Download Gene and Genome Sequencing

1 6/1/15 Workshop Schedule   h�p://oomycete-‐training.org/2015-‐2/   Schedule has links to introductory presenta�ons and the FungiDB workshops Tuesday 3rd Wednesday 4th AM Session 1 Introductory Presenta�ons FungiDB Exercises 2 AM Session 2 Introductory Presenta�ons Thursday 5th FungiDB Exercises 5, 10 FungiDB Exercises 6, 7 FungiDB Exercises 11 Group project Lunch/Discussion PM Session 3 FungiDB Exercise 1, 3 FungiDB Exercises 8, 9 FungiDB Exercises 13, 14 EOD/Discussion FungiDB Exercises   h�p://fungidb.org   FungiDB is a genome database with integrated bioinforma�cs tools; similar to FlyBase, TAIR, PlantGDB   FungiDB is part of EuPathDB and uses same so�ware but is less mature. –  Not as much data has been loaded –  Not all applica�on are available   When possible we will do oomycete genomes for the FungiDB exercises –  In one exercise we will use Fungi genomes because not enough oomycete data was available –  In one exercise we will switch between FungiDB and EuPathDB to show extra func�ons not yet available in FungiDB   If �me permits we may add in extra EuPathDB exercises Gene and Genome Sequencing Brent Kronmiller Center for Genome Research and Biocompu�ng Oregon State University 1 2 6/1/15 From Sequence to Genome   We’ve got our sequence off the machines, what is it and what do we do with it?   Three versions of sequencers –  Gen1: Sanger ~800bp sequence –  Gen2: Illumina 50-‐150bp sequence –  Gen3: PacBio +5000bp   We s�ll can’t sequence a genome from end to end   Using these small sequences we need to assemble a 100Mb genome FASTQ Files   FASTQ is the standard flat file sequence read format with integrated quality scores   Extension of FASTA, but with two extra lines 1) Header, starts with a ‘@’ instead of ‘>’ 2) Sequence 3) Quality header – usually not used ‘+’ 4) Quality scores, one per base to line up with sequence FASTA >DB775P1:230:D1U9KACXX:5:1101:1474:1950 1:N:0:CGATGT CAGGNTGGTAGTCAAAGGATTGTTTTTTCCTGTAAGCATCTCATCAGGTGAATAAATGACTTCTCCAGTATCTGG FASTQ @DB775P1:230:D1U9KACXX:5:1101:1474:1950 1:N:0:CGATGT CAGGNTGGTAGTCAAAGGATTGTTTTTTCCTGTAAGCATCTCATCAGGTGAATAAATGACTTCTCCAGTATCTGA + @@@F#2ADADFFHIJIIIGIGIGIJIJJJJJJCEGIJIJGHGIJIIIJ?FGIGGIJGCHGCHGIJJFJIJIGEHH Sequence Quality Scores   Each base of a sequence is assigned a quality score.   Quality is (usually) used by the assembler or aligner to determine the validity of the overlap and the consensus quality.   Phred log scale: Phred Score Chance the base was called incorrectly Q10 1 in 10 Q20 1 in 100 Q30 1 in 1,000 Q40 1 in 10,000   Q40 is the max score for a sequence base. Depending on the calling so�ware bases in a sequence can be influenced by surrounding bases and show a score higher than Q40. 2 3 6/1/15 FASTQ Scoring   Quality scores characters can be translated to Phred scores by looking up the Ascii value (Decimal value – 33) –  Illumina has used 4 version of FASTQ quality scores – be careful that you have told your so�ware which version to use @DB775P1:230:D1U9KACXX:5:1101:1474:1950 1:N:0:CGATGT CAGGNTGGTAGTCAAAGGATTGTTTTTTCCTGTAAGCATCTCATCAGGTGAATAAATGACTTCTCCAGTATCTGA + @@@F#2ADADFFHIJIIIGIGIGIJIJJJJJJCEGIJIJGHGIJIIIJ?FGIGGIJGCHGCHGIJJFJIJIGEHH Alignments vs Assemblies   Alignment: align your sequences to a reference genome –  Quicker, each sequence is compared to the reference sequence   Assembly: de novo reconstruc�on of genome from sequences –  Slower, each sequence is compared to each other   Applica�ons: Alignment SNP Iden�fica�on RNAseq ChIPseq Re-‐Sequence Assembly Genome Sequencing Transcriptome Seq Alignment Programs   Alignment programs use one of two algorithms; –  Hash table –  Burrows Wheeler Transforma�on (BWT)   Hash table is a data structure in programming –  Quick lookup of exact matching sequence –  Sor�ng is not necessary –  Programs that use Hash   MAQ, ELAND, SHRiMP, SOAP   BWT –  Faster than Hash aligners –  Reference genome is pre-‐processed into a quickly searchable sorted index, subsequent assemblies will not need to reindex –  Programs that use BWT   BWA, Bow�e, SOAP 3 4 6/1/15 Genome Assembly   The genome is broken into fragments, these are sequenced and assembled to reconstruct the genome. Hierarchical Shotgun Sequencing Whole Genome Shotgun Sequencing BACs Minimal BAC Path -‐ Sequenced at much higher depth -‐ But loca�ons of sequences are random across whole genome, some areas will be sparse Plasmids End Sequenced Typically sequenced at 6-‐8x coverage before finishing Hierarchical Genome Assembly BAC GCGCGC Possible Assembly Issues: Low Coverage Simple Repeat Transposable Element Repeat TE Terminal Repeat Gene Family Tandem Duplicated Gene Constrained mate paired ends can alleviate some of these assembly issues.   The few gaps can be closed with finishing techniques   Rela�ve loca�on on genome is known Whole Genome Assembly Chromosome GCGCGC Possible Assembly Issues: Low Coverage Simple Repeat Transposable Element Repeat TE Terminal Repeat Gene Family Tandem Duplicated Gene 4 5 6/1/15 Solu�ons for WGS   What can we do about these issues? –  –  –  –  –  Paired end sequences Mate pair sequencing (long range) Longer sequences Hybrid sequence types (2nd Gen and 3rd Gen) Long range libraries to span issues Or, only target the genes   Gene region enhancement for WGS –  Methyl filtra�on, HiCot   Transcriptome sequencing –  As a hybrid assembly   RNAseq –  With reference –  de novo Assembly Programs   Hierarchical assemblers can use an Overlap-‐layout-‐consensus –  Graph is constructed from overlapping reads –  Phrap, arachne, etc   A WGS assembly, expecially with 2nd-‐Gen will have too many sequences   Many short read assemblers use de Bruijn graph algorithm –  ABySS, Velvet, ALLPATHS, SOAPdenovo –  Uses fixed-‐length K-‐mer substrings –  Assembler doesn’t store sequences, just counts of K-‐mers   Interes�ngly, with long sequences from Gen3 sequencers, overlap-‐layout-‐consensus is making a comeback –  We don’t want to chop up the long sequences into k-‐mers WGS -‐ de Bruijn Graph   Sequences are chopped up into overlapping substrings (k-‐ mers) –  K-‐mer length decided by user, generally determined based on read length along with other factors, like expected depth and genome size   Path is created across all k-‐mers   Repeated regions will determine the complexity of the graph   Errors or missing sequence will directly affect the ability to find the correct path Genome AGTGTAGATCTGATCCATTT Sequences AGTGTAGATC GTAGATCTGA TGATCCATTT de Bruijn Graph 4-mers AGTG-GTGT-TGTA-GTAG-TAGA-AGAT-GATC GTAG-TAGA-AGAT-GATC-ATCT-TCTG-CTGA TGAT-GATC-ATCC-TCCA-CCAT-CATT-ATTT AGTG-GTGT-TGTA-GTAG-TAGA-AGAT GATC ATCT-TCTG-CTGA-TGAT ATCC-TCCA-CCAT-CAAT-ATTT 5 6 6/1/15 N50 for Assembly Assessment   To calculate N50 for an assembly: –  Order all con�gs produced in the assembly by size –  Calculate total length of all con�gs –  Find the con�g where 50% of the total length of all con�gs are found in that con�gs of that size or greater, e.g. 100kb 95kb 80kb 70kb 60kb 50kb 40kb 30kb 25kb 20kb 20kb 10kb 5kb 605Kb / 2 = 302.5Kb 100Kb+95Kb+80Kb+70Kb = 345Kb N50 = 70Kb –  N50 does not assess quality, either from sequence quality or correctness of sequence overlaps Assembly/Alignment File Formats   Assembly: output is a FASTA file –  Mul�ple FASTA of the con�gs – con�guous sequence assemblies –  Mul�ple FASTA of the scaffolds – con�gs joined when their order and orienta�on is known by spanning sequences from paired-‐end or mate pair sequences   Alignment: SAM/BAM has become the standardized output format   SAM: Sequence Alignment/Map format   BAM: Binary SAM –  Binary format allows for quicker retrieval and indexing of informa�on –  For each sequence read give informa�on on where it aligns and how it matches SAM/BAM Format r ade he r the n sio rde r e g o v r�n AM he so – S – D – N H V SO @   Header sec�on (op�onal) –  Form of @RECORD TAG:VALUE TAG:VALUE … –  RECORD and TAG are 2-‐le�er codes –  5 RECORD and 25 TAG categories some are required, some op�onal @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 @ SQ   Alignment sec�on –  One line per sequence in assembly –  11 required fields r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA –  37 possible op�onal tags: TAG:TYPE:VALUE format * found a�er field 11 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *   of r ade – Re SN L – N – Re R f n ef le am n en e gth ce fer 1 Seq name 7 Ref Mate 2 Bit Flag 8 Posi�on Mate 3 Ref name 9 Insert Length 4 Posi�on 10 Sequence (RC) 5 Map quality 11 Phred Quality 6 CIGAR string No�ce header line begins with ‘@’, same as FASTQ header. If your assembler doesn’t remove the ‘@’ from the sequence name the alignment sec�on will get confused for the header sec�on 6 7 6/1/15 Bitwise Flag   A numerical code to give you the status of the read in the Bit Descrip�on assembly –  163 = 1+2+20+40+100           Has pair Has alignment Pair is reversed First of the pair to align 2nd best alignment –  12 = 2+10   Has alignment   Reverse-‐complemented –  83 = 1+2+80   Has pair   Has alignment   Second of the pair to align 0x1 template having mul�ple segments in sequencing 0x2 each segment properly aligned according to the aligner 0x4 segment unmapped 0x8 next segment in the template unmapped 0x10 SEQ being reverse complemented 0x20 SEQ of the next segment in the template being reversed 0x40 the first segment in the template 0x80 the last segment in the template 0x100 secondary alignment 0x200 not passing quality controls 0x400 PCR or op�cal duplicate CIGAR String   A compact view of the sequence in the assembly –  8M2I4M1D3M           8 bases match 2 bases inserted in sequence rela�ve to ref 4 bases match 1 base deleted in sequence rela�ve to ref 3 bases match ref …ATGTTAGATAA**GATAGCTGTGC… seq TTAGATAAAGGATA*CTG –  6M14N5M   6 bases match   14 bases skipped   5 bases match M Alignment Match I Inser�on to the reference D Dele�on from the reference N Skipped sequence (intron in RNAseq) S So� clipped H Hard clipped P Padded = Sequence Match X Sequence Mismatch ref …TGATAGCTGTGCTAGTAGGCAGTCAGCGCC… seq ATAGCT..............TCAGC 7 8 6/1/15 How-to: Assembly A simple example using Velvet Velvet “Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions.”(Zerbino and Birney, 2008) How does Velvet work? 1 9 6/1/15 K-mers: an example with 4mers K-mers: an example with 4mers Error handling by velvet 2 10 6/1/15 How to run velvet: 1 Two modules: 1 velveth - hash/k-mer construction 1 velvetg - genome assembly velveth 1 Velveth helps you construct the dataset for the following program, velvetg, and indicate to the system what each sequence file represents. 1 Velveth takes in a number of sequence files, produces a hashtable (k-mer table), then outputs two files in an output directory (creating it if necessary), Sequences and Roadmaps, which are necessary to velvetg. velveth 3 11 6/1/15 velvetg 1 Velvetg is the core of Velvet where the de Bruijn graph is built then manipulated. velvetg Summary velveth velvetg 4 6/1/15 12 Genome annotation: Going from raw sequence to functional prediction for downstream applications (Part I) Marcus Chibucos, Ph.D. University of Maryland Big picture… Adenine Thymine CGC ATA AAA Triplets Guanine Cytosine 1 6/1/15 13 Redundant 6 reading frames Start Stop +3 +2 +1 -1 -2 -3 DNA transcription to mRNA http://commons.wikimedia.org/wiki/File:Simple_transcription_elongation1.svg mRNA codon translation table 2 6/1/15 14 mRNA translation to peptide http://upload.wikimedia.org/wikipedia/commons/thumb/b/b1/Ribosome_mRNA_translation_en.svg/1280px-Ribosome_mRNA_translation_en.svg.png Protein structure http://commons.wikimedia.org/wiki/File:Protein_structure.png (CC-BY-SA-3.0 Holger87 2012) Identify every protein coding gene in a cell 3 6/1/15 15 Noncoding RNAs What is the function of each protein? What are the cell’s metabolic capabilities? Does a protein have a role in pathogenesis? Trends in Microbiology 2009 17, 312-319DOI: (10.1016/j.tim.2009.05.001) 4 6/1/15 16 Under what conditions are proteins expressed? Structural annotation Functional annotation “ Annotate” - to make or furnish critical or explanatory notes or comment. —Merriam-Webster dictionary “ Genome annotation” – the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes. — Lincoln Stein, PMID 11433356 Canonical gene structures Promoter Start (ATG) Prok DNA -> Stop (TAG) 5’ end 3’ end Start (AUG) Stop (UAG) mRNA -> ORF RBS Euk 5’ UTR 5’ upstream flanking region (promoter) Start codon Exon Donor site (GU) Intron Acceptor site (AG) Stop 3’ flanking codon region 3’ UTR 5 6/1/15 17 Gene to protein in a eukaryote Nucleus Cytosol Information for gene prediction Extrinsic information • • Information coming from a source other than the DNA itself Comparative information • Aligned proteins/expression data from other taxa • Gene expression data like mRNA sequencing reads • DNA conservation between taxa (tBLASTX or other) Intrinsic information • • Ab initio or de novo meaning “from the beginning” Using patterns contained within the DNA itself Composition (content) Inherent statistical properties of protein coding DNA itself Signal Specific sequences that indicate the presence of a gene nearby Prokaryotic gene finding examples Start: ATG, GTG, TTG Stop: TAG, TAA, TGA +3 +2 +1 -1 -2 -3 Which ORFs are genes? 6 6/1/15 18 Some prokaryotic gene finders Prodigal GeneMark Easygene • • • Glimmer • • Interpolated Markov models (IMMs) Compare nucleotide patterns from training set to patterns of all ORFs 250 Kb validated/published sequences BLAST query of translated ORFs against protein database Use all non-overlapping ORFs with good hits Train & run Glimmer • • • • Record all k-mers 5-8 nt Record frequency of nt following each k-mer Build statistical model Score all ORFs in genome over a predetermined minimum size Scores well against model Scores poorly against model +3 +2 +1 -1 -2 -3 Scores well against model False negative? False positive? Horizontal gene transfer, e.g. phage integration & transposition How much do genes overlap in prokaryotes or eukaryotes? 7 6/1/15 19 Translation start site prediction • • • ATG >> GTG >> TTG Ribosomal binding site: AG rich & 5-11 bases upstream of start Similarity to other proteins 3 possible start sites RBS upstream of chosen start ORF upstream boundary BER match protein Overlaps Is one similar to known proteins? Is one in an operon? Where are the start codons? Inter-evidence regions Translate 6 frames, search non-redundant database Assessing gene prediction quality Sensitivity (Sn) Fraction of known reference features actually predicted Measure of false negatives: TP / (TP + FN) Specificity (Sp) Fraction of predictions that overlap known reference features Measure of false positives: TP / (TP + FP) Real gene model True positives True positives True negatives Sn = 3/(3+0) = 1.0 Sp = 3/(3+0) = 1.0 Sn = 1.0 Sp = 0.75 Sn = 0.67 Sp = 1.0 False positive False negative 8 6/1/15 20 Prokaryotic success is >95% Ab initio success rate in a eukaryote: http://bioinf.uni-greifswald.de/augustus/accuracy (accessed May 2013) Biologically speaking, why might sensitivity & specificity be so low in eukaryotes? • • • • • • • • • • • • • Large genomes & low coding density Genomic repeats - masking is very important Non-canonical (ATG) start codons Alternative splicing (40-50% genes) Pseudogenes Long or short genes Long introns Non-canonical introns UTR introns Overlapping genes on opposite strands Nested genes overlapping on strand or in intron Multiple isoforms Very short peptides (~11 amino acid residues) mRNA requires multiple biological conditions • Some non-biological considerations Underlying algorithm Program parameters Available extrinsic evidence Training set quality, numbers Program & parameters Training set 1 Training set 2 GeneMark-ES (self training) 9,024 Augustus trained on species 8,694 9,011 Augustus with “optimize” step 8,503 8,920 SNAP trained on species 9,024 7,335 7,955 GlimmerHMM trained on species 10,313 11,894 Scipio alignments with other species 10,691 10,691 Trinity assemblies GMAP aligned 9,527 9,527 Trinity (Jaccard clip option on) 10,023 10,023 Combined evidence with Glean 8,705 9,123 9 6/1/15 21 Consensus gene model Finder 1 Finder 2 Finder 3 Protein alignments Consensus with isoforms Finder 1 Finder 2 Finder 3 Protein alignments Eukaryotic prediction Basic rules of gene •structure All coding regions (exons) are on same strand • • An individual exon resides in an ORF in one reading frame Multiple exons within a gene can have different reading frames Training set • • Want verified gene models: expression, homology, manual curation Many predictors offer parameter files for common organisms 10 6/1/15 22 Pattern-based exon & gene prediction Coding region inside ORF (start & stop, no interrupting stops) Dimer frequency Coding score Donor & acceptor site scores Codon preference by species for given amino acid(s) GC content • • • • • • Exon length distribution Polymerase II promoter elements (GC box, CCAT box, TATA region) Ribosome binding site Polyadenylation signal upstream poly-A cleavage site Termination signal downstream poly-A cleavage site • • • • • Dimer frequency in protein sequence • • • • Expected dimer frequency if random = 0.25% (1/20 * 1/20) Not evenly distributed Most dicodons biased toward non-coding or coding Organism specific AAA AAA appears 1% of time in coding regions and 5% of time in non-coding regions in human genome Splicing http://en.wikipedia.org/wiki/File:Pre-mRNA_to_mRNA.svg Find all GT/AG donor/acceptor sites & score with PSSM splice donor polybranch pyrimidine splice acceptor point tract Modified from: http://en.wikipedia.org/wiki/File:Intron_miguelferig.jpg 11 6/1/15 23 Position specific scoring matrix 2 3 4 5 6 7 8 A 1 1 1 1 0 0 0 1 1 G 1 0 0 5 0 1 2 0 C 2 1 4 0 0 2 1 4 U 1 2 0 0 5 2 1 0 5 splice donor (GU) sites: ATCGUCGC UCAGUGGC CUCGUCCC GUCGUUAC CACGUCUA Must use confirmed splice sites for training. Not always available for new genomes… some splice sites are noncanonical… some genes alternatively spliced… Translation start prediction Position-specific scoring matrix (PSSM) • • Certain nucleotides tend to be in position around start site (ATG) Such biased nucleotide distribution is basis for prediction of start Fi(X): frequency of X (A,G,C,T) in position I Score string by Σ log (Fi (X)/0.25) Two potential start sites in a DNA sequence containing a gene: sequence 1: CACC ATG GC sequence 2: TCGA ATG TT What are the odds that ATG is a real start site for each one? Build frequency matrix using training sequences with known starts Training sequences training training training training training etc. 1: 2: 3: 4: 5: CACC GGCC ACGG CACA CGAG ATG ATG ATG ATG ATG -4 -3 -2 -1 Site 1 scores better GC GG GA CT TT Matrix* -4 -3 -2 -1 A:17.4,48.8,28.9,15.7 C:57.8,05.7,39.7,50.4 G:19.0,43.0,14.9,26.4 T:05.8,02.5,16.5,07.5 *Shown as percentages for ease CACC ATG GC = log(58/25)+log(49/25)+… = 1.16 TCGA ATG TT = log(06/25)+log(06/25)+… = -1.68 12 6/1/15 24 An ab initio “ workflow” http://genome.crg.es/software/geneid/ Extrinsic methods http://pasa.sourceforge.net/ EST alignment RNA-seq alignment Protein alignment RNA-sequencing evidence mRNA cDNA GCTAATGCGAAGTCCTAGACCAGATTGAC ATGCGATGCAGCTGACGCTGGCTAATGCG CGCATAGCCAGATGACCATGATGCGATGC TGACAGATTAGACAGTAGGACAGATAGAC ……..many millions of reads ? 1 3 2 Reads mapped to genome with gene models 13 6/1/15 25 Splice boundaries with RNA-seq n Intro http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png Conservation among taxa Arnaud, et al. (2010) Nucleic Acids Res.38(Database issue): D420-7. Manual curation (Jamborees) 14 6/1/15 26 Combiners Incorporate multiple evidence types including ab initio predictions, expression data, and homology and generate a statistically derived combination • • • • • Evidence Modeler (EVM) Glean Jigsaw Maker (a pipeline) PASA (merges expression data with predictions) Many ab inito predictors, for example Augustus, incorporate data types such as protein alignments or expression data Glean combiner Glean paper at http://genomebiology.com/2007/8/1/R13 nGASP: the nematode genome annotation assessment project http://www.biomedcentral.com/1471-2105/9/549 15 6/1/15 27 Structural annotation pipeline Repeat masking RNA-seq assemblies & alignments • • EST alignments Splice-aware protein alignments Develop training set • • • Train many ab initio predictors (with expression data) Run ab initio predictors Combine all evidence types Predict non-coding RNAs • • • • Ready for functional annotation next… Structural annotation closing thoughts • • • Intrinsic & extrinsic prediction methods exist High-quality training dataset is required for ab initio “Correct” gene predictions are moving target Note the steady decrease in the number of predicted genes as the human genome has been further curated • • • Gene finders & gene finding pipelines produce predictions that must be verified & refined More pieces of high-quality evidence are better There is not necessarily only one correct model 16 28 6/1/15 How-to: Gene Calling A quick guide on MAKER Lets assume you have a contig (just one) P. unknownensis 43,000 bp How do you know how many genes it has? Two strategies of gene annotation Ab initio Homology based Recognition by patterns Set of genes of closely related species Start codon (ATG) Stop codon (TAG, TAA, TGA) Species 1 Species 2 Species 3 Species 4 Species 5 Ab-initio Gene candidate A Gene candidate A Gene candidate B Gene candidate B Gene candidate C Gene candidate C 1 29 6/1/15 Two strategies of gene annotation Ab initio Homology based MAKER MAKER • MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab-initio gene predictions and automatically synthesizes these data into gene annotations having evidencebased quality values. MAKER Start codon (ATG) Stop codon (TAG, TAA, TGA) Consensus Ab-initio Species 1 Species 2 Species 3 Species 4 Homology based Species 5 2 30 6/1/15 Two strategies of gene annotation • Gene models of closely related species: • • For our example: P. infestans, P. sojae and P. ramorum mtDNA genes Ab-initio training • AUGUSTUS training (http://bioinf.unigreifswald.de/webaugustus/training/create) AUGUSTUS training P. unknownensis genome 43,000 bp P. unknownensis transcriptome http://bioinf.uni-greifswald.de/webaugustus/training/create Running MAKER (online) http://weatherby.genetics.utah.edu/cgi-bin/mwas/maker.cgi 1). Existing evidence (cDNA, EST, Genes from closely related organisms) multi-FASTA of genes from other Phytophthora species 3 31 6/1/15 Running MAKER (online) http://weatherby.genetics.utah.edu/cgi-bin/mwas/maker.cgi 2). Ab initio training: AUGUSTUS model output MAKER Outputs http://weatherby.genetics.utah.edu/cgi-bin/mwas/maker.cgi • Annotated results • Proteins • Gene models • All info in a gff file (Genome format file) MAKER Outputs http://weatherby.genetics.utah.edu/cgi-bin/mwas/maker.cgi Gene Model EST/cDNA matches Protein Matches Ab initio Matches Homology Matches 4 32 6/1/15 Genome annota) o n: Going from raw sequence to func)onal predic)on for downstream applica ) ons (Part II) Marcus Chibucos, Ph.D. University of Maryland Before we start... some context Database (Oxford). 2014; 2014: bau075. What do our predicted genes do? •  What we would like: –  Experimental knowledge of funcOon •  Literature curaOon •  Perform experiment •  Not possible for all proteins in most organisms (not even close in most) •  What we actually have: –  Sequence similarity •  •  •  •  Similarity to moOfs, domains, or whole sequences Protein not DNA for finding funcOon Shared sequence can imply shared funcOon All sequence-‐-‐-‐based annotaOons are putaOve unOl proven experimentally 3 1 33 6/1/15 Basic set of protein annotaOons •  protein name -‐-‐-‐ descripOve common name for the protein •  e.g. “ribokinase” •  gene symbol -‐-‐-‐ mnemonic abbreviaOon for the gene – e.g. “recA” •  EC number -‐-‐-‐ only applicable to enzymes • e.g. 1.4.3.2 •  role -‐-‐-‐ what the protein is doing in the cell and why –  e.g. “amino acid biosynthesis” •  suppor) ng evidence –  accession numbers of BER and HMM matches –  TmHMM, SignalP, LipoP –  whatever informaOon you used to make the annotaOon •  unique iden ) fier –  e.g. locus ids 4 Alignments/Families/MoOfs •  pairwise alignments •  mulOple alignments –  two protein’s amino acid sequences aligned next to each other so that the maximum number of amino acids match –  3 or more amino acid sequences aligned to each other so that the maximum number of amino acids match in each column –  more meaningful than pairwise alignments since it is much less likely that several proteins will share sequence similarity due to chance alone, than that 2 will share sequence similarity due to chance alone. Therefore, such shared similarity is more likely to be indicaOve of shared funcOon. •  protein families –  clusters of proteins that all share sequence similarity and presumably similar funcOon –  may be modeled by various staOsOcal techniques •  moOfs –  short regions of amino acid sequence shared by many proteins •  transmembrane regions •  acOve sites •  signal pepOdes 5 Important terms to understand •  homologs •  orthologs –  two sequences have evolved from the same common ancestor –  they may or may not share the same funcOon –  two proteins are either homologs of each other or they are not. A protein can not be more, or less, homologous to one protein than to another. –  a type of homolog where the two sequences are in different species that arose from a common ancestor. The fact of the speciaOon event has created the two copies of the sequence. –  orthologs oc en, but not always, share the same funcOon •  •  paralogs –  a type of homolog where the two sequences have arisen due to a gene duplicaOon within one species –  paralogs will iniOally have the same funcOon (just a cer the duplicaOon) but as Ome goes by, one copy will be free to evolve new funcOons, as the other copy will maintain the original funcOon. This process is called “neofuncOonalizaOon”. xenologs –  a type of ortholog where the two sequences have arisen due to lateral (or horizontal) transfer 6 2 34 6/1/15 ancestor speciation to orthologs lateral transfer to a different species makes xenologs duplication to paralogs one paralog evolves a new function “neofunc) onaliza ) on” – the duplicated gene/protein develops a new funcOon Pairwise alignments •  There are numerous tools available for pairwise alignments –  NCBI BLAST resources –  FASTA searches –  Many more •  At IGS we use a tool called BER (BLAST-‐-‐-‐ extend-‐-‐-‐ repraze) that combines BLAST and Smith-‐-‐-‐ Waterman approaches –  Actually much of bioinformaOcs is based on reusing tools in new and creaOve ways… 8 genome’s protein set vs. non-redundant protein database BER BLAST mini-db for protein #1 mini-db for protein #2 , mini-db for protein #3000 mini-db for protein #3 ... , Query protein is extended Significant hits (using a liberal cutoff) put into mini-dbs for each protein modified SmithWaterman Alignment BER alignment vs. Extended Query protein by 300 nt Mini database 9 3 35 6/1/15 …to look through in-‐-‐frame stop codons & across frameshic s to see if similarity conOnues 10 end5 end3 ORFxxxxx 300 bp 300 bp Extensions in BER search protein match protein normal full length match ! FS ! similarity extending through a frameshift upstream or downstream into extensions * PM similarity extending in the same frame through a stop codon ? The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-frame stop codons (PM). This is indicated when similarity extends outside the coordinates of the protein coding sequence. Blue line indicates predicted protein coding sequence, green line indicates up- and downstream extensions. Red line is the match protein. FS or PM ? two functionally unrelated genes from other species matching one query protein could indicate incorrectly fused ORFs 11 How do you know when an alignment is good enough to determine funcOon? •  Good quesOon! No easy answer… •  Generally, you want a minimum of 40%-‐-‐-‐50% idenOty over the full lengths of both query and match with conservaOon of all important structural and catalyOc sites •  However, some informaOon can be gained from parOal alignments –  Domains –  MoOfs •  BEWARE OF TRANSITIVE ANNOTATION ERRORS 12 4 36 6/1/15 Pioalls of transiOve annotaOon TransiOve AnnotaOon is the process of passing annotaOon from one protein (or gene) to another based on sequence similarity: A B B C C D A’s name has passed to D from A through several intermediates. -‐-‐-‐This is fine if A is similar to D. -‐-‐-‐This is NOT fine if A is NOT similar to D TransiOve annotaOon errors are easy to make and happen oc en. •  Current public datasets full of such errors •  A good way to avoid transiOve annotaOon errors is to require that in a pairwise match, the match annotaOon must be trusted •  Be conservaOve – Err on the side of not making an annotaOon, when possibly you should, rather than making an annotaOon when probably you shouldn’t. 13 Trusted annotaOons •  It is important to know what proteins in our search database are characterized. – proteins marked as characterized from public databases •  Gene Ontology repository (more on this later) •  GenBank (only recently began) –  UniProt •  proteins at “protein existence level 1” •  Proteins with literature reference tags indicaOng characterizaOon 14 UniProt UniProt hpp://www.uniprot.org •  Swiss-‐-‐-‐Prot –  European BioinformaOcs InsOtute (EBI) and Swiss InsOtute of BioinformaOcs (SIB) –  all entries manually curated –  hpp://www.expasy.ch/sprot –  annotaOon includes •  links to references •  coordinates of protein features •  links to cross-‐-‐-‐referenced databases •  TrEMBL –  –  –  –  EBI and SIB entries have not been manually curated once they are accessions remain the same but move into Swiss-‐-‐-‐Prot hpp://www.expasy.ch/sprot •  Protein Informa) on Resource (PIR) –  hpp://pir.georgetown.edu 15 5 37 6/1/15 UniProt 16 17 18 6 38 6/1/15 19 20 Enzyme Commission Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes by the Reactions they Catalyse •  not sequence based •  categorized collecOon of enzymaOc reacOons •  reacOons have accession numbers indicaOng the type of reacOon, for example EC 1.2.1.5 •  hp p://www.chem.qmul.ac.uk/iubmb/enzyme/ •  hp p://www.expasy.ch/enzyme/ 21 7 39 6/1/15 EC number Hierarchy All ECs starOng with #1 are some kind of oxidoreductase Further numbers narrow specificity of the type of enzyme A four-‐-‐-‐posiOon EC number describes one parOcular reacOon 22 Example entry for one specific enzyme 23 Metabolic pathway databases •  KEGG –  hpp://www.genome.jp/kegg/ •  MetaCyc/BioCyc –  hp p://metacyc.org/ –  hpp://www.biocyc.org/ •  BRENDA –  hp p://www.brenda-‐-‐-‐enzymes.info/ 24 8 40 6/1/15 25 26 Hidden Markov models (HMMs) •  Statistical model of the patterns of amino acids in a multiple alignment of proteins (called the “seed) which share sequence and, presumably, functional similarity •  Several sets routinely used for protein functional annotation •  Each TIGRFAM model is assigned to a category which describes the type of functional relationship the proteins in the model have to each other •  TIGRFAMs (www.tigr.org/TIGRFAMs/) •  Pfam (pfam.sanger.ac.uk) •  Custom collections –  Equivalog - one specific function, e.g. “ribokinase” –  Subfamily - group of related functions generally with different substrate specificities, e.g. “carbohydrate kinase” –  Superfamily - different specific functions that are related in a very general way, e.g. “kinase” –  Domain - not necessarily full-length of the protein, contains one functional part or structural feature of a protein, may be fairly specific or may be very general, e.g. “ATP-binding domain” 27 9 41 6/1/15 AnnotaOon a pached to HMMs •  FuncOonally specific HMMs have specific annotaOons –  TIGR00433 (accession number for the model) •  •  • •  •  name: biotin synthase category: equivalog EC: 2.8.1.6 gene symbol: bioB Roles: –  biotin biosynthesis (TIGR 77/GO:0009102) –  biotin synthase activity (GO:0004076) •  FuncOonally general HMMs have general annotaOons – PF04055 •  name: radical SAM domain protein •  •  •  •  category: domain EC: not applicable gene symbol: not applicable Roles: –  enzymes of unknown specificity (TIGR role 703) –  catalytic activity (GO:0003824) –  metabolism (GO:0008152) 28 HMM building Proteins from many species Alignments of funcOonally related proteins act as training sets for HMM building StaOsOcal Model Model specific to a family of proteins, generally found across many species Figure: Michelle Giglio, Ph.D., InsOtute for Genome Sciences, University of Maryland School of Medicine, 2013 HMM scores •  When a protein is searched against an HMM it receives a BITS score and an e-‐-‐-‐value indicaOng the significance of the match StaOsOcal Model The person building the HMM will search the new HMM against a protein database and decide on the trusted and noise cutoff scores T N StaOsOcal Model •  The search protein’s score is compared with the trusted and noise cutoff scores a pached to the HMM –  proteins scoring above the trusted cutoff can be assumed to be members of the family –  proteins scoring below the noise cutoff can be assumed NOT to be members of the family –  when proteins score in-‐-‐-‐between the trusted and noise cutoffs, the protein may be a member of the family and may not. 30 10 42 6/1/15 HMM databases Proteins from many species T N Alignments of funcOonally related proteins act as training sets for HMM building StaOsOcal Model Model specific to a family of proteins, generally found across many species Add this model to the database Database of HMM models, each specific to one protein family and/or funcOonal 31 level Examples: Pfam and TIGRFAM Figure: Michelle Giglio, Ph.D., InsOtute for Genome Sciences, University of Maryland School of Medicine, 2013 The cutoff scores a pached to HMMs, are someOmes high and someOmes low and someOmes even negaOve. There is no inherent meaning in how high or low a cutoff score is, the important thing is the query protein’s score relaOve to the trusted and noise scores. -50 0 …above trusted: the protein is a member of family the HMM models N -50 0 -50 0 -50 0 P 100 T …below noise: the protein is not a member of family the HMM models 100 …in-between noise and trusted: the protein MAY be a member of the family the HMM models 100 ...above trusted and some or all scores are negative: the protein is a member of the family the HMM models 100 32 Orthologous groups •  COGs – have not been updated in a long Ome •  eggNOG – newer, more complete 2 Bi-‐-‐-‐direc) onal best BLAST B 1 A 3 C 33 11 43 6/1/15 MoOf searches •  PROSITE -‐-‐-‐ hMp://www.expasy.ch/prosite/ –  “consists of documentaOon entries describing protein domains, families and funcOonal sites as well as associated paperns to idenOfy them.” •  Center for Biological Sequence Analysis -‐-‐-‐ hMp://www.cbs.dtu.dk/ –  Protein SorOng (7 tools) •  Signal P finds potenOal secreted proteins •  LipoP finds potenOal lipoproteins •  TargetP predicts subcellular locaOon of proteins –  Protein funcOon and structure (9 tools) •  TmHMM finds potenOal membrane spans –  –  –  –  –  Post-‐-‐-‐translaOonal modificaOons (14 tools) Immunological features (9 tools) Gene finding and splice sites (9 tools) DNA microarray analysis (2 tools) Small molecules (2 tools) 34 One-‐-‐-‐stop shopping -‐-‐-‐ InterPro •  InterPro – Brings together mulOple databases of HMM, moOf, and domain informaOon. –  Excellent annotaOon and documentaOon –  hMp://www.ebi.ac.uk/interpro/ 35 Making annotaOons •  Use the informaOon from the evidence sources to decide what the gene/protein is doing •  Assign annotaOons that are appropriate to your knowledge –  Name –  EC number –  Role –  Etc. 36 12 44 6/1/15 Main Categories: Amino acid biosynthesis Purines, pyrimidines, nucleosides, and nucleotides Fatty acid and phospholipidmetabolism Biosynthesis of cofactors, prosthetic groups, and carriers Central intermediary metabolism Energy metabolism Transport and binding proteins DNA metabolism Transcription Protein synthesis Protein Fate Regulatory Functions Signal Transduction Cell envelope Cellular processes Other categories Unknown Hypothetical Disrupted Reading Frame Unclassified (not a real role) TIGR roles Each main category has several subcategories. Names (and other annotaOons) should reflect knowledge •  specific function –  Example: “adenylosuccinate lyase”, purB, 4.3.2.2 •  varying knowledge about substrate specificity –  A good example: ABC transporters •  ribose ABC transporter •  sugar ABC transporter •  ABC transporter – choosing the name at the appropriate level of specificity requires careful evaluation of the evidence looking for specific characterized matches and HMMs. •  family designation - no gene symbol, partial EC –  “Cbby family protein” –  “carbohydrate kinase, FGGY family” •  hypotheticals –  “hypothetical protein” –  “conserved hypothetical protein” 38 Ontologies 39 13 45 6/1/15 Names can be problemaOc…. •  ….because humans do not always use precise and consistent terminology •  Our language is riddled with –  Synonyms – different names for the same thing –  Homonyms – different things with the same name •  This makes data mining/query difficult –  What name should you assign? –  What name should you use when you search UniProt or NCBI or any other database? 40 Synonyms •  Within any domain do people use precise & consistent language? •  Take biologists, for example… –  Mutually understood concepts – DNA, RNA, protein –  TranslaOon & protein synthesis •  Synonym: one thing, more than one name •  Enzyme Commission reacOons –  Standardized id, official name & alternaOve names hMp://www.expasy.ch/enzyme/2.7.1.40 41 Homonyms •  Different things known by same name •  Common in biology –  SporulaOon –  Vascular (plant vasculature, i.e. xylem & phloem, or vascular smooth muscle, i.e. blood vessels?) mation Endospore for is Bacillus anthrac Reproduc) ve spor Asci & ascospores, u l a )on Morchella elata (morel) hMp://en.wikipedia.org/w iki/File:Morelasci.jpg ©PG Warner 2008 (accessed 17-‐-‐-‐Sep-‐-‐-‐09) ASMOnly/ obelibrary.org/ hMp://www.micr 426&Lang= details.asp?id=1 (accessed 17-‐-‐Sep-‐-‐-‐09) 003 ©L Stauffer 2 42 14 46 6/1/15 StandardizaOon with controlled vocabularies (CVs) •  An official list of precisely defined terms used to classify informaOon & facilitate its retrieval –  Flat list –  Thesaurus – Catalog •  Benefits of CVs –  Allow standardized descripOons –  Synonyms & homonyms addressed –  Can be cross-‐-‐-‐referenced externally –  Facilitate electronic searching A CV can be “…used to index and retrieve a body of literature in a bibliographic, factual, or other database. An example is the MeSH controlled vocabulary used in MEDLINE and other MEDLARS databases of the NLM.” hMp://www.nlm.nih.gov/nichsr/hta101/ta101014.html 43 Ontology: CV with defined relaOonships •  Formalizes knowledge of subject with precise textual definiOons •  Networked terms; child more specific (“granular”) than parent Na) onal Drug File 44 An example is the Gene Ontology with three controlled vocabularies •  Molecular FuncOon – What the gene product is doing at a molecular level •  Biological Process –  The role of the gene product in a larger context •  Cellular component –  Where a gene product is doing what it does 45 15 47 6/1/15 The Gene Ontology •  A good example of a biological ontology •  RelaOonships among networked, defined terms •  Vascular terms shown with relaOonships Example: a GO annotaOon •  AssociaOng GO term with gene product (GP) –  –  –  –  GP has funcOon (6-‐-‐-‐phosphofructokinase acOvity) GP parOcipates in process (glycolysis) GP is located in part of cell (cytoplasm) Linking GO term to GP asserts it has that a pribute •  Based on literature or •  computaOonal methods •  Always involves: –  –  –  –  –  Learning something about gene product SelecOng appropriate GO term Providing appropriate evidence code CiOng reference [preferably open access] Entering informaOon into GO annotaOon file 47 AnnotaOon becomes a series of ids linked to other proteins/genes/ features This protein is integral to the plasma membrane and is part of an ATP-‐-‐-‐binding cassep e (ABC) transporter complex. It funcOons as part of a transporter to accomplish the transport of sulfate across the plasma membrane using ATP hydrolysis as an energy source. = • • • • GO:0005887 GO:0008272 GO:0015419 GO:0043190 48 16 48 6/1/15 Term name GO ID (unique numerical iden)fier) Synonyms for searching, alt. names, misspellings… GO slim Precise textual defini) o n that describes some aspect of the biology of the gene product Defini) o n reference Ontology rela) o nships (next page) 49 Genomes can be compared •  High-‐-‐-‐level biological process terms used to compare Plasmodium and Saccharomyces (made by “slimming”) MJ Gardner, et al. (2002) Nature 419:498-‐-‐-‐511 50 Evidence 51 17 49 6/1/15 The importance of recording evidence •  •  •  •  The process of funcOonal annotaOon involves assessing available evidence and reaching a conclusion about what you think the protein is doing in the cell and why. FuncOonal annotaOons should only be as specific as the supporOng evidence allows All evidence that led to the annotaOon conclusions that were made must be stored. In addiOon, detailed documentaOon of methodologies and general rules or guidelines used in any annotaOon process should be provided. I conclude that you are a cat. Why? -‐-‐-‐You look like other cats I know -‐-‐-‐I heard you meow and purr I conclude that you code for a protein kinase. Why? -‐-‐-‐You look like other protein kinases I know -‐-‐-‐You have been observed to add phosphate to 52 proteins Knowledge & annotaOon specificity •  How much can we accurately say? Corresponding GO annota) o ns Available evidence for three genes Types of Evidence •  Experiments (oc en considered the best evidence) •  Pairwise/mulOple alignments •  HMM/domain matches scoring above trusted cutoff •  Metabolic Pathway analysis •  Match to an ortholog group (COG,eggNOG) •  MoOfs 54 18 50 6/1/15 The Evidence Ontology •  EO terms have standardized definiOons, references & synonyms •  Allows standardizing evidence descripOon and searching by evidence type •  Can filter by evidence type & do other things! •  GO evidence codes are subset of EO The big picture: an example pipeline DNA Sequence (assembly, masking) Predicted protein coding genes Gene PredicOon RNA finding: tRNAScan, RNAMMER, homology searches MySQL database using the Chado schema Predicted RNA Genes Genome viewer/ editor translaOon Automated start site and gene overlap correcOon Flat files of annotaOon informaOon Searches: Pairwise BER searches against UniRef100 HMM searches against Pfam and TIGRfam MoOf searches with LipoP, THMHH, PROSITE NCBI COGs Prium profiles AutomaOc AnnotaOon using the evidence hierarchy of Pfunc 56 Some concluding themes… •  The best annotaOon comes from looking at mulOple sources of evidence •  It is important to track and check the evidence used in an annotaOon •  Do not assume the annotaOon you see on a protein is correct unless it comes from a trusted source •  Always err on the side of under-‐-‐-‐annotaOng rather than over-‐-‐-‐annotaOng •  Consider using UniProt (UniRef) for searches, not NCBI nr, simply for the depth of informaOon it provides. 57 19 51 6/1/15 How-to: Functional annotation a super-quick guide to use InterPro Scan What do we need? • We have a list of genes with its theoretical translations. • We have no idea what those genes are… Annotate against a well known database 1 52 6/1/15 Annotate against a bunch of well known databases Example: Gene model 1 of P. unknowinensis >Gene_model_1_aa MIQIQTKVKVNDNSGIKIGQCIKIYKKKVGKIGDTI LISAKKLRLNQKKKIKIVKGDLFKALIIHTTYQKQS TIGNMVKFDKNCIIILNNQNKPLGTRIFGPITSEFR KQKNFKILSLASNIL Example: Gene model 1 of P. unknowinensis - Ribosomal protein! 2 6/1/15 53 Finding genes and exploring the gene page (Exercise 1) 1.1 Finding a gene using text search. Note: For this exercise use http://www.fungidb.org - Select only Oomycetes and run this step. - Remember to only select Oomycetes for all further searches in this exercise. a. Find all possible kinases in FungiDB. Hint: use the keyword “kinase” (without quotations) in the “Gene Text Search” box. -  -  -  How many genes did you get? How many of those are Oomycetes? How did you find this out? What happens if you search using the word “kinases”? How many results did you return? b. Restrict your search to only return Oomycetes. The rest of this exercise will focus on Oomycetes. You will need to restrict your keyword searches by organism. - There are several ways to do this. To filter the kinase results by organism click the ‘Edit’ link from the results ‘Text’ box, and select ‘revise’ from the options. c. How can you increase the number of possible kinases in your results? Hint: the search you did in ‘a’ will miss things like “6-phosphofructokinase” or “kinases” so you need to use a wild card in your search – try “kinase*”, “*kinase” and “*kinase*” (without quotations). 1 6/1/15 54 -  -  Did you get more results? Which one of the above wild card combinations gave you the largest number of kinases? -  How can you quickly examine the genes that were identified using the key word “*kinase*” but not with the word “kinase”? Hint: You can easily do this by combining search strategies. Click on “Add Step” then select “existing strategy”: - Select the right strategy from your list of Gene Strategies and combine the strategies with the correct operation: Which operation did you choose? d.  Find only the kinases that specifically have the word “kinase” in the gene product name. Hint: Use the text search page, the specific page where you can define the fields to be. There are many ways to navigate to the Text Search page. -  How did you get there? -  How many kinases have the word kinase in their product names? -  Did you remember to use the wild card? 1.2 Combing text search results with results from other searches a. In exercise 1.1 you identified genes that have the word “kinase” somewhere in their product name. Can you now find out how many of these kinases are likely secreted? Hint: grow your search strategy by adding a step. Choose a search that identifies genes with likely secretory signal peptides. How did you combine the search results? - Do the results make sense? kinase? Do all the product names contain the word 2 6/1/15 55 Hint: there is no wrong answer here…. - From a biological standpoint what else would be interesting to know about these kinases? Add more searches to grow this strategy. - For example, how many of these secreted kinases also have transmembrane domains? c. In the above example, how can you define kinases that have either a secretory signal peptide AND/OR a transmembrane domain(s)? Hint: to do this properly you will have to employ the “Nested Strategy” feature. Why? 63378 3755 785 60751 3755 63378 580 183365 3755 Which operation did you choose? 1003 143697 63378 183365 b. Now that you have a list of possible secreted kinases, how would you expand this strategy even further? 3 6/1/15 56 Notice the different results obtained in figures A (with nesting) and B or C (without nesting) below: A Visiting a specific gene page. a.  Find the Ornithine aminotransferase gene in Phytophthora ramorum. -  -  -  183365 3755 1003 B 3755 60751 161867 580 162085 60751 161867 580 362 How did you navigate to this gene? What other ways could you get there? (hint: what about using the gene ID? (Psura_71772) How many exons in this gene? How many nucleotides of coding sequence? b.  What genes are located upstream & downstream of ornithine aminotransferase? -  Is synteny (chromosome organization) in this region maintained in other species? -  How complete is the genome assembly for other species? (hint: it may help to view in the genome browser). The genome browser on gene pages can be accessed by clicking on the “view in genome browser” link (see below). In the genome browser data tracks can be loaded from the “Select Tracks” page. Tracks are automatically added to the browser image when you select them on the “Select Tracks” page. Just go back to the “Browser” page to view the data. -  Which tracks did you turn on? C 3755 4 6/1/15 57 a.  Is the ornithine aminotransferase gene expressed? Hint: look at the gene page sections entitled “Protein” and “Expression” – you may have to click on the show link to reveal the underlying data). -  -  -  What kinds of data in FungiDB provide evidence for expression? At what life cycle stage is it most abundant? Does this make sense? 3.  Finding a gene by BLAST. -  Imagine that you generated an insertion mutant in an Oomycete species that is providing you with some of the most interesting results in your career! You sequence the flanking region and you are only able to get sequence from one side of the insertion. You immediately go to FungiDB to find any information about this sequence. What do you do? -  Try running a BLAST search with this sequence (hint: you can get to the BLAST tool by clicking on the BLAST link under tools on the home page). -  Which blast program should you use? (hint: try different combinations, just keep in mind that you have a nucleotide sequence so you have to use an appropriate BLAST program). Note on BLAST programs: •  blastp compares an amino acid sequence against a protein sequence database; •  blastn compares a nucleotide sequence against a nucleotide sequence database; •  blastx compares the six-frame conceptual translation products of a nucleotide sequence (both strands) against a protein sequence database; •  tblastn compares a protein sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands); •  tblastx compares the six-frame translations of a nucleotide sequence against the six-frame translations of a nucleotide sequence database. 5 6/1/15 58 Use the following sequence: >upstream_flanking_region aagatgggttcccccgtgaaaaacgatagatgcgctctccatcggatgtgagaggtctgg cttccagaaacttctctgacatgggacaaagatcgcgaagctgcataactggagcaaaac ggacgatggccacagagcaagagtactaagcgaatgggagtgcgacagcgcacttgctgc cccctacacatagtgtgtgaagattgcacctgcgcttgcagttccatagtgggtggcgcg gtccataggaaagagagcgtcagaatgtggggcgtcgccaacttgcggcccacaccaatc aaaactccttgtattcaggcgcgctgcagtacgtttcgtcccgtcgtggtacaccctcca tcgatttgtacaggttttagtaaaatcaaaggtcgtcattcacaaactcctgccatattt tatcttacatgatttagtatcgttttaggcagggaatgtattttacaaggttgcaagttg tttcacgcgttccgcatgttggggatgggtggggggggaggaggggagagtcctgttggt gacgtgtggtggttattctagaaccccaagcgcgtcggaagctccctccttgtgcacgcg tggccgcactttttcttcagaccccaaggcgacacccccttcgtcccatta -  -  Are you getting any results from blastx? tblastn? What about blastn? What is your gene? (hint: after running a blastn against Oomycete genomic sequence, click on the “link to the genome browser”. In the genome browser zoom out to see what gene is in the area). 6 6/1/15 59 RNA sequence data analysis (Part 1: using pathogen portal’s RNAseq pipeline) Exercise 3 The goal of this exercise is to retrieve an RNA-seq dataset in FASTQ format and run it through an RNA-sequence analysis pipeline. Step II: Getting data into your launch pad. The following exercise is based on data generated from the following study: “Comparative transcriptomics of the saprobic and parasitic growth phases in Coccidioides spp.” Whiston et al. PLoS ONE 2012;7(7):e41034 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3401177/ Step I: Create a login account at Pathogen Portal: 1.  Go to http://pathogenportal.org 2.  Click on RNA Rocket. 3.  Click on Create account and fill in the required information. The data mentioned in the paper has been deposited to the sequence read archive (SRA) and the study accession number is: SRA054882. You can access this record here: http://www.ncbi.nlm.nih.gov/sra/SRA054882 The required input format is something called a FASTQ file, which is similar to a FASTA file. These are simple text files that include sequence and additional information about the sequence (ie. name, quality scores, sequencing machine ID, lane number etc.). FASTA Definition line FASTQ End of Sequence Sequence >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDK AVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAA MRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRL KDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQ GKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVM DDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKT EDFAAEVAAQL Definition line @SRR016080.2 20AKUAAXX:7:1:123:268 TGTAGCATAATGCCGTTTTCTTTGTTTCCATTCATC + II&I&4IICIIIIIIII.III3:III3#6IIII1I) @SRR016080.3 20AKUAAXX:7:1:112:638 TATAGATCTTGGTAACACCCGTTGTATTATTCGCAA + IIIIIIIIIIIIIIIIIIIIIIII-IIIII%%IIII @SRR016080.4 20AKUAAXX:7:1:102:360 TTGCCAGTACAACACCGTTTTGCATCGTTTTTTTTA + IIIIII$IIIIIIII'IIIIIIIIIIII@IIIID35 Sequence Encoded Quality Score 1 6/1/15 60 - FASTQ files are large and as a result not all sequencing repositories will store this format. However, tools are available to convert, for example, NCBI’s .SRA format to FASTQ. The file that we will be using for this exercise originated from the DNA Data Bank of Japan (DDBJ), which is a mirror of NCBI and EBI. Here is the record at DDBJ: ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA054/SRA054882/SRX156193/SRR5 16239.fastq.bz2 3.  C. posadasii C735, parasitic spherule biorep #1 http://trace.ddbj.nig.ac.jp/DRASearch/submission?acc=SRA054882 ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA054/SRA054882/SRX156197/SRR5 16242.fastq.bz2 The FastQ files for each time point are available here: 4.  C. posadasii C735, saprobic hyphae biorep #1 ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA054/SRA054882 ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA054/SRA054882/SRX156199/SRR5 16244.fastq.bz2 C. posadasii C735, saprobic hyphae biorep #3 data is in folder: SRX156201 C. posadasii C735, saprobic hyphae biorep #2 data is in folder: SRX156200 C. posadasii C735, saprobic hyphae biorep #1 data is in folder: SRX156199 C. posadasii C735, parasitic spherule biorep #3 data is in folder: SRX156198 C. posadasii C735, parasitic spherule biorep #2 data is in folder: SRX156189 C. posadasii C735, parasitic spherule biorep #1 data is in folder: SRX156197 Here are the steps you take to start uploading data into your Launchpad: 1. Click on the “Upload Files” link C. immitis RS, saprobic hyphae biorep #3 data is in folder: SRX156195 C. immitis RS, saprobic hyphae biorep #2 data is in folder: SRX156194 C. immitis RS, saprobic hyphae biorep #1 data is in folder: SRX156193 C. immitis RS, parasitic spherule biorep #3 data is in folder: SRX156190 C. immitis RS, parasitic spherule biorep #2 data is in folder: SRX156187 C.  immitis RS, parasitic spherule biorep #1 data is in folder: SRX155974 We will be uploading data directly from the DDBJ FTP site. Each samples is single end. Also, they indicate that three runs were done for each sample. We are only going to worry about one of the runs for each condition. For the next part of the this exercise feel free to navigate in the FTP site to the desired time point folder or simply use the links provided below: 1.  C. immitis RS, parasitic spherule biorep #1: ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA054/SRA054882/SRX155974/SRR5 16231.fastq.bz2 2.  C. immitis RS, saprobic hyphae biorep #1 2 6/1/15 61 2. On the next page, copy and paste both files for your time point in the “URL/Text” window then click on the “Execute” button. To view the progress of your upload, click on “Project View” (red square in image above). Completed tasks will show up in green Paste the FastQ URLs here You can inspect the contents of completed tasks (like uploaded files) by clicking on the eye icon next to the name of the file (arrow in above image). Inspecting a FASTQ file should look like this: Click on Execute You should now see a window that looks like this: 3. Once the RNA-sequence FASTQ file has been uploaded you can start the RNAseq pipeline. Pathogen portal uses two algorithms for mapping (TopHat) and transcript prediction and expression value calculation (Cufflinks). Note that there are many algorithms and methods for RNA-seq mapping and analysis each with its advantages and disadvantages. You are encouraged to learn more about the algorithm you are using. o  TopHat: o  Cufflinks: http://tophat.cbcb.umd.edu/ http://cufflinks.cbcb.umd.edu/index.html 3 6/1/15 62 - To start the pipeline click on the “Launch Pad” link (red square in above image). On the next page, scroll down to the “RNA-Seq Analysis” section and click on “Map Reads & Assemble Transcripts”. -‐  On the next page, scroll down and choose the type of analysis (in this case we are analyzing a single end eukaryotic sample). -‐  Next select the target project from the drop down menu. You should only have one or two projects one of which will contain both FASTQ files you uploaded (probably called “Uploaded Files”). Once you select the correct project you should see the two FASTQ files contained within it. Next click on continue. Step3: Configure TopHat – there are a number of options that may be modified, however, for the purposes of this exercise the default parameters may be used. The only required change is the reference genome -- select Coccidioides immitis RS Run Workflow - The next page allows you to configure the pipeline: Step1: Select the first dataset from the dropdown menu under “Input Dataset” Step4: Configure Cufflinks – once again there are a number of options to modify. For the purposes of this exercise change the following: Select a reference annotation: Coccidioides immitis RS Select how to use the provided annotation: Assemble Novel + annotated transcripts. 4 6/1/15 63 Click on the Run Workflow button. After you start the workflow you should get a confirmation window that indicates all the steps that have been added to the queue. The progress of your workflow can be viewed to the right. Completed tasks are in green, running tasks are in yellow and tasks waiting in the queue are in grey. 5 6/1/15 64 Using the genome browser (GBrowse) Exercise 2 2.1 Navigating to the genome browser (GBrowse) Note: For this exercise use http://www.fungidb.org b.  Go to GBrowse from the FungiDB home page. Explore this page – take note of the different sections: Instructions, Search, Overview, Region, Details, Tracks, etc… c.  In the “Landmark or Region” box write the following: PinfT30-4_SC0010:1,114,424..1,164,424. d.  Look at the “Landmark or Region” box: a. There are two ways to navigate to GBrowse from FungiDB. From record pages, like a gene page, genomic sequence or EST page, click on the “View in Genome Browser” link. What information does the “Landmark or Region” box contain? What chromosome is displayed? What location of the chromosome is displayed? Move to a different genomic region on this chromosome – for example, visit the right arm of this chromosome. §  Hint: change the coordinate numbers in the “landmark or region” box to correspond to an area in that region. Look at the overview to give you an indication of the total size of this chromosome, ie. 1000000..1100000). -  Change region to a different scaffold. How did you do this? §  Hint: Change the scaffold (chromosome) number in the “landmark or region” box – it should look like this: PinfT30-4_SC0025:1..10000 for Phytophthora infestans scaffold 25 from 1bp to 10kb. -  Zoom in to a 20Kb region. Select 20Kb from the Scroll/zoom drop down menu. -  -  -  -  You can also use the Tools section on the home page or the grey tool bar in the header section. 1 6/1/15 65 e.  What if you want to go to a specific gene in Gbrowse? Try to figure out how to go to this gene: Phyca_508616 §  Hint: type the ID in the “landmark or region” box and press enter. §  Scroll out to 5 kbp - What is this gene? -  What genes are in this region? Mouse over the gene graphics and look at the popups. -  Explore the ruler tool. Click on the ruler to engage then drag it across the window: 2.2 Exploring data tracks in GBrowse a. Is the region containing Phyca_508616 gene syntenic in all Oomycetes? - Are there other ways to move and zoom? Try highlighting an area along the scale in the overview, region or details sections of GBrowse. Hint: Go to the “Select Tracks” section and find the B) Synteny section. Before you click ‘Syntenic Sequences and Genes’ click ‘showing 103/103 subtracks’ and select only some of the Oomycete species (All Phytophthora, Pythium, Hyaloperonospora arabidopsidis) If you first turn on synteny before deselecting the fungi species you will need to wait for all 103 species to load. 2 6/1/15 66 2.3 Downloading data from GBrowse - You can download data from GBrowse in multiple ways and formats. 1 3 2 -  Return to the browser by clicking the “Browser” tab. – zoom out to 20Kb. What does this region look like? What genes are upstream and downstream? -  If the synteny trapezoids connecting Phyca_508616 are difficult to see you can try moving the viewing window or zooming in or out. - Which selected species doesn’t have a syntenic gene? Are there differences in the species? 1. The Report and Analysis drop down menu allows you to select a format for the download file that will contain the all the features that you have displayed in the region you are looking at. 2. Highlighting a section of the Details scale allows you to retrieve a FASTA dump of the nucleotide sequence from this region. You can also use this same tool to submit a sequence to NCBI Blast. 3. Mousing over a gene will reveal a popup window with the option to get the coding (CDS) or amino acid sequence of that gene. 4.  Designing PCR Primers with GBrowse Open GBrowse at the genomic location where you want to find primers. -  Go to gene page of gene you want to design primers for and use the ‘View in GBrowse’ button. -  Open GBrowse from the home page and then enter genomic coordinates in the landmark region. Choose “Design PCR Primers” from the drop down menu and then click GO. -  This opens the Design Primers application. 3 6/1/15 67 ZOOM Choose a target: -  The graphic is interactive. To choose a target, highlight an area on the scale. You can zoom in with the controls in the upper left corner. -  Once you choose a target, the Product size range is automatically updated in the parameter table at the bottom of the page. -  You can choose to customize the primer design using other parameters. Click DESIGN PRIMERS to run the application. 4 6/1/15 68 Protein Motif Searches and Regular Expressions Exercise 6 6.1 Using InterPro domain searches to identify unannotated kinesin motor proteins. For this exercise use http://fungidb.org a. Identify all genes annotated as hypothetical in Phytophthora infestans. b. How many of these hypothetical genes have a kinesin-motor protein InterPro domain? Hint: add a step to the strategy. Go to the “Interpro Domain” search under similarity/pattern, start typing the work kinesin and it should autocomplete. Hint: use the full text search and look for genes with the word “hypothetical” in their product names. 1 6/1/15 69 c. Go to the gene page for PITG_05224 and look at the protein feature section. Does this look like a possible motor protein? Hint: click on the ID for PITG_05224 in the result table to go to the gene page. Mouse over the glyphs in the Protein Features graphic. b. RXLR is a domain motif found in some effectors to facilitate infection. Identify all occurrences of the RXLR motif in Phytophthora. You may need to refer to the RegEx guides to find the correct query; you will need to use a special character for ‘X’. 24043 771 c. Some of these were probably identified in incomplete proteins. You could use a text search to omit predicted or hypothetical proteins, but instead think of a protein motif search to only identify complete proteins containing RxLR. Protein sequences in FungiDB do not contain the stop character (*). However, bad computationally predicted proteins can have internal stops. Edit your motif search to select for proteins that start with a Methionine, do not have any *s, and contain an RXLR. (hint: you’ll need to tell the RegEx to not find * both before and after the RXLR) You can find a single RegEx to identify the correct proteins but it will be complex. Try to break it up into multiple steps to make it easier to build. Here it is split into two RegEx: 6.2 Using regular expressions to find motifs in Phytophthora. Find variations of RXLR a. To infect plants Phytophthora utilizes effector proteins. Use a text search to find all proteins that have been identified as effectors in Phytophthora. It only removed two genes. Why? Compare the results from B and C, where was the change? d. The ‘X’ in RXLR is a wild-card, allowing for any amino acid. Try some specific amino acids or special characters to narrow down the RXLR occurrences. Do most identified RXLRs fit into any special classification? 771 2 6/1/15 70 1.  Identification of specific DNA motifs. Note: For this exercise use http://fungidb.org a.  Find all BamHI restriction sites in all Pythium ultimum genomic sequences available in FungiDB. Note: you can use the DNA motif search to find complex motifs like transcription factor binding sites using regular expressions. Hint: BamHI = GGATCC and the DNA motif search is under the heading “Genomic Segments”. 7.2 Find genes that have one of these BamHI sites within 250 nucleotides upstream of their start. In the section 7.1 you found BamHI sites, but now you are looking for genes that have one of these sites located within 250 nucleotides upstream of their start. b. How many times does the BamHI site occur in the genomes you searched? Take a look at your results; notice the Genomic location and the Motif columns. Hint: You can achieve this by running a genomic collocation search that defines the genomic relationship between the BamHI sites and genes. Add a “Genes by Organism” step to the motif search and select the “1 relative to 2, using genomic locations” option. 1 6/1/15 71 1 2 3 4 How did you modify the location relative to genes? How many genes did you get? 5 7.3 Using a similar sequence of steps as in part 7.2, define which of these genes also have a BamHI site in their 250 nucleotide downstream region. Hint: after you click on add step you will have to select DNA motif search and select the genomic collocation option. 2 6/1/15 72 7.4 Taking this a step further, define which of these genes do NOT contain a BamHI site within them. Hint: you will have to use a nested strategy. Note: you can add a column to any result table that allows you to go directly to GBrowse at the genomic coordinates of any ID in your result list. Click on the Add Columns button. Look at your results. Do they make sense? Confirm your results by looking at one of the genes in Gbrowse and showing BamHI restriction sites. 3 6/1/15 73 Note: you can configure restriction sites by clicking on the configure button in GBrowse and selecting the restriction sites you would like to display. To view restriction sites, the “Restriction Sites” data track must be turned on. Go to the “Select Tracks” page and click “Restriction Sites” under the “Analysis” section. 4 6/1/15 74 Exploring Isolate Data Exercise 8 8.1 Exploring isolates in Cryptosporidium and using the alignment tool. Note: For this exercise use http://www.cryptodb.org c. What is the general distribution of these isolates in Europe? (hint: you can do this quickly in two ways: sort the geographic location column by clicking on the sort arrows, then look at the represented countries; or use the “Isolate Geographic Location” tab to view a map and results summary table). a. Identify all Cryptosporidium isolates from Europe. Hint: search for isolates by geographic location in the “Identify Other Data Types” section. Sort by clicking on the arrows b. How many of the Cryptosporidium isolates collected in Europe were isolated from feces? Hint: add another isolate search step. d.  Out of those in step ‘b’, how many are unclassified Cryptosporidium species? Hint: add another isolate search step. e.  How many of step ‘b’ isolates originated from humans? f.  How many of the isolates in step ‘b’ were typed using GP40/15 (GP60)? (hint: you can insert a step within a strategy. Click on the name of the step you want to insert a step before, then click on “Insert step before”). 1 6/1/15 75 8.2 Typing an unclassified isolate. Note: For this exercise use http://www.cryptodb.org a. Run a search to find all unclassified Cryptosporidium isolates and find one that was typed using 18S small subunit ribosomal RNA. (Hint: Identify Isolates based on Taxon/Strain and choose ‘unclassified’ under Cryptosporidium. Add a column for Gene Product and sort the column). g. Compare some of these isolates using the multiple sequence alignment tool (ClustalW). Do you see any sequences with insertions or deletions? b.  Go to the isolate record page and copy the DNA sequence. h. Take a look at the ‘guide tree’ that was built using this alignment. Change the isolates that you selected for alignment – how does the tree change? Do isolates from the same country cluster together? c.  Go to search for isolates based on BLAST, select isolates and make sure only the reference isolates are selected in the target organism window. d.  Paste the DNA sequence in the input window and select the Blastn program. Click on “Get Answer”. e.  Explore your results. Based on the similarity which reference isolate is this one closest to? 2 6/1/15 76 8.3 Exploring isolates in Plasmodium. Note: For this exercise use http://www.plasmodb.org a.  Identify all isolates from Mexico. b.  How many of those are P. falciparum? How many P. vivax? c.  What about all of North and South America? Hint: revise the first step in your strategy to include all countries in both continents. d.  For these results, add columns such as isolate product and length. Sort these columns and explore your results. For example, what product is mainly used in typing P. falciparum isolates? What about P. vivax isolates? 3 6/1/15 77 Orthology and Phyletic Patterns Exercise 9 9.1 Getting to OrthoMCL from FungiDB databases Note: For this exercise use http://www.fungidb.org a.  Go to the gene page for the Phytophthora ramorum gene with the ID: Psura_72632. b.  What does this gene do? It is annotated as unspecified product! c.  Scroll down to the table labeled “Orthologs and Paralogs within FungiDB”. Does this gene have orthologs in other Oomycete species? What about other organisms? Hint: click on the link below the table that takes you to OrthoMCL. e.  Take a look at the PFAM domain architectures. Do all the proteins in this group have similar domain architecture? f.  Based on the orthologs, what do you think this protein might be doing? If you had to give this gene a name, what would you call it? d. Does this protein have orthologs in other organisms? Does it have any orthologs in bacteria or archaea? Hint: mouse over the colorful boxes in the tables to reveal the full species and pylum names – see image below. 1 6/1/15 78 9.3 Use the orthology transform tool to identify P. sojae genes containing signal peptides also found in P. ramorum. 9.2 Using the phyletic pattern tool in OrthoMCL a.  Go to Fungi DB. UNDOABLE: P. sojae was removed from ORTHOMCL b.  How many P.sojae genes are annotated with signal peptides (just use the default settings)? Use http://orthomcl.org a. How many protein groups in OrthoMCL do not have any orthologs in bacteria or archaea? Hint: go to “Search for Groups by Evolution…Phyletic Pattern”. c.  Use intersection to see the shared P.ramorum signal peptide genes. How did that work? d. We’ll have to use a different method. First transform the P.sojae results into their P.ramorum orthologs, use these in the intersection. b. How many protein groups do not contain orthologs from eukaryotes? Hint: click on the icon to specify which taxa or species to include or exclude in the profile. e. How many of the P.ramorum orthologs of P.sojae genes with signal peptides do not themselves contain signal peptides. Why might this be the case? Look at a couple of these using the synteny viewer to generate some hypotheses. 9.5 (optional) Integrated searches in OrthoMCL NOTE: All EuPathDB sites including FungiDB also have a phyletic pattern search that uses OrthoMCL data under Genes -> Evolution -> Orthology Phylogenetic Profile. Find all oomycete proteins that are likely phosphatases that do not have orthologs outside of oomycetes. Use OrthoMCL.org 2 6/1/15 79 a.  Use the text search to find groups that contain the word “phosphatase”. b.  Run a orthology phylogenetic profile search for groups that contain any oomycete protein but do not contain any other organism outside oomycetes. Hint: make sure everything has a red x on it except for oomycetes, which should be a grey circle. c. How many groups did you return? Explore the multiple sequence alignments from some of these groups. Hint: click on a group ID and open the MSA tab. 3 6/1/15 80 Exploring Metabolic Pathways and Compounds Exercise 5 -‐ Once you find glycolysis, the result page will display a graphical KEGG representation of the pathway. Examine the pathway – What do the rectangles with numbers like 2.7.1.41 represent? What do the circles represent? 1. Find the metabolic pathway for glycolysis. For this exercise use http://fungidb.org -‐ Metabolic pathway and compound searches are available under the “Identify Other Data Types” heading on the home page. To find metabolic pathways by name, click on the “Pathway/Name/ID” option under the heading “Metabolic Pathways”. -‐ This search provides type-ahead options. -‐ Turn on ‘Paint Genera’, ‘Albujo, Apha…. What do the colors mean? Note that you can mouse over and click on the various elements in the pathway to reveal popups with additional information, and you can zoom in and out. 1 6/1/15 81 -‐  Find the rectangle representing 6-phosphofructokinase. (hint: its EC number is 2.7.1.11). -‐  Do you believe that this enzyme is only present in yeast? What are some other possibilities? How can you determine if this enzyme has orthologs in any oomycete species? -‐  Click on enzyme name/EC number taking you to a FungiDB strategy. You get 3 genes but this is not necessarily all the orthologs identified by OrthoMCL. How can you find orthologs of this gene in other oomycetes? 2. Compound records can be accessed by running a specific compound search available under “Identify Other Data Types” heading on the home page. Compound records can also be accessed from the mouse over popups in a metabolic pathway. -‐ Find Phosphoenolpyruvate (PEP) and visit its record page. o PEP can be identified using a specific compound search. For example, compounds may be identified by ID, text search, Molecular metabolic pathway, formula, molecular weight and metabolite levels. o Choose one of these options to identify PEP. For example, you can type phosphoenolpyruvate in the compound text search: -‐ Orthologs can be identified by add an “ortholog transform” step to the search strategy. (hint: click on add step, then select ortholog transform from the popup window. In this case allow all the organism). -‐ Examine the PEP record page. Note that sections (ie. Metabolic Pathway Reactions) may be expanded by clicking on the “show” link. -‐ What do your results show? Is 6-phosphofructokinase unique to P. falciparum? 2 6/1/15 82 Data retrieval and download Exercise 9 9.1 Downloading a set of results and associated data. For this exercise you can start with any gene list of results. Start with any result list you have generated, such as the DNA Motif search. Download all the genes: 9.2 Download the sequences of genes in a list of results. Use the same list of results as in 9.1. Go to the download section and select “Configurable FASTA”. Download the ‘genomic’ sequences. Now download the ‘transcript’ sequences. What is the difference? Download this list of results with the following associated data: Genomic Location, Product Description, Transcript Length and Predicted GO Function. Hint: click on the Download ## Genes link. Hint: select the type of report to download and then click on the boxes to customize your report. The gene ID is automatically downloaded and so is not an option in the popup. Note, that you can access and download sequence with the sequence retrieval tool (SRT) accessed from the tools menu on the home page: •  Retrieve Sequences By Gene IDs. •  Retrieve Sequences By Genomic Sequence IDs. •  Retrieve Multiple Sequence Alignments by Contig / Genomic Sequence IDs. •  Retrieve Sequences By Open Reading Frame IDs. 1 6/1/15 83 9.3 Downloading large data files such as all coding sequences or all protein sequences for an entire genome. Download files are available in the file download section of all EuPathDB sites Hint: select “Data Files” under the “Download” menu in the grey tool bar. Hint: navigate through the subfolders and find the files containing codon usage information for P. capsici. Folders without a strain designation contain species level data. 2 6/1/15 84 RNA sequence data analysis (Part 2: Loading data generated by the pathogen portal’s RNAseq pipeline in the Genome Browser) Exercise 11 On the next page select the option: Make History Accessible and Publish For this exercise we will be using: http://pathogenportal.org http://fungidb.org 1. Explore the results of the RNA-sequence pipeline. What files were generated? To view contents of any of the results, click on the eye icon ( ) next to the file name. !!! important note – do not click on the icon next to the file called “Tophat2 on data 1 and data 3: accepted_hits” – this file is huge and will not display but rather will download the contents to your computer. Once your project is published other people can access it by going to “Published Projects” section under the Shared data menu option in the Galaxy menu bar. TopHat generates four files: insertions, deletions, splice junctions and accepted hits. The accepted hits file is the BAM file (binary alignment map). Note that many alignment programs will generate a file called a SAM file (sequence alignment map) which is a table including text of the alignment and mapping. However, for viewing results in a sequence browser like GBrowse, the file needs to be converted into the binary formatted (BAM) – you do not have to worry about this for this exercise. Cufflinks generates three files: gene expression, transcript expression and assembled transcripts. The gene expression and transcript expression files for our purposes should be identical since FungiDB genomes do not have separate genes and transcripts. These files include the FPKM values for each gene in the genome analyzed – in this case Coccidioides immitis. 2. Share your accepted hits files. Click on the drop down menu for your project and select the option “share or publish”. 3. Load your BAM data into GBrowse. Navigate to the genome browser in FungiDB and choose a landmark for Coccidioides immitis RS you can just cut and paste the following into the “landmark or region” box: CimmRS_SC1:1..17,454 Next, do the following to copy the link to the tophat accepted hits in pathogenportal to GBrowse: a.  Control click (same as right click on a windows machine) on the eye icon for the tophat accepted hits. b.  In GBrowse click on the “Custom Tracks” tab. 1 6/1/15 85 4. Load the assembled transcript data. Cufflinks generates this file in a format called GFF. This format is not accepted by GBrowse so you have to convert it to another format called BED. To do this click on the pencil icon next to the file. Click on “Covert Format” then click on convert. A new file will be generated in BED format. You can not copy the link to the file and load it into GBrowse the same way you loaded the BAM file. c. Click on the “From a URL” link and paste the link you copied from pathogenportal. d. Delete the last portion of the URL: display/?preview=True e. Click on import…..and be patient. f. Once the data has loaded click on the Browser tab to view your data. 2 6/1/15 86 Exploring Transcriptomics Data Exercise 13 13.1 Evidence of expression at the transcriptional level. Note: For this exercise use http://www.fungidb.org 13.2 Exploring RNA sequence data in FungiDB. Note: For this exercise use http://www.fungidb.org a. Find all genes in C. posadasii C735 delta SOWgp that are upregulated based on RNA-seq data at Parasitic spherule phase compared to Saprobic hyphae. a. What kind of data types can be used to provide evidence of transcriptional activity? Hint: click on “Transcript Expression” to expand the list of possible searches. b.  Explore organisms that have microarray data. What organisms have expressed sequence tag (EST), or RNA sequence? c.  What does RNA-seq data tell you that microarray data cannot? d.  Go to the Data Summary Section, can you find the same information there? Hint: data summary table in on the left side of the home page. hint: there are several parameters to manipulate in this search: Experiment: Choose the experiment of interest, in this case the only option available: Saprobic vs Parasitic Growth Genes: gene format of the results. Choose protein coding. Direction: the direction of change in expression. Choose up-regulated. Fold Change>=: fold change is calculated as the ratio of two values (expression in reference)/(expression in comparison). The intensity of difference in expression needed before a gene is returned by the search. Choose 2 but feel free to modify this. Reference Sample: the samples that will serve as the reference when comparing expression between samples. Choose Saprobic Hyphae Comparison Sample: the sample that you are comparing to the reference. In this case you are interested in genes that are up-regulated in Parasitic Spherules phase 1 6/1/15 87 3.  Exploring Expression Quantitative Trait Locus (eQTL) data in PlasmoDB. Genetic crosses were instrumental in implicating the PfCRT gene in chloroquine resistance. PlasmoDB contains expression quantitative trait locus data from Gonzales et. al. PLoS Biol 6(9): e238. The trait that was examined in this study was gene expression using microarray experiments. b.  For the genes returned by the search, what are the top 15 upregulated genes in the parasitic phase compared to the saprobic phase? c.  Can you find more information for the “hypothetical genes”? hint: add columns from the putative function option. d.  Are some of these upregulated genes secreted? Choose the SignalP Peptide box under the Protein Feature option. e.  Are these genes unique to C. posadasii? Can the ortholog data help us find this information? f.  What does the paralog count tell us about these top upregulated genes? a.  Go to the gene page for the gene with the ID PF3D7_0630200. Can you identify the genomic region (haplotype block) that is “most” associated with this gene, ie. has the highest LOD score? (Hint: examine the table called “Regions/Spans associated by eQTL experiment on HB3 x DD2 progeny” on the gene page. b. What kinds of genes do you find in this region? Click on the first link in the column “Genomic segment (liberal)”. Now examine the gene table on the genomic segment page. 2 6/1/15 88 c. 13.4 What other genes are associated with this block? (Hint: go back to the gene page eQTL table, and click the “genes associated with this region” link. Run the search on the next page and examine the list of genes. It might be useful to sort this list based on the LOD scores.) Finding oocyst expressed genes in T. gondii based on microarray evidence. Note: For this exercise use http://toxodb.org •  Fold Change >= 10. • Global min/max in selected time points: choose “don’t care”. Since we have selected all the samples between the reference and comparator time points, the global max and the global min will have to be within the selected time points. If we had not selected all the time points, then changing this parameter would make a difference as the global min or max could be in a time point that we didn’t select. • Select Protein coding genes. We want to only look at polyadenylated transcripts. a. Find genes that are expressed at 10 fold higher levels in one of the oocyst stages than in any other stage in the Expression Profiling of T. gondii Oocyst/Tachyzoite/Bradyzoite stages (Boothroyd/Conrad) microarray experiment. (fold change) •  There are multiple parameters that need to be set. • Experiment: choose Oocyst, Tachyzoite and Bradyzoite Development. • Direction: choose down-regulated since we want to find things more highly expressed in oocysts than in other stages. •  Notice setting the Direction to down-regulated automatically changes the expression value for reference sample from average to maximum and minimum for the comparator samples. This would enable you to find the genes with the maximum difference between these two sets of samples. Let’s leave the reference set to maximum. •  Reference Samples: choose the three oocyst samples: (unsporulated, 4 days sporulated and 10 days sporulated. •  Comparison Samples: choose the 4 non-oocyst samples: 2 days, 4 days, 8 days in vitro, and 21 days in vivo. (ie, tachyzoite and three bradyzoite samples) • choose maximum expression value for comparison sample since the goal is to find genes with 10-fold higher expression in at least one of the oocyst samples compared to any of the non-oocyst samples. b. Add a step to limit this set of genes to only those for which all the non-oocyst stages are expressed below 50th percentile … ie likely not expressed at those stages. •  Hint: use the Expression Profiling of T. gondii Oocyst/Tachyzoite/Bradyzoite stages (str M4) (Boothroyd) -> T.g. Life Cycle Stages (percentile) search. •  Select the 4 in-vitro samples . • We want all to have less than 50th percentile so set minimum percentile to 0 and maximum percentile to 50. 3 6/1/15 89 •  Since we want all of them to be in this range, choose ALL in the “Matches Any or All Selected Samples”. •  Select Protein Coding genes. • Note: you can turn on the column for “M4 Life Cycle Stages – graph” to see the graphs in the final result table. (add column; transcript expression; microarray; tglife cycle; tg m4 life cycles stages graph) c. Revise the first step of this strategy to find genes where all oocyst stages (d0, 4, 10) are 10 fold higher than any of the non-oocyst stages. •  Hint, change the “expression values to reference samples to minimum. • Does this result in cleaner, more convincing looking graphs? Why? • Would you consider these genes to be oocyst specific? 13.5 Exploring EST evidence in Phytophthora infestans. a.  Find all genes that have EST evidence. b.  Which gene has the highest number of ESTs? c.  Can you find some gene models that do not match their ESTs? Check out the Genome Browser linked off the gene page. Go to ‘select tracks’ and make sure ESTs are shown. Try sorting by the number of ESTs to find those with just a few alignments. Those with just 1 EST aren’t very interesting but maybe those in the 5-20 range would be better. You can revise your search to return genes with greater than 5, or 10, etc EST hits and then sort by ESTs. 4 6/1/15 90 Complex strategies with Genomic Colocation Exercise 14 1.  Divergent genes with similar expression profiles. Identify Phytophthora ramorum genes that meet these four criteria: 1.  are located within 1000 bp of each other 2.  are divergently transcribed (on opposite strands), 3.  are up-regulated in either zoospore or chlamydospore compared to either media, 4.  show at least a 3-fold increase in expression. •  Add a step that is the same as the first step and select the genomic colocation (1 relative to 2) operation. •  Set up the form to identify those genes that are transcribed on the opposite strand that have their starts located within 1000 bp of another genes start. •  Hint: first use the “Genes bases on RNAseq expression” -> “Transcript Profiling In Sporulations/Media” -> “P.r. Sporulations/Media RNASeq (fc)” search. •  Turn on the “Pr Sporulations/Media RNAseq – rpkm Graph” and “Pr Sporulations/Media RNAseq – percentile graph“ columns to assess how well the pairs of genes compare in terms of expression. The pairs of genes are located one above the other in the result table if sorted by location. •  Identify paired genes that have similar expression profiles based on the graphs. •  Note that you could do similar types of experiments to look at potential coregulation / shared enhancers / divergent promoters with other sorts of data such as: o  DNA motifs for transcription factor binding sites. o  Of course other expression queries. o  Etc … •  The screenshot below shows one way (there are MANY) to configure the genome colocation form to identify genes that are divergently transcribed located with their start within 1000 bp of each other. 1 6/1/15 91 14.2 Identify potential transcription factor binding motifs The goal of this exercise is to identify DNA motifs in the promoter regions of similarly expressed phosphatase genes, and then search for these motifs in un-annotated genes that also show similar expression. Maybe these un-annotated genes have related functions or are in the same pathways. a.  Use the same RNAseq dataset from the previous example, up-regulated 3-fold increase P.ramorum Chlamydospore vs V8 media reference. b.  Restrict this set to genes that have “phosphatase” entered in the gene product. This should give you 8 genes as shown below. . Now download the promoter region sequence for these 5 genes. Most oomycete genes do not have UTR regions identified in the annotations, so we will take a large region upstream from the translation start site. Take 1Kb upstream from the ATG. Hint: use the download # genes link shown in exercise 10. Select FASTA sequence, and change the options to get the upstream region. d.  You should now have 5 1Kb long sequences. We now want to identify the overrepresented DNA patterns found in these sequences. Run these sequences in the DNA motif finder MEME (http://meme-suite.org/tools/meme). However, it will take a while to return these results, especially if we all submit jobs. Pre-run results can be accessed here (http://meme- suite.org/info/status? service=MEME&id=appMEME_4.10.114328414024451905962 311) (This link should be active during the workshop but will not last forever. If you are following this example outside of the workshop you will need to actually run and wait for the MEME results.) e.  Take a look at the DNA motif results. Several interesting motifs are found. Motif 1 (CCAAAT) is very similar to a CAAT-box. Scroll all the way down to the bottom where the motif placement map is seen. Motif 2 is often found ~200-300bp upstream of Motif 2. f. Search for all occurences of Motif 1 in the 1Kb upstream regions of the expressed gene set of all phosphatase genes of P. ramorum (up-regulated 3-fold increase P.ramorum Chlamydospore vs V8 media reference). Meme gives the RegEx for Motif 1 as C[AG]AT[CT]. c. Let make the search more astringent and look for the genes that say “Purple acid phosphatase”. You will end with only 5 genes. 2 6/1/15 92 g. See how many of Motif 2 are found in close proximity to the CAAT-box like Motif 1. Make a nested strategy for the motif identification, search for the motif [CT][GT][CA] [ACG]CA[CG]CA[AC][CGT]A[AC][CA]G found within 400bp upstream of Motif 5. How many previously un-annotated genes did you find? 3

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Gene and Genome Sequencing