* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Advancing Science with DNA Sequence Finding the genes in
Whole genome sequencing wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Non-coding RNA wikipedia , lookup
Molecular cloning wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Gene expression profiling wikipedia , lookup
Gene expression wikipedia , lookup
Genetic code wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Genomic library wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Non-coding DNA wikipedia , lookup
Molecular evolution wikipedia , lookup
Genome evolution wikipedia , lookup
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop February 3, 2009 Advancing Science with DNA Sequence Outline 1. Introduction 2. Tools out there 3. Basic principles behind tools 4. Known problems of the tools: why you may need manual curation Advancing Science with DNA Sequence Outline 1. Introduction (who said annotating prokaryotic genomes is easy?) 2. Tools out there 3. Basic principles behind tools 4. Known problems of the tools: why you may need manual curation Advancing Science with DNA Sequence Finding the genes in microbial genomes features Well-annotated bacterial genome in Artemis genome viewer: Sequence features in prokaryotic genomes: stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA) protein-coding genes (CDSs) transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends) translational features (RBS, regulatory antisense rRNA tRNA RNAs, mRNA secondary structures, translational operon recoding and programmed frameshifts, inteins) promoter terminator pseudogenes (tRNA and protein-coding genes) protein-coding gene … CDS protein-binding site Advancing Science with DNA Sequence Outline 1. Introduction (who said annotating prokaryotic genomes is easy?) 2. Tools out there (don’t bother to write down the names and links, all presentations will be available on the web site) 3. Basic principles behind tools 4. Known problems of the tools: why you may need manual curation Advancing Science with DNA Sequence Tools out there: servers for microbial genome annotation - I • IMG-ER http://img.jgi.doe.gov/er IMG-ER submission page: http://durian.jgi-psf.org/~imachen/cgi-bin/Submission/main.cgi • RAST http://rast.nmpdr.org/ • JCVI Annotation Service Output: stable RNA-encoding genes, CDSs, functional annotations output in GenBank format Output: rRNAs and tRNAs, CDSs, functional annotations output in several formats http://www.tigr.org/tigr-scripts/AnnotationEngine/ann_engine.cgi Output: CDSs, stable RNAs? functional annotations format? Advancing Science with DNA Sequence Tools out there: servers for microbial genome annotation - II • AMIGENE http://www.genoscope.cns.fr/agc/tools/amiga/Form/form.php • RefSeq Output: CDSs, output in gff format http://www.ncbi.nlm.nih.gov/genomes/MICROBES/genemark.cgi http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi • EasyGene Output: CDSs, output in tbl format http://www.cbs.dtu.dk/services/EasyGene/ Output: CDSs, size restriction <1Mb Advancing Science with DNA Sequence Tools out there: genome browsers for manual annotation of microbial genomes • Artemis http://www.sanger.ac.uk/Software/Artemis/ • Manatee http://manatee.sourceforge.net/ • Argo http://www.broad.mit.edu/annotation/argo/ Windows and Linux versions; works with files in many formats, annotated by any pipeline Linux versions only; genome needs to be annotated by the JCVI Annotation Service Windows and Linux; works with files in many formats Major difference: viewer vs editor? Advancing Science with DNA Sequence Tools out there: tools for finding stable (“non-coding”) RNAs - I • Large structural RNAs (23S and 16S rRNAs) RNAmmer http://www.cbs.dtu.dk/services/RNAmmer/ • Small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA component) Rfam database, INFERNAL search tool http://www.sanger.ac.uk/Software/Rfam/ http://rfam.janelia.org/ http://infernal.janelia.org/ Web service: sequence search is limited to 2 kb ARAGORN Web service: sequence http://130.235.46.10/ARAGORN1.1/HTML/aragorn1.2.html tRNAScan-SE http://lowelab.ucsc.edu/tRNAscan-SE/ search is limited to 15 kb, finds tRNAs and tmRNAs only Web service: sequence search is limited to 5 Mb, finds tRNAs only Advancing Science with DNA Sequence Tools out there: tools for finding “non-coding” RNAs - II • Short regulatory RNAs (riboswitches, etc.) Rfam database, INFERNAL search tool http://www.sanger.ac.uk/Software/Rfam/ http://rfam.janelia.org/ http://infernal.janelia.org/ Web service: sequence search is limited to 2 kb; Provides list of precalculated RNAs for publicly available genomes Other (less popular) tools: Pipeline for discovering cis-regulatory ncRNA motifs: http://bio.cs.washington.edu/supplements/yzizhen/pipeline/ RNAz http://www.tbi.univie.ac.at/~wash/RNAz/ Advancing Science with DNA Sequence Tools out there: finding proteincoding genes (not ORFs!) Reading frames: translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction) Open reading frame (ORF): reading frame between a start and stop codon Advancing Science with DNA Sequence Tools out there: most popular CDS-finding tools • CRITICA • Glimmer family (Glimmer2, Glimmer3, RBS finder) http://glimmer.sourceforge.net/ • GeneMark family (GeneMark-hmm, GeneMarkS) http://exon.gatech.edu/GeneMark/ • EasyGene • AMIGENE • PRODIGAL (default JGI gene finder) http://compbio.ornl.gov/prodigal/ Combinations and variations of the above • REGANOR (CRITICA + Glimmer3 + pre-processing) • RAST (Glimmer2 + pre- and post-processing) Advancing Science with DNA Sequence Tools out there: metagenome annotation • BLASTx • Fgenesb http://linux1.softberry.com/berry.phtml?topic=fgenesb&group=programs&su bgroup=gfindb • GeneMark (GeneMark-hmm for reads, GeneMarkS for longer contigs) http://exon.gatech.edu/GeneMark/ • MetaGene http://metagene.cb.k.u-tokyo.ac.jp/metagene/ • GISMO ? http://www.cebitec.uni-bielefeld.de/groups/brf/software/gismo/ Full-service servers • IMG/M-ER – uses GeneMark for Sanger, proxygenes for 454 http://img.jgi.doe.gov/submit • MG-RAST http://metagenomics.nmpdr.org/ Advancing Science with DNA Sequence Outline 1. Introduction (who said annotating prokaryotic genomes is easy?) 2. Tools out there (don’t bother to write down the names and links, all presentations will be available on the web site) 3. Basic principles behind tools (very basic, see specific papers for details) 4. Known problems of the tools: why you may need manual curation Advancing Science with DNA Sequence Basic principles: finding CDSs using evidence-based vs ab initio algorithms Two major approaches to prediction of protein-coding genes: • “evidence-based” (ORFs with translations homologous to the known proteins are CDSs) Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictions Limitations: cannot find “unique” genes; low sensitivity towards short genes; prone to propagation of false positive results of ab initio annotation tools • ab initio (ORFs with nucleotide composition similar to CDSs are also CDSs) Advantages: finds “unique” genes; high sensitivity Limitations: often misses “unusual” genes; high rate of false positives Advancing Science with DNA Sequence Basic principles: finding proteincoding genes with ab initio methods Example: the overall HMM architecture used in EasyGene (from Larsen & Krogh, BMC Bioinformatics, 2003). An ORF that is likely to be proteincoding is found by searching for “coding potential” “Coding potential” is defined by comparing nucleotide sequence of an ORF to a hidden Markov model (HMM) HMM is generated using a training set from the genome or from average frequencies observed for multiple genomes Probability that an ORF is a protein-coding gene is computed N-terminal (5’) boundary is found by finding a start codon (ATG, GTG, TTG) next to a ribosomal binding site (RBS, Shine-Dalgarno sequence) Different genomes have different frequencies of start codons RBS is found by (Gibbs sampling) multiple sequence alignment of upstream sequences and represented by a weighted positional frequency matrix Or RBS is found by multiple sequence alignment and represented as one of the states in an HMM model Or ... Advancing Science with DNA Sequence Features and differences between gene finding tools • Training set selection (evidence-based vs purely ab initio) Example: CRITICA and EasyGene use evidence-based training sets (BLASTn with counting synonymous/non-synonymous codons in CRITICA, BLASTx in EasyGene); Glimmer uses ab initio training set of long non-overlapping ORFs; GeneMark uses ab initio heuristic model • Statistical model of coding and non-coding regions (codon frequencies, dicodon frequencies, hidden Markov models) Example: CRITICA uses dicodon frequencies to model coding regions; Glimmer uses interpolated Markov models (IMM) of up to 5-th order; GeneMark uses order 2 hmm for coding regions, order 0 hmm for non-coding regions • Statistical model architecture (i. e. which parts of the CDS are explicitly modeled – may include RBS, spacer region, start codon, second codon, internal codons, stop codon, etc.) Example: EasyGene explicitely models RBS, spacer region, start codon, second codon, internal codons, stop codon, codons surrounding stop codon, non-coding sequence; all other tools have less comprehensive architectures of HMM • Additional algorithms for refinement of predictions (RBS finder, overlap resolution, etc.) Example: Glimmer2.0 has a scoring schema for overlap resolution; Glimmer3. uses a dynamic programming algorithm to select the highest-scoring set of predictions consistent with the maximum allowed overlap Advancing Science with DNA Sequence Outline 1. Introduction (who said annotating prokaryotic genomes is easy?) 2. Tools out there (don’t bother to write down the names and links, all presentations will be available on the web site) 3. Basic principles behind tools (very basic, see specific papers for details) 4. Known problems of the tools: why you may need manual curation (more on manual curation in the next talk by Thanos Lykidis) Advancing Science with DNA Sequence Known problems: RNAs Genome Sequencing center 16S rRNA, nt Synechococcus sp. CC9311 UCSD, TIGR 1477 Synechococcus sp. CC9605 JGI 1440 Synechococcus elongatus PCC 7942 JGI 1490 Synechococcus sp. JA-23BA(2-13) TIGR 1323 Synechococcus sp. JA-3-3Ab TIGR 1324 Synechococcus sp. RCC307 Genoscope 1498 Synechococcus sp. WH7803 Genoscope 1497, 1464 Large rRNAs: RNAmmer is very accurate, but it has been developed very recently variation of rRNA sizes in closely related strains most 16S rRNA are missing antiShine-Dalgarno sequence Small structural RNAs: covariance models are generally accurate, but may miss some tRNAs in Archaea Check for the full complement of tRNAs with all necessary anticodons No model for pyrrolysine tRNA Small regulatory RNAs: search is accurate but slow (too many models) Annotations of regulatory RNAs are missing from many genomes Advancing Science with DNA Sequence Known problems: CDSs Short CDSs: many are missed, others are overpredicted short ribosomal proteins (30-40 aa long) are often missed short proteins in the promoter region are often overpredicted N-terminal sequences are often inaccurate (many features of the sequence around start codon are not accounted for) Glimmer2.0 is calling genes longer than they should be GeneMark, Glimmer3.0 err both ways, but mostly call genes shorter Pseudogenes all tools are looking for ORFs (needs valid start and stop codons) “unique” genes are often predicted on the opposite strand of a pseudogene Proteins with unusual translational features (recoding, programmed frameshifts) these genes are often mistaken for pseudogenes see pseudogenes Advancing Science with DNA Sequence Supplemental tools TIS (translation initiation site) prediction/correction TICO http://tico.gobics.de/ TriTISA http://mech.ctb.pku.edu.cn/protisa/TriTISA Two tools often disagree about the best TIS, especially in high GC genomes Operon prediction JPOP http://csbl.bmb.uga.edu/downloads/#jpop http://www.cse.wustl.edu/~jbuhler/research/operons/ http://www.sph.umich.edu/~qin/hmm/ Proteins with unusual translational features – selenocysteine-containing genes bSECISearch http://genomics.unl.edu/bSECISearch/ Advancing Science with DNA Sequence Known problems: different gene finding tools applied to the same genome total features total CDSs non-pseudo CDSs pseudo total rRNA total misc RNA total tRNA manual 7124 7042 6699 343 18 63 1 GeneMark 7059 6974 6974 0 18 62 2 ORNL 7076 6994 6994 0 18 63 1 RAST 5503 5422 5422 0 18 63 0 REGANOR 6420 6339 6339 0 18 63 0 Glimmer3 8218 8218 8218 0 0 0 0 missed by automated annotation # false positive (deleted by manual curation) % CDSs % CDSs # too short (extended by manual curation) # % CDSs too long (truncated by manual curation) # % CDSs total modifications # % CDSs GeneMark 347 4.9 282 4.0 846 12.0 106 1.5 1581 22.4 ORNL 610 8.6 560 7.9 300 4.2 1152 16.3 2622 37.2 RAST 1783 25.3 167 2.3 83 1.1 2228 31.6 4261 60.5 REGANOR 904 12.8 203 2.8 207 2.9 1059 15.0 2373 33.6 Glimmer3 237 3.3 1408 19.9 500 7.1 542 7.6 2687 38.1 Advancing Science with DNA Sequence Conclusions • There are plenty of tools for automated annotation of microbial genome, including several “full-service” servers and annotation pipelines • Even “full-service” pipelines identify a limited range of features and development of automated or semi-automated tools for identification of operons, promoters, terminators etc. is highly desirable, but not likely in the absence of experimental data • Nearly all of the annotation tools and servers are using different strategies, algorithms, models, settings, etc., so the results may and will vary • Different automated gene finders have different advantages and limitations; the best strategy is using any of them or a combination followed by evidence-based manual curation