Download ppt - Chair of Computational Biology

Gene finding Material of this lecture taken from - chapter 8, DW Mount „Bioinformatics“ - C. Mathé et al. Nucleic Acids Research 30, 4103 (2002) Recent years have seen an explosion of the number of completed genome sequences. The Genome Online Database lists 166 published completed genomes in the public databases 415 ongoing prokaryotic sequencing projects 360 ongoing eukaryotic sequencing projects. However, annotation (biological interpretation) is hardly keeping pace with the new flow of data :-( 9. Lecture WS 2003/04 Bioinformatics III 1 Gene finding - sequence motifs useful for gene finding - prokaryotic gene finding - Hidden Markov Models (HMM) - eukaryotic gene finding - popular gene finding programs, comparison 9. Lecture WS 2003/04 Bioinformatics III 2 Introduction The simplest method of finding DNA sequences that encode proteins is to search for open reading frames (ORFs). In each sequence, there are 6 possible open reading frames: 3 ORFs starting at positions 1, 2, and 3, and going in the 5‘ to 3‘ direction and 3 ORFs starting at positions 1, 2, and 3, and going in the 5‘ to 3‘ direction of the complementary sequence. In prokaryotic genomes, DNA sequences encoding proteins are transcribed into mRNA, and the mRNA is usually directly translated into proteins without significant modification. Therefore, the longest ORF running from the first available Met codon (AUG) on the mRNA to the next stop codon in the same reading frame, is usually a good prediction of the protein-encoding regions. 9. Lecture WS 2003/04 Bioinformatics III 3 Methods Obtain new genomic DNA sequence. Translate in all 6 ORFs and compare to protein database. Perform database similarity search of EST database of same organism, or of cDNA sequences, if available. 9. Lecture WS 2003/04 Bioinformatics III Use gene prediction program to locate genes. Analyze regulatory sequences in the gene. 4 Statistical methods vs. rule-based methods Statistical methods secondary structure prediction: neural networks AGADIR Rule-based methods apply rules from stereochemistry tertiary structure prediction: homology modelling, threading ab initio predictions gene prediction: Hidden Markov Models neural networks apply rules for gene structure biological function: sequence analysis structural alignment (3D structures) 9. Lecture WS 2003/04 predict binding sites + mechanism from 3D structure if available Bioinformatics III 5 Statistical methods vs. rule-based methods Advantages statistical methods: powerful if based on large data basis Disadvantages weak if data basis is too small may be quick to apply automatic training procedures no insight in underlying principles rule-based methods: powerful if rules are strongly followed in biological there are rules and many special cases insight in underlying principles rules may only become obvious after problem is solved by statistical method may be very expensive (e.g. ab initio 3rd structure prediction) may be very quick 9. Lecture WS 2003/04 Bioinformatics III 6 Extrinsic and intrinsic methods Most approaches now combine (a) homology methods = „extrinsic methods“ with (b) gene prediction methods = „intrinsic methods“ Only about half of all genes can be found by homology to other known genes or proteins (this value is of course increasing as more genomes get sequenced and more cDNA/EST sequences get available). In order to determine the 50% of remaining genes, one has to turn to predictive methods. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 9. Lecture WS 2003/04 Bioinformatics III 7 Prokaryotic genes Advantages: there are no introns the intergenic regions are small Disadvantages: the genes may often overlap each other the translation starts are difficult to predict. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 9. Lecture WS 2003/04 Bioinformatics III 8 Extrinsic Content Sensors Extrinsic content sensors identify similarities with protein/DNA database entry: FASTA, BLAST ... Obvious weakness: nothing will be found if the database does not contain a sufficiently similar sequence. Even when a good similarity is found, the limits of the regions of similarity, which should indicate exons, are not always very precise. Small exons are easily missed. EST/cDNA data allow to identify exons. On the other hand, EST data gives only limited information on the gene structure. Advantage: a single match is enough to detect the presence of a gene. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 9. Lecture WS 2003/04 Bioinformatics III 9 Eukaryotic genomes Transcription of protein-encoding regions is initiated at specific promoter sequences, and followed by removal of noncoding sequence (introns) from pre-mRNA by a splicing mechanism. 3 types of posttranscriptional events influence the translation of mRNA into protein and the accuracy of gene prediction: (1) species-dependent codon usage (2) tissue-dependent splice variations (3) mRNA may be edited. 9. Lecture WS 2003/04 Bioinformatics III 10 Intrinsic Content Sensors for eukaryotic genomes Characterize „coding“ regions: - nucleotide composition - G+C content (introns are more A/T rich than exons, especially in plants) - codon composition - hexamer frequency (this was found to be the most discriminating variable between coding and non-coding sequences) - base occurrence periodicity ... Hexamer frequence, or, more generally, the k-mer composition of coding sequences is the main search tools in many packages. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 9. Lecture WS 2003/04 Bioinformatics III 11 Repeated sequence elements in eukaryotic genes Eukaryotic DNA is wrapped around histone-protein complexes called nucleosomes. Therefore, some of the base pairs in the major or minor grooves of DNA face the nucleosome surface and others face outside. Repeated patterns of sequence have been found in the introns and exons by hidden Markov model (HMM) analysis. Patterns appear with a periodicity of 10, JMB 263, 503 (1996) Baldi et al. 9. Lecture WS 2003/04 Bioinformatics III 12 Markov model A Markov model is a stochastic model which assumes that the probability of appearance of a given base (A, T, C, or G) at a given position depends only on the k previous nucleotides. k is called the order of the Markov model. Such a model is defined by the conditional probabilities P(X|k previous nucleotides), where X = A, T, G, or C. In order to build a Markov model, a learning set of sequences on which these probabilities will be estimated is required. Given a sequence and a Markov model, one can then very simply compute the probability that this sequence has been generated according to this model. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 9. Lecture WS 2003/04 Bioinformatics III 13 Markov models The simplest Markov models are homogenuous zero order Markov models which assume that each base occurs independently with a given frequency. Such simple models were previously often used for non-coding regions. Modern programs like GeneMark also use higher order models to represent introns and intergenic regions. More complex three-periodic Markov models have been introduced to characterize coding sequences. Coding regions are defined by three Markov models, one for each position inside a codon. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 9. Lecture WS 2003/04 Bioinformatics III 14 Markov models The larger the order of a Markov model, the finer it can characterize dependencies between adjacent nucleotides. However, a model of order k requires a large number of coding sequences to be reliably estimated. Therefore, most programs like GeneMark or Genscan rely on a three-periodic Markov model of order five (thus exploiting hexamer composition) or less to characterize coding sequences. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 9. Lecture WS 2003/04 Bioinformatics III 15 Hidden Markov Models An HMM is a graph of connected states, each state potentially able to “emit” a series of observations. The process evolves in some dimension, often time, though not necessarily. The model is parameterized with probabilities governing the state at a time t + 1, given that one knows the previous states. Markov assumptions are used to truncate the dependency of having to know the entire history of states up to this point in order to assess the next state: Instead, only one step back is required. As the process evolves in time through the states, each state can potentially emit observations, which are regarded as a stream of observations over time. These models are often illustrated graphically as shown right, with the states being circles and transitions as arrows between the states. 9. Lecture WS 2003/04 Bioinformatics III 16 Signal sensors The basic and natural approach to finding a signal that may represent the presence of a functional site is to search for a match with a consensus sequence e.g. for promoter regions (TATA box), or the ribosomal binding site on the mRNA. This consensus could be determined from a multiple alignment of functionally related documented sequences. Programs SPLICEVIEW and SplicePredictor. A more flexible representation of signals is offered by the so-called positional weight matrices (PWMs): indicate the probability that a given base appears at each position of the signal (again computed from a multiple alignment of functionally related sequences). One can say that a PWM is defined by one classical zero order Markov model per position. The PWM weights can also be optimized by a neural network method. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 9. Lecture WS 2003/04 Bioinformatics III 17 Signal sensors In order to capture possible dependencies between adjacent positions of a signal, one may use higher order Markov models called weight array models (WAM). These methods assume a fixed length signal. Hidden Markov models further allow for insertions and deletions. Most existing programs use such models to represent and detect - splice sites - branch points - correct intron/exon boundaries and other motives like - poly(A) sites (in 3‘-UTRs) - promoters Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 9. Lecture WS 2003/04 Bioinformatics III 18 Predict eukaryotic gene structures One doesn‘t want to only search for independent exons, but instead identify the whole complex structures of genes! Each consistent pair of detected signals (translation starts and stops, spice sites) defines a potential gene region (intron, exon or coding part of an exon). Since all these potential gene regions can be used to build a gene model, the number of potential gene models grows exponentially with the number of predicted exons! In practice, „correct“ gene structures must satisfy a set of properties (i) there are no overlapping exons (ii) coding exons must be frame compatible (iii) merging two successive coding exons will not generate an in-frame stop at the junction. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) The number of candidates remains, however, exponential. 9. Lecture WS 2003/04 Bioinformatics III 19 Testing the reliability of an ORF prediction (1) Observation of unusual type of sequence variation found in ORFs: every 3rd base tends to be the same one much more often than by chance alone (Fickett 1982). This property is due to nonrandom use of codons in ORFs and is true for any ORF, regardless of the species. (2) Determine whether the codons in the ORF correspond to those used in other genes of the same organism (codon usage statistic). (3) Translate ORFs into amino acid sequence and compare that to database of protein sequences. If good hits are found, confidence in new predicted ORF rises. 9. Lecture WS 2003/04 Bioinformatics III 20 Neural Network: GRAIL II Grail II provides analyses of protein-coding regions, poly(A) sites, and promoters constructs gene models, predicts encoded protein sequences provides database searching capabilities. (1) create list of most likely exon candidates (2) evaluate candidates by neural network Uberbacher, Mural. PNAS, 88, 11261 (1991) 9. Lecture WS 2003/04 Bioinformatics III 21 Glimmer 2.0 Glimmer 2.0 is an extension. Probabilities may be computed for any of the k bases preceeding a particular base b, not just those adjacent to b. Glimmer 1.0 uses an interpolated Markov model (IMM) where the search depth k is optimized for every tuple. Sample ICM (interpolated context model) decomposition tree. The root position 12 has maximum mutual information with the final base position 13. Each child of the root represents the subset of windows with the indicated nucleotide value at position 12, and indicates the maximum mutual information position for that subset. Each node is similarly decomposed into children. Note that children of a single node may represent different base positions. Salzberg et al. NAR, 27, 4636 (1999) 9. Lecture WS 2003/04 Bioinformatics III 22 Accuracy of Glimmer 2.0 Glimmer 2.0's accuracy for ten complete bacterial and archaeal genomes. The majority of the genes missed were very short, either below the minimum or very close to it. The default settings produce additional gene predictions ranging from 7-20% of the total, many of which are likely to be false positives, but some of which may be genuine. The additional prediction rate drops quickly if the minimum gene length is set to be greater than 90bp. 9. Lecture WS 2003/04 Bioinformatics III 23 Accuracy of Glimmer 2.0 A better measure of accuracy is to consider only "confirmed genes," which we define as genes that have a significant database match to a gene in another organism. The table below shows these statistics on 10 genomes for both Glimmer 1.0 and 2.0. All results were obtained by a very simple training procedure: Glimmer was trained by first extracting all non-overlapping open reading frames over 500bp. The trained model was then used to find genes in the complete genome. 9. Lecture WS 2003/04 Bioinformatics III 24 Glimmer 1.0 vs. Glimmer 2.0 Numbers of genes confirmed by database matches found exclusively by GLIMMER 1.0, by GLIMMER 2.0, and by both systems. 9. Lecture WS 2003/04 Bioinformatics III 25 Newly identified genes Genes in M.tuberculosis found automatically by GLIMMER 2.0 with homology to protein sequences from other organisms. Salzberg et al. NAR, 27, 4636 (1999) 9. Lecture WS 2003/04 Bioinformatics III 26 Problematic gene start prediction Detecting a gene as a protein-coding ORF with an ‘open’ start still does not provide full information for gene annotation. Although several procedures for gene start prediction accuracy have been described, verification of the actual accuracy of these methods has been hampered by an insufficient number of experimentally validated translation starts and, therefore, a deficit of reliable data for training and testing. In the absence of a reliable computer procedure for gene start prediction, the rule of the ‘longest ORF’ was frequently applied to annotate complete microbial genomes with gene start assigned to the 5'-most ATG codon (see Table). 9. Lecture WS 2003/04 Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) Bioinformatics III Salzberg et al. NAR, 27, 4636 (1999) 27 GeneMarkS Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 28 Gene prediction accuracy Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 29 Performance of GeneMarkS on eukaryotic genomes Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 30 Comparable performance of Glimmer and GeneMarksS Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 31 GeneMarkS (A) In the process of GeneMarkS training there is no division of the coding sequence into two clusters. However, in applying the GeneMark.hmm 2.0 program, the model of coding region derived by GeneMarkS can be used as the Typical model along with a heuristic model used as the Atypical model . (B) In this simplified diagram of hidden state transitions in GeneMark.hmm 2.0, the state ‘gene’ represents a sequence composed of an RBS plus a spacer plus the protein-coding sequence (CDS). Gene overlaps encompass all possible types of superpositions: overlap of genes on the same strand (as observed in operons), overlap of genes on opposite strands, overlap of coding region with RBS, and so on. Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 32 Positional frequency pattern Sequence logo representing the RBS positional frequency pattern detected by GeneMarkS in the analysis of B.subtilis genomic data. The total height of the four letters in each position indicates the position specific information content, while the height of each letter is proportional to the nucleotide frequency. Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 33 Spacer length (B) Graph of probability distribution of spacer length, the sequence between the RBS sequence and the gene start. Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 34 Gene prediction evaluation Venn diagram showing group relationships between the GenBank annotation and sets of genes detected by GeneMark.hmm 2.0 and Glimmer 2.02 for the B.subtilis genome (A) and the E.coli genome (B). Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 35 Ribosome binding sites Distributions of log-odds scores of RBS sites, as detected by GeneMarkS, in sets of overlapping and non-overlapping of genes of (A) B.subtilis, (B) E.coli and (C) M.jannaschii. As can be seen, the overlapping genes, which are likely to be located inside operons, frequently have strong RBS sites. Still, most strong sites of ribosome binding precede the non-overlapping genes (stand alone genes and genes leading operons). This tendency is much more apparent in the case of the archaeal genome of M.jannaschii than in the E.coli and B.subtilis genomes Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 36 RBS motif Sequence logo representing the RBS motif observed in a subset of upstream sequences of the A.fulgidus genome. This subset consisted of 50 nt long upstream sequences overlapping the 3' end of the preceding gene. The consensus of this motif is complementary to a section of the A.fulgidus 16S rRNA. Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 37 Spacer length Distributions of spacer length for two species with strong RBS patterns, B.subtilis and E.coli (solid and dashed lines, respectively), and one species with a strong eukaryotic promoter-like pattern, A.fulgidus (dotted line). The promoter-like pattern of A.fulgidus is localized much further upstream of the start codon than the RBS patterns of B.subtilis and E.coli. Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 38 Distribution of spacer lengths (A) Distribution of spacer lengths observed in the B.subtilis genome for two different types of possible RBS hexamers: AGGAGG and AGGTGA. Multiple alignment allows these hexamers to be superimposed. In actual upstream sequences, these hexamers tend to occupy different locations relative to the start codon. This preference may be involved in the precise positioning of the ribosome at the translation initiation site when the 16S rRNA binds to mRNA. The more frequent hexamer was observed on average at a further distance from the gene start than the rare hexamer. Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 39 Distribution of spacer lengths (B) Distribution of spacer lengths observed in the M.thermoautotrophicum genome for two different types of RBS hexamers: GGAGGT and GGTGAT. Properties of these hexamers are similar to the two hexamers observed in the B.subtilis genome (A), except that more frequent hexamer is now found on average at a closer distance to the gene start than the rare hexamer. Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 9. Lecture WS 2003/04 Bioinformatics III 40 TIGR: GlimmerM, Exonomy and Unveil Genefinder programs for eukaryotes are far from perfect, often predicting genes or exons which do not exist, failing to predict those that do exist or generating predictions having one or more incorrect exon boundaries. Also, different genefinders trained for the same organism often produce different predictions. The latest publication by The Institute of Genome Research (TIGR) therefore suggests 3 genefinders that are all based on one or more types of Markov models: GlimmerM, Exonomy, Unveil. Majoros et al. Nucl. Acids. Res. 31, 3601 (2003) 9. Lecture WS 2003/04 Bioinformatics III 41 TIGR: GlimmerM, Exonomy and Unveil Topologies of Unveil 283-state HMM Exonomy 23-state GHMM Majoros et al. Nucl. Acids. Res. 31, 3601 (2003) 9. Lecture WS 2003/04 Bioinformatics III 42 Methods don’t work everywhere An example in which Exonomy produces the correct gene model. An example in which GlimmerM produces the correct gene model. An example in which Unveil produces the correct gene model (as does Genscan). Majoros et al. Nucl. Acids. Res. 31, 3601 (2003) 9. Lecture WS 2003/04 Bioinformatics III 43 Summary The results of gene prediction should be taken with caution; although the results are becoming increasingly more reliable, they do remain only predictions. They are very useful for speeding up gene discovery and knowledge mining. Biological expertise remains necessary in order to confirm the existence of a virtual protein and to find or prove its biological function. Therefore comparative genome approaches are becoming more and more important where programs can scan for homologies with expressed sequences (EST or cDNA sequence data). More work is now focussing on detecting RNA-coding genes. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 9. Lecture WS 2003/04 Bioinformatics III 44

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt - Chair of Computational Biology