Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computational Gene Finding CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen Greg Voronin Hui Zhao Xueyi(Judy) Xiao The Challenge Presented By Greg Voronin Generate predictions of gene locations from primary genomic sequence by computational means Two principle means: – Database searching – Statistical Methods The Biological Model The Computational Model Representing the biology in a framework amenable to mathematical/statistical methods Exon classification, sequence features, signal profiles – What is an exon and what properties does the sequence of an exon hold? – How is an exon recognized and processed? Exon Classification Scheme The Nature of The Data What is the primary genomic sequence? • “Nor is the available sequence a single continuous and exact sequence for each chromosome… [ the HGP ] is represented by a set of sequences that cover the genome is a statistical sense but have a very large number of gaps.” – Many genes are as large or larger than the contigs in the HGP – Finding genes will depend on the accuracy of the scaffold of their contigs Back to Beginning What is a gene? – A biological model, a mathematical model and computational representation The programs we evaluate take these factors into account in their underlying model MZEF Michael Zhang’s Exon Finder Utilizes quadratic discriminant analysis (QDA) to classify sequence into gene and non-gene groups – QDA is a multivariate statistical pattern recognition method – “Draws” a curved boundary between groups of different classes QDA Key Elements of QDA Entities are represented by an n-dimensional vector of feature values Two classes of entities are categorized by their respective multinormal distribution – Each class has its own mean vector – The mean of each feature An appropriate distance function is central to the calculation of the posterior probabillity of group membership of a given unknown entity given its specific feature vector. Mahalanobis Distance The actual posterior probabillity function is more complex, but this is the distance component: ( x – mi )T Si ( x – mi ) -1 MZEF Specifics MZEF uses the following features: – Exon length, exon-intron transition, branch site score, 3’ss score, exon score, strand score, frame score, 5’ss score, intron-exon transition 9 dimensional feature vector Training sets of known exons and “non-exons” are used to establish the class characterisitics – Supervised learning …GATC… to Gene Cells recognize genes from DNA sequence. Can we?? The Hidden Markov Model Method HMMgene Presented By Hui Zhao HMMs are Statistical Models Definition: – Any mathematical construct that attempts to parameterize a random process Example: A normal distribution – – – – Assumptions Parameters Estimation Usage HMMs are just a little more complicated… Primary HMM Assumptions Observations are ordered Random processes can be represented by a stochastic finite state machine with emitting states – transition probabilities and emission probabilities. How do we find the model probabilities? This is called training We start with an architecture and a set of observed sequences The training process iteratively alters its parameters to fit the training set The trained model will assign the training sequences high probability – but can it generalize? HMM Usage – two major tasks Evaluate the probability of an observed sequence given the model (Forward) Find the most likely path through the model for a given observation sequence (Viterbi) Gene Finding: An Ideal HMM Application Our Objective: – To find the coding and non-coding regions of an unlabeled string of DNA nucleotides Our Motivation: – Assist in the annotation of genomic data produced by genome sequencing methods – Gain insight into the mechanisms involved in transcription, splicing and other processes Why HMMs might be a good fit for Gene Finding The observations within a sequence are ordered A DNA sequence is a set of ordered observations Designing the architecture is straight forward: Easy to measure success Training data is available from various genome annotation projects A HMM genefinder States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..). Observations are things like state-dependent base composition. In a HMM, length of each state must be included as well. Finally, reading frames and both strands must be dealt with. Several problems can occur 5’ correct gene structure extended exon missing exon additional exon missing intron extended gene model 3’ HMMgene Krogh (1997) In Proc. 5th Conf. Intel. Sys. Mol Biol. pp179-186 •Predicts whole genes in any given stretch of DNA •Uses Hidden Markov Models (HMM) to maximize probability of accurate prediction •This allows confidence levels to be determined and "Best Prediction" as well as potential alternative splicing predictions •Outputs splice sites, start and stop codons, alternative predictions •Trained for human and C. elegans HMMGene Uses an extended HMM called a CHMM CHMM = HMM with classes Takes full advantage of being able to modify the statistical algorithms Uses high-order states Trains everything at once How does HMMGene work? 1) 5th order HMM assumes: P(xi | xi-1,xi-2, xi-3, xi-4, xi-5) is different in Introns, Exons, etc.. e.g: P(G, I | A,C,G,G,T) P(G, E | A,C,G,G,T) 2) Construct the model 2. How does HMMGene work? 3) In a CHMM states emit a nucl. pair class label G G e. g. or I E 4) Use Viterbi (n-best) to find a path through the CHMM = a labeled gene 5) Use the forward algorithm to measure P(gene | model) –using n-best. A DNA sequence containing one gene. For each nucleotide its label is written below. The coding regions are labeled ‘C’, the introns ‘I’, and the intergenic regions ‘0’. HMMGene calls these class labels in a CHMM. HMMGene Does not use the standard ML method which optimizes the probability of the observed sequence – instead it maximizes the probability of the correct prediction. Only one conference paper describes the algorithm. There is a web site to run the algorithm, and it's performance has been compared to other algorithms. No complete description of the algorithm is available – in the 1997 paper the author states "… the details of HMMGene will be described elsewhere (in prep)" – but unfortunately the detailed paper has not been published. HMMgene http://www.cbs.dtu.dk/services/HMMgene/) HMMgene and HMM Disadvantages Markov Chains P(x) … P(y) – States should be independent – P(y) must be independent of P(x) -usually not true Local maxima – Model may not converge the optimal parameter set Over-fitting – More training is not always good-set may be too small Summary • HMMgene finds whole genes in anonymous DNA with correctly spliced exons. • It can predict several whole or partial genes in one sequence. •If some features of a sequence are known, such as hits to ESTs, proteins, or repeat elements, these regions can be locked as coding or non-coding and then the program will find the best gene structure under these constraints. GENSCAN (v1.0) Presented By Xueyi (Judy) Xiao A computer program identifying complete exon & intron structures of genes in genomic DNA. Developed by Chris Burge (Burge 1997), in the research group of Samuel Karlin, Dept of Mathematics, Stanford Univ. Original server @Stanford New server @MIT (seq_len <= 500 kb); Servers are also maintained by the Pasteur Institute, Paris and by the GENSCAN web server at DKFZ/EMBnet, Heidelberg Implementations web server http://genes.mit.edu/GENSCAN.html email server http://genes.mit.edu/GENSCANM.html local copy downloaded under a license agreement How does It Work? Designed to predict complete gene structures Introns and exons Promoter sites Polyadenylation signals Larger predictive scope Partial and Complete genes Multiple genes separated by intergenic DNA in a seq Consistent sets of genes on either/both DNA strands Not use similarity-based methods Based on a general probabilistic model of genomic sequences composition and gene structure Model of Genomic Sequence Structure Fig. 3, Burge and Karlin 1997 Input http://genes.mit.edu/GENSCAN.html Output Graphic View Initial Exon Internal Exon Terminal Exon Single-Exon gene Optimal Exon Suboptimal Exon Is It Good? Accuracy: Substantially higher accuracies when tested on standardized sets of human & vertebrate genes, with 75-80% of exons identified exactly. Reliability: Able to indicate fairly accurately the reliability of each predicted exon. Consistency: Consistently high levels of accuracy, for seqs of differing C+G content and for distinct groups of vertebrates. Why not Perfect? Gene Number usually approximately correct, but may not Organism primarily for human/vertebrate seqs; maybe lower accuracy for non-vertebrates. ‘Glimmer’ & ‘GeneMark’ for prokaryotic or yeast seqs Exon and Feature Type Internal exons > Initial or Terminal exons; Exons > Polyadenylation or Promoter signals(‘NNPP’) Biases in Test Set The Burset/Guigó (1996) dataset: toward short genes with relatively simple exon/intron structure The Rogic (2001) dataset: DNA seqs: GenBank r-111.0 (04/1999 <- 08/1997); source organism specified; consider genomic seqs containing exactly one gene; seqs>200kb were discarded; mRNA seqs and seqs containing pseudo genes or alternatively spliced genes were excluded. What are They doing NOW? The research group @MIT is currently developing another program, GenomeScan, which is more accurate when a moderate or closely related protein seq is available. TEST OF METHODS Sample Tests reported by Literature Test on the set of 570 vertebrate gene seqs (Burset&Guigo 1996) as a standard for comparison of gene finding methods. Test on the set of 195 seqs of human, mouse or rat origin (named HMR195) (Rogic 2001). Self-Test done by our group Dataset: Intron-less(Single-exon), -rich(Multi-exon), -poor(Random) Organism: Human Methods: all of the three Steps Where to get the dataset for Self-Test? http://www.ncbi.nlm.nih.gov/genome/guide/human/ Accuracy Measures Sensitivity vs. Specificity (adapted from Burset&Guigo 1996) TP FP TN FN TP FN TN Actual Predicted Actual No Coding / Coding Predicted Coding / No Coding TP FP FN TN Sensitivity (Sn) Fraction of actual coding regions that are correctly predicted as coding Specificity (Sp) Fraction of the prediction that is actually correct Correlation Coefficient (CC) Combined measure of Sensitivity & Specificity Range: -1 (always wrong) +1 (always right) Results: Accuracy Statistics Table: Relative Performance (adapted & added from Rogic 2001) Test By Rogic 2001 Programs # of seq Nucleotide accuracy Multi-Exon # of ESn Seq ESp Single-Exon # of ESn ESp Seq ESn ESp 195(3) 0.95 0.90 0.91 0.70 0.70 5 0.57 0.63 5 0.60 0.50 HMMgene 195(5) 0.93 0.93 0.91 0.76 0.77 5 0.42 0.42 5 0.60 0.30 0.58 0.59 5 0.76 0.62 5 0.40 0.40 MZEF Sp Exon accuracy CC Genscan Sn Self-Test 2002 119(8) 0.70 0.73 0.66 # of seqs - number of seqs effectively analyzed by each program; in parentheses is the number of seqs where the absence of gene was predicted; Sn -nucleotide level sensitivity; Sp - nucleotide level specificity; CC - correlation coefficient; ESn - exon level sensitivity; ESp - exon level specificity Testing ‘Random’ Sequences Presented By Greg Voronin These gene finding programs model statistical trends and properties – Can they be fooled by ‘random’ sequences – Generate a preliminary measure of accuracy Java program written to generate ‘random’ sequences of a,t,g,c 3 groups of sequences 5k, 10k & 30K Sent to BLAST then GeneMachine Testing Results BLAST: bit score – 5k 42 – 10k 44 – 30k 42 E-value 5.7 3.0 8.7 GeneMachine: – MZEF – GenScan – HMMgene 5k 1 3 7 10k 5 11 11 30K 14 26 42 New directions Presented By Hui Zhao •Computational Gene Finding has rapidly evolved since it started 20 years ago. •The advent of full-length genomic sequences has provided data and increased the requirements. •Gene annotation has direct medical implications on the design of pharmaceuticals and the understanding of the genetic component of diseases. •Gene finding remains largely an unsolved problem. New directions •The growing quantities of training data for the models should improve their performance. •Algorithms that combine the inputs from several models in a weighted voting scheme should be considered to try to get the best from all of the methods. •Many other AI approaches can be used to meet this challenge including decision trees, neural networks and rule-based systems Challenges and Discoveries Ahead Eukaryotic gene finding continues to be an active and important area – more research is required into algorithms with greater accuracy Expertise in computational biology is also required – which means training in both: computer science and molecular biology More classes like this…