* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slides
Survey
Document related concepts
Gene prediction wikipedia , lookup
General circulation model wikipedia , lookup
Computer simulation wikipedia , lookup
History of numerical weather prediction wikipedia , lookup
Generalized linear model wikipedia , lookup
Data assimilation wikipedia , lookup
Transcript
Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein Deepak Verghese CS 6890 •GPHMM •CONSERVED Exon method •2 step GLASS n ROSETTA •TWINSCAN which extends GENESCAN •etc Do not exploit all information in evolutionary pattern Not easily extended to multiple genome sequences. (EHMM) A Probabilistic model of both Genome Structure and Evolution Composed of : 1. Hidden Markov Model (HMM) 2. Phylogenetic Tree Can handle any number of sequences in an alignment. Can have properties of higher order HMM’s Can handle variability in the sequences along the alignment State of art evolutionary models can be incorporated later Evolutionary events between different genomes are not treated independently SCOPE • Not to compete with the existing finding methods on performance but to illustrate the power of this approach. •Relies on a pre produced alignment. MARKOV CHAINS A set of states The transitions from one state to all other states, including itself, are governed by a probability distribution First order Markov chain: the probabilities depend solely on the current state n-th order Markov chain: n previous states HIDDEN MARKOV MODEL 5 Components •A set of states • Matrix of transition probabilities ( A ) • Set of alphabets ( C ) • Set of emission distribution (e) • Initial state distribution ( B ) Example of hidden Markov model ACA- - -ATG T C AA C TAT C ACAC--AGC AGA- - -ATC AC C G - -ATC NO 1:1 correspondence between states and symbols Why the name Hidden ? Components State k Emits symbols (observables) C PROBABILISTIC MODEL Emission Distribution e Initial state distribution B Transition Probabilities A Path Π Different paths possible for same sequence In EHMM Emission distribution e specified by Evolutionary model Ek Phylogenetic tree T PHYLOGENETIC TREES Motivation : The problem of explaining the evolutionary history of today's species In Phylogenetic trees Leaves represent present day species Character states of inner nodes are missing data Interior nodes represent hypothesized ancestors The length of the brances of a tree represent the evolutionary difference. Evolution is often modeled by continuous markov chains Here evolution along the branches of the phylogenetic tree is modelled by Ek Transition probability Pk ( t ) For a branch length t P k ( t ) = exp ( t Q k ) Increasing the number of sequences is increasing the amount of evolutionary information. THE ALIGNMENT COLUMN CORRESPONDS TO THE STATE OF ELOVUTION AT THE LEAVES OF THE PHYLOGENETIC TREE THE PEOPABILITY OF GENERATING AN ALIGNMENT COLUMN IN STATE K EQUALS PROBABILITY OF OBSERVING A GIVEN CHARACTER PATTERN ON THE LEAVES OF T WHEN GIVEN E k Phylogenetic tree of the entries of the 3 alignment columns Codon based evolutionary model used to calculate emission probability of columns of A Nucleotide Based evolutionary model used to calculate emission probability of column B Emission probability of C is got from the equilibrium distribution of the the relevant evolutionary model Parameter Estimation Parameters of HMM are estimated by a combination of Baum – Welch Powell Evolutionary model E divided into E equ E evo Initial State Distribution B can be estimated by Baum-Welch but It is generally set to 0.000 01 for all states except the intergenic . The expectation step of Baum-Welch estimates the number of nucleotides emitted from each state the expected number of state transitions Expected number of times a state is used. Powell another optimization method estimates E evo phylogenetic tree T Baum – Welch method is used to estimate E equ A Therefore Likelihood of an alignment ( x ) given a parameterization of the EHMM Can be found by the equation Here we are summing over all possible paths This can be done in linear time by Dynamic Programming EHMM is fully probabilistic and can be used to simulate data and find genes. EUKARYOTIC GENOME MODEL can be used to generate alignments. Reduced model produces only inner exons. Results Benefits of modeling evolution with a EHMM using a data set of orthologous mouse/human gene pair Benefit will depend on divergence between sequences compared Key parameter for modelling the difference between exons and introns is the dN/dS ratio. Moreover we see that Evolutionary model shows a distinct difference between the intergenic /intron state and the codon state Evaluations were performed on both single and aligned sequences Graphical Representation Simple model used now not comparable to state of art methods Any number of aligned sequences can be handled Extensions of the model • GENESCAN can be extended into HMM • Splice site finders • Models of ribosome binding site and promoter regions • Non – geometric length distributions of exons • Pseudo higher order EHMM can be constructed. • Idea of pair HMM to multiple sequences Disadvantages in present model Existing frame work does not model gaps but treats it as missing data. Optimal data for EHMM is a multiple alignment of full – length genome. Challenge in constructions of the alignment is to reduce the noise per signal ratio. BUT ………..