Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Long non-coding RNA wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Human genome wikipedia , lookup
Point mutation wikipedia , lookup
Gene desert wikipedia , lookup
Genome evolution wikipedia , lookup
Computational Gene Finding Dong Xu Computer Science Department 109 Engineering Building West E-mail: [email protected] 573-882-7064 http://digbio.missouri.edu Lecture Outline Protein-encoding genes and gene structures Computational models for coding regions Computational models for coding-region boundaries Markov chain model for coding regions What are in DNA? Genome: Blueprint of life DNA music: http://www.toddbarton.com/sequence1.asp What Is a Gene? Definition: A gene is the nucleotide sequence that stores the information which specifies the order of the monomers in a final functional polypeptide or RNA molecule, or set of closely related isoforms (Epp CD, Nature, 389: 537). Gene and Disease Environment Genes Monogenic Diseases Common Diseases Infections Cystic fibrosis Alzheimer disease Influenza Huntington’s disease Adult onset diabetes Hepatitis Haemophilia Cancer AIDS Phenylketonuria Cardiovascular disease Depression Genetic Code Reading Frame Reading (or translation) frame: each DNA segment has six possible reading frames Forward strand: ATGGCTTACGCTTGA Reading frame #1 Reading frame #2 Reading frame #3 ATG GCT TAC GCT TGC TGG CTT ACG CTT GA. GGC TTA CGC TTG A.. Reverse strand: TCAAGCGTAAGCCAT Reading frame #4 Reading frame #5 Reading frame #6 TCA AGC GTA AGC CAT CAA GCG TAA GCC AT. AAG CGT AAG CCA T.. Prokaryotic Gene Structure 5’ 3’ Coding region of Open Reading Frame Promoter region (maybe) Ribosome binding site (maybe) Termination sequence (maybe) Start codon / Stop Codon Open reading frame (ORF): a segment of DNA with two in-frame stop codons at the two ends and no in-frame stop codon in the middle Eukaryotic Gene Structure Gene Structure Rules Each coding region (exon) has a fixed translation frame (no gaps allowed) All exons of a gene are on the same strand Neighboring exons of a gene can have different reading frames frame 1 frame 3 frame 2 Computational Gene Finding The Problem: Given a stretch of DNA sequence, find all coding regions and construct gene structures from identified exons if needed A gene finding problem can be decomposed into two problems: identification of coding potential of a region in a particular frame identification of boundaries between coding and non-coding regions Repetitive Sequence Definition DNA sequences that made up of copies of the same or nearly the same nucleotide sequence Present in many copies per chromosome set Repeat Filtering RepeatMasker Uses precompiled representative sequence libraries to find homologous copies of known repeat families Use Blast http://www.repeatmasker.org/ Gene Finding Tools Genscan (http://genes.mit.edu/GENSCAN.html ) GeneMarkHMM (http://opal.biology.gatech.edu/GeneMark/) GRAIL (http://compbio.ornl.gov/Grail-1.3/) Genie (http://www.fruitfly.org/seq_tools/genie.html) Glimmer (http://www.tigr.org/softlab/glimmer) Testing Finding Tools Access Genscan (http://genes.mit.edu/GENSCAN.html ) Use a sequence at http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=8077108 Lecture Outline Protein-encoding genes and gene structures Computational models for coding regions Computational models for coding-region boundaries Markov chain model for coding regions Coding Signal Detection (1) Frequency distribution of dimers in protein sequence (shewanella) The average frequency is 5% Some amino acids prefer to be next to each other Some other amino acids prefer to be not next to each other Coding Signal Detection (2) Dimer bias (or preference) could imply di-codon (6-mers like AAA TTT) bias in coding versus non-coding regions Relative frequencies of a di-codon in coding versus non-coding frequency of dicodon X (e.g, AAAAAA) in coding region, total number of occurrences of X divided by total number of dicocon occurrences frequency of dicodon X (e.g, AAAAAA) in noncoding region, total number of occurrences of X divided by total number of dicodon occurrences In human genome, frequency of dicodon “AAA AAA” is ~1% in coding region versus ~5% in non-coding region Question: if you see a region with many “AAA AAA”, would you guess it is a coding or non-coding region? Coding Signal Detection (3) Most dicodons show bias towards either coding or noncoding regions; only fraction of dicodons is neutral Foundation for coding region identification Regions consisting of dicodons that mostly tend to be in coding regions are probably coding regions; otherwise non-coding regions Dicodon frequencies are key signal used for coding region detection; all gene finding programs use this information Coding Signal Detection (4) Dicodon frequencies in coding versus non-coding are genome-dependent bovine shewanella Coding Signal Detection (5) in-frame versus any-frame dicodons not in-frame dicodons ATG TTG GAT GCC CAG AAG............ in-frame dicodons In-frame: ATG TTG GAT GCC CAG AAG more sensitive Not in-frame: TGTTGG, ATGCCC AGAAG ., GTTGGA AGCCCA, AGAAG .. any-frame Computational Model (1) Preference model: for each dicodon X (e.g., AAA AAA), calculate its frequencies in coding and non-coding regions, FC(X), FN(X) calculate X’s preference value P(X) = log (FC(X)/FN(X)) Properties: P(X) is 0 if X has the same frequencies in coding and non-coding regions P(X) has positive score if X has higher frequency in coding than in non-coding region; the larger the difference the more positive the score is P(X) has negative score if X has higher frequency in non-coding than in coding region; the larger the difference the more negative the score is Computational Model (2) Example AAA ATT, AAA GAC, AAA TAG have the following frequencies FC(AAA ATT) = 1.4%, FN(AAA ATT) = 5.2% FC(AAA GAC) = 1.9%, FN(AAA GAC) = 4.8% FC(AAA TAG) = 0.0%, FN(AAA TAG) = 6.3% We have P(AAA ATT) = log (1.4/5.2) = -0.57 P(AAA GAC) = log (1.9/4.8) = -0.40 P(AAA TAG) = - infinity (treating STOP codons differently) A region consisting of only these dicodons is probably a non-coding region Coding preference of a region (an any-frame model) Calculate the preference scores of all dicodons of the region and sum them up; If the total score is positive, predict the region to be a coding region; otherwise a non-coding region. Computational Model (3) In-frame preference model (most commonly used in prediction programs) Data collection step: For each known coding region, calculate in-frame preference score, P0(X), of each dicodon X; ATG TGC CGC GCT ...... e.g., calculate (in-frame + 1) preference score, P1(X), of each dicodon X; ATG TGC CGC GCT ...... e.g., calculate (in-frame + 2) preference score, P2(X), of each dicodon X; ATG TGC CGC GCT ...... e.g., Application step: For each possible reading frame of a region, calculate the total in-frame preference score P0(X), the total (in-frame + 1) preference score P1(X), the total (in-frame + 2) preference score P2(X), and sum them up If the score is positive, predict it to be a coding region; otherwise non-coding Computational Model (4) Prediction procedure of coding region Procedure: Calculate all ORFs of a DNA segment; For each ORF, do the following slide through the ORF with an increment of 10 base-pairs calculate the preference score, in same frame of ORF, within a window of 60 base-pairs; and assign the score to the center of the window Example (forward strand in one particular frame) +5 0 -5 preference scores Computational Model (5) Making the call: coding or non-coding and where the boundaries are coding region? where to draw the boundaries? Need a training set with known coding and non-coding regions select threshold(s) to include as many known coding regions as possible, and in the same time to exclude as many known non-coding regions as possible If threshold = 0.2, we will include 90% of coding regions and also 10% of non-coding regions If threshold = 0.4, we will include 70% of coding regions and also 6% of non-coding regions If threshold = 0.5, we will include 60% of coding regions and also 2% of non-coding regions where to draw the line? Computational Model (6) Why dicodon (6mer)? Codon (3mer) -based models are not nearly as information rich as dicodon-based models Tricodon (9mers)-based models need too many data points for it to be practical People have used 7-mer or 8-mer based models; they could provide better prediction methods 6-mer based models There are 4*4*4 = 64 codons 4*4*4*4*4*4 = 4,096 di-codons 4*4*4*4*4*4*4*4*4= 262,144 tricodons To make our statistics reliable, we would need at least ~15 occurrences of each X-mer; so for tricodon-based models, we need at least 15*262144 = 3932160 coding bases in our training data, which is probably not going to be available for most of the genomes Lecture Outline Protein-encoding genes and gene structures Computational models for coding regions Computational models for coding-region boundaries Markov chain model for coding regions Signals for Coding-Region Boundaries (1) Possible boundaries of an exon { translation start, acceptor site } { translation stop, donor site } EXON | Splice junctions: Translation start in-frame ATG INTRON | EXON | | \_/ \_/ A G G T A......C A G 64 73 100 100 62 65 100 100 (Percent occurrence) donor site: coding region | GT acceptor: AG | coding region Signals for Coding-Region Boundaries (2) Both splice junction sites and translation starts have certain distribution profiles Acceptor site (human genome) if we align all known acceptor sites (with their splice junction site aligned), we have the following nucleotide distribution Donor site (human genome) Model for Splice Sites (1) Information content for a weight matrix, the information content of each column is calculated as -F(A)*log (F(A)/.25) - F(C)*log (F(C)/.25) - F(G)*log (F(G)/.25) - F(T) *log (F(T)/.25) when a column has evenly distributed nucleotides, the information content is lowest column -3: -0.34*log (.34/.25) - 0.363*log (.363/.25) -0.183* log (.183/.25) 0.114* log (.114/.25) = 0.04 column -1: -0.092*log (.092/.25) - 0.03*log (.033/.25) -0.803* log (.803/.25) 0.073* log (.073/.25) = 0.30 Only need to consider positions with “high” information content Model for Splice Sites (2) Weight matrix model build a weight matrix for donor, acceptor, translation start site, respectively using positions with high information Application of weight matrix model add up frequencies of corresponding letter in corresponding positions AAGGTAAGT: 0.34+0.60+0.80+1.0+1.0+0.52+0.71+0.81+0.46 = 6.24 TGTGTCTCA: 0.11+0.12+0.03+1.0+1.0+0.02+0.07+0.05+0.16 = 2.56 Lecture Outline Protein-encoding genes and gene structures Computational models for coding regions Computational models for coding-region boundaries Markov chain model for coding regions Why Markov Chain? Preference model cannot capture all the dependence relationship among adjacent dicodons Markov chain model has been a popular model for modeling dependence in a linear sequence (a chain of events) Basic assumption of the model for a chain of events: The “occurrence” of each event depends only on the most recent events right before this event Example: the weather of today is a function of only the weather of past seven days (i.e., it is independent of the weather of eight days ago) Markov Chain Model (1) Basics of probabilities: P(A) represents the probability of A being true P(A, B) represents the event of having both A and B being true if A and B are independent, P(A, B) = P(A) * P(B) P(A | B), conditional probability, of A being true under the condition B is true (this applies only when B is true) Zero-th order Markov chain is equivalent to “all events are independent” First order Markov chain: the occurrence of an event depends only on the event right before it P(A1 A2 A3 A4 A5 A6) = P(A1) P(A2 | A1) P(A3 | A2) P(A4 | A3) P(A5 | A4) P(A6 | A5) Markov Chain Model (2) K-th order Markov chain model: Example of 5th order Markov chain: P(A1 A2 A3 A4 A5 A6 A7 A8 A9A10A11) = P(A1 A2 A3 A4 A5) * P(A6 | A1 A2 A3 A4 A5 ) * P(A7 | A2 A3 A4 A5 A6) * P(A8 | A3 A4 A5 A6 A7) * P(A9 | A4 A5 A6 A7 A8) * P(A10 | A5 A6 A7 A8A9) * P(A11 | A6 A7 A8 A9 A10) Markov chain model allows us to “decompose” a large problem into a collection of smaller problems Markov Chain Model (3) Definition of conditional probability B P( A | B ) = P (A, B) / P(B) Decomposition rule P( C) = P(C | A) P(A) + P(C | B) P(B) as long A and B do not overlap and A plus B completely covers C B C A A Markov Chain Model for Coding Region (1) Any-frame Markov chain model Bayesian formula for coding: P (coding | A1 …. An ) = P (coding, A1 …. An ) / P(A1 …. An ) P (A1 …. An | coding) * P (coding) = ---------------------------------------------------------------------------------------------P (A1 …. An | coding) P(coding) + P (A1 …. An | noncoding) P(noncoding) Bayesian formula for non-coding: P (non-coding | A1 …. An ) P (A1 …. An | noncoding) * P (noncoding) = ---------------------------------------------------------------------------------------------P (A1 …. An | coding) P(coding) + P (A1 …. An | noncoding) P(noncoding) Markov Chain Model for Coding Region (2) This formula decomposes a problem of “predicting a region A1 …. An being a (non) coding region” to the following four problems 1. 2. 3. 4. Estimating probability of seeing A1 …. An in noncoding regions Estimating probability of coding bases in a whole genome Estimating probability of noncoding bases in a whole genome Estimating probability of seeing A1 …. An in coding regions All these can be estimated using known coding and noncoding sequence data Markov Chain Model for Coding Region (3) Any-frame Markov chain model Markov chain model (5th order) : a priori probability conditional probability P (A1 …. An | coding) = P (A1 A2 A3 A4 A5 | coding) * P (A6 | A1 A2 A3 A4 A5 , coding) * P (A7 | A2 A3 A4 A5 A6 , coding) * …. * P (An | An-5 An-4 An-3 An-2 An-1 , coding) Markov chain model (5th order) : P (A1 …. An | noncoding) = P (A1 A2 A3 A4 A5 | noncoding) * P (A6 | A1 A2 A3 A4 A5 , noncoding) * P (A7 | A2 A3 A4 A5 A6 , noncoding) * …. * P (An | An-5 An-4 An-3 An-2 An-1 , noncoding) P(coding): total # coding bases/total # all bases P(noncoding): total # noncoding bases/total # all bases Build Markov Tables (1) a priori probability tables (5th order): P(5mer |coding) and P(5mer | noncoding) 5mer frequency table for coding regions 5mer frequency table for noncoding regions Conditional probability tables (5th order): P(X | 5mer, coding) and P(X | 5mer, noncoding) (X could be A, C, G, T) For a fixed 5mer (e.g., ATT GT), what is the probability to have A, C, G or T following it in coding region For a fixed 5mer (e.g., ATT GT), what is the probability to have A, C, G or T following it in noncoding region P(coding) = ~0.02 and P(noncoding) = ~0.98 Build Markov Tables (2) a priori probabilities for coding PAC a priori probabilities for noncoding PAN AAA AA: 0.000012 AAA AA: 0.000329 AAA AC: 0.000001 AAA AC: 0.000201 AAA AG: 0.000101 AAA AG: 0.000982 ……. ……. conditional probabilities for coding PC A C conditional probabilities for noncoding PN A C G T G T AAA AA: 0.17 0.39 0.01 0.43 AAA AA: 0.71 0.09 0.00 0.20 AAA AC: 0.12 0.44 0.02 0.42 AAA AC: 0.61 0.19 0.02 0.18 AAA AG: 0.01 0.69 0.10 0.20 AAA AG: 0.01 0.69 0.10 0.20 ……. ……. In-Frame Markov Chain Model (1) In-frame Markov tables a priori probabilities for coding PAC00, 1, 2 AAA AA: 0.000012 0.000230 0.000009 translation frame AAA AC: 0.000001 0.000182 0.000011 AAA AG: 0.000101 0.000301 0.000101 ……. A A A A C A A A A C A A A A C conditional probabilities for coding PC0 conditional probabilities for coding PC1 conditional probabilities for coding PC2 AAA AA: 0.17 0.39 0.01 0.43 AAA AA: 0.33 0.12 0.10 0.35 AAA AA: 0.17 0.39 0.01 0.43 AAA AC: 0.12 0.44 0.02 0.42 AAA AC: 0.02 0.49 0.12 0.37 AAA AC: 0.12 0.44 0.02 0.42 AAA AG: 0.01 0.69 0.10 0.20 AAA AG: 0.10 0.60 0.15 0.15 AAA AG: 0.01 0.69 0.10 0.20 ……. ……. ……. In-Frame Markov Chain Model (2) In-frame Markov chain (5th order) calculation P0 (A1 …. An | coding) = P0 (A1 A2 A3 A4 A5 | coding) * P0 (A6 | A1 A2 A3 A4 A5 , coding) * P1 (A7 | A2 A3 A4 A5 A6 , coding) * P2 (A8 | A3 A4 A5 A6 A7 , coding) * ........ 0 1 2 0 1 2 0 1 2 A A A A C T G C ....... P (A1 …. An | noncoding) = P (A1 A2 A3 A4 A5 | noncoding) * P (A6 | A1 A2 A3 A4 A5 , noncoding) * P (A7 | A2 A3 A4 A5 A6 , noncoding) * …. * P (An | An-5 An-4 An-3 An-2 An-1 , noncoding) Calculation for non-coding regions stays the same In-Frame Markov Chain Model (3) Markov tables for Human genome in-frame conditional probabilities a priori table In-Frame Markov Chain Model (4) Coding score procedure for a DNA segment [i, j], calculate Markov coding scores scoreC[0], scoreC[1], scoreC[2], representing three frames (one strand), and noncoding score scoreN if MAX { scoreC[0], scoreC[1], scoreC[2] } > scoreN, the region is predicted to coding; otherwise non-coding first base i calculated reading frame in reference to the starting point of the first base Application of Markov Chain Model Prediction procedure Procedure: Calculate all ORFs of a DNA segment; For each ORF, do the following slide through the ORF with an increment of 10 base-pairs calculate the preference score, in same frame of ORF, within a window of 60 base-pairs; and assign the score to the center of the window A computing issue Multiplication of many small numbers (probabilities) is generally problematic in computer Converting a * b * c * d ..... *z to log (a) + log (b) + log (c) + log (d) ... log (z) Reading Assignments Suggested reading: Chapter 9 in “Current Topics in Computational Molecular Biology, edited by Tao Jiang, Ying Xu, and Michael Zhang. MIT Press. 2002.” Optional reading: Chapter 9 in “Pavel Pevzner: Computational Molecular Biology - An Algorithmic Approach. MIT Press, 2000.” Project Assignment (1) The following are three possible translational frames: a b c AGCTTTCTTCTTTTCCCTGTTGCTCAAATCAACAGTGTTCTTTGCTCAAACCCCCTTTCC S F L L F P V A Q I N S V L C S N P L S A F F F S L L L K S T V F F A Q T P F P L S S F P C C S N Q Q C S L L K P P F P Write a small computer program to predict which translational frame is most probable based on the frequency table of amino acid pairs in actual proteins in the following page. Assuming the frequency for any pair in the non-coding regions is 5%. Project Assignment (2) Di-codon frequencies in coding regions: