Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Statistical modeling and classification in Biological Sequence Space April 26, 04; 9.520 Gene Yeo Poggio, Burge @MIT Framework/Issues • “Build” models around known biology – In the process, extend knowledge about known biology • “Predict” new examples • “Validate” predictions by – – – – prediction accuracy experimental validation higher-level traits of predictions conservation in other genomes Biological sequences • DNA, RNA and proteins: macromolecules built up from smaller units. • • • DNA: units are the nucleotide residues A, C, G and T RNA: units are the nucleotide residues A, C, G and U Proteins: units are the amino acid residues A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. • To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure. • Statistical models can be descriptive and/or predictive. • Given known biological signal-> describe the signal with statistical modeling & find unknown examples of the same signal – Gene-finding (protein-coding genes) – Noncoding RNA genes – Protein domains • Warning: although successful, models are not to be taken literally. • Most important: biological confirmation of predictions is almost always necessary. Different models Complexity RNA gene (Covariation,SCFG,NN,SVM) Protein structure (a variety of methods) Protein gene(HMM,NN) Splice site motif (WMM, MM, SVM, NN) DNA RNA Protein A case study in computational biology: modeling signals in genes With so many genomes being sequenced, it remains important to be able to identify genes and the signals within and around genes computationally. What is a (protein-coding) gene? DNA CCTGAGCCAACTATTGATGAA transcription mRNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE Some facts about human genes Comprise about 3% of the genome Average gene length: ~ 8,000 bp Average of 5-6 exons/gene Average exon length: ~200 bp Average intron length: ~2,000 bp ~8% genes have a single exon The idea behind a HMM genefinder • States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..). • Observations embody state-dependent statistics, such as base composition, dependence, and signal features. GENSCAN (Burge & Karlin) E0 I0 E1 I1 Ei Et 3'UTR poly-A promoter Reverse (-) strand I2 Es 5'UTR Forward (+) strand E2 intergenic region Forward (+) strand Reverse (-) strand 62001 AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC 62051 CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC 62101 TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA 62151 CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC 62201 TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG 62251 ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC 62301 AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC 62351 TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC 62401 ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC 62451 CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA 62501 AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC 62551 CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA 62601 CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT 62651 GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG 62701 GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC 62751 ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA 62801 TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC 62851 CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG 62901 GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT 62951 CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC 63001 ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT 63051 ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT 63101 GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC 63151 TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC 63201 TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT 63251 CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT 63301 GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT 63351 ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG 63401 TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT 63451 CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG 63501 TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG 63551 GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA 63601 GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT 63651 TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA 63701 GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT Splice sites can be an important signal Regular expressions can be limiting C AGGT A AGT A G TC ≥11 N TC AGC 5’ splice junction in eukaryotes 3’ splice junction Most protein binding sites are characterized by some degree of sequence specificity, but seeking a consensus sequence is often an inadequate way to recognize sites. Position-specific distributions came to represent the variability in motif composition. Position-specific scoring matrix (PSSM) Pos -3 -2 -1 +1 +2 +3 +4 +5 +6 0.1 0.1 0.1 0.2 A C 0.3 0.4 0.6 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.4 0.1 0.7 0.1 G 0.2 0.2 0.8 1.0 0.0 0.4 0.1 0.8 0.2 T 0.1 0.1 0.1 0.0 1.0 0.1 0.1 0.0 0.5 S = S1 S2 S3 S4 S5 S6 S7 S8 S9 Odds Ratio R = P(S|+) P(S|-) Score s = log2R = P-3(S1)P-2(S2)P-1(S3) ••• P5(S8)P6(S9) Pbg(S1)Pbg(S2)Pbg(S3) ••• Pbg(S8)Pbg(S9) Ok, so we got the genes • molecular biology (transcription, splicing) • signals are modeled as states (HMM) or separately, i.e.PSSMs • Here’s another catch, there isn’t just one version of each gene. • But sometimes several Eg. alternative splicing - CD44 Human chromosome 11p… Zhu et al Science… (2003) Alternative splicing • is a major determinant of protein diversity (Lander 2001, Zavolan 2003) • 30-50% of human diseases involve alt. splicing Defining constitutive and alternative exons Constitutive exon Skipped exon 3’ alternative exon 5’ alternative exon Intron retention Mutually exclusive exons Fragile X Related Gene, FXR1 Conserved alternative, skipped exon - FXR1 Myotonic Dystrophy-containing WD Repeat, DMWD Another example of genes containing CSE: DMWD Predicting new alternatively spliced exons 1. The problem is ‘ill-posed’ 2. High-dimensional space 3. Not overfit data 4. Simple feature selection 5. Unbalanced data set sizes 6. Labels are more “flexible” Eg. of experimentally validated Biological sequence space: challenges • Models that “represent” as much of the biology as possible. • Biologically motivated features are important • Validating attributes: – Conservation of events are key in computational biology – Higher-level consistency with known biology • Experimental validation of predictions are essential Framework/Issues • “Build” models around known biology – In the process, extend knowledge about known biology • “Predict” new examples • “Validate” predictions by – – – – prediction accuracy experimental validation higher-level traits of predictions conservation in other genomes If time permits Modeling higher order interactions: Yeast Phe tRNA Secondary Structure Tertiary Structure The Hammerhead Ribozyme Secondary structure Tertiary structure One example on how to model and predict o RNA 2 Structure • Covariation (using comparative genomics) Seq1: A C G A A A G U Seq2: U A G U A A U A Seq3: A G G U G A C U Seq4: C G G C A A U G Seq5: G U G G G A A C Method of Covariation / Compensatory changes Mutual information statistic for pair of columns in a multiple alignment M ij x, y f f (i, j) x, y (i, j ) f log x, y 2 f (i, j ) x, y (i ) ( j) x y f f = fraction of seqs w/ nt. x in col. i, nt. y in col. j (i ) x = fraction of seqs w/ nt. x in col. i sum over x, y = A, C, G, U M is maximal (2 bits) if x and y individually appear at random (A,C,G,U equally likely), but are perfectly correlated (e.g., always complementary) ij o Inferring 2 Structure from Covariation Stochastic Context-Free Grammars (SCFGs) • A generalized model which is capable of handling non-local dependencies between words in a language (or bases in an RNA) Ref: Durbin et al. “Biological Sequence Analysis” 1998 An SCFG Model of RNA 2o Structure “Production Rules”: • P aWb • L aW • R Wa • B SS • S W • E (“pair”) (“left bulge/loop”) (“right bulge/loop”) (“bifurcation”) (“start”) (“end”) last page • some of the slides were obtained from various places: – available online slides on the web (primarily from lectures by terry speed). – slides from chris burge, dirk holste