* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Biological sequence analysis
Vectors in gene therapy wikipedia , lookup
Epitranscriptome wikipedia , lookup
Transposable element wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Gene regulatory network wikipedia , lookup
Biosynthesis wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Molecular ecology wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Genetic code wikipedia , lookup
Gene expression wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Point mutation wikipedia , lookup
Homology modeling wikipedia , lookup
Stat 246, Lecture 3 Biological sequence analysis The objects of our study DNA, RNA and proteins: macromolecules which are unbranched polymers built up from smaller units. DNA: units are the nucleotide residues A, C, G and T RNA: units are the nucleotide residues A, C, G and U Proteins: units are the amino acid residues A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure. The use of statistics to study linear sequences of biomolecular units Can be descriptive, predictive or everything else in between…..almost business as usual. Stochastic mechanisms should never be taken literally, but nevertheless can be amazingly useful. Care is always needed: a model or method can break down at any time without notice. Biological confirmation of predictions is almost always necessary. The statistics of biological sequences can be global or local Base composition of genomes: E. coli: 25% A, 25% C, 25% G, 25% T P. falciparum: 82%A+T Translation initiation: ATG is the near universal motif indicating the start of translation in DNA coding sequence. From certainty to statistical models: a brief case study 1 ZNF: Cys-Cys-His-His zinc finger DNA binding domain Cys-Cys-His-His zinc finger DNA binding domain Its characteristic motif has regular expression C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H 1ZNF: XYKCGLCERSFVEKSALSRHQRVHKNX . http://www.isrec.isb-sib.ch/software/PATFND_form.html c.{2,4}c...[livmfywc]........h.{3,5}h PatternFind output [ISREC-Server] Date: Wed Aug 22 13:00:41 MET 2001 ... gp|AF234161|7188808|01AEB01ABAC4F945 nuclear protein NP94b [Homo sapiens] Occurences: 2 Position : 514 CYICKASCSSQQEFQDHMSEPQH Position : 606 CTVCNRYFKTPRKFVEHVKSQGH ........ gp|X67787|1326037|02AF953C84E0AB5A zinc finger protein [Saccharomyces cerevisiae] Occurences: 1 Position : 3 CSFDGCEKVYNRPSLLQQHQNSH 200 matches found: output limit reached This search could have been conducted using a suffix tree representation. Regular expressions can be limiting C AGGT A AGT A G TC ≥11 N TC AGC 5’ splice junction in eukaryotes 3’ splice junction Most protein binding sites are characterized by some degree of sequence specificity, but seeking a consensus sequence is often an inadequate way to recognize sites. Position-specific distributions came to represent the variability in motif composition. Cys-Cys-His-His profile: sequence logo form A sequence logo is a scaled position-specific a.a.distribution. Scaling is by a measure of a position’s information content. Sequence logos (T.D. Schneider) A visual representation of a position-specific distribution. Easy for nucleotides, but we need colour to depict up to 20 amino acid proportions. Idea: overall height at position l proportional to information content (2-Hl); proportions of each nucleotide ( or amino acid) are in relation to their observed frequency at that position, with most frequent on top, next most frequent below, etc.. How do we search with position-specific distributions? Position-specific scoring matrices T A T A A T A 0 1 0 1 1 0 A -38 C 0 0 0 0 0 0 C G 0 0 0 0 0 0 T 1 0 1 0 0 1 consensus T A 19 T A T 12 10 -48 -15 -38 -8 -10 -3 -32 G -13 -48 -6 T 17 -32 8 PSSM 1 A -7 -10 -48 -9 -6 19 Use of a PSSM to find sites C Move the matrix along the sequence and score each “window”. T A 12 10 -48 C -15 -38 -8 -10 -3 -32 G -13 -48 -6 17 -32 8 -9 -6 19 A -38 19 1 12 10 -48 C -15 -38 -8 -10 -3 -32 G -13 -48 -6 17 -32 8 -9 -6 19 A -38 19 1 12 10 -48 C -15 -38 -8 -10 -3 -32 G -13 -48 -6 17 -32 8 A -38 T Peaks should occur at the “true” sites. Of course in general any threshold will have some false positive and false negative rate. T T 19 T A 1 A T C sum -93 -7 -10 -48 +85 -7 -10 -48 -7 -10 -48 -9 -6 19 -95 Calculation of a PSSM from counts A 9 214 63 142 118 8 A 0.04 0.88 0.26 0.59 0.49 0.03 C 22 7 26 31 52 13 C 0.09 0.03 0.11 0.13 0.22 0.05 G 18 2 29 38 29 5 G 0.07 0.01 0.12 0.16 0.12 0.02 T 193 19 124 31 43 216 T 0.80 0.08 0.51 0.13 0.18 0.89 Relative frequencies: fbl Counts from 242 known sites 2 A -2.76 1.82 0.06 1.23 0.96 -2.92 C -1.46 -3.11 -1.22 -1.00 -0.22 -2.21 1 G -1.76 -5.00 -1.06 -0.67 -1.06 -3.58 T 1.67 -1.66 1.04 -1.00 -0.49 1.84 0 1 PSSM: log fbl/pb 2 3 4 5 6 Informativeness: 2+∑bpbllog2pbl Derivation of PSSM entries Suppose that we have aligned sequence data on a number of instances of a given type of site. Candidate sequence CTATAATC.... Aligned position 123456 Hypotheses: S=site (and independence) R=random (equiprobable, independence) pr(CTATAA | S) = log2 .09x.03x.26x.13x.51x.01 pr(CTATAA | R) .25x.25x.25x.25x.25x.25 log2 = (2+log2.09)+...+(2+log2.01) = 1 -15- 32 + 1 - 9 + 10 - 48 10 More generally, PSSM score sbl = log fbl/pb l=position, b=base pb=background frequency Representation of motifs: further steps Missing from the position-specific distribution representation of motifs are good ways of dealing with: Length distributions for insertions/deletions Cross-position association of amino acids Hidden Markov models help with the first. The second remains a hard unsolved problem. Hidden Markov models Processes {(St,Ot), t=1,…}, where St is the hidden state and Ot the observation at time t, such that pr(St | St-1,Ot-1,St-2 ,Ot-2 …) = pr(St | St-1) pr(Ot | St-1,Ot-1,St-2 ,Ot-2 …) = pr(Ot | St, St-1) The basics of HMMs were laid bare in a series of beautiful papers by L E Baum and colleagues around 1970, and their formulation has been used almost unchanged to this day. Hidden Markov models:extensions Many variants are now used. For example, the distribution of O may not depend on previous S but on previous O values, pr(Ot | St , St-1 , Ot-1 ,.. ) = pr(Ot | St ), or pr(Ot | St , St-1 , Ot-1 ,.. ) = pr(Ot | St , St-1 ,Ot-1) . Most importantly for us, the times of S and O may be decoupled, permitting the Observation corresponding to State time t to be a string whose length and composition depends on St (and possibly St-1 and part or all of the previous Observations). This is called a hidden semi-Markov or generalized hidden Markov model. A simple HMM bA (i) 1 / 6 PAB PBB A PAA Initial distribution: ( A, B) PBA bB (i) 1/ 4 A lattice view Observed sequence: 1 4 3 6 6 4 A A A B A B Hidden sequence: B B A Observed: 1,4,3,6,6,4... Questions: 1. What is the most likely die sequence? 2. What is the probability of the observed sequence? 3. What is the probability that the 3rd state is B, given the observed sequence? The HMM algorithms Forward: t (i) = P(observed sequence, ending in state i at base t) Backward: ß t (i) = P(obs. after t | ending in state i at base t) Viterbi: t (i) = max P(obs. , ending in state i at base t) Questions: 1. What is the most likely die sequence? Viterbi 2. What is the probability of the observed sequence? Forward 3. What is the probability that the 3rd state is B, given the observed sequence? Backward The forward algorithm t (i) = P(observed sequence, ending in state i at base t) N t (j) ( t 1(i)pij )b j (Yt ) i1 N N sequence) (j) (i) = P(observed (p )b(Y (i) p )b (Y ) t (j) ( Tt (i) ) ij j t 1 ij j t t 1 i1 i1 The forward algorithm State N ... ... ... 1 2 3 3 2 1 ... ... ... ... Observation T Some early applications of HMMs finance, but we never saw them speech recognition modelling ion channels In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched. Some current applications of HMMs to biology mapping chromosomes aligning biological sequences predicting sequence structure inferring evolutionary relationships finding genes in DNA sequence Profile HMM: m=match state, I-insert state, d=delete state; go from left to right. I and m states output amino acids; d states are ‘silent”. d1 d2 d3 d4 I0 I1 I2 I3 I4 m0 m1 m2 m3 m4 Start m5 End Pfam domain-HMMs Pfam is a library of models of recurrent protein “domains”. They are constructed semi-automatically using hidden Markov models (HMMs). Pfam families have permanent accession numbers and contain functional annotation and cross-references to other databases, while Pfam-B families are re-generated at each release and are unannotated. See http://www.sanger.ac.uk/Software/Pfam/ http://www.cgr.ki.se/Pfam/ http://pfam.wustl.edu/ http://pfam.jouy.inra.fr/ Finding genes in DNA sequence This is one of the most challenging and interesting problems in computational biology at the moment. With so many genomes being sequenced so rapidly, it remains important to begin by identifying genes computationally. What is a (protein-coding) gene? DNA CCTGAGCCAACTATTGATGAA transcription mRNA CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE What is a gene, ctd? In general the transcribed sequence is longer than the translated portion: parts called introns (intervening sequence) are removed, leaving exons (expressed sequence), and yet other regions remain untranslated. The translated sequence comes in triples called codons, beginning and ending with a unique start (ATG) and one of three stop (TAA, TAG, TGA) codons. There are also characteristic intron-exon boundaries called splice donor and acceptor sites, and a variety of other motifs: promoters, transcription start sites, polyA sites,branching sites, and so on. All of the foregoing have statistical characterizations. In more detail (color ~state) Some facts about human genes Comprise about 3% of the genome Average gene length: ~ 8,000 bp Average of 5-6 exons/gene Average exon length: ~200 bp Average intron length: ~2,000 bp ~8% genes have a single exon Some exons can be as small as 1 or 3 bp. HUMFMR1S is not atypical: 17 exons 40-60 bp long, comprising 3% of a 67,000 bp gene The idea behind a GHMM genefinder States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..). Observations embody state-dependent base composition, dependence, and signal features. In a GHMM, duration must be included as well. Finally, reading frames and both strands must be dealt with. Half a model for a genefinder Splice sites can be included in the exons Beyond position-specific distributions The bases in splice sites exhibit dependence, and not simply of the nearest neighbor kind. High-order (non-stationary) Markov models would be one option, but the number of parameters in relation to the amount of data rules them out. The class of variable length Markov models (VLMMs) deriving from early research by Rissanen prove to be valuable in this context. However, there is likely to be room for more research here. GENSCAN (Burge & Karlin) E0 I0 E1 I1 Ei Et 3'UTR poly-A promoter Reverse (-) strand I2 Es 5'UTR Forward (+) strand E2 intergenic region Forward (+) strand Reverse (-) strand 62001 AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC 62051 CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC 62101 TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA 62151 CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC 62201 TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG 62251 ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC 62301 AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC 62351 TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC 62401 ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC 62451 CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA 62501 AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC 62551 CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA 62601 CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT 62651 GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG 62701 GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC 62751 ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA 62801 TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC 62851 CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG 62901 GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT 62951 CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC 63001 ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT 63051 ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT 63101 GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC 63151 TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC 63201 TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT 63251 CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT 63301 GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT 63351 ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG 63401 TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT 63451 CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG 63501 TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG 63551 GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA 63601 GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT 63651 TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA 63701 GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT Remark In general the problem of identifying (annotating) human genes is considerably harder than ß-globin might suggest. The human factor VIII gene (whose mutations cause hemophilia A) is spread over ~186,000 bp. It consists of 26 exons ranging in size from 69 to 3,106 bp, and its 25 introns range in size from 207 to 32,400 bp. The complete gene is thus ~9 kb of exon and ~177 kb of intron. The biggest human gene yet is for dystrophin. It has > 30 exons and is spread over 2.4 million bp. Challenges in the analysis of sequence data Understanding the biology well enough to begin. Designing HMM architecture, e.g. in Marcoil for coiled-coils. Modelling the parts, e.g. VLMMs for splice sites. Coding = software engineering, is the hardest and most important task of all: making it all work. Obtaining good data sets for use in careful evaluation and comparison with competing algorithms; designing the studies. Opportunities for methodological research. Topics not mentioned include Molecular evolution, including phylogenetic inference (building trees from aligned sequence data) Sequence alignment (pairwise, multiple), including use of Gibbs sampler Stochastic context-free grammar models and the analysis of RNA sequence data. Acknowledgements Mauro Delorenzi (WEHI) Simon Cawley (Affymetrix) Tony Wirth (CS, Princeton) Lior Pachter (Math, UCB) Marina Alexandersson (Stat, UCB) References Biological Sequence Analysis R Durbin, S Eddy, A Krogh and G Mitchison Cambridge University Press, 1998. Bioinformatics The machine learning approach P Baldi and S Brunak The MIT Press, 1998 Post-Genome Informatics M Kanehisa Oxford University Press, 2000 HMMs representing coiled-coil domains 2TMA: Tropomyosin Coiled-coil domains, schematically dimeric parallel helices, heptad repeats, knobs-into-holes (hydrophobic) dislikes water Designing the HMM, I Designing the HMM, 2 Designing the HMM, 3 HMM: decoding Sequence Labels Path1 Path2 WGP ARQLNES VKDINKM LER HP BBB CCCCCCC CCCCCCC CCC BB 000 abcdefg abcdefg abc 00 00c defgabc defgabc def g0 VITERBI decoding: of all possible state-paths, we determine the maximum probability one given the amino acid sequence O; POSTERIOR decoding: at each position, we determine the state with the highest probability given O. Issue: how to measure the strength of a potential CC domain, and how should this depend on the length of a domain? CC-PROBABILITY PROFILE Fusion protein of simian parainfluenza virus 5 Assessing performance: terms TP true positive: a predicted fragment that overlaps the annotated fragment (aa in the annotated region) FP false positive: a predicted fragment does not overlap any annotated fragment (aa not in the annotated region) LS learning set of sequences NTS negative test set; sequences with no CCD used for estimating FP PTS positive test set; used for estimating TP Much care/effort required to create these sets Assessing performance: study design Study the variability of performance under variation of the sequences used for determining the model parameters. Compare methods using the same set of aa-frequencies / emission probabilities. Use the same set of domains for learning and testing instead of testing on different protein families. Choose a number of FP-rates and calculate the corresponding TP-rates (ROC curve). PTS subdivided 150 times at random (stratified) into 2/3 for learning and the rest for testing. LEARNING PHASE => PARAMETERS TESTING ON NTS => THRESHOLDS TESTING ON PTS => TP-VALUES ANALYSIS OF RESULTS 150 X Assessing performance: summaries TP-rate at given FP-rate: per family / per length-class TP- and FP-rates for aa’s: per family / per length-class Accuracy: at the borders and of length prediction The algorithms As the name suggests, with an HMM the series O = (O1,O2 ,O3 ,……., OT) is observed, while the states S = (S1 ,S2 ,S3 ,……., ST) are not. There are elegant algorithms for calculating pr(O|), arg max pr(O|) in certain special cases, and arg maxS pr(S|O,). Here are the parameters of the model, e.g. transition and observation probabilities.