Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006 The UNIVERSITY of Kansas Mining Sequence Patterns in Biological Data A brief introduction to biology and bioinformatics Alignment of biological sequences Hidden Markov model for biological sequence analysis Summary 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 DNA Structure DNA: helix-shaped molecule whose constituents are two parallel strands of four nucleotides: A, C, G, and T. This assumes only one strand is considered; the second strand is always derivable from the first by Nucleotides (bases) pairing A with T and C with G 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 Adenine (A) Cytosine (C) Guanine (G) Thymine (T) slide3 Genome Gene: Contiguous subparts of single strand DNA that are templates for producing proteins. Chromosomes: compact chains of coiled DNA Genome: The set of all genes in a given organism. Noncoding part: The function of DNA material between genes is largely unknown. Source: www.mtsinai.on.ca/pdmg/Genetics/basic.htm 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 Grand Challenges: Functional Genomics The function of a protein is its role in the cellular environment. Function is closely related to tertiary structure Functional genomics: studies the function of all the proteins of a genome Source: fajerpc.magnet.fsu.edu/Education/2010/Lectures/26_D NA_Transcription.htm 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 Grand Challenges: Systems Biology A cell is made up of molecular components that can be viewed as 3D-structures of various shapes In a living cell, the molecules interact with each other (w. shape and location). An important type of interaction involve catalysis (enzyme) that facilitate interaction. The interactions of molecules are regulated temporally and spatially. The overall interaction and the regulation of the molecules is studied by systems biology. Source: www.mtsinai.on.ca/pdmg/images/pairscolour.jpg Human Genome—23 pairs of chromosomes 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 Grand Challenges: Evolution Develop a detailed understanding of the heritable variation in the human genome Understand evolutionary variation across species and the mechanisms underlying it 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 Grand Challenges: from -omics to Health Disease detection: Monitoring the biological system at the molecular level No intrusive Fast, reliable, & cost-effective 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 Grand Challenges: from -omics to Health Develop novel medicines Select drug target Develop robust strategies for identifying the genetic contributions to disease and drug response Use new understanding of genes and pathways to develop powerful new therapeutic approaches to disease Drug screening Drug delivery Side effects and other safety issues 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 Data Management & Analysis in Bioinformatics Computational management and analysis of biological information Interdisciplinary Field (Molecular Biology, Statistics, Computer Science, Genomics, Genetics, Databases, Chemistry, Radiology …) Bioinformatics vs. computational biology (more on algorithm correctness, Bioinformatics Functional Genomics complexity and other themes central to theoretical CS) Genomics Proteomics Structural Bioinformatics 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 Data Mining & Bioinformatics: Why? Biological data is abundant and information-rich Genomics & proteomics data (sequences), microarray and protein-arrays, protein database (PDB), bio-testing data Huge data banks, rich literature, openly accessible Largest and richest scientific data sets in the world Data Mining: transform raw data to knowledge Mining for correlations, linkages between disease and gene sequences, protein networks, classification, clustering, outliers, ... Find correlations among linkages in literature and heterogeneous databases 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 Data Integration in Bioinformatics Data Integration: Handling heterogeneous bio-data Build Web-based, interchangeable, integrated, multi-dimensional genome databases Data cleaning and data integration methods becomes crucial Mining correlated information across multiple databases itself becomes a data mining task Typical studies: mining database structures, information extraction from data, reference reconciliation, document classification, clustering and correlation discovery algorithms, ... 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 Data Mining & Bioinformatics: Current Picture Pattern discovery Comparative study of biomolecules and biological networks Similarity search Drug design Clustering gene expression profiles Phylogeny tree construction Classification Visualization 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 Future Research and development of new tools for bioinformatics Similarity search and comparison between classes of genes (e.g., diseased and healthy) by finding and comparing frequent patterns New clustering and classification methods for micro-array data and proteinarray data analysis Path analysis: linking genes/proteins to different disease development stages Develop pharmaceutical interventions that target the different stages separately High-dimensional analysis and OLAP mining Visualization tools and genetic/proteomic data analysis Knowledge integration You tell me! 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 Mining Sequence Patterns in Biological Data A brief introduction to biology and bioinformatics Alignment of biological sequences Hidden Markov model for biological sequence analysis Summary 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 Comparing Sequences Sequences to be compared: either nucleotides (DNA/RNA) or amino acids (proteins) Nucleotides: identical Amino acids: identical, or if one can be derived from the other by substitutions that are likely to occur in nature Alignment: Lining up sequences to achieve the maximal level of identity Local—only portions of the sequences are aligned. Global—align over the entire length of the sequences Use gap “–” to indicate preferable not to align two symbols 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 Comparing Sequences Two sequences are homologous if they share a common ancestor Percent identity: ratio between the number of columns containing identical symbols vs. the number of symbols in the longest sequence Score of alignment: summing up the matches and counting gaps as negative 9/13/2006 Bio-sequence analysis HEAGAWGHEE PAWHEAE Input sequences: HEAGAWGHE-E One alignment --P-AW-HEAE HEAGAWGHE-E Another alignment P-A--W-HEAE Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 Sequence Alignment: Problem Definition Goal: Given two or more input sequences Identify similar sequences with long conserved subsequences Method: Use substitution matrices (probabilities of substitutions of nucleotides or amino-acids and probabilities of insertions and deletions) Optimal multiple alignment problem: NP-hard Heuristic method to find good alignments 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 Edit Distance Model Minimal weighted sum of insertions, deletions & mutations required to transform one string into another AGGCACA--CA | |||| || A--CACATTCA 9/13/2006 Bio-sequence analysis or AGGCACACA | || || ACACATTCA Mining Biological Data KU EECS 800, Luke Huan, Fall’06 Levenshtein 1966 slide19 Pair-wise Sequence Alignment: Scoring Matrix insertion/Deletion (indel) penalty Mutation penalty A E G H W A 5 -1 0 -2 -3 E -1 6 -3 0 -3 H -2 0 -2 10 -3 P -1 -1 -2 -2 -4 W -3 -3 -3 -3 15 Gap penalty: -8 Gap extension: -8 HEAGAWGHE-E --P-AW-HEAE (-8) + (-8) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 9 Exercise: Calculate for 9/13/2006 Bio-sequence analysis HEAGAWGHE-E P-A--W-HEAE Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 Complete DNA Sequences nearly 200 complete genomes have been sequenced 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 Whole Genome Similarity Map Mining Biological Data 9/13/2006 Bio-sequence analysis KU EECS 800, Luke Huan, Fall’06 slide22 Map of Rearrangements The UNIVERSITY of Kansas Alignment of a Syntenic Area The UNIVERSITY of Kansas 84790 seq1 84800 || ||||| ||||||||||||||| 13780 84850 84860 ||| | 13840 84910 || | ||||||| 13790 13800 84870 13810 84880 13820 84890 84900 | || |||||||| ||| | |||| | || ||| |||| 13850 84920 13860 84930 13870 84940 13880 84950 84960 TCAGCCTGGGCAGAGAATTTATGACTAAGTCCTCAAAAGTAATTTCAACAAAAATAAACA | seq2 ||||| ---AAACTGAAACCATAACAATTCTAGAAGTAAACATTGAAAAAAAAACCCTTCTAGACA 13830 seq1 84840 GTAAAATAAAAAACTGTAAAACTCTAGAAGAAAATGTAGAAACA-----CCATCTGGACA ||| seq2 84830 CAGGATCGACGTCTCTCACCTTATACAAAAATCAACTGAAGATGCATCACAGACTTTAA13770 seq1 84820 -AGGACTTCTTCCTTTCACCATATACAAAAATCAACCCAAGATTGATACAATACTTTAAT |||| seq2 84810 || | |||| ||| || ||||| ||| |||||| | TTTGCTTAGGCAAAGACTTCATGACAAAGAATGCAAAAGCAG-----------------13890 13900 13910 13920 Local Area of Similarity The UNIVERSITY of Kansas Formal Description Problem: PairSeqAlign Input: Two sequences x, y Scoring matrix s Gap penalty d Gap extension penalty e Output: The optimal sequence alignment Difficulty: If x, y are of size n then the number of possible 2n global alignments is n 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 2n (2n)! 2 2 (n!) n slide26 Global Alignment: Needleman-Wunsch Needleman-Wunsch Algorithm (1970) Uses weights for the outmost edges that encourage the best overall (global) alignment Idea: Build up optimal alignment from optimal alignments of subsequences HEAG --P- Add score from table -25 --P-AW-HEAE HEAG- HEAGA HEAGA --P-A --P—- --P-A -33 -33 -20 Gap with bottom 9/13/2006 Bio-sequence analysis HEAGAWGHE-E Gap with top Mining Biological Data KU EECS 800, Luke Huan, Fall’06 Top and bottom slide27 Pair-Wise Sequence Alignment Given s ( xi , y j ), d F (0, 0) 0 F (i 1, j 1) s( xi , y j ) F (i, j ) max F (i 1, j ) d F (i, j 1) d Alignment: F(0,0) – F(n,m) We can vary both the model and the alignment strategies 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 Global Alignment Uses recursion to fill in intermediate results table Uses O(nm) space and time O(n2) algorithm yj aligned to gap F(i-1,j-1) s(xi,yj) F(i-1,j) Feasible for moderate sized sequences, but not for aligning whole genomes. 9/13/2006 Bio-sequence analysis d F(i,j-1) d F(i,j) xi aligned to gap While building the table, keep track of where optimal score came from, reverse arrows Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 Dot Matrix Alignment Method Dot Matrix Plot: Boolean matrices representing possible alignments that can be detected visually Extremely simple but O(n2) in time and space Visual inspection 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 Heuristic Alignment Algorithms Motivation: Complexity of alignment algorithms: O(nm) Current protein DB: 100 million base pairs Matching each sequence with a 1,000 base pair query takes about 3 hours! Heuristic algorithms aim at speeding up at the price of possibly missing the best scoring alignment Two well known programs BLAST: Basic Local Alignment Search Tool FASTA: Fast Alignment Tool Both find high scoring local alignments between a query sequence and a target database Basic idea: first locate high-scoring short stretches and then extend them 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 BLAST (Basic Local Alignment Search Tool) Approach (BLAST) (Altschul et al. 1990, developed by NCBI) View sequences as sequences of short words (k-tuple) DNA: 11 bases, protein: 3 amino acids Create hash table of neighborhood (closely-matching) words Use statistics to set threshold for “closeness” Start from exact matches to neighborhood words Motivation Good alignments should contain many close matches Statistics can determine which matches are significant Much more sensitive than % identity Hashing can find matches in O(n) time Extending matches in both directions finds alignment Yields high-scoring/maximum segment pairs (HSP/MSP) 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 BLAST (Basic Local Alignment Search Tool) 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 Multiple Sequence Alignment Alignment containing multiple DNA / protein sequences Look for conserved regions → similar function Example: #Rat #Mouse #Rabbit #Human #Oppossum #Chicken #Frog 9/13/2006 Bio-sequence analysis ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT ---ATGGGTTTGACAGCACATGATCGT---CAGCT Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 Mining Sequence Patterns in Biological Data A brief introduction to biology and bioinformatics Alignment of biological sequences Hidden Markov model for biological sequence analysis Summary 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 From Data to Model Given a group of sequence, what is the underlying model for the group of sequences? Markov models are simple but widely used model to describe a sequence family. 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide36 A Markov Chain Model Transition probabilities Pr(xi=a|xi-1=g)=0.16 Pr(xi=c|xi-1=g)=0.34 Pr(xi=g|xi-1=g)=0.38 Pr(xi=t|xi-1=g)=0.12 Pr( x | x i i 1 g) 1 Exercise Can “acgt” be generated from the MC model? How about “cagt”, “gcat”, “tcga”? 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 Definition of Markov Chain Model A Markov chain model is defined by a set of states some states emit symbols other states (e.g., the begin state) are silent a set of transitions with associated probabilities the transitions emanating from a given state define a distribution over the possible next states 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide38 Markov Chain Models: Properties Given some sequence x of length L, we can ask the probability of the sequence with the given model For any probabilistic model of sequences, we can write this probability as Pr( x) Pr( xL , xL 1 ,..., x1 ) Pr( xL / xL 1 ,..., x1 ) Pr( xL 1 | xL 2 ,..., x1 )... Pr( x1 ) key property of a (1st order) Markov chain: the probability of each xi depends only on the value of xi-1 Pr( x) Pr( xL / xL 1 ) Pr( xL 1 | xL 2 )... Pr( x2 | x1 ) Pr( x1 ) L Pr( x1 ) Pr( xi | xi 1 ) i 2 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 The Probability of a Sequence for a Markov Chain Model Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g) 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 Example Application CpG islands CG dinucleotides are rarer in eukaryotic genomes than expected given the marginal probabilities of C and G but the regions upstream of genes are richer in CG dinucleotides than elsewhere – CpG islands useful evidence for finding genes Application: Predict CpG islands with Markov chains one to represent CpG islands one to represent the rest of the genome 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide41 Markov Chains for Discrimination Suppose we want to distinguish CpG islands from other sequence regions Given sequences from CpG islands, and sequences from other regions, we can construct a model to represent CpG islands a null model to represent the other regions can then score a test sequence by: Pr( x | CpGModel) score( x) log Pr( x | nullModel ) 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 Markov Chains for Discrimination Why use Pr( x | CpGModel) score( x) log Pr( x | nullModel ) According to Bayes’ rule Pr( x | CpG) Pr(CpG) Pr(CpG | x) Pr( x) Pr( x | null ) Pr( null ) Pr( null | x) Pr( x) If we are not taking into account prior probabilities of two classes then we just need to compare Pr(x|CpG) and Pr(x|null) 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 Higher Order Markov Chains The Markov property specifies that the probability of a state depends only on the probability of the previous state But we can build more “memory” into our states by using a higher order Markov model In an n-th order Markov model Pr( xi | xi 1 , xi 2 ,..., x1 ) Pr( xi | xi 1 ,..., xi n ) 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44 Selecting the Order of a Markov Chain Model But the number of parameters we need to estimate grows exponentially with the order n 1 for modeling DNA we need O( 4 ) parameters for an n-th order model The higher the order, the less reliable we can expect our parameter estimates to be estimating the parameters of a 2nd order Markov chain from the complete genome of E. Coli, we’d see each word > 72,000 times on average estimating the parameters of an 8-th order chain, we’d see each word ~ 5 times on average 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide45 Hidden Markov Model: A Simple HMM Given observed sequence AGGCT, which state emits every item? 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide46 Hidden State We’ll distinguish between the observed parts of a problem and the hidden parts In the Markov models we’ve considered previously, it is clear which state accounts for each part of the observed sequence In the model above, there are multiple states that could account for each part of the observed sequence this is the hidden part of the problem 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide47 An HMM Example 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide48 The Parameters of an HMM Πi is a set of states Transition Probabilities akl Pr( i l | i 1 k ) Probability of transition from state k to state l Emission Probabilities ek (b) Pr( xi b | i k ) Probability of emitting character b in state k 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide49 Three Important Questions How likely is a given sequence? The Forward algorithm What is the most probable “path” for generating a given sequence? The Viterbi algorithm How can we learn the HMM parameters given a set of sequences? The Forward-Backward (Baum-Welch) algorithm 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide50 How Likely is a Given Sequence? The probability that the path is taken and the sequence is generated: L Pr( x1...xL , 0 ... N ) a01 e i ( xi )a i i1 i 1 Pr( AAC, ) a01 e1 ( A) a11 e1 ( A) a13 e3 (C ) a35 .5 .4 .2 .4 .8 .3 .6 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide51 How Likely is a Given Sequence? The probability over all paths is but the number of paths can be exponential in the length of the sequence... the Forward algorithm enables us to compute this efficiently 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide52 The Forward Algorithm Define f k (i) to be the probability of being in state k having observed the first i characters of sequence x To compute f N (L) , the probability of being in the end state having observed all of sequence x Can define this recursively Use dynamic programming 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide53 The Forward Algorithm Initialization f0(0) = 1 for start state; fi(0) = 0 for other state Recursion For emitting state (i = 1, … L) f l (i ) el (i ) f k (i 1)akl k For silent state f l (i ) f k (i 1) akl k Termination Pr( x) Pr( x1...xL ) f N ( L) f k ( L)akN k 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide54 Forward Algorithm Example Given the sequence x=TAGA 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide55 Forward Algorithm Example Initialization f0(0)=1, f1(0)=0…f5(0)=0 Computing other values f1(1)=e1(T)*(f0(0)a01+f1(0)a11) =0.3*(1*0.5+0*0.2)=0.15 f2(1)=0.4*(1*0.5+0*0.8) f1(2)=e1(A)*(f0(1)a01+f1(1)a11) =0.4*(0*0.5+0.15*0.2) … Pr(TAGA)= f5(4)=f3(4)a35+f4(4)a45 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide56 Dynamic Programming 9/13/2006 Bio-sequence analysis T=0 T=1 T=2 S0 1 0 0 S1 0 0.5 0.1 S2 0 0.5 0.4 S3 0 0 0.4 S4 0 0 0.1 S5 0 0 0 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide57 Three Important Questions How likely is a given sequence? What is the most probable “path” for generating a given sequence? How can we learn the HMM parameters given a set of sequences? 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide58 Suboptimal Property If path πi1, πi2, …, πik is the most probable path that generates path x1, x2, …, xk, then Given πin, πi1, πi2, …, πin is the most probable path that generates the sequence x1, x2, …, xn for all n <=k πim, πi2, …, πin is the most probable path that generates path xm, x2, …, xn for all 1 <= m <= n <=k, given that πi1, πi2, …, πim is the most probable path for x1, x2, …, xm and πin S(x1, x2, …, xk | πk) = max ( S(x1, x2, …, xk-1 | πk-1) * Pr(πk | πk-1) * Pr( xk | πk) ) 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide59 Finding the Most Probable Path: The Viterbi Algorithm Define vk(i) to be the probability of the most probable path accounting for the first i characters of x and ending in state k We want to compute vN(L), the probability of the most probable path accounting for all of the sequence and ending in the end state Can define recursively Can use DP to find vN(L) efficiently 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide60 Three Important Questions How likely is a given sequence? What is the most probable “path” for generating a given sequence? How can we learn the HMM parameters given a set of sequences? 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide61 Learning Without Hidden State Learning is simple if we know the correct path for each sequence in our training set estimate parameters by counting the number of times each parameter is used across the training set 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide62 Learning With Hidden State If we don’t know the correct path for each sequence in our training set, consider all possible paths for the sequence Estimate parameters through a procedure that counts the expected number of times each parameter is used across the training set 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide63 Learning Parameters: The Baum-Welch Algorithm Also known as the Forward-Backward algorithm An Expectation Maximization (EM) algorithm EM is a family of algorithms for learning probabilistic models in problems that involve hidden state In this context, the hidden state is the path that best explains each training sequence 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide64 Learning Parameters: The Baum-Welch Algorithm Algorithm sketch: initialize parameters of model iterate until convergence calculate the expected number of times each transition or emission is used adjust the parameters to maximize the likelihood of these expected values 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide65 Computational Complexity of HMM Algorithms Given an HMM with S states and a sequence of length L, the complexity of the Forward, Backward and Viterbi algorithms is 2 O( S L) This assumes that the states are densely interconnected Given M sequences of length L, the complexity of Baum Welch on each iteration is O( MS 2 L) 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide66 Markov Models Summary We considered models that vary in terms of order, hidden state Three DP-based algorithms for HMMs: Forward, Backward and Viterbi The algorithms used for each task depend on whether there is hidden state (correct path known) in the problem or not 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide67 Sequence Patterns AHGQSDFILDEADGMMKSTVPN… HGFDSAAVLDEADHILQWERTY… GGGNDEYIVDEADSVIASDFGH… *[IV][LV]DEAD*** 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide68 Expectation Values Expectation value (e) is the expected number of hits for a given sequence pattern or motif e = N x f1 x f2 x f3 x .... fk N is the number of residues in DB (108) fi is the frequency of a given amino acid(s) 9/13/2006 Bio-sequence analysis Table 1 Frequency of amino acid occurrences in water soluble proteins Residue A C D E F G H I K L Mining Biological Data KU EECS 800, Luke Huan, Fall’06 Frequency 8.80% 2.05% 5.91% 5.89% 3.76% 8.30% 2.15% 5.40% 6.20% 8.09% Residue M N P Q R S T V W Y Frequency 1.97% 4.58% 4.48% 3.84% 4.22% 6.50% 5.91% 7.05% 1.39% 3.52% slide69 Pattern Example #1 ACIDS e = 108*0.088*0.021*0.054*0.059*0.065 e = 38.3 #Found in a database = 14 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide70 Pattern Example #2 A*ACI[DEN]S e = 108*0.088*1.000*0.088*0.021*0.054 *{0.059 + 0.059 + 0.046}*0.065 e = 9.4 #Found in a database = 9 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide71 How Long Should a Sequence Motif or Sequence Block Be? How many matching segments of length “l” could be found in comparing a query of length M to a DB of N ? Answer: n(l) = M x N x f l Assume f = 0.05, M = 300, N = 100,000,000 Rule of thumb: make your protein sequence motifs at least 8 residues long 9/13/2006 Bio-sequence analysis Table 2 n 3,750,000 187,500 9375 469 23 1.2 0.058 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 l 3 4 5 6 7 8 9 slide72 PROSITE Pattern Expressions C - [ACG] - T - Matches CAT, CCT and CGT only C - X -T - Matches CAT, CCT, CDT, CET, etc. C - {A} -T - Matches every CXT except CAT C - (1,3) - T - Matches CT, CCT, CCCT C - A(2) - [TP] - Matches CAAT, CAAP [LIV] - [VIC] - X(2) - G - [DENQ] - X - [LIVFM] (2) - G 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide73 Rules of Thumb Pattern-based sequence motifs should be determined from no fewer than 5 multiply aligned sequences A good degree of sequence divergence is needed. If “S” is the %similarity and “N” is the no. of sequences then 1 - SN > 0.95 A good sequence pattern should have no fewer than 8 defined amino acid positions 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide74 Sequence Pattern Databases PROSITE - http://ca.expasy.org/prosite/ BLOCKS - http://www.blocks.fhcrc.org/ DOMO - http://www.infobiogen.fr/services/domo/ PFAM - http://pfam.wustl.edu PRINTS - http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ SEQSITE - PepTool 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide75 Summary: Mining Biological Data Biological sequence analysis compares, aligns, indexes, and analyzes biological sequences (sequence of nucleotides or amino acids) Biosequence analysis can be partitioned into two essential tasks: pair-wise sequence alignment and multiple sequence alignment Dynamic programming approach (notably, BLAST ) has been popularly used for sequence alignments Markov chains and hidden Markov models are probabilistic models in which the probability of a state depends only on that of the previous state Given a sequence of symbols, x, the forward algorithm finds the probability of obtaining x in the model The Viterbi algorithm finds the most probable path (corresponding to x) through the model The Baum-Welch learns or adjusts the model parameters (transition and emission probabilities) to best explain a set of training sequences. 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide76 References Lecture notes@M. Craven’s website: www.biostat.wisc.edu/~craven A. Baxevanis and B. F. F. Ouellette. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (3rd ed.). John Wiley & Sons, 2004 R.Durbin, S.Eddy, A.Krogh and G.Mitchison. Biological Sequence Analysis: Probability Models of Proteins and Nucleic Acids. Cambridge University Press, 1998 N. C. Jones and P. A. Pevzner. An Introduction to Bioinformatics Algorithms. MIT Press, 2004 I. Korf, M. Yandell, and J. Bedell. BLAST. O'Reilly, 2003 L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE, 77:257--286, 1989 J. C. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS Pub Co., 1997. M. S. Waterman. Introduction to Computational Biology: Maps, Sequences, and Genomes. CRC Press, 1995 9/13/2006 Bio-sequence analysis Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide77