Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Transcript

Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003 Definition • A Hidden Markov Model (HMM) is a discrete-time finite-state Markov chain coupled with a sequence of letters emitted when the Markov chain visits its states. States (Q): q1 Letters (O): O1 q2 q3 O2 O3 ... Definition (Cont’d) • The sequence O of emitted letters is called “the observed sequence” because we often know it while not knowing the state sequence Q, which is in this case called “hidden”. • The triple = (P, B, ) represents the full set of parameters of the HMM, where P is the transition probability matrix of the Markov chain, B is the emission probability matrix, and denotes the initial distribution vector of the Markov chain. Important Calculations Given any observed sequence O = (O1,…,OT) • and , efficiently calculate P(O | ) • and , efficiently calculate the hidden sequence Q = (q1,…,qT) that is most likely to have occurred; i.e. find argmaxQ P(Q | O) • and assuming a fixed graph structure of the underlying Markov chain, find the parameters = (P, B, ) maximizing P(O | ) Applications of HMM • Modeling protein families: (1) construct multiple sequence alignments (2) determine the family of a query sequence • Gene finding through semi-Hidden Markov Models (semiHMM) HMM for Sequence Alignment Consider the following Markov chain underlying a HMM, with three types of states: “match”; “insert”; “delete” HMM for Sequence Alignment (Con’t) • The alphabet A consists of the 20 amino acids and a “delete” symbol ( ) • Delete states output only with probability 1 • Each insert & match state has its own distribution over the 20 amino acids and does not output HMM for Sequence Alignment (Con’t) There are two extreme situations depending on the HMM parameters: • The emission probs for the match & insert states are uniform over the 20 amino acids the model produces random sequences • Each state emits one specific amino acid with prob 1 & mi mi+1 with prob 1 the model produces the same sequence always HMM for Sequence Alignment (Con’t) Between the two extremes consider a “family” of somewhat similar sequences: • A “tight” family of very similar sequences • A “loose” family with little similarity Similarity may be confined to certain areas of the sequences – if some match states emit a few amino acids, while other match states emit all amino acids uniformly/randomly HMM for Sequence Alignments: Procedure (A) Start with “training”, or estimating, the parameters of the model using a set of training sequences from the protein family (B) Next, compute the path of states most likely to have produced each sequence (C) Amino acids are aligned if both are produced by the same match state in their paths (D) Finally, indels are inserted appropriately for insertions and deletions Important Calculations Given any observed sequence O = (O1,…,OT) • and , efficiently calculate P(O | ) • and , efficiently calculate the hidden sequence Q = (q1,…,qT) that is most likely to have occurred; i.e. find argmaxQ P(Q | O) • and assuming a fixed graph structure of the underlying Markov chain, find the parameters = (P, B, ) maximizing P(O | ) Example • Consider: CAEFDDH, CDAEFPDDH • Suppose the model has length 10, and the most likely paths for the two sequences are: m0m1m2m3m4d5d6m7m8m9m10 and m0m1i1m2m3m4d5 m6m7m8m9m10 Example (Cont’d) The alignment induced is found by aligning positions generated by the same match state: m0 m1 m2 m3 m4 d5 d6 m7m8m9m10 C A E F DDH C D A E F P DD H m0 m1 i1 m2 m3m4 d5 m6m7m8m9m10 Example (End) This leads to the following alignment: C– AEF–DDH CDAEFPDDH HMM: Strengths & Weaknesses • HMM aligns many sequences with little computing power • HMM allows the sequences themselves to guide the alignment • Alignments by HMM are sometimes ambiguous and some regions are left unaligned in the end • HMM weaknesses come from their strengths: the Markov property and stationarity Thank you.