Document related concepts

Complement component 4 wikipedia, lookup

Transcript
```Hidden Markov Models:
Applications in Bioinformatics
Gleb Haynatzki, Ph.D.
Creighton University
March 31, 2003
Definition
• A Hidden Markov Model (HMM) is a
discrete-time finite-state Markov chain
coupled with a sequence of letters emitted
when the Markov chain visits its states.
States (Q):
q1
Letters (O): O1
q2
q3
O2
O3
...
Definition (Cont’d)
• The sequence O of emitted letters is called “the
observed sequence” because we often know it
while not knowing the state sequence Q, which is
in this case called “hidden”.
• The triple
 = (P, B,  )
represents the full set of parameters of the HMM,
where P is the transition probability matrix of the
Markov chain, B is the emission probability
matrix, and  denotes the initial distribution
vector of the Markov chain.
Important Calculations
Given any observed sequence O = (O1,…,OT)
• and  , efficiently calculate P(O |  )
• and  , efficiently calculate the hidden sequence
Q = (q1,…,qT) that is most likely to have
occurred; i.e. find argmaxQ P(Q | O)
• and assuming a fixed graph structure of the
underlying Markov chain, find the parameters
 = (P, B,  ) maximizing P(O |  )
Applications of HMM
• Modeling protein families:
(1) construct multiple sequence alignments
(2) determine the family of a query sequence
• Gene finding through semi-Hidden Markov
Models (semiHMM)
HMM for Sequence Alignment
Consider the following Markov chain underlying a
HMM, with three types of states:

“match”;


“insert”;
  “delete”
HMM for Sequence Alignment
(Con’t)
• The alphabet A consists of the 20 amino
acids and a “delete” symbol ( )
• Delete states output  only with probability 1
• Each insert & match state has its own
distribution over the 20 amino acids and does
not output 
HMM for Sequence Alignment
(Con’t)
There are two extreme situations depending on
the HMM parameters:
• The emission probs for the match & insert
states are uniform over the 20 amino acids
 the model produces random sequences
• Each state emits one specific amino acid with
prob 1 & mi  mi+1 with prob 1
 the model produces the same sequence
always
HMM for Sequence Alignment
(Con’t)
Between the two extremes consider a “family” of
somewhat similar sequences:
• A “tight” family of very similar sequences
• A “loose” family with little similarity
Similarity may be confined to certain areas of the
sequences – if some match states emit a few
amino acids, while other match states emit all
amino acids uniformly/randomly
HMM for Sequence Alignments:
Procedure
parameters  of the model using a set of
training sequences from the protein family
(B) Next, compute the path of states most likely
to have produced each sequence
(C) Amino acids are aligned if both are produced
by the same match state in their paths
(D) Finally, indels are inserted appropriately for
insertions and deletions
Important Calculations
Given any observed sequence O = (O1,…,OT)
• and  , efficiently calculate P(O |  )
• and  , efficiently calculate the hidden sequence
Q = (q1,…,qT) that is most likely to have
occurred; i.e. find argmaxQ P(Q | O)
• and assuming a fixed graph structure of the
underlying Markov chain, find the parameters
 = (P, B,  ) maximizing P(O |  )
Example
• Consider: CAEFDDH, CDAEFPDDH
• Suppose the model has length 10, and the
most likely paths for the two sequences are:
m0m1m2m3m4d5d6m7m8m9m10
and
m0m1i1m2m3m4d5 m6m7m8m9m10
Example (Cont’d)
The alignment induced is found by aligning
positions generated by the same match state:
m0 m1 m2 m3 m4 d5 d6 m7m8m9m10
C A E F
DDH
C D A E F P DD H
m0 m1 i1 m2 m3m4 d5 m6m7m8m9m10
Example (End)
This leads to the following alignment:
C– AEF–DDH
CDAEFPDDH
HMM: Strengths & Weaknesses
• HMM aligns many sequences with little
computing power
• HMM allows the sequences themselves to
guide the alignment
• Alignments by HMM are sometimes ambiguous
and some regions are left unaligned in the end
• HMM weaknesses come from their strengths:
the Markov property and stationarity
Thank you.
```