Download Mining Patterns from Protein Structures

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
EECS 800 Research Seminar
Mining Biological Data
Instructor: Luke Huan
Fall, 2006
The UNIVERSITY of Kansas
Mining Sequence Patterns in Biological Data
A brief introduction to biology and bioinformatics
Alignment of biological sequences
Hidden Markov model for biological sequence analysis
Summary
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide2
DNA Structure
DNA: helix-shaped molecule
whose constituents are two
parallel strands of four
nucleotides: A, C, G, and T.
This assumes only one strand
is considered; the second
strand is always derivable
from the first by
Nucleotides (bases)
pairing A with T and
C with G
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
slide3
Genome
Gene: Contiguous subparts of single
strand DNA that are templates for
producing proteins.
Chromosomes: compact chains of coiled
DNA
Genome: The set of all genes in a
given organism.
Noncoding part: The function of
DNA material between genes is
largely unknown.
Source: www.mtsinai.on.ca/pdmg/Genetics/basic.htm
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide4
Grand Challenges: Functional Genomics
The function of a protein is
its role in the cellular
environment.
Function is closely related to
tertiary structure
Functional genomics:
studies the function of all the
proteins of a genome
Source:
fajerpc.magnet.fsu.edu/Education/2010/Lectures/26_D
NA_Transcription.htm
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide5
Grand Challenges: Systems Biology
A cell is made up of molecular
components that can be viewed as
3D-structures of various shapes
In a living cell, the molecules
interact with each other (w. shape
and location). An important type of
interaction involve catalysis
(enzyme) that facilitate interaction.
The interactions of molecules are
regulated temporally and spatially.
The overall interaction and the
regulation of the molecules is
studied by systems biology.
Source:
www.mtsinai.on.ca/pdmg/images/pairscolour.jpg
Human Genome—23 pairs of chromosomes
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide6
Grand Challenges: Evolution
Develop a detailed understanding of the heritable variation
in the human genome
Understand evolutionary variation across species and the
mechanisms underlying it
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide7
Grand Challenges: from -omics to Health
Disease detection:
Monitoring the biological system at the molecular level
No intrusive
Fast, reliable, & cost-effective
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide8
Grand Challenges: from -omics to Health
Develop novel medicines
Select drug target
Develop robust strategies for identifying the genetic
contributions to disease and drug response
Use new understanding of genes and pathways to develop
powerful new therapeutic approaches to disease
Drug screening
Drug delivery
Side effects and other safety issues
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide9
Data Management & Analysis in
Bioinformatics
Computational management and analysis
of biological information
Interdisciplinary Field (Molecular
Biology, Statistics, Computer Science,
Genomics, Genetics, Databases,
Chemistry, Radiology …)
Bioinformatics vs. computational biology
(more on algorithm correctness,
Bioinformatics
Functional
Genomics
complexity and other themes central to
theoretical CS)
Genomics
Proteomics
Structural
Bioinformatics
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide10
Data Mining & Bioinformatics: Why?
Biological data is abundant and information-rich
Genomics & proteomics data (sequences), microarray and protein-arrays, protein
database (PDB), bio-testing data
Huge data banks, rich literature, openly accessible
Largest and richest scientific data sets in the world
Data Mining: transform raw data to knowledge
Mining for correlations, linkages between disease and gene sequences, protein
networks, classification, clustering, outliers, ...
Find correlations among linkages in literature and heterogeneous databases
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide11
Data Integration in Bioinformatics
Data Integration: Handling heterogeneous bio-data
Build Web-based, interchangeable, integrated, multi-dimensional genome
databases
Data cleaning and data integration methods becomes crucial
Mining correlated information across multiple databases itself becomes a data
mining task
Typical studies: mining database structures, information extraction from data,
reference reconciliation, document classification, clustering and correlation
discovery algorithms, ...
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide12
Data Mining & Bioinformatics: Current Picture
Pattern discovery
Comparative study of biomolecules and biological networks
Similarity search
Drug design
Clustering
gene expression profiles
Phylogeny tree construction
Classification
Visualization
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide13
Future
Research and development of new tools for bioinformatics
Similarity search and comparison between classes of genes (e.g., diseased and
healthy) by finding and comparing frequent patterns
New clustering and classification methods for micro-array data and proteinarray data analysis
Path analysis: linking genes/proteins to different disease development stages
Develop pharmaceutical interventions that target the different stages
separately
High-dimensional analysis and OLAP mining
Visualization tools and genetic/proteomic data analysis
Knowledge integration
You tell me!
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide14
Mining Sequence Patterns in Biological Data
A brief introduction to biology and bioinformatics
Alignment of biological sequences
Hidden Markov model for biological sequence analysis
Summary
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide15
Comparing Sequences
Sequences to be compared: either nucleotides
(DNA/RNA) or amino acids (proteins)
Nucleotides: identical
Amino acids: identical, or if one can be derived from the other
by substitutions that are likely to occur in nature
Alignment: Lining up sequences to achieve the maximal
level of identity
Local—only portions of the sequences are aligned.
Global—align over the entire length of the sequences
Use gap “–” to indicate preferable not to align two symbols
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide16
Comparing Sequences
Two sequences are
homologous if they share a
common ancestor
Percent identity: ratio between
the number of columns
containing identical symbols
vs. the number of symbols in
the longest sequence
Score of alignment: summing
up the matches and counting
gaps as negative
9/13/2006
Bio-sequence analysis
HEAGAWGHEE
PAWHEAE
Input sequences:
HEAGAWGHE-E
One alignment
--P-AW-HEAE
HEAGAWGHE-E
Another alignment
P-A--W-HEAE
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide17
Sequence Alignment: Problem Definition
Goal:
Given two or more input sequences
Identify similar sequences with long conserved subsequences
Method:
Use substitution matrices (probabilities of substitutions of nucleotides or
amino-acids and probabilities of insertions and deletions)
Optimal multiple alignment problem: NP-hard
Heuristic method to find good alignments
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide18
Edit Distance Model
Minimal weighted sum of insertions, deletions &
mutations required to transform one string into another
AGGCACA--CA
| |||| ||
A--CACATTCA
9/13/2006
Bio-sequence analysis
or
AGGCACACA
| || ||
ACACATTCA
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
Levenshtein 1966
slide19
Pair-wise Sequence Alignment: Scoring Matrix
insertion/Deletion
(indel) penalty
Mutation penalty
A
E
G
H
W
A
5
-1
0
-2
-3
E
-1
6
-3
0
-3
H
-2
0
-2
10
-3
P
-1
-1
-2
-2
-4
W
-3
-3
-3
-3
15
Gap penalty: -8
Gap extension: -8
HEAGAWGHE-E
--P-AW-HEAE
(-8) + (-8) + (-1) + 5 + 15 + (-8)
+ 10 + 6 + (-8) + 6 = 9
Exercise: Calculate for
9/13/2006
Bio-sequence analysis
HEAGAWGHE-E
P-A--W-HEAE
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide20
Complete DNA Sequences
nearly 200 complete
genomes have been
sequenced
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide21
Whole Genome
Similarity Map
Mining Biological Data
9/13/2006
Bio-sequence analysis
KU EECS 800, Luke Huan, Fall’06
slide22
Map of Rearrangements
The UNIVERSITY of Kansas
Alignment of a Syntenic Area
The UNIVERSITY of Kansas
84790
seq1
84800
|| ||||| |||||||||||||||
13780
84850
84860
||| |
13840
84910
||
| |||||||
13790
13800
84870
13810
84880
13820
84890
84900
| || |||||||| |||
| |||| |
|| ||| ||||
13850
84920
13860
84930
13870
84940
13880
84950
84960
TCAGCCTGGGCAGAGAATTTATGACTAAGTCCTCAAAAGTAATTTCAACAAAAATAAACA
|
seq2
|||||
---AAACTGAAACCATAACAATTCTAGAAGTAAACATTGAAAAAAAAACCCTTCTAGACA
13830
seq1
84840
GTAAAATAAAAAACTGTAAAACTCTAGAAGAAAATGTAGAAACA-----CCATCTGGACA
|||
seq2
84830
CAGGATCGACGTCTCTCACCTTATACAAAAATCAACTGAAGATGCATCACAGACTTTAA13770
seq1
84820
-AGGACTTCTTCCTTTCACCATATACAAAAATCAACCCAAGATTGATACAATACTTTAAT
||||
seq2
84810
|| | |||| ||| || ||||| |||
|||||| |
TTTGCTTAGGCAAAGACTTCATGACAAAGAATGCAAAAGCAG-----------------13890
13900
13910
13920
Local Area of Similarity
The UNIVERSITY of Kansas
Formal Description
Problem: PairSeqAlign
Input: Two sequences
x, y
Scoring matrix
s
Gap penalty
d
Gap extension penalty e
Output: The optimal sequence alignment
Difficulty:
If x, y are of size n then
the number of possible 
 2n 
  
global alignments is
n
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
2n
(2n)!
2

2
(n!)
n 
slide26
Global Alignment: Needleman-Wunsch
Needleman-Wunsch Algorithm (1970)
Uses weights for the outmost edges that encourage the best overall (global)
alignment
Idea: Build up optimal alignment from optimal alignments of
subsequences
HEAG
--P-
Add score from table
-25
--P-AW-HEAE
HEAG-
HEAGA
HEAGA
--P-A
--P—-
--P-A
-33
-33
-20
Gap with bottom
9/13/2006
Bio-sequence analysis
HEAGAWGHE-E
Gap with top
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
Top and bottom
slide27
Pair-Wise Sequence Alignment
Given s ( xi , y j ), d
F (0, 0)  0
 F (i  1, j  1)  s( xi , y j )

F (i, j )  max  F (i  1, j )  d
 F (i, j  1)  d

Alignment: F(0,0) – F(n,m)
We can vary both the model and the alignment strategies
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide28
Global Alignment
Uses recursion to fill in
intermediate results table
Uses O(nm) space and time
O(n2) algorithm
yj aligned to gap
F(i-1,j-1)
s(xi,yj)
F(i-1,j)
Feasible for moderate sized
sequences, but not for aligning
whole genomes.
9/13/2006
Bio-sequence analysis
d
F(i,j-1)
d
F(i,j)
xi aligned to gap
While building the table,
keep track of where optimal
score came from, reverse
arrows
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide29
Dot Matrix Alignment Method
Dot Matrix Plot: Boolean matrices representing possible
alignments that can be detected visually
Extremely simple but
O(n2) in time and space
Visual inspection
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide30
Heuristic Alignment Algorithms
Motivation: Complexity of alignment algorithms: O(nm)
Current protein DB: 100 million base pairs
Matching each sequence with a 1,000 base pair query takes about 3 hours!
Heuristic algorithms aim at speeding up at the price of possibly
missing the best scoring alignment
Two well known programs
BLAST: Basic Local Alignment Search Tool
FASTA: Fast Alignment Tool
Both find high scoring local alignments between a query sequence and a
target database
Basic idea: first locate high-scoring short stretches and then extend them
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide31
BLAST (Basic Local Alignment Search Tool)
Approach (BLAST) (Altschul et al. 1990, developed by NCBI)
View sequences as sequences of short words (k-tuple)
DNA: 11 bases, protein: 3 amino acids
Create hash table of neighborhood (closely-matching) words
Use statistics to set threshold for “closeness”
Start from exact matches to neighborhood words
Motivation
Good alignments should contain many close matches
Statistics can determine which matches are significant
Much more sensitive than % identity
Hashing can find matches in O(n) time
Extending matches in both directions finds alignment
Yields high-scoring/maximum segment pairs (HSP/MSP)
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide32
BLAST (Basic Local Alignment Search Tool)
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide33
Multiple Sequence Alignment
Alignment containing multiple DNA / protein sequences
Look for conserved regions → similar function
Example:
#Rat
#Mouse
#Rabbit
#Human
#Oppossum
#Chicken
#Frog
9/13/2006
Bio-sequence analysis
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT
ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC
ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC
ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG
ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT
---ATGGGTTTGACAGCACATGATCGT---CAGCT
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide34
Mining Sequence Patterns in Biological Data
A brief introduction to biology and bioinformatics
Alignment of biological sequences
Hidden Markov model for biological sequence analysis
Summary
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide35
From Data to Model
Given a group of sequence, what is the underlying model for the
group of sequences?
Markov models are simple but widely used model to describe a
sequence family.
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide36
A Markov Chain Model
Transition probabilities
Pr(xi=a|xi-1=g)=0.16
Pr(xi=c|xi-1=g)=0.34
Pr(xi=g|xi-1=g)=0.38
Pr(xi=t|xi-1=g)=0.12
 Pr( x | x
i
i 1
 g)  1
Exercise
Can “acgt” be generated from the
MC model?
How about “cagt”, “gcat”, “tcga”?
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide37
Definition of Markov Chain Model
A Markov chain model is defined by
a set of states
some states emit symbols
other states (e.g., the begin state) are silent
a set of transitions with associated probabilities
the transitions emanating from a given state define a
distribution over the possible next states
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide38
Markov Chain Models: Properties
Given some sequence x of length L, we can ask the probability
of the sequence with the given model
For any probabilistic model of sequences, we can write this
probability as
Pr( x)  Pr( xL , xL 1 ,..., x1 )
 Pr( xL / xL 1 ,..., x1 ) Pr( xL 1 | xL 2 ,..., x1 )... Pr( x1 )
key property of a (1st order) Markov chain: the probability of
each xi depends only on the value of xi-1
Pr( x)  Pr( xL / xL 1 ) Pr( xL 1 | xL  2 )... Pr( x2 | x1 ) Pr( x1 )
L
 Pr( x1 ) Pr( xi | xi 1 )
i 2
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide39
The Probability of a Sequence for a Markov Chain Model
Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide40
Example Application
CpG islands
CG dinucleotides are rarer in eukaryotic genomes than expected given the
marginal probabilities of C and G
but the regions upstream of genes are richer in CG dinucleotides than
elsewhere – CpG islands
useful evidence for finding genes
Application: Predict CpG islands with Markov chains
one to represent CpG islands
one to represent the rest of the genome
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide41
Markov Chains for Discrimination
Suppose we want to distinguish CpG islands from other sequence
regions
Given sequences from CpG islands, and sequences from other
regions, we can construct
a model to represent CpG islands
a null model to represent the other regions
can then score a test sequence by:
Pr( x | CpGModel)
score( x)  log
Pr( x | nullModel )
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide42
Markov Chains for Discrimination
Why use
Pr( x | CpGModel)
score( x)  log
Pr( x | nullModel )
According to Bayes’ rule
Pr( x | CpG) Pr(CpG)
Pr(CpG | x) 
Pr( x)
Pr( x | null ) Pr( null )
Pr( null | x) 
Pr( x)
If we are not taking into account prior probabilities of two classes
then we just need to compare Pr(x|CpG) and Pr(x|null)
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide43
Higher Order Markov Chains
The Markov property specifies that the probability of a state
depends only on the probability of the previous state
But we can build more “memory” into our states by using a higher
order Markov model
In an n-th order Markov model
Pr( xi | xi 1 , xi 2 ,..., x1 )  Pr( xi | xi 1 ,..., xi n )
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide44
Selecting the Order of a Markov Chain Model
But the number of parameters we need to estimate grows
exponentially with the order
n 1
for modeling DNA we need
O( 4
)
parameters for an n-th order model
The higher the order, the less reliable we can expect our
parameter estimates to be
estimating the parameters of a 2nd order Markov chain from the complete
genome of E. Coli, we’d see each word > 72,000 times on average
estimating the parameters of an 8-th order chain, we’d see each word ~ 5
times on average
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide45
Hidden Markov Model: A Simple HMM
Given observed sequence AGGCT, which state emits
every item?
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide46
Hidden State
We’ll distinguish between the observed parts of a problem and the
hidden parts
In the Markov models we’ve considered previously, it is clear
which state accounts for each part of the observed sequence
In the model above, there are multiple states that could account for
each part of the observed sequence
this is the hidden part of the problem
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide47
An HMM Example
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide48
The Parameters of an HMM
Πi is a set of states
Transition Probabilities
akl  Pr( i  l |  i 1  k )
Probability of transition from state k to state l
Emission Probabilities
ek (b)  Pr( xi  b |  i  k )
Probability of emitting character b in state k
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide49
Three Important Questions
How likely is a given sequence?
The Forward algorithm
What is the most probable “path” for generating a given
sequence?
The Viterbi algorithm
How can we learn the HMM parameters given a set of
sequences?
The Forward-Backward (Baum-Welch) algorithm
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide50
How Likely is a Given Sequence?
The probability that the path is taken and the sequence is
generated:
L
Pr( x1...xL ,  0 ... N )  a01  e i ( xi )a i i1
i 1
Pr( AAC,  )
 a01  e1 ( A)  a11  e1 ( A)
 a13  e3 (C )  a35
 .5  .4  .2  .4  .8  .3  .6
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide51
How Likely is a Given Sequence?
The probability over all paths is
but the number of paths can be exponential in the length
of the sequence...
the Forward algorithm enables us to compute this
efficiently
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide52
The Forward Algorithm
Define f k (i) to be the probability of being in state k
having observed the first i characters of sequence x
To compute f N (L) , the probability of being in the end
state having observed all of sequence x
Can define this recursively
Use dynamic programming
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide53
The Forward Algorithm
Initialization
f0(0) = 1 for start state; fi(0) = 0 for other state
Recursion
For emitting state (i = 1, … L)
f l (i )  el (i ) f k (i  1)akl
k
For silent state
f l (i ) 
f
k
(i  1) akl
k
Termination
Pr( x)  Pr( x1...xL )  f N ( L)   f k ( L)akN
k
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide54
Forward Algorithm Example
Given the sequence x=TAGA
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide55
Forward Algorithm Example
Initialization
f0(0)=1, f1(0)=0…f5(0)=0
Computing other values
f1(1)=e1(T)*(f0(0)a01+f1(0)a11)
=0.3*(1*0.5+0*0.2)=0.15
f2(1)=0.4*(1*0.5+0*0.8)
f1(2)=e1(A)*(f0(1)a01+f1(1)a11)
=0.4*(0*0.5+0.15*0.2)
…
Pr(TAGA)= f5(4)=f3(4)a35+f4(4)a45
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide56
Dynamic Programming
9/13/2006
Bio-sequence analysis
T=0
T=1
T=2
S0
1
0
0
S1
0
0.5
0.1
S2
0
0.5
0.4
S3
0
0
0.4
S4
0
0
0.1
S5
0
0
0
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide57
Three Important Questions
How likely is a given sequence?
What is the most probable “path” for generating a given
sequence?
How can we learn the HMM parameters given a set of
sequences?
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide58
Suboptimal Property
If path πi1, πi2, …, πik is the most probable path that
generates path x1, x2, …, xk, then
Given πin, πi1, πi2, …, πin is the most probable path that generates
the sequence x1, x2, …, xn for all n <=k
πim, πi2, …, πin is the most probable path that generates path xm,
x2, …, xn for all 1 <= m <= n <=k, given that πi1, πi2, …, πim is
the most probable path for x1, x2, …, xm and πin
S(x1, x2, …, xk | πk) = max ( S(x1, x2, …, xk-1 | πk-1) * Pr(πk
| πk-1) * Pr( xk | πk) )
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide59
Finding the Most Probable Path: The Viterbi Algorithm
Define vk(i) to be the probability of the most probable
path accounting for the first i characters of x and ending
in state k
We want to compute vN(L), the probability of the most
probable path accounting for all of the sequence and
ending in the end state
Can define recursively
Can use DP to find vN(L) efficiently
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide60
Three Important Questions
How likely is a given sequence?
What is the most probable “path” for generating a given
sequence?
How can we learn the HMM parameters given a set of
sequences?
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide61
Learning Without Hidden State
Learning is simple if we know the correct path for each sequence
in our training set
estimate parameters by counting the number of times each
parameter is used across the training set
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide62
Learning With Hidden State
If we don’t know the correct path for each sequence in our
training set, consider all possible paths for the sequence
Estimate parameters through a procedure that counts the
expected number of times each parameter is used across the
training set
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide63
Learning Parameters: The Baum-Welch Algorithm
Also known as the Forward-Backward algorithm
An Expectation Maximization (EM) algorithm
EM is a family of algorithms for learning probabilistic models in
problems that involve hidden state
In this context, the hidden state is the path that best
explains each training sequence
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide64
Learning Parameters: The Baum-Welch Algorithm
Algorithm sketch:
initialize parameters of model
iterate until convergence
calculate the expected number of times each
transition or emission is used
adjust the parameters to maximize the likelihood of
these expected values
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide65
Computational Complexity of HMM Algorithms
Given an HMM with S states and a sequence of length L, the
complexity of the Forward, Backward and Viterbi algorithms is
2
O( S L)
This assumes that the states are densely interconnected
Given M sequences of length L, the complexity of Baum Welch
on each iteration is
O( MS 2 L)
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide66
Markov Models Summary
We considered models that vary in terms of order, hidden
state
Three DP-based algorithms for HMMs: Forward,
Backward and Viterbi
The algorithms used for each task depend on whether
there is hidden state (correct path known) in the problem
or not
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide67
Sequence Patterns
AHGQSDFILDEADGMMKSTVPN…
HGFDSAAVLDEADHILQWERTY…
GGGNDEYIVDEADSVIASDFGH…
*[IV][LV]DEAD***
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide68
Expectation Values

Expectation value (e) is the
expected number of hits for a
given sequence pattern or
motif
e = N x f1 x f2 x f3 x .... fk
N is the number of residues in
DB (108)
fi is the frequency of a given
amino acid(s)
9/13/2006
Bio-sequence analysis
Table 1
Frequency of amino acid occurrences in water soluble proteins
Residue
A
C
D
E
F
G
H
I
K
L
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
Frequency
8.80%
2.05%
5.91%
5.89%
3.76%
8.30%
2.15%
5.40%
6.20%
8.09%
Residue
M
N
P
Q
R
S
T
V
W
Y
Frequency
1.97%
4.58%
4.48%
3.84%
4.22%
6.50%
5.91%
7.05%
1.39%
3.52%
slide69
Pattern Example #1
ACIDS
e = 108*0.088*0.021*0.054*0.059*0.065
e = 38.3
#Found in a database = 14
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide70
Pattern Example #2
A*ACI[DEN]S
e = 108*0.088*1.000*0.088*0.021*0.054
*{0.059 + 0.059 + 0.046}*0.065
e = 9.4
#Found in a database = 9
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide71
How Long Should a Sequence Motif or
Sequence Block Be?
How many matching segments of
length “l” could be found in
comparing a query of length M to a
DB of N ?
Answer:
n(l) = M x N x f l
Assume f = 0.05, M = 300, N =
100,000,000
Rule of thumb: make your protein
sequence motifs at least 8 residues
long
9/13/2006
Bio-sequence analysis
Table 2
n
3,750,000
187,500
9375
469
23
1.2
0.058
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
l
3
4
5
6
7
8
9
slide72
PROSITE Pattern Expressions
C - [ACG] - T - Matches CAT, CCT and CGT only
C - X -T - Matches CAT, CCT, CDT, CET, etc.
C - {A} -T - Matches every CXT except CAT
C - (1,3) - T - Matches CT, CCT, CCCT
C - A(2) - [TP] - Matches CAAT, CAAP
[LIV] - [VIC] - X(2) - G - [DENQ] - X - [LIVFM] (2) - G
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide73
Rules of Thumb
Pattern-based sequence motifs should be determined
from no fewer than 5 multiply aligned sequences
A good degree of sequence divergence is needed. If “S”
is the %similarity and “N” is the no. of sequences then
1 - SN > 0.95
A good sequence pattern should have no fewer than 8
defined amino acid positions
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide74
Sequence Pattern Databases
PROSITE - http://ca.expasy.org/prosite/
BLOCKS - http://www.blocks.fhcrc.org/
DOMO - http://www.infobiogen.fr/services/domo/
PFAM - http://pfam.wustl.edu
PRINTS - http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
SEQSITE - PepTool
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide75
Summary: Mining Biological Data
Biological sequence analysis compares, aligns, indexes, and analyzes biological sequences
(sequence of nucleotides or amino acids)
Biosequence analysis can be partitioned into two essential tasks:
pair-wise sequence alignment and multiple sequence alignment
Dynamic programming approach (notably, BLAST ) has been popularly used for sequence
alignments
Markov chains and hidden Markov models are probabilistic models in which the
probability of a state depends only on that of the previous state
Given a sequence of symbols, x, the forward algorithm finds the probability of obtaining x in
the model
The Viterbi algorithm finds the most probable path (corresponding to x) through the model
The Baum-Welch learns or adjusts the model parameters (transition and emission probabilities)
to best explain a set of training sequences.
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide76
References
Lecture notes@M. Craven’s website: www.biostat.wisc.edu/~craven
A. Baxevanis and B. F. F. Ouellette. Bioinformatics: A Practical Guide to the Analysis of
Genes and Proteins (3rd ed.). John Wiley & Sons, 2004
R.Durbin, S.Eddy, A.Krogh and G.Mitchison. Biological Sequence Analysis: Probability
Models of Proteins and Nucleic Acids. Cambridge University Press, 1998
N. C. Jones and P. A. Pevzner. An Introduction to Bioinformatics Algorithms. MIT Press,
2004
I. Korf, M. Yandell, and J. Bedell. BLAST. O'Reilly, 2003
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech
recognition. Proc. IEEE, 77:257--286, 1989
J. C. Setubal and J. Meidanis. Introduction to Computational Molecular Biology. PWS Pub
Co., 1997.
M. S. Waterman. Introduction to Computational Biology: Maps, Sequences, and Genomes.
CRC Press, 1995
9/13/2006
Bio-sequence analysis
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide77