Download Introduction to Computational Genomics for Infectious Disease

Document related concepts
no text concepts found
Transcript
Welcome to
Introduction to Computational
Genomics for Infectious Disease
Course Instructors
• Instructor
James Galagan
• Teaching Assistants
Brian Weiner
Desmond Lun
• Lab Instructors
Antonis Rokas
Reinhard Engels
Mark Borowsky
Aaron Brandes
Jeremy Zucker
Caroline Colijn
Other members of Broad Microbial Analysis Group
Schedule and Logistics
• Lectures
Tues/Thurs 11-12:30
Harvard School of Public Health: FXB-301
The François-Xavier Bagnoud Center, Room 301
• Labs
Wed/Fri 1-3
Broad Institute: Olympus Room
First floor of Broad Main Lobby
See front desk attendant near entrance
Individual computers and software provided
No programming experience required
Website
www.broad.mit.edu/annotation/winter_course_2006/
• Contact information
• Directions to Broad
• Lecture slides
• Lab handouts
• Resources
Goals of Course
• Introduction to concepts behind
commonly used computational tools
• Recognize connection between different
concepts and applications
• Hands on experience with computational
analysis
Concepts and Applications
• Lectures will cover concepts
– Computationally oriented
• Labs will provide opportunity for hands
on application of tools
– Nuts and bolts of running tools
– Application of tools not covered in lectures
Computational Genomics Overview
Slide Credit: Manolis Kellis
Topics
1. Probabilistic Sequence Modeling
2. Clustering and Classification
3. Motifs
4. Steady State Metabolic Modeling
Topics Not Covered
•
•
•
•
Sequence Alignment
Phylogeny (maybe in labs)
Molecular Evolution
Population Genetics
• Advanced Machine Learning
– Bayesian Networks
– Conditional Random Fields
Applications to Infectious Disease
• Examples and labs will focus on the analysis
of microbial genomics data
–
–
–
–
Pathogenicity islands
TB expression analysis
Antigen prediction
Mycolic acid metabolism
• But approaches are applicable to any
organism and to many different questions
Probabilistic Modeling of Biological
Sequences
Concepts
Statistical Modeling of Sequences
Hidden Markov Models
Applications
Predicting pathogenicity islands
Modeling protein families
Lab Practical
Basic sequence annotation
Probabilistic Sequence Modeling
• Treat objects of interest as random
variables
– nucleotides, amino acids, genes, etc.
• Model probability distributions for these
variables
• Use probability calculus to make
inferences
Why Probabilistic Sequence Modeling?
• Biological data is noisy
• Probability provides a calculus for
manipulating models
• Not limited to yes/no answers – can provide
“degrees of belief”
• Many common computational tools based on
probabilistic models
Sequence Annotation
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT
TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT
GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA
AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC
GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC
CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC
TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC
GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC
GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG
ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG
TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG
TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG
GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC
ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT
GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG
GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC
GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT
ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG
GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC
GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT
CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG
Sequence Annotation
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT
TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT
GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA
AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC
GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC
CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC
TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC
GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC
GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG
Gene
ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG
TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG
TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG
GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC
ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT
GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG
GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC
GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT
ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG
GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC
GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT
CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG
Sequence Annotation
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT
Promoter
TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT
Motif
GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA
AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC
GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC
CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC
TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC
GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC
GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG
Gene
ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG
TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG
TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG
GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC
ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT
GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG
Kinase
GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC
Domain
GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT
ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG
GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC
GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT
CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG
Probabilistic Sequence Modeling
•
Hidden Markov Models (HMM)
– A general framework for sequences of
symbols (e.g. nucleotides, amino acids)
– Widely used in computational genomics
1. Hmmer – HMMs for protein families
2. Pathogenicity Islands
Pathogenicity Islands
• Clusters of genes acquired
by horizontal transfer
Neisseria meningitidis, 52% G+C
– Present in pathogenic species
but not others
• Frequently encode
virulence factors
– Toxins, secondary
metabolites, adhesins
GC
Content
(from Tettelin et al. 2000. Science)
• (Flanked by repeats, gene content, phylogeny,
regulation, codon usage)
• Different GC content than rest of genome
Application: Bacillus subtilis
Modeling Sequence Composition
• Calculate sequence distribution from
known islands
– Count occurrences of A,T,G,C
• Model islands as nucleotides drawn
independently from this distribution
... C C TA A G T T A G A G G A T T G A G A ….
…
A: 0.15
A: 0.15
A: 0.15
T: 0.13
T: 0.13
T: 0.13
G: 0.30
G: 0.30
G: 0.30
C: 0.42
C: 0.42
C: 0.42
P(Si|MP)
…
The Probability of a Sequence
• Can calculate the probability of a particular sequence
(S) according to the pathogenicity island model (MP)
N
P( S | MP)  P( S1 , S 2 ,...S N | MP)   P( Si | MP)
i 1
Example
S = AAATGCGCATTTCGAA
P( S | MP)  P( A)  P(T )  P(G )  P(C )
6
4
3
2
A: 0.15
T: 0.13
 (0.15)6  (0.13) 4  (0.30)3  (0.42) 2
G: 0.30
 1.55 1011
C: 0.42
Sequence Classification
PROBLEM: Given a sequence, is it an island?
– We can calculate P(S|MP), but what is a sufficient P value?
SOLUTION: compare to a null model and calculate log-likelihood ratio
– e.g. background DNA distribution model, B
N
P(Si | MP) N
P(Si | MP)
P( S | MP)
Score  log
 log 
  log
P( S | B)
P ( Si | B )
i 1
i 1 P( Si | B)
Pathogenicity
Islands
A: 0.15
Background
DNA
A: 0.25
Score
Matrix
A: -0.73
T: 0.13
T: 0.25
T: -0.94
G: 0.30
G: 0.25
G: 0.26
C: 0.42
C: 0.25
C: 0.74
Finding Islands in Sequences
• Could use the log-likelihood ratio on
windows of fixed size
– What if islands have variable length?
• We prefer a model for entire sequence
TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA
GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC
A More Complex Model
0.15
0.85
Background
Island
0.75
0.25
A: 0.25
T: 0.25
G: 0.25
C: 0.25
A: 0.15
T: 0.13
G: 0.30
C: 0.42
TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA
GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC
A Generative Model
S:
P
P
P
P
P
P
P
P
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
P(Li+1|Li)
Bi+1 Pi+1
Bi
0.85 0.15
Pi
0.25 0.75
P(S|B)
A: 0.25
T: 0.25
G: 0.25
C: 0.25
P(S|P)
A: 0.42
T: 0.30
G: 0.13
C: 0.15
A Hidden Markov Model
Hidden States
L = { 1, ..., K }
Transition
Probabilities
Transition probabilities
aij = Transition probability
from state i to state j
State i
State j
Emission probabilities
ei(b) = P( emitting b |
state=i)
ei(b)
ej(b)
Initial state probability
p(b) = P(first state=b)
Emission
Probabilities
What can we do with this model?
The model defines a joint probability over
labels and sequences, P(L,S)
Implicit in model is what labels “tend to go”
with what sequences (and vice versa)
Rules of probability allow us to use this model
to analyze existing sequences
Fundamental HMM Operations
Computation
Decoding
•
•
Given
Find
an HMM and sequence S
a corresponding sequence of
labels, L
Biology
Annotate pathogenicity islands
on a new sequence
Evaluation
•
•
Given
Find
an HMM and sequence S
P(S|HMM)
Score a particular sequence
(not as useful for this model –
will come back to this later)
Training
•
Given
•
Find
an HMM w/o parameters
and set of sequences S
transition and emission
probabilities the maximize
P(S | params, HMM)
Learn a model for sequence
composed of background DNA
and pathogenicity islands
The Hidden in HMM
• DNA does not come
conveniently labeled (i.e.
Island, Gene, Promoter)
• We observe nucleotide
sequences
• The hidden in HMM refers to
the fact that state labels, L,
are not observed
– Only observe emissions (e.g.
nucleotide sequence in our
example)
State i
State j
…A A G T T A G A G…
“Decoding” With HMM
Given observables, we would like to predict a
sequence of hidden states that is most likely to
have generated that sequence
Pathogenicity Island Example
Given a nucleotide sequence, we want a labeling of
each nucleotide as either “pathogenicity island” or
“background DNA”
The Most Likely Path
• Given a sequence, one reasonable choice for
a labeling is:
L*  arg max P( Labels, Sequence | Model )
labels
The sequence of labels, L*, (or path) that makes
the labels and sequence most likely given the
model
Probability of a Path,Seq
L:
P
P
P
P
P
P
P
P
B
B
B
B
B
B
B
B
0.85
0.25
S:
G
0.85
0.25
C
0.85
0.25
A
0.85
0.25
A
0.85
0.25
A
0.85
0.85
0.25
T
0.25
G
0.25
C
P  P(G | B) P( B1 | B0 ) P(C | B) P( B2 | B1 ) P( A | B) P( B3 | B2 )...P(C | B7 )
 (0.85)7  (0.25)8
 4.9 106
Probability of a Path,Seq
L:
P
P
P
P
0.75
P
0.15
B
B
0.85
0.25
S:
P
0.75
G
B
P
0.25
B
B
B
B
0.85
0.25
C
P
B
0.85
0.25
A
0.42
A
0.42
A
0.30
T
0.25
G
0.25
C
P  P(G | B) P( B1 | B0 ) P(C | B) P( B2 | B1 ) P( A | B) P( P3 | B2 )...P(C | B7 )
 (0.85)3  (0.25)6  (0.75)2  (0.42)2  0.30  0.15
 6.7 107
We could try to calculate the probability of every path, but….
Decoding
• Viterbi Algorithm
– Finds most likely sequence of labels, L*, given
sequence and model
L*  arg max P( Labels, Sequence | Model )
labels
– Uses dynamic programming (same technique used in
sequence alignment)
– Much more efficient than searching every path
Probability of a Single Label
Sum over all paths
L:
S:
P
P
P
P
P
P
P
P
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
Forward algorithm
P(Label5=B|S)
(dynamic programming)
• Calculate most probable label, L*i , at each position i
• Do this for all N positions gives us {L*1, L*2, L*3…. L*N}
Two Decoding Options
• Viterbi Algorithm
– Finds most likely sequence of labels, L*, given
sequence and model
L*  arg max P( Labels | Sequence, Model )
labels
• Posterior Decoding
– Finds most likely label at each position for all
positions, given sequence and model
{L*1, L*2, L*3…. L*N}
– Forward and Backward equations
Application: Bacillus subtilis
Method
Three State Model
Gene+
Gene-
Second Order Emissions
P(Si)=P(Si|State,Si-1,Si-2)
(capturing trinucleotide
Frequencies)
Train using EM
Predict w/Posterior Decoding
AT Rich
Nicolas et al (2002) NAR
Results
Gene on positive strand
A/T Rich
- Intergenic regions
- Islands
Gene on negative strand
Each line is
P(label|S,model)
color coded by label
Nicolas et al (2002) NAR
Fundamental HMM Operations
Computation
Decoding
•
•
Given
Find
an HMM and sequence S
a corresponding sequence of
labels, L
Biology
Annotate pathogenicity islands
on a new sequence
Evaluation
•
•
Given
Find
an HMM and sequence S
P(S|HMM)
Score a particular sequence
(not as useful for this model –
will come back to this later)
Training
•
Given
•
Find
an HMM w/o parameters
and set of sequences S
transition and emission
probabilities the maximize
P(S | params, HMM)
Learn a model for sequence
composed of background DNA
and pathogenicity islands
Training an HMM
Transition probabilities
e.g. P(Pi+1|Bi) – the
probability of entering a
pathogenicity island from
background DNA
Emission probabilities
i.e. the nucleotide
frequencies for
background DNA and
pathogenicity islands
P(Li+1|Li)
B
P
P(S|B)
P(S|P)
Maximum
Likelihood
Estimation
Learning From
Labelled
Data
If we have a sequence that has islands marked, we can simply count
L:
start
S:
P
P
P
P
P
P
P
P
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
P(S|B)
P(Li+1|Li)
Bi+1 Pi+1
End
Bi
3/5
1/5
1/5
Pi
1/3
2/3
0
1
0
0
Start
A:
T:
G:
C:
1/5
0
2/5
2/5
P(S|P)
!
A:
T:
G:
C:
ETC..
End
Unlabelled Data
How do we know how to count?
L:
start
S:
P
P
P
P
P
P
P
P
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
?
P(S|B)
P(Li+1|Li)
Bi+1 Pi+1
Bi
Pi
Start
?
End
A:
T:
G:
C:
P(S|P)
A:
T:
G:
C:
End
Unlabeled Data
L:
start
S:
P
P
P
P
P
P
P
P
B
B
B
B
B
B
B
B
G
C
A
A
A
T
G
C
An idea:
1. Imagine we start with some parameters
2. We could calculate the most likely path,
P*, given those parameters and S
3. We could then use P* to update our
parameters by maximum likelihood
4. And iterate (to convergence)
End
P(Li+1|Li)0
P(S|B)0
P(S|P)0
P(Li+1|Li)1
P(S|B)1
P(S|P)1
P(Li+1|Li)2
P(S|B)2
P(S|P)2
…
P(Li+1|Li)K P(S|B)K P(S|P)K
Expectation Maximization (EM)
1. Initialize parameters
2. E Step Estimate probability of hidden labels , Q, given parameters and
sequence
Q  P(Labels | S , paramst 1)
3. M Step Choose new parameters to maximize expected likelihood of
parameters given Q
params t  arg max EQ log P( S , labels | params t 1 ) 
params
4. Iterate
P(S|Model) guaranteed to increase each iteration
Expectation Maximization (EM)
Remember the basic idea!
1.Use model to estimate (distribution of) missing data
2.Use estimate to update model
3.Repeat until convergence
EM is a general approach for learning models
(ML estimation) when there is “missing data”
Widely used in computational biology
EM frequently used in motif discovery
Lecture 3
A More Sophisticated Application
Modeling Protein Families
• Given amino acid sequences from a protein family,
how can we find other members?
– Can search databases with each known member – not
sensitive
– More information is contained in full set
• The HMM Profile Approach
– Learn the statistical features of protein family
– Model these features with an HMM
– Search for new members by scoring with HMM
We will learn features from multiple alignments
Human Ubiquitin Conjugating Enzymes
UBE2D2
UBE2D3
BAA91697
UBE2D1
UBE2E1
UBCH9
UBE2N
AAF67016
UBCH10
CDC34
BAA91156
UBE2G1
UBE2B
UBE2I
E2EPF5
UBE2L1
UBE2L6
UBE2H
UBC12
FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK
FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK
FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK
FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK
FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK
FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK
LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-------------DKWSPALQIRT
IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---------PKGAWRPSLNIAT
FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-------------EKWSALYDVRT
FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT
FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT
FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGEDKYGYEKPEERWLPIHTVET
FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN-------------RWSPTYDVSS
FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED-----------KDWRPAITIKQ
LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR-------------DWTAELGIRH
FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA------------ENWKPATKTDQ
FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS------------ENWKPCTKTCQ
LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-------------QTWTALYDLTN
VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-------------EDWKPVLTINS
Profile HMM
A
C
D
E
F
G
H
I
K
L
M
N
O
P
Q
R
S
T
V
W
Y
I
Start
E2EPF5
UBE2L1
UBE2L6
UBE2H
D1
Dj
DN
I1
Ij
IN
M1
Mj
MN
End
LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKRA------------DWTAELGIRH
FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISAA-----------ENWKPATKTDQ
FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISSA-----------ENWKPCTKTCQ
LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-P-----------QTWTALYDLTN
Using Profile HMMs
Computation
Biology
Decoding
Find
sequence of labels, L,
that maximizes
P(L|S, HMM)
Align a new sequence to a
protein family
Evaluation
•
Find
P(S|HMM)
Score a sequence for
membership in family
transition and emission
probabilities the maximize
P(S | params, HMM)
Discover and model family
structure
Training
•
Find
Example: Modeling Globins
• Profile HMM from 300 randomly selected globin
genes
• Score database of 60,000 proteins
PFAM Collection of Profile HMMs
http://www.sanger.ac.uk/Software/Pfam/
PFAM Resources
• 8957 curated protein
families and domains
• Each with HMM profile(s)
• Coverage
– 73% of proteins in
Swissprot and SPTREMBLE
– 53% of “typical” genome
sequence
Example PFAM Entry
•
•
•
•
Literature Links
Protein Structure
Domain Architectures
GO Functional Categories
Lab 1
HMMER
• Implementation of Profile HMM methods
• Given a multiple alignment, HMMER
can build a Profile HMM
• Given a Profile HMM (i.e. from PFAM),
HMMER can score sequences for
membership in the family or domain
HMMs in Context
• HMMs
– Sequence alignment
– Gene Prediction
• Generalized HMMs
– Variable length states
– Complex emissions models
– e.g. Genscan
• Bayesian Networks
– General graphical model
– Arbitrary graph structure
– e.g. Regulatory network analysis
References
•
•
•
•
•
•
Sean R Eddy, “Hidden Markov models,” Current Opinion in Structural Biology,
6:361-365, 1996.
Sean R Eddy, “Profile hidden Markov models,” Bioinformatcis, 14(9):755-763,
1998.
Anders Krogh, “An introduction to hidden Markov models for biological
sequences,” In computational Methods in Molecular Biology, edited by S. L.
Salzberg, D. B. Searls and S. Kasif, pp. 45-63, Elsevier, 1998.
HMMER: profile HMMs for protein sequence analysis. http://hmmer.wustl.edu/
Erik L. L. Sonnhammer et al, “Pfam: multiple sequence alignments andHMMprofiles of protein domains,” Nucleic Acids Research, 26(1):320-322, 1998.
R. Durbin, S. Eddy, A. Krogh and G. Mitchison, BIOLOGICAL SEQUENCE
ANALYSIS, Cambridge University Press, 1998.
Tomorrow’s Lab
• Basic Sequence Analysis Tools
– Argo Genome Browser
– Blast
– Gene prediction using Glimmer
– Protein families with Hmmer and PFAM
– Comparative synteny analysis
• Identify virulence factors by annotating and
comparing virulent and avirulent bacterial
sequences
The Hidden in HMM
• DNA does not come conveniently labeled (i.e.
Pathogencity Island, Gene, Promoter)
• All we observe are the nucleotide sequences
• The hidden in HMM refers to the fact that the
state labels, L, are not observed
– Only observe emissions (e.g. nucleotide sequence
in our example)
Relation between Viterbi and Forward
VITERBI
FORWARD
Vj(i) = P(most probable path
ending in state j with
observation i)
fl(i)=P(x1…xi,statei=j)
Initialization:
V0(0) = 1
Vk(0) = 0, for all k > 0
Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0
Iteration:
Iteration:
Vj(i) = ej(xi) maxk Vk(i-1) akj
fl(i) = el(xi) k fk(i-1) akl
Termination:
P(x, p*) = maxk Vk(N)
Termination:
P(x) = k fk(N) ak0
Slide Credit: Serafim Batzoglou
Related documents