Download presentation on Hidden Markov Models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of genetic engineering wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Microevolution wikipedia , lookup

Genomic library wikipedia , lookup

DNA vaccination wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Genetic code wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

NEDD9 wikipedia , lookup

Primary transcript wikipedia , lookup

Protein moonlighting wikipedia , lookup

Gene wikipedia , lookup

Human genome wikipedia , lookup

Non-coding DNA wikipedia , lookup

Metagenomics wikipedia , lookup

Genome editing wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Point mutation wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Sequence alignment wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
HIDDEN MARKOV MODELS
IN COMPUTATIONAL
BIOLOGY
CS 594: An Introduction to
Computational Molecular Biology
BY
Shalini Venkataraman
Vidhya Gunaseelan
Relationship Between DNA, RNA
And Proteins
DNA
CCTGAGCCAACTATTGATGAA
transcription
mRNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
2
Protein Structure
Primary Structure of Proteins
The primary structure of peptides and proteins refers to the linear
number and order of the amino acids present.
3
Protein Structure
Secondary Structure
Protein secondary structure refers to regular, repeated patters of folding of
the protein backbone. How a protein folds is largely dictated by the primary
sequence of amino acids
Beta Sheet
Alpha Helix
4
Multiple Alignment Process
• Process of aligning three or more
sequences with each other
• Generalization of the algorithm to align
two sequences
• Local multiple alignment uses Sum of
pairs scoring scheme
5
HMM Architecture
•
•
•
•
Markov Chains
What is a Hidden Markov Model(HMM)?
Components of HMM
Problems of HMMs
6
Markov Chains
Rain
Sunny
Cloudy
State transition matrix : The probability of
States : Three states - sunny, the weather given the previous day's
cloudy, rainy.
weather.
Initial Distribution : Defining the probability of the
system being in each of the states at time 0.
7
Hidden Markov Models
Hidden states : the (TRUE) states of a system that
may be described by a Markov process (e.g., the
weather).
Observable states : the states of the process that
are `visible' (e.g., seaweed dampness).
8
Components Of HMM
Output matrix : containing the probability of observing a
particular observable state given that the hidden model is in a
particular hidden state.
Initial Distribution : contains the probability of the (hidden)
model being in a particular hidden state at time t = 1.
State transition matrix : holding the probability of a hidden
state given the previous hidden state.
9
Example-HMM
Transition
Prob.
Output Prob.
Scoring a Sequence with an HMM:
The probability of ACCY along this path is
.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10-6.
10
Problems With HMM
Scoring problem:
Given an existing HMM and observed sequence , what is the probability
that the HMM can generate the sequence
11
Problems With HMM
Alignment Problem
Given a sequence, what is the optimal state sequence that the HMM
would use to generate it
12
Problems With HMM
Training Problem
Given a large amount of data how can we estimate the structure
and the parameters of the HMM that best accounts for the data
13
HMMs in Biology
•
•
•
•
•
Gene finding and prediction
Protein-Profile Analysis
Secondary Structure prediction
Advantages
Limitations
14
Finding genes in DNA sequence
This is one of the most challenging and interesting problems in
computational biology at the moment. With so many genomes
being sequenced so rapidly, it remains important to begin by
identifying genes computationally.
15
What is a (protein-coding) gene?
DNA
CCTGAGCCAACTATTGATGAA
transcription
mRNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
16
In more detail
(color ~state)
(Left)
(Removed)
17
Gene Finding HMMs
• Our Objective:
– To find the coding and non-coding regions
of an unlabeled string of DNA nucleotides
• Our Motivation:
– Assist in the annotation of genomic data
produced by genome sequencing methods
– Gain insight into the mechanisms involved
in transcription, splicing and other
processes
18
Why HMMs
• Classification: Classifying observations within a
sequence
• Order: A DNA sequence is a set of ordered observations
• Grammar : Our grammatical structure (and the
beginnings of our architecture) is right here:
• Success measure: # of complete exons correctly
labeled
• Training data: Available from various genome
annotation projects
19
HMMs for gene finding
• Training - Expectation Maximization (EM)
• Parsing – Viterbi algorithm
An HMM for unspliced genes.
x : non-coding DNA
c : coding state
Genefinders- a comparison
Accuracy per nucleotide
Method
Sn
Sp
AC
Sn
GENSCAN
FGENEH
GeneID
GeneParser2
GenLang
GRAILII
SORFIND
Xpound
0.93
0.77
0.63
0.66
0.72
0.72
0.71
0.61
0.93
0.85
0.81
0.79
0.75
0.84
0.85
0.82
0.91
0.78
0.67
0.66
0.69
0.75
0.73
0.68
0.78
0.61
0.44
0.35
0.5
0.36
0.42
0.15
Accuracy per exon
(Sn+Sp)/
Sp
ME
2
0.81
0.8
0.09
0.61
0.61
0.15
0.45
0.45
0.28
0.39
0.37
0.29
0.49
0.5
0.21
0.41
0.38
0.25
0.47
0.45
0.24
0.17
0.16
0.32
WE
0.05
0.11
0.24
0.17
0.21
0.1
0.14
0.13
Sn = Sensitivity
Sp = Specificity
Ac = Approximate Correlation
ME = Missing Exons
WE = Wrong Exons
GENSCAN Performance Data, http://genes.mit.edu/Accuracy.html
21
Protein Profile HMMs
• Motivation
– Given a single amino acid target sequence of
unknown structure, we want to infer the structure
of the resulting protein. Use Profile Similarity
• What is a Profile?
–
–
–
–
Proteins families of related sequences and structures
Same function
Clear evolutionary relationship
Patterns of conservation, some positions are more
conserved than the others
22
An Overview
Aligned Sequences
Build a Profile HMM (Training)
Database
search
Query against Profile Multiple
HMM database
alignments
(Forward)
(Viterbi)
23
Building – from an existing alignment
ACA
TCA
ACA
AGA
ACC
- - - ATG
ACT ATC
C - - AGC
- - - ATC
G - - ATC insertion
Transition probabilities
Output Probabilities
A HMM model for a DNA motif alignments, The transitions are
shown with arrows whose thickness indicate their probability. In
each state, the histogram shows the probabilities of the four
bases.
Building – Final Topology
Deletion states
Matching states
Insertion states
No of matching states = average sequence length in the family
PFAM Database - of Protein families
(http://pfam.wustl.edu)
25
Database Searching
• Given HMM, M, for a sequence family,
find all members of the family in data
base.
• LL – score LL(x) = log P(x|M)
(LL score is length dependent – must
normalize or use Z-score)
26
Query a new sequence
Suppose I have a query protein sequence, and I am interested
in which family it belongs to? There can be many paths
leading to the generation of this sequence. Need to find all
these paths and sum the probabilities.
Consensus sequence:ACAC - - ATC
P (ACACATC) = 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x 1x1 x
0.8x1 x 0.8 = 4.7 x 10 -2
Multiple Alignments
• Try every possible path through the
model that would produce the target
sequences
– Keep the best one and its probability.
– Output : Sequence of match, insert and
delete states
• Viterbi alg. Dynamic Programming
28
Building – unaligned sequences
• Baum-Welch Expectation-maximization
method
– Start with a model whose length matches the
average length of the sequences and with random
output and transition probabilities.
– Align all the sequences to the model.
– Use the alignment to alter the output and transition
probabilities
– Repeat. Continue until the model stops changing
• By-product: It produced a multiple alignment
29
PHMM Example
An alignment of 30 short amino acid sequences chopped out
of a alignment of the SH3 domain. The shaded area are the
most conserved and were represented by the main states in
the HMM. The unshaded area was represented by an insert
state.
30
Prediction of Protein
Secondary structures
• Prediction of secondary structures is needed
for the prediction of protein function.
• Analyze the amino-acid sequences of
proteins
• Learn secondary structures
– helix, sheet and turn
• Predict the secondary structures of
sequences
31
Advantages
• Characterize an entire family of sequences.
• Position-dependent character distributions and
position-dependent insertion and deletion gap
penalties.
• Built on a formal probabilistic basis
• Can make libraries of hundreds of profile HMMs and
apply them on a large scale (whole genome)
32
Limitations

Markov Chains

Probabilities of states are supposed to
be independent
P(x)


…
P(y)
P(y) must be independent of P(x), and
vice versa
This usually isn’t true
33
Limitations - contd
• Standard Machine Learning Problems
• Watch out for local maxima
– Model may not converge to a truly optimal
parameter set for a given training set
• Avoid over-fitting
– You’re only as good as your training set
– More training is not always good
34
CONCLUSION
• For links & slides
– www.evl.uic.edu/shalini/hmm/
35