Download ready for

Document related concepts
no text concepts found
Transcript
PROTEIN SEONDARY &
SUPER-SECONDARY
STRUCTURE PREDICTION
WITH HMM
By En-Shiun Annie Lee
CS 882 Protein Folding
Instructed by Professor Ming Li
0 OUTLINE
.
1.
2.
3.
4.
Introduction
Problem
Methods (4)
HMM Examples (3)
a. Segmentation HMM
b. Profile HMM
c. Conditional Random Field
5. Proposal
1 INTRODUCTION
.
1.
2.
3.
4.
Introduction *
Problem
Methods (4)
HMM Examples (3)
a. Segmentation HMM
b. Profile HMM
c. Conditional Random Field
5. Proposal
1 Genomics
.
• Achievements in Genomic
– BLAST
(Basic Local Alignment Search Tool)
• most cited paper published in 1990s
• more than 15,000 times
– Human genome project
• Completion April 2003
1 Proteomics
.
• Precedence to Proteomics
– Protein Data Bank (PDB)
• 40,132 structures
• cited more than 6,000 times
1 Proteomics
.
Number of Protein Structures in Protein Data Bank
1 Secondary Structure
.
• Importance
– The known secondary structure may be
used as an input for the tertiary structure
predictions.
1 Protein Structure
.
• Primary Structure
1 Protein Structure
.
• Secondary Structure
1 Secondary Structure
.
• α-helix
– Interaction between i and (i+4)th residue
1 Secondary Structure
.
• β-sheet/strand
– Parallel or Anti-parallel
1 Secondary Structure
.
• Coil (loop)
1 Protein Structure
.
• Tertiary Structure
1 Protein Structure
.
• Super-Secondary (2.5) Structure
Super-Secondary
(2.5) Structure
1 Protein Structure
.
• Quaternary Structure
Super-Secondary
(2.5) Structure
2 PROBLEM
.
1.
2.
3.
4.
Introduction
Problem *
Methods (4)
HMM Examples (3)
a. Segmentation HMM
b. Profile HMM
c. Conditional Random Field
5. Proposal
2 Secondary Structure
.
• Problem
– Given:
• A primary sequence of amino acids
– a1a2…an
– Find:
• Secondary structure of each ai as
– α-helix = H
– β-strand = E *
– coil = C
2 Secondary Structure
.
• Example
– Given:
• Primary Sequence
– GHWIATRGQLIREAYEDYRHFSSECPFIP
– Find:
• Secondary Structure Element
– CEEEEECHHHHHHHHHHHCCCHHCCCCCC
– Note: segments
2 Prediction Quality
.
• Three-state prediction accuracy
– Q3 = # of correctly predicted residues
total # of number of residues
– Q, Qβ, Qc
– Q3 for random prediction is 33%
– Theoretical limit Q3=90%.
2 Prediction Quality
.
• Segment Overlap (SOV)
– Higher penalties for core segment regions
• Matthews Correlation Coefficients
(MCC)
– Prediction errors made for each state
2 True Structures
.
• Three dimensional PDB data
– DSSP (Dictionary of Secondary Structure of Proteins)
• 8 states
–
–
–
–
–
–
–
–
H = alpha helix
G = 310 - helix
I = 5 helix (pi helix)
E = extended strand (beta ladder)
B = residue in isolated beta-bridge
T = hydrogen bonded turn
S = bend
C = coil
– STRIDE
H
H
H
E
E
C
C
C
3 METHODS
.
1.
2.
3.
4.
Introduction
Problem
Methods (4) *
HMM Examples (3)
a. Segmentation HMM
b. Profile HMM
c. Conditional Random Field
5. Proposal
3 Sliding Window
.
• Sliding-Window
3 Sliding Window
.
• Sliding-Window
3 Sliding Window
.
• Sliding-Window
3 Sliding Window
.
• Sliding-Window
3 Four Methods
.
a.
b.
c.
d.
Statistical Method
Neural Network
Support Vector Machine
Hidden Markov Model
3a Statistical Method
.
• Propensity
• Ex. Chou-Fasman 50~53%
3b Neural Network
.
• Ex. PHD 71%
3c SVM
.
• Ex. PSIPRED 76~78%
3d HMM Definition
.
• State set Q
• Output alphabet Σ
3d HMM Definition
.
• Transition probabilities
– probability of entering the state p from
state q
– Tq(p)
• q  Q
• p  Q
3d HMM Definition
.
• Emission probabilities
– probability emits each letter of Σ from state
q
– Eq(ai)
•  ai  Σ
• q  Q
3d HMM Decoding
.
• Problem
– Given:
• HMM = (Q,Σ,E,T) and
• Sequence S
– Where S = S1, S2, …, Sn
– Find:
• Most probable path of state gone through to get
S
– Where X = X1, X2, …, Xn = state sequence
4 HMM Decoding
.
• Optimize
– Pr [ S , X ]
• X = X1, X2, …, Xn = state sequence
• S = S1, S2, …, Sn
– Pr [ S | X ]
4 HMM Decoding
.
• Dynamic programming
– Memoryless
– Pr [Xn|Sn] = Pr [Xn-1|Sn-1] Tn-1[Xn] EXn [Sn]
4 HMM EXAMPLES
.
1.
2.
3.
4.
Introduction
Problem
Methods (4)
HMM Examples (3) *
a. Segmentation HMM
b. Profile HMM
c. Conditional Random Field
5. Proposal
4a SEMI-HMM
.
1.
2.
3.
4.
Introduction
Problem
Methods (4)
HMM Examples (3)
a. Semi-HMM *
b. Profile HMM
c. Conditional Random Field
5. Proposal
4a Semi-HMM
.
• Definition
– Each state can emit a sequence
– Move emission probabilities into states
– Model secondary structure segments
4a Segmentation
.
• Sequence Segments
4a Segmentation
.
• Sequence Segments
4a Segmentation
.
• Sequence Segments
• T = secondary structural
type of the segment, {H, E, L}
• S = ends of each individual
structural segments
• R = known amino acid
sequence
4a Segmentation
.
• Sequence Segments
• T2 = E = β-strand
• S2 = 9
• R2 = S1 + 1 : S2
4a Bayesian
.
• Bayesian Formulation
• R = Sequence of ALL amino acid residues
• S = End of the segments
• T = Secondary structural type of the segments
– {H, E, L}
4a Bayesian
.
• Bayesian Formulation



1. Likelihood
2. Priori Probability
3. Constant  (S,T)   dropped
4a Bayesian
.
•  Likelihood
• m = Total number of segments
• Sj = End of the jth segments
• Tj = Secondary structural type of the jth
segments
4a Bayesian
.
•  Likelihood
4a Bayesian
.
•  Likelihood
4a Bayesian
.
•  Likelihood
N-terminus
Internal
C-terminus
4a BSPPS
.
• Bayesian Segmentation PPS
4a BSPPS
.
• Bayesian Segmentation PPS
4a Results
.
• Better than PSIPRED
– (w/o homology information)
4a Results
.
• Better than PSIPRED
– (w/o homology information)
4b PROFILE-HMM
.
1.
2.
3.
4.
Introduction
Problem
Methods (4)
HMM Examples (3)
a. Semi-HMM
b. Profile HMM *
c. Conditional Random Field
5. Proposal
4b Profile HMM
.
• Main States
– Columns of alignment
4b Profile HMM
.
• Insertion States
4b Profile HMM
.
• Deletion States
– Jump over 1+ column in alignment
4b Profile HMM
.
• Combined
4b HMMSTR
.
• HMM for local protein STRucture
4b HMMSTR
.
• HMM for local protein STRucture
• Pronounced “hamster”
4b I-Site Library
.
• I-sites Library
– Motif = short basic structural fragments
• 3~19 residues
• 262 motifs
• Highly predictable
– Non-redundant PDB data (<25% similarity)
– Fold uniquely across protein family
– Exhaustive motif clustering
4b Build HMM
.
• States
– Amino acid sequence and
– Structural attribute
• Transition from state
– Adjacent positions in motif
– No gap or insertion states
4b Build HMM
.
• Emission probability distributions
– b = observed amino acid
• (20 probability values)
– d = secondary structure
• (helix, strand, loop)
– r = backbone angle region
• (11 dihedral angle symbols)
– c = structural context descriptor
• (10 context symbols)
4b Build HMM
.
• Model I-site Library
– Each 262 motif is a chain in HMM
– Merge states base on similarity of
• Sequence
• Structure
4b Build HMM
.
• Model I-site Library
• Merge states
– base on similarity of
• Sequence
• Structure
4b HMMSTR Merge
.
• Ex. β-Hairpin
Serine β-Hairpin
Type-I β-Hairpin
4b HMMSTR Merge
.
• Ex. β-Hairpin
Serine β-Hairpin
Type-I β-Hairpin
4b HMMSTR Merge
.
• Ex. β-Hairpin
4b HMMSTR Merge
.
• Ex. β-Hairpin
4b HMMSTR Training
.
• Input: PDB proteins
• Find
– best state sequence for sequence
– probability distribution of one amino acid
• Integrate 3 data set
– Aligned probability distribution
– Amino acid and context information
– Contact map
4b HMMSTR Summary
.
• 282 nodes
• 317 transitions
• 31 merged motifs
4b HMMSTR Summary
.
• Introduce structural context on level of
super-secondary structure
• Predict higher-order 3D tertiary
structure
– Side-result = predict 1D secondary
structure
4b PROFILE-HMM
.
1.
2.
3.
4.
Introduction
Problem
Methods (4)
HMM Examples (3)
a. Semi-HMM
b. Profile HMM
c. Conditional Random Field *
5. Proposal
4c HMM Disadvantages
.
• Does not model
– Multiple interacting features
– Long-range dependencies
• Strict independence assumptions
4c Conditional Model
.
• Allow
– Arbitrary features
– Non-independent features
• Transition probability
– With respect to past and future
observations
4c Conditional Model
.
y1
y2
y3
y4
y5
y6
x1
x2
x3
x4
x5
x6
y1
y2
y3
y4
y5
y6
x1
x2
x3
x4
x5
x6
HMM
CRF
…
…
4c Random Field
.
• Random Field (Undirected graphical
model)
– Let G = (Y, E) be a graph
• Where each vertex Yv = a random variable
– If P(Yv|all other Y)= P(Yv|neighbours of Yv)
Then Y is a random field
4c Random Field
.
• Example:
– P(Y5 | all other Y) = P(Y5 | Y4, Y6)
4c Conditional RF
.
• Conditional Random Field
– Let X = r.v. data sequences to be labeled
• observations
– Let Y = r.v. corresponding label sequences
• labels
– Let G = (V, E) be a graph
• S.t. Y = (Yv)vY so Y is indexed by vertices of G
– If P(Yv | X, Yw w≠v) = P(Yv | X, Yw, w~v)
Then (X, Y) is a random field
4c Conditional RF
.
• Example:
– P(Y3 | X, all other Y) = P(Y3 | X, Y2, Y4)
4c HMM vs. CRF
.
• HMM:
– Maximize P(x,y|θ)=P(y|x,θ)P(x|θ)
– Transition and emission probabilities
– Transition/emission base only one x
• CRF:
– Maximize P(y|x,θ)
– Feature function f(i, j, k)
– Feature function base on all x
4c Beta-Wrap
.
• β-Helix
– 3 parallel β-strands
– Connected by coils
• Few solved structures
– 9 SCOP SuperFamilies
– 14 RH solved structures in PDB
– Solved structures differ widely
4c Graph Definition
.
• Let G = (V,E1,E2) be a graph
– V = Nodes/States = Secondary structures
– Edges = interactions
• E1
– Edges between adjacent neighbors
– Implied in the model
• E2
– Edges for long-term interactions
– Explicitly considered
4c Beta-Wrap Example
.
• Simple Example:
– S2 = first β-strand
– S3 = coil
– S4 = second β-strand
– S5 = coil
– S6 = -helix
4c Beta-Wrap
.
• β-Helix Solution:
5 PROPOSAL
.
1.
2.
3.
4.
Introduction
Problem
Methods (4)
HMM Examples (3)
a. Segmentation HMM
b. Profile HMM
c. Conditional Random Field
5. Proposal *
5 Difficulties
.
• Do not infer global interaction
– i.e. Beta-sheet interactions
• Protein structure definition constraint
5 Possible Future Work
.
• Novel methods of secondary structure
prediction
– Model as Integer Programming
• Super-secondary structure prediction
5 Acknowledgement
.
• Professor Ming Li
– Guidance in
• knowledge and
• expertise
• Bioinformatics lab
• Mentoring a “rookie”
• Class
• Attention and listening
Related documents