Download Machine Learning in the Study of Protein Structure

Document related concepts

Molecular evolution wikipedia , lookup

Bottromycin wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

List of types of proteins wikipedia , lookup

Gene expression wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Protein wikipedia , lookup

Magnesium transporter wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Interactome wikipedia , lookup

Cyclol wikipedia , lookup

Protein (nutrient) wikipedia , lookup

Protein moonlighting wikipedia , lookup

Western blot wikipedia , lookup

Protein design wikipedia , lookup

Rosetta@home wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Circular dichroism wikipedia , lookup

Protein adsorption wikipedia , lookup

Protein folding wikipedia , lookup

Proteolysis wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Structural alignment wikipedia , lookup

Homology modeling wikipedia , lookup

Protein structure prediction wikipedia , lookup

Transcript
Machine Learning
in the Study of
Protein Structure
Rui Kuang
Columbia University
Candidacy Exam Talk
May 5th, 2004
Committee: Christina S. Leslie (advisor), Yoav Freund, Tony Jebara
Table of contents
1. Introduction to protein structure and its
prediction
2. HMM, SVM and string kernels
3. Machine learning in the study of protein
structure
•
•
•
•
Protein ranking
Protein structural classification
Protein secondary structural and
conformational state prediction
Protein domain segmentation
4. Conclusion and Future work
1.
2.
3.
4.
Introduction
HMM, SVM and string kernels
Topics
Conclusion and future work
Part 1: Introduction to Protein
Structure and Its Prediction
Thanks to Carl-Ivar Branden
and John Tooze
Why study protein structure
• Protein –
Derived from Greek word
proteios meaning “of the
first rank” in 1838 by Jöns J.
Berzelius
• Crucial in all biological
processes
• Function depends on
structure
structure can help us to
understand function
How to Describe Protein
Structure
• Primary structure: amino acid sequence
• Secondary structure: local structure elements
• Tertiary structure: packing and arrangement of
secondary structure, also called domain
• Quaternary structure: arrangement of several
polypeptide chains
Secondary Structure : Alpha
Helix
hydrogen bonds
between C’=O at
position n and NH at position n+i
(i=3,4,5)
Secondary Structure : Beta
Sheet
Antiparallel Beta
Sheet
Parallel Beta
Sheet
We can also have a mix of both.
Secondary Structure : Loop
Regions
– Less conserved
structure
– Insertions and
deletions are more
often
– Conformations are
flexible
Tertiary Structure
Phi – N - C bond
Psi – C -C’ bond
Phi-Psi angle distribution
Protein Domains
• A polypeptide chain or a part of a
polypeptide chain that can fold
independently into a stable tertiary
structure.
Determination of Protein
Structures
• Experimental determination (time
consuming and expensive)
– X-ray crystallography
– Nuclear magnetic resonance (NMR)
• Computational determination
[Schonbrun
2002 (B2)]
– Comparative modeling
– Fold recognition ('threading')
– Ab initio structure prediction (‘de novo’)
Sequence, Structure and
Function [Domingues 2000 (B1)]
Sequence (1,000,000)
•>30% sequence similarity
suggests strong structure similarity
•Remote homologous proteins can
also share similar structure
Structure (24,000):
discrete groups of folds
with unclear boundaries
Function (Ill-defined)
•Function associated with different
structures
•Super-family with the same fold
can evolve into distinct functions.
•66% of proteins having similar
fold also have a similar function
Picture due to Michal Linial
1.
2.
3.
4.
Introduction
HMM, SVM and string kernels
Topics
Conclusion and future work
Part 2: Hidden Markov Model, Support
Vector Machine and String Kernels
K(
,
Thanks to Nello Cristianini
)
Hidden Markov Models for
Modeling Protein [Krogh 1993(B3)]
Alignment
Maximum
Likelihood
Or
Maximum a
posteriori
HMM
If we don’t know the alignment, use EM to
train HMM.
Hidden Markov Models for
Modeling Protein [Krogh 1993(B3)]
• Probability of sequence x through path q
• Viterbi algorithm for finding the best path
• Can be used for sequence clustering,
database search…
Support Vector Machine
[Burges 1998(B4)]
• Relate to structural risk minimization
• Linear-separable case
– Primal qp problem
Minimize
subject to
– Dual convex problem
Minimize
subject to
&
Support Vector Machine
[Burges 1998(B4)]
• Kernel: one nice property of dual qp
problem is that it only involves the inner
product between feature vectors, we
can define a kernel function to compute
it more efficiently
• Example:
String Kernels for Text
Classification [Lodhi 2002(M2)]
• String subsequence kernel –SSK :
u 

i|u| i1 1
,u 
n
i:u  s[ i ]
• A recursive computation of SSK has the
complexity of the computation O(n|s||t|).
It is quadratic in terms of the length of
input sequences. Not practical.
1.
2.
3.
4.
Introduction
HMM, SVM and string kernels
Topics
Conclusion and future work
Part 3
Machine learning in the study of
protein structure
3.1 Protein ranking
3.2 Protein structural classification
3.3 Protein secondary structure and
conformational state prediction
3.4 Protein domain segmentation
Part 3.1 Protein Ranking
• Smith-Waterman
• SAM-T98
Please!!!
Stand in order
• BLAST/PSI-BLAST
• Rank Propagation
Local alignment:
Smith-Waterman
algorithm
• For two string x and y, a local alignment
with gaps is:
• The score is:
• Smith-Waterman score:
Thanks to Jean Philippe
BLAST
[Altschul 1997 (R1)]:
a heuristic algorithm for matching DNA/Protein
sequences
• Idea: True matches are likely to
contain a short stretch of identity
AKQDYYYYE…
substitution score>T
Protein
Database
match
AKQ
AKQ SKQ..
KQD Neighbor
KQD AQD..
Search
cut
QDY mapping
QDY ..
DYY
DYY ..
YYY…
YYY…
Query:
………DYY………………
Target: …ASDDYYQQEYY…
Extend match
Extend match
PSI-BLAST: Position-specific
Iterated BLAST [Altschul 1997 (R1)]
• Only extend those double hits within
a certain range.
• A gapped alignment uses dynamic
programming to extend a central pair
of aligned residues in both directions.
• PSI-BLAST can takes PSSM as input
to search database
SAM-T98
[Karplus 1999 (C3)]
Query sequence Blast search
NR Protein
database
search
HMM
Iterate
4
rounds
Build
alignment
with hits
Profile/Alignment
Local and Global
Consistency [Zhou 2003 (M1)]
• Affinity matrix
Wij  exp(  || xi  x j ||2 ) / 2 2
• D is a diagonal matrix of
sum of i-th row of W
S  D 1/ 2WD 1/ 2
• Iterate
F (t  1)  SF (t )  (1   )Y
• F* is the limit of seuqnce
{F(t)}
yi  arg max j c Fij*
Rank propagation
[Weston 2004 (R2)]
• Protein similarity network:
– Graph nodes: protein sequences in the
database
– Directed edges: a exponential function of
the PSI-BLAST e-value (destination node as
query)
– Activation value at each node: the similarity
to the query sequnce
Yt 1  K q  KYt
• Exploit the structure of the protein
similarity network
Result
[Weston 2004 (R2)]
Part 3.2 Protein structural
classification
• Fisher Kernel
• Mismatch Kernel
• ISITE Kernel
Where
are my
relatives?
• SVM-Pairwise
• EMOTIF Kernel
• Cluster Kernels
SCOP
[Murzin 1995 (C1)]
SCOP
Fold
Superfamily
Negative
Negative
Training Set Test Set
Family
Positive
Training Set
Positive
Test Set
Family : Sequence identity > 30% or functions and structures are very similar
Superfamily : low sequence similarity but functional features suggest probable
common evolutionary origin
Common fold : same major secondary structures in the same arrangement
with the same topological connections
CATH
[Orengo 1997 (C2)]
• Class
Secondary structure composition
and contacts
• Architecture
Gross arrangement of secondary
structure
• Topology
Similar number and arrange of
secondary structure and same
connectivity linking
• Homologous
superfamily
• Sequence family
Fisher Kernel
[Jaakkola 2000 (C4)]
• A HMM (or more than one) is built for each
family
• Derive feature mapping from the Fisher
scores of each sequence given a HMM H1:
U X   log P( X | H1 , )
U ij 
E j (i )
e j (i )
  E j (k )
k
SVM-pairwise
[Liao 2002 (C5)]
• Represent sequence P as a vector of
pairwise similarity score with all training
sequences
• The similarity score could be a SmithWaterman score or PSI-BLAST e-value.
Mismatch Kernel
[ Leslie 2002 (C6)]
AKQ
KQD
QDY
DYY
YYY…
AKQDYYYYE…
AKQ
CKQ
DKQ
AKQ
…
…
AAQ
AKY
( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 )
AAQ
AKQ
DKQ
EKQ
Implementation with suffix tree achieves linear
time complexity O(||mkm+1(|x|+|y|))
EMOTIF Kernel
[Ben-Hur 2003 (C8)]
• EMOTIF TRIE built from eBLOCKS
1998 (C7)]
[Nevill-manning
• EMOTIF feature vector: ( x)  (m ( x)) mM
where m (x) is the number of occurrences of the
motif m in x
I-SITE Kernel
[Hou 2003 (C10)]
• Similar to EMOTIF
kernel I-SITE
kernel encodes
protein sequences
as a vector of the
confidence level
against structural
motifs in the ISITES library [Bystroff
1998 (C9)]
Cluster kernels
[Weston 2004 (C11)]
• Neighborhood Kernels
Implicitly average the feature vectors for sequences in
the PSI-BLAST neighborhood of input sequence
(dependent on the size of the neighborhood and total
length of unlabeled sequences)
• Bagged Kernels
Run bagged k-means to estimate p(x,y), the
empirical probability that x and y are in the same
cluster. The new kernel is the product of p(x,y)
and base kernel K(x,y)
Results
Part 3.3: Protein secondary structure
and conformational state prediction
• PHD
• PSI-PRED
• PrISM
• HMMSTR
Can we
really do
that?
PHD: Profile network from
HeiDelberg [Rost 1993 (P1)]
Accuracy: 70.8%
PSIPRED
[Jones 1999 (P2)]
Accuracy: 76.0%
Conformational State
Prediction
PrISM
[Yang 2003 (P3)]
Prediction with this conformation library based on sequence
and secondary structure similarity, accuracy: 74.6%
HMMSTR
[Bystroff 2000 (P4)]:
a Hidden Markov Model for Local SequenceStructure Correlations in Proteins
• I-sites motifs are modeled as
markov chains and merged
into one compact HMM to
capture grammatical structure
• The HMM can be used for
Gene finding, secondary or
conformational state
prediction, sequence
alignment…
• Accuray:
– secondary structure prediction:
74.5%
– Conformational state prediction:
74.0%
Part 3.4: Protein domain
segmentation
• DOMAINATION
• Pfam Database
• Multi-experts
Cut?
where???
DOMAINATION
[George 2002 (D1)]
• Get a distribution of
both the N- and Ctermini in PSI-BLAST
alignment at each
position, potential
domain boundaries
with Z-score>2
• Acuracy: 50% over
452 multi-domain
proteins
Pfam
[Sonnhammer 1997 (D2)]
• A database of HMMs of domain families
• Pfam A: high quality alignments and
HMMS built from known domains
• Pfam B: domains built from Domainer
algorithm from the remaining protein
sequences with removal of Pfam-A
domains
A multi-expert system from
sequence information
[Nagarajan 2003 (D3)]
Intron Boundaries
Seed Sequence
DNA DATA
blast search
Sequence Participation
Multiple Alignment
Secondary Structure
Entropy
Neural Network
Correlation
Contact Profile
Physio-Chemical Properties
Putative Predictions
Results
[Nagarajan 2003 (D3)]
1.
2.
3.
4.
Introduction
HMM, SVM and string kernels
Topics
Conclusion and future work
Part 4: Conclusion and Future Work
Mars is not
too far!?
Distribution of Paper Year
8
7
6
Count
5
4
3
2
1
0
<1997
1997
1998
1999
2000
Year
2001
2002
2003
2004
Conclusion
• Structural genomics plays important role
for understanding our life
• Protein structure can be studied from
different perspectives with different
methods
• Machine learning is one of the most
important tools for understanding genome
data
• Protein structure prediction is a challenging
task given the data we have now
Future Work
• Rank propagation with domain
activation regions
• Profile kernel with secondary
structure information for protein
classification
• Rank propagation for domain
segmentation
• Specialist algorithm for protein
conformational state prediction
The End
Determination of Protein
Structures (back)
• X-ray crystallography
The interaction of x-rays with electrons arranged
in a crystal can produce electron-density map,
which can be interpreted to an atomic model.
Crystal is very hard to grow.
• Nuclear magnetic resonance (NMR)
Some atomic nuclei have a magnetic spin. Probed
the molecule by radio frequency and get the
distances between atoms. Only applicable to
small molecules.
Hidden Markov Models for
Modeling Protein [Krogh 1993(B3)]
(back)
Build HMM from sequences not aligned
EM algorithm
1. Choose initial length and parameters
2. Iterate until the change of likelihood is
small
– Calculate expected number of times
each transition or emission is used
– Maximize the likelihood to get new
parameters
Support Vector Machine
[Burges 1998(B4)] (back)
• With probability 1-η the bound holds
R( )  J ( )  Remp ( )  (
h(log( 2l / h)  1)  log( / 4)
)
l
– l is the number of data points
– h is VC dimension
• Structural Risk Minimization
– For each hi,
– Get bestα*=argmin Remp(α)
– Choose model with min J(α*,hi)
Thanks to Tony Jebara
EMOTIF Database
[Nevill-manning 1998 (C7)]
• A motif database of protein families
• Substitution groups from separation score
EMOTIF Database
[Nevill-manning 1998 (C7)] (back)
•All possible motifs are enumerated from sequence alignments
I-SITE Motif Library
[Bystroff 1998 (C9)] (back)
• Sequence segments (3-15 amino acids
long) are clustered via K-means
• Within each cluster structure similarity is
calculated in terms of dme and mda
L
dme 

i 5
s1
s2
(



 i j i j )
i 1 j i 5
mda( L)  max i 1, L 1 (i 1 , i )
N
• Only those clusters with good dme and
mda are refined and considered motifs
afterwords
PrISM
[Yang 2003 (P3)] (back)
Pfam
[Sonnhammer 1997 (D2)] (back)
• Construction of Pfam A:
– Pick seed sequences from several sources
and build seed alignment
– Build HMM from seed alignment and use
to it pull in new members and align them
to the HMM to get full alignment
Pfam
[Sonnhammer 1997 (D2)]
(back)
• Construction of Pfam B:
– Domainer program merges homology segment
pairs into homologous segment sets together
with links. This graph is partitioned into
domains
– Use domainer program to build alignment from
all protein segments not covered by Pfam-A
• Incremental updating
– New sequence is added to the full alignment of
existing models if they score above a threshold
– If the new sequence causes problems, the
seed alignment will be altered and Pfam-B will
be regenerated afterwards.
Sonnhammer, 1997