Download PDF

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein (nutrient) wikipedia , lookup

Multi-state modeling of biomolecules wikipedia , lookup

Protein wikipedia , lookup

Cyclol wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

List of types of proteins wikipedia , lookup

Protein domain wikipedia , lookup

Circular dichroism wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
2/11/2010
What are Proteins?
2
Amino Acids (20 types)
CS 6784: STRUCTURAL SVM
FOR PROTEIN SEQUENCE
ALIGNMENT
Any two amino acids can form
peptide bonds to join together
Represented by 20 letters:
A, C, D, E, F, G, H, I, K, L, M
N, P, Q, R, S, T, V, W, Y
Feb 11, 2010 Guest Lecture
Chun-Nam Yu
Secondary Structures

Tertiary Structures (Folds)
A typical protein sequence:

MDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESE
TTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITS
RTQFNSLQQLVAYYSKHADGLCHRLTTVCP




typically between 100 to 1000 amino acids long
Fold into secondary structures:
Alpha helix
On top of 2nd structures fold into stables shapes
Shape determines functions:

Oxygen transport
Building hair, muscle, etc
Loop
Beta sheet
Homology Modeling of Proteins
5
The Learning Task
6
Given: new sequence
LYNWVAKDVEPPKFTEVTDVVLITRD
ADKVLKGEKVQAKYPVDLKLVVKQ
Similar sequence,
known structure
Want to know: structure
Build
model
align
LYNWVAKDVEPPKFTEVTDVVLITRD
ADKV-LKGEKVQAKYPV-DLKLVVKQ
1
2/11/2010
Sequence Alignments

Given sequence pairs:
AECD


EACC

E
Score
= -3 + 4 – 4 + 9 - 3
=3
Substitution Cost Matrix



A
E
C
D
A C C

Consider More General Linear Scoring Rule:
µ
¶
µ
¶
µ
¶
µ
¶
µ
¶
A
E
C
D
©(x; y) = Á x;
+Á x;
+Á x;
+Á x;
+Á x;
E
A
C
C


~ ¢ ©(x; y)
Score of alignment = w
Can still use Smith-Waterman
E
as score decomposes nicely A
into alignment operations in
µ
¶
E
y
~ ¢Á x;
w
µ
¶
C
~ ¢Á x;
w
C
D
B
B
B
B
B
©(x; y) = B
B
B
µ
¶
B
B
~ ¢Á x;
w
B
E
B
@
µ
¶
C
~ ¢Á x;
w
C
AECD
EACC
E
In the simplest case the match score s(ai ; bj ) is just
a simple lookup over the BLOSUM matrix
0 1
Align ‘A’ with ‘A’
1
Consider the feature map:
B 0 C
B . C
Equivalent to BLOSUM
B . C
E
Edge score = -d
D
E
C
More Complex Substitution Scores

-AECD
EACC-
More Complex Substitution Scores
Weighted string edit distance: Match, insert, delete
Find highest scoring path
through the graph
E A C C
Dynamic Programming
A
Recurrence:
F(i; j)
8
< F(i-1; j-1) + s(ai ; bj )
= max F(i-1; j) - d
: F(i; j-1) - d
.
1
0
..
.
1
0
..
.
2
C
C
C
C
C
C
C
C
C
C
C
C
A
Edge score = s(A,A)
Edge score = s(A,C)
E
C
Gap penalty d = -3
Smith-Waterman Algorithm

A C C
A
Score
= -1 – 1 +9 + 6
= 13
Alignment 2:
-AECD
EACC-

Weighted string edit distance: Match, insert, delete
String Edit Distance Graph:
Alignment 1:
AECD
EACC

Smith-Waterman Algorithm
D

Suppose we also have information on 2nd structures:
x = (A E C D, E A C C)

Extra feature counts:

E
A
E
C
Gap
C
More Complex Substitution Scores
Align ‘C’ with ‘C’
Align ‘E’ with ‘C’
A C C
D
A C C
0
B
B
B
©(x; y) = B
B
B
@
µ
¶
~ ¢Á x;
w
E
1
..
.
.. C
. C
C
1 C
C Align
2 C
A Align
..
.
with
with
Our substitution score includes
µ
¶ nd
C
2 structure information too!
~ ¢Á x;
w
C
2
2/11/2010
Structural Support Vector Machines
13
Structural Support Vector Machines
14
QLVESGGGVVQPGKSLRLSCAA
PEPVVAVALGA----SRQLTCR

Structural SVM [Tsochantaridis et al. ‘04]
n
min
~ ;~»
w
n
X
1
k~
wk2 + C
»i
2
i=1
X
1
1 · i · n, for all alignments y
^ 2 Y,
min k~
wk2 + C s:t:»for
i
~ ;~» 2
w
~ ¢ ©(x ; y ) - w
~ ¢ ©(x ; y
w
^) ¸ ¢(y ; y
^) - »
i=1
QLVESGGGVVQPGKSLRLSCAA
PEPVVAVALGASR----QLTCR
i
QLVESGGGVVQPGKSLRLSCAA
PEPVVAVALGA---S-RQLTCR

for all alignments y
^2Y
i
i
i
i
s:t: for 1 · i · n, for all output structures y
^ 2 Y,
 Convex optimization problem
µ
¶
~ ¢ ©(xi; yi) - w
~ ¢ ©(x
n i+
w
;y
^m
) ¸ ¢(yi; y
^) - »i
 O(
) = exponentially many constraints
m
Can solve using cutting-plane algorithm
~ ¢ ©(xi; yi) - w
~ ¢ ©(xi; y
w
^) ¸ ¢(yi; y
^) - »i
Learning with Many Features
Feature Vectors (2 examples)
15
Complex Alignment
Model with Many
Features
Structural SVM
Learning

3 basic structural features:
Amino acid (i.e., A, N, P, etc)
Secondary structures (i.e.,
,
,
 Exposure to water (i.e., 1, 2, 3, 4, 5)



Simple Alignment
Model
Anova2:


Pairwise feature interaction,
e.g., s(A and
, E and
)
Window:

Loss Functions
)
Consider neighborhood of aligned site,
e.g., s(AEC, CED), s(
,
)
Experiments
18
Q - loss

Correct Alignment y:
Q4 – loss (shift < 4)

-AECD
EACC-
-AECD
EACC
Incorrect Alignment y’:

Q-loss = 1/3
Incorrect Alignment y’:


Training Set: ~5000 alignments [Qiu & Elber ’06]
Test Set: ~30000 alignments from deposits to
Protein Data Bank between June 05 to June 06
All structural alignments produced by the program
CE by superposition of
3D coordinates
A-ECD
EACC-
A-ECD
EACC
Correct Alignment y:


Q4-loss = 0
(from pdb.org)
3
2/11/2010
Results on Alignment Accuracy
Summary
19
Q4 score

From [Yu, Joachims, Elber & Pillardy. RECOMB‘07]
72
71
70
69
68
67
66
65

71.32
69.8
68.04

67.51

An application of Structural SVM to a problem in
computational biology
Discriminative training allows us to incorporate
complex features into the alignment models to
improve alignment accuracy
Showed how to design the feature vectors, loss
functions, and inference procedures
Anova2 Window Profile SSALN
A strong
(49634) (447016) (448642)
generative model
with hand-tuned
Structural SVM
features
4