Download Directions in Protein Contact Map Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Directions in Protein Contact
Map Mining
Mohammed J. Zaki
Computer Science Dept.
joint work with
Jingjing Hu & Xiaolan Shen, CS Dept.
Yu Shao & Prof. Chris Bystroff, Biology Dept.
Rensselaer Polytechnic Institute, Troy NY
Protein Structures

Primary structure




Un-branched polymer
20 side chains (residues or amino acids)
PDB file 2IGD: MTPAVTTYSLVINGLTLSGU…..
Higher order structures



Secondary: local (consecutive) in sequence
Tertiary: 3D fold of one polypeptide chain
Quaternary: Chains packing together
PDB protein 2IGD
Anti-parallel Beta Sheets
Alpha Helix
Parallel Beta Sheets
The Protein Folding Problem
Contact Map



Amino acids Ai and Aj are in contact if
their 3D distance is less than contact
threshold (e.g., 7 Angstroms)
Sequence separation is given as |i-j|
Contact map C is a symmetric N x N
matrix with



C(i,j) = 1 if Ai and Aj are in contact
C(i,j) = 0 otherwise
Consider all pairs with |i-j| >= 4
Amino Acid Aj
Parallel Beta Sheets
Anti-parallel Beta Sheets
Contact Map (2IGD)
Alpha Helix
Amino Acid Ai
Characterizing Physical, Proteinlike Contact Maps


A very small subset of all contact maps
code for physically possible proteins
(self-avoiding, globular chains)
A contact map must:


Satisfy geometric constraints
Represent low-energy structure
Characterizing Physical Contact
Maps in Proteins

What are the typical non-local
interactions?


Frequent dense 0/1 sub-matrices in contact
maps
3-step approach



Dense pattern mining
Pruning mined patterns
Clustering dense patterns (non-local pattern
signatures)
Dense Pattern Mining

Frequent 2D Pattern Mining




Use WxW sliding window; W window size
Measure density under each window
(N-W)2 / 2 possible windows for N length
protein
Look for “minimum density” (number of 1’s)


scale away from diagonal
Try different window sizes
Counting Dense Patterns

Naïve Approach: for W=5, N=60 there
are 1485 windows per protein.
28 million possible windows for 18,544
proteins (in PDB)

Test if two sub-matrices are equal



Linear search: O(P x W2) with P current dense
patterns
Hash based: O(W2)
Our Approach: 2-level Hashing

O(W) time
Pattern (WxW Sub-matrix)
Encoding

Encode sub-matrix as string (W ints)
Sub-matrix Integer Value
00000
0
01100
12
01000
8
01000
8
00000
0
Concatenated String: 0.12.8.8.0
Two-level Hashing

String-ID(M) =
v1.v2 .....vW
W
h1( M )   v
i
i 1

Level1 (approximate):

Level2 (exact): h2(M) = String-ID(M)
Binding Patterns to Protein
Sequence and Structure

StringID:0.12.8.8.0, Support = 170 (window size W=5)
00000
01100
01000
01000
00000

Occurrences:
pdb-name (X,Y) X_sequence Y_sequence
1070.0
52,30
ILLKN TFVRI
1145.0
51,13
VFALH GFHIA
1251.2
42,6
EVCLR GSKFG
1312.0
54,11
HGYDE ATFAK
1732.0
49,6
HRFAK KELAG
2895.0
49,7
SRCLD DTIYY
...
Interaction
alpha::beta
alpha::strand
alpha::strand
alpha::beta
alpha::beta
alpha::beta
Frequent Dense Local Patterns
Submatrix
0 0
0
0 0
1
0 0
0
0 0
0
0 0
0
0 0
0
0 1
0
1 0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0 0 0
0 0 0
0 0 1
0 1 0
1 0 0
0 0 0
0 0 0
0 0 0
0 0
0
0 0
0
0 0
0
0 0
0
1 0
0
1 1
0
1 1
0
0 1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
1 0
0
1 1
0
1 1
0
0 1
0
0 0
0
0 0
0
0 0
1
0 0
0
0
0
0
1
0
1
0
1
0
0
0
0
0
0
0 0 0
0 0 0
0 0 0
1 0 0
1 1 0
1 1 1
0 1 1
0 0 1
Pruning Patterns
Same pattern (shifted to right) but different String-IDs
00000
01000
01000
01000
00000
00000
00100
00100
00100
00000
00000
00010
00010
00010
00000
Merge horizontally or vertically shifted patterns
Prune away the local patterns (alpha/beta)
Dense Pattern Mining Results



2702 non-redundant proteins from PDB
Min-Support = 1 (exhaustive patterns)
Window size = 5, Min-Density = 5
Contact Threshold
5 Angstroms
Number of Patterns
2508
6 Angstroms
9929
7 Angstroms
21231
Frequent Dense Non-Local
Patterns
Alpha – Alpha
Alpha – Beta Sheet
Frequent Dense Non-Local
Patterns
Alpha – Beta Turn
Beta Sheet – Beta Turn
Clustering Dense Patterns

Distance: Mi, Mj are dense sub-matrices
W2
d ( M i , M j )   | M i [k ]  M j [k ] |
k 1


Use agglomerative hierarchical clustering
Find each cluster’s (c) representative (n patterns)


Conceptually the super-imposition of n sub-matrices
Compute contact probability at each position
n
pc [k ] 

 M [k ]
i
i 1
n
Note a 1 whenever contact probability is more than a probability
threshold
Cluster Representative
Contact Probabilities:
0: 0.05
1: 0.05
2: 0.68
5: 0.03
6: 0.02
7: 0.14
10: 0.05 11: 0.05 12: 0.12
15: 0.03 16: 0.05 17: 0.15
20: 0.25 21: 0.10 22: 0.59
3: 0.85
8: 0.07
13: 0.09
18: 0.27
23: 0.92
Representative contact pattern:
00111
00000
00000
00001
00011
4: 0.71
9: 0.09
14: 0.03
19: 0.85
24: 0.83
Clustering Quality


High and low value of pc[k] are good (most
cluster members agree on k)
For a cluster c, define quality Qc:
W2
Sc1   pc [k ], ( pc [k ]  0.5)
k 1
W2
Sc0  1  pc [k ], ( pc [k ]  0.5)
Qc  S  S
1
c
k 1

Overall clustering quality (0.5 <= Q <= 1)
NC
Q
| c | Q
i 1
i
NP
ci
NC = Number of Clusters
NP = Number of Patterns
0
c
Example 1: Mined Cluster
#1355 #3496 #6282 #7980 representative
00011
00011
01111
11000
10000
00001
00101
11111
11000
10000
00010
00000
11000
10000
10000
00011
00101
11100
10000
00000
00011
00001
11100
10000
10000
Cluster patterns (beta-beta strand)
Example 2: Mined Cluster
#196
#503
#2834
#8697
representative
11010
01111
01000
01000
11000
01000
01110
01000
01000
11000
11000
01100
01110
01000
01000
11010
01110
01100
01100
01000
11000
01110
01000
01000
01000
Cluster Patterns (beta-beta turn)
Clustering Results
Contact
Threshold
5A
Number of Number of Cluster
Patterns
Clusters
Quality
2508
83
0.89
6A
9929
99
0.86
7A
21231
367
0.84
Future Work

Comprehensive list of non-local motifs


I-sites library (by Prof. Bystroff) catalogs
local motifs
Future Directions



Improving prediction of contact maps
Mining heuristic rules for “physicality”
Protein folding pathways
Improving Contact Map
Prediction
Physically Impossible
Physically Impossible
Mining Physicality Rules

Mining heuristic rules for “physicality”


Based on simple geometric constraints
Rules governing contacts and non-contacts

Parallel Beta Sheets:


Anti-parallel Beta Sheets:


If C(i,j) = 1 and C(i+2,j+2) = 1,
then C(i,j+2) = 0 and C(i+2,j) = 0
If C(i,j+2) = 1 and C(i+2,j) = 1,
then C(i,j) = 0 and C(i+2,j+2) = 0
Alpha Helices:

If C(i,i+4) = 1, C(i,j) = 1, and C(i+4,j) = 1,
then C(i+2,j) = 0
Heuristic Rules of Physicality
Anti-parallel Beta Sheets
i+2
j
i
j+2
If C(i,j+2) = 1 and C(i+2,j) = 1,
then C(i,j) = 0 and C(i+2,j+2) = 0
Heuristic Rules of Physicality
Parallel Beta Sheets
i+2
j+2
i
j
If C(i,j) = 1 and C(i+2,j+2) = 1,
then C(i,j+2) = 0 and C(i+2,j) = 0
Heuristic Rules of Physicality
Alpha Helix
i+4
j
i+2
i
If C(i,j) = 1 and C(i+4,j) = 1 and C(I,i+4) = 1,
then C(i+2,j) = 0
Protein Folding Pathways

Rules for Pathways in Contact Map Space



Pathway is time-ordered sequence of contacts
Consider only native contacts (those that are
present in the true map)
Condensation rule: New contacts within Smax


U(i,j) <= Smax; U(i,j) unfolded residues from i to j
Pathway prediction is complementary to
structure prediction
Contact Map Folding Pathways
Related documents