Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris Bystroff, Biology Dept. Rensselaer Polytechnic Institute, Troy NY Protein Structures Primary structure Un-branched polymer 20 side chains (residues or amino acids) PDB file 2IGD: MTPAVTTYSLVINGLTLSGU….. Higher order structures Secondary: local (consecutive) in sequence Tertiary: 3D fold of one polypeptide chain Quaternary: Chains packing together PDB protein 2IGD Anti-parallel Beta Sheets Alpha Helix Parallel Beta Sheets The Protein Folding Problem Contact Map Amino acids Ai and Aj are in contact if their 3D distance is less than contact threshold (e.g., 7 Angstroms) Sequence separation is given as |i-j| Contact map C is a symmetric N x N matrix with C(i,j) = 1 if Ai and Aj are in contact C(i,j) = 0 otherwise Consider all pairs with |i-j| >= 4 Amino Acid Aj Parallel Beta Sheets Anti-parallel Beta Sheets Contact Map (2IGD) Alpha Helix Amino Acid Ai Characterizing Physical, Proteinlike Contact Maps A very small subset of all contact maps code for physically possible proteins (self-avoiding, globular chains) A contact map must: Satisfy geometric constraints Represent low-energy structure Characterizing Physical Contact Maps in Proteins What are the typical non-local interactions? Frequent dense 0/1 sub-matrices in contact maps 3-step approach Dense pattern mining Pruning mined patterns Clustering dense patterns (non-local pattern signatures) Dense Pattern Mining Frequent 2D Pattern Mining Use WxW sliding window; W window size Measure density under each window (N-W)2 / 2 possible windows for N length protein Look for “minimum density” (number of 1’s) scale away from diagonal Try different window sizes Counting Dense Patterns Naïve Approach: for W=5, N=60 there are 1485 windows per protein. 28 million possible windows for 18,544 proteins (in PDB) Test if two sub-matrices are equal Linear search: O(P x W2) with P current dense patterns Hash based: O(W2) Our Approach: 2-level Hashing O(W) time Pattern (WxW Sub-matrix) Encoding Encode sub-matrix as string (W ints) Sub-matrix Integer Value 00000 0 01100 12 01000 8 01000 8 00000 0 Concatenated String: 0.12.8.8.0 Two-level Hashing String-ID(M) = v1.v2 .....vW W h1( M ) v i i 1 Level1 (approximate): Level2 (exact): h2(M) = String-ID(M) Binding Patterns to Protein Sequence and Structure StringID:0.12.8.8.0, Support = 170 (window size W=5) 00000 01100 01000 01000 00000 Occurrences: pdb-name (X,Y) X_sequence Y_sequence 1070.0 52,30 ILLKN TFVRI 1145.0 51,13 VFALH GFHIA 1251.2 42,6 EVCLR GSKFG 1312.0 54,11 HGYDE ATFAK 1732.0 49,6 HRFAK KELAG 2895.0 49,7 SRCLD DTIYY ... Interaction alpha::beta alpha::strand alpha::strand alpha::beta alpha::beta alpha::beta Frequent Dense Local Patterns Submatrix 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 Pruning Patterns Same pattern (shifted to right) but different String-IDs 00000 01000 01000 01000 00000 00000 00100 00100 00100 00000 00000 00010 00010 00010 00000 Merge horizontally or vertically shifted patterns Prune away the local patterns (alpha/beta) Dense Pattern Mining Results 2702 non-redundant proteins from PDB Min-Support = 1 (exhaustive patterns) Window size = 5, Min-Density = 5 Contact Threshold 5 Angstroms Number of Patterns 2508 6 Angstroms 9929 7 Angstroms 21231 Frequent Dense Non-Local Patterns Alpha – Alpha Alpha – Beta Sheet Frequent Dense Non-Local Patterns Alpha – Beta Turn Beta Sheet – Beta Turn Clustering Dense Patterns Distance: Mi, Mj are dense sub-matrices W2 d ( M i , M j ) | M i [k ] M j [k ] | k 1 Use agglomerative hierarchical clustering Find each cluster’s (c) representative (n patterns) Conceptually the super-imposition of n sub-matrices Compute contact probability at each position n pc [k ] M [k ] i i 1 n Note a 1 whenever contact probability is more than a probability threshold Cluster Representative Contact Probabilities: 0: 0.05 1: 0.05 2: 0.68 5: 0.03 6: 0.02 7: 0.14 10: 0.05 11: 0.05 12: 0.12 15: 0.03 16: 0.05 17: 0.15 20: 0.25 21: 0.10 22: 0.59 3: 0.85 8: 0.07 13: 0.09 18: 0.27 23: 0.92 Representative contact pattern: 00111 00000 00000 00001 00011 4: 0.71 9: 0.09 14: 0.03 19: 0.85 24: 0.83 Clustering Quality High and low value of pc[k] are good (most cluster members agree on k) For a cluster c, define quality Qc: W2 Sc1 pc [k ], ( pc [k ] 0.5) k 1 W2 Sc0 1 pc [k ], ( pc [k ] 0.5) Qc S S 1 c k 1 Overall clustering quality (0.5 <= Q <= 1) NC Q | c | Q i 1 i NP ci NC = Number of Clusters NP = Number of Patterns 0 c Example 1: Mined Cluster #1355 #3496 #6282 #7980 representative 00011 00011 01111 11000 10000 00001 00101 11111 11000 10000 00010 00000 11000 10000 10000 00011 00101 11100 10000 00000 00011 00001 11100 10000 10000 Cluster patterns (beta-beta strand) Example 2: Mined Cluster #196 #503 #2834 #8697 representative 11010 01111 01000 01000 11000 01000 01110 01000 01000 11000 11000 01100 01110 01000 01000 11010 01110 01100 01100 01000 11000 01110 01000 01000 01000 Cluster Patterns (beta-beta turn) Clustering Results Contact Threshold 5A Number of Number of Cluster Patterns Clusters Quality 2508 83 0.89 6A 9929 99 0.86 7A 21231 367 0.84 Future Work Comprehensive list of non-local motifs I-sites library (by Prof. Bystroff) catalogs local motifs Future Directions Improving prediction of contact maps Mining heuristic rules for “physicality” Protein folding pathways Improving Contact Map Prediction Physically Impossible Physically Impossible Mining Physicality Rules Mining heuristic rules for “physicality” Based on simple geometric constraints Rules governing contacts and non-contacts Parallel Beta Sheets: Anti-parallel Beta Sheets: If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0 If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0 Alpha Helices: If C(i,i+4) = 1, C(i,j) = 1, and C(i+4,j) = 1, then C(i+2,j) = 0 Heuristic Rules of Physicality Anti-parallel Beta Sheets i+2 j i j+2 If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0 Heuristic Rules of Physicality Parallel Beta Sheets i+2 j+2 i j If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0 Heuristic Rules of Physicality Alpha Helix i+4 j i+2 i If C(i,j) = 1 and C(i+4,j) = 1 and C(I,i+4) = 1, then C(i+2,j) = 0 Protein Folding Pathways Rules for Pathways in Contact Map Space Pathway is time-ordered sequence of contacts Consider only native contacts (those that are present in the true map) Condensation rule: New contacts within Smax U(i,j) <= Smax; U(i,j) unfolded residues from i to j Pathway prediction is complementary to structure prediction Contact Map Folding Pathways