Download SSE – secondary structure element (ex. helices, sheets)

An Efficient Index-based Protein Structure Database Searching Method 陳冠宇 Introduction More than 18,000 protein structures stored in PDB (September 2002) Structural comparison(3D) and database searching – other methods practice exhaustive searching Their design philosophy:   Filter-and-refine Using Indexed-based searching method Results: 16 times faster than DALI Filter-and-Refine query ProtDex Actual alignment Top 100 proteins Database 20,000 proteins result Problem Definition Protein Structures 3D Structural Comparison Structural Database Searching A protein is composed of a sequence of amino acid (AA) residues. SSE – secondary structure element (ex. helices, sheets) Loop Regions (no specific shape) Sequence Comparison vs. Structural Comparison One cannot determine the similarity of two remotely homologous proteins by sequence comparison. We try to superimpose one protein structure over another in order to obtain the minimum root mean square deviation (RMSD) between them. -> O(n4m4) The ProtDex Method Step 1: Extracting Information from PDB database Step 2: Building Intra-molecular Distance Matrices  Design rationale: two protein structures are similar if their distance matrices are similar Step 3: Cutting Fixed Matrices and Extracting Properties Step 4: Building Inverted File Index Step 1: Extracting Information For each protein chain in PDB file:  PDB id - chain id; No. of AA residues; No. of SSEs For each AA Residue:  3D coordinate (x, y, z) of C carbon For each SSE:  SSE type (Helix or Sheet); SSE Start position; SSE length Step 2: Representation - Building Distance Matrices Protein 9xxxx with 7 AA residues Step 3-1: Contact Patterns & FixedSize Matrices SSE(H) SSE(E) contact patterns Fixed-size matrix Step 3-2: Extracting Properties For the 2X2 sub-matrix starting at the cell (2, 2), we store the values: 8, HH, (3,3), (1,1), (1,1) For the 2X2 sub-matrix starting at the cell (3,6), we store the values: 49, HE, (3,2), (1,2), (2,1), etc. Step 4: Building Inverted File Index Implemented as sorted list Searching a Protein Structure S(Q,P) = WFMCount(Q,P) X WGSum(I,j) X Sigma(match(I,j)[ (WTerm(i) X max(match(a,b)^PdbIdb=P)( WArea(a,b) X WARatio(a,b) X WOrdinal(a,b) ) ] WFMCount is to compensate the effect that the large proteins being matched and scored more frequently than the small ones. WTerm is to add more weight to the query index terms that rarely occur in the database. Discussion Design:     representation of structures scoring schemes comparison algorithms assessment of the results Performance Accuracy – SCOP classification hierarchy is made of 4 levels: class, fold, superfamily and family Pros and Cons of ProtDex Conclusions Advantages:  Speed (need not to scan through each structure in the database) Disadvantages:    Cannot provide the actual alignment Storage overhead for the index structure (the entire index: 1.2GB) Time requirement to build and update the index (building the entire index: 30min 38 sec)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download SSE – secondary structure element (ex. helices, sheets)