Download PPT presentation

Methods of Protein Structure Alignment David Hoksza Charles University in Prague Department of Software Engineering Czech Republic Presentation Outline  Biological background  Protein databases  Protein structure  Similarity measures  Algorithms DATAKON 2008 2 Terminology  DNA (deoxyribonucleic acid)    sequence of nucleotides (A, C, G, T) double-helix RNA (ribonucleic acid)  single-helix sequence of nucleotides (A, C, G, U) messenger RNA (mRNA) transfer RNA (tRNA) ribosomal RNA (rRNA) …      Proteins     molecules translated from mRNA in ribosomes sequence of amino acids (20 AAs) coded by codon (triplet of nucleotides) genetic code  central dogma  DNA → RNA → protein transcription translation DATAKON 2008 3 Protein Similarity   Interaction of proteins determines biological functions Function of protein derived from its three dimensional structure  similar proteins (many common amino acids on “appropriate” places) have similar structure    → similar proteins have similar functions similar proteins have a common ancestor Identifying protein sequence  → finding similar proteins  → getting clue to the function DATAKON 2008 4 Protein Databases  Finding similar proteins   even among different species Prominent non-structural databases  GenBank EMBL (European Molecular Biology Laboratory Data) DDBJ (DNA Data Bank of Japan)  UniProt     not moderated moderated Swissprot + trEMBL (translated EMBL) + PIR (Protein Information Resource) Prominent structural databases    PDB (Protein Data Bank) SCOP (Structural Classification of Proteins) ASTRAL Compendium (for Sequence and Structure Analysis) DATAKON 2008 5 Databases Growth DATAKON 2008 6 Databases Growth (PDB) DATAKON 2008 7 Levels of complexity of protein structure  Primary structure   linear sequences of amino acids Secondary structure  local three-dimensional segments which are folded into specific repeated structures   Tertiary structure   alpha helices, beta sheets (strands) the atomic coordinates - spatial relations among the secondary structure elements Quaternary structure  multiple polypeptide chains DATAKON 2008 8 Protein structure    Amino-acids differ in their side chains (R-groups) Connection – peptide bonds Protein sequence → sequence of rigid planes  Degrees of freedom     Planes R-groups 3D conformation described by dihedral angles Only α-carbons usually considered DATAKON 2008 9 Protein structure cont.  SSE – Secondary Structural Elements   Repetitive structures arising by H bonds Alpha helices     Beta sheets (strands)    ith amino-acids is connected to (i+4)th amino-acid φ and ψ angles are constant peptide units per turn is 3,6 multiple strands connected to each other by H bonds parallel/antiparallel Motifs   Combinations (second form) of SSEs beta ribbon, beta-barrel, betahairpin, helix-loop-helix, greek key, … DATAKON 2008 10 Similarity measures  RMSD – Root Mean Square deviation/difference/distance    Summarizes partial distances of aligned residue pairs Evaluates quality of a matching (superposition) cRMSD (core RMSD)   dRMSD (distance RMSD)   intra-residue disregarding outliers 1 n A x (i )  x B (i )  n i 1 cRMSd  dRMSd  2 n n 1 (d ijA  d ijB ) 2 , d ij  x(i )  x( j )   n(n  1) i 1 j i 1 elastic similarity score   inter-residue (i, j )  | d ijA  d ijB | d * ij w(d ij* ) , w(x)  e  x2 α fragmented dRMSD  aimed to recognition of similar substructures (i, j ) local DATAKON 2008  1 (2l  1) 2 l l   (d a l bl A i  a , j b  d iB a , j b ) 2 11 Algorithms  Goals  Alignment   Classification   direct similarity indirect similarity Methods  Incremental extension   Dynamic programming   extending initial partial alignment dynamic programming matrix of (usually) distances Indexing  using features to be indexed by   trees geometric hashing DATAKON 2008 12 Algorithms - Incremental Extension  DALI        Elastic similarity score Matrix of inter-residual distances Similar proteins ≡ similar interresidual distances ≡ similar distance matrices Contatct pattern (CP) submatrix of fixed size (hexapeptides) Similar pairs of CPs are stored and one is used as a seed Monte-Carlo optimization is used for extend the already created alignment CE     AFP – Aligned Fragment Pair (constant length portions – local structures) Fragmented dRMSD Joining of AFPs based on three different distance measures Several path are computed and best of them is optimized by Smith-Waterman (on the distance matrix) DATAKON 2008 13 Algorithms – Dynamic Programming  SAP     PROSUP 1. 2. 3. 4.  Double dynamic programming View – vector of distances to other resiudes Between pairs of views, optimal alignments (Smith-Waterman) are computed which are used to fill up final DP matrix → final alignment Identification of seed fragments Expand seed fragments to initial alignments Apply DP (Needleman-Wunsch) to refine initial alignments Evaluate refined alignments STRUCTAL  Evaluated as the best structural algorithm in Kolodny, R., Koehl, P., Levitt, M.: Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol. 346 (2005), 1173-88.       TALI   Initial alignment based on several rules (match beginning of structures, match ends, match pairs based on sequence alignment, …) Refine the alignments by DP (Needleman-Wunsch) Exposure weighting Position dependent gap penalties … Incorporates torsion angles → mutual distances form DP matrix → Smith-Waterman FATCAT   Twists – proteins not understood as rigid bodies Minimization of twists to turn one structure into another one → DP used to connect AFPs (introducing a twist is penalized) DATAKON 2008 14 Algorithms – Indexing - Trees  PSI   Represenatation    Features  amino-acids → centers of masses defining local neighborhood for each SSE features (9 dimensions)    SSE approximation   PSIST sliding window of size w    distance angle Suffix-tree distances in triplets angles between pairs in triplets R*-tree DATAKON 2008 15 Algorithms – Indexing - Hashing  ProGreSS    Combines structure and sequence Features (sliding window of size w)            DATAKON 2008 torsion angle curvature SSE type Pairwise comparison   line in scoring matrix (PAM, BLOSUM, ..) → scoring vector → chaining scoring vectors → dim = 20w Haar wavelet transormation → dim = dq normalization → [0,1] dq space SSE type Theory of differential geometry Structure ≡ 3D spline Features  curvature (dim = w) torsion angles (dim = w) Haar wavelet transormation → dim = dt normalization → [0,1] dt space sequence    structure   CTSS Smith-Waterman scoring matrix   based on features For each pair 16 End DATAKON 2008 17 PDB record 13 HEADER HYDROLASE(O-GLYCOSYL) 25-JAN-94 149L HELIX 10 H10 PRO A 143 THR A 155 1 SHEET 1 A 3 TYR A 18 LYS A 19 0 TITLE CONSERVATION OF SOLVENT-BINDING SITES … SHEET 2 A 3 TYR A 25 ILE A 27 -1 N THR A 26 O TYR A 18 .. .. REMARK 1 SHEET 3 A 3 HIS A 31 THR A 34 -1 N HIS A 31 O ILE A 27 REMARK 1 REFERENCE 1 TURN 1 T1 ASP A 20 GLY A 23 .. REMARK 1 AUTH M.MATSUMURA,W.J.BECKTEL,… ATOM 1 N MET A 1 29.360 -4.880 38.742 1.00 65.91 N .. ATOM 2 CA MET A 1 29.892 -6.057 38.096 1.00 60.68 C REMARK 2 RESOLUTION. 2.60 ANGSTROMS. ATOM 3 C MET A 1 30.674 -5.673 36.863 1.00 56.33 C REMARK 3 .. REMARK 3 REFINEMENT. ATOM 302 CG PRO A 37 51.531 -30.219 18.738 1.00 78.60 C ATOM 303 CD PRO A 37 52.005 -28.775 18.641 1.00 78.61 C REMARK 3 PROGRAM : TNT ATOM 304 N SER A 38 53.483 -28.405 22.129 1.00 70.92 N REMARK 3 AUTHORS : TRONRUD,TEN EYCK… ATOM 305 CA SER A 38 54.604 -28.517 23.043 1.00 67.86 C ... .. SEQRES 1 A 164 MET ASN LEU PHE GLU MET LEU ARG … ATOM 1309 OXT LEU A 164 25.719 -18.888 43.195 1.00 25.30 O SEQRES 2 A 164 ARG LEU LYS ILE TYR LYS ASP THR … TER 1310 LEU A 164 .. .. HELIX 1 H1 LEU A 3 GLU A 11 1 9 END HELIX .. 2 H2 LEU A 39 ILE A 50 1 12 DATAKON 2008 18 Similarity Measures (primary structure)  two strings of amino-acids  hamming distance    sequences of equal length number of non-identical positions Alignment S1 = NGHLILLE S2 = HGALGLLE x x HD(S1, S2) = 3 x edit distance  minimal number of operations insert/update/delete to convert one sequence to the other Alignment S1 = NPHGIIMGLAE S2 = HGALGLLE  x x x x x x x ED(S1, S2) = 8 weighted edit distance   takes into account probability of updating one letter to the other scoring (substitution) matrices    x PAM, BLOSUM, … different costs for opening/extending a gap global/local alignment   Needleman-Wunsch Smith-Waterman DATAKON 2008 19

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PPT presentation