* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PPT presentation
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Point mutation wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Genetic code wikipedia , lookup
Magnesium transporter wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Epitranscriptome wikipedia , lookup
Expression vector wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Interactome wikipedia , lookup
Biochemistry wikipedia , lookup
Metalloprotein wikipedia , lookup
Gene expression wikipedia , lookup
Protein purification wikipedia , lookup
Western blot wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Methods of Protein Structure Alignment David Hoksza Charles University in Prague Department of Software Engineering Czech Republic Presentation Outline Biological background Protein databases Protein structure Similarity measures Algorithms DATAKON 2008 2 Terminology DNA (deoxyribonucleic acid) sequence of nucleotides (A, C, G, T) double-helix RNA (ribonucleic acid) single-helix sequence of nucleotides (A, C, G, U) messenger RNA (mRNA) transfer RNA (tRNA) ribosomal RNA (rRNA) … Proteins molecules translated from mRNA in ribosomes sequence of amino acids (20 AAs) coded by codon (triplet of nucleotides) genetic code central dogma DNA → RNA → protein transcription translation DATAKON 2008 3 Protein Similarity Interaction of proteins determines biological functions Function of protein derived from its three dimensional structure similar proteins (many common amino acids on “appropriate” places) have similar structure → similar proteins have similar functions similar proteins have a common ancestor Identifying protein sequence → finding similar proteins → getting clue to the function DATAKON 2008 4 Protein Databases Finding similar proteins even among different species Prominent non-structural databases GenBank EMBL (European Molecular Biology Laboratory Data) DDBJ (DNA Data Bank of Japan) UniProt not moderated moderated Swissprot + trEMBL (translated EMBL) + PIR (Protein Information Resource) Prominent structural databases PDB (Protein Data Bank) SCOP (Structural Classification of Proteins) ASTRAL Compendium (for Sequence and Structure Analysis) DATAKON 2008 5 Databases Growth DATAKON 2008 6 Databases Growth (PDB) DATAKON 2008 7 Levels of complexity of protein structure Primary structure linear sequences of amino acids Secondary structure local three-dimensional segments which are folded into specific repeated structures Tertiary structure alpha helices, beta sheets (strands) the atomic coordinates - spatial relations among the secondary structure elements Quaternary structure multiple polypeptide chains DATAKON 2008 8 Protein structure Amino-acids differ in their side chains (R-groups) Connection – peptide bonds Protein sequence → sequence of rigid planes Degrees of freedom Planes R-groups 3D conformation described by dihedral angles Only α-carbons usually considered DATAKON 2008 9 Protein structure cont. SSE – Secondary Structural Elements Repetitive structures arising by H bonds Alpha helices Beta sheets (strands) ith amino-acids is connected to (i+4)th amino-acid φ and ψ angles are constant peptide units per turn is 3,6 multiple strands connected to each other by H bonds parallel/antiparallel Motifs Combinations (second form) of SSEs beta ribbon, beta-barrel, betahairpin, helix-loop-helix, greek key, … DATAKON 2008 10 Similarity measures RMSD – Root Mean Square deviation/difference/distance Summarizes partial distances of aligned residue pairs Evaluates quality of a matching (superposition) cRMSD (core RMSD) dRMSD (distance RMSD) intra-residue disregarding outliers 1 n A x (i ) x B (i ) n i 1 cRMSd dRMSd 2 n n 1 (d ijA d ijB ) 2 , d ij x(i ) x( j ) n(n 1) i 1 j i 1 elastic similarity score inter-residue (i, j ) | d ijA d ijB | d * ij w(d ij* ) , w(x) e x2 α fragmented dRMSD aimed to recognition of similar substructures (i, j ) local DATAKON 2008 1 (2l 1) 2 l l (d a l bl A i a , j b d iB a , j b ) 2 11 Algorithms Goals Alignment Classification direct similarity indirect similarity Methods Incremental extension Dynamic programming extending initial partial alignment dynamic programming matrix of (usually) distances Indexing using features to be indexed by trees geometric hashing DATAKON 2008 12 Algorithms - Incremental Extension DALI Elastic similarity score Matrix of inter-residual distances Similar proteins ≡ similar interresidual distances ≡ similar distance matrices Contatct pattern (CP) submatrix of fixed size (hexapeptides) Similar pairs of CPs are stored and one is used as a seed Monte-Carlo optimization is used for extend the already created alignment CE AFP – Aligned Fragment Pair (constant length portions – local structures) Fragmented dRMSD Joining of AFPs based on three different distance measures Several path are computed and best of them is optimized by Smith-Waterman (on the distance matrix) DATAKON 2008 13 Algorithms – Dynamic Programming SAP PROSUP 1. 2. 3. 4. Double dynamic programming View – vector of distances to other resiudes Between pairs of views, optimal alignments (Smith-Waterman) are computed which are used to fill up final DP matrix → final alignment Identification of seed fragments Expand seed fragments to initial alignments Apply DP (Needleman-Wunsch) to refine initial alignments Evaluate refined alignments STRUCTAL Evaluated as the best structural algorithm in Kolodny, R., Koehl, P., Levitt, M.: Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol. 346 (2005), 1173-88. TALI Initial alignment based on several rules (match beginning of structures, match ends, match pairs based on sequence alignment, …) Refine the alignments by DP (Needleman-Wunsch) Exposure weighting Position dependent gap penalties … Incorporates torsion angles → mutual distances form DP matrix → Smith-Waterman FATCAT Twists – proteins not understood as rigid bodies Minimization of twists to turn one structure into another one → DP used to connect AFPs (introducing a twist is penalized) DATAKON 2008 14 Algorithms – Indexing - Trees PSI Represenatation Features amino-acids → centers of masses defining local neighborhood for each SSE features (9 dimensions) SSE approximation PSIST sliding window of size w distance angle Suffix-tree distances in triplets angles between pairs in triplets R*-tree DATAKON 2008 15 Algorithms – Indexing - Hashing ProGreSS Combines structure and sequence Features (sliding window of size w) DATAKON 2008 torsion angle curvature SSE type Pairwise comparison line in scoring matrix (PAM, BLOSUM, ..) → scoring vector → chaining scoring vectors → dim = 20w Haar wavelet transormation → dim = dq normalization → [0,1] dq space SSE type Theory of differential geometry Structure ≡ 3D spline Features curvature (dim = w) torsion angles (dim = w) Haar wavelet transormation → dim = dt normalization → [0,1] dt space sequence structure CTSS Smith-Waterman scoring matrix based on features For each pair 16 End DATAKON 2008 17 PDB record 13 HEADER HYDROLASE(O-GLYCOSYL) 25-JAN-94 149L HELIX 10 H10 PRO A 143 THR A 155 1 SHEET 1 A 3 TYR A 18 LYS A 19 0 TITLE CONSERVATION OF SOLVENT-BINDING SITES … SHEET 2 A 3 TYR A 25 ILE A 27 -1 N THR A 26 O TYR A 18 .. .. REMARK 1 SHEET 3 A 3 HIS A 31 THR A 34 -1 N HIS A 31 O ILE A 27 REMARK 1 REFERENCE 1 TURN 1 T1 ASP A 20 GLY A 23 .. REMARK 1 AUTH M.MATSUMURA,W.J.BECKTEL,… ATOM 1 N MET A 1 29.360 -4.880 38.742 1.00 65.91 N .. ATOM 2 CA MET A 1 29.892 -6.057 38.096 1.00 60.68 C REMARK 2 RESOLUTION. 2.60 ANGSTROMS. ATOM 3 C MET A 1 30.674 -5.673 36.863 1.00 56.33 C REMARK 3 .. REMARK 3 REFINEMENT. ATOM 302 CG PRO A 37 51.531 -30.219 18.738 1.00 78.60 C ATOM 303 CD PRO A 37 52.005 -28.775 18.641 1.00 78.61 C REMARK 3 PROGRAM : TNT ATOM 304 N SER A 38 53.483 -28.405 22.129 1.00 70.92 N REMARK 3 AUTHORS : TRONRUD,TEN EYCK… ATOM 305 CA SER A 38 54.604 -28.517 23.043 1.00 67.86 C ... .. SEQRES 1 A 164 MET ASN LEU PHE GLU MET LEU ARG … ATOM 1309 OXT LEU A 164 25.719 -18.888 43.195 1.00 25.30 O SEQRES 2 A 164 ARG LEU LYS ILE TYR LYS ASP THR … TER 1310 LEU A 164 .. .. HELIX 1 H1 LEU A 3 GLU A 11 1 9 END HELIX .. 2 H2 LEU A 39 ILE A 50 1 12 DATAKON 2008 18 Similarity Measures (primary structure) two strings of amino-acids hamming distance sequences of equal length number of non-identical positions Alignment S1 = NGHLILLE S2 = HGALGLLE x x HD(S1, S2) = 3 x edit distance minimal number of operations insert/update/delete to convert one sequence to the other Alignment S1 = NPHGIIMGLAE S2 = HGALGLLE x x x x x x x ED(S1, S2) = 8 weighted edit distance takes into account probability of updating one letter to the other scoring (substitution) matrices x PAM, BLOSUM, … different costs for opening/extending a gap global/local alignment Needleman-Wunsch Smith-Waterman DATAKON 2008 19