Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Western blot wikipedia , lookup
Circular dichroism wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Alpha helix wikipedia , lookup
Protein domain wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Protein structure prediction wikipedia , lookup
Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences Clustering Tools • Clustering is grouping together of related sequences based on some set thresholds such as length, % identity, composition etc. • % identity is the most commonly used criterion to remove redundant sequences in the databases • Clustering helps improve the speed of database searches in the orders of magnitude with minimal loss of content • The general principle in clustering is pair-wise alignment of sequences in all-to-all combination • Most commonly used tools are • blastclust • cd-hit BLASTCLUST http://www.csc.fi/molbio/progs/blast/blastclust.html • BLAST score-based single-linkage clustering • All sequences in the database are compared pair-wise in all-to-all combinations, based on the BLAST score • For each pair, the top scoring alignment is evaluated based on two factors • Length coverage- L’/L (for one or both sequences) • Score density – I/AL • where, L’ is length of sequence in the alignment, L is total length of the sequence, I is the number of identical residues and AL is the total alignment length (L’+gaps) • If both these factors score above the set thresholds, the two sequences are considered as neighbors • The default e-value is 1e-6 CD-HIT (http://bioinformatics.ljcrf.edu/cd-hi/) • This program is 20-30 times faster than BLASTCLUST for it avoids all-toall comparison of pair-wise alignments • Short word filters are applied to reduce the number of pair-wise alignments • First index tables are built for short words of 2-5 residues, in all possible combinations • (ABC-), a 4-letter alphabet can make a maximum of 16 two-letter pairs • AB, AC, A-, BA, CA, -A, BC, B-, CB, -B, C-, -C, AA, BB, CC, -• So, for (20+1) amino acids, the index table size would be 21n where n is the word size (If n=5, total number of words would be ~ 4 million) • Program compares the type and number of identical peptides between the representative and the new sequence • Only those pairs that meet the minimum criterion will be further aligned to confirm the identity • Very fast algorithm for clustering larger databases like NR Phylogenetic Analysis Terminology • Homologous : Similar • Paralogous : Similar sequences in the same species, originated by gene duplication • Orthologous: Similar sequences in different species by divergent evolution • Xenologous: Genes acquired by horizontal gene transfer • Analogous: Similarity by convergent evolution Methods of building phylogenetic trees • Based on the data processing • Discrete methods • Maximum-parsimony method • Maximum-Likelihood method • Distance-based methods • Based on the tree-building algorithm • Clustering methods • UPGMA • Neighbor-joining • Optimality criterion Distance-based versus discrete methods • Distance methods first convert aligned sequences into a pair-wise distance matrix and then input the matrix into a tree building method • Discrete methods are based on characters i.e., consider each nucleotide or amino acid directly • In distance methods, once a distance matrix is built the biological information is lost while, in discrete methods additional information such as which site contributes to the length of each branch is preserved • Distance based methods are faster and easier to implement than discrete methods Clustering versus optimality criteria-based methods • Clustering methods follow a set of steps and arrive at a single tree while in the other case, a set of all possible trees are built and the best of them is evaluated based on the score • Clustering methods do not allow us to evaluate competing hypotheses • Clustering methods are faster, easy to implement and produce an unambiguous output while the other methods are computationally very expensive • Optimality methods often result in good quality trees since they could be interactively corrected Parsimony Methods :Background • Eck and Dayhoff method counts the number of all to all amino acid substitutions in a phylogeny, but in this method, both high and low probable substitutions (acc. to genetic code) are treated equally • Ex: AAA (K) CGC (R) vs AAC (N) AGC (S) • Fitch method counts the minimum number of nucleotide changes required to achieve the observed variation, but this method treats both synonymous and non-synonymous changes equally • Ex: UUU(F) CUU(L) CUA(L) CAA (Q) • In Maximum parsimony method a moderate approach between the above two methods is used. All amino acid changes be consistent with the genetic code and synonymous changes are counted less times than non-synonymous changes. • In the above example the number of changes from F Q is counted as two, not three Maximum Parsimony Method • Also called minimum evolution method • Predict tree(s) that minimizes the number of steps required to generate the observed variation in the sequences • For each aligned column in the multiple alignment, phylogenetic trees that require smallest number of evolutionary changes to produce the observed variation are identified • Finally, those trees that produce the smallest number of changes overall for all sequence positions are identified • Very time consuming, not good for large number of sequences or sequences with a large amount of variation • For DNA: DNAPARS • For proteins: PROTPARS Protpars Example Distance-based Method • Distance between pairs of sequences is calculated based on • Dayhoff’s PAM matrix values • Fraction of non-identical amino acids between the two sequences • Depending on whether the conversion of amino acids is within the group or to a different group • A distance matrix of (n x n) is calculated between all pair-wise combinations where each diagonal is identical to the other • Distance matrix is used as input in different algorithms to calculate an optimal evolutionary tree Distance Matrix generated by Protdist HUMAN MOUSE DROME SOLTU WHEAT ARATH NEUCR YEAST Distance method continued … • The key is how best the pair-wise distances are made additive on a predicted evolutionary tree • Using the distance matrix, several phylogenetic trees are built and evaluated based on the following criteria • Goodness of fit methods seek the metric tree that best accounts for the observed pair-wise distances • Minimum evolution method: Seeks the tree whose sum of branch lengths is the minimum (minimum evolution) • Methods used • FITCH: Based on Fitch-Margoliash method • NEIGHBOR: Based on neighbor-joining or UPGMA methods Feng-Doolittle Method ….. A A B C D B C Human Chimp Gorilla 0 88 103 0 106 0 Human Chimp Gorilla Orang Tree building using Fitch-Margoliash method (1967) Da = ( DAB + DAC - DBC ) / 2 Db = ( DAB + DBC - DAC ) / 2 Dc = ( DAC + DBC - DAB ) / 2 Dc Da Db C B A Join the first 3 sequences 9.0 Da = ( 88 + 103 - 106 ) / 2 = 42.5 51.5 42.5 45.5 Db = ( 88 + 106 - 103 ) / 2 = 45.5 Dc = ( 103 + 106 - 88 ) / 2 = 60.5 C B A D Orang 160 170 166 0 Feng-Doolittle Method ….. A A B C D Human Chimp Gorilla Orang B C Human Chimp Gorilla 0 88 103 0 106 0 D A B C Orang 160 170 166 0 Hum/Chimp Gorilla A Hum/Chimp 0 104.5 B Gorilla 0 C Orang Orang 165 166 0 30.75 Join the 4th sequence to current tree 82.5 Da = ( 104.5 + 165 - 166 ) / 2 = 51.75 9.25 52.75 Db = ( 104.5 + 166 - 165 ) / 2 = 52.75 Dc = ( 165 + 166 - 104.5 ) / 2 = 113.25 42.5 45.5 C B A’ A Maximum-Likelihood Methods • These methods are discrete methods similar to maximum parsimony (MP) methods, however probability calculations are used to find a tree that best accounts for the variation in a set of sequences • Analysis is performed on all columns in the multiple alignment and all possible trees are considered • Compared to MP methods, more divergent sequences can be analyzed • However, the main disadvantage is that these methods are computationally intensive Genome-scale Data Analysis Ensembl/translation Sequenced Genome Unknown function & structure No Known structure Yes Pdb search Complete Proteome No Interpro Pfam Yes Known function Finding right tools for right tasks • Finding paralogues by clustering (BLASTCLUST, CD-HIT) • Finding homologues and orthologues (BLAST) • Finding remote homologues (PSI-BLAST) • Finding functional annotation (PFAM, INTERPRO) • Finding structural annotation (Blast PDB) • Finding low complex regions (SEG, CAST) • Finding transmembrane regions (TMHMM) • Finding disordered regions (COILS, PONDR) • Finding secondary structure (JPRED, TOPpred) Accessing Tools and Data • Web-based tools vs. Standalone tools • Download • NCBI : ftp://ftp.ncbi.nih.gov • EBI: ftp://ftp.ebi.ac.uk • PDB: ftp://ftp.rcsb.org • PFAM: ftp://ftp.genetics.wustl.edu • Local installation and configuration Structure-based Algorithms Protein Data Bank (PDB) http://www.rcsb.org • About 26000 structures including X-Ray, NMR and models • Structures include 23597 proteins, 1108 protein/nucleic acid complexes, 1336 nucleic acids and 18 carbohydrates • Sequence numbering • PDB/Atomic numbering • PDB ID/chain ID Growth of PDB entries Growth of new folds in PDB NIGMS funded Structural Genomics Projects • Midwest Center for Structural Genomics • Northeast Structural Genomics Consortium • New York Structural Genomics Research Consortium • Southeast Collaboratory for Structural Genomics • Structural Genomics Center • Tuberculosis (TB) Structural Genomics Consortium • Joint Center for Structural Genomics • Center for Eukaryotic Structural Genomics • Structural Genomics of Pathogenic Protozoa Consortium Protein Structure Databases • SCOP : Structural Classification of Proteins • CATH : Class, Architecture, Topology & Homologous superfamily • FSSP/DALI : Fold classification based on Structure-Structure alignment of Proteins • HSSP: Homology-derived Secondary Structure of Proteins • HOMSTRAD : Homologous Structure Alignment Database • DSSP : Database of Secondary Structure Assignments • DMAPS : Database of Multiple Alignment for Protein Structures Structure Alignments • Protein structures are determined by X-ray crystallography or NMR methods • Structural alignment involves establishing equivalencies between residues in two or more proteins based on their 3D-coordinates • 3-D coordinates from C- atoms are most commonly used for calculation of distance in structural alignments Methods used for structure alignment • Dynamic programming (Taylor & Orengo, 1989) • Combinatorial Extension • Monte Carlo method (Shindyalov & Bourne, 1998) (Mirny & Shakhnovich, 1998, Guda et. al., 2001) • Environment profile method (Jung & Lee., 2000) • Genetic Algorithms (May & Johnson, 1995) Combinatorial Extension (CE) Method http://cl.sdsc.edu/ce.html • CE method is based on determining Aligned Fragment Pairs (AFPs) with local similarities and joining AFPs to form a continuous path • AFPs are based on the difference in the local geometry of structures being compared • For ex., inter-residue distances are calculated between 8 residues in all possible combinations, except between the neighboring residues ((n-1)(n-2)/2). This is done for all candidate AFPs in each structure • Difference(d) in the average distances is calculated and all candidate AFPs with d under some threshold are considered AFPs • Consecutive AFPs are selected based on calculation of inter-residue distances between two AFP members in the same chain in 64 (8x8) combinations and selecting the ones with minimum average difference (d) CE Method … Extending the optimal path • The alignment path is constructed from AFPs selected from any position in the similarity matrix and consecutive AFPs are added in either direction such that, • two consecutive AFPs are aligned without gaps OR • two consecutive AFPs are aligned with gaps inserted in either of the proteins, but not in both • The maximum allowable size of a gap is 30. This is required to limit the gap size, however, similarities requiring gap size > 30 are misrepresented by this algorithm • A few best alignments are superimposed and r.m.s.d. (Root mean square deviation) is iteratively optimized using dynamic programming by adjusting gaps • Finally, the pair with lowest RMSD value is selected FSSP/DALI http://www.ebi.ac.uk/dali/fssp/fssp.html • Fold Classification based on Structure-Structure alignment of Proteins • All structures in PDB are clustered into families based on 25% sequence identity and representatives for each family are selected • FSSP was built using completely automatic method (DALI), based on all-against-all comparison of representative set of structures • DALI (Distance matrix ALIgnment) is based on distance maps that contains all pair-wise distances between residue centers i. e., C-œ atoms • The distance matrices from each protein are decomposed into hexapeptide-hexapeptide submatrices. Similar contact patterns are paired and combined into larger sets of pairs • A Monte Carlo procedure is used to optimize similarity score • Multiple structure alignments were built based on pair-wise comparison of representative and member within the family and between representatives HOMSTRAD http://www-cryst.bioc.cam.ac.uk/homstrad/ • HOMologous STRucture Alignment Database • 1032 families with 3454 structures • Structures with only C-alpha values were excluded • Structurally similar proteins were clustered into homologous families and alignments were built based on 3-D coordinate data • Uses COMPARER and MNYFIT for building structure alignments • Multiple alignments were calculated only for representative members of each family Limitations of current methods Most of the multiple alignment methods are based on master-slave or progressive alignments. These are biased towards the master structure or the initial alignment Example: master Monte Carlo Optimization Method http://cemc.sdsc.edu http://dmaps.sdsc.edu Problem: Most of the multiple alignment methods are based on pair-wise alignment of structures to a Master structure. This leads to biased alignments towards the master, ignoring the similarities within the other structures Essential elements of the Method • The Target/Scoring function • The Search Algorithm • The Search Constraints • Algorithm General Monte Carlo Approach • Compute a distance-based score for the current alignment • Make a random trial change to the current alignment and compute the change in the score (S) • If S > 0, the move is always accepted • If S <= 0, the move may be accepted by adding an additional score of P C s P m where, -C is a constant -m is the trial move count • Once a move is accepted, the change in the alignment becomes permanent • This procedure is iterated until there is no further change in the score, i.e., the system is converged Monte Carlo Simulation ... Scoring function (Modified from Levitt & Gerstein, 1998) - S is the total score for the alignment - l is the total number of columns and i is the column position, in the alignment - M = 20 (Maximum score of a column, chosen arbitrarily) - di is the average C distance between residues in column i. dpq pq di N - p and q are residues in column i - N =(m x m-1)/2 (all-to-all combinations) - m is the residue count in column i - d0 is a constant (the distance increase that can be tolerated) 0, di d 0 A 10, di d 0 - G is Affine gap penalty term ( G = I + pE) where, I=15, E=7. I and E are gap initiation & extension penalties, respectively, and p is the number of gap extensions Monte Carlo Simulation ... Search Constraints • Minimum Block length: > 3 (3-6) • Residue Threshold: 50 % (33-66 %) Block Free pool Monte Carlo Simulation ... Random Trial Move Set 1. Shift Right 2. Shift Left 3. Expand Right 4. Expand Left 5. Shrink Right 6. Shrink Left 7. Split/Shrink Monte Carlo Simulation ... Shift Left Before Accepting Move: Score = 30796, Distance = 3.815 After Accepting Move: Score = 30846, Distance = 3.849 Monte Carlo Simulation ... Expand Right Before Accepting Move: Score = 30850, Distance = 3.852 Free pool of residues After Accepting Move: Score = 31048, Distance = 3.915 Expanded fragment Monte Carlo Simulation ... Expand Left Before Accepting Move: Score = 31093 Distance = 4.042 Free pool of residues After Accepting Move: Score = 31500, Distance = 4.207 Expanded fragment Monte Carlo Simulation ... Shrink Before shrinking After shrinking Monte Carlo Simulation ... Split and Shrink Before Split and Shrinking After Split and Shrinking Monte Carlo Simulation ... 320 3.2 310 3.1 3 300 2.9 290 2.8 280 2.7 270 2.6 260 2.5 0 2000 4000 6000 8000 10000 12000 Move count Number of alignment columns Average alignment dist ance Alignment distance Number of alignment columns Typical Monte Carlo behavior Monte Carlo Simulation ... Relation between alignment improvement and distance increase Change in the number of alignment columns 60 40 20 0 -20 -40 -60 -80 1.2 1 0.8 0.6 0.4 0.2 Change in the average alignment distance 0 -0.2 Monte Carlo Simulation ... Example 1 ID A (CE) B (CE+MC) C (HOM.) Monte Carlo Simulation ... Example 2 ID A (CE) B (CE+MC) C (HOMSTRAD) CE-MC Web Server • Accessible at http://cemc.sdsc.edu • A web-based facility to perform multiple structure alignments • User could upload local coordinate files and compare against the PDB files • Initial seed alignments are built based on CE algorithm and iteratively optimized using Monte Carlo Optimization • Results are emailed upon completion of job • Output is displayed in 4-different formats as follows • JOY/html • JOY/post-script • Text • FASTA DMAPS Web Server • Accesible at http://dmaps.sdsc.edu • Stores pre-calculated multiple structure alignments for all structural families in the PDB • All structure chains in the PDB were clustered into ~1700 familes and multiple structure alignments were performed using Monte Carlo algorithm • Multiple structure alignment for a structure family is accessible with the PDB chain ID of any member of that family • Results are retrieved and displayed in 4 different families, i.e., JOY/html, JOY/post-script, Text and FASTA Final Project Work