Download Bioinformatics in Brief This week: DB for structures Structure

Bioinformatics in Brief This week: Bioinformatics -what is it, what for DB for structures Structure Classification Structure-Function link 1 Proteins Class1&2 M. Linial ‘02-’03 Swiss-Prot • Established in 1986 and maintained collaboratively by SIB (Swiss Institute of Bioinformatics) and EBI/EMBL • Provides high-level annotations, including description of protein function, structure of protein domains, post-translational modifications, variants, etc • Aims to be minimally redundant • Linked to many other resources -Consider the best 2 Proteins Class1&2 M. Linial ‘02-’03 November-2002 SWISS-PROT 116776 entries TrEMBL 680075 entries Best annotated DB 3 Proteins Class1&2 M. Linial ‘02-’03 Still mistakes and algorithms may become a source for incorrect annotations In the protein world: 1. Wrong gene finding (exon- intron) 2. Premature cleavage -wrong tails (nt sequencing mistakes) 3. ESTs may be misleading 4. Automatic assignment of features 5. No replacement for manual curators 4 Proteins Class1&2 M. Linial ‘02-’03 Make sense of large DB - How?? uWhich database to search? There are many uHow good are my results? (relevance and reliability) uI didn’t get any results, does it means there aren’t any? 5 Proteins Class1&2 M. Linial ‘02-’03 Linking Functional Databases Essential addition Pathways Ligands Essential addition Putative TF -BS 6 Proteins Class1&2 M. Linial ‘02-’03 Grand Plan Find all the genes Translate genes to proteins “Compute” function “Compute” structure 7 Goal of structure prediction • Epstein & Anfinsen, 1961: sequence uniquely determines structure • INPUT: • OUTPUT: sequence 3D structure and function 8 Prediction of Function What is function? This is not a simple term Function may be: • a biological process (e.g. serine protease activity) • a molecular event (e.g. proteolysis of a specific substrate) • a cellular structure (e.g. membrane; chromatin, etc.) • relevance to a whole process (e.g. cell cycle) • relevance to the whole organism (e.g. ovulation) 9 Pattern Recognition • Looks for motifs that may have functional relevance (family signatures): * Membrane anchoring * Catalytic site * Nucleotide binding * Nuclear localization signal * Hormone response element * Calcium binding, etc. • Protein family resources (2nd week) 10 Homology • What is “homology”? Definition: Two proteins are homologous if they are related by divergence from a common ancestor. A Ancestor Divergent Evolution B C D Homologous 11 Analogy • What is “analogy”? Definition: Two proteins are “analogous” if they acquired common structural and functional features via convergent evolution from unrelated ancestors. A C Convergent Evolution B D Unrelated Analogous (similar structure and/or function) 12 Serine Proteases (Convergent Evolution) Trypsin-like Subtilisin-like Analogous proteins Many homologous members Many homologous members Trypsin and subtilisin share groups of catalytic residues with almost identical spatial geometries but they have no other sequence or structural similarities. 13 Aspartic acid - Histidine- Serine D H S 14 Human Kallikrein Gene Family (Divergent Evolution) 15 homologous genes on human chromosome 19q13.4 Divergence in tissue expression and substrate specificity (trypsin like of S1, substrate Met|Lys; Arg|Ser in small mol) activate Bradikynin 15 Orthologs Proteins that usually perform same function in different species (e.g. DNA polymerase; glucose 6-phosphate dehydrogenase; retinoblastoma gene; p53, etc.). Paralogs Proteins that perform different but related functions within one organism [usually formed by gene duplication and divergent evolution] (e.g. the 15 kallikrein genes). 16 Evolutionary time New term: Old Paralog Ortholog xenopus Ancestor Paralog-1 Ortholog fish human Paralog-2 17 Why structure? • Protein structure is more conserved than protein sequence, and more closely related to function. 18 Proteins Class1&2 M. Linial ‘02-’03 Structural information • Protein Data Bank: maintained by the Research Collaboratory of Structural Bioinformatics(RCSB) – > 16,500 structures of proteins – Also contains of structures of Protein/Nucleic Acid Complexes, Nucleic Acids, Carbohydrates – 19,200 together (Nov 2002) • • • Most structures are determined by X-ray crystallography. NMR (15%) and electron microscopy(few). Some structures are also theoretically predicted. 19 Proteins Class1&2 M. Linial ‘02-’03 Structural information • From solved structure to classification what for: • The structural space - rules, organization?? • Infer structure (modeling) • Infer function - not trivial • Protein engineering, drug design… Proteins Class1&2 M. Linial ‘02-’03 20 PDB is growing 21 Proteins Class1&2 M. Linial ‘02-’03 Structure Alignment • Why Structure Alignment? • Algorithms for Structure Alignment 22 Proteins Class1&2 M. Linial ‘02-’03 Why Structure Alignment? • For homologous proteins (similar ancestry), this provides the “gold standard” for sequence alignment — elucidates the common ancestry of the proteins. • For non-homologous proteins, allows us to identify common substructures of interest. • Allows us to classify proteins into clusters, based on structural similarity. 23 Proteins Class1&2 M. Linial ‘02-’03 How do we recognize structural similarities? • By eye (Alexei Murzin) SCOP- Gold standard for structure classification • Algorithmically Growth of PDB demands automated techniques for classification and fold detection. 24 Proteins Class1&2 M. Linial ‘02-’03 Algorithms for Structure Alignment • • • Distance based methods – DALI (Holm and Sander): Aligning scalar distance plots – STRUCTAL (Gerstein and Levitt): Dynamic programming using pairwise inter-molecular distances – SSAP (Orengo and Taylor): Dynamic programming using intra-molecular vector distance – Others (PRISM, CE…) Vector based methods – VAST (Bryant): Graph theory based secondary structure alignment – 3dSearch (Singh and Brutlag): Fast secondary structure index lookup Both vector and distance based – LOCK (Singh and Brutlag): Hierarchically uses both secondary structures vectors and atomic distances 25 Proteins Class1&2 M. Linial ‘02-’03 Databases of structural classification • Consider the GOLD standard - Expert view • SCOP – Murzin AG et al. 1995 – Structural classification of protein structures – Manual assembly by inspection – All nodes are annotated (eg. All-α, α/β) – Structural similarity search using 3dSearch (Singh and Brutlag) 26 Proteins Class1&2 M. Linial ‘02-’03 From classification to distances maps 27 Proteins Class1&2 M. Linial ‘02-’03 28 Proteins Class1&2 M. Linial ‘02-’03 29 Proteins Class1&2 M. Linial ‘02-’03 30 Proteins Class1&2 M. Linial ‘02-’03 31 Proteins Class1&2 M. Linial ‘02-’03 Dynamic view - too slow 3.5 Ratio SCOP 1.59 /1.37 3 2.5 2 1.5 1 0.5 0 Domains PDB Fam SF Fold 32 Proteins Class1&2 M. Linial ‘02-’03 Dynamic view - too slow Number of SCOP Entities 2000 1800 Famil y 1600 1400 1200 Superfamily 1000 800 600 Fold 400 200 0 0 10 20 30 40 Months from SCOP 1.37 Release 33 Proteins Class1&2 M. Linial ‘02-’03 •CATH –Orengo et al. 1997 – Class-Architecture-Topology-Homologous superfamily –Manual classification at Architecture level –Automated topology classification using the SSAP algorithms –No structural similarity search 34 Proteins Class1&2 M. Linial ‘02-’03 Protein Classifications CATH 35 Proteins Class1&2 M. Linial ‘02-’03 Christine Orengo (Structures, 1997, 5, 1093-1108) © Christine Orengo 36 Protein Classifications CATH Class is determined according to the secondary structure composition It can be assigned automatically for over 90% of the known structures For the remainder, manual inspection is used 37 Proteins Class1&2 M. Linial ‘02-’03 38 Proteins Class1&2 M. Linial ‘02-’03 Protein Classifications CATH Architecture, A Determined by the orientations of the secondary structures but ignores the connectivity between the secondary structures. It is currently assigned manually using a simple description of arrangements e.g. barrel or 3-layer sandwich. Procedures are being developed for automating this step. 39 Proteins Class1&2 M. Linial ‘02-’03 Protein Classifications CATH Topology(=fold) Structures are grouped into depending on both the overall shape and connectivity of the secondary structures (SSAP algorithm). Structures which have a SSAP score of 70 and where at least 60% of the larger protein matches the smaller protein are assigned to the same T level or fold family. Some fold families are very highly populated, they are currently subdivided using a higher cutoff on the SSAP score. 40 Proteins Class1&2 M. Linial ‘02-’03 Protein Classifications CATH ‘Score’ as a classification criteria ?? Legitimate: •Comparative study to SCOP etc. - good •Separating power of the score - tested •Using it in ‘real world’ competitions •Applying to structural Genomics 41 Proteins Class1&2 M. Linial ‘02-’03 Protein Classifications CATH Homologous Superfamily, HThis level groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous. Similarities are identified first by sequence comparisons and subsequently by structure comparison using SSAP. the criteria: Sequence identity >= 35%, 60% of larger structure equivalent to smaller SSAP score >= 80.0 and sequence identity >= 20% 60% of larger structure equivalent to smaller SSAP score >= 80.0, 60% of larger structure equivalent to smaller, and domains which have related functions Proteins Class1&2 M. Linial ‘02-’03 42 Protein Classifications CATH S level Sequence families, S Clustered on sequence identity. Domains clustered in the same sequence families have sequence identities >35% (with at least 60% of the larger domain equivalent to the smaller). 43 Proteins Class1&2 M. Linial ‘02-’03 Protein Classifications CATH 44 Proteins Class1&2 M. Linial ‘02-’03 45 Proteins Class1&2 M. Linial ‘02-’03 Protein Classifications 46 Proteins Class1&2 M. Linial ‘02-’03 Protein Classifications 47 Proteins Class1&2 M. Linial ‘02-’03 Protein Classifications 48 Proteins Class1&2 M. Linial ‘02-’03 Example 49 Proteins Class1&2 M. Linial ‘02-’03 Protein Classifications 50 Proteins Class1&2 M. Linial ‘02-’03 51 Proteins Class1&2 M. Linial ‘02-’03 52 Proteins Class1&2 M. Linial ‘02-’03 3 domains in Histone -Acetyltransferase 53 Proteins Class1&2 M. Linial ‘02-’03 SCOP vs CATH 54 Proteins Class1&2 M. Linial ‘02-’03 SCOP vs CATH 55 Proteins Class1&2 M. Linial ‘02-’03 Coping with the data CATH addition: Temporary assignment 56 Proteins Class1&2 M. Linial ‘02-’03 Moving from ‘classification’ to a distance map -why? Biological Examples Weak connections “hoping in the ‘map’ to find relatedness (intermediate) 57 Proteins Class1&2 M. Linial ‘02-’03 Databases of structural classification • FSSP – L.L. Holm and C. Sander – Fully automated using the DALI algorithms (Holm and Sander) – No internal node annotations – Structural similarity search using DALI (considered best..) • Pclass – A. Singh, X. Liu, J. Chang, D. Brutlag – Fully automated using the LOCK and 3dSearch algorithms – All internal nodes automatically annotated with common terms – Structural similarity serach using 3dSearch 58 Proteins Class1&2 M. Linial ‘02-’03 DALI • Based on aligning 2-D intra-molecular distance matrices • Computes the best subset of corresponding residues from the two proteins such that similarity between the 2-D distance matrices is maximized. • Searches through all possible alignments of residues (Monte-Carlo algorithms). 59 Proteins Class1&2 M. Linial ‘02-’03 DALI 60 Proteins Class1&2 M. Linial ‘02-’03 DALI • DALI has been used to do an ALL vs. ALL comparison of proteins in the PDB, and to create a hierarchical clustering of families. • FSSP=Fold classification based on StructureStructure alignment of Proteins 61 Proteins Class1&2 M. Linial ‘02-’03 From classification to distances maps 62 Proteins Class1&2 M. Linial ‘02-’03 Structural distances - FSSP The FSSP database includes all protein chains from the PDB which are longer than 30 residues. The chains are divided into a representative set (<25% identity). The representative set contains no pair of such sequence homologs. An all-against-all structure comparison is performed on the representative set. 63 Proteins Class1&2 M. Linial ‘02-’03 Structural distances - FSSP A hierarchical clustering method is used to construct a tree based on the structural similarities Family indices are constructed by cutting the tree at levels of 2, 4, 8, 16, 32 and 64 standard deviations above database average. 64 Proteins Class1&2 M. Linial ‘02-’03 DALI finds surprising homologues Many unexpected links: • Histon and heat-shock protein • Adenosine deaminase &phosphodiesterase (13% identity) • Chemotrypsin & emydermolytic toxin A (S.aureus) 65 Proteins Class1&2 M. Linial ‘02-’03 Structural distances - FSSP About 700 entries (as in the ENZYME DB) 66 Proteins Class1&2 M. Linial ‘02-’03 Structural distances - FSSP 67 Proteins Class1&2 M. Linial ‘02-’03 VAST-Vector Alignment Search Tool • Aligns only secondary structure elements (SSE) • Represents each SSE as a vector • Finds all possible pairs of vectors from the two structures that are similar • Uses a graph theory algorithms to find maximal subset of similar vectors. 68 Proteins Class1&2 M. Linial ‘02-’03 VAST • VAST has been used to do an ALL vs. ALL comparison of proteins in the MMDB (NCBI’s structure database), and to find structure neighbors for each structure. • MMDB provides service of searching structure neighbors using VAST. 69 Proteins Class1&2 M. Linial ‘02-’03 Proteins and major challenges Predicting protein function on a genomic scale 70 Proteins Class1&2 M. Linial ‘02-’03 Understand Proteins, through analyzing large amounts Structures Functions Evolution (motions, packing, folds) 71 Proteins Class1&2 M. Linial ‘02-’03 A new concept called PROTEOME (PROTEin complement to a genOME) Proteomics can be defined as the qualitative and quantitative comparison of proteomes under different conditions to further unravel biological processes. 72 Proteins Class1&2 M. Linial ‘02-’03 How to predict function for 1000s of proteins? .…… ~650 u250 of 650 known on chr. 22 [Dunham et al.] u>>30K+ genes in Entire Human Genome u (alt. splicing) 73 Proteins Class1&2 M. Linial ‘02-’03 How to predict functions for 1000s of proteins? 1) 2) 3) 4) "Traditional" sequence patterns Via fold similarity (structural genomics) Clustering a microarray experiment Data integration 5) Advanced classification methods using multilayers information 74 Proteins Class1&2 M. Linial ‘02-’03 Function is a fuzzy term.. Structural Cytoskeletal Transporters channels Signaling switch, adaptor Transcription NA binding Recognition receptors, immune +++++ ++ +++ ++ ++++ Enzymes ++ + +++++ + + ++++ …. 75 Proteins Class1&2 M. Linial ‘02-’03 How to predict functions for 1000s of proteins? 1) "Traditional" sequence patterns Compare uncharacterized genome sequences against known sequences in DBs, transferring function annotation for similar sequences Issue: Threshold is major parameter & limitation Also, look for motifs & sites [Sternberg, Thornton, Rose, Koonin] Wait for next class 76 Proteins Class1&2 M. Linial ‘02-’03 1000s of structurally based alignments of structurally and functionally characterized sequences Sequence Similarity scores Domains Function One feature ENZYMES Motifs Signatures Organized DB Families …. 77 Proteins Class1&2 M. Linial ‘02-’03 Relatively easy function ENZYMES 78 Proteins Class1&2 M. Linial ‘02-’03 Functionally characterized Enzymes By Cofactors 6-hydroxyDOPA Ammonia Ascorbate ATP Bicarbonate Bile salts Biotin Cadmium Calcium Cobalamin Cobalt Coenzyme F430 Coenzyme-A Copper Dipyrromethane Dithiothreitol Divalent cation F420 FAD Fe(II) Flavin Flavoprotein FMN Glutathione Heme Heme-thiolate Iron Iron(II) Iron-molybdenum Iron-sulfur Lipoyl group Magnesium Manganese Molybdenum Molybdopterin Monovalent cation NAD NAD(P)H Nickel Potassium PQQ Protoheme IX Pterin Pyridoxal phosphate Pyridoxal-phosphate Pyruvate Reduced flavin Selenium Siroheme Sodium Tetrahydropteridine Thiamine pyrophosphate Thiol-dependent Tryptophan………….. 79 Proteins Class1&2 M. Linial ‘02-’03 80 Proteins Class1&2 M. Linial ‘02-’03 Functionally characterized Enzymes Catalysis 1. -. -.- Oxidoreductases. 1. 1. -.1. 2. -.1. 3. -.1. 4. -.1. 5. -.1. 6. -.- Acting on the CH-OH group of donors. Acting on the aldehyde or oxo group of donors. Acting on the CH-CH group of donors. Acting on the CH-NH(2) group of donors. Acting on the CH-NH group of donors. Acting on NADH or NADPH. 5. -. -.- Isomerases. 5. 1. -.5. 2. -.5. 3. -.5. 4. -.5. 5. -.- Racemases and epimerases. Cis-trans-isomerases. Intramolecular oxidoreductases.. Intramolecular transferases (mutases). Intramolecular lyases. 81 Proteins Class1&2 M. Linial ‘02-’03 1000s of structurally based alignments of structurally and functionally characterized sequences Sequence Function 5.3.1.1 (TP Isomerase) 5.3.1.1 (TP Isomerase) 5.3.1.1 (TP Isomerase) (E coli) 5.3.1.24 (PRA Isomerase) (B ster.) 5.3.1.15 (Xylose Isom.) 4.1.3.3 (Aldolase) 4.2.1.11 (Enolase) (Human) Same Exact 90% (Chick) (E coli) 45% 20% Both Class 5 (isom.) Different Classes (E coli) (Yeast) Proteins Class1&2 M. Linial ‘02-’03 82 100 90 80 70 60 50 40 30 20 10 0 Percentage of pairs that have same precise function as defined by Enzyme & FlyBase functional classifications Sequence similarity of pairs of proteins %ID 70 Proteins Class1&2 60 M. Linial 50 ‘02-’03 40 30 20 10 0 83 % Same Function Relationship of Similarity in Sequence to that in Function %ID 70 Proteins Class1&2 100 90 80 70 60 50 40 30 20 10 0 60 M. Linial 50 ‘02-’03 40 30 20 10 0 84 % Same Function Relationship of Similarity in Sequence to that in Function Can transfer both Fold & Functional Annotation %ID 70 Proteins Class1&2 60 M. Linial 50 ‘02-’03 100 90 80 70 60 50 40 30 20 10 0 40 30 20 10 0 85 % Same Function Relationship of Similarity in Sequence to that in Function Can transfer both Fold & Functional Annotation %ID 70 Proteins Class1&2 60 M. Linial 50 ‘02-’03 Can transfer Can not transfer Annotation related Fold or Functional Annotation Fold but not ("Twilight Zone") Function 40 30 20 10 100 90 80 70 60 50 40 30 20 10 0 0 86 % Same Function Relationship of Similarity in Sequence to that in Function Can transfer both Fold & Functional Annotation Can transfer Can not transfer Annotation related Fold or Functional Annotation Fold but not ("Twilight Zone") Function 100 90 80 70 60 50 40 30 20 10 0 Broad vs Narrow Similarity %ID 70 Proteins Class1&2 60 M. Linial 50 ‘02-’03 40 30 20 10 0 87 % Same Function Relationship of Similarity in Sequence to that in Function Caveats: Sequence Divergence of Multidomain Proteins , Implies a high threshold >40-50% Single Domain Sequences Multidomain Sequences (Human) (Chick) (E coli) (E coli) (B ster.) (E coli) (Yeast) 88 Proteins Class1&2 M. Linial (Rat) ‘02-’03 How to predict functions for 1000s of proteins? 1) "Traditional" sequence patterns 2) Via fold similarity (structural perspective) Structures of ORFs with unknown function, Use Fold & Site Similarity to Determine Function Rationale for Structure Prediction Issue: To what degree does fold determine function? [Kim, Edwards & Arrowsmith, Montelione, Burley, Eisenberg] 89 Proteins Class1&2 M. Linial ‘02-’03 Fold Function Combinations Many Functions on Same Fold (TIM-barrel) Different Folds with Same Function (Carbonic Anhydrases, 4.2.1.1) 90 Proteins Class1&2 M. Linial ‘02-’03 Fold Function Combinations Same function -different fold Same class 91 Proteins Class1&2 M. Linial ‘02-’03 Carbonic Anhydrase 92 Proteins Class1&2 M. Linial ‘02-’03 Global View of FoldFunction Combinations 229 Folds 91 Enzymatic Functions Non-Enz 93 Proteins Class1&2 M. Linial ‘02-’03 Correlation with Structural Features all- all- 229 Folds Architectural Class small 91 Enzymatic Functions Non-Enz 94 Proteins Class1&2 M. Linial ‘02-’03 Correlation with Structural Features Enzyme Class all- all- Slight Overpopulation 229 Folds Architectural Class small 91 Enzymatic Functions Non-Enz 95 Proteins Class1&2 M. Linial ‘02-’03 Global View of FoldFunction Combinations 229 Folds Sort 91 Enzymatic Functions Non-Enz 96 Proteins Class1&2 M. Linial ‘02-’03 Frequency in database of 229 folds To what degree is fold associated with function? Folds with multiple functions Number of functions associated with a fold Proteins Class1&2 M. Linial ‘02-’03 97 [Similar results by Thornton] Most Versatile Folds – Relation to Interactions The number of interactions for each fold = the number of other folds it is found to contact in the PDB 98 Some common folds in phylogenetic groups 99 Proteins Class1&2 M. Linial ‘02-’03 Not all folds shared between phylogenetic groups Evolution of new folds Plants 46 156 Eubacteria Proteins Class1&2 M. Linial 73 Eukaryotes ‘02-’03 20 104 Animals 90 100 Summary Structural classification Fold-function, not a simple relation Structure is very informative 101 Proteins Class1&2 M. Linial ‘02-’03

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Bioinformatics in Brief This week: DB for structures Structure