* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download handout 1
Rosetta@home wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Protein design wikipedia , lookup
List of types of proteins wikipedia , lookup
Protein folding wikipedia , lookup
Western blot wikipedia , lookup
Protein purification wikipedia , lookup
Alpha helix wikipedia , lookup
Protein moonlighting wikipedia , lookup
Circular dichroism wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
Trimeric autotransporter adhesin wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Structural alignment wikipedia , lookup
Protein structure prediction wikipedia , lookup
Classification: understanding the diversity and principles of protein structure and function MCSG 2001 structures Protein structure classification Main reference: Robert B. Russell (2002) Classification of Protein Folds. Molecular Biotechnology 20:17-28. Importance: central to studies of protein structure, function, and evolution Philosophy: phyletic vs. phenetic Method: structure comparison + human knowledge Philosophy of classification Phyletic: based on phylogenetic relationship Phenetic: based on study of phenomena (phenomelogical) Classification Unit: Domain, a LEGO piece Ranganathan From domain to assembly Domains are shuffled, duplicated and fused to make proteins On average, a domain is of 173 a.a. in size, compared to 466 a.a. for a yeast protein Most of the natural domain sequences assume one of a few thousand folds, of which ~1000 are already known no satisfactory estimate yet for the number of macromolecular complexes On average, a yeast complex may consist of 7.5 proteins Sali et al. 2003 Distribution of Protein size Swiss-prot Structural vs. functional domain Russian doll: a conceptual problem Singh Approaches Hierarchical Based on the types and arrangements of secondary structures Unit (level): domain Domain assignment - structural vs. functional (fold or function in isolation) - automated assignment methods (structure vs. sequence) A. P. Singh Assignment of Class All a or All b (could be subjective) a / b (bab unit) or a + b Other classes Class assignment could be subjective All-alpha structures All-beta structures Superoxide dimutase Alpha/beta structures Closed barrel Open twisted sheet B-a-b motif (barrel) (sheet) a/b vs. a+b Assignment of Fold Defined by the number, type, and arrangement of SSEs Connectivity (e.g. circular permutation, scrambled proteins) Assignment of Superfamily Homologous even in the absence of significant sequence similarity - certain level of structural similarity - unusual structural features - low but significant sequence similarity from structural alignment - key active site residues - sequence similarity bridges Divergence vs. convergence Divergent vs. convergent evolution Divergent evolution: decent from a common ancestor; become variant due to mutation Convergent evolution: no common ancestor; become similar due to functional or physical constraint Anti-freeze protein: convergent evolution crystal.biochem.queensu.ca Homologous fold Ranganathan Analogous fold Ranganathan Analogous or homologous? C’ N’ N C Scallop Myosin Regulatory Domain C chain N C N’ C’ Aldehyde Oxidoreductase A chain Assignment of Family significant sequence similarity Classification databases SCOP - careful assignment of evolutionary relationships; homologous vs. analogous CATH - A:architecture FSSP - a list of structural neighbors CATH Class: SSE composition & packing Architecture: overall shape of domain, ignore SSE connectivity Topology (Fold): consider connectivity Homologous superfamily: a common ancestor Singh Classification databases CATH SCOP FSSP Class, Architecture, Topolgy, and Homologous superfamily, a hierarchical classification of protein domain structures http://www.biochem.ucl.ac.uk/bsm/cath _new/ Structural Classification Of Proteins: augmented manual classification http://scop.mrc-lmb.cam.ac.uk/scop/ Fold classification based on StructureStructure alignment of Proteins http://www2.ebi.ac.uk/dali/fssp/ Genome-scale structure analysis Curr. Opin. Str. Biol., 2003 genome-scale structure annotation Some statistics 80% of sequence families belong to 400 folds (top 10 folds account for 40% of sequence families) >60% of genes encode multi-domain proteins (80% for eukaryotes) ~50,000 protein families and ~150,000 singletons structural superfamilies ~1800 (+/-50) and ~10,000 unifolds 50-60% of distant homologs (<25% seq. id.) can be recognized by profile-based sequence comparison methods (e.g. psi-blast, HMM, etc) 50-60% of the enzymes in yeast and E coli are common, and >80% of pathways are shared superfolds, superfamilies, supersites TIM barrel, Rossmann-like, ferredoxin-like, b-propellers, 4-helix bundle, Ig-like, b-jelly rolls, Oligonucleotide/oligosaccharride binding (OB) fold, SH3-like. Structure -> function (only 50% correct) Structure implicates function? Assessing the Progress of Structural Genomics Projects 1 Nov. 2002, Science Target Tracking by PDB (Sep 2002) PDB content growth (May 2005) Some statistics Contributed 316 non-redundant PDB entries comprising 459 CATH and 393 SCOP domains by 11 SG consortia. 14% of the targets have a homolog (>30% sequence identity) solved by another consortium 67% of SG domains in CATH are unique vs. 21% of non-SG domains. 19% and 11% contributed new superfamilies and new folds, respectively. Allow new and reliable homology models for 9287 non-redundant gene sequences in 208 completely sequenced genomes. PSI Structure Statistics 2002-2003 Unique structures (30% seq. ID) PSI 70% PDB 10% New folds PSI 12% PDB 3% NIGMS Protein Structure Initiative Average total cost per structure PSI Pilot phase 01 02 03 04 05 $650 $400 $240 ? $100 K K K (7 centers) (9 centers) K (goal) PSI-2 Production phase 06-10 Comparison $50 K (goal) ~$250-300 K NIGMS Protein Structure Initiative PSI Pilot Phase -- Lessons Learned 1. 2. 3. 4. 5. 6. Structural genomics pipelines can be constructed and scaled-up High throughput operation works for many proteins Genomic approach works for structures Bottlenecks remain for some proteins A coordinated, 5-year target selection policy must be developed Homology modeling methods need improvement NIGMS Protein Structure Initiative PSI-2 Production Phase (2005) Interacting network for high throughput protein structure determination with three components Large-scale centers for protein structure production of selected targets Specialized centers for technology development leading to high throughput structure determination of difficult proteins Specialized centers for protein structures relevant to disease (other NIH Institutes and Centers) Included in NIH Structural Biology Roadmap plans NIGMS Protein Structure Initiative Computational structural genomics Summary table Fold occurrence matrix Common Folds Unique Folds Main findings Folds can be assigned to ~25% ORF and ~20% amino acids for the 20 genomes >80% scop folds identified in one of the 20 organisms Worm and E. coli have most distinct folds Level of gene duplication (2.4 folds in MG, 32 in worm) higher than observed based on sequence only Top three most common folds: P-loop NTP hydrolase, the ferrodoxin fold, TIM-barrel Unique folds tend to be those involved in cell defense (e.g. toxins) Common folds tend to be more “symmetrical” Fold evolution Insertion, deletion, substitution a-helix & b-sheet substitution in Rossmann-fold like proteins A path from all-b to all-a proteins Circular Permutation (CP) N B A A C C B C N D C D ..A..B..C..D.. ..C..D..A..B.. Circular permutation example 1nls (Concanavalin) 1led (Lectin) C N N C Strand invasion/withdraw Strand invasion/withdraw Strand invasion/withdraw Hairpin flips/swaps Hairpin flips/swaps Sickel-cell hemoglobin confers resistance to malaria Hemoglobin & sickle cell anemia Lethal legos as killer clumps The inherited form of Lou Gehrig's disease--familial amyotrophic lateral sclerosis (FALS)--causes a decay of the motor neurons in the spinal cord and brain, a devastating loss of bodily control, and death within 2 to 5 years. Elam et al. Nat. Str. Biol., 2003