Download Statistical analysis of DNA microarray data

Protein Structure, Classification and Prediction BMI 730 Victor Jin Department of Biomedical Informatics Ohio State University Protein Structure Protein Structure Determination Protein Structure Classification - SCOP - CATH Secondary Structure Predication Tertiary Prediction Structure Prediction Evaluation - CASP Protein Structure Protein Structure Determination Protein Structure Classification - SCOP - CATH Secondary Structure Predication Tertiary Prediction Structure Prediction Evaluation - CASP Chemistry Proteins are linear hetero-polymers of amino acids twenty different amino acids (building blocks) 3-letter code VAL ARG LYS ILE GLU PRO ARG GLU 1-letter code V R K I E P R E Peptide bond Double bond character of the peptide bond Peptide Polypeptide Protein The peptide bond is planar ~ 2-10 amino acids ~ 10-50 amino acids ~ 50- amino acids 2 angles freely rotatable 1 is fixed http://www.imb-jena.de/~rake/Bioinformatics_WEB/basics_peptide_bond.html Amino acids Side chain properties  Size  Charge  Polarity http://www.ch.cam.ac.uk/SGTL/Structures/amino/ Hierarchical nature of protein structure Primary structure (Amino acid sequence) ↓ Secondary structure (local conformations: α-helix, β-sheet, and reverse turn and loop） ↓ Tertiary structure （Global conformations: a three-dimensional structure resulted from folding together secondary structures） ↓ Quaternary structure （Structure formed by more than one polypeptide chains） Basic structural units of proteins: Secondary structure α-helix β-sheet Secondary structures, α-helix and β-sheet, have regular hydrogen-bonding patterns. Tertiary structure  In globular proteins such as enzymes, the long chain of amino acids becomes folded into a three-dimensional functional shape or tertiary structure. This is because certain amino acids with sulfhydryl or SH groups form disulfide (S-S) bonds with other amino acids in the same chain. Other interactions between R groups of amino acids such as hydrogen bonds, ionic bonds, covalent bonds, and hydrophobic interactions also contribute to the tertiary structure A few examples of tertiary structure Dihydrofolate reductase Myoglobin Quaternary structure  non-covalent interactions that bind multiple polypeptides into a single, larger protein. Hemoglobin has quaternary structure due to association of two alpha globin and two beta globin polyproteins. Structure Stabilizing Interactions Non-covalent  Van der Waals forces (transient, weak electrical attraction of one atom for another)  Hydrophobic (clustering of nonpolar groups)  Hydrogen bonding Covalent  Disulfide bonds Protein Structure Protein Structure Determination Protein Structure Classification - SCOP - CATH Secondary Structure Predication Tertiary Prediction Structure Prediction Evaluation - CASP Protein structure determination  Protein expression  membrane proteins  aggregation  X-Ray crystallography  NMR (nuclear magnetic resonance)  Cryo-EM (electron microscopy) Protein Structure Protein Structure Determination Protein Structure Classification - SCOP - CATH Secondary Structure Predication Tertiary Prediction Structure Prediction Evaluation - CASP Protein Structure Classification - SCOP • Structure Classification Of Proteins database • http://scop.mrc-lmb.cam.ac.uk/scop/ • Hierarchical Clustering • Family – clear evolutionarily relationship • Superfamily – probable common evolutionary origin • Fold – major structural similarity • Boundaries between levels are more or less subjective • Conservative evolutionary classification leads to many new divisions at the family and superfamily levels, therefore it is recommended to first focus on higher levels in the classification tree. Protein Structure Classification - SCOP • a/a Protein Structure Classification - SCOP • b/b Protein Structure Classification - SCOP • a/b Protein Structure Classification - SCOP • a+b Protein Structure Classification - SCOP • Misc HIV Protease complexed with pepstatin T-Cell-receptor/MHC/CD8 complex Protein Structure Classification - SCOP Scop Classification Statistics SCOP: Structural Classification of Proteins. 1.69 release 25973 PDB Entries (1 Oct 2004). 70859 Domains. 1 Literature Reference (excluding nucleic acids and theoretical models) Number of folds Number of superfamilies Number of families All alpha proteins 218 376 608 All beta proteins 144 290 560 Alpha and beta proteins (a/b) 136 222 629 Alpha and beta proteins (a+b) 279 409 717 Multi-domain proteins 46 46 61 Membrane and cell surface proteins 47 88 99 Small proteins 75 108 171 945 1539 2845 Class Total Protein Structure Classification - SCOP Protein Structure Classification - SCOP Protein Structure Classification - SCOP Protein Structure Classification - SCOP Protein Structure Classification - SCOP Protein Structure Classification - CATH • CATH Protein Structure Classification • http://www.cathdb.info/latest/index.html • CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(C), Architecture(A), Topology(T) and Homologous superfamily (H). • Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. • Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. • The topology level clusters structures into fold groups according to their topological connections and numbers of secondary structures. • The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to fold groups and homologous superfamilies are made by sequence and structure comparisons. Protein Structure Classification - CATH http://www.cathdb.info/cgi-bin/cath/GotoCath.pl?link=cath_info.html Only crystal structures solved to resolution better than 4.0 angstroms are considered, together with NMR structures. All non-proteins, models, and structures with greater than 30% "C-alpha only" are excluded from CATH The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures. These include computational techniques, empirical and statistical evidence, literature review and expert analysis. Domains within each H-level are subclustered into sequence families using multi-linkage clustering at the following levels: Name Sequence Identity Overlap S 35% 80% O 60% 80% L 95% 80% I 100% 80% Level Protein Structure Classification - CATH Protein Structure Classification - CATH Protein Structure Classification - CATH CATH vs. SCOP Protein Structure Protein Structure Determination Protein Structure Classification - SCOP - CATH Secondary Structure Predication Tertiary Prediction Structure Prediction Evaluation - CASP Secondary Structure Prediction AGADIR - An algorithm to predict the helical content of peptides APSSP - Advanced Protein Secondary Structure Prediction Server GOR - Garnier et al, 1996 HNN - Hierarchical Neural Network method (Guermeur, 1997) Jpred - A consensus method for protein secondary structure prediction at University of Dundee JUFO - Protein secondary structure prediction from sequence (neural network) nnPredict - University of California at San Francisco (UCSF) Porter - University College Dublin PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader, MaxHom, EvalSec from Columbia University Prof - Cascaded Multiple Classifiers for Secondary Structure Prediction PSA - BioMolecular Engineering Research Center (BMERC) / Boston PSIpred - Various protein structure prediction methods at Brunel University SOPMA - Geourjon and Deléage, 1995 SSpro - Secondary structure prediction using bidirectional recurrent neural networks at University of California DLP - Domain linker prediction at RIKEN http://us.expasy.org/tools/#secondary Determining the Residue Environment  Six basic environment classes (E, P1, P2, B1, B2 and B3) The environment of each residue in the three-dimensional structure is first classified according to the area of the side chain that is buried in the protein. ---- A residue is considered exposed to solvent (environment class E) if the area buried is less than 40 Å2. ---- It is considered partially buried (class P) if the area buried is between 40 and 114 Å2. ---- It is considered buried (class B) if the area buried is greater than 114 Å2.  The buried and partially buried classes are further subdivided according to the fraction of the side chain area that is exposed to polar atoms ("fraction polar", denoted f). ---- For this purpose polar atoms are defined as those of the solvent and the oxygen and nitrogen atoms of the protein. ---- The buried class is subdivided into classes B1 (f < 0.45), B2 (0.45 <= f < 0.58) and B3 (f >= 0.58). ---- The partially buried class is subdivided into classes P1 (f < 0.67) and P2 (f >= 0.67). Structural environments Sequence residue and predicted secondary structure classes rcC rc H rc S rw C rw H rw S rb C rb H rb S ra C ra H raS rh C rh H rh S rs C rs H rs S rp C rp H rp S rcC_E 3.3 2. 4 0.8 0.5 −9. 0 −9. 0 −0. 6 −9. 0 −1. 2 −0. 1 −1. 5 −0. 8 −0. 1 −2. 1 −1. 0 0.1 −2. 3 −0. 5 0.6 −1. 9 −0. 9 rcC_B 3.7 −9 .0 −9. 0 −9. 0 −9. 0 −9. 0 −0. 7 −9. 0 0.1 0.2 −9. 0 −9. 0 0.7 −0. 9 0.0 0.1 −9. 0 −1. 2 0.1 −9. 0 −9. 0 rcH_E 1.7 3. 1 −9. 0 1.2 1.3 −9. 0 −9. 0 1.4 −9. 0 −0. 3 1.0 −9. 0 −1. 1 1.0 −9. 0 −1. 5 0.7 −9. 0 −9. 0 0.8 −9. 0 rcH_B 2.5 3. 7 −9. 0 −9. 0 −9. 0 −9. 0 −9. 0 −0. 5 −9. 0 −9. 0 0.0 −9. 0 −1. 1 1.3 −9. 0 −2. 1 0.9 −9. 0 −9. 0 0.0 −9. 0 rcS_E 0.4 −9 .0 3.9 −9. 0 −9. 0 1.5 −1. 2 −9. 0 1.5 −0. 2 −0. 7 1.6 −1. 1 −2. 0 0.6 −0. 5 −9. 0 0.8 −0. 8 −9. 0 1.5 rcS_B 0.7 −9 .0 4.0 −9. 0 −9. 0 −9. 0 −0. 2 −9. 0 0.9 −0. 7 −9. 0 −0. 5 −9. 0 −1. 8 1.0 −0. 9 −9. 0 1.0 0.0 −9. 0 1.3 Secondary Structure Prediction - HNN • http://npsa-pbil.ibcp.fr/cgi-bin/secpred_hnn.pl • >gi|78099986|sp|P0ABK2|CYDB_ECOLI Cytochrome d ubiquinol oxidase subunit 2 (Cytochrome d ubiquinol oxidase subunit II) (Cytochrome bd-I oxidase subunit II) MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPIILLY TAWCYWKMFGRITKEDIERNTHSLY Secondary Structure Prediction - HNN Sequence length : 379 HNN : Alpha helix (Hh) : 209 is 55.15% 310 helix (Gg) : 0 is 0.00% Pi helix (Ii) : 0 is 0.00% Beta bridge (Bb) : 0 is 0.00% Extended strand (Ee) : 55 is 14.51% Beta turn (Tt) : 0 is 0.00% Bend region (Ss) : 0 is 0.00% Random coil (Cc) : 115 is 30.34% Ambiguous states (?) : 0 is 0.00% Other states : 0 is 0.00% 10 20 30 40 50 60 70 | | | | | | | MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA ccchhhhhhhhhhhhhhheeeeehccchhcchhhhhheecccccceeeeeeccccccccceeeeeeccch LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN hhhhhhhhhhhhhhhhhhhhhhhhhhhhhcccccccccchhhhhhhhhhcceeehccchccheehhhhhc LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV hhcccccchhhhheeeeccchhhhhcchceccceeeeeeeeeccchhhhhhhchhhhhhchhhhhhhhhh TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI hhhhhhccceeeeeeccceeeeeccccccccccchhhhhhhhhhhheeccccceeeeccchhhhhhhhhh LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPI hhhhhhhhhhhhhhhhhhhhhhhhhcchhhcccccccchhhccccchhcccchhhhhhhhhhhhhhhhhh ILLYTAWCYWKMFGRITKEDIERNTHSLY hhhhhhhhhhhhhhhcchhhhhhhccccc Secondary Structure Prediction - HNN Secondary Structure Prediction - PHD •PHDsec predicts secondary structure from multiple sequence alignments. Secondary structure is predicted by a system of neural networks rating at an expected average accuracy > 72% for the three states helix, strand and loop (Rost & Sander, PNAS, 1993 , 90, 7558-7562; Rost & Sander, JMB, 1993 , 232, 584-599; and Rost & Sander, Proteins, 1994 , 19, 55-72). •Evaluated on the same data set, PHDsec is rated at ten percentage points higher three-state accuracy than methods using only single sequence information, and at more than six percentage points higher than, e.g., a method using alignment information based on statistics (Levin, Pascarella, Argos & Garnier, Prot. Engng., 6, 849-54, 1993). •PHDsec predictions have three main features: • improved accuracy through evolutionary information from multiple sequence alignments • improved beta-strand prediction through a balanced training procedure • more accurate prediction of secondary structure segments by using a multi-level system Secondary Structure Prediction - PHD Rost • B, Sander C. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Bio. 1993 Motifs Readily Identified from Sequence • Zinc Finger - order and spacing of a pattern for cysteine and histidine. • Leucine zippers – two antiparallel alpha helices held together by interactions between hybrophobic leucine residues at every seventh position in each helix. • Coiled coils – 2-3 helices coiled around each other in a lefthanded supercoil (3.5 residue/turn instead of 3.6 – 7/two turns); first and fourth are always hydrophobic, others hydrophilic; 5-10 heptads. • Transmembrane-spanning proteins – alpha helices comprising amino acids with hydrophobic side chains, typically 20-30 residues. Protein Structure Protein Structure Determination Protein Structure Classification - SCOP - CATH Secondary Structure Predication Tertiary Prediction Structure Prediction Evaluation - CASP Tertiary Structure Prediction Comparative modeling SWISS-MODEL - An automated knowledge-based protein modelling server 3Djigsaw - Three-dimensional models for proteins based on homologues of known structure CPHmodels - Automated neural-network based protein modelling server ESyPred3D - Automated homology modeling program using neural networks Geno3d - Automatic modeling of protein three-dimensional structure SDSC1 - Protein Structure Homology Modeling Server Threading 3D-PSSM - Protein fold recognition using 1D and 3D sequence profiles coupled with secondary structure information (Foldfit) Fugue - Sequence-structure homology recognition HHpred - Protein homology detection and structure prediction by HMM-HMM comparison Libellula - Neural network approach to evaluate fold recognition results LOOPP - Sequence to sequence, sequence to structure, and structure to structure alignment SAM-T02 - HMM-based Protein Structure Prediction Threader - Protein fold recognition ProSup - Protein structure superimposition SWEET - Constructing 3D models of saccharides from their sequences Ab initio HMMSTR/Rosetta - Prediction of protein structure from sequence http://us.expasy.org/tools Tertiary Structure Prediction – Comparative Modeling Example: 3Djigsaw Three-dimensional models for proteins based on homologues of known structure Contreras-Moreira,B., Bates,P.A. (2002) Domain Fishing: a first step in protein comparative modelling. Bioinformatics 18: 1141-1142. 3D Protein Sequence Profiles  A 3D profile is based on a 3D structure-specific scoring matrix  A 3D scoring matrix is similar to the 1D scoring matrices we discussed in the multiple sequence alignment lectures, with the additional attribute of the structural environment of the amino acid side chain  There are 6 basic environment classes (E, P1, P2, B1, B2 and B3), differing in the area of the side chain that is buried, and by the fraction of the side chain that is exposed to polar atoms  Since amino acids can assume 3 different secondary structures, there are 3 x 6 = 18 different environmental classes  The log odds of each amino acid in each environment type gives the values for the 3D-1D scoring matrix -- calculated from database of protein structures Using 3D Profiles in Structure Prediction  The alignment of an amino acid sequence with a 3D profile yields an overall 3D-1D score. The 3D-1D score is a measure of the compatibility of the sequence with the structure described by the profile  Given a amino acid sequence, find compatible structures ---- Useful for finding homologous structures when doing homology modeling  Given a preliminary or model structure, test its validity --- Useful for the final phase of homology modeling  Given a structure, find compatible sequences ---- Useful for analyzing evolutionary relationships among proteins Homology Modeling  Definition: Predicting the tertiary structure of an unknown protein using a known 3D structure of protein(s) with homologous sequence  Based on assumption that structure is more conserved than sequence  Important to use homologous proteins whose structures were determined by X-ray crystallography or NMR  Homology modeling is an important method since the number of different protein folds (unique structures) is much smaller than the number of different proteins  Likely that homologous protein sequences will share a common protein fold Some of the material from this section is from: http://www.cs.wright.edu/~mraymer/cs790/Homology_Modeling.ppt Homology Modeling Procedure  Search databases for homologous protein sequences The Protein Data Bank (PDB) is a good choice, since all of the sequences contained in PDB have solved 3D structures  Align homologous protein sequence with the sequence of interest ---- Pair-wise or Multiple Sequence Alignment can be used  Build a model of the structure of the protein of interest using the known structures of homologous proteins. Possible methods include: 1. Modeling by rigid body assembly 2. Modeling by segment matching or coordinate reconstruction 3. Modeling by satisfaction of spatial constraints Evaluate and refine model structure Tertiary Structure Prediction Threading 3D-PSSM - Protein fold recognition using 1D and 3D sequence profiles coupled with secondary structure information (Foldfit) Fugue - Sequence-structure homology recognition HHpred - Protein homology detection and structure prediction by HMM-HMM comparison Libellula - Neural network approach to evaluate fold recognition results LOOPP - Sequence to sequence, sequence to structure, and structure to structure alignment SAM-T02 - HMM-based Protein Structure Prediction Threader - Protein fold recognition ProSup - Protein structure superimposition SWEET - Constructing 3D models of saccharides from their sequences Tertiary Structure Prediction - Threading • First coined by Jones, Taylor and Thornton in 1992. Originally for fold recognition. • Today, the terms threading and fold recognition are frequently (though somewhat incorrectly) used interchangeably. • The basic idea is that the target sequence (structure to be predicted) is threaded through the backbone structures of template proteins (known as the fold library) and a “goodness of fit” scores are calculated (usually derived in terms of an empirical energy function). • Threading methods share some of the characteristics of both comparative modelling methods (the sequence alignment aspect) and ab initio prediction methods (predicting structure based on identifying low-energy conformations of the target protein). http://en.wikipedia.org/wiki/Threading_%28protein_sequence%29 Protein Threading  Generalization of homology modeling method ---- Homology Modeling: Align sequence to sequence ---- Threading: Align sequence to structure (templates)  Rationale: ---- Limited number of basic folds found in nature ---- Amino acid preferences for different structural environments provides sufficient information to choose the best-fitting protein fold (structure) Tertiary Structure Prediction Ab initio (de novo) • From scratch – using physical property instead of known structures • Mimic folding process – minimize certain energy function, stochastic modeling (e.g., simulated annealing) • Computationally expensive – requires large clusters, large machines (e.g., IBM BlueGene) or distributed computing, currently only work for small peptides • Big potential in the future – understand the dynamics, accuracy, and applications in drug development Tertiary Structure Prediction Ab initio (de novo) Prediction Scoring with Rosetta Rosetta uses a scoring function to judge different conformations. The process consists of making 'moves' (changing the bond angles of a particular group of amino acids) and then scoring the new conformation. The Rosetta score is a weighted sum of component scores, where each component score is judging a different aspect of protein structure. Environment score: Here, hydrophobic residues as represented as orange stars, so the left conformation is good (all the hydrophobics together) while the rightmost conformation is bad (with the hydrophobic amino acids not touching). Pair-score: Two conformations of a polypeptide are shown, one (top) where the chain is folded back on itself bringing two cysteins together (yellow+yellow = possible disulphide bond) and forming a saltbridge (blue+red = opposites attract). The conformation at bottom does not make these pairings and the pair-score would, thus, favor the top conformation. http://www.grid.org/projects/hpf/howitworks_scoring.htm Protein Structure Protein Structure Determination Protein Structure Classification - SCOP - CATH Secondary Structure Predication Tertiary Prediction Structure Prediction Evaluation - CASP Evaluation - CASP CASP - Critical Assessment of Techniques for Protein Structure Prediction, is a community-wide experiment (though it is commonly referred to as a competition) for protein structure prediction taking place every two years since 1994. (http://predictioncenter.org/) The main goal of CASP is to obtain an in-depth and objective assessment of our current abilities and inabilities in the area of protein structure prediction. To this end, participants will predict as much as possible about a set of soon to be known structures. These will be true predictions, not ‘post-dictions’ made on already known structures. CASP7 will particularly address the following questions: 1. Are the models produced similar to the corresponding experimental structure? 2. Is the mapping of the target sequence onto the proposed structure (i.e. the alignment) correct? 3. Have similar structures that a model can be based on been identified? 4. Are comparative models more accurate than can be obtained by simply copying the best template? 5. Has there been progress from the earlier CASPs? 6. What methods are most effective? 7. Where can future effort be most productively focused? Evaluation - CASP Evaluation of the results is carried out in the following prediction categories: • tertiary structure prediction (all CASPs) • secondary structure prediction (dropped after CASP5) • prediction of structure complexes (CASP2 only; a separate experiment CAPRI - carries on this subject) • residue-residue contact prediction (starting CASP4) • disordered regions prediction (starting CASP5) • domain boundary prediction (starting CASP6) • function prediction (starting CASP6) • model quality assessment (starting CASP7) • model refinement (starting CASP7) Tertiary structure prediction category was further subdivided into • homology modelling • fold recognition (also called protein threading; Note, this is incorrect as threading is a method) • de novo structure prediction Now referred to as 'New Fold' as many methods apply evaluation, or scoring, functions that are biased by knowledge of native protein structures, such an example would be an artificial neural network. Evaluation - CASP Number of human expert groups registered 207 Number of targets released 104 Number of prediction servers registered 98 Targets canceled 4 Valid targets 100 Refinement targets 9 Number of groups contributing Number of models designated as 1 Total number of models 180 12393 48339 Alignments to PDB structures 15 966 3896 Residue-residue contacts 17 1473 1561 Structural domains assignments 27 2258 2515 Disordered regions 19 1801 1801 Function prediction 22 1317 1930 Quality assessment 29 2326 3228 Model refinement 26 136 447 255 (unique) 22670 63717 Prediction format 3D coordinates All Summary  Proteins are key players in our living systems.  Proteins are polymers consisting of 20 kinds of amino acids.  Each protein folds into a unique three-dimensional structure defined by its amino acid sequence.  Protein structure has a hierarchical nature.  Protein structure prediction is a grand challenge of computational biology.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Statistical analysis of DNA microarray data