* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bioinformatics of proteins: Sequence, structure and the `symbiosis
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Gene nomenclature wikipedia , lookup
Paracrine signalling wikipedia , lookup
Biochemistry wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Expression vector wikipedia , lookup
Magnesium transporter wikipedia , lookup
Point mutation wikipedia , lookup
Metalloprotein wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Interactome wikipedia , lookup
Western blot wikipedia , lookup
Protein purification wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Proteolysis wikipedia , lookup
Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them Maya Schushan The Ben-Tal lab Bioinformatics of proteins: Sequence, structure and the ‘symbiosis’ between them OUTLINE • Sequence: Databases, domains, motifs & annotations • Structure: Secondary structure, structure databases, visualization and identification of functional site Sequences, domains, motifs & annotations UniProt • UniProt is a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). • In 2002, the three institutes decided to pool their resources and expertise and formed the UniProt Consortium. Sequences, domains, motifs & annotations UniProt • The world's most comprehensive catalog of information on proteins • Sequence, function & more… • Comprised mainly of the databases: – SwissProt – 366226 last year, 412525 protein entries now – high quality annotation, non-redundant & cross-referenced to many other databases. – TrEMBL - 5708298 last year, 7341751 protein entries now – computer translation of the genetic information from the EMBL Nucleotide Sequence Database many proteins are poorly annotated since only automatic annotation is generated Sequences, domains, motifs & annotations UniProt • Annotation description includes: – Function(s) of the protein; – Posttranslational modification(s) such as carbohydrates, phosphorylation, acetylation and GPI-anchor; – Domains and sites, for example, calcium-binding regions, ATPbinding sites, zinc fingers, homeoboxes, – Secondary structure, e.g. alpha helix, beta sheet; – Quaternary structure, i.g. homodimer, heterotrimer, etc.; – Similarities to other proteins; – Disease(s) associated with any number of deficiencies in the protein; – Sequence conflicts, variants, etc Sequences, domains, motifs & annotations UniProt • Connected to many other databases (e.g. Pfam , Prosite, EC, GO, PdbSum, PDB (to be discussed…)) • Each sequence has a unique 6 letter accession • Entries in SwissProt also have IDs, which usually make sense (e.g. CADH1_HUMAN for a cadherin of humans) • Download sequence in FASTA format Sequences, domains, motifs & annotations UniProt: http://www.uniprot.org/ Type accession: P05102 Or ID: MTH1 _HAEPH Sequences, domains, motifs & annotations Sequences, domains, motifs & annotations General data: name, origin, EC (enzymatic reaction)… Sequences, domains, motifs & annotations Functional data, including the GO annotations Scroll down to find the sequence & download the FASTA Sequences, domains, motifs & annotations Known sites, predicted/known secondary structures, Natural variation or mutagenesis Sequences, domains, motifs & annotations The protein’s sequence in FASTA format Download Send to BLAST Sequences, domains, motifs & annotations References for all info in the page- important to take a look… Sequences, domains, motifs & annotations Connections to other databases Other sequence database, e.g. genebank Related structures in the PDB (if available) Model-structure in the ModBase databaseautomatically derived! All sorts of domain\motifs databasesThe family related to the entry Sequences, domains, motifs & annotations Pfam- domain database •Proteins are generally composed of one or more functional regions, commonly termed domains. •Different combinations of domains give rise to the diverse range of proteins found in nature. •The identification of domains that occur within proteins can therefore provide insights into their function. Sequences, domains, motifs & annotations Pfam- domain database • The Pfam database is a large collection of protein domain families. • Each family is represented by multiple sequence alignments and hidden Markov models (HMMs). • Pfam entries are classified in one of four ways: Family: A collection of related proteins Domain: A structural unit which can be found in multiple protein contexts Repeat: A short unit which is unstable in isolation but forms a stable structure when multiple copies are present Motifs: A short unit found outside globular domains Sequences, domains, motifs & annotations Pfam- domain database There are two components to Pfam: • Pfam-A entries are high quality, manually curated families. these Pfam-A entries cover a large proportion of the sequences in the sequence database. • Pfam-B- automatically generated entries. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. •Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM. Sequences, domains, motifs & annotations Pfam- domain database Allows http://pfam.sanger.ac.uk/ : •Analyze your protein sequence for Pfam matches •View Pfam family annotation and alignments •See groups of related families •Look at the domain organization of a protein sequence •Find the domains on a PDB structure •Query Pfam by keyword Sequences, domains, motifs & annotations Pfam- domain database Searching for a certain protein accession Sequences, domains, motifs & annotations Pfam- domain database Searching for a certain protein accession Sequences, domains, motifs & annotations Pfam- domain database Sequences, domains, motifs & annotations Other domain/motifs databases: • PROSITE • Interpro • BLOCKS • InterPro • SMART • Etc… Sequences, domains, motifs & annotations Classifying protein function • Each protein performs one (or more…) specific functions. This can be, e.g., catalyzation of a specific enzymatic reaction, transport of an ion, interaction with a DNA molecule etc… • In order to easily address the specific functions, attempts have been made to numerate and classify the various functions performed by proteins. Sequences, domains, motifs & annotations Classifying protein function Examplesome of the diverse functions exhibited by Membrane proteins. Sequences, domains, motifs & annotations Enzyme Commission number (EC number) • A numerical classification scheme for enzymes, based on the chemical reactions they catalyze • EC numbers do not specify enzymes, but enzymecatalyzed reactions. If different enzymes (for instance from different organisms) catalyze the same reaction, then they receive the same EC number. • By contrast, the UniProt database identifiers uniquely specify a protein by its amino acid sequence. Sequences, domains, motifs & annotations Enzyme Commission number (EC number) • Every enzyme code consists of the letters "EC" followed by four numbers separated by periods. Those numbers represent a progressively finer classification of the enzyme. • For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4": • EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule) • EC 3.4 are hydrolases that act on peptide bonds •EC 3.4.11 are those hydrolases that cleave off the aminoterminal amino acid from a polypeptide •EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide Sequences, domains, motifs & annotations Enzyme Commission number (EC number) • For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4“, as shown for an enzyme from Lactobacillus helveticus in the BRENDA database for Comprehensive Enzyme Information System: Sequences, domains, motifs & annotations Enzyme Commission number (EC number) • • • • • • EC EC EC EC EC EC 1 2 3 4 5 6 - Oxidoreductases Transferases Hydrolases Lyases Isomerases Ligases Sequences, domains, motifs & annotations Gene Ontology • A collaborative effort to address the need for consistent descriptions of gene products in different database • The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a speciesindependent manner. • The use of GO terms by collaborating databases facilitates uniform queries across them. The controlled vocabularies are structured so that they can be queried at different levels. Sequences, domains, motifs & annotations Gene Ontology Cellular component A cellular component is just that, a component of a cell, but that it is part of some larger object; this may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer) Sequences, domains, motifs & annotations Gene Ontology Cellular component A cellular component is just that, a component of a cell, but that it is part of some larger object; this may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer) Sequences, domains, motifs & annotations Gene Ontology Biological process A biological process is series of events accomplished by one or more ordered assemblies of molecular functions. Examples of biological process terms are signal transduction or pyrimidine metabolism. It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps. Sequences, domains, motifs & annotations Gene Ontology Molecular function describes activities, such as catalytic or binding activities, that occur at the molecular level. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are catalytic activity, transporter activity, or binding; examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding. Sequences, domains, motifs & annotations Gene Ontology Topology The ontologies are in the form of directed acyclic graphs (DAG), with the graph nodes being GO terms. The ontologies are hierarchically structured, a more specialized term (child) can be related to more than one less specialized term (parent). E.g. the biological process hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process. biosynthetic process is a type of metabolic process and a hexose is a type of monosaccharide. When any gene is involved in hexose biosynthetic process, it is automatically annotated to both hexose metabolic process and monosaccharide biosynthetic process. Sequences, domains, motifs & annotations Gene Ontology Example Sequences, domains, motifs & annotations Gene Ontology Interface Search by gene or protein accession http://www.geneontology.org/ Sequences, domains, motifs & annotations Summary of the first part- protein sequence databases and tools • UniProt- the most comprehensive protein sequence database. Connected to many other databases and resources, • Pfam- domain database. Many others… interpor, prosite, BLOCKS etc. • EC and GO classifications of protein function OUTLINE • Sequence: Databases, domains, motifs & annotations • Structure: Secondary structure, structure databases, visualization and identification of functional site Investigating & visualizing protein structures From Sequence to Structure • All information about the native structure of a protein is encoded in the amino acid sequence + its native solution environment. • Many possible conformation still only one or few native folds are exhibited for each protein (Levinthal’s paradox) • Protein folding is driven by various forces: – Ionic forces – Hydrogen bonds – The hydrophobic affect – ... Investigating & visualizing protein structures Secondary Structure Prediction Why predict secondary structures of proteins? 1) When the structure of the protein is still unknown. This can serve as the first step for structure prediction- first predict the secondary structures, then how they are arranged together. 2) For calculating better multiple alignments or pairwise alignments. sequence Investigating & visualizing protein structures Predicting 2° Structure Each amino acid has a different propensity for being in each 2° structure. For example, Proline causes a kink which destroys the helix structure. Thus, Proline is usually found only at the helix end. The different structures also have typical lengths. Investigating & visualizing protein structures Predicting 2° Structure http://www.predictprotein.org/ Investigating & visualizing protein structures Predicting 2° Structure All these and more… Investigating & visualizing protein structures Predicting 2° Structure Input: Sequence Output: Secondary structure prediction, globular regions, coiled-coil regions, transmembrane helices, PROSITE motifs, bound cystein… The Meta Predict Protein server now allows many other options… http://www.predictprotein.org/meta.php Investigating & visualizing protein structures Predicting 2° Structure A common measure is Q3 = the % of amino acids that were predicted correctly. Authors Chou-Fasman Garnier Levin Rost & Sander Year % acurracy Method 1974 50% propensities of aa's in 2nd structures 1978 62% interactions between aa's 1993 69% multiple seq. alignments (MSA) 1994 72% neural networks + MSA Today, Q3 is about 75-78% (as determined objectively by CASP) The theoretical limit is thougt to be about 90% Investigating & visualizing protein structures Predicting 2° Structure E.g. PSIPRED http://bioinf.cs.ucl.ac.uk/psipred/psiform.html • A simple and accurate secondary structure prediction method, incorporating two feedforward neural networks which perform an analysis on output obtained from PSI-BLAST. • Using a very stringent cross validation method to evaluate the method's performance, PSIPRED recent version achieves an average Q3 score of 80.7%. Investigating & visualizing protein structures Protein 3D Structures A protein’s structure has a critical effect on its function: 1. Binding pockets PDB ID 1nw7 Investigating & visualizing protein structures Protein 3D Structures A protein’s structure has a critical effect on its function: 2. Areas of specific chemical\electrical properties Investigating & visualizing protein structures Protein 3D Structures A protein’s structure has a critical effect on its function: 3. Importance of the global fold for function Investigating & visualizing protein structures Tertiary structure = protein fold Complete 3-dimensional structure Why is it interesting ? isn’t the sequence enough? The structure is more conserved Detection of distant evolutionary relationships A key to understand protein function Structure-based drug design Investigating & visualizing protein structures RCSB- the Protein Data Bank • The main & comprehensive database for biological macro-molecular structures • Each structure receives a PDB ID: a 4 letters unique identifier • Search by author, PDB id or any keyword. • Download structures Investigating & visualizing protein structures RCSB- Protein Databank http://www.rcsb.org/pdb/home/home.do PDB ID: 3mht Investigating & visualizing protein structures RCSB- The Protein Data Bank Download structure The paper describing the structure Data concerning the structureresolution, R-value…. Display structure Investigating & visualizing protein structures RCSB- The Protein Data Bank PDB files have a specific format: • • • • • • • • • TITLE REMARK COMPND JRNL- reference SEQRES- the original sequence HELIX, BETA- secondary structure ATOM – The actual protein/DNA/RNA chain HETATM- additional atoms such as ligands, water etc. … Investigating & visualizing protein structures RCSB – The Protein Data Bank PDB files have a specific format: ATOM ATOM ATOM ATOM HETATM HETATM HETATM HETATM HETATM HETATM HETATM HETATM HETATM 7 8 9 10 3139 3140 3141 3142 3143 3144 3145 3146 3147 SD CE N CA C6 N6 N1 C2 N3 C4 O O O MET MET ILE ILE SAH SAH SAH SAH SAH SAH HOH HOH HOH A A A A 1 1 2 2 328 328 328 328 328 328 329 330 331 -29.059 -27.535 -29.656 -30.077 -11.642 -10.474 -11.895 -13.079 -14.120 -13.832 -29.525 -28.213 -24.619 Atom, residue Numbering or molecule Chain if exists 28.614 29.074 32.903 33.171 26.514 26.661 25.334 25.090 25.887 27.092 42.890 42.867 35.287 71.539 70.866 69.094 67.730 89.489 90.103 88.899 88.350 88.278 88.861 90.934 93.588 96.173 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 26.90 16.57 25.93 25.49 17.97 14.50 23.10 16.93 16.05 14.31 24.84 8.11 17.96 Coordinates: X, Y,Z http://www.wwpdb.org/documentation/format3.1-20080211.pdf S C N C C N N C N C O O O Investigating & visualizing protein structures RCSB – The Protein Data Bank More Sequences Than Structures Discrepancy between the number of known sequences and solved structures: 5,047,807 UniRef90 entries vs. 19988 90% Non-redundant structures Computational methods are needed to obtain more structures Investigating & visualizing protein structures Fold classification Classification: clustering proteins into structural families Motivation? Profound analysis of evolutionary mechanisms Constraints on secondary structure packing? Classification at domain level Investigating & visualizing protein structures Fold classification http://scop.berkeley.edu • The SCOP database aims to provide a description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in the PDB. • The SCOP classification of proteins has been constructed manually, but with the assistance of tools to make the task manageable and help provide generality. Investigating & visualizing protein structures Fold classification 1. Family: Clear evolutionarily relationship Generally, this means that pairwise residue identities between the proteins are 30% and greater. 2. Superfamily: Probable common evolutionary origin Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. Investigating & visualizing protein structures Fold classification 3. Fold: Major structural similarity Same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins of the same fold category may not have a common evolutionary origin: the structural similarities could arise from convergent evolution. Investigating & visualizing protein structures Number Growth of unique folds as defined by SCOP Year Investigating & visualizing protein structures Fold classification Hierarchical classification of protein domain structures in the PDB. Domains are clustered at five major levels: Class Architecture Topology Homologous superfamily Sequence family Investigating & visualizing protein structures Fold classification • Class [C] - derived from secondary structure content (automatic)- alpha, beta, alpha and beta, few. • Architecture [A] - derived from orientation of secondary structures (manual) • Topology [T] - derived from topological connection and secondary structures- (by automated structural alignment) • Homologous Superfamily [H]/sequence family- clusters of similar structures & functions. Investigating & visualizing protein structures SCOP Vs. CATH Same SCOP family, different CATH topologies: d1rh6b (a.6.1.7) / 1rh6B00 (1.10.1660.20) vs. d1g4da (a.6.1.7) / 1g4dA00 (1.10.10.10) Csaba et al., 2009 Different SCOP classes, same CATH homologous superfamilies: d1bbxd (b.34.13.1) / 1bbxD00 (2.40.50.40) vs. d1rhpa (d.9.1.1) / 1rhpA00 (2.40.50.40) Investigating & visualizing protein structures SCOP Vs. CATH SCOP class fold superfamily family CATH class architecture topology homologous superfamily sequence family CATH more directed toward structural classification, SCOP pays more attention to evolutionary relationships Investigating & visualizing protein structures PdbSum • A database providing an overview of all biological macromolecular structures • Connected to UniProt find the sequence accession of a known PDB ID • Detailed description of many structure properties, e.g.: – – – – – – EC number Chains & ligands and their interactions Clefts Secondary structure FASTA sequence of structure… … Investigating & visualizing protein structures PdbSum PDB ID http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/ Free text Search by sequence Investigating & visualizing protein structures PdbSum Useful tabs UniProt accession Chains & ligands Investigating & visualizing protein structures PdbSum GO annotation EC and reaction Highlights from the related paper Investigating & visualizing protein structures PdbSum Protein tab Secondary structurefrom the PDB Investigating & visualizing protein structures PdbSum Ligand tab The ligand’s structure LigPlotPredicts the residues that bind the ligand Investigating & visualizing protein structures Before the invention of computer graphics, trained artists were employed for hand-drawing understandable picture of a protein Irving Geis (1908 – 1997) Investigating & visualizing protein structures Features: PyMol Viewer • Viewing 3D Structures • Rendering Figures • Giving Presentations • Animating Molecules • Sharing Visualizations • Exporting Geometry Investigating & visualizing protein structures Pymol Viewer: Potassium channel from (kcsa) from streptomyces lividans, pdb id 1bl8 Declan et al., 1998 Investigating & visualizing protein structures View Manipulation • Identify the different parts of the screen: -the external GUI window -the internal GUI window. • The internal window contains the viewer, which displays the molecule, and the command line. Investigating & visualizing protein structures View Manipulation To manipulate an object, we use the letter icons near its name - A – Action - S – Show - H – Hide - L – Label - C – Color Investigating & visualizing protein structures View Manipulation Change the representation of the object to “Cartoon” using: S (show) As Cartoon Investigating & visualizing protein structures View Manipulation Other protein representations under “S” “As”: • Lines •Ribbons • Sticks • Dots • Spheres • Surface Investigating & visualizing protein structures Part 1: View Manipulation Color by chain: C (color) by chain Investigating & visualizing protein structures View Manipulation Other coloring options: • Color by spectrum: b-factor, rainbow • Color by secondary structure (“SS”) • Color by element: • A lot of available colors, other can be defined in the external GUI “settings””colors…” “new” Investigating & visualizing protein structures Selecting and manipulating specific parts of the molecule • Select specific amino acids by clicking on them . • Select a range in the sequence by clicking the first residue, and then “shift+click” on the last residue. • The selection will be indicated on the structure (in pink dots). Investigating & visualizing protein structures Selecting and manipulating specific parts of the molecule • In the object list, a new object “(sele)” was added. •This object represents the current selection • You can manipulate it with the buttons next to the object. For example, change its representation to sticks •(“S” “As” “Sticks”) Investigating & visualizing protein structures Selecting and manipulating specific parts of the molecule • Give a different name to the selection, so you can easily manipulate it later. •Select the first chain again (using the sequence) and change it name to “chain1” by pressing: “Action Rename Selection” and typing “chain1”. Investigating & visualizing protein structures Making high-quality photos 1. Change the background color to white, with “Display Background White” on the external GUI menu: Investigating & visualizing protein structures Making high-quality photos 2. Type in the command line: “ray [x], [y]” ”… wait… 3. Save the image by: “Save” “Image Pay attention not to accidentally press on the image before saving! Investigating & visualizing protein structures Making high-quality photos Investigating & visualizing protein structures Making high-quality photos Investigating & visualizing protein structures ConSurf The goal: identification of functionally important amino acids that mediate the interaction of a query protein with ligands, DNA/RNA, other proteins etc. Approach: Functionally important amino acid sites are often evolutionarily conserved Investigating & visualizing protein structures Consurf Beta Class N6-Adenine DNA Methyltransferase Investigating & visualizing protein structures ConSurf The 3D structure of Beta Class N6-Adenine DNA Methyltransferase has already been solved: PDB id : 1nw7 Investigating & visualizing protein structures Consurf • The ConSurf webserver calculates the evolutionary rate for each position in the protein • The results, mapped on the structure, reveal residues crucial for function and structure stability • In this case, the ligand is bound in a highly conserved cluster of residues http://consurf.tau.ac.il/ Investigating & visualizing protein structures Consurf The consensus sequence approach: ..W.. ..W.. ..W.. ..W.. .. E.. .. G.. Investigating & visualizing protein structures Consurf However, some sequences might be close homologues of each other ..W.. ..W.. ..W.. ..W.. .. E.. primates .. G.. Conclusion: Assessing conservation without taking into consideration the phylogenetic relations may lead to uneven sampling in sequence space Investigating & visualizing protein structures Consurf Phylogenetic reconstruction may be used to distinguish between two possible cases: 1. Structural/functional constraints that truly result in sequence conservation as a result of evolutionary pressure. 2. Short evolutionary time that may be mistaken as sequence conservation, while no evolutionary pressure affects the examined position. Investigating & visualizing protein structures Consurf Rate4Site: an algorithm for calculating the evolutionary rate at each amino acid site Definition: Evolutionary rate = number of AA replacements/(site*year) Conserved sites evolve slowly variable sites evolve rapidly Pupko et al., 2002 Mayrose et al., 2005 Investigating & visualizing protein structures Consurf Web-Server: http://consurf.tau.ac.il/ Landau et al., 2005 Investigating & visualizing protein structures Consurf coloring bar The Rate4Site conservation scores are not specific integers. Such scores are impossible to display on a structure. Hence, the ConSurf webserver divides them into 9 bins- 1 for highly variable , 9 for the most conserved Investigating & visualizing protein structures Consurf The ConSurf webserver Essential input- MSA and tree constructed by ConSurf through “advanced options”: 1. PDB ID\PDB file\model-structure and chain Essential and optional input: 1. PDB ID\PDB file\model-structure and chain 2. Constructed MSA, with the query sequence included 3. Phylogenetic tree http://consurf.tau.ac.il/index.html Essential and Optional input: Bayesian Max Likelihood 1NW7 Check in the PDBsum… MSA Sequence name in the MSA Tree Email http://consurf.tau.ac.il/index.html Essential input: 1NW7 Check in the PDBsum… http://consurf.tau.ac.il/index.html Essential input: Email Alignment method SWISS-PROT UniProt Additional BLAST options http://consurf.tau.ac.il/index.html Calculation Finished: Easy web-based viewer Viewer for producing medium-quality images* View scores Produced or input MSA View phylogenetic tree Script for coloring in RasTop* Instructions for PyMOl* Investigating & visualizing protein structures Consurf Jmol- Easy web-based viewer Investigating & visualizing protein structures Consurf Summary - MSA Quality • ConSurf is dependent on the quality of the MSA. • When an MSA is not given by the user, sequences are automatically gathered by PSI-BLAST and aligned by CLUSTALW with default parameters. • Even though these alignments are usually good, it is highly recommended to inspect the alignment manually and with other tools in order to improve the quality of the evolutionary data . Investigating & visualizing protein structures Consurf A caveat: In some cases the functionally important region may not be conserved at all The peptidebinding groove of the MHC class I heavy chain. PDB id : 2vaa Investigating & visualizing protein structures PatchFinderidentification of functional sites Patch- a spatially continuous cluster of surface residues. Problems: – Subjectivity of boundaries. – Difficult to apply on large datasets Investigating & visualizing protein structures PatchFinder Input: 1. Protein Structure (1) Assignment of conservation scores (Rate4Site3) 2. Multiple sequence alignment (MSA) (2) Identification of exposed residues (3) Extraction of the surface patch of conserved residues with the highest statistical significance (ML-patch). (4) Identification of nonoverlapping secondary patches 1Nimrod et al., 2005 et al, 2008 3Mayrose et al., 2004 2Nimrod Investigating & visualizing protein structures PatchFinder- http://patchfinder.tau.ac.il/ Investigating & visualizing protein structures Summary of structure-related databases & tools • Secondary structure prediction- PredictProtein, Meta PredictProtein and PSIPRED. • PDB, SCOP and CATH- collection and classification of structures available by experiment. • Structure visualization- PyMol • Conservation analysis- Consurf and Patchfinder Protein structure prediction Structure Prediction Approaches 1. Homology (Comparative) Modeling Based on sequence similarity with a protein for which a structure has been solved. 2. Threading (Fold Recognition) Requires a structure similar to a known structure 3. Ab-initio fold prediction Not based on similarity to a sequence\structure Ab-initio Structure prediction from “first principals”: Given only the sequence, try to predict the structure based on physico-chemical properties (energy, hydrophobicity etc.) • When all else fails works for novel folds • Shows that we understand the process The Force Field (energy function) A group of mathematical expressions describing the potential energy of a molecular system • Each expression describes a different type of physicochemical interaction between atoms in the system: • Van der Waals forces • Covalent bonds • Hydrogen bonds • Charges • Hydrophobic effects Non-bonded terms Approaches to Ab-initio Prediction 1. Molecular Dynamics • Simulates the forces that governs the protein within water. • Since proteins usually naturally fold, this would lead to the native protein structure. Problems: • Thousands of atoms • Huge number of time steps to reach folded protein feasible only for very small proteins Approaches to Ab-initio Prediction 2. Minimal Energy Assumption: the folded form is the minimal energy conformation of a protein Main principals: • Define an energy function. • Search for 3D conformation that minimize energy. Ab-initio 2. Minimal Energy • Use of simplified energy function • Search methods for minimal energy conformation: – Greedy search – Simulated annealing –… Ab-initio • Current methods (e.g. Rosetta) primarily utilize the fact that although we are far from observing all protein folds, we probably have seen nearly all substructures: Local sequence-structure relationships: • A library of known sub-structures (fragments less than 10 residues) is created. • A range of possible conformations for each fragment in the query protein are selected. Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006) Ab-initio Non-local sequence-structure relationships: • The primary nonlocal interactions considered are hydrophobic burial, electrostatics, main-chain hydrogen bonding etc. Structures that are consistent with both the local and non-local interactions are generated by minimizing the non-local interaction energy in the space defined by the local structure distributions. Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006) Ab-initio - Example Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006) Fold Recognition (Threading) Given a sequence and a library of folds, thread the sequence through each fold. Take the one with the highest score. • Method will fail if new protein does not belong to any fold in the library. • Score of the threading is computed based on known physical chemistry properties and statistics of amino acids. Threading: example • structural template 4E • neighbor definition C3 • energy function C2 ACCECADAAC -3-1-4-4-1-4-3-3=-23 A1 E E aib j positionsi, j 10 5 C 9 6 A 8 7 D Eab A C D E . A C -3 -1 -1 -4 0 1 0 2 . . C A A D 0 1 5 6 . E ….. 0 .. 2 .. 6 .. 7 .. . Find best fold for a protein sequence: Fold recognition (threading) 1) ... 56) ... MAHFPGFGQSLLFGYPVYVFGD... -10 ... ... n) ... -123 ... Potential fold 20.5 GenTHREADER • Align the query sequence with each template (requires some sequence homology!) • Assess the alignment by: – Sequence alignment score – Pairwise potentials – Solvation function • Record lengths of: alignment, query, template • Using Neural Network the overall score is computed. Jones DT et al. J. Mol. Biol. 287: 797-815(1999) GenTHREADER Jones DT et al. J. Mol. Biol. 287: 797-815(1999) I-TASSER- Hybrid Approach • In a recent wide blind experiment, CASP7, I-TASSER generated the best 3D structure predictions among all automated servers. •Based on the secondary-structure threading and the iterative implementation of the Threading ASSEmbly Refinement (TASSER) program. •For predicting the biological function of the protein, the I-TASSER server matches the predicted 3D models to the proteins in 3 independent libraries which consist of proteins of known enzyme classification (EC) number, gene ontology (GO) vocabulary, and ligand-binding sites. I-TASSER Test Case: Rosetta Vs. TASSER Grey: Crystal structure of Betannnn: Purple: Rosetta prediction, starting from homology modeling Green: TASSER predcition Homology Modeling – Basic Idea 1. A protein structure is defined by its amino acid sequence. 2. Closely related sequences adopt highly similar structures, distantly related sequences may still fold into similar structures. 3. Three-dimensional structure of proteins from the same family is more conserved than their primary sequences. Triophospate ismoerases 44.7% sequence identity 0.95 RMSD General Scheme 1. Searching for structures related to the query sequence 2. Selecting templates 3. Aligning query sequence with template structures 4. Building a model for the query using information from the template structures 5. Evaluating the model Fiser A et al. Methods in Enzymology 374: 461-491(2004) General Scheme Homology modeling requires handling structures & sequences • Query- only the protein sequence is available- usually found at the UniProt database • Template- after identification, both structural and sequencerelated data should be found- UniPort (or NCBI databases), RCSB and PDBsum Homology modeling- querytemplate alignment Different levels of similarity between the template & query initiate various computational approaches: Homology modeling- model evaluation Evolutionary Conservation http://consurf.tau.ac.il Homology modeling- model evaluation Evolutionary Conservation http://consurf.tau.ac.il Homology modeling- model evaluation Evolutionary Conservation http://consurf.tau.ac.il Homology Modeling • The accuracy of the model depends on its sequence identity with the template: