Download Protein_structure_II

PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23 Protein Structure Analysis - II Liangjiang (LJ) Wang [email protected] April 10, 2005 Outline • Protein structure alignment (DALI and VAST). • Protein secondary structure prediction (PHDsec, PSIPRED, etc). • Prediction of 3-D protein structures: – Homology modeling. – Threading. – Ab initio prediction. • Protein structural genomics. Protein Structure Comparison • Why is structure comparison important? – To understand structure-function relationship. – To study the evolution of many key proteins (structure is more conserved than sequence). • Comparing 3-D structures is much more difficult than sequence comparison. • Protein structure classification: – SCOP: Structure Classification Of Proteins. – CATH: Class, Architecture, Topology and Homology. • Protein structure alignment: DALI and VAST. Protein Structure Alignment • Positions of atoms in two or more 3-D protein structures are compared. • Must first determine which atoms to align. At least two sets of three common reference points should be identified. • Atoms in structures are matched to minimize the average deviation. • Computers are NOT good at comparing 3-D objects (an NP-hard problem). (Baxevanis and Ouellette, 2005) How to Compare Structures? Structure 1 Structure 2 Feature extraction Description 1 Description 2 Comparison Scores Statistical analysis Similarity, classification DALI • DALI is for Distance matrix ALIgnment. • Each structure is represented as a twodimensional array (matrix) of distances between all pairs of C atoms. – Remember what a C atom is? • Assume that similar 3-D structures have similar inter-residue distances. • DALI uses distance matrices to align protein structures. • DALI is available at http://www.ebi.ac.uk/dali/. VAST • VAST is for Vector Alignment Search Tool. • Each structure is represented as a set of secondary structure elements (SSEs). – SSEs:  helices or  strands. • VAST scores pairs of SSEs based on their type, orientation and connectivity. • The SSE matches of statistical significance are then extended (similar to BLAST). • Structures in MMDB have been pre-computed, and organized as structure neighbors in Entrez. • VAST can be accessed at http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml. Secondary Structure Prediction • Given the sequence of a polypeptide, secondary structures are predicted. • Assume that secondary structures are fully determined by local interactions among neighboring residues. • Early analysis were based on the frequencies of amino acid found in different types of secondary structures. – For example, proline occurs at turns, but not in  helices. • Modern approaches use machine learning techniques and multiple sequence alignments. Machine Learning Approach QEALDAAGDKLVVVDF HHHHHHLLLLEEEEEE Training Dataset H – Helix E – Sheet L – Loop Test Dataset Training Testing Classifier (Model) No Yes Performance? Prediction PHDsec • For a given protein sequence: – Search for homologous sequences. – Produce a multiple sequence alignment. – Generate a profile (evolutionary information). • PHDsec uses a feed-forward artificial neural network to predict the secondary structures. Input layer S P A R S H E L K Y Hidden layer Output layer (PHDsec can be accessed at http://www.predictprotein.org/) PSIPRED • For a given protein sequence: – Perform a PSI-BLAST search. – Create a profile that conveys the evolutionary information at each position. – Feed the profile into a system of neural networks (or support vector machines). • PSIPRED can be accessed at http://bioinf.cs.ucl.ac.uk/psipred/. How to Evaluate the Performance? • EVA: an independent server for evaluation of protein structure prediction methods. • The best tool for three-state per-residue secondary structure prediction now reaches the accuracy of about 78%. (http://cubic.bioc.columbia.edu/eva/) Prediction of 3-D Protein Structures • There are about 30,000 structures in PDB, but more than 1.8 million non-redundant protein sequences in UniProt (Swiss-Prot + TrEMBL). • Computational structure prediction may provide valuable information for most of the protein sequences derived from genome sequencing projects. • Three predictive methods: – Homology (or comparative) modeling. – Threading (or fold recognition). – Ab initio structure prediction. Sequence - Structure Relationship • In cells, protein folding is determined by the amino acid sequence. But, protein structures can also be affected by post-translational modifications and the cellular environment. • Proteins with ≥ 30% sequence identity tend to have similar structures. However, exceptions do exist … 80-residue stretch (yellow) with 40% sequence identity (Bourne, 2004) (Viral capsid protein, 1PIV:1) (Glycosyltransferase, 1HMP:A) Homology Modeling • Probably the most accurate method for protein structure prediction. • Five different steps: – Find a known structure related to the query sequence by sequence comparison. – Align the query sequence with the known structure (template). – Build a model by modifying the backbone and side chains of the template. – Refine the model using energy minimization. – Validate the model using visual inspection or software tools. Homology Modeling (Cont’d) • Accuracy of structure prediction depends on the percent amino acid sequence identity shared between the query and template. • For >50% sequence identity, RMSD (Root Mean Square Deviation) is only 1 Å for mainchain atoms, which is comparable to the accuracy of a medium-resolution NMR structure or a low-resolution X-ray structure. • Homology modeling may not be used for predicting protein structures if the sequence identity is less than 30%. Homology Modeling Servers • SWISS-MODEL (http://swissmodel.expasy.org/): A popular site for structure homology modeling. • SDSC1 (http://cl.sdsc.edu/hm.html): the #1 ranked server for homology modeling on the EVA site. SDSC1 http://cubic.bioc.columbia.edu/eva/ Threading (Baxevanis and Ouellette, 2005) Threading (Cont’d) • Threading takes a query sequence and passes (threads) it through the 3-D structure of each protein in a fold database (known structures). • As a sequence is threaded, the fit of the sequence in the fold is evaluated using some functions of energy or packing efficiency. • Threading may find a common fold for proteins with essentially no sequence homology. • Structures predicted from threading techniques often are not of high quality (RMSD > 3 Å). • Based on EVA results, 3D-PSSM is the best threading server (http://www.sbg.bio.ic.ac.uk/~3dpssm/). Ab Initio Structure Prediction • Ab initio prediction can be used when a protein sequence has no detectable homologues in PDB. • Protein folding is modeled based on global freeenergy minimization. • Since the protein folding problem has not yet been solved, the ab initio prediction methods are still experimental and can be quite unreliable. • One of the top ab initio prediction methods is called Rosetta, which was found to be able to successfully predict 61% of structures (80 of 131) within 6.0 Å RMSD (Bonneau et al., 2002). • The HMMSTR/Rosetta Server can be accessed at http://www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php. Comparing Structure Prediction Methods A – C: homology modeling with 60% (A), 40% (B) and 30% (C) sequence identity. D and E: ab initio protein structure prediction. Predicted structures are in red, and actual structures are in blue. (Baker and Sali, 2000) Example: Cysteine-Rich Peptides Signal helix and cleavage site C C C C C C C C NCR: Nodule-specific Cysteine Rich genes in legumes. Avr9: fungal avirulence protein from Cladosporium fulvum. Defensin: antimicrobial peptides. Proteinase inhibitor: Serine proteinase inhibitors. SCR6: S-locus of Brassica, SI, interact with SRK6. Ab Initio Prediction of Cys Rich Peptides LSG-TC51151 PsENOD3 Defensin (AAG40321, M. sativa) Avr9 (Cladosporium fulvum) Protein Structural Genomics • A worldwide initiative aimed at determining a large number of protein structures in a high throughput mode. • In the US, nine structural genomics centers have been funded by the National Institutes of Health (NIH). • More information may be found at http://www.rcsb.org/pdb/strucgen.html. • TargetDB (http://targetdb.pdb.org/): a centralized registration database for target sequences from the worldwide structural genomics projects. A Target Selection Pipeline from JCSG Methods TMHMM Protein size (7 - 80 kDa) Low complexity Redundancy BLAST against PDB sequences Summary • Fast and accurate structure alignment is still a very hard problem to be solved. • Machine learning techniques are widely used in protein secondary structure prediction. • Homology modeling is probably the most reliable method for structure prediction. • The protein folding problem has not yet been solved. Prediction of Solvent Accessibility • Solvent accessibility: the relative area of a residue’s surface that is exposed to the surrounding solvent. • The solvent-accessible residues may be part of an active site or a binding site, while the buried residues may play an important role in stabilizing the protein structure. • PHDacc (http://www.predictprotein.org/): a neural network-based method (similar to PHDsec). • Jpred (http://www.compbio.dundee.ac.uk/~www-jpred/): a neural network system that predicts both secondary structure and solvent accessibility. Predicting Transmembrane Segments • Transmembrane segments share common biophysical features (e.g., hydrophobicity). • PHDhtm (http://www.predictprotein.org/): – Part of the PredictProtein services. – Transmembrane helices are predicted using a neural network system. • TMHMM (http://www.cbs.dtu.dk/services/TMHMM/): – A set of known transmembrane segments are represented as HMMs. – A query sequence is matched to a known transmembrane pattern. Signal Peptide Prediction • Extracellular proteins or proteins targeted to subcellular compartments contain short signal peptides (often at the N-terminal). • PSORT (http://psort.ims.u-tokyo.ac.jp/): A rule-based expert system for predicting subcellular localization of proteins from their amino acid sequences. The algorithm of k-nearest neighbors is used for reasoning. • SignalP (http://www.cbs.dtu.dk/services/SignalP/): predicts the presence and location of signal peptide cleavage sites using a combination of neural networks and HMMs.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Protein_structure_II