Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Endogenous retrovirus wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
Interactome wikipedia , lookup
Western blot wikipedia , lookup
Point mutation wikipedia , lookup
Proteolysis wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Protein structure prediction wikipedia , lookup
Protein and Proteome Annotation David Wishart University of Alberta Edmonton, AB [email protected] Annotating 2D Gels Trypsin + Gel punch p53 Trx G6PDH Lecture 2.5 2 Is This Annotated? p53 Information 1) pI 2) MW 3) name (abbr) 4) accession # 5) relative amnt Trx G6PDH Lecture 2.5 3 How About This? Information 1) name (abbr) 2) accession # 3) relative amnt 4) coexpressors Lecture 2.5 4 Is This Annotated? >P12345 Sequence 1 GATTACAGATTACAGATTACAGATTACAGATTACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACAGATTACAGAT TACAGATTAGAGATTACAGATTACAGATTACAGATT ACAGATTACAGATTACAGATTACAGATTACAGATTA CAGATTACAGATTACAGATTACAGATTACAGATTAC AGATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAG ATTACAGATTACAGATTACAGATTACAGATTACAGA TTACAGATTACAGATTACAGATTACAGATTACAGAT Lecture 2.5 5 Protein Annotation • Objective - identify and describe all the physico-chemical, functional and structural properties of a protein including its sequence, accession #, mass, pI, absorptivity, solubility, active sites, binding sites, reactions, substrates, homologues, function, name(s), abundance, location, 2o structure, 3D structure, domains, pathways, interacting partners Lecture 2.5 6 Protein vs. Proteome Annotation • Protein annotation is concerned with one or a small number (<50) proteins from one or several types of organisms • Proteome annotation is concerned with entire proteomes (>2000 proteins) from a specific organism (or for all organisms) need for speed Lecture 2.5 7 Different Levels of Annotation • Sparse – typical of many gel or microarray annotations, usually just includes name and accession number • Moderate – typical of many sequence databases or of experiments aimed at identifying protein complexes or ligands • Detailed – not typical (occasionally found in organism-specific databases) Lecture 2.5 8 Different Levels of Database Annotation • GenBank (large # of sequences, minimal annotation) • PIR (large # of sequences, slightly better annotation) • SwissProt (small # of sequences, even better annotation) • Organsim-specific DB (very small # of sequences, best annotation) Lecture 2.5 9 GenBank Annotation Lecture 2.5 10 PIR Annotation Lecture 2.5 11 Swiss-Prot Annotation Lecture 2.5 12 CCDB Annotation Lecture 2.5 13 CCDB Annotation Lecture 2.5 14 Ultimate Goal... • To achieve the same level of protein/proteome annotation as found in CCDB for all genes/proteins – from 2D GE data, from microarray data or for sequence databases in general How? Lecture 2.5 15 Annotation Methods • Annotation by homology (BLAST) – requires a large, well annotated database of protein sequences • Annotation by sequence composition – simple statistical/mathematical methods • Annotation by sequence features, profiles or motifs – requires sophisticated sequence analysis tools Lecture 2.5 16 Annotation by Homology • Statistically significant sequence matches identified by BLAST searches against GenBank (nr), SWISS-PROT, PIR, ProDom, BLOCKS, KEGG, WIT, Brenda, BIND • Properties or annotation inferred by name, keywords, features, comments Databases Are Key Lecture 2.5 17 Sequence Databases • GenBank – www.ncbi.nlm.nih.gov/ • EMBL/trEMBL – www.ebi.ac.uk/trembl/ • DDBJ – www.nig.ac.jp/ • PIR – http://pir.georgetown.edu/ • SwissProt – www.expasy.ch/sprot/ • UniProt – http://www.pir.uniprot.org/ Lecture 2.5 18 Structure Databases • RCSB-PDB – http://www.rcsb.org/pdb/ • MSD – http://www.ebi.ac.uk/msd/i ndex.html • CATH – http://www.biochem.ucl. ac.uk/bsm/cath/ • SCOP – http://scop.mrclmb.cam.ac.uk/scop/ Lecture 2.5 19 Expression Databases • Swiss 2D Page – http://ca.expasy.org/ch2d/ • SMD – http://genomewww5.stanford.edu/MicroArra y/SMD/ • ArrayExpress – http://www.ebi.ac.uk/arrayexp ress/ • Gene Expr. Omnibus – http://www.ncbi.nlm.nih.gov/g eo/ Lecture 2.5 20 Metabolism Databases • KEGG – http://www.genome.ad.jp/kegg /metabolism.html • Roche/Boeringer – http://www.expasy.org/cgibin/search-biochem-index • EcoCyc – www.ecocyc.org/ • MetaCyc – http://metacyc.org/ Lecture 2.5 21 Interaction Databases • BIND – http://www.blueprint.org/bin d/bind.php • DIP – http://dip.doe-mbi.ucla.edu/ • MINT – http://mint.bio.uniroma2.it/ mint/ • IntAct – http://www.ebi.ac.uk/intact/i ndex.html Lecture 2.5 22 Bibliographic Databases • PubMed Medline – http://www.ncbi.nlm.nih.gov/ PubMed/ • Science Citation Index – http://isi4.isiknowledge.com/ portal.cgi • Your Local eLibrary – www.XXXX.ca • Current Contents – http://www.isinet.com/isi/ Lecture 2.5 23 Annotation by Homology An Example • 76 residue protein from Methanobacter thermoautotrophicum (newly sequenced) • What does it do? • MMKIQIYGTGCANCQMLEKNAREAVKELGIDAEF EKIKEMDQILEAGLTALPGLAVDGELKIMGRVAS KEEIKKILS Lecture 2.5 24 PSI BLAST Select Database Lecture 2.5 25 PSI-BLAST Lecture 2.5 26 PSI-BLAST Lecture 2.5 27 PSI-BLAST Lecture 2.5 28 Conclusions • Protein is a thioredoxin or glutaredoxin (function, family) • Protein has thioredoxin fold (2o and 3D structure) • Active site is from residues 11-14 (active site location) • Protein is soluble, cytoplasmic (cellular location) Lecture 2.5 29 Annotation Methods • Annotation by homology (BLAST) – requires a large, well annotated database of protein sequences • Annotation by sequence composition – simple statistical/mathematical methods • Annotation by sequence features, profiles or motifs – requires sophisticated sequence analysis tools Lecture 2.5 30 Annotation by Composition • Molecular Weight • Isoelectric Point • UV Absorptivity • Hydrophobicity Lecture 2.5 31 Where To Go Lecture 2.5 32 Isoelectric Point • The pH at which a protein has a net charge=0 • Q = S Ni/(1 + 10pH-pKi) pKa Values for Ionizable Amno Acids Residue C D E Lecture 2.5 pKa 10.28 3.65 4.25 Residue H K R pKa 6 10.53 12.43 33 UV Absorptivity • OD280 = (5690 x #W + 1280 x #Y)/MW x Conc. • Conc. = OD280 x MW/(5690 X #W + 1280 x #Y) OH N H2N Lecture 2.5 C H COOH H2N C H COOH 34 Hydrophobicity • Indicates Solubility • Indicates Stability • Indicates Location (membrane or cytoplasm) • Indicates Globularity or tendency to form spherical structure Lecture 2.5 Kyte / Doolittle Hyrophobicity Scale Residue A C D E F G H I K L Hphob 1.8 2.5 -3.5 -3.5 2.8 -0.4 -3.2 4.5 -3.9 3.8 Residue M N P Q R S T V W Y Hphob 1.9 -3.5 -1.6 -3.5 -4.5 -0.8 -0.7 4.2 -0.9 -1.3 35 Annotation Methods • Annotation by homology (BLAST) – requires a large, well annotated database of protein sequences • Annotation by sequence composition – simple statistical/mathematical methods • Annotation by sequence features, profiles or motifs – requires sophisticated sequence analysis tools Lecture 2.5 36 Where To Go Lecture 2.5 37 Sequence Feature Databases • PROSITE - http://www.expasy.ch/ • BLOCKS - http://blocks.fhcrc.org/ • DOMO - http://www.infobiogen.fr/services/domo/ • PFAM - http://pfam.wustl.edu • PRINTS - http://www.biochem.ucl.ac.uk/bsm/dbrowser/PRINTS • SEQSITE - PepTool Lecture 2.5 38 What Can Be Predicted? • • • • • • • • • O-Glycosylation Sites Phosphorylation Sites Protease Cut Sites Nuclear Targeting Sites Mitochondrial Targ Sites Chloroplast Targ Sites Signal Sequences Signal Sequence Cleav. Peroxisome Targ Sites Lecture 2.5 • • • • • • • • • ER Targeting Sites Transmembrane Sites Tyrosine Sulfation Sites GPInositol Anchor Sites PEST sites Coil-Coil Sites T-Cell/MHC Epitopes Protein Lifetime A whole lot more…. 39 Cutting Edge Sequence Feature Servers • Membrane Helix Prediction – http://www.cbs.dtu.dk/services/TMHMM-2.0/ • T-Cell Epitope Prediction – http://syfpeithi.bmiheidelberg.com/scripts/MHCServer.dll/home.htm • O-Glycosylation Prediction – http://www.cbs.dtu.dk/services/NetOGlyc/ • Phosphorylation Prediction – http://www.cbs.dtu.dk/services/NetPhos/ • Protein Localization Prediction – http://psort.nibb.ac.jp/ Lecture 2.5 40 Subcellular Localization Lecture 2.5 41 Subcellular Localization Lecture 2.5 http://www.cs.ualberta.ca/~bioinfo/PA/Sub/ 42 Proteome Analyst (SubCell) Lecture 2.5 43 2o Structure Prediction • PredictProtein-PHD (72%) – http://cubic.bioc.columbia.edu/predictprotein/ • Jpred (73-75%) – http://www.compbio.dundee.ac.uk/~wwwjpred/submit.html • SAM-T02 (75%) – http://www.cse.ucsc.edu/research/compbio/HMMapps/T02-query.html • PSIpred (77%) – http://bioinf.cs.ucl.ac.uk/psipred/psiform.html Lecture 2.5 44 Putting It All Together Seq Motifs Composition Annotated Protein Homology Lecture 2.5 45 Putting It All Together • PEDANT – http://pedant.gsf.de/ • GeneQuiz – http://jura.ebi.ac.uk:8765/ext-genequiz/ • Magpie – http://magpie.ucalgary.ca/ • Proteome Analyst – http://www.cs.ualberta.ca/~bioinfo/PA/ Lecture 2.5 46 Lecture 2.5 47 Programs Used By Pedant • • • • • • • • HMMER PSORT PREDATOR COILS FGENESH++ pI PROSEARCH TargetP Lecture 2.5 • • • • • • • • SAPS NCBI-BLAST SEG InterProScan SignalP TMHMM tRNAscan-SE GENSCAN 48 Databases Used By Pedant • EMBL • Blocks • PIR-PSD • PDB • SWISS-PROT • SCOP • Functional Cat • COGs • PROSITE • Pfam • TrEMBL • STRIDE Lecture 2.5 49 Lecture 2.5 50 Lecture 2.5 http://jura.ebi.ac.uk:8765/gqsrv/submit 51 GeneQuiz Functions • • • • • • • • • • • • • • • Amino acid biosynthesis Biosynthesis of cofactors, prosthetic groups, & carriers Cell envelope Cellular processes Central intermediary metabolism Energy metabolism Fatty acid and phospholipid metabolism Other categories Purines, pyrimidines, nucleosides, and nucleotides Regulatory functions Replication Transcription Translation Transport and binding proteins Unknown Lecture 2.5 52 Lecture 2.5 53 Lecture 2.5 54 Lecture 2.5 55 Lecture 2.5 56 Home Page Lecture 2.5 57 Proteome Analyst • Uses PSI-BLAST, PSI-PRED and motif analysis tools • Extracts keyword information from homologues and uses Naïve Bayes classifiers to infer function • Combines sequence motif and sequence profile information to complete functional classification • Supports custom classifier/ontology Lecture 2.5 58 BacMap • Picking up where we left off with the CCDB… (Google “bacmap”) • Idea is to generate a visual atlas of all (not just Escherichia coli) bacterial chromosomes and plasmids but with links to extensive genome annotation • Attempt to re-use annotation and graphing tools originally developed for the CCDB Lecture 2.5 59 BacMap Lecture 2.5 http://wishart.biology.ualberta.ca/BacMap/60 BacMap Lecture 2.5 61 Text Search Tools Lecture 2.5 62 Sequence Search Tools Lecture 2.5 63 Bacterial Biography Card Lecture 2.5 64 Genome Statistics Lecture 2.5 65 Proteome Statistics Lecture 2.5 66 BacMap • Each genome has a short description of the organism and sequence data • Supports zoomable, hyperlinked, clickable map views of the genome • Supports text search of gene names, protein names and synonyms • Supports BLAST search and supplies genome-wide stats • Currently going through major update Stothard P, et al. BacMap: an interactive picture atlas of annotated bacterial genomes. Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D317-20. Lecture 2.5 67 What if Your Organism or Genome isn’t in BacMap? http://wishart.biology.ualberta.ca/basys/ Lecture 2.5 68 BASys • Bacterial Annotation System • A publicly available web server that performs automated annotation of bacterial genomes given only the gene sequence of a chromosome or plasmid • Takes about 24 hrs for an average genome (4 megabases) • Output includes images and annotation text (about 70 fields for each gene) Lecture 2.5 69 Typical BASys Result Lecture 2.5 70 Conclusion • Genome annotation is the same as proteome annotation – required after any gene sequencing and gene ID effort • Can be done either manually or automatically • Need for high throughput, automated “pipelines” to keep up with the volume of genome sequence data • Area of active research and development with about ½ of all bioinformaticians working on some aspect of this process Lecture 2.5 71