Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the Swiss Institute of Bioinformatics Andrea Auchincloss ([email protected]) Tunis, March 19, 2007 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Outline • • • • • • The Swiss Institute of Bioinformatics What is UniProt? UniProt Knowledgebase: Swiss-Prot and TrEMBL HPI, post-translational modifications, HAMAP UniRef and UniParc Databases for protein function and domains: PROSITE, InterPro etc. • ExPASy; other tools A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Swiss Institute of Bioinformatics (SIB) • Non-profit foundation created in 1998; • Groups in Geneva, Lausanne and Basel; • Federation of several groups (some of which existed and collaborated long before the foundation of the institute), about 170 researchers in 2006. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 www.isb-sib.ch A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 SIB missions • Development of databases and software tools; • High-quality bioinformatics research program; • Courses and seminars for the training of bioinformatics research scientists. This includes a master’s degree in proteomics and bioinformatics, several weekly courses and a doctoral school • Services to the Swiss Life Sciences community (EMBnet node). A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Swiss Institute of Bioinformatics: 20 research and service groups A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Proteins are organic compounds made of amino acids arranged in a linear chain and joined by peptide bonds… Wikipedia A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Proteins are composed of 20 "standard" amino acids, symbolised by a LETTER. Different ‘views’ of a protein A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Proteins can also work together to perform a particular function, and they often associate to form complexes. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Proteins are essential parts of all living organisms and participate in every process within cells. -> enzymes -> structural or mechanical functions -> important in cell signaling, immune response, cell adhesion, cell cycle, toxins…. Proteins are a necessary component in our diet, since animals cannot synthesize all the amino acids and must obtain essential amino acids from food. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Protein/Gene number Organism Number Bacteria 182-8,591 6,127 17,947 13,849 ∼ 25,674 ∼21,000 S. cerevisiae C. elegans Drosophila A. thaliana Human The universe in which protein databases evolve 1953: 1st sequence (bovine insulin) 1986: 4,000 sequences 2006: 3.5 million sequences Where will it stop? AMB, SP20 179,000,021,000 1st estimate: ~30 million species (1.5 million named) 2nd estimate: 20 million bacteria/archaea x 4,000 genes 5 million protists x 6,000 genes 3 million insects x 14,000 genes 1 million fungi x 6,000 genes 0.6 million plants x 20,000 genes 0.2 million molluscs, worms, arachnids, etc. x 20,000 genes 0.2 million vertebrates x 21,000 genes The calculation: 2x107x4000+5x106x6000+3x106x14000+106x6000+6x105x 20000+2x105x20000+2x105x21000+21000(you!) Caveat: this is an estimate of the number of potential sequence entries, but not that of the number of distinct protein entities in the biosphere. AMB, SP20 What is sequencing is underway right now? Many eukaryotic & bacterial genomes (varying sizes) Metagenomics (environmental samples) ~ 6 million sequences submitted/published in December 2006, ~ 17 million sequences being generated at the Venter Institute, 6 million proteins are being submitted from the GOS (Global Ocean Sampling) trip A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Protein sequences; what is sequenced? Currently about 3.5 to 4.0 million ‘known’ protein sequences More than 99% of these are derived by translation of nucleotide sequences Less than 1%: direct protein sequencing (Edman, MS/MS…) -> It is important that users know where the protein sequence comes from… (sequence & gene prediction quality)! A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Level of DNA/RNA sequence quality - DNA/RNA sequencing quality (genome or WGS, cDNA or EST …) - Gene prediction quality; programs used, is there manual intervention afterwards? For example: Authors can specify the nature of the CDS in the nucleotide databases by using qualifiers: "/evidence=experimental" or "/evidence=not_experimental". Very rarely done… A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 The hectic life of a sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … Public nucleic acid databases EMBL, GenBank, DDBJ …if the submitters provide an annotated Coding Sequence (CDS) Public protein sequence databases CDS: CoDing Sequence (CDS) CDS provided by the submitters The first Met ! CDS translation provided by EMBL Data not submitted Complete genome (submitted) only ~ 1,858 CDS available! Issue for the users: the protein database jungle A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 The hectic life of a sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … Public nucleic acid databases EMBL, GenBank, DDBJ …if the submitters provide an annotated Coding Sequence (CDS) Public protein sequence databases The hectic life of a sequence … Data not submitted to public databases, delayed or cancelled… cDNAs, ESTs, genomes, … EMBL, GenBank, DDBJ Scientific publications derived sequences CoDing Sequences provided by submitters TrEMBL UniProtKB GenPept RefSeq* PRF PIR IPI Swiss-Prot Manually annotated UniParc EnsEMBL* CCDS * Also gene prediction PDB + species-specific databases (EcoGene, TubercuList, TIGR…) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Major public protein sequence database ‘sources’ PIR PDB PRF Integrated resources ‘cross-references’ UniProtKB: Swiss-Prot + TrEMBL Separated resources NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq UniProtKB/Swiss-Prot: manually annotated protein sequences (11,000 species) UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot (127,000 species) GenPept: submitted CDS (GenBank); redundant with UniProtKB (about 130,000 species) PIR: Protein Information Resource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4,000 species) Other protein sequence databases CCDS: EBI + NCBI + Wellcome Trust Sanger + UC Santa Cruz (2 species) Consensus human and mouse sequences between 4 institutions… Combining different approaches – ab initio, by similarity - and taking advantage of the expertise acquired by different institutes, including manual annotation… EnsEMBL: UniProtKB + RefSeq + gene prediction (31 species) aligns some eukaryotic genomic sequences with all the sequences found in EMBL, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (→ known genes)- Also does some gene prediction (→ novel genes) IPI: UniProtKB + RefSeq + EnsEMBL + (H-InvDB, TAIR, VEGA) (7 species) provides a guide to the main databases that describe the human, mouse, rat, zebrafish, Arabidopsis, chicken, and cow proteomes. … A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 The UniProt consortium European Bioinformatics Institute European Molecular Biology Laboratory Swiss Institute of Bioinformatics Protein Information Resource A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 The UniProt Consortium UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein information www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006). Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc and soon UniMES (for Metagenomic and Environmental Sequences) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 The Universal Protein resource components UniProt UniProtKB KnowledgeBase UniProtKB Release 9.7 consists of: UniProtKB/TrEMBL Computer annotated protein sequences 3’600’000 entries ~100’000 species UniProtKB/Swiss-Prot Manually annotated protein sequences 260’000 entries ~10’000 species produced by SIB and EBI UniRef100 UniRef 90 UniRef 50 • One UniRef100 entry = All identical sequences (including fragments). • One UniRef90 entry = Sequences that have at least 90% or more identity. • One UniRef50 entry = Sequences that are at least 50% or more identity. Independent of species. Allows comprehensible BLAST similarity searches by providing sets of representative sequences produced by PIR UniProt Archives ~8’000’000 entries Archived raw protein sequences, found in publicly accessible databases: Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices. Use with extreme caution: Contains pseudogenes, incorrect CDS predictions, etc… produced by EBI The Universal Protein resource components UniProt UniProtKB KnowledgeBase UniProtKB/TrEMBL Computer annotated protein sequences 3,900,000 entries ~127,000 species UniProtKB/Swiss-Prot Manually annotated protein sequences 260,000 entries ~11,000 species produced by SIB and EBI UniRef100 UniRef 90 UniRef 50 • One UniRef100 entry = All identical sequences (including fragments). • One UniRef90 entry = Sequences that have at least 90% or more identity. • One UniRef50 entry = Sequences that are at least 50% or more identity. Independent of species. Allows comprehensible BLAST similarity searches by providing sets of representative sequences produced by PIR UniProt Archives ~8’000’000 entries Archived raw protein sequences, found in publicly accessible databases: Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices. Use with extreme caution: Contains pseudogenes, incorrect CDS predictions, etc… produced by EBI The Universal Protein resource components UniProt UniProtKB KnowledgeBase UniProtKB/TrEMBL Computer annotated protein sequences 3,900,000 entries ~127,000 species UniProtKB/Swiss-Prot Manually annotated protein sequences 260,000 entries ~11,000 species produced by SIB and EBI UniRef100 UniRef 90 UniRef 50 • One UniRef100 entry = All identical sequences (including fragments). • One UniRef90 entry = Sequences that have at least 90% or more identity. • One UniRef50 entry = Sequences that are at least 50% or more identity. Independent of species. Allows comprehensible BLAST similarity searches by providing sets of representative sequences produced by PIR UniProt Archives ~8’000’000 entries Archived raw protein sequences, found in publicly accessible databases: Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices. Use with extreme caution: Contains pseudogenes, incorrect CDS predictions, etc… produced by EBI The Universal Protein resource components UniProt UniProtKB KnowledgeBase UniProtKB/TrEMBL Computer annotated protein sequences 3,900,000 entries ~127,000 species UniProtKB/Swiss-Prot Manually annotated protein sequences 260,000 entries ~11,000 species produced by SIB and EBI UniRef100 UniRef 90 UniRef 50 • One UniRef100 entry = All identical sequences (including fragments). • One UniRef90 entry = Sequences that have at least 90% or more identity. • One UniRef50 entry = Sequences that are at least 50% or more identity. Independent of species. Allows comprehensible BLAST similarity searches by providing sets of representative sequences produced by PIR UniProt Archives ~8,800,000 entries Archived raw protein sequences, found in publicly accessible databases: Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices. Use with extreme caution: Contains pseudogenes, incorrect CDS predictions, etc… produced by EBI UniProt web sites… http://www.expasy.org/sprot/ http://www.pir.uniprot.org/ http://www.ebi.ac.uk/uniprot/ http://www.uniprot.org/ Soon, a new unified web site, with a very powerful search engine…. http://beta.uniprot.org/ Test it! Logon:guest Password: amazing A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 The UniProt groups from SIB, EBI and PIR (Antibes, September 2004) In Geneva (SIB): 2 Group Leaders 44 Annotators 4 Prosite annotators 22 Programmers and Researchers 5 Administrators, science communicators 3 System Administrators 4 Students 1 GISAID At EBI: -----------------(Swiss-Prot + EMBL + TrEMBL) 85 people 75 people (29 Annotators) A. Auchincloss UniProtKB and ExPASy At PIR: 1 Group Leader 13 Protein Science Team 12 Informatics Team -----------------26 people Tunis, March 2007 UniProtKB has biweekly releases; available from about ~100 servers, the main sources being ExPASy and www.uniprot.org A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniProtKB From EMBL (DNA) to TrEMBL (protein) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Gene/protein name Taxonomy Reference CDS TrEMBL EMBL Automated extract of the protein sequence (CDS), gene name, taxonomy and references. Automated annotation (KWs and protein family). ! TrEMBL does not translate DNA sequences, nor does it use gene prediction programs: only takes the existing CDS proposed by the submitting authors in the EMBL/Genbank/DDBJ entry In particular, the proposed CDS and derived protein sequences can be experimentally proven or derived from gene prediction programs (this is not obvious from the TrEMBL entry) TrEMBL does not validate any sequences !!!! The quality of UniProtKB/TrEMBL data is directly dependent on the information provided by the submitter of the original nucleotide entry. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniProtKB From TrEMBL to Swiss-Prot A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 CDS Automated extraction of the protein sequence (CDS), gene name and references. Automated annotation. TrEMBL Manual annotation of the sequence and associated biological information (derived from literature, external experts, databases…) Annotation of sequence differences (conflicts, variants, splicing…) EMBL Average of 6 independent sequence reports for each human protein Swiss-Prot Distinguishing Swiss-Prot and TrEMBL – A TrEMBL entry is a computer-annotated record derived from a coding sequence (CDS) in the nucleotide sequence databases, not in Swiss-Prot, after some redundancy removal and automated annotation. – A Swiss-Prot entry is a manually annotated record for a given protein. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniProtKB From TrEMBL to Swiss-Prot Step 1: Sequence check A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniProtKB/Swiss-Prot Non-redundant 1 entry -> 1 gene (1 species) i) Merge all known protein sequences (CDS and amino acid) derived from the same gene -> decreases redundancy and improves sequence reliability ii) Annotation of the sequence differences (including conflicts, polymorphisms, splice variants etc..) -> annotation of protein diversity A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Redundancy… UniProtKB/Swiss-Prot ~11,000 species UniProtKB/TrEMBL ~127,000 species 260,000 + 3,800,000 3,600,000 Redundancy in TrEMBL & Redundancy between TrEMBL and Swiss-Prot In the future: redundancy is going to decrease: "new" genome sequencing → "new" proteins - 13 sequences (complete or partial) - derived from mRNA (n=6) or genomic DNA (n=7) All alternatively spliced sequences are available for BLAST searches, protein identification tools and are downloadable… Human: ~2/3 of the human genes are alternatively spliced A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 - 6 genomic sequences (complete or partial) - 1 protein sequence from PIR A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Multiple alignment of the available clpB sequences A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Within Swiss-Prot? • A snapshot of the situation (December 2006): – 28,200 entries with 82,000 sequence conflicts; – 2,600 entries with corrected frameshifts; – 15,100 entries with corrected initiation sites; – 4,300 entries with other sequence ‘problems’. • At least 43,000 entries (19% of Swiss-Prot) required a minimal amount of annotation effort to obtain the “correct” sequence. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Quality of protein information from genome projects • Proteins originating from different genome projects: – Drosophila: what a curated (thanks to FlyBase) genome effort should look like: only 1.8% of the gene models conflict with what we have in UniProtKB/Swiss-Prot; – Arabidopsis: a genome where lots of work was done to annotate it when it was sequenced, but where nothing as been done since (at least in the public view): 19.5% of the gene models are erroneous; – Tetraodon nigroviridis: a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins. – Bacteria and Archaea have almost no splicing, so prediction is “easier”, however errors are still made… • Producing a clean set of sequences is not a trivial task; • It is not getting easier as more and more types of sequence data is submitted; • It is important to pursue our efforts in making sure we provide to our users the most correct set of sequences for a given organism. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 New ‘Protein existence evidence’ tag • As most protein sequences are derived from translation of nucleotide sequence and are only predictions, the new PE line indicates whether there is any evidence that proves the existence of a protein; • The ‘Protein existence evidence’ will have 5 different qualifiers: 1. Evidence at protein level 2. Evidence at transcript level 3. Inferred from homology 4. Predicted - Unassigned (used mostly in TrEMBL) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Righting the wrongs “Sequences are rarely deposited in a “mature” state; as with all scientific research, DNA and protein annotation is a continual process of learning, revision and corrections.” “Sequencing error rates: ~1 base in 10’000” “Making people aware of errors is good and great; making people aware that they’re responsible also for correcting errors is even greater” C. Hardley, EMBO reports, 4(9), 2003. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniProtKB From TrEMBL to Swiss-Prot Step 2: Annotation: literature controlled vocabulary A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Annotation • The focal point of the efforts to maintain and develop UniProtKB/Swiss-Prot; • It is becoming more and more important as it provides: a summary of what is known about a protein; creates template for automatic annotation for the many organisms whose genome sequence is/will be available but whose proteins will not be characterized; provides well annotated (corpus) entries to train literature mining tools (text mining). A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 …. Source of data - publications (> 1,700 journals cited) -also external scientific expertise & other databases (…) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Comments: “structured free text”, 27 defined topics Manually annotated Information from papers, specialized databases, computer prediction, external experts, brain storming Distinction between data obtained experimentally and computerized inferences A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniProtKB From TrEMBL to Swiss-Prot Step 3: Sequence analysis (bioinformatics tools) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 The annotation platform Annotators could not work without the help of our software developers; A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Anabelle: much more than a domain annotation platform A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 We manually check the results ! What else is in a UniProtKB/Swiss-Prot entry? A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Cross-references; a central hub Gasteiger E. et al, Curr. Issues Mol. Biol. 3:47-55(2001) www.expasy.org/cgi-bin/lists?dbxref.txt • Swiss-Prot was the first database with X-references; • Explicitly X-referenced to 85 databases: – DNA (EMBL/GenBank/DDBJ), – 3D-structure (PDB) – Family and domain (InterPro, HAMAP, PROSITE, Pfam, etc.) – genomic (OMIM, MGI, FlyBase, SGD, SubtiList, etc.) – 2D-gel (e.g. SWISS-2DPAGE) – specialized db (e.g.GlycoSuiteDB, PhosSite, MEROPS); – literature (PubMed) • Each UniProtKB/Swiss-Prot entry can be seen as a central hub for the data available about the protein it describes A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Organism-specific databases AGD CYGD DictyBase EchoBASE EcoGene euHCVdb FlyBase GeneDB_Spombe GeneFarm Gramene H-InvDB HGNC HIV HPA LegioList Leproma ListiList MaizeGDB MGI MIM MypuList PhotoList RGD SagaList SGD StyGene SubtiList TAIR TubercuList WormBase WormPep ZFIN Genome annotation databases Ensembl GenomeReviews KEGG TIGR Sequence databases EMBL PIR UniGene Enzyme and pathway databases Family and domain databases BioCyc Reactome Gene3D HAMAP InterPro PANTHER PIRSF Pfam PRINTS ProDom PROSITE SMART TIGRFAMs 2D-gel databases UniProtKB/Swiss-Prot explicit links ANU-2DPAGE Aarhus/Ghent-2DPAGE COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE HSC-2DPAGE OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE Miscellaneous 3D structure databases HSSP PDB SMR PTM databases GlycoSuiteDB PhosSite ArrayExpress dbSNP DIP DrugBank GO IntAct LinkHub RZPD-ProtExp Protein family/group databases GermOnline MEROPS PeroxiBase PptaseDB REBASE TRANSFAC Implicit cross-references on new web server and ExPASy Implicit X-references to 26 additional db added by the ExPASy server on the www (i.e.: GeneCards, ModBase, etc.) These X-refs are not present as hard-coded DR lines in the Swiss-Prot entry as it can be downloaded by ftp, but are added on the fly when someone views an entry on ExPASy. This can be done because enough information is present in the UniProtKB entry to access the related information in another db. Example: All Swiss-Prot/TrEMBL are linked to the BLOCKS domain db, via the Swiss-Prot/TrEMBL accession number A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Keyword definition and usage in Swiss-Prot Linked to Gene Ontology to further facilitate information retrieval via controlled vocabularies A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 In a UniProtKB/Swiss-Prot entry, you can expect to find: • • • • • • • • All the names of a given protein (and of its gene); Its biological origin with links to the taxonomic databases; A selection of references; A summary of what is known about the protein: function, alternative products, PTM, tissue expression, disease, 3Dstructures, etc.…; Numerous cross-references; Selected keywords; A description of important sequence features: domains, PTMs, variations, etc.; A (often corrected) protein sequence and the description of various isoforms/variants. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Monitoring entry history: The UniProtKB Sequence/Annotation Version archive A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 … and many useful links: A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 And on the new website other tools are not yet available… A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniProt Knowledgebase • Swiss-Prot: Manually annotated section • TrEMBL: Automatically annotated section A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Distinguishing Swiss-Prot and TrEMBL A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Accession number: to be used when you cite a UniProt entry in anywhere (never cite the entry name (ID) alone) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Non-Redundant Complete Proteome Sets • Text search UniProtKB keyword “Complete proteome”, combined with an organism name • Or download precomputed sets (bacteria, archaea, some eukaryotes): ftp://ftp.expasy.org/databases/complete_proteomes/entries • Or EBI Integr8 http://www.ebi.ac.uk/integr8/ A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Swiss-Prot annotation priorities The main annotation programs: • • • • • • • • • • • HAMAP (High quality Automated and Manual Annotation of microbial Proteomes; bacteria, archaea, plastids); HPI (Human Proteomics Initiative); PPAP (Plant Proteome Annotation Project); FPAP (Fungal Proteome Annotation Project); Viral proteins; Tox-Prot (Toxin Annotation Project); ENZYMES (proteins with EC numbers); PTMs 3D-structure Protein-protein interactions Quality assurance, includes controlled vocabularies A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Model organisms • Organisms for which we want to have a more in-depth coverage; • Completeness, links with specialized databases, specific documents; • Examples: E.coli, B.subtilis, human, mouse, fruitfly, C.elegans, yeast, S.pombe, A.thaliana. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Human Proteomics Initiative (HPI) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 From genome to proteome ~ 1,000,000 human proteins ~ 21,000 human genes alternative splicing of mRNA 2-5 fold increase post-translational modifications of proteins (PTMs) 5-10 fold increase ~ 100,000 human transcripts Considerable increase in complexity In the case of human genes, the Swiss-Prot/TrEMBL redundancy is still very high: 15,803 + 53,100 about 20,000* * human gene number estimation: 21,000-35,000 MS proteomics has verified more than 10% of human genes products, but has not identified significant numbers of unpredicted proteins What is missing: • Sequences not submitted to EMBL/GenBank/DDJB (and PIR) • Not yet predicted or known genes ("no CDS provided by the submitters" or no DNA sequence) • Confidential data (Patent application sequences) • Immunoglobulins, T-cell receptors (-> UniParc) •… 1000 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Post-translational modifications (PTMs) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 PTM definition a post-translational modification or PTM is a modification of a polypeptide chain involving the making or the breaking of covalent bond(s) that occurs during (cotranslational class) or after translation. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 PTMs influence or even define protein function phosphorylation and possibly GlcNAcylation and S-nitrosylation are a means of transducing extracellular signals to the inside of the cells. methylation has a role in nuclear protein import. lipid addition allows protein to membrane association (e.g. GPIanchor, myristate, palmitate). intrachain disulfide bonds and N-glycosylation influence protein folding. interchain disulfide bonds bind subunits together. other PTMs are directly involved in the protein function, as for example the binding of cofactors (e.g. pyridoxal phosphate), or the synthesis of a cofactor by the modification of amino acids present in the protein (e.g. quinones). A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 PTM variety Gly acetylation methylation acylation phosphorylation oxidation crosslinks hydroxylation cofactor binding sulfation C-linked sugar N-linked sugar O-linked sugar S-linked sugar Ala Val Leu Ile Lys Arg His Asp Glu Asn Gln side-chain modifications Cys Ser Thr Met Pro Phe Tyr Trp Each protein can be modified at sites…which gives a various high number of ‘alternative’ peptides. N-terminal modifications 283 different protein modifications are annotated in acetylation methylation UniProtKB/Swiss-Prot… acylation crosslinks C-terminal modifications GPI amidation crosslinks methylation in black: cytoplasmic modifications in dark grey: both cytoplasmic and extracellular modifications, depending on the exact type in light grey: extracellular modifications A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Large scale experiments (LSE) for PTMs! • PTM information can now be obtained from results of proteomics large scale experiments (LSE); • In the past 12 months we have added about 6’000 experimental PTMs using data originating from some of these projects. AMB, SP20 Proteomic studies have lead to the updating of 2767 human Swiss-Prot entries, mainly with PTM information (UniProt release 10.0 , March 2007) Phosphorylation (83%) Subcellular location (4%) Glycosylation (9%) Other PTMs (4%) Bacteria and Archaea (HAMAP) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 In 2006, ≈130 new bacterial and archaeal genomes (not WGS) were submitted to the DNA databases; If on "average" 4,000 proteins/genome=>500,000 proteins! How to cope???? A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 High quality Automated and Manual Annotation of microbial Proteomes HAMAP Lots of microbial genomes, lots of proteins. What should we do with them in UniProt? A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 http://www.expasy.org/unirule/MF_00319 Automatic annotation of proteins belonging to specified families (1) • This program requires the continuous development and adaptation of software tools as well as the development of a database of annotation rules for each family (so far about 1,400). A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Allows us to annotate automatically, yet with a very high level of quality, proteins that belong to well defined protein families; Can be applied to both characterized proteins and to some UPF’s (Uncharacterized Protein Family); The families are based on UniProtKB/Swiss-Prot entries, so we first do all the annotation steps described earlier! A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 /www.expasy.ch/sprot/hamap/ Using HAMAP, we can currently annotate to Swiss-Prot quality level between 10% to 50% of a complete microbial proteome (next step: HAMAP for Fungi…) Updates • DNA sequence archives – EMBL/GenBank/DDBJ is an archive • All submitted data goes into the archive • Submitters are responsible for the submitted sequences and the accompanying annotation • Nobody else can change them (including the curators at EMBL/GenBank/DDBJ) • Protein sequence databases – UniPRotKB/Swiss-Prot is NOT an archive • Swiss-Prot chooses what goes into the database and where to place it • Swiss-Prot updates annotation and sequences when necessary A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 **ZB SYP, 28-NOV-2003; ALB, 16-NOV-2004; MIM, 31-Jan-2006; **ZB BER, 13-FEB-2006; LYG, 14-JUN-2006; LYG, 21-SEP-2006; **ZB CHH, 05-DEC-2006; A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 User updates or annotation requests A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Accessing & Searching UniProtKB Direct access (keyword search) • New search tool – we’ll use it later • Sequence Retrieval System (SRS, Europe), will disappear • Entrez (NCBI, USA) – UniProtKB/Swiss-Prot (not TrEMBL) is integrated in GenPept, but with a changed format, and with some information (e.g. implicit cross-references) is missing • Query tools on ExPASy & UniProt (http://www.expasy.org/sprot/, http://www.uniprot.org) Indirect access (sequence search) • Bioinformatics & sequence analysis tools (Blast, Fasta, GCG, Emboss, MS Identification tools…) Downloading the UniProt Knowledgebase http://www.expasy.org/sprot/download.html • Swiss-Prot and TrEMBL form a complete, non-redundant database, the UniProt Knowledgebase • Can be downloaded from ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase • In “Swiss-Prot” format, fasta or xml format • Complemented by sequences of alternative splice isoforms • “everything” about “ all” proteins! (at least all CDS submitted to the public nucleotide sequence databases) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 If you want to develop tools to work with your local copy of UniProtKB: Swissknife – a PERL parser for UniProtKB Constantly updated according to latest format changes Advantage: you do not need to know how exactly the information is stored in the flat file • http://swissknife.sourceforge.net/ • ftp://ftp.ebi.ac.uk/pub/software/swissprot/Swissknife/ A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Take home message • Swiss-Prot is the non redundant, manually annotated and highly cross-referenced section of the UniProt Knowledgebase • Be aware of the differences between UniProtKB/TrEMBL and UniProtKB/Swiss-Prot – Computer vs. Human – Redundant vs. Non-redundant • Always cite the Accession number, not the entry name – The AC is stable – The entry name might change We need your feedback and your expertise! [email protected] http://www.expasy.org/sprot/update.html (and from every UniProtKB entry page on our servers) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 The UniProt Consortium UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein information www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006). Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc and soon UniMES (for Metagenomic and Environmental Sequences) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniRef100, 90 and 50 clusters One UniRef100 entry -> all identical sequences from UniProtKB and some sections of UniParc (including fragments, Swiss-Prot splice variants). One UniRef90 entry -> sequences that have at least 90% or more identity. One UniRef50 entry -> sequences that are at least 50% identical. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniRef100, 90 and 50 clusters One cluster can contain sequences of several species, clustering is done independently of the organism Each cluster has a “representative”, “reference” sequence, preferably that of the best-annotated Swiss-Prot entry UniRef identifiers are of the form UniRef100_P99999, UniRef50_P00414 – not stable, as clusters are recomputed with every biweekly release, and cluster representatives can change! UniRef is useful for comprehensive BLAST sequence searches by providing sets of representative sequences. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Implicit cross-link UniProtKB to UniRef: new web view: A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 The UniProt Consortium UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein information www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006). Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc and soon UniMES (for Metagenomic and Environmental Sequences) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniParc – the UniProt Archive • 8.8 million sequences • Sequences and cross-references (AC numbers) • A comprehensive collection of the raw protein sequences in public databases (including those not submitted to the DNA databases): Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices. • UniParc can be used to track sequence versions Use with extreme caution: also contains pseudogenes, incorrect CDS predictions, etc…and is highly redundant ! A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniParc tracks a protein sequence and its integration in various databases http://www.pir.uniprot.org/cgi-bin/textSearch_AR Patent data A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniParc entry UPI0000033477 part 2 TrEMBL entry probably to be merged into Swiss-Prot TrEMBL entry was merged into Swiss-Prot A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 www.expasy.ch/prosite A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 PROSITE A database of protein families and domains using two kinds of motif descriptors: Patterns or regular expressions : •User friendly (easy to understand and to use) •Well designed for the detection of biologically meaningful sites such as residues playing a structural or functional role •Can be used to scan a protein database in reasonable time on any computer Generalized profiles or weight matrices : •Well adapted to cover the full length of the protein or domain •Are able to detect highly divergent families or domains with only a few well conserved positions A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Identification of protein domains and families • There are two non-exclusive approaches for the determination of the function of an uncharacterized protein: – Comparison with a complete sequence database (BLAST) – Scanning a database of patterns and profiles • Most proteins can be grouped into families. Proteins belonging to a particular family share functional attributes and are derived from a common ancestor; • Some regions in the sequence are more conserved than others during evolution because they are important for the function or the structure of the protein; • Like fingerprints for police identification, signatures built out of sequence patterns or profiles can be used to formulate hypotheses about the function of uncharacterized proteins. Definitions of conserved regions Conserved regions can be classified into 5 different groups: • Families: proteins that have the same domain arrangement, be 1 or many domains. • Domains: specific combination of secondary structures that assume characteristic three dimensional structures or folds. • Repeats: structural units always found in two or more copies that assemble in specific fold. Assemblies of repeats might also be thought of as domains. • Motifs: short regions with conserved active- or binding-sites that usually adopt a folded conformation only in association with their ligands. • Sites: functional residues (active sites, disulfide bridges, post-translationally modified residues) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Conserved regions (2) CSA_PPIASE Binding cleft (motif) Cys 181: active site residue PPID family: 1 CSA_PPIASE domain + 3 TPR repeat A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 http://www.expasy.org/tools/scanprosite/ A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Functionally and structurally relevant residues in PROSITE motif descriptors A new concept to extract more information from profiles Principle : • Combining the advantages of profiles (high sensitivity) and patterns (position-specific information) • Tagging of amino acids at precise positions in the profile and checking their presence in the matched sequence A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 ProRule Aim: • Provide users with biologically meaningful functional and structural information: active sites, post-translational modification sites, binding sites, disulfide bonds, transmembrane regions. • Help the UniProtKB/Swiss-Prot annotation and provide enhanced homogeneity: domain name and boundaries, keywords and linked GO terms, EC numbers, false negative PROSITE patterns. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 www.expasy.ch/prosite/prorule.html Sigrist et al.: Bioinformatics 21:4060-4066(2005) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Other methods for protein/domain identification Pfam, TIGRFAMs, SMART, Gene3D, PANTHER, CDD: Hidden Markov Models (HMM), Probabilistic models; PRINTS: “Unweighted” matrices; protein fingerprints BLOCKS: Weight matrix derived from ungapped alignments; PIRSF, SUPERFAMILY: classification system based on evolutionary relationship of whole proteins ProDom: automatic compilation of homologous domains based on recursive PSI-BLAST searches. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 The InterPro project www.ebi.ac.uk/interpro Integrated Documentation Resource of Protein Families, Domains and Functional Sites A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 The InterPro project www.ebi.ac.uk/interpro • Unification of PROSITE, PRINTS, Pfam and ProDom into an integrated resource of protein families, domains and functional sites in 2000; • Joint effort in creating a unified yet methodologically diverse system for protein family/domain identification; • Single set of “documents” linked to the various methods; • Distributed with tools by anonymous FTP and through www servers; • Used to enhance the functional annotation of UniProtKB (Swiss-Prot and TrEMBL) • Has progressively incorporated other databases A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Current status of InterPro Release 14.1 (February 2007) was built from Pfam, PRINTS, PROSITE, ProDom, SMART, TIGRFAMs, PIRSF, Scop based SUPERFAMILY, Gene3D and PANTHER, and the current UniProt/Swiss-Prot + TrEMBL data. (for details see http://www.ebi.ac.uk/interpro/release_notes.html) InterPro release 14.1 contains 13,953 entries, representing 3,911 domains, 9,610 families, 232 repeats, 34 active sites, 20 binding sites and 19 post-translational modification sites. Overall, there are 15,880,845 InterPro hits from 3,100,874 UniProtKB protein sequences. 92.4% of Swiss-Prot and 76.4% of TrEMBL protein sequences have one or more InterPro hits. A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 http://www.ebi.ac.uk/interpro/ A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001304 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 InterPro: Graphical domain representation A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteomeID=25 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteomeId=18 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 The ExPASy www server • First molecular biology server on the Web (August 1993); ~500 million accesses since; • Dedicated to proteomics: – Databases: UniProtKB, PROSITE, Swiss-2DPAGE, etc.; – Many 2D/MS protein identification/characterization and sequence analysis tools; • Mirror sites in Australia, Brazil, Canada, China and Korea: http://{au|br|ca|cn|kr|www}.expasy.org A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 ExPASy software tools • Tools for the display and management of databases (NiceProt, Swiss-Shop sequence alerting system, etc.); • Tools for sequence analysis (ScanProsite, ProtParam, ProtScale, RandSeq, Translate, etc.); • Proteomics tools (AACompIdent, FindMod, FindPept, Aldente, PeptideMass, TagIdent, etc.); • 3D-structure analysis and display tools (SwissModel, Swiss-PDBviewer) A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 http://www.expasy.org/tools/ Identification: Aldente, TagIdent, AAcompIdent, MultiIdent Characterization: FindMod, GlycoMod, FindPept Analysis: PeptideMass, GlycanMass, BioGraph, - Use annotation in Swiss-Prot and TrEMBL PeptideCutter (preprocessing, PTMs, etc.) ProtScale, A. Auchincloss UniProtKB and ExPASy - Hyper-links between tools and databases Tunis, March 2007 ProtParam http://www.expasy.org/links.html A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Finding out about recent developments: UniProtKB/Swiss-Prot recent format changes: http://www.expasy.org/sprot/relnotes/sp_news.html UniProtKB/Swiss-Prot planned format changes: http://www.expasy.org/sprot/relnotes/sp_soon.html Subscribe to the electronic Swiss-Flash bulletins: http://www.expasy.org/swiss-flash/ What’s new on ExPASy: http://www.expasy.org/history.html A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 UniProtKB/Swiss-Prot: http://www.expasy.org/sprot/sprot-ref.html References (1) Wu C. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34:D187-191(2006). Boeckmann B. et al. Protein variety and functional diversity: Swiss-Prot annotation in its biological context Comptes Rendus Biologies 328:882-99(2005). Bairoch A. Swiss-Prot: Juggling between evolution and stability Brief. Bioinform. 5:39-55(2004). Farriol-Mathis N. et al. Annotation of post-translational modifications in the Swiss-Prot knowledgebase. Proteomics 4:1537-1550(2004). Gasteiger E. et al. A. Swiss-Prot: Connecting biological knowledge via a protein database Curr. Issues Mol. Biol. 3:47-55(2001). A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 PROSITE: References (2) Hulo N., et al., The PROSITE database. Nucleic Acids Res. 34:D227D230(2006). Sigrist C.J.A., et al., PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform. 3:265-274(2002). Gattiker A., et al., ScanProsite: a reference implementation of a PROSITE scanning tool. Applied Bioinformatics 1:107-108(2002). Sigrist C.J.A., et al., ProRule: a new database containing functional and structural information on PROSITE profiles. Bioinformatics. 2005 21(21):4060-6. ExPASy: Gasteiger E. et al.ExPASy: the proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 31:3784-3788(2003). A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Useful general publications • Nucleic Acids Res. Database issue 2006, vol. 34, supplement 1: http://nar.oupjournals.org/content/vol34/suppl_1/ • Nucleic Acids Res. Web server issue 2005, vol. 33, supplement 2: http://nar.oupjournals.org/content/vol33/suppl_2/ • Book: Bioinformatics for Dummies, by J.-M. Claverie and C. Notredame Publisher: For Dummies; 2nd edition (December, 2006) ISBN: 0764516965 A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Take home message • We need your feedback! [email protected] Or via the website A. Auchincloss UniProtKB and ExPASy Tunis, March 2007 Before the introduction to Swiss-Prot/ExPASy… After the introduction to Swiss-Prot /ExPASy … Some practical exercises: http://education.expasy.org/cours/Tunis/ 1. Finding databases 2. Comparing protein databases 3. Comparing BLAST programs 4. BLAST output 5. Bacterial start sites 6. UniRef 7. Different views of UniProtKB 8. Environmental sequences 9. Inter-database links & PROSITE 10. InterPro 11. Using UniProtKB/Swiss-Prot to create datasets