* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slide 1
Immunoprecipitation wikipedia , lookup
Circular dichroism wikipedia , lookup
Structural alignment wikipedia , lookup
Rosetta@home wikipedia , lookup
List of types of proteins wikipedia , lookup
Protein design wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Homology modeling wikipedia , lookup
Protein folding wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Protein structure prediction wikipedia , lookup
Protein domain wikipedia , lookup
Protein moonlighting wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
Western blot wikipedia , lookup
Protein purification wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis of Proteins Celebrating the 20th anniversary of Swiss-Prot Fortaleza, Brazil August 4, 2006 Cathy H. Wu, Ph.D. Director, Protein Information Resource Professor, Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center Wu CH, Zhao S, Chen HL. (1996) A protein class database organized with PROSITE protein groups and PIR superfamilies. Journal of Computational Biology, 3 (4), 547-562. 2 Protein Information Resource (PIR) Integrated Protein Informatics Resource for Genomic/Proteomic Research http://pir.georgetown.edu 3 UniProt Universal Protein Resource: Central Resource of Protein Sequence and Function PIRSF Family Classification System: Protein Classification and Functional Annotation iProClass Integrated Protein Database: Data Integration and Protein Mapping iProLINK Literature Mining Resource: Annotation Extraction Other Projects: NIAID Proteomics, caBIG Grid-Enablement PIR Protein Sequence Database 4 The PIR-International Protein Sequence Database (PIR-PSD) grew out of the Atlas of Protein Sequence and Structure (1965-1978), Vol 1-5, Suppl 1-3. Margaret Dayhoff collected all the known protein sequences to study protein evolution. The first Atlas contained 65 proteins, the final volume had 1081 proteins. 300,000 Joined UniProt (Jan 2002) Number of Sequences 250,000 The PIR-PSD was produced from 200,000 1984 (Release 1, 2900 proteins) to 2004 (Release 80, 283,416 proteins). 150,000 100,000 PIR-PSD has been integrated with the50,000 UniProt since 2002. 0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 PIR-PSD Release Number UniProt Activities at PIR 5 Integration of PIR-PSD into UniProtKB Incorporation of unique PIR entries Incorporation of PIR annotations: references, experimental features with literature evidence tag Functional annotation of UniProtKB proteins Development of PIRSF family classification system & PIRSF curation => Comprehensive coverage of all UniProtKB proteins Development of rule-based annotation system & PIRNR (name rule) /PIRSR (site rule) curation => Rule curation and integration into Swiss-Prot/TrEMBL annotation pipelines & propagation of annotations (e.g., name, GO, site feature) Production of UniRef100/90/50 databases Creation of UniProt web site and help system => Unified UniProt web site & user community interaction PIRSF Classification System Protein Classification and Functional Annotation PIRSF: Evolutionary relationships of proteins from super- to sub-families Curated families with name rules and site rules Curation platform with classification/visualization tools Dissemination: UniProtKB annotations, InterPro families, PIRSF reports, PIRSF curation platform Domain Superfamily • One common Pfam domain PIRSF Superfamily • 0 or more levels • One or more common domains PIRSF Homeomorphic Family • Exactly one level • Full-length sequence similarity and common domain architecture PIRSF Homeomorphic Subfamily • 0 or more levels • Functional specialization PIRSF003033: Ku70 autoantigen PF02735: Ku70/Ku80 beta- barrel domain PIRSF800001: Ku70/80 autoantigen PIRSF016570: Ku80 autoantigen PIRSF006493: Ku, prokaryotic type PIRSF500001: IGFBP-1 PF00219: Insulin-like growth factor binding protein (IGFBP) PIRSF001969: IGFBP PIRSF018239: IGFBP-related protein, MAC25 type 6 … PIRSF500006: IGFBP-6 PIRSF017318: CM of AroQ class, eukaryotic type PIRSF001501: CM of AroQ class, prokaryotic type iProClass Integrated Protein Database Data Integration and Protein Mapping Data integration from >90 databases Underlying data warehouse for protein ID/name/bibliography mapping & pre-computed BLAST results Integration of protein family, function, structure for functional annotation Rich link (link + summary) for value-added reports of UniProt proteins Structure Family Protein Sequence PDB SCOP CATH PDBSum MMDB PIRSF InterPro Pfam Prosite COG UniProt UniRef UniParc RefSeq GenPept … … … Function/Pathway iProClass Integrated Protein Knowledgebase … Protein Expression NCBI X-Refs Gene/Genome … GEO GXD ArrayExpress CleanEx SOURCE … Additional Refs Gene Ontology Disease/Variation Swiss-2DPAGE PMG OMIM HapMap … Modification Interaction RESID PhosphoBase DIP BIND Ontology … Taxonomy GO … GenBank/EMBL/DDBJ LocusLink UniGene MGI TIGR Gene Expression EC-IUBMB KEGG BioCarta EcoCyc WIT 7 Gene/Genome … NCBI Taxon NEWT Literature PubMed EC KEGG Pathway Structure Homolog PTM iProLINK Text Mining Resource Annotation Extraction and Literature-Based Protein Annotation Curated datasets and literature corpus for development of literature mining and annotation extraction tools RLIMS-P text-mining tool for extracting protein phosphorylation data BioThesaurus of gene/protein names to resolve synonym and ambiguity i ProLINK NLP Text Mining Research Bibliography Display • Mapping of PubMed IDs to Proteins • Papers Categorized by Annotations Literature Corpus • Mapping to Proteins/Features • Annotation-Tagged • Name-Tagged Dictionary and Ontology • Protein Names and Synonyms • PIRSF Family Names in DAG Guidelines 8 http://pir.georgetown.edu/iprolink Bibliography Mapping Text Categorization Annotation Extraction Named Entity Recognition • Protein/Family Naming Guidelines • Name Tagging Guidelines integrated Protein Literature, INformation and Knowledge Literature-Based Curation Literature Mining & Protein Curation Bibliography PubMed Databases UniProt PIRSF iProClass GO NIAID Biodefense Proteomic Program Goals Characterize proteomes of pathogens and host cells Identify proteins associated with the biology of the microbes Elucidate mechanisms of microbial pathogenesis Understand immune responses and non-immune mediated host responses Adm Ctr PRC Data Type 9 Organism PIRSF iProClass UniProt Data Integration at NIAID Admin Center Integrated Data at VBI Protein ID Peptide/Protein Sequence Mapping Master Protein Directory & Complete Proteomes at GU-PIR http://pir.georgetown.edu/proteomics/ Data Exchange Format Controlled Vocabulary Ontology Multiple Data Types from Proteomics Research Centers 10 Rich annotation - capture experimental data and scientific conclusion; integrate with major databases NCI caBIG Initiative caBIG (cancer Biomedical Informatics Grid) Cancer research platform to enable sharing of research infrastructure, data, tools Designed and built by an open federation of organizations Based on common standards and open source/open access principles One of four caBIG grid reference projects PIR Grid-Enablement: UniProtKB as central protein information resource for cancer research caBIG Workspaces Integrative Cancer Research PIR Developer Project: Grid Enablement of PIR PIR Adopter Project: SEED Genome Annotation caGrid Architecture PIR Adopter Project: GeneConnect ID mapping Vocabularies and Common Data Elements PIR Participant Project: Protein models, objects, vocabularies, ontologies 11 UniProt Knowledgebase: Accurate, Consistent, and Rich Annotation of Protein Sequence and Function 12 Family Classification-Driven and Rule-Based Curation Functional inference of uncharacterized hypothetical proteins Systematic detection and correction of genome annotation errors Improvement of under- or over-annotated proteins Text Mining-Assisted and Literature-Based Curation Annotation extraction from scientific literature Attribution of experimental evidence Ontology and Controlled Vocabulary-Based Curation Standardization of protein/gene/family names and annotation terms Annotation of specific protein entities PIR Superfamily Classification 13 Tree of Life and Evolution of Protein Families (Dayhoff) The protein superfamily concept (1976) was based on sequence similarity, where sequences were categorized into superfamilies, families, subfamilies, and entries using different % identity thresholds. PIRSF Classification System 14 A network classification system from superfamily to subfamily levels to reflect the evolutionary relationships of full-length proteins and domains Basic unit is homeomorphic family: Full-length similarity, common domain architecture Provide annotation of generic biochemical and specific biological functions Basis for evolutionary and comparative genomics research Basis for accurate and consistent automated protein annotation (protein name, biochemical and biological functions, functional sites) Basis for standardization of protein names and development of ontology for protein evolution Domain Superfamily • One common Pfam domain PIRSF Superfamily • 0 or more levels • One or more common domains PIRSF Homeomorphic Family • Exactly one level • Full-length sequence similarity and common domain architecture PIRSF Homeomorphic Subfamily • 0 or more levels • Functional specialization PIRSF003033: Ku70 autoantigen PF02735: Ku70/Ku80 beta- barrel domain PIRSF800001: Ku70/80 autoantigen PIRSF016570: Ku80 autoantigen PIRSF006493: Ku, prokaryotic type PIRSF500001: IGFBP-1 PF00219: Insulin-like growth factor binding protein (IGFBP) PIRSF001969: IGFBP PIRSF018239: IGFBP-related protein, MAC25 type PIRSF017318: CM of AroQ class, eukaryotic type PIRSF001501: CM of AroQ class, prokaryotic type PF01817: Chorismate mutase (CM) … PIRSF500006: IGFBP-6 PIRSF026640: Periplasmic CM PIRSF001500: Bifunctional CM/PDT (P-protein) PIRSF001499: Bifunctional CM/PDH (T-protein) 15 PIRSF001499: Bifunctional CM/PDH (T-protein) PIRSF Classification/Curation Workflow Unclassified UniProtKB proteins Unassigned Proteins Automatic Procedure 1 New Proteins 1. 2. Automatic Clustering 3 Uncurated Homeomorphic Clusters Orphans Map Domains on Clusters Computerassisted Manual Curation Merge/Split 4 Clusters Add/Remove Members Preliminary Homeomorphic Families Automatic Placement 2 3. 4. 5. 5 Hierarchies (Superfamilies/Subfamilies) Name, Refs, Abstract, Domain Arch. 6 6. 7. Final Families, Subfamilies, Superfamilies 7 16 Protein Name Rules/Site Rules 8 Build and Test HMMs 8. Computational generation of homeomorphic clusters Computational domain mapping and annotation of preliminary clusters Automatic placement of new proteins into families Computer-assisted expert analysis to define homeomorphic families Family hierarchy created as needed Expert annotation Name rules and optional site rules created Seed members to generate family HMMs PIRSF Classification Tools Iterative BlastClust Tree with Annotation Table Multiple Alignment and Phylogenetic Tree PIRSF Classification in DAG Editor HPS KGPDC Phylogenetic Tree 17 Classification/Annotation ISMB: PIRSF Protein Classification System Demo Alignment PIRSF Analysis/Visualization Tools 18 Taxonomy Distribution and Phylogenetic Pattern Domain Display Family Hierarchy (DAG Browser) PIRSF Family Report Curated family name Description of family Sequence analysis tools 19 Classification and Functional Annotation Example - Phosphofructokinase (PFK) classification shows that functional specialization can occur as a result not only of major sequence changes but also by mutation of a single amino-acid residue. Families Classification Tree ATP_PFK_DR0635 20 ATP_PFK_euk E. coli (P06998) Gly105 Gly125 ATP-PFK: Gly105 + Gly125 PPi_PFK_PfpB PPi_PFK_TM0289 PPi_PFK_TP0108 PPi_PFK_SMc01852 PFK_XF0274 PPi-PFK: Gly/Asp105 + Lys125 Family-Based Rules for Annotation Functional Site Rule: tags active site, binding, other residue-specific information ? 21 Functional Name Rule: gives name, EC, GO, other function-specific information iProLINK Literature Mining Resource i ProLINK NLP Text Mining Research Bibliography Display • Mapping of PubMed IDs to Proteins • Papers Categorized by Annotations Literature Corpus • Mapping to Proteins/Features • Annotation-Tagged • Name-Tagged Dictionary and Ontology • Protein Names and Synonyms • PIRSF Family Names in DAG Guidelines 22 http://pir.georgetown.edu/iprolink Bibliography Mapping Text Categorization Annotation Extraction Named Entity Recognition • Protein/Family Naming Guidelines • Name Tagging Guidelines integrated Protein Literature, INformation and Knowledge Literature-Based Curation Literature Mining & Protein Curation Bibliography PubMed Databases UniProt PIRSF iProClass GO iProLINK Literature Mining Resource 1. 2. 3. 4. 5. UniProtKB Bibliography mapping in iProClass RLIMS-P Rule-based NLP method for extracting protein phosphorylation data Substring-based machine learning method for PTM text categorization BioThesaurus of protein/gene names with UniProtKB association Entity-named tagging Guide i ProLINK 1 2 3 NLP Research Bibliography Display • Mapping of PubMed IDs to Proteins • Papers Categorized by Annotations Literature-Based Curation Bibliography Mapping Literature Corpus • Mapping to Proteins/Features • Annotation-Tagged • Name-Tagged Dictionary and Ontology • Protein Names and Synonyms • PIRSF Family Names in DAG 4 5 Guidelines Text Categorization Annotation Extraction Named Entity Recognition • Protein/Family Naming Guidelines • Name Tagging Guidelines 23 integrated Protein Literature, INformation and Knowledge http://pir.georgetown.edu/iprolink Literature Mining & Protein Curation Bibliography PubMed Databases UniProt PIRSF iProClass GO Literature Corpus for Text Mining Literature survey and manual tagging for evidence attribution Training and benchmarking sets for information retrieval and extraction 24 Protein phosphorylation data used to develop RLIMS-P for extracting phosphorylation information The five PTM datasets used to develop a machine learning algorithm for text categorization A Online RLIMS-P 2 1. Summary table: PMIDs & top-ranking annotation 1 25 3. Name mapping searches BioThesaurus 2. Report: Full annotation with evidence tagging and PMID mapping to UniProtKB entry 3 BioThesaurus Name Filtering NCBI Genome Entrez Gene RefSeq GenPept FlyBase WormBase MGD SGD RGD UniProt UniProtKB UniRef90/50 PIR-PSD iProClass Name Extraction Highly Ambiguous Nonsensical Terms Raw Thesaurus Semantic Typing Other 26 HUGO EC OMIM BioThesaurus UniProtKB Entries: Protein/Gene Names & Synonyms UMLS Comprehensive collection of protein/gene names from 23 databases Associate names (~3.2 million) with UniProtKB entries (>2 million) Web-based searches to retrieve synonymous names, resolve ambiguous names, evaluate name coverage FTP download for automatic dictionary-based named entity tagging Online BioThersaurus Name ambiguity of CLIM1 1 2 1. Search protein entries sharing the same names 2. Retrieve BioThesaurus report 27 Annotation error detection BioThesaurus Report Gene/Protein Name Mapping 1. Search Synonyms 2. Resolve Name Ambiguity 3. Underlying ID Mapping Synonyms for Metalloproteinase inhibitor 3 1 Name ambiguity of TIMP-3 2 28 3 ID Mapping Protein Ontology (PRO) 29 PRotein Ontology (PRO) in OBO (Open Biomedical Ontologies) Framework Two sub-ontologies: Ontology for Protein Evolution (ProEvo) for the classification of proteins on the basis of evolutionary relationships Ontology for Protein Modified Forms (ProMod) to represent the multiple protein forms of a gene (genetic variation, alternative splicing, proteolytic cleavage, and post-translational modification). Why PRO? Allow the specification of relationships between PRO and other ontologies, such as GO and Disease Ontology Facilitate precise protein annotation of specific proteins/classes The PRO prototype is illustrated using human proteins from the TGFbeta signaling pathway (http://pir.georgetown.edu/pro). PRO Conceptual Framework ProEvo evolutionary unit Root level is_a is_a Unit Level PRO Protein Ontology • The two types of evolutionary units • Not substituted by any other terms domain is_a protein is_a is_a GO Domain Family Level (structure) • Related by structural similarity • Source: SCOP Superfamily structure domain has_ancestral_property has_function lacks_function is_a Domain Family Level (sequence) • Related by sequence similarity • Source: Pfam domain sequence domain biological process lacks has_part Protein Family Level homeomorphic protein • Evolutionarily-related full-length protein • May contain finer-grain sub-categories • Sources: PIRSF family/subfamily, Panther subfamily ProMod is_a gene product Gene level • All protein products encoded by one gene • Source: UniProtKB is_a Gene Ontology molecular function is_a is_a has_ancestral_property participates_in cellular component has_ancestral_property part_of (for complexes) located_in (for compartments) HGNC/MGI Gene Name gene name Transcript level • Possible transcript forms • Source: UniProtKB encoded_by genetic variant splice variant reference protein PSI-MOD Modification protein modification derives_from derives_from Post-translation level • Protein as modified after translation • Source: UniProtKB 30 cleaved product derives_from has_modification DO/UMLS Disease Ontology/Term modified product disease agent_of Protein Ontology (PRO) 31 Acknowledgements PIR Team Collaborators UniProt: Rolf Apweiler, Amos Bairoch and EBI/SIB Teams NIAID: Margaret Moore (SSS), Bruno Sobral (VBI) Text Mining: Hongfang Liu (GUMC), Interjeet Mani (MITRE), Vijay Shanker (U Delaware), Zoran Obradovic (Temple U) Funding Support 32 Protein Science Team: Darren Natale, Winona Barker, Peter McGarvey, Zhangzhi Hu, Lai-Su Yeh, Anastasia Nikolskaya, Raja Mazumder, CR Vinayaka, Sona Vasudevan, Cecilia Arighi, Xin Yuan Informatics Team: Hongzhan Huang, Baris Suzek, Leslie Arminski, HsingKuo Hua, Yongxing Chen, Jing Zhang, Robel Kahsay, Jess Cannata Students: Natalia Petrova, Paul Ramos, Ti-Cheng Chang, Anna Bank NHGRI/NIGMS (UniProt) NCI caBIG NIAID (Proteomic Admin Center) NSF: iProClass, text mining