Download Poster - Protein Information Resource

Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR) Hu ZZ1, Mani I2, Liu H3, Vijay-Shanker K4, Hermoso V1, Nikolskaya A1, Natale DA1, and Wu CH1 1Protein Information Resource, Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20057; 2Georgetown University, 37th and O Streets, NW, Washington, DC 20057; 3University of Maryland at Baltimore County, Baltimore, MD 21250; 4Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716 1 3 ABSTRACT An integrated protein literature mining resource iProLINK is developed at PIR to provide data sources for Natural Language Processing (NLP) research on bibliography mapping, annotation extraction, protein named-entity recognition, and protein ontology development. A rule-based text-mining system RLIMS-P is used to extract protein phosphorylation information from MEDLINE abstracts to assist database annotation, an online BioThesaurus is developed for protein/gene name mapping and to assist with protein named-entity recognition, and a family classification PIRSF-based protein ontology is developed and to complement other ontologies. 2 INTRODUCTION 4 iProLINK: An integrated protein resource for literature mining RLIMS-P • Manual tagging assisted with computational extraction • Training sets of positive and negative samples Rule-based LIterature Mining System for Protein Phosphorylation Preprocessing 1. Bibliography mapping - UniProt mapped citations 2. Annotation extraction - annotation tagged literature 3. Protein entity recognition - dictionary, tagged literature 4. Protein ontology development - PIRSF-based ontology PIR – Integrated Protein Informatics Resource As the volume of scientific literature rapidly grows, literature data mining becomes increasingly critical to facilitate genome/proteome annotation and to improve the quality of biological databases. Annotations derived from experimentally verified data from literature are of special value to the UniProtKB (UniProt Knowledgbase). One objective of UniProtKB is to have accurate, consistent, and rich annotation of protein sequence and function. Relevant to this goal are the literature-based curation and development and adoption of ontologies and controlled vocabularies. • Literature-Based Curation – Extract Reliable Information from Literature • Protein properties: protein function, domains and sites, developmental stages, catalytic activity, binding and modified residues, regulation, induction, pathways, tissue specificity, subcellular location, quaternary structure… • This will ensure high quality, accurate and up-to-date experimental data for each protein. But it is a major bottleneck! • Ontologies/Controlled Vocabularies – For Information Integration and Knowledge Management • UniProtKB entries will be annotated using widely accepted biological ontologies and other controlled vocabularies, e.g. Gene Ontology (GO) and EC nomenclature. The Protein Information Resource has been collaborating with several NLP research groups to develop text-mining methodologies to extract information from biological literature and to develop protein ontology. Protein Phosphorylation Annotation Extraction for Genomic/Proteomic Research Entity Recognition Sentence extraction Abstracts Full-Length Texts Acronym detection Part of speech tagging RLIMS-P Extracted Annotations Tagged Abstracts Enzyme P-group (e.g., MAP kinase) PostProcessing Substrate (e.g., cPLA2) (e.g., Ser505) Phosphorylation Phrase Detection Relation Identification Nominal level relation P-site Term recognition phosphorylated-cPLA2 Ser-P Semantic Type Classification Verbal level relation <AGENT> Enzyme (kinase catalyzing the phosphorylation) Noun and verb group detection Other syntactic structure detection <THEME> Substrate (protein being phosphorylated) <SITE> P-Site (amino acid residue being phosphorylated) Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)? ATR/FRP-1 also phosphorylated p53 in Ser 15 http://pir.georgetown.edu/iprolink/ Benchmarking of RLIMS-P Testing and Benchmarking Dataset http://pir.georgetown.edu/iprolink/ (http://pir.georgetown.edu) Online RLIMS-P text-mining tool (version 1.0) 2 UniProt – Central international database of protein sequence and function High recall for paper retrieval and high precision for information extraction • UniProtKB site feature annotation • Proteomics MS data analysis: protein identification • RLIMS-P text mining tool • Protein dictionaries 1 http://pir.georgetown.edu/i prolink/rlimsp/ 1. Search interface 2. Summary table with top hit of all sites Bioinformatics. 2005 Jun 1;21(11):2759-65 • Name tagging guideline 3. All sites and tagged text evidence • Protein ontology 3 (http://www.uniprot.org) 5 Web-based BioThesaurus BioThesaurus Name Filtering NCBI Genome Entrez Gene RefSeq GenPept UniProt FlyBase WormBase MGD SGD RGD UniProtKB UniRef90/50 PIR-PSD Name Extraction iProClass Highly Ambiguous Nonsensical Terms Raw Thesurus Semantic Typing Other HUGO EC OMIM BioThesaurus v1.0 BioThesaurus UniProtKB Entries: Protein/Gene Names & Synonyms Gene/Protein Name Mapping 1.Search Synonyms 2.Resolve Name Ambiguity 3.Underlying ID Mapping 6 PIRSF-Based Protein Ontology PIRSF to GO Mapping • PIRSF family hierarchy based on evolutionary relationships • Standardized PIRSF family names as hierarchical protein ontology • DAG Network structure for PIRSF family classification system PIRSF in DAG View • Complements GO: PIRSF-based ontology can be used to analyze GO branches and concepts and to provide links between the GO sub-ontologies • Mapped 5363 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy – 68% of the PIRSF families and subfamilies map to GO leaf nodes – 2329 PIRSFs have shared GO leaf nodes # UniProtKB entry 1.86m # Source DB record 6.6m # Gene/protein name/terms 3.6m (May, 2005) Applications: • Biological entity tagging • Name mapping • Database annotation • literature mining • Gateway to other resources Expanding a Node: Identification of GO subtrees that need expansion if GO concepts are too broad – IGFBP subfamilies – High- vs. low-affinity binding for IGF between IGFBP and IGFBPrP DynGO viewer UMLS m = million Protein Ontology Can Complement GO http://pir.georgetown.edu/iprolink/biothesaurus/ Liu et al, 2005, submitted BioThesaurus report UniProtKB entry P35625 GO-centric view Example 1. Name ambiguity of TIMP3 DAG file: ftp://ftp.pir.georgetown.edu/pir_databases/pirsf/dagfiles/ PIRSF Protein Family Classification PIRSF: A network structure from superfamilies to subfamilies to reflect evolutionary relationships of full-length proteins Definitions Basic unit = Homeomorphic Family Homeomorphic: Full-length similarity, common domain architecture Network Structure: Flexible number of levels with varying degrees of sequence conservation Domain Superfamily • One common Pfam domain Protein Name Tagging • One or more common domains PIRSF Homeomorphic Family • Exactly one level • Full-length sequence similarity and common domain architecture • 0 or more levels • Functional specialization PF02735: Ku70/Ku80 beta- Example 2. Name ambiguity of CLIM1 PIRSF800001: Ku70/80 autoantigen PIRSF016570: Ku80 autoantigen PIRSF006493: Ku, prokaryotic type PIRSF500001: IGFBP-1 PF00219: Insulin-like growth factor binding protein (IGFBP) PIRSF001969: IGFBP … PIRSF500006: IGFBP-6 PIRSF018239: IGFBP-related protein, MAC25 type PIRSF017318: CM of AroQ class, eukaryotic type PIRSF001501: CM of AroQ class, prokaryotic type Exploration of Gene and Protein Ontology Two cases: analyze GO branches and concepts and identify missing GO nodes Case I. Nuclear receptor superfamily Molecular function Case II. IGF-binding protein superfamily Biological process Estrogen receptor alpha (PIRSF50001) 1 Systematic links between three GO sub-ontologies based on the shared annotations at different protein family levels, e.g., linking molecular function and biological process: – estrogen receptor binding and – estrogen receptor signaling pathway PF01817: Chorismate mutase (CM) PIRSF026640: Periplasmic CM PIRSF001500: Bifunctional CM/PDT (P-protein) PIRSF-centric view PIRSF001499: Bifunctional CM/PDH (T-protein) PIRSF001499: Bifunctional CM/PDH (T-protein) PF02153: Prephenate dehydrogenase (PDH) • PIR iProLINK literature mining resource provides annotated data sets for NLP research on annotation extraction and protein ontology development • RLIMS-P text-mining tool for protein phosphorylation from PubMed literature. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P can be applied for UniProtKB protein feature annotation. • Biothesaurus can be used to solve name synonym and ambiguity, name mapping. • PIRSF-based protein ontology can complement GO by identify missing GO concepts/nodes and provides systematic links between three GO sub-ontologies. Superimpose GO and PIRSF hierarchies Liu et al, 2005, submitted Bidirectional display (GO- or PIRSF-centric views) PIRSF Homeomorphic Subfamily PIRSF003033: Ku70 autoantigen barrel domain • Tagging guideline versions 1.0 and 2.0 – Generation of domain expert-tagged corpora – Inter-coder reliability – upper bound of machine tagging • Dictionary pre-tagging – F-measure: 0.412 (0.372 Precision, 0.462 Recall) – Advantages: helpful with standardization and extent of tagging, reducing fatigue problem, and improve inter-coder reliability. • BioThesaurus for pre-tagging PIRSF Superfamily • 0 or more levels • • Summary 7 8 Acknowledgements Research Projects NIH: NHGRI/NIGMS/NLM/NIMH/NCRR/NIDCR (UniProt) NSF: SEIII (Entity Tagging) NSF: ITR (Ontology) Collaborators I. Mani from Georgetown University Department of Linguistics on protein name recognition and protein name ontology. H. Liu from University of Maryland Department of Information System on protein name recognition and text mining. Vijay K. Shanker from University of Delaware Department of Computer and Information Science on text mining of protein phosphorylation features.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Poster - Protein Information Resource