Download D - Protein Information Resource

Text Mining and Ontology Development at the Protein Information Resource (PIR) Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center, Washington, DC 20007 The Protein Information Resource (PIR, http://pir.georgetown.edu/) is an integrated protein bioinformatics resource for genomic and proteomic research. For the past two decades, PIR has focused its activities on protein and proteomics informatics research and development by providing protein databases (e.g. UniProt), functional data integration and analysis tools, and other bioinformatics infrastructure. In the past few years, PIR’s work is expanded to emerging areas such as biomedical text mining, cancer biomedical informatics grid development, and medical informatics for translational research. Our interests in text mining are primarily due to the fact that literature-based manual curation of protein databases, while critical to the database quality, cannot keep up with fast-growing literature as well as sequence data. Collaborating with other groups, PIR has been working in several text mining areas, including entity recognition, information retrieval and extraction, and ontology development. I. Development of literature mining resource for NLP research and tool development PIR developed a literature mining resource – iProLINK (integrated Protein Literature INformation and Knowledge) (http://pir.georgetown.edu/iprolink/) aiming to provide annotated literature data sets for developing new text mining algorithms, such as protein named entity recognition, text categorization, and protein annotation extraction, and protein ontology development. iProLINK also provides text mining tools developed using its annotated literature data sets, such as protein phosphorylation information extraction tool RLIMS-P. iProLINK thus serves as a knowledge link bridging protein databases and literature database such as PubMed. II. Protein named entity recognition Due to the lack of name standardization, gene/protein names described in literature and annotated in bioinformatics databases may vary greatly. As part of entity tagging program (BioTagger), we developed a gene/protein name thesaurus, BioThesaurus, which maps a comprehensive collection of gene/protein names from over 35 biological databases, including the major gene (e.g., Entrez Gene) and protein databases (e.g. UniProt), and the model organism databases (e.g. MGD, SGD). BioThesaurus currently contains 5.8 million gene/protein names and are associated with over 4.6 million protein entries in UniProt Knowledgebase (UniProtKB). An online BioThesaurus is developed to provide online query for finding gene/protein synonyms and for solving name ambiguities for names shared by more than one protein entry (http://pir.georgetown.edu/iprolink/biothesaurus/). III. Information retrieval and extraction Biological document classification based on specific topics, such as protein-protein interactions, protein modifications, or pathogenesis-related microbial proteins, is important in analyzing large-scale biomedical literature and for database curation. A new text classification algorithm using discriminative substrings as attributes was developed based on literature data sets about five types of protein post-translational modifications, glycosylation, phosphorylation, acetylation, methylation and hydroxylation, provided in iProLINK. The algorithm showed that Naive Bayes and Support Vector Machine classifiers perform consistently better (AUC 0.92-0.97) when using the substring-based attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC 0.86-0.93). The algorithm is highly effective even when the training data set is relatively small. Protein phosphorylation is a fundamental molecular event essential to cellular processes. We developed and benchmarked a rule-based text mining system, RLIMS-P, for extracting protein phosphorylation information from MEDLINE abstracts regarding three objects—the protein kinase, the protein substrate, and the phosphorylation sites, with high precision and recall. A web-based version of the RLIMS-P for mining of protein phosphorylation information from MEDLINE abstracts (http://pir.georgetown.edu/iprolink/rlimsp/) was recently developed. The online RLIMS-P is designed for easy accessibility and with enhanced functionality, including evidence tagging for extracted objects in abstracts, mapping of phosphorylated proteins to the UniProtKB entry based on PubMed ID and/or protein names using BioThesaurus. Online RLIMS-P facilitates biological studies of phosphorylated proteins and the database annotation of phosphorylation features. IV. Protein Ontology development We recently designed a PRotein Ontology (PRO) framework to facilitate protein annotation and to guide new experimental design (http://pir.georgetown.edu/pro/). PRO is an ontology of protein entities and relationships in the OBO Foundry framework. Its components extend from the evolutionary relationships of protein classes of structural domains, sequence domains, and whole proteins (ProEvo—Protein Evolution ontology), to the representation of multiple protein forms of genes generated by genetic variation, alternative splicing, proteolytic cleavage, and other post-translational modifications (ProForm—Protein Form ontology). PRO is designed to assist assignment of protein annotations (properties such as molecular functions) to specific protein forms of a gene product, and to protein classes at different evolutionary levels in a formal ontological framework. Thus, PRO allows for accurate association of its ProForm nodes to specific objects in other databases (e.g. Reactome pathway) or ontologies (e.g. Sequence Ontology). PRO also facilitates the experimental design based on knowledge transfer between, for example, a mouse protein form and its human ortholog (or even to suggest that such an ortholog might exist) via the connection between ProEvo and ProForm. V. Ongoing TATRC funded work on Pathogen Mining: Literature Mining of Pathogenesis-Related Proteins in Human Pathogens for Database text mining (2007-2009) The objective of this project is to develop an automated text mining system to facilitate literature-based curation of pathogenesis-related proteins in pathogens of military relevance, including pathogens of importance for diseases of the developing world and biodefense. We will develop a text-mining system to identify papers and extract information on pathogenicity and host-pathogen interactions, and to integrate text-mining into the UniProt Knowledge curation system. This project will build upon our technological framework that includes text mining algorithms, bibliography mapping, and database curation platform. Coupling the curated pathogenesis information from the scientific literature with the rich protein information and integrated genomics and proteomics data in UniProt will greatly improve our basic understanding of virulence and pathogenicity factors and host-interacting proteins of the human pathogens. Such knowledge will facilitate the development of preventative and therapeutic strategies against human pathogens. LITERATURE MINING RELATED REFERENCES: Named entity recognition: 1. Torii M, Hu ZZ, Song M, Wu CH, Liu H (2007) A Comparison Study on Algorithms for Detecting Biomedical Short Form Definition. BMC Bioinformatics (in press) 2. Liu H, Torii M, Hu ZZ, and Wu CH (2007) Gene Mention and Gene Normalization Based on Machine Learning and Online Resources. In Proc. of BioCreAtIvE II 3. Liu H, Hu ZZ, Wu CH (2006). BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics 22:103-105. 4. Liu H, Hu ZZ, Torii M, Wu C, Friedman C. (2006) Quantitative Assessment of Dictionary-based Protein Named Entity Tagging. J Am Med Inform Assoc 13:497-507. 5. Mani I, Hu Z, Jang SB, Samuel K, Krause M, Phillips J, Wu CH (2005). Protein name tagging guidelines: lessons learned. Comparative and Functional Genomics 6(1-2): 72-76. Information retrieval and extraction: 6. Han B, Obradovic Z, Hu ZZ, Wu CH, Vucetic S. (2006) Substring selection for biomedical document classification. Bioinformatics 22:2136-42 7. Yuan X, Hu ZZ, Wu HT, Torii M, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, and Wu CH. (2006) An Online Literature Mining Tool for Protein Phosphorylation. Bioinformatics 22:1668-9. 8. Hu ZZ, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, Wu CH (2005). Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-Based System. Bioinformatics 21(11): 2759-2765. Ontology development: 9. Natale DA, Arighi CN, Barker W, Blake J, Chang T, Hu ZZ, Liu H, Smith B, Wu CH. (2007) Framework for a Protein Ontology. BMC Bioinformatics, 8(Suppl 9):S1 10. H. Liu, Z. Hu, and C Wu (2005) DynGO: A tool for navigation and visualization of Gene Ontology resources. BMC Bioinformatics, BioMed Central Ltd, 2005, 6:201 (online journal). Annotated literature resource: 11. Hu ZZ, Mani I, Hermoso V, Liu H and Wu CH (2004). iProLINK: an integrated protein resource for literature mining. Comput Biol Chem 28: 409-416.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download D - Protein Information Resource