Download D - Protein Information Resource

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Proteasome wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Immunoprecipitation wikipedia , lookup

Histone acetylation and deacetylation wikipedia , lookup

Molecular evolution wikipedia , lookup

Index of biochemistry articles wikipedia , lookup

Gene regulatory network wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

List of types of proteins wikipedia , lookup

Gene expression wikipedia , lookup

Gene nomenclature wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Magnesium transporter wikipedia , lookup

Homology modeling wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Protein domain wikipedia , lookup

Protein design wikipedia , lookup

Expression vector wikipedia , lookup

Phosphorylation wikipedia , lookup

Protein folding wikipedia , lookup

Protein structure prediction wikipedia , lookup

Protein (nutrient) wikipedia , lookup

Protein moonlighting wikipedia , lookup

Interactome wikipedia , lookup

Western blot wikipedia , lookup

Protein purification wikipedia , lookup

Protein adsorption wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Transcript
Text Mining and Ontology Development at the Protein Information Resource (PIR)
Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology
Georgetown University Medical Center, Washington, DC 20007
The Protein Information Resource (PIR, http://pir.georgetown.edu/) is an integrated protein bioinformatics
resource for genomic and proteomic research. For the past two decades, PIR has focused its activities on protein
and proteomics informatics research and development by providing protein databases (e.g. UniProt), functional
data integration and analysis tools, and other bioinformatics infrastructure. In the past few years, PIR’s work is
expanded to emerging areas such as biomedical text mining, cancer biomedical informatics grid development, and
medical informatics for translational research. Our interests in text mining are primarily due to the fact that
literature-based manual curation of protein databases, while critical to the database quality, cannot keep up with
fast-growing literature as well as sequence data. Collaborating with other groups, PIR has been working in several
text mining areas, including entity recognition, information retrieval and extraction, and ontology development.
I. Development of literature mining resource for NLP research and tool development
PIR developed a literature mining resource – iProLINK (integrated Protein Literature INformation and
Knowledge) (http://pir.georgetown.edu/iprolink/) aiming to provide annotated literature data sets for developing
new text mining algorithms, such as protein named entity recognition, text categorization, and protein annotation
extraction, and protein ontology development. iProLINK also provides text mining tools developed using its
annotated literature data sets, such as protein phosphorylation information extraction tool RLIMS-P. iProLINK thus
serves as a knowledge link bridging protein databases and literature database such as PubMed.
II. Protein named entity recognition
Due to the lack of name standardization, gene/protein names described in literature and annotated in bioinformatics
databases may vary greatly. As part of entity tagging program (BioTagger), we developed a gene/protein name
thesaurus, BioThesaurus, which maps a comprehensive collection of gene/protein names from over 35 biological
databases, including the major gene (e.g., Entrez Gene) and protein databases (e.g. UniProt), and the model
organism databases (e.g. MGD, SGD). BioThesaurus currently contains 5.8 million gene/protein names and are
associated with over 4.6 million protein entries in UniProt Knowledgebase (UniProtKB). An online BioThesaurus
is developed to provide online query for finding gene/protein synonyms and for solving name ambiguities for
names shared by more than one protein entry (http://pir.georgetown.edu/iprolink/biothesaurus/).
III. Information retrieval and extraction
Biological document classification based on specific topics, such as protein-protein interactions, protein
modifications, or pathogenesis-related microbial proteins, is important in analyzing large-scale biomedical
literature and for database curation. A new text classification algorithm using discriminative substrings as attributes
was developed based on literature data sets about five types of protein post-translational modifications,
glycosylation, phosphorylation, acetylation, methylation and hydroxylation, provided in iProLINK. The algorithm
showed that Naive Bayes and Support Vector Machine classifiers perform consistently better (AUC 0.92-0.97)
when using the substring-based attribute selection than when using attributes obtained by the Porter stemmer
algorithm (AUC 0.86-0.93). The algorithm is highly effective even when the training data set is relatively small.
Protein phosphorylation is a fundamental molecular event essential to cellular processes. We developed and
benchmarked a rule-based text mining system, RLIMS-P, for extracting protein phosphorylation information from
MEDLINE abstracts regarding three objects—the protein kinase, the protein substrate, and the phosphorylation
sites, with high precision and recall. A web-based version of the RLIMS-P for mining of protein phosphorylation
information from MEDLINE abstracts (http://pir.georgetown.edu/iprolink/rlimsp/) was recently developed. The
online RLIMS-P is designed for easy accessibility and with enhanced functionality, including evidence tagging for
extracted objects in abstracts, mapping of phosphorylated proteins to the UniProtKB entry based on PubMed ID
and/or protein names using BioThesaurus. Online RLIMS-P facilitates biological studies of phosphorylated
proteins and the database annotation of phosphorylation features.
IV. Protein Ontology development
We recently designed a PRotein Ontology (PRO) framework to facilitate protein annotation and to guide new
experimental design (http://pir.georgetown.edu/pro/). PRO is an ontology of protein entities and relationships in
the OBO Foundry framework. Its components extend from the evolutionary relationships of protein classes of
structural domains, sequence domains, and whole proteins (ProEvo—Protein Evolution ontology), to the
representation of multiple protein forms of genes generated by genetic variation, alternative splicing, proteolytic
cleavage, and other post-translational modifications (ProForm—Protein Form ontology). PRO is designed to assist
assignment of protein annotations (properties such as molecular functions) to specific protein forms of a gene
product, and to protein classes at different evolutionary levels in a formal ontological framework. Thus, PRO
allows for accurate association of its ProForm nodes to specific objects in other databases (e.g. Reactome pathway)
or ontologies (e.g. Sequence Ontology). PRO also facilitates the experimental design based on knowledge transfer
between, for example, a mouse protein form and its human ortholog (or even to suggest that such an ortholog might
exist) via the connection between ProEvo and ProForm.
V. Ongoing TATRC funded work on Pathogen Mining: Literature Mining of Pathogenesis-Related Proteins in
Human Pathogens for Database text mining (2007-2009)
The objective of this project is to develop an automated text mining system to facilitate literature-based curation of
pathogenesis-related proteins in pathogens of military relevance, including pathogens of importance for diseases of
the developing world and biodefense. We will develop a text-mining system to identify papers and extract
information on pathogenicity and host-pathogen interactions, and to integrate text-mining into the UniProt
Knowledge curation system. This project will build upon our technological framework that includes text mining
algorithms, bibliography mapping, and database curation platform. Coupling the curated pathogenesis information
from the scientific literature with the rich protein information and integrated genomics and proteomics data in
UniProt will greatly improve our basic understanding of virulence and pathogenicity factors and host-interacting
proteins of the human pathogens. Such knowledge will facilitate the development of preventative and therapeutic
strategies against human pathogens.
LITERATURE MINING RELATED REFERENCES:
Named entity recognition:
1. Torii M, Hu ZZ, Song M, Wu CH, Liu H (2007) A Comparison Study on Algorithms for Detecting Biomedical
Short Form Definition. BMC Bioinformatics (in press)
2. Liu H, Torii M, Hu ZZ, and Wu CH (2007) Gene Mention and Gene Normalization Based on Machine
Learning and Online Resources. In Proc. of BioCreAtIvE II
3. Liu H, Hu ZZ, Wu CH (2006). BioThesaurus: a web-based thesaurus of protein and gene names.
Bioinformatics 22:103-105.
4. Liu H, Hu ZZ, Torii M, Wu C, Friedman C. (2006) Quantitative Assessment of Dictionary-based Protein
Named Entity Tagging. J Am Med Inform Assoc 13:497-507.
5. Mani I, Hu Z, Jang SB, Samuel K, Krause M, Phillips J, Wu CH (2005). Protein name tagging guidelines:
lessons learned. Comparative and Functional Genomics 6(1-2): 72-76.
Information retrieval and extraction:
6. Han B, Obradovic Z, Hu ZZ, Wu CH, Vucetic S. (2006) Substring selection for biomedical document
classification. Bioinformatics 22:2136-42
7. Yuan X, Hu ZZ, Wu HT, Torii M, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, and Wu CH. (2006)
An Online Literature Mining Tool for Protein Phosphorylation. Bioinformatics 22:1668-9.
8. Hu ZZ, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, Wu CH (2005). Literature Mining and Database
Annotation of Protein Phosphorylation Using a Rule-Based System. Bioinformatics 21(11): 2759-2765.
Ontology development:
9. Natale DA, Arighi CN, Barker W, Blake J, Chang T, Hu ZZ, Liu H, Smith B, Wu CH. (2007) Framework for a
Protein Ontology. BMC Bioinformatics, 8(Suppl 9):S1
10. H. Liu, Z. Hu, and C Wu (2005) DynGO: A tool for navigation and visualization of Gene Ontology resources.
BMC Bioinformatics, BioMed Central Ltd, 2005, 6:201 (online journal).
Annotated literature resource:
11. Hu ZZ, Mani I, Hermoso V, Liu H and Wu CH (2004). iProLINK: an integrated protein resource for literature
mining. Comput Biol Chem 28: 409-416.