Download Textpresso: An Ontology-Based Information Retrieval

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genome editing wikipedia , lookup

Gene therapy wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Genomics research paper presentation
Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature
Hans-Michael Mu ̈ller, Eimear E. Kenny, Paul W. Sternberg*
Division of Biology and Howard Hughes Medical Institute, California Institute of Technology,
Pasadena, California, United States of America
Presented by: Saghan Mudbhari
Introduction
In this research, the authors build an ontology to support text mining in research papers published in the domain
of Genomics.
Definition of Ontology: Ontology is a formal representation of the knowledge by a set of concepts within a domain
and the relationships between those concepts.
Source of diagram: http://slidewiki.org/print/deck/11936
Motivation
Suppose if someone wants to know what role gene “lin-12” plays in anchor cell, they would type lin-12 anchor cell
as search query. But if they want to know which genes are responsible for functions of anchor cell then they may
not be able to type all genes that are responsible. A generic word ‘gene’ and ‘anchor cell’ needs to be posed as query.
So, we need to create an ontology to store what possible objects can the concept ‘gene’ store to return relevant
results.
Contributions
Creation of Ontology
Created from 3800 papers in Caenorhabditis elegans. It uses ‘Gene
Ontology(G0)’ as a reference to create categories. 30 out of 33 categories they created are also present in GO.
Natural language used by researchers in the field to describe relationships form additional categories. (for
example, ‘‘expressed,’’ ‘‘lineage,’’ ‘‘bound,’’ ‘‘required for’’). Wormbase and PubMed/NCBI are also used to
populate Ontology with list of terms.
Searchability of full text
Recall for keyword search is ~94% in full text compared to ~44% in abstract search.
Main idea
Textpresso splits papers into sentences, and sentences into words or phrases. Each word is labelled using XML into
one of 33 categories. Regular expressions are used to label words into categories. 14,500 Regex created.
The keywords and tags in the corpus are indexed
to make the search in database fast.
33 categories
Query: (Gene) (Regulation/Association category) (Gene)
Thank you!