Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genomics research paper presentation Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature Hans-Michael Mu ̈ller, Eimear E. Kenny, Paul W. Sternberg* Division of Biology and Howard Hughes Medical Institute, California Institute of Technology, Pasadena, California, United States of America Presented by: Saghan Mudbhari Introduction In this research, the authors build an ontology to support text mining in research papers published in the domain of Genomics. Definition of Ontology: Ontology is a formal representation of the knowledge by a set of concepts within a domain and the relationships between those concepts. Source of diagram: http://slidewiki.org/print/deck/11936 Motivation Suppose if someone wants to know what role gene “lin-12” plays in anchor cell, they would type lin-12 anchor cell as search query. But if they want to know which genes are responsible for functions of anchor cell then they may not be able to type all genes that are responsible. A generic word ‘gene’ and ‘anchor cell’ needs to be posed as query. So, we need to create an ontology to store what possible objects can the concept ‘gene’ store to return relevant results. Contributions Creation of Ontology Created from 3800 papers in Caenorhabditis elegans. It uses ‘Gene Ontology(G0)’ as a reference to create categories. 30 out of 33 categories they created are also present in GO. Natural language used by researchers in the field to describe relationships form additional categories. (for example, ‘‘expressed,’’ ‘‘lineage,’’ ‘‘bound,’’ ‘‘required for’’). Wormbase and PubMed/NCBI are also used to populate Ontology with list of terms. Searchability of full text Recall for keyword search is ~94% in full text compared to ~44% in abstract search. Main idea Textpresso splits papers into sentences, and sentences into words or phrases. Each word is labelled using XML into one of 33 categories. Regular expressions are used to label words into categories. 14,500 Regex created. The keywords and tags in the corpus are indexed to make the search in database fast. 33 categories Query: (Gene) (Regulation/Association category) (Gene) Thank you!