* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Refine Query Set - University of Delaware
Polycomb Group Proteins and Cancer wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Point mutation wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Gene therapy wikipedia , lookup
Genome evolution wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene desert wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Helitron (biology) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene expression programming wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Microevolution wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene expression profiling wikipedia , lookup
Designer baby wikipedia , lookup
Mining the Biomedical Literature for Genic Information BioNLP ’08, June 19, 2008 Catalina O. Tudor, K. Vijay-Shanker, Carl J. Schmidt University of Delaware Presenter: Catalina O. Tudor User Scenario – Groucho and PubMed Groucho? I want to know more about this gene… Search Engine PubMed 270 abstracts retrieved User Scenario – Groucho and eGIFT most relevant terms associated with the given gene Groucho? I want to know more about this gene… Key Terms for Groucho Processes: • segmentation • neurogenesis • embryonic development ... Descriptors: • enhancer • corepressor ... Web Application eGIFT PubMed Domains: • WD40 • eh1 • WRPW • basic helix-loop-helix ... Genes: • Hairy • AES All sentences for Groucho containing segmentation 1. The Groucho protein interacts with Hairy-related transcription factors to regulate segmentation, neurogenesis and sex determination. (PMID 8892234) 2. The Drosophila protein Groucho is involved in embryonic segmentation and neural development , and is implicated in the Notch signal transduction pathway. (PMID 8713081) ... What does eGIFT provide? • Two types of users • Scientists trying to quickly find information about a gene • Annotators trying to quickly locate textual evidence describing gene functions • Key Terms provide an overall picture about a given gene • eGIFT allows users to identify the set of documents for a topic relevant to the gene of interest Overall Approach of eGIFT 1. Retrieve abstracts from PubMed 1. Background Set: all abstracts mentioning “gene” or “protein” 2. Query Set: all abstracts mentioning a given gene 2. Refine Query Set 3. Group morphologically related words 4. Calculate term scores and identify key terms 5. Categorize key terms using controlled vocabularies 6. Link sentences and abstracts to a specific key term Retrieve abstracts Background Set • all abstracts mentioning “gene” or “protein” • (gene[ti] OR genes[ti] OR protein[ti] OR proteins[ti]) AND hasabstract[text] • 639,211 abstracts retrieved PubMed Background Set Query Set Query Set • all abstracts mentioning a given gene name, symbol, synonyms • Compare information from Query Set against general information from Background Set and determine the most specific information in the Query Set • Compare background and query frequencies of terms to identify statistically interesting cases Overall Approach of eGIFT 1. Retrieve abstracts from PubMed 2. Refine Query Set 3. Group morphologically related words 4. Calculate term scores and identify key terms 5. Categorize key terms using controlled vocabularies 6. Link sentences and abstracts to a specific key term Refine Query Set Query Set = all abstracts mentioning given gene Query Set contains two types of abstracts 1. About Set Query Set About Set • abstracts which focus on the given gene 2. Extra Set • abstracts which focus on other topics but happen to mention the gene Heuristics for identifying an About abstract • if given gene name occurs in title, first or last sentences • if given gene name occurs 3+ times in abstract Extra Set Refine Query Set – About Set example Multiple RTK pathways downregulate Groucho-mediated repression in Drosophila embryogenesis. RTK pathways establish cell fates in a wide range of developmental processes. However, how the pathway effector MAPK coordinately regulates the expression of multiple target genes is not fully understood. We have previously shown that the EGFR RTK pathway causes phosphorylation and downregulation of Groucho, a global co-repressor that is widely used by many developmentally important repressors for silencing their various targets. Here, we use specific antibodies that reveal the dynamics of Groucho phosphorylation by MAPK, and show that Groucho is phosphorylated in response to several RTK pathways during Drosophila embryogenesis. Focusing on the regulation of terminal patterning by the Torso RTK pathway, we demonstrate that attenuation of Groucho's repressor function via phosphorylation is essential for the transcriptional output of the pathway and for terminal cell specification. Importantly, Groucho is phosphorylated by an efficient mechanism that does not alter its subcellular localisation or decrease its stability; rather, modified Groucho endures long after MAPK activation has terminated. We propose that phosphorylation of Groucho provides a widespread, long-term mechanism by which RTK signals control target gene expression. PMID - 18216172 Refine Query Set – Extra Set example Engrailed defines the position of dorsal di-mesencephalic boundary by repressing diencephalic fate. Regionalization of a simple neural tube is a fundamental event during the development of central nervous system. To analyze in vivo the molecular mechanisms underlying the development of mesencephalon, we ectopically expressed Engrailed, which is expressed in developing mesencephalon, in the brain of chick embryos by in ovo electroporation. Misexpression of Engrailed caused a rostral shift of the di-mesencephalic boundary, and caused transformation of dorsal diencephalon into tectum, a derivative of dorsal mesencephalon. Ectopic Engrailed rapidly repressed Pax-6, a marker for diencephalon, which preceded the induction of mesencephalon-related genes such as Pax-2, Pax-5, Fgf8, Wnt-1 and EphrinA2. In contrast, a mutant Engrailed, En-2(F51rE), bearing mutation in EH1 domain, which has been shown to interact with a co-repressor, Groucho, did not show the phenotype induced by wild-type Engrailed. Furthermore, VP16-Engrailed chimeric protein, the dominant positive form of Engrailed, caused caudal shift of di-mesencephalic boundary and ectopic Pax-6 expression in mesencephalon. These data suggest that (1) Engrailed defines the position of dorsal di-mesencephalic boundary by directly repressing diencephalic fate, and (2) Engrailed positively regulates the expression of mesencephalonrelated genes by repressing the expression of their negative regulator(s). PMID - 10529429 Overall Approach of eGIFT 1. Retrieve abstracts from PubMed 2. Refine Query Set 3. Group morphologically related words 4. Calculate term scores and identify key terms 5. Categorize key terms using controlled vocabularies 6. Link sentences and abstracts to a specific key term Group morphologically related words - example • The Drosophila Groucho transcriptional corepressor protein has been shown to interact with the DNA-binding bHLH domain of Enhancer of split , Hairy and Deadpan proteins. • Groucho acts as a co-repressor for several Drosophila DNA binding transcriptional repressors. • Dorsal represses transcription by recruiting the co-repressor Groucho • The results indicate that FoxD3 recruitment of Groucho corepressors is essential for the transcriptional repression of target genes and induction of mesoderm in Xenopus. corepressor = {corepressor, corepressors, co-repressor, …} transcription repress = {transcriptional repressors, transcriptional repression, …} Group morphologically related words Unigram example recruit = {recruit, recruits, recruited, recruitment, recruiting, recruitments} Bigram example transcript repress = {transcriptional repressor, transcriptional repressors, transcriptional repression, transcriptional repressions, transcription repression, transcription repressions} Reasons for grouping morphologically related words • textual variants, independent of each other, are scattered in text • we help family stand out • we prevent a very infrequent variant from becoming a key term Overall Approach of eGIFT 1. Retrieve abstracts from PubMed 2. Refine Query Set 3. Group morphologically related words 4. Calculate term scores and identify key terms 5. Categorize key terms using controlled vocabularies 6. Link sentences and abstracts to a specific key term Calculate term scores • Calculate Normalized Frequencies dctb Set Back Set tq = document count of term t in Query Nbq = total number of abstracts in Query Set Back Set • Calculate Score st = score of term t ft = frequency of term t segmentation ftb = 0.0012 ftq = 0.13 0.13 0.874 these ftb = 0.47 ftq = 0.60 0.13 0.098 Other scoring methods • Pearson’s Chi-Square • • Prefers only highly infrequent terms (bigrams are ranked high) Drops very frequent terms, although much more frequent in QS • Z-score • Performance is highly dependent on the way the Background Set is grouped • Other considered • Ratio of frequencies • Tf-Idf • Mutual Information Overall Approach of eGIFT 1. Retrieve abstracts from PubMed 2. Refine Query Set 3. Group morphologically related words 4. Calculate term scores and retrieve key terms 5. Categorize key terms using controlled vocabularies 6. Link sentences and abstracts to a specific key term Categorize Key Terms Overall Approach of eGIFT 1. Retrieve abstracts from PubMed 2. Refine Query Set 3. Group morphologically related words 4. Calculate term scores 5. Categorize key terms using controlled vocabularies 6. Link sentences and abstracts to a specific key term Link sentences to key terms • eGIFT allows users to see every sentence mentioning a particular key term in the gene’s Query Set • by reading in context, the user gets a better appreciation of the relationship between the key term and the gene • From sentences users can choose which abstracts to read • Sentences can be saved in gene specific files (e.g. for annotation) eGIFT Screenshots – Key Terms for Groucho eGIFT Screenshots – Sentences Related Work • Andrade and Valencia (1998) Keywords for a protein family Z-score Background divided by literature for individual families • Liu et al. (2004) • e-LiSe (Gladki et al., 2008) • MedEvi (Kim et al., 2008) Keyword detection (not necessarily genes) Z-score More general background set than us, grouped randomly • Anne O’Tate (Smalheiser et al., 2008) • XplorMed (Perez-Iratxeta et al., 2003) • Shatkay and Wilbur (2000) Keyword detection (some just nouns) More general background set than us From kernel document to Query Set of on-topic documents Background Set contains off-topic documents Score is ratio of normalized frequencies Distinguishing Features of eGIFT • Background Set is specific for genes • About Set yields better results than the entire Query Set • Bigrams in addition to unigrams • Morphological grouping gives “textual concepts” • New scoring mechanism • Going beyond key terms • Categories of key terms (for interface purposes) • Retrieval of sentences containing a specific key term Future Work Evaluation • comparison with other systems Named Entity Recognition • extend unigrams and bigrams to full length names Using other subsets of Query Set • currently, eGIFT uses the About Set to compute key terms • different kinds of information can be obtained from variants of Extra Set and other subsets The End http://dinah.cis.udel.edu/tudor/eGIFT