Download Refine Query Set - University of Delaware

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Polycomb Group Proteins and Cancer wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Point mutation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene therapy wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene wikipedia , lookup

Gene desert wikipedia , lookup

RNA-Seq wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Helitron (biology) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression programming wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

NEDD9 wikipedia , lookup

Microevolution wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Transcript
Mining the Biomedical Literature
for Genic Information
BioNLP ’08, June 19, 2008
Catalina O. Tudor, K. Vijay-Shanker, Carl J. Schmidt
University of Delaware
Presenter: Catalina O. Tudor
User Scenario – Groucho and PubMed
Groucho? I
want to know
more about
this gene…
Search Engine
PubMed
270 abstracts retrieved
User Scenario – Groucho and eGIFT
most relevant terms associated with the given gene
Groucho? I
want to know
more about
this gene…
Key Terms for Groucho
Processes:
• segmentation
• neurogenesis
• embryonic development
...
Descriptors:
• enhancer
• corepressor
...
Web Application
eGIFT
PubMed
Domains:
• WD40
• eh1
• WRPW
• basic helix-loop-helix
...
Genes:
• Hairy
• AES
All sentences for Groucho
containing segmentation
1. The Groucho protein
interacts with Hairy-related
transcription factors to
regulate segmentation,
neurogenesis and sex
determination. (PMID 8892234)
2. The Drosophila protein
Groucho is involved in
embryonic segmentation and
neural development , and is
implicated in the Notch
signal transduction pathway.
(PMID 8713081)
...
What does eGIFT provide?
• Two types of users
• Scientists trying to quickly find information about a gene
• Annotators trying to quickly locate textual evidence describing gene
functions
• Key Terms provide an overall picture about a given gene
• eGIFT allows users to identify the set of documents for a topic
relevant to the gene of interest
Overall Approach of eGIFT
1. Retrieve abstracts from PubMed
1. Background Set: all abstracts mentioning “gene” or “protein”
2. Query Set: all abstracts mentioning a given gene
2. Refine Query Set
3. Group morphologically related words
4. Calculate term scores and identify key terms
5. Categorize key terms using controlled vocabularies
6. Link sentences and abstracts to a specific key term
Retrieve abstracts
Background Set
• all abstracts mentioning “gene” or “protein”
• (gene[ti] OR genes[ti] OR
protein[ti] OR proteins[ti])
AND hasabstract[text]
• 639,211 abstracts retrieved
PubMed
Background
Set
Query
Set
Query Set
• all abstracts mentioning a given gene name, symbol, synonyms
• Compare information from Query Set against general
information from Background Set and determine the most
specific information in the Query Set
• Compare background and query frequencies of terms to identify
statistically interesting cases
Overall Approach of eGIFT
1. Retrieve abstracts from PubMed
2. Refine Query Set
3. Group morphologically related words
4. Calculate term scores and identify key terms
5. Categorize key terms using controlled vocabularies
6. Link sentences and abstracts to a specific key term
Refine Query Set
Query Set = all abstracts mentioning given gene
Query Set contains two types of abstracts
1. About Set
Query
Set
About
Set
• abstracts which focus on the given gene
2. Extra Set
• abstracts which focus on other topics but happen to
mention the gene
Heuristics for identifying an About abstract
•
if given gene name occurs in title, first or last sentences
•
if given gene name occurs 3+ times in abstract
Extra
Set
Refine Query Set – About Set example
Multiple RTK pathways downregulate Groucho-mediated repression in Drosophila
embryogenesis.
RTK pathways establish cell fates in a wide range of developmental processes. However,
how the pathway effector MAPK coordinately regulates the expression of multiple target
genes is not fully understood. We have previously shown that the EGFR RTK pathway
causes phosphorylation and downregulation of Groucho, a global co-repressor that is widely
used by many developmentally important repressors for silencing their various targets.
Here, we use specific antibodies that reveal the dynamics of Groucho phosphorylation by
MAPK, and show that Groucho is phosphorylated in response to several RTK pathways
during Drosophila embryogenesis. Focusing on the regulation of terminal patterning by the
Torso RTK pathway, we demonstrate that attenuation of Groucho's repressor function via
phosphorylation is essential for the transcriptional output of the pathway and for terminal
cell specification. Importantly, Groucho is phosphorylated by an efficient mechanism that
does not alter its subcellular localisation or decrease its stability; rather, modified
Groucho endures long after MAPK activation has terminated. We propose that
phosphorylation of Groucho provides a widespread, long-term mechanism by which RTK
signals control target gene expression.
PMID - 18216172
Refine Query Set – Extra Set example
Engrailed defines the position of dorsal di-mesencephalic boundary by repressing
diencephalic fate.
Regionalization of a simple neural tube is a fundamental event during the development of
central nervous system. To analyze in vivo the molecular mechanisms underlying the
development of mesencephalon, we ectopically expressed Engrailed, which is expressed in
developing mesencephalon, in the brain of chick embryos by in ovo electroporation.
Misexpression of Engrailed caused a rostral shift of the di-mesencephalic boundary, and
caused transformation of dorsal diencephalon into tectum, a derivative of dorsal
mesencephalon. Ectopic Engrailed rapidly repressed Pax-6, a marker for diencephalon,
which preceded the induction of mesencephalon-related genes such as Pax-2, Pax-5, Fgf8,
Wnt-1 and EphrinA2. In contrast, a mutant Engrailed, En-2(F51rE), bearing mutation in
EH1 domain, which has been shown to interact with a co-repressor, Groucho, did not show
the phenotype induced by wild-type Engrailed. Furthermore, VP16-Engrailed chimeric
protein, the dominant positive form of Engrailed, caused caudal shift of di-mesencephalic
boundary and ectopic Pax-6 expression in mesencephalon. These data suggest that (1)
Engrailed defines the position of dorsal di-mesencephalic boundary by directly repressing
diencephalic fate, and (2) Engrailed positively regulates the expression of mesencephalonrelated genes by repressing the expression of their negative regulator(s).
PMID - 10529429
Overall Approach of eGIFT
1. Retrieve abstracts from PubMed
2. Refine Query Set
3. Group morphologically related words
4. Calculate term scores and identify key terms
5. Categorize key terms using controlled vocabularies
6. Link sentences and abstracts to a specific key term
Group morphologically related words - example
• The Drosophila Groucho transcriptional corepressor protein has been
shown to interact with the DNA-binding bHLH domain of Enhancer of split
, Hairy and Deadpan proteins.
• Groucho acts as a co-repressor for several Drosophila DNA binding
transcriptional repressors.
• Dorsal represses transcription by recruiting the co-repressor Groucho
• The results indicate that FoxD3 recruitment of Groucho corepressors is
essential for the transcriptional repression of target genes and induction
of mesoderm in Xenopus.
corepressor = {corepressor, corepressors, co-repressor, …}
transcription repress = {transcriptional repressors, transcriptional repression, …}
Group morphologically related words
Unigram example
recruit = {recruit, recruits, recruited, recruitment, recruiting,
recruitments}
Bigram example
transcript repress = {transcriptional repressor, transcriptional
repressors, transcriptional repression, transcriptional repressions,
transcription repression, transcription repressions}
Reasons for grouping morphologically related words
• textual variants, independent of each other, are scattered in text
• we help family stand out
• we prevent a very infrequent variant from becoming a key term
Overall Approach of eGIFT
1. Retrieve abstracts from PubMed
2. Refine Query Set
3. Group morphologically related words
4. Calculate term scores and identify key terms
5. Categorize key terms using controlled vocabularies
6. Link sentences and abstracts to a specific key term
Calculate term scores
• Calculate Normalized Frequencies
dctb
Set
Back Set
tq = document count of term t in Query
Nbq = total number of abstracts in Query
Set
Back Set
• Calculate Score
st = score of term t
ft = frequency of term t
segmentation
ftb = 0.0012
ftq = 0.13
0.13
0.874
these
ftb = 0.47
ftq = 0.60
0.13
0.098
Other scoring methods
• Pearson’s Chi-Square
•
•
Prefers only highly infrequent terms (bigrams are ranked high)
Drops very frequent terms, although much more frequent in QS
• Z-score
•
Performance is highly dependent on the way the Background Set
is grouped
• Other considered
• Ratio of frequencies
• Tf-Idf
• Mutual Information
Overall Approach of eGIFT
1. Retrieve abstracts from PubMed
2. Refine Query Set
3. Group morphologically related words
4. Calculate term scores and retrieve key terms
5. Categorize key terms using controlled vocabularies
6. Link sentences and abstracts to a specific key term
Categorize Key Terms
Overall Approach of eGIFT
1. Retrieve abstracts from PubMed
2. Refine Query Set
3. Group morphologically related words
4. Calculate term scores
5. Categorize key terms using controlled vocabularies
6. Link sentences and abstracts to a specific key term
Link sentences to key terms
• eGIFT allows users to see every sentence mentioning a particular
key term in the gene’s Query Set
• by reading in context, the user gets a better appreciation of
the relationship between the key term and the gene
• From sentences users can choose which abstracts to read
• Sentences can be saved in gene specific files (e.g. for annotation)
eGIFT Screenshots – Key Terms for Groucho
eGIFT Screenshots – Sentences
Related Work
• Andrade and Valencia (1998)
Keywords for a protein family
Z-score
Background divided by literature for individual families
• Liu et al. (2004)
• e-LiSe (Gladki et al., 2008)
• MedEvi (Kim et al., 2008)
Keyword detection (not necessarily genes)
Z-score
More general background set than us, grouped randomly
• Anne O’Tate (Smalheiser et al., 2008)
• XplorMed (Perez-Iratxeta et al., 2003)
• Shatkay and Wilbur (2000)
Keyword detection (some just nouns)
More general background set than us
From kernel document to Query Set of on-topic documents
Background Set contains off-topic documents
Score is ratio of normalized frequencies
Distinguishing Features of eGIFT
• Background Set is specific for genes
• About Set yields better results than the entire Query Set
• Bigrams in addition to unigrams
• Morphological grouping gives “textual concepts”
• New scoring mechanism
• Going beyond key terms
• Categories of key terms (for interface purposes)
• Retrieval of sentences containing a specific key term
Future Work
Evaluation
• comparison with other systems
Named Entity Recognition
• extend unigrams and bigrams to full length names
Using other subsets of Query Set
• currently, eGIFT uses the About Set to compute key terms
• different kinds of information can be obtained from
variants of Extra Set and other subsets
The End
http://dinah.cis.udel.edu/tudor/eGIFT