Download Domain-Specific Synonym Expansion and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Pathogenomics wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genomics wikipedia , lookup

Public health genomics wikipedia , lookup

Transcript
Domain-Specific Synonym Expansion and
Validation for Biomedical Information Retrieval
(MultiText Experiments for TREC 2004)
Stefan Büttcher
<[email protected]>
18th November 2004
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Introduction
The UW MultiText group participated in the Ad hoc
retrieval task of the Genomics track:
“Given a topic, find all relevant documents.”
Example
Title:
Need:
H2A histone family
What evolutionary changes have occurred to members
of the H2A histone family?
Context: We are interested in the evolutionary context of H2A
histones, e.g., where do they belong in the tree of life?
We implemented a document retrieval system based on
the MultiText information retrieval engine and submitted
two runs (Need only, Title+Need).
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
System Overview
Extensions to the MultiText system
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
System Overview
Extensions to the MultiText system
●
Domain-specific query (synonym) expansion
●
Heuristics for lexical variants
●
Synonym/acronym databases
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
System Overview
Extensions to the MultiText system
●
●
Domain-specific query (synonym) expansion
●
Heuristics for lexical variants
●
Synonym/acronym databases
Expansion validation
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
System Overview
Extensions to the MultiText system
●
Domain-specific query (synonym) expansion
●
Heuristics for lexical variants
●
Synonym/acronym databases
●
Expansion validation
●
Pseudo-relevance feedback
●
Corpus-based feedback
●
Google
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
System Overview
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
System Overview
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Query Expansion: Motivation (1)
<TOPIC>
<ID>53</ID>
<TITLE>
Gene expression regulation and TNF/NfkappaB pathway
</TITLE>
<NEED>
What influence do up-regulated proteins GADD45beta,
IkappaBalpha, XIAP, cIAP2, and A20 have on the
TNF/NfkappaB pathway?
</NEED>
</TOPIC>
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Query Expansion: Motivation (2)
<TOPIC>
<ID>53</ID>
<TITLE>
Nuclear Factor kappaB, Nuclear Transcription
Factor kappaB, NFkB, NF­kappaB, ...
Tumor Necrosis Factor,
TNFalpha, TNFbeta, ...
Gene expression regulation and TNF/NfkappaB pathway
</TITLE>
<NEED>
Growth Arrest DNA Damage, Growth Arrest and
DNA damage­inducable, GADD, GADD45b, ...
What influence do up-regulated proteins GADD45beta,
IkappaBalpha, XIAP, cIAP2, and A20 have on the
TNF/NfkappaB pathway?
</NEED>
</TOPIC>
Cellular Inhibitor of Apoptosis Protein, CIAP, ...
X­linked Inhibitor of Apoptosis, X­chromosome­linked
Inhibitor of Apoptosis Protein, xIAP, ...
Inhibitory Protein kappaBalpha, Inhibitor of Nuclear
Factor kappaBalpha, I kappa B alpha, ...
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Query Expansion: Lexical Variants
How to generate lexical variants?
Example
Medline corpus refers to the NF-kappaB protein
in 6 different ways:
“NF-kappa B” (33902), “NF-kappaB” (28551),
“NFkappaB” (3211), “NF-kB (688), “NFkB” (259),
“NFkappa B” (45).
⇒ Derive a set of simple generation rules.
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Query Expansion: Lexical Variants (2)
How to generate lexical variants?
1. Tokenize the term (token boundaries: transitions
between alphabetical and numerical characters;
hyphens; spaces; Greek letters).
2. Contract Greek letters: “alpha” → “a”, ...
3. Generate all hyphenation variants.
Example (Larval serum prot. 1 alpha, “Lsp1alpha”)
“lsp-1-alpha”, “lsp-1-a”, “lsp-1alpha”, “lsp-1a”,
“lsp1-alpha”, “lsp1-a”, “lsp1alpha”, “lsp1a”
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Query Expansion: Acronyms
Acronym expansion using AcroMed
What is AcroMed?
An automatically generated database of biomedical
acronyms (extracted from Medline abstracts).
●
●
We used a subset consisting of 25,589 acronyms
and 49,822 long forms.
For every acronym that was found in the topic, its
long forms were added to the query.
The opposite direction could be interesting as well...
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Query Expansion: Synonyms (1)
Synonym expansion: species names
Use a handcrafted set of 11 species names to add
synonyms to the query:
●
“mouse” → “mice”, “mus musculus”
●
“caenorhabditis” → “worm”, “c. elegans”
●
“cow” → “bovine”, “bos taurus”, “cattle”
●
...
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Query Expansion: Synonyms (2)
Synonym expansion: genes/proteins
●
●
Gene data from the Eukaryotic Organisms database
(euGenes.org) – 184,460 genes.
Gene/protein data from the LocusLink database –
128,580 genes.
For every gene symbol recognized, we added:
full name, protein product, alias symbols.
For every protein symbol recognized, we added:
gene symbol, gene name, gene alias symbols.
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Query Expansion: Synonyms (3)
Synonym expansion: narrow synonyms
If the topic contains “TGF beta”, documents talking
about “TGF beta 1” or “TGF beta 2” are probably
relevant as well.
For every gene descriptor in the database, a search
key was added with the last part of the gene symbol
removed so that “TGF beta 1” can be found when
searching for “TGF beta”.
Why “narrow” synonyms?
“TGF beta 2” is sometimes referred to as
“TGF beta”, but never the the other way round.
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Query Expansion: Pattern Matching
Synonym expansion: non-strict matching
To allow for the recognition of lexical variants when
searching for an acronym or gene descriptor, the
rules for generating lexical variants are applied again.
⇒ Gene symbol “TGFB2” can be found when
searching for “TGF-beta”.
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
System Overview
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Expansion Validation (1)
Closer examination of the top 150 documents
returned by the 1st stage.
Two goals:
1. Rule out wrong expansions.
2. Find additional expansions.
Why? – Two long forms for acronym “GIS”:
●
“gastrointestinal symptoms”;
●
“geographic information system”.
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Expansion Validation (2)
Rule out wrong expansions (Goal 1)
●
●
Look at the documents and count the
occurrences of all expansion terms.
For every original query term, only keep the
top 10 expansions.
Find additional expansions (Goal 2)
●
●
Search for permutations of query terms (the
gene “glucosidase, alpha; acid” encodes the
“acid alpha-glucosidase” preprotein).
Include the most frequent permutations in
the new query.
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
System Overview
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Pseudo-Relevance Feedback
Probability that a text passage of length len contains
the term T:
f T len
P contained = 1−1− 
N
N: corpus size; fT: term frequency
Since fT/N is usually very small, we can approximate
the probability that a text passage of length len
contains the term T:
len∗ f T
P contained ≈
N
N
Information of “T inside passage”: log 2  −log2 len
fT
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Corpus-Based Feedback (1)
Passage feedback
We used MultiText's QAP passage scorer to find
passages relevant to a given topic.
FB rule: For every term T that appears in the neighborhood of a relevant passage, increase the term score:
N
pscoreT := pscoreT log 2  −log 2 l 
fT
N: corpus size; fT: term frequency; l: size of the minimum window that contains both the passage and T
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Corpus-Based Feedback (2)
Document feedback
FB rule: For every term T that appears in one of the
top 100 documents returned (D):
N
dscoreT :=dscoreT w D∗log 2  −log 2 len D 
fT
N: corpus size; fT: term frequency; lenD: size of the
document; wD: relative document weight
w D =score D ∗ c
rank D
c1
scoreD: BM25 score; rankD: BM25 rank (from stage 2)
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Google Feedback
FB rule: Send unexpanded Need to Google. For every
term T that appears in one of the top 20 snippets
returned by Google:
N
gscoreT :=gscoreT log 2  −log 2 len S 
fT
N: corpus size; fT: term frequency (within corpus);
lenS: size of the Google snippet returned
Combination of feedback methods
Final feedback score for term T: (top 10 are taken)
score T =
pscore T
pscore max

dscore T
dscore max

TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
gscore T
gscore max
System Overview
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Results & Discussion (1)
Intermediate results (Title+Need)
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Results & Discussion (2)
Domain-specific query expansion
⇒ AcroMed caused greatest overall improvement. euGenes was not beneficial
at all (maybe authors do not use full gene names...).
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Results & Discussion (3)
Pseudo-relevance feedback
⇒ Document feedback best, Google feedback worst. Combination better than
any single technique. Why did Google perform so poorly?
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Conclusion
●
●
●
●
We added a number of different techniques (general
and domain-specific) to the MultiText system.
Most of the new techniques increased the retrieval
effectiveness of our system.
Lexical variants, permutations, and acronyms
performed best among the domain-specific
techniques.
Google feedback was worse than the corpus-based
feedback methods, due to Google's restricted query
language.
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004
Thank you
Thank you!
TREC 2004 Genomics Track: University of Waterloo MultiText Project
Stefan Büttcher <[email protected]>, 18th November 2004