Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Domain-Specific Synonym Expansion and Validation for Biomedical Information Retrieval (MultiText Experiments for TREC 2004) Stefan Büttcher <[email protected]> 18th November 2004 TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Introduction The UW MultiText group participated in the Ad hoc retrieval task of the Genomics track: “Given a topic, find all relevant documents.” Example Title: Need: H2A histone family What evolutionary changes have occurred to members of the H2A histone family? Context: We are interested in the evolutionary context of H2A histones, e.g., where do they belong in the tree of life? We implemented a document retrieval system based on the MultiText information retrieval engine and submitted two runs (Need only, Title+Need). TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 System Overview Extensions to the MultiText system TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 System Overview Extensions to the MultiText system ● Domain-specific query (synonym) expansion ● Heuristics for lexical variants ● Synonym/acronym databases TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 System Overview Extensions to the MultiText system ● ● Domain-specific query (synonym) expansion ● Heuristics for lexical variants ● Synonym/acronym databases Expansion validation TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 System Overview Extensions to the MultiText system ● Domain-specific query (synonym) expansion ● Heuristics for lexical variants ● Synonym/acronym databases ● Expansion validation ● Pseudo-relevance feedback ● Corpus-based feedback ● Google TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 System Overview TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 System Overview TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Query Expansion: Motivation (1) <TOPIC> <ID>53</ID> <TITLE> Gene expression regulation and TNF/NfkappaB pathway </TITLE> <NEED> What influence do up-regulated proteins GADD45beta, IkappaBalpha, XIAP, cIAP2, and A20 have on the TNF/NfkappaB pathway? </NEED> </TOPIC> TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Query Expansion: Motivation (2) <TOPIC> <ID>53</ID> <TITLE> Nuclear Factor kappaB, Nuclear Transcription Factor kappaB, NFkB, NFkappaB, ... Tumor Necrosis Factor, TNFalpha, TNFbeta, ... Gene expression regulation and TNF/NfkappaB pathway </TITLE> <NEED> Growth Arrest DNA Damage, Growth Arrest and DNA damageinducable, GADD, GADD45b, ... What influence do up-regulated proteins GADD45beta, IkappaBalpha, XIAP, cIAP2, and A20 have on the TNF/NfkappaB pathway? </NEED> </TOPIC> Cellular Inhibitor of Apoptosis Protein, CIAP, ... Xlinked Inhibitor of Apoptosis, Xchromosomelinked Inhibitor of Apoptosis Protein, xIAP, ... Inhibitory Protein kappaBalpha, Inhibitor of Nuclear Factor kappaBalpha, I kappa B alpha, ... TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Query Expansion: Lexical Variants How to generate lexical variants? Example Medline corpus refers to the NF-kappaB protein in 6 different ways: “NF-kappa B” (33902), “NF-kappaB” (28551), “NFkappaB” (3211), “NF-kB (688), “NFkB” (259), “NFkappa B” (45). ⇒ Derive a set of simple generation rules. TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Query Expansion: Lexical Variants (2) How to generate lexical variants? 1. Tokenize the term (token boundaries: transitions between alphabetical and numerical characters; hyphens; spaces; Greek letters). 2. Contract Greek letters: “alpha” → “a”, ... 3. Generate all hyphenation variants. Example (Larval serum prot. 1 alpha, “Lsp1alpha”) “lsp-1-alpha”, “lsp-1-a”, “lsp-1alpha”, “lsp-1a”, “lsp1-alpha”, “lsp1-a”, “lsp1alpha”, “lsp1a” TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Query Expansion: Acronyms Acronym expansion using AcroMed What is AcroMed? An automatically generated database of biomedical acronyms (extracted from Medline abstracts). ● ● We used a subset consisting of 25,589 acronyms and 49,822 long forms. For every acronym that was found in the topic, its long forms were added to the query. The opposite direction could be interesting as well... TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Query Expansion: Synonyms (1) Synonym expansion: species names Use a handcrafted set of 11 species names to add synonyms to the query: ● “mouse” → “mice”, “mus musculus” ● “caenorhabditis” → “worm”, “c. elegans” ● “cow” → “bovine”, “bos taurus”, “cattle” ● ... TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Query Expansion: Synonyms (2) Synonym expansion: genes/proteins ● ● Gene data from the Eukaryotic Organisms database (euGenes.org) – 184,460 genes. Gene/protein data from the LocusLink database – 128,580 genes. For every gene symbol recognized, we added: full name, protein product, alias symbols. For every protein symbol recognized, we added: gene symbol, gene name, gene alias symbols. TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Query Expansion: Synonyms (3) Synonym expansion: narrow synonyms If the topic contains “TGF beta”, documents talking about “TGF beta 1” or “TGF beta 2” are probably relevant as well. For every gene descriptor in the database, a search key was added with the last part of the gene symbol removed so that “TGF beta 1” can be found when searching for “TGF beta”. Why “narrow” synonyms? “TGF beta 2” is sometimes referred to as “TGF beta”, but never the the other way round. TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Query Expansion: Pattern Matching Synonym expansion: non-strict matching To allow for the recognition of lexical variants when searching for an acronym or gene descriptor, the rules for generating lexical variants are applied again. ⇒ Gene symbol “TGFB2” can be found when searching for “TGF-beta”. TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 System Overview TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Expansion Validation (1) Closer examination of the top 150 documents returned by the 1st stage. Two goals: 1. Rule out wrong expansions. 2. Find additional expansions. Why? – Two long forms for acronym “GIS”: ● “gastrointestinal symptoms”; ● “geographic information system”. TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Expansion Validation (2) Rule out wrong expansions (Goal 1) ● ● Look at the documents and count the occurrences of all expansion terms. For every original query term, only keep the top 10 expansions. Find additional expansions (Goal 2) ● ● Search for permutations of query terms (the gene “glucosidase, alpha; acid” encodes the “acid alpha-glucosidase” preprotein). Include the most frequent permutations in the new query. TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 System Overview TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Pseudo-Relevance Feedback Probability that a text passage of length len contains the term T: f T len P contained = 1−1− N N: corpus size; fT: term frequency Since fT/N is usually very small, we can approximate the probability that a text passage of length len contains the term T: len∗ f T P contained ≈ N N Information of “T inside passage”: log 2 −log2 len fT TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Corpus-Based Feedback (1) Passage feedback We used MultiText's QAP passage scorer to find passages relevant to a given topic. FB rule: For every term T that appears in the neighborhood of a relevant passage, increase the term score: N pscoreT := pscoreT log 2 −log 2 l fT N: corpus size; fT: term frequency; l: size of the minimum window that contains both the passage and T TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Corpus-Based Feedback (2) Document feedback FB rule: For every term T that appears in one of the top 100 documents returned (D): N dscoreT :=dscoreT w D∗log 2 −log 2 len D fT N: corpus size; fT: term frequency; lenD: size of the document; wD: relative document weight w D =score D ∗ c rank D c1 scoreD: BM25 score; rankD: BM25 rank (from stage 2) TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Google Feedback FB rule: Send unexpanded Need to Google. For every term T that appears in one of the top 20 snippets returned by Google: N gscoreT :=gscoreT log 2 −log 2 len S fT N: corpus size; fT: term frequency (within corpus); lenS: size of the Google snippet returned Combination of feedback methods Final feedback score for term T: (top 10 are taken) score T = pscore T pscore max dscore T dscore max TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 gscore T gscore max System Overview TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Results & Discussion (1) Intermediate results (Title+Need) TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Results & Discussion (2) Domain-specific query expansion ⇒ AcroMed caused greatest overall improvement. euGenes was not beneficial at all (maybe authors do not use full gene names...). TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Results & Discussion (3) Pseudo-relevance feedback ⇒ Document feedback best, Google feedback worst. Combination better than any single technique. Why did Google perform so poorly? TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Conclusion ● ● ● ● We added a number of different techniques (general and domain-specific) to the MultiText system. Most of the new techniques increased the retrieval effectiveness of our system. Lexical variants, permutations, and acronyms performed best among the domain-specific techniques. Google feedback was worse than the corpus-based feedback methods, due to Google's restricted query language. TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004 Thank you Thank you! TREC 2004 Genomics Track: University of Waterloo MultiText Project Stefan Büttcher <[email protected]>, 18th November 2004