Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Information Retrieval Word Sense Disambiguation Ulf Leser Content of this Lecture • Word Sense Disambiguation • Approaches – Knowledge-Based – Using Classification – Using Clustering • Case Study: SVM-based WSD for Biomedical Texts • Material from – Mihalcea, Pedersen: Word Sense Disambiguation, Tutorial at AAAI-2005 – Schiemann et al. (2008). Word Sense Disambiguation in Biomedical Applications: A Machine Learning Approach. In Prince, V. and Roche, M. (ed). "Information Retrieval in Biomedicine", IGI Global Ulf Leser: Information Retrieval, Winter Semester 2014/2015 2 Definition • Word sense disambiguation: Select the correct sense for a word in a context given a fixed set of senses – Knowledge intensive methods, supervised learning • Word sense discovery: Find all possible senses of a word, without regard to an existing set – Unsupervised techniques, clustering • Ambiguous in itself – Homonyms: Same word, different and unrelated meaning • ‘Sin’ , ‘soul’ are English words and gene names – Polysemy: Same word, closely related meaning (often same stem) • Gene and its mRNA • “Nicht in diesem Ton!” Ulf Leser: Information Retrieval, Winter Semester 2014/2015 3 Example • The fisherman jumped off the bank and into the water. • The bank down the street was robbed! • Back in the day, we had an entire bank of computers devoted to this problem. • The bank in that road is entirely too steep and is really dangerous. • The plane took a bank to the left, and then headed off towards the mountains. Ulf Leser: Information Retrieval, Winter Semester 2014/2015 4 Different Tasks • Study words with multiple meanings (classical WSD) • Find entities of a certain class with homonyms in English – Is the mentioning of „white“ in a sentence a gene or not? – Named Entity Recognition (NER) • Disambiguate homonyms within a class – Which gene is the mentioning of “TNF-alpha”? Which species? – Named Entity Normalization (NEN) • Disambiguate all words in a sentence – Ambiguity on the parse-level - senses influence each other – Sense chaining: ”Senses” of neighboring words should be similar Ulf Leser: Information Retrieval, Winter Semester 2014/2015 6 Content of this Lecture • Word Sense Disambiguation • Approaches – Knowledge-Based – Using Classification – Using Clustering • Case Study: SVM-based WSD for Biomedical Texts Ulf Leser: Information Retrieval, Winter Semester 2014/2015 7 Knowledge-Based WSD • Idea: Use background knowledge on the given set of possible senses • Source 1: Explicit specifications – Dictionary of definitions: Lexica, dictionaries, … – Thesauri and ontologies: Wordnet, UMLS, MESH, Gene Ontology, business ontologies, enterprise vocabularies, … • Source 2: Annotated text – Supervised: Needs examples with annotated senses – Compute words in the context of the word to disambiguate that are indicative for a sense – Compute collocations per sense and find those that are the most discriminating; “One collocation per sense” – See: Distributional semantics Ulf Leser: Information Retrieval, Winter Semester 2014/2015 8 Example • WordNet definitions for all senses of the noun “plant” – Note: Definitions often include an example – easier to grasp – buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles“ – a living organism lacking the power of locomotion – something planted secretly for discovery by another; "he claimed that the evidence against him was a plant" – an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience • Other: Wikipedia, classical “Wörterbücher”, Freebase, … – Wikipedia distinguishes senses and gives ample definitional text – Used widely in current research in semantic search – Also: Wiktionary – free dictionary Ulf Leser: Information Retrieval, Winter Semester 2014/2015 9 Usage for Disambiguation • Idea: Match context of word w in text with words from sense definitions si – Most similar sense with wins – Use any similarity (relevance) measure: VSM, language model… – May include word weighting (e.g. TF*IDF) • Properties – – – – Simple, effective Not powerful enough for “hard” polysemy (contexts too similar) Depends on good and complete dictionaries Regarding genes, these are not really existing • Idea: Use papers describing a gene as definition • Not a definition, but provides specific context Ulf Leser: Information Retrieval, Winter Semester 2014/2015 10 Using an Ontology (e.g. Wordnet) • Idea: Use neighborhood of a sense in the ontology as definition • But where does the neighborhood end? • Score matches by distance between words in the ontology • Idea – Match words in the context of w (in the sentence) with words in the neighborhood of all senses of w (in the ontology) – Score context words based on semantic similarity to a given sense • Difficulty: Quantify the “semantic length” of links • Simplest method: Distance = “number of links” – Aggregate scores over all matches per sense Ulf Leser: Information Retrieval, Winter Semester 2014/2015 11 Example – Ontology Eukaryot IS-A HAS-A Insecta habitat IS-A HAS-GENE HAS-A IS-A Metabolism PART-OF dares PRODUCES leaves LIVES-IN AA syn X Ulf Leser: Information Retrieval, Winter Semester 2014/2015 chromosome dares HAS-LOCATION IN-PROCESS AA syn Y AA syn Z 12 Example – Ambiguous Word Eukaryot IS-A HAS-A Insecta habitat IS-A HAS-GENE HAS-A IS-A Metabolism PART-OF dares PRODUCES leaves LIVES-IN AA syn X chromosome dares HAS-LOCATION IN-PROCESS AA syn Y AA syn Z The metabolism of dares is incapable of synthesizing amino acids Y and Z, but these can be uptaken from the leaves populated by this species. Ulf Leser: Information Retrieval, Winter Semester 2014/2015 13 Example – Simplest Approach to SemSim Eukaryot Insecta habitat Metabolism chromosome dares dares leaves AA syn X AA syn Y AA syn Z The metabolism of dares is incapable of synthesizing amino acids Y and Z, but these can be uptaken from the leaves populated by dares. Ulf Leser: Information Retrieval, Winter Semester 2014/2015 14 Example – Matching Eukaryot Insecta habitat Metabolism chromosome dares dares leaves AA syn X AA syn Y AA syn Z The metabolism of dares is incapable of synthesizing amino acids Y and Z, but these can be uptaken from the leaves populated by dares. Ulf Leser: Information Retrieval, Winter Semester 2014/2015 15 Example – Scoring Eukaryot Insecta habitat Metabolism chromosome dares dares leaves AA syn X AA syn Y AA syn Z • Gene: 1+1/2+1/3+1/4 • Species: 1+1/2+1/3+1/3 Ulf Leser: Information Retrieval, Winter Semester 2014/2015 16 Example – Expanded Network along ISA Eukaryot Insecta habitat Metabolism chromosome dares dares leaves AA syn X AA syn Y AA syn Z • Gene: 1+1/2+1/2+1/3 • Species: 1+1+1/2+1/2 Ulf Leser: Information Retrieval, Winter Semester 2014/2015 17 Two Tricks • Look at frequencies of senses in a large corpus – Include as a-priori probabilities into scores – Very effective – But beware of domain-specificities – use the right corpus to count • One-sense-per-discourse – The same word occurring multiple times in one document will always have the same meaning • Not true for “normal” words, but usually true for specialized, domainspecific terms (proper nouns, abbreviations, …) – Highly effective – Implicitly broadens the context for inference • Both tricks work for all approaches to WSD Ulf Leser: Information Retrieval, Winter Semester 2014/2015 18 Content of this Lecture • Word Sense Disambiguation • Approaches – Knowledge-Based – Using Classification – Using Clustering • Case Study: SVM-based WSD for Biomedical Texts Ulf Leser: Information Retrieval, Winter Semester 2014/2015 19 Classification-based WSD • • • • • Cast the problem in a multi-class classification problem Each sense is one class Given training data, learn a model for each class Rest is usual classification (VSM, NB, SVM, overfitting, …) We may use more features than only context words – POS tag of the word, surrounding POS tags • Very important clues for disambiguation – Existence of parts of known collocations – Phrase heads Ulf Leser: Information Retrieval, Winter Semester 2014/2015 20 Properties • Requires all senses to be known in advance • Requires good (and sufficient) training data – For each sense for each ambiguous word – Which may be a real problem – see case study • Essentially a generalization of the KB-approaches – Classification instead of matching of context • Method of choice if requirements are met • Performance to expect – Senseval-1, a number of systems in range of 74-78% accuracy for English Lexical Sample task – Senseval-2, a number of systems in range of 61-64% accuracy – Senseval-3, a number of systems in range of 70-73% accuracy Ulf Leser: Information Retrieval, Winter Semester 2014/2015 21 Content of this Lecture • Word Sense Disambiguation • Approaches – Knowledge-Based – Using Classification – Using Clustering • Case Study: SVM-based WSD for Biomedical Texts Ulf Leser: Information Retrieval, Winter Semester 2014/2015 22 Using Clustering for WSD • Method not for sense-tagging but for sense discovery • Idea – Collect a large number of sentences containing the probably ambiguous word – Cluster sentences/contexts – Each cluster should be one sense • Problem – We cannot “label” the cluster – We do not learn the sense (in the real sense of the word) but only how many there are and how they differ statistically – We can give examples per sense Ulf Leser: Information Retrieval, Winter Semester 2014/2015 23 Evaluation • Manually – look at clusters – Not easy, especially for highly related senses • Using a gold standard (if you have one) – See if clustering reproduces annotation • Trick: Merged words – Create your own, artificial homonyms – Merge two arbitrary words into one (bankdrink) – Compute contexts of merged word • Unifies contexts of the two original words – Cluster – See if original scopes are reproduced Ulf Leser: Information Retrieval, Winter Semester 2014/2015 24 One more Trick: Parallel Texts • Within a language, senses are often not separable clearly – That’s why there are no different words • In other languages, this may be different • If word-aligned translations are available … – Sense discover is simple (different translations) – Obtaining training data is simple (all instances with the same translation) • Parallel texts – – – – – EU parliament protocols UN texts Canadian official documents Belgian official documents … Ulf Leser: Information Retrieval, Winter Semester 2014/2015 25 Content of this Lecture • Word Sense Disambiguation • Approaches – Knowledge-Based – Using Classification – Using Clustering • Case Study: SVM-based WSD for Biomedical Texts Ulf Leser: Information Retrieval, Winter Semester 2014/2015 26 AliBaba • How do we know the correct color? Ulf Leser: Information Retrieval, Winter Semester 2014/2015 27 Multi-Class NER • Many words can denote entities from different classes – Some classes are highly ambiguous (cell: 12%, tissue: 22%) – Some less (protein: 0.001%) • Here: Class-specific dictionaries taken from various sources – MeSH, UMLS, UniProt, EntrezGene, OMIM, … Ulf Leser: Information Retrieval, Winter Semester 2014/2015 28 Our approach • Rely on “one sense per discourse” assumption • Use machine learning (SVM) – We built one model for each ambiguous name – Multi-class: Evaluate one-against-all – Longest distance to hyperplane wins • Training data – that’s the main trick • Evaluation: Leave-one-out – Recall: This will generally yield better (and more realistic?) numbers than 10-fold cross-validation Ulf Leser: Information Retrieval, Winter Semester 2014/2015 29 Training Data • Problem: How to obtain enough exemplary texts – We have app. 1100 ambiguous senses for 531 terms – IR does not help – fooled by homonyms • Various ways – Search with unique synonyms (from dictionary) – Search local explanations containing a synonym • “chloramphenicol acetyltransferase (CAT)” versus “posterior vena cava of a rabbit (cat)” versus “dog, fox, cat (Carnivora)” – Use class-specific databases (Gene-Rifs, DrugBank, OMIM, …) • Results – not enough yet – For 304 of the 531 ambiguous terms: More than 3 texts – Thus, for 227 ambiguous terms only 3 or less texts Ulf Leser: Information Retrieval, Winter Semester 2014/2015 30 Enrichment: Adding “Dirty” Training Data • Idea: To characterize the meaning „diseases“ of an ambiguous word X, using descriptions of other diseases might be helpful as well – Add „disease-ish“ context to disease X if not enough „X-ish“ context is available • Refinement: Use only similar entities of the same class – Measured by semantic similarity if dictionary is an ontology (diseases, cell, tissue, organism) – Else: some other measure (e.g. orthology for genes) Ulf Leser: Information Retrieval, Winter Semester 2014/2015 31 Results Accuracy for the 304 ambiguous terms with more than 3 highquality training texts (median: 93,7%) Ulf Leser: Information Retrieval, Winter Semester 2014/2015 Accuracy depends strongly on number of training samples 32 Even Dirty Training Data Helps • Adding „dirty“ texts until entity has at least eight texts • Median F-measure raises from 93.7 to 97% Ulf Leser: Information Retrieval, Winter Semester 2014/2015 33 Breakdown to Class - Pairs • Tissue – cell is hard (not surprising) • Organism – cell seems very hard Ulf Leser: Information Retrieval, Winter Semester 2014/2015 34 Feature Sets – Performance • Best FMeasure – Few stop words – Use complete texts – Use TF (no IDF) Ulf Leser: Information Retrieval, Winter Semester 2014/2015 35 Feature Sets – Speed • Best FMeasure – Few stop words – Use complete texts – Use TF (no IDF) • But slow … Ulf Leser: Information Retrieval, Winter Semester 2014/2015 36