Download Text Analytics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Fuzzy-trace theory wikipedia , lookup

Information theory wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Transcript
Information Retrieval
Word Sense Disambiguation
Ulf Leser
Content of this Lecture
• Word Sense Disambiguation
• Approaches
– Knowledge-Based
– Using Classification
– Using Clustering
• Case Study: SVM-based WSD for Biomedical Texts
•
Material from
– Mihalcea, Pedersen: Word Sense Disambiguation, Tutorial at AAAI-2005
– Schiemann et al. (2008). Word Sense Disambiguation in Biomedical Applications: A
Machine Learning Approach. In Prince, V. and Roche, M. (ed). "Information Retrieval
in Biomedicine", IGI Global
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
2
Definition
• Word sense disambiguation: Select the correct sense for a
word in a context given a fixed set of senses
– Knowledge intensive methods, supervised learning
• Word sense discovery: Find all possible senses of a word,
without regard to an existing set
– Unsupervised techniques, clustering
• Ambiguous in itself
– Homonyms: Same word, different and unrelated meaning
• ‘Sin’ , ‘soul’ are English words and gene names
– Polysemy: Same word, closely related meaning (often same stem)
• Gene and its mRNA
• “Nicht in diesem Ton!”
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
3
Example
• The fisherman jumped off the bank and into the water.
• The bank down the street was robbed!
• Back in the day, we had an entire bank of computers
devoted to this problem.
• The bank in that road is entirely too steep and is really
dangerous.
• The plane took a bank to the left, and then headed off
towards the mountains.
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
4
Different Tasks
• Study words with multiple meanings (classical WSD)
• Find entities of a certain class with homonyms in English
– Is the mentioning of „white“ in a sentence a gene or not?
– Named Entity Recognition (NER)
• Disambiguate homonyms within a class
– Which gene is the mentioning of “TNF-alpha”? Which species?
– Named Entity Normalization (NEN)
• Disambiguate all words in a sentence
– Ambiguity on the parse-level - senses influence each other
– Sense chaining: ”Senses” of neighboring words should be similar
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
6
Content of this Lecture
• Word Sense Disambiguation
• Approaches
– Knowledge-Based
– Using Classification
– Using Clustering
• Case Study: SVM-based WSD for Biomedical Texts
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
7
Knowledge-Based WSD
• Idea: Use background knowledge on the given set of
possible senses
• Source 1: Explicit specifications
– Dictionary of definitions: Lexica, dictionaries, …
– Thesauri and ontologies: Wordnet, UMLS, MESH, Gene Ontology,
business ontologies, enterprise vocabularies, …
• Source 2: Annotated text
– Supervised: Needs examples with annotated senses
– Compute words in the context of the word to disambiguate that are
indicative for a sense
– Compute collocations per sense and find those that are the most
discriminating; “One collocation per sense”
– See: Distributional semantics
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
8
Example
• WordNet definitions for all senses of the noun “plant”
– Note: Definitions often include an example – easier to grasp
– buildings for carrying on industrial labor; "they built a large plant to
manufacture automobiles“
– a living organism lacking the power of locomotion
– something planted secretly for discovery by another; "he claimed
that the evidence against him was a plant"
– an actor situated in the audience whose acting is rehearsed but
seems spontaneous to the audience
• Other: Wikipedia, classical “Wörterbücher”, Freebase, …
– Wikipedia distinguishes senses and gives ample definitional text
– Used widely in current research in semantic search
– Also: Wiktionary – free dictionary
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
9
Usage for Disambiguation
• Idea: Match context of word w in text with words from
sense definitions si
– Most similar sense with wins
– Use any similarity (relevance) measure: VSM, language model…
– May include word weighting (e.g. TF*IDF)
• Properties
–
–
–
–
Simple, effective
Not powerful enough for “hard” polysemy (contexts too similar)
Depends on good and complete dictionaries
Regarding genes, these are not really existing
• Idea: Use papers describing a gene as definition
• Not a definition, but provides specific context
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
10
Using an Ontology (e.g. Wordnet)
• Idea: Use neighborhood of a sense in the ontology as
definition
• But where does the neighborhood end?
• Score matches by distance between words in the ontology
• Idea
– Match words in the context of w (in the sentence) with words in
the neighborhood of all senses of w (in the ontology)
– Score context words based on semantic similarity to a given sense
• Difficulty: Quantify the “semantic length” of links
• Simplest method: Distance = “number of links”
– Aggregate scores over all matches per sense
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
11
Example – Ontology
Eukaryot
IS-A
HAS-A
Insecta
habitat
IS-A
HAS-GENE
HAS-A
IS-A
Metabolism
PART-OF
dares PRODUCES
leaves
LIVES-IN
AA syn X
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
chromosome
dares
HAS-LOCATION
IN-PROCESS
AA syn Y
AA syn Z
12
Example – Ambiguous Word
Eukaryot
IS-A
HAS-A
Insecta
habitat
IS-A
HAS-GENE
HAS-A
IS-A
Metabolism
PART-OF
dares PRODUCES
leaves
LIVES-IN
AA syn X
chromosome
dares
HAS-LOCATION
IN-PROCESS
AA syn Y
AA syn Z
The metabolism of dares is incapable of synthesizing amino
acids Y and Z, but these can be uptaken from the leaves
populated by this species.
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
13
Example – Simplest Approach to SemSim
Eukaryot
Insecta
habitat
Metabolism
chromosome
dares
dares
leaves
AA syn X
AA syn Y
AA syn Z
The metabolism of dares is incapable of synthesizing amino
acids Y and Z, but these can be uptaken from the leaves
populated by dares.
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
14
Example – Matching
Eukaryot
Insecta
habitat
Metabolism
chromosome
dares
dares
leaves
AA syn X
AA syn Y
AA syn Z
The metabolism of dares is incapable of synthesizing amino
acids Y and Z, but these can be uptaken from the leaves
populated by dares.
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
15
Example – Scoring
Eukaryot
Insecta
habitat
Metabolism
chromosome
dares
dares
leaves
AA syn X
AA syn Y
AA syn Z
• Gene: 1+1/2+1/3+1/4
• Species: 1+1/2+1/3+1/3
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
16
Example – Expanded Network along ISA
Eukaryot
Insecta
habitat
Metabolism
chromosome
dares
dares
leaves
AA syn X
AA syn Y
AA syn Z
• Gene: 1+1/2+1/2+1/3
• Species: 1+1+1/2+1/2
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
17
Two Tricks
• Look at frequencies of senses in a large corpus
– Include as a-priori probabilities into scores
– Very effective
– But beware of domain-specificities – use the right corpus to count
• One-sense-per-discourse
– The same word occurring multiple times in one document will
always have the same meaning
• Not true for “normal” words, but usually true for specialized, domainspecific terms (proper nouns, abbreviations, …)
– Highly effective
– Implicitly broadens the context for inference
• Both tricks work for all approaches to WSD
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
18
Content of this Lecture
• Word Sense Disambiguation
• Approaches
– Knowledge-Based
– Using Classification
– Using Clustering
• Case Study: SVM-based WSD for Biomedical Texts
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
19
Classification-based WSD
•
•
•
•
•
Cast the problem in a multi-class classification problem
Each sense is one class
Given training data, learn a model for each class
Rest is usual classification (VSM, NB, SVM, overfitting, …)
We may use more features than only context words
– POS tag of the word, surrounding POS tags
• Very important clues for disambiguation
– Existence of parts of known collocations
– Phrase heads
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
20
Properties
• Requires all senses to be known in advance
• Requires good (and sufficient) training data
– For each sense for each ambiguous word
– Which may be a real problem – see case study
• Essentially a generalization of the KB-approaches
– Classification instead of matching of context
• Method of choice if requirements are met
• Performance to expect
– Senseval-1, a number of systems in range of 74-78% accuracy for
English Lexical Sample task
– Senseval-2, a number of systems in range of 61-64% accuracy
– Senseval-3, a number of systems in range of 70-73% accuracy
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
21
Content of this Lecture
• Word Sense Disambiguation
• Approaches
– Knowledge-Based
– Using Classification
– Using Clustering
• Case Study: SVM-based WSD for Biomedical Texts
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
22
Using Clustering for WSD
• Method not for sense-tagging but for sense discovery
• Idea
– Collect a large number of sentences containing the probably
ambiguous word
– Cluster sentences/contexts
– Each cluster should be one sense
• Problem
– We cannot “label” the cluster
– We do not learn the sense (in the real sense of the word) but only
how many there are and how they differ statistically
– We can give examples per sense
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
23
Evaluation
• Manually – look at clusters
– Not easy, especially for highly related senses
• Using a gold standard (if you have one)
– See if clustering reproduces annotation
• Trick: Merged words
– Create your own, artificial homonyms
– Merge two arbitrary words into one (bankdrink)
– Compute contexts of merged word
• Unifies contexts of the two original words
– Cluster
– See if original scopes are reproduced
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
24
One more Trick: Parallel Texts
• Within a language, senses are often not separable clearly
– That’s why there are no different words
• In other languages, this may be different
• If word-aligned translations are available …
– Sense discover is simple (different translations)
– Obtaining training data is simple (all instances with the same
translation)
• Parallel texts
–
–
–
–
–
EU parliament protocols
UN texts
Canadian official documents
Belgian official documents
…
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
25
Content of this Lecture
• Word Sense Disambiguation
• Approaches
– Knowledge-Based
– Using Classification
– Using Clustering
• Case Study: SVM-based WSD for Biomedical Texts
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
26
AliBaba
• How do we know the correct color?
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
27
Multi-Class NER
• Many words can denote entities from different classes
– Some classes are highly ambiguous (cell: 12%, tissue: 22%)
– Some less (protein: 0.001%)
• Here: Class-specific dictionaries taken from various sources
– MeSH, UMLS, UniProt, EntrezGene, OMIM, …
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
28
Our approach
• Rely on “one sense per discourse” assumption
• Use machine learning (SVM)
– We built one model for each ambiguous name
– Multi-class: Evaluate one-against-all
– Longest distance to hyperplane wins
• Training data – that’s the main trick
• Evaluation: Leave-one-out
– Recall: This will generally yield better (and more realistic?)
numbers than 10-fold cross-validation
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
29
Training Data
• Problem: How to obtain enough exemplary texts
– We have app. 1100 ambiguous senses for 531 terms
– IR does not help – fooled by homonyms
• Various ways
– Search with unique synonyms (from dictionary)
– Search local explanations containing a synonym
• “chloramphenicol acetyltransferase (CAT)” versus “posterior vena cava
of a rabbit (cat)” versus “dog, fox, cat (Carnivora)”
– Use class-specific databases (Gene-Rifs, DrugBank, OMIM, …)
• Results – not enough yet
– For 304 of the 531 ambiguous terms: More than 3 texts
– Thus, for 227 ambiguous terms only 3 or less texts
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
30
Enrichment: Adding “Dirty” Training Data
• Idea: To characterize the meaning „diseases“ of an
ambiguous word X, using descriptions of other diseases
might be helpful as well
– Add „disease-ish“ context to disease X if not enough „X-ish“ context
is available
• Refinement: Use only similar entities of the same class
– Measured by semantic similarity if dictionary is an ontology
(diseases, cell, tissue, organism)
– Else: some other measure (e.g. orthology for genes)
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
31
Results
Accuracy for the 304 ambiguous
terms with more than 3 highquality training texts
(median: 93,7%)
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
Accuracy depends strongly on
number of training samples
32
Even Dirty Training Data Helps
• Adding „dirty“ texts until entity has at least eight texts
• Median F-measure raises from 93.7 to 97%
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
33
Breakdown to Class - Pairs
• Tissue – cell is hard (not surprising)
• Organism – cell seems very hard
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
34
Feature Sets – Performance
• Best FMeasure
– Few stop
words
– Use complete
texts
– Use TF (no
IDF)
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
35
Feature Sets – Speed
• Best FMeasure
– Few stop
words
– Use complete
texts
– Use TF (no
IDF)
• But slow …
Ulf Leser: Information Retrieval, Winter Semester 2014/2015
36