Download A Genetic Algorithm using Semantic Relations for Word Sense

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Site-specific recombinase technology wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Population genetics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Genetic testing wikipedia , lookup

Genetic engineering wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome (book) wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
A Genetic Algorithm using Semantic
Relations for Word Sense Disambiguation
Proposal
Michael Hausman
Overview
• Introduction
• Semantic Relations
– Cost Function 1
– Cost Function 2
• Genetic Algorithms
• Testing and Deliverables
• Schedule
2
Introduction
•
Word Sense Disambiguation
– Another way of saying, “which dictionary definition is correct in context”
– Main problem of the project
•
Example: “Time flies like an arrow.”
– Time (noun #1) -- an instance or single occasion for some event
– Time (noun #5) -- the continuum of experience in which events pass from
the future through the present to the past
– Fly (noun #1) -- two-winged insects characterized by active flight
– Fly (verb #1) -- travel through the air; be airborne;
– Fly (verb #2) -- move quickly or suddenly
– Arrow (noun #1) -- a mark to indicate a direction or relation
– Arrow (noun #2) -- a projectile with a straight thin shaft and an arrowhead
on one end and stabilizing vanes on the other; intended to be shot from a
bow
– Etc.
3
Semantic Relations
• All relations become from WordNet
– Machine readable dictionary
– Contains several semantic relations
• Semantic Relations use some aspect of a word to
define how closely two objects are related
• Early papers focus on one semantic relation
– Somewhat good results
• Later papers focus on two or three relations
– Better results over early papers
• Therefore, this project uses several semantic relations
4
Semantic Relations: Frequency
• WordNet orders senses (definitions) by how often
it appears in the corpus used to make WordNet
• More common senses are more likely to be
correct
1. Fly -- travel through
the air; be airborne;
(Freq = 1)
2. Fly -- move quickly or
suddenly (Freq = 0.5)
𝑆𝑒𝑛𝑠𝑒𝑇𝑜𝑡𝑎𝑙 𝑤 − 𝑆𝑒𝑛𝑠𝑒 𝑤 − 1
𝐹𝑟𝑒𝑞 𝑤 =
𝑆𝑒𝑛𝑠𝑒𝑇𝑜𝑡𝑎𝑙(𝑤)
SenseTotal(w): The total number of senses for this word
Sense(w): the sense number of the word currently in use
5
Semantic Relations: Hypernyms
• A more generic way of saying a word
• Many hypernyms in a row creates a tree
– More specific down the tree
– Length of path between words is the similarity of the subjects
• Every definition has a different hypernym tree
– Bank: financial institution vs. bank: ground next to a river
entity
…
D
organism
LSO
person
animal
male
female
A B
boy
girl
2×𝐷
𝐻𝑦𝑝(𝑤1, 𝑤2) =
A+𝐵+2×𝐷
LSO: most specific common hypernym
D: length from the root to LSO
A: length from w1 to LSO
B: length from w2 to LSO
Equation from Wu-Palmer (1994)
6
Semantic Relations: Coordinate Sisters
• Two words that have the same hypernym are
coordinate sisters
• Slightly more specific ideas from the same
concept
entity
…
organism
person
animal
male
female
boy
girl
𝐶𝑜𝑜𝑟 𝑤1, 𝑤2
1, 𝐶𝑜𝑜𝑟 𝑤1 ∋ 𝑤2
=
0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
w1 and w2: the two words being
compared
Coor(w*): all of the coordinate
sisters of the word
Coor(w1) ∋ w2: if w2 is any of the
coordinate sisters of w1
7
Semantic Relations: Domain
• The word or collection a word belongs to
• A paragraph talks about the same collection
• Computer Science
– Buffer, drive, cache,
program, software
• Sports
– Skate, backpack,
ball, foul, snorkel
𝐷𝑜𝑚 𝑤1, 𝑤2
1, 𝐷𝑜𝑚 𝑤1 = 𝐷𝑜𝑚(𝑤2)
=
0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
w1 and w2: the two words
being compared
Dom(w*): the domain of the
word given
8
Semantic Relations: Synonym
• Two words that can be interchanged in a given
context are synonyms
• Many words in a paragraph may be interchangeable
• computer, computing machine,
computing device, data
processor, electronic computer,
information processing system
• sport, athletics
• frolic, lark, rollick, skylark,
disport, sport, cavort, gambol,
frisk, romp, run around, lark
about
𝑆𝑦𝑛 𝑤1, 𝑤2 =
1, 𝑆𝑦𝑛 𝑤1 ∋ 𝑤2
0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
w1 and w2: the two words being
compared
Syn(w*): all of the synonyms of the
word
Syn(w1) ∋ w2: if w2 is any of the
synonyms of w1
9
Semantic Relations: Antonyms
• Two words that are opposites of each other in a
given context are antonyms
• Comparisons tend to have opposites
• good, goodness
– evil, evilness
• pure
– defiled, maculate
• vague
– clear, distinct
𝐴𝑛𝑡 𝑤1, 𝑤2 =
1, 𝐴𝑛𝑡 𝑤1 ∋ 𝑤2
0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
w1 and w2: the two words being
compared
Ant(w*): all of the anonyms of the
word
Ant(w1) ∋ w2: if w2 is any of the
antonyms of w1
10
Cost Function 1
• Combining several semantic relations has better
results
• Simply separate a solution into every possible
word pair combination and add the relations
• Highest scores are the best
𝐶𝑜𝑠𝑡 𝑐 =
𝑖,𝑗
𝐹𝑟𝑒𝑞 𝑤𝑖 + 𝐻𝑦𝑝 𝑤𝑖 , 𝑤𝑗 + 𝐷𝑜𝑚 𝑤𝑖 , 𝑤𝑗 +
,
𝑆𝑦𝑛 𝑤𝑖 , 𝑤𝑗 + 𝐴𝑛𝑡 𝑤𝑖 , 𝑤𝑗 + 𝐶𝑜𝑜𝑟 𝑤𝑖 , 𝑤𝑗
𝑖≠𝑗
c: the current solution
wi and wj: the current word pair combination
11
Cost Function 2
• Semantic Relations only work up to a point
– Picking most frequent definition never picks the second
definition
• Some semantic relations don’t work on all POS
– Don’t have hypernyms for adjectives
• Weighting semantic relations for each POS could help
𝐹𝑟𝑒𝑞 𝑤𝑖 ∗ 𝐹𝑟𝑒𝑞𝑊𝑒𝑖𝑔ℎ𝑡 𝑃𝑂𝑆 + 𝐻𝑦𝑝 𝑤𝑖 , 𝑤𝑗 ∗ 𝑆𝑖𝑚𝑊𝑒𝑖𝑔ℎ𝑡 𝑃𝑂𝑆 +
𝐷𝑜𝑚 𝑤𝑖 , 𝑤𝑗 ∗ 𝐷𝑜𝑚𝑊𝑒𝑖𝑔ℎ𝑡 𝑃𝑂𝑆 + 𝑆𝑦𝑛 𝑤𝑖, 𝑤𝑗 ∗ 𝑆𝑦𝑛𝑊𝑒𝑖𝑔ℎ𝑡 𝑃𝑂𝑆 + ,
𝐶𝑜𝑠𝑡 𝑐 =
𝑃𝑂𝑆,𝑖,𝑗 𝐴𝑛𝑡 𝑤𝑖 , 𝑤𝑗 ∗ 𝐴𝑛𝑡𝑊𝑒𝑖𝑔ℎ𝑡 𝑃𝑂𝑆 + 𝐶𝑜𝑜𝑟 𝑤𝑖 , 𝑤𝑗 ∗ 𝐶𝑜𝑜𝑟𝑊𝑒𝑖𝑔ℎ𝑡(𝑃𝑂𝑆)
𝑖≠𝑗
c: the current solution
wi and wj: the current word pair combination
POS: the part of speech
***Weight(POS): the weight for the relation and part of speech
12
Cost Function 2: Calculating Weights
1. Take a translated file and separate the
parts of speech.
– Nouns, Adjectives, verbs, etc.
2. For each part of speech, find every
possible word sense combination. Each
word sense combination will be a solution.
3. Calculate the semantic relations for every
solution.
4. For each semantic relation, make a scatter
plot of a part of speech with every point
representing a solution. The x axis is the
ratio of correct senses. The y axis is the
semantic relation calculation.
13
Cost Function 2: Calculating Weights
5. Make a trend line for each scatter plot. This trend line should have
the best R2 value for the data. However, do not use a trend line that
goes below zero.
6. Use these trend lines to make a system of equations. Note that there
should be a different system of equations for each part of speech.
100 = freqnoun(100)*X + Hypnoun(100)*Y…Coornoun(100)*Z,
95 = freqnoun(95)*X + Hypnoun(95)*Y…Coornoun(95)*Z, etc.
7. Solve the system of equations for the weights
14
Why Genetic Algorithms?
• The cost functions give a way to calculate good
solutions if not the best solution
• With a cost function, this is now an optimization
problem
• Large solution space
– 5 senses per word for 100 words is
5100 = 7.888 * 1069 possible solutions
• Genetic algorithms use a subset of the solutions
to make a good solution
• I like them 
15
Genetic Algorithms: Overview
1. Start with a set of solutions (1st Generation)
2. Take original “parent” solutions and combine them
with each other to create a new set of “child”
solutions (Mating)
3. Somehow measure the solutions (Cost Function)
and only keep the “better” half of the solutions
– Some may be from “parent” set, others are from
“child” set
4. Introduce some random changes in case solutions
are “stuck” or are all the same (Mutation)
5. Repeat starting with Step 2
16
Testing and Deliverables
• SemCor
– Princeton University took the brown corpus and
tagged it with WordNet senses
– Common resource used by many researchers
– Many papers reference specific SemCor results
• Testing
– All training and testing will use SemCor
– Compare results to several papers
– Compare results with Michael Billot
• Deliverables includes any code, files, binaries, results,
and reports from the project, and possibly a paper
17
Schedule
Date
Milestone
November 15, 2010 Wrote any code relating to the cost functions or cost
function calculations
November 30, 2010 Calculated the weights for the Weighted Semantic
Relation Addition Cost Function
December 30, 2010 Wrote code for the mating, mutation, and genetic
algorithm portions
January 15, 2011
Tested all code and have results
January 30, 2011
Contact Michael Billot to get his latest results
February ??, 2011
Paper written and submitted to advisors
February ??, 2011
Final presentation
18
Questions?
19
Mating
• Dominant Gene Genetic Algorithm Mating
– Somehow calculate best genes and focus on them
– Mating combines two parent solutions to make two
child solutions
• Keep the cost for each gene (word sense) as well as the
entire chromosome (disambiguated paragraph)
• Mating two chromosomes
1. Keep the best genes (top half, top third, above
average, etc.) from parent 1
2. Add the best genes from parent 2
3. Add the senses from lower genes from parent 1
4. Repeat after swapping both parents
20
Mutations
• Dominant Gene Genetic Algorithm Mutations
– Mating focuses on dominant (best cost) genes, so
mutations must focus on the recessive (lower cost)
genes
– Mutations tend to focus on one solution
• Keep the cost for each gene (word sense) as well as
the entire chromosome (disambiguated paragraph)
• Possible Mutations
1. Randomly change one recessive gene
2. Literally calculate the best sense for a randomly
picked semantic relation on a recessive gene
21
References
1. Chunhui Zhang, Yiming Zhou, Trevor Martin, "Genetic Word Sense Disambiguation Algorithm," iita, vol. 1, pp.123-127,
2008 Second International Symposium on Intelligent Information Technology Application, 2008
2. Tsuruoka, Yoshimasa. “A part-of-speech tagger for English,” University of Tokyo, 2005. http://www-tsujii.is.s.utokyo.ac.jp/~tsuruoka/postagger/
3. Princeton University, "About WordNet," WordNet, Princeton University, 2010. http://wordnet.princeton.edu
4. Basile, P., Degemmis, M., Gentile, A. L., Lops, P., and Semeraro, G. 2007. The JIGSAW Algorithm for Word Sense
Disambiguation and Semantic Indexing of Documents. In Proceedings of the 10th Congress of the Italian Association
For Artificial intelligence on AI*IA 2007: Artificial intelligence and Human-Oriented Computing (Rome, Italy, September
10 - 13, 2007). R. Basili and M. T. Pazienza, Eds. Lecture Notes In Artificial Intelligence. Springer-Verlag, Berlin,
Heidelberg, 314-325
5. Hausman, Michael, “A Dominant Gene Genetic Algorithm for a Transposition Cipher in Cryptography,” University of
Colorado at Colorado Springs, May 2009.
6. Hausman, Michael; Erickson, Derrick; “A Dominant Gene Genetic Algorithm for a Substitution Cipher in Cryptography,”
University of Colorado at Colorado Springs, December 2009.
7. Princeton University, " The SemCor corpus," MultiSemCor, 2010. http://multisemcor.fbk.eu/semcor.php
8. Patwardhan, S., Banerjee, S., and Pedersen, T. 2003. Using measures of semantic relatedness for word sense
disambiguation. In Proceedings of the Fourth International Conference on Intelligent Text Processing and
Computational Linguistics, pp. 241–57, Mexico City, Mexico.
9. TORSTEN ZESCH and IRYNA GUREVYCH (2010). Wisdom of crowds versus wisdom of linguists – measuring the
semantic relatedness of words. Natural Language Engineering, 16, pp 25-59
22