* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A Genetic Algorithm using Semantic Relations for Word Sense
Site-specific recombinase technology wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Point mutation wikipedia , lookup
Population genetics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Genetic testing wikipedia , lookup
Genetic engineering wikipedia , lookup
Designer baby wikipedia , lookup
Gene expression programming wikipedia , lookup
A Genetic Algorithm using Semantic Relations for Word Sense Disambiguation Proposal Michael Hausman Overview • Introduction • Semantic Relations – Cost Function 1 – Cost Function 2 • Genetic Algorithms • Testing and Deliverables • Schedule 2 Introduction • Word Sense Disambiguation – Another way of saying, “which dictionary definition is correct in context” – Main problem of the project • Example: “Time flies like an arrow.” – Time (noun #1) -- an instance or single occasion for some event – Time (noun #5) -- the continuum of experience in which events pass from the future through the present to the past – Fly (noun #1) -- two-winged insects characterized by active flight – Fly (verb #1) -- travel through the air; be airborne; – Fly (verb #2) -- move quickly or suddenly – Arrow (noun #1) -- a mark to indicate a direction or relation – Arrow (noun #2) -- a projectile with a straight thin shaft and an arrowhead on one end and stabilizing vanes on the other; intended to be shot from a bow – Etc. 3 Semantic Relations • All relations become from WordNet – Machine readable dictionary – Contains several semantic relations • Semantic Relations use some aspect of a word to define how closely two objects are related • Early papers focus on one semantic relation – Somewhat good results • Later papers focus on two or three relations – Better results over early papers • Therefore, this project uses several semantic relations 4 Semantic Relations: Frequency • WordNet orders senses (definitions) by how often it appears in the corpus used to make WordNet • More common senses are more likely to be correct 1. Fly -- travel through the air; be airborne; (Freq = 1) 2. Fly -- move quickly or suddenly (Freq = 0.5) 𝑆𝑒𝑛𝑠𝑒𝑇𝑜𝑡𝑎𝑙 𝑤 − 𝑆𝑒𝑛𝑠𝑒 𝑤 − 1 𝐹𝑟𝑒𝑞 𝑤 = 𝑆𝑒𝑛𝑠𝑒𝑇𝑜𝑡𝑎𝑙(𝑤) SenseTotal(w): The total number of senses for this word Sense(w): the sense number of the word currently in use 5 Semantic Relations: Hypernyms • A more generic way of saying a word • Many hypernyms in a row creates a tree – More specific down the tree – Length of path between words is the similarity of the subjects • Every definition has a different hypernym tree – Bank: financial institution vs. bank: ground next to a river entity … D organism LSO person animal male female A B boy girl 2×𝐷 𝐻𝑦𝑝(𝑤1, 𝑤2) = A+𝐵+2×𝐷 LSO: most specific common hypernym D: length from the root to LSO A: length from w1 to LSO B: length from w2 to LSO Equation from Wu-Palmer (1994) 6 Semantic Relations: Coordinate Sisters • Two words that have the same hypernym are coordinate sisters • Slightly more specific ideas from the same concept entity … organism person animal male female boy girl 𝐶𝑜𝑜𝑟 𝑤1, 𝑤2 1, 𝐶𝑜𝑜𝑟 𝑤1 ∋ 𝑤2 = 0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 w1 and w2: the two words being compared Coor(w*): all of the coordinate sisters of the word Coor(w1) ∋ w2: if w2 is any of the coordinate sisters of w1 7 Semantic Relations: Domain • The word or collection a word belongs to • A paragraph talks about the same collection • Computer Science – Buffer, drive, cache, program, software • Sports – Skate, backpack, ball, foul, snorkel 𝐷𝑜𝑚 𝑤1, 𝑤2 1, 𝐷𝑜𝑚 𝑤1 = 𝐷𝑜𝑚(𝑤2) = 0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 w1 and w2: the two words being compared Dom(w*): the domain of the word given 8 Semantic Relations: Synonym • Two words that can be interchanged in a given context are synonyms • Many words in a paragraph may be interchangeable • computer, computing machine, computing device, data processor, electronic computer, information processing system • sport, athletics • frolic, lark, rollick, skylark, disport, sport, cavort, gambol, frisk, romp, run around, lark about 𝑆𝑦𝑛 𝑤1, 𝑤2 = 1, 𝑆𝑦𝑛 𝑤1 ∋ 𝑤2 0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 w1 and w2: the two words being compared Syn(w*): all of the synonyms of the word Syn(w1) ∋ w2: if w2 is any of the synonyms of w1 9 Semantic Relations: Antonyms • Two words that are opposites of each other in a given context are antonyms • Comparisons tend to have opposites • good, goodness – evil, evilness • pure – defiled, maculate • vague – clear, distinct 𝐴𝑛𝑡 𝑤1, 𝑤2 = 1, 𝐴𝑛𝑡 𝑤1 ∋ 𝑤2 0, 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 w1 and w2: the two words being compared Ant(w*): all of the anonyms of the word Ant(w1) ∋ w2: if w2 is any of the antonyms of w1 10 Cost Function 1 • Combining several semantic relations has better results • Simply separate a solution into every possible word pair combination and add the relations • Highest scores are the best 𝐶𝑜𝑠𝑡 𝑐 = 𝑖,𝑗 𝐹𝑟𝑒𝑞 𝑤𝑖 + 𝐻𝑦𝑝 𝑤𝑖 , 𝑤𝑗 + 𝐷𝑜𝑚 𝑤𝑖 , 𝑤𝑗 + , 𝑆𝑦𝑛 𝑤𝑖 , 𝑤𝑗 + 𝐴𝑛𝑡 𝑤𝑖 , 𝑤𝑗 + 𝐶𝑜𝑜𝑟 𝑤𝑖 , 𝑤𝑗 𝑖≠𝑗 c: the current solution wi and wj: the current word pair combination 11 Cost Function 2 • Semantic Relations only work up to a point – Picking most frequent definition never picks the second definition • Some semantic relations don’t work on all POS – Don’t have hypernyms for adjectives • Weighting semantic relations for each POS could help 𝐹𝑟𝑒𝑞 𝑤𝑖 ∗ 𝐹𝑟𝑒𝑞𝑊𝑒𝑖𝑔ℎ𝑡 𝑃𝑂𝑆 + 𝐻𝑦𝑝 𝑤𝑖 , 𝑤𝑗 ∗ 𝑆𝑖𝑚𝑊𝑒𝑖𝑔ℎ𝑡 𝑃𝑂𝑆 + 𝐷𝑜𝑚 𝑤𝑖 , 𝑤𝑗 ∗ 𝐷𝑜𝑚𝑊𝑒𝑖𝑔ℎ𝑡 𝑃𝑂𝑆 + 𝑆𝑦𝑛 𝑤𝑖, 𝑤𝑗 ∗ 𝑆𝑦𝑛𝑊𝑒𝑖𝑔ℎ𝑡 𝑃𝑂𝑆 + , 𝐶𝑜𝑠𝑡 𝑐 = 𝑃𝑂𝑆,𝑖,𝑗 𝐴𝑛𝑡 𝑤𝑖 , 𝑤𝑗 ∗ 𝐴𝑛𝑡𝑊𝑒𝑖𝑔ℎ𝑡 𝑃𝑂𝑆 + 𝐶𝑜𝑜𝑟 𝑤𝑖 , 𝑤𝑗 ∗ 𝐶𝑜𝑜𝑟𝑊𝑒𝑖𝑔ℎ𝑡(𝑃𝑂𝑆) 𝑖≠𝑗 c: the current solution wi and wj: the current word pair combination POS: the part of speech ***Weight(POS): the weight for the relation and part of speech 12 Cost Function 2: Calculating Weights 1. Take a translated file and separate the parts of speech. – Nouns, Adjectives, verbs, etc. 2. For each part of speech, find every possible word sense combination. Each word sense combination will be a solution. 3. Calculate the semantic relations for every solution. 4. For each semantic relation, make a scatter plot of a part of speech with every point representing a solution. The x axis is the ratio of correct senses. The y axis is the semantic relation calculation. 13 Cost Function 2: Calculating Weights 5. Make a trend line for each scatter plot. This trend line should have the best R2 value for the data. However, do not use a trend line that goes below zero. 6. Use these trend lines to make a system of equations. Note that there should be a different system of equations for each part of speech. 100 = freqnoun(100)*X + Hypnoun(100)*Y…Coornoun(100)*Z, 95 = freqnoun(95)*X + Hypnoun(95)*Y…Coornoun(95)*Z, etc. 7. Solve the system of equations for the weights 14 Why Genetic Algorithms? • The cost functions give a way to calculate good solutions if not the best solution • With a cost function, this is now an optimization problem • Large solution space – 5 senses per word for 100 words is 5100 = 7.888 * 1069 possible solutions • Genetic algorithms use a subset of the solutions to make a good solution • I like them 15 Genetic Algorithms: Overview 1. Start with a set of solutions (1st Generation) 2. Take original “parent” solutions and combine them with each other to create a new set of “child” solutions (Mating) 3. Somehow measure the solutions (Cost Function) and only keep the “better” half of the solutions – Some may be from “parent” set, others are from “child” set 4. Introduce some random changes in case solutions are “stuck” or are all the same (Mutation) 5. Repeat starting with Step 2 16 Testing and Deliverables • SemCor – Princeton University took the brown corpus and tagged it with WordNet senses – Common resource used by many researchers – Many papers reference specific SemCor results • Testing – All training and testing will use SemCor – Compare results to several papers – Compare results with Michael Billot • Deliverables includes any code, files, binaries, results, and reports from the project, and possibly a paper 17 Schedule Date Milestone November 15, 2010 Wrote any code relating to the cost functions or cost function calculations November 30, 2010 Calculated the weights for the Weighted Semantic Relation Addition Cost Function December 30, 2010 Wrote code for the mating, mutation, and genetic algorithm portions January 15, 2011 Tested all code and have results January 30, 2011 Contact Michael Billot to get his latest results February ??, 2011 Paper written and submitted to advisors February ??, 2011 Final presentation 18 Questions? 19 Mating • Dominant Gene Genetic Algorithm Mating – Somehow calculate best genes and focus on them – Mating combines two parent solutions to make two child solutions • Keep the cost for each gene (word sense) as well as the entire chromosome (disambiguated paragraph) • Mating two chromosomes 1. Keep the best genes (top half, top third, above average, etc.) from parent 1 2. Add the best genes from parent 2 3. Add the senses from lower genes from parent 1 4. Repeat after swapping both parents 20 Mutations • Dominant Gene Genetic Algorithm Mutations – Mating focuses on dominant (best cost) genes, so mutations must focus on the recessive (lower cost) genes – Mutations tend to focus on one solution • Keep the cost for each gene (word sense) as well as the entire chromosome (disambiguated paragraph) • Possible Mutations 1. Randomly change one recessive gene 2. Literally calculate the best sense for a randomly picked semantic relation on a recessive gene 21 References 1. Chunhui Zhang, Yiming Zhou, Trevor Martin, "Genetic Word Sense Disambiguation Algorithm," iita, vol. 1, pp.123-127, 2008 Second International Symposium on Intelligent Information Technology Application, 2008 2. Tsuruoka, Yoshimasa. “A part-of-speech tagger for English,” University of Tokyo, 2005. http://www-tsujii.is.s.utokyo.ac.jp/~tsuruoka/postagger/ 3. Princeton University, "About WordNet," WordNet, Princeton University, 2010. http://wordnet.princeton.edu 4. Basile, P., Degemmis, M., Gentile, A. L., Lops, P., and Semeraro, G. 2007. The JIGSAW Algorithm for Word Sense Disambiguation and Semantic Indexing of Documents. In Proceedings of the 10th Congress of the Italian Association For Artificial intelligence on AI*IA 2007: Artificial intelligence and Human-Oriented Computing (Rome, Italy, September 10 - 13, 2007). R. Basili and M. T. Pazienza, Eds. Lecture Notes In Artificial Intelligence. Springer-Verlag, Berlin, Heidelberg, 314-325 5. Hausman, Michael, “A Dominant Gene Genetic Algorithm for a Transposition Cipher in Cryptography,” University of Colorado at Colorado Springs, May 2009. 6. Hausman, Michael; Erickson, Derrick; “A Dominant Gene Genetic Algorithm for a Substitution Cipher in Cryptography,” University of Colorado at Colorado Springs, December 2009. 7. Princeton University, " The SemCor corpus," MultiSemCor, 2010. http://multisemcor.fbk.eu/semcor.php 8. Patwardhan, S., Banerjee, S., and Pedersen, T. 2003. Using measures of semantic relatedness for word sense disambiguation. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, pp. 241–57, Mexico City, Mexico. 9. TORSTEN ZESCH and IRYNA GUREVYCH (2010). Wisdom of crowds versus wisdom of linguists – measuring the semantic relatedness of words. Natural Language Engineering, 16, pp 25-59 22