Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PAN Localization Cambodia KHMER WORD SEGMENTATION Regional Conference on Localized ICT Development and Dissemination across Asia 12th – 16th January 2009 2009,, Novotel Hotel,, Vientiane,, LAO PDR CHEA, Sok Huor Cambodia Country Project Leader PAN Localization Cambodia KHMER SCRIPT ´ ´ ´ Khmer is written from left to right The new line starts when horizontal space runs out There is not any explicit word boundary like in Latin, Chinese, Japanese Language Almost all techniques of Statistical NLP are based on words « POS tagging « Speech synthesis « Machine translation « Information retrieval « ´ PAN Localization Cambodia Problem Identification I ´ Ambiguity g y Issues: « Words can combine to form other words F example: For l 1. ЮŲЧ˝ = ЮŲЧ˝ or ЮŲЧ |˝ 2. ŪĠďďijЊЯŠŊ ũ = ŪĠď ŪĠď| |ďijЊ|ЯŠŊ ũ or ŪĠď| ŪĠď |ďijЯŠŊ ďijЊЯŠũũ or ŪĠďďijЊ|ЯŠŊ ũ PAN Localization Cambodia Problem Identification II « Unknown words Identification ²The words not in lexicon ²Error words ¹ Typographic ¹ Cognitive errors ²Abbreviations ²Proper Names ²Derived words ²Compounds PAN Localization Cambodia Research Approach ´ Study of problems faced by former researchers ´ Disambiguation methodologies ´ Selection of segmentation method for Khmer ´ Propose p method for unknown word detection ´ Testing and improvement PAN Localization Cambodia Disambiguation Research: ´ Rule based methods « Longest Matching Algorithm « Maximum M i M Matching hi Al Algorithm ih ´ Statistical « N-Gram methods Models « Non Non--Dictionary based WS using decision tree PAN Localization Cambodia Disambiguation Methods: ´ Two Methodologies « Maximum matching algorithm « Orthographic g p syllable y Bi-ggram model Bi²Calculates probability by multiplying the frequency of two orthographic g syllable collocation « We opted two methodologies because ²Statistical method requires large corpus ²Corpus generation was part of the project PAN Localization Cambodia Error Word Detection Method ´ Two types yp of spelling p g errors « Non word « Real ea word od ´ Survey to identify non word errors «A group of people to type articles « We found 70% 70% errors due to phonetic similarity « e.g. eg ₤е ₤ЮΌ₣ ЮΌ₣ to ₤Юņų ₤Юņ₣ ₣ or ₤е ₤ЮŲ₣ ЮŲ₣ PAN Localization Cambodia Homophonous Error Detection: Identification of Khmer homophone set ´ Sound shifting g issues ´ Phone Independent Vowel Equal Sound [ ΒЊ ] [ ΒН ] [ ΒРŷ ] [ ũЖЖ ] [ ũК ] [ ŲЕ ] [ŲЙ ] Ο Χ ό ζ ι µ ο ΒЊ ΒН ΒРŷ ũЖЖ ũК ŲЕ ŲЙ PAN Localization Cambodia Solution: ´ ´ Correction based on pronunciation Same p pronunciation in 1 expression p « Khmer Common expression (KCE) Dictionary Word List Misspelled Word KCE Building using Phonetic Rules Encoded Word List (KCE of Word List) Encoded Misspelled Word (KCE of Misspelled Word) Search the KCE of misspelled word in the encoded list rather in the Dictionary PAN Localization Cambodia Word W d segmentation i process: ´ Break down of words into Khmer Character Clusters (KCC) ´ Merge these KCC to possible word segmentation ´ Search of KCE list for the string made by KCC ´ Disambiguation module picks best among KCE PAN Localization Cambodia Word Segmentation Process: Input Sentence KCC Segmentation Generate the KCC KCCs Matching Gen. Word Tokens Disambiguation Output Segmentation Sentence KCC Rules KCE List KCE Rules Trained Text Corpus PAN Localization Cambodia Release: ´ Word segmentation applications for Microsoft and Linux platform ´ Plug Plug--in for MS office 2003 & 2007 ´ Plug Plugug-in for o Open Ope Office.org O ce o g writer te ´ Research report of Khmer word segmentation PAN Localization Cambodia THANKS