Download khmer word segmentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
PAN Localization Cambodia
KHMER WORD
SEGMENTATION
Regional Conference on Localized ICT Development
and Dissemination across Asia
12th – 16th January 2009
2009,,
Novotel Hotel,, Vientiane,, LAO PDR
CHEA, Sok Huor
Cambodia Country Project Leader
PAN Localization Cambodia
KHMER SCRIPT
´
´
´
Khmer is written from left to right
The new line starts when horizontal space runs out
There is not any explicit word boundary like in
Latin, Chinese, Japanese Language
Almost all techniques of Statistical NLP are based on words
« POS tagging
« Speech synthesis
« Machine translation
« Information retrieval
«
´
PAN Localization Cambodia
Problem Identification I
´
Ambiguity
g y Issues:
«
Words can combine to form other words
F example:
For
l
1. ЮŲЧ˝ = ЮŲЧ˝ or ЮŲЧ |˝
2. ŪĠďďijЊЯŠŊ ũ = ŪĠď
ŪĠď|
|ďijЊ|ЯŠŊ ũ or
ŪĠď|
ŪĠď
|ďijЯŠŊ
ďijЊЯŠũũ or
ŪĠďďijЊ|ЯŠŊ ũ
PAN Localization Cambodia
Problem Identification II
« Unknown
words Identification
²The
words not in lexicon
²Error words
¹ Typographic
¹ Cognitive
errors
²Abbreviations
²Proper
Names
²Derived words
²Compounds
PAN Localization Cambodia
Research Approach
´ Study
of problems faced by former researchers
´ Disambiguation methodologies
´ Selection of segmentation method for Khmer
´ Propose
p
method for unknown word detection
´ Testing and improvement
PAN Localization Cambodia
Disambiguation Research:
´ Rule
based methods
« Longest
Matching Algorithm
« Maximum
M i
M
Matching
hi Al
Algorithm
ih
´ Statistical
« N-Gram
methods
Models
« Non
Non--Dictionary based WS using decision tree
PAN Localization Cambodia
Disambiguation Methods:
´ Two
Methodologies
« Maximum
matching algorithm
« Orthographic
g p syllable
y
Bi-ggram model
Bi²Calculates
probability by multiplying the frequency
of two orthographic
g
syllable collocation
« We
opted two methodologies because
²Statistical
method requires large corpus
²Corpus generation was part of the project
PAN Localization Cambodia
Error Word Detection Method
´
Two types
yp of spelling
p
g errors
« Non
word
« Real
ea word
od
´
Survey to identify non word errors
«A
group of people to type articles
« We found 70%
70% errors due to phonetic similarity
« e.g.
eg
₤е
₤ЮΌ₣
ЮΌ₣ to ₤Юņų
₤Юņ₣
₣ or ₤е
₤ЮŲ₣
ЮŲ₣
PAN Localization Cambodia
Homophonous Error Detection:
Identification of Khmer homophone set
´ Sound shifting
g issues
´
Phone
Independent Vowel
Equal Sound
[ ΒЊ ]
[ ΒН ]
[ ΒРŷ ]
[ ũЖЖ ]
[ ũК ]
[ ŲЕ ]
[ŲЙ ]
Ο
Χ
ό
ζ
ι
µ
ο
ΒЊ
ΒН
ΒРŷ
ũЖЖ
ũК
ŲЕ
ŲЙ
PAN Localization Cambodia
Solution:
´
´
Correction based on pronunciation
Same p
pronunciation in 1 expression
p
«
Khmer Common expression (KCE)
Dictionary Word List
Misspelled Word
KCE Building using Phonetic Rules
Encoded Word List
(KCE of Word List)
Encoded Misspelled Word
(KCE of Misspelled Word)
Search the KCE of misspelled word in the encoded list rather in the Dictionary
PAN Localization Cambodia
Word
W d segmentation
i process:
´ Break
down of words into Khmer Character
Clusters (KCC)
´ Merge these KCC to possible word
segmentation
´ Search of KCE list for the string made by KCC
´ Disambiguation module picks best among KCE
PAN Localization Cambodia
Word Segmentation Process:
Input Sentence
KCC Segmentation
Generate the KCC
KCCs Matching
Gen. Word Tokens
Disambiguation
Output Segmentation Sentence
KCC Rules
KCE List
KCE Rules
Trained Text
Corpus
PAN Localization Cambodia
Release:
´ Word
segmentation applications for Microsoft
and Linux platform
´ Plug
Plug--in for MS office 2003 & 2007
´ Plug
Plugug-in for
o Open
Ope Office.org
O ce o g writer
te
´ Research report of Khmer word segmentation
PAN Localization Cambodia
THANKS