* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Unlocking and Sharing LTCL Linguistic Knowledge
Sanskrit grammar wikipedia , lookup
Ojibwe grammar wikipedia , lookup
Morphology (linguistics) wikipedia , lookup
Chinese grammar wikipedia , lookup
Junction Grammar wikipedia , lookup
Old Norse morphology wikipedia , lookup
Kannada grammar wikipedia , lookup
Lithuanian grammar wikipedia , lookup
Macedonian grammar wikipedia , lookup
Navajo grammar wikipedia , lookup
Udmurt grammar wikipedia , lookup
Ukrainian grammar wikipedia , lookup
Modern Greek grammar wikipedia , lookup
Japanese grammar wikipedia , lookup
Esperanto grammar wikipedia , lookup
Old Irish grammar wikipedia , lookup
Georgian grammar wikipedia , lookup
Modern Hebrew grammar wikipedia , lookup
Latin syntax wikipedia , lookup
Swedish grammar wikipedia , lookup
Portuguese grammar wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Old English grammar wikipedia , lookup
Malay grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Spanish grammar wikipedia , lookup
French grammar wikipedia , lookup
Turkish grammar wikipedia , lookup
Russian grammar wikipedia , lookup
Sotho parts of speech wikipedia , lookup
Yiddish grammar wikipedia , lookup
Italian grammar wikipedia , lookup
Lexical semantics wikipedia , lookup
With 6,500 languages in the world, we must explore new ways to learn, document, and share our linguistic knowledge. John J. Kovarik NSA/CSS Senior Language Technology Authority Unlocking and Sharing LTCL Linguistic Knowledge Keywords: CFG parsing, language generation, computational linguistics CALICO ’05 University of Michigan Ann Arbor, MI May 17-20, 2005 The Challenges of Learning and Sharing Knowledge of an LCTL in the 21st Century John J. Kovarik National Security Agency Presentation Overview General LCTL Challenges Challenges of Learning Mongolian Recipe for New Approach Khalka Mongolian Parts of Speech Mongolian Morphological Affixes Method of Lexical Knowledge Representation Analyze, Parse, Build Grammar Model, Test Iterate Repeatedly LCTL Learning Challenges Fewer Learned Resources to Learn from Less Recognition Nationally Less Opportunities to Document What’s Learned Very Few Students to Learn from You Almost All Learning Done Manually Few Reliable 21st Century Applications – Microsoft IME – Font Mongolian Learning Challenges Input Method Emulator (IME) – MicroSoft IME • Keyboard arranged for native Mongols • American Mongolists prefer phonetic keyboard – “a” key on Mongolian keyboard mapped to ASCII “a” etc. Fonts commonly used on Internet – Russian Cyrillic fonts are commonly used • “|” and “0” commonly substituted for “ү” and “ө” • “у” and “о” often freely extended to “ү” and “ө” Recipe for a New Approach Take a student with a computational linguistics background Infuse with curiosity and energy Stir in access to the Internet Add Mongolian syntax and morphology Create morphological analyzer, context free parser, and grammatical generator for Mongolian Resulting lexicons, software, and grammar models can be used by other linguistically adept students Khalkha Mongolian Parts of Speech Declinable Nouns Declinable Adjectives Inflected Verbs Unchanging Adverbs Declinable Converbs Unchanging Postpositions Unchanging Conjunctions Unchanging Particles Mongol Morphological Affixes 27 verbal suffixes denoting tense and mood 2 verb infixes denoting verb manner – Consultative – Passive 6 verb paradigms or verb types 3 irregular common verbs 6 cases in singular and plural number Both nouns and adjectives are declined Lexical Knowledge Representations Unchanging adverbs, conjunctions, particles, etc. and irregular verb forms (unchanging.txt file) Lemmas of declinable nouns and adjectives (declinables.txt file) Inflected verbs and nominalized verbs (regvb.txt file) Affix files (casendings.txt, reflex.txt, infixes.txt, vbforms.txt) Some Examples declinables.txt file – N нэр Q хэн regverb.txt file – V ир V өс Affix files – casendings.txt g ний d д a ыг b оос – reflex.txt аа ээ оо – infixes.txt C лц R лд P гд – vbforms.txt) ipf нө i1p в i3p чээ Ypf охгүй unchanging.txt file – Pg->талаар Pc->холбогдуулан Merge Morphology Knowledge with the Power of the Computer Wrote yalgah.pl to become tireless lexical pedagogue Searches for identifiable affixes by comparison with lexical knowledge affix files Matches resulting lemma against lexical knowledge declinables, verbs, and unchanging words, then outputs word/part of speech tag to standard output file plus expository lexicon Depending whether lemma can or cannot be matched, outputs: • Lemma to Out Of Vocabulary (oov) file noting affixes found • Word/part of speech tag to standard output file Additional Outputs Expository Morphology File (named morphlex.txt) IR->verb command imperative 2nd person singular IREEREY->converb future perfect continuative IREG-> verb command concessive 3rd person singular/plural BAGA->adjective HURAL->noun nominative IH->adjective AJILDAA->reflexive noun dative-locative ORLOO->verb indicative second past Out Of Vocabulary File (named oov) [C = : = > 5 = 0 E 0 0 A 0 0 ] (UNKNOWNAHAASAA) WORD 0 LINE 2 FALLS OUTSIDE OF VOCABULARY possible reflexive ending <0 0 >-<AA> possible declinable case ending<b>-<0 0 A >-<AAS> possible verbal part of speech <Ypf >-<0 E >-<AH> possible participial/converbal stem <C = : = > 5 = >--<UNKNOWN> Feed Analytic Output to Parser Developed context-free grammar (CFG) rules for both discourse and newspaper texts S->Sbj Prd S->Prd Sbj->Nn Sbj->NP NP->Tg Nn NP->Tg Ng Nn Prd->J Wrote parse.pl to validate CFG rules against input text tagged as to part of speech When each sentence can be fully parsed, outputs a parse tree and an English gloss. Working on "BAGA HURAL IH AJILDAA ORLOO ." ENGLISH GLOSS: large hural great work began . The sentence does parse. Branch nodes on tree: S -> (Sbj Prd) Sbj -> (NP) NP -> (J Nn) Prd -> (NPd Vi2p) NPd -> (J Nd) POS: J Nn J Nd Vi2p Feed Output to Generator Wrote gramgen.pl to generate sentences based on lexical knowledge, morphological knowledge, and syntactic knowledge gained Output routinely reviewed for accuracy and Chomskian explanatory adequacy of the grammar models created for the parser and generator engines Iterative Process First take new newspaper article or dialogue and run morphological analyzer on it until all words are listed within vocabulary (no output in the oov [Out Of Vocabulary] file Run output through parser, creating new CFG rules until new text parses Run generator for a hundred or more examples to ensure adequacy of new rules Morpho-analyzer, Parser, Generator Software Led This Student to Deeper Understanding of Mongolian A linguistically adept learner can thus write software to help one learn deeper & faster Language tool development is thus grounded in gaining and applying language knowledge in a systematic and linguistically principled manner for oneself and others Contact Information John Kovarik Email: [email protected] Home Page: http://www.worldnet.att/~kovariks Phone: 443-479-7188