Download Unlocking and Sharing LTCL Linguistic Knowledge

With 6,500 languages in the world, we must explore new ways to learn, document, and share our linguistic knowledge. John J. Kovarik NSA/CSS Senior Language Technology Authority Unlocking and Sharing LTCL Linguistic Knowledge Keywords: CFG parsing, language generation, computational linguistics CALICO ’05 University of Michigan Ann Arbor, MI May 17-20, 2005 The Challenges of Learning and Sharing Knowledge of an LCTL in the 21st Century John J. Kovarik National Security Agency Presentation Overview  General LCTL Challenges  Challenges of Learning Mongolian  Recipe for New Approach  Khalka Mongolian Parts of Speech  Mongolian Morphological Affixes  Method of Lexical Knowledge Representation  Analyze, Parse, Build Grammar Model, Test  Iterate Repeatedly LCTL Learning Challenges  Fewer Learned Resources to Learn from  Less Recognition Nationally  Less Opportunities to Document What’s Learned  Very Few Students to Learn from You  Almost All Learning Done Manually  Few Reliable 21st Century Applications – Microsoft IME – Font Mongolian Learning Challenges  Input Method Emulator (IME) – MicroSoft IME • Keyboard arranged for native Mongols • American Mongolists prefer phonetic keyboard – “a” key on Mongolian keyboard mapped to ASCII “a” etc.  Fonts commonly used on Internet – Russian Cyrillic fonts are commonly used • “|” and “0” commonly substituted for “ү” and “ө” • “у” and “о” often freely extended to “ү” and “ө” Recipe for a New Approach  Take a student with a computational linguistics background  Infuse with curiosity and energy  Stir in access to the Internet  Add Mongolian syntax and morphology  Create morphological analyzer, context free parser, and grammatical generator for Mongolian  Resulting lexicons, software, and grammar models can be used by other linguistically adept students Khalkha Mongolian Parts of Speech  Declinable Nouns  Declinable Adjectives  Inflected Verbs  Unchanging Adverbs  Declinable Converbs  Unchanging Postpositions  Unchanging Conjunctions  Unchanging Particles Mongol Morphological Affixes  27 verbal suffixes denoting tense and mood  2 verb infixes denoting verb manner – Consultative – Passive  6 verb paradigms or verb types  3 irregular common verbs  6 cases in singular and plural number  Both nouns and adjectives are declined Lexical Knowledge Representations  Unchanging adverbs, conjunctions, particles, etc. and irregular verb forms (unchanging.txt file)  Lemmas of declinable nouns and adjectives (declinables.txt file)  Inflected verbs and nominalized verbs (regvb.txt file)  Affix files (casendings.txt, reflex.txt, infixes.txt, vbforms.txt) Some Examples  declinables.txt file – N нэр Q хэн  regverb.txt file – V ир V өс  Affix files – casendings.txt g ний d д a ыг b оос – reflex.txt аа ээ оо – infixes.txt C лц R лд P гд – vbforms.txt) ipf нө i1p в i3p чээ Ypf охгүй  unchanging.txt file – Pg->талаар Pc->холбогдуулан Merge Morphology Knowledge with the Power of the Computer Wrote yalgah.pl to become tireless lexical pedagogue  Searches for identifiable affixes by comparison with lexical knowledge affix files  Matches resulting lemma against lexical knowledge declinables, verbs, and unchanging words, then outputs word/part of speech tag to standard output file plus expository lexicon  Depending whether lemma can or cannot be matched, outputs: • Lemma to Out Of Vocabulary (oov) file noting affixes found • Word/part of speech tag to standard output file Additional Outputs  Expository Morphology File (named morphlex.txt) IR->verb command imperative 2nd person singular IREEREY->converb future perfect continuative IREG-> verb command concessive 3rd person singular/plural BAGA->adjective HURAL->noun nominative IH->adjective AJILDAA->reflexive noun dative-locative ORLOO->verb indicative second past  Out Of Vocabulary File (named oov) [C = : = > 5 = 0 E 0 0 A 0 0 ] (UNKNOWNAHAASAA) WORD 0 LINE 2 FALLS OUTSIDE OF VOCABULARY possible reflexive ending <0 0 >-<AA> possible declinable case ending<b>-<0 0 A >-<AAS> possible verbal part of speech <Ypf >-<0 E >-<AH> possible participial/converbal stem <C = : = > 5 = >--<UNKNOWN> Feed Analytic Output to Parser  Developed context-free grammar (CFG) rules for both discourse and newspaper texts S->Sbj Prd S->Prd Sbj->Nn Sbj->NP NP->Tg Nn NP->Tg Ng Nn Prd->J   Wrote parse.pl to validate CFG rules against input text tagged as to part of speech When each sentence can be fully parsed, outputs a parse tree and an English gloss. Working on "BAGA HURAL IH AJILDAA ORLOO ." ENGLISH GLOSS: large hural great work began . The sentence does parse. Branch nodes on tree: S -> (Sbj Prd) Sbj -> (NP) NP -> (J Nn) Prd -> (NPd Vi2p) NPd -> (J Nd) POS: J Nn J Nd Vi2p Feed Output to Generator  Wrote gramgen.pl to generate sentences based on lexical knowledge, morphological knowledge, and syntactic knowledge gained  Output routinely reviewed for accuracy and Chomskian explanatory adequacy of the grammar models created for the parser and generator engines Iterative Process  First take new newspaper article or dialogue and run morphological analyzer on it until all words are listed within vocabulary (no output in the oov [Out Of Vocabulary] file  Run output through parser, creating new CFG rules until new text parses  Run generator for a hundred or more examples to ensure adequacy of new rules Morpho-analyzer, Parser, Generator Software Led This Student to Deeper Understanding of Mongolian  A linguistically adept learner can thus write software to help one learn deeper & faster  Language tool development is thus grounded in gaining and applying language knowledge in a systematic and linguistically principled manner for oneself and others Contact Information John Kovarik  Email: [email protected]  Home Page: http://www.worldnet.att/~kovariks  Phone: 443-479-7188

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Unlocking and Sharing LTCL Linguistic Knowledge