Download Unlocking and Sharing LTCL Linguistic Knowledge

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sanskrit grammar wikipedia , lookup

Ojibwe grammar wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Chinese grammar wikipedia , lookup

Junction Grammar wikipedia , lookup

Old Norse morphology wikipedia , lookup

Kannada grammar wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Macedonian grammar wikipedia , lookup

Navajo grammar wikipedia , lookup

Udmurt grammar wikipedia , lookup

Ukrainian grammar wikipedia , lookup

Modern Greek grammar wikipedia , lookup

Japanese grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

Old Irish grammar wikipedia , lookup

Georgian grammar wikipedia , lookup

Modern Hebrew grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Swedish grammar wikipedia , lookup

Inflection wikipedia , lookup

Portuguese grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Old English grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Spanish grammar wikipedia , lookup

French grammar wikipedia , lookup

Turkish grammar wikipedia , lookup

Russian grammar wikipedia , lookup

Sotho parts of speech wikipedia , lookup

Yiddish grammar wikipedia , lookup

Parsing wikipedia , lookup

Italian grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Transcript
With 6,500 languages in the world,
we must explore
new ways to learn, document, and share
our linguistic knowledge.
John J. Kovarik
NSA/CSS Senior Language Technology Authority
Unlocking and Sharing LTCL
Linguistic Knowledge
Keywords: CFG parsing, language generation,
computational linguistics
CALICO ’05
University of Michigan
Ann Arbor, MI May 17-20, 2005
The Challenges of Learning
and Sharing Knowledge of an
LCTL in the 21st Century
John J. Kovarik
National Security Agency
Presentation Overview
 General LCTL Challenges
 Challenges of Learning Mongolian
 Recipe for New Approach
 Khalka Mongolian Parts of Speech
 Mongolian Morphological Affixes
 Method of Lexical Knowledge Representation
 Analyze, Parse, Build Grammar Model, Test
 Iterate Repeatedly
LCTL Learning Challenges
 Fewer Learned Resources to Learn from
 Less Recognition Nationally
 Less Opportunities to Document What’s Learned
 Very Few Students to Learn from You
 Almost All Learning Done Manually
 Few Reliable 21st Century Applications
– Microsoft IME
– Font
Mongolian Learning Challenges
 Input Method Emulator (IME)
– MicroSoft IME
• Keyboard arranged for native Mongols
• American Mongolists prefer phonetic keyboard
– “a” key on Mongolian keyboard mapped to ASCII “a” etc.
 Fonts commonly used on Internet
– Russian Cyrillic fonts are commonly used
• “|” and “0” commonly substituted for “ү” and “ө”
• “у” and “о” often freely extended to “ү” and “ө”
Recipe for a New Approach
 Take a student with a computational linguistics
background
 Infuse with curiosity and energy
 Stir in access to the Internet
 Add Mongolian syntax and morphology
 Create morphological analyzer, context free
parser, and grammatical generator for Mongolian
 Resulting lexicons, software, and grammar models
can be used by other linguistically adept students
Khalkha Mongolian
Parts of Speech
 Declinable Nouns
 Declinable Adjectives
 Inflected Verbs
 Unchanging Adverbs
 Declinable Converbs
 Unchanging Postpositions
 Unchanging Conjunctions
 Unchanging Particles
Mongol Morphological Affixes
 27 verbal suffixes denoting tense and mood
 2 verb infixes denoting verb manner
– Consultative
– Passive
 6 verb paradigms or verb types
 3 irregular common verbs
 6 cases in singular and plural number
 Both nouns and adjectives are declined
Lexical Knowledge Representations
 Unchanging adverbs, conjunctions,
particles, etc. and irregular verb forms
(unchanging.txt file)
 Lemmas of declinable nouns and adjectives
(declinables.txt file)
 Inflected verbs and nominalized verbs
(regvb.txt file)
 Affix files (casendings.txt, reflex.txt,
infixes.txt, vbforms.txt)
Some Examples
 declinables.txt file
– N нэр Q хэн
 regverb.txt file
– V ир
V өс
 Affix files
– casendings.txt g ний d д
a ыг b оос
– reflex.txt
аа
ээ
оо
– infixes.txt
C лц R лд P гд
– vbforms.txt)
ipf нө i1p в i3p чээ Ypf охгүй
 unchanging.txt file
– Pg->талаар
Pc->холбогдуулан
Merge Morphology Knowledge
with the Power of the Computer
Wrote yalgah.pl to become tireless lexical pedagogue
 Searches for identifiable affixes by comparison
with lexical knowledge affix files
 Matches resulting lemma against lexical
knowledge declinables, verbs, and unchanging
words, then outputs word/part of speech tag to
standard output file plus expository lexicon
 Depending whether lemma can or cannot be
matched, outputs:
• Lemma to Out Of Vocabulary (oov) file noting affixes found
• Word/part of speech tag to standard output file
Additional Outputs
 Expository Morphology File (named morphlex.txt)
IR->verb command imperative 2nd person singular
IREEREY->converb future perfect continuative
IREG-> verb command concessive 3rd person singular/plural
BAGA->adjective
HURAL->noun nominative
IH->adjective
AJILDAA->reflexive noun dative-locative
ORLOO->verb indicative second past

Out Of Vocabulary File (named oov)
[C = : = > 5 = 0 E 0 0 A 0 0 ] (UNKNOWNAHAASAA) WORD 0 LINE 2
FALLS OUTSIDE OF VOCABULARY
possible reflexive ending <0 0 >-<AA>
possible declinable case ending<b>-<0 0 A >-<AAS>
possible verbal part of speech <Ypf >-<0 E >-<AH>
possible participial/converbal stem <C = : = > 5 = >--<UNKNOWN>
Feed Analytic Output to Parser

Developed context-free grammar (CFG) rules for both discourse and newspaper texts
S->Sbj Prd
S->Prd
Sbj->Nn Sbj->NP
NP->Tg Nn
NP->Tg Ng Nn
Prd->J


Wrote parse.pl to validate CFG rules against input text tagged as to part of speech
When each sentence can be fully parsed, outputs a parse tree and an English gloss.
Working on "BAGA HURAL IH AJILDAA ORLOO ."
ENGLISH GLOSS: large hural great work began .
The sentence does parse.
Branch nodes on tree:
S -> (Sbj Prd)
Sbj -> (NP)
NP -> (J Nn)
Prd -> (NPd Vi2p)
NPd -> (J Nd)
POS: J Nn J Nd Vi2p
Feed Output to Generator
 Wrote gramgen.pl to generate sentences
based on lexical knowledge, morphological
knowledge, and syntactic knowledge gained
 Output routinely reviewed for accuracy and
Chomskian explanatory adequacy of the
grammar models created for the parser and
generator engines
Iterative Process
 First take new newspaper article or dialogue and
run morphological analyzer on it until all words
are listed within vocabulary (no output in the oov
[Out Of Vocabulary] file
 Run output through parser, creating new CFG
rules until new text parses
 Run generator for a hundred or more examples to
ensure adequacy of new rules
Morpho-analyzer, Parser, Generator
Software Led This Student to Deeper
Understanding of Mongolian
 A linguistically adept learner can thus write
software to help one learn deeper & faster
 Language tool development is thus
grounded in gaining and applying language
knowledge in a systematic and linguistically
principled manner for oneself and others
Contact Information
John Kovarik
 Email: [email protected]
 Home Page:
http://www.worldnet.att/~kovariks
 Phone: 443-479-7188