* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture 5
Survey
Document related concepts
Ancient Greek grammar wikipedia , lookup
Lexical semantics wikipedia , lookup
Georgian grammar wikipedia , lookup
Kannada grammar wikipedia , lookup
Yiddish grammar wikipedia , lookup
French grammar wikipedia , lookup
Old Norse morphology wikipedia , lookup
Ojibwe grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Distributed morphology wikipedia , lookup
Morphology (linguistics) wikipedia , lookup
Turkish grammar wikipedia , lookup
Esperanto grammar wikipedia , lookup
Pipil grammar wikipedia , lookup
Malay grammar wikipedia , lookup
Transcript
Experiences with Indian Language Morphology Monojit Choudhury RS, CSE, IIT Kharagpur 28/07/2005 Speech and NLP When do we need MA/MS? Store all words Advantages: Less effort for NLP Less time for processing Disadvantages: More words more space more search time How to tackle unseen words 28/07/2005 Speech and NLP Therefore, we need MA/MS when The language is morphologically rich large number of affixes concatenation of affixes/compounding Example: Turkish, German, Sanskrit … The language is morphologically productive Speakers/writers can coin new words by following morphological rules Example: German, Sanskrit … 28/07/2005 Speech and NLP A Problem to ponder How do we decide whether a language is morphologically rich and/or productive? Linguistically Difficult (enumerate all morphological processes) Fuzzy/Subjective Can you suggest some formal technique? Hint: Statistics 28/07/2005 Speech and NLP Vocabulary Growth 200,000 BENGALI (3019565,182848) VOCAB HINDI SIZE ( V(N) ) (2967438, 121603) CORPUS SIZE ( N ) 28/07/2005 Speech and NLP 3,500,000 Another Estimate How many different forms of a verb are there in English Hindi Bengali Telugu Sanskrit 28/07/2005 Speech and NLP Another Estimate How many different forms of a verb are there in English – Hindi – Bengali – Telugu – Sanskrit – 28/07/2005 5 ~20 (without causation) ~170 (without causation) ~1000 ~51480 (with derivational affixes) ~3960 (otherwise) Speech and NLP Three basic concerns While designing a morphological analyzer/generator one must consider Productivity of a rule Morphological paradigms Irregular morphology 28/07/2005 Speech and NLP Productivity of a Rule Rule Example VR + tA jAtA, letA NR + ika dainika, sAmAjika Adj + imA lAlimA, niilimA 28/07/2005 Speech and NLP Productivity Productivity of a Rule Rule Example Productivity VR + tA jAtA, letA ***** NR + ika dainika, sAmAjika ** Adj + imA lAlimA, niilimA X 28/07/2005 Speech and NLP Productive Rules for Bengali/Hindi Inflectional Morphology Verb Derivational Morphology Noun Compounding Adjectives Prefixation Pronouns Suffixation Emphasizing in Bengali i and o 28/07/2005 Speech and NLP Productive Rules for Bengali/Hindi Inflectional Morphology Verb Derivational Morphology Noun Compounding Adjectives Prefixation Pronouns Suffixation Emphasizing in Bengali i and o 28/07/2005 Speech and NLP Three basic concerns While designing a morphological analyzer/generator one must consider Productivity of a rule Morphological paradigms Irregular morphology 28/07/2005 Speech and NLP Morphological paradigms Classes of words that inflect similarly Hindi Noun roots take 4 inflections Singular, direct laDakA, laDakii Plural, direct laDake, laDakiyA.N Singular, oblique laDake, laDakii Plural, oblique laDako, laDakiyo.N How many paradigms for nouns? 28/07/2005 Speech and NLP How to identify the paradigms? Paradigms may be based on Syllable structure (e.g laDakii, nadii, sakhii) Gender (e.g. dhobii vs. nadii) Semantics (e.g. lohA vs. dohA) Which of these distinctions can be identified automatically? How? 28/07/2005 Speech and NLP Paradigms for Bengali Nouns Bengali noun inflections: Classifier Suffixes TA, gulo, rA etc. Case Markers er, ke, der, te etc. Emphasizers i, o Paradigms are based on semantics Inanimate objects take TA, gulo Animate objects take rA, dera 28/07/2005 Speech and NLP Three basic concerns While designing a morphological analyzer/generator one must consider Productivity of a rule Morphological paradigms Irregular morphology 28/07/2005 Speech and NLP Irregular Morphology All languages feature irregular morphology English: ox – oxen, go – went Hindi: jAnA – gayA, karanA – kiyA Bengali: yAoYA – gela, AsA – ela Better to list them as exceptions and treat separately Bengali has only 4 exceptional verbs, Hindi has 2 28/07/2005 Speech and NLP So, we decided to Build MS/MA for Hindi & Bengali Cover only inflectional morphology Cover only verbs, nouns and adjectives We also identified the morphological paradigms Irregular verbs/nouns 28/07/2005 Speech and NLP Now we need to decide The list of possible affixes There attributes Morphotactics And then design/build The Input/output specification The lexicon structure The FST structure Lexicon and FST search strategy 28/07/2005 Speech and NLP A Case Study: Bengali Verb Morphology The information coded by affixes: Finite forms Tense: Past, present, future Aspect: simple, continuous, perfect, habitual Modality: Order, request Person: 1st, 2nd normal (tumi), 2nd familiar (tui), 3rd (se), Honorific 2nd and 3rd (Apani, tini) Polarity: positive/negative Non-finite forms: e, te 28/07/2005 Speech and NLP Morphotactics Root Aspect Tense kar (to do) eChi (perfect) l (past) Ama (1st) Φ (+) I had done kar Ch (cont.) Φ (present) i (1st) Φ (+) I’m doing kar Φ (simple) b (future) i Φ (2nd fam) (+) You’ll do kar Φ (perfect) Φ (pre/pst) 28/07/2005 Person +/- Speech and NLP i (1st) ni (-) Gloss I’ven’t done I’d not done Morphotactics Root + aspect + tense + person + emphasizer + polarity Root + modality + person + emphasizer Root + aspect1 + emphasizer + aspect2 + person + polarity 28/07/2005 Speech and NLP Verb Suffix Table TAM/ Person 1st 2nd, familiar 2nd, normal 2nd & 3rd formal 3rd Ind, Pr, Simple i isa’ ena’ e Ind, Pr, Cont chhi chhisa’ chha chhena’ chhe Ind, Pr, Perfect echhi echhisa’ echha echhena’ echhe Ind, Pa, Simple lAma’ li le lena’ la Ind, Pa, Cont. chhilAma’ chhili chhile chhilena’ chhila Ind, Pa, Perfect echhilAma’ echhili echhile echhilena’ echhila’ Ind, Future ba bi be bena’ be Habitual Past tAma’ tisa’ te tena’ ta Imperative - .h/ una’ uka’ Neg, Perfect ini isa’ni ani ena’ni eni 28/07/2005 Speech and NLP Orthographic Changes kar + eChilAm kareChilAm khA + eChilAm kheYeChilAm hAr + eChilAm hereChilAm karA + eChilAm kariYeChilAm tolA + eChilAm tuliYeChilAm khAoYA + eChilAm khAiYeChilAm de + eChilAm diYeChilAm 28/07/2005 Speech and NLP Orthographic Classes (Paradigms?) $ V a’ A oYA a ha [haoYA] (to happen) kara’ [karA] (to do) karA [karAno] (do, causative) saoYA [saoYAno] (undergo, causative) A khA [khAoYA] (to eat) jAna’ [jAnA] (to know) jAnA [jAnAno] (to inform) khAoYA [khAoYAno] (to feed) i di [deoYA] (to give) likha’ [lekhA] (to write) ni~NrA [ni~NrAno] -- e -- dekha’ [dekhA] (to see) dekhA [dekhAno] (to show) deoYA [deoYAno] (give, causative) o so [so;oYA] (to lie down) tola’ [tolA] (to pick) tolA [tolAno] (pick, causative) so;oYA [so;oYAno] (lie, causative) u/au -- -- ghumA [ghumAno] (to sleep) -- 28/07/2005 Speech and NLP FSM for Recognizing Bengali Verb Class 28/07/2005 Speech and NLP A Morphological Generator: Abstract Level Morphological Generator Root TAM Person Polarity Emph 28/07/2005 Suffix Table Suffix Speech and NLP Orthographic FST Surface Form A Morphological Generator: Implementation Morphological Generator Root TAM Person Polarity Irregular Root Handler Root Class Recognizer Suffix Table Orthographic Rules for each Root class Emph 28/07/2005 Speech and NLP Emph Adder Surface Form Implementation: More Facts Memory Requirement Root Class Recognizer: FSM with 26 states Suffix Table: 56 suffixes (emphasizers not incl.) Orthographic Rule Tables: 19×56 = 1064 rules Time Requirement Root Class Recognizer: scans the root once (r) Suffix Selection: just table look up (constant) Orthographic Rules: scans root + suffix once (r+s) Emphasizer Adder: Constant time Total time: O(r+s) 28/07/2005 Speech and NLP Now we need to decide The list of possible affixes There attributes Morphotactics And then design/build The Input/output specification The lexicon structure The FST structure Lexicon and FST search strategy 28/07/2005 Speech and NLP A Morphological Analyzer: Abstract Level Trie: A data structure also called a suffix tree. (from Information Retrieval) Basic Notions: Note that Bengali verb morphology only has suffixes Scan a given word from right to left (backward) If the substring seen is a valid suffix, see if the remaining part of the input is a valid stem/root Take care of orthographic changes We shall see that trie is just another way to implement FST with some nice properties 28/07/2005 Speech and NLP Trie: Construction Make a list of all valid suffixes NULL, i, Chi, li, eChi, YeChi, lAma, elAma Construct the trie recursively by inserting each of the suffixes (right to left) Every state where a suffix ends is marked as a final state Every final state consists of TAM, Person, Polarity information Rewrite rules for generation of the root 28/07/2005 Speech and NLP Trie: Search Reverse the input word Traverse the trie starting from the root (start state) At every final state apply the orthographic rule to the rest of the string Let r be the string obtained. Search for r in the root lexicon If found, output the attributes Continue the search 28/07/2005 Speech and NLP Trie: Computational Issues Time Complexity Searching the trie is linear on input length Searching the lexicon can also be linear Space Complexity In general linear in number of affixes Can be reduced further by constructing DAWG 28/07/2005 Speech and NLP Trie vs DAWG Trie More space Linear Search Easy to construct Easy to insert & delete Final states have unique attributes 28/07/2005 DAWG Less space Linear search Exponential construction Difficult to delete and insert A final state can have ambiguous attributes Speech and NLP Morphological Analyzer: Implementation Details Size of Trie: 300 states Size of root lexicon: 600 verb root Paradigm Information: Not required Noun, verb and adjectives are separately analyzed Tries can be merged but no significant gain Root lexicons are also distinct Rule compilation 28/07/2005 Speech and NLP Summarizing Decide whether to go for MA/MS Identify the productive morphological processes and corresponding irregularities Identify the paradigms and morphological attributes Specify the morphotactics, affix list Gather a Machine readable root lexicon Choose appropriate computational technique Design, implement and test A good interface for rule-editing is desirable 28/07/2005 Speech and NLP