* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download computational morphology
Portuguese grammar wikipedia , lookup
Arabic grammar wikipedia , lookup
Kannada grammar wikipedia , lookup
Zulu grammar wikipedia , lookup
Modern Hebrew grammar wikipedia , lookup
Modern Greek grammar wikipedia , lookup
Ukrainian grammar wikipedia , lookup
Macedonian grammar wikipedia , lookup
Spanish grammar wikipedia , lookup
Untranslatability wikipedia , lookup
Lithuanian grammar wikipedia , lookup
Old Irish grammar wikipedia , lookup
Lexical semantics wikipedia , lookup
Latin syntax wikipedia , lookup
Swedish grammar wikipedia , lookup
Old Norse morphology wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Compound (linguistics) wikipedia , lookup
Ojibwe grammar wikipedia , lookup
Old English grammar wikipedia , lookup
Distributed morphology wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Turkish grammar wikipedia , lookup
French grammar wikipedia , lookup
Yiddish grammar wikipedia , lookup
Esperanto grammar wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Agglutination wikipedia , lookup
Pipil grammar wikipedia , lookup
COMPUTATIONAL MORPHOLOGY Radhika Mamidi Winter workshop/school on Machine Learning and Text Analytics 17 December, 2013 Outline What is Computational Morphology? Morphology concepts Computational approaches Morphological Analysis Morphological Generator Computational Approaches Paradigm based Finite State based Statistical based Issues to be handled Computational Morphology Linguistics: Scientific study of language at these levels Sound – Phonetics and Phonology Word - Morphology Sentence - Syntax Discourse – Discourse Analysis Computational Linguistics: Processing and producing language computationally. Computational Morphology: Processing and producing language computationally at WORD LEVEL Why are we studying morphology? The knowledge of words will help us process language computationally at word level. Knowledge of words include word structure and word formation rules. This knowledge will help us in developing NLP tools like “Morphological Analyzers” and “Morphological Generators” These are important in Information Retrieval, Machine Translation, Spell checking, etc. What’s Morphology? The study of word structure The study of the mental dictionary: How are words stored in the mind? What is a possible word? Example: (i) At the supermarket, the girls bought pink cheeriots and the boys blue fistings. (ii) When their mother signaled, the girls barried home unhappily. The words ‘cheeriots’, ‘fistings’ and ‘barried’ do not exist. However, assuming they are valid words of English, we ‘guess’ the meaning by context and the position of the word in the given sentence. We do this using our general knowledge and linguistic knowledge. At the supermarket, the girls bought pink cheeriots and the boys blue fistings. Part of speech = nouns [comes after adjectives] [-s ending] = more than one Meaning = some objects that have color [clue: supermarket] = some object that is sold; perhaps a toy The word forms are more like toys, balls, ribbons. When their mother signaled, the girls barried home unhappily. Part of Speech = verb [position] [ends in –ed] = past tense Base form = barry Meaning = go The word form is more like carried, married. What’s the Longest Word of English? Could it be ismestablishmentariandisanti ? Why not when antidisestablishmentarianism is possible. Possible words with anti and missile: anti-missile (adjective) anti-missile missile: a missile used for anti-missile purposes anti- anti-missile missile missile: a missile used against antimissile missiles antiNmissileN+1, where N can go until…. There is a systematic way of word formation. Eg: happy, -ness, unThis is called Morphotactics. Morphemes Have a sound [form] and a meaning: Example: “cats” /kaet/ /-s/ Even “four-legged animal” “plural number” though /-s/ has a sound and a meaning, it can’t mean “plural” by itself… It has to attach to a noun. “A morpheme is the smallest unit of word form that has meaning” Examples: cats = cat + -s girlish = girl + -ish unfriendly = un- + friend + -ly cat, -s, girl, -ish, un-, friend, -ly are morphemes Even Bush knows morphology (…though he may use it differently than the rest of us) The war on terrorism has transformationed the US-Russia Relationship We’re working to help Russia securitize the dismantled warheads The explorationists are only willing to help move equipment during the winter This case has had full analyzation and has been looked at a lot Compositionality “Explorationists” explore: to spend an extended effort looking around a particular area - X -ation: can attach to Verbs, the process of Xing - Y -ist: can attach to Nouns, one who performs an action Y - Z -s: attaches to Nouns, more than one Z explorationists: a compositional word Fully compositional meaning is based on its parts Eg: computer, impossible, headache Non-compositionality “Inflect” Is inflect morphologically complex? It contains more than one morpheme. What do in- and flect mean? This is a case of a non-compositional meaning. In explorationists, if you know the meaning of the parts, you know the meaning of the whole. Not necessarily so for inflect. Non-compositional meaning cannot be derived from its parts. Lexical/Content words Words which are not function words are called content words or lexical words: these include nouns, verbs, adjectives, and most adverbs, though some adverbs are function words (e.g. then, why). They belong to open class. Dictionaries define the specific meanings of content words, but can only describe the general usages of function words. By contrast, grammars describe the use of function words in detail, but have little interest in lexical words. Function words Function words or grammatical words are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence. Function words may be prepositions, pronouns, auxiliary verbs, conjunctions, grammatical articles or particles, all of which belong to the group of closed class words. To know about morpheme we should know about…. Free morphemes vs. Bound morphemes Lexical morphemes vs. Functional morphemes Null/Zero morpheme Inflectional morphemes vs. Derivational morphemes Root morphemes vs. Affix morphemes Free vs Bound morphemes electr- and tox- have isolable meanings in electric, electrify, toxic, (de-)toxify But they cannot be pronounced on their own: they are bound morphemes girl and book have isolable meanings in girls, girlish, books, booked, booking They can occur on their own: they are free morphemes Are prefixes and suffixes bound morphemes? Lexical morpheme free morphemes: apple, smart, book, slow, eat, write They can exist on their own as independent words. bound morphemes: -ceive, -ject, cran-, -ship, un-, disThey cannot be used independently. They need another morpheme [free or bound] to form a word. Eg: re-ceive, con-ceive, sub-ject, pro-ject, cran-berry, scholar-ship, fellow-ship, un-kind, dis-obey Functional morphemes free morphemes: of, with, she, it, and, although, however, because, then bound morphemes: -s, -ed, -ing, Four-way contrasts Lexical, Free: Nouns, Verbs, Adj, Adv cat, town, call, house, hall, smart, fast Lexical, Bound: including derivational affixes rasp- [raspberry], cran- [cranberry] , -ceive [conceive, receive], un-, re-, pre- Functional, Free: Prepositions, Articles, Conj with, at, and, an, the, because Functional, Bound: inflectional affixes -s, -ed, -ing [eats, walked, laughing] Exercise 1: Identify the free and bound morphemes in the following words walked, talked, danced, arrived playhouse, watchdog, football player drinking, playing, eating import, export, transport raspberry, cranberry invert, convert, divert books, pens, boards writer, caretaker, rider, fighter Can the following words be decomposed? delight, news, traitor, bed, evening Exercise 2: Identify the lexical and functional morphemes in the following words. Mention if they are free or bound. politically beautiful between writing raspberries unable nationalization Inflection vs. Derivation Derivational affixes: allow us to make new words that alters the meaning. There is an error in the computation. [computeV – computationN] {Nominalisation} It is a computational approach. [computationN – computationalAdj] {Adjectivization} Inflectional suffixes: required in order to make the sentence grammatical Inflected words belong to the same class *Yesterday I walk to class [walkV – walkedV] *I like all my student [studentN – studentsN] Inflectional Morphology Examples: [the POS remains the same] VERBS EAT = eat, eats, ate, eaten, eating DRINK = drink, drinks, drank, drunk, drinking PLAY = play, plays, played, played, playing 0, -s, -ed, -en, -ing are inflectional morphemes NOUNS PLAY = play, plays GIRL = girl, girls SHEEP = sheep, sheep -s, 0 are inflectional morphemes Derivational morphology Two types: May change the category {N,V,A,Adv} driveV +er = driverN eatv + able = eatableadj girlN + ish = girlishadj disturb V + ance = disturbance N Doesn’t have to change the category un + doV = undoV re+fryv = refryv un+happyAdj = unhappyAdj Derivational – more examples Verbs eat – eatable [adj], eatables [noun] drink – drinking [noun] play – player [noun] -able, -ing, -er are derivational morphemes Nouns play – playful [adj], replay [verb] girl – girlish [adj], girlhood [noun] sheep – sheepish[adj] -ful, re-, -ish, -hood are derivational morphemes Exercise 3 Each of the words below contains two morphemes – a root and a derivational affix. Decide if the derivational affix changes only the meaning or the class of the root as well. rewrite happily unclear unhappy hopeless national creation happiness helpful undo Null/Zero morpheme A null morpheme is a morpheme that is realized by a phonologically null affix (an empty string of phonological segments) A null morpheme is an "invisible" affix. It's also called zero morpheme; the process of adding a null morpheme is called null affixation. Examples cat = cat + -0 = ROOT("cat") + SINGULAR cats = cat + -s = ROOT("cat") + PLURAL sheep = sheep + -0 = ROOT(“sheep") + SINGULAR sheep = sheep + -0 = ROOT(“sheep") + PLURAL More examples darken[verb] = dark [adj] + -en Meaning = make more ‘Adjective’ redden [verb] = red + -en [make more Red] yellow [verb] = yellow + 0 [make more yellow] brown [verb] = brown + 0 [make more brown] blacken [verb] = black + -en [make more black] Root Morphemes vs Affix morphemes Root morphemes are morphemes around which larger words are built. Root morphemes are free or bound. Affixes are additional morphemes added to roots to create multi- or poly-morphemic words. Affixes are always bound. Rats Root = rat [free morpheme] Affix = -s [bound morpheme] Project Root = -ject [bound morpheme] Affix = pro- [bound morpheme] Mice Root = mouse [free morpheme] Affix = -s [bound morpheme] Ate Root = eat [free morpheme] Affix = -ed [bound morpheme] Disgracefulness Root = grace [free morpheme] Affixes = dis-, -ful, -ness [bound morpheme] Affixes Morphemes added to free forms to make other free forms are called affixes. Mainly four kinds of affixes: 1. 2. 3. 4. Prefixes (at beginning) – “un-” in “unable” Suffixes (at end) – “-ed” in “walked” Circumfixes (at both ends) – “en—en” in enlighten Infixes (in the middle) – “-um-” in kumilad [‘to be red’], fumikas [‘to be strong’] [ kilad = ‘red’, fikas = ‘strong’ in Bontoc language] Affixes are bound morphemes. Prefixes No prefix can determine the category of a complex word: Eg: unhappy, unhappiness, unhappily, undo What does un- mean when it attaches to adjectives? unkind, unhappy What does un- mean when it attaches to verbs? undo, untie Suffixes We can represent the fact that the rightmost suffix determines the category of a word for triplets like - Eg: rational, rationalize, rationalization rational = adjective rationalize = verb rationalization = noun Allomorph An allomorph is a variant form of a morpheme. The meaning remains the same, while the sound can vary. Example: the different forms of past tense morpheme /-ed/ [as we hear] barked, hissed [t] raised, smelled [d] added, trotted [ed] Allomorphs of /-s/ for nouns Example: the different forms of plural morpheme /s/ are: [as we read] -s --- cats, dogs, boys, girls -es – watches, churches -0 – sheep /-s/, /-es/ and 0 are allomorphs of /-s/ {If pronunciation is considered, then /-s/, /-z/, /-iz/ and 0 are allomorphs of /-s/ in the above examples} Lexemes and Word-forms Lemma/Lexeme: Base form of a word. Includes derived forms. It is different from root. Word-forms: The inflected forms of a word. Eg: happy, unhappy, unhappiness, happiness, happier, unhappier, happiest, unhappiest Lexemes: happy, unhappy, unhappiness, happiness Wordforms: Of lexemes HAPPY: happy, happier, happiest, Of lexemes UNHAPPY: unhappy, unhappier, unhappiest Analysing a word Look at the word ACTORS Three morphemes: act, -or, -s Root: act Suffixes: -or, -s Derivational suffix: -or Inflectional suffix: -s Lexeme: ACTOR Word-forms: actor, actors Hierarchical Structure within Words the word unlockable is ambiguous [[un + lock] able]: able to be unlocked [un [lock +able]]: not able-to-be-locked Similar to this structure: French History Teacher Old Ladies Hostel Old Bombay Highway Disgraceful Adj / \ Noun Suffix | | / \ | Prefix Noun | | | | Dis grace ful Ungraceful Adj / \ Prefix \ | Adj | / \ | Noun Suffix | | | Un grace ful Exercise 4 Give the hierarchical structure of the following words unwanted disfigurement Interchangeable united actors retries unhappiness KEY POINTS Words and word structure Root morphemes vs. Affix morphemes Lexical morphemes Free morphemes Bound morphemes Functional morphemes Free morphemes Bound morphemes Derivational morphemes Inflectional morphemes Null/Zero morpheme Morphological Analysis Morphology = The study of word structure Morphological Analysis = Analysing words How do we do it? Is the process opposite of Morph generation? Why is it important? Look at the following words and break them up into smaller units useless copilot psychology sickness communism socialism artist dentist curable impossible microstructure macrostructure departure Did you do it this way? Comparing with other words? useless – endless, meaningless – “without” copilot – co-author, co-editor – “with” psychology – physiology, biology – “study of” sickness – dizziness, happiness – “state of being” communism – socialism, feminism – “belief, doctrine” artist – activist, tourist – “one who” impossible – impatient, immoral – “not” microstructure – microscope, microbiology – “small” macrostructure – macroeconomics, macro - “large” What about coalition, departure, important, dentist? Now do the same for this data: Mtotoamefika Mtotoanafika Mtotoatafika Watotowamefika Watotowanafika Watotowatafika Mtuamelala Mtuanalala Mtuatalala Watuwamelala Watuwanalala Watuwatalala Here’s some clue! Mtotoamefika - "The child has arrived." Mtotoanafika - "The child is arriving." Mtotoatafika - "The child will arrive." Watotowamefika - "The children have arrived." Watotowanafika - "The children are arriving." Watotowatafika - "The children will arrive." Mtuamelala - "The man has slept." Mtuanalala – "The man is sleeping." Mtuatalala - "The man will sleep." Watuwamelala - "The men have slept." Watuwanalala - "The men are sleeping." Watuwatalala - "The men will sleep." Did you do it this way? Comparing with other words? Mtotoamefika - "The child has arrived." Mtotoanafika - "The child is arriving." Mtotoatafika - "The child will arrive." Watotowamefika - "The children have arrived." Watotowanafika - "The children are arriving." Watotowatafika - "The children will arrive." Mtuamelala - "The man has slept." Mtuanalala – "The man is sleeping." Mtuatalala - "The man will sleep." Watuwamelala - "The men have slept." Watuwanalala - "The men are sleeping." Watuwatalala - "The men will sleep." Exercise 5 What is the equivalent of…? The child has slept. The children are sleeping. The men have arrived. The man will arrive. Exercise 5 What is the equivalent of…? The child has slept. Mtotoamelala The children are sleeping. Watotowanalala The men have arrived. Watuwamefika The man will arrive. Mtuatafika Quite easy?! Easy to identify morphemes if the language is regular and consistent. Languages are of following types (but no language belongs to one type solely) a. Isolating/ Analytic (low morpheme-per-word ratio) Eg: Chinese, Vietnamese, English Context and syntax more important than morphology b. Synthetic Fusional (many features fused in one morpheme) Eg: Most IE languages Agglutinating (joining many morphemes together) Eg: Finnish, Korean, Turkish, Japanese, Telugu Isolating languages Isolating languages do not (usually) have any bound morphemes Eg: Mandarin Chinese Gou bu ai chi qingcai (dog not like eat vegetable) This can mean one of the following (depending on the context) The dog doesn’t like to eat vegetables The dog didn’t like to eat vegetables The dogs don’t like to eat vegetables The dogs didn’t like to eat vegetables. Dogs don’t like to eat vegetables. Other Examples English nationalisation Turkish uygar+la¸s+tır+ama+dık+larımız+dan+mı¸s+sınız+casına Behaving as if you are among those whom we could not cause to become civilized (Turkish) Telugu pagalagoTTabOyADu He was about to break it. Hindi jA_sakatA_hE He/It can go. A bit of warm-up first: Write a small text in your language Tamil Manipuri Languages are not so regular…. English verbs Eat – ate – eaten Heat – heated – heated Teach – taught – taught Preach – preached – preached Cry – cried - cried English nouns Box – boxes Ox – oxen House – houses Mouse – mice Child - children Spelling changes occur at morpheme boundaries. Cases of suppletion are often seen. Vowel change also occurs. There is always a default regular word formation. Orthographic rules identified Name Description of rule Example Consonant doubling L-letter consonant Beg/begging douled before –ing/ed E deletion Silent e dropped before –ing and -ed Make/making E insertion E added after –s, -z, -x, -ch, -sh before -s Watch/watches Y replacement -y changes to -ie before –s, -I before ed Try/tries K insertion Verbs ending with vowel + -c add -k Panic/panicked J&M (2000) Suppletion • Replacement of the entire wordform. • Example: am, is, are, was, were – variants of ‘be’ ate – past tense of ‘eat’ In Hindi: vaha jA rahA hE (“He is going.”). vaha gayA (“He went”) In Telugu: tinnADA? (“Has he eaten?”). tinalEdu (“(He) Has not eaten.”) vaccADA? (“Has he come?”). rAlEdu (“(He) Has not come.”) Assimilation Change of one sound into another because of the influence of neighbouring sounds. Example: peN + kaL peNgaL manishi + lu manushulu (“vowel harmony”) Reduplication Some or all of a word is duplicated to mark a morphological process Indonesian orang (man) ⇒ orangorang (men) Bambara wulu (dog) ⇒ wuluowulu (whichever dog) Turkish mavi (blue) ⇒ masmavi (very blue) kırmızı (red) ⇒ kıpkırmızı (very red) koşa koşa (by running) Let’s explore your language Lexical, Free morphemes Avu Lexical, Bound morphemes a-, su- {adharmam “in-justice”, suhAsini “good-smile”} Functional, Free morphemes {“cow”}, pustakam {“book”}, pilla {“girl”} Ame {“she”}, mariyu {“and”}, kAni {“but”} Functional, Bound morphemes -lo, -ki {inTilO “in-house”, chetiki “to-hand”} What else is happening? Is there a null morpheme in your language? pilli pAlu tAgiMdi {“pilli” in nominative case} “The cat drank milk” pilli tOka guburugA uMdi {“pilli” in genitive case”} “The cat’s tail is bushy” Some changes in lexemes In Telugu: illu inTi before locative case marker “lO”, “ki” ceyyi cEti before locative case marker “lO”, “ki” In Hindi: laDakA laDake before ergative case marker “ne” laDake laDakOM before the ergative case marker “ne” What about causative and other constructions? Observe the words. Exercise 6: Translate in the language you speak at home. Ram killed Ravan. Ravan died. The baby ate. The nanny fed him. Sita made the nanny feed the baby. Ravi broke the vase. The vase broke. Anil broke the lock open. Ravish axed the goat to death. Bina nailed the picture on the wall. In Kashmiri Number (singular, plural) Plural form of noun can be obtained by addition of suffix, change in vowel or without any declination. Example: Singular Plural Suffix addition: kangir “fire pot” kangri laej “pot” laeji Vowel change: kuTh “room” kueTh gagur “rat” gagar Invariants: kAv “crow” kAv kan “ear” kan In Kashmiri Gender (masculine, feminine) Example: Masculine Feminine Different roots for each gender: ladake “boy” kUr “girl” dAnd “ox” gAv “cow” Same root but changes not regular: batuk “duck” batich “fem duck” tsAvul “goat” tsAvij “ fem goat” Same root with regular changes: dAndur “greengrocer” dAndren kAndur “baker” kAndren Let’s look at Telugu verbs Agglutinative language Many suffixes with different features concatenating Derived words from pagili: pagiliMdi – pagilipOyiMdi – pagalabOyiMdi – pagalledu - pagalagoTTADu – pagalagoTTabOyADu Morph analysis = Identifying morphemes and analysing them boys boy + s boy +noun + plural marker Root = boy POS = noun Number = plural Avulu Avu +noun + plural pustakAlu pustakam +noun + plural tinnADu tinu + A +Du tinu +verb +past +masc.sg cEsADu ceyyi + A +Du ceyyi +verb +past +masc.sg OdikoMDiddavana -> the one (masculine) who was reading Odu + i + koLLu +MD+ u + iru + dd + a + avanu + a Root + VBP+ AUXV +PST+ VBP + AUXV + PST+ RP + PRON-3SM + ACC Exercise: Analyse every word you used in your text Ambiguous words: multiple analyses Plants Grounded Ground Leaves Found Can Morphological Analysis and Generation Bidirectional Example: Ate eat + verb + past Surface level: tinnADu “he ate” | Intermediary level: tinn + A + Du | Lexical level: tinu + verb + past + 3rd per. sg. masc. Morphological Analyzers They are tools to automatically decompose a word into its root and affixes and give related features. Example: 1st stage – identifying morphemes ate: root = eat suffix = ed 2nd stage – analyzing morphemes ate: root = eat tense = past Morph Analyser Language Input: word Output: analysis Hindi ladake a) root=ladakA, cat=n, gen=m, num=sg, case=obl b) root=ladakA, cat=n, gen=m, num=pl, case=dir Morph Generator Language Input: analysis Output: word Hindi a) root=ladakA, cat=n, gen=m, num=sg, case=obl ladake b) root=ladakA, cat=n, gen=m, num=pl, case=dir Machine learning of morphological rules Supervised approach requires training data and rules are extracted from the training data. Rules = orthographic, suppletion, assimilation, vowel harmony The morphological analyzers can be built by making use of lexical database with morphological information for building rules. A good example of such lexical database is CELEX that contains information about lemma, wordform, abbreviation, corpus tagging etc Some Applications Machine Translation Speech Processing Information Retrieval Machine Translation Pos tagger gives only part of speech. More information is needed to translate a word correctly. More information like tense, aspect and mood of the verbs, gender, number and person of the nouns. Example: [Eng Hindi translation] ENGLISH: She went home. HINDI: vaha ghar gayi. ENGLISH: He went home. HINDI: vaha ghar gayaa. The gender of the pronoun is essential for the translation in Hindi. The morph analyzer will give the gender information. Example: [Hindi Eng translation] In Hindi ‘vaha’ can have different senses – ‘he’, ‘she’ or ‘that’. “vaha ghar gayaa” If we were to translate this, then the extra information on the verb will help us to translate the above sentence correctly as “He went home” The ‘yaa’ indicates past tense as well as singular number and masculine gender. The morph analyzer will give this information. Information Retrieval Stemming: A process of reducing an inflected or derived word to its stem. Makes search effective. Example: Wordforms =play, plays, played, playing PLAY Wordforms = child, children, childish CHILD Stemmers are useful but have limitations. Eg. marketing market speaker speak Speech Processing In Text to Speech tools also Morph Analyzer is essential along with Part of Speech. With extra information on the words, the efficiency increases. The intonation, the pause, the stress etc can be close to the way humans speak. This additional information is given by morph analyzers. (POS taggers helps in pronouncing the ambiguous words the right way: wind, wound, minute etc) Linguistic models for describing morphology Item and Arrangement (morpheme based) Item and Process (lexeme based) Word and Paradigm (word based) Approaches Paradigm based Finite State based Combination of both – fast and efficient PARADIGM BASED MORPHOLOGICAL ANALYZERS Requirement for building paradigm based Morph Analyzers Knowledge of Lexeme and Word forms Root and Affix dictionaries Paradigm Table Paradigm Class The lexemes are stored in the dictionaries and the word forms as paradigms. Lexeme and Word form APPLE: apple, apples CHURCH: church, churches BOY: boy, boys WATCH: watch, watches SPY: spy, spies The word in upper case is called LEXEME and the inflected forms are WORD FORMS. Lexemes are the headwords in a dictionary. Lexeme and Word form Another example: played is a word form of the lexeme PLAY plays is a word form of the lexeme PLAY(1) plays is a word form of the lexeme PLAY(2) where PLAY(1) is a verb and PLAY(2) is a noun. PLAY(1) and PLAY(2) are two different lexemes. Exercise 7 Give the lexeme of the following word forms: ate played manufactured glasses players bites Exercise 8 “manufactured” can be a verb in past tense or an adjective. So it belongs to two different lexemes – MANUFACTURE and MANUFACTURED. Which of the following words belong to more than one lexeme? ate wanted wrote written finished Root and Affix dictionaries Root dictionary contains a list of roots or the base forms - the lexemes. It is stored usually with its part of speech. Affix dictionary contains a list of all the affixes in a language. The features of the affixes are stored here. The features are stored as attribute value pairs. Example entries in a dictionary Root dictionary eat <root=‘eat’, category=‘verb’> book <root=‘book’, category=‘verb’> book <root=‘book’, category=‘noun’> Affix dictionary +s +ed +en +ing <tense = ‘present’> <tense = ‘past’> <aspect = ‘perfective’> <aspect = ‘progressive’> Paradigm table A paradigm table represents the inflected forms of a particular lexeme. It includes the conjugation of verbs and declensions of nouns, adjectives, pronouns etc. Example: APPLE: apple, apples EAT: eat, eats, ate, eaten, eating SMART: smart, smarter, smartest Conjugation of English verbs play plays played played playing eat eats ate eaten eating look looks looked looked looking dance dances danced danced dancing push pushes pushed pushed pushing Declension of English nouns apple, apples boy, boys church, churches watch, watches spy, spies Declension of English adjectives • smart, smarter, smartest • tall, taller, tallest Exercise 9 Give the paradigm table for 5 different nouns and 5 different verbs in English. Paradigm Class A paradigm class contains the classes of lexemes i.e. the prototypical root and all the roots that fall in its class including the given root. Those words which decline or conjugate in exactly the same way, fall into one paradigm class. The English verbs ‘PLAY’ and ‘LOOK’ have the following paradigm: play plays played played playing look looks looked looked looking So they belong to the same class. But ‘PUSH’ since it differs in its present tense form i.e. it has ‘-es’ and not ‘- s’ falls in another class. Its paradigm is as follows: push pushes pushed pushed pushing The English nouns ‘PLAY’ and ‘BOY’ have the following paradigm: play plays boy boys So they belong to the same class. But ‘SPY’ falls in another class. Its paradigm is as follows: spy spies Paradigm class is represented by one member of the class. eat play push play spy church V V V N N N eat play, talk, walk, train push, fish play, boy, day spy, sky church, watch Exercise 10 Which of the following verbs belong to the same paradigm class? mince ride walk speak shake play dance take Which of the following nouns belong to the same paradigm class? girl house dish book mouse beach flower pencil Identify paradigm classes in your own language. Avu = {Avu, bassu, bomma, chekka} Add-Delete Rules for Generation and Analysis of words •Most of the morphological analyzers handle only inflectional morphology. •The rules given here are for such inflectional processes only. •An add-delete rule would look like: [n1, n2, xyz] where n1 = number of characters to delete from the end n2 = number of characters to add at the end xyz = the characters to add Rules to generate wordforms in a paradigm table for a given paradigm class. Eg. play play[0,0,0] play[0,1,s] play[0,2,ed] play[0,2,ed] play[0,3,ing] = play = plays = played = played = playing Eg. eat eat[0,0,0] = eat[0,1,s] = eat[3,3,ate] = eat[0,2,en] = eat[0,3,ing] = eat eats ate eaten eating Exercise 11 Write similar rules to generate paradigm tables of the verbs ‘dance’, ‘cry’, ‘sleep’, ‘drink’ ‘write’ and nouns ‘book’, ‘mouse’, ‘church’, ‘sheep’. In the same way, these rules can be used to find the root of an inflected word. For example, the root of ‘playing’ is ‘play’ – we get it by deleting 3 characters and adding nothing. playing [3,0,0] = play eating [3,0,0] = eat played [2,0,0] = play eaten [2,0,0] = eat played [2,0,0] = play ate [3,3,eat] = eat plays [1,0,0] = play eats [1,0,0] = eat play [0,0,0] = play eat [0,0,0] = eat Exercise 12 Write similar rules to find the root of the verbs ‘kept’, ‘cried’, ‘sat’, ‘stood’, ‘written’ and nouns ‘mice’, ‘legs’, ‘spies’, ‘uncles’, ‘houses’. Let’s look at the paradigms in some ILs Hindi Morph Analyzer demo http://sampark.iiit.ac.in/hindimorph/ Input text:: लड़का आ गया। Output text: Input text:: लड़के आ गये। Output text: Note : Features explanation of a word 1. root : Root of the word (e.g. ladZake word has root ladZakA) 2. cat : Category of the word (e.g. Noun=n, Pronoun=pn, Adjective=adj, verb=v, adverb=adv post-position=psp and avvya=avy) 3. gen : Gender of the word (e.g. Masculine=m, Feminine=f, Nueter=n, mf , mn , fn, and any) 4. num : Number of the word (e.g. Singular=sg, Plural=pl, dual, and any ) 5. per : Person of the word (e.g. 1st Person=1, 2nd Person=2, 3rd Person=3, and any) 6. case : Case of the word (e.g. direct=d, oblique=o and any) 7. tam : Case marker for noun or Tense Aspect Mood(TAM) for verb of the word 8. suff : Suffix of the word agr_gen: Agreeing gender of the following noun agr_num: Agreeing number of the following noun agr_cas: Agreeing case of the following noun emph: Emphatic feature of the word hon: Honorofic feature of the word derive_root: Derived root derive_lcat: Derived lexical category derive_gen: Derived gender derive_suff: Derived suffix FINITE STATE TRANSDUCERS (FST) BASED MORPHOLOGICAL ANALYZERS Finite State Automaton/Transducers Computational model to handle the morphology analysis/generation Used in morphology, phonology, TTS, data mining. Popular, since mathematically extremely well understood. Easy to implement. Many off-the-shelf tools available for direct use. Optimisation: finding a m/c with minimum number of states that performs the same function Simple extensions to Markov Models useful for POS tagging Finite State Automaton A Finite State Automaton is a machine composed of: An input tape A finite number of states, with one initial and one or more accepting states Actions in terms of transitions from one state to the other, depending on the current state and the input. The input tape has a sequence of symbols written on it. The tape, initially is in the intial state. The reader reads one symbol at a time, and depending on the current state and input symbol, transition to another state takes place Finite state automaton (FSA) Morphotactic Models English nominal inflection plural (-s) reg-n q0 q1 q2 irreg-pl-n irreg-sg-n •reg-n: regular noun •irreg-pl-n: irregular plural noun •Inputs: cats, goose, geese •irreg-sg-n: irregular singular noun Derivational morphology: adjective fragment adj-root1 unq0 -er, -ly, -est q1 q2 adj-root1 q3 q5 q4 -er, -est adj-root2 • Adj-root1: clear, happy, real • Adj-root2: big, red Parsing with Finite State Transducers cats cat +N +PL Kimmo Koskenniemi’s two-level morphology Words represented as correspondences between lexical level (the morphemes) and surface level (the orthographic word) Morphological parsing :building mappings between the lexical and surface levels c c a a t t +N +PL s Finite State Transducers FSTs map between one set of symbols and another using an FSA whose alphabet is composed of pairs of symbols from input and output alphabets In general, FSTs can be used for Translator (Hello: vanakkam) Parser/generator (Hello:How may I help you?) To map between the lexical and surface levels of Kimmo’s 2-level morphology FST for a 2-level Lexicon Example q0 q0 c q1 q1 g a q2 e:o q2 q3 e:o t q3 q4 s Reg-n Irreg-pl-n cat g o:e o:e s e g o o s e q5 e Irreg-sg-n FST for English Nominal Inflection reg-n +N: q1 q4 +PL:^s# +SG:-# q0 irreg-n-sg irreg-n-pl q2 +N: q5 q3 q6 +N: +SG:-# q7 +PL:-s# Combining (cascade or composition) this FSA with FSAs for each noun type replaces e.g. regn with every regular noun representation in the lexicon FST for Telugu verb tin: tinu, verb A: past q1 a: q0 q5 cEs: ceyyi, q2 verb cEy:ceyyi, verb q3 Du: sg, masc, 3p q4 lEdu: neg q6 A:past a: q7 tinnADu, tinaledu; vinnADu, vinaledu; cEsADu, cEyaledu lEdu: neg q8 Issues – mainly due to ambiguity When shot at, the dove dove into the bushes. The insurance was invalid for the invalid. They were too close to the door to close it. The buck does funny things when the does are present. Upon seeing the tear in the painting I shed a tear. After seeing the wound, the doctor wound his watch. Such words with multiple analyses affect MT and other NLP applications. Examples veyyi = verb or noun? “put”, “thousand” perugu = verb or noun? “yogurt”, “grow” nAku = verb or pronoun in k4 “to lick”, “for/to me” (icecream) nAkivvu nAku + ivvu (“Give it to me”) nAki + ivvu (“Lick and then give”) Tools Several off-the-shelf tools available for FST which support Word and Paradigm model 1. XFST (Xerox Finite State Tool) 2. SFST (Stuttgart Finite State Transducer Tools) 3. Lttoolbox (part of Apertium) To wind up… Knowledge of words and word structure is important. A good linguistic analysis of a language will help in formulating rules of word formation, paradigm table and paradigm classes Morph generator is the reverse process of morph analyser. A thorough study of one’s language at word level will help built MA by using any approach. Phonological changes at morpheme boundary, agglutinative and fusional type of suffixing makes analysis more challenging. Selecting the right analysis when more than one analyses of a given wordform is given is another challenging task esp. in Machine Translation system. References Akshar Bharati, Vineet Chaitanya, Rajeev Sangal. 1995. Chapter 3: Words and Their Analyzer. Natural Language Processing: A Paninian Perspective. PrenticeHall of India. Akshar Bharati, Rajeev Sangal, Dipti M Sharma, Radhika Mamidi. 2004. Generic Morphological Analysis Shell. In Proceedings of LREC . Lisbon, Portugal. 24th30th May 2004. Dan Jurafsky and J.H. Martin. 2000. Speech and Language Processing. PrenticeHall. Monis Khan, Radhika Mamidi, Umar Hamid Shah. 2005. Morphological Analyzer for Kashmiri . LSI. Amba Kulkarni, G Uma Maheshwar Rao. Building Morphological Analyzers and Generators for Indian Languages using FST, Tutorial for ICON 2010. Viswanathan S et. al.. 2003. A Tamil Morphological Analyser. Recent Advances in NLP. 31-39. Mysore: Central Institute of Indian Languages. Parameshwari K. 2011. An Implementation of APERTIUM Morphological Analyzer and Generator for Tamil. Language in India. 11: 5. R. Veerappan, P J Antony, S Saravanan, K P Soman. 2011. A Rule based Kannada Morphological Analyzer and Generator using Finite State Transducer. International Journal of Computer Applications (0975 – 8887) Volume 27– No.10, August 2011. Thank you!