Download computational morphology

Document related concepts

Portuguese grammar wikipedia , lookup

Arabic grammar wikipedia , lookup

Kannada grammar wikipedia , lookup

Zulu grammar wikipedia , lookup

Modern Hebrew grammar wikipedia , lookup

Modern Greek grammar wikipedia , lookup

Ukrainian grammar wikipedia , lookup

Macedonian grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Spanish grammar wikipedia , lookup

Untranslatability wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Old Irish grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Latin syntax wikipedia , lookup

Swedish grammar wikipedia , lookup

Old Norse morphology wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Ojibwe grammar wikipedia , lookup

Old English grammar wikipedia , lookup

Distributed morphology wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Stemming wikipedia , lookup

Turkish grammar wikipedia , lookup

French grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Inflection wikipedia , lookup

Polish grammar wikipedia , lookup

Agglutination wikipedia , lookup

Pipil grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Transcript
COMPUTATIONAL
MORPHOLOGY
Radhika Mamidi
Winter workshop/school on Machine Learning and Text Analytics
17 December, 2013
Outline

What is Computational Morphology?
Morphology concepts
 Computational approaches




Morphological Analysis
Morphological Generator
Computational Approaches
Paradigm based
 Finite State based
 Statistical based


Issues to be handled
Computational Morphology
Linguistics: Scientific study of language at these levels
Sound – Phonetics and Phonology
Word - Morphology
Sentence - Syntax
Discourse – Discourse Analysis
Computational Linguistics: Processing and producing
language computationally.
Computational Morphology: Processing and producing
language computationally at WORD LEVEL
Why are we studying morphology?




The knowledge of words will help us process
language computationally at word level.
Knowledge of words include word structure
and word formation rules.
This knowledge will help us in developing
NLP tools like “Morphological Analyzers”
and “Morphological Generators”
These are important in Information Retrieval,
Machine Translation, Spell checking, etc.
What’s Morphology?


The study of word structure
The study of the mental dictionary:
 How
are words stored in the mind?
 What is a possible word?

Example:
(i) At the supermarket, the girls bought pink cheeriots
and the boys blue fistings.
(ii) When their mother signaled, the girls barried
home unhappily.
The words ‘cheeriots’, ‘fistings’ and
‘barried’ do not exist. However, assuming
they are valid words of English, we
‘guess’ the meaning by context and the
position of the word in the given sentence.
We do this using our general knowledge
and linguistic knowledge.
At the supermarket, the girls bought pink
cheeriots and the boys blue fistings.
Part of speech
= nouns [comes after adjectives]
[-s ending]
= more than one
Meaning
= some objects that have color
[clue: supermarket]
= some object that is sold; perhaps a toy
The word forms are more like toys, balls, ribbons.
When their mother signaled, the girls
barried home unhappily.
Part of Speech = verb [position]
[ends in –ed] = past tense
Base form = barry
Meaning = go
The word form is more like carried, married.
What’s the Longest Word of English?



Could it be ismestablishmentariandisanti ?
Why not when antidisestablishmentarianism is possible.
Possible words with anti and missile:







anti-missile (adjective)
anti-missile missile: a missile used for anti-missile purposes
anti- anti-missile missile missile: a missile used against antimissile missiles
antiNmissileN+1, where N can go until….
There is a systematic way of word formation.
Eg: happy, -ness, unThis is called Morphotactics.
Morphemes
Have a sound [form] and a meaning:
Example: “cats”

 /kaet/
 /-s/
 Even
“four-legged animal”
“plural number”
though /-s/ has a sound and a meaning, it
can’t mean “plural” by itself…
 It has to attach to a noun.
“A morpheme is the smallest unit of word form
that has meaning”
Examples:
cats = cat + -s
girlish = girl + -ish
unfriendly = un- + friend + -ly
cat, -s, girl, -ish, un-, friend, -ly are morphemes
Even Bush knows morphology

(…though he may use it differently than the
rest of us)
 The
war on terrorism has transformationed the
US-Russia Relationship
 We’re working to help Russia securitize the
dismantled warheads
 The explorationists are only willing to help move
equipment during the winter
 This case has had full analyzation and has been
looked at a lot
Compositionality
“Explorationists”
 explore: to spend an extended effort looking around
a particular area - X
 -ation: can attach to Verbs, the process of Xing - Y
 -ist: can attach to Nouns, one who performs an
action Y - Z
 -s: attaches to Nouns, more than one Z



explorationists: a compositional word
Fully compositional meaning is based on its parts
Eg: computer, impossible, headache
Non-compositionality
“Inflect”
 Is inflect morphologically complex?
 It contains more than one morpheme.
 What do in- and flect mean?
 This is a case of a non-compositional meaning.
 In explorationists, if you know the meaning of the
parts, you know the meaning of the whole. Not
necessarily so for inflect.
 Non-compositional meaning cannot be derived
from its parts.
Lexical/Content words




Words which are not function words are called
content words or lexical words: these include
nouns, verbs, adjectives, and most adverbs, though
some adverbs are function words (e.g. then, why).
They belong to open class.
Dictionaries define the specific meanings of content
words, but can only describe the general usages of
function words.
By contrast, grammars describe the use of function
words in detail, but have little interest in lexical
words.
Function words


Function words or grammatical words are words
that have little lexical meaning or have ambiguous
meaning, but instead serve to express grammatical
relationships with other words within a sentence.
Function words may be prepositions, pronouns,
auxiliary verbs, conjunctions, grammatical articles
or particles, all of which belong to the group of
closed class words.
To know about morpheme we should know
about….





Free morphemes vs. Bound morphemes
Lexical morphemes vs. Functional morphemes
Null/Zero morpheme
Inflectional morphemes vs. Derivational
morphemes
Root morphemes vs. Affix morphemes
Free vs Bound morphemes





electr- and tox- have isolable meanings in
electric, electrify, toxic, (de-)toxify
But they cannot be pronounced on their own:
they are bound morphemes
girl and book have isolable meanings in girls,
girlish, books, booked, booking
They can occur on their own: they are free
morphemes
Are prefixes and suffixes bound morphemes?
Lexical morpheme
free morphemes: apple, smart, book, slow, eat, write
They can exist on their own as independent words.
bound morphemes: -ceive, -ject, cran-, -ship, un-, disThey cannot be used independently.
They need another morpheme [free or bound] to form
a word.
Eg: re-ceive, con-ceive, sub-ject, pro-ject, cran-berry,
scholar-ship, fellow-ship, un-kind, dis-obey
Functional morphemes
free morphemes:
of, with, she, it, and, although, however, because,
then
bound morphemes:
-s, -ed, -ing,
Four-way contrasts

Lexical, Free: Nouns, Verbs, Adj, Adv
cat, town, call, house, hall, smart, fast

Lexical, Bound: including derivational affixes
rasp- [raspberry], cran- [cranberry] , -ceive [conceive,
receive], un-, re-, pre-

Functional, Free: Prepositions, Articles, Conj
with, at, and, an, the, because

Functional, Bound: inflectional affixes
-s, -ed, -ing [eats, walked, laughing]
Exercise 1: Identify the free and bound morphemes in
the following words








walked, talked, danced, arrived
playhouse, watchdog, football player
drinking, playing, eating
import, export, transport
raspberry, cranberry
invert, convert, divert
books, pens, boards
writer, caretaker, rider, fighter
Can the following words be decomposed?
 delight, news, traitor, bed, evening
Exercise 2: Identify the lexical and functional morphemes in
the following words. Mention if they are free or bound.







politically
beautiful
between
writing
raspberries
unable
nationalization
Inflection vs. Derivation

Derivational affixes: allow us to make new
words that alters the meaning.
 There
is an error in the computation. [computeV –
computationN] {Nominalisation}
 It is a computational approach. [computationN –
computationalAdj] {Adjectivization}

Inflectional suffixes: required in order to make
the sentence grammatical
Inflected words belong to the same class
 *Yesterday
I walk to class [walkV – walkedV]
 *I like all my student [studentN – studentsN]
Inflectional Morphology
Examples: [the POS remains the same]
VERBS
EAT = eat, eats, ate, eaten, eating
DRINK = drink, drinks, drank, drunk, drinking
PLAY = play, plays, played, played, playing
0, -s, -ed, -en, -ing are inflectional morphemes
NOUNS
PLAY = play, plays
GIRL = girl, girls
SHEEP = sheep, sheep
-s, 0 are inflectional morphemes
Derivational morphology
Two types:
 May change the category {N,V,A,Adv}
driveV +er = driverN
eatv + able = eatableadj
girlN + ish = girlishadj
disturb V + ance = disturbance N

Doesn’t have to change the category
un + doV = undoV
re+fryv = refryv
un+happyAdj = unhappyAdj
Derivational – more examples
Verbs
eat – eatable [adj], eatables [noun]
drink – drinking [noun]
play – player [noun]
-able, -ing, -er are derivational morphemes
Nouns
play – playful [adj], replay [verb]
girl – girlish [adj], girlhood [noun]
sheep – sheepish[adj]
-ful, re-, -ish, -hood are derivational morphemes
Exercise 3
Each of the words below contains two
morphemes – a root and a derivational affix.
Decide if the derivational affix changes only the
meaning or the class of the root as well.
rewrite
happily
unclear
unhappy
hopeless
national
creation
happiness
helpful
undo
Null/Zero morpheme



A null morpheme is a morpheme that is
realized by a phonologically null affix
(an empty string of phonological
segments)
A null morpheme is an "invisible" affix.
It's also called zero morpheme; the
process of adding a null morpheme is
called null affixation.
Examples




cat = cat + -0 = ROOT("cat") + SINGULAR
cats = cat + -s = ROOT("cat") + PLURAL
sheep = sheep + -0 = ROOT(“sheep") + SINGULAR
sheep = sheep + -0 = ROOT(“sheep") + PLURAL
More examples


darken[verb] = dark [adj] + -en
Meaning = make more ‘Adjective’
redden [verb] = red + -en [make more Red]
yellow [verb] = yellow + 0 [make more yellow]
brown [verb] = brown + 0 [make more brown]
blacken [verb] = black + -en [make more black]
Root Morphemes vs Affix morphemes
Root morphemes are morphemes around which larger
words are built.
Root morphemes are free or bound.
Affixes are additional morphemes added to roots to
create multi- or poly-morphemic words.
Affixes are always bound.

Rats
Root = rat [free morpheme]
Affix = -s [bound morpheme]

Project
Root = -ject [bound morpheme]
Affix = pro- [bound morpheme]

Mice
Root = mouse [free morpheme]
Affix = -s [bound morpheme]

Ate
Root = eat [free morpheme]
Affix = -ed [bound morpheme]

Disgracefulness
Root = grace [free morpheme]
Affixes = dis-, -ful, -ness [bound morpheme]
Affixes
Morphemes added to free forms to make other free
forms are called affixes.
Mainly four kinds of affixes:


1.
2.
3.
4.
Prefixes (at beginning) – “un-” in “unable”
Suffixes (at end) – “-ed” in “walked”
Circumfixes (at both ends) – “en—en” in enlighten
Infixes (in the middle) – “-um-” in kumilad [‘to be red’], fumikas
[‘to be strong’]
[ kilad = ‘red’, fikas = ‘strong’ in Bontoc language]
Affixes are bound morphemes.
Prefixes




No prefix can determine the category of a complex
word:
Eg: unhappy, unhappiness, unhappily, undo
What does un- mean when it attaches to adjectives?
unkind, unhappy
What does un- mean when it attaches to verbs?
undo, untie
Suffixes

We can represent the fact that the rightmost suffix
determines the category of a word for triplets like -
Eg: rational, rationalize, rationalization
 rational
= adjective
 rationalize = verb
 rationalization = noun
Allomorph



An allomorph is a variant form of a morpheme.
The meaning remains the same, while the sound can
vary.
Example: the different forms of past tense
morpheme /-ed/ [as we hear]
 barked,
hissed [t]
 raised, smelled [d]
 added, trotted [ed]
Allomorphs of /-s/ for nouns

Example: the different forms of plural morpheme /s/ are: [as we read]
-s --- cats, dogs, boys, girls
-es – watches, churches
-0 – sheep
/-s/, /-es/ and 0 are allomorphs of /-s/
{If pronunciation is considered, then /-s/, /-z/, /-iz/
and 0 are allomorphs of /-s/ in the above
examples}
Lexemes and Word-forms



Lemma/Lexeme: Base form of a word. Includes derived
forms. It is different from root.
Word-forms: The inflected forms of a word.
Eg: happy, unhappy, unhappiness, happiness, happier,
unhappier, happiest, unhappiest
Lexemes: happy, unhappy, unhappiness, happiness
Wordforms:
 Of lexemes HAPPY: happy, happier, happiest,
 Of lexemes UNHAPPY: unhappy, unhappier, unhappiest
Analysing a word








Look at the word ACTORS
Three morphemes: act, -or, -s
Root: act
Suffixes: -or, -s
Derivational suffix: -or
Inflectional suffix: -s
Lexeme: ACTOR
Word-forms: actor, actors
Hierarchical Structure within
Words



the word unlockable is ambiguous
[[un + lock] able]: able to be unlocked
[un [lock +able]]: not able-to-be-locked
Similar to this structure:
 French History Teacher
 Old Ladies Hostel
 Old Bombay Highway
Disgraceful
Adj
/
\
Noun
Suffix
|
|
/ \
|
Prefix Noun
|
|
|
|
Dis grace
ful
Ungraceful
Adj
/
\
Prefix \
|
Adj
|
/ \
| Noun Suffix
|
|
|
Un grace ful
Exercise 4
Give the hierarchical structure of the following words







unwanted
disfigurement
Interchangeable
united
actors
retries
unhappiness
KEY POINTS



Words and word structure
Root morphemes vs. Affix morphemes
Lexical morphemes


Free morphemes
Bound morphemes


Functional morphemes


Free morphemes
Bound morphemes


Derivational morphemes
Inflectional morphemes
Null/Zero morpheme
Morphological Analysis
Morphology = The study of word structure
Morphological Analysis = Analysing words



How do we do it?
Is the process opposite of Morph generation?
Why is it important?
Look at the following words and break
them up into smaller units













useless
copilot
psychology
sickness
communism
socialism
artist
dentist
curable
impossible
microstructure
macrostructure
departure
Did you do it this way?
Comparing with other words?
 useless
– endless, meaningless – “without”
 copilot – co-author, co-editor – “with”
 psychology – physiology, biology – “study of”
 sickness – dizziness, happiness – “state of being”
 communism – socialism, feminism – “belief, doctrine”
 artist – activist, tourist – “one who”
 impossible – impatient, immoral – “not”
 microstructure – microscope, microbiology – “small”
 macrostructure – macroeconomics, macro - “large”
What about coalition, departure, important, dentist?
Now do the same for this data:












Mtotoamefika
Mtotoanafika
Mtotoatafika
Watotowamefika
Watotowanafika
Watotowatafika
Mtuamelala
Mtuanalala
Mtuatalala
Watuwamelala
Watuwanalala
Watuwatalala
Here’s some clue!












Mtotoamefika - "The child has arrived."
Mtotoanafika - "The child is arriving."
Mtotoatafika - "The child will arrive."
Watotowamefika - "The children have arrived."
Watotowanafika - "The children are arriving."
Watotowatafika - "The children will arrive."
Mtuamelala - "The man has slept."
Mtuanalala – "The man is sleeping."
Mtuatalala - "The man will sleep."
Watuwamelala - "The men have slept."
Watuwanalala - "The men are sleeping."
Watuwatalala - "The men will sleep."
Did you do it this way?
Comparing with other words?












Mtotoamefika - "The child has arrived."
Mtotoanafika - "The child is arriving."
Mtotoatafika - "The child will arrive."
Watotowamefika - "The children have arrived."
Watotowanafika - "The children are arriving."
Watotowatafika - "The children will arrive."
Mtuamelala - "The man has slept."
Mtuanalala – "The man is sleeping."
Mtuatalala - "The man will sleep."
Watuwamelala - "The men have slept."
Watuwanalala - "The men are sleeping."
Watuwatalala - "The men will sleep."
Exercise 5
What is the equivalent of…?

The child has slept.

The children are sleeping.

The men have arrived.

The man will arrive.
Exercise 5
What is the equivalent of…?
The child has slept.
Mtotoamelala

The children are sleeping.
Watotowanalala

The men have arrived.
Watuwamefika

The man will arrive.
Mtuatafika

Quite easy?!


Easy to identify morphemes if the language is regular
and consistent.
Languages are of following types (but no language
belongs to one type solely)
a. Isolating/ Analytic (low morpheme-per-word ratio)
Eg: Chinese, Vietnamese, English
 Context and syntax more important than morphology
b. Synthetic
Fusional (many features fused in one morpheme)
Eg: Most IE languages
 Agglutinating (joining many morphemes together)
Eg: Finnish, Korean, Turkish, Japanese, Telugu

Isolating languages
Isolating languages do not (usually) have any bound
morphemes
Eg: Mandarin Chinese
Gou bu ai chi qingcai (dog not like eat vegetable)
This can mean one of the following
(depending on the context)
The dog doesn’t like to eat vegetables
The dog didn’t like to eat vegetables
The dogs don’t like to eat vegetables
The dogs didn’t like to eat vegetables.
Dogs don’t like to eat vegetables.
Other Examples




English
nationalisation
Turkish
uygar+la¸s+tır+ama+dık+larımız+dan+mı¸s+sınız+casına Behaving
as if you are among those whom we could not cause to become
civilized (Turkish)
Telugu
pagalagoTTabOyADu
He was about to break it.
Hindi
jA_sakatA_hE
He/It can go.
A bit of warm-up first: Write a small text in
your language
Tamil
Manipuri
Languages are not so regular….
English verbs
 Eat – ate – eaten
 Heat – heated – heated
 Teach – taught – taught
 Preach – preached – preached
 Cry – cried - cried
English nouns
 Box – boxes
 Ox – oxen
 House – houses
 Mouse – mice
 Child - children
Spelling changes
occur at
morpheme
boundaries.
Cases of
suppletion are
often seen.
Vowel change
also occurs.
There is always a
default regular
word formation.
Orthographic rules identified
Name
Description of rule
Example
Consonant doubling
L-letter consonant
Beg/begging
douled before –ing/ed
E deletion
Silent e dropped
before –ing and -ed
Make/making
E insertion
E added after –s, -z,
-x, -ch, -sh before -s
Watch/watches
Y replacement
-y changes to -ie
before –s, -I before ed
Try/tries
K insertion
Verbs ending with
vowel + -c add -k
Panic/panicked
J&M (2000)
Suppletion
• Replacement of the entire wordform.
• Example: am, is, are, was, were – variants of ‘be’
ate – past tense of ‘eat’
In Hindi: vaha jA rahA hE (“He is going.”).
vaha gayA (“He went”)
In Telugu:
tinnADA? (“Has he eaten?”). tinalEdu (“(He) Has not eaten.”)
vaccADA? (“Has he come?”). rAlEdu (“(He) Has not come.”)
Assimilation


Change of one sound into another because of the
influence of neighbouring sounds.
Example: peN + kaL  peNgaL
manishi + lu  manushulu
(“vowel harmony”)
Reduplication
Some or all of a word is duplicated to mark a
morphological process
Indonesian
 orang (man) ⇒ orangorang (men)
Bambara
 wulu (dog) ⇒ wuluowulu (whichever dog)
Turkish
 mavi (blue) ⇒ masmavi (very blue)
 kırmızı (red) ⇒ kıpkırmızı (very red)
 koşa koşa (by running)
Let’s explore your language

Lexical, Free morphemes
 Avu

Lexical, Bound morphemes


a-, su- {adharmam “in-justice”, suhAsini “good-smile”}
Functional, Free morphemes


{“cow”}, pustakam {“book”}, pilla {“girl”}
Ame {“she”}, mariyu {“and”}, kAni {“but”}
Functional, Bound morphemes

-lo, -ki {inTilO “in-house”, chetiki “to-hand”}
What else is happening?

Is there a null morpheme in your language?
pilli pAlu tAgiMdi {“pilli” in nominative case}
“The cat drank milk”
pilli tOka guburugA uMdi {“pilli” in genitive case”}
“The cat’s tail is bushy”
Some changes in lexemes
In Telugu:
illu  inTi before locative case marker “lO”, “ki”
ceyyi  cEti before locative case marker “lO”, “ki”
In Hindi:
laDakA  laDake before ergative case marker “ne”
laDake  laDakOM before the ergative case marker “ne”
What about causative and other
constructions? Observe the words.
Exercise 6:
Translate in the language you speak at home.
 Ram killed Ravan. Ravan died.
 The baby ate. The nanny fed him.
 Sita made the nanny feed the baby.
 Ravi broke the vase. The vase broke.
 Anil broke the lock open.
 Ravish axed the goat to death.
 Bina nailed the picture on the wall.
In Kashmiri
Number (singular, plural)
Plural form of noun can be obtained by addition of suffix,
change in vowel or without any declination.
Example:
Singular
Plural
Suffix addition: kangir “fire pot”
kangri
laej “pot”
laeji
Vowel change:
kuTh “room”
kueTh
gagur “rat”
gagar
Invariants:
kAv “crow”
kAv
kan “ear”
kan
In Kashmiri
Gender (masculine, feminine)
Example:
Masculine
Feminine
Different roots for each gender: ladake “boy”
kUr “girl”
dAnd “ox”
gAv “cow”
Same root but changes not regular: batuk “duck”
batich “fem duck”
tsAvul “goat”
tsAvij “ fem goat”
Same root with regular changes:
dAndur “greengrocer”
dAndren
kAndur “baker”
kAndren
Let’s look at Telugu verbs



Agglutinative language
Many suffixes with different features concatenating
Derived words from pagili:
pagiliMdi – pagilipOyiMdi – pagalabOyiMdi –
pagalledu - pagalagoTTADu –
pagalagoTTabOyADu
Morph analysis = Identifying morphemes and
analysing them

boys  boy + s  boy +noun + plural marker
Root = boy
POS = noun
Number = plural




Avulu  Avu +noun + plural
pustakAlu  pustakam +noun + plural
tinnADu  tinu + A +Du  tinu +verb +past +masc.sg
cEsADu  ceyyi + A +Du  ceyyi +verb +past +masc.sg
OdikoMDiddavana -> the one (masculine) who was reading
 Odu + i + koLLu +MD+ u + iru + dd + a + avanu + a
 Root + VBP+ AUXV +PST+ VBP + AUXV + PST+ RP +
PRON-3SM + ACC
Exercise:
Analyse every word you used in your text
Ambiguous words: multiple analyses






Plants
Grounded
Ground
Leaves
Found
Can
Morphological Analysis and
Generation
Bidirectional
Example:
Ate  eat + verb + past
Surface level: tinnADu “he ate”
|
Intermediary level: tinn + A + Du
|
Lexical level: tinu + verb + past + 3rd per. sg. masc.
Morphological Analyzers
They are tools to automatically decompose a word
into its root and affixes and give related features.
Example:
1st stage – identifying morphemes
ate: root = eat
suffix = ed
2nd stage – analyzing morphemes
ate: root = eat
tense = past
Morph Analyser
Language
Input: word
Output: analysis
Hindi
ladake
a) root=ladakA,
cat=n, gen=m,
num=sg, case=obl
b) root=ladakA,
cat=n, gen=m,
num=pl, case=dir
Morph Generator
Language
Input: analysis
Output: word
Hindi
a) root=ladakA,
cat=n, gen=m,
num=sg, case=obl
ladake
b) root=ladakA,
cat=n, gen=m,
num=pl, case=dir
Machine learning of
morphological rules

Supervised approach
requires training data and rules are extracted from the
training data.
 Rules = orthographic, suppletion, assimilation, vowel
harmony



The morphological analyzers can be built by making
use of lexical database with morphological information
for building rules.
A good example of such lexical database
is CELEX that contains information about lemma,
wordform, abbreviation, corpus tagging etc
Some Applications



Machine Translation
Speech Processing
Information Retrieval
Machine Translation


Pos tagger gives only part of speech. More
information is needed to translate a word correctly.
More information like tense, aspect and mood of
the verbs, gender, number and person of the nouns.
Example: [Eng Hindi translation]
ENGLISH: She went home.
HINDI: vaha ghar gayi.
ENGLISH: He went home.
HINDI: vaha ghar gayaa.


The gender of the pronoun is essential for the
translation in Hindi.
The morph analyzer will give the gender information.
Example: [Hindi Eng translation]
In Hindi ‘vaha’ can have different senses – ‘he’, ‘she’ or
‘that’.
“vaha ghar gayaa”
If we were to translate this, then the extra information on
the verb will help us to translate the above sentence
correctly as
“He went home”


The ‘yaa’ indicates past tense as well as singular
number and masculine gender.
The morph analyzer will give this information.
Information Retrieval

Stemming: A process of reducing an inflected or derived
word to its stem. Makes search effective.
Example:
Wordforms =play, plays, played, playing  PLAY
Wordforms = child, children, childish  CHILD

Stemmers are useful but have limitations.
Eg. marketing  market
speaker  speak
Speech Processing
In Text to Speech tools also Morph Analyzer is
essential along with Part of Speech.
 With extra information on the words, the efficiency
increases.
 The intonation, the pause, the stress etc can be close
to the way humans speak.
 This additional information is given by morph
analyzers.
(POS taggers helps in pronouncing the ambiguous
words the right way: wind, wound, minute etc)

Linguistic models for describing morphology
Item and Arrangement (morpheme based)
 Item and Process (lexeme based)
 Word and Paradigm (word based)

Approaches
Paradigm based
 Finite State based
 Combination of both – fast and efficient

PARADIGM BASED
MORPHOLOGICAL ANALYZERS
Requirement for building paradigm based
Morph Analyzers





Knowledge of Lexeme and Word forms
Root and Affix dictionaries
Paradigm Table
Paradigm Class
The lexemes are stored in the dictionaries and the
word forms as paradigms.
Lexeme and Word form
APPLE: apple, apples
CHURCH: church, churches
BOY: boy, boys
WATCH: watch, watches
SPY: spy, spies


The word in upper case is called LEXEME and the
inflected forms are WORD FORMS.
Lexemes are the headwords in a dictionary.
Lexeme and Word form
Another example:
played is a word form of the lexeme PLAY
plays is a word form of the lexeme PLAY(1)
plays is a word form of the lexeme PLAY(2)
where PLAY(1) is a verb and PLAY(2) is a noun.
PLAY(1) and PLAY(2) are two different lexemes.
Exercise 7
Give the lexeme of the following word forms:
ate
played
manufactured
glasses
players
bites
Exercise 8
“manufactured” can be a verb in past tense or an adjective.
So it belongs to two different lexemes – MANUFACTURE
and MANUFACTURED.
Which of the following words belong to more than one
lexeme?
ate
wanted
wrote
written
finished
Root and Affix dictionaries
Root dictionary contains a list of roots or the base
forms - the lexemes.
It is stored usually with its part of speech.
Affix dictionary contains a list of all the affixes in a
language.
The features of the affixes are stored here.
The features are stored as attribute value pairs.
Example entries in a dictionary
Root dictionary
eat <root=‘eat’, category=‘verb’>
book <root=‘book’, category=‘verb’>
book <root=‘book’, category=‘noun’>
Affix dictionary
+s
+ed
+en
+ing
<tense = ‘present’>
<tense = ‘past’>
<aspect = ‘perfective’>
<aspect = ‘progressive’>
Paradigm table
A paradigm table represents the inflected forms of a
particular lexeme.
It includes the conjugation of verbs and declensions of
nouns, adjectives, pronouns etc.
Example:
APPLE: apple, apples
EAT: eat, eats, ate, eaten, eating
SMART: smart, smarter, smartest
Conjugation of English verbs





play plays played played playing
eat eats ate eaten eating
look looks looked looked looking
dance dances danced danced dancing
push pushes pushed pushed pushing
Declension of English nouns





apple, apples
boy, boys
church, churches
watch, watches
spy, spies
Declension of English adjectives
• smart, smarter, smartest
• tall, taller, tallest
Exercise 9

Give the paradigm table for 5 different nouns and
5 different verbs in English.
Paradigm Class


A paradigm class contains the classes of lexemes i.e.
the prototypical root and all the roots that fall in its
class including the given root.
Those words which decline or conjugate in exactly
the same way, fall into one paradigm class.
The English verbs ‘PLAY’ and ‘LOOK’ have the
following paradigm:
 play plays played played playing
 look looks looked looked looking
So they belong to the same class.
But ‘PUSH’ since it differs in its present tense form i.e. it
has ‘-es’ and not ‘- s’ falls in another class. Its
paradigm is as follows:
 push pushes pushed pushed pushing
The English nouns ‘PLAY’ and ‘BOY’ have the following
paradigm:
 play plays
 boy boys
So they belong to the same class.
But ‘SPY’ falls in another class. Its paradigm is as follows:

spy spies
Paradigm class is represented by one member of the
class.
eat
play
push
play
spy
church
V
V
V
N
N
N
eat
play, talk, walk, train
push, fish
play, boy, day
spy, sky
church, watch
Exercise 10
Which of the following verbs belong to the same paradigm class?
mince
ride
walk speak
shake
play
dance
take
Which of the following nouns belong to the same paradigm class?
girl
house dish
book
mouse beach
flower pencil
Identify paradigm classes in your own language.
Avu = {Avu, bassu, bomma, chekka}
Add-Delete Rules for Generation and Analysis
of words
•Most of the morphological analyzers handle only
inflectional morphology.
•The rules given here are for such inflectional
processes only.
•An add-delete rule would look like: [n1, n2, xyz]
where
n1 = number of characters to delete from the end
n2 = number of characters to add at the end
xyz = the characters to add
Rules to generate wordforms in a paradigm table for a
given paradigm class.
Eg. play
play[0,0,0]
play[0,1,s]
play[0,2,ed]
play[0,2,ed]
play[0,3,ing]
= play
= plays
= played
= played
= playing
Eg. eat
eat[0,0,0] =
eat[0,1,s] =
eat[3,3,ate] =
eat[0,2,en] =
eat[0,3,ing] =
eat
eats
ate
eaten
eating
Exercise 11
Write similar rules to generate paradigm tables of the
verbs ‘dance’, ‘cry’, ‘sleep’, ‘drink’ ‘write’ and nouns ‘book’,
‘mouse’, ‘church’, ‘sheep’.
In the same way, these rules can be used to find the root of an
inflected word.
For example, the root of ‘playing’ is ‘play’ – we get it by deleting 3
characters and adding nothing.
playing [3,0,0] = play
eating [3,0,0] = eat
played [2,0,0] = play
eaten [2,0,0] = eat
played [2,0,0] = play
ate [3,3,eat] = eat
plays [1,0,0] = play
eats [1,0,0] = eat
play
[0,0,0] = play
eat
[0,0,0] = eat
Exercise 12
Write similar rules to find the root of the verbs ‘kept’, ‘cried’, ‘sat’,
‘stood’, ‘written’ and nouns ‘mice’, ‘legs’, ‘spies’, ‘uncles’, ‘houses’.
Let’s look at the paradigms in some ILs
Hindi Morph Analyzer demo

http://sampark.iiit.ac.in/hindimorph/






Input text::
लड़का आ गया।
Output text:
Input text::
लड़के आ गये।
Output text:
Note : Features explanation of a word
1. root : Root of the word (e.g. ladZake word has root ladZakA)
2. cat : Category of the word (e.g. Noun=n, Pronoun=pn, Adjective=adj, verb=v,
adverb=adv post-position=psp and avvya=avy)
3. gen : Gender of the word (e.g. Masculine=m, Feminine=f, Nueter=n, mf , mn , fn,
and any)
4. num : Number of the word (e.g. Singular=sg, Plural=pl, dual, and any )
5. per : Person of the word (e.g. 1st Person=1, 2nd Person=2, 3rd Person=3, and any)
6. case : Case of the word (e.g. direct=d, oblique=o and any)
7. tam : Case marker for noun or Tense Aspect Mood(TAM) for verb of the word
8. suff : Suffix of the word
agr_gen: Agreeing gender of the following noun
agr_num: Agreeing number of the following noun
agr_cas: Agreeing case of the following noun
emph: Emphatic feature of the word
hon: Honorofic feature of the word
derive_root: Derived root
derive_lcat: Derived lexical category
derive_gen: Derived gender
derive_suff: Derived suffix
FINITE STATE TRANSDUCERS (FST)
BASED MORPHOLOGICAL ANALYZERS
Finite State Automaton/Transducers

Computational model to handle the morphology analysis/generation

Used in morphology, phonology, TTS, data mining.

Popular, since mathematically extremely well understood.

Easy to implement.

Many off-the-shelf tools available for direct use.


Optimisation: finding a m/c with minimum number of states that performs
the same function
Simple extensions to Markov Models useful for POS tagging
Finite State Automaton
A Finite State Automaton is a machine composed of:
An input tape
A finite number of states, with one initial and one or more accepting
states
Actions in terms of transitions from one state to the other, depending on
the current state and the input.
The input tape has a sequence of symbols written on it. The tape, initially
is in the intial state. The reader reads one symbol at a time, and
depending on the current state and input symbol, transition to another
state takes place
Finite state automaton (FSA)
Morphotactic Models

English nominal inflection
plural (-s)
reg-n
q0
q1
q2
irreg-pl-n
irreg-sg-n
•reg-n: regular noun
•irreg-pl-n: irregular plural noun
•Inputs: cats, goose, geese
•irreg-sg-n: irregular singular noun

Derivational morphology: adjective fragment
adj-root1
unq0

-er, -ly, -est
q1
q2
adj-root1
q3
q5
q4
-er, -est
adj-root2
• Adj-root1: clear, happy, real
• Adj-root2: big, red
Parsing with Finite State Transducers


cats cat +N +PL
Kimmo Koskenniemi’s two-level morphology
 Words
represented as correspondences between lexical
level (the morphemes) and surface level (the
orthographic word)
 Morphological parsing :building mappings between the
lexical and surface levels
c
c
a
a
t
t
+N +PL
s
Finite State Transducers


FSTs map between one set of symbols and another
using an FSA whose alphabet  is composed of
pairs of symbols from input and output alphabets
In general, FSTs can be used for
 Translator
(Hello: vanakkam)
 Parser/generator (Hello:How may I help you?)
 To map between the lexical and surface levels of
Kimmo’s 2-level morphology
FST for a 2-level Lexicon

Example
q0
q0
c
q1
q1
g
a
q2
e:o
q2
q3
e:o
t
q3
q4
s
Reg-n
Irreg-pl-n
cat
g o:e o:e s e g o o s e
q5
e
Irreg-sg-n
FST for English Nominal Inflection
reg-n
+N:
q1
q4
+PL:^s#
+SG:-#
q0
irreg-n-sg
irreg-n-pl
q2 +N:
q5
q3
q6
+N:
+SG:-#
q7
+PL:-s#
Combining (cascade or composition) this FSA
with FSAs for each noun type replaces e.g. regn with every regular noun representation in the
lexicon
FST for Telugu verb
tin: tinu,
verb
A: past
q1
a: 
q0
q5
cEs: ceyyi,
q2
verb
cEy:ceyyi,
verb
q3
Du: sg, masc, 3p
q4
lEdu: neg
q6
A:past
a: 
q7
tinnADu, tinaledu; vinnADu, vinaledu;
cEsADu, cEyaledu
lEdu: neg
q8
Issues – mainly due to ambiguity






When shot at, the dove dove into the bushes.
The insurance was invalid for the invalid.
They were too close to the door to close it.
The buck does funny things when the does are
present.
Upon seeing the tear in the painting I shed a tear.
After seeing the wound, the doctor wound his
watch.
Such words with multiple analyses
affect MT and other NLP applications.
Examples



veyyi = verb or noun?
“put”, “thousand”
perugu = verb or noun?
“yogurt”, “grow”
nAku = verb or pronoun in k4
“to lick”, “for/to me”
(icecream) nAkivvu  nAku + ivvu (“Give it to me”)
 nAki + ivvu (“Lick and then give”)
Tools
Several off-the-shelf tools available for FST which
support Word and Paradigm model
1. XFST (Xerox Finite State Tool)
2. SFST (Stuttgart Finite State Transducer Tools)
3. Lttoolbox (part of Apertium)
To wind up…






Knowledge of words and word structure is important.
A good linguistic analysis of a language will help in
formulating rules of word formation, paradigm table and
paradigm classes
Morph generator is the reverse process of morph analyser.
A thorough study of one’s language at word level will help
built MA by using any approach.
Phonological changes at morpheme boundary, agglutinative
and fusional type of suffixing makes analysis more
challenging.
Selecting the right analysis when more than one analyses of
a given wordform is given is another challenging task esp. in
Machine Translation system.
References








Akshar Bharati, Vineet Chaitanya, Rajeev Sangal. 1995. Chapter 3: Words and
Their Analyzer. Natural Language Processing: A Paninian Perspective. PrenticeHall of India.
Akshar Bharati, Rajeev Sangal, Dipti M Sharma, Radhika Mamidi. 2004. Generic
Morphological Analysis Shell. In Proceedings of LREC . Lisbon, Portugal. 24th30th May 2004.
Dan Jurafsky and J.H. Martin. 2000. Speech and Language Processing. PrenticeHall.
Monis Khan, Radhika Mamidi, Umar Hamid Shah. 2005. Morphological Analyzer
for Kashmiri . LSI.
Amba Kulkarni, G Uma Maheshwar Rao. Building Morphological Analyzers and
Generators for Indian Languages using FST, Tutorial for ICON 2010.
Viswanathan S et. al.. 2003. A Tamil Morphological Analyser. Recent Advances
in NLP. 31-39. Mysore: Central Institute of Indian Languages.
Parameshwari K. 2011. An Implementation of APERTIUM Morphological
Analyzer and Generator for Tamil. Language in India. 11: 5.
R. Veerappan, P J Antony, S Saravanan, K P Soman. 2011. A Rule based
Kannada Morphological Analyzer and Generator using Finite State Transducer.
International Journal of Computer Applications (0975 – 8887) Volume 27– No.10,
August 2011.
Thank you!