Download Lecture 5

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ancient Greek grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Georgian grammar wikipedia , lookup

Kannada grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

French grammar wikipedia , lookup

Old Norse morphology wikipedia , lookup

Ojibwe grammar wikipedia , lookup

Inflection wikipedia , lookup

Polish grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Distributed morphology wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Turkish grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Sotho parts of speech wikipedia , lookup

Stemming wikipedia , lookup

Transcript
Experiences with Indian
Language Morphology
Monojit Choudhury
RS, CSE, IIT Kharagpur
28/07/2005
Speech and NLP
When do we need MA/MS?
Store all words
Advantages:
Less effort for NLP
Less time for processing
Disadvantages:
More words  more space  more search time
How to tackle unseen words
28/07/2005
Speech and NLP
Therefore, we need MA/MS when
 The language is morphologically rich
 large number of affixes
 concatenation of affixes/compounding
 Example: Turkish, German, Sanskrit …
 The language is morphologically productive
Speakers/writers can coin new words by following
morphological rules
Example: German, Sanskrit …
28/07/2005
Speech and NLP
A Problem to ponder
How do we decide whether a language is
morphologically rich and/or productive?
Linguistically
Difficult (enumerate all morphological processes)
Fuzzy/Subjective
Can you suggest some formal technique?
Hint: Statistics
28/07/2005
Speech and NLP
Vocabulary Growth
200,000
BENGALI
(3019565,182848)
VOCAB
HINDI
SIZE ( V(N) )
(2967438, 121603)
CORPUS SIZE ( N )
28/07/2005
Speech and NLP
3,500,000
Another Estimate
How many different forms of a verb are
there in
English
Hindi
Bengali
Telugu
Sanskrit
28/07/2005
Speech and NLP
Another Estimate
How many different forms of a verb are
there in
English –
Hindi –
Bengali –
Telugu –
Sanskrit –
28/07/2005
5
~20 (without causation)
~170 (without causation)
~1000
~51480 (with derivational affixes)
~3960 (otherwise)
Speech and NLP
Three basic concerns
While designing a morphological
analyzer/generator one must consider
Productivity of a rule
Morphological paradigms
Irregular morphology
28/07/2005
Speech and NLP
Productivity of a Rule
Rule
Example
VR + tA
jAtA, letA
NR + ika
dainika,
sAmAjika
Adj + imA
lAlimA, niilimA
28/07/2005
Speech and NLP
Productivity
Productivity of a Rule
Rule
Example
Productivity
VR + tA
jAtA, letA
*****
NR + ika
dainika,
sAmAjika
**
Adj + imA
lAlimA, niilimA
X
28/07/2005
Speech and NLP
Productive Rules for Bengali/Hindi
Inflectional Morphology
 Verb
Derivational Morphology
 Noun
 Compounding
 Adjectives
 Prefixation
 Pronouns
 Suffixation
Emphasizing in Bengali
 i and o
28/07/2005
Speech and NLP
Productive Rules for Bengali/Hindi
Inflectional Morphology
 Verb
Derivational Morphology
 Noun
 Compounding
 Adjectives
 Prefixation
 Pronouns
 Suffixation
Emphasizing in Bengali
 i and o
28/07/2005
Speech and NLP
Three basic concerns
While designing a morphological
analyzer/generator one must consider
Productivity of a rule
Morphological paradigms
Irregular morphology
28/07/2005
Speech and NLP
Morphological paradigms
Classes of words that inflect similarly
Hindi Noun roots take 4 inflections
Singular, direct  laDakA, laDakii
Plural, direct  laDake, laDakiyA.N
Singular, oblique  laDake, laDakii
Plural, oblique  laDako, laDakiyo.N
How many paradigms for nouns?
28/07/2005
Speech and NLP
How to identify the paradigms?
Paradigms may be based on
Syllable structure (e.g laDakii, nadii, sakhii)
Gender (e.g. dhobii vs. nadii)
Semantics (e.g. lohA vs. dohA)
Which of these distinctions can be
identified automatically? How?
28/07/2005
Speech and NLP
Paradigms for Bengali Nouns
Bengali noun inflections:
Classifier Suffixes  TA, gulo, rA etc.
Case Markers  er, ke, der, te etc.
Emphasizers  i, o
Paradigms are based on semantics
Inanimate objects take TA, gulo
Animate objects take rA, dera
28/07/2005
Speech and NLP
Three basic concerns
While designing a morphological
analyzer/generator one must consider
Productivity of a rule
Morphological paradigms
Irregular morphology
28/07/2005
Speech and NLP
Irregular Morphology
All languages feature irregular morphology
English: ox – oxen, go – went
Hindi: jAnA – gayA, karanA – kiyA
Bengali: yAoYA – gela, AsA – ela
 Better to list them as exceptions and treat
separately
 Bengali has only 4 exceptional verbs,
 Hindi has 2
28/07/2005
Speech and NLP
So, we decided to
 Build MS/MA for Hindi & Bengali
 Cover only inflectional morphology
 Cover only verbs, nouns and adjectives
We also identified
 the morphological paradigms
 Irregular verbs/nouns
28/07/2005
Speech and NLP
Now we need to decide
 The list of possible affixes
 There attributes
 Morphotactics
And then design/build
 The Input/output specification
 The lexicon structure
 The FST structure
 Lexicon and FST search strategy
28/07/2005
Speech and NLP
A Case Study: Bengali Verb Morphology
The information coded by affixes:
Finite forms
Tense: Past, present, future
Aspect: simple, continuous, perfect, habitual
Modality: Order, request
Person: 1st, 2nd normal (tumi), 2nd familiar (tui),
3rd (se), Honorific 2nd and 3rd (Apani, tini)
Polarity: positive/negative
Non-finite forms: e, te
28/07/2005
Speech and NLP
Morphotactics
Root
Aspect
Tense
kar
(to do)
eChi
(perfect)
l
(past)
Ama
(1st)
Φ
(+)
I had done
kar
Ch
(cont.)
Φ
(present)
i
(1st)
Φ
(+)
I’m doing
kar
Φ
(simple)
b
(future)
i
Φ
(2nd fam) (+)
You’ll do
kar
Φ
(perfect)
Φ
(pre/pst)
28/07/2005
Person +/-
Speech and NLP
i
(1st)
ni
(-)
Gloss
I’ven’t done
I’d not done
Morphotactics
Root + aspect + tense +
person + emphasizer + polarity
Root + modality + person + emphasizer
Root + aspect1 + emphasizer +
aspect2 + person + polarity
28/07/2005
Speech and NLP
Verb Suffix Table
TAM/ Person
1st
2nd, familiar
2nd, normal
2nd & 3rd formal
3rd
Ind, Pr, Simple
i
isa’

ena’
e
Ind, Pr, Cont
chhi
chhisa’
chha
chhena’
chhe
Ind, Pr, Perfect
echhi
echhisa’
echha
echhena’
echhe
Ind, Pa, Simple
lAma’
li
le
lena’
la
Ind, Pa, Cont.
chhilAma’
chhili
chhile
chhilena’
chhila
Ind, Pa, Perfect
echhilAma’
echhili
echhile
echhilena’
echhila’
Ind, Future
ba
bi
be
bena’
be
Habitual Past
tAma’
tisa’
te
tena’
ta
Imperative
-
.h/

una’
uka’
Neg, Perfect
ini
isa’ni
ani
ena’ni
eni
28/07/2005
Speech and NLP
Orthographic Changes
kar + eChilAm  kareChilAm
khA + eChilAm  kheYeChilAm
hAr + eChilAm  hereChilAm
karA + eChilAm  kariYeChilAm
tolA + eChilAm  tuliYeChilAm
khAoYA + eChilAm  khAiYeChilAm
de + eChilAm  diYeChilAm
28/07/2005
Speech and NLP
Orthographic Classes (Paradigms?)
$
V

a’
A
oYA
a
ha [haoYA]
(to happen)
kara’ [karA]
(to do)
karA [karAno]
(do, causative)
saoYA [saoYAno]
(undergo, causative)
A
khA [khAoYA]
(to eat)
jAna’ [jAnA]
(to know)
jAnA [jAnAno]
(to inform)
khAoYA [khAoYAno]
(to feed)
i
di [deoYA]
(to give)
likha’ [lekhA]
(to write)
ni~NrA [ni~NrAno]
--
e
--
dekha’ [dekhA]
(to see)
dekhA [dekhAno]
(to show)
deoYA [deoYAno]
(give, causative)
o
so [so;oYA]
(to lie down)
tola’ [tolA]
(to pick)
tolA [tolAno]
(pick, causative)
so;oYA [so;oYAno]
(lie, causative)
u/au
--
--
ghumA [ghumAno]
(to sleep)
--
28/07/2005
Speech and NLP
FSM for Recognizing Bengali Verb Class
28/07/2005
Speech and NLP
A Morphological Generator:
Abstract Level
Morphological Generator
Root
TAM
Person
Polarity
Emph
28/07/2005
Suffix
Table
Suffix
Speech and NLP
Orthographic
FST
Surface
Form
A Morphological Generator:
Implementation
Morphological Generator
Root
TAM
Person
Polarity
Irregular Root
Handler
Root Class
Recognizer
Suffix
Table
Orthographic
Rules
for each
Root class
Emph
28/07/2005
Speech and NLP
Emph
Adder
Surface
Form
Implementation: More Facts
 Memory Requirement
Root Class Recognizer: FSM with 26 states
Suffix Table: 56 suffixes (emphasizers not incl.)
Orthographic Rule Tables: 19×56 = 1064 rules
 Time Requirement
Root Class Recognizer: scans the root once (r)
Suffix Selection: just table look up (constant)
Orthographic Rules: scans root + suffix once (r+s)
Emphasizer Adder: Constant time
Total time: O(r+s)
28/07/2005
Speech and NLP
Now we need to decide
 The list of possible affixes
 There attributes
 Morphotactics
And then design/build
 The Input/output specification
 The lexicon structure
 The FST structure
 Lexicon and FST search strategy
28/07/2005
Speech and NLP
A Morphological Analyzer:
Abstract Level
 Trie: A data structure also called a suffix tree. (from
Information Retrieval)
 Basic Notions:
 Note that Bengali verb morphology only has suffixes
 Scan a given word from right to left (backward)
 If the substring seen is a valid suffix, see if the remaining part
of the input is a valid stem/root
 Take care of orthographic changes
 We shall see that trie is just another way to implement
FST with some nice properties
28/07/2005
Speech and NLP
Trie: Construction
Make a list of all valid suffixes
NULL, i, Chi, li, eChi, YeChi, lAma, elAma
 Construct the trie recursively by inserting each
of the suffixes (right to left)
 Every state where a suffix ends is marked as a
final state
 Every final state consists of
TAM, Person, Polarity information
Rewrite rules for generation of the root
28/07/2005
Speech and NLP
Trie: Search
 Reverse the input word
 Traverse the trie starting from the root (start
state)
 At every final state apply the orthographic rule to
the rest of the string
 Let r be the string obtained. Search for r in the
root lexicon
 If found, output the attributes
 Continue the search
28/07/2005
Speech and NLP
Trie: Computational Issues
Time Complexity
Searching the trie is linear on input length
Searching the lexicon can also be linear
Space Complexity
In general linear in number of affixes
Can be reduced further by constructing DAWG
28/07/2005
Speech and NLP
Trie vs DAWG
Trie
 More space
 Linear Search
 Easy to construct
 Easy to insert &
delete
 Final states have
unique attributes
28/07/2005
DAWG
 Less space
 Linear search
 Exponential construction
 Difficult to delete and
insert
 A final state can have
ambiguous attributes
Speech and NLP
Morphological Analyzer:
Implementation Details
Size of Trie: 300 states
Size of root lexicon: 600 verb root
Paradigm Information: Not required
Noun, verb and adjectives are separately
analyzed
Tries can be merged but no significant gain
Root lexicons are also distinct
Rule compilation
28/07/2005
Speech and NLP
Summarizing
 Decide whether to go for MA/MS
 Identify the productive morphological processes
and corresponding irregularities
 Identify the paradigms and morphological
attributes
 Specify the morphotactics, affix list
 Gather a Machine readable root lexicon
 Choose appropriate computational technique
 Design, implement and test
 A good interface for rule-editing is desirable
28/07/2005
Speech and NLP