Download Natural Language Processing

Document related concepts

Udmurt grammar wikipedia , lookup

Macedonian grammar wikipedia , lookup

Agglutination wikipedia , lookup

English clause syntax wikipedia , lookup

Swedish grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Modern Hebrew grammar wikipedia , lookup

Georgian grammar wikipedia , lookup

Inflection wikipedia , lookup

Old Irish grammar wikipedia , lookup

Navajo grammar wikipedia , lookup

French grammar wikipedia , lookup

Kannada grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Portuguese grammar wikipedia , lookup

Arabic grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Determiner phrase wikipedia , lookup

Spanish grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Chinese grammar wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Zulu grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Esperanto grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Danish grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Transcript
POS tagging and Chunking
for Indian Languages
Rajeev Sangal and V. Sriram,
International Institute of Information Technology,
Hyderabad
Contents


NLP : Introduction
Language Analysis - Representation

Part-of-speech tags in Indian Languages (Ex. Hindi)
Corpus based methods: An introduction
POS tagging using HMMs

Introduction to TnT

Chunking for Indian languages – Few experiments

Shared task - Introduction


Language

A unique ability of humans

Animals have signs – Sign for danger


But cannot combine the signs
Higher animals – Apes

Can combine symbols (noun & verb)

But can talk only about here and now
Language : Means of Communication
CONCEPT
CONCEPT
coding
decoding
Language
* The concept gets transferred through language
Language : Means of thinking
What should I wear today?
* Can we think without language ?
What is NLP ?
The process of computer analysis of input provided in a human
language is known as Natural Language Processing.
Concept
Language
Intermediate representation
Used for processing by
computer
Applications

Machine translation

Document Clustering

Information Extraction / Retrieval

Text classification
MT system : Shakti
• Machine translation system being developed at
IIIT – Hyderabad.
• A hybrid translation system which uses the combined
strengths of Linguistic, Statistical and Machine learning
techniques.
• Integrates the best available NLP technologies.
Shakti architecture
English sentence
Morphology
POS tagging
Chunking
Parsing
English sentence analysis
Transfer from English to Hindi
Word reordering
Hindi word subs.
Hindi sentence generation
Hindi sentence
Agreement
Word-generation
Contents


NLP : Introduction
Language Analysis - Representation

Part-of-speech tags in Indian Languages (Ex. Hindi)
Corpus based methods: An introduction
POS tagging using HMMs

Introduction to TnT

Chunking for Indian languages – Few experiments

Shared task - Introduction


Levels of Language Analysis
• Morphological analysis
• Lexical Analysis ( POS tagging )
• Syntactic Analysis ( Chunking, Parsing )
• Semantic Analysis ( Word sense disambiguation )
• Discourse processing ( Anaphora resolution )
Let’s take an example sentence
“Children are watching some programmes on television in the house”
Chunking

What are chunks ?



[[ Children ]] (( are watching )) [[ some programmes ]]
[[ on television ]] [[ in the house ]]
Chunks
 Noun chunks (NP, PP) in square brackets
 Verb chunks (VG) in parentheses
Chunks represent objects


Noun chunks represent objects/concepts
Verb chunks represent actions
Chunking

Representation in SSF
Part-of-Speech tagging
Morphological analysis

Deals with the word form and it’s analysis.

Analysis consists of characteristic properties like





Root/Stem
Lexical category
Gender, number, person …
Etc …
Ex:



watching
Root
Lexical category
Etc …
= watch
= verb
Morphological analysis
Contents


NLP : Introduction
Language Analysis - Representation

Part-of-speech tags in Indian Languages (Ex. Hindi)
Corpus based methods: An introduction
POS tagging using HMMs

Introduction to TnT

Chunking for Indian languages – Few experiments

Shared task - Introduction


POS Tags in Hindi

POS Tags in Hindi

Broadly categories are
noun, verb, adjective & adverb.

Word are classified depending on their role, both individually
as well as in the sentence.

Example:

vaha aama khaa rahaa hei

Pron noun verb verb verb
POS Tagging

Simplest method of POS tagging

Looking in the dictionary
khaanaa
Dictionary lookup
verb
Problems with POS Tagging


Size of the dictionary limits the scope of POStagger.
Ambiguity

The same word can be used both as a noun as well as a
verb.
khaanaa
noun
verb
Problems with POS Tagging

Ambiguity
 Sentences in which the word “khaanaa” occurs



tum bahuta achhaa khaanaa banatii ho.
mein jilebii khaanaa chaahataa hun.
Hence, complete sentence has to be looked at before
determining it’s role and thus the POS tag.
Problems with POS Tagging

Many applications need more specific POS tags.

For example,
… seba khaa rahaa …
… khaate huE …
… khaakara …
Verb Finite Main
Verb Non-Finite Adjective
Verb Non-Finite Adverb
sharaaba piinaa sehata … Verb Non-Finite Nominal

Hence, the need for defining a tagset.
Defining the tagset for Hindi
(IIIT Tagset)
Issues !

1.
Fineness V/s Coarseness in linguistic analysis
2.
Syntactic Function V/s lexical category
3.
New tags V/s tags from a standard tagger
Fineness V/s Coarseness
Decision has to be taken whether tags will
account for finer distinctions of various
features of the parts of speech.

Need to strike a balance



Not too fine to hamper machine learning
Not too coarse to loose information
Fineness V/s Coarseness
Nouns

Plurality information not taken into account


(noun singular and noun plural are marked with same tags).
Case information not marked


(noun direct and noun oblique are marked with same tags).
Adjectives and Adverbs


No distinction between comparitive and superlative forms
Verbs



Finer distinctions are made (eg., VJJ, VRB, VNN)
Helps us understand the arguments that a verb form can take.
Fineness in Verb tags

Useful for tasks like dependency parsing as we have better
information about arguments of verb form.

Non-finite form of verbs which are used as nouns or adjectives
or adverbs still retain their verbal property.
(VNN -> Noun formed for a verb)

Example:

aasamaana/NN
“sky”
niiche/NLOC
“down”
mein/PREP
“in”
utara/VFM
“climb”
udhane/VNN
“flying”
aayaa/VAUX
“came”
vaalaa/PREP
ghodhaa/NN
“horse”
Syntactic V/S Lexical
Whether to tag the word based on lexical or
syntactic category.




Should “uttar” in “uttar bhaarata” be tagged as noun or
adjective ?
Lexical category is given more importance than
syntactic category while marking text manually.
Leads to consistency in tagging.
New tags v/s tags from standard tagset
Entirely new tagset for Indian languages not desirable as
people are familiar with standard tagsets like Penn tags.

Penn tagset has been used as benchmark while deciding
tags for Hindi.

Wherever Penn tagset has been found inadequate, new
tags introduced.



NVB  New tag for kriyamuls or Light verbs
QW  Modified tag for question words
IIIT Tagset
Tags are grouped into three types.

1.
2.
3.
Examples of tags in Group3


Group1 : Adopted from the Penn tagset with minor changes.
Group2 : Modification over Penn tagset.
Group3 : Tags not present in Penn tagset.
1.
INTF ( Intensifier ) : Words like ‘baHuta’, ‘kama’ etc.
2.
NVB, JVB, RBVB : Light verbs.
Detailed guidelines would be put online.
Contents


NLP : Introduction
Language Analysis - Representation

Part-of-speech tags in Indian Languages (Ex. Hindi)
Corpus based methods: An introduction
POS tagging using HMMs

Introduction to TnT

Chunking for Indian languages – Few experiments

Shared task - Introduction


Corpus – based approach
Untagged new corpus
POS tagged corpus
Learn
POS tagger
Tagged new corpus
POS tagging : A simple method
• Pick the most likely tag for each word
• Probabilities can be estimated from a
tagged corpus.
• Assumes independence between tags.
• Accuracy < 90%
POS tagging : A simple method
Example
• Brown corpus, 182159 tagged words (training section),
26 tags
• Example :
• mujhe xo kitabein xijiye
•Word xo occurs 267 times,
• 227 times tagged as QFN
• 29 times as VAUX
• P(QFN|W=xo) = 227/267 = 0.8502
• P(NN | W=xo) = 29/267 = 0.1086
Corpus-based approaches
Learning Rules

Transformation-based
error driven learning.



Brill - 1995
Inductive Logic
programming.

Statistical
Cussens - 1997
Hidden Markov
models.


TnT, Brants 00
Maximum entropy.

Ratnaparakhi’ 96
Contents


NLP : Introduction
Language Analysis - Representation

Part-of-speech tags in Indian Languages (Ex. Hindi)
Corpus based methods: An introduction
POS tagging using HMMs

Introduction to TnT

Chunking for Indian languages – Few experiments

Shared task - Introduction


POS tagging using HMMs
Let W be a sequence of words
W = w1 , w2 … wn
Let T be the corresponding tag sequence
T = t1 , t2 … tn
Task : Find T which maximizes
P(T| W)
T’ = argmaxT P ( T | W )
POS tagging using HMM
By Bayes Rule,
P(T|W) = P(W|T)*P(T)/P(W)
T’ = argmaxT P ( W | T ) * P ( T )
P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | t1 … tn-1 )
Applying Bi-gram approximation,
P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t2 ) …… * P ( tn | tn-1 )
POS tagging using HMM
P ( W | T ) = P ( w1 | T ) * P ( w2 | w1 T ) * P ( w3 | w1.w2 T ) *
……… P ( wn | w1 … wn-1 , T )
= Πi = 1 to n P ( wi | w1…wi-1 T )
Assume, P ( wi | w1…wi-1 T ) = P ( wi | ti )
Now,
T’ is the one which maximizes,
P ( t1 ) * P ( t2 | t1 ) * …… * P ( tn | tn-1 ) *
P ( w1 | t1 ) * P ( w2 | t2 ) * …… * P ( wn | wn-1 )
POS tagging using HMM
If we use Tri-gram model instead for the tag sequence,
P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | tn-2 tn-1 )
Which model to choose ?
•
Depends on the amount of data available !
•
Richer models ( Tri-grams, 4-grams ) require lots of data.
Chain rule
with approximations
• P( W = “vaha ladakaa gayaa” , T = “det noun verb” )
== P(det) * P(vaha|det) * P(noun|det)
* P(ladakaa|noun) * P(verb|noun) * P(gayaa|verb)
det
noun
verb
vaha
ladakaa
gayaa
Chain rule
with approximations: Example
• P (vaha | det ) = ( Number of times ‘vaha’ appeared as ‘det’ in the corpus )
------------------------------------------------------------( Total number of occurrences of ‘det’ in the corpus )
• P ( verb | noun ) = ( Number of times ‘verb’ followed ‘noun’ in the corpus )
------------------------------------------------------------( Total number of occurrences of ‘noun’ in the corpus )
If we obtained the following estimates from the corpus
0.5
det
0.4
vaha
0.99
noun
0.5
ladakaa
0.4
verb
0.02
gayaa
P ( W , T ) = 0.5 * 0.4 * 0.99 * 0.5 * 0.4 * 0.02 = 0.000792
POS tagging using HMM
We need to estimate three types of parameters from the corpus
Pstart(ti) = (no. of sentences which begin with ti ) / ( no. of sentences )
P ( ti | ti-1 ) = count ( ti-1 ti ) / count ( ti-1 )
P ( wi | ti ) = count ( wi with ti ) / count ( ti )
These parameters can be directly represented using the Hidden
Markov Models (HMMs) and the best tag sequence can be computed
by applying Viterbi algorithm on the HMMs.
Markov models
Markov Chain
•
An event is dependent on the previous events.
Consider the word sequence
usane
kahaa
ki
Here, each word is dependent on the previous one word.
Hence, it is said to form markov chain of order 1.
Hidden Markov models
Observation
sequence O
Hidden states
sequence X
Index of sequence t
o1
o2
o3
o4
x1
x2
x3
x4
1
2
3
4
Hidden states follow markov property. Hence, this model is know as Hidden
Markov Model.
Hidden Markov models
• Representation of parameters in HMMs
• Define
O(t)
= tth Observation
• Define
X(t)
= Hidden State Value at tth position
A = aab = P ( X ( t+1 ) = Xb |
B = bak = P ( O ( t ) = Ok |
PI = pia
X ( t ) = Xa )  Transition matrix
X ( t ) = Xa )
 Emission matrix
= Probability of the starting with hidden state Xa
The model is μ = { A , PI , B }
 PI matrix
HMM for POS tagging
Observation sequence
===
Word sequence
Hidden state sequence
===
Tag sequence
Model
A = P ( current tag
| previous tag )
B = P ( current word | current tag )
PI = Pstart ( tag )
Tag sequences are mapped to Hidden state sequences because
they are not observable in the natural language text.
Example
A =
det
noun
verb
det
.01
.99
.00
noun
.30
.30
.40
verb
.40
.40
.20
vaha
B =
ladakaa
PI =
gayaa
det
.40
.00
.00
noun
.00
.015
.0031
verb
.00
.0004
.020
det
0.5
noun
0.4
verb
.01
POS tagging using HMM
The problem can be formulated as,
Given the observation sequence O and the model
μ = (A, B, PI), how to choose the best state
sequence X which explains the observations ?
• Consider all the possible tag sequences and choose the
tag sequence having the maximum joint probability with
the observation sequence.
• X_max = argmax ( P(O , X) )
• The complexity of the above is high. Order NT
• Viterbi algorithm is used for computational efficiency.
POS tagging using HMM
O
X’s
t
vaha
ladakaa
hansaa
det
det
det
noun
noun
noun
verb
verb
verb
1
2
3
27 tag sequences possible ! = 27 paths
Viterbi algorithm
O
X’s
t
vaha
ladakaa
hansaa
det
det
det
noun
noun
noun
verb
verb
verb
1
2
3
Let αnoun(ladakaa) represent the probability of reaching the state
‘noun’ taking the best possible path and generating observation ‘ladakaa’
Viterbi algorithm
O
X’s
t
vaha
ladakaa
det
det
det
noun
noun
noun
verb
verb
verb
1
2
3
hansaa
Best probability of reaching a state associated with first word
αpron(vaha) = PI (det) * B [det, vaha ]
Viterbi algorithm
O
X’s
t
vaha
ladakaa
det
det
det
noun
noun
noun
verb
verb
verb
1
2
3
hansaa
Probability of reaching a state elsewhere in the best possible way
αnoun(ladakaa) =
Viterbi algorithm
O
X’s
t
vaha
ladakaa
det
det
det
noun
noun
noun
verb
verb
verb
1
2
3
hansaa
Probability of reaching a state in the best possible way
αnoun(ladakaa) = MAX { αpron(vaha) * A [det, noun ] * B [ noun, ladakaa ] ,
Viterbi algorithm
O
X’s
t
vaha
ladakaa
det
det
det
noun
noun
noun
verb
verb
verb
1
2
3
hansaa
Probability of reaching a state in the best possible way,
αnoun(ladakaa) = MAX { αpron(vaha) * A [ det, noun ] * B [ noun, ladakaa ] ,
αnoun(vaha) * A [ noun, noun ] * B [ noun, ladakaa ] ,
Viterbi algorithm
O
X’s
t
vaha
ladakaa
det
det
det
noun
noun
noun
verb
verb
verb
1
2
3
hansaa
Probability of reaching a state in the best possible way
αnoun(ladakaa) = MAX { αpron(vaha) * A [det, noun ] * B [ noun, ladakaa ] ,
αnoun(vaha) * A [ noun, noun ] * B [ noun, ladakaa ] ,
αverb(vaha) * A [ verb, noun ] * B [ noun, ladakaa ] }
Viterbi algorithm
O
X’s
t
vaha
ladakaa
det
det
det
noun
noun
noun
verb
verb
verb
1
2
3
hansaa
What is the best way to come to a particular state ?
phinoun(ladakaa) = ARGMAX { αpron(vaha) * A [ pron, noun ] * B [ noun, ladakaa ] ,
αnoun(vaha) * A [ noun, noun ] * B [ noun, ladakaa ] ,
αverb(vaha) * A [ verb, noun ] * B [ noun, ladakaa ] }
Viterbi algorithm
O
X’s
t
hansaa
vaha
ladakaa
det
det
det
noun
noun
noun
verb
verb
verb
1
2
3
The last tag of the most likely sequence
phi (T+1) = ARGMAX { αpron(hansaa) , αnoun(hansaa) , αverb(hansaa) }
Viterbi algorithm
O
X’s
t
vaha
ladakaa
hansaa
det
det
det
noun
noun
noun
verb
verb
verb
1
2
3
Most likely sequence is obtained by backtracking.
Preliminary Results

POS tagging for Indian languages

Training set = 182159 tokens, Testing set = 14277 tokens
Tags = 26.

Most frequent tag labelling = 78.85 %

Hidden Markov Models


= 86.75 %
Needs improvement!

By experimenting with a variety of tags and tokens ( Some
experiments on the chunking task are shown in following
slides ).
Preliminary Results


Most Common error seen.

NNP, NNC  NN

<* see the output of the system >
Opportunity to carry out experiments to
eliminate such errors as part of NLPAI
shared task , 2006 (will be introduced at the
end).
Contents


NLP : Introduction
Language Analysis - Representation

Part-of-speech tags in Indian Languages (Ex. Hindi)
Corpus based methods: An introduction
POS tagging using HMMs

Introduction to TnT

Chunking for Indian languages – Few experiments

Shared task - Introduction


Introduction to TnT



Efficient implementation of Viterbi’s algorithm for
2nd order Markov Chains ( Trigram approximation ).
Language independent – Can be trained on any
corpus.
Easy to use.
Introduction to TnT

4 main programs –

tnt-para – trains the model (parameter generation)


tnt – tagging


tnt [options] <model> <corpus>
tnt-diff - Comparing two files to get precision/ recall figures.


tnt-para [options] <corpus_file>
tnt-diff [options] <original file 1> <new output file>
tnt-wc – count tokens (words) and types (pos-tag/chunk-tag)
in different files.

tnt-wc [options] <corpusfile>
Introduction to TnT


Training file format
 Tokens and tag separated by white space.
Example,








%% <comment>
nirAlA
NNP
kI
PREP
sAhiwya
NN
%% blank line – new sentence
yahAz
PRP
yaha
PRP
aXikAMRa JJ
Introduction to TnT

Testing file – consists of only the first column.

Other files – Used to store the model




.lex file
.123 file
.map file
Demo1.
Contents


NLP : Introduction
Language Analysis - Representation

Part-of-speech tags in Indian Languages (Ex. Hindi)
Corpus based methods: An introduction
POS tagging using HMMs

Introduction to TnT

Chunking for Indian languages – Few experiments

Shared task - Introduction


An Example (Chunk boundary identification)
Chunking with TnT

Chunk Tags





STRT: A chunk starts at this token
CNT: This token lies in the middle of a chunk
STP: This token lies at the end of a chunk
STRT_STP: This token lies in a chunk of its own
Chunk Tag Schemes



2-tag Scheme: {STRT, CNT}
3-tag Scheme: {STRT, CNT, STP}
4-tag Scheme: {STRT, CNT, STP, STRT_STP}
Input Tokens

What kinds of input tokens can we use?



Word only – simplest
POS tag only – use only the part of speech tag of the
word
Combinations of the above –
 Word_POStag: word followed by POS tag
 POStag_Word: POS tag followed by word.
Chunking with TnT: Experiments



Training corpus = 150000 tokens
Testing corpus = 20000 tokens
Trick to improve learning is by training on larger
tagset and reduce it to smaller tagset


NO LOSS of INFO. as all the tagsets convey same info.
Best results (Precision = 85.6%) obtained for


Input Tokens of the form ‘Word_POS’
Learning trick : 4 tags reduced to 2
Chunking with TnT: Improvement


85.6 not good enough.
Improvement of model (Precision = 88.63%) by adding
contextual information (POS tags). Example,
Chunking with TnT: Improvements

For experiments which lead to furthur improvements
in chunk boundary identification, see

Akshay Singh; Sushama Bendre; Rajeev Sangal, HMM based
Chunker for Hindi, In Second International Joint Conference
on Natural Language Processing: Companion Volume including
Posters/Demos and tutorial abstracts.
Chunking labelling & Results

Chunk labelling:

Chunks which have been identified have to be labelled as
Noun chunks, Verb chunks etc.

Rule based chunk labelling performed best.

RESULTS:

Final Chunk Boundary Identification accuracy = 92.6%

Chunk boundary identification + Chunk labelling = 91.5%
Contents


NLP : Introduction
Language Analysis - Representation

Part-of-speech tags in Indian Languages (Ex. Hindi)
Corpus based methods: An introduction
POS tagging using HMMs

Introduction to TnT

Chunking for Indian languages – Few experiments

Shared task - Introduction


Shared task.

For information on the shared task, refer to the
flyer on NLPAI shared task 2006.
Thank you