Download Part of speech Tagging for Tamil using SVMTool - CEN

Document related concepts

Old English grammar wikipedia , lookup

Japanese grammar wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Ojibwe grammar wikipedia , lookup

Spanish grammar wikipedia , lookup

Inflection wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

French grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Turkish grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Agglutination wikipedia , lookup

Contraction (grammar) wikipedia , lookup

Junction Grammar wikipedia , lookup

Untranslatability wikipedia , lookup

Stemming wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Pipil grammar wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Malay grammar wikipedia , lookup

Transcript
PART OF SPEECH TAGGING FOR TAMIL
USING SVMTOOL
A PROJECT REPORT
submitted by
ANAND KUMAR.M
(CB206CN001)
in partial fulfillment for the award of the degree
of
MASTER OF TECHNOLOGY
IN
COMPUTATIONAL ENGINEERING AND NETWORKING
Centre for Excellence in Computational Engineering and Networking
(CEN)
AMRITA SCHOOL OF ENGINEERING, COIMBATORE
AMRITA VISHWA VIDYAPEETHAM
COIMBATORE – 641 105
JULY 2008
Dedicated to my beloved family
PART OF SPEECH TAGGING FOR TAMIL
USING SVMTOOL
A PROJECT REPORT
submitted by
ANAND KUMAR M
(CB206CN001)
in partial fulfillment for the award of the degree
of
MASTER OF TECHNOLOGY
IN
COMPUTATIONAL ENGINEERING AND NETWORKING
Under the Guidance of
Dr.K.P.Soman (Professor & Head, CEN)
Centre for Excellence in Computational Engineering and Networking
(CEN)
AMRITA SCHOOL OF ENGINEERING, COIMBATORE
AMRITA VISHWA VIDYAPEETHAM
COIMBATORE – 641 105
JULY 2008
AMRITA SCHOOL OF ENGINEERING,
AMRITA VISHWA VIDYAPEETHAM, COIMBATORE-641105
BONAFIDE CERTIFICATE
This is to certify that the project report entitled “PART OF SPEECH
TAGGING
FOR
TAMIL
USING
SVMTOOL
”
submitted
by
“ANAND KUMAR M, Reg. No. CB206CN001” in partial fulfillment of the
requirements for the award of the Degree of Master of Technology in
“COMPUTATIONAL ENGINEERING AND NETWORKING”
is a
bonafide record of the work carried out under my guidance and supervision at
Amrita School of Engineering, Coimbatore.
Dr.K.P.SOMAN,
Dr. K.P.SOMAN
Project Guide,
Professor and HOD,
Professor, CEN.
Department of CEN.
This project report was evaluated by us on ………………………
INTERNAL EXAMINER
EXTERNAL EXAMINER
AMRITA VISHWA VIDYAPEETHAM
AMRITA SCHOOL OF ENGINEERING, COIMBATORE
CENTRE FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING
AND NETWORKING
DECLARATION
I, ANAND KUMAR M
with Reg.No. CB206CN001 hereby declare that this
project report entitled “PART OF SPEECH TAGGING FOR TAMIL USING
SVMTOOL”, is a record of the original work done by me under the guidance of
Dr.K.P.SOMAN, (Head of the Department), Department of Computational
Engineering and Networking, Amrita School of Engineering, Coimbatore and this
work has not formed the basis for the award of any degree / diploma /
associateship / fellowship or a similar award, to any candidate in any University,
to the best of my knowledge.
Place :
Signature of the student
Date:
COUNTERSIGNED
Dr.K.P.SOMAN,
HOD of CEN
TABLE OF CONTENTS
LIST OF FIGURES .......................................................................................................... i
LIST OF TABLES .......................................................................................................... ii
ORGANIZATION ...........................................................................................................iii
1. INTRODUCTION ........................................................................................................ 1
1.1 Introduction ............................................................................................................... 1
1.2 Word classes.............................................................................................................. 1
1.3 Part of speech tagging ............................................................................................... 2
1.4 Tagging approaches .................................................................................................. 3
1.4.1 Rule based part of speech tagging ................................................................... 5
1.4.2 Stochastic part of speech tagging .................................................................... 8
1.4.3 Transformation based tagging ....................................................................... 12
1.5 Other techniques...................................................................................................... 15
1.6 Motivation ............................................................................................................. 16
2. LITERATURE SURVEY ........................................................................................... 18
3.TAMIL POS TAGGING ............................................................................................ 29
3.1 Tamil language......................................................................................................... 29
3.1.1 Alphabets......................................................................................................... 29
3.1.2 Tamil grammar ............................................................................................... 31
3.1.3 Part of speech categories ................................................................................ 32
3.1.4 Other POS categories ...................................................................................... 32
3.1.5 Ambiguity of roots .......................................................................................... 34
3.2 Complexity in tamil POS tagging ........................................................................... 34
3.2.1 Noun complexity ............................................................................................ 34
3.2.2 Verb complexity ........................................................................................... 36
3.2.3 Complexity in Adverbs ................................................................................... 38
3.2.4 Complexity in Postpositions ........................................................................... 39
3.3 Developing tagsets .................................................................................................. 39
3.3.1 Tagsets gone through ...................................................................................... 40
3.3.2 AMRITA tagset............................................................................................... 41
3.4 Explanation of AMRITA POS tags.......................................................................... 42
3.4.1 Noun tags ....................................................................................................... 42
3.4.2 Pronoun tags................................................................................................... 45
3.4.3 Adjective tags ................................................................................................ 46
3.4.4 Adverb tags ................................................................................................... 46
3.4.5 Verb tags ........................................................................................................ 47
3.4.6 Other tags ....................................................................................................... 49
4. DEVELOPMENT OF TAGGED CORPUS ............................................................. 52
4.1 Introduction ............................................................................................................. 52
4.1.1 Tagged corpus, Parallel Corpus, and Aligned Corpus .................................. 52
4.1.2 CIIL corpus for tamil....................................................................................... 53
4.1.3 AUKBC-RC's improved tagged corpus for tamil .......................................... 53
4.2 Developing a new corpus ......................................................................................... 53
4.2.1 Untagged and Tagged corpus ......................................................................... 53
4.2.2 Tagged corpus development............................................................................ 55
4.3 Details of our Tagged corpus .................................................................................. 57
4.4 Applications of tagged corpus.................................................................................. 58
5.IMPLEMENTATION OF SVMTOOL FOR TAMIL ............................................. 59
5.1 Introduction ............................................................................................................. 59
5.2 Properties of the SVMTool ..................................................................................... 60
5.3 The theory of support vector machines ................................................................... 62
5.4 Problem setting ........................................................................................................ 65
5.4.1 Binarizing the classification problem .............................................................. 65
5.4.2 Feature codification ......................................................................................... 65
5.5 SVMTool components and implementations........................................................... 67
5.5.1 SVMTlearn ...................................................................................................... 68
5.5.1.1 Training data format.............................................................................. 68
5.5.1.2 Options ................................................................................................. 69
5.5.1.3 Configuration file .................................................................................. 72
5.5.1.4 C parameter tuning ............................................................................... 78
5.4.1.5 Test ....................................................................................................... 79
5.4.1.6 Models .................................................................................................. 79
5.4.1.7 Implementation of SVMTlearn for Tamil ............................................ 82
5.5.2 SVMTagger .................................................................................................... 86
5.5.2.1 Options ................................................................................................ 91
5.5.2.2 Strategies .............................................................................................. 95
5.5.2.3 Implementation of SVMTagger for Tamil .......................................... 96
5.5.3 SVMTeval ....................................................................................................... 98
5.4.3.1 Implementation of SVMTeval for Tamil ............................................. 99
5.4.3.2 Reports ............................................................................................... 100
6. GUI FOR SVMTOOL .............................................................................................. 110
6.1 Introduction ............................................................................................................ 110
6.2 GUI for SVMTagger ............................................................................................. 111
6.2.1 File based SVMTagger window .................................................................... 113
6.2.2 Sentance based SVMTagger window .......................................................... 114
6.2.3 AMRITA POS Tagset window .................................................................... 115
6.3 GUI for SVMTeval ................................................................................................. 116
6.3.1 SVMTeval report window ............................................................................. 117
6.4 Output of TnT tagger for Tamil .............................................................................. 118
7.CONCLUSION........................................................................................................... 120
REFERENCES .............................................................................................................. 121
LIST OF FIGURES
1.1 Classification of POS tagging models ......................................................................... 4
1.2 Example of Brills Template ...................................................................................... 15
3.1 Tamil alphabets with english mapping........................................................................ 30
4.1 Example of untagged Corpus ...................................................................................... 54
4.2 Example of tagged Corpus .......................................................................................... 54
4.3 Untagged Corpus before pre-editing .......................................................................... 56
4.4 Untagged Corpus after pre-editing ............................................................................. 56
5.1 SVM Example : Hard Margin ..................................................................................... 63
5.2 SVM Example : Soft Margin ...................................................................................... 63
5.3 Training data format.................................................................................................... 69
5.4 Implementation of SVMTlearn .................................................................................. 83
5.5 Example input file ...................................................................................................... 90
5.6 Example output file ..................................................................................................... 91
5.7 Implementation of SVMTagger .................................................................................. 97
5.8 Implementation of SVMTeval ................................................................................... 99
6.1 SVMTagger window ............................................................................................... 112
6.2 File based SVMTagger window ............................................................................... 113
6.3 Sentance based SVMTagger window ....................................................................... 114
6.4 AMRITA Tagset window ........................................................................................ 115
6.5 SVMTeval window .................................................................................................. 116
6.6 SVMTeval results window........................................................................................ 117
6.7 Output of TnT tagger for Tamil ................................................................................ 118
i
LIST OF TABLES
3.1 AMRITA Tagset ...................................................................................................... 42
4.1 Tag counts ................................................................................................................. 58
5.1 Rich Feature Pattern Set............................................................................................ 67
5.2 SVMTlearn config-file mandatory arguments ......................................................... 72
5.3 SVMTlearn config-file action arguments ............................................................... 73
5.4 SVMTlearn config-file optional arguments ............................................................. 74
5.5 SVMTool feature language ...................................................................................... 75
5.6 Model 0: Example of suitable POS Features .......................................................... 81
5.7 Model 1:Example of suitable POS features ............................................................ 81
5.8 Model 2:Example of suitable POS features ............................................................ 82
ii
OVERVIEW OF THE PROJECT
This thesis is all about Part of speech Tagging for Tamil using SVMTool.
The work is divided into seven chapters.
First Chapter will give an introduction to the Part of speech Tagging. It
will discuss about word classes, tagsets and little description about various tagging
approaches used for POS tagging.
Second chapter will give the literature review about part of speech tagging
and other POS tagging for Tamil.
Third chapter will discuss about Tamil POS tagging .Here it will give
small introduction about Tamil language, Tamil grammar and POS categories in
Tamil. It will also discuss about complexity in Tamil POS tagging and various
tagsets for Tamil. Finally it defines a new tagset for Tamil (AMRITA TAGSET).
Fourth chapter is about corpus development. It will discuss about how the
corpus is developed.
Fifth chapter is about part of speech tagger based on Support Vector
Machines. It will discuss about the software SVMTool and how it is used for
training and testing a Tamil corpus .Finally it discuss about the implementation of
POS tagger for Tamil using SVMTool .And the output of SVMTool is compared
with TnT for Tamil.
Sixth chapter discusses about the GUI for SVMTool part of speech tagger.
This GUI is for the person who has no idea about commands in SVMTool related
to SVMTagger and SVMTeval components. Final chapter will explain the
conclusion and future work of this project.
iii
1) INTRODUCTION
1.1 Introduction
A part of speech (POS) tagging is one of the most well studied problems in
the field of Natural Language Processing (NLP). Different approaches have already
been tried to automate the task for English and other languages. Though Tamil is a
south Asian language belongs to the Dravidian language family. It is spoken by over
77 million people in several countries in the world it still lacks significant research
efforts in the area of Natural Language Processing. This chapter discuss about, what
is mean by POS tagging and various POS tagging approaches.
1.2 Word classes
Words are divided into different classes called parts of speech (POS), word
classes, morphological classes, or lexical tags. In traditional grammars, there are few
parts of speech (noun, verb, adjective, preposition, adverb, conjunction, etc.). Many
of the recent models have much larger numbers of word classes (POS Tags).Part-ofspeech tagging (POS tagging or POST), also called grammatical tagging, is the
process of marking up the words in a text as corresponding to a particular part of
speech, based on both its definition, as well as its context —i.e., relationship with
adjacent and related words in a phrase, sentence, or paragraph.
Parts-of-speech can be divided into two broad super categories: CLOSED
CLASS types OPEN CLASS and types. Closed classes are those that have relatively
fixed membership. For example, prepositions are a closed class because there is a
fixed set of them in English; new prepositions are rarely coined. By contrast nouns
and verbs are open classes because new nouns and verbs are continually coined or
1
borrowed from other languages. There are four major open classes that occur in the
languages of the world; nouns, verbs, adjectives, and adverbs [23]. It turns out that
English has all four of these, although not every language does.
1.3 Part of speech tagging
Parts of speech (POS) tagging means assigning grammatical classes i.e.
appropriate parts of speech tags to each word in a natural language sentence.
Assigning a POS tag to each word of an un-annotated text by hand is very time
consuming, which results in the existence of various approaches to automate the job.
So POS tagging is a technique to automate the tagging process of lexical categories.
The process takes a word or a sentence as input, assigns a POS tag to the word or to
each word in the sentence, and produces the tagged text as output.
Part-of-speech tagging is the process of assigning a part-of-speech or other
lexical class marker to each word in a corpus. Tags are also usually applied to
punctuation markers; thus tagging for natural language is the same process as
tokenization for computer languages, although tags for Tamil are much more
ambiguous. Taggers play an increasingly important role in speech recognition, natural
language parsing and information retrieval.
The input to a tagging algorithm is a string of words and a specified tagset of
the kind described in the previous section. The output is a single best tag for each
word. For example, here are some sample sentences.
Example in English:
Take
that
Book.
VB
DT
NN
(Tagged using Penn Tree Bank Tagset)
2
Even in these simple examples, automatically assigning a tag to each word is
not trivial. For example, Book is ambiguous. That is, it has more than one possible
usage and part of speech. It can be a verb (as in book that bus or to book the suspect)
or a noun (as in hand me that book, or a book of matches). Similarly that can be a
determiner (as in Does that flight serve dinner), or a complementizer (as in I thought
that your flight was earlier).
The problem of POS-tagging is to resolve these ambiguities, choosing the
proper tag for the context. Part-of-speech tagging is thus one of the many
disambiguation tasks. Another important point which was discussed and agreed upon
was that POS tagging is NOT a replacement for morph analyzer.
A 'word' in a text
carries the following linguistic knowledge
a) Grammatical category and
b) Grammatical features such as gender, number, person etc. The POS tag should be
based on the 'category' of the word and the features can be acquired from the morph
analyzer.
1.4 Tagging approaches
There are different approaches for POS tagging. The following figure
demonstrates different POS tagging models. Most tagging algorithms fall into one of
two classes: rule-based taggers and stochastic taggers. Rule-based taggers generally
involve a large database of hand-written disambiguation rules.
3
POS Tagging
Unsupervised
Supervised
Rule Based
Stochastic
Neural
Rule Based
Brill
N-gram
based
Stochastic
Neural
Brill
Maximum
Likelihood
Hidden Markov
Model
Baum-Welch
Algorithm
Viterbi
Algorithm
Figure 1.1: Classification of POS tagging models
For example, that an ambiguous word is a noun rather than a verb if it follows
a determiner. Stochastic taggers generally resolve tagging ambiguities by using a
training corpus to compute the probability of a given word having a given tag in a
given context.
The Brill tagger shares features of both tagging architectures. Like the rulebased tagger, it is based on rules which determine when an ambiguous word should
have a given tag. Like the stochastic taggers, it has a machine-learning component:
the rules are automatically induced from a previously-tagged training corpus.
4
Supervised POS Tagging
The supervised POS tagging models require pre-tagged corpora which are
used for training to learn information about the tagset, word-tag frequencies, rule sets
etc. The performance of the models generally increases with the increase in size of
this corpus.
Unsupervised POS Tagging
Unlike the supervised models, the unsupervised POS tagging models do not
require a pre-tagged corpus. Instead, they use advanced computational methods like
the Baum-Welch algorithm to automatically induce tagsets, transformation rules etc.
Based on the information, they either calculate the probabilistic information needed
by the stochastic taggers or induce the contextual rules needed by rule-based systems
or transformation based systems.
1.4.1 Rule Based POS tagging
The rule based POS tagging models apply a set of hand written rules and use
contextual information to assign POS tags to words. These rules are often known as
context frame rules. For example, a context frame rule might say something like:
“If an ambiguous/unknown word X is preceded by a Determiner and followed by a
Noun, tag it as an Adjective.” On the other hand, the transformation based approaches
use a pre-defined set of handcrafted rules as well as automatically induced rules that
are generated during training.
Morphology is a linguistic term which means how words are built up from
smaller units of meaning known as morphemes. In addition to contextual information,
morphological information is also used by some models to aid in the disambiguation
5
process. One such rule might be: “If an ambiguous/unknown word ends in a suffix ing and is preceded by a Verb, label it a Verb”.
Some models also use information about capitalization and punctuation, the
usefulness of which are largely dependent on the language being tagged. The earliest
algorithms for automatically assigning part-of-speech were based on two-stage
architecture. The first stage used a dictionary to assign each word a list of potential
parts of speech. The second stage used large lists of hand-written disambiguation
rules to bring down this list to a single part-of-speech for each word.
The ENGTWOL tagger is based on the same two-stage architecture, although
both the lexicon and the disambiguation rules are much more sophisticated than the
early algorithms. The ENGTWOL lexicon is based on the two-level morphology and
has about 56,000 entries for English word stems, counting a word with multiple parts
of speech (e.g. nominal and verbal senses of hit) as separate entries, and of course not
counting inflected and many derived forms. Each entry is annotated with a set of
morphological and syntactic features.
In the first stage of the tagger, each word is run through the two-level lexicon
transducer and the entries for all possible parts of speech are returned.
For example the phrase “Ameen had shown that output...”
Would return the following list (one line per possible tag, with the correct tag shown
in boldface):
Ameen
AMEEN N NOM SG PROPER
had
HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown
SHOW PCP2 SVOO SVO SV
that
ADV
6
PRON DEM SG
DET CENTRAL DEM SG CS
output
N NOM SG
A set of about 1,100 constraints are then applied to the input sentence to rule
out incorrect parts of speech; the boldfaced entries in the table above show the desired
result, in which the preterite (not participle) tag is applied to had, and the
complementizer (CS) tag is applied the that. The constraints are used in a negative
way, to eliminate tags that are inconsistent with the context. For example one
constraint eliminates all readings of that except the ADV (adverbial intensifier) sense
(this is the sense in the sentence it isn't that odd). Here's a simplified version of the
constraint:
ADVERBIAL-THAT RULE
Given input: "that"
if
(+1 A/ADV/QUANT); / * if next word is adj, adverb, or quantifier * /
(+2 SENT-LIM); / * and following which is a sentence boundary, * /
(NOT -1 SVOC/A); / * and the previous word is not a verb like * /
/ * 'consider' which allows adjs as object complements * /
then eliminate non-ADV tags
else eliminate ADV tag
The first two clauses of this rule check to see that the ‘that’ directly precedes a
sentence-final adjective, adverb, or quantifier. In all other cases the adverb reading is
eliminated. The last clause eliminates cases preceded by verbs like consider or
believe which can take a noun and an adjective; this is to avoid tagging the following
instance of that as an adverb:
7
Example: I consider that odd.
Another rule is used to express the constraint that the complementizer sense of
that is most likely to be used if the previous word is a verb which expects a
complement (like believe, think, or show), and if the ‘that’ is followed by the
beginning of a noun phrase, and a finite verb.
1.4.2
Stochastic based POS tagging
A stochastic approach includes frequency, probability or statistics. The
simplest stochastic approach finds out the most frequently used tag for a specific
word in the annotated training data and uses this information to tag that word in the
unannotated text. The problem with this approach is that it can come up with
sequences of tags for sentences that are not acceptable according to the grammar rules
of a language.
An alternative to the word frequency approach is known as the n-gram
approach that calculates the probability of a given sequence of tags. It determines the
best tag for a word by calculating the probability that it occurs with the n previous
tags, where the value of n is set to 1,2 or 3 for practical purposes. These are known as
the Unigram, Bigram and Trigram models. The most common algorithm for
implementing an n-gram approach for tagging new text is known as the Viterbi
Algorithm, which is a search algorithm that avoids the polynomial expansion of a
breadth first search by trimming the search tree at each level using the best m
Maximum Likelihood Estimates (MLE) where m represents the number of tags of the
following word .
8
The use of probabilities in tags is quite old; probabilities in tagging were first
used in 1965, a complete probabilistic tagger with Viterbi decoding was sketched by
Bahl and Mercer (1976), and various stochastic taggers were built in the 1980's
(Marshall, 1983; Garside, 1987; Church, 1988; DeRose, 1988). This section describes
a particular stochastic tagging algorithm generally known as the Hidden Markov
Model or HMM tagger. The idea behind all stochastic taggers is a simple
generalization of the 'pick the most-likely tag for this word’.
For a given sentence or word sequence, HMM taggers choose the tag sequence that
maximizes the following formula:
P(word | tag ) X P(tag | previous n tags)
(1.1)
The rest of this section will explain and motivate this particular equation.
HMM taggers generally choose a tag sequence for a whole sentence rather than for a
single word, but for pedagogical purposes, let's first see how an HMM tagger assigns
a tag to an individual word. We first give the basic equation, then work through an
example, and, finally, give the motivation for the equation.
A bigram-HMM tagger of this kind chooses the tag ti for word wi that is most
probable given the previous tag ti-1 and the current word wi:
ti = a r g m a x P ( t j | t
,w )
i −1 i
j
(1.2)
We can restate the Equation 1.2 to give the basic HMM equation for a single tag by
using some markov assumptions as follows:
ti = argmax P(t j | t ) P(w | t )
i −1
i j
j
(1.3)
9
An Example
Using an HMM tagger to assign the proper tag to the single word race in the
following examples (both shortened slightly from the Brown corpus):
Vikram is
NNP
expected
VBZ
VBN
to race tomorrow .
TO VB
(1.4)
NN
People continue to inquire the reason for the race for outer space
NNS
VBP
TO
VB
DT NN
NN IN DT
NN
IN
(1.5)
NN
(Tagged using Penn Tree Bank Tagset)
In the first example race is a verb (VB), in the second a noun (NN). For the
purposes of this example, we will assume that some other mechanism has already
done the best tagging job possible on the surrounding words, leaving only the word
race untagged. A bigram version of the HMM tagger makes the simplifying
assumption that the tagging problem can be solved by looking at nearby words and
tags. Consider the problem of assigning a tag to race given just these subsequences:
to/TO race/???
the/DT race/???
Equation 1.3 says that if we are trying to choose between NN and VB for the
sequence to race, we choose the tag that has the greater of these two probabilities:
P(VB|TO) P(race|VB)
(1.6)
and
P(NN|TO) P(race|NN)
(1.7)
10
Equation 1.3 and its instantiations Equations 1.6 and 1.7 each have two
probabilities: a tag sequence probability P(t | t ) and a word-likelihood P(w | t ) .
i j
i i−1
For race, the tag sequence probabilities P(NN|TO ) and P(VB|TO ) give the answer
to the question "how likely are we to expect a verb (noun) given the previous tag?".
They can just be computed from a corpus by counting and normalizing. We can
expect that a verb is more likely to follow TO than a noun is, since infinitives (to
race, to run, to eat) are common in English. While it is possible for a noun to follow
TO (walk to school related to hunting), it is less common.
Suppose the combined Brown and Switchboard corpora gives us the following
probabilities, showing that verbs are fifteen times as likely as nouns after TO:
P(NN|TO) : .021
P(VB|TO) : .34
The second part of Equation 1.3 and its instantiations Equations 1.6 and 1.7 is
the lexical likelihood: the likelihood of the noun race given each tag, P(race |VB ) and
P(race |NN ). This likelihood term is not 'which is the most likely tag for this word'.
That is, the likelihood term is not P(VB |race). Instead we are computing P(race |VB).
The probability answers the question "if we were expecting a verb, how likely is it
that this verb would be race".
Here are the lexical likelihoods from the combined Brown and Switchboard
corpora:
P(race|NN ) = .00041
P(race|VB ) = .00003
11
If we multiply the lexical likelihoods with the tag sequence probabilities, we
see that even the simple bigram version of the HMM tagger correctly tags race as a
VB despite the fact that it is the less likely sense of race:
P(VB|TO) P(race|VB) : .00001
P(NN|TO ) P(race|NN) : .000007
1.4.3 Transformation-based POS tagging
In general, the rule based tagging models usually require supervised training
i.e. pre-annotated corpora. But recently, good amount of work has been done to
automatically induce the transformation rules. One approach to automatic rule
induction is to run an untagged text through a tagging model and get the initial output.
A human then goes through the output of this first phase and corrects any erroneously
tagged words by hand. This tagged text is then submitted to the tagger, which learns
correction rules by comparing the two sets of data. Several iterations of this process
are sometimes necessary before the tagging model can achieve considerable
performance. The transformation based approach is similar to the rule based approach
in the sense that it depends on a set of rules for tagging.
It initially assigns tags to words based on a stochastic method e.g. the tag with
highest frequency for a particular word is assigned to that word. Then it applies the
set of rules on the initially tagged data to generate final output. It also learns new
rules while applying the already learnt rule, and can save the new rules if they seem
effective i.e. improve the performance of the model.
Transformation-Based Tagging, sometimes called Brill tagging, is an instance
of the Transformation-Based Learning (TBL) approach to machine learning (Brill,
1995), and draws inspiration from both the rule-based and stochastic taggers. Like the
rule-based taggers, TBL is based on rules that specify what tags should be assigned to
12
what words. But like the stochastic taggers, TBL is a machine learning technique, in
which rules are automatically induced from the data. Like some but not all of the
HMM taggers,TBL is a supervised learning technique; it assumes a pre-tagged
training corpus. The TBL algorithm has a set of tagging rules. A corpus is first tagged
using the broadest rule, i.e. the one that applies to the most cases. Then a slightly
more specific rule is chosen, which changes some of the original tags. Next an even
narrower rule, which changes a smaller number of tags (some of which might be
previously-changed tags).
How TBL rules are applied
Here we will see some of the rules used by Brill’s tagger. Before the rules
apply, the tagger labels every word with its most-likely tag. We get these most-likely
tags from a tagged corpus. For example, in the Brown corpus, race is most likely to
be a noun:
P(NN |race ) = .98
P(VB |race ) = .02
This means that the two examples of race that we saw above will both be coded as
NN. In the first case, this is a mistake, as NN is the incorrect tag:
is/VBZ expected/VBN to/TO race/NN tomorrow/NN
(1.8)
In the second case this race is correctly tagged as an NN:
the/DT race/NN for/IN outer/JJ space/NN
(1.9)
After selecting the most-likely tag, Brill's tagger applies its transformation rules. As it
happens, Brill's tagger learned a rule that applies exactly to this mistagging of race:
13
Change NN to VB when the previous tag is TO
This rule would change race/NN to race/VB in exactly the following situation, since it
is preceded by to/TO:
expected/VBN to/TO race/NN -> expected/VBN to/TO race/VB
(1.10)
How TBL Rules are learned
Brill's TBL algorithm has three major stages. It first labels every word with its
most-likely tag. It then examines every possible transformation, and selects the one
that results in the most improved tagging. Finally, it then re-tags the data according to
this rule. These three stages are repeated until some stopping criterion is reached,
such as insufficient improvement over the previous pass. Stage two requires that TBL
knows the correct tag of each word; i.e., TBL is a supervised learning algorithm.
The output of the TBL process is an ordered list of transformations; these then
constitute a 'tagging procedure' that can be applied to a new corpus. In principle the
set of possible transformations is infinite, since we could imagine transformations
such as "transform NN to VB if the previous word was 'IBM' and the word 'the'
occurs between 17 and 158 words before that".
But TBL needs to consider every possible transformation, in order to pick the best
one on each pass through the algorithm. Thus the algorithm needs a way to limit the
set of transformations. This is done by designing a small set of templates, abstracted
transformations. Every allowable transformation is an instantiation of one of the
templates. The following figure 1.2 shows a set of templates.
14
The preceding (following) word is tagged z.
The word two before (after) is tagged z.
One of the two preceding (following) words is tagged z.
One of the three preceding (following) words is tagged z.
The preceding word is tagged z and the following word is tagged w.
The preceding (following) word is tagged z and the word two before (after) is
tagged w.
Figure 1.2 Brill's templates. Each begins with 'Change tag a to tag b when: ‘The
variables a, b, z, and w range over parts of speech.
In practice, there are a number of ways to make the algorithm more efficient.
For example, templates and instantiated transformations can be suggested in a datadriven manner; a transformation-instance might only be suggested if it would
improve the tagging of some specific word. The search can also be made more
efficient by pre-indexing the words in the training corpus by potential transformation.
1.5 Other techniques
Apart from these quiet a few different approaches to tagging have been
developed.
Support Vector Machines: This is the powerful machine learning method used for
various applications in NLP and other areas like bio-informatics
Neural Networks: These are potential candidates for the classification task since
they learn abstractions from examples [Schmid 1994a].
Decision trees: These are classification devices based on hierarchical clusters of
questions. They have been used for natural language processing such as POS Tagging
15
[Magerman 1995; Schmid 1994b].We can use “Weka” for classifying the ambiguity
words.
Maximum entropy models: These avoid certain problems of statistical
interdependence and have proven successful for tasks such as parsing and POS
tagging.
Example-Based techniques: These techniques find the training instance that is most
similar to the current problem instance and assumes the same class for the new
problem instance as for the similar one.
1.6 Motivation
The part of speech for a word gives a significant amount of information about
the word and its neighbors. This is clearly true for major categories, (verb versus
noun), but is also true for the many finer distinctions. For example these tagsets
distinguish between possessive pronouns (my, your, his, her, its) and personal
pronouns (I, you, he, me). Knowing whether a word is a possessive pronoun or a
personal pronoun can tell us what words are likely to occur in its vicinity (possessive
pronouns are likely to be followed by a noun, personal pronouns by a verb). This can
be useful in a language model for speech recognition.
A word's part-of-speech can tell us something about how the word is
pronounced .The word content, for example, can be a noun or an adjective. They are
pronounced differently
Parts of speech can also be used in stemming for Informational Retrieval (IR),
since knowing a word's part of speech can help tell us which morphological affixes it
can take. They can also help an IR application by helping select out nouns or other
16
important words from a document. Automatic part-of-speech taggers can help in
building automatic word-sense disambiguating algorithms, and POS taggers are also
used in advanced ASR language models such as class-based N-grams. Parts of speech
are very often used for 'partial parsing' texts, or example for quickly finding names or
other phrases for the information extraction applications. The corpora that have been
marked for part-of-speech are very useful for linguistic research, for example to help
find instances or frequencies of particular constructions in large corpora.
Apart from these, many Natural Language Processing (NLP) activities such as
Machine Translation (MT), Word Sense Disambiguation (WSD) and Question
Answering (QA) systems are dependent on Part-Of-Speech Tagging.
For part of speech tagging,
i). The problem is clearly defined and well understood.
ii). The task is relatively easy but hard enough. A “good” performance can already be
achieved with very simple methods, but a perfect system ultimately requires complete
understanding (It is said to be “AI-hard”).
iii). Evaluation methods and comparison measures are available which make different
machine learning approaches feasible.
iv). Because only a relatively simple annotation scheme is required, large corpora are
available.
17
2) LITERATURE SURVEY
Part-of-speech tagging is the act of assigning each word in a sentence a tag
that describes how that word is used in the sentence. Typically, these tags indicate
syntactic categories, such as noun or verb, and occasionally include additional feature
information, such as number (singular or plural) and verb tense.
A large number of current language processing systems use a part-of-speech
tagger for pre-processing. The tagger assigns a part-of-speech tag to each token in the
input and passes its output to the next processing level, usually a parser. For both
applications, a tagger with the highest possible accuracy is required. Recent
comparisons of approaches that can be trained on corpora have shown that in most
cases statistical approaches yield better results than finite-state, rule-based, or
memory-based taggers. They are only surpassed by combinations of different
systems, forming a "voting tagger. The tagger comparison was organized as a "black
box test": set the same task to every tagger and compare the outcomes. The authors in
[1], describes the models and techniques used by TnT together with the
implementation. The result of the tagger comparison seems to support the sentence
"the simplest is the best".
Part-of-speech tagging is also a very practical application, with uses in many
areas, including speech recognition and generation, machine translation, parsing,
information retrieval and lexicography. Tagging can be seen as a prototypical
problem in lexical ambiguity; advances in part-of-speech tagging could readily
translate to progress in other areas of lexical, and perhaps structural, ambiguity, such
as word sense disambiguation and prepositional phrase attachment disambiguation.
Also, it is possible to cast a number of other useful problems as part-of-speech
tagging problems, such as letter-to-sound translation [2] and building pronunciation
18
networks for speech recognition. Recently, a method has been proposed for using
part-of-speech tagging techniques as a method for parsing with lexicalized grammars.
When automated part-of-speech tagging was initially explored [3], people
manually engineered rules for tagging, sometimes with the aid of a corpus. As large
corpora became available, it became clear that simple Markov-model based stochastic
taggers that were automatically trained could achieve high rates of tagging accuracy.
Markov-model based taggers assign to a sentence the tag sequence that maximizes
P(word | tag) * P(tag | previous n tags). These probabilities can be estimated directly
from a manually tagged corpus. These stochastic taggers have a number of
advantages over the manually built taggers, including the need for laborious manual
rule construction, and possibly capturing useful information that may not have been
noticed by the human engineer. However, stochastic taggers have the disadvantage
that linguistic information is captured only indirectly, in large tables of statistics. All
recent work in developing automatically trained part-of-speech taggers has been on
further exploring Markov model based tagging.
Several different approaches have been used for building text taggers. Greene
and Rubin used a rule-based approach in the TAGGIT program [4], which was an aid
in tagging the Brown corpus [5]. TAGGIT disambiguated 77% of
the corpus; the
rest was done manually over a period of several years. Koskenniemi also used a rulebased approach implemented with finite-state machines.
Statistical methods have also been used (e.g., [6]). These provide the
capability of resolving ambiguity on the basis of most likely interpretation. A form of
Markov model has been widely used that assumes that a word depends
probabilistically on just its part-of-speech category, which in turn depends solely on
the categories of the preceding two words (in case of trigram).
19
Two types of training (i.e., parameter estimation) have been used with this
model. The first makes use of a tagged training corpus. Derouault and Merialdo use a
bootstrap method for training [7]. At first, a relatively small amount of text is
manually tagged and used to train a partially accurate model. The model is then used
to tag more text, and the tags are manually corrected and then used to retrain the
model. Church uses the tagged Brown corpus for training [8]. These models involve
probabilities for each word in the lexicon, so large tagged corpora are required for
reliable estimation.
The second method of training does not require a tagged training corpus. In
this situation the Baum-Welch algorithm (also known as the forward-backward
algorithm) can be used [9]. Under this system the model is called a hidden Markov
model (HMM), as state transitions (i.e., part-of-speech categories) are assumed to be
unobservable. Jelinek has used this method for training a text tagger [10]. Parameter
smoothing can be conveniently achieved using the method of ‘deleted interpolation’
in which weighted estimates are taken from second and first-order models and a
uniform probability distribution. Kupiec used word equivalence classes (referred to
here as ambiguity classes) based on parts of speech, to pool data from individual
words. The most common words are still represented individually, as sufficient data
exist for robust estimation.
All other words are represented according to the set of possible categories
they can assume. In this manner, the vocabulary of 50,000 words in the Brown corpus
can be reduced to approximately 400 distinct ambiguity classes [11]. To further
reduce the number of parameters, a first-order model can be employed (this assumes
that a word's category depends only on the immediately preceding word's category).
In [12], networks are used to selectively augment the context in a basic first order
model, rather than using uniformly second-order dependencies.
20
In the linguistic approach, an expert linguist is needed to formalize the
restrictions of the language. This implies a very high cost and it is very dependent on
each particular language. We can find an important contribution that uses Constraint
Grammar formalism. Supervised learning methods were proposed in [14] to learn a
set, of transformation rules that repair the error committed by a probabilistic tagger.
The main advantage of the linguistic approach is that the model is constructed from a
linguistic point of view and contains many and complex kinds of knowledge.
In the learning approach, the most extended formalism is based on n-grams. In
this case, the language model can be estimated from a labeled corpus (supervised
methods) [15] or from a non-labeled corpus (unsupervised methods) [16]. In the first;
case, the model is trained from the relative observed frequencies. In the second one,
the model is learned using the Baum-Welch algorithm from an initial model which is
estimated using labeled corpora [17]. The advantages of the unsupervised approach
are the facility to build language models, the flexibility of choice of categories and
the ease of application to other languages. We can find some other machine-learning
approaches that use more sophisticated LMs, such as Decision Trees, memory-based
approaches to learn special decision trees, maximum entropy approaches that
combine statistical information from different sources [18], finite state automata
inferred using Grammatical Inference, etc.
The comparison among different approaches is difficult due to the multiple
factors that can be considered: thee language, the number and type of the tags, the
size of the vocabulary, the ambiguity, the difficulty of the test set, etc.
Part-of-speech tagging is an important research topic in Natural Language
Processing (NLP). Taggers are often preprocessors in NLP systems, making accurate
performance especially important. Much research has been done to improve tagging
accuracy using several different models and methods.
21
Most NLP applications demand at initial stages shallow linguistic information
(e.g., part–of–speech tagging, base phrase chunking, named entity recognition).This
information may be predicted fully automatically (at the cost of some errors) by
means of sequential tagging over un annotated raw text. Generally, tagging is
required to be as accurate as possible, and as efficient as possible. But, certainly, there
is a trade-off between these two desirable properties. This is so because obtaining a
higher accuracy relies on processing more and more information, digging deeper and
deeper into it. However, sometimes, depending on the kind of application, a loss in
efficiency may be acceptable in order to obtain more precise results. Or the other way
around, a slight loss in accuracy may be tolerated in favour of tagging speed.
Some languages have a richer morphology than others, requiring the tagger to
have into account a bigger set of feature patterns. Also the tagset size and ambiguity
rate may vary from language to language and from problem to problem. Besides, if
few data are available for training, the proportion of unknown words may be huge.
Sometimes, morphological analyzers could be utilized to reduce the degree of
ambiguity when facing unknown words. Thus, a sequential tagger should be flexible
with respect to the amount of information utilized and context shape. Another very
interesting property for sequential taggers is their portability. Multilingual
information is a key ingredient in NLP tasks such as Machine Translation,
Information Retrieval, Information Extraction, Question Answering and Word Sense
Disambiguation, just to name a few. Therefore, having a tagger that works equally
well for several languages is crucial for the system robustness. Besides, quite often
for some languages, but also in general, lexical resources are hard to obtain.
Therefore, ideally a tagger should be capable for learning with fewer (or even none)
annotated data.
The SVMTool [29] is intended to comply with all the requirements of modern
NLP technology, by combining simplicity, flexibility, robustness, portability and
22
efficiency with state–of–the–art accuracy. This is achieved by working in the Support
Vector Machines (SVM) learning framework, and by offering NLP researchers a
highly customizable sequential tagger generator.
In the recent literature, several approaches to POS tagging based on statistical
and machine learning techniques are applied, including among many others: Hidden
Markov Models [15], Maximum Entropy taggers [18], Transformation–based
learning [14], Memory–based learning [19], Decision Trees [20], AdaBoost [21], and
Support Vector Machines [22]. Most of the previous taggers have been evaluated on
the English WSJ corpus, using the Penn Treebank set of POS categories and a lexicon
constructed directly from the annotated corpus. Although the evaluations were
performed with slight variations, there was a wide consensus in the late 90’s that the
state–of-the–art accuracy for English POS tagging was between 96.4% and 96.7%. In
the recent years, the most successful and popular taggers in the NLP community have
been the HMM–based TnT tagger, the Transformation–based learning (TBL) tagger
[14], and several variants of the Maximum Entropy (ME) approach [18].
TnT is an example of a really practical tagger for NLP applications. It is
available to anybody, simple and easy to use, considerably accurate, and extremely
efficient, allowing training from 1 million word corpora in just a few seconds and
tagging thousands of words per second. In the case of TBL and ME approaches, the
great success has been due to the flexibility they offer in modeling contextual
information, being ME slightly more accurate than TBL.
Far from being considered a closed problem, several researchers tried to
improve results on the POS tagging task during last years. Some of them by allowing
richer and more complex HMM models [23], others [24] by enriching the feature set
in a ME tagger, and others [22] by using more effective learning techniques: SVM,
and a Voted–Perceptron–based training of a ME model. In these more complex
taggers the state–of–the–art accuracy was raised up to 96.9%–97.1% on the same
23
WSJ corpus. In a complementary direction, other researchers suggested the
combination of several pre-existing taggers under several alternative voting schemes
[25]. Although the accuracy of these taggers is even better (around 97.2%) the
ensembles of POS taggers are undeniably more complex and less efficient.
Many natural language tasks require the accurate assignment of Part-OfSpeech (POS) tags to previously unseen text. Due to the availability of large corpora
which have been manually annotated with POS information, many taggers use
annotated text to "learn" either probability distributions or rules and use them to
automatically assign POS tags to unseen text.
Several recent papers [14] have reported 96.5% tagging accuracy on the Wall
St. Journal corpus. A Maximum Entropy model is well-suited for experiments since it
combines diverse forms of contextual information in a principled manner, and does
not impose any distributional assumptions on the training data. Previous uses of this
model include language modeling [19], machine translation [26], prepositional phrase
attachment [18], and word morphology.
Parts of Speech Tagging gone through
For English, there are many POS taggers , employing machine learning
techniques such as Hidden Markov Models (Brants, 2000), transformation based error
driven learning (Brill, 1995), decision trees (Black, 1992), maximum entropy
methods (Ratnaparkhi, 1996), conditional random _elds (Laffertey et al., 2001). Some
of the techniques proposed for chunking in English are based on Support vector
machines (Kudoh et al., 2001), Winnow (Zhang et al., 2002) .
The POS taggers reach anywhere between 92-97 % accuracy and chunkers
have reached approximately 94 % accuracy. However, these accuracies are aided by
24
the availability of large annotated corpus for English. As mentioned above, due to the
lack of annotated corpora, previous research in POS tagging and chunking in Indian
languages has mainly focused on rule based systems utilizing the morphological
analysis of word-forms. A. Bharati et al. (1995) in their work on computational
Paninian POS parser, described a technique where POS tagging is implicit and is
merged with parsing phase.
More recently, Smriti et al. (2006) proposed a pos tagger for Hindi which uses
an annotated corpus (15,562 words collected from BBC Hindi News site), exhaustive
morphological analysis backed by high coverage lexicon and a decision tree based
learning algorithm(CN2). They reach an accuracy of 93.45% for Hindi with a tagset
of 23 POS tags. For Bengali, (Sandipan et al., 2004) developed a corpus based semisupervised learning algorithm for POS tagging based on HMMs. Their system uses a
small tagged corpus (500 sentences) and a large unannotated corpus along with a
Bengali morphological analyzer. When tested on a corpus of 100 sentences (1003
words), their system obtained an accuracy of 95%.
A. Singh et al. (2005) proposed a HMM based chunker for Hindi with an
accuracy of 91.7 %. They used HMMs trained on four tag scheme (STRT,CNT, STP,
STRT STP) with POS tag information and converted it into two tag (STRT, CNT)
scheme while testing for chunk boundary identi_cation. They however used a rule
based system for chunk label identi_cation. Annotated data of 150,000 words was
used for training and the chunker was tested on 20,000 words with POS tags which
were manually annotated. To the best of our knowledge, these are the reported works
on POS tagging and chunking for Indian languages till the NLPAI Machine Learning
Contest (2006) was held in the summer of 2006.
For the contest, participants had to train on a set of training data for a chosen
language provided by the contest organizers. The systems, thus trained, were to
25
automatically mark POS and chunk information on the test data of the chosen
language. Chunk annotated data wasn't released for Bengali and Telugu. Sandipan
and Sudeshna (2006) achieved an accuracy of 84.34 % for Bengali POS tagging using
semi-supervised learning combined with a Bengali morphological analyzer.
A. Dalal et al. (2006) achieved accuracies of 82.22 %and 82.4%for Hindi POS
tagging and chunking respectively using maximum entropy models. Karthik et al.
(2006) got 81.59 % accuracy for Telugu POS tagging using HMMs. Sivaji et al.
(2006) came up with a rule based chunker for Bengali which gave an accuracy of
81.64 %. The training data for all the three languages contained approximately 20,000
words and the testing data had approximately 5000 words.
The experiences from organizing the NLPAI ML Contest prompted us to hold
a contest for Shallow Parsing (POS tagging and chunking) where the participants will
have to develop systems for POS tagging and chunking across Indian languages using
the same learning technique. The next section formally de_nes the task and the data
released as part of the contest.
CLAWS part-of-speech tagger for English
Part-of-speech (POS) tagging, also called grammatical tagging, is the
commonest form of corpus annotation, and was the first form of annotation to be
developed by UCREL(University Centre for Computer Corpus Research on
Language ) at Lancaster. The POS tagging software for English text, CLAWS (the
Constituent Likelihood Automatic Word-tagging System), has been continuously
developed since the early 1980s. The latest version of the tagger, CLAWS4, was used
to POS tag c.100 million words of the British National Corpus (BNC).
26
CLAWS have consistently achieved 96-97% accuracy (the precise degree of
accuracy varying according to the type of text). Judged in terms of major categories,
the system has an error-rate of only 1.5%, with c.3.3% ambiguities unresolved, within
the BNC. More detailed analysis of the error rates for the C5 tagset in the BNC can be
found within the BNC manual.
Parts of Speech Tagging in Tamil
Parts of speech tagging scheme, tags a word in a sentence with its parts of speech. It
is done in three stages: pre-editing, automatic tag assignment, and manual postediting [24]. In pre-editing, corpus is converted to a suitable format to assign a part of
speech tag to each word or word combination. Because of orthographic similarity one
word may have several possible POS tags. After initial assignment of possible POS,
words are manually corrected to disambiguate words in texts.
1. Vasu Ranganathan’s Tagtamil:
Tagtamil by Vasu Ranganathan is based on Lexical phonological approach.
Tagtamil does morphotactics of morphological processing of verbs by using index
method. Tagtamil does both tagging and generation.
2. Ganesan’s POS tagger:
Ganesan has prepared a POS tagger for Tamil. His tagger works well
in CIIL Corpus. Its efficiency in other corpora has to be tested. He has a rich
tagset for Tamil. He tagged a portion of CIIL corpus by using a dictionary as well
as a morphological analyzer. He corrected it manually and trained the rest of the
corpus with it. The tags are added morpheme by morpheme.
27
pukkaLai : puu_N_PL_AC
vandtavan: va_IV_ ndt_PT_avan_3PMS
3. kathambam of RCILTS-Tamil:
Kathambam attaches parts of speech tags to the words of a given
Tamil document. It uses heuristic rules based on Tamil linguistics for tagging and
does not use either the dictionary or the morphological analyzer. It gives 80%
efficiency for large documents. It uses 12 heuristic rules. It identifies the tags
based on PNG, tense and case markers. Standalone words are checked with the
lists stored in the tagger. It uses ‘Fill in rule’ to tag ‘unknown words. It uses
bigram and identifies the unknown word using the previous word category.
28
3) TAMIL POS TAGGING
3.1).TAMIL LANGUAGE
Tamil (or Tamizh) is one of the most widely spoken languages in south Asia.
It belongs to the Dravidian family of languages. It has a long history of literary
tradition dating back to 200 BC and is spoken in the state of Tamilnadu in India,
Srilanka, Singapore, and Malaysia and in small numbers in other parts of the world.
Because of its long roots and unbroken literary tradition, it was announced as a
Classical Language in 2004 by the Govt. of India.
Tamil is a diglossic language. This means that there is a large disparity
between the written form of the language and the spoken form. These differences
include grammatical differences, vocabulary differences, and pronunciation
differences.
Tamil has a variety of dialects. Even within Tamilnadu, there are numerous
dialects that vary widely based on geography and communities of the speakers. Tamil
also exhibits diglossia between its formal and classic variety called Centamil (or
centamizh) and its colloquial variety that includes a number of spoken dialects of
Tamil. Centamil has been used for official, governmental purposes to write and
present reports, news articles etc.
3.1.1). Alphabets
Tamil is phonetic Language. The scripts for Tamil evolved from Brahmi
scripts of 3rd century BC. Tamil alphabets consist of 12 vowels (uyir ezhuththu) and
18 consonants (Mei ezhuththu). They combines together to make 246 vowel-
29
consonant compound (uyirmei ezhuththu). There is a special character called aaytha
ezhuththu( ). These altogether make 247 characters in the Tamil alphabet.
The vowels can be divided into three categories: long (netil), short (kuRil) and
dipthongs. There are 5 long and 5 short vowels and two dipthongs. The consonant are
classified under three groups: hard (vallinam), nasal (mellinam) and medium
(idaiyinam). In addition, Tamil borrows alphabets from Grantha script to represent
borrowed Sanskrit words.
Figure: 3.1 Tamil alphabets with English mapping
The following is the classification of alphabets,
Long vowels: A, I, O, U, E
Short vowels: a, i, o, u, e
30
Dipthongs
: au, ai
Vallinam: k, c, t, th, p, r
Mellinam: nj, ng, n, N, w, m
Idayinam: y, R, l, v, zh, L
3.1.2). Tamil Grammar
Grammar of any Language can be broadly divided into Phonology,
Morphology and Syntax. Phonology is the study of sound structure in language.
Morphology studies words and their construction. Syntax deals with how to put the
words together in some order to make meaningful sentences. We will talk more about
Morphology here as it is more relevant to the project [28]. Tamil Morphology is very
rich. It is an agglutinative Morphology Language, much like the other Dravidian
languages. Tamil words are made up of lexical roots followed by one or more affixes.
The lexical roots and the affixes are the smallest meaningful units and are
called morphemes. Tamil words are therefore made up of morphemes concatenated to
one another in a series. The first one in the construction is always a lexical morpheme
(lexical rot). This may or may not be followed by other functional or grammatical
morphemes. For instance, a word ‘books’ in English, can be meaningfully divided
into ‘book’ and ‘s’.
books = book + s.
In this example, ‘book’ is the lexical root, representing a real world entity and
‘s’ is the plural feature marker (suffix). ‘s’ is a grammatical morpheme that is bound
to the lexical root to add plurality to the lexical root. Unlike English, Tamil words can
have a large sequence of morphemes. For instance,
puththakangkaLai = puththakam (book) + kaL (s) + ai (acc. case marker).
31
Tamil nouns can take case suffixes after the plural marker. They can also have post
positions after that. More on these details are discussed later. The words can be
analyzed like the one above by identifying the constituent morphemes and their
features can be identified.
Identifying the constituent morphemes also leads to identifying the Part-ofSpeech (POS) for every word in the sentence, information which is further used in
Syntax analysis.
3.1.3). Part of Speech categories:
Words can be classified under various parts of speech classes based on the
role they play in the sentence. The following are the main POS classes in Tamil.
1. Nouns
2. Verbs
3. Adjectives
4. Adverbs
5. Determiners
6. Post Positions
7. Conjunctions
8. Quantifiers
Out of these, only Nouns and Verb can be inflected. Words in other classes just occur
in their root forms.
3.1.4). Other POS categories
Apart from Nouns and Verbs the other POS categories that are open class are
the adverbs and Adjectives. Most Adjectives and adverbs, in their root can be placed
32
in the lexicon. But there are Adjectives and Adverbs that can be derived from noun
and verb stems. Following are the Morphotactics of derived Adjectives from noun
root and verb stems.
Noun_root + adjective_suffix
e.g. uyaram + Ana = uyaramAna <ADJ>
verb_stem + relative_participle
e.g. cey + tha = ceytha <VNAJ>
Following is the Morphotactics of derived adverbs from noun_roots and verb stem.
noun_root + adverb_suffix
e.g. uyaram +Aka = uyaramAka <ADV>
verb_stem +adverbial participle
e.g. cey + tu = ceythu <VNAV>
There are number of non-finite verb structure forms in Tamil .Apart from
participles forms ,Grammatically they are classified into structures such as
infinitive, conditional, etc,.
Example for infinitive form of verb:
paRgga <VINT> (to fly)
Example for conditional verb:
vawthAl <CVB> (if one comes)
There are other categories like conjunctions, complementizers etc. some of
these may be derived forms. But there aren’t many. So they can be listed in the
lexicon. Other categories that need to be listed in the lexicon as roots are post
33
positions which are free and auxiliary verbs. This is because they can occur as words
in isolations even though they are semantically bonded to the noun or verb preceding
them.
3.1.5). Ambiguity of Roots
The roots can also be ambiguous. They can have more than one sense.
Sometimes belong to more than one POS category. Though sometimes the POS can
be disambiguated using contextual information like co-occurring morphemes, it is not
possible cent percent
These issues should be taken care of when Morph Analyzers are built for the
Language. Other issues that were dealt with during the implementation, are explained
in the following sections. For a thorough discussion of Tamil Morphology refer to
[28].
3.2). COMPLEXITY IN TAMIL POS TAGGING
As Tamil is a agglutinative language, Nouns get inflected for number and
cases. Verbs get inflected for various inflections which include tense, person
,number ,gender Suffixes. Verbs are adjectivalized and adverbialized. Also verbs
and adjectives are nominalized by means of certain nominalizers. Adjectives and
adverbs do not Inflect. Many post-positions in Tamil [20] are from nominal and
verbal sources. So, many times we need to depend on syntactic function or context
to decide upon whether one is a noun or adjective or adverb or post position. This
leads to the complexity of Tamil in POS tagging.
3.2.1).Noun complexity
Nouns are the words which denote a person, place, thing, time, etc .In Tamil
language nouns are inflected for the number and case in morphological level (3.1).
34
However on phonological level, four types of suffixes can occur with noun stem
(3.2).
Noun ( + number ) (+ case )
Ex:
(3.1)
pook-kaL-ai <NN>
Flower-plural-accusative case suffix
Noun ( + number ) (+ oblique) (+ euphonic) (+ case )
Ex:
(3.2)
pook-kaL-in-Al <NN>
Flower-plural-euphonic suffix-accusative case suffix
Nouns further need to be annotated into common noun, compound noun,
proper noun, compound proper noun, pronoun, cardinal and ordinal. Pronouns need
to be further annotated for personal pronoun, interrogative pronoun, and indefinite
pronoun.
There occurs complexity between common noun and compound noun and
also between proper noun and compound proper noun. Common noun can also
occur as compound noun, for example
UrAcci <NNC> thalaivar <NNC>
When UrAcci and thalaivar comes together it can be compound noun
(<NNC>),but when UrAcci and thalaivar comes separately in a sentence it should
be tagged as a common noun(<NN>).such complexity occur with the proper
noun<NNP> and compound proper noun<NNPC>. Moreover there occurs
complexity between noun and adverb, pronoun and emphasis in syntactic level.
35
3.2.2).Verb complexity
The verbal forms are complex in Tamil. A finite verb shows the
following morphological structure
Verb stem + Tense + Person-Number + Gender
Ex:
Wata +nt +En <VF>
‘I walked’
A number of non-finite forms are possible: adverbial forms(3.3), Adjectival
forms(3.4), infinitive forms(3.5), and conditional(3.6) .
verb_stem +adverbial participle
(3.3)
cey + tu = ceythu <VNAV>
‘having done’
verb_stem + relative_participle
(3.4)
cey + tha = ceytha <VNAJ>
‘who did’
verb_stem + infinitive suffix
azu + a = aza <VINT>
(3.5)
‘to weep’
verb_stem + conditional suffix
(3.6)
kEL + Al =kEttAl <CVB>
‘if asked’
36
Distinction needs to be made between main verb followed by main verb and
main Verb followed by an auxiliary verb. The main verb followed by an auxiliary
need to be interpreted together, whereas the main verb followed by a main verb
need to be interpreted separately. This lead to functional ambiguity as given below:
Functional ambiguity in adverbial <VNAV> form
The morphological structure of Adverbial verb is
verb root +adverbial participle
e.g. cey + tu = ceythu <VNAV>
‘having done’
Vandtu
<VNAV> caappiTTuviTTu
<ADV> poo <VF>‘having come and
having eaten went’.
wondtu<VNAV> poo
‘become vexed’
Functional ambiguity in adjectival <VNAJ> form
The adjectival<VNAJ> forms differ by tense markings:
Verb stem+Tense+Adjectivalizer
vandta ‘x who came’
varukiRa ‘x who comes’
varum ‘x who come’
Adjectival<VNAJ> form allows several interpretations as given in the following
examples.
cappiTTa ilai ‘the leaf which is eaten by x’
37
‘the leaf on which x had his food and ate’
vaangkiya <VNAJ>x
‘x which is bought’
‘x who bought’
‘x (price) by which something is bought‘
‘x (money) received’
‘x (container) in which something is received’
um-suffixed adjectival form clashes with other homophonous forms which leads
ambiguity.
varum <VNAJ>paiyan ‘the boy who will come’
varum <VF>‘it will come’
varum pootu <VNAV>‘while coming’
Functional ambiguity in infinitival <VINT> form
verb_stem + infinitive suffix
e.g. azu
+ a = aza
<VINT>
vara.v-iru ‘going to come’
vara-k.kuuTaatu ‘should not come’
vara-c.col ‘ask x to come’
3.2.3).COMPLEXITY IN ADVERBS
We have seen that a number of adjectival and adverbial forms of verbs are
lexicalized as adjectives and adverbs respectively and clash with their respective
sentential adjectival and adverbial forms semantically creating ambiguity in POS
38
tagging [20].Adverbs too need to be distinguished based on their source category.
Many adverbs are derived by suffixing aaka with nouns in Tamil. But not all aaka
suffixed forms are adverbial.
veekam-aaka ‘fastly’ vs. TAkTar-Aka ‘as doctor’
Functional clash can be seen between noun and adverb in aaka suffixed forms.
This type of clash is seen among other Dravidian languages too.
avaL azhakaaka irukkiRaaL
‘she beauty_ADV be_PRE_she
‘she is beautiful’
3.2.4).COMPLEXITY IN POSTPOSITIONS
Postpositions are from various categories such as verbal, nominal and
adverbial in Tamil. Many a time, the demarking line between verb/noun/adverb and
postposition is slim leading to ambiguity. Some postpositions are simple and some
are compound. Postpositions are conditioned by the nouns inflected for case they
follow. Simply tagging one form as postposition will be misleading.
There are postpositions which come after noun and also after verbs which
makes the postposition ambiguous (spatial vs. temporal).
pinnaal <PPO> ‘behind’ as in viiTTukkkup pinnaal ‘behind the house’
pinnaal ‘after’<ADV> avanukkup pinnaal vandtaan ‘he came after him’
3.3) Developing a new tagset :
For developing a corpus, it is necessary to define a tags (POS tags)
used in that corpus. Collection of all the possible tags is called tagset. Tagset differ
from language to language. For Tamil, some tagsets are available. We refer some of
39
the tagsets for Tamil and also for other languages .After considering all the
possibilities we create a new tagset named as AMRITA tagset, because
the available tagsets are very large in number.
We consider the guidelines from “AnnCorra IIIT Hyderabad “[26] while we
create our AMRITA Tagset:
1. The tags should be simple.
2. Maintaining simplicity for Ease of Learning and Consistency in annotation.
3. POS tagging is not a replacement for morph analyzer.
4. A 'word' in a text carries grammatical category and grammatical features such as
gender, number, person etc. The POS tag should be based on the 'category' of the
word and the features can be acquired from the morph analyzer.
3.3.1 TAGSETS GONE THROUGH:
AUKBC Tagset
This tagset is created by AUKBC Research Centre at Chennai with the
help of eminent linguists from, Tamil University; Tanjore. This is an exhaustive
tagset, which covers all possible grammatical and lexical constituents. It contains 68
tags.
IIIT, Hyderabad Tagset:
POS Tag Set for Indian Languages was developed by IIIT, Hyderabad.
Their tags are decided on coarse linguistic information with an idea to expand it to
finer knowledge if required. The annotation standards for POS tagging for Indian
languages include 26 tags.
40
CIIL Tagset for Tamil
This Tagset is developed by CIIL (Central Institute of Indian Languages)
Mysore. It contains 71 tags for Tamil .Because of considering the noun and verb
inflection the number of tags are increased. They consider near 30 noun forms
including pronoun categories and 25 verb forms including participle forms.
CIIL-KHS Hindi Tagset
This Tagset is also developed by CIIL (Central Institute of Indian Languages)
Mysore. It contains 36 tags for Hindi.
3.3.2). AMRITA TAGSET
The main drawbacks in all other tagset (AUKBC, CIIL Tagset) are they
consider the verb and noun inflections. So at the tagging time you need to split each
and every inflected word in the corpus. It’s a tough process. For POS level we want to
determine the word’s POS category or tag. That can be done using a limited number
of tagset. The inflectional problems can be solved by morph analyzer .so there is no
need of using large number of tags. Moreover large number of tags will lead to more
complexity which intern reduces the tagging accuracy.
Considering the complexity of Tamil in POS tagging and referring various
tagsets, We have developed our own Tagset (AMRITA TAGSET).See Table.3.1
Our tagset contains 32 tags we are not considering the inflections. The 31 tags
are listed in above table. In our tagset we used compound tag for only nouns (NNC)
and proper nouns (NNPC). We considered tag VBG for verbal nouns and participle
nouns.
41
Table: 3.1 AMRITA Tagset
3.4) Explanation of AMRITA POS tags
3.4.1).Noun tags
Nouns are the words which denote a person, place, thing, time, etc. In Tamil
language nouns are inflected for the number and case in morphological level (3.7).
However on phonological level, four types of suffixes can occur with noun stem
(3.8).
(3.7)
Noun ( + number ) (+ case )
e.g. pook-kaL-ai <NN>
Flower-plural-accusative case suffix
Noun ( + number ) (+ oblique) (+ euphonic) (+ case )
e.g. pook-kaL-in-Al <NN>
42
(3.8)
Flower-plural-euphonic suffix-accusative case suffix
As mentioned earlier, distinct tags based on grammatical information are avoided. so
Plurality and case suffixation can be obtained from a morph analyzer. This brings the
number of tags down, and helps achieve simplicity, consistency, better machine
learning. Therefore, we use only two tags common noun <NN> and common
compound noun <NNC> for nouns without getting into any distinction based on the
grammatical information contained in a given noun word.
Example for Common Nouns (NN)
paRavai <NN>
‘bird’
paRavaigaL <NN>
‘birds’
paRavaikku <NN>
‘to bird’
paRavaiyAl <NN>
‘by bird’
Example for Compound Nouns (NNC)
UrAcchi <NNC> thalaivar <NNC>
‘Township leader’
Vanap <NNC> pakuthi <NNC>
‘forest area’
Proper Nouns
Proper Nouns are the words which denote a particular person, place, or thing.
Indian languages, unlike English, do not have any specific marker for proper nouns in
orthographic conventions. English proper nouns begin with a capital letter which
43
distinguishes them from common nouns. All the words which occur as proper nouns
in Indian languages can also occur as common nouns denoting a lexical meaning.
For example,English: John, Harry, Mary occur only as proper nouns whereas in
Tamil: thAmari, maNi, pissai, arasi etc used as 'names' and they also belong to
grammatical categories of words with various senses. For example given below is a
list of Tamil words with their grammatical class and sense.
thAmari
noun
lotus
maNi
noun
bell
pissai
noun
beg
arasi
noun
queen
Any of the above words can occur in texts as common nouns or as proper
names. Therefore, we use only two tags proper noun <NNP> and compound proper
noun <NNPC>.
Example for Proper Nouns (NNP)
raja <NNP> wERRu vawthAn.
Raja came yesterday.
Example for Compound Proper Nouns (NNPC)
Abdthul<NNPC> kalAm <NNPC> inRu cennai varugiRAr.
“Abdul kalam is going to come to Chennai today”
Cardinals
Any word denoting a cardinal number will be tagged as <CRD>.
44
Example for Cardinals <CRD>
enakku 150 <CRD> rupAy vENtum.
I need 150 ruppes.
mUnRu <CRD > wapargaL anggu amarwthiruggiRArgaL.
“Three people were sitting there”
Ordinals
Ordinals are formed by adding the suffixes Am and Avathu. Expressions
denoting ordinals will be marked as <ORD>.
Example for Ordinals <ORD>
muthalAm <ORD> vaguppu.
“first class”
12-Am<ORD> wURRANdu
“12th century”
3.4.2) PRONOUNS
Pronouns are the words that take the place of nouns .We use a pronoun in
place of a noun so that we do not have to repeat the noun. Linguistically, a pronoun
is a variable and functionally it is a noun, so the tag for pronouns will be helpful for
anaphora resolution .We have used three tags for pronouns,<PRP> for personal
pronoun,<PRIN> for interrogative pronoun and <PRID> for indefinite pronoun.
Example for personal Pronouns (PRP)
avan <PRP> weRRu inggu vawthAn.
‘He came here yesterday’
45
Example for interrogative pronoun (PRIN)
evan <PRIN> sonnAn.
‘which(male) one said.’
Example for indefinite pronoun (PRID)
yArrO <PRID> sonnArgaL
‘someone said’
3.4.3) ADJECTIVE tags
Adjectives are the noun modifiers. We have simple and derived adjectives in
modern tamil. We have use a tag <ADJ> for adjectives.
Example for ADJECTIVES<ADJ>
iwtha walla <ADJ> paiyan
‘this nice boy’
oru azagAna <ADJ> pen
‘A beautiful girl’
3.4.4)ADVERB tags:
Adverbs are words which tell us more about verbs. we have simple and
derived adverbs in modern tamil. We have use a tag <ADV> for adverbs.
Example for ADVERBS <ADV>
Avan atiggati <ADV> vidumuRai etuththAn.
‘He took leave frequently’
46
kuthirai vEgamAga <ADV> Odiyathu.
‘The horse ran fastly’
3.4.5) VERB tags:
Verbs are defined as the action word, which can take tense suffixes,
person number gender suffixes, and few other verbal suffixes. Tamil verb forms can
be distinguished into finite and non finite verbs.
FINITE VERB
Finite verbs as the predicate of the main clause occur at the end of the
sentence.
Example for FINITE VERB <VF>
avan paSyil vawthAn <VF>
‘he came by bus’
NON FINITE VERBS
Tamil distinguishes between four types of non-finite verb forms [26].They are
verbal participle <VNAV>, adjectival participle <VNAJ>, infinitive <VINT> and
conditional <CVB>.
Example for verbal participle <VNAV>
wondtu<VNAV> poo
‘become vexed’
Example for adjectival participle <VNAJ>
vawtha <VNAJ> paiyan
‘the boy who came’
47
Example for infinitive <VINT>
un thalaiyil idii viza <VINT>
‘may thunder fall on your head’
Example for conditional <CVB>
wE wEraththOdu vawthAl <CVB>thAn
‘If you would come in time’
NOMINALIZED VERB FORMS
Nominalized verbal forms are verbal nouns, participle nouns and adjectival
nouns. we use the tag <VBG> for all the forms of Nominalized verbs.
Example for Nominalized verb forms <VBG>
Seithal<VBG> (doing)
Seivathu <VBG> (a neuter which is doing)
sethavan <VBG> (a person who did)
Anaw enna seyvathu <VBG>?
‘What shall (we) do Anand?’
AUXILARY VERB
We have used the tag <VAX> for auxiliary verb.
Example for AUXILARY VERB <VAX>
arun enna seyavENdum<VAX> ?
‘What shall (we) do Arun?’
48
3.4.6) OTHER Tags
POSTPOSITION
All postpositions in Tamil are formally uninflected or inflected noun forms or
non-finite verb forms. We use the tag < PPO> for echo words.
Example for Postposition <PPO>
avan wAyaip pOl <PPO> kaththinAn.
‘He cried like a dog’
CONJUNCTIONS
Co-ordination in Tamil is mainly realized by the use of noun case, some clitics and
number of verb forms. We use the tag < CNJ> for Conjunctions.
Example for Conjunctions <CNJ>
ciRiYA AnAl <PPO> walla pen.
‘a small but nice girl’
DETERMINERS
We use the tag < DET> for determiners.
Example for DETERMINERS <DET>
awthath <DET>thittam.
‘that plan’
COMPLIMENTIZER
We use the tag < COM> for complimentizer.
49
Example for COMPLIMENTIZER (COM)
avan vawthAn enru<COM> kELvippattEn.
‘I heard that he had come’
EMPHASIS
We use the tag <EMP> for emphasis.
Example for EMPHASIS <EMP>
avan than<EMP> sonnAn.
‘He only said’
ECHO WORDS
We use the tag < ECH> for echo words, in our corpus echo words are very little
because our 90% of words are from Dinamani newspaper.
Example for Echo words (ECH)
kAppi kEppi <ECH>
‘coffee keeffee’
Reduplication WORDS
Reduplication words are the same word which is written twice for various
purposes such as indicating emphasis, deriving a category from another category. We
use the tag <RDW> for Reduplication words.
Example for Reduplication words <RDW>
Pala pala <RDW> thittam.
‘many plans’
50
QUESTION WORD AND MARK
We use the tags <QW> AND <QM> for question word and question mark.
Example for QUESTION WORD AND MARK:<QW> <QM>
Avan vawthAnA <QW>? <QM>.
‘DOES HE COME?’
SYMBOLS
We consider only two symbols dot <DOT> tag and comma <COMM> tag in
our corpus. <DOT> tag is used to show the sentence separation. <COMM> tag is
used in-between the multiple nouns and proper noun.
Example for <DOT>
avan than sonnAn .<DOT>
‘He only said.’
Example for <COMMA>
wakarAj , <COMM> arunudan vawthan.
Nagaraj came with Arun .
51
4) DEVLOPMENT OF TAGGED CORPUS
4.1).Introduction
Corpus linguistics seeks to further our understanding of language through the
analysis of large quantities of naturally occurring data. Text corpora are used in a
number of different ways. Traditionally corpora have been used for the study and
analysis of language at different levels of linguistic description. Corpora have been
constructed for the specific purpose of acquiring knowledge for information
extraction systems, knowledge-based systems and e-business systems[27]. Corpora
have been used for studying child language development. Speech corpora play a vital
role in the specification, design and implementation of telephonic communication and
for the broadcast media.
There is a long tradition of corpus linguistic studies in Europe. The need for
corpus for a language is multifarious. Starting from the preparation of a dictionary or
lexicon to machine translation, corpus has become an inevitable resource for
technological development of languages. Corpus means a body of huge text
incorporating various types of textual materials, including newspaper, weeklies,
fictions, scientific writings, literary writings, and so on. Corpus represents all the
styles of a language. Corpus must be very huge in size as it is going to be used for
many language applications such as preparation of lexicons of different sizes,
purposes and types, machine translation programs and so on.
4.1.1) Tagged corpus, Parallel Corpus, and Aligned Corpus
Corpuses can be distinguished as tagged corpus, parallel corpus and aligned
corpus. The tagged corpus is that which is tagged for part-of-speech [27]. A parallel
corpus contains texts and translations in each of the languages involved in it. It allows
wider scopes for double-checking of the translation equivalents. Aligned corpus is a
52
kind of bilingual corpus where text samples of one language and their translations
into other language are aligned, sentence by sentence, phrase by phrase, word by
word, or even character by character.
4.1.2)CIIL Corpus for Tamil
As for as building corpus for the Indian languages is concerned it was Central
Institute of Indian languages (CIIL) which took initiative and started preparing corpus
for some of the Indian languages (Tamil, Telugu, Kannada, and Malayalam).
Department of electronics (DOE) financed the corpus-building project. The target
was to prepare corpus with ten million words for each language [27]. But due to
financial crunch and time restriction it ends up with three million words for each
language. Tamil corpus with three million words is built by CIIL in this way. It is a
partially tagged corpus.
4.1.3) AUKBC-RC’s Improved Tagged Corpus for Tamil
AUKBC Research Centre which has taken up NLP oriented works for Tamil,
has improved upon the CIIL Tamil Corpus and tagged it for their MT programs. It
also developed parallel corpora for English-Tamil to promote its goal of preparing an
MT tool for English-Tamil translation. Parallel corpus is very useful for training the
corpus and for building example based machine translation. Parallel corpus is a useful
tool for MT programs.
4.2) Developing a new tagged corpus
4.2.1) Untagged and tagged Corpus
Untagged or annotated corpus provides limited information to the users.
Corpus can be augmented with additional information by the way of labeling the
53
morpheme, word, phrase and sentence for their grammatical values. Such information
helps the user to retrieve information selectively and easily.
The frequency of the lemma is useful in the analysis of the corpus. When the
frequency of a particular word is compared to other context words, we can find
whether the word is common or rare. The frequencies are relatively reliable for the
most common words in a corpus, but to analyze the senses and association patterns of
words we need a very large number of occurrences. With a very large corpus
containing many different texts, a wider range of topics should be represented, so that
the frequencies of words are less influenced by individual texts. Frequency lists based
on an untagged corpus are limited in usefulness, because they do not tell us which
grammatical uses are common or rare. Tagged corpus is an important dataset for NLP
applications.
Figure: 4.1 Example of Untagged Corpus
Figure: 4.2 Example of tagged Corpus
54
4.2.2) TAGGED CORPUS DEVELOPMENT
The tagged corpus is the immediate requirement for different analyses
in the field of Natural Language processing. Most of the language processing works
are in need of such large database of texts, which provide a real, natural, native
language of varying types. Annotation of corpora can be done at various levels viz,
part of speech, phrase/clause level, dependency level, etc. Part of speech tagging
forms the basic step towards building an annotated corpus. Chunking can form the
next level of tagging.
For creating Tamil Part of Speech Tagger we need a grammatically
tagged corpus. So we set up a tagged corpus size of 2, 25,000 words. We collected
sentences from Dinamani news paper, yahoo Tamil news and Tamil short stories etc
and tagged the words. Tagged corpora vary with respect to the amount of information
that they include about words.
We have done corpus tagging in three stages:
1. Pre-editing,
2. Manual Tagging,
3. Tagging using SVMTagger
In pre-editing, untagged corpus is converted to a suitable format for
SVMTool, to assign a part of speech tag to each word. Because of orthographic
similarity one word may have several possible POS tags. After initial assignment of
possible POS tags, words are manually tagged using our (AMRITA) tagset. The
tagged corpus is trained using SVMTlearn component. After training, the new
untagged corpus is tagged by using SVMTagger. The output of SVMTagger is
again manually corrected and added into the tagged corpus to increase the corpus
size. Here we discuss how we create a tagged corpus.
55
Pre-editing
First we download untagged sentences from Dinamani newspaper website.
Then we clean the corpus using PERL program i.e. we remove punctuations except
dots, commas and question marks. After completing this the next step is to change
the corpora into a column format .Because the SVMTool training data must be in
column format, i.e. a token per line corpus in a sentence by sentence fashion. It
also can be done by PERL program. The column separator is the blank space.
Figure: 4.3 Untagged corpus before pre-editing.
Figure:4.4 Untagged corpus after pre-editing.
56
Manual Tagging
After pre editing we got an untagged corpus in token per line fashion
(Fig).In the second stage we tagged the untagged corpus manually using AMRITA
tagset. First we tagged near 10,000 words manually .Here we face so much
difficulties to assign a tags to corpora. After discussing with various Tamil
linguistics, we assign POS tag to each and every word in corpora.
Tagging using SVMTagger
After completing the manual tagging, we have the tagged corpus size of
10,000.In this stage we train the corpus with SVMTlearn component of SVMTool,
after that we tagged the cleaned untagged corpus using SVMTagger component. The
output of the component is tagged corpus with some error. Then we correct the tags
manually. After correcting the tags we add that tagged corpus into the training corpus
for increase the size of training corpus.
4.3) APPLICATIONS OF TAGGED CORPUS
•
Part of Speech Tagging
•
Computer Lexicography
•
Information extraction
•
Statistical training of Language models
•
Machine Translation using multilingual corpora
•
Text checkers for evaluating spelling and grammar
•
Educational application like Computer Assisted Language Learning
The multilingual corpora with [28] parallel texts which translate one language to
another are used for the purposes of studying contrastive analysis and translation.
57
4.4) DETAILS OF OUR TAGGED CORPUS
We have a tagged corpus with size of 225185 and our tagset size is 32.
Table1 shows the counts of each and every tags in our corpus.
S.no
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Tags
Counts
<ADJ>
<ADV>
<CNJ>
<COM>
<COMM>
<CRD>
<CVB>
<DET>
<DOT>
<ECH>
<EMP>
<INT>
<NN>
<NNC>
<NNP>
<NNPC>
<NNQ>
<ORD>
<PPO>
<PRID>
<PRIN>
<PRP>
<QM>
<QTF>
<QW>
<RDW>
<VAX>
<VBG>
<VF>
<VINT>
<VNAJ>
<VNAV>
7457
9493
2937
3295
1695
5400
992
2729
16649
1
1190
981
56174
31532
14826
3875
819
760
3200
171
338
6668
692
395
1024
117
3340
3983
14969
4973
7647
8859
Table 4.1 Tag Counts
58
5) IMPLEMENTATION OF SVMTOOL FOR TAMIL
5.1) INTRODUCTION
This chapter presents the SVMTool, a simple, flexible, and effective generator
of sequential taggers based on Support Vector Machines and how it is being applied
to the problem of part-of-speech tagging. This SVM-based tagger is robust and
flexible for feature modeling (including lexicalization), trains efficiently with
almost no parameters to tune, and is able to tag thousands of words per second,
which makes it really practical for real NLP [30] applications. Regarding accuracy,
the SVM-based tagger significantly outperforms the TnT tagger exactly under the
same conditions, and achieves a very competitive accuracy of 94.2% for Tamil.
Generally, tagging is required to be as accurate as possible, and as efficient as
possible. But, certainly, there is a trade-off between these two desirable properties.
This is so because obtaining a higher accuracy relies on processing more and more
information, digging deeper and deeper into it. However, sometimes, depending on
the kind of application, a loss in efficiency may be acceptable in order to obtain
more precise results. Or the other way around, a slight loss in accuracy may be
tolerated in favour of tagging speed.
Moreover, some languages have a richer morphology than others, requiring the
tagger to have into account a bigger set of feature patterns. Also the tagset size and
ambiguity rate may vary from language to language and from problem to problem.
Besides, if few data are available for training, the proportion of unknown words
may be huge. Sometimes, morphological analyzers could be utilized to reduce the
degree of ambiguity when facing unknown words. Thus, a sequential tagger should
be flexible with respect to the amount of information utilized and context shape.
Another very interesting property for sequential taggers is their portability.
59
Multilingual information is a key ingredient in NLP tasks such as Machine
Translation, Information Retrieval, Information Extraction, Question Answering
and Word Sense Disambiguation, just to name a few. Therefore, having a tagger
that works equally well for several languages is crucial for the system robustness.
Besides, quite often for some languages, but also in general, lexical resources are
hard to obtain. Therefore, ideally a tagger should be capable for learning with fewer
(or even none) annotated data. The SVMTool is intended to comply with all the
requirements of modern NLP technology, by combining simplicity, flexibility,
robustness, portability and efficiency with state–of–the–art accuracy. This is
achieved by working in the Support Vector Machines (SVM) learning framework,
and by offering NLP researchers a highly customizable sequential tagger generator.
Here this tool is applied to POS tagging of Tamil.
5.2) PROPERTIES OF THE SVMTOOL
The following are the properties of the SVMTool:
Simplicity: The SVMTool is easy to configure and to train. The learning is
controlled by means of a very simple configuration file. There are very few
parameters to tune. And the tagger itself is very easy to use, accepting standard
input and output pipelining. Embedded usage is also supplied by means of the
SVMTool API.
Flexibility: The size and shape of the feature context can be adjusted. Also, rich
features can be defined, including word and POS (tag) n-grams as well as ambiguity
classes and “may be’s”, apart from lexicalized features for unknown words and
sentence general information. The behavior at tagging time is also very flexible,
allowing different strategies.
60
Robustness: The over fitting problem is well addressed by tuning the C parameter
in the soft margin version of the SVM learning algorithm. Also, a sentence-level
analysis may be performed in order to maximize the sentence score. And, for
unknown words not to punish so severely on the system effectiveness, several
strategies have been implemented and tested.
Portability: The SVMTool is language independent. It has been successfully
applied to English and Spanish without a priori knowledge other than a supervised
corpus. Moreover, thinking of languages for which labeled data is a scarce resource,
the SVMTool also may learn from unsupervised data based on the role of nonambiguous words with the only additional help of a morpho-syntactic dictionary.
Accuracy: Compared to state–of–the–art POS taggers reported up to date, it
exhibits a very competitive accuracy (97.2% for English on the WSJ corpus).
Clearly, rich sets of features allow to model very precisely most of the information
involved. Also the learning paradigm, SVM, is very suitable for working accurately
and efficiently with high dimensionality feature spaces.
Efficiency: Performance at tagging time depends on the feature set size and the
tagging scheme selected. For the default (one-pass left-to-right greedy) tagging
scheme, it exhibits a tagging speed of 1,500 words/second whereas the C++ version
achieves a tagging speed of over 10,000 words/second. This has been achieved by
working in the primal formulation of SVM. The use of linear kernels causes the
tagger to perform more efficiently both at tagging and learning time, but forces the
user to define a richer feature space. However, the learning time remains linear with
respect to the number of training examples.
61
5.3 THE THEORY OF SUPPORT VECTOR MACHINES
SVM is a machine learning algorithm for binary classification, which has been
successfully applied to a number of practical problems, including NLP. Let {(x1,
y1). . . (xN, yN)} be the set of N training examples, where each instance xi is a vector
in R N and y ∈ {−1,+1} is the class label. In their basic form, a SVM learns a
i
linear hyperplane [30] that separates the set of positive examples from the set of
negative examples with maximal margin (the margin is defined as the distance of
the hyperplane to the nearest of the positive and negative examples). This learning
bias has proved to have good properties in terms of generalization bounds for the
induced classifiers.
The linear separator is defined by two elements: a weight vector w (with one
component for each feature), and a bias b which stands for the distance of the
hyperplane to the origin. The classification rule of a SVM is:
sign(f(x, w, b))
(5.1)
f(x, w, b) = 〈 w. x 〉 + b
(5.2)
Being x the example to be classified. In the linearly separable case, learning the
maximal margin hyperplane (w, b) can be stated as a convex quadratic optimization
problem with a unique solution: minimize ||w||, subject to the constraints (one for
each training example):
y (〈 w. x 〉 + b) ≥ 1
i
i
(5.3)
62
See an example of a 2-dimensional SVM in Figure 5.1.
Hyperplane
Data Belongs to Class +1
Data Belongs to Class -1
Support vectors
Figure 5.1: SVM example: hard margin
γ
xTw +b =0
Є
Figure 5.2: SVM example: soft margin maximization
63
The SVM model has an equivalent dual formulation, characterized by a weight
vector α and a bias b. In this case, α contains one weight for each training vector,
indicating the importance of this vector in the solution. Vectors with non null
weights are called support vectors.
The dual classification rule is:
N
f ( x,α , b) = ∑ y α 〈 x i x〉 + b
i i i
i =1
(5.4)
The α vector can be calculated also as a quadratic optimization problem. Given the
optimal α * vector of the dual quadratic optimization problem, the weight vector
w* that realizes the maximal margin hyper plane is calculated as:
N
w* = ∑ yiαi*xi
i =1
(5.5)
The b * has also a simple expression in terms of w* and the training examples
{( x i , y i )}i = 1
N
.
The advantage of the dual formulation is that permits an efficient learning of non–
linear SVM separators, by introducing kernel functions. Mathematically, a kernel
function calculates a dot product between two vectors that have been (non linearly)
mapped into a high dimensional feature space. Since there is no need to perform
this mapping explicitly, the training is still feasible although the dimension of the
real feature space can be very high or even infinite.
In the presence of outliers and wrongly classified training examples it may be useful
to allow some training errors in order to avoid over-fitting. This is achieved by a
variant of the optimization problem, referred to as soft margin, in which the
64
contribution to the objective function of margin maximization and training errors
can be balanced through the use of a parameter called C as shown in Figure 1
(rightmost representation).
5.4 PROBLEM SETTING
Here, description about the collection of training examples and feature codification
is done.
5.4.1 Binarizing the Classification Problem
Tagging a word in context is a multi-class classification problem. Since SVMs are
binary classifiers, a binarization of the problem must be performed before applying
them. Here a simple one-per- class binarization is applied, i.e., a SVM is trained for
every POS tag in order to distinguish between examples of this class and all the
rest. When tagging a word, the most confident tag according to the predictions of all
binary SVMs is selected. However, not all training examples have been considered
for all classes. Instead, a dictionary is extracted from the training corpus with all
possible tags for each word, and when considering the occurrence of a training
word w tagged as ti, this example is used as a positive example for class ti and a
negative example for all other tj classes appearing as possible tags for w in the
dictionary. In this way, the generation of excessive (and irrelevant) negative
examples can be avoided, and training step can be made faster.
5.4.2 Feature Codification
Each example (event) has been represented using the local context of the word for
which the system will determine a tag (output decision). This local context and
local information like capitalization and affixes of the current token will help the
system make a decision even if the token has not been encountered during training.
65
A centered window of seven tokens is considered , in which some basic and n–gram
patterns are evaluated to form binary features such as: “previous word is
awtha(that)”, “two preceding tags are DET NN”, etc. Table 5.1 contains the list of
all patterns considered. As it can be seen, the tagger is lexicalized and all word
forms appearing in window are taken into account. Since a very simple left–to–right
tagging scheme will be used, the tags of the following words are not known at
running time.
Following the approach of Memory based tagger, the more general ambiguity–class
tag for the right context words is used, which is a label composed by the
concatenation of all possible tags for the word (e.g., VINT-VAX, ADJ-NN, etc.).
Each of the individual tags of an ambiguity class is also taken as a binary feature of
the form “following word may be a NN”. Therefore, with ambiguity classes and
“maybe’s”, two pass solution is avoided, in which an initial first pass tagging is
performed in order to have right contexts disambiguated for the second pass.
Explicit n–gram features are not necessary in the SVM approach, because
polynomial kernels account for the combination of features. However, since we are
interested in working with a linear kernel, those are included in the feature set.
Additional features have been used to deal with the problem of unknown words.
Features appearing a number of times under a certain count cut-off might be
ignored for the sake of robustness.
Word features
w-3, w-2, w-1, w0, w1, w2, w3
POS Features
p-3, p-2, p-1, p0, p1, p2, p3
Ambiguity
a0, a1, a2, a3
classes
May_be’s
m0, m1, m2, m3
Word bigrams
(w-2, w-1 ),( w-1, w+1), (w-1, w0), ( w0, w+1), (w+1, w+2)
66
POS Bigrams
Word Trigrams
(p-2, p-1 ),( p-1, a+1), (a+1, a+2)
(w-2, w-1 ,w0),( w-2, w-1, w+1), (w-1, w0
,
w+1),
,
a+1),
( w-1, w+1, w+2), ( w0, w+1, w+2)
POS Trigrams
(p-2, p-1 ,a0),( p-2, p-1, a+1), (p-1, a0
( p-1, a+1, a+2)
Sentence_info
Punctuation (’.’,’?’,’!’)
Prefixes
s1, s1s2, s1s2 s3 , s1s2 s3s4
Suffixes
sn , sn-1 sn ,sn-2 sn-1 sn ,sn-3 sn-2 sn-1 sn
Binary
word
features
Initial Upper Case ,all Upper Case ,no initial
Capital Letter(s) , all Lower Case , contains a
(period/number/hyphen..)
Word length
Integer
Table 5.1: Rich Feature Pattern Set
5.5 SVMTOOL COMPONENTS AND IMPLEMENTATIONS
The SVMTool software package consists of three main components, namely the
model learner ( SVMTlearn), the tagger ( SVMTagger) and the evaluator (
SVMTeval).
Previous to the tagging, SVM models (weight vectors and biases) are learned from
a training corpus using the SVMTlearn component. Different models are learned for
the different strategies. Then, at tagging time, using the SVMTagger component,
one may choose the tagging strategy that is most suitable for the purpose of the
tagging.
Finally, given a correctly annotated corpus, and the corresponding SVMTool
predicted annotation, the SVMTeval component displays tagging results.
67
5.5.1 SVMTLEARN
Given a training set of examples (either annotated or unannotated), it is responsible
for the training of a set of SVM classifiers. So as to do that, it makes use of SVM–
light, an implementation of Vapnik’s SVMs in C, developed by Thorsten
Joachim’s.
The SVMlight software implementation of Vapnik’s Support Vector Machine by
Thorsten Joachim’s has been used to train the models.
5.5.1.1 Training Data Format
Training data must be in column format, i.e. a token per line corpus in a sentence by
sentence fashion. The column separator is the blank space. The token is expected to
be the first column of the line. The tag to predict takes the second column in the
output. The rest of the line may contain additional information. See an example:
68
Figure 5.3. Training data format
No special ‘<EOS>’ mark is employed for sentence separation. Sentence
punctuation is used instead, i.e. [.!?] symbols are taken as unambiguous sentence
separators. But here [.?] symbols only used. Therefore these symbols are taken as a
sentence separators.
5.5.1.2 Options
SVMTlearn behavior is easily adjusted through a configuration file.
Usage: SVMTlearn [options] <config-file>
options:
- V verbose 0: none verbose
1: low verbose [default]
2: medium verbose
3: high verbose
Example: SVMTlearn -V 2 config.svmt
These are the currently available config-file options:
69
Sliding window: The size of the sliding window for feature extraction can be
adjusted. Also, the core position in which the word to disambiguate is to be located
may be selected. By default, window size is 5 and the core position is 2, starting at
0.Here I took the default window size 5.
Feature set: Three different kinds of feature types can be collected from the sliding
window:
– Word features: Word form n-grams. Usually unigrams, bigrams and trigrams
suffice. Also, the sentence last word, which corresponds to a punctuation mark (‘.’,
‘?’, ‘!’), is important.
– POS features: Annotated parts–of–speech and ambiguity classes n-grams, and
“may be’s”. As for words, considering unigrams, bigrams and trigrams is enough.
The ambiguity class for a certain word determines which POS are possible. A “may
be” states, for a certain word, that certain POS may be possible, i.e. it belongs to the
word ambiguity class.
– Lexicalized
features:
including
prefixes
and
suffixes,
capitalization,
hyphenization, and similar information related to a word form. But in Tamil
capitalization will not used all the other features are accepted.
Default feature sets for every model are defined.
Feature filtering: The feature space can be kept in a convenient size. Smaller
models allow for a higher efficiency. By default, no more than 100,000 dimensions
are used. Also, features appearing less than n times can be discarded, which indeed
causes the system both to fight against overfitting and to exhibit a higher accuracy.
By default, features appearing just once are ignored.
70
SVM model compression: Weight vector components lower than a given
threshold, in the resulting SVM models can be filtered out, thus enhancing
efficiency by decreasing the model size but still preserving accuracy level. That is
an interesting behavior of SVM models being currently under study. In fact,
discarding up to 70% of the weight components accuracy remains stable, and it is
not until 95% of the components are discarded that accuracy falls below the current
state–of–the–art (97.0% - 97.2%).
C parameter tuning: In order to deal with noise and outliers in training data, the
soft margin version of the SVM learning algorithm allows the misclassification of
certain training examples when maximizing the margin. This balance can be
automatically adjusted by optimizing the value of the C parameter of SVMs. A
local maximum is found exploring accuracy on a validation set for different C
values at shorter intervals.
Dictionary repairing: The lexicon extracted from the training corpus can be
automatically repaired either based on frequency heuristics or on a list of
corrections supplied by the user. This makes the tagger robust to corpus errors. Also
a heuristic threshold may be specified in order to consider as tagging errors those
(wordxtagy) pairs occurring less than a certain proportion of times with respect to
the number of occurrences of wordx. For example, a threshold of 0.001 would
consider (run DT) as an error if the word run had been seen at least 1000 times and
only once tagged as a ‘DT’. This kind of heuristic dictionary repairing does not
harm the tagger performance, on the contrary, it may help a lot.
71
Repairing list must comply with the SVMTOOL dictionary format,
i.e. <word> <N occurrences> <N possible tags> 1{<tag (i)> <N occurrences (i)}
Ambiguous classes: The list of POS presenting ambiguity is, by default,
automatically extracted from the corpus but, if available, this knowledge can be
made explicit. This acts in favor of the system robustness.
Open classes: The list of POS tags an unknown word may be labeled as is also, by
default, automatically determined.
Backup lexicon: A morphological lexicon containing words that are not present in
the training corpus may be provided. It can be also provided at tagging time. This
file must comply with the SVMTool dictionary format.
5.5.1.3 Configuration File
Several arguments are mandatory (as shown in table 5.2). The rest are optional (as
shown in table 5.4). Lists of features are defined in the SVMTool feature language
(svmtfl) as shown in table 5.5. Lines beginning with ‘#’ are ignored. The list of
action items for the learner must be declared (shown in table 5.3):
NAME => name of the model to create (a log of the experiment is generated onto
the file “NAME.EXP”)
TRAINSET => location of the training set
SVMDIR => location of the Joachim’s SVMlight software
Table 5.2: SVMTlearn config-file mandatory arguments.
72
Syntax: do <MODEL> <DIRECTION> [<CK>] [<CU>] [<T>]
Where MODEL = [M0|M1|M2|M3|M4]
DIRECTION = [LR|RL|LRL]
CK = [CK :<<range1>:<range2>:<#iterations>:<#segments_per_iteration>:
<log>|<nolog>> | <CK-value>]
CU = [CU :<< range1> :< range2> :< #iterations> :< #segments_per_iteration>:
<log>|<nolog>> | <CU-value>]
T = [T[:<Nfolders>]]
MODEL model type
DIRECTION model direction
CK known word C parameter tuning options (optional)
CU unknown word C parameter tuning options (optional)
T test options (optional)
Table 5.3: SVMTlearn config-file action arguments.
Here is an example of a valid config-file:
# ------------------------------------------------------------#SVMTool configurations file for Tamil
# ------------------------------------------------------------#prefix of the model files which will be created
NAME = TAMIL_2L
# ------------------------------------------------------------#location of the training set
TRAINSET=/media/disk/Anand/SVM/SVMTool-1.3/bin/TAMIL_2L.TRAIN
# ------------------------------------------------------------#location of the Joachim’s SVMlight software
SVMDIR = /media/disk/Anand/SVM/svm_light
# ------------------------------------------------------------#action items
do M0 LR
73
SET
location of the whole set
VALSET
location of the validation set
TESTSET
location of the test set
TRAINP
proportion of sentences belonging to the provided whole SET which
will be used for training
VALP
proportion of sentences belonging to the provided whole SET which
will be used for validation
TESTP
proportion of sentences belonging to the provided whole SET which
will be used for test
REMOVE FILES
remove intermediate files?
REMAKE FOLDERS
remake cross-validation folders?
Kfilter
Weight filtering for known word models
Ufilter
Weight filtering for unknown word models
R
dictionary repairing list (heuristically repaired by default)
D
dictionary repairing heuristic threshold (0.001 by default)
BLEX
backup lexicon
LEX
lexicon for unsupervised learning (Model 3)
W
window definition (size, core position)
F
feature filtering (count cut off, max mapping size)
CK
C parameter for known words -all models-(0 by default)
CU
C parameter for unknown words -all models-(0 by default)
X
percentage of unknown words expected (3 by default)
AP
list of POS presenting ambiguity (automatically created by default)
UP
list of open-classes (automatically created by default)
A0k..A4k
known word feature definition for models 0..4
A0u..A4u
unknown word feature definition for models 0..4
Table 5.4: SVMTlearn config-file optional arguments.
74
COLUMN
n-grams C(colid; n1, ...ni...nm)
WORD
n-grams w(n1, ...ni...nm) (equivalent to C(0; n1, ...ni...nm))
TAG
n-grams p(n1, ...ni...nm) (equivalent to C(1; n1, ...ni...nm))
AMBIGUITY CLASSES k(n)
MAYBE’s
m(n) where ni is the relative position with respect to the element
to disambiguate
CHARACTER
A(i) ca(i)where i is the relative position of the character with
respect to the beginning of the word
CHARACTER
Z(i) cz(i) where i is the relative position of the character with
respect to the end of the word
PREFIXES
a(i) = s1s2...si
SUFFIXES
z(i) = sn-i...sn-1sn sa does the word start with lower case?
SA
does the word start with upper case?
CA
does the word contain any capital letter?
CAA
does the word contain several capital letters?
aa
are all letters in the word in lower case?
AA
are all letters in the word in upper case?
SN
does the word start with number?
CP
does the word contain a period?
CN
does the word contain a number?
CC
does the word contain a comma?
MW
does the word contain a hyphen?
L
word length
sentence info
punctuation (‘.’, ‘?’, ‘!’)
Table 5.5: SVMTool feature language.
75
Enriched version of the SVMTool configuration file for Tamil
# ------------------------------------------------------------# SVMT configuration file
# -------------- location of the training set ----------------TRAINSET
=
/media/disk/Anand/SVM/SVMTool-
1.3/bin/TAMIL_CORPUS_2L.TRAIN
# -------------- location of the validation set --------------#VALSET = /media/disk/Anand/TAMIL_OL2.TRAIN
# -------------- location of the test set --------------------#TESTSET = /media/disk/Anand/TAMIL_OL3.TRAIN
# -------------- location of the Joachims svmlight software --SVMDIR = /media/disk/Anand/SVM/svm_light
# -------------- name of the model to create -----------------NAME = TAMIL_CORPUS_2L
# --------------- dictionary repairing list ------------------#R = root/Desktop/small/SVMTool-1.3/bin/CC.R
# --------------- window definition (lenght, core_position) --W = 5 2
#---------------
feature
filtering
(count_cut_off,
max_mapping_size)
F = 2 100000
# --------------- default C Parameter values -----------------CK = 0.1086
CU = 0.07975
# --------------- % of unknown words expected
(3 by default) -
X = 10
# --------------- weight filtering for known words -----------Kfilter = 0
# --------------- weight filtering for unknown words ---------Ufilter = 0
# --------------- remove intermediate files ------------------REMOVE_FILES = 1
# --------------- remake cross-validation folders ------------REMAKE_FOLDERS = 1
76
# --------------- action items -------------------------------#
***
train
model
0,
LR
and
RL,
C
Parameter
Tuning,
cross-
validation 10 folders
do M0 LRL CK:0.01:1:3:10:log CU:0.01:1:3:10:log T
# *** train model 1, RL, C Parameter Tuning, no cross-validation,
test
#do M1 LRL CK:0.01:1:3:10:log CU:0.01:1:3:10:log T
# *** train model 2, LR and RL, no C Parameter Tuning, no crossvalidation, test
#do M2 LRL CK:0.01:1:3:10:log CU:0.01:1:3:10:log T
# *** M3 is currently unavailable
# *** train model 4, LR, C Parameter Tuning, no cross-validation,
no test
#do M4 LRL CK:0.01:1:3:10:log CU:0.01:1:3:10:log T
# ---------------------------------------------------------------------------------#list of classes (automatically determined by default)
# ---------------------------------------------------------------------------------#list of parts-of-speech presenting ambiguity
#AP = '' CC CD DT EX FW IN JJ JJR JJS LS MD NN NNS NNP NNPS PDT POS
PRP PRP$ RB RBR RBS RP SYM UH VB VBD VBG VBN VBP VBZ WDT WP WRB
#list of open-classes
#UP = FW JJ JJR JJS NN NNS NNP NNPS RB RBR RBS VB VBD VBG VBN VBP
VBZ
# ---------------------------------------------------------------------------------#ambiguous-right [default]
A0k =
C(0;-1) C(0;0) C(0;1) C(0;-2,-1) C(0;-1,0) C(0;0,1) C(0;-
1,1) C(0;1,2) C(0;-2,-1,0)
C(0;-1,0,1) C(0;0,1,2) C(1;-1) C(1;-2,-
1) C(1;-1,1) C(1;1,2) C(1;-2,-1,0) C(1;0,1,2) C(1;-1,0) C(1;0,1)
C(1;-1,0,1) C(1;0) k(0) k(1) k(2) m(0) m(1) m(2)
A0u = C(0;-1) C(0;0) C(0;1) C(0;-2,-1) C(0;-1,0) C(0;0,1) C(0;-1,1)
C(0;1,2) C(0;-2,-1,0)
C(0;-1,0,1) C(0;0,1,2) C(1;-1) C(1;-2,-1)
C(1;-1,1) C(1;1,2) C(1;-2,-1,0) C(1;0,1,2) C(1;-1,0) C(1;0,1) C(1;-
77
1,0,1) C(1;0) k(0) k(1) k(2) m(0) m(1) m(2) a(2) a(3) a(4) a(5)
a(6) a(7) a(8) a(9) a(10) a(11) a(12) a(13) a(14) a(15) a(16) a(17)
a(18) a(19) a(20) z(2) z(3) z(4) z(5) z(6) z(7) z(8) z(9) z(10)
z(11) z(12) z(13) z(14) z(15) z(16) z(17) z(18) z(19) z(20)
ca(1)
cz(1) L SN CP CN
#--------------------------------------------------------------------
In this case model NAME is ‘TAMIL_2L’, so model files will begin with this
prefix. Only the training set is specified. A window of 5 elements, being the core in
the third position, is defined for feature extraction. The expected proportion of
unknown words is 10%. Intermediate files will be removed. The list of parts-ofspeech presenting ambiguity and the list of open-classes are not provided in this
config-file so our tool will determine automatically.
This config-file is designed to learn Model 0 on our Tamil Corpus. That would
allow for the use of tagging strategies 0 and 5, only left-to-right though. Instead of
using default feature sets, two feature sets are defined for Model 0 (for the two
distinct problems of known word and unknown word tag guessing).
5.5.1.4 C Parameter Tuning
C parameter tuning is optional. Either we can specify no C parameter (C = 0 by
default), or we can specify a fixed value (e.g., CK: 0.1 CU: 0.01), or we can
perform an automatic tuning by greedy exploration. If we decide to do so then we
must provide a validation set and specify the interval to be explored and how to do
it (i.e. number of iterations and number of segments per iteration). Moreover, the
first iteration can take [30] place in a logarithmic fashion or not. For example, CK:
0.01:10:3:10: log would try these values for C: 0.01, 0.1, 1, 10 at the first iteration.
For the next iteration, the algorithm explores on both sides of the point where the
78
maximal accuracy was obtained, half to the next and previous points. For example,
suppose the maximal accuracy was obtained for C = 1, then it would explore the
range form 0.1 / 2 = 0.05 to 0.1 / 2 = 0.5. The segmentation ratio would be 0.045 so
the algorithm would go for values 0.05, 0.095, 0.14, 0.185, 0.23, 0.275, 0.32, 0.365,
0.41, 0.455, and 0.5. And so on for the following iteration.
5.5.1.5 Test
After training a model it can be evaluated against a test set. To indicate so the T
option must be activated in the corresponding do-action, e.g., “do M0 LR
CK:0.01:10:3:10:log CU:0.07 T”. By default it expects a test set definition in the
config file. But training/test can also be performed through a cross-validation. The
number of folders must be provided, e.g., “do M0 LR CK:0.01:10:3:10:log CU:0.07
T:10”. 10 is a good number. Furthermore, if training/test goes in cross-validation
then the C Parameter tuning goes too, even if a validation set has been provided.
5.5.1.6 Models
Five different kinds of models have been implemented in this. Models 0, 1, and 2
differ only in the features they consider. Model 3 and Model 4 are just like Model 0
with respect to feature extraction but examples are selected in a different manner.
Model 3 is for unsupervised learning so, given an unlabeled corpus and a
dictionary, at learning time it can only count on knowing the ambiguity class, and
the POS information only for unambiguous words. Model 4 achieves robustness by
simulating unknown words in the learning context at training time.
Model 0: This is the default model. The unseen context remains ambiguous. It was
thought having in mind the one-pass on-line tagging scheme, i.e. the tagger goes
either left-to-right or right-to-left making decisions. So, past decisions feed future
79
ones in the form of POS features. At tagging time only the parts-of-speech of
already disambiguated tokens are considered. For the unseen context, ambiguity
classes are considered instead. Features are shown in Table 5.6.
Model 1: This model considers the unseen context already disambiguated in a
previous step. So it is thought for working at a second pass, revisiting and
correcting already tagged text. Features are shown in Table 5.7.
Model 2: This model does not consider pos features at all for the unseen context. It
is designed to work at a first pass, requiring Model 1 to review the tagging results at
a second pass. Features are shown in Table 5.8.
Model 3:
The training is based on the role of unambiguous words. Linear
classifiers are trained with examples of unambiguous words extracted from an
unannotated corpus. So, fewer POS information is available. The only additional
information required is a morpho-syntactic dictionary.
Model 4: The errors caused by unknown words at tagging time punish severely the
system. So as to reduce this problem, during the learning some words are artificially
marked as unknown in order to learn a more realistic model. The process is very
simple. The corpus is divided in a number of folders. Before starting to extract
samples from each of the folders, a dictionary is generated out from the rest of
folders. So, the words appearing in a folder but not in the rest are unknown words to
the learner.
Ambiguity classes
a ,a ,a
0 1 2
May_be’s
m ,m ,m
0 1 2
80
POS Features
p , p p-3, p-2, p-1,
−2 −1
POS Bigrams
( p−2, p−1),( p−1, a+1),( a+1, a+2 )
POS Trigrams
( p−2, p−1, a0 ),( p−2, p−1, a+1),
( p−1,a0 ,a+1 ),( p−1, a+1, a+2 )
Single characters
ca(1), cz(1)
prefixes
a(2), a(3), a(4)
suffixes
z(2), z(3), z(4)
Lexicalized features
SA, CAA, AA, SN, CP, CN, CC, MW,L
Sentence_info
punctuation ( '.','?','!' )
Table 5.6: Model 0. Example of suitable POS feature
Ambiguity classes
a ,a ,a
0 1 2
May_be’s
m ,m ,m
0 1 2
POS Features
p ,p ,p ,p
−2 −1 +1 +2
POS Bigrams
( p−2, p−1), ( p−1, p+1),( p+1, p+2 )
POS Trigrams
( p−2, p−1, a0 ),( p−2, p−1, p+1),
( p−1,a0 , p+1 ),( p−1, p+1, p+2 )
Single characters
ca(1), cz(1)
prefixes
a(2), a(3), a(4)
suffixes
z(2), z(3), z(4)
Lexicalized features
SA, CAA, AA, SN, CP, CN, CC, MW,L
Sentence_info
punctuation ( '.','?','!' )
Table 5.7: Model 1. Example of suitable POS Features
81
Ambiguity classes
a
0
May_be’s
m
0
POS Features
p ,p
−2 −1
POS Bigrams
( p−2, p−1)
POS Trigrams
( p−2, p−1, a0 )
Single characters
ca(1), cz(1)
prefixes
a(2), a(3), a(4)
suffixes
z(2), z(3), z(4)
Lexicalized features
SA, CAA, AA, SN, CP, CN, CC, MW,L
Sentence_info
punctuation ( '.','?','!' )
Table 5.8: Model 2. Example of suitable POS Features
5.5.1.7 IMPLEMENTATION OF SVMTlearn FOR TAMIL
The SVMTlearn is a first component in SVMTool .It is
used for training the untagged corpus using SVM. This component is run in Linux
machines only. In training, only the data set is required. However, if you have enough
data it is a good practice to split it into three working sets (i.e. training, validation and
test). That will allow you to train, tune and evaluate your system before you start
using it. If you do not have much data there is no need to worry, you can still train,
tune and test your system through cross-validation. Here we follow cross validation.
82
Tagged Tamil
Corpus
SVM Learn
Merged Models
Dictionary
Known
Features
Unknown
Figure 5.4: Implementation of SVMTlearn
In this component our input is tagged training corpus. That training corpus is given
into the SVMTlearn component and we change the features in config file based on
Tamil language .The output of the SVMTlearn are a dictionary file, merged files for
unknown and known words(For each model).Each merged file contains the all
features of known words and unknown words.
Example training output of SVMTlearn for Tamil
--------------------------------------------------------------------------------------SVMTool v1.3
(C) 2006 TALP RESEARCH CENTER.
Written by Jesus Gimenez and Lluis Marquez.
83
--------------------------------------------------------------------------------------TRAINING SET = /media/disk-1/SVM/SVMTool-1.3/bin/TAMIL_CORPUS.TRAIN
---------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------DICTIONARY <TAMIL_CORPUS.DICT> [31605 words]
*******************************************************************
**************************
BUILDING MODELS... [MODE = 0 :: DIRECTON = LR]
*******************************************************************
**************************
*******************************************************************
**************************
C-PARAMETER TUNING by 10-fold CROSS-VALIDATION
on </media/disk-1/SVM/SVMTool-1.3/bin/TAMIL_CORPUS.TRAIN>
on <MODE 0> <DIRECTION LR> [KNOWN]
C-RANGE = [0.01..1] :: [log] :: #LEVELS = 3 :: SEGMENTATION RATIO =
10
*******************************************************************
**************************
===================================================================
==========================
LEVEL = 0 :: C-RANGE = [0.01..1] :: FACTOR = [* 10 ]
===================================================================
==========================
-------------------------------------------------------------------------------------------******************************** level - 0 : ITERATION 0 - C = 0.01
- [M0 :: LR]
--------------------------------------------------------------------------------------------
84
TEST ACCURACY: 90.6093%
KNOWN [ 92.886% ] AMBIG.KNOWN [ 83.3052% ] UNKNOWN [ 78.5781% ]
TEST ACCURACY: 90.392%
KNOWN [ 92.6809% ] AMBIG.KNOWN [ 82.838% ] UNKNOWN [ 78.0815% ]
TEST ACCURACY: 90.1015%
KNOWN [ 92.6128% ] AMBIG.KNOWN [ 83.4766% ] UNKNOWN [ 77.5075% ]
TEST ACCURACY: 89.7127%
KNOWN [ 92.0721% ] AMBIG.KNOWN [ 81.8731% ] UNKNOWN [ 77.5281% ]
TEST ACCURACY: 90.7699%
KNOWN [ 92.7304% ] AMBIG.KNOWN [ 83.4785% ] UNKNOWN [ 80.3874% ]
TEST ACCURACY: 89.8988%
KNOWN [ 92.3462% ] AMBIG.KNOWN [ 81.5675% ] UNKNOWN [ 77.286% ]
TEST ACCURACY: 90.8836%
KNOWN [ 92.9671% ] AMBIG.KNOWN [ 83.6309% ] UNKNOWN [ 79.5591% ]
TEST ACCURACY: 89.9724%
KNOWN [ 92.4002% ] AMBIG.KNOWN [ 82.1854% ] UNKNOWN [ 77.664% ]
TEST ACCURACY: 90.2643%
KNOWN [ 92.5675% ] AMBIG.KNOWN [ 83.0289% ] UNKNOWN [ 78.0907% ]
TEST ACCURACY: 90.7494%
KNOWN [ 92.7798% ] AMBIG.KNOWN [ 82.7494% ] UNKNOWN [ 79.8929% ]
OVERALL ACCURACY [Ck = 0.01 :: Cu = 0.07975] : 90.33539%
KNOWN [ 92.6043% ] AMBIG.KNOWN [ 82.81335% ] UNKNOWN [ 78.45753% ]
MAX ACCURACY -> 90.33539 :: C-value = 0.01 :: depth = 0 :: iter = 1
-------------------------------------------------------------------------------------------******************************** level - 0 : ITERATION 1 - C = 0.1
- [M0 :: LR]
--------------------------------------------------------------------------------------------
TEST ACCURACY: 91.7702%
KNOWN [ 94.2402% ] AMBIG.KNOWN [ 87.5492% ] UNKNOWN [ 78.7175% ]
TEST ACCURACY: 91.8881%
KNOWN [ 94.4737% ] AMBIG.KNOWN [ 88.4324% ] UNKNOWN [ 77.9821% ]
85
TEST ACCURACY: 91.3219%
KNOWN [ 94.0596% ] AMBIG.KNOWN [ 88.0441% ] UNKNOWN [ 77.5928% ]
TEST ACCURACY: 91.0615%
KNOWN [ 93.6037% ] AMBIG.KNOWN [ 86.6795% ] UNKNOWN [ 77.9326% ]
TEST ACCURACY: 92.0852%
KNOWN [ 94.2575% ] AMBIG.KNOWN [ 88.3275% ] UNKNOWN [ 80.5811% ]
TEST ACCURACY: 91.3927%
KNOWN [ 94.1299% ] AMBIG.KNOWN [ 87.4226% ] UNKNOWN [ 77.286% ]
TEST ACCURACY: 91.9891%
KNOWN [ 94.2944% ] AMBIG.KNOWN [ 88.0182% ] UNKNOWN [ 79.4589% ]
TEST ACCURACY: 91.3063%
KNOWN [ 93.9605% ] AMBIG.KNOWN [ 87.1258% ] UNKNOWN [ 77.8502% ]
TEST ACCURACY: 91.3654%
KNOWN [ 93.8499% ] AMBIG.KNOWN [ 87.2127% ] UNKNOWN [ 78.2339% ]
TEST ACCURACY: 91.8693%
KNOWN [ 94.1% ] AMBIG.KNOWN [ 87.0546% ] UNKNOWN [ 79.9416% ]
OVERALL ACCURACY [Ck = 0.1 :: Cu = 0.07975] : 91.60497%
KNOWN [ 94.09694% ] AMBIG.KNOWN [ 87.58666% ] UNKNOWN [ 78.55767% ]
MAX ACCURACY -> 91.60497 :: C-value = 0.1 :: depth = 0 :: iter = 2
5.5.2 SVMTAGGER
Given a text corpus (one token per line) and the path to a previously learned SVM
model (including the automatically generated dictionary), it performs the POS
tagging of a sequence of words. The tagging goes on-line based on a sliding
window which gives a view of the feature context [30] to be considered at every
decision.
In any case, there are two important concepts we must consider:
(1) Example generation
(2) Feature extraction
86
(1) Example generation: This step is to define what an example is, according to
the concept in which the machine is to be learned. For instance, in POS tagging , the
machine to correctly classify words according to their POS. Thus, every POS is a
class, and, typically, every occurrence of a word generates a positive example for its
class, and a negative example for the rest of classes. Therefore, every sentence may
generate a large number of examples.
(2) Feature Extraction: The set of features based on the algorithm to be used have
to be defined. For instance, the POS tags should be guessed according to the
preceding and following word. Thus, every example is represented as a set of active
features. These representations will be the input for the SVM classifiers. If
SVMTool working have to be learned, it is necessary to run the SVMTlearn (Perl
version) setting the REMOVE_FILES option to 0 (in the config file). In this way
inspection of intermediate files can be done.
Feature extraction is performed by the sliding window object. The whole story takes
place as follows. A sliding window works on a very local context (as defined in the
CONFIG file), usually a 5 words context [-2, -1, 0, +1, +2], being the current word
under analysis at the core position. Taking this context into account a number of
features may be extracted. The feature set depends on how the tagger is going to
proceed later (i.e., the context and information that’s going to be available at
tagging time). Commonly, all words are known before tagging, but POS is only
available for some words (those already tagged).
In tagging stage, if the input word is 'known and ambiguous, the word is tagged
(i.e., classified), and the predicted tag feeds forward next decisions. This will be
done in the subroutine "sub classify_sample_merged ( )" in the file SVMTAGGER.
In order to speed up SVM classification, they decided to merge mapping and SVM
weights and biases, into a single file. Therefore, when a new example is to be
87
tagged, we just access the merged model and for every active feature retrieve the
associated weight. Then for every possible tag, the bias will be retrieved as well.
Finally, we apply the SVM classification rule (i.e., scalar product + bias).
AN EXAMPLE
Suppose the example is taken as
Tamil Unicode form :
கூட் க்கு நீர்
திட்டம்
நிைறேவற்றப்ப ம்
என்றார்
தல்வர்
.
wiRaivERRappadum
enRAr
muthalvar
<VNAJ>/<VF>
<VF>
Romanized form :
kUddukkudiwIr thiddam
.
Taggs :
<NNC>
<NNC>
<NN>
<DOT>
(Ambiguity)
For tagging this sentence first take the active features like w-1, w+1, i.e., predict
POS-tags based only on the preceding and following words. Here the ambiguity
word is
“wiRaivERRappadum” .The correct tag of that ambiguity word is <VF>.
“ w0: wiRaivERRappadum
Then, active features are "w-1: thiddam " and "w+1: enRAr ".
Therefore, when applying the SVM classification rule for a given POS, it is
necessary to go to the merged model and retrieve the weight for these features, and
the bias (first line after the header, beginning with "BIASES "), corresponding to
the given POS.
88
For instance, suppose this ".MRG" file:
BIASES <ADJ>:0.37059487 <ADV>:-0.19514606 <CNJ>:0.43007979 <COM>:0.037037037
<CRD>:0.55448766
<CVB>:-0.19911161
<DET>:-1.1815452
<EMP>:-0.86491783 <INT>:0.61775334 <NN>:-0.21980137 <NNC>:1.3656117
<NNP>:0.072242349
<ORD>:0.30304924
0.15550162
<NNPC>:0.7906585
<PPO>:-0.2182171
<PRIN>:0.56913633
<NNQ>:0.44012828
<PRI>:0.89491131
<PRP>:0.35316978
<PRID>:-
<QW>:0.039121434
<RDW>:0.84771943 <VAX>:0.041690388 <VBG>:0.23199934 <VF>:0.33486366
<VINT>:0.0048185684 <VNAJ>:0.42063524 <VNAV>:0.18009116
C0~-1:thiddam
<CRD>:0.00579042912371902
<NNC>:-0.508699048073652
<NN>:0.532690716551973
<ORD>:-0.000698015879911668
<VBG>:0.142313085089229 <VF>:0.296699729267891 <VNAJ>:-0.32
C0~1:enRAr
<VAX>:0.132726597682121
<VF>:0.66667135122578
<VNAJ>:-
0.676332541749603
The SVM score for “wiRaivERRappadum” being <VNAJ> is:
Weight
("w-1:
thiddam","VNAJ")+
weigh("w+1:enRAr
","VNAJ")–
bias("VNAJ") = (-0.32) + (-0.676332541749603)-( 0.42063524) =
The SVM score for “wiRaivERRappadum” being <VF> is:
Weight (“w-1: thiddam”, ”VF”)+weight (“w+1: enRAr”, ”VF”) – bias
(“VF”) = (0.296699729267891) + (0.66667135122578)-( 0.33486366 )=
Here SVM score for <VF> is more compared to <VNAJ>, So, the tag VF is
assigned to the word ‘wiRaivERRappadum’.
Calculated part–of–speech tags feed directly forward next tagging decisions as
context features. The SVMTagger component works on standard input/output. It
89
processes a token per line corpus in a sentence by sentence fashion. The token is
expected to be the first column of the line. The predicted tag will take the second
column in the output. The rest of the line remains unchanged. Lines beginning with
‘## ’ are ignored by the tagger.
This is an example of input file , SVMTagger only consider the first column of the
input file.
Figure:5.5 Example input file
90
Figure :5.6 Example output file
5.5.2.1 Options
SVMTagger is very flexible, and adapts very well to the needs of the user. Thus we
may find the several options currently available:
91
Tagging scheme: Two different tagging schemes may be used.
– Greedy: Each tagging decision is made based on a reduced context. Later on,
decisions are not further reconsidered, except in the case of tagging at two steps or
tagging in two directions.
– Sentence-level: By means of dynamic programming techniques (Viterbi algorithm),
the global sentence sum of SVM tagging scores is the function to maximize as
shown in equation 5.6.
Given a sentence S =w1...wn as a word sequence and the set T(S) = {ti: 1 <= i<= |S|)
∧ ti ∈ ambiguity_class(wi)} of all possible sequences of POS tags associated to S.
t ( S ) = argmax
s
scores ( i )
s∈T ( S ) i∑
=1
(5.6)
A softmax function is used by default so as to transform this sum of scores into a
product of probabilities. Because sentence-level tagging is expensive, two pruning
methods are provided. First, the maximum number of beams may be defined.
Alternatively, a threshold may be specified so that solutions scoring under a certain
value (with respect to the best solution at that point) are discarded. Both pruning
techniques have proved effective and efficient in our experiments.
Tagging direction: The tagging direction can be either “left-to right”, “right-to-
left”, or a combination of both. The tagging direction varies results yielding a
significant improvement when both are combined. For every token, every direction
assigns a tag with a certain score. The highest-scoring tag is selected.
92
This makes the tagger very robust. In the case of sentence level tagging there is an
additional way to combine left-to-right and right-to-left. “GLRL” direction makes a
global decision, i.e. considering the sentence as a whole. For every sentence, every
direction assigns a sequence of tags [30] with an associated score that corresponds
to the sum of scores (or the product of probabilities when using a softmax function,
the default option). The highest scoring sequence of tags is selected.
One pass / two passes: Another way of achieving robustness is by tagging in two
passes. At the first pass only POS features related to already disambiguated words
are considered. At a second pass disambiguated POS features are available for
every word in the feature context, so when revisiting a word tagging errors may be
alleviated.
SVM Model Compression: Just as for the learning, weight vector components
lower than a certain threshold can be ignored.
All scores: Because sometimes not only the tag is relevant, but also its score and
scores of all competing tags, as a measure of confidence, this information is
available.
Backup lexicon: Again, a morphological lexicon containing new words that were
not available in the training corpus may be provided
Lemmae lexicon: Given a lemmae lexicon containing <word form, tag, lemma>
entries, output may be lemmatized.
<EOS> tag: The ‘<s>’ tag may be employed for sentence separation. Otherwise,
sentence punctuation is used instead, i.e. [.!?] symbols are taken as unambiguous
sentence separators.
93
Usage : SVMTagger [options] <model>
Options:
- T <strategy>
0: one-pass (default - requires model 0)
1: two-passes [revisiting results and relabeling requires model 2 and model
1]
2: one-pass [robust against unknown words requires model 0 and model 2]
3: one-pass [unsupervised learning models requires model 3]
4: one-pass [very robust against unknown words requires model 4]
5: one-pass [sentence-level likelihood requires model 0]
6: one-pass [robust sentence-level likelihood requires model 4]
- S <direction>
LR: left-to-right (default)
RL: right-to-left
LRL: both left-to-right and right-to-left
GLRL: both left-to-right and right-to-left
(global assignment, only applicable under a sentence level tagging strategy)
-K
<n> weight filtering threshold for known words (default is 0)
-U
<n> weight filtering threshold for unknown words (default is 0)
-Z
<n> number of beams in beam search, only applicable under sentence-level
strategies (default is disabled)
-R
<n> dynamic beam search ratio, only applicable under sentence-level
strategies (default is disabled)
-F
<n> softmax function to transform SVM scores into probabilities (default is
1)
0: do_nothing
1: ln(e^score(i) / [sum:1<=j<=N:[e^score(j)]])
94
- A predictions for all possible parts-of-speech are returned
- B <backup_lexicon>
-L
<lemmae_lexicon>
- EOS
enable usage of end_of_sentence ‘<s>’ string (disabled by default,
[!.?] used instead)
- V <verbose> -> 0: none verbose
1: low verbose
2: medium verbose
3: high verbose
4: very high verbose
Model: model location (path/name)
(Name as declared in the config-file NAME)
Example: SVMTagger -T 0 TAMIL_CORPUS_2L < TAMIL.IN > TAMIL.OUT
5.5.2.2 Strategies
Seven different tagging strategies have been implemented so far:
Strategy 0: It is the default one. It makes use of Model 0 in a greedy on-line
fashion, one-pass.
Strategy 1: As a first attempt to achieve robustness in front of error propagation, it
works in two passes, in an on-line greedy way. It uses Model 2 in the first pass and
Model 1 in the second. In other words, in the first pass the unseen morpho syntactic
context remains ambiguous while in the second pass the tag predicted in the first
pass is available also for unseen tokens and used as a feature.
95
Strategy 2: This strategy tries to achieve robustness by using two models at tagging
time, namely Model 0 and Model 2. When all the words in the unseen context are
known it uses Model 0. Otherwise it makes use of Model 2.
Strategy 3: It uses Model 3, again in a greedy and on-line manner. This
unsupervised learning strategy is still under experimentation.
Strategy 4: It simply uses Model 4 as is, in an on-line greedy fashion.
Strategy 5: Still working on a more robust scheme, this strategy performs a
sentence-level tagging by means of dynamic programming techniques (Viterbi
algorithm). It uses Model 0.
Strategy 6: Same as strategy 5, this strategy performs a sentence level tagging, this
time applying Model 4.
5.5.2.3 IMPLEMENTATION OF SVMTagger FOR TAMIL
In SVMTagger component the important options are strategies
and backup lexicon.Here it is important which tagging strategies you are going to
use. This may depend for instance on efficiency requirements. If the tagging must
be as fast as possible then you should forget about strategies 1, 5, and 6, because
strategy 1 goes in two passes and strategies 5 and 6 perform a sentence-level
tagging. Strategy 3 is only for unsupervised learning (no hand-annotated data is
available). Now, to choose among strategies 0, 2 and 4 the best solution is try them
all. However, if you have an idea of the proportion of unknown words you are
going to find at tagging time, strategies 2 and 4 are more robust than strategy 0 in
the case of unknown words. Finally, if you do not have any speed requirement nor
96
information about future data, in our experiments tagging strategies 4 and 6
systematically obtained best results.
Here the format of backup lexicon file is same as the dictionary
format. So a PERL program can be used for converts a tagged corpus into
dictionary format. The main drawback in general POS tagging is tagging a proper
nouns. If you have a large tagged corpus means tagging is not a problem for closed
tags (But open tags contains nouns and proper noun). For English they use
capitalization for tagging the proper noun words .But Tamil it is not possible
therefore it is required to provide a large backup lexicon.
Proper nouns
SVM Learn
Morph Generator
Untagged Tamil
Corpus
Backup lexicon
Training Data
Dictionary
SVM Tagger
Features
Merged model
Tagged Tamil
Corpus
Figure :5.7 Implementation of SVMTagger
97
Strategies
Here we collect a large dataset for proper nouns (Indian names, Indian
towns and word important towns). Then that proper nouns are input to a
morphological generator ( PERL program) .Our morph generator generate inflected
proper nouns for our data set .For each proper noun we generate near twelve
inflections. This new dataset is converted into SVMTool dictionary format and
given into SVMTagger as a back up lexicon.
Fig shows the steps in implementation of SVMTagger for Tamil .Here input is an
untagged cleaned Tamil corpus and output is tagged or annotated words. Supporting
files are training corpus, dictionary file, merged models for unknown and known
words and backup lexicon.
5.5.3 SVMTEVAL
Given a SVMTool predicted tagging output and the corresponding gold-standard,
SVMTeval evaluates the performance in terms of accuracy. It is a very useful
component for the tuning of the system parameters, such as the C parameter, the
feature patterns and filtering, the model compression et cetera.
Based on a given morphological dictionary (e.g., the automatically generated at
training time) results may be presented also for different sets of words (known
words vs. unknown words, ambiguous words vs. unambiguous words). A different
view of these same results can be seen from the class of ambiguity perspective, too,
i.e., words sharing the same kind of ambiguity may be considered together. Also
words sharing the same degree of disambiguation complexity, determined by the
size of their ambiguity classes, can be grouped.
Usage: SVMTeval [mode] <model> <gold> <pred>
- mode: 0 - complete report (everything)
98
1 - overall accuracy only [default]
2 - accuracy of known vs unknown words
3 - accuracy per level of ambiguity
4 - accuracy per kind of ambiguity
5 - accuracy per class
- model:
model name
- gold:
correct tagging file
- pred:
predicted tagging file
Example: SVMTeval TAMIL_CORPUS_2L TAMIL.GOLD TAMIL.OUT
5.5.3.1 IMPLEMENTATION OF SVMTeval FOR TAMIL
SVMTeval is the last component of SVMTool.In this component is used to
evaluate the outputs based on different modes. The main input of this component is
correctly tagged corpus also called as gold standard.
Correct Tagged file
(Gold Standard)
Mode
SVM Tagged
Output
SVMTeval
Report
Figure :5.8 Implementation of SVMTeval
99
Model
Name
5.5.3.2 REPORTS
Brief report:
By default, a brief report mainly returning the overall accuracy is elaborated. It also
provides information about the number of tokens processed, and how much were
known/unknown and ambiguous/unambiguous according to the model dictionary.
Results are always compared to the most-frequent-tag (MFT) baseline.
*=========================SVMTevalreport=====================
=========
* model
*
= [E:\\SVMTool-1.3\\bin\\CORPUS]
testset
(gold)
=
[E:\\SVMTool-
1.3\\bin\\files\\test.gold]
*
testset
(predicted)
=
[E:\\SVMTool-
1.3\\bin\\files\\test.out]
*
=============================================================
===========
EVALUATING
<E:\\SVMTool-1.3\\bin\\files\\test.out>
<E:\\SVMTool-1.3\\bin\\files\\test.gold>
on
vs.
model
<E:\\SVMTool-1.3\\bin\\CORPUS>...
*=================TAGGINGSUMMARY=============================
==========================
#TOKENS
= 1063
AVERAGE_AMBIGUITY = 6.4901 tags per token
* ---------------------------------------------------------------------------------------#KNOWN
= 80.3387% -->
854 / 1063
#UNKNOWN
= 19.6613% -->
209 / 1063
#AMBIGUOUS
= 21.7310% -->
231 / 1063
100
#MFT baseline
= 71.2135% -->
757 / 1063
*=================OVERALLACCURACY============================
==========================
HITS
TRIALS
ACCURACY
MFT
* ---------------------------------------------------------------------------------------1002
1063
94.2615%
71.2135%
*
=============================================================
============================
Known vs. unknown tokens:
Accuracy for four different sets of words is returned. The first set is that of all
known tokens, tokens which were seen during the training. The second and third
sets contain respectively all ambiguous and all unambiguous tokens among these
known tokens. Finally, there is the set of unknown tokens, which were not seen
during the training.
*=========================SVMTevalreport
==============================
* model
= [E:\\SVMTool-1.3\\bin\\CORPUS]
* testset (gold)
= [E:\\SVMTool-1.3\\bin\\files\\test.gold]
* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out]
*
===================================================================
=====
EVALUATING
<E:\\SVMTool-1.3\\bin\\files\\test1.out>
<E:\\SVMTool-1.3\\bin\\files\\test.gold>
1.3\\bin\\CORPUS>...
101
on
model
vs.
<E:\\SVMTool-
*=================TAGGINGSUMMARY===================================
===================
#TOKENS
= 1063
AVERAGE_AMBIGUITY = 6.4901 tags per token
* ---------------------------------------------------------------------------------------#KNOWN
= 80.3387% -->
854 / 1063
#UNKNOWN
= 19.6613% -->
209 / 1063
#AMBIGUOUS
= 21.7310% -->
231 / 1063
#MFT baseline
= 71.2135% -->
757 / 1063
*=================KNOWNvsUNKNOWNTOKENS=============================
==================
HITS
TRIALS
ACCURACY
* ---------------------------------------------------------------------------------------*=======known======================================================
===================
816
854
95.5504%
-------- known unambiguous tokens -------------------------------------------------------604
623
96.9502%
-------- known ambiguous tokens ---------------------------------------------------------212
231
91.7749%
*=======unknown====================================================
====================
186
209
88.9952%
*
===================================================================
======================
*=================OVERALLACCURACY==================================
====================
HITS
TRIALS
ACCURACY
MFT
* ---------------------------------------------------------------------------------------1002
1063
94.2615%
102
71.2135%
*
===================================================================
======================
Level of ambiguity:
This view of the results groups together all words having the same degree of POS–
ambiguity.
*=========================SVMTevalreport
==============================
* model
= [E:\\SVMTool-1.3\\bin\\CORPUS]
* testset (gold)
= [E:\\SVMTool-1.3\\bin\\files\\test.gold]
* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out]
*
===================================================================
=====
EVALUATING
<E:\\SVMTool-1.3\\bin\\files\\test1.out>
<E:\\SVMTool-1.3\\bin\\files\\test.gold>
on
model
vs.
<E:\\SVMTool-
1.3\\bin\\CORPUS>...
*=================TAGGINGSUMMARY===================================
===================
#TOKENS
= 1063
AVERAGE_AMBIGUITY = 6.4901 tags per token
* ---------------------------------------------------------------------------------------#KNOWN
= 80.3387% -->
854 / 1063
#UNKNOWN
= 19.6613% -->
209 / 1063
#AMBIGUOUS
= 21.7310% -->
231 / 1063
#MFT baseline
= 71.2135% -->
757 / 1063
*=================ACCURACY
PER
=======================================
#CLASSES = 5
103
LEVEL
OF
AMBIGUITY
*==================================================================
=======================
LEVEL
HITS
TRIALS
ACCURACY
MFT
*---------------------------------------------------------------------------------------1
605
624
96.9551%
96.6346%
2
204
220
92.7273%
66.8182%
3
7
9
77.7778%
66.6667%
4
2
3
66.6667%
33.3333%
28
184
207
88.8889%
0.0000%
*=================OVERALLACCURACY==================================
====================
HITS
TRIALS
ACCURACY
MFT
*---------------------------------------------------------------------------------------1002
1063
94.2615%
71.2135%
*
===================================================================
======================
Kind of ambiguity:
This view is much finer. Every class of ambiguity is studied separately.
*=========================SVMTevalreport
==============================
* model
= [E:\\SVMTool-1.3\\bin\\CORPUS]
* testset (gold)
= [E:\\SVMTool-1.3\\bin\\files\\test.gold]
* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out]
*
===================================================================
=====
104
EVALUATING
<E:\\SVMTool-1.3\\bin\\files\\test.out>
<E:\\SVMTool-1.3\\bin\\files\\test.gold>
on
model
vs.
<E:\\SVMTool-
1.3\\bin\\CORPUS>...
*=================TAGGINGSUMMARY===================================
===================
#TOKENS
= 1063
AVERAGE_AMBIGUITY = 6.4901 tags per token
* ---------------------------------------------------------------------------------------#KNOWN
= 80.3387% -->
854 / 1063
#UNKNOWN
= 19.6613% -->
209 / 1063
#AMBIGUOUS
= 21.7310% -->
231 / 1063
#MFT baseline
= 71.2135% -->
757 / 1063
*=================ACCURACY
PER
CLASS
OF
AMBIGUITY
=======================================
#CLASSES = 55
*
===================================================================
======================
CLASS
HITS
TRIALS
ACCURACY
MFT
---------------------------------------------------------------------------------------<ADJ>
28
28
100.0000%
100.0000%
<ADJ>_<ADV>_<CNJ>_<COM>_<CRD>_<CVB>_<DET>_<ECH>_<INT>_<NN>_<NNC>_<N
NP>_<NNPC>_<NNQ>_<ORD>_<PPO>_<PRID>_<PRIN>_<PRP>_<QTF>_<QW>_<RDW>_<
VAX>_<VBG>_<VF>_<VINT>_<VNAJ>_<VNAV>
207
88.8889%
184
0.0000%
<ADJ>_<NN>
1
1
100.0000%
100.0000%
<ADJ>_<VNAJ>
1
1
100.0000%
100.0000%
31
32
96.8750%
96.8750%
<ADV>_<NN>_<PPO>
1
2
50.0000%
50.0000%
<ADV>_<RDW>
2
2
100.0000%
100.0000%
20
20
100.0000%
100.0000%
<ADV>
<CNJ>
<CNJ>_<PPO>
1
1
105
100.0000%
100.0000%
<COM>
17
17
100.0000%
100.0000%
<COMM>
49
49
100.0000%
100.0000%
<CRD>
23
23
100.0000%
95.6522%
<CRD>_<DET>
22
22
100.0000%
100.0000%
<CRD>_<NN>_<NNC>_<PPO>
2
2
100.0000%
50.0000%
<CRD>_<ORD>
2
2
100.0000%
0.0000%
<CVB>
6
6
100.0000%
100.0000%
<CVB>_<VF>
0
1
0.0000%
0.0000%
<DET>
14
14
100.0000%
100.0000%
<DOT>
77
77
100.0000%
100.0000%
<EMP>_<PRP>
1
1
100.0000%
100.0000%
<INT>
6
6
100.0000%
100.0000%
<INT>_<NN>
0
1
0.0000%
0.0000%
81
91
89.0110%
89.0110%
148
161
91.9255%
58.3851%
1
1
100.0000%
100.0000%
<NN>_<PRID>_<PRIN>_<VNAJ>0
1
0.0000%
0.0000%
<NN>_<VBG>
1
1
100.0000%
100.0000%
<NN>_<VF>
1
1
100.0000%
100.0000%
46
47
97.8723%
97.8723%
<NNC>_<NNP>_<NNPC>
4
5
80.0000%
80.0000%
<NNC>_<VAX>
1
1
100.0000%
100.0000%
<NNC>_<VF>_<VNAJ>
1
1
100.0000%
0.0000%
34
37
91.8919%
91.8919%
<NNP>_<NNPC>
0
1
0.0000%
0.0000%
<NNPC>
0
1
0.0000%
0.0000%
<NNQ>
4
4
100.0000%
100.0000%
<ORD>
3
3
100.0000%
66.6667%
<PPO>
6
6
100.0000%
100.0000%
<PPO>_<VNAV>
1
1
100.0000%
100.0000%
<PRID>
2
2
100.0000%
100.0000%
<PRIN>
2
2
100.0000%
100.0000%
<PRP>
33
33
100.0000%
100.0000%
<QM>
4
4
100.0000%
100.0000%
<QTF>
5
5
100.0000%
100.0000%
<NN>
<NN>_<NNC>
<NN>_<NNC>_<NNPC>
<NNC>
<NNP>
106
<QW>
4
4
100.0000%
100.0000%
<RDW>
1
1
100.0000%
100.0000%
<VAX>
7
7
100.0000%
100.0000%
<VAX>_<VF>
8
8
100.0000%
100.0000%
11
11
100.0000%
100.0000%
2
2
100.0000%
100.0000%
<VF>
39
40
97.5000%
97.5000%
<VF>_<VNAJ>
12
12
100.0000%
91.6667%
<VINT>
15
16
93.7500%
93.7500%
<VNAJ>
20
20
100.0000%
100.0000%
<VNAV>
17
18
94.4444%
94.4444%
<VBG>
<VBG>_<VF>
*=================OVERALLACCURACY==================================
====================
HITS
TRIALS
ACCURACY
MFT
* ---------------------------------------------------------------------------------------1002
1063
94.2615%
71.2135%
*
===================================================================
============
Class:
Every class is studied individually.
*=========================SVMTevalreport=====================
=========
* model
*
= [E:\\SVMTool-1.3\\bin\\CORPUS]
testset
(gold)
=
[E:\\SVMTool-
1.3\\bin\\files\\test.gold]
*
testset
(predicted)
1.3\\bin\\files\\test.out]
107
=
[E:\\SVMTool-
*
=============================================================
===========
EVALUATING
<E:\\SVMTool-1.3\\bin\\files\\test.out>
<E:\\SVMTool-1.3\\bin\\files\\test.gold>
on
vs.
model
<E:\\SVMTool-1.3\\bin\\CORPUS>...
*=================TAGGINGSUMMARY=============================
==========================
#TOKENS
= 1063
AVERAGE_AMBIGUITY = 6.4901 tags per token
* ---------------------------------------------------------------------------------------#KNOWN
= 80.3387% -->
854 / 1063
#UNKNOWN
= 19.6613% -->
209 / 1063
#AMBIGUOUS
= 21.7310% -->
231 / 1063
#MFT baseline
= 71.2135% -->
757 / 1063
*=================
ACCURACY
PER
PART-OF-SPEECH
===========================================
POS
HITS
TRIALS
ACCURACY
MFT
* ---------------------------------------------------------------------------------------<ADJ>
30
31
96.7742%
90.3226%
<ADV>
47
48
97.9167%
70.8333%
<CNJ>
21
21
100.0000%
95.2381%
<COM>
17
17
100.0000%
100.0000%
<COMM>
49
49
100.0000%
100.0000%
<CRD>
26
26
100.0000%
84.6154%
<CVB>
7
8
87.5000%
75.0000%
<DET>
36
36
100.0000%
100.0000%
<DOT>
77
77
100.0000%
100.0000%
<EMP>
1
1
100.0000%
100.0000%
<INT>
6
7
85.7143%
85.7143%
108
<NN>
243
259
93.8224%
57.9151%
<NNC>
145
162
89.5062%
46.2963%
<NNP>
43
44
97.7273%
86.3636%
<NNPC>
0
16
0.0000%
0.0000%
<NNQ>
4
4
100.0000%
100.0000%
<ORD>
2
2
100.0000%
100.0000%
<PPO>
9
9
100.0000%
100.0000%
<PRID>
2
3
66.6667%
66.6667%
<PRIN>
2
2
100.0000%
100.0000%
34
34
100.0000%
97.0588%
<QM>
4
4
100.0000%
100.0000%
<QTF>
5
5
100.0000%
100.0000%
<QW>
6
6
100.0000%
66.6667%
<RDW>
1
1
100.0000%
100.0000%
<VAX>
18
18
100.0000%
77.7778%
<VBG>
20
22
90.9091%
54.5455%
<VF>
68
68
100.0000%
66.1765%
<VINT>
16
18
88.8889%
83.3333%
<VNAJ>
41
42
97.6190%
69.0476%
<VNAV>
22
23
95.6522%
73.9130%
<PRP>
*=================OVERALLACCURACY============================
=========================
HITS
TRIALS
ACCURACY
MFT
* ---------------------------------------------------------------------------------------1002
1063
94.2615%
71.2135%
*
=============================================================
============================
109
6) GUI FOR SVMTool
6.1) INTRODUCTION
A graphical user interface (GUI) is a user-computer interface (i.e., a way for
users to interact with computers) that uses windows, icons and menus and which can
be manipulated by mouse (and often to a limited extent by keyboard as well).GUIs
stand in sharp contrast to command line interfaces (CLIs), which use only text and are
accessed solely by a keyboard. The most familiar example of a CLI to many people is
MS-DOS. Another example is Linux when it is used in console mode (i.e., the entire
screen shows text only).
A window is a rectangular portion of the monitor screen that can display its
contents (e.g., a program, icons, a text file or an image) seemingly independently of
the rest of the display screen. A major feature is the ability for multiple windows to
be open simultaneously. Each window can display a different application, or each can
display different files (e.g., text, image or spreadsheet files) that have been opened or
created with a single application.
In this chapter we will discuss about the GUI for SVMTool. This GUI is
developed using NET BEANS. Here we use the SVMTool Perl version. Our GUI will
run only for SVMTagger and SVMTeval components because GUI is developed in
Windows operating system. SVMTlearn component runs only in Linux operating
systems.
Advantages of GUIs
A major advantage of GUIs is that they make computer operation more
intuitive, and thus easier to learn and use. For example, it is much easier for a new
user to move a file from one directory to another by dragging its icon with the mouse
110
than by having to remember and type seemingly arcane commands to accomplish the
same task. Adding to this intuitiveness of operation is the fact that GUIs generally
provide users with immediate, visual feedback about the effect of each action. For
example, when a user deletes an icon representing a file, the icon immediately
disappears, confirming that the file has been deleted (or at least sent to the trash can).
This contrasts with the situation for a CLI, in which the user types a delete command
(inclusive of the name of the file to be deleted) but receives no automatic feedback
indicating that the file has actually been removed.
In addition, GUIs allows user to take full advantage of the powerful
multitasking (the ability for multiple programs and/or multiple instances of single
programs to run simultaneously) capabilities of modern operating systems by
allowing such multiple programs and/or instances to be displayed simultaneously.
The result is a large increase in the flexibility of computer use and a consequent rise
in user productivity.
The SVMTool options are work only in command line interfaces, therefore
the user must know the commands in SVMTool. Sometimes it is not possible so we
need some user friendly environment like GUI. This GUI solves that problem because
we consider all important options in SVMTool. It will give graphical user friendly
version to SVMTool.
6.2).GUI for SVMTagger
GUI for SVMTagger was developed using Net Beans. In this GUI the
SVMTool was run from Net Beans, using java program. The SVMTagger window is
shown in a figure (fig 6.1).In this window the user can select the SVMTool location
and model name. SVMTool location indicates the “bin” folder location and model
name means, the name which given in the config file. This window contains two
options. They are,
111
1) File based tagger
2) Sentence based tagger
In File based tagger, the user selects the input file location and output file
location. The file must be in a SVMTool format .The other SVMTagger options like
strategy, direction and backup lexicon are also given in that window. In Sentence
based tagging the user can give the input Tamil sentence our program will tokenize
that sentence and tagged using SVMTagger. Sometimes user doesn’t have an idea
about abbreviations, so our window has an option for understanding abbreviations.
The SVMTagger window is shown in bellow,
Figure 6.1: SVMTagger window
112
6.2.1).File based SVMTagger window
In this file based tagger (fig 6.2) user can give the input file location and
output file location. After that user can tag the words based on different strategies and
directions. These options are already explained in chapter 5. Another one important
option is backup lexicon location. It is very useful for handling unknown words
especially for open tags like proper noun and noun. If user gives the lexicon means,
we convert each and every word (Proper noun) in a lexicon into number of inflected
words using small morphological generator. It will handle unknown words perfectly.
Figure 6.2: File based SVMTagger window
113
6.2.2).Sentence based SVMTagger window
Sentence based tagger (fig 6.3) takes an input in a sentence form and tokenize
it into token per line format because SVMTool accept only token per line input. This
tokenize can done by a small Perl program .After converting into tokenize form we
can perform tagging operation using SVMTagger component. If user has any doubt
about the abbreviation of tags, tag details window helps to find the exact meaning of
tags.
Figure 6.3: Sentence based SVMTagger window
114
6.2.3).AMRITA POS tagset window
This window (fig 6.4) is used to the user to find the appropriate meaning of
each and every tags in our tagset. Each and every tagset follow different abbreviations
for every tag. Our tagset also have different form of abbreviations but we consider
some standard tag abbreviations like NN, NNP, ADJ etc.
Figure 6.4: AMRITA tagset window
115
6.3).GUI for SVMTeval
Given a SVMTool predicted tagging output and the corresponding goldstandard, SVMTeval evaluates the performance in terms of accuracy. GUI for
SVMTeval (fig 6.5) also developed using Net beans. In this window user can choose
the SVMTool location and model name. This evaluate component of SVMTool is
used to analyze the results of a tagger. It shows the number of tokens, unknown
words, known and ambiguous words. We can run this component in different modes
like unknown words vs. known words, kind of ambiguity, and level of ambiguity.
Figure 6.5: SVMTeval window
116
6.3.1) SVMTeval Report window
GUI for Evaluator is shown in figure (fig 6.6) .Here user can check the
accuracy of SVMTool .This window needs a gold standard of tagged file i.e. correctly
tagged corpus in SVMTool format. Another input of this model is SVMTool output
file i.e. the corpus which is tagged by SVMTagger component. Based on a given
morphological dictionary (e.g., the automatically generated at training time) results
may be presented also for different sets of words (known words vs. unknown words,
ambiguous words vs. unambiguous words).
Figure 6.6: SVMTeval results window
117
A different view of these same results can be seen from the class of ambiguity
perspective, too, i.e., words sharing the same kind of ambiguity may be considered
together. Also words sharing the same degree of disambiguation complexity,
determined by the size of their ambiguity classes, can be grouped.
6.4 Output of TnT tagger for Tamil
TnT, the short form of trigrams n tags, it is a very efficient statistical part of
speech tagger that is trainable on different languages and virtually any tag set. The
component for parameter generation trains on tagged corpora. The system
incorporates several methods of smoothing and of handling unknown words.
Figure :6.7 Output of TnT tagger for Tamil
118
TnT is not optimized for a particular language. Instead, it is optimized for
training on a large variety of corpora. Adapting a tagger to any language, new domain
or new tag set is very easy. Additionally, TnT optimized for speed.
We evaluate the tagger’s performance under several aspects. We use the
training corpora of size two lakh words. We run this tagger with thousand untagged
words. We got an overall accuracy (fig 6.7) of 88.71%.
Comparative accuracies of TnT Tagger and SVMTool for Tamil
TnT
SVMTool
Overall accuracy
88.7%
93.87%
Accuracy of Known words
92.54%
95.54%
Accuracy of Known ambiguity words
77.94%
91.17%
Accuracy of Unknown words
74.16%
87.081%
119
7) CONCLUSION AND FUTURE WORK
Part of speech tagging plays an important role in various speech and
language processing applications. Since many of the reputed companies like Google
and Microsoft are concentrating on Natural language processing applications, it has
got more importance. Currently many tools are available to do this task of part of
speech tagging. Good accuracy can be obtained using the existing softwares like
SVMTool and Stanford tagger. These are giving more than 97% accuracy for English.
The SVMTool has been already successfully applied to English and Spanish
POS Tagging, exhibiting state–of–the–art performance (97.16% and 96.89%,
respectively). In both cases results clearly outperform the HMM–based TnT part–of–
speech tagger.
For Tamil we have obtained an accuracy of 94% .The training using these
softwares is also easy. Any language can be trained easily using the existing
softwares. POS tagging can be extended by applying this to other languages.
Currently there are no taggers giving good accuracy for Tamil. The obstacle for the
POS tagging for Indian languages is, there is no annotated (tagged) data. We have a
corpus with size of 2.25 lakh words.
Another possible work is to increase the corpus size i.e. increases the tagged
data for Tamil and in future work adding chunking tags to the tagged corpus can be
done. Still good accuracy can be obtained by combining morphological analyzer with
this SVMTool for handling unknown words.
120
REFERENCES
[1]. Thorsten Brants, “TnT -- A Statistical Part-of - Speech Tagger”, In Proceedings
of the 6th Applied NLP Conference, ANLP-2000, April 29 – May 3, 2000.
[2]. Huang, Caroline; Son-Bell, Mark; and Baggett, David. "Generation of
pronunciations from orthographies using transformation-based error-driven learning."
In International Conference on Speech and Language Processing (ICSLP),
Yokohama, Japan, 1994.
[3]. Klein, Sheldon and Simmons, Robert. "A computational approach to
Grammatical coding of English words". JACM 10, 1963, pp 334-47.
[4]. B. B. Greene and G. M. Rubin. “Automatic grammatical tagging of English.
Technical report”, Department of Linguistics, Brown University, Providence, Rhode
Island, 1971.
[5]. W. N. Francis and F. Kucera. “Frequency Analysis of English Usage”. Houghton
Mifflin, 1982.
[6]. S. DeRose. “Grammatical category disambiguation by statistical optimization”.
Computational Linguistics, 1988, pp 14:31-39.
[7]. A. M. Derouault and B. Merialdo. “Natural language modeling for phoneme-totext transcription”. IEEE Transactions on Pattern Analysis and Machine Intelligence,
PAMI-8, 1986, pp 742-749.
121
[8]. K. W. Church. “A stochastic parts program and noun phrase parser for
unrestricted text”. In Proceedings of the Second Conference on Applied Natural
Language Processing (ACL), 1988, pp 136-143.
[9]. L. E. Baum. “An inequality and associated maximization technique in statistical
estimation for probabilistic functions of a Markov process”. Inequalities, 1972, pp
3:1-8.
[10]. F. Jelinek. “Markov source modeling of text generation”. In J. K. Skwirzinski,
editor, Impact of Processing techniques on Communication. Nijhoff, Dordrecht, 1985.
[11]. J. M. Kupiec. “Robust part-of-speech tagging using a hidden markov model”,
Computer Speech and Language, 1992.
[12]. J. M. Kupiec. “Augmenting a hidden Markov model for phrase-dependent word
tagging”. In Proceedings of the DARPA Speech and Natural Language Workshop,
Morgan Kaufmann, Cape Cod, MA, 1989, pp 92-98.
[13]. M. A. Hearst. “Noun homograph disambiguation using local context in large
text corpora”. In The Proceedings of the 7th New OED Conference on Using
Corpora, Oxford, 1991, pages 1-22.
[14]. Brill. “Transformation-based Error-driven Learning and Natural Language
Processing: A Case Study in Part-of-speech Tagging”. Computational Linguistics,
1995, pp 21 (4):543-565.
[15]. R. Weischedel, R. Schwartz, J. Pahnueci, M. Meteer, and L. Ramshaw. “Coping
with Ambiguity and Unknown words through Probabilistic Models”. Computational
Linguistics, 1993, pp 19(2):260-269.
122
[16]. D. Cutting, J. Kupiec, J. Pederson, and P. Nipun. “A Practical Part-of-speech
Tagger”. In Proceedings of the 3rd Conference of Applied Natural Language
Processing, ANLP, 1992, pp 133-140.
[17]. B. Merialdo. ”Tagging English Text with a Probabilistic Model”. Computational
Linguistics, 1994, pp 20(2):155-171.
[18]. A. Ratnaparkhi. “A Maximum Entropy Part of-speech Tagger”. In Proceedings
of the 1st Conference on Empirical Methods in Natural Language Processing,
EMNLP, 1996.
[19]. Ray Lau, Ronald Rosenfeld, and Salim Roukos. “Adaptive Language Modeling
Using the Maximum Entropy Principle”. In Proceedings of the Human Language
Technology Workshop, ARPA, 1993, pp 108-113.
[20].Rajendran S ,complexity of tamil in pos tagging, language in india, Volume 7 :
1-th January 2007.
[21].Arulmozhi, P and Sobha, L. 2006. A Hybrid POS tagger for a Relatively Free
Word Order Language. In Proceedings of the First National Symposium on Modeling
and Shallow Parsing of Indian Languages, pages 79-85.
[22]. Gim´enez, J. and L.M`arquez. “Fast and Accurate Part-of-Speech Tagging: The
SVM Approach Revisited”. In Proceedings of the Fourth RANLP, 2003.
[23]. D. Jurafsky and J. H. Martin. “Speech and Language Processing”. PrenticeHall, 2004.
[24].Rajendran S Ph.D ,Parsing in tamil -Present state of art, language in india,
Volume 6 : 8-th August 2006
123
[25].Lehmann, Thomas. 1992 (second edition). A Grammar of Modern Tamil.
Pondicherry Institute of Linguistics and Culture, Pondicherry.
[26]. Akshar Bharati, Dipti Misra Sharma, Lakshmi Bai, Rajeev Sangal Language
Technologies Research Centre IIIT, Hyderabad, AnnCorra : Annotating Corpora
Guidelines For POS And Chunk Annotation For Indian Languages. December2006.
[27] Rajendran.S, Ph.D. A survey of the state of the art in Tamil language technology
Volume 6 : 10-th October 2006
[28] K.Rajan, Ph.D. corpus analysis and tagging -for Tamil . Annamalai University,
Annamalai nagar.
[29] Ahmed, S.Bapi Raju, Pammi V.S. Chandrasekhar, M.Krishna Prasad,
Application Of Multilayer Perceptron Network For Tagging Parts-Of-Speech
Language Engineering Conference, University of Hyderabad, India, Dec. 2002.
[30] Jes´us Gim´enez and Llu´ıs M`arquez. SVMTtool:Technical manual v1.3,
August 2006.
[31].Rajendran S, Arulmozi S, Ramesh Kumar S, & Viswanathan S. 2003.
Computational Morpohology of Verbal Complex In B. Ramakrishna Reddy (ed.)
Word Structure in Dravidian, Kuppam: Dravidian University, 376-398.
[32] T.Joachims ,Making lare scale SVM Learning practical. Advances in Kernel
Methods –
Support vector Learning, B.Scholkopf and C.Burges and
A.smola(ed.),MIT-Press,1999
[33].Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York, NY:
124
Springer-Verlag.
[34]. John McNaught, User needs for Textual corpora in Natural Language
Processing, Literary and Linguistic computing, vol.8, No.4, 1993
125