* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Part of speech Tagging for Tamil using SVMTool - CEN
Old English grammar wikipedia , lookup
Japanese grammar wikipedia , lookup
Lithuanian grammar wikipedia , lookup
Ojibwe grammar wikipedia , lookup
Spanish grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Yiddish grammar wikipedia , lookup
French grammar wikipedia , lookup
Latin syntax wikipedia , lookup
Turkish grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Esperanto grammar wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Agglutination wikipedia , lookup
Contraction (grammar) wikipedia , lookup
Junction Grammar wikipedia , lookup
Untranslatability wikipedia , lookup
Compound (linguistics) wikipedia , lookup
Morphology (linguistics) wikipedia , lookup
Pipil grammar wikipedia , lookup
PART OF SPEECH TAGGING FOR TAMIL USING SVMTOOL A PROJECT REPORT submitted by ANAND KUMAR.M (CB206CN001) in partial fulfillment for the award of the degree of MASTER OF TECHNOLOGY IN COMPUTATIONAL ENGINEERING AND NETWORKING Centre for Excellence in Computational Engineering and Networking (CEN) AMRITA SCHOOL OF ENGINEERING, COIMBATORE AMRITA VISHWA VIDYAPEETHAM COIMBATORE – 641 105 JULY 2008 Dedicated to my beloved family PART OF SPEECH TAGGING FOR TAMIL USING SVMTOOL A PROJECT REPORT submitted by ANAND KUMAR M (CB206CN001) in partial fulfillment for the award of the degree of MASTER OF TECHNOLOGY IN COMPUTATIONAL ENGINEERING AND NETWORKING Under the Guidance of Dr.K.P.Soman (Professor & Head, CEN) Centre for Excellence in Computational Engineering and Networking (CEN) AMRITA SCHOOL OF ENGINEERING, COIMBATORE AMRITA VISHWA VIDYAPEETHAM COIMBATORE – 641 105 JULY 2008 AMRITA SCHOOL OF ENGINEERING, AMRITA VISHWA VIDYAPEETHAM, COIMBATORE-641105 BONAFIDE CERTIFICATE This is to certify that the project report entitled “PART OF SPEECH TAGGING FOR TAMIL USING SVMTOOL ” submitted by “ANAND KUMAR M, Reg. No. CB206CN001” in partial fulfillment of the requirements for the award of the Degree of Master of Technology in “COMPUTATIONAL ENGINEERING AND NETWORKING” is a bonafide record of the work carried out under my guidance and supervision at Amrita School of Engineering, Coimbatore. Dr.K.P.SOMAN, Dr. K.P.SOMAN Project Guide, Professor and HOD, Professor, CEN. Department of CEN. This project report was evaluated by us on ……………………… INTERNAL EXAMINER EXTERNAL EXAMINER AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING, COIMBATORE CENTRE FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING AND NETWORKING DECLARATION I, ANAND KUMAR M with Reg.No. CB206CN001 hereby declare that this project report entitled “PART OF SPEECH TAGGING FOR TAMIL USING SVMTOOL”, is a record of the original work done by me under the guidance of Dr.K.P.SOMAN, (Head of the Department), Department of Computational Engineering and Networking, Amrita School of Engineering, Coimbatore and this work has not formed the basis for the award of any degree / diploma / associateship / fellowship or a similar award, to any candidate in any University, to the best of my knowledge. Place : Signature of the student Date: COUNTERSIGNED Dr.K.P.SOMAN, HOD of CEN TABLE OF CONTENTS LIST OF FIGURES .......................................................................................................... i LIST OF TABLES .......................................................................................................... ii ORGANIZATION ...........................................................................................................iii 1. INTRODUCTION ........................................................................................................ 1 1.1 Introduction ............................................................................................................... 1 1.2 Word classes.............................................................................................................. 1 1.3 Part of speech tagging ............................................................................................... 2 1.4 Tagging approaches .................................................................................................. 3 1.4.1 Rule based part of speech tagging ................................................................... 5 1.4.2 Stochastic part of speech tagging .................................................................... 8 1.4.3 Transformation based tagging ....................................................................... 12 1.5 Other techniques...................................................................................................... 15 1.6 Motivation ............................................................................................................. 16 2. LITERATURE SURVEY ........................................................................................... 18 3.TAMIL POS TAGGING ............................................................................................ 29 3.1 Tamil language......................................................................................................... 29 3.1.1 Alphabets......................................................................................................... 29 3.1.2 Tamil grammar ............................................................................................... 31 3.1.3 Part of speech categories ................................................................................ 32 3.1.4 Other POS categories ...................................................................................... 32 3.1.5 Ambiguity of roots .......................................................................................... 34 3.2 Complexity in tamil POS tagging ........................................................................... 34 3.2.1 Noun complexity ............................................................................................ 34 3.2.2 Verb complexity ........................................................................................... 36 3.2.3 Complexity in Adverbs ................................................................................... 38 3.2.4 Complexity in Postpositions ........................................................................... 39 3.3 Developing tagsets .................................................................................................. 39 3.3.1 Tagsets gone through ...................................................................................... 40 3.3.2 AMRITA tagset............................................................................................... 41 3.4 Explanation of AMRITA POS tags.......................................................................... 42 3.4.1 Noun tags ....................................................................................................... 42 3.4.2 Pronoun tags................................................................................................... 45 3.4.3 Adjective tags ................................................................................................ 46 3.4.4 Adverb tags ................................................................................................... 46 3.4.5 Verb tags ........................................................................................................ 47 3.4.6 Other tags ....................................................................................................... 49 4. DEVELOPMENT OF TAGGED CORPUS ............................................................. 52 4.1 Introduction ............................................................................................................. 52 4.1.1 Tagged corpus, Parallel Corpus, and Aligned Corpus .................................. 52 4.1.2 CIIL corpus for tamil....................................................................................... 53 4.1.3 AUKBC-RC's improved tagged corpus for tamil .......................................... 53 4.2 Developing a new corpus ......................................................................................... 53 4.2.1 Untagged and Tagged corpus ......................................................................... 53 4.2.2 Tagged corpus development............................................................................ 55 4.3 Details of our Tagged corpus .................................................................................. 57 4.4 Applications of tagged corpus.................................................................................. 58 5.IMPLEMENTATION OF SVMTOOL FOR TAMIL ............................................. 59 5.1 Introduction ............................................................................................................. 59 5.2 Properties of the SVMTool ..................................................................................... 60 5.3 The theory of support vector machines ................................................................... 62 5.4 Problem setting ........................................................................................................ 65 5.4.1 Binarizing the classification problem .............................................................. 65 5.4.2 Feature codification ......................................................................................... 65 5.5 SVMTool components and implementations........................................................... 67 5.5.1 SVMTlearn ...................................................................................................... 68 5.5.1.1 Training data format.............................................................................. 68 5.5.1.2 Options ................................................................................................. 69 5.5.1.3 Configuration file .................................................................................. 72 5.5.1.4 C parameter tuning ............................................................................... 78 5.4.1.5 Test ....................................................................................................... 79 5.4.1.6 Models .................................................................................................. 79 5.4.1.7 Implementation of SVMTlearn for Tamil ............................................ 82 5.5.2 SVMTagger .................................................................................................... 86 5.5.2.1 Options ................................................................................................ 91 5.5.2.2 Strategies .............................................................................................. 95 5.5.2.3 Implementation of SVMTagger for Tamil .......................................... 96 5.5.3 SVMTeval ....................................................................................................... 98 5.4.3.1 Implementation of SVMTeval for Tamil ............................................. 99 5.4.3.2 Reports ............................................................................................... 100 6. GUI FOR SVMTOOL .............................................................................................. 110 6.1 Introduction ............................................................................................................ 110 6.2 GUI for SVMTagger ............................................................................................. 111 6.2.1 File based SVMTagger window .................................................................... 113 6.2.2 Sentance based SVMTagger window .......................................................... 114 6.2.3 AMRITA POS Tagset window .................................................................... 115 6.3 GUI for SVMTeval ................................................................................................. 116 6.3.1 SVMTeval report window ............................................................................. 117 6.4 Output of TnT tagger for Tamil .............................................................................. 118 7.CONCLUSION........................................................................................................... 120 REFERENCES .............................................................................................................. 121 LIST OF FIGURES 1.1 Classification of POS tagging models ......................................................................... 4 1.2 Example of Brills Template ...................................................................................... 15 3.1 Tamil alphabets with english mapping........................................................................ 30 4.1 Example of untagged Corpus ...................................................................................... 54 4.2 Example of tagged Corpus .......................................................................................... 54 4.3 Untagged Corpus before pre-editing .......................................................................... 56 4.4 Untagged Corpus after pre-editing ............................................................................. 56 5.1 SVM Example : Hard Margin ..................................................................................... 63 5.2 SVM Example : Soft Margin ...................................................................................... 63 5.3 Training data format.................................................................................................... 69 5.4 Implementation of SVMTlearn .................................................................................. 83 5.5 Example input file ...................................................................................................... 90 5.6 Example output file ..................................................................................................... 91 5.7 Implementation of SVMTagger .................................................................................. 97 5.8 Implementation of SVMTeval ................................................................................... 99 6.1 SVMTagger window ............................................................................................... 112 6.2 File based SVMTagger window ............................................................................... 113 6.3 Sentance based SVMTagger window ....................................................................... 114 6.4 AMRITA Tagset window ........................................................................................ 115 6.5 SVMTeval window .................................................................................................. 116 6.6 SVMTeval results window........................................................................................ 117 6.7 Output of TnT tagger for Tamil ................................................................................ 118 i LIST OF TABLES 3.1 AMRITA Tagset ...................................................................................................... 42 4.1 Tag counts ................................................................................................................. 58 5.1 Rich Feature Pattern Set............................................................................................ 67 5.2 SVMTlearn config-file mandatory arguments ......................................................... 72 5.3 SVMTlearn config-file action arguments ............................................................... 73 5.4 SVMTlearn config-file optional arguments ............................................................. 74 5.5 SVMTool feature language ...................................................................................... 75 5.6 Model 0: Example of suitable POS Features .......................................................... 81 5.7 Model 1:Example of suitable POS features ............................................................ 81 5.8 Model 2:Example of suitable POS features ............................................................ 82 ii OVERVIEW OF THE PROJECT This thesis is all about Part of speech Tagging for Tamil using SVMTool. The work is divided into seven chapters. First Chapter will give an introduction to the Part of speech Tagging. It will discuss about word classes, tagsets and little description about various tagging approaches used for POS tagging. Second chapter will give the literature review about part of speech tagging and other POS tagging for Tamil. Third chapter will discuss about Tamil POS tagging .Here it will give small introduction about Tamil language, Tamil grammar and POS categories in Tamil. It will also discuss about complexity in Tamil POS tagging and various tagsets for Tamil. Finally it defines a new tagset for Tamil (AMRITA TAGSET). Fourth chapter is about corpus development. It will discuss about how the corpus is developed. Fifth chapter is about part of speech tagger based on Support Vector Machines. It will discuss about the software SVMTool and how it is used for training and testing a Tamil corpus .Finally it discuss about the implementation of POS tagger for Tamil using SVMTool .And the output of SVMTool is compared with TnT for Tamil. Sixth chapter discusses about the GUI for SVMTool part of speech tagger. This GUI is for the person who has no idea about commands in SVMTool related to SVMTagger and SVMTeval components. Final chapter will explain the conclusion and future work of this project. iii 1) INTRODUCTION 1.1 Introduction A part of speech (POS) tagging is one of the most well studied problems in the field of Natural Language Processing (NLP). Different approaches have already been tried to automate the task for English and other languages. Though Tamil is a south Asian language belongs to the Dravidian language family. It is spoken by over 77 million people in several countries in the world it still lacks significant research efforts in the area of Natural Language Processing. This chapter discuss about, what is mean by POS tagging and various POS tagging approaches. 1.2 Word classes Words are divided into different classes called parts of speech (POS), word classes, morphological classes, or lexical tags. In traditional grammars, there are few parts of speech (noun, verb, adjective, preposition, adverb, conjunction, etc.). Many of the recent models have much larger numbers of word classes (POS Tags).Part-ofspeech tagging (POS tagging or POST), also called grammatical tagging, is the process of marking up the words in a text as corresponding to a particular part of speech, based on both its definition, as well as its context —i.e., relationship with adjacent and related words in a phrase, sentence, or paragraph. Parts-of-speech can be divided into two broad super categories: CLOSED CLASS types OPEN CLASS and types. Closed classes are those that have relatively fixed membership. For example, prepositions are a closed class because there is a fixed set of them in English; new prepositions are rarely coined. By contrast nouns and verbs are open classes because new nouns and verbs are continually coined or 1 borrowed from other languages. There are four major open classes that occur in the languages of the world; nouns, verbs, adjectives, and adverbs [23]. It turns out that English has all four of these, although not every language does. 1.3 Part of speech tagging Parts of speech (POS) tagging means assigning grammatical classes i.e. appropriate parts of speech tags to each word in a natural language sentence. Assigning a POS tag to each word of an un-annotated text by hand is very time consuming, which results in the existence of various approaches to automate the job. So POS tagging is a technique to automate the tagging process of lexical categories. The process takes a word or a sentence as input, assigns a POS tag to the word or to each word in the sentence, and produces the tagged text as output. Part-of-speech tagging is the process of assigning a part-of-speech or other lexical class marker to each word in a corpus. Tags are also usually applied to punctuation markers; thus tagging for natural language is the same process as tokenization for computer languages, although tags for Tamil are much more ambiguous. Taggers play an increasingly important role in speech recognition, natural language parsing and information retrieval. The input to a tagging algorithm is a string of words and a specified tagset of the kind described in the previous section. The output is a single best tag for each word. For example, here are some sample sentences. Example in English: Take that Book. VB DT NN (Tagged using Penn Tree Bank Tagset) 2 Even in these simple examples, automatically assigning a tag to each word is not trivial. For example, Book is ambiguous. That is, it has more than one possible usage and part of speech. It can be a verb (as in book that bus or to book the suspect) or a noun (as in hand me that book, or a book of matches). Similarly that can be a determiner (as in Does that flight serve dinner), or a complementizer (as in I thought that your flight was earlier). The problem of POS-tagging is to resolve these ambiguities, choosing the proper tag for the context. Part-of-speech tagging is thus one of the many disambiguation tasks. Another important point which was discussed and agreed upon was that POS tagging is NOT a replacement for morph analyzer. A 'word' in a text carries the following linguistic knowledge a) Grammatical category and b) Grammatical features such as gender, number, person etc. The POS tag should be based on the 'category' of the word and the features can be acquired from the morph analyzer. 1.4 Tagging approaches There are different approaches for POS tagging. The following figure demonstrates different POS tagging models. Most tagging algorithms fall into one of two classes: rule-based taggers and stochastic taggers. Rule-based taggers generally involve a large database of hand-written disambiguation rules. 3 POS Tagging Unsupervised Supervised Rule Based Stochastic Neural Rule Based Brill N-gram based Stochastic Neural Brill Maximum Likelihood Hidden Markov Model Baum-Welch Algorithm Viterbi Algorithm Figure 1.1: Classification of POS tagging models For example, that an ambiguous word is a noun rather than a verb if it follows a determiner. Stochastic taggers generally resolve tagging ambiguities by using a training corpus to compute the probability of a given word having a given tag in a given context. The Brill tagger shares features of both tagging architectures. Like the rulebased tagger, it is based on rules which determine when an ambiguous word should have a given tag. Like the stochastic taggers, it has a machine-learning component: the rules are automatically induced from a previously-tagged training corpus. 4 Supervised POS Tagging The supervised POS tagging models require pre-tagged corpora which are used for training to learn information about the tagset, word-tag frequencies, rule sets etc. The performance of the models generally increases with the increase in size of this corpus. Unsupervised POS Tagging Unlike the supervised models, the unsupervised POS tagging models do not require a pre-tagged corpus. Instead, they use advanced computational methods like the Baum-Welch algorithm to automatically induce tagsets, transformation rules etc. Based on the information, they either calculate the probabilistic information needed by the stochastic taggers or induce the contextual rules needed by rule-based systems or transformation based systems. 1.4.1 Rule Based POS tagging The rule based POS tagging models apply a set of hand written rules and use contextual information to assign POS tags to words. These rules are often known as context frame rules. For example, a context frame rule might say something like: “If an ambiguous/unknown word X is preceded by a Determiner and followed by a Noun, tag it as an Adjective.” On the other hand, the transformation based approaches use a pre-defined set of handcrafted rules as well as automatically induced rules that are generated during training. Morphology is a linguistic term which means how words are built up from smaller units of meaning known as morphemes. In addition to contextual information, morphological information is also used by some models to aid in the disambiguation 5 process. One such rule might be: “If an ambiguous/unknown word ends in a suffix ing and is preceded by a Verb, label it a Verb”. Some models also use information about capitalization and punctuation, the usefulness of which are largely dependent on the language being tagged. The earliest algorithms for automatically assigning part-of-speech were based on two-stage architecture. The first stage used a dictionary to assign each word a list of potential parts of speech. The second stage used large lists of hand-written disambiguation rules to bring down this list to a single part-of-speech for each word. The ENGTWOL tagger is based on the same two-stage architecture, although both the lexicon and the disambiguation rules are much more sophisticated than the early algorithms. The ENGTWOL lexicon is based on the two-level morphology and has about 56,000 entries for English word stems, counting a word with multiple parts of speech (e.g. nominal and verbal senses of hit) as separate entries, and of course not counting inflected and many derived forms. Each entry is annotated with a set of morphological and syntactic features. In the first stage of the tagger, each word is run through the two-level lexicon transducer and the entries for all possible parts of speech are returned. For example the phrase “Ameen had shown that output...” Would return the following list (one line per possible tag, with the correct tag shown in boldface): Ameen AMEEN N NOM SG PROPER had HAVE V PAST VFIN SVO HAVE PCP2 SVO shown SHOW PCP2 SVOO SVO SV that ADV 6 PRON DEM SG DET CENTRAL DEM SG CS output N NOM SG A set of about 1,100 constraints are then applied to the input sentence to rule out incorrect parts of speech; the boldfaced entries in the table above show the desired result, in which the preterite (not participle) tag is applied to had, and the complementizer (CS) tag is applied the that. The constraints are used in a negative way, to eliminate tags that are inconsistent with the context. For example one constraint eliminates all readings of that except the ADV (adverbial intensifier) sense (this is the sense in the sentence it isn't that odd). Here's a simplified version of the constraint: ADVERBIAL-THAT RULE Given input: "that" if (+1 A/ADV/QUANT); / * if next word is adj, adverb, or quantifier * / (+2 SENT-LIM); / * and following which is a sentence boundary, * / (NOT -1 SVOC/A); / * and the previous word is not a verb like * / / * 'consider' which allows adjs as object complements * / then eliminate non-ADV tags else eliminate ADV tag The first two clauses of this rule check to see that the ‘that’ directly precedes a sentence-final adjective, adverb, or quantifier. In all other cases the adverb reading is eliminated. The last clause eliminates cases preceded by verbs like consider or believe which can take a noun and an adjective; this is to avoid tagging the following instance of that as an adverb: 7 Example: I consider that odd. Another rule is used to express the constraint that the complementizer sense of that is most likely to be used if the previous word is a verb which expects a complement (like believe, think, or show), and if the ‘that’ is followed by the beginning of a noun phrase, and a finite verb. 1.4.2 Stochastic based POS tagging A stochastic approach includes frequency, probability or statistics. The simplest stochastic approach finds out the most frequently used tag for a specific word in the annotated training data and uses this information to tag that word in the unannotated text. The problem with this approach is that it can come up with sequences of tags for sentences that are not acceptable according to the grammar rules of a language. An alternative to the word frequency approach is known as the n-gram approach that calculates the probability of a given sequence of tags. It determines the best tag for a word by calculating the probability that it occurs with the n previous tags, where the value of n is set to 1,2 or 3 for practical purposes. These are known as the Unigram, Bigram and Trigram models. The most common algorithm for implementing an n-gram approach for tagging new text is known as the Viterbi Algorithm, which is a search algorithm that avoids the polynomial expansion of a breadth first search by trimming the search tree at each level using the best m Maximum Likelihood Estimates (MLE) where m represents the number of tags of the following word . 8 The use of probabilities in tags is quite old; probabilities in tagging were first used in 1965, a complete probabilistic tagger with Viterbi decoding was sketched by Bahl and Mercer (1976), and various stochastic taggers were built in the 1980's (Marshall, 1983; Garside, 1987; Church, 1988; DeRose, 1988). This section describes a particular stochastic tagging algorithm generally known as the Hidden Markov Model or HMM tagger. The idea behind all stochastic taggers is a simple generalization of the 'pick the most-likely tag for this word’. For a given sentence or word sequence, HMM taggers choose the tag sequence that maximizes the following formula: P(word | tag ) X P(tag | previous n tags) (1.1) The rest of this section will explain and motivate this particular equation. HMM taggers generally choose a tag sequence for a whole sentence rather than for a single word, but for pedagogical purposes, let's first see how an HMM tagger assigns a tag to an individual word. We first give the basic equation, then work through an example, and, finally, give the motivation for the equation. A bigram-HMM tagger of this kind chooses the tag ti for word wi that is most probable given the previous tag ti-1 and the current word wi: ti = a r g m a x P ( t j | t ,w ) i −1 i j (1.2) We can restate the Equation 1.2 to give the basic HMM equation for a single tag by using some markov assumptions as follows: ti = argmax P(t j | t ) P(w | t ) i −1 i j j (1.3) 9 An Example Using an HMM tagger to assign the proper tag to the single word race in the following examples (both shortened slightly from the Brown corpus): Vikram is NNP expected VBZ VBN to race tomorrow . TO VB (1.4) NN People continue to inquire the reason for the race for outer space NNS VBP TO VB DT NN NN IN DT NN IN (1.5) NN (Tagged using Penn Tree Bank Tagset) In the first example race is a verb (VB), in the second a noun (NN). For the purposes of this example, we will assume that some other mechanism has already done the best tagging job possible on the surrounding words, leaving only the word race untagged. A bigram version of the HMM tagger makes the simplifying assumption that the tagging problem can be solved by looking at nearby words and tags. Consider the problem of assigning a tag to race given just these subsequences: to/TO race/??? the/DT race/??? Equation 1.3 says that if we are trying to choose between NN and VB for the sequence to race, we choose the tag that has the greater of these two probabilities: P(VB|TO) P(race|VB) (1.6) and P(NN|TO) P(race|NN) (1.7) 10 Equation 1.3 and its instantiations Equations 1.6 and 1.7 each have two probabilities: a tag sequence probability P(t | t ) and a word-likelihood P(w | t ) . i j i i−1 For race, the tag sequence probabilities P(NN|TO ) and P(VB|TO ) give the answer to the question "how likely are we to expect a verb (noun) given the previous tag?". They can just be computed from a corpus by counting and normalizing. We can expect that a verb is more likely to follow TO than a noun is, since infinitives (to race, to run, to eat) are common in English. While it is possible for a noun to follow TO (walk to school related to hunting), it is less common. Suppose the combined Brown and Switchboard corpora gives us the following probabilities, showing that verbs are fifteen times as likely as nouns after TO: P(NN|TO) : .021 P(VB|TO) : .34 The second part of Equation 1.3 and its instantiations Equations 1.6 and 1.7 is the lexical likelihood: the likelihood of the noun race given each tag, P(race |VB ) and P(race |NN ). This likelihood term is not 'which is the most likely tag for this word'. That is, the likelihood term is not P(VB |race). Instead we are computing P(race |VB). The probability answers the question "if we were expecting a verb, how likely is it that this verb would be race". Here are the lexical likelihoods from the combined Brown and Switchboard corpora: P(race|NN ) = .00041 P(race|VB ) = .00003 11 If we multiply the lexical likelihoods with the tag sequence probabilities, we see that even the simple bigram version of the HMM tagger correctly tags race as a VB despite the fact that it is the less likely sense of race: P(VB|TO) P(race|VB) : .00001 P(NN|TO ) P(race|NN) : .000007 1.4.3 Transformation-based POS tagging In general, the rule based tagging models usually require supervised training i.e. pre-annotated corpora. But recently, good amount of work has been done to automatically induce the transformation rules. One approach to automatic rule induction is to run an untagged text through a tagging model and get the initial output. A human then goes through the output of this first phase and corrects any erroneously tagged words by hand. This tagged text is then submitted to the tagger, which learns correction rules by comparing the two sets of data. Several iterations of this process are sometimes necessary before the tagging model can achieve considerable performance. The transformation based approach is similar to the rule based approach in the sense that it depends on a set of rules for tagging. It initially assigns tags to words based on a stochastic method e.g. the tag with highest frequency for a particular word is assigned to that word. Then it applies the set of rules on the initially tagged data to generate final output. It also learns new rules while applying the already learnt rule, and can save the new rules if they seem effective i.e. improve the performance of the model. Transformation-Based Tagging, sometimes called Brill tagging, is an instance of the Transformation-Based Learning (TBL) approach to machine learning (Brill, 1995), and draws inspiration from both the rule-based and stochastic taggers. Like the rule-based taggers, TBL is based on rules that specify what tags should be assigned to 12 what words. But like the stochastic taggers, TBL is a machine learning technique, in which rules are automatically induced from the data. Like some but not all of the HMM taggers,TBL is a supervised learning technique; it assumes a pre-tagged training corpus. The TBL algorithm has a set of tagging rules. A corpus is first tagged using the broadest rule, i.e. the one that applies to the most cases. Then a slightly more specific rule is chosen, which changes some of the original tags. Next an even narrower rule, which changes a smaller number of tags (some of which might be previously-changed tags). How TBL rules are applied Here we will see some of the rules used by Brill’s tagger. Before the rules apply, the tagger labels every word with its most-likely tag. We get these most-likely tags from a tagged corpus. For example, in the Brown corpus, race is most likely to be a noun: P(NN |race ) = .98 P(VB |race ) = .02 This means that the two examples of race that we saw above will both be coded as NN. In the first case, this is a mistake, as NN is the incorrect tag: is/VBZ expected/VBN to/TO race/NN tomorrow/NN (1.8) In the second case this race is correctly tagged as an NN: the/DT race/NN for/IN outer/JJ space/NN (1.9) After selecting the most-likely tag, Brill's tagger applies its transformation rules. As it happens, Brill's tagger learned a rule that applies exactly to this mistagging of race: 13 Change NN to VB when the previous tag is TO This rule would change race/NN to race/VB in exactly the following situation, since it is preceded by to/TO: expected/VBN to/TO race/NN -> expected/VBN to/TO race/VB (1.10) How TBL Rules are learned Brill's TBL algorithm has three major stages. It first labels every word with its most-likely tag. It then examines every possible transformation, and selects the one that results in the most improved tagging. Finally, it then re-tags the data according to this rule. These three stages are repeated until some stopping criterion is reached, such as insufficient improvement over the previous pass. Stage two requires that TBL knows the correct tag of each word; i.e., TBL is a supervised learning algorithm. The output of the TBL process is an ordered list of transformations; these then constitute a 'tagging procedure' that can be applied to a new corpus. In principle the set of possible transformations is infinite, since we could imagine transformations such as "transform NN to VB if the previous word was 'IBM' and the word 'the' occurs between 17 and 158 words before that". But TBL needs to consider every possible transformation, in order to pick the best one on each pass through the algorithm. Thus the algorithm needs a way to limit the set of transformations. This is done by designing a small set of templates, abstracted transformations. Every allowable transformation is an instantiation of one of the templates. The following figure 1.2 shows a set of templates. 14 The preceding (following) word is tagged z. The word two before (after) is tagged z. One of the two preceding (following) words is tagged z. One of the three preceding (following) words is tagged z. The preceding word is tagged z and the following word is tagged w. The preceding (following) word is tagged z and the word two before (after) is tagged w. Figure 1.2 Brill's templates. Each begins with 'Change tag a to tag b when: ‘The variables a, b, z, and w range over parts of speech. In practice, there are a number of ways to make the algorithm more efficient. For example, templates and instantiated transformations can be suggested in a datadriven manner; a transformation-instance might only be suggested if it would improve the tagging of some specific word. The search can also be made more efficient by pre-indexing the words in the training corpus by potential transformation. 1.5 Other techniques Apart from these quiet a few different approaches to tagging have been developed. Support Vector Machines: This is the powerful machine learning method used for various applications in NLP and other areas like bio-informatics Neural Networks: These are potential candidates for the classification task since they learn abstractions from examples [Schmid 1994a]. Decision trees: These are classification devices based on hierarchical clusters of questions. They have been used for natural language processing such as POS Tagging 15 [Magerman 1995; Schmid 1994b].We can use “Weka” for classifying the ambiguity words. Maximum entropy models: These avoid certain problems of statistical interdependence and have proven successful for tasks such as parsing and POS tagging. Example-Based techniques: These techniques find the training instance that is most similar to the current problem instance and assumes the same class for the new problem instance as for the similar one. 1.6 Motivation The part of speech for a word gives a significant amount of information about the word and its neighbors. This is clearly true for major categories, (verb versus noun), but is also true for the many finer distinctions. For example these tagsets distinguish between possessive pronouns (my, your, his, her, its) and personal pronouns (I, you, he, me). Knowing whether a word is a possessive pronoun or a personal pronoun can tell us what words are likely to occur in its vicinity (possessive pronouns are likely to be followed by a noun, personal pronouns by a verb). This can be useful in a language model for speech recognition. A word's part-of-speech can tell us something about how the word is pronounced .The word content, for example, can be a noun or an adjective. They are pronounced differently Parts of speech can also be used in stemming for Informational Retrieval (IR), since knowing a word's part of speech can help tell us which morphological affixes it can take. They can also help an IR application by helping select out nouns or other 16 important words from a document. Automatic part-of-speech taggers can help in building automatic word-sense disambiguating algorithms, and POS taggers are also used in advanced ASR language models such as class-based N-grams. Parts of speech are very often used for 'partial parsing' texts, or example for quickly finding names or other phrases for the information extraction applications. The corpora that have been marked for part-of-speech are very useful for linguistic research, for example to help find instances or frequencies of particular constructions in large corpora. Apart from these, many Natural Language Processing (NLP) activities such as Machine Translation (MT), Word Sense Disambiguation (WSD) and Question Answering (QA) systems are dependent on Part-Of-Speech Tagging. For part of speech tagging, i). The problem is clearly defined and well understood. ii). The task is relatively easy but hard enough. A “good” performance can already be achieved with very simple methods, but a perfect system ultimately requires complete understanding (It is said to be “AI-hard”). iii). Evaluation methods and comparison measures are available which make different machine learning approaches feasible. iv). Because only a relatively simple annotation scheme is required, large corpora are available. 17 2) LITERATURE SURVEY Part-of-speech tagging is the act of assigning each word in a sentence a tag that describes how that word is used in the sentence. Typically, these tags indicate syntactic categories, such as noun or verb, and occasionally include additional feature information, such as number (singular or plural) and verb tense. A large number of current language processing systems use a part-of-speech tagger for pre-processing. The tagger assigns a part-of-speech tag to each token in the input and passes its output to the next processing level, usually a parser. For both applications, a tagger with the highest possible accuracy is required. Recent comparisons of approaches that can be trained on corpora have shown that in most cases statistical approaches yield better results than finite-state, rule-based, or memory-based taggers. They are only surpassed by combinations of different systems, forming a "voting tagger. The tagger comparison was organized as a "black box test": set the same task to every tagger and compare the outcomes. The authors in [1], describes the models and techniques used by TnT together with the implementation. The result of the tagger comparison seems to support the sentence "the simplest is the best". Part-of-speech tagging is also a very practical application, with uses in many areas, including speech recognition and generation, machine translation, parsing, information retrieval and lexicography. Tagging can be seen as a prototypical problem in lexical ambiguity; advances in part-of-speech tagging could readily translate to progress in other areas of lexical, and perhaps structural, ambiguity, such as word sense disambiguation and prepositional phrase attachment disambiguation. Also, it is possible to cast a number of other useful problems as part-of-speech tagging problems, such as letter-to-sound translation [2] and building pronunciation 18 networks for speech recognition. Recently, a method has been proposed for using part-of-speech tagging techniques as a method for parsing with lexicalized grammars. When automated part-of-speech tagging was initially explored [3], people manually engineered rules for tagging, sometimes with the aid of a corpus. As large corpora became available, it became clear that simple Markov-model based stochastic taggers that were automatically trained could achieve high rates of tagging accuracy. Markov-model based taggers assign to a sentence the tag sequence that maximizes P(word | tag) * P(tag | previous n tags). These probabilities can be estimated directly from a manually tagged corpus. These stochastic taggers have a number of advantages over the manually built taggers, including the need for laborious manual rule construction, and possibly capturing useful information that may not have been noticed by the human engineer. However, stochastic taggers have the disadvantage that linguistic information is captured only indirectly, in large tables of statistics. All recent work in developing automatically trained part-of-speech taggers has been on further exploring Markov model based tagging. Several different approaches have been used for building text taggers. Greene and Rubin used a rule-based approach in the TAGGIT program [4], which was an aid in tagging the Brown corpus [5]. TAGGIT disambiguated 77% of the corpus; the rest was done manually over a period of several years. Koskenniemi also used a rulebased approach implemented with finite-state machines. Statistical methods have also been used (e.g., [6]). These provide the capability of resolving ambiguity on the basis of most likely interpretation. A form of Markov model has been widely used that assumes that a word depends probabilistically on just its part-of-speech category, which in turn depends solely on the categories of the preceding two words (in case of trigram). 19 Two types of training (i.e., parameter estimation) have been used with this model. The first makes use of a tagged training corpus. Derouault and Merialdo use a bootstrap method for training [7]. At first, a relatively small amount of text is manually tagged and used to train a partially accurate model. The model is then used to tag more text, and the tags are manually corrected and then used to retrain the model. Church uses the tagged Brown corpus for training [8]. These models involve probabilities for each word in the lexicon, so large tagged corpora are required for reliable estimation. The second method of training does not require a tagged training corpus. In this situation the Baum-Welch algorithm (also known as the forward-backward algorithm) can be used [9]. Under this system the model is called a hidden Markov model (HMM), as state transitions (i.e., part-of-speech categories) are assumed to be unobservable. Jelinek has used this method for training a text tagger [10]. Parameter smoothing can be conveniently achieved using the method of ‘deleted interpolation’ in which weighted estimates are taken from second and first-order models and a uniform probability distribution. Kupiec used word equivalence classes (referred to here as ambiguity classes) based on parts of speech, to pool data from individual words. The most common words are still represented individually, as sufficient data exist for robust estimation. All other words are represented according to the set of possible categories they can assume. In this manner, the vocabulary of 50,000 words in the Brown corpus can be reduced to approximately 400 distinct ambiguity classes [11]. To further reduce the number of parameters, a first-order model can be employed (this assumes that a word's category depends only on the immediately preceding word's category). In [12], networks are used to selectively augment the context in a basic first order model, rather than using uniformly second-order dependencies. 20 In the linguistic approach, an expert linguist is needed to formalize the restrictions of the language. This implies a very high cost and it is very dependent on each particular language. We can find an important contribution that uses Constraint Grammar formalism. Supervised learning methods were proposed in [14] to learn a set, of transformation rules that repair the error committed by a probabilistic tagger. The main advantage of the linguistic approach is that the model is constructed from a linguistic point of view and contains many and complex kinds of knowledge. In the learning approach, the most extended formalism is based on n-grams. In this case, the language model can be estimated from a labeled corpus (supervised methods) [15] or from a non-labeled corpus (unsupervised methods) [16]. In the first; case, the model is trained from the relative observed frequencies. In the second one, the model is learned using the Baum-Welch algorithm from an initial model which is estimated using labeled corpora [17]. The advantages of the unsupervised approach are the facility to build language models, the flexibility of choice of categories and the ease of application to other languages. We can find some other machine-learning approaches that use more sophisticated LMs, such as Decision Trees, memory-based approaches to learn special decision trees, maximum entropy approaches that combine statistical information from different sources [18], finite state automata inferred using Grammatical Inference, etc. The comparison among different approaches is difficult due to the multiple factors that can be considered: thee language, the number and type of the tags, the size of the vocabulary, the ambiguity, the difficulty of the test set, etc. Part-of-speech tagging is an important research topic in Natural Language Processing (NLP). Taggers are often preprocessors in NLP systems, making accurate performance especially important. Much research has been done to improve tagging accuracy using several different models and methods. 21 Most NLP applications demand at initial stages shallow linguistic information (e.g., part–of–speech tagging, base phrase chunking, named entity recognition).This information may be predicted fully automatically (at the cost of some errors) by means of sequential tagging over un annotated raw text. Generally, tagging is required to be as accurate as possible, and as efficient as possible. But, certainly, there is a trade-off between these two desirable properties. This is so because obtaining a higher accuracy relies on processing more and more information, digging deeper and deeper into it. However, sometimes, depending on the kind of application, a loss in efficiency may be acceptable in order to obtain more precise results. Or the other way around, a slight loss in accuracy may be tolerated in favour of tagging speed. Some languages have a richer morphology than others, requiring the tagger to have into account a bigger set of feature patterns. Also the tagset size and ambiguity rate may vary from language to language and from problem to problem. Besides, if few data are available for training, the proportion of unknown words may be huge. Sometimes, morphological analyzers could be utilized to reduce the degree of ambiguity when facing unknown words. Thus, a sequential tagger should be flexible with respect to the amount of information utilized and context shape. Another very interesting property for sequential taggers is their portability. Multilingual information is a key ingredient in NLP tasks such as Machine Translation, Information Retrieval, Information Extraction, Question Answering and Word Sense Disambiguation, just to name a few. Therefore, having a tagger that works equally well for several languages is crucial for the system robustness. Besides, quite often for some languages, but also in general, lexical resources are hard to obtain. Therefore, ideally a tagger should be capable for learning with fewer (or even none) annotated data. The SVMTool [29] is intended to comply with all the requirements of modern NLP technology, by combining simplicity, flexibility, robustness, portability and 22 efficiency with state–of–the–art accuracy. This is achieved by working in the Support Vector Machines (SVM) learning framework, and by offering NLP researchers a highly customizable sequential tagger generator. In the recent literature, several approaches to POS tagging based on statistical and machine learning techniques are applied, including among many others: Hidden Markov Models [15], Maximum Entropy taggers [18], Transformation–based learning [14], Memory–based learning [19], Decision Trees [20], AdaBoost [21], and Support Vector Machines [22]. Most of the previous taggers have been evaluated on the English WSJ corpus, using the Penn Treebank set of POS categories and a lexicon constructed directly from the annotated corpus. Although the evaluations were performed with slight variations, there was a wide consensus in the late 90’s that the state–of-the–art accuracy for English POS tagging was between 96.4% and 96.7%. In the recent years, the most successful and popular taggers in the NLP community have been the HMM–based TnT tagger, the Transformation–based learning (TBL) tagger [14], and several variants of the Maximum Entropy (ME) approach [18]. TnT is an example of a really practical tagger for NLP applications. It is available to anybody, simple and easy to use, considerably accurate, and extremely efficient, allowing training from 1 million word corpora in just a few seconds and tagging thousands of words per second. In the case of TBL and ME approaches, the great success has been due to the flexibility they offer in modeling contextual information, being ME slightly more accurate than TBL. Far from being considered a closed problem, several researchers tried to improve results on the POS tagging task during last years. Some of them by allowing richer and more complex HMM models [23], others [24] by enriching the feature set in a ME tagger, and others [22] by using more effective learning techniques: SVM, and a Voted–Perceptron–based training of a ME model. In these more complex taggers the state–of–the–art accuracy was raised up to 96.9%–97.1% on the same 23 WSJ corpus. In a complementary direction, other researchers suggested the combination of several pre-existing taggers under several alternative voting schemes [25]. Although the accuracy of these taggers is even better (around 97.2%) the ensembles of POS taggers are undeniably more complex and less efficient. Many natural language tasks require the accurate assignment of Part-OfSpeech (POS) tags to previously unseen text. Due to the availability of large corpora which have been manually annotated with POS information, many taggers use annotated text to "learn" either probability distributions or rules and use them to automatically assign POS tags to unseen text. Several recent papers [14] have reported 96.5% tagging accuracy on the Wall St. Journal corpus. A Maximum Entropy model is well-suited for experiments since it combines diverse forms of contextual information in a principled manner, and does not impose any distributional assumptions on the training data. Previous uses of this model include language modeling [19], machine translation [26], prepositional phrase attachment [18], and word morphology. Parts of Speech Tagging gone through For English, there are many POS taggers , employing machine learning techniques such as Hidden Markov Models (Brants, 2000), transformation based error driven learning (Brill, 1995), decision trees (Black, 1992), maximum entropy methods (Ratnaparkhi, 1996), conditional random _elds (Laffertey et al., 2001). Some of the techniques proposed for chunking in English are based on Support vector machines (Kudoh et al., 2001), Winnow (Zhang et al., 2002) . The POS taggers reach anywhere between 92-97 % accuracy and chunkers have reached approximately 94 % accuracy. However, these accuracies are aided by 24 the availability of large annotated corpus for English. As mentioned above, due to the lack of annotated corpora, previous research in POS tagging and chunking in Indian languages has mainly focused on rule based systems utilizing the morphological analysis of word-forms. A. Bharati et al. (1995) in their work on computational Paninian POS parser, described a technique where POS tagging is implicit and is merged with parsing phase. More recently, Smriti et al. (2006) proposed a pos tagger for Hindi which uses an annotated corpus (15,562 words collected from BBC Hindi News site), exhaustive morphological analysis backed by high coverage lexicon and a decision tree based learning algorithm(CN2). They reach an accuracy of 93.45% for Hindi with a tagset of 23 POS tags. For Bengali, (Sandipan et al., 2004) developed a corpus based semisupervised learning algorithm for POS tagging based on HMMs. Their system uses a small tagged corpus (500 sentences) and a large unannotated corpus along with a Bengali morphological analyzer. When tested on a corpus of 100 sentences (1003 words), their system obtained an accuracy of 95%. A. Singh et al. (2005) proposed a HMM based chunker for Hindi with an accuracy of 91.7 %. They used HMMs trained on four tag scheme (STRT,CNT, STP, STRT STP) with POS tag information and converted it into two tag (STRT, CNT) scheme while testing for chunk boundary identi_cation. They however used a rule based system for chunk label identi_cation. Annotated data of 150,000 words was used for training and the chunker was tested on 20,000 words with POS tags which were manually annotated. To the best of our knowledge, these are the reported works on POS tagging and chunking for Indian languages till the NLPAI Machine Learning Contest (2006) was held in the summer of 2006. For the contest, participants had to train on a set of training data for a chosen language provided by the contest organizers. The systems, thus trained, were to 25 automatically mark POS and chunk information on the test data of the chosen language. Chunk annotated data wasn't released for Bengali and Telugu. Sandipan and Sudeshna (2006) achieved an accuracy of 84.34 % for Bengali POS tagging using semi-supervised learning combined with a Bengali morphological analyzer. A. Dalal et al. (2006) achieved accuracies of 82.22 %and 82.4%for Hindi POS tagging and chunking respectively using maximum entropy models. Karthik et al. (2006) got 81.59 % accuracy for Telugu POS tagging using HMMs. Sivaji et al. (2006) came up with a rule based chunker for Bengali which gave an accuracy of 81.64 %. The training data for all the three languages contained approximately 20,000 words and the testing data had approximately 5000 words. The experiences from organizing the NLPAI ML Contest prompted us to hold a contest for Shallow Parsing (POS tagging and chunking) where the participants will have to develop systems for POS tagging and chunking across Indian languages using the same learning technique. The next section formally de_nes the task and the data released as part of the contest. CLAWS part-of-speech tagger for English Part-of-speech (POS) tagging, also called grammatical tagging, is the commonest form of corpus annotation, and was the first form of annotation to be developed by UCREL(University Centre for Computer Corpus Research on Language ) at Lancaster. The POS tagging software for English text, CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s. The latest version of the tagger, CLAWS4, was used to POS tag c.100 million words of the British National Corpus (BNC). 26 CLAWS have consistently achieved 96-97% accuracy (the precise degree of accuracy varying according to the type of text). Judged in terms of major categories, the system has an error-rate of only 1.5%, with c.3.3% ambiguities unresolved, within the BNC. More detailed analysis of the error rates for the C5 tagset in the BNC can be found within the BNC manual. Parts of Speech Tagging in Tamil Parts of speech tagging scheme, tags a word in a sentence with its parts of speech. It is done in three stages: pre-editing, automatic tag assignment, and manual postediting [24]. In pre-editing, corpus is converted to a suitable format to assign a part of speech tag to each word or word combination. Because of orthographic similarity one word may have several possible POS tags. After initial assignment of possible POS, words are manually corrected to disambiguate words in texts. 1. Vasu Ranganathan’s Tagtamil: Tagtamil by Vasu Ranganathan is based on Lexical phonological approach. Tagtamil does morphotactics of morphological processing of verbs by using index method. Tagtamil does both tagging and generation. 2. Ganesan’s POS tagger: Ganesan has prepared a POS tagger for Tamil. His tagger works well in CIIL Corpus. Its efficiency in other corpora has to be tested. He has a rich tagset for Tamil. He tagged a portion of CIIL corpus by using a dictionary as well as a morphological analyzer. He corrected it manually and trained the rest of the corpus with it. The tags are added morpheme by morpheme. 27 pukkaLai : puu_N_PL_AC vandtavan: va_IV_ ndt_PT_avan_3PMS 3. kathambam of RCILTS-Tamil: Kathambam attaches parts of speech tags to the words of a given Tamil document. It uses heuristic rules based on Tamil linguistics for tagging and does not use either the dictionary or the morphological analyzer. It gives 80% efficiency for large documents. It uses 12 heuristic rules. It identifies the tags based on PNG, tense and case markers. Standalone words are checked with the lists stored in the tagger. It uses ‘Fill in rule’ to tag ‘unknown words. It uses bigram and identifies the unknown word using the previous word category. 28 3) TAMIL POS TAGGING 3.1).TAMIL LANGUAGE Tamil (or Tamizh) is one of the most widely spoken languages in south Asia. It belongs to the Dravidian family of languages. It has a long history of literary tradition dating back to 200 BC and is spoken in the state of Tamilnadu in India, Srilanka, Singapore, and Malaysia and in small numbers in other parts of the world. Because of its long roots and unbroken literary tradition, it was announced as a Classical Language in 2004 by the Govt. of India. Tamil is a diglossic language. This means that there is a large disparity between the written form of the language and the spoken form. These differences include grammatical differences, vocabulary differences, and pronunciation differences. Tamil has a variety of dialects. Even within Tamilnadu, there are numerous dialects that vary widely based on geography and communities of the speakers. Tamil also exhibits diglossia between its formal and classic variety called Centamil (or centamizh) and its colloquial variety that includes a number of spoken dialects of Tamil. Centamil has been used for official, governmental purposes to write and present reports, news articles etc. 3.1.1). Alphabets Tamil is phonetic Language. The scripts for Tamil evolved from Brahmi scripts of 3rd century BC. Tamil alphabets consist of 12 vowels (uyir ezhuththu) and 18 consonants (Mei ezhuththu). They combines together to make 246 vowel- 29 consonant compound (uyirmei ezhuththu). There is a special character called aaytha ezhuththu( ). These altogether make 247 characters in the Tamil alphabet. The vowels can be divided into three categories: long (netil), short (kuRil) and dipthongs. There are 5 long and 5 short vowels and two dipthongs. The consonant are classified under three groups: hard (vallinam), nasal (mellinam) and medium (idaiyinam). In addition, Tamil borrows alphabets from Grantha script to represent borrowed Sanskrit words. Figure: 3.1 Tamil alphabets with English mapping The following is the classification of alphabets, Long vowels: A, I, O, U, E Short vowels: a, i, o, u, e 30 Dipthongs : au, ai Vallinam: k, c, t, th, p, r Mellinam: nj, ng, n, N, w, m Idayinam: y, R, l, v, zh, L 3.1.2). Tamil Grammar Grammar of any Language can be broadly divided into Phonology, Morphology and Syntax. Phonology is the study of sound structure in language. Morphology studies words and their construction. Syntax deals with how to put the words together in some order to make meaningful sentences. We will talk more about Morphology here as it is more relevant to the project [28]. Tamil Morphology is very rich. It is an agglutinative Morphology Language, much like the other Dravidian languages. Tamil words are made up of lexical roots followed by one or more affixes. The lexical roots and the affixes are the smallest meaningful units and are called morphemes. Tamil words are therefore made up of morphemes concatenated to one another in a series. The first one in the construction is always a lexical morpheme (lexical rot). This may or may not be followed by other functional or grammatical morphemes. For instance, a word ‘books’ in English, can be meaningfully divided into ‘book’ and ‘s’. books = book + s. In this example, ‘book’ is the lexical root, representing a real world entity and ‘s’ is the plural feature marker (suffix). ‘s’ is a grammatical morpheme that is bound to the lexical root to add plurality to the lexical root. Unlike English, Tamil words can have a large sequence of morphemes. For instance, puththakangkaLai = puththakam (book) + kaL (s) + ai (acc. case marker). 31 Tamil nouns can take case suffixes after the plural marker. They can also have post positions after that. More on these details are discussed later. The words can be analyzed like the one above by identifying the constituent morphemes and their features can be identified. Identifying the constituent morphemes also leads to identifying the Part-ofSpeech (POS) for every word in the sentence, information which is further used in Syntax analysis. 3.1.3). Part of Speech categories: Words can be classified under various parts of speech classes based on the role they play in the sentence. The following are the main POS classes in Tamil. 1. Nouns 2. Verbs 3. Adjectives 4. Adverbs 5. Determiners 6. Post Positions 7. Conjunctions 8. Quantifiers Out of these, only Nouns and Verb can be inflected. Words in other classes just occur in their root forms. 3.1.4). Other POS categories Apart from Nouns and Verbs the other POS categories that are open class are the adverbs and Adjectives. Most Adjectives and adverbs, in their root can be placed 32 in the lexicon. But there are Adjectives and Adverbs that can be derived from noun and verb stems. Following are the Morphotactics of derived Adjectives from noun root and verb stems. Noun_root + adjective_suffix e.g. uyaram + Ana = uyaramAna <ADJ> verb_stem + relative_participle e.g. cey + tha = ceytha <VNAJ> Following is the Morphotactics of derived adverbs from noun_roots and verb stem. noun_root + adverb_suffix e.g. uyaram +Aka = uyaramAka <ADV> verb_stem +adverbial participle e.g. cey + tu = ceythu <VNAV> There are number of non-finite verb structure forms in Tamil .Apart from participles forms ,Grammatically they are classified into structures such as infinitive, conditional, etc,. Example for infinitive form of verb: paRgga <VINT> (to fly) Example for conditional verb: vawthAl <CVB> (if one comes) There are other categories like conjunctions, complementizers etc. some of these may be derived forms. But there aren’t many. So they can be listed in the lexicon. Other categories that need to be listed in the lexicon as roots are post 33 positions which are free and auxiliary verbs. This is because they can occur as words in isolations even though they are semantically bonded to the noun or verb preceding them. 3.1.5). Ambiguity of Roots The roots can also be ambiguous. They can have more than one sense. Sometimes belong to more than one POS category. Though sometimes the POS can be disambiguated using contextual information like co-occurring morphemes, it is not possible cent percent These issues should be taken care of when Morph Analyzers are built for the Language. Other issues that were dealt with during the implementation, are explained in the following sections. For a thorough discussion of Tamil Morphology refer to [28]. 3.2). COMPLEXITY IN TAMIL POS TAGGING As Tamil is a agglutinative language, Nouns get inflected for number and cases. Verbs get inflected for various inflections which include tense, person ,number ,gender Suffixes. Verbs are adjectivalized and adverbialized. Also verbs and adjectives are nominalized by means of certain nominalizers. Adjectives and adverbs do not Inflect. Many post-positions in Tamil [20] are from nominal and verbal sources. So, many times we need to depend on syntactic function or context to decide upon whether one is a noun or adjective or adverb or post position. This leads to the complexity of Tamil in POS tagging. 3.2.1).Noun complexity Nouns are the words which denote a person, place, thing, time, etc .In Tamil language nouns are inflected for the number and case in morphological level (3.1). 34 However on phonological level, four types of suffixes can occur with noun stem (3.2). Noun ( + number ) (+ case ) Ex: (3.1) pook-kaL-ai <NN> Flower-plural-accusative case suffix Noun ( + number ) (+ oblique) (+ euphonic) (+ case ) Ex: (3.2) pook-kaL-in-Al <NN> Flower-plural-euphonic suffix-accusative case suffix Nouns further need to be annotated into common noun, compound noun, proper noun, compound proper noun, pronoun, cardinal and ordinal. Pronouns need to be further annotated for personal pronoun, interrogative pronoun, and indefinite pronoun. There occurs complexity between common noun and compound noun and also between proper noun and compound proper noun. Common noun can also occur as compound noun, for example UrAcci <NNC> thalaivar <NNC> When UrAcci and thalaivar comes together it can be compound noun (<NNC>),but when UrAcci and thalaivar comes separately in a sentence it should be tagged as a common noun(<NN>).such complexity occur with the proper noun<NNP> and compound proper noun<NNPC>. Moreover there occurs complexity between noun and adverb, pronoun and emphasis in syntactic level. 35 3.2.2).Verb complexity The verbal forms are complex in Tamil. A finite verb shows the following morphological structure Verb stem + Tense + Person-Number + Gender Ex: Wata +nt +En <VF> ‘I walked’ A number of non-finite forms are possible: adverbial forms(3.3), Adjectival forms(3.4), infinitive forms(3.5), and conditional(3.6) . verb_stem +adverbial participle (3.3) cey + tu = ceythu <VNAV> ‘having done’ verb_stem + relative_participle (3.4) cey + tha = ceytha <VNAJ> ‘who did’ verb_stem + infinitive suffix azu + a = aza <VINT> (3.5) ‘to weep’ verb_stem + conditional suffix (3.6) kEL + Al =kEttAl <CVB> ‘if asked’ 36 Distinction needs to be made between main verb followed by main verb and main Verb followed by an auxiliary verb. The main verb followed by an auxiliary need to be interpreted together, whereas the main verb followed by a main verb need to be interpreted separately. This lead to functional ambiguity as given below: Functional ambiguity in adverbial <VNAV> form The morphological structure of Adverbial verb is verb root +adverbial participle e.g. cey + tu = ceythu <VNAV> ‘having done’ Vandtu <VNAV> caappiTTuviTTu <ADV> poo <VF>‘having come and having eaten went’. wondtu<VNAV> poo ‘become vexed’ Functional ambiguity in adjectival <VNAJ> form The adjectival<VNAJ> forms differ by tense markings: Verb stem+Tense+Adjectivalizer vandta ‘x who came’ varukiRa ‘x who comes’ varum ‘x who come’ Adjectival<VNAJ> form allows several interpretations as given in the following examples. cappiTTa ilai ‘the leaf which is eaten by x’ 37 ‘the leaf on which x had his food and ate’ vaangkiya <VNAJ>x ‘x which is bought’ ‘x who bought’ ‘x (price) by which something is bought‘ ‘x (money) received’ ‘x (container) in which something is received’ um-suffixed adjectival form clashes with other homophonous forms which leads ambiguity. varum <VNAJ>paiyan ‘the boy who will come’ varum <VF>‘it will come’ varum pootu <VNAV>‘while coming’ Functional ambiguity in infinitival <VINT> form verb_stem + infinitive suffix e.g. azu + a = aza <VINT> vara.v-iru ‘going to come’ vara-k.kuuTaatu ‘should not come’ vara-c.col ‘ask x to come’ 3.2.3).COMPLEXITY IN ADVERBS We have seen that a number of adjectival and adverbial forms of verbs are lexicalized as adjectives and adverbs respectively and clash with their respective sentential adjectival and adverbial forms semantically creating ambiguity in POS 38 tagging [20].Adverbs too need to be distinguished based on their source category. Many adverbs are derived by suffixing aaka with nouns in Tamil. But not all aaka suffixed forms are adverbial. veekam-aaka ‘fastly’ vs. TAkTar-Aka ‘as doctor’ Functional clash can be seen between noun and adverb in aaka suffixed forms. This type of clash is seen among other Dravidian languages too. avaL azhakaaka irukkiRaaL ‘she beauty_ADV be_PRE_she ‘she is beautiful’ 3.2.4).COMPLEXITY IN POSTPOSITIONS Postpositions are from various categories such as verbal, nominal and adverbial in Tamil. Many a time, the demarking line between verb/noun/adverb and postposition is slim leading to ambiguity. Some postpositions are simple and some are compound. Postpositions are conditioned by the nouns inflected for case they follow. Simply tagging one form as postposition will be misleading. There are postpositions which come after noun and also after verbs which makes the postposition ambiguous (spatial vs. temporal). pinnaal <PPO> ‘behind’ as in viiTTukkkup pinnaal ‘behind the house’ pinnaal ‘after’<ADV> avanukkup pinnaal vandtaan ‘he came after him’ 3.3) Developing a new tagset : For developing a corpus, it is necessary to define a tags (POS tags) used in that corpus. Collection of all the possible tags is called tagset. Tagset differ from language to language. For Tamil, some tagsets are available. We refer some of 39 the tagsets for Tamil and also for other languages .After considering all the possibilities we create a new tagset named as AMRITA tagset, because the available tagsets are very large in number. We consider the guidelines from “AnnCorra IIIT Hyderabad “[26] while we create our AMRITA Tagset: 1. The tags should be simple. 2. Maintaining simplicity for Ease of Learning and Consistency in annotation. 3. POS tagging is not a replacement for morph analyzer. 4. A 'word' in a text carries grammatical category and grammatical features such as gender, number, person etc. The POS tag should be based on the 'category' of the word and the features can be acquired from the morph analyzer. 3.3.1 TAGSETS GONE THROUGH: AUKBC Tagset This tagset is created by AUKBC Research Centre at Chennai with the help of eminent linguists from, Tamil University; Tanjore. This is an exhaustive tagset, which covers all possible grammatical and lexical constituents. It contains 68 tags. IIIT, Hyderabad Tagset: POS Tag Set for Indian Languages was developed by IIIT, Hyderabad. Their tags are decided on coarse linguistic information with an idea to expand it to finer knowledge if required. The annotation standards for POS tagging for Indian languages include 26 tags. 40 CIIL Tagset for Tamil This Tagset is developed by CIIL (Central Institute of Indian Languages) Mysore. It contains 71 tags for Tamil .Because of considering the noun and verb inflection the number of tags are increased. They consider near 30 noun forms including pronoun categories and 25 verb forms including participle forms. CIIL-KHS Hindi Tagset This Tagset is also developed by CIIL (Central Institute of Indian Languages) Mysore. It contains 36 tags for Hindi. 3.3.2). AMRITA TAGSET The main drawbacks in all other tagset (AUKBC, CIIL Tagset) are they consider the verb and noun inflections. So at the tagging time you need to split each and every inflected word in the corpus. It’s a tough process. For POS level we want to determine the word’s POS category or tag. That can be done using a limited number of tagset. The inflectional problems can be solved by morph analyzer .so there is no need of using large number of tags. Moreover large number of tags will lead to more complexity which intern reduces the tagging accuracy. Considering the complexity of Tamil in POS tagging and referring various tagsets, We have developed our own Tagset (AMRITA TAGSET).See Table.3.1 Our tagset contains 32 tags we are not considering the inflections. The 31 tags are listed in above table. In our tagset we used compound tag for only nouns (NNC) and proper nouns (NNPC). We considered tag VBG for verbal nouns and participle nouns. 41 Table: 3.1 AMRITA Tagset 3.4) Explanation of AMRITA POS tags 3.4.1).Noun tags Nouns are the words which denote a person, place, thing, time, etc. In Tamil language nouns are inflected for the number and case in morphological level (3.7). However on phonological level, four types of suffixes can occur with noun stem (3.8). (3.7) Noun ( + number ) (+ case ) e.g. pook-kaL-ai <NN> Flower-plural-accusative case suffix Noun ( + number ) (+ oblique) (+ euphonic) (+ case ) e.g. pook-kaL-in-Al <NN> 42 (3.8) Flower-plural-euphonic suffix-accusative case suffix As mentioned earlier, distinct tags based on grammatical information are avoided. so Plurality and case suffixation can be obtained from a morph analyzer. This brings the number of tags down, and helps achieve simplicity, consistency, better machine learning. Therefore, we use only two tags common noun <NN> and common compound noun <NNC> for nouns without getting into any distinction based on the grammatical information contained in a given noun word. Example for Common Nouns (NN) paRavai <NN> ‘bird’ paRavaigaL <NN> ‘birds’ paRavaikku <NN> ‘to bird’ paRavaiyAl <NN> ‘by bird’ Example for Compound Nouns (NNC) UrAcchi <NNC> thalaivar <NNC> ‘Township leader’ Vanap <NNC> pakuthi <NNC> ‘forest area’ Proper Nouns Proper Nouns are the words which denote a particular person, place, or thing. Indian languages, unlike English, do not have any specific marker for proper nouns in orthographic conventions. English proper nouns begin with a capital letter which 43 distinguishes them from common nouns. All the words which occur as proper nouns in Indian languages can also occur as common nouns denoting a lexical meaning. For example,English: John, Harry, Mary occur only as proper nouns whereas in Tamil: thAmari, maNi, pissai, arasi etc used as 'names' and they also belong to grammatical categories of words with various senses. For example given below is a list of Tamil words with their grammatical class and sense. thAmari noun lotus maNi noun bell pissai noun beg arasi noun queen Any of the above words can occur in texts as common nouns or as proper names. Therefore, we use only two tags proper noun <NNP> and compound proper noun <NNPC>. Example for Proper Nouns (NNP) raja <NNP> wERRu vawthAn. Raja came yesterday. Example for Compound Proper Nouns (NNPC) Abdthul<NNPC> kalAm <NNPC> inRu cennai varugiRAr. “Abdul kalam is going to come to Chennai today” Cardinals Any word denoting a cardinal number will be tagged as <CRD>. 44 Example for Cardinals <CRD> enakku 150 <CRD> rupAy vENtum. I need 150 ruppes. mUnRu <CRD > wapargaL anggu amarwthiruggiRArgaL. “Three people were sitting there” Ordinals Ordinals are formed by adding the suffixes Am and Avathu. Expressions denoting ordinals will be marked as <ORD>. Example for Ordinals <ORD> muthalAm <ORD> vaguppu. “first class” 12-Am<ORD> wURRANdu “12th century” 3.4.2) PRONOUNS Pronouns are the words that take the place of nouns .We use a pronoun in place of a noun so that we do not have to repeat the noun. Linguistically, a pronoun is a variable and functionally it is a noun, so the tag for pronouns will be helpful for anaphora resolution .We have used three tags for pronouns,<PRP> for personal pronoun,<PRIN> for interrogative pronoun and <PRID> for indefinite pronoun. Example for personal Pronouns (PRP) avan <PRP> weRRu inggu vawthAn. ‘He came here yesterday’ 45 Example for interrogative pronoun (PRIN) evan <PRIN> sonnAn. ‘which(male) one said.’ Example for indefinite pronoun (PRID) yArrO <PRID> sonnArgaL ‘someone said’ 3.4.3) ADJECTIVE tags Adjectives are the noun modifiers. We have simple and derived adjectives in modern tamil. We have use a tag <ADJ> for adjectives. Example for ADJECTIVES<ADJ> iwtha walla <ADJ> paiyan ‘this nice boy’ oru azagAna <ADJ> pen ‘A beautiful girl’ 3.4.4)ADVERB tags: Adverbs are words which tell us more about verbs. we have simple and derived adverbs in modern tamil. We have use a tag <ADV> for adverbs. Example for ADVERBS <ADV> Avan atiggati <ADV> vidumuRai etuththAn. ‘He took leave frequently’ 46 kuthirai vEgamAga <ADV> Odiyathu. ‘The horse ran fastly’ 3.4.5) VERB tags: Verbs are defined as the action word, which can take tense suffixes, person number gender suffixes, and few other verbal suffixes. Tamil verb forms can be distinguished into finite and non finite verbs. FINITE VERB Finite verbs as the predicate of the main clause occur at the end of the sentence. Example for FINITE VERB <VF> avan paSyil vawthAn <VF> ‘he came by bus’ NON FINITE VERBS Tamil distinguishes between four types of non-finite verb forms [26].They are verbal participle <VNAV>, adjectival participle <VNAJ>, infinitive <VINT> and conditional <CVB>. Example for verbal participle <VNAV> wondtu<VNAV> poo ‘become vexed’ Example for adjectival participle <VNAJ> vawtha <VNAJ> paiyan ‘the boy who came’ 47 Example for infinitive <VINT> un thalaiyil idii viza <VINT> ‘may thunder fall on your head’ Example for conditional <CVB> wE wEraththOdu vawthAl <CVB>thAn ‘If you would come in time’ NOMINALIZED VERB FORMS Nominalized verbal forms are verbal nouns, participle nouns and adjectival nouns. we use the tag <VBG> for all the forms of Nominalized verbs. Example for Nominalized verb forms <VBG> Seithal<VBG> (doing) Seivathu <VBG> (a neuter which is doing) sethavan <VBG> (a person who did) Anaw enna seyvathu <VBG>? ‘What shall (we) do Anand?’ AUXILARY VERB We have used the tag <VAX> for auxiliary verb. Example for AUXILARY VERB <VAX> arun enna seyavENdum<VAX> ? ‘What shall (we) do Arun?’ 48 3.4.6) OTHER Tags POSTPOSITION All postpositions in Tamil are formally uninflected or inflected noun forms or non-finite verb forms. We use the tag < PPO> for echo words. Example for Postposition <PPO> avan wAyaip pOl <PPO> kaththinAn. ‘He cried like a dog’ CONJUNCTIONS Co-ordination in Tamil is mainly realized by the use of noun case, some clitics and number of verb forms. We use the tag < CNJ> for Conjunctions. Example for Conjunctions <CNJ> ciRiYA AnAl <PPO> walla pen. ‘a small but nice girl’ DETERMINERS We use the tag < DET> for determiners. Example for DETERMINERS <DET> awthath <DET>thittam. ‘that plan’ COMPLIMENTIZER We use the tag < COM> for complimentizer. 49 Example for COMPLIMENTIZER (COM) avan vawthAn enru<COM> kELvippattEn. ‘I heard that he had come’ EMPHASIS We use the tag <EMP> for emphasis. Example for EMPHASIS <EMP> avan than<EMP> sonnAn. ‘He only said’ ECHO WORDS We use the tag < ECH> for echo words, in our corpus echo words are very little because our 90% of words are from Dinamani newspaper. Example for Echo words (ECH) kAppi kEppi <ECH> ‘coffee keeffee’ Reduplication WORDS Reduplication words are the same word which is written twice for various purposes such as indicating emphasis, deriving a category from another category. We use the tag <RDW> for Reduplication words. Example for Reduplication words <RDW> Pala pala <RDW> thittam. ‘many plans’ 50 QUESTION WORD AND MARK We use the tags <QW> AND <QM> for question word and question mark. Example for QUESTION WORD AND MARK:<QW> <QM> Avan vawthAnA <QW>? <QM>. ‘DOES HE COME?’ SYMBOLS We consider only two symbols dot <DOT> tag and comma <COMM> tag in our corpus. <DOT> tag is used to show the sentence separation. <COMM> tag is used in-between the multiple nouns and proper noun. Example for <DOT> avan than sonnAn .<DOT> ‘He only said.’ Example for <COMMA> wakarAj , <COMM> arunudan vawthan. Nagaraj came with Arun . 51 4) DEVLOPMENT OF TAGGED CORPUS 4.1).Introduction Corpus linguistics seeks to further our understanding of language through the analysis of large quantities of naturally occurring data. Text corpora are used in a number of different ways. Traditionally corpora have been used for the study and analysis of language at different levels of linguistic description. Corpora have been constructed for the specific purpose of acquiring knowledge for information extraction systems, knowledge-based systems and e-business systems[27]. Corpora have been used for studying child language development. Speech corpora play a vital role in the specification, design and implementation of telephonic communication and for the broadcast media. There is a long tradition of corpus linguistic studies in Europe. The need for corpus for a language is multifarious. Starting from the preparation of a dictionary or lexicon to machine translation, corpus has become an inevitable resource for technological development of languages. Corpus means a body of huge text incorporating various types of textual materials, including newspaper, weeklies, fictions, scientific writings, literary writings, and so on. Corpus represents all the styles of a language. Corpus must be very huge in size as it is going to be used for many language applications such as preparation of lexicons of different sizes, purposes and types, machine translation programs and so on. 4.1.1) Tagged corpus, Parallel Corpus, and Aligned Corpus Corpuses can be distinguished as tagged corpus, parallel corpus and aligned corpus. The tagged corpus is that which is tagged for part-of-speech [27]. A parallel corpus contains texts and translations in each of the languages involved in it. It allows wider scopes for double-checking of the translation equivalents. Aligned corpus is a 52 kind of bilingual corpus where text samples of one language and their translations into other language are aligned, sentence by sentence, phrase by phrase, word by word, or even character by character. 4.1.2)CIIL Corpus for Tamil As for as building corpus for the Indian languages is concerned it was Central Institute of Indian languages (CIIL) which took initiative and started preparing corpus for some of the Indian languages (Tamil, Telugu, Kannada, and Malayalam). Department of electronics (DOE) financed the corpus-building project. The target was to prepare corpus with ten million words for each language [27]. But due to financial crunch and time restriction it ends up with three million words for each language. Tamil corpus with three million words is built by CIIL in this way. It is a partially tagged corpus. 4.1.3) AUKBC-RC’s Improved Tagged Corpus for Tamil AUKBC Research Centre which has taken up NLP oriented works for Tamil, has improved upon the CIIL Tamil Corpus and tagged it for their MT programs. It also developed parallel corpora for English-Tamil to promote its goal of preparing an MT tool for English-Tamil translation. Parallel corpus is very useful for training the corpus and for building example based machine translation. Parallel corpus is a useful tool for MT programs. 4.2) Developing a new tagged corpus 4.2.1) Untagged and tagged Corpus Untagged or annotated corpus provides limited information to the users. Corpus can be augmented with additional information by the way of labeling the 53 morpheme, word, phrase and sentence for their grammatical values. Such information helps the user to retrieve information selectively and easily. The frequency of the lemma is useful in the analysis of the corpus. When the frequency of a particular word is compared to other context words, we can find whether the word is common or rare. The frequencies are relatively reliable for the most common words in a corpus, but to analyze the senses and association patterns of words we need a very large number of occurrences. With a very large corpus containing many different texts, a wider range of topics should be represented, so that the frequencies of words are less influenced by individual texts. Frequency lists based on an untagged corpus are limited in usefulness, because they do not tell us which grammatical uses are common or rare. Tagged corpus is an important dataset for NLP applications. Figure: 4.1 Example of Untagged Corpus Figure: 4.2 Example of tagged Corpus 54 4.2.2) TAGGED CORPUS DEVELOPMENT The tagged corpus is the immediate requirement for different analyses in the field of Natural Language processing. Most of the language processing works are in need of such large database of texts, which provide a real, natural, native language of varying types. Annotation of corpora can be done at various levels viz, part of speech, phrase/clause level, dependency level, etc. Part of speech tagging forms the basic step towards building an annotated corpus. Chunking can form the next level of tagging. For creating Tamil Part of Speech Tagger we need a grammatically tagged corpus. So we set up a tagged corpus size of 2, 25,000 words. We collected sentences from Dinamani news paper, yahoo Tamil news and Tamil short stories etc and tagged the words. Tagged corpora vary with respect to the amount of information that they include about words. We have done corpus tagging in three stages: 1. Pre-editing, 2. Manual Tagging, 3. Tagging using SVMTagger In pre-editing, untagged corpus is converted to a suitable format for SVMTool, to assign a part of speech tag to each word. Because of orthographic similarity one word may have several possible POS tags. After initial assignment of possible POS tags, words are manually tagged using our (AMRITA) tagset. The tagged corpus is trained using SVMTlearn component. After training, the new untagged corpus is tagged by using SVMTagger. The output of SVMTagger is again manually corrected and added into the tagged corpus to increase the corpus size. Here we discuss how we create a tagged corpus. 55 Pre-editing First we download untagged sentences from Dinamani newspaper website. Then we clean the corpus using PERL program i.e. we remove punctuations except dots, commas and question marks. After completing this the next step is to change the corpora into a column format .Because the SVMTool training data must be in column format, i.e. a token per line corpus in a sentence by sentence fashion. It also can be done by PERL program. The column separator is the blank space. Figure: 4.3 Untagged corpus before pre-editing. Figure:4.4 Untagged corpus after pre-editing. 56 Manual Tagging After pre editing we got an untagged corpus in token per line fashion (Fig).In the second stage we tagged the untagged corpus manually using AMRITA tagset. First we tagged near 10,000 words manually .Here we face so much difficulties to assign a tags to corpora. After discussing with various Tamil linguistics, we assign POS tag to each and every word in corpora. Tagging using SVMTagger After completing the manual tagging, we have the tagged corpus size of 10,000.In this stage we train the corpus with SVMTlearn component of SVMTool, after that we tagged the cleaned untagged corpus using SVMTagger component. The output of the component is tagged corpus with some error. Then we correct the tags manually. After correcting the tags we add that tagged corpus into the training corpus for increase the size of training corpus. 4.3) APPLICATIONS OF TAGGED CORPUS • Part of Speech Tagging • Computer Lexicography • Information extraction • Statistical training of Language models • Machine Translation using multilingual corpora • Text checkers for evaluating spelling and grammar • Educational application like Computer Assisted Language Learning The multilingual corpora with [28] parallel texts which translate one language to another are used for the purposes of studying contrastive analysis and translation. 57 4.4) DETAILS OF OUR TAGGED CORPUS We have a tagged corpus with size of 225185 and our tagset size is 32. Table1 shows the counts of each and every tags in our corpus. S.no 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Tags Counts <ADJ> <ADV> <CNJ> <COM> <COMM> <CRD> <CVB> <DET> <DOT> <ECH> <EMP> <INT> <NN> <NNC> <NNP> <NNPC> <NNQ> <ORD> <PPO> <PRID> <PRIN> <PRP> <QM> <QTF> <QW> <RDW> <VAX> <VBG> <VF> <VINT> <VNAJ> <VNAV> 7457 9493 2937 3295 1695 5400 992 2729 16649 1 1190 981 56174 31532 14826 3875 819 760 3200 171 338 6668 692 395 1024 117 3340 3983 14969 4973 7647 8859 Table 4.1 Tag Counts 58 5) IMPLEMENTATION OF SVMTOOL FOR TAMIL 5.1) INTRODUCTION This chapter presents the SVMTool, a simple, flexible, and effective generator of sequential taggers based on Support Vector Machines and how it is being applied to the problem of part-of-speech tagging. This SVM-based tagger is robust and flexible for feature modeling (including lexicalization), trains efficiently with almost no parameters to tune, and is able to tag thousands of words per second, which makes it really practical for real NLP [30] applications. Regarding accuracy, the SVM-based tagger significantly outperforms the TnT tagger exactly under the same conditions, and achieves a very competitive accuracy of 94.2% for Tamil. Generally, tagging is required to be as accurate as possible, and as efficient as possible. But, certainly, there is a trade-off between these two desirable properties. This is so because obtaining a higher accuracy relies on processing more and more information, digging deeper and deeper into it. However, sometimes, depending on the kind of application, a loss in efficiency may be acceptable in order to obtain more precise results. Or the other way around, a slight loss in accuracy may be tolerated in favour of tagging speed. Moreover, some languages have a richer morphology than others, requiring the tagger to have into account a bigger set of feature patterns. Also the tagset size and ambiguity rate may vary from language to language and from problem to problem. Besides, if few data are available for training, the proportion of unknown words may be huge. Sometimes, morphological analyzers could be utilized to reduce the degree of ambiguity when facing unknown words. Thus, a sequential tagger should be flexible with respect to the amount of information utilized and context shape. Another very interesting property for sequential taggers is their portability. 59 Multilingual information is a key ingredient in NLP tasks such as Machine Translation, Information Retrieval, Information Extraction, Question Answering and Word Sense Disambiguation, just to name a few. Therefore, having a tagger that works equally well for several languages is crucial for the system robustness. Besides, quite often for some languages, but also in general, lexical resources are hard to obtain. Therefore, ideally a tagger should be capable for learning with fewer (or even none) annotated data. The SVMTool is intended to comply with all the requirements of modern NLP technology, by combining simplicity, flexibility, robustness, portability and efficiency with state–of–the–art accuracy. This is achieved by working in the Support Vector Machines (SVM) learning framework, and by offering NLP researchers a highly customizable sequential tagger generator. Here this tool is applied to POS tagging of Tamil. 5.2) PROPERTIES OF THE SVMTOOL The following are the properties of the SVMTool: Simplicity: The SVMTool is easy to configure and to train. The learning is controlled by means of a very simple configuration file. There are very few parameters to tune. And the tagger itself is very easy to use, accepting standard input and output pipelining. Embedded usage is also supplied by means of the SVMTool API. Flexibility: The size and shape of the feature context can be adjusted. Also, rich features can be defined, including word and POS (tag) n-grams as well as ambiguity classes and “may be’s”, apart from lexicalized features for unknown words and sentence general information. The behavior at tagging time is also very flexible, allowing different strategies. 60 Robustness: The over fitting problem is well addressed by tuning the C parameter in the soft margin version of the SVM learning algorithm. Also, a sentence-level analysis may be performed in order to maximize the sentence score. And, for unknown words not to punish so severely on the system effectiveness, several strategies have been implemented and tested. Portability: The SVMTool is language independent. It has been successfully applied to English and Spanish without a priori knowledge other than a supervised corpus. Moreover, thinking of languages for which labeled data is a scarce resource, the SVMTool also may learn from unsupervised data based on the role of nonambiguous words with the only additional help of a morpho-syntactic dictionary. Accuracy: Compared to state–of–the–art POS taggers reported up to date, it exhibits a very competitive accuracy (97.2% for English on the WSJ corpus). Clearly, rich sets of features allow to model very precisely most of the information involved. Also the learning paradigm, SVM, is very suitable for working accurately and efficiently with high dimensionality feature spaces. Efficiency: Performance at tagging time depends on the feature set size and the tagging scheme selected. For the default (one-pass left-to-right greedy) tagging scheme, it exhibits a tagging speed of 1,500 words/second whereas the C++ version achieves a tagging speed of over 10,000 words/second. This has been achieved by working in the primal formulation of SVM. The use of linear kernels causes the tagger to perform more efficiently both at tagging and learning time, but forces the user to define a richer feature space. However, the learning time remains linear with respect to the number of training examples. 61 5.3 THE THEORY OF SUPPORT VECTOR MACHINES SVM is a machine learning algorithm for binary classification, which has been successfully applied to a number of practical problems, including NLP. Let {(x1, y1). . . (xN, yN)} be the set of N training examples, where each instance xi is a vector in R N and y ∈ {−1,+1} is the class label. In their basic form, a SVM learns a i linear hyperplane [30] that separates the set of positive examples from the set of negative examples with maximal margin (the margin is defined as the distance of the hyperplane to the nearest of the positive and negative examples). This learning bias has proved to have good properties in terms of generalization bounds for the induced classifiers. The linear separator is defined by two elements: a weight vector w (with one component for each feature), and a bias b which stands for the distance of the hyperplane to the origin. The classification rule of a SVM is: sign(f(x, w, b)) (5.1) f(x, w, b) = 〈 w. x 〉 + b (5.2) Being x the example to be classified. In the linearly separable case, learning the maximal margin hyperplane (w, b) can be stated as a convex quadratic optimization problem with a unique solution: minimize ||w||, subject to the constraints (one for each training example): y (〈 w. x 〉 + b) ≥ 1 i i (5.3) 62 See an example of a 2-dimensional SVM in Figure 5.1. Hyperplane Data Belongs to Class +1 Data Belongs to Class -1 Support vectors Figure 5.1: SVM example: hard margin γ xTw +b =0 Є Figure 5.2: SVM example: soft margin maximization 63 The SVM model has an equivalent dual formulation, characterized by a weight vector α and a bias b. In this case, α contains one weight for each training vector, indicating the importance of this vector in the solution. Vectors with non null weights are called support vectors. The dual classification rule is: N f ( x,α , b) = ∑ y α 〈 x i x〉 + b i i i i =1 (5.4) The α vector can be calculated also as a quadratic optimization problem. Given the optimal α * vector of the dual quadratic optimization problem, the weight vector w* that realizes the maximal margin hyper plane is calculated as: N w* = ∑ yiαi*xi i =1 (5.5) The b * has also a simple expression in terms of w* and the training examples {( x i , y i )}i = 1 N . The advantage of the dual formulation is that permits an efficient learning of non– linear SVM separators, by introducing kernel functions. Mathematically, a kernel function calculates a dot product between two vectors that have been (non linearly) mapped into a high dimensional feature space. Since there is no need to perform this mapping explicitly, the training is still feasible although the dimension of the real feature space can be very high or even infinite. In the presence of outliers and wrongly classified training examples it may be useful to allow some training errors in order to avoid over-fitting. This is achieved by a variant of the optimization problem, referred to as soft margin, in which the 64 contribution to the objective function of margin maximization and training errors can be balanced through the use of a parameter called C as shown in Figure 1 (rightmost representation). 5.4 PROBLEM SETTING Here, description about the collection of training examples and feature codification is done. 5.4.1 Binarizing the Classification Problem Tagging a word in context is a multi-class classification problem. Since SVMs are binary classifiers, a binarization of the problem must be performed before applying them. Here a simple one-per- class binarization is applied, i.e., a SVM is trained for every POS tag in order to distinguish between examples of this class and all the rest. When tagging a word, the most confident tag according to the predictions of all binary SVMs is selected. However, not all training examples have been considered for all classes. Instead, a dictionary is extracted from the training corpus with all possible tags for each word, and when considering the occurrence of a training word w tagged as ti, this example is used as a positive example for class ti and a negative example for all other tj classes appearing as possible tags for w in the dictionary. In this way, the generation of excessive (and irrelevant) negative examples can be avoided, and training step can be made faster. 5.4.2 Feature Codification Each example (event) has been represented using the local context of the word for which the system will determine a tag (output decision). This local context and local information like capitalization and affixes of the current token will help the system make a decision even if the token has not been encountered during training. 65 A centered window of seven tokens is considered , in which some basic and n–gram patterns are evaluated to form binary features such as: “previous word is awtha(that)”, “two preceding tags are DET NN”, etc. Table 5.1 contains the list of all patterns considered. As it can be seen, the tagger is lexicalized and all word forms appearing in window are taken into account. Since a very simple left–to–right tagging scheme will be used, the tags of the following words are not known at running time. Following the approach of Memory based tagger, the more general ambiguity–class tag for the right context words is used, which is a label composed by the concatenation of all possible tags for the word (e.g., VINT-VAX, ADJ-NN, etc.). Each of the individual tags of an ambiguity class is also taken as a binary feature of the form “following word may be a NN”. Therefore, with ambiguity classes and “maybe’s”, two pass solution is avoided, in which an initial first pass tagging is performed in order to have right contexts disambiguated for the second pass. Explicit n–gram features are not necessary in the SVM approach, because polynomial kernels account for the combination of features. However, since we are interested in working with a linear kernel, those are included in the feature set. Additional features have been used to deal with the problem of unknown words. Features appearing a number of times under a certain count cut-off might be ignored for the sake of robustness. Word features w-3, w-2, w-1, w0, w1, w2, w3 POS Features p-3, p-2, p-1, p0, p1, p2, p3 Ambiguity a0, a1, a2, a3 classes May_be’s m0, m1, m2, m3 Word bigrams (w-2, w-1 ),( w-1, w+1), (w-1, w0), ( w0, w+1), (w+1, w+2) 66 POS Bigrams Word Trigrams (p-2, p-1 ),( p-1, a+1), (a+1, a+2) (w-2, w-1 ,w0),( w-2, w-1, w+1), (w-1, w0 , w+1), , a+1), ( w-1, w+1, w+2), ( w0, w+1, w+2) POS Trigrams (p-2, p-1 ,a0),( p-2, p-1, a+1), (p-1, a0 ( p-1, a+1, a+2) Sentence_info Punctuation (’.’,’?’,’!’) Prefixes s1, s1s2, s1s2 s3 , s1s2 s3s4 Suffixes sn , sn-1 sn ,sn-2 sn-1 sn ,sn-3 sn-2 sn-1 sn Binary word features Initial Upper Case ,all Upper Case ,no initial Capital Letter(s) , all Lower Case , contains a (period/number/hyphen..) Word length Integer Table 5.1: Rich Feature Pattern Set 5.5 SVMTOOL COMPONENTS AND IMPLEMENTATIONS The SVMTool software package consists of three main components, namely the model learner ( SVMTlearn), the tagger ( SVMTagger) and the evaluator ( SVMTeval). Previous to the tagging, SVM models (weight vectors and biases) are learned from a training corpus using the SVMTlearn component. Different models are learned for the different strategies. Then, at tagging time, using the SVMTagger component, one may choose the tagging strategy that is most suitable for the purpose of the tagging. Finally, given a correctly annotated corpus, and the corresponding SVMTool predicted annotation, the SVMTeval component displays tagging results. 67 5.5.1 SVMTLEARN Given a training set of examples (either annotated or unannotated), it is responsible for the training of a set of SVM classifiers. So as to do that, it makes use of SVM– light, an implementation of Vapnik’s SVMs in C, developed by Thorsten Joachim’s. The SVMlight software implementation of Vapnik’s Support Vector Machine by Thorsten Joachim’s has been used to train the models. 5.5.1.1 Training Data Format Training data must be in column format, i.e. a token per line corpus in a sentence by sentence fashion. The column separator is the blank space. The token is expected to be the first column of the line. The tag to predict takes the second column in the output. The rest of the line may contain additional information. See an example: 68 Figure 5.3. Training data format No special ‘<EOS>’ mark is employed for sentence separation. Sentence punctuation is used instead, i.e. [.!?] symbols are taken as unambiguous sentence separators. But here [.?] symbols only used. Therefore these symbols are taken as a sentence separators. 5.5.1.2 Options SVMTlearn behavior is easily adjusted through a configuration file. Usage: SVMTlearn [options] <config-file> options: - V verbose 0: none verbose 1: low verbose [default] 2: medium verbose 3: high verbose Example: SVMTlearn -V 2 config.svmt These are the currently available config-file options: 69 Sliding window: The size of the sliding window for feature extraction can be adjusted. Also, the core position in which the word to disambiguate is to be located may be selected. By default, window size is 5 and the core position is 2, starting at 0.Here I took the default window size 5. Feature set: Three different kinds of feature types can be collected from the sliding window: – Word features: Word form n-grams. Usually unigrams, bigrams and trigrams suffice. Also, the sentence last word, which corresponds to a punctuation mark (‘.’, ‘?’, ‘!’), is important. – POS features: Annotated parts–of–speech and ambiguity classes n-grams, and “may be’s”. As for words, considering unigrams, bigrams and trigrams is enough. The ambiguity class for a certain word determines which POS are possible. A “may be” states, for a certain word, that certain POS may be possible, i.e. it belongs to the word ambiguity class. – Lexicalized features: including prefixes and suffixes, capitalization, hyphenization, and similar information related to a word form. But in Tamil capitalization will not used all the other features are accepted. Default feature sets for every model are defined. Feature filtering: The feature space can be kept in a convenient size. Smaller models allow for a higher efficiency. By default, no more than 100,000 dimensions are used. Also, features appearing less than n times can be discarded, which indeed causes the system both to fight against overfitting and to exhibit a higher accuracy. By default, features appearing just once are ignored. 70 SVM model compression: Weight vector components lower than a given threshold, in the resulting SVM models can be filtered out, thus enhancing efficiency by decreasing the model size but still preserving accuracy level. That is an interesting behavior of SVM models being currently under study. In fact, discarding up to 70% of the weight components accuracy remains stable, and it is not until 95% of the components are discarded that accuracy falls below the current state–of–the–art (97.0% - 97.2%). C parameter tuning: In order to deal with noise and outliers in training data, the soft margin version of the SVM learning algorithm allows the misclassification of certain training examples when maximizing the margin. This balance can be automatically adjusted by optimizing the value of the C parameter of SVMs. A local maximum is found exploring accuracy on a validation set for different C values at shorter intervals. Dictionary repairing: The lexicon extracted from the training corpus can be automatically repaired either based on frequency heuristics or on a list of corrections supplied by the user. This makes the tagger robust to corpus errors. Also a heuristic threshold may be specified in order to consider as tagging errors those (wordxtagy) pairs occurring less than a certain proportion of times with respect to the number of occurrences of wordx. For example, a threshold of 0.001 would consider (run DT) as an error if the word run had been seen at least 1000 times and only once tagged as a ‘DT’. This kind of heuristic dictionary repairing does not harm the tagger performance, on the contrary, it may help a lot. 71 Repairing list must comply with the SVMTOOL dictionary format, i.e. <word> <N occurrences> <N possible tags> 1{<tag (i)> <N occurrences (i)} Ambiguous classes: The list of POS presenting ambiguity is, by default, automatically extracted from the corpus but, if available, this knowledge can be made explicit. This acts in favor of the system robustness. Open classes: The list of POS tags an unknown word may be labeled as is also, by default, automatically determined. Backup lexicon: A morphological lexicon containing words that are not present in the training corpus may be provided. It can be also provided at tagging time. This file must comply with the SVMTool dictionary format. 5.5.1.3 Configuration File Several arguments are mandatory (as shown in table 5.2). The rest are optional (as shown in table 5.4). Lists of features are defined in the SVMTool feature language (svmtfl) as shown in table 5.5. Lines beginning with ‘#’ are ignored. The list of action items for the learner must be declared (shown in table 5.3): NAME => name of the model to create (a log of the experiment is generated onto the file “NAME.EXP”) TRAINSET => location of the training set SVMDIR => location of the Joachim’s SVMlight software Table 5.2: SVMTlearn config-file mandatory arguments. 72 Syntax: do <MODEL> <DIRECTION> [<CK>] [<CU>] [<T>] Where MODEL = [M0|M1|M2|M3|M4] DIRECTION = [LR|RL|LRL] CK = [CK :<<range1>:<range2>:<#iterations>:<#segments_per_iteration>: <log>|<nolog>> | <CK-value>] CU = [CU :<< range1> :< range2> :< #iterations> :< #segments_per_iteration>: <log>|<nolog>> | <CU-value>] T = [T[:<Nfolders>]] MODEL model type DIRECTION model direction CK known word C parameter tuning options (optional) CU unknown word C parameter tuning options (optional) T test options (optional) Table 5.3: SVMTlearn config-file action arguments. Here is an example of a valid config-file: # ------------------------------------------------------------#SVMTool configurations file for Tamil # ------------------------------------------------------------#prefix of the model files which will be created NAME = TAMIL_2L # ------------------------------------------------------------#location of the training set TRAINSET=/media/disk/Anand/SVM/SVMTool-1.3/bin/TAMIL_2L.TRAIN # ------------------------------------------------------------#location of the Joachim’s SVMlight software SVMDIR = /media/disk/Anand/SVM/svm_light # ------------------------------------------------------------#action items do M0 LR 73 SET location of the whole set VALSET location of the validation set TESTSET location of the test set TRAINP proportion of sentences belonging to the provided whole SET which will be used for training VALP proportion of sentences belonging to the provided whole SET which will be used for validation TESTP proportion of sentences belonging to the provided whole SET which will be used for test REMOVE FILES remove intermediate files? REMAKE FOLDERS remake cross-validation folders? Kfilter Weight filtering for known word models Ufilter Weight filtering for unknown word models R dictionary repairing list (heuristically repaired by default) D dictionary repairing heuristic threshold (0.001 by default) BLEX backup lexicon LEX lexicon for unsupervised learning (Model 3) W window definition (size, core position) F feature filtering (count cut off, max mapping size) CK C parameter for known words -all models-(0 by default) CU C parameter for unknown words -all models-(0 by default) X percentage of unknown words expected (3 by default) AP list of POS presenting ambiguity (automatically created by default) UP list of open-classes (automatically created by default) A0k..A4k known word feature definition for models 0..4 A0u..A4u unknown word feature definition for models 0..4 Table 5.4: SVMTlearn config-file optional arguments. 74 COLUMN n-grams C(colid; n1, ...ni...nm) WORD n-grams w(n1, ...ni...nm) (equivalent to C(0; n1, ...ni...nm)) TAG n-grams p(n1, ...ni...nm) (equivalent to C(1; n1, ...ni...nm)) AMBIGUITY CLASSES k(n) MAYBE’s m(n) where ni is the relative position with respect to the element to disambiguate CHARACTER A(i) ca(i)where i is the relative position of the character with respect to the beginning of the word CHARACTER Z(i) cz(i) where i is the relative position of the character with respect to the end of the word PREFIXES a(i) = s1s2...si SUFFIXES z(i) = sn-i...sn-1sn sa does the word start with lower case? SA does the word start with upper case? CA does the word contain any capital letter? CAA does the word contain several capital letters? aa are all letters in the word in lower case? AA are all letters in the word in upper case? SN does the word start with number? CP does the word contain a period? CN does the word contain a number? CC does the word contain a comma? MW does the word contain a hyphen? L word length sentence info punctuation (‘.’, ‘?’, ‘!’) Table 5.5: SVMTool feature language. 75 Enriched version of the SVMTool configuration file for Tamil # ------------------------------------------------------------# SVMT configuration file # -------------- location of the training set ----------------TRAINSET = /media/disk/Anand/SVM/SVMTool- 1.3/bin/TAMIL_CORPUS_2L.TRAIN # -------------- location of the validation set --------------#VALSET = /media/disk/Anand/TAMIL_OL2.TRAIN # -------------- location of the test set --------------------#TESTSET = /media/disk/Anand/TAMIL_OL3.TRAIN # -------------- location of the Joachims svmlight software --SVMDIR = /media/disk/Anand/SVM/svm_light # -------------- name of the model to create -----------------NAME = TAMIL_CORPUS_2L # --------------- dictionary repairing list ------------------#R = root/Desktop/small/SVMTool-1.3/bin/CC.R # --------------- window definition (lenght, core_position) --W = 5 2 #--------------- feature filtering (count_cut_off, max_mapping_size) F = 2 100000 # --------------- default C Parameter values -----------------CK = 0.1086 CU = 0.07975 # --------------- % of unknown words expected (3 by default) - X = 10 # --------------- weight filtering for known words -----------Kfilter = 0 # --------------- weight filtering for unknown words ---------Ufilter = 0 # --------------- remove intermediate files ------------------REMOVE_FILES = 1 # --------------- remake cross-validation folders ------------REMAKE_FOLDERS = 1 76 # --------------- action items -------------------------------# *** train model 0, LR and RL, C Parameter Tuning, cross- validation 10 folders do M0 LRL CK:0.01:1:3:10:log CU:0.01:1:3:10:log T # *** train model 1, RL, C Parameter Tuning, no cross-validation, test #do M1 LRL CK:0.01:1:3:10:log CU:0.01:1:3:10:log T # *** train model 2, LR and RL, no C Parameter Tuning, no crossvalidation, test #do M2 LRL CK:0.01:1:3:10:log CU:0.01:1:3:10:log T # *** M3 is currently unavailable # *** train model 4, LR, C Parameter Tuning, no cross-validation, no test #do M4 LRL CK:0.01:1:3:10:log CU:0.01:1:3:10:log T # ---------------------------------------------------------------------------------#list of classes (automatically determined by default) # ---------------------------------------------------------------------------------#list of parts-of-speech presenting ambiguity #AP = '' CC CD DT EX FW IN JJ JJR JJS LS MD NN NNS NNP NNPS PDT POS PRP PRP$ RB RBR RBS RP SYM UH VB VBD VBG VBN VBP VBZ WDT WP WRB #list of open-classes #UP = FW JJ JJR JJS NN NNS NNP NNPS RB RBR RBS VB VBD VBG VBN VBP VBZ # ---------------------------------------------------------------------------------#ambiguous-right [default] A0k = C(0;-1) C(0;0) C(0;1) C(0;-2,-1) C(0;-1,0) C(0;0,1) C(0;- 1,1) C(0;1,2) C(0;-2,-1,0) C(0;-1,0,1) C(0;0,1,2) C(1;-1) C(1;-2,- 1) C(1;-1,1) C(1;1,2) C(1;-2,-1,0) C(1;0,1,2) C(1;-1,0) C(1;0,1) C(1;-1,0,1) C(1;0) k(0) k(1) k(2) m(0) m(1) m(2) A0u = C(0;-1) C(0;0) C(0;1) C(0;-2,-1) C(0;-1,0) C(0;0,1) C(0;-1,1) C(0;1,2) C(0;-2,-1,0) C(0;-1,0,1) C(0;0,1,2) C(1;-1) C(1;-2,-1) C(1;-1,1) C(1;1,2) C(1;-2,-1,0) C(1;0,1,2) C(1;-1,0) C(1;0,1) C(1;- 77 1,0,1) C(1;0) k(0) k(1) k(2) m(0) m(1) m(2) a(2) a(3) a(4) a(5) a(6) a(7) a(8) a(9) a(10) a(11) a(12) a(13) a(14) a(15) a(16) a(17) a(18) a(19) a(20) z(2) z(3) z(4) z(5) z(6) z(7) z(8) z(9) z(10) z(11) z(12) z(13) z(14) z(15) z(16) z(17) z(18) z(19) z(20) ca(1) cz(1) L SN CP CN #-------------------------------------------------------------------- In this case model NAME is ‘TAMIL_2L’, so model files will begin with this prefix. Only the training set is specified. A window of 5 elements, being the core in the third position, is defined for feature extraction. The expected proportion of unknown words is 10%. Intermediate files will be removed. The list of parts-ofspeech presenting ambiguity and the list of open-classes are not provided in this config-file so our tool will determine automatically. This config-file is designed to learn Model 0 on our Tamil Corpus. That would allow for the use of tagging strategies 0 and 5, only left-to-right though. Instead of using default feature sets, two feature sets are defined for Model 0 (for the two distinct problems of known word and unknown word tag guessing). 5.5.1.4 C Parameter Tuning C parameter tuning is optional. Either we can specify no C parameter (C = 0 by default), or we can specify a fixed value (e.g., CK: 0.1 CU: 0.01), or we can perform an automatic tuning by greedy exploration. If we decide to do so then we must provide a validation set and specify the interval to be explored and how to do it (i.e. number of iterations and number of segments per iteration). Moreover, the first iteration can take [30] place in a logarithmic fashion or not. For example, CK: 0.01:10:3:10: log would try these values for C: 0.01, 0.1, 1, 10 at the first iteration. For the next iteration, the algorithm explores on both sides of the point where the 78 maximal accuracy was obtained, half to the next and previous points. For example, suppose the maximal accuracy was obtained for C = 1, then it would explore the range form 0.1 / 2 = 0.05 to 0.1 / 2 = 0.5. The segmentation ratio would be 0.045 so the algorithm would go for values 0.05, 0.095, 0.14, 0.185, 0.23, 0.275, 0.32, 0.365, 0.41, 0.455, and 0.5. And so on for the following iteration. 5.5.1.5 Test After training a model it can be evaluated against a test set. To indicate so the T option must be activated in the corresponding do-action, e.g., “do M0 LR CK:0.01:10:3:10:log CU:0.07 T”. By default it expects a test set definition in the config file. But training/test can also be performed through a cross-validation. The number of folders must be provided, e.g., “do M0 LR CK:0.01:10:3:10:log CU:0.07 T:10”. 10 is a good number. Furthermore, if training/test goes in cross-validation then the C Parameter tuning goes too, even if a validation set has been provided. 5.5.1.6 Models Five different kinds of models have been implemented in this. Models 0, 1, and 2 differ only in the features they consider. Model 3 and Model 4 are just like Model 0 with respect to feature extraction but examples are selected in a different manner. Model 3 is for unsupervised learning so, given an unlabeled corpus and a dictionary, at learning time it can only count on knowing the ambiguity class, and the POS information only for unambiguous words. Model 4 achieves robustness by simulating unknown words in the learning context at training time. Model 0: This is the default model. The unseen context remains ambiguous. It was thought having in mind the one-pass on-line tagging scheme, i.e. the tagger goes either left-to-right or right-to-left making decisions. So, past decisions feed future 79 ones in the form of POS features. At tagging time only the parts-of-speech of already disambiguated tokens are considered. For the unseen context, ambiguity classes are considered instead. Features are shown in Table 5.6. Model 1: This model considers the unseen context already disambiguated in a previous step. So it is thought for working at a second pass, revisiting and correcting already tagged text. Features are shown in Table 5.7. Model 2: This model does not consider pos features at all for the unseen context. It is designed to work at a first pass, requiring Model 1 to review the tagging results at a second pass. Features are shown in Table 5.8. Model 3: The training is based on the role of unambiguous words. Linear classifiers are trained with examples of unambiguous words extracted from an unannotated corpus. So, fewer POS information is available. The only additional information required is a morpho-syntactic dictionary. Model 4: The errors caused by unknown words at tagging time punish severely the system. So as to reduce this problem, during the learning some words are artificially marked as unknown in order to learn a more realistic model. The process is very simple. The corpus is divided in a number of folders. Before starting to extract samples from each of the folders, a dictionary is generated out from the rest of folders. So, the words appearing in a folder but not in the rest are unknown words to the learner. Ambiguity classes a ,a ,a 0 1 2 May_be’s m ,m ,m 0 1 2 80 POS Features p , p p-3, p-2, p-1, −2 −1 POS Bigrams ( p−2, p−1),( p−1, a+1),( a+1, a+2 ) POS Trigrams ( p−2, p−1, a0 ),( p−2, p−1, a+1), ( p−1,a0 ,a+1 ),( p−1, a+1, a+2 ) Single characters ca(1), cz(1) prefixes a(2), a(3), a(4) suffixes z(2), z(3), z(4) Lexicalized features SA, CAA, AA, SN, CP, CN, CC, MW,L Sentence_info punctuation ( '.','?','!' ) Table 5.6: Model 0. Example of suitable POS feature Ambiguity classes a ,a ,a 0 1 2 May_be’s m ,m ,m 0 1 2 POS Features p ,p ,p ,p −2 −1 +1 +2 POS Bigrams ( p−2, p−1), ( p−1, p+1),( p+1, p+2 ) POS Trigrams ( p−2, p−1, a0 ),( p−2, p−1, p+1), ( p−1,a0 , p+1 ),( p−1, p+1, p+2 ) Single characters ca(1), cz(1) prefixes a(2), a(3), a(4) suffixes z(2), z(3), z(4) Lexicalized features SA, CAA, AA, SN, CP, CN, CC, MW,L Sentence_info punctuation ( '.','?','!' ) Table 5.7: Model 1. Example of suitable POS Features 81 Ambiguity classes a 0 May_be’s m 0 POS Features p ,p −2 −1 POS Bigrams ( p−2, p−1) POS Trigrams ( p−2, p−1, a0 ) Single characters ca(1), cz(1) prefixes a(2), a(3), a(4) suffixes z(2), z(3), z(4) Lexicalized features SA, CAA, AA, SN, CP, CN, CC, MW,L Sentence_info punctuation ( '.','?','!' ) Table 5.8: Model 2. Example of suitable POS Features 5.5.1.7 IMPLEMENTATION OF SVMTlearn FOR TAMIL The SVMTlearn is a first component in SVMTool .It is used for training the untagged corpus using SVM. This component is run in Linux machines only. In training, only the data set is required. However, if you have enough data it is a good practice to split it into three working sets (i.e. training, validation and test). That will allow you to train, tune and evaluate your system before you start using it. If you do not have much data there is no need to worry, you can still train, tune and test your system through cross-validation. Here we follow cross validation. 82 Tagged Tamil Corpus SVM Learn Merged Models Dictionary Known Features Unknown Figure 5.4: Implementation of SVMTlearn In this component our input is tagged training corpus. That training corpus is given into the SVMTlearn component and we change the features in config file based on Tamil language .The output of the SVMTlearn are a dictionary file, merged files for unknown and known words(For each model).Each merged file contains the all features of known words and unknown words. Example training output of SVMTlearn for Tamil --------------------------------------------------------------------------------------SVMTool v1.3 (C) 2006 TALP RESEARCH CENTER. Written by Jesus Gimenez and Lluis Marquez. 83 --------------------------------------------------------------------------------------TRAINING SET = /media/disk-1/SVM/SVMTool-1.3/bin/TAMIL_CORPUS.TRAIN --------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------DICTIONARY <TAMIL_CORPUS.DICT> [31605 words] ******************************************************************* ************************** BUILDING MODELS... [MODE = 0 :: DIRECTON = LR] ******************************************************************* ************************** ******************************************************************* ************************** C-PARAMETER TUNING by 10-fold CROSS-VALIDATION on </media/disk-1/SVM/SVMTool-1.3/bin/TAMIL_CORPUS.TRAIN> on <MODE 0> <DIRECTION LR> [KNOWN] C-RANGE = [0.01..1] :: [log] :: #LEVELS = 3 :: SEGMENTATION RATIO = 10 ******************************************************************* ************************** =================================================================== ========================== LEVEL = 0 :: C-RANGE = [0.01..1] :: FACTOR = [* 10 ] =================================================================== ========================== -------------------------------------------------------------------------------------------******************************** level - 0 : ITERATION 0 - C = 0.01 - [M0 :: LR] -------------------------------------------------------------------------------------------- 84 TEST ACCURACY: 90.6093% KNOWN [ 92.886% ] AMBIG.KNOWN [ 83.3052% ] UNKNOWN [ 78.5781% ] TEST ACCURACY: 90.392% KNOWN [ 92.6809% ] AMBIG.KNOWN [ 82.838% ] UNKNOWN [ 78.0815% ] TEST ACCURACY: 90.1015% KNOWN [ 92.6128% ] AMBIG.KNOWN [ 83.4766% ] UNKNOWN [ 77.5075% ] TEST ACCURACY: 89.7127% KNOWN [ 92.0721% ] AMBIG.KNOWN [ 81.8731% ] UNKNOWN [ 77.5281% ] TEST ACCURACY: 90.7699% KNOWN [ 92.7304% ] AMBIG.KNOWN [ 83.4785% ] UNKNOWN [ 80.3874% ] TEST ACCURACY: 89.8988% KNOWN [ 92.3462% ] AMBIG.KNOWN [ 81.5675% ] UNKNOWN [ 77.286% ] TEST ACCURACY: 90.8836% KNOWN [ 92.9671% ] AMBIG.KNOWN [ 83.6309% ] UNKNOWN [ 79.5591% ] TEST ACCURACY: 89.9724% KNOWN [ 92.4002% ] AMBIG.KNOWN [ 82.1854% ] UNKNOWN [ 77.664% ] TEST ACCURACY: 90.2643% KNOWN [ 92.5675% ] AMBIG.KNOWN [ 83.0289% ] UNKNOWN [ 78.0907% ] TEST ACCURACY: 90.7494% KNOWN [ 92.7798% ] AMBIG.KNOWN [ 82.7494% ] UNKNOWN [ 79.8929% ] OVERALL ACCURACY [Ck = 0.01 :: Cu = 0.07975] : 90.33539% KNOWN [ 92.6043% ] AMBIG.KNOWN [ 82.81335% ] UNKNOWN [ 78.45753% ] MAX ACCURACY -> 90.33539 :: C-value = 0.01 :: depth = 0 :: iter = 1 -------------------------------------------------------------------------------------------******************************** level - 0 : ITERATION 1 - C = 0.1 - [M0 :: LR] -------------------------------------------------------------------------------------------- TEST ACCURACY: 91.7702% KNOWN [ 94.2402% ] AMBIG.KNOWN [ 87.5492% ] UNKNOWN [ 78.7175% ] TEST ACCURACY: 91.8881% KNOWN [ 94.4737% ] AMBIG.KNOWN [ 88.4324% ] UNKNOWN [ 77.9821% ] 85 TEST ACCURACY: 91.3219% KNOWN [ 94.0596% ] AMBIG.KNOWN [ 88.0441% ] UNKNOWN [ 77.5928% ] TEST ACCURACY: 91.0615% KNOWN [ 93.6037% ] AMBIG.KNOWN [ 86.6795% ] UNKNOWN [ 77.9326% ] TEST ACCURACY: 92.0852% KNOWN [ 94.2575% ] AMBIG.KNOWN [ 88.3275% ] UNKNOWN [ 80.5811% ] TEST ACCURACY: 91.3927% KNOWN [ 94.1299% ] AMBIG.KNOWN [ 87.4226% ] UNKNOWN [ 77.286% ] TEST ACCURACY: 91.9891% KNOWN [ 94.2944% ] AMBIG.KNOWN [ 88.0182% ] UNKNOWN [ 79.4589% ] TEST ACCURACY: 91.3063% KNOWN [ 93.9605% ] AMBIG.KNOWN [ 87.1258% ] UNKNOWN [ 77.8502% ] TEST ACCURACY: 91.3654% KNOWN [ 93.8499% ] AMBIG.KNOWN [ 87.2127% ] UNKNOWN [ 78.2339% ] TEST ACCURACY: 91.8693% KNOWN [ 94.1% ] AMBIG.KNOWN [ 87.0546% ] UNKNOWN [ 79.9416% ] OVERALL ACCURACY [Ck = 0.1 :: Cu = 0.07975] : 91.60497% KNOWN [ 94.09694% ] AMBIG.KNOWN [ 87.58666% ] UNKNOWN [ 78.55767% ] MAX ACCURACY -> 91.60497 :: C-value = 0.1 :: depth = 0 :: iter = 2 5.5.2 SVMTAGGER Given a text corpus (one token per line) and the path to a previously learned SVM model (including the automatically generated dictionary), it performs the POS tagging of a sequence of words. The tagging goes on-line based on a sliding window which gives a view of the feature context [30] to be considered at every decision. In any case, there are two important concepts we must consider: (1) Example generation (2) Feature extraction 86 (1) Example generation: This step is to define what an example is, according to the concept in which the machine is to be learned. For instance, in POS tagging , the machine to correctly classify words according to their POS. Thus, every POS is a class, and, typically, every occurrence of a word generates a positive example for its class, and a negative example for the rest of classes. Therefore, every sentence may generate a large number of examples. (2) Feature Extraction: The set of features based on the algorithm to be used have to be defined. For instance, the POS tags should be guessed according to the preceding and following word. Thus, every example is represented as a set of active features. These representations will be the input for the SVM classifiers. If SVMTool working have to be learned, it is necessary to run the SVMTlearn (Perl version) setting the REMOVE_FILES option to 0 (in the config file). In this way inspection of intermediate files can be done. Feature extraction is performed by the sliding window object. The whole story takes place as follows. A sliding window works on a very local context (as defined in the CONFIG file), usually a 5 words context [-2, -1, 0, +1, +2], being the current word under analysis at the core position. Taking this context into account a number of features may be extracted. The feature set depends on how the tagger is going to proceed later (i.e., the context and information that’s going to be available at tagging time). Commonly, all words are known before tagging, but POS is only available for some words (those already tagged). In tagging stage, if the input word is 'known and ambiguous, the word is tagged (i.e., classified), and the predicted tag feeds forward next decisions. This will be done in the subroutine "sub classify_sample_merged ( )" in the file SVMTAGGER. In order to speed up SVM classification, they decided to merge mapping and SVM weights and biases, into a single file. Therefore, when a new example is to be 87 tagged, we just access the merged model and for every active feature retrieve the associated weight. Then for every possible tag, the bias will be retrieved as well. Finally, we apply the SVM classification rule (i.e., scalar product + bias). AN EXAMPLE Suppose the example is taken as Tamil Unicode form : கூட் க்கு நீர் திட்டம் நிைறேவற்றப்ப ம் என்றார் தல்வர் . wiRaivERRappadum enRAr muthalvar <VNAJ>/<VF> <VF> Romanized form : kUddukkudiwIr thiddam . Taggs : <NNC> <NNC> <NN> <DOT> (Ambiguity) For tagging this sentence first take the active features like w-1, w+1, i.e., predict POS-tags based only on the preceding and following words. Here the ambiguity word is “wiRaivERRappadum” .The correct tag of that ambiguity word is <VF>. “ w0: wiRaivERRappadum Then, active features are "w-1: thiddam " and "w+1: enRAr ". Therefore, when applying the SVM classification rule for a given POS, it is necessary to go to the merged model and retrieve the weight for these features, and the bias (first line after the header, beginning with "BIASES "), corresponding to the given POS. 88 For instance, suppose this ".MRG" file: BIASES <ADJ>:0.37059487 <ADV>:-0.19514606 <CNJ>:0.43007979 <COM>:0.037037037 <CRD>:0.55448766 <CVB>:-0.19911161 <DET>:-1.1815452 <EMP>:-0.86491783 <INT>:0.61775334 <NN>:-0.21980137 <NNC>:1.3656117 <NNP>:0.072242349 <ORD>:0.30304924 0.15550162 <NNPC>:0.7906585 <PPO>:-0.2182171 <PRIN>:0.56913633 <NNQ>:0.44012828 <PRI>:0.89491131 <PRP>:0.35316978 <PRID>:- <QW>:0.039121434 <RDW>:0.84771943 <VAX>:0.041690388 <VBG>:0.23199934 <VF>:0.33486366 <VINT>:0.0048185684 <VNAJ>:0.42063524 <VNAV>:0.18009116 C0~-1:thiddam <CRD>:0.00579042912371902 <NNC>:-0.508699048073652 <NN>:0.532690716551973 <ORD>:-0.000698015879911668 <VBG>:0.142313085089229 <VF>:0.296699729267891 <VNAJ>:-0.32 C0~1:enRAr <VAX>:0.132726597682121 <VF>:0.66667135122578 <VNAJ>:- 0.676332541749603 The SVM score for “wiRaivERRappadum” being <VNAJ> is: Weight ("w-1: thiddam","VNAJ")+ weigh("w+1:enRAr ","VNAJ")– bias("VNAJ") = (-0.32) + (-0.676332541749603)-( 0.42063524) = The SVM score for “wiRaivERRappadum” being <VF> is: Weight (“w-1: thiddam”, ”VF”)+weight (“w+1: enRAr”, ”VF”) – bias (“VF”) = (0.296699729267891) + (0.66667135122578)-( 0.33486366 )= Here SVM score for <VF> is more compared to <VNAJ>, So, the tag VF is assigned to the word ‘wiRaivERRappadum’. Calculated part–of–speech tags feed directly forward next tagging decisions as context features. The SVMTagger component works on standard input/output. It 89 processes a token per line corpus in a sentence by sentence fashion. The token is expected to be the first column of the line. The predicted tag will take the second column in the output. The rest of the line remains unchanged. Lines beginning with ‘## ’ are ignored by the tagger. This is an example of input file , SVMTagger only consider the first column of the input file. Figure:5.5 Example input file 90 Figure :5.6 Example output file 5.5.2.1 Options SVMTagger is very flexible, and adapts very well to the needs of the user. Thus we may find the several options currently available: 91 Tagging scheme: Two different tagging schemes may be used. – Greedy: Each tagging decision is made based on a reduced context. Later on, decisions are not further reconsidered, except in the case of tagging at two steps or tagging in two directions. – Sentence-level: By means of dynamic programming techniques (Viterbi algorithm), the global sentence sum of SVM tagging scores is the function to maximize as shown in equation 5.6. Given a sentence S =w1...wn as a word sequence and the set T(S) = {ti: 1 <= i<= |S|) ∧ ti ∈ ambiguity_class(wi)} of all possible sequences of POS tags associated to S. t ( S ) = argmax s scores ( i ) s∈T ( S ) i∑ =1 (5.6) A softmax function is used by default so as to transform this sum of scores into a product of probabilities. Because sentence-level tagging is expensive, two pruning methods are provided. First, the maximum number of beams may be defined. Alternatively, a threshold may be specified so that solutions scoring under a certain value (with respect to the best solution at that point) are discarded. Both pruning techniques have proved effective and efficient in our experiments. Tagging direction: The tagging direction can be either “left-to right”, “right-to- left”, or a combination of both. The tagging direction varies results yielding a significant improvement when both are combined. For every token, every direction assigns a tag with a certain score. The highest-scoring tag is selected. 92 This makes the tagger very robust. In the case of sentence level tagging there is an additional way to combine left-to-right and right-to-left. “GLRL” direction makes a global decision, i.e. considering the sentence as a whole. For every sentence, every direction assigns a sequence of tags [30] with an associated score that corresponds to the sum of scores (or the product of probabilities when using a softmax function, the default option). The highest scoring sequence of tags is selected. One pass / two passes: Another way of achieving robustness is by tagging in two passes. At the first pass only POS features related to already disambiguated words are considered. At a second pass disambiguated POS features are available for every word in the feature context, so when revisiting a word tagging errors may be alleviated. SVM Model Compression: Just as for the learning, weight vector components lower than a certain threshold can be ignored. All scores: Because sometimes not only the tag is relevant, but also its score and scores of all competing tags, as a measure of confidence, this information is available. Backup lexicon: Again, a morphological lexicon containing new words that were not available in the training corpus may be provided Lemmae lexicon: Given a lemmae lexicon containing <word form, tag, lemma> entries, output may be lemmatized. <EOS> tag: The ‘<s>’ tag may be employed for sentence separation. Otherwise, sentence punctuation is used instead, i.e. [.!?] symbols are taken as unambiguous sentence separators. 93 Usage : SVMTagger [options] <model> Options: - T <strategy> 0: one-pass (default - requires model 0) 1: two-passes [revisiting results and relabeling requires model 2 and model 1] 2: one-pass [robust against unknown words requires model 0 and model 2] 3: one-pass [unsupervised learning models requires model 3] 4: one-pass [very robust against unknown words requires model 4] 5: one-pass [sentence-level likelihood requires model 0] 6: one-pass [robust sentence-level likelihood requires model 4] - S <direction> LR: left-to-right (default) RL: right-to-left LRL: both left-to-right and right-to-left GLRL: both left-to-right and right-to-left (global assignment, only applicable under a sentence level tagging strategy) -K <n> weight filtering threshold for known words (default is 0) -U <n> weight filtering threshold for unknown words (default is 0) -Z <n> number of beams in beam search, only applicable under sentence-level strategies (default is disabled) -R <n> dynamic beam search ratio, only applicable under sentence-level strategies (default is disabled) -F <n> softmax function to transform SVM scores into probabilities (default is 1) 0: do_nothing 1: ln(e^score(i) / [sum:1<=j<=N:[e^score(j)]]) 94 - A predictions for all possible parts-of-speech are returned - B <backup_lexicon> -L <lemmae_lexicon> - EOS enable usage of end_of_sentence ‘<s>’ string (disabled by default, [!.?] used instead) - V <verbose> -> 0: none verbose 1: low verbose 2: medium verbose 3: high verbose 4: very high verbose Model: model location (path/name) (Name as declared in the config-file NAME) Example: SVMTagger -T 0 TAMIL_CORPUS_2L < TAMIL.IN > TAMIL.OUT 5.5.2.2 Strategies Seven different tagging strategies have been implemented so far: Strategy 0: It is the default one. It makes use of Model 0 in a greedy on-line fashion, one-pass. Strategy 1: As a first attempt to achieve robustness in front of error propagation, it works in two passes, in an on-line greedy way. It uses Model 2 in the first pass and Model 1 in the second. In other words, in the first pass the unseen morpho syntactic context remains ambiguous while in the second pass the tag predicted in the first pass is available also for unseen tokens and used as a feature. 95 Strategy 2: This strategy tries to achieve robustness by using two models at tagging time, namely Model 0 and Model 2. When all the words in the unseen context are known it uses Model 0. Otherwise it makes use of Model 2. Strategy 3: It uses Model 3, again in a greedy and on-line manner. This unsupervised learning strategy is still under experimentation. Strategy 4: It simply uses Model 4 as is, in an on-line greedy fashion. Strategy 5: Still working on a more robust scheme, this strategy performs a sentence-level tagging by means of dynamic programming techniques (Viterbi algorithm). It uses Model 0. Strategy 6: Same as strategy 5, this strategy performs a sentence level tagging, this time applying Model 4. 5.5.2.3 IMPLEMENTATION OF SVMTagger FOR TAMIL In SVMTagger component the important options are strategies and backup lexicon.Here it is important which tagging strategies you are going to use. This may depend for instance on efficiency requirements. If the tagging must be as fast as possible then you should forget about strategies 1, 5, and 6, because strategy 1 goes in two passes and strategies 5 and 6 perform a sentence-level tagging. Strategy 3 is only for unsupervised learning (no hand-annotated data is available). Now, to choose among strategies 0, 2 and 4 the best solution is try them all. However, if you have an idea of the proportion of unknown words you are going to find at tagging time, strategies 2 and 4 are more robust than strategy 0 in the case of unknown words. Finally, if you do not have any speed requirement nor 96 information about future data, in our experiments tagging strategies 4 and 6 systematically obtained best results. Here the format of backup lexicon file is same as the dictionary format. So a PERL program can be used for converts a tagged corpus into dictionary format. The main drawback in general POS tagging is tagging a proper nouns. If you have a large tagged corpus means tagging is not a problem for closed tags (But open tags contains nouns and proper noun). For English they use capitalization for tagging the proper noun words .But Tamil it is not possible therefore it is required to provide a large backup lexicon. Proper nouns SVM Learn Morph Generator Untagged Tamil Corpus Backup lexicon Training Data Dictionary SVM Tagger Features Merged model Tagged Tamil Corpus Figure :5.7 Implementation of SVMTagger 97 Strategies Here we collect a large dataset for proper nouns (Indian names, Indian towns and word important towns). Then that proper nouns are input to a morphological generator ( PERL program) .Our morph generator generate inflected proper nouns for our data set .For each proper noun we generate near twelve inflections. This new dataset is converted into SVMTool dictionary format and given into SVMTagger as a back up lexicon. Fig shows the steps in implementation of SVMTagger for Tamil .Here input is an untagged cleaned Tamil corpus and output is tagged or annotated words. Supporting files are training corpus, dictionary file, merged models for unknown and known words and backup lexicon. 5.5.3 SVMTEVAL Given a SVMTool predicted tagging output and the corresponding gold-standard, SVMTeval evaluates the performance in terms of accuracy. It is a very useful component for the tuning of the system parameters, such as the C parameter, the feature patterns and filtering, the model compression et cetera. Based on a given morphological dictionary (e.g., the automatically generated at training time) results may be presented also for different sets of words (known words vs. unknown words, ambiguous words vs. unambiguous words). A different view of these same results can be seen from the class of ambiguity perspective, too, i.e., words sharing the same kind of ambiguity may be considered together. Also words sharing the same degree of disambiguation complexity, determined by the size of their ambiguity classes, can be grouped. Usage: SVMTeval [mode] <model> <gold> <pred> - mode: 0 - complete report (everything) 98 1 - overall accuracy only [default] 2 - accuracy of known vs unknown words 3 - accuracy per level of ambiguity 4 - accuracy per kind of ambiguity 5 - accuracy per class - model: model name - gold: correct tagging file - pred: predicted tagging file Example: SVMTeval TAMIL_CORPUS_2L TAMIL.GOLD TAMIL.OUT 5.5.3.1 IMPLEMENTATION OF SVMTeval FOR TAMIL SVMTeval is the last component of SVMTool.In this component is used to evaluate the outputs based on different modes. The main input of this component is correctly tagged corpus also called as gold standard. Correct Tagged file (Gold Standard) Mode SVM Tagged Output SVMTeval Report Figure :5.8 Implementation of SVMTeval 99 Model Name 5.5.3.2 REPORTS Brief report: By default, a brief report mainly returning the overall accuracy is elaborated. It also provides information about the number of tokens processed, and how much were known/unknown and ambiguous/unambiguous according to the model dictionary. Results are always compared to the most-frequent-tag (MFT) baseline. *=========================SVMTevalreport===================== ========= * model * = [E:\\SVMTool-1.3\\bin\\CORPUS] testset (gold) = [E:\\SVMTool- 1.3\\bin\\files\\test.gold] * testset (predicted) = [E:\\SVMTool- 1.3\\bin\\files\\test.out] * ============================================================= =========== EVALUATING <E:\\SVMTool-1.3\\bin\\files\\test.out> <E:\\SVMTool-1.3\\bin\\files\\test.gold> on vs. model <E:\\SVMTool-1.3\\bin\\CORPUS>... *=================TAGGINGSUMMARY============================= ========================== #TOKENS = 1063 AVERAGE_AMBIGUITY = 6.4901 tags per token * ---------------------------------------------------------------------------------------#KNOWN = 80.3387% --> 854 / 1063 #UNKNOWN = 19.6613% --> 209 / 1063 #AMBIGUOUS = 21.7310% --> 231 / 1063 100 #MFT baseline = 71.2135% --> 757 / 1063 *=================OVERALLACCURACY============================ ========================== HITS TRIALS ACCURACY MFT * ---------------------------------------------------------------------------------------1002 1063 94.2615% 71.2135% * ============================================================= ============================ Known vs. unknown tokens: Accuracy for four different sets of words is returned. The first set is that of all known tokens, tokens which were seen during the training. The second and third sets contain respectively all ambiguous and all unambiguous tokens among these known tokens. Finally, there is the set of unknown tokens, which were not seen during the training. *=========================SVMTevalreport ============================== * model = [E:\\SVMTool-1.3\\bin\\CORPUS] * testset (gold) = [E:\\SVMTool-1.3\\bin\\files\\test.gold] * testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out] * =================================================================== ===== EVALUATING <E:\\SVMTool-1.3\\bin\\files\\test1.out> <E:\\SVMTool-1.3\\bin\\files\\test.gold> 1.3\\bin\\CORPUS>... 101 on model vs. <E:\\SVMTool- *=================TAGGINGSUMMARY=================================== =================== #TOKENS = 1063 AVERAGE_AMBIGUITY = 6.4901 tags per token * ---------------------------------------------------------------------------------------#KNOWN = 80.3387% --> 854 / 1063 #UNKNOWN = 19.6613% --> 209 / 1063 #AMBIGUOUS = 21.7310% --> 231 / 1063 #MFT baseline = 71.2135% --> 757 / 1063 *=================KNOWNvsUNKNOWNTOKENS============================= ================== HITS TRIALS ACCURACY * ---------------------------------------------------------------------------------------*=======known====================================================== =================== 816 854 95.5504% -------- known unambiguous tokens -------------------------------------------------------604 623 96.9502% -------- known ambiguous tokens ---------------------------------------------------------212 231 91.7749% *=======unknown==================================================== ==================== 186 209 88.9952% * =================================================================== ====================== *=================OVERALLACCURACY================================== ==================== HITS TRIALS ACCURACY MFT * ---------------------------------------------------------------------------------------1002 1063 94.2615% 102 71.2135% * =================================================================== ====================== Level of ambiguity: This view of the results groups together all words having the same degree of POS– ambiguity. *=========================SVMTevalreport ============================== * model = [E:\\SVMTool-1.3\\bin\\CORPUS] * testset (gold) = [E:\\SVMTool-1.3\\bin\\files\\test.gold] * testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out] * =================================================================== ===== EVALUATING <E:\\SVMTool-1.3\\bin\\files\\test1.out> <E:\\SVMTool-1.3\\bin\\files\\test.gold> on model vs. <E:\\SVMTool- 1.3\\bin\\CORPUS>... *=================TAGGINGSUMMARY=================================== =================== #TOKENS = 1063 AVERAGE_AMBIGUITY = 6.4901 tags per token * ---------------------------------------------------------------------------------------#KNOWN = 80.3387% --> 854 / 1063 #UNKNOWN = 19.6613% --> 209 / 1063 #AMBIGUOUS = 21.7310% --> 231 / 1063 #MFT baseline = 71.2135% --> 757 / 1063 *=================ACCURACY PER ======================================= #CLASSES = 5 103 LEVEL OF AMBIGUITY *================================================================== ======================= LEVEL HITS TRIALS ACCURACY MFT *---------------------------------------------------------------------------------------1 605 624 96.9551% 96.6346% 2 204 220 92.7273% 66.8182% 3 7 9 77.7778% 66.6667% 4 2 3 66.6667% 33.3333% 28 184 207 88.8889% 0.0000% *=================OVERALLACCURACY================================== ==================== HITS TRIALS ACCURACY MFT *---------------------------------------------------------------------------------------1002 1063 94.2615% 71.2135% * =================================================================== ====================== Kind of ambiguity: This view is much finer. Every class of ambiguity is studied separately. *=========================SVMTevalreport ============================== * model = [E:\\SVMTool-1.3\\bin\\CORPUS] * testset (gold) = [E:\\SVMTool-1.3\\bin\\files\\test.gold] * testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out] * =================================================================== ===== 104 EVALUATING <E:\\SVMTool-1.3\\bin\\files\\test.out> <E:\\SVMTool-1.3\\bin\\files\\test.gold> on model vs. <E:\\SVMTool- 1.3\\bin\\CORPUS>... *=================TAGGINGSUMMARY=================================== =================== #TOKENS = 1063 AVERAGE_AMBIGUITY = 6.4901 tags per token * ---------------------------------------------------------------------------------------#KNOWN = 80.3387% --> 854 / 1063 #UNKNOWN = 19.6613% --> 209 / 1063 #AMBIGUOUS = 21.7310% --> 231 / 1063 #MFT baseline = 71.2135% --> 757 / 1063 *=================ACCURACY PER CLASS OF AMBIGUITY ======================================= #CLASSES = 55 * =================================================================== ====================== CLASS HITS TRIALS ACCURACY MFT ---------------------------------------------------------------------------------------<ADJ> 28 28 100.0000% 100.0000% <ADJ>_<ADV>_<CNJ>_<COM>_<CRD>_<CVB>_<DET>_<ECH>_<INT>_<NN>_<NNC>_<N NP>_<NNPC>_<NNQ>_<ORD>_<PPO>_<PRID>_<PRIN>_<PRP>_<QTF>_<QW>_<RDW>_< VAX>_<VBG>_<VF>_<VINT>_<VNAJ>_<VNAV> 207 88.8889% 184 0.0000% <ADJ>_<NN> 1 1 100.0000% 100.0000% <ADJ>_<VNAJ> 1 1 100.0000% 100.0000% 31 32 96.8750% 96.8750% <ADV>_<NN>_<PPO> 1 2 50.0000% 50.0000% <ADV>_<RDW> 2 2 100.0000% 100.0000% 20 20 100.0000% 100.0000% <ADV> <CNJ> <CNJ>_<PPO> 1 1 105 100.0000% 100.0000% <COM> 17 17 100.0000% 100.0000% <COMM> 49 49 100.0000% 100.0000% <CRD> 23 23 100.0000% 95.6522% <CRD>_<DET> 22 22 100.0000% 100.0000% <CRD>_<NN>_<NNC>_<PPO> 2 2 100.0000% 50.0000% <CRD>_<ORD> 2 2 100.0000% 0.0000% <CVB> 6 6 100.0000% 100.0000% <CVB>_<VF> 0 1 0.0000% 0.0000% <DET> 14 14 100.0000% 100.0000% <DOT> 77 77 100.0000% 100.0000% <EMP>_<PRP> 1 1 100.0000% 100.0000% <INT> 6 6 100.0000% 100.0000% <INT>_<NN> 0 1 0.0000% 0.0000% 81 91 89.0110% 89.0110% 148 161 91.9255% 58.3851% 1 1 100.0000% 100.0000% <NN>_<PRID>_<PRIN>_<VNAJ>0 1 0.0000% 0.0000% <NN>_<VBG> 1 1 100.0000% 100.0000% <NN>_<VF> 1 1 100.0000% 100.0000% 46 47 97.8723% 97.8723% <NNC>_<NNP>_<NNPC> 4 5 80.0000% 80.0000% <NNC>_<VAX> 1 1 100.0000% 100.0000% <NNC>_<VF>_<VNAJ> 1 1 100.0000% 0.0000% 34 37 91.8919% 91.8919% <NNP>_<NNPC> 0 1 0.0000% 0.0000% <NNPC> 0 1 0.0000% 0.0000% <NNQ> 4 4 100.0000% 100.0000% <ORD> 3 3 100.0000% 66.6667% <PPO> 6 6 100.0000% 100.0000% <PPO>_<VNAV> 1 1 100.0000% 100.0000% <PRID> 2 2 100.0000% 100.0000% <PRIN> 2 2 100.0000% 100.0000% <PRP> 33 33 100.0000% 100.0000% <QM> 4 4 100.0000% 100.0000% <QTF> 5 5 100.0000% 100.0000% <NN> <NN>_<NNC> <NN>_<NNC>_<NNPC> <NNC> <NNP> 106 <QW> 4 4 100.0000% 100.0000% <RDW> 1 1 100.0000% 100.0000% <VAX> 7 7 100.0000% 100.0000% <VAX>_<VF> 8 8 100.0000% 100.0000% 11 11 100.0000% 100.0000% 2 2 100.0000% 100.0000% <VF> 39 40 97.5000% 97.5000% <VF>_<VNAJ> 12 12 100.0000% 91.6667% <VINT> 15 16 93.7500% 93.7500% <VNAJ> 20 20 100.0000% 100.0000% <VNAV> 17 18 94.4444% 94.4444% <VBG> <VBG>_<VF> *=================OVERALLACCURACY================================== ==================== HITS TRIALS ACCURACY MFT * ---------------------------------------------------------------------------------------1002 1063 94.2615% 71.2135% * =================================================================== ============ Class: Every class is studied individually. *=========================SVMTevalreport===================== ========= * model * = [E:\\SVMTool-1.3\\bin\\CORPUS] testset (gold) = [E:\\SVMTool- 1.3\\bin\\files\\test.gold] * testset (predicted) 1.3\\bin\\files\\test.out] 107 = [E:\\SVMTool- * ============================================================= =========== EVALUATING <E:\\SVMTool-1.3\\bin\\files\\test.out> <E:\\SVMTool-1.3\\bin\\files\\test.gold> on vs. model <E:\\SVMTool-1.3\\bin\\CORPUS>... *=================TAGGINGSUMMARY============================= ========================== #TOKENS = 1063 AVERAGE_AMBIGUITY = 6.4901 tags per token * ---------------------------------------------------------------------------------------#KNOWN = 80.3387% --> 854 / 1063 #UNKNOWN = 19.6613% --> 209 / 1063 #AMBIGUOUS = 21.7310% --> 231 / 1063 #MFT baseline = 71.2135% --> 757 / 1063 *================= ACCURACY PER PART-OF-SPEECH =========================================== POS HITS TRIALS ACCURACY MFT * ---------------------------------------------------------------------------------------<ADJ> 30 31 96.7742% 90.3226% <ADV> 47 48 97.9167% 70.8333% <CNJ> 21 21 100.0000% 95.2381% <COM> 17 17 100.0000% 100.0000% <COMM> 49 49 100.0000% 100.0000% <CRD> 26 26 100.0000% 84.6154% <CVB> 7 8 87.5000% 75.0000% <DET> 36 36 100.0000% 100.0000% <DOT> 77 77 100.0000% 100.0000% <EMP> 1 1 100.0000% 100.0000% <INT> 6 7 85.7143% 85.7143% 108 <NN> 243 259 93.8224% 57.9151% <NNC> 145 162 89.5062% 46.2963% <NNP> 43 44 97.7273% 86.3636% <NNPC> 0 16 0.0000% 0.0000% <NNQ> 4 4 100.0000% 100.0000% <ORD> 2 2 100.0000% 100.0000% <PPO> 9 9 100.0000% 100.0000% <PRID> 2 3 66.6667% 66.6667% <PRIN> 2 2 100.0000% 100.0000% 34 34 100.0000% 97.0588% <QM> 4 4 100.0000% 100.0000% <QTF> 5 5 100.0000% 100.0000% <QW> 6 6 100.0000% 66.6667% <RDW> 1 1 100.0000% 100.0000% <VAX> 18 18 100.0000% 77.7778% <VBG> 20 22 90.9091% 54.5455% <VF> 68 68 100.0000% 66.1765% <VINT> 16 18 88.8889% 83.3333% <VNAJ> 41 42 97.6190% 69.0476% <VNAV> 22 23 95.6522% 73.9130% <PRP> *=================OVERALLACCURACY============================ ========================= HITS TRIALS ACCURACY MFT * ---------------------------------------------------------------------------------------1002 1063 94.2615% 71.2135% * ============================================================= ============================ 109 6) GUI FOR SVMTool 6.1) INTRODUCTION A graphical user interface (GUI) is a user-computer interface (i.e., a way for users to interact with computers) that uses windows, icons and menus and which can be manipulated by mouse (and often to a limited extent by keyboard as well).GUIs stand in sharp contrast to command line interfaces (CLIs), which use only text and are accessed solely by a keyboard. The most familiar example of a CLI to many people is MS-DOS. Another example is Linux when it is used in console mode (i.e., the entire screen shows text only). A window is a rectangular portion of the monitor screen that can display its contents (e.g., a program, icons, a text file or an image) seemingly independently of the rest of the display screen. A major feature is the ability for multiple windows to be open simultaneously. Each window can display a different application, or each can display different files (e.g., text, image or spreadsheet files) that have been opened or created with a single application. In this chapter we will discuss about the GUI for SVMTool. This GUI is developed using NET BEANS. Here we use the SVMTool Perl version. Our GUI will run only for SVMTagger and SVMTeval components because GUI is developed in Windows operating system. SVMTlearn component runs only in Linux operating systems. Advantages of GUIs A major advantage of GUIs is that they make computer operation more intuitive, and thus easier to learn and use. For example, it is much easier for a new user to move a file from one directory to another by dragging its icon with the mouse 110 than by having to remember and type seemingly arcane commands to accomplish the same task. Adding to this intuitiveness of operation is the fact that GUIs generally provide users with immediate, visual feedback about the effect of each action. For example, when a user deletes an icon representing a file, the icon immediately disappears, confirming that the file has been deleted (or at least sent to the trash can). This contrasts with the situation for a CLI, in which the user types a delete command (inclusive of the name of the file to be deleted) but receives no automatic feedback indicating that the file has actually been removed. In addition, GUIs allows user to take full advantage of the powerful multitasking (the ability for multiple programs and/or multiple instances of single programs to run simultaneously) capabilities of modern operating systems by allowing such multiple programs and/or instances to be displayed simultaneously. The result is a large increase in the flexibility of computer use and a consequent rise in user productivity. The SVMTool options are work only in command line interfaces, therefore the user must know the commands in SVMTool. Sometimes it is not possible so we need some user friendly environment like GUI. This GUI solves that problem because we consider all important options in SVMTool. It will give graphical user friendly version to SVMTool. 6.2).GUI for SVMTagger GUI for SVMTagger was developed using Net Beans. In this GUI the SVMTool was run from Net Beans, using java program. The SVMTagger window is shown in a figure (fig 6.1).In this window the user can select the SVMTool location and model name. SVMTool location indicates the “bin” folder location and model name means, the name which given in the config file. This window contains two options. They are, 111 1) File based tagger 2) Sentence based tagger In File based tagger, the user selects the input file location and output file location. The file must be in a SVMTool format .The other SVMTagger options like strategy, direction and backup lexicon are also given in that window. In Sentence based tagging the user can give the input Tamil sentence our program will tokenize that sentence and tagged using SVMTagger. Sometimes user doesn’t have an idea about abbreviations, so our window has an option for understanding abbreviations. The SVMTagger window is shown in bellow, Figure 6.1: SVMTagger window 112 6.2.1).File based SVMTagger window In this file based tagger (fig 6.2) user can give the input file location and output file location. After that user can tag the words based on different strategies and directions. These options are already explained in chapter 5. Another one important option is backup lexicon location. It is very useful for handling unknown words especially for open tags like proper noun and noun. If user gives the lexicon means, we convert each and every word (Proper noun) in a lexicon into number of inflected words using small morphological generator. It will handle unknown words perfectly. Figure 6.2: File based SVMTagger window 113 6.2.2).Sentence based SVMTagger window Sentence based tagger (fig 6.3) takes an input in a sentence form and tokenize it into token per line format because SVMTool accept only token per line input. This tokenize can done by a small Perl program .After converting into tokenize form we can perform tagging operation using SVMTagger component. If user has any doubt about the abbreviation of tags, tag details window helps to find the exact meaning of tags. Figure 6.3: Sentence based SVMTagger window 114 6.2.3).AMRITA POS tagset window This window (fig 6.4) is used to the user to find the appropriate meaning of each and every tags in our tagset. Each and every tagset follow different abbreviations for every tag. Our tagset also have different form of abbreviations but we consider some standard tag abbreviations like NN, NNP, ADJ etc. Figure 6.4: AMRITA tagset window 115 6.3).GUI for SVMTeval Given a SVMTool predicted tagging output and the corresponding goldstandard, SVMTeval evaluates the performance in terms of accuracy. GUI for SVMTeval (fig 6.5) also developed using Net beans. In this window user can choose the SVMTool location and model name. This evaluate component of SVMTool is used to analyze the results of a tagger. It shows the number of tokens, unknown words, known and ambiguous words. We can run this component in different modes like unknown words vs. known words, kind of ambiguity, and level of ambiguity. Figure 6.5: SVMTeval window 116 6.3.1) SVMTeval Report window GUI for Evaluator is shown in figure (fig 6.6) .Here user can check the accuracy of SVMTool .This window needs a gold standard of tagged file i.e. correctly tagged corpus in SVMTool format. Another input of this model is SVMTool output file i.e. the corpus which is tagged by SVMTagger component. Based on a given morphological dictionary (e.g., the automatically generated at training time) results may be presented also for different sets of words (known words vs. unknown words, ambiguous words vs. unambiguous words). Figure 6.6: SVMTeval results window 117 A different view of these same results can be seen from the class of ambiguity perspective, too, i.e., words sharing the same kind of ambiguity may be considered together. Also words sharing the same degree of disambiguation complexity, determined by the size of their ambiguity classes, can be grouped. 6.4 Output of TnT tagger for Tamil TnT, the short form of trigrams n tags, it is a very efficient statistical part of speech tagger that is trainable on different languages and virtually any tag set. The component for parameter generation trains on tagged corpora. The system incorporates several methods of smoothing and of handling unknown words. Figure :6.7 Output of TnT tagger for Tamil 118 TnT is not optimized for a particular language. Instead, it is optimized for training on a large variety of corpora. Adapting a tagger to any language, new domain or new tag set is very easy. Additionally, TnT optimized for speed. We evaluate the tagger’s performance under several aspects. We use the training corpora of size two lakh words. We run this tagger with thousand untagged words. We got an overall accuracy (fig 6.7) of 88.71%. Comparative accuracies of TnT Tagger and SVMTool for Tamil TnT SVMTool Overall accuracy 88.7% 93.87% Accuracy of Known words 92.54% 95.54% Accuracy of Known ambiguity words 77.94% 91.17% Accuracy of Unknown words 74.16% 87.081% 119 7) CONCLUSION AND FUTURE WORK Part of speech tagging plays an important role in various speech and language processing applications. Since many of the reputed companies like Google and Microsoft are concentrating on Natural language processing applications, it has got more importance. Currently many tools are available to do this task of part of speech tagging. Good accuracy can be obtained using the existing softwares like SVMTool and Stanford tagger. These are giving more than 97% accuracy for English. The SVMTool has been already successfully applied to English and Spanish POS Tagging, exhibiting state–of–the–art performance (97.16% and 96.89%, respectively). In both cases results clearly outperform the HMM–based TnT part–of– speech tagger. For Tamil we have obtained an accuracy of 94% .The training using these softwares is also easy. Any language can be trained easily using the existing softwares. POS tagging can be extended by applying this to other languages. Currently there are no taggers giving good accuracy for Tamil. The obstacle for the POS tagging for Indian languages is, there is no annotated (tagged) data. We have a corpus with size of 2.25 lakh words. Another possible work is to increase the corpus size i.e. increases the tagged data for Tamil and in future work adding chunking tags to the tagged corpus can be done. Still good accuracy can be obtained by combining morphological analyzer with this SVMTool for handling unknown words. 120 REFERENCES [1]. Thorsten Brants, “TnT -- A Statistical Part-of - Speech Tagger”, In Proceedings of the 6th Applied NLP Conference, ANLP-2000, April 29 – May 3, 2000. [2]. Huang, Caroline; Son-Bell, Mark; and Baggett, David. "Generation of pronunciations from orthographies using transformation-based error-driven learning." In International Conference on Speech and Language Processing (ICSLP), Yokohama, Japan, 1994. [3]. Klein, Sheldon and Simmons, Robert. "A computational approach to Grammatical coding of English words". JACM 10, 1963, pp 334-47. [4]. B. B. Greene and G. M. Rubin. “Automatic grammatical tagging of English. Technical report”, Department of Linguistics, Brown University, Providence, Rhode Island, 1971. [5]. W. N. Francis and F. Kucera. “Frequency Analysis of English Usage”. Houghton Mifflin, 1982. [6]. S. DeRose. “Grammatical category disambiguation by statistical optimization”. Computational Linguistics, 1988, pp 14:31-39. [7]. A. M. Derouault and B. Merialdo. “Natural language modeling for phoneme-totext transcription”. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-8, 1986, pp 742-749. 121 [8]. K. W. Church. “A stochastic parts program and noun phrase parser for unrestricted text”. In Proceedings of the Second Conference on Applied Natural Language Processing (ACL), 1988, pp 136-143. [9]. L. E. Baum. “An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process”. Inequalities, 1972, pp 3:1-8. [10]. F. Jelinek. “Markov source modeling of text generation”. In J. K. Skwirzinski, editor, Impact of Processing techniques on Communication. Nijhoff, Dordrecht, 1985. [11]. J. M. Kupiec. “Robust part-of-speech tagging using a hidden markov model”, Computer Speech and Language, 1992. [12]. J. M. Kupiec. “Augmenting a hidden Markov model for phrase-dependent word tagging”. In Proceedings of the DARPA Speech and Natural Language Workshop, Morgan Kaufmann, Cape Cod, MA, 1989, pp 92-98. [13]. M. A. Hearst. “Noun homograph disambiguation using local context in large text corpora”. In The Proceedings of the 7th New OED Conference on Using Corpora, Oxford, 1991, pages 1-22. [14]. Brill. “Transformation-based Error-driven Learning and Natural Language Processing: A Case Study in Part-of-speech Tagging”. Computational Linguistics, 1995, pp 21 (4):543-565. [15]. R. Weischedel, R. Schwartz, J. Pahnueci, M. Meteer, and L. Ramshaw. “Coping with Ambiguity and Unknown words through Probabilistic Models”. Computational Linguistics, 1993, pp 19(2):260-269. 122 [16]. D. Cutting, J. Kupiec, J. Pederson, and P. Nipun. “A Practical Part-of-speech Tagger”. In Proceedings of the 3rd Conference of Applied Natural Language Processing, ANLP, 1992, pp 133-140. [17]. B. Merialdo. ”Tagging English Text with a Probabilistic Model”. Computational Linguistics, 1994, pp 20(2):155-171. [18]. A. Ratnaparkhi. “A Maximum Entropy Part of-speech Tagger”. In Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing, EMNLP, 1996. [19]. Ray Lau, Ronald Rosenfeld, and Salim Roukos. “Adaptive Language Modeling Using the Maximum Entropy Principle”. In Proceedings of the Human Language Technology Workshop, ARPA, 1993, pp 108-113. [20].Rajendran S ,complexity of tamil in pos tagging, language in india, Volume 7 : 1-th January 2007. [21].Arulmozhi, P and Sobha, L. 2006. A Hybrid POS tagger for a Relatively Free Word Order Language. In Proceedings of the First National Symposium on Modeling and Shallow Parsing of Indian Languages, pages 79-85. [22]. Gim´enez, J. and L.M`arquez. “Fast and Accurate Part-of-Speech Tagging: The SVM Approach Revisited”. In Proceedings of the Fourth RANLP, 2003. [23]. D. Jurafsky and J. H. Martin. “Speech and Language Processing”. PrenticeHall, 2004. [24].Rajendran S Ph.D ,Parsing in tamil -Present state of art, language in india, Volume 6 : 8-th August 2006 123 [25].Lehmann, Thomas. 1992 (second edition). A Grammar of Modern Tamil. Pondicherry Institute of Linguistics and Culture, Pondicherry. [26]. Akshar Bharati, Dipti Misra Sharma, Lakshmi Bai, Rajeev Sangal Language Technologies Research Centre IIIT, Hyderabad, AnnCorra : Annotating Corpora Guidelines For POS And Chunk Annotation For Indian Languages. December2006. [27] Rajendran.S, Ph.D. A survey of the state of the art in Tamil language technology Volume 6 : 10-th October 2006 [28] K.Rajan, Ph.D. corpus analysis and tagging -for Tamil . Annamalai University, Annamalai nagar. [29] Ahmed, S.Bapi Raju, Pammi V.S. Chandrasekhar, M.Krishna Prasad, Application Of Multilayer Perceptron Network For Tagging Parts-Of-Speech Language Engineering Conference, University of Hyderabad, India, Dec. 2002. [30] Jes´us Gim´enez and Llu´ıs M`arquez. SVMTtool:Technical manual v1.3, August 2006. [31].Rajendran S, Arulmozi S, Ramesh Kumar S, & Viswanathan S. 2003. Computational Morpohology of Verbal Complex In B. Ramakrishna Reddy (ed.) Word Structure in Dravidian, Kuppam: Dravidian University, 376-398. [32] T.Joachims ,Making lare scale SVM Learning practical. Advances in Kernel Methods – Support vector Learning, B.Scholkopf and C.Burges and A.smola(ed.),MIT-Press,1999 [33].Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York, NY: 124 Springer-Verlag. [34]. John McNaught, User needs for Textual corpora in Natural Language Processing, Literary and Linguistic computing, vol.8, No.4, 1993 125