* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Natural Language Processing
Udmurt grammar wikipedia , lookup
Macedonian grammar wikipedia , lookup
Agglutination wikipedia , lookup
English clause syntax wikipedia , lookup
Swedish grammar wikipedia , lookup
Lexical semantics wikipedia , lookup
Modern Hebrew grammar wikipedia , lookup
Georgian grammar wikipedia , lookup
Old Irish grammar wikipedia , lookup
Navajo grammar wikipedia , lookup
French grammar wikipedia , lookup
Kannada grammar wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Portuguese grammar wikipedia , lookup
Arabic grammar wikipedia , lookup
Malay grammar wikipedia , lookup
Compound (linguistics) wikipedia , lookup
Determiner phrase wikipedia , lookup
Spanish grammar wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Chinese grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Zulu grammar wikipedia , lookup
Latin syntax wikipedia , lookup
Esperanto grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Danish grammar wikipedia , lookup
POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction Language A unique ability of humans Animals have signs – Sign for danger But cannot combine the signs Higher animals – Apes Can combine symbols (noun & verb) But can talk only about here and now Language : Means of Communication CONCEPT CONCEPT coding decoding Language * The concept gets transferred through language Language : Means of thinking What should I wear today? * Can we think without language ? What is NLP ? The process of computer analysis of input provided in a human language is known as Natural Language Processing. Concept Language Intermediate representation Used for processing by computer Applications Machine translation Document Clustering Information Extraction / Retrieval Text classification MT system : Shakti • Machine translation system being developed at IIIT – Hyderabad. • A hybrid translation system which uses the combined strengths of Linguistic, Statistical and Machine learning techniques. • Integrates the best available NLP technologies. Shakti architecture English sentence Morphology POS tagging Chunking Parsing English sentence analysis Transfer from English to Hindi Word reordering Hindi word subs. Hindi sentence generation Hindi sentence Agreement Word-generation Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction Levels of Language Analysis • Morphological analysis • Lexical Analysis ( POS tagging ) • Syntactic Analysis ( Chunking, Parsing ) • Semantic Analysis ( Word sense disambiguation ) • Discourse processing ( Anaphora resolution ) Let’s take an example sentence “Children are watching some programmes on television in the house” Chunking What are chunks ? [[ Children ]] (( are watching )) [[ some programmes ]] [[ on television ]] [[ in the house ]] Chunks Noun chunks (NP, PP) in square brackets Verb chunks (VG) in parentheses Chunks represent objects Noun chunks represent objects/concepts Verb chunks represent actions Chunking Representation in SSF Part-of-Speech tagging Morphological analysis Deals with the word form and it’s analysis. Analysis consists of characteristic properties like Root/Stem Lexical category Gender, number, person … Etc … Ex: watching Root Lexical category Etc … = watch = verb Morphological analysis Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction POS Tags in Hindi POS Tags in Hindi Broadly categories are noun, verb, adjective & adverb. Word are classified depending on their role, both individually as well as in the sentence. Example: vaha aama khaa rahaa hei Pron noun verb verb verb POS Tagging Simplest method of POS tagging Looking in the dictionary khaanaa Dictionary lookup verb Problems with POS Tagging Size of the dictionary limits the scope of POStagger. Ambiguity The same word can be used both as a noun as well as a verb. khaanaa noun verb Problems with POS Tagging Ambiguity Sentences in which the word “khaanaa” occurs tum bahuta achhaa khaanaa banatii ho. mein jilebii khaanaa chaahataa hun. Hence, complete sentence has to be looked at before determining it’s role and thus the POS tag. Problems with POS Tagging Many applications need more specific POS tags. For example, … seba khaa rahaa … … khaate huE … … khaakara … Verb Finite Main Verb Non-Finite Adjective Verb Non-Finite Adverb sharaaba piinaa sehata … Verb Non-Finite Nominal Hence, the need for defining a tagset. Defining the tagset for Hindi (IIIT Tagset) Issues ! 1. Fineness V/s Coarseness in linguistic analysis 2. Syntactic Function V/s lexical category 3. New tags V/s tags from a standard tagger Fineness V/s Coarseness Decision has to be taken whether tags will account for finer distinctions of various features of the parts of speech. Need to strike a balance Not too fine to hamper machine learning Not too coarse to loose information Fineness V/s Coarseness Nouns Plurality information not taken into account (noun singular and noun plural are marked with same tags). Case information not marked (noun direct and noun oblique are marked with same tags). Adjectives and Adverbs No distinction between comparitive and superlative forms Verbs Finer distinctions are made (eg., VJJ, VRB, VNN) Helps us understand the arguments that a verb form can take. Fineness in Verb tags Useful for tasks like dependency parsing as we have better information about arguments of verb form. Non-finite form of verbs which are used as nouns or adjectives or adverbs still retain their verbal property. (VNN -> Noun formed for a verb) Example: aasamaana/NN “sky” niiche/NLOC “down” mein/PREP “in” utara/VFM “climb” udhane/VNN “flying” aayaa/VAUX “came” vaalaa/PREP ghodhaa/NN “horse” Syntactic V/S Lexical Whether to tag the word based on lexical or syntactic category. Should “uttar” in “uttar bhaarata” be tagged as noun or adjective ? Lexical category is given more importance than syntactic category while marking text manually. Leads to consistency in tagging. New tags v/s tags from standard tagset Entirely new tagset for Indian languages not desirable as people are familiar with standard tagsets like Penn tags. Penn tagset has been used as benchmark while deciding tags for Hindi. Wherever Penn tagset has been found inadequate, new tags introduced. NVB New tag for kriyamuls or Light verbs QW Modified tag for question words IIIT Tagset Tags are grouped into three types. 1. 2. 3. Examples of tags in Group3 Group1 : Adopted from the Penn tagset with minor changes. Group2 : Modification over Penn tagset. Group3 : Tags not present in Penn tagset. 1. INTF ( Intensifier ) : Words like ‘baHuta’, ‘kama’ etc. 2. NVB, JVB, RBVB : Light verbs. Detailed guidelines would be put online. Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction Corpus – based approach Untagged new corpus POS tagged corpus Learn POS tagger Tagged new corpus POS tagging : A simple method • Pick the most likely tag for each word • Probabilities can be estimated from a tagged corpus. • Assumes independence between tags. • Accuracy < 90% POS tagging : A simple method Example • Brown corpus, 182159 tagged words (training section), 26 tags • Example : • mujhe xo kitabein xijiye •Word xo occurs 267 times, • 227 times tagged as QFN • 29 times as VAUX • P(QFN|W=xo) = 227/267 = 0.8502 • P(NN | W=xo) = 29/267 = 0.1086 Corpus-based approaches Learning Rules Transformation-based error driven learning. Brill - 1995 Inductive Logic programming. Statistical Cussens - 1997 Hidden Markov models. TnT, Brants 00 Maximum entropy. Ratnaparakhi’ 96 Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction POS tagging using HMMs Let W be a sequence of words W = w1 , w2 … wn Let T be the corresponding tag sequence T = t1 , t2 … tn Task : Find T which maximizes P(T| W) T’ = argmaxT P ( T | W ) POS tagging using HMM By Bayes Rule, P(T|W) = P(W|T)*P(T)/P(W) T’ = argmaxT P ( W | T ) * P ( T ) P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | t1 … tn-1 ) Applying Bi-gram approximation, P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t2 ) …… * P ( tn | tn-1 ) POS tagging using HMM P ( W | T ) = P ( w1 | T ) * P ( w2 | w1 T ) * P ( w3 | w1.w2 T ) * ……… P ( wn | w1 … wn-1 , T ) = Πi = 1 to n P ( wi | w1…wi-1 T ) Assume, P ( wi | w1…wi-1 T ) = P ( wi | ti ) Now, T’ is the one which maximizes, P ( t1 ) * P ( t2 | t1 ) * …… * P ( tn | tn-1 ) * P ( w1 | t1 ) * P ( w2 | t2 ) * …… * P ( wn | wn-1 ) POS tagging using HMM If we use Tri-gram model instead for the tag sequence, P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | tn-2 tn-1 ) Which model to choose ? • Depends on the amount of data available ! • Richer models ( Tri-grams, 4-grams ) require lots of data. Chain rule with approximations • P( W = “vaha ladakaa gayaa” , T = “det noun verb” ) == P(det) * P(vaha|det) * P(noun|det) * P(ladakaa|noun) * P(verb|noun) * P(gayaa|verb) det noun verb vaha ladakaa gayaa Chain rule with approximations: Example • P (vaha | det ) = ( Number of times ‘vaha’ appeared as ‘det’ in the corpus ) ------------------------------------------------------------( Total number of occurrences of ‘det’ in the corpus ) • P ( verb | noun ) = ( Number of times ‘verb’ followed ‘noun’ in the corpus ) ------------------------------------------------------------( Total number of occurrences of ‘noun’ in the corpus ) If we obtained the following estimates from the corpus 0.5 det 0.4 vaha 0.99 noun 0.5 ladakaa 0.4 verb 0.02 gayaa P ( W , T ) = 0.5 * 0.4 * 0.99 * 0.5 * 0.4 * 0.02 = 0.000792 POS tagging using HMM We need to estimate three types of parameters from the corpus Pstart(ti) = (no. of sentences which begin with ti ) / ( no. of sentences ) P ( ti | ti-1 ) = count ( ti-1 ti ) / count ( ti-1 ) P ( wi | ti ) = count ( wi with ti ) / count ( ti ) These parameters can be directly represented using the Hidden Markov Models (HMMs) and the best tag sequence can be computed by applying Viterbi algorithm on the HMMs. Markov models Markov Chain • An event is dependent on the previous events. Consider the word sequence usane kahaa ki Here, each word is dependent on the previous one word. Hence, it is said to form markov chain of order 1. Hidden Markov models Observation sequence O Hidden states sequence X Index of sequence t o1 o2 o3 o4 x1 x2 x3 x4 1 2 3 4 Hidden states follow markov property. Hence, this model is know as Hidden Markov Model. Hidden Markov models • Representation of parameters in HMMs • Define O(t) = tth Observation • Define X(t) = Hidden State Value at tth position A = aab = P ( X ( t+1 ) = Xb | B = bak = P ( O ( t ) = Ok | PI = pia X ( t ) = Xa ) Transition matrix X ( t ) = Xa ) Emission matrix = Probability of the starting with hidden state Xa The model is μ = { A , PI , B } PI matrix HMM for POS tagging Observation sequence === Word sequence Hidden state sequence === Tag sequence Model A = P ( current tag | previous tag ) B = P ( current word | current tag ) PI = Pstart ( tag ) Tag sequences are mapped to Hidden state sequences because they are not observable in the natural language text. Example A = det noun verb det .01 .99 .00 noun .30 .30 .40 verb .40 .40 .20 vaha B = ladakaa PI = gayaa det .40 .00 .00 noun .00 .015 .0031 verb .00 .0004 .020 det 0.5 noun 0.4 verb .01 POS tagging using HMM The problem can be formulated as, Given the observation sequence O and the model μ = (A, B, PI), how to choose the best state sequence X which explains the observations ? • Consider all the possible tag sequences and choose the tag sequence having the maximum joint probability with the observation sequence. • X_max = argmax ( P(O , X) ) • The complexity of the above is high. Order NT • Viterbi algorithm is used for computational efficiency. POS tagging using HMM O X’s t vaha ladakaa hansaa det det det noun noun noun verb verb verb 1 2 3 27 tag sequences possible ! = 27 paths Viterbi algorithm O X’s t vaha ladakaa hansaa det det det noun noun noun verb verb verb 1 2 3 Let αnoun(ladakaa) represent the probability of reaching the state ‘noun’ taking the best possible path and generating observation ‘ladakaa’ Viterbi algorithm O X’s t vaha ladakaa det det det noun noun noun verb verb verb 1 2 3 hansaa Best probability of reaching a state associated with first word αpron(vaha) = PI (det) * B [det, vaha ] Viterbi algorithm O X’s t vaha ladakaa det det det noun noun noun verb verb verb 1 2 3 hansaa Probability of reaching a state elsewhere in the best possible way αnoun(ladakaa) = Viterbi algorithm O X’s t vaha ladakaa det det det noun noun noun verb verb verb 1 2 3 hansaa Probability of reaching a state in the best possible way αnoun(ladakaa) = MAX { αpron(vaha) * A [det, noun ] * B [ noun, ladakaa ] , Viterbi algorithm O X’s t vaha ladakaa det det det noun noun noun verb verb verb 1 2 3 hansaa Probability of reaching a state in the best possible way, αnoun(ladakaa) = MAX { αpron(vaha) * A [ det, noun ] * B [ noun, ladakaa ] , αnoun(vaha) * A [ noun, noun ] * B [ noun, ladakaa ] , Viterbi algorithm O X’s t vaha ladakaa det det det noun noun noun verb verb verb 1 2 3 hansaa Probability of reaching a state in the best possible way αnoun(ladakaa) = MAX { αpron(vaha) * A [det, noun ] * B [ noun, ladakaa ] , αnoun(vaha) * A [ noun, noun ] * B [ noun, ladakaa ] , αverb(vaha) * A [ verb, noun ] * B [ noun, ladakaa ] } Viterbi algorithm O X’s t vaha ladakaa det det det noun noun noun verb verb verb 1 2 3 hansaa What is the best way to come to a particular state ? phinoun(ladakaa) = ARGMAX { αpron(vaha) * A [ pron, noun ] * B [ noun, ladakaa ] , αnoun(vaha) * A [ noun, noun ] * B [ noun, ladakaa ] , αverb(vaha) * A [ verb, noun ] * B [ noun, ladakaa ] } Viterbi algorithm O X’s t hansaa vaha ladakaa det det det noun noun noun verb verb verb 1 2 3 The last tag of the most likely sequence phi (T+1) = ARGMAX { αpron(hansaa) , αnoun(hansaa) , αverb(hansaa) } Viterbi algorithm O X’s t vaha ladakaa hansaa det det det noun noun noun verb verb verb 1 2 3 Most likely sequence is obtained by backtracking. Preliminary Results POS tagging for Indian languages Training set = 182159 tokens, Testing set = 14277 tokens Tags = 26. Most frequent tag labelling = 78.85 % Hidden Markov Models = 86.75 % Needs improvement! By experimenting with a variety of tags and tokens ( Some experiments on the chunking task are shown in following slides ). Preliminary Results Most Common error seen. NNP, NNC NN <* see the output of the system > Opportunity to carry out experiments to eliminate such errors as part of NLPAI shared task , 2006 (will be introduced at the end). Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction Introduction to TnT Efficient implementation of Viterbi’s algorithm for 2nd order Markov Chains ( Trigram approximation ). Language independent – Can be trained on any corpus. Easy to use. Introduction to TnT 4 main programs – tnt-para – trains the model (parameter generation) tnt – tagging tnt [options] <model> <corpus> tnt-diff - Comparing two files to get precision/ recall figures. tnt-para [options] <corpus_file> tnt-diff [options] <original file 1> <new output file> tnt-wc – count tokens (words) and types (pos-tag/chunk-tag) in different files. tnt-wc [options] <corpusfile> Introduction to TnT Training file format Tokens and tag separated by white space. Example, %% <comment> nirAlA NNP kI PREP sAhiwya NN %% blank line – new sentence yahAz PRP yaha PRP aXikAMRa JJ Introduction to TnT Testing file – consists of only the first column. Other files – Used to store the model .lex file .123 file .map file Demo1. Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction An Example (Chunk boundary identification) Chunking with TnT Chunk Tags STRT: A chunk starts at this token CNT: This token lies in the middle of a chunk STP: This token lies at the end of a chunk STRT_STP: This token lies in a chunk of its own Chunk Tag Schemes 2-tag Scheme: {STRT, CNT} 3-tag Scheme: {STRT, CNT, STP} 4-tag Scheme: {STRT, CNT, STP, STRT_STP} Input Tokens What kinds of input tokens can we use? Word only – simplest POS tag only – use only the part of speech tag of the word Combinations of the above – Word_POStag: word followed by POS tag POStag_Word: POS tag followed by word. Chunking with TnT: Experiments Training corpus = 150000 tokens Testing corpus = 20000 tokens Trick to improve learning is by training on larger tagset and reduce it to smaller tagset NO LOSS of INFO. as all the tagsets convey same info. Best results (Precision = 85.6%) obtained for Input Tokens of the form ‘Word_POS’ Learning trick : 4 tags reduced to 2 Chunking with TnT: Improvement 85.6 not good enough. Improvement of model (Precision = 88.63%) by adding contextual information (POS tags). Example, Chunking with TnT: Improvements For experiments which lead to furthur improvements in chunk boundary identification, see Akshay Singh; Sushama Bendre; Rajeev Sangal, HMM based Chunker for Hindi, In Second International Joint Conference on Natural Language Processing: Companion Volume including Posters/Demos and tutorial abstracts. Chunking labelling & Results Chunk labelling: Chunks which have been identified have to be labelled as Noun chunks, Verb chunks etc. Rule based chunk labelling performed best. RESULTS: Final Chunk Boundary Identification accuracy = 92.6% Chunk boundary identification + Chunk labelling = 91.5% Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction Shared task. For information on the shared task, refer to the flyer on NLPAI shared task 2006. Thank you