Download POS Tagging

POS Tagging: An Overview 1. POS-tagging vs. syntactic parsing Xiaofei Lu Linguistics 795K 1.1 Considered as one task April 15, 2002 Both POS tagging and syntactic annotation specify the grammatical characteristics of a text: POS-tagging is a specification of the leaves of the Outline phrase-structure (PS) tree, which is a favored model for syntactic annotation. (1) 1. POS-tagging vs. syntactic parsing S 2. Why tagging: applications of POS-tagged texts Nr 3. Two problems for POS tagging N V . 4. A tagging system MD NNT1 , AT NN1 VVD . 4.1 Tokenization Last year , the workforce grew . 4.2 Tagset 4.3 Encoding of token-tag relations 1.2Considered as separate tasks 4.4 Tagging schemes: assigning tags to words 4.5 Non-linguistic issues 5. Early automatic taggers: the rule-based approach Practically, the difficulty of parsing unrestricted text makes it a useful 6. Automatically trained taggers: the probabilistic approach expedient to divide the work of parsing into two manageable tasks. POS- 6.1 Supervised tagging tagging can be done much more accurately and quickly than parsing, and 6.2 Unsupervised tagging good taggers can typically be developed for a domain much more rapidly 6.3 Other algorithms for automatic training than good parsers can. Therefore, POS-tagging serves as a precursor to parsing and other NLP tasks. 7. Assignments References 1 2 2. Why POS-tagging: applications of POS-tagged texts 2.3 Higher-level syntactic processing 2.1 Machine translation Tagging often serves as a precursor to higher-level syntactic-processing systems. E.g. noun phrase chunkers (programs to find NPs in each The probability of a word in the source language translating into a word in sentence) use a combination of word and POS information to learn either the target language is highly dependent on the POS of the source word. E.g. regular expressions for NPs in a sentence that are likely to indicate the the word guancha in Chinese can be translated as either observe or beginning or ending of a phrase (Church 1988; Ramshaw and Marcus, observation, depending on its POS, i.e. whether it is a noun or verb. 1995). 2.2Information retrieval (IR) and extraction Tagging can also be important for speech synthesis. E.g. the word record is pronounced differently depending on whether it is a noun or a verb. By augmenting the query a person gives to an IR system with POS information, more refined retrieval is possible. E.g. if one wants to search for documents containing produce as a noun, adding the POS information will eliminate irrelevant ones containing only produce as a verb. Patterns used for extracting information from text frequently make reference to POSs, too. E.g. to extract information of the form: acquire (X, Y) from the Wall Street Journal, one may use the following template: (2) (DET)? (PROPER NOUN)+ (acquired | bought) (DET)? (PROPER NOUN)+ X acquire Y From (3) would be extracted (4) (3) International Business Machines bought Violet Corporation (4) acquire (International Business Machines, Violet Corporation) 3 4 3. Two problems for POS tagging 4. A tagging system 3.1 Words with multiple POSs Three questions to start with: · How to divide the text into individual word tokens? Many words have multiple POSs, and one of the difficult tasks is to provide · How to choose a tagset the machine with the knowledge necessary to disambiguate from the set of · How to choose which tag is to be applied which word token allowable tags based on context. E.g. the word saw can be a noun or a verb, and the word can can be a noun, a verb, or an auxiliary verb. These usages 4.1 Tokenization (multiwords, merged words, compounds) can appear in one sentence, as in (5) and (6). 4.2 Tagset (5) I saw a saw saw a saw. (6) We can can the can A tagset is a list of tags used for a given task of grammatical tagging. E.g. the CLAWS C7 tagset. 3.2 New words There is room for disagreement about what word-categories are useful or New words appear all the time. Thus, a method for determining the tags of linguistically applicable, and for interference from practical constraints such news words is needed. This can be done based on contextual information as speed and accuracy. Normally, there has to be a trade-off between what is and information about word itself, such as affixes. E.g. it is easy to linguistically desirable and computationally feasible. determine that the made-up word goblamesque is an adjective, based on the environment in which it appears and the suffix -esque. E.g., it makes grammatical sense to have separate tags for the present subjective (come what may), the imperative form (come here!), and the (7) His goblanesque writing was a refreshing treat. present tense plural indicative form (they come every spring). However, in practice, it is difficult to distinguish them without a substantial proportion of errors. The solution was to merge them into a single category ‘finite base form’ as opposed to non-finite base form (Would like to come?). Even this distinction is ignored in some projects, e.g. the tagging of Brown Corpus. 5 6 · Horizontal and vertical formats for presenting tagged corpus 4.2.1 Tags and labels · A tag is a word-class embodied in an annotative device associated Horizontal format: with a word in the text Oh_UH ,_, he_PPHS1 did_VDD pass_VV0 his_APP$ exams_NN2 · Three criteria for choosing labels for tags ._. 1) Conciseness: Brief labels are move convenient than lengthy ones. Vertical format: E.g. DD1 vs. SINGULAR_DETERMINER SK01 271 Oh UH (interjection) 2) Perspicuity: Labels which can be easily interpreted and SK01 272 , , (comma) remembered are more user-friendly than those which cannot. E.g. SK01 273 and CC (coordinating conjunction) Preposition vs. IN SK01 274 he PPHS1 (3rd per. Pronoun, sing. Nom) SK01 275 did VDD (past tense of the verb do) 3) Analyzability: Labels which are decomposable into their logical SK01 276 pass VV0 (base form of lexical verb) parts are better than those which are not. E.g. NP1, in the BNC SK01 277 his APP$ (possessive determiner) tagset, can be decomposed into: N = noun (vs. V = verb, etc.); P = SK01 278 exams NN2 (plural common noun) proper [noun] (vs. N = common noun); 1 = singular (vs. 2 = SK01 279 . . (period) plural). With an analyzable tagset, searches of the corpus can be carried out at varying levels of granularity. E.g. the symbol N* can Easily convertible. Verbose labels are more conveniently handled in represent all nouns, N*1 can represent all singular nouns, and NP* the vertical format than in the horizontal one. all proper nouns. However, people may prefer more perspicuous labels such as Noun: prop: sing to NP1. 4) Disambiguity: each tag corresponds to a unique label. 7 8 4.2.2. Logical tagsets 4.2.3 Size and composition of tagsets · Size less important as it seems, changeable according to the emphasis The relations between the word categories symbolized by tags should be representable as a hierarchical tree, with attributes being inherited of a particular project · ‘Core’ of tagset tends to be major word classes with their principal from one level of the tree to another. The same alphanumeric symbol, in a particular position in the sequence of symbols in the label, may sub-classes have the same meaning across different branches of the tree. E.g. · peripheral elements that need to be marked tend to be ignored: e.g. for below is a summary representation of part of the C7 tagset as a a written corpus, certain categories of WIC (word-initial capital) hierarchical, logical tagset. words, such as month nouns, day nouns, etc, may be of semantic and syntactic significance in their own right. In spoken corpora, it would N N 1 NN1: common noun, general, singular 2 NN2: common noun, general, plural be useful to distinguish certain types of discourse marker (well) or hesitation marker (er). NN: common noun, general, number-neutral T P M · conflict between linguistic (external) reasons and computational 1 NNT1: common noun, temporal, singular (internal) reasons for determining the composition of a tagset. 2 NNT2: common noun, temporal, plural Linguistic quality of a tagset (e.g. the extent to which it allows 1 NP1: proper noun, general, singular retrieval of important grammatical distinctions in the language) 2 NP2: proper noun, general, plural concerns the user’s requirements; the computational tractability of a NP: proper noun, general, number-neutral tagset (e.g. the extent to which a particular tag is useful in aiding the 1 NPM1: prop noun, day, singular disambiguation process, and increasing the accuracy of tagging) is 2 NPM2: prop noun, day, plural ‘internal’. Most tagsets show some signs of the ‘internal’ criteria impinging on the ‘external’ criteria, e.g. the low tractability of the subjunctive category, given the ambiguity of verb base forms. · More on the comparison and evaluation of tagsets coming up with Kyuchul. 9 4.3 Encoding of token-tag relations 10 4.4 · Issues of tokenization raise problems for the way we encode token-tag Tagging schemes: · Task: to specify how decisions are made about how to assign tags relations to words · The tagging of the Brown Corpus set up a model imitated by many · Lexicon: to provide information on which tags are assignable to other tagging projects, such as the tagged LOB Corpus, the Penn which words · Disambiguation: when multiple tags are assignable to a single Treeban, and the SUSANNE Corpus. · The Text Encoding Initiative (TEI) (Burnard, 1995) word, we need to defining the contextual conditions and A movement towards achieving an acceptable standard in the distributional factors of choosing a particular tag for a particular encoding of electronic textual material on computer, esp. for purposes word-token. of data interchange, based on the mark-up system known as SGML · Tagging manuals for different tagged corpora, e.g. Santorini (more on XML coming up soon with Stacey). The SGML-conformant (1990) for the Penn Treebank, explains the tagging decisions to be set-ups can be elaborated to deal with multiwords, mergers, and made in all possible contexts. · Grey areas of unclarity between the use of one tag and another in ‘phantom words’. Examples from BNC: English, e.g. in plastic bottle, should plastic be tagged as a noun or Multiwords: <w PRP>in lieu of <w NN1>payment Mergers: <w PNP>they<w VBB>’re <w VVG>passing ‘Phanton words’ <w AJ0><w PRP>post-</w PRP><w AJ0>Cold</w AJ0> an adjective? 4.5 Non-linguistic issues <w NN1>war </w NN1></w AJ0> · manual vs. automatic tagging · techniques and capabilities of the tagging software · available human and hardware resources · speed, accuracy, and consistency requirements 11 12 5. Automatic tagging: the rule-based approach 6. Automatically trained taggers: the probabilistic approach · An early program (Greene and Rubin, 1971) created to tag semiautomatically the Brown Corpus. High-certainty word- 6.1 Supervised tagging environments discovered manually were applied to tag reliable Supervised taggers typically rely on pre-tagged corpora to serve as the regions in the corpus, followed by manual tagging and correction. No basis for creating any tools to be used throughout the tagging process, manually tagged corpus available to develop and test the manually for example: the tagger dictionary, the word/tag frequencies, the tag created rules empirically – not high rates of accuracy. sequence probabilities and/or the rule set. · Klein and Simmons (1963) constructed a totally automatic tagger. First, all allowable tags listed in the lexicon were assigned to each · 6.2 Unsupervised tagging word. Second, a sequence of handwritten rules deleted certain tags as Unsupervised models, on the other hand, are those which do not possibilities in certain environments. A rule might read, “If a word is require a pretagged corpus but instead use sophisticated computational tagged with both an N tag and a V tag, and it occurs immediately after methods to automatically induce word groupings (i.e. tag sets) and a Det, then remove the V tag.” based on those automatic groupings, to either calculate the A recently developed tagger using the same principles but based on probabilistic information needed by stochastic taggers or to induce the Constraint Grammars (Karlsson et al., 1995), has been applied to POS context rules needed by rule-based systems. tagging with success. Both the morphological and syntactic analysers use rule-based linguistic descriptions. The system works in the following way: 6.3 Other algorithms for automatic training 6.3.1 Transformation-based learning: 1. Tokenisation; 2. Lookup of morphological tags; 1. A lexical analyer assings all possible morphological analysis to each word found in a large lexicon including all inflected an central derived word forms 2. A guesser is used to assign an analysis to all remaining words 3. A rule-based Constraint Grammar parser is used to resolve morphological ambiguities 4. syntactic lookup: All possible syntactic tags are introduced for each word 5. Resolution of syntactic ambiguities: the parser finally consults a syntactic disambiguation grammar. The English version of the Constraint Grammar contains 800 syntactic constraints, of a similar form to the rules at the morphological resolution stage. 6.3.2 Learning without a tagged text 6.3.3 Maximum entropy tagging 14 13 References 7. Assignments Brill, E. 2000. ‘Part-of-Speech Tagging,’ in Robert Dale, Hermann Moisl, Harold Somers (eds.) Handbook of natural language processing. New York : Marcel Dekker. 403-414. Run XKWIC on the BNC-SAMP, and answer the following questions: Burnard, L. 1995 ‘Users’ Reference Guide for the British National Corpus’, Oxford University Computing Services. 1. Which of the two appears more frequently, the definite or indefinite determiner? 2. Which of the three appears most frequently, nouns, verbs, or adjectives? 3. Which of the two appears more frequently, singular nouns or plural nouns? Church KW 1988. A stochastic parts program and noun phrase parser for unrestricted text. Second Conference on Applied Natural Language Processing, Austin, TX, pp 136143. Greene, B. B. & Rubin, G. M. 1971. Automatic grammatical tagging of English. Technical Report, Brown University. Providence, RI. Leech, G. 1997. ‘Grammatical Tagging,’ in Roger Garside, Geoffrey Leech, Tony McEnery (eds.), Corpus Annotation: Linguistic Information From Computer Text Corpora. London: Longman. 4. How often does VDD occur before negation? 5. Is the word record used more often as a noun or a verb? Karlsson F, A Voutilainen, A Anttila, 1995. Constraint Grammar. Berlin: Mouton de Gruyter. 6. How many sentences contain two word forms of have? 7. What kind of (two-word) compound nouns do you find in the corpus? Klein, S. & Simmons, R. 1963. A computational approach to the grammatical coding of English words. Journal of the Association for Computational Machinery 10: 334-347. 8. What are the adjectives used to modify men and women respectively? 9. What kind of feelings do people have? Ramshaw, L. A. and Marcus, M. P. 1995. Text Chunking using Transformation-Based Learning. In Proceedings of the ACL Third Workshop on Very Large Corpora, June 1995, pp. 82-94. Santorini, B. 1990. Part-of-speech tagging guidelines for the Penn Treebank Project. Department of Computer and Information Science, University of Pennsylvania, Technical Report MS-CIS-90-47. 15 16

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download POS Tagging