Download Lecture 11: Parts of speech

Lecture 11: Parts of speech Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O’Connor (http://brenocon.com) Some material borrowed from Chris Manning, Jurafsky&Martin, and student exercises Tuesday, October 7, 14 1 • Review Next Tues. Which? • • 12-1:30pm ? 5:30-7pm ? 2 Tuesday, October 7, 14 What’s a part-of-speech (POS)? • Syntactic categories / word classes • • You could substitute words within a class and have a syntactically valid sentence. Give information how words can combine. • • • I saw the dog I saw the cat I saw the {table, sky, dream, school, anger, ...} 3 Tuesday, October 7, 14 POS is an old idea • • • Dionysius Thrax of Alexandria (100 BCE): 8 parts of speech Common in grammar classes today: noun, verb, adjective, preposition, conjunction, pronoun, interjection Many other more fine-grained possibilities https://www.youtube.com/watch? v=ODGA7ssL-6g&index=1&list=PL6795522EAD6CE2F7 4 Tuesday, October 7, 14 Open class (lexical) words Nouns Proper IBM Italy Verbs Common cat / cats snow Closed class (functional) Determiners the some Conjunctions and or Pronouns Main see registered old older oldest Adverbs slowly Numbers … more 122,312 one Modals Prepositions to with can had Particles he its off up Interjections Ow Eh 5 Tuesday, October 7, 14 Adjectives … more Open vs closed classes • • Closed • • • • • Determiners: a, an, the Pronouns: he, she, it, they ... Prepositions: on, over, under, of, ... Why “closed”? Many are “grammatical function words.” • Nouns, verbs, adjectives, adverbs Open 6 Tuesday, October 7, 14 Many tagging standards • • • Brown corpus (85 tags) Penn Treebank (45 tags) ... the most common one Coarse tagsets • Petrov et al. “Universal” tagset (12 tags) • • • • http://code.google.com/p/universal-pos-tags/ Motivation: cross-linguistic regularities e.g. adposition: pre- and postpositions For English, collapsing of PTB tags • Gimpel et al. tagset for Twitter (25 tags) • • Motivation: easier for humans to annotate We collapsed PTB, added new things that were necessary for Twitter 7 Tuesday, October 7, 14 Coarse tags, Twitter Proper or common? Does it matter? Grammatical category?? D: D: A: N: V: O: P: D: ^: N: #: U: It's a great show catch it on the Sundance channel #SMSAUDIO http://instagram.com/p/trHejUML3X/ Not really a grammatical category, but perhaps an important word class 8 Tuesday, October 7, 14 Why do we want POS? • Useful for many syntactic and other NLP tasks. • • • • Phrase identification (“chunking”) Named entity recognition Full parsing Sentiment 9 Tuesday, October 7, 14 we acquire presence words of one are of the words review (Brill, 1994).3about Two the consecutive The third step is to calculate the average semanwhen we observe the other. extracted from the review if their tags conform to The Semantic Orientation (SO) of a phrase, tic orientation of the phrases in the given review any of the patterns in Table 1. The JJ tags indicate and classify the review as recommended if the avphrase, is calculated here as follows: adjectives, the NN tags are nouns, the RB tags are erage is positive and otherwise not recommended. 4 SO(phrase) adverbs, and the VB tags= PMI(phrase, are verbs. “excellent”) The second Table 2 shows an example for a recommended (2) - PMI(phrase, pattern, for example, means that two “poor”) consecutive review and Table 3 shows an example for a not words are The extracted if the first “excellent” word is an and adverb reference words “poor” were recommended review. Both are reviews of the Turney (2002): identify bigram phrases useful for sentiment and the second is aninadjective, but the thirdrating chosenword because, the five star review sys- Bank of America. Both are in the collection of 410 word (which extracted) cannotone be star a noun. tem,isit not isanalysis common to define as “poor” and reviews from Epinions that are used in the experiNNP and NNPS (singular and plural proper five stars as “excellent”. SO isnouns) positive when ments in Section 4. are avoided, so that the names ofassociated the items with in the phrase is more strongly “excellent” Table 2. An example of the processing of a review that review cannot influencewhen the classification. and negative phrase is more strongly associ- the author has classified as recommended.6 ated with “poor”. Table 1. Patterns of tags for extracting two-word Extracted Phrase Part-of-Speech Semantic PMI-IR estimates PMI by issuing queries to a phrases from reviews. Tags Orientation search engine (hence the IR in PMI-IR) and noting online experience JJ NN 2.253 First Word Second Word Third Word the number of hits (matching documents). The follow fees JJ NNS 0.333 (Not Extracted) lowing experiments use the AltaVista Advanced local branch JJ NN 0.421 1. JJ NN or NNS anything 5 Search engine , which indexes approximately 350 small part JJ NN 0.053 2. RB, RBR, or JJ not NN nor NNS online service JJ NN 2.780 RBS million web pages (counting only those pages that printable version JJ NN -0.705 are in English). I chosenot AltaVista because it has a 3. JJ JJ NN nor NNS direct deposit JJ NN 1.288 operator. The not AltaVista NEAR operator 4. NN orNEAR NNS JJ NN nor NNS well other RB JJ 0.237 constrains search to anything documents that contain the 5. RB, RBR, or VB,the VBD, inconveniently RB VBN -1.541 words of one another, in either RBS words within VBN, orten VBG located order. Previous work has shown that NEAR perother bank JJ NN -0.850 The second is to estimate the semantic oriformsstep better than AND when measuring the true service JJ NN -0.732 entation ofstrength the extracted phrases,association using the PMI-IR of semantic between words Average Semantic Orientation 0.322 algorithm.(Turney, This algorithm uses mutual information 2001). as a measure Let of the strengthbeofthe semantic hits(query) numberassociaof hits returned, tion between twothe words & 1989). (plus other sentiment stuff) given query(Church query. TheHanks, following estimate of PMI-IR has empirically evaluated using 80 (2) with 6 SObeen can be derived from equations (1) and The semantic orientation in the following tables is calculated synonym test questions from the Test of English as 10 using the natural logarithm (base e), rather than base 2. The a Foreign Language (TOEFL), obtaining a score of natural log is more common in the literature on log-odds ratio. POS patterns: sentiment • Tuesday, October 7, 14 POS patterns: simple noun phrases • Quick and dirty noun phrase identification 11 Tuesday, October 7, 14 • Exercises 12 Tuesday, October 7, 14 A FT set. Most word typesNNP, (80-86%) are JJS, unambiguous; that RB). is, they only a sintag (Janet is always funniest and hesitantly Buthave the ambiguous tag although (Janet is accounting always NNP, JJS, and hesitantly RB).are But the of ambiguous rds, forfunniest only 14-15% of the vocablary, some the most rds, although accounting 14-15% of the vocablary, of the mmon words of English, for andonly hence 55-67% of word tokensareinsome running textmost are mmon words of the English, hence 55-67% of word tokens especially in runningintext are biguous. Note large and differences across the two genres, token biguous. Tags Noteinthe large differences across the two genres, especially in token quency. the WSJ corpus are less ambiguous, presumably because this Can we just use a tag dictionary quency. Tags in the WSJ arenews less leads ambiguous, presumably because this wspaper’s specific focus oncorpus financial to a more limited distribution of wspaper’s focus on financial leads into to a the more limited distribution of (onespecific tagtheper type)? rd usages than moreword general textsnews combined Brown corpus. rd usages than the more general texts combined into the Brown corpus. Types: WSJ Brown Most words types Types: WSJ(86%) 45,799 Brown Unambiguous (1 tag) 44,432 (85%) are unambiguous ... Unambiguous (1 44,432 Ambiguous (2+tag) tags) 7,025 (86%) (14%) 45,799 8,050 (85%) (15%) Ambiguous (2+ tags) 7,025 (14%) 8,050 (15%) Tokens: Tokens: Unambiguous (1 tag) 577,421 (45%) 384,349 (33%) But not so for Unambiguous (1 577,421 Ambiguous (2+tag) tags) 711,780 (45%) (55%) 384,349 786,646 (33%) (67%) tokens! Ambiguous (2+ tags) 711,780 (55%) 786,646 (67%) ure 8.2 The amount of tag ambiguity for word types in the Brown and WSJ corpora, POS Tagging: lexical ambiguity The amount of tag ambiguity for statistics word types in thepunctuation Brown and as WSJ corpora, mure the8.2 Treebank-3 (45-tag) tagging. These include words, and m thewords Treebank-3 tagging. case. These statistics include punctuation as words, and ume are kept(45-tag) in their original ume words are kept in their original case. • Ambiguous wordtypes tend to be very common ones. Some of the most ambiguous frequent words are that, back, down, put and set; I know that he is honest = IN (relativizer) Some of the most ambiguous frequent words are that, back, down, put and set; e are some examples of the 6 different parts-of-speech for the word back: e are some examples of the 6 different parts-of-speech for the word back: Yes, that play seat was nice = DT (determiner) earnings growth took a back/JJ earnings growth in took back/JJ seat a small building theaback/NN You ofincan’t goback/VBP that far = RB (adverb) aa small building the back/NN clear majority senators the bill • • • aDave clearbegan majority of senators back/VBP the bill to back/VB toward the door Dave to back/VB the about door debt enablebegan the country to buytoward back/RP enable the countryback/RB to buy back/RP I was twenty-one then about debt I was twenty-one back/RB then Tuesday, October 7, 14 13 POS Tagging: baseline • Baseline: most frequent tag. 92.7% accuracy • • • Simple baselines are very important to run! Why so high? • • Many ambiguous words have a skewed distribution of tags Credit for easy things like punctuation, “the”, “a”, etc. Is this actually that high? I get 0.918 accuracy for token tagging ...but, 0.186 whole-sentence accuracy (!) • • 14 Tuesday, October 7, 14 POS tagging can be hard for humans • • • Mrs/NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBG All/DT we/PRP gola/VBN do/VB is/VBZ go/VB around/IN the/DT corner/NN Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD 15 Tuesday, October 7, 14 When they are the first members of the double conjunctions both . . . and, either . . . or and neither both, either and neither are tagged as coordinating conjunctions (CC), not as determiners (DT). . . . nor, EXAMPLES: Either/DT child coulddo sing.annotators always follow them?) Need careful guidelines (and PTBBut:POS guidelines, Santorini (1990) 4 CONFUSING PARTS OF SPEECH Either/CC a boy could sing or/CC a girl could dance. 4 Either/CC a boy or/CC a girl could sing. Either/CC a boy or/CC girl could sing. Confusing parts of speech Be that either parts or neither can sometimes function as determiners (DT) even in the presence of or or Thisaware section discusses of speech that are easily confused and gives guidelines on how to tag such cases. nor. EXAMPLE: Either/DT boy or/CC girl could sing. When they are the first members of the double conjunctions both . . . and, either . . . or and neither both, or either CD JJ and neither are tagged as coordinating conjunctions (CC), not as determiners (DT). . . . nor, Number-number combinations should becould tagged as adjectives (JJ) if they have the same distribution as EXAMPLES: Either/DT child sing. adjectives . But: EXAMPLES: a 50-3/JJ victory (cf. a handy/JJ victory) Either/CC a boy could sing or/CC a girl could dance. Hyphenated fractions one-half, three-fourths, seven-eighths, one-and-a-half, seven-and-three-eighths should Either/CC a boy or/CC a girl could sing. be tagged as adjectives (JJ) when they prenominal modifiers, but as adverbs (RB) if they could be Either/CC a boyare or/CC girl could sing. replaced by double or twice. Be aware that either or neither can sometimes function as determiners (DT) even in the presence of or or EXAMPLES: one-half/J J cup; cf. a full/JJ cup nor. EXAMPLE: one-half/RB the amount; cf. twice/RB the amount; double/RB the amount Either/DT boy or/CC girl could sing. CD or JJ Sometimes, it is combinations unclear whether onebeis tagged cardinal or (JJ) a noun. In have general, it should be tagged Number-number should as number adjectives if they the same distribution as as a cardinal (CD) even when its sense is not clearly that of a numeral. adjectivesnumber . 50-3/JJ of victory (cf. reasons a handy/JJ victory) aone/CD the best 16 Hyphenated fractions one-half, one-and-a-half, seven-and-three-eighths Tuesday, October 7, 14 noun (NN). But if it could be pluralized orthree-fourths, modified by seven-eighths, an adjective in a particular context, it is a commonshould EXAMPLES: EXAMPLE: Some other lexical ambiguities • • Prepositions versus verb particles • • • Test: turn slowly into a monster *take slowly out the trash turn into/P a monster take out/T the trash check it out/T, what’s going on/T, shout out/T this,that -- pronouns versus determiners • • i just orgasmed over this/O this/D wind is serious Careful annotator guidelines are necessary to define what to do in many cases. • http://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports • http://www.ark.cs.cmu.edu/TweetNLP/annot_guidelines.pdf 17 Tuesday, October 7, 14 Since we are not annotating parse structure, it is less clear what to do with our data. In a consistent with the PTB, we typically tagged today as a noun. The PTB annotations are less clear for words like home. Of the 24 times that home app child under a DIR (direction) nonterminal, it is annotated as: Proper nouns • (ADVP-DIR (NN home) ) 14 times Common nouns vs. proper nouns on Twitter • • (ADVP-DIR (RB home) ) 6 times thatOhome) Monty^)python •Convinced (NP-DIRA (NN 3 times^ doingV-VBG aD completelyR straightA facedA Shakespeare^ adaptionN wouldV-VBD beV-VB amongP •the (NP-DIR (RB home) ) ^1 things time N everR D mostA Monty ^ python Manual inspection of the 24 occurrences revealed no discernible difference in usage that Names are multiwords. Their tokens are not always nouns. these differences in annotation. As a result of these inconsistencies, we decided to let anno Manywhen people in thethese exercises to do this.them Tokenbest judgment annotating types ofdidn’t wordswant in tweets, asking to refer to t level tagging weird consistency. abstraction here. previously-annotated dataistoaimprove • 3.4 Names In general, every noun within a proper name should be tagged as a proper noun (‘ˆ’): • Jesse/ˆ and/& the/D Rippers/ˆ • the/D California/ˆ Chamber/ˆ of/P Commerce/ˆ Company and web site names (Twitter, Yahoo ! News) are tagged as proper nouns. Function ever tagged as proper nouns if they are not18behaving in a normal syntactic fashion, e.g. Ace/ Tuesday, October 7, 14 Are your tokens too big for tags? • • • PTB tokenization of clitics leads to easy tagging • I’m ==> I/PRP 'm/VBP • • hes i’m im ill http://search.twitter.com Imma bout to do some body shots Twitter: is this splitting feasible? Real examples: Gimpel et al.’s strategy: introduce compound tags (I'm = PRONOUN+VERB) 19 Tuesday, October 7, 14 formance degradations in a wide variety of languages (including Czech, Slovene, Estonian, and Romanian) (Hajič, 2000). Highly inflectional languages also have much more information than English coded in word morphology, like case (nominative, accusative, genitive) or gender (masculine, feminine). Because this information is important for tasks like parsing and coreference part-of-speech taggers forlanguages, morphologically Otherresolution, example: highly inflected e.g.rich languages need to label words with case and gender information. Tagsets for morphoTurkish, case,sequences gender ofetc. built intotags the logically rich languageshave are therefore morphological rather than a single primitive tag. Here’s a Turkish example, in which the word izin has three pos“tag” sible morphological/part-of-speech tags and meanings (Hakkani-Tür et al., 2002): Are your tokens too big for tags? • iz + Noun+A3sg+Pnon+Gen 2. Üzerinde parmak izin kalmiş Your finger print is left on (it). iz + Noun+A3sg+P2sg+Nom F 1. Yerdeki izin temizlenmesi gerek. The trace on the floor should be cleaned. 3. Içeri girmek için izin alman gerekiyor. You need a permission to enter. izin + Noun+A3sg+Pnon+Nom Using a morphological parse sequence like Noun+A3sg+Pnon+Gen as the partof-speech tagOur greatlyapproach increases thefor number of parts-of-speech, and sotreat tagsets can Twitter was to simply be 4 to 10 times larger than the 50–100 tags we have seen for English. With such each compound tag as a separate tag. Is this large tagsets, each word needs to be morphologically analyzed (using a method from here? Chapter 3, or feasible an extensive dictionary) to generate the list of possible morphological tag sequences (part-of-speech tags) for the word. The role of the tagger is then to disambiguate among these tags. This 20 method also helps with unknown words • Tuesday, October 7, 14 How to build a POS tagger? • Key sources of information: • • • 1. The word itself 2. Morphology or orthography of word 3. POS tags of surrounding words: syntactic positions 21 Tuesday, October 7, 14

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 11: Parts of speech