Download Introduction to Data Science Lecture 5 Natural - b

Introduction to Data Science Lecture 5 Natural Language Processing CS 194 Fall 2015 John Canny with some slides from David Hall Outline • • • • • • BoW, N-grams Part-of-Speech Tagging Named-Entity Recognition Grammars Parsing Dependencies Natural Language Processing (NLP) • In a recent survey (KDNuggets blog) of data scientists, 62% reported working “mostly or entirely” with data about people. Much of this data is text. • In the suggested CS194-16 projects (a random sample of data science projects around campus), nearly half involve natural language text processing. • NLP is a central part of mining large datasets. Natural Language Processing Some basic terms: • Syntax: the allowable structures in the language: sentences, phrases, affixes (-ing, -ed, -ment, etc.). • Semantics: the meaning(s) of texts in the language. • Part-of-Speech (POS): the category of a word (noun, verb, preposition etc.). • Bag-of-words (BoW): a featurization that uses a vector of word counts (or binary) ignoring order. • N-gram: for a fixed, small N (2-5 is common), an n-gram is a consecutive sequence of words in a text. Bag of words Featurization Assuming we have a dictionary mapping words to a unique integer id, a bag-of-words featurization of a sentence could look like this: Sentence: The cat sat on the mat word id’s: 1 12 5 3 1 14 The BoW featurization would be the vector: Vector 2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1 position 1 3 5 12 14 In practice this would be stored as a sparse vector of (id, count)s: (1,2),(3,1),(5,1),(12,1),(14,1) Note that the original word order is lost, replaced by the order of id’s. Bag of words Featurization • BoW featurization is probably the most common representation of text for machine learning. • Most algorithms expect numerical vector inputs and BoW provides this. • These include kNN, Naïve Bayes, Logistic Regression, kMeans clustering, topic models, collaborative filtering (using text input). Bag of words Featurization • One challenge is the size of the representation. Since words follow a power-law distribution, vocabulary size grows with corpus size at almost a linear rate. • Since rare words aren’t observed often enough to inform the model, its common to discard infrequent words occurring less than e.g. 5 times. • One can also use the feature selection methods mentioned last time: mutual information and chi-squared tests. But these often have false positive problems and are less reliable than frequency culling on power-law data. Bag of words Featurization • The feature selection methods mentioned last time: mutual information and chi-squared tests, often have false positive problems on power-law data. • Frequency-based selection usually works better. 10x more features in each block, but 10x fewer observations of them. Each is a potential false positive. Difficult to control the false-positive rate for rare features. 105 104 Frequency 103 102 101 100 101 102 Word rank 103 104 N-grams Because word order is lost, the sentence meaning is weakened. This sentence has quite a different meaning but the same BoW vector: Sentence: The mat sat on the cat word id s: 1 14 5 3 1 12 BoW featurization: Vector 2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1 But word order is important, especially the order of nearby words. N-grams capture this, by modeling tuples of consecutive words. N-grams Sentence: The cat sat on the mat 2-grams: the-cat, cat-sat, sat-on, on-the, the-mat Notice how even these short n-grams “make sense” as linguistic units. For the other sentence we would have different features: Sentence: The mat sat on the cat 2-grams: the-mat, mat-sat, sat-on, on-the, the-cat We can go still further and construct 3-grams: Sentence: The cat sat on the mat 3-grams: the-cat-sat, cat-sat-on, sat-on-the, on-the-mat Which capture still more of the meaning: Sentence: The mat sat on the cat 3-grams: the-mat-sat, mat-sat-on, sat-on-the, on-the-cat N-grams Features Typically, its advantages to use multiple n-gram features in machine learning models with text, e.g. unigrams + bigrams (2-grams) + trigrams (3-grams). The unigrams have higher counts and are able to detect influences that are weak, while bigrams and trigrams capture strong influences that are more specific. e.g. “the white house” will generally have very different influences from the sum of influences of “the”, “white”, “house”. N-grams as Features Using n-grams+BoW improves accuracy over BoW alone for classifying and other text problems. A typical performance (e.g. classifier accuracy) looks like this: 3-grams < 2-grams < 1-grams < 1+2-grams < 1+2+3-grams A few percent improvement is typical for text classification. N-grams size N-grams pose some challenges in feature set size. If the original vocabulary size is |V|, the number of possible 2grams is |V|2 while for 3-grams it is |V|3 Luckily natural language n-grams (including single words) have a power law frequency structure. This means that most of the ngrams you see are common. A dictionary that contains the most common n-grams will cover most of the n-grams you see. Power laws for N-grams N-grams also follow a power law distribution: N-grams size Because of this you may see values like this: • Unigram dictionary size: 40,000 • Bigram dictionary size: 100,000 • Trigram dictionary size: 300,000 With coverage of > 80% of the features occurring in the text. N-gram Language Models N-grams can be used to build statistical models of texts. When this is done, they are called n-gram language models. An n-gram language model associates a probability with each ngram, such that the sum over all n-grams (for fixed n) is 1. You can then determine the overall likelihood of a particular sentence: The cat sat on the mat Is much more likely than The mat sat on the cat Skip-grams We can also analyze the meaning of a particular word by looking at the contexts in which it occurs. The context is the set of words that occur near the word, i.e. at displacements of …,-3,-2,-1,+1,+2,+3,… in each sentence where the word occurs. A skip-gram is a set of non-consecutive words (with specified offset), that occur in some sentence. We can construct a BoSG (bag of skip-gram) representation for each word from the skip-gram table. Skip-grams Then with a suitable embedding (DNN or linear projection) of the skip-gram features, we find that word meaning has an algebraic structure: Man King Woman Queen Man + (King – Man) + (Woman – Man) = Queen Tomáš Mikolov et al. (2013). "Efficient Estimation of Word Representations in Vector Space" Outline • • • • • • BoW, N-grams Part-of-Speech Tagging Named-Entity Recognition Grammars Parsing Dependencies Parts of Speech Thrax’s original list (c. 100 B.C): • Noun • Verb • Pronoun • Preposition • Adverb • Conjunction • Participle • Article Parts of Speech Thrax’s original list (c. 100 B.C): • Noun (boat, plane, Obama) • Verb (goes, spun, hunted) • Pronoun (She, Her) • Preposition (in, on) • Adverb (quietly, then) • Conjunction (and, but) • Participle (eaten, running) • Article (the, a) Parts of Speech (Penn Treebank 2014) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. CC CD DT EX FW Coordinating conjunction Cardinal number Determiner Existential there Foreign word Preposition or subordinating IN conjunction JJ Adjective JJR Adjective, comparative JJS Adjective, superlative LS List item marker MD Modal NN Noun, singular or mass NNS Noun, plural NNP Proper noun, singular NNPS Proper noun, plural PDT Predeterminer POS Possessive ending PRP Personal pronoun 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. PRP$ RB RBR RBS RP SYM TO UH VB VBD VBG VBN 31. VBP 32. 33. 34. 35. 36. VBZ WDT WP WP$ WRB Possessive pronoun Adverb Adverb, comparative Adverb, superlative Particle Symbol to Interjection Verb, base form Verb, past tense Verb, gerund or present participle Verb, past participle Verb, non-3rd person singular present Verb, 3rd person singular present Wh-determiner Wh-pronoun Possessive wh-pronoun Wh-adverb POS tags • Morphology: • “liked,” “follows,” “poke” • Context: “can” can be: • “trash can,” “can do,” “can it” • Therefore taggers should look at the neighborhood of a word. Constraint-Based Tagging Based on a table of words with morphological and context features: POS taggers • Taggers typically use HMMs (Hidden-Markov Models) or MaxEnt (Maximum Entropy) sequence models, with the latter giving state-of-the art performance. • Accuracy on test data is around 97%. • A representative is the Stanford POS tagger: http://nlp.stanford.edu/software/tagger.shtml • Tagging is fast, 300k words/second typical (Speedread, about 10x slower for Stanford). Outline • • • • • • BoW, N-grams Part-of-Speech Tagging Named-Entity Recognition Grammars Parsing Dependencies Named Entity Recognition • People, places, (specific) things • • • • “Chez Panisse”: Restaurant “Berkeley, CA”: Place “Churchill-Brenneis Orchards”: Company(?) Not “Page” or “Medjool”: part of a non-rigid designator 27 Named Entity Recognition • Why a sequence model? • “Page” can be “Larry Page” or “Page mandarins” or just a page. • “Berkeley” can be a person. 28 Named Entity Recognition • Chez Panisse, Berkeley, CA - A bowl of ChurchillBrenneis Orchards Page mandarins and Medjool dates • States: • Restaurant, Place, Company, GPE, Movie, “Outside” • Usually use BeginRestaurant, InsideRestaurant. • Why? • Emissions: words • Estimating these probabilities is harder. 29 Named Entity Recognition • Both (hand-written) grammar methods and machine-learning methods are used. • The learning methods use sequence models: HMMs or CRFs = Conditional Random Fields, with the latter giving state-of-the-art performance. • State-of-the-art on standard test data = 93.4%, human performance = 97%. • Speed around 200k words/sec (Speedread). • Example: http://nlp.stanford.edu/software/CRF-NER.shtml 30 Outline • • • • • • BoW, N-grams Part-of-Speech Tagging Named-Entity Recognition Grammars Parsing Dependencies Grammars Grammars comprise rules that specify acceptable sentences in the language: (S is the sentence or root node) • S  NP VP • S  NP VP PP • NP  DT NN • VP  VB NP • VP  VBD • PP  IN NP • DT  “the” • NN  “mat”, “cat” • VBD  “sat” • IN  “on” Grammars Grammars comprise rules that specify acceptable sentences in the language: (S is the sentence or root node) “the cat sat on the mat” • S  NP VP • S  NP VP PP (the cat) (sat) (on the mat) • NP  DT NN (the cat), (the mat) • VP  VB NP • VP  VBD • PP  IN NP • DT  “the” • NN  “mat”, “cat” • VBD  “sat” • IN  “on” Grammars English Grammars are context-free: the productions do not depend on any words before or after the production. The reconstruction of a sequence of grammar productions from a sentence is called “parsing” the sentence. It is most conveniently represented as a tree: Parse Trees “The cat sat on the mat” Parse Trees In bracket notation: (ROOT (S (NP (DT the) (NN cat)) (VP (VBD sat) (PP (IN on) (NP (DT the) (NN mat)))))) Grammars There are typically multiple ways to produce the same sentence. Consider the statement by Groucho Marx: “While I was in Africa, I shot an elephant in my pajamas” “How he got into my pajamas, I don’t know” Parse Trees “…,I shot an elephant in my pajamas” -what people hear first Parse Trees Groucho’s version Grammars Recursion is common in grammar rules, e.g. NP  NP RC Because of this, sentences of arbitrary length are possible. Recursion in Grammars “Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo”. Grammars Its also possible to have “sentences” inside other sentences… S  NP VP VP  VB NP SBAR SBAR  IN S Recursion in Grammars “Nero played his lyre while Rome burned”. 5-min Break Please talk to us if you haven’t finalized your group! We will get you your data in the next day or two. We’re in this room for lecture and lab from now on. Outline • • • • • • BoW, N-grams Part-of-Speech Tagging Named-Entity Recognition Grammars Parsing Dependencies PCFGs Complex sentences can be parsed in many ways, most of which make no sense or are extremely improbable (like Groucho’s example). Probabilistic Context-Free Grammars (PCFGs) associate and learn probabilities for each rule: S  NP VP 0.3 S  NP VP PP 0.7 The parser then tries to find the most likely sequence of productions that generate the given sentence. This adds more realistic “world knowledge” and generally gives much better results. Most state-of-the-art parsers these days use PCFGs. CKY parsing A CKY (Cocke Younger Kasami) table is a dynamic programming parser (most parsers use some form of this): S VP NP N(John) V(hit) D(the) N(ball) CKY parsing Not every internal node encodes a symbol (there are N-1 internal nodes in the tree, but O(N2) cells in the table). S VP NP N(John) V(hit) D(the) N(ball) CKY parsing Not every internal node encodes a symbol (there are N-1 internal nodes in the tree, but O(N2) cells in the table). N(John) V(hit) D(the) N(ball) CKY parsing Not every internal node encodes a symbol (there are N-1 internal nodes in the tree, but O(N2) cells in the table). NP N(John) V(hit) D(the) N(ball) CKY parsing Not every internal node encodes a symbol (there are N-1 internal nodes in the tree, but O(N2) cells in the table). VP NP N(John) V(hit) D(the) N(ball) CKY parsing Not every internal node encodes a symbol (there are N-1 internal nodes in the tree, but O(N2) cells in the table). S VP NP N(John) V(hit) D(the) N(ball) CKY parsing For an internal node x height k, there are k possible pairs of symbols which x can decompose into: x C B CKY parsing For an internal node x height k, there are k possible pairs of symbols which x can decompose into: x B C CKY parsing For an internal node x height k, there are k possible pairs of symbols which x can decompose into: x B C Viterbi Parse Given the probabilities of cells in the table, we can find the most probable parse using a generalized Viterbi scan. i.e. find the most likely parse at the root, then return it, and the left and right nodes which generated it. We can do this as part of the scoring process by maintaining a set of backpointers. x B C Systems • NLTK: Python-based NLP system. Many modules, good visualization tools, but not quite state-of-the-art performance. • Stanford Parser: Another comprehensive suite of tools (also POS tagger), and state-of-the-art accuracy. Has the definitive dependency module. • Berkeley Parser: Slightly higher parsing accuracy (than Stanford) but not as many modules. • Note: high-quality dependency parsing is usually very slow, but see: https://github.com/dlwh/puck Outline • • • • • • BoW, N-grams Part-of-Speech Tagging Named-Entity Recognition Grammars Parsing Dependencies Dependency Parsing Dependency structures are rooted in existing words, i.e. there are no non-terminal nodes, and a tree on n words has exactly n nodes. A dependency tree has about half as many total nodes as a binary (Chomsky Normal form) constituency tree. Dependency Parsing Dependency parses may be non-binary, and structure type is encoded in links rather than nodes: Constituency vs. Dependency Grammars Dependency parses are faster and more compact descriptions. There are not simplifications of constituency parses, although they can be derived from lexicalized constituency grammars. Lexicalized grammars assign a head word to each non-terminal node. Constituency vs. Dependency Grammars • Dependency and constituency parses capture attachment information (e.g. sentiments directed at particular targets). • Constituency parses are probably better for abstracting semantic features, i.e. constituency units often correspond to semantic units. Constituency grammars have a much large space to search and take longer, especially as sentence length grows O(N3). However, if the goal is to find semantic groups its possible to proceed bottom-up with (small) bounded N. Dependencies “The cat sat on the mat” dependency tree parse tree constituency labels of leaf nodes Dependencies From the dependency tree, we can obtain a “sketch” of the sentence. i.e. by starting at the root we can look down one level to get: “cat sat on” And then by looking for the object of the prepositional child, we get: “cat sat on mat” We can easily ignore determiners “a, the”. And importantly, adjectival and adverbial modifiers generally connect to their targets: Dependencies “Brave Merida prepared for a long, cold winter” Dependencies “Russell reveals himself here as a supremely gifted director of actors” Dependencies • Stanford dependencies can be constructed from the output of a lexicalized constituency parser (so you can in principle use other parsers). • The mapping is based on hand-written regular expressions. • Dependency grammars have been widely used for sentiment analysis and for semantic embeddings of sentences. • Examples: • http://nlp.stanford.edu/software/lex-parser.shtml • http://www.maltparser.org/ Parser Performance • MaltParser is widely used for dependency parsing because of its speed (around 10k words/sec). • High-quality constituency parsing on a fast machines usually runs at < 2000 words/sec. Puck (with a GPU) raises this to around 8000 words/sec, similar to Malt Parser. • Parsers are very complicated pieces of software, and are quite sensitive to tuning. Make sure your parser is tuned properly – evaluate on sample text. • Parsers are often trained on sanitized data (news articles), and will not work as well on informal text like social media. If possible, train on similar data (e.g. LDC’s English web treebank). Summary • • • • • • BoW, N-grams Part-of-Speech Tagging Named-Entity Recognition Grammars Parsing Dependencies

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to Data Science Lecture 5 Natural - b