Download Introduction to Data Science Lecture 5 Natural - b

Document related concepts
Transcript
Introduction to Data Science
Lecture 5
Natural Language Processing
CS 194 Fall 2015
John Canny
with some slides from David Hall
Outline
•
•
•
•
•
•
BoW, N-grams
Part-of-Speech Tagging
Named-Entity Recognition
Grammars
Parsing
Dependencies
Natural Language Processing (NLP)
• In a recent survey (KDNuggets blog) of data scientists, 62%
reported working “mostly or entirely” with data about people.
Much of this data is text.
• In the suggested CS194-16 projects (a random sample of data
science projects around campus), nearly half involve natural
language text processing.
• NLP is a central part of mining large datasets.
Natural Language Processing
Some basic terms:
• Syntax: the allowable structures in the language: sentences,
phrases, affixes (-ing, -ed, -ment, etc.).
• Semantics: the meaning(s) of texts in the language.
• Part-of-Speech (POS): the category of a word (noun, verb,
preposition etc.).
• Bag-of-words (BoW): a featurization that uses a vector of word
counts (or binary) ignoring order.
• N-gram: for a fixed, small N (2-5 is common), an n-gram is a
consecutive sequence of words in a text.
Bag of words Featurization
Assuming we have a dictionary mapping words to a unique integer
id, a bag-of-words featurization of a sentence could look like this:
Sentence:
The cat sat on the mat
word id’s:
1 12 5 3 1 14
The BoW featurization would be the vector:
Vector
2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1
position
1
3
5
12 14
In practice this would be stored as a sparse vector of (id, count)s:
(1,2),(3,1),(5,1),(12,1),(14,1)
Note that the original word order is lost, replaced by the order of
id’s.
Bag of words Featurization
• BoW featurization is probably the most common
representation of text for machine learning.
• Most algorithms expect numerical vector inputs and BoW
provides this.
• These include kNN, Naïve Bayes, Logistic Regression, kMeans clustering, topic models, collaborative filtering
(using text input).
Bag of words Featurization
• One challenge is the size of the representation. Since words
follow a power-law distribution, vocabulary size grows with
corpus size at almost a linear rate.
• Since rare words aren’t observed often enough to inform the
model, its common to discard infrequent words occurring
less than e.g. 5 times.
• One can also use the feature selection methods mentioned
last time: mutual information and chi-squared tests. But
these often have false positive problems and are less reliable
than frequency culling on power-law data.
Bag of words Featurization
• The feature selection methods mentioned last time: mutual
information and chi-squared tests, often have false positive
problems on power-law data.
• Frequency-based selection usually works better.
10x more features in each block,
but 10x fewer observations of them.
Each is a potential false positive.
Difficult to control the false-positive rate
for rare features.
105
104
Frequency
103
102
101
100
101
102
Word rank
103
104
N-grams
Because word order is lost, the sentence meaning is weakened.
This sentence has quite a different meaning but the same BoW
vector:
Sentence:
The mat sat on the cat
word id s:
1 14 5 3 1 12
BoW featurization:
Vector
2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1
But word order is important, especially the order of nearby words.
N-grams capture this, by modeling tuples of consecutive words.
N-grams
Sentence:
The cat
sat
on
the mat
2-grams:
the-cat, cat-sat, sat-on, on-the, the-mat
Notice how even these short n-grams “make sense” as linguistic
units. For the other sentence we would have different features:
Sentence:
The mat
sat
on
the cat
2-grams:
the-mat, mat-sat, sat-on, on-the, the-cat
We can go still further and construct 3-grams:
Sentence:
The cat
sat
on
the
mat
3-grams:
the-cat-sat, cat-sat-on, sat-on-the, on-the-mat
Which capture still more of the meaning:
Sentence:
The
mat
sat
on
the
cat
3-grams:
the-mat-sat, mat-sat-on, sat-on-the, on-the-cat
N-grams Features
Typically, its advantages to use multiple n-gram features in machine
learning models with text, e.g.
unigrams + bigrams (2-grams) + trigrams (3-grams).
The unigrams have higher counts and are able to detect influences
that are weak, while bigrams and trigrams capture strong
influences that are more specific.
e.g. “the white house” will generally have very different influences
from the sum of influences of “the”, “white”, “house”.
N-grams as Features
Using n-grams+BoW improves accuracy over BoW alone for
classifying and other text problems. A typical performance (e.g.
classifier accuracy) looks like this:
3-grams < 2-grams < 1-grams < 1+2-grams < 1+2+3-grams
A few percent improvement is typical for text classification.
N-grams size
N-grams pose some challenges in feature set size.
If the original vocabulary size is |V|, the number of possible 2grams is |V|2 while for 3-grams it is |V|3
Luckily natural language n-grams (including single words) have a
power law frequency structure. This means that most of the ngrams you see are common. A dictionary that contains the most
common n-grams will cover most of the n-grams you see.
Power laws for N-grams
N-grams also follow a power law distribution:
N-grams size
Because of this you may see values like this:
• Unigram dictionary size: 40,000
• Bigram dictionary size: 100,000
• Trigram dictionary size: 300,000
With coverage of > 80% of the features occurring in the text.
N-gram Language Models
N-grams can be used to build statistical models of texts.
When this is done, they are called n-gram language models.
An n-gram language model associates a probability with each ngram, such that the sum over all n-grams (for fixed n) is 1.
You can then determine the overall likelihood of a particular
sentence:
The cat sat on the mat
Is much more likely than
The mat sat on the cat
Skip-grams
We can also analyze the meaning of a particular word by looking at
the contexts in which it occurs.
The context is the set of words that occur near the word, i.e. at
displacements of …,-3,-2,-1,+1,+2,+3,… in each sentence where the
word occurs.
A skip-gram is a set of non-consecutive words (with specified
offset), that occur in some sentence.
We can construct a BoSG (bag of skip-gram) representation for
each word from the skip-gram table.
Skip-grams
Then with a suitable embedding (DNN or linear projection) of the
skip-gram features, we find that word meaning has an algebraic
structure:
Man
King
Woman
Queen
Man + (King – Man) + (Woman – Man) = Queen
Tomáš Mikolov et al. (2013). "Efficient Estimation of Word Representations in
Vector Space"
Outline
•
•
•
•
•
•
BoW, N-grams
Part-of-Speech Tagging
Named-Entity Recognition
Grammars
Parsing
Dependencies
Parts of Speech
Thrax’s original list (c. 100 B.C):
• Noun
• Verb
• Pronoun
• Preposition
• Adverb
• Conjunction
• Participle
• Article
Parts of Speech
Thrax’s original list (c. 100 B.C):
• Noun (boat, plane, Obama)
• Verb (goes, spun, hunted)
• Pronoun (She, Her)
• Preposition (in, on)
• Adverb (quietly, then)
• Conjunction (and, but)
• Participle (eaten, running)
• Article (the, a)
Parts of Speech (Penn Treebank 2014)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
CC
CD
DT
EX
FW
Coordinating conjunction
Cardinal number
Determiner
Existential there
Foreign word
Preposition or subordinating
IN
conjunction
JJ
Adjective
JJR
Adjective, comparative
JJS
Adjective, superlative
LS
List item marker
MD
Modal
NN
Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT
Predeterminer
POS
Possessive ending
PRP
Personal pronoun
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
PRP$
RB
RBR
RBS
RP
SYM
TO
UH
VB
VBD
VBG
VBN
31.
VBP
32.
33.
34.
35.
36.
VBZ
WDT
WP
WP$
WRB
Possessive pronoun
Adverb
Adverb, comparative
Adverb, superlative
Particle
Symbol
to
Interjection
Verb, base form
Verb, past tense
Verb, gerund or present participle
Verb, past participle
Verb, non-3rd person singular
present
Verb, 3rd person singular present
Wh-determiner
Wh-pronoun
Possessive wh-pronoun
Wh-adverb
POS tags
• Morphology:
• “liked,” “follows,” “poke”
• Context: “can” can be:
• “trash can,” “can do,” “can it”
• Therefore taggers should look at the neighborhood of a word.
Constraint-Based Tagging
Based on a table of words with morphological and context
features:
POS taggers
• Taggers typically use HMMs (Hidden-Markov Models) or
MaxEnt (Maximum Entropy) sequence models, with the latter
giving state-of-the art performance.
• Accuracy on test data is around 97%.
• A representative is the Stanford POS tagger:
http://nlp.stanford.edu/software/tagger.shtml
• Tagging is fast, 300k words/second typical (Speedread, about
10x slower for Stanford).
Outline
•
•
•
•
•
•
BoW, N-grams
Part-of-Speech Tagging
Named-Entity Recognition
Grammars
Parsing
Dependencies
Named Entity Recognition
• People, places, (specific) things
•
•
•
•
“Chez Panisse”: Restaurant
“Berkeley, CA”: Place
“Churchill-Brenneis Orchards”: Company(?)
Not “Page” or “Medjool”: part of a non-rigid designator
27
Named Entity Recognition
• Why a sequence model?
• “Page” can be “Larry Page” or “Page mandarins” or
just a page.
• “Berkeley” can be a person.
28
Named Entity Recognition
• Chez Panisse, Berkeley, CA - A bowl of ChurchillBrenneis Orchards Page mandarins and Medjool dates
• States:
• Restaurant, Place, Company, GPE, Movie, “Outside”
• Usually use BeginRestaurant, InsideRestaurant.
• Why?
• Emissions: words
• Estimating these probabilities is harder.
29
Named Entity Recognition
• Both (hand-written) grammar methods and machine-learning
methods are used.
• The learning methods use sequence models: HMMs or CRFs =
Conditional Random Fields, with the latter giving state-of-the-art
performance.
• State-of-the-art on standard test data = 93.4%, human
performance = 97%.
• Speed around 200k words/sec (Speedread).
• Example: http://nlp.stanford.edu/software/CRF-NER.shtml
30
Outline
•
•
•
•
•
•
BoW, N-grams
Part-of-Speech Tagging
Named-Entity Recognition
Grammars
Parsing
Dependencies
Grammars
Grammars comprise rules that specify acceptable sentences in the
language: (S is the sentence or root node)
• S  NP VP
• S  NP VP PP
• NP  DT NN
• VP  VB NP
• VP  VBD
• PP  IN NP
• DT  “the”
• NN  “mat”, “cat”
• VBD  “sat”
• IN  “on”
Grammars
Grammars comprise rules that specify acceptable sentences in the
language: (S is the sentence or root node) “the cat sat on the mat”
• S  NP VP
• S  NP VP PP (the cat) (sat) (on the mat)
• NP  DT NN (the cat), (the mat)
• VP  VB NP
• VP  VBD
• PP  IN NP
• DT  “the”
• NN  “mat”, “cat”
• VBD  “sat”
• IN  “on”
Grammars
English Grammars are context-free: the productions do not depend
on any words before or after the production.
The reconstruction of a sequence of grammar productions from a
sentence is called “parsing” the sentence.
It is most conveniently represented as a tree:
Parse Trees
“The cat sat on the mat”
Parse Trees
In bracket notation:
(ROOT
(S
(NP (DT the) (NN cat))
(VP (VBD sat)
(PP (IN on)
(NP (DT the) (NN mat))))))
Grammars
There are typically multiple ways to produce the same sentence.
Consider the statement by Groucho Marx:
“While I was in Africa, I shot an elephant in my pajamas”
“How he got into my pajamas, I don’t know”
Parse Trees
“…,I shot an elephant in my pajamas” -what people hear first
Parse Trees
Groucho’s version
Grammars
Recursion is common in grammar rules, e.g.
NP  NP RC
Because of this, sentences of arbitrary length are possible.
Recursion in Grammars
“Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo”.
Grammars
Its also possible to have “sentences” inside other sentences…
S  NP VP
VP  VB NP SBAR
SBAR  IN S
Recursion in Grammars
“Nero played his lyre while Rome burned”.
5-min Break
Please talk to us if you haven’t finalized your group!
We will get you your data in the next day or two.
We’re in this room for lecture and lab from now on.
Outline
•
•
•
•
•
•
BoW, N-grams
Part-of-Speech Tagging
Named-Entity Recognition
Grammars
Parsing
Dependencies
PCFGs
Complex sentences can be parsed in many ways, most of which
make no sense or are extremely improbable (like Groucho’s
example).
Probabilistic Context-Free Grammars (PCFGs) associate and learn
probabilities for each rule:
S  NP VP
0.3
S  NP VP PP 0.7
The parser then tries to find the most likely sequence of
productions that generate the given sentence. This adds more
realistic “world knowledge” and generally gives much better results.
Most state-of-the-art parsers these days use PCFGs.
CKY parsing
A CKY (Cocke Younger Kasami) table is a dynamic programming
parser (most parsers use some form of this):
S
VP
NP
N(John)
V(hit)
D(the)
N(ball)
CKY parsing
Not every internal node encodes a symbol (there are N-1
internal nodes in the tree, but O(N2) cells in the table).
S
VP
NP
N(John)
V(hit)
D(the)
N(ball)
CKY parsing
Not every internal node encodes a symbol (there are N-1
internal nodes in the tree, but O(N2) cells in the table).
N(John)
V(hit)
D(the)
N(ball)
CKY parsing
Not every internal node encodes a symbol (there are N-1
internal nodes in the tree, but O(N2) cells in the table).
NP
N(John)
V(hit)
D(the)
N(ball)
CKY parsing
Not every internal node encodes a symbol (there are N-1
internal nodes in the tree, but O(N2) cells in the table).
VP
NP
N(John)
V(hit)
D(the)
N(ball)
CKY parsing
Not every internal node encodes a symbol (there are N-1
internal nodes in the tree, but O(N2) cells in the table).
S
VP
NP
N(John)
V(hit)
D(the)
N(ball)
CKY parsing
For an internal node x height k, there are k possible pairs of
symbols which x can decompose into:
x
C
B
CKY parsing
For an internal node x height k, there are k possible pairs of
symbols which x can decompose into:
x
B
C
CKY parsing
For an internal node x height k, there are k possible pairs of
symbols which x can decompose into:
x
B
C
Viterbi Parse
Given the probabilities of cells in the table, we can find the most
probable parse using a generalized Viterbi scan.
i.e. find the most likely parse at the root, then return it, and the
left and right nodes which generated it.
We can do this as part of the scoring process by maintaining a
set of backpointers.
x
B
C
Systems
• NLTK: Python-based NLP system. Many modules, good
visualization tools, but not quite state-of-the-art performance.
• Stanford Parser: Another comprehensive suite of tools (also POS
tagger), and state-of-the-art accuracy. Has the definitive
dependency module.
• Berkeley Parser: Slightly higher parsing accuracy (than Stanford)
but not as many modules.
• Note: high-quality dependency parsing is usually very slow, but
see: https://github.com/dlwh/puck
Outline
•
•
•
•
•
•
BoW, N-grams
Part-of-Speech Tagging
Named-Entity Recognition
Grammars
Parsing
Dependencies
Dependency Parsing
Dependency structures are rooted in existing words, i.e. there
are no non-terminal nodes, and a tree on n words has exactly n
nodes.
A dependency tree has about half as many total nodes as a
binary (Chomsky Normal form) constituency tree.
Dependency Parsing
Dependency parses may be non-binary, and structure type is
encoded in links rather than nodes:
Constituency vs. Dependency Grammars
Dependency parses are faster and more compact descriptions.
There are not simplifications of constituency parses, although
they can be derived from lexicalized constituency grammars.
Lexicalized grammars assign a head word to each non-terminal
node.
Constituency vs. Dependency Grammars
• Dependency and constituency parses capture attachment
information (e.g. sentiments directed at particular targets).
• Constituency parses are probably better for abstracting
semantic features, i.e. constituency units often correspond to
semantic units.
Constituency grammars have a much large space to search and
take longer, especially as sentence length grows O(N3).
However, if the goal is to find semantic groups its possible to
proceed bottom-up with (small) bounded N.
Dependencies
“The cat sat on the mat”
dependency tree
parse tree
constituency labels of leaf nodes
Dependencies
From the dependency tree, we can obtain a “sketch” of the
sentence. i.e. by starting at the root we can look down one level to
get:
“cat sat on”
And then by looking for the object of the prepositional child, we get:
“cat sat on mat”
We can easily ignore determiners “a, the”.
And importantly, adjectival and adverbial modifiers generally
connect to their targets:
Dependencies
“Brave Merida prepared for a long, cold winter”
Dependencies
“Russell reveals himself here as a supremely gifted director of
actors”
Dependencies
• Stanford dependencies can be constructed from the output of a
lexicalized constituency parser (so you can in principle use other
parsers).
• The mapping is based on hand-written regular expressions.
• Dependency grammars have been widely used for sentiment
analysis and for semantic embeddings of sentences.
• Examples:
• http://nlp.stanford.edu/software/lex-parser.shtml
• http://www.maltparser.org/
Parser Performance
• MaltParser is widely used for dependency parsing because of its
speed (around 10k words/sec).
• High-quality constituency parsing on a fast machines usually runs
at < 2000 words/sec. Puck (with a GPU) raises this to around 8000
words/sec, similar to Malt Parser.
• Parsers are very complicated pieces of software, and are quite
sensitive to tuning. Make sure your parser is tuned properly –
evaluate on sample text.
• Parsers are often trained on sanitized data (news articles), and
will not work as well on informal text like social media. If possible,
train on similar data (e.g. LDC’s English web treebank).
Summary
•
•
•
•
•
•
BoW, N-grams
Part-of-Speech Tagging
Named-Entity Recognition
Grammars
Parsing
Dependencies