Download Lecture 11: Parts of speech

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Macedonian grammar wikipedia , lookup

Portuguese grammar wikipedia , lookup

Ojibwe grammar wikipedia , lookup

Arabic grammar wikipedia , lookup

Zulu grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Swedish grammar wikipedia , lookup

Old Norse morphology wikipedia , lookup

Agglutination wikipedia , lookup

Old English grammar wikipedia , lookup

Stemming wikipedia , lookup

Inflection wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Untranslatability wikipedia , lookup

Modern Greek grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Comparison (grammar) wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Preposition and postposition wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Japanese grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Sotho parts of speech wikipedia , lookup

Romanian numbers wikipedia , lookup

Romanian nouns wikipedia , lookup

Turkish grammar wikipedia , lookup

Contraction (grammar) wikipedia , lookup

Yiddish grammar wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

French grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Vietnamese grammar wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Russian declension wikipedia , lookup

Esperanto grammar wikipedia , lookup

Determiner phrase wikipedia , lookup

Polish grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

English grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Transcript
Lecture 11:
Parts of speech
Intro to NLP, CS585, Fall 2014
http://people.cs.umass.edu/~brenocon/inlp2014/
Brendan O’Connor (http://brenocon.com)
Some material borrowed from Chris Manning,
Jurafsky&Martin, and student exercises
Tuesday, October 7, 14
1
•
Review Next Tues. Which?
•
•
12-1:30pm ?
5:30-7pm ?
2
Tuesday, October 7, 14
What’s a part-of-speech (POS)?
•
Syntactic categories / word classes
•
•
You could substitute words within a class and have
a syntactically valid sentence.
Give information how words can combine.
•
•
•
I saw the dog
I saw the cat
I saw the {table, sky, dream, school, anger, ...}
3
Tuesday, October 7, 14
POS is an old idea
•
•
•
Dionysius Thrax of Alexandria (100 BCE):
8 parts of speech
Common in grammar classes today:
noun, verb, adjective, preposition, conjunction,
pronoun, interjection
Many other more fine-grained possibilities
https://www.youtube.com/watch?
v=ODGA7ssL-6g&index=1&list=PL6795522EAD6CE2F7
4
Tuesday, October 7, 14
Open class (lexical) words
Nouns
Proper
IBM
Italy
Verbs
Common
cat / cats
snow
Closed class (functional)
Determiners the some
Conjunctions and or
Pronouns
Main
see
registered
old older oldest
Adverbs
slowly
Numbers
… more
122,312
one
Modals
Prepositions to with
can
had
Particles
he its
off up
Interjections Ow Eh
5
Tuesday, October 7, 14
Adjectives
… more
Open vs closed classes
•
•
Closed
•
•
•
•
•
Determiners: a, an, the
Pronouns: he, she, it, they ...
Prepositions: on, over, under, of, ...
Why “closed”?
Many are “grammatical function words.”
•
Nouns, verbs, adjectives, adverbs
Open
6
Tuesday, October 7, 14
Many tagging standards
•
•
•
Brown corpus (85 tags)
Penn Treebank (45 tags) ... the most common one
Coarse tagsets
•
Petrov et al. “Universal” tagset (12 tags)
•
•
•
•
http://code.google.com/p/universal-pos-tags/
Motivation: cross-linguistic regularities
e.g. adposition: pre- and postpositions
For English, collapsing of PTB tags
•
Gimpel et al. tagset for Twitter (25 tags)
•
•
Motivation: easier for humans to annotate
We collapsed PTB, added new things that were necessary
for Twitter
7
Tuesday, October 7, 14
Coarse tags, Twitter
Proper or common?
Does it matter?
Grammatical category??
D:
D:
A:
N:
V:
O:
P:
D:
^:
N:
#:
U:
It's
a
great
show
catch
it
on
the
Sundance
channel
#SMSAUDIO
http://instagram.com/p/trHejUML3X/
Not really a grammatical
category, but perhaps an
important word class
8
Tuesday, October 7, 14
Why do we want POS?
•
Useful for many syntactic and other NLP tasks.
•
•
•
•
Phrase identification (“chunking”)
Named entity recognition
Full parsing
Sentiment
9
Tuesday, October 7, 14
we acquire
presence words
of one are
of the words
review (Brill,
1994).3about
Two the
consecutive
The third step is to calculate the average semanwhen we observe the other.
extracted from the review if their tags conform to
The Semantic Orientation (SO) of a phrase, tic orientation of the phrases in the given review
any of the patterns in Table 1. The JJ tags indicate
and classify the review as recommended if the avphrase, is calculated here as follows:
adjectives, the NN tags are nouns, the RB tags are
erage is positive and otherwise not recommended.
4
SO(phrase)
adverbs, and the
VB tags= PMI(phrase,
are verbs. “excellent”)
The second
Table 2 shows an example for a recommended
(2)
- PMI(phrase,
pattern, for example, means
that two “poor”)
consecutive
review and Table 3 shows an example for a not
words are The
extracted
if the
first “excellent”
word is an and
adverb
reference
words
“poor” were recommended review. Both are reviews of the
Turney
(2002):
identify
bigram
phrases useful for sentiment
and the second
is aninadjective,
but
the
thirdrating
chosenword
because,
the five star
review
sys- Bank of America. Both are in the collection of 410
word (which
extracted)
cannotone
be star
a noun.
tem,isit not
isanalysis
common
to define
as “poor” and reviews from Epinions that are used in the experiNNP and NNPS
(singular
and plural proper
five stars
as “excellent”.
SO isnouns)
positive when ments in Section 4.
are avoided,
so that
the names
ofassociated
the items with
in the
phrase
is more
strongly
“excellent” Table 2. An example of the processing of a review that
review cannot
influencewhen
the classification.
and negative
phrase is more strongly associ- the author has classified as recommended.6
ated with
“poor”.
Table 1. Patterns
of tags
for extracting two-word
Extracted Phrase
Part-of-Speech Semantic
PMI-IR
estimates
PMI
by
issuing
queries
to
a
phrases from reviews.
Tags
Orientation
search engine (hence the IR in PMI-IR) and noting
online experience
JJ NN
2.253
First Word
Second
Word
Third
Word
the number of hits (matching documents). The follow fees
JJ NNS
0.333
(Not Extracted)
lowing experiments use the AltaVista Advanced
local branch
JJ NN
0.421
1. JJ
NN or NNS
anything
5
Search engine , which indexes approximately 350
small part
JJ NN
0.053
2. RB, RBR, or JJ
not NN nor NNS
online service
JJ NN
2.780
RBS million web pages (counting only those pages that
printable version
JJ NN
-0.705
are in English).
I chosenot
AltaVista
because it has a
3. JJ
JJ
NN nor NNS
direct deposit
JJ NN
1.288
operator. The not
AltaVista
NEAR operator
4. NN orNEAR
NNS JJ
NN nor NNS
well other
RB JJ
0.237
constrains
search to anything
documents that contain the
5. RB, RBR,
or VB,the
VBD,
inconveniently
RB VBN
-1.541
words of one another, in either
RBS words within
VBN, orten
VBG
located
order. Previous work has shown that NEAR perother bank
JJ NN
-0.850
The second
is to estimate
the semantic
oriformsstep
better
than AND
when measuring
the
true service
JJ NN
-0.732
entation ofstrength
the extracted
phrases,association
using the PMI-IR
of semantic
between words
Average Semantic Orientation
0.322
algorithm.(Turney,
This algorithm
uses
mutual
information
2001).
as a measure Let
of the
strengthbeofthe
semantic
hits(query)
numberassociaof hits returned,
tion between
twothe
words
&
1989).
(plus
other
sentiment
stuff)
given
query(Church
query.
TheHanks,
following
estimate of
PMI-IR has
empirically
evaluated
using
80 (2) with 6
SObeen
can be
derived from
equations
(1) and
The semantic orientation in the following tables is calculated
synonym test questions from the Test of English as
10
using the natural logarithm (base e), rather than base 2. The
a Foreign Language (TOEFL), obtaining a score of
natural log is more common in the literature on log-odds ratio.
POS patterns: sentiment
•
Tuesday, October 7, 14
POS patterns: simple noun phrases
•
Quick and dirty noun phrase identification
11
Tuesday, October 7, 14
•
Exercises
12
Tuesday, October 7, 14
A
FT
set.
Most word
typesNNP,
(80-86%)
are JJS,
unambiguous;
that RB).
is, they
only a sintag (Janet
is always
funniest
and hesitantly
Buthave
the ambiguous
tag although
(Janet is accounting
always NNP,
JJS, and
hesitantly
RB).are
But
the of
ambiguous
rds,
forfunniest
only 14-15%
of the
vocablary,
some
the most
rds, although
accounting
14-15%
of the
vocablary,
of the
mmon
words of
English, for
andonly
hence
55-67%
of word
tokensareinsome
running
textmost
are
mmon
words
of the
English,
hence 55-67%
of word
tokens especially
in runningintext
are
biguous.
Note
large and
differences
across the
two genres,
token
biguous. Tags
Noteinthe
large
differences
across
the two genres,
especially
in token
quency.
the
WSJ
corpus
are
less
ambiguous,
presumably
because
this
Can
we
just
use
a
tag
dictionary
quency. Tags
in the
WSJ
arenews
less leads
ambiguous,
presumably
because this
wspaper’s
specific
focus
oncorpus
financial
to a more
limited distribution
of
wspaper’s
focus
on
financial
leads into
to a the
more
limited
distribution of
(onespecific
tagtheper
type)?
rd usages
than
moreword
general
textsnews
combined
Brown
corpus.
rd usages than the more general texts combined into the Brown corpus.
Types:
WSJ
Brown
Most words types
Types:
WSJ(86%) 45,799
Brown
Unambiguous (1 tag)
44,432
(85%)
are unambiguous ...
Unambiguous
(1
44,432
Ambiguous
(2+tag)
tags)
7,025 (86%)
(14%) 45,799
8,050 (85%)
(15%)
Ambiguous
(2+ tags)
7,025 (14%)
8,050 (15%)
Tokens:
Tokens:
Unambiguous (1 tag)
577,421 (45%) 384,349 (33%) But not so for
Unambiguous
(1
577,421
Ambiguous
(2+tag)
tags)
711,780 (45%)
(55%) 384,349
786,646 (33%)
(67%)
tokens!
Ambiguous
(2+
tags)
711,780
(55%)
786,646
(67%)
ure 8.2 The amount of tag ambiguity for word types in the Brown and WSJ corpora,
POS Tagging: lexical ambiguity
The amount
of tag
ambiguity
for statistics
word types
in thepunctuation
Brown and as
WSJ
corpora,
mure
the8.2
Treebank-3
(45-tag)
tagging.
These
include
words,
and
m
thewords
Treebank-3
tagging. case.
These statistics include punctuation as words, and
ume
are kept(45-tag)
in their original
ume words are kept in their original case.
•
Ambiguous wordtypes tend to be very common ones.
Some of the most ambiguous frequent words are that, back, down, put and set;
I
know
that
he
is
honest
=
IN
(relativizer)
Some
of
the
most
ambiguous
frequent
words
are
that,
back,
down,
put and set;
e are some examples of the 6 different parts-of-speech for the word back:
e are some examples of the 6 different parts-of-speech for the word back:
Yes, that
play seat
was nice = DT (determiner)
earnings growth
took a back/JJ
earnings
growth in
took
back/JJ seat
a small building
theaback/NN
You ofincan’t
goback/VBP
that far
= RB (adverb)
aa small
building
the back/NN
clear majority
senators
the bill
•
•
•
aDave
clearbegan
majority
of senators
back/VBP
the bill
to back/VB
toward
the door
Dave
to back/VB
the about
door debt
enablebegan
the country
to buytoward
back/RP
enable
the countryback/RB
to buy back/RP
I was twenty-one
then about debt
I was twenty-one back/RB then
Tuesday, October 7, 14
13
POS Tagging: baseline
• Baseline: most frequent tag. 92.7% accuracy
•
•
•
Simple baselines are very important to run!
Why so high?
•
•
Many ambiguous words have a skewed distribution
of tags
Credit for easy things like punctuation, “the”, “a”,
etc.
Is this actually that high?
I get 0.918 accuracy for token tagging
...but, 0.186 whole-sentence accuracy (!)
•
•
14
Tuesday, October 7, 14
POS tagging can be hard for humans
•
•
•
Mrs/NNP Shaefer/NNP never/RB got/VBD
around/RP to/TO joining/VBG
All/DT we/PRP gola/VBN do/VB is/VBZ go/VB
around/IN the/DT corner/NN
Chateau/NNP Petrus/NNP costs/VBZ
around/RB 250/CD
15
Tuesday, October 7, 14
When they are the first members of the double conjunctions both . . . and, either . . . or and neither
both, either and neither are tagged as coordinating conjunctions (CC), not as determiners (DT).
. . . nor,
EXAMPLES:
Either/DT child
coulddo
sing.annotators always follow them?)
Need careful
guidelines
(and
PTBBut:POS guidelines, Santorini (1990)
4 CONFUSING PARTS OF
SPEECH
Either/CC
a boy could sing or/CC a girl could dance.
4
Either/CC a boy or/CC a girl could sing.
Either/CC a boy or/CC girl could sing.
Confusing parts of speech
Be
that
either parts
or neither
can sometimes
function
as determiners
(DT) even
in the
presence
of or or
Thisaware
section
discusses
of speech
that are easily
confused
and gives guidelines
on how
to tag
such cases.
nor.
EXAMPLE:
Either/DT boy or/CC girl could sing.
When they are the first members of the double conjunctions both . . . and, either . . . or and neither
both, or
either
CD
JJ and neither are tagged as coordinating conjunctions (CC), not as determiners (DT).
. . . nor,
Number-number
combinations
should
becould
tagged
as adjectives (JJ) if they have the same distribution as
EXAMPLES:
Either/DT
child
sing.
adjectives .
But:
EXAMPLES:
a 50-3/JJ victory (cf. a handy/JJ victory)
Either/CC a boy could sing or/CC a girl could dance.
Hyphenated fractions one-half,
three-fourths,
seven-eighths,
one-and-a-half, seven-and-three-eighths should
Either/CC
a boy or/CC
a girl could sing.
be tagged as adjectives (JJ)
when they
prenominal
modifiers,
but as adverbs (RB) if they could be
Either/CC
a boyare
or/CC
girl could
sing.
replaced by double or twice.
Be aware that either or neither can sometimes function as determiners (DT) even in the presence of or or
EXAMPLES: one-half/J J cup;
cf. a full/JJ cup
nor.
EXAMPLE:
one-half/RB the amount; cf. twice/RB the amount; double/RB the amount
Either/DT boy or/CC girl could sing.
CD or JJ
Sometimes,
it is combinations
unclear whether
onebeis tagged
cardinal
or (JJ)
a noun.
In have
general,
it should
be tagged
Number-number
should
as number
adjectives
if they
the same
distribution
as as a
cardinal
(CD) even when its sense is not clearly that of a numeral.
adjectivesnumber
.
50-3/JJ of
victory
(cf. reasons
a handy/JJ victory)
aone/CD
the best
16
Hyphenated
fractions
one-half,
one-and-a-half,
seven-and-three-eighths
Tuesday, October
7, 14
noun (NN).
But
if it could
be pluralized
orthree-fourths,
modified by seven-eighths,
an adjective in
a particular context,
it is a commonshould
EXAMPLES:
EXAMPLE:
Some other lexical ambiguities
•
•
Prepositions versus verb particles
•
•
•
Test:
turn slowly into a monster
*take slowly out the trash
turn into/P a monster
take out/T the trash
check it out/T, what’s going on/T, shout out/T
this,that -- pronouns versus determiners
•
•
i just orgasmed over this/O
this/D wind is serious
Careful annotator guidelines are necessary to define what to do in
many cases.
• http://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports
• http://www.ark.cs.cmu.edu/TweetNLP/annot_guidelines.pdf
17
Tuesday, October 7, 14
Since we are not annotating parse structure, it is less clear what to do with our data. In a
consistent with the PTB, we typically tagged today as a noun.
The PTB annotations are less clear for words like home. Of the 24 times that home app
child under a DIR (direction) nonterminal, it is annotated as:
Proper nouns
• (ADVP-DIR (NN home) ) 14 times
Common
nouns
vs.
proper
nouns
on
Twitter
•
• (ADVP-DIR (RB home) ) 6 times
thatOhome)
Monty^)python
•Convinced
(NP-DIRA (NN
3 times^ doingV-VBG aD completelyR
straightA facedA Shakespeare^ adaptionN wouldV-VBD beV-VB amongP
•the
(NP-DIR
(RB home)
) ^1 things
time N everR
D mostA Monty
^ python
Manual inspection of the 24 occurrences revealed no discernible difference in usage that
Names are multiwords. Their tokens are not always nouns.
these differences in annotation. As a result of these inconsistencies, we decided to let anno
Manywhen
people
in thethese
exercises
to do
this.them
Tokenbest judgment
annotating
types ofdidn’t
wordswant
in tweets,
asking
to refer to t
level tagging
weird consistency.
abstraction here.
previously-annotated
dataistoaimprove
•
3.4
Names
In general, every noun within a proper name should be tagged as a proper noun (‘ˆ’):
• Jesse/ˆ and/& the/D Rippers/ˆ
• the/D California/ˆ Chamber/ˆ of/P Commerce/ˆ
Company and web site names (Twitter, Yahoo ! News) are tagged as proper nouns. Function
ever tagged as proper nouns if they are not18behaving in a normal syntactic fashion, e.g. Ace/
Tuesday, October 7, 14
Are your tokens too big for tags?
•
•
•
PTB tokenization of clitics leads to easy tagging
•
I’m ==> I/PRP 'm/VBP
•
•
hes i’m im ill http://search.twitter.com
Imma bout to do some body shots
Twitter: is this splitting feasible? Real examples:
Gimpel et al.’s strategy: introduce compound
tags (I'm = PRONOUN+VERB)
19
Tuesday, October 7, 14
formance degradations in a wide variety of languages (including Czech, Slovene,
Estonian, and Romanian) (Hajič, 2000).
Highly inflectional languages also have much more information than English
coded in word morphology, like case (nominative, accusative, genitive) or gender
(masculine, feminine). Because this information is important for tasks like parsing and coreference
part-of-speech
taggers forlanguages,
morphologically
Otherresolution,
example:
highly inflected
e.g.rich languages need to label words with case and gender information. Tagsets for morphoTurkish,
case,sequences
gender ofetc.
built intotags
the
logically rich
languageshave
are therefore
morphological
rather than a
single primitive
tag. Here’s a Turkish example, in which the word izin has three pos“tag”
sible morphological/part-of-speech tags and meanings (Hakkani-Tür et al., 2002):
Are your tokens too big for tags?
•
iz + Noun+A3sg+Pnon+Gen
2. Üzerinde parmak izin kalmiş
Your finger print is left on (it).
iz + Noun+A3sg+P2sg+Nom
F
1. Yerdeki izin temizlenmesi gerek.
The trace on the floor should be cleaned.
3. Içeri girmek için izin alman gerekiyor.
You need a permission to enter.
izin + Noun+A3sg+Pnon+Nom
Using a morphological parse sequence like Noun+A3sg+Pnon+Gen as the partof-speech tagOur
greatlyapproach
increases thefor
number
of parts-of-speech,
and sotreat
tagsets can
Twitter
was to simply
be 4 to 10 times larger than the 50–100 tags we have seen for English. With such
each
compound
tag
as
a
separate
tag.
Is
this
large tagsets, each word needs to be morphologically analyzed (using a method from
here?
Chapter 3, or feasible
an extensive
dictionary) to generate the list of possible morphological
tag sequences (part-of-speech tags) for the word. The role of the tagger is then
to disambiguate among these tags. This 20
method also helps with unknown words
•
Tuesday, October 7, 14
How to build a POS tagger?
•
Key sources of information:
•
•
•
1. The word itself
2. Morphology or orthography of word
3. POS tags of surrounding words: syntactic
positions
21
Tuesday, October 7, 14