Download 791-07-pos-short

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ancient Greek grammar wikipedia , lookup

Macedonian grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Probabilistic context-free grammar wikipedia , lookup

Untranslatability wikipedia , lookup

Yiddish grammar wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Agglutination wikipedia , lookup

Esperanto grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Stemming wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Malay grammar wikipedia , lookup

Transcript
COMP790: Statistical NLP
POS Tagging
Chap. 10
1
POS tagging

Goal: assign the right part of speech (noun, verb,
…) to words in a text
“The/AT representative/NN put/VBD chairs/NNS on/IN
the/AT table/NN.”

Terminology





POS, part-of-speech tag
word class
morphological class
lexical tag
grammatical tag
2
Why do POS Tagging?

Purpose:



1st step to NLU
easier then full NLU (results > 95% accuracy)
Useful for:

speech recognition / synthesis (better accuracy)



stemming in IR



which morphological affixes the word can take
adverb - ly = noun (friendly - ly = friend)
for IR and QA


how to recognize/pronounce a word
CONtent/noun VS conTENT/adj
pick out nouns which may be more important than other words in
indexing documents or query analysis
partial parsing/chunking (for IE)

to find noun phrases/verb phrases
3
Tag sets

Different tag sets, depends on the
purpose of the application






45 tags in Penn Treebank
62 tags in CLAWS with BNC corpus
79 tags in Church (1991)
87 tags in Brown corpus
147 tags in C7 tagset
258 tags in Tzoukermann and Radev (1995)
4
Tag set: Penn TreeBank

45 tags
IN
preposition or subordinating conjunct.
JJ
adjective or numeral, ordinal
JJR
adjective, comparative
NN
noun, common, singular or mass
NNP
noun, proper, singular
NNS
noun, common, plural
TO
"to" as preposition or infinitive marker
VB
verb, base form
VBD
verb, past tense
VBG
verb, present participle or gerund
VBN
verb, past participle
VBP
verb, present tense, not 3rd p. singular
VBZ
verb, present tense, 3rd p. singular
…
5
Most word types are not
ambiguous but...


but most word types are rare…
Brown corpus (Francis&Kucera, 1982):


11.5% word types are ambiguous (>1 tag)
40% word tokens are ambiguous (>1 tag)
Unambiguous (1 tag)
Ambiguous (>1 tag)
2 tags
3 tags
4 tags
5 tags
6 tags
7 tags
Nb word
types
35 340
4 100
3760
264
61
12
2
1 “still”
6
Techniques to POS tagging

rule-based tagging


stochastic tagging


uses hand-written rules
uses probabilities computed from training
corpus
transformation-based tagging

uses rules learned automatically
7
Information sources for tagging
All techniques are based on the same observations…

Syntagmatic information:


some tag sequences are more probable than others
 ART+ADJ+N is more probable than ART+ADJ+VB
Lexical information:

knowing the word to be tagged gives a lot of
information about the correct tag
 “table”: {noun, verb} but not a {adj, prep,…}
 “rose”: {noun, adj, verb} but not {prep, ...}
8
Naïve POS tagging

using only syntagmatic patterns:



Green & Rubin (1971)
accuracy of 77%
using the most-likely tag for each word:



Charniak et al. (1993)
accuracy of 90%
much better, but not very good...


1 mistake every 10 words
used as baseline for evaluation
9
Techniques to POS tagging

--> rule-based tagging


stochastic tagging


uses hand-written rules
uses probabilities computed from training
corpus
transformation-based tagging

uses rules learned automatically
10
Rule-based POS tagging

Step 1: Assign each word with all possible
tags


use dictionary
Step 2: Use if-then rules to identify the
correct tag in context (disambiguation
rules)
11
Sample rules
N-IP rule:
A tag N (noun) cannot be followed by a tag IP
(interrogative pronoun)
... man who …
 man: {N}
 who: {RP, IP} --> {RP} relative pronoun
ART-V rule:
A tag ART (article) cannot be followed by a tag V (verb)
...the book…
 the: {ART}
 book: {N, V} --> {N}
12
Techniques to POS tagging

rule-based tagging


--> stochastic tagging


uses hand-written rules
uses probabilities computed from training
corpus
transformation-based tagging

uses rules learned automatically
13
Stochastic POS tagging


Assume that a word’s tag only depends
on the previous tags (not following ones)
Use a training set (manually tagged
corpus) to:



learn the regularities of tag sequences
learn the possible tags for a word
model this info through a language model (ngram)
14
Hidden Markov Model (HMM) Taggers


Goal: maximize P(word|tag) x P(tag|previous n tags)
P(word|tag)





Lexical information
Syntagmatic information
word/lexical likelihood
probability that given this tag, we have this word
NOT probability that this word has this tag
modeled through language model (word-tag matrix)
P(tag|previous n tags)



tag sequence likelihood
probability that this tag follows these previous tags
modeled through language model (tag-tag matrix)
15
Tag sequence probability

P(tag|previous n tags)

if we look (n-1) tags before to find current tag --> n-gram
model

trigram model


bigram model


chooses the most probable tag ti for word wi given:
 the previous 2 tags ti-2 & ti-1 and
 the current word wi
chooses the most probable tag ti for word wi given:
 the previous tag ti-1 and
 the current word wi
unigram model (just most-likely tag)

chooses the most probable tag ti for word wi given:
 the current word wi
16
Example

“race” can be VB or NN



“Secretariat/NNP is/VBZ expected/VBN to/TO
race/VB tomorrow/ADV”
“People/NNS continue/VBP to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/NN for/IN
outer/JJ space/NN”
let’s tag the word “race” in 1st sentence
with a bigram model.
17
Example (con’t)

assuming previous words have been tagged, we
have:
“Secretariat/NNP is/VBZ expected/VBN to/TO
race/?? tomorrow”

P(race|VB) x P(VB|TO) ?



given that we have a VB, how likely is the current word
to be race
given that the previous tag is TO, how likely is the
current tag to be VB
P(race|NN) x P(NN|TO) ?


given that we have a NN, how likely is the current word
to be race
given that the previous tag is TO, how likely is the
current tag to be NN
18
Example (con’t)

From the training corpus, we found that:




so:
P(NN|TO) = .021 // given that the previous tag is TO
// 2.1% chances that the current tag is NN
P(VB|TO) = .34 // given that the previous tag is TO
// 34% chances that the current tag is VB
P(race|NN) = .00041 // given that we have an NN
// 0.041% chances that this word is "race"
P(race|VB) = .00003 // given that we have a VB
// 003% chances that this word is "race"
P(VB|TO) x P(race|VB) = .34 x .00003 = .000 01
P(NN|TO) x P(race|NN) = .021 x .00041 = .000 009
so:
VB is more probable!
19
Example (con’t)

and by the way:
race is 98% of the time a NN !!!
P(VB|race) = 0.02
P(NN|race) = 0.98 !!!

How are the probabilities found ?


using a training corpus of hand-tagged text
long & meticulous work done by linguists
20
HMM Tagging

But HMM tagging tries to find:




the best sequence of tags for a sentence
not just best tag for a single word
goal: maximize the probability of a tag
sequence, given a word sequence
i.e. choose the sequence of tags that
maximizes
P(tag sequence|word sequence)
21
HMM Tagging (con’t)
bestTagSeq  argmax P(tagSeq | wordSeq)
tagSeq

By Bayes law:
P(wordSeq | tagSeq)
P(tagSeq | wordSeq)  P(tagSeq) x
P(wordSeq)

wordSeq is given…


so P(wordSeq) will be the same for all tagSeq
so we can drop it from the equation
bestTagSeq  argmax P(tagSeq) x P(wordSeq | tagSeq)
tagSeq
(t1 ,..., tn )*  argmax P(t1 ,..., tn ) x P(w1 ,..., wn | t1 ,..., tn )
t1 ,...,tn
22
Assumptions in HMM Tagging
1.
2.
3.
words are independent
P(w1 ,..., wn | t1 ,..., tn )  P(w1 | t1 ,..., tn ) x P(w2 | t1 ,..., tn )x ... x P(wn | t1 ,..., tn )
Markov assumption (approximation to short history)

ex. with bigram approximation: P(ti | t1 ,..., ti-1 )  P(ti | ti-1 )
probability of a word is only dependent on its tag
P(wi | t1 ,..., tn )  P(wi | ti )
so (t1 ,..., tn )*  argmax P(t1 ,..., tn ) x P(w1 ,..., wn | t1 ,..., tn ) emission
t1 ,...,tn
 argmax
t1 ,...,tn
n
 P(t | t
i1
i
i-1
) x P(wi | ti ) 
probability
state transition probability
23
The derivation
bestTagSeq = argmax P(tagSeq) x P(wordSeq|tagSeq)
(t1…tn)* = argmax P( t1, …, tn ) x P(w1, …, wn | t1, …, tn )
Assumption 1: Independence assumption + Chain rule
P(t1, …, tn) x P(w1, …, wn | t1, …, tn)
= P(tn| t1, …, tn-1) x P(tn-1| t1, …, tn-2) x P(tn-2| t1, …, tn-3) x … x P(t1)
x P(w1| t1, …, tn) x P(w2 | t1, …, tn) x P(w3 | t1, …, tn) x … x P(wn | t1, …, tn)
Assumption 2: Markov assumption: only look at short history (ex. bigram)
= P(tn|tn-1) x P(tn-1|tn-2) x P(tn-2|tn-3) x … x P(t1)
x P(w1| t1, …, tn) x P(w2 | t1, …, tn) x P(w3 | t1, …, tn) x … x P(wn | t1, …, tn)
Assumption 3: A word’s identity only depends on its tag
= P(tn|tn-1) x P(tn-1|tn-2) x P(tn-2|tn-3) x … x P(t1)
x P(w1| t1) x P(w2 | t2) x P(w3 | t3) x … x P(wn | tn)
n
  P(ti | ti - 1) x P(wi | ti)
i1
n
so bestTagSeq t1 ,..., tn *  argmax  P(ti | ti - 1) x P(wi | ti)
t1 ,...,tn
i 1
24
Emissions & Transitions probabilities

let



N: number of possible tags (size of tag set)
V: number of word types (vocabulary)
from a tagged training corpus, we compute the frequency of:

Emission probabilities P(wi| ti)



Transitions probabilities P(ti|ti-1)




stored in an N x V matrix
emission[i,j] = probability that tag i is the correct tag for word j
stored in an N x N matrix
transmission[i,j] = probability that tag i follows tag j
In practice, these matrices are very sparse
So these models are smoothed to avoid zero probabilities
25
Emission probabilities P(wi| ti)


stored in an N x V matrix
emission[i,j] = probability/frequency that tag i
is the correct tag for word j
26
Transitions probabilities P(ti|ti-1)


stored in an N x N matrix
transmission[i,j] = probability/frequency that
tag i follows tag j
27
Efficiency issues


to find the best probability of a sequence is
exponential in time
for efficiency, we usually use the Viterbi
algorithm


For global maximisation
i.e. best tag sequence
28
an Example

Tag
PN
Emission probabilities:
John
VB
IN
Vocabulary
Transition probabilities:
PN
PN
VB
0.2
0.7
First Tag
VB
0.1
TO
1
IN
0.1
AT
0.05
NN
None (1st tag)
0.2
to
0.3
0.5
0.1
0.8
in
0.3
0.5
1
1
sea
0.3
Second tag
IN
AT
NN
None (last tag)
0.1
0.2
0.2
0.5
0.9
0.95
0.3
0.7
TO
NN
0.1
the

AT
0.9
likes
fish
TO
0.25
0.25
0.1
0.5
0.2
0.1
0.05
29
State Transition Diagram (VMM)

Transition probabilities
start
0.2
0.7 0.1
0.5
NN
0.05 0.25
0.2
PN
end
TO
0.05
AT
0.95
0.7
0.1
0.25
0.1
0.3
0.5
1
0.1
VB
0.2
0.2
0.9
0.1
IN
30
State Transition Diagram (HMM)

but the states are "invisible" (we only see the words)
…
the: 0.1
start
likes: 0.1
0.2
0.7 0.1
sea: 0.2
0.05
AT
0.95
0.5
NN
…
0.05 0.25
John: 0.3
fish: 0.1
0.2
PN
0.7
0.1
TO
0.25
0.1
0.3
0.5
1
to: 0.1
…
0.1
VB
0.2
0.2
fish: 0.3
…
likes: 0.1
0.9
…
end
in: 0.2
0.
1
IN
in: 0.1
…
31
The Viterbi Algorithm



best tag sequence for "John likes to fish in the
sea"?
efficiently computes the most likely state
sequence given a particular output sequence
based on dynamic programming
32
A smaller example
b
a
0.4
start
0.3

0.7
q
1

0.6
0.5
a
b
0.2
0.8
r
end
1
0.5
What is the best sequence of states for
the input string “bbba”?
Computing all possible paths and finding the
one with the max probability is exponential
33
A smaller example (con’t)


For each state, store the most likely sequence that could lead to it (and
its probability)
Path probability matrix:


An array of states versus time (tags versus words)
That stores the prob. of being at each state at each time in terms of the
prob. for being in each state at the preceding time.
Best sequence
leading
to q
coming
from q
Input sequence / time
ε --> b
b --> b
bb --> b
bbb --> a
ε --> q 0.6
(1.0x0.6)
q --> q 0.108
(0.6x0.3x0.6)
qq --> q 0.01944
(0.108x0.3x0.6)
qrq --> q 0.018144
(0.1008x0.3x0.4)
r --> q 0
(0x0.5x0.6)
qr --> q 0.1008
(0.336x0.5x 0.6)
qrr --> q 0.02688
(0.1344x0.5x0.4)
q --> r 0.336
(0.6x0.7x0.8)
qq --> r 0.0648
(0.108x0.7x0.8)
qrq --> r 0.014112
(0.1008x0.7x0.2)
r --> r 0
(0x0.5x0.8)
qr --> r 0.1344
(0.336x0.5x0.8)
qrr --> r 0.01344
(0.1344x0.5x0.2)
coming
from r
leading
to r
coming
from q
coming
from r
ε --> r 0
(0x0.8)
34
Viterbi for POS tagging
Let:




n = nb of words in sentence to tag (nb of input tokens)
T = nb of tags in the tag set (nb of states)
vit = path probability matrix (viterbi)
vit[i,j] = probability of being at state (tag) j at word i
state = matrix to recover the nodes of the best path (best tag
sequence)
state[i+1,j] = the state (tag) of the incoming arc that led to this
most probable state j at word i+1
// Initialization
vit[1,PERIOD]:=1.0
// pretend that there is a period before
// our sentence (start tag = PERIOD)
vit[1,t]:=0.0 for t ≠ PERIOD
35
Viterbi for POS tagging (con’t)
// Induction (build the path probability matrix)
for i:=1 to n step 1 do // for all words in the sentence
emission
probability
for all tags tj do
// for all possible tags
// store the max prob of the path
vit[i+1,tj] := max1≤k≤T(vit[i,tk] x P(wi+1|tj) x P(tj| tk))
state transition
probability
// store the actual state
path[i+1,tj] := argmax1≤k≤T ( vit[i,tk] x P(wi+1|tj) x P(tj| tk))
end
end
//Termination and path-readout
probability of best path
leading to state tk at word i
bestStaten+1 := argmax1≤j≤T vit[n+1,j]
for j:=n to 1 step -1 do // for all the words in the sentence
bestStatej := path[i+1, bestStatej+1]
end
P(bestState1,…, bestStaten ) := max1≤j≤T vit[n+1,j]
36
Possible improvements


in bigram POS tagging, we condition a tag only on the
preceding tag
why not...

use more context (ex. use trigram model)

more precise:






“is clearly marked” --> verb, past participle
“he clearly marked” --> verb, past tense
combine trigram, bigram, unigram models
condition on words too
but with an n-gram approach, this is too costly (too
many parameters to model)
transformation-based tagging...
37
Techniques to POS tagging

rule-based tagging


stochastic tagging


uses hand-written rules
uses probabilities computed from training
corpus
--> transformation-based tagging

uses rules learned automatically
38
Transformation-based tagging


Due to Eric Brill (1995)
basic idea:




take a non-optimal sequence of tags and
improve it successively by applying a series of wellordered re-write rules
rule-based
but, rules are learned automatically by training
on a pre-tagged corpus
39
An example
1. Assign to words their most likely tag


P(NN|race) = .98
P(VB|race) = .02
2. Change some tags by applying transformation rules
Rule
Context (trigger)
(apply the rule when…)
Examples
NN  VB
(noun  verb)
the previous tag is the
preposition to
go to sleep(VB)
? go to school(VB)
VBR  VB
(past tense  base form)
one of the previous 3 tags
is a modal (MD)
you may cut (VB)
JJR  RBR
(comparative adj  comparative adv)
next tag is an adjective
(JJ)
a more (RBR) valuable
VBP  VB
(past tense  base form)
one of the previous 2
words is “n’t”
should (VB) n’t
40
Types of context


lots of latitude…
can be:

tag-triggered transformation




word- triggered transformation



The preceding/following word this word
…
morphology- triggered transformation



The preceding/following word is tagged this way
The word two before/after is tagged this way
...
The preceding/following word finishes with an s
…
a combination of the above

The preceding word is tagged this ways AND the following word is
this word
41
Learning the transformation rules

Input: A corpus with each word:




correctly tagged (for reference)
tagged with its most frequent tag (C0)
Output: A bag of transformation rules
Algorithm:

Instantiates a small set of hand-written templates
(generic rules) by comparing the reference corpus to C0
Change tag a to tag b when…
The preceding/following word is tagged z
The word two before/after is tagged z
One of the 2 preceding/following words is tagged z
One of the 2 preceding words is z
…

42
Learning the transformation rules

Run the initial tagger and compile types of errors





(con't)
<incorrect tag, desired tag, # of occurrences>
For each error type, instantiate all templates to generate
candidate transformations
Apply each candidate transformation to the corpus and
count the number of corrections and errors that it
produces
Save the transformation that yields the greatest
improvement
Stop when no transformation can reduce the error rate by
a predetermined threshold
43
Example

if the initial tagger mistags 159 words as verbs
instead of nouns


Suppose template #3 is instantiated as the rule:


Change the tag from <verb> to <noun> if one of the two
preceding words is tagged as a determiner.
When this template is applied to the corpus:



create the error triple: <verb, noun, 159>
it corrects 98 of the 159 errors
but it also creates 18 new errors
Error reduction is 98-18=80
44
Learning the best transformations

input:

a corpus with each word:




correctly tagged (for reference)
tagged with its most frequent tag (C0)
a bag of unordered transformation rules
output:

an ordering of the best transformation rules
45
Learning the best transformations
(con’t)
let:


E(Ck) = nb of words incorrectly tagged in the corpus at iteration k
v(C) = the corpus obtained after applying rule v on the corpus C
ε = minimum number of errors desired
for k:= 0 step 1 do
bt := argmint (E(t(Ck))
// find the transformation t that minimizes
// the error rate
if ((E(Ck) - E(bt(Ck))) < ε) // if bt does not improve the tagging significantly
then goto finished
Ck+1 := bt(Ck)
Tk+1 := bt
// apply rule bt to the current corpus
// bt will be kept as the current transformation
// rule
end
finished: the sequence T1 T2 … Tk is the ordered transformation rules
46
Strengths of transformation-based
tagging


exploits a wider range of lexical and syntactic
regularities
can look at a wider context



condition the tags on preceding/next words not just
preceding tags.
can use more context than bigram or trigram.
transformation rules are easier to understand
than matrices of probabilities
47
Evaluation of POS taggers


compared with gold-standard of human performance
metric:



accuracy = % of tags that are identical to gold standard
most taggers ~96-97% accuracy
must compare accuracy to:

ceiling (best possible results)



how do human annotators score compared to each other?
(96-97%)
so systems are not bad at all!
baseline (worst possible results)


what if we take the most-likely tag (unigram model)
regardless of previous tags ? (90-91%)
so anything less is really bad
48
More on tagger accuracy

is 95% good?



that’s 5 mistakes every 100 words
if on average, a sentence is 20 words, that’s 1 mistake per
sentence
when comparing tagger accuracy, beware of:

size of training corpus


difference between training & testing corpora (genre, domain…)


the closer, the better the results
size of tag set


the bigger, the better the results
Prediction versus classification
unknown words

the more unknown words (not in dictionary), the worst the results
49
Error analysis of POS taggers


Where did the tagger go wrong ?
Use a confusion matrix / contingency table
correct
tag
tags assigned by the tagger
(Penn Treebank tags)
DT
DT
NNP
JJ
NN
VBD
VBN
…
Total
99.4
.3
0
0
.3
0
100
NNP
0
90.2
3.3
4.1
0
0
100
JJ
0
.1
93.9
1.8
.1
1.9
100
NN
0
.5
2.2
95.5
.2
0
100
VBD
0
0
.3
1.4
96.0
2.5
100
VBN
0
0
1.9
0
3.4
93.3
100
…

Most confused:


NN (noun) vs. NNP (proper noun) vs. JJ (adjective)
VBD (verb, past tense) vs. VBN (past participle) vs. JJ (adjective)
 he chopped carrots, the carrots were chopped, the chopped carrots
50
Major difficulties in POS tagging

Unknown words (proper names)



because we do not know the set of tags it can take
and knowing this takes you a long way (cf. baseline POS tagger)
possible solutions:


assign all possible tags with probabilities distribution identical to
lexicon as a whole
use morphological cues to infer possible tags


ex. word ending in -ed are likely to be past tense verbs or past
participles
Frequently confused tag pairs

preposition vs particle
<running> <up> a hill (prep) / <running up> a bill (particle)

verb, past tense vs. past participle vs. adjective
51