Download POS Tagging

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Kannada grammar wikipedia , lookup

Macedonian grammar wikipedia , lookup

Stemming wikipedia , lookup

Modern Hebrew grammar wikipedia , lookup

Untranslatability wikipedia , lookup

Portuguese grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Chinese grammar wikipedia , lookup

Modern Greek grammar wikipedia , lookup

Old Norse morphology wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Old English grammar wikipedia , lookup

Ojibwe grammar wikipedia , lookup

Agglutination wikipedia , lookup

Spanish grammar wikipedia , lookup

Swedish grammar wikipedia , lookup

Inflection wikipedia , lookup

Old Irish grammar wikipedia , lookup

Grammatical number wikipedia , lookup

Romanian numbers wikipedia , lookup

Latin syntax wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Zulu grammar wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Arabic grammar wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Yiddish grammar wikipedia , lookup

Romanian nouns wikipedia , lookup

Russian declension wikipedia , lookup

Esperanto grammar wikipedia , lookup

Turkish grammar wikipedia , lookup

Determiner phrase wikipedia , lookup

French grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Transcript
POS Tagging: An Overview
1. POS-tagging vs. syntactic parsing
Xiaofei Lu
Linguistics 795K
1.1 Considered as one task
April 15, 2002
Both POS tagging and syntactic annotation specify the grammatical
characteristics of a text: POS-tagging is a specification of the leaves of the
Outline
phrase-structure (PS) tree, which is a favored model for syntactic annotation.
(1)
1. POS-tagging vs. syntactic parsing
S
2. Why tagging: applications of POS-tagged texts
Nr
3. Two problems for POS tagging
N
V
.
4. A tagging system
MD
NNT1
,
AT
NN1
VVD .
4.1 Tokenization
Last
year
,
the
workforce
grew .
4.2 Tagset
4.3 Encoding of token-tag relations
1.2Considered as separate tasks
4.4 Tagging schemes: assigning tags to words
4.5 Non-linguistic issues
5. Early automatic taggers: the rule-based approach
Practically, the difficulty of parsing unrestricted text makes it a useful
6. Automatically trained taggers: the probabilistic approach
expedient to divide the work of parsing into two manageable tasks. POS-
6.1 Supervised tagging
tagging can be done much more accurately and quickly than parsing, and
6.2 Unsupervised tagging
good taggers can typically be developed for a domain much more rapidly
6.3 Other algorithms for automatic training
than good parsers can. Therefore, POS-tagging serves as a precursor to
parsing and other NLP tasks.
7. Assignments
References
1
2
2. Why POS-tagging: applications of POS-tagged texts
2.3 Higher-level syntactic processing
2.1 Machine translation
Tagging often serves as a precursor to higher-level syntactic-processing
systems. E.g. noun phrase chunkers (programs to find NPs in each
The probability of a word in the source language translating into a word in
sentence) use a combination of word and POS information to learn either
the target language is highly dependent on the POS of the source word. E.g.
regular expressions for NPs in a sentence that are likely to indicate the
the word guancha in Chinese can be translated as either observe or
beginning or ending of a phrase (Church 1988; Ramshaw and Marcus,
observation, depending on its POS, i.e. whether it is a noun or verb.
1995).
2.2Information retrieval (IR) and extraction
Tagging can also be important for speech synthesis. E.g. the word record is
pronounced differently depending on whether it is a noun or a verb.
By augmenting the query a person gives to an IR system with POS
information, more refined retrieval is possible. E.g. if one wants to search
for documents containing produce as a noun, adding the POS information
will eliminate irrelevant ones containing only produce as a verb.
Patterns used for extracting information from text frequently make reference
to POSs, too. E.g. to extract information of the form: acquire (X, Y) from the
Wall Street Journal, one may use the following template:
(2)
(DET)? (PROPER NOUN)+ (acquired | bought) (DET)? (PROPER NOUN)+
X
acquire
Y
From (3) would be extracted (4)
(3)
International Business Machines bought Violet Corporation
(4)
acquire (International Business Machines, Violet Corporation)
3
4
3. Two problems for POS tagging
4. A tagging system
3.1 Words with multiple POSs
Three questions to start with:
· How to divide the text into individual word tokens?
Many words have multiple POSs, and one of the difficult tasks is to provide
· How to choose a tagset
the machine with the knowledge necessary to disambiguate from the set of
· How to choose which tag is to be applied which word token
allowable tags based on context. E.g. the word saw can be a noun or a verb,
and the word can can be a noun, a verb, or an auxiliary verb. These usages
4.1 Tokenization (multiwords, merged words, compounds)
can appear in one sentence, as in (5) and (6).
4.2 Tagset
(5)
I saw a saw saw a saw.
(6)
We can can the can
A tagset is a list of tags used for a given task of grammatical tagging. E.g.
the CLAWS C7 tagset.
3.2 New words
There is room for disagreement about what word-categories are useful or
New words appear all the time. Thus, a method for determining the tags of
linguistically applicable, and for interference from practical constraints such
news words is needed. This can be done based on contextual information
as speed and accuracy. Normally, there has to be a trade-off between what is
and information about word itself, such as affixes. E.g. it is easy to
linguistically desirable and computationally feasible.
determine that the made-up word goblamesque is an adjective, based on the
environment in which it appears and the suffix -esque.
E.g., it makes grammatical sense to have separate tags for the present
subjective (come what may), the imperative form (come here!), and the
(7)
His goblanesque writing was a refreshing treat.
present tense plural indicative form (they come every spring). However, in
practice, it is difficult to distinguish them without a substantial proportion of
errors. The solution was to merge them into a single category ‘finite base
form’ as opposed to non-finite base form (Would like to come?). Even this
distinction is ignored in some projects, e.g. the tagging of Brown Corpus.
5
6
· Horizontal and vertical formats for presenting tagged corpus
4.2.1 Tags and labels
· A tag is a word-class embodied in an annotative device associated
Horizontal format:
with a word in the text
Oh_UH ,_, he_PPHS1 did_VDD pass_VV0 his_APP$ exams_NN2
· Three criteria for choosing labels for tags
._.
1) Conciseness: Brief labels are move convenient than lengthy ones.
Vertical format:
E.g. DD1 vs. SINGULAR_DETERMINER
SK01 271
Oh
UH
(interjection)
2) Perspicuity: Labels which can be easily interpreted and
SK01 272
,
,
(comma)
remembered are more user-friendly than those which cannot. E.g.
SK01 273
and
CC
(coordinating conjunction)
Preposition vs. IN
SK01 274
he
PPHS1
(3rd per. Pronoun, sing. Nom)
SK01 275
did
VDD
(past tense of the verb do)
3) Analyzability: Labels which are decomposable into their logical
SK01 276
pass
VV0
(base form of lexical verb)
parts are better than those which are not. E.g. NP1, in the BNC
SK01 277
his
APP$
(possessive determiner)
tagset, can be decomposed into: N = noun (vs. V = verb, etc.); P =
SK01 278
exams
NN2
(plural common noun)
proper [noun] (vs. N = common noun); 1 = singular (vs. 2 =
SK01 279
.
.
(period)
plural). With an analyzable tagset, searches of the corpus can be
carried out at varying levels of granularity. E.g. the symbol N* can
Easily convertible. Verbose labels are more conveniently handled in
represent all nouns, N*1 can represent all singular nouns, and NP*
the vertical format than in the horizontal one.
all proper nouns. However, people may prefer more perspicuous
labels such as Noun: prop: sing to NP1.
4) Disambiguity: each tag corresponds to a unique label.
7
8
4.2.2. Logical tagsets
4.2.3 Size and composition of tagsets
· Size less important as it seems, changeable according to the emphasis
The relations between the word categories symbolized by tags should
be representable as a hierarchical tree, with attributes being inherited
of a particular project
· ‘Core’ of tagset tends to be major word classes with their principal
from one level of the tree to another. The same alphanumeric symbol,
in a particular position in the sequence of symbols in the label, may
sub-classes
have the same meaning across different branches of the tree. E.g.
· peripheral elements that need to be marked tend to be ignored: e.g. for
below is a summary representation of part of the C7 tagset as a
a written corpus, certain categories of WIC (word-initial capital)
hierarchical, logical tagset.
words, such as month nouns, day nouns, etc, may be of semantic and
syntactic significance in their own right. In spoken corpora, it would
N
N
1
NN1: common noun, general, singular
2
NN2: common noun, general, plural
be useful to distinguish certain types of discourse marker (well) or
hesitation marker (er).
NN: common noun, general, number-neutral
T
P
M
· conflict between linguistic (external) reasons and computational
1
NNT1: common noun, temporal, singular
(internal) reasons for determining the composition of a tagset.
2
NNT2: common noun, temporal, plural
Linguistic quality of a tagset (e.g. the extent to which it allows
1
NP1: proper noun, general, singular
retrieval of important grammatical distinctions in the language)
2
NP2: proper noun, general, plural
concerns the user’s requirements; the computational tractability of a
NP: proper noun, general, number-neutral
tagset (e.g. the extent to which a particular tag is useful in aiding the
1
NPM1: prop noun, day, singular
disambiguation process, and increasing the accuracy of tagging) is
2
NPM2: prop noun, day, plural
‘internal’. Most tagsets show some signs of the ‘internal’ criteria
impinging on the ‘external’ criteria, e.g. the low tractability of the
subjunctive category, given the ambiguity of verb base forms.
· More on the comparison and evaluation of tagsets coming up with
Kyuchul.
9
4.3
Encoding of token-tag relations
10
4.4
· Issues of tokenization raise problems for the way we encode token-tag
Tagging schemes:
· Task: to specify how decisions are made about how to assign tags
relations
to words
· The tagging of the Brown Corpus set up a model imitated by many
· Lexicon: to provide information on which tags are assignable to
other tagging projects, such as the tagged LOB Corpus, the Penn
which words
· Disambiguation: when multiple tags are assignable to a single
Treeban, and the SUSANNE Corpus.
· The Text Encoding Initiative (TEI) (Burnard, 1995)
word, we need to defining the contextual conditions and
A movement towards achieving an acceptable standard in the
distributional factors of choosing a particular tag for a particular
encoding of electronic textual material on computer, esp. for purposes
word-token.
of data interchange, based on the mark-up system known as SGML
· Tagging manuals for different tagged corpora, e.g. Santorini
(more on XML coming up soon with Stacey). The SGML-conformant
(1990) for the Penn Treebank, explains the tagging decisions to be
set-ups can be elaborated to deal with multiwords, mergers, and
made in all possible contexts.
· Grey areas of unclarity between the use of one tag and another in
‘phantom words’. Examples from BNC:
English, e.g. in plastic bottle, should plastic be tagged as a noun or
Multiwords:
<w PRP>in lieu of <w NN1>payment
Mergers:
<w PNP>they<w VBB>’re <w VVG>passing
‘Phanton words’
<w AJ0><w PRP>post-</w PRP><w AJ0>Cold</w AJ0>
an adjective?
4.5
Non-linguistic issues
<w NN1>war </w NN1></w AJ0>
· manual vs. automatic tagging
· techniques and capabilities of the tagging software
· available human and hardware resources
· speed, accuracy, and consistency requirements
11
12
5. Automatic tagging: the rule-based approach
6. Automatically trained taggers: the probabilistic approach
· An early program (Greene and Rubin, 1971) created to tag
semiautomatically
the
Brown
Corpus.
High-certainty
word-
6.1 Supervised tagging
environments discovered manually were applied to tag reliable
Supervised taggers typically rely on pre-tagged corpora to serve as the
regions in the corpus, followed by manual tagging and correction. No
basis for creating any tools to be used throughout the tagging process,
manually tagged corpus available to develop and test the manually
for example: the tagger dictionary, the word/tag frequencies, the tag
created rules empirically – not high rates of accuracy.
sequence probabilities and/or the rule set.
· Klein and Simmons (1963) constructed a totally automatic tagger.
First, all allowable tags listed in the lexicon were assigned to each
·
6.2 Unsupervised tagging
word. Second, a sequence of handwritten rules deleted certain tags as
Unsupervised models, on the other hand, are those which do not
possibilities in certain environments. A rule might read, “If a word is
require a pretagged corpus but instead use sophisticated computational
tagged with both an N tag and a V tag, and it occurs immediately after
methods to automatically induce word groupings (i.e. tag sets) and
a Det, then remove the V tag.”
based on those automatic groupings, to either calculate the
A recently developed tagger using the same principles but based on
probabilistic information needed by stochastic taggers or to induce the
Constraint Grammars (Karlsson et al., 1995), has been applied to POS
context rules needed by rule-based systems.
tagging with success. Both the morphological and syntactic analysers
use rule-based linguistic descriptions. The system works in the
following way:
6.3 Other algorithms for automatic training
6.3.1 Transformation-based learning:
1. Tokenisation;
2. Lookup of morphological tags;
1. A lexical analyer assings all possible morphological analysis to each
word found in a large lexicon including all inflected an central derived
word forms
2. A guesser is used to assign an analysis to all remaining words
3. A rule-based Constraint Grammar parser is used to resolve morphological
ambiguities
4. syntactic lookup: All possible syntactic tags are introduced for each word
5. Resolution of syntactic ambiguities: the parser finally consults a syntactic
disambiguation grammar. The English version of the Constraint Grammar
contains 800 syntactic constraints, of a similar form to the rules at the
morphological resolution stage.
6.3.2 Learning without a tagged text
6.3.3 Maximum entropy tagging
14
13
References
7. Assignments
Brill, E. 2000. ‘Part-of-Speech Tagging,’ in Robert Dale, Hermann Moisl, Harold Somers
(eds.) Handbook of natural language processing. New York : Marcel Dekker. 403-414.
Run XKWIC on the BNC-SAMP, and answer the following questions:
Burnard, L. 1995 ‘Users’ Reference Guide for the British National Corpus’, Oxford
University Computing Services.
1. Which of the two appears more frequently, the definite or indefinite
determiner?
2. Which of the three appears most frequently, nouns, verbs, or adjectives?
3. Which of the two appears more frequently, singular nouns or plural
nouns?
Church KW 1988. A stochastic parts program and noun phrase parser for unrestricted
text. Second Conference on Applied Natural Language Processing, Austin, TX, pp 136143.
Greene, B. B. & Rubin, G. M. 1971. Automatic grammatical tagging of English.
Technical Report, Brown University. Providence, RI.
Leech, G. 1997. ‘Grammatical Tagging,’ in Roger Garside, Geoffrey Leech, Tony
McEnery (eds.), Corpus Annotation: Linguistic Information From Computer Text
Corpora. London: Longman.
4. How often does VDD occur before negation?
5. Is the word record used more often as a noun or a verb?
Karlsson F, A Voutilainen, A Anttila, 1995. Constraint Grammar. Berlin: Mouton de
Gruyter.
6. How many sentences contain two word forms of have?
7. What kind of (two-word) compound nouns do you find in the corpus?
Klein, S. & Simmons, R. 1963. A computational approach to the grammatical coding of
English words. Journal of the Association for Computational Machinery 10: 334-347.
8. What are the adjectives used to modify men and women respectively?
9. What kind of feelings do people have?
Ramshaw, L. A. and Marcus, M. P. 1995. Text Chunking using Transformation-Based
Learning. In Proceedings of the ACL Third Workshop on Very Large Corpora, June
1995, pp. 82-94.
Santorini, B. 1990. Part-of-speech tagging guidelines for the Penn Treebank Project.
Department of Computer and Information Science, University of Pennsylvania, Technical
Report MS-CIS-90-47.
15
16