Download Tokenization

Document related concepts
no text concepts found
Transcript
Linguistic annotation
2/14/2006
Nianwen Xue
Outline
• Tokenization / segmentation, POS tagging
• Treebanking
 Constituent structure and structural ambiguity
 Basic grammatical relations and how argument structure is
instantiated
• Propbanking/nombanking
 Cross-linguistic syntactic alternations, verb senses and
argument structure
• Others: named entity, coreference, discourse
connectives
2
Tokenization
• English
 In the new position he will oversee Mazda ’s U.S.
sales , services , parts and marketing operations .
 We did n’t have much of a choice .
 U.S. trade officials said the Philippines and
Thailand would be the main beneficiaries of the
president ‘s action .
 Anything ‘s possible -- how about the new Guinea
Fund ?
3
Tokenization
• English
 In the new position he will oversee Mazda ’s U.S.
sales , services , parts and marketing operations .
 We did n’t have much of a choice .
 U.S. trade officials said the Philippines and
Thailand would be the main beneficiaries of the
president ‘s action .
 Anything ‘s possible -- how about the new Guinea
Fund ?
4
Tokenization
• The federal government suspended sales of
the U.S. savings bonds because Congress
has n’t lifted the ceiling on government debt .
• The Treasury said the U.S. will default on Nov.
9 if Congress does n’t act by then .
5
Tokenization
• The federal government suspended sales of
the U.S. savings bonds because Congress
has n’t lifted the ceiling on government debt .
• The Treasury said the U.S. will default on Nov.
9 if Congress does n’t act by then .
6
Tokenization
• Assets of the 400 taxable funds grew by $ 1.5
billion during the latest week .
• Exports in October stood $ 5.29 billion , a
mere 0.7 % increase from a year earlier , while
imports increased sharply to $ 5.39 billion , up
20 % from last year .
• Do you notice any ambiguity in tokenization?
7
Tokenization
• Assets of the 400 taxable funds grew by $ 1.5
billion during the latest week .
• Exports in October stood $ 5.29 billion , a
mere 0.7 % increase from a year earlier , while
imports increased sharply to $ 5.39 billion , up
20 % from last year .
• Do you notice any ambiguity in tokenization ?
8
Exercise
• How many sentences in the WSJ corpus of
the Penn Treebank contain “’re”?
• How many sentences in the WSJ corpus of
the Penn Treebank contain “’d”?
9
Big deal, you say
• The problem is pushed to the forefront for languages
like Chinese, where there are no delimiting spaces
between words
这句话里有几个词?
Howmanywordsarethereinthissentence?
10
Big deal, you say
• The problem is pushed to the forefront for languages like
Chinese, where there are no delimiting spaces between
words
zhe ju
hua li
you ji
ge ci
这
句
话
里
有
几
个
词
?
this CL sentence inside have [how many] CL word ?
How many words are there in this sentence ?
11
A much harder problem than it first
appears…
• Well, what if we just create a list of words (a
dictionary) and compare the sentence against
this list?
• 日文章鱼怎么说 ?
Dictionary entries: 日 “Sun”, 日文
“Japanese”,,文章 “article”,,章鱼 “octopus”,
鱼 “fish” 怎么 “how” 说 “say”
12
A much harder problem than it first
appears…
• Well, what if we just create a list of words (a dictionary) and
compare the sentence against this list?
• 日文
章鱼
怎么 说 ?
Japanese Octopus how say
How do you say octopus in Japanese?
• 日
文章
鱼 怎么 说 ?
Sun article fish how say
???
13
Computer problem vs human
problem
• Well that may be a problem for the computer
because the computer is dumb…
• Segmentation is difficult for humans as
well
 What is a word?
 Different criteria do not coincide
14
What if we let native speakers
follow their intuitions?
• Inadequate level of inter-annotator agreement
 Sproat, 1996: 70%
 Xue at al, 2005: 90%
• Conclusion: need a linguistic definition of
wordhood to develop segmentation standards
15
Packard’s (2000) notion of words
• Orthographic word: Words are defined by delimiters
in written text. This appears to have no relevance in
Chinese since there are no such written delimiters
• Sociological word: Following (Chao, 1968, pp. 136138), these are ‘that type of unit, intermediate in size
between a phoneme and a sentence, which the
general, non-linguistic public is conscious of, talks
about, has an every day term for, and is practically
concerned with in various ways.’ In English this is
the lay notion of ‘word’, whereas in Chinese this is
the character (字zi).
16
Packard’s notions of word
• Lexical word: This corresponds to Di Sciullo
and Williams’s (1987) listeme
• Semantic word: Roughly speaking this
corresponds to a “unitary concept”.
• Phonological word: defined according to
phonological criteria. Is it a domain that a
phonological process applies? Is it s prosodic
unit?
17
Packard’s notions of word
• Morphological word: following Di Sciullo and Williams
(1987), a morphological word is anything that is the
output of a phonological rule
• Syntactic word: These are all and only the
constructions that occupy X0 in the syntax. Well first
you need to know what X0 is.
• Psycholinguistic word: this the “ ‘word’ level of
linguistic analysis that is … salient and highly
relevant to the operation of the language processor”
18
Wordhood tests
• Phonological:
 Bound morpheme: a bound morpheme forms a
word with its neighboring morpheme
• Syntactic:
 Insertion: if another morpheme can be inserted
between X and Y, then it is unlikely a word.
 XP-substitution: if a morpheme cannot be
replaced with an XP of the same type, then it is
likely to be a word
19
Wordhood tests
• Semantic
 If the meaning of X-Y is non-compositional, then it is a
word
• Others
 Productivity: if a rule that combines morpheme X and
morpheme Y is not productive, then X-Y is likely to be a
word
 Frequency of co-ocurrence: if morphemes X and Y cooccur frequently then they form a word
20
Exercise
• Given the wordhood criteria and wordhood
tests we have discussed, how many words are
there in the “can’t” ?
21
Answer
•
•
•
•
•
•
•
•
Orthographical word: 1
Sociological word: ?
Lexical word: 2
Semantic word: 2
Phonological word: 1
Morphological word: 2
Syntactic word: 2
Psycholinguistic word: ?
22
Chinese morphological types
•
•
•
•
•
Reduplication
Affixation
Compounding
Proper names
Abbreviations
23
Verbal reduplication
说说
shuo-shuo
speak-speak
little”
看看
kan-kan
look-look
a look”
走走
zou-zou
walk-walk
a walk”
磨磨
mo-mo
rub-rub
little”
讨论讨论 taolun-taolun
discuss-discuss
a little”
请教请教 qingjiao-qingjiao ask-ask
little”
“speak a
“take
“take
“rub a
“discuss
“ask a
24
Verbal reduplication
说一说
shuo-shuo
a little”
看一看
kan-kan
“take a look”
走一走
zou-zou
“take a walk”
磨一磨
mo-mo
a little”
*讨论一讨论
*请教一请教
speak one speak
“speak
look one look
walk one walk
rub one rub
“rub
taolun-yi-taolun
discuss-one-discuss
qingjiao-yi-qingjiao ask-one-ask
25
Adjectival reduplication
舒服 shufu
舒舒服服 shushu-fufu
“comfortable”
舒服舒服
shufu-shu-fu
“enjoy”
干净 ganjing 干干净净 gangan-jingjing
“very clean”
干净干净 ganjing-ganjing “clean
up”
糊涂 hutu
糊糊涂涂 huhu-tutu
“muddleheaded”
(?) 糊涂糊涂 hutu-hutu
快活
快快活活
kuaikuai-huohuo “happy”
快活快活
kuaihuo-kuaihuo “make
happy”
漂亮
漂漂亮亮 piaopiao-liangliang“pretty”
26
Prefixation
老
lao老王
wang”
小
xiao- 小王
wang”
第
di第一
“first”
初
chu- 初三
third”
可
ke可爱
lao-wang
“old
xiao-wang “small
di yi
chu san
ke-ai
“the
“cute”
27
Suffixation
学
-xue 心理学 xinli-xue
“psychology”
家
-jia
心理学家 xinli-xue-jia
“psychologist”
化
-hua
绿化 lv-hua
“greenize??”
率
-lv
录取率 luqu-lv
“enrollment
rate”
主义 -zhuyi 马克思主义 makesi-zhuyi“marxism”
28
Compounding
Location:
客厅 沙发
keting-shafa
“living room sofa”
河 马
hema
“river horse
(hippopotamus)”
海 狮
haishi
“sea lion (seal)”
Used for:
指甲 油
zhijia you
“nail polish”
乒乓 球
pingpang qiu
“ping-pang ball”
太阳眼镜
taiyang yanjing “sunglasses”
Material:
大理石 地板 talishi diban
“marble floor”
纸老 虎
zhilaohu
“paper tiger”
29
Resultative verb compounding
Result:
打破
dapo
“break by hitting”
拉开
lakai
“open by pulling”
Achievement:
写清楚
xieqingchu “write clearly”
买到
maidao
“succeed in buying”
Direction:
跳过去
tiaoguoqu
“jump across”
走进来
zoujinlai
“come walking in”
30
Subject-Verb compounds
头疼 tou-teng (head hurt) “have a
headache”
嘴硬 zui-ying (mouth hard) “stubborn”
眼红 yan-hong (eye red)
“covet”
心酸 xin-suan (heart sour) “feel sad”
命苦 ming-ku (fate bitter) “unlucky”
31
Subject-Verb compounds
我 的 头
很
疼
I DE head very hurt
“My head hurts badly.”
这
事
让
我 很 头疼
This matter make I very headache
“This gave me a real headache”.
32
Verb-object compounds
出版
chu-ban (emit edition) “publish”
睡觉
shui-jiao (sleep sleep)
“sleep”
毕业
bi-ye (finish study)
“graduate”
开刀
kai-dao (operate knife)
“operate”
开玩笑 kai-wanxiao (make joke) “make a joke”
照相
zhao-xiang (shine image) “take a
picture”
33
Verb-object compounds
别 开玩笑 !
Do not joke
Do not joke!
开
他 的 玩笑。
Make he DE joke
Make fun of him.
34
Let’s try one
她 很 担心 孩子 的 健康 成长
Test type
phonological
syntactic
semantic
others
Test
Bound morpheme
Test result
Yes?
Prediction
One word
Syllable count
insertion
XP substitution
Non-compositional
yes
no
no
yes
One word
One word
One word
One word
productive
frequency
N/A
N/A
N/A
N/A
35
But …
• 担心: 她 为 孩子 担 心
Test type
Test result
Prediction
Bound morphemes?
no
Two words
Syllable count
yes
One word
Insertion
yes
Two words
XP substitution
yes
Two words
Semantic
Non-compositional?
yes
One word
Others
Productive?
N/A
N/A
Frequent co-occurrence?
N/A
N/A
phonological
syntactic
Test
36
Summary
• Wordhood has to be decided in context
• When wordhood tests lead to conflict
predictions, decisions will have to be made
based on what the annotated corpus is for.
37
Discussion question
• Based on word criteria we have discussed, is
“make headway” one word or two words?
38
POS-tagging: throwing words into
different buckets…
• Each category is a bucket
• How many buckets are there?
 Noun
 Verb
 Adjective
 Preposition
 Adverb
• Which bucket should“five”, “the”, “$”, should
go?
39
Penn Treebank Tagsets (buckets)
•
•
•
•
•
•
•
•
•
CC - coordinating conjunction: and, but
CD - cardinal number: one, two, three
DT - determiner: a, the, this, that
EX - existential there
FW - foreign word
IN - preposition or subordinate conjunction
LS - list marker: firstly, secondly
To - to
UH - interjection, uh, oh
40
CC or DT
• Neither/?? he or/CC she likes skiing.
• Neither/?? men like skiing .
• Either/?? Jean or/CC Mary likes singing.
• Either/?? Girl likes singing.
• Both/?? Jack and/CC Tom hates singing .
• Both/?? men hates singing.
41
CC or DT
• Neither/CC he or/CC she likes skiing.
• Neither/DT men like skiing .
• Either/CC Jean or/CC Mary likes singing.
• Either/DT Girl likes singing.
• Both/CC Jack and/CC Tom hates singing .
• Both/DT men hates singing.
42
CD or NN
• One/?? of the best reasons
• The only one/?? Of its kind
• The only ones/?? of its kind
43
CD or NN
• One/CD of the best reasons
• The only one/NN Of its kind
• The only ones/NN of its kind
44
EX or RB
•
•
•
•
There/?? was a party in progress.
There/?? ensued a melee.
There/?? , a party was in progress.
There/?? , ensued a melee.
45
EX or RB
•
•
•
•
There/EX was a party in progress.
There/EX ensued a melee.
There/RB , a party was in progress.
There/RB , ensued a melee.
46
The role of context in POS tagging
• Can we take a list of all the words in a
language, and decide which bucket each word
should go, without looking at the context in
which the word occurs?
• Water, can,drops
47
Categorizing context
• Morphological
• Syntactic
• Semantic
48
Morphological context
• Inflectional morphology
Verb: destroy, destroying, destroyed
Noun: destruction, destructions
He watered the plant.
• Derivational morphology
Noun: destruction
49
Syntactic context
• Verb: The bomb destroyed the building.
He decided to water the plant.
• Noun: The destruction of building
50
Semantic context
• Verb: action, activity
• Noun: state, object, etc.
51
What do we have in Chinese?
• Morphological clues: not as much
• Syntactic clues: not as rich, but exist
• Semantic clues: About the same
52
Syntactic clues
• Impoverished, but exist:
这 座 大楼
的 倒塌
this CL building DE collapse
“the collapse of this building”
这
座 大楼
看起来要 倒塌
This CL building seem will collapse
“It looks like this building will collapse.”
53
Semantic clues
• Same as English:
 Noun: state, object, etc.
 Verb: action, activity, etc.
54
When syntactic and semantic clues
are in conflict
这 座 大楼
的 倒塌
this CL building DE collapse
“the collapse of this building”
Option 1: 倒塌 is a verb regardless of its context
Option 2: 倒塌 can be a noun or a verb depending
on its context
The Chinese Treebank decision: option 2
 POS tags based on syntactic clues encode not only its own
lexical properties, but also information provided by its
context
 “context-free” POS tags are no better than a dictionary
55
Online references
• Chinese Treebank:
www.cis.upenn.edu/~chinese
• Sproat, Richard. 2002. Coling tutorial:
www.linguistics.uiuc.edu/rws
• Penn Treebank:
www.cis.upenn.edu/~treebank/home.html
56