Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Linguistic annotation 2/14/2006 Nianwen Xue Outline • Tokenization / segmentation, POS tagging • Treebanking Constituent structure and structural ambiguity Basic grammatical relations and how argument structure is instantiated • Propbanking/nombanking Cross-linguistic syntactic alternations, verb senses and argument structure • Others: named entity, coreference, discourse connectives 2 Tokenization • English In the new position he will oversee Mazda ’s U.S. sales , services , parts and marketing operations . We did n’t have much of a choice . U.S. trade officials said the Philippines and Thailand would be the main beneficiaries of the president ‘s action . Anything ‘s possible -- how about the new Guinea Fund ? 3 Tokenization • English In the new position he will oversee Mazda ’s U.S. sales , services , parts and marketing operations . We did n’t have much of a choice . U.S. trade officials said the Philippines and Thailand would be the main beneficiaries of the president ‘s action . Anything ‘s possible -- how about the new Guinea Fund ? 4 Tokenization • The federal government suspended sales of the U.S. savings bonds because Congress has n’t lifted the ceiling on government debt . • The Treasury said the U.S. will default on Nov. 9 if Congress does n’t act by then . 5 Tokenization • The federal government suspended sales of the U.S. savings bonds because Congress has n’t lifted the ceiling on government debt . • The Treasury said the U.S. will default on Nov. 9 if Congress does n’t act by then . 6 Tokenization • Assets of the 400 taxable funds grew by $ 1.5 billion during the latest week . • Exports in October stood $ 5.29 billion , a mere 0.7 % increase from a year earlier , while imports increased sharply to $ 5.39 billion , up 20 % from last year . • Do you notice any ambiguity in tokenization? 7 Tokenization • Assets of the 400 taxable funds grew by $ 1.5 billion during the latest week . • Exports in October stood $ 5.29 billion , a mere 0.7 % increase from a year earlier , while imports increased sharply to $ 5.39 billion , up 20 % from last year . • Do you notice any ambiguity in tokenization ? 8 Exercise • How many sentences in the WSJ corpus of the Penn Treebank contain “’re”? • How many sentences in the WSJ corpus of the Penn Treebank contain “’d”? 9 Big deal, you say • The problem is pushed to the forefront for languages like Chinese, where there are no delimiting spaces between words 这句话里有几个词? Howmanywordsarethereinthissentence? 10 Big deal, you say • The problem is pushed to the forefront for languages like Chinese, where there are no delimiting spaces between words zhe ju hua li you ji ge ci 这 句 话 里 有 几 个 词 ? this CL sentence inside have [how many] CL word ? How many words are there in this sentence ? 11 A much harder problem than it first appears… • Well, what if we just create a list of words (a dictionary) and compare the sentence against this list? • 日文章鱼怎么说 ? Dictionary entries: 日 “Sun”, 日文 “Japanese”,,文章 “article”,,章鱼 “octopus”, 鱼 “fish” 怎么 “how” 说 “say” 12 A much harder problem than it first appears… • Well, what if we just create a list of words (a dictionary) and compare the sentence against this list? • 日文 章鱼 怎么 说 ? Japanese Octopus how say How do you say octopus in Japanese? • 日 文章 鱼 怎么 说 ? Sun article fish how say ??? 13 Computer problem vs human problem • Well that may be a problem for the computer because the computer is dumb… • Segmentation is difficult for humans as well What is a word? Different criteria do not coincide 14 What if we let native speakers follow their intuitions? • Inadequate level of inter-annotator agreement Sproat, 1996: 70% Xue at al, 2005: 90% • Conclusion: need a linguistic definition of wordhood to develop segmentation standards 15 Packard’s (2000) notion of words • Orthographic word: Words are defined by delimiters in written text. This appears to have no relevance in Chinese since there are no such written delimiters • Sociological word: Following (Chao, 1968, pp. 136138), these are ‘that type of unit, intermediate in size between a phoneme and a sentence, which the general, non-linguistic public is conscious of, talks about, has an every day term for, and is practically concerned with in various ways.’ In English this is the lay notion of ‘word’, whereas in Chinese this is the character (字zi). 16 Packard’s notions of word • Lexical word: This corresponds to Di Sciullo and Williams’s (1987) listeme • Semantic word: Roughly speaking this corresponds to a “unitary concept”. • Phonological word: defined according to phonological criteria. Is it a domain that a phonological process applies? Is it s prosodic unit? 17 Packard’s notions of word • Morphological word: following Di Sciullo and Williams (1987), a morphological word is anything that is the output of a phonological rule • Syntactic word: These are all and only the constructions that occupy X0 in the syntax. Well first you need to know what X0 is. • Psycholinguistic word: this the “ ‘word’ level of linguistic analysis that is … salient and highly relevant to the operation of the language processor” 18 Wordhood tests • Phonological: Bound morpheme: a bound morpheme forms a word with its neighboring morpheme • Syntactic: Insertion: if another morpheme can be inserted between X and Y, then it is unlikely a word. XP-substitution: if a morpheme cannot be replaced with an XP of the same type, then it is likely to be a word 19 Wordhood tests • Semantic If the meaning of X-Y is non-compositional, then it is a word • Others Productivity: if a rule that combines morpheme X and morpheme Y is not productive, then X-Y is likely to be a word Frequency of co-ocurrence: if morphemes X and Y cooccur frequently then they form a word 20 Exercise • Given the wordhood criteria and wordhood tests we have discussed, how many words are there in the “can’t” ? 21 Answer • • • • • • • • Orthographical word: 1 Sociological word: ? Lexical word: 2 Semantic word: 2 Phonological word: 1 Morphological word: 2 Syntactic word: 2 Psycholinguistic word: ? 22 Chinese morphological types • • • • • Reduplication Affixation Compounding Proper names Abbreviations 23 Verbal reduplication 说说 shuo-shuo speak-speak little” 看看 kan-kan look-look a look” 走走 zou-zou walk-walk a walk” 磨磨 mo-mo rub-rub little” 讨论讨论 taolun-taolun discuss-discuss a little” 请教请教 qingjiao-qingjiao ask-ask little” “speak a “take “take “rub a “discuss “ask a 24 Verbal reduplication 说一说 shuo-shuo a little” 看一看 kan-kan “take a look” 走一走 zou-zou “take a walk” 磨一磨 mo-mo a little” *讨论一讨论 *请教一请教 speak one speak “speak look one look walk one walk rub one rub “rub taolun-yi-taolun discuss-one-discuss qingjiao-yi-qingjiao ask-one-ask 25 Adjectival reduplication 舒服 shufu 舒舒服服 shushu-fufu “comfortable” 舒服舒服 shufu-shu-fu “enjoy” 干净 ganjing 干干净净 gangan-jingjing “very clean” 干净干净 ganjing-ganjing “clean up” 糊涂 hutu 糊糊涂涂 huhu-tutu “muddleheaded” (?) 糊涂糊涂 hutu-hutu 快活 快快活活 kuaikuai-huohuo “happy” 快活快活 kuaihuo-kuaihuo “make happy” 漂亮 漂漂亮亮 piaopiao-liangliang“pretty” 26 Prefixation 老 lao老王 wang” 小 xiao- 小王 wang” 第 di第一 “first” 初 chu- 初三 third” 可 ke可爱 lao-wang “old xiao-wang “small di yi chu san ke-ai “the “cute” 27 Suffixation 学 -xue 心理学 xinli-xue “psychology” 家 -jia 心理学家 xinli-xue-jia “psychologist” 化 -hua 绿化 lv-hua “greenize??” 率 -lv 录取率 luqu-lv “enrollment rate” 主义 -zhuyi 马克思主义 makesi-zhuyi“marxism” 28 Compounding Location: 客厅 沙发 keting-shafa “living room sofa” 河 马 hema “river horse (hippopotamus)” 海 狮 haishi “sea lion (seal)” Used for: 指甲 油 zhijia you “nail polish” 乒乓 球 pingpang qiu “ping-pang ball” 太阳眼镜 taiyang yanjing “sunglasses” Material: 大理石 地板 talishi diban “marble floor” 纸老 虎 zhilaohu “paper tiger” 29 Resultative verb compounding Result: 打破 dapo “break by hitting” 拉开 lakai “open by pulling” Achievement: 写清楚 xieqingchu “write clearly” 买到 maidao “succeed in buying” Direction: 跳过去 tiaoguoqu “jump across” 走进来 zoujinlai “come walking in” 30 Subject-Verb compounds 头疼 tou-teng (head hurt) “have a headache” 嘴硬 zui-ying (mouth hard) “stubborn” 眼红 yan-hong (eye red) “covet” 心酸 xin-suan (heart sour) “feel sad” 命苦 ming-ku (fate bitter) “unlucky” 31 Subject-Verb compounds 我 的 头 很 疼 I DE head very hurt “My head hurts badly.” 这 事 让 我 很 头疼 This matter make I very headache “This gave me a real headache”. 32 Verb-object compounds 出版 chu-ban (emit edition) “publish” 睡觉 shui-jiao (sleep sleep) “sleep” 毕业 bi-ye (finish study) “graduate” 开刀 kai-dao (operate knife) “operate” 开玩笑 kai-wanxiao (make joke) “make a joke” 照相 zhao-xiang (shine image) “take a picture” 33 Verb-object compounds 别 开玩笑 ! Do not joke Do not joke! 开 他 的 玩笑。 Make he DE joke Make fun of him. 34 Let’s try one 她 很 担心 孩子 的 健康 成长 Test type phonological syntactic semantic others Test Bound morpheme Test result Yes? Prediction One word Syllable count insertion XP substitution Non-compositional yes no no yes One word One word One word One word productive frequency N/A N/A N/A N/A 35 But … • 担心: 她 为 孩子 担 心 Test type Test result Prediction Bound morphemes? no Two words Syllable count yes One word Insertion yes Two words XP substitution yes Two words Semantic Non-compositional? yes One word Others Productive? N/A N/A Frequent co-occurrence? N/A N/A phonological syntactic Test 36 Summary • Wordhood has to be decided in context • When wordhood tests lead to conflict predictions, decisions will have to be made based on what the annotated corpus is for. 37 Discussion question • Based on word criteria we have discussed, is “make headway” one word or two words? 38 POS-tagging: throwing words into different buckets… • Each category is a bucket • How many buckets are there? Noun Verb Adjective Preposition Adverb • Which bucket should“five”, “the”, “$”, should go? 39 Penn Treebank Tagsets (buckets) • • • • • • • • • CC - coordinating conjunction: and, but CD - cardinal number: one, two, three DT - determiner: a, the, this, that EX - existential there FW - foreign word IN - preposition or subordinate conjunction LS - list marker: firstly, secondly To - to UH - interjection, uh, oh 40 CC or DT • Neither/?? he or/CC she likes skiing. • Neither/?? men like skiing . • Either/?? Jean or/CC Mary likes singing. • Either/?? Girl likes singing. • Both/?? Jack and/CC Tom hates singing . • Both/?? men hates singing. 41 CC or DT • Neither/CC he or/CC she likes skiing. • Neither/DT men like skiing . • Either/CC Jean or/CC Mary likes singing. • Either/DT Girl likes singing. • Both/CC Jack and/CC Tom hates singing . • Both/DT men hates singing. 42 CD or NN • One/?? of the best reasons • The only one/?? Of its kind • The only ones/?? of its kind 43 CD or NN • One/CD of the best reasons • The only one/NN Of its kind • The only ones/NN of its kind 44 EX or RB • • • • There/?? was a party in progress. There/?? ensued a melee. There/?? , a party was in progress. There/?? , ensued a melee. 45 EX or RB • • • • There/EX was a party in progress. There/EX ensued a melee. There/RB , a party was in progress. There/RB , ensued a melee. 46 The role of context in POS tagging • Can we take a list of all the words in a language, and decide which bucket each word should go, without looking at the context in which the word occurs? • Water, can,drops 47 Categorizing context • Morphological • Syntactic • Semantic 48 Morphological context • Inflectional morphology Verb: destroy, destroying, destroyed Noun: destruction, destructions He watered the plant. • Derivational morphology Noun: destruction 49 Syntactic context • Verb: The bomb destroyed the building. He decided to water the plant. • Noun: The destruction of building 50 Semantic context • Verb: action, activity • Noun: state, object, etc. 51 What do we have in Chinese? • Morphological clues: not as much • Syntactic clues: not as rich, but exist • Semantic clues: About the same 52 Syntactic clues • Impoverished, but exist: 这 座 大楼 的 倒塌 this CL building DE collapse “the collapse of this building” 这 座 大楼 看起来要 倒塌 This CL building seem will collapse “It looks like this building will collapse.” 53 Semantic clues • Same as English: Noun: state, object, etc. Verb: action, activity, etc. 54 When syntactic and semantic clues are in conflict 这 座 大楼 的 倒塌 this CL building DE collapse “the collapse of this building” Option 1: 倒塌 is a verb regardless of its context Option 2: 倒塌 can be a noun or a verb depending on its context The Chinese Treebank decision: option 2 POS tags based on syntactic clues encode not only its own lexical properties, but also information provided by its context “context-free” POS tags are no better than a dictionary 55 Online references • Chinese Treebank: www.cis.upenn.edu/~chinese • Sproat, Richard. 2002. Coling tutorial: www.linguistics.uiuc.edu/rws • Penn Treebank: www.cis.upenn.edu/~treebank/home.html 56