* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CD 24614-2 WordSeg2
Ukrainian grammar wikipedia , lookup
Lithuanian grammar wikipedia , lookup
Agglutination wikipedia , lookup
Macedonian grammar wikipedia , lookup
Old English grammar wikipedia , lookup
Udmurt grammar wikipedia , lookup
Untranslatability wikipedia , lookup
Swedish grammar wikipedia , lookup
Lexical semantics wikipedia , lookup
English clause syntax wikipedia , lookup
Japanese grammar wikipedia , lookup
Arabic grammar wikipedia , lookup
Modern Greek grammar wikipedia , lookup
Navajo grammar wikipedia , lookup
Old Irish grammar wikipedia , lookup
Portuguese grammar wikipedia , lookup
Georgian grammar wikipedia , lookup
Modern Hebrew grammar wikipedia , lookup
Kannada grammar wikipedia , lookup
Compound (linguistics) wikipedia , lookup
Zulu grammar wikipedia , lookup
Italian grammar wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
French grammar wikipedia , lookup
Icelandic grammar wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Vietnamese grammar wikipedia , lookup
Chinese grammar wikipedia , lookup
Malay grammar wikipedia , lookup
Spanish grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Esperanto grammar wikipedia , lookup
Latin syntax wikipedia , lookup
English grammar wikipedia , lookup
Yiddish grammar wikipedia , lookup
© ISO 2009 – All rights reserved ISO TC 37/SC 4 N 482 rev02 Date: 2009-10-15 ISO/CD 24614-2 ISO TC 37/SC 4/WG 2 Secretariat: KATS Language resource management — Word segmentation of text — Part 2: Word segmentation for Chinese, Japanese and Korean Gestion des resource des langues — Segmentation de texte — Partie 2: Segmentation des mots pour Chinois, Japonais et Koréan Warning This document is not an ISO International Standard. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an International Standard. Recipients of this draft are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation. Document type: International Standard Document subtype: Document stage: (30) Committee Document language: E STD Version 2.1c2 ISO/CD 24614-2 Copyright notice This ISO document is a working draft or committee draft and is copyright-protected by ISO. While the reproduction of working drafts or committee drafts in any form for use by participants in the ISO standards development process is permitted without prior permission from ISO, neither this document nor any extract from it may be reproduced, stored or transmitted in any form for any other purpose without prior written permission from ISO. Requests for permission to reproduce this document for the purpose of selling it should be addressed as shown below or to ISO's member body in the country of the requester: [Indicate the full address, telephone number, fax number, telex number, and electronic mail address, as appropriate, of the Copyright Manger of the ISO member body responsible for the secretariat of the TC or SC within the framework of which the working document has been prepared.] Reproduction for sales purposes may be subject to royalty payments or a licensing agreement. Violators may be prosecuted. ii © ISO 2009 – All rights reserved ISO/CD 24614-2 Contents Page Foreword ..............................................................................................................................................................v Introduction ........................................................................................................................................................vi 1 Scope ......................................................................................................................................................1 2 Normative references ............................................................................................................................1 3 Terms and definitions ...........................................................................................................................1 4 Overview – What is Word Segmentation Unit in CJK, Why is necessary, What are different from Other languages ...........................................................................................................................3 5 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 5.1.7 5.1.8 5.1.9 5.1.10 5.1.11 5.2 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.2.6 5.2.7 5.2.8 5.2.9 5.2.10 5.2.11 5.2.12 5.2.13 Chinese word segmentation .................................................................................................................5 General rules for identifying WSUs in Chinese text ..........................................................................5 Punctuation and white space ...............................................................................................................5 Word ........................................................................................................................................................5 Derivation ...............................................................................................................................................5 Phrasal compound ................................................................................................................................5 Idiom .......................................................................................................................................................6 Idiomatic expression, proverb and familiar quotation .......................................................................6 Abbreviation ...........................................................................................................................................6 Suffixation of the nonsyllabic 儿(r) ......................................................................................................7 Transliterated loanword ........................................................................................................................7 Non-Chinese-character strings ............................................................................................................7 Internal structure of WSUs ...................................................................................................................7 Typology of WSUs in Chinese ..............................................................................................................7 Noun ........................................................................................................................................................8 Verb .......................................................................................................................................................11 Adjective ...............................................................................................................................................13 Pronoun ................................................................................................................................................14 Numeral .................................................................................................................................................15 Measure word .......................................................................................................................................15 Adverb ...................................................................................................................................................16 Preposition ...........................................................................................................................................16 Conjunction ..........................................................................................................................................16 Auxiliary word ......................................................................................................................................16 Modal word ...........................................................................................................................................16 Exclamation word ................................................................................................................................16 Imitative word .......................................................................................................................................17 6 6.1 6.1.1 6.1.2 6.1.3 6.1.4 6.1.5 6.1.6 6.1.7 6.1.8 6.1.9 6.1.10 6.1.11 6.1.12 6.2 Japanese word segmentation ............................................................................................................17 General rules for identifying WSUs in Japanese text ......................................................................17 Punctuation ..........................................................................................................................................17 Noun ......................................................................................................................................................17 Verbs .....................................................................................................................................................18 Adjectives .............................................................................................................................................18 Adnominal nouns ................................................................................................................................18 Adverbs .................................................................................................................................................18 Conjunctions ........................................................................................................................................18 Exclamations ........................................................................................................................................18 Particles ................................................................................................................................................18 Auxiliary verbs .....................................................................................................................................18 Idioms and proverbs ...........................................................................................................................19 Abbreviations .......................................................................................................................................19 Typology of WSUs in Japanese .........................................................................................................19 © ISO 2009 – All rights reserved iii ISO/CD 24614-2 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5 6.2.6 6.2.7 6.2.8 6.2.9 Nouns (名詞; Meishi) ........................................................................................................................... 19 Verbs (動詞;Doushi)............................................................................................................................. 25 Adjectives (形容詞/形容動詞; Keiyoushi/Keiyoudoushi) .................................................................. 26 Adnominal nouns (連体詞; Rentaishi) ............................................................................................... 29 Adverbs (副詞; Fukushi)...................................................................................................................... 29 Conjunctions (接続詞; Setsuzokushi) ................................................................................................ 30 Exclamations (感動詞; Kandoushi) .................................................................................................... 30 Particles (助詞; Joshi) ......................................................................................................................... 30 Auxiliary Verbs (助動詞; Jodoushi) .................................................................................................... 31 7 7.1 7.1.1 7.1.2 7.1.3 7.1.4 7.2 7.2.1 7.2.2 7.2.3 7.2.4 7.2.5 7.2.6 7.2.7 7.2.8 7.2.9 Korean word segmentation ................................................................................................................ 31 Typology of word segmentation units in Korea ............................................................................... 31 Punctuation and white space ............................................................................................................. 31 Word...................................................................................................................................................... 32 Multi-word expression ........................................................................................................................ 34 Non-Korean-character strings............................................................................................................ 34 Typology of WSUs in Korean ............................................................................................................. 35 Noun ...................................................................................................................................................... 35 Pronoun ................................................................................................................................................ 37 Numeral ................................................................................................................................................ 39 Verb ....................................................................................................................................................... 40 Adjective ............................................................................................................................................... 41 Adnoun ................................................................................................................................................. 42 Adverb .................................................................................................................................................. 43 Exclamation.......................................................................................................................................... 44 Particle .................................................................................................................................................. 45 iv © ISO 2009 – All rights reserved ISO/CD 24614-2 Foreword ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization. International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2. The main task of technical committees is to prepare International Standards. Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Publication as an International Standard requires approval by at least 75 % of the member bodies casting a vote. Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. ISO 24614-2 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content resources, Subcommittee SC 4, Language resource management. This second/third/... edition cancels and replaces the first/second/... edition (), [clause(s) / subclause(s) / table(s) / figure(s) / annex(es)] of which [has / have] been technically revised. ISO 24614 consists of the following parts, under the general title Language resource management — Word segmentation of text: Part 1: Basic concepts and general principles Part 2: Word segmentation for Chinese, Japanese and Korean © ISO 2009 – All rights reserved v ISO/CD 24614-2 Introduction Word segmentation remains a challenging technology in natural language processing for languages in which word boundaries of text cannot be fully identified by typographic properties(like spaces in English), for example, Chinese, Japanese, Korean, Thai, Vietnamese, and Mongolian. Part2 focuses on word segmentation for Chinese, Japanese, and Korean. These three languages are similar and different in some aspects. In the aspect of using Chinese characters, all of them are similar, for instance, they have a lot of nouns which consist of Chinese characters, especially two Chinese character nouns, such as “討論(discussion)” and “同意(agreement)”. In the aspect of typography, there is no spacing in Chinese and Japanese text, while Korean text contains some fragments (Eojeols) separated by spaces. In the aspect of language category, Chinese is an isolated language, but Japanese and Korean belong to agglutinative languages, for example, a noun can be followed by a series of particles and a verb can be used with several endings.(e.g. “깨/뜨리/시/었/겠/군/요” (break [+emphasis] [+polite] [+past] [+conjectural] final ending [+polite]), “학교/에서/부터/는” (as for 'from at school')) e.g. “学校へ( to school)” (学校/へ, school [+ particle]) “行きました (went)” (行き/まし/た, go [+ auxiliary verb(polite)][auxiliary verb (past)] ) Due to the fact that these three languages share similarities in words composed of Chinese characters, general rules for identifying word segmentation units (WSUs) in Chinese text can also be applied to the processing for Japanese and Korean to some extent. In real practice, there are great concerns on what should be the right outcome through the process of word segmentation applied to a text. The Standards are needed to pursue the consistency in word segmentation within/among texts to the maximum extent so as to meet the requirements from a variety of applications in language information processing, -- both mono-lingual and multi-lingual. The applications of the standards include but not limited to natural language processing, information retrieval, search engine, questionanswering, machine translation and machine aided translation, pre-processing of text-to-speech, postprocessing of speech recognition, OCR and other character input methods, proof reading, digital library, terminology and ontology, semantic web, eBusiness and eCommerce, content management, and naturallanguage-based computer-aided eLearning (including language learning and second language learning). They shall also be helpful for orthographic processing (Romanization) of text in some languages like Chinese. vi © ISO 2009 – All rights reserved COMMITTEE DRAFT ISO/CD 24614-2 Language resource management — Word segmentation of text — Part 2: Word segmentation for Chinese, Japanese and Korean 1 Scope Principles for word segmentation in Part 1 are applied for Chinese, Japanese and Korean. Their word segmentation application is standardized for the purpose of recognizing the unit that will be used for the later syntactic processing. There are linguistic annotation standards in ISO: MAF (morpho-syntactic annotation framework), SynAF (syntactic annotation framework), and others in ISO/TC37/SC4. These standards describe annotation methods but not for the meaningful units of word segmentation. In this aspect, MAF and SynAF are to annotate each linguistic layer in a standardized way for the further interoperability. Word segmentation standard would like to recommend what linguistic units should be registered in a lexicon, and what type of word sequences called “word segmentation unit (WSU)” should be recognized before syntactic processing. In the context of multi-lingual word segmentation, if a word sequence forms one WSU in one language, it is a symptom for recognizing the corresponding WSU in other languages. Normative references 2 Normative references The following referenced documents are indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. ISO 24614-1, Language resource management — Word segmentation of text — Part 1: Basic concepts and general principles 3 Terms and definitions For the purposes of this document, the terms and definitions given in ISO 24614-1 and the following apply. 3.1 phrase component of a sentence that carries a grammatical function 3.2 bunsetsu phrase (3.1) in Japanese text without modifying expressions EXAMPLE The sentence “私は学校へ 早く 行きました(I went to school)。” consists of four Bunsetsu: 私は (watashiwa), 学校へ(gakkoue), 早く(hayaku) 行きました(ikimashita). “私(watashi)” is a pronoun, “は(wa)” is a particle, “学校(gakkou)” is a noun, “へ(e)” is a particle, “早く(hayaku)” is an adjective in adverbial usage, “行き (iki)” is a verbal stem followed by “まし(mashi)” which is an auxiliary verb for a politeness, and “た(ta)” is an auxiliary verb for a past tense. The sentence contains four Bunsetsus. NOTE A Bunsetsu normally consists of a noun plus its particle(s) or a verb plus its ending(s), auxiliary verb(s), and particle(s) as shown in the example above. © ISO 2009 – All rights reserved 1 ISO/CD 24614-2 3.3 eojeol phrase (3.1) in Korean text without modifying expressions separated by white space EXAMPLE Given a sentence “나는 학교에 일찍 갔다(I went to school early)”, “나(I)” is a pronoun, “는” is a particle, “학교(hakgyo; noun; school)” is a noun, “에” is a particle, “일찍(early)” is a adverb, “가(go)” is a verbal stem followed by the endings “았” and “다”. And the sentence contains four Eojeols: “나는(naneun)”, “학교에(hakgyoe)”, “일찍(iljjik)”, and “갔다(gatta)”. NOTE1 An Eojeol normally consists of a noun plus its particle(s) or a verb plus its ending(s), auxiliary verb(s), and particle(s) as shown in the example above. NOTE2 An Eojeol is also called as ‘word phrase’. Eojeol(word phrase) consists of one or more word forms. Auxiliary words can concatenate to word unit standing in front. E.g. 살아있다(to keep alive) is composed of two word form; 살아(to live) and 있다(keep). 3.4 particle part-of-speech in Japanese and Korean to represent grammatical function or trivial meaning. EXAMPLE A Japanese particle is not used independently. A word followed by a particle can constitute “Bunsetsu”. The function is a marker for a case, correlation with another phrase, attachment of some trivial meaning, and so on. As for a behavior, it attaches to words and does not have an inflectional ending like a suffix. However it is not a suffix but one of a part of speech. A Japanese particle attach to not only words but also a clause or even a sentence. For example, “寒い ね?” means “It is very cold, isn’t it?” In this example a Japanese particle “ね(ne)” is corresponding to “isn’t it? ”. EXAMPLE A particle is not used independently. A word followed by a particle can constitute “Bunsetsu” or “Eojeol”. A article can be attached to not only words but also a clause or even a sentence. For example, “寒いね?”in Japanese and “매우 춥지요?” mean “It is very cold, isn’t it?” In this example a Japanese particle “ね(ne)” and a Korean particle “요(yo)” are corresponding to “isn’t it? ”. NOTE In North Korean grammar, a particle is treated as an affix that freely agglutinates after a nominal and performs a grammatical function. A particle and an ending are called totally as ‘토(Tho)’ in North Korean grammar. 3.5. ending agglutinative part of verb, adjective and auxiliary verb in Japanese and Korean NOTE A verb, adjective and auxiliary verb have agglutinative forms at the end of them. The agglutinative forms are defined as ending. For example, as the ending of a verb, there are a negation form, an adverbial form, a base form, an adnominal form, an assumption form, or an imperative form. 3.6. Measure word part-of-speech in Chinese to define, along with numbers, the quantity of a given object, or to identify specific objects with demonstrative pronouns such as "this" and "that". NOTE1 While English speakers say "one person" or "this person", Chinese speakers say respectively 一个人 (yi ge ren; Numeral + measure word + noun; one person) or 这个人 (zhe ge ren; demonstrative pronoun + measure word + person; this person), where “个” (ge) is a measure word. NOTE2 There is a set of "verbal measure words" used for counting the number of times an action occurs, rather than counting a number of items. For example, in the sentence “我去过三次北京” (wo qu guo san ci Beijing; Pronoun + verb + Auxiliary word + numeral + measure word + proper noun; I have been to Beijing three times), “次”(ci) functions as a verbal measure word to modify the verb “去”(qu). 2 © ISO 2009 – All rights reserved ISO/CD 24614-2 4 Overview – What is Word Segmentation Unit in CJK, Why is necessary, What are different from Other languages Word segmentation is the process of dividing of sentence into meaningful units. For example, “the White House” consists of three words but designates one concept for the President’s residence in USA. “Pork” in English is translated into two words “pig meat” in Chinese, Korean and Japanese: 猪肉(rom…;), 돼지-고기, 豚 肉 respectively. In Japanese and Korean, because an auxiliary verb must follow main verb, they will compose one word segmentation unit like “tabetai” and “meoggo sipda” whose meaning is “want to eat”, respectively. So a meaningful unit that is useful for the further syntactic processing defines the word segmentation unit. Such unit could be an entry of lexicon or of any other type of storage whose entries are useful for syntactic processing in natural language processing purpose. A word segmentation unit is more or less fixed and there is no syntactic interference in the inside of the word segmentation unit. In the practical sense, it is useful for the further syntactic processing because it is not decomposable by syntactic processing and also frequently occurred in corpora. If the word is derived from Chinese characters, three languages have common properties. If their word in noun consists of two or more Chinese characters, they will be one word segmentation unit if they are “tightly combined and steadily used” according to principles of Part 1. For example, “each country” in English is not a word segmentation unit as its translation “各|国”. If the last character is productive in a limited manner, it forms a word segmentation unit with the preceding word, for example, “東京都” (Tokyo Metropolis), “8 月” (August) or “加速器” (accelerator). Negation character of verb and adjective is segmented independently in Chinese, but they form one word segmentation unit in Japanese. For example, “yasashikunai” (優しく無い, not kind) is one word segmentation unit in Japanese, but “不|写” (not to write), “不|能” (cannot), “没|研究” (did not research) and “未| 完成” (not completed) will be segmented independently in Chinese. In Korean, “chinjeolhaji anhta” (친절하지 않다, not kind) has one space inserted between two eojeols but it could be one word unit. “ji anhta” makes negation of adjectival stem “chinjeolha”. Because the motivation of word segmentation standard is to recommend what word segmentation units should be registered in a type of lexicon where it is not the lexicon in linguistics but any kind of practical indexed container for word segmentation units, it has two possibly conflicting principles. For example, principles of unproductivity, frequency, and granularity could cause conflicts because they have different perspectives to define a word segmentation unit. The Chinese character derived nouns are sharable for its word segmentation unit structure for three languages, but not the whole. On the other hand, there are common things between Korean and Japanese. Some Korean word endings and Japanese auxiliary verbs have the same functions. Word segmentation in each language is somewhat different according to already made word segmentation regulation, even violating one or more principles of word segmentation. This document will specify the application of Part 1 to three languages: Chinese, Japanese and Korean. It will be a starting point to recommend the more synchronized word segmentation unit concept in a multi-lingual environment. The concept of “word segmentation unit” is to broaden the view about what could be registered in lexicon of natural language processing purpose, without much linguistic representation. POS Chinese Japanese Korean(south) Korean(north) Noun ○(名词) ○ (名詞) ○(명사 名詞) ○(명사 名詞) Verb ○(动词) ○(動詞) ○(동사 動詞) ○(동사 動詞) ○(형용사 形容詞) ○(형용사 形容詞) ○ Adjective ○(形容词) (形容詞 and 形容動詞) © ISO 2009 – All rights reserved 3 ISO/CD 24614-2 Numeral Subcategory of Noun ○(数词) ○(수사 數詞) ○(수사 數詞) (名詞[数詞]) Adverb ○(副词) ○(副詞) ○(부사 副詞) ○(부사 副詞) Exclamation ○(叹词) ○(感動詞) ○(감탄사 感歎詞) ○(감동사 感動詞) ○(대명사 代名詞) ○(대명사 代名詞) Pronoun Subcategory of Noun ○(代词) (名詞[代名詞]) Auxiliary word ○(助词) × × × ○(量词) Noun or Adverb(名詞/副 詞 [序数詞]) Noun or Adverb(명사 名詞/부사 副詞 [序數 詞]) Noun or Adverb(명사 名詞/부사 副詞 [序數 詞]) ○(语气词) × × × Part of Adverb Part of Adverb Part of Adverb ( 擬態語・擬音語) (擬態語・擬音語) (擬態語・擬音語) Measure word Modal word Imitative word ○(拟声词) Preposition ○(介词) × × × Conjunction ○(连词) ○(接続詞) ○(접속부사 接続副詞) ○(이음부사 --副詞) × ○(助詞) ○(조사 助詞) Treated as grammatical affix named 토(Tho) × ○(連体詞) ○(관형사 冠形詞) ○(관형사 冠形詞) Subcategory of Verb Subcategory of Verb (보조동사 補助動詞) (보조동사 補助動詞) Subcategory of Adjective Subcategory of Adjective (보조형용사 補助 形容詞) (보조형용사 補助 形容詞) × × Particle Adnoun Auxiliary verb Subcategory of Verb(能愿动 词) Differentiating word ○(区别词) ○ (助動詞) × This standard adopts a notation which uses the underline to indicate the presence of a WSU under consideration. 4 © ISO 2009 – All rights reserved ISO/CD 24614-2 5 Chinese word segmentation1) 5.1 General rules for identifying WSUs in Chinese text 5.1.1 Punctuation and white space Punctuations and white space are in general natural separation marks for WSUs, though in some cases, certain punctuations can be a part of WSU, as “·” in “诺姆·乔姆斯基”(nuo mu · qiao mu si ji ; Noam Chomsky). 5.1.2 Word Words, clearly justified by linguistic criteria and, mainly consisting of two or three or four characters, are WSUs. EXAMPLE 发展 可爱 现代化 自行车 毛泽东 资本主义 操作系统 fa zhan ke ai xian dai hua zi xing che mao ze dong zi ben zhu yi cao zuo xi tong develop lovely modernize bike Mao Zedong capitalism operating system 5.1.3 Derivation The results of adding a series of prefixes or suffixes to a word are WSUs. EXAMPLE 科学 家 ke xue jia science -er noun suffix Scientist 5.1.4 物理 学 家 wu li xue jia physics -ology -er noun suffix suffix physicist Phrasal compound Phrasal compounds, frequently used in text and, mainly consisting of two- or three-characters, are WSUs. EXAMPLE 1) Most examples laid out as columns in the following clauses are formatted as follows. First line: Chinese expression Second line: Romanization Third line: part of speech Fourth line: English translation for each component Fifth line: English translation for the whole expression When any part is not necessary in a example there will be a blank line. © ISO 2009 – All rights reserved 5 ISO/CD 24614-2 猪 Zhu pig noun pork 5.1.5 肉 rou meat noun 发电 fa dian to generate electricity verb power plant 厂 chang plant noun Idiom Idioms, mainly consisting of four characters, are WSUs. EXAMPLE 5.1.6 胸有成竹 欣欣向荣 xiong you cheng zhu xin xin xiang rong have a well-thought-out plan prosperous Idiomatic expression, proverb and familiar quotation Idiomatic expressions, proverbs and familiar quotations are WSUs they are frequently used in text. EXAMPLE 对不起 dui bu qi 春夏秋冬 chun xia qiu dong 由此可见 you ci ke jian sorry spring summer autumn winter this shows 不管 三 七 二 十 一 bu guan san qi er shi yi no matter three seven two ten one no mater what happens 5.1.7 失败 是 成功 之 shi bai shi cheng gong zhi Failure is success of Failure is the mother of success. 母 mu mother Abbreviation Abbreviations are WSUs. EXAMPLE 6 科技 工农业 ke ji gong nong ye science and technology industry and agriculture © ISO 2009 – All rights reserved ISO/CD 24614-2 5.1.8 Suffixation of the nonsyllabic 儿(r) The results of suffixation of the nonsyllabic 儿(r) to nouns and sometimes verbs are WSUs. EXAMPLE 5.1.9 花儿 玩儿 悄悄儿 huar wanr qiaoqiaor flower r play r quietly r noun r verb r adverb r flower play quietly Transliterated loanword Transliterated loanwords are WSUs. EXAMPLE 吉普 巧克力 ji pu qiao ke li jeep chocolate 5.1.10 Non-Chinese-character strings Non-Chinese-character strings including foreign language characters, Arabic numerals, math symbols, chemical symbols etc. are treated as WSUs by keeping their original forms. EXAMPLE CAD CO := cm 1298 3.1415926 5.1.11 Internal structure of WSUs A WSU may have an internal structure which organizes several WSUs hierarchically. Such a structure can be manipulated at different granularity level in the process of word segmentation according to the need of various applications. EXAMPLE chocolate: WSU(巧克力) pork: WSU(WSU(猪) WSU(肉)) physicist: (WSU(WSU(WSU(物理) WSU(学)) 家(WSU)) Mao Zedong: WSU(WSU(毛) WSU(泽东)) 5.2 Typology of WSUs in Chinese The treatment for some specific WSU-related phenomena is addressed in this Clause (note: the phenomena that can be clearly treated by Clause 5.1 will not be stated here). For clarity of description, the specification is organized under 14 word categories: noun, verb, adjective, differentiating word, pronoun, numeral, measure word, adverb, preposition, conjunction, auxiliary word, modal word, exclamation, and imitative word. © ISO 2009 – All rights reserved 7 ISO/CD 24614-2 5.2.1 Noun 5.2.1.1 Common noun (1) The nominal expression “adjective + noun” is segmented unless the meaning of the expression is not the sum of its parts. EXAMPLE 小 床 小 媳妇 xiao chuang xiao xi wu small bed small wife adjective, noun adjective, noun small bed young wife (2) The localizer word (a subcategory of noun) is segmented. EXAMPLE 桌子上 长江以北 zuo zi shang chang jiang yi bei table above the Yangtzi River the north noun, localizer word noun, localizer word on the table to the north of the Yangtzi River (3) The plural suffix “们” (men; -s) is segmented EXAMPLE 朋友 们 peng you men However, following ones are treated as WSUs 人们 哥儿们 爷儿们 ren men ger men yier men people pals guys friend –s; noun –s friends (4) The time expression is treated as follows: a. January-December and Monday-Sunday are WSUs. EXAMPLE 8 五月 元月 3月 星期 日 礼拜 三 wu yue yuan yue 3 yue xing qi ri li bai san five month first month 3 month Week + day week three May January March Sunday Wednesday © ISO 2009 – All rights reserved ISO/CD 24614-2 b. The time measure words “Year, day, hour, minute, second” are segmented. EXAMPLE 1988 年 3 月 15 日 11 时 42 分 8 秒 1988 nian 3 yue 15 ri 11 shi 42 fen 8 miao 1988 year 3 month 15 day 11 hour 42 minute 8 second March 15th,1998 forty two minute and eight second past eleven c. The results of “前、后、上、下、大前、大后” (before last, after next, last, next, before before last, after after next) each combined directly with a time noun or a time measure word are WSUs. EXAMPLE d. 前天 后年 上星期 下月 大前天 大后年 qian tian hou nian shang xingqi xia yue da qian tian da hou nian before last, day after year next, last, week next month before before last, day after after next, year the day before yesterday the year after next last week next month three days ago three years later The time nouns “初一”(First day of a month in the Chinese lunar calendar) to “初十”(Tenth day of a month in the Chinese lunar calendar) are WSUs. 5.2.1.2 5.2.1.2.1 Proper noun Personal name and title (1) The full personal names of Han nationalities are WSUs each has an internal structure with surname and last name as two WSUs. EXAMPLE 张 胜利 欧阳 志华 zhang sheng li ou yang zhi hua surname, given name surname, given name Zhang Shengli Ouyang Zhihua (2) The full personal names of other nationalities or foreign countries are WSUs each may have an internal structure in accordance with their own traditions. EXAMPLE 牛顿 小林 niu dun xiao lin duo xi er © ISO 2009 – All rights reserved 多喜二 9 ISO/CD 24614-2 Newton Kobayashi Takiji (3) The expression “surname + title” is segmented. EXAMPLE 张 教授 王 部长 李 师傅 zhang jiao shou wang bu zhang li shi fu surname professor surname miniter surname master professor Zhang minister Wang master Li (4) The expressions “one-character honorific title + surname” or “surname + one-character title” are WSUs. EXAMPLE 老张 陈总 lao zhang chen zong one-character honorific title surname, one-character title surname; old Zhang manager Chen (5) The titles for kinship regarding rankings are WSUs each with an internal structure. EXAMPLE 5.2.1.2.2 三 叔 大 女儿 san shu da nv er three uncle big daughter the third younger uncle the eldest daughter Place name and nationality name “族、省、市、州、县、乡、区、江、河、山” (nationality, province, city, prefecture, county, town, district, river, mountain)shall be segmented separately from nationality name and place name; yet the nationality name and place name, if only containing two Chinese characters, shall not be segmented. EXAMPLE 10 汉族 the Han nationality 哈萨克 族 the Kazakstan nationality 北京 市 Beijing Municipality 浙江 省(Zhejiang Province) © ISO 2009 – All rights reserved ISO/CD 24614-2 正定 县(Zhengding County) 长江(Yangtzi River) 忻县(Qi County) Proper noun that cannot exist independently and keep its original meaning shall not be segmented. 牡丹江(Mudan River) 横断山(Hengduan Mountains) EXAMPLE Street, road, village and town names, ocean and sea names shall be deemed as segmentation unit. 长安街(Chang’an Avenue) 学院路(Xueyuan Road) 周口店(Zhoukoudian) EXAMPLE 刘家村 (Liujiacun Village) 大西洋(Atlantic ocean) 地中海(Mediterranean Sea) 5.2.1.2.3 Other type of proper names Full country name shall be deemed as segmentation unit. EXAMPLE 中华人民共和国(People's Republic of China) 大不列颠及北爱尔兰联合王国(United Kingdom) Full name of organization, agency, institution shall be segmented in accordance with the word segmentation units consisting the full name. EXAMPLE 联合国 教科文 组织(United Nations Educational, Scientific, and Cultural Organization) 中国 共产党(Communist party of China) trade marks, produce type, product series shall be segmented from the common noun. EXAMPLE 5.2.2 牡丹 II 型 Peony III Verb 5.2.2.1 a) 永久 牌(Yongjiu Brand ) 中华 烟(Zhonghua Cigarette) Various forms of reiterative verbs Single-character verb reiterated shall be deemed as one segmentation unit. EXAMPLE 看看(look at) 动动(move) b) Two-character verb reiterated in the form of “AABB” shall be deemed as one segmentation unit. EXAMPLE 来来往往(come and go) 拉拉扯扯(drag) c) Verb reiterated in the form of “AAB, ABAB” shall be segmented. EXAMPLE 说说 看(try to say) 研究 研究(have a discuss) a) Verb reiterated in the form of “A 一 A, A 了 A, A 了一 A” shall be segmented. EXAMPLE 谈 一 谈 (have a good chat) 想 一 想(think carefully) 读 一 读(to read) 想 了 想(think it over) 想 了 一 想(think it over) © ISO 2009 – All rights reserved 11 ISO/CD 24614-2 5.2.2.2 Verb delimited by a negative meaning Chinese character The negative meaning Chinese character before the verb shall be segmented independently. EXAMPLE 5.2.2.3 不 写(not to write) 不 能(cannot) 没 研究(did not research) 未 完成(not completed) "Verb + a negative meaning Chinese character + the same verb" structure Such a structure that is indicating a question shall be segmented. EXAMPLE 说 说(say or not say)? 看 不 看(see or not see)? 相信 不 相信(believe or not believe)? Yet the brachylogical form shall not be segmented. EXAMPLE 5.2.2.4 相不相信(believe or not) Verb–object structure and verb collocations Verb–object structural word, or compact and stably used verb phrase shall not be segmented EXAMPLE 开会(meeting) 跳舞(dancing) 解决吃饭问题(to resolve the problem of meals) 孩子该念书了(it’s time for the child to go to school) Incompact or verb–object structural phrase with many similar structures shall be segmented. EXAMPLE 吃 鱼(Eat fish) 学 滑冰(learn skiing) 写 信 (write a letter); (写 文章(write an article); 写 论文(write a thesis);写 书(write a book); … Verb–object structural word/phrase, if inserted with other elements, shall be segmented. EXAMPLE 5.2.2.5 吃 两 顿 饭(have two meals) 跳 新疆 舞(Dance “Xinjiang dance”) Verb–complement structural word and stably used Verb-complement phrase Verb–complement structural word (two-character), or stably used Verb-complement phrase (two-character) shall not be segmented. EXAMPLE 打倒(down with) 提高(improve) 加长(lengthen) 做好(do well in) “2with1” or “1with2” structural verb- complement phrase shall be segmented; over three character Verbcomplement phrase shall be segmented, either. EXAMPLE 整理 好(clean up) 说 清楚(speak clearly) 解释 清楚(explain clearly) Verb-complement word for phrase, if inserted with “得 or 不”, shall be segmented. EXAMPLE 5.2.2.6 打 得 倒 (able to knock down) 提 不 高(unable to improve) Adverb delimited verb Adjective with noun word, and compact, and stably used adjective with noun phrase shall not be segmented. EXAMPLE 12 胡闹(make trouble) 瞎说(talk nonsense) 死记(learn by rote) © ISO 2009 – All rights reserved ISO/CD 24614-2 早 来(come early) 晚 走(go late) 重 说(retell) Compound directional verb shall be deemed as segmentation unit. EXAMPLE 出去(go out) 进来(come in) However, the compound directional verb of direction, if inserted with “得 or 不”, shall be segmented. EXAMPLE 出 得 去(able to go out) 进 不 来(unable to come in) Phrase formed by verb with directional verb shall be deemed as segmented. EXAMPLE 5.2.2.7 寄 来(send) 跑 出 去(run out) Combination of independent single verbs Combination of independent single verbs without conjunction shall be segmented. For example: 苫 盖(cover with) 听 说 读 写(listen, speaking, read and write) Multi-word verb without conjunction shall be segmented. For example: 调查 研究(investigate and research) 宣传 鼓动(publicity and instigation) 5.2.3 Adjective 5.2.3.1 Reiteratively combined adjectives Adjective in reiterative form of “AA, AABB, ABB, AAB, A+"里"+AB” shall be deemed as segmentation unit. EXAMPLE 大大(big) 高高(tall) 高高兴兴(happy) 匆匆忙忙(busy) 绿油油(fresh green) 红通通(bright red) 蒙蒙亮(daybreak) 马里马虎(careless) However, adjective in reiterative form of “ABAB” shall be segmented. EXAMPLE 5.2.3.2 雪白 雪白(snowy white) 滚圆 滚圆(fat and round) Adjective phrases Adjective phrase in from of “一 A 一 B,一 A 二 B,半 A 半 B,半 A 不 B,有 A 有 B” shall not be segmented. EXAMPLE 一心一意(wholeheartedly) 一清二楚(as plain as daylight) 半明半暗(partly bright partly dark) 半生不熟(half-cooked) 有条有理(orderly) 5.2.3.3 Adjectives in parataxis form Adjectives in parataxis form shall be segmented in accordance with following rules: © ISO 2009 – All rights reserved 13 ISO/CD 24614-2 a. Two single-character adjectives with word features varied shall not be segmented. EXAMPLE 长短(long-short) 深浅(deep-shallow) 大小(big-small) b. Adjectives in parataxis form and maintaining original adjective meaning shall be segmented. EXAMPLE 大 小尺寸(size) 光荣 伟大(glory) 5.2.3.4 Adjective delimited noun for colors Color adjective word or phrase shall not be segmented. EXAMPLE 浅黄(light yellow) 橄榄绿(olive green) 5.2.3.5 Adjective phrases Adjective phrase in positive with negative form to indicate question shall be segmented. EXAMPLE 容易 不 容易(easy or not easy) Yet the brachylogical one shall not be segmented. EXAMPLE 容不容易(easy or not) 5.2.4 a) Pronoun Single-character pronoun with “们” shall be deemed as segmentation unit. EXAMPLE 我们 (we) 你们(you) 它们(they) 他们(they) b) “这、那、哪” with unit word “个” or “些、样、么、里、边” shall be deemed as one segmentation unit. EXAMPLE 这个(this) 这么(thus) 这边(here) 那些(those) 那样(then) 那里(there) 哪个(which) 哪里(where) 哪些(which) c) “这、那、哪” with numeral , unit word or noun word segmentation unit shall be segmented. EXAMPLE 这 十 天(these 10 days) 那 人(that person) 那 种(that kind) d) Interrogative adjective or phrase shall be deemed as segmentation unit. EXAMPLE 多少(how many) 怎样(what about) 为什么(why) 什么(what) e) Pronoun of “各、每、某、本、该、此、全”, etc. shall be segmented from followed measure word or noun. EXAMPLE 各 国 (each country) 每 种(each type) 某 工厂(a certain factory) 本 部门(this department) 该 单位(this unit) 此 人(this people) 14 © ISO 2009 – All rights reserved ISO/CD 24614-2 全 校(whole school) 5.2.5 a) Numeral Numeral is segmented from measure word. EXAMPLE 三 个(three) 一 种(one type) b) Chinese digit word shall be deemed as segmentation unit. EXAMPLE 一亿八千零四万七百二十三(180,040,723) c) Ordinal number of “第” shall be segmented from followed numeral. EXAMPLE 第 一 (first) 第 四(the fourth) 第 五 十 三(the fifty-third) d) “分之” percent in fractional number shall be deemed as one segmentation unit. EXAMPLE 五 分之三(third fifth) 百分之二(2/100) 万分之五(5/10000) e) Paratactic numberals indicating approximate number shall be deemed as segmentation unit. EXAMPLE 八九 公斤(eight or nine kg.) f) “多、一些、点儿、一点儿”, used after adjective or verb for indicating approximate number, shall be segmented. EXAMPLE 两 点 多(past two o’clock) 十 来 家(about ten ) g) 十 七八 岁(seventeen or eighteen years old) 一 千 多 人(more than one thousand person) 十 几 个(over ten) “些、一些、点儿、一点儿”, used after adjective or verb for indicating approximate number, shall be segmented. EXAMPLE 大 些(bigger ) 懂 一些(know some) 快 点儿(Quickly) 快 一点儿(more Quickly) h) “近、约、数”, etc. used before the numeral and numerical digit for indicating approximate number, shall be segmented. EXAMPLE 近 千 人(near one thousand person) 约 三 百(about three hundred) 数 万(ten thousands) 成百(hundreds of) 5.2.6 a) Measure word Reiterative measure word shall not be segmented. EXAMPLE b) 上千(thousands of) 年年(every year) 天天(every day) 个个(each) 家家户户(every household) Compound measure word or phrase shall be deemed as segmentation unit. EXAMPLE 人年 man/year © ISO 2009 – All rights reserved 人次(man/time) 架次(sortie) 吨公里(t/km) 15 ISO/CD 24614-2 5.2.7 a) Adverb Adverb shall be deemed as segmentation unit. EXAMPLE 很好(very good) 都来了(every one came here) 刚走(have just gone) 互相协助(help each other) b) Following phrases used frequently and acting as adverb shall be deemed as segmentation unit: EXAMPLE 越来越(more and more) 不得不(have to ) 不能不(cannot but) “越…越…、又…又…”, etc. acting as conjunction shall be segmented. 越 走 越 远(to go farther and farther) 5.2.8 又 香 又 甜(sweet yet savory) Preposition Preposition shall be deemed as segmentation unit. EXAMPLE 5.2.9 生于(be born in ) 走向胜利(up to success) 按照规定(according to the regulations) Conjunction Conjunction shall be deemed as segmentation unit. EXAMPLE 工人和农民(worker and farmer) 光荣而伟大(glorious and grand) 5.2.10 Auxiliary word a) Structural auxiliary word “的、地、得、之” shall be deemed as segmentation unit. EXAMPLE 他的书 (his book) 慢慢地走(walk slowly) 说得快(speak fast) 美丽的城市(beautiful city) 中国的大熊猫(Chinese panda) 成功之路(road to success) b) Tense auxiliary word “着()、了、过” shall be deemed as segmentation unit. EXAMPLE c) 看着(be watching) 看了(watched) 看过(have watched) Auxiliary word “所” shall be segmented from its followed verb. EXAMPLE 所 想(what one thinks) 所 认识(what one knows) 5.2.11 Modal word Modal word shall be deemed as segmentation unit. EXAMPLE 你好吗?(How are you?) 你好吧!(Is everything OK?) 5.2.12 Exclamation word Exclamation word shall be deemed as segmentation unit. EXAMPLE 16 啊,真美!(How beautiful it is !) © ISO 2009 – All rights reserved ISO/CD 24614-2 唉呀,他走了!(He has gone!) 5.2.13 Imitative word Imitative word shall be deemed as segmentation unit. EXAMPLE 6 嘟(Du) 当当(tinkle) 轰隆隆(rumble) Japanese word segmentation 6.1 General rules for identifying WSUs in Japanese text “Bunsetsu“ is a phrase, which is a component of a sentence that carries a grammatical function, in Japanese text without modifying expressions. As a component of "Bunsetsu", there are mainly 9 part of speech. 名詞(meishi; noun), 動詞(doushi; verb), 形容詞・形容動詞(keiyoushi, keiyoudoushi; adjective), 連体詞 (rentaishi; adnominal noun [only used in adnominal usage]), 副 詞 (fukushi; adverb), 感 動 詞 (kandoushi; exclamation), 接続詞(setsuzoushi; conjunction), 助詞 joshi (particle), and 助動詞 jodoushi (auxiliary verb). These parts of speech are divided into more detailed classes in terms of grammatical function (see section 6.2). 6.1.1 Punctuation There are two main punctuation marks in Japanese, “、” and “。”. “、” is used for representing a slight pause. It indicates a break between phrases inside one sentence, but does not always correspond to one segmentation unit. It means just a pause to make the sentence easier to understand. Therefore, it is not directly related to word segmentation units. “、” serves as a comma, semicolon, and colon. “。” is used for representing a full stop, and is written at the end of a sentence. That is, it means one sentence. A question mark is “?”. Quotation marks are “「」”. A book name mark is “『』”. An exclamation mark is “!”. etc. 6.1.2 Noun When a noun is a member constituting a sentence, it is usually followed by a particle or auxiliary verb. Also, if a word like an adjective or adnominal noun modifies a noun, then a modifier (adjectives, adnominal noun, adnominal phrases) and a modificand (a noun) are segmented. Some nouns whose meaning is an action can become verbs by adding the verb “suru (do). © ISO 2009 – All rights reserved 17 ISO/CD 24614-2 6.1.3 Verbs A Japanese verb has an inflectional ending. The ending of a verb changes depending on whether it is a negation form, an adverbial form, a base form, an adnominal form, an assumption form, or an imperative form. Japanese verbs are often used with auxiliary verbs and/or particles, and a verb with auxiliary verbs and/or particles is considered as a word segmentation unit. 6.1.4 Adjectives A Japanese adjective has an inflectional ending. Based on the type of inflectional ending, there are two kinds of adjectives, "keiyoushi" and "keiyoudoushi". Both are treated as adjectives. In terms of inflectional ending, “keiyoushi” is sometimes called “i_keiyoushi” , such as “美しい(utsukushi_i; beautiful), and “keiyoudoushi” is sometimes “na_keiyoushi” such as “静かな(shizuka_na; quiet).” (In terms of inflectional ending of “na_keiyoushi,” it is sometimes said to be considered as “Noun + auxiliary verb (da)”.) The ending of an adjective changes depending on whether it is a negation form, an adverbial form, a base form, an adnominal form, or an assumption form. Japanese adjectives are sometimes used with auxiliary verbs and/or particles, and they are considered as a word segmentation unit. 6.1.5 Adnominal nouns An adnominal noun does not have an inflectional ending; it is always used as a modifier. An adnominal noun is considered as one segmentation unit. 6.1.6 Adverbs An adverb does not have an inflectional ending; it is always used as a modifier of a sentence or a verb. It is considered as one segmentation unit. 6.1.7 Conjunctions A conjunction is considered as one segmentation unit. 6.1.8 Exclamations An exclamation is considered as one segmentation unit. 6.1.9 Particles A particle itself does not become “Bunsetsu.” A word followed by a particle can constitute “Bunsetsu”. The function is a marker for a case, correlation with another phrase, attachment of some trivial meaning, and so on. As for a behavior, it attaches to words and does not have an inflectional ending like a suffix. However it is not a suffix but one of a part of speech. A Japanese particle attach to not only words but also a clause or even a sentence. For example, “寒いね?” means “It is very cold, isn’t it?” In this example a Japanese particle “ね(ne)” is corresponding to “isn’t it?.” 6.1.10 Auxiliary verbs Auxiliary verbs represent various semantic functions such as a capability, a voice, a tense, an aspect and so on. An auxiliary verb appears at the end of a phrase or a sentence. An auxiliary verb is always preceded by a word like a noun, a verb, or an adjective, and the set is considered as one segmentation unit. An auxiliary verb should not be segmented from a word. 18 © ISO 2009 – All rights reserved ISO/CD 24614-2 6.1.11 Idioms and proverbs Proverbs, mottos, etc. should be segmented if their original meanings are not violated after segmentation. EXAMPLE 光陰 Kouin 矢の ごとし ya_no gotoshi Noun Noun _particle Time arrow Auxiliary verb like (flying) Time flies fast 6.1.12 Abbreviations An abbreviation should not be segmented. 6.2 Typology of WSUs in Japanese The examples in each section are formatted as follows. First line: Japanese sentence In Japanese, spaces are not used in a sentence. However, in the examples shown below, spaces indicate a border of “Bunsetsu” Second line: Romanization Third line: part of speech constituting “Bunsetsu” A space refers to the border of a part of speech The “_” mark within Bunsetsu refers to the composition of Bunsetsu The “+” mark refers to the lexical composition within words The ”[ ]” mark refers to the semantic function of a part of speech Fourth line: English translation for each Bunsetsu Fifth line: English translation for the example sentence or phrase 6.2.1 Nouns (名詞; Meishi) When a noun is a component constituting a sentence, it is usually followed by a particle or auxiliary verb, but there are exceptions. In some cases, one word becomes one sentence. For example, as a question, “なぜ(naze?; why?)”, as an answer, “りんご(ringo ; apple)”, “3 (san; three)” and so on. Also, if a word like an adjective or an adnominal noun modifies a noun, a modifier (adjective, adnominal noun, adnominal phrase) and a modificand (a noun) are segmented, not a compound noun. 6.2.1.1 6.2.1.1.1 Common nouns (普通名詞; Futsumeishi) A noun followed by a particle is considered as a word segmentation unit. © ISO 2009 – All rights reserved 19 ISO/CD 24614-2 A noun followed by an auxiliary verb is considered as a word segmentation unit. EXAMPLE A Noun followed by Particle for a case marker 私は トマトを 買った。 Watashi_wa tomato _wo kat_ta Pronoun_particle Noun_particle[object] Verb_auxiliary verb I tomato bought I bought a tomato. EXAMPLE B Noun followed by Auxiliary verb 私の 好きな 花は 桜です。 Watashi_no sukina hana_wa sakura_desu Noun_particle Adjective Noun_particle Noun_auxiliary verb [polite] my favorit flower is cherry blossoms My favorit flower is cherry blossoms. 6.2.1.1.2 "A noun with a prefix and/or a suffix, plus a case particle following it” and “A noun with a prefix and/or a suffix, plus an auxiliary verb following it" are considered as a word segmentation unit. EXAMPLE A A noun with a prefix and/or a suffix, plus a case particle following it. あなたの お名前を 教えてください。 Anata_no o+namae_wo oshiete_kudasai Noun_particle prefix[politeness]+Noun_particle[object] Verb_auxiliary verb your name tell Please tell me your name . EXAMPLE B A noun with a suffix, plus an auxiliary verb following it この シャンプーは 植物性だ。 Kono shanpoo_wa shokubutsu+sei_da Adnominal noun Noun_particle Noun+suffix_auxiliary verb[copula] this shampoo is a plant origin This shampoo is a plant origin. 20 © ISO 2009 – All rights reserved ISO/CD 24614-2 A noun followed 6.2.1.1.3 “A compound noun plus a case particle following it” and “a compound noun plus an auxiliary verb following it" are considered as a word segmentation unit. EXAMPLE A A compound noun plus a case particle following it 私は さしみ定食を 注文した。 Watashi_wa Sashimi+teishoku_wo chumonshi_ta Pronoun_particle Noun+Noun_particle[object] Verb_auxiliary verb I Sashimi set ordered I ordered Sashimi set. EXAMPLE B a compound noun plus an auxiliary verb following it 私の 趣味は 映画鑑賞です。 Watashi_no shumi_wa eigakanshou_desu Noun_particle Noun_particle Noun_auxiliary verb[copula, polite] My hobby watching movies My hobby is watching movies. Some nouns which mean actions can become verbs by adding the verb “suru (do).” (see 6.2.2.2) 6.2.1.1.4 EXAMPLE 私は 毎日 散歩する。 Watashi_wa mainichi sanpo+suru Pronoun_particle Adverb Verb[Noun+”do”] I every day take a walk I take a walk every day. 6.2.1.2 6.2.1.2.1 Pronouns (代名詞; Daimeishi) A pronoun and a case particle are regarded as a word segmentation unit. Sets of a pronoun and an auxiliary verb and/or a particle are regarded as a word segmentation unit. EXAMPLE 私は © ISO 2009 – All rights reserved トマトを 買った。 21 ISO/CD 24614-2 Watashi_wa tomato _wo kat_ta Noun[pronoun]_particle[topic] Noun_particle Verb_auxiliary verb I tomato bought I bought a tomato. 6.2.1.2.2 "A pronoun with a prefix and/or a suffix, plus a case particle following it” and “A pronoun with a prefix and/or a suffix, plus an auxiliary verb and/or a particle" are regarded as a word segmentation unit. EXAMPLE A A pronoun with a suffix, plus a case particle 彼女たちは コーチに 花を 贈った。 Kanojo+tachi_wa coochi_ni hana_wo okut_ta Noun[pronoun]+suffix_particle[topic] Noun_particle Noun_particle Verb_auxiliary verb they to a coach flowers gave They gave flowers to their coach. EXAMPLE B A pronoun with a suffix, plus an auxiliary verb and a particle. 犯人は あなたたちですか ? Han’nin_wa anata+tachi_desu_ka Noun_particle Noun[pronoun]+suffix_auxiliary verb[copula]_particle[question] criminal persons are you? Are you criminal persons? 6.2.1.3 Proper nouns (固有名詞; Koyuumeishi) A proper noun following by a case particle is considered as a word segmentation unit. A proper noun following by an auxiliary verb and/or a particle is considered as a word segmentation unit. EXAMPLE A A proper noun following by a case particle 私は 東京へ 行った。 Watashi_wa Tokyou_e it_ta Noun_particle Noun[proper]_particle[direction] Verb_auxiliary verb I to Tokyo went I went to Tokyo. EXAMPLE B 22 A proper noun following by an auxiliary verb and/or a particle © ISO 2009 – All rights reserved ISO/CD 24614-2 彼は 坂本さんですね? Kare_wa Sakamoto+san_deshou_ne Noun_particle Noun[proper]+suffix_auxiliary verb_particle[mood] He is Mr.Sakamoto, isn’t he? He is Mr.Sakamoto, isn’t he? 6.2.1.4 Interrogative (疑問詞; Gimonshi) 6.2.1.4.1 An Interrogative noun and a case particle are considered as a word segmentation unit. An Interrogative noun and an auxiliary verb are considered as a word segmentation unit. EXAMPLE A An Interrogative noun and a case particle are considered as a word segmentation unit. どれが 好きですか? Dore_ga suki_desu_ka Noun[interrogative]_particle[subject] Verb_auxiliary verb_particle which do you like? Which do you like? EXAMPLE B An Interrogative noun and an auxiliary verb are considered as a word segmentation unit. 彼女は 誰でしょうか? Kanojo_wa dare_deshou_ka Noun_particle Noun[interrogative]_auxiliary Verb[guess]_particle[question] she who is? Who is she? 6.2.1.4.2 Though this case is not limited to interrogative nouns, informally, occasionally only an interrogative noun is used as a one-word sentence. EXAMPLE いくつ? Ikutsu? Noun[interrogative] How many How many? © ISO 2009 – All rights reserved 23 ISO/CD 24614-2 6.2.1.4.3 Some interrogative nouns cannot be followed by case particles. EXAMPLE *どうは / *が / *を *Dou_wa / *Dou_ga / *Dou_wo *Noun[interrogative]_particle[topic]/ [subject]/[object] *How is *How is 6.2.1.5 time/numeral/quantifier noun (数量詞/序数詞; Suuryoushii/Josuushi) 6.2.1.5.1 A Numeral noun and a case particle are considered as a word segmentation unit. A Numeral noun and an auxiliary verb are considered as a word segmentation unit. EXAMPLE A A Numeral noun and a case particle are considered as a word segmentation unit. 母は ケーキを 三分の一に 分けた。 Haha_wa keeki_wo sanbun’noichi_ni wake_ta Noun_particle Noun_particle Noun[numeral]_particle Verb_auxiliary verb my mother a cake three pieces devided My mother divided a cake into three pieces. EXAMPLE B A Numeral noun and an auxiliary verb are considered as a word segmentation unit. 休憩は 5 分間です。 Kyuukei_wa gofunkan_desu Noun_particle Noun[numeral]_auxiliary verb[copula, polite] a break is for 5minitues A break is for 5minutes. 6.2.1.5.2 A measure noun is sometimes used as an adverb by itself without a particle. EXAMPLE 鉛筆を 24 4本 準備しなさい。 © ISO 2009 – All rights reserved ISO/CD 24614-2 Enpitsu_wo yon_hon junbishinasai Noun_particle Noun[measure]_ Verb a pencil 4 prepare Prepare 4 pencils. 6.2.2 Verbs (動詞;Doushi) A Japanese verb has an inflectional ending. The ending of a verb changes depending on whether it is a negation form, an adverbial form, a base form, an adnominal form, an assumption form, or an imperative form. Japanese verbs are often used with auxiliary verbs and/or particles, and they are considered as a word segmentation unit. 6.2.2.1 Single verbs and compound verbs Verbs (including single verbs and compound verbs) are considered as one segmentation unit. EXAMPLE 私は 毎朝 牛乳を 飲む。 Watashi_wa maiasa gyuunyu_wo nomu Noun_particle Adverb Noun_particle Verb I every morning milk drink I drink milk every morning. 6.2.2.2 Verb composed from a noun and “suru”(do) (サ変動詞;Sahendoushi) An action noun becomes a verb by adding a verb “suru (do)” to the end of the noun, and is sometimes called “Sahendoushi.” “Sahendoushi” is considered as one segmentation unit. (see 7.1.1 (4)) EXAMPLE 私は 英語を 勉強する。 Watashi_wa eigo_wo benkyou+suru Noun_particle Noun_particle Verb [Noun+do] I English do study I study English. © ISO 2009 – All rights reserved 25 ISO/CD 24614-2 6.2.2.3 A verb with a subsidiary verb A function of a subsidiary verb is complement a meaning of main verb, such as " 話 し 始 め る (hanashi+hajimeru; begin speaking)". They are not a suffix. A verb with a subsidiary verb is considered as a verb. When it is used in the end of a sentence and a clause, It is considered as one segmentation unit. EXAMPLE 人形が 箱から 飛び出す。 Ningyou_ga hako_kara tobidasu Noun_particle Noun_particle Verb[ Verb + subsidiary ] a doll the box jump out A doll jump out from the box. 6.2.2.4 A verb with an auxiliary verb and a particle A verb with an auxiliary verb and/or a particle is considered as one segmentation unit. EXAMPLE A A verb with an auxiliary verb 彼は 試験に 合格するだろう。 Kare_wa shaken_ni goukakusuru_darou Noun_particle Noun_particle Verb_auxiliary verb[expectation] he the examination will pass He will pass the examination. EXAMPLE B A verb with an auxiliary verb and/or a particle 彼は 試験に 合格するだろうね? Kare_wa shiken_ni goukakusuru_darou_ne Noun_particle Noun_particle Verb_auxiliary verb_particle[mood] he the examination will pass, don’t you think so? He will pass the examination. don’t you think so? 6.2.3 Adjectives (形容詞/形容動詞; Keiyoushi/Keiyoudoushi) A Japanese adjective has an inflectional ending. Based on the type of inflectional ending, there are two kinds of adjectives, "i keiyoushi" and "na keiyoushi". However, both are treated as adjectives. 26 © ISO 2009 – All rights reserved ISO/CD 24614-2 In terms of traditional Japanese linguistics, “keiyoushi” refers to “i keiyoushi” and “keiyoudoushi” refers to “na keiyoushi.” (In terms of inflectional ending of “na keiyoushi,” it is sometimes said to be considered as “Noun + auxiliary verb (da)”.) The ending of an adjective changes depending on whether it is a negation form, an adverbial form, a base form, an adnominal form, or an assumption form. Japanese adjectives are sometimes used with auxiliary verbs and/or particles, and they are considered as a word segmentation unit. 6.2.3.1 6.2.3.1.1 Adjectives in predicative usage / adnominal usage Adjectives in predicative usage are considered as one segmentation unit. EXAMPLE 富士山は 高い。 Fujisan_wa takai Noun_particle Adjective Mt.Fuji high Mt.Fuji is high. 6.2.3.1.2 Adjectives with an auxiliary verb and/or a particle are considered as one segmentation unit. EXAMPLE 値段が 高いですか? Nedan_ga takai_desu_ka Noun_particle Adjective_auxiliary verb[copula,polite]_particle[question] the price is high? Is the price high? 6.2.3.1.3 Adjectives in adnominal usage and nouns modified by the adjectives are segmented separately. EXAMPLE 面白い 本が ある。 Omoshiroi hon_ga aru Adjective [adnominal] Noun_particle Verb interesting a book there is There is an interesting book. 6.2.3.1.4 Adjectives with a particle / auxiliary verb, and nouns modified by them are segmented separately © ISO 2009 – All rights reserved 27 ISO/CD 24614-2 EXAMPLE 面白かった 本を おしえる。 Omoshirokat_ta hon_wo oshieru Adjective_auxiliary verb[past] Noun_particle verb was interesting book tell you I tell you a book for which I was interesting. 6.2.3.2 Adjectives in adverbial usage An adjective in adverbial usage and a verb modified by it must be segmented. In this case, an adjective in an adverbial usage is considered as one segmentation unit. EXAMPLE 早く 起きなさい! hayaku okinasai Adjective [adverbial usage] Verb early Get up Get up early! 6.2.3.3 Adjectives in negation and assumption 6.2.3.3.1 Adjectives in negation usage are generally represented in the form of "adjectives in adverbial form and auxiliary verbs." As auxiliary verbs for negations, "nai (for i_adjective),” “wa_nai (for na_adjective),” “arimasen (a polite form for i_adjectives),” “wa_arimasen (a polite form for na_adjectives)” and “ja_arimasen (an impolite form for na_adjective)” are used. “an adjective in adverbial form and an auxiliary verb for negation” is considered as one segmentation unit. EXAMPLE A やさしくない yasashiku_nai Adjective_auxiliary verb[negation] Not kind Not kind EXAMPLE B きれいではありません kireide_wa_arimasen 28 © ISO 2009 – All rights reserved ISO/CD 24614-2 Adjective_particle_auxiliary verb[negation+polite] Not clean Not clean 6.2.3.3.2 An adjective in an assumption usage is generally represented in the form of an adjective in an assumption form plus a particle for an assumption. Therefore, an adjective in assumption form plus a particle for an assumption is considered as one segmentation unit. EXAMPLE 雨が ひどければ、 遠足は 中止する。 Ame_ga hidokere_ba ensoku_wa chushisuru Noun_particle Adjective_particle[assumption] Noun_particle Verb the rain if it is heavy hiking cancel If the rain is heavy, the hiking will cancel. 6.2.4 Adnominal nouns (連体詞; Rentaishi) An adnominal noun does not have an inflectional ending; it is always used as a modifier. An adnominal noun is considered as one segmentation unit. EXAMPLE あらゆる arayuru Adnominal noun every every country 6.2.5 国 kuni Noun country Adverbs (副詞; Fukushi) An adverb does not have an inflectional ending; it is always used as a modifier of a sentence or a verb. It is considered as one segmentation unit. EXAMPLE やっと 来た。 yatto ki_ta adverb Verb_auxiliary verb at last came At last (someone) came. © ISO 2009 – All rights reserved 29 ISO/CD 24614-2 6.2.6 Conjunctions (接続詞; Setsuzokushi) A conjunction is considered as one segmentation unit. EXAMPLE そして、 彼は 笑った。 Soshite kare_wa warat_ta Conjunction Noun_particle Verb_auxiliary verb then he laughed Then he laughed. 6.2.7 Exclamations (感動詞; Kandoushi) An exclamation is considered as one segmentation unit. EXAMPLE あっ! A! Exclamation Oops! Oops! 6.2.8 Particles (助詞; Joshi) In Japan, we have six subcategories in Japanese particles. ”格助詞; kakujoshi” is a maker for a case. (が; ga; subject marker, を; wo; objective marker, に:ni; dative marker, and so on) “係助詞; kakarijoshi” is a maker for a correlation with another phrase. (さえ; sae; even, しか; shika; only and so on) “並立助詞; heiritsujoshi” is a marker for a coordination. (と; to; and, か; ka; or, and so on) “接続助詞;setsuzokujoshi” is a marker for a conjunction between phrases. (ので; node; because, とき; toki; when, and so on) “副助詞;fukujoshi” is a marker for a an attachment of something of meaning. (くらい; kurai; about, まで; made; ) “終助詞; shuujoshi” is a marker for representing a mood and a question of a speaker. It is always used at the end of a sentence. (ね; ne; don’t you think so?, か; ka; question) “準体助詞;juntaijoshi” is a marker for a normalization of a phrase. ( の; no; thing, こと; koto; thing) EXAMPLE A 30 particles for a case marker © ISO 2009 – All rights reserved ISO/CD 24614-2 私は (watashi_wa; I ), 私を (watashi_wo; me), (watashi_to; with), 私の (watashi_no; my), 私へ (watashi_e; to me) , 私と 私に(watashi_ni; me, for me) EXAMPLE B A particle for a conjunction 行けば(ike_ba; if you go) , 行くので(iku_node; because (someone) goes) EXAMPLE C a particle for adding something of a meaning 私さえ(watashi_sae; even I), 私も(watashi_mo; I (go together), too) EXAMPLE D particles for representing a mood and a question 行きますね? Iki_masu_ne? Verb_auxiliary verb_particle[mood] go, don’t you? (You go there), don’t you? 6.2.9 Auxiliary Verbs (助動詞; Jodoushi) Auxiliary verbs represent various semantic functions such as a capability, a voice, a tense, an aspect and so on. An auxiliary verb appears at the end of a phrase or a clause and a sentence. An auxiliary verb is a part of speech but should not be segmented. An auxiliary verb is used with a noun, a verb, or an adjective at the end of a phrase or a clause and a sentence. EXAMPLE 雨が 降りそうなので、 家に いるでしょう。 Ame_ga furi_souna_node ie_ni i_masu Noun_particle Verb_auxirialy verb[guess]_particle[conjunction] Noun_particle Verb_auxiliary verb [prospect, polite] it because(it) seems to rain at home (I ) will be Because it seems to rain, I will be at home. 7 Korean word segmentation 7.1 Typology of word segmentation units in Korea “Eojeol (word phrase)“ is a phrase, which is a component of a sentence that carries a grammatical function, in Korean text without modifying expressions. As a component of "Eojeol", there are mainly 8~9 part of speech: noun, verb, adjective, pronoun, numeral, adnoun, adverb, exclamation, particle. In North Korean grammar, a particle is not treated as POS. (see section 4) The basic parts of speech can be divided into more detailed classes in terms of grammatical function (see section 7.2). 7.1.1 Punctuation and white space A period (.) is used for representing a full stop, and is written at the end of a sentence. A question mark (?) and an exclamation mark (!) are also written at the end of a sentence. © ISO 2009 – All rights reserved 31 ISO/CD 24614-2 That is, it means one sentence. A comma (,) is used for representing a slight pause. It indicates a break between phrases inside one sentence, but does not always correspond to one segmentation unit. It means just a pause to make the sentence easier to understand. Therefore, it is not directly related to word segmentation units. A colon(:) and a slash(/) also represent a slight pause. Double quotation marks (“ ”) are used for dialogue or quotation and small quotation marks (‘ ’) are used for inner quotation and emphasis of some expression. In contrast with Chinese and Japanese typography, Korean sentences contain some fragments separated by white space. These fragments refer to Eojeol (word phrase). In Korean, white space as well as punctuation and bracket is fundamental in separating word phrases. 7.1.2 Word Words, clearly justified by linguistic criteria are WSUs. 7.1.2.1 Simplex Words with just one component are WSUs. EXAMPLE 다섯 daseot numeral five 7.1.2.2 Compound The results of composing two or more word are WSUs. EXAMPLE 곱-씹다 gopssipda verb Twofold-chew Repeat a word 7.1.2.3 Derivation The results of adding a series of prefixes or suffixes to a word are WSUs. EXAMPLE A 32 © ISO 2009 – All rights reserved ISO/CD 24614-2 외-삼촌 oesamchon Noun(prefix-noun) uncle-in-law EXAMPLE B 입-질 ipjil Noun(noun-suffix) bite 7.1.2.4 Abbreviation Abbreviations are WSUs. EXAMPLE A 이공[理工](igong; noun; science and engineering) : 이학(ihak; science )+공학(gonghak, engineering) EXAMPLE B EXAMPLE B 국규[國規] (gukgyu; noun; national standard) : 국가(gukga; state)+규격(gyugeok; standard) 7.1.2.5 Transliterated loanword Transliterated loanwords are WSUs. EXAMPLE 7.1.2.6 지프 (jeep) 초콜릿(chocolate) Idiomatic expression with Chinese character Idiomatic expressions with Chinese character are WSUs. They are usually composed with four Chinese characters. EXAMPLE 함흥차사 (咸興差使) hamheungchasa noun Lost messenger © ISO 2009 – All rights reserved 33 ISO/CD 24614-2 7.1.3 Multi-word expression 7.1.3.1 Phrasal compound Phrasal compounds, frequently used in text and, mainly consisting of two or more word, are WSUs. EXAMPLE 수력 발전소 suryeok baljeonso noun noun waterpower plant hydroelectric plant 7.1.3.2 Idioms and proverbs Fixed expression such as idioms, proverbs, mottos should be segmented if their original meanings are not violated after segmentation. They should be deemed as a word unit, even though they are composed with two or more Eojeol (word phrases). EXAMPLE A 울며 겨자 먹기 ulmyo gyeoja meokgi Verb Noun Verb cry mustard eat A Hobson’s choice. EXAMPLE B 한 마디로 말해 han madiro malhae adnoun Noun_particle Verb one Word_with talk In a word (speaking briefly) 7.1.4 Non-Korean-character strings Non-Korean-character strings including foreign language characters, Arabic numerals, math symbols, chemical symbols etc. are treated as WSUs by keeping their original forms. 34 © ISO 2009 – All rights reserved ISO/CD 24614-2 EXAMPLE 7.2 CAD CO := cm 1298 3.1415926 Typology of WSUs in Korean 7.2.1 Noun A noun is usually followed by a particle and it is a component constituting a sentence. But there are some exceptions. In cases, one noun becomes one sentence. For example, as a question, “어디(eodi?; where?)”, as an answer, “사과(sagwa ; apple)”, “3 (set; three)” and so on. Also, if a word like an adjective or an adnoun modifies a noun, a modifier (adjective, adnoun, and adnominal phrase) and a modificand (a noun) are segmented. 7.2.1.1 Common noun 7.2.1.1.1 A noun followed by a particle is considered as a word segmentation unit. Noun shall be segmented from the other grammatical component in Eojeol (word phrase). EXAMPLE Noun followed by Particle for a case marker 소녀가 사과를 먹었다. sonyeo_ga sagwa _leul meogeotta noun_particle[subjective] Noun_particle[object] Verb girl apple ate A girl ate an apple. 7.2.1.1.2 EXAMPLE A Derivative noun with derivative affixes shall be deemed as a word segmentation unit A noun with a prefix 비-금속 bigeumsok Noun(prefix-noun) Not metal nonmetal EXAMPLE B A noun with a suffix 음악-가 © ISO 2009 – All rights reserved 35 ISO/CD 24614-2 Eumak-ga Noun(prefix-noun) Music artist musician 7.2.1.1.3 Compound noun shall be deemed as a word segmentation unit. EXEMPLE A noun plus noun 손-목 sonmok noun Hand-neck wrist EXEMPLE B numeral plus numeral 하나-하나 hanahana noun One-one One at a time 7.2.1.1.4 7.2.1.2 Word combination that is treated as a word segmentation unit could be sub-segmented for the practical need: noun + prefix, noun + suffix, noun + noun. Proper noun 7.2.1.2.1 Korean name and surname should not be separated and totally should be deemed as a word segmentation unit. Name with following ‘이(i)’ also should be deemed as a word segmentation unit. EXAMPLE 김-광수: (KIM, surname) + (Gwangsu, first name) 경철-이: (Gyongcheol, name) + (i, suffix) 7.2.1.2.2 Person’s name or surname with following titles or affixes should be segmented independently. EXAMPLE 36 손 교수 son gyosu © ISO 2009 – All rights reserved ISO/CD 24614-2 Proper noun noun One of surname professor Prof. Son 7.2.1.2.3 Nation name, country name, language name and toponym shall be deemed as a word segmentation unit. EXAMPLE 백두산(Baekdusan; proper noun; Mt. Baekdu) 7.2.1.2.4 Full name of organization, agency, and institution shall be deemed as a word segmentation unit. EXAMPLE 7.2.1.3 국제 표준화 기구(Gukjepyojunhwagigu; International Standardization Organization ) Bound noun Even though bound noun is functional word, it should be segmented independently. EXAMPLE 좋은 것 joeun geot adjective Bound noun good thing Good thing Bound noun in a word segmentation unit should not be segmented. EXAMPLE 들-것 deulgeot Noun (verb+bound noun) Lift thing A stretcher 7.2.2 Pronoun Pronoun should be segmented from following particles. 7.2.2.1 Personal pronoun Personal pronoun followed by a pronoun followed by a particle is considered as a word segmentation unit. © ISO 2009 – All rights reserved 37 ISO/CD 24614-2 7.2.2.1.1 General personal pronoun General personal pronoun shall be deemed as one segmentation unit. EXAMPLE 내가 너를 그에게 소개하겠다. Nae-ga Neo-reul Geu-ege sogaehagetta Prpnoun_particle Prpnoun_particle Prpnoun_particle verb I you To him will introduce I will introduce you to him. 7.2.2.1.2 Reflexive pronoun Reflexive pronoun shall be deemed as one segmentation unit. EXAMPLE 그녀는 자기를 부끄러워해야 한다. Geunyeo-neun Jagi-reul buggeureoweohaeya handa Prpnoun_particle Reflexive pronoun_particle Prpnoun_particle verb She herself be ashamed of Ought to She ought to be ashamed of herself. 7.2.2.1.3 Indefinite pronoun Indefinite pronoun shall be deemed as one segmentation unit. EXAMPLE 아직 아무도 오지 않았다. ajik amu-do oji anatta adverb Indefinite pronoun_particle verb Auxiliary verb yet anybody come not Anybody doesn’t come yet. 7.2.2.2 Demonstrative pronoun 7.2.2.2.1 Compound pronoun including bound noun should be deemed as a word segmentation unit. EXAMPLE 이-것 igeot Pronoun (adnoun-bound noun) 38 © ISO 2009 – All rights reserved ISO/CD 24614-2 This thing this 7.2.2.2.2 Demonstrative pronoun for place shall be deemed as a word segmentation unit. EXAMPLE 저기 jeogi Pronoun there 7.2.2.3 Compounding of pronouns In Korean, pronoun can be produced by compounding of pronouns. It shall be deemed as a word segmentation unit. EXAMPLE A 이것-저것 igeotjeogeot Pronoun This-that One thing or another EXAMPLE B 여기-저기 yeogijeogi Pronoun Here and there 7.2.3 Numeral Numeral should be segmented from following particles. 7.2.3.1 Quantifier numeral Quantifier numeral shall be deemed as a word segmentation unit. EXAMPLE 하나(hana; numeral; one) © ISO 2009 – All rights reserved 삼[三](sam; numeral; three) 39 ISO/CD 24614-2 7.2.3.2 Ordinal numeral 제일[第 一] (jeil; numeral; first) EXAMPLE 7.2.4 제오십삼[第五十三](jeosipsam; numeral; the fifty-third) Verb A Korean verb has over one inflectional ending. The endings of a verb can be changed and attached depending on grammatical function of verb. E.g. 깨-뜨리-시-었-겠-군 (ggaeddeurisieotgetgunyo; verb; break ending [+emphasis] - ending [+polite] - ending [+past] - ending [+conjectural] - final ending). Korean verbs are often used with auxiliary verbs and/or particles, and they are considered as a word segmentation unit. In North Korean grammar, ending of verb is treated as grammatical prefix named 토(To; grammatical prefix). It follows verb (equivalent) and makes up predicate form or represents grammatical meanings such as tense and honorifics, should be segmented as a word. There are two methods for the same linguistic phenomenon. 7.2.4.1 Complete verb Verbs (including single verbs and compound verbs) are considered as one segmentation unit. 7.2.4.1.1 Single verb Single verb should be segmented from following particle. EXAMPLE 보-았-군-요 boatgunyo Verb(stem+ending+ending)+particle See [+past] [+sentence final] [+polite] You might saw (something). Single verb should be segmented from following auxiliary verb. EXAMPLE 읽어 보다 ilgeo boda Complete verb Auxiliary verb read try Try to read 7.2.4.1.2 Derivative or compound verb Derivative or compound verb should not be segmented. 40 © ISO 2009 – All rights reserved ISO/CD 24614-2 For example, “돌아가다” (dolagada; verb; pass away) is literally translated into ‘go+back’ (verb+verb).). “바로잡다” (barojapda; verb; correct) is one word segmentation unit but it consists of ‘rightly+hold’ (adverb+verb). 7.2.4.2 Auxiliary verb Auxiliary verb also should be segmented independently. A Korean auxiliary verb represents various semantic functions such as a capability, a voice, a tense, an aspect and so on. Auxiliary verb is only used with a verb plus endings with special word ending depending on the auxiliary verb. For example, “보다” (boda; try to), an auxiliary verb has the same inflectional endings but it should follow a main verb with a connective ending “어” (eo) or “고” (‘go’). EXAMPLE 먹어 버리다 meogeo beorida Complete verb Auxiliary verb eat finish Eat up 7.2.5 Adjective A Korean adjective has over one inflectional ending like verb. The endings of a verb can be changed and attached depending on grammatical function of verb. For example, in “예쁘-시-었-겠-군” (pretty - ending [+polite] - ending [+past] - ending [+conjectural] - final ending), one adjective has four endings. Korean adjectives are often used with auxiliary verbs and/or particles, and they are considered as a word segmentation unit. 7.2.5.1 Complete adjective Adjectives (including single adjectives and compound adjectives) are considered as one segmentation unit. 7.2.5.1.1 Single adjective Single adjective should be segmented from following particle. EXAMPLE 검-군-요 geomgunyo Adjective (stem+ending)+particle Black [+sentence final] [+polite] It is black, isn’t it? © ISO 2009 – All rights reserved 41 ISO/CD 24614-2 Single adjective should be segmented from following auxiliary verb. EXAMPLE 길지 않다 gilji anta Complete adjective Auxiliary adjective long not (It is) not long. 7.2.5.1.2 Derivative or compound adjective Derivative or compound adjective should not be segmented. EXAMPLE “돌아가다” (dolagada; verb; pass away) is literally translated into ‘go+back’ (verb+verb).). “바로잡다” (barojapda; verb; correct) is one word segmentation unit but it consists of ‘rightly+hold’ (adverb+verb). 7.2.5.2 Auxiliary adjective Unlike Japanese, there is auxiliary adjective in Korean. Function and usage of it are very similar to auxiliary verb. Auxiliary adjective is considered as one segmentation unit. EXAMPLE 마시고 싶다 masigo siptta Complete adjective Auxiliary adjective drink Want Want to drink 7.2.6 Adnoun An adnoun does not have an ending; it is always used as a modifier for noun. An adnoun shall be a word segmentation unit by itself. 7.2.6.1 General adnoun General adnoun is segmented from following noun. EXAMPLE 42 새 책 sae chaek adnoun Noun © ISO 2009 – All rights reserved ISO/CD 24614-2 new book A new book 7.2.6.2 Demonstrative adnoun Demonstrative adnoun is segmented from following noun. EXAMPLE 이 사람 i saram adnoun Noun this person this person 7.2.6.3 Numeral adnoun Numeral adnoun is segmented from following noun for measure. EXAMPLE 차 세 잔 cha se jan noun adnoun Bound noun tea three cup Three cups of tea 7.2.7 Adverb An adverb does not have an ending; it is always used as a modifier for verb. An adverb shall be a word segmentation unit by itself. Compound adverb also should not be segmented. EXAMPLE 더욱-더 deoukdeo Adverb (adverb + adverb) More-more More and more 7.2.7.1 Component adverb Component adverb is segmented from following verb. © ISO 2009 – All rights reserved 43 ISO/CD 24614-2 EXAMPLE 매우 바쁘다 maeu babbeuda adverb verb very busy Very busy 7.2.7.2 Sentence adverb Sentence adverb is segmented from following sentence. EXAMPLE 다행히 비-가 온다. dahaenghi biga onda adverb Noun-particle verb fortunately rain come Fortunately it rains. 7.2.7.3 Conjunctive adverb Conjunctive adverb is segmented from following nominal or sentence. EXAMPLE A 그리고 잠-이 들었다. geurigo jami deureotta adverb Noun-particle verb and sleep get 경제 및 문화 gyeongje mit munhwa noun adverb noun economy and culture EXAMPLE B Economy and culture 7.2.8 Exclamation An exclamation is considered as one segmentation unit. EXAMPLE 44 © ISO 2009 – All rights reserved ISO/CD 24614-2 아! A! Exclamation Oops! 7.2.9 Particle Korean particles can not be separated from a word just like Japanese particles. A particle is always used with a word like a noun, a verb, an adverb and so on. But it shall be considered as one segmentation unit. Particles can be divided into three main types in Korean. One is a case particle that serves as a case marker. Another is an auxiliary particle that appears at the end of a phrase or a sentence. Auxiliary particle represents a mood and a tense. The other particle is used for linking nominals. In North Korean grammar, particles are treated as grammatical prefix named 토(To; grammatical prefix). They follow noun (equivalent) and represent grammatical meanings such as case marker. They should be segmented as a word segmentation unit. There are two methods for the same linguistic phenomenon. 7.2.9.1 Particle as case marker Particle as case marker decides the case of nominal in the sentence. EXAMPLE 7.2.9.2 내가(nae_ga; I ), 나를(na_leul; me), 나의(na_eui; my), 나에게(watashi_ege; to me) Conjunctive particle Conjunctive particle is a marker for a conjunction between nominals or phrases. EXAMPLE 경제- 와 문화 gyeongjewa munhwa noun-particle noun Economy and culture Economy and culture 7.2.9.3 Auxiliary particle Auxiliary particle is used for an attachment of something of meaning. EXAMPLE 나-는 소설-만 읽는다. naneun soseolman malara Pronoun-particle Noun-particle verb As for me Only novel read As for me, I read only novels. © ISO 2009 – All rights reserved 45