Download CD 24614-2 WordSeg2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ukrainian grammar wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Agglutination wikipedia , lookup

Macedonian grammar wikipedia , lookup

Old English grammar wikipedia , lookup

Udmurt grammar wikipedia , lookup

Untranslatability wikipedia , lookup

Swedish grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

English clause syntax wikipedia , lookup

Japanese grammar wikipedia , lookup

Arabic grammar wikipedia , lookup

Modern Greek grammar wikipedia , lookup

Navajo grammar wikipedia , lookup

Old Irish grammar wikipedia , lookup

Portuguese grammar wikipedia , lookup

Georgian grammar wikipedia , lookup

Inflection wikipedia , lookup

Modern Hebrew grammar wikipedia , lookup

Kannada grammar wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Zulu grammar wikipedia , lookup

Italian grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

French grammar wikipedia , lookup

Icelandic grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Vietnamese grammar wikipedia , lookup

Chinese grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Spanish grammar wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

Latin syntax wikipedia , lookup

English grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Transcript
© ISO 2009 – All rights reserved
ISO TC 37/SC 4 N 482 rev02
Date: 2009-10-15
ISO/CD 24614-2
ISO TC 37/SC 4/WG 2
Secretariat: KATS
Language resource management — Word segmentation of text — Part 2:
Word segmentation for Chinese, Japanese and Korean
Gestion des resource des langues — Segmentation de texte — Partie 2: Segmentation des mots pour Chinois,
Japonais et Koréan
Warning
This document is not an ISO International Standard. It is distributed for review and comment. It is subject to
change without notice and may not be referred to as an International Standard.
Recipients of this draft are invited to submit, with their comments, notification of any relevant patent rights of
which they are aware and to provide supporting documentation.
Document type: International Standard
Document subtype:
Document stage: (30) Committee
Document language: E
STD Version 2.1c2
ISO/CD 24614-2
Copyright notice
This ISO document is a working draft or committee draft and is copyright-protected by ISO. While the
reproduction of working drafts or committee drafts in any form for use by participants in the ISO standards
development process is permitted without prior permission from ISO, neither this document nor any extract
from it may be reproduced, stored or transmitted in any form for any other purpose without prior written
permission from ISO.
Requests for permission to reproduce this document for the purpose of selling it should be addressed as
shown below or to ISO's member body in the country of the requester:
[Indicate the full address, telephone number, fax number, telex number, and electronic mail address, as
appropriate, of the Copyright Manger of the ISO member body responsible for the secretariat of the TC or
SC within the framework of which the working document has been prepared.]
Reproduction for sales purposes may be subject to royalty payments or a licensing agreement.
Violators may be prosecuted.
ii
© ISO 2009 – All rights reserved
ISO/CD 24614-2
Contents
Page
Foreword ..............................................................................................................................................................v
Introduction ........................................................................................................................................................vi
1
Scope ......................................................................................................................................................1
2
Normative references ............................................................................................................................1
3
Terms and definitions ...........................................................................................................................1
4
Overview – What is Word Segmentation Unit in CJK, Why is necessary, What are different
from Other languages ...........................................................................................................................3
5
5.1
5.1.1
5.1.2
5.1.3
5.1.4
5.1.5
5.1.6
5.1.7
5.1.8
5.1.9
5.1.10
5.1.11
5.2
5.2.1
5.2.2
5.2.3
5.2.4
5.2.5
5.2.6
5.2.7
5.2.8
5.2.9
5.2.10
5.2.11
5.2.12
5.2.13
Chinese word segmentation .................................................................................................................5
General rules for identifying WSUs in Chinese text ..........................................................................5
Punctuation and white space ...............................................................................................................5
Word ........................................................................................................................................................5
Derivation ...............................................................................................................................................5
Phrasal compound ................................................................................................................................5
Idiom .......................................................................................................................................................6
Idiomatic expression, proverb and familiar quotation .......................................................................6
Abbreviation ...........................................................................................................................................6
Suffixation of the nonsyllabic 儿(r) ......................................................................................................7
Transliterated loanword ........................................................................................................................7
Non-Chinese-character strings ............................................................................................................7
Internal structure of WSUs ...................................................................................................................7
Typology of WSUs in Chinese ..............................................................................................................7
Noun ........................................................................................................................................................8
Verb .......................................................................................................................................................11
Adjective ...............................................................................................................................................13
Pronoun ................................................................................................................................................14
Numeral .................................................................................................................................................15
Measure word .......................................................................................................................................15
Adverb ...................................................................................................................................................16
Preposition ...........................................................................................................................................16
Conjunction ..........................................................................................................................................16
Auxiliary word ......................................................................................................................................16
Modal word ...........................................................................................................................................16
Exclamation word ................................................................................................................................16
Imitative word .......................................................................................................................................17
6
6.1
6.1.1
6.1.2
6.1.3
6.1.4
6.1.5
6.1.6
6.1.7
6.1.8
6.1.9
6.1.10
6.1.11
6.1.12
6.2
Japanese word segmentation ............................................................................................................17
General rules for identifying WSUs in Japanese text ......................................................................17
Punctuation ..........................................................................................................................................17
Noun ......................................................................................................................................................17
Verbs .....................................................................................................................................................18
Adjectives .............................................................................................................................................18
Adnominal nouns ................................................................................................................................18
Adverbs .................................................................................................................................................18
Conjunctions ........................................................................................................................................18
Exclamations ........................................................................................................................................18
Particles ................................................................................................................................................18
Auxiliary verbs .....................................................................................................................................18
Idioms and proverbs ...........................................................................................................................19
Abbreviations .......................................................................................................................................19
Typology of WSUs in Japanese .........................................................................................................19
© ISO 2009 – All rights reserved
iii
ISO/CD 24614-2
6.2.1
6.2.2
6.2.3
6.2.4
6.2.5
6.2.6
6.2.7
6.2.8
6.2.9
Nouns (名詞; Meishi) ........................................................................................................................... 19
Verbs (動詞;Doushi)............................................................................................................................. 25
Adjectives (形容詞/形容動詞; Keiyoushi/Keiyoudoushi) .................................................................. 26
Adnominal nouns (連体詞; Rentaishi) ............................................................................................... 29
Adverbs (副詞; Fukushi)...................................................................................................................... 29
Conjunctions (接続詞; Setsuzokushi) ................................................................................................ 30
Exclamations (感動詞; Kandoushi) .................................................................................................... 30
Particles (助詞; Joshi) ......................................................................................................................... 30
Auxiliary Verbs (助動詞; Jodoushi) .................................................................................................... 31
7
7.1
7.1.1
7.1.2
7.1.3
7.1.4
7.2
7.2.1
7.2.2
7.2.3
7.2.4
7.2.5
7.2.6
7.2.7
7.2.8
7.2.9
Korean word segmentation ................................................................................................................ 31
Typology of word segmentation units in Korea ............................................................................... 31
Punctuation and white space ............................................................................................................. 31
Word...................................................................................................................................................... 32
Multi-word expression ........................................................................................................................ 34
Non-Korean-character strings............................................................................................................ 34
Typology of WSUs in Korean ............................................................................................................. 35
Noun ...................................................................................................................................................... 35
Pronoun ................................................................................................................................................ 37
Numeral ................................................................................................................................................ 39
Verb ....................................................................................................................................................... 40
Adjective ............................................................................................................................................... 41
Adnoun ................................................................................................................................................. 42
Adverb .................................................................................................................................................. 43
Exclamation.......................................................................................................................................... 44
Particle .................................................................................................................................................. 45
iv
© ISO 2009 – All rights reserved
ISO/CD 24614-2
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies
(ISO member bodies). The work of preparing International Standards is normally carried out through ISO
technical committees. Each member body interested in a subject for which a technical committee has been
established has the right to be represented on that committee. International organizations, governmental and
non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the
International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of technical committees is to prepare International Standards. Draft International Standards
adopted by the technical committees are circulated to the member bodies for voting. Publication as an
International Standard requires approval by at least 75 % of the member bodies casting a vote.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO shall not be held responsible for identifying any or all such patent rights.
ISO 24614-2 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content
resources, Subcommittee SC 4, Language resource management.
This second/third/... edition cancels and replaces the first/second/... edition (), [clause(s) / subclause(s) /
table(s) / figure(s) / annex(es)] of which [has / have] been technically revised.
ISO 24614 consists of the following parts, under the general title Language resource management — Word
segmentation of text:

Part 1: Basic concepts and general principles

Part 2: Word segmentation for Chinese, Japanese and Korean
© ISO 2009 – All rights reserved
v
ISO/CD 24614-2
Introduction
Word segmentation remains a challenging technology in natural language processing for languages in which
word boundaries of text cannot be fully identified by typographic properties(like spaces in English), for
example, Chinese, Japanese, Korean, Thai, Vietnamese, and Mongolian.
Part2 focuses on word segmentation for Chinese, Japanese, and Korean. These three languages are similar
and different in some aspects. In the aspect of using Chinese characters, all of them are similar, for instance,
they have a lot of nouns which consist of Chinese characters, especially two Chinese character nouns, such
as “討論(discussion)” and “同意(agreement)”. In the aspect of typography, there is no spacing in Chinese and
Japanese text, while Korean text contains some fragments (Eojeols) separated by spaces. In the aspect of
language category, Chinese is an isolated language, but Japanese and Korean belong to agglutinative
languages, for example, a noun can be followed by a series of particles and a verb can be used with several
endings.(e.g. “깨/뜨리/시/었/겠/군/요” (break [+emphasis] [+polite] [+past] [+conjectural] final ending [+polite]),
“학교/에서/부터/는” (as for 'from at school')) e.g. “学校へ( to school)” (学校/へ, school [+ particle]) “行きました
(went)” (行き/まし/た, go [+ auxiliary verb(polite)][auxiliary verb (past)] )
Due to the fact that these three languages share similarities in words composed of Chinese characters,
general rules for identifying word segmentation units (WSUs) in Chinese text can also be applied to the
processing for Japanese and Korean to some extent.
In real practice, there are great concerns on what should be the right outcome through the process of word
segmentation applied to a text. The Standards are needed to pursue the consistency in word segmentation
within/among texts to the maximum extent so as to meet the requirements from a variety of applications in
language information processing, -- both mono-lingual and multi-lingual. The applications of the standards
include but not limited to natural language processing, information retrieval, search engine, questionanswering, machine translation and machine aided translation, pre-processing of text-to-speech, postprocessing of speech recognition, OCR and other character input methods, proof reading, digital library,
terminology and ontology, semantic web, eBusiness and eCommerce, content management, and naturallanguage-based computer-aided eLearning (including language learning and second language learning). They
shall also be helpful for orthographic processing (Romanization) of text in some languages like Chinese.
vi
© ISO 2009 – All rights reserved
COMMITTEE DRAFT
ISO/CD 24614-2
Language resource management — Word segmentation of
text — Part 2: Word segmentation for Chinese, Japanese and
Korean
1
Scope
Principles for word segmentation in Part 1 are applied for Chinese, Japanese and Korean. Their word
segmentation application is standardized for the purpose of recognizing the unit that will be used for the later
syntactic processing. There are linguistic annotation standards in ISO: MAF (morpho-syntactic annotation
framework), SynAF (syntactic annotation framework), and others in ISO/TC37/SC4. These standards describe
annotation methods but not for the meaningful units of word segmentation. In this aspect, MAF and SynAF are
to annotate each linguistic layer in a standardized way for the further interoperability. Word segmentation
standard would like to recommend what linguistic units should be registered in a lexicon, and what type of
word sequences called “word segmentation unit (WSU)” should be recognized before syntactic processing. In
the context of multi-lingual word segmentation, if a word sequence forms one WSU in one language, it is a
symptom for recognizing the corresponding WSU in other languages. Normative references
2
Normative references
The following referenced documents are indispensable for the application of this document. For dated
references, only the edition cited applies. For undated references, the latest edition of the referenced
document (including any amendments) applies.
ISO 24614-1, Language resource management — Word segmentation of text — Part 1: Basic concepts and
general principles
3
Terms and definitions
For the purposes of this document, the terms and definitions given in ISO 24614-1 and the following apply.
3.1
phrase
component of a sentence that carries a grammatical function
3.2
bunsetsu
phrase (3.1) in Japanese text without modifying expressions
EXAMPLE
The sentence “私は学校へ 早く 行きました(I went to school)。” consists of four Bunsetsu: 私は
(watashiwa), 学校へ(gakkoue), 早く(hayaku) 行きました(ikimashita). “私(watashi)” is a pronoun, “は(wa)” is a
particle, “学校(gakkou)” is a noun, “へ(e)” is a particle, “早く(hayaku)” is an adjective in adverbial usage, “行き
(iki)” is a verbal stem followed by “まし(mashi)” which is an auxiliary verb for a politeness, and “た(ta)” is an
auxiliary verb for a past tense. The sentence contains four Bunsetsus.
NOTE
A Bunsetsu normally consists of a noun plus its particle(s) or a verb plus its ending(s), auxiliary verb(s), and
particle(s) as shown in the example above.
© ISO 2009 – All rights reserved
1
ISO/CD 24614-2
3.3
eojeol
phrase (3.1) in Korean text without modifying expressions separated by white space
EXAMPLE
Given a sentence “나는 학교에 일찍 갔다(I went to school early)”, “나(I)” is a pronoun, “는” is a particle,
“학교(hakgyo; noun; school)” is a noun, “에” is a particle, “일찍(early)” is a adverb, “가(go)” is a verbal stem followed by
the endings “았” and “다”. And the sentence contains four Eojeols: “나는(naneun)”, “학교에(hakgyoe)”, “일찍(iljjik)”, and
“갔다(gatta)”.
NOTE1
An Eojeol normally consists of a noun plus its particle(s) or a verb plus its ending(s), auxiliary verb(s), and
particle(s) as shown in the example above.
NOTE2 An Eojeol is also called as ‘word phrase’. Eojeol(word phrase) consists of one or more word forms.
Auxiliary words can concatenate to word unit standing in front. E.g. 살아있다(to keep alive) is composed of
two word form; 살아(to live) and 있다(keep).
3.4
particle
part-of-speech in Japanese and Korean to represent grammatical function or trivial meaning.
EXAMPLE
A Japanese particle is not used independently. A word followed by a particle can constitute “Bunsetsu”.
The function is a marker for a case, correlation with another phrase, attachment of some trivial meaning, and so on. As for
a behavior, it attaches to words and does not have an inflectional ending like a suffix. However it is not a suffix but one of
a part of speech. A Japanese particle attach to not only words but also a clause or even a sentence. For example, “寒い
ね?” means “It is very cold, isn’t it?” In this example a Japanese particle “ね(ne)” is corresponding to “isn’t it? ”.
EXAMPLE
A particle is not used independently. A word followed by a particle can constitute “Bunsetsu” or “Eojeol”.
A article can be attached to not only words but also a clause or even a sentence. For example, “寒いね?”in Japanese
and “매우 춥지요?” mean “It is very cold, isn’t it?” In this example a Japanese particle “ね(ne)” and a Korean
particle “요(yo)” are corresponding to “isn’t it? ”.
NOTE
In North Korean grammar, a particle is treated as an affix that freely agglutinates after a nominal and performs
a grammatical function. A particle and an ending are called totally as ‘토(Tho)’ in North Korean grammar.
3.5.
ending
agglutinative part of verb, adjective and auxiliary verb in Japanese and Korean
NOTE
A verb, adjective and auxiliary verb have agglutinative forms at the end of them. The agglutinative forms
are defined as ending. For example, as the ending of a verb, there are a negation form, an adverbial form, a base form, an
adnominal form, an assumption form, or an imperative form.
3.6.
Measure word
part-of-speech in Chinese to define, along with numbers, the quantity of a given object, or to identify specific
objects with demonstrative pronouns such as "this" and "that".
NOTE1
While English speakers say "one person" or "this person", Chinese speakers say respectively 一个人 (yi ge
ren; Numeral + measure word + noun; one person) or 这个人 (zhe ge ren; demonstrative pronoun + measure word +
person; this person), where “个” (ge) is a measure word.
NOTE2
There is a set of "verbal measure words" used for counting the number of times an action occurs, rather than
counting a number of items. For example, in the sentence “我去过三次北京” (wo qu guo san ci Beijing; Pronoun + verb +
Auxiliary word + numeral + measure word + proper noun; I have been to Beijing three times), “次”(ci) functions as a verbal
measure word to modify the verb “去”(qu).
2
© ISO 2009 – All rights reserved
ISO/CD 24614-2
4
Overview – What is Word Segmentation Unit in CJK, Why is necessary, What are
different from Other languages
Word segmentation is the process of dividing of sentence into meaningful units. For example, “the White
House” consists of three words but designates one concept for the President’s residence in USA. “Pork” in
English is translated into two words “pig meat” in Chinese, Korean and Japanese: 猪肉(rom…;), 돼지-고기, 豚
肉 respectively. In Japanese and Korean, because an auxiliary verb must follow main verb, they will compose
one word segmentation unit like “tabetai” and “meoggo sipda” whose meaning is “want to eat”, respectively.
So a meaningful unit that is useful for the further syntactic processing defines the word segmentation unit.
Such unit could be an entry of lexicon or of any other type of storage whose entries are useful for syntactic
processing in natural language processing purpose. A word segmentation unit is more or less fixed and there
is no syntactic interference in the inside of the word segmentation unit. In the practical sense, it is useful for
the further syntactic processing because it is not decomposable by syntactic processing and also frequently
occurred in corpora.
If the word is derived from Chinese characters, three languages have common properties. If their word in noun
consists of two or more Chinese characters, they will be one word segmentation unit if they are “tightly
combined and steadily used” according to principles of Part 1. For example, “each country” in English is not a
word segmentation unit as its translation “各|国”. If the last character is productive in a limited manner, it forms
a word segmentation unit with the preceding word, for example, “東京都” (Tokyo Metropolis), “8 月” (August)
or “加速器” (accelerator).
Negation character of verb and adjective is segmented independently in Chinese, but they form one word
segmentation unit in Japanese. For example, “yasashikunai” (優しく無い, not kind) is one word segmentation
unit in Japanese, but “不|写” (not to write), “不|能” (cannot), “没|研究” (did not research) and “未| 完成” (not
completed) will be segmented independently in Chinese. In Korean, “chinjeolhaji anhta” (친절하지 않다, not
kind) has one space inserted between two eojeols but it could be one word unit. “ji anhta” makes negation of
adjectival stem “chinjeolha”.
Because the motivation of word segmentation standard is to recommend what word segmentation units
should be registered in a type of lexicon where it is not the lexicon in linguistics but any kind of practical
indexed container for word segmentation units, it has two possibly conflicting principles. For example,
principles of unproductivity, frequency, and granularity could cause conflicts because they have different
perspectives to define a word segmentation unit.
The Chinese character derived nouns are sharable for its word segmentation unit structure for three
languages, but not the whole. On the other hand, there are common things between Korean and Japanese.
Some Korean word endings and Japanese auxiliary verbs have the same functions. Word segmentation in
each language is somewhat different according to already made word segmentation regulation, even violating
one or more principles of word segmentation. This document will specify the application of Part 1 to three
languages: Chinese, Japanese and Korean. It will be a starting point to recommend the more synchronized
word segmentation unit concept in a multi-lingual environment.
The concept of “word segmentation unit” is to broaden the view about what could be registered in lexicon of
natural language processing purpose, without much linguistic representation.
POS
Chinese
Japanese
Korean(south)
Korean(north)
Noun
○(名词)
○ (名詞)
○(명사 名詞)
○(명사 名詞)
Verb
○(动词)
○(動詞)
○(동사 動詞)
○(동사 動詞)
○(형용사 形容詞)
○(형용사 形容詞)
○
Adjective
○(形容词)
(形容詞 and 形容動詞)
© ISO 2009 – All rights reserved
3
ISO/CD 24614-2
Numeral
Subcategory of Noun
○(数词)
○(수사 數詞)
○(수사 數詞)
(名詞[数詞])
Adverb
○(副词)
○(副詞)
○(부사 副詞)
○(부사 副詞)
Exclamation
○(叹词)
○(感動詞)
○(감탄사 感歎詞)
○(감동사 感動詞)
○(대명사 代名詞)
○(대명사 代名詞)
Pronoun
Subcategory of Noun
○(代词)
(名詞[代名詞])
Auxiliary word
○(助词)
×
×
×
○(量词)
Noun or Adverb(名詞/副
詞 [序数詞])
Noun or Adverb(명사
名詞/부사 副詞 [序數
詞])
Noun or Adverb(명사
名詞/부사 副詞 [序數
詞])
○(语气词)
×
×
×
Part of Adverb
Part of Adverb
Part of Adverb
( 擬態語・擬音語)
(擬態語・擬音語)
(擬態語・擬音語)
Measure word
Modal word
Imitative word
○(拟声词)
Preposition
○(介词)
×
×
×
Conjunction
○(连词)
○(接続詞)
○(접속부사 接続副詞)
○(이음부사 --副詞)
×
○(助詞)
○(조사 助詞)
Treated as
grammatical affix
named 토(Tho)
×
○(連体詞)
○(관형사 冠形詞)
○(관형사 冠形詞)
Subcategory of Verb
Subcategory of Verb
(보조동사 補助動詞)
(보조동사 補助動詞)
Subcategory of
Adjective
Subcategory of
Adjective
(보조형용사 補助
形容詞)
(보조형용사 補助
形容詞)
×
×
Particle
Adnoun
Auxiliary verb
Subcategory of
Verb(能愿动
词)
Differentiating
word
○(区别词)
○ (助動詞)
×
This standard adopts a notation which uses the underline to indicate the presence of a WSU under
consideration.
4
© ISO 2009 – All rights reserved
ISO/CD 24614-2
5
Chinese word segmentation1)
5.1
General rules for identifying WSUs in Chinese text
5.1.1
Punctuation and white space
Punctuations and white space are in general natural separation marks for WSUs, though in some cases,
certain punctuations can be a part of WSU, as “·” in “诺姆·乔姆斯基”(nuo mu · qiao mu si ji ; Noam
Chomsky).
5.1.2
Word
Words, clearly justified by linguistic criteria and, mainly consisting of two or three or four characters, are WSUs.
EXAMPLE
发展
可爱
现代化
自行车
毛泽东
资本主义
操作系统
fa zhan
ke ai
xian dai hua
zi xing che
mao ze dong
zi ben zhu yi
cao zuo xi tong
develop
lovely
modernize
bike
Mao Zedong
capitalism
operating system
5.1.3
Derivation
The results of adding a series of prefixes or suffixes to a word are WSUs.
EXAMPLE
科学
家
ke xue jia
science -er
noun suffix
Scientist
5.1.4
物理
学
家
wu li
xue
jia
physics -ology -er
noun
suffix suffix
physicist
Phrasal compound
Phrasal compounds, frequently used in text and, mainly consisting of two- or three-characters, are WSUs.
EXAMPLE
1)
Most examples laid out as columns in the following clauses are formatted as follows.

First line: Chinese expression

Second line: Romanization

Third line: part of speech

Fourth line: English translation for each component

Fifth line: English translation for the whole expression
When any part is not necessary in a example there will be a blank line.
© ISO 2009 – All rights reserved
5
ISO/CD 24614-2
猪
Zhu
pig
noun
pork
5.1.5
肉
rou
meat
noun
发电
fa dian
to generate electricity
verb
power plant
厂
chang
plant
noun
Idiom
Idioms, mainly consisting of four characters, are WSUs.
EXAMPLE
5.1.6
胸有成竹
欣欣向荣
xiong you cheng zhu
xin xin xiang rong
have a well-thought-out plan
prosperous
Idiomatic expression, proverb and familiar quotation
Idiomatic expressions, proverbs and familiar quotations are WSUs they are frequently used in text.
EXAMPLE
对不起
dui bu qi
春夏秋冬
chun xia qiu dong
由此可见
you ci ke jian
sorry
spring summer autumn winter
this shows
不管
三
七
二 十 一
bu guan san qi
er shi yi
no matter three seven two ten one
no mater what happens
5.1.7
失败
是
成功
之
shi bai shi
cheng gong zhi
Failure is
success
of
Failure is the mother of success.
母
mu
mother
Abbreviation
Abbreviations are WSUs.
EXAMPLE
6
科技
工农业
ke ji
gong nong ye
science and technology
industry and agriculture
© ISO 2009 – All rights reserved
ISO/CD 24614-2
5.1.8
Suffixation of the nonsyllabic 儿(r)
The results of suffixation of the nonsyllabic 儿(r) to nouns and sometimes verbs are WSUs.
EXAMPLE
5.1.9
花儿
玩儿
悄悄儿
huar
wanr
qiaoqiaor
flower r
play r
quietly r
noun r
verb r
adverb r
flower
play
quietly
Transliterated loanword
Transliterated loanwords are WSUs.
EXAMPLE
吉普
巧克力
ji pu
qiao ke li
jeep
chocolate
5.1.10 Non-Chinese-character strings
Non-Chinese-character strings including foreign language characters, Arabic numerals, math symbols,
chemical symbols etc. are treated as WSUs by keeping their original forms.
EXAMPLE
CAD CO := cm 1298 3.1415926
5.1.11 Internal structure of WSUs
A WSU may have an internal structure which organizes several WSUs hierarchically. Such a structure can be
manipulated at different granularity level in the process of word segmentation according to the need of various
applications.
EXAMPLE
chocolate: WSU(巧克力)
pork: WSU(WSU(猪) WSU(肉))
physicist: (WSU(WSU(WSU(物理) WSU(学)) 家(WSU))
Mao Zedong: WSU(WSU(毛) WSU(泽东))
5.2
Typology of WSUs in Chinese
The treatment for some specific WSU-related phenomena is addressed in this Clause (note: the phenomena
that can be clearly treated by Clause 5.1 will not be stated here). For clarity of description, the specification is
organized under 14 word categories: noun, verb, adjective, differentiating word, pronoun, numeral, measure
word, adverb, preposition, conjunction, auxiliary word, modal word, exclamation, and imitative word.
© ISO 2009 – All rights reserved
7
ISO/CD 24614-2
5.2.1
Noun
5.2.1.1
Common noun
(1) The nominal expression “adjective + noun” is segmented unless the meaning of the expression is not the
sum of its parts.
EXAMPLE
小
床
小 媳妇
xiao chuang
xiao xi wu
small bed
small wife
adjective, noun
adjective, noun
small bed
young wife
(2) The localizer word (a subcategory of noun) is segmented.
EXAMPLE
桌子上
长江以北
zuo zi shang
chang jiang yi bei
table above
the Yangtzi River the north
noun, localizer word
noun, localizer word
on the table
to the north of the Yangtzi River
(3) The plural suffix “们” (men; -s) is segmented
EXAMPLE
朋友 们
peng you men
However,
following ones are
treated as WSUs
人们
哥儿们
爷儿们
ren men
ger men
yier men
people
pals
guys
friend –s;
noun –s
friends
(4) The time expression is treated as follows:
a. January-December and Monday-Sunday are WSUs.
EXAMPLE
8
五月
元月
3月
星期 日
礼拜 三
wu yue
yuan yue
3 yue
xing qi ri
li bai san
five month
first month
3 month
Week + day
week three
May
January
March
Sunday
Wednesday
© ISO 2009 – All rights reserved
ISO/CD 24614-2
b.
The time measure words “Year, day, hour, minute, second” are segmented.
EXAMPLE
1988 年 3 月 15 日
11 时 42 分 8 秒
1988 nian 3 yue 15 ri
11 shi 42 fen 8 miao
1988 year 3 month 15 day
11 hour 42 minute 8 second
March 15th,1998
forty two minute and eight second past eleven
c. The results of “前、后、上、下、大前、大后” (before last, after next, last, next, before before last, after
after next) each combined directly with a time noun or a time measure word are WSUs.
EXAMPLE
d.
前天
后年
上星期
下月
大前天
大后年
qian tian
hou nian
shang xingqi
xia yue
da qian tian
da hou nian
before last, day
after
year
next,
last, week
next month
before before last, day
after after next, year
the day before
yesterday
the year after
next
last week
next month
three days ago
three years later
The time nouns “初一”(First day of a month in the Chinese lunar calendar) to “初十”(Tenth day of a month
in the Chinese lunar calendar) are WSUs.
5.2.1.2
5.2.1.2.1
Proper noun
Personal name and title
(1) The full personal names of Han nationalities are WSUs each has an internal structure with surname and
last name as two WSUs.
EXAMPLE
张 胜利
欧阳 志华
zhang sheng li
ou yang zhi hua
surname, given name
surname, given name
Zhang Shengli
Ouyang Zhihua
(2) The full personal names of other nationalities or foreign countries are WSUs each may have an internal
structure in accordance with their own traditions.
EXAMPLE
牛顿
小林
niu dun
xiao lin duo xi er
© ISO 2009 – All rights reserved
多喜二
9
ISO/CD 24614-2
Newton
Kobayashi Takiji
(3) The expression “surname + title” is segmented.
EXAMPLE
张 教授
王 部长
李 师傅
zhang jiao shou
wang bu zhang
li shi fu
surname professor
surname miniter
surname master
professor Zhang
minister Wang
master Li
(4) The expressions “one-character honorific title + surname” or “surname + one-character title” are WSUs.
EXAMPLE
老张
陈总
lao zhang
chen zong
one-character honorific title
surname, one-character title
surname; old Zhang
manager Chen
(5) The titles for kinship regarding rankings are WSUs each with an internal structure.
EXAMPLE
5.2.1.2.2
三 叔
大 女儿
san shu
da nv er
three uncle
big daughter
the third younger uncle
the eldest daughter
Place name and nationality name
“族、省、市、州、县、乡、区、江、河、山” (nationality, province, city, prefecture, county, town, district,
river, mountain)shall be segmented separately from nationality name and place name; yet the nationality
name and place name, if only containing two Chinese characters, shall not be segmented.
EXAMPLE
10
汉族 the Han nationality
哈萨克 族 the Kazakstan nationality
北京 市 Beijing Municipality
浙江 省(Zhejiang Province)
© ISO 2009 – All rights reserved
ISO/CD 24614-2
正定 县(Zhengding County)
长江(Yangtzi River)
忻县(Qi County)
Proper noun that cannot exist independently and keep its original meaning shall not be segmented.
牡丹江(Mudan River) 横断山(Hengduan Mountains)
EXAMPLE
Street, road, village and town names, ocean and sea names shall be deemed as segmentation unit.
长安街(Chang’an Avenue) 学院路(Xueyuan Road) 周口店(Zhoukoudian)
EXAMPLE
刘家村 (Liujiacun Village) 大西洋(Atlantic ocean) 地中海(Mediterranean Sea)
5.2.1.2.3
Other type of proper names
 Full country name shall be deemed as segmentation unit.
EXAMPLE
中华人民共和国(People's Republic of China) 大不列颠及北爱尔兰联合王国(United Kingdom)
 Full name of organization, agency, institution shall be segmented in accordance with the word
segmentation units consisting the full name.
EXAMPLE
联合国 教科文 组织(United Nations Educational, Scientific, and Cultural Organization)
中国 共产党(Communist party of China)
 trade marks, produce type, product series shall be segmented from the common noun.
EXAMPLE
5.2.2
牡丹 II 型 Peony III
Verb
5.2.2.1
a)
永久 牌(Yongjiu Brand ) 中华 烟(Zhonghua Cigarette)
Various forms of reiterative verbs
Single-character verb reiterated shall be deemed as one segmentation unit.
EXAMPLE 看看(look at) 动动(move)
b)
Two-character verb reiterated in the form of “AABB” shall be deemed as one segmentation unit.
EXAMPLE 来来往往(come and go) 拉拉扯扯(drag)
c)
Verb reiterated in the form of “AAB, ABAB” shall be segmented.
EXAMPLE 说说 看(try to say) 研究 研究(have a discuss)
a)
Verb reiterated in the form of “A 一 A, A 了 A, A 了一 A” shall be segmented.
EXAMPLE 谈 一 谈 (have a good chat) 想 一 想(think carefully)
读 一 读(to read)
想 了 想(think it over)
想 了 一 想(think it over)
© ISO 2009 – All rights reserved
11
ISO/CD 24614-2
5.2.2.2
Verb delimited by a negative meaning Chinese character
The negative meaning Chinese character before the verb shall be segmented independently.
EXAMPLE
5.2.2.3
不 写(not to write) 不 能(cannot)
没 研究(did not research)
未 完成(not completed)
"Verb + a negative meaning Chinese character + the same verb" structure
Such a structure that is indicating a question shall be segmented.
EXAMPLE
说
说(say or not say)?
看 不 看(see or not see)? 相信 不 相信(believe or not believe)?
Yet the brachylogical form shall not be segmented.
EXAMPLE
5.2.2.4
相不相信(believe or not)
Verb–object structure and verb collocations
Verb–object structural word, or compact and stably used verb phrase shall not be segmented
EXAMPLE
开会(meeting) 跳舞(dancing)
解决吃饭问题(to resolve the problem of meals)
孩子该念书了(it’s time for the child to go to school)
Incompact or verb–object structural phrase with many similar structures shall be segmented.
EXAMPLE
吃 鱼(Eat fish)
学 滑冰(learn skiing)
写 信 (write a letter); (写 文章(write an article); 写 论文(write a thesis);写 书(write a book); …
Verb–object structural word/phrase, if inserted with other elements, shall be segmented.
EXAMPLE
5.2.2.5
吃 两 顿 饭(have two meals)
跳 新疆 舞(Dance “Xinjiang dance”)
Verb–complement structural word and stably used Verb-complement phrase
Verb–complement structural word (two-character), or stably used Verb-complement phrase (two-character)
shall not be segmented.
EXAMPLE
打倒(down with) 提高(improve) 加长(lengthen) 做好(do well in)
“2with1” or “1with2” structural verb- complement phrase shall be segmented; over three character Verbcomplement phrase shall be segmented, either.
EXAMPLE
整理 好(clean up) 说 清楚(speak clearly)
解释 清楚(explain clearly)
Verb-complement word for phrase, if inserted with “得 or 不”, shall be segmented.
EXAMPLE
5.2.2.6
打 得 倒 (able to knock down) 提 不 高(unable to improve)
Adverb delimited verb
Adjective with noun word, and compact, and stably used adjective with noun phrase shall not be segmented.
EXAMPLE
12
胡闹(make trouble)
瞎说(talk nonsense)
死记(learn by rote)
© ISO 2009 – All rights reserved
ISO/CD 24614-2
早 来(come early) 晚 走(go late)
重 说(retell)
Compound directional verb shall be deemed as segmentation unit.
EXAMPLE
出去(go out) 进来(come in)
However, the compound directional verb of direction, if inserted with “得 or 不”, shall be segmented.
EXAMPLE
出 得 去(able to go out) 进 不 来(unable to come in)
Phrase formed by verb with directional verb shall be deemed as segmented.
EXAMPLE
5.2.2.7
寄 来(send) 跑 出 去(run out)
Combination of independent single verbs
Combination of independent single verbs without conjunction shall be segmented. For example:
苫 盖(cover with) 听 说 读 写(listen, speaking, read and write)
Multi-word verb without conjunction shall be segmented. For example:
调查 研究(investigate and research) 宣传 鼓动(publicity and instigation)
5.2.3
Adjective
5.2.3.1
Reiteratively combined adjectives
Adjective in reiterative form of “AA, AABB, ABB, AAB, A+"里"+AB” shall be deemed as segmentation unit.
EXAMPLE
大大(big)
高高(tall)
高高兴兴(happy) 匆匆忙忙(busy)
绿油油(fresh green)
红通通(bright red)
蒙蒙亮(daybreak) 马里马虎(careless)
However, adjective in reiterative form of “ABAB” shall be segmented.
EXAMPLE
5.2.3.2
雪白 雪白(snowy white) 滚圆 滚圆(fat and round)
Adjective phrases
Adjective phrase in from of “一 A 一 B,一 A 二 B,半 A 半 B,半 A 不 B,有 A 有 B” shall not be
segmented.
EXAMPLE
一心一意(wholeheartedly) 一清二楚(as plain as daylight)
半明半暗(partly bright partly dark) 半生不熟(half-cooked)
有条有理(orderly)
5.2.3.3
Adjectives in parataxis form
Adjectives in parataxis form shall be segmented in accordance with following rules:
© ISO 2009 – All rights reserved
13
ISO/CD 24614-2
a.
Two single-character adjectives with word features varied shall not be segmented.
EXAMPLE 长短(long-short) 深浅(deep-shallow) 大小(big-small)
b.
Adjectives in parataxis form and maintaining original adjective meaning shall be segmented.
EXAMPLE 大 小尺寸(size) 光荣 伟大(glory)
5.2.3.4
Adjective delimited noun for colors
Color adjective word or phrase shall not be segmented.
EXAMPLE 浅黄(light yellow) 橄榄绿(olive green)
5.2.3.5
Adjective phrases
Adjective phrase in positive with negative form to indicate question shall be segmented.
EXAMPLE 容易 不 容易(easy or not easy)
Yet the brachylogical one shall not be segmented.
EXAMPLE 容不容易(easy or not)
5.2.4
a)
Pronoun
Single-character pronoun with “们” shall be deemed as segmentation unit.
EXAMPLE 我们 (we) 你们(you) 它们(they) 他们(they)
b)
“这、那、哪” with unit word “个” or “些、样、么、里、边” shall be deemed as one segmentation unit.
EXAMPLE 这个(this) 这么(thus) 这边(here)
那些(those) 那样(then) 那里(there)
哪个(which) 哪里(where) 哪些(which)
c)
“这、那、哪” with numeral , unit word or noun word segmentation unit shall be segmented.
EXAMPLE 这 十 天(these 10 days) 那 人(that person) 那 种(that kind)
d)
Interrogative adjective or phrase shall be deemed as segmentation unit.
EXAMPLE 多少(how many)
怎样(what about)
为什么(why) 什么(what)
e)
Pronoun of “各、每、某、本、该、此、全”, etc. shall be segmented from followed measure word or
noun.
EXAMPLE 各 国 (each country)
每 种(each type)
某 工厂(a certain factory) 本 部门(this department)
该 单位(this unit) 此 人(this people)
14
© ISO 2009 – All rights reserved
ISO/CD 24614-2
全 校(whole school)
5.2.5
a)
Numeral
Numeral is segmented from measure word.
EXAMPLE 三 个(three) 一 种(one type)
b)
Chinese digit word shall be deemed as segmentation unit.
EXAMPLE 一亿八千零四万七百二十三(180,040,723)
c)
Ordinal number of “第” shall be segmented from followed numeral.
EXAMPLE 第 一 (first) 第 四(the fourth) 第 五 十 三(the fifty-third)
d)
“分之” percent in fractional number shall be deemed as one segmentation unit.
EXAMPLE 五 分之三(third fifth) 百分之二(2/100) 万分之五(5/10000)
e)
Paratactic numberals indicating approximate number shall be deemed as segmentation unit.
EXAMPLE 八九 公斤(eight or nine kg.)
f)
“多、一些、点儿、一点儿”, used after adjective or verb for indicating approximate number, shall be
segmented.
EXAMPLE 两 点 多(past two o’clock)
十 来 家(about ten )
g)
十 七八 岁(seventeen or eighteen years old)
一 千 多 人(more than one thousand person)
十 几 个(over ten)
“些、一些、点儿、一点儿”, used after adjective or verb for indicating approximate number, shall be
segmented.
EXAMPLE
大 些(bigger )
懂 一些(know some)
快 点儿(Quickly) 快 一点儿(more Quickly)
h)
“近、约、数”, etc. used before the numeral and numerical digit for indicating approximate number, shall
be segmented.
EXAMPLE
近 千 人(near one thousand person) 约 三 百(about three hundred) 数 万(ten thousands)
成百(hundreds of)
5.2.6
a)
Measure word
Reiterative measure word shall not be segmented.
EXAMPLE
b)
上千(thousands of)
年年(every year)
天天(every day) 个个(each) 家家户户(every household)
Compound measure word or phrase shall be deemed as segmentation unit.
EXAMPLE
人年 man/year
© ISO 2009 – All rights reserved
人次(man/time) 架次(sortie) 吨公里(t/km)
15
ISO/CD 24614-2
5.2.7
a)
Adverb
Adverb shall be deemed as segmentation unit.
EXAMPLE
很好(very good) 都来了(every one came here)
刚走(have just gone) 互相协助(help each other)
b)
Following phrases used frequently and acting as adverb shall be deemed as segmentation unit:
EXAMPLE
越来越(more and more)
不得不(have to ) 不能不(cannot but)
“越…越…、又…又…”, etc. acting as conjunction shall be segmented.
越 走 越 远(to go farther and farther)
5.2.8
又 香 又 甜(sweet yet savory)
Preposition
Preposition shall be deemed as segmentation unit.
EXAMPLE
5.2.9
生于(be born in ) 走向胜利(up to success) 按照规定(according to the regulations)
Conjunction
Conjunction shall be deemed as segmentation unit.
EXAMPLE
工人和农民(worker and farmer) 光荣而伟大(glorious and grand)
5.2.10 Auxiliary word
a)
Structural auxiliary word “的、地、得、之” shall be deemed as segmentation unit.
EXAMPLE
他的书 (his book) 慢慢地走(walk slowly) 说得快(speak fast)
美丽的城市(beautiful city) 中国的大熊猫(Chinese panda) 成功之路(road to success)
b)
Tense auxiliary word “着()、了、过” shall be deemed as segmentation unit.
EXAMPLE
c)
看着(be watching) 看了(watched) 看过(have watched)
Auxiliary word “所” shall be segmented from its followed verb.
EXAMPLE
所 想(what one thinks) 所 认识(what one knows)
5.2.11 Modal word
Modal word shall be deemed as segmentation unit.
EXAMPLE
你好吗?(How are you?)
你好吧!(Is everything OK?)
5.2.12 Exclamation word
Exclamation word shall be deemed as segmentation unit.
EXAMPLE
16
啊,真美!(How beautiful it is !)
© ISO 2009 – All rights reserved
ISO/CD 24614-2
唉呀,他走了!(He has gone!)
5.2.13 Imitative word
Imitative word shall be deemed as segmentation unit.
EXAMPLE
6
嘟(Du) 当当(tinkle) 轰隆隆(rumble)
Japanese word segmentation
6.1
General rules for identifying WSUs in Japanese text
“Bunsetsu“ is a phrase, which is a component of a sentence that carries a grammatical function, in
Japanese text without modifying expressions. As a component of "Bunsetsu", there are mainly 9 part of
speech. 名詞(meishi; noun), 動詞(doushi; verb), 形容詞・形容動詞(keiyoushi, keiyoudoushi; adjective), 連体詞
(rentaishi; adnominal noun [only used in adnominal usage]), 副 詞 (fukushi; adverb), 感 動 詞 (kandoushi;
exclamation), 接続詞(setsuzoushi; conjunction), 助詞 joshi (particle), and 助動詞 jodoushi (auxiliary verb).
These parts of speech are divided into more detailed classes in terms of grammatical function (see section
6.2).
6.1.1
Punctuation
There are two main punctuation marks in Japanese, “、” and “。”.
“、” is used for representing a slight pause. It indicates a break between phrases inside one sentence, but
does not always correspond to one segmentation unit. It means just a pause to make the sentence easier to
understand. Therefore, it is not directly related to word segmentation units.
“、” serves as a comma, semicolon, and colon.
“。” is used for representing a full stop, and is written at the end of a sentence.
That is, it means one sentence.
A question mark is “?”.
Quotation marks are “「」”.
A book name mark is “『』”.
An exclamation mark is “!”.
etc.
6.1.2
Noun
When a noun is a member constituting a sentence, it is usually followed by a particle or auxiliary verb.
Also, if a word like an adjective or adnominal noun modifies a noun, then a modifier (adjectives, adnominal
noun, adnominal phrases) and a modificand (a noun) are segmented.
Some nouns whose meaning is an action can become verbs by adding the verb “suru (do).
© ISO 2009 – All rights reserved
17
ISO/CD 24614-2
6.1.3
Verbs
A Japanese verb has an inflectional ending. The ending of a verb changes depending on whether it is a
negation form, an adverbial form, a base form, an adnominal form, an assumption form, or an imperative form.
Japanese verbs are often used with auxiliary verbs and/or particles, and a verb with auxiliary verbs and/or
particles is considered as a word segmentation unit.
6.1.4
Adjectives
A Japanese adjective has an inflectional ending. Based on the type of inflectional ending, there are two kinds
of adjectives, "keiyoushi" and "keiyoudoushi". Both are treated as adjectives. In terms of inflectional ending,
“keiyoushi” is sometimes called “i_keiyoushi” , such as “美しい(utsukushi_i; beautiful), and “keiyoudoushi” is
sometimes “na_keiyoushi” such as “静かな(shizuka_na; quiet).”
(In terms of inflectional ending of “na_keiyoushi,” it is sometimes said to be considered as “Noun + auxiliary
verb (da)”.)
The ending of an adjective changes depending on whether it is a negation form, an adverbial form, a base
form, an adnominal form, or an assumption form. Japanese adjectives are sometimes used with auxiliary
verbs and/or particles, and they are considered as a word segmentation unit.
6.1.5
Adnominal nouns
An adnominal noun does not have an inflectional ending; it is always used as a modifier. An adnominal noun
is considered as one segmentation unit.
6.1.6
Adverbs
An adverb does not have an inflectional ending; it is always used as a modifier of a sentence or a verb. It is
considered as one segmentation unit.
6.1.7
Conjunctions
A conjunction is considered as one segmentation unit.
6.1.8
Exclamations
An exclamation is considered as one segmentation unit.
6.1.9
Particles
A particle itself does not become “Bunsetsu.” A word followed by a particle can constitute “Bunsetsu”. The
function is a marker for a case, correlation with another phrase, attachment of some trivial meaning, and so on.
As for a behavior, it attaches to words and does not have an inflectional ending like a suffix. However it is not
a suffix but one of a part of speech. A Japanese particle attach to not only words but also a clause or even a
sentence. For example, “寒いね?” means “It is very cold, isn’t it?” In this example a Japanese particle
“ね(ne)” is corresponding to “isn’t it?.”
6.1.10 Auxiliary verbs
Auxiliary verbs represent various semantic functions such as a capability, a voice, a tense, an aspect and so
on.
An auxiliary verb appears at the end of a phrase or a sentence.
An auxiliary verb is always preceded by a word like a noun, a verb, or an adjective, and the set is considered
as one segmentation unit. An auxiliary verb should not be segmented from a word.
18
© ISO 2009 – All rights reserved
ISO/CD 24614-2
6.1.11 Idioms and proverbs
Proverbs, mottos, etc. should be segmented if their original meanings are not violated after segmentation.
EXAMPLE
光陰
Kouin
矢の
ごとし
ya_no
gotoshi
Noun
Noun _particle
Time
arrow
Auxiliary verb
like (flying)
Time flies fast
6.1.12 Abbreviations
An abbreviation should not be segmented.
6.2
Typology of WSUs in Japanese
The examples in each section are formatted as follows.
First line: Japanese sentence
In Japanese, spaces are not used in a sentence.
However, in the examples shown below, spaces indicate a border of “Bunsetsu”
Second line: Romanization
Third line: part of speech constituting “Bunsetsu”
A space refers to the border of a part of speech
The “_” mark within Bunsetsu refers to the composition of Bunsetsu
The “+” mark refers to the lexical composition within words
The ”[ ]” mark refers to the semantic function of a part of speech
Fourth line: English translation for each Bunsetsu
Fifth line: English translation for the example sentence or phrase
6.2.1
Nouns (名詞; Meishi)
 When a noun is a component constituting a sentence, it is usually followed by a particle or auxiliary
verb, but there are exceptions. In some cases, one word becomes one sentence. For example, as a
question, “なぜ(naze?; why?)”, as an answer, “りんご(ringo ; apple)”, “3 (san; three)” and so on.
 Also, if a word like an adjective or an adnominal noun modifies a noun, a modifier (adjective,
adnominal noun, adnominal phrase) and a modificand (a noun) are segmented, not a compound
noun.
6.2.1.1
6.2.1.1.1
Common nouns (普通名詞; Futsumeishi)
A noun followed by a particle is considered as a word segmentation unit.
© ISO 2009 – All rights reserved
19
ISO/CD 24614-2
A noun followed by an auxiliary verb is considered as a word segmentation unit.
EXAMPLE A
Noun followed by Particle for a case marker
私は
トマトを
買った。
Watashi_wa
tomato _wo
kat_ta
Pronoun_particle
Noun_particle[object]
Verb_auxiliary verb
I
tomato
bought
I bought a tomato.
EXAMPLE B
Noun followed by Auxiliary verb
私の
好きな
花は
桜です。
Watashi_no
sukina
hana_wa
sakura_desu
Noun_particle
Adjective
Noun_particle
Noun_auxiliary verb [polite]
my
favorit
flower
is cherry blossoms
My favorit flower is cherry blossoms.
6.2.1.1.2
"A noun with a prefix and/or a suffix, plus a case particle following it” and “A noun with a prefix
and/or a suffix, plus an auxiliary verb following it" are considered as a word segmentation unit.
EXAMPLE A
A noun with a prefix and/or a suffix, plus a case particle following it.
あなたの
お名前を
教えてください。
Anata_no
o+namae_wo
oshiete_kudasai
Noun_particle
prefix[politeness]+Noun_particle[object]
Verb_auxiliary verb
your
name
tell
Please tell me your name .
EXAMPLE B
A noun with a suffix, plus an auxiliary verb following it
この
シャンプーは
植物性だ。
Kono
shanpoo_wa
shokubutsu+sei_da
Adnominal noun
Noun_particle
Noun+suffix_auxiliary verb[copula]
this
shampoo
is a plant origin
This shampoo is a plant origin.
20
© ISO 2009 – All rights reserved
ISO/CD 24614-2
A noun followed
6.2.1.1.3
“A compound noun plus a case particle following it” and “a compound noun plus an auxiliary verb
following it" are considered as a word segmentation unit.
EXAMPLE A
A compound noun plus a case particle following it
私は
さしみ定食を
注文した。
Watashi_wa
Sashimi+teishoku_wo
chumonshi_ta
Pronoun_particle
Noun+Noun_particle[object]
Verb_auxiliary verb
I
Sashimi set
ordered
I ordered Sashimi set.
EXAMPLE B
a compound noun plus an auxiliary verb following it
私の
趣味は
映画鑑賞です。
Watashi_no
shumi_wa
eigakanshou_desu
Noun_particle
Noun_particle
Noun_auxiliary verb[copula, polite]
My
hobby
watching movies
My hobby is watching movies.
Some nouns which mean actions can become verbs by adding the verb “suru (do).” (see 6.2.2.2)
6.2.1.1.4
EXAMPLE
私は
毎日
散歩する。
Watashi_wa
mainichi
sanpo+suru
Pronoun_particle
Adverb
Verb[Noun+”do”]
I
every day
take a walk
I take a walk every day.
6.2.1.2
6.2.1.2.1
Pronouns (代名詞; Daimeishi)
A pronoun and a case particle are regarded as a word segmentation unit.
Sets of a pronoun and an auxiliary verb and/or a particle are regarded as a word segmentation unit.
EXAMPLE
私は
© ISO 2009 – All rights reserved
トマトを
買った。
21
ISO/CD 24614-2
Watashi_wa
tomato _wo
kat_ta
Noun[pronoun]_particle[topic]
Noun_particle
Verb_auxiliary verb
I
tomato
bought
I bought a tomato.
6.2.1.2.2
"A pronoun with a prefix and/or a suffix, plus a case particle following it” and “A pronoun with a
prefix and/or a suffix, plus an auxiliary verb and/or a particle" are regarded as a word segmentation unit.
EXAMPLE A
A pronoun with a suffix, plus a case particle
彼女たちは
コーチに
花を
贈った。
Kanojo+tachi_wa
coochi_ni
hana_wo
okut_ta
Noun[pronoun]+suffix_particle[topic]
Noun_particle
Noun_particle
Verb_auxiliary verb
they
to a coach
flowers
gave
They gave flowers to their coach.
EXAMPLE B
A pronoun with a suffix, plus an auxiliary verb and a particle.
犯人は
あなたたちですか ?
Han’nin_wa
anata+tachi_desu_ka
Noun_particle
Noun[pronoun]+suffix_auxiliary verb[copula]_particle[question]
criminal persons
are you?
Are you criminal persons?
6.2.1.3
Proper nouns (固有名詞; Koyuumeishi)
A proper noun following by a case particle is considered as a word segmentation unit.
A proper noun following by an auxiliary verb and/or a particle is considered as a word segmentation unit.
EXAMPLE A
A proper noun following by a case particle
私は
東京へ
行った。
Watashi_wa
Tokyou_e
it_ta
Noun_particle
Noun[proper]_particle[direction]
Verb_auxiliary verb
I
to Tokyo
went
I went to Tokyo.
EXAMPLE B
22
A proper noun following by an auxiliary verb and/or a particle
© ISO 2009 – All rights reserved
ISO/CD 24614-2
彼は
坂本さんですね?
Kare_wa
Sakamoto+san_deshou_ne
Noun_particle
Noun[proper]+suffix_auxiliary verb_particle[mood]
He
is Mr.Sakamoto, isn’t he?
He is Mr.Sakamoto, isn’t he?
6.2.1.4
Interrogative (疑問詞; Gimonshi)
6.2.1.4.1
An Interrogative noun and a case particle are considered as a word segmentation unit.
An Interrogative noun and an auxiliary verb are considered as a word segmentation unit.
EXAMPLE A
An Interrogative noun and a case particle are considered as a word segmentation unit.
どれが
好きですか?
Dore_ga
suki_desu_ka
Noun[interrogative]_particle[subject]
Verb_auxiliary verb_particle
which
do you like?
Which do you like?
EXAMPLE B
An Interrogative noun and an auxiliary verb are considered as a word segmentation unit.
彼女は
誰でしょうか?
Kanojo_wa
dare_deshou_ka
Noun_particle
Noun[interrogative]_auxiliary Verb[guess]_particle[question]
she
who is?
Who is she?
6.2.1.4.2
Though this case is not limited to interrogative nouns, informally, occasionally only an
interrogative noun is used as a one-word sentence.
EXAMPLE
いくつ?
Ikutsu?
Noun[interrogative]
How many
How many?
© ISO 2009 – All rights reserved
23
ISO/CD 24614-2
6.2.1.4.3
Some interrogative nouns cannot be followed by case particles.
EXAMPLE
*どうは
/
*が
/
*を
*Dou_wa / *Dou_ga / *Dou_wo
*Noun[interrogative]_particle[topic]/ [subject]/[object]
*How is
*How is
6.2.1.5
time/numeral/quantifier noun (数量詞/序数詞; Suuryoushii/Josuushi)
6.2.1.5.1
A Numeral noun and a case particle are considered as a word segmentation unit.
A Numeral noun and an auxiliary verb are considered as a word segmentation unit.
EXAMPLE A
A Numeral noun and a case particle are considered as a word segmentation unit.
母は
ケーキを
三分の一に
分けた。
Haha_wa
keeki_wo
sanbun’noichi_ni
wake_ta
Noun_particle
Noun_particle
Noun[numeral]_particle
Verb_auxiliary verb
my mother
a cake
three pieces
devided
My mother divided a cake into three pieces.
EXAMPLE B
A Numeral noun and an auxiliary verb are considered as a word segmentation unit.
休憩は
5 分間です。
Kyuukei_wa
gofunkan_desu
Noun_particle
Noun[numeral]_auxiliary verb[copula, polite]
a break
is for 5minitues
A break is for 5minutes.
6.2.1.5.2
A measure noun is sometimes used as an adverb by itself without a particle.
EXAMPLE
鉛筆を
24
4本
準備しなさい。
© ISO 2009 – All rights reserved
ISO/CD 24614-2
Enpitsu_wo
yon_hon
junbishinasai
Noun_particle
Noun[measure]_
Verb
a pencil
4
prepare
Prepare 4 pencils.
6.2.2
Verbs (動詞;Doushi)
A Japanese verb has an inflectional ending. The ending of a verb changes depending on whether it is a
negation form, an adverbial form, a base form, an adnominal form, an assumption form, or an imperative form.
Japanese verbs are often used with auxiliary verbs and/or particles, and they are considered as a word
segmentation unit.
6.2.2.1
Single verbs and compound verbs
Verbs (including single verbs and compound verbs) are considered as one segmentation unit.
EXAMPLE
私は
毎朝
牛乳を
飲む。
Watashi_wa
maiasa
gyuunyu_wo
nomu
Noun_particle
Adverb
Noun_particle
Verb
I
every morning
milk
drink
I drink milk every morning.
6.2.2.2
Verb composed from a noun and “suru”(do) (サ変動詞;Sahendoushi)
An action noun becomes a verb by adding a verb “suru (do)” to the end of the noun, and is sometimes called
“Sahendoushi.” “Sahendoushi” is considered as one segmentation unit. (see 7.1.1 (4))
EXAMPLE
私は
英語を
勉強する。
Watashi_wa
eigo_wo
benkyou+suru
Noun_particle
Noun_particle
Verb [Noun+do]
I
English
do study
I study English.
© ISO 2009 – All rights reserved
25
ISO/CD 24614-2
6.2.2.3
A verb with a subsidiary verb
A function of a subsidiary verb is complement a meaning of main verb, such as " 話 し 始 め る
(hanashi+hajimeru; begin speaking)". They are not a suffix. A verb with a subsidiary verb is considered as a
verb. When it is used in the end of a sentence and a clause, It is considered as one segmentation unit.
EXAMPLE
人形が
箱から
飛び出す。
Ningyou_ga
hako_kara
tobidasu
Noun_particle
Noun_particle
Verb[ Verb + subsidiary ]
a doll
the box
jump out
A doll jump out from the box.
6.2.2.4
A verb with an auxiliary verb and a particle
A verb with an auxiliary verb and/or a particle is considered as one segmentation unit.
EXAMPLE A
A verb with an auxiliary verb
彼は
試験に
合格するだろう。
Kare_wa
shaken_ni
goukakusuru_darou
Noun_particle
Noun_particle
Verb_auxiliary verb[expectation]
he
the examination
will pass
He will pass the examination.
EXAMPLE B
A verb with an auxiliary verb and/or a particle
彼は
試験に
合格するだろうね?
Kare_wa
shiken_ni
goukakusuru_darou_ne
Noun_particle
Noun_particle
Verb_auxiliary verb_particle[mood]
he
the examination
will pass, don’t you think so?
He will pass the examination. don’t you think so?
6.2.3
Adjectives (形容詞/形容動詞; Keiyoushi/Keiyoudoushi)
A Japanese adjective has an inflectional ending. Based on the type of inflectional ending, there are two kinds
of adjectives, "i keiyoushi" and "na keiyoushi". However, both are treated as adjectives.
26
© ISO 2009 – All rights reserved
ISO/CD 24614-2
In terms of traditional Japanese linguistics, “keiyoushi” refers to “i keiyoushi” and “keiyoudoushi” refers to “na
keiyoushi.”
(In terms of inflectional ending of “na keiyoushi,” it is sometimes said to be considered as “Noun + auxiliary
verb (da)”.)
The ending of an adjective changes depending on whether it is a negation form, an adverbial form, a base
form, an adnominal form, or an assumption form. Japanese adjectives are sometimes used with auxiliary
verbs and/or particles, and they are considered as a word segmentation unit.
6.2.3.1
6.2.3.1.1
Adjectives in predicative usage / adnominal usage
Adjectives in predicative usage are considered as one segmentation unit.
EXAMPLE
富士山は
高い。
Fujisan_wa
takai
Noun_particle
Adjective
Mt.Fuji
high
Mt.Fuji is high.
6.2.3.1.2 Adjectives with an auxiliary verb and/or a particle are considered as one segmentation unit.
EXAMPLE
値段が
高いですか?
Nedan_ga
takai_desu_ka
Noun_particle
Adjective_auxiliary verb[copula,polite]_particle[question]
the price
is high?
Is the price high?
6.2.3.1.3 Adjectives in adnominal usage and nouns modified by the adjectives are segmented separately.
EXAMPLE
面白い
本が
ある。
Omoshiroi
hon_ga
aru
Adjective [adnominal]
Noun_particle
Verb
interesting
a book
there is
There is an interesting book.
6.2.3.1.4 Adjectives with a particle / auxiliary verb, and nouns modified by them are segmented separately
© ISO 2009 – All rights reserved
27
ISO/CD 24614-2
EXAMPLE
面白かった
本を
おしえる。
Omoshirokat_ta
hon_wo
oshieru
Adjective_auxiliary verb[past]
Noun_particle
verb
was interesting
book
tell you
I tell you a book for which I was interesting.
6.2.3.2
Adjectives in adverbial usage
An adjective in adverbial usage and a verb modified by it must be segmented. In this case, an adjective in an
adverbial usage is considered as one segmentation unit.
EXAMPLE
早く
起きなさい!
hayaku
okinasai
Adjective [adverbial usage]
Verb
early
Get up
Get up early!
6.2.3.3
Adjectives in negation and assumption
6.2.3.3.1
Adjectives in negation usage are generally represented in the form of "adjectives in adverbial
form and auxiliary verbs." As auxiliary verbs for negations, "nai (for i_adjective),” “wa_nai (for na_adjective),”
“arimasen (a polite form for i_adjectives),” “wa_arimasen (a polite form for na_adjectives)” and “ja_arimasen
(an impolite form for na_adjective)” are used. “an adjective in adverbial form and an auxiliary verb for
negation” is considered as one segmentation unit.
EXAMPLE A
やさしくない
yasashiku_nai
Adjective_auxiliary verb[negation]
Not kind
Not kind
EXAMPLE B
きれいではありません
kireide_wa_arimasen
28
© ISO 2009 – All rights reserved
ISO/CD 24614-2
Adjective_particle_auxiliary verb[negation+polite]
Not clean
Not clean
6.2.3.3.2
An adjective in an assumption usage is generally represented in the form of an adjective in an
assumption form plus a particle for an assumption. Therefore, an adjective in assumption form plus a particle
for an assumption is considered as one segmentation unit.
EXAMPLE
雨が
ひどければ、
遠足は
中止する。
Ame_ga
hidokere_ba
ensoku_wa
chushisuru
Noun_particle
Adjective_particle[assumption]
Noun_particle
Verb
the rain
if it is heavy
hiking
cancel
If the rain is heavy, the hiking will cancel.
6.2.4
Adnominal nouns (連体詞; Rentaishi)
An adnominal noun does not have an inflectional ending; it is always used as a modifier. An adnominal noun
is considered as one segmentation unit.
EXAMPLE
あらゆる
arayuru
Adnominal noun
every
every country
6.2.5
国
kuni
Noun
country
Adverbs (副詞; Fukushi)
An adverb does not have an inflectional ending; it is always used as a modifier of a sentence or a verb. It is
considered as one segmentation unit.
EXAMPLE
やっと
来た。
yatto
ki_ta
adverb
Verb_auxiliary verb
at last
came
At last (someone) came.
© ISO 2009 – All rights reserved
29
ISO/CD 24614-2
6.2.6
Conjunctions (接続詞; Setsuzokushi)
A conjunction is considered as one segmentation unit.
EXAMPLE
そして、
彼は
笑った。
Soshite
kare_wa
warat_ta
Conjunction
Noun_particle
Verb_auxiliary verb
then
he
laughed
Then he laughed.
6.2.7
Exclamations (感動詞; Kandoushi)
An exclamation is considered as one segmentation unit.
EXAMPLE
あっ!
A!
Exclamation
Oops!
Oops!
6.2.8
Particles (助詞; Joshi)
In Japan, we have six subcategories in Japanese particles.
”格助詞; kakujoshi” is a maker for a case. (が; ga; subject marker, を; wo; objective marker, に:ni; dative
marker, and so on)
“係助詞; kakarijoshi” is a maker for a correlation with another phrase. (さえ; sae; even, しか; shika; only and
so on)
“並立助詞; heiritsujoshi” is a marker for a coordination. (と; to; and, か; ka; or, and so on)
“接続助詞;setsuzokujoshi” is a marker for a conjunction between phrases. (ので; node; because, とき; toki;
when, and so on)
“副助詞;fukujoshi” is a marker for a an attachment of something of meaning. (くらい; kurai; about, まで;
made; )
“終助詞; shuujoshi” is a marker for representing a mood and a question of a speaker. It is always used at the
end of a sentence. (ね; ne; don’t you think so?, か; ka; question)
“準体助詞;juntaijoshi” is a marker for a normalization of a phrase. ( の; no; thing, こと; koto; thing)
EXAMPLE A
30
particles for a case marker
© ISO 2009 – All rights reserved
ISO/CD 24614-2
私は (watashi_wa; I ), 私を (watashi_wo; me),
(watashi_to; with),
私の (watashi_no; my), 私へ (watashi_e; to me) , 私と
私に(watashi_ni; me, for me)
EXAMPLE B
A particle for a conjunction
行けば(ike_ba; if you go) , 行くので(iku_node; because (someone) goes)
EXAMPLE C
a particle for adding something of a meaning
私さえ(watashi_sae; even I), 私も(watashi_mo; I (go together), too)
EXAMPLE D
particles for representing a mood and a question
行きますね?
Iki_masu_ne?
Verb_auxiliary verb_particle[mood]
go, don’t you?
(You go there), don’t you?
6.2.9
Auxiliary Verbs (助動詞; Jodoushi)
Auxiliary verbs represent various semantic functions such as a capability, a voice, a tense, an aspect and so
on. An auxiliary verb appears at the end of a phrase or a clause and a sentence. An auxiliary verb is a part of
speech but should not be segmented. An auxiliary verb is used with a noun, a verb, or an adjective at the end
of a phrase or a clause and a sentence.
EXAMPLE
雨が
降りそうなので、
家に
いるでしょう。
Ame_ga
furi_souna_node
ie_ni
i_masu
Noun_particle
Verb_auxirialy verb[guess]_particle[conjunction]
Noun_particle
Verb_auxiliary verb [prospect, polite]
it
because(it) seems to rain
at home
(I ) will be
Because it seems to rain, I will be at home.
7
Korean word segmentation
7.1
Typology of word segmentation units in Korea
“Eojeol (word phrase)“ is a phrase, which is a component of a sentence that carries a grammatical function, in
Korean text without modifying expressions. As a component of "Eojeol", there are mainly 8~9 part of speech:
noun, verb, adjective, pronoun, numeral, adnoun, adverb, exclamation, particle. In North Korean grammar, a
particle is not treated as POS. (see section 4) The basic parts of speech can be divided into more detailed
classes in terms of grammatical function (see section 7.2).
7.1.1
Punctuation and white space
A period (.) is used for representing a full stop, and is written at the end of a sentence. A question mark (?)
and an exclamation mark (!) are also written at the end of a sentence.
© ISO 2009 – All rights reserved
31
ISO/CD 24614-2
That is, it means one sentence.
A comma (,) is used for representing a slight pause. It indicates a break between phrases inside one sentence,
but does not always correspond to one segmentation unit. It means just a pause to make the sentence easier
to understand. Therefore, it is not directly related to word segmentation units. A colon(:) and a slash(/) also
represent a slight pause.
Double quotation marks (“ ”) are used for dialogue or quotation and small quotation marks (‘ ’) are used for
inner quotation and emphasis of some expression.
In contrast with Chinese and Japanese typography, Korean sentences contain some fragments separated by white
space. These fragments refer to Eojeol (word phrase). In Korean, white space as well as punctuation and bracket is
fundamental in separating word phrases.
7.1.2
Word
Words, clearly justified by linguistic criteria are WSUs.
7.1.2.1
Simplex
Words with just one component are WSUs.
EXAMPLE
다섯
daseot
numeral
five
7.1.2.2
Compound
The results of composing two or more word are WSUs.
EXAMPLE
곱-씹다
gopssipda
verb
Twofold-chew
Repeat a word
7.1.2.3
Derivation
The results of adding a series of prefixes or suffixes to a word are WSUs.
EXAMPLE A
32
© ISO 2009 – All rights reserved
ISO/CD 24614-2
외-삼촌
oesamchon
Noun(prefix-noun)
uncle-in-law
EXAMPLE B
입-질
ipjil
Noun(noun-suffix)
bite
7.1.2.4
Abbreviation
Abbreviations are WSUs.
EXAMPLE A
이공[理工](igong; noun; science and engineering) : 이학(ihak; science )+공학(gonghak, engineering)
EXAMPLE B
EXAMPLE B 국규[國規] (gukgyu; noun; national standard) : 국가(gukga; state)+규격(gyugeok; standard)
7.1.2.5
Transliterated loanword
Transliterated loanwords are WSUs.
EXAMPLE
7.1.2.6
지프 (jeep)
초콜릿(chocolate)
Idiomatic expression with Chinese character
Idiomatic expressions with Chinese character are WSUs. They are usually composed with four Chinese
characters.
EXAMPLE
함흥차사 (咸興差使)
hamheungchasa
noun
Lost messenger
© ISO 2009 – All rights reserved
33
ISO/CD 24614-2
7.1.3
Multi-word expression
7.1.3.1
Phrasal compound
Phrasal compounds, frequently used in text and, mainly consisting of two or more word, are WSUs.
EXAMPLE
수력
발전소
suryeok
baljeonso
noun
noun
waterpower
plant
hydroelectric plant
7.1.3.2
Idioms and proverbs
Fixed expression such as idioms, proverbs, mottos should be segmented if their original meanings are not
violated after segmentation. They should be deemed as a word unit, even though they are composed with two
or more Eojeol (word phrases).
EXAMPLE A
울며
겨자
먹기
ulmyo
gyeoja
meokgi
Verb
Noun
Verb
cry
mustard
eat
A Hobson’s choice.
EXAMPLE B
한
마디로
말해
han
madiro
malhae
adnoun
Noun_particle
Verb
one
Word_with
talk
In a word (speaking briefly)
7.1.4
Non-Korean-character strings
Non-Korean-character strings including foreign language characters, Arabic numerals, math symbols,
chemical symbols etc. are treated as WSUs by keeping their original forms.
34
© ISO 2009 – All rights reserved
ISO/CD 24614-2
EXAMPLE
7.2
CAD CO := cm 1298 3.1415926
Typology of WSUs in Korean
7.2.1
Noun
 A noun is usually followed by a particle and it is a component constituting a sentence. But there are
some exceptions. In cases, one noun becomes one sentence. For example, as a question,
“어디(eodi?; where?)”, as an answer, “사과(sagwa ; apple)”, “3 (set; three)” and so on.
 Also, if a word like an adjective or an adnoun modifies a noun, a modifier (adjective, adnoun, and
adnominal phrase) and a modificand (a noun) are segmented.
7.2.1.1
Common noun
7.2.1.1.1
A noun followed by a particle is considered as a word segmentation unit. Noun shall be
segmented from the other grammatical component in Eojeol (word phrase).
EXAMPLE
Noun followed by Particle for a case marker
소녀가
사과를
먹었다.
sonyeo_ga
sagwa _leul
meogeotta
noun_particle[subjective]
Noun_particle[object]
Verb
girl
apple
ate
A girl ate an apple.
7.2.1.1.2
EXAMPLE A
Derivative noun with derivative affixes shall be deemed as a word segmentation unit
A noun with a prefix
비-금속
bigeumsok
Noun(prefix-noun)
Not metal
nonmetal
EXAMPLE B
A noun with a suffix
음악-가
© ISO 2009 – All rights reserved
35
ISO/CD 24614-2
Eumak-ga
Noun(prefix-noun)
Music artist
musician
7.2.1.1.3 Compound noun shall be deemed as a word segmentation unit.
EXEMPLE A
noun plus noun
손-목
sonmok
noun
Hand-neck
wrist
EXEMPLE B
numeral plus numeral
하나-하나
hanahana
noun
One-one
One at a time
7.2.1.1.4
7.2.1.2
Word combination that is treated as a word segmentation unit could be sub-segmented for the
practical need: noun + prefix, noun + suffix, noun + noun.
Proper noun
7.2.1.2.1 Korean name and surname should not be separated and totally should be deemed as a word
segmentation unit. Name with following ‘이(i)’ also should be deemed as a word segmentation unit.
EXAMPLE
김-광수: (KIM, surname) + (Gwangsu, first name)
경철-이: (Gyongcheol, name) + (i, suffix)
7.2.1.2.2 Person’s name or surname with following titles or affixes should be segmented independently.
EXAMPLE
36
손
교수
son
gyosu
© ISO 2009 – All rights reserved
ISO/CD 24614-2
Proper noun
noun
One of surname
professor
Prof. Son
7.2.1.2.3 Nation name, country name, language name and toponym shall be deemed as a word segmentation
unit.
EXAMPLE
백두산(Baekdusan; proper noun; Mt. Baekdu)
7.2.1.2.4 Full name of organization, agency, and institution shall be deemed as a word segmentation unit.
EXAMPLE
7.2.1.3
국제 표준화 기구(Gukjepyojunhwagigu; International Standardization Organization )
Bound noun
Even though bound noun is functional word, it should be segmented independently.
EXAMPLE
좋은
것
joeun
geot
adjective
Bound noun
good
thing
Good thing
Bound noun in a word segmentation unit should not be segmented.
EXAMPLE
들-것
deulgeot
Noun (verb+bound noun)
Lift thing
A stretcher
7.2.2
Pronoun
Pronoun should be segmented from following particles.
7.2.2.1
Personal pronoun
Personal pronoun followed by a pronoun followed by a particle is considered as a word segmentation unit.
© ISO 2009 – All rights reserved
37
ISO/CD 24614-2
7.2.2.1.1
General personal pronoun
General personal pronoun shall be deemed as one segmentation unit.
EXAMPLE
내가
너를
그에게
소개하겠다.
Nae-ga
Neo-reul
Geu-ege
sogaehagetta
Prpnoun_particle
Prpnoun_particle
Prpnoun_particle
verb
I
you
To him
will introduce
I will introduce you to him.
7.2.2.1.2
Reflexive pronoun
Reflexive pronoun shall be deemed as one segmentation unit.
EXAMPLE
그녀는
자기를
부끄러워해야
한다.
Geunyeo-neun
Jagi-reul
buggeureoweohaeya
handa
Prpnoun_particle
Reflexive pronoun_particle
Prpnoun_particle
verb
She
herself
be ashamed of
Ought to
She ought to be ashamed of herself.
7.2.2.1.3
Indefinite pronoun
Indefinite pronoun shall be deemed as one segmentation unit.
EXAMPLE
아직
아무도
오지
않았다.
ajik
amu-do
oji
anatta
adverb
Indefinite pronoun_particle
verb
Auxiliary verb
yet
anybody
come
not
Anybody doesn’t come yet.
7.2.2.2
Demonstrative pronoun
7.2.2.2.1 Compound pronoun including bound noun should be deemed as a word segmentation unit.
EXAMPLE
이-것
igeot
Pronoun (adnoun-bound noun)
38
© ISO 2009 – All rights reserved
ISO/CD 24614-2
This thing
this
7.2.2.2.2 Demonstrative pronoun for place shall be deemed as a word segmentation unit.
EXAMPLE
저기
jeogi
Pronoun
there
7.2.2.3
Compounding of pronouns
In Korean, pronoun can be produced by compounding of pronouns. It shall be deemed as a word
segmentation unit.
EXAMPLE A
이것-저것
igeotjeogeot
Pronoun
This-that
One thing or another
EXAMPLE B
여기-저기
yeogijeogi
Pronoun
Here and there
7.2.3
Numeral
Numeral should be segmented from following particles.
7.2.3.1
Quantifier numeral
Quantifier numeral shall be deemed as a word segmentation unit.
EXAMPLE
하나(hana; numeral; one)
© ISO 2009 – All rights reserved
삼[三](sam; numeral; three)
39
ISO/CD 24614-2
7.2.3.2
Ordinal numeral
제일[第 一] (jeil; numeral; first)
EXAMPLE
7.2.4
제오십삼[第五十三](jeosipsam; numeral; the fifty-third)
Verb
A Korean verb has over one inflectional ending. The endings of a verb can be changed and attached
depending on grammatical function of verb. E.g. 깨-뜨리-시-었-겠-군 (ggaeddeurisieotgetgunyo; verb; break ending [+emphasis] - ending [+polite] - ending [+past] - ending [+conjectural] - final ending). Korean verbs are
often used with auxiliary verbs and/or particles, and they are considered as a word segmentation unit.
In North Korean grammar, ending of verb is treated as grammatical prefix named 토(To; grammatical prefix). It
follows verb (equivalent) and makes up predicate form or represents grammatical meanings such as tense
and honorifics, should be segmented as a word. There are two methods for the same linguistic phenomenon.
7.2.4.1
Complete verb
Verbs (including single verbs and compound verbs) are considered as one segmentation unit.
7.2.4.1.1
Single verb
Single verb should be segmented from following particle.
EXAMPLE
보-았-군-요
boatgunyo
Verb(stem+ending+ending)+particle
See [+past] [+sentence final] [+polite]
You might saw (something).
Single verb should be segmented from following auxiliary verb.
EXAMPLE
읽어
보다
ilgeo
boda
Complete verb
Auxiliary verb
read
try
Try to read
7.2.4.1.2
Derivative or compound verb
Derivative or compound verb should not be segmented.
40
© ISO 2009 – All rights reserved
ISO/CD 24614-2
For example, “돌아가다” (dolagada; verb; pass away) is literally translated into ‘go+back’ (verb+verb).).
“바로잡다” (barojapda; verb; correct) is one word segmentation unit but it consists of ‘rightly+hold’
(adverb+verb).
7.2.4.2
Auxiliary verb
Auxiliary verb also should be segmented independently.
A Korean auxiliary verb represents various semantic functions such as a capability, a voice, a tense, an
aspect and so on.
Auxiliary verb is only used with a verb plus endings with special word ending depending on the auxiliary verb.
For example, “보다” (boda; try to), an auxiliary verb has the same inflectional endings but it should follow a
main verb with a connective ending “어” (eo) or “고” (‘go’).
EXAMPLE
먹어
버리다
meogeo
beorida
Complete verb
Auxiliary verb
eat
finish
Eat up
7.2.5
Adjective
A Korean adjective has over one inflectional ending like verb. The endings of a verb can be changed and
attached depending on grammatical function of verb. For example, in “예쁘-시-었-겠-군” (pretty - ending
[+polite] - ending [+past] - ending [+conjectural] - final ending), one adjective has four endings. Korean
adjectives are often used with auxiliary verbs and/or particles, and they are considered as a word
segmentation unit.
7.2.5.1
Complete adjective
Adjectives (including single adjectives and compound adjectives) are considered as one segmentation unit.
7.2.5.1.1
Single adjective
Single adjective should be segmented from following particle.
EXAMPLE
검-군-요
geomgunyo
Adjective (stem+ending)+particle
Black [+sentence final] [+polite]
It is black, isn’t it?
© ISO 2009 – All rights reserved
41
ISO/CD 24614-2
Single adjective should be segmented from following auxiliary verb.
EXAMPLE
길지
않다
gilji
anta
Complete adjective
Auxiliary adjective
long
not
(It is) not long.
7.2.5.1.2
Derivative or compound adjective
Derivative or compound adjective should not be segmented.
EXAMPLE
“돌아가다” (dolagada; verb; pass away) is literally translated into ‘go+back’ (verb+verb).). “바로잡다”
(barojapda; verb; correct) is one word segmentation unit but it consists of ‘rightly+hold’ (adverb+verb).
7.2.5.2
Auxiliary adjective
Unlike Japanese, there is auxiliary adjective in Korean. Function and usage of it are very similar to auxiliary
verb. Auxiliary adjective is considered as one segmentation unit.
EXAMPLE
마시고
싶다
masigo
siptta
Complete adjective
Auxiliary adjective
drink
Want
Want to drink
7.2.6
Adnoun
An adnoun does not have an ending; it is always used as a modifier for noun. An adnoun shall be a word
segmentation unit by itself.
7.2.6.1
General adnoun
General adnoun is segmented from following noun.
EXAMPLE
42
새
책
sae
chaek
adnoun
Noun
© ISO 2009 – All rights reserved
ISO/CD 24614-2
new
book
A new book
7.2.6.2
Demonstrative adnoun
Demonstrative adnoun is segmented from following noun.
EXAMPLE
이
사람
i
saram
adnoun
Noun
this
person
this person
7.2.6.3
Numeral adnoun
Numeral adnoun is segmented from following noun for measure.
EXAMPLE
차
세
잔
cha
se
jan
noun
adnoun
Bound noun
tea
three
cup
Three cups of tea
7.2.7
Adverb
An adverb does not have an ending; it is always used as a modifier for verb. An adverb shall be a word
segmentation unit by itself.
Compound adverb also should not be segmented.
EXAMPLE
더욱-더
deoukdeo
Adverb (adverb + adverb)
More-more
More and more
7.2.7.1
Component adverb
Component adverb is segmented from following verb.
© ISO 2009 – All rights reserved
43
ISO/CD 24614-2
EXAMPLE
매우
바쁘다
maeu
babbeuda
adverb
verb
very
busy
Very busy
7.2.7.2
Sentence adverb
Sentence adverb is segmented from following sentence.
EXAMPLE
다행히
비-가
온다.
dahaenghi
biga
onda
adverb
Noun-particle
verb
fortunately
rain
come
Fortunately it rains.
7.2.7.3
Conjunctive adverb
Conjunctive adverb is segmented from following nominal or sentence.
EXAMPLE A
그리고
잠-이
들었다.
geurigo
jami
deureotta
adverb
Noun-particle
verb
and
sleep
get
경제
및
문화
gyeongje
mit
munhwa
noun
adverb
noun
economy
and
culture
EXAMPLE B
Economy and culture
7.2.8
Exclamation
An exclamation is considered as one segmentation unit.
EXAMPLE
44
© ISO 2009 – All rights reserved
ISO/CD 24614-2
아!
A!
Exclamation
Oops!
7.2.9
Particle
Korean particles can not be separated from a word just like Japanese particles. A particle is always used with
a word like a noun, a verb, an adverb and so on. But it shall be considered as one segmentation unit.
Particles can be divided into three main types in Korean. One is a case particle that serves as a case
marker. Another is an auxiliary particle that appears at the end of a phrase or a sentence. Auxiliary particle
represents a mood and a tense. The other particle is used for linking nominals.
In North Korean grammar, particles are treated as grammatical prefix named 토(To; grammatical prefix). They
follow noun (equivalent) and represent grammatical meanings such as case marker. They should be
segmented as a word segmentation unit. There are two methods for the same linguistic phenomenon.
7.2.9.1
Particle as case marker
Particle as case marker decides the case of nominal in the sentence.
EXAMPLE
7.2.9.2
내가(nae_ga; I ), 나를(na_leul; me),
나의(na_eui; my), 나에게(watashi_ege; to me)
Conjunctive particle
Conjunctive particle is a marker for a conjunction between nominals or phrases.
EXAMPLE
경제- 와
문화
gyeongjewa
munhwa
noun-particle
noun
Economy and
culture
Economy and culture
7.2.9.3
Auxiliary particle
Auxiliary particle is used for an attachment of something of meaning.
EXAMPLE
나-는
소설-만
읽는다.
naneun
soseolman
malara
Pronoun-particle
Noun-particle
verb
As for me
Only novel
read
As for me, I read only novels.
© ISO 2009 – All rights reserved
45