Download click to proceedings of the conference.

Document related concepts

Scottish Gaelic grammar wikipedia , lookup

Macedonian grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Japanese grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Dependency grammar wikipedia , lookup

Inflection wikipedia , lookup

Spanish grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Untranslatability wikipedia , lookup

Stemming wikipedia , lookup

Parsing wikipedia , lookup

Malay grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Agglutination wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Junction Grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Transcript
2016
TheFi
r
s
tConf
er
enc
eon
Tur
ki
cComput
at
i
onalLi
ngui
s
t
i
c
s
39Apr
i
l2016,Konya,Tur
key
Pr
oc
eedi
ngsBook
I
SBN:9786056642203
t
ur
c
l
i
ng.
ege.
edu.
t
r
Proceedings of
The First International Conference on
Turkic Computational Linguistics
TurCLing 2016
In conjunction with CICLing 2016, the 17th International Conference
on Intelligent Text Processing and Computational Linguistics
April 3–9, 2016 • Konya, Turkey
ISBN: 978-605-66422-0-3
The First International Conference on Turkic Computational Linguistics - TurCLing 2016 - Full Paper Proceedings
CHAIR
•
Bahar Karaoğlan, Ege University
CO-CHAIRS:
•
•
•
Tarık Kışla, Ege University
Senem Kumova, İzmir Ekonomi University
Hatem Haddad, Mevlana University
PROGRAM COMMITTEE:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Yeşim Aksan, Mersin University
Adil Alpkoçak, Dokuz Eylül University
Ildar Batyrshin, Instituto Politécnico Nacional
Cem Bozşahin, Middle East Technical University
Fazlı Can, Bilkent University
İlyas Çicekli, Hacettepe University
Gülşen Eryiğit, Istanbul Technical University
Alexander Gelbukh, Instituto Politécnico Nacional
Tunga Güngör, Bogazici University
Hatem Haddad, Mevlana University
Bahar Karaoğlan, Ege University
Tarık Kışla, Ege University
Senem Kumova Metin, İzmir Ekonomi University
Altynbek Sharipbayev, L.N. Gumilyov Eurasian National University
Dzhavdet Suleymanov, Tatarstan Academy of Sciences
Jonathan North Washington, Indiana University
KEYNOTE SPEAKER:
Prof. Dr. Tunga Güngör , Bogazici University
ii
The First International Conference on Turkic Computational Linguistics - TurCLing 2016 - Full Paper Proceedings
EDITORIAL
1st International Conference on Turkic Computational Linguistics is held jointly with
CICLing (17th International Conference on Intelligent Text Processing and Computational
Linguistics) in Konya Mevlana University. All computational linguistic research studies on
Turkic languages, such as Turkish, Kazakh, Azerbaijani, Uyghur, Tatar, Kyrgyz, Turkmen,
Gagauz, Bashkir, Nogay, Uzbek, Chuvash, Khakas, Tuvan, and all other Turkic languages
are considered to be within the scope of this conference.
The conference aims to be a forum for serving studies on Turkish and other Turkic languages,
and to gather researchers in the field to discuss common long-term goals, promote
knowledge, resource sharing and possible collaborations between the groups.
PROCEEDINGS EDITORS
Bahar Karaoğlan, Ege University
Tarık Kışla, Ege University
Senem Kumova, İzmir Ekonomi University
2016
iii
The First International Conference on Turkic Computational Linguistics - TurCLing 2016 - Full Paper Proceedings
Table of Contents
A Revisited Turkish Dependency Treebank ....................................................................... 1-6
Umut Sulubacak, Gülşen Eryiğit and Tuğba Pamay
Exploring Spelling Correction Approaches for Turkish ..................................................... 7-11
Dilara Torunoğlu Selamet, Eren Bekar, Tugay İlbay and Gülşen Eryiğit
Framing of Verbs for Turkish PropBank ............................................................................ 12-17
Gözde Gül Sahin
A Free/Open-Source Hybrid Morphological Disambiguation Tool for Kazakh ................ 18-26
Zhenisbek Assylbekov, Jonathan N. Washington, Francis M. Tyers, Assulan Nurkas, Aida
Sundetova, Aidana Karibayeva, Balzhan Abduali and Dina Amirova
A Methodology for Multi-word Unit Extraction in Turkish ............................................... 27-31
Ümit Mersinli and Yeşim Aksan
The Turkish National Corpus (TNC): Comparing the Architectures of v1 and
v2......................................................................................................................................... 32-37
Yeşim Aksan, S. Ayşe Özel, Hakan Yılmazer and Umut Demirhan
(When) Do We Need İnflectional Groups? ........................................................................ 38-43
Çağrı Çöltekin
Allomorphs and Binary Transitions Reduce Sparsity in Turkish Semi-supervised
Morphological Processing .................................................................................................. 44-49
Serkan Kumyol, Burcu Can and Cem Bozşahin
Automatic Detection Of The Type Of “Chunks” İn Extracting Chunker
Translation Rules From Parallel Corpora ........................................................................... 50-54
Aida Sundetova and Ualsher Tukeyev
Simplification of Turkish Sentences ................................................................................... 55-59
Dilara Torunoğlu-Selamet, Tuğba Pamay, Gülşen Eryiğit
Comprehensive Annotation of Multiword Expressions in Turkish .................................... 60-66
Kubra Adalı, Tutkum Dinc, Memduh Gokırmak, Gu¨ls¸en Eryigit
An Overview of Resources Available for Turkish Natural Language Processing
Applications ........................................................................................................................ 67-84
Tunga Güngör
iv
IMST: A Revisited Turkish Dependency Treebank
Umut Sulubacak∗, Tuğba Pamay†, Gülşen Eryiğit‡
Department of Computer Engineering,
Istanbul Technical University,
Istanbul, 34469, Turkey.
Email: [∗sulubacak, †pamay, ‡gulsen.cebiroglu]@itu.edu.tr
corpus, which, decidedly, has ample room for improvement.
The effort would also be promising in alleviating certain
problems commonly attributed to the corpus, such as excessive parsing difficulty [4] and cross-parser instability [25].
Although engaging in a tedious investigation in order to
recondition a corpus may not seem cost-effective, previous
successful attempts for other prominent languages [2], [22],
[27], [39] provide a strong motivation for the effort.
In this paper, we propose changes in certain dependency
schemes, leading to an updated annotation framework for
Turkish. We thereby aim to relieve some of the known difficulties in the current framework, as well as to reduce stress on
human annotators and thus alleviate manual annotation errors.
We also present the ITU-METU-Sabancı Treebank (IMST), a
new version of MST reannotated from the ground up following
this new framework. Later, we make empirical evaluations
on our new treebank and report our results. The paper is
structured as follows: Section 2 briefly outlines Turkish and the
dependency formalism, Section 3 explains the problems and
the proposed solutions, Section 4 introduces the new treebank,
Section 5 describes the experiments, and finally, Section 6
presents the conclusion.
Abstract—In this paper, we present a critical analysis of the
dependency annotation framework used in the METU-Sabancı
Treebank (MST), and propose new annotation schemes that
would alleviate the issues we have identified. Later, we describe
our attempt at reannotating the treebank from the ground up
using the proposed schemes, and then compare the consistencies
of the two versions via cross-validation using a dependency
parser. According to our experiments, the reannotated version
of the original treebank, which we call the ITU-METU-Sabancı
Treebank (IMST), demonstrates a labeled attachment score of
75.3% and an unlabeled attachment score of 83.7%, surpassing
the corresponding scores of 65.9% and 76.0% for MST by a very
large margin.
I. I NTRODUCTION
Despite the considerable interest in Turkish syntax, parsing
performances have not seen a major improvement in a long
time, as evidenced by several recent case studies [6], [11], [17],
[18], [30], [35]. Many studies seem to be concentrating on
specific computational or linguistic issues, fine-tuning certain
aspects of their parsers and leaving the rest untouched. As
a result, although many still demonstrate local improvements,
they fail to make any pivotal progress. As certain issues remain
in focus and others fall behind the spotlight, a considerable
portion of the field remains uncharted.
It is likely that there are some issues outside the domain
of well-researched cases that create a bottleneck for syntactic
parsing. Considering that virtually all state-of-the-art parsers
make use of supervised learning from human-annotated corpora, it is entirely possible that the issues stem from imperfections in the training corpora. The METU-Sabancı Turkish
Treebank (MST) [29] has proved to be an invaluable resource
over the years, and has been utilized by almost every Turkish
dependency parser to date. However, its dependency grammar
has come to be criticized on occasion from various standpoints,
and it is known to contain a large amount of annotation
inconsistencies, as also attested in some previous works [5],
[12]. At present, there is no other available resource1 for
Turkish that would be equivalent or an alternative to MST.
This further conceals any issues with the corpus that might
otherwise emerge.
In the light of these considerations, it could be worthwhile to
take a detour from specific case studies and directly tackle the
II. T URKISH AND THE D EPENDENCY F ORMALISM
Though the concept of dependencies has existed since
some of the earliest recorded grammars [32], the modern
dependency grammar is commonly attributed to Tesnière [37].
The formalism has seen a great deal of attention and extensive
usage in computational linguistics in recent years. Essentially,
a dependency grammar defines a set of practical rules on how
to utilize dependencies to model the syntax of a sentence.
PREDICATE
DETERMINER
SUBJECT
She
MOD.
was
in
the
MOD.
ARGUMENT
MODIFIER
1 There
is the ITU Validation Set [14], [15], [20], but it is a fairly small
corpus containing only 300 sentences, and is meant to be a validation or
test set for supervised learners, therefore not suitable for training data-driven
models.
car
red
PREDICATE
DERIV
Kırmızı
arabada
+ydı
‘red’
‘car’LOC
‘[s/he] was’
Fig. 1: An example dependency tree for a sentence in Turkish and English.
Note that the definite article does not occur in the Turkish sentence, and the English
dependency to the preposition ‘in’ is analogous to the Turkish locative suffix ‘-da’.
1
In this work, as also in the majority of modern syntactic
studies for Turkish, we adopt the dependency formalism.
The formalism necessitates the representation of syntactic
information with sets of directed binary relations (dependencies) between tokens (Fig. 1). Each dependency is defined
between a governing token (the head) and a subordinate token
that modifies it (the dependent), and represented by labeled
arcs from the head to the dependent. The labels assigned
to dependencies indicate the type of the relation, called the
dependency type. For a recent discussion of the dependency
formalism, the interested reader may refer to [23].
Turkish is a classical example of an agglutinative morphologically rich language incorporating a large number of
productive derivational suffixes. For example, the suffix ‘ydı’
(‘[s/he] was’) in Fig. 1 is a third-person singular past copula
attached to the stem ‘araba’ (‘car’). As different portions of
such derived words may correspond to several words in a
weakly inflected language such as English, Turkish sentences
often comprise relatively fewer, highly inflected words. In
order to properly analyze the syntax of a Turkish sentence,
words are divided from derivational boundaries into morphosyntactic units called inflectional groups (IGs). This formalism establishes tokens as the IGs comprising the sentence,
rather than orthographic words. Words with multiple IGs are
quite prevalent in Turkish—in fact, it is not unusual to find
words with as many as four or five IGs. Having been practiced
in many influential works [19], [18], [21], [28], their usage has
become the de facto standard for parsing Turkish.
Throughout the rest of the section, we regularly refer to our
proposed annotation framework, though a description of the
whole framework is not provided in this article. The full list
of the proposed dependency types and their usages is provided
within a separate annotation manual [36].
A. Semantic Incoherence
In the original framework, some dependency relations were
used in a way that is contradictory to their semantic connotations. Such cases occurred especially in less prevalent secondary usages of common dependency types. Though it might
have seemed counter-productive to handle such cases under
exclusive dependency types or another encompassing type, we
maintain that the incoherence is generally less favorable, as
they would confuse the associations drawn by annotators. Even
though this phenomenon was not very common, it occurred
frequently enough to warrant notice.
DET.
OBJECT
PREDICATE
Bir
örnek
yazdı
‘a[n]’
‘example’
‘[s/he] wrote’
“S/he wrote an example.”
OBJECT
ARGUMENT
Kalem
‘pen’
MODIFIER
PREDICATE
ile
yazdı
‘with’
‘[s/he] wrote’
“S/he wrote with a pen.”
Fig. 2: The O BJECT relation used for the object of the main verb (top) and
for an adpositional phrase argument (bottom).
III. P ROBLEMS AND P ROPOSED S OLUTIONS
One example is that adpositional phrases were connected
via the dependency label O BJECT. Although dependents of
adpositional phrases are sometimes called adpositional objects,
they are in fact arguments of the adpositional head and
unrelated to sentence (or clausal) objects. Not only was it not
immediately obvious that they should be regarded as objects,
but also this annotation method confused parsers and made
the prediction of objects difficult. We assign these the new
dependency label A RGUMENT along with the rest of the
phrasal arguments.
In designing a dependency annotation framework, it is
essential to have a clear definition of the dependency relations
and the set of conventions on when to use which relation.
Although dependency relations would be ideally expressive,
exclusive, coherent and concise, there are often trade-offs
between some of these properties. As such, it becomes a
challenge to balance a grammar around them. Considering the
drawbacks of the original MST, we reason that prioritizing
clarity and aiming for a minimal dependency grammar makes
the better sense in mitigating inconsistency and obscurity.
As foundations for our work, we carried out an in-depth
manual error analysis on the original MST. Among the most
frequent cases that we noted were inconsistently or erratically
annotated linguistic constructions as well as standard annotation methods that mandated the usage of certain particles that
were optional in informal language. The subsections below
present our attempt to loosely categorize the questionable
cases that we encountered. For each issue, we also provide
example cases, discuss our own standpoints and finally describe our proposed annotation schemes.
In the process of settling on local annotation schemes,
we investigated the corresponding methods followed in some
other prominent frameworks [3], [8], [9], [10], [27], [38] and
reviewed previous work in the subject [24], [33], [34]. Through
all these, we laid strong foundations for our decisions.
PRED.
PRED.
ve
sevgi
‘and’
‘love’
Barış
‘peace’
COORD.
COORDINATION
Barış
ve
PRED.
sevgi
CONJ.
Fig. 3: An example showing the original (top) and the proposed (bottom)
annotation scheme for coordination structures.
Another case was in coordination structures, where the
coordinating conjunction was connected to the succeeding
token with the dependency label C OORDINATION. This constitutes a counter-intuitive scenario which semantically implied
that the token is in coordination with the conjunction itself,
2
(Fig. 5) and eliminate the X.A DJUNCT labels, which are at
any rate reproducible using morphological information.
whereas the tokens should be in coordination with each other,
as also attested in [1]. We make it so that tokens are connected
directly to the next token in coordination, while preserving the
C OORDINATION label. This approach has also been previously proved to improve parsing performances in [35], who
applied automatic conversion routines to map coordination
structures to different styles and compared local performances.
C. Ambiguous Annotation
For certain annotation schemes, the framework clearly
defined what the head should be, but not the dependency
relation (or vice versa). This encouraged arbitrary annotation,
or else annotation conventions that were quite difficult for
annotators to memorize, which impaired the annotation consistency. Although at times this would be due to linguistic
relations not properly explained by any dependency label, this
was mostly observed in cases of ambiguity, when a relation
could be possibly explained by more than one label. For such
cases, we introduce new dependency types where the involved
dependencies would be common enough to represent a group.
An instance of this phenomenon was seen in phrasal arguments, which were not precisely covered under any dependency type, and were variously assigned M ODIFIER or
O BJECT labels. We introduce the new dependency label
A RGUMENT for all cases where exactly one argument is
syntactically required to modify a head, such as in adpositional
phrases, in contrast to modifiers, of which a head could have
more than one, or none at all.
B. Hierarchy and Overlap
In the original framework, certain dependency relations fell
within the scope of others. As the grammar did not enact a dependency hierarchy to exploit granularity in dependency types,
this also had a negative effect. The immediate impact was on
annotators, for whom it occasionally became arbitrary which
dependency type to use. Parsing frameworks also suffered
from increased entropy in prediction. Yet another impact was
on evaluation, as such cases caused some sound dependency
annotations to be considered incorrect because any label other
than that which was in the gold-standard would be a mismatch.
ETOL
MWE
COLL.
MWE
PRED.
PRED.
Söz
ettim
Söz
verdim
‘word’
‘[I] did’
‘word’
‘[I] gave’
D. Optional Annotation
In the original framework, only certain types of punctuation
(usually conjunctive punctuation and terminal periods) had
dependency types associated with them, and the rest were
allowed to pass without any head (Fig. 6). These tokens were
connected to an arbitrary head and assigned the label NOTCONNECTED . This indicated that the dependency grammar
essentially did not enforce dependencies for all tokens in
a sentence, which is required by most dependency parsers,
leading to complexity in evaluation. Furthermore, since NOTCONNECTED was computationally considered to be a regular
dependency type in parsing, learning performances were also
indirectly affected. To address this issue, we introduce the
new label P UNCTUATION and standardize the annotation
scheme of all types of punctuation, as well as eliminating
the support for optionality in the grammar. In this approach,
all punctuation should be connected at all times with the
P UNCTUATION relation to the last non-punctuation token
occurring before it. Punctuation that begins a sentence should
be connected to the sentence’s root node instead (Fig. 2).
“I promised.”
“I mentioned.”
Fig. 4: Two similar idiomatic expressions indicated by the dependency
relations E TOL and C OLLOCATION.
An example to this (Fig. 4) is the sub-type E TOL, which
comprised a group of multiword expressions incorporating
certain auxiliary verbs, otherwise denoted by the label C OL LOCATION . We eliminate such types altogether.
OBJECT
DATIVE.ADJUNCT
MODIFIER
INSTRUMENTAL.ADJUNCT
MODIFIER
EQU.ADJ.
MODIFIER
DERIV
POSSESSOR
PRED.
İnsanı
insana
insanla
insanca
anlat
+ma
sanatı
‘human’ACC
‘human’DAT
‘human’INS
‘human’EQU
‘tell’
(GERUNDIVE)
‘art of’
“The art of relating humans to humans, with humans, like humans.”
Fig. 5: Nominal adjuncts serving as modifiers were mostly indicated by
different X.A DJUNCT labels according to their cases.
There were also some cases where a dependency relation
overlapped with another in usage, giving way to confusion.
This was the most obvious between the label M ODIFIER and
the X.A DJUNCT labels for every noun declension (such as
DATIVE .A DJUNCT), which are also effectively modifiers.
For instance, while generic adjuncts that did not fall into
a specific category would use a M ODIFIER label and a
regular nominal adjunct in the locative case would use the
label L OCATIVE .A DJUNCT, certain other adjuncts, which
were grammatically nouns in the locative case, would still be
assigned a M ODIFIER label due to semantic concerns. To address this complication, we preserve only the M ODIFIER label
SENTENCE
“
Özgün
”
ROOT
.
‘original’
PUNCTUATION
PREDICATE
“
Özgün
”
.
PUNC.
PUNCTUATION
Fig. 6: Certain kinds of punctuation that were allowed to pass without a head
(top) are now covered by the new dependency type P UNCTUATION (bottom).
3
For the new corpus, we used an updated version of our
ITU Annotation Tool [13]. Five annotators were employed,
one linguist and four computer scientists with considerable
experience in NLP research. Our annotators were well-versed
in Turkish morphology and syntax, and underwent two weeks
of supervised training in the new annotation framework before
starting on the annotation. Dependency annotation was made
on gold-standard tokens with pre-allocated morphological
analyses3 , and was completed within a span of two months.
Although the annotation was started with two annotators
for each sentence, our annotators eventually had to work
individually on their exclusive shares of the data due to
budgetary constraints. As a consequence, it was not possible
to measure inter-annotator agreement. Nonetheless, after the
initial annotation, sentences from both corpora were carefully
inspected for inconsistent annotation, and a correction phase
followed for two weeks, which led to the final version.
E. Reliance on Omissible Tokens
Some annotation schemes required certain tokens to occur
in a specific position within the sentence, and could not be
properly applied when they were omitted. This prevented
regular annotation in case of omission, and caused uncertainty
as to how to alternatively mark the relation, which led to annotation inconsistencies. For instance, coordination structures
were annotated with a dependency from the first constituent
to the coordinating conjunction and another dependency from
the coordinating conjunction to the next constituent, which
made proper annotation impossible when the coordinating
conjunction was omitted. Adverse cases are not uncommon
in non-canonical language, most notably web jargon, where
some common function words are frequently dropped in favor
of brevity. Examples are encountered even in well-typed
sentences, caused by less conventional, idiomatic or archaic
usages. Therefore, the issue warranted addressing.
OBJECT
Çatal
bıçak
OBJECT
COORD.
,
Çatal
bıçak
COORDINATION
Çatal
,
SENT.
A. Deep Dependencies
ROOT
Another detail to mention about the annotation is that
we set out to indicate deep (or unbounded) dependencies
in IMST. Deep dependencies are secondary dependencies of
tokens to other logical heads, often with different dependency
relations, in addition to their regular surface dependencies.
The annotation of these dependencies violates the restriction
of each constituent having a single head and thereby makes
a corpus incompatible with most syntactic parsers without
preprocessing. However, deep dependencies are favored often
because they function as cues for semantic parsers designed to
determine the semantic roles of verbal arguments in a sentence.
In IMST, we regularly draw deep dependencies as substitutes
for coreference links from zero pronouns as well as to mark
shared modifiers for tokens in coordination.
kullanmıyor
OBJECT
SENTENCE
ROOT
.
kullanmıyor
OBJECT
PREDICATE
bıçak
kullanmıyor
‘knife’
‘[s/he] isn’t using’
PUNC.
‘Fork’
OBJECT
COORD.
.
PUNC.
“S/he doesn’t use a knife and a fork.”
Fig. 7: Reliance on omissible tokens in the original annotation framework
The sentences show a case where annotation is impossible (top), except by the addition
of conjunctive and terminal punctuation (middle). The scheme we propose (bottom) is
not affected by this.
B. Corpus Statistics
Reliance was perhaps the most noticeable in terminal periods (Fig. 7), which were essential in marking the main predicate of the sentence. The annotation required the predicate to
be connected to the terminal period with the label S ENTENCE.
This scheme left no option for legitimately omitting periods,
as practiced very frequently in non-canonical language. To
address this, we make it so that predicates are connected
directly to the sentence root with the renamed dependency
label P REDICATE, making terminal periods properly optional.
For a proper comparison between MST and IMST, we
provide a particular selection of comparative statistics before
describing our syntactic accuracy tests. Table I displays sentence, token and dependency counts for either corpora. Table II
shows the distribution of dependencies by dependency relation.
TABLE I: Comparative sentence, token and dependency statistics.
METU-S ABANCI T REEBANK
IV. T HE ITU-METU-S ABANCI T REEBANK
In order to have an indication of the impact of our proposed
schemes and provide future studies with a new and fresh training corpus, we annotated the entire METU-Sabancı Treebank
from the ground up. We call this reannotated corpus the ITUMETU-Sabancı Treebank (IMST). The annotation of IMST
was carried out in parallel with the ITU Web Treebank [31],
an original corpus of user-generated web data that was released
earlier. This section provides details about IMST2 .
ITU-METU-S ABANCI
T REEBANK
# Sentences
5,635
5,635
# Words
# Tokens (IG)
# Single-headed Tokens
# Multi-headed Tokens
56,424
67,403
67,403 (100.0%)
—
56,424
63,089
60,688 (96.2%)
2,401 (3.8%)
# Dependencies (excl. D ERIV)
# Dependencies (incl. D ERIV)
# Projective Dependencies
# Non-projective Dependencies
56,424
67,403
66,145 (98.1%)
1,258 (1.9%)
59,425
66,090
64,663 (97.8%)
1,427 (2.2%)
V. E VALUATION
This section presents the statistical analysis we performed
on MST and IMST. Section V-A contains preliminary information about our parsing and evaluation systems. Section V-B
shows the test outcome and a brief discussion of the results.
2 The treebank underwent some minor revisions before its release and is at
version 1.3 at the time of this publication. The latest version will be made
available for research purposes on http://tools.nlp.itu.edu.tr/.
3 The morphological tags were inherited from a version of MST following
a revised morphological annotation framework established in [7], [16].
4
TABLE II: Distribution of the dependency relation labels.
METU-S ABANCI T REEBANK
A BLATIVE .A DJUNCT
A PPOSITION
A RGUMENT
C ONJUNCTION
C LASSIFIER
C OLLOCATION
C OORDINATION
DATIVE .A DJUNCT
D ERIV
D ETERMINER
E QU.A DJUNCT
E TOL
F OCUS .PARTICLE
I NSTRUMENTAL .A DJUNCT
I NTENSIFIER
L OCATIVE .A DJUNCT
M ODIFIER
MWE
N EGATIVE .PARTICLE
O BJECT
P OSSESSOR
P REDICATE
P UNCTUATION
Q UESTION .PARTICLE
R ELATIVIZER
ROOT
S.M ODIFIER
S ENTENCE
S UBJECT
VOCATIVE
( DISCONNECTED TOKENS )
523 (0.8%)
202 (0.3%)
—
—
2,050 (3.0%)
73 (0.1%)
2,476 (3.7%)
1,361 (2.0%)
10,979 (16.3%)
1,952 (2.9%)
16 (0.0%)
10 (0.0%)
23 (0.0%)
271 (0.4%)
903 (1.3%)
1,142 (1.7%)
11,690 (17.3%)
2,432 (3.6%)
160 (0.2%)
8,338 (12.4%)
1,516 (2.2%)
—
—
289 (0.4%)
85 (0.1%)
5,644 (8.4%)
597 (0.9%)
7,261 (10.8%)
4,481 (6.6%)
241 (0.4%)
2,688 (4.0%)
As shown in Table I, the number of words and evaluated
dependencies (excluding D ERIV) is exactly the same between
the two corpora. The slight difference between the dependency
counts as seen in Table I is due to the updated morphological
analysis framework mentioned earlier in Section IV and the
entailed difference in derivational boundaries. The changes in
IG bounding should only affect the performance of morphological analysis and have a negligible effect on parsing.
Comparing the current LAS of 75.3% for IMST with the
corresponding score of 65.9%4 for MST shows that we manage
an increase of nearly 10 percentage points. The UAS seems to
have improved in a similar way, increasing to 83.7% for IMST
and passing the score of 76.0% for MST by a large margin.
ITU-METU-S ABANCI T REEBANK
—
91 (0.1%)
1,805 (2.7%)
1,360 (2.1%)
—
—
3,078 (4.7%)
—
6,665 (10.1%)
2,180 (3.3%)
—
—
—
—
1,070 (1.6%)
—
15,516 (23.5%)
3,552 (5.4%)
—
5,094 (7.7%)
4,070 (6.2%)
5,741 (8.7%)
10,375 (15.7%)
—
129 (0.2%)
—
—
—
5,174 (7.8%)
190 (0.3%)
—
VI. C ONCLUSION
In this article, we initially described the annotation schemes
we designed based on the dependency grammar of the METUSabancı Treebank (MST). Our new annotation framework
incorporates only 16 dependency relation labels in contrast
to the 24 labels of the baseline, but features generally clearer
and more intuitive dependency types with reduced overlap between each other, hopefully relieving the difficulty of manual
annotation without suffering any loss in expressiveness.
Afterwards, we presented the ITU-METU-Sabancı Treebank
(IMST) as a reannotated version of MST that followed our
revised annotation framework. We additionally marked deep
dependencies in IMST to pave the way for future semantic
role labeling studies. We substantiate the theoretical advantages of our proposed annotation schemes through a parsing
experiment in compliance with the parsing framework used in
the study for the original MST that still remains the state of
the art. Our experiment yielded a labeled attachment score of
75.3% for IMST, surpassing the best score of 65.9% attained
so far on MST by a very large margin.
Finally, considering the outcome of our work, we believe
it would be safe to say that we succeeded in making pivotal
progress by working directly on the training set. We show
that improving the quality of data, although an open-ended
endeavor, has a considerable effect on parsing performances,
and will hopefully pave the way for corpus studies for Turkish.
A. Preliminaries
In our test, we used the same MaltParser [26] configuration
as in [17] so that the results would be properly comparable.
In further accordance with the cited work, non-projective
sentences were eliminated from all training sets, which is
shown to cause a significant performance boost [17], [18].
The dependencies with the relation D ERIV (denoting intraword relations between morphosyntactic units) were excluded
in evaluation, as they are considered trivial. In the literature,
punctuation is either wholly excluded from evaluation (as in
e.g. [4]) or included (as in e.g. [25]). We follow the latter
approach and evaluate the dependencies of punctuation. Since
the inherited parsing framework does not support learning
from dependents annotated with multiple heads, we discard
all deep dependencies from IMST before running the test.
The metrics used in evaluation are the conventional IGbased labeled and unlabeled attachment scores. The unlabeled
attachment score (UAS) considers a prediction to be accurate
if the head token alone was correctly predicted, while the
labeled attachment score (LAS) additionally requires a correct
prediction of the dependency relation. Between the two, a high
LAS is more difficult to attain and more valuable, so we take
the LAS as our primary criterion in performance comparison.
We also provide standard error values, and use McNemar’s
Test for measuring statistical significance where needed.
ACKNOWLEDGEMENT
This study is part of a research project entitled “Parsing
Web 2.0 Sentences” subsidized by the Turkish Scientific and
Technological Research Council under grant number 112E276
and associated with the ICT COST Action IC1207. We hereby
offer our sincere gratitude to our volunteering annotators
Dilara Torunoğlu-Selamet and Ayşenur Genç, as well as our
colleagues Can Özbey, Kübra Adalı and Gözde Gül İşgüder
who offered additional help with the annotation.
B. Experimental Results
Parsing performances obtained by applying ten-fold crossvalidation on IMST are shown side-by-side with the corresponding scores for MST in Table III.
4 The MST score was evaluated excluding punctuation, in accordance with
the conventions at the time. As we reproduced the baseline scores on the original MST, we found the difference between models including and excluding
punctuation to be statistically insignificant (p < 0.01). Conversely, excluding
punctuation in evaluating IMST resulted in a drop in LAS from 75.3% to
70.0%, indicating that the contribution of the new punctuation annotation is
far from wholly accounting for the increase in parsing performance.
TABLE III: Cross-validation scores and standard error values.
METU-S ABANCI T REEBANK
ITU-METU-S ABANCI T REEBANK
LAS
UAS
65.9% ± 0.3%
75.3% ± 0.2%
76.0% ± 0.2%
83.7% ± 0.2%
5
R EFERENCES
[21] D. Z. Hakkani-Tür, K. Oflazer, and G. Tür, “Statistical morphological
disambiguation for agglutinative languages,” Computers and the Humanities, vol. 36, no. 4, pp. 381–410, 2002.
[22] K. Haverinen, J. Nyblom, T. Viljanen, V. Laippala, S. Kohonen, A. Missilä, S. Ojala, T. Salakoski, and F. Ginter, “Building the essential
resources for Finnish: the Turku Dependency Treebank,” Language
Resources and Evaluation, pp. 1–39, 2013.
[23] S. Kübler, R. McDonald, and J. Nivre, Dependency Parsing, ser.
Synthesis Lectures on Human Language Technologies.
Morgan &
Claypool Publishers, 2009.
[24] R. McDonald, J. Nivre, Y. Quirmbach-Brundage, Y. Goldberg, D. Das,
K. Ganchev, K. Hall, S. Petrov, H. Zhang, O. Täckström, C. Bedini,
N. Bertomeu Castelló, and J. Lee, “Universal dependency annotation
for multilingual parsing,” in Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics (ACL). Sofia, Bulgaria:
Association for Computational Linguistics, August 2013, pp. 92–97.
[25] J. Nilsson, S. Riedel, and D. Yüret, “The CoNLL 2007 Shared Task on
dependency parsing,” in Proceedings of the CoNLL Shared Task Session
of the Joint Conference on Empirical Methods in Natural Language
Processing (EMNLP) and Computational Natural Language Learning
(CoNLL). Association for Computational Linguistics, 2007, pp. 915–
932.
[26] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryiğit, S. Kübler, S. Marinov,
and E. Marsi, “MaltParser: A language-independent system for datadriven dependency parsing,” Natural Language Engineering, vol. 13,
pp. 95–135, 6 2007.
[27] J. Nivre, J. Nilsson, and J. Hall, “Talbanken05: A Swedish treebank
with phrase structure and dependency annotation,” in Proceedings of
the 5th International Conference on Language Resources and Evaluation
(LREC), 2006, pp. 1392–1395.
[28] K. Oflazer, “Dependency parsing with an extended finite-state approach,”
Computational Linguistics, vol. 29, no. 4, pp. 515–544, 2003.
[29] K. Oflazer, B. Say, D. Z. Hakkani-Tür, and G. Tür, “Building a Turkish
treebank,” in Treebanks. Springer, 2003, pp. 261–277.
[30] Özlem Çetinoğlu, “Turkish Treebank as a gold standard for morphological disambiguation and its influence on parsing,” in Proceedings of the
9th International Conference on Language Resources and Evaluation
(LREC). Reykjavı́k, Iceland: European Language Resources Association (ELRA), May 2014.
[31] T. Pamay, U. Sulubacak, D. Torunoğlu-Selamet, and G. Eryiğit, “The
annotation process of the ITU Web Treebank,” in Proceedings of the
9th Linguistic Annotation Workshop (LAW), Denver, CO, USA, 5 June
2015.
[32] W. K. Percival, “Reflections on the history of dependency notions in
linguistics,” Historiographia Linguistica, vol. 17, no. 1-2, pp. 29–47,
1990.
[33] M. Popel, D. Mareček, J. Štěpánek, D. Zeman, and Z. Žabokrtský,
“Coordination structures in dependency treebanks,” in Proceedings of the
51st Annual Meeting of the Association for Computational Linguistics
(ACL). Sofia, Bulgaria: Association for Computational Linguistics,
August 2013, pp. 517–527.
[34] N. Schneider, B. O’Connor, N. Saphra, D. Bamman, M. Faruqui, N. A.
Smith, C. Dyer, and J. Baldridge, “A framework for (under)specifying
dependency syntax without overloading annotators,” in Proceedings of
the 7th Linguistic Annotation Workshop and Interoperability with Discourse (LAW VII & ID). Sofia, Bulgaria: Association for Computational
Linguistics, August 2013, pp. 51–60.
[35] U. Sulubacak and G. Eryiğit, “Representation of morphosyntactic units
and coordination structures in the Turkish dependency treebank,” in Proceedings of the 4th Workshop on Statistical Parsing of MorphologicallyRich Languages (SPMRL). Seattle, Washington, USA: Association for
Computational Linguistics, October 2013, pp. 129–134.
[36] ——, ITU Treebank Annotation Guide, March 2016, available at
http://tools.nlp.itu.edu.tr, version 2.7.
[37] L. Tesnière, Éléments de Syntaxe Structurale. Éditions Klinksieck,
1959.
[38] L. Van der Beek, G. Bouma, R. Malouf, and G. Van Noord, “The Alpino
Dependency Treebank,” Language and Computers, vol. 45, no. 1, pp.
8–22, 2002.
[39] V. Vincze, V. Varga, K. I. Simkó, J. Zsibrita, A. Nagy, R. Farkas,
and J. Csirik, “Szeged Corpus 2.5: Morphological modifications in
a manually POS-tagged Hungarian corpus.” in Proceedings of the
9th International Conference on Language Resources and Evaluation
(LREC), 2014, pp. 1074–1078.
[1] B. R. Ambati, S. Reddy, and A. Kilgarriff, “Word sketches for Turkish,” in Proceedings of the 8th International Conference on Language
Resources and Evaluation (LREC), 2012, pp. 2945–2950.
[2] E. Bejček, J. Panevová, J. Popelka, P. Straňák, M. Ševčı́ková,
J. Štěpánek, and Z. Žabokrtský, “Prague Dependency Treebank 2.5–a
revisited version of PDT 2.0,” in Proceedings of the 24th International
Conference on Computational Linguistics (COLING), 2012, pp. 231–
246.
[3] A. Böhmová, J. Hajič, E. Hajičová, and B. Hladká, “The Prague
Dependency Treebank,” in Treebanks. Springer, 2003, pp. 103–127.
[4] S. Buchholz and E. Marsi, “CoNLL-X Shared Task on multilingual
dependency parsing,” in Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL).
Association for
Computational Linguistics, 2006, pp. 149–164.
[5] R. Çakıcı, “Wide-coverage parsing for Turkish,” Ph.D. dissertation, The
University of Edinburgh, 2008.
[6] O. Çetinoğlu and J. Kuhn, “Towards joint morphological analysis and
dependency parsing of Turkish,” in Proceedings of the 2nd International
Conference on Dependency Linguistics (DepLing).
Prague, Czech
Republic: Charles University in Prague, Matfyzpress, Prague, Czech
Republic, August 2013, pp. 23–32.
[7] M. Şahin, U. Sulubacak, and G. Eryiğit, “Redefinition of Turkish morphology using flag diacritics,” in Proceedings of The 10th Symposium on
Natural Language Processing (SNLP), Phuket, Thailand, October 2013.
[8] D. Csendes, J. Csirik, T. Gyimóthy, and A. Kocsor, “The Szeged
Treebank,” in Text, Speech and Dialogue. Springer, 2005, pp. 123–
131.
[9] M.-C. De Marneffe, M. Connor, N. Silveira, S. R. Bowman, T. Dozat,
and C. D. Manning, “More constructions, more genres: Extending Stanford Dependencies,” in Proceedings of the 2nd International Conference
on Dependency Linguistics (DepLing). Prague, Czech Republic: Charles
University in Prague, Matfyzpress, Prague, Czech Republic, August
2013, pp. 187–196.
[10] M.-C. De Marneffe and C. D. Manning, “The Stanford Typed Dependencies representation,” in Proceedings of the Workshop on CrossFramework and Cross-Domain Parser Evaluation (COLING). Association for Computational Linguistics, 2008, pp. 1–8.
[11] I. Durgar El-Kahlout, A. A. Akın, and E. Yılmaz, “Initial explorations in
two-phase Turkish dependency parsing by incorporating constituents,”
in Proceedings of the 1st Joint Workshop on Statistical Parsing of
Morphologically Rich Languages (SPMRL) and Syntactic Analysis of
Non-Canonical Languages (SANCL).
Dublin, Ireland: Dublin City
University, August 2014, pp. 82–89.
[12] G. Eryiğit, “Dependency parsing of Turkish,” Ph.D. dissertation, Istanbul
Technical University, 2006.
[13] ——, “ITU Treebank Annotation Tool,” in Proceedings of the ACL
Workshop on Linguistic Annotation (LAW), Prague, 24-30 June 2007.
[14] ——,
“ITU
Validation
Set
for
METU-Sabancı
Turkish
Treebank,”
March
2007.
[Online].
Available:
http://web.itu.edu.tr/gulsenc/papers/validationset.pdf
[15] ——, “The impact of automatic morphological analysis & disambiguation on dependency parsing of Turkish,” in Proceedings of the
8th International Conference on Language Resources and Evaluation
(LREC), Istanbul, Turkey, 23-25 May 2012.
[16] ——, “ITU Turkish NLP Web Service,” in Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Gothenburg, Sweden:
Association for Computational Linguistics, April 2014.
[17] G. Eryiğit, T. Ilbay, and O. A. Can, “Multiword expressions in statistical
dependency parsing,” in Proceedings of the 2nd Workshop on Statistical
Parsing of Morphologically Rich Languages (SPMRL). Dublin, Ireland:
Association for Computational Linguistics, October 2011.
[18] G. Eryiğit, J. Nivre, and K. Oflazer, “Dependency parsing of Turkish,”
Computational Linguistics, vol. 34, no. 3, pp. 357–389, 2008.
[19] G. Eryiğit and K. Oflazer, “Statistical dependency parsing of Turkish,”
in Proceedings of the 11th Conference of the European Chapter of the
Association for Computational Linguistics (EACL), Trento, April 2006,
pp. 89–96.
[20] G. Eryiğit and T. Pamay, “ITU Validation Set,” Türkiye Bilişim Vakfı
Bilgisayar Bilimleri ve Mühendisliği Dergisi, vol. 7, no. 1, 2014.
6
Exploring Spelling Correction Approaches
for Turkish
Dilara Torunoğlu-Selamet, Eren Bekar, Tugay İlbay, Gülşen Eryiğit
Department of Computer Engineering
Istanbul Technical University
Istanbul, 34469, Turkey
[torunoglud, erenbekar, ilbay, gulsen.cebiroglu]@itu.edu.tr
some agglutinative and polysynthetic languages as well as
English, use the Wikipedia articles of the related languages
in order to create the corresponding language models. On
the other hand, the same error models which are used for
English are also used for MRLs only by adding languagespecific characters.
Wang et al. [7] propose a fast and accurate approximate
string search algorithm (ASS) which keeps track of the frequent mistakes (error model) extracted from training data
(consisting of spelling mistakes and their corrections) and
generates the most probable correction candidates. The method
uses a vocabulary trie for validating the generated candidates.
It is very straightforward to collect the training data for the
error model from the user queries of a search engine (the
suggested and selected corrections) as it is conducted in the
mentioned study.
In this paper, we explore the way of creating a Turkishspecific error model for lack of manually annotated training
data and the different combinations of the error model, the
language model and the minimum edit distance candidate
generation for spelling correction. We compare our results
with 3 existing spelling correction systems for Turkish: 1.
error tolerant finite state recognition (ETFSR) approach of
Oflazer [3], 2. MsWord and 3. Zemberek [8].1 The paper is
structured as follows: Section 2 introduces the error model and
Section 3 discusses the proposed spelling correctors, Section
4 presents the used datasets and evaluation metrics, Section 5
gives the experimental results and discussions and Section 6
the conclusion and future work.
Abstract—The spelling correction of morphologically rich
languages is hard to be solved with traditional approaches
since in these languages, words may have hundreds of different
surface forms which do not occur in a dictionary. Turkish is
an agglutinative language with a very complex morphology and
lacks annotated language resources. In this study, we explore the
impact of different spelling correction approaches for Turkish
and ways to eliminate the training data scarcity. We test with
seven different spelling correction approaches, four of which
are introduced in this study. As the result of this preliminary
work, we propose a new automatic training data collection
process where existing spelling correctors help to develop an
error model for a better system. Our best performing model uses
a unigram language model and this error model, and improves
the performance scores by almost 20 percentage points over the
widely used baselines. As a result, our study reveals the achievable
top performance with the proposed approach and gives directions
for a better future implementation plan.
Keywords—Spelling Corrector, Spell Checker, Turkish
I. I NTRODUCTION
In morphologically rich languages (MRLs) and especially
the agglutinative ones like Turkish, Finnish or Hungarian, a
word may occur in hundreds of different surface forms by
the addition of multiple suffixes the end of a word stem.
The creation of a lexicon/dictionary consisting of all possible
surface forms is impractical and most of the time not efficient
due to memory space and search speed constraints. As a result,
the usage of a lexicon to check if the newly constructed
candidate of a misspelled word is valid or not, as is the case
in traditional approaches tailored for morphologically poor
languages, becomes unusable for MRLs.
Finite state transducers (FSTs) [1], [2] are proven to be
very well suited for this kind of languages and perform
very fast lookup over possible word generations. One of the
early implementations of spelling correction for MRLs is the
error tolerant finite state recognition (ETFSR) approach of
Oflazer [3]. Although it is very fast to create the possible
candidates up to the specified edit distance limit, the deficiency
of this approach is that it does not produce an ordered list of
possible corrections which prevents its usage as an automated
spelling corrector. Recent approaches [4]–[6] which focus on
weighted finite-state spell-checking using language models
and error models are very efficient for the spelling correction
of MRLs. Pirinen and Lindén [6] who experiment also with
II. T HE E RROR M ODEL
Obtaining the error model is a challenging task considering
the lack of manually annotated training data for the Turkish
language. Wang et al. [7] proposed a probabilistic approach
for spelling correction. This approach was novel in that it
was using a log-linear candidate generation utilizing a special
data structure that can find top candidates efficiently. The
proposed method works effectively for languages which have
a limited dictionary for lookup. They derived all the possible
1 To the best of our knowledge, at the time of writing this paper, the only
three spelling correction systems that we can compare with were these three
systems.
7
rules from the training data using a similar approach to Brill
and Moore [9]. In their study, they collected the training
data for the error model from the user queries of a search
engine. Despite not having this opportunity, we propose a
new automatic training data collection process where the
existing spelling correctors help to develop an error model.
We collected a training data set from the Twitter domain. We
then passed all the ill-formed words (which are not accepted
by our morphological analyzer) from one online (Google2 ) and
one offline spelling correctors [8] and accepted the corrections
which are proposed identically by both of these correctors
as the corrected form of the ill-formed words in our training
set. At the end of this process, we obtained a training set of
5775 word pairs (ill-formed and corrected words) which have
a character length within a range of 2 to 23. After obtaining
the training set for the error model, we used the same approach
with Wang et al. [7] to store the extracted error rules.
We used the Aho-Corasick tree structure for storing and
applying the correction rules. During the generation of the
error model, the rules are extracted from the misspelled
and corrected forms of words by using the Levenshtein edit
distance algorithm. The output of this part is a set of rules
which includes addition, deletion and substitutions of letters.
This rule set also contains the likelihood of each derived rule.
The extracted rules and their estimated likelihoods are stored
in an Aho-Corasick search tree which is a very efficient string
matching trie-based data structure. All leaf nodes in this search
tree have an output link which associates the node itself with
the likelihood of the rule in the node. This lets fetching rules
and their likelihoods effectively. It also stores failure links that
redirect the search to the best applicable node when there is
no way to continue for the queried string. This prevents us
from starting from the beginning each time the search query
fails and results in a significant time gain during the search.
Fig. 1: Spelling Corrector #1
transducer which is built from a stem lexicon for the MRL
in focus and the morphotactic and phonetic rules to generate
the inflected forms of these stems) as the language validator.
Figure 1 draws the main components of SC1. The training
phase is the process of creating the error model which is
explained in Section II. In the candidate generation phase,
the previously constructed Aho-Corasick tree is looked-up for
all applicable rules for a given misspelled word. Since not all
the rules generate a valid surface form, the generated results
should be validated by the FST. If the constructed word is
validated by the FST, then all applied rule likelihoods are
summed up and this forms the likelihood of the candidate
word. As a pruning technique, before applying a rule, it is
always checked that the rule likelihood is able to generate a
more probable candidate. If not, the rule is not applied to the
misspelled word.
As a result, our approach differs from the original ASS
model [7] in two main points: 1) the usage of FST for
validation, 2) the calculation of the rule set probabilities in the
training phase. In the original work, they employ a log-linear
model for calculating the probabilities of rule sets whereas in
our work, we simply use likelihoods for preliminary investigation.
III. S PELLING C ORRECTORS
ETFSR and Zemberek use edit-distance based candidate
generation approaches. The following subsections introduces
our new approaches that we experiment with, which are
basically the different combinations of the language and error
models as well as ETFSR.
Spelling Corrector #1 (SC1)
Our first approach is an adaptation of Wang et al. [7]. Since
creating a lexicon which will cover all possible surface forms
in an MRL is not practical and efficient in that the required
memory allocation for the data structure is very big even with
the most compact data structures3 , instead of the vocabulary
trie for candidate validation, SC1 uses an FST (a finite-state
Spelling Corrector #2 (SC2)
As mentioned in the introductory section, the output of
ETFSR is a set of unsorted candidates and the size of the
candidate list is unpredictable. SC2 is an approach to deal
with this deficiency by re-ranking ETFSR outputs using the
probabilities calculated from the error model as explained
previously. Figure 2 shows the structure of SC2, where the
misspelled inputs firstly enter the ETFSR. We then retrieve
the rules (and their scores) from our rule tree that should
be applied to the misspelled word to generate each candidate
in the output list from the ETFSR. In other words, we get
2 At the time of this collection process, Google spelling correction service
was still available.
3 In the early stages of our implementation, we tried to just place the most
frequently occurring surface forms extracted from a corpus into the lexicon
and even this approach took more than 500M of memory by using a suffix
tree, which we believe is not acceptable for a spelling corrector application
to be in practical usage.
8
Spelling Corrector #4 (SC4)
the list of applied rules (which can be addition, deletion
and substitution of a letter) according to the Levenshtein edit
distance between the misspelled word and the corresponding
candidate. When we have the rules for a candidate, we sum
up the costs of the applied rules, and then simply sort the
candidates by their costs. The candidate with the minimum
cost is accepted as the most probable correction.
SC4 is inspired from Linden and Pirinen [6], in that it uses
a language and an error model together in order to generate
candidates. SC4 uses the same unigram language model from
SC3 and the same error model introduced in Section II. SC4
differs from SC1 in that, the generated candidates by the error
model are validated by using the language model instead of
the FST and the best proposal is selected as the candidate with
minimum rule cost and maximum unigram probability:
argmax p(c)
cGen
1
rulecost(c)
Laplace Smoothing [10] is used in order to compensate for
the absence of a candidate word in the language model. SC4
is depicted in Figure 4.
Fig. 2: Spelling Corrector #2
Spelling Corrector #3 (SC3)
Inspired by previous works by Linden and Pirinen [4]–[6],
SC3 aims to make use of unigram language models for
candidate sorting. To this end, a unigram language model is
trained from word surface forms from a Turkish corpus. The
ETFSR outputs are then re-ranked similarly to SC2 but this
time using the unigram probabilities. The candidate having
the highest probability and the smallest edit distance from
the input misspelled word is then accepted as the produced
correction. The structure of SC3 is shown in Figure 3.
Fig. 4: Spelling Corrector #4
Table I displays the usage and combination of language and
error models as well as the candidate generation method in
the introduced spelling correctors. As can be noticed from the
table, the difference between SC2 and SC1 is that in SC2,
which uses ETFSR in its candidate generation stage, all the
produced candidates are already valid words, whereas in SC1
the candidates are validated after being produced by the use
of the error model. The last two spelling correctors (SC3
and SC4) using language models are the highest memoryconsuming systems as expected and explained in the introductory section. They are tested both with ETFSR candidate generation (SC3) and Aho-Corasick candidate generation (SC4).
SC4 also uses the error model in its probability calculation.
Another possible system (discussed in the following sections)
which could provide a slight increase in the scores would be
a combination of ETFSR, the language model and the error
model, though it was not tested as part of this study.
Fig. 3: Spelling Corrector #3
9
Models
SC1
SC2
SC3
SC4
Error
Model
!
!
%
!
Language
Model
%
%
!
!
Candidate
Generation
index results. The added cost due to re-ranking is smaller than
a single millisecond4 over ETFSR.
Aho-Corasick
ETFSR
ETFSR
Aho-Corasick
Approach
ETFSR
SC1
SC2
SC3
SC4
TABLE I: Models Used in Different Approaches
Average
Duration (ms)
388
3385
389
333
363
Average
Index
1.46
0.9
0.6
0.35
0.145
TABLE II: Output Statistics
IV. E XPERIMENTAL S ETUP
Table III gives the comparison of the spelling correction
accuracies of our models with the mentioned tools. Although,
the Google spelling suggestion API was used during the
creation of our training data, it could not be compared with
the other spelling correctors in this section since it is no longer
available. In this experiment, for all of the systems, we took
the first suggestion given by that system and compared it with
the gold-standard correction in our test set. Our best model
outperforms the widely used Zemberek spelling corrector by
almost 20 percentage points. Despite the modest size of our
training data set that we were not able to continue to collect
due to the unavailability of one of the services (Google spelling
suggestion API) that we have used, we see that the proposed
error model on its own (SC2) outperforms MsWord by more
than 2 percentage points. We believe that, with the addition of
extra training data, the system performance may be improved
even further. As a future work, self-training approaches may
be tested for the learning of the error-rule probabilities. But
we observe that the used language model has a much higher
impact by almost 10 percentage points. One should notice
that the used language model is just a unigram surface model
and better results may be obtained with more sophisticated
language models.
We tested our system on Turkish which is a highly agglutinative language carrying all the characteristics of a morphologically rich language. We used an available two-level
morphological analyzer of Oflazer [11] as the FST language
validator of our system in SC1 and again the ETFSR from
Oflazer [3] in SC2 and SC3.
To obtain a unigram language model we used the corpus
introduced by Sak et al. [12]. The text corpus compiled from
the web contains about 500M tokens. Due to the composition
of the data found on the web, the corpus include noisy data.
We extracted only the valid Turkish words which constitute
842 MB of the corpus (almost 43M valid tokens).
During the collection of the test data, for the sake of fairness
we do not include errors made on purpose due to social media
writing trends such as emoticons and words that are typed
out without vowels or the proper diacritics, which would be
corrected in a normalization stage [13] rather than spelling
correction.
The creation of the training data to train the error model is
explained in Section II. Since this automatic approach is only
applied during the creation of the training data used in rule
extraction, this does not hamper the evaluation on our test data
which is manually annotated with corrected forms (1016 word
pairs).
ETFSR
Zemberek
MsWord
SC1
SC2
SC3
SC4
V. E XPERIMENTAL R ESULTS & D ISCUSSIONS
In our experiments, we first test with ETFSR and the
spelling correctors introduced in Section III and evaluate their
results. We then compare our models with other available
spelling correctors for Turkish.
Table II introduces some statistics about ETFSR and with
the other models (SC1, SC2, SC3 and SC4); namely the
average operation time the spelling correction approach on
our test set described in previous section and the average
index of the correct candidate within all possible generated
candidates. The index number starts from 0 indicating that
the first candidate in the output is the correct one given
the manually annotated test set. One may notice from this
table that the ETFSR approach produces very fast results but
the correct answer generally occurs in lower positions in the
produced candidate list. On the other hand SC1 is almost 10
times slower when compared to ETFSR but produces more
accurate results. SC2 is much faster when compared to SC1
and has almost similar success ranges. SC3 and SC4 similar
to SC2 in terms of average duration but give better average
Accuracy
49.0%
61.4%
66.3%
68.6%
67.8%
78.7%
80.7%
TABLE III: Comparison with previous studies
In order to investigate the results and the behavior of the
algorithms more closely, we also made a different evaluation
based on promoting the correct candidate appearing in the top
n list of the algorithm’s output. Table IV presents these scores
for n=1, 3, 5 and 10., e.g. SC4 positioned the correct candidate
in its top 3 list in 92.7% of the cases.
We can observe that the success rates of all the models
become similar as n increases, meaning that ETFSR is also
successful in generating the correct candidate in its top 10
list. But SC3 and SC4 are certainly more suited to be used
4 The training time (629 ms with our available training data) is not added to
this cost since it occurs only once in the preparation stage and the pre-trained
model is only loaded at the beginning of testing stage.
10
Candidate
List Size
1
3
5
10
ETFSR
SC1
SC2
SC3
SC4
R EFERENCES
49.0%
76.7%
86.2%
93.8%
68.6%
88.6%
93.5%
95.7%
67.8%
89.1%
92.9%
95.4%
78.7%
92.7%
94.5%
95.5%
80.7%
92.7%
97.0%
98.9%
[1] Finite-state morphology: Xerox tools and techniques, 2003.
[2] K. Lindén, M. Silfverberg, and T. Pirinen, HFST tools for morphology–
an efficient open-source package for construction of morphological
analyzers, Std., 2009.
[3] Error-tolerant finite-state recognition with applications to morphological
analysis and spelling correction, vol. 22, no. 1, 1996.
[4] K. Lindén, T. Pirinen et al., Weighting finite-state morphological analyzers using hfst tools, Std., 2009.
[5] T. Pirinen, K. Lindén et al., Finite-state spell-checking with weighted
language and error models, Std., 2010.
[6] T. A. Pirinen and K. Lindén, State-of-the-Art in Weighted Finite-State
Spell-Checking, Std., 2014.
[7] Z. Wang, G. Xu, H. Li, and M. Zhang, A fast and accurate method for
approximate string search, Association for Computational Linguistics
Std., 2011.
[8] Zemberek, an open source NLP framework for Turkic Languages,
vol. 10, 2007.
[9] E. Brill and R. C. Moore, An improved error model for noisy channel
spelling correction, Association for Computational Linguistics Std.,
2000.
[10] Laplacian smoothing and Delaunay triangulations, vol. 4, no. 6, 1988.
[11] Two-level description of Turkish morphology, vol. 9, no. 2, 1994.
[12] Resources for Turkish morphological processing, vol. 45, no. 2, 2011.
[13] D. Torunoǧlu and G. Eryiğit, A Cascaded Approach for Social Media
Text Normalization of Turkish, Std., April 2014.
TABLE IV: Candidate List Evaluation
as automated spelling correctors. In top 1, the difference is as
high as 31,7 percentage points between ETFSR and SC4.
Although SC3 and SC4 both yield very high scores, they
are both memory-inefficient due to the used surface language
models. A better possible system which would be the combination of both will actually be a kind of the system proposed
by Linden and Pirinen [6] combined with our automatically
created error model which we aim to develop in our future
work. Although the difference between candidate generation
using FSTs and Aho-Corasick tree is not statistically significant5 , we expect that memory consumption will be alleviated
with a better implementation, even though there may not be
an increase in performance.
VI. C ONCLUSION & F UTURE W ORK
In this study, we explored ways to eliminate the scarcity
of training data for spelling correction, as well as the impact
of different spelling correction approaches for Turkish. We
proposed a new automatic training data collection process
where existing spelling correctors contribute to the development of an error model, paving the way for better systems.
We explained four spelling correction approaches adapted for
Turkish alternatively using language models, error models
and combination of candidate generation approaches, and
reported their performances for Turkish in comparison with
three established spelling correctors. Our work has been a preliminary investigation of better spelling correction approaches
for MRLs, and there is still much that could be further investigated and improved, such as 1) Automatically increasing
training set size, 2) Integrating self-training approaches in
learning error rule probabilities, and 3) Using weighted finitestate language and error models. Although we used a simple
unigram language model in our best-performing systems, we
observed that the systems making use of the language model
outperform those without the model by about 10 percentage
points. Furthermore, we believe that using weighted finitestate language and error models would produce slightly better
results than the ones represented in this paper as well as
eliminating the memory consumption problem of our best
corrector.
ACKNOWLEDGMENT
This work is part of our ongoing research project “Parsing
Turkish Web 2.0 Sentences” supported by ICT COST Action
IC1207 TUBITAK 1001 (grant no: 112E276).
5 We used McNemar’s paired t-test to evaluate the difference between
SC1(68.6%) and SC2(67.8%) and found that the difference between these
two models is not statistically significant, with a two-tailed p value of 0.7.
11
Framing of Verbs for Turkish PropBank
Gözde Gül Şahin
Department of Computer Engineering
Istanbul Technical University
Istanbul, 34469, Turkey
[email protected]
thematic roles, verb classes are defined with all possible
syntaxes for each class. One possible syntax is given below
the examplary sentence. Unlike FrameNet and VerbNet,
PropBank (PB) [16] does not make use of a reference
ontology like semantic frames or verb classes. Instead
semantic roles are numbered from Arg0 to Arg5 for the
core arguments. Moreover, PropBank has an associated
annotated corpus that help researchers to specify SRL as
a task, furthermore are used as training and test data for
supervised machine learning methods [11] [21].
[I]Buyer-Agent-Arg0 bought [a coat]Goods-Theme-Arg1 from
[the flea market.]1
Abstract—In this work, we present our method for framing
the verbs of Turkish PropBank and discuss incorporation
of crowd intelligence to increase the quality and coverage
rate of annotated frames. First, we discuss the manual
framing process by experts with the help of publicly available
dictionaries, corpora and guiding morphosemantic features
such as case markers. Then, we present a systematic way of
framing for challenging cases such as light verbs, multiword
expressions and derived verbs. Later, a verb sense disambiguation task where the verb senses correspond to annotated
frames, is crowdsourced. Finally, the results of verb sense
disambiguation task are used to increase the coverage rate
and quality of created linguistic resource. In conclusion,
a new lexicon of Turkish verbs with 759 annotated verbs
and 1262 annotated senses is constructed. Keywords-Turkish
PropBank; Semantic Role Labeling; Semantic Frame; Light
Verb; MWE
Syntax: Agent V Theme {From} Source
In [17], the authors investigate the usability of
FrameNet, VerbNet and PropBank conventions for modern
Turkish and conclude that the PropBank convention with
additional morphosemantic features would be the most
appropriate semantic resource. Unfortunately, creating a
qualified PropBank for a morphologically complex language with low-resources is a challenging task. Creation of
such a corpus is generally considered as the combination
of two subtasks: Framing of Verbs and Corpus Annotation
with framed verbs. Framing process includes deciding on
the verbs to annotate, examining different senses of the
chosen verbs and deciding on the arguments for each
verb sense. The languages with rich resources mostly
perform corpus annotation in one step with large number
of annotaters. Due to small number of expert annotaters,
we choose to divide corpus annotation into two microtasks
for crowdsourcing: Verb Sense Annotation and Argument
Labeling. In the verb sense annotation task, people are
asked to disambiguate the meaning of the verbs in the
sentences from a morphologically and syntactically analysed corpus and in argument labeling task, annotaters are
asked to label the arguments of the previously annotated
verb senses.
Framing of Turkish verbs can be considered as the
most important step for PropBank creation. The errors
introduced in framing process may accumulate and may
significantly reduce the accuracy and reliability of semantic role labeling task. Moreover, it can be considered as
the most complicated task specially for languages with
I. I NTRODUCTION
In recent years considerable amount of research has
been performed on extracting semantic information from
sentences. Revealing such information is usually achieved
by identifying the arguments of a predicate and assigning
meaningful labels to them. Each label represents the
argument’s relation to its predicate and is referred to as
a semantic role and this task is named as semantic role
labeling (SRL). SRL aims to answer the question “Who
did what to whom?”, thus reveal the full meaning of a
sentence. It has been employed in machine translation,
information extraction and question answering tasks.
There exists different semantic role annotaion schemes,
where the most commonly used ones are VerbNet [18],
FrameNet [6] and PropBank [16]. FrameNet (FN) is a
semantic network, built around the theory of semantic
frames. All predicates in same semantic frame share one
set of Frame Elements (FEs). In the example below, a
sentence with predicate “buy”, annotated with FrameNet,
VerbNet and PropBank convention is given. The predicate
“buy” belongs to “Commerce buy”, frame of FrameNet
which contains “Buyer”, “Goods” as core frame elements
and “Seller” as a non-core frame element. Moreover,
FN provides connections between semantic frames like
inheritance, hierarchy and causativity. Contrary to FN,
VerbNet (VN) is a hierarchical verb lexicon, that contains
categories of verbs based on Levin Verb classification [18].
The predicate “buy” is contained in “get-13.5.1” class
of VN, among with the verbs “pick”, “reserve” and
“book”. Members of same verb class share same set of
semantic roles, referred as thematic roles. In addition to
1 In PropBank Arg0 is used for actor, agent, experiencer or cause of
the event; Arg1 represents the patient, if the argument is affected by the
action, and theme, if the argument is not structurally changed.
12
rich derivational morphology and with large number of
light verbs and multi word expressions like Turkish. In
this paper, we focus on this process due to its importance
and difficulty, whereas we investigate the details of latter
processes in other studies. In next sections, we present
the details of our approach on choosing verbs and their
arguments for annotation, framing light verbs/multi-word
expressions and incorporation of declension. Further on,
we explain how we interpreted the results of crowdsourced
verb sense disambiguation microtask for fine-tuning verb
frames. Finally, we conclude by describing the properties
of created linguistic resource and results of the improvement process.
from verbs. However, these numbers only account for
the first level of derivation, such as “sev-iş (to make
love)”, reciprocal form of “sev (to love)”. In contemporary
everyday Turkish, it is observed that words have about 3 to
4 morphemes including the stem [15] such as “sev-iş-tir-il
(to be made to make love with someone)” which has 3
derivational morphemes: reciprocal, causative and passive
accordingly.
Due to these challenging cases, our approach for each
of them and the tools that are used are explained in each
subsection in a guideline fashion.
A. Root Verbs
Turkish Language Association is a trustworthy source
for lexical datasets and dictionaries. We have initiated our
framing efforts with the list of Turkish root verbs provided
by TDK. This list consists of 759 root verbs however it
contains verbs that are rarely used or have fallen into
disuse as the ones shown in Table I. In order to detect
those root verbs we have used TNC (Turkish National
Corpus), which is a balanced and a representative corpus
of contemporary Turkish with about 50 million words. Its
query interface shown in Fig. 1 allows regular expressions
which is essential for quering verbs that appear in different
conjugated forms in unstructered text. We have performed
queries on all root verbs and framed them if their frequency count is above 5 in a million words. Overall only
385 of the verbs were found to be above this threshold.
Some examplary root verbs that were excluded from the
framing process are given with their frequencies in Table I.
II. M ETHOD
We took a two-pass framing approach. In the first
pass, we have performed regular framing explained in
PropBank framing guidelines [5] based on available resources such as publicly available dictionary prepared
by Turkish Language Association [19], a large corpus
(Turkish National Corpus) [2] that can be queried for
different usages of words, an open source annotation
tool and CornerStone [8]. The senses of the verbs and
case marking of their arguments are decided by manually
investigating the sentences appear in search results of
the TNC corpus. Then, the arguments of the predicates
are labeled with VerbNet thematic roles and PropBank
argument numbers, by checking the English equivalent of
Turkish verb sense if possible. This process is repeated for
all verb senses. However, low number of expert framers
and limited amount of time occupied for framing cause
incomplete, inaccurate and subjective frames. In order to
reduce this phenomena, we have utilized crowd feedback
from a verb sense disambiguation task and performed a
second pass on framing of Turkish verbs.
Root Verb
eğir (to spin cotton for making thread)
semir (to batten, get fat )
yüksün (to regard someone, something as a burden)
çıv (to be deflected)
evele (to hum and haw)
göynü (to be grieved)
ılga (to run at a gallop - used only for horses without a rider)
çemre (to roll up one’s sleeves, trouser legs, or skirts)
ipile (to give a very dim light)
fışılda (to make a swishing or rustling sound)
III. F IRST PASS : C REATION OF V ERB F RAMES
PropBank framing guidelines [5] is an important resource of information that discusses how the verbs in
English PropBank should be framed. Although we have
followed that guideline [5], rich derivational morphology
of Turkish and large number of light verbs (LV) and multi
word expressions (MWE) introduce challenges for Turkish
framers.
LV and MWE are still an active research area for
linguists [20], and due to the complexity of this issue
annotation of LV and MWE constructions in PropBank has
been investigated separately in study [14]. Even though
PropBanks have been constructed for morphologically
rich languages such as Hindi/Urdu, Arabic and Finnish,
modern Turkish language poses more challenges due to
its extreme derivational morphology. According to Turkish
Language Association (TDK)2 , there are 759 root verbs,
2380 verbs derived from nouns and 2944 verbs derived
Count[Verb]
105
80
52
24
16
5
5
4
1
0
Frequency[Verb]
2,24
1,68
1,09
0,5
0,34
0,1
0,1
0,08
0,02
0
Table I: Excluded root verbs and their frequencies in a
million
B. Derivational Morphology of Verbs
Turkish is among languages with rich derivational morphology. According to TDK, there exists 10 morphemes
that derives verbs from verbs and 2944 derived verbs. Of
these, 6 are known as valency changing morphemes and
are responsible from 98% of derived verbs. In Table II,
the count of derived verbs categorized according to their
types are shown. In [17], it has been stated that Turkish
valency changing morphemes always cause a predictable
transformation, thus it is sufficient to have frames of the
root verbs only. An examplary causative transformation
for intransitive verb “laugh” and transitive verb “wear”, is
given in Fig. 2. When intranstive verbs are causitivized,
the causee becomes the patient of the causation event.
In other words, the central argument of the root verb,
2 TDK is the official organization of Turkish language, founded on
1932. It is responsible for conducting linguistic research on Turkish and other Turkic languages, and publishing the official Turkish
dictionary.(www.tdk.gov.tr)
13
Figure 1: TNC Query for “sev-iş-tir* (to make someone to make love with someone)”
Morpheme
-akla, -ekle, -ıkla, -ikle, -ukla, -ükle
-ala, -ele
-ımsa, -imse, -umsa, -ümse
-zir
Total
-ş, -aş, -eş, -ış, -iş, -uş, -üş
-l, -al, -el, -ıl, -il, -ul, -ül
-n, -ın, -in, -un, -ün
-r, -ar, -er, -ır, -ir, -ur, -ür
-t, -at, -et, -ıt, -it, -ut, -üt
-tır, -tir, -tur, -tür, -dır, -dir, -dur, -dür
Total
Type
Not Valency
Not Valency
Not Valency
Not Valency
Reciprocal
Passive
Passive—Reflexive
Causative
Causative
Causative
Count
8
22
5
1
36
258
528
720
29
510
863
2908
there exists some verbs which are frequently used in their
causative forms with some deviation in the meaning, such
as “yaz-dır”, causative form of the verb “yaz (to write)”,
which means to register someone to school/course. In
order to have an accurate framing process, separate frames
were created for such verbs. In addition to verb to verb
derivational morphemes, there exists 2380 verbs that are
derived from nominal words via 12 different morphemes
as stated by TDK. We claim that creating a nominal bank
and linking those derived verbs with the entries from the
nominal bank would be more appropriate. Thus, only the
most frequent ones are included in the current bank and
the rest is kept as a subject of a further study.
Table II: Derivational Morphemes
C. Light Verbs and Multiword Expressions (MWE)
Light verbs are the verbs that cannot stand in the
sentence on their own but can occur with another verb
or a nominal [7].Light verb constructions in Turkish are
the complex predicates formed by a nominal and one
of the light verbs such as ol-, et-, gel-, ver-, dur-, kal-,
düş-, bulun-, eyle- and buyur- [20]. Other than Turkish,
light verb constructions can also be encountered in many
languages such as Japanese, Korean, Persian, English,
French and German.
Light verb itself may contribute comparatively light to
the meaning or it has no contribution as in ‘teşekkür et(to thank) ’. In such cases, where the meaning is mostly
conveyed by the nominal, the phrase is treated as a new
predicate as (teşekkür et). In addition, Turkish light verbs
are not necessarily light in all uses. Consider the function
of the verb et- in the sentence “Üç artı iki beş eder
(Three plus two makes five)”. Framing process is handled
similarly for such verbs as in other root verbs.
Most of the time, MWEs are confused with ligt verb
constructions. In order to avoid discussions, we approach
the problem practically, rather than categorizing verbs as
LVC or MWE. We either treat such verbs as another sense
of the root verb or as a complex predicate. The criterias
followed during the decision process are:
Figure 2: Causative Transformations
(Arg0 if exists, otherwise Arg1), is marked with ACC
case and becomes an internal argument (usually Arg1)
of the new causative verb. For transitive root verbs, the
central argument, Arg0 of the root verb, receives the DAT
case marker and serves as an indirect object (usually
as Arg2), while Arg1 serves again as Arg1.3 Moreover,
3 Causative morpheme introduces of a new argument called causer to
the valence pattern. In Fig. 2, the causer is showed with A0, where in
PropBank for Hindi/Urdu, it may be showed with A-A.
•
14
Deviation from the original meaning of the verb root,
Nominal’s contribution to the meaning of the complex predicate,
• The frequency of the complex predicate,
• Being a fixed phrase,
In Table III. our framing approach for the verb “ver (to
give)” is shown as an example. Second sense has the
meaning of “to fix, to establish” as in to give/fix appointment, name or price. Similarly ver.03 is defined as to
devote, allocate as in “öncelik vermek (to give priority)”,
“emek vermek (to give/devote effort)” and “zaman vermek
(to give/allocate time)”. These phrases are not fixed and
the contribution of the nominal is not dominant. Hence
they are framed with new senses for the root verb. On the
contrary, the complex predicates, “söz ver (to promise)”,
“izin ver (to allow)”, “kulak ver (to listen carefully)” and
“hesap ver (to explain)” are fixed phrases and they have
high frequency in TNC corpus. Hence they are determined
as seperate predicates.
•
Predicate
Sense
Meaning
ver
ver.01
To transfer
ver
ver.02
To fix
ver
ver.03
To devote, allocate
söz ver
ver.09
To promise
kulak ver
ver.12
To listen carefully
Figure 3: Cornerstone Software Adjusted for Turkish
task is given in Fig. 4. At the end of the task, 5855 rows
have been annotated at least by three annotaters, 265 rows
have been annotated per hour and all annotation process
took 68 hours. More than 100 taskers contributed from
39 different cities of Turkey and the overall annotater
aggrement is calculated as 83.15%. The details of this
work is presented in another paper. The consolidation
of one or more contributor responses into a summarized
result is referred to as aggregation. We have analyzed
the results which have a confidence level lower than
0.7 or an aggregated result as “None”. 2174 rows had
confidence lower than 0.7 and 738 rows were aggregated
as “None” out of 6000 rows. We have manually performed
a second pass annotation of experts for the rows with low
confidence and eliminated 1200 out of 2174 of the rows
since the aggregated results were already accurate. We
have investigated the main reasons for annotaters to choose
the option “None” as follows and taken the appropriate
actions;
Example
Hediye vermek
(Give presents)
Randevu vermek
(Give an appointment)
Öncelik vermek
(Give priority)
Bana söz ver
(Promise me)
Bana kulak ver
(Listen to me)
Table III: Framing of the verb “ver- (to give)”
D. Annotation Tool
For framing purposes, we have adjusted an already
available open source software, cornerstone [8]4 . In
study [17], the correlation between case marking information and semantic roles have been shown. That motivated
us to include case markings in the framing process.
To supply case marking information of the argument, a
drop down menu containing six possible case markers in
Turkish is added as shown in Fig 3.
In this section, we have explained the process of framing
Turkish verbs by expert annotaters with our systematic
decision process for challenging cases introduced by LV,
MWE and rich derivational morphology.
•
•
•
IV. S ECOND PASS : I NTERPRETING V ERB S ENSE
D ISAMBIGUATION R ESULTS
Overall aim of this study is to build a corpus with
annotated semantic roles. For this purpose we use an
existing Turkish Dependency Treebank [15] with morphological and syntactical analysis and add an extra layer
with predicate senses and their arguments. For annotation
of verb senses, we have crowdsourced a verb sense disambiguation task, where people are asked to choose the
appropriate frame or “Hiçbiri (None)” for all the verbs
in the treebank. An examplary question from the original
Mistakes in morphological analysis of the predicate
such as analyzing the verb as “sok” (to put) instead
of “sokul” (to get near); “kal” (to stay) instead of
“kaldır” (to lift): These erroneous analyses have been
corrected and the appropriate sense is chosen by an
expert.
Missing meanings: They are added to PropBank.
Confusion caused by metaphorical expression: Verb
senses are coarse-grained, thus metaphorical expressions are treated the same way as non-metaphorical
expressions as suggested in PropBank guidelines [5].
Similarly, we have detected the causes of low-confidence
rows as follows;
•
•
•
4 Cornerstone
is also used for building English, Chinese and
Hindi/Urdu PropBanks.
15
Fine-grained verb senses: When two senses of the
predicate have close meanings, it leads to confusions:
Such frames were detected and merged.
Missing meanings: They are added to PropBank.
Confusing the meaning of the complete sentence with
the meaning of the verb in question: They are revised
and annotated by an expert.
Figure 4: A question from Verb Sense Disambiguation task. Corresponding English translations are shown near the
original text starting with (En)
V. C ONCLUSION
[4] Nart B. Atalay, Kemal Oflazer and Bilge Say. 2003. The
Annotation Process in the Turkish Treebank. In Proceedings
of the EACL Workshop on Linguistically Interpreted Corpora.
Budapest
In conclusion, we have presented a new linguistic resource, the Turkish verb lexicon that consists of the verbs
and their arguments that are present in the Turkish Dependency Treebank [15] and the verbs that are frequently
used but not present in the Treebank. Total amount of 759
verb roots and 1262 verb senses are annotated.
We have explained our approach on framing light verbs
and multiword expressions which can be inherited by other
languages where light verb constructions are as common
as in Turkish. We have presented a different approach to
the framing problem with a two step solution to ensure
the quality and quantity of the lexicon. In the first pass,
framing guidelines that are explained in Section III, is
constructed and expert annotaters have framed 1135 verb
senses. In the second pass, the results from a crowdsourced
verb sense disambiguation task have been incorporated to
improve the quality of verb lexicon as well as to increase
the coverage rate. As a result, the number of annotated
verb frame count increased from 675 to 759 and the total
number of annotated senses increased from 1135 to 1262.
As a future work, we plan to construct a NominalBank
to account for the verbs that are derived from nouns and
crowdsource argument annotation task where people will
be asked to choose the most appropriate label for the verb
sense given in the question.
This work explained in this paper, presented the first
and the most important step to create necessary resources
for Turkish to be included in the task of semantic role labeling. We believe that these resources will drive the NLP
community for building semantic role labelers that will
have wider coverage of language families and will enable
the community work on a more challenging language.
[5] Olga Babko-Malaya. 2005. Guidelines for Propbank
Framers.
http://verbs.colorado.edu/∼mpalmer/projects/ace/
FramingGuidelines.pdf
[6] Collin F. Baker, Charles J. Fillmore, and John B. Lowe.
1998. The Berkeley FrameNet Project. In Proceedings of the
36th Annual Meeting of the Association for Computational
Linguistics and 17th International Conference on Computational Linguistics, Vol. 1., PA, USA, 86-90.
[7] M Butt. 2004. The Light Verb Jungle. Papers from the
GSAS/Dudley House Workshop on Light Verbs. Cambridge,
Harvard Working Papers in Linguistics: 1-49.
[8] Jinho D. Choi, Claire Bonial and Martha Palmer. 2010.
Propbank Frameset Annotation Guidelines Using a Dedicated
Editor, Cornerstone. In LREC 10, Malta
[9] Mona Diab, Alessandro Moschitti and Daniele Pighin. 2008.
Semantic Role Labeling Systems for Arabic Language using
Kernel Methods In Proceedings of the 46th Annual Meeting
of the Association for Computational Linguistics: Human
Language Technologies., 2008
[10] Christiane Fellbaum, Anne Osherson and Peter E Clark.
2007 Putting Semantics into WordNet’s ”Morphosemantic”
Links. Computing Reviews, 24(11):503–512.
[11] Ana-Maria Giuglea and Alessandro Moschitti. 2006. Semantic Role Labeling via FrameNet, VerbNet and PropBank.
In Proceedings of the 21st International Conference on Computational Linguistics, pp. 929-936. 2006.
[12] Abdelati Hawwari, Wajdi Zaghouani, Tim OGorman,
Ahmed Badran and Mona Diab. 2013. Building a lexical
semantic resource for Arabic morphological Patterns. In
Communications, Signal Processing, and their Applications
(ICCSPA)
R EFERENCES
[1] Eneko Agirre, Izaskun Aldezabal, Jone Etxeberria and
Eli Pociello 2006. A Preliminary Study for Building the
Basque PropBank. In LREC 2006, Genoa
[13] Mehmet Hengirmen.
Yayınevi
[2] Yeşim Aksan and Mustafa Aksan 2012. Construction of the
Turkish National Corpus (TNC). In LREC 2012, İstanbul
2004.
Türkçe Dilbilgisi.
Engin
[14] Jena D. Hwang, Archna Bhatia, Clare Bonial, Aous Mansouri, Ashwini Vaidya, Nianwen Xue, and Martha Palmer.
2010. PropBank annotation of multilingual light verb constructions. In Proceedings of the Fourth Linguistic Annotation Workshop, Association for Computational Linguistics,
Stroudsburg, PA, USA, 82-90.
[3] Izaskun
Aldezabal,
Marı́a
Jesús
Aranzabe,
Arantza Dı́az de Ilarraza Sánchez and Ainara Estarrona.
2010. Building the Basque PropBank. In LREC 2010, Malta
16
[15] Kemal Oflazer, Bilge Say, Dilek Z. Hakkani-Tür and
Gökhan Tür. 2003. Building a Turkish Treebank. Invited
chapter in Building and Exploiting Syntactically annotated
Corpora, Anne Abeille Editor, Kluwer Academic Publishers
[16] Martha Palmer, Paul Kingsbury and Daniel Gildea. 2005.
The Proposition Bank: An Annotated Corpus of Semantic
Roles. In Computational Linguistics, 31(1):71–106
[17] Gozde Gul Isguder Sahin and Esref Adalı. 2014 Using
Morphosemantic Information in Construction of a Pilot Lexical Semantic Resource for Turkish. In Proceedings of the
21st International Conference on Computational Linguistics,
pp. 929-936. 2014.
[18] Karin K. Schuler 2006. VerbNet: A Broad-Coverage,
Comprehensive Verb Lexicon PhD diss., University of Pennsylvania
[19] Turkish Language Association. 2005 Güncel Türkçe Sözlük
(Contemporary Turkish Dictionary) http://www.tdk.gov.tr/
index.php?option=com gts&view=gts
[20] Aygül Uçar. 2010 Light Verb Constructions in Turkish
Dictionaries: Are They Sub-meanings of Polysemous Verbs?
Mersin University Journal of Linguistics and Literature, 7 (1),
1-17.
[21] Shumin Wu. 2013. Semantic Role Labeling Tutorial:
Supervised Machine Learning methods. In Conference of the
North American Chapter of the Association for Computational
Linguistics: Human Language Technologies
17
A free/open-source hybrid morphological
disambiguation tool for Kazakh
Zhenisbek Assylbekov∗ , Jonathan North Washington† , Francis Tyers‡ , Assulan Nurkas∗ ,
Aida Sundetova§ , Aidana Karibayeva§ , Balzhan Abduali§ , Dina Amirova§
∗ School
of Science and Technology, Nazarbayev University
of Linguistics and Central Eurasian Studies, Indiana University
‡ HSL-fakultehta, UiT Norgga árktalaš universitehta
§ Information Systems Department, Al-Farabi Kazakh National University
† Departments
II. Kazakh
Abstract—This paper presents the results of developing a
morphological disambiguation tool for Kazakh. Starting with a
previously developed rule-based approach, we tried to cope with
the complex morphology of Kazakh by breaking up lexical forms
across their derivational boundaries into inflectional groups
and modeling their behavior with statistical methods. A hybrid
rule-based/statistical approach appears to benefit morphological
disambiguation demonstrating a per-token accuracy of 91% in
running text.
Kazakh (natively қазақ тілі, қазақша) is a Turkic language
belonging to the Kypchak (or Qıpçaq) branch, closely related
to Nogay (or Noğay) and Qaraqalpaq. It is spoken by around
13 million people in Kazakhstan, China, Mongolia, and adjacent areas [7].
Kazakh is an agglutinative language, which means that
words are formed by joining suffixes to the stem. A Kazakh
word can thus correspond to English phrases of various length
as shown below:
I. Introduction
In this paper, we present a free/open-source hybrid morphological disambiguation tool for Kazakh. Morphological
disambiguation is the task of selecting the sequence of morphological parses corresponding to a sequence of words, from
the set of possible parses for those words. Morphological
disambiguation is an important step for a number of NLP tasks
and this importance becomes more crucial for agglutinative
languages such as Kazakh, Turkish, Finnish, Hungarian, etc.
For example, by using a morphological analyzer together
with a disambiguator the perplexity of a Turkish language
model can be reduced significantly [1]. Kazakh (as well as
any morphologically rich language) presents an interesting
problem for statistical natural language processing since the
number of possible morphological parses is very large due to
the productive derivational morphology [2, 3]. In this work
we combine rule-based [4] and statistical [5] approaches to
disambiguate a Kazakh text: the output of a morphological
analyzer is pre-processed using constraint-grammar rules [6],
and then the most probable sequence of analyses is selected.
Our combined approach works well even with a small handannotated training corpus. The performance of the presented
hybrid system can likely be improved further when a larger
hand-tagged corpus becomes available.
In Section II, we present relevant properties of Kazakh.
Then, in Section III, we review the related work on part-ofspeech (POS) tagging and morphological disambiguation. In
Section IV, we describe the statistical model for morphological
disambiguation. We finally present and discuss our results in
Section V.
дос
достар
достарым
friend
friends
my friends
достарымыз
достарымызда
our friends
at our friends
достарымыздамыз
we are at our friends
The effect of rich morphology can be observed in parallel
Kazakh-English texts. Table below provides the vocabulary
sizes, type-token ratios (TTR) and out-of-vocabulary (OOV)
rates of Kazakh and English sides of a parallel corpus used in
[8].
Vocabulary size
Type-token ratio
OOV rate
English
18,170
3.8%
1.9%
Kazakh
35,984
9.8%
5.0%
It is easy to see that rich morphology leads to sparse data problems for statistical natural language processing of Kazakh, be
it tasks in machine translation, text categorization, sentiment
analysis, etc. A common approach (see [9, 10, 11, 12]) applied
for morphologically rich languages is to convert surface forms
into lexical forms (i.e. analyze words), and then perform some
morphological segmentation for the lexical forms (i.e. split
analyzes). The segmentation schemes are usually motivated
by linguistics and the domain of intended use. For example,
for a Kazakh-English word alignment task we could be in
18
favor of the following segmentation of the above mentioned
word достарымыздамыз1
достар
дос⟨n⟩⟨pl⟩
friends
ымыз
⟨px1pl⟩
our
да
⟨loc⟩
at
same idea is present in [16]. One of the most well-known
corpora, Brown corpus, was automatically pre-tagged with
a rule-based tagger, TAGGIT [17]. The earliest probabilistic
tagger known to us is [18]. One of the first Markov Model
taggers was created at the University of Lancaster as part
of Lancaster-Oslo-Bergen corpus tagging effort [19, 20]. The
type of Markov Model tagger that tags based on both word
probabilities and tag transition probabilities was introduced by
Church [21] and DeRose [22]. All these taggers are trained on
hand-tagged data. Kupiec [23], Cutting et al. [24], and others
show that it is also possible to train a Hidden Markov Model
(HMM) tagger on unlabeled data, using the EM algorithm
[25]. An experiment by Merialdo [26], however, indicates
that with even a small amount of training data, a tagger
trained on hand-tagged data worked better than one trained
via EM. Other notable approaches in POS tagging are Brill’s
transformation-based learning paradigm [27], the memorybased tagging paradigm [28], and the maximum entropy-based
approach [29].
мыз
+e⟨cop⟩
⟨p1⟩⟨pl⟩
are
we
since each segment of the Kazakh word would then correspond
to a single word in English. The problem is that often for a
word in Kazakh we have more than one way to analyze it, as
in the example below:
‘in 2009 , we started the construction works .’
2009 жылы біз құрылысты бастадық .
жылы⟨adj⟩
‘warm’
жылы⟨adj⟩⟨advl⟩
‘warmly’
→ жыл⟨n⟩⟨px3sp⟩⟨nom⟩
‘year’
жылы⟨adj⟩⟨subst⟩⟨nom⟩
‘warmth’
Selecting the correct analysis from among all possible analyses
is called morphological disambiguation. Due to productive
derivational morphology this task itself suffers from data
sparseness. To alleviate the data sparseness problem we break
down the full analyses into smaller units – inflectional groups.
An inflectional group is a tag sequence split by a derivation
boundary. For example, in the sentence that follows, the word
айналасындағыларға ‘to the ones in his vicinity’ is split
into root r and two inflectional groups, g1 and g2 , the first
containing the tags before the derivation boundary -ғы and
the second containing the derivation boundary and subsequent
tags.
Morphological disambiguation in inflectional or agglutinative languages with complex morphology involves determining not only the major or minor parts-of-speech, but also all
relevant lexical and morphological features of surface forms.
Levinger et al. [30] suggested an approach for morphological disambiguation of Hebrew. Hajič and Hladká [31] have
used maximum entropy modeling approach for morphological
disambiguation of Czech, an inflectional language. Hajič [32]
extended this work to 5 other languages including English and
Hungarian (an agglutinative language). Ezeiza et al. [33] have
combined stochastic and rule-based disambiguation methods
for Basque, which is also an agglutinative language. Megyesi
[34] has adapted Brill’s POS tagger with extended lexical
templates to Hungarian.
Жəңгір хан мен оның айналасындағыларға . . .
(айнала)·(сын·да)·(ғы·лар·ға)
(айнала
)·(subst·pl·dat)
| {z })·(n·px3sp·loc
| {z } | {z }
r
g1
g2
We will heavily exploit the following observation of dependency relationships which was made by Hakkani-Tür et al.
[5, p. 387] for Turkish, but is valid for Kazakh as well:
When a word is considered to be a sequence of inflectional
groups, syntactic relation links only emanate from the last
inflectional group of a (dependent) word, and land on one of
the inflectional groups of the (head) word on the right.
From all languages which are widely researched nowadays
Turkish is the closest one to Kazakh. Previous approaches to
morphological disambiguation of Turkish text had employed
constraint-based methods (Oflazer and Kuruöz [35]; Oflazer
and Tür [36, 37]), statistical methods (Hakkani-Tür et al. [5],
Sak et al. [38]), or both (Yuret and Türe [39], Kutlu and
Cicekli [40]).
III. Related work
Recently, some work has been done towards developing
morphological disambiguation tools for Kazakh. Salimzyanov
et al. [4] provide constraint grammar rules which reduce
ambiguity from 2.4 to 1.4 analyzes per form in a running
text. Makhambetov et al. [41] present a comparison of partof-speech taggers trained on the Kazakh National Corpus
[42]: the best result obtained, using the full training data
of around 600,000 tokens was a per-token accuracy of 86%
when cross-validated on the same training data with 10 folds.
Kessikbayeva and Cicekli [43] present a transformation-based
morphological disambiguator for Kazakh which is trained on
hand-annotated corpus of over 30,000 words and gains 87%
accuracy when tested against a test data of around 15,000
words.
Morphological disambiguation of inflectional and agglutinative languages was inspired by part-of-speech (POS) tagging
techniques. Due to Chomsky’s criticism of the inadequacies of
Markov models [14, ch. 3], the lack of training data and computing resources to pursue an ‘empirical’ approach to natural
language, early work on POS tagging using Markov chains
had been largely abandoned by the early sixties. The earliest
‘taggers’ were simply programs that looked up the category
of words in a dictionary. The first well-known program which
attempted to assign tags based on syntagmatic contexts was
the rule-based program presented in [15], though roughly the
1 hereinafter
we use the Apertium tagset [13] for analyzed forms
19
A. Derivation
IV. Statistical morphological disambiguation
Following [44], we will use the notation in Table I. We use
wi
ti
wi,i+m
ti,i+m
ri
gi,k
n
w
t
Using the chain rule, the probability in (3) can always be
rewritten as:
n
∏
Pr(t) =
Pr(ti |t1,i−1 ).
(4)
the word (token) at position i in the corpus
the tag of wi
the words occurring at positions i through i + m
the tags ti · · · ti+m for wi · · · wi+m
the root of wi
the k-th inflectional group of wi
length of a text chunk
(be it a sentence, a paragraph or a whole text)
the words w1,n of a text chunk
the tags t1,n for w1,n
i=1
It is important to realize that equation (4) is not an approximation. We are simply asserting in this equation that when
we generate a sequence of parses, we can firstly choose the
first analysis. Then we can choose the second parse given our
knowledge of the first parse. Then we can select the third
analysis given our knowledge of the first two parses, and so
on. As we step through the sequence, at each point we make
our next choice given our complete knowledge of the all our
previous choices.
The conditional probabilities on the right-hand side of
equation (4) cannot all be taken as independent parameters
because there are too many of them. In the bigram model, we
assume that
Pr(ti |t1,i−1 ) ≈ Pr(ti |ti−1 ).
TABLE I: ’Notation’
subscripts to refer to words and tags in particular positions of
the sentences and corpora we tag. We use superscripts to refer
to word types in the lexicon of words and to refer to tag types
in the tag set.
The basic mathematical object with which we deal here is
the joint probability distribution Pr(W = w, T = t), where
the random variables W and T are a sequence or words and
a sequence of tags. We also consider various marginal and
conditional probability distributions that can be constructed
from Pr(W = w, T = t), especially the distribution Pr(T = t).
We generally follow the common convention of using uppercase letters to denote random variables and the corresponding
lowercase letters to denote specific values that the random
variables may take. When there is no possibility for confusion,
we write Pr(w, t), and use similar shorthands throughout.
In this compact notation, morphological disambiguation
is the problem of selecting the sequence of morphological
parses (including the root), t = t1 t2 · · · tn , corresponding to a
sequence of words w = w1 w2 · · · wn , from the set of possible
parses for these words:
arg max Pr(t|w).
That is, we assume that the current analysis is only dependent on the previous one. With this assumption we get the
following:
n
∏
Pr(t) ≈
Pr(ti |ti−1 ).
(5)
i=1
However, the probabilities on the right-hand side of this
equation still cannot be taken as parameters, since the number
of possible analyzes is very large in morphologically rich
languages. Following the discussion from Section II we split
morphological parses across their derivational boundaries, i.e.
we consider morphological analysis as a sequence of root (ri )
and inflectional groups (gi,k ), and therefore, each parse ti can
be represented as (ri , gi,1 , . . . , gi,ni ). Then the probabilities
Pr(ti |ti−1 ) can be rewritten as:
(1)
t
Using Bayes’ rule and taking into account that w is constant
for all possible values t, we can rewrite (1) as:
Pr(t) × Pr(w|t)
arg max
(2)
= arg max Pr(t) × Pr(w|t)
Pr(w)
t
t
Pr(ti |ti−1 )
= Pr((ri , gi,1 , . . . , gi,ni )|(ri−1 , gi−1,1 , . . . , gi−1,ni−1 ))
= {chain rule} = Pr(ri |(ri−1 , gi−1,1 , . . . , gi−1,ni−1 ))
× Pr(gi,1 |(ri−1 , gi−1,1 , . . . , gi−1,ni−1 ), ri ) × . . . ×
× Pr(gi,ni |(ri−1 , gi−1,1 , . . . , gi−1,ni−1 ), ri , gi,1 , . . . , gi,ni −1 )
(6)
In Kazakh, given a morphological analysis2 including the root,
there is only one surface form that can correspond to it, that
is, there is no morphological generation ambiguity. Therefore,
Pr(w|t) = 1,
In order to simplify this representation we throw in the
following independence assumptions
and the morphological disambiguation problem (2) is simplified to finding the most probable sequence of parses:
arg max Pr(t)
Pr(ri |(ri−1 , gi−1,1 , . . . , gi−1,ni−1 )) ≈ Pr(ri |ri−1 ),
(3)
(7)
t
Keep in mind that the search space in equations (1)–(3) is
not equal to the set of all hypothetically possible sequences
t. Instead it is limited to only the set of parse sequences that
can correspond to w. Such limited set is obtained as a full or
constrained output of a morphological analysis tool.
Pr(gi,k |(ri−1 , gi−1,1 , . . . , gi−1,ni−1 ), ri , gi,1 , . . . , gi,k−1 )
≈ Pr(gi,k |gi−1,ni−1 ), (8)
i.e. we assume that the root in the current parse depends only
on the root of the previous parse, and each inflectional group
in the current parse depends only on the last inflectional group
of the previous parse (this last assumption is motivated by the
2 We use the terms morphological analysis or parse interchangeably, to refer
to individual distinct morphological parses of a token.
20
remark at the end of Section II). Now, from (6), (7), and (8)
we get:
ni
∏
Pr(ti |ti−1 ) ≈ Pr(ri |ri−1 )
k=1
|
Pr(gi,k |gi−1,ni−1 ),
{z
Prb (ti |ti−1 )
adjusting the empirical counts that we observe in the training
corpus to the expected counts of n-grams in previously unseen
text involve smoothing, interpolation and back-off: they have
been discussed by Good [46], Gale and Sampson [47], Written
and Bell [48], Knesser and Ney [49], Chen and Goodman [50].
The latter paper presents an extensive empirical comparison of
several of widely-used smoothing techniques and introduces a
variation of Kneser–Ney smoothing that consistently outperforms all other algorithms evaluated. We used it for estimating
the parameters of the bigram model (10).
(9)
}
where we define r0 ='.' and g0,n0 ='<sent>'. Now putting
together (5) and (9) we have:
Pr(t) ≈
n
∏
Pr(ti |ti−1 )
i=1
≈
n
∏
C. Tagging with the Viterbi algorithm
[
Pr(ri |ri−1 )
i=1
ni
∏
]
k=1
|
Once parameters are estimated we could evaluate the bigram
model (10) for all possible parses t1,n of a sentence of length
n, but that would make tagging exponential in the length of
the input that is to be tagged. An efficient tagging algorithm
is the Viterbi algorithm (Algorithm 1). It has three steps:
Pr(gi,k |gi−1,ni−1 ) . (10)
{z
}
Prb (t)
Pr(rl |rm ) and Pr(g l |g m ) are parameters (root and IG probabilities) which can be estimated using manually disambiguated
texts.
Algorithm 1 Algorithm for tagging
Require: a sentence w1,n of length n
Ensure: a sequence of analyzes t1,n
1: δ0 (('.', <sent>)) = 1.0
2: δ0 (t) = 0.0 for t ̸= ('.', <sent>)
3: for i = 1 to n step 1 do
4:
for all candidate parses tj do
5:
δi (tj ) = max[δi−1 (tk ) × Prb (tj |tk )]
B. Parameters estimation
Assume we are observing a sequence of n tokens w1 , w2 ,
. . ., wn , and each token was manually disambiguated, i.e. we
posses a sequence of corresponding parses t1 , t2 , . . ., tn .
Then the likelihood for our data is given by the equation
(10), and in order to find maximum likelihood estimates for
the parameters Pr(rl |rm ) and Pr(g l |g m ) we need to solve the
following optimization problem:
[
]
ni
n
∏
∏
Pr(ri |ri−1 )
Pr(gi,k |gi−1,ni−1 ) −→ max
(11)
i=1
∑
k=1
Pr(rl |rm ) = 1,
l
∑
Pr(g l |g m ) = 1.
tk
6:
ψi (tj ) = arg max[δi−1 (tk ) × Prb (tj |tk )]
tk
7:
8:
9:
end for
end for
Xn = arg max δn (tj )
tj
10:
11:
(12)
12:
l
Using the method of Lagrange multipliers [45] one can show
that the solution of (11) subject to constraints (12) is given
by:
for j = n − 1 to 1 step −1 do
Xj = ψj+1 (Xj+1 )
end for
initialization (lines 1–2), induction (lines 3–8), termination and
path readout (lines 9–12). We compute two functions δi (tj ),
which gives us the probability of parse tj for word wi , and
ψi+1 (tj ), which gives us the most likely parse at word wi
given that we have the parse tj at word wi+1 . A more detailed
discussion of the Viterbi algorithm for tagging is provided in
[51].
C(rm , rl )
C(g m , g l )
l m
=
,
Pr
(g
|g
)
,
MLE
C(rm )
C(g m )
(13)
where C(rm ) is the number of occurrences of rm , C(rm , rl )
is the number of occurrences of rm followed by rl , C(g m ) is
the number of occurrences of g m , C(g m , g l ) is the number of
parses with g m as the last IG followed by a parse containing
g l . However, the maximum likelihood estimates suffer from
the following problem: What if a bigram has not been seen
in training, but then shows up in the test data? Using the
formulas (13) we would assign unseen bigrams a probability
of 0. Such approach is not very useful in practice. If we
want to compare different possible parses for a sentence,
and all of them contain unseen bigrams, then each of these
parses receives a model estimate of 0, and we have nothing
interesting to say about their relative quality. Since we do not
want to give any sequence of words zero probability, we need
to assign some probability to unseen bigrams. Methods for
PrMLE (rl |rm ) =
V. Experiments and results
A. Training and test data
We selected thirteen most viewed articles from Kazakh
Wikipedia according to 2014 page counts data (see Table
II), and used all of them except ‘Басты бет’, ‘CERN’, and
‘Жапония префектуралары’ to create a training set3 . This
totaled in approximately 12.5K words (15.7K tokens). We
performed morphological analysis for our texts using an
open-source finite-state morphological transducer apertiumkaz [52]. It is based on Helsinki Finite-State Toolkit and is
3 ‘Басты бет’ is not an article, it is a main page of Kazakh Wikipedia;
articles ‘CERN’ and ‘Жапония префектуралы’ do not contain much text
21
Article title
Басты бет
Жапония
Біріккен Ұлттар Ұйымы
CERN
Иран
Жапония префектуралары
Футболдан əлем чемпионаты 2014
Жапония Ұлттық футбол құрама командасы
Eurovision əн конкурсы 2010
Абай Құнанбайұлы
Радиан
Жасуша
Шоқан Шыңғысұлы Уəлиханов
Views
1,674,069
877,693
807,058
648,464
602,001
551,394
333,988
321,249
312,183
242,151
187,225
145,010
119,780
Tokens
–
3,211
793
–
2,879
–
257
146
101
4,083
39
1,789
2,408
15,706
developed a set of guidelines4 , asked annotators to resolve the
differences in annotations and fix them where necessary using
the mentioned guidelines.
In order to enrich our model with more roots we extracted
unambiguous sequences of 1,509,480 tokens in a corpus of
2,128,642 tokens and used these unambiguous sequences in
addition to hand-annotated texts from Table II for estimating
root probabilities.
For our test data we selected several texts from the
free/open-source Kazakh treebank [54], which is based on
universal dependency (UD) annotation standards. These texts
are morphologically disambiguated and annotated manually
for dependency structure, but for our purposes we used only
morphological annotations. We made sure that the document
‘wikipedia’ does not overlap with our training data. Composition of the test data is given below:
TABLE II: Most viewed articles of Kazakh Wikipedia in
2014
available within the Apertium project [13]. The analysis was
carried out by calling lt-proc command of the Lttoolbox
[53]. A preliminary disambiguation was performed through
Constrained Grammar rules [6] by calling the cg-proc command, which decreased ambiguity from 2.4 to 1.4 analyses
per form on average. The remaining disambiguation was done
manually in the following way: the texts were disambiguated
independently by two different annotators. Unfortunately,
spot-checking annotations showed that they were rather noisy:
this was mainly due to the lack of annotation guidelines. Most
common mistakes were connected with:
•
•
•
•
Document
Шымкент
story
wikitravel
Өлген_қазан
wikipedia
Ер_төстік
Жиырма_Бесінші_Сөз
Description
Wikipedia article (Shymkent)
Story for language learners
Phrases from Wikitravel
Folk tale from Wikisource
Random sentences from Wikipedia
Folk tale from Wikisource
Philosophical text
Tokens
168
404
177
134
559
206
435
2071
TABLE III: Test data
B. Training the model
choosing between <attr> (attributive) and <nom> (nominative) in noun-noun compounds: e.g. in көрші елдер
‘neighbouring countries’ the word көрші ‘neighbour’
should be tagged as <n><attr> (attributive noun), but
in əлем чемпионаты ‘world championship’ the word
əлем ‘world’ should be tagged as <n><nom> (noun in
nominative case);
choosing between <cnjcoo> (conjunction) and
<postadv> (postadverb) for the words да/де/та/те:
e.g. in Үстелде қалам да, қарындаш та, дəптер де
жатыр ‘There are pen, pencil and notebook on the
table’ they should be tagged as <cnjcoo>, but in Мен
де барамын ‘I will also go’ it should be tagged as
<postadv>;
choosing between <det><dem> (demonstrative determiner) and <prn> (pronoun) for the words бұл, мынау,
осы, мына, анау, ана, сол ‘this, that’: e.g. in Мынау
үй жаңа ‘This house is new’ the word мынау should
be tagged as <det><dem>, but in Мынау – терезе емес
‘This is not a window’ the word мынау should be tagged
as <prn>;
choosing between <ger> (gerund) and <n> (noun) for
verbs in a dictionary form: e.g. in Кітап оқу адамдарды
ақылдырақ етеді ‘Reading books makes people wiser’
the word оқу ‘to read’ should be tagged as <ger>, but
in Оқу басталды ‘Classes began’ the word оқу ‘study’
should be tagged as <n>.
We used SRILM toolkit [55, 56] to estimate root and
IG probabilities Pr(rl |rm ) and Pr(g l |g m ) respectively. We
need to say few words about the way we prepared root
and IG sequences for feeding into SRILM. First of all we
used the following tags from the Apertium tagset to split
analyzes across the derivational boundaries: <subst> (substantive, like a noun), <attr> (attributive, like an adjective), <advl> (adverbial, like an adverb), <ger_*> (gerunds
in different tenses), <gpr_*> (verbal adjectives in different
tenses), <gna_*> (verbal adverbs in different tenses), <prc_*>
(participles in different tenses), <ger> (gerund)5 . Now assume
that using the notation from Section 10 the hand-annotated
(or unambiguous) text chunk of length n is represented as
{(ri , gi,1 , . . . , gi,ni )}ni=1 . Then we form root-bigrams as
(r1 , r2 ), (r2 , r3 ), . . . , (ri−1 , ri ), . . . , (rn−1 , rn ),
and we form IG-bigrams as follows:
(g1,n1 , g2,1 ), (g1,n1 , g2,2 ), . . . , (g1,n1 , g2,n2 ),
(g2,n2 , g3,1 ), (g2,n2 , g3,2 ), . . . , (g2,n2 , g3,n3 ),
...
(gi−1,ni−1 , gi,1 ), (gi−1,ni−1 , gi,2 ), . . . , (gi−1,ni−1 , gi,ni ),
...
4 available at http://wiki.apertium.org/wiki/Annotation_guide
lines_for_Kazakh
5 a detailed description of Turkic tagset in Apertium project is given at
http://wiki.apertium.org/wiki/Turkic_lexicon
Based on these and other types of annotation mistakes we
22
The way we form the above bigrams is dictated by the
assumptions from Section IV that the root in the current parse
depends only on the root of the previous parse, and each
inflectional group in the current parse depends only on the
last inflectional group of the previous parse.
and we can see that although there are more chances to see
a noun in a non-possesive form after an attributive noun than
after a noun in nominative case, due to split of the analysis
<n><attr> into two inflectional groups the wrong parse gets
higher overall probability:
C. Results
Pr(cnjcoo, n.attr, n.pl.gen)
= Pr(n|cnjcoo) Pr(attr|cnjcoo) Pr(n.pl.gen|attr)
|
{z
}
Once our model was trained, i.e. its parameters were estimated, we analyzed the test data with apertium-kaz [52] and
applied the Algorithm 1 to its output. The accuracy results
are given in the column ‘Tagger’ of the Table IV. As one
can see the performance of this purely statistical approach
is barely satisfactory (e.g. compared to state of the art for
Turkish [38]). This is mainly due to relatively small amount
of available hand-tagged corpora for Kazakh. However, if we
preprocess the output of the transducer using CG-rules [4]
and then just select the first analysis for each ambiguous
token, then the accuracy is around 87% on our test set
(see column ‘CG’ in Table IV), which is comparable to the
previous results [41, 43] for Kazakh morphological disambiguation. Combining rule-based and statistical approaches,
i.e. preprocessing the transducer’s output with CG and then
selecting most probable parses based on statistical model,
yields around 91% accuracy (see column ‘CG+Tagger’ in
Table IV). However, keep in mind that for the fair comparison
Document
Шымкент
story
wikitravel
Өлген_қазан
wikipedia
Ер_Төстік
Жиырма_Бесінші_Сөз
TOTAL
Tagger
88.46
76.49
71.75
88.81
93.92
85.92
81.84
84.55
CG
89.74
84.16
80.23
88.06
93.56
83.01
85.52
87.20
10−4.911634
< Pr(n.nom|cnjcoo) Pr(n.pl.gen|n.nom)
|
{z
}
10−3.9993215
= Pr(cnjcoo, n.nom, n.pl.gen)
This observation leads to a following suggestion: maybe we
should try not splitting <n><attr> but rather treating it as
<adj> (an adjective) during the training and tagging. Since
we can always distinguish between noun/adjective in Kazakh
[57] then theoretically a word cannot have both <n><attr>
and <adj> as possible analyzes, and thus our suggested
replacement can be back-substituted without causing any additional ambiguity. This might also work for other errors as
well, e.g. when the tagger mistakenly prefers <adv> (adverb)
over <adj><advl> (adverbial adjective) or <n> (noun) over
<adj><subst> (substantivized adjective) and etc.
The list of most common errors for the ‘CG+Tagger’
configuration also includes
CG+Tagger
92.95
88.61
87.57
91.79
95.89
91.26
85.98
90.73
selecting:
<n><nom> (noun)
<cnjcoo> (conjucntion)
<det><dem> (dem. determiner)
<prn><dem><pl><nom>
<v><tv><aor><p3><pl>
TABLE IV: Accuracy results in %
VI. Conclusion and future work
We reproduced the previous methods of statistical morphological disambiguation [5] for the case of Kazakh language
in terms of the Apertium tagset. Combining rule-based and
statistical approaches we were able to achieve better accuracy
than when these approaches were used separately in the task
of morphological disambiguation for Kazakh language. Both
the tagger and the annotated data are free and available in
open access.
In the future, we are planning to improve the performance
of the tagger by adding more annotated data and taking into
account suggestions from the previous section. Then our result
will directly be able to feed into other work on Kazakh
language technology, such as machine translation. Assylbekov
and Nurkas [8] made use of the partially-disambiguated output
of the morphological analyser to preprocess the Kazakh side
of a parallel corpus for statistical machine translation (SMT),
achieving an increase in translation quality. We expect that
better disambiguation of the analyzer’s output will lead to improved performance of the SMT system. We are also planning
to apply our disambiguation tool to reduce data sparseness
in the task of document and sentence alignment between
of our approach with the previously developed methods one
needs to use the same tagset and to test against the same data,
which is currently not feasible since both previous works on
morphological disambiguation for Kazakh ([41] and [43]) have
released neither their tools nor their data for open access.
Let us perform an example of error analysis for the
‘CG+Tagger’ configuration. One of the most common errors
was that it was choosing <n><nom> instead of <n><attr>:
e.g. in
жəне
| {z }
көрші аймақтардың ‘and of neigboring regions’
| {z } |
{z
}
<cnjcoo> <n><attr>
instead of:
<np><ant><m><nom> (proper noun)
<prn><itg><nom> (inter. pronoun)
<prn><dem><nom> (dem. pronoun)
<prn><pers><p3><pl><nom>
<v><tv><aor><p3><sg>
<n><pl><gen>
the word көрші ‘neighbor’ was mistakenly tagged as
<n><nom>. A closer look at IG log-probabilities reveals:
log Pr(n|cnjcoo) = −1.617432
log Pr(attr|cnjcoo) = −1.485425
log Pr(n.pl.gen|attr) = −1.808777
log Pr(n.nom|cnjcoo) = −0.7627025
log Pr(n.pl.gen|n.nom) = −3.236619
23
Kazakh and English or Kazakh and Russian: given accurate
transducers and disambiguation tools for English and Russian,
we can apply morphological analysis and then morphological
disambiguation to both sides of a candidate pair and then
compare the stems in both documents to compute contentbased similarity in addition to structural similarity measures
as it was done in [58, 59, 60, 61].
[10]
[11]
Where to find the hand-tagged texts and the tagger
Our morphological disambiguation tool (including handannotated texts) is under GNU General Public License
(GPL) version 3.06 : its code and releases can be found at
https://svn.code.sf.net/p/apertium/svn/branches/
kaz-tagger/.
[12]
Aknowledgements
We would like to thank Daiana Azamat for assisting in
hand-annotation of the texts and rigorous derivation of the
maximum likelihood estimates (13).
[13]
References
[1] D. Yuret and E. Biçici, “Modeling morphologically rich
languages using split words and unstructured dependencies,” in Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Association for Computational
Linguistics, 2009, pp. 345–348.
[2] G. Altenbek and W. Xiao-long, “Kazakh segmentation
system of inflectional affixes,” in Proceedings of CIPSSIGHAN Joint Conference on Chinese Language Processing, 2010, pp. 183–190.
[3] A. Makazhanov, O. Makhambetov, I. Sabyrgaliyev, and
Z. Yessenbayev, “Spelling correction for kazakh,” in
Computational Linguistics and Intelligent Text Processing. Springer, 2014, pp. 533–541.
[4] I. Salimzyanov, J. Washington, and F. Tyers, “A
free/open-source Kazakh-Tatar machine translation system,” Machine Translation Summit XIV, 2013.
[5] D. Z. Hakkani-Tür, K. Oflazer, and G. Tür, “Statistical morphological disambiguation for agglutinative languages,” Computers and the Humanities, vol. 36, no. 4,
pp. 381–410, 2002.
[6] F. Karlsson, A. Voutilainen, J. Heikkilae, and A. Anttila,
Constraint Grammar: a language-independent system for
parsing unrestricted text.
Walter de Gruyter, 1995,
vol. 4.
[7] M. P. Lewis, F. Gary, and D. Charles, “Ethnologue:
Languages of the world,. dallas, texas: Sil international.
retrieved on 15 april, 2014,” 2013.
[8] Z. Assylbekov and A. Nurkas, “Initial explorations in
kazakh to english statistical machine translation,” in The
First Italian Conference on Computational Linguistics
CLiC-it 2014, 2014, p. 12.
[9] N. Habash and F. Sadat, “Arabic preprocessing schemes
for statistical machine translation,” in Proceedings of the
Human Language Technology Conference of the NAACL,
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
6 http://www.gnu.org/licenses/gpl-3.0.html
24
Companion Volume: Short Papers.
Association for
Computational Linguistics, 2006, pp. 49–52.
A. Bisazza and M. Federico, “Morphological preprocessing for turkish to english statistical machine translation.” in IWSLT, 2009, pp. 129–135.
C. Mermer, “Unsupervised search for the optimal segmentation for statistical machine translation,” in Proceedings of the ACL 2010 Student Research Workshop.
Association for Computational Linguistics, 2010, pp. 31–
36.
E. Bekbulatov and A. Kartbayev, “A study of certain
morphological structures of kazakh and their impact
on the machine translation quality,” in Application of
Information and Communication Technologies (AICT),
2014 IEEE 8th International Conference on.
IEEE,
2014, pp. 1–5.
M. L. Forcada, M. Ginestí-Rosell, J. Nordfalk,
J. O’Regan, S. Ortiz-Rojas, J. A. Pérez-Ortiz, F. SánchezMartínez, G. Ramírez-Sánchez, and F. M. Tyers, “Apertium: a free/open-source platform for rule-based machine
translation,” Machine translation, vol. 25, no. 2, pp. 127–
144, 2011.
N. Chomsky, Syntactic structures. Walter de Gruyter,
2002.
S. Klein and R. F. Simmons, “A computational approach
to grammatical coding of english words,” Journal of the
ACM (JACM), vol. 10, no. 3, pp. 334–347, 1963.
G. Salton and R. Thorpe, “An approach to the segmentation problem in speech analysis and language
translation,” in Proceedings of the 1961 International
Conference on Machine Translation of Languages and
Applied Language Analysis, vol. 2. Citeseer, 1962, pp.
703–724.
B. B. Greene and G. M. Rubin, Automatic grammatical
tagging of English. Department of Linguistics, Brown
University, 1971.
W. S. Stolz, P. H. Tannenbaum, and F. V. Carstensen,
“Stochastic approach to the grammatical coding of english,” Communications of the ACM, vol. 8, no. 6, pp.
399–405, 1965.
R. Garside, G. Sampson, and G. Leech, The computational analysis of English: A corpus-based approach.
Longman, 1988, vol. 57.
I. Marshall, “Tag selection using probabilistic methods,”
The Computational analysis of English: a corpusbased
approach, pp. 42–65, 1987.
K. W. Church, “A stochastic parts program and noun
phrase parser for unrestricted text,” in Proceedings of the
second conference on Applied natural language processing. Association for Computational Linguistics, 1988,
pp. 136–143.
S. J. DeRose, “Grammatical category disambiguation
by statistical optimization,” Computational Linguistics,
vol. 14, no. 1, pp. 31–39, 1988.
J. Kupiec, “Robust part-of-speech tagging using a hidden
markov model,” Computer Speech & Language, vol. 6,
no. 3, pp. 225–242, 1992.
[24] D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun, “A
practical part-of-speech tagger,” in Proceedings of the
third conference on Applied natural language processing. Association for Computational Linguistics, 1992,
pp. 133–140.
[25] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the royal statistical society. Series B
(methodological), pp. 1–38, 1977.
[26] B. Merialdo, “Tagging english text with a probabilistic
model,” Computational linguistics, vol. 20, no. 2, pp.
155–171, 1994.
[27] E. Brill, “Transformation-based error-driven learning
and natural language processing: A case study in partof-speech tagging,” Computational linguistics, vol. 21,
no. 4, pp. 543–565, 1995.
[28] W. Daelemans, J. Zavrel, P. Berck, and S. Gillis, “Mbt:
A memory-based part of speech tagger-generator,” arXiv
preprint cmp-lg/9607012, 1996.
[29] A. Ratnaparkhi et al., “A maximum entropy model for
part-of-speech tagging,” in Proceedings of the conference
on empirical methods in natural language processing,
vol. 1. Philadelphia, USA, 1996, pp. 133–142.
[30] M. Levinger, A. Itai, and U. Ornan, “Learning morpholexical probabilities from an untagged corpus with an application to hebrew,” Computational Linguistics, vol. 21,
no. 3, pp. 383–404, 1995.
[31] J. Hajič and B. Hladká, “Tagging inflective languages:
Prediction of morphological categories for a rich, structured tagset,” in Proceedings of the 17th international
conference on Computational linguistics-Volume 1. Association for Computational Linguistics, 1998, pp. 483–
490.
[32] J. Hajič, “Morphological tagging: Data vs. dictionaries,”
in Proceedings of the 1st North American chapter of the
Association for Computational Linguistics conference.
Association for Computational Linguistics, 2000, pp. 94–
101.
[33] N. Ezeiza, I. Alegria, J. M. Arriola, R. Urizar, and
I. Aduriz, “Combining stochastic and rule-based methods for disambiguation in agglutinative languages,” in
Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume
1. Association for Computational Linguistics, 1998, pp.
380–384.
[34] B. Megyesi, “Improving brill’s pos tagger for an agglutinative language,” in Proceedings of the Joint SIGDAT
Conference on Empirical Methods in Natural Language
Processing and Very Large Corpora, 1999, pp. 275–284.
[35] K. Oflazer and Ì. Kuruöz, “Tagging and morphological
disambiguation of turkish text,” in Proceedings of the
fourth conference on Applied natural language processing. Association for Computational Linguistics, 1994,
pp. 144–149.
[36] K. Oflazer and G. Tur, “Combining hand-crafted rules
and unsupervised learning in constraint-based morphological disambiguation,” arXiv preprint cmp-lg/9604001,
1996.
[37] K. Oflazer and G. Tür, “Morphological disambiguation
by voting constraints,” in Proceedings of the 35th Annual
Meeting of the Association for Computational Linguistics
and Eighth Conference of the European Chapter of the
Association for Computational Linguistics. Association
for Computational Linguistics, 1997, pp. 222–229.
[38] H. Sak, T. Güngör, and M. Saraçlar, “Morphological
disambiguation of turkish text with perceptron algorithm,” in Computational Linguistics and Intelligent Text
Processing. Springer, 2007, pp. 107–118.
[39] D. Yuret and F. Türe, “Learning morphological disambiguation rules for turkish,” in Proceedings of the main
conference on Human Language Technology Conference
of the North American Chapter of the Association of
Computational Linguistics. Association for Computational Linguistics, 2006, pp. 328–334.
[40] M. Kutlu and I. Cicekli, “A hybrid morphological disambiguation system for turkish.” in IJCNLP, 2013, pp.
1230–1236.
[41] O. Makhambetov, A. Makazhanov, I. Sabyrgaliyev, and
Z. Yessenbayev, “Data-driven morphological analysis
and disambiguation for kazakh,” in Computational Linguistics and Intelligent Text Processing. Springer, 2015,
pp. 151–163.
[42] O. Makhambetov, A. Makazhanov, Z. Yessenbayev,
B. Matkarimov, I. Sabyrgaliyev, and A. Sharafudinov,
“Assembling the kazakh language corpus.” in EMNLP,
2013, pp. 1022–1031.
[43] G. Kessikbayeva and I. Cicekli, “A rule based morphological analyzer and a morphological disambiguator for
kazakh language,” 2016.
[44] E. Charniak, C. Hendrickson, N. Jacobson, and
M. Perkowitz, “Equations for part-of-speech tagging,” in
AAAI, 1993, pp. 784–789.
[45] J. L. Lagrange, Mécanique analytique. Mallet-Bachelier,
1853, vol. 1.
[46] I. J. Good, “The population frequencies of species and
the estimation of population parameters,” Biometrika,
vol. 40, no. 3-4, pp. 237–264, 1953.
[47] W. A. Gale and G. Sampson, “Good-turing frequency
estimation without tears*,” Journal of Quantitative Linguistics, vol. 2, no. 3, pp. 217–237, 1995.
[48] I. H. Witten and T. C. Bell, “The zero-frequency problem: Estimating the probabilities of novel events in
adaptive text compression,” Information Theory, IEEE
Transactions on, vol. 37, no. 4, pp. 1085–1094, 1991.
[49] R. Kneser and H. Ney, “Improved backing-off for mgram language modeling,” in Acoustics, Speech, and
Signal Processing, 1995. ICASSP-95., 1995 International
Conference on, vol. 1. IEEE, 1995, pp. 181–184.
[50] S. F. Chen and J. Goodman, “An empirical study of
smoothing techniques for language modeling,” Computer
25
Speech & Language, vol. 13, no. 4, pp. 359–393, 1999.
[51] C. D. Manning and H. Schütze, Foundations of statistical
natural language processing. MIT Press, 1999, vol. 999.
[52] J. N. Washington, I. Salimzyanov, and F. M. Tyers, “Finite-state morphological transducers for three
Kypchak languages,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC, 2014.
[53] S. O. Rojas, M. L. Forcada, and G. R. Sánchez, “Construcción y minimización eficiente de transductores de
letras a partir de diccionarios con paradigmas,” Procesamiento del lenguaje natural, vol. 35, pp. 51–57, 2005.
[54] F. M. Tyers and J. Washington, “Towards a free/opensource universal-dependency treebank for Kazakh,” in
3rd International Conference on Computer Processing
in Turkic Languages (TURKLANG 2015), 2015.
[55] A. Stolcke et al., “Srilm-an extensible language modeling
toolkit.” in INTERSPEECH, vol. 2002, 2002, p. 2002.
[56] A. Stolcke, J. Zheng, W. Wang, and V. Abrash, “Srilm
at sixteen: Update and outlook,” in Proceedings of IEEE
Automatic Speech Recognition and Understanding Workshop, 2011, p. 5.
[57] B. KREJCI and L. GLASS, “The kazakh noun/adjective
distinction.”
[58] Y. Zhang, K. Wu, J. Gao, and P. Vines, “Automatic
acquisition of chinese–english parallel corpus from the
web,” in Advances in Information Retrieval. Springer,
2006, pp. 420–431.
[59] M. Esplà-Gomis and M. Forcada, “Combining contentbased and url-based heuristics to harvest aligned bitexts
from multilingual sites with bitextor,” The Prague Bulletin of Mathematical Linguistics, vol. 93, pp. 77–86,
2010.
[60] I. San Vicente and I. Manterola, “Paco2: A fully automated tool for gathering parallel corpora from the web.”
in LREC, 2012, pp. 1–6.
[61] L. Liu, Y. Hong, J. Lu, J. Lang, H. Ji, and J. Yao,
“An iterative link-based method for parallel web page
mining,” Proceedings of EMNLP, pp. 1216–1224, 2014.
26
Methodological Considerations for
Multi-word Unit Extraction in Turkish
Ümit Mersinli
Yeşim Aksan
Mersin University
Mersin, Turkey
[email protected]
Mersin University
Mersin, Turkey
[email protected]
illustrative purposes only. They should not be regarded as
finalized data sets of the ongoing study.
Abstract— Multi-word Unit (MWU) extraction in Turkish has
its own challenges due to the agglutinative nature of the language
and the lack of reliable tools and reference datasets. The aim of
this study is to share the hands-on experience on MWU
extraction in the ongoing projects using Turkish National Corpus
(TNC) as the data source. Since Turkish still does not have a
reference MWU set, the primary purpose of these projects is to
form a reference MWU dictionary of Turkish which will serve as
a resource to evaluate the performance of any extraction tool or
technique. In this paper we will discuss methodological
considerations for clarifying appropriate processes for Turkish
MWU extraction. Techniques or suggestions compiled in this
paper form an overall proposal for further Turkish-specific
computational or statistical work. The linguistic perspective
underlying the choices of a valid methodology is described in the
first part of the study. In the second part, important
methodological considerations are discussed through real
examples from the TNC. In the conclusion, suggestions for an
interdisciplinary approach and a hybrid methodology are
summarized.
Keywords—MWU
extraction;
phraseology; Turkish National Corpus
I.
multi-word;
II.
METHODOLOGICAL CONSIDERATIONS
According to Pecina [11], eliciting the best methodology
for MWU extraction depends heavily on data, language, and
the notion of MWU itself. However, these concerns are
underestimated in current Turkish NLP literature. Thus, the
methodological considerations discussed in this paper will
emphasize the importance of some neglected aspects of MWU
extraction in Turkish.
A. Choosing The Corpus
Most of the current studies on Turkish MWU extraction,
focus on optimizing the statistical or computational processes
or optimizing the sorting procedure of the outcome. The
importance of the input, or corpus in our case, is often
underestimated. In this part of the paper, we will deal with the
necessary qualifications of a corpus to be used as input for
MWU extraction in Turkish.
First, the difference between a linguistic corpus and a text
archive needs to be clarified [12]. According to Sinclair [13],
“a corpus is a collection of pieces of language that are selected
and ordered according to explicit linguistic criteria in order to
be used as a sample of the language” but not a random text
collection of any available type. Second, a reference corpus
should cover naturally occurring, contemporary language data
and have a design to represent the language, if not a historical
or specialized corpus. Third, a corpus should cover, if
applicable, a variety of text-types and mediums of that
language. In other words, the corpus should be a wellbalanced and representative one to be used in MWU
extraction.
In this respect, it is crucial to rely on a reference corpus
like Turkish National Corpus in order to extract true rankings
of the n-grams. The size of the TNC is 50,997,016 running
words, representing a wide range of text categories spanning a
period of 23 years (1990-2013). It consists of samples from
textual data representing 9 different domains (98%) with 4978
documents and transcribed spoken data (2%) with 434
documents. Table (1) shows the distribution of texts in the
written part of the TNC.
In addition, the annotation system of the TNC covers over
90 inflectional morphemes, all of which are compatible with
modern Turkish linguistics studies. Analysis and tagging of
Turkish
INTRODUCTION
As Mel’čuk [1] states, “people speak not in words but in
phrases” or in Firth’s [2] words, as a well-known statement
among linguists, “you shall know a word by the company it
keeps”. The importance of MWUs in any language-related area
leads to a huge amount of work done especially for English.
For Turkish, on the other hand, the lack of a preliminary,
well-documented, reference MWU lexicon to evaluate the
performance of any linguistic, statistical or computational
extraction methodology seems to be the basic challenge to
overcome. Works of Oflazer et al. [3], Eryiğit et al. [4],
Kumova & Karaoğlan [5], Aksan & Aksan [6], Durrant &
Mathews-Aydınlı [7], Aksan, Mersinli & Altunay [8] and
Mersinli & Demirhan [9] covers some aspects of Turkish
phraseology but unfortunately, Turkish NLP literature is far
from providing a comprehensive, reference MWU lexicon. In
this respect, the purpose of this paper is to share the hands-on
experience on MWU extraction projects using the Turkish
National Corpus (TNC) [10] as the data source, rather than to
provide finalized software, resources or methodology. The
following sections will summarize the crucial points of the
study in progress. In each section, sample data is provided for
27
derivational morphemes are in progress and will provide
insights for the relationship between word and multi-word
forming processes of Turkish.
B. Optimizing the Input
As stated above, choosing and optimizing the input is an
important part of our proposal. The basic shift from the
conventional approaches is to make use of punctuation marks
as a natural delimiter for MWU candidates. Thus, all
punctuation marks and numericals in the corpus are replaced
with line-breaks which serve as a splitter for n-grams. Since
the primary concern of this study is not to extract proper
nouns, all the corpus text is also lowercased to avoid duplicate
n-grams. Table (3) is a sample raw text and its optimized
version.
TABLE I. DISTRIBUTION OF TEXTS ACCORDING TO DOMAINS IN TNC-WRITTEN
Domain
No. of words % of words
Imaginative: Prose
9.365.775
18.74 %
Informative: Natural and pure sciences 1.367.213
2.74 %
Informative: Applied science
3.464.557
6.93 %
Informative: Social science
7.151.622
14.31 %
Informative: World affairs
9.840.241
19.69%
Informative: Commerce and finance
4.513.233
9.03 %
Raw text
Informative: Arts
3.659.025
7.32 %
Informative: Belief and thought
2.200.019
4.4 %
Günlerden bir gün , okuldan evine dönen Hetzer,sırt çantasından
çıkardığı yepyeni bir kitabı, babasına gösterir.
Informative: Leisure
8.421.603
16.85%
Total
49.983.288
100.00 %
TABLE III. CORPUS OPTIMIZATION FOR MWU EXTRACTION
Optimized text
günlerden bir gün
okuldan evine dönen hetzer
sırt çantasından çıkardığı yepyeni bir kitabı
babasına gösterir
Table (2) shows the MWU candidates derived from the
written part of the TNC including 49,983,288 words. The top
5 multi-word candidates obtained from the written part of the
TNC and from the newspaper articles section of it demonstrate
how serious the differences between data extracted from a
reference corpus and the data from a specialized corpus are.
TABLE II.
After the optimization, the lower-cased, sentence-splitted,
punctuation-delimited, ASCII-coded TNC texts are processed
in Text-NSP [14], for obtaining all the sample lists presented
in this paper. Moreover, for the sake of simplicity, no
associative measures are used for extracting MWUs, and all
the values represent the observed frequencies of the data. A
detailed discussion on associative measures applied on
Turkish MWU candidates can be found in Kumova-Metin &
Karaoğlan [5] and Mersinli [15].
THE 5 TOP-RANKED 3-GRAMS IN A REFERENCE CORPUS AND A
SPECIAL CORPUS
Rank TNC_alla
Freq. TNC_Newspapers
Freq.
1
bir süre sonra
4419
recep tayyip erdoğan
555
2
bir kez daha
4000
bir kez daha
506
3
ne var ki
3360
başbakan recep tayyip
449
4
başka bir şey
3238
yönetim kurulu başkanı
442
5
ne yazık ki
3020
şöyle devam etti
367
6
her ne kadar
3012
bir an önce
367
7
bir yandan da
2993
genel başkan yardımcısı
323
8
bir an önce
2413
ahmet necdet sezer
316
9
kısa bir süre
2300
cumhurbaşkanı ahmet necdet
288
10
ne olursa olsun 2182
C. Looking Beyond Words
It is a well-known phenomenon that an inflected Turkish
verb is actually a sentence in English, in most cases. The same
is also true for other phrases like postpositions or connectives.
We can easily observe that most of the connectives in English
are actually suffix-word pairs in Turkish such as -mAk için “in
order to”, -A göre “according to” etc. The point here is that
any multi-word in any language may appear as single words,
multi-words, suffixes or suffix-word pairs in any other
language and vice versa. Thus, especially dealing with an
agglutinative language, suffix-word pairs need to be taken into
serious consideration. Postpositional phrases, for instance,
requires specific suffixations in the preceding word in Turkish.
Below are the most frequent suffix-word pairs of Turkish,
extracted with the help of the annotation framework of the
TNC. The suffixes are annotated according to their functions
as nominalizers, case markers or person/number agreements in
the table. The frequencies are extracted from bigrams
including the first word ending with the given suffix and the
second word as a whole.
düzenlediği basın toplantısında 263
a.
MWUs are in bold
As seen in Table (2), multi-word units are not only
language specific but also text-type specific. Thus, relying on
a text archive derived from the Web or a specialized corpus
covering newspapers, for instance, is not a relevant approach
to extract MWUs of Turkish, but it is a kind of approach used
for extracting the MWUs of that specific text type. If the
purpose of the extraction is to derive Named Entities, on the
other hand, a Web-based, newspaper corpus may be the
appropriate option in terms of choosing the corpus.
28
TABLE IV.
MOST FREQUENT SUFFIX-WORD PAIRS IN TURKISH
Suffix_type
Freq.
Example
nzmk__için
58535 etmek_için
in order to
dat__göre
37850 buna_göre
according to
abl__sonra
36515 olduktan_sonra after
p3s__için
33514 olduğu_için
since
p3s__gibi
31306 olduğu_gibi
as it is
dat__kadar
28429 bugüne_kadar
until
Table (5) clearly demonstrates that causative+passive
inflection is specific to academic Turkish and can be regarded
as a multi-morpheme unit in itself. Although very rare in
usage, these verbal morphgrams can be extended to 9
morphemes in Turkish as in the inflected verb,
çıkartılabilinirdi which starts with the verb çık- and includes
the suffixes causative, causative, passive, auxiliary_verb,
passive, aorist, verb_i, past_tense, 3rd_person_singular in the
given order. The inflected verb can be translated as “it could
be made possible to extract” which is a full sentence in
English and thus, again, blurs our notion of ‘word’ in the term
‘multi-word unit’.
English
nzmk__üzere 17728 olmak_üzere
almost
gen__için
15336 bunun_için
for this
acc__olarak
11895 sonucu_olarak
as a result of
pl__için
9990
onlar_için
D. Bidirectional Sorting
Another common practice in MWU extraction can be
summarized as sorting n-grams using associative measures or
a combination of them, providing a cut-off point and regarding
the remaining top n-grams as MWUs. As discussed in
Mersinli [15], the relevance of relying only on sorting the ngrams without any linguistic filtering is questionable. A hybrid
approach combining quantitative sorting and qualitative
filtering techniques, as in Seretan et al. [16], seems more
productive for Turkish if the purpose is to prepare a reference
MWU set and to describe multi-word formation processes in
Turkish.
Below are the associative measures stated as linguistically
relevant for the given n-grams in Turkish [15]. Since the 2grams include most of the sub-MWUs in Turkish, although
most of the measures are for these candidates, it seems
reasonable to rely on observed frequencies of 3-grams for
extracting MWUs in Turkish.
for them
As Table (4) demonstrates, the term ‘multi-word’ in
Turkish should also cover “suffix-word” pairs as a term which
we may call a “multi-morpheme unit”. Looking for in-word or
intra-word units in Turkish may be the solution for most of the
challenges encountered in MWU extraction processes.
Also the inflectional patterns in Turkish should be
considered as multi-words or, in a more appropriate
terminology, multi-morpheme units, since their distribution
among different text-types provides evidence for their
functional unity specific to certain text-types. Below are the 6morphgrams and their distribution among 3 text-types in the
TNC. The tagset includes the functions such as causative,
passive, auxiliary verb, aorist, nominalizer, adverbial,
negation, verb I, necessity, perfective, imperfect, person
agreement, possessive, accusative, locative, copular etc. in
their abbreviated forms. Almost all 6-morphgrams start with
some voice suffixes and end with 3rd person singular suffix as
seen in the table.
TABLE VI. RELEVANT ASSOCIATIVE MEASURES FOR TURKISH
n-grams
Measures
2-grams
T-score, Fisher’s Exact Test (left-sided),
Log-likelihood,True Mutual Information,
Poisson-Stirling Measure
3-grams
Poisson-Stirling Measure
4-grams
Log-likelihood
TABLE V. SAMPLE MORPHGRAMS AND THEIR DISTRIBUTION AMONG TEXTTYPES IN TURKISH
6-morphgrams
Academic Fiction Newspapers
caus+pasv+va1+nzma+p3s+acc 27
0
1
caus+pasv+va1+neg+aor+3s
444
76
63
caus+pasv+aor+vi+avsa+3s
386
25
46
caus+pasv+imprf+vi+past+3s
277
164
47
caus+pasv+imprf+vi+perf+3s
4
16
4
caus+pasv+neg+necc+cop+3s
220
3
12
caus+pasv+neg+nzma+p3s+acc 24
4
13
caus+pasv+neg+perf+cop+3s
172
5
6
caus+pasv+nzma+p3s+cop+3s
838
11
29
caus+pasv+nzma+p3s+loc+kia
85
2
6
With that concern in mind, in order to measure the
fixedness of 3-grams, since they are more likely to include as
MWUs in a Turkish dictionary, we have used the frequencies
of inner components, such as the frequency of the first two
words and the last two words of 3-grams. If the difference
between those values are high, then it is regarded as an
evidence declaring that the given 3-grams is not a MWU but
includes 2-grams that are more fixed than the whole 3-grams.
To be more specific, Table (7) shows the ranking of the
values gained by subtracting the frequency of the last two
words from the frequencies of the first two, in a given 3-gram.
The MWUs within the given 3-grams are in bold shows the
fixedness of the ones in the center of the ranking.
29
TABLE VII.
TABLE VIII. CLASSIFICATION OF COLLIGATIONAL PATTERNS OF N-GRAMS
BIDIRECTIONALLY SORTED SAMPLE 3-GRAMS
ABC
Freq Freq.AB Freq.BC Freq.(AB - BC)
Category 1 – Complete structures: MWU patterns
korkacak bir şey
50
50
15360
-15310
Sample colligational pattern n-gram
konuda bir şey
51
51
15360
-15309
AJ+bare DT+bare NN+nom
kısa bir süre
aklına bir şey
51
51
15360
-15309
AJ+bare DT+bare NN+loc
etkin bir şekilde in an efficient manner
yapabileceği bir şey 51
51
15360
-15309
Category 2 – Sub-patterns: Non-closed, potential sub-MWUs
bildiğim bir şey
54
15360
-15306
54
Sample colligational pattern n-gram
……………………………………………
AV,bare_AJ,bare_DT,bare
ne yazık ki
3020 3020
3020
0
her zamanki gibi
992
992
992
0
en ufak bir
849
849
849
0
her ikisi de
804
804
804
0
ittihat ve terakki
649
649
649
0
çok önemli bir
English
(in) a short time
English
a very important
Category 3 –Incomplete structures: non-MWU patterns
Sample colligational pattern n-gram
……………………………………………
ya da bunun
51
13650
51
13599
ya da siyasi
50
13650
50
13600
ya da karşı
50
13650
50
13600
ya da üçüncü
50
13650
50
13600
ya da kültürel
50
13650
50
13600
English
PP,bare_AJ,bare_DT,bare
için önemli bir
an important ... for
PP,bare_AJ,bare_DT,bare
kadar geniş bir
as a broad ... as
The categories in Table (8) allow filtering the MWUs and
non-MWUs as well as reserving partial ones that may be used
to identify sub-MWUs. In brief, Category 3 candidates are
filtered out, Category 1 is filtered in and Category 2
candidates are reserved for identifying 4-gram MWUs. Since
the identification of sub-MWU strings is problematic not only
for MWU extraction but also for all lexical frequencies in any
language, it requires separate techniques, and it is out of the
scope of current study.
Extracting colligations also provides a general ranking
based on grammatical patterns of MWU candidates and makes
the filtering process more linguistically relevant. Below is the
top ten 3-gram colligations in the TNC. Table (9)
demonstrates that 3-word units in Turkish mostly provides a
closed projection including a specifier, a modifier and a head,
making 3-grams worth extracting more than 2-grams including
mostly light verb constructions or reduplications.
As seen in Table (7), a bidirectional sorting reveals the
MWUs in the center even without applying any statistical
associative measure and provides evidence for the 2-gram
MWUs within the given candidates. The results of setting
double thresholds based on such a simple measure points out
that the relevance of any sorting practice does not rely on the
complexity of the formulae we use.
TABLE IX. CLASSIFICATION OF COLLIGATIONAL PATTERNS OF N-GRAMS
E. Lexico-grammatical Filtering
‘Colligation’ is another key term that is important in
identifying the MWUs in a given set of candidates. As defined
by Baker [17], a colligation is “a form of collocation which
involves relationships at the grammatical rather than the
lexical level”. For rich morphology languages, then,
grammatical relations between two or more words becomes
important since they actually declare the constraints that
prevent some frequent n-grams from becoming multi-words,
or letting some less frequent ones become multi-word units.
Thus, in a hybrid approach, sorting and filtering are two
basic processes, being the first statistical and the later rulebased. In order to provide the filtering rules for MWUs and
non-MWUs linguistically, we have classified grammatical or
colligational patterns of the MWU candidates into 3
categories, presented with examples from the TNC, below.
Colligation
Sample 3-grams
English
1
AV,bare_AJ,bare_DT,bare
çok önemli bir
a very important
2
AJ,bare_DT,bare_NN,nom
kısa bir süre
a short time
3
NN,nom_CJ,bare_NN,nom
radyo ve televizyon radio and television
4
DT,bare_NN,nom_AV,bare bir süre sonra
5
AJ,bare_CJ,bare_AJ,bare
ekonomik ve sosyal economic and social
6
CJ,bare_AV,bare_AV,bare
ama yine de
but still
7
NN,nom_NN,nom_CJ,bare
ne var ki
however, yet
8
AJ,bare_DT,bare_NN,loc
etkin bir şekilde
efficiently
9
AV,bare_DT,bare_NN,nom böyle bir şey
10 CJ,bare_AJ,bare_DT,bare
30
ile ilgili bir
after a while
such a thing
a … related to
III.
CONCLUSION
REFERENCES
The methodological considerations discussed in this paper
show that MWU extraction is rather a trial-and-error process
for a given language. Thus, any attempt, be statistical,
computational or linguistic is worth sharing in an interdisciplinary manner to fill the gap in this area. A reference
MWU set or a MWU dictionary, for that purpose, will serve as
an input not only for linguistics but also for all related areas of
study. Fig.1 summarizes a sample recursive process followed
in the proposed strategies.
[1]
[2]
[3]
[4]
1Corpus
2Optimization
3Sorting
4Classification
5Filtering
[5]
Fig. 1. Basics of the proposed strategy
Considering the fact that Turkish is an agglutinative
language and has little to do with words but rather operates on
suffixes, the term ‘multi-morpheme unit’ (MMU) seems more
operational for further cross-linguistic studies. In addition,
lexico-grammatical constraints in MMU forming are as
important as the observed frequencies of any MMU candidate
and thus colligational analysis and filtering of n-grams should
be a part of any strategy that includes statistical ranking of
MMU candidates.
This paper briefly summarized some methodological
considerations for multi-morpheme unit (MMU) extraction in
Turkish. The purpose of the study is to discuss some ignored
aspects of MMU extraction in Turkish and provide an overall
idea on the methodological considerations we faced with.
Turkish lexicon includes more MMUs than already
documented. Any technical or linguistic contribution will be of
great importance and a hybrid, inter-disciplinary approach may
be the answer to most of the questions in the field.
MMU extraction is some reverse engineering of the MMU
forming processes in our minds. Only a process-based
approach may provide data for linguistics of Turkish. A
product-based approach, or extracting a reference MMU set,
however, can serve as an initial step for identifying the
grammatical constraints that governs the MMU forming
processes in Turkish. Interdisciplinary studies conducted by
engineers and linguists are of great importance in this sense,
that, not only MMUs but also the rules underlying the process
of forming them can only be described by such collaborative
studies.
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
ACKNOWLEDGMENT
This work is supported by a grant from Scientific and
Technological Research Council of Turkey (TÜBİTAK, Grant
No: 115K135).
[17]
31
Mel’čuk, I. A.: Phrasemes in language and phraseology in linguistics. In:
Everaert, M., van der Linden, E.J, Schenk, A. and Schreuder, R. (eds.)
Idioms: Structural and Psychological Perspectives. Lawrence Erlbaum,
Hillsdale, NJ (1995)
Firth, J.R.: A Synopsis of Linguistic theory 1930-1955. In: Palmer, F.
(ed). Selected Papers of J. R. Firth, Longman, Harlow (1968)
Oflazer, K., Çetinoğlu, Ö. and Say, B.: Integrating morphology with
multi-word expression processing in Turkish. In: Proceedings of the
Workshop on Multiword Expressions: Integrating Processing (MWE
'04). Association for Computational Linguistics, pp. 64-71 (2004)
Eryiğit, G. et.al. Annotation and Extraction of Multiword Expressions in
Turkish Treebanks. In Proceedings of the 11th Workshop on Multiword
Expressions: MWE 2014. June 4, 2015 Denver, Colorado, USA. pp.7076. (2015)
Kumova-Metin, S., Karaoğlan B.: Collocation extraction in Turkish
texts using statistical methods. In: 7th International Conference on
Natural Language Processing (LNCS-ISI) IceTAL 2010, pp. 238-249
(2010)
Aksan, M., Aksan, Y.: Multi-word units and pragmatic functions in
genre specification. Paper presented at 13th IPrA Conference 08-13
September 2013. New Delhi, India (2013)
Durrant, P., Mathews-Aydınlı, J.: A function first approach to
identifying formulaic language in academic writing. English for Specific
Purposes, 30, 58-72 (2011)
Aksan, Y., Mersinli, Ü. and Altunay, S.: Colligational analysis of
Turkish multi-word units. Paper presented at CCS-2015, Corpus-Based
Word Frequency: Methods and Applications. 19-20 February 2015.
Mersin University, Turkey (2015)
Mersinli, Ü. and Demirhan, U.: Çok sözcüklü kullanımlar ve ilköğretim
Türkçe ders kitapları. In: Aksan, M. ve Aksan, Y. (eds.). Türkçe
Öğretiminde Güncel Çalışmalar.: Mersin Üniversitesi, Mersin (2012)
Aksan, Y., Aksan, M., Koltuksuz, A., Sezer, T., Mersinli, Ü., Demirhan,
U. U., Yılmazer, H., Kurtoğlu, Ö., Atasoy, G., Öz, S., Yıldız, İ.:
Construction of the Turkish National Corpus (TNC). In: Proceedings of
the 12th International Conference on Language Resources and
Evaluation (LREC), pp. 3223-3227 (2012)
Pecina, P.: Lexical association measures and collocation extraction.
Language Resources and Evaluation, 44, 137-158 (2010)
Aksan, M. and Aksan, Y.: Linguistic corpora: A view from Turkish. In:
Oflazer, K. and Saraçlar, M. (eds.) Studies in Turkish Language
Processing. Springer Verlag, Berlin (forthcoming)
Sinclair, J. McH. and Renouf, A.J.: A lexical syllabus for language
learning. In: McCarthy, M.J. and Carter, R.A. (eds.) Vocabulary in
Language Teaching. Longman, London. (1987)
Banerjee, S. and Pedersen, T.: The design, implementation, and use of
the ngram statistics package. In: Proceedings of the Fourth International
Conference on Intelligent Text Processing and Computational
Linguistics, pp. 370-381 (2003)
Mersinli, Ü.: Associative measures and multi-word unit extraction in
Turkish. Journal of Language and Literature 12 (1), 43-61 (2015)
Seretan, V., Nerima, L., and Wehrli, E.: Multi-word collocation
extraction by syntactic composition of collocation bigrams. Amsterdam
Studies in the Theory and History of Linguistic Science. Series Iv,
Current Issues in Linguistic Theory, 260, 91-100 (2004)
Baker, P., Hardie, A., & McEnery, T. A glossary of corpus linguistics.
Edinburgh: Edinburgh University Press. (2006)
The Turkish National Corpus (TNC): Comparing
the Architectures of v1 and v2
Yeşim Aksan
Selma Ayşe Özel
Mersin University
Mersin, Turkey
[email protected]
Çukurova University
Adana, Turkey
[email protected]
Hakan Yılmazer
Umut Ufuk Demirhan
Çukurova University
Adana, Turkey
[email protected]
Mersin University
Mersin, Turkey
[email protected]
Abstract— Turkish National Corpus (TNC) released its first
version in 2012 is the first large scale (50 million words), webbased and publicly-available free resource of contemporary
Turkish. It is designed to be a well-balanced and representative
reference corpus for Turkish. With 48 million words coming
from the written part of it, the untagged TNC v1 represents 4438
different data sources over 9 domains and 34 different genres.
The morphologically annotated, 50 million words TNC v2 with
5412 different documents compiled from written and spoken
Turkish is planned for release in 2016 offers new query options
for linguistic analyses. This paper aims to compare architectures
of the TNC v1 and v2 on the basis of a set of queries made on
both versions. Standard, restricted and wildcard lexical searches
are performed. Then, the speed of two versions in retrieving the
query results in concordance lines is compared. Finally, it is
argued that TNC v2 performs better and faster than that of TNC
v1 due to the in-memory inverted index structure. Since building
language corpora is a very recent issue for Turkish, the
architecture of TNC v2 would serve as a model for similar corpus
construction projects.
more obvious. To meet the challenge, Turkish National Corpus
(TNC) is built as reference corpus of Turkish. The project
team followed the best practices at all stages of corpus
development. Major design principles were adopted from the
experiences of the British National Corpus with minor
modifications. The end product is the TNC, a well-balanced,
representative, and large-scale (50 million words) free resource
of a general-purpose corpus of contemporary Turkish [3].
As maintained by [14] “if the corpus in question claims to
be general in nature, then it will be typically balanced with
regard to genres, domains that typically represent the language
under consideration”. In line with this definition, the major aim
in building the TNC is to represent texts from different genres,
domains and types in a balanced manner so that the
conclusions drawn from quantitative and qualitative analysis of
corpus data hold true for language use in general. Genre
balance is an important aspect of corpus design [15]. Both
versions of the TNC have data from different domains and
genres set them apart from text archives or a collection of texts
difficult to categorize and separate by genre, such as the Web.
The number of linguistic and computational linguistic studies
using the TNC as a reference corpus is increasing. While most
of the linguistic and NLP studies use TNC for compiling
naturally occurring language evidence and for hypothesistesting [16, 17, 18, 19], there are still others following a
corpus-driven approach and attempt to build hypotheses and
describe Turkish on the basis of the TNC [20, 21]. Overall, the
usefulness of the TNC as a general corpus primarily is due to
the data itself. With 48 million words, the TNC v1 represents
written component of the corpus which contains 4438 different
data sources over 9 domains and 34 different genres, and was
published as a free resource for non-commercial use in October
2012. Size of the TNC v2 is 50,997,016 running words,
representing a wide range of text categories spanning a period
of 23 years (1990-2013). It consists of samples from textual
data representing 9 different domains (98%) with 4,978
documents and transcribed spoken data (2%) with 434
documents. The morphologically annotated, complete version
Keywords—Turkish National Corpus (TNC); corpus building;
architecture; inverted index; relational database; in-memory data
structures
I. INTRODUCTION
There are at least two different kinds of corpora in Turkish
today: (i) large-sized general linguistic corpora that are
constructed and made available for users with proper corpus
tools, (ii) NLP corpora built with no linguistic criteria in mind
but rather as tools for testing algorithms devised for different
applications [1]. The first electronic linguistic corpus designed
to represent modern Turkish is the 2 million words,
downloadable Middle East Technical University Turkish
Corpus (MTC) [2]. MTC is tagged by XCES style annotation
using special software developed by the members of the project
group as well as its corpus query workbench. In the years
following the construction of the MTC, the need for a largescale general reference corpus of Turkish has become more and
32
of the TNC v2 is planned for release in 2016, offering new
query options for linguistic analyses.
texts are distributed along two major types, namely imaginative
and informative. While the imaginative domain is represented
by texts of fiction, the informative domain is represented by
texts from the social sciences, the arts, commerce-finance,
belief-thought, world affairs, applied sciences, natural-pure
sciences, and leisure. The criterion of medium refers to text
production. The texts collected to represent the written medium
are carefully selected from books, periodicals, published or
unpublished documents, and texts written-to-be-spoken such as
news broadcasts and screenplays, among others. The criterion
of time defines the period of text production. Here, the
distribution of the size of the texts for each year is decided in
terms of relative representation of each domain in the medium.
This paper is organized as follows: Section two explains the
design features of the TNC. Section three describes basic
features of the TNC interface. The architectures of the TNC v1
and v2 are presented in section four. Section five displays the
comparative query results obtained through the two versions of
the corpus. The paper finally argues that in-memory inverted
index structure and relational database structure are effective in
terms of speed and extension of web-based language corpora.
II. DESIGN OF THE TNC
The only Turkish corpus of its kind, the TNC is constructed
following the principles used to construct the British National
Corpus in its basic design and implementation. The distribution
of samples in written component of the corpus is determined
proportionally for each text domain, time, and medium. Table I
and II show the distribution of texts across domain and
medium, respectively.
TABLE I.
THE DISTRIBUTION OF TEXTS ACROSS DOMAINS IN THE TNC
Domain
Imaginative:
Prose
Informative:
Natural
and
pure sciences
Informative:
Applied
science
Informative:
Social science
Informative:
World affairs
Informative:
Commerce and
finance
Informative:
Arts
Informative:
Belief
and
thought
Informative:
Leisure
Total
TABLE II.
Medium
Unspecified
Book
Periodical
Miscellaneous:
published
Miscellaneous:
unpublished
Total
Transcriptions from authentic spoken language constitute
2% of the TNC’s database, which involve everyday
conversations recorded in informal settings such as
conversations among friends, talk among family members and
friends, etc., as well as speeches collected in particular
communicative settings, such as meetings, lectures, and
interviews. The spoken component of the TNC contains a total
of 1,013,728 running words. Of these words, 439,461 of them
come from orthographic transcriptions of everyday
conversations and their relevant medium, and 574,267 of them
are orthographic transcriptions of context-governed speeches.
No.
of
words
9,365,775
%
of
words
18.74 %
No.
of
documents
674
%
of
documents
13.54 %
1,367,213
2.74 %
253
5.08 %
3,464,557
6.93 %
461
9.26 %
7,151,622
14.31 %
671
13.48 %
9,840,241
19.69 %
757
15.21 %
4,513,233
9.03 %
429
8.62 %
3,659,025
7.32 %
347
6.97 %
2,200,019
4.4 %
297
5.97 %
8,421,603
16.85 %
1,089
21.88 %
49,983,288
100.00 %
4,978
100.00 %
Part-of-speech annotation, morphological tagging, and
lemmatization of the TNC are done by developing a natural
language-processing (NLP) dictionary based on the NooJ_TR
module [13]. The unique, semi-automatic process of
developing the NLP dictionary includes the following steps: (i)
automatically annotating the type list with the NooJ_TR
module, which follows a root-driven, non-stochastic, rulebased approach to annotating the morphemes of the given types
using a graph-based, finite-state transducer; (ii) manually
checking and revising the output and eliminating artificial/nonoccurring ambiguities and theoretically possible multi-tags.
After these stages, the entries of the NLP dictionary and actual
running words of the corpus are matched via the software
which has been developed by using PHP and MySQL.
III. FEATURES OF THE TNC INTERFACE
Web-based interface of the TNC provides for multitude of
features for the analysis of corpus texts including concordance
display (Fig. 1), sorting concordance data (Fig. 2), creating
descriptive statistics for query results over the languageexternal restriction categories of texts via distribution (Fig. 3),
and compiling lists of collocates (Fig. 4) for query terms on the
basis of several statistical methods.
THE DISTRIBUTION OF TEXTS ACROSS MEDIUMS IN THE TNC
No.
of
words
10,541
31,456,426
15,968,240
%
of
words
0.02 %
62.93 %
31.95 %
No.
of
documents
1
2,141
2,092
%
of
documents
0.02 %
43.01 %
42.02 %
958,999
1.92 %
294
5.91 %
1,589,082
3.18 %
450
9.04 %
49,983,288
100.00 %
4,978
100.00 %
The representativeness of the TNC is secured through
balance and sampling of varieties of contemporary language
use. The selection of written texts is done via the criteria of text
domain, medium, and time. The criterion of domain means that
Fig. 1. TNC v1 concordance results page
33
Fig. 1 shows the query results in the TNC which are given
as concordance display (key word in context-KWIC). “A
concordance is a list of all the occurrence of a particular search
term in a corpus presented within the context in which they
occur-usually a few words to the left and right to the search
term” [22]. A search term in TNC can be a single word,
multiword phrases and words containing wildcards.
Concordances can be sorted alphabetically not only according
to the node word but also the context up to 5 words to the left
or right of the node word. This function of the TNC help users
find linguistic patterns easily.
TNC v2, on the other hand, offers new features and query
options. Since v2 is morphologically annotated, lemma form
searches, morphemes and morpheme sequences and PoS-tag
restricted searches (Fig. 5 and Fig. 6) can be conducted. As for
some of the new features, users can save query history and they
can search spoken component of the corpus by using metatextual categories such as genre, domain, interaction type,
speakers’ age, sex.
Fig. 5. TNC v2 PoS-tag query
Fig. 2. TNC v1 sorting function
Users can also view distributional information of the query
result based on pre-defined meta-textual categories. The
distribution page allows users to access descriptive statistics
concerning the distribution of the query result of without
performing multiple queries.
Fig. 6. TNC v2 PoS-tag query results
IV. THE ARCHITECTURES OF TNC V1 AND TNC V2
TNC is a user-friendly, platform independent, Web-based
corpus developed for Turkish language. HTML [12], CSS [7],
PHP [5] [6], and JavaScript [8] languages, and MySQL [4]
database management system are used for implementation of
the TNC. The main architecture of TNC version 1 is presented
in Fig. 7. To develop TNC v1, text documents in the written
component of the corpus are first pre-processed to extract
metadata such as author, year, source, domain etc. that describe
each document in the collection. Metadata of each document
are stored in a MySQL table on disk. After metadata extraction
step, each token, which is a character string separated by white
space characters, in each document is identified and unique
token list is formed from all documents in the collection. Each
token is given a unique identifier and while unique tokens are
found from documents, their frequencies in each document are
also counted. Unique tokens, their ids, and frequencies are
stored in another MySQL table. For each unique token found
from the document collection, a kind of inverted index
structure is formed. In the index structure position of each
unique token are stored for each document in the collection.
This index structure is stored over disk by using MyISAM file
structure of MySQL. By using the inverted index structure,
concordance data, descriptive statistics, and lists of collocates
Fig. 3. TNC v1 distribution function
Fig. 4. TNC v1 result of a collocation analysis of haber ‘news’
Collocation function allows users to list collocates (the
words that the query-term occurs most frequently with) by
offering six statistical association measures for calculating
collocational strength: Log-likelihood, MI, MI3, T-score, Dice
coefficient and Log Dice coefficient.
34
for unique tokens in the corpus are computed and they are
stored as compressed files over disk by applying IGBinary [9]
compression method of PHP. IGBinary applies binary data
compression and storage therefore reading and decompression
of the data are performed faster with respect to other
compression methods. The unique token list and names of its
compressed data files including concordance data are then
loaded to memory as a hash table to improve performance of
user searches. When a user sends a query by using the TNC
GUI, the queried token is searched from the hash table and the
name of the compressed concordance file of the token is found.
After that the compressed concordance file is read from disk to
memory, then this file is decompressed and if the user gives
some filtering options in his query these filters are applied over
the decompressed file, then the computed results are randomly
shuffled and displayed to the user.
user gives some filtering in his query, these filters are searched
from metadata table stored in the database, and the results of
this search are used to filter unique type lists for the given
token. Finally, the computed concordances are shuffled and a
random number of results are displayed to the user. The
architecture of the TNC v2 is presented in Fig. 8. As the
inverted index structure is stored in memory, all computations
are performed very fast as it is shown in the next section.
Fig. 8. Architecture of the TNC v2
On the other hand, the system specifications of the
computer running the TNC v1 interface are prominently
different from the TNC v2. The system properties of the server
running the TNC v2 interface seems sufficient enough to
process and store huge amount of data in memory. Table III
briefly presents the major hardware specifications of both
versions.
Fig. 7. Architecture of the TNC v1
The TNC v2 is an updated and improved version of the
TNC v1. Metadata extraction, tokenization and indexing steps
are similar to that of the TNC v1. Metadata are stored over disk
as a MySQL table. Unique token list including frequencies for
each document are loaded to memory instead of storing over
disk. Only document collection and metadata for the
documents are stored on disk. For all unique tokens in the
collection, a kind of inverted index structure is constructed in
which the positions of the token in each document are stored.
This inverted index structure is located in memory by using
Redis [10] which is an open source (BSD licensed), in-memory
data structure store and supports data structures such as strings,
hashes, lists, sets, sorted sets, etc. When a user sends a query
by using the TNC GUI, the queried token is searched from the
in-memory inverted index and unique types forming the
concordance output of queries, descriptive statistics for query
results, and lists of collocates are computed in real time. If the
TABLE III.
HARDWARE SPECIFICATIONS OF COMPUTERS RUNNING TWO
VERSIONS OF THE TNC
TNC
v1
TNC
v2
OS
FreeBSD 9.0
RAM
16
GB
CPU
1 X Intel Xeon
x3440
2.53
GHz 4 cores
Ubuntu Server 14.04
(Virtual machine running
on FreeBSD host)
64
GB
2 X Intel Xeon
E5-2630v2
2.60 GHz 2
cores
Disk
500
GB
SATA
2
350
GB
Virtual
Disk
V. QUERIES ON TNC V1 AND TNC V2
In what follows the speed of two versions of the TNC are
compared on the basis of standard, restricted and wildcard
queries conducted on the written component of the TNC v1 and
written and spoken components of TNC v2. Fig. 9 and Fig. 10
respectively show the main pages of the both versions.
35
Fig. 12. TNC v1 query results-fakat ‘but’
On the other hand, while the TNC v1 does not allow the
search of one of the most frequent word kadar ‘until’, which
ranks 45 with 142693 frequency of occurrence in the frequency
list of the TNC, the architecture of TNC v2 allows its search by
displaying random in 10.82 seconds to users.
Fig. 9. TNC v1 main page
TABLE IV.
Query
item
fakat
‘but’
kadar
‘until’
Fig. 10. TNC v2 main page
A. Standard Queries
Standard search in the TNC offers users to make searches
in the whole of the corpus without filtering the queries on the
basis of written or spoken parts of the corpus. Users type the
search term in the form labeled query term and send it. Just on
top of the results page, users can view frequency information of
the node word. A normalized frequency of a 1-million-word
scale is also stated. Query results are displayed in a KWIC
view by default. Each column in the result page displays the ID
of the concordance line, the text where the node word is found
and the concordance line, respectively. Users can display the
further context to the left and right of the node word by
clicking search term in the concordance lines. When such a
query is made for exact form of the node word fakat, it takes
just about 5.52 seconds to compute concordance lines among
2758 different corpus text in the TNC v2 (Fig. 11), while it
takes 14.57 seconds for the same query word in the TNC v1
(Fig. 12).
THE STANDARD QUERY OF FAKAT ‘BUT’ AND KADAR ‘UNTIL’
WITHIN WRITTEN COMPONENT OF THE TNC
TNC
version
TNC
v1
TNC
v2
Word
count
47641688
Text
count
4458
Hits
22331
Different
text
2486
Time
50088936
4990
25432
2758
14.57
sec
5.52 sec
TNC
v1
TNC
v2
47641688
4458
N/A
N/A
> 60 sec
50088936
4990
133807
4252
10.82
sec
B. Restricted Query
Restricted queries can be performed in the written
component of TNC with the criteria of publication date, media,
sample, domain, derived text type, author information,
audience and genre. Table V demonstrates such a sample query
performed by restricting the node word büyük ‘big’ in terms of
the publication date (between 1995-2005), medium (books) and
sample (whole text) of the corpus documents. Once again the
TNC v2 is fast in the restricted query search. It only takes 3.52
seconds to produce concordance lines in the v2, while the same
query lasts 9.31 seconds in the v1.
TABLE V.
THE RESTRICTED STANDARD QUERY OF BÜYÜK ‘BIG’ IN
TERMS OF PUBLICATION DATE (1995-2005), MEDIUM (BOOKS) AND SAMPLE
(WHOLE TEXT) WITHIN WRITTEN COMPONENT OF THE TNC
Query
item
büyük
TNC
version
TNC v1
Word count
Hits
47641688
Text
count
4458
3476
Different
text
168
‘big’
TNC v2
50088936
4990
3079
170
Time
9.31
sec
3.52
sec
C. Wildcard Queries
Wildcards are also used in standard and restricted queries in
the TNC. Special character * permits users to search word
forms starting with kol, such as kolay ‘easy’, kollarına ‘to his
arms’, koltuğa ‘to the armchair’, as is seen in Table VI the
TNC v2 is slightly faster than that of v1 in displaying query
results.
Fig. 11. TNC v2 query results-fakat ‘but’
36
The wildcard query aims to obtain word forms containing
both /b/ and /p/ as the final sound of kitap is only permitted in
the TNC v2 and 41,098 hits are found in across the corpus
documents in 22.25 seconds.
allows for a “modular structure in which any number of
features can be incorporated in to the architecture” [11]. For
future work any extension in the features of the TNC would be
possible via relational database and inverted index structures.
Multi-unit search pattern where beyaz ‘white’ or peynir
‘cheese’ is queried across the corpus documents. The speed of
the TNC v2 is again better than v1. The query in written and
spoken parts of the corpus returned 12,212 hits in 2,085
different texts in 1.73 seconds.
ACKNOWLEDGMENT
This work is supported by TÜBİTAK (Grant No: 115K135,
113K039).
REFERENCES
Owing to in-memory index structure of the TNC v2 it is
possible to search lexical items used frequently in Turkish such
as ama ‘but’ (ranking 43 among 73,383 lemmas in the NLP
Dictionary of TNC) and bu ‘this’ (ranking 6 among 73,383
lemmas in the NLP Dictionary of TNC) in a reasonable
fastness. Ama ‘OR’ bu wildcard query returned relevant strings
within 15.66 seconds in the TNC v2 but the same query takes
more than 60 seconds in the v1. As a final remark, the speed of
TNC v2 concerning some other wildcard query options needs
to be optimized.
TABLE VI.
[3]
[4]
TNC
version
Word
count
Text
count
Hits
kol*
TNC
v1
TNC
v2
47,641,688
4,458
53,041
No.
of diff.
text
3,523
50,088,936
4,990
58,154
3,864
TNC
v1
TNC
v2
47,641,688
4,458
N/A
N/A
N/A
50,088,936
4,990
41,098
2,687
22.25
sec
TNC
v1
TNC
v2
47,641,688
4,458
10,881
1,894
50,088,936
4,990
12,212
2,085
TNC
v1
TNC
v2
47,641,688
4,458
N/A
N/A
50,088,936
4,990
836,838 4,565
beyaz|
peynir
[2]
THE WILDCARD QUERIES IN THE TNC
Query
item
kita[b,p
]*
[1]
Time
[5]
[6]
[7]
[8]
[9]
30.78
sec
22.95
sec
[10]
[11]
[12]
[13]
6.46
sec
1.73
sec
[14]
[15]
ama|bu
N/A
15.66
sec
[16]
[17]
VI. CONCLUSION
This paper describes the design principles, interface
features and the architecture of the TNC. Then it compares the
architecture of the TNC v1 and v2. On the basis of the
standard, restricted and wildcard corpus queries, it is shown
that in-memory inverted index structure of the TNC v2
computes better and faster than that of v1 which is designed as
disk-based compressed concordance data files for each unique
term. In terms of speed, the v2 architecture allows users to
perform searches across many corpus files (5,412 data files of
the TNC) very rapidly, but such architecture needs more
memory to display query results fast. We should also note that
the relational database structure used in both versions of the
TNC has its advantages to process large corpus files such that it
[18]
[19]
[20]
[21]
[22]
37
M. Aksan and Y. Aksan, “Linguistic corpora: A view from Turkish,” in
Studies in Turkish Language Processing, K. Oflazer and M. Saraçlar,
Eds. Berlin: Springer Verlag, (forthcoming).
B. Say, D. Zeyrek, K. Oflazer and U. Özge, “Development of a corpus
and a treebank for present-day written Turkish,” in Proceedings of the
11th International Conference of Turkish Linguistics, 2004, pp 183–192.
Y. Aksan, M. Aksan, A. Koltuksuz, T. Sezer, Ü. Mersinli, U. U.
Demirhan, H. Yılmazer, Ö. Kurtoğlu, G. Atasoy, S. Öz and İ. Yıldız,
“Construction of the Turkish National Corpus (TNC),” in Proceedings
of the 12th International Conference on Language Resources and
Evaluation (LREC), 2012, pp. 3223-3227.
MySQL 5.5 Release Notes,
http://dev.mysql.com/doc/relnotes/mysql/5.5/en/
PHP 5.4.21, http://www.php.net/releases/5_4_21.php
PHP 5.6.10, http://www.php.net/releases/5_6_10.php
CSS, http://www.w3schools.com/css/
Javascript, http://www.w3schools.com/js/
PHP PECL IGBinary Extension, http://codepoets.co.uk/2011/phpserialization-igbinary/
Redis, http://redis.io/
M. Davies, “The 385+million word Corpus of Contemporary English
(1990-2008+),” International Journal of Corpus Linguistics, vol. 14, no.
2, pp. 159-160, 2009.
HTML, http://www.w3schools.com/html/
M. Aksan and Ü. Mersinli, “A corpus based Nooj module for Turkish,”
in Proceedings of the NooJ 2010 International Conference and
Workshop, 2011, pp. 29-39.
T. McEnery, R. Xiao, and Y. Tono, Corpus-based Language Studies,
London: Routledge, 2006.
M. Davies, “The Corpus of Contemporary American English as the first
reliable monitor corpus of English,” Literary and Linguistic Computing,
vol. 25, no. 4, pp. 447-464, 2010.
S. Akşehirli, “Dereceli karşıt anlamlılarda belirtisizlik ve ölçek yapısı,”
Journal of Language and Linguistic Studies, vol. 10, no. 1, 49-66, 2014.
G. İşgüder Şahin and E. Adalı, “Using morphosemantic information in
construction of a pilot lexical semantic resource for Turkish,” in
Proceedings of the 21st International Conference on Computational
Linguistics, 2014, pp. 929-936.
S. Demir, “Generating valence shifted Turkish sentences,” in
Proceedings of 8th INLG, 2014, pp. 128-132.
O. Yılmaz, “Tag-based semantic website recommendation for Turkish
language,” International Journal of Scientific and Engineering
Research, vol. 4, no. 3, pp. 1-7, 2013.
A. Uçar and Ö. Kurtoğlu, “A corpus-based account of polysemy in
Turkish: A case of ver-‘give’,” in Proceedings of the 15th International
Conference on Turkish Linguistics, 2012, pp. 539-551.
Ü. Mersinli, “Associative measures and multi-word unit extraction in
Turkish,” Journal of Language and Literature, vol. 12, no. 1, pp. 43-61,
2015.
P. Baker, A. Hardie and T. McEnery, A Glossary of Corpus Linguistics,
Edinburg: Edinburg Press, 2006.
1
(When) do we need inflectional groups?
Çağrı Çöltekin
University of Tübingen
[email protected]
root
Abstract—Inflectional groups (IGs) are sub-words units that
became a de facto standard in Turkish natural language
processing (NLP). Despite their prominence in Turkish NLP,
similar units are seldom used in other languages; theoretical or
psycholinguistic studies on such units are virtually nonexistent;
they are typically overused in most existing work; and there are
no clear standards defining when a word should or should not
be split into IGs. This paper argues for the need for sub-word
syntactic units in Turkish NLP, followed by an explicit proposal
listing a small set of morphosyntactic contexts in which these
units should be introduced.
amod
nmod
nsubj
Mavi arabada -kiler uyuyorlar
POS:
Lemma:
Number:
Case:
ADJ
mavi
-
NOUN
araba
Plur
Loc
NOUN
-ki
Sing
Nom
VERB
uyu
Plur
-
Figure 1. Dependency analysis of the sentence in (1). The dependency
and feature labels follow Universal Dependencies (UD, Nivre et al. 2016)
conventions. Only the features relevant to our discussion are listed.
I. Introduction
to ‘the one in the/a car’. Finally the word is suffixed with the
plural morpheme resulting in plural number inflection.
(1) Mavi arabadakiler uyuyorlar
The term inflectional group (IG) in Turkish natural language
processing literature refers to a sub-word unit. Although it
does not seem to stem from (theoretical) linguistics, the unit
has been a de facto standard for representing words in Turkish
NLP. Representing words as multiple IGs helps dealing with
complex interaction between the morphology and syntax in
the language. Furthermore, it alleviates the data sparseness
problems in machine learning methods that arise due to large
(theoretically infinite) number word forms as a result of
numerous affixes a word can get. On the other hand, the use
of IGs makes it difficult to use well-studied methods from
other languages, or common off-the-shelf NLP tools since
these methods and tools are designed with the assumption
that the word is the basic unit of syntactic processing. While
we argue that sub-word syntactic units are necessary for
Turkish NLP, the oversegmentation of words into IGs, which
is very common in present practice in the field, amplifies
these problems, and even defeats its own aim by shifting the
data sparseness problem caused by long sequences of potential
suffixes per word to one caused by a long sequences of IGs
per word. We discuss these issues in detail, and propose a
more conservative alternative for segmentation of words into
IGs. In this paper, we assume that the IGs are introduced for
syntactic reasons, even though the traditional use of the unit
seems to link it with derivational morphemes and derivation
boundaries. We do not address, or discuss the derivational
morphology outside its relation to the IGs.
Blue car.LOC-ki.PL sleep.PROG.1P
‘The ones in the blue car are sleeping.’
The conventional representation with a triple ⟨lemma, POS
tag, features⟩ fails here, since the word arabadakiler refers
to two different (sets of) entities, and it carries a separate set
of inflections for each. The first part of the word, arabada
‘in the/a car’, is singular and in locative case, while the
complete word, arabadakiler ‘the ones in the/a car’, is plural
and not marked for case (nominative). Besides the multiple
conflicting inflectional features within the word, parts of
the word participate in separate syntactic relations. Figure 1
presents a dependency analysis of the sentence in (1).1 The
adjective mavi ‘blue’ modifies the car (not the people in it),
while the entities that sleep are the ones in the car (not the car).
As a result, in Turkish computational linguistics literature,
such words have been represented using multiple sub-word
units known as inflectional groups (Oflazer 1999).
Although the need for sub-word units is clear in (1), the
current practice in the field oversegments the words without
any clear linguistic or practical reasons. For example, the
subordinated verb sınırlandırılabilecek ‘that/which can be
limited’ would be tokenized into six IGs in METU-Sabancı
treebank (Say et al. 2002; Oflazer et al. 2003) as in (2).
(2) sınır -lan
-dır
-ıl
-abil
-ecek
NOUN VERB.Deriv VERB.Caus VERB.Pass VERB.Abil ADJ
In this annotation scheme, as well as the derivational
morpheme -lan, the causative (-dır) and the passive (-ul)
voice suffixes, the mood suffix -abil expressing ability or
possibility and the subordinating suffix -ecek which forms
a verbal adjective introduce new IGs. The segmentation in
A. The need for sub-word syntactic units
In many languages, representing a word with a lemma,
a POS tag and a set of (inflectional) features is sufficient
(and useful) for most NLP tasks. In Turkish, however, this
representation is often inadequate. For example, consider the
word arabadakiler ‘the ones in the/a car’ in (1) below. The
word araba ‘car’ is inflected for locative case after which it
receives the suffix -ki which changes the meaning of the word
1 We present example analyses using dependency annotations, since this is
where the IGs were first introduced, and due to popularity of dependency parsing and annotation in the NLP community. However, the parallel examples
can easily be constructed for other grammar formalisms.
38
2
A. Earlier use in the literature
(2) does not have the same grounding as the one introduced
by the suffix -ki in (1). All suffixes except the first one
are considered part of inflectional morphology by modern
grammars of Turkish (e.g., Kornfilt 1997; Göksel and Kerslake
2005). Even if we consider first three inflectional suffixes
as verb–verb derivations, none of the intermediate forms can
carry any separate inflections, and there is no possibility
of conflicting features. The case for verbal adjective suffix
is slightly more complicated (discussed in Section II-C).
However, the verbal adjective forms in Turkish are not much
different than participle forms in other languages where an
additional inflectional feature is sufficient to indicate that the
word carries properties of both adjectives and verbs. That is,
the word acts similar to verbs within the subordinate clause,
while acting like an adjectival outside the subordinate clause.
The current paper proposes tokenizing a surface word into
multiple IGs only in case one of the following is true.2
(3) a. Parts of the word may have potentially conflicting
inflectional features.
b. Parts of the word may participate in different syntactic relations.
These guidelines also imply that the syntactic units should
have clearly defined syntactic functions, unlike, for example,
the relation deriv introduced in the CoNLL-X version of the
METU-Sabancı treebank (Buchholz and Marsi 2006). Under
our guidelines, the word in (2) would not be segmented at all.
The next section presents a critical summary of the use of
IGs to date, mainly pointing out when segmentation of words
are not necessary. Section III lists the cases where we need
to introduce IGs after which we provide a brief discussion
followed by a summary and outlook.
Following Oflazer (1999), almost all Turkish NLP tools and
resources annotate a word as a sequence of IGs as shown in
(4) below.
(4) root+Infl1 ˆDB+Infl2 +…+ˆDB+Infln
where root is the root of the word, Infli are a group
(presumably a set) of inflections and ˆDB is a special symbol
indicating a derivation boundary. According to this annotation
scheme, the word sınırlandırılabilecek in (2) is represented as
(5) below.3
(5) sınır+Noun+A3sg+Pnon+Nom
ˆDB+Verb+Acquire
ˆDB+Verb+Caus
ˆDB+Verb+Pass
ˆDB+Verb+Able+Pos
ˆDB+Adj+AFuttPart
The same annotation scheme is used in most of the Turkish
computational linguistics literature to date. Below we discuss
the differences between the current practice and the scheme
suggested in this paper.
B. Derivation boundaries are not necessarily syntactic-token
boundaries
In the current literature, it is common to see inflectional
group boundaries inserted before some derivational morphemes, such as -lan in (2). However, not every derivation
warrants introducing a new syntactic unit. In the noun–verb
derivation example, sınır-lan ‘border-lan (= to restrict)’, the
noun sınır cannot be inflected. Hence, it cannot have an inflectional group of its own. It is also not accessible from syntax:
neither it can be modified by another syntactic word, nor is
it possible for it to modify another one. Although keeping the
derivational history may be helpful for some applications, it
is not related to determining syntactic units. For the purpose
of determining syntactic units, the (derivational) morphemes
of interest are typically those that modify an already inflected
word, like the suffix -ki in (1) in Section I. However, attaching
to an already inflected verb is not sufficient for forming a
new syntactic token. Also, the condition we are seeking here
is more strict than morphemes that scope over the phrases.
Some productive derivational suffixes may attach to already
inflected forms, and scope over whole phrases, as exemplified
by the suffix -sIz ‘without’ in (6) below.
(6) [Takım arkadaşlarım]sız
yapamam
II. Inflectional groups
The term inflectional group first appeared in work related
to Turkish dependency parsing and annotation (Oflazer 1999),
and used in later studies with similar aims (Say et al. 2002;
Oflazer et al. 2003; Sulubacak and Eryiğit 2013; Çöltekin
2015). It is also used in work on Turkish syntax with different
grammar formalisms (Çetinoğlu and Oflazer 2006; Çakıcı
2008), and in pre- or non-syntactic analysis such as morphological analysis and disambiguation (e.g., Hakkani-Tür,
Oflazer, and Tür 2002; Çöltekin 2014). The similar units are
also used by NLP work on other Turkic languages (Tyers and
Washington 2015). Although we are not aware of a precise
definition of the term, both the use in the literature so far
and the name inflectional group indicates that the unit was introduced based on morphosyntactic concerns. More precisely,
we assume inflectional groups are sub-word units required
by syntax. The remainder of this section outlines the earlier
use of IGs, and discusses the morphological constructions
where the current practice oversegments words according to
the guidelines defined in (3).
Team
friend.PL.POSS1S.without do.AOR.NEG.1P
‘I cannot do without my team mates’
It may be tempting to segment the word arkadaşlarımsız
into two IGs, since the noun takım modifies the stem arkadaş,
and the suffix -sız scopes over the complete phrase. Furthermore, the suffix -sız attaches to an already inflected noun
and derives an adverbial. However, according to our criteria,
these do not warrant introduction of a new syntactic token. A
large number of inflections scope over the phrases headed by
2 The conditions ‘conflicting features’ and ‘separate syntactic relations’
depend on the annotation scheme. Ideally, the tagsets should avoid spurious
conflicts. However, the guidelines are useful even if the tagset choice is not
free, and causes spurious conflicts.
3 The analysis here follows the annotation scheme in METU-Sabancı treebank (Oflazer et al. 2003) which is a typical example of other resources and
tools for Turkish NLP with respect to representation of words.
39
3
the words carrying the inflection. For example, the possessive
suffix attached to the same noun also scopes over the whole
phrase (it is ‘my [team mates]’, not ‘*team [my mates]’). The
word arkadaşlarımsız in this example cannot have conflicting
features either (adverbs are not inflected in Turkish). Hence,
there are no strong reasons for segmenting words at derivation
boundaries introduced by the suffixes similar to -sIz. The
suffixes in this category include -lI, -lIk, -(n)CA, -CI, and
also -ki when it derives an adjectival. These suffixes should
be represented with adequate morphological features, rather
than separate syntactic units. Note that we make a distinction
between the cases where these suffixes derive adjectivals
or adverbials and the cases that some these suffixes derive
nominals. Nominal case is discussed in Section III-B.
independent inflections. A set of features that allow marking
multiple levels of causation and distinguishing the effects of
single or double passive or -Abil suffixes is sufficient for
avoiding additional syntactic tokens.
Another aspect of the voice inflections that may have
affected the current practice of oversegmentation is the fact
that they change the valency of the verb, and modify the
meanings of the arguments of the verb. For example, a
causative or passive verb will assign different roles to its
arguments. However, even if the verb valency is changed,
there will still be a single grammatical subject and/or object,
and their roles can be inferred from the transitivity of the verb
and the voice inflections it carries. As a result, none of the
suffixes discussed above meet the criteria set in (3). With a
proper morphological tag set, we do not need to introduce
new IGs for voice suffixes as well as other aspect or modality
modifiers.
Besides the verbal suffixes discussed above, existing work
also segments the words at subordinating suffixes (suffixes
that cause phrases headed by the verbs to function as adjectives, adverbs or nouns). These suffixes change the function
of the word they are attached to. However, there is no
principled reason for not representing their status by setting
a feature, e.g., verb form to an appropriate value, e.g., verbal
adjective (participle), verbal adverb (converb) or verbal noun
(gerund/infinitive). This avoids segmentation by indicating
that the word functions as a verb within the subordinate clause,
while acting like a noun, adjective or adverb outside the
subordinate clause. Note that even the subordinate clauses that
function as nouns (verbal nouns and headless relative clauses,
Göksel and Kerslake 2005, p.84) do not require segmentation
since nominal predicates cannot be subordinated without an
auxiliary verb and inflectional features, and syntactic relations
of verbs can easily be distinguished from that of nouns, adjectives and adverbs (the copula attached to the subordinate verbs
is discussed in Section IV). In many ways, the subordinating
suffixes are similar to the productive derivational suffixes
discussed in Section II-B, and do not need to introduce a new
syntactic tokens.
C. Inflectional morphemes should not introduce IGs
In the current literature, a large number of inflections
introduce new IGs. The majority of these inflections are verbal
inflections including voice suffixes, as well as some mood
and aspect modifiers. The passive and causative suffixes and
the modal suffix glossed as Abil in (2) are examples of such
inflectional suffixes.
One of the motivations for segmenting at these inflectional
morphemes may be the fact that some of them can attach
repeatedly to the same verbal stem. In this respect, the
causative morpheme is particularly interesting, since, similar
to -ki described in Section I, it can repeat multiple times with
no principled limit on the number of consecutive causative
suffixes. In practice, however, the use of multiple causative
suffixes is rare, and it often indicates emphasis rather than
multiple levels of causation. Example (7) demonstrates a verb
with two causative suffixes which, indeed, can be interpreted
as having two levels of causation.4
(7) Ders
bütün
okullarda
Subject
all
school.PL.LOC
oku-t-tur-ulacak.
study.CAU.CAU.PASS.FUT.3SG
‘The subject will be caused to be caused to be studied
all schools.’ (literal)
‘The subject will be taught in all schools.’
Besides the causative suffix, the passive suffix, and forms
of the modal suffix -Abil may attach to the same verb multiple times. The double passive (on a transitive verb) creates
impersonal (passive) expressions (Göksel and Kerslake 2005,
p.136). The double use of -Abil modifies the modality of the
verb for both of its senses (ability and possibility). In all of
these cases, these suffixes do not create a new predicate with
potentially different inflections than the verbal stem they are
attached to. For example, in the multiple levels of causatives
above, all actions have to share the same tense, aspect and
modality. As a result, if these suffixes form inflectional
groups, the resulting inflectional groups will not have any
D. Uniform representation of all syntactic units
Another issue with the present use of IGs as represented in
(4) is the asymmetry between the first IG and the ones that
follow. In this representation, the only IG with a lemma is the
first one. This hinders the uniform treatment of the syntactic
tokens since some of the tokens are not represented as ⟨lemma,
POS tag, features⟩ triples, and introduces difficulties with
using existing NLP tools like parsers.
The current proposal requires a syntactic token to always be
associated with a lemma. For non-root IGs, the lemma should
be a canonical representation of the (derivational) morpheme
that introduces the IG. For example, for the proposed tokenization of arabada-kiler in (1), the suffix -ki should be treated as
the lemma rather than an inflection. This also serves as a test
for introducing new IGs. If the segmentation of a word results
in IGs that cannot have any inflections of their own (except
for the lemma), the segmentation is not justified.
4 A bit of context may be useful for non-native speakers to understand the
double causative in this example. The example, taken from a news text about
a new educational regulation, expresses that (the authorities who made) the
regulation will cause schools or teachers to cause the students to study the
subject.
40
4
III. Inflectional group boundaries
(9) a.
TL
fine cut.PASS.FUT
‘Those without a registration document will be fined
2000 TL.’
adamdan saymıyor
b. 2-3 metrelikleri
2-3 meter-LIK.PL.ACC man.ABL count.NEG.PROG
The suffix -ki has two main functions (Hankamer 2004).
It either forms either adjectivals or pronominal expressions
from nouns. We already argued in Section II-B that when the
suffix -ki derives adjectivals, there is no need for introducing
a new syntactic unit. However, as the example in Section I
demonstrates, if it derives a pronominal a new IG is necessary.
If the suffix -ki is attached to a noun in genitive case,
the resulting pronominal expression refers to an entity that
belongs to the object or person the original noun refers to.
If it is attached to a locative noun, the resulting expression
refers to an entity in/on/at the object the original noun refers
to. The parts of the word referring to these two entities
may have their own set of inflections, and may participate
in different syntactic relations. The example (1) and the
corresponding dependency analysis in Figure 1 demonstrate
the need for separate syntactic units. Without segmenting the
word into multiple syntactic tokens, we cannot tell whether
the expression refers to multiple cars or a single car, and we
cannot tell whether the car or the objects in the car are blue, or
even whether the car is sleeping or the people/objects inside
are sleeping. Both problems can be solved by introducing a
new syntactic token as in the analysis presented in Figure 1.
Furthermore, the nominals derived with -ki may be suffixed
with genitive or locative suffixes again, and in turn, with
another -ki suffix. Although multiple -ki suffixes are rare in
real language use, the process is recursive, and there is no
principled limit that one can place on number of -ki suffixes in
a word form. This fact also underlines the need for introducing
new IGs in pronominal usage of suffix -ki.
musun?
QuesP.2SG
c.
CRDI engine-LI.POS3S.INS 170
dizelle
TL.LIK
Istanbul-Sivas mesafesini yaptım.
diesel.INS Istanbul-Sivas
distance
do.PAST.1SG
‘I rode the Istanbul-Sivas distance with the one with
1.5 CRDI engine using 170 TL worth of diesel fuel.’
In (9a), without segmenting the word belgesizlere ‘the ones
without documents’, we cannot represent the fact that the noun
kayıt ‘registration’ modifies the word belge ‘document’, not
the people who do not have the document. This is unlike the
earlier example (6) where the relation is unambiguous since
the attributive noun can only modify the noun, not the resulting
adjectival. Similarly, in (9b), the numeral modifies the metre
‘meter(s)’, not the pronominal expression derived by the suffix
-lik. In other words, the expression refers to (unknown number
of) 2 to 3 meter boats, not 2 or 3 boats of one meter long. In
(9c), too, the numeral and the abbreviation modifies the motor
‘engine’, not the car with that particular engine. Also note that
the suffix -lık in this example does not have to be segmented,
since it derives an adjectival. The preceding number here can
only modify the noun, not the adjectival.
The suffixes listed in (9) are a lot less productive than -ki
discussed Section III-A, and they attach to already inflected
words with a varying but lower degree than -ki. Nevertheless,
the cases exemplified in (8) exist. For a uniform treatment,
our proposal is to segment words into multiple tokens when
these suffixes derive a (pro)nominal expression.
Although the suffixes discussed here require segmentation
of words, this is not true if the same suffix is part of a lexicalized derivation. For example, in contrast to the use of suffix
-siz in (9b), the lexicalized word ev-siz ‘homeless’ should not
be segmented since the root here cannot be inflected, and it
cannot participate in separate syntactic relations.
Like the suffix -ki discussed above, some productive noun
derivations result in word forms that refer to multiple entities.
This is demonstrated using the derivational suffix -CI in (8).
(8) a. [eski kitap]çı
b. eski [kitapçı]
old
‘Are you not considering 2 to 3 meter long ones
worthy? (referring to boats)
1.5 crdi motorlusuyla
170 tl’lik
1.5
B. Other productive noun–noun derivations
book.CI
2 bin
ceza kesilecek
A. The relativizer -ki
‘[old book] shop/seller’
belgesizlere
Registation doucment-SIZ.PL.DAT 2 thousand TL
So far, our focus in this paper has been on where or when
not to segment a word to sub-word syntactic units. In this
section, we list the cases where sub-word units are necessary.
old
Kayıt
book.CI
‘old [book shop]’
C. Copular suffixes and the suffix -lAş
In Turkish, main means of forming copular predicates is
through suffixation. In most cases, copular suffixes attach to a
simple noun or adjective, where one may avoid segmenting the
word by setting a feature that indicates the copular nature of
the word. However, if the copula is attached to a verbal noun
or a headless relative clause, as in (10) below, segmentation
is unavoidable.
(10) Örnek
bizim
If the word kitapçı in (8) is not segmented, we do not have
a way to represent the ambiguity between 8a and 8b. The
same issue surfaces in case of other noun–noun derivations
or noun–adjective derivations when the derived adjectival is
nominalized, referring to an object with the property described
by the derived adjective. In such cases, similar to -ki, the
parts of the word refer to entities which may have their own
set of inflections, and may participate in different syntactic
relations. The other suffixes with similar behavior are -sIz, -lI
and -lIk (which overlap with the ones listed in Section II-B).
We present an example for each of the cases in (9).
Example
we.GEN
yazdıklarımızdandı.
write.PART.PAST.PL.POSS1P.ABL-COP.PAST.3SG
‘The example was from the ones we wrote’
41
5
root
nsubj
root
nsubj
nsubj
Ben
I
Örnek bizim yazdıklarımızdan -dı
POS:
Lemma:
Number:
Case:
Number[psor]:
Person[psor]:
VerbForm:
Tense:
Person:
NOUN
örnek
Plur
Nom
3
VERB
yaz
Plur
Abl
Plur
1
Part
Past
3
PRON
biz
Plur
Gen
1
VERB
-ySing
Past
3
aradaşlarımla
friend.PL.P1S.INS
-yım
COP.PRES.1.SG
(a)
root
nsubj
Ali
Ali
cop
aradaşlarımla
friend.PL.P1S.INS
∅
COP.PRES.3.SG
(b)
Figure 2. Dependency analysis of the sentence in (10). The dependency
and feature labels follow the Universal Dependencies conventions (marking
copula as the head is against one of the UD principles which is violated
frequently). Only the features relevant to our discussion are listed. The features
Person[psor] and Number[psor] mark the person and number of the
possessor in a noun. The same suffixes also indicate the person and number
of the subject on a subordinate verb.
root
nsubj
Ali
Ali
aradaşlarımla
friend.PL.P1S.INS.COP.PRES.3.SG
(c)
In (10), the word yazdıklarımızdandı includes two predicates
(yaz ‘write’ and the past copula). As it is also presented in Figure 2, both predicates have their own subjects in the sentence.
Furthermore, these two predicates have their own feature sets
which may conflict. For example, the subordinate verb carries
the first person plural subject–verb agreement (indicated by
the feature labels Person[psor] and Number[psor] in
Figure 2), while the inflections on the copula indicate a thirdperson singular subject (marked by feature labels Person and
Number). This example also demonstrates that the potential
conflict of person and number features between the predicate
and resulting nominal is avoided by using different labels for
these features (although the labels may be confusing in this
particular tagset).
The morpheme -lAş ‘to become’ presents a slightly different
case. -lAş forms verbs from nouns and adjectives, often leaving
the possibility of modifying the stem. The sentence in (11)
presents an example where the adjective pembe within the
verb derived by -lAş is modified by an adverb.
(11) Koyu pembeleşinceye kadar kavurun.
Dark pink-lAş.CONV until
cop
ccomp
Figure 3. Inconsistent analyses of copula in case an empty syntactic unit
is not introduced. (a) Overt copula: Ben arkadaşlarımlayım ‘I am with my
friends’. (b) No surface copula: Ali arkadaşlarımla ‘Ali is with my friends’,
a null syntactic element is introduced. (c) same sentence as in (b) analyzed
without a null element.
made based on morphosyntactic information, which may cause
difficulties for a pipeline approach to NLP.
A second issue, we left unspecified in Section III-C is
the use of null-copula, which surfaces (pun intended) in case
of copular constructions with present tense and third person
singular subject. Failing to introduce a null syntactic token
will result in inconsistent analyses of copular expressions that
differ only in trivial future assignments, e.g., first person or
third person subject–verb agreement. Figure 3 demonstrates
this inconsistency. In Section III-C we demonstrated that the
copular suffixes should be segmented to be able to properly
analyze sentences like (10). For the same reasons, we need
to segment the copula in the sentence analyzed in Figure 3a.
However, unless we introduce a null-copula as in Figure 3b,
the tokenization and syntactic analysis of these two sentences
will be different (as presented in Figure 3c), despite the
fact that two sentences differ only in the person/number
features of the copular predicates. It seems, introducing null
copula becomes a necessity, unless one wants to introduce an
inconsistency in the analyses of these two similar structures.
Note, however, the null element introduced here is unlike
the null units introduced in certain grammar formalisms as
a result of syntactic processes (e.g., movement). Nevertheless,
null elements will typically not be allowed in a wide range of
grammatical frameworks, where an alternative method may be
needed to avoid this inconsistency.
As noted earlier, the criteria we set in (3) depends on the
choice of the feature set. For example, many tag sets, e.g.,
UD, use the same feature label for the number feature of
predicates and nominals. This causes either feature conflicts or
inconsistent labels for morphological and/or syntactic tags in
representation of participles and verbal nouns, which should
not be tokenized according to our proposal. For example, the
word yazdıkları ‘the ones he/she wrote’ in (10) requires two
number features, the nominal is plural, but the predicate has
fry
‘Fry until it it becomes dark pink.’
IV. Discussion and further issues
This paper argues for limiting the segmentation of words
into sub-word syntactic tokens based on two principles listed
in (3). Based on these principles, the same affix may or may
not introduce an new IG depending on whether it derives a
nominal or an adjectival. In general, the need for tokenization
arises when the same word contains multiple (pro)nouns or
predicates. Furthermore, if a derived word with an otherwise
transparent and productive suffix is fully lexicalized, there
is no need for segmenting the word, as the stem cannot be
inflected or modified by other words in the sentence.
Our proposal introduces a new IG in case a suffix derives
a (pro)nominal from a noun in a way that allows modification
of both nouns in the word, but not when the same suffix
derives an adjective or adverb. A potential disadvantage of
this approach is that it requires tokenization decisions to be
42
6
a singular subject. The analysis in Figure 2 avoids conflicting
feature values within the word yazdıklarımızdan, by indicating
the number and person of the subject of the predicate yaz using
a different tag than the person and number of the subject of
the copula. As a result, this word cannot be represented as a
single syntactic token by assigning separate labels for these
two different roles. Similar issues may also arise because of
overloaded use of some syntactic relations.
Çetinoğlu, Özlem and Kemal Oflazer (2006). “MorphologySyntax Interface for Turkish LFG.” In: Proceedings of the
21st International Conference on Computational Linguistics
and 44th Annual Meeting of the Association for Computational Linguistics. Sydney, Australia: Association for
Computational Linguistics, pp. 153–160.
Çöltekin, Çağrı (2014). “A set of open source tools for
Turkish natural language processing.” In: Proceedings of
the Ninth International Conference on Language Resources
and Evaluation (LREC-2014). Reykjavik, Iceland: European
Language Resources Association (ELRA).
Çöltekin, Çağrı (2015). “A grammar-book treebank of Turkish.” In: Proceedings of the 14th workshop on Treebanks
and Linguistic Theories (TLT 14). Ed. by Markus Dickinson, Erhard Hinrichs, Agnieszka Patejuk, and Adam
Przepiórkowski. Warsaw, Poland, pp. 35–49.
Göksel, Aslı and Celia Kerslake (2005). Turkish: A Comprehensive Grammar. London: Routledge.
Hakkani-Tür, Dilek Z., Kemal Oflazer, and Gökhan Tür
(2002). “Statistical Morphological Disambiguation for Agglutinative Languages.” In: Computers and the Humanities
36.4, pp. 381–410.
Hankamer, Jorge (2004). “Why there are two ki’s in Turkish.”
In: Current Research in Turkish Linguistics. Ed. by Kamile
Imer and Gürkan Dogan. Eastern Mediterranean University,
pp. 13–25.
Kornfilt, Jaklin (1997). Turkish. London and New York:
Routledge.
Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter,
Yoav Goldberg, Jan Hajič, Christopher Manning, Ryan
McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira,
Reut Tsarfaty, and Daniel Zeman (2016). “Universal Dependencies v1: A Multilingual Treebank Collection.” In:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), (accepted).
Oflazer, Kemal (1999). “Dependency Parsing with an Extended Finite State Approach.” In: Proceedings of the
37th Annual Meeting of the Association for Computational
Linguistics. College Park, Maryland, USA: Association for
Computational Linguistics, pp. 254–260.
Oflazer, Kemal, Bilge Say, Dilek Zeynep Hakkani-Tür, and
Gökhan Tür (2003). “Building a Turkish treebank.” In:
Treebanks: Building and Using Parsed Corpora. Ed. by
Anne Abeillé. Springer. Chap. 15, pp. 261–277.
Say, Bilge, Deniz Zeyrek, Kemal Oflazer, and Umut Özge
(2002). “Development of a Corpus and a TreeBank for
Present-day Written Turkish.” In: Proceedings of the
Eleventh International Conference of Turkish Linguistics.
Eastern Mediterranean University, Cyprus.
Sulubacak, Umut and Gülsen Eryiğit (2013). “Representation of Morphosyntactic Units and Coordination Structures in the Turkish Dependency Treebank.” In: Proceedings of the Fourth Workshop on Statistical Parsing of
Morphologically-Rich Languages, pp. 129–134.
Tyers, Francis M. and Jonathan Washington (2015). “Towards a free/open-source universal-dependency treebank for
Kazakh.” In: 3rd International Conference on Computer
Processing in Turkic Languages (TURKLANG 2015).
V. Summary and outlook
This paper presented an analysis of the current use of subword syntactic units, IGs, and proposed a more conservative
alternative than the current practice while segmenting words
into multiple IGs. We show that sub-word syntactic units are
necessary even under such a conservative approach. However,
the number of sub-word units can be dramatically reduced with
appropriate choice of tagset for morphological features and
syntactic relations. Our concrete proposal is that introduction
of IGs should be motivated by syntactic analysis, and a word
should be tokenized into multiple IGs when (1) it cannot
be represented as a simple triple ⟨lemma, POS tag, features⟩
and/or (2) the part of the word participates in different separate
syntactic relations.
The principles set in this paper for (not) segmenting a word
into multiple units, depend on the tagset in use. A logical next
step is to complemented this proposal with a tagset that is useful for a wide range of NLP applications. Although defining
a proper tagset for morphological features is out of scope of
this paper, the guidelines above are useful in design of such a
tag set. We note that the efforts like Universal Dependencies
project (Nivre et al. 2016) may facilitate constructing such
tag sets through the consensus of the broad community of
Turkish/Turkic NLP researchers.
Our motivation in this paper has been identifying syntactic
units for computational processing of the language. However,
the sort of units discussed in this paper are interesting from
the perspective of (general/theoretical) linguistics as well. At
present, the problems discussed here are underexplored in
all subfields of linguistics including computational linguistics
(with the notable exception of Bozşahin 2002). This discussion
may motivate further research with more theoretical flavor,
which in turn may benefit the computational methods.
In closing, we also note that even though our discussion in
this paper covers only Turkish, the same approach is likely to
be relevant for other Turkic languages.
References
Bozşahin, Cem (2002). “The Combinatory Morphemic Lexicon.” In: Computational Linguistics 28.2, pp. 145–186.
Buchholz, Sabine and Erwin Marsi (2006). “CoNLL-X shared
task on multilingual dependency parsing.” In: Proceedings
of the Tenth Conference on Computational Natural Language Learning, pp. 149–164.
Çakıcı, Ruket (2008). “Wide-Coverage Parsing for Turkish.”
PhD thesis. University of Edinburgh.
43
Allomorphs and Binary Transitions Reduce Sparsity in Turkish
Semi-supervised Morphological Processing
†
Burcu Can† Serkan Kumyol‡ Cem Bozşahin‡
Department of Computer Engineering ‡ Cognitive Science Department, Informatics Inst.
Hacettepe University, Beytepe
Middle East Technical University (ODTÜ)
06800, ANKARA, Turkey
06800, Ankara, Turkey
[email protected]
[email protected] [email protected]
[3] proposes an MDL-based system that models morphology
in terms of morphological structures called signatures2 . The
goal of the model is to minimize the amount of space through
the signatures.
A number of probabilistic approaches have been proposed
for unsupervised morphological segmentation. There are
maximum a posteriori (MAP), maximum likelihood (ML),
Bayesian parametric and Bayesian non-parametric models.
Creutz and Lagus [4] propose both an ML model and an
MDL model to introduce one of the well-known unsupervised morphological segmentation systems, the Morfessor.
Creutz and Lagus [5] suggest another member of the Morfessor family using MAP estimation.
Creutz [6] proposes a generative model which is based
on the word segmentation model of Brent [7]. Morphemes’
length and frequency are used as prior information in the
model.
A Bayesian non-parametric model is a Bayesian model
defined on an infinite-dimensional parameter space. The
parameter space is typically chosen as the set of all possible
solutions for a given learning problem (see [8]). Goldwater
et al. [9] develop a two-stage model where the types (i.e.
morphemes) are created by a generator and the frequency of
the types are modified by an adaptor in order to generate a
power-law distribution using a Pitman-Yor process.
Can and Manandhar [10] propose a model based on
Hierarchical Dirichlet Process (HDP) to capture the morphological paradigms that are structured within a hierarchy.
Virpioja et al. [16] introduce Allomorfessor which is
the only morphological analyzer that models allomorphs
within the morphological segmentation task. They model the
mutations between word forms to induce the allomorphs.
All the models given above assume that morphemes are
independent of each other.
Here we adopt two different models one of which is based
on the independence assumption, and the other one which is
based on a bigram morpheme model where each morpheme
is assumed to be dependent on the previous morpheme. We
make use of allomorphs in our model in order to reduce
the sparsity. Therefore, our model works like a class-based
Abstract— Turkish is an agglutinating language with heavy
affixation. During affixation, morphophonemic operations
change the surface forms of morphemes, leading to allomorphy.
This paper explores the use of Turkish allomorphs in morphological segmentation task. The results show that aggregating
morphemes in allomorph sets and treating them as the same
morpheme decrease the sparsity in morphological segmentation,
leading to higher accuracy. The source of this supervision can
be syntax, in particular the syntactic category of morphemes
and their logical form. We further investigate the dependency of
Turkish morphemes on each other, using unigram and bigram
morpheme models, by adopting a non-parametric Bayesian
model in the form of a Dirichlet process. The bigram morpheme
model outperforms the single-morpheme model.
I. INTRODUCTION
Morphological segmentation is an important task in computational linguistics, which splits words into morphological
components called morphemes.1 For example, the word
başarılıdır is split into başar, ı, lı, dır (succeed, deverbal
noun, comitative, copula—he/she/it is successful). It can also
be split as başarı, lı, dır (success, comitative, copula—
he/she/it is with success), depending on how much of derivational morphology is considered non-lexical.
Morphological segmentation becomes inevitable in any
task that involves processing of the languages with an
agglutinative structure, because affixation leads to sparsity.
It becomes impossible to build a vocabulary that consists of
all possible word forms in the language. Indeed the possible
word forms that can be constructed in Turkish is infinite.
Morphological segmentation has been treated as supervised and unsupervised machine learning (ML) problems.
ML approaches to unsupervised learning of morphology
starts with the successor model of Harris [1] that counts
successors of each grapheme (as a proxy for phonemic
realization) to detect the morpheme boundaries, where the
number of grapheme successors is comparably higher.
Minimum description length (MDL) is another method
that has been applied in unsupervised morphological segmentation. MDL is based on selecting a model that aims to minimize the amount of space occupied by data. Goldsmith [2],
1 Strictly speaking, ‘segmentation’ is not the right level of abstraction
for morphology, because segments are phonological concepts whereas
morphemes are not. The fact that Turkish morphology is ‘segmental’ is
an exception rather than the rule in the world’s languages.
2 A signature consists of a set of stems and suffixes where each combination of a stem from the stem set and a suffix from the suffix set makes a
valid word form.
44
TABLE I
III. T HE BAYESIAN N ON - PARAMETRIC M ODEL
P HONEME ALTERNATIONS OF T URKISH . ( FROM O FLAZER ET AL . [11])
D:
A:
H:
R:
C:
G:
We propose two different models for morphological segmentation: the unigram Dirichlet process (DP) model, and
the bigram Hierarchical Dirichlet process (HDP) model.
Morphemes are drawn from a Dirichlet process by building
a Markov chain. Unlike the other Bayesian non-parametric
models adopted for morphological segmentation, our model
generates a set of allomorphs from a Dirichlet process, rather
than generating each morpheme independently.
Let corpus C be the set of words: C={w1 , w2 , w3 , .., wn }.
Exploiting the segmental nature of Turkish morphology, we
assume that each word consists of segments for wn , viz.
s+m1 ..+mn , where s is the stem, and m are suffixes.
voiced (d) or voiceless (t)
back (a) or front (e)
high vowel (i, i, u, ü)
vowel except o, ö
voiced (c) or voiceless (ç)
voiced (g) or voiceless (k)
model where all allomorphs of the same morpheme are
treated as the same belonging to the same class. This view
introduces some semantic supervision to learning the forms
because allomorphy, which is sameness of meaning under
phonological variation, is a semantic notion.
Section II explains allomorphy in Turkish. Section III
describes the unigram morpheme model and the bigram
morpheme model, and the inference on morphological segmentation. Section IV presents the evaluation scores obtained
from the experiments before discussion and conclusion.
A. Unigram Dirichlet Process Model
In the unigram model, we assume that segments are
independent of each other:
II. T URKISH ALLOMORPHY
Affixation in Turkish mostly occurs as segmental concatenation of suffixes to a stem or root. Prefixes are very rare.
Surface forms of Turkish morphemes may change depending on phonological context. Vowel harmony and consonant
assimilation are two morphophonemic processes which are
common in Turkish. Segment “deletion” is also common,
which may be treated as insertion depending on one’s
morphological representation.
The vowels can be grouped in relation to vowel harmony:
1.
2.
3.
4.
5.
6.
7.
8.
p (w = s + m) = p (s) p (m)
(1)
We do not discriminate stems from suffixes in our model.
Therefore, the probability of a word with multiple segments
is given as follows:
Y
p (w = s1 + s2 + · · · + sn ) =
p(si )
(2)
i
where s denotes the segments of w. Each segment s is drawn
from a DP (see Figure 1):
Gs ∼ DP (αs , Hs )
s ∼ Gs
Back vowels: {a, ı, o, u}
Front vowels: {e, i, ö, ü}
Front unrounded vowels: {e, i}
Front rounded vowels: {ö, ü}
Back unrounded vowels: {a, ı}
Back rounded vowels: {o, u}
High vowels: {ı, i, u, ü}
Low unrounded vowels: {a, e}
(3)
where DP (αs , Hs ) denotes the Dirichlet process that generates a probability distribution Gs from which the segments
are generated. Here αs is a concentration parameter which
adjusts the skewness of the distribution. Large values of
αs lead to higher number of segments. Low values reduce
the number of segments generated per word. Condition
αs < 1 results in sparse segments and a skewed distribution.
Condition αs > 1 leads to a distribution closer to uniform
that assigns similar probabilities to the segments. If αs = 1,
all segments are equally probable and a uniform distribution
is obtained. We use αs < 1 to favor a skewed distribution
over the segments.
Hs is the base distribution that determines the mean of
the DP [12]. We use the segment lengths for the base
distribution:
Table I describes some alternations using these allophones.
The rules governing an alternation refer to metaphonemes
or allophones. A vowel alternation is presented in Example
1 below. ’0’ is the notation for deleted phonemes, and for
deleted lexical symbols such as morpheme boundaries.
Example 1. Lexical form : bulut-lAr
N (cloud) − P LU
Surface form : bulut0lar (i.e. bulutlar)
Lexical form : kedi-lAr
N (cat) − P LU
Surface form : kedi0ler (i.e. kediler)
Hs = γ |s|
(4)
where |s| indicates the length of a segment and γ is a gamma
parameter ( γ< 1).
The Dirichlet process in our model forms a Chinese
Restaurant Process (CRP) where the same dish (i.e. segment
type) is served in each table. Each segment is a customer and
whenever a new customer enters the restaurant, either it joins
a table with the same segment type, if exists, otherwise it
Here, lar and ler are allomorphs and both have the plural
meaning. The number of allomorphs can vary for different
morphemes. For example, dir, dır, dur, dür, tir, tır, tur, tür
are all allomorphs and define the status of a verbal action.
45
Fig. 2. The plate notation of the bigram Hierarchical Dirichlet process
model. Each segment is generated through a DP, which is used in another
DP in order to generate stem bigrams (si , si+1 ).
model between-group dependencies. The bigram hierarchical
Dirichlet process model is defined as follows (see Figure 2):
Fig. 1. The plate notation of the unigram Dirichlet process model. wi is
the word generated from a DP. si represents segments that form the word.
Rectangular boxes show how many times the process is repeated.
si+1 | si ∼ DP (αb , Hb )
Hb ∼ DP (αs , Hs )
si ∼ Hb
creates a new table. The conditional probability of a segment
is estimated through the CRP as follows:

−si
nsSi



if si ∈ S −si
−s
S −si + α
N
i
s
p si S , αs , Hs =

αs ∗ Hs (si )


otherwise
−s
N S i + αs
(5)
where, si+1 |si denotes the conditional probability distribution over adjacent segments. Hb is the base distribution of
the bigram model that is another Dirichlet process with a
base distribution Hs that generates each unigram segment in
the model. Segment lengths are used for the base distribution
again.
Once the probability distribution p(si+1 |si ) is drawn from
a Dirichlet process, the adjacent morphemes can be generated
by a Markov chain. Here we do not want to estimate Hb and
we integrate it out as follows:
−si
denotes the total number of segment tokens of
where nSsi
type si but with the new instance of the stem excluded from
−si
the complete set of stems S in the model. NsSi
is the total
number of segment tokens in S where new segment instance
si is excluded.
p (s1 , s2 ), (s2 , s3 ) . . . , (sM −1 , sM )
Z
M
Y
= p(Hb )
p ((si−1 , si ) | Hb ) dHb
B. Bigram Hierarchical Dirichlet Process Model
In the bigram model, we assume that each morpheme is
dependent on the previous morpheme:
p(w = s + m) = p(s)p(m|s)
(8)
(9)
i=1
(6)
where M denotes the total number of bigram tokens. Thus,
the joint probability distribution of bigrams becomes as
follows:
This rule assumes that the suffix is generated accordingly
with the stem. The same applies to a word with multiple
segments:
Y
p (w = s1 + s2 + · · · + sn ) = p(s1 )
p(si+1 |si ) (7)
p(s1 , s2 , . . . , sM )
= p (s1 ) p (s2 | s1 ) p (s3 | s2 ) ,
. . . , p (sn | sM −1 ) p (0 00 | sM )
i
where we again do not discriminate stems from suffixes.
Here, the first segment of the word is generated from a
Dirichlet process and bigrams are generated through another
Dirichlet process. We use a hierarchical Dirichlet process
(HDP) with two levels, where first we generate the first
segment through a Dirichlet process and in the second
level we generate the following segment depending on the
previous segment through another Dirichlet process. HDP
consists of multiple DPs within a hierarchy and is able to
(10)
Here 0 00 denotes the end of the word.
Let us call each bigram bi = (si | si−1 ):
p (w = {s1 , s2 , . . . , sM }) = p (s1 ) p(b1 )p(b2 ), . . . p(bM )
(11)
Here p(s1 ) is drawn from Hs through the unigram Dirichlet
process, which again forms a Chinese restaurant where
46
Algorithm 1 The filtering algorithm
1: input: D = {w1 = s1 + s2 + · · · + sn , . . . , wn = s1 +
s2 , · · · + sn }
2: chars ←{’d’: ’D’, ’t’: ’D’, ’a’: ’A’, ’e’: ’A’, ’ı’: ’H’, ’i’:
’H’, ’u’: ’H’, ’ü’: ’H’, ’ç’: ’C’, ’g’: ’G’, ’k’: ’G’, ’ğ’ :
G}
3: procedure F ILTER ( SEGMENT )
4:
if SEGM EN T 6= ’ken’ then
5:
i←0
6:
for i < length(SEGMENT) do
7:
if SEGM EN T [i] in chars then
8:
replace(SEGM EN T [i], chars[i])
only a segment type is served in each table having the
segment tokens of the same type as customers. The other
Dirichlet process forms another Chinese restaurant where
each table serves a segment type and the customers are
the following segments. The conditional probability of a
segment bigram can be calculated according to the Chinese
Restaurant Process given previously generated segments S =
{s1 , s2 , . . . , sn } as follows:
p (sR | sL )bi B −bi , S −sL , S −sR , αb , Hb , αs , Hs

−bi

nB

bi

if bi ∈ B −bi

−s
NsSL L + αb
=

αb ∗ p(sR )


otherwise

−s
NsSL L + αb
(12)
9:
10:
11:
12:
return segment
for all m in D do :
return: F ilter(m)
−bi
where nB
denotes the number of bigrams of type bi
bi
when the new instance of the bigram bi is excluded. Here
B denotes the bigram set that involves all bigram tokens in
−s
the model. NsSL L is the total number of bigram tokens in
the model. sL and sR denote the left and right nodes of the
bigram. Therefore, if the bigram bi exists in the model, the
probability of generating the same bigram again becomes
proportional with the number of bigram tokens of the same
type. If the bigram does not exist in the model, it is generated
with the probability proportional to the number of right node
in the bigram:




respect to the representations in Table I.3
Algorithm 1 takes a segment as input and replaces the
graphemes with their allophones.
D. Inference
We use Metropolis-Hastings algorithm [13] to learn word
segmentations in the given dataset. Words are randomly split
initially. We pick a word from the dataset in each iteration
and randomly split that word. We calculate the new conditional probability pnew of the sampled word and compare it
with the old conditional probability of the sampled word pold
by using Equation 5, Equation 12 and Equation 13. We either
accept or reject the new sample according to the proportion
of two probabilities (see Figure 3):
−sR
nSsR
if sR ∈ S −sR
−s
N S R + αs
 αs ∗ Hs (sR )


else
−s
N S R + αs
(13)
−s
where nSsR R is the number of segments of type sR in
−s
S when the new segment sR is excluded. N S R is the
total number of segment tokens in S that excludes sR .
If the segment sR exists in the model, it is generated
again with a probability proportional to its frequency in the
model. If it does not exist in the model, it is generated
proportionally with the base distribution, therefore shorter
morpheme lengths are favored.
The hierarchical model is useful for modeling dependencies between co-occurring segments. The co-occurrence of
unseen segments is also within the scope of the hierarchical
model. The prediction capability of the model comes from
the hierarchical modeling of co-occurrences, which leads to
a natural smoothing. For example, the segment bigram may
not be seen in the corpus, however it is smoothed with one of
the segments in the bigram which leads to a kind of natural
interpolation.
p sR S −sR , αs , Hs =
pnew
pold
(14)
> 1, the new sample is accepted. Otherwise, the
If ppnew
old
new sample is still accepted with probability ppnew
to find
old
the global maximum.
IV. R ESULTS AND E VALUATION
We used a publicly available Turkish dataset provided by
Morpho Challenge 2010 for both as a training set and a
test set.4 The dataset consists of a wordlist with 617,298
words, with their frequency values. We did not make use of
frequencies in our model.
Two sets of experiments were performed for both unigram
model and the bigram model, with and without the filtering
algorithm. In all experiments, we assume that words are
made of only stems and suffixes, and prefixes are ignored.
We followed the same evaluation procedure provided by
Morpho Challenge. In order to calculate the precision, two
words that share a common segment are selected randomly
C. Incorporating Turkish Allomorphy Into The Model
3 With the exception of suffix ’ken’, which does not show allomorphy.
The cases ğ, ç and ü are not shown because our data does not contain any
form with these symbols.
4 http://research.ics.aalto.fi/events/morphochallenge2010/data/wordlist.tur
In this study, the rules of alternation are included in the
segmentation model as a filtering algorithm for vowels with
47
Algorithm 2 The inference algorithm
1: input: data D = {w1 = s1 + s2 + · · · + sn , . . . , wn =
s1 + s2 , · · · + sn }
2: initialize: i ⇐ 1, w ⇐ wi = si + mi , n ← iterations
3: while n > 0 do
4:
for all wi in D do:
5:
Randomly split wi as Snew = {s1 , s2 , . . . }
6:
Remove the segments Sold from the model
7:
pold ← p(Sold |D−wi )
8:
Pnew ← p(Snew |D−wi )
9:
if pnew > pold then
10:
Accept new segments of wi
11:
Sold ← Snew
12:
else
13:
random ∼ N ormal(0, 1)
14:
if random < (pnew /pold ) then
15:
Accept new segments of wi
16:
Sold ← Snew
17:
else
18:
Reject the new segments
19:
Insert old segments Sold
20:
n←n−1
21: output: Optimal segments of input words
Fig. 3. An example sampling step during the inference. The word Evlerde
is randomly split into segments Evle, r, and de. The old segmentation
(Ev+ler+de) is compared with the new segmentation and either accepted or
rejected.
TABLE II
C OMPARISON OF OUR
UNSUPERVISED SYSTEMS IN
System
Morfessor CatMAP [5]
Aggressive Compounding [17]
Bigram HDP
Iterative Compounding [17]
MorphAcq [18]
Morfessor Baseline [4]
Base Inference [17]
Fig. 4. The results with the highest F-measure from models. S indicates
that the model is supervised by allomorph filtering.
UNSUPERVISED MODEL WITH OTHER
M ORPHO C HALLENGE 2010
Precision(%)
79.38%
55.51%
50.36%
68.69%
79.02%
89.68%
72.81%
Recall(%)
31.88%
34.36%
31.60%
21.44%
19.78%
17.78%
16.11%
FOR
T URKISH
F-measure(%)
45.49%
42.45%
38.83%
32.68%
19.78%
29.67%
26.38%
38.83%. This shows that morphemes are highly dependent
on each other, and modeling the morphemes as bigrams is
more realistic.
Semi-supervision also makes a significant improvement in
the F-measure when compared to the unsupervised setting.
The gap between precision and recall is also not very
big in the semi-supervised setting. The highest F-measure
obtained in the semi-supervised setting is %43.22. Using
allomorphs with the filtering algorithm decreases the number
of morphemes to be modeled, leading to a more stable
distribution in the Dirichlet process with less number of
tables and less sparsity.
We compare our unsupervised model with other unsupervised models participated in Morpho Challenge 2010. The
results are given in Table II. Our model has an F-measure
of 38.83%, which is ranked 3 out of 7 models.
We also compare our semi-supervised model (with the
filtering process) with other unsupervised models with supervised parameter tuning which participated in Morpho Challenge 2010. We do not include other unsupervised models or
semi-supervised models in this comparison because they use
from the results and checked whether they really share a
common segment according to the gold segmentations. One
point is given for each correct segment. Recall is estimated
similarly by selecting two words that share a common segment in the gold segmentations. For every correct segment,
one point is given. The F-measure is the harmonic mean of
Precision and Recall:
1
(15)
F − measure =
1
1
P recision + Recall
The overall results are given in Figure 4. The bigram HDP
with allomorphy supervision gives the highest F-measure,
whereas the unigram HDP without any supervision is the
weakest model.
In the unigram model that has an F-measure of 30.64%,
we assign α=0.5 and Γ=0.5. There is a significant gap
between precision and recall in this setting. The low recall
implies undersegmentation. The bigram model makes a significant improvement on the results. The F-measure becomes
48
TABLE III
C OMPARISON OF OUR
R EFERENCES
SEMI - SUPERVISED MODEL WITH OTHER
[1] Harris, Z. S.: From phoneme to morpheme. Language, 31(2):190-22,
1955.
[2] Goldsmith, J.: Unsupervised learning of the morphology of a natural
language. Computational linguistics, 27(2):153–198, 2001.
[3] Goldsmith, J.: An algorithm for the unsupervised learning of morphology. Natural Language Engineering, 12(04):353–371, 2006.
[4] Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In Proceedings of the ACL-02 workshop on Morphological and phonological
learning-Volume 6, pages 21–30. Association for Computational Linguistics, 2002.
[5] Creutz, M., Lagus, K.: Inducing the morphological lexicon of a natural
language from unannotated text. In Proceedings of the International
and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05), volume 1, pages 51–59, 2005.
[6] Creutz, M.: Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the 41st
Annual Meeting on Association for Computational Linguistics-Volume
1, pages 280–287. Association for Computational Linguistics, 2003.
[7] Brent, M. R.: An efficient, probabilistically sound algorithm for
segmentation and word discovery. Machine Learning, 34(1-3):71–105,
1999.
[8] Orbanz, T., Teh, Y. W.: Bayesian nonparametric models. In Encyclopedia of Machine Learning, pages 8189. Springer, 2010.
[9] Goldwater, S., Johnson, M., Griffiths, T. L.: Interpolating between
types and tokens by estimating power-law generators. In Advances in
neural information processing systems, pages 459–466, 2005.
[10] Can B., Manandhar, S.: Probabilistic hierarchical clustering of morphological paradigms. In Proceedings of the 13th Conference of the
European Chapter of the Association for Computational Linguistics,
pages 654–663. Association for Computational Linguistics, 2012.
[11] Oflazer, K. Göçmen, E. and Bozşahin, C.: An outline of Turkish
morphology. Report on Bilkent and METU Turkish Natural Language
Processing Initiative Project, 1994.
[12] Teh, Y. W.: Dirichlet process. In Encyclopedia of machine learning,
pages 280–287. Springer, 2010.
[13] Hastings, W. K.: Monte Carlo sampling methods using Markov chains
and their applications. Biometrika, 57(1):97–109, 1970.
[14] Spiegler, S., Golenia, B., Flach, P.: Unsupervised word decomposition
with the promodes algorithm, volume I, 2010.
[15] Kohonen, O., Virpioja, S., Lagus, K.: Semi-supervised learning of
concatenative morphology. In Proceedings of the 11th Meeting of
the ACL Special Interest Group on Computational Morphology and
Phonology, pages 78–86. Association for Computational Linguistics,
2010.
[16] Virpioja, S., Kohonen, O., Lagus, K.: Unsupervised morpheme discovery with allomorfessor. In CLEF (Working Notes), 2009.
[17] Lignos, C.: Learning from unseen data. In Proceedings of the Morpho
Challenge 2010 Workshop, pages 35–38, 2010.
[18] Nicolas L., Farré J., Molinero, M. A.: Unsupervised learning of concatenative morphology based on frequency-related form occurrence.
In Proceedings of the PASCAL Challenge Workshop on Unsupervised
Segmentation of Words into Morphemes, Helsinki, Finland, September, 2010.
[19] Çakıcı, R., Steedman, M., Bozşahin, C.: Wide-coverage parsing,
semantics, and morphology. In Turkish Natural Language Processing,
Oflazer, K., Saraçlar, M. eds. Springer, forthcoming, 2016.
ALGORITHMS WITH SUPERVISED PARAMETER TUNING PARTICIPATED IN
M ORPHO C HALLENGE 2010 FOR T URKISH
System
Promodes [14]
Promodes-E [14]
Morfessor U+W [15]
Bigram HDP S
Promodes-H [14]
Precision(%)
46.59%
40.75%
40.71%
49.21%
47.88%
Recall(%)
51.67%
52.39%
46.76%
38.52%
39.37%
F-measure(%)
49.00%
45.84%
43.52%
43.22%
43.21%
the gold segmentations provided by Morpho Challenge. The
results are given in Table III. Our model has a F-measure of
43.22%, which is ranked 4 out of 5 other semi-supervised
models. This is with our minimal amount of supervision by
using only the allomorphs and not any other information.
The only existing allomorph based morphological segmentation model is the Allomorfessor developed by Virpioja et
al. [16], a morpheme-level model which is able to manipulate
surface forms of the morphemes with mutations. Their model
resulted an F-measure of %31.82 for the Turkish dataset5
with a huge gap between precision and recall. While their
model results in %62.31 F-measure over the English dataset,
the results for Turkish are quite low compared to our results.
Their model can capture only 1.9% of the Turkish mutations,
which is quite low. This leads us to think that allomorphy
may not be modeled only by consonant mutations; it needs
non-phonological information about morphological forms.
V. C ONCLUSION
Our study made two contributions to morphological segmentation: (i) We found that it is important to incorporate
intra-word dependencies into form-driven Bayesian models,
(ii) allomorphy seems to be a useful prior for computational
morphological segmentation task. Given that sublexical training at the level of suffixes’ syntactic categories is becoming
more feasible, along with their logical forms [19], there
seems to be means for syntax to semi-supervise morphology.
Syntactic information about the morphemes has an important impact on segmentation that can not be ignored, as
would be the case for independent-morphemes assumption.
Providing room for co-occurences of the morphemes in an
unsupervised model provides language-specific information
which improves the number of valid segments. Furthermore, the supervision of non-parametric Bayesian models is
promising as our minimal supervision achieved better results.
Our results are far from state-of-the-art performance.
However, we believe that our experiments with allomorphs
and morpheme dependency will facilitate further work on
morphological processing.
5 http://research.ics.aalto.fi/events/morphochallenge2009/
49
Automatic Detection of the Type of “Chunks” in
Extracting Chunker Translation Rules from Parallel
Corpora
Aida Sundetova#, Ualsher Tukeyev#
#
Al-Farabi Kazakh National University, Research Institute of Mechanics and Mathematics,
Al-Farabi av., 71, 050040 Almaty, Kazakhstan
{sun27aida, ualsher.tukeyev}@gmail.com
Abstract— This paper describes the method of the automatic
detection of the type of “chunks” which are generated in
methodology presented by Sánchez-Carta-gena et. al. (Computer
Speech & Language 32:1(2015) 46–90). The proposed automatic
detection method type of “chunks” improves above methodology
of extracting grammatical translation rules from bilingual
corpora. Proposed im-provement of methodology of extracting
grammatical translation rules from cor-pora allows to apply
output phrases of extracted translation “chunk” rules for next
“interchunk” stage in machine translation system and improve of
machine trans-lation quality. Experiments are done for the
English–Kazakh 1 language pair using the free/open-source rulebased machine translation (MT) platform Apertium and
bilingual English–Kazakh corpora.
morphological analyzer, POS disambiguator, structural
transfer, morphological generator, post-generator, re-formatter.
Keywords— rules extraction, machine translation, Apertium,
transfer rules, chunks.
1. A first stage of transformations (“chunker”) detects source
language (SL) lexical form (LF) patterns and generates the
appropriate sequences of target language (TL) LFs, which
will be grouped in chunks representing simple constituents
such as noun phrases, prepositional phrases, etc. These
chunks bear tags that may be used for interchunk
processing.
2. The second round (“interchunk”) reads patterns of chunks
and produces a new sequence of chunks. This is the
module where one can attempt to perform some longerrange reordering operations, interchunk agreement (for
example, between noun and verb pharse, agreement in
number and person), case selection, etc.
3. The third round (“postchunk”) transfers chunk-level tags to
the lexical forms they contain and whose lexical-formlevel tags are linked (through a referencing system) to
chunk-level tags.
Structural transfer for English–Kazakh has an additional
clean-up stage to remove tags.
Fig. 1. Structure of the Apertium platform
Three stage structural transfer rules on Apertium platform
is implemented by three stages, it follows the description in
[4]:
I. INTRODUCTION
Rule-based machine translation (MT) of natural language
nearly always contains the following steps [1]: morphological
analysis, part-of-speech (POS) tagging, translating words into
target language, execution of syntactic transformations and
division into phrases (or chunks), generating new lexical
forms (word’s lemmas with lexical catego-ries) of target
language words. In rule-based MT systems, most of these
stages are im-plemented by handwritten translation rules. The
process of creating the handwritten rules is very laborious
process. Therefore, very actual is automatic extracting of
trans-lation rules from bilingual corpora.
This paper presents the automatic detection method of the
type of “chunks” rules, obtained by using the methodology
automatic extracting of translation rules from bilin-gual
corpora by Sánchez-Cartagena et al. (2015) [2], which is
described in the following section. Their method requires to
create tag groups and tag sequences for new pair and tuning of
the extraction script by declaring monolingual dictionary,
bilingual diction-ary, and bilingual corpora.
II. “CHUNKING” RULES FOR THE APERTIUM PLATFORM
The Apertium free/open-source rule-based shallow-transfer
MT platform [3] includes the following modules: de-formatter,
1
https://svn.code.sf.net/p/apertium/svn/staging/apertium-eng-kaz
50
Usually, three stage transfer uses different type of phrases,
which helps to apply rules for concrete structures from stage
to stage. For example, the current version of English–Kazakh
MT system, for which experiments have been done, has 169
handwritten “chunker” rules, and is able to analyze the
following kinds of phrases:
1. Noun phrases (NP). Sequences with nouns (case is
nominative or accusative) are analyzed as noun-phrase, for
instance, phrase two little cars - <NP> {two (eki)
<numeral>
little
(kiskentai)
<adjective>
car(autokolik) <noun>} is grouped into one NP phrase.
All NP phrases have not determined tags for cases and
possessives, in case that in next interchunk stage it will be
assigned: I see <NP>{the sky<case-to-bedetermined>} – I (Men) <NP>{sky<accusitive>
(aspnDY)} see(koremyn). As can be seen from the
example, noun phrase “the sky” has not determined case,
but in interchunk stage it should be determined as
“accusitive” case and will be added ending -dy, which
could be changed, depending on vowel harmony.
2. Noun phrases as gerund(NP-ger). For verbs, which appear
after verbs: like, love, finish, start, hate, etc. and takes -ing
form, it is translated as NP phrase too, and on the Kazakh
side, it has gerund tense: I like playing - <NP> {I (men)
<subject
pronoun>}
<VP>
{like(zhaqsy
koremin) <verb>} <NP>{playing(oinaudy) <verb
gerund>}. Such type of verb is decided to be noun
phrase, because on Kazakh side it could have case,
possessive such like noun phrases with noun has.
3. Verb phrases (VP). All kind of verbs: simple verbs (only
one word), complex verb tense (continuous, perfect),
modal verbs (with assigning genitive case for subject – I
must play – Meniŋ (Менің) oinauym (ойнауым) керек
(kerek)), etc. Modal verbs have special phrases, for
instance, “VP_must_inf” or “VP_should_inf”, it helps to
assign possessive from subject by rule, which is written in
interchunk stage.
4. Prepositional phrases. These phrases feature locative (да/da – in house – үйде/uide), ablative (-нен/nen – from
river – өзеннен/ozennen), genitive (-ның/niŋ – of city –
қаланың/kalanyŋ), postpositions, as well as complex
postpositional phrases with wordsүст/аст + possessive +
locative (under table – үстелдің астында/usteldiŋ
astynda).
5. Question verbs phrase (VP_Q) are used to detect question,
which started with did/do, was/were, etc., where auxiliary
verbs analyzed as VP_Q, and will be processed in
interchunk, to generate question particles in Kazakh (ma/me, etc.). For instance, “Do you remember?” – Sizdiŋ
(Сіздің) esiŋizde (есіңізде) me (ме)?).
6. Auxiliary verb phrases(be/have/do,etc). Such kind of
phrases are used in the next structures: <VPQ> {Do
<verb do>(only tense)} you play? – to translate questions, where rule only detects tense and transfer it to the
next stage in <VPQ> phrase; I <VP_be>{am
<vbser>(e <copula>)} a teacher – to generate
copula “e”(edi,edim) and move it at the end of noun
phrase(a teacher – mugalim[e+myn]), then assign person
and number from subject at the interchunk stage.
7. Adjectival phrases: single adjective (AdjP big) and
comparative adjective (AdjP bigger), superlative
adjectives have different phrase, because, for instance, the
trans-lation of “SupP the most beautiful” is “SupP eŋ
(ең) ædemi (әдемі)”, but it could get possessive (SupP
{the most interesting} of these books – kitaptardyŋ (SupP
{eŋ ædemisi}), so it can not be treated as regular adjective
phrase AdjP.
As can be seen from phrases above, each type of phrase
has concrete operations, which could be done at the
interchunk stage: determining case, possessives, assign person
and number, moving positions. Without certain phrase names,
it is impossible to have well-worked interchunk stage.
III. EXTRACTING “CHUNKER” RULES FROM CORPORA
The method described by Sánchez-Cartagena et al. (2015)
was inspired by the work of Sánchez-Martínez and Forcada
(2009) [5] where alignment templates were also con-sidered
for structural transfer rule inference. However, this new
approach overcomes the main limitations of that by SánchezMartínez and Forcada (2009). Firstly, choosing the
appropriate generalization level for the alignment templates
(AT), contained word alignment and use word classes instead
of the words themselves [6,7,8], from which rules will be
generated. Second, a different treatment words which have
difficulties with context-dependent lexicalizations and are
incorrectly translated by more general ATs. Third, the
automatic selection of ATs to be used for generating
convenient rules.
To adapt the method by Sánchez-Cartagena et al. (2015) for
the English–Kazakh language the following steps were
performed:
1. Building English–Kazakh parallel corpora by using
Bitextor 2, a web crawler for par-allel texts, and manually
collected texts from fiction literature. Manually collected
corpus consist of ~3200 parallel sentences, and with
crawled texts, parallel English-Kazakh corpus contains
5625 sentences. Experiments were done on a corpus consisting of 140 sentences, and big corpus is used for testing
and for tuning.
2. Creating tag groups file for the Kazakh language. SánchezCartagena et al. (2015) method had not been tested on
Turkic languages, which have rich-morphology. As a
result, this file for the Kazakh language will have more
morphological tag groups. Groups have the following
format,
for
instance,
group
for
numerals:
numtype:ord,coll,year:num, where numtype is
name of variable used to identify different types of
numerals ord (first, second),coll (using in Kazakh to
identify number of objects or subjects without followed
noun: two person – eki adam(екі адам), two came – ekey
keldi (екеуі келді)),year (numerals, coming after
prepositions: in 1992), and at the end after “:” is put name
of part of speech “numeral” – “num”. If some tags belong
to several part of speeches, they are put after “:” and is
2
51
https://svn.code.sf.net/p/bitextor/code/trunk
As can be seen from Table 1, there are defined five
phrases. First level defined as X, next levels are will modified
with other grammatical constituents function as the specifiers: X’ defines X+X phrase and X’’ could define X+X’ or
X’+X’ phrases. There are could be defined next primary
priority of POS [10]:
Primary POS priority : V > N > A > P
According to this priority, for example, for English-Kazakh
language pair, POS se-quences will be defined for each phrase
as follow:
divided by comma: tense:present:vblex,v. This
file will be used to generate an appropriate group of tags
for each part of speech of English and the Kazakh
language side, all necessary tags could be found from
morphological analyses of English–Kazakh MT system on
Apertium platform.
3. Creating a tag sequences file, where the defined tag groups
are combined into ap-propriate sequences of tags,
accordingly to morphological analysis. The sequences will
be used to generate target language sequences of tags,
which are the lexical categories of each lexical forms. If
the morphological analysis of word “do” is:
do<vbdo><pres><p3><sg>, in format of tag
sequence it will be look like as follows:
vbdo:verbtime,person,numberat, where vbdo
is
name
of
lexi-cal
category,
and
verbtime,person,numberat name of tag groups,
defined in tag group file.
4. Adapting the rule extraction script: defining installed the
English–Kazakh language pair on Apertium machine
translation system, morphological and bilingual dictionaries, size of corpora.
5. Problem of adapted method of extracting chunker rules
from corpora. Some MT systems, like English-Kazakh
machine translation system on Apertium, uses three-stage
structural transfer, which means that adapted method
needs improvement, be-cause the rule learning algorithm
is designed to work only with 1-level Apertium transfer
(only apertium-transfer module and not apertiuminterchunk). Generated chunks have no special phrases (it
generates “LRN” phrases), as NP, VP, etc., showed in
section 2, this fact prevents correct usage of this phrase in
interchunk stage.
TABLE II
POS SEQUENCES FOR X'-EQUIVALENCES
And, POS priority for English-Kazakh pair will be looked
like that:
Primary POS priority : P > V > N > A
Choosing that priority based on highest score got from
evaluation, which will be showed in Results section, and also
on that P-prepositions could be only modifiers of noun and in
Kazakh they will transform into postpositions or case. In that
case, PP phrases include noun in their structure that took them
in priority before N.
Described phrases are written in additional file, where user
can specify phrases by priority, accordingly to each language
pair’s features. This file is called “phrase.txt”, where
described priority is written in the Apertium chunk names
format.
IV. AUTOMATIC DETECTION OF CHUNK TYPE
To improve quality of translation and to do work of
generated rules more usable in interchunk stage additional
step was added: detect name of phrase for generated chunks.
For instance, if chunk named “__n__” and deal with nouns,
“NP” phrase should be assigned.
To assign a phrase to each chunk, first, there are defined
part-of-speech sequences to each phrase and will consider
sequences of POSes by using X'-theory [9], where has been
defined the X'-equivalences shown as Table 1.
TABLE I
X'-EQUIVALENCES
52
TABLE III
TABLE IV
POS SEQUENCES FOR X'-EQUIVALENCES
COMPARING TRANSLATION
As can be seen from the Table 3, first, user writes name of
phrase, then part-of-speech, which defines this phrase:
VP,vblex,vbser,vbhaver,vbmod. Phrase detection
program reads this file and generated file with rules, and
assigns phrases. To do this application more usable, templates
of rules were changed by adding one-word rules. Evaluation
of this method is described in the next section.
V. RESULTS
Results of improved method are performed by using
English-Kazakh MT system on Apertium. From GATs,
extracted from corpora with 140 sentences, 13 rules are generated. In the next table, some of the translation rules obtained
with handwritten and ex-tracting processes are compared:
As can be seen from Table 4, a few generated rules works
correctly, but number of generated rules is not big, because of
small volume of corpus. The main differences between rules
are that in handwritten rules some of tags undetermined
(<PXD>, <ND>, <PD>, <NXD>) or could be changed in next
interchunk stage, whereas generated rules assign all tags
constantly. Also, generated rules while translating could miss
some words, as can be seen from the last translation of
sentence “A dog is also in the garden”, where generated rule
translated it without adverb “also”. Such problems appears because of low generalization level, that problem could be
solved by using bigger corpus for extracting rules.
In next table quality of translations will be compared with
rules, which were got after using rules with application for
specifying phrases and rules, without it:
TABLE V
QUALITY OF TRANSLATED TEXTS
As can be seen from Table 5, adding phrase detection step
and improvements of rules templates helped to raise quality of
translation for 4%, for unigrams it is raised for 12.55, bigrams
for 8.65 and trigrams for 5,02. In the next Table 6 was showed
translated sentences by Sánchez-Cartagena et al. methodology
and by proposed improved methodology:
53
TABLE V
TRANSLATED TEXTS
ACKNOWLEDGMENT
Authors thank Prof. Mikel L. Forcada and Miquel EsplàGomis from Departament de Llenguatges i Sistemes
Informàtics, Universitat d'Alacant (Alacant, Spain) for the
continuous consult and help in the researching and
development of this project. This re-search work is doing in
frame of project 0749/GF financed by Ministry of Education
and Science of Republic Kazakhstan.
REFERENCES
[1]
[2]
[3]
[4]
After chunker stage, input text is transferred into sequences
of tags, where phrase type tags in bolt. In the interchunk
stages, as can be seen from the fourth column, chunks with
phrase types “<LRN>” did not changed their position,
whereas specified phrases NP, VP, PP changed as follow: NP
VP PP  NP PP VP. The last columns show output of
translation. In the result, new method performed right
sequences of phrases, verbs phrase in italic in the end of the
sentence.
[5]
[6]
[7]
VI. CONCLUSION
In the paper is proposed automatic detection method of
“chunks” type improving meth-odology Sánchez-Cartagena et
al. (2015) of extracting grammatical translation rules from
bilingual corpora. Results of this paper could use for others
morphology rich lan-guages. Proposed improvement of
methodology of extracting grammatical translation rules from
corpora allows improving of machine translation quality. For
future works it is planned to use improved methodology for
more biggest English-Kazakh corpus, using proposed
improved methodology for other kind of languages pair,
Kazakh-Russian.
[8]
[9]
[10]
54
Hutchins, William John, and Harold L. Somers. An introduction to
machine translation. Vol. 362. London: Academic Press, 1992.
Víctor M. Sánchez-Cartagena, Juan Antonio Pérez-Ortiz, and Felipe
Sánchez-Martínez. 2015. A generalised alignment template formalism
and its application to the inference of shallow-transfer machine
translation rules from scarce bilingual corpora. Comput. Speech Lang.
32, 1 (July 2015), 46–90.
Mikel L. Forcada, Mireia Ginestí-Rosell, Jacob Nordfalk, Jim O'Regan,
Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Felipe SánchezMartínez, Gema Ramírez-Sánchez, Francis M. Tyers. Apertium: a
free/open-source platform for rule-based machine translation. In Machine Translation (Special Issue on Free/Open-Source Machine
Translation), volume 25, issue 2, p. 127–144.
Sundetova, A., Forcada, M. L., Shormakova, A., Aitkulova,
A.:Structural transfer rules for English-To-Kazakh machine translation
in the free/open-source platform Apertium.Pro-ceedings of the
International Conference on Computer processing of Turkic Languages,
pp. 317–326. L.N. Gumilyov Eurasian National University,
Astana(2013)
F. S´anchez-Mart´ınez and M. L. Forcada. Inferring shallow-transfer
machine translation rules from small parallel corpora. Journal of
Artificial Intelligence Research, 34(1):605–635, 2009. ISSN 10769757.
F. J. Och and H. Ney. The alignment template approach to statistical
machine translation. Computational Linguistics, 30(4):417–449, 2004.
F. J. Och and H. Ney. A systematic comparison of various statistical
alignment models. Computational Linguistics, 29(1):19–51, 2003.
Y. Xu, T. K. Ralphs, L. Lad´anyi, and M. J. Saltzman. Computational
experience with a software framework for parallel integer
programming. INFORMS Journal on Computing, 21(3):383–397, 2009.
Sells, Peter (1985), Lectures on Contemporary Syntactic Theories,
Lecture Notes, No. 3, CSLI.
Kuang-hua Chen and Hsin-Hsi Chen. 1994. Extracting noun phrases
from large-scale texts: a hybrid approach and its automatic evaluation.
In Proceedings of the 32nd annual meeting on Association for
Computational Linguistics (ACL '94). Association for Computational
Linguistics,
Stroudsburg,
PA,
USA,
234-241.
DOI=http://dx.doi.org/10.3115/981732.981764
Simplification of Turkish Sentences
Dilara Torunoğlu-Selamet, Tuğba Pamay, Gülşen Eryiğit
Department of Computer Engineering
Istanbul Technical University
Istanbul, 34469, Turkey
[torunoglud, pamay, gulsen.cebiroglu]@itu.edu.tr
(preteens) face difficulty in understanding the arguments of
the main predicate in the sentence, which may be complicated. Preteens have a tendency to use simple sentence
structures in their daily lives, and when they come across
complex structured sentences in school text books, they may
fall behind in the class. For this reason, in this paper, we
focus on Turkish and examine sentences from elementary
school textbook to extract complex structures and propose
a sentence simplification system to automatically generate
simpler versions of the sentences. Thereby, sentences become
easier to understand by children, especially ones with difficulty
in reading comprehension.
In this paper, we take advantage of inflectional groups in
Turkish and investigate certain types of complex structured
sentences. We divide these sentences under three main categories as: 1. Coordinate Sentences, 2. Paratactic Sentences,
3. Subordinating Sentences and each main category also has
sub-categories. Examples of these categories are explained in
Section III in detail. Then, we derive rules corresponding to
each category and apply the rules to the sentences which were
taken from an elementary school textbook. We prepare a data
set which was annotated morphologically and syntactically
with the NLP tools [5] to use in the sentence simplification.
The paper is structured as follows: Section II gives brief
information about related work, Section III introduces the
sentence structures on which we focused and presents our
sentence simplification approach and Section IV presents the
conclusion and futurework.
Abstract—Text Simplification is the process of transforming
existing natural language text into a new form aiming to reduce
their syntactic or lexical complexities while preserving their
meaning. A sentence being long and complicated may pose
multiple problems especially for elementary school children.
In this paper1 , we focus on Turkish, a morphologically rich
language, and examine sentences from an elementary school
text book to extract complex structures and propose a sentence
simplification system to automatically generate simpler versions
of the sentences. Thereby, sentences become easier for children
to understand, particularly children with difficulty in reading
comprehension. Our system automatically uses simplification operations, namely splitting, dropping, reordering, and substitution.
Keywords—Text Simplification, Sentence Simplification, Turkish
I. I NTRODUCTION
Text Simplification is the process of transforming existing
natural language text into a new form with aim of reducing
their syntactic or lexical complexity while preserving their
meaning. Applications of Text Simplification can help people
to understand natural text with less effort. The target audience
might be people with language disabilities like aphasia, adults
learning a foreign language, low-literacy readers [1] and children [2]. Text simplification is also used in areas like Machine
Translation (MT) [3] and Text Summarization (TS) [4]. At sentence level, reading difficulties (sentence complexities) lie in
the syntactic and lexical levels, so simplification of sentences
can be classified into two general categories: Lexical and
Syntactical Simplification. Without considering the language
level, there are some approaches for lexical and syntactic
simplification based on Statistical Machine Translation. The
concept of a simple, “easy-to-read” sentence is not universal.
Sentence length and syllable count can give a good estimate
but it will not be complete since we are taking the preserving
of meaning and understandability into account during the
simplification process. Also, requirements of “easy-to-read”
sentences can vary from audience to audience.
Sentence simplification for highly inflectional or agglutinative languages has significant problems. For example, in
Turkish, some words may be omitted from a sentence yet the
meaning may remain the same. Elementary school children
II. R ELATED W ORK
Text simplification has become a highly investigated topic
with the increase in the use of NLP systems. These systems
suffer lower accuracy results from the complexity of the
sentences. One study [6], proposes a sentence simplification
model which is based on tree transformation by Statistical Machine Translation (SMT) [7], [8]. This work covers operations
like sentence splitting, reordering, deleting (dropping) and
phrase/word substitution. The parallel corpora that were used
in this work (PWKP) were generated from English Wikipedia
and Simple English Wikipedia. Another study [9] presents a
data-driven model based on quasi-synchronous grammar. In
contrast to state of art solutions [6], operations are not defined
explicitly; instead the quasi-synchronous grammar extraction
algorithm learns appropriate rules from the training data. In
1 This work is part of our ongoing research project “A Signing Avatar
System for Turkish to Turkish Sign Language Machine Translation” supported
by TUBITAK FATIH 1003 (grant no: 114E263).
55
another study [10] which presents a machine translation based
approach similar to [6], differs in that it does not take syntactic
information into account and only relies on phrase based
machine translation methods to implicitly learn simplifying
and paraphrasing of phrases. They claim that they produced
a language agnostic solution. However they only worked on
lexical operations for sentence simplification. In [11], a lexical
approach was followed for sentence simplification for different
learning levels and context. Their method has 4 steps: part-ofspeech (POS) tagging, synonym probing, context frequencybased lexical replacement and sentence checker. They evaluated their results with human annotators by only asking
yes/no questions for testing on meaning and simplicity. They
did not use parallel datasets, instead they used context-based
books for doing lexical operations. The study [12] focuses
on syntactic simplification to make text easier to comprehend
for human readers, or process by programs. They formalize
the interactions that take place between syntax and discourse
during the simplification process and present the results of
their system.
Most of the recent works focus on the English yet there are
some studies on other languages. The study in [13], focuses
on Brazilian Portuguese. Another study [14] which is based on
dependency parsing of Spanish sentences is capable of lexical
simplification, deletion operations and sentence simplification
operations. The study [15] aims to develop an approach to
syntactic simplification of French sentences.
Another usage of text simplification is to help children
understand complex sentences in books. One of the studies
conducted for this purpose is [16] which examines children
stories and proposes a text simplification system to automatically generate simplified, more comprehensible versions of the
stories for children, especially those with difficulty in reading
comprehension. Splitting, dropping, reordering and substitution operations can be done with the proposed system. Another
study with the same approach is in [2] which chooses children
as the target audience of text simplification operations. They
perform both syntactic and lexical simplifications. They follow
a rule based system for this task. Inspired by these researches
in this paper, we focus on simplifying children’s textbooks.
from the morphological and syntactic information of tokens
in the sentence and also use a morphological generator [5]
which is one of the NLP tools to generate surface form of the
token from its morphological analysis. Sentence simplification
is executed under three steps which are visualized in Figure 1.
First step is the analysis operation in which the sentence is
analyzed morphologically and parsed syntactically. By this
way, we obtain dependency relations between each token.
Then, for the transformation stage, each rule is tried on the
given sentence, and the first suitable rule is selected to be
applied (only one rule could be applied over the sentence.
If no suitable rule is found, the sentence will be left in its
original form). Insertion of a token is performed in this level
if it is considered as necessary. In the insertion step, shared
arguments of the original sentence are derived first then each
shared argument is inserted to the sub-sentence. Examples of
insertion step is given in sections below. The rule in Figure
1 is explained in Section III-C1 in detail. In the generation
step, the sentence is divided into sub-sentences corresponding
to the information which is obtained from the transformation
stage. At this phase, morphological information of the tokens
may be updated to fit with the simplified version of the
sentence. For this purpose, we use a morphological generator
to reconstruct the new form of the token. The morphological
generator produces a valid Turkish word by applying all
the rules of a morphological analyzer in the reverse order
(from lexical form through surface form). For example, the
analysis of the participle “görmediğim”(who I have not seen)
is produced as gör +V ERB +N EG ˆDB+A DJ +PAST PART +P1 SG by
the morphological analyzer. This analysis is converted to
gör +V ERB +N EG +PAST +A1 SG in the generation step to construct the predicate of the sub-sentence as “görmedim”(I have
not seen). These three steps are valid for each rule which are
explained in the below sections.
A. Coordinate Sentences
1) Shared Predicate: For this category, we introduce sentences in which the predicate is shared by elements which are
interconnected coordination structure. A sample sentence under this category is shown in Figure 2. In the sample sentence,
the word “sever” (like) is the shared predicate. Turkish allows
the non-repetition of some words in the sentences which may
cause difficulty in understanding the arguments of the shared
predicate for children.
In this category, sentences are split, based on the number
of sub-parts in the original sentence. The elements of the subparts are decided by the coordinated arguments in the sentence.
For example, for the Figure 2, “Ali” and “Mehmet” are coordinated subjects and “basketbolu” (basketball) and “futbolu”
(football) are coordinated objects of the same predicate. After
splitting, the sentence is transformed into a new structure
which is presented in simplified version of the Figure 2.
2) Shared Object: The sentence structure of this category
is similar to the sentences in the Section III-A1. However, in
this case an object is shared by the elements of the coordinated
structure. An example under this category is given in Figure 3.
III. D ISCUSSION AND A PPROACH
The morphologically rich nature of Turkish may result
in orthographic words to be split into multiple inflectional
groups2 . For the sentence simplification approach, we take
advantage of this issue and investigate solutions for the
simplification of syntactically complex sentences. We divide
them under three main categories as: 1. Coordinate Sentences,
2. Paratactic Sentences, 3. Subordinating Sentences and each
main category also has sub-categories. Then, we derive rules
corresponding to each category and applied the rules to
the sentences which were taken from the elementary school
textbook. To apply the rules over the sentences, we benefit
2 In Turkish NLP, words are generally split into sub-word units from their
derivational boundaries, each resulting unit having a potentially different part
of speech tag and dependency relation.
56
Fig. 1: Sentence Simplification steps
Dependency Graph
COORDINATION
SUBJECT
two in this case, and the split sentence is given in the simplified
version of the Figure 3. This way, the sentence gives the same
meaning, but with a syntactically simpler structure.
PREDICATE
Ali
basketbolu
,
Mehmet
futbolu
sever
.
‘Ali’
‘basketball’ACC
‘,’
‘Mehmet’
‘football’ACC
‘[he] likes’
‘.’
Dependency Graph
OBJECT
Original Version
‘Ali basketbolu, Mehmet futbolu sever.’
(‘Ali [likes] basketball, Mehmet likes football.’)
Simplified Version
COORDINATION
PREDICATE
Hayvanları
sevelim
,
koruyalım
.
‘animals’ACC
‘[we] like’
‘,’
‘[we] protect’
‘.’
Original Version
‘Ali basketbolu sever. Mehmet futbolu sever.’
(‘Ali likes basketball. Mehmet likes football.’)
‘Hayvanları sevelim, koruyalım.’
(‘Let’s like animals, protect (them).’)
Fig. 2: Example for Shared Predicate Category
Simplified Version
‘Hayvanları sevelim. Hayvanları koruyalım.’
(‘Let’s like animals . Let’s protect animals.’)
The word “hayvanları” (animals) is the shared object by
the two predicates “sevelim” (like) and “koruyalım” (protect).
Instead of non-repetition of the shared argument, this may be
a good way to use the same argument twice in the sentence.
By this way, the meaning of the sentence may be given more
clearly to preteens.
In this category, sentences are split based on the number
of sub-parts in the original sentence. The elements of the subparts are decided by the coordinated predicates in the sentence.
For example, for the sentence in Figure 3, “sevelim” (like)
and “koruyalım” (protect) are coordinated predicate by the
same object. In the splitting operation, the sentence is split
into new sentences corresponding to the coordinated predicates
and the shared arguments are put into all sub-sentences. After
simplification, the sentence is divided into a number of parts,
Fig. 3: Example for Shared Object Category
B. Paratactic Sentence
For this category we focused on sentences which do not
have any shared argument or predicate. These consist of
independent clauses separated by conjunctions or punctuation.
As the predicates share no arguments, each sub-sentence
has its own elements. An example sentence under this category is shown in Figure 4. As seen from the sample, there
are two coordinated predicates: “açtı” (open) and “uyandı”
(woke up). These predicates have their own arguments. For
example, “Ebru” is the subject of “açtı” and “Elif” is the
subject of “uyandı”. In this category, sentences are split at
57
the conjunctions or punctuation marks which separate the
independent clauses, resulting in a number of sub-sentences.
Since these predicates have their own arguments, insertion of
any argument is not performed in this process. The example
is given in Figure 4.
in the simplification process, the morphological analysis is
changed to accusative case before using the morphological
generator. This way, we ensure that the simplified sentences
are gramatically correct.
Dependency Graph
Dependency Graph
SUBJECT
Ebru
pencereyi
MODIFIER
SUBJECT
SUBJECT
açtı
ve
Elif
uyandı
‘and’
‘Elif’
‘woke up’
Uzun
süredir
görmediğim
teyzem
‘long’
‘[for a] time’
‘[I] have not seen ’
‘[my] aunt’
bize
geliyor
.
‘us’ACC
‘[she] is coming’
‘.’
CONJ
‘Ebru’
‘window’ACC
‘[she] opened’
‘.’
Original Version
“Uzun süredir görmediğim teyzem bize geliyor.”
(“My aunt whom I have not seen for a long time, is coming
to us”)
Original Version
‘Ebru pencereyi açtı ve Elif uyandı.’
(‘Ebru opened the window and Elif woke up.’)
Simplified Version
Simplified Version
‘Teyzemi uzun süredir görmedim.
Teyzem bize geliyor.’
(‘I have not seen my aunt for a long time.
My aunt is coming to us.’)
‘Ebru pencereyi açtı. Elif uyandı.’
(‘Ebru opened the window. Elif woke up.’)
Fig. 4: Example for Paratactic Sentence Category
Fig. 5: Example for Participle Subclause Category
C. Subordinating Sentences
Subordinating sentence is a sentence which contains subclause. Subordinating sentences are not complete sentences
by themselves, however they make additional information to
complete the meaning of the whole sentence. These subclauses
are formed by subordinate conjunctions (i.e. when, until and so
on) and relative pronouns (i.e. who, which and so on). In this
study, for Turkish, we also focused on these categories under
two topics: 1. Participle Subclauses, 2. Converbial Subclauses.
1) Participle Subclauses: For this category, we introduce
sentences containing subclauses the heads of which are participles. Participles are adjectives derived from a verb. An
example under this category is given in Figure 5. When
the English translation of the sentence is considered, the
part which starts with the relative pronoun, “who” forms a
subclause which modifies the word “aunt”. In the Turkish
sentence, the part “uzun süredir görmediğim” (whom I have
not seen for a long time forms a subclause. This is a participle
subclause because the head of this part is used as an adjective
which modifies the word “teyzem” (my aunt).
In this category we benefit from the inflectional groups of
the word in the sentence. In the example, when the sentence
is semantically analyzed, the person whom I have not seen
and the person who is coming are the same person. Using this
property, the sentence is split into two parts. The first part
covers the subclause arguments and the second one the main
sentence arguments. In this category, there is an important
issue. The token which is modified by the participle subclause
is inserted to the first split part with the proper dependency
relation. The word “teyzem” (my aunt) is in nominative
case. Thus, when this token is inserted to the subclause part
Dependency Graph
SUBJECT
PREDICATE
Ayşe
koşarken
düştü
‘Ayşe’
‘run’WHILE
‘[she] fell down’
‘.’
Original Version
“Ayşe koşarken düştü.”
(“Ayşe fell down while [she was] running.”)
Simplified Version
“Ayşe koştu. Ayşe düştü.”
(“Ayşe ran. Ayşe fell down.”)
Fig. 6: Example for Converbial Subclause Category
2) Converbial Subclauses: The sentence structure of this
category is similar to the sentences in the Section III-C1. For
this category, we introduce the sentences containing subclause
whose head clause is a converb. Converbs are adverbs derived
from a verb inflectional group. An example under this category
is given in Figure 6.
When the English translation of the sentence is considered,
the part which starts with the subordinating conjunction,
“while” forms a subclause which modifies the predicate of
the main sentence, “fell down”. In the Turkish sentence, the
part “koşarken” (while [she was] running) forms a subclause.
58
This is a converbial subclause because the head clause of this
sub-part is a converb which modifies the main predicate.
In the example, when the sentence is semantically analyzed,
the person who fell down and the person who was running are
the same person, “Ayşe”. This word is only assigned as the
subject of the main predicate in syntactic analysis. As a result,
in the simplification process, this token is also inserted into
the sub-sentence which is formed by the subclause. Also, the
head of the converbial subclause is used as the verb of the first
part sub-sentence. Therefore, this converb token is converted
to the verb form using the morphological generator tool.
[8] K. Yamada and K. Knight, A decoder for syntax-based statistical MT,
Association for Computational Linguistics Std., 2002.
[9] K. Woodsend and M. Lapata, Learning to simplify sentences with
quasi-synchronous grammar and integer programming, Association for
Computational Linguistics Std., 2011.
[10] S. Wubben, A. Van Den Bosch, and E. Krahmer, Sentence simplification
by monolingual machine translation, Association for Computational
Linguistics Std., 2012.
[11] B. P. Nunes, R. Kawase, P. Siehndel, M. A. Casanova, and S. Dietze, As
simple as it gets-a sentence simplifier for different learning levels and
contexts, IEEE Std., 2013.
[12] Syntactic simplification and text cohesion, vol. 4, no. 1, 2006.
[13] Natural language processing for social inclusion: a text simplification
architecture for different literacy levels, 2009.
[14] S. Bott, L. Rello, B. Drndarevic, and H. Saggion, Can Spanish Be
Simpler? LexSiS: Lexical Simplification for Spanish., Std., 2012.
[15] Simplification syntaxique de phrases pour le français, 2012.
[16] T. T. Vu, G. B. Tran, and S. B. Pham, “Learning to simplify children
stories with limited data,” in Intelligent Information and Database
Systems. Springer, 2014, pp. 31–41.
IV. C ONCLUSION AND F UTUREWORK
A sentence being long and complicated can pose multiple
problems in daily life. For example, in Turkish, some words
may be omitted from a sentence yet the meaning may remain
the same. However, elementary school children (preteens) may
face difficulty in understanding the arguments of the main
predicate in a complicated sentence. For this reason, in this
paper, we focus on solving this problem by simplifying the
given sentences.
In this paper, we take advantage of inflectional groups in
Turkish and investigate certain types of complex structured
sentences. We divide them under three main categories as: 1.
Coordinate Sentences, 2. Paratactic Sentences, 3. Subordinating Sentences. Then, we derive rules corresponding to each
category and apply the rules to the sentences taken from an
elementary school textbook. We present an automatic sentence
simplifier for these categories and propose an approach to
divide sentences to help children understand better.
Thus, as a future work we plan to verify the effectiveness
of our simplification and preservation of meaning by testing
our results on child readers. For validating our rules, we
intend to use a human-focused evaluation based system with
elemantary-school children as a testing audience.
V. ACKNOWLEDGEMENTS
This work is part of our ongoing research project “A Signing
Avatar System for Turkish to Turkish Sign Language Machine
Translation” supported by TUBITAK FATIH 1003 (grant no:
114E263). The authors want to thank Umut Sulubacak and
Memduh Gökırmak for their valuable discussions and helps.
R EFERENCES
[1] W. M. Watanabe, A. C. Junior, V. R. Uzêda, R. P. d. M. Fortes, T. A. S.
Pardo, and S. M. Aluı́sio, Facilita: reading assistance for low-literacy
readers, ACM Std., 2009.
[2] J. De Belder and M.-F. Moens, Text simplification for children, ACM
Std., 2010.
[3] S. Tyagi, D. Chopra, I. Mathur, and N. Joshi, Classifier based text
simplification for improved machine translation, IEEE Std., 2015.
[4] A. Siddharthan, A. Nenkova, and K. McKeown, Syntactic simplification for improving content selection in multi-document summarization,
Association for Computational Linguistics Std., 2004.
[5] G. Eryiğit, ITU Turkish NLP Web Service, Std., April 2014.
[6] Z. Zhu, D. Bernhard, and I. Gurevych, A monolingual tree-based translation model for sentence simplification, Association for Computational
Linguistics Std., 2010.
[7] A syntax-based statistical translation model, Association for Computational Linguistics, 2001.
59
Comprehensive Annotation of Multiword Expressions in Turkish
Kübra Adalı
Tutkum Dinc
Memduh Gokırmak
Gülşen Eryiǧit
Dep. of Computer Engineering
Dep. of Linguistics Dep. of Computer Engineering Dep. of Computer Engineering
Istanbul Technical University
Istanbul University Istanbul Technical University
Istanbul Technical University
Maslak, Istanbul 34369
Beyazıt, Istanbul
Maslak, Istanbul 34369
Maslak, Istanbul 34369
Email: [email protected] Email: [email protected] Email: [email protected]
Email: [email protected]
is to establish such standards for treebanks. [7] makes a
survey of MWE annotated treebanks. Some of these are
the Prague Dependency Treebank [8], French Dependency
Treebank [9], Penn Treebank (a constituency treebank for
English) [10].
Although there have been some previous attempts [11],
[12] to build MWE annotated treebanks for Turkish, this
study is the first comprehensive annotation of MWEs on
Turkish treebanks, being a fully manual annotation with
detailed fine categories. This study is also a first attempt
to define suitable categories for the MWE annotation of
Turkish, and we believe this will also aid the creation
of multi-lingual MWE annotation guidelines. Two existing
Turkish dependency treebanks (IMST [13] a treebank of
well-edited texts and IWT [14] a Web treebank) are annotated with 11 main MWE categories (nominal compounds,
duplications, verbal compounds, light verb constructions,
compounds constructed with determiners, conjunctions, formulaic expressions, idiomatic expressions, proverbs and
named entities) and 8 named entity sub-categories (Person,
Location, Organization Names, Date and Time Expressions,
Percentage, Monetary Expressions, Miscellaneous Numerical Expressions).
The remainder of the paper is structured as follows:
Section 2 gives information about previous MWE studies
in Turkish and introduces our proposed MWE categories,
Section 3 presents the annotation process and the statistics,
and Section 4 is the conclusion.
Abstract—Multiword expressions (MWEs) are pervasive in
Turkish, as in many other languages. There are many challenges related to MWEs in Natural Language Processing.
The scarcity of annotated language resources is one of the
most prominent for lesser-studied languages and as always
development of these resources requires a noteworthy effort.
This paper is the first study which specifically focuses on the
development of Turkish MWE resources for the purpose of
1) the categorization of different MWE types in Turkish 2)
use in MWE identification, and 3) use in research focusing
on interleaving between MWE identification and parsing. For
these purposes, we annotated two Turkish treebanks (IMST
and IWT) with 11 MWE categories and 8 subcategories for
the MWE category Named Entity.
1. Introduction
As the name implies, multiword expressions are composed of multiple words that together produce an idiosyncratic meaning or have a distinctive syntactic role. They pose
several challenges for natural language processing tasks as
well as in language acquisition for non-native speakers. As
a result, they have been an important issue covered in many
studies since the inception of the field of NLP. The reader
may consult many comprehensive studies for a complete
discussion of MWEs ( [1], [2], [3]). Their extraction and
processing within NLP applications is still a very active
research topic as may be seen by many recent workshops (
[4], [5]) and research initiatives (e.g. EU PARSEME Cost
Action [6]).
Annotated data sets and lexicons are very valuable resources for MWE processing tasks. A comprehensive annotation of MWEs is a troublesome and exhaustive process.
Many languages including Turkish suffer from lack of MWE
annotated language resources. Manually annotated treebanks
are syntactically annotated corpora and are valuable resources for parsing research. The annotation of MWEs on
treebanks would undoubtedly help investigations on the
integration of MWE identification and parsing studies. As
a result, there are many efforts to annotate MWEs on treebanks. Unfortunately, there is as of yet no common standard
on how to annotate them. The aim of WG4 of PARSEME
2. MWEs in Turkish
There are a couple of studies which focus on MWE
discovery [15], MWE annotation [11], [12] and MWE identification [12], [16] in Turkish. [15] employs two simple
statistical methods, a Chi-square hypothesis test and mutual
information in order to discover Turkish collocations. [11]
reveals that the performance of parsing is affected differently
by the concatenation of different MWE types’ components.
The most recent study on MWEs is [12] in which a coarse,
undifferentiated annotation of MWEs took place and different lexical models for MWE identification including automatic named entity recognition were tested, demonstrating
that their extraction model improves the accuracy of MWE
60
extraction by a dependency parser [17] and the extraction
tool of [16].
Similar to other languages, MWEs poses interesting
challenges for Turkish. Especially, the variability of MWE
instances are very high due to the agglutinative and morphologically very rich nature of this language. The constituents
of a MWE may be inflected, resulting in a high number of
different surface forms [16], [18]. To give an example the
MWE “aklına gelmek” (to come to mind) may appear in
different forms by taking personal agreement, tense, aspect
and modality suffixes. In the sentence “Aklıma gelmedi”
(It didn’t come to my mind.), both of the components
underwent inflection and are different from their lemma
forms: the first word “aklına” (to the mind) with 1st person
possessive agreement suffix in dative form and the second
word “gelmek” (to come) in past tense with 3rd person
singular agreement. Non-compositionality and discontinuity
are common challenges of MWEs which also appear in
Turkish.
In this section, we introduce the categories that we
defined for MWE types in Turkish which we believe will
provide the opportunity to address the problems of different types separately. The sub-categorization of MWEs will
also pave the way for further investigations on hierarchical
approaches for MWE identification and its integration into
parsing. With this aim, we define 11 categories of MWEs
which we detail in the remaining of this section.
MWEs (e.g., “gu¨zel mi gu¨zel” (so beautiful)). Duplications
can strengthen the meaning of the main word, turn an
adjective into an adverb, or add an idiomatic meaning. We
decided not to include the ‘m’-duplication (where a word is
repeated with the first letter replaced with ‘m’ in the second
occurrence) as a type of duplication MWE.
2.3. Verbal Compound MWEs
In this type, the components form the MWE without
undergoing a significant semantic change. They are formed
with a noun and a verb1. This type of MWEs may be
inflected more frequently than other types due to the verbal
nature of their constructions. Examples of this pattern can be
like: “karar vermek” (to decide), “so¨z vermek” (to promise).
2.4. Light Verb Construction MWEs
Light Verb Construction MWEs are formed by six auxiliary verbs which are “olmak” (to be), “etmek” (to do),
“yapmak” (to make) , “kılmak” (to render), “eylemek” (to
make) and “buyurmak” (to order). Together with a preceding
nominal, these auxiliary verbs behave as a finite verb. The
verb phrase is a construction which has its own meaning,
which can be idiomatic or relatively similar to that of its
components. These MWEs can be easily detected using
morphosyntactic information such as the existence of an
auxiliary verb at the end of a verb phrase. Some examples
are: “as¸ık olmak” (fall in love), “sinir etmek” (to aggrevate),
“veda etmek” (to bid farewell), “yemek yapmak” (to cook),
“gec¸ersiz kılmak” (to revoke), “emir buyurmak” (to give
order). However, not every construction with the aforementioned auxiliary verbs falls under this category. For example,
MWEs like “aforoz etmek” (to excommunicate) and “ah
etmek” (to sigh) are considered idiomatic expressions and
will be handled under that category.
2.1. Nominal Compound MWEs
As described in [19], noun compounds are word like
units made up of two nominals. Our definition of nominal
compound MWEs differs from this general definition in
that they comprise only a subset of noun compounds used
commonly enough to express a wide concept or class. These
consist of bare compounds (the components do not take
extra suffixes to mark the relation between them) and (s)I compounds (the first component has no suffixes while
the second one is marked with the third person possessive
suffix -(s)I ) [19] . To give some examples, “kadın c¸orabı”
(hosiery), “hakem heyeti” (arbitration court) ,“kredi kartı”
(credit card), “dis¸ macunu” (toothpaste). As may be observed from the examples, the overall sense of this type of
MWE may be discerned from its components.
2.5. Compound MWEs Constructed with Determiners
This category consists of compounds having at least one
determiner component. The compounds “her s¸ey” (everything), “s¸u an” (now), “bir daha” (again/never) may be given
as examples for this category. Differing from the previous
compound MWE categories, MWEs of this category type
may be used in different roles (nominal, adjectival or adverbial) in a sentence.
2.2. Duplication MWEs
Duplications are linguistic units that are formed mainly
by duplicating a nominal or modifier. The production of the
second word can be done in several ways: the reproduction
of the exact words, synonymous words, antonymous words,
onomatopoeic words, gibberish words. The examples that
refers to each are “c¸abuk c¸abuk” (very quickly, or lit. quick
quick), “mal mu¨lk” (property, or lit. property property),
“as¸ag˘ ıyukarı” (almost, nearly, or lit. down up), “adı sanı”
(public profile, or lit. name and fame), “paldır ku¨ldu¨r” (pellmell, or lit. pell mell). Duplications with an interrogative
particle in between are also considered to be duplication
2.6. Conjunction MWEs
Conjunction MWEs are a sort of transition phrase and
are used to concatenate two sentences. Some examples of
this category may be given as the followings: “bu arada”
(by the way), “bu yu¨zden” (therefore), “o halde” (then),
1. Excluding light verb constructions which are also a special type of
verbal MWEs collected under a separate category.
61
category can be considered the easiest one to identify as an
MWE. They often describe some observation or experience
with didactic intent.
Some examples are given below:
“bu sebeple” (for this reason) etc. While exhibiting some
semantic flexibility, the components of MWEs in this category largely retain their original meaning. This category
excludes constructions formed by the addition of an enclitic
intensifier such as “de”, “ise” , “ki” (e.g., “öyle ki’(so that),
’“ya da”(or)).
•
“Damlaya damlaya göl olur.”
lit. (By dribbling) (a lake) (composes) .
(Many a little makes a mickle.)
2.7. Formulaic Expression MWEs
•
“Güneş balçıkla sıvanmaz.”
lit. (The sun) (with mud) (can not be covered) .
(The truth can not be hidden.)
MWEs in this category satisfy the following semantic
and syntactic conditions. As the semantic condition, the
MWE should carry the meaning of well wishing or gratitude.
For the syntactic condition, the MWE is an independent
clause, mostly with an elided verb implied to be in a
subjunctive mood. Some examples are : “Ellerine sağlık
(olsun)” (May God bless your hands), “Gö rüşmek ü zere”
(See you soon), “Hoşça kal” (Good Bye). MWEs in this
category may rarely resemble light verb constructions that
also carry a sense of gratitude, such as “teşekkür etmek” (to
thank), “rica etmek” (to request)
•
“(Yalancının mumu) (yatsıya) (kadar) (yanar).”
lit. (The candle of the lier) (until isha) (burn) .
(The truth can not be hidden.)
2.11. Named Entities
In our annotation we consider a Named Entity to be a
set of tokens denoting some unique entity in the real world.
Their syntactic patterns and semantic properties are fixed,
and they are not necessarily multi word expressions. Since
most of the time they consist of two or more words, they are
also treated as an MWE category. Named entities include 8
subcategories, namely; ENAMEX types (Person, Location,
Organization Names), TIMEX types (Date and Time) and
NUMEX types (Percentage, Monetary Expressions, Miscellaneous Numerical Expressions). We follow the MUC-6 [21]
guidelines for our named entity definitions.
2.8. Idiomatic Expression MWEs
Idiomatic expressions are MWEs with noncompositional meanings; i.e., the meaning of the MWE
differs from the literal meaning of its components. For
example: “etekleri zil çalmak” (to be very happy, or lit.
ring the bells on the skirt), “gemi azıya almak” (to get
out of control or lit. to scratch the bit with grinders)
etc. This type of MWEs are quite challenging for MWE
identification due the ambiguity between idiomatic and
literal use. To give some examples: “ayvayı yemek” (to
be in a worrisome and bad situation, or lit. to eat the
quince) and “ayağa kalkmak” (to protest, or lit. to stand
up). In these cases, there is no morphosyntactical difference
between the two utilizations of the word group as an idiom
or as an ordinary phrase carrying literal meaning, hence it
could be difficult to detect the MWE using the contextual
information.
Person
This tag denotes persons, referred to by name, and
excludes any titles or alternate references other than the
name of the person in question. The examples: “Başbakan
Turgut Ö zal” (Prime Minister Turgut Ö zal), “Maliye Bakanı
Ali Babacan” (Finance Minister Ali Babacan).
Location
Denotes the proper name of a location. For example:
“Amerika Birleşik Devletleri’nden mektup geldi” (A letter
came from the United States of America).
2.9. Simile Expressions MWEs
Similes are expressions comparing two things, in an often striking manner, using a connecting word (e.g., the word
“gibi” (like) in Turkish). We include under this category
not every comparison but only those in frequent use. The
syntactic construction has two main parts: the figurative part
and post-positional particle which refers to only one word
“gibi” (like/alike). Here are some examples: “Agop’un kazı
gibi” (voraciously), “damdan dü şer gibi” (out of the blue),
“Avcunun içi gibi” (well known), “kedinin ciğere baktığı
gibi” (anxiously) etc.
Organization
This subcategory is used for the name or the group of
names of an organization such as “Birleşmiş Milletler kararı
uyguladı.” (United Nations enforced the judgment.).
Date
2.10. Proverb MWEs
Expresses an absolute date. As an example: “Doğum
tarihi 25 Temmuz 1987 ’di.” (Her birth date is 25th of July
in 1987).
Proverbs are idiomatic and frozen sentences [20] with no
words changing or undergoing inflections. Consequently, the
62
Time
TABLE 1. T H E
In this category, the named entity states an absolute time.
The examples are : “Saat 6:30’da film başlıyor. ” (The film
starts at 6:30.), “Sınavı bugün 10:30’daymış.” (Her exam is
today at 10:30)
Percentage
This category is used to represent percentage information e.g. “Devrelerin yü zde yirmisi arızalı.” (Twenty percent
of the circuits are defective.), “Adayların yüzde sekseni
sınavdan kaldı.” (Eigthy percent of the candidates have
failed the examination).
For this category, the word group denotes an expression
of money or monetary value. The example is : “O kitaba
altmış lira verdim.” (I paid sixty liras for that book.)
Miscellaneous Number
We have diverged from the MUC-6 guidelines in this tag,
and marked cardinal numbers with their own named entity
tag. To give an example: “Altı yüz bin araba satılacak” (Six
hundred thousand cars will be sold.).
3. Annotation
The annotation process was carried out in two stages on
both treebanks, with two annotators carrying out both on
each treebank. The stages are as follows:
The Annotation of NE categories
•
The Annotation of MWE categories except Named
Entities
NE Type
IMST
Person
Organization
Location
Money
Percentage
Misc. Number
Date
Time
Total
1071
418
491
54
44
427
106
20
2631
IWT
An.-1
385
401
260
45
8
59
10
1168
NE S
IWT
An.-2
426
503
274
48
7
317
76
17
1668
A N D THE
KA P PA
Kappa Co.
Total
0.88
0.64
0.79
0.98
0.99
0.87
0.95
-
1497
921
765
102
51
744
182
37
4299
to head. Inflectional suffixes are excluded from the named
entity in the plain text format marked with XML tags, but
the entire token in the CoNLL file is marked as a part of the
named entity. As the lemma is given in each CoNLL token,
this does not result in a loss of data. Figure 1 shows an
example CoNLL annotation for the sentence “Ben Arçelik’e
sordum 31 Aralık’a kadarmış.” (I asked Arçelik, it’s until
December 31st.) which examplifies such a case on the word
“Aralık” (December) inflected with a dative case marker.
Table 1 shows the numbers of NE categories in two
treebanks. We annotated IMST [13] with detailed NE types
for the first time, however IWT [14] was annotated for
MWEs previously in a recent study [22]. This made possible
to calculate the Cohen’s Kappa coefficient2 [23] in order to
evaluate the inter-annotator agreement between the current
and the previous annotation [22]. From the scores it is seen
that there is sufficient agreement between our annotator and
the previous annotator.
Money
•
NU MB E RS O F T YPE S O F
C O E FFI C I E N T S
3.2. MWE Annotation
During the original dependency annotations of both
treebanks, the annotators were asked to annotate the interrelations of a multiword expression with a single catchall dependency type (named as MWE as well) [12]. But
the annotation was limited to only this dependency relation
without any extra information on types of the MWEs. In
this work, we refine previous annotations by inspecting all
the treebank sentences and reannotating all the MWEs with
finer categories.
We have done the annotation with the participation
of the linguistics student who oversaw the categorization
of MWEs. The annotation of MWEs was performed in a
number of iterations of an annotation and check cycle. We
automatically checked the annotation for dependency-related
errors, and manually examined cases marked in previous annotations ( [12], [13], [24]) were not marked and vice versa.
This iteration was carried out until all problematic cases
were handled. The first MWE annotation of the treebanks
is complete. We plan to have the MWEs annotated again by
3.1. NE Annotation
In the NE annotation process we have annotated the
entities according to the categories we described in the previous section. Figure 2 shows an example dependency tree
consisting an organization named entity. In our annotation
we have largely followed the MUC-6 [21] guidelines for the
annotations, with the addition of a single extra category for
miscellaneous numerical expressions. The MUC-6 guidelines establish a standard for marking plain text sentences
with XML tags, however, we have annotated sentences in
CoNLL format in which morphological information and
dependency relations are marked. We have added two extra
columns to the data, one marking the type of the named
entity, and another marking possible following items in a
collocative named entity. This way of annotating the named
entities is particularly well suited to Turkish as named
entities tend to be adjacent, and their dependencies relations
are overwhelmingly organized left to right from dependent
2. During the calculation of the Kappa coefficient we saw that in an
unmodified Kappa value the agreement rate was too high to be meaningful.
We used a weight value of 0.01 for the number of tokens both annotators
did not annotate, resulting in a much more meaningful statistic.
63
Ben Arçelik’e
31 Aralık’a
sordum
kadarmış.
I
Arçelik.DAT ask.PAST.1-SG
31 December.DAT until.EVID.3-SG.
I asked Arçelik, it’s until December 31st.
ID
1
2
3
4
5
6
7
8
9
Surface
Form
Ben
Arçelik’e
sordum
31
Aralık’a
kadarmıs¸
.
Dependency
Head
3
3
8
5
8
7
8
0
8
Dependency
Relation
SUBJECT
MODIFIER
COORDINATION
MWE
MODIFIER
DERIV
DERIV
PREDICATE
PUNCTUATION
NE Type
ORGANIZATION.ENAMEX
DATE.TIMEX
DATE.TIMEX
Figure 1. The annotation format of an example
Maliye Bakanlığı
konuyla
ilgili
Next Word
açıklama
Finance Ministry.3-POSS subject.INS related
statement
The Ministry of Finance made a statement on the issue.
5
sentence
yaptı.
make.PAST.3-SG
SUBJECT
PREDICATE
MWE.NE.ORG
Maliye
Noun
MODIFIER
ARGUMENT
Bakanlığı
Noun
konuyla
Noun
ilgili
Adj
OBJECT
açıklama
Noun
PUNCTUATION
yaptı
Verb
Figure 2. An example dependency tree showing an organization named entity
challenging issue,
TABLE 2. THE DISTRIBUTION OF THE NUMBERS OF CATEGORIES OF
MWE S IN TWO T URKISH TREEBANKS
MWE Type
IMST
IWT
Total
Named Entities
Compound
Conjunction
Duplication
Formulaic Expression
Idiomatic Expression
Lightverb Construction
Nominal Compound
Proverb
Simile Expression
Total
910
525
32
209
22
773
537
136
3
12
3159
439
545
41
130
221
598
648
156
4
7
2789
1349
1070
73
339
243
1371
1185
292
7
19
5948
another linguistics student, and calculate the agreement
as in the named entity annotation.
Table 2 gives the results of distribution of MWE categories in two treebanks. As seen on the Table 2, one of
the biggest categories of MWE is named entities, which
means the performance of a Named Entity Recognition
system used in MWE extraction will substantially affect the
performance of the system. The other large category is
idiomatic expres- sions, which makes MWE extraction a
64
.
Punc
as we are obliged to deal with the particular
challenges of idiomatic expressions to build a high
performance system.
4. Conclusion
In this paper, we proposed a basis for Turkish MWE
and NE categorization to be used as a working guide in
annotation. The categorization framework, which was prepared by taking into account the idiosyncratic features
of Turkish, consists of 11 categories of MWEs. We
performed annotations on two Turkish treebanks using the
proposed framework. We annotated the categories of
MWEs as the first annotation task and the annotation of
NEs and their sub- categories as the second on the
Turkish treebanks. For the annotation task, we enlisted
the aid of linguistics researchers that have expertise on the
morphosyntactic and semantic features of Turkish.
The categorization framework that we defined in this
study and the annotated treebanks will hopefully be used
in future studies in the annotation and identification of
MWEs in Turkish.
65
Acknowledgments
[14] T. Pamay, U. Sulubacak, D. Torunoglu-Selamet, and G. Eryigit, “The
annotation process of the itu web treebank,” in The 9th
Linguistic Annotation Workshop held in conjuncion with NAACL
2015, 2015, p. 95.
We would like to acknowledge that this work is part
of a research project entitled “Parsing Web 2.0 Sentences”
subsidized by the TUBITAK (Turkish Scientific and Technological Research Council) 1001 program (grant number 112E276) and part of the ICT COST Action IC1207
PARSEME (PARSing and Multi-word Expressions).
[15] S. K. Metin and B. Karaog˘lan, “Collocation extraction in Turkish
texts using statistical methods,” in Advances in Natural
Language Processing. Springer, 2010, pp. 238–249.
[16] K. Oflazer, O. C¸ etinog˘lu, and B. Say, “Integrating morphology with
multi-word expression processing in Turkish,” in Proceedings of the
Workshop on Multiword Expressions: Integrating Processing.
Asso- ciation for Computational Linguistics, 2004, pp. 64–71.
References
[17] G. Eryiğit, J. Nivre, and K. Oflazer, “Dependency parsing of Turkish,”
Computational Linguistics, vol. 34, no. 3, pp. 357–389,
2008.
[1] I. A. Sag, T. Baldwin, F. Bond, A. Copestake, and D. Flickinger,
“Multiword expressions: A pain in the neck for NLP,” in Computational Linguistics and Intelligent Text Processing. Springer, 2002,
pp. 1–15.
[18] A. Savary, “Computational inflection of multi-word units,” A contrastive study of lexical approaches, vol. 1, no. 2, 2008.
[19] C. K. Asli Go¨ksel, Turkish: A Comprehensive Grammar (Comprehensive Grammars), bilingual ed., ser. Comprehensive Grammars.
Routledge, 2005.
[2] I. Arnon and N. Snider, “More than words: Frequency effects for multiword phrases,” Journal of Memory and Language, vol. 62, no. 1,
pp. 67–82, 2010.
[20] J. Baptista, A. Correia, and G. Fernandes, “Frozen sentences of
portuguese: Formal descriptions for nlp,” in Proceedings of the
Workshop on Multiword Expressions: Integrating Processing,
ser. MWE ’04. Stroudsburg, PA, USA: Association
for
Computational Linguistics, 2004, pp. 72–79. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1613186.1613196
[3] C. Ramisch, Multiword Expressions Acquisition, ser. Theory and
Applications of Natural Language Processing. Springer, 2015.
[4] Proceedings of the 11th Workshop on Multiword Expressions.
Denver, Colorado: Association for Computational Linguistics, June
2015. [Online]. Available: http://www.aclweb.org/anthology/W15-09
[21] R. Grishman, “The nyu system for muc-6 or where’s the
syntax?” in Proceedings of the 6th conference on Message
understanding. Association for Computational Linguistics, 1995,
pp. 167–175.
[5] V. Kordoni, M. Egg, s. t. o. Agata Savary, s. t. o. Eric Wehrli,
and S. Evert, Eds., Proceedings of the 10th Workshop on
Multiword Expressions (MWE). Gothenburg, Sweden: Association
for Computational Linguistics, April 2014. [Online]. Available:
http://www.aclweb.org/anthology/W14-08
[6]
[7]
[22] G. A. S¸eker and G. Eryig˘it, “Initial explorations on using CRFs
for Turkish named entity recognition,” in Proceedings of COLING
2012, Mumbai, India, 8-15 December 2012.
A. Savary, M. Sailer, Y. Parmentier, M. Rosner, V. Rosén,
A. Przepio´rkowski, C. Krstev, V. Vincze, B. Wo´jtowicz, G. S. Losnegaard et al., “Parseme–parsing and multiword expressions within a
european multilingual network,” in 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer
Science and Linguistics (LTC 2015), 2015.
[23] J. Cohen, “Weighted kappa: Nominal scale agreement provision
for scaled disagreement or partial credit.” Psychological bulletin,
vol. 70, no. 4, p. 213, 1968.
[24] U. Sulubacak and G. Eryig˘it, “A redefined Turkish dependency
gram- mar and its implementations: A new Turkish web
treebank & the revised Turkish treebank,” 2014, under review.
V. Rosén, G. S. Losnegaard, K. De Smedt, E. Bejcek, A. Savary,
A. Przepio´rkowski, P. Osenova, and V. B. Mititelu, “A survey of
multiword expressions in treebanks,” in International Workshop on
Treebanks and Linguistic Theories (TLT14), 2015, p. 179.
[8] E. Bejcˇek and P. Stranˇa´k, “Annotation of multiword expressions in the
Prague dependency treebank,” Language Resources and Evaluation,
vol. 44, no. 1-2, pp. 7–21, 2010.
[9] A. Abeille´, L. Cle´ment, and F. Toussenel, “Building a treebank for
french,” in Treebanks. Springer, 2003, pp. 165–187.
[10] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a
large annotated corpus of english: The penn treebank,” Computational
linguistics, vol. 19, no. 2, pp. 313–330, 1993.
[11] G. Eryig˘it, T. ˙Ilbay, and O. A. Can, “Multiword expressions
in statistical dependency parsing,” in Proceedings of the Second
Workshop on Statistical Parsing of Morphologically Rich Languages
(IWPT), Dublin, Ireland, October 2011, pp. 45–55. [Online].
Available: http://www.aclweb.org/W11-3806
[12] G. Eryig˘it, K. ADALI, D. Torunog˘lu-Selamet, U. Sulubacak,
and T. Pamay, Proceedings of the 11th Workshop on Multiword
Expressions. Association for Computational Linguistics, 2015,
ch. Annotation and Extraction of Multiword Expressions
in Turkish Treebanks, pp. 70–76. [Online]. Available:
http://aclweb.org/anthology/W15-0912
[13] U. Sulubacak and G. Eryig˘it, “Imst: A revisited turkish dependency
treebank,” in TurCLing 2016, The First International Conference on
Turkic Computational Linguistics at CICLING 2016, Konya, Turkey,
April 2016.
66
An Overview of Resources Available for Turkish Natural Language Processing
Applications
Tunga GÜNGÖR
Computer Engineering, Boğaziçi University
TurcLing 2016 Keynote Speaker
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
TheFi
r
s
tConf
er
enc
eon
Tur
ki
cComput
at
i
onalLi
ngui
s
t
i
c
s
39Apr
i
l2016,Konya,Tur
key
I
SBN:9786056642203
t
ur
c
l
i
ng.
ege.
edu.
t
r