* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download click to proceedings of the conference.
Survey
Document related concepts
Scottish Gaelic grammar wikipedia , lookup
Macedonian grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Compound (linguistics) wikipedia , lookup
Japanese grammar wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Esperanto grammar wikipedia , lookup
Yiddish grammar wikipedia , lookup
Latin syntax wikipedia , lookup
Word-sense disambiguation wikipedia , lookup
Dependency grammar wikipedia , lookup
Spanish grammar wikipedia , lookup
Untranslatability wikipedia , lookup
Malay grammar wikipedia , lookup
Lexical semantics wikipedia , lookup
Agglutination wikipedia , lookup
Morphology (linguistics) wikipedia , lookup
Transcript
2016 TheFi r s tConf er enc eon Tur ki cComput at i onalLi ngui s t i c s 39Apr i l2016,Konya,Tur key Pr oc eedi ngsBook I SBN:9786056642203 t ur c l i ng. ege. edu. t r Proceedings of The First International Conference on Turkic Computational Linguistics TurCLing 2016 In conjunction with CICLing 2016, the 17th International Conference on Intelligent Text Processing and Computational Linguistics April 3–9, 2016 • Konya, Turkey ISBN: 978-605-66422-0-3 The First International Conference on Turkic Computational Linguistics - TurCLing 2016 - Full Paper Proceedings CHAIR • Bahar Karaoğlan, Ege University CO-CHAIRS: • • • Tarık Kışla, Ege University Senem Kumova, İzmir Ekonomi University Hatem Haddad, Mevlana University PROGRAM COMMITTEE: • • • • • • • • • • • • • • • • Yeşim Aksan, Mersin University Adil Alpkoçak, Dokuz Eylül University Ildar Batyrshin, Instituto Politécnico Nacional Cem Bozşahin, Middle East Technical University Fazlı Can, Bilkent University İlyas Çicekli, Hacettepe University Gülşen Eryiğit, Istanbul Technical University Alexander Gelbukh, Instituto Politécnico Nacional Tunga Güngör, Bogazici University Hatem Haddad, Mevlana University Bahar Karaoğlan, Ege University Tarık Kışla, Ege University Senem Kumova Metin, İzmir Ekonomi University Altynbek Sharipbayev, L.N. Gumilyov Eurasian National University Dzhavdet Suleymanov, Tatarstan Academy of Sciences Jonathan North Washington, Indiana University KEYNOTE SPEAKER: Prof. Dr. Tunga Güngör , Bogazici University ii The First International Conference on Turkic Computational Linguistics - TurCLing 2016 - Full Paper Proceedings EDITORIAL 1st International Conference on Turkic Computational Linguistics is held jointly with CICLing (17th International Conference on Intelligent Text Processing and Computational Linguistics) in Konya Mevlana University. All computational linguistic research studies on Turkic languages, such as Turkish, Kazakh, Azerbaijani, Uyghur, Tatar, Kyrgyz, Turkmen, Gagauz, Bashkir, Nogay, Uzbek, Chuvash, Khakas, Tuvan, and all other Turkic languages are considered to be within the scope of this conference. The conference aims to be a forum for serving studies on Turkish and other Turkic languages, and to gather researchers in the field to discuss common long-term goals, promote knowledge, resource sharing and possible collaborations between the groups. PROCEEDINGS EDITORS Bahar Karaoğlan, Ege University Tarık Kışla, Ege University Senem Kumova, İzmir Ekonomi University 2016 iii The First International Conference on Turkic Computational Linguistics - TurCLing 2016 - Full Paper Proceedings Table of Contents A Revisited Turkish Dependency Treebank ....................................................................... 1-6 Umut Sulubacak, Gülşen Eryiğit and Tuğba Pamay Exploring Spelling Correction Approaches for Turkish ..................................................... 7-11 Dilara Torunoğlu Selamet, Eren Bekar, Tugay İlbay and Gülşen Eryiğit Framing of Verbs for Turkish PropBank ............................................................................ 12-17 Gözde Gül Sahin A Free/Open-Source Hybrid Morphological Disambiguation Tool for Kazakh ................ 18-26 Zhenisbek Assylbekov, Jonathan N. Washington, Francis M. Tyers, Assulan Nurkas, Aida Sundetova, Aidana Karibayeva, Balzhan Abduali and Dina Amirova A Methodology for Multi-word Unit Extraction in Turkish ............................................... 27-31 Ümit Mersinli and Yeşim Aksan The Turkish National Corpus (TNC): Comparing the Architectures of v1 and v2......................................................................................................................................... 32-37 Yeşim Aksan, S. Ayşe Özel, Hakan Yılmazer and Umut Demirhan (When) Do We Need İnflectional Groups? ........................................................................ 38-43 Çağrı Çöltekin Allomorphs and Binary Transitions Reduce Sparsity in Turkish Semi-supervised Morphological Processing .................................................................................................. 44-49 Serkan Kumyol, Burcu Can and Cem Bozşahin Automatic Detection Of The Type Of “Chunks” İn Extracting Chunker Translation Rules From Parallel Corpora ........................................................................... 50-54 Aida Sundetova and Ualsher Tukeyev Simplification of Turkish Sentences ................................................................................... 55-59 Dilara Torunoğlu-Selamet, Tuğba Pamay, Gülşen Eryiğit Comprehensive Annotation of Multiword Expressions in Turkish .................................... 60-66 Kubra Adalı, Tutkum Dinc, Memduh Gokırmak, Gu¨ls¸en Eryigit An Overview of Resources Available for Turkish Natural Language Processing Applications ........................................................................................................................ 67-84 Tunga Güngör iv IMST: A Revisited Turkish Dependency Treebank Umut Sulubacak∗, Tuğba Pamay†, Gülşen Eryiğit‡ Department of Computer Engineering, Istanbul Technical University, Istanbul, 34469, Turkey. Email: [∗sulubacak, †pamay, ‡gulsen.cebiroglu]@itu.edu.tr corpus, which, decidedly, has ample room for improvement. The effort would also be promising in alleviating certain problems commonly attributed to the corpus, such as excessive parsing difficulty [4] and cross-parser instability [25]. Although engaging in a tedious investigation in order to recondition a corpus may not seem cost-effective, previous successful attempts for other prominent languages [2], [22], [27], [39] provide a strong motivation for the effort. In this paper, we propose changes in certain dependency schemes, leading to an updated annotation framework for Turkish. We thereby aim to relieve some of the known difficulties in the current framework, as well as to reduce stress on human annotators and thus alleviate manual annotation errors. We also present the ITU-METU-Sabancı Treebank (IMST), a new version of MST reannotated from the ground up following this new framework. Later, we make empirical evaluations on our new treebank and report our results. The paper is structured as follows: Section 2 briefly outlines Turkish and the dependency formalism, Section 3 explains the problems and the proposed solutions, Section 4 introduces the new treebank, Section 5 describes the experiments, and finally, Section 6 presents the conclusion. Abstract—In this paper, we present a critical analysis of the dependency annotation framework used in the METU-Sabancı Treebank (MST), and propose new annotation schemes that would alleviate the issues we have identified. Later, we describe our attempt at reannotating the treebank from the ground up using the proposed schemes, and then compare the consistencies of the two versions via cross-validation using a dependency parser. According to our experiments, the reannotated version of the original treebank, which we call the ITU-METU-Sabancı Treebank (IMST), demonstrates a labeled attachment score of 75.3% and an unlabeled attachment score of 83.7%, surpassing the corresponding scores of 65.9% and 76.0% for MST by a very large margin. I. I NTRODUCTION Despite the considerable interest in Turkish syntax, parsing performances have not seen a major improvement in a long time, as evidenced by several recent case studies [6], [11], [17], [18], [30], [35]. Many studies seem to be concentrating on specific computational or linguistic issues, fine-tuning certain aspects of their parsers and leaving the rest untouched. As a result, although many still demonstrate local improvements, they fail to make any pivotal progress. As certain issues remain in focus and others fall behind the spotlight, a considerable portion of the field remains uncharted. It is likely that there are some issues outside the domain of well-researched cases that create a bottleneck for syntactic parsing. Considering that virtually all state-of-the-art parsers make use of supervised learning from human-annotated corpora, it is entirely possible that the issues stem from imperfections in the training corpora. The METU-Sabancı Turkish Treebank (MST) [29] has proved to be an invaluable resource over the years, and has been utilized by almost every Turkish dependency parser to date. However, its dependency grammar has come to be criticized on occasion from various standpoints, and it is known to contain a large amount of annotation inconsistencies, as also attested in some previous works [5], [12]. At present, there is no other available resource1 for Turkish that would be equivalent or an alternative to MST. This further conceals any issues with the corpus that might otherwise emerge. In the light of these considerations, it could be worthwhile to take a detour from specific case studies and directly tackle the II. T URKISH AND THE D EPENDENCY F ORMALISM Though the concept of dependencies has existed since some of the earliest recorded grammars [32], the modern dependency grammar is commonly attributed to Tesnière [37]. The formalism has seen a great deal of attention and extensive usage in computational linguistics in recent years. Essentially, a dependency grammar defines a set of practical rules on how to utilize dependencies to model the syntax of a sentence. PREDICATE DETERMINER SUBJECT She MOD. was in the MOD. ARGUMENT MODIFIER 1 There is the ITU Validation Set [14], [15], [20], but it is a fairly small corpus containing only 300 sentences, and is meant to be a validation or test set for supervised learners, therefore not suitable for training data-driven models. car red PREDICATE DERIV Kırmızı arabada +ydı ‘red’ ‘car’LOC ‘[s/he] was’ Fig. 1: An example dependency tree for a sentence in Turkish and English. Note that the definite article does not occur in the Turkish sentence, and the English dependency to the preposition ‘in’ is analogous to the Turkish locative suffix ‘-da’. 1 In this work, as also in the majority of modern syntactic studies for Turkish, we adopt the dependency formalism. The formalism necessitates the representation of syntactic information with sets of directed binary relations (dependencies) between tokens (Fig. 1). Each dependency is defined between a governing token (the head) and a subordinate token that modifies it (the dependent), and represented by labeled arcs from the head to the dependent. The labels assigned to dependencies indicate the type of the relation, called the dependency type. For a recent discussion of the dependency formalism, the interested reader may refer to [23]. Turkish is a classical example of an agglutinative morphologically rich language incorporating a large number of productive derivational suffixes. For example, the suffix ‘ydı’ (‘[s/he] was’) in Fig. 1 is a third-person singular past copula attached to the stem ‘araba’ (‘car’). As different portions of such derived words may correspond to several words in a weakly inflected language such as English, Turkish sentences often comprise relatively fewer, highly inflected words. In order to properly analyze the syntax of a Turkish sentence, words are divided from derivational boundaries into morphosyntactic units called inflectional groups (IGs). This formalism establishes tokens as the IGs comprising the sentence, rather than orthographic words. Words with multiple IGs are quite prevalent in Turkish—in fact, it is not unusual to find words with as many as four or five IGs. Having been practiced in many influential works [19], [18], [21], [28], their usage has become the de facto standard for parsing Turkish. Throughout the rest of the section, we regularly refer to our proposed annotation framework, though a description of the whole framework is not provided in this article. The full list of the proposed dependency types and their usages is provided within a separate annotation manual [36]. A. Semantic Incoherence In the original framework, some dependency relations were used in a way that is contradictory to their semantic connotations. Such cases occurred especially in less prevalent secondary usages of common dependency types. Though it might have seemed counter-productive to handle such cases under exclusive dependency types or another encompassing type, we maintain that the incoherence is generally less favorable, as they would confuse the associations drawn by annotators. Even though this phenomenon was not very common, it occurred frequently enough to warrant notice. DET. OBJECT PREDICATE Bir örnek yazdı ‘a[n]’ ‘example’ ‘[s/he] wrote’ “S/he wrote an example.” OBJECT ARGUMENT Kalem ‘pen’ MODIFIER PREDICATE ile yazdı ‘with’ ‘[s/he] wrote’ “S/he wrote with a pen.” Fig. 2: The O BJECT relation used for the object of the main verb (top) and for an adpositional phrase argument (bottom). III. P ROBLEMS AND P ROPOSED S OLUTIONS One example is that adpositional phrases were connected via the dependency label O BJECT. Although dependents of adpositional phrases are sometimes called adpositional objects, they are in fact arguments of the adpositional head and unrelated to sentence (or clausal) objects. Not only was it not immediately obvious that they should be regarded as objects, but also this annotation method confused parsers and made the prediction of objects difficult. We assign these the new dependency label A RGUMENT along with the rest of the phrasal arguments. In designing a dependency annotation framework, it is essential to have a clear definition of the dependency relations and the set of conventions on when to use which relation. Although dependency relations would be ideally expressive, exclusive, coherent and concise, there are often trade-offs between some of these properties. As such, it becomes a challenge to balance a grammar around them. Considering the drawbacks of the original MST, we reason that prioritizing clarity and aiming for a minimal dependency grammar makes the better sense in mitigating inconsistency and obscurity. As foundations for our work, we carried out an in-depth manual error analysis on the original MST. Among the most frequent cases that we noted were inconsistently or erratically annotated linguistic constructions as well as standard annotation methods that mandated the usage of certain particles that were optional in informal language. The subsections below present our attempt to loosely categorize the questionable cases that we encountered. For each issue, we also provide example cases, discuss our own standpoints and finally describe our proposed annotation schemes. In the process of settling on local annotation schemes, we investigated the corresponding methods followed in some other prominent frameworks [3], [8], [9], [10], [27], [38] and reviewed previous work in the subject [24], [33], [34]. Through all these, we laid strong foundations for our decisions. PRED. PRED. ve sevgi ‘and’ ‘love’ Barış ‘peace’ COORD. COORDINATION Barış ve PRED. sevgi CONJ. Fig. 3: An example showing the original (top) and the proposed (bottom) annotation scheme for coordination structures. Another case was in coordination structures, where the coordinating conjunction was connected to the succeeding token with the dependency label C OORDINATION. This constitutes a counter-intuitive scenario which semantically implied that the token is in coordination with the conjunction itself, 2 (Fig. 5) and eliminate the X.A DJUNCT labels, which are at any rate reproducible using morphological information. whereas the tokens should be in coordination with each other, as also attested in [1]. We make it so that tokens are connected directly to the next token in coordination, while preserving the C OORDINATION label. This approach has also been previously proved to improve parsing performances in [35], who applied automatic conversion routines to map coordination structures to different styles and compared local performances. C. Ambiguous Annotation For certain annotation schemes, the framework clearly defined what the head should be, but not the dependency relation (or vice versa). This encouraged arbitrary annotation, or else annotation conventions that were quite difficult for annotators to memorize, which impaired the annotation consistency. Although at times this would be due to linguistic relations not properly explained by any dependency label, this was mostly observed in cases of ambiguity, when a relation could be possibly explained by more than one label. For such cases, we introduce new dependency types where the involved dependencies would be common enough to represent a group. An instance of this phenomenon was seen in phrasal arguments, which were not precisely covered under any dependency type, and were variously assigned M ODIFIER or O BJECT labels. We introduce the new dependency label A RGUMENT for all cases where exactly one argument is syntactically required to modify a head, such as in adpositional phrases, in contrast to modifiers, of which a head could have more than one, or none at all. B. Hierarchy and Overlap In the original framework, certain dependency relations fell within the scope of others. As the grammar did not enact a dependency hierarchy to exploit granularity in dependency types, this also had a negative effect. The immediate impact was on annotators, for whom it occasionally became arbitrary which dependency type to use. Parsing frameworks also suffered from increased entropy in prediction. Yet another impact was on evaluation, as such cases caused some sound dependency annotations to be considered incorrect because any label other than that which was in the gold-standard would be a mismatch. ETOL MWE COLL. MWE PRED. PRED. Söz ettim Söz verdim ‘word’ ‘[I] did’ ‘word’ ‘[I] gave’ D. Optional Annotation In the original framework, only certain types of punctuation (usually conjunctive punctuation and terminal periods) had dependency types associated with them, and the rest were allowed to pass without any head (Fig. 6). These tokens were connected to an arbitrary head and assigned the label NOTCONNECTED . This indicated that the dependency grammar essentially did not enforce dependencies for all tokens in a sentence, which is required by most dependency parsers, leading to complexity in evaluation. Furthermore, since NOTCONNECTED was computationally considered to be a regular dependency type in parsing, learning performances were also indirectly affected. To address this issue, we introduce the new label P UNCTUATION and standardize the annotation scheme of all types of punctuation, as well as eliminating the support for optionality in the grammar. In this approach, all punctuation should be connected at all times with the P UNCTUATION relation to the last non-punctuation token occurring before it. Punctuation that begins a sentence should be connected to the sentence’s root node instead (Fig. 2). “I promised.” “I mentioned.” Fig. 4: Two similar idiomatic expressions indicated by the dependency relations E TOL and C OLLOCATION. An example to this (Fig. 4) is the sub-type E TOL, which comprised a group of multiword expressions incorporating certain auxiliary verbs, otherwise denoted by the label C OL LOCATION . We eliminate such types altogether. OBJECT DATIVE.ADJUNCT MODIFIER INSTRUMENTAL.ADJUNCT MODIFIER EQU.ADJ. MODIFIER DERIV POSSESSOR PRED. İnsanı insana insanla insanca anlat +ma sanatı ‘human’ACC ‘human’DAT ‘human’INS ‘human’EQU ‘tell’ (GERUNDIVE) ‘art of’ “The art of relating humans to humans, with humans, like humans.” Fig. 5: Nominal adjuncts serving as modifiers were mostly indicated by different X.A DJUNCT labels according to their cases. There were also some cases where a dependency relation overlapped with another in usage, giving way to confusion. This was the most obvious between the label M ODIFIER and the X.A DJUNCT labels for every noun declension (such as DATIVE .A DJUNCT), which are also effectively modifiers. For instance, while generic adjuncts that did not fall into a specific category would use a M ODIFIER label and a regular nominal adjunct in the locative case would use the label L OCATIVE .A DJUNCT, certain other adjuncts, which were grammatically nouns in the locative case, would still be assigned a M ODIFIER label due to semantic concerns. To address this complication, we preserve only the M ODIFIER label SENTENCE “ Özgün ” ROOT . ‘original’ PUNCTUATION PREDICATE “ Özgün ” . PUNC. PUNCTUATION Fig. 6: Certain kinds of punctuation that were allowed to pass without a head (top) are now covered by the new dependency type P UNCTUATION (bottom). 3 For the new corpus, we used an updated version of our ITU Annotation Tool [13]. Five annotators were employed, one linguist and four computer scientists with considerable experience in NLP research. Our annotators were well-versed in Turkish morphology and syntax, and underwent two weeks of supervised training in the new annotation framework before starting on the annotation. Dependency annotation was made on gold-standard tokens with pre-allocated morphological analyses3 , and was completed within a span of two months. Although the annotation was started with two annotators for each sentence, our annotators eventually had to work individually on their exclusive shares of the data due to budgetary constraints. As a consequence, it was not possible to measure inter-annotator agreement. Nonetheless, after the initial annotation, sentences from both corpora were carefully inspected for inconsistent annotation, and a correction phase followed for two weeks, which led to the final version. E. Reliance on Omissible Tokens Some annotation schemes required certain tokens to occur in a specific position within the sentence, and could not be properly applied when they were omitted. This prevented regular annotation in case of omission, and caused uncertainty as to how to alternatively mark the relation, which led to annotation inconsistencies. For instance, coordination structures were annotated with a dependency from the first constituent to the coordinating conjunction and another dependency from the coordinating conjunction to the next constituent, which made proper annotation impossible when the coordinating conjunction was omitted. Adverse cases are not uncommon in non-canonical language, most notably web jargon, where some common function words are frequently dropped in favor of brevity. Examples are encountered even in well-typed sentences, caused by less conventional, idiomatic or archaic usages. Therefore, the issue warranted addressing. OBJECT Çatal bıçak OBJECT COORD. , Çatal bıçak COORDINATION Çatal , SENT. A. Deep Dependencies ROOT Another detail to mention about the annotation is that we set out to indicate deep (or unbounded) dependencies in IMST. Deep dependencies are secondary dependencies of tokens to other logical heads, often with different dependency relations, in addition to their regular surface dependencies. The annotation of these dependencies violates the restriction of each constituent having a single head and thereby makes a corpus incompatible with most syntactic parsers without preprocessing. However, deep dependencies are favored often because they function as cues for semantic parsers designed to determine the semantic roles of verbal arguments in a sentence. In IMST, we regularly draw deep dependencies as substitutes for coreference links from zero pronouns as well as to mark shared modifiers for tokens in coordination. kullanmıyor OBJECT SENTENCE ROOT . kullanmıyor OBJECT PREDICATE bıçak kullanmıyor ‘knife’ ‘[s/he] isn’t using’ PUNC. ‘Fork’ OBJECT COORD. . PUNC. “S/he doesn’t use a knife and a fork.” Fig. 7: Reliance on omissible tokens in the original annotation framework The sentences show a case where annotation is impossible (top), except by the addition of conjunctive and terminal punctuation (middle). The scheme we propose (bottom) is not affected by this. B. Corpus Statistics Reliance was perhaps the most noticeable in terminal periods (Fig. 7), which were essential in marking the main predicate of the sentence. The annotation required the predicate to be connected to the terminal period with the label S ENTENCE. This scheme left no option for legitimately omitting periods, as practiced very frequently in non-canonical language. To address this, we make it so that predicates are connected directly to the sentence root with the renamed dependency label P REDICATE, making terminal periods properly optional. For a proper comparison between MST and IMST, we provide a particular selection of comparative statistics before describing our syntactic accuracy tests. Table I displays sentence, token and dependency counts for either corpora. Table II shows the distribution of dependencies by dependency relation. TABLE I: Comparative sentence, token and dependency statistics. METU-S ABANCI T REEBANK IV. T HE ITU-METU-S ABANCI T REEBANK In order to have an indication of the impact of our proposed schemes and provide future studies with a new and fresh training corpus, we annotated the entire METU-Sabancı Treebank from the ground up. We call this reannotated corpus the ITUMETU-Sabancı Treebank (IMST). The annotation of IMST was carried out in parallel with the ITU Web Treebank [31], an original corpus of user-generated web data that was released earlier. This section provides details about IMST2 . ITU-METU-S ABANCI T REEBANK # Sentences 5,635 5,635 # Words # Tokens (IG) # Single-headed Tokens # Multi-headed Tokens 56,424 67,403 67,403 (100.0%) — 56,424 63,089 60,688 (96.2%) 2,401 (3.8%) # Dependencies (excl. D ERIV) # Dependencies (incl. D ERIV) # Projective Dependencies # Non-projective Dependencies 56,424 67,403 66,145 (98.1%) 1,258 (1.9%) 59,425 66,090 64,663 (97.8%) 1,427 (2.2%) V. E VALUATION This section presents the statistical analysis we performed on MST and IMST. Section V-A contains preliminary information about our parsing and evaluation systems. Section V-B shows the test outcome and a brief discussion of the results. 2 The treebank underwent some minor revisions before its release and is at version 1.3 at the time of this publication. The latest version will be made available for research purposes on http://tools.nlp.itu.edu.tr/. 3 The morphological tags were inherited from a version of MST following a revised morphological annotation framework established in [7], [16]. 4 TABLE II: Distribution of the dependency relation labels. METU-S ABANCI T REEBANK A BLATIVE .A DJUNCT A PPOSITION A RGUMENT C ONJUNCTION C LASSIFIER C OLLOCATION C OORDINATION DATIVE .A DJUNCT D ERIV D ETERMINER E QU.A DJUNCT E TOL F OCUS .PARTICLE I NSTRUMENTAL .A DJUNCT I NTENSIFIER L OCATIVE .A DJUNCT M ODIFIER MWE N EGATIVE .PARTICLE O BJECT P OSSESSOR P REDICATE P UNCTUATION Q UESTION .PARTICLE R ELATIVIZER ROOT S.M ODIFIER S ENTENCE S UBJECT VOCATIVE ( DISCONNECTED TOKENS ) 523 (0.8%) 202 (0.3%) — — 2,050 (3.0%) 73 (0.1%) 2,476 (3.7%) 1,361 (2.0%) 10,979 (16.3%) 1,952 (2.9%) 16 (0.0%) 10 (0.0%) 23 (0.0%) 271 (0.4%) 903 (1.3%) 1,142 (1.7%) 11,690 (17.3%) 2,432 (3.6%) 160 (0.2%) 8,338 (12.4%) 1,516 (2.2%) — — 289 (0.4%) 85 (0.1%) 5,644 (8.4%) 597 (0.9%) 7,261 (10.8%) 4,481 (6.6%) 241 (0.4%) 2,688 (4.0%) As shown in Table I, the number of words and evaluated dependencies (excluding D ERIV) is exactly the same between the two corpora. The slight difference between the dependency counts as seen in Table I is due to the updated morphological analysis framework mentioned earlier in Section IV and the entailed difference in derivational boundaries. The changes in IG bounding should only affect the performance of morphological analysis and have a negligible effect on parsing. Comparing the current LAS of 75.3% for IMST with the corresponding score of 65.9%4 for MST shows that we manage an increase of nearly 10 percentage points. The UAS seems to have improved in a similar way, increasing to 83.7% for IMST and passing the score of 76.0% for MST by a large margin. ITU-METU-S ABANCI T REEBANK — 91 (0.1%) 1,805 (2.7%) 1,360 (2.1%) — — 3,078 (4.7%) — 6,665 (10.1%) 2,180 (3.3%) — — — — 1,070 (1.6%) — 15,516 (23.5%) 3,552 (5.4%) — 5,094 (7.7%) 4,070 (6.2%) 5,741 (8.7%) 10,375 (15.7%) — 129 (0.2%) — — — 5,174 (7.8%) 190 (0.3%) — VI. C ONCLUSION In this article, we initially described the annotation schemes we designed based on the dependency grammar of the METUSabancı Treebank (MST). Our new annotation framework incorporates only 16 dependency relation labels in contrast to the 24 labels of the baseline, but features generally clearer and more intuitive dependency types with reduced overlap between each other, hopefully relieving the difficulty of manual annotation without suffering any loss in expressiveness. Afterwards, we presented the ITU-METU-Sabancı Treebank (IMST) as a reannotated version of MST that followed our revised annotation framework. We additionally marked deep dependencies in IMST to pave the way for future semantic role labeling studies. We substantiate the theoretical advantages of our proposed annotation schemes through a parsing experiment in compliance with the parsing framework used in the study for the original MST that still remains the state of the art. Our experiment yielded a labeled attachment score of 75.3% for IMST, surpassing the best score of 65.9% attained so far on MST by a very large margin. Finally, considering the outcome of our work, we believe it would be safe to say that we succeeded in making pivotal progress by working directly on the training set. We show that improving the quality of data, although an open-ended endeavor, has a considerable effect on parsing performances, and will hopefully pave the way for corpus studies for Turkish. A. Preliminaries In our test, we used the same MaltParser [26] configuration as in [17] so that the results would be properly comparable. In further accordance with the cited work, non-projective sentences were eliminated from all training sets, which is shown to cause a significant performance boost [17], [18]. The dependencies with the relation D ERIV (denoting intraword relations between morphosyntactic units) were excluded in evaluation, as they are considered trivial. In the literature, punctuation is either wholly excluded from evaluation (as in e.g. [4]) or included (as in e.g. [25]). We follow the latter approach and evaluate the dependencies of punctuation. Since the inherited parsing framework does not support learning from dependents annotated with multiple heads, we discard all deep dependencies from IMST before running the test. The metrics used in evaluation are the conventional IGbased labeled and unlabeled attachment scores. The unlabeled attachment score (UAS) considers a prediction to be accurate if the head token alone was correctly predicted, while the labeled attachment score (LAS) additionally requires a correct prediction of the dependency relation. Between the two, a high LAS is more difficult to attain and more valuable, so we take the LAS as our primary criterion in performance comparison. We also provide standard error values, and use McNemar’s Test for measuring statistical significance where needed. ACKNOWLEDGEMENT This study is part of a research project entitled “Parsing Web 2.0 Sentences” subsidized by the Turkish Scientific and Technological Research Council under grant number 112E276 and associated with the ICT COST Action IC1207. We hereby offer our sincere gratitude to our volunteering annotators Dilara Torunoğlu-Selamet and Ayşenur Genç, as well as our colleagues Can Özbey, Kübra Adalı and Gözde Gül İşgüder who offered additional help with the annotation. B. Experimental Results Parsing performances obtained by applying ten-fold crossvalidation on IMST are shown side-by-side with the corresponding scores for MST in Table III. 4 The MST score was evaluated excluding punctuation, in accordance with the conventions at the time. As we reproduced the baseline scores on the original MST, we found the difference between models including and excluding punctuation to be statistically insignificant (p < 0.01). Conversely, excluding punctuation in evaluating IMST resulted in a drop in LAS from 75.3% to 70.0%, indicating that the contribution of the new punctuation annotation is far from wholly accounting for the increase in parsing performance. TABLE III: Cross-validation scores and standard error values. METU-S ABANCI T REEBANK ITU-METU-S ABANCI T REEBANK LAS UAS 65.9% ± 0.3% 75.3% ± 0.2% 76.0% ± 0.2% 83.7% ± 0.2% 5 R EFERENCES [21] D. Z. Hakkani-Tür, K. Oflazer, and G. Tür, “Statistical morphological disambiguation for agglutinative languages,” Computers and the Humanities, vol. 36, no. 4, pp. 381–410, 2002. [22] K. Haverinen, J. Nyblom, T. Viljanen, V. Laippala, S. Kohonen, A. Missilä, S. Ojala, T. Salakoski, and F. Ginter, “Building the essential resources for Finnish: the Turku Dependency Treebank,” Language Resources and Evaluation, pp. 1–39, 2013. [23] S. Kübler, R. McDonald, and J. Nivre, Dependency Parsing, ser. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2009. [24] R. McDonald, J. Nivre, Y. Quirmbach-Brundage, Y. Goldberg, D. Das, K. Ganchev, K. Hall, S. Petrov, H. Zhang, O. Täckström, C. Bedini, N. Bertomeu Castelló, and J. Lee, “Universal dependency annotation for multilingual parsing,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL). Sofia, Bulgaria: Association for Computational Linguistics, August 2013, pp. 92–97. [25] J. Nilsson, S. Riedel, and D. Yüret, “The CoNLL 2007 Shared Task on dependency parsing,” in Proceedings of the CoNLL Shared Task Session of the Joint Conference on Empirical Methods in Natural Language Processing (EMNLP) and Computational Natural Language Learning (CoNLL). Association for Computational Linguistics, 2007, pp. 915– 932. [26] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryiğit, S. Kübler, S. Marinov, and E. Marsi, “MaltParser: A language-independent system for datadriven dependency parsing,” Natural Language Engineering, vol. 13, pp. 95–135, 6 2007. [27] J. Nivre, J. Nilsson, and J. Hall, “Talbanken05: A Swedish treebank with phrase structure and dependency annotation,” in Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), 2006, pp. 1392–1395. [28] K. Oflazer, “Dependency parsing with an extended finite-state approach,” Computational Linguistics, vol. 29, no. 4, pp. 515–544, 2003. [29] K. Oflazer, B. Say, D. Z. Hakkani-Tür, and G. Tür, “Building a Turkish treebank,” in Treebanks. Springer, 2003, pp. 261–277. [30] Özlem Çetinoğlu, “Turkish Treebank as a gold standard for morphological disambiguation and its influence on parsing,” in Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC). Reykjavı́k, Iceland: European Language Resources Association (ELRA), May 2014. [31] T. Pamay, U. Sulubacak, D. Torunoğlu-Selamet, and G. Eryiğit, “The annotation process of the ITU Web Treebank,” in Proceedings of the 9th Linguistic Annotation Workshop (LAW), Denver, CO, USA, 5 June 2015. [32] W. K. Percival, “Reflections on the history of dependency notions in linguistics,” Historiographia Linguistica, vol. 17, no. 1-2, pp. 29–47, 1990. [33] M. Popel, D. Mareček, J. Štěpánek, D. Zeman, and Z. Žabokrtský, “Coordination structures in dependency treebanks,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL). Sofia, Bulgaria: Association for Computational Linguistics, August 2013, pp. 517–527. [34] N. Schneider, B. O’Connor, N. Saphra, D. Bamman, M. Faruqui, N. A. Smith, C. Dyer, and J. Baldridge, “A framework for (under)specifying dependency syntax without overloading annotators,” in Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse (LAW VII & ID). Sofia, Bulgaria: Association for Computational Linguistics, August 2013, pp. 51–60. [35] U. Sulubacak and G. Eryiğit, “Representation of morphosyntactic units and coordination structures in the Turkish dependency treebank,” in Proceedings of the 4th Workshop on Statistical Parsing of MorphologicallyRich Languages (SPMRL). Seattle, Washington, USA: Association for Computational Linguistics, October 2013, pp. 129–134. [36] ——, ITU Treebank Annotation Guide, March 2016, available at http://tools.nlp.itu.edu.tr, version 2.7. [37] L. Tesnière, Éléments de Syntaxe Structurale. Éditions Klinksieck, 1959. [38] L. Van der Beek, G. Bouma, R. Malouf, and G. Van Noord, “The Alpino Dependency Treebank,” Language and Computers, vol. 45, no. 1, pp. 8–22, 2002. [39] V. Vincze, V. Varga, K. I. Simkó, J. Zsibrita, A. Nagy, R. Farkas, and J. Csirik, “Szeged Corpus 2.5: Morphological modifications in a manually POS-tagged Hungarian corpus.” in Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), 2014, pp. 1074–1078. [1] B. R. Ambati, S. Reddy, and A. Kilgarriff, “Word sketches for Turkish,” in Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), 2012, pp. 2945–2950. [2] E. Bejček, J. Panevová, J. Popelka, P. Straňák, M. Ševčı́ková, J. Štěpánek, and Z. Žabokrtský, “Prague Dependency Treebank 2.5–a revisited version of PDT 2.0,” in Proceedings of the 24th International Conference on Computational Linguistics (COLING), 2012, pp. 231– 246. [3] A. Böhmová, J. Hajič, E. Hajičová, and B. Hladká, “The Prague Dependency Treebank,” in Treebanks. Springer, 2003, pp. 103–127. [4] S. Buchholz and E. Marsi, “CoNLL-X Shared Task on multilingual dependency parsing,” in Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL). Association for Computational Linguistics, 2006, pp. 149–164. [5] R. Çakıcı, “Wide-coverage parsing for Turkish,” Ph.D. dissertation, The University of Edinburgh, 2008. [6] O. Çetinoğlu and J. Kuhn, “Towards joint morphological analysis and dependency parsing of Turkish,” in Proceedings of the 2nd International Conference on Dependency Linguistics (DepLing). Prague, Czech Republic: Charles University in Prague, Matfyzpress, Prague, Czech Republic, August 2013, pp. 23–32. [7] M. Şahin, U. Sulubacak, and G. Eryiğit, “Redefinition of Turkish morphology using flag diacritics,” in Proceedings of The 10th Symposium on Natural Language Processing (SNLP), Phuket, Thailand, October 2013. [8] D. Csendes, J. Csirik, T. Gyimóthy, and A. Kocsor, “The Szeged Treebank,” in Text, Speech and Dialogue. Springer, 2005, pp. 123– 131. [9] M.-C. De Marneffe, M. Connor, N. Silveira, S. R. Bowman, T. Dozat, and C. D. Manning, “More constructions, more genres: Extending Stanford Dependencies,” in Proceedings of the 2nd International Conference on Dependency Linguistics (DepLing). Prague, Czech Republic: Charles University in Prague, Matfyzpress, Prague, Czech Republic, August 2013, pp. 187–196. [10] M.-C. De Marneffe and C. D. Manning, “The Stanford Typed Dependencies representation,” in Proceedings of the Workshop on CrossFramework and Cross-Domain Parser Evaluation (COLING). Association for Computational Linguistics, 2008, pp. 1–8. [11] I. Durgar El-Kahlout, A. A. Akın, and E. Yılmaz, “Initial explorations in two-phase Turkish dependency parsing by incorporating constituents,” in Proceedings of the 1st Joint Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL) and Syntactic Analysis of Non-Canonical Languages (SANCL). Dublin, Ireland: Dublin City University, August 2014, pp. 82–89. [12] G. Eryiğit, “Dependency parsing of Turkish,” Ph.D. dissertation, Istanbul Technical University, 2006. [13] ——, “ITU Treebank Annotation Tool,” in Proceedings of the ACL Workshop on Linguistic Annotation (LAW), Prague, 24-30 June 2007. [14] ——, “ITU Validation Set for METU-Sabancı Turkish Treebank,” March 2007. [Online]. Available: http://web.itu.edu.tr/gulsenc/papers/validationset.pdf [15] ——, “The impact of automatic morphological analysis & disambiguation on dependency parsing of Turkish,” in Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 23-25 May 2012. [16] ——, “ITU Turkish NLP Web Service,” in Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Gothenburg, Sweden: Association for Computational Linguistics, April 2014. [17] G. Eryiğit, T. Ilbay, and O. A. Can, “Multiword expressions in statistical dependency parsing,” in Proceedings of the 2nd Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL). Dublin, Ireland: Association for Computational Linguistics, October 2011. [18] G. Eryiğit, J. Nivre, and K. Oflazer, “Dependency parsing of Turkish,” Computational Linguistics, vol. 34, no. 3, pp. 357–389, 2008. [19] G. Eryiğit and K. Oflazer, “Statistical dependency parsing of Turkish,” in Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Trento, April 2006, pp. 89–96. [20] G. Eryiğit and T. Pamay, “ITU Validation Set,” Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi, vol. 7, no. 1, 2014. 6 Exploring Spelling Correction Approaches for Turkish Dilara Torunoğlu-Selamet, Eren Bekar, Tugay İlbay, Gülşen Eryiğit Department of Computer Engineering Istanbul Technical University Istanbul, 34469, Turkey [torunoglud, erenbekar, ilbay, gulsen.cebiroglu]@itu.edu.tr some agglutinative and polysynthetic languages as well as English, use the Wikipedia articles of the related languages in order to create the corresponding language models. On the other hand, the same error models which are used for English are also used for MRLs only by adding languagespecific characters. Wang et al. [7] propose a fast and accurate approximate string search algorithm (ASS) which keeps track of the frequent mistakes (error model) extracted from training data (consisting of spelling mistakes and their corrections) and generates the most probable correction candidates. The method uses a vocabulary trie for validating the generated candidates. It is very straightforward to collect the training data for the error model from the user queries of a search engine (the suggested and selected corrections) as it is conducted in the mentioned study. In this paper, we explore the way of creating a Turkishspecific error model for lack of manually annotated training data and the different combinations of the error model, the language model and the minimum edit distance candidate generation for spelling correction. We compare our results with 3 existing spelling correction systems for Turkish: 1. error tolerant finite state recognition (ETFSR) approach of Oflazer [3], 2. MsWord and 3. Zemberek [8].1 The paper is structured as follows: Section 2 introduces the error model and Section 3 discusses the proposed spelling correctors, Section 4 presents the used datasets and evaluation metrics, Section 5 gives the experimental results and discussions and Section 6 the conclusion and future work. Abstract—The spelling correction of morphologically rich languages is hard to be solved with traditional approaches since in these languages, words may have hundreds of different surface forms which do not occur in a dictionary. Turkish is an agglutinative language with a very complex morphology and lacks annotated language resources. In this study, we explore the impact of different spelling correction approaches for Turkish and ways to eliminate the training data scarcity. We test with seven different spelling correction approaches, four of which are introduced in this study. As the result of this preliminary work, we propose a new automatic training data collection process where existing spelling correctors help to develop an error model for a better system. Our best performing model uses a unigram language model and this error model, and improves the performance scores by almost 20 percentage points over the widely used baselines. As a result, our study reveals the achievable top performance with the proposed approach and gives directions for a better future implementation plan. Keywords—Spelling Corrector, Spell Checker, Turkish I. I NTRODUCTION In morphologically rich languages (MRLs) and especially the agglutinative ones like Turkish, Finnish or Hungarian, a word may occur in hundreds of different surface forms by the addition of multiple suffixes the end of a word stem. The creation of a lexicon/dictionary consisting of all possible surface forms is impractical and most of the time not efficient due to memory space and search speed constraints. As a result, the usage of a lexicon to check if the newly constructed candidate of a misspelled word is valid or not, as is the case in traditional approaches tailored for morphologically poor languages, becomes unusable for MRLs. Finite state transducers (FSTs) [1], [2] are proven to be very well suited for this kind of languages and perform very fast lookup over possible word generations. One of the early implementations of spelling correction for MRLs is the error tolerant finite state recognition (ETFSR) approach of Oflazer [3]. Although it is very fast to create the possible candidates up to the specified edit distance limit, the deficiency of this approach is that it does not produce an ordered list of possible corrections which prevents its usage as an automated spelling corrector. Recent approaches [4]–[6] which focus on weighted finite-state spell-checking using language models and error models are very efficient for the spelling correction of MRLs. Pirinen and Lindén [6] who experiment also with II. T HE E RROR M ODEL Obtaining the error model is a challenging task considering the lack of manually annotated training data for the Turkish language. Wang et al. [7] proposed a probabilistic approach for spelling correction. This approach was novel in that it was using a log-linear candidate generation utilizing a special data structure that can find top candidates efficiently. The proposed method works effectively for languages which have a limited dictionary for lookup. They derived all the possible 1 To the best of our knowledge, at the time of writing this paper, the only three spelling correction systems that we can compare with were these three systems. 7 rules from the training data using a similar approach to Brill and Moore [9]. In their study, they collected the training data for the error model from the user queries of a search engine. Despite not having this opportunity, we propose a new automatic training data collection process where the existing spelling correctors help to develop an error model. We collected a training data set from the Twitter domain. We then passed all the ill-formed words (which are not accepted by our morphological analyzer) from one online (Google2 ) and one offline spelling correctors [8] and accepted the corrections which are proposed identically by both of these correctors as the corrected form of the ill-formed words in our training set. At the end of this process, we obtained a training set of 5775 word pairs (ill-formed and corrected words) which have a character length within a range of 2 to 23. After obtaining the training set for the error model, we used the same approach with Wang et al. [7] to store the extracted error rules. We used the Aho-Corasick tree structure for storing and applying the correction rules. During the generation of the error model, the rules are extracted from the misspelled and corrected forms of words by using the Levenshtein edit distance algorithm. The output of this part is a set of rules which includes addition, deletion and substitutions of letters. This rule set also contains the likelihood of each derived rule. The extracted rules and their estimated likelihoods are stored in an Aho-Corasick search tree which is a very efficient string matching trie-based data structure. All leaf nodes in this search tree have an output link which associates the node itself with the likelihood of the rule in the node. This lets fetching rules and their likelihoods effectively. It also stores failure links that redirect the search to the best applicable node when there is no way to continue for the queried string. This prevents us from starting from the beginning each time the search query fails and results in a significant time gain during the search. Fig. 1: Spelling Corrector #1 transducer which is built from a stem lexicon for the MRL in focus and the morphotactic and phonetic rules to generate the inflected forms of these stems) as the language validator. Figure 1 draws the main components of SC1. The training phase is the process of creating the error model which is explained in Section II. In the candidate generation phase, the previously constructed Aho-Corasick tree is looked-up for all applicable rules for a given misspelled word. Since not all the rules generate a valid surface form, the generated results should be validated by the FST. If the constructed word is validated by the FST, then all applied rule likelihoods are summed up and this forms the likelihood of the candidate word. As a pruning technique, before applying a rule, it is always checked that the rule likelihood is able to generate a more probable candidate. If not, the rule is not applied to the misspelled word. As a result, our approach differs from the original ASS model [7] in two main points: 1) the usage of FST for validation, 2) the calculation of the rule set probabilities in the training phase. In the original work, they employ a log-linear model for calculating the probabilities of rule sets whereas in our work, we simply use likelihoods for preliminary investigation. III. S PELLING C ORRECTORS ETFSR and Zemberek use edit-distance based candidate generation approaches. The following subsections introduces our new approaches that we experiment with, which are basically the different combinations of the language and error models as well as ETFSR. Spelling Corrector #1 (SC1) Our first approach is an adaptation of Wang et al. [7]. Since creating a lexicon which will cover all possible surface forms in an MRL is not practical and efficient in that the required memory allocation for the data structure is very big even with the most compact data structures3 , instead of the vocabulary trie for candidate validation, SC1 uses an FST (a finite-state Spelling Corrector #2 (SC2) As mentioned in the introductory section, the output of ETFSR is a set of unsorted candidates and the size of the candidate list is unpredictable. SC2 is an approach to deal with this deficiency by re-ranking ETFSR outputs using the probabilities calculated from the error model as explained previously. Figure 2 shows the structure of SC2, where the misspelled inputs firstly enter the ETFSR. We then retrieve the rules (and their scores) from our rule tree that should be applied to the misspelled word to generate each candidate in the output list from the ETFSR. In other words, we get 2 At the time of this collection process, Google spelling correction service was still available. 3 In the early stages of our implementation, we tried to just place the most frequently occurring surface forms extracted from a corpus into the lexicon and even this approach took more than 500M of memory by using a suffix tree, which we believe is not acceptable for a spelling corrector application to be in practical usage. 8 Spelling Corrector #4 (SC4) the list of applied rules (which can be addition, deletion and substitution of a letter) according to the Levenshtein edit distance between the misspelled word and the corresponding candidate. When we have the rules for a candidate, we sum up the costs of the applied rules, and then simply sort the candidates by their costs. The candidate with the minimum cost is accepted as the most probable correction. SC4 is inspired from Linden and Pirinen [6], in that it uses a language and an error model together in order to generate candidates. SC4 uses the same unigram language model from SC3 and the same error model introduced in Section II. SC4 differs from SC1 in that, the generated candidates by the error model are validated by using the language model instead of the FST and the best proposal is selected as the candidate with minimum rule cost and maximum unigram probability: argmax p(c) cGen 1 rulecost(c) Laplace Smoothing [10] is used in order to compensate for the absence of a candidate word in the language model. SC4 is depicted in Figure 4. Fig. 2: Spelling Corrector #2 Spelling Corrector #3 (SC3) Inspired by previous works by Linden and Pirinen [4]–[6], SC3 aims to make use of unigram language models for candidate sorting. To this end, a unigram language model is trained from word surface forms from a Turkish corpus. The ETFSR outputs are then re-ranked similarly to SC2 but this time using the unigram probabilities. The candidate having the highest probability and the smallest edit distance from the input misspelled word is then accepted as the produced correction. The structure of SC3 is shown in Figure 3. Fig. 4: Spelling Corrector #4 Table I displays the usage and combination of language and error models as well as the candidate generation method in the introduced spelling correctors. As can be noticed from the table, the difference between SC2 and SC1 is that in SC2, which uses ETFSR in its candidate generation stage, all the produced candidates are already valid words, whereas in SC1 the candidates are validated after being produced by the use of the error model. The last two spelling correctors (SC3 and SC4) using language models are the highest memoryconsuming systems as expected and explained in the introductory section. They are tested both with ETFSR candidate generation (SC3) and Aho-Corasick candidate generation (SC4). SC4 also uses the error model in its probability calculation. Another possible system (discussed in the following sections) which could provide a slight increase in the scores would be a combination of ETFSR, the language model and the error model, though it was not tested as part of this study. Fig. 3: Spelling Corrector #3 9 Models SC1 SC2 SC3 SC4 Error Model ! ! % ! Language Model % % ! ! Candidate Generation index results. The added cost due to re-ranking is smaller than a single millisecond4 over ETFSR. Aho-Corasick ETFSR ETFSR Aho-Corasick Approach ETFSR SC1 SC2 SC3 SC4 TABLE I: Models Used in Different Approaches Average Duration (ms) 388 3385 389 333 363 Average Index 1.46 0.9 0.6 0.35 0.145 TABLE II: Output Statistics IV. E XPERIMENTAL S ETUP Table III gives the comparison of the spelling correction accuracies of our models with the mentioned tools. Although, the Google spelling suggestion API was used during the creation of our training data, it could not be compared with the other spelling correctors in this section since it is no longer available. In this experiment, for all of the systems, we took the first suggestion given by that system and compared it with the gold-standard correction in our test set. Our best model outperforms the widely used Zemberek spelling corrector by almost 20 percentage points. Despite the modest size of our training data set that we were not able to continue to collect due to the unavailability of one of the services (Google spelling suggestion API) that we have used, we see that the proposed error model on its own (SC2) outperforms MsWord by more than 2 percentage points. We believe that, with the addition of extra training data, the system performance may be improved even further. As a future work, self-training approaches may be tested for the learning of the error-rule probabilities. But we observe that the used language model has a much higher impact by almost 10 percentage points. One should notice that the used language model is just a unigram surface model and better results may be obtained with more sophisticated language models. We tested our system on Turkish which is a highly agglutinative language carrying all the characteristics of a morphologically rich language. We used an available two-level morphological analyzer of Oflazer [11] as the FST language validator of our system in SC1 and again the ETFSR from Oflazer [3] in SC2 and SC3. To obtain a unigram language model we used the corpus introduced by Sak et al. [12]. The text corpus compiled from the web contains about 500M tokens. Due to the composition of the data found on the web, the corpus include noisy data. We extracted only the valid Turkish words which constitute 842 MB of the corpus (almost 43M valid tokens). During the collection of the test data, for the sake of fairness we do not include errors made on purpose due to social media writing trends such as emoticons and words that are typed out without vowels or the proper diacritics, which would be corrected in a normalization stage [13] rather than spelling correction. The creation of the training data to train the error model is explained in Section II. Since this automatic approach is only applied during the creation of the training data used in rule extraction, this does not hamper the evaluation on our test data which is manually annotated with corrected forms (1016 word pairs). ETFSR Zemberek MsWord SC1 SC2 SC3 SC4 V. E XPERIMENTAL R ESULTS & D ISCUSSIONS In our experiments, we first test with ETFSR and the spelling correctors introduced in Section III and evaluate their results. We then compare our models with other available spelling correctors for Turkish. Table II introduces some statistics about ETFSR and with the other models (SC1, SC2, SC3 and SC4); namely the average operation time the spelling correction approach on our test set described in previous section and the average index of the correct candidate within all possible generated candidates. The index number starts from 0 indicating that the first candidate in the output is the correct one given the manually annotated test set. One may notice from this table that the ETFSR approach produces very fast results but the correct answer generally occurs in lower positions in the produced candidate list. On the other hand SC1 is almost 10 times slower when compared to ETFSR but produces more accurate results. SC2 is much faster when compared to SC1 and has almost similar success ranges. SC3 and SC4 similar to SC2 in terms of average duration but give better average Accuracy 49.0% 61.4% 66.3% 68.6% 67.8% 78.7% 80.7% TABLE III: Comparison with previous studies In order to investigate the results and the behavior of the algorithms more closely, we also made a different evaluation based on promoting the correct candidate appearing in the top n list of the algorithm’s output. Table IV presents these scores for n=1, 3, 5 and 10., e.g. SC4 positioned the correct candidate in its top 3 list in 92.7% of the cases. We can observe that the success rates of all the models become similar as n increases, meaning that ETFSR is also successful in generating the correct candidate in its top 10 list. But SC3 and SC4 are certainly more suited to be used 4 The training time (629 ms with our available training data) is not added to this cost since it occurs only once in the preparation stage and the pre-trained model is only loaded at the beginning of testing stage. 10 Candidate List Size 1 3 5 10 ETFSR SC1 SC2 SC3 SC4 R EFERENCES 49.0% 76.7% 86.2% 93.8% 68.6% 88.6% 93.5% 95.7% 67.8% 89.1% 92.9% 95.4% 78.7% 92.7% 94.5% 95.5% 80.7% 92.7% 97.0% 98.9% [1] Finite-state morphology: Xerox tools and techniques, 2003. [2] K. Lindén, M. Silfverberg, and T. Pirinen, HFST tools for morphology– an efficient open-source package for construction of morphological analyzers, Std., 2009. [3] Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction, vol. 22, no. 1, 1996. [4] K. Lindén, T. Pirinen et al., Weighting finite-state morphological analyzers using hfst tools, Std., 2009. [5] T. Pirinen, K. Lindén et al., Finite-state spell-checking with weighted language and error models, Std., 2010. [6] T. A. Pirinen and K. Lindén, State-of-the-Art in Weighted Finite-State Spell-Checking, Std., 2014. [7] Z. Wang, G. Xu, H. Li, and M. Zhang, A fast and accurate method for approximate string search, Association for Computational Linguistics Std., 2011. [8] Zemberek, an open source NLP framework for Turkic Languages, vol. 10, 2007. [9] E. Brill and R. C. Moore, An improved error model for noisy channel spelling correction, Association for Computational Linguistics Std., 2000. [10] Laplacian smoothing and Delaunay triangulations, vol. 4, no. 6, 1988. [11] Two-level description of Turkish morphology, vol. 9, no. 2, 1994. [12] Resources for Turkish morphological processing, vol. 45, no. 2, 2011. [13] D. Torunoǧlu and G. Eryiğit, A Cascaded Approach for Social Media Text Normalization of Turkish, Std., April 2014. TABLE IV: Candidate List Evaluation as automated spelling correctors. In top 1, the difference is as high as 31,7 percentage points between ETFSR and SC4. Although SC3 and SC4 both yield very high scores, they are both memory-inefficient due to the used surface language models. A better possible system which would be the combination of both will actually be a kind of the system proposed by Linden and Pirinen [6] combined with our automatically created error model which we aim to develop in our future work. Although the difference between candidate generation using FSTs and Aho-Corasick tree is not statistically significant5 , we expect that memory consumption will be alleviated with a better implementation, even though there may not be an increase in performance. VI. C ONCLUSION & F UTURE W ORK In this study, we explored ways to eliminate the scarcity of training data for spelling correction, as well as the impact of different spelling correction approaches for Turkish. We proposed a new automatic training data collection process where existing spelling correctors contribute to the development of an error model, paving the way for better systems. We explained four spelling correction approaches adapted for Turkish alternatively using language models, error models and combination of candidate generation approaches, and reported their performances for Turkish in comparison with three established spelling correctors. Our work has been a preliminary investigation of better spelling correction approaches for MRLs, and there is still much that could be further investigated and improved, such as 1) Automatically increasing training set size, 2) Integrating self-training approaches in learning error rule probabilities, and 3) Using weighted finitestate language and error models. Although we used a simple unigram language model in our best-performing systems, we observed that the systems making use of the language model outperform those without the model by about 10 percentage points. Furthermore, we believe that using weighted finitestate language and error models would produce slightly better results than the ones represented in this paper as well as eliminating the memory consumption problem of our best corrector. ACKNOWLEDGMENT This work is part of our ongoing research project “Parsing Turkish Web 2.0 Sentences” supported by ICT COST Action IC1207 TUBITAK 1001 (grant no: 112E276). 5 We used McNemar’s paired t-test to evaluate the difference between SC1(68.6%) and SC2(67.8%) and found that the difference between these two models is not statistically significant, with a two-tailed p value of 0.7. 11 Framing of Verbs for Turkish PropBank Gözde Gül Şahin Department of Computer Engineering Istanbul Technical University Istanbul, 34469, Turkey [email protected] thematic roles, verb classes are defined with all possible syntaxes for each class. One possible syntax is given below the examplary sentence. Unlike FrameNet and VerbNet, PropBank (PB) [16] does not make use of a reference ontology like semantic frames or verb classes. Instead semantic roles are numbered from Arg0 to Arg5 for the core arguments. Moreover, PropBank has an associated annotated corpus that help researchers to specify SRL as a task, furthermore are used as training and test data for supervised machine learning methods [11] [21]. [I]Buyer-Agent-Arg0 bought [a coat]Goods-Theme-Arg1 from [the flea market.]1 Abstract—In this work, we present our method for framing the verbs of Turkish PropBank and discuss incorporation of crowd intelligence to increase the quality and coverage rate of annotated frames. First, we discuss the manual framing process by experts with the help of publicly available dictionaries, corpora and guiding morphosemantic features such as case markers. Then, we present a systematic way of framing for challenging cases such as light verbs, multiword expressions and derived verbs. Later, a verb sense disambiguation task where the verb senses correspond to annotated frames, is crowdsourced. Finally, the results of verb sense disambiguation task are used to increase the coverage rate and quality of created linguistic resource. In conclusion, a new lexicon of Turkish verbs with 759 annotated verbs and 1262 annotated senses is constructed. Keywords-Turkish PropBank; Semantic Role Labeling; Semantic Frame; Light Verb; MWE Syntax: Agent V Theme {From} Source In [17], the authors investigate the usability of FrameNet, VerbNet and PropBank conventions for modern Turkish and conclude that the PropBank convention with additional morphosemantic features would be the most appropriate semantic resource. Unfortunately, creating a qualified PropBank for a morphologically complex language with low-resources is a challenging task. Creation of such a corpus is generally considered as the combination of two subtasks: Framing of Verbs and Corpus Annotation with framed verbs. Framing process includes deciding on the verbs to annotate, examining different senses of the chosen verbs and deciding on the arguments for each verb sense. The languages with rich resources mostly perform corpus annotation in one step with large number of annotaters. Due to small number of expert annotaters, we choose to divide corpus annotation into two microtasks for crowdsourcing: Verb Sense Annotation and Argument Labeling. In the verb sense annotation task, people are asked to disambiguate the meaning of the verbs in the sentences from a morphologically and syntactically analysed corpus and in argument labeling task, annotaters are asked to label the arguments of the previously annotated verb senses. Framing of Turkish verbs can be considered as the most important step for PropBank creation. The errors introduced in framing process may accumulate and may significantly reduce the accuracy and reliability of semantic role labeling task. Moreover, it can be considered as the most complicated task specially for languages with I. I NTRODUCTION In recent years considerable amount of research has been performed on extracting semantic information from sentences. Revealing such information is usually achieved by identifying the arguments of a predicate and assigning meaningful labels to them. Each label represents the argument’s relation to its predicate and is referred to as a semantic role and this task is named as semantic role labeling (SRL). SRL aims to answer the question “Who did what to whom?”, thus reveal the full meaning of a sentence. It has been employed in machine translation, information extraction and question answering tasks. There exists different semantic role annotaion schemes, where the most commonly used ones are VerbNet [18], FrameNet [6] and PropBank [16]. FrameNet (FN) is a semantic network, built around the theory of semantic frames. All predicates in same semantic frame share one set of Frame Elements (FEs). In the example below, a sentence with predicate “buy”, annotated with FrameNet, VerbNet and PropBank convention is given. The predicate “buy” belongs to “Commerce buy”, frame of FrameNet which contains “Buyer”, “Goods” as core frame elements and “Seller” as a non-core frame element. Moreover, FN provides connections between semantic frames like inheritance, hierarchy and causativity. Contrary to FN, VerbNet (VN) is a hierarchical verb lexicon, that contains categories of verbs based on Levin Verb classification [18]. The predicate “buy” is contained in “get-13.5.1” class of VN, among with the verbs “pick”, “reserve” and “book”. Members of same verb class share same set of semantic roles, referred as thematic roles. In addition to 1 In PropBank Arg0 is used for actor, agent, experiencer or cause of the event; Arg1 represents the patient, if the argument is affected by the action, and theme, if the argument is not structurally changed. 12 rich derivational morphology and with large number of light verbs and multi word expressions like Turkish. In this paper, we focus on this process due to its importance and difficulty, whereas we investigate the details of latter processes in other studies. In next sections, we present the details of our approach on choosing verbs and their arguments for annotation, framing light verbs/multi-word expressions and incorporation of declension. Further on, we explain how we interpreted the results of crowdsourced verb sense disambiguation microtask for fine-tuning verb frames. Finally, we conclude by describing the properties of created linguistic resource and results of the improvement process. from verbs. However, these numbers only account for the first level of derivation, such as “sev-iş (to make love)”, reciprocal form of “sev (to love)”. In contemporary everyday Turkish, it is observed that words have about 3 to 4 morphemes including the stem [15] such as “sev-iş-tir-il (to be made to make love with someone)” which has 3 derivational morphemes: reciprocal, causative and passive accordingly. Due to these challenging cases, our approach for each of them and the tools that are used are explained in each subsection in a guideline fashion. A. Root Verbs Turkish Language Association is a trustworthy source for lexical datasets and dictionaries. We have initiated our framing efforts with the list of Turkish root verbs provided by TDK. This list consists of 759 root verbs however it contains verbs that are rarely used or have fallen into disuse as the ones shown in Table I. In order to detect those root verbs we have used TNC (Turkish National Corpus), which is a balanced and a representative corpus of contemporary Turkish with about 50 million words. Its query interface shown in Fig. 1 allows regular expressions which is essential for quering verbs that appear in different conjugated forms in unstructered text. We have performed queries on all root verbs and framed them if their frequency count is above 5 in a million words. Overall only 385 of the verbs were found to be above this threshold. Some examplary root verbs that were excluded from the framing process are given with their frequencies in Table I. II. M ETHOD We took a two-pass framing approach. In the first pass, we have performed regular framing explained in PropBank framing guidelines [5] based on available resources such as publicly available dictionary prepared by Turkish Language Association [19], a large corpus (Turkish National Corpus) [2] that can be queried for different usages of words, an open source annotation tool and CornerStone [8]. The senses of the verbs and case marking of their arguments are decided by manually investigating the sentences appear in search results of the TNC corpus. Then, the arguments of the predicates are labeled with VerbNet thematic roles and PropBank argument numbers, by checking the English equivalent of Turkish verb sense if possible. This process is repeated for all verb senses. However, low number of expert framers and limited amount of time occupied for framing cause incomplete, inaccurate and subjective frames. In order to reduce this phenomena, we have utilized crowd feedback from a verb sense disambiguation task and performed a second pass on framing of Turkish verbs. Root Verb eğir (to spin cotton for making thread) semir (to batten, get fat ) yüksün (to regard someone, something as a burden) çıv (to be deflected) evele (to hum and haw) göynü (to be grieved) ılga (to run at a gallop - used only for horses without a rider) çemre (to roll up one’s sleeves, trouser legs, or skirts) ipile (to give a very dim light) fışılda (to make a swishing or rustling sound) III. F IRST PASS : C REATION OF V ERB F RAMES PropBank framing guidelines [5] is an important resource of information that discusses how the verbs in English PropBank should be framed. Although we have followed that guideline [5], rich derivational morphology of Turkish and large number of light verbs (LV) and multi word expressions (MWE) introduce challenges for Turkish framers. LV and MWE are still an active research area for linguists [20], and due to the complexity of this issue annotation of LV and MWE constructions in PropBank has been investigated separately in study [14]. Even though PropBanks have been constructed for morphologically rich languages such as Hindi/Urdu, Arabic and Finnish, modern Turkish language poses more challenges due to its extreme derivational morphology. According to Turkish Language Association (TDK)2 , there are 759 root verbs, 2380 verbs derived from nouns and 2944 verbs derived Count[Verb] 105 80 52 24 16 5 5 4 1 0 Frequency[Verb] 2,24 1,68 1,09 0,5 0,34 0,1 0,1 0,08 0,02 0 Table I: Excluded root verbs and their frequencies in a million B. Derivational Morphology of Verbs Turkish is among languages with rich derivational morphology. According to TDK, there exists 10 morphemes that derives verbs from verbs and 2944 derived verbs. Of these, 6 are known as valency changing morphemes and are responsible from 98% of derived verbs. In Table II, the count of derived verbs categorized according to their types are shown. In [17], it has been stated that Turkish valency changing morphemes always cause a predictable transformation, thus it is sufficient to have frames of the root verbs only. An examplary causative transformation for intransitive verb “laugh” and transitive verb “wear”, is given in Fig. 2. When intranstive verbs are causitivized, the causee becomes the patient of the causation event. In other words, the central argument of the root verb, 2 TDK is the official organization of Turkish language, founded on 1932. It is responsible for conducting linguistic research on Turkish and other Turkic languages, and publishing the official Turkish dictionary.(www.tdk.gov.tr) 13 Figure 1: TNC Query for “sev-iş-tir* (to make someone to make love with someone)” Morpheme -akla, -ekle, -ıkla, -ikle, -ukla, -ükle -ala, -ele -ımsa, -imse, -umsa, -ümse -zir Total -ş, -aş, -eş, -ış, -iş, -uş, -üş -l, -al, -el, -ıl, -il, -ul, -ül -n, -ın, -in, -un, -ün -r, -ar, -er, -ır, -ir, -ur, -ür -t, -at, -et, -ıt, -it, -ut, -üt -tır, -tir, -tur, -tür, -dır, -dir, -dur, -dür Total Type Not Valency Not Valency Not Valency Not Valency Reciprocal Passive Passive—Reflexive Causative Causative Causative Count 8 22 5 1 36 258 528 720 29 510 863 2908 there exists some verbs which are frequently used in their causative forms with some deviation in the meaning, such as “yaz-dır”, causative form of the verb “yaz (to write)”, which means to register someone to school/course. In order to have an accurate framing process, separate frames were created for such verbs. In addition to verb to verb derivational morphemes, there exists 2380 verbs that are derived from nominal words via 12 different morphemes as stated by TDK. We claim that creating a nominal bank and linking those derived verbs with the entries from the nominal bank would be more appropriate. Thus, only the most frequent ones are included in the current bank and the rest is kept as a subject of a further study. Table II: Derivational Morphemes C. Light Verbs and Multiword Expressions (MWE) Light verbs are the verbs that cannot stand in the sentence on their own but can occur with another verb or a nominal [7].Light verb constructions in Turkish are the complex predicates formed by a nominal and one of the light verbs such as ol-, et-, gel-, ver-, dur-, kal-, düş-, bulun-, eyle- and buyur- [20]. Other than Turkish, light verb constructions can also be encountered in many languages such as Japanese, Korean, Persian, English, French and German. Light verb itself may contribute comparatively light to the meaning or it has no contribution as in ‘teşekkür et(to thank) ’. In such cases, where the meaning is mostly conveyed by the nominal, the phrase is treated as a new predicate as (teşekkür et). In addition, Turkish light verbs are not necessarily light in all uses. Consider the function of the verb et- in the sentence “Üç artı iki beş eder (Three plus two makes five)”. Framing process is handled similarly for such verbs as in other root verbs. Most of the time, MWEs are confused with ligt verb constructions. In order to avoid discussions, we approach the problem practically, rather than categorizing verbs as LVC or MWE. We either treat such verbs as another sense of the root verb or as a complex predicate. The criterias followed during the decision process are: Figure 2: Causative Transformations (Arg0 if exists, otherwise Arg1), is marked with ACC case and becomes an internal argument (usually Arg1) of the new causative verb. For transitive root verbs, the central argument, Arg0 of the root verb, receives the DAT case marker and serves as an indirect object (usually as Arg2), while Arg1 serves again as Arg1.3 Moreover, 3 Causative morpheme introduces of a new argument called causer to the valence pattern. In Fig. 2, the causer is showed with A0, where in PropBank for Hindi/Urdu, it may be showed with A-A. • 14 Deviation from the original meaning of the verb root, Nominal’s contribution to the meaning of the complex predicate, • The frequency of the complex predicate, • Being a fixed phrase, In Table III. our framing approach for the verb “ver (to give)” is shown as an example. Second sense has the meaning of “to fix, to establish” as in to give/fix appointment, name or price. Similarly ver.03 is defined as to devote, allocate as in “öncelik vermek (to give priority)”, “emek vermek (to give/devote effort)” and “zaman vermek (to give/allocate time)”. These phrases are not fixed and the contribution of the nominal is not dominant. Hence they are framed with new senses for the root verb. On the contrary, the complex predicates, “söz ver (to promise)”, “izin ver (to allow)”, “kulak ver (to listen carefully)” and “hesap ver (to explain)” are fixed phrases and they have high frequency in TNC corpus. Hence they are determined as seperate predicates. • Predicate Sense Meaning ver ver.01 To transfer ver ver.02 To fix ver ver.03 To devote, allocate söz ver ver.09 To promise kulak ver ver.12 To listen carefully Figure 3: Cornerstone Software Adjusted for Turkish task is given in Fig. 4. At the end of the task, 5855 rows have been annotated at least by three annotaters, 265 rows have been annotated per hour and all annotation process took 68 hours. More than 100 taskers contributed from 39 different cities of Turkey and the overall annotater aggrement is calculated as 83.15%. The details of this work is presented in another paper. The consolidation of one or more contributor responses into a summarized result is referred to as aggregation. We have analyzed the results which have a confidence level lower than 0.7 or an aggregated result as “None”. 2174 rows had confidence lower than 0.7 and 738 rows were aggregated as “None” out of 6000 rows. We have manually performed a second pass annotation of experts for the rows with low confidence and eliminated 1200 out of 2174 of the rows since the aggregated results were already accurate. We have investigated the main reasons for annotaters to choose the option “None” as follows and taken the appropriate actions; Example Hediye vermek (Give presents) Randevu vermek (Give an appointment) Öncelik vermek (Give priority) Bana söz ver (Promise me) Bana kulak ver (Listen to me) Table III: Framing of the verb “ver- (to give)” D. Annotation Tool For framing purposes, we have adjusted an already available open source software, cornerstone [8]4 . In study [17], the correlation between case marking information and semantic roles have been shown. That motivated us to include case markings in the framing process. To supply case marking information of the argument, a drop down menu containing six possible case markers in Turkish is added as shown in Fig 3. In this section, we have explained the process of framing Turkish verbs by expert annotaters with our systematic decision process for challenging cases introduced by LV, MWE and rich derivational morphology. • • • IV. S ECOND PASS : I NTERPRETING V ERB S ENSE D ISAMBIGUATION R ESULTS Overall aim of this study is to build a corpus with annotated semantic roles. For this purpose we use an existing Turkish Dependency Treebank [15] with morphological and syntactical analysis and add an extra layer with predicate senses and their arguments. For annotation of verb senses, we have crowdsourced a verb sense disambiguation task, where people are asked to choose the appropriate frame or “Hiçbiri (None)” for all the verbs in the treebank. An examplary question from the original Mistakes in morphological analysis of the predicate such as analyzing the verb as “sok” (to put) instead of “sokul” (to get near); “kal” (to stay) instead of “kaldır” (to lift): These erroneous analyses have been corrected and the appropriate sense is chosen by an expert. Missing meanings: They are added to PropBank. Confusion caused by metaphorical expression: Verb senses are coarse-grained, thus metaphorical expressions are treated the same way as non-metaphorical expressions as suggested in PropBank guidelines [5]. Similarly, we have detected the causes of low-confidence rows as follows; • • • 4 Cornerstone is also used for building English, Chinese and Hindi/Urdu PropBanks. 15 Fine-grained verb senses: When two senses of the predicate have close meanings, it leads to confusions: Such frames were detected and merged. Missing meanings: They are added to PropBank. Confusing the meaning of the complete sentence with the meaning of the verb in question: They are revised and annotated by an expert. Figure 4: A question from Verb Sense Disambiguation task. Corresponding English translations are shown near the original text starting with (En) V. C ONCLUSION [4] Nart B. Atalay, Kemal Oflazer and Bilge Say. 2003. The Annotation Process in the Turkish Treebank. In Proceedings of the EACL Workshop on Linguistically Interpreted Corpora. Budapest In conclusion, we have presented a new linguistic resource, the Turkish verb lexicon that consists of the verbs and their arguments that are present in the Turkish Dependency Treebank [15] and the verbs that are frequently used but not present in the Treebank. Total amount of 759 verb roots and 1262 verb senses are annotated. We have explained our approach on framing light verbs and multiword expressions which can be inherited by other languages where light verb constructions are as common as in Turkish. We have presented a different approach to the framing problem with a two step solution to ensure the quality and quantity of the lexicon. In the first pass, framing guidelines that are explained in Section III, is constructed and expert annotaters have framed 1135 verb senses. In the second pass, the results from a crowdsourced verb sense disambiguation task have been incorporated to improve the quality of verb lexicon as well as to increase the coverage rate. As a result, the number of annotated verb frame count increased from 675 to 759 and the total number of annotated senses increased from 1135 to 1262. As a future work, we plan to construct a NominalBank to account for the verbs that are derived from nouns and crowdsource argument annotation task where people will be asked to choose the most appropriate label for the verb sense given in the question. This work explained in this paper, presented the first and the most important step to create necessary resources for Turkish to be included in the task of semantic role labeling. We believe that these resources will drive the NLP community for building semantic role labelers that will have wider coverage of language families and will enable the community work on a more challenging language. [5] Olga Babko-Malaya. 2005. Guidelines for Propbank Framers. http://verbs.colorado.edu/∼mpalmer/projects/ace/ FramingGuidelines.pdf [6] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet Project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Vol. 1., PA, USA, 86-90. [7] M Butt. 2004. The Light Verb Jungle. Papers from the GSAS/Dudley House Workshop on Light Verbs. Cambridge, Harvard Working Papers in Linguistics: 1-49. [8] Jinho D. Choi, Claire Bonial and Martha Palmer. 2010. Propbank Frameset Annotation Guidelines Using a Dedicated Editor, Cornerstone. In LREC 10, Malta [9] Mona Diab, Alessandro Moschitti and Daniele Pighin. 2008. Semantic Role Labeling Systems for Arabic Language using Kernel Methods In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies., 2008 [10] Christiane Fellbaum, Anne Osherson and Peter E Clark. 2007 Putting Semantics into WordNet’s ”Morphosemantic” Links. Computing Reviews, 24(11):503–512. [11] Ana-Maria Giuglea and Alessandro Moschitti. 2006. Semantic Role Labeling via FrameNet, VerbNet and PropBank. In Proceedings of the 21st International Conference on Computational Linguistics, pp. 929-936. 2006. [12] Abdelati Hawwari, Wajdi Zaghouani, Tim OGorman, Ahmed Badran and Mona Diab. 2013. Building a lexical semantic resource for Arabic morphological Patterns. In Communications, Signal Processing, and their Applications (ICCSPA) R EFERENCES [1] Eneko Agirre, Izaskun Aldezabal, Jone Etxeberria and Eli Pociello 2006. A Preliminary Study for Building the Basque PropBank. In LREC 2006, Genoa [13] Mehmet Hengirmen. Yayınevi [2] Yeşim Aksan and Mustafa Aksan 2012. Construction of the Turkish National Corpus (TNC). In LREC 2012, İstanbul 2004. Türkçe Dilbilgisi. Engin [14] Jena D. Hwang, Archna Bhatia, Clare Bonial, Aous Mansouri, Ashwini Vaidya, Nianwen Xue, and Martha Palmer. 2010. PropBank annotation of multilingual light verb constructions. In Proceedings of the Fourth Linguistic Annotation Workshop, Association for Computational Linguistics, Stroudsburg, PA, USA, 82-90. [3] Izaskun Aldezabal, Marı́a Jesús Aranzabe, Arantza Dı́az de Ilarraza Sánchez and Ainara Estarrona. 2010. Building the Basque PropBank. In LREC 2010, Malta 16 [15] Kemal Oflazer, Bilge Say, Dilek Z. Hakkani-Tür and Gökhan Tür. 2003. Building a Turkish Treebank. Invited chapter in Building and Exploiting Syntactically annotated Corpora, Anne Abeille Editor, Kluwer Academic Publishers [16] Martha Palmer, Paul Kingsbury and Daniel Gildea. 2005. The Proposition Bank: An Annotated Corpus of Semantic Roles. In Computational Linguistics, 31(1):71–106 [17] Gozde Gul Isguder Sahin and Esref Adalı. 2014 Using Morphosemantic Information in Construction of a Pilot Lexical Semantic Resource for Turkish. In Proceedings of the 21st International Conference on Computational Linguistics, pp. 929-936. 2014. [18] Karin K. Schuler 2006. VerbNet: A Broad-Coverage, Comprehensive Verb Lexicon PhD diss., University of Pennsylvania [19] Turkish Language Association. 2005 Güncel Türkçe Sözlük (Contemporary Turkish Dictionary) http://www.tdk.gov.tr/ index.php?option=com gts&view=gts [20] Aygül Uçar. 2010 Light Verb Constructions in Turkish Dictionaries: Are They Sub-meanings of Polysemous Verbs? Mersin University Journal of Linguistics and Literature, 7 (1), 1-17. [21] Shumin Wu. 2013. Semantic Role Labeling Tutorial: Supervised Machine Learning methods. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 17 A free/open-source hybrid morphological disambiguation tool for Kazakh Zhenisbek Assylbekov∗ , Jonathan North Washington† , Francis Tyers‡ , Assulan Nurkas∗ , Aida Sundetova§ , Aidana Karibayeva§ , Balzhan Abduali§ , Dina Amirova§ ∗ School of Science and Technology, Nazarbayev University of Linguistics and Central Eurasian Studies, Indiana University ‡ HSL-fakultehta, UiT Norgga árktalaš universitehta § Information Systems Department, Al-Farabi Kazakh National University † Departments II. Kazakh Abstract—This paper presents the results of developing a morphological disambiguation tool for Kazakh. Starting with a previously developed rule-based approach, we tried to cope with the complex morphology of Kazakh by breaking up lexical forms across their derivational boundaries into inflectional groups and modeling their behavior with statistical methods. A hybrid rule-based/statistical approach appears to benefit morphological disambiguation demonstrating a per-token accuracy of 91% in running text. Kazakh (natively қазақ тілі, қазақша) is a Turkic language belonging to the Kypchak (or Qıpçaq) branch, closely related to Nogay (or Noğay) and Qaraqalpaq. It is spoken by around 13 million people in Kazakhstan, China, Mongolia, and adjacent areas [7]. Kazakh is an agglutinative language, which means that words are formed by joining suffixes to the stem. A Kazakh word can thus correspond to English phrases of various length as shown below: I. Introduction In this paper, we present a free/open-source hybrid morphological disambiguation tool for Kazakh. Morphological disambiguation is the task of selecting the sequence of morphological parses corresponding to a sequence of words, from the set of possible parses for those words. Morphological disambiguation is an important step for a number of NLP tasks and this importance becomes more crucial for agglutinative languages such as Kazakh, Turkish, Finnish, Hungarian, etc. For example, by using a morphological analyzer together with a disambiguator the perplexity of a Turkish language model can be reduced significantly [1]. Kazakh (as well as any morphologically rich language) presents an interesting problem for statistical natural language processing since the number of possible morphological parses is very large due to the productive derivational morphology [2, 3]. In this work we combine rule-based [4] and statistical [5] approaches to disambiguate a Kazakh text: the output of a morphological analyzer is pre-processed using constraint-grammar rules [6], and then the most probable sequence of analyses is selected. Our combined approach works well even with a small handannotated training corpus. The performance of the presented hybrid system can likely be improved further when a larger hand-tagged corpus becomes available. In Section II, we present relevant properties of Kazakh. Then, in Section III, we review the related work on part-ofspeech (POS) tagging and morphological disambiguation. In Section IV, we describe the statistical model for morphological disambiguation. We finally present and discuss our results in Section V. дос достар достарым friend friends my friends достарымыз достарымызда our friends at our friends достарымыздамыз we are at our friends The effect of rich morphology can be observed in parallel Kazakh-English texts. Table below provides the vocabulary sizes, type-token ratios (TTR) and out-of-vocabulary (OOV) rates of Kazakh and English sides of a parallel corpus used in [8]. Vocabulary size Type-token ratio OOV rate English 18,170 3.8% 1.9% Kazakh 35,984 9.8% 5.0% It is easy to see that rich morphology leads to sparse data problems for statistical natural language processing of Kazakh, be it tasks in machine translation, text categorization, sentiment analysis, etc. A common approach (see [9, 10, 11, 12]) applied for morphologically rich languages is to convert surface forms into lexical forms (i.e. analyze words), and then perform some morphological segmentation for the lexical forms (i.e. split analyzes). The segmentation schemes are usually motivated by linguistics and the domain of intended use. For example, for a Kazakh-English word alignment task we could be in 18 favor of the following segmentation of the above mentioned word достарымыздамыз1 достар дос⟨n⟩⟨pl⟩ friends ымыз ⟨px1pl⟩ our да ⟨loc⟩ at same idea is present in [16]. One of the most well-known corpora, Brown corpus, was automatically pre-tagged with a rule-based tagger, TAGGIT [17]. The earliest probabilistic tagger known to us is [18]. One of the first Markov Model taggers was created at the University of Lancaster as part of Lancaster-Oslo-Bergen corpus tagging effort [19, 20]. The type of Markov Model tagger that tags based on both word probabilities and tag transition probabilities was introduced by Church [21] and DeRose [22]. All these taggers are trained on hand-tagged data. Kupiec [23], Cutting et al. [24], and others show that it is also possible to train a Hidden Markov Model (HMM) tagger on unlabeled data, using the EM algorithm [25]. An experiment by Merialdo [26], however, indicates that with even a small amount of training data, a tagger trained on hand-tagged data worked better than one trained via EM. Other notable approaches in POS tagging are Brill’s transformation-based learning paradigm [27], the memorybased tagging paradigm [28], and the maximum entropy-based approach [29]. мыз +e⟨cop⟩ ⟨p1⟩⟨pl⟩ are we since each segment of the Kazakh word would then correspond to a single word in English. The problem is that often for a word in Kazakh we have more than one way to analyze it, as in the example below: ‘in 2009 , we started the construction works .’ 2009 жылы біз құрылысты бастадық . жылы⟨adj⟩ ‘warm’ жылы⟨adj⟩⟨advl⟩ ‘warmly’ → жыл⟨n⟩⟨px3sp⟩⟨nom⟩ ‘year’ жылы⟨adj⟩⟨subst⟩⟨nom⟩ ‘warmth’ Selecting the correct analysis from among all possible analyses is called morphological disambiguation. Due to productive derivational morphology this task itself suffers from data sparseness. To alleviate the data sparseness problem we break down the full analyses into smaller units – inflectional groups. An inflectional group is a tag sequence split by a derivation boundary. For example, in the sentence that follows, the word айналасындағыларға ‘to the ones in his vicinity’ is split into root r and two inflectional groups, g1 and g2 , the first containing the tags before the derivation boundary -ғы and the second containing the derivation boundary and subsequent tags. Morphological disambiguation in inflectional or agglutinative languages with complex morphology involves determining not only the major or minor parts-of-speech, but also all relevant lexical and morphological features of surface forms. Levinger et al. [30] suggested an approach for morphological disambiguation of Hebrew. Hajič and Hladká [31] have used maximum entropy modeling approach for morphological disambiguation of Czech, an inflectional language. Hajič [32] extended this work to 5 other languages including English and Hungarian (an agglutinative language). Ezeiza et al. [33] have combined stochastic and rule-based disambiguation methods for Basque, which is also an agglutinative language. Megyesi [34] has adapted Brill’s POS tagger with extended lexical templates to Hungarian. Жəңгір хан мен оның айналасындағыларға . . . (айнала)·(сын·да)·(ғы·лар·ға) (айнала )·(subst·pl·dat) | {z })·(n·px3sp·loc | {z } | {z } r g1 g2 We will heavily exploit the following observation of dependency relationships which was made by Hakkani-Tür et al. [5, p. 387] for Turkish, but is valid for Kazakh as well: When a word is considered to be a sequence of inflectional groups, syntactic relation links only emanate from the last inflectional group of a (dependent) word, and land on one of the inflectional groups of the (head) word on the right. From all languages which are widely researched nowadays Turkish is the closest one to Kazakh. Previous approaches to morphological disambiguation of Turkish text had employed constraint-based methods (Oflazer and Kuruöz [35]; Oflazer and Tür [36, 37]), statistical methods (Hakkani-Tür et al. [5], Sak et al. [38]), or both (Yuret and Türe [39], Kutlu and Cicekli [40]). III. Related work Recently, some work has been done towards developing morphological disambiguation tools for Kazakh. Salimzyanov et al. [4] provide constraint grammar rules which reduce ambiguity from 2.4 to 1.4 analyzes per form in a running text. Makhambetov et al. [41] present a comparison of partof-speech taggers trained on the Kazakh National Corpus [42]: the best result obtained, using the full training data of around 600,000 tokens was a per-token accuracy of 86% when cross-validated on the same training data with 10 folds. Kessikbayeva and Cicekli [43] present a transformation-based morphological disambiguator for Kazakh which is trained on hand-annotated corpus of over 30,000 words and gains 87% accuracy when tested against a test data of around 15,000 words. Morphological disambiguation of inflectional and agglutinative languages was inspired by part-of-speech (POS) tagging techniques. Due to Chomsky’s criticism of the inadequacies of Markov models [14, ch. 3], the lack of training data and computing resources to pursue an ‘empirical’ approach to natural language, early work on POS tagging using Markov chains had been largely abandoned by the early sixties. The earliest ‘taggers’ were simply programs that looked up the category of words in a dictionary. The first well-known program which attempted to assign tags based on syntagmatic contexts was the rule-based program presented in [15], though roughly the 1 hereinafter we use the Apertium tagset [13] for analyzed forms 19 A. Derivation IV. Statistical morphological disambiguation Following [44], we will use the notation in Table I. We use wi ti wi,i+m ti,i+m ri gi,k n w t Using the chain rule, the probability in (3) can always be rewritten as: n ∏ Pr(t) = Pr(ti |t1,i−1 ). (4) the word (token) at position i in the corpus the tag of wi the words occurring at positions i through i + m the tags ti · · · ti+m for wi · · · wi+m the root of wi the k-th inflectional group of wi length of a text chunk (be it a sentence, a paragraph or a whole text) the words w1,n of a text chunk the tags t1,n for w1,n i=1 It is important to realize that equation (4) is not an approximation. We are simply asserting in this equation that when we generate a sequence of parses, we can firstly choose the first analysis. Then we can choose the second parse given our knowledge of the first parse. Then we can select the third analysis given our knowledge of the first two parses, and so on. As we step through the sequence, at each point we make our next choice given our complete knowledge of the all our previous choices. The conditional probabilities on the right-hand side of equation (4) cannot all be taken as independent parameters because there are too many of them. In the bigram model, we assume that Pr(ti |t1,i−1 ) ≈ Pr(ti |ti−1 ). TABLE I: ’Notation’ subscripts to refer to words and tags in particular positions of the sentences and corpora we tag. We use superscripts to refer to word types in the lexicon of words and to refer to tag types in the tag set. The basic mathematical object with which we deal here is the joint probability distribution Pr(W = w, T = t), where the random variables W and T are a sequence or words and a sequence of tags. We also consider various marginal and conditional probability distributions that can be constructed from Pr(W = w, T = t), especially the distribution Pr(T = t). We generally follow the common convention of using uppercase letters to denote random variables and the corresponding lowercase letters to denote specific values that the random variables may take. When there is no possibility for confusion, we write Pr(w, t), and use similar shorthands throughout. In this compact notation, morphological disambiguation is the problem of selecting the sequence of morphological parses (including the root), t = t1 t2 · · · tn , corresponding to a sequence of words w = w1 w2 · · · wn , from the set of possible parses for these words: arg max Pr(t|w). That is, we assume that the current analysis is only dependent on the previous one. With this assumption we get the following: n ∏ Pr(t) ≈ Pr(ti |ti−1 ). (5) i=1 However, the probabilities on the right-hand side of this equation still cannot be taken as parameters, since the number of possible analyzes is very large in morphologically rich languages. Following the discussion from Section II we split morphological parses across their derivational boundaries, i.e. we consider morphological analysis as a sequence of root (ri ) and inflectional groups (gi,k ), and therefore, each parse ti can be represented as (ri , gi,1 , . . . , gi,ni ). Then the probabilities Pr(ti |ti−1 ) can be rewritten as: (1) t Using Bayes’ rule and taking into account that w is constant for all possible values t, we can rewrite (1) as: Pr(t) × Pr(w|t) arg max (2) = arg max Pr(t) × Pr(w|t) Pr(w) t t Pr(ti |ti−1 ) = Pr((ri , gi,1 , . . . , gi,ni )|(ri−1 , gi−1,1 , . . . , gi−1,ni−1 )) = {chain rule} = Pr(ri |(ri−1 , gi−1,1 , . . . , gi−1,ni−1 )) × Pr(gi,1 |(ri−1 , gi−1,1 , . . . , gi−1,ni−1 ), ri ) × . . . × × Pr(gi,ni |(ri−1 , gi−1,1 , . . . , gi−1,ni−1 ), ri , gi,1 , . . . , gi,ni −1 ) (6) In Kazakh, given a morphological analysis2 including the root, there is only one surface form that can correspond to it, that is, there is no morphological generation ambiguity. Therefore, Pr(w|t) = 1, In order to simplify this representation we throw in the following independence assumptions and the morphological disambiguation problem (2) is simplified to finding the most probable sequence of parses: arg max Pr(t) Pr(ri |(ri−1 , gi−1,1 , . . . , gi−1,ni−1 )) ≈ Pr(ri |ri−1 ), (3) (7) t Keep in mind that the search space in equations (1)–(3) is not equal to the set of all hypothetically possible sequences t. Instead it is limited to only the set of parse sequences that can correspond to w. Such limited set is obtained as a full or constrained output of a morphological analysis tool. Pr(gi,k |(ri−1 , gi−1,1 , . . . , gi−1,ni−1 ), ri , gi,1 , . . . , gi,k−1 ) ≈ Pr(gi,k |gi−1,ni−1 ), (8) i.e. we assume that the root in the current parse depends only on the root of the previous parse, and each inflectional group in the current parse depends only on the last inflectional group of the previous parse (this last assumption is motivated by the 2 We use the terms morphological analysis or parse interchangeably, to refer to individual distinct morphological parses of a token. 20 remark at the end of Section II). Now, from (6), (7), and (8) we get: ni ∏ Pr(ti |ti−1 ) ≈ Pr(ri |ri−1 ) k=1 | Pr(gi,k |gi−1,ni−1 ), {z Prb (ti |ti−1 ) adjusting the empirical counts that we observe in the training corpus to the expected counts of n-grams in previously unseen text involve smoothing, interpolation and back-off: they have been discussed by Good [46], Gale and Sampson [47], Written and Bell [48], Knesser and Ney [49], Chen and Goodman [50]. The latter paper presents an extensive empirical comparison of several of widely-used smoothing techniques and introduces a variation of Kneser–Ney smoothing that consistently outperforms all other algorithms evaluated. We used it for estimating the parameters of the bigram model (10). (9) } where we define r0 ='.' and g0,n0 ='<sent>'. Now putting together (5) and (9) we have: Pr(t) ≈ n ∏ Pr(ti |ti−1 ) i=1 ≈ n ∏ C. Tagging with the Viterbi algorithm [ Pr(ri |ri−1 ) i=1 ni ∏ ] k=1 | Once parameters are estimated we could evaluate the bigram model (10) for all possible parses t1,n of a sentence of length n, but that would make tagging exponential in the length of the input that is to be tagged. An efficient tagging algorithm is the Viterbi algorithm (Algorithm 1). It has three steps: Pr(gi,k |gi−1,ni−1 ) . (10) {z } Prb (t) Pr(rl |rm ) and Pr(g l |g m ) are parameters (root and IG probabilities) which can be estimated using manually disambiguated texts. Algorithm 1 Algorithm for tagging Require: a sentence w1,n of length n Ensure: a sequence of analyzes t1,n 1: δ0 (('.', <sent>)) = 1.0 2: δ0 (t) = 0.0 for t ̸= ('.', <sent>) 3: for i = 1 to n step 1 do 4: for all candidate parses tj do 5: δi (tj ) = max[δi−1 (tk ) × Prb (tj |tk )] B. Parameters estimation Assume we are observing a sequence of n tokens w1 , w2 , . . ., wn , and each token was manually disambiguated, i.e. we posses a sequence of corresponding parses t1 , t2 , . . ., tn . Then the likelihood for our data is given by the equation (10), and in order to find maximum likelihood estimates for the parameters Pr(rl |rm ) and Pr(g l |g m ) we need to solve the following optimization problem: [ ] ni n ∏ ∏ Pr(ri |ri−1 ) Pr(gi,k |gi−1,ni−1 ) −→ max (11) i=1 ∑ k=1 Pr(rl |rm ) = 1, l ∑ Pr(g l |g m ) = 1. tk 6: ψi (tj ) = arg max[δi−1 (tk ) × Prb (tj |tk )] tk 7: 8: 9: end for end for Xn = arg max δn (tj ) tj 10: 11: (12) 12: l Using the method of Lagrange multipliers [45] one can show that the solution of (11) subject to constraints (12) is given by: for j = n − 1 to 1 step −1 do Xj = ψj+1 (Xj+1 ) end for initialization (lines 1–2), induction (lines 3–8), termination and path readout (lines 9–12). We compute two functions δi (tj ), which gives us the probability of parse tj for word wi , and ψi+1 (tj ), which gives us the most likely parse at word wi given that we have the parse tj at word wi+1 . A more detailed discussion of the Viterbi algorithm for tagging is provided in [51]. C(rm , rl ) C(g m , g l ) l m = , Pr (g |g ) , MLE C(rm ) C(g m ) (13) where C(rm ) is the number of occurrences of rm , C(rm , rl ) is the number of occurrences of rm followed by rl , C(g m ) is the number of occurrences of g m , C(g m , g l ) is the number of parses with g m as the last IG followed by a parse containing g l . However, the maximum likelihood estimates suffer from the following problem: What if a bigram has not been seen in training, but then shows up in the test data? Using the formulas (13) we would assign unseen bigrams a probability of 0. Such approach is not very useful in practice. If we want to compare different possible parses for a sentence, and all of them contain unseen bigrams, then each of these parses receives a model estimate of 0, and we have nothing interesting to say about their relative quality. Since we do not want to give any sequence of words zero probability, we need to assign some probability to unseen bigrams. Methods for PrMLE (rl |rm ) = V. Experiments and results A. Training and test data We selected thirteen most viewed articles from Kazakh Wikipedia according to 2014 page counts data (see Table II), and used all of them except ‘Басты бет’, ‘CERN’, and ‘Жапония префектуралары’ to create a training set3 . This totaled in approximately 12.5K words (15.7K tokens). We performed morphological analysis for our texts using an open-source finite-state morphological transducer apertiumkaz [52]. It is based on Helsinki Finite-State Toolkit and is 3 ‘Басты бет’ is not an article, it is a main page of Kazakh Wikipedia; articles ‘CERN’ and ‘Жапония префектуралы’ do not contain much text 21 Article title Басты бет Жапония Біріккен Ұлттар Ұйымы CERN Иран Жапония префектуралары Футболдан əлем чемпионаты 2014 Жапония Ұлттық футбол құрама командасы Eurovision əн конкурсы 2010 Абай Құнанбайұлы Радиан Жасуша Шоқан Шыңғысұлы Уəлиханов Views 1,674,069 877,693 807,058 648,464 602,001 551,394 333,988 321,249 312,183 242,151 187,225 145,010 119,780 Tokens – 3,211 793 – 2,879 – 257 146 101 4,083 39 1,789 2,408 15,706 developed a set of guidelines4 , asked annotators to resolve the differences in annotations and fix them where necessary using the mentioned guidelines. In order to enrich our model with more roots we extracted unambiguous sequences of 1,509,480 tokens in a corpus of 2,128,642 tokens and used these unambiguous sequences in addition to hand-annotated texts from Table II for estimating root probabilities. For our test data we selected several texts from the free/open-source Kazakh treebank [54], which is based on universal dependency (UD) annotation standards. These texts are morphologically disambiguated and annotated manually for dependency structure, but for our purposes we used only morphological annotations. We made sure that the document ‘wikipedia’ does not overlap with our training data. Composition of the test data is given below: TABLE II: Most viewed articles of Kazakh Wikipedia in 2014 available within the Apertium project [13]. The analysis was carried out by calling lt-proc command of the Lttoolbox [53]. A preliminary disambiguation was performed through Constrained Grammar rules [6] by calling the cg-proc command, which decreased ambiguity from 2.4 to 1.4 analyses per form on average. The remaining disambiguation was done manually in the following way: the texts were disambiguated independently by two different annotators. Unfortunately, spot-checking annotations showed that they were rather noisy: this was mainly due to the lack of annotation guidelines. Most common mistakes were connected with: • • • • Document Шымкент story wikitravel Өлген_қазан wikipedia Ер_төстік Жиырма_Бесінші_Сөз Description Wikipedia article (Shymkent) Story for language learners Phrases from Wikitravel Folk tale from Wikisource Random sentences from Wikipedia Folk tale from Wikisource Philosophical text Tokens 168 404 177 134 559 206 435 2071 TABLE III: Test data B. Training the model choosing between <attr> (attributive) and <nom> (nominative) in noun-noun compounds: e.g. in көрші елдер ‘neighbouring countries’ the word көрші ‘neighbour’ should be tagged as <n><attr> (attributive noun), but in əлем чемпионаты ‘world championship’ the word əлем ‘world’ should be tagged as <n><nom> (noun in nominative case); choosing between <cnjcoo> (conjunction) and <postadv> (postadverb) for the words да/де/та/те: e.g. in Үстелде қалам да, қарындаш та, дəптер де жатыр ‘There are pen, pencil and notebook on the table’ they should be tagged as <cnjcoo>, but in Мен де барамын ‘I will also go’ it should be tagged as <postadv>; choosing between <det><dem> (demonstrative determiner) and <prn> (pronoun) for the words бұл, мынау, осы, мына, анау, ана, сол ‘this, that’: e.g. in Мынау үй жаңа ‘This house is new’ the word мынау should be tagged as <det><dem>, but in Мынау – терезе емес ‘This is not a window’ the word мынау should be tagged as <prn>; choosing between <ger> (gerund) and <n> (noun) for verbs in a dictionary form: e.g. in Кітап оқу адамдарды ақылдырақ етеді ‘Reading books makes people wiser’ the word оқу ‘to read’ should be tagged as <ger>, but in Оқу басталды ‘Classes began’ the word оқу ‘study’ should be tagged as <n>. We used SRILM toolkit [55, 56] to estimate root and IG probabilities Pr(rl |rm ) and Pr(g l |g m ) respectively. We need to say few words about the way we prepared root and IG sequences for feeding into SRILM. First of all we used the following tags from the Apertium tagset to split analyzes across the derivational boundaries: <subst> (substantive, like a noun), <attr> (attributive, like an adjective), <advl> (adverbial, like an adverb), <ger_*> (gerunds in different tenses), <gpr_*> (verbal adjectives in different tenses), <gna_*> (verbal adverbs in different tenses), <prc_*> (participles in different tenses), <ger> (gerund)5 . Now assume that using the notation from Section 10 the hand-annotated (or unambiguous) text chunk of length n is represented as {(ri , gi,1 , . . . , gi,ni )}ni=1 . Then we form root-bigrams as (r1 , r2 ), (r2 , r3 ), . . . , (ri−1 , ri ), . . . , (rn−1 , rn ), and we form IG-bigrams as follows: (g1,n1 , g2,1 ), (g1,n1 , g2,2 ), . . . , (g1,n1 , g2,n2 ), (g2,n2 , g3,1 ), (g2,n2 , g3,2 ), . . . , (g2,n2 , g3,n3 ), ... (gi−1,ni−1 , gi,1 ), (gi−1,ni−1 , gi,2 ), . . . , (gi−1,ni−1 , gi,ni ), ... 4 available at http://wiki.apertium.org/wiki/Annotation_guide lines_for_Kazakh 5 a detailed description of Turkic tagset in Apertium project is given at http://wiki.apertium.org/wiki/Turkic_lexicon Based on these and other types of annotation mistakes we 22 The way we form the above bigrams is dictated by the assumptions from Section IV that the root in the current parse depends only on the root of the previous parse, and each inflectional group in the current parse depends only on the last inflectional group of the previous parse. and we can see that although there are more chances to see a noun in a non-possesive form after an attributive noun than after a noun in nominative case, due to split of the analysis <n><attr> into two inflectional groups the wrong parse gets higher overall probability: C. Results Pr(cnjcoo, n.attr, n.pl.gen) = Pr(n|cnjcoo) Pr(attr|cnjcoo) Pr(n.pl.gen|attr) | {z } Once our model was trained, i.e. its parameters were estimated, we analyzed the test data with apertium-kaz [52] and applied the Algorithm 1 to its output. The accuracy results are given in the column ‘Tagger’ of the Table IV. As one can see the performance of this purely statistical approach is barely satisfactory (e.g. compared to state of the art for Turkish [38]). This is mainly due to relatively small amount of available hand-tagged corpora for Kazakh. However, if we preprocess the output of the transducer using CG-rules [4] and then just select the first analysis for each ambiguous token, then the accuracy is around 87% on our test set (see column ‘CG’ in Table IV), which is comparable to the previous results [41, 43] for Kazakh morphological disambiguation. Combining rule-based and statistical approaches, i.e. preprocessing the transducer’s output with CG and then selecting most probable parses based on statistical model, yields around 91% accuracy (see column ‘CG+Tagger’ in Table IV). However, keep in mind that for the fair comparison Document Шымкент story wikitravel Өлген_қазан wikipedia Ер_Төстік Жиырма_Бесінші_Сөз TOTAL Tagger 88.46 76.49 71.75 88.81 93.92 85.92 81.84 84.55 CG 89.74 84.16 80.23 88.06 93.56 83.01 85.52 87.20 10−4.911634 < Pr(n.nom|cnjcoo) Pr(n.pl.gen|n.nom) | {z } 10−3.9993215 = Pr(cnjcoo, n.nom, n.pl.gen) This observation leads to a following suggestion: maybe we should try not splitting <n><attr> but rather treating it as <adj> (an adjective) during the training and tagging. Since we can always distinguish between noun/adjective in Kazakh [57] then theoretically a word cannot have both <n><attr> and <adj> as possible analyzes, and thus our suggested replacement can be back-substituted without causing any additional ambiguity. This might also work for other errors as well, e.g. when the tagger mistakenly prefers <adv> (adverb) over <adj><advl> (adverbial adjective) or <n> (noun) over <adj><subst> (substantivized adjective) and etc. The list of most common errors for the ‘CG+Tagger’ configuration also includes CG+Tagger 92.95 88.61 87.57 91.79 95.89 91.26 85.98 90.73 selecting: <n><nom> (noun) <cnjcoo> (conjucntion) <det><dem> (dem. determiner) <prn><dem><pl><nom> <v><tv><aor><p3><pl> TABLE IV: Accuracy results in % VI. Conclusion and future work We reproduced the previous methods of statistical morphological disambiguation [5] for the case of Kazakh language in terms of the Apertium tagset. Combining rule-based and statistical approaches we were able to achieve better accuracy than when these approaches were used separately in the task of morphological disambiguation for Kazakh language. Both the tagger and the annotated data are free and available in open access. In the future, we are planning to improve the performance of the tagger by adding more annotated data and taking into account suggestions from the previous section. Then our result will directly be able to feed into other work on Kazakh language technology, such as machine translation. Assylbekov and Nurkas [8] made use of the partially-disambiguated output of the morphological analyser to preprocess the Kazakh side of a parallel corpus for statistical machine translation (SMT), achieving an increase in translation quality. We expect that better disambiguation of the analyzer’s output will lead to improved performance of the SMT system. We are also planning to apply our disambiguation tool to reduce data sparseness in the task of document and sentence alignment between of our approach with the previously developed methods one needs to use the same tagset and to test against the same data, which is currently not feasible since both previous works on morphological disambiguation for Kazakh ([41] and [43]) have released neither their tools nor their data for open access. Let us perform an example of error analysis for the ‘CG+Tagger’ configuration. One of the most common errors was that it was choosing <n><nom> instead of <n><attr>: e.g. in жəне | {z } көрші аймақтардың ‘and of neigboring regions’ | {z } | {z } <cnjcoo> <n><attr> instead of: <np><ant><m><nom> (proper noun) <prn><itg><nom> (inter. pronoun) <prn><dem><nom> (dem. pronoun) <prn><pers><p3><pl><nom> <v><tv><aor><p3><sg> <n><pl><gen> the word көрші ‘neighbor’ was mistakenly tagged as <n><nom>. A closer look at IG log-probabilities reveals: log Pr(n|cnjcoo) = −1.617432 log Pr(attr|cnjcoo) = −1.485425 log Pr(n.pl.gen|attr) = −1.808777 log Pr(n.nom|cnjcoo) = −0.7627025 log Pr(n.pl.gen|n.nom) = −3.236619 23 Kazakh and English or Kazakh and Russian: given accurate transducers and disambiguation tools for English and Russian, we can apply morphological analysis and then morphological disambiguation to both sides of a candidate pair and then compare the stems in both documents to compute contentbased similarity in addition to structural similarity measures as it was done in [58, 59, 60, 61]. [10] [11] Where to find the hand-tagged texts and the tagger Our morphological disambiguation tool (including handannotated texts) is under GNU General Public License (GPL) version 3.06 : its code and releases can be found at https://svn.code.sf.net/p/apertium/svn/branches/ kaz-tagger/. [12] Aknowledgements We would like to thank Daiana Azamat for assisting in hand-annotation of the texts and rigorous derivation of the maximum likelihood estimates (13). [13] References [1] D. Yuret and E. Biçici, “Modeling morphologically rich languages using split words and unstructured dependencies,” in Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Association for Computational Linguistics, 2009, pp. 345–348. [2] G. Altenbek and W. Xiao-long, “Kazakh segmentation system of inflectional affixes,” in Proceedings of CIPSSIGHAN Joint Conference on Chinese Language Processing, 2010, pp. 183–190. [3] A. Makazhanov, O. Makhambetov, I. Sabyrgaliyev, and Z. Yessenbayev, “Spelling correction for kazakh,” in Computational Linguistics and Intelligent Text Processing. Springer, 2014, pp. 533–541. [4] I. Salimzyanov, J. Washington, and F. Tyers, “A free/open-source Kazakh-Tatar machine translation system,” Machine Translation Summit XIV, 2013. [5] D. Z. Hakkani-Tür, K. Oflazer, and G. Tür, “Statistical morphological disambiguation for agglutinative languages,” Computers and the Humanities, vol. 36, no. 4, pp. 381–410, 2002. [6] F. Karlsson, A. Voutilainen, J. Heikkilae, and A. Anttila, Constraint Grammar: a language-independent system for parsing unrestricted text. Walter de Gruyter, 1995, vol. 4. [7] M. P. Lewis, F. Gary, and D. Charles, “Ethnologue: Languages of the world,. dallas, texas: Sil international. retrieved on 15 april, 2014,” 2013. [8] Z. Assylbekov and A. Nurkas, “Initial explorations in kazakh to english statistical machine translation,” in The First Italian Conference on Computational Linguistics CLiC-it 2014, 2014, p. 12. [9] N. Habash and F. Sadat, “Arabic preprocessing schemes for statistical machine translation,” in Proceedings of the Human Language Technology Conference of the NAACL, [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] 6 http://www.gnu.org/licenses/gpl-3.0.html 24 Companion Volume: Short Papers. Association for Computational Linguistics, 2006, pp. 49–52. A. Bisazza and M. Federico, “Morphological preprocessing for turkish to english statistical machine translation.” in IWSLT, 2009, pp. 129–135. C. Mermer, “Unsupervised search for the optimal segmentation for statistical machine translation,” in Proceedings of the ACL 2010 Student Research Workshop. Association for Computational Linguistics, 2010, pp. 31– 36. E. Bekbulatov and A. Kartbayev, “A study of certain morphological structures of kazakh and their impact on the machine translation quality,” in Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on. IEEE, 2014, pp. 1–5. M. L. Forcada, M. Ginestí-Rosell, J. Nordfalk, J. O’Regan, S. Ortiz-Rojas, J. A. Pérez-Ortiz, F. SánchezMartínez, G. Ramírez-Sánchez, and F. M. Tyers, “Apertium: a free/open-source platform for rule-based machine translation,” Machine translation, vol. 25, no. 2, pp. 127– 144, 2011. N. Chomsky, Syntactic structures. Walter de Gruyter, 2002. S. Klein and R. F. Simmons, “A computational approach to grammatical coding of english words,” Journal of the ACM (JACM), vol. 10, no. 3, pp. 334–347, 1963. G. Salton and R. Thorpe, “An approach to the segmentation problem in speech analysis and language translation,” in Proceedings of the 1961 International Conference on Machine Translation of Languages and Applied Language Analysis, vol. 2. Citeseer, 1962, pp. 703–724. B. B. Greene and G. M. Rubin, Automatic grammatical tagging of English. Department of Linguistics, Brown University, 1971. W. S. Stolz, P. H. Tannenbaum, and F. V. Carstensen, “Stochastic approach to the grammatical coding of english,” Communications of the ACM, vol. 8, no. 6, pp. 399–405, 1965. R. Garside, G. Sampson, and G. Leech, The computational analysis of English: A corpus-based approach. Longman, 1988, vol. 57. I. Marshall, “Tag selection using probabilistic methods,” The Computational analysis of English: a corpusbased approach, pp. 42–65, 1987. K. W. Church, “A stochastic parts program and noun phrase parser for unrestricted text,” in Proceedings of the second conference on Applied natural language processing. Association for Computational Linguistics, 1988, pp. 136–143. S. J. DeRose, “Grammatical category disambiguation by statistical optimization,” Computational Linguistics, vol. 14, no. 1, pp. 31–39, 1988. J. Kupiec, “Robust part-of-speech tagging using a hidden markov model,” Computer Speech & Language, vol. 6, no. 3, pp. 225–242, 1992. [24] D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun, “A practical part-of-speech tagger,” in Proceedings of the third conference on Applied natural language processing. Association for Computational Linguistics, 1992, pp. 133–140. [25] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the royal statistical society. Series B (methodological), pp. 1–38, 1977. [26] B. Merialdo, “Tagging english text with a probabilistic model,” Computational linguistics, vol. 20, no. 2, pp. 155–171, 1994. [27] E. Brill, “Transformation-based error-driven learning and natural language processing: A case study in partof-speech tagging,” Computational linguistics, vol. 21, no. 4, pp. 543–565, 1995. [28] W. Daelemans, J. Zavrel, P. Berck, and S. Gillis, “Mbt: A memory-based part of speech tagger-generator,” arXiv preprint cmp-lg/9607012, 1996. [29] A. Ratnaparkhi et al., “A maximum entropy model for part-of-speech tagging,” in Proceedings of the conference on empirical methods in natural language processing, vol. 1. Philadelphia, USA, 1996, pp. 133–142. [30] M. Levinger, A. Itai, and U. Ornan, “Learning morpholexical probabilities from an untagged corpus with an application to hebrew,” Computational Linguistics, vol. 21, no. 3, pp. 383–404, 1995. [31] J. Hajič and B. Hladká, “Tagging inflective languages: Prediction of morphological categories for a rich, structured tagset,” in Proceedings of the 17th international conference on Computational linguistics-Volume 1. Association for Computational Linguistics, 1998, pp. 483– 490. [32] J. Hajič, “Morphological tagging: Data vs. dictionaries,” in Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference. Association for Computational Linguistics, 2000, pp. 94– 101. [33] N. Ezeiza, I. Alegria, J. M. Arriola, R. Urizar, and I. Aduriz, “Combining stochastic and rule-based methods for disambiguation in agglutinative languages,” in Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 1998, pp. 380–384. [34] B. Megyesi, “Improving brill’s pos tagger for an agglutinative language,” in Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999, pp. 275–284. [35] K. Oflazer and Ì. Kuruöz, “Tagging and morphological disambiguation of turkish text,” in Proceedings of the fourth conference on Applied natural language processing. Association for Computational Linguistics, 1994, pp. 144–149. [36] K. Oflazer and G. Tur, “Combining hand-crafted rules and unsupervised learning in constraint-based morphological disambiguation,” arXiv preprint cmp-lg/9604001, 1996. [37] K. Oflazer and G. Tür, “Morphological disambiguation by voting constraints,” in Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 1997, pp. 222–229. [38] H. Sak, T. Güngör, and M. Saraçlar, “Morphological disambiguation of turkish text with perceptron algorithm,” in Computational Linguistics and Intelligent Text Processing. Springer, 2007, pp. 107–118. [39] D. Yuret and F. Türe, “Learning morphological disambiguation rules for turkish,” in Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Association for Computational Linguistics, 2006, pp. 328–334. [40] M. Kutlu and I. Cicekli, “A hybrid morphological disambiguation system for turkish.” in IJCNLP, 2013, pp. 1230–1236. [41] O. Makhambetov, A. Makazhanov, I. Sabyrgaliyev, and Z. Yessenbayev, “Data-driven morphological analysis and disambiguation for kazakh,” in Computational Linguistics and Intelligent Text Processing. Springer, 2015, pp. 151–163. [42] O. Makhambetov, A. Makazhanov, Z. Yessenbayev, B. Matkarimov, I. Sabyrgaliyev, and A. Sharafudinov, “Assembling the kazakh language corpus.” in EMNLP, 2013, pp. 1022–1031. [43] G. Kessikbayeva and I. Cicekli, “A rule based morphological analyzer and a morphological disambiguator for kazakh language,” 2016. [44] E. Charniak, C. Hendrickson, N. Jacobson, and M. Perkowitz, “Equations for part-of-speech tagging,” in AAAI, 1993, pp. 784–789. [45] J. L. Lagrange, Mécanique analytique. Mallet-Bachelier, 1853, vol. 1. [46] I. J. Good, “The population frequencies of species and the estimation of population parameters,” Biometrika, vol. 40, no. 3-4, pp. 237–264, 1953. [47] W. A. Gale and G. Sampson, “Good-turing frequency estimation without tears*,” Journal of Quantitative Linguistics, vol. 2, no. 3, pp. 217–237, 1995. [48] I. H. Witten and T. C. Bell, “The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression,” Information Theory, IEEE Transactions on, vol. 37, no. 4, pp. 1085–1094, 1991. [49] R. Kneser and H. Ney, “Improved backing-off for mgram language modeling,” in Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, vol. 1. IEEE, 1995, pp. 181–184. [50] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer 25 Speech & Language, vol. 13, no. 4, pp. 359–393, 1999. [51] C. D. Manning and H. Schütze, Foundations of statistical natural language processing. MIT Press, 1999, vol. 999. [52] J. N. Washington, I. Salimzyanov, and F. M. Tyers, “Finite-state morphological transducers for three Kypchak languages,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC, 2014. [53] S. O. Rojas, M. L. Forcada, and G. R. Sánchez, “Construcción y minimización eficiente de transductores de letras a partir de diccionarios con paradigmas,” Procesamiento del lenguaje natural, vol. 35, pp. 51–57, 2005. [54] F. M. Tyers and J. Washington, “Towards a free/opensource universal-dependency treebank for Kazakh,” in 3rd International Conference on Computer Processing in Turkic Languages (TURKLANG 2015), 2015. [55] A. Stolcke et al., “Srilm-an extensible language modeling toolkit.” in INTERSPEECH, vol. 2002, 2002, p. 2002. [56] A. Stolcke, J. Zheng, W. Wang, and V. Abrash, “Srilm at sixteen: Update and outlook,” in Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, 2011, p. 5. [57] B. KREJCI and L. GLASS, “The kazakh noun/adjective distinction.” [58] Y. Zhang, K. Wu, J. Gao, and P. Vines, “Automatic acquisition of chinese–english parallel corpus from the web,” in Advances in Information Retrieval. Springer, 2006, pp. 420–431. [59] M. Esplà-Gomis and M. Forcada, “Combining contentbased and url-based heuristics to harvest aligned bitexts from multilingual sites with bitextor,” The Prague Bulletin of Mathematical Linguistics, vol. 93, pp. 77–86, 2010. [60] I. San Vicente and I. Manterola, “Paco2: A fully automated tool for gathering parallel corpora from the web.” in LREC, 2012, pp. 1–6. [61] L. Liu, Y. Hong, J. Lu, J. Lang, H. Ji, and J. Yao, “An iterative link-based method for parallel web page mining,” Proceedings of EMNLP, pp. 1216–1224, 2014. 26 Methodological Considerations for Multi-word Unit Extraction in Turkish Ümit Mersinli Yeşim Aksan Mersin University Mersin, Turkey [email protected] Mersin University Mersin, Turkey [email protected] illustrative purposes only. They should not be regarded as finalized data sets of the ongoing study. Abstract— Multi-word Unit (MWU) extraction in Turkish has its own challenges due to the agglutinative nature of the language and the lack of reliable tools and reference datasets. The aim of this study is to share the hands-on experience on MWU extraction in the ongoing projects using Turkish National Corpus (TNC) as the data source. Since Turkish still does not have a reference MWU set, the primary purpose of these projects is to form a reference MWU dictionary of Turkish which will serve as a resource to evaluate the performance of any extraction tool or technique. In this paper we will discuss methodological considerations for clarifying appropriate processes for Turkish MWU extraction. Techniques or suggestions compiled in this paper form an overall proposal for further Turkish-specific computational or statistical work. The linguistic perspective underlying the choices of a valid methodology is described in the first part of the study. In the second part, important methodological considerations are discussed through real examples from the TNC. In the conclusion, suggestions for an interdisciplinary approach and a hybrid methodology are summarized. Keywords—MWU extraction; phraseology; Turkish National Corpus I. multi-word; II. METHODOLOGICAL CONSIDERATIONS According to Pecina [11], eliciting the best methodology for MWU extraction depends heavily on data, language, and the notion of MWU itself. However, these concerns are underestimated in current Turkish NLP literature. Thus, the methodological considerations discussed in this paper will emphasize the importance of some neglected aspects of MWU extraction in Turkish. A. Choosing The Corpus Most of the current studies on Turkish MWU extraction, focus on optimizing the statistical or computational processes or optimizing the sorting procedure of the outcome. The importance of the input, or corpus in our case, is often underestimated. In this part of the paper, we will deal with the necessary qualifications of a corpus to be used as input for MWU extraction in Turkish. First, the difference between a linguistic corpus and a text archive needs to be clarified [12]. According to Sinclair [13], “a corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language” but not a random text collection of any available type. Second, a reference corpus should cover naturally occurring, contemporary language data and have a design to represent the language, if not a historical or specialized corpus. Third, a corpus should cover, if applicable, a variety of text-types and mediums of that language. In other words, the corpus should be a wellbalanced and representative one to be used in MWU extraction. In this respect, it is crucial to rely on a reference corpus like Turkish National Corpus in order to extract true rankings of the n-grams. The size of the TNC is 50,997,016 running words, representing a wide range of text categories spanning a period of 23 years (1990-2013). It consists of samples from textual data representing 9 different domains (98%) with 4978 documents and transcribed spoken data (2%) with 434 documents. Table (1) shows the distribution of texts in the written part of the TNC. In addition, the annotation system of the TNC covers over 90 inflectional morphemes, all of which are compatible with modern Turkish linguistics studies. Analysis and tagging of Turkish INTRODUCTION As Mel’čuk [1] states, “people speak not in words but in phrases” or in Firth’s [2] words, as a well-known statement among linguists, “you shall know a word by the company it keeps”. The importance of MWUs in any language-related area leads to a huge amount of work done especially for English. For Turkish, on the other hand, the lack of a preliminary, well-documented, reference MWU lexicon to evaluate the performance of any linguistic, statistical or computational extraction methodology seems to be the basic challenge to overcome. Works of Oflazer et al. [3], Eryiğit et al. [4], Kumova & Karaoğlan [5], Aksan & Aksan [6], Durrant & Mathews-Aydınlı [7], Aksan, Mersinli & Altunay [8] and Mersinli & Demirhan [9] covers some aspects of Turkish phraseology but unfortunately, Turkish NLP literature is far from providing a comprehensive, reference MWU lexicon. In this respect, the purpose of this paper is to share the hands-on experience on MWU extraction projects using the Turkish National Corpus (TNC) [10] as the data source, rather than to provide finalized software, resources or methodology. The following sections will summarize the crucial points of the study in progress. In each section, sample data is provided for 27 derivational morphemes are in progress and will provide insights for the relationship between word and multi-word forming processes of Turkish. B. Optimizing the Input As stated above, choosing and optimizing the input is an important part of our proposal. The basic shift from the conventional approaches is to make use of punctuation marks as a natural delimiter for MWU candidates. Thus, all punctuation marks and numericals in the corpus are replaced with line-breaks which serve as a splitter for n-grams. Since the primary concern of this study is not to extract proper nouns, all the corpus text is also lowercased to avoid duplicate n-grams. Table (3) is a sample raw text and its optimized version. TABLE I. DISTRIBUTION OF TEXTS ACCORDING TO DOMAINS IN TNC-WRITTEN Domain No. of words % of words Imaginative: Prose 9.365.775 18.74 % Informative: Natural and pure sciences 1.367.213 2.74 % Informative: Applied science 3.464.557 6.93 % Informative: Social science 7.151.622 14.31 % Informative: World affairs 9.840.241 19.69% Informative: Commerce and finance 4.513.233 9.03 % Raw text Informative: Arts 3.659.025 7.32 % Informative: Belief and thought 2.200.019 4.4 % Günlerden bir gün , okuldan evine dönen Hetzer,sırt çantasından çıkardığı yepyeni bir kitabı, babasına gösterir. Informative: Leisure 8.421.603 16.85% Total 49.983.288 100.00 % TABLE III. CORPUS OPTIMIZATION FOR MWU EXTRACTION Optimized text günlerden bir gün okuldan evine dönen hetzer sırt çantasından çıkardığı yepyeni bir kitabı babasına gösterir Table (2) shows the MWU candidates derived from the written part of the TNC including 49,983,288 words. The top 5 multi-word candidates obtained from the written part of the TNC and from the newspaper articles section of it demonstrate how serious the differences between data extracted from a reference corpus and the data from a specialized corpus are. TABLE II. After the optimization, the lower-cased, sentence-splitted, punctuation-delimited, ASCII-coded TNC texts are processed in Text-NSP [14], for obtaining all the sample lists presented in this paper. Moreover, for the sake of simplicity, no associative measures are used for extracting MWUs, and all the values represent the observed frequencies of the data. A detailed discussion on associative measures applied on Turkish MWU candidates can be found in Kumova-Metin & Karaoğlan [5] and Mersinli [15]. THE 5 TOP-RANKED 3-GRAMS IN A REFERENCE CORPUS AND A SPECIAL CORPUS Rank TNC_alla Freq. TNC_Newspapers Freq. 1 bir süre sonra 4419 recep tayyip erdoğan 555 2 bir kez daha 4000 bir kez daha 506 3 ne var ki 3360 başbakan recep tayyip 449 4 başka bir şey 3238 yönetim kurulu başkanı 442 5 ne yazık ki 3020 şöyle devam etti 367 6 her ne kadar 3012 bir an önce 367 7 bir yandan da 2993 genel başkan yardımcısı 323 8 bir an önce 2413 ahmet necdet sezer 316 9 kısa bir süre 2300 cumhurbaşkanı ahmet necdet 288 10 ne olursa olsun 2182 C. Looking Beyond Words It is a well-known phenomenon that an inflected Turkish verb is actually a sentence in English, in most cases. The same is also true for other phrases like postpositions or connectives. We can easily observe that most of the connectives in English are actually suffix-word pairs in Turkish such as -mAk için “in order to”, -A göre “according to” etc. The point here is that any multi-word in any language may appear as single words, multi-words, suffixes or suffix-word pairs in any other language and vice versa. Thus, especially dealing with an agglutinative language, suffix-word pairs need to be taken into serious consideration. Postpositional phrases, for instance, requires specific suffixations in the preceding word in Turkish. Below are the most frequent suffix-word pairs of Turkish, extracted with the help of the annotation framework of the TNC. The suffixes are annotated according to their functions as nominalizers, case markers or person/number agreements in the table. The frequencies are extracted from bigrams including the first word ending with the given suffix and the second word as a whole. düzenlediği basın toplantısında 263 a. MWUs are in bold As seen in Table (2), multi-word units are not only language specific but also text-type specific. Thus, relying on a text archive derived from the Web or a specialized corpus covering newspapers, for instance, is not a relevant approach to extract MWUs of Turkish, but it is a kind of approach used for extracting the MWUs of that specific text type. If the purpose of the extraction is to derive Named Entities, on the other hand, a Web-based, newspaper corpus may be the appropriate option in terms of choosing the corpus. 28 TABLE IV. MOST FREQUENT SUFFIX-WORD PAIRS IN TURKISH Suffix_type Freq. Example nzmk__için 58535 etmek_için in order to dat__göre 37850 buna_göre according to abl__sonra 36515 olduktan_sonra after p3s__için 33514 olduğu_için since p3s__gibi 31306 olduğu_gibi as it is dat__kadar 28429 bugüne_kadar until Table (5) clearly demonstrates that causative+passive inflection is specific to academic Turkish and can be regarded as a multi-morpheme unit in itself. Although very rare in usage, these verbal morphgrams can be extended to 9 morphemes in Turkish as in the inflected verb, çıkartılabilinirdi which starts with the verb çık- and includes the suffixes causative, causative, passive, auxiliary_verb, passive, aorist, verb_i, past_tense, 3rd_person_singular in the given order. The inflected verb can be translated as “it could be made possible to extract” which is a full sentence in English and thus, again, blurs our notion of ‘word’ in the term ‘multi-word unit’. English nzmk__üzere 17728 olmak_üzere almost gen__için 15336 bunun_için for this acc__olarak 11895 sonucu_olarak as a result of pl__için 9990 onlar_için D. Bidirectional Sorting Another common practice in MWU extraction can be summarized as sorting n-grams using associative measures or a combination of them, providing a cut-off point and regarding the remaining top n-grams as MWUs. As discussed in Mersinli [15], the relevance of relying only on sorting the ngrams without any linguistic filtering is questionable. A hybrid approach combining quantitative sorting and qualitative filtering techniques, as in Seretan et al. [16], seems more productive for Turkish if the purpose is to prepare a reference MWU set and to describe multi-word formation processes in Turkish. Below are the associative measures stated as linguistically relevant for the given n-grams in Turkish [15]. Since the 2grams include most of the sub-MWUs in Turkish, although most of the measures are for these candidates, it seems reasonable to rely on observed frequencies of 3-grams for extracting MWUs in Turkish. for them As Table (4) demonstrates, the term ‘multi-word’ in Turkish should also cover “suffix-word” pairs as a term which we may call a “multi-morpheme unit”. Looking for in-word or intra-word units in Turkish may be the solution for most of the challenges encountered in MWU extraction processes. Also the inflectional patterns in Turkish should be considered as multi-words or, in a more appropriate terminology, multi-morpheme units, since their distribution among different text-types provides evidence for their functional unity specific to certain text-types. Below are the 6morphgrams and their distribution among 3 text-types in the TNC. The tagset includes the functions such as causative, passive, auxiliary verb, aorist, nominalizer, adverbial, negation, verb I, necessity, perfective, imperfect, person agreement, possessive, accusative, locative, copular etc. in their abbreviated forms. Almost all 6-morphgrams start with some voice suffixes and end with 3rd person singular suffix as seen in the table. TABLE VI. RELEVANT ASSOCIATIVE MEASURES FOR TURKISH n-grams Measures 2-grams T-score, Fisher’s Exact Test (left-sided), Log-likelihood,True Mutual Information, Poisson-Stirling Measure 3-grams Poisson-Stirling Measure 4-grams Log-likelihood TABLE V. SAMPLE MORPHGRAMS AND THEIR DISTRIBUTION AMONG TEXTTYPES IN TURKISH 6-morphgrams Academic Fiction Newspapers caus+pasv+va1+nzma+p3s+acc 27 0 1 caus+pasv+va1+neg+aor+3s 444 76 63 caus+pasv+aor+vi+avsa+3s 386 25 46 caus+pasv+imprf+vi+past+3s 277 164 47 caus+pasv+imprf+vi+perf+3s 4 16 4 caus+pasv+neg+necc+cop+3s 220 3 12 caus+pasv+neg+nzma+p3s+acc 24 4 13 caus+pasv+neg+perf+cop+3s 172 5 6 caus+pasv+nzma+p3s+cop+3s 838 11 29 caus+pasv+nzma+p3s+loc+kia 85 2 6 With that concern in mind, in order to measure the fixedness of 3-grams, since they are more likely to include as MWUs in a Turkish dictionary, we have used the frequencies of inner components, such as the frequency of the first two words and the last two words of 3-grams. If the difference between those values are high, then it is regarded as an evidence declaring that the given 3-grams is not a MWU but includes 2-grams that are more fixed than the whole 3-grams. To be more specific, Table (7) shows the ranking of the values gained by subtracting the frequency of the last two words from the frequencies of the first two, in a given 3-gram. The MWUs within the given 3-grams are in bold shows the fixedness of the ones in the center of the ranking. 29 TABLE VII. TABLE VIII. CLASSIFICATION OF COLLIGATIONAL PATTERNS OF N-GRAMS BIDIRECTIONALLY SORTED SAMPLE 3-GRAMS ABC Freq Freq.AB Freq.BC Freq.(AB - BC) Category 1 – Complete structures: MWU patterns korkacak bir şey 50 50 15360 -15310 Sample colligational pattern n-gram konuda bir şey 51 51 15360 -15309 AJ+bare DT+bare NN+nom kısa bir süre aklına bir şey 51 51 15360 -15309 AJ+bare DT+bare NN+loc etkin bir şekilde in an efficient manner yapabileceği bir şey 51 51 15360 -15309 Category 2 – Sub-patterns: Non-closed, potential sub-MWUs bildiğim bir şey 54 15360 -15306 54 Sample colligational pattern n-gram …………………………………………… AV,bare_AJ,bare_DT,bare ne yazık ki 3020 3020 3020 0 her zamanki gibi 992 992 992 0 en ufak bir 849 849 849 0 her ikisi de 804 804 804 0 ittihat ve terakki 649 649 649 0 çok önemli bir English (in) a short time English a very important Category 3 –Incomplete structures: non-MWU patterns Sample colligational pattern n-gram …………………………………………… ya da bunun 51 13650 51 13599 ya da siyasi 50 13650 50 13600 ya da karşı 50 13650 50 13600 ya da üçüncü 50 13650 50 13600 ya da kültürel 50 13650 50 13600 English PP,bare_AJ,bare_DT,bare için önemli bir an important ... for PP,bare_AJ,bare_DT,bare kadar geniş bir as a broad ... as The categories in Table (8) allow filtering the MWUs and non-MWUs as well as reserving partial ones that may be used to identify sub-MWUs. In brief, Category 3 candidates are filtered out, Category 1 is filtered in and Category 2 candidates are reserved for identifying 4-gram MWUs. Since the identification of sub-MWU strings is problematic not only for MWU extraction but also for all lexical frequencies in any language, it requires separate techniques, and it is out of the scope of current study. Extracting colligations also provides a general ranking based on grammatical patterns of MWU candidates and makes the filtering process more linguistically relevant. Below is the top ten 3-gram colligations in the TNC. Table (9) demonstrates that 3-word units in Turkish mostly provides a closed projection including a specifier, a modifier and a head, making 3-grams worth extracting more than 2-grams including mostly light verb constructions or reduplications. As seen in Table (7), a bidirectional sorting reveals the MWUs in the center even without applying any statistical associative measure and provides evidence for the 2-gram MWUs within the given candidates. The results of setting double thresholds based on such a simple measure points out that the relevance of any sorting practice does not rely on the complexity of the formulae we use. TABLE IX. CLASSIFICATION OF COLLIGATIONAL PATTERNS OF N-GRAMS E. Lexico-grammatical Filtering ‘Colligation’ is another key term that is important in identifying the MWUs in a given set of candidates. As defined by Baker [17], a colligation is “a form of collocation which involves relationships at the grammatical rather than the lexical level”. For rich morphology languages, then, grammatical relations between two or more words becomes important since they actually declare the constraints that prevent some frequent n-grams from becoming multi-words, or letting some less frequent ones become multi-word units. Thus, in a hybrid approach, sorting and filtering are two basic processes, being the first statistical and the later rulebased. In order to provide the filtering rules for MWUs and non-MWUs linguistically, we have classified grammatical or colligational patterns of the MWU candidates into 3 categories, presented with examples from the TNC, below. Colligation Sample 3-grams English 1 AV,bare_AJ,bare_DT,bare çok önemli bir a very important 2 AJ,bare_DT,bare_NN,nom kısa bir süre a short time 3 NN,nom_CJ,bare_NN,nom radyo ve televizyon radio and television 4 DT,bare_NN,nom_AV,bare bir süre sonra 5 AJ,bare_CJ,bare_AJ,bare ekonomik ve sosyal economic and social 6 CJ,bare_AV,bare_AV,bare ama yine de but still 7 NN,nom_NN,nom_CJ,bare ne var ki however, yet 8 AJ,bare_DT,bare_NN,loc etkin bir şekilde efficiently 9 AV,bare_DT,bare_NN,nom böyle bir şey 10 CJ,bare_AJ,bare_DT,bare 30 ile ilgili bir after a while such a thing a … related to III. CONCLUSION REFERENCES The methodological considerations discussed in this paper show that MWU extraction is rather a trial-and-error process for a given language. Thus, any attempt, be statistical, computational or linguistic is worth sharing in an interdisciplinary manner to fill the gap in this area. A reference MWU set or a MWU dictionary, for that purpose, will serve as an input not only for linguistics but also for all related areas of study. Fig.1 summarizes a sample recursive process followed in the proposed strategies. [1] [2] [3] [4] 1Corpus 2Optimization 3Sorting 4Classification 5Filtering [5] Fig. 1. Basics of the proposed strategy Considering the fact that Turkish is an agglutinative language and has little to do with words but rather operates on suffixes, the term ‘multi-morpheme unit’ (MMU) seems more operational for further cross-linguistic studies. In addition, lexico-grammatical constraints in MMU forming are as important as the observed frequencies of any MMU candidate and thus colligational analysis and filtering of n-grams should be a part of any strategy that includes statistical ranking of MMU candidates. This paper briefly summarized some methodological considerations for multi-morpheme unit (MMU) extraction in Turkish. The purpose of the study is to discuss some ignored aspects of MMU extraction in Turkish and provide an overall idea on the methodological considerations we faced with. Turkish lexicon includes more MMUs than already documented. Any technical or linguistic contribution will be of great importance and a hybrid, inter-disciplinary approach may be the answer to most of the questions in the field. MMU extraction is some reverse engineering of the MMU forming processes in our minds. Only a process-based approach may provide data for linguistics of Turkish. A product-based approach, or extracting a reference MMU set, however, can serve as an initial step for identifying the grammatical constraints that governs the MMU forming processes in Turkish. Interdisciplinary studies conducted by engineers and linguists are of great importance in this sense, that, not only MMUs but also the rules underlying the process of forming them can only be described by such collaborative studies. [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] ACKNOWLEDGMENT This work is supported by a grant from Scientific and Technological Research Council of Turkey (TÜBİTAK, Grant No: 115K135). [17] 31 Mel’čuk, I. A.: Phrasemes in language and phraseology in linguistics. In: Everaert, M., van der Linden, E.J, Schenk, A. and Schreuder, R. (eds.) Idioms: Structural and Psychological Perspectives. Lawrence Erlbaum, Hillsdale, NJ (1995) Firth, J.R.: A Synopsis of Linguistic theory 1930-1955. In: Palmer, F. (ed). Selected Papers of J. R. Firth, Longman, Harlow (1968) Oflazer, K., Çetinoğlu, Ö. and Say, B.: Integrating morphology with multi-word expression processing in Turkish. In: Proceedings of the Workshop on Multiword Expressions: Integrating Processing (MWE '04). Association for Computational Linguistics, pp. 64-71 (2004) Eryiğit, G. et.al. Annotation and Extraction of Multiword Expressions in Turkish Treebanks. In Proceedings of the 11th Workshop on Multiword Expressions: MWE 2014. June 4, 2015 Denver, Colorado, USA. pp.7076. (2015) Kumova-Metin, S., Karaoğlan B.: Collocation extraction in Turkish texts using statistical methods. In: 7th International Conference on Natural Language Processing (LNCS-ISI) IceTAL 2010, pp. 238-249 (2010) Aksan, M., Aksan, Y.: Multi-word units and pragmatic functions in genre specification. Paper presented at 13th IPrA Conference 08-13 September 2013. New Delhi, India (2013) Durrant, P., Mathews-Aydınlı, J.: A function first approach to identifying formulaic language in academic writing. English for Specific Purposes, 30, 58-72 (2011) Aksan, Y., Mersinli, Ü. and Altunay, S.: Colligational analysis of Turkish multi-word units. Paper presented at CCS-2015, Corpus-Based Word Frequency: Methods and Applications. 19-20 February 2015. Mersin University, Turkey (2015) Mersinli, Ü. and Demirhan, U.: Çok sözcüklü kullanımlar ve ilköğretim Türkçe ders kitapları. In: Aksan, M. ve Aksan, Y. (eds.). Türkçe Öğretiminde Güncel Çalışmalar.: Mersin Üniversitesi, Mersin (2012) Aksan, Y., Aksan, M., Koltuksuz, A., Sezer, T., Mersinli, Ü., Demirhan, U. U., Yılmazer, H., Kurtoğlu, Ö., Atasoy, G., Öz, S., Yıldız, İ.: Construction of the Turkish National Corpus (TNC). In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC), pp. 3223-3227 (2012) Pecina, P.: Lexical association measures and collocation extraction. Language Resources and Evaluation, 44, 137-158 (2010) Aksan, M. and Aksan, Y.: Linguistic corpora: A view from Turkish. In: Oflazer, K. and Saraçlar, M. (eds.) Studies in Turkish Language Processing. Springer Verlag, Berlin (forthcoming) Sinclair, J. McH. and Renouf, A.J.: A lexical syllabus for language learning. In: McCarthy, M.J. and Carter, R.A. (eds.) Vocabulary in Language Teaching. Longman, London. (1987) Banerjee, S. and Pedersen, T.: The design, implementation, and use of the ngram statistics package. In: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, pp. 370-381 (2003) Mersinli, Ü.: Associative measures and multi-word unit extraction in Turkish. Journal of Language and Literature 12 (1), 43-61 (2015) Seretan, V., Nerima, L., and Wehrli, E.: Multi-word collocation extraction by syntactic composition of collocation bigrams. Amsterdam Studies in the Theory and History of Linguistic Science. Series Iv, Current Issues in Linguistic Theory, 260, 91-100 (2004) Baker, P., Hardie, A., & McEnery, T. A glossary of corpus linguistics. Edinburgh: Edinburgh University Press. (2006) The Turkish National Corpus (TNC): Comparing the Architectures of v1 and v2 Yeşim Aksan Selma Ayşe Özel Mersin University Mersin, Turkey [email protected] Çukurova University Adana, Turkey [email protected] Hakan Yılmazer Umut Ufuk Demirhan Çukurova University Adana, Turkey [email protected] Mersin University Mersin, Turkey [email protected] Abstract— Turkish National Corpus (TNC) released its first version in 2012 is the first large scale (50 million words), webbased and publicly-available free resource of contemporary Turkish. It is designed to be a well-balanced and representative reference corpus for Turkish. With 48 million words coming from the written part of it, the untagged TNC v1 represents 4438 different data sources over 9 domains and 34 different genres. The morphologically annotated, 50 million words TNC v2 with 5412 different documents compiled from written and spoken Turkish is planned for release in 2016 offers new query options for linguistic analyses. This paper aims to compare architectures of the TNC v1 and v2 on the basis of a set of queries made on both versions. Standard, restricted and wildcard lexical searches are performed. Then, the speed of two versions in retrieving the query results in concordance lines is compared. Finally, it is argued that TNC v2 performs better and faster than that of TNC v1 due to the in-memory inverted index structure. Since building language corpora is a very recent issue for Turkish, the architecture of TNC v2 would serve as a model for similar corpus construction projects. more obvious. To meet the challenge, Turkish National Corpus (TNC) is built as reference corpus of Turkish. The project team followed the best practices at all stages of corpus development. Major design principles were adopted from the experiences of the British National Corpus with minor modifications. The end product is the TNC, a well-balanced, representative, and large-scale (50 million words) free resource of a general-purpose corpus of contemporary Turkish [3]. As maintained by [14] “if the corpus in question claims to be general in nature, then it will be typically balanced with regard to genres, domains that typically represent the language under consideration”. In line with this definition, the major aim in building the TNC is to represent texts from different genres, domains and types in a balanced manner so that the conclusions drawn from quantitative and qualitative analysis of corpus data hold true for language use in general. Genre balance is an important aspect of corpus design [15]. Both versions of the TNC have data from different domains and genres set them apart from text archives or a collection of texts difficult to categorize and separate by genre, such as the Web. The number of linguistic and computational linguistic studies using the TNC as a reference corpus is increasing. While most of the linguistic and NLP studies use TNC for compiling naturally occurring language evidence and for hypothesistesting [16, 17, 18, 19], there are still others following a corpus-driven approach and attempt to build hypotheses and describe Turkish on the basis of the TNC [20, 21]. Overall, the usefulness of the TNC as a general corpus primarily is due to the data itself. With 48 million words, the TNC v1 represents written component of the corpus which contains 4438 different data sources over 9 domains and 34 different genres, and was published as a free resource for non-commercial use in October 2012. Size of the TNC v2 is 50,997,016 running words, representing a wide range of text categories spanning a period of 23 years (1990-2013). It consists of samples from textual data representing 9 different domains (98%) with 4,978 documents and transcribed spoken data (2%) with 434 documents. The morphologically annotated, complete version Keywords—Turkish National Corpus (TNC); corpus building; architecture; inverted index; relational database; in-memory data structures I. INTRODUCTION There are at least two different kinds of corpora in Turkish today: (i) large-sized general linguistic corpora that are constructed and made available for users with proper corpus tools, (ii) NLP corpora built with no linguistic criteria in mind but rather as tools for testing algorithms devised for different applications [1]. The first electronic linguistic corpus designed to represent modern Turkish is the 2 million words, downloadable Middle East Technical University Turkish Corpus (MTC) [2]. MTC is tagged by XCES style annotation using special software developed by the members of the project group as well as its corpus query workbench. In the years following the construction of the MTC, the need for a largescale general reference corpus of Turkish has become more and 32 of the TNC v2 is planned for release in 2016, offering new query options for linguistic analyses. texts are distributed along two major types, namely imaginative and informative. While the imaginative domain is represented by texts of fiction, the informative domain is represented by texts from the social sciences, the arts, commerce-finance, belief-thought, world affairs, applied sciences, natural-pure sciences, and leisure. The criterion of medium refers to text production. The texts collected to represent the written medium are carefully selected from books, periodicals, published or unpublished documents, and texts written-to-be-spoken such as news broadcasts and screenplays, among others. The criterion of time defines the period of text production. Here, the distribution of the size of the texts for each year is decided in terms of relative representation of each domain in the medium. This paper is organized as follows: Section two explains the design features of the TNC. Section three describes basic features of the TNC interface. The architectures of the TNC v1 and v2 are presented in section four. Section five displays the comparative query results obtained through the two versions of the corpus. The paper finally argues that in-memory inverted index structure and relational database structure are effective in terms of speed and extension of web-based language corpora. II. DESIGN OF THE TNC The only Turkish corpus of its kind, the TNC is constructed following the principles used to construct the British National Corpus in its basic design and implementation. The distribution of samples in written component of the corpus is determined proportionally for each text domain, time, and medium. Table I and II show the distribution of texts across domain and medium, respectively. TABLE I. THE DISTRIBUTION OF TEXTS ACROSS DOMAINS IN THE TNC Domain Imaginative: Prose Informative: Natural and pure sciences Informative: Applied science Informative: Social science Informative: World affairs Informative: Commerce and finance Informative: Arts Informative: Belief and thought Informative: Leisure Total TABLE II. Medium Unspecified Book Periodical Miscellaneous: published Miscellaneous: unpublished Total Transcriptions from authentic spoken language constitute 2% of the TNC’s database, which involve everyday conversations recorded in informal settings such as conversations among friends, talk among family members and friends, etc., as well as speeches collected in particular communicative settings, such as meetings, lectures, and interviews. The spoken component of the TNC contains a total of 1,013,728 running words. Of these words, 439,461 of them come from orthographic transcriptions of everyday conversations and their relevant medium, and 574,267 of them are orthographic transcriptions of context-governed speeches. No. of words 9,365,775 % of words 18.74 % No. of documents 674 % of documents 13.54 % 1,367,213 2.74 % 253 5.08 % 3,464,557 6.93 % 461 9.26 % 7,151,622 14.31 % 671 13.48 % 9,840,241 19.69 % 757 15.21 % 4,513,233 9.03 % 429 8.62 % 3,659,025 7.32 % 347 6.97 % 2,200,019 4.4 % 297 5.97 % 8,421,603 16.85 % 1,089 21.88 % 49,983,288 100.00 % 4,978 100.00 % Part-of-speech annotation, morphological tagging, and lemmatization of the TNC are done by developing a natural language-processing (NLP) dictionary based on the NooJ_TR module [13]. The unique, semi-automatic process of developing the NLP dictionary includes the following steps: (i) automatically annotating the type list with the NooJ_TR module, which follows a root-driven, non-stochastic, rulebased approach to annotating the morphemes of the given types using a graph-based, finite-state transducer; (ii) manually checking and revising the output and eliminating artificial/nonoccurring ambiguities and theoretically possible multi-tags. After these stages, the entries of the NLP dictionary and actual running words of the corpus are matched via the software which has been developed by using PHP and MySQL. III. FEATURES OF THE TNC INTERFACE Web-based interface of the TNC provides for multitude of features for the analysis of corpus texts including concordance display (Fig. 1), sorting concordance data (Fig. 2), creating descriptive statistics for query results over the languageexternal restriction categories of texts via distribution (Fig. 3), and compiling lists of collocates (Fig. 4) for query terms on the basis of several statistical methods. THE DISTRIBUTION OF TEXTS ACROSS MEDIUMS IN THE TNC No. of words 10,541 31,456,426 15,968,240 % of words 0.02 % 62.93 % 31.95 % No. of documents 1 2,141 2,092 % of documents 0.02 % 43.01 % 42.02 % 958,999 1.92 % 294 5.91 % 1,589,082 3.18 % 450 9.04 % 49,983,288 100.00 % 4,978 100.00 % The representativeness of the TNC is secured through balance and sampling of varieties of contemporary language use. The selection of written texts is done via the criteria of text domain, medium, and time. The criterion of domain means that Fig. 1. TNC v1 concordance results page 33 Fig. 1 shows the query results in the TNC which are given as concordance display (key word in context-KWIC). “A concordance is a list of all the occurrence of a particular search term in a corpus presented within the context in which they occur-usually a few words to the left and right to the search term” [22]. A search term in TNC can be a single word, multiword phrases and words containing wildcards. Concordances can be sorted alphabetically not only according to the node word but also the context up to 5 words to the left or right of the node word. This function of the TNC help users find linguistic patterns easily. TNC v2, on the other hand, offers new features and query options. Since v2 is morphologically annotated, lemma form searches, morphemes and morpheme sequences and PoS-tag restricted searches (Fig. 5 and Fig. 6) can be conducted. As for some of the new features, users can save query history and they can search spoken component of the corpus by using metatextual categories such as genre, domain, interaction type, speakers’ age, sex. Fig. 5. TNC v2 PoS-tag query Fig. 2. TNC v1 sorting function Users can also view distributional information of the query result based on pre-defined meta-textual categories. The distribution page allows users to access descriptive statistics concerning the distribution of the query result of without performing multiple queries. Fig. 6. TNC v2 PoS-tag query results IV. THE ARCHITECTURES OF TNC V1 AND TNC V2 TNC is a user-friendly, platform independent, Web-based corpus developed for Turkish language. HTML [12], CSS [7], PHP [5] [6], and JavaScript [8] languages, and MySQL [4] database management system are used for implementation of the TNC. The main architecture of TNC version 1 is presented in Fig. 7. To develop TNC v1, text documents in the written component of the corpus are first pre-processed to extract metadata such as author, year, source, domain etc. that describe each document in the collection. Metadata of each document are stored in a MySQL table on disk. After metadata extraction step, each token, which is a character string separated by white space characters, in each document is identified and unique token list is formed from all documents in the collection. Each token is given a unique identifier and while unique tokens are found from documents, their frequencies in each document are also counted. Unique tokens, their ids, and frequencies are stored in another MySQL table. For each unique token found from the document collection, a kind of inverted index structure is formed. In the index structure position of each unique token are stored for each document in the collection. This index structure is stored over disk by using MyISAM file structure of MySQL. By using the inverted index structure, concordance data, descriptive statistics, and lists of collocates Fig. 3. TNC v1 distribution function Fig. 4. TNC v1 result of a collocation analysis of haber ‘news’ Collocation function allows users to list collocates (the words that the query-term occurs most frequently with) by offering six statistical association measures for calculating collocational strength: Log-likelihood, MI, MI3, T-score, Dice coefficient and Log Dice coefficient. 34 for unique tokens in the corpus are computed and they are stored as compressed files over disk by applying IGBinary [9] compression method of PHP. IGBinary applies binary data compression and storage therefore reading and decompression of the data are performed faster with respect to other compression methods. The unique token list and names of its compressed data files including concordance data are then loaded to memory as a hash table to improve performance of user searches. When a user sends a query by using the TNC GUI, the queried token is searched from the hash table and the name of the compressed concordance file of the token is found. After that the compressed concordance file is read from disk to memory, then this file is decompressed and if the user gives some filtering options in his query these filters are applied over the decompressed file, then the computed results are randomly shuffled and displayed to the user. user gives some filtering in his query, these filters are searched from metadata table stored in the database, and the results of this search are used to filter unique type lists for the given token. Finally, the computed concordances are shuffled and a random number of results are displayed to the user. The architecture of the TNC v2 is presented in Fig. 8. As the inverted index structure is stored in memory, all computations are performed very fast as it is shown in the next section. Fig. 8. Architecture of the TNC v2 On the other hand, the system specifications of the computer running the TNC v1 interface are prominently different from the TNC v2. The system properties of the server running the TNC v2 interface seems sufficient enough to process and store huge amount of data in memory. Table III briefly presents the major hardware specifications of both versions. Fig. 7. Architecture of the TNC v1 The TNC v2 is an updated and improved version of the TNC v1. Metadata extraction, tokenization and indexing steps are similar to that of the TNC v1. Metadata are stored over disk as a MySQL table. Unique token list including frequencies for each document are loaded to memory instead of storing over disk. Only document collection and metadata for the documents are stored on disk. For all unique tokens in the collection, a kind of inverted index structure is constructed in which the positions of the token in each document are stored. This inverted index structure is located in memory by using Redis [10] which is an open source (BSD licensed), in-memory data structure store and supports data structures such as strings, hashes, lists, sets, sorted sets, etc. When a user sends a query by using the TNC GUI, the queried token is searched from the in-memory inverted index and unique types forming the concordance output of queries, descriptive statistics for query results, and lists of collocates are computed in real time. If the TABLE III. HARDWARE SPECIFICATIONS OF COMPUTERS RUNNING TWO VERSIONS OF THE TNC TNC v1 TNC v2 OS FreeBSD 9.0 RAM 16 GB CPU 1 X Intel Xeon x3440 2.53 GHz 4 cores Ubuntu Server 14.04 (Virtual machine running on FreeBSD host) 64 GB 2 X Intel Xeon E5-2630v2 2.60 GHz 2 cores Disk 500 GB SATA 2 350 GB Virtual Disk V. QUERIES ON TNC V1 AND TNC V2 In what follows the speed of two versions of the TNC are compared on the basis of standard, restricted and wildcard queries conducted on the written component of the TNC v1 and written and spoken components of TNC v2. Fig. 9 and Fig. 10 respectively show the main pages of the both versions. 35 Fig. 12. TNC v1 query results-fakat ‘but’ On the other hand, while the TNC v1 does not allow the search of one of the most frequent word kadar ‘until’, which ranks 45 with 142693 frequency of occurrence in the frequency list of the TNC, the architecture of TNC v2 allows its search by displaying random in 10.82 seconds to users. Fig. 9. TNC v1 main page TABLE IV. Query item fakat ‘but’ kadar ‘until’ Fig. 10. TNC v2 main page A. Standard Queries Standard search in the TNC offers users to make searches in the whole of the corpus without filtering the queries on the basis of written or spoken parts of the corpus. Users type the search term in the form labeled query term and send it. Just on top of the results page, users can view frequency information of the node word. A normalized frequency of a 1-million-word scale is also stated. Query results are displayed in a KWIC view by default. Each column in the result page displays the ID of the concordance line, the text where the node word is found and the concordance line, respectively. Users can display the further context to the left and right of the node word by clicking search term in the concordance lines. When such a query is made for exact form of the node word fakat, it takes just about 5.52 seconds to compute concordance lines among 2758 different corpus text in the TNC v2 (Fig. 11), while it takes 14.57 seconds for the same query word in the TNC v1 (Fig. 12). THE STANDARD QUERY OF FAKAT ‘BUT’ AND KADAR ‘UNTIL’ WITHIN WRITTEN COMPONENT OF THE TNC TNC version TNC v1 TNC v2 Word count 47641688 Text count 4458 Hits 22331 Different text 2486 Time 50088936 4990 25432 2758 14.57 sec 5.52 sec TNC v1 TNC v2 47641688 4458 N/A N/A > 60 sec 50088936 4990 133807 4252 10.82 sec B. Restricted Query Restricted queries can be performed in the written component of TNC with the criteria of publication date, media, sample, domain, derived text type, author information, audience and genre. Table V demonstrates such a sample query performed by restricting the node word büyük ‘big’ in terms of the publication date (between 1995-2005), medium (books) and sample (whole text) of the corpus documents. Once again the TNC v2 is fast in the restricted query search. It only takes 3.52 seconds to produce concordance lines in the v2, while the same query lasts 9.31 seconds in the v1. TABLE V. THE RESTRICTED STANDARD QUERY OF BÜYÜK ‘BIG’ IN TERMS OF PUBLICATION DATE (1995-2005), MEDIUM (BOOKS) AND SAMPLE (WHOLE TEXT) WITHIN WRITTEN COMPONENT OF THE TNC Query item büyük TNC version TNC v1 Word count Hits 47641688 Text count 4458 3476 Different text 168 ‘big’ TNC v2 50088936 4990 3079 170 Time 9.31 sec 3.52 sec C. Wildcard Queries Wildcards are also used in standard and restricted queries in the TNC. Special character * permits users to search word forms starting with kol, such as kolay ‘easy’, kollarına ‘to his arms’, koltuğa ‘to the armchair’, as is seen in Table VI the TNC v2 is slightly faster than that of v1 in displaying query results. Fig. 11. TNC v2 query results-fakat ‘but’ 36 The wildcard query aims to obtain word forms containing both /b/ and /p/ as the final sound of kitap is only permitted in the TNC v2 and 41,098 hits are found in across the corpus documents in 22.25 seconds. allows for a “modular structure in which any number of features can be incorporated in to the architecture” [11]. For future work any extension in the features of the TNC would be possible via relational database and inverted index structures. Multi-unit search pattern where beyaz ‘white’ or peynir ‘cheese’ is queried across the corpus documents. The speed of the TNC v2 is again better than v1. The query in written and spoken parts of the corpus returned 12,212 hits in 2,085 different texts in 1.73 seconds. ACKNOWLEDGMENT This work is supported by TÜBİTAK (Grant No: 115K135, 113K039). REFERENCES Owing to in-memory index structure of the TNC v2 it is possible to search lexical items used frequently in Turkish such as ama ‘but’ (ranking 43 among 73,383 lemmas in the NLP Dictionary of TNC) and bu ‘this’ (ranking 6 among 73,383 lemmas in the NLP Dictionary of TNC) in a reasonable fastness. Ama ‘OR’ bu wildcard query returned relevant strings within 15.66 seconds in the TNC v2 but the same query takes more than 60 seconds in the v1. As a final remark, the speed of TNC v2 concerning some other wildcard query options needs to be optimized. TABLE VI. [3] [4] TNC version Word count Text count Hits kol* TNC v1 TNC v2 47,641,688 4,458 53,041 No. of diff. text 3,523 50,088,936 4,990 58,154 3,864 TNC v1 TNC v2 47,641,688 4,458 N/A N/A N/A 50,088,936 4,990 41,098 2,687 22.25 sec TNC v1 TNC v2 47,641,688 4,458 10,881 1,894 50,088,936 4,990 12,212 2,085 TNC v1 TNC v2 47,641,688 4,458 N/A N/A 50,088,936 4,990 836,838 4,565 beyaz| peynir [2] THE WILDCARD QUERIES IN THE TNC Query item kita[b,p ]* [1] Time [5] [6] [7] [8] [9] 30.78 sec 22.95 sec [10] [11] [12] [13] 6.46 sec 1.73 sec [14] [15] ama|bu N/A 15.66 sec [16] [17] VI. CONCLUSION This paper describes the design principles, interface features and the architecture of the TNC. Then it compares the architecture of the TNC v1 and v2. On the basis of the standard, restricted and wildcard corpus queries, it is shown that in-memory inverted index structure of the TNC v2 computes better and faster than that of v1 which is designed as disk-based compressed concordance data files for each unique term. In terms of speed, the v2 architecture allows users to perform searches across many corpus files (5,412 data files of the TNC) very rapidly, but such architecture needs more memory to display query results fast. We should also note that the relational database structure used in both versions of the TNC has its advantages to process large corpus files such that it [18] [19] [20] [21] [22] 37 M. Aksan and Y. Aksan, “Linguistic corpora: A view from Turkish,” in Studies in Turkish Language Processing, K. Oflazer and M. Saraçlar, Eds. Berlin: Springer Verlag, (forthcoming). B. Say, D. Zeyrek, K. Oflazer and U. Özge, “Development of a corpus and a treebank for present-day written Turkish,” in Proceedings of the 11th International Conference of Turkish Linguistics, 2004, pp 183–192. Y. Aksan, M. Aksan, A. Koltuksuz, T. Sezer, Ü. Mersinli, U. U. Demirhan, H. Yılmazer, Ö. Kurtoğlu, G. Atasoy, S. Öz and İ. Yıldız, “Construction of the Turkish National Corpus (TNC),” in Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC), 2012, pp. 3223-3227. MySQL 5.5 Release Notes, http://dev.mysql.com/doc/relnotes/mysql/5.5/en/ PHP 5.4.21, http://www.php.net/releases/5_4_21.php PHP 5.6.10, http://www.php.net/releases/5_6_10.php CSS, http://www.w3schools.com/css/ Javascript, http://www.w3schools.com/js/ PHP PECL IGBinary Extension, http://codepoets.co.uk/2011/phpserialization-igbinary/ Redis, http://redis.io/ M. Davies, “The 385+million word Corpus of Contemporary English (1990-2008+),” International Journal of Corpus Linguistics, vol. 14, no. 2, pp. 159-160, 2009. HTML, http://www.w3schools.com/html/ M. Aksan and Ü. Mersinli, “A corpus based Nooj module for Turkish,” in Proceedings of the NooJ 2010 International Conference and Workshop, 2011, pp. 29-39. T. McEnery, R. Xiao, and Y. Tono, Corpus-based Language Studies, London: Routledge, 2006. M. Davies, “The Corpus of Contemporary American English as the first reliable monitor corpus of English,” Literary and Linguistic Computing, vol. 25, no. 4, pp. 447-464, 2010. S. Akşehirli, “Dereceli karşıt anlamlılarda belirtisizlik ve ölçek yapısı,” Journal of Language and Linguistic Studies, vol. 10, no. 1, 49-66, 2014. G. İşgüder Şahin and E. Adalı, “Using morphosemantic information in construction of a pilot lexical semantic resource for Turkish,” in Proceedings of the 21st International Conference on Computational Linguistics, 2014, pp. 929-936. S. Demir, “Generating valence shifted Turkish sentences,” in Proceedings of 8th INLG, 2014, pp. 128-132. O. Yılmaz, “Tag-based semantic website recommendation for Turkish language,” International Journal of Scientific and Engineering Research, vol. 4, no. 3, pp. 1-7, 2013. A. Uçar and Ö. Kurtoğlu, “A corpus-based account of polysemy in Turkish: A case of ver-‘give’,” in Proceedings of the 15th International Conference on Turkish Linguistics, 2012, pp. 539-551. Ü. Mersinli, “Associative measures and multi-word unit extraction in Turkish,” Journal of Language and Literature, vol. 12, no. 1, pp. 43-61, 2015. P. Baker, A. Hardie and T. McEnery, A Glossary of Corpus Linguistics, Edinburg: Edinburg Press, 2006. 1 (When) do we need inflectional groups? Çağrı Çöltekin University of Tübingen [email protected] root Abstract—Inflectional groups (IGs) are sub-words units that became a de facto standard in Turkish natural language processing (NLP). Despite their prominence in Turkish NLP, similar units are seldom used in other languages; theoretical or psycholinguistic studies on such units are virtually nonexistent; they are typically overused in most existing work; and there are no clear standards defining when a word should or should not be split into IGs. This paper argues for the need for sub-word syntactic units in Turkish NLP, followed by an explicit proposal listing a small set of morphosyntactic contexts in which these units should be introduced. amod nmod nsubj Mavi arabada -kiler uyuyorlar POS: Lemma: Number: Case: ADJ mavi - NOUN araba Plur Loc NOUN -ki Sing Nom VERB uyu Plur - Figure 1. Dependency analysis of the sentence in (1). The dependency and feature labels follow Universal Dependencies (UD, Nivre et al. 2016) conventions. Only the features relevant to our discussion are listed. I. Introduction to ‘the one in the/a car’. Finally the word is suffixed with the plural morpheme resulting in plural number inflection. (1) Mavi arabadakiler uyuyorlar The term inflectional group (IG) in Turkish natural language processing literature refers to a sub-word unit. Although it does not seem to stem from (theoretical) linguistics, the unit has been a de facto standard for representing words in Turkish NLP. Representing words as multiple IGs helps dealing with complex interaction between the morphology and syntax in the language. Furthermore, it alleviates the data sparseness problems in machine learning methods that arise due to large (theoretically infinite) number word forms as a result of numerous affixes a word can get. On the other hand, the use of IGs makes it difficult to use well-studied methods from other languages, or common off-the-shelf NLP tools since these methods and tools are designed with the assumption that the word is the basic unit of syntactic processing. While we argue that sub-word syntactic units are necessary for Turkish NLP, the oversegmentation of words into IGs, which is very common in present practice in the field, amplifies these problems, and even defeats its own aim by shifting the data sparseness problem caused by long sequences of potential suffixes per word to one caused by a long sequences of IGs per word. We discuss these issues in detail, and propose a more conservative alternative for segmentation of words into IGs. In this paper, we assume that the IGs are introduced for syntactic reasons, even though the traditional use of the unit seems to link it with derivational morphemes and derivation boundaries. We do not address, or discuss the derivational morphology outside its relation to the IGs. Blue car.LOC-ki.PL sleep.PROG.1P ‘The ones in the blue car are sleeping.’ The conventional representation with a triple ⟨lemma, POS tag, features⟩ fails here, since the word arabadakiler refers to two different (sets of) entities, and it carries a separate set of inflections for each. The first part of the word, arabada ‘in the/a car’, is singular and in locative case, while the complete word, arabadakiler ‘the ones in the/a car’, is plural and not marked for case (nominative). Besides the multiple conflicting inflectional features within the word, parts of the word participate in separate syntactic relations. Figure 1 presents a dependency analysis of the sentence in (1).1 The adjective mavi ‘blue’ modifies the car (not the people in it), while the entities that sleep are the ones in the car (not the car). As a result, in Turkish computational linguistics literature, such words have been represented using multiple sub-word units known as inflectional groups (Oflazer 1999). Although the need for sub-word units is clear in (1), the current practice in the field oversegments the words without any clear linguistic or practical reasons. For example, the subordinated verb sınırlandırılabilecek ‘that/which can be limited’ would be tokenized into six IGs in METU-Sabancı treebank (Say et al. 2002; Oflazer et al. 2003) as in (2). (2) sınır -lan -dır -ıl -abil -ecek NOUN VERB.Deriv VERB.Caus VERB.Pass VERB.Abil ADJ In this annotation scheme, as well as the derivational morpheme -lan, the causative (-dır) and the passive (-ul) voice suffixes, the mood suffix -abil expressing ability or possibility and the subordinating suffix -ecek which forms a verbal adjective introduce new IGs. The segmentation in A. The need for sub-word syntactic units In many languages, representing a word with a lemma, a POS tag and a set of (inflectional) features is sufficient (and useful) for most NLP tasks. In Turkish, however, this representation is often inadequate. For example, consider the word arabadakiler ‘the ones in the/a car’ in (1) below. The word araba ‘car’ is inflected for locative case after which it receives the suffix -ki which changes the meaning of the word 1 We present example analyses using dependency annotations, since this is where the IGs were first introduced, and due to popularity of dependency parsing and annotation in the NLP community. However, the parallel examples can easily be constructed for other grammar formalisms. 38 2 A. Earlier use in the literature (2) does not have the same grounding as the one introduced by the suffix -ki in (1). All suffixes except the first one are considered part of inflectional morphology by modern grammars of Turkish (e.g., Kornfilt 1997; Göksel and Kerslake 2005). Even if we consider first three inflectional suffixes as verb–verb derivations, none of the intermediate forms can carry any separate inflections, and there is no possibility of conflicting features. The case for verbal adjective suffix is slightly more complicated (discussed in Section II-C). However, the verbal adjective forms in Turkish are not much different than participle forms in other languages where an additional inflectional feature is sufficient to indicate that the word carries properties of both adjectives and verbs. That is, the word acts similar to verbs within the subordinate clause, while acting like an adjectival outside the subordinate clause. The current paper proposes tokenizing a surface word into multiple IGs only in case one of the following is true.2 (3) a. Parts of the word may have potentially conflicting inflectional features. b. Parts of the word may participate in different syntactic relations. These guidelines also imply that the syntactic units should have clearly defined syntactic functions, unlike, for example, the relation deriv introduced in the CoNLL-X version of the METU-Sabancı treebank (Buchholz and Marsi 2006). Under our guidelines, the word in (2) would not be segmented at all. The next section presents a critical summary of the use of IGs to date, mainly pointing out when segmentation of words are not necessary. Section III lists the cases where we need to introduce IGs after which we provide a brief discussion followed by a summary and outlook. Following Oflazer (1999), almost all Turkish NLP tools and resources annotate a word as a sequence of IGs as shown in (4) below. (4) root+Infl1 ˆDB+Infl2 +…+ˆDB+Infln where root is the root of the word, Infli are a group (presumably a set) of inflections and ˆDB is a special symbol indicating a derivation boundary. According to this annotation scheme, the word sınırlandırılabilecek in (2) is represented as (5) below.3 (5) sınır+Noun+A3sg+Pnon+Nom ˆDB+Verb+Acquire ˆDB+Verb+Caus ˆDB+Verb+Pass ˆDB+Verb+Able+Pos ˆDB+Adj+AFuttPart The same annotation scheme is used in most of the Turkish computational linguistics literature to date. Below we discuss the differences between the current practice and the scheme suggested in this paper. B. Derivation boundaries are not necessarily syntactic-token boundaries In the current literature, it is common to see inflectional group boundaries inserted before some derivational morphemes, such as -lan in (2). However, not every derivation warrants introducing a new syntactic unit. In the noun–verb derivation example, sınır-lan ‘border-lan (= to restrict)’, the noun sınır cannot be inflected. Hence, it cannot have an inflectional group of its own. It is also not accessible from syntax: neither it can be modified by another syntactic word, nor is it possible for it to modify another one. Although keeping the derivational history may be helpful for some applications, it is not related to determining syntactic units. For the purpose of determining syntactic units, the (derivational) morphemes of interest are typically those that modify an already inflected word, like the suffix -ki in (1) in Section I. However, attaching to an already inflected verb is not sufficient for forming a new syntactic token. Also, the condition we are seeking here is more strict than morphemes that scope over the phrases. Some productive derivational suffixes may attach to already inflected forms, and scope over whole phrases, as exemplified by the suffix -sIz ‘without’ in (6) below. (6) [Takım arkadaşlarım]sız yapamam II. Inflectional groups The term inflectional group first appeared in work related to Turkish dependency parsing and annotation (Oflazer 1999), and used in later studies with similar aims (Say et al. 2002; Oflazer et al. 2003; Sulubacak and Eryiğit 2013; Çöltekin 2015). It is also used in work on Turkish syntax with different grammar formalisms (Çetinoğlu and Oflazer 2006; Çakıcı 2008), and in pre- or non-syntactic analysis such as morphological analysis and disambiguation (e.g., Hakkani-Tür, Oflazer, and Tür 2002; Çöltekin 2014). The similar units are also used by NLP work on other Turkic languages (Tyers and Washington 2015). Although we are not aware of a precise definition of the term, both the use in the literature so far and the name inflectional group indicates that the unit was introduced based on morphosyntactic concerns. More precisely, we assume inflectional groups are sub-word units required by syntax. The remainder of this section outlines the earlier use of IGs, and discusses the morphological constructions where the current practice oversegments words according to the guidelines defined in (3). Team friend.PL.POSS1S.without do.AOR.NEG.1P ‘I cannot do without my team mates’ It may be tempting to segment the word arkadaşlarımsız into two IGs, since the noun takım modifies the stem arkadaş, and the suffix -sız scopes over the complete phrase. Furthermore, the suffix -sız attaches to an already inflected noun and derives an adverbial. However, according to our criteria, these do not warrant introduction of a new syntactic token. A large number of inflections scope over the phrases headed by 2 The conditions ‘conflicting features’ and ‘separate syntactic relations’ depend on the annotation scheme. Ideally, the tagsets should avoid spurious conflicts. However, the guidelines are useful even if the tagset choice is not free, and causes spurious conflicts. 3 The analysis here follows the annotation scheme in METU-Sabancı treebank (Oflazer et al. 2003) which is a typical example of other resources and tools for Turkish NLP with respect to representation of words. 39 3 the words carrying the inflection. For example, the possessive suffix attached to the same noun also scopes over the whole phrase (it is ‘my [team mates]’, not ‘*team [my mates]’). The word arkadaşlarımsız in this example cannot have conflicting features either (adverbs are not inflected in Turkish). Hence, there are no strong reasons for segmenting words at derivation boundaries introduced by the suffixes similar to -sIz. The suffixes in this category include -lI, -lIk, -(n)CA, -CI, and also -ki when it derives an adjectival. These suffixes should be represented with adequate morphological features, rather than separate syntactic units. Note that we make a distinction between the cases where these suffixes derive adjectivals or adverbials and the cases that some these suffixes derive nominals. Nominal case is discussed in Section III-B. independent inflections. A set of features that allow marking multiple levels of causation and distinguishing the effects of single or double passive or -Abil suffixes is sufficient for avoiding additional syntactic tokens. Another aspect of the voice inflections that may have affected the current practice of oversegmentation is the fact that they change the valency of the verb, and modify the meanings of the arguments of the verb. For example, a causative or passive verb will assign different roles to its arguments. However, even if the verb valency is changed, there will still be a single grammatical subject and/or object, and their roles can be inferred from the transitivity of the verb and the voice inflections it carries. As a result, none of the suffixes discussed above meet the criteria set in (3). With a proper morphological tag set, we do not need to introduce new IGs for voice suffixes as well as other aspect or modality modifiers. Besides the verbal suffixes discussed above, existing work also segments the words at subordinating suffixes (suffixes that cause phrases headed by the verbs to function as adjectives, adverbs or nouns). These suffixes change the function of the word they are attached to. However, there is no principled reason for not representing their status by setting a feature, e.g., verb form to an appropriate value, e.g., verbal adjective (participle), verbal adverb (converb) or verbal noun (gerund/infinitive). This avoids segmentation by indicating that the word functions as a verb within the subordinate clause, while acting like a noun, adjective or adverb outside the subordinate clause. Note that even the subordinate clauses that function as nouns (verbal nouns and headless relative clauses, Göksel and Kerslake 2005, p.84) do not require segmentation since nominal predicates cannot be subordinated without an auxiliary verb and inflectional features, and syntactic relations of verbs can easily be distinguished from that of nouns, adjectives and adverbs (the copula attached to the subordinate verbs is discussed in Section IV). In many ways, the subordinating suffixes are similar to the productive derivational suffixes discussed in Section II-B, and do not need to introduce a new syntactic tokens. C. Inflectional morphemes should not introduce IGs In the current literature, a large number of inflections introduce new IGs. The majority of these inflections are verbal inflections including voice suffixes, as well as some mood and aspect modifiers. The passive and causative suffixes and the modal suffix glossed as Abil in (2) are examples of such inflectional suffixes. One of the motivations for segmenting at these inflectional morphemes may be the fact that some of them can attach repeatedly to the same verbal stem. In this respect, the causative morpheme is particularly interesting, since, similar to -ki described in Section I, it can repeat multiple times with no principled limit on the number of consecutive causative suffixes. In practice, however, the use of multiple causative suffixes is rare, and it often indicates emphasis rather than multiple levels of causation. Example (7) demonstrates a verb with two causative suffixes which, indeed, can be interpreted as having two levels of causation.4 (7) Ders bütün okullarda Subject all school.PL.LOC oku-t-tur-ulacak. study.CAU.CAU.PASS.FUT.3SG ‘The subject will be caused to be caused to be studied all schools.’ (literal) ‘The subject will be taught in all schools.’ Besides the causative suffix, the passive suffix, and forms of the modal suffix -Abil may attach to the same verb multiple times. The double passive (on a transitive verb) creates impersonal (passive) expressions (Göksel and Kerslake 2005, p.136). The double use of -Abil modifies the modality of the verb for both of its senses (ability and possibility). In all of these cases, these suffixes do not create a new predicate with potentially different inflections than the verbal stem they are attached to. For example, in the multiple levels of causatives above, all actions have to share the same tense, aspect and modality. As a result, if these suffixes form inflectional groups, the resulting inflectional groups will not have any D. Uniform representation of all syntactic units Another issue with the present use of IGs as represented in (4) is the asymmetry between the first IG and the ones that follow. In this representation, the only IG with a lemma is the first one. This hinders the uniform treatment of the syntactic tokens since some of the tokens are not represented as ⟨lemma, POS tag, features⟩ triples, and introduces difficulties with using existing NLP tools like parsers. The current proposal requires a syntactic token to always be associated with a lemma. For non-root IGs, the lemma should be a canonical representation of the (derivational) morpheme that introduces the IG. For example, for the proposed tokenization of arabada-kiler in (1), the suffix -ki should be treated as the lemma rather than an inflection. This also serves as a test for introducing new IGs. If the segmentation of a word results in IGs that cannot have any inflections of their own (except for the lemma), the segmentation is not justified. 4 A bit of context may be useful for non-native speakers to understand the double causative in this example. The example, taken from a news text about a new educational regulation, expresses that (the authorities who made) the regulation will cause schools or teachers to cause the students to study the subject. 40 4 III. Inflectional group boundaries (9) a. TL fine cut.PASS.FUT ‘Those without a registration document will be fined 2000 TL.’ adamdan saymıyor b. 2-3 metrelikleri 2-3 meter-LIK.PL.ACC man.ABL count.NEG.PROG The suffix -ki has two main functions (Hankamer 2004). It either forms either adjectivals or pronominal expressions from nouns. We already argued in Section II-B that when the suffix -ki derives adjectivals, there is no need for introducing a new syntactic unit. However, as the example in Section I demonstrates, if it derives a pronominal a new IG is necessary. If the suffix -ki is attached to a noun in genitive case, the resulting pronominal expression refers to an entity that belongs to the object or person the original noun refers to. If it is attached to a locative noun, the resulting expression refers to an entity in/on/at the object the original noun refers to. The parts of the word referring to these two entities may have their own set of inflections, and may participate in different syntactic relations. The example (1) and the corresponding dependency analysis in Figure 1 demonstrate the need for separate syntactic units. Without segmenting the word into multiple syntactic tokens, we cannot tell whether the expression refers to multiple cars or a single car, and we cannot tell whether the car or the objects in the car are blue, or even whether the car is sleeping or the people/objects inside are sleeping. Both problems can be solved by introducing a new syntactic token as in the analysis presented in Figure 1. Furthermore, the nominals derived with -ki may be suffixed with genitive or locative suffixes again, and in turn, with another -ki suffix. Although multiple -ki suffixes are rare in real language use, the process is recursive, and there is no principled limit that one can place on number of -ki suffixes in a word form. This fact also underlines the need for introducing new IGs in pronominal usage of suffix -ki. musun? QuesP.2SG c. CRDI engine-LI.POS3S.INS 170 dizelle TL.LIK Istanbul-Sivas mesafesini yaptım. diesel.INS Istanbul-Sivas distance do.PAST.1SG ‘I rode the Istanbul-Sivas distance with the one with 1.5 CRDI engine using 170 TL worth of diesel fuel.’ In (9a), without segmenting the word belgesizlere ‘the ones without documents’, we cannot represent the fact that the noun kayıt ‘registration’ modifies the word belge ‘document’, not the people who do not have the document. This is unlike the earlier example (6) where the relation is unambiguous since the attributive noun can only modify the noun, not the resulting adjectival. Similarly, in (9b), the numeral modifies the metre ‘meter(s)’, not the pronominal expression derived by the suffix -lik. In other words, the expression refers to (unknown number of) 2 to 3 meter boats, not 2 or 3 boats of one meter long. In (9c), too, the numeral and the abbreviation modifies the motor ‘engine’, not the car with that particular engine. Also note that the suffix -lık in this example does not have to be segmented, since it derives an adjectival. The preceding number here can only modify the noun, not the adjectival. The suffixes listed in (9) are a lot less productive than -ki discussed Section III-A, and they attach to already inflected words with a varying but lower degree than -ki. Nevertheless, the cases exemplified in (8) exist. For a uniform treatment, our proposal is to segment words into multiple tokens when these suffixes derive a (pro)nominal expression. Although the suffixes discussed here require segmentation of words, this is not true if the same suffix is part of a lexicalized derivation. For example, in contrast to the use of suffix -siz in (9b), the lexicalized word ev-siz ‘homeless’ should not be segmented since the root here cannot be inflected, and it cannot participate in separate syntactic relations. Like the suffix -ki discussed above, some productive noun derivations result in word forms that refer to multiple entities. This is demonstrated using the derivational suffix -CI in (8). (8) a. [eski kitap]çı b. eski [kitapçı] old ‘Are you not considering 2 to 3 meter long ones worthy? (referring to boats) 1.5 crdi motorlusuyla 170 tl’lik 1.5 B. Other productive noun–noun derivations book.CI 2 bin ceza kesilecek A. The relativizer -ki ‘[old book] shop/seller’ belgesizlere Registation doucment-SIZ.PL.DAT 2 thousand TL So far, our focus in this paper has been on where or when not to segment a word to sub-word syntactic units. In this section, we list the cases where sub-word units are necessary. old Kayıt book.CI ‘old [book shop]’ C. Copular suffixes and the suffix -lAş In Turkish, main means of forming copular predicates is through suffixation. In most cases, copular suffixes attach to a simple noun or adjective, where one may avoid segmenting the word by setting a feature that indicates the copular nature of the word. However, if the copula is attached to a verbal noun or a headless relative clause, as in (10) below, segmentation is unavoidable. (10) Örnek bizim If the word kitapçı in (8) is not segmented, we do not have a way to represent the ambiguity between 8a and 8b. The same issue surfaces in case of other noun–noun derivations or noun–adjective derivations when the derived adjectival is nominalized, referring to an object with the property described by the derived adjective. In such cases, similar to -ki, the parts of the word refer to entities which may have their own set of inflections, and may participate in different syntactic relations. The other suffixes with similar behavior are -sIz, -lI and -lIk (which overlap with the ones listed in Section II-B). We present an example for each of the cases in (9). Example we.GEN yazdıklarımızdandı. write.PART.PAST.PL.POSS1P.ABL-COP.PAST.3SG ‘The example was from the ones we wrote’ 41 5 root nsubj root nsubj nsubj Ben I Örnek bizim yazdıklarımızdan -dı POS: Lemma: Number: Case: Number[psor]: Person[psor]: VerbForm: Tense: Person: NOUN örnek Plur Nom 3 VERB yaz Plur Abl Plur 1 Part Past 3 PRON biz Plur Gen 1 VERB -ySing Past 3 aradaşlarımla friend.PL.P1S.INS -yım COP.PRES.1.SG (a) root nsubj Ali Ali cop aradaşlarımla friend.PL.P1S.INS ∅ COP.PRES.3.SG (b) Figure 2. Dependency analysis of the sentence in (10). The dependency and feature labels follow the Universal Dependencies conventions (marking copula as the head is against one of the UD principles which is violated frequently). Only the features relevant to our discussion are listed. The features Person[psor] and Number[psor] mark the person and number of the possessor in a noun. The same suffixes also indicate the person and number of the subject on a subordinate verb. root nsubj Ali Ali aradaşlarımla friend.PL.P1S.INS.COP.PRES.3.SG (c) In (10), the word yazdıklarımızdandı includes two predicates (yaz ‘write’ and the past copula). As it is also presented in Figure 2, both predicates have their own subjects in the sentence. Furthermore, these two predicates have their own feature sets which may conflict. For example, the subordinate verb carries the first person plural subject–verb agreement (indicated by the feature labels Person[psor] and Number[psor] in Figure 2), while the inflections on the copula indicate a thirdperson singular subject (marked by feature labels Person and Number). This example also demonstrates that the potential conflict of person and number features between the predicate and resulting nominal is avoided by using different labels for these features (although the labels may be confusing in this particular tagset). The morpheme -lAş ‘to become’ presents a slightly different case. -lAş forms verbs from nouns and adjectives, often leaving the possibility of modifying the stem. The sentence in (11) presents an example where the adjective pembe within the verb derived by -lAş is modified by an adverb. (11) Koyu pembeleşinceye kadar kavurun. Dark pink-lAş.CONV until cop ccomp Figure 3. Inconsistent analyses of copula in case an empty syntactic unit is not introduced. (a) Overt copula: Ben arkadaşlarımlayım ‘I am with my friends’. (b) No surface copula: Ali arkadaşlarımla ‘Ali is with my friends’, a null syntactic element is introduced. (c) same sentence as in (b) analyzed without a null element. made based on morphosyntactic information, which may cause difficulties for a pipeline approach to NLP. A second issue, we left unspecified in Section III-C is the use of null-copula, which surfaces (pun intended) in case of copular constructions with present tense and third person singular subject. Failing to introduce a null syntactic token will result in inconsistent analyses of copular expressions that differ only in trivial future assignments, e.g., first person or third person subject–verb agreement. Figure 3 demonstrates this inconsistency. In Section III-C we demonstrated that the copular suffixes should be segmented to be able to properly analyze sentences like (10). For the same reasons, we need to segment the copula in the sentence analyzed in Figure 3a. However, unless we introduce a null-copula as in Figure 3b, the tokenization and syntactic analysis of these two sentences will be different (as presented in Figure 3c), despite the fact that two sentences differ only in the person/number features of the copular predicates. It seems, introducing null copula becomes a necessity, unless one wants to introduce an inconsistency in the analyses of these two similar structures. Note, however, the null element introduced here is unlike the null units introduced in certain grammar formalisms as a result of syntactic processes (e.g., movement). Nevertheless, null elements will typically not be allowed in a wide range of grammatical frameworks, where an alternative method may be needed to avoid this inconsistency. As noted earlier, the criteria we set in (3) depends on the choice of the feature set. For example, many tag sets, e.g., UD, use the same feature label for the number feature of predicates and nominals. This causes either feature conflicts or inconsistent labels for morphological and/or syntactic tags in representation of participles and verbal nouns, which should not be tokenized according to our proposal. For example, the word yazdıkları ‘the ones he/she wrote’ in (10) requires two number features, the nominal is plural, but the predicate has fry ‘Fry until it it becomes dark pink.’ IV. Discussion and further issues This paper argues for limiting the segmentation of words into sub-word syntactic tokens based on two principles listed in (3). Based on these principles, the same affix may or may not introduce an new IG depending on whether it derives a nominal or an adjectival. In general, the need for tokenization arises when the same word contains multiple (pro)nouns or predicates. Furthermore, if a derived word with an otherwise transparent and productive suffix is fully lexicalized, there is no need for segmenting the word, as the stem cannot be inflected or modified by other words in the sentence. Our proposal introduces a new IG in case a suffix derives a (pro)nominal from a noun in a way that allows modification of both nouns in the word, but not when the same suffix derives an adjective or adverb. A potential disadvantage of this approach is that it requires tokenization decisions to be 42 6 a singular subject. The analysis in Figure 2 avoids conflicting feature values within the word yazdıklarımızdan, by indicating the number and person of the subject of the predicate yaz using a different tag than the person and number of the subject of the copula. As a result, this word cannot be represented as a single syntactic token by assigning separate labels for these two different roles. Similar issues may also arise because of overloaded use of some syntactic relations. Çetinoğlu, Özlem and Kemal Oflazer (2006). “MorphologySyntax Interface for Turkish LFG.” In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney, Australia: Association for Computational Linguistics, pp. 153–160. Çöltekin, Çağrı (2014). “A set of open source tools for Turkish natural language processing.” In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014). Reykjavik, Iceland: European Language Resources Association (ELRA). Çöltekin, Çağrı (2015). “A grammar-book treebank of Turkish.” In: Proceedings of the 14th workshop on Treebanks and Linguistic Theories (TLT 14). Ed. by Markus Dickinson, Erhard Hinrichs, Agnieszka Patejuk, and Adam Przepiórkowski. Warsaw, Poland, pp. 35–49. Göksel, Aslı and Celia Kerslake (2005). Turkish: A Comprehensive Grammar. London: Routledge. Hakkani-Tür, Dilek Z., Kemal Oflazer, and Gökhan Tür (2002). “Statistical Morphological Disambiguation for Agglutinative Languages.” In: Computers and the Humanities 36.4, pp. 381–410. Hankamer, Jorge (2004). “Why there are two ki’s in Turkish.” In: Current Research in Turkish Linguistics. Ed. by Kamile Imer and Gürkan Dogan. Eastern Mediterranean University, pp. 13–25. Kornfilt, Jaklin (1997). Turkish. London and New York: Routledge. Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman (2016). “Universal Dependencies v1: A Multilingual Treebank Collection.” In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), (accepted). Oflazer, Kemal (1999). “Dependency Parsing with an Extended Finite State Approach.” In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. College Park, Maryland, USA: Association for Computational Linguistics, pp. 254–260. Oflazer, Kemal, Bilge Say, Dilek Zeynep Hakkani-Tür, and Gökhan Tür (2003). “Building a Turkish treebank.” In: Treebanks: Building and Using Parsed Corpora. Ed. by Anne Abeillé. Springer. Chap. 15, pp. 261–277. Say, Bilge, Deniz Zeyrek, Kemal Oflazer, and Umut Özge (2002). “Development of a Corpus and a TreeBank for Present-day Written Turkish.” In: Proceedings of the Eleventh International Conference of Turkish Linguistics. Eastern Mediterranean University, Cyprus. Sulubacak, Umut and Gülsen Eryiğit (2013). “Representation of Morphosyntactic Units and Coordination Structures in the Turkish Dependency Treebank.” In: Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 129–134. Tyers, Francis M. and Jonathan Washington (2015). “Towards a free/open-source universal-dependency treebank for Kazakh.” In: 3rd International Conference on Computer Processing in Turkic Languages (TURKLANG 2015). V. Summary and outlook This paper presented an analysis of the current use of subword syntactic units, IGs, and proposed a more conservative alternative than the current practice while segmenting words into multiple IGs. We show that sub-word syntactic units are necessary even under such a conservative approach. However, the number of sub-word units can be dramatically reduced with appropriate choice of tagset for morphological features and syntactic relations. Our concrete proposal is that introduction of IGs should be motivated by syntactic analysis, and a word should be tokenized into multiple IGs when (1) it cannot be represented as a simple triple ⟨lemma, POS tag, features⟩ and/or (2) the part of the word participates in different separate syntactic relations. The principles set in this paper for (not) segmenting a word into multiple units, depend on the tagset in use. A logical next step is to complemented this proposal with a tagset that is useful for a wide range of NLP applications. Although defining a proper tagset for morphological features is out of scope of this paper, the guidelines above are useful in design of such a tag set. We note that the efforts like Universal Dependencies project (Nivre et al. 2016) may facilitate constructing such tag sets through the consensus of the broad community of Turkish/Turkic NLP researchers. Our motivation in this paper has been identifying syntactic units for computational processing of the language. However, the sort of units discussed in this paper are interesting from the perspective of (general/theoretical) linguistics as well. At present, the problems discussed here are underexplored in all subfields of linguistics including computational linguistics (with the notable exception of Bozşahin 2002). This discussion may motivate further research with more theoretical flavor, which in turn may benefit the computational methods. In closing, we also note that even though our discussion in this paper covers only Turkish, the same approach is likely to be relevant for other Turkic languages. References Bozşahin, Cem (2002). “The Combinatory Morphemic Lexicon.” In: Computational Linguistics 28.2, pp. 145–186. Buchholz, Sabine and Erwin Marsi (2006). “CoNLL-X shared task on multilingual dependency parsing.” In: Proceedings of the Tenth Conference on Computational Natural Language Learning, pp. 149–164. Çakıcı, Ruket (2008). “Wide-Coverage Parsing for Turkish.” PhD thesis. University of Edinburgh. 43 Allomorphs and Binary Transitions Reduce Sparsity in Turkish Semi-supervised Morphological Processing † Burcu Can† Serkan Kumyol‡ Cem Bozşahin‡ Department of Computer Engineering ‡ Cognitive Science Department, Informatics Inst. Hacettepe University, Beytepe Middle East Technical University (ODTÜ) 06800, ANKARA, Turkey 06800, Ankara, Turkey [email protected] [email protected] [email protected] [3] proposes an MDL-based system that models morphology in terms of morphological structures called signatures2 . The goal of the model is to minimize the amount of space through the signatures. A number of probabilistic approaches have been proposed for unsupervised morphological segmentation. There are maximum a posteriori (MAP), maximum likelihood (ML), Bayesian parametric and Bayesian non-parametric models. Creutz and Lagus [4] propose both an ML model and an MDL model to introduce one of the well-known unsupervised morphological segmentation systems, the Morfessor. Creutz and Lagus [5] suggest another member of the Morfessor family using MAP estimation. Creutz [6] proposes a generative model which is based on the word segmentation model of Brent [7]. Morphemes’ length and frequency are used as prior information in the model. A Bayesian non-parametric model is a Bayesian model defined on an infinite-dimensional parameter space. The parameter space is typically chosen as the set of all possible solutions for a given learning problem (see [8]). Goldwater et al. [9] develop a two-stage model where the types (i.e. morphemes) are created by a generator and the frequency of the types are modified by an adaptor in order to generate a power-law distribution using a Pitman-Yor process. Can and Manandhar [10] propose a model based on Hierarchical Dirichlet Process (HDP) to capture the morphological paradigms that are structured within a hierarchy. Virpioja et al. [16] introduce Allomorfessor which is the only morphological analyzer that models allomorphs within the morphological segmentation task. They model the mutations between word forms to induce the allomorphs. All the models given above assume that morphemes are independent of each other. Here we adopt two different models one of which is based on the independence assumption, and the other one which is based on a bigram morpheme model where each morpheme is assumed to be dependent on the previous morpheme. We make use of allomorphs in our model in order to reduce the sparsity. Therefore, our model works like a class-based Abstract— Turkish is an agglutinating language with heavy affixation. During affixation, morphophonemic operations change the surface forms of morphemes, leading to allomorphy. This paper explores the use of Turkish allomorphs in morphological segmentation task. The results show that aggregating morphemes in allomorph sets and treating them as the same morpheme decrease the sparsity in morphological segmentation, leading to higher accuracy. The source of this supervision can be syntax, in particular the syntactic category of morphemes and their logical form. We further investigate the dependency of Turkish morphemes on each other, using unigram and bigram morpheme models, by adopting a non-parametric Bayesian model in the form of a Dirichlet process. The bigram morpheme model outperforms the single-morpheme model. I. INTRODUCTION Morphological segmentation is an important task in computational linguistics, which splits words into morphological components called morphemes.1 For example, the word başarılıdır is split into başar, ı, lı, dır (succeed, deverbal noun, comitative, copula—he/she/it is successful). It can also be split as başarı, lı, dır (success, comitative, copula— he/she/it is with success), depending on how much of derivational morphology is considered non-lexical. Morphological segmentation becomes inevitable in any task that involves processing of the languages with an agglutinative structure, because affixation leads to sparsity. It becomes impossible to build a vocabulary that consists of all possible word forms in the language. Indeed the possible word forms that can be constructed in Turkish is infinite. Morphological segmentation has been treated as supervised and unsupervised machine learning (ML) problems. ML approaches to unsupervised learning of morphology starts with the successor model of Harris [1] that counts successors of each grapheme (as a proxy for phonemic realization) to detect the morpheme boundaries, where the number of grapheme successors is comparably higher. Minimum description length (MDL) is another method that has been applied in unsupervised morphological segmentation. MDL is based on selecting a model that aims to minimize the amount of space occupied by data. Goldsmith [2], 1 Strictly speaking, ‘segmentation’ is not the right level of abstraction for morphology, because segments are phonological concepts whereas morphemes are not. The fact that Turkish morphology is ‘segmental’ is an exception rather than the rule in the world’s languages. 2 A signature consists of a set of stems and suffixes where each combination of a stem from the stem set and a suffix from the suffix set makes a valid word form. 44 TABLE I III. T HE BAYESIAN N ON - PARAMETRIC M ODEL P HONEME ALTERNATIONS OF T URKISH . ( FROM O FLAZER ET AL . [11]) D: A: H: R: C: G: We propose two different models for morphological segmentation: the unigram Dirichlet process (DP) model, and the bigram Hierarchical Dirichlet process (HDP) model. Morphemes are drawn from a Dirichlet process by building a Markov chain. Unlike the other Bayesian non-parametric models adopted for morphological segmentation, our model generates a set of allomorphs from a Dirichlet process, rather than generating each morpheme independently. Let corpus C be the set of words: C={w1 , w2 , w3 , .., wn }. Exploiting the segmental nature of Turkish morphology, we assume that each word consists of segments for wn , viz. s+m1 ..+mn , where s is the stem, and m are suffixes. voiced (d) or voiceless (t) back (a) or front (e) high vowel (i, i, u, ü) vowel except o, ö voiced (c) or voiceless (ç) voiced (g) or voiceless (k) model where all allomorphs of the same morpheme are treated as the same belonging to the same class. This view introduces some semantic supervision to learning the forms because allomorphy, which is sameness of meaning under phonological variation, is a semantic notion. Section II explains allomorphy in Turkish. Section III describes the unigram morpheme model and the bigram morpheme model, and the inference on morphological segmentation. Section IV presents the evaluation scores obtained from the experiments before discussion and conclusion. A. Unigram Dirichlet Process Model In the unigram model, we assume that segments are independent of each other: II. T URKISH ALLOMORPHY Affixation in Turkish mostly occurs as segmental concatenation of suffixes to a stem or root. Prefixes are very rare. Surface forms of Turkish morphemes may change depending on phonological context. Vowel harmony and consonant assimilation are two morphophonemic processes which are common in Turkish. Segment “deletion” is also common, which may be treated as insertion depending on one’s morphological representation. The vowels can be grouped in relation to vowel harmony: 1. 2. 3. 4. 5. 6. 7. 8. p (w = s + m) = p (s) p (m) (1) We do not discriminate stems from suffixes in our model. Therefore, the probability of a word with multiple segments is given as follows: Y p (w = s1 + s2 + · · · + sn ) = p(si ) (2) i where s denotes the segments of w. Each segment s is drawn from a DP (see Figure 1): Gs ∼ DP (αs , Hs ) s ∼ Gs Back vowels: {a, ı, o, u} Front vowels: {e, i, ö, ü} Front unrounded vowels: {e, i} Front rounded vowels: {ö, ü} Back unrounded vowels: {a, ı} Back rounded vowels: {o, u} High vowels: {ı, i, u, ü} Low unrounded vowels: {a, e} (3) where DP (αs , Hs ) denotes the Dirichlet process that generates a probability distribution Gs from which the segments are generated. Here αs is a concentration parameter which adjusts the skewness of the distribution. Large values of αs lead to higher number of segments. Low values reduce the number of segments generated per word. Condition αs < 1 results in sparse segments and a skewed distribution. Condition αs > 1 leads to a distribution closer to uniform that assigns similar probabilities to the segments. If αs = 1, all segments are equally probable and a uniform distribution is obtained. We use αs < 1 to favor a skewed distribution over the segments. Hs is the base distribution that determines the mean of the DP [12]. We use the segment lengths for the base distribution: Table I describes some alternations using these allophones. The rules governing an alternation refer to metaphonemes or allophones. A vowel alternation is presented in Example 1 below. ’0’ is the notation for deleted phonemes, and for deleted lexical symbols such as morpheme boundaries. Example 1. Lexical form : bulut-lAr N (cloud) − P LU Surface form : bulut0lar (i.e. bulutlar) Lexical form : kedi-lAr N (cat) − P LU Surface form : kedi0ler (i.e. kediler) Hs = γ |s| (4) where |s| indicates the length of a segment and γ is a gamma parameter ( γ< 1). The Dirichlet process in our model forms a Chinese Restaurant Process (CRP) where the same dish (i.e. segment type) is served in each table. Each segment is a customer and whenever a new customer enters the restaurant, either it joins a table with the same segment type, if exists, otherwise it Here, lar and ler are allomorphs and both have the plural meaning. The number of allomorphs can vary for different morphemes. For example, dir, dır, dur, dür, tir, tır, tur, tür are all allomorphs and define the status of a verbal action. 45 Fig. 2. The plate notation of the bigram Hierarchical Dirichlet process model. Each segment is generated through a DP, which is used in another DP in order to generate stem bigrams (si , si+1 ). model between-group dependencies. The bigram hierarchical Dirichlet process model is defined as follows (see Figure 2): Fig. 1. The plate notation of the unigram Dirichlet process model. wi is the word generated from a DP. si represents segments that form the word. Rectangular boxes show how many times the process is repeated. si+1 | si ∼ DP (αb , Hb ) Hb ∼ DP (αs , Hs ) si ∼ Hb creates a new table. The conditional probability of a segment is estimated through the CRP as follows: −si nsSi if si ∈ S −si −s S −si + α N i s p si S , αs , Hs = αs ∗ Hs (si ) otherwise −s N S i + αs (5) where, si+1 |si denotes the conditional probability distribution over adjacent segments. Hb is the base distribution of the bigram model that is another Dirichlet process with a base distribution Hs that generates each unigram segment in the model. Segment lengths are used for the base distribution again. Once the probability distribution p(si+1 |si ) is drawn from a Dirichlet process, the adjacent morphemes can be generated by a Markov chain. Here we do not want to estimate Hb and we integrate it out as follows: −si denotes the total number of segment tokens of where nSsi type si but with the new instance of the stem excluded from −si the complete set of stems S in the model. NsSi is the total number of segment tokens in S where new segment instance si is excluded. p (s1 , s2 ), (s2 , s3 ) . . . , (sM −1 , sM ) Z M Y = p(Hb ) p ((si−1 , si ) | Hb ) dHb B. Bigram Hierarchical Dirichlet Process Model In the bigram model, we assume that each morpheme is dependent on the previous morpheme: p(w = s + m) = p(s)p(m|s) (8) (9) i=1 (6) where M denotes the total number of bigram tokens. Thus, the joint probability distribution of bigrams becomes as follows: This rule assumes that the suffix is generated accordingly with the stem. The same applies to a word with multiple segments: Y p (w = s1 + s2 + · · · + sn ) = p(s1 ) p(si+1 |si ) (7) p(s1 , s2 , . . . , sM ) = p (s1 ) p (s2 | s1 ) p (s3 | s2 ) , . . . , p (sn | sM −1 ) p (0 00 | sM ) i where we again do not discriminate stems from suffixes. Here, the first segment of the word is generated from a Dirichlet process and bigrams are generated through another Dirichlet process. We use a hierarchical Dirichlet process (HDP) with two levels, where first we generate the first segment through a Dirichlet process and in the second level we generate the following segment depending on the previous segment through another Dirichlet process. HDP consists of multiple DPs within a hierarchy and is able to (10) Here 0 00 denotes the end of the word. Let us call each bigram bi = (si | si−1 ): p (w = {s1 , s2 , . . . , sM }) = p (s1 ) p(b1 )p(b2 ), . . . p(bM ) (11) Here p(s1 ) is drawn from Hs through the unigram Dirichlet process, which again forms a Chinese restaurant where 46 Algorithm 1 The filtering algorithm 1: input: D = {w1 = s1 + s2 + · · · + sn , . . . , wn = s1 + s2 , · · · + sn } 2: chars ←{’d’: ’D’, ’t’: ’D’, ’a’: ’A’, ’e’: ’A’, ’ı’: ’H’, ’i’: ’H’, ’u’: ’H’, ’ü’: ’H’, ’ç’: ’C’, ’g’: ’G’, ’k’: ’G’, ’ğ’ : G} 3: procedure F ILTER ( SEGMENT ) 4: if SEGM EN T 6= ’ken’ then 5: i←0 6: for i < length(SEGMENT) do 7: if SEGM EN T [i] in chars then 8: replace(SEGM EN T [i], chars[i]) only a segment type is served in each table having the segment tokens of the same type as customers. The other Dirichlet process forms another Chinese restaurant where each table serves a segment type and the customers are the following segments. The conditional probability of a segment bigram can be calculated according to the Chinese Restaurant Process given previously generated segments S = {s1 , s2 , . . . , sn } as follows: p (sR | sL )bi B −bi , S −sL , S −sR , αb , Hb , αs , Hs −bi nB bi if bi ∈ B −bi −s NsSL L + αb = αb ∗ p(sR ) otherwise −s NsSL L + αb (12) 9: 10: 11: 12: return segment for all m in D do : return: F ilter(m) −bi where nB denotes the number of bigrams of type bi bi when the new instance of the bigram bi is excluded. Here B denotes the bigram set that involves all bigram tokens in −s the model. NsSL L is the total number of bigram tokens in the model. sL and sR denote the left and right nodes of the bigram. Therefore, if the bigram bi exists in the model, the probability of generating the same bigram again becomes proportional with the number of bigram tokens of the same type. If the bigram does not exist in the model, it is generated with the probability proportional to the number of right node in the bigram: respect to the representations in Table I.3 Algorithm 1 takes a segment as input and replaces the graphemes with their allophones. D. Inference We use Metropolis-Hastings algorithm [13] to learn word segmentations in the given dataset. Words are randomly split initially. We pick a word from the dataset in each iteration and randomly split that word. We calculate the new conditional probability pnew of the sampled word and compare it with the old conditional probability of the sampled word pold by using Equation 5, Equation 12 and Equation 13. We either accept or reject the new sample according to the proportion of two probabilities (see Figure 3): −sR nSsR if sR ∈ S −sR −s N S R + αs αs ∗ Hs (sR ) else −s N S R + αs (13) −s where nSsR R is the number of segments of type sR in −s S when the new segment sR is excluded. N S R is the total number of segment tokens in S that excludes sR . If the segment sR exists in the model, it is generated again with a probability proportional to its frequency in the model. If it does not exist in the model, it is generated proportionally with the base distribution, therefore shorter morpheme lengths are favored. The hierarchical model is useful for modeling dependencies between co-occurring segments. The co-occurrence of unseen segments is also within the scope of the hierarchical model. The prediction capability of the model comes from the hierarchical modeling of co-occurrences, which leads to a natural smoothing. For example, the segment bigram may not be seen in the corpus, however it is smoothed with one of the segments in the bigram which leads to a kind of natural interpolation. p sR S −sR , αs , Hs = pnew pold (14) > 1, the new sample is accepted. Otherwise, the If ppnew old new sample is still accepted with probability ppnew to find old the global maximum. IV. R ESULTS AND E VALUATION We used a publicly available Turkish dataset provided by Morpho Challenge 2010 for both as a training set and a test set.4 The dataset consists of a wordlist with 617,298 words, with their frequency values. We did not make use of frequencies in our model. Two sets of experiments were performed for both unigram model and the bigram model, with and without the filtering algorithm. In all experiments, we assume that words are made of only stems and suffixes, and prefixes are ignored. We followed the same evaluation procedure provided by Morpho Challenge. In order to calculate the precision, two words that share a common segment are selected randomly C. Incorporating Turkish Allomorphy Into The Model 3 With the exception of suffix ’ken’, which does not show allomorphy. The cases ğ, ç and ü are not shown because our data does not contain any form with these symbols. 4 http://research.ics.aalto.fi/events/morphochallenge2010/data/wordlist.tur In this study, the rules of alternation are included in the segmentation model as a filtering algorithm for vowels with 47 Algorithm 2 The inference algorithm 1: input: data D = {w1 = s1 + s2 + · · · + sn , . . . , wn = s1 + s2 , · · · + sn } 2: initialize: i ⇐ 1, w ⇐ wi = si + mi , n ← iterations 3: while n > 0 do 4: for all wi in D do: 5: Randomly split wi as Snew = {s1 , s2 , . . . } 6: Remove the segments Sold from the model 7: pold ← p(Sold |D−wi ) 8: Pnew ← p(Snew |D−wi ) 9: if pnew > pold then 10: Accept new segments of wi 11: Sold ← Snew 12: else 13: random ∼ N ormal(0, 1) 14: if random < (pnew /pold ) then 15: Accept new segments of wi 16: Sold ← Snew 17: else 18: Reject the new segments 19: Insert old segments Sold 20: n←n−1 21: output: Optimal segments of input words Fig. 3. An example sampling step during the inference. The word Evlerde is randomly split into segments Evle, r, and de. The old segmentation (Ev+ler+de) is compared with the new segmentation and either accepted or rejected. TABLE II C OMPARISON OF OUR UNSUPERVISED SYSTEMS IN System Morfessor CatMAP [5] Aggressive Compounding [17] Bigram HDP Iterative Compounding [17] MorphAcq [18] Morfessor Baseline [4] Base Inference [17] Fig. 4. The results with the highest F-measure from models. S indicates that the model is supervised by allomorph filtering. UNSUPERVISED MODEL WITH OTHER M ORPHO C HALLENGE 2010 Precision(%) 79.38% 55.51% 50.36% 68.69% 79.02% 89.68% 72.81% Recall(%) 31.88% 34.36% 31.60% 21.44% 19.78% 17.78% 16.11% FOR T URKISH F-measure(%) 45.49% 42.45% 38.83% 32.68% 19.78% 29.67% 26.38% 38.83%. This shows that morphemes are highly dependent on each other, and modeling the morphemes as bigrams is more realistic. Semi-supervision also makes a significant improvement in the F-measure when compared to the unsupervised setting. The gap between precision and recall is also not very big in the semi-supervised setting. The highest F-measure obtained in the semi-supervised setting is %43.22. Using allomorphs with the filtering algorithm decreases the number of morphemes to be modeled, leading to a more stable distribution in the Dirichlet process with less number of tables and less sparsity. We compare our unsupervised model with other unsupervised models participated in Morpho Challenge 2010. The results are given in Table II. Our model has an F-measure of 38.83%, which is ranked 3 out of 7 models. We also compare our semi-supervised model (with the filtering process) with other unsupervised models with supervised parameter tuning which participated in Morpho Challenge 2010. We do not include other unsupervised models or semi-supervised models in this comparison because they use from the results and checked whether they really share a common segment according to the gold segmentations. One point is given for each correct segment. Recall is estimated similarly by selecting two words that share a common segment in the gold segmentations. For every correct segment, one point is given. The F-measure is the harmonic mean of Precision and Recall: 1 (15) F − measure = 1 1 P recision + Recall The overall results are given in Figure 4. The bigram HDP with allomorphy supervision gives the highest F-measure, whereas the unigram HDP without any supervision is the weakest model. In the unigram model that has an F-measure of 30.64%, we assign α=0.5 and Γ=0.5. There is a significant gap between precision and recall in this setting. The low recall implies undersegmentation. The bigram model makes a significant improvement on the results. The F-measure becomes 48 TABLE III C OMPARISON OF OUR R EFERENCES SEMI - SUPERVISED MODEL WITH OTHER [1] Harris, Z. S.: From phoneme to morpheme. Language, 31(2):190-22, 1955. [2] Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Computational linguistics, 27(2):153–198, 2001. [3] Goldsmith, J.: An algorithm for the unsupervised learning of morphology. Natural Language Engineering, 12(04):353–371, 2006. [4] Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In Proceedings of the ACL-02 workshop on Morphological and phonological learning-Volume 6, pages 21–30. Association for Computational Linguistics, 2002. [5] Creutz, M., Lagus, K.: Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05), volume 1, pages 51–59, 2005. [6] Creutz, M.: Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 280–287. Association for Computational Linguistics, 2003. [7] Brent, M. R.: An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 34(1-3):71–105, 1999. [8] Orbanz, T., Teh, Y. W.: Bayesian nonparametric models. In Encyclopedia of Machine Learning, pages 8189. Springer, 2010. [9] Goldwater, S., Johnson, M., Griffiths, T. L.: Interpolating between types and tokens by estimating power-law generators. In Advances in neural information processing systems, pages 459–466, 2005. [10] Can B., Manandhar, S.: Probabilistic hierarchical clustering of morphological paradigms. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 654–663. Association for Computational Linguistics, 2012. [11] Oflazer, K. Göçmen, E. and Bozşahin, C.: An outline of Turkish morphology. Report on Bilkent and METU Turkish Natural Language Processing Initiative Project, 1994. [12] Teh, Y. W.: Dirichlet process. In Encyclopedia of machine learning, pages 280–287. Springer, 2010. [13] Hastings, W. K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970. [14] Spiegler, S., Golenia, B., Flach, P.: Unsupervised word decomposition with the promodes algorithm, volume I, 2010. [15] Kohonen, O., Virpioja, S., Lagus, K.: Semi-supervised learning of concatenative morphology. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pages 78–86. Association for Computational Linguistics, 2010. [16] Virpioja, S., Kohonen, O., Lagus, K.: Unsupervised morpheme discovery with allomorfessor. In CLEF (Working Notes), 2009. [17] Lignos, C.: Learning from unseen data. In Proceedings of the Morpho Challenge 2010 Workshop, pages 35–38, 2010. [18] Nicolas L., Farré J., Molinero, M. A.: Unsupervised learning of concatenative morphology based on frequency-related form occurrence. In Proceedings of the PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes, Helsinki, Finland, September, 2010. [19] Çakıcı, R., Steedman, M., Bozşahin, C.: Wide-coverage parsing, semantics, and morphology. In Turkish Natural Language Processing, Oflazer, K., Saraçlar, M. eds. Springer, forthcoming, 2016. ALGORITHMS WITH SUPERVISED PARAMETER TUNING PARTICIPATED IN M ORPHO C HALLENGE 2010 FOR T URKISH System Promodes [14] Promodes-E [14] Morfessor U+W [15] Bigram HDP S Promodes-H [14] Precision(%) 46.59% 40.75% 40.71% 49.21% 47.88% Recall(%) 51.67% 52.39% 46.76% 38.52% 39.37% F-measure(%) 49.00% 45.84% 43.52% 43.22% 43.21% the gold segmentations provided by Morpho Challenge. The results are given in Table III. Our model has a F-measure of 43.22%, which is ranked 4 out of 5 other semi-supervised models. This is with our minimal amount of supervision by using only the allomorphs and not any other information. The only existing allomorph based morphological segmentation model is the Allomorfessor developed by Virpioja et al. [16], a morpheme-level model which is able to manipulate surface forms of the morphemes with mutations. Their model resulted an F-measure of %31.82 for the Turkish dataset5 with a huge gap between precision and recall. While their model results in %62.31 F-measure over the English dataset, the results for Turkish are quite low compared to our results. Their model can capture only 1.9% of the Turkish mutations, which is quite low. This leads us to think that allomorphy may not be modeled only by consonant mutations; it needs non-phonological information about morphological forms. V. C ONCLUSION Our study made two contributions to morphological segmentation: (i) We found that it is important to incorporate intra-word dependencies into form-driven Bayesian models, (ii) allomorphy seems to be a useful prior for computational morphological segmentation task. Given that sublexical training at the level of suffixes’ syntactic categories is becoming more feasible, along with their logical forms [19], there seems to be means for syntax to semi-supervise morphology. Syntactic information about the morphemes has an important impact on segmentation that can not be ignored, as would be the case for independent-morphemes assumption. Providing room for co-occurences of the morphemes in an unsupervised model provides language-specific information which improves the number of valid segments. Furthermore, the supervision of non-parametric Bayesian models is promising as our minimal supervision achieved better results. Our results are far from state-of-the-art performance. However, we believe that our experiments with allomorphs and morpheme dependency will facilitate further work on morphological processing. 5 http://research.ics.aalto.fi/events/morphochallenge2009/ 49 Automatic Detection of the Type of “Chunks” in Extracting Chunker Translation Rules from Parallel Corpora Aida Sundetova#, Ualsher Tukeyev# # Al-Farabi Kazakh National University, Research Institute of Mechanics and Mathematics, Al-Farabi av., 71, 050040 Almaty, Kazakhstan {sun27aida, ualsher.tukeyev}@gmail.com Abstract— This paper describes the method of the automatic detection of the type of “chunks” which are generated in methodology presented by Sánchez-Carta-gena et. al. (Computer Speech & Language 32:1(2015) 46–90). The proposed automatic detection method type of “chunks” improves above methodology of extracting grammatical translation rules from bilingual corpora. Proposed im-provement of methodology of extracting grammatical translation rules from cor-pora allows to apply output phrases of extracted translation “chunk” rules for next “interchunk” stage in machine translation system and improve of machine trans-lation quality. Experiments are done for the English–Kazakh 1 language pair using the free/open-source rulebased machine translation (MT) platform Apertium and bilingual English–Kazakh corpora. morphological analyzer, POS disambiguator, structural transfer, morphological generator, post-generator, re-formatter. Keywords— rules extraction, machine translation, Apertium, transfer rules, chunks. 1. A first stage of transformations (“chunker”) detects source language (SL) lexical form (LF) patterns and generates the appropriate sequences of target language (TL) LFs, which will be grouped in chunks representing simple constituents such as noun phrases, prepositional phrases, etc. These chunks bear tags that may be used for interchunk processing. 2. The second round (“interchunk”) reads patterns of chunks and produces a new sequence of chunks. This is the module where one can attempt to perform some longerrange reordering operations, interchunk agreement (for example, between noun and verb pharse, agreement in number and person), case selection, etc. 3. The third round (“postchunk”) transfers chunk-level tags to the lexical forms they contain and whose lexical-formlevel tags are linked (through a referencing system) to chunk-level tags. Structural transfer for English–Kazakh has an additional clean-up stage to remove tags. Fig. 1. Structure of the Apertium platform Three stage structural transfer rules on Apertium platform is implemented by three stages, it follows the description in [4]: I. INTRODUCTION Rule-based machine translation (MT) of natural language nearly always contains the following steps [1]: morphological analysis, part-of-speech (POS) tagging, translating words into target language, execution of syntactic transformations and division into phrases (or chunks), generating new lexical forms (word’s lemmas with lexical catego-ries) of target language words. In rule-based MT systems, most of these stages are im-plemented by handwritten translation rules. The process of creating the handwritten rules is very laborious process. Therefore, very actual is automatic extracting of trans-lation rules from bilingual corpora. This paper presents the automatic detection method of the type of “chunks” rules, obtained by using the methodology automatic extracting of translation rules from bilin-gual corpora by Sánchez-Cartagena et al. (2015) [2], which is described in the following section. Their method requires to create tag groups and tag sequences for new pair and tuning of the extraction script by declaring monolingual dictionary, bilingual diction-ary, and bilingual corpora. II. “CHUNKING” RULES FOR THE APERTIUM PLATFORM The Apertium free/open-source rule-based shallow-transfer MT platform [3] includes the following modules: de-formatter, 1 https://svn.code.sf.net/p/apertium/svn/staging/apertium-eng-kaz 50 Usually, three stage transfer uses different type of phrases, which helps to apply rules for concrete structures from stage to stage. For example, the current version of English–Kazakh MT system, for which experiments have been done, has 169 handwritten “chunker” rules, and is able to analyze the following kinds of phrases: 1. Noun phrases (NP). Sequences with nouns (case is nominative or accusative) are analyzed as noun-phrase, for instance, phrase two little cars - <NP> {two (eki) <numeral> little (kiskentai) <adjective> car(autokolik) <noun>} is grouped into one NP phrase. All NP phrases have not determined tags for cases and possessives, in case that in next interchunk stage it will be assigned: I see <NP>{the sky<case-to-bedetermined>} – I (Men) <NP>{sky<accusitive> (aspnDY)} see(koremyn). As can be seen from the example, noun phrase “the sky” has not determined case, but in interchunk stage it should be determined as “accusitive” case and will be added ending -dy, which could be changed, depending on vowel harmony. 2. Noun phrases as gerund(NP-ger). For verbs, which appear after verbs: like, love, finish, start, hate, etc. and takes -ing form, it is translated as NP phrase too, and on the Kazakh side, it has gerund tense: I like playing - <NP> {I (men) <subject pronoun>} <VP> {like(zhaqsy koremin) <verb>} <NP>{playing(oinaudy) <verb gerund>}. Such type of verb is decided to be noun phrase, because on Kazakh side it could have case, possessive such like noun phrases with noun has. 3. Verb phrases (VP). All kind of verbs: simple verbs (only one word), complex verb tense (continuous, perfect), modal verbs (with assigning genitive case for subject – I must play – Meniŋ (Менің) oinauym (ойнауым) керек (kerek)), etc. Modal verbs have special phrases, for instance, “VP_must_inf” or “VP_should_inf”, it helps to assign possessive from subject by rule, which is written in interchunk stage. 4. Prepositional phrases. These phrases feature locative (да/da – in house – үйде/uide), ablative (-нен/nen – from river – өзеннен/ozennen), genitive (-ның/niŋ – of city – қаланың/kalanyŋ), postpositions, as well as complex postpositional phrases with wordsүст/аст + possessive + locative (under table – үстелдің астында/usteldiŋ astynda). 5. Question verbs phrase (VP_Q) are used to detect question, which started with did/do, was/were, etc., where auxiliary verbs analyzed as VP_Q, and will be processed in interchunk, to generate question particles in Kazakh (ma/me, etc.). For instance, “Do you remember?” – Sizdiŋ (Сіздің) esiŋizde (есіңізде) me (ме)?). 6. Auxiliary verb phrases(be/have/do,etc). Such kind of phrases are used in the next structures: <VPQ> {Do <verb do>(only tense)} you play? – to translate questions, where rule only detects tense and transfer it to the next stage in <VPQ> phrase; I <VP_be>{am <vbser>(e <copula>)} a teacher – to generate copula “e”(edi,edim) and move it at the end of noun phrase(a teacher – mugalim[e+myn]), then assign person and number from subject at the interchunk stage. 7. Adjectival phrases: single adjective (AdjP big) and comparative adjective (AdjP bigger), superlative adjectives have different phrase, because, for instance, the trans-lation of “SupP the most beautiful” is “SupP eŋ (ең) ædemi (әдемі)”, but it could get possessive (SupP {the most interesting} of these books – kitaptardyŋ (SupP {eŋ ædemisi}), so it can not be treated as regular adjective phrase AdjP. As can be seen from phrases above, each type of phrase has concrete operations, which could be done at the interchunk stage: determining case, possessives, assign person and number, moving positions. Without certain phrase names, it is impossible to have well-worked interchunk stage. III. EXTRACTING “CHUNKER” RULES FROM CORPORA The method described by Sánchez-Cartagena et al. (2015) was inspired by the work of Sánchez-Martínez and Forcada (2009) [5] where alignment templates were also con-sidered for structural transfer rule inference. However, this new approach overcomes the main limitations of that by SánchezMartínez and Forcada (2009). Firstly, choosing the appropriate generalization level for the alignment templates (AT), contained word alignment and use word classes instead of the words themselves [6,7,8], from which rules will be generated. Second, a different treatment words which have difficulties with context-dependent lexicalizations and are incorrectly translated by more general ATs. Third, the automatic selection of ATs to be used for generating convenient rules. To adapt the method by Sánchez-Cartagena et al. (2015) for the English–Kazakh language the following steps were performed: 1. Building English–Kazakh parallel corpora by using Bitextor 2, a web crawler for par-allel texts, and manually collected texts from fiction literature. Manually collected corpus consist of ~3200 parallel sentences, and with crawled texts, parallel English-Kazakh corpus contains 5625 sentences. Experiments were done on a corpus consisting of 140 sentences, and big corpus is used for testing and for tuning. 2. Creating tag groups file for the Kazakh language. SánchezCartagena et al. (2015) method had not been tested on Turkic languages, which have rich-morphology. As a result, this file for the Kazakh language will have more morphological tag groups. Groups have the following format, for instance, group for numerals: numtype:ord,coll,year:num, where numtype is name of variable used to identify different types of numerals ord (first, second),coll (using in Kazakh to identify number of objects or subjects without followed noun: two person – eki adam(екі адам), two came – ekey keldi (екеуі келді)),year (numerals, coming after prepositions: in 1992), and at the end after “:” is put name of part of speech “numeral” – “num”. If some tags belong to several part of speeches, they are put after “:” and is 2 51 https://svn.code.sf.net/p/bitextor/code/trunk As can be seen from Table 1, there are defined five phrases. First level defined as X, next levels are will modified with other grammatical constituents function as the specifiers: X’ defines X+X phrase and X’’ could define X+X’ or X’+X’ phrases. There are could be defined next primary priority of POS [10]: Primary POS priority : V > N > A > P According to this priority, for example, for English-Kazakh language pair, POS se-quences will be defined for each phrase as follow: divided by comma: tense:present:vblex,v. This file will be used to generate an appropriate group of tags for each part of speech of English and the Kazakh language side, all necessary tags could be found from morphological analyses of English–Kazakh MT system on Apertium platform. 3. Creating a tag sequences file, where the defined tag groups are combined into ap-propriate sequences of tags, accordingly to morphological analysis. The sequences will be used to generate target language sequences of tags, which are the lexical categories of each lexical forms. If the morphological analysis of word “do” is: do<vbdo><pres><p3><sg>, in format of tag sequence it will be look like as follows: vbdo:verbtime,person,numberat, where vbdo is name of lexi-cal category, and verbtime,person,numberat name of tag groups, defined in tag group file. 4. Adapting the rule extraction script: defining installed the English–Kazakh language pair on Apertium machine translation system, morphological and bilingual dictionaries, size of corpora. 5. Problem of adapted method of extracting chunker rules from corpora. Some MT systems, like English-Kazakh machine translation system on Apertium, uses three-stage structural transfer, which means that adapted method needs improvement, be-cause the rule learning algorithm is designed to work only with 1-level Apertium transfer (only apertium-transfer module and not apertiuminterchunk). Generated chunks have no special phrases (it generates “LRN” phrases), as NP, VP, etc., showed in section 2, this fact prevents correct usage of this phrase in interchunk stage. TABLE II POS SEQUENCES FOR X'-EQUIVALENCES And, POS priority for English-Kazakh pair will be looked like that: Primary POS priority : P > V > N > A Choosing that priority based on highest score got from evaluation, which will be showed in Results section, and also on that P-prepositions could be only modifiers of noun and in Kazakh they will transform into postpositions or case. In that case, PP phrases include noun in their structure that took them in priority before N. Described phrases are written in additional file, where user can specify phrases by priority, accordingly to each language pair’s features. This file is called “phrase.txt”, where described priority is written in the Apertium chunk names format. IV. AUTOMATIC DETECTION OF CHUNK TYPE To improve quality of translation and to do work of generated rules more usable in interchunk stage additional step was added: detect name of phrase for generated chunks. For instance, if chunk named “__n__” and deal with nouns, “NP” phrase should be assigned. To assign a phrase to each chunk, first, there are defined part-of-speech sequences to each phrase and will consider sequences of POSes by using X'-theory [9], where has been defined the X'-equivalences shown as Table 1. TABLE I X'-EQUIVALENCES 52 TABLE III TABLE IV POS SEQUENCES FOR X'-EQUIVALENCES COMPARING TRANSLATION As can be seen from the Table 3, first, user writes name of phrase, then part-of-speech, which defines this phrase: VP,vblex,vbser,vbhaver,vbmod. Phrase detection program reads this file and generated file with rules, and assigns phrases. To do this application more usable, templates of rules were changed by adding one-word rules. Evaluation of this method is described in the next section. V. RESULTS Results of improved method are performed by using English-Kazakh MT system on Apertium. From GATs, extracted from corpora with 140 sentences, 13 rules are generated. In the next table, some of the translation rules obtained with handwritten and ex-tracting processes are compared: As can be seen from Table 4, a few generated rules works correctly, but number of generated rules is not big, because of small volume of corpus. The main differences between rules are that in handwritten rules some of tags undetermined (<PXD>, <ND>, <PD>, <NXD>) or could be changed in next interchunk stage, whereas generated rules assign all tags constantly. Also, generated rules while translating could miss some words, as can be seen from the last translation of sentence “A dog is also in the garden”, where generated rule translated it without adverb “also”. Such problems appears because of low generalization level, that problem could be solved by using bigger corpus for extracting rules. In next table quality of translations will be compared with rules, which were got after using rules with application for specifying phrases and rules, without it: TABLE V QUALITY OF TRANSLATED TEXTS As can be seen from Table 5, adding phrase detection step and improvements of rules templates helped to raise quality of translation for 4%, for unigrams it is raised for 12.55, bigrams for 8.65 and trigrams for 5,02. In the next Table 6 was showed translated sentences by Sánchez-Cartagena et al. methodology and by proposed improved methodology: 53 TABLE V TRANSLATED TEXTS ACKNOWLEDGMENT Authors thank Prof. Mikel L. Forcada and Miquel EsplàGomis from Departament de Llenguatges i Sistemes Informàtics, Universitat d'Alacant (Alacant, Spain) for the continuous consult and help in the researching and development of this project. This re-search work is doing in frame of project 0749/GF financed by Ministry of Education and Science of Republic Kazakhstan. REFERENCES [1] [2] [3] [4] After chunker stage, input text is transferred into sequences of tags, where phrase type tags in bolt. In the interchunk stages, as can be seen from the fourth column, chunks with phrase types “<LRN>” did not changed their position, whereas specified phrases NP, VP, PP changed as follow: NP VP PP NP PP VP. The last columns show output of translation. In the result, new method performed right sequences of phrases, verbs phrase in italic in the end of the sentence. [5] [6] [7] VI. CONCLUSION In the paper is proposed automatic detection method of “chunks” type improving meth-odology Sánchez-Cartagena et al. (2015) of extracting grammatical translation rules from bilingual corpora. Results of this paper could use for others morphology rich lan-guages. Proposed improvement of methodology of extracting grammatical translation rules from corpora allows improving of machine translation quality. For future works it is planned to use improved methodology for more biggest English-Kazakh corpus, using proposed improved methodology for other kind of languages pair, Kazakh-Russian. [8] [9] [10] 54 Hutchins, William John, and Harold L. Somers. An introduction to machine translation. Vol. 362. London: Academic Press, 1992. Víctor M. Sánchez-Cartagena, Juan Antonio Pérez-Ortiz, and Felipe Sánchez-Martínez. 2015. A generalised alignment template formalism and its application to the inference of shallow-transfer machine translation rules from scarce bilingual corpora. Comput. Speech Lang. 32, 1 (July 2015), 46–90. Mikel L. Forcada, Mireia Ginestí-Rosell, Jacob Nordfalk, Jim O'Regan, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Felipe SánchezMartínez, Gema Ramírez-Sánchez, Francis M. Tyers. Apertium: a free/open-source platform for rule-based machine translation. In Machine Translation (Special Issue on Free/Open-Source Machine Translation), volume 25, issue 2, p. 127–144. Sundetova, A., Forcada, M. L., Shormakova, A., Aitkulova, A.:Structural transfer rules for English-To-Kazakh machine translation in the free/open-source platform Apertium.Pro-ceedings of the International Conference on Computer processing of Turkic Languages, pp. 317–326. L.N. Gumilyov Eurasian National University, Astana(2013) F. S´anchez-Mart´ınez and M. L. Forcada. Inferring shallow-transfer machine translation rules from small parallel corpora. Journal of Artificial Intelligence Research, 34(1):605–635, 2009. ISSN 10769757. F. J. Och and H. Ney. The alignment template approach to statistical machine translation. Computational Linguistics, 30(4):417–449, 2004. F. J. Och and H. Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51, 2003. Y. Xu, T. K. Ralphs, L. Lad´anyi, and M. J. Saltzman. Computational experience with a software framework for parallel integer programming. INFORMS Journal on Computing, 21(3):383–397, 2009. Sells, Peter (1985), Lectures on Contemporary Syntactic Theories, Lecture Notes, No. 3, CSLI. Kuang-hua Chen and Hsin-Hsi Chen. 1994. Extracting noun phrases from large-scale texts: a hybrid approach and its automatic evaluation. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics (ACL '94). Association for Computational Linguistics, Stroudsburg, PA, USA, 234-241. DOI=http://dx.doi.org/10.3115/981732.981764 Simplification of Turkish Sentences Dilara Torunoğlu-Selamet, Tuğba Pamay, Gülşen Eryiğit Department of Computer Engineering Istanbul Technical University Istanbul, 34469, Turkey [torunoglud, pamay, gulsen.cebiroglu]@itu.edu.tr (preteens) face difficulty in understanding the arguments of the main predicate in the sentence, which may be complicated. Preteens have a tendency to use simple sentence structures in their daily lives, and when they come across complex structured sentences in school text books, they may fall behind in the class. For this reason, in this paper, we focus on Turkish and examine sentences from elementary school textbook to extract complex structures and propose a sentence simplification system to automatically generate simpler versions of the sentences. Thereby, sentences become easier to understand by children, especially ones with difficulty in reading comprehension. In this paper, we take advantage of inflectional groups in Turkish and investigate certain types of complex structured sentences. We divide these sentences under three main categories as: 1. Coordinate Sentences, 2. Paratactic Sentences, 3. Subordinating Sentences and each main category also has sub-categories. Examples of these categories are explained in Section III in detail. Then, we derive rules corresponding to each category and apply the rules to the sentences which were taken from an elementary school textbook. We prepare a data set which was annotated morphologically and syntactically with the NLP tools [5] to use in the sentence simplification. The paper is structured as follows: Section II gives brief information about related work, Section III introduces the sentence structures on which we focused and presents our sentence simplification approach and Section IV presents the conclusion and futurework. Abstract—Text Simplification is the process of transforming existing natural language text into a new form aiming to reduce their syntactic or lexical complexities while preserving their meaning. A sentence being long and complicated may pose multiple problems especially for elementary school children. In this paper1 , we focus on Turkish, a morphologically rich language, and examine sentences from an elementary school text book to extract complex structures and propose a sentence simplification system to automatically generate simpler versions of the sentences. Thereby, sentences become easier for children to understand, particularly children with difficulty in reading comprehension. Our system automatically uses simplification operations, namely splitting, dropping, reordering, and substitution. Keywords—Text Simplification, Sentence Simplification, Turkish I. I NTRODUCTION Text Simplification is the process of transforming existing natural language text into a new form with aim of reducing their syntactic or lexical complexity while preserving their meaning. Applications of Text Simplification can help people to understand natural text with less effort. The target audience might be people with language disabilities like aphasia, adults learning a foreign language, low-literacy readers [1] and children [2]. Text simplification is also used in areas like Machine Translation (MT) [3] and Text Summarization (TS) [4]. At sentence level, reading difficulties (sentence complexities) lie in the syntactic and lexical levels, so simplification of sentences can be classified into two general categories: Lexical and Syntactical Simplification. Without considering the language level, there are some approaches for lexical and syntactic simplification based on Statistical Machine Translation. The concept of a simple, “easy-to-read” sentence is not universal. Sentence length and syllable count can give a good estimate but it will not be complete since we are taking the preserving of meaning and understandability into account during the simplification process. Also, requirements of “easy-to-read” sentences can vary from audience to audience. Sentence simplification for highly inflectional or agglutinative languages has significant problems. For example, in Turkish, some words may be omitted from a sentence yet the meaning may remain the same. Elementary school children II. R ELATED W ORK Text simplification has become a highly investigated topic with the increase in the use of NLP systems. These systems suffer lower accuracy results from the complexity of the sentences. One study [6], proposes a sentence simplification model which is based on tree transformation by Statistical Machine Translation (SMT) [7], [8]. This work covers operations like sentence splitting, reordering, deleting (dropping) and phrase/word substitution. The parallel corpora that were used in this work (PWKP) were generated from English Wikipedia and Simple English Wikipedia. Another study [9] presents a data-driven model based on quasi-synchronous grammar. In contrast to state of art solutions [6], operations are not defined explicitly; instead the quasi-synchronous grammar extraction algorithm learns appropriate rules from the training data. In 1 This work is part of our ongoing research project “A Signing Avatar System for Turkish to Turkish Sign Language Machine Translation” supported by TUBITAK FATIH 1003 (grant no: 114E263). 55 another study [10] which presents a machine translation based approach similar to [6], differs in that it does not take syntactic information into account and only relies on phrase based machine translation methods to implicitly learn simplifying and paraphrasing of phrases. They claim that they produced a language agnostic solution. However they only worked on lexical operations for sentence simplification. In [11], a lexical approach was followed for sentence simplification for different learning levels and context. Their method has 4 steps: part-ofspeech (POS) tagging, synonym probing, context frequencybased lexical replacement and sentence checker. They evaluated their results with human annotators by only asking yes/no questions for testing on meaning and simplicity. They did not use parallel datasets, instead they used context-based books for doing lexical operations. The study [12] focuses on syntactic simplification to make text easier to comprehend for human readers, or process by programs. They formalize the interactions that take place between syntax and discourse during the simplification process and present the results of their system. Most of the recent works focus on the English yet there are some studies on other languages. The study in [13], focuses on Brazilian Portuguese. Another study [14] which is based on dependency parsing of Spanish sentences is capable of lexical simplification, deletion operations and sentence simplification operations. The study [15] aims to develop an approach to syntactic simplification of French sentences. Another usage of text simplification is to help children understand complex sentences in books. One of the studies conducted for this purpose is [16] which examines children stories and proposes a text simplification system to automatically generate simplified, more comprehensible versions of the stories for children, especially those with difficulty in reading comprehension. Splitting, dropping, reordering and substitution operations can be done with the proposed system. Another study with the same approach is in [2] which chooses children as the target audience of text simplification operations. They perform both syntactic and lexical simplifications. They follow a rule based system for this task. Inspired by these researches in this paper, we focus on simplifying children’s textbooks. from the morphological and syntactic information of tokens in the sentence and also use a morphological generator [5] which is one of the NLP tools to generate surface form of the token from its morphological analysis. Sentence simplification is executed under three steps which are visualized in Figure 1. First step is the analysis operation in which the sentence is analyzed morphologically and parsed syntactically. By this way, we obtain dependency relations between each token. Then, for the transformation stage, each rule is tried on the given sentence, and the first suitable rule is selected to be applied (only one rule could be applied over the sentence. If no suitable rule is found, the sentence will be left in its original form). Insertion of a token is performed in this level if it is considered as necessary. In the insertion step, shared arguments of the original sentence are derived first then each shared argument is inserted to the sub-sentence. Examples of insertion step is given in sections below. The rule in Figure 1 is explained in Section III-C1 in detail. In the generation step, the sentence is divided into sub-sentences corresponding to the information which is obtained from the transformation stage. At this phase, morphological information of the tokens may be updated to fit with the simplified version of the sentence. For this purpose, we use a morphological generator to reconstruct the new form of the token. The morphological generator produces a valid Turkish word by applying all the rules of a morphological analyzer in the reverse order (from lexical form through surface form). For example, the analysis of the participle “görmediğim”(who I have not seen) is produced as gör +V ERB +N EG ˆDB+A DJ +PAST PART +P1 SG by the morphological analyzer. This analysis is converted to gör +V ERB +N EG +PAST +A1 SG in the generation step to construct the predicate of the sub-sentence as “görmedim”(I have not seen). These three steps are valid for each rule which are explained in the below sections. A. Coordinate Sentences 1) Shared Predicate: For this category, we introduce sentences in which the predicate is shared by elements which are interconnected coordination structure. A sample sentence under this category is shown in Figure 2. In the sample sentence, the word “sever” (like) is the shared predicate. Turkish allows the non-repetition of some words in the sentences which may cause difficulty in understanding the arguments of the shared predicate for children. In this category, sentences are split, based on the number of sub-parts in the original sentence. The elements of the subparts are decided by the coordinated arguments in the sentence. For example, for the Figure 2, “Ali” and “Mehmet” are coordinated subjects and “basketbolu” (basketball) and “futbolu” (football) are coordinated objects of the same predicate. After splitting, the sentence is transformed into a new structure which is presented in simplified version of the Figure 2. 2) Shared Object: The sentence structure of this category is similar to the sentences in the Section III-A1. However, in this case an object is shared by the elements of the coordinated structure. An example under this category is given in Figure 3. III. D ISCUSSION AND A PPROACH The morphologically rich nature of Turkish may result in orthographic words to be split into multiple inflectional groups2 . For the sentence simplification approach, we take advantage of this issue and investigate solutions for the simplification of syntactically complex sentences. We divide them under three main categories as: 1. Coordinate Sentences, 2. Paratactic Sentences, 3. Subordinating Sentences and each main category also has sub-categories. Then, we derive rules corresponding to each category and applied the rules to the sentences which were taken from the elementary school textbook. To apply the rules over the sentences, we benefit 2 In Turkish NLP, words are generally split into sub-word units from their derivational boundaries, each resulting unit having a potentially different part of speech tag and dependency relation. 56 Fig. 1: Sentence Simplification steps Dependency Graph COORDINATION SUBJECT two in this case, and the split sentence is given in the simplified version of the Figure 3. This way, the sentence gives the same meaning, but with a syntactically simpler structure. PREDICATE Ali basketbolu , Mehmet futbolu sever . ‘Ali’ ‘basketball’ACC ‘,’ ‘Mehmet’ ‘football’ACC ‘[he] likes’ ‘.’ Dependency Graph OBJECT Original Version ‘Ali basketbolu, Mehmet futbolu sever.’ (‘Ali [likes] basketball, Mehmet likes football.’) Simplified Version COORDINATION PREDICATE Hayvanları sevelim , koruyalım . ‘animals’ACC ‘[we] like’ ‘,’ ‘[we] protect’ ‘.’ Original Version ‘Ali basketbolu sever. Mehmet futbolu sever.’ (‘Ali likes basketball. Mehmet likes football.’) ‘Hayvanları sevelim, koruyalım.’ (‘Let’s like animals, protect (them).’) Fig. 2: Example for Shared Predicate Category Simplified Version ‘Hayvanları sevelim. Hayvanları koruyalım.’ (‘Let’s like animals . Let’s protect animals.’) The word “hayvanları” (animals) is the shared object by the two predicates “sevelim” (like) and “koruyalım” (protect). Instead of non-repetition of the shared argument, this may be a good way to use the same argument twice in the sentence. By this way, the meaning of the sentence may be given more clearly to preteens. In this category, sentences are split based on the number of sub-parts in the original sentence. The elements of the subparts are decided by the coordinated predicates in the sentence. For example, for the sentence in Figure 3, “sevelim” (like) and “koruyalım” (protect) are coordinated predicate by the same object. In the splitting operation, the sentence is split into new sentences corresponding to the coordinated predicates and the shared arguments are put into all sub-sentences. After simplification, the sentence is divided into a number of parts, Fig. 3: Example for Shared Object Category B. Paratactic Sentence For this category we focused on sentences which do not have any shared argument or predicate. These consist of independent clauses separated by conjunctions or punctuation. As the predicates share no arguments, each sub-sentence has its own elements. An example sentence under this category is shown in Figure 4. As seen from the sample, there are two coordinated predicates: “açtı” (open) and “uyandı” (woke up). These predicates have their own arguments. For example, “Ebru” is the subject of “açtı” and “Elif” is the subject of “uyandı”. In this category, sentences are split at 57 the conjunctions or punctuation marks which separate the independent clauses, resulting in a number of sub-sentences. Since these predicates have their own arguments, insertion of any argument is not performed in this process. The example is given in Figure 4. in the simplification process, the morphological analysis is changed to accusative case before using the morphological generator. This way, we ensure that the simplified sentences are gramatically correct. Dependency Graph Dependency Graph SUBJECT Ebru pencereyi MODIFIER SUBJECT SUBJECT açtı ve Elif uyandı ‘and’ ‘Elif’ ‘woke up’ Uzun süredir görmediğim teyzem ‘long’ ‘[for a] time’ ‘[I] have not seen ’ ‘[my] aunt’ bize geliyor . ‘us’ACC ‘[she] is coming’ ‘.’ CONJ ‘Ebru’ ‘window’ACC ‘[she] opened’ ‘.’ Original Version “Uzun süredir görmediğim teyzem bize geliyor.” (“My aunt whom I have not seen for a long time, is coming to us”) Original Version ‘Ebru pencereyi açtı ve Elif uyandı.’ (‘Ebru opened the window and Elif woke up.’) Simplified Version Simplified Version ‘Teyzemi uzun süredir görmedim. Teyzem bize geliyor.’ (‘I have not seen my aunt for a long time. My aunt is coming to us.’) ‘Ebru pencereyi açtı. Elif uyandı.’ (‘Ebru opened the window. Elif woke up.’) Fig. 4: Example for Paratactic Sentence Category Fig. 5: Example for Participle Subclause Category C. Subordinating Sentences Subordinating sentence is a sentence which contains subclause. Subordinating sentences are not complete sentences by themselves, however they make additional information to complete the meaning of the whole sentence. These subclauses are formed by subordinate conjunctions (i.e. when, until and so on) and relative pronouns (i.e. who, which and so on). In this study, for Turkish, we also focused on these categories under two topics: 1. Participle Subclauses, 2. Converbial Subclauses. 1) Participle Subclauses: For this category, we introduce sentences containing subclauses the heads of which are participles. Participles are adjectives derived from a verb. An example under this category is given in Figure 5. When the English translation of the sentence is considered, the part which starts with the relative pronoun, “who” forms a subclause which modifies the word “aunt”. In the Turkish sentence, the part “uzun süredir görmediğim” (whom I have not seen for a long time forms a subclause. This is a participle subclause because the head of this part is used as an adjective which modifies the word “teyzem” (my aunt). In this category we benefit from the inflectional groups of the word in the sentence. In the example, when the sentence is semantically analyzed, the person whom I have not seen and the person who is coming are the same person. Using this property, the sentence is split into two parts. The first part covers the subclause arguments and the second one the main sentence arguments. In this category, there is an important issue. The token which is modified by the participle subclause is inserted to the first split part with the proper dependency relation. The word “teyzem” (my aunt) is in nominative case. Thus, when this token is inserted to the subclause part Dependency Graph SUBJECT PREDICATE Ayşe koşarken düştü ‘Ayşe’ ‘run’WHILE ‘[she] fell down’ ‘.’ Original Version “Ayşe koşarken düştü.” (“Ayşe fell down while [she was] running.”) Simplified Version “Ayşe koştu. Ayşe düştü.” (“Ayşe ran. Ayşe fell down.”) Fig. 6: Example for Converbial Subclause Category 2) Converbial Subclauses: The sentence structure of this category is similar to the sentences in the Section III-C1. For this category, we introduce the sentences containing subclause whose head clause is a converb. Converbs are adverbs derived from a verb inflectional group. An example under this category is given in Figure 6. When the English translation of the sentence is considered, the part which starts with the subordinating conjunction, “while” forms a subclause which modifies the predicate of the main sentence, “fell down”. In the Turkish sentence, the part “koşarken” (while [she was] running) forms a subclause. 58 This is a converbial subclause because the head clause of this sub-part is a converb which modifies the main predicate. In the example, when the sentence is semantically analyzed, the person who fell down and the person who was running are the same person, “Ayşe”. This word is only assigned as the subject of the main predicate in syntactic analysis. As a result, in the simplification process, this token is also inserted into the sub-sentence which is formed by the subclause. Also, the head of the converbial subclause is used as the verb of the first part sub-sentence. Therefore, this converb token is converted to the verb form using the morphological generator tool. [8] K. Yamada and K. Knight, A decoder for syntax-based statistical MT, Association for Computational Linguistics Std., 2002. [9] K. Woodsend and M. Lapata, Learning to simplify sentences with quasi-synchronous grammar and integer programming, Association for Computational Linguistics Std., 2011. [10] S. Wubben, A. Van Den Bosch, and E. Krahmer, Sentence simplification by monolingual machine translation, Association for Computational Linguistics Std., 2012. [11] B. P. Nunes, R. Kawase, P. Siehndel, M. A. Casanova, and S. Dietze, As simple as it gets-a sentence simplifier for different learning levels and contexts, IEEE Std., 2013. [12] Syntactic simplification and text cohesion, vol. 4, no. 1, 2006. [13] Natural language processing for social inclusion: a text simplification architecture for different literacy levels, 2009. [14] S. Bott, L. Rello, B. Drndarevic, and H. Saggion, Can Spanish Be Simpler? LexSiS: Lexical Simplification for Spanish., Std., 2012. [15] Simplification syntaxique de phrases pour le français, 2012. [16] T. T. Vu, G. B. Tran, and S. B. Pham, “Learning to simplify children stories with limited data,” in Intelligent Information and Database Systems. Springer, 2014, pp. 31–41. IV. C ONCLUSION AND F UTUREWORK A sentence being long and complicated can pose multiple problems in daily life. For example, in Turkish, some words may be omitted from a sentence yet the meaning may remain the same. However, elementary school children (preteens) may face difficulty in understanding the arguments of the main predicate in a complicated sentence. For this reason, in this paper, we focus on solving this problem by simplifying the given sentences. In this paper, we take advantage of inflectional groups in Turkish and investigate certain types of complex structured sentences. We divide them under three main categories as: 1. Coordinate Sentences, 2. Paratactic Sentences, 3. Subordinating Sentences. Then, we derive rules corresponding to each category and apply the rules to the sentences taken from an elementary school textbook. We present an automatic sentence simplifier for these categories and propose an approach to divide sentences to help children understand better. Thus, as a future work we plan to verify the effectiveness of our simplification and preservation of meaning by testing our results on child readers. For validating our rules, we intend to use a human-focused evaluation based system with elemantary-school children as a testing audience. V. ACKNOWLEDGEMENTS This work is part of our ongoing research project “A Signing Avatar System for Turkish to Turkish Sign Language Machine Translation” supported by TUBITAK FATIH 1003 (grant no: 114E263). The authors want to thank Umut Sulubacak and Memduh Gökırmak for their valuable discussions and helps. R EFERENCES [1] W. M. Watanabe, A. C. Junior, V. R. Uzêda, R. P. d. M. Fortes, T. A. S. Pardo, and S. M. Aluı́sio, Facilita: reading assistance for low-literacy readers, ACM Std., 2009. [2] J. De Belder and M.-F. Moens, Text simplification for children, ACM Std., 2010. [3] S. Tyagi, D. Chopra, I. Mathur, and N. Joshi, Classifier based text simplification for improved machine translation, IEEE Std., 2015. [4] A. Siddharthan, A. Nenkova, and K. McKeown, Syntactic simplification for improving content selection in multi-document summarization, Association for Computational Linguistics Std., 2004. [5] G. Eryiğit, ITU Turkish NLP Web Service, Std., April 2014. [6] Z. Zhu, D. Bernhard, and I. Gurevych, A monolingual tree-based translation model for sentence simplification, Association for Computational Linguistics Std., 2010. [7] A syntax-based statistical translation model, Association for Computational Linguistics, 2001. 59 Comprehensive Annotation of Multiword Expressions in Turkish Kübra Adalı Tutkum Dinc Memduh Gokırmak Gülşen Eryiǧit Dep. of Computer Engineering Dep. of Linguistics Dep. of Computer Engineering Dep. of Computer Engineering Istanbul Technical University Istanbul University Istanbul Technical University Istanbul Technical University Maslak, Istanbul 34369 Beyazıt, Istanbul Maslak, Istanbul 34369 Maslak, Istanbul 34369 Email: [email protected] Email: [email protected] Email: [email protected] Email: [email protected] is to establish such standards for treebanks. [7] makes a survey of MWE annotated treebanks. Some of these are the Prague Dependency Treebank [8], French Dependency Treebank [9], Penn Treebank (a constituency treebank for English) [10]. Although there have been some previous attempts [11], [12] to build MWE annotated treebanks for Turkish, this study is the first comprehensive annotation of MWEs on Turkish treebanks, being a fully manual annotation with detailed fine categories. This study is also a first attempt to define suitable categories for the MWE annotation of Turkish, and we believe this will also aid the creation of multi-lingual MWE annotation guidelines. Two existing Turkish dependency treebanks (IMST [13] a treebank of well-edited texts and IWT [14] a Web treebank) are annotated with 11 main MWE categories (nominal compounds, duplications, verbal compounds, light verb constructions, compounds constructed with determiners, conjunctions, formulaic expressions, idiomatic expressions, proverbs and named entities) and 8 named entity sub-categories (Person, Location, Organization Names, Date and Time Expressions, Percentage, Monetary Expressions, Miscellaneous Numerical Expressions). The remainder of the paper is structured as follows: Section 2 gives information about previous MWE studies in Turkish and introduces our proposed MWE categories, Section 3 presents the annotation process and the statistics, and Section 4 is the conclusion. Abstract—Multiword expressions (MWEs) are pervasive in Turkish, as in many other languages. There are many challenges related to MWEs in Natural Language Processing. The scarcity of annotated language resources is one of the most prominent for lesser-studied languages and as always development of these resources requires a noteworthy effort. This paper is the first study which specifically focuses on the development of Turkish MWE resources for the purpose of 1) the categorization of different MWE types in Turkish 2) use in MWE identification, and 3) use in research focusing on interleaving between MWE identification and parsing. For these purposes, we annotated two Turkish treebanks (IMST and IWT) with 11 MWE categories and 8 subcategories for the MWE category Named Entity. 1. Introduction As the name implies, multiword expressions are composed of multiple words that together produce an idiosyncratic meaning or have a distinctive syntactic role. They pose several challenges for natural language processing tasks as well as in language acquisition for non-native speakers. As a result, they have been an important issue covered in many studies since the inception of the field of NLP. The reader may consult many comprehensive studies for a complete discussion of MWEs ( [1], [2], [3]). Their extraction and processing within NLP applications is still a very active research topic as may be seen by many recent workshops ( [4], [5]) and research initiatives (e.g. EU PARSEME Cost Action [6]). Annotated data sets and lexicons are very valuable resources for MWE processing tasks. A comprehensive annotation of MWEs is a troublesome and exhaustive process. Many languages including Turkish suffer from lack of MWE annotated language resources. Manually annotated treebanks are syntactically annotated corpora and are valuable resources for parsing research. The annotation of MWEs on treebanks would undoubtedly help investigations on the integration of MWE identification and parsing studies. As a result, there are many efforts to annotate MWEs on treebanks. Unfortunately, there is as of yet no common standard on how to annotate them. The aim of WG4 of PARSEME 2. MWEs in Turkish There are a couple of studies which focus on MWE discovery [15], MWE annotation [11], [12] and MWE identification [12], [16] in Turkish. [15] employs two simple statistical methods, a Chi-square hypothesis test and mutual information in order to discover Turkish collocations. [11] reveals that the performance of parsing is affected differently by the concatenation of different MWE types’ components. The most recent study on MWEs is [12] in which a coarse, undifferentiated annotation of MWEs took place and different lexical models for MWE identification including automatic named entity recognition were tested, demonstrating that their extraction model improves the accuracy of MWE 60 extraction by a dependency parser [17] and the extraction tool of [16]. Similar to other languages, MWEs poses interesting challenges for Turkish. Especially, the variability of MWE instances are very high due to the agglutinative and morphologically very rich nature of this language. The constituents of a MWE may be inflected, resulting in a high number of different surface forms [16], [18]. To give an example the MWE “aklına gelmek” (to come to mind) may appear in different forms by taking personal agreement, tense, aspect and modality suffixes. In the sentence “Aklıma gelmedi” (It didn’t come to my mind.), both of the components underwent inflection and are different from their lemma forms: the first word “aklına” (to the mind) with 1st person possessive agreement suffix in dative form and the second word “gelmek” (to come) in past tense with 3rd person singular agreement. Non-compositionality and discontinuity are common challenges of MWEs which also appear in Turkish. In this section, we introduce the categories that we defined for MWE types in Turkish which we believe will provide the opportunity to address the problems of different types separately. The sub-categorization of MWEs will also pave the way for further investigations on hierarchical approaches for MWE identification and its integration into parsing. With this aim, we define 11 categories of MWEs which we detail in the remaining of this section. MWEs (e.g., “gu¨zel mi gu¨zel” (so beautiful)). Duplications can strengthen the meaning of the main word, turn an adjective into an adverb, or add an idiomatic meaning. We decided not to include the ‘m’-duplication (where a word is repeated with the first letter replaced with ‘m’ in the second occurrence) as a type of duplication MWE. 2.3. Verbal Compound MWEs In this type, the components form the MWE without undergoing a significant semantic change. They are formed with a noun and a verb1. This type of MWEs may be inflected more frequently than other types due to the verbal nature of their constructions. Examples of this pattern can be like: “karar vermek” (to decide), “so¨z vermek” (to promise). 2.4. Light Verb Construction MWEs Light Verb Construction MWEs are formed by six auxiliary verbs which are “olmak” (to be), “etmek” (to do), “yapmak” (to make) , “kılmak” (to render), “eylemek” (to make) and “buyurmak” (to order). Together with a preceding nominal, these auxiliary verbs behave as a finite verb. The verb phrase is a construction which has its own meaning, which can be idiomatic or relatively similar to that of its components. These MWEs can be easily detected using morphosyntactic information such as the existence of an auxiliary verb at the end of a verb phrase. Some examples are: “as¸ık olmak” (fall in love), “sinir etmek” (to aggrevate), “veda etmek” (to bid farewell), “yemek yapmak” (to cook), “gec¸ersiz kılmak” (to revoke), “emir buyurmak” (to give order). However, not every construction with the aforementioned auxiliary verbs falls under this category. For example, MWEs like “aforoz etmek” (to excommunicate) and “ah etmek” (to sigh) are considered idiomatic expressions and will be handled under that category. 2.1. Nominal Compound MWEs As described in [19], noun compounds are word like units made up of two nominals. Our definition of nominal compound MWEs differs from this general definition in that they comprise only a subset of noun compounds used commonly enough to express a wide concept or class. These consist of bare compounds (the components do not take extra suffixes to mark the relation between them) and (s)I compounds (the first component has no suffixes while the second one is marked with the third person possessive suffix -(s)I ) [19] . To give some examples, “kadın c¸orabı” (hosiery), “hakem heyeti” (arbitration court) ,“kredi kartı” (credit card), “dis¸ macunu” (toothpaste). As may be observed from the examples, the overall sense of this type of MWE may be discerned from its components. 2.5. Compound MWEs Constructed with Determiners This category consists of compounds having at least one determiner component. The compounds “her s¸ey” (everything), “s¸u an” (now), “bir daha” (again/never) may be given as examples for this category. Differing from the previous compound MWE categories, MWEs of this category type may be used in different roles (nominal, adjectival or adverbial) in a sentence. 2.2. Duplication MWEs Duplications are linguistic units that are formed mainly by duplicating a nominal or modifier. The production of the second word can be done in several ways: the reproduction of the exact words, synonymous words, antonymous words, onomatopoeic words, gibberish words. The examples that refers to each are “c¸abuk c¸abuk” (very quickly, or lit. quick quick), “mal mu¨lk” (property, or lit. property property), “as¸ag˘ ıyukarı” (almost, nearly, or lit. down up), “adı sanı” (public profile, or lit. name and fame), “paldır ku¨ldu¨r” (pellmell, or lit. pell mell). Duplications with an interrogative particle in between are also considered to be duplication 2.6. Conjunction MWEs Conjunction MWEs are a sort of transition phrase and are used to concatenate two sentences. Some examples of this category may be given as the followings: “bu arada” (by the way), “bu yu¨zden” (therefore), “o halde” (then), 1. Excluding light verb constructions which are also a special type of verbal MWEs collected under a separate category. 61 category can be considered the easiest one to identify as an MWE. They often describe some observation or experience with didactic intent. Some examples are given below: “bu sebeple” (for this reason) etc. While exhibiting some semantic flexibility, the components of MWEs in this category largely retain their original meaning. This category excludes constructions formed by the addition of an enclitic intensifier such as “de”, “ise” , “ki” (e.g., “öyle ki’(so that), ’“ya da”(or)). • “Damlaya damlaya göl olur.” lit. (By dribbling) (a lake) (composes) . (Many a little makes a mickle.) 2.7. Formulaic Expression MWEs • “Güneş balçıkla sıvanmaz.” lit. (The sun) (with mud) (can not be covered) . (The truth can not be hidden.) MWEs in this category satisfy the following semantic and syntactic conditions. As the semantic condition, the MWE should carry the meaning of well wishing or gratitude. For the syntactic condition, the MWE is an independent clause, mostly with an elided verb implied to be in a subjunctive mood. Some examples are : “Ellerine sağlık (olsun)” (May God bless your hands), “Gö rüşmek ü zere” (See you soon), “Hoşça kal” (Good Bye). MWEs in this category may rarely resemble light verb constructions that also carry a sense of gratitude, such as “teşekkür etmek” (to thank), “rica etmek” (to request) • “(Yalancının mumu) (yatsıya) (kadar) (yanar).” lit. (The candle of the lier) (until isha) (burn) . (The truth can not be hidden.) 2.11. Named Entities In our annotation we consider a Named Entity to be a set of tokens denoting some unique entity in the real world. Their syntactic patterns and semantic properties are fixed, and they are not necessarily multi word expressions. Since most of the time they consist of two or more words, they are also treated as an MWE category. Named entities include 8 subcategories, namely; ENAMEX types (Person, Location, Organization Names), TIMEX types (Date and Time) and NUMEX types (Percentage, Monetary Expressions, Miscellaneous Numerical Expressions). We follow the MUC-6 [21] guidelines for our named entity definitions. 2.8. Idiomatic Expression MWEs Idiomatic expressions are MWEs with noncompositional meanings; i.e., the meaning of the MWE differs from the literal meaning of its components. For example: “etekleri zil çalmak” (to be very happy, or lit. ring the bells on the skirt), “gemi azıya almak” (to get out of control or lit. to scratch the bit with grinders) etc. This type of MWEs are quite challenging for MWE identification due the ambiguity between idiomatic and literal use. To give some examples: “ayvayı yemek” (to be in a worrisome and bad situation, or lit. to eat the quince) and “ayağa kalkmak” (to protest, or lit. to stand up). In these cases, there is no morphosyntactical difference between the two utilizations of the word group as an idiom or as an ordinary phrase carrying literal meaning, hence it could be difficult to detect the MWE using the contextual information. Person This tag denotes persons, referred to by name, and excludes any titles or alternate references other than the name of the person in question. The examples: “Başbakan Turgut Ö zal” (Prime Minister Turgut Ö zal), “Maliye Bakanı Ali Babacan” (Finance Minister Ali Babacan). Location Denotes the proper name of a location. For example: “Amerika Birleşik Devletleri’nden mektup geldi” (A letter came from the United States of America). 2.9. Simile Expressions MWEs Similes are expressions comparing two things, in an often striking manner, using a connecting word (e.g., the word “gibi” (like) in Turkish). We include under this category not every comparison but only those in frequent use. The syntactic construction has two main parts: the figurative part and post-positional particle which refers to only one word “gibi” (like/alike). Here are some examples: “Agop’un kazı gibi” (voraciously), “damdan dü şer gibi” (out of the blue), “Avcunun içi gibi” (well known), “kedinin ciğere baktığı gibi” (anxiously) etc. Organization This subcategory is used for the name or the group of names of an organization such as “Birleşmiş Milletler kararı uyguladı.” (United Nations enforced the judgment.). Date 2.10. Proverb MWEs Expresses an absolute date. As an example: “Doğum tarihi 25 Temmuz 1987 ’di.” (Her birth date is 25th of July in 1987). Proverbs are idiomatic and frozen sentences [20] with no words changing or undergoing inflections. Consequently, the 62 Time TABLE 1. T H E In this category, the named entity states an absolute time. The examples are : “Saat 6:30’da film başlıyor. ” (The film starts at 6:30.), “Sınavı bugün 10:30’daymış.” (Her exam is today at 10:30) Percentage This category is used to represent percentage information e.g. “Devrelerin yü zde yirmisi arızalı.” (Twenty percent of the circuits are defective.), “Adayların yüzde sekseni sınavdan kaldı.” (Eigthy percent of the candidates have failed the examination). For this category, the word group denotes an expression of money or monetary value. The example is : “O kitaba altmış lira verdim.” (I paid sixty liras for that book.) Miscellaneous Number We have diverged from the MUC-6 guidelines in this tag, and marked cardinal numbers with their own named entity tag. To give an example: “Altı yüz bin araba satılacak” (Six hundred thousand cars will be sold.). 3. Annotation The annotation process was carried out in two stages on both treebanks, with two annotators carrying out both on each treebank. The stages are as follows: The Annotation of NE categories • The Annotation of MWE categories except Named Entities NE Type IMST Person Organization Location Money Percentage Misc. Number Date Time Total 1071 418 491 54 44 427 106 20 2631 IWT An.-1 385 401 260 45 8 59 10 1168 NE S IWT An.-2 426 503 274 48 7 317 76 17 1668 A N D THE KA P PA Kappa Co. Total 0.88 0.64 0.79 0.98 0.99 0.87 0.95 - 1497 921 765 102 51 744 182 37 4299 to head. Inflectional suffixes are excluded from the named entity in the plain text format marked with XML tags, but the entire token in the CoNLL file is marked as a part of the named entity. As the lemma is given in each CoNLL token, this does not result in a loss of data. Figure 1 shows an example CoNLL annotation for the sentence “Ben Arçelik’e sordum 31 Aralık’a kadarmış.” (I asked Arçelik, it’s until December 31st.) which examplifies such a case on the word “Aralık” (December) inflected with a dative case marker. Table 1 shows the numbers of NE categories in two treebanks. We annotated IMST [13] with detailed NE types for the first time, however IWT [14] was annotated for MWEs previously in a recent study [22]. This made possible to calculate the Cohen’s Kappa coefficient2 [23] in order to evaluate the inter-annotator agreement between the current and the previous annotation [22]. From the scores it is seen that there is sufficient agreement between our annotator and the previous annotator. Money • NU MB E RS O F T YPE S O F C O E FFI C I E N T S 3.2. MWE Annotation During the original dependency annotations of both treebanks, the annotators were asked to annotate the interrelations of a multiword expression with a single catchall dependency type (named as MWE as well) [12]. But the annotation was limited to only this dependency relation without any extra information on types of the MWEs. In this work, we refine previous annotations by inspecting all the treebank sentences and reannotating all the MWEs with finer categories. We have done the annotation with the participation of the linguistics student who oversaw the categorization of MWEs. The annotation of MWEs was performed in a number of iterations of an annotation and check cycle. We automatically checked the annotation for dependency-related errors, and manually examined cases marked in previous annotations ( [12], [13], [24]) were not marked and vice versa. This iteration was carried out until all problematic cases were handled. The first MWE annotation of the treebanks is complete. We plan to have the MWEs annotated again by 3.1. NE Annotation In the NE annotation process we have annotated the entities according to the categories we described in the previous section. Figure 2 shows an example dependency tree consisting an organization named entity. In our annotation we have largely followed the MUC-6 [21] guidelines for the annotations, with the addition of a single extra category for miscellaneous numerical expressions. The MUC-6 guidelines establish a standard for marking plain text sentences with XML tags, however, we have annotated sentences in CoNLL format in which morphological information and dependency relations are marked. We have added two extra columns to the data, one marking the type of the named entity, and another marking possible following items in a collocative named entity. This way of annotating the named entities is particularly well suited to Turkish as named entities tend to be adjacent, and their dependencies relations are overwhelmingly organized left to right from dependent 2. During the calculation of the Kappa coefficient we saw that in an unmodified Kappa value the agreement rate was too high to be meaningful. We used a weight value of 0.01 for the number of tokens both annotators did not annotate, resulting in a much more meaningful statistic. 63 Ben Arçelik’e 31 Aralık’a sordum kadarmış. I Arçelik.DAT ask.PAST.1-SG 31 December.DAT until.EVID.3-SG. I asked Arçelik, it’s until December 31st. ID 1 2 3 4 5 6 7 8 9 Surface Form Ben Arçelik’e sordum 31 Aralık’a kadarmıs¸ . Dependency Head 3 3 8 5 8 7 8 0 8 Dependency Relation SUBJECT MODIFIER COORDINATION MWE MODIFIER DERIV DERIV PREDICATE PUNCTUATION NE Type ORGANIZATION.ENAMEX DATE.TIMEX DATE.TIMEX Figure 1. The annotation format of an example Maliye Bakanlığı konuyla ilgili Next Word açıklama Finance Ministry.3-POSS subject.INS related statement The Ministry of Finance made a statement on the issue. 5 sentence yaptı. make.PAST.3-SG SUBJECT PREDICATE MWE.NE.ORG Maliye Noun MODIFIER ARGUMENT Bakanlığı Noun konuyla Noun ilgili Adj OBJECT açıklama Noun PUNCTUATION yaptı Verb Figure 2. An example dependency tree showing an organization named entity challenging issue, TABLE 2. THE DISTRIBUTION OF THE NUMBERS OF CATEGORIES OF MWE S IN TWO T URKISH TREEBANKS MWE Type IMST IWT Total Named Entities Compound Conjunction Duplication Formulaic Expression Idiomatic Expression Lightverb Construction Nominal Compound Proverb Simile Expression Total 910 525 32 209 22 773 537 136 3 12 3159 439 545 41 130 221 598 648 156 4 7 2789 1349 1070 73 339 243 1371 1185 292 7 19 5948 another linguistics student, and calculate the agreement as in the named entity annotation. Table 2 gives the results of distribution of MWE categories in two treebanks. As seen on the Table 2, one of the biggest categories of MWE is named entities, which means the performance of a Named Entity Recognition system used in MWE extraction will substantially affect the performance of the system. The other large category is idiomatic expres- sions, which makes MWE extraction a 64 . Punc as we are obliged to deal with the particular challenges of idiomatic expressions to build a high performance system. 4. Conclusion In this paper, we proposed a basis for Turkish MWE and NE categorization to be used as a working guide in annotation. The categorization framework, which was prepared by taking into account the idiosyncratic features of Turkish, consists of 11 categories of MWEs. We performed annotations on two Turkish treebanks using the proposed framework. We annotated the categories of MWEs as the first annotation task and the annotation of NEs and their sub- categories as the second on the Turkish treebanks. For the annotation task, we enlisted the aid of linguistics researchers that have expertise on the morphosyntactic and semantic features of Turkish. The categorization framework that we defined in this study and the annotated treebanks will hopefully be used in future studies in the annotation and identification of MWEs in Turkish. 65 Acknowledgments [14] T. Pamay, U. Sulubacak, D. Torunoglu-Selamet, and G. Eryigit, “The annotation process of the itu web treebank,” in The 9th Linguistic Annotation Workshop held in conjuncion with NAACL 2015, 2015, p. 95. We would like to acknowledge that this work is part of a research project entitled “Parsing Web 2.0 Sentences” subsidized by the TUBITAK (Turkish Scientific and Technological Research Council) 1001 program (grant number 112E276) and part of the ICT COST Action IC1207 PARSEME (PARSing and Multi-word Expressions). [15] S. K. Metin and B. Karaog˘lan, “Collocation extraction in Turkish texts using statistical methods,” in Advances in Natural Language Processing. Springer, 2010, pp. 238–249. [16] K. Oflazer, O. C¸ etinog˘lu, and B. Say, “Integrating morphology with multi-word expression processing in Turkish,” in Proceedings of the Workshop on Multiword Expressions: Integrating Processing. Asso- ciation for Computational Linguistics, 2004, pp. 64–71. References [17] G. Eryiğit, J. Nivre, and K. Oflazer, “Dependency parsing of Turkish,” Computational Linguistics, vol. 34, no. 3, pp. 357–389, 2008. [1] I. A. Sag, T. Baldwin, F. Bond, A. Copestake, and D. Flickinger, “Multiword expressions: A pain in the neck for NLP,” in Computational Linguistics and Intelligent Text Processing. Springer, 2002, pp. 1–15. [18] A. Savary, “Computational inflection of multi-word units,” A contrastive study of lexical approaches, vol. 1, no. 2, 2008. [19] C. K. Asli Go¨ksel, Turkish: A Comprehensive Grammar (Comprehensive Grammars), bilingual ed., ser. Comprehensive Grammars. Routledge, 2005. [2] I. Arnon and N. Snider, “More than words: Frequency effects for multiword phrases,” Journal of Memory and Language, vol. 62, no. 1, pp. 67–82, 2010. [20] J. Baptista, A. Correia, and G. Fernandes, “Frozen sentences of portuguese: Formal descriptions for nlp,” in Proceedings of the Workshop on Multiword Expressions: Integrating Processing, ser. MWE ’04. Stroudsburg, PA, USA: Association for Computational Linguistics, 2004, pp. 72–79. [Online]. Available: http://dl.acm.org/citation.cfm?id=1613186.1613196 [3] C. Ramisch, Multiword Expressions Acquisition, ser. Theory and Applications of Natural Language Processing. Springer, 2015. [4] Proceedings of the 11th Workshop on Multiword Expressions. Denver, Colorado: Association for Computational Linguistics, June 2015. [Online]. Available: http://www.aclweb.org/anthology/W15-09 [21] R. Grishman, “The nyu system for muc-6 or where’s the syntax?” in Proceedings of the 6th conference on Message understanding. Association for Computational Linguistics, 1995, pp. 167–175. [5] V. Kordoni, M. Egg, s. t. o. Agata Savary, s. t. o. Eric Wehrli, and S. Evert, Eds., Proceedings of the 10th Workshop on Multiword Expressions (MWE). Gothenburg, Sweden: Association for Computational Linguistics, April 2014. [Online]. Available: http://www.aclweb.org/anthology/W14-08 [6] [7] [22] G. A. S¸eker and G. Eryig˘it, “Initial explorations on using CRFs for Turkish named entity recognition,” in Proceedings of COLING 2012, Mumbai, India, 8-15 December 2012. A. Savary, M. Sailer, Y. Parmentier, M. Rosner, V. Rosén, A. Przepio´rkowski, C. Krstev, V. Vincze, B. Wo´jtowicz, G. S. Losnegaard et al., “Parseme–parsing and multiword expressions within a european multilingual network,” in 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2015), 2015. [23] J. Cohen, “Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.” Psychological bulletin, vol. 70, no. 4, p. 213, 1968. [24] U. Sulubacak and G. Eryig˘it, “A redefined Turkish dependency gram- mar and its implementations: A new Turkish web treebank & the revised Turkish treebank,” 2014, under review. V. Rosén, G. S. Losnegaard, K. De Smedt, E. Bejcek, A. Savary, A. Przepio´rkowski, P. Osenova, and V. B. Mititelu, “A survey of multiword expressions in treebanks,” in International Workshop on Treebanks and Linguistic Theories (TLT14), 2015, p. 179. [8] E. Bejcˇek and P. Stranˇa´k, “Annotation of multiword expressions in the Prague dependency treebank,” Language Resources and Evaluation, vol. 44, no. 1-2, pp. 7–21, 2010. [9] A. Abeille´, L. Cle´ment, and F. Toussenel, “Building a treebank for french,” in Treebanks. Springer, 2003, pp. 165–187. [10] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a large annotated corpus of english: The penn treebank,” Computational linguistics, vol. 19, no. 2, pp. 313–330, 1993. [11] G. Eryig˘it, T. ˙Ilbay, and O. A. Can, “Multiword expressions in statistical dependency parsing,” in Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages (IWPT), Dublin, Ireland, October 2011, pp. 45–55. [Online]. Available: http://www.aclweb.org/W11-3806 [12] G. Eryig˘it, K. ADALI, D. Torunog˘lu-Selamet, U. Sulubacak, and T. Pamay, Proceedings of the 11th Workshop on Multiword Expressions. Association for Computational Linguistics, 2015, ch. Annotation and Extraction of Multiword Expressions in Turkish Treebanks, pp. 70–76. [Online]. Available: http://aclweb.org/anthology/W15-0912 [13] U. Sulubacak and G. Eryig˘it, “Imst: A revisited turkish dependency treebank,” in TurCLing 2016, The First International Conference on Turkic Computational Linguistics at CICLING 2016, Konya, Turkey, April 2016. 66 An Overview of Resources Available for Turkish Natural Language Processing Applications Tunga GÜNGÖR Computer Engineering, Boğaziçi University TurcLing 2016 Keynote Speaker 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 TheFi r s tConf er enc eon Tur ki cComput at i onalLi ngui s t i c s 39Apr i l2016,Konya,Tur key I SBN:9786056642203 t ur c l i ng. ege. edu. t r