Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Portability of Dependency Parsing Algorithms An Application for Italian Atanas Chanev University of Trento Department of Cognitive Sciences and ITC-irst, Trento, Italy E-mail: [email protected] 1 Introduction Parsers can be classified along different dimensions: e.g. rule-based vs. statistical, constituency-based vs. dependency-based. Rule-based parsers use manually prepared grammars which consist of many language-specific rules. A grammar previously prepared for a given language cannot be usually adapted for parsing another language. On the other hand statistical parsers can be ported to a new language provided that a treebank to train the parser is available. Constituency-based parsing, traditionally applied to English texts, may not be suitable for languages with different characteristics, such as Italian or Bulgarian. Dependency-based formalisms are considered to be more adequate for free word order languages (Bosco and Lombardo, [3]). Most work on parsing concentrated on English, the language with the largest treebanks available (e.g. the Penn Treebank (Marcus et al., [14])). Dependency treebanks have been compiled for free word order languages but their sizes are not comparable to the English constituency treebanks (with few exceptions, e.g. the Prague Dependency Treebank for Czech (Hajič, [9])). Portability of tools for language analysis is an issue with a great potential for less-studied languages or languages with few resources. Porting existing parsers can ease the preparation of greater scale resources for these languages. Apart from parsing, portability should be a research problem by itself and more attention has to be paid on it. Testing a parser on different treebanks in several languages would give more reliable results than testing it on a single treebank, thus extending the scale of measurement. Previous work has concentrated on issues concerning measurement of parsing results for a particular language (e.g. Charniak, [5]) or preparation of language resources (e.g. Hajič [9]). Recently there has been some work on portability. The topic has been for example addressed in (Nivre and Scholz, [18]) and (Corazza et al., [7]). The former is about porting a statistical dependency parser originally developed for Swedish to English and the latter is about testing statistical constituency-based parsers on an Italian treebank. The goal of the work reported in the paper is to investigate further the problem of portability across languages in the framework of statistical dependency parsing. In this paper we report preliminary results of the application of a statistical dependency parser on Italian and Slovene treebanks. The only Italian treebank that is freely available is the Turin University Treebank1 (TUT). Its annotation scheme is described in (Bosco, [2]). From the perspective of our research interests, the most attractive feature of TUT is the fact that it is dependency-based. The Malt parser2 was chosen to be applied to TUT. This parser was originally applied to Swedish and then ported to English. It is a statistical dependency parser that implements the shift-reduce parsing strategy and uses memory based learning (Nivre, [16]). It includes also evaluation software available on the web page. On the whole, the results obtained for Italian are slightly lower than those obtained for Swedish and English, but the training data for Italian is smaller and this reduction in precision can be expected. Comparing these results to the results from (Corazza et al., [7]) it seems an improvement in parsing Italian has been achieved, even if the evaluation metrics for constituency and for dependency-based parsing are different. The current work of the authors is dedicated to investigate how much the training data can be reduced without significantly loosing precision and how this reflects on portability. Tests for small portions of TUT and the Slovene Dependency Treebank (Ledinek, [12]) have been performed. Some preliminary results are available. Another current activity of the authors is comparison of the results from Malt to results obtained by parsing TUT with a rule-based parser for Italian. Future work includes the use of the Malt parser on other treebanks. First of all, experiments on Bulgarian with the BulTreeBank (Simov, [19]) are planned for the next months. Later, we will investigate the possibility of using a version of the Italian treebank ISST (see Section 2) converted to dependency and Turkish and 1 2 http://www.di.unito.it/˜tutreeb/ http://w3.msi.vxu.se/˜nivre/research/MaltParser.html German treebanks. The structure of this work is organized as follows. Section 2 gives a brief description of the treebanks developed for Italian. In Section 3 the setting procedures for the experiments on TUT are presented. Section 4 consists of the experiments and obtained results followed by preliminary results of a comparative study of TUT and the Slovene Dependency Treebank (SDT) in Section 5. Future work is outlined in Section 6. 2 Italian Treebanks Three Italian treebanks have been developed so far. The only Italian dependency treebank is the Turin University Treebank (TUT) (Bosco, [4]). It has 1,500 sentences (41,771 tokens) annotated with parts of speech and dependency relations. The dependency relations contain information about syntactic relations and, in some cases, theta roles. The trees of the corpus contain traces which are presented as ordinary tokens. TUT is publicly available and can be freely downloaded from the web site of the project3 . Another Italian treebank is the Italian Syntactic-Semantic Treebank (Montemagni et al., [15]). It is annotated at four levels: morpho-syntactic, two syntactic levels (constituent structure and functional relation), and lexico-semantic. There are about 3000 sentences (89,941 tokens) annotated at constituent structure level. ISST is not yet publicly available. The third treebank for Italian is the Venice Italian Treebank (VIT) (Delmonte, [8]). It consists of 250,000 words of written text and 50,000 words of spoken text. It is annotated with constituent structures. VIT is not freely available but sentences from the treebank can be browsed in the project web page4 . 3 Preliminary Settings for the Experiments Our first experiment on portability was the evaluation of the Malt parser (Nivre, [16]) on the Italian Turin University Treebank (Bosco, [4]). The Malt parser was chosen because of its availability and the fact that originally developed for Swedish, it has already been ported to English (Nivre and Scholz, [18]). The annotation scheme of TUT is slightly different from the ones Malt is able to learn and parse. Along with the re-arranging of the information from TUT, several idiosyncrasies of the Italian language in general and TUT in particular had to be taken in mind for a successful conversion and plausibility of the parsing results. 3 4 http://www.di.unito.it/˜tutreeb/ http://sisley.cgm.unive.it/HTMLipar/indexparsing_a.htm Dependencies between prepositions and articles5 are represented in TUT using a type of indexing in which the second word in such a construction is given the index of the first word and a sub index. In Malt formats all the words except the roots in a sentence have heads which have positive integer indices. Converting the TUT indexing to integer-based one contributed to loosing information for the closer relationships between prepositions and articles within such constructions (or verbs and pronouns for which the same scheme was used). An example of a preposition-article annotation extracted from TUT is given below. 17 dello (DI PREP MONO) [11;PREP-RMOD] 17.1 dello (IL ART DEF M SING) [17;PREP-ARG] Conversion of the indexing was performed so that every token was associated with an integer index. The first word of a sentence from the TUT is given as an example in its original format and after the conversion to Malt format: TUT: 1 Valona (VALONA NOUN PROPER F CITY) [1.10;VERB-SUBJ] Malt: Valona NOUN.PROPER.F.CITY 2 VERB-SUBJ The TUT annotation scheme is well described and documented in the PhD thesis of Cristina Bosco (Bosco, [2]). The part of speech labels contain information about category, sub-category and features. The number of POS tags, extracted from the latest version of TUT was 972. Potentially, their number can be greater, because some tags might not have occurred in the 1,500 sentences of the treebank6 . In the TUT POS tag set there are cases of lexical information included in the tag. This increases artificially the number of POS tags. For nouns of verbal origin, the verb and information about its transitivity are included in the labels resulting in an increased number of tags. An example extracted from the TUT is: NOUN.COMMON.F.PL.AFFERMARE.TRANS for the noun affermazioni (affirmations). Information about the verb and its transitivity was removed from the tags of nouns of verbal origin. These steps reduced the tag set size to 511 tags. The 5 In Italian, usually the article is attached to the preposition thus forming a single word. The initial number of POS tags extracted from TUT was 1,250, because of many tags of the form NUM.3 or DATE.1997. What is after the point is not POS information and all these tags were substituted with a single NUM and a single DATE tag. 6 POS tag set was reduced further by removal of features from the tags. Gender and number features were removed from verbs, nouns, adjectives, predeterminers, articles and pronouns. A single tag for proper nouns was used and the tags for conjunctions were reduced to 3: a general conjunction tag and tags for coordinating and subordinating conjunctions. After the reduction the set consisted of 90 labels. The only grammatical information that remained in noun tags was about the case of the noun. The tags of the verbs (including auxiliaries and modal verbs) were reduced to contain only information about the mood of the verb and its transitivity. None of the remaining tags in the set contained gender and number information7 . The dependency tag set of TUT is quite complex. Each tag consists of three parts that represent different kinds of information, but not every type of information is present in all the tags. Usually a dependency label starts with the part-of-speech category of the head word, followed by the syntactic relation between the word and its head and the semantic theta role8 . The set of syntactic relations is organized in a hierarchy with 3 very general labels as top categories: dependent, function and nofunction. Dependent is the parent category of function and nofunction and it has no other child categories. The lower levels of the hierarchy are more specific. On the whole, the hierarchy is scalable and the number of labels can be reduced. A total number of 283 different dependency relation labels were extracted from TUT. But potentially the number of dependency tags can be greater due to different combinations of POS tag for the head word, syntactic relation and theta role. The reduction strategy for the set of dependency relations was to extract the part that contained only the syntactic relation information from each tag. The part of the labels that contained the POS category of the head word and the one containing the semantic theta role were removed. The number of syntactic relations was then reduced and the syntactic labels that remained in the set after the reduction were used as dependency labels of the tokens in the corpus. In most cases, the removed syntactic labels could be replaced by their parents without a significant loss of information, but in a single case, a label was substituted by its sister9 . The most general dependency relations in the hierarchy were kept. Only tags from the lower levels were removed. A simple example for grouping of tags is the use of a single tag SEPARATOR for the end of sentence punctuation 7 The suggestion about removing gender and number information was given to the authors by Joakim Nivre in personal correspondence. Use of genders and numbers has not improved the parsing results for Swedish, but has made the learning data more sparse. 8 But sometimes the label contains additional information too. 9 The label EXTRAOBJ was substituted by its sister from the hierarchy OBJ but not its parent ARG. This approach was preferred by the authors since another child of ARG was SUBJ (END), opening (OPEN) and closing (CLOSE) inverted commas and other punctuation (SEPARATOR). On the whole, the dependency tags were reduced from 283 to 17. SEPARATOR ARG ? ? I nomi . ART.DEF NOUN.COMMON I nomi . The names . PUNCT Figure 1: A sample tree from the TUT with reduced tag sets. One of the features of TUT is the existence of traces in the treebank. They have been implemented in order to deal with language-specific phenomena without introducing non-projective branches in the syntactic trees. The sentences in the treebank are assumed to be projective but after the treebank was converted to Malt parser format and browsed using the MaltEval software10 , a number of nonprojective trees were discovered. One of the pre-requisites for obtaining plausible parsing results with the Malt parser was the removal of the traces from TUT. They must neither be learned by the statistical parser nor be included later in the input test data. The removal of traces which were leaf nodes or parents of leaf trace nodes without other children was straightforward. Nevertheless 122 of 3,123 traces had children or grandchildren which were not traces and it was not possible to remove them automatically. For such cases a manual removal was performed. There were two types of cases to handle manually: those in which the trace was pointing to a verb and all the others. The treatment of the latter cases was rather straightforward. Among the former cases, a particularly hard task was to remove traces of main verbs from coordinated sub sentences. In the context of the TUT annotation scheme there are not simple answers to the questions ’Which is the head of a coordinated sub sentence with a removed verb trace being its former head?’ and ’To which token should the new head of a coordinated sub sentence point?’. The problematic verb traces were removed without an explicit and theoretically motivated strategy. We plan to further investigate issues related to traces and their 10 http://w3.msi.vxu.se/˜nivre/research/MaltEval.html Table 1: Parsing results for Italian using the MBL3 and MBL4 learning methods. tag sets complete: reduced: tag sets complete: reduced: MBL3 unlabelled precision/recall 86.59% 87.18% MBL4 unlabelled precision/recall 86.49% 87.33% labelled precision/recall 74.41% 81.66% labelled precision/recall 74.37% 81.75% removal. After this activity the TUT was ready for being learned and parsed using the Malt parser. 4 Experiments For the tests, the sentences from the TUT files were re-arranged in a random manner in a single file which was then split into 10 parts for a 10-fold cross-validation. The choice of 10-fold cross-validation was motivated by the small size of TUT (only 1,500 sentences). The MaltEval software was used for measuring the results, calculating the mean score per word, excluding punctuation. Gold standard POS tags were used in the experiments. Two couples of experiments were performed. The TUT was trained and tested with the Malt parser with both complete and reduced tag sets and both MBL3 and MBL4 (lexicalized) learning models. The first couple of experiments used the original TUT tag sets with minor changes while the second one employed reduced tag sets. The parsing results are given in Table 1. 4.1 Comparison to Dependency Parsing for Swedish The results from the evaluation of the Malt parser on Swedish from (Nivre, [17]) are given in Table 2. A updated version of the results for Swedish was provided by Joakim Nivre in personal communication: unlabelled accuracy is 86% and labelled precision/recall is 82%. In the first couple of experiments (those with large tag sets) unlabelled precision was comparable to the one from the updated results for Swedish data, but Table 2: Results from the evaluation of the Malt parser on Swedish (Nivre, [17]). learning model MBL non-lexical: MBL lexical: unlabelled precision/recall 81.70% 84.70% labelled precision/recall 74.70% 80.60% labelled precision was considerably lower. In the second couple of experiments, labelled precision was closer to the one from the updated Swedish results. The tests on the Swedish data have been performed using a smaller POS tag set with training data of 100,000 words. Automatic POS tagging was used instead of gold standard tags in the Swedish tests. For a comparison, the parsing results for Italian are slightly lower, due to sparse data. After removing the traces, the number of tokens for TUT decreased to 41,616. This size is significantly lower than the one of the training data for Swedish. The parsing results are comparable with those obtained for Swedish even though automatic tags were used in the Swedish experiments. This technique usually gives worse results, compared to the usage of gold standard tags. There are, though, significant differences in the training set sizes. 4.2 Comparison to Constituency Parsing Malt parser has been tested on the WallStreet Journal (WSJ) part of the Penn Treebank in (Nivre and Scholz, [18]) achieving dependency accuracy from 3 to 5% and root accuracy – from 6 to 10% worse than the state-of-the-art parsers for English ((Charniak, [5]) and (Collins, [6])). Dependency-based evaluation of the parsers of Charniak and Collins was done in (Yamada and Matsumoto, [20]). The unlabelled and labelled precision/recall for parsing English with the Malt parser are respectively 88% and 86%, but the training set is 1,000,000 words11 . In a recent study by Corazza and colleagues [7] state-of-the-art statistical constituency-based parsers (an implementation of Collins 2 parsing model (Collins, [6]) by Bikel in (Bikel, [1]) and the Stanford parser (Klein and Manning [10, 11])) were used for parsing the Italian Syntactic-Semantic Treebank. The difference in performance compared to English was over 15% for the F-measure. The results are significantly poorer for Italian than those for English. It is concluded that this performance can be due to either the annotation scheme or the specificity of Italian language. 11 This data was kindly provided by Joakim Nivre in personal communication. Table 3: Parsing results for small sets from TUT and SDT. treebank TUT: SDT: treebank TUT: SDT: MBL3 unlabelled precision/recall 81.62% 69.01% MBL4 unlabelled precision/recall 81.60% 69.08% labelled precision/recall 67.24% 58.33% labelled precision/recall 66.66% 57.84% Given the fact that the results obtained for TUT are closer to those obtained for English using the Malt parser and that the Malt parser performs for English slightly lower than state-of-the art parsers, the poor results on ISST seem not to be due to the specificity of Italian language but rather because of the fact that the parsers and the treebank were constituency-based. However, this issue needs further investigation, since the evaluation metrics for constituency parsing are different from those for dependency parsing. 5 Usage of the Malt Parser on Small Extracts of Corpora A couple of preliminary experiments was performed on small extracts from the TUT and the Slovene Dependency Treebank (Ledinek, [12]). SDT implements an annotation scheme similar to the one of the Prague Dependency Treebank (Hajič, [9]) The total number of sentences in both the extracts was 335 but the TUT extract contained a significantly greater number of tokens (9,322 tokens for TUT vs. 6,980 tokens for SDT). Note that the TUT tag set contains more labels than the SDT one. The preliminary results for these experiments are given in Table 3. The reason for the significantly poorer numbers for Slovene seems to be the lower number of tokens of the SDT extract. 6 Future Work The authors plan to continue their work on Italian by performing further tests using as input to the Malt parser the output of a POS tagger instead of the gold standard tags. This will provide more realistic results. A comparison with a rule-based Sb AuxK Obj AuxV ? ? ? ? Knjiga ni imela naslova . Ncfsn Vcip3s--y Vmps-sfa Ncmsg Z Figure 2: A sample tree from SDT converted to Malt format. parser for Italian is planned too. In middle term the Malt parser can be used on a version of ISST converted to dependency. Another direction for further investigation is to test the Malt parser on small sets of TUT and SDT which contain similar number of tokens and to explore the influence of the average number of words in the sentence on the final parsing results. In middle term, experiments with the Malt parser on other treebanks are planned. First of all, on Bulgarian with the BulTreeBank (Simov, [19]) and later using Turkish and German treebanks. Acknowledgements I thank Joakim Nivre for kindly answering all my questions about the Malt parser and providing me with detailed explanations on the parameters of its use. I also thank Cristina Bosco for providing useful feedback and Vincenzo Lombardo for his advice on dependency treebanks. I would like to thank Nina Ledinek and Tomaz Erjavec for making available the Slovene Dependency Treebank. And last but not least I would like to thank Alberto Lavelli and Bernardo Magnini for supervising me and for the valuable comments and suggestions they have provided me so far. References [1] Bikel, Daniel M. (2004) Intricacies of Collins’ parsing model. Computational Linguistics, 30(4). [2] Bosco, Cristina (2004) A grammatical relation system for treebank annotation. PhD thesis, University of Torino. [3] Bosco, Cristina and Lombardo, Vincenzo (2004) Dependency and relational structure in treebank annotation. In Proceedings of the Workshop on Recent Advances in Dependency Grammars at Coling 2004, Geneva, Switzerland. [4] Bosco, Cristina, Lombardo, Vincenzo, Vassallo, Daniela and Lesmo, Leonardo (2000) Building a treebank for Italian: a data-driven annotation schema. In Proceedings of the Second International Conference on Language Resources and Evaluation LREC-2000, pp. 99–106, Athens, Greece. [5] Charniak, Eugene (2000) A maximum-entropy-inspired parser. In Proceedings of NAACL-2000. [6] Collins, Michael (1997) Three generative, lexicalized models for statistical parsing. In Proceedings of ACL, pp. 16–23, Madrid, Spain. [7] Corazza, Anna, Lavelli, Alberto, Satta, Giorgio and Zanoli, Roberto (2004) Analyzing an Italian treebank with state-of-the-art statistical parsers. In Proceedings of the 3rd workshop on Treebanks and Linguistic Theories (TLT2004), Tübingen, Germany. [8] Delmonte, Rodolfo (2004) Strutture sintatiche dall’analisi computazionale di corpora di italiano. In Anna Cardinaletti (a cura di), "Intorno all’Italiano Contemporaneo" Franco Angeli, Milano, pp. 187–220. [9] Hajič, Jan (1998) Building a syntactically annotated corpus: The Prague Dependency Treebank. In Issues of Valency and Meaning, pp. 106–132. Prague: Karolinum. [10] Klein, Dan and Manning, Christopher D. (2002) Fast exact inference with a factored model for natural language parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002). [11] Klein, Dan and Manning, Chtistopher D. (2003) Accurate unlexicalized parsing. In Proceedings of the 41st annual meeting of the Association of Computational Linguistics, Sapporo, Japan. [12] Ledinek, Nina (2005) Površinskoskladenjsko označevanje korpusa Slovene Dependency Treebank (s poudarkom na predikatu). B. A. thesis. Ljubljana: Faculty of Arts, University of Ljubljana. [13] Lin, Dekang (1998) A dependency-based method for evaluating broadcoverage parsers. In Natural Language Engineering, 4 (2), pp. 97–114. [14] Marcus, Mitchell P., Santorini, Beatrice and Marcinkiewicz, Mary Ann (1993) Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics, 19 (2), pp. 273–290. [15] Montemagni, Simonetta, Barsotti, Francesco, Battista, Marco, Calzolari, Nicoletta, Corazzari, Ornella, Lenci, Alessandro, Zampolli, Antonio, Fanciulli, Francesca, Massetani, Maria, Raffaelli, Remo, Basili, Roberto, Pazienza, Maria Teresa, Saracino, Dario, Zanzotto, Fabio, Pianesi, Fabio, Mana, Nadia and Delmonte, Rodolfo (2003) Building the Italian Syntactic-Semantic Treebank. In Anne Abeillé, editor, Building and Using syntactically annotated corpora, pp. 189–210, Kluwer, Dodrecht. [16] Nivre, Joakim (2003) An efficient algorithm for projective dependency parsing. In Proceedings of the 8th international workshop on parsing technologies (IWPT), pp. 149–160, Nancy, France. [17] Nivre, Joakim, Hall, Johan and Nilsson, Jens (2004) Memory-based dependency parsing. In Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL), pp. 49–56, Boston, Massachusetts. [18] Nivre, Joakim and Scholz, Mario (2004) Deterministic dependency parsing of English text. In Proceedings of COLING 2004, Geneva, Switzerland. [19] Simov, Kiril, Popova, Gergana and Osenova, Petya (2001) HPSG-based syntactic treebank of Bulgarian (BulTreeBank). In Proceedings of the Corpus Linguistics 2001 Conference, pp. 561. [20] Yamada, Hiroyasu and Matsumoto, Yuji (2003) Statistical dependency analysis with support vector machines. In Proceedings of IWPT, pp. 195–206, Nancy, France.