Download Portability of Dependency Parsing Algorithms An

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Portability of Dependency Parsing Algorithms
An Application for Italian
Atanas Chanev
University of Trento
Department of Cognitive Sciences
and
ITC-irst, Trento, Italy
E-mail: [email protected]
1 Introduction
Parsers can be classified along different dimensions: e.g. rule-based vs. statistical,
constituency-based vs. dependency-based.
Rule-based parsers use manually prepared grammars which consist of many
language-specific rules. A grammar previously prepared for a given language cannot be usually adapted for parsing another language. On the other hand statistical
parsers can be ported to a new language provided that a treebank to train the parser
is available.
Constituency-based parsing, traditionally applied to English texts, may not be
suitable for languages with different characteristics, such as Italian or Bulgarian.
Dependency-based formalisms are considered to be more adequate for free word
order languages (Bosco and Lombardo, [3]).
Most work on parsing concentrated on English, the language with the largest
treebanks available (e.g. the Penn Treebank (Marcus et al., [14])). Dependency
treebanks have been compiled for free word order languages but their sizes are not
comparable to the English constituency treebanks (with few exceptions, e.g. the
Prague Dependency Treebank for Czech (Hajič, [9])).
Portability of tools for language analysis is an issue with a great potential for
less-studied languages or languages with few resources. Porting existing parsers
can ease the preparation of greater scale resources for these languages. Apart from
parsing, portability should be a research problem by itself and more attention has
to be paid on it. Testing a parser on different treebanks in several languages would
give more reliable results than testing it on a single treebank, thus extending the
scale of measurement.
Previous work has concentrated on issues concerning measurement of parsing
results for a particular language (e.g. Charniak, [5]) or preparation of language
resources (e.g. Hajič [9]). Recently there has been some work on portability. The
topic has been for example addressed in (Nivre and Scholz, [18]) and (Corazza
et al., [7]). The former is about porting a statistical dependency parser originally developed for Swedish to English and the latter is about testing statistical
constituency-based parsers on an Italian treebank.
The goal of the work reported in the paper is to investigate further the problem
of portability across languages in the framework of statistical dependency parsing. In this paper we report preliminary results of the application of a statistical
dependency parser on Italian and Slovene treebanks.
The only Italian treebank that is freely available is the Turin University Treebank1 (TUT). Its annotation scheme is described in (Bosco, [2]). From the perspective of our research interests, the most attractive feature of TUT is the fact that
it is dependency-based. The Malt parser2 was chosen to be applied to TUT. This
parser was originally applied to Swedish and then ported to English. It is a statistical dependency parser that implements the shift-reduce parsing strategy and uses
memory based learning (Nivre, [16]). It includes also evaluation software available
on the web page.
On the whole, the results obtained for Italian are slightly lower than those obtained for Swedish and English, but the training data for Italian is smaller and this
reduction in precision can be expected. Comparing these results to the results from
(Corazza et al., [7]) it seems an improvement in parsing Italian has been achieved,
even if the evaluation metrics for constituency and for dependency-based parsing
are different.
The current work of the authors is dedicated to investigate how much the training data can be reduced without significantly loosing precision and how this reflects
on portability. Tests for small portions of TUT and the Slovene Dependency Treebank (Ledinek, [12]) have been performed. Some preliminary results are available.
Another current activity of the authors is comparison of the results from Malt to
results obtained by parsing TUT with a rule-based parser for Italian.
Future work includes the use of the Malt parser on other treebanks. First of
all, experiments on Bulgarian with the BulTreeBank (Simov, [19]) are planned for
the next months. Later, we will investigate the possibility of using a version of
the Italian treebank ISST (see Section 2) converted to dependency and Turkish and
1
2
http://www.di.unito.it/˜tutreeb/
http://w3.msi.vxu.se/˜nivre/research/MaltParser.html
German treebanks.
The structure of this work is organized as follows. Section 2 gives a brief description of the treebanks developed for Italian. In Section 3 the setting procedures
for the experiments on TUT are presented. Section 4 consists of the experiments
and obtained results followed by preliminary results of a comparative study of TUT
and the Slovene Dependency Treebank (SDT) in Section 5. Future work is outlined
in Section 6.
2 Italian Treebanks
Three Italian treebanks have been developed so far. The only Italian dependency
treebank is the Turin University Treebank (TUT) (Bosco, [4]). It has 1,500 sentences (41,771 tokens) annotated with parts of speech and dependency relations.
The dependency relations contain information about syntactic relations and, in
some cases, theta roles. The trees of the corpus contain traces which are presented
as ordinary tokens. TUT is publicly available and can be freely downloaded from
the web site of the project3 .
Another Italian treebank is the Italian Syntactic-Semantic Treebank (Montemagni et al., [15]). It is annotated at four levels: morpho-syntactic, two syntactic
levels (constituent structure and functional relation), and lexico-semantic. There
are about 3000 sentences (89,941 tokens) annotated at constituent structure level.
ISST is not yet publicly available.
The third treebank for Italian is the Venice Italian Treebank (VIT) (Delmonte,
[8]). It consists of 250,000 words of written text and 50,000 words of spoken text.
It is annotated with constituent structures. VIT is not freely available but sentences
from the treebank can be browsed in the project web page4 .
3 Preliminary Settings for the Experiments
Our first experiment on portability was the evaluation of the Malt parser (Nivre,
[16]) on the Italian Turin University Treebank (Bosco, [4]). The Malt parser was
chosen because of its availability and the fact that originally developed for Swedish,
it has already been ported to English (Nivre and Scholz, [18]).
The annotation scheme of TUT is slightly different from the ones Malt is able to
learn and parse. Along with the re-arranging of the information from TUT, several
idiosyncrasies of the Italian language in general and TUT in particular had to be
taken in mind for a successful conversion and plausibility of the parsing results.
3
4
http://www.di.unito.it/˜tutreeb/
http://sisley.cgm.unive.it/HTMLipar/indexparsing_a.htm
Dependencies between prepositions and articles5 are represented in TUT using
a type of indexing in which the second word in such a construction is given the
index of the first word and a sub index. In Malt formats all the words except the
roots in a sentence have heads which have positive integer indices. Converting
the TUT indexing to integer-based one contributed to loosing information for the
closer relationships between prepositions and articles within such constructions
(or verbs and pronouns for which the same scheme was used). An example of a
preposition-article annotation extracted from TUT is given below.
17 dello (DI PREP MONO) [11;PREP-RMOD]
17.1 dello (IL ART DEF M SING) [17;PREP-ARG]
Conversion of the indexing was performed so that every token was associated
with an integer index. The first word of a sentence from the TUT is given as an
example in its original format and after the conversion to Malt format:
TUT:
1 Valona (VALONA NOUN PROPER F CITY) [1.10;VERB-SUBJ]
Malt:
Valona
NOUN.PROPER.F.CITY
2
VERB-SUBJ
The TUT annotation scheme is well described and documented in the PhD
thesis of Cristina Bosco (Bosco, [2]).
The part of speech labels contain information about category, sub-category and
features. The number of POS tags, extracted from the latest version of TUT was
972. Potentially, their number can be greater, because some tags might not have
occurred in the 1,500 sentences of the treebank6 .
In the TUT POS tag set there are cases of lexical information included in the
tag. This increases artificially the number of POS tags. For nouns of verbal origin,
the verb and information about its transitivity are included in the labels resulting in
an increased number of tags. An example extracted from the TUT is:
NOUN.COMMON.F.PL.AFFERMARE.TRANS
for the noun affermazioni (affirmations).
Information about the verb and its transitivity was removed from the tags of
nouns of verbal origin. These steps reduced the tag set size to 511 tags. The
5
In Italian, usually the article is attached to the preposition thus forming a single word.
The initial number of POS tags extracted from TUT was 1,250, because of many tags of the
form NUM.3 or DATE.1997. What is after the point is not POS information and all these tags were
substituted with a single NUM and a single DATE tag.
6
POS tag set was reduced further by removal of features from the tags. Gender
and number features were removed from verbs, nouns, adjectives, predeterminers,
articles and pronouns. A single tag for proper nouns was used and the tags for
conjunctions were reduced to 3: a general conjunction tag and tags for coordinating
and subordinating conjunctions.
After the reduction the set consisted of 90 labels. The only grammatical information that remained in noun tags was about the case of the noun. The tags
of the verbs (including auxiliaries and modal verbs) were reduced to contain only
information about the mood of the verb and its transitivity. None of the remaining
tags in the set contained gender and number information7 .
The dependency tag set of TUT is quite complex. Each tag consists of three
parts that represent different kinds of information, but not every type of information
is present in all the tags. Usually a dependency label starts with the part-of-speech
category of the head word, followed by the syntactic relation between the word and
its head and the semantic theta role8 .
The set of syntactic relations is organized in a hierarchy with 3 very general
labels as top categories: dependent, function and nofunction. Dependent is the
parent category of function and nofunction and it has no other child categories.
The lower levels of the hierarchy are more specific. On the whole, the hierarchy is
scalable and the number of labels can be reduced. A total number of 283 different
dependency relation labels were extracted from TUT. But potentially the number
of dependency tags can be greater due to different combinations of POS tag for the
head word, syntactic relation and theta role.
The reduction strategy for the set of dependency relations was to extract the part
that contained only the syntactic relation information from each tag. The part of
the labels that contained the POS category of the head word and the one containing
the semantic theta role were removed. The number of syntactic relations was then
reduced and the syntactic labels that remained in the set after the reduction were
used as dependency labels of the tokens in the corpus.
In most cases, the removed syntactic labels could be replaced by their parents
without a significant loss of information, but in a single case, a label was substituted
by its sister9 . The most general dependency relations in the hierarchy were kept.
Only tags from the lower levels were removed. A simple example for grouping
of tags is the use of a single tag SEPARATOR for the end of sentence punctuation
7
The suggestion about removing gender and number information was given to the authors by
Joakim Nivre in personal correspondence. Use of genders and numbers has not improved the parsing
results for Swedish, but has made the learning data more sparse.
8
But sometimes the label contains additional information too.
9
The label EXTRAOBJ was substituted by its sister from the hierarchy OBJ but not its parent
ARG. This approach was preferred by the authors since another child of ARG was SUBJ
(END), opening (OPEN) and closing (CLOSE) inverted commas and other punctuation (SEPARATOR). On the whole, the dependency tags were reduced from 283
to 17.
SEPARATOR
ARG
?
?
I
nomi
.
ART.DEF
NOUN.COMMON
I
nomi
.
The names .
PUNCT
Figure 1: A sample tree from the TUT with reduced tag sets.
One of the features of TUT is the existence of traces in the treebank. They
have been implemented in order to deal with language-specific phenomena without introducing non-projective branches in the syntactic trees. The sentences in
the treebank are assumed to be projective but after the treebank was converted to
Malt parser format and browsed using the MaltEval software10 , a number of nonprojective trees were discovered.
One of the pre-requisites for obtaining plausible parsing results with the Malt
parser was the removal of the traces from TUT. They must neither be learned by
the statistical parser nor be included later in the input test data.
The removal of traces which were leaf nodes or parents of leaf trace nodes
without other children was straightforward. Nevertheless 122 of 3,123 traces had
children or grandchildren which were not traces and it was not possible to remove
them automatically. For such cases a manual removal was performed.
There were two types of cases to handle manually: those in which the trace
was pointing to a verb and all the others. The treatment of the latter cases was
rather straightforward. Among the former cases, a particularly hard task was to
remove traces of main verbs from coordinated sub sentences. In the context of the
TUT annotation scheme there are not simple answers to the questions ’Which is
the head of a coordinated sub sentence with a removed verb trace being its former
head?’ and ’To which token should the new head of a coordinated sub sentence
point?’.
The problematic verb traces were removed without an explicit and theoretically
motivated strategy. We plan to further investigate issues related to traces and their
10
http://w3.msi.vxu.se/˜nivre/research/MaltEval.html
Table 1: Parsing results for Italian using the MBL3 and MBL4 learning methods.
tag sets
complete:
reduced:
tag sets
complete:
reduced:
MBL3
unlabelled precision/recall
86.59%
87.18%
MBL4
unlabelled precision/recall
86.49%
87.33%
labelled precision/recall
74.41%
81.66%
labelled precision/recall
74.37%
81.75%
removal. After this activity the TUT was ready for being learned and parsed using
the Malt parser.
4 Experiments
For the tests, the sentences from the TUT files were re-arranged in a random manner in a single file which was then split into 10 parts for a 10-fold cross-validation.
The choice of 10-fold cross-validation was motivated by the small size of TUT
(only 1,500 sentences). The MaltEval software was used for measuring the results,
calculating the mean score per word, excluding punctuation. Gold standard POS
tags were used in the experiments.
Two couples of experiments were performed. The TUT was trained and tested
with the Malt parser with both complete and reduced tag sets and both MBL3
and MBL4 (lexicalized) learning models. The first couple of experiments used the
original TUT tag sets with minor changes while the second one employed reduced
tag sets. The parsing results are given in Table 1.
4.1 Comparison to Dependency Parsing for Swedish
The results from the evaluation of the Malt parser on Swedish from (Nivre, [17])
are given in Table 2.
A updated version of the results for Swedish was provided by Joakim Nivre in
personal communication: unlabelled accuracy is 86% and labelled precision/recall
is 82%.
In the first couple of experiments (those with large tag sets) unlabelled precision was comparable to the one from the updated results for Swedish data, but
Table 2: Results from the evaluation of the Malt parser on Swedish (Nivre, [17]).
learning model
MBL non-lexical:
MBL lexical:
unlabelled precision/recall
81.70%
84.70%
labelled precision/recall
74.70%
80.60%
labelled precision was considerably lower. In the second couple of experiments,
labelled precision was closer to the one from the updated Swedish results. The
tests on the Swedish data have been performed using a smaller POS tag set with
training data of 100,000 words. Automatic POS tagging was used instead of gold
standard tags in the Swedish tests.
For a comparison, the parsing results for Italian are slightly lower, due to sparse
data. After removing the traces, the number of tokens for TUT decreased to 41,616.
This size is significantly lower than the one of the training data for Swedish. The
parsing results are comparable with those obtained for Swedish even though automatic tags were used in the Swedish experiments. This technique usually gives
worse results, compared to the usage of gold standard tags. There are, though,
significant differences in the training set sizes.
4.2 Comparison to Constituency Parsing
Malt parser has been tested on the WallStreet Journal (WSJ) part of the Penn Treebank in (Nivre and Scholz, [18]) achieving dependency accuracy from 3 to 5% and
root accuracy – from 6 to 10% worse than the state-of-the-art parsers for English
((Charniak, [5]) and (Collins, [6])). Dependency-based evaluation of the parsers
of Charniak and Collins was done in (Yamada and Matsumoto, [20]). The unlabelled and labelled precision/recall for parsing English with the Malt parser are
respectively 88% and 86%, but the training set is 1,000,000 words11 .
In a recent study by Corazza and colleagues [7] state-of-the-art statistical constituency-based parsers (an implementation of Collins 2 parsing model (Collins,
[6]) by Bikel in (Bikel, [1]) and the Stanford parser (Klein and Manning [10, 11]))
were used for parsing the Italian Syntactic-Semantic Treebank. The difference in
performance compared to English was over 15% for the F-measure. The results
are significantly poorer for Italian than those for English. It is concluded that this
performance can be due to either the annotation scheme or the specificity of Italian
language.
11
This data was kindly provided by Joakim Nivre in personal communication.
Table 3: Parsing results for small sets from TUT and SDT.
treebank
TUT:
SDT:
treebank
TUT:
SDT:
MBL3
unlabelled precision/recall
81.62%
69.01%
MBL4
unlabelled precision/recall
81.60%
69.08%
labelled precision/recall
67.24%
58.33%
labelled precision/recall
66.66%
57.84%
Given the fact that the results obtained for TUT are closer to those obtained for
English using the Malt parser and that the Malt parser performs for English slightly
lower than state-of-the art parsers, the poor results on ISST seem not to be due to
the specificity of Italian language but rather because of the fact that the parsers and
the treebank were constituency-based. However, this issue needs further investigation, since the evaluation metrics for constituency parsing are different from those
for dependency parsing.
5 Usage of the Malt Parser on Small Extracts of Corpora
A couple of preliminary experiments was performed on small extracts from the
TUT and the Slovene Dependency Treebank (Ledinek, [12]). SDT implements an
annotation scheme similar to the one of the Prague Dependency Treebank (Hajič,
[9]) The total number of sentences in both the extracts was 335 but the TUT extract
contained a significantly greater number of tokens (9,322 tokens for TUT vs. 6,980
tokens for SDT).
Note that the TUT tag set contains more labels than the SDT one. The preliminary results for these experiments are given in Table 3. The reason for the
significantly poorer numbers for Slovene seems to be the lower number of tokens
of the SDT extract.
6 Future Work
The authors plan to continue their work on Italian by performing further tests using
as input to the Malt parser the output of a POS tagger instead of the gold standard
tags. This will provide more realistic results. A comparison with a rule-based
Sb
AuxK
Obj
AuxV
?
?
?
?
Knjiga
ni
imela
naslova
.
Ncfsn
Vcip3s--y
Vmps-sfa
Ncmsg
Z
Figure 2: A sample tree from SDT converted to Malt format.
parser for Italian is planned too. In middle term the Malt parser can be used on a
version of ISST converted to dependency.
Another direction for further investigation is to test the Malt parser on small
sets of TUT and SDT which contain similar number of tokens and to explore the
influence of the average number of words in the sentence on the final parsing results.
In middle term, experiments with the Malt parser on other treebanks are planned.
First of all, on Bulgarian with the BulTreeBank (Simov, [19]) and later using Turkish and German treebanks.
Acknowledgements
I thank Joakim Nivre for kindly answering all my questions about the Malt parser
and providing me with detailed explanations on the parameters of its use. I also
thank Cristina Bosco for providing useful feedback and Vincenzo Lombardo for his
advice on dependency treebanks. I would like to thank Nina Ledinek and Tomaz
Erjavec for making available the Slovene Dependency Treebank. And last but not
least I would like to thank Alberto Lavelli and Bernardo Magnini for supervising
me and for the valuable comments and suggestions they have provided me so far.
References
[1] Bikel, Daniel M. (2004) Intricacies of Collins’ parsing model. Computational
Linguistics, 30(4).
[2] Bosco, Cristina (2004) A grammatical relation system for treebank annotation. PhD thesis, University of Torino.
[3] Bosco, Cristina and Lombardo, Vincenzo (2004) Dependency and relational
structure in treebank annotation. In Proceedings of the Workshop on Recent
Advances in Dependency Grammars at Coling 2004, Geneva, Switzerland.
[4] Bosco, Cristina, Lombardo, Vincenzo, Vassallo, Daniela and Lesmo,
Leonardo (2000) Building a treebank for Italian: a data-driven annotation
schema. In Proceedings of the Second International Conference on Language
Resources and Evaluation LREC-2000, pp. 99–106, Athens, Greece.
[5] Charniak, Eugene (2000) A maximum-entropy-inspired parser. In Proceedings of NAACL-2000.
[6] Collins, Michael (1997) Three generative, lexicalized models for statistical
parsing. In Proceedings of ACL, pp. 16–23, Madrid, Spain.
[7] Corazza, Anna, Lavelli, Alberto, Satta, Giorgio and Zanoli, Roberto (2004)
Analyzing an Italian treebank with state-of-the-art statistical parsers. In Proceedings of the 3rd workshop on Treebanks and Linguistic Theories (TLT2004), Tübingen, Germany.
[8] Delmonte, Rodolfo (2004) Strutture sintatiche dall’analisi computazionale di
corpora di italiano. In Anna Cardinaletti (a cura di), "Intorno all’Italiano
Contemporaneo" Franco Angeli, Milano, pp. 187–220.
[9] Hajič, Jan (1998) Building a syntactically annotated corpus: The Prague Dependency Treebank. In Issues of Valency and Meaning, pp. 106–132. Prague:
Karolinum.
[10] Klein, Dan and Manning, Christopher D. (2002) Fast exact inference with a
factored model for natural language parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002).
[11] Klein, Dan and Manning, Chtistopher D. (2003) Accurate unlexicalized parsing. In Proceedings of the 41st annual meeting of the Association of Computational Linguistics, Sapporo, Japan.
[12] Ledinek, Nina (2005) Površinskoskladenjsko označevanje korpusa Slovene
Dependency Treebank (s poudarkom na predikatu). B. A. thesis. Ljubljana:
Faculty of Arts, University of Ljubljana.
[13] Lin, Dekang (1998) A dependency-based method for evaluating broadcoverage parsers. In Natural Language Engineering, 4 (2), pp. 97–114.
[14] Marcus, Mitchell P., Santorini, Beatrice and Marcinkiewicz, Mary Ann
(1993) Building a large annotated corpus of English: The Penn Treebank.
In Computational Linguistics, 19 (2), pp. 273–290.
[15] Montemagni, Simonetta, Barsotti, Francesco, Battista, Marco, Calzolari,
Nicoletta, Corazzari, Ornella, Lenci, Alessandro, Zampolli, Antonio, Fanciulli, Francesca, Massetani, Maria, Raffaelli, Remo, Basili, Roberto, Pazienza,
Maria Teresa, Saracino, Dario, Zanzotto, Fabio, Pianesi, Fabio, Mana, Nadia
and Delmonte, Rodolfo (2003) Building the Italian Syntactic-Semantic Treebank. In Anne Abeillé, editor, Building and Using syntactically annotated
corpora, pp. 189–210, Kluwer, Dodrecht.
[16] Nivre, Joakim (2003) An efficient algorithm for projective dependency parsing. In Proceedings of the 8th international workshop on parsing technologies
(IWPT), pp. 149–160, Nancy, France.
[17] Nivre, Joakim, Hall, Johan and Nilsson, Jens (2004) Memory-based dependency parsing. In Proceedings of the Eighth Conference on Computational
Natural Language Learning (CoNLL), pp. 49–56, Boston, Massachusetts.
[18] Nivre, Joakim and Scholz, Mario (2004) Deterministic dependency parsing
of English text. In Proceedings of COLING 2004, Geneva, Switzerland.
[19] Simov, Kiril, Popova, Gergana and Osenova, Petya (2001) HPSG-based syntactic treebank of Bulgarian (BulTreeBank). In Proceedings of the Corpus
Linguistics 2001 Conference, pp. 561.
[20] Yamada, Hiroyasu and Matsumoto, Yuji (2003) Statistical dependency analysis with support vector machines. In Proceedings of IWPT, pp. 195–206,
Nancy, France.