* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download USING TOPOLOGICAL INFORMATION FOR DETECTING
Old English grammar wikipedia , lookup
Udmurt grammar wikipedia , lookup
Malay grammar wikipedia , lookup
Zulu grammar wikipedia , lookup
Old Irish grammar wikipedia , lookup
Morphology (linguistics) wikipedia , lookup
Sloppy identity wikipedia , lookup
Antisymmetry wikipedia , lookup
Modern Hebrew grammar wikipedia , lookup
Macedonian grammar wikipedia , lookup
Georgian grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Navajo grammar wikipedia , lookup
Preposition and postposition wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Portuguese grammar wikipedia , lookup
Kannada grammar wikipedia , lookup
Icelandic grammar wikipedia , lookup
Esperanto grammar wikipedia , lookup
Turkish grammar wikipedia , lookup
Determiner phrase wikipedia , lookup
Yiddish grammar wikipedia , lookup
Lexical semantics wikipedia , lookup
Chinese grammar wikipedia , lookup
English clause syntax wikipedia , lookup
Spanish grammar wikipedia , lookup
Latin syntax wikipedia , lookup
German grammar wikipedia , lookup
Dimitra Anastasiou, Oliver Čulo Saarland University USING TOPOLOGICAL INFORMATION FOR DETECTING IDIOMATIC VERB PHRASES IN GERMAN 1 Introduction The METIS-II1 project (Dirix et al. 2005; Dologlou et al. 2003) is a hybrid statistical machine translation system (SMT). Its goal is to generate free text translation that is based on statistical methods and on pattern matching in a huge monolingual target language (TL) corpus, the British National Corpus (BNC)2. Its duration is from 2004 until 2007 and it has Dutch, German, Greek and Spanish as source languages (SL) and British English as TL. METIS-II has language-specific resources for both SL and TL, such as bilingual dictionaries, tokenizer, part-of-speech (PoS) tagger, chunker, lemmatizer /morphological generator and manually constructed mapping rules. The function of the mapping rules is to map the TL structures onto SL structures. A language model for the target language acquired from the BNC helps disambiguating between different translation possibilities and it is used to retrieve the TL word order (Vandeghinste et al., 2005). A translation example is shown in figure 1 below. Clearly, the context plays a significant role; in the example, the choice of the adjective in the TL depends on the noun it is combined with. SL: Ich betrachte Churchill als einen großen Politiker. Literally: I consider Churchill as a tall / great politician. TL: I consider Churchill to be a great politician. The SL sentence must be tokenized, tagged, lemmatized and chunked. When all lemmas have found one or more translation in the TL, the statistical language model as well as mapping rules help find the right TL lexemes and word order. 2 Identification of idiomatic expressions and discontinuities Idioms are many times referred to as long words. Idioms are these expressions whose syntactic or semantic properties cannot be derived from their component parts. They always involve a lexical head and frozen and/or flexible complements. Volk (1998) points out the prerequisites of a MT system in order to identify an idiom: 1. The contiguous parts of the idiom (vor dir Hunde – to the dogs) 1 2 METIS-II is sponsored by EU under the FET-STREP scheme of FP6 (METIS-II, IST-FP6-003768). http://www.natcorp.ox.ac.uk/ 2. The discontinuous parts of the idiom (gehen – go) in any of their declination forms 3. The syntactic requirements of the idiom. The idiom in question often takes an animate subject and a physical object. 4. The clause boundaries. The idiom is usually in one clause. There are rare cases where the idiom is spread in two clauses, but then it is often used with its non-idiomatic meaning. The term “discontinuous strings” has been used to describe the strings whose verb is conjugated. It usually follows the subject-verb-object (SVO) sequence, e.g. die Welt geht (ständig) vor die Hunde – the world (constantly) goes to the dogs or it is in the participle form, e.g. Die Welt ist vor die Hunde gegangen – the world went to the dogs. 3 Types of idiomatic verb phrases Idiomatic expression can be a noun phrase (das A und O – the be-all and end-all), a prepositional phrase (auf Biegen and Brechen – by hook or by crook), an adverb (steif und fest – firmly), a noun phrase plus a prepositional phrase (Hals über Kopf – head over heels) or a verb phrase. In this paper, we examine the German idiomatic verb phrases in detail, because they exhibit the most discontinuities. They basically occur in one of three following categories. The entries written below are used for our experiment and are included in our bilingual (German – English) lexicon. In our corpus, we used real world examples from newspaper texts, with the below entries, both in continuous and discontinuous order. 1. Noun (NP) plus verb ein Auge zudrücken turn a blind eye einen kapitalen Bock schießen drop a real charger 2. Prepositional phrase (PP) plus verb auf die falsche Karte setzen back the wrong horse ins Fettnäpfchen treten put one’s foot in it mit den Wölfen heulen run with the pack um den heißen Brei herumreden beat about the bush vor die Hunde gehen go to the dogs [OC: Diese Beispiele nach hinten, als future work, denn das hier behandeln wir gar nicht in unserem matching!] 3. NP plus PP plus verb das Kind mit dem Bade ausschütten throw out the baby with the bathwater den Bock zum Gärtner machen set the fox to keep the geese ein Wolf im Schafpelz sein be a wolf in sheep’s clothing jmdn. auf den Arm nehmen pull s.o.'s leg 4 Permutations and topological distributions German is a language with a relatively free word order. However, it does obey some ordering principles, as described in the topological field model for German (Drach 1963, Duden 1998). Making use of this model, we can describe the patterns in which subparts of idiomatic expression can appear, potentially carrying an idiomatic reading. The topological field model states that the German main clause3 can be divided into five fields, each of which can hold a certain number and/or a certain. A basic description of the fields is as follows: • The pre-field (VF) contains only one syntactic constituent, be it NP, PP or subordinate clause; • the left bracket (LK) holds the conjugated syntactic head verb; • the middle field (MF) can consist of several permutations of various kinds of syntactic constituents and subordinate clauses; • the right bracket (RK) holds participles or infinitive forms in case the syntactic head verb is an auxiliary or a modal; • finally, the post-field (NF) contains subordinate clauses or coordinated main clauses. As Drach (1963) points out, the sentence bracket is a typical feature of the German clause construction. Syntactic units that have been seperated often appear as bracketing the middle field, which is the main container of the main clause. This bracketing construction is the case, for instance, when we have a modal verb: it will appear in the left bracket, the infinitive belonging to it will be placed in the right bracket. The same goes for discontinuous idiomatic verb phrases. The verb may appear in the left bracket, the NP or PP belonging to it will be set to the end of the middle field4. Following this observation, we defined the pattern for discontinuous appearances of iVPs: iVLK ([NP|PP|subclause)*MF iNPMF [ (V*RK|subclause*NF)] like in: SL: Der Mann nimmtLK mich, obwohl ich mich darüber ärgere, ständig auf den ArmMF. Literally: The man takes me, although I me about it annoy, constantly, on the arm. TL: The man is constantly pulling my leg, although I am annoyed about it. We define the ordering in which the verb appears to the left as non-canonical order, because in the lexicon the verb is situated to the right of the iNP/iVP, which we simply define as canonical order. Besides the pattern given above, (at least) three other configurations are possible, all of them being variations of continuous appearances of the iVP. The iVP can appear as a continuous string at the end of a clause. The pattern for this is: iNPMF iVRK stating that the iV is in the right bracket and preceded by the iNP or iPP in the middle field. This is usually the case in subordinate clauses, like in: SL: Ich mag ihn nicht, weil er mich auf den ArmMF nimmtRK. Literally: I like him not, because he me on the arm takes. 3 Subordinate clauses can be divided into fields, too, but a detailed inspection of a complete theory on German topological fields is outside the scope of this paper. 4 We cannot say that it is in the right bracket, as by the definition we use only verbal forms can be contained there. TL: I don't like him, because he pulls my leg. The same case appears in main clauses, when there is an auxiliary or modal verb in the left bracket as syntactic head word, like in: SL: Er hatLK mich auf den ArmMF genommenRK. Literally: He has me on the arm taken. TL: He pulled my leg. When the iVP is topicalized, it appears as continuous string in the pre-field, described by the pattern: (iNP iV)VF like in: SL: Auf den Arm nehmenVF lasseLK ich mich nicht. Literally: [On the arm take] let I me not. TL: I won’t let anyone pull my leg. While it may look superfluous to actually define all topological variations for the continuous appearances instead of simply trying to match an ‘en bloc’ order of all iVP constituents, it must be pointed out that some idiomatic expressions can also be varied slightly by inserting adjectives, for instance. In these cases we want to still make sure that an idiomatic reading of the whole expression is still possible, which is true if the slightly modified idiom parts are distributed as in the above examples. 5 Pattern matching of discontinuous VPs The detection of a fit between the given sentence and a particular tree structure is called pattern matching. Matching is useful to identify the permutations of discontinuous phrases. For this task, a procedure was developed that makes it possible to match discontinuous strings (Carl and Rascu, 2006). The dictionary has only canonical forms and special dictionary preprocessing and lookup as well as discontinuous matching is necessary for detecting the permutations. The first section describes the structure of the dictionary and the second one describes the rules that guide the pattern matching process with the help of topological information. 5.1 The dictionary [OC: Die Generierung von Varianten mit der Beschreibung des diskontinuierlichen Musters in Abschnitt 4 verknüpfen.] The German – English lexicon used in METIS-II has been developed by the IAI5 at the Saarland University. It contains more than 600.000 entries which have been collected over the past 20 years. The bilingual dictionary describes the raw lemma-to-lemma translation. A lemma and a PoS-tag are taken as input and a TL lemma and a partial TL tag are returned. There is a German and an English side in each entry. During the compilation of the lexicon, the German side and the English side of the entries are lemmatized and tagged, so that they can be matched more easily on the SL string. The features are represented and used for TL generation. Lexicon entries are represented in the form of attribute-value pairs. A single word 5 Institut der Gesellschaft zur Förderung der Angewandten Informationsforschung of German can be translated either into an English single word or into a phrase; that means that the language sides are independent. The entries also contain additional PoS information for the German and for the English side. The dictionary entries do not necessarily have the same number of arguments. The German side may have the meta-information jemanden <jdn>, whereas the English may lack the meta information somebody <sb>. Take the following lexicon entry, for instance: <jdn> auf den Arm nehmen <=> pull <sos> leg The distance between SL and TL of idioms and other fixed expressions is often large, as several of the lemmas of the idiom do not occur as possible translations of corresponding entries in the dictionary. Therefore, listing this kind of fixed expressions in the dictionary could solve the problem. Let us take the above entry auf den Arm nehmen as an example. The compilation process generates one database entry for the canonical form, and one for the permutation nehmen auf den Arm, where the verb moves at the front of the phrase. The canonical form NP + V and the permutation V + NP will appear in different kinds of sentences, as we have described above. 5.2 Topological rules guiding the matching process [OC: Dies hier kürzer fassen und auf Abschnitt 4 beziehen.] During the matching process, for each word of a multi-word entry the process will check whether the words are present in the sentence. However, just this is not enough. While the words may be there, they can still appear in a distribution which does not make them an idiomatic expression. As we have already shown, the parts of an idiomatic VP can appear in certain places in the sentence. The matching process can be guided by rules that check whether matched words appear in the right place. Otherwise, for an expression like auf_den_Arm_nehmen, any preposition auf appearing in the sentence could be matched. We want to match just this preposition though that is part of the PP auf_ den_Arm. If there is no such preposition, we can dismiss the sentence as containing the idiomatic expression and thus avoid false positives, called noise. The rules used for guiding the matching process are written in the KURD6 formalism. A KURD rule consists of a name, a condition and an action part. A KURD test consists of a list of nodes, each node being a set of attribute-values features. If the test matches the current node(s) in the input object, it returns true and the given action is performed; else the action is spared or possibly an alternative action is performed, if given. A schematic representation of a matching rule containing topological information is as follows: VerbPattern_lk_mf = VP [ V: field = left bracket X*: field = middle field OR subordinate clause NP: field = middle field Y*: End_Of_Sentence OR field = right bracket OR field = post-field ] 6 KURD stands for the operations kill, unify, replace and delete. It is capable of other operations, too, though. : Mark_As_iVP. This rule describes a pattern for a discontinuous iVP, as we have described above. The first element that is matched should be a verb in the left bracket of the sentence. After that, a number of elements in the middle field or an inserted subordinate clause can follow. At the end of the middle field, there is the NP part of the iVP. Either the sentence finishes at that point, or the iNP is followed by an element in the right bracket or the post field. This condition matches sentences like “Der Mann nimmt mich, obwohl ich mich darüber ärgere, ständig auf den Arm”. 6 Evaluation of the discontinuous matching We constructed two test corpora consisting of • 32 sent. with canonical order, e.g. de : Grünes Licht geben müssen die Eltern den Kindern. en (literally) : Green light give must the parents the children. en : The parents must give green light to the children. • 19 sent. with non-canonical order, e.g. de : Die Welt geht nicht vor die Hunde. en (literally) : The world goes not before the dogs. en : The world does not go to the dogs. We ran the matching procedure on the given sentences and manually evaluated right and wrong matches of idiomatic expressions. In addition, we compared it to an earlier version of matching rules which would simply test what kind of clause the iVP appears in, assuming that the canonical order should be more or less reserved for subordinate clauses. The following table reflects our findings: Precision Recall Noise Miss non-canonical old match 100 89 0 2 new match 100 89 0 2 old match 77 53 5 15 new match 100 91 0 3 canonical The false matched sentences could be classified in two categories, such as misses and noise. The former is when the phrase has not been matched at all and noise when the sentence has been matched, but in a wrong way. Some sentences may have both miss and noise at the same time. [OC: kurze Evaluation und Diskussion der Zahlen] [OC: das im folgenden Absatz beschriebene Problem hatten wir kurz vor PALC gelöst] A problem that we faced very often is that of having the same article two times in the sentence, so that it matched the article that was not part of the idiom. The morphological program could not identify whether the article belong to the discontinuous phrase or to another one common noun of the sentence. With the help of a KURD rule, we solved the problem by setting appropriate constraints. 7 Conclusion and further work The idioms field is a matter that concerns many researchers and has always been a difficult task for MT. Our matching procedure within METIS-II, a hybrid SMT system, has been improved since last year. We amended the already existing rules of IAI and the evaluation was proved to be successful. METIS-II is now able to match not only continuous, but discontinuous phrases, too. There are still more complex patterns to be researched, such as Der Bock macht sich selbst zum Gärtner – The fox is set by itself to keep the geese. [OC: folgende zwei Absätze als future work in Kürze zusammenfassen, ein oder zwei Beispiele vom Anfang einfügen] Abeillé and Schabes (1989) examine various discontinuities occurred with both “fixed” and “flexible” idioms. The former are these idioms to which no syntactic or lexical rule can be applied to the frozen arguments of the idiom. In other words, they can not be relativized, passivized etc. Even this kind of idioms cannot be syntactically reanalyzed, but the idioms rather need to be assigned a syntactic internal structure. All insertions are regularly predictable from the syntactic category of the idiomatic element that is modified, for example the adjective proverbial in the example: He kicked the proverbial bucket. Unbounded discontinuities can arise from the insertion of auxiliaries or adverbials between the frozen subject and the verb: All hell seemed to be likely to break loose. The “flexible” idioms have the same discontinuities as free sentences and are often assigned the same syntactic structures. Unbounded discontinuities may arise from the passivization of an idiom with a frozen complement, e.g. The beans were spilled by him. To an idiom with a frozen subject can adverbials and auxiliaries insert regularly: The beans continue to appear to be certain to be spilled (Wasow et al., 1983). It should, also, be checked whether the already applied patterns are applicable to all supportverb-constructions (SVCs). RΕFERENCES [OC: Volk (1998) fehlt hier noch!] Abeillé, A. and Y. Schabes. (1989). “Parsing idioms in lexicalized TAGs”. Proceedings of the fourth conference on European chapter of the Association for Computational Linguistics. Manchester, England: 1-9. Carl, M. and E. Rascu. (2006). “A Dictionary Lookup Strategy for Translating Discontinuous Phrases”. Proceedings of EAMT. Oslo: 16-21. Dirix, P., Schuurman I. and V. Vandeghinste. (2005). “Example-based machine translation using monolingual corpora: System descritption”. Proceedings of MT Summit X, Workshop on EBMT. Phuket, Thailand: 43-50. Dologlou, Y., Markantonatou, S., Tambouratzis, G. Yannoutsou, O., Fourla, A. and N. Ioannou. (2003). “Using Monolingual Corpora for Statistical Machine Translation: The METIS System”. Proceedings of EAMT - CLAW 2003, Dublin: 61-68. Drach, E. (1963)[1940]. Grundgedanken der deutschen Satzlehre. Wissenschaftliche Buchgesellschaft, Darmstadt, Germany. DUDEN Redaktion. (1998). Grammatik der deutschen Gegenwartssprache. Mannheim, Germany. Vandeghinste, V., Dirix, P. and I. Schuurman. (2005). “Example-based Translation without Parallel Corpora: First experiments on a prototype”. Proceedings of MT Summit X, Workshop on EBMT. Phuket, Thailand: 135-142. Wasow, T., Sag, I., and G. Nunberg. (1983). „Idioms: An interim report“. In: Hattori, S. and K. Inoue (eds.) Proceedings of the XIIIth International Congress of Linguistics. Tokyo: 102-115.