* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A Systematic Adaptation Scheme for English-Hindi Example
Ojibwe grammar wikipedia , lookup
Old Irish grammar wikipedia , lookup
Old English grammar wikipedia , lookup
Swedish grammar wikipedia , lookup
Lithuanian grammar wikipedia , lookup
Japanese grammar wikipedia , lookup
Navajo grammar wikipedia , lookup
Udmurt grammar wikipedia , lookup
English clause syntax wikipedia , lookup
Lexical semantics wikipedia , lookup
French grammar wikipedia , lookup
Preposition and postposition wikipedia , lookup
Portuguese grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Macedonian grammar wikipedia , lookup
Chinese grammar wikipedia , lookup
Georgian grammar wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Malay grammar wikipedia , lookup
Modern Hebrew grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Kannada grammar wikipedia , lookup
Esperanto grammar wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Yiddish grammar wikipedia , lookup
Untranslatability wikipedia , lookup
Turkish grammar wikipedia , lookup
Latin syntax wikipedia , lookup
A Systematic Adaptation Scheme for English-Hindi Example-Based Machine Translation Deepa Gupta Niladri Chatterjee Department of Mathematics I.I.T Delhi, Hauz Khas New Delhi-110016 Email: [email protected] ABSTRACT The success of Example-Based Machine Translation (EBMT) often depends upon how efficient the adaptation scheme is. Adaptation primarily aims at modifying retrieved examples to meet the required demands of a given translation task. The present work looks at adaptation for EBMT from English to Hindi. This paper describes a rule-driven adaptation scheme for modifying a retrieved translation example to generate the translation of a given input. Only a selected set of sentence structures have been considered so far. The structural and morphological rules for the source and the target languages have been used to develop the scheme. KEY WORD : EBMT, Adaptation, Structural Modification, Morphological Modification. 1. INTRODUCTION One major aspect of an Example-Based Machine Translation (EBMT) [2] [4] system is its adaptation scheme. Even a very efficient similarity measurement scheme and a very large example base cannot, in general, guarantee an exact match for a given input sentence. As a consequence, the need for an efficient and systematic adaptation scheme arises for modifying a retrieved example and thereby generating the required translation. In this work we present the preliminary version of a systematic scheme for adapting translations from English to Hindi. The work is still in its initial stage, and works only on sentences of selected structures. The examples used for this work have been obtained primarily from children's storybooks and translation books, where more complex sentence structures are rarely found. However, we aim to incorporate more complex structures in future. We consider an adaptation scheme to be a rule-driven approach that considers the discrepancy between the input and the retrieved sentence in the source language. The rules have been formed by taking into account the grammars of the both source and the target languages. Given an input sentence, the system retrieves an example from its example base that is similar to the input sentence in structure. Component-wise differences between the input and the retrieved sentence are then measured and the retrieved translation in the target language is then modified by taking care of the discrepancies. The adaptation is carried out by resorting to one of the following operations, as suggested in [1]: Simple word replacement: One may get the translation of the input sentence by replacing some words in the retrieved translation example. Suppose the input sentence is: "The Squirrel was eating groundnuts.” The most similar sentence retrieved by the system (along with its Hindi translation) is: "The Elephant was drinking water.” (haatee paanii pii rahaa thaa) In order to generate the translation, one just needs to replace “squirrel” by “elephant”, “eat” by “drink” and “groundnuts” by “water”. Therefore, only word replacement gives the exact translation of the input sentence. Word Deletion: In some cases one may have to delete some words from the translation example to generate the new translation. For example, the input sentence is: “Animals were dying of thirst”. The retrieved translation example: “Birds and Animals were dying of thirst” (pakshii aur pashu pyaasa se mar rahee thii). The translation can then be obtained by deleting the “birds and (pakshii aur)” part from the retrieved translation. Word Addition: Sometimes to generate a new translation one may have to add some additional words in the retrieved translation example. For illustration, one may consider the example given just above with the roles of input and retrieved sentences being reversed. Change in Tense: When the input and retrieved sentences are different in the tense, one has to apply syntax rules for appropriate modification of the retrieved translation example. For illustration, English Verb Structure "was" + root verb + "ing" root verb + "s"/"es" Hindi Structure root verb + “rahaa thaa”or rahi thi, root verb + “taa hai” or “tii hai”. If the tense of the input and the retrieved examples are different, appropriate morphological changes are to be done accordingly. Although the demands for translation from foreign languages to Indian languages are increasing, not too many works have been reported so far in this context. In ANUBHARTI [3] we found a good example that uses adaptation as its key tool for generating new translations from old ones. However, here the examples are stored in an abstracted form for determining the structural similarity between the input sentence and the example sentences. However, in this work we store the examples as they are. The advantage thereby is that if certain components of the retrieved sentence are semantically and/or syntactically similar to the input sentence, then the task of adaptation is reduced. Furthermore, this helps in estimating the overall adaptation cost that in turn may be used as a tool for similarity measurement between an input and the stored examples. The adaptation scheme proposed here is designed on the basis of the above four operations. Section 2 describes the overall adaptation scheme. Section 3 explains details of how the components of the sentence are determined. Section 4 describes how the structural changes are incorporated into the retrieved translation. 2. BASIC ADAPTATION SCHEME The following notations will be followed in the paper for explaining the adaptation scheme: SI: input sentence in the source language (here it is English). SR: source language example sentence retrieved as the most similar to SI TR: the target language (Hindi, here) translation of SR. TT: the translation of SI in target language to be generated by adaptation of TR The sentences that have been considered for this work have the following structural limitations. a) They may have only the following components: Subject (noun / pronoun), Auxiliary. Verb, Main Verb, Adverb, maximum two Objects along with related Adjectives and connecting Preposition. b) The tense of the sentences may be one of the following: 1. Simple Present 4. Simple Past 2. Present Continuous 5. Past Continuous 3. Present Indefinite 6. Future indefinite c) The sentences may be Affirmative, Interrogative (the ones where the auxiliary verbs are positioned at the beginning of the sentence) or Negative. The following algorithm explains the overall scheme for adaptation. It checks the constituent words of SI and SR. The adaptation is then done in the following way: a) If the constituent words of SI and SR are the same and have the same morphosyntactic tags, No adaptation is needed and TR is the desired translation TT. b) If e ∈ SI and e' ∈ SR have same morphosyntactic tags, but e ≠ e' then call Word-Replacement (w, w’) where w and w' are Hindi translations of e and e' respectively. c) If e ∈ SI does not have any corresponding e' in SR then Call Word-Addition (w), and place it in the right position of TR. d) If e' ∈ SR does not have any corresponding e ∈ SI then call Word-Deletion (w'). When the above scheme is applied to all the words of SI and SR, TR gets modified to TT. The following example illustrates the scheme. Here WR, WD and WA represent Word-Replacement, Word-Deletion and Word-Addition, respectively. Example: SI : Is he selling fresh vegetables? SR : He is not writing the letter with a pilot pen TR : Wah pilot kalam se chitti nahi likh rahaa hai The Sequence of changes occur in the following order. Wah pilot kalam se chitti nahi bech rahaa hai Wah pilot kalam se tarkaari nahi bech rahaa hai Wah pilot kalam se taazi tarkaari nahi bech rahaa hai Wah pilot se taazi tarkaari nahi bech rahaa hai Wah pilot taazi tarkaari nahi bech rahaa hai Wah taazi tarkaari nahi bech rahaa hai Wah taazi tarkaari bech rahaa hai Kyaa wah taazi tarkaari bech rahaa hai WR ( bech, likh ) WR ( tarkaari, chitti ) WA (taazi) WD ( kalam ) WD ( se ) WD (pilot ) WD ( nahi ) WA (Kyaa) TT: Kyaa Wah taazi tarkaari bech rahaa hai The following section explains the key steps of the algorithm. 3. DETAILS OF THE ALGORITHMS The POSs in both SI and SR are identified and the differences between them are used to make structural changes in TR to get TT. The following observations suggest the preferable order in which the POS in any sentence under consideration should be determined. • • Nouns are distinguished as subjects and objects depending on their positions relative to the verb. Consider the example “Ram broke the bow.” In the sentence “Ram” is the subject and “bow” is the object. The auxiliary verb and the main verb together determine the tense of the sentence. We determine the POS in both SI and SR in the following order: 1 3. 5. 6. Auxiliary Verb 2. Main Verb Subject 4. Preposition Object and Adjective before Preposition – Object1 and Adjective1 respectively. Object and Adjective after Preposition – Object2 and Adjective2 respectively. 3.1 Determining the Auxiliary and Main Verbs The following algorithm explains the method for determining the verbs present in a sentence. Step 1: Identify the auxiliary verb (if present). This determines the tense. Step 2: If (tense is future indefinite) Then Verb must be in the root form; Else If (tense is present indefinite or present / past continuous) Then Identification of the root verb may need suffix stripping In Hindi morphological suffixes of a verb are attached either after the root verb or append as group of words after the root verb without deforming it. Hence obtaining the root verb is important for adaptation. For illustration, with respect to the input sentence "Ram is eating rice" consider the following cases: Case 1. The retrieved sentence and its translation are: "Ram eats rice” and "ram chaawal khaata hai". Case 2. The retrieved pair is: "Ram is drinking milk” and "ram dudh pii rahaa hai". In case 1, root verb "eat" is same as that of the input. However, there is morphological difference. But in case 2, the root verb is different so for adaptation the verb “pii” is to be replaced with “khaa”. To accomplish the above identification of the root form of the verb, which may in turn require suffix stripping. Table 1 provides some of the rules for identifying English suffixes. These rules have been obtained from standard English grammar book [5]. Tense Suffixes Present/Past Continuous a) If the root verb ends with the vowel ‘e’ then remove 'e' and append ‘ing’. E.g. come + "ing" = coming. b) If verb ends with ‘ie’ then replace ‘ie’ with ‘y’ and add ‘ing’ suffix. E.g. lie + "ing" = lying. c) If last character of verb is not ‘w’/ ‘r’/ ‘y’, and it is preceded by a vowel then the last character is repeated twice and “ing” is appended. For example: run + "ing" = running. Present Indefinite a) If the root verb ends with ‘o’, ‘ss’, ‘sh’, ‘ch’, ‘x’ then the suffix is "es". b) If the root verb is ending with ‘y’ and it is preceded with a consonant then to remove ‘y’ write ‘ies’ else ‘s’. read+s; push+es; fry-y+ies; say + s etc. Table 1: Some Common English Suffixes for Verbs. 3.2 Determining the Subject In the present work we are dealing with two types of subjects: noun and pronoun. The subject may be singular or plural. Identification of subject is easy when it is singular. However, getting the root (singular) form from the plural is not always straightforward as different suffixes are in use in English. Some typical suffixing rules are given below (although there are some exceptions as well): • If Noun ends with ‘x’, ‘z’, ‘o’, ‘s’, 'sh', ‘ch’ then relevant suffix is "es". E.g. Fox ~ Foxes. • If Noun ends with ‘y’ then a) if it follows a consonant, the rule is remove ‘y’ and add “ies” . E.g. Baby ~ Babies. b) if ‘y’ follows a vowel then the suffix is "s". E.g. Boy ~ Boys. • If Noun ends with 'fe' or ‘f ‘ then the rule for plural is: – 'fe' + "ves". E.g. Knife ~ Knives. In Hindi too the plural of a noun is obtained by appending (often after deleting the last vowels) the required suffix to the singular form of the noun. Some examples are: chiriyaa ~ chiriyaa + n; ghodaa ~ ghode ; billi ~ billiyaan Adaptation often requires identification of the root word by suffix stripping for Hindi and/or English to determine the root word. To achieve this we store the relevant knowledge in our database of words: So now the one record for nouns in the database have the following structure: English Noun-tag Translation of N (T) Gender Person Hindi Suffix for the plural of T Examples: Bird-0 Chiriyaa F 3 n Horse-0 Ghodaa M 3 e The tags used are non-negative integers. The maximum value for a tag can be three. The tags have been created after scrutinizing about 600 sentences. The tags are used in determining appropriate prepositions. Details of the tags is explained in Sections 3.3 and 3.4 below. Once the properties of the subject (i.e. its gender, number and person) are found and the tense of the sentence is established, the Hindi morphological changes required for the verb for generating the translation may be determined. Table 2 provides some rules for the morphological changes: Examples of some morphological transformation in Hindi: ENGLSIH GENDER TENSE I am writing. M Present continuous Main likh rahaa hoon I am writing. F Present continuous Main likh rahee hoon M/F Present indefinite Tum likho I write. F Present indefinite Main likh ti hoon She was writing. F Past continuous Wah likh rahee thee Simple future Hum likhenge You write. We will write. ---- HINDI Table 3: Morphological Transformation Rules for Verbs. 3.3 Determining the Preposition The mapping between the prepositions in English and corresponding Hindi words is many-to-many. Consider the example pair with the preposition “with” : The king had breakfast with his friend : raajaa ne apne mitra ke saath naashta kiya The king had breakfast with a spoon : raajaa ne chammach se naashta kiya. The translation of a preposition is dependent on the context and determining the correct meaning in a given context needs semantic information. We need a strategy to automate this mapping, and to try and ensure its correctness and consistency. The meaning of the preposition is dependent on the object(s) or adjective(s) or adverb(s) that follow it. For a preposition its possible meanings are stored in the database. The records in the database of prepositions have the following structure: Preposition in English (P) M[0] M[1] M[2] M[3] Where M [ i ] (i = 0 .. 3) is some of the context dependent translations of P in Hindi, i.e. the maximum number of possible meaning that has been stored for a preposition is four. The contents of each M[ i ] are: • • Example: Tag → T [ i ]. This tag is a non-negative integer less than or equal to 3. A Hindi translation of P → H [ i ] With 0-se 1-ke saath 2x 3x Since, all prepositions need not necessarily have four different translations, a dummy character ‘x’ is used to fill it in. Since at this stage of the translation procedure, we have not determined the object(s) / adjective(s) / adverb(s), we simply retrieve the preposition in the sentence in question and postpone the determination of the appropriate M[ i ] till those determinations are over. 3.4 Determining the Object (s) and Adjective (s) The sentences that we consider may have up to two objects, each may have its corresponding adjective. The search for the objects (which implies along with corresponding adjectives and articles) may be made in two possible locations. Case1: The English sentence has a preposition • Adjective1 and Object1 are searched for between the verb and the preposition • Adjective2 and Object2 are searched for after the preposition. We call the object that comes before the preposition as Object1, and the one that comes after the preposition is called Object2. Similarly, Adjective1 and Adjective2. This however, does not exclude the fact that there may be only Object2 and no Object1. The following examples elucidate the notations. Examples: 1) Mohan is eating hot rice with the clean spoon. Here “rice” is Object1 and “spoon” is Object2. 2) She is dancing on the floor. Here the only object "the floor" is called Object2. Case2. The sentence does not have a preposition • The sentence has at the most one object and corresponding adjective. • The search for these is made immediately after the verb. For example, in the sentence “Ram is drinking clean water", there is no preposition. So the sentence may have at most one object. Here "water" is Object1 and the adjective "clean" is Adjective1. All adjectives also have a tag attached to them and are stored in the following way: Adjective in English (A) Example Record: Hot-0 Translation of A in Hindi (TA) garam Once the objects and corresponding adjectives are determined, the appropriate M[ i ] (See Section 3.3) is decided. The algorithm is given below. Algorithm for Deciding the appropriate M[ i ] : If (there is an object OBJ after the preposition) Then that M[ i ] is chosen for which tag (OBJ) equals T[ i ] Else If ( there is an adjective ADJ after the preposition) Then that M[ i ] is chosen for which tag (ADJ) equals T[ i ] At this stage, all the POSs have been identified in both SI and SR and their Hindi translations have been obtained. Section 4 describes an algorithm to make the required structural changes in TR to get TT. 4. ADAPTATION OF THE RETRIEVED TRANSLATION The adaptation scheme makes use of the three operations: Word-Replacement, Word-Addition and WordDeletion. 4.1 Word Deletion. When a word in the retrieved translation (TR) is not required for generating the translation, the adaptation process simply deletes it from TR. The task involves two steps: searching the word in the retrieved translation and delete it. However, the key problem here is how to identify the word. This is being done in the following way. The algorithm compares all the POS of the input and the retrieved sentence. If any POS that is not present in the input sentence but present in retrieved English example, then that word needs to be deleted from the retrieved translation. For example, consider the following: SI: Mira will sing in the function. SR: Mira will sing a melodious song in the function. TR: Mira samaaroh mein madhur geet gaaegee The SR has two extra words, the adjective "melodious" and the object "song". Hence for generating the required translation of SI, the Hindi equivalents of these two words are to be removed from TR. Our algorithm first deletes “madhur” and then "geet” , to arrive at TT, the translation of SI which is “mira samaaroh mein gaaegee.” 4.2 Word Replacement Word replacement takes place if some word w of SI has the same morphosyntactic tag of some word w' of TR, but w is not same as w'. Here the task involves the following: firstly, to find the Hindi equivalent (h) of w; and then to replace the Hindi equivalent (h') of w' with h. For illustration, consider the scenario: SI: Sita is singing SR: Ram is eating fruits. TR: Ram phal khaa rahaa hai Here, the algorithm sequentially checks the words in the following order. 1) First the verb is dealt with. In the above scenario, the verb of the input sentence is to "sing"; while the retrieved sentence has the verb to "eat". Since they are different, the translation of the verb "sing" (gaa) is retrieved from a verb database. Application of WR ( gaa, dekh ) produces "Ram phal gaa rahaa hai". 2) Similarly, the subject is changed by calling WR(sita, ram). 3) The subject provides information about its. number, gender and person. The auxiliary verb gives idea about the tense. Since the gender of the two subjects are different, a further call to WR ( rahee, rahaa) is needed. Thus the final translation is achieved after calling WD (phal). TT, the final translation of SI, thus obtained is "Sita gaa rahee hai". 4.3 Algorithm for Word Addition Word Addition (WA) is a relatively more complicated operation, as the task here is threefold: i) ii) iii) To determine the word to be added To find the right position for the new word. This can be done by resorting to the structural rules of the target language. The actual addition of the Hindi equivalent of that word in TR is to be made. The algorithm for finding the correct position for addition is as follows. Here Tobject1, Tobject2, Tadjective1, Tadjective2 and Tpreposition are the respected translations of Object1, Object2, Adjective1, Adjective2 (see Section 3.4) and preposition. CASE 1: If (Object 1 ∈ SI but ∉ SR ) then WA (Tobject1) comes before the main verb of SR. CASE 2: If (Adjective1 ∈ SI but ∉SR) then If (Object1 ∈ SI) then WA ( Tadjective1 ) comes before the Object1 of TR. Else WA ( Tadjective1) comes before the main verb of TR. CASE 3: If (Object2 ∈ SI but ∉ SR ) then If (preposition ∈ SR) then WA (Tobject2) comes before the preposition of TR. Else If (Adjective1 ∈ SI) then WA (Tobject2) comes before the Adjective1 of TR. Else If (Object1∈ SI) then WA (Tobject2) comes before the Object1 of TR Else WA (Tobject2) comes before the main verb of TR CASE 4: If (Adjective2 ∈ SI but ∉ SR ) then If (Object2 ∈ SI) then WA (Tadjective2) comes before the object2 of TR. Else WA (Tadjective2) comes before the preposition of TR. CASE 5: If ( preposition ∈ SI but ∉ SR ) then If (Adjective1 ∈ SI) then WA ( Tpeposition ) comes before the Adjective1 of TR. Else If (Object1 ∈ SI ) then WA ( Tpreposition) comes before the Object1 of TR. Else WA ( Tpreposition) comes before the main verb of TR. Now we will discuss some of the case by the following examples: SI : She is eating rice on a clean plate SR: He is sitting on the chair TR : Wah kursi par bait rahaa hai The Sequence of changes occur in the following order. Who kursi par khaa rahaa hai Who kursi par khaa rahee hai Who kursi par chaawal khaa rahee hai Who thaali par chaawal khaa rahee hai Who thaali mein chaawal khaa rahee hai Who saaf thaali mein chaawal khaa rahee hai WR( khaa, bait ) WR( rahee, rahaa ) WA(chaawal) WR( thaali) WR( mein, par ) WA( saaf) Table 4: Structural Changes in TR TT: Wah saaf thaali mein chaawal khaa rahee hai . In the above example, SI has Object1( "rice") and Adjective2 ("clean") but SR does not have them. Hence in order to accomplish the translation of SI, the algorithm adds the Object1 (chaawal ) in TR before “rahee”(the morphological change already done in previous step). Similarly, Adjective2 “saaf” will be added before Object2 of TR, i.e. thaali . All systematic changes are given in Table 4. 4.4 Negative Sentences Presently, our algorithm works on some selected structures of negative sentences too. It has been observed and that the negation “nahiin” in a negative sentence in Hindi, is placed just before the verb. For example, consider the following scenario where the input sentence is negative, but the retrieved one is affirmative. IS: “Ram is not learning Hindi” (negative sentence) RS: “Ram is reading a book.” (affirmative sentence) TS: “ ram kitaab pad rahaa hai” By following the algorithm discussed above, an intermediate translation: “ram Hindi seekh rahaa hai” is generated. The correct translation of the input sentence is then obtained by inserting “nahiin” before the verb “seekh”. Similarly, in the opposite case, the translation of an affirmative sentence may be created from the example of the translation of a negative sentence by deleting "nahiin" from the retrieved translation. 5. CONCLUSION Adaptation of retrieved sentences is a key aspect of EBMT. However, given the variation of expressions in the source and the target languages, each having its own sets of syntactic and morphological rules, modification of a retrieved translation example into the desired translation is seldom straightforward. As a consequence, no algorithm exists that is equally adept across the languages for achieving efficient adaptation. In this paper we report the preliminary version of an algorithm that we propose for carrying out adaptation for English to Hindi translation. Although at the deepest level the algorithm resorts to the syntax and morphology of English and Hindi, we feel that the overall scheme should work well for other pairs of languages too. This is because many North Indian languages (e.g. Bengali, Marathi) have the same origin (i.e. Sanskrit) as Hindi, and are therefore structurally close to Hindi. The present algorithm is at its initial stage. The current version is working only for limited tenses, and some specific types of simple sentence. We are working towards extending the technique for adaptation of structurally more complicated sentences including interrogative, negative sentences. 6. ACKNOWLEDGEMENT We acknowledge the contributions made by Ms. S. Anupama towards implementing the software. REFERENCES [1] D. Gupta and N. Chatterje., Study of Divergence for Example Based English-Hindi Machine Translation.. STRANS-2001, IIT Kanpur, 2001 pp. 43-51. [2] H.A. Guvenir and I. Cicekli., Learning Translation Templates from Examples. Elsevier Science Ltd., 1998 [3] R. Jain , R.M.K Sinha and A. Jain., ANUBHATRI: Using Hybrid Example-Based Approach for Machine Translation.. STRANS-2001, IIT Kanpur, 2001 pp. 20-32. [4] S.C. Nirenburg (Ed.)., The PANGLOSS Mark III Machine translation system (Tech. Rep. No. CMU-CMT-95-145). Pittsburgh: Carnegie Mellon University, 1995 [5] P.C. Wren and H. Martin and N.D.V.P. Rao, High School English Grammar and Composition, S. Chand and Co., New Delhi, 1989.