Download A Systematic Adaptation Scheme for English-Hindi Example

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ojibwe grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Old Irish grammar wikipedia , lookup

Old English grammar wikipedia , lookup

Swedish grammar wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Japanese grammar wikipedia , lookup

Navajo grammar wikipedia , lookup

Udmurt grammar wikipedia , lookup

English clause syntax wikipedia , lookup

Lexical semantics wikipedia , lookup

Inflection wikipedia , lookup

French grammar wikipedia , lookup

Preposition and postposition wikipedia , lookup

Stemming wikipedia , lookup

Portuguese grammar wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Macedonian grammar wikipedia , lookup

Chinese grammar wikipedia , lookup

Georgian grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Modern Hebrew grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Kannada grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

Untranslatability wikipedia , lookup

Turkish grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Spanish grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Transcript
A Systematic Adaptation Scheme for English-Hindi
Example-Based Machine Translation
Deepa Gupta Niladri Chatterjee
Department of Mathematics
I.I.T Delhi, Hauz Khas
New Delhi-110016
Email: [email protected]
ABSTRACT
The success of Example-Based Machine Translation (EBMT) often depends upon how efficient the adaptation
scheme is. Adaptation primarily aims at modifying retrieved examples to meet the required demands of a given
translation task. The present work looks at adaptation for EBMT from English to Hindi. This paper describes a
rule-driven adaptation scheme for modifying a retrieved translation example to generate the translation of a
given input. Only a selected set of sentence structures have been considered so far. The structural and
morphological rules for the source and the target languages have been used to develop the scheme.
KEY WORD : EBMT, Adaptation, Structural Modification, Morphological Modification.
1. INTRODUCTION
One major aspect of an Example-Based Machine Translation (EBMT) [2] [4] system is its adaptation scheme.
Even a very efficient similarity measurement scheme and a very large example base cannot, in general,
guarantee an exact match for a given input sentence. As a consequence, the need for an efficient and systematic
adaptation scheme arises for modifying a retrieved example and thereby generating the required translation. In
this work we present the preliminary version of a systematic scheme for adapting translations from English to
Hindi. The work is still in its initial stage, and works only on sentences of selected structures. The examples
used for this work have been obtained primarily from children's storybooks and translation books, where more
complex sentence structures are rarely found. However, we aim to incorporate more complex structures in
future.
We consider an adaptation scheme to be a rule-driven approach that considers the discrepancy between the input
and the retrieved sentence in the source language. The rules have been formed by taking into account the
grammars of the both source and the target languages. Given an input sentence, the system retrieves an example
from its example base that is similar to the input sentence in structure. Component-wise differences between the
input and the retrieved sentence are then measured and the retrieved translation in the target language is then
modified by taking care of the discrepancies.
The adaptation is carried out by resorting to one of the following operations, as suggested in [1]:
Simple word replacement: One may get the translation of the input sentence by replacing some words in the
retrieved translation example. Suppose the input sentence is: "The Squirrel was eating groundnuts.” The most
similar sentence retrieved by the system (along with its Hindi translation) is: "The Elephant was drinking
water.” (haatee paanii pii rahaa thaa)
In order to generate the translation, one just needs to replace “squirrel” by “elephant”, “eat” by “drink” and
“groundnuts” by “water”. Therefore, only word replacement gives the exact translation of the input sentence.
Word Deletion: In some cases one may have to delete some words from the translation example to generate the
new translation. For example, the input sentence is: “Animals were dying of thirst”. The retrieved translation
example: “Birds and Animals were dying of thirst” (pakshii aur pashu pyaasa se mar rahee thii). The
translation can then be obtained by deleting the “birds and (pakshii aur)” part from the retrieved translation.
Word Addition: Sometimes to generate a new translation one may have to add some additional words in the
retrieved translation example. For illustration, one may consider the example given just above with the roles of
input and retrieved sentences being reversed.
Change in Tense: When the input and retrieved sentences are different in the tense, one has to apply syntax rules
for appropriate modification of the retrieved translation example. For illustration,
English Verb Structure
"was" + root verb + "ing"
root verb + "s"/"es"
Hindi Structure
root verb + “rahaa thaa”or rahi thi,
root verb + “taa hai” or “tii hai”.
If the tense of the input and the retrieved examples are different, appropriate morphological changes are to be
done accordingly.
Although the demands for translation from foreign languages to Indian languages are increasing, not too many
works have been reported so far in this context. In ANUBHARTI [3] we found a good example that uses
adaptation as its key tool for generating new translations from old ones. However, here the examples are stored
in an abstracted form for determining the structural similarity between the input sentence and the example
sentences. However, in this work we store the examples as they are. The advantage thereby is that if certain
components of the retrieved sentence are semantically and/or syntactically similar to the input sentence, then the
task of adaptation is reduced. Furthermore, this helps in estimating the overall adaptation cost that in turn may be
used as a tool for similarity measurement between an input and the stored examples.
The adaptation scheme proposed here is designed on the basis of the above four operations. Section 2 describes
the overall adaptation scheme. Section 3 explains details of how the components of the sentence are determined.
Section 4 describes how the structural changes are incorporated into the retrieved translation.
2. BASIC ADAPTATION SCHEME
The following notations will be followed in the paper for explaining the adaptation scheme:
SI: input sentence in the source language (here it is English).
SR: source language example sentence retrieved as the most similar to SI
TR: the target language (Hindi, here) translation of SR.
TT: the translation of SI in target language to be generated by adaptation of TR
The sentences that have been considered for this work have the following structural limitations.
a) They may have only the following components: Subject (noun / pronoun), Auxiliary. Verb, Main Verb,
Adverb, maximum two Objects along with related Adjectives and connecting Preposition.
b) The tense of the sentences may be one of the following:
1. Simple Present
4. Simple Past
2. Present Continuous
5. Past Continuous
3. Present Indefinite
6. Future indefinite
c) The sentences may be Affirmative, Interrogative (the ones where the auxiliary verbs are positioned at the
beginning of the sentence) or Negative.
The following algorithm explains the overall scheme for adaptation. It checks the constituent words of SI and
SR. The adaptation is then done in the following way:
a)
If the constituent words of SI and SR are the same and have the same morphosyntactic tags, No adaptation is
needed and TR is the desired translation TT.
b) If e ∈ SI and e' ∈ SR have same morphosyntactic tags, but e ≠ e' then call Word-Replacement (w, w’) where
w and w' are Hindi translations of e and e' respectively.
c)
If e ∈ SI does not have any corresponding e' in SR then Call Word-Addition (w), and place it in the right
position of TR.
d) If e' ∈ SR does not have any corresponding e ∈ SI then call Word-Deletion (w').
When the above scheme is applied to all the words of SI and SR, TR gets modified to TT. The following
example illustrates the scheme. Here WR, WD and WA represent Word-Replacement, Word-Deletion and
Word-Addition, respectively.
Example:
SI : Is he selling fresh vegetables?
SR : He is not writing the letter with a pilot pen
TR : Wah pilot kalam se chitti nahi likh rahaa hai
The Sequence of changes occur in the following order.
Wah pilot kalam se chitti nahi bech rahaa hai
Wah pilot kalam se tarkaari nahi bech rahaa hai
Wah pilot kalam se taazi tarkaari nahi bech rahaa hai
Wah pilot se taazi tarkaari nahi bech rahaa hai
Wah pilot taazi tarkaari nahi bech rahaa hai
Wah taazi tarkaari nahi bech rahaa hai
Wah taazi tarkaari bech rahaa hai
Kyaa wah taazi tarkaari bech rahaa hai
WR ( bech, likh )
WR ( tarkaari, chitti )
WA (taazi)
WD ( kalam )
WD ( se )
WD (pilot )
WD ( nahi )
WA (Kyaa)
TT: Kyaa Wah taazi tarkaari bech rahaa hai
The following section explains the key steps of the algorithm.
3. DETAILS OF THE ALGORITHMS
The POSs in both SI and SR are identified and the differences between them are used to make structural changes
in TR to get TT. The following observations suggest the preferable order in which the POS in any sentence
under consideration should be determined.
•
•
Nouns are distinguished as subjects and objects depending on their positions relative to the verb. Consider
the example “Ram broke the bow.” In the sentence “Ram” is the subject and “bow” is the object.
The auxiliary verb and the main verb together determine the tense of the sentence.
We determine the POS in both SI and SR in the following order:
1
3.
5.
6.
Auxiliary Verb
2. Main Verb
Subject
4. Preposition
Object and Adjective before Preposition – Object1 and Adjective1 respectively.
Object and Adjective after Preposition – Object2 and Adjective2 respectively.
3.1 Determining the Auxiliary and Main Verbs
The following algorithm explains the method for determining the verbs present in a sentence.
Step 1: Identify the auxiliary verb (if present). This determines the tense.
Step 2: If (tense is future indefinite) Then
Verb must be in the root form;
Else If (tense is present indefinite or present / past continuous) Then
Identification of the root verb may need suffix stripping
In Hindi morphological suffixes of a verb are attached either after the root verb or append as group of words
after the root verb without deforming it. Hence obtaining the root verb is important for adaptation. For
illustration, with respect to the input sentence "Ram is eating rice" consider the following cases:
Case 1. The retrieved sentence and its translation are: "Ram eats rice” and "ram chaawal khaata hai".
Case 2. The retrieved pair is:
"Ram is drinking milk” and "ram dudh pii rahaa hai".
In case 1, root verb "eat" is same as that of the input. However, there is morphological difference. But in case 2,
the root verb is different so for adaptation the verb “pii” is to be replaced with “khaa”. To accomplish the above
identification of the root form of the verb, which may in turn require suffix stripping. Table 1 provides some of
the rules for identifying English suffixes. These rules have been obtained from standard English grammar book
[5].
Tense
Suffixes
Present/Past Continuous
a)
If the root verb ends with the vowel ‘e’ then
remove 'e' and append ‘ing’. E.g. come + "ing" = coming.
b) If verb ends with ‘ie’ then replace ‘ie’ with ‘y’
and add ‘ing’ suffix. E.g. lie + "ing" = lying.
c) If last character of verb is not ‘w’/ ‘r’/ ‘y’, and it is preceded
by a vowel then the last character is repeated twice and “ing” is
appended. For example: run + "ing" = running.
Present Indefinite
a) If the root verb ends with ‘o’, ‘ss’, ‘sh’, ‘ch’, ‘x’ then the
suffix is "es".
b) If the root verb is ending with ‘y’ and it is preceded with a
consonant then to remove ‘y’ write ‘ies’ else ‘s’.
read+s; push+es; fry-y+ies; say + s etc.
Table 1: Some Common English Suffixes for Verbs.
3.2 Determining the Subject
In the present work we are dealing with two types of subjects: noun and pronoun. The subject may be singular or
plural. Identification of subject is easy when it is singular. However, getting the root (singular) form from the
plural is not always straightforward as different suffixes are in use in English. Some typical suffixing rules are
given below (although there are some exceptions as well):
•
If Noun ends with ‘x’, ‘z’, ‘o’, ‘s’, 'sh', ‘ch’ then relevant suffix is "es". E.g. Fox ~ Foxes.
•
If Noun ends with ‘y’ then
a) if it follows a consonant, the rule is remove ‘y’ and add “ies” . E.g. Baby ~ Babies.
b) if ‘y’ follows a vowel then the suffix is "s". E.g. Boy ~ Boys.
•
If Noun ends with 'fe' or ‘f ‘ then the rule for plural is: – 'fe' + "ves". E.g. Knife ~ Knives.
In Hindi too the plural of a noun is obtained by appending (often after deleting the last vowels) the required
suffix to the singular form of the noun. Some examples are:
chiriyaa ~ chiriyaa + n;
ghodaa ~ ghode ;
billi ~ billiyaan
Adaptation often requires identification of the root word by suffix stripping for Hindi and/or English to
determine the root word. To achieve this we store the relevant knowledge in our database of words:
So now the one record for nouns in the database have the following structure:
English Noun-tag
Translation of N (T)
Gender
Person
Hindi Suffix for the plural of T
Examples:
Bird-0
Chiriyaa
F
3
n
Horse-0
Ghodaa
M
3
e
The tags used are non-negative integers. The maximum value for a tag can be three. The tags have been created
after scrutinizing about 600 sentences. The tags are used in determining appropriate prepositions. Details of the
tags is explained in Sections 3.3 and 3.4 below.
Once the properties of the subject (i.e. its gender, number and person) are found and the tense of the sentence is
established, the Hindi morphological changes required for the verb for generating the translation may be
determined. Table 2 provides some rules for the morphological changes:
Examples of some morphological transformation in Hindi:
ENGLSIH
GENDER
TENSE
I am writing.
M
Present continuous
Main likh rahaa hoon
I am writing.
F
Present continuous
Main likh rahee hoon
M/F
Present indefinite
Tum likho
I write.
F
Present indefinite
Main likh ti hoon
She was writing.
F
Past continuous
Wah likh rahee thee
Simple future
Hum likhenge
You write.
We will write.
----
HINDI
Table 3: Morphological Transformation Rules for Verbs.
3.3 Determining the Preposition
The mapping between the prepositions in English and corresponding Hindi words is many-to-many. Consider
the example pair with the preposition “with” :
The king had breakfast with his friend : raajaa ne apne mitra ke saath naashta kiya
The king had breakfast with a spoon :
raajaa ne chammach se naashta kiya.
The translation of a preposition is dependent on the context and determining the correct meaning in a given
context needs semantic information. We need a strategy to automate this mapping, and to try and ensure its
correctness and consistency. The meaning of the preposition is dependent on the object(s) or adjective(s) or
adverb(s) that follow it. For a preposition its possible meanings are stored in the database. The records in the
database of prepositions have the following structure:
Preposition in English (P)
M[0]
M[1]
M[2]
M[3]
Where M [ i ] (i = 0 .. 3) is some of the context dependent translations of P in Hindi, i.e. the maximum number of
possible meaning that has been stored for a preposition is four. The contents of each M[ i ] are:
•
•
Example:
Tag → T [ i ]. This tag is a non-negative integer less than or equal to 3.
A Hindi translation of P → H [ i ]
With
0-se
1-ke saath
2x
3x
Since, all prepositions need not necessarily have four different translations, a dummy character ‘x’ is used to fill
it in.
Since at this stage of the translation procedure, we have not determined the object(s) / adjective(s) / adverb(s),
we simply retrieve the preposition in the sentence in question and postpone the determination of the appropriate
M[ i ] till those determinations are over.
3.4 Determining the Object (s) and Adjective (s)
The sentences that we consider may have up to two objects, each may have its corresponding adjective. The
search for the objects (which implies along with corresponding adjectives and articles) may be made in two
possible locations.
Case1: The English sentence has a preposition
• Adjective1 and Object1 are searched for between the verb and the preposition
• Adjective2 and Object2 are searched for after the preposition.
We call the object that comes before the preposition as Object1, and the one that comes after the preposition
is called Object2. Similarly, Adjective1 and Adjective2. This however, does not exclude the fact that there
may be only Object2 and no Object1. The following examples elucidate the notations.
Examples:
1) Mohan is eating hot rice with the clean spoon. Here “rice” is Object1 and “spoon” is Object2.
2) She is dancing on the floor. Here the only object "the floor" is called Object2.
Case2. The sentence does not have a preposition
• The sentence has at the most one object and corresponding adjective.
• The search for these is made immediately after the verb.
For example, in the sentence “Ram is drinking clean water", there is no preposition. So the sentence may have at
most one object. Here "water" is Object1 and the adjective "clean" is Adjective1.
All adjectives also have a tag attached to them and are stored in the following way:
Adjective in English (A)
Example Record:
Hot-0
Translation of A in Hindi (TA)
garam
Once the objects and corresponding adjectives are determined, the appropriate M[ i ] (See Section 3.3) is
decided. The algorithm is given below.
Algorithm for Deciding the appropriate M[ i ] :
If (there is an object OBJ after the preposition) Then
that M[ i ] is chosen for which tag (OBJ) equals T[ i ]
Else If ( there is an adjective ADJ after the preposition) Then
that M[ i ] is chosen for which tag (ADJ) equals T[ i ]
At this stage, all the POSs have been identified in both SI and SR and their Hindi translations have been
obtained. Section 4 describes an algorithm to make the required structural changes in TR to get TT.
4. ADAPTATION OF THE RETRIEVED TRANSLATION
The adaptation scheme makes use of the three operations: Word-Replacement, Word-Addition and WordDeletion.
4.1 Word Deletion.
When a word in the retrieved translation (TR) is not required for generating the translation, the adaptation
process simply deletes it from TR. The task involves two steps: searching the word in the retrieved translation
and delete it. However, the key problem here is how to identify the word. This is being done in the following
way. The algorithm compares all the POS of the input and the retrieved sentence. If any POS that is not present
in the input sentence but present in retrieved English example, then that word needs to be deleted from the
retrieved translation. For example, consider the following:
SI: Mira will sing in the function.
SR: Mira will sing a melodious song in the function.
TR: Mira samaaroh mein madhur geet gaaegee
The SR has two extra words, the adjective "melodious" and the object "song". Hence for generating the required
translation of SI, the Hindi equivalents of these two words are to be removed from TR. Our algorithm first
deletes “madhur” and then "geet” , to arrive at TT, the translation of SI which is “mira samaaroh mein
gaaegee.”
4.2 Word Replacement
Word replacement takes place if some word w of SI has the same morphosyntactic tag of some word w' of TR,
but w is not same as w'. Here the task involves the following: firstly, to find the Hindi equivalent (h) of w; and
then to replace the Hindi equivalent (h') of w' with h. For illustration, consider the scenario:
SI: Sita is singing
SR: Ram is eating fruits.
TR: Ram phal khaa rahaa hai
Here, the algorithm sequentially checks the words in the following order.
1) First the verb is dealt with. In the above scenario, the verb of the input sentence is to "sing"; while the
retrieved sentence has the verb to "eat". Since they are different, the translation of the verb "sing" (gaa) is
retrieved from a verb database. Application of WR ( gaa, dekh ) produces "Ram phal gaa rahaa hai".
2) Similarly, the subject is changed by calling WR(sita, ram).
3) The subject provides information about its. number, gender and person. The auxiliary verb gives idea about
the tense. Since the gender of the two subjects are different, a further call to WR ( rahee, rahaa) is needed.
Thus the final translation is achieved after calling WD (phal). TT, the final translation of SI, thus obtained
is "Sita gaa rahee hai".
4.3 Algorithm for Word Addition
Word Addition (WA) is a relatively more complicated operation, as the task here is threefold:
i)
ii)
iii)
To determine the word to be added
To find the right position for the new word. This can be done by resorting to the structural rules of the
target language.
The actual addition of the Hindi equivalent of that word in TR is to be made.
The algorithm for finding the correct position for addition is as follows. Here Tobject1, Tobject2, Tadjective1,
Tadjective2 and Tpreposition are the respected translations of Object1, Object2, Adjective1, Adjective2 (see
Section 3.4) and preposition.
CASE 1:
If (Object 1 ∈ SI but ∉ SR ) then
WA (Tobject1) comes before the main verb of SR.
CASE 2: If (Adjective1 ∈ SI but ∉SR) then
If (Object1 ∈ SI) then
WA ( Tadjective1 ) comes before the Object1 of TR.
Else WA ( Tadjective1) comes before the main verb of TR.
CASE 3: If (Object2 ∈ SI but ∉ SR ) then
If (preposition ∈ SR) then
WA (Tobject2) comes before the preposition of TR.
Else If (Adjective1 ∈ SI) then
WA (Tobject2) comes before the Adjective1 of TR.
Else If (Object1∈ SI) then
WA (Tobject2) comes before the Object1 of TR
Else WA (Tobject2) comes before the main verb of TR
CASE 4: If (Adjective2 ∈ SI but ∉ SR ) then
If (Object2 ∈ SI) then
WA (Tadjective2) comes before the object2 of TR.
Else WA (Tadjective2) comes before the preposition of TR.
CASE 5: If ( preposition ∈ SI but ∉ SR ) then
If (Adjective1 ∈ SI) then
WA ( Tpeposition ) comes before the Adjective1 of TR.
Else If (Object1 ∈ SI ) then
WA ( Tpreposition) comes before the Object1 of TR.
Else WA ( Tpreposition) comes before the main verb of TR.
Now we will discuss some of the case by the following examples:
SI : She is eating rice on a clean plate
SR: He is sitting on the chair
TR : Wah kursi par bait rahaa hai
The Sequence of changes occur in the following order.
Who kursi par khaa rahaa hai
Who kursi par khaa rahee hai
Who kursi par chaawal khaa rahee hai
Who thaali par chaawal khaa rahee hai
Who thaali mein chaawal khaa rahee hai
Who saaf thaali mein chaawal khaa rahee hai
WR( khaa, bait )
WR( rahee, rahaa )
WA(chaawal)
WR( thaali)
WR( mein, par )
WA( saaf)
Table 4: Structural Changes in TR
TT: Wah saaf thaali mein chaawal khaa rahee hai .
In the above example, SI has Object1( "rice") and Adjective2 ("clean") but SR does not have them. Hence in
order to accomplish the translation of SI, the algorithm adds the Object1 (chaawal ) in TR before “rahee”(the
morphological change already done in previous step). Similarly, Adjective2 “saaf” will be added before Object2
of TR, i.e. thaali . All systematic changes are given in Table 4.
4.4 Negative Sentences
Presently, our algorithm works on some selected structures of negative sentences too.
It has been observed and that the negation “nahiin” in a negative sentence in Hindi, is placed just before the
verb. For example, consider the following scenario where the input sentence is negative, but the retrieved one is
affirmative.
IS: “Ram is not learning Hindi” (negative sentence)
RS: “Ram is reading a book.”
(affirmative sentence)
TS: “ ram kitaab pad rahaa hai”
By following the algorithm discussed above, an intermediate translation: “ram Hindi seekh rahaa hai” is
generated. The correct translation of the input sentence is then obtained by inserting “nahiin” before the verb
“seekh”. Similarly, in the opposite case, the translation of an affirmative sentence may be created from the
example of the translation of a negative sentence by deleting "nahiin" from the retrieved translation.
5. CONCLUSION
Adaptation of retrieved sentences is a key aspect of EBMT. However, given the variation of expressions in the
source and the target languages, each having its own sets of syntactic and morphological rules, modification of a
retrieved translation example into the desired translation is seldom straightforward. As a consequence, no
algorithm exists that is equally adept across the languages for achieving efficient adaptation. In this paper we
report the preliminary version of an algorithm that we propose for carrying out adaptation for English to Hindi
translation. Although at the deepest level the algorithm resorts to the syntax and morphology of English and
Hindi, we feel that the overall scheme should work well for other pairs of languages too. This is because many
North Indian languages (e.g. Bengali, Marathi) have the same origin (i.e. Sanskrit) as Hindi, and are therefore
structurally close to Hindi.
The present algorithm is at its initial stage. The current version is working only for limited tenses, and some
specific types of simple sentence. We are working towards extending the technique for adaptation of structurally
more complicated sentences including interrogative, negative sentences.
6. ACKNOWLEDGEMENT
We acknowledge the contributions made by Ms. S. Anupama towards implementing the software.
REFERENCES
[1] D. Gupta and N. Chatterje., Study of Divergence for Example Based English-Hindi Machine Translation..
STRANS-2001, IIT Kanpur, 2001 pp. 43-51.
[2] H.A. Guvenir and I. Cicekli., Learning Translation Templates from Examples. Elsevier Science Ltd., 1998
[3] R. Jain , R.M.K Sinha and A. Jain., ANUBHATRI: Using Hybrid Example-Based Approach for Machine
Translation.. STRANS-2001, IIT Kanpur, 2001 pp. 20-32.
[4] S.C. Nirenburg (Ed.)., The PANGLOSS Mark III Machine translation system (Tech. Rep. No.
CMU-CMT-95-145). Pittsburgh: Carnegie Mellon University, 1995
[5] P.C. Wren and H. Martin and N.D.V.P. Rao, High School English Grammar and Composition, S. Chand and
Co., New Delhi, 1989.