Download Assignment 1: Manual Direct Translation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Kannada grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Agglutination wikipedia , lookup

Esperanto grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Spanish grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Swedish grammar wikipedia , lookup

Junction Grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Pipil grammar wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Malay grammar wikipedia , lookup

Stemming wikipedia , lookup

Untranslatability wikipedia , lookup

Transcript
Assignment 1: Manual Direct Translation
Rebecca Jonson
GSLT course: Machine Translation 2004
In this assignment I have tried to implement a Prolog program that translates the following
Swedish sentence to English:
Sv. Ytterst handlar kampen för sysselsättning om att hålla samman Sverige.
The translation given was the following:
En. Ultimately, the fight for full employment concerns the cohesion of Swedish society.
I started with an implementation of a very simplistic approach and then tried to improve the
program with more advanced direct translation methods. The two attempts are described
below followed by a short discussion of the result and a comparison with Systran. The
implementation is in Prolog and the code can be found on my webpage (the system does not
work with the Swedish letters å ä ö yet so you need to write au, ae, oe if you run it).
First Attempt (simplistic approach):
Algorithm:
First, try to translate bigger parts of the sentence (only implemented two-word look up) by
looking up phrases in the lexicon (also words that do not follow each other). Then back off to
word to word translation of the result by using the bilingual dictionary. Unknown words are
copied. Capital letters in source language gives capital letters in target language (not
implemented). Punctuation is copied (not implemented).
Part of the algorithm was implemented as follows:
run_trans(Target):-read_string(Str),
string2wordlist(Str, SourceList),
trans2(SourceList,Tr1),
trans1(Tr1,Target).
The predicate run_trans takes a string, produces a wordlist and sends this list for translation
to the trans2 predicate that makes two-word translations. The result is sent to the trans1
predicate that makes word-to-word translations and outputs the final translation result.
%word-to-word translation
trans1([],[]).
trans1([FirstWord|Rest], Target):%two-word translation
trans2([],[]).
lookup([FirstWord],Targ1),
trans1(Rest, Targ2),
append(Targ1,Targ2,Target).
trans2([FirstWord|Rest],Target):(
member(Second, Rest),
lookupSvEng([FirstWord, Second],Targ1),
delete(Rest,Second,RestNew)
;
Targ1 = [FirstWord],
RestNew = Rest
),
trans2(RestNew, Targ2),
append(Targ1,Targ2,Target).
The bilingual dictionary looks as follows (based on Norstedts svensk-engelska, 1992):
lookupSvEng([ytterst], [ultimately]).
lookupSvEng([handla, om],[be, about]).
lookupSvEng([kamp],[fight]).
lookupSvEng([för],[for]).
lookupSvEng([sysselsättning],[employment]).
lookupSvEng([hålla, samman],[keep, together]).
lookupSvEng([att],[to]).
lookupSvEng([sverige],['Sweden']).
Result
The first attempt with the program gave the following result.
| ?- run_trans(T).
|: Ytterst handlar kampen för sysselsättning om att hålla samman Sverige.
T = [ultimately,handlar,kampen,for,employment,om,to,keep,together,'Sweden'] ?
yes
Compared to:
En. Ultimately, the fight for full employment concerns the cohesion of Swedish society.
The result of this model is quite poor and it is not very intelligible as part of the sentence has
not been translated and definitely therefore far from correct English. As seen, no
morphological analysis was made so inflected words were not translated. This is probably the
method’s biggest drawback. I will try to improve that in the second attempt.
Second attempt:
In the second attempt I took the program from the first attempt and added some steps from
advanced direct translation strategy in the following order trying to improve the translation:
Step 1 Source text dictionary look-up + morphological analysis
I added a source text dictionary with head words with their part of speech category. This was
used in the morphological analysis that I added.
lookupSv(ytterst, adv).
lookupSv(handla,v).
lookupSv(kamp, n).
lookupSv(foer,prep).
lookupSv(sysselsaettning, n).
lookupSv(haulla,v).
lookupSv(samman,prep).
lookupSv(om,prep).
lookupSv(att, infm).
lookupSv(sverige, n).
The morphological analysis worked in the following way for nouns:
If the word ends in ‘–en’ delete the suffix and look if the combination left is a word with
category noun. If so, add the feature DEF (for definite) and the shortened word to the word
list. The order of the feature and the word is based on the target language structure (e.g. ‘the
fight’ and not ‘fight the’).
Verbs were handed like this:
If the word ends in –r, delete the suffix and look up if the rest is a Swedish verb. If so, add the
shortened word and the feature PRES to the wordlist.
The morphological analysis was implemented with the following predicates:
%morph analysis
morfanalys([],[]).
morfanalys([W|WL],MorfList):- lookupSv(W,_),
morfanalys(WL, Morf),
append([W],Morf,MorfList).
morfanalys([W|WL],MorfList):- getMorf(W,M),
morfanalys(WL,Morf),
append(M,Morf, MorfList).
%%%%looking up if Word ends with -en and is a Swedish noun
getMorf(Word,Lookup):atom_chars(Word,Chars),
suffix([101,110],Chars, MorfList),
atom_chars(Morf,MorfList),
lookupSv(Morf,n),
Lookup = [def, Morf].
%%%%looking up if Word ends with -r and is a Swedish verb
getMorf(Word, Lookup):atom_chars(Word,Chars),
suffix([114],Chars,MorfList),
atom_chars(Morf,MorfList),
lookupSv(Morf,v),
Lookup = [Morf, pres].
suffix(Xs,Xs,[]).
suffix(Xs,[Y|Ys],Morf):-suffix(Xs,Ys, Prefix), append([Y],Prefix,Morf).
Result
The result of adding this step gave the following:
| ?- run_trans(T).
|: Ytterst handlar kampen för sysselsättning om att hålla samman Sverige.
T = [ultimately, be, about, pres, def, fight, for, employment, to, keep, together, ‘Sweden’]
?
As seen, the words ‘handlar’ and ‘kampen’ have now been found and given a translation,
although we still need to make some processing of the target text to get their whole
translation.
Step 2 Synthesis and morphological processing of target text
The synthesis and morphological processing of the target step has been implemented with the
predicate targetsynt. The predicate looks for features in the word list such as def and pres left
from the morphological analysis and processes the target text. One of the rules substitutes all
def features for the definite article the. Another one checks if there is a verb or verb
expression followed by a tense indicator and substitutes the verb in the list for the verb in the
correct tense form. Apart from this the target output has been fixed and is presented as a string
(and not as a list) by changing the main predicate run_trans.
%Looks for definite articles and synthesises 'the'
targetsynt(WL,Target):- member(def,WL),
substitute(def,WL,the,TL),
targetsynt(TL,Target).
%Checks that there are no plural forms
targetsynt(WL,WL):-member(pres,WL),
member(plur,WL).
%Checks tense and assumes singular form and inflects the verb to found tense
targetsynt(WL,Target):sublist([X,pres],WL),
lookupEng(X,v),
tenseEng(X,pres,sng,Y),
delete(WL,pres,WL1),
substitute(X,WL1,Y,Targ1),
targetsynt(Targ1,Target).
targetsynt(WL,Target):-
sublist([X,_,pres],WL),
lookupEng(X,v),
tenseEng(X,pres,sng,Y),
delete(WL,pres,WL1),
substitute(X,WL1,Y,Targ1),
targetsynt(Targ1,Target).
%Deletes singular form indicators left from morfanalysis
targetsynt(WL,Target):- member(sng,WL),
delete(WL,sng,Target).
targetsynt(WL,WL).
lookupEng(ultimately,adv).
lookupEng(be,v).
lookupEng(about,prep).
lookupEng(fight,n).
lookupEng(for,prep).
lookupEng(employment,n).
lookupEng(keep,v).
lookupEng(together,adv).
lookupEng(to,infm).
lookupEng('Sweden',n).
lookupEng(the, def).
tenseEng(be,pres,sng,is).
tenseEng(be,pres,plur,are).
| ?- run_trans.
|: Ytterst handlar kampen för sysselsättning om att hålla samman Sverige
Target:ultimately is about the fight for employment to keep together Sweden
Step 3 rearrangement of words and phrases in target text
The rearrangement of words and phrases in target text is hard without parsing the whole
sentence. A simplified method that could be used to solve the problem in the example
sentence would be to identify np and vp phrases and then check that the word order is correct
by checking that it is not the case that the first vp comes earlier than the first np phrase. If they
do, we would need to change the order of them. This method has not been implemented but
would give the following result on the target text:
Target:ultimately the fight for employment is about to keep together Sweden
Apart from this some adjustments on the target words depending on their context have been
done. I have added a rule that processes the target text from structures such as ‘prep to verb’
(e.g. about to keep) to the form ‘prep v+ing’ (e.g. about keeping). The predicate is called
ingform.
ingform(WL,TL):- pos(WL,POS),
sublist([(_,prep),(to,infm),(Y,v)],POS),
delete(WL,to,NWL),
ing(Y,Ying),
substitute(Y,NWL,Ying,TL).
ingform(WL,WL).
pos([],[]).
pos([W|WL],[(W,C)|POS]):-lookupEng(W,C),
pos(WL,POS).
pos([W|WL],[(W,C)|POS]):-lookupSv(W,C),
pos(WL,POS).
pos([W|WL],[(W,unk)|POS]):pos(WL,POS).
ing(Y,Ying):-
lookupEng(Y,v),
atom_chars(Y,L),
append(L,[105,110,103],TL),
atom_chars(Ying,TL).
This rule makes the following translation possible:
| ?- run_trans.
|: Kampen för att hålla samman Sverige.
Target:the fight for keeping together Sweden
Step 4 Editing
This step includes some editing of the target text, some that I have implemented and some that
have just been thought of. One rule that inserts a comma after an adverb starting an English
sentence, has been added as follows:
insertcomma(T,TC):-lookupEng(T,adv),
atom_chars(T,L),
append(L,[44],TL),
atom_chars(TC,TL).
insertcomma(T,T).
This gives the following result:
| ?- run_trans.
|: Ytterst handlar kampen för sysselsättning om att hålla samman Sverige
Target:ultimately, is about the fight for employment to keep together Sweden
Another rule should also be added that puts capital letter on the first word in the sentence.
Finally, the punctuation is added. These rules have not been implemented.
Evaluation/Discussion of Final Result:
The final result from the implementation is:
Target:ultimately, is about the fight for employment keeping together Sweden
and with the rules described that I have not implemented the result is the following:
Target: Ultimately, the fight for employment is about keeping together Sweden.
The result is not too bad as it is quite intelligible and at least the last example (not the
implemented one) seems to be quite accurate and looks like English. It is far away though,
from the full-fledged style in the example translation. The quality, I would say is not too bad
for being a very simple MT-implementation, but especially the last part of the sentence
sounds a bit weird in English. The manual example translation captures the meaning of the
Swedish sentence in a much better way.
On the other hand, the reason for getting such a “good” result is thanks to some MT-faking as
the MT-processing has been adapted to this specific case. Knowing this, the result is not that
good. The dictionary has for example been adapted to the translation task and I have thereby
avoided homographs and used a chosen vocabulary. The system does accept other phrases
though and I have tried to make things not too specific. There are some bugs in the program
that does not interfere with the translation process as it is now, but that would give problems
if the system was extended or used for other examples. Due to lack of time and as the system
works for the task it was built for I have chosen to leave the program as it is.
Comparison with Systran translation:
The Systran translation gives:
Outermost acts the struggle for employment about holding together Sweden
The program implemented for this assignment gives the following translation:
Target: ultimately, is about the fight for employment keeping together Sweden
The result of the assignment including manual steps described is:
Target: Ultimately, the fight for employment is about keeping together Sweden.
Following the Intelligibility Scale proposed by Arnold et al. (1.) the last sentence is perfectly
intelligible and grammatical and I would therefore score it quite high. It looks and sounds like
good English although the word choice is not absolutely satisfactory. The output phrase of my
program is worse as it is not at first intelligible due to the word order error but it would be
easy to fix with some post-editing. Apart from the word order error the word choices are ok.
The Systran translation, I would say is even harder to understand and a post-editing of this
translation seems harder to me. The word choices of the first part of the sentence are the big
reason for the unintelligibility and the inaccuracy.
It must be said in defence of the Systran system, that my system has avoided most of the
problems the Systran system struggles with by being adapted to this specific translation task
and by choosing carefully the lexical translation in the bilingual dictionary avoiding
homographs. The development of my little system has shown to me that the direct translation
method works, although I doubt that it is the best choice. I think you would need some
processing of bigger structures to be able to do good translations and solve ambiguity on all
its levels.
References
1. D.J. Arnold, Lorna Balkan, Siety Meijer, R.Lee Humphreys and Louisa Sadler,
1994 Machine Translation: an Introductory Guide, Blackwells-NCC, London,
1994, ISBN: 1855542-17x
http://www.essex.ac.uk/linguistics/clmt/MTbook/PostScript/