Download But

1 English-Persian SMT Reza Saeedi [email protected] WTLAB Wednesday, May 25, 2011 Outline 2       MT Introduction SMT Introduction Requirements for SMT Evaluation metrics English-Persian MT challenges English-Persian SMT  System1  System2  Problems in English-Persian SMT MT Introduction 3  Automatic translation of text written in a natural language into another one by the use of computers is referred to as Machine Translation.  There are several way to do this work:  Dictionary-based  Rule-based  Example-based  Statistical approach SMT Introduction 4  First ideas of Statistical machine translation was proposed by Warren Weaver in 1947.  Statistical machine translation tries to learn the translation by examining the translations made by humans. SMT Introduction(Cont.) 5    Statistical MT models take the view that every sentence in the target language is a translation of the source language sentence with some probability. The best translation, of course, is the sentence that has the highest probability. The key problems in statistical MT are:  estimating the probability of a translation  and efficiently finding the sentence with the highest probability. SMT Introduction(Cont.) 6  Given a Source sentence f, we seek the target sentence e that maximizes P(e | f). e‘ = argmaxe P(e | f)  Intuitively, P(e|f) should depend on two factors:  P(e|f) = P(e) * P(f | e) / P(f)  argmaxe P(e | f) = argmaxe P(e) * P(f | e) fluency faithfulness SMT Introduction(Cont.) 7  Philipp koehn  http://homepages.inf.ed.ac.uk/pkoehn Why SMT? 8  Better use of resources Not need linguistic knowledge It can use for any pair of language  But    We need a big training corpus Steps of SMT 9 Requirements for SMT 10  Bilingual and Monolingual Corpus:  For bilingual need tow file aligned sentence by sentence (one file for source language and other for target language)  Microsoft  Bi-Lingual sentence Aligner Language Model:  We need a tool to compute P(e)  For this step we need to monolingual corpus  SRILM: a tool for create N-grams LM output 11 Requirements for SMT 12  Translation Model:  We need a tool for compute P(f|e)  For this step we need to bilingual corpus  GIZA++  The output of this tool is a phrase table  Decode:  For search and find best translation  Moses Phrase table 13 Moses tool 14 The training steps 15          Prepare data Run GIZA++ Align words Get lexical translation table Extract phrases Score phrases Build reordering model Build generation models Create configuration file Evaluation metrics 16  BLEU(BiLingual Evaluation Understudy)  Developed  The at IBM’s closer a MT is to a professional human translation, the better it is  NIST English-Persian MT challenges 17     The Persian language structure is very different in comparison to English The structure of Persian language is very complex There has been little previous work done for this language pair Effective SMT systems rely on very large bilingual corpora but there are not readily available for the English/Persian language pair English-Persian SMT 18  There have been few English-Persian MT systems developed  Most of them are purely rule-based  There are two work on English-Persian SMT  Mohaghegh  Pilevar and Sarrafzadeh (Massey University) and Faili (Tehran University) System1 19  Corpus: BBC news System1(Cont.) 20  Tools: SRILM, GIZA++, Moses System1: Improved Language Modeling 21 System2 22  Corpus:  Bidirectional(TEP): Subtitle of films, 3 books, KDE4 System2(Cont.) 23  Corpus:  Monolingual: Hamshahri, subtitle of films System2(Cont.) 24  Tools: SRILM, GIZA++, Moses PersianSMT with 4-gram Sub-LM Comparison PersianSMT with Google Translator 25 Problems in English-Persian SMT 26  compound verbs (aligning problem)  Use a phrase-based SMT system  But problem is inflectional morphology  Large number of inflected verb forms does not let the system learn to translate all the individual forms of a compound verb  Persian takes personal pronouns as an optional element in the sentence (aligning problem) Problems(Cont.) 27  failure of the system to place the elements of the sentence in the right order  Use a phrase-based SMT system  Re-rank the n-best output list and/or reorder the output sentences  Prior to translation, the input sentence is reordered using morpho-syntactic information, so that the word order resembles better that of the target language. 28 References 29  [1] A. Ramanathan, "Statistical Machine Translation", Ph.D. Seminar Report, Department of Computer Science and Engineering Indian Institute of Technology, 2000.  [2] A. LOPEZ, "Statistical Machine Translation", ACM Computing Surveys, 2008.  [3] M. Mohaghegh, & A. Sarrafzadeh, “The first english-persian statistical machine translation”, New Zealand Postgraduate Conference, 2009 .  [4] M. Mohaghegh, & A. Sarrafzadeh, " An analysis of the effect of training data variation in English-Persian Statistical Machine Translation”, 2009 International Conference on Innovations in Information Technology (IIT 2009)  [5] M. Mohaghegh, & A. Sarrafzadeh, " Performance evaluation of various training data in English-Persian statistical machine translation “, Appear in Proceedings of the 10th International Conference on the Statistical Analysis of Textual Data (JADT 2010), Rome, Italy, June 9-11, 2010.  [6] M. Mohaghegh, & A. Sarrafzadeh, " Improved Language Modeling for English-Persian Statistical Machine Translation”, COLING 2010 / SIGMT Workshop 23rd International Conference on Computational Linguistics Beijing, China 28 August 2010 References(Cont.) 30  [7] M.T. Pilevar and H. Faili, "PersianSMT: A First Attempt to English-Persian Statistical Machine Translation", to appear in Proc. of 10th International Conference on statistical analysis of textual data (JADT 2010)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download But