Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/315857794 Automatic Parallel Data Mining After Bilingual Document Alignment Conference Paper in Advances in Intelligent Systems and Computing · March 2017 DOI: 10.1007/978-3-319-56535-4_32 CITATIONS READS 0 76 2 authors: Krzysztof Wołk Agnieszka Wołk Polish-Japanese Academy of Information Technology Polish-Japanese Academy of Information Technology 88 PUBLICATIONS 296 CITATIONS 22 PUBLICATIONS 34 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: Clarin-PL View project Nextsell View project All content following this page was uploaded by Krzysztof Wołk on 30 December 2017. The user has requested enhancement of the downloaded file. SEE PROFILE Automatic parallel data mining after bilingual document alignment Krzysztof Wołk1,1 , Agnieszka Wołk1 1 Polish-Japanese Academy of Information Technology, Koszykowa 86, 02008 Warsaw, Poland {kwolk, awolk}@pja.edu.pl Abstract. It has become essential to have precise translations of texts from different parts of the world, but it is often difficult to fill the translation gaps as quickly as might be needed. Undoubtedly, there are multiple dictionaries that can help in this regard, and various online translators exist to help cross this lingual bridge in many cases, but even these resources can fall short of serving their true purpose. The translators can provide a very accurate meaning of given words in a phrase, but they often miss the true essence of the language. The research presented here describes a method that can help close this lingual gap by extending certain aspects of the alignment task for WMT16. It is possible to achieve this goal by utilizing different classifiers and algorithms and by use of advanced computation. We carried out various experiments that allowed us to extract parallel data at the sentence level. This data proved capable of improving overall machine translation quality. Key words: SMT, quasi comparable, corpora, parallel corpora generation, comparable corpora, unsupervised corpora acquisition, data mining 1 Introduction The main purpose of this research was to create a parallel text that will help us to fill the translation gaps that we face when translating multilingual texts between under-resourced languages. This was done using the documents that were issued for the WMT16 alignment task 1. Through this research, we hoped to obtain the most accurate data with filtered parallel text that is further refined to achieve the best performance. To build this research on a strong foundation, techniques such as linear interpolation and in-domain adaption were applied to data that were newly discovered, as well. In addition, experiments were carried out using the data from the WMT16 news translation task 2. While translating one corpus to another, the true meaning of a text can be lost. Linguists have tried different ways to eliminate this difference in translation. Apart from the primary step towards achieving better translations (i.e., more data), classifiers were also needed. Our classifiers were trained on TED Talks domain 3. Our method was evaluated by using different test samples from the WMT16 evaluation sets and calculating BLEU metrics to determine the quality of our method [1]. The Moses Statistical Translation Toolkit [2] was used during the 1 http://www.statmt.org/wmt16/bilingual-task.html http://www.statmt.org/wmt16/translation-task.html 3 https://www.ted.com 2 experiments, while the GIZA++ tool was used to train the alignment models on parallel text and to conduct symmetrization (by making use of the Berkley Aligner) for the translation phrases. The SRI Language Modeling Toolkit was used to train the statistical language models that often fail to fill in the missing gaps. In addition, other data from thematic domains were filtered to make the system more useful, because there is a need to fill in the linguistic gaps during a translation from one language to another. We used MooreLewis Filtering [3] for parallel in-domain data selection, and the monolingual language models were linearly interpolated [4]. As far as data mining is concerned, the most accurate bilingual data were refined using many other techniques, such as support vector machine (SVM) for classification of sentences [5] and the Needleman-Wunch [6] algorithm, which aligns discovered phrases to fit the words in one language to where they belong in a second language. The algorithms were re-implemented for speed and quality improvements, and a graphics-processing unit was used for parallel calculations. 2 Corpora types A collection of a large number of texts stored on a computing device is known as a corpus, whereas the collected texts themselves are called corpora. In linguistics, parallel corpora is a term that is used to define texts that are translations of others [4]. There are two main types of different parallel corpora: • Comparable Corpus • Parallel Corpus A comparable corpus is a collection of texts that only includes the same type of content in two or more languages, whereas a parallel corpus includes translations of the content from one language to another. A comparable corpus consists of the same type of content, but is not a translation of the content at all [4]. To conduct an analysis of a language, text alignment is necessary to determine the proximity of sentences in language documents; it is difficult to discover where the translation stands on its own. These documents help to fill in the gaps that most machine translation engines leave behind. A typical machine translator is limited to translating documents in a word-to-word or element-to-element manner. In doing so, the meaning of phrases and the essence of the language are lost, leaving behind a jumble of ‘noisy’ parallel alignments that barely make sense when put together as a phrase. Sometimes, data from other sources is used that may describe the content while translating from one language to another. However, extra words are sometimes unintentionally added in these translations, which can be misleading and change the meaning of the content. Linguists who strive to achieve accurate translations have always considered this an added difficulty that has yet to be adequately addressed. Additional techniques are required to differentiate between the bilingual and monolingual elements in texts in order to generate an accurate translation of the content that leaves less room for misinterpretation while providing the highest accuracy possible. A conditional probability determination must be used that can compare one corpus to another for bilingual elements [4], because it is difficult to evaluate a parallel text without refining it as much as possible to distinguish between the two texts. Parallel corpora are relatively rare, even in the FrenchEnglish linguistic pair, whereas monolingual data are more easily accessible [4] in both of these languages. Being more precise, corpora types can be placed into four categories. The rarest form is parallel corpora, which are the ones that consist of documents in two different languages. This data can, however, use the alignment of sentences at a phrasal level for better results and to provide consistency in meaning. ‘Noisy’ parallel translations are faulty and might even contain inaccurate translations that can misguide the reader, as the language loses most of its true meaning when it is translated. Third, a comparable corpus is formed when there are bilingual documents without any specific translations and sentences that lack alignment even though their topics are still aligned. A quasi- comparable corpus only includes bilingual documents that might not be topicaligned [7] but which may have some phrasal alignments. Considering these four different types, we deal with two types of corpora in this research: comparable and quasi-comparable. 3 State of the art As far as comparable corpora are concerned, different attempts have been made for different domains to extract parallel phrases that are used as models when translating. There have been several different approaches, of which two have been the most useful in generating the most accurate results. One of these is based on cross-lingual information retrieval, whereas the other is simply to translate the data using any MT system. During this process, related documents are first translated and then compared with the target language texts, which can help in finding the closest translations. An interesting way to determine such data by using Wikipedia was adopted in [8]. The authors proposed two different approaches: 1. The first approach was employing an online translation system that can, for example, translate from Dutch to English and to later compare the translated data with the EN pages on which the translated data is already available. 2. The second idea was simply to use a dictionary generated from the Wikipedia titles of the documents. The first idea is much less problematic in practice, but it is not substantial in its approach. The second one is also rejected due to the generation of results that were not accurate. However, the second method was improved in [9] by adding one more step to the translation process. It was shown that the precision level in [8] improved from 21% to 48% after using the new method [9]. Yasuda and Sumita [10] developed another improvement for machine translation, which was based on statistics that generate phrases. In their method, the alignment is achieved by using a lexicon that is already updated, which means that the corpus they are using has been trained to serve in that way. In their research, they proved that 10% of Japanese Wikipedia sentences correspond with English Wikipedia sentences. Furthermore, Tyers and Pienaar [9] succeeded in leveraging Interwiki links. Filtering Wikipedia’s link structure also generated a bilingual dictionary, and mismatches between links were detected as well. In their research, they discovered that the precision of a translation depends strictly on the language. In [11], the authors attempted to filter their work using their research. Moreover, they used annotations provided on Wikipedia, as well as lexicon models that generate translations automatically. They were finally able to achieve a precision of 90% through their methods. A new and improved method for parallel corpora was also introduced by the author of [12], who used textual entailment and an SMT system to generate data. The author reports noticeable improvements by using this aligned data in German as well as in French. In addition, a different way to use Wikipedia was established by Plamada and Volk [12]. Previous methods limited the application of an algorithm on sentences, but it could potentially be used as the final data generated during translations. Their method rejected the position of potential sentences and ranked them on the basis of customized metrics. Instead of an element-toelement translation, their method attempts to achieve semantic accuracy in the text by limiting the process of translation to a specified domain. In evaluation of their method, they noted an improvement in precision by 0.5 points in the BLEU metric, but only in text outside the targeted domains. There was almost no improvement for in-domain text. In [13], the authors proposed to provide limited information for data generation. These included the title of the text to be translated and the publication date and time, instead of providing the complete document as was done in prior research. The similarity of titles was the only factor that was used to generate the data well. In research that was described in [14], the authors produced a document based on events they observed that can be used as a model. To calculate the metric of their experiment, they used documents that modeled sets of events that are temporal and geographical in their nature. These temporal and geographical expressions are the basis on which the documents are based. The authors in [15] were also of the opinion that machine translators must be used that incorporate comparable corpora, including different texts from Wikipedia and Twitter. Their view was that this will help to generate better results; hence, they used domains to generate documents that will help to gather data for comparison during translations. Recently, a new method that is inspired by Yalign [4] has been used. However, that method was also not feasible at first; it was slow and did not help to fill the gaps. However, it was improved through the use of new methods, and it now produces better results. 4 Parallel data mining Methodologies for analyzing data sources that have parallel corpora without sentence alignment, such as quasi-comparable, comparable, and noisy parallel corpora, were used in this research. Bilingual text samples, which were provided through a baseline script acquired from the WMT16 alignment task, were used in this experiment. Through this procedure, 166,099 quasi-comparable and comparable article pairs were obtained by using the baseline script (provided by the WMT16 organizers). For SVM classifier training, the TED corpus from the IWSLT 2015 evaluation campaign 4 was used as wide-domain data for the mining experiment. The selected classifier text domain was one that is widely charted and covers a multitude of subject areas, which was also the reason for its usage. The data included 2.5 million untokenized words [16]. Our solution is composed of three important main steps: • In the first and second steps, quasi-comparable and comparable data are collected, and alignment is performed at the article level. Parallel sentences are then mined from the results that are produced. • The third step is of great importance, since the results have widely varying quantities and qualities. The sentence that is produced may not contain the desired result and may not correspond with the corpus as a translation. Indirect and poor translations from the article corpus may also hinder the alignment process. On the other hand, the alignment may not be feasible for the process itself, which will make it difficult to use. In any case, certain steps must be verified, and the text used in the miningtool process must be prepared beforehand. To start, a relational database should be used to save all the article-level data. Afterwards, article pairs are aligned with the WMT16 baseline tool. Our algorithm is used to filter the topic-aligned articles to remove XML tags, HTML tags, or noisy data (figures, references, tables, etc.). Lastly, unique IDs are used to tag the bilingual documents that comprise each topic-aligned article corpus. It was also decided that, to extract parallel pairs, an automated parallel text mining process would be employed that can find the sentence matching closest to the translation of the article corpora. 4 http://workshop2015.iwslt.org/ This procedure provided a new window of opportunity; the parallel corpora can be harvested from such sources as the web and translated documents, which are not restricted to a particular language pair. Nevertheless, two selected languages should be used to create the alignment models. The resulting solution, when implemented, produced a similarity metric that gave a rough estimate from 0 to 1 of the likelihood of one sentence being a translation of another. A sequence aligner was used to produce the maximum sum of the similarities of the individual sentence pairs between the two documents [5]. The Needleman-Wunsch algorithm was used for the sequence alignment; the two selected documents were scrutinized to find the optimal alignment of the sentences used in both. The algorithm calculates the N*M worst-case complexity. As a result, only the sentences with the highest probability are translated, while filtering the results ensures that only high-quality alignments are delivered. A threshold parameter is used for this process: a sentence pair that has a similarity score below the threshold is omitted. The statistical classifier of the algorithm is used for the output of the sentence similarity metric, whose range is normalized to [0, 1]. A classifier is trained to determine whether the two sentences in the pair are translations of each other. In our research, an SVM was used as the classifier. A classifier can also be used to determine the separation-to-distance hyperplane during the process of classification; the distance found through this process can then be modified by a sigmoid function that can return a value in [0, 1] [17]. Consequently, by using a trained classifier, the quality and quantity of the input of the alignment can be severely affected. High-quality parallel data are required to create a good classifier, and a dictionary for translation probability can also make a classifier better [18]. The TED Talks corpus was used for this purpose only. To obtain a dictionary, 1-grams were extracted from a phrase table that was already trained [18]. 5 The data mining process The inspiration for the data mining process was the native Yalign tool 5, but it was not feasible for a large-scale parallel data-mining operation. The tool can accept input in the form of web links or plain text, while each pair alignment is loaded with a classifier. It should be noted that this tool is single-threaded. In our re-implementation, the classifier was supplied with articles in one session so that the process could be accelerated. This avoids having to reload the classifier for each execution. This quicker solution not only increased the pace of the process, it also made it multi-threaded. Because of the extra computing demand, the alignment algorithm was redesigned (from A* to Needleman-Wunsch) so that better accuracy was achieved. The timing factor was decreased by 6.1 times through the use of a 4-core, 8-thread i7 CPU. Lastly, a tuning algorithm was added. 5.1 Needleman-Wunsch (NW) algorithm The purpose of this algorithm was to align the elements of two sequences (phrases, letters, words, etc.). It consists of two steps: 1. 5 The similarity between the two elements is defined in the first step. Matrix X is used to define the similarity (an N*S matrix), where N represents the number of elements and S represents the sequence of the second number of elements. The algorithm was developed in the field of bioinformatics, where it is used for DNA and RNA https://github.com/machinalis/yalign comparison. It can also be conditioned for use in text comparison. With this intention, each pair of elements is associated with a real number of the matrix to create an algorithm of the type discussed above. If the elements are similar, then the associated number will be higher. For example, consider a similarity matrix (phrase-French, phrase-English) numbered 0 to 1. If a phrase pair has 0 associated with it, then the phrases have nothing in common, while 1 would suggest the highest degree of similarity between the phrases. The definition of the similarity matrix is the primary output of the algorithm. The gap penalty is the second step of the definition. If one of the elements is associated with a gap in the other sequence then, in this case, it is necessary to levy a penalty (p). Starting from S(0, 0), which is also 0 by definition, the performance of the S-matrix is calculated. First, the columns and rows of the matrix are initialized, followed by an iteration of through the other elements by the algorithm, following the process from the upper-left corner to the bottom right. This step is shown illustrated in Figure 1. 2. First column initialization First row initialization Inside S matrix calculation Figure 1: Needleman-Wunsch S-matrix calculation First column initialization in parallel threads First raw initialization in parallel 1st anti-diagonal calculation in parallel threads 2nd anti-diagonal calculation in parallel threads 3rd anti-diagonal calculation in parallel threads … Figure 2: Needleman-Wunsch S-matrix calculation with parallel threads Conceptually, the two NW algorithms are identical, with or without any optimization through the Graphics Processing Unit (GPU). However, the former has an advantage in terms of efficiency that depends on the hardware. The S-matrix elements differ in the calculation; this operation is also a part of the multi-threaded optimization we applied. A large number of GPUs, such as CUDA cores, can execute these small operations. All the elements are computed diagonally in parallel to obtain the results; they should start from the upper left and move on to the bottom right, as seen in Figure 2 [19]. The calculation of the S matrix proceeds from the top-left column. The value of the cell S(m, n) can be found in pairs of n and m. The value must be known in advance from the top S(m-1, n), left S(m, n-1), and top-left S(m-1, n-1). In contrast, the calculation for S(m, n) can be performed with the following equation [20]: 5.2 Tuning algorithm for the classifier The SVM classifier is used to define the quality of the alignment; it creates a trade-off between recall and the precision. The classifier has two configurable variables: • Threshold: the acceptance of an alignment is ‘Good’ if the confidence threshold if high. For less recall and more precision, the value should be lowered. The probability estimated by the support vector machine is the ‘confidence’ that it is classified as ‘a translation’ or ‘not a translation. 6 • Penalty: the alignment allows control of the amount of ‘slipping ahead’ [5] if you are aligning subtitles. There would then either be no extra or fewer paragraphs, and the alignment would be one-to-one, while the penalty would be high. The penalty would be lower if the translations of the alignments are similar and there are no extra paragraphs. These parameters are automatically selected during the training; however, they can be manually changed if necessary. A tuning algorithm is also introduced in our implementation of the solution 7 used. In this research, adjustments are allowed for better accuracy. Random articles of the corpus must be extracted to perform the 6 7 https://github.com/machinalis/yalign/issues/,3 accessed 10.11.2015 https://github.com/krzwolk/yalign tuning; humans can manually align these random articles. Given the information provided, the tuning mechanism finds the classifier value through randomly selected parameters. It tries to find the output that would be as similar as possible to that produced by a human. The percentage value resemblance of the automatically aligned file to a human-aligned one is called the similarity. To check the proper adjustment of the parameters by the tuning algorithm, we measured the number of additionally obtained sentences. Table 1 shows the results of such an experimental check. One hundred random articles were selected from the link database for testing purposes and aligned manually by a human translator. Trained classifiers were used on a previously described text domain through a tuning script. Table 1. Improvements in mining using a tuning script Language FR Improvement % 11.75 6 Results and conclusions Bi-sentence extraction has become more and more popular in unsupervised learning for numerous specific tasks. Our method overcomes the disparities between the French-English pair and other languages. It also extends the scope of the WMT16 bilingual document alignment task by returning alignments at the sentence level. It is a language-independent method that can be easily adjusted to a new environment, and it only requires parallel corpora for the initial training. Table 2. Resulting corpora specification Bi-sentences FR EN 441,463 441,463 Number of words 9,941,281 10,140,950 Unique words 496,791 479,043 Average Sentence length 20 21 Our experiments show that the method performs well. By mining 166,099 document pairs, we successfully obtained 441,463 8 translated sentences. (The detailed corpora specification is shown in Table 2). The resulting corpora increased the MT quality in a wide news domain. We compared the WMT16 baseline news evaluation task results with the same system augmented with mined data. For baseline system training, we used constrained data sets and the evaluation test files from the WMT15 9. To get a more accurate word alignment, we used the SyMGiza++ tool. This is a tool that assists with the formation of a similar word-alignment model. This particular tool develops alignment models that perform multiple many-to-one and one-to-many alignments between the given language pairs. SyMGiza++ was also used to create a pool of several processors supported by the latest threading management techniques, which makes it a high-speed process. The alignment process used in such cases utilizes four unique models during the training of the system to refine and enhance alignment outcomes. These approaches have shown fruitful results [21]. Out-of-vocabulary (OOV) words are another challenge for the system to overcome. To deal with OOV words, we used the Moses toolkit and the Unsupervised Transliteration Model (UTM) [22]. The KenLM tool was applied to the language model training. The language model order was set to 6, and the models were linearly interpolated [23]. The parallel data that was obtained was added without domain adaptation. Table 3. Improvements to MT with obtained data 8 9 https://mega.nz/#F!hkEjFC4Q!lV9OJplRnsbtgveSLcc94g http://www.statmt.org/wmt15/ System Direction BLEU FR->EN FR->EN Baseline Augmented 31.14 33.78 EN->FR Baseline 32.25 EN->FR Augmented 36.09 Based on the results presented in Table 3, it can be projected that such BLEU differences can have a positive influence on real-life translation scenarios. From a practical point of view, the method requires neither expensive training nor languagespecific grammatical resources, but it produces satisfying results. It is possible to replicate such mining for any language pair or text domain, or for any reasonable comparable input data. Acknowledgements Work financed as part of the investment in the CLARIN-PL research infrastructure funded by the Polish Ministry of Science and Higher Education and was backed by the PJATK legal resources. References 1. Krzysztof Wołk and Krzysztof Marasek. 2014. Real-Time Statistical Speech Translation. New Perspectives in Information Systems and Technologies, Volume 1. Springer International Publishing, p. 107-113. 2. Krzysztof Wołk and Krzysztof Marasek. 2013. Polish–English Speech Statistical Machine Translation Systems for the IWSLT 2013. Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany. p. 113-119. 3. Philipp Koehn. 2009. Statistical machine translation. Cambridge University Press. 4. Gonzalo García Berrotarán, Rafael Carrascosa, and Andrew Vine. Yalign documentation, https://yalign.readthedocs.org accessed 01/2015 5. Romain Dieny, Jerome Thevenon, Jesus Martinez-del-Rincon, and Jean-Christophe Nebel. 2011. Bioinformatics inspired algorithm for stereo correspondence. International Conference on Computer Vision Theory and Applications. Vilamoura - Algarve, Portugal. 6. Gabe Musso. 2007. Sequence alignment (Needleman-Wunsch, Smith-Waterman), http://www.cs.utoronto.ca/~brudno/bcb410/lec2notes.pdf. 7. Mauro Cetollo, Christian Girardi, and Marcello Federico. 2012. Wit3: Web inventory of transcribed and translated talks. Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT). p. 261-268. 8. Mehdi Mohammadi and Nasser GasemAghaee. 2010. Building bilingual parallel corpora based on Wikipedia. Computer Engineering and Applications (ICCEA), 2010 Second International Conference on. IEEE, p. 264-268. 9. Francis M. Tyers and Jacques A. Pienaar. 2008. Extracting bilingual word pairs from Wikipedia., Collaboration: interoperability between people in the creation of language resources for less-resourced languages 19, p. 19-22. 10. Keiji Yasuda and Eiichiro Sumita. 2008. Method for building sentence-aligned corpus from Wikipedia. 2008 AAAI Workshop on Wikipedia and Artificial Intelligence (WikiAI08), p. 263-268. 11. Santanu Pal, Partha Pakray, and Sudip Kumar Naskar. 2014. Automatic Building and Using Parallel Resources for SMT from Comparable Corpora. Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra)@ EACL, 2014. p. 48-57. 12. Magdalena Plamada and Martin Volk. 2013. Mining for Domainspecific Parallel Text from Wikipedia. Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, ACL 2013, p. 112-120. 13. Jannik Strötgen, Michael Gertz, Conny Junghans. 2011. An eventcentric model for multilingual document similarity. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, p. 953-962. 14. Monica Lestari Paramita et al. 2013. Methods for collection and evaluation of comparable documents. Building and Using Comparable Corpora. Springer Berlin Heidelberg, p. 93-112. 15. Dekai Wu and Pascale Fung. 2005. Inversion transduction grammar constraints for mining parallel sentences from quasicomparable corpora. Natural Language Processing– IJCNLP 2005. Springer Berlin Heidelberg, p. 257- 268. 16. Jonathan H. Clark et al. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2. Association for Computational Linguistics, p. 176181. 17. Krzysztof Wołk and Krzysztof Marasek. 2014. A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation. New Perspectives in Information Systems and Technologies, Volume 1. Springer International Publishing, p. 229-237. 18. Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, p. 355-362. 19. Krzysztof Wołk and Krzysztof Marasek. 2015. Tuned and GPUaccelerated parallel data mining from comparable corpora. Text, Speech, and Dialogue. Springer International Publishing, p. 32-40. 20. Chetan S. Khaladkar. 2009. An Efficient Implementation of Needleman-Wunsch Algorithm on Graphical Processing Units. PHD Thesis, School of Computer Science and Software Engineering, The University of Western Australia. 21. Marcin Junczys-Dowmunt and Arkadiusz Szał. 2012. Symgiza++: symmetrized word alignment models for statistical machine translation. Security and Intelligent Information Systems. Springer: Berlin Heidelberg. p. 379-390. 22. Nadir Durrani et al. 2014. Integrating an Unsupervised Transliteration Model into Statistical Machine Translation. EACL, 2014. p. 148-153. 23. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011. p. 187-197. View publication stats