Download Automatic Parallel Data Mining After Bilingual Doc

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
See discussions, stats, and author profiles for this publication at:
Automatic Parallel Data Mining After Bilingual Document Alignment
Conference Paper in Advances in Intelligent Systems and Computing · March 2017
DOI: 10.1007/978-3-319-56535-4_32
2 authors:
Krzysztof Wołk
Agnieszka Wołk
Polish-Japanese Academy of Information Technology
Polish-Japanese Academy of Information Technology
Some of the authors of this publication are also working on these related projects:
Clarin-PL View project
Nextsell View project
All content following this page was uploaded by Krzysztof Wołk on 30 December 2017.
The user has requested enhancement of the downloaded file.
Automatic parallel data mining after bilingual
document alignment
Krzysztof Wołk1,1 , Agnieszka Wołk1
Polish-Japanese Academy of Information Technology, Koszykowa 86, 02008 Warsaw, Poland
{kwolk, awolk}
Abstract. It has become essential to have precise translations of texts
from different parts of the world, but it is often difficult to fill the
translation gaps as quickly as might be needed. Undoubtedly, there
are multiple dictionaries that can help in this regard, and various
online translators exist to help cross this lingual bridge in many cases,
but even these resources can fall short of serving their true purpose.
The translators can provide a very accurate meaning of given words
in a phrase, but they often miss the true essence of the language. The
research presented here describes a method that can help close this
lingual gap by extending certain aspects of the alignment task for
WMT16. It is possible to achieve this goal by utilizing different
classifiers and algorithms and by use of advanced computation. We
carried out various experiments that allowed us to extract parallel data
at the sentence level. This data proved capable of improving overall
machine translation quality.
Key words: SMT, quasi comparable, corpora, parallel corpora
generation, comparable corpora, unsupervised corpora acquisition,
data mining
1 Introduction
The main purpose of this research was to create a parallel text that will help us to
fill the translation gaps that we face when translating multilingual texts between
under-resourced languages. This was done using the documents that were issued
for the WMT16 alignment task 1. Through this research, we hoped to obtain the
most accurate data with filtered parallel text that is further refined to achieve the
best performance. To build this research on a strong foundation, techniques such
as linear interpolation and in-domain adaption were applied to data that were
newly discovered, as well. In addition, experiments were carried out using the
data from the WMT16 news translation task 2. While translating one corpus to
another, the true meaning of a text can be lost. Linguists have tried different
ways to eliminate this difference in translation. Apart from the primary step
towards achieving better translations (i.e., more data), classifiers were also
needed. Our classifiers were trained on TED Talks domain 3. Our method
was evaluated by using different test samples from the WMT16 evaluation
sets and calculating BLEU metrics to determine the quality of our method
[1]. The Moses Statistical Translation Toolkit [2] was used during the
experiments, while the GIZA++ tool was used to train the alignment models
on parallel text and to conduct symmetrization (by making use of the Berkley
Aligner) for the translation phrases. The SRI Language Modeling Toolkit
was used to train the statistical language models that often fail to fill in the
missing gaps. In addition, other data from thematic domains were filtered to
make the system more useful, because there is a need to fill in the linguistic
gaps during a translation from one language to another. We used MooreLewis Filtering [3] for parallel in-domain data selection, and the
monolingual language models were linearly interpolated [4].
As far as data mining is concerned, the most accurate bilingual data were
refined using many other techniques, such as support vector machine (SVM)
for classification of sentences [5] and the Needleman-Wunch [6] algorithm,
which aligns discovered phrases to fit the words in one language to where
they belong in a second language. The algorithms were re-implemented for
speed and quality improvements, and a graphics-processing unit was used for
parallel calculations.
2 Corpora types
A collection of a large number of texts stored on a computing device is known as
a corpus, whereas the collected texts themselves are called corpora. In
linguistics, parallel corpora is a term that is used to define texts that are
translations of others [4]. There are two main types of different parallel corpora:
• Comparable Corpus
• Parallel Corpus
A comparable corpus is a collection of texts that only includes the same type of
content in two or more languages, whereas a parallel corpus includes translations
of the content from one language to another. A comparable corpus consists of the
same type of content, but is not a translation of the content at all [4]. To conduct
an analysis of a language, text alignment is necessary to determine the proximity
of sentences in language documents; it is difficult to discover where the
translation stands on its own. These documents help to fill in the gaps that most
machine translation engines leave behind. A typical machine translator is limited
to translating documents in a word-to-word or element-to-element manner. In
doing so, the meaning of phrases and the essence of the language are lost,
leaving behind a jumble of ‘noisy’ parallel alignments that barely make sense
when put together as a phrase. Sometimes, data from other sources is used that
may describe the content while translating from one language to another.
However, extra words are sometimes unintentionally added in these translations,
which can be misleading and change the meaning of the content. Linguists who
strive to achieve accurate translations have always considered this an added
difficulty that has yet to be adequately addressed.
Additional techniques are required to differentiate between the bilingual and
monolingual elements in texts in order to generate an accurate translation of the
content that leaves less room for misinterpretation while providing the highest
accuracy possible. A conditional probability determination must be used that can
compare one corpus to another for bilingual elements [4], because it is difficult
to evaluate a parallel text without refining it as much as possible to distinguish
between the two texts. Parallel corpora are relatively rare, even in the FrenchEnglish linguistic pair, whereas monolingual data are more easily accessible [4]
in both of these languages. Being more precise, corpora types can be placed into
four categories. The rarest form is parallel corpora, which are the ones that
consist of documents in two different languages. This data can, however, use the
alignment of sentences at a phrasal level for better results and to provide
consistency in meaning. ‘Noisy’ parallel translations are faulty and might even
contain inaccurate translations that can misguide the reader, as the language loses
most of its true meaning when it is translated. Third, a comparable corpus is
formed when there are bilingual documents without any specific translations and
sentences that lack alignment even though their topics are still aligned. A quasi-
comparable corpus only includes bilingual documents that might not be topicaligned [7] but which may have some phrasal alignments.
Considering these four different types, we deal with two types of corpora
in this research: comparable and quasi-comparable.
3 State of the art
As far as comparable corpora are concerned, different attempts have been
made for different domains to extract parallel phrases that are used as models
when translating. There have been several different approaches, of which
two have been the most useful in generating the most accurate results. One of
these is based on cross-lingual information retrieval, whereas the other is
simply to translate the data using any MT system. During this process,
related documents are first translated and then compared with the target
language texts, which can help in finding the closest translations. An
interesting way to determine such data by using Wikipedia was adopted in
[8]. The authors proposed two different approaches:
1. The first approach was employing an online translation system that can,
for example, translate from Dutch to English and to later compare the
translated data with the EN pages on which the translated data is already
2. The second idea was simply to use a dictionary generated from the
Wikipedia titles of the documents.
The first idea is much less problematic in practice, but it is not substantial in
its approach. The second one is also rejected due to the generation of results
that were not accurate. However, the second method was improved in [9] by
adding one more step to the translation process. It was shown that the
precision level in [8] improved from 21% to 48% after using the new method
[9]. Yasuda and Sumita [10] developed another improvement for machine
translation, which was based on statistics that generate phrases. In their
method, the alignment is achieved by using a lexicon that is already updated,
which means that the corpus they are using has been trained to serve in that
way. In their research, they proved that 10% of Japanese Wikipedia
sentences correspond with English Wikipedia sentences.
Furthermore, Tyers and Pienaar [9] succeeded in leveraging Interwiki
links. Filtering Wikipedia’s link structure also generated a bilingual
dictionary, and mismatches between links were detected as well. In their
research, they discovered that the precision of a translation depends strictly
on the language. In [11], the authors attempted to filter their work using their
research. Moreover, they used annotations provided on Wikipedia, as well as
lexicon models that generate translations automatically. They were finally
able to achieve a precision of 90% through their methods. A new and
improved method for parallel corpora was also introduced by the author of
[12], who used textual entailment and an SMT system to generate data. The
author reports noticeable improvements by using this aligned data in German
as well as in French.
In addition, a different way to use Wikipedia was established by Plamada
and Volk [12]. Previous methods limited the application of an algorithm on
sentences, but it could potentially be used as the final data generated during
translations. Their method rejected the position of potential sentences and
ranked them on the basis of customized metrics. Instead of an element-toelement translation, their method attempts to achieve semantic accuracy in the
text by limiting the process of translation to a specified domain. In evaluation of
their method, they noted an improvement in precision by 0.5 points in the
BLEU metric, but only in text outside the targeted domains. There was
almost no improvement for in-domain text. In [13], the authors proposed to
provide limited information for data generation. These included the title of
the text to be translated and the publication date and time, instead of
providing the complete document as was done in prior research. The
similarity of titles was the only factor that was used to generate the data well.
In research that was described in [14], the authors produced a document
based on events they observed that can be used as a model. To calculate the
metric of their experiment, they used documents that modeled sets of events
that are temporal and geographical in their nature. These temporal and
geographical expressions are the basis on which the documents are based.
The authors in [15] were also of the opinion that machine translators must be
used that incorporate comparable corpora, including different texts from
Wikipedia and Twitter. Their view was that this will help to generate better
results; hence, they used domains to generate documents that will help to
gather data for comparison during translations. Recently, a new method that
is inspired by Yalign [4] has been used. However, that method was also not
feasible at first; it was slow and did not help to fill the gaps. However, it was
improved through the use of new methods, and it now produces better
4 Parallel data mining
Methodologies for analyzing data sources that have parallel corpora without
sentence alignment, such as quasi-comparable, comparable, and noisy
parallel corpora, were used in this research. Bilingual text samples, which
were provided through a baseline script acquired from the WMT16
alignment task, were used in this
experiment. Through this procedure, 166,099 quasi-comparable and
comparable article pairs were obtained by using the baseline script (provided
by the WMT16 organizers). For SVM classifier training, the TED corpus
from the IWSLT 2015 evaluation campaign 4 was used as wide-domain data
for the mining experiment. The selected classifier text domain was one that
is widely charted and covers a multitude of subject areas, which was also the
reason for its usage. The data included 2.5 million untokenized words [16].
Our solution is composed of three important main steps:
• In the first and second steps, quasi-comparable and comparable
data are collected, and alignment is performed at the article level.
Parallel sentences are then mined from the results that are
• The third step is of great importance, since the results have
widely varying quantities and qualities. The sentence that is
produced may not contain the desired result and may not
correspond with the corpus as a translation. Indirect and poor
translations from the article corpus may also hinder the alignment
process. On the other hand, the alignment may not be feasible for
the process itself, which will make it difficult to use. In any case,
certain steps must be verified, and the text used in the miningtool process must be prepared beforehand.
To start, a relational database should be used to save all the article-level
data. Afterwards, article pairs are aligned with the WMT16 baseline tool.
Our algorithm is used to filter the topic-aligned articles to remove XML tags,
HTML tags, or noisy data (figures, references, tables, etc.). Lastly, unique
IDs are used to tag the bilingual documents that comprise each topic-aligned
article corpus. It was also decided that, to extract parallel pairs, an automated
parallel text mining process would be employed that can find the sentence
matching closest to the translation of the article corpora.
This procedure provided a new window of opportunity; the parallel
corpora can be harvested from such sources as the web and translated
documents, which are not restricted to a particular language pair.
Nevertheless, two selected languages should be used to create the alignment
models. The resulting solution, when implemented, produced a similarity
metric that gave a rough estimate from 0 to 1 of the likelihood of one
sentence being a translation of another.
A sequence aligner was used to produce the maximum sum of the
similarities of the individual sentence pairs between the two documents [5].
The Needleman-Wunsch algorithm was used for the sequence alignment; the
two selected documents were scrutinized to find the optimal alignment of the
sentences used in both. The algorithm calculates the N*M worst-case
complexity. As a result, only the sentences with the highest probability are
translated, while filtering the results ensures that only high-quality
alignments are delivered. A threshold parameter is used for this process: a
sentence pair that has a similarity score below the threshold is omitted.
The statistical classifier of the algorithm is used for the output of the
sentence similarity metric, whose range is normalized to [0, 1]. A classifier is
trained to determine whether the two sentences in the pair are translations of
each other. In our research, an SVM was used as the classifier. A classifier
can also be used to determine the separation-to-distance hyperplane during
the process of classification; the distance found through this process can then
be modified by a sigmoid function that can return a value in [0, 1] [17].
Consequently, by using a trained classifier, the quality and quantity of the
input of the alignment can be severely affected. High-quality parallel data are
required to create a good classifier, and a dictionary for translation
probability can also make a classifier better [18]. The TED Talks corpus was
used for this purpose only. To obtain a dictionary, 1-grams were extracted
from a phrase table that was already trained [18].
5 The data mining process
The inspiration for the data mining process was the native Yalign tool 5, but it
was not feasible for a large-scale parallel data-mining operation. The tool can
accept input in the form of web links or plain text, while each pair alignment
is loaded with a classifier. It should be noted that this tool is single-threaded.
In our re-implementation, the classifier was supplied with articles in one
session so that the process could be accelerated. This avoids having to reload
the classifier for each execution. This quicker solution not only increased the
pace of the process, it also made it multi-threaded. Because of the extra
computing demand, the alignment algorithm was redesigned (from A* to
Needleman-Wunsch) so that better accuracy was achieved. The timing factor
was decreased by 6.1 times through the use of a 4-core, 8-thread i7 CPU.
Lastly, a tuning algorithm was added.
5.1 Needleman-Wunsch (NW) algorithm
The purpose of this algorithm was to align the elements of two sequences
(phrases, letters, words, etc.). It consists of two steps:
The similarity between the two elements is defined in the first step.
Matrix X is used to define the similarity (an N*S matrix), where N
represents the number of elements and S represents the sequence of
the second number of elements. The algorithm was developed in the
field of bioinformatics, where it is used for DNA and RNA
comparison. It can also be conditioned for use in text comparison.
With this intention, each pair of elements is associated with a real
number of the matrix to create an algorithm of the type discussed
above. If the elements are similar, then the associated number will
be higher. For example, consider a similarity matrix (phrase-French,
phrase-English) numbered 0 to 1. If a phrase pair has 0 associated
with it, then the phrases have nothing in common, while 1 would
suggest the highest degree of similarity between the phrases. The
definition of the similarity matrix is the primary output of the
The gap penalty is the second step of the definition. If one of the
elements is associated with a gap in the other sequence then, in this
case, it is necessary to levy a penalty (p). Starting from S(0, 0),
which is also 0 by definition, the performance of the S-matrix is
calculated. First, the columns and rows of the matrix are initialized,
followed by an iteration of through the other elements by the
algorithm, following the process from the upper-left corner to the
bottom right. This step is shown illustrated in Figure 1.
First column initialization
First row initialization
Inside S matrix calculation
Figure 1: Needleman-Wunsch S-matrix calculation
First column initialization in
parallel threads
First raw initialization in parallel
1st anti-diagonal calculation in parallel
2nd anti-diagonal calculation
in parallel threads
3rd anti-diagonal calculation
in parallel threads
Figure 2: Needleman-Wunsch S-matrix calculation with parallel threads
Conceptually, the two NW algorithms are identical, with or without any optimization
through the Graphics Processing Unit (GPU). However, the former has an advantage
in terms of efficiency that depends on the hardware. The S-matrix elements differ in
the calculation; this operation is also a part of the multi-threaded optimization we
applied. A large number of GPUs, such as CUDA cores, can execute these small
operations. All the elements are computed diagonally in parallel to obtain the results;
they should start from the upper left and move on to the bottom right, as seen in
Figure 2 [19].
The calculation of the S matrix proceeds from the top-left column. The value of the
cell S(m, n) can be found in pairs of n and m. The value must be known in advance
from the top S(m-1, n), left S(m, n-1), and top-left S(m-1, n-1). In contrast, the
calculation for S(m, n) can be performed with the following equation [20]:
5.2 Tuning algorithm for the classifier
The SVM classifier is used to define the quality of the alignment; it creates a trade-off
between recall and the precision. The classifier has two configurable variables:
• Threshold: the acceptance of an alignment is ‘Good’ if the confidence
threshold if high. For less recall and more precision, the value should be
lowered. The probability estimated by the support vector machine is the
‘confidence’ that it is classified as ‘a translation’ or ‘not a translation. 6
• Penalty: the alignment allows control of the amount of ‘slipping ahead’
[5] if you are aligning subtitles. There would then either be no extra or
paragraphs, and the alignment would be one-to-one, while the penalty
would be high. The penalty would be lower if the translations of the
alignments are similar and there are no extra paragraphs.
These parameters are automatically selected during the training; however, they can
be manually changed if necessary. A tuning algorithm is also introduced in our
implementation of the solution 7 used. In this research, adjustments are allowed for
better accuracy. Random articles of the corpus must be extracted to perform the
7,3 accessed 10.11.2015
tuning; humans can manually align these random articles. Given the information
provided, the tuning mechanism finds the classifier value through randomly selected
parameters. It tries to find the output that would be as similar as possible to that
produced by a human. The percentage value resemblance of the automatically aligned
file to a human-aligned one is called the similarity. To check the proper adjustment of
the parameters by the tuning algorithm, we measured the number of additionally
obtained sentences. Table 1 shows the results of such an experimental check. One
hundred random articles were selected from the link database for testing purposes and
aligned manually by a human translator. Trained classifiers were used on a previously
described text domain through a tuning script.
Table 1. Improvements in mining using a tuning script
Improvement %
6 Results and conclusions
Bi-sentence extraction has become more and more popular in unsupervised learning
for numerous specific tasks. Our method overcomes the disparities between the
French-English pair and other languages. It also extends the scope of the WMT16
bilingual document alignment task by returning alignments at the sentence level. It is
a language-independent method that can be easily adjusted to a new environment, and
it only requires parallel corpora for the initial training.
Table 2. Resulting corpora specification
Number of words
Unique words
Sentence length
Our experiments show that the method performs well. By mining 166,099
document pairs, we successfully obtained 441,463 8 translated sentences. (The detailed
corpora specification is shown in Table 2).
The resulting corpora increased the MT quality in a wide news domain. We
compared the WMT16 baseline news evaluation task results with the same system
augmented with mined data. For baseline system training, we used constrained data
sets and the evaluation test files from the WMT15 9.
To get a more accurate word alignment, we used the SyMGiza++ tool. This is a
tool that assists with the formation of a similar word-alignment model. This particular
tool develops alignment models that perform multiple many-to-one and one-to-many
alignments between the given language pairs. SyMGiza++ was also used to create a
pool of several processors supported by the latest threading management techniques,
which makes it a high-speed process. The alignment process used in such cases
utilizes four unique models during the training of the system to refine and enhance
alignment outcomes. These approaches have shown fruitful results [21].
Out-of-vocabulary (OOV) words are another challenge for the system to overcome.
To deal with OOV words, we used the Moses toolkit and the Unsupervised
Transliteration Model (UTM) [22]. The KenLM tool was applied to the language
model training. The language model order was set to 6, and the models were linearly
interpolated [23]. The parallel data that was obtained was added without domain
Table 3. Improvements to MT with obtained data
Based on the results presented in Table 3, it can be projected that such BLEU
differences can have a positive influence on real-life translation scenarios. From a
practical point of view, the method requires neither expensive training nor languagespecific grammatical resources, but it produces satisfying results. It is possible to
replicate such mining for any language pair or text domain, or for any reasonable
comparable input data.
Work financed as part of the investment in the CLARIN-PL research infrastructure
funded by the Polish Ministry of Science and Higher Education and was backed by
the PJATK legal resources.
1. Krzysztof Wołk and Krzysztof Marasek. 2014. Real-Time
Statistical Speech Translation. New Perspectives in Information
Systems and Technologies, Volume 1. Springer International
Publishing, p. 107-113.
2. Krzysztof Wołk and Krzysztof Marasek. 2013. Polish–English Speech
Statistical Machine Translation Systems for the IWSLT 2013.
Proceedings of the 10th International Workshop on Spoken
Language Translation, Heidelberg, Germany. p. 113-119.
3. Philipp Koehn. 2009. Statistical machine translation.
Cambridge University Press.
4. Gonzalo García Berrotarán, Rafael Carrascosa, and Andrew
Vine. Yalign documentation, accessed 01/2015
5. Romain Dieny, Jerome Thevenon, Jesus Martinez-del-Rincon, and
Jean-Christophe Nebel. 2011. Bioinformatics inspired algorithm for
stereo correspondence. International Conference on Computer Vision
Theory and Applications. Vilamoura - Algarve, Portugal.
6. Gabe Musso. 2007. Sequence alignment (Needleman-Wunsch,
7. Mauro Cetollo, Christian Girardi, and Marcello Federico. 2012.
Wit3: Web inventory of transcribed and translated talks. Proceedings
of the 16th Conference of the European Association for Machine
Translation (EAMT). p. 261-268.
8. Mehdi Mohammadi and Nasser GasemAghaee. 2010. Building
bilingual parallel corpora based on Wikipedia. Computer
Engineering and Applications (ICCEA), 2010 Second International
Conference on. IEEE, p. 264-268.
9. Francis M. Tyers and Jacques A. Pienaar. 2008. Extracting
bilingual word pairs from Wikipedia., Collaboration:
interoperability between people in the creation of language
resources for less-resourced languages 19, p. 19-22.
10. Keiji Yasuda and Eiichiro Sumita. 2008. Method for building
sentence-aligned corpus from Wikipedia. 2008 AAAI Workshop on
Wikipedia and Artificial Intelligence (WikiAI08), p. 263-268.
11. Santanu Pal, Partha Pakray, and Sudip Kumar Naskar. 2014.
Automatic Building and Using Parallel Resources for SMT from
Comparable Corpora. Proceedings of the 3rd Workshop on Hybrid
Approaches to Translation (HyTra)@ EACL, 2014. p. 48-57.
12. Magdalena Plamada and Martin Volk. 2013. Mining for Domainspecific Parallel Text from Wikipedia. Proceedings of the Sixth
Workshop on Building and Using Comparable Corpora, ACL 2013, p.
13. Jannik Strötgen, Michael Gertz, Conny Junghans. 2011. An eventcentric model for multilingual document similarity. Proceedings of
the 34th international ACM SIGIR conference on Research and
development in Information Retrieval. ACM, p. 953-962.
14. Monica Lestari Paramita et al. 2013. Methods for collection and
evaluation of comparable documents. Building and Using
Comparable Corpora. Springer Berlin Heidelberg, p. 93-112.
15. Dekai Wu and Pascale Fung. 2005. Inversion transduction
grammar constraints for mining parallel sentences from
quasicomparable corpora. Natural Language Processing–
IJCNLP 2005. Springer Berlin Heidelberg, p. 257- 268.
16. Jonathan H. Clark et al. 2011. Better hypothesis testing for statistical
machine translation: Controlling for optimizer instability.
Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies: short
papers-Volume 2. Association for Computational Linguistics, p. 176181.
17. Krzysztof Wołk and Krzysztof Marasek. 2014. A Sentence Meaning
Based Alignment Method for Parallel Text Corpora Preparation. New
Perspectives in Information Systems and Technologies, Volume
1. Springer International Publishing, p. 229-237.
18. Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain
adaptation via pseudo in-domain data selection. Proceedings of the
Conference on Empirical Methods in Natural Language
Processing. Association for Computational Linguistics, p. 355-362.
19. Krzysztof Wołk and Krzysztof Marasek. 2015. Tuned and GPUaccelerated parallel data mining from comparable corpora. Text,
Speech, and Dialogue. Springer International Publishing, p. 32-40.
20. Chetan S. Khaladkar. 2009. An Efficient Implementation of
Needleman-Wunsch Algorithm on Graphical Processing Units.
PHD Thesis, School of Computer Science and Software
Engineering, The University of Western Australia.
21. Marcin Junczys-Dowmunt and Arkadiusz Szał. 2012. Symgiza++:
symmetrized word alignment models for statistical machine
translation. Security and Intelligent Information Systems.
Springer: Berlin Heidelberg. p. 379-390.
22. Nadir Durrani et al. 2014. Integrating an Unsupervised
Transliteration Model into Statistical Machine Translation.
EACL, 2014. p. 148-153.
23. Kenneth Heafield. 2011. KenLM: Faster and smaller language
model queries. Proceedings of the Sixth Workshop on Statistical
Machine Translation. Association for Computational Linguistics,
2011. p. 187-197.
View publication stats