Download Impact of automated translation on mining knowledge from

19. 11. 2015, Brno Luděk Svozil Impact of automated translation on mining knowledge from text data Kapitola 1 strana 2 Introduction • Statistical and hybrid machine translation systems are gaining more attention • Apart from commercial services like Google Translate and Bing, there are number of projects aiming to bring the benefits of big data knowledge to endusers strana 3 EU projects on horizon • Modern MT – aims to bring powerful, ready to use MT system to desktop users http://www.modernmt.eu/ • LTI cloud – gathers language technology components for easy use in information systems http://www.ltinnovate.org/lticloud strana 4 • If machine translation is part of preprocessing, would it benefit the text-mining procces? And how? • Earlier experiments have shown that when combining scarce data across different languages, MT provides great simplification of problem strana 5 Test data and experiment • 20 000 reviews in 5 languages from booking.com were subjected to Google machine translation, stemming and then c5.0 decision tree was trained on them and evaluated using cross-validation strana 6 Results – % decrease in attributes count ES FR PL CS DE translation 24% 17% 42% 40% 29% stemming 37% 31% 20% 33% 16% translation and stemming 41% 35% 56% 53% 44% strana 7 Results – avg. classification error ES FR PL CS DE Original 14,10% 14,10% 12,40% Translated 14,10% 13,30% 11,30% 12,70% 12,00% Stemmed 15,30% 14,00% 11,90% 11,80% 13,50% Translated and stemmed 15,50% 15,50% 12,80% 14,60% 12,70% 13,70% 14,10% strana 8 • To observe how well the translated data would combine with native English, another experiment was made • 10 000 English documents were combined with another 10 000 from different language, the other language was then Google translated strana 9 Results – avg. classification error EN+FR EN+PL EN+DE EN+ES original 16,10% 14,80% 14,60% 17,30% non-English language translated 33,50% 33,90% 37,70% 36,10% strana 10 Conclusions • MT simplifies problem (reduces dictionary) while doesn’t increase classification error • Attention must be paid, while combining native and translated documents strana 11 • Další detaily, testy a porovnání rulebased a MT translátorů najdete v mé bakalářské práci „Dolování znalostí z vícejazyčných textových dat“, která bude k dispozici během ledna-února 2016

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Impact of automated translation on mining knowledge from