Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
19. 11. 2015, Brno Luděk Svozil Impact of automated translation on mining knowledge from text data Kapitola 1 strana 2 Introduction • Statistical and hybrid machine translation systems are gaining more attention • Apart from commercial services like Google Translate and Bing, there are number of projects aiming to bring the benefits of big data knowledge to endusers strana 3 EU projects on horizon • Modern MT – aims to bring powerful, ready to use MT system to desktop users http://www.modernmt.eu/ • LTI cloud – gathers language technology components for easy use in information systems http://www.ltinnovate.org/lticloud strana 4 • If machine translation is part of preprocessing, would it benefit the text-mining procces? And how? • Earlier experiments have shown that when combining scarce data across different languages, MT provides great simplification of problem strana 5 Test data and experiment • 20 000 reviews in 5 languages from booking.com were subjected to Google machine translation, stemming and then c5.0 decision tree was trained on them and evaluated using cross-validation strana 6 Results – % decrease in attributes count ES FR PL CS DE translation 24% 17% 42% 40% 29% stemming 37% 31% 20% 33% 16% translation and stemming 41% 35% 56% 53% 44% strana 7 Results – avg. classification error ES FR PL CS DE Original 14,10% 14,10% 12,40% Translated 14,10% 13,30% 11,30% 12,70% 12,00% Stemmed 15,30% 14,00% 11,90% 11,80% 13,50% Translated and stemmed 15,50% 15,50% 12,80% 14,60% 12,70% 13,70% 14,10% strana 8 • To observe how well the translated data would combine with native English, another experiment was made • 10 000 English documents were combined with another 10 000 from different language, the other language was then Google translated strana 9 Results – avg. classification error EN+FR EN+PL EN+DE EN+ES original 16,10% 14,80% 14,60% 17,30% non-English language translated 33,50% 33,90% 37,70% 36,10% strana 10 Conclusions • MT simplifies problem (reduces dictionary) while doesn’t increase classification error • Attention must be paid, while combining native and translated documents strana 11 • Další detaily, testy a porovnání rulebased a MT translátorů najdete v mé bakalářské práci „Dolování znalostí z vícejazyčných textových dat“, která bude k dispozici během ledna-února 2016