* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Statistical language modeling – overview of my work
Survey
Document related concepts
Transcript
Character based language models Tomáš Mikolov, 2010 1 Motivation – why this? No OOV problem More information for model (model can make words out of characters, but not vice versa) Theoretically better solution (no ad-hoc definition of what “word” is) 2 Why not? Worse performance: larger models (arpa LM format not suitable for long context LMs), lower word accuracy Smoothing seems to be weak point: word histories should not be clustered just by length of context! However, some people claim smoothing is not important.. 3 Was someone working on this before? Yes, many researchers Mahoney, Schmidhubber - text compression, information theoretic approach Elman - models of language based on connectionist models, linguistic approach Carpenter – language modeling for classification etc., just started reading that.. 4 Comparison of standard LM and RNN LM Conclusion: simple RNN can learn long context information (6-9 gram and maybe more) MODEL ENTROPY RNN 160 - 41 227 RNN 320 - 40 582 RNN 640 - 40 484 RNN 1280 - 40 927 LM 4gram - 41 822 LM 6gram - 39 804 LM 9gram - 40 278 5 Comparison of char-based and word-based LMs MODEL WORD ERROR RATE Baseline – word bigram 32.6% Word KN 4gram 30.4% Char 6gram 36.5% Char 9gram 32.3% Word KN 4gram + Char 9 gram 30.3% Task: RT07, LM trained just on Switchboard 6 What can be done Combining strengths: automatically derived "word" units What is word? in different languages this varies a lot Word boundaries in English can be automatically found - high entropy at first few characters in word, very low entropy in the rest of word 7 Example from Elman 8 Example of automatically derived lexical units Word boundaries are simply chosen on places with high entropy (trivial approach) LETTERS WORDS SUBWORDS MULTIWORDS A YEAH RE YOUKNOW S HMMM TH KINDOF I WE CO YEAHI O OKAY SE ANDI E AND DON ONTHE T BECAUSE LI ITIS 9 Conclusion I. Units similar to words can be automatically detected in sequential data by using entropy We can attempt to build LVCSR system without any implicit language model - it can be learned from the data However, these learned units are not words This approach is quite biologically plausible - "weak" model here is the model that works on phoneme/character level, higher model works on words/subwords, even higher may be working on phrases etc. 10 Conclusion II. LVCSR: acoustic models + language models Language models are very good predictors for phonemes in the middle/end of words Shouldn't acoustic models focus more on first few phonemes in words? (or maybe simply high-entropy phonemes (given by LM)) 11