Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Unknown Words Modelling in Training and Using Language Models for Russian LVCSR System. Maxim Korenevsky, Anna Bulusheva, Kirill Levin Speech Technology Center, Saint-Petersburg, Russia Abstract The paper considers some peculiarities of training and using N-gram language models with open vocabulary. It is demonstrated that explicit modeling of the probability distribution of out-of-model (unknown) words is necessary in this case. Two known techniques for this modeling are considered and a new technique with several advantages is proposed. We present experiments which demonstrate the consistency of the proposed approach. 1. Introduction The most successful speech recognition systems are now developed for English, which can be attributed to both the large amount of English recognition research and development and the relative simplicity of the English language itself. The development of continuous speech recognition systems for many Slavic languages [1] including Russian is much more difficult due to such features of these languages as rich morphology, high inflexion and flexible word order. Also there is a lack of research and publications concerning Russian speech recognition. One of the major components of a recognition system is a language model (LM) which is used to evaluate the probability of different word sequences in an utterance. The most widespread approach is the use of N-gram statistical models that estimate the probability of a word given (N-1) preceding words. There is a number of software products designed to train LMs from a given text corpus, including CMU-Cambridge SLM Toolkit [2,3], HTK HLM [4], as well as SRILM Toolkit [5,6] with excellent opportunities for training LMs of various types. The SRILM Toolkit was chosen as a tool to perform N-gram language modeling for the Russian continuous speech recognition system that is being developed by Speech Technology Center, Ltd. [7]. This recognition system has been developed for several years in the segment of middlesized vocabulary (up to 5000 words) systems, but recently it has been successfully applied for large vocabularies (50-100 thousand words) as well. In our experiments in the training of LMs using SRILM Toolkit, several important peculiarities of language modeling were revealed, which are almost never mentioned in the literature. But lack of their understanding leads to serious difficulties, both in the training of LMs, and in their further use. The paper describes these peculiarities and the techniques used to interpret and account for them. 2. Out-of-vocabulary words and unknown words <unk> in a language model Recognition systems with a middle-sized vocabulary usually deal with certain fixed topics and it is often assumed that only the words included in the vocabulary may be uttered. These are so-called closed-vocabulary systems. In such systems specially built LMs are usually used whose vocabulary includes the whole recognition vocabulary. However, as the recognition vocabulary increases and a general lexicon is used more extensively, the need to account for potential out-of-vocabulary (OOV) words arises. The coverage of an arbitrary text by the recognition vocabulary is an important feature that limits maximum achievable system accuracy. It is well-known that the dependence of text coverage on vocabulary size is essentially different for different languages. For example, about 65000 words are sufficient to cover almost 99% of a general lexicon English text [8]. But much larger vocabularies are required for highly inflective languages like Russian to achieve comparable text coverage. According to [8], a 65000-word vocabulary of the most frequent Russian words covers only about 92.4% of text, whereas almost 450000 words are necessary for 99% coverage. The same figures were observed in our research. The number of free parameters of the language model increases rapidly as vocabulary size increases. For large vocabulary systems an LM vocabulary usually does not entirely cover the large recognition vocabulary. Thus the LM should be “open”, i.e. it should account for the potential appearance of “unknown” words in a sentence and should correctly estimate their probabilities1. Traditionally the open vocabulary problem in LM training is solved by replacing all unknown words by the special token <unk>. Then the model is trained in a conventional manner as if <unk> were a normal word. In SRILM Toolkit this mode is set by the command line option “-unk”. When an open-vocabulary LM is used, the appearance of unknown word in a recognition hypothesis is treated by the N-grams containing the <unk> token. This is where the trouble lies. The direct use of probabilities of N-grams ending in <unk> is not quite correct. The problem is that the probability of such N-grams is actually the sum of the probabilities for all potential N-grams ending in unknown words (with the same context), including those absent in training corpus! What does this mean from the recognition viewpoint? Imagine a slightly exaggerated example, namely an LM trained on a large training corpus but with a severely limited vocabulary. In this case the number of N-grams ending in <unk> is so large that their fraction (for an arbitrary but fixed context) may easily exceed that of any N-gram ending in an known word. As a result the estimated probability of <unk>-ended N-grams may be substantially greater than that of all other N-grams with the same context. And this situation will make the recognition decoder prefer an unknown word to known words which are more suitable lexically and grammatically but have a smaller Ngram probability. Consequently the best recognition hypothesis may correspond to an absolutely meaningless sentence which contains unknown words modeled by <unk>. Let’s consider a simple example. The language model was trained by SRILM on a 60 million training corpus of Russian. The vocabulary was limited to 5K of the most frequent words of the training corpus. Exploring the LM, we found that the log-probability of the trigram “Президент России Медведев” (meaning “Medvedev, the President of Russia”) equals -2.579312, whereas the log-probability of the trigram “Президент России <unk>” is as large as -1.0086! That means that any meaningless trigram, for example “Президент России Буркина-Фасо” (“President of Russia Burkina Faso”) will get much more probability in recognition (the word “Буркина-Фасо” is unknown to the LM but was seen in the training corpus). Clearly the problem described above is absent in recognition systems whose vocabulary is entirely covered by the LM’s vocabulary. In large vocabulary systems with a large LM this problem is also becomes less prominent because the number of unknown words is relatively small and the superiority of their cumulative probability over known words’ probabilities is not so noticeable. This probably explains why this problem is almost never referred to in scientific papers dealing with English and similar languages. Nevertheless, the issue exists and it is more serious for Russian-like languages, because a much larger recognition vocabulary is needed for them. And a huge training corpus is required to consistently train an LM of comparable vocabulary size. 3. Perplexity The main function of a language model is to predict the most probable word sequences in text. To estimate the quality of this prediction, perplexity or per-word entropy is conventionally used. The per-word entropy (measured in bits) is a limit 1 Hereafter the term “unknown” will be used for words which are absent in the LM vocabulary, as opposed to “OOV” words which are absent in the recognition vocabulary. 1 log 2 P(w1 , w2 , , wm ) m where the word sequence probability P(w1 , w2 , , wm ) is calculated from the LM. So the per-word entropy is a negative probability logarithm averaged over all words. The language model perplexity is defined as follows: PPL = 2 H = lim P(w1 , w2 , , wm ) 1 / m H = lim m m In practice perplexity is calculated not as a limit but from a finite text. This value is called “the perplexity on a given text” and approaches the original perplexity as text size increases. The lower the perplexity of a LM, the better it predicts an arbitrary new text. Clearly, the text prediction capabilities of a model are directly dependent on the quality of estimation of its parameters during training. And this quality is in turn dependent on the training corpus size. Thus the LM perplexity for a fixed vocabulary size is expected to monotonously and asymptotically decrease as the training corpus size increases. On the other hand, a larger number of words allows for better text prediction. Accordingly, the same perplexity behavior should be observed for an unlimited training corpus when vocabulary size increases. SRILM Toolkit includes an application ngram designed in particular for LM perplexity calculation on given texts. In our experiments we found two peculiarities of this application’s behavior which should be kept in mind when using it and should be taken into account when analysing results. The first peculiarity is related to the estimation of closed-vocabulary LM perplexity. In this case the model does not contain the <unk> token and is in principle unable to predict word sequences containing unknown words. Consequently, the probability of such sequences must be zero from the LM’s viewpoint, and perplexity must accordingly be infinite. Nevertheless SRILM returns a finite perplexity in this case by simply skipping the N-grams with unknown words. Ignoring this circumstance can easily lead to a misconception about the quality of the tested LM. The second ngram peculiarity is related to the open LM working mode. In that case all the unknown words in a text are replaced with the <unk> token and the corresponding N-grams are used to calculate word sequence probabilities according to the LM. The perplexity values for the 650K-word text calculated by the SRILM ngram are shown in Table 1 (all LMs were trained on the same 40M-word corpus and identical discounting parameters): Table 1. Perplexity values for different vocabulary sizes LM vocabulary size 30K 20K 10K Perplexity (PPL) 144.2 120.2 84.4 It is obvious that the behavior of perplexity is exactly the opposite to the discussed theoretical one. It is caused by the same problem we have already seen, namely the overestimation of <unk>-ended N-gram probabilities. As the LM dictionary decreases, the number of <unk> tokens increases rapidly along with the probabilities of the corresponding N-grams. As a result the perplexity of the smaller vocabulary LM turns out to be smaller, which totally contradicts the theory. 4. The <unk> modeling The described situations clearly demonstrate that <unk> does not simply substitute for any unknown word but instead represents the set of all the LM’s unknown words which requires special modeling. The probability of <unk>-ended N-grams should therefore be calculated as follows: P( wunknown | w1 , w2 ,..., wN 1 ) P( unk | w1 , w2 ,..., wN 1 ) P( wunknown | unk ) , where wunknown means any unknown word and P( wunknown | unk ) means the probability to pick this word from the <unk> set. Curiously, the necessity of special modeling for unknown words in LMs has virtually no coverage in the literature. Nevertheless, the SRILM Toolkit authors are of course aware of this necessity. One of SRILM’s frequently asked question (FAQ) lists [9] proposes at least two approaches to such modeling. The first one is to split all unknown words of the training text into letters and then use these letters as separate tokens in LM training. After such splitting the training text will not longer contain unknown words, and the closed-vocabulary LM can be trained in a normal manner. The probability of a word sequence containing an unknown word can now be calculated by splitting it into letters and applying the LM directly. Although this approach allows us to get probability estimates, SRILM authors note that these values should be treated with caution. This is because the word-to-letters splitting changes the number of tokens in a word sequence. The second approach is in some sense similar to the first one but it involves explicit unknown word probability modeling. It is proposed to create the list of all unknown words and to split them into letters in the same manner as above. Then one can treat this list as a training text to train a “letter” LM which will be able to estimate probabilities P( wunknown | unk ) for any unknown word and take them into account when calculating P( wunknown | w1 , w2 ,..., wN 1 ) . The advantage of this approach over the first one is the independence of the letter LM from the original LM. One of the consequences of this is that the letter LM may be of a different N-gram order than the original LM (the authors think that seven is a good choice), and the estimated perplexity values will be quite informative. But this independence is a drawback as well, because both LMs (letter and original) need to be used in a probability calculation that is not supported in SRILM directly. So additional efforts have to be applied to compute LM perplexity in this approach. Besides, the idea of modeling word probabilities based on their orthography seems rather questionable. And this approach does not take into account the actual frequencies of unknown words in the training text. We propose a simple approach to explicit modeling of unknown word probabilities. It allows us to have only one LM and to calculate its perplexity directly by means of the SRILM ngram. The idea behind this approach is inspired by class (cluster) language models, which are widely used for limited amounts of training text [8, 10]. In class LM training, the whole wordset is divided into several subsets (by expert or automatic methods) and a class label is assigned to each subset.After that a normal LM is trained for class label sequences, and additional models (simple lookup tables as usual) are created to obtain probabilities of words within corresponding classes. The conventional formula to calculate the class LM trigram probability is as follows: P( w3 | w1 , w2 ) PC ( w3 ) | C (w1 ), C ( w2 ) Pw3 | C ( w3 ) , where C (w) is the class label for the word w . The first factor in this expression is calculated from the class label LM, while the second one is calculated from probability lookup tables. SRILM Toolkit supports not only class models of this type but also makes it possible to mix both words and classes within a single LM. In this case, words and classes are treated as equivalent tokens but additional probability tables are used for words belonging to classes. From this viewpoint, the unknown word set <unk> is nothing else but a class, whose label can be included into the LM along with known words. Consequently, for explicit <unk> modeling it is sufficient to complete this LM with a table of probabilities of a word’s belonging to the <unk> class. From the language model viewpoint, all words can be divided into three main groups, namely A – the LM vocabulary; B – all the other words that were seen at least once in the training corpus; C – the rest of the words (unseen in training corpus) . The <unk> class consists of all words from groups B and C. For group B words the probability of belonging to the <unk> class can be estimated as a normalized frequency in the training corpus, i.e. the usual unigram probability. For group C words it is impossible because they are absent in the training corpus. Moreover, they cannot be listed explicitly in the probability lookup table for the <unk> class, so a special common token (for example <true_unk>) should be introduced for them. But in this situation discounting methods can be easily applied to take some probability from seen words (group B) in favor of unseen words (group C). We think that Laplace discounting (also known as “+1 discounting”) [11] may be quite adequate. Thus, the probabilities estimates will be as follows: n( w) 1 P( w | unk ) , for w B , n(u ) | B | 1 u B P( true _ unk | unk ) 1 . n(u ) | B | 1 u B where n(w) is a number of times the unknown word w was seen in training corpus, and | B | is the size of group B. The model described above still contains the class <true_unk> which corresponds not to a single word but potentially to many rare words. However, it is not included into the main LM N-grams and thus does not accumulate all probabilities of the unseen words unlike the original <unk> class. From the viewpoint of the recognition system, the class <true_unk> will correspond to all the words in the recognition vocabulary that are not in the LM vocabulary, as well as potential fillers (garbage models) intended to account for OOV words. An important merit of the proposed approach is full LM compatibility with SRILM word-class format. Because of this, LM perplexity can be calculated in a normal way by using the ngram utility. 5. Experiments Experiments were conducted to confirm the efficiency and consistency of the proposed approach. Vocabularies of different sizes (consisting of the most frequent words) were chosen and LMs were trained on a Russian training corpus of news transcriptions. All models were trained by SRILM with Good-Turing discounting and default values of other parameters. The results are shown in Table 2 and Figure 1. The graphs of perplexity obtained by means of the second SRILM method are also provided for comparison (line 40M* on the left and line 50K* on the right). Table 2. Perplexity values for the proposed <unk> modeling depending on vocabulary and training corpus size Vocabulary size in words 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K 10M 1255 1052 959 905 869 844 825 811 799 790 Training corpus size in words 20M 30M 1078 981 858 751 757 647 697 587 657 546 628 517 607 495 591 478 577 464 566 453 40M 909 677 571 509 467 439 417 400 386 374 These results demonstrate that the proposed approach to unknown word probability modeling provides the perplexity behavior that is fully consistent with theory and common sense, namely a monotonous asymptotic decrease. Obviously, the SRILM approach demonstrates the same tendencies but with significantly larger perplexity values. 1600 2500 10M 20M 30M 40M 40M* 1500 Perplexity Perplexity 2000 1000 500 0 10 10K 20K 30K 40K 50K 50K* 1400 1200 1000 800 600 20 30 40 50 60 70 80 Vocabulary size, thousands words 90 100 400 10 15 20 25 30 35 40 Training corpus size, millions words 45 50 Fig.1 The dependence of the LM perplexity on training corpus size for different vocabularies (left), and on vocabulary size for different training corpus sizes (right) Besides, Figure 1 (right) shows that a 50M-word training corpus is not sufficient in Russian for perplexity saturation even for a 10K-word vocabulary, not to mention larger vocabularies. The recognition experiments with the proposed approach are still not completed but we believe that a more correct modeling of unknown words will improve recognition quality. 5. Conclusions and future work We have demonstrated the necessity for special modeling of the unknown word set <unk> in training and using language models. We have proposed a simple approach to this modeling that allows us to calculate N-gram probabilities in a single LM framework. The approach is shown to be consistent with theory with respect to perplexity. In the future we plan to explore different unigram discounting methods for unknown word probabilities, and to extend our results to larger corpus sizes. And of course we will study the effect of our approach on the quality of large vocabulary recognition of continuous Russian speech. References 1. J.Nouza, J.Zdansky, P.Cerva, J.Silovsky. Challenges in Speech Processing of Slavic Languages (Case Studies in Speech Recognition of Czech and Slovak). Development of Multimodal Interfaces: Active Listening and Synchrony Second COST 2102, 2009. 2. http://www.speech.cs.cmu.edu/SLM/toolkit.html. 3. P.R.Clarkson and R. Rosenfeld. Statistical Language Modeling Using the CMU-Cambridge Toolkit. Proc. ESCA Eurospeech 1997. 4. S.Young, G.Evermann, M.Gales, T.Hain, D. Kershaw, X.Liu, G.Moore, J.Odell, D.Ollason, D.Povey, V.Valtchev, P.Woodland. The HTK Book. Cambridge, 2009. 5. http://www-speech.sri.com/projects/srilm. 6. A.Stolcke. SRILM —An extensible language modeling toolkit. Proc. ICSLP 2002. 7. Marina Tatarnikova, Ivan Tampel, Ilya Oparin, Yuri Khokhlov. Building Acoustic Models for Large Vocabulary Continuous Speech Recognizer for Russian. Proc. SpeCom 2006. 8. E.W.D. Whittaker. Statistical language modelling for automatic speech recognition of Russian and English. PhD thesis, Cambridge University, 2000. 9. http://www-speech.sri.com/projects/srilm/manpages/srilm-faq.7.html 10. J.Gao, J.Goodman, J.Miao The Use of Clustering Techniques for Asian Language Modeling. Computational Linguistics and Chinese Language Processing, 2001, v.6, №1, pp 27-60. 11. D.Jurafsky, J.H.Martin, Speech and Language Processing. Prentice Hall, 2000, 975pp.