Download Unknown Words Modelling in Training and Using

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pattern recognition wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Transcript
Unknown Words Modelling in Training and Using
Language Models for Russian LVCSR System.
Maxim Korenevsky, Anna Bulusheva, Kirill Levin
Speech Technology Center, Saint-Petersburg, Russia
Abstract
The paper considers some peculiarities of training and using N-gram language models with open
vocabulary. It is demonstrated that explicit modeling of the probability distribution of out-of-model
(unknown) words is necessary in this case. Two known techniques for this modeling are considered
and a new technique with several advantages is proposed. We present experiments which
demonstrate the consistency of the proposed approach.
1. Introduction
The most successful speech recognition systems are now developed for English, which can be
attributed to both the large amount of English recognition research and development and the relative
simplicity of the English language itself. The development of continuous speech recognition
systems for many Slavic languages [1] including Russian is much more difficult due to such
features of these languages as rich morphology, high inflexion and flexible word order. Also there
is a lack of research and publications concerning Russian speech recognition.
One of the major components of a recognition system is a language model (LM) which is used to
evaluate the probability of different word sequences in an utterance. The most widespread approach
is the use of N-gram statistical models that estimate the probability of a word given (N-1) preceding
words. There is a number of software products designed to train LMs from a given text corpus,
including CMU-Cambridge SLM Toolkit [2,3], HTK HLM [4], as well as SRILM Toolkit [5,6]
with excellent opportunities for training LMs of various types.
The SRILM Toolkit was chosen as a tool to perform N-gram language modeling for the Russian
continuous speech recognition system that is being developed by Speech Technology Center,
Ltd. [7]. This recognition system has been developed for several years in the segment of middlesized vocabulary (up to 5000 words) systems, but recently it has been successfully applied for large
vocabularies (50-100 thousand words) as well. In our experiments in the training of LMs using
SRILM Toolkit, several important peculiarities of language modeling were revealed, which are
almost never mentioned in the literature. But lack of their understanding leads to serious
difficulties, both in the training of LMs, and in their further use. The paper describes these
peculiarities and the techniques used to interpret and account for them.
2. Out-of-vocabulary words and unknown words <unk> in a language model
Recognition systems with a middle-sized vocabulary usually deal with certain fixed topics and it is
often assumed that only the words included in the vocabulary may be uttered. These are so-called
closed-vocabulary systems. In such systems specially built LMs are usually used whose vocabulary
includes the whole recognition vocabulary.
However, as the recognition vocabulary increases and a general lexicon is used more extensively,
the need to account for potential out-of-vocabulary (OOV) words arises. The coverage of an
arbitrary text by the recognition vocabulary is an important feature that limits maximum achievable
system accuracy. It is well-known that the dependence of text coverage on vocabulary size is
essentially different for different languages. For example, about 65000 words are sufficient to cover
almost 99% of a general lexicon English text [8]. But much larger vocabularies are required for
highly inflective languages like Russian to achieve comparable text coverage. According to [8], a
65000-word vocabulary of the most frequent Russian words covers only about 92.4% of text,
whereas almost 450000 words are necessary for 99% coverage. The same figures were observed in
our research.
The number of free parameters of the language model increases rapidly as vocabulary size
increases. For large vocabulary systems an LM vocabulary usually does not entirely cover the large
recognition vocabulary. Thus the LM should be “open”, i.e. it should account for the potential
appearance of “unknown” words in a sentence and should correctly estimate their probabilities1.
Traditionally the open vocabulary problem in LM training is solved by replacing all unknown
words by the special token <unk>. Then the model is trained in a conventional manner as if <unk>
were a normal word. In SRILM Toolkit this mode is set by the command line option “-unk”.
When an open-vocabulary LM is used, the appearance of unknown word in a recognition
hypothesis is treated by the N-grams containing the <unk> token. This is where the trouble lies. The
direct use of probabilities of N-grams ending in <unk> is not quite correct. The problem is that the
probability of such N-grams is actually the sum of the probabilities for all potential N-grams ending
in unknown words (with the same context), including those absent in training corpus!
What does this mean from the recognition viewpoint? Imagine a slightly exaggerated example,
namely an LM trained on a large training corpus but with a severely limited vocabulary. In this case
the number of N-grams ending in <unk> is so large that their fraction (for an arbitrary but fixed
context) may easily exceed that of any N-gram ending in an known word. As a result the estimated
probability of <unk>-ended N-grams may be substantially greater than that of all other N-grams
with the same context. And this situation will make the recognition decoder prefer an unknown
word to known words which are more suitable lexically and grammatically but have a smaller Ngram probability. Consequently the best recognition hypothesis may correspond to an absolutely
meaningless sentence which contains unknown words modeled by <unk>.
Let’s consider a simple example. The language model was trained by SRILM on a 60 million
training corpus of Russian. The vocabulary was limited to 5K of the most frequent words of the
training corpus. Exploring the LM, we found that the log-probability of the trigram “Президент
России Медведев” (meaning “Medvedev, the President of Russia”) equals -2.579312, whereas the
log-probability of the trigram “Президент России <unk>” is as large as -1.0086! That means that
any meaningless trigram, for example “Президент России Буркина-Фасо” (“President of Russia
Burkina Faso”) will get much more probability in recognition (the word “Буркина-Фасо” is
unknown to the LM but was seen in the training corpus).
Clearly the problem described above is absent in recognition systems whose vocabulary is entirely
covered by the LM’s vocabulary. In large vocabulary systems with a large LM this problem is also
becomes less prominent because the number of unknown words is relatively small and the
superiority of their cumulative probability over known words’ probabilities is not so noticeable.
This probably explains why this problem is almost never referred to in scientific papers dealing
with English and similar languages. Nevertheless, the issue exists and it is more serious for
Russian-like languages, because a much larger recognition vocabulary is needed for them. And a
huge training corpus is required to consistently train an LM of comparable vocabulary size.
3. Perplexity
The main function of a language model is to predict the most probable word sequences in text. To
estimate the quality of this prediction, perplexity or per-word entropy is conventionally used. The
per-word entropy (measured in bits) is a limit
1
Hereafter the term “unknown” will be used for words which are absent in the LM vocabulary, as opposed to “OOV”
words which are absent in the recognition vocabulary.
1
log 2 P(w1 , w2 , , wm )
m
where the word sequence probability P(w1 , w2 , , wm ) is calculated from the LM. So the per-word
entropy is a negative probability logarithm averaged over all words. The language model
perplexity is defined as follows:
PPL = 2 H = lim P(w1 , w2 , , wm ) 1 / m
H =  lim
m 
m
In practice perplexity is calculated not as a limit but from a finite text. This value is called “the
perplexity on a given text” and approaches the original perplexity as text size increases.
The lower the perplexity of a LM, the better it predicts an arbitrary new text. Clearly, the text
prediction capabilities of a model are directly dependent on the quality of estimation of its
parameters during training. And this quality is in turn dependent on the training corpus size. Thus
the LM perplexity for a fixed vocabulary size is expected to monotonously and asymptotically
decrease as the training corpus size increases. On the other hand, a larger number of words allows
for better text prediction. Accordingly, the same perplexity behavior should be observed for an
unlimited training corpus when vocabulary size increases.
SRILM Toolkit includes an application ngram designed in particular for LM perplexity calculation
on given texts. In our experiments we found two peculiarities of this application’s behavior which
should be kept in mind when using it and should be taken into account when analysing results.
The first peculiarity is related to the estimation of closed-vocabulary LM perplexity. In this case the
model does not contain the <unk> token and is in principle unable to predict word sequences
containing unknown words. Consequently, the probability of such sequences must be zero from the
LM’s viewpoint, and perplexity must accordingly be infinite. Nevertheless SRILM returns a finite
perplexity in this case by simply skipping the N-grams with unknown words. Ignoring this
circumstance can easily lead to a misconception about the quality of the tested LM.
The second ngram peculiarity is related to the open LM working mode. In that case all the
unknown words in a text are replaced with the <unk> token and the corresponding N-grams are
used to calculate word sequence probabilities according to the LM. The perplexity values for the
650K-word text calculated by the SRILM ngram are shown in Table 1 (all LMs were trained on the
same 40M-word corpus and identical discounting parameters):
Table 1. Perplexity values for different vocabulary sizes
LM vocabulary size
30K
20K
10K
Perplexity (PPL)
144.2
120.2
84.4
It is obvious that the behavior of perplexity is exactly the opposite to the discussed theoretical one. It is
caused by the same problem we have already seen, namely the overestimation of <unk>-ended N-gram
probabilities. As the LM dictionary decreases, the number of <unk> tokens increases rapidly along with the
probabilities of the corresponding N-grams. As a result the perplexity of the smaller vocabulary LM turns out
to be smaller, which totally contradicts the theory.
4. The <unk> modeling
The described situations clearly demonstrate that <unk> does not simply substitute for any unknown word
but instead represents the set of all the LM’s unknown words which requires special modeling. The
probability of <unk>-ended N-grams should therefore be calculated as follows:
P( wunknown | w1 , w2 ,..., wN 1 )  P( unk | w1 , w2 ,..., wN 1 ) P( wunknown | unk  ) ,
where wunknown means any unknown word and P( wunknown | unk ) means the probability to pick this
word from the <unk> set.
Curiously, the necessity of special modeling for unknown words in LMs has virtually no coverage
in the literature. Nevertheless, the SRILM Toolkit authors are of course aware of this necessity. One
of SRILM’s frequently asked question (FAQ) lists [9] proposes at least two approaches to such
modeling. The first one is to split all unknown words of the training text into letters and then use
these letters as separate tokens in LM training. After such splitting the training text will not longer
contain unknown words, and the closed-vocabulary LM can be trained in a normal manner. The
probability of a word sequence containing an unknown word can now be calculated by splitting it
into letters and applying the LM directly. Although this approach allows us to get probability
estimates, SRILM authors note that these values should be treated with caution. This is because the
word-to-letters splitting changes the number of tokens in a word sequence.
The second approach is in some sense similar to the first one but it involves explicit unknown word
probability modeling. It is proposed to create the list of all unknown words and to split them into
letters in the same manner as above. Then one can treat this list as a training text to train a “letter”
LM which will be able to estimate probabilities P( wunknown | unk ) for any unknown word and take
them into account when calculating P( wunknown | w1 , w2 ,..., wN 1 ) . The advantage of this approach over
the first one is the independence of the letter LM from the original LM. One of the consequences of
this is that the letter LM may be of a different N-gram order than the original LM (the authors think
that seven is a good choice), and the estimated perplexity values will be quite informative. But this
independence is a drawback as well, because both LMs (letter and original) need to be used in a
probability calculation that is not supported in SRILM directly. So additional efforts have to be
applied to compute LM perplexity in this approach. Besides, the idea of modeling word
probabilities based on their orthography seems rather questionable. And this approach does not take
into account the actual frequencies of unknown words in the training text.
We propose a simple approach to explicit modeling of unknown word probabilities. It allows us to
have only one LM and to calculate its perplexity directly by means of the SRILM ngram. The idea
behind this approach is inspired by class (cluster) language models, which are widely used for
limited amounts of training text [8, 10]. In class LM training, the whole wordset is divided into
several subsets (by expert or automatic methods) and a class label is assigned to each subset.After
that a normal LM is trained for class label sequences, and additional models (simple lookup tables
as usual) are created to obtain probabilities of words within corresponding classes. The
conventional formula to calculate the class LM trigram probability is as follows:
P( w3 | w1 , w2 )  PC ( w3 ) | C (w1 ), C ( w2 ) Pw3 | C ( w3 )  ,
where C (w) is the class label for the word w . The first factor in this expression is calculated from
the class label LM, while the second one is calculated from probability lookup tables. SRILM
Toolkit supports not only class models of this type but also makes it possible to mix both words and
classes within a single LM. In this case, words and classes are treated as equivalent tokens but
additional probability tables are used for words belonging to classes. From this viewpoint, the
unknown word set <unk> is nothing else but a class, whose label can be included into the LM along
with known words. Consequently, for explicit <unk> modeling it is sufficient to complete this LM
with a table of probabilities of a word’s belonging to the <unk> class.
From the language model viewpoint, all words can be divided into three main groups, namely
 A – the LM vocabulary;
 B – all the other words that were seen at least once in the training corpus;
 C – the rest of the words (unseen in training corpus) .
The <unk> class consists of all words from groups B and C. For group B words the probability of
belonging to the <unk> class can be estimated as a normalized frequency in the training corpus, i.e.
the usual unigram probability. For group C words it is impossible because they are absent in the
training corpus. Moreover, they cannot be listed explicitly in the probability lookup table for the
<unk> class, so a special common token (for example <true_unk>) should be introduced for them.
But in this situation discounting methods can be easily applied to take some probability from seen
words (group B) in favor of unseen words (group C). We think that Laplace discounting (also
known as “+1 discounting”) [11] may be quite adequate. Thus, the probabilities estimates will be as
follows:
n( w)  1
P( w | unk ) 
, for w  B ,
 n(u ) | B | 1
u B
P( true _ unk | unk ) 
1
.
 n(u ) | B | 1
u B
where n(w) is a number of times the unknown word w was seen in training corpus, and | B | is the
size of group B.
The model described above still contains the class <true_unk> which corresponds not to a single
word but potentially to many rare words. However, it is not included into the main LM N-grams
and thus does not accumulate all probabilities of the unseen words unlike the original <unk> class.
From the viewpoint of the recognition system, the class <true_unk> will correspond to all the words
in the recognition vocabulary that are not in the LM vocabulary, as well as potential fillers (garbage
models) intended to account for OOV words. An important merit of the proposed approach is full
LM compatibility with SRILM word-class format. Because of this, LM perplexity can be calculated
in a normal way by using the ngram utility.
5. Experiments
Experiments were conducted to confirm the efficiency and consistency of the proposed approach.
Vocabularies of different sizes (consisting of the most frequent words) were chosen and LMs were
trained on a Russian training corpus of news transcriptions. All models were trained by SRILM
with Good-Turing discounting and default values of other parameters. The results are shown in
Table 2 and Figure 1. The graphs of perplexity obtained by means of the second SRILM method are
also provided for comparison (line 40M* on the left and line 50K* on the right).
Table 2. Perplexity values for the proposed <unk> modeling depending on vocabulary and training corpus
size
Vocabulary size in words
10K
20K
30K
40K
50K
60K
70K
80K
90K
100K
10M
1255
1052
959
905
869
844
825
811
799
790
Training corpus size in words
20M
30M
1078
981
858
751
757
647
697
587
657
546
628
517
607
495
591
478
577
464
566
453
40M
909
677
571
509
467
439
417
400
386
374
These results demonstrate that the proposed approach to unknown word probability modeling
provides the perplexity behavior that is fully consistent with theory and common sense, namely a
monotonous asymptotic decrease. Obviously, the SRILM approach demonstrates the same
tendencies but with significantly larger perplexity values.
1600
2500
10M
20M
30M
40M
40M*
1500
Perplexity
Perplexity
2000
1000
500
0
10
10K
20K
30K
40K
50K
50K*
1400
1200
1000
800
600
20
30
40
50
60
70
80
Vocabulary size, thousands words
90
100
400
10
15
20
25
30
35
40
Training corpus size, millions words
45
50
Fig.1 The dependence of the LM perplexity on training corpus size for different vocabularies (left),
and on vocabulary size for different training corpus sizes (right)
Besides, Figure 1 (right) shows that a 50M-word training corpus is not sufficient in Russian for
perplexity saturation even for a 10K-word vocabulary, not to mention larger vocabularies. The
recognition experiments with the proposed approach are still not completed but we believe that a
more correct modeling of unknown words will improve recognition quality.
5. Conclusions and future work
We have demonstrated the necessity for special modeling of the unknown word set <unk> in
training and using language models. We have proposed a simple approach to this modeling that
allows us to calculate N-gram probabilities in a single LM framework. The approach is shown to be
consistent with theory with respect to perplexity.
In the future we plan to explore different unigram discounting methods for unknown word
probabilities, and to extend our results to larger corpus sizes. And of course we will study the effect
of our approach on the quality of large vocabulary recognition of continuous Russian speech.
References
1. J.Nouza, J.Zdansky, P.Cerva, J.Silovsky. Challenges in Speech Processing of Slavic Languages
(Case Studies in Speech Recognition of Czech and Slovak). Development of Multimodal Interfaces:
Active Listening and Synchrony Second COST 2102, 2009.
2. http://www.speech.cs.cmu.edu/SLM/toolkit.html.
3. P.R.Clarkson and R. Rosenfeld. Statistical Language Modeling Using the CMU-Cambridge
Toolkit. Proc. ESCA Eurospeech 1997.
4. S.Young, G.Evermann, M.Gales, T.Hain, D. Kershaw, X.Liu, G.Moore, J.Odell, D.Ollason,
D.Povey, V.Valtchev, P.Woodland. The HTK Book. Cambridge, 2009.
5. http://www-speech.sri.com/projects/srilm.
6. A.Stolcke. SRILM —An extensible language modeling toolkit. Proc. ICSLP 2002.
7. Marina Tatarnikova, Ivan Tampel, Ilya Oparin, Yuri Khokhlov. Building Acoustic Models for
Large Vocabulary Continuous Speech Recognizer for Russian. Proc. SpeCom 2006.
8. E.W.D. Whittaker. Statistical language modelling for automatic speech recognition of Russian
and English. PhD thesis, Cambridge University, 2000.
9. http://www-speech.sri.com/projects/srilm/manpages/srilm-faq.7.html
10. J.Gao, J.Goodman, J.Miao The Use of Clustering Techniques for Asian Language Modeling.
Computational Linguistics and Chinese Language Processing, 2001, v.6, №1, pp 27-60.
11. D.Jurafsky, J.H.Martin, Speech and Language Processing. Prentice Hall, 2000, 975pp.