Design of LVCSR Decoder for Czech Language
ˇ ıdl ¨ Filip Jurˇc´ıcˇ ek , Aleˇs Praˇza´ k , Ludˇek Muller , J. V. Psutka , and Luboˇs Sm´ CAK FAV University of West Bohemia, Department of Cybernetics Univerzitn´ı 8, Plzeˇn, 301 00, Czech Republic
University of West Bohemia, Department of Cybernetics Univerzitn´ı 8, Plzeˇn, 301 00, Czech Republic
filip,aprazak,muller,psutka j,smidl @kky.zcu.cz
Abstract: In this paper we present a Czech speaker-independent large vocabulary continuous speech recognition (LVCSR) system based on lexical trees and bigram language model. Lexical trees use triphones for both the in-word and the cross-word context. A dynamically generated cross-word context saves important amount of memory. A telephone speech and text corpus have been used to evaluate the system accuracy and speed. The corpus was used to compare our recognizer with the standard HTK recognizer. The comparison results are shown.
I. INTRODUCTION The LVCSR systems are being developed for several decades. Nowadays LVCSR systems take advantage of increasing performance of computers and sophisticated algorithms. There is permanent effort to integrate larger lexicons into LVCSR system. An -gram language model can improve recognition accuracy. However, memory requirements are substantial, so in practice for real-time implementation of LVCSR system only a bigram language model is often considered. The LVCSR system can be used for either automatic dictation or as a speech recognition module (e.g. of a voice dialog system) loosely coupled with a speech understanding module. The output of the LVCSR system is either a word lattice or the best sequence of words matching the input acoustic signal. The word lattice can be processed by higher level blocks. The proposed Czech LVCSR system was developed as an extension of our grammar based decoder [6]. The decoder is implemented as a baseline recognizer at this time. All known specifics of the Czech language are implemented in the recognizer, i.e. all baseforms (phonetic transcriptions) of each individual word in a system vocabulary, full cross-word triphone context, and so-called voice assimilation phenomenon are considered. In the following, the lexicon representation and the decoder are described. Finally, experimental results are given and discussed.
II. LEXICON REPRESENTATION A static representation of a lexicon is very interesting, because there is no need of on the fly compilation of lexicon into the recognition net. This leads to saving CPU time and several non trivial optimizations of recognition network can be performed. A. Linear lexicon In medium vocabulary tasks (up to 1000 different words) static linear lexicons are commonly used in the most recognition systems. A bigram language model can be implemented by weighted transitions between words. If is the vocabulary size, transitions are needed. A usefull technique for reduction of number of transitions was introduced in [4]. A static implementation of a language model saves decoder time. Nevertheless, the linguistic information (language model) is not used as soon as possible. The efficient of reduction of the local HMM likelihoods evaluations can be also obtained in a non-tree lexicon by simply caching all calculated HMM likelihoods.
B. Tree lexicon Contrary to a linear recognition net that does not take into account phonetic transcriptions (”similarity”) of words, the lexical trees share common portions of their phonetic transcription and make decoding more effective. This leads to a significant speed up due to less local likelihood calculations and less total probabilities updates. A typical decrease in number of triphone instances is approximately half. Due to the beam pruning, that seems to be most effective in the initial phonemes of word (where the shares are extremely high), the observed improvements in decoding time are even more than a factor two higher. C. Factorization The factorization [3] of bigram probabilities over the lexical tree allows the decoder to use linguistic information as soon as possible. With the factorization more effective pruning without an important lost of accuracy can be applied. If during factorization more words share the some part of their phonetic transcription, the maximum of their probabilities can only be propagated towards the root of the lexical tree.
III. DECODER The decoder uses continues density HMMs with Gaussian mixtures, a bigram language model, contextdependent phones (triphones) and a lexicon implemented as a lexical tree. The decoder uses a timesynchronous Viterbi search with beam pruning and pruning based on maximum possible live paths in the lexical trees. For each word history (bigram language model) a copy of the lexical tree is dynamically created. All lexical trees are decoded simultaneously. The number of trees is limited and an effective algorithm is used to decide which lexical tree should be discarded and which new tree with an appropriate word history should be created. The algorithm takes advantage of predictive ability of a language model in order to reduce the number of simultaneously decoded lexical trees. A. Voice assimilation phenomenon The voice assimilation phenomenon is a cross-word context phonetic transcription ambiguity. One or more last phonemes of a word can be influenced by one or more initial phonemes of the successive word. There are four phoneme groups in the Czech language: vowels, voiced paired consonants, unvoiced paired consonants, and unique consonants. Voice assimilation is applied only to paired consonants. For example: phonetic transcription of word ”ples” with an unvoiced right context is "p l e s" while the unvoiced last phoneme ”s” is changed by a voiced right context to its voiced paired consonant ”z” to the baseform "p l e z". cma {doma}
d
eg {dostatek} ostat ek {dostatek}
ad {kopat} op k
at {kopat} ozel {kozel} word's ends uninfluenced by voice assimilation phenomenon word's ends influenced by voice assimilation phenomenon
Fig. 1: Lexical tree with the Czech language voice assimilation phenomenon
To take into account this phenomenon, all possible word baseforms are added to the lexical tree. In some cases, the phonetic transcription ambiguity brings the same phonetic transcription for different words, the reason for which can also be the voice assimilation. E.g. the word ”plot” (fence) has two
phonetic transcriptions "p l o t" and "p l o d" and the word ”plod” (fruit) has the same phonetic transcriptions "p l o t" and "p l o d". Although the words ”plot” and ”plod” have the same phonetic transcriptions, all four phonetic transcriptions have to be added to the lexical tree because the words ”plot” and ”plod” have different bigram language model probabilities. See Figure 1. B. Cross-word context The lexical tree uses triphones for both the in-word and the cross-word context. In the cross-word context a method of dynamic context generation for each successfully decoded word end is used. Using cross-word context means a generation of all start left and end right cross-word context triphones and results in enormous memory requirements. For example, the lexical tree based on a 60k-word vocabulary from a Czech newspaper corpus and the phonetic alphabet consisting of 44 phonemes contains 15 times more cross-word context triphones to than in-word triphones on average. See Figure 2. This ratio primarily depends on word lengths, their content of shared initial portions of phonetic transcription, and the number of different phonetic transcriptions. The dynamically generated cross-word context takes about 10 percent of time of the whole decoding process while the memory requirement is decreased to 20 percent approximately against the statically generated cross-word context. m-a+a
m-a+g
d-o+m o-m+a {doma} a-d+o
triphones with both voiced and unvoiced cross-word context
m-a+p
m-a+k k-d+o e-g+a
d-o+d o-s+t s-t+a t-a+t a-t+e
triphones with voiced cross-word context
t-e+g {dostatek} e-g+g
word's ends uninfluenced by voice assimilation phenomenon word's ends influenced by voice assimilation phenomenon
e-k+p triphones with unvoiced cross-word context
t-e+k {dostatek}
m-k+k
Fig. 2: Lexical tree on the triphone layer
C. Lattice In order the recognition system would be able to communicate with a speech understanding module, the decoder generates a word lattice. A full N-best sentences decoding process consumes much more computer time than decoding only the one best sentence. This leads to investigation of effective approximation of N-best sentences decoding. Tree 1
Tree 3 w1
w3 Tree 5 w4
Tree 2
Tree 4 w2 w3
t1
t2
t3
Fig. 3: Conditioned copies of the lexical tree
Our approximation is based on a word pair approximation. The word pair approximation is based on an idea that a time boundary between any words a does not depend on previous words of a word sequence. Furthermore, the recognition net must be extended in order to satisfy the condition that only transitions from leaves representing the same word can lead to the same tree root. See Figure 3. In our case, when conditioned copies of the lexical tree are used to implement a bigram language model, the word pair approximation is a natural property of created recognition net.
IV. EXPERIMENTS Our recognition engine works as a speaker independent CDHMM based module. It incorporates frontend, acoustic model, language model, and decoder. The acoustic signal is digitized at 8kHz sample rate. The pre-emphasized signal is segmented into 25 millisecond frames and every 10 ms a feature vector consisting of 7 PLP cepstral static, 7 delta and 7 accelerations coefficients is computed. [5]. The Czech telephone corpus collected at the Department of Cybernetics have been used to evaluate the system accuracy and speed. The corpus consist of read speech transmitted over a telephone channel. The corpus was manually annotated and phonetically transcribed. The corpus was used to compare our recognizer with the standard HTK recognizer [1]. The Czech telephone speech and text corpus containing only about 1K word lexicon was used. The speech part of the corpus comprises over 3.5 hours in 1492 sentences from 100 speakers (males and females) and was divided into two parts. The first part consisted of randomly selected 100 sentences, that represented the test data. The rest 1392 sentences were used for acoustic model training. Three tests were performed. Firstly, the speech recognition system equipped with a zerogram language model was evaluated on the test data, secondly the system with a bigram language model trained on the text part of telephone corpus was tested, thirdly the system with a bigram language model trained on newspaper corpus comprises more than 2250k sentences was tested. The recognition tests were carried out on a workstation with Pentium4 - 2.4GHz processor. A. Zerogram language model Zerogram language model was used to evaluate the acoustic model of the recognizer. The lexicon consisted of 1000 words. In this test we compare a tree based contra linear based lexicon. According to our expectation the word accuracy (ACC) is almost the same, but the real time response (RTR, i.e. recognition-time/speech-duration) shows for the tree based lexicon representation a significant increase. Memory consumption corresponds to the peak process (PS) size measured during decoding. The results are shown in Table 1. Table 1: Zerogram language model recognition accuracy and resource consumption Recognizer HTK tree
ACC 70.32 69.25
RTR 4.06 0.57
PS 12 MB 13 MB
B. Bigram language model with a small lexicon In the second test we used a bigram language model trained on a telephone text corpus by the SRI LM Toolkit [2]. At first, it is obvious, that the language model is over-trained, because the text corpus contains only 513 different sentences. The resulting perplexity of the trained language model is only 2. With this language model implementation of language knowledge can be tested. The results are shown in Table 2. Increased word accuracy is caused by the low perplexity of the language model. The low perplexity of the language model allowed more effective pruning during decoding phase. The low perplexity also led to a better RTR of the tree decoder in comparison to the previous test.
Table 2: Bigram language model with small lexicon recognition accuracy and resource consumption Recognizer HTK tree
ACC 97.01 98.08
RTR 4.10 0.21
PS 12 MB 13 MB
C. Bigram language model For the third test a bigram model with more than 60,000 word lexicon was used. The bigram model was trained on the text corpus Lidov´e noviny. The resulting perplexity is 806. In this case both the acoustic and the language model are tested. The results are shown in Table 3. Decreased word accuracy is caused by the increased size of the lexicon and mainly by the high perplexity of the language model. Table 3: Bigram language model recognition accuracy and resource consumption Recognizer HTK tree
ACC 48.71 57.72
RTR 1383.67 41.34
PS 532 MB 408 MB
V. CONCLUSION A Czech LVCSR system was introduced. We mainly highlighted the problem of the cross-word context and word lattice generation. An efficient technique of dynamically generated cross-word context was presented. The memory requirement is decreased approximately to 20 percent. Experiments performed on the telephone speech and text corpus show that our system outperforms the standard HTK recognizer in speed and memory requirement with no lost of accuracy.
ACKNOWLEDGMENT This work was supported by the Ministry of Education of the Czech Republic under project LN00B096.
REFERENCES [1] S. Young et al.: ”The HTK Book”, Entropic Inc. 1999. [2] A. Stolcke: SRILM - The SRI Language Modeling Toolkit. http://www.speech.sri.com/projects/srilm/ [3] J. Odell, V. Valtchev, P. Woodland, S. Young: ”A one pass decoder design for large vocabulary recognition”, Processing of the ARPA Human Language Technology Workshop, Plainsboro, NJ, 1994. [4] P. Placeway, R. Swartz, P. Fung, L. Guyen: ”The estimation of powerful language models from small and large corpora”, Proceedings of the IEEE ICASSP, Minneapolis, MN, 1993. [5] J. Psutka, L. M¨uller, J. V. Psutka: ”Comparison of MFCC and PLP Parameterizations in the Speaker Independent Continuous Speech Recognition Task”, EuroSpeech 2001, Scandinavia, 2001. ˇ ıdl: Design of Speech Recognition Engine, TSD 2000 Third Interna[6] L. M¨uller, J. Psutka, L. Sm´ tional Workshop on TEXT, SPEECH and DIALOGUE, Brno, Czech Republic, 2000.