Design of LVCSR Decoder for Czech Language 





ˇ ıdl ¨ Filip Jurˇc´ıcˇ ek , Aleˇs Praˇza´ k , Ludˇek Muller , J. V. Psutka , and Luboˇs Sm´ CAK FAV University of West Bohemia, Department of Cybernetics Univerzitn´ı 8, Plzeˇn, 301 00, Czech Republic 



University of West Bohemia, Department of Cybernetics Univerzitn´ı 8, Plzeˇn, 301 00, Czech Republic

filip,aprazak,muller,psutka j,smidl  @kky.zcu.cz

Abstract: In this paper we present a Czech speaker-independent large vocabulary continuous speech recognition (LVCSR) system based on lexical trees and bigram language model. Lexical trees use triphones for both the in-word and the cross-word context. A dynamically generated cross-word context saves important amount of memory. A telephone speech and text corpus have been used to evaluate the system accuracy and speed. The corpus was used to compare our recognizer with the standard HTK recognizer. The comparison results are shown.

I. INTRODUCTION The LVCSR systems are being developed for several decades. Nowadays LVCSR systems take advantage of increasing performance of computers and sophisticated algorithms. There is permanent effort to integrate larger lexicons into LVCSR system. An  -gram language model can improve recognition accuracy. However, memory requirements are substantial, so in practice for real-time implementation of LVCSR system only a bigram language model is often considered. The LVCSR system can be used for either automatic dictation or as a speech recognition module (e.g. of a voice dialog system) loosely coupled with a speech understanding module. The output of the LVCSR system is either a word lattice or the best sequence of words matching the input acoustic signal. The word lattice can be processed by higher level blocks. The proposed Czech LVCSR system was developed as an extension of our grammar based decoder [6]. The decoder is implemented as a baseline recognizer at this time. All known specifics of the Czech language are implemented in the recognizer, i.e. all baseforms (phonetic transcriptions) of each individual word in a system vocabulary, full cross-word triphone context, and so-called voice assimilation phenomenon are considered. In the following, the lexicon representation and the decoder are described. Finally, experimental results are given and discussed.

II. LEXICON REPRESENTATION A static representation of a lexicon is very interesting, because there is no need of on the fly compilation of lexicon into the recognition net. This leads to saving CPU time and several non trivial optimizations of recognition network can be performed. A. Linear lexicon In medium vocabulary tasks (up to 1000 different words) static linear lexicons are commonly used in the most recognition systems. A bigram language model can be implemented by weighted transitions  between words. If  is the vocabulary size,  transitions are needed. A usefull technique for reduction of number of transitions was introduced in [4]. A static implementation of a language model saves decoder time. Nevertheless, the linguistic information (language model) is not used as soon as possible. The efficient of reduction of the local HMM likelihoods evaluations can be also obtained in a non-tree lexicon by simply caching all calculated HMM likelihoods.

B. Tree lexicon Contrary to a linear recognition net that does not take into account phonetic transcriptions (”similarity”) of words, the lexical trees share common portions of their phonetic transcription and make decoding more effective. This leads to a significant speed up due to less local likelihood calculations and less total probabilities updates. A typical decrease in number of triphone instances is approximately half. Due to the beam pruning, that seems to be most effective in the initial phonemes of word (where the shares are extremely high), the observed improvements in decoding time are even more than a factor two higher. C. Factorization The factorization [3] of bigram probabilities over the lexical tree allows the decoder to use linguistic information as soon as possible. With the factorization more effective pruning without an important lost of accuracy can be applied. If during factorization more words share the some part of their phonetic transcription, the maximum of their probabilities can only be propagated towards the root of the lexical tree.

III. DECODER The decoder uses continues density HMMs with Gaussian mixtures, a bigram language model, contextdependent phones (triphones) and a lexicon implemented as a lexical tree. The decoder uses a timesynchronous Viterbi search with beam pruning and pruning based on maximum possible live paths in the lexical trees. For each word history (bigram language model) a copy of the lexical tree is dynamically created. All lexical trees are decoded simultaneously. The number of trees is limited and an effective algorithm is used to decide which lexical tree should be discarded and which new tree with an appropriate word history should be created. The algorithm takes advantage of predictive ability of a language model in order to reduce the number of simultaneously decoded lexical trees. A. Voice assimilation phenomenon The voice assimilation phenomenon is a cross-word context phonetic transcription ambiguity. One or more last phonemes of a word can be influenced by one or more initial phonemes of the successive word. There are four phoneme groups in the Czech language: vowels, voiced paired consonants, unvoiced paired consonants, and unique consonants. Voice assimilation is applied only to paired consonants. For example: phonetic transcription of word ”ples” with an unvoiced right context is "p l e s" while the unvoiced last phoneme ”s” is changed by a voiced right context to its voiced paired consonant ”z” to the baseform "p l e z". cma {doma}

d

eg {dostatek} ostat ek {dostatek}

ad {kopat} op k

at {kopat} ozel {kozel} word's ends uninfluenced by voice assimilation phenomenon word's ends influenced by voice assimilation phenomenon

Fig. 1: Lexical tree with the Czech language voice assimilation phenomenon

To take into account this phenomenon, all possible word baseforms are added to the lexical tree. In some cases, the phonetic transcription ambiguity brings the same phonetic transcription for different words, the reason for which can also be the voice assimilation. E.g. the word ”plot” (fence) has two

phonetic transcriptions "p l o t" and "p l o d" and the word ”plod” (fruit) has the same phonetic transcriptions "p l o t" and "p l o d". Although the words ”plot” and ”plod” have the same phonetic transcriptions, all four phonetic transcriptions have to be added to the lexical tree because the words ”plot” and ”plod” have different bigram language model probabilities. See Figure 1. B. Cross-word context The lexical tree uses triphones for both the in-word and the cross-word context. In the cross-word context a method of dynamic context generation for each successfully decoded word end is used. Using cross-word context means a generation of all start left and end right cross-word context triphones and results in enormous memory requirements. For example, the lexical tree based on a 60k-word vocabulary from a Czech newspaper corpus and the phonetic alphabet consisting of 44 phonemes contains 15 times more cross-word context triphones to than in-word triphones on average. See Figure 2. This ratio primarily depends on word lengths, their content of shared initial portions of phonetic transcription, and the number of different phonetic transcriptions. The dynamically generated cross-word context takes about 10 percent of time of the whole decoding process while the memory requirement is decreased to 20 percent approximately against the statically generated cross-word context. m-a+a

m-a+g

d-o+m o-m+a {doma} a-d+o

triphones with both voiced and unvoiced cross-word context

m-a+p

m-a+k k-d+o e-g+a

d-o+d o-s+t s-t+a t-a+t a-t+e

triphones with voiced cross-word context

t-e+g {dostatek} e-g+g

word's ends uninfluenced by voice assimilation phenomenon word's ends influenced by voice assimilation phenomenon

e-k+p triphones with unvoiced cross-word context

t-e+k {dostatek}

m-k+k

Fig. 2: Lexical tree on the triphone layer

C. Lattice In order the recognition system would be able to communicate with a speech understanding module, the decoder generates a word lattice. A full N-best sentences decoding process consumes much more computer time than decoding only the one best sentence. This leads to investigation of effective approximation of N-best sentences decoding. Tree 1

Tree 3 w1

w3 Tree 5 w4

Tree 2

Tree 4 w2 w3

t1

t2

t3

Fig. 3: Conditioned copies of the lexical tree

Our approximation is based on a word pair approximation. The word pair approximation is based on an idea that a time boundary between any words  a  does not depend on previous words      of a word sequence. Furthermore, the recognition net must be extended in order to satisfy the condition that only transitions from leaves representing the same word can lead to the same tree root. See Figure 3. In our case, when conditioned copies of the lexical tree are used to implement a bigram language model, the word pair approximation is a natural property of created recognition net.

IV. EXPERIMENTS Our recognition engine works as a speaker independent CDHMM based module. It incorporates frontend, acoustic model, language model, and decoder. The acoustic signal is digitized at 8kHz sample rate. The pre-emphasized signal is segmented into 25 millisecond frames and every 10 ms a feature vector consisting of 7 PLP cepstral static, 7 delta and 7 accelerations coefficients is computed. [5]. The Czech telephone corpus collected at the Department of Cybernetics have been used to evaluate the system accuracy and speed. The corpus consist of read speech transmitted over a telephone channel. The corpus was manually annotated and phonetically transcribed. The corpus was used to compare our recognizer with the standard HTK recognizer [1]. The Czech telephone speech and text corpus containing only about 1K word lexicon was used. The speech part of the corpus comprises over 3.5 hours in 1492 sentences from 100 speakers (males and females) and was divided into two parts. The first part consisted of randomly selected 100 sentences, that represented the test data. The rest 1392 sentences were used for acoustic model training. Three tests were performed. Firstly, the speech recognition system equipped with a zerogram language model was evaluated on the test data, secondly the system with a bigram language model trained on the text part of telephone corpus was tested, thirdly the system with a bigram language model trained on newspaper corpus comprises more than 2250k sentences was tested. The recognition tests were carried out on a workstation with Pentium4 - 2.4GHz processor. A. Zerogram language model Zerogram language model was used to evaluate the acoustic model of the recognizer. The lexicon consisted of 1000 words. In this test we compare a tree based contra linear based lexicon. According to our expectation the word accuracy (ACC) is almost the same, but the real time response (RTR, i.e. recognition-time/speech-duration) shows for the tree based lexicon representation a significant increase. Memory consumption corresponds to the peak process (PS) size measured during decoding. The results are shown in Table 1. Table 1: Zerogram language model recognition accuracy and resource consumption Recognizer HTK tree

ACC 70.32 69.25

RTR 4.06 0.57

PS 12 MB 13 MB

B. Bigram language model with a small lexicon In the second test we used a bigram language model trained on a telephone text corpus by the SRI LM Toolkit [2]. At first, it is obvious, that the language model is over-trained, because the text corpus contains only 513 different sentences. The resulting perplexity of the trained language model is only 2. With this language model implementation of language knowledge can be tested. The results are shown in Table 2. Increased word accuracy is caused by the low perplexity of the language model. The low perplexity of the language model allowed more effective pruning during decoding phase. The low perplexity also led to a better RTR of the tree decoder in comparison to the previous test.

Table 2: Bigram language model with small lexicon recognition accuracy and resource consumption Recognizer HTK tree

ACC 97.01 98.08

RTR 4.10 0.21

PS 12 MB 13 MB

C. Bigram language model For the third test a bigram model with more than 60,000 word lexicon was used. The bigram model was trained on the text corpus Lidov´e noviny. The resulting perplexity is 806. In this case both the acoustic and the language model are tested. The results are shown in Table 3. Decreased word accuracy is caused by the increased size of the lexicon and mainly by the high perplexity of the language model. Table 3: Bigram language model recognition accuracy and resource consumption Recognizer HTK tree

ACC 48.71 57.72

RTR 1383.67 41.34

PS 532 MB 408 MB

V. CONCLUSION A Czech LVCSR system was introduced. We mainly highlighted the problem of the cross-word context and word lattice generation. An efficient technique of dynamically generated cross-word context was presented. The memory requirement is decreased approximately to 20 percent. Experiments performed on the telephone speech and text corpus show that our system outperforms the standard HTK recognizer in speed and memory requirement with no lost of accuracy.

ACKNOWLEDGMENT This work was supported by the Ministry of Education of the Czech Republic under project LN00B096.

REFERENCES [1] S. Young et al.: ”The HTK Book”, Entropic Inc. 1999. [2] A. Stolcke: SRILM - The SRI Language Modeling Toolkit. http://www.speech.sri.com/projects/srilm/ [3] J. Odell, V. Valtchev, P. Woodland, S. Young: ”A one pass decoder design for large vocabulary recognition”, Processing of the ARPA Human Language Technology Workshop, Plainsboro, NJ, 1994. [4] P. Placeway, R. Swartz, P. Fung, L. Guyen: ”The estimation of powerful language models from small and large corpora”, Proceedings of the IEEE ICASSP, Minneapolis, MN, 1993. [5] J. Psutka, L. M¨uller, J. V. Psutka: ”Comparison of MFCC and PLP Parameterizations in the Speaker Independent Continuous Speech Recognition Task”, EuroSpeech 2001, Scandinavia, 2001. ˇ ıdl: Design of Speech Recognition Engine, TSD 2000 Third Interna[6] L. M¨uller, J. Psutka, L. Sm´ tional Workshop on TEXT, SPEECH and DIALOGUE, Brno, Czech Republic, 2000.

Design of LVCSR Decoder for Czech Language

All known specifics of the Czech lan- ... of recognition network can be performed. ... This leads to a significant speed up due to less local likelihood calculations and less .... to a better RTR of the tree decoder in comparison to the previous test.

38KB Sizes 2 Downloads 177 Views

Recommend Documents

Anatom y of an e xtremely fast LVCSR decoder Abstract ...
atson. Research. Center phone. (914)-945-2985, email saon@w atson.ibm.com. Abstract. W e report in detail the decoding strategy that w e used for the past tw o ... Experimentalresults are given on the. EARS database. (English con versational telephon

An efficient video decoder design for MPEG-2 MP@ ML
In this paper, we present an efficient MPEG-2 video decoder architecture design to meet. MP@ML real-time decoding requirement. The overall architecture, as ...

Czech-Slovak language contact: forms, results, attitudes
themes of the paper. - a quick overview of the current Czech-Slovak language contact situation. - language contact of Czech and Slovak as closely- related West ...

Implementation of Viterbi decoder for a High Rate ...
Code Using Pre-computational Logic for TCM Systems .... the same number of states and all the states in the same cluster will be extended by the same BMs.

A study on soft margin estimation for LVCSR
large vocabulary continuous speech recognition in two aspects. The first is to use the ..... IEEE Trans. on Speech and Audio Proc., vol. 5, no. 3, pp. 257-265, 1997 ... recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167 .

Design of Language Processors Course.pdf
Language Processing, and Fundamentals of Language Specification,. Language ... Assemblers: Elements of Assembly Language Programming, A simple.

czech harem czech harem 4.pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect ...

CZECH REPUBLIC.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Main menu.

VIN Decoder .pdf
Decorative items. Computer Parts. Recharging PrePaid. Blogs. Online Stores. Directories qlweb ... Link does not work / spam? Highlight this ... VIN Decoder .pdf.

A reconstruction decoder for computing with words
Other uses, including reproduction and distribution, or selling or licensing ... (i) The Extension Principle based models [1,22,20,3], which operate on the ..... The two-input SJA is a fuzzy logic system describing the relationship between ..... [10]

Low Complexity Opportunistic Decoder for Network Coding - Rice ECE
ECE Department, Rice University, 6100 Main St., Houston, TX 77005. Email: {by2, mbw2, wgh, cavallar}@rice.edu. Abstract—In this paper, we propose a novel opportunistic decoding scheme for network coding decoder which significantly reduces the decod

A Novel Storage Scheme for Parallel Turbo Decoder
We do this by restricting the whole number of colors seen by a SISO processor when it decode the two component codes. If p χ can be restricted, the total tri-state buffer consumption will decrease. The resultant “reordered first fit” algorithm i

Embedded Software Optimization for MP3 Decoder ...
This work is supported by National 863 Project of China, Grant No. 2003AA1Z1350. Authors: Weiwei CHEN, second year graduate student, School of Microelectronics, Shanghai Jiao Tong University. Yuzhuo FU, professor, vice dean, School of Microelectronic

En/Decoder for Spectral Phase Coded OCDMA System ...
ring resonator and fiber Bragg grating (FBG) can all serve as .... 2005, pp. 253-255. [4] Z.Jiang, D.S.Seo, S.-D.Yang et al, “Four-user, 2.5-Gb/s, spectrally coded.

Low-Complexity Shift-LDPC Decoder for High-Speed ... - IEEE Xplore
about 63.3% hardware reduction can be achieved compared with the ... adopted by high-speed communication systems [1] due to their near Shannon limit ...

A Reconstruction Decoder for the Perceptual Computer
Abstract—The Word decoder is a very important approach for decoding in the Perceptual Computer. It maps the computing with words (CWW) engine output, ...

Czech casting 4106
Windows XP ProfessionalSP3 x86 Black Edition 2015.12.Theancient Czech casting ... american horror story s05e10. Iphone 6 jailbreak.Knock emdead pdf.

Czech casting 3698
The dollhouse pdf.Czech casting 3698.Electric blue video. Joe moses nothing. ... Avatar 720p 2009.Czech casting 3698.Lucy fairy tail.Pro evolution for.90.

CZECH SNOOPER 8
The businessadvanced pdf.Download CZECHSNOOPER8 ... Skin pack win 7. Aheartfeltstand up. ... Myfreecams october.008165928.Pretty littleliars with sub.

New efficient decoder for product and concatenated block codes
Index Terms— Parallel concatenated ocdes, Generalized serially concatenated codes, Chase decoding, Chase-Pyndiah decoder, turbo codes, iterative ...

Low Complexity Opportunistic Decoder for Network ...
designs for high data rate transmission. In order to reduce the decoding complexity and increasing the throughput of the network coding decoder, we propose a.

CZECH SOLARIUM 135
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. CZECH ...