Phonotactic Language Recognition Based on DNN ...

Viewer
Transcript

Phonotactic Language Recognition Based on DNN-HMM Acoustic Model Wei-Wei Liu1,2 , Meng Cai1 , Hua Yuan1 , Xiao-Bei Shi3 , Wei-Qiang Zhang1 , Jia Liu1 Tsinghua National Laboratory for Information Science and Technology1 Department of Electronic Engineering, Tsinghua University, Beijing 1000841 General Communication Station, Chinese General Logistics Department2 Beijing General Municipal Engineering Design & Research Institute Co., Ltd.3 [email protected]

Abstract A recently introduced deep neural network (DNN) has achieved some unprecedented gains in many challenging automatic speech recognition (ASR) tasks. In this paper deep neural network hidden Markov model (DNN-HMM) acoustic models is introduced to phonotactic language recognition and outperforms artificial neural network hidden Markov model (ANN-HMM) and Gaussian mixture model hidden Markov model (GMM-HMM) acoustic model. Experimental results have confirmed that phonotactic language recognition system using DNN-HMM acoustic model yields relative equal error rate reduction of 28.42%, 14.06%, 18.70% and 12.55%, 7.20%, 2.47% for 30s, 10s, 3s comparing with the ANN-HMM and GMM-HMM approaches respectively on National Institute of Standards and Technology language recognition evaluation (NIST LRE) 2009 tasks. Index Terms: language recognition, DNN-HMM, acoustic model

1. Introduction Language recognition refers to the process that determines the identity of the language spoken in a speech utterance [1]. It is a widely used technology for multilingual speech processing applications, such as spoken document retrieval, spoken language translation [2], and multilingual speech recognition. There are two kinds of language recognition systems which have been widely used with encouraging results, phonotactic language recognition systems [3] and acoustic language recognition systems [4]. Phonotactic language recognition systems always adopt artificial neural network hidden Markov model (ANN-HMM) [5] and Gaussian mixture model hidden Markov model (GMM-HMM) as acoustic modeling methods. Nowadays, an acoustic model combining deep neural networks and hidden Markov models (DNN-HMM) is quickly becoming the dominant acoustic modeling technology for automatic speech recognition (ASR) in recent years [6] and outperforms GMM-HMM and ANN-HMM. DNN-HMM acoustic modeling method has several advantages over GMM-HMM approaches. Firstly, with minimal assumptions about the data distribution, DNNs are able to discover the relationships between neighboring frames and extract informative features with less front-end processing [7]. Secondly, the hierarchical structures of DNNs enable parameters sharing within the hidden layers, which is more efficient and powerful than having many disjoint parameters for every target. Thirdly, with many nonlinear hidden layers, neural networks yield stronger modeling power than GMMs.

DNN-HMM acoustic modeling method also has several advantages over ANN-HMM approaches. Firstly, the DNN greatly expand the number of output units of ANN by replacing monophone states by a large number of tied triphone states.Secondly, the number of hidden layers of DNN can be increased to six or seven. Thirdly, with pre-training procedure [8] the parameters of the deep neural network with a lot of hidden layers and a huge output layer can be learned in a very reliable way. In language recognition, DNN has been used to mine the contextual information embedded in speech frames [9], detect activity speech [10], train language models and work as backends [11]. In this paper, we introduce DNN into phonotactic language recognition tasks to training acoustic model for phoneme recognizers. To our knowledge, this is the first time DNN-HMM acoustic model are applied to phonotactic language recognition. Experimental results in NIST LRE 2009 tasks have confirmed that DNN can consistently achieve about 28.42%, 14.06%, 18.70% and 12.55%, 7.20%, 2.47% relative equal error reduction for 30s, 10s, 3s over the ANNs and GMMs. The rest of this paper is organized as follows. The DNN-HMM acoustic model is proposed to phonotactic language recognition system including its training and testing strategies in Section 2. In section 3 experimental setup is described. Experiment results are reported in detail in Section 4 and this paper is concluded in Section 5.

2. DNN-HMM acoustic model based phonotactic language recognition In this work phone recognition followed by support vector machine (PR-SVM) [12] language recognition system is used as the baseline system. The architecture of the system is shown in Fig.1. In this system, phoneme recognizers are employed to convert the speech into phone lattices according to the given acoustic model, then the lattices are used to perform phonotactic analysis to classify languages in SVM. The phoneme recognizers are usually trained either on multiple language-specific speech data with different phone sets [3] or on the same language-specific speech data with one phone set but using different acoustic models [13]. In this paper, DNN has been proposed to replace GMMs and ANNs to compute state observation probabilities for all tied states in the HMM set [14] trained on the Switchboard Corpus [15]. The DNN’s structure is a conventional multilayer perceptron (MLP) with many layers. The key idea of DNN is

GMM-HMM Acoustic Model

ANN-HMM Acoustic Model

DNN-HMM Acoustic Model ... ... ...

Lattice

Speech Waveform

Pre-processing and Feature Extraction

Phone recognizers

Feature supervector

Expect Counting

Score calibration and fusion

SVM classifier

Figure 1: Architecture of phonotactic language recognition system.

to model the HMM state posterior probabilities using a neural network with many hidden layers. DNN training is typically initialized by a pretraining algorithm. The logistic sigmoidal nonlinearity is chosen for pre-trained sigmoidal networks. This form of nonlinearity is widely used in neural networks due to its smoothness and its simplicity of gradient computation.In the DNN-HMM hybrid approach, the posterior probability of senone s given observation o is modeled by DNN with L hidden layers. The L hidden layers performs feature transformations while the output layer produces the posterior probability via the softmax normalization. The training process of DNN optimizes the cross entropy function [14]: X ds log p(s|o), (1) F (ΛDN N ) = − s

where ΛDN N denotes the acoustic model, ds equals 0 for non-target states and 1 for the target state. Given the acoustic model ΛDN N , the expected counts over all possible hypotheses in the lattice of speech utterance X are computed as follows [16]:

=

F P

i=1 F P

i=1

pn (di |ℓ1 ) ∗ pn (di |ℓ2 )

(4)

√p(di |ℓ1 ) ∗ √p(di |ℓ2 ) , p(di |all)

p(di |all)

where F = f N , f is the size of the phone inventory for the front-end phoneme recognizer. p(di |all) is the probability of di in all the lattices used for language model training. Then the decision is based on the output score of the SVM as follows: f (X) =

X

l

αl K(X, Xl ) + d.

(5)

The Xl are support vectors that are trained using the Mercer condition. The training is carried out between the target language and non-target language with a one-versus-rest strategy.

3.1. Training and initialization of DNN-HMM based phoneme recognizer

j=i

′

where M P is the estimates of the N -gram probabilities that maximize H f (X|H, ΛDN N )P (H|L) (H = hi ...hi+N −1 , L is the language under consideration, f (X|H, ΛDN N ) is the likelihood of the speech utterance X given L and H). α(hi ) is the forward probability of the starting node of hi ...hi+N −1 and β(hi+N −1 ) is the backward probability of the ending node of hi ...hi+N −1 . ξ(hj ) denotes the posterior probability of the edge hj . Then the probability of the phone sequence hi ...hi+N −1 in the lattice is calculated as follows: p(hi ...hi+N −1 |ℓ) = P

K(X1 , X2 ) =

3. Experimental setup

c(hi , ..., hi+N −1 |ℓ) = E[c(hi , ..., hi+N −1 )|X, ΛDN N , M ′ ] i+N Q−1 P [α(hi )β(hi+N −1 ) = ξ(hj )], hi ...hi+N −1 ∈ℓ

the language recognition system. Then the kernel between phonotactic feature supervectors X1 and X2 is computed as:

c(hi ...hi+N −1 |ℓ) , c(hm ...hm+N −1 |ℓ)

(2)

∀m

Let di = hi ...hi+n−1 (n <= N ), the probabilities of phonetic N -grams in the lattice ℓ can form a phonotactic feature supervector for the given utterance X: X = {p(d1 |ℓ), ..., p(dF |ℓ)}.

(3)

In this paper, SVM using term frequency log-likelihood ratio (TFLLR) kernel [17] are employed as back-end of

In this paper, we use the same training algorithm to train DNN-HMMs as in [14]. In the training stage of DNN-HMM acoustic model, 13-dimensional PLP features plus their first order and second order derivatives The input PLP features are normalized to have zero mean and unit variance based on conversation-side information. The GMM-HMM acoustic model contains from 144 to 9308 [18] states with 32 Gaussians each. Firstly the model is trained using maximum likelihood, then the ML-trained model is used to generate state-aligned transcriptions for the succeeding DNN training. A triphone language model is trained using the transcription of the 100h Switchboard English corpus. Non-phonetic units including intermittent noise, short pause and non-speech speaker noise are all mapped to silence, then we get the phone inventories of size 47 for English phoneme recognizer. We set the initial learning rate to 0.2 at the fine-tuning stage. At the end of every epoch, the frame accuracy of the development set is evaluate and the learning rate is reduced by a factor of 2 if the accuracy decreases. DBN pre-training is first applied following the process in [19] for the sigmoidal network. The implementations of the DNN are based on an extended version of CUDAMat library [20].

3.2. Phonotactic language recognition system setup A PR-SVM language recognition system is used as baseline system in this paper. The first step is to tokenize speech by the means of running phone-recognizers with different acoustic models and the decoder named HVite produced by HTK [21] is used to produce phone lattices, and an open software named lattice-tool (SRILM) [22] is used to produce phone counts. Then, a popular classifier LIBLINEAR [23] is employed to classify the feature supervector. Finally, LDA-MMI algorithm [24] is used for score calibration among the ANN-HMM, GMM-HMM and DNN-HMM acoustic models.

Table 2: Performance of DNN-HMM acoustic model with different context window length, 5 hiddenlayers. NIST LRE 2009, EN-DNN-HMM frontend (EER/Cavg in %). frame

30s

10s

3s

11

2.24/2.06

7.26/7.20

21.88/21.92

21

2.09/1.94

6.66/6.74

19.72/20.03

31

2.15/1.99

6.91/6.90

20.75/20.92

4.2. Effects of network layers

3.3. Test, training and developing dataset The results in the paper are reported for the test trials of the 2009 NIST Language Recognition Evaluation (NIST-LRE2009). The test dataset comprises 41793 test segments of 23 languages for 30-s, 10-s, and 3-s nominal duration test. The Call-Home, Call-Friend, OGI, OHSU and VOA Corpus are used in this paper for training. 22701 conversations are selected from the dataset provided by NIST for the 2003, 2005 and 2007 LRE and VOA as develop dataset. 3.4. Evaluation measures We report Equal Error Rate (EER) and average cost performance Cavg which is defined by NIST LRE 2009 [25] to describe the performance of language recognition systems.

4. Experimental results and discussion In this section, we present our experimental results of DNN-HMM on the NIST LRE 2009. For single English (EN) phoneme recognizer with 47 phonemes, 3-grams have been considered. So the number of possible 3-grams could be up to 47*47*47=103823, which is also the dimension of the bag-of-trigram feature supervetor. 4.1. Effects of context window length In DNN training stage context windows of 11 (5 frames on each side), 21 (10 frames on each side) and 31 frames (15 frames on each side) are used. Table 1 shows the frame accuracies of DNN-HMM acoustic models with different context window length. We also evaluate the language recognition performance

Table 1: Frame accuracies for DNN-HMM acoustic models with different context window length, 5 hiddenlayers, in %. frame

11

21

31

frame accuracy

52.43

52.88

52.61

of DNN-HMM acoustic model with context windows of 11, 21 and 31 frames in NIST LRE 2009 and the results are shown in Table 2. From Table 1 and 2 we can conclude that DNN-HMM acoustic model with context windows of 21 frames can afford both better frame classify accuracy and language classify EER than 11 and 31 frames. That is because context window of 21 frames is more fit for the speaking speed of training data, Switchboard phone-call corpus. Therefore, DNN-HMM acoustic model trained with context frame 21 is used for all the following experiments.

It is reported DNNs often perform better with more hidden layers in ASR tasks, but the optimizations become harder as networks grow deeper. We evaluate the effects of different numbers of hidden layers to sigmoidal networks using the same learning procedure as described in the previous subsection. For networks of different numbers of hidden layers, the frame accuracies are shown in Table 1. In these experiments, the Table 3: Frame accuracies for DNN-HMM acoustic models with different numbers of hidden layers, context window of 21 frames, in %. hidden layers

3

4

5

6

7

frame accuracy

51.77

52.32

52.88

52.78

52.62

number of units for the networks is fixed to 2048 and the number of hidden layers are varied over 3, 4, 5, 6 and 7. Table 3 suggests that the frame accuracy does not always improve with more hidden layers. Therefore, DNN-HMM acoustic model trained by 5 hidden layers with context frame 21 is used for all the following experiments. 4.3. Effects of the number of GMMs’ states Table 4 shows the performance of language recognition using DNN-HMM acoustic model aligned with numbers of senones. We get better performance language recognition using 150-states GMM-HMM acoustic model than that of 904-states and 9032-states previously, and DNN-HMMs have the same variation tendency as GMM-HMMs. Therefore, sigmoidal networks with 21 frames, 5 hidden layers, 150 GMM-HMM states are used for all the following experiments. Table 4: Performance of DNN-HMM acoustic model with different GMMs’ states, NIST LRE 2009, EN-DNN-HMM frontend (EER/Cavg in %). states

30s

10s

3s

9032

2.96/2.84

9.04/9.06

24.35/24.45

904

2.70/2.58

8.45/8.54

23.83/23.87

150

2.09/1.94

6.66/6.74

19.72/20.03

4.4. Comparison of different kinds of acoustic model In table 5 we compare the results of language recognition system using an ANN-HMM model, a GMM-HMM model with 150 states and a pre-trained sigmoidal DNN-HMM model,

Table 5: Performance of different kinds of acoustic model based systems, NIST LRE 2009, (EER/Cavg in %). Acoustic model

30s Cavg

EER

Cavg

EER

Cavg

EN-GMM-HMM

2.39

2.28

7.20

7.21

20.22

20.19

EN-ANN-HMM

2.92

2.83

7.75

7.73

23.41

23.42

EN-DNN-HMM

2.09

1.94

6.66

6.74

19.72

20.03

EN fusion

1.39

1.20

4.28

4.11

15.83

15.61

40

Miss Probability (%)

20

model. Table 6: Comparison of real time factor for decoding, EN frontend, LRE09, 30-s test. CPU: Xeon [email protected] GHz, RAM: 8 GB, single thread. GPU: GeForce GTX 275, RAM: 1 GB, 240 CUDA core. acoustic model

ANN

GMM

DNN

RT factor

0.12

0.11

0.07

5. Conclusion An approach to use DNN-HMM acoustic model in language recognition system has been presented in this paper. The state-of-the-art DNN-HMM are proposed to model acoustic events for PR-SVM LRE system. DNN-HMM modeling approach is evaluated on NIST LRE 2009 task, and relative equal error rate reduction of 28.42%, 14.06%, 18.70% and 12.55%, 7.20%, 2.47% for 30s, 10s, 3s comparing with the ANN-HMM and GMM-HMM acoustic model approaches respectively. As for future work, we will develop effective adaptation techniques of DNN.

6. Acknowledgements

10

This project is supported by National Natural Science Foundation of China (No. 61005019, No.61273268 and No. 61370034).

5 2

7. References

1

[1] H. Li, B. Ma, and K.-A. Lee, “Spoken language recognition: from fundamentals to practice,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1136–1159, 2013.

0.5

0.1

3s

EER

which are all trained using the 100-hour subset of English Switchboard corpus. EN-ANN-HMM is trained using TRAP feature and a context window of 21 frames. Experiments show that EN-DNN-HMMs provide dramatic improvements in language recognition accuracy.offers a relative EER reduction of 28.42%, 14.06%, 18.70% over the ANN-HMMs acoustic model, and a relative EER reduction of 12.55%, 7.20%, 2.47% over the GMM-HMM acoustic model. The performance of longer speech utterances (30s) improves more dramatically than that of 10s and 3s because DNNs are more powerful in modeling long context acoustic events than GMM-HMMs. Generally, the language recognition EER of ANN-HMM acoustic model outperforms GMM-HMM when lacking of training data, and does just the opposite when having plenty of training data [26]. In this work acoustic models are all trained using the 100-hour subset of English Switchboard corpus, which is plenty for modeling acoustic events. So here the GMM-HMM acoustic model yields better performance over ANN-HMM acoustic model. Fig.2 shows the DET curves for NIST LRE09, English frontend.

0.2

10s

of phoneme recognizer

ANN−HMM GMM−HMM DNN−HMM 0.1 0.2 0.5 1 2 5 10 20 False Alarm Probability (%)

40

Figure 2: DET curves for NIST LRE09, EN frontend.

4.5. Comparison of real time factor for decoding Table 6 shows decoding real time factor of GMM-HMM, DNN-HMM and ANN-HMM acoustic model. Although training DNN-HMM acoustic model is a little expensive compared with training GMM-HMM acoustic model, decoding in DNN-HMM acoustic model is very efficient because it is carried out on GPU machines, and the structure of DNN-HMM acoustic model is no more complex than GMM-HMM acoustic

[2] A. Waibel, P. Geutner, L. M. Tomokiyo, T. Schultz, and M. Woszczyna, “Multilinguality in speech and spoken language systems,” Proceedings of the IEEE, vol. 88, no. 8, pp. 1297–1313, 2000. [3] M. A. Zissman, “Comparison of four approaches to automatic language identification of telephone speech,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 1, pp. 33–44, 1996. [4] P. A. Torres-Carrasquillo, E. Singer, M. A. Kohler, R. J. Greene, D. A. Reynolds, and J. R. Deller, “Approaches to language identification using gaussian mixture models and shifted delta cepstral features,” Proc. ICSLP, pp. 33–36, Sep. 2002. [5] N. Morgan and H. Bourlard, “Continuous speech recognition using multilayer perceptrons with hidden markov models,” in Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., 1990 International Conference on. IEEE, 1990, pp. 413–416. [6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep

neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012. [7] A.-R. Mohamed, G. Hinton, and G. Penn, “Understanding how deep belief networks perform acoustic modelling,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 4273–4276. [8] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006. [9] Y. Song, B. Jiang, Y. Bao, S. Wei, and L.-R. Dai, “i-vector representation based on bottleneck features for language identification,” Electronics Letters, vol. 49, no. 24, pp. 1569–1570, 2013. [10] K. J. Han, S. Ganapathy, M. Li, M. K. Omar, and S. Narayanan, “Trap language identification system for rats phase ii evaluation,” in Proc. Interspeech, 2013. [11] G. Montavon, “Deep learning for spoken language identification,” in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009. [12] W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, and P. A. Torres-Carrasquillo, “Support vector machines for speaker and language recognition,” Computer Speech and Language, vol. 20, no. 2-3, pp. 210–229, Jan 2006. [13] K. C. Sim and H. Li, “On acoustic diversification front-end for spoken language identification,” IEEE Trans. on Audio, Speech and Language Processing, vol. 16, no. 5, pp. 1029–1037, 2008. [14] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 30–42, 2012. [15] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “Switchboard: Telephone speech corpus for research and development,” in Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on, vol. 1. IEEE, 1992, pp. 517–520. [16] J. L. Gauvain, A. Messaoudi, and H. Schwenk, “Language recognition using phone lattices,” in Proc. ICSLP, Jeju Island, Oct 2004, pp. 1283–1286. [17] W. M. Campbell, J. P. Campbell, D. A. Reynolds, D. A. Jones, and T. R. Leek, Phonetic speaker recognition with support vector machines. Cambrige: Eds. MIT Press, 2004, ch. Advances in Neural Information Processing System 16. [18] M. Cai, Y. Shi, and J. Liu, “Deep maxout neural networks for speech recognition,” in ASRU, 2013. [19] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 2011, pp. 24–29. [20] V. Mnih, “Cudamat: a cuda-based matrix class for python,” Department of Computer Science, University of Toronto, Tech. Rep. UTML TR, vol. 4, 2009. [21] S. Y. et al, “The HTK book,” vol. 3, 2002. [Online]. Available: http://www.phonetik.uni-muenchen.de/studium/skripten/ [22] A. Stolcke, “SRILM - An extensible modeling toolkit,” Sept. 2002. [Online]. http://www.speech.sri.com/projects/srilm/

language Available:

[23] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear classification,” The Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008. [24] W.-Q. Zhang, T. Hou, and J. Liu, “Discriminative score fusion for language identification,” Chinese Journal of Electronics, vol. 19, pp. 124–128, Jan 2010. [25] “The 2009 NIST language evaluation plan,” Apr 2009. [Online]. http://www.itl.nist.gov/iad/mig/tests/lang/2009/

recognition Available:

[26] P. Matejka, P. Schwarz, J. Cernock`y, and P. Chytil, “Phonotactic language identification using high quality phoneme recognition.” in INTERSPEECH, 2005, pp. 2237–2240.

Phonotactic Language Recognition Based on Time ...