Speaker Adaptation Based on Sparse and Low-rank Eigenphone Matrix Estimation Wen-Lin Zhang1 , Dan Qu1 , Wei-Qiang Zhang2 , Bi-Cheng Li1 1
Zhengzhou Information Science and Technology Institute, Zhengzhou, China Department of Electronic Engineering, Tsinghua University, Beijing, China
2
zwlin
[email protected],
[email protected],
[email protected],
[email protected]
Abstract The eigenphone based speaker adaptation outperforms the conventional MLLR and eigenvoice methods when the adaptation data is sufficient, but it suffers from severe over-fitting when the adaptation data is limited. In this paper, l1 and nuclear norm regularization are applied simultaneously to obtain a more robust eigenphone estimation, resulting in a sparse and low-rank eigenphone matrix. The sparse constraint can reduce the number of free parameters while the low rank constraint can limit the dimension of phone variation subspace, which are both benefit to the generalization ability. Experimental results show that the proposed method can improve the adaptation performance substantially, especially when the amount of adaptation data is limited. Index Terms: eigenphones, speaker adaptation, l1 regularization, nuclear norm regularization
1. Introduction Model space speaker adaptation is an important technique in modern speech recognition system. Given some adaptation data, the parameters of a speaker independent (SI) system are transformed to match the speaking pattern of an unknown speaker, resulting in a speaker adapted (SA) system. To deal with the sparsity of the adaptation data, parameter sharing schemes are usually adopted. For example, in eigenvoice based method [1], the speaker dependent (SD) models are assumed to lie in a low dimensional subspace, namely the speaker subspace. The subspace bases, i.e. eigenvoices, are shared among all speakers. For each new speaker, a speaker-specific coordinate vector, namely speaker factor, is estimated to obtain the SA model. The maximum likelihood linear regression (MLLR) method [2] estimates a set of linear transformations to transform an SI model into a new SD model. Using regression class trees, the HMM state components can be grouped into regression classes with each class sharing the same transformation matrix. Recently, a novel phone subspace based method, i.e. eigenphone based method, was proposed [3]. Differently from the speaker subspace based method, the phone variation patterns of a speaker are assumed to be in a low dimensional subspace, called phone variation subspace. The coordinates of the whole phone set are shared among different speakers. During speaker adaptation, a speaker dependent eigenphone matrix which represents the main phone variation patterns for a specific speaker is estimated. Due to its more elaborate modeling, the eigenphone method performs better than both the eigenvoice and MLLR methods when sufficient amounts of adaptation data are available. However, with limited amounts of adaptation data,
the maximum likelihood estimation shows severe over-fitting, resulting in very bad adaptation performance [3]. Even with a fine tuned Gaussian prior, the eigenphone matrix estimated by the maximum a posterior (MAP) criterion still does not match the performance of the eigenvoice method. In machine learning, regularization techniques are widely employed to address the problem of data sparsity and model complexity. Recently, regularization has been widely adopted in speech processing and recognition applications. For instance, l1 and l2 regularization are proposed for spectral de-noising in speech recognition [4, 5]. In [6], similar regularization methods are adopted to improve the estimation of state-specific parameters in the subspace Gaussian mixture model (SGMM). In [7], l1 regularization is used to reducing the nonzero connections of deep neural networks without sacrificing speech recognition performance. In this paper, we investigate the regularized estimation of the eigenphone matrix for speaker adaptation. l1 norm regularization is used to control the sparsity of the matrix and the nuclear norm regularization forces the eigenphone matrix to be low-rank. The basic considerations are that being sparse can alleviate over-fitting and being low-rank can automatically control the dimension of the phone variation subspace. In the next section, a brief overview of the eigenphone based speaker adaptation method is given. The use of the l1 norm and nuclear norm regularization are described in Section III, and the optimization of the sparse and low-rank eigenphone matrix is presented in Section IV. Finally, in Section V, we present experiments on supervised speaker adaptation of a Mandarin tonal syllable recognition system.
2. Review of the eigenphone based speaker adaptation method Given a set of speaker independent HMMs containing a total of M mixture components across all states and models and a D-dimensional speech feature vector, let µm , µm (s) and um (s) = µm (s)−µm denote the SI mean vector, the SD mean vector and the phone variation vector for speaker s and mixture component m respectively. In eigenphone based speaker adaptation method, the phone variation vectors {um (s)}M m=1 are assumed to be located in a speaker dependent N (N << M ) dimensional phone variation subspace. Let v 0 (s) and {v i (s)}N i=1 denote the origin and the basis vectors of speaker s’s phone variation subspace respectively, then the phone variation vector {um (s)}M m=1 can be written as um (s) = v 0 (s) +
N X n=1
lmn v n (s)
(1)
where lmn is the coefficient of component m corresponding to basis vector v n (s). We call {v i (s)}N i=0 the eigenphones of T speaker s and lm1 lm2 · · · lmN the phone coordinate vector of component m. The eigenphone decomposition of speaker s’s phone variation matrix can be expressed by the following equation U (s) = u1 (s) u2 (s) · · · uM (s) = V (s) · L (2) where V (s) = v 0 (s) v 1 (s) v 2 (s) · · · v N (s) and 1 1 1 ... 1 l11 l21 l31 . . . lM 1 L = l12 l22 l32 . . . lM 2 . .. .. .. .. . . ... . . l1N l2N l3N . . . lM N Equation (2) can be viewed as the decomposition of the phone variation matrix U (s) to the multiplication of two lowrank matrices V (s) and L. The eigenphone matrix V (s) is speaker dependent, which summarizes the main phone variation patterns of speaker s. The phone coordinate matrix L is speaker independent, which implicitly reflects the correlation information between different Gaussian components. Given a set of training speaker SD models, L can be obtained using principal component analysis (PCA) [3]. During speaker adaptation, given some adaptation data, the eigenphone matrix V (s) is estimated using the maximum likelihood criterion. Let O = {o(1), o(2), · · · , o(T )} denotes the sequence of feature vectors of the adaptation data. Using the expectation maximization (EM) algorithm, the auxiliary function to be optimized is given as follows Q(V (s)) = −
1 XX γm (t) 2 t m
[o(t) − µm (s)]T Σ−1 m [o(t) − µm (s)] ,
(3)
where µm (s) = µm + um , γm (t) is the posterior probability of being in mixture m at time t given the observation sequence O and current estimation of SD model. Suppose the covariance matrix Σm is diagonal, let σm,d denotes its dth diagonal element and od (t), µm,d and vn,d (s) represent the dth component of o(t), µm and v n (s) respectively. Then Equation (3) can be simplified to h i2 T 1 XXX −1 γm (t)σm,d o0m,d (t) − ˆlm ν d (s) , Q(V (s)) = − 2 t m d (4) where o0m,d (t) = od (t) − µm,d , ˆlm = [1, lm1 , lm2 , . . . , lmN ]T and ν d (s) = [v0,d (s), v1,d (s), v2,d (s), . . . , vN,d (s)]T , which is the dth row of the eigenphone matrix V (s). Define XX −1 ˆ ˆT Ad = γm (t)σm,d lm lm t
bd =
XX t
m −1 0 γm (t)σm,d om,d (t)ˆlm ,
m
Equation (4) can be further simplified to i 1 Xh Q(V (s)) = − ν d (s)T Ad ν d (s) − bTd ν d (s) + Const. 2 d (5)
Setting the derivative of (5) with respect to ν d (s) to zero ˆ d (s) = A−1 yields ν d bd . Because of the independence of different feature dimensions, {ˆ ν d (s)}D d=1 can be calculated in parallel very efficiently. The size of the eigenphone matrix V (s) is (N + 1) × D, which has more free parameters than the eigenvoice method. For the MLLR method with a global transformation matrix and a bias vector, the parameter size is of (D+1)×D. So the eigenphone method is more flexible and elaborate. When sufficient amounts of adaptation data are available, better adaptation performance can be obtained. But when the adaptation data is limited, performance degrades quickly. The recognition rate can be even worse than the unadapted SI system when very limited amounts of adaptation are available. In order to alleviate the overfitting problem, a Gaussian prior is assumed and a MAP adaptation method is derived in [3]. In this paper, we address the problem using an explicit matrix regularization function.
3. Sparse and low-rank eigenphone matrix estimation In fact, the center of the eigenphone adaptation method is the robust estimation of the eigenphone matrix V (s). This type of problem, i.e. the estimation of an unknown matrix from some observation data, has appeared frequently in the literature of diverse fields. Regularization has been proved to be a valid method to overcome the data scarcity. One widely used regularizer is the l1 norm. For the eigenphone P matrix V (s), the matrix l norm can be written as ||V (s)|| = 1 1 d ||ν d (s)||1 = P P d n |vn,d (s)|. The l1 norm regularization is sometimes referred to as the lasso, which can drive an element-wise shrinkage of V (s) towards zero, thus leading to a sparse matrix solution. Recently, in many matrix estimation problems, such as matrix completion [8] and robust PCA [9], a nuclear norm regularizer was used to obtain a low-rank solution. In fact, this approach is closely related to the idea of using the l1 norm as a surrogate for sparsity, because low-rank corresponds to sparsity of the vector of singular values and the nuclear norm is the l1 norm of the vector of singular values. For the eigenphone matrix V (s), the nuclear norm can be written as ||V (s)||∗ = PN i=1 κi , where κi are the singular values of V (s). In eigenphone based speaker adaptation, sparsity and lowrank constraints can be applied simultaneously to obtain more robust estimation of the eigenphone matrix. The reasons are as follows: firstly, sparsity constraint can reduce the free parameters, thus alleviates over-fitting; secondly, when the adaptation data is insufficient, many speaker specific phone variation pattern will not be observed and a low dimensional phone variation subspace should be assumed, i.e. the rank of the eigenphone matrix should be limited. The solutions of low-rank estimation problems are in general not sparse at all. In this paper, a linear combination of the l1 and nuclear norm was used to obtain a simultaneously sparse and low-rank matrix [10]. The resulting regularized objective function is as following Q0 (V (s)) = Q(V (s)) + λ1 ||V (s)||1 + λ2 ||V (s)||∗ ,
(6)
where λ1 , λ2 > 0.
4. Optimization There is no closed form solution to the regularized objective function (6). Numerous approaches have been proposed in literature to solve the l1 norm and nuclear norm penalty problems
separately. For the mixed norm penalty problem, we adopted the incremental proximal descent algorithm [10, 11]. For a convex regularizer R(X), X ∈ Rm×n , the proximal operator is defined as 1 proxR (X) = arg min ||Y − X||2F + R(Y ) 2 Y
(7)
where || ∗ ||F denotes the Frobenius norm of a matrix. The proximal operator for the l1 norm is the soft thresholding operator proxγ||·||1 (X) = sgn(X) ◦ (|X| − γ)+
(8)
where ◦ denotes the Hadamard product of two matrices, (x)+ = max{x, 0}. The sign function (sgn), product and maximum are all taken component-wise. For the nuclear norm, the proximal operator is given by the shrinkage operation as follows [11]. If X = P diag(ν1 , ν2 , · · · , νn )QT is the singular value decomposition of X, then proxγ||·||∗ (X) = P diag((νi − γ)+ )QT .
(9)
The proximity operator of a convex function is a natural extension of the notion of a projection operator onto a convex set. The incremental proximal descent algorithm [11] could be viewed as a natural extension of the iterated projection algorithm, which activates each convex set modeling a constraint individually by means of its projection operator. In this paper, an accelerated version of the incremental proximal descent algorithm is introduced for estimation of the eigenphone matrix V , which can be summarized as following Algorithm 1 Accelerated Incremental Proximal Descent Algorithm for Sparse and Low-rank Eigenphone Matrix Estimation 1: θ ← θ0 . Initialize the descent step size ˆ 2: V ← V . Vˆ is the solution of (5) 3: Qnew ← Q(V ) + λ1 ||V ||1 + λ2 ||V ||∗ . Equation (6) 4: repeat 5: Qold ← Qnew , θ ← ηθ 6: repeat . Search for the step size 7: V ← V − θ∇V Q(V ) 8: V ← proxθλ1 ||·||1 (V ) 9: V ← proxθλ2 ||·||∗ (V ) 10: Qnew ← Q(V ) + λ1 ||V ||1 + λ2 ||V ||∗ 11: if Qnew > Qold then 12: θ ← η −1 θ 13: end if 14: until Qnew < Qold 15: until |Qold − Qnew |/|Qold | < . In Algorithm 1, ∇V Q(V ) is the gradient of (5), which can be easily calculated from ∇ν d (s) Q(V ) = −Ad ν d (s) + bd . Step 7 is the normal gradient descent step of the original objective function Q(V (s)). In Step 8 and Step 9, the proximal operators of the l1 norm and nuclear norm are applied in sequential. The initial descent step size θ0 can be set to inverse of the Lipschitz constant [12] of Q(V (s)). In this paper, to accelerate the convergence speed, the descent step size is increased by a predefined factor η(> 1) for each iteration (Step 5). From Step 6 to 14, we check for the value of the regularized objective function (6) and reduce the step size by a factor of η −1 until it is decreased. The whole procedure is iterated until the relative change of (6) is small than a predefined threshold (Step 15).
5. Experiments Experiments were performed on a Mandarin Chinese continuous speech recognition task using the Microsoft speech corpus [13]. The training set contains 19,688 sentences from 100 speakers with a total of 454,315 syllables (about 33 hours total). The testing set consists of 25 speakers and each speaker contributes 20 sentences (the average length of a sentence is about 5 seconds). All experiments were based on the standard HTK (v 3.4.1) tool set ( [14]). The frame length and frame step size were set as 25ms and 10ms, respectively. Acoustic features were constructed from 13 dimensional Mel-frequency cepstral coefficients and their first and second derivatives. The basic units for acoustic modeling are 27 initial and 157 tonal final units of Mandarin Chinese as described in [13]. Monophone models were first created using all 19,688 sentences. Then all possible cross-syllable triphone expansions based on the full syllable dictionary were generated, resulting in 295,180 triphones. Out of these triphones, 95,534 triphones actually occur in the training corpus. Each triphone was modeled by a 3-state left-to-right HMM without skips. After decision tree based state clustering, the number of unique tied states was reduced to 2,392. We then use the HTKs Gaussian splitting capability to incrementally increase the number of Gaussian components per state to 8, resulting in 19,136 different Gaussian components in the SI model. Standard regression class tree based MLLR method was used to obtain the 100 training speakers’ SD models. HVite was used as the decoder with a full connected syllable recognition network. All 1,679 tonal syllables are listed in the network and any syllable can be followed by any other syllable, or they may be separated by short pause or silence. This recognition task puts the highest demand on the quality of the acoustic models. We drew 1, 2, 4, 6, 8 and 10 sentences randomly from each testing speaker for adaptation in supervised mode and tonal syllable recognition rate was measured among the remaining 10 sentences. To ensure statistical robustness of the results, each experiment was repeated 8 times using cross-validation and the recognition rates were averaged. The recognition accuracy of the SI model is 53.04% (the baseline reference result reported in [13] is 51.21%). For the purpose of comparison, we carried out three experiments using conventional MLLR + MAP, eigenvoice and eigenphone based adaptation methods without regularization. For MLLR + MAP adaptation, we experimented with different parameter settings and the best result was obtained at a prior weighting factor of 10 (for MAP) and 32 regression classes with a 3-block-diagonal transformation matrix (for MLLR). For eigenvoice adaptation, the dimension K of the speaker subspace was varied from 10 to 100. For the eigenphone based method, both the ML and MAP estimation schemes were tested. Adaptation experiment results of the above methods are summarized in Table I. For MAP eigenphone method, λ denotes the prior weighting factor. From Table 1, it can be observed that when the adaptation data is sufficient, the eigenphone based method outperforms the MAP+MLLR method. But when the adaptation is limited to 1 or 2 sentences (about 5∼10 seconds), performance degradation emerges due to overfitting. The situation is worse when higher dimensional eigenphone subspace is used. MAP estimation using a Gaussain prior (equivalent to an l2 regularization term) can alleviate overfitting to some extent. To prevent the performance from degradation, a large prior weight is required, which degrades the performance when the adaptation data is sufficient.
Table 1: Average tonal syllable recognition rate (%) after speaker adaptation using conventional methods Number of adaptation sentences Methods 1 2 4 6 8 10 MAP+MLLR 53.32 54.93 57.83 58.50 59.65 60.16 Eigenvoice K = 20 55.32 56.38 56.61 56.90 57.11 57.05 K = 40 55.67 56.59 57.03 57.26 57.62 57.45 K = 60 55.72 57.01 57.15 57.36 57.87 57.95 K = 80 55.37 56.97 57.39 57.45 58.14 58.18 K = 100 55.20 57.11 57.24 57.53 57.91 58.39 ML Eigenphone N = 50 33.74 51.38 58.16 59.00 59.84 60.62 N = 100 19.14 41.46 54.30 57.91 59.44 60.13 MAP Eigenphone, N = 50 λ = 10 43.26 53.67 58.43 59.11 59.78 60.45 λ = 100 50.08 53.69 56.71 58.35 59.21 59.80 λ = 1000 53.69 54.28 55.35 56.13 56.95 57.41 λ = 2000 53.63 54.13 54.80 55.43 56.27 56.69 MAP Eigenphone, N = 100 λ = 10 27.91 44.63 53.78 57.39 59.61 60.70 λ = 100 45.24 50.31 55.77 57.55 59.34 60.30 λ = 1000 53.29 54.22 55.75 56.78 57.41 58.29 λ = 2000 53.92 54.28 55.52 56.34 56.55 57.74
We tested the proposed method with different regularization parameters, where λ1 is varied between 0 and 100, λ2 is varied between 0 and 200. Table 2 presents the typical results. It can be observed that the nuclear norm regularization (λ1 = 0, λ2 6= 0) improves the performance for both N = 50 and N = 100, especially when the adaptation data is limited to less than 2 sentences. A large weighting factor (λ2 > 100) is needed to obtain the best recognition rates. We calculate the average rank of the eigenphone matrix (V (s), which dimension is (N + 1) × D) over all the testing speakers in each test. It is observed that for 1 and 2 sentences, the average rank is small than the feature dimension (D=39). When more adaptation data is provided, the average rank keeps equal to 39. So it can be concluded that the nuclear norm regularization effectively prevents the dimension of the phone variation subspace from large than it is necessary. Compared with the nuclear norm regularization, l1 regularization (λ1 6= 0, λ2 = 0) can improve the performance further with a small weighting factor (λ1 < 50). This can be attributed to the sparse constraint introduced, which can reduce the free parameters, thus prevents the estimation of the eigenphone matrix from over-fitting. The larger the number of eigenphones (N ), the larger the weighting factor (λ1 ) to achieve the best performance. In all testing condition, many elements of (V )(s) become zero, resulting a sparse eigenphone matrix. When less adaptation is provided or a large weighting factor λ1 is used, the eigenphone matrix become more sparse, which means that less free parameters are estimated. Combining the l1 norm and the nuclear norm regularization, performance can be further improved. In this situation, compared with using the nuclear norm regularization alone, a relatively small weighting factor of λ2 < 30 is needed. For 1 sentence (about 5s) adaptation, the best result is 55.24% (when λ1 = 20, λ2 = 10 and N = 50), which is comparable to the best result obtained by the eigenvoice method (55.72% when K = 60). There is about 1% relative improvement compared
Table 2: Average tonal syllable recognition rate (%) after speaker adaptation based on sparse and low-rank eigenphone matrix estimation Number of adaptation sentences (λ1 , λ2 ) 1 2 4 6 8 10 N = 50 (0, 120) 54.12 56.16 58.16 59.25 59.84 60.87 (0, 140) 54.25 56.42 58.42 59.53 60.05 60.68 (0, 160) 54.38 56.23 58.24 58.96 59.80 60.51 (20, 0) 54.72 56.99 58.96 59.61 59.92 60.34 (20, 10) 55.24 57.24 59.02 59.65 60.01 60.57 (20, 20) 54.85 56.65 58.92 59.74 60.09 60.60 (30, 0) 54.36 56.48 58.71 59.48 60.24 60.39 (30, 10) 54.68 56.78 58.71 59.32 60.24 60.49 (30, 20) 54.36 56.54 58.64 59.44 60.20 60.39 N = 100 (0, 120) 54.11 55.01 57.01 59.40 59.82 61.16 (0, 140) 54.26 55.10 57.32 59.30 59.80 60.99 (0, 160) 53.99 54.97 57.03 59.25 59.59 60.93 (20, 0) 53.76 55.50 57.87 59.30 59.99 61.08 (20, 10) 54.97 56.80 58.79 59.59 60.22 61.35 (20, 20) 54.85 56.48 58.52 59.36 60.32 61.37 (30, 0) 54.72 56.65 58.75 60.20 60.78 61.41 (30, 10) 55.12 57.22 58.94 60.16 60.76 61.44 (30, 20) 54.82 56.78 58.88 60.11 60.49 61.41
with the l1 regularization (54.72% when λ1 = 20, λ2 = 0 and N = 50) and about 2.4% relative improvement compared with the MAP Eigenphone method (53.92% when σ (−2) = 2000 and N = 100). For 2-sentence (about 10s) adaptation, the best result is 57.24%, which is slightly better than the best result of eigenvoice (57.11% when K = 80). For 4 sentences and more adaptation data, the performance is also improved compared with the ML eigenphone method. Even with 10 sentences (about 50s) adaptation data, the best result (61.44%) is better than that of the MAP (60.70%) and ML eigenphone method (60.62%). Again, the average rank of the eigenphone matrix (V (s)) is small than 39 when there is 1 or 2 adaptation sentences. It seems that sparse constraint plays a key role in the performance improvement and the low-rank constraint is a good complement.
6. Conclusion In this paper, we investigate applying l1 and nuclear norm regularization simultaneously to improve the robustness of the estimation of the eigenphone matrix in eigenphone based speaker adaptation. The l1 regularization introduces sparseness and reduces the number of free parameters, thus alleviates over-fitting. The nuclear norm regularization forces the eigenphone matrix to be low-rank, thus prevents the dimension of the phone variation subspace from being too high than necessary. Their linear combination results in a simultaneous sparse and low-rank eigenphone matrix. From our experiments on a Mandarin Chinese syllable recognition task, we observed substantial performance improvement under all testing conditions compared with conventional methods.
7. Acknowledgements This work was supported in part by the National Natural Science Foundation of China (No. 61175017 and No. 61370034).
8. References [1] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid speaker adaptation in eigenvoice space,” IEEE Trans. Speech Audio Process., vol. 8, no. 6, pp. 695–707, Nov. 2000. [2] M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Comput. Speech Lang., vol. 12, no. 2, pp. 75–98, Apr. 1998. [3] W.-L. Zhang, W.-Q. Zhang, and B.-C. Li, “Speaker adaptation based on speaker-dependent eigenphone estimation,” in Proc. of ASRU, Dec. 2011, pp. 48–52. [4] Q. F. Tan, P. G. Georgiou, and S. S. Narayanan, “Enhanced sparse imputation techniques for a robust speech recognition front-end,” IEEE Trans. Acoust., Speech, Signal Process., vol. 19, no. 8, pp. 2418–2429, Nov. 2011. [5] Q. F. Tan and S. S. Narayanan, “Novel variations of group sparse regularization techniques with applications to noise robust automatic speech recognition,” IEEE Trans. Acoust., Speech, Signal Process., vol. 20, no. 4, pp. 1337–1346, May 2012. [6] L. Lu, A. Ghoshal, and S. Renals, “Regularized subspace gaussian mixture models for speech recognition,” IEEE Signal Process. Lett., vol. 18, no. 7, pp. 419–422, July 2011. [7] D. Yu, F. Seide, G. Li, and L. Deng, “Exploiting sparseness in deep neural networks for large vocabulary speech recognition,” in Proc. of ICASSP, Mar. 2012, pp. 4409–4412. [8] J.-F. Cai, E. J. Cand`es, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM J. Optimization, vol. 20, no. 4, pp. 1956–1982, Jan. 2010. [9] E. J. Cand`es, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM, vol. 58, no. 3, pp. 11:1–11:37, May 2011. [10] E. Richard and P.-A. Savalle, “Estimation of simultaneously sparse and low rank matrices,” in Proc. of ICML, July 2012, pp. 1351–1358. [11] D. P. Bertsekas, “Incremental proximal methods for large scale convex optimization,” Math. Program., vol. 129, no. 2, pp. 163– 195, Oct. 2011. [12] K.-C. Toh and S. Yun, “An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares prolems,” Pacific J. Optim., vol. 6, no. 3, pp. 615–640, 2010. [13] E. Chang, Y. Shi, J. Zhou et al., “Speech lab in a box : a Mandarin speech toolbox to jumpstart speech related research,” in Proc. of Eurospeech, 2001, pp. 2799–2802. [14] S. Young, G. Evermann, M. Gales et al., The HTK Book (for HTK Version 3.4), 2009.