Speaker Adaptation Based on Sparse and Low-rank Eigenphone Matrix Estimation Wen-Lin Zhang1 , Dan Qu1 , Wei-Qiang Zhang2 , Bi-Cheng Li1 1

Zhengzhou Information Science and Technology Institute, Zhengzhou, China Department of Electronic Engineering, Tsinghua University, Beijing, China

2

zwlin [email protected], [email protected], [email protected], [email protected]

Abstract The eigenphone based speaker adaptation outperforms the conventional MLLR and eigenvoice methods when the adaptation data is sufficient, but it suffers from severe over-fitting when the adaptation data is limited. In this paper, l1 and nuclear norm regularization are applied simultaneously to obtain a more robust eigenphone estimation, resulting in a sparse and low-rank eigenphone matrix. The sparse constraint can reduce the number of free parameters while the low rank constraint can limit the dimension of phone variation subspace, which are both benefit to the generalization ability. Experimental results show that the proposed method can improve the adaptation performance substantially, especially when the amount of adaptation data is limited. Index Terms: eigenphones, speaker adaptation, l1 regularization, nuclear norm regularization

1. Introduction Model space speaker adaptation is an important technique in modern speech recognition system. Given some adaptation data, the parameters of a speaker independent (SI) system are transformed to match the speaking pattern of an unknown speaker, resulting in a speaker adapted (SA) system. To deal with the sparsity of the adaptation data, parameter sharing schemes are usually adopted. For example, in eigenvoice based method [1], the speaker dependent (SD) models are assumed to lie in a low dimensional subspace, namely the speaker subspace. The subspace bases, i.e. eigenvoices, are shared among all speakers. For each new speaker, a speaker-specific coordinate vector, namely speaker factor, is estimated to obtain the SA model. The maximum likelihood linear regression (MLLR) method [2] estimates a set of linear transformations to transform an SI model into a new SD model. Using regression class trees, the HMM state components can be grouped into regression classes with each class sharing the same transformation matrix. Recently, a novel phone subspace based method, i.e. eigenphone based method, was proposed [3]. Differently from the speaker subspace based method, the phone variation patterns of a speaker are assumed to be in a low dimensional subspace, called phone variation subspace. The coordinates of the whole phone set are shared among different speakers. During speaker adaptation, a speaker dependent eigenphone matrix which represents the main phone variation patterns for a specific speaker is estimated. Due to its more elaborate modeling, the eigenphone method performs better than both the eigenvoice and MLLR methods when sufficient amounts of adaptation data are available. However, with limited amounts of adaptation data,

the maximum likelihood estimation shows severe over-fitting, resulting in very bad adaptation performance [3]. Even with a fine tuned Gaussian prior, the eigenphone matrix estimated by the maximum a posterior (MAP) criterion still does not match the performance of the eigenvoice method. In machine learning, regularization techniques are widely employed to address the problem of data sparsity and model complexity. Recently, regularization has been widely adopted in speech processing and recognition applications. For instance, l1 and l2 regularization are proposed for spectral de-noising in speech recognition [4, 5]. In [6], similar regularization methods are adopted to improve the estimation of state-specific parameters in the subspace Gaussian mixture model (SGMM). In [7], l1 regularization is used to reducing the nonzero connections of deep neural networks without sacrificing speech recognition performance. In this paper, we investigate the regularized estimation of the eigenphone matrix for speaker adaptation. l1 norm regularization is used to control the sparsity of the matrix and the nuclear norm regularization forces the eigenphone matrix to be low-rank. The basic considerations are that being sparse can alleviate over-fitting and being low-rank can automatically control the dimension of the phone variation subspace. In the next section, a brief overview of the eigenphone based speaker adaptation method is given. The use of the l1 norm and nuclear norm regularization are described in Section III, and the optimization of the sparse and low-rank eigenphone matrix is presented in Section IV. Finally, in Section V, we present experiments on supervised speaker adaptation of a Mandarin tonal syllable recognition system.

2. Review of the eigenphone based speaker adaptation method Given a set of speaker independent HMMs containing a total of M mixture components across all states and models and a D-dimensional speech feature vector, let µm , µm (s) and um (s) = µm (s)−µm denote the SI mean vector, the SD mean vector and the phone variation vector for speaker s and mixture component m respectively. In eigenphone based speaker adaptation method, the phone variation vectors {um (s)}M m=1 are assumed to be located in a speaker dependent N (N << M ) dimensional phone variation subspace. Let v 0 (s) and {v i (s)}N i=1 denote the origin and the basis vectors of speaker s’s phone variation subspace respectively, then the phone variation vector {um (s)}M m=1 can be written as um (s) = v 0 (s) +

N X n=1

lmn v n (s)

(1)

where lmn is the coefficient of component m corresponding to basis vector v n (s). We call {v i (s)}N i=0 the eigenphones of T  speaker s and lm1 lm2 · · · lmN the phone coordinate vector of component m. The eigenphone decomposition of speaker s’s phone variation matrix can be expressed by the following equation   U (s) = u1 (s) u2 (s) · · · uM (s) = V (s) · L (2)   where V (s) = v 0 (s) v 1 (s) v 2 (s) · · · v N (s) and   1 1 1 ... 1  l11 l21 l31 . . . lM 1      L =  l12 l22 l32 . . . lM 2  .  .. .. .. ..   . . ... . .  l1N l2N l3N . . . lM N Equation (2) can be viewed as the decomposition of the phone variation matrix U (s) to the multiplication of two lowrank matrices V (s) and L. The eigenphone matrix V (s) is speaker dependent, which summarizes the main phone variation patterns of speaker s. The phone coordinate matrix L is speaker independent, which implicitly reflects the correlation information between different Gaussian components. Given a set of training speaker SD models, L can be obtained using principal component analysis (PCA) [3]. During speaker adaptation, given some adaptation data, the eigenphone matrix V (s) is estimated using the maximum likelihood criterion. Let O = {o(1), o(2), · · · , o(T )} denotes the sequence of feature vectors of the adaptation data. Using the expectation maximization (EM) algorithm, the auxiliary function to be optimized is given as follows Q(V (s)) = −

1 XX γm (t) 2 t m

[o(t) − µm (s)]T Σ−1 m [o(t) − µm (s)] ,

(3)

where µm (s) = µm + um , γm (t) is the posterior probability of being in mixture m at time t given the observation sequence O and current estimation of SD model. Suppose the covariance matrix Σm is diagonal, let σm,d denotes its dth diagonal element and od (t), µm,d and vn,d (s) represent the dth component of o(t), µm and v n (s) respectively. Then Equation (3) can be simplified to h i2 T 1 XXX −1 γm (t)σm,d o0m,d (t) − ˆlm ν d (s) , Q(V (s)) = − 2 t m d (4) where o0m,d (t) = od (t) − µm,d , ˆlm = [1, lm1 , lm2 , . . . , lmN ]T and ν d (s) = [v0,d (s), v1,d (s), v2,d (s), . . . , vN,d (s)]T , which is the dth row of the eigenphone matrix V (s). Define XX −1 ˆ ˆT Ad = γm (t)σm,d lm lm t

bd =

XX t

m −1 0 γm (t)σm,d om,d (t)ˆlm ,

m

Equation (4) can be further simplified to i 1 Xh Q(V (s)) = − ν d (s)T Ad ν d (s) − bTd ν d (s) + Const. 2 d (5)

Setting the derivative of (5) with respect to ν d (s) to zero ˆ d (s) = A−1 yields ν d bd . Because of the independence of different feature dimensions, {ˆ ν d (s)}D d=1 can be calculated in parallel very efficiently. The size of the eigenphone matrix V (s) is (N + 1) × D, which has more free parameters than the eigenvoice method. For the MLLR method with a global transformation matrix and a bias vector, the parameter size is of (D+1)×D. So the eigenphone method is more flexible and elaborate. When sufficient amounts of adaptation data are available, better adaptation performance can be obtained. But when the adaptation data is limited, performance degrades quickly. The recognition rate can be even worse than the unadapted SI system when very limited amounts of adaptation are available. In order to alleviate the overfitting problem, a Gaussian prior is assumed and a MAP adaptation method is derived in [3]. In this paper, we address the problem using an explicit matrix regularization function.

3. Sparse and low-rank eigenphone matrix estimation In fact, the center of the eigenphone adaptation method is the robust estimation of the eigenphone matrix V (s). This type of problem, i.e. the estimation of an unknown matrix from some observation data, has appeared frequently in the literature of diverse fields. Regularization has been proved to be a valid method to overcome the data scarcity. One widely used regularizer is the l1 norm. For the eigenphone P matrix V (s), the matrix l norm can be written as ||V (s)|| = 1 1 d ||ν d (s)||1 = P P d n |vn,d (s)|. The l1 norm regularization is sometimes referred to as the lasso, which can drive an element-wise shrinkage of V (s) towards zero, thus leading to a sparse matrix solution. Recently, in many matrix estimation problems, such as matrix completion [8] and robust PCA [9], a nuclear norm regularizer was used to obtain a low-rank solution. In fact, this approach is closely related to the idea of using the l1 norm as a surrogate for sparsity, because low-rank corresponds to sparsity of the vector of singular values and the nuclear norm is the l1 norm of the vector of singular values. For the eigenphone matrix V (s), the nuclear norm can be written as ||V (s)||∗ = PN i=1 κi , where κi are the singular values of V (s). In eigenphone based speaker adaptation, sparsity and lowrank constraints can be applied simultaneously to obtain more robust estimation of the eigenphone matrix. The reasons are as follows: firstly, sparsity constraint can reduce the free parameters, thus alleviates over-fitting; secondly, when the adaptation data is insufficient, many speaker specific phone variation pattern will not be observed and a low dimensional phone variation subspace should be assumed, i.e. the rank of the eigenphone matrix should be limited. The solutions of low-rank estimation problems are in general not sparse at all. In this paper, a linear combination of the l1 and nuclear norm was used to obtain a simultaneously sparse and low-rank matrix [10]. The resulting regularized objective function is as following Q0 (V (s)) = Q(V (s)) + λ1 ||V (s)||1 + λ2 ||V (s)||∗ ,

(6)

where λ1 , λ2 > 0.

4. Optimization There is no closed form solution to the regularized objective function (6). Numerous approaches have been proposed in literature to solve the l1 norm and nuclear norm penalty problems

separately. For the mixed norm penalty problem, we adopted the incremental proximal descent algorithm [10, 11]. For a convex regularizer R(X), X ∈ Rm×n , the proximal operator is defined as 1 proxR (X) = arg min ||Y − X||2F + R(Y ) 2 Y

(7)

where || ∗ ||F denotes the Frobenius norm of a matrix. The proximal operator for the l1 norm is the soft thresholding operator proxγ||·||1 (X) = sgn(X) ◦ (|X| − γ)+

(8)

where ◦ denotes the Hadamard product of two matrices, (x)+ = max{x, 0}. The sign function (sgn), product and maximum are all taken component-wise. For the nuclear norm, the proximal operator is given by the shrinkage operation as follows [11]. If X = P diag(ν1 , ν2 , · · · , νn )QT is the singular value decomposition of X, then proxγ||·||∗ (X) = P diag((νi − γ)+ )QT .

(9)

The proximity operator of a convex function is a natural extension of the notion of a projection operator onto a convex set. The incremental proximal descent algorithm [11] could be viewed as a natural extension of the iterated projection algorithm, which activates each convex set modeling a constraint individually by means of its projection operator. In this paper, an accelerated version of the incremental proximal descent algorithm is introduced for estimation of the eigenphone matrix V , which can be summarized as following Algorithm 1 Accelerated Incremental Proximal Descent Algorithm for Sparse and Low-rank Eigenphone Matrix Estimation 1: θ ← θ0 . Initialize the descent step size ˆ 2: V ← V . Vˆ is the solution of (5) 3: Qnew ← Q(V ) + λ1 ||V ||1 + λ2 ||V ||∗ . Equation (6) 4: repeat 5: Qold ← Qnew , θ ← ηθ 6: repeat . Search for the step size 7: V ← V − θ∇V Q(V ) 8: V ← proxθλ1 ||·||1 (V ) 9: V ← proxθλ2 ||·||∗ (V ) 10: Qnew ← Q(V ) + λ1 ||V ||1 + λ2 ||V ||∗ 11: if Qnew > Qold then 12: θ ← η −1 θ 13: end if 14: until Qnew < Qold 15: until |Qold − Qnew |/|Qold | < . In Algorithm 1, ∇V Q(V ) is the gradient of (5), which can be easily calculated from ∇ν d (s) Q(V ) = −Ad ν d (s) + bd . Step 7 is the normal gradient descent step of the original objective function Q(V (s)). In Step 8 and Step 9, the proximal operators of the l1 norm and nuclear norm are applied in sequential. The initial descent step size θ0 can be set to inverse of the Lipschitz constant [12] of Q(V (s)). In this paper, to accelerate the convergence speed, the descent step size is increased by a predefined factor η(> 1) for each iteration (Step 5). From Step 6 to 14, we check for the value of the regularized objective function (6) and reduce the step size by a factor of η −1 until it is decreased. The whole procedure is iterated until the relative change of (6) is small than a predefined threshold  (Step 15).

5. Experiments Experiments were performed on a Mandarin Chinese continuous speech recognition task using the Microsoft speech corpus [13]. The training set contains 19,688 sentences from 100 speakers with a total of 454,315 syllables (about 33 hours total). The testing set consists of 25 speakers and each speaker contributes 20 sentences (the average length of a sentence is about 5 seconds). All experiments were based on the standard HTK (v 3.4.1) tool set ( [14]). The frame length and frame step size were set as 25ms and 10ms, respectively. Acoustic features were constructed from 13 dimensional Mel-frequency cepstral coefficients and their first and second derivatives. The basic units for acoustic modeling are 27 initial and 157 tonal final units of Mandarin Chinese as described in [13]. Monophone models were first created using all 19,688 sentences. Then all possible cross-syllable triphone expansions based on the full syllable dictionary were generated, resulting in 295,180 triphones. Out of these triphones, 95,534 triphones actually occur in the training corpus. Each triphone was modeled by a 3-state left-to-right HMM without skips. After decision tree based state clustering, the number of unique tied states was reduced to 2,392. We then use the HTKs Gaussian splitting capability to incrementally increase the number of Gaussian components per state to 8, resulting in 19,136 different Gaussian components in the SI model. Standard regression class tree based MLLR method was used to obtain the 100 training speakers’ SD models. HVite was used as the decoder with a full connected syllable recognition network. All 1,679 tonal syllables are listed in the network and any syllable can be followed by any other syllable, or they may be separated by short pause or silence. This recognition task puts the highest demand on the quality of the acoustic models. We drew 1, 2, 4, 6, 8 and 10 sentences randomly from each testing speaker for adaptation in supervised mode and tonal syllable recognition rate was measured among the remaining 10 sentences. To ensure statistical robustness of the results, each experiment was repeated 8 times using cross-validation and the recognition rates were averaged. The recognition accuracy of the SI model is 53.04% (the baseline reference result reported in [13] is 51.21%). For the purpose of comparison, we carried out three experiments using conventional MLLR + MAP, eigenvoice and eigenphone based adaptation methods without regularization. For MLLR + MAP adaptation, we experimented with different parameter settings and the best result was obtained at a prior weighting factor of 10 (for MAP) and 32 regression classes with a 3-block-diagonal transformation matrix (for MLLR). For eigenvoice adaptation, the dimension K of the speaker subspace was varied from 10 to 100. For the eigenphone based method, both the ML and MAP estimation schemes were tested. Adaptation experiment results of the above methods are summarized in Table I. For MAP eigenphone method, λ denotes the prior weighting factor. From Table 1, it can be observed that when the adaptation data is sufficient, the eigenphone based method outperforms the MAP+MLLR method. But when the adaptation is limited to 1 or 2 sentences (about 5∼10 seconds), performance degradation emerges due to overfitting. The situation is worse when higher dimensional eigenphone subspace is used. MAP estimation using a Gaussain prior (equivalent to an l2 regularization term) can alleviate overfitting to some extent. To prevent the performance from degradation, a large prior weight is required, which degrades the performance when the adaptation data is sufficient.

Table 1: Average tonal syllable recognition rate (%) after speaker adaptation using conventional methods Number of adaptation sentences Methods 1 2 4 6 8 10 MAP+MLLR 53.32 54.93 57.83 58.50 59.65 60.16 Eigenvoice K = 20 55.32 56.38 56.61 56.90 57.11 57.05 K = 40 55.67 56.59 57.03 57.26 57.62 57.45 K = 60 55.72 57.01 57.15 57.36 57.87 57.95 K = 80 55.37 56.97 57.39 57.45 58.14 58.18 K = 100 55.20 57.11 57.24 57.53 57.91 58.39 ML Eigenphone N = 50 33.74 51.38 58.16 59.00 59.84 60.62 N = 100 19.14 41.46 54.30 57.91 59.44 60.13 MAP Eigenphone, N = 50 λ = 10 43.26 53.67 58.43 59.11 59.78 60.45 λ = 100 50.08 53.69 56.71 58.35 59.21 59.80 λ = 1000 53.69 54.28 55.35 56.13 56.95 57.41 λ = 2000 53.63 54.13 54.80 55.43 56.27 56.69 MAP Eigenphone, N = 100 λ = 10 27.91 44.63 53.78 57.39 59.61 60.70 λ = 100 45.24 50.31 55.77 57.55 59.34 60.30 λ = 1000 53.29 54.22 55.75 56.78 57.41 58.29 λ = 2000 53.92 54.28 55.52 56.34 56.55 57.74

We tested the proposed method with different regularization parameters, where λ1 is varied between 0 and 100, λ2 is varied between 0 and 200. Table 2 presents the typical results. It can be observed that the nuclear norm regularization (λ1 = 0, λ2 6= 0) improves the performance for both N = 50 and N = 100, especially when the adaptation data is limited to less than 2 sentences. A large weighting factor (λ2 > 100) is needed to obtain the best recognition rates. We calculate the average rank of the eigenphone matrix (V (s), which dimension is (N + 1) × D) over all the testing speakers in each test. It is observed that for 1 and 2 sentences, the average rank is small than the feature dimension (D=39). When more adaptation data is provided, the average rank keeps equal to 39. So it can be concluded that the nuclear norm regularization effectively prevents the dimension of the phone variation subspace from large than it is necessary. Compared with the nuclear norm regularization, l1 regularization (λ1 6= 0, λ2 = 0) can improve the performance further with a small weighting factor (λ1 < 50). This can be attributed to the sparse constraint introduced, which can reduce the free parameters, thus prevents the estimation of the eigenphone matrix from over-fitting. The larger the number of eigenphones (N ), the larger the weighting factor (λ1 ) to achieve the best performance. In all testing condition, many elements of (V )(s) become zero, resulting a sparse eigenphone matrix. When less adaptation is provided or a large weighting factor λ1 is used, the eigenphone matrix become more sparse, which means that less free parameters are estimated. Combining the l1 norm and the nuclear norm regularization, performance can be further improved. In this situation, compared with using the nuclear norm regularization alone, a relatively small weighting factor of λ2 < 30 is needed. For 1 sentence (about 5s) adaptation, the best result is 55.24% (when λ1 = 20, λ2 = 10 and N = 50), which is comparable to the best result obtained by the eigenvoice method (55.72% when K = 60). There is about 1% relative improvement compared

Table 2: Average tonal syllable recognition rate (%) after speaker adaptation based on sparse and low-rank eigenphone matrix estimation Number of adaptation sentences (λ1 , λ2 ) 1 2 4 6 8 10 N = 50 (0, 120) 54.12 56.16 58.16 59.25 59.84 60.87 (0, 140) 54.25 56.42 58.42 59.53 60.05 60.68 (0, 160) 54.38 56.23 58.24 58.96 59.80 60.51 (20, 0) 54.72 56.99 58.96 59.61 59.92 60.34 (20, 10) 55.24 57.24 59.02 59.65 60.01 60.57 (20, 20) 54.85 56.65 58.92 59.74 60.09 60.60 (30, 0) 54.36 56.48 58.71 59.48 60.24 60.39 (30, 10) 54.68 56.78 58.71 59.32 60.24 60.49 (30, 20) 54.36 56.54 58.64 59.44 60.20 60.39 N = 100 (0, 120) 54.11 55.01 57.01 59.40 59.82 61.16 (0, 140) 54.26 55.10 57.32 59.30 59.80 60.99 (0, 160) 53.99 54.97 57.03 59.25 59.59 60.93 (20, 0) 53.76 55.50 57.87 59.30 59.99 61.08 (20, 10) 54.97 56.80 58.79 59.59 60.22 61.35 (20, 20) 54.85 56.48 58.52 59.36 60.32 61.37 (30, 0) 54.72 56.65 58.75 60.20 60.78 61.41 (30, 10) 55.12 57.22 58.94 60.16 60.76 61.44 (30, 20) 54.82 56.78 58.88 60.11 60.49 61.41

with the l1 regularization (54.72% when λ1 = 20, λ2 = 0 and N = 50) and about 2.4% relative improvement compared with the MAP Eigenphone method (53.92% when σ (−2) = 2000 and N = 100). For 2-sentence (about 10s) adaptation, the best result is 57.24%, which is slightly better than the best result of eigenvoice (57.11% when K = 80). For 4 sentences and more adaptation data, the performance is also improved compared with the ML eigenphone method. Even with 10 sentences (about 50s) adaptation data, the best result (61.44%) is better than that of the MAP (60.70%) and ML eigenphone method (60.62%). Again, the average rank of the eigenphone matrix (V (s)) is small than 39 when there is 1 or 2 adaptation sentences. It seems that sparse constraint plays a key role in the performance improvement and the low-rank constraint is a good complement.

6. Conclusion In this paper, we investigate applying l1 and nuclear norm regularization simultaneously to improve the robustness of the estimation of the eigenphone matrix in eigenphone based speaker adaptation. The l1 regularization introduces sparseness and reduces the number of free parameters, thus alleviates over-fitting. The nuclear norm regularization forces the eigenphone matrix to be low-rank, thus prevents the dimension of the phone variation subspace from being too high than necessary. Their linear combination results in a simultaneous sparse and low-rank eigenphone matrix. From our experiments on a Mandarin Chinese syllable recognition task, we observed substantial performance improvement under all testing conditions compared with conventional methods.

7. Acknowledgements This work was supported in part by the National Natural Science Foundation of China (No. 61175017 and No. 61370034).

8. References [1] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid speaker adaptation in eigenvoice space,” IEEE Trans. Speech Audio Process., vol. 8, no. 6, pp. 695–707, Nov. 2000. [2] M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Comput. Speech Lang., vol. 12, no. 2, pp. 75–98, Apr. 1998. [3] W.-L. Zhang, W.-Q. Zhang, and B.-C. Li, “Speaker adaptation based on speaker-dependent eigenphone estimation,” in Proc. of ASRU, Dec. 2011, pp. 48–52. [4] Q. F. Tan, P. G. Georgiou, and S. S. Narayanan, “Enhanced sparse imputation techniques for a robust speech recognition front-end,” IEEE Trans. Acoust., Speech, Signal Process., vol. 19, no. 8, pp. 2418–2429, Nov. 2011. [5] Q. F. Tan and S. S. Narayanan, “Novel variations of group sparse regularization techniques with applications to noise robust automatic speech recognition,” IEEE Trans. Acoust., Speech, Signal Process., vol. 20, no. 4, pp. 1337–1346, May 2012. [6] L. Lu, A. Ghoshal, and S. Renals, “Regularized subspace gaussian mixture models for speech recognition,” IEEE Signal Process. Lett., vol. 18, no. 7, pp. 419–422, July 2011. [7] D. Yu, F. Seide, G. Li, and L. Deng, “Exploiting sparseness in deep neural networks for large vocabulary speech recognition,” in Proc. of ICASSP, Mar. 2012, pp. 4409–4412. [8] J.-F. Cai, E. J. Cand`es, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM J. Optimization, vol. 20, no. 4, pp. 1956–1982, Jan. 2010. [9] E. J. Cand`es, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM, vol. 58, no. 3, pp. 11:1–11:37, May 2011. [10] E. Richard and P.-A. Savalle, “Estimation of simultaneously sparse and low rank matrices,” in Proc. of ICML, July 2012, pp. 1351–1358. [11] D. P. Bertsekas, “Incremental proximal methods for large scale convex optimization,” Math. Program., vol. 129, no. 2, pp. 163– 195, Oct. 2011. [12] K.-C. Toh and S. Yun, “An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares prolems,” Pacific J. Optim., vol. 6, no. 3, pp. 615–640, 2010. [13] E. Chang, Y. Shi, J. Zhou et al., “Speech lab in a box : a Mandarin speech toolbox to jumpstart speech related research,” in Proc. of Eurospeech, 2001, pp. 2799–2802. [14] S. Young, G. Evermann, M. Gales et al., The HTK Book (for HTK Version 3.4), 2009.

Speaker Adaptation Based on Sparse and Low-rank ...

nuclear norm regularization forces the eigenphone matrix to be low-rank. The basic considerations are that being sparse can alleviate over-fitting and being ... feature vectors of the adaptation data. Using the expectation maximization (EM) algorithm, the auxiliary function to be op- timized is given as follows. Q(V (s)) = −. 1. 2.

226KB Sizes 0 Downloads 240 Views

Recommend Documents

Sparse Distributed Learning Based on Diffusion Adaptation
results illustrate the advantage of the proposed filters for sparse data recovery. ... tive radio [45], and spectrum estimation in wireless sensor net- works [46].

Fast Speaker Adaptation - Semantic Scholar
Jun 18, 1998 - where we use very small adaptation data, hence the name of fast adaptation. ... A n de r esoudre ces probl emes, le concept d'adaptation au ..... transform waveforms in the time domain into vectors of observation carrying.

Fast Speaker Adaptation - Semantic Scholar
Jun 18, 1998 - We can use deleted interpolation ( RJ94]) as a simple solution ..... This time, however, it is hard to nd an analytic solution that solves @R.

Adaptation Algorithm and Theory Based on Generalized Discrepancy
rithms is that the training and test data are sampled from the same distribution. In practice ...... data/datasets.html, 1996. version 1.0. S. Sch˝onherr. Quadratic ...

Adaptation Algorithm and Theory Based on Generalized Discrepancy
pothesis set contains no candidate with good performance on the training set. ...... convex program and solving the equations defining λi is, in gen- eral, simple ...

Adaptation Algorithm and Theory Based on ... - Research at Google
tion bounds for domain adaptation based on the discrepancy mea- sure, which we ..... the target domain, which is typically available in practice. The following ...

SPEAKER ADAPTATION OF CONTEXT ... - Research at Google
adaptation on a large vocabulary mobile speech recognition task. Index Terms— Large ... estimated directly from the speaker data, but using the well-trained speaker ... quency ceptral coefficients (MFCC) or perceptual linear prediction. (PLP) featu

Rapid speaker adaptation in eigenvoice space - Speech and Audio ...
on each new speaker approaching that of an SD system for that speaker, while ... over the telephone, one can only count on a few seconds of unsupervised speech. ... The authors are with the Panasonic Speech Technology Laboratory, Pana-.

Rapid speaker adaptation in eigenvoice space - Speech and Audio ...
voice approach with other speaker adaptation algorithms, the ...... the 1999 International Conference on Acoustics, Speech, and Signal Processing. He was an ...

Feature and model space speaker adaptation with full ...
For diagonal systems, the MLLR matrix is estimated as fol- lows. Let c(sm) .... The full covariance case in MLLR has a simple solution, but it is not a practical one ...

Approaches to Speech Recognition based on Speaker ...
best speech recognition submissions in its Jan- ... ity such as telephone type and background noise. ... of a single vector to represent each phone in context,.

Highly Noise Robust Text-Dependent Speaker Recognition Based on ...
conditions and non-stationary color noise conditions (factory, chop- per and babble noises), which are also the typical conditions where conventional spectral subtraction techniques perform poorly. Index Terms: Robust speaker recognition, hypothesize

Approaches to Speech Recognition based on Speaker ...
best speech recognition submissions in its Jan- ... when combined with discriminative training. .... The covariances are shared between the classes, and ..... tions for hmm-based speech recognition. Computer. Speech and Language, 12:75–98 ...

Text-dependent speaker-recognition systems based on ...
tems based on the one-pass dynamic programming (DP) algo- rithm. .... Rsil. R 52. R 54. R 53. R 51. Rsil. Rsil. RsilR 11. R 14. R 13. R 12. Forced Alignment ..... help increase the robustness of the system to arbitrary input noise conditions and ...

Exemplar-Based Sparse Representation Phone ...
1IBM T. J. Watson Research Center, Yorktown Heights, NY 10598. 2MIT Laboratory for ... These phones are the basic units of speech to be recognized. Moti- vated by this ..... to seed H on TIMIT, we will call the feature Sknn pif . We have also.

Rapid speaker adaptation in eigenvoice space - Semantic Scholar
free parameters to be estimated from adaptation data. ..... surprising, until one considers the mechanics of ML recogni- tion. .... 150 speakers ended up being tested (this was arranged by car- ... training. During online adaptation, the system must

Speaker Adaptation with an Exponential Transform - Semantic Scholar
... Zweig, Alex Acero. Microsoft Research, Microsoft, One Microsoft Way, Redmond, WA 98052, USA ... best one of these based on the likelihood assigned by the model to ..... an extended phone set with position and stress dependent phones,.

Rapid speaker adaptation in eigenvoice space - Semantic Scholar
The associate ed- itor coordinating ..... degrees of freedom during eigenvoice adaptation is equivalent ...... 1984 and the Ph.D. degree in computer science from.

Speaker adaptation of context dependent deep ... - Research
tering them, e.g. using regression trees [6, 7]. ... However the computation power .... states [23] clustered using decision trees [24] to 7969 states; the real time ...

XMLLR for Improved Speaker Adaptation in Speech ...
Firstly, let us compute the improvement in ... pute the average per-dimension variance ¯σ2 d over all ... one over all Gaussians); we also compute a constant cj. =.

Fast Speaker Adaptation Using A Priori Knowledge
tion techniques could be applied to SD models to find a low-dimen- sional representation for speaker space, the .... 3.1. Conventional vs. Eigenvoice Techniques. We conducted mean adaptation experiments on the Isolet database. [l], which ..... “Pro

Speaker Adaptation with an Exponential Transform - Semantic Scholar
Abstract—In this paper we describe a linear transform that we call an Exponential ..... Transform all the current speaker transforms by setting W(s) ←. CW(s) .... by shifting the locations of the center frequencies of the triangular mel bins duri

Exemplar-Based Sparse Representation Features ...
in LVCSR systems and applying them on TIMIT to establish a new baseline. We then .... making it difficult to compare probabilities across frames. Thus, to date SVMs ...... His conversational biometrics based security patent was recognized by.

Sparse Modeling-based Sequential Ensemble ...
The large proportion of irrelevant or noisy features in real- life high-dimensional data presents a significant challenge to subspace/feature selection-based high-dimensional out- lier detection (a.k.a. outlier scoring) methods. These meth- ods often