DIMENSIONAL REDUCTION, COVARIANCE MODELING, AND COMPUTATIONAL COMPLEXITY IN ASR SYSTEMS Scott Axelrod, Ramesh Gopinath, Peder Olsen, Karthik Visweswariah IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598, USA axelrod,rameshg,pederao,kv1  @us.ibm.com ABSTRACT In this paper, we study acoustic modeling for speech recognition using mixtures of exponential models with linear and quadratic features tied across all context dependent states. These models are one version of the SPAM models introduced in [1]. They generalize diagonal covariance, MLLT, EMLLT, and full covariance models. Reduction of the dimension of the acoustic vectors using LDA/HDA projections corresponds to a special case of reducing the exponential model feature space. We see, in one speech recognition task, that SPAM models on an LDA projected space of varying dimensions achieve a significant fraction of the WER improvement in going from MLLT to full covariance modeling, while maintaining the low computational cost of the MLLT models. Further, the feature precomputation cost can be minimized using the hybrid feature technique of [2]; and the number of Gaussians one needs to compute can be greatly reducing using hierarchical clustering of the Gaussians (with fixed feature space). Finally, we show that reducing the quadratic and linear feature spaces separately produces models with better accuracy, but comparable computational complexity, to LDA/HDA based models.

,.-0/21 4) 365 187:9/ 36; ,.1 4<(=?>@;A8BDCE>GF 1 ?H 3 1 5"=?>@;A - >@<I (1) 187J9/ 36; ,1 is the time required to precompute all of the Here 365 1 linear and quadratic features; <(=?>@;A8BDCE>GF is the actual num1 ber of Gaussians evaluated; and 3 5K=?>G; is the amount of time required for each Gaussian evaluation, which, up to constants, is just LM . If we were to evaluate all of the Gaussians in a 1 system with very many Gaussians, the term <(=?>@;A8BDCE>GF NH 13 5K=?>G; - >@< would very much dominate over the precomputation time. However, by clustering the Gaussians (preserving the fixed feature space), as discussed in 5, we are able to reduce <(=?>@;ABOCE>GF 1 to the point where the precomputation time beevaluating the acoustic model is

comes a significant fraction of the overall computation. The feature precomputation time can be reduced by either of two techniques. First, as discussed in section 4, one can reduce the effective dimensions of the samples by generalizing the heteroscedastic discriminant analysis technique of [7]. Second, one can use the hybrid technique of [2] which restricts the matrices to be linear combinations of rank one matrices, .



PE!  ?T8K

Q

QSROR

1. INTRODUCTION 2. DEFINITION OF MODEL



In this paper we study acoustic models for speech recognition which are mixtures of exponential models for acoustic vectors which use features tied across all states of a context dependent Hidden Markov model. We look at systems with linear features ( ) and quadratic features ( ). These models were introduced in [1] under the acronym SPAM models because they are Gaussian mixture models with a subspace constraint placed on the model precisions (inverse covariance matrices) and means; although the precise condition on the means was left ambiguous in [1]. Reference [1] focused on the case of unconstrained means, in which the only constraint was that the precision matrices be a linear combination of matrices which are shared across Gaussians. The SPAM models generalize the previously introduced EMLLT models [3, 4], in which the are required to be rank one matrices. The well known maximum likelihood linear transform (MLLT) [5] or semi-tied covariance [6] models are the special case of EMLLT models when . Using the techniques developed in section 3 here and in [1, 2, 3, 4, 7], it is now possible to perform, at least to a good approximation, maximum likelihood training of these models for reasonably large scale systems, in both the completely general case and in a number of interesting subcases. Our goal is to use these models as a tool to improve word error rates at reasonable computational cost. The time required for





 

  



 "!#%$'&

(

*)+

0-7803-7663-3/03/$17.00 ©2003 IEEE

The SPAM models have the form:

(2) 3 VU W) Y ZKX []\_^`8a Y 3'VU bc2d l &nmpo 1"qstr \_u qvw `yxAz w \_u q6vKw ` Y 3''U bcW) ecf:gihkj d (3) a where {|} is the set of Gaussians for state  , and the precisions and means are written as #  (4) j Y ) P~ X%$V&E€ Y Pd ‚ Y ) Y8ƒ(Y )TF„~†X … ‡Y F ‡ I (5) j ‡ $V&"ˆ The Š‰‹ symmetric matrices (E! and the vectors F ‡ ! in   are tied across all Gaussians. In the following, we will drop the affine shifts (i.e. set P~ and F„~ to zero) when they don’t need to be

emphasized. The constraints (4) and (5), correspond to restricting to the following linear and quadratic features,

I - 912

Œ6Ž‘   d Œ]”–• ‡ @AA’ ’  …# d

Œ6‡ Ž‘  ) yF ‡“ Md Œ  ”–• ‡  A—) ˜“ P ™I

(6)

ICASSP 2003



➡ L) F & I_I_I F  “ . The exponential … _ \  u  Y:` d ” r q 1 J m o t 3''U bcW) } a     'U bcW)   Y d Y  Y“ Œ ”–• ‡    Y“ Œ Ž  (7)(8)  d ˆ  ("!Ed  )  € ecf:ˆ g% j  € ˆ “  j q & “ ˆ  ˆ I (9) € Œ6Ž‘ )

Note that , where model version of (3) is:

 

3. PARAMETER ESTIMATION

)  Y d ˆ Y dJPE!"d |! €  pdp n

In this section we consider training all of the model parameters, , so as to maximize, or at least approximately maximize, the total log likelihood of a set of labeled train1 ing vectors . This can be accomplished, according to the EM algorithm by iteratively updating the parameters. Given a current set of parameter , the -step of the EM algorithm gives the function that we need to maximize over . Letting be the number of samples associated with state and be the HMM state for Gaussian and solving for the priors in the usual way, we have, as in [1, 2],

  d        d  

B



b



"ybc



}

(10) ) X Y < Y =  j Y  Y Jd (11) < Y )  }Eybc a Y d  " Y (12) a )  }Eybc  !^ X $ ^J\Y:` JybcJd " pybc ) " pybd    ) 3'a Y  3' U #EJybcU bJdd       d (13) =  j  ’) $0ecf:g j % '&)(  j Jd and (14)  Y )  * Y   ƒ Y ƒ * Y : ƒ Y ƒ * Y  “ I (15) Here ƒ * Y and  * Y are the mean and covariance matrix of the set of samples labeled by "ybc , with sample    given a weight of " pybc< Y . Letting B Y  + A denote the expectation value of the function + A for this sample distribution, we have ƒ(Y Y (16)  * * Y )) BB Y ] “  ƒ(* Y ƒ * Y“ I (17)

(E!

For the case of unconstrained means, fast algorithms for calculating the ’s and approximately calculating the are given in [1], and exact calculation of the is performed in [2]. For the general case considered here, section 3.1 describes how to jointly optimize for the tied matrix defining the constrained linear space and the untied parameters defining the point in the linear space; and section 3.2 gives an algorithm to optimize for the untied parameters .

€

 K!

 Y dˆ Y! €



‚Y

ˆY!

3.1. Optimization of the Linear Parameters

j Y! sdJ ˆ Y ! ) ŠX ‚* Y ) Y ƒ* Y j

and  Y ! for fixed (E!  Y ! ) is:ˆ € < Y U_U ‚ * Y 4X ˆ ‡Y F ‡ U_U oz w , r (18) (19) d

where, for

1 The

U_U ;'U‘U o- ) ; “  (; d a symmetric matrix,

; 2  I (20) \ ` ~ be the solution of the To maximize  we start by letting q& total least squares problem obtained when the Y are all replaced j  by their average . : \F ‡ ~ ` )  . q &nmpo f0/$ ‡ 1 X < Y  . &nmpo ‚ * Y ‚ * Y“  . &nmp3o 2 d (21) Y where 3f / ‡ 5 4  stands for the eigenvector corresponding to the > ’th largest eigenvalue of the symmetric matrix 4 . Then we simply alternate between optimizing the quadratic in  ˆ Y ! obtained from Y by fixing and the quadratic in obtainedY from  by fixing  ˆ ! . The linear equations for the updates of ˆ and are, respectively,  j Y q & “  ˆq Y& ) ‚ * Y (22) Y Y Y Y ( ƒ Y Y Y XY < ˆ ˆ“ j ) XY < ˆ * I (23) Y  Y Y = Œ6 Ž j  Y  Œ6Ž‘ € . Y ˆ) B Y  

for

3.2. Optimization of the Untied Parameters

Œ]. Y ”–• ‡ ( E) ! B Y  ]Œ ”–• ‡ @ A

The function above depends implicitly on the untied parameters and , the tied parameters and , and the and E-step means of the linear and quadratic features. In this section, we consider optimizing with respect to the untied parameters. Dropping the subscript , we may write

b

=

=  d ˆ  ) =  j  )TB   A Œ ”–• ‡ € )   d ˆ ("!Ed 6 “ .    ˆ “ Œ . Ž‘ (24) I € € =  d ˆ  is a concave function of ˆ and . It may be optimized € maximizing with respect to € and until converby alternately ˆ € gence. Maximization with respect to for fixed gives: (25) )ˆ  j q & “  ˆq & Œ . Ž  I € Note that in the case we take to Œ6Ž‘ of) unconstrained & ƒ * , so (where ƒ * and ) j q means be the identity), . that the model mean ƒ ) j equals the data mean ˆƒ * . Forˆ the case when the means are unconstrained, an efficient technique was given in [1] to maximize = with respect to for € fixed ƒ . We use a similar technique here to optimize for when € withˆ is fixed. Namely, we apply the conjugate gradient algorithm fast line searches for the maximum of the function =  d  along € the line through the value of found at the end of the lastˆ conjugate € of the conjugate gradient search gradient iteration in the direction

7

direction . The function to optimize when doing the line search is:

The part of the function which depends on precision matrices (i.e. fixed and





labeling corresponds to a fixed alignment of some speech corpus. More generally, we could of course weight the training vectors using the forward-backward algorithm for HMM’s.

I - 913

Œ  ,  ) =   , 7d % 4=  d (T a constant € eGf:g  ˆ ~ q &  € “ ˆ   q & “  ,:9 ) $

8  |  9 ) 7 “ Œ . ”–• ‡ j  j ˆ j ˆ , ,!; j W) X   €   7EKP ) j ~  ; ) X 7E(I 

(26) (27) (28) (29) (30)



➡  @!

q~ &nmpo ; ~ q &nmpo  @! j j

; C d j ) ~ j ~ q &nmJo

Let  be the eigenvalues and  be an orthonormal ba  . The vector  sis of eigenvectors of is called a generalized eigenvector of the pair because    . Using

; C O)

j ~C

q& j

) X '$  & kC   , C@ “  d 

(31)

it is straight forward to verify that

 o  X $V & $.| ,   |  ,  ) C “ ˆ I

Π, )  

,:9

(32) 

jk  ,

3

,

,

4. SPAM-HDA MODELS

 

Following [7], we define a SPAM-HDA model to be one in which is broken into two complementary subspaces:

 



 o & )  d) 

&



 

o d

Ž is Ž ‰

and the Gaussians are tied along one of the subspaces:

ƒ Y ) ƒƒ Yo& d j Y ) “ j  Y &&  oo  j && If Y are unrestricted, the model is called a full covariance j HDA model, or simply an HDA model. If they are diagonal, it is && called a diagonal HDA model. If the Y are allowed to be full j covariance, but are required to be independent of b , the authors 

of [7] show that the maximum likelihood projection matrix agrees with the well known Linear Discriminant analysis (LDA) matrix. The more general SPAM-HDA model allows for an arbitrary sub. The feature precomputation cost for the space restriction on models is , which can be much smaller than the generic precomputation cost of .

Y && j & H V2 H6 & HV &   K

 Hk ?H  ?

5. CLUSTERING OF GAUSSIANS By using Gaussian clustering techniques [8], it is possible to reduce the number of Gaussian one needs to evaluate significantly below the total number of Gaussians in an acoustic model. To apply this idea to a SPAM model, we will find a collection of cluster models , (which are exponential models using the same tied features as the SPAM model) and an assignment of Gaussians of the SPAM model to clusters. We choose this clustering to maximize

3'VU 7  7 ) Kd%I_I_I_d–<) F„;A ,1 5K 7 ybc b     X Y a Y B Y   VU 7 ybcV) X < =  d ˆ Œ . Šd where €   < ) Y  X \Y:`  a  $ Œ .  )  X  a Y B Y  Œ ”–• ‡  AJd Œ Ž‘ A2d Y  \Y:` $ <

place of the contribution of the unevaluated Gaussians. 6. EXPERIMENTAL RESULTS

(33)

Since and are fixed, the   can be precomputed at the start of the line search. The line search is restricted to the values of for which is positive definite, i.e. we restrict to the interval of for which  is positive for all .

ˆ

where B Y  + A is now the expectation of + A under the model 3'VU bc , and the function = is given by (24). Similarly to K-means clustering, equation (36) is optimized by alternately choosing the best clusters for each Gaussian   Œ  and recomputing the cluster models (i.e. optimizing =  d .  using € ˆ the technique of section 3.2). , To do acoustic modeling at time , we first evaluate all the  7J,.p- U 7  1 . We then use those results to make a cluster distributions 3  <  C Gaussians b from the original model judicious choice of  for which to evaluate 3   U bc . Simple threshold values are used in

(34)

(35) (36)

We performed experiments using the same test data, training data with fixed Viterbi alignment (obtained using a baseline diagonal covariance model), and Viterbi decoder as was used in [1, 2, 3, 4]. The test set consists of  words from utterances in small vocabulary grammar based tasks (addresses, digits, command and control) recorded in a car under idling, city driving, and highway driving conditions. The acoustic models had  phonemes and a total of   Gaussians distributed across   context dependent states using BIC based on a diagonal covariance system. The samples we used consisted of  dimensional vectors obtained by splicing nine consecutive thirteen dimensional cepstral vectors. As a first step, we created LDA projection matrices based on the within class and between class full covariance statistics of the samples for each state. For  different values of the dimension ranging from  to  , we constructed matrices LDA which project from  to dimensions and we built full covariance models, , based on the projected vectors. In order to verify that the projections used for the models where good, we also used the Gaussian level statistics of FC the models FC  and FC ! to construct LDA and HDA projection matrices (as well as a successful variant of HDA presented in [9]). The models FC gave WERs within " relative of the best performing of all of the full covariance system (with the same projected dimension), with the sole exception that FC  had an error rate of   #" , whereas the system built on vectors output by the composition of LDA ! and the $ % LDA matrix constructed based on the statistics of FC ! had a WER of    " (a  " relative improvement). Next, we built the systems we will refer to as MLLT , which are MLLT systems for vectors produced by multiplication . (As a check on these MLLT systems, we observed with LDA that they did as well or better than the MLLT system based on features built using the diagonal version of HDA.) We also built the systems SPAM , which are SPAM systems with unconstrained means in dimension with precision matrices constrained to a dimensional subspace spanned by matrices obtained using the quadratic approximation (to the total likelihood function) technique of [1]. Figure 1 shows that the SPAM models achieve a significant fraction of the total improvement possible in going from MLLT to full covariance (while maintaining the same per Gaussian computational cost as the MLLT system). Next, again using the techniques of [2], we built the models SPAM & & , which are SPAM models for vectors produced by LDA ! which have unconstrained  ,  ,  . Using these models to promeans and have vide E-step statistics, we computed a by ' matrix

 

 & 

K

&

 & 

+

.K K

cI

GI

I - 914

 & 



K  & 



&

 KK

 KK

 Š‰   KK

. K

GI

 & 

PE!

 ) & d– ) & d. )* &  & L) &

 +) K@dp ) & d ) KK  K  )   Š) & ) K

 & 





5

5

MLLT(d1) SPAM(d=d1, D=d1,δ=d1) FC(d1)

4.5

word error rate in percentages

word error rate in percentages

4.5

4

3.5

3

2.5

0

20

40

60

80

100

3.5

3

2.5

1.5 10

120

dimension d1 of LDA projected data samples

("!  Yd Yd Y!  ) KEdp )+ a & d € ) ˆ &   ™) @dpW) & d– ) & )   ) & pd  ) & d T) & 

) Kd8I_I_I_dp ) &

by the technique of section 3.1. Fixing this and the , , we performed the EM algorithm, with the technique of section 3.2 for the M-step, to optimize . The models obtained are called SPAM  .  Figure 2 show that the system SPAM ties or outperforms (significantly when  ) the systems SPAM and (the even worse) MLLT . All of these system have equal per Gaussian computational cost.

 & 

We conclude with two experiments showing that the precomputation cost as well as the number of Gaussians that need to be evaluated can be reduced.   Both Figure 1 and 2 show that SPAM  has error rate $" . A comparable error rate of " is obtained from a hybrid model trained by the techniques of [2] to give a SPAM model with  , but with the  rank one constrained to be linear combinations of matrices. This comparable error rate was obtained by balancing the small degradation due to the constraint on the with the small improvements due to the fact that an affine was included and the were trained in a true maximum likelihood fashion. The model reduces the feature precomputation cost from    to   . Applying clustering   and decoding of Gaussians to this model with  , as described in section 5, we found that the with  error rate increased only slightly, to " . This is at a significant savings from evaluating   Gaussians to evaluating only   Gaussians.

"

 2) Gd– ) Gd– ) @I_ )  ) +) Q )  ( (~

EI_

P

(

  ] 8 K < J7 ,.- C 1 ) 

QŠ ]˜

 

 <% F ;A 1 5KO)™  @I_

15

20

25

30

35

40

45

50

55

dimension d1 of linear and quadratic feature space

Fig. 1. WER as function of dimension showing SPAM model achieves significant fraction of improvement from MLLT to full covariance model. SPAM model has same per Gaussian compute time as MLLT.

&

4

2

2

1.5

MLLT(d1) SPAM(d=d1, D=d1,δ=d1) SPAM(d=52, D=d1,δ=d1)

 

7. CONCLUSION A SPAM model is just a state dependent mixture of exponential models with linear and quadratic features shared across all Gaussians. We have described how to train such models and have shown that both the flexibility to constrain the quadratic features and the

K

Fig. 2. WER as function of linear and quadratic feature space dimension showing that SPAM features from  dimensional model do better than SPAM features constrained to the LDA projected subspace. flexibility to constrain the linear features can lead to improved accuracy at fixed computational cost per Gaussian. Furthermore, we have seen that the total computational cost can be lowered singificantly by choosing features that can be precomputed quickly and by clustering the Gaussians (as exponential models with common feature space) so that only a fraction of the Gaussians need to be evaluated. 8. REFERENCES [1] S. Axelrod, R. Gopinath, and P. Olsen, “Modeling with a subspace constraint on inverse convariance matrices,” in Proc. ICSLP, 2002. [2] K. Visweswariah, P. Olsen, R. Gopinath, and S. Axelrod, “Maximum likelihood training of subspaces for inverse covariance modeling,” Submitted ICASSP 2003. [3] P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis expansion,” in Proc. ICASSP, 2002. [4] P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis expansion,” IEEE Transactions in Speech and Audio Processing, submitted. [5] R. Gopinath, “Maximum likelihood modeling with gaussian distributions for classification,” in Proc. ICASSP, 1998. [6] M. J. F. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE Transactions in Speech and Audio Processing, 1999. [7] N. K. Goel and A. G. Andreou, “Heteroscedastic discriminant analysis and reduced-rank HMMs for improved speech recognition,” Speech Comm., vol. 26, pp. 283–297, 1998. [8] E. Bocchieri, “Vector quantization for efficient computation of continuous density likelihoods,” in Proc. ICASSP, 1993. [9] G. Saon, M. Padmanabhan, R. Gopinath, and S. Chen, “Maximum likelihood discriminant feature spaces,” in Proc. ICASSP, 2000.

I - 915

Dimensional Reduction, Covariance Modeling, and ...

Page 1 ... using the hybrid feature technique of [2]; and the number of Gaus- sians one needs .... optimize for the tied matrix defining the constrained linear space.

107KB Sizes 1 Downloads 258 Views

Recommend Documents

Shrinkage Estimation of High Dimensional Covariance Matrices
Apr 22, 2009 - Shrinkage Estimation of High Dimensional Covariance Matrices. Outline. Introduction. The Rao-Blackwell Ledoit-Wolf estimator. The Oracle ...

Three-Dimensional Anisotropic Noise Reduction with Automated ...
Three-Dimensional Anisotropic Noise Reduction with. Automated Parameter Tuning: Application to Electron Cryotomography. J.J. Fernández. 1,2. , S. Li. 1.

sparse covariance thresholding for high-dimensional ...
partition the data into a training and a testing set. Tuning ... with a different random partition of the training and testing set. In Table 5, we ...... recovery of sparsity.

Exploiting Feature Covariance in High-Dimensional Online Learning
with the number of features, are often used in prac- tice (Dredze et al., 2008; .... increased confidence in µp and smaller subsequent up- dates to that value. Thus ...

Collaborative Dimensional Modeling, from Whiteboard ...
[Ebook] p.d.f Agile Data Warehouse Design: ... The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence.

Collaborative Dimensional Modeling, from Whiteboard ...
Read Agile Data Warehouse Design: Collaborative ... The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence.

Uncertainty Modeling and Error Reduction for Pathline ... - Ayan Biswas
field. We also show empirically that when the data sequence is fitted ... While most of the uncertainty analysis algorithms assume that flow field uncertainty ...

Uncertainty Modeling and Error Reduction for Pathline Computation in ...
fundamental tool for deriving other visualization and analysis tech- niques, such as path ...... an extra control point, to perform a fair comparison we also com- pared the linear ..... In Expanding the Frontiers of Visual Analytics and Visualization

DIMENSIONAL REDUCTION, SL(2, C)-EQUIVARIANT ...
ariant version of the Hitchin–Kobayashi correspondence for filtrations. In Sec. .... This can be represented schematically by the diagram. Vm φm. −→ Vm−1 φm-1.

Covariance Control Using Closed Loop Modeling for ...
1Sponsored by NSF grant CMS-9403592. 2Structural Systems and .... Otherwise, set qi !qi0 and go to step 2, where is the error tolerance. Integration of Closed ...

Fast Covariance Computation and ... - Research at Google
Google Research, Mountain View, CA 94043. Abstract. This paper presents algorithms for ..... 0.57. 27. 0.22. 0.45. 16. 3.6. Ropes (360x240). 177. 0.3. 0.74. 39.

Covariance Matrix.pdf
Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Covariance Matrix.pdf. Covariance Matrix.pdf. Open. Extract.

17-08-022. Disaster Risk Reduction Reduction and Management ...
17-08-022. Disaster Risk Reduction Reduction and Management Program.pdf. 17-08-022. Disaster Risk Reduction Reduction and Management Program.pdf.

SPAM and full covariance for speech recognition.
best approach. 3.1. ... basis. Rather than limit the update to very small step sizes to prevent this, we ..... context-dependent states with ±2 phones of context and 150000 ... periments; all systems are built from scratch on top of fMPE features.

SPAM and full covariance for speech recognition. - Semantic Scholar
tied covariances [1], in which a number of full-rank matrices ... cal optimization package as originally used [3]. We also re- ... If we change Pj by a small amount ∆j , the ..... context-dependent states with ±2 phones of context and 150000.

Electrochemical reduction, radical anions, and ... - Arkivoc
Jun 25, 2017 - luminescent building blocks of real or potential molecular functional materials for electronics, optoelectronics and photovoltaics. 4,5,13-25. Despite fluorinated (hetero) aromatics are promising for electronic and optoelectronic appli

eigenfaces and eigenvoices: dimensionality reduction ...
We conducted mean adaptation experiments on the Isolet database 1], which contains .... 4] Z. Hu, E. Barnard, and P. Vermeulen, \Speaker Normalization using.

mixtures of inverse covariances: covariance ... - Vincent Vanhoucke
Jul 30, 2003 - archive of well-characterized digital recordings of physiologic signals ... vein, the field of genomics has grown around the collection of DNA sequences such ... went the transition from using small corpora of very constrained data (e.

Estimating Covariance Models for Collaborative ...
enjoying the benefits brought by GPS integrated applications. One of the ...... parameters; methods for developing and testing models for urban canyons will be.

man-105\covariance-matrix-excel.pdf
Download. Connect more apps... Try one of the apps below to open or edit this item. man-105\covariance-matrix-excel.pdf. man-105\covariance-matrix-excel.pdf.