Discriminative Score Fusion for Language Identification

Viewer
Transcript

Chinese Journal of Electronics Vol.19, No.1, Jan. 2010

Discriminative Score Fusion for Language Identification∗ ZHANG Weiqiang, HOU Tao and LIU Jia (Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China) Abstract — Language identification (LID) has received increasing interests in the speech signal processing community. With the rapid development of LID technologies, how to fuse the score of multi-systems is growing to be a researching focus. In this paper, we proposed a discriminative framework for LID score fusion. The Heteroscedastic linear discriminate analysis (HLDA) technology is used for dimension reduction and de-correlation, and the Gaussian mixture model (GMM) trained with Maximum mutual information (MMI) criteria is used as classifier. Experiments show that the proposed method can improve the performance significantly. By score fusion of five systems, we achieve average cost of 2.10% for 30s trials on the 2007 NIST language recognition evaluation databases. Key words — Language identification (LID), Score fusion, Heteroscedastic linear discriminate analysis (HLDA), Maximum mutual information (MMI).

I. Introduction Language identification (LID) is the task of automatically identifying the language of a spoken utterance. With the increasing possibility of incorporating voice in many practical systems, LID has become more and more important. It has many potential applications such as in multilingual spoken dialog systems, spoken language translation, call-routing and spoken document retrieval[1−3] . In recent years, LID has drawn a lot of attentions in the speech signal processing community and has gained great developments. Many typical technologies, such as Parallel phone recognition followed by n-gram Language model (PPR-LM)[1] , Gaussian mixture model with Shifted delta cepstra features (GMM-SDC)[4] , Support vector machine with Shifted delta cepstra features (SVM-SDC)[5] , GMM mean supervectors followed by SVM (GMM-SVM)[6] , Parallel phone recognizer followed by Vector space model (PPR-VSM)[3] , have emerged. While the systems becoming more and more complicate, how to combine the score from each system is growing to be a researching focus. Generally speaking, score fusion can utilize the informa-

tion from multi-system and will lead a better result. In fact, even in a single system, one language may also give some clue information for the other language. So how to extract the information between languages and obtain a more precise result is also the task of score fusion. So far, in LID field, the score fusion method includes linear fusion[4] , Gaussian[5] , SVM[3] , Artificial neural network (ANN)[7] , and so on. They also referred as backend in some literatures. The linear and Gaussian fusion method are widely used in the National Institute of Standard and Technology (NIST) Language recognition evaluation (LRE)[4,5] . In the linear fusion method, each dimension of the score vector is assigned a weight, which is determined by empirical method or optimized on the development set, and the sum is used as final score. The Gaussian method on the other hand involves concatenating the scores of each system to form a highdimensional vector. Linear discriminate analysis (LDA) is usually performed subsequently to achieve dimension reduction. At last Gaussian classifiers are used to give final scores. The Gaussian method is usually more effective than the simple linear fusion; however, it suffers from that: (1) A single Gaussian may be too simple to model the score distributions of various systems; (2) In LDA, the same covariance assumption of each class may not be satisfied; (3) The parameters of Gaussian are estimated with Maximum likelihood (ML) criteria, which may fail to model the decision boundaries in detail. In this paper, we try to solve the problems mentioned above by employing the mature technologies in speech recognition. Firstly, we substitute Gaussian mixture model (GMM) for single Gaussian. Secondly, LDA is extended to Heteroscedastic LDA (HLDA). Thirdly, the Maximum mutual information (MMI) criterion is used to train the parameters of GMMs. The rest of this paper is organized as follows: A simple review of GMM, HLDA, and MMI is provided in Section II. Section III gives the fusion framework and Section IV offers the optimization method. Section V demonstrates the effectiveness of each technology through detailed experiments. At last, the conclusion is given in Section VI.

∗ Manuscript Received Mar. 2008; Accepted June 2009. This work is supported by the National Natural Science Foundation of China and Microsoft Research Asia (No.60776800), and in part by the National High Technology Development Program of China (863 Program) (No.2006AA010101, No.2007AA04Z223 and No.2008AA02Z414).

125

Discriminative Score Fusion for Language Identification

II. GMM, HLDA and MMI 1. GMM GMM uses weighted Gaussian mixtures to describe a general distribution: p(x|λ) =

P

∀m wm N (x; µm , Σm )

(1)

where x is the input vector and N (·) denotes the normal distribution. wm , µm , Σm are the weight, mean vector and covariance matrix of the m-th Gaussian mixture, and they are often referred as a parameter set λ = {wm , µm , Σm }. In theory, if the mixtures of GMM are sufficient, it can approach any distribution[8] . So we use GMM to anticipate a better result than single Gaussian. 2. HLDA Similar to LDA, HLDA is a linear projection method. It removes the restriction of same covariance of all the class and can be seemed as a generation of LDA[9] . Let x be an n-dimensional feature vector. HLDA seeks a linear transformation A which maps x to a new vector y: · ¸ · ¸ Ap x yp y = Ax = = (2) An−p x y n−p where y p are deemed to be those useful dimensions and y n−p are the nuisance dimensions in the transformed space. Let xi denote a training sample and g(i) indicate its class label. In HLDA framework, each class is modeled as a normal distribution. The probability density of xi in the transformed space is given as |A| p(xi ) = p (2π)n |Σg(i) | ½ ¾ (Axi − µg(i) )T Σ−1 g(i) (Axi − µg(i) ) · exp − 2

(3)

where µg(i) , Σg(i) are the mean vector and covariance matrix of class g(i) in the transformed space. The objective function of HLDA is defined as the log-likelihood of all the training samples: P L(A) = ∀i log p(xi ) (4) The HLDA solution maximizes the objective function. Different from LDA, HLDA has no closed-form solution. When presented HLDA, Kumar utilized the nonlinear optimization method to seek its solution[9] . This method, however, is not easy to implement and has great computational complexity. Fortunately, Gales later proposed another effective iterative method based on generalized Expectationmaximization (EM) algorithm[10] . Under the diagonal covariance assumption, the objective function of HLDA (Eq.(4)) can be simplified as (the unrelated items have been dropped) J X |A|2 Nj log T 2 |diag(Ap W j Ap )kdiag(An−p T ATn−p )| j=1 (5) where Nj is the number of the training samples associated with the j-th class and J is the number of classes. W j is the between-class covariance of the j-th class and T is the total

L(A) =

(global) covariance of all the class. The individual rows can be re-estimated by[10] : r . −1 T ˆ k = ck Gk a N ck G−1 (6) k ck where ck is the k-th row vector of co-factor matrix C = |A|A−1 for current estimate of A and P Nj J    j=1 a W aT W j , k ≤ p j k k Gk = (7)  N   T , k > p ak T aTk where N is the total number of the training samples and p is the useful dimensions as mentioned in Section II.2. Eq.(7) shows that the contribution from each class is proportional to the numbers of their respective training samples. In our application, the training data are unbalanced. Specifically speaking, the training data for English, Chinese, and Arabic are much more than that of Bengali and Thai. As a result, the use of Eqs.(10) and (11) leads to the incorporation of undesired priors into HLDA. But in the evaluation, each language is regarded as equally. So we equalize the amounts of training data per language by setting Nj = N/J for j = 1, 2, · · · , J instead of studying them from the training data. In addition, by observing Eqs.(6) and (7), we can find if N and {Nj } are ˆ k will simultaneously multiplied by a factor, the re-estimated a remains unchanged. Therefore, our re-estimate formulae become: q T ˆ k =ck G−1 a J/ck G−1 (8) k k ck P 1 J    j=1 ak W j aT W j , k ≤ p k (9) Gk = J    T, k>p T ak T ak After obtaining the transformation matrix A, we can reserve its first p rows to form a submatrix Ap . Then, we can multiply Ap by any input vector x to get the transformed and also the dimension reduced new vector y p : y p = Ap x

(10)

3. MMI Let xi denote a training sample and g(i) indicate its class label. (In this paper, x is in fact the new vector y p which has been performed dimension reduction as described in Section II.2.) The traditional ML training method trains each class individually. The parameters of j-th class maximize the log-likelihood the training samples belonging to class j. Its objective function is given as P FML (λj ) = g(i)=j log p(xi |λj ) (11) In contrast to ML training method, the MMI method maximizes the posterior probability of all the training samples[4] . Its objective function can be expressed as[11] X p(xi |λg(i) )P (g(i)) (12) FMMI (λ) = log P ∀j p(xi |λj )P (j) ∀i

where P (j) is the prior probability of class j. The prior probabilities are often assumed to be equal and thus they can be dropped form Eq.(6).

126

Chinese Journal of Electronics

The objective function of MMI (Eq.(12)) can be optimized by Extended Baum-Welch (EWB) algorithm. The reestimated formulae can be expressed as[11] : ˆ jm = µ

num den θjm (X ) − θjm (X ) + Djm µjm num den γjm − γjm + Djm

ˆ jm = Σ

num den θjm (X 2 ) − θjm (X 2 ) + Djm (Σjm + µ2jm ) ˆ 2jm −µ num den γjm − γjm + Djm (14)

(13)

where the superscripts “num” and “den” denote numerator and denominator items. The subscripts j and m denote the class and mixture. Djm is a smoothing constant. µ2 is shorthand for µµT in the full covariance case and for diag(µµT ) in the diagonal covariance case. The numerator items can be given as P num num γjm = ∀i γjmi (15) P num num θjm (X ) = ∀i γjmi xi (16) P num 2 num 2 θjm (X ) = ∀i γjmi xi (17) Denominator items can be expressed by similar equations. We omit them here. In addition ½ γjmi , if g(i) = j num γjmi = (18) 0, otherwise p(xi |j) den (19) γjmi = γjmi P 0 ∀j 0 p(xi |j ) where

wjm N (xi ; µjm , Σjm ) 0 0 ∀m0 wjm N (xi ; µjm0 , Σjm )

γjmi = Kj P

III. Discriminative Score Fusion Framework Our score fusion method is mainly modified from the Gaussian fusion paradigm. Firstly, the raw scores from individual systems are attacked to a high-dimensional vector. Then HLDA is performed for dimension reduction and decorrelation. After that, GMMs are trained for each class by using MMI criteria. Then the log post probability is calculated as (We suppose all the prior probabilities are equal.) p(x|j) 0 ∀j 0 p(x|j )

(21)

In LID application, some language may have several dialects. (For example, Chinese has Mandarin, Cantonese, Min, Wu, and so on.) We regard each dialect as an individual class when training HLDA and GMM. At last stage, the dialect scores of the same language can be emerged as s(l) = max log P (j|x) h(j)=l

Note that the HLDA and MMI are all discriminative technologies. In language identification, our score fusion framework is different from the traditional descriptive method. This framework leads more computational complexity in the score fusion training stage, but the calculation load is almost negligible when compared with the whole language identification system.

Fig. 1. Flowchart of discriminative score fusion

IV. Experiments 1. Experiment setup In this paper, the experiments are mainly performed in the framework of NIST 2007 Language Recognition Evaluation (LRE07)[12] . There are 14 target languages, 4 of which can be divided into 10 dialects further. According to LRE07, the average cost is used as the performance measurement: ¾ ½ X 1X Cavg = CM PT PM (lT ) + CF PN PF (lT , lN ) (23) L ∀lT

(20)

Similar to HLDA, the MMI also learns the class prior from the training data. To suppress this effect, a weight factor Kj is introduced into Eq.(20) to balance the statistics[12] . In this paper, we set Kj = N/(Nj · J), which will become 1 if all the class has equal amount of training samples.

log P (j|x) = log P

2010

(22)

where h(j) = l denotes all the dialects (classes) belonging to language l.

∀lN

where L is the number of languages. PM (lT ) and PF (lT , lN ) are the miss and false alarm probabilities. lT and lN are the target and non-target languages. CM and CF are the missing and false alarm cost, which are set as CM = CF = 1. PT and PN are the target and non-target language prior probabilities, which are set as PT = 0.5, PN = (1 − PT )/(L − 1). These parameters are set by NIST[12] , who tries to provide an application-independent, no bias to missing or false alarm, criteria for evaluating the algorithm performance. Table 1. The sources of training data for each language/dialect Call Call OHSU LRE07 OGI OGI Language/dialect friend home train 11 22 √ √ √ √ Arabic √ Bengali √ √ Cantonese √ √ √ √ √ Mandarin √ Chinese Min √ Wu √ √ √ √ √ American √ English Indian √ √ √ Farsi √ √ √ √ √ German √ √ √ √ Hindi √ Hindustani Urdu √ √ √ √ √ Japanese √ √ √ √ Korean √ √ Russian √ √ Caribbean √ √ √ √ Spanish Non-Caribbean √ √ √ √ Tamil √ Thai √ √ √ Vietnamese

Discriminative Score Fusion for Language Identification From Eq.(23), we can see that the average cost will depend on decision threshold of each language. In our experiments, we alter the threshold and find the minimum average cost as the performance criteria. The training data come from CallFriend, CallHome, OHSU, LRE07 Train, OGI11 and OGI22. Table 1 lists the detailed sources of each language/dialect. All the data are first used in LDA/HLDA matrix training, and then used in GMM training. The evaluation data are provided by NIST, which consist of 7530 segments of telephony speech with nominal duration of 30s, 10s and 3s. The systems used in experiments include GMM-SDC, SVM-SDC, GMM-SVM, PPR-LM, and PPR-VSM. Our LRE07 system is composed of the first 3 systems and the last two were developed more recently, based on the phone recognizers designed by Brno University of Technology (BUT)[13] . 2. Experiments on GMM-SDC system We first investigate the GMM-SDC system. This system outputs a 20-dimensional log likelihood score for each test segment. (Each dialect is modeled as a class, excepting that the Taiwan Mandarin and mainland Mandarin are merged as one class.) All the scores are divided by the number of speech frames to normalize the different length of the test segment. Unless stated otherwise, all the results in this section are evaluated on 30s segments. We obtained the correlation coefficients matrix (absolute value) of the scores as shown in Fig.2. The values are mapped in gray scale but in a reverse order, i.e., white denotes 0 and black denotes 1 as shown in the colorbar. The off-diagonal values represent correlations between dimensions. High correlation can be observed between some dimensions, which hint that even in a simple single system, score fusion is also needed. (PPR-LM system should be seen as multi-systems. The importance of score fusion for this system has long been noticed.)

Fig. 2. Correlation coefficients matrix (absolute value) of scores of GMM-SDC system

(1) Weighted log post probability The weighted log post probability can be given as[4] p(x|j)w 0 w ∀j 0 p(x|j )

log P (j|x) = log P

(24)

where w (w > 1) is the weight item, which can lead the post probability become sharper. The result obtained by using weighted log post probability can be seen as the baseline performance without fusion. We alter the value of weight to get the minimum average cost as listed in Table 2. We can see that when w = 20,

127

Cavg = 9.46%, which is better than that when w = 1. The optimized value may change for other systems. For example, for PPR-LM system, it becomes 10. Through many supplementary experiments, we found this value has no obvious relation to the training and test data. It is mainly affected by the score distribution, which is an intrinsic property of recognition system. Table 2. The performance of GMM-SDC system by using weighted log post probability w 1 10 20 30 40 50 Cavg (%) 11.15 9.48 9.46 9.55 9.57 9.63

(2) Gaussian fusion In the Gaussian fusion method, the number of reserved dimensions after LDA transformation may affect the performance. The original number of dimensions is 20. We change the number of reserved dimensions and obtain the minimum average cost as listed in Table 3. We can see that, when dimension = 19 (which is just equal to the number of class −1), it obtains the best performance Cavg = 5.11%. In the following experiments, we will use dimension = 19 as default. Compared with the result obtained via weighted log post probability, we can see that even in a single system, score fusion can significantly improve the performance. Table 3. The performance of GMM-SDC system by using Gaussian fusion method Dimension 15 16 17 18 19 20 Cavg (%) 6.41 5.86 5.33 5.14 5.11 5.16

(3) From Gaussian to GMM In this section, we use GMM as the classifier for score fusion. We test the GMMs with different mixture numbers and the results are listed in Table 4. It can be found that using GMM can slightly decrease the average cost. When mixture = 32, we achieve Cavg = 4.96%. Note that GMMs are trained with ML criteria, so we also referred this method as LDA + ML. (4) From LDA to HLDA Next, we substitute HLDA for LDA in our experiments. All the other conditions are the same to the previous section. The results are listed in Table 5. Compared with Table 3, we can see that HLDA decrease the average cost from 4.96% to 4.79%. Table 4. The performance of GMM-SDC system by using GMM (LDA + ML) fusion method Mixture 1 8 16 32 64 Cavg (%) 5.11 5.11 5.11 4.96 5.05 Table 5. The performance of GMM-SDC system by using GMM (HLDA + ML) fusion method Mixture 1 8 16 32 64 Cavg (%) 5.00 5.00 5.00 4.79 4.86

(5) From ML to MMI In this section, we test the MMI training method. In the experiments, the numbers of reserved dimensions and mixtures are set as dimension = 19 and mixture = 32. Four combinations have been tested and the results are listed in Table 6. We can see that whether with LDA or HLDA, MMI training

128

Chinese Journal of Electronics

is more effective than ML training. For HLDA + MMI, we obtain Cavg = 4.66%. 3. Experiments on multi-systems This experiment is based on multi-systems. The average costs of each individual system and fused system are obtained by HLDA + MMI fusion method. The parameters are set as dimension = 19 and mixture = 32. Five single systems, as listed in Table 7, are used for score fusion. When training the HLDA transformation matrix and GMM models, we only used the scores of 30s segments. For 30s trials, we can see that the performance of the best single system is 4.36%. After score fusion, we get 2.10% for the fused system, which is only about 50% that of the best single system. For 10s and 3s trials, we obtain 6.62% and 16.81% respectively. The improvements the score fusion for 10s and 3s trials are not as significant as that for 30s ones. This may due to the mismatch between the training and evaluation conditions. Table 6. The performance of GMM-SDC system by using GMM (LDA/HLDA + ML/MMI) fusion method Fusion method

Cavg (%)

LDA + ML

4.96

HLDA + ML

4.79

LDA + MMI

4.77

HLDA + MMI

4.66

Table 7. The performance of individual and fused system by using GMM (HLDA + MMI) method Cavg (%) System 30s 10s 3s GMM-SDC 4.66 9.83 21.08 SVM-SDC 11.53 20.21 32.06 GMM-SVM 4.36 11.53 23.32 PPR-LM 4.89 11.74 24.41 PPR-VSM 4.91 12.75 25.82 Fusion 2.10 6.62 16.81

V. Conclusion In this paper, we proposed a discriminative score fusion method for language identification. The discriminative technologies, HLDA and MMI, are used in our method. Experiments show that the proposed score fusion method can improve the performance even for a single system. By score fusion of five systems, we achieve an average cost of 2.10% for 30s trials on the 2007 NIST language recognition evaluation databases. References [1] M.A. Zissman, “Comparison of four approaches to automatic language identification of telephone speech”, IEEE Transactions on Speech and Audio Processing, Vol.4, No.1, pp.31–44, 1996. [2] Y.K. Muthusamy, E. Barnard, R.A. Cole, “Reviewing automatic language identification”, IEEE Signal Processing Magazine, Vol.11, No.4, pp.33–41, 1994. [3] H. Li, B. Ma, C.H. Lee, “A vector space modeling approach to spoken language identification”, IEEE Transactions on Audio, Speech and Language Processing, Vol.15, No.1, pp.271–284, 2007. [4] P. Matejka, L. Burget, P. Schwarz, et al., “Brno University of Technology system for NIST 2005 language recognition evaluation”, Proc. IEEE Odyssey - The Speaker and Language Recognition Workshop, San Juan, Puerto Rico, 2006. [5] W. Campbell, T. Gleason, J. Navratil, et al., “Advanced language recognition using cepstra and phonotactics: MITLL sys-

2010

tem performance on the NIST 2005 language recognition evaluation”, Proc. IEEE Odyssey - The Speaker and Language Recognition Workshop, San Juan, Puerto Rico, 2006. [6] B. Campbell, T. Gleason, A. McCree, “MITLL 2007 NIST LRE MIT Lincoln Laboratory site presentation”, Proc. 2007 NIST Language Recognition Evaluation Workshop, Orlando, USA, 2007. [7] H. Suo, M. Li, T. Liu, et al., “The design of backend classifiers in PPRLM system for language identification”, Proc. International Conference on Natural Computation, Haikou, China, pp.678–682, 2007. [8] J.Q. Li, A.R. Barron, “Mixture density estimation”, Proc. Advances in Neural Information Processing Systems, Denver, USA, 1999. [9] N. Kumar, A.G. Andreou, “Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition”, Speech Communication, Vol.26, No.4, pp.283–297, 1998. [10] M.J.F. Gales, “Semi-tied covariance matrices for hidden Markov models”, IEEE Transactions on Speech and Audio Processing, Vol.7, No.3, pp.272–281, 1999. [11] D. Povery, “Discriminative Training for Large Vocabulary Speech Recognition”, Ph.D. Thesis, Cambridge University, UK, 2003. [12] NIST language recognition evaluation, Online, Available: http://www.nist.gov/speech/ tests/lang/index.htm. [13] P. Matejka, P. Schwarz, J. Cernocky, et al., “Phonotactic language identification using high quality phoneme recognition”, Proc. Eurospeech 2005, Lisbon, Portugal, 2005. ZHANG Weiqiang was born in Hebei, China, in 1979. He received the B.S. degree in applied physics from University of Petroleum, Shandong, in 2002, the M.S. degree in communication and information systems from Beijing Institute of Technology, Beijing, in 2005, and the Ph.D. degree in information and communication engineering from Tsinghua University, Beijing, in 2009. He is a research assistant at the Department of Electronic Engineering, Tsinghua University. His research interests are in the area of radar signal processing, acoustic signal processing, and speech signal processing, primarily in parameter estimation, higher-order statistics, time-frequency analysis, speaker recognition, and language identification. (E-mail: [email protected]) HOU Tao received the B.S. degree from Department of Electronics Engineering, Tsinghua University, Beijing. He is currently pursuing the Ph.D. degree in Department of Electronic Engineering, Tsinghua University. His research interest covers speaker recognition and language recognition.

LIU Jia is a professor at the Department of Electronic Engineering, Tsinghua University, Beijing. His research interest covers speech recognition and signal processing.

Biometric Score Fusion through Discriminative Training