Research on I-Vector Combination for Speaker ...

Viewer
Transcript

Research on I-Vector Combination for Speaker Recognition Zhi-Yi Li, Wei-Qiang Zhang, Jia Liu Department of Electronic Engineering, Tsinghua University, Beijing 100084, China Email: [email protected], {wqzhang, liuj}@tsinghua.edu.cn

Abstract—I-vector has been proved to be very successful in NIST 2010 speaker recognition evaluation (SRE). In this paper, we explored more about how to improve the performance of i-vector based SRE system. We explored both the application of feature-domain latent factor analysis (fLFA) and the complementary combination method in i-vector level. The experimental results in NIST 2010 core-core data corpus including 9 conditions show that the compensation method to acoustic features is still very effective in i-vector based SRE system, although linear discriminant analysis (LDA) and within class covariance normalization (WCCN) are performing well to compensate the session and channel variabilities. Besides, the combination in ivector level can also improve the performance of i-vector based SRE system. Index Terms—I-vector combination, fLFA, speaker recognition, NIST 2010.

I. I NTRODUCTION Speaker recognition (SRE) refers to recognizing persons’ identity from their personal voices. It has been applied range from such as forensic, information security to identity authentication systems [1]. In recent years, more significant improvements have been achieved, especially after the Gaussian mixture model-universal background model (GMM-UBM) was proposed [2]. GMM supervectors for support vector machines (GSV-SVM) [3] and join factor analysis (JFA) [4] based on GMM are two stat-of-art technologies in SRE. Especially the i-vector based technologies [5] derived from JFA have shown the superior performance than JFA and had been proved to be very successful in NIST 2010 speaker recognition evaluation (SRE). In i-vector technology based SRE systems, the fixed length low dimensional i-vectors are extracted by estimating the latent variables from each utterance based on the factor analysis like JFA. The linear discriminant analysis (LDA) and within class covariance normalization (WCCN) are following to remove the session- and channel-dependent variabilities. Then the derived i-vectors used to be the features for the other classifiers [5]. On the other hand, the different acoustic features contain different discriminative speaker information, and the selection of them also has large influences on the performance of classifier. In speaker recognition, although various high-level or other kinds of features have been studied, spectrum-based features still outperform the others very well and play the domination roles in practice. In addition to the commonly used Mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP) and linear prediction cepstral coefficients (LPCC), there are other novel spectral-

temporal features, such as temporal discrete cosine transform (TDCT) and time-frequency cepstrum (TFC) [7], etc. In this paper, we explored the application of feature-domain latent factor analysis (fLFA) in i-vector based SRE system and the complementary combination method in i-vector level in NIST 2010 speaker recognition core-core corpus including 9 conditions based on two proved complementary acoustic features MFCC and TFC [9]. The remainder of this paper is organized as follows: In section II, SRE system based on the application of fLFA and the combination in i-vector level is introduced. Then in section III, experimental results are presented to show the performance. At last, the conclusion and some future directions are given in section IV. II. D ESCRIPTION OF SRE

SYSTEM

The speaker recognition system in our paper was shown in Fig. 1. In this system, we firstly extract two different complementary acoustic features MFCC and TFC [9], and apply fLFA to both of them. Then, we derived the i-vectors from both two kinds of features with LDA and WCCN following. After this, we combine the two i-vectors to be new i-vector and classify them by the cosine distance scoring (CDS). A. Extraction of complementary acoustic features In this paper, we use two complementary methods to extract the acoustic feature from the basic 13-dimensional features. The one is the popular extraction of 39-dimensional MFCC, and the other is the extraction of 39-dimensional TFC. This new kind of feature is obtained by performing a temporal discrete cosine transform (DCT) on the cepstrum matrix and selecting the elements in a specific area with large variances [7], [8]. Besides, both of the two features are tested before and after compensated by fLFA technology. B. Combination method in i-vector level At first, we need to extract i-vectors from two different acoustic features. The concept of i-vector was motivated by the JFA which separately models both speaker and intersession subspaces, while i-vector method models all the important variability in the same low dimensional subspace named total variability space as shown in (1). Thus, the estimation of low

MFCC39

of all the recordings of a single speaker. The LDA optimization problem can be defined as the Rayleigh coefficient for space direction v: v t Sb v (3) J= t v Sw v

TFC39

fLFA

fLFA

i-vector

i-vector

The purpose of LDA is to maximize the Rayleigh coefficient. This maximization is used to obtain a project matrix P composed by the best eigenvectors (those with highest eigenvalues) of the general eigenvalue problem as following equation: Sb v = λSw v

Sb = LDA+WCCN

LDA+WCCN

ns S ∑ 1 ∑ Sw = (wis − ws )(wis − ws )t n s s=1 i=1

CDS

Description of SRE system

rank rectangular total variability space matrix is much more like the eigenvoice adaptation in JFA [6]. M = m + Tw

(5)

s=1

i-vector combination

Fig. 1.

S ∑ (ws − w)(ws − w)t

(4)

(1)

(6)

where ws is the observations of speaker s and w represents the mean of all instances in the training set. WCCN uses the within-class covariance matrix to normalize the cosine kernel function in order to compensate for intersession variability, while guaranteeing conservation of directions in space. We also assume that all utterances of a given speaker belong to one class. The within class covariance matrix is computed as follows: W =

ns S 1∑ 1 ∑ (ws − ws )(wis − ws )t S s=1 ns i=1 i

(7)

where m is the speaker-independent and channel-independent component of the mean supervector (generally be taken as UBM mean supervector), T is the matrix of bases spanning the subspace covering both speaker- and session-specific variability in the supervector space, and is a standard normally distributed latent variable. For each sample, the final i-vector w is the maximum a posteriori (MAP) point estimate of the latent variable as follows.

∑ns s where ws = n1s i=1 wi is the mean of i-vectors of each speaker, S is the number of speakers and ns is number of utterances of speaker s. Then, the feature-mapping matrix L can be obtained through Cholesky decomposition of matrix W −1 = LLt .

w = (I + T t Σ−1 N T )−1 T t Σ−1 F

After derived from two complementary MFCC and TFC, two kinds of i-vectors are combined to be a new higher dimensional i-vector and put them into the CDS classifier. In [5], CDS has been proved to be the fastest and most efficient method, which directly uses the value of the cosine kernel between the target speaker i-vector and the test i-vector as a final decision score as follows:

(2)

where N and F are the zero-order and first-order BaumWelch statistics respectively. Additional details about the ivector extraction procedure can be found in [5]. After all the i-vectors are extracted, linear discriminant analysis (LDA) and within class covariance normalization (WCCN) are still used in this paper. The motivation for using LDA is that, in the case where all utterances of a given speaker are assumed to represent one class, LDA attempts to define new special axes that minimize the intra-class variance caused by channel effects, and to maximize the variance between speakers. The advantage of the LDA approach is based on discriminative criteria designed to remove unwanted directions and to minimize the information removed about variance between speakers [5]. In the modeling, each class is made up

C. Cosine distance scoring

CDS(wtar , wimp ) =

⟨(Lt P t wtar )t , Lt P t wimp ⟩ ∥Lt P t wtar ∥ · ∥Lt P t wimp ∥

(8)

As another advantage of this modeling in SRE, no target speaker enrollment is required, so this method can make the modeling and scoring process faster and less complex than other methods such as SVM and LLR. Besides, this scoring method can make the score doing ZT-Norm very fast [11].

III. E XPERIMENT

TABLE I T HE PERFORMANCE ( IN EER)

OF F LFA IN I - VECTOR SYSTEM BASED ON

MFCC

A. Experimental data Our experiment was done in NIST 2010 core-core male data corpus which including 9 conditions. The data of training UBM are SRE04 male dataset. And the data of training total variability matrix T , training LDA matrix and training WCCN matrix and doing ZT-Norm all come from the SRE05, SRE06, SRE08, Mix5 and SwitchBoard data pooled corpus. All these data are pooled together and selected randomly to be the training corpus. In our experiment, we simply used the same dataset to train total variability matrix and LDA and WCCN matrix. Finally, there are 248 utterances used to train UBM and 21768 utterances including 2025 speakers used to train LDA and WCCN matrix. and 507 utterance used to be Z-Norm cohort set and 460 utterance used to be T-Norm cohort set.

Condition

i-vec (MFCC)

i-vec (MFCC fLFA)

condtion 1

0.035

0.026

condtion 2

0.048

0.043

condtion 3

0.052

0.052

condtion 4

0.035

0.029

condtion 5

0.059

0.057

condtion 6

0.039

0.045

condtion 7

0.050

0.034

condtion 8

0.008

0.008

condtion 9

0.017

0.024

TABLE II T HE PERFORMANCE ( IN EER)

TFC

B. Experimental setup As the front end of our experiments, speech/silence segmentation is performed by an order statistics filter (OSF) VAD detector. Cepstral mean subtraction and feature warping with a 3 s window are applied for channel mismatch compensation. After that, 25% of low energy frames are discarded using a dynamic threshold. standard 39-dimensional MFCC was extracted while in the extraction of TFC, the context width (the width of cepstrum matrix) is 9, the shape of elements with large variances is recalculated with 13+13+13, with the total 39 dimensions. Note that we fixed the total dimensionality of acoustic features as the same 39 in our experiments to compare the performance between MFCC and TFC. If we relax this constraint, more gains in performance can be achieved. In our experiments, a UBM with 1024 diagonal mixtures is trained and the dimension of total variability space was set to 400. After dimension-reduction of LDA and WCCN, the dimension of final i-vector reduced to 200. The experimental results are shown in terms of error equal rate (EER). C. Evaluation of experimental results Table I and Table II showed that the performance of two ivector system before and after the fLFA based on MFCC and TFC respectively. We can see that no matter i-vector derived from what kind of acoustic features, using fLFA can provide the better performance in most of 9 conditions. And also, the acoustic TFC feature can perform better than MFCC in most of 9 conditions. Table III showed that the combination in i-vector level can provide the better performance than both of them, respectively. In our experiments, the two raw i-vectors are 200-dimensional respectively, and the combined i-vector is 400-dimensional. Table IV showed that the Performance EER in NIST 2010 core-core pooled corpus. We can see that the fLFA and Combination can both efficiently improve the performance EER. And the performance DETs are shown in Fig. 2

OF F LFA IN I - VECTOR SYSTEM BASED ON

Condition

i-vec (TFC)

i-vec (TFC fLFA)

condtion 1

0.034

0.031

condtion 2

0.046

0.048

condtion 3

0.043

0.046

condtion 4

0.023

0.023

condtion 5

0.057

0.056

condtion 6

0.049

0.051

condtion 7

0.044

0.039

condtion 8

0.015

0.009

condtion 9

0.016

0.009

TABLE III T HE PERFORMANCE ( IN EER) OF COMBINATION IN I - VECTOR LEVEL WITH USING F LFA Condition

i-vec (MFCC fLFA)

i-vec (TFC fLFA)

i-vec (Combined)

condtion 1

0.026

0.031

0.021

condtion 2

0.043

0.048

0.039

condtion 3

0.052

0.046

0.038

condtion 4

0.029

0.023

0.020

condtion 5

0.057

0.056

0.053

condtion 6

0.045

0.051

0.034

condtion 7

0.034

0.039

0.032

condtion 8

0.008

0.009

0.008

condtion 9

0.024

0.009

0.009

TABLE IV T HE PERFORMANCE ( IN EER) IN CORE - CORE POOLED Condition

EER

i-vec (MFCC)

0.059

i-vec (MFCC fLFA)

0.057

i-vec (TFC)

0.054

i-vec (TFC fLFA)

0.052

i-vec (Combination)

0.048

TEST

IV. C ONCLUSION In this paper, we explore more about the performance of i-vector based SRE system. The experimental results in NIST

Speaker Detection Performance i−vec(Combination) i−vec(MFCC_fLFA) i−vec(TFC_fLFA)

Miss probability (in %)

40

20

10

5

2

1

Fig. 2.

1

2

5 10 20 False Alarm probability (in %)

40

Performance DET of combination in i-vector level with fLFA

2010 core-core data corpus including 9 conditions show that the compensation method to acoustic features is still very effective in i-vector system in which LDA and WCCN are used to compensate the session and channel variabilities. Besides, the combination in i-vector level can also improve the performance of i-vector based SRE system. Thus the combination method can be seen that not only a very effective feature-domain method in [9], but also a very effective method in i-vector level. In future, we will do more about how to reduce the dimensionality of combined i-vector. Lots of work on this have been done in i-vector based language recognition [12]. ACKNOWLEDGMENT This work was supported by the National Natural Science Foundation of China under Grant No. 90920302, and by the National Natural Science Foundation of China under Grant No. 61005019, and by the National High Technology Development Program of China (863 Program) under Grant No. 2008AA040201. R EFERENCES [1] T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech Communication, vol. 52, no. 1, pp. 12-40, Jan. 2010. [2] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models”, Digital Signal Processing, vol. 10, no. 1-3, pp. 19-41, Jan. 2000. [3] W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff, “SVM based speaker verification using a GMM supervector kernel and NAP variability compensation”, in Proc. ICASSP, pp. 97-100, May 2006. [4] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition”, IEEE Transaction on Audio, Speech and Language Processing, vol. 15, no. 4, pp. 14351447, May 2007. [5] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, “Front end factor analysis for speaker verification”, IEEE Transaction on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 788-798, May 2011.

[6] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with sparse training data”, IEEE Transaction on Audio, Speech and Language Processing, vol. 13, no. 3, pp 345-354, May 2005. [7] W.-Q. Zhang, L. He, Y. Deng, J. Liu, and M. T. Johnson, “Time frequency cepstral features and heteroscedastic linear discriminant analysis for language recognition”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 12, no. 2, pp. 266-276, Apr. 2011. [8] W.-Q. Zhang, Y. Deng, L. He, and J. Liu, “Variant time-frequency cepstral features for speaker recognition”, in Proc. Interspeech, pp. 21222125, Sept. 2011. [9] Z.-Y. Li, L. He, W.-Q. Zhang and J. Liu, “Multi-feature combination for speaker recognition”, in Proc. 7th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 318-321, Nov. 2011. [10] C. Vair, D. Colibro, F. Castaldo, E. Dalmasso, and P. Laface, “Channel factors compensation in model and feature domain for speaker recognition”, in Proc. IEEE Odyssey, San Juan, Puerto Rico, Jun. 2006. [11] N. Dehak, R. Dehak, J. Glass, D. Reynolds, P. Kenny, “Cosine similarity scoring without score normalization techniques”, in Proc. Odyssey, Brno, Czech, July 2010. [12] Z.-Y. Li, W.-Q. Zhang, L. He and J. Liu, “I-vector combination for language recognition”, submitted to ICASSP 2012. [13] NIST DET-Curve Plotting software for use with MATLAB, http://www.itl.nist.gov/iad/mig/tools/DETwarev2.1.targz.html.

adaptive model combination for dynamic speaker ...

Cheap Sada D-203 Combination Speaker Is Suitable For Desktop ...

SELECTION AND COMBINATION OF ... - Research at Google

Multitask Learning and System Combination for ... - Research at Google

iVector-based Acoustic Data Selection - Research at Google

Large-scale speaker identification - Research at Google

SPEAKER ADAPTATION OF CONTEXT ... - Research at Google

Model Combination for Machine Translation - John DeNero

A note on performance metrics for Speaker ... - Semantic Scholar

Confusion Network Based System Combination for ... - GitHub

Model Combination for Machine Translation - Semantic Scholar

On a Probabilistic Combination of Prediction Sources - Springer Link

a winning combination

On a Probabilistic Combination of Prediction Sources - Springer Link

International Symposium for Research Scholars on Metallurgy ...

Combination pocket article.

Gas combination oven

International Symposium for Research Scholars on Metallurgy ...

Speaker Location and Microphone Spacing ... - Research at Google

Speaker adaptation of context dependent deep ... - Research

End-to-End Text-Dependent Speaker Verification - Research at Google

Kernel Based Text-Independnent Speaker ... - Research at Google