Research on I-Vector Combination for Speaker Recognition Zhi-Yi Li, Wei-Qiang Zhang, Jia Liu Department of Electronic Engineering, Tsinghua University, Beijing 100084, China Email: [email protected], {wqzhang, liuj}@tsinghua.edu.cn

Abstract—I-vector has been proved to be very successful in NIST 2010 speaker recognition evaluation (SRE). In this paper, we explored more about how to improve the performance of i-vector based SRE system. We explored both the application of feature-domain latent factor analysis (fLFA) and the complementary combination method in i-vector level. The experimental results in NIST 2010 core-core data corpus including 9 conditions show that the compensation method to acoustic features is still very effective in i-vector based SRE system, although linear discriminant analysis (LDA) and within class covariance normalization (WCCN) are performing well to compensate the session and channel variabilities. Besides, the combination in ivector level can also improve the performance of i-vector based SRE system. Index Terms—I-vector combination, fLFA, speaker recognition, NIST 2010.

I. I NTRODUCTION Speaker recognition (SRE) refers to recognizing persons’ identity from their personal voices. It has been applied range from such as forensic, information security to identity authentication systems [1]. In recent years, more significant improvements have been achieved, especially after the Gaussian mixture model-universal background model (GMM-UBM) was proposed [2]. GMM supervectors for support vector machines (GSV-SVM) [3] and join factor analysis (JFA) [4] based on GMM are two stat-of-art technologies in SRE. Especially the i-vector based technologies [5] derived from JFA have shown the superior performance than JFA and had been proved to be very successful in NIST 2010 speaker recognition evaluation (SRE). In i-vector technology based SRE systems, the fixed length low dimensional i-vectors are extracted by estimating the latent variables from each utterance based on the factor analysis like JFA. The linear discriminant analysis (LDA) and within class covariance normalization (WCCN) are following to remove the session- and channel-dependent variabilities. Then the derived i-vectors used to be the features for the other classifiers [5]. On the other hand, the different acoustic features contain different discriminative speaker information, and the selection of them also has large influences on the performance of classifier. In speaker recognition, although various high-level or other kinds of features have been studied, spectrum-based features still outperform the others very well and play the domination roles in practice. In addition to the commonly used Mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP) and linear prediction cepstral coefficients (LPCC), there are other novel spectral-

temporal features, such as temporal discrete cosine transform (TDCT) and time-frequency cepstrum (TFC) [7], etc. In this paper, we explored the application of feature-domain latent factor analysis (fLFA) in i-vector based SRE system and the complementary combination method in i-vector level in NIST 2010 speaker recognition core-core corpus including 9 conditions based on two proved complementary acoustic features MFCC and TFC [9]. The remainder of this paper is organized as follows: In section II, SRE system based on the application of fLFA and the combination in i-vector level is introduced. Then in section III, experimental results are presented to show the performance. At last, the conclusion and some future directions are given in section IV. II. D ESCRIPTION OF SRE

SYSTEM

The speaker recognition system in our paper was shown in Fig. 1. In this system, we firstly extract two different complementary acoustic features MFCC and TFC [9], and apply fLFA to both of them. Then, we derived the i-vectors from both two kinds of features with LDA and WCCN following. After this, we combine the two i-vectors to be new i-vector and classify them by the cosine distance scoring (CDS). A. Extraction of complementary acoustic features In this paper, we use two complementary methods to extract the acoustic feature from the basic 13-dimensional features. The one is the popular extraction of 39-dimensional MFCC, and the other is the extraction of 39-dimensional TFC. This new kind of feature is obtained by performing a temporal discrete cosine transform (DCT) on the cepstrum matrix and selecting the elements in a specific area with large variances [7], [8]. Besides, both of the two features are tested before and after compensated by fLFA technology. B. Combination method in i-vector level At first, we need to extract i-vectors from two different acoustic features. The concept of i-vector was motivated by the JFA which separately models both speaker and intersession subspaces, while i-vector method models all the important variability in the same low dimensional subspace named total variability space as shown in (1). Thus, the estimation of low

MFCC39

of all the recordings of a single speaker. The LDA optimization problem can be defined as the Rayleigh coefficient for space direction v: v t Sb v (3) J= t v Sw v

TFC39

fLFA

fLFA

i-vector

i-vector

The purpose of LDA is to maximize the Rayleigh coefficient. This maximization is used to obtain a project matrix P composed by the best eigenvectors (those with highest eigenvalues) of the general eigenvalue problem as following equation: Sb v = λSw v

Sb = LDA+WCCN

LDA+WCCN

ns S ∑ 1 ∑ Sw = (wis − ws )(wis − ws )t n s s=1 i=1

CDS

Description of SRE system

rank rectangular total variability space matrix is much more like the eigenvoice adaptation in JFA [6]. M = m + Tw

(5)

s=1

i-vector combination

Fig. 1.

S ∑ (ws − w)(ws − w)t

(4)

(1)

(6)

where ws is the observations of speaker s and w represents the mean of all instances in the training set. WCCN uses the within-class covariance matrix to normalize the cosine kernel function in order to compensate for intersession variability, while guaranteeing conservation of directions in space. We also assume that all utterances of a given speaker belong to one class. The within class covariance matrix is computed as follows: W =

ns S 1∑ 1 ∑ (ws − ws )(wis − ws )t S s=1 ns i=1 i

(7)

where m is the speaker-independent and channel-independent component of the mean supervector (generally be taken as UBM mean supervector), T is the matrix of bases spanning the subspace covering both speaker- and session-specific variability in the supervector space, and is a standard normally distributed latent variable. For each sample, the final i-vector w is the maximum a posteriori (MAP) point estimate of the latent variable as follows.

∑ns s where ws = n1s i=1 wi is the mean of i-vectors of each speaker, S is the number of speakers and ns is number of utterances of speaker s. Then, the feature-mapping matrix L can be obtained through Cholesky decomposition of matrix W −1 = LLt .

w = (I + T t Σ−1 N T )−1 T t Σ−1 F

After derived from two complementary MFCC and TFC, two kinds of i-vectors are combined to be a new higher dimensional i-vector and put them into the CDS classifier. In [5], CDS has been proved to be the fastest and most efficient method, which directly uses the value of the cosine kernel between the target speaker i-vector and the test i-vector as a final decision score as follows:

(2)

where N and F are the zero-order and first-order BaumWelch statistics respectively. Additional details about the ivector extraction procedure can be found in [5]. After all the i-vectors are extracted, linear discriminant analysis (LDA) and within class covariance normalization (WCCN) are still used in this paper. The motivation for using LDA is that, in the case where all utterances of a given speaker are assumed to represent one class, LDA attempts to define new special axes that minimize the intra-class variance caused by channel effects, and to maximize the variance between speakers. The advantage of the LDA approach is based on discriminative criteria designed to remove unwanted directions and to minimize the information removed about variance between speakers [5]. In the modeling, each class is made up

C. Cosine distance scoring

CDS(wtar , wimp ) =

⟨(Lt P t wtar )t , Lt P t wimp ⟩ ∥Lt P t wtar ∥ · ∥Lt P t wimp ∥

(8)

As another advantage of this modeling in SRE, no target speaker enrollment is required, so this method can make the modeling and scoring process faster and less complex than other methods such as SVM and LLR. Besides, this scoring method can make the score doing ZT-Norm very fast [11].

III. E XPERIMENT

TABLE I T HE PERFORMANCE ( IN EER)

OF F LFA IN I - VECTOR SYSTEM BASED ON

MFCC

A. Experimental data Our experiment was done in NIST 2010 core-core male data corpus which including 9 conditions. The data of training UBM are SRE04 male dataset. And the data of training total variability matrix T , training LDA matrix and training WCCN matrix and doing ZT-Norm all come from the SRE05, SRE06, SRE08, Mix5 and SwitchBoard data pooled corpus. All these data are pooled together and selected randomly to be the training corpus. In our experiment, we simply used the same dataset to train total variability matrix and LDA and WCCN matrix. Finally, there are 248 utterances used to train UBM and 21768 utterances including 2025 speakers used to train LDA and WCCN matrix. and 507 utterance used to be Z-Norm cohort set and 460 utterance used to be T-Norm cohort set.

Condition

i-vec (MFCC)

i-vec (MFCC fLFA)

condtion 1

0.035

0.026

condtion 2

0.048

0.043

condtion 3

0.052

0.052

condtion 4

0.035

0.029

condtion 5

0.059

0.057

condtion 6

0.039

0.045

condtion 7

0.050

0.034

condtion 8

0.008

0.008

condtion 9

0.017

0.024

TABLE II T HE PERFORMANCE ( IN EER)

TFC

B. Experimental setup As the front end of our experiments, speech/silence segmentation is performed by an order statistics filter (OSF) VAD detector. Cepstral mean subtraction and feature warping with a 3 s window are applied for channel mismatch compensation. After that, 25% of low energy frames are discarded using a dynamic threshold. standard 39-dimensional MFCC was extracted while in the extraction of TFC, the context width (the width of cepstrum matrix) is 9, the shape of elements with large variances is recalculated with 13+13+13, with the total 39 dimensions. Note that we fixed the total dimensionality of acoustic features as the same 39 in our experiments to compare the performance between MFCC and TFC. If we relax this constraint, more gains in performance can be achieved. In our experiments, a UBM with 1024 diagonal mixtures is trained and the dimension of total variability space was set to 400. After dimension-reduction of LDA and WCCN, the dimension of final i-vector reduced to 200. The experimental results are shown in terms of error equal rate (EER). C. Evaluation of experimental results Table I and Table II showed that the performance of two ivector system before and after the fLFA based on MFCC and TFC respectively. We can see that no matter i-vector derived from what kind of acoustic features, using fLFA can provide the better performance in most of 9 conditions. And also, the acoustic TFC feature can perform better than MFCC in most of 9 conditions. Table III showed that the combination in i-vector level can provide the better performance than both of them, respectively. In our experiments, the two raw i-vectors are 200-dimensional respectively, and the combined i-vector is 400-dimensional. Table IV showed that the Performance EER in NIST 2010 core-core pooled corpus. We can see that the fLFA and Combination can both efficiently improve the performance EER. And the performance DETs are shown in Fig. 2

OF F LFA IN I - VECTOR SYSTEM BASED ON

Condition

i-vec (TFC)

i-vec (TFC fLFA)

condtion 1

0.034

0.031

condtion 2

0.046

0.048

condtion 3

0.043

0.046

condtion 4

0.023

0.023

condtion 5

0.057

0.056

condtion 6

0.049

0.051

condtion 7

0.044

0.039

condtion 8

0.015

0.009

condtion 9

0.016

0.009

TABLE III T HE PERFORMANCE ( IN EER) OF COMBINATION IN I - VECTOR LEVEL WITH USING F LFA Condition

i-vec (MFCC fLFA)

i-vec (TFC fLFA)

i-vec (Combined)

condtion 1

0.026

0.031

0.021

condtion 2

0.043

0.048

0.039

condtion 3

0.052

0.046

0.038

condtion 4

0.029

0.023

0.020

condtion 5

0.057

0.056

0.053

condtion 6

0.045

0.051

0.034

condtion 7

0.034

0.039

0.032

condtion 8

0.008

0.009

0.008

condtion 9

0.024

0.009

0.009

TABLE IV T HE PERFORMANCE ( IN EER) IN CORE - CORE POOLED Condition

EER

i-vec (MFCC)

0.059

i-vec (MFCC fLFA)

0.057

i-vec (TFC)

0.054

i-vec (TFC fLFA)

0.052

i-vec (Combination)

0.048

TEST

IV. C ONCLUSION In this paper, we explore more about the performance of i-vector based SRE system. The experimental results in NIST

Speaker Detection Performance i−vec(Combination) i−vec(MFCC_fLFA) i−vec(TFC_fLFA)

Miss probability (in %)

40

20

10

5

2

1

Fig. 2.

1

2

5 10 20 False Alarm probability (in %)

40

Performance DET of combination in i-vector level with fLFA

2010 core-core data corpus including 9 conditions show that the compensation method to acoustic features is still very effective in i-vector system in which LDA and WCCN are used to compensate the session and channel variabilities. Besides, the combination in i-vector level can also improve the performance of i-vector based SRE system. Thus the combination method can be seen that not only a very effective feature-domain method in [9], but also a very effective method in i-vector level. In future, we will do more about how to reduce the dimensionality of combined i-vector. Lots of work on this have been done in i-vector based language recognition [12]. ACKNOWLEDGMENT This work was supported by the National Natural Science Foundation of China under Grant No. 90920302, and by the National Natural Science Foundation of China under Grant No. 61005019, and by the National High Technology Development Program of China (863 Program) under Grant No. 2008AA040201. R EFERENCES [1] T. Kinnunen and H. Li, “An overview of text-independent speaker recognition: From features to supervectors,” Speech Communication, vol. 52, no. 1, pp. 12-40, Jan. 2010. [2] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models”, Digital Signal Processing, vol. 10, no. 1-3, pp. 19-41, Jan. 2000. [3] W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff, “SVM based speaker verification using a GMM supervector kernel and NAP variability compensation”, in Proc. ICASSP, pp. 97-100, May 2006. [4] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition”, IEEE Transaction on Audio, Speech and Language Processing, vol. 15, no. 4, pp. 14351447, May 2007. [5] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, “Front end factor analysis for speaker verification”, IEEE Transaction on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 788-798, May 2011.

[6] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with sparse training data”, IEEE Transaction on Audio, Speech and Language Processing, vol. 13, no. 3, pp 345-354, May 2005. [7] W.-Q. Zhang, L. He, Y. Deng, J. Liu, and M. T. Johnson, “Time frequency cepstral features and heteroscedastic linear discriminant analysis for language recognition”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 12, no. 2, pp. 266-276, Apr. 2011. [8] W.-Q. Zhang, Y. Deng, L. He, and J. Liu, “Variant time-frequency cepstral features for speaker recognition”, in Proc. Interspeech, pp. 21222125, Sept. 2011. [9] Z.-Y. Li, L. He, W.-Q. Zhang and J. Liu, “Multi-feature combination for speaker recognition”, in Proc. 7th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 318-321, Nov. 2011. [10] C. Vair, D. Colibro, F. Castaldo, E. Dalmasso, and P. Laface, “Channel factors compensation in model and feature domain for speaker recognition”, in Proc. IEEE Odyssey, San Juan, Puerto Rico, Jun. 2006. [11] N. Dehak, R. Dehak, J. Glass, D. Reynolds, P. Kenny, “Cosine similarity scoring without score normalization techniques”, in Proc. Odyssey, Brno, Czech, July 2010. [12] Z.-Y. Li, W.-Q. Zhang, L. He and J. Liu, “I-vector combination for language recognition”, submitted to ICASSP 2012. [13] NIST DET-Curve Plotting software for use with MATLAB, http://www.itl.nist.gov/iad/mig/tools/DETwarev2.1.targz.html.

Research on I-Vector Combination for Speaker ...

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China. Email: [email protected], {wqzhang, liuj}@tsinghua.edu.cn. Abstract—I-vector has been ... GMM are two stat-of-art technologies in SRE. Especially the i-vector based technologies [5] derived from JFA have shown the superior ...

172KB Sizes 0 Downloads 288 Views

Recommend Documents

adaptive model combination for dynamic speaker ...
as MAP [7]) and speaker space family (such as eigenvoice. [6]). .... a global weight vector is learned for all phone classes of test ..... Signal Processing, vol. 9, pp.

Cheap Sada D-203 Combination Speaker Is Suitable For Desktop ...
Cheap Sada D-203 Combination Speaker Is Suitable F ... ass Cannon For Free Shipping & Wholesale Price.pdf. Cheap Sada D-203 Combination Speaker Is ...

SELECTION AND COMBINATION OF ... - Research at Google
Columbia University, Computer Science Department, New York. † Google Inc., Languages Modeling Group, New York. ABSTRACT. While research has often ...

Multitask Learning and System Combination for ... - Research at Google
Index Terms— system combination, multitask learning, ... In MTL learning, multiple related tasks are ... liver reasonable performance on adult speech as well.

iVector-based Acoustic Data Selection - Research at Google
DataHound [2], a data collection application running on An- droid mobile ... unsupervised training techniques where the hypothesized tran- scripts are used as ...

Large-scale speaker identification - Research at Google
promises excellent scalability for large-scale data. 2. BACKGROUND. 2.1. Speaker identification with i-vectors. Robustly recognizing a speaker in spite of large ...

SPEAKER ADAPTATION OF CONTEXT ... - Research at Google
adaptation on a large vocabulary mobile speech recognition task. Index Terms— Large ... estimated directly from the speaker data, but using the well-trained speaker ... quency ceptral coefficients (MFCC) or perceptual linear prediction. (PLP) featu

Model Combination for Machine Translation - John DeNero
System combination procedures, on the other hand, generate ..... call sentence-level combination, chooses among the .... In Proceedings of the Conference on.

A note on performance metrics for Speaker ... - Semantic Scholar
Jun 9, 2008 - regardless of the (analysis) condition it happens to be part of. .... of hard decisions is replaced by a log-error measure of the soft decision score.

A note on performance metrics for Speaker ... - Semantic Scholar
Jun 9, 2008 - performance evaluation tools in the statistical programming language R, we have even used these weighted trials for calculating Cdet and Cllr, ...

A note on performance metrics for Speaker ... - Semantic Scholar
Jun 9, 2008 - this, and then propose a new evaluation scheme that allows for ... different performance, thus forcing system developers to get ... evaluation priors from the application priors, we can give the trials in ..... York - Berlin, 2007.

Confusion Network Based System Combination for ... - GitHub
segmentation is not the best word segmentation for SMT,. ➢P-C Chang, et al. optimized ... 巴基斯坦说死不投诚. ➢ 巴基斯坦说死于投诚. 5. ' ' ' ( | ). ( | ) (1 ). ( | ) j i sem j i sur ... the output into words by different CWS too

Model Combination for Machine Translation - Semantic Scholar
ing component models, enabling us to com- bine systems with heterogenous structure. Un- like most system combination techniques, we reuse the search space ...

On a Probabilistic Combination of Prediction Sources - Springer Link
method individually. Keywords: Recommender Systems, Collaborative Filtering, Personalization,. Data Mining. 1 Introduction. Nowadays, most of the popular ...

a winning combination
The principles of building an effective hybrid monetization strategy . . . . . . . . . . . . .12. A framework for segmenting users . ... Read this paper to learn: B What an effective hybrid monetization model looks like ... earn approximately 117% mo

On a Probabilistic Combination of Prediction Sources - Springer Link
On a Probabilistic Combination of Prediction Sources ... 2 Prediction Techniques ...... Heckerman, D., Kadie, C.: Empirical Analysis of Predictive Algorithms for.

International Symposium for Research Scholars on Metallurgy ...
Aug 15, 2016 - Uday Chakkingal, MME, IITM. Preamble. Patrons. Ajayan PM, Rice University, USA. Balasubramanian V, Kalyani Carpenter Special Steels Ltd.

Combination pocket article.
'Be' it known that I, STEFAN MACHUGA, a subject ofthe King of Hungary, residing at. Denver, in the county of Denver and State of Colorado, have invented ...

Gas combination oven
Sep 17, 1987 - system, a conventional gas-?red snorkel convection .... ture and other control data. A display 46 shows ..... An alternate gas distribution system.

International Symposium for Research Scholars on Metallurgy ...
Aug 15, 2016 - present symposium, International Symposium for. Research Scholars on Metallurgy, Materials. Science & Engineering(ISRS-2016) is devoted solely to research scholars so that they can interact with their peers from all over the world and

Speaker Location and Microphone Spacing ... - Research at Google
across time to give a degree of short term shift invariance, and then ..... to better exploit directional cues. We can ..... Notes in Computer Science, no. 2, pp.

Speaker adaptation of context dependent deep ... - Research
tering them, e.g. using regression trees [6, 7]. ... However the computation power .... states [23] clustered using decision trees [24] to 7969 states; the real time ...

End-to-End Text-Dependent Speaker Verification - Research at Google
for big data applications like ours that require highly accurate, easy-to-maintain systems with a small footprint. Index Terms: speaker verification, end-to-end ...

Kernel Based Text-Independnent Speaker ... - Research at Google
... between authentication attempts: land line phone, mobile phone, laptop ..... To extract input features, the original waveforms are sampled every 10ms with a ...... to face verification Proceedings of the IEEE Computer Society Conference on ...