12

Kernel Based Text-Independnent Speaker Verification Johnny Mari´ethoz1 , Yves Grandvalet1 and Samy Bengio2 1 2

IDIAP Research Institute, Martigny, Switzerland Google Inc., Mountain View, CA, USA

The goal of a person authentication system is to authenticate the claimed identity of a user. When this authentication is based on the voice of the user, without respect of what the user exactly said, the system is called a text-independent speaker verification system. Speaker verification systems are increasingly often used to secure personal information, particularly for mobile phone based applications. Furthermore, text-independent versions of speaker verification systems are the most used for their simplicity, as they do not require complex speech recognition modules. The most common approach to this task is based on Gaussian Mixture Models (GMMs) (Reynolds et al. 2000), which do not take into account any temporal information. GMMs have been intensively used thanks to their good performance, especially with the use of the Maximum A Posteriori (MAP) (Gauvain and Lee 1994) adaptation algorithm. This approach is based on the density estimation of an impostor data distribution, followed by its adaptation to a specific client data set. Note that the estimation of these densities is not the final goal of speaker verification systems, which is rather to discriminate the client and impostor classes; hence discriminative approaches might appear good candidates for this task as well. As a matter of fact, Support Vector Machine (SVM) based systems have been the subject of several recent publications in the speaker verification community, in which they obtain similar to or even better performance than GMMs on several text-independent speaker Speech and Speaker Recognition: Large Margin and Kernel Methods. Edited by J. Keshet and S. Bengio c 2001 John Wiley & Sons, Ltd

Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods c XXXX John Wiley & Sons, Ltd

J. Keshet and S. Bengio, Eds.

198

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

verification tasks. In order to use SVMs or any other discriminant approaches for speaker verification, several modifications from the classical techniques need to be performed. The purpose of this chapter is to present an overview of discriminant approaches that have been used successfully for the task of text-independent speaker verification, to analyze their difference and their similarities with each other and with classical generative approaches based on GMMs. An open-source version of the C++ source code used to performed all experiments described in this chapter can be found at http://speaker.abracadoudou.com.

12.1 Introduction Person authentication systems are in general designed in order to let genuine clients access a given service while forbidding it to impostors. This can be seen as a 2-class classification problem suitable for machine learning approaches. A number of specificities make speaker verification different from a standard binary classification problem. First, the input data are sentences whose lengths depend on its phonetic content and the speaking rate of the underlying speaker. Second, only few client training examples are available: in most real application, it is not possible to ask a client to speak during several hours or days in order to capture the entire variability of his/her voice. There are typically between one and three utterances for each client. Third, the impostor distribution is not known and even not well defined: we have no idea of what an impostor is in a “real” application. In order to simulate impostor accesses, one usually considers other speakers in the database. This ignorance is somewhat remedied by evaluating the models with impostor identities that are not available when creating the models. This incidentally means that plenty of impostor accesses are usually available, often more than 1000 times the number of client accesses, which makes the problem highly unbalanced. The distribution of impostors being only loosely defined, the prior probability of each class is unknown, and the cost of each type of error is usually not known beforehand. Thus, one usually selects a model that gives reasonable performance for several possible cost tradeoffs. Finally, the recording conditions change over time. The speaker can be located in several kinds of places: office, street, train station, etc. The device used to perform the authentication can also change between authentication attempts: land line phone, mobile phone, laptop microphone, etc. That being said, the problem of accepting or rejecting someone’s identity claim can be formally stated as a binary classification task. Let S be a set of clients and si ∈ S be the i-th client of that set. We look for a discriminant function f (·; ϑi ) and a decision threshold ∆ such that f (¯ x ; ϑi ) > ∆ , (12.1) ¯ was pronounced by speaker si . if and only if sentence x The parameters ϑi are typically determined by optimizing an empirical criterion comLi puted on a set of Li sentences, either called the training or the learning set Li = {(¯ xl , yl )}l=1 , d×T l ¯l ∈ R is an input waveform sequence encoded as Tl d-dimensional frames, and where x yl ∈ {−1, 1} is the corresponding target, where 1 stands for for a true client sequence and −1 for an impostor access. The search space is defined as the set of functions f : Rd×Tl 7→ R

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

199

parameterized by ϑi , and ϑi is identified by minimizing the mean loss on the training set, where the loss ℓ(·) returns low values when f (¯ x ; ϑi ) is near y and high values otherwise: X ℓ(f (¯ x ; θ), y) . ϑi = arg min θ

(¯ x ,y)∈Li

Note that the overall goal is not to obtain zero error on Li but rather on unseen examples drawn from the same probability distribution. This objective is monitored by measuring the classification performance on an independent test set Ti , in order to provide an unbiased estimate of performance on the population. A standard taxonomy of machine learning algorithms sets apart discriminant models, that directly estimate the function f (·; ϑi ), from generative models, where f (·; ϑi ) is defined through the estimation of the conditional distribution of sequences knowing the speaker. We briefly present hereafter the classical generative approach that encompasses the very popular Gaussian Mixture Model (GMM), which will provide a baseline in the experimental section. All the other methods presented in this chapter are kernel-based systems that belong to the discriminative approach.

12.2 Generative Approaches The state-of-the-art generative approaches for speaker verification use atypical models in the sense that they do not model the joint distribution of inputs and outputs. This is due to the fact that we have no clue of what the prior probability of having client si speaking should be, since the distribution of impostors is only loosely defined and the proportion of client accesses in the training set may not be representative of the proportion in future accesses. Although the model is not complete, a decision function is computed using the rationale described below.

12.2.1 Rationale ¯ was pronounced by speaker si or by any other The system has to decide whether a sentence x person s0 . It should accept a claimed speaker as a client if and only if: P (si |¯ x ) > αi P (s0 |¯ x) ,

(12.2)

where αi is a trade-off parameter that accounts for the loss of false acceptance of an impostor access versus false rejection of a genuine client access. Using Bayes theorem, we rewrite (12.2) as follows: P (s0 ) p(¯ x |si ) > αi = ∆i = ∆ , (12.3) p(¯ x |s0 ) P (si ) where ∆i is proportional to the ratio of the prior probabilities of being or not being the client. This ratio being unknown, ∆i is replaced by a client independent decision threshold ∆. This corresponds to having different (unknown) settings for the trade-off parameters αi . The left ratio in (12.3) plays the role of f (¯ x ; ϑi ) in (12.1), where the set of parameters ϑi is decomposed as follows: f (¯ x ; ϑi ) =

p(¯ x |si , θi ) , p(¯ x |s0 , θ0 )

200

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

with ϑi = {θi , θ0 }. The loss function used to estimate θ 0 is the negative log-likelihood X − log p(¯ x |s0 , θ) , θ 0 = arg min θ

(¯ x ,y)∈L− i

where L− x , y) in the learning set Li for which y = −1. As generally i is the subset of pairs (¯ few positive examples are available, the loss function used to estimate θi is based on a Maximum A Posteriori (MAP) adaptation scheme (Gauvain and Lee 1994) and can be written as follows:   X − log p(¯ x |si , θ)p(θ) θ i = arg min θ

(¯ x ,y)∈L+ i

where L+ x , y) in Li for which y = 1. This MAP approach puts some i is the subset of pairs (¯ prior on θ to constrain these parameters to some reasonable values. In practice, they are constrained to be near θ0 , which represents reasonable parameters for any unknown person. See for instance Reynolds et al. (2000) for a practical implementation.

12.2.2 Gaussian Mixture Models ¯ by a rough estimate that assumes State-of-the-art systems compute the density of a sentence x ¯ . The density of the frames themselves is assumed independence of the T frames that encode x to be independent of the sequence length, and is estimated by a Gaussian Mixture Model (GMM) with diagonal covariance matrices, as follows: p(¯ x |s, θ) = P (T ) p(¯ x |T, s, θ) = P (T )

T Y

t=1

= P (T )

T Y

t=1

= P (T )

p(xt |T, s, θ) p(xt |s, θ)

T X M Y

t=1 m=1

πm N (xt |µm , σ m ) ,

(12.4)

¯ , xt is the t-th frame where P (T ) is the probability distribution 1 of the length of sequence x ¯ , and M is the number of mixture components. The parameters θ comprise the means of x M M {µm }M m=1 , standard deviations {σ m }m=1 , and mixing weights {πm }m=1 for all Gaussian components. The Gaussian density is defined as follows:   1 1 T −2 N (x|µ, σ) = exp − (x − µ) Σ (x − µ) , d 2 (2π) 2 |Σ| where d is the dimension of x, Σ is the diagonal matrix with diagonal elements Σii = σi , and |Σ| denotes the determinant of Σ. 1 Under the reasonable assumption that the distributions of sentence length are identical for each speaker, this distribution does not play any discriminating role and can be left unspecified.

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

201

As stated in the previous section, we first train an impostor model p(¯ x |s0 , θ0 ), called world or universal background model when it is common to all speakers si . For this purpose, we use the Expectation-Maximization (EM) algorithm to maximize the likelihood of the negative examples in the training set. Note that in order to obtain state-of-the-art performance, the variances of all Gaussian components are constrained to be higher than some threshold, normally selected on a separate development set. This process, often called variance flooring (Melin et al. 1998), can be seen as a way to control the capacity of the overall model. For each client si , we use a variant of MAP adaptation (Reynolds et al. 2000) to estimate a client model p(¯ x |si , θ i ) that only departs partly from the world model p(¯ x |s0 , θ0 ). In this setting, only the mean parameters of the world model are adapted to each client, using the following update rule: b im + (1 − τi,m )µ0m , µim = τi,m µ

b im is the corresponding where µ0m is the vector of means of Gaussian m of the world model, µ vector estimated by maximum likelihood on the sequences available for client si , and τi is the adaptation factor that represents the faith we have in the client data. The latter is defined as follows (Reynolds et al. 2000): where τi,m =

ni,m ni,m + r

(12.5)

b im , that is, the sum of memwhere ni,m is the effective number of frames used to compute µ berships to component m for all the frames of the training sequence(s) uttered by client si (see Section 12.5.2 for details). The MAP relevant factor r is chosen by cross-validation. Finally, when all GMMs have been estimated, one can instantiate (12.3) to take a decision for a given access as follows: PM T πm N (xt ; µim , σ m ) 1X > log ∆ , log Pm=1 M t 0 T t=1 m=1 πm N (x ; µm , σ m )

where θ0 = {µ0m , σ m , πm }M m=1 are the GMM parameters for the world model, and θ i = 1 {µim , σ m , πm }M m=1 are the GMM parameters for the client model. Note that T does not follow from (12.3) and is an empirical normalization factor added to yield a threshold ∆ that is independent of the length of the sentence.

12.3 Discriminative Approaches Support Vector Machines (SVMs) (Vapnik 2000) are now a standard tool in numerous applications of machine learning, such as in text or vision (Joachims 2002; Pontil and Verri 1998). While GMM is the mainstream generative model in speaker verification, SVMs are prevailing in the discriminative approach. This section provides a basic description of SVMs that introduces the kernel trick that relates feature expansions to kernels, on which will focus in Section 12.5.

202

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

12.3.1 Support Vector Machines In the context of binary classification problems, the SVM decision function is defined by the sign of f (x; ϑ) = w · Φ(x) + b , (12.6) where x is the current example, ϑ = {w, b} are the model parameters and Φ(·) is a mapping, chosen “a priori”, that associates a possibly high dimensional feature to each input data. The SVM training problem consists in solving the following problem:  L X  1   ξl  (w∗ , b∗ ) = arg min kwk2 + C (w,b) 2 (12.7) l=1   s.t. yl (w · xl + b) ≥ 1 − ξl ∀l   ξl ≥ 0 ∀l ,

where L is the number of training examples, the target class label yl ∈ {−1, 1} corresponds to xl , and C is a hyper-parameter that trades off the minimization of classification error (upper-bounded by ξl ) and the maximization of the margin, which provides generalization guarantees (Vapnik 2000). Solving (12.7) leads to a discriminant function expressed as a linear combination of training examples in the feature space Φ(·). We can thus rewrite (12.6) as follows: f (x; ϑ) =

L X l=1

αl yl Φ(xl ) · Φ(x) + b ,

where most training examples do not enter this combination (αl = 0); the training examples for which αl 6= 0 are called support vectors. As the feature mapping Φ(·) only appears in dot products, the SVM solution can be expressed as follows: L X f (x; ϑ) = αl yl k(xl , x) + b , l=1

where k(·, ·) is the dot product Φ(·) · Φ(·). More generally, k(·, ·) can be any kernel function that fulfills the Mercer conditions (Burges 1998), which ensure that, for any possible training set, the optimization problem is convex.

12.3.2 Kernels A usual problem in machine learning is to extract features that are relevant for the classification task. For SVMs, choosing the features and choosing the kernel are equivalent problems, thanks to the so-called “kernel trick” mentioned above. The latter also permits to map xl into potentially infinite dimensional feature spaces by avoiding the explicit computation of Φ(xl ); it also reduces the computational load for mappings in finite but high dimension. The two most well known kernels are the Radial Basis Function (RBF) kernel   −kxl − xl′ k2 (12.8) k(xl , xl′ ) = exp 2σ 2

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

203

Speakers

Dev. Set D {si }i∈D+

Eval. Set E

World W {si }i∈D−

{si }i∈W

{si }i∈E +

{si }i∈E −

+ − − + + − {L+ i }i∈D + {Ti }i∈D + {Ti }i∈D + {Li }i∈D + ∪E + {Li }i∈E + {Ti }i∈E + {Ti }i∈E +

Figure 12.1 Split of the speaker population in three subsets, with the final decomposition in learning and test sets.

and the polynomial kernel k(xl , xl′ ) = (a xl · xl′ + b)p ,

(12.9)

where σ, p, b, a are hyper-parameters that define the feature space. Several SVM-based approaches have been proposed recently to tackle the speaker verification problem (Campbell et al. 2006a; Wan and Renals 2003). These approaches rely on constructing an ad-hoc kernel for the problem at hand. These kernels will be presented and evaluated after the following section that describes the details of the experimental methodology and the data that will be used to compare the various methods.

12.4 Benchmarking Methodology In this section, we describe the methodology and the data used in all the experiments reported in this chapter. We first present the data splitting strategy that is used to imitate a realistic use of speaker verification systems. Then, we discuss the measures evaluating the performances of learning algorithms. Finally, we detail the database used to benchmark these algorithms, and the pre-processing that builds sequences of frames from waveform signals.

12.4.1 Data Splitting for Speaker Verification A speaker verification problem is not a standard classification problem, since the objective is not to certify accesses from a pre-defined set of clients. Instead, we want to be able to authenticate new clients when they subscribe to the service, that is, we want to learn how to build new classifiers on the fly. Hence, a speaker verification system is evaluated by its ability to produce new classifiers with small test error. This is emulated by the data splitting process depicted in Figure 12.1. The root level gathers the population of speakers, which is split into three subpopulations, defined by their role in building classifiers: the development set D, the world

204

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

set W and the evaluation set E. All accesses from the speakers of W will be used as the set of negative examples L− i for training the models responsible for authenticating client si , where si may belong either to the development set D or to the evaluation set E. The sets D and E are further split into clients (resp. D+ and E + ) and impostors (resp. D− and E − ) at the second level of the tree. The clients and the test impostors hence differ between the development and the evaluation sets. The impostor accesses in D− and E − form the set of negative test examples Ti− , that is, “attempt data” from out-of-training impostors claiming identity si , where si belongs respectively to D+ and E + . Finally, at the third level of the tree, the accesses of client si are split to form the positive examples of the training set L+ i (also known as the “enrollment data”, usually a single access), and the set of positive “attempt data” Ti+ that play the role of out-of-training client accesses requiring authentication. To summarize, the development set D is used jointly with W to train models and select their various hyper-parameters (such as the number of Gaussians, the MAP adaptation factor, kernel parameters, etc.). For each hyper-parameter, we define a range of possible values, and for each value, each client model is trained using the enrollment data L+ i and the world data + − L− , before being evaluated with the positive and negative attempt data T i i and Ti . We then select the value of the hyper-parameters that optimizes a given performance measure (the Equal Error Rate described below) on {Ti+ ∪ Ti− }. Finally, the evaluation set E is used to train new client models using these hyper-parameters, and to measure the performance of the system on these new clients.

12.4.2 Performance Measures The classification error rate is the most common performance measure in the machine learning literature, but it is not well suited to the type of problems encountered in speaker verification, where class priors are unknown and misclassification losses are unbalanced. Hence, a weighted version of the misclassification rate is used, where one distinguishes two kinds of errors: False Rejection (FR) which consists in rejecting a genuine client, and False Acceptance (FA) which consists in accepting an impostor. All the measures used in this chapter are based on the corresponding error rates: the False Acceptance Rate (FAR) is the number of FAs divided by the number of client accesses, and the False Rejection Rate (FRR) is the number of FRs divided by the number of impostor accesses. As stated in the previous section, in practice, we aim at building a single system that is able to take decisions for all future users. The performance is measured globally, on the set of speakers of the evaluation set, by averaging the performance over all trials independently of the claimed identity. In the speaker verification literature, a point often overlooked is that most of the results are reported with “a posteriori” measures, in the sense that the decision threshold ∆ in Equation (12.1) is selected such that it optimizes some criterion on the evaluation set. We believe that this is unfortunate, and, in order to obtain unbiased results, we will use “a priori” measures, where the decision threshold ∆ is selected on a development set, before seeing the evaluation set, and then applied to the evaluation data. Common a posteriori measures include the Equal Error Rate (EER), where the threshold ∆ is chosen such that (FAR=FRR), and the Detection Error Tradeoff (DET) curve (Martin et al. 1997), which depicts FRR as a function of FAR when ∆ varies. Note that

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

205

the DET curve is is a non-linear transformation of the Receiver Operating Characteristic (ROC) curve (Van Trees 1968). The non-linearity is in fact a normal deviate, coming from the hypothesis that the scores of client accesses and impostor accesses follow a Gaussian distribution. These measures are perfectly legitimate for exploratory analysis or for tuning hyper-parameters on the development set and they are used in this purpose here. To avoid confusion with proper test results, we will only report DET curves computed on the development set. For test performance, we will use a priori measures: the Half Total Error Rate ( HTER = 21 (FAR(∆) + FRR(∆)) ) and the Expected Performance Curve (EPC) (Bengio et al. 2005), which depicts the evaluation set HTER as a function of a trade-off parameter α. The latter defines a decision threshold, computed on the development set, by minimizing the following convex combination of development FAR and FRR:   ∗ (12.10) ∆ = arg min α · FAR(∆) + (1 − α) · FRR(∆) . ∆

We will provide confidence intervals around HTER and EPC. In this chapter, we report confidence intervals computed at the 5% significance level, using an adapted version of the standard proportion test (Bengio and Mari´ethoz 2004).

12.4.3 NIST Data The NIST database is a subset of the database that was used for the NIST 2005 and 2006 Speaker Recognition Evaluation, which comes from the second release of the cellular switchboard corpus (Switchboard Cellular - Part 2) of the Linguistic Data Consortium. This data was used as development and evaluation sets while the training (negative) examples come from previous NIST campaigns. For both development and evaluation clients, there are about 2 minutes of telephone speech available to train the models and each test access was less than 1 minute long. Only male speakers were used. The development population consisted of 264 speakers, while the evaluation set contained 349 speakers. 219 different records were used as negative examples for the discriminant models. The total number of accesses in the development population is 13596 and 22131 for the evaluation set population with a proportion of 10% of true target accesses.

12.4.4 Pre-Processing To extract input features, the original waveforms are sampled every 10ms with a window size of 20ms. Each sentence is parameterized using 24 triangular band-pass filters with a DCT transformation of order 16, complemented by their first derivative (delta) and the 10th second derivative (delta-delta), the log-energy, the delta-log-energy and delta-delta-logenergy, for a total of 51 coefficients. The NIST database being telephone-based, the signal is band-pass filtered between 300 and 3400 Hz. A simple silence detector, based on a two-components Gaussian mixture model, is used to remove all silence frames. The model is first learned on a random recording with land line microphone and adapted for each new sequence using the MAP adaptation algorithm. The sequences are then normalized in order to have zero mean and unit variance on each feature. While the log-energy is important in order to remove the silence frames, it is known to be inappropriate to discriminate between clients and impostors. This feature is thus eliminated

206

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

after silence removal, while its first derivative is kept. Hence, the speaker verification models are trained with 50 (51-1) features.

12.5 Kernels for Speaker Verification One particularity of speaker verification is that patterns are sequences. An SVM based classification thus requires a kernel handling variable size sequences. Most solutions proposed in the literature use a procedure that converts the sequences into fixed size vectors that are processed by a linear SVM. Other sequence kernels allow embeddings in infinitedimensional feature spaces (Mari´ethoz and Bengio 2007). However, compared to the mainstream approach, this type of kernels is computationally too demanding for long sequences. It will not be applied here, since the NIST database contains long sequences. In the following we describe several approaches using sequence kernels. The most promising are then compared in Section 12.8.

12.5.1 Mean Operator Sequence Kernels For kernel methods, a simple approach to tackle variable length sequences considers the following kernel between two sequences: Tj Ti X 1 X ¯j ) = k(xti , xuj ) , K(¯ xi , x Ti Tj t=1 u=1

(12.11)

¯ i is a sequence of size Ti and xti is a frame where we denote by K(·, ·) a sequence kernel, x ¯ i . We thus apply a frame-based kernel k(·, ·) to all possible pairs of frames coming from of x ¯ i and x ¯j . the two input sequences x As the kernel K represents the average similarity between all possible pairs of frames, it will be referred to as the mean operator sequence kernel. This kind of kernel has been applied successfully in other domains such as object recognition (Boughorbel et al. 2004). Provided that k(·, ·) is positive-definite, the resulting kernel K(·, ·) is also positive-definite. The sequences in the NIST database typically consist of several thousands of frames, hence the double summation in (12.11) is very costly. As the number of operations for each sequence kernel evaluation is proportional to the product of sequence lengths, such a computation typically requires an order of the million of operations. We thus will consider factorizable kernels k(·, ·), such that the mean operator sequence kernel (12.11) can be expressed as follows: Tj Ti X 1 X φ(xti ) · φ(xuj ) Ti Tj t=1 u=1 " # " # Tj Ti 1 X 1 X t u = φ(xi ) · φ(xj ) . Ti t=1 Tj u=1

¯j ) = K(¯ xi , x

(12.12)

When the dimension of the feature space is not too large, computing the dot product explicitly is not too demanding, and replacing the double summation by two single ones may result in a significant reduction of computing time.

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

207

Explicit polynomial expansions have been used in Campbell (2002); Campbell et al. (2006a); Wan and Renals (2003). In practice, the average feature vectors within brackets in (12.12) are used as input to a linear SVM. The GLDS (Generalized Linear Discriminant Sequence) kernel of Campbell departs slightly from a raw polynomial expansion, by using a normalization in the feature space: ¯j ) = K(¯ xi , x

1 Φ(¯ xi )Γ−1 Φ(¯ xj ) , Ti Tj

(12.13)

where Γ defines a metric in the feature space. Typically, this is a diagonal approximation of the Mahalanobis metric, that is, Γ is a diagonal matrix whose diagonal elements γk are the empirical variances2 for each feature, computed over the training data. The polynomial expansion sends d-dimensional frames to a feature space of dimension (d + p)!/d!p! − 1, where p is the degree of the polynomial. With our 50 input features, and for a polynomial of degree p = 3, the dimension of the feature space is 23426. For higher polynomial degrees and for other feature space of higher dimension, the computational advantage of the decomposition (12.12) disappears, and it is better to use explicit kernel in the form (12.11). We empirically show below that, for the usual representation of frames described in Section 12.4.4, the GLDS normalization in (12.13) is embedded in the standard polynomial kernel. Let us define k(xi , xj ) as a polynomial kernel of the form (xi · xj + 1)p , where p is the degree of the polynomial. After removing the constant term, the explicit expansion of this standard polynomial kernel involves (d + p)!/d!p! − 1 terms that can be indexed by r = (r1 , r2 , ..., rd ), such that √ φr (x) = cr xr11 xr22 ...xrdd , d X

p! . r1 !r2 !...rd+1 ! i=1 √ √ In the above equations, cr has exactly the same role as the 1/ γk coefficients on the diagonal of Γ−1/2 in Equation (12.13). In Figure 12.2, we compare these coefficient values, √ where the normalization factors 1/ γk are estimated on two real datasets, after a polynomial expansion of degree 3. The values are very similar, with highs and lows on the same monomial. In fact, the performance of the two approaches obtained on the development set of NIST are about the same, as shown by the DET curves given in Figure 12.3. Even if this approach is simple and easy to use, the accuracy can be improved by introducing priors. In fact, to train a client model very few positive examples are available. Thus, if we can put pieces of information collected on large set of speakers into the SVM model, as done for the GMM system, we can expect an improvement. One can for example try to include the world model in the kernel function as proposed in the next Section. where

ri = p ,

ri ≥ 0 ,

and

cr =

12.5.2 Fisher Kernels Jaakkola and Haussler proposed a principled means for building kernel functions from generative models: the Fisher kernel (Jaakkola and Haussler 1998). In this framework, which 2 The

constant feature is removed from the feature space prior to normalization.

1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0



ck

√1 γ

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

√1 γ

208

1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0

Normalization on Banca

Normalization on PolyVar

Polynomial Coefficients

1000

2000

3000

4000

5000

6000

7000

k

Figure 12.2 Coefficient values of polynomial terms, as computed on two different datasets (Banca and PolyVar), compared to the ck polynomial coefficients. has been applied to speaker verification by Wan and Renals (2005a), the generative model is used to specify the similarity between pairs of examples, instead of the usual practice where it is used to provide a likelihood score, which measures how well the example fits the model. Put it another way, a Fisher kernel utilizes a generative model to measure the differences in the generative process between pairs of examples instead of the differences in posterior probabilities. The key ingredient of the Fisher kernel is the vector of Fisher scores: x |θ) , ux¯ = ∇θ log p(¯ where θ denotes here the parameters of the generative model, and ∇θ is the gradient with respect to θ. The Fisher scores quantify how much each parameter contributes to the gener¯. ation of example x The Fisher kernel itself is given by: ¯ j ) = ux¯Ti I(θ)−1 ux¯ j , K(¯ xi , x

(12.14)

where I(θ) is the Fisher information matrix at θ, that is, the covariance matrix of Fisher scores: (12.15) I(θ) = Ex¯ (ux¯ ux¯T ) , where we used that Ex¯ (ux¯ ) = 0. The Fisher kernel (12.14) can thus be interpreted as a Mahalanobis distance between two Fisher scores.

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

209

Figure 12.3 DET curves on the development set of the NIST database comparing the explicit polynomial expansion (noted as “GLDS kernel p = 3 in the legend), and the principled polynomial kernel (noted “Polynomial kernel p = 3”).

Another interpretation of the Fisher kernel is based on the representation of a parametric class of generative models as a Riemannian manifold (Jaakkola and Haussler 1998). Here, the vector of Fisher scores defines a tangent direction at a given location, that is, at a given model parameterized by θ. The Fisher information matrix is the local metric at this given point, which defines the distance between the current model p(¯ x |θ) and its neighbors p(¯ x |θ + δ). The (squared) distance d(θ, θ + δ) = 12 δ T Iδ approximates the Kullback-Leibler divergence between the two models. Note that, unlike the Kullback-Leibler divergence, the Fisher kernel (12.14) is symmetric. It is also positive-definite since the Fisher information matrix I(θ) is obviously positive-definite at θ.

Fisher Kernels for GMMs In the MAP framework, the family of generative models we consider is the set of Gaussian mixtures (12.4) that differ in their mean vectors µm . Hence, a relevant dissimilarity between examples will be measured by the Fisher scores computed on these vectors

210

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

x |θ), . . . , ∇µTM log p(¯ x |θ))T , where ux¯ = (∇µT1 log p(¯ x |θ) = ∇µm log p(¯ =

T X t=1

T X t=1

=

T X t=1

∇µm log

M X

m′ =1

P (m|xt ) ∇µm

πm′ N (xt |µm′ , σ m′ )   1 t (x − µ ) − (xt − µm )T Σ−2 m m 2

t P (m|xt ) Σ−2 m (x − µm ) .

(12.16)

Using definition (12.15), the Fisher information matrix can be expressed block-wise, with M × M blocks of size d × d: I = (Im,m′ )1≤m≤M,1≤m′ ≤M , with Im,m′ = Ex¯

T X T hX

t=1 u=1

t u T −2 P (m|xt )P (m′ |xu )Σ−2 m (x − µm )(x − µm′ ) Σm′

i

.

(12.17)

There is no simple analytical expression of this expectation, due, among other things, to the product P (m|xt )P (m′ |xu ). Hence, several options are possible: I. ignore the information matrix in the computation of the Fisher kernel (12.14). This option, mentioned by Jaakkola and Haussler as a simpler suitable substitute, is often used in the application of Fisher kernels; II. approximate the expectation in the definition of Fisher information by Monte Carlo sampling. III. approximate the product P (m|xt )P (m′ |xu ) by a simpler expression in (12.17). For example, if we assume that the considered GMM performs hard assignments of frames to mixture components, then P (m|xt )P (m′ |xu ) is null if m 6= m′ . Furthermore, this product is also null for m = m′ when xt or xu is generated from another component of the mixture distribution, otherwise, we have P (m|xt )P (m′ |xu ) = 1. T −2 Let gm denote the function such that gm (x, y) = Σ−2 m (x − µm )(y − µm′ ) Σm′ if P (m|x) = P (m|y) = 1 and gm (x, y) = 0 otherwise. With this notation and the above approximations, (12.17) reads Im,m′ ≃ 0 if m 6= m′ Im,m ≃ Ex¯

T X T hX

t=1 u=1

i gm (xt , xu )

≃ Ex [gm (x, x)] ET [T ]

≃ Σ−2 m ET [T ] .

The unknown constant ET [T ] is not relevant and can been dropped from the implementation of this approximation to the Fisher kernel.

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

211

We now introduce some definitions with the following scenario. Suppose that we trained the GMM world model on a large set of speakers, resulting in parameters θ0 = {µ0m , σ m , πm }M m=1 . We then use this GMM as an initial guess for the model for client si . If, as in the MAP framework, the client model differs from the world model in the mean ¯ i will result in the following vector only, then, after one EM update, the training sequence x estimates µim =

where ni,m =

Ti 1 X

ni,m Ti X

xti P (m|xti ) ,

t=1

P (m|xti ) .

t=1

Hence, ni,m is the effective number of frames used to compute µim , that is, the sum of the ¯ i to component m. These definitions of µim and ni,m are membership of all frames of x convenient for expressing Fisher scores, when the reference generative model is the world model parameterized by θ 0 Ti X xi |θ) θ =θ0 = Σ−2 ∇µm log p(¯ P (m|xti ) (xti − µm ) m t=1

0 i = ni,m Σ−2 m (µm − µm ) .

(12.18)

With the approximations of the Fisher information discussed above, the kernel is expressed as: I. for the option where the Fisher information matrix is ignored: ¯ j ) = uTx¯ i ux¯ j K(¯ xi , x =

M X

m=1

 T j 0 i 0 nj,m Σ−2 ni,m Σ−2 m (µm − µm ) m (µm − µm )

II. for the option where the Fisher information matrix is approximated by Monte Carlo integration: here, for computational reasons, we only consider a block-diagonal approximation bI, where bI = (bIm,m′ )1≤m≤M,1≤m′ ≤M ,

with

bIm,m′ = 0 if m 6= m′ X t 0 t 0 T −2 bIm,m = 1 P (m|xt )2 Σ−2 m (x − µm )(x − µm ) Σm , n t

where n is the number of random draws of xt generated from the world model. We then have: ¯j ) = K(¯ xi , x

M X

m=1

T −1  i 0 j 0 bIm,m nj,m Σ−2 ni,m Σ−2 m (µm − µm ) m (µm − µm )

212

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

Figure 12.4 DET curves on the development set of the NIST database comparing the three different approximations of the Fisher information matrix.

III. for the option where the Fisher information matrix is approximated analytically

¯j ) = K(¯ xi , x

M X

m=1

T  i 0 j 0 ni,m Σ−1 nj,m Σ−1 m (µm − µm ) m (µm − µm )

These three variants of the Fisher kernel are compared in Figure 12.4, which compares the DET curves obtained on the development set of the NIST database. The three curves almost overlap, confirming that ignoring the information matrix in the Fisher kernel is not harmful in our setup.

12.5.3 Beyond Fisher Kernels The previous experimental results confirm that the main ingredient of the Fisher kernel is the Fisher score. The latter is based on a probabilistic model viewed through the log-likelihood function. We can depart from the original setup described above, by using other models and/or score. Some alternative approaches has been already investigated, for example Wan and Renals (2005b), uses scores based on a log likelihood ratio between the world model and the adapted client model. We describe below a very simple modification of the scoring function that brings noticeable improvements in performances.

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

213

Table 12.1 EERs (the lower the better) on the development set of the NIST database, comparing Fisher kernel (approximation 3), the normalized Fisher kernel.

EER (%) 95% confidence # Support Vectors

Fisher

Normalized Fisher

9.3 ±0.9 37

8.2 ±0.8 32

Normalized Fisher Scores We saw in Section 12.2.2 that the scores used for classifying examples are normalized, in order to counterbalance the exponential decrease of likelihoods with sequence lengths. Using the normalized likelihood leads to the following Fisher-like kernel 1 uT ux¯ Ti Tj x¯ i j T   M  X ni,m −2 i nj,m −2 j = Σm (µm − µ0m ) Σm (µm − µ0m ) Ti Tj m=1

¯j ) = K(¯ xi , x

Here also, one may consider several options for approximating the Fisher information matrix, but the results displayed in Figure 12.4 suggest it is not worth pursuing this road further. Table 12.1 and Figure 12.5 compare empirically the Fisher kernel (approximation 3) with the normalized Fisher kernel. Including a normalization seems have a positive impact on the accuracy. Thus other kind of scores should be explored. GMM Supervector Linear Kernel The Fisher kernel is a similarity based on the differences in the generation of examples. In this matter, it is related to the GMM Supervector Linear Kernel (GSLK) proposed by Campbell et al. (2006b). The GSLK approximates the Kullback-Leibler divergence that measures the dissimilarity between two GMMs, each of one being obtained by adapting the world model to one example ¯ j ). Hence, instead of looking at how a single generative process differs for of the pair (¯ xi , x ¯ j ) pair, GSLK looks at the difference between pairs of generative each example of the (¯ xi , x models. The GSLK is given by: ¯j ) = K(¯ xi , x

M X T √ i 0 · πm Σ−1 m τi,m µm + (1 − τi,m )µm

m=1

 √ j 0 , πm Σ−1 m τj,m µm + (1 − τj,m )µm

¯i, where τi,m is the adaptation factor for the mixture component m adapted with sequence x as defined in Equation (12.5). The MAP relevant factor r is chosen by cross-validation, as in GMM based text-independent speaker verification systems (Reynolds et al. 2000).

214

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

Figure 12.5 DET curves on the development set of the NIST database for Fisher kernel (approximation 3) and normalized Fisher kernel.

Table 12.2 EERs (the lower the better) on the development set of the NIST database, comparing GSLK and the normalized Fisher kernel.

EER (%) 95% confidence # Support Vectors

GSLK

Normalized Fisher

7.9 ±0.8 34

8.2 ±0.8 32

The Fisher kernel and GSLK are somewhat similar scalar products, with the most noticeable difference being that the Fisher similarity is based on difference from the reference µ0 whereas the GSLK kernel above is based on a convex combination of the observations and the reference µ0 that has no obvious interpretation. Both are an approximation of the KL divergence as mentioned in Section 12.5.2. The difference is that GSLK compare two adapted distributions when the Fisher kernel compare the world model to the updated model using the access data. Table 12.2 and Figure 12.6 compare empirically GSLK with the normalized Fisher kernel. There is no significant difference between GSLK and the normalized Fisher kernel.

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

215

Figure 12.6 DET curves on the development set of the NIST database for GSLK and normalized Fisher kernel.

12.6 Parameter Sharing The text-independent speaker verification problem is actually a set of several binary classification problems, one for each client of the system. Although few positive examples are available for each client, the overall number of available positive examples may be large. Hence, techniques that share information between classification problems should be beneficial. We already mentioned such a technique: the MAP adaptation scheme that trains a single world model on a common data set, and uses it as a prior distribution over the parameters to train a GMM for each client. Here, the role of the world model is to bias each client model towards a reference speaker model. This bias amounts to a soft sharing of parameters. Additional parameter sharing techniques are now used in discriminant approaches. In the following, we discuss one of them, the Nuisance Attribute Projection (NAP).

12.6.1 Nuisance Attribute Projection The Nuisance Attribute Projection (NAP) approach (Solomonoff et al. 2004) looks for a linear subspace such that similar accesses (that is, accesses coming from the same client or from the same channel, etc) are near each others. In order to refrain from finding an obvious bad solution, the dimension of the target subspace is controlled by cross-validation. This transformation is learned on a large set of clients (similarly to learning a generic GMM in the generative approach). After this step is performed, a standard linear SVM is usually trained for each new client over the transformed access data. This approach provided very good performance in recent NIST evaluations.

216

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

¯ is mapped into a fixed-size feature More specifically, assume each access sequence x space through some transformation Φ(¯ x ) such as the one used in the GLDS kernel. Let ¯ j ), that these sequences Wc be a proximity matrix encoding, for each pair of accesses (¯ xi , x c c were recorded over the same channel (Wi,j = 0) or not (Wi,j = 1). The NAP approach then consists in finding a projection matrix P⋆ such that P⋆ = arg min P

X i,j

c Wi,j kP(Φ(¯ xi ) − Φ(¯ xj ))k2

(12.19)

among orthonormal projection matrices of a given rank. Hence P⋆ minimizes the average difference between accesses from differing channels, in the feature space. Similarly, a second matrix Ws could encode the fact that two accesses come from the same speaker. A combination between these prior knowledge could be encoded as follows W = αWc − γWs ,

(12.20)

with α and γ hyper-parameters to tune, and P⋆ found to minimize equation (12.19) with W instead of Wc . As stated earlier, P⋆ is then used to project each access Φ(¯ x ) into a feature subspace where, for each client, a linear SVM is used to discriminate client and impostor accesses. As shown in Table 12.3 and Figure 12.7, NAP brings significant improvement when combined with the GSLK kernel. On the other hand, the number of support vectors grows also significantly. This can be interpreted that now all accesses are in the same space and are independent to the channel and thus more training impostors are good candidates.

Table 12.3 EERs (the lower the better) on the development set of the NIST database, comparing an SVM classifier with GSLK with and without NAP (polynomial kernel of degree 3).

EER (%) 95% confidence # Support Vectors

GSKL

GSLK with NAP

7.9 ±0.8 34

5.8 ±0.6 59

Although the approach has shown to yield very good performance results, we believe that there is still room for improvements, since P⋆ is not selected using the criterion that is directly related to the task. Minimizing the average squared distance between accesses of the same client (or accesses of different channel) is likely to help classification, but it would also be relevant to do something about accesses from different clients, such as moving them away for instance.

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

217

Figure 12.7 DET curves on the development set of the NIST database with GSLK with and without NAP.

12.6.2 Other Approaches Another recent approach that also goes in the same direction and that obtains state-of-the-art performance similar to the NAP approach is the Bayesian Factor Analysis approach (Kenny et al. 2005). In this case, one assumes that the mean vector of a client model is a linear combination of a generic mean vector, the mean vector of the available training data for that client, and the mean vector of the particular channel used in this training data. Once again, the linear combination parameters are trained on a large amount of access data, involving a large amount of clients. While this approach is nicely presented theoretically (and obtains very good empirical performance), it still does not try to find the optimal parameters of client models and linear combination by taking into account the global cost function. Another very promising line of research that has emerged in machine learning relates to the general problem of learning a similarity metric (Chopra et al. 2005; Lebanon 2006; Weinberger et al. 2005). In this setting, where the learning algorithm relies on the comparison of two examples, one can set aside some training examples to actually learn what would be a good metric to compare pairs of examples. Obviously, in the SVM world, this relates to learning the kernel itself (Crammer et al. 2002; Lanckriet et al. 2004). In the context of discriminant approaches to speaker verification, none of these techniques have been tried, to the best of our knowledge. Using a large base of accesses for which one knows the correct identity, one could for instance train a parametric similarity measure that would assess whether two accesses are coming from the same person or not. That could be done efficiently by stochastic gradient descent using a scheme similar to the so-called Siamese neural network (Chopra et al. 2005) and a margin criterion with proximity

218

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

constraints.

12.7 Is the Margin Useful for this Problem? The scarcity of positive training examples in speaker verification explains the great improvements that pertain to parameter sharing techniques. In this section, we question whether this specificity also hinders large margin methods to improve upon more simple approaches. The K-Nearest Neighbors (KNN) algorithm (Duda and Hart 1973) is probably the simplest and the most known non-parametric classifier. Instead of learning a decision boundary, decisions are computed on-the-fly for each test access, by using the k nearest labelled sequences in the database as “experts”, whose votes are aggregated to make up the decision on the current access. In the weighted KNN (Dudani 1986) variant, the votes of the nearest neighbors are weighted according to their distance to the query:

f (¯ xj ) =

k X

yi wi , with wi =

i=1

(

1 d(j,k)−d(j,i) d(j,k)−d(j,1)

if d(j, k) = d(j, 1) otherwise,

(12.21)

¯ j , yi ∈ {−1, 1} determines whether where the sum runs over the k neighbors of the query x the neighbor’s access is from a client (yi = 1) or an impostor (y1 = −1), and d(j, i) is the ¯ j to its i-th neighbor. distance from x One can then use kernels to define distances, as follows: d(i, j) =

q ¯ i ) − 2K(¯ ¯ j ) + K(¯ ¯j ) , K(¯ xi , x xi , x xj , x

(12.22)

but it is often better to normalize the data also in the feature space so that they have unit norm, as follows, ¯j ) K(¯ xi , x ¯j ) = p Knorm (¯ xi , x , ¯ i ) K(¯ ¯j ) K(¯ xi , x xj , x

(12.23)

which leads to the final distance measure used in the experiments:

dnorm (i, j) =

s

¯j ) K(¯ xi , x 2 − 2p . ¯ i ) K(¯ ¯j ) K(¯ xi , x xj , x

(12.24)

Table 12.4 and Figure 12.8 compares the normalized Fisher score with NAP approach followed by either an SVM or a KNN, and as can be seen, the KNN approach yields similar if not better performance than the SVM approach. Furthermore, the KNN has several advantages with respect the SVMs: there is no training session, KNN can easily approximate posterior probabilities and do not rely on potentially constraining Mercer conditions to work. On the other hand, the test session might be longer as finding nearest neighbors needs to be efficiently implemented.

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

219

Figure 12.8 DET curves on the development set of the NIST database comparing Fisher normalized kernel with NAP for KNN and SVM.

Table 12.4 EERs (the lower the better) on the development set of the NIST database, comparing the Fisher normalized kernel with NAP (250) for KNN and SVM. SVM KNN EER (%) 95% confidence # Support Vectors

6.7 ±0.7 47

5.3 ±0.7 -

12.8 Comparing All Methods As a final experiment, we have compared all the proposed approaches and now report the results on the evaluation set. Figure 12.9 compares a state-of-the-art diagonal GMM with an SVM using a GSLK kernel with NAP, and also with a KNN based on the normalized Fisher kernel with NAP. In this experiment, the following set of hyper-parameters were tuned according to the EER obtained on the development set: • the number of neighbors K in the KNN approach, was varied between 20 and 200, with optimal value: 100;

220

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

• the size of P, the transformed space in NAP for the GSLK kernel, was varied between 40 and 250, with optimal value: 64; • the size of P, the transformed space in NAP for the Fisher kernel, was varied between 40 and 400, with optimal value: 250; • the number of Gaussians in the GMM used for the GSLK and Fisher kernel approaches was varied between 100 and 500, with optimal value: 200; • all other parameters of the state-of-the-art diagonal GMM baseline were taken from previously published experiments. The GMM yields the worst performance, probably partly because no channel compensation method is used (while the others use NAP). KNN and SVM performances do not differ significantly, hence the margin does not appear to be at all necessary for speaker verification.

Figure 12.9 Expected Performance Curve (the lower, the better) on the evaluation set of the NIST database comparing GMM with T-norm, SVM with a GSLK kernel and NAP, SVM Fisher normalized kernel with NAP and KNN with a Fisher normalized kernel with NAP.

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

221

Table 12.5 Final results on the evaluation set of the NIST database

HTER (%) 95% Conf.

GMM

SVM GSKL NAP

KNN Normalized Fisher NAP

10.2 ±0.7

5.4 ±0.5

5.5 ±0.5

12.9 Conclusion In this chapter, we have presented the task of text independent speaker verification. We have shown that the traditional method to approach this task is through a generative approach based on Gaussian Mixture Models. We have then presented a discriminative framework for this task, and presented several recent approaches in this framework, mainly based on Support Vector Machines. We have presented various kernels adapted to the task, including the GLDS, GSLK and Fisher kernels. While many of the proposed kernels in the literature were proposed in some heuristic way, including the GLDS and GSLK kernels, we have shown the relation between the principled polynomial kernel and the GLDS kernel, as well as the relation between the principled Fisher kernel and the GSLK kernel. We have then shown that in order for SVMs to perform at a state-of-the-art level, parameter sharing in one way or another was necessary. Approaches such as NAP or Bayesian Factor Analysis were designed for that purpose and indeed helped SVMs to reach better performance. Finally, we have questioned the main purpose of using SVMs, which maximize the margin in the feature space. We have tried instead a plain KNN approach, which yielded similar performance. This simple experiment shows that future research should concentrate more on better modelling of the distance measure, rather than on maximizing the margin. A drawback of the current approaches is that they are made of various blocks (feature extraction, feature normalization, distance measure, etc) which were all trained using a separate ad-hoc criterion. Ultimately, a system that would train all these steps in a single framework to optimize the final objective should perform better, but more research is necessary to reach that goal. In order to foster more research in this domain, an open-source version of the C++ source code used to performed all experiments described in this chapter have been made available at http://speaker.abracadoudou.com.

References Bengio S and Mari´ethoz J 2004 A statistical significance test for person authentication Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, pp. 237–240. Bengio S, Mari´ethoz J and Keller M 2005 The expected performance curve International Conference on Machine Learning, ICML, Workshop on ROC Analysis in Machine Learning. Boughorbel S, Tarel JP and Fleuret F 2004 Non-Mercer kernel for SVM object recognition British Machine Vision Conference.

222

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

Burges C 1998 A tutorial on support vector machines for pattern recognition. Knowledge Discovery and Data Mining. Campbell W 2002 Generalized linear discriminant sequence kernels for speaker recognition Proc IEEE International Conference on Audio Speech and Signal Processing, pp. 161–164. Campbell W, Campbell J, Reynolds D, Singer E and Torres-Carrasquillo P 2006a Support vector machines for speaker and language recognition. Computer Speech and Language 20(2-3), 125–127. Campbell W, Sturim D and Reynolds D 2006b Support vector machines using gmm supervectors for speaker verification. Signal Processing Letters, IEEE 13(5), 308–311. Chopra S, Hadsell R and LeCun Y 2005 Learning a similarity metric discriminatively, with application to face verification Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). Crammer K, Keshet J and Singer Y 2002 Kernel design using boosting Advances in Neural Information Processing Systems, NIPS. Duda RO and Hart PE 1973 Pattern Classification and Scene Analysis. Wiley, New York. Dudani SA 1986 The distance-weighted k-nearest neighbor rule. IEEE Transactions on Systems, Man and Cybernetics 6(4), 325–327. Gauvain JL and Lee CH 1994 Maximum a posteriori estimation for multivariate gaussian mixture observation of markov chains IEEE Transactions on Speech Audio Processing, vol. 2, pp. 291–298. Jaakkola T and Haussler D 1998 Exploiting generative models in discriminative classifiers. Advances in Neural Information Processing 11, 487–493. Joachims T 2002 Learning to Classify Text using Support Vector Machines. Kluwer Academic Publishers, Dordrecht, NL. Kenny P, Boulianne G and Dumouchel P 2005 Eigenvoice modeling with sparse training data. IEEE Transactions on Speech and Audio Processing. Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE and Jordan MI 2004 Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, JMLR 5, 27–72. Lebanon G 2006 Metric learning for text documents. IEEE Transaction on Pattern Analysis and Machine Intelligence, PAMI 28, 497–508. Mari´ethoz J and Bengio S 2007 A kernel trick for sequences applied to text-independent speaker verification systems. Pattern Recognition. IDIAP-RR 05-77. Martin A, Doddington G, Kamm T, Ordowski M and Przybocki M 1997 The DET curve in assessment of detection task performance Proceedings of Eurospeech’97, Rhodes, Greece, pp. 1895–1898. Melin H, Koolwaaij J, Lindberg J and Bimbot F 1998 A comparative evaluation of variance flooring techniques in hmm-based speaker verification ICSLP 1998, pp. 1903–1906. Pontil M and Verri A 1998 Support vector machines for 3-d object recognition. IEEE Transaction PAMI 20, 637–646. Reynolds DA, Quatieri TF and Dunn RB 2000 Speaker verification using adapted gaussian mixture models. Digital Signal Processing. Solomonoff A, Quillen C and Campbell W 2004 Channel compensation for SVM speaker recognition Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, pp. 57–62. Van Trees HL 1968 Detection, Estimation and Modulation Theory, vol. 1. Wiley, New York. Vapnik VN 2000 The nature of statistical learning theory second edn. Springer. Wan V and Renals S 2003 Support vector machine speaker verification methodology IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pp. 221–224. Wan V and Renals S 2005a Speaker verification using sequence discriminant support vector machines. IEEE Transactions on Speech and Audio Processing 13(2), 203–210. Wan V and Renals S 2005b Speaker verification using sequence discriminant support vector machines. IEEE Transactions on Speech and Audio Processing 12(2), 203–210.

KERNEL BASED TEXT-INDEPENDNENT SPEAKER VERIFICATION

223

Weinberger KQ, Blitzer J and Saul LK 2005 Distance metric learning for large margin nearest neighbor classification Advances in Neural Information Processing Systems, NIPS.

Kernel Based Text-Independnent Speaker ... - Research at Google

... between authentication attempts: land line phone, mobile phone, laptop ..... To extract input features, the original waveforms are sampled every 10ms with a ...... to face verification Proceedings of the IEEE Computer Society Conference on ...

546KB Sizes 4 Downloads 402 Views

Recommend Documents

Ensembles of Kernel Predictors - Research at Google
can we best learn an accurate predictor when using p base kernels? ... using data to learn a predictor for each base kernel and ...... Sparse recovery in large.

SPEAKER ADAPTATION OF CONTEXT ... - Research at Google
adaptation on a large vocabulary mobile speech recognition task. Index Terms— Large ... estimated directly from the speaker data, but using the well-trained speaker ... quency ceptral coefficients (MFCC) or perceptual linear prediction. (PLP) featu

Large-scale speaker identification - Research at Google
promises excellent scalability for large-scale data. 2. BACKGROUND. 2.1. Speaker identification with i-vectors. Robustly recognizing a speaker in spite of large ...

Identifying and Exploiting Windows Kernel Race ... - Research at Google
ProbeForWrite function call, such as the one presented in Listing 3. Listing 3: Input .... The most elegant way to verify the condition would be to look up the page.

On the Impact of Kernel Approximation on ... - Research at Google
termine the degree of approximation that can be tolerated in the estimation of the kernel matrix. Our analysis is general and applies to arbitrary approximations of ...

Kernel Methods for Learning Languages - Research at Google
Dec 28, 2007 - its input labels, and further optimize the result with the application of the. 21 ... for providing hosting and guidance at the Hebrew University.

Speaker Location and Microphone Spacing ... - Research at Google
across time to give a degree of short term shift invariance, and then ..... to better exploit directional cues. We can ..... Notes in Computer Science, no. 2, pp.

Building statistical parametric multi-speaker ... - Research at Google
Procedia Computer Science 00 (2016) 000–000 .... For the recordings we used an ASUS Zen fanless laptop with a Neumann KM 184 microphone, a USB converter ... After a supervised practice run of 10–15 minutes, the remainder.

Improving DNN speaker independence with $i - Research at Google
part-model, for each speaker which, in a cloud-based speech recog- nizer adds ..... Here we compare our technique with this approach and show that the two ...

Building Statistical Parametric Multi-speaker ... - Research at Google
While the latter might be done with simple grapheme-to-phoneme rules, Bangla spelling is sufficiently mismatched with the pronunciation of colloquial Bangla to warrant a transcription effort to develop a phonemic pro- nunciation dictionary. Consider

Multi-Language Multi-Speaker Acoustic ... - Research at Google
for LSTM-RNN based Statistical Parametric Speech Synthesis. Bo Li, Heiga Zen ... training data for acoustic modeling obtained by using speech data from multiple ... guage u, a language dependent text analysis module is first run to extract a ...

End-to-End Text-Dependent Speaker Verification - Research at Google
for big data applications like ours that require highly accurate, easy-to-maintain systems with a small footprint. Index Terms: speaker verification, end-to-end ...

Hierarchical Phrase-Based Translation ... - Research at Google
analyze in terms of search errors and transla- ... volved, with data representations and algorithms to follow. ... tigate a translation grammar which is large enough.

Scenario Based Optimization - Research at Google
tion framework we have designed an SBO optimization we call the Online Aiding ... in addition to inspiring a new way to think about compile- time optimization.

Example-based Image Compression - Research at Google
Index Terms— Image compression, Texture analysis. 1. ..... 1The JPEG2000 encoder we used for this comparison was Kakadu Soft- ware version 6.0 [10]. (a).

Integrating Graph-Based and Transition-Based ... - Research at Google
language or domain in which annotated resources exist. Practically all ... perform parsing by searching for the highest-scoring graph. This type of model .... simply MST for short, which is also the name of the freely available implementation.2.

Speaker Recognition using Kernel-PCA and ... - Semantic Scholar
[11] D. A. Reynolds, T. F. Quatieri and R. B. Dunn, "Speaker verification using adapted Gaussian mixture models,". Digital Signal Processing, Vol. 10, No.1-3, pp.

Speaker Recognition using Kernel-PCA and ...
Modeling in common speaker subspace (CSS). The purpose of the projection of the common-speaker subspace into Rm using the distance preserving ...

Kernel-Based Skyline Cardinality Estimation
which is more meaningful than the conventional skyline for high- dimensional ... [2], use DD&C as a sub-routine and, thus, preserve its worst-case bound, while ...

Generalized Kernel-based Visual Tracking
Communications, Information Technology, and the Arts and the Australian. Research Council .... not smooth w.r.t. spatial mis-registration (see Fig. 1 in [19]).

Generalized Kernel-based Visual Tracking - CiteSeerX
robust and provides the capabilities of automatic initialization and recovery from momentary tracking failures. 1Object detection is typically a classification ...

DEEP NEURAL NETWORKS BASED SPEAKER ...
1National Laboratory for Information Science and Technology, Department of Electronic Engineering,. Tsinghua .... as WH×S and bS , where H denotes the number of hidden units in ..... tional Conference on Computer Vision, 2007. IEEE, 2007 ...

Generalized Kernel-based Visual Tracking - CiteSeerX
computational costs for real-time applications such as tracking. A desirable ... 2It is believed that statistical learning theory (SVM and many other kernel learning ...

Kernel-Based Skyline Cardinality Estimation
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD'09 ... implemented in Microsoft SQL Server. LS assumes that the skyline cardinality m, of an ...... In general, the kIDR