Speaker Recognition in Adverse Conditions Ananth N. Iyer, Uchechukwu O. Ofoegbu, Robert E. Yantorno Speech Processing Lab, Temple University, Philadelphia, PA 19122, USA {aniyer,uche1,byantorn}@temple.edu Stanley J. Wenndt Airforce Research Laboratory/IFEC, Rome, NY 13441-4505, USA [email protected]

Abstract— Recognizing speakers from their voices is a challenging area of research with several practical applications. Presently speaker verification (SV) systems achieve a high level of accuracy under ideal conditions such as, when there is ample data to build speaker models and when speaker verification is performed in the presence of little or no interference. In general, these systems assume that the features extracted from the data follow a particular parametric probability density function (pdf), i.e., Gaussian or a mixture of Gaussians; where a form of the pdf is imposed on the speech data rather than determining the underlying structure of the pdf. In practical conditions, like in an aircraft cockpit where most of the verbal communication is in the form of short commands, it is almost impossible to ascertain that the assumptions made about the structure of the pdf are correct, and wrong assumptions could lead to significant reduction in performance of the SV system. In this research, non-parametric strategies, to statistically model speakers are developed and evaluated. Nonparametric density estimation methods are generally known to be superior when limited data is available for model building and SV. Experimental evaluation has shown that the nonparametric system yeilded a 70% accuracy level in speaker verification with only 0.5 seconds of data and under the influence of noise with signal-to-noise ratio of 5dB. This result corresponds to a 20% decrease in error when compared to the parametric system.

TABLE OF C ONTENTS

reference speaker model associated with the claimed identity to accept the claim. In other words, the task is to determine whether or not the speech data was generated by the reference model. The approach adopted to perform SV, in the most general form (based on a distance measure), is depicted by the block diagram presented in Figure 1.

Figure 1. Block diagram representing the general setup of speaker verification system.

The decision logic block in Figure 1 usually involves determination of an optimal threshold for the distance measure. The choice of threshold is directly related to the operating error rate of the SV system. It can be shown that SV has a well defined error rate, and is independent of the population size, i.e., the number of reference models [1], [2]. A formal treatment of the error rates, based on simulations using normally distributed speaker models, was presented by Doddington [3]. Some of the possible applications of SV include banking and credit authorizations, access to secure facilities or information, and carrying out transactions from remote locations by telephone.

1

I NTRODUCTION

2

BACKGROUND

3

GMM BASED M ODELING

Parametric vs. Non-parametric Modeling

4

N ON -PARAMETRIC M ODELING

5

D IMENSIONALITY OF F EATURE S PACE

6

E XPERIMENTAL E VALUATION

7

C ONCLUSIONS

The central theme of this research is the use of nonparametric methods to represent a speaker’s voice. There are some significant differences between parametric and nonparametric modeling. Parametric modeling assumes a functional representation of the underlying distribution for the speaker feature vectors, and hence imposes a specific structure to the probability density function. For example, the structure assumed by the Gaussian model is that the data is unimodal and central; i.e., the pdf has only one maximum and the probability of occurrence of a data sample uniformly decreases with the increase in the distance from the mean value. The structure of the parametric function is usually controlled by a set of parameters, and the number of parameters is di-

1. I NTRODUCTION The speaker verification problem can be stated as follows. Given speech data and a claimed identity, the task is to determine whether the speech data is sufficiently similar to the c 1-4244-0525-4/07/$20.00 2007 IEEE IEEEAC paper # 1247, Version 3, Updated October 25, 2006.

rectly proportional to complexity of the model, as well as the capability of relaxing the imposed structure [4]. The advantage of parametric modeling lies in the fact that a succinct representation of the data can be obtained, which is manifested by the efficient use of data (with a well defined optimality criterion) for parameter estimation. Furthermore, the evaluation of test data (testing) can be performed through statistical summaries - the mean vector and covariance matrix - of the data rather than the data itself. The notion of optimality in parametric modeling does not translate well to the non-parametric methods. For example, the histogram, which is a basic non-parametric data representation, might prove (asymptotically) to be an inadmissible estimator. However, the histogram is generally superior for small data sizes regardless of its asymptotic properties. Estimation of parametric models with small data sizes can become highly biased and hence ineffective. Furthermore, the parametric density estimators tend to become inefficient with only small perturbations in the assumptions of the underlying model. The emphasis in robust estimation procedures is to sacrifice a small percentage of the optimality in order to achieve greater insensitivity to model misspecification [5], [6]. However, in most situations, only vague prior information on an appropriate model is available. The use of nonparametric models conveniently eliminates the requirement for model specifications. The loss of optimality is balanced by alleviating the risk of misrepresenting the data by the use of an inappropriate model. Based on the above discussion, it can be inferred that the use of non-parametric models is desirable when adverse operating conditions exist. However, in ideal conditions, and when the underlying distribution function is known, a parametric model may be desirable. In this research, the SV system is developed to operate under high level of noise interference and with restricted amounts of data available for testing. Therefore, a non-parametric approach to represent speakers and perform speaker verification is proposed. The rest of the article is organized as follows: In Section 2, a detailed background of the speaker verification theory, which includes the definition of error types and a method to determine the optimal SV threshold is presented. In Section 3 and Section 4, the modeling of speakers by parametric (traditional) and a non-parametric (proposed) approaches are presented respectively. A discussion of the dimensionality of feature space constructed by speech features is presented in Section 5. Experimental evaluation of the proposed system and comparision with the traditional system is presented in Section 6. Conclusions are drawn and future research are indicated in Section 7.

2. BACKGROUND Speaker verification requires binary decision logic, i.e., either accept or reject the claimed identity of a person [7], [8]. Speaker verification can be performed in two ways, and these are based on the content of the speech data. They are: text-

dependent and text-independent. Text-dependent SV is a restricted problem, where the training and testing data are required to be the same, i.e., a registered password or a passphrase. Several methods that take into account the temporal information of the speech to perform verification have been proposed [9], [10], [11]. On the other hand, text-independent SV does not depend on the content of the speech data and a statistical approach is adopted to model speakers as well as to perform verification [7], [12]. Testing in SV refers to the decision to either accept or reject the claimed identity. Let X represent the collection of feature vectors obtained from the test data, and k be the claimed speaker identity which has a corresponding reference model Ck . The verification decision is given as follows: • •

Accept speaker k, if d(Ck , X ) ≤ τk , and Reject speaker k, if d(Ck , X ) > τk ,

where τk is the verification threshold, and d(Ck , X ) represents some “distance” measure between the test data and the reference model Ck . The distance measure is computed by determining the likelihood of the test data being generated by the reference model, which is given by: d(Ck , X ) = − log (p(X |Ck )) .

(1)

The estimation of the pdf – p(X |Ck ) – can be achieved by either a parametric form or a non-parametric form, and these two approaches are described in Sections 3 and 4 respectively. The verification threshold can be speaker-dependent, however, in this research it is assumed to be the same for all speakers, i.e., τk = τ, ∀k. Determination of the optimal verification threshold τ is very important. The accuracy of the SV system is directly related to the threshold. Consider a typical SV setup (as in Figure 1) and the following two situations: (i) the test data truly belongs to the claimant (referred as Customer), and (ii) the test data does not belong to the claimant (Imposter). Let d represent a random variable, measuring the distance between test data and the reference model of the claimant, and p(d|ωs ) and p(d|ωd ) be the conditional probability density functions of the random variable based on the source of the test data, i.e., customer or imposter. A graphical representation of the two pdfs along with the threshold is presented in Figure 2 Application of a threshold to perform the verification decision creates to two kinds of errors, which are: • Type I error, also known as “false alarm” or “false positive”: is the error caused due to acceptance of an imposter. This error occurs when the distance from the test data is lower than the threshold, and the test data does not belong to the claimant. Several situations can cause the Type I error; some can be due to excessive background noise or the voice data recorded from impersonator. The type I error is computed Rτ from the pdfs as: PτI = −∞ p(x|ωd )dx. • Type II error, also known as “misses” or “false negative”: is the error in the SV system which rejects a valid customer.

0.2

Magnitude

0.15 0.1 0.05 0 0

500

1000

1500

2000 2500 Frequency [Hz]

3000

3500

4000

Figure 3. Frequency response of the Mel-scaled filter bank.

block diagram in Figure 4. It should be pointed that the computation is performed on short-time non-overlapping windows of length 30 msecs. Figure 2. Hypothetical conditional pdfs of the distance based on the test data belonging to the customer p(d|ωs ) or imposter p(d|ωd ).

In other words, the distance from the test data to his/her reference model is greater than the threshold. Type II error can again occur due to background conditions or as a result of a change in the speaker’s voice due Rto cold/cough. The type II ∞ error can be computed as: PτII = τ p(x|ωs )dx. One can imagine that an appropriate threshold would be when the two kinds of error are equal, i.e., the equal error probability (EEP). EEP itself serves as a measure of separability between the customer and imposter distance pdfs by computing the lowest attainable classification error. It can be realized that the accuracy of the SV system is inversely proportional to EEP. An analytical estimate of EEP can be determined if the distributions of the distances are known or assumed. It is usually known that the distances have non-Gaussian distributions, and hence determining the threshold through analytical methods is quite difficult or sometimes impossible. Hence a numerical approach is adopted to compute the EEP and the verification threshold. The EEP can be computed by solving: Z ∞ Z τ p(d|ωs ), (2) p(d|ωd ) = −∞

τ

with respect to τ , which is the optimal verification threshold (also known as the “equal error threshold”). The integral equation is solved numerically, and thus the need to assume parametric forms of the distributions p(d|ωs ) and p(d|ωd ) is unnecessary. The integrals in Equation (2) are graphically represented by the shaded regions in Figure 2. Speaker Features In speech processing applications, determination of the MelFrequency Cepstral Coefficients (MFCC) is commonly used to analyze speech signals. This method is based on the use of a filter bank, with the center frequencies of the filters spaced on a mel-scale. The filters in the filter bank have a triangular frequency response and are shown in Figure 3. The computation of the MFCC coefficients is shown as a

Figure 4. Determination of the MFCC coefficients.

3. GMM BASED M ODELING The mixture of Gaussians, or Gaussian Mixture Model (GMM), is one of the most enduring and popularly used model in applied statistics. It has been the choice of probability distributions in many applications as it finds theoretical corroboration in the central limit theorem. In the case of speaker modeling, sample data are thought of as originating from various sources (each representing acoustic units), and the data from each source is modeled by a Gaussian [13]. In general, the parametric form of the GMM is given by a weighted sum of m Gaussians by the equation: p(x) =

m X

αi pi (x|µi , Σi ),

(3)

i=1

where αi are the weights, µi and Σi are the mean-vector and covariance matricies of the Gaussian functions, pi , given as: pi (x) =

1

1 exp{− (x − µi )T Σ−1 i (x − µi )} (4) 2 (2π) |Σi | p 2

1 2

Parameter Estimation Several techniques have been proposed to estimate the parameters of the m-Gaussian model, and amongst them the most popular appears to be the Expectation-Maximization (EM) algorithm. The EM algorithm is an iterative method of obtaining a maximum-likelihood estimator (MLE) of the parameters of the GMM. The EM algorithm was formalized by [14]. EM is a local search heuristic, and is designed to converge to a local maximum of the likelihood function. A detailed description of the algorithm and its convergence properties is presented in a survey article by [15] and [16]. To begin, let X = xk ; k = 1, 2, . . . , N be the set of feature vectors, which is assumed to be generated by the density function given in Equation (3), and θ represents the parameters: αi , µi and Σi . EM can be used to estimate the unknown parameter set θ,

given the observed data X , however, it also requires a “hidden” nuisance variable J, which indicates the association of each data sample xk with a Gaussian function pi (x). The EM algorithm can be summarized by the following two steps [17]. Start with an arbitrary initial estimate θ (0) of the unknown parameter θ and 1. E-Step: compute U (θ ′ , θ (j) ) , E{log p(X ; θ ′ |J; θ (j) )} as a function of θ ′ . 2. M-Step: compute θ (j+1) , arg maxθ′ U (θ ′ , θ (j) ). The two steps are iterated until θ ′ ceases to change, or is changing so slowly that an imposed convergence criterion is met. In other words, the EM alternates between estimating the unknown θ and the hidden variable J.

(represented as ‘+’ symbols) deviate from the black line significantly, and hence suggesting that the Gaussian model is not an appropriate model for the data. This observation can be further emphasized by considering just one dimension of the feature; the histograms of the second coefficient of the MFCC features along with the Gaussian approximations are presented in Figure 6. Note that the MFCC distribution is highly skewed and does not fit the Gaussian function adequately. Sophisticated models or data transformations (such as emphasizing the gaussianity of a data set) may provide solutions to a certain extent. However, advanced models are generally associated with complex procedures for parameter estimation, as well as for comparing two models. χ2 Plot − MFCC features

For the GMM, the E-step and the M-step are performed in each iteration by using the following formulas: =

µj+1 i

=

Σj+1 i

=

1 X p(i|xk , θ (j) ), (5) N k=1 PN (j) )xk k=1 p(i|xk , θ , (6) PN (j) ) k=1 p(i|xk , θ PN (j) )(xk − µji (xk − µji )T ) k=1 p(i|xk , θ ,(7) PN (j) ) k=1 p(i|xk , θ

and the a posteriori probability is given by:

αi pi (xk ) p(i|xk , θ (j) ) = Pm . n=1 αn pn (xk )

Probability

N

αij+1

0.9

0.6

0.3 0.1

0

10

20

30 Data

40

50

60

Figure 5. The χ2 plots of MFCC features to study the goodness-of-fit with a Gaussian model (black line).

(8) MFCC features

0.8

4. N ON -PARAMETRIC M ODELING

0.7 0.6 Probability

Non-parametric statistical methods to analyze multivariate data, though seldom used, are highly flexible and robust tools for data analysis. In this section, non-parametric approaches to speaker modeling will be introduced.

0.5 0.4 0.3

Why use Non-parametric Models? Classical approaches to linear multivariate statistical methods are based on the analysis of only the first two moments, i.e., the mean vector and the covariance matrix. These analysis methods are based on the inherent assumption that the underlying parametric model, generating the data (feature vectors), is a multivariate Gaussian distribution. Parametric methods are highly powerful as the techniques for analysis, even for data sets with a large number of dimensions. Further, parametric studies provide a clear inferential summary and a parsimonious representation of the data. However, in many practical applied problems, the second-order information (the first and second moments) is inadequate. For example, the χ2 plots of feature vectors (MFCCs) computed from a speakers data available in the HTIMIT speech corpus [18] is presented in Figure 5. One can clearly note that the features

0.2 0.1 0

−1

−0.5

0 0.5 Feature value

1

1.5

Figure 6. Histogram representation of the second coefficient of MFCC features and the Gaussian fit (black line).

The aforementioned issues suggest use of non-parametric methods, where no assumption is made about the underlying probability distribution function of the data. Non-parametric approaches, even though more computationally expensive, are viable, because a parametric model of similar complexity may involve hundreds of parameters.

5. D IMENSIONALITY OF F EATURE S PACE

Kernel Density Estimator The basic kernel estimator in the univariate case may be written compactly as [19], [20]: N

1 X fˆ(x) = K nh i=1



x − xi h



N

=

1X Kh (x − xi ) n i=1

(9)

where Kh (t) = K(t/h)/h is called the kernel function. Appropriate choice of the kernel function leads to the kernel estimators smoothing out the contribution of each observed data point over a local neighbourhood of that data point. The contribution of data point xi to the estimate at some point x∗ depends on the distance betweem xi and x∗ . The extent of this contribution is also dependent upon the shape of the kernel function adopted and the width (bandwidth), h. The kernel estimator defined in Equation 9 is valid only in the univariate case. A multivariate generalization of the kernel estimator appears in the form of the product kernel estimator, which is defined as:     p N Y X 1 x − xij fˆ(x) = . (10) K   nh1 h2 · · · hp h i=1

j=1

The same (univariate) kernel is used in each dimension with a different bandwidth factor hi . The data xij comes from the N × p matrix X (or Z). The estimate is defined pointwise, where x = [x1 , x2 , . . . , xp ]T . In a geometric perspective, the estimate places a probability mass of size 1/n centered at each sample point. The Gaussian kernel is a commonly used kernel functions [19], and is given by the expression: 1 2 1 K(u) = √ e− 2 u , 2π

(11)

The kernel function along with an example of the Gaussian kernel (using Equation 10) for estimating a bi-variate standard normal distribution from a set of 100 samples is shown in Figure 7. It can be clearly seen that the Gaussian kernel Kernel Function

pdf Estimate

0.2 0.1 0.15 0.05

0.1

0 5

0.05

5

0 0 −1.5

−1

−0.5

0

0.5

1

1.5

0 −5

In this section, the dimensionality of the feature space will be determined. The notion of dimension directly relats to the number of independent basis vectors, which can uniquely represent all points in the linear vector spaces constructed by the feature vectors. However, the feature vectors are random variables and hence the theory of linear systems is connected with random variables by the use of the expectation operator. For example, let Xi , i = 1, 2, . . . , p represent random variables forming the random vector X = [X1 , X2 , . . . , Xp ]T . The scalar product between two variables Xi and Xj is defined as:  ¯ i )(Xj − X ¯ j ) ≡ Cov(Xi , Xj ), hXi , Xj i = E (Xi − X (12) which is generally referred as the covariance between the two ¯ i = E{Xi }, which is random variables Xi and Xj . Also, X known as the mean value. The definition of orthogonality, i.e, x and y are orthogonal iff hx, yi = 0, translates in the random world to that of Cov(Xi , Xj ) = 0, which implies the variables Xi and Xj are statistically independent1 . Similarly, the variance can be related to the square of the norm, i.e., ¯ i )2 ≡ Var(Xi )}. A construction of the kXi k2 = E{(Xi − X covariances between all the random variables in the form of a matrix leads to the covariance matrix. The covariance matrix helps in relating the shape of a density function to well-known elliptical curves. The covariance matrix is a symmetric matrix and is of the form:   Cov(X1 , X1 ) · · · Cov(X1 , Xp )   .. .. .. (13) C= . . . . Cov(Xp , X1 )

···

Cov(Xp , Xp )

Alternatively, the covariance matrix can be easily constructed by considering the outer product of the random vector as: C = E{X ⊗ X} = E{XX T }.

(14)

The dimensionality of the feature space is directly related to the rank of the covariance matrix. However, since the covariance matrix is usually of full rank as it is computed from a set of samples, the numerical rank is used to determine the dimensionality of the feature space. The numerical rank can be computed based on the eigenvalue decomposition of the covariance matrix, which can be written as: CV = ΛV

(15)

−5

Figure 7. The kernel function (left figure) and the density estimate (right figure) using the Gaussian kernel.

produces a smooth density estimate. A disadvantage of using the Gaussian kernel is that it does not have finite support. However, for practical purposes the probability values are so small that they can be safely ignored.

where V is an orthogonal eigenvector matrix (V T V = I), and Λ = diag(λ1 , λ2 , . . . , λp ) is a diagonal matrix with the eigenvectors. It should be pointed out that the eigenvalues represents the variance of the independent random variables, once transformed using the eigenvector matrix. The approach to estimate r = rank(C) from the eigenvalues is to have a 1 Note that this is true only when certain conditions are met. The random variables Xi and Xj need to be jointly Gaussian

λ1 ≥ · · · ≥ λrˆ > δ ≥ λrˆ+1 ≥ · · · ≥ λp .

(16)

The tolerance δ is usually chosen as a function of kCk, e.g., δ = 10−3 kCk. The dimensionality of the feature spaces, constructed by the MFCC features, were experimentally determined. Feature vectors were computed from speech data from all the speakers in the HTIMIT database, and the sample covariance matrix was estimated. Figure 8 shows the eigenvalues as well as statistics, such as mean, standard-deviation and the range across each dimension was computed from the 23rd dimensional MFCC feature space. Note that the eigenvalues are expoentially decreasing with the increase in the dimension. Also, a similar observation can be made on the data statistics. 2

Figure 9 shows the ROC curves obtained for speaker verification experiments with 0.5 seconds of clean testing data, i.e., no interference from environmental noises. The experimental setup consisted of 384 speaker verification tests. In each test a ‘customer’ distance and an ‘imposter’ distance was computed by using appropriate test data. The data for the experiments were obtained using the HTIMIT database, and care was taken to make sure that the training data and testing data were disjoint. It can be clearly seen from the ROC curves Receiver Operating Characteristic 100%

Customer Success Probability

tolerance δ > 0 and regard rˆ as the numerical rank of C [21], [22], if λi satisfies

GMM System Non−parametric System

80%

60%

40%

20%

1.5

0 Statistics

Eigen Value

2

1

0%

−2

0% −4 Mean Standard Deviation Range

0.5 −6 5

10 15 Dimension

20

5

10 15 Dimension

20

Figure 8. Determination of the MFCC feature space dimension by performing eigenvalue (left figure) and statistical (right figure) analysis.

The numerical rank for MFCC was found to be ‘17’. On careful observation of the eigenvalue curves and the standard deviations (right panel of Figure 8), it can be observed that a much lower dimension might be sufficient. Another school of thought in determining the dimension, which is rather heuristic, is to determine the number of dimensions required such that 90% of the variation is captured [23]. In such a case, it was noted that both the features have a dimension of ‘3’. There is more than one way to determine the dimension – but the final goal is to have as low a dimension as possible. In this research, the determination of the dimension is based on the numerical rank, and it yeilded the best speaker verification performance.

20%

40% 60% 80% Impostor Success Probability

100%

Figure 9. Reciever operating charateristic curves comparing the performance of the GMM system (dotted) and the non-parametric system (solid) with the speech data obtained in a clean environment.

in Figure 9, that the performance of the non-parametric system is higher when compared to that of the GMM system. The GMM based system yeilded an equal error probability of 0.1745, whereas the non-parametric system resulted in an equal error probability of 0.1641. Under the influence of environmental noises – modeled as Gaussian random variables – a significant improvement in the speaker verification performance was achieved with the non-parametric system in comparision with the GMM based system. The ROC curves for speaker verification experiment performed with speech signal degraded by noise (with signalto-noise ratio of 5dB) is shown in Figure 10. The equal error probabilites obtained were – 0.3880 and 0.3099 for the GMM and the non-parametric systems respectively. Note that the use of non-parametric modeling has resulted in a 20% reduction in the equal error probability.

6. E XPERIMENTAL E VALUATION The performance of the speaker verification system can be presented using the Receiver Operating Charateristic ROC curve [24]. Both the Type I as well as Type II errors are functions of the threshold τ , and the ROC curve can be constructed by a graph of the detection rate (or customer success probability) as a function of the false alarm (imposter success probability). Note that the detection rate is given by: PD = 1 − PτI .

7. C ONCLUSIONS The central theme of this article was to use a non-parametric speaker modeling method to represent speakers, and demonstrate its effectiveness in speaker verification in adverse conditions. Evaluation was performed in adverse conditions (which are unrealistically harsh) were considered, and it was shown that the non-parametric system is able to perform better than the tradidtional GMM system. The equal error prob-

I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-Garc´ıa, D. Petrovska-Delacr´etaz, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Applied Signal Processing, vol. 4, p. 430451, 2004.

Receiver Operating Characteristic

Customer Success Probability

100%

80%

60%

[8]

R. J. Mammone, X. Zhang, and E. P. Ramachandran, “Robust speaker recognition: a feature-based approach,” IEEE Signal Processing Magazine, vol. 13, pp. 58–71, 1996.

[9]

J. M. Naik and D. M. Lubensky, “A hybrid HMM-MLP speaker verification algorithm for telephone speech,” ICASSP, vol. i, pp. 19–22, 1994.

GMM System Non−parametric System 40%

20%

0% 0%

20%

40% 60% 80% Impostor Success Probability

100%

Figure 10. Reciever operating charateristic curves comparing the performance of the GMM system (dotted) and the non-parametric system (solid) with the speech data obtained in the presence of noise at a signal-to-noise ratio of 5dB.

abilities reported in this article may seem low, however, it should be kept in mind that the amount of data used for testing is of length 0.5 seconds, and in general a larger amount of data will be available. In conclusion, an effective approach to model a speaker’s voice data was proposed and was sucessfully evaluated. One can imagine that the non-parametric model finds applications in a wide range of applications – including speaker identification, speaker clustering, speaker counting, as well as speaker adaptation for speech recognition. Future research in includes the use of alternative kernel functions, such as the triangle, uniform and the epanechnikov functions, and determination of their effectiveness for speaker verification.

R EFERENCES

[10] J. M. Naik, L. P. Netsch, and G. R. Doddington, “Speaker verification over long distance telephone lines,” ICASSP, vol. 1, pp. 524 – 527, 1989. [11] D. O’Shaughnessy, “Speaker recognition,” ASSP Magazine, vol. 3, pp. 4– 17, 1986. [12] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models.” Digital Signal Processing, vol. 10, pp. 19–41, 2000. [13] D. A. Reynolds and R. C. Rose, “Robust textindependent speaker identification using Gaussian mixture models,” IEEE Trans. on Speech and Audio Processing, vol. 3, no. 1, pp. 72–83, January 1995. [14] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” J. Royal Statist. Soc. Ser., vol. B, no. 39, pp. 1–38, 1977. [15] R. A. Redner and H. F. Walker, “Mixture densities, maximum likelihood and the EM algorithm,” SIAMReview, vol. 26, no. 2, pp. 195–239, 1984. [16] B. Lindsay, Mixture Models: Theory, Geometry, and Applications. Virginia: American Statistical Association, 1995.

[1]

S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, no. 2, pp. 254 – 272, 1981.

[2]

A. E. Rosenberg, “Automatic speaker verification: A review,” Proceedings of the IEEE, vol. 64, no. 4, pp. 475 – 487, April 1976.

[3]

G. R. Doddington, “Speaker recognitionidentifying people by their voices,” Proceedings of the IEEE, vol. 73, no. 11, pp. 1651 – 1664, 1985.

[19] J.-N. Hwang, S.-R. Lay, and A. Lippman, “Nonparametric multivariate density estimation: a comparative study,” IEEE Transactions on Signal Processing, vol. 42, no. 10, pp. 2795 – 2810, 1994.

[4]

H. Gish and M. Schmidt, “Text-idependent speaker identification,” IEEE Signal Processing Magazine, vol. 11, no. 4, pp. 18–32, October 1994.

[20] D. W. Scott, Multivariate Density Estimation, Theory, Practice, and Visualization. Wiley-Interscience, 1992.

[5]

P. J. Huber, Robust Statistics. Wiley, New York, 1981.

[6]

W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C. Cambridge, 1990.

[7]

F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier,

[17] H. Stark and J. W. Woods, Probability and Random Processes with Applications to Signal Processing. Prentice Hall, 2002. [18] D. Reynolds, “HTIMIT and LLHDB : Speech corpora for the study of handset transducer effects,” ICASSP, pp. 1535–1538, 1997.

[21] J. S. Bay, Fundamentals of Linear State Space Systems. McGraw Hill, 1999. [22] G. H. Golub and C. F. V. Loan, Matrix Computations. 701 West 40th Street, Baltimore, Maryland 21211: The Johns Hopkins University Press, 1996.

[23] B. F. J. Manly, Mutivariate Statistical Methods, 2nd ed. Chapman & Hall, 1995. [24] T. F. Quatieri, Discrete-Time Speech Signal Processing. Prentice Hall, 2001.

Ananth Iyer received the B.E. degree in electrical and electronic engineering from University of Mysore, India, in 1997 and two M.S. degrees in electrical engineering from Temple University, Philadelphia and the Pennsylvania State University, University Park, PA. He is currently pursuing a Ph.D. degree in engineering at Temple University. His research interests are in modeling for speaker identification and developing numerical methods for signal processing. Mr. Iyer is a member of the IEEE Signal Processing Society, International Speech Communication Association and the American Acoustical Society.

Uchechukwu Ofoegbu received the B.S. and M.S. degrees in electrical engineering from Temple University, Philadelphia, in 2003 and 2005 respectively. She is currently pursuing a Ph.D. degree in engineering at Temple University. Her research interests include speaker discrimination for criminal activity detection applications and enhancing pre-college engineering education. Ms. Ofoegbu is a member of the American Society of Engineering Education, the Society of Women Engineers and Eta Kappa Nu.

Robert Yantorno received the B.S.E.S. degree in 1970 from the University of Rhode Island, Kingston, Rhode Island, and the Ph.D. in Bioengineering in 1978 for the University of Pennsylvania. Dr. Yantorno joined the faculty of Electrical and Computer Engineering at Temple University, in 1982, and is presently a Professor. He was also past chair of the Philadelphia chapter of the IEEE Signal Processing Society, As chair he organized a number of workshops and a short course, as well as an all-day speech processing educational seminar. He has also been on the organizing committee of a number of national and international conferences. Dr. Yantorno is presently the Director of the Speech Processing Laboratory, and his research interests include; intelligent signal and speech processing, speaker identification, speaker indexing, low-bit-rate speech coding, speech excitation, articulation

modeling, and speech quality. He has been a consultant for AT&T Bell Labs in Murray Hill, New Jersey, and also for the Speech Processing Lab of Air Force Research Labs in Rome, New York. Dr. Yantorno has served as a reviewer for the IEEE Transactions on Speech Processing and the Journal of the Acoustical Society of America. Dr. Yantorno received the university-wide Lindback Award for Distinguished Teaching from Temple University, and the “Long-time Continuous Contributions to the Delaware Valley and the Philadelphia Section in Education, Research and Service” from the Philadelphia Section of the IEEE. Dr. Yantorno is a member of the ASEE, IEEE, Tau Beta Pi, Phi Kappa Phi, and Sigma Xi.

Stanley Wenndt

Speaker Recognition in Adverse Conditions

cation is performed in the presence of little or no interference. In general, these systems ..... ing the Gaussian kernel is that it does not have finite support. However, for ..... AT&T Bell Labs in Murray Hill, New Jersey, and also for the. Speech ...

248KB Sizes 1 Downloads 276 Views

Recommend Documents

Speaker Recognition in Adverse Conditions
parametric density estimation methods are generally known to be superior when .... Application of a threshold to perform the verification decision creates to two ...

Intersession Variability in Speaker Recognition
Switchboard-I database (SWB-I) is used to study the ... subspace (IVS), computed from the SWB-I database, is ... under the name of intersession variability.

Speaker Recognition Final Report - GitHub
Telephone banking and telephone reservation services will develop ... The process to extract MFCC feature is demonstrated in Figure.1 .... of the network. ..... //publications.idiap.ch/downloads/papers/2012/Anjos_Bob_ACMMM12.pdf. [2] David ...

Speaker Recognition in Two-Wire Test Sessions
The authors would like to thank Oshry Ben-Harush, Itshak. Lapidot, and Hugo Guterman for providing automatic speaker diarization labels, and Reda Dehak, Najim Dehak, and David van Leeuwen for their assistance with the NAP-SVM system. 8. References. [

SPEAKER-TRAINED RECOGNITION USING ... - Vincent Vanhoucke
advantages of this approach include improved performance and portability of the ... tion rate of both clash and consistency testing has to be minimized, while ensuring that .... practical application using STR in a speaker-independent context,.

Speaker Recognition in Two-Wire Test Sessions - Semantic Scholar
techniques both in the frame domain and in the model domain. The proposed .... summed the two sides of the 4w conversations to get the corresponding 2w ...

Efficient Speaker Recognition Using Approximated ...
metric model (a GMM) to the target training data and computing the average .... using maximum a posteriori (MAP) adaptation with a universal background ...

Speaker Recognition in Encrypted Voice Streams
Alexi, Diana, Vitaly, Daniel, Elena, Gancho, Anna, Daniela for never giving up on me and supporting me in the past several months. Last (and least), I would like to thank my gym for the long opening hours, Trey Parker and Matt Stone for the philosoph

Fusion of heterogeneous speaker recognition systems in ... - CiteSeerX
Speech@FIT, Faculty of Information Technology Brno University of Tech- nology, Czech Republic. ... The STBU consortium was formed to learn and share the tech- nologies and available ...... as Associate Professor (Doc.) and Deputy Head of.

Speaker Recognition in Two-Wire Test Sessions - Semantic Scholar
system described in [7]. ... virtual 2w training method described in Section 5. ... 7.0. (0.0328). 3.2. Feature warping. Feature warping is the process of normalizing ...

Speaker Recognition in Encrypted Voice Streams
Transmitting voice communication over untrusted networks puts personal ... technical challenges in the design and analysis of the systems and programs ...... Administration of the President of the Russian Federation. ... Master's thesis, Cleve-.

Fusion of heterogeneous speaker recognition systems in ... - CiteSeerX
tium of 4 partners: Spescom DataVoice (South Africa), TNO .... eral frame selection criteria were applied: (a) frame energy must be more than than ..... UBM sources. S99–S03 ...... tant basic workhorses in speaker recognition, but alternative.

Speaker Recognition in Two-Wire Test Sessions - CiteSeerX
cheating experiment by replacing each 2w session with a concatenation of its two 4w sides (in the audio domain). For the GMM system, we received an EER of ...

SPEAKER-TRAINED RECOGNITION USING ... - Vincent Vanhoucke
approach has been evaluated on an over-the-telephone, voice-ac- tivated dialing task and ... ments over techniques based on context-independent phone mod-.

Automatic Speech and Speaker Recognition ... - Semantic Scholar
7 Large Margin Training of Continuous Density Hidden Markov Models ..... Dept. of Computer and Information Science, ... University of California at San Diego.

Automatic speaker recognition using dynamic Bayesian network ...
This paper presents a novel approach to automatic speaker recognition using dynamic Bayesian network (DBN). DBNs have a precise and well-understand ...

THE UMD-JHU 2011 SPEAKER RECOGNITION SYSTEM D Garcia ...
presence of reverberation and noise via the use of frequency domain perceptual .... best capture the speaker and the channel variabilities in supervector space.

Approaches to Speech Recognition based on Speaker ...
best speech recognition submissions in its Jan- ... ity such as telephone type and background noise. ... of a single vector to represent each phone in context,.

Speaker Recognition using Kernel-PCA and ...
Modeling in common speaker subspace (CSS). The purpose of the projection of the common-speaker subspace into Rm using the distance preserving ...

THE UMD-JHU 2011 SPEAKER RECOGNITION SYSTEM D Garcia ...
University of Maryland, College Park, MD, USA. 2 ..... (FL), FDLP-mel (FM) and Cortical (CC) features for the NIST SRE 2010 extended core data set. For the ...

Speaker Recognition using Kernel-PCA and ... - Semantic Scholar
[11] D. A. Reynolds, T. F. Quatieri and R. B. Dunn, "Speaker verification using adapted Gaussian mixture models,". Digital Signal Processing, Vol. 10, No.1-3, pp.

Fast Speaker Adaptive Training for Speech Recognition
As we process each speaker we store the speaker-specific count and mean statistics in memory and then at the end of the speaker's data we directly increment ...