Improving Speaker Identification via Singular Value Decomposition Based Feature Transformer Bibhu Prasad Mishra, Sandipan Chakroborty and Goutam Saha Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Kharagpur, India, Kharagpur-721 302 Email: [email protected], [email protected],[email protected] Telephone: +91-3222-283556/1470, FAX: +91-3222-255303

Abstract—State-of-the-art Speaker Identification (SI) systems use Gaussian Mixture Models (GMM) for modeling speakers’ data. Using GMM, a speaker can be identified accurately even from a large number of speakers, when model complexity is large. However, lower ordered speaker model using GMM show poor accuracy as lesser number of Gaussian are involved. In SI context, not much attention have been paid towards improving accuracies for lower order models although they have been used in real-time applications like hierarchical speaker pruning. In this paper, two different approaches have been proposed using Singular Value Decomposition (SVD) based Feature Transformer (FT) for improving accuracies especially for lower ordered speaker models. The results show significant improvements over baseline and have been presented on two widely different public databases comprising of more than 130 speakers.

I. I NTRODUCTION Any SI [1] system needs a robust acoustic feature extraction technique as a front-end block followed by an efficient modeling scheme for generalized representation of these features. Mel-frequency Cepstral Coefficients (MFCC) [2] has been widely accepted as such a front-end for a typical SI application as it is less vulnerable to noise perturbation, gives little session variability and is easy to extract. Usually, from a raw speech frame, multiple features are extracted, which form a feature vector. When there are enough feature vectors available for a speaker, a statistical model like GMM [3] is used to represent them in multidimensional sense. GMM shows good approximation of a speaker’s multidimensional data (multidimensional feature vectors) in probabilistic sense for which reasonably accurate identification of a speaker is possible even from a large database. The accuracy of a SI system increases with increase of model orders as sufficient number of Gaussians (i.e. higher ordered model) are involved to approximate the arbitrary Probability Density Function (pdf) of feature vectors. Lower ordered speaker models find it difficult to approximate the pdf of feature vectors that results in poor performances at the time of blind testing. In the context of SI, much attention has not been paid for improving the performance of these lower ordered speaker models although the same are needed for real-time deployment of the system. Application like hierarchical speaker pruning needs lower ordered speaker models to prune out unlikely speakers from the whole speaker list. For example, if a lower

ordered speaker model could prune out large number of speakers with greater confidence at the initial stage of classification, lesser complexities are expected at the successive stage(s), which use detailed model (i.e. model with higher complexity). In addition, the lower ordered speaker models could be used as stand alone speaker model for an SI task if they can show near equal performance as compared to speaker models with higher complexity. It is worth mentioning here that a lower ordered speaker model is involved with much lesser time and space complexities both in training and testing phases than that of a higher ordered model. In this paper, two different approaches have been proposed using SVD [4] based Feature Transformer (FT) [5] for improving the lower order speaker models in SI application. The feature vectors are transformed using a transformer, which has been chosen here as the Right Singular Vector Matrix (RSVM), derived from the original feature matrix. In the first proposition a common transformer matrix has been chosen for all the speakers while for the latter case, separate RSVM has been used for each speaker in the database. The idea of transforming the feature vectors is motivated from the view point of acoustic variability in orthogonal sense where an FT is required to be orthogonal. Note that, an RSVM acting as an FT is an orthogonal matrix, whose transpose is its inverse. Examples of such linear transformation techniques using orthogonal transformation matrix can be found in [6], [7]. In those previous works, the transformation matrices have been obtained using Linear Discriminative Analysis [8], which might fail when the data is multi-modal in nature. The results have been presented on two widely different public databases comprising of more than 130 speakers. The experiments have been conducted with varying model orders and the results show that both the proposed techniques outperform conventional SI system significantly especially for the lowered order models. The rest of the paper is organized as follows: Section II describes necessary theoretical background for this study. Section III illustrates the proposed frame work in detail. Next, the section IV shows experimental results with detailed discussion on the same. Finally, section V draws the principal conclusion.

II. T HEORETICAL BACKGROUND

B. Mel-frequency Cepstral Coefficients

A. Singular Value Decomposition SVD [4] as an orthogonal decomposition, finds wide applications in rank determination and inversion of matrices, as well as in the modeling, prediction, filtering and information compression of data sequences. SVD is closely related with Karhunen-Lo`eve transformation (KLT) [9], singular values being uniquely related to eigenvalues, although the computational requirements of SVD are less than those of KLT. From the numerical point of view, SVD is extremely robust, and the singular values in SVD can be computed with greater computational accuracy than eigenvalues. SVD is popularly used for the solution of least squares problems; it offers an unambiguous way of handling rank deficient or nearly rank deficient least squares problems. SVD is also the most definite method for the detection of the rank of a matrix or the nearness of a matrix to loss of rank. Given any m × n matrix F, there exist an m × m real orthogonal matrix U, an n × n real orthogonal matrix V and an m × n diagonal matrix Sv , such that F = USv VT ,

Sv = UT FV,

(1)

where the elements of Sv can be arranged in non-increasing order, that is 1) for a nonsingular F, Sv = diag{s1 , s2 , . . . , sp }, p = min(m, n), s1 ≥ s2 ≥ s3 . . . ≥ sp > 0, or 2) for F of rank r, s1 ≥ s2 ≥ s3 . . . ≥ sg > 0 and sg+1 = sg+2 = . . . = sp = 0. In other words, UT U = UUT = I, VT V = VVT = I, and    Sv =  

s1 0 .. .

0 s2 .. .

... ... .. .

0 0 .. .

0

0

...

sp

   . 

for m > n = p.

The decomposition is called the singular value decomposition. The numbers s1 , s2 , s3 , . . . sp are the singular values (or principal values) of F. U and V are called the left and right singular vector matrices of F respectively. U and V can be expressed as

U =

[u1 u2 . . . ui . . . um ],

V =

[v1 v2 . . . vi . . . vn ],

and

(2) (3)

where i = 1 to p, the m-column vector ui and the n-column vector vi , which correspond to the i-th singular value si , are called i-th left singular vector and the i-th right singular vector respectively.

According to psychophysical studies, human perception of the frequency content of sounds follows a subjectively defined nonlinear scale called the Mel scale. This is defined as, µ ¶ f fmel = 2595 log10 1 + (4) 700 where fmel is the subjective pitch in Mels corresponding to f , the actual frequency in Hz. This leads to the definition of MFCC, a baseline [10] acoustic feature set for Speech and Speaker Recognition applications, which can be calculated as follows. s Let {y(n)}N n=1 represent a frame of speech that is preemphasized and Hamming-windowed. First, y(n) is converted to the frequency domain by an Ms -point Discrete Fourier Transform (DFT) which leads to the energy spectrum, ¯X ¡ −j2πnk ¢ ¯¯2 ¯ Ms |Y (k)|2 = ¯¯ y(n) · e Ms ¯¯

(5)

n=1

where, a filter spaced the ith

1 ≤ k ≤ Ms . This is followed by the construction of bank with Q unity height triangular filters, uniformly in the Mel scale (eqn. 4). The filter response ψi (k) of filter in the bank is defined as,

ψi (k) =

 0      k−kbi−1

for k < kbi−1

    

for kbi ≤ k ≤ kbi+1

kbi −kbi−1 kbi+1 −k kbi+1 −kbi

0

for kbi−1 ≤ k ≤ kbi

(6)

for k > kbi+1

where 1 ≤ i ≤ Q, Q is the number of filters in the bank, {kbi }Q+1 i=0 are the boundary points of the filters and k denotes the coefficient index in the Ms -point DFT. The filter bank boundary points, {kbi }Q+1 i=0 are equally spaced in the Mel scale which is satisfied by the definition (eqn. 7), µ ¶ · ¸ Ms i{fmel (fhigh ) − fmel (flow )} −1 ·fmel fmel (flow )+ kbi = Fs Q+1 (7) where the function fmel (·) is defined in eqn. 4, Ms is the number of points in the DFT (eqn. 5), Fs is the sampling frequency, flow and fhigh are the low and high frequency −1 boundaries of the filter bank and fmel is the inverse of the transformation in eqn. 4 defined as, £ fmel ¤ −1 f = fmel (fmel ) = 700 · 10 2595 − 1 (8) The sampling frequency Fs and the frequencies flow and fhigh are in Hz while fmel is in Mels. For both the databases considered in this work, Fs is 8 kHz. Ms was taken as 256, Fs = 31.25 Hz while fhigh = F2s = 4 kHz. flow = M s Next, this filter bank is imposed on the spectrum calculated in eqn. 5. The outputs {e(i)}Q i=1 of the Mel-scaled band-pass filters can be calculated by a weighted summation between

respective filter response ψi (k) and the energy spectrum |Y (k)|2 as Ms

e(i) =

2 X

|Y (k)|2 · ψi (k)

(9)

k=1

Finally, Discrete Cosine Transform is taken on the log filter bank energies {log[e(i)]}Q i=1 and the final MFCC coefficients Cm can be written as, r (Q−1) · µ ¶ ¸ 2 X 2l − 1 π Cm = log[e(l + 1)] · cos m · · (10) Q 2 Q l=0

where, 0 ≤ m ≤ R−1, and R is the desired number of cepstral features. Typically, Q = 20 and 10 to 30 cepstral coefficients are taken for speech processing applications. Here we took Q = 20, R = 20 and used the last 19 coefficients to model the individual speakers. Note that the first coefficient C0 is discarded because it contains only a d.c term that signifies spectral energy. C. Gaussian Mixture Model Speaker Recognition involves state-of-the-art GMM [3] for generalized representation of acoustic vectors irrespective of their extraction process. A GMM can be viewed as a nonparametric, multivariate probability distribution model that is capable of modeling arbitrary distributions and is currently one of the principal methods of modeling speakers for SI systems. The GMM of the distribution of feature vectors for speaker s is a weighted linear combination of M uni-modal Gaussian densities bsi (x), each parameterized by a mean vector µsi with a diagonal covariance matrix Σsi . These parameters, which collectively constitute the speaker model, are represented by s the notation λs = {psi , µsi , Σsi }M i=1 . The piPare the mixture M weights satisfying the stochastic constraint i=1 psi = 1. For a feature vector x the mixture density for a speaker s is computed as p(x|λs ) =

M X

psi bsi (x)

(11)

i=1

where, bsi (x) =

1 e (D/2) (2π) | Σsi |(1/2)

¡

¢

s t s −1 1 (x−µsi ) 2 (x−µi ) (Σi )

(12)

and D is the dimension of the feature-space. Given a sequence of feature vectors X = {x1 , x2 , . . . , xT }, for an utterance with T frames, the log-likelihood of a speaker model s is Ls (X) = log p(X|λs ) =

T X

log p(xt |λs )

(13)

t=1

assuming the vectors to be independent for computational simplicity. For SI, the value of Ls (X) is computed for all speaker models λs enrolled in the system and the owner of the model that generates the highest value is returned as the identified speaker.

III. P ROPOSED F RAMEWORK Often, the transformation in feature space has been done by multiplication of a transformation matrix with feature matrix. By feature matrix, we mean a stacked version of all the feature vectors extracted from several frames of speech. Therefore, the row of a feature matrix signifies the number of vectors (or number of frames) while the column denotes the dimensionality. Note that, dimension of the feature matrix is taken 19 for the whole study. A. Transformation Matrix using Stacked Version of Vector Quantized Outputs from Speaker Data (Prop-I) In this scheme, the transformation matrix has been obtained from the stacked version of vector quantized [11] outputs from the speaker data. Using the Linde Buzo Gray algorithm [11], representative vectors (code vectors) have been obtained and these representative vectors are also used as the initial guess of the mean vectors for GMM. The common feature matrix F is formed by stacking these M code vectors from speaker 1, followed by M code vectors of speaker 2 and so on up to speaker S. Therefore, the value of m is equal to S × M , where S is the total number of the speakers. However, for a fixed number of speakers the row dimension of matrix F may vary with the initial selection of the model order that decides the parameter M . The matrix F coarsely represents the whole corpus with equal evidence from different speakers. The size of the matrix is not large in comparison to the concatenated version of all feature vectors from total speakers in the database. Another usefulness of matrix F is that it holds the most representative data from each speaker allowing the SVD operator to yield the useful transformation matrix. The transformed feature matrix can be obtained after multiplying right singular matrix V with original feature matrix for each speaker. This is given by ˆ spk = Fspk × V F

(14)

With this, the same transformation matrix has been used to transform the initial seed mean vectors obtained after vector quantization of each speaker’s data (see Fig. 1(a)). The columns of V denote the eigenvectors corresponding to the singular values shown by (1). Multiplication with these eigenvectors signifies orthogonalization of feature vectors in the direction of eigenvectors present in transformation matrix V. At the time of testing the same transformation matrix i.e. V have been used to transform the test feature vectors irrespective to speaker model involved (see Fig. 1(a)). If X is the set of unknown test feature vectors then after transformation, transformed test feature vectors would be, ˆ =X×V X

(15)

B. Transformation Matrix using Original Feature Matrix from each Speaker (Prop-II) In this approach, a transformation matrix has been calculated for each speaker separately. SVD is applied on the feature matrix for every speaker and corresponding V has

been obtained (see Fig. 1(b)). The transformed feature matrix for each speaker can be similarly obtained after multiplying his/her features with corresponding V. This can be shown as, ˆ spk = Fspk × Vspk F

(16)

where, the suffix ‘spk’ denotes the index of speakers in a corpus. In this proposed scheme, no VQ stack preparation is required and calculated RSVMs are expected to be more speaker specific in nature as SVD is applied on each speaker data individually. At the time test, unknown set of test feature vectors have been transformed speaker wise based on to which particular speaker model those test vectors are to be sent. Therefore, for every speaker, a multiplication of speaker specific RSVM with unknown test vectors is required. This can be shown as, ˆ = X × Vspk X (17) where, the symbols are having there usual meaning. The two systems are shown in Fig. 1(a) and 1(b) respectively. C. Comparison between the proposed transformation techniques The former is much computationally efficient than latter at the time of training as well as testing. The main computational burden for the latter system could be observed at the time of identification, where for ever speaker one matrix multiplication is required while the former needs only one such multiplication to yield transformed test vectors for all the speaker models. However, the latter system utilizes the original feature matrix while evaluating the RSVM as compared to the other one, which uses coarse feature vectors from the speakers. Therefore, it is expected that the latter system could outperform former, as the latter system can transform the unknown test vectors speaker wise using speaker specific transformation matrix. Another advantage of using the latter system is that one do not have to recalculate the common transformation matrix V, when some speakers are added with or deleted from a corpus. IV. E XPERIMENTAL R ESULTS AND D ISCUSSIONS A. Pre-processing stage In this work, each frame of speech is pre-processed by i) silence removal and end-point detection using an energy threshold criterion, followed by ii) pre-emphasis with 0.97 preemphasis factor, iii) frame blocking with 20ms frame length, i.e Ns = 160 samples/frame (ref. Sec.II) & 50% overlap, and finally iv) Hamming-windowing.

a telephone channel. There are 138 speakers (106 males and 32 females); for each speaker, there are 4 enrollment sessions of 24 utterances each and 10 test sessions of 4 utterances each. In this work, a closed set text-independent speaker identification problem is attempted where we consider all 138 speakers as client speakers. For a speaker, all the 96 (4 sessions × 24 utterances) utterances are used for developing the speaker model while for testing, 40 (10 sessions × 4 utterances) utterances are put under test. Therefore, for 138 speakers we put 138 × 40 = 5520 utterances under test and evaluated the identification accuracies. For this database, 32 Gaussians were the highest model order because some speakers do not generate enough feature vectors to create the next higher order i.e. with 64 Gaussians. 2) POLYCOST Database: The POLYCOST database [13] was recorded as a common initiative within the COST 250 action during January- March 1996. It contains around 10 sessions recorded by 134 subjects from 14 countries. Each session consists of 14 items, two of which ( MOT01 & MOT02 files ) contain speech in the subject’s mother tongue. The database was collected through the European telephone network. The recording has been performed with ISDN cards on two XTL SUN platforms with an 8 kHz sampling rate. In this work, a closed set text independent speaker identification problem is addressed where only the mother tongue (MOT) files are used. Specified guideline [14] for conducting closed set speaker identification experiments is adhered to, i.e. ‘MOT02’ files from first four sessions are used to build a speaker model while ‘MOT01’ files from session five onwards are taken for testing. Unlike YOHO database all the speakers do not have the same number of sessions. Further, three speakers (M042, M045 & F035) are not included in our experiments as they provide sessions which are lower than 4. A total 754 ‘MOT01’ utterances are put under test. As with YOHO database, all speakers (131 after deletion of three speakers) in the database were registered as clients. We restrained ourselves to 3 different sized mixtures for GMM.This is because less number of feature vectors are obtained from the POLYCOST database that prevents development of meaningful higher order GMMs. C. Performance Evaluation For any closed-set SI problem, identification accuracy is defined as follows in [3] and we have used the same: µ =

Percentage of Identification Accuracy (PIA) ¶ No. of utterances correctly identified × 100 Total no. of utterances under test

(18)

B. Databases for the study

D. Experimental Results

1) YOHO Database: The YOHO voice verification corpus [12] was collected while testing ITT’s prototype speaker verification system in an office environment. Most subjects were from the New York City area, although there were many exceptions, including some nonnative English speakers. A high-quality telephone handset (Shure XTH-383) was used to collect the speech; however, the speech was not passed through

For each database, we evaluated the performance of an MFCC based system and the proposed systems. 1) Results for YOHO Database: The results shows (ref. table I that the proposed systems outperform baseline significantly at lower ordered models i.e. M =2,4,8. However, improvements can also be observed in models with higher complexities (M =16,32). Out of the two proposed systems

Raw Speech Data

Raw Speech Data Pre-Processing + Feature Extraction

Pre-Processing + Feature Extraction

VQ Stack (Matrix F)

All feature vectors

All feature vectors

SPK-1 SPK-2 SPK-3 Testing Phase

SPK-S

Mean vectors initialization using VQ

Pre-Processing + Feature Extraction

Pre-Processing + Feature Extraction

Reduced feature vectors after VQ

Reduced feature vectors after VQ

Mean vectors initialization using VQ

Testing Phase

SVD

Feature Transformer (V)

SVD

Feature Transformer Bank (V spk )

V spk

V Matching Algorithm

V

Speaker Identity

GMMbased Speaker Modeling

GMMbased Speaker Modeling

SPK-1

SPK-2

Matching Algorithm

SPK-S

SPK-1

SPK-2

Speaker Model Databases in 19 dimensional features

(a) Feature Transformer applied on feature matrix where the transformer is obtained from the stacked version of vector quantized outputs across the speakers’ data Fig. 1.

Speaker Identity

V spk

SPK-S Speaker Model Databases in 19 dimensional features

(b) Feature Transformer applied on feature matrix where the transformer is obtained from the individual feature matrices of the speakers

Proposed Feature Transformation Methods

the system indicated by ‘Prop-II’ outperforms the other variant (denoted by ‘Prop-I’) in most of the cases. For higher order models Prop-II show near equal performances in comparison with Prop-I. TABLE I C OMPARATIVE PERFORMANCE USING MFCC AND P ROPOSED S YSTEMS FOR YOHO DATABASE . Model Order (M )

using MFCC PIA (in %)

Prop-I PIA (in %)

Prop-II PIA (in %)

2

74.31

80.22

90.11

4

84.86

88.39

91.94

8

90.69

93.04

93.61

16

94.20

95.33

95.25

32

95.67

96.41

96.00

2) Results for POLYCOST Database: For this database, the results show similar trend as of with YOHO database. In this case, both the proposed methods performs significantly better than baseline over all the model orders. However, the rate of improvement for the lowest model order (M = 2) is much greater than that of higher model orders (i.e. M = 4, 8). From the table II it can also observed that by using Prop-II, one could even achieve almost the same performance at much lower model complexity when compared to baseline system with highest model order i.e. 8 . V. C ONCLUSION In this paper, two different feature transformation schemes have been proposed based on the SVD. Right Singular Matrices have been chosen as the transformation matrix, which is

TABLE II C OMPARATIVE PERFORMANCE USING MFCC AND P ROPOSED S YSTEMS FOR POLYCOST DATABASE . Model Order (M )

using MFCC PIA (in %)

Prop-I PIA (in %)

Prop-II PIA (in %)

2

63.93

69.89

77.59

4

72.94

76.92

77.72

8

77.85

79.05

81.83

obtained after applying SVD on the speakers’ data. Feature vectors are first transformed to the space spanned by the eigenvectors of the data matrix before applying to GMM. Two methods differ in the way the transformation matrix is derived. In one of the proposed methods the transformation matrix is determined for each speaker in the database by applying SVD on his feature matrix whereas in the other method the transformer is obtained from a stacked version of VQ outputs obtained across the speakers. Results are presented over varying model orders with two different databases. The results show that both the propositions outperform baseline performance significantly for the lower order models. The results prove the superiority of our propositions irrespective of data type (either microphone or telephone) and amount of data. In this whole study, the GMM involves a linear transform for all the test vectors. However, this small calculation overhead has no noticeable influence on the response speed. Strictly speaking, for the same number of mixtures, there is no clear difference between modified GMM and standard GMM in

terms of both computational efficiency and response speed. R EFERENCES [1] J.-C. Wang, C.-H. Yang, J.-F. Wang, and H.-P. Lee, “Robust Speaker Identification and Verification,” IEEE Computational Intelligence Magazine, vol. 2, no. 2, May 2007. [2] S. B. Davis and P. Mermelstein, “Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Audio Speech and Signal Process., vol. ASSP28, no. 4, pp. 357-365, Aug. 1980. [3] D. A. Reynolds and R. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE Trans. on Speech and Audio Process., vol. 3, no.1, pp. 72-83, Jan. 1995. [4] G. H. Golub and C. F. V. Loan, Matrix Computations, Johns Hopkins University Press, 3rd ed., 1996. pp. 48-80. [5] J. Tou and R. Gonzalez, Pattern recognition principles, London: AddisonWesley, 1974, pp. 243-314. [6] L. Liu and J. He, “On the use of orthogonal GMM in speaker recognition,” Proc. of IEEE International Conference on Acoustics, Speech, and Signal Process., (ICASSP 1999), 1999, vol. 2, pp. 845-848. [7] R. Zhang and X. Ding, “Offline handwritten numeral recognition using orthogonal Gaussian mixture model,” in Proc. of International Conference on Image Processing, vol.1, 2001, pp.1126-1129. [8] K. K. Paliwal, “Dimensionality reduction of the enhanced feature set for the HMM-based speech recognizer,” Digital Signal Process., vol. 2, no. 3, pp. 157-173, Jul. 1992. [9] Y. Hua, and W. Liu, “Generalized KarhunenLo`eve Transform,” IEEE Signal Process. Lett., vol. 5, no. 6, pp. 141-142, Jun. 1998. [10] D. J. Mashao and M. Skosan, “Combining Classifier Decisions for Robust Speaker Identification,” Pattern Recog., vol. 39, no. 1, pp. 147155, Jan. 2006. [11] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans. Commun., vol. 28, no. 1, pp. 84-95, Jan. 1980. [12] J. P. Campbell, Jr., “Testing with the YOHO CDROM voice verification corpus,” in Proc. International Conference on Acoustic, Speech, and Signal Process., (ICASSP 1995), 1995, pp. 341-344. [13] J. Hennebert, H. Melin, D. Petrovska, and D. Genoud, “POLYCOST: A telephone-speech database for speaker recognition,” Speech Communication, vol. 31, no. 2-3, pp. 265-270, Jun 2000. [14] H. Melin and J. Lindberg, “Guidelines for experiments on the polycost database,” in Proc. of a COST 250 workshop on Application of Speaker Recognition Techniques in Telephony, 1996, pp. 59-69.

Improving Speaker Identification via Singular Value ...

two different approaches have been proposed using Singular. Value Decomposition (SVD) based Feature Transformer (FT) for improving accuracies especially for lower ordered speaker models. The results show significant improvements over baseline and have been presented on two widely different public databases.

340KB Sizes 2 Downloads 269 Views

Recommend Documents

TWO DIMENSIONAL SINGULAR VALUE ...
applications such as mobile video calls and wireless .... Figure 2 depicts the system diagram of the proposed .... The GI file has been interleaved and a lost GI.

lecture 5: matrix diagonalization, singular value ... - GitHub
We can decorrelate the variables using spectral principal axis rotation (diagonalization) α=XT. L. DX. L. • One way to do so is to use eigenvectors, columns of X. L corresponding to the non-zero eigenvalues λ i of precision matrix, to define new

Efficient Speaker Identification and Retrieval - Semantic Scholar
Department of Computer Science, Bar-Ilan University, Israel. 2. School of Electrical .... computed using the top-N speedup technique [3] (N=5) and divided by the ...

SPEAKER IDENTIFICATION IMPROVEMENT USING ...
Air Force Research Laboratory/IFEC,. 32 Brooks Rd. Rome NY 13441-4514 .... Fifth, the standard error for the percent correct is zero as compared with for all frames condition. Therefore, it can be concluded that using only usable speech improves the

Efficient Speaker Identification and Retrieval
(a GMM) to the target training data and computing the average log-likelihood of the ... In this paper we aim to (a) improve the time and storage efficiency of the ...

speaker identification and verification using eigenvoices
approach, in which client and test speaker models are confined to a low-dimensional linear ... 100 client speakers for a high-security application, 60 seconds or more of ..... the development of more robust eigenspace training techniques. 5.

Large-scale speaker identification - Research at Google
promises excellent scalability for large-scale data. 2. BACKGROUND. 2.1. Speaker identification with i-vectors. Robustly recognizing a speaker in spite of large ...

Efficient Speaker Identification and Retrieval - Semantic Scholar
identification framework and for efficient speaker retrieval. In ..... Phase two: rescoring using GMM-simulation (top-1). 0.05. 0.1. 0.2. 0.5. 1. 2. 5. 10. 20. 40. 2. 5. 10.

Improving Predictive State Representations via ... - Alex Kulesza
Computer Science & Engineering. University of Michigan. Abstract. Predictive state representations (PSRs) model dynam- ical systems using appropriately ...

Improving Word Representations via Global Visual Context
Department of Electrical Engineering and Computer Science. University of Michagan [email protected]. Abstract. Visually grounded semantics is a very ...

Improving Word Representations via Global Visual Context
Department of Electrical Engineering and Computer Science ... In this work, we propose to use global visual context to help learn better word ... In this way, we are able to measure how global visual information contributes (or affects) .... best and

Improving quantum microscopy and lithography via ...
Jul 27, 2004 - 2 Department of Aerospace and Mechanical Engineering, Princeton University, Princeton, ... 3 Institute for Quantum Studies and Department of Physics, Texas A&M University, College ..... If the pump field is tuned close to.

Accelerated Singular Value Thresholding for Matrix ...
Aug 16, 2012 - tem [13, 14, 22] and image/video analysis [17, 11]. Since the completion of ...... Sdpt3 – a matlab software package for semidefinite quadratic.

speaker identification and verification using eigenvoices
(805) 687-0110; fax: (805) 687-2625; email: kuhn, nguyen, [email protected]. 1. ABSTRACT. Gaussian Mixture Models (GMMs) have been successfully ap- plied to the tasks of speaker ID and verification when a large amount of enrolment data is av

Speaker Verification via High-Level Feature Based ...
humans rely not only on the low-level acoustic information but also on ... Systems Engineering and Engineering Management, The Chinese University of Hong ...

High-Level Speaker Verification via Articulatory-Feature ...
(m, p|k)+(1−βk)PCD b. (m, p|k), (2) where k = 1,...,G, PCD s. (m, p|k) is a model obtained from the target speaker utterance, and βk ∈ [0, 1] controls the con- tribution of the speaker utterance and the background model on the target speaker mo

Text-Independent Speaker Verification via State ...
phone HMMs as shown in Fig. 1. After that .... telephone male dataset for both training and testing. .... [1] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker.

Improving Deep Neural Networks Based Speaker ...
In this paper, however, we try to adapt the DNN model to the target acoustic space directly using the unlabeled (untranscribed) in-domain data. In our work, we firstly explore the effects of using in-domain data for DNN initialization during the unsu

Visible and Infrared Face Identification via Sparse ...
Visible and Infrared Face Identification via. Sparse Representation. Pierre Buyssens1 and Marinette Revenu2. 1 LITIS EA 4108 - QuantIF Team, University of Rouen,. 22 boulevard Gambetta, 76183 Rouen Cedex, France. 2 University of Caen Basse-Normandie,

Improving DNN speaker independence with $i - Semantic Scholar
in recent years, surpassing the performance of the previous domi- nant paradigm ... part-model, for each speaker which, in a cloud-based speech recog- nizer adds ... previous layer, with the first layer computing a weighted sum of ex-.

Improving DNN speaker independence with $i - Research at Google
part-model, for each speaker which, in a cloud-based speech recog- nizer adds ..... Here we compare our technique with this approach and show that the two ...

confinement regime identification in nuclear fusion via ...
Abstract. In this paper a data driven methodology to automat- ... In magnetically confinement nuclear fusion, the data analysis ... ”Big Physics” research. Indeed ...

Gene Identification via Phenotype Sequencing Version 1.5 ... - GitHub
1. Gene Identification via Phenotype Sequencing. Version 1.5. User manual. Zhu Z, Wang WT, Zhu JH, and Chen X. 2015-08-01 ...

Improving Support Vector Machine Generalisation via Input ... - IJEECS
[email protected]. Abstract. Data pre-processing always plays a key role in learning algorithm performance. In this research we consider data.