Blind Speaker Clustering

Viewer
Transcript

Blind Speaker Clustering A. N. Iyer∗ , U. O. Ofoegbu∗ , R. E. Yantorno∗ and B. Y. Smolenski† ∗ Speech

Processing Laboratory, Temple University, Philadelphia, PA 19122, USA E-mail: {aniyer,uche1,byantorn}@temple.edu

† Airforce

Research Laboratory/IFEC, Rome, NY 13441-4505, USA E-mail: [email protected]

Abstract— A novel approach to performing speaker clustering in telephone conversations is presented in this paper. The method is based on a simple observation that the distance between populations of feature vectors extracted from different speakers is greater than a preset threshold. This observation is incorporated into the clustering problem by the formulation of a constrained optimization problem. A modified c-means algorithm is designed to solve the optimization problem. Another key aspect in speaker clustering is to determine the number of clusters, which is either assumed or expected as an input in traditional methods. The proposed method does not require such information; instead, the number of clusters is automatically determined from the data. The performance of the proposed algorithm with the Hellinger, Bhattacharyya, Mahalanobis and the Generalized Likelihood Ratio distance measures is evaluated and compared. The approach, employing the Hellinger distance, resulted in an average cluster purity value of 0.85 from experiments performed using the Switchboard telephone conversational speech database. The result indicates a 9% relative improvement in the average cluster purity as compared to the best performing agglomerative clustering system.

I. I NTRODUCTION In the recent past, speaker clustering in conversation has attracted a large amount of research interest as its applications span from improving speech recognition (by enabling the use of speaker dependent systems) to categorizing broadcast and telephone recordings for speaker activities. The main application considered in this research is to cluster data from telephone conversations, mostly of short duration (around one minute). Popular approaches apply agglomerative methods by constructing distance matrices and building dendrograms [1][2]. These methods usually require knowing the number of speakers participating in the conversation or the algorithms include heuristic strategies to determine the number of speakers/clusters. Furthermore, these methods have been developed for clustering news and broadcast data, where the amount of data available is large when compared to telephone conversations. Other recent efforts include maximum purity clustering [3] and a procedure based on constructing eigenvoice motivated vector spaces [4]. Though these methods are mathematically appealing, they fail to provide a definitive framework to determine the number of speakers and the clusters themselves. The above stated inadequacies of the existing algorithms warrants a need for a simple clustering technique which works reliably with short telephone conversations. Unlike the other

cited methods, a partitional method for clustering speakers is presented in this paper. The method assumes that the number of speakers is unknown and is estimated automatically from the data. The approach relies on a simple observation that the distance between populations of feature vectors extracted from two different speakers is larger than a pre-determined threshold. Along with this information, a clustering cost function, similar to the squared error criterion, is minimized by the use of a modified c-means algorithm. The modified algorithm includes a merging strategy which takes into account the fact that if the distance between two clusters becomes arbitrarily small, the clusters belong to the same speaker. This method will be referred as the cc-means (constrained c-means) algorithm throughout the paper. The evaluation of the cc-means algorithm is performed by employing the popular cluster purity index. For comparing the proposed technique, an agglomerative clustering method (assuming known number of clusters) is used as the baseline. The paper is organized as follows: a review of speech distances used in this work and a method to estimate the speaker threshold is presented in Section I. In Section II, the formulation and description of the cc-means algorithm is introduced and is followed by experimental evaluation of the proposed algorithm in Section IV. Conclusions and future studies are discussed in Section V. II. BACKGROUND Classification of speakers’ utterances is performed using the elementary notion of a distance measure computed between the two probability density functions (pdf) of the feature vectors extracted from the speech utterances. The features considered are the short-time Linear Predictive Cepstral Coefficients (LPCC) computed from 30msec non-overlapping speech windows. Various distances are investigated and results are presented. This is followed by a procedure to estimate the optimal speaker threshold. A. Speech Distances Based on a recent study [5], the following four distances are used separately with the cc-means algorithm, and their performance is compared: (i) Mahalanobis distance is one of the earliest distance measures used for speaker identification for a minimum

dM AH (X1 , X2 ) = (µ1 − µ2 )T Σ−1 (µ1 − µ2 ),

(1)

where Σ is the pooled estimate of the covariance matrix and µ1 and µ2 being the respective mean vectors. The Mahalanobis distance was also employed in a speaker count system and has shown to be helpful in discriminating between speakers [7]. (ii) Bhattacharyya distance is a special case of the f divergence and it measures the classification error bound between two pdfs. The Bhattacharya distance is known to have a geometric interpretation of being the cosine of the angle between the two pdfs and can be computed as [8]: 1 (µ1 − µ2 )T (Σ1 + Σ2 )−1 (µ1 − µ2 ) dBH (X1 , X2 ) = 4 ! 1 |Σ1 + Σ2 | p , (2) + log 2 2 |Σ1 Σ2 |

where {µ1 , Σ1 } and {µ2 , Σ2 } are the mean and covariance matrix estimates of feature vectors in X1 and X2 respectively. (iii) Hellinger distance is generally used for estimating mixture densities and is considered superior to the maximum likelihood density estimates under certain conditions [9]. The Hellinger distance is adopted for clustering in this research based on the conclusions drawn in [8], indicating the usefulness of this distance for clustering applications. The Hellinger distance can be computed by a non-linear mapping of the Bhattacharya distance given by: dHE (X1 , X2 ) = 1 − e−dBH (X1 ,X2 ) .

(3)

(iv) Generalized Likelihood Ratio (GLR) is derived based on the hypothesis test constructed to determine if two populations of feature vectors X1 and X2 belong to the same underlying model. The formula for computing the distance is written as [10]:

where: λΣ =

|Σ1 |α |Σ2 |1−α |Σ|

β

0.08

,

0.06

1 2 with α = n1n+n , β = n1 +n and Σ is the pooled 2 2 covariance, and −β n1 n2 T −1 λµ = 1 + (µ − µ ) Σ (µ − µ ) . 1 2 1 2 (n1 + n2 )2 (6) In the above expressions, n1 and n2 are the number of feature vectors in X1 and X2 respectively.

0.05 0.04 0.03 0.02 0.01 0

0.3

0.4

0.5

0.6 0.7 0.8 Hellinger Distance

0.9

1

Fig. 1. Histograms showing the distances between utterances from the same speaker (gray) and different speakers (black). The dotted line represents the threshold at the equal error rate.

threshold ξ at the equal error rate1 , which is considered as the optimal threshold. During experimentation, an observation made was that the threshold depends on the length (n1 and n2 ) of the two utterances and hence requires a procedure to estimate the threshold which would be dependent on the data lengths. To illustrate this fact, the threshold was determined experimentally by varying the data length and the result is shown in Figure 2. The smooth surface represents the estimating function ξˆ = Φ(n1 , n2 ), parametrized as a 3rd order polynomial. The polynomial coefficients were chosen to minimize the least squares error between the estimated threshold and the true threshold ξ. Hellinger Distance Threshold Estimation

1 0.9

(4)

(5)

p(d|ω1 ) p(d|ω0 )

0.07

0.8 Threshold

dGLR (X1 , X2 ) = − log(λΣ λµ ),

B. Speaker Threshold Estimation A common method to determine if two utterances belong to the same speaker is to compute the distance between them and make a decision using a threshold. To determine a threshold, experimental analysis on speech data obtained from the HTIMIT database was performed. Figure 1 shows the histograms of the Hellinger distances computed between two utterances of the same speaker p(d|ω0 ) and from different speakers p(d|ω1 ). The dotted vertical line represents the

Probability

distance classifier. The distance between two sets of feature vectors X1 and X2 is defined as [6]:

0.7 0.6 0.5 0.4 1 2 3

Segment 2 Size [secs]

4 5

4.5

4

2 2.5 3 3.5 Segment 1 Size [secs]

1.5

1

Fig. 2. Estimation of the optimal threshold from the data sizes. The Hellinger distance is used for illustration. The asterisks (∗) represents the threshold determined and the surface is the polynomial fit Φ(n1 , n2 ).

1 The equal error rate was obtained by numerically integrating the error regions of the two pdfs.

III. M ETHODOLOGY The speaker clustering problem can be mathematically represented as the following optimization problem: given Xj , j = 1, 2, . . . , P utterances from a telephone conversation, determine the partitions, denoted as Ck , k = 1, 2, . . . , N , such that: minimize N,Zk

N X X

B. Initial Partitioning d(Xj , Zk )

(7)

k=1 Xj ∈Ck

subject to d(Zp , Zq ) > ξ,

(determined using the distance measure) are concatenated to form speaker homogeneous segments. It should be pointed that one can always adopt a sophisticated change-point algorithm to perorm the utterance segmentation. However, to maintain simplicity of the approach, such methods were not investigated.

∀p, q; p 6= q.

where Zk represents the utterances lying in the partition defined by Ck , d(·, ·) is the appropriate distance measure and ξ is the threshold. Despite the elegance of the above formulation, a closed form solution to the above problem is not available. It was shown by [11] that the c-means algorithm solves a similar unconstrained clustering problem, when the number of clusters N is known. In most speaker clustering problems, this knowledge is usually not available and hence one needs to determine the number of clusters automatically. In this research a modification of the c-means algorithm (the cc-means algorithm) to solve the constrained problem is developed. The number of clusters representing the number of speakers is estimated and used to determine the initial partitions and is kept as a free parameter, which is updated with every cc-means iteration. The algorithmic description of the proposed approach is presented as a flow-chart in Figure 3 and is followed by the description of each operation.

The utterances formed in the previous step are collected to form a set which is subjected to a selection process to form a subset, with each utterance in the subset being a representative of one partition. One can imagine that the smallest subset, with all utterances in the subset being greater than the speaker ˆ would represent the number of speaker, in the threshold ξ, conversation. For ease of presentation, let Ω be the set of all the utterances formed from the conversation Xj and S be the desired subset of utterances. To begin, set S(k=2) = {Xp , Xq }, where Xp and Xq are the two farthest utterances (based on the employed distance). It can be noted that d(Xp , Xq ) < ξˆ is the trivial case where there is only one speaker present in the conversation. A new utterance is included into the subset by choosing the farthest utterance from the subset S(k) . Mathematically this can be written as: Xi S(k+1)

= argmax min d(Xi , Xj )

(8)

=

(9)

Xi ∈Ω Xj ∈S(k) S(k) ∪ Xi

The utterances are repeatedly included, one at a time, into the subset until two utterances having a distance lower than the threshold is found. The selected utterances are subjected to the cc-means iteration process to update the partitions, described below. C. cc-means Iterations and Merging

Fig. 3.

Flowchart of the proposed cc-means clustering algorithm.

A. Utterance Segmentation The first step for most speaker clustering methods is to perform an utterance level segmentation, where the conversation is segmented into speaker homogeneous segments (utterances). This is achieved by performing a two-step procedure: (i) the conversation is split into segments of 1 second in length and (ii) adjacent segments belonging to the same speaker

The proposed approach to perform clustering follows the traditional c-means steps, however an additional strategy is included to provide the algorithm with the ability to merge two clusters that move close to each other during the iterations. The following two steps are repeated until convergence: 1. Update the partition: Each utterance is associated with the closest partition in an attempt to reduce the cost function defined in Equation 7. 2. Adjust the number of clusters by merging partitions that are close enough to ascertain that they belong to the same speaker. The closeness between partitions is again defined by the distance between the two clusters and a merge ˆ decision is obtained by using the estimated threshold ξ. The algorithm is said to have converged when the value of the cost function ceases to decrease with an increase in the iterations. IV. E XPERIMENTS AND R ESULTS The proposed speaker clustering algorithm was experimentally evaluated using the Switchboard database [12]. The Switchboard database consists of telephone conversations between two speakers and data of each speaker is available in

a different channel. The two channels were added together on a sample-by-sample basis to simulate a single channel recording. The ground truth, indicating the temporal activity of each speaker, was obtained by the transcriptions provided by the Mississippi State University2 . Two hundred and fortyfive conversations, each of duration 1 minute, were extracted from the database and clusters were obtained. Note, all the conversations in the database are between two speakers, however this information was not used in the clustering algorithm and was estimated from the data. The evaluation of the cc-means clustering process was performed based on the cluster purity index whose formula is [2]: 1 X 2 mkj , (10) pk = 2 mk j and defines the purity pk associated with cluster k. The symbol mk represents the number of speech frames in the k th cluster and mkj represents the number of speech windows in the k th cluster that belong to the j th speaker. The purity measures the extent to which all speech data in a cluster comes from the same speaker. The highest achievable purity value is 1, when all the speech data in a cluster is associated with the same speaker. The average purity used to measure the overall performance of the clustering algorithm, is defined as: X 1 mk p k . (11) p= P k mk k

The average purity and its standard deviation (STD) obtained for the 245 conversations for the various distances is tabulated in Table I. The accuracy [%] of the approach in estimating the number of clusters as two is also reported. TABLE I C LUSTERING P ERFORMANCE A NALYSIS OF THE P ROPOSED A LGORITHM .

Distance dBH dHE dM AH dGLR a b c d

Avg. Purity CCMa ACb 0.84 0.71 0.85 0.78 0.86 0.73 0.90 0.75

STD CCM AC 0.13 0.18 0.13 0.17 0.12 0.18 0.08 0.17

NCc RId 95 93 86 8

18 9 17 20

cc-means clustering algorithm Agglomerative clustering Accuracy [%] in determining the number of clusters Relative improvement [%] compared to AC

Note that the values reported for the traditional agglomerative clustering method assumes a priori knowledge of the number of clusters. V. D ISCUSSION AND C ONCLUSIONS The results presented in Table I clearly indicate that the average cluster purity, obtained with any distance in conjunction with the cc-means algorithm, is higher than the values obtained 2 The transcriptions are available online at - http://www.cavs.msstate.edu /hse/ies/projects/switchboard/index.html

with the best performing agglomerative clustering method, even though the number of speakers was known in the later case. Furthermore, the cc-means algorithm has consistently produced a lower standard deviation (representing a tighter distribution) of the cluster purity measure, indicating that the cc-means algorithm is more reliable than the traditional counterpart. It can also be noted that cc-means algorithm was successful in identifying the correct number of speakers (two in this case) more than 93% of the time with the Hellinger distance. One interesting observation in Table I is that there exists a trade-off in the choice of the appropriate distance measure. The cluster purity increases moving down the column, whereas the accuracy in determining the correct number of clusters decreases. It seems that the Hellinger distance provides a balance between the two performance measures. Furthermore, one can think of using two different distances: one to estimate the number of clusters and the other to perform clustering which is expected to improve the performance of the algorithm. In conclusion, the proposed cc-means speaker clustering algorithm was found to be reasonably accurate and reliable in clustering utterances in telephone conversations. Experimental evaluation resulted in an average cluster purity value of 0.85, which accounts to a 9% increase in the purity value as compared to the best performance obtained using the agglomerative clustering approach. ACKNOWLEDGMENT This effort was sponsored by the Air Force Research Laboratory, Air Force Material Command, and USAF, under agreement number FA8750-04-1-0146. R EFERENCES [1] S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian information criterion,” DARPA speech recognition workshop, 1998. [2] A. Solomonoff, A. Mielke, M. Schmidt, and H. Gish, “Clustering speakers by their voices,” ICASSP, vol. 2, pp. 757 – 760, 1998. [3] W.-H. Tsai and H.-M. Wang, “Speaker clustering of unknown utterances based on maximum purity estimation,” Eurospeech, 2005. [4] W.-H. Tsai, S.-S. Cheng, Y.-H. Chao, and H.-M. Wang, “Clustering speech utterances by speaker using eigenvoice-motivated vector space models,” ICASSP, vol. 1, pp. 725 – 728, 2005. [5] A. N. Iyer, U. O. Ofoegbu, R. E. Yantorno, and B. Y. Smolenski, “Speaker discriminative distances: A comparative study,” IEEE Transactions on Signal Processing, (In preparation). [6] H. Gish and M. Schmidt, “Text-idependent speaker identification,” IEEE Signal Processing Magazine, vol. 11, no. 4, pp. 18–32, October 1994. [7] U. O. Ofoegbu, A. N. Iyer, R. E. Yantorno, and B. Y. Smolenski, “A speaker count system for telephone conversations,” ISPACS, 2006 (submitted). [8] M. Basseville, “Distance measures for signal processing and pattern recognition,” Signal Processing, vol. 18, no. 4, pp. 349 – 369, 1989. [9] Z. Lu, Y. V. Hui, and A. H. Lee, “Minimum hellinger distance estimation for finite mixtures of poisson regression models and its applications,” Biometrics, vol. 59, no. 4, pp. 1016–26, 2003. [10] T. W. Anderson, An Introduction to Multivariate Statistical Analysis, 3rd ed. John Wiley & Sons, Inc., 2003. [11] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264–323, 1999. [12] J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone speech corpus for research and development,” ICASSP, pp. 517–520, 1992.

Blind Speaker Clustering - Montgomery College Student Web

determined and the surface is the polynomial fit Î¦(n1, n2). 1The equal error rate was obtained by numerically integrating the error regions of the two pdfs.

Download PDF

171KB Sizes 1 Downloads 226 Views

Report

Blind Speaker Clustering - Montgomery College Student Web

ROBUST SPEAKER CLUSTERING STRATEGIES TO ...

Towards Domain Independent Speaker Clustering

Agglomerative Hierarchical Speaker Clustering using ...

Web page clustering using Query Directed Clustering ...

Robust Speaker segmentation and clustering for ...

Web Search Clustering and Labeling with Hidden Topics

Web Usage Mining Using Artificial Ant Colony Clustering and Genetic ...

web usage mining using rough agglomerative clustering

Web Search Clustering and Labeling with Hidden Topics - CiteSeerX

Lexical and semantic clustering by Web links

Fast Web Clustering Algorithm using Divide and ...

Web Search Clustering and Labeling with Hidden Topics - CiteSeerX

Fast Web Clustering Algorithm using Divide and ...

Fulton Montgomery Quilt Barn Trail.pdf

Student Admission Process - Heartland Community College

Blind Speaker Clustering - Montgomery College Student Web

Blind Speaker Clustering - Montgomery College Student Web

Blind Speaker Clustering

ROBUST SPEAKER CLUSTERING STRATEGIES TO ...

Towards Domain Independent Speaker Clustering

Agglomerative Hierarchical Speaker Clustering using ...

Web page clustering using Query Directed Clustering ...

Robust Speaker segmentation and clustering for ...

Web Search Clustering and Labeling with Hidden Topics

Web Usage Mining Using Artificial Ant Colony Clustering and Genetic ...

web usage mining using rough agglomerative clustering

Web Search Clustering and Labeling with Hidden Topics - CiteSeerX

Lexical and semantic clustering by Web links

Fast Web Clustering Algorithm using Divide and ...

Web Search Clustering and Labeling with Hidden Topics - CiteSeerX

Fast Web Clustering Algorithm using Divide and ...

Fulton Montgomery Quilt Barn Trail.pdf

Student Admission Process - Heartland Community College

Blind Speaker Clustering - Montgomery College Student Web

Recommend Documents