A Speaker Count System for Telephone Conversations

Viewer
Transcript

A Speaker Count System for Telephone Conversations Uchechukwu O. Ofoegbu1, Ananth N. Iyer1, Robert E. Yantorno1 and Brett Y. Smolenski2 1

Speech Processing Laboratory, Temple University, PA 19122-6077 {uche1, aniyer, byantorn}@temple.edu 2 Air Force Research Laboratory/IFEC, Rome, NY 13441-4514, USA [email protected] Abstract - In telephone conversations, only short consecutive utterances can be examined for each speaker, therefore, discriminating between speakers in such conversations is a challenging task which becomes even more challenging when no information about the speakers is known a priori. In this paper, a technique for determining the number of speakers participating in a telephone conversation is presented. This approach assumes no knowledge or information about any of the participating speakers. The technique is based on comparing short utterances within the conversation and deciding whether or not they belong to the same speaker. The applications of this research include three-way call detection and speaker tracking, and could be extended to speaker change-point detection and indexing. The proposed method involves an elimination process in which speech segments matching a chosen set of reference models are sequentially removed from the conversation. Models are formed using the mean vectors and covariance matrices of Linear Predictive Cepstral Coefficients of voiced segments in the conversation. The use of the Mahalanobis distance to determine if two models belong to the same or to different speakers, based on likelihood ratio testing, is investigated. The relative amount of residual speech is observed after each elimination process to determine if an additional speaker is present. Experimentation was performed on 4000 artificial conversations from the HTIMIT Database. The proposed system was able to yield an average speaker count accuracy of 78%.

I. INTRODUCTION Speaker recognition is the art of observing a given speech wave-form and automatically making a decision about the speaker(s) from which it was generated. Speaker recognition was originally classified into speaker identification (SID), in which the speaker of a given utterance is determined from a known dataset, and speaker verification, in which the identity claim of an observed speaker is accepted or rejected based, based on the characteristics of the speakers’ voices. SID, which is more commonly encountered, is usually performed in two stages – the training and testing stages. During training, models are built from a dataset consisting of speech from all the speakers to be examined. Testing is then performed on a test dataset containing different speech utterances from the same speakers [1]. In SID systems, information about the speakers is known a priori, and, at least 5 seconds of data per speaker is generally available for comparison. Currently, other aspects of speaker recognition, besides identification and

verification, exist, some of which are speaker change point detection and speaker indexing of broadcast news data, where the utterances are labeled according to the participating speakers. This is usually accomplished by first determining speaker change points and then clustering the segments between these change points [2], [3] or clustering speakers based on distances [4], [5]. Other methods of indexing speech data have also been examined [6], [7]. It should be noted that in broadcast data indexing, long speaker consecutive speaker utterances (5 to 20 seconds) are usually available [3], [7]. Recently, the problem of detecting the presence of a third speaker in a telephone conversation, which can also be regarded as a speaker recognition application, was investigated [8], no a priori knowledge about any speaker in each conversation was assumed. Such lack of information poses a challenge in the detection problem as the system cannot be trained with information about the speakers as is the usual practice in SID systems. Furthermore, unlike in broadcast news data indexing, only short speaker homogeneous utterances are available, in the case of telephone conversations, due to the rapid change of speaker turns. The above mentioned barriers were circumvented by the formation of speaker models from short segments of the observed data and then implementing a sequential distance-based elimination procedure referred to as the Residual Ratio Algorithm (RRA) [8]. In this research, the problem is generalized to a speaker count task, and a Generalized Residual Ratio Algorithm (GRRA) is proposed for determining the number of speakers in a telephone conversation as opposed simply deciding whether there are two or three speakers present. Models are created using the mean vectors and covariance matrices of 14th order Linear Predictive Cepstral Coefficients (LPCCs) of voiced segments. The Mahanalobis distance is used for distinguishing speakers during the elimination rounds in the GRRA algorithm; and the number of speakers is determined by observing the relative amount of speech left in the conversation after each round. The paper is organized as follows: the approach taken in comparing speaker models is explained in Section II. In Section III, a detailed description of the GRRA is given, followed by a presentation of experimental procedures and computation of results in Section IV. Conclusions are drawn and possible areas of further research are given in Section V.

COMPARING SPEAKER MODELS

The Euclidean distance is a simple and widely used distance measure for distinguishing speakers especially in speaker recognition applications [9]. However, for multivariate random variables, the Euclidean distance does not take into account the correlations of the dataset, and is sensitive to the scale of the measurements. In other words, with the Euclidean distance, only the mean of the vectors are observed, and no computation is performed using the covariance matrix. The Mahalanobis distance, on the other hand, which is very similar to the Euclidean distance, measures the dissimilarity between two random vectors by utilizing the covariance matrix as well as the mean vectors of the random variables. Let X = [X1, X2, …, Xp] and Y = [Y1, Y2, …, Yp] be two multivariate random distributions. Let µx and µy be the mean vectors of X and Y respectively and let Σ be an estimate of their covariance matrix. The Mahanalobis distance can be expressed as:

One goal of this investigation was to determine an appropriate number of segments, which would yield sufficient differentiation between intra- and inter-speaker distances, and also prevent grouping together of segments from two different speakers to form one model. In order words, the least number of segments with adequate separation is desired. From Fig. 2, it is observed that an increase in data-size results in an increase in speaker separability. Additionally, all values of N below 5 result in an overlap in the standard deviations of the intra- and inter-speaker distances; thus, in this research, 5 segments (resulting in a total length of about 1 second) are used in forming speaker models. Distribution of Mahalanobis Distance - Utterance Based Same Speaker Different Speaker

0.08 0.07 0.06 Probability

II.

0.05 0.04 0.03 0.02 0.01

(1)

0

0.5

1

1.5 2 Distance Value

2.5

3

3.5

Fig. 1. Intra- (black) and inter- (grey) speaker utterance-based Mahalanobis distance distributions.

S p eaker Differen tiation with Respect to Data S iz e 4.5

S am e S peak er Different S peak er

4 3.5

Mahalanobis distance

It must be noted that the covariance matrix, Σ is assumed to be equal for both random variables being compared. 14th order LPCCs are used as features in this research as they had been proven to be appropriate features for the speaker count procedure based on tests performed on utterances from the HTIMIT database [10]. The LPCCs were computed on a frame-by-frame basis, with each frame being 30 milliseconds in length. The first test involved computing T2 statistics for different speech utterances from the same speaker using all 384 speakers from the HTIMIT database. This was then compared with the T2 values for speech utterances from different speakers, chosen at random from the database, using a combination of all 384 speakers. The distributions of T2 values obtained for the observation clearly indicated that two speakers could be effectively discriminated using the T2Statistics [8]. The same procedure was repeated in this research, except that the Mahalanobis distance was used instead of the T2-Statistics. The same inference could be made from the distribution of Mahalanobis distances, as can be seen in Fig. 1 It will be impossible, to compare whole utterances of speakers in practical (conversational) applications without prior information about speaker change points. Therefore, another experiment was conducted in which segments were used instead of whole utterances. In this case, speaker models were formed using N consecutive voice segments from the same utterance and compared with another set of N voiced segments from the same or a different speaker’s utterance as the case may be. Fig. 2, shows the means (circles) and standard deviations (horizontal bars) of the Mahanalobis distances for the intra- and inter-speaker comparisons for N = 1-20, and it illustrates the effect of data size on speaker discrimination. One thousand comparisons were observed for each value of N (number of segments). Each voiced segment (or phoneme) was of an average length of 200 milliseconds.

0

3 2.5 2 1.5 1 0.5 0

0

5

10 15 Num ber of s egm ents

20

25

Fig. 2. Comparison of Mahalanobis Distances for different ‘models’ of speech from the same speaker (black bars) x-axis represents the number of segments used to form each model.

A direct approach to deciding if two models are from the same or different speakers based on the Mahalanobis distance between them would be to choose a threshold by observing the mean values of both distributions, and making decisions based on this threshold. This approach could be considered sufficient if both distributions were of almost equal variances. However, Fig. 3, which shows the intra- and inter-speaker Mahalanobis distance distributions (1000 comparisons for each class) for 5segment based models, suggests some difference in the variance of the two classes. Mahalanobis Distance Same Speaker Different Speaker

0.035 0.03 0.025

Probability

d (X,Y) = (µ − µ )Τ Σ−1(µ − µ ) x y x y

0.02 0.015

0.01 0.005

0

0.5

1

1.5

2 2.5 Distance value

3

3.5

Fig. 3. Intra- and Inter-Speaker ‘model’-based Mahalanobis distance distributions.

Therefore, in this research, likelihood ratio testing is performed in determining if two models belong to the same speaker or to different speakers. The Gaussian approximation is assumed for the inter- and intra-speaker distributions [5], and the mean and variance for each class are obtained using the distributions shown in Fig. 3 (The separation between the intra- and inter-speaker classes appears to be greater with the use of the Mahanalobis distance than with the T2-statistics shown in [8]; hence the use of the Mahanalobis in this research). Let µ1 and σ1 represent the mean and standard deviations for the intra-speaker Mahalanobis distances; and let µ2 and σ2 represent the inter-speaker parameters. Then, given the Mahalanobis distance, d, between two models, the Gaussian probabilities, f(x|µ1,σ1) and f(x|µ2,σ2) can be computed and one can determine if the models are from the same or different speakers simply by observing the greater of the two probabilities. In other words, the two models can be said to be from the same speaker if the intra-speaker likelihood, f(x|µ1,σ1)is greater than the inter-speaker likelihood, f(x|a2,b2). A Mahalanobis Likelihood Ratio (MLR) is thus defined as: MLR

=

f (x | µ1 ,σ 1 ) f (x | µ 2 ,σ 2 )

(2)

If the intra- and inter-speaker cases are assumed to have equal probability, then an MLR value above 1 will indicate that both models are from the same speaker and if the MLR is below 1, then both models are from different speakers. Note that this test is based, not just on the mean, but also on the variance of the distributions, thereby increasing the accuracy of the Mahalanobis distance in discriminating speakers. This MLR test is applied in the GRRA described in the following section. III. THE GENERALIZED RESIDUAL RATIO ALGORITHM The RRA algorithm introduced in [8] was designed for three-speaker detection and was based on eliminating two speakers from a conversation and observing the relative amount of speech remaining. In this paper, a generalized form of the RRA, referred to as the Generalized RRA (GRRA), whereby a speaker count of up to K speakers can be determined is presented. A detailed description of this technique is given below: i. Speech models are formed from a given conversation by computing the mean vectors and covariance matrices of the 14th order LPCC coefficients of 5 consecutive voiced segments (representing one model). ii. All pair-wise Mahalanobis distances for all models in the conversation are computed. iii. A reference model is chosen at random, and MLR tests are performed between this model and all others. Every model with an MLR > 1 is considered to belong to the reference speaker, and eliminated from the conversation along with the reference model itself, and the Residual

Ratio – the ratio of the size of residual speech to the original size of the conversation - is determined. This completes the first elimination round. iv. Step iii is repeated for the second round; however, the following procedure is taken in order to ensure that the new reference model is not one of those that belong to the first reference speaker but were erroneously mismatched in the first round: the ratio of size of the speech that was matched to the second reference to the total amount of speech is observed, and the process is repeated until this ratio is greater than a chosen threshold determined a priori. Once this condition is satisfied, the Residual Ratio for the second round is determined. v. Step iv is repeated until the (K-1)th round. Ideally, all reference models should belong to different speakers, and all models from the kth (k = 1, 2, …, K-1) reference speaker should be eliminated in the kth round, and if there are k speakers, the Residual Ratio after the kth round should be zero. In practice, however, some models may be mismatched in the elimination rounds, with some models belonging to the references being missed, and some models being wrongfully eliminated (as was illustrated in [8]). In determining the speaker count based on the Residual Ratios computed, two approaches are considered. The first is a tree-type classification procedure where the Residual Ratio is observed for each round, and if it is below a certain threshold (obtained by observing residual ratios for a total of 4,000 artificially generated 1-4 speaker conversations from the HTIMIT database) for the kth round, the speaker count is considered equal to k. and the Residual Ratios for other rounds are not considered. This method is referred to as the Stopped Residual Ratio (SRR) approach. An alternate approach involves determining the speaker count based on the sum of the Residual Ratios for all K-1 rounds. This method is referred to as the Added Residual Ratio (ARR) approach. The higher the ARR, the higher the speaker count is expected to be. IV.

EXPERIMENTS AND RESULTS

All experiments were performed using artificial conversations from the HTIMIT database, since databases with telephone conversations consisting of more than two speakers are currently unavailable. The intra-speaker parameters µ1 and σ1, and the inter-speaker parameters, µ2 and σ2, used in the MLR tests were obtained from 1000 (each) intra- and inter-speaker Mahanalobis Distances as described in Section II. A maximum count of K = 4 speakers was considered in this research. In obtaining thresholds for the SRR and ARR speaker count approaches, the GRRA was implemented on 4000 training conversations were generated (1000 each for 1-4 speakers). Each conversation was about 60 seconds in length, and each speaker contributed an approximately equal amount of speech For the SRR method, the thresholds were obtained by fitting the SRR values into a decision tree [11]. Appropriate thresholds for the ARR method were obtained from Fig. 4, which shows the estimated

distributions of ARR values using the training conversations described above. ARR Probability Distributions

in all three cases. The trade-off, however, is the decrease in complexity in implementing the ARR method. V. DISCUSSION

0.11 0.1

1 Speaker 2 Speakers 3 Speakers 4 Speakers

0.09

Probability

0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0

0.2

0.4

0.6

0.8

1 ARR

1.2

1.4

1.6

1.8

2

Fig. 4. Added Residual Ratio distribution functions for 1000 artificially generated conversations each for 1-4 speaker cases

From Fig. 4, it can be inferred that the speaker count accuracy will decrease as the number of speakers in the conversation increases, as a relatively small separation between the ARR distributions for three and four speakers is observed. The proposed GRRA technique was tested on 4000 testing conversations having the same statistics as the training data. The speaker count accuracy was determined in three different ways as described below: 1. One or more speakers: an accurate count was obtained if there was one speaker and proposed system yielded a speaker count of 1, or if there were two, three or four speakers, and the proposed system yielded a speaker count greater than 1. 2. One, two or more speakers: an accurate count was obtained if there was one or two speakers and the proposed system yielded a count of one or two, respectively, or if there were two or three speakers and the proposed system yielded a count greater than 2. 3. One, two, three or four speakers: an accurate count was considered if the proposed system yielded the correct number of speakers. The accuracy rate of the system was obtained as the ratio of the number of correct speaker counts to the total number of conversations. Fig. 5 below shows the accuracy rates for the SSR and ARR methods based on the three accuracy forms described. Percent Correct

ACKNOWLEDGMENT

This effort was sponsored by the Air Force Research Laboratory, Air Force Material Command, and USAF, under agreement number FA8750-04-1-0146. REFERENCES

[1] Reynolds, D. A. and Rose, R. C., “Robust Text-Independent Speaker [2]

[3] [4] [5] [6]

GRRA Accuracy 100 90 80 70 60 50 40

A speaker count technique has been presented, and the accuracy for counting up to four speakers in a telephone conversation has been shown. The limitation with such conversations is the insufficient amount of data available for comparing speaker. This usually presents difficulties in speaker recognition even when information about the speakers is known (which is not the case in this research). Attempts to distinguish between speakers using short utterance lengths (below 2 seconds) in conversations have reported up to 41% detection error [12]; nevertheless, with the proposed technique, an average accuracy of about 78% was obtained in determining the number of speakers inartificial one minute telephone conversations. Possible future enhancements include determining an optimal method for selecting reference models, rather than selecting then randomly, during the elimination rounds of the GRRA,

92.5 90 77.5 76 63 60

1 or M ore

1, 2 or M ore

SRR

[7]

ARR

1, 2, 3 or 4

[8]

Accuracy Method

Fig. 5. Speaker count accuracy of the Generalized Residual Ratio Algorithm.

From Fig. 5, it can be observed that the performance of the proposed technique diminishes with increase in the complexity of the task. Note that the event of a four-speaker telephone conversation is relatively unlikely, compensating for the relatively low accuracy obtained in the one, two , three or four speaker count case. Also, the SSR is shown to perform slightly better than the ARR in determining the speaker count

[9] [10] [11] [12]

Identification using Gaussian Mixture Speaker Models,” IEEE Trans. Speech and Audio Process., pp. 72–83, 1995. Chen, S., Gopalakrishnan, P., “Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion”. Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, 1998. Zhou, B. W. and Hansen, J. H., “Unsupervised Audio Stream Segmentation and Clustering via the Bayesian Information Criterion”, Proceedings of ICSLP, vol. 3, pp. 714-717, 2000. Iyer, A. N., Ofoegbu, U. O., Yantorno, R. E. and Wenndt, S. J., Blind Speaker Clustering”, ISPACS, 2006. (Submitted). Iyer, A. N., Ofoegbu, U. O., Yantorno, R. E. Smolenski, B. Y., “Speaker Discriminative Distances: A Comparative study”, IEEE Trans. on Speech and Audio Processing. (In Progress). Kwon, S. and Narayanan, S., “Unsupervised speaker Indexing using Generic Models”, IEEE Trans. on Speech and Audio Processing, vol. 13 (5), pp.1004-1013, 2004. Nishida, M. and Ariki, Y., Real Time Speaker Indexing Based on Subspace Method - Application To TV News Articles and Debate, Proceedings of ICSLP, 1998. Ofoegbu, U. O., Iyer, A. N., Yantorno, R. E. and Wenndt, S. J., "Detection of a Third Speaker in Telephone Conversations", ICSLP, 2006. S. Ong and C. Yang, “A Comparative study of Text-Independent Speaker identification using Statistical Features”, International Journal of Computer and engineering Management, vol 6 (1), 1998. Reynolds, D., “HTIMIT and LLHDB: Speech corpora for the Study of Handset Transducer Effects", ICASSP, vol. 2, p. 1535, 1997. Duda, R. O., Hart, P., E. and Stork, D. G., “Pattern classification”, Wiley, New York, 2nd edition edition, 2001. Delacourt, P., Kryze, D. and Wellekens, C. J., “Speaker-based Segmentation for Audio Data Indexing”, Proceedings of the ESCA ETRW workshop, UK, 1999.