Text-dependent speaker-recognition systems based on one-pass dynamic programming algorithm V. Ramasubramanian, V. Praveen Kumar, Deepak Vijaywargiay∗ D. Harish∗ , S. Thiyagarajan, Amitav Das† Siemens Corporate Technology - India Siemens Information Systems Ltd., Bangalore - 560100, India {V.Ramasubramanian, V.Praveenkumar}@siemens.com {Deepak.Vijaywargiay, D.Harish, Thiyagarajan.Subramani}@internal.siemens.com {[email protected]}

Abstract We propose variable-text text-dependent speaker-recognition systems based on the one-pass dynamic programming (DP) algorithm. The key feature of the proposed algorithm is its ability to use multiple templates for each of the words which form the ‘password’ text. The use of multiple templates allows the proposed system to capture the idiosyncratic intra-speaker variability of a word, resulting in significant improvement in the performance. Our algorithm also uses inter-word silence templates to handle continuous speech input. We use the proposed onepass DP algorithm in three speaker-recognition systems, namely, closed-set speaker-identification (CSI), speaker-verification (SV) and open-set speaker-identification (OSI). These systems were evaluated on a 100 speaker and 200 speaker tasks using the TIDIGITS database and with various car noise conditions. The key result of this paper is that the use of multiple templates enhances the performance of all the three systems significantly – the use of multiple templates (in comparison to a single template) enhances the CSI performance from 94% to 100%, the SV EER from 1.6% to 0.09% and the OSI EER from 12.3% to 3.5% on a 100 speaker task. We also use the proposed one-pass DP for automatically extracting the multiple templates from continuous speech training data. The performance of the three systems using such automatically extracted multiple templates is as good as with manually extracted templates. Front-end noise suppression enables our systems to deliver robust performance in up to 0 dB car noise.

1. Introduction Text-dependent speaker-recognition systems can be of two types: Fixed-text and variable-text. In the case of fixed-text, the text can be a single word or a sentence of multiple-words, treated as same during training and testing, i.e., the given text ‘password’ is fixed. Matching is typically done by the dynamic-time warping (DTW) algorithm in an isolated style manner, where the fixedtext training template is matched with the fixed-text test utterance [1], [2], [3]. In contrast, a variable-text system uses a pre-defined vocabulary of words from which any ‘variable’ text can be composed and specified as the user ‘password’ or a system ‘prompt’ (as in a prompted-mode of operation). The variable-text thus has ∗ On internship from International Institute of Information Technology, Bangalore † Presently with Microsoft Research India, Bangalore

the flexibility of allowing composition of ‘password’ phrases as a connected string of words from the specified vocabulary. By this, even with a small vocabulary (such as digits 0 through 9), a large variety of passwords can be composed with the convenience of changing it either by the user (from time to time), such as Personal Identification Numbers (PIN) or by the system in the form of a randomly generated ‘password prompt’ every time the system is used; this is of paramount importance in security applications such as access control. Interestingly, despite the advantages of the variable-text operation over the fixed-text operation, the variable-text type has received much less attention than the fixed-text type of operation. For instance, the only main results reported in literature for variable text systems are by [4] and [5]. Rosenberg et al. [4] use an isolated-style of DTW, where the input utterance (supposed to be a string of words) is matched against the supposed ‘password’ text in the form of a concatenated sequence of the corresponding isolated word templates. Higgins et al. [5] use a connected word recognition algorithm proposed in [6] to perform either an open recognition of the input utterance or a forced alignment. However, the shortcomings of both these approaches are as follows: i) the algorithm in [5] uses one averaged template per word per speaker for the forced alignment matching. This is clearly inadequate for dealing with intra-speaker variability as arising due to idiosyncratic pronunciation variations of the speaker or due to inter-session variability over time. ii) the isolated style DTW matching in [4] cannot handle multiple templates; moreover the isolated style of matching cannot also handle continuous input utterances where there will be arbitrary interword pauses and co-articulations. In this paper, we propose a multiple-template based variabletext speaker recognition algorithm adopting the one-pass dynamic programming (DP) for the forced alignment matching. The one-pass DP algorithm was originally proposed for connected word recognition [7]. The use of multiple templates enhances the performance of our system by efficiently handling the intraspeaker variability. The use of inter-word silence templates allows users to freely speak the password in a continuous fashion as the system can now handle arbitrary inter-word silences gracefully while at the same time also allowing for inter-word coarticulations. This thus adds a high degree of convenience to the user as well as system reliability even while allowing for a certain freedom in the way the isolated word templates are defined (their end-pointing is not crucial any more). Preliminary results of this algorithm are reported in [8]; in

2. Proposed algorithm The proposed variable-text speaker-recognition system based on the one-pass dynamic-programming (DP) matching algorithm and the corresponding system architecture is presented in Fig. 1. Here, each speaker has a set of templates for each word in the vocabulary. For example, for the word ‘nine’, there are four templates, R91 , R92 , R93 , R94 . Given an input utterance, the feature extraction module converts the input speech utterance into a sequence of feature vectors. We used the mel-frequency-cepstral coefficients (MFCCs) as the feature vector. This feature vector sequence corresponds to the input ‘password’ text (say, the digit string 915 in the figure) and is presented to the forced alignment module. At the same time, the corresponding concatenated set of multiple reference templates for ‘9’, ‘1’ and ‘5’ along with the inter-word silence templates are also presented to the forced alignment module. Feature vector sequence

O= o 1 o 2 Input Utterance

... o t ... o T

Match Distance D i =D(O, Txt | S i )

One−Pass DP Forced Alignment

FEATURE EXTRACTION

Password text of input utterance Txt="915"

0 w1 1 w2

.. 5 w ..

i

Rsil R 91 R 92 R 93 R 94 − Rsil R 11 R 12 R 13 R 14 −

Set−up one−pass DP recursions

R 01 R 02 R 03 R 04 R 11 R 12 R 13 R 14 R 51 R 52

.. R ..

53

Rsil R 51 R 52 R 53 R 54 Rsil

Concatenated Reference Templates

R 54

9 wM R 91 R 92 R 93 R 94 Multiple word−templates of Speaker S i

Figure 1: Proposed variable-text speaker-recognition algorithm based on one-pass DP matching with multiple templates

The one-pass DP algorithm matches the feature-vector sequence O against the multiple-template and inter-word silence based word-model sequence for speaker-Si . The resulting match score Di = D(O, T xt|Si ) is the optimal distance between the input utterance O and the word-templates of speaker Si corresponding to the password text ‘Txt’. This score is used in different ways in the three speaker-recognition systems, namely, the closed-set speaker-identification (CSI), speaker-verification (SV) and open-set speaker-identification (OSI). These systems are described in Sec. 3. 2.1. One-pass DP algorithm with multiple templates and inter-word silence templates Fig. 2 illustrates the use of multiple templates in the proposed one-pass DP forced alignment between the input utterance (on the x-axis) and the word-templates (on the y-axis). The same example password of ‘915’, as in Fig. 1, is used. Even though multiple templates are being used for all the words, here for the sake of clarity, only the multiple templates of the word 1 are shown on the y-axis. From the best warping path obtained by the one-pass DP algorithm in this example, it is seen here that the template 2 of word 1 (R12 ) had been chosen as the best matching template for that part (word ‘1’) of the input utterance.

R5 Multiple templates of password

this paper, we present more elaborate results by first extending this algorithm to all the three types of speaker-recognition, applying them on a larger speaker-population and employing the proposed algorithm for automatically extracting multiple templates from continuous speech during the training phase. The three speaker-recognition systems reported in this paper, based on the proposed algorithm are the closed-set speakeridentification system, speaker-verification system and open-set speaker-identification system. These systems were evaluated on a 100 speaker and 200 speaker tasks using the TIDIGITS database and with various car noise conditions. The key result of this paper is that the use of multiple templates enhances the performance of all the three systems significantly. For instance, the use of multiple templates (in comparison to a single template) enhances the CSI performance from 94% to 100%, the SV EER from 1.6% to 0.09% and the OSI EER from 12.3% to 3.5% on a 100 speaker task. We also use the proposed one-pass DP for automatically extracting the multiple templates from continuous speech training data. The performance of the three systems using such automatically extracted multiple templates is as good as with manually extracted templates. For real-life implementation, it is also important to make such systems robust to background noise. We deployed a noise suppression technique at the front-end, which enables our systems to deliver robust performance in heavy noise conditions (up to 0 dB car noise).

R1,4 R1,3 R1,2 R1,1 R9 9

1

5

Input test utterance

Figure 2: One-pass DP matching between test utterance and multiple training templates corresponding to password text Fig. 3 illustrates a typical matching by our proposed onepass DP algorithm with templates for inter-word silences. In this example, it is assumed that the input utterance is the same (‘915’) as in Figure 2, but it is spoken with silence before 9, silence between 1 and 5 and after 5. There is no inter-word silence between 9 and 1, representing an inter-word co-articulation. The one-pass DP algorithm uses concatenated ‘multiple’ templates of each word in the password ‘915’ as in Fig. 2, but with a silence template between adjacent words (for the sake of clarity and also to emphasize the handling of inter-word silence, only one template per word is shown in Fig. 3). The one-pass DP recursions now allow for entry into any word either from a silence template or one of the multiple templates of the predecessor words. Fig. 3 shows how the one-pass DP algorithm now correctly decodes the input utterance skipping the silence model between word 9 and 1. Other inter-word silences are mapped to the corresponding silence templates. We now state the dynamic program recursions, which are the heart of our one-pass DP algorithm, for the combined case of multiple templates and inter-word silence, illustrating how the

R sil

Word and silence templates

R5 R sil

R1 Silence Model Skipped

R sil R9 R sil /sil/

9

1

/sil/

5

/sil/

Input Test Utterance

Figure 3: One-pass DP matching with optional inter-word silences

Here, D(m, n, v) is the minimum accumulated distortion by any path reaching the grid point defined as frame ‘n’ of wordtemplate ‘v’ and frame ‘m’ of the input utterance.; d(m, n, v) is the local distance between the m-th frame of word-v template and n-th frame of the input utterance. The within-word recursion applies to all frames of word v template, which are not the starting frame (i.e., n > 1). The across-word recursion applies to frame 1 of any word-v to account for a potential ’entry’ into word v template from the last frame Nu of any of the other words {u} which are valid predecessors of word-v; i.e., P red′ (v) = {Silence template Rsil , P red(v)}; these are the valid predecessors of any word v consisting of a silence template Rsil and the multiple templates P red(v) of the word preceding the word v in the ‘password’ text; for instance, if the ‘password’ text is 915, and v = 5, then P red′ (v = 5) = {Rsil , R11 , R12 , R13 , R14 }; likewise, P red′ (v = 1) = {Rsil , R91 , R92 , R93 , R94 }. This across-word recursion takes care of entry into any template of any word from a preceding silence template or from any template of any preceding word in the password text. 2.1.2. Silence template recursions

warping paths (shown in Figs. 2 and 3) are realized jointly. The recursions for two specific parts, one for word-templates and the other for the inter-word silence templates, are presented next.

Fig. 5 shows recursions for an inter-word silence template. This is illustrated for the transition from any of the 4 templates of word ‘1’ to the silence template between words ‘1’ and ‘5’. The withinword and across-word recursions in this case are:

2.1.1. Word template recursions Fig. 4 shows the two main types of recursions, a) Within-word recursion and b) Across-word recursion for a general case of any word template, but in the context of the password-sequence ‘915’. The general equations for these two types of recursions are:

Within-word recursion D(m, n, v) = d(m, n, v) +

min

n−2≤j≤n

D(m − 1, j, v)

(3)

Across-word recursion Within-word recursion D(m, 1, v) D(m, n, v) = d(m, n, v) +

min

n−2≤j≤n

D(m − 1, j, v)

(1)

=

d(m, 1, v) + min{D(m − 1, 1, v), min D(m − 1, Nu , u)} (4) u∈P red{v}

Across-word recursion

R5,4 R5,1

D(m, 1, v)

=

d(m, 1, v) + min{D(m − 1, 1, v), min′ D(m − 1, Nu , u)} (2)

WITHIN−WORD RECURSION

WORD v

u∈P red {v}

Rsil

(m−1,n)

D(m,n,v)

(m−1,n−1)

D(m,1,v)

(m−1,n−2)

(m−1,1)

R1,4

WITHIN−WORD RECURSION

(m−1,n)

(m−1,Nu4 , u 4 )

WORD u WORD v

R5

(m−1,n−1)

MULTIPLE TEMPLATES

(m−1,n−2) (m−1,1) SILENCE TEMPLATE

Rsil R1,4

D(m,1,v)

(m−1,Nus , u s )

m

ACROSS−WORD RECURSION

(m−1,Nu 4 , u 4 )

R1,2 R1,1 1

R1,2

(m−1,Nu 2 , u 2 )

(m−1,Nu 1 ,u 1 )

1

m−1 m Input test utterance

5

Figure 5: One-pass DP recursions for multiple training templates

u ε pred(v) Multiple Templates

m

ACROSS−WORD RECURSION

R1,1 m−1

m−1

D(m,n,v)

(m−1,Nu 2 , u 2 ) (m−1,Nu 1 ,u 1 )

/sil/

m−1 m

5

Input test utterance

Figure 4: One-pass DP recursions for optional inter-word silences

Here, all terms are same as in the recursions in Sec. 2.1.1 except the definition of P red(v), where v is the inter-word silence template Rsil between two consecutive words in the password. Thus, P red(v) is the set of the multiple templates of the preceding word in the ‘password’ text. For instance, if the ‘password’ text is 915, then P red(v = Rsil between 1 and 5) = {R11 , R12 , R13 , R14 }, i.e., the 4 templates of word 1.

The above recursions together describe the one-pass DP recursion for using multiple templates and inter-word silence templates for forced alignment matching as required in the variable-text speaker-recognition. The best score (lowest) among D(T, Nr , r), r = 1, . . . , L + 1, where T is the last frame of the input utterance and r = 1, . . . , L + 1 refers to the L multiple templates of the last word in the password text and the last silence template (with Nr as their respective last frames) yields the minimum accumulated distance Di of the match between the input utterance and the ‘password’ text and is used as the score for that speaker-i whose word-templates were used. The next section will describe how Di is used for closed-set speaker-identification, speaker-verification and open-set speaker-identification.

3. Speaker-recognition systems The previous section described how the proposed one-pass DP algorithm with multiple templates obtains the match score Di = D(O, T xt|Si ) which is the optimal distance between the input utterance O and the word-templates of speaker Si corresponding to the password text ‘Txt’. In this section, we describe three speaker-recognition systems, namely, the closed-set speaker-identification (CSI), speaker-verification (SV) and openset speaker-identification (OSI) and how the above match score is used in different ways in these systems. 3.1. Closed-set speaker-identification (CSI) In a N speaker ‘closed-set’ SI problem, there are N decision alternatives, i.e., the input speech is classified as one of the N speakers. The score Di as described above, is computed for each of the N registered speaker in the system and the speaker with the lowest score is declared as the identified speaker. 3.2. Speaker-verification (SV) In the speaker-verification problem, it is of interest to make a two decision alternative, i.e., the input utterance is classified as belonging to the claimed speaker identity or not. Here, the score Di corresponds to the match between the input utterance and the templates of the ‘claimed speaker’ Si . We perform a form of likelihood-ratio normalization on the one-pass DP score, Di , by dividing it with the background score computed between the input utterance and a background speaker closest to the input utterance from among the remaining target speaker set [5]. This normalized score is then compared to a threshold and the input speaker claim is accepted if the normalized score is less than the threshold and rejected otherwise. This is done for both target speakers and impostor speakers and the probabilities of false rejection and false acceptance for the given threshold are determined as defined in Sec. 4.3. This further yields the ROC curve for varying thresholds. 3.3. Open-set speaker-identification (OSI) In a N speaker ‘open-set’ SI problem (unlike the closedset speaker-identification system, where the input utterance is forcibly decided as one of the N speakers), the system can also decide that the input speaker does not match with any of the N registered speakers and reject the speaker. Fig. 6 gives the basic structure of an open-set speakeridentification (OSI) system. Here, the system has a set of N registered speaker whose speaker-models (word templates) are available through a training session. These N speakers correspond to the set of valid speakers the system is capable of recognizing. The

objective of an OSI system is as follows: If the test speaker is one of the N registered speakers, the system should correctly make the decision that the speaker belongs to the N speaker set and also identify him. If the test speaker is not one of the N speakers, the system should correctly reject him as not belonging to the set of N registered speakers. Password Text ACCEPT: Input speaker = S i Speaker Verification REJECT: Input speaker is NOT any of N speakers Input Speech

Closed−set Speaker Identification

Feature Extraction

Feature vector sequence O = ( o1 o2 oT ) ot

..

..

S1 S2

Speaker S i Models

Recognized Speaker S i

Si SN Speaker−set (N−speakers)

Figure 6: Generic structure of open-set speaker-identification The OSI system has two main components: 1. Closed-set speaker-identification 2. Speaker-verification In the recognition phase of the system, the input speech is first converted to a sequence of feature vectors (MFCC) and the OSI system first performs a closed-set speaker-identification (CSI), where it recognizes the input speaker as one of the N speakers. Let the recognized speaker-id be ‘i’. Next, the speakerverification (SV) system performs the verification that the input speaker is indeed speaker i, i.e., it assumes an identity claim of i for the input speaker and accepts or rejects the input speaker as i. Both the CSI and SV systems are as described above based on the 1-pass dynamic programming (DP) algorithm. The SV system inside the OSI system employs the same kind of normalization as described in the stand-alone SV system in Sec. 3.2.

4. Performance of proposed algorithms 4.1. Database We have built three speaker-recognition systems, namely, the closed-set speaker-identification (CSI) system, the speakerverification (SV) system and the open-set speaker-identification (OSI) system using the proposed algorithm. We evaluated the performance of these three systems using two speaker sets, one of 100 speakers and the other with 200 speakers from the TIDIGITS database. The TIDGITS database has a vocabulary of 11 words ‘oh’ and 0 to 9 with 77 continuously spoken digit string utterances per speaker of lengths 1, 2, 3, 4, 5 and 7 comprising of 22 utterances of length 1 and 11 utterances for each of the other lengths. We studied the proposed algorithm for test utterances of length 3, 4 and 5 digits pooled together. The training templates were excised from the 7-digit utterances, yielding up to 5 templates per word. The 100 speaker set was first used for evaluating the performance of the three systems with manually excised training templates; this was to ensure that we measure the best performances possible by the one-pass DP algorithm with multiple templates without concern about template-extraction accuracies. Further, we excised the templates automatically in a training phase using the proposed one-pass DP algorithm in a forced

alignment mode and performed experiments on the same 100 speaker task to measure the differences between manual and automatic extractions. We then extended the three systems to the 200 speaker task where the templates were extracted entirely automatically. While the results of the Secs. 4.2, 4.3, 4.4 are obtained for the manually extracted templates, Sec. 4.5 gives the results using automatically extracted templates in all the three systems for both the 100 and 200 speaker tasks. We also studied the performance of the proposed system under noisy conditions, with ‘car’ noise being added digitally to the clean TIDIGITS database. The three noise-levels studied are clean, 0 dB and -10 dB SNRs. In all these cases, we have compared the performance of the system with and without noisesuppression techniques. The feature vectors used in the systems are MFCCs of dimension 12, obtained from an analysis frame size of 20ms and overlap of 10ms.

4.3. Speaker-verification system Here, we present the performance of the speaker-verification (SV) system for the 100 speaker task with manually extracted templates. For the speaker-verification system, 33 test utterances (11 from each of lengths 3, 4 and 5 digits) were used as target speaker data, i.e., 33 target-trials per target speaker per noise condition. For all the 100 speakers, the total number of target trials are therefore Ntarget−trials = 33 × 100 = 3300. For each target speaker trial, an impostor data (speaker and sentence by that speaker) was generated from the non-target speakers (i.e., the other 99 speakers chosen randomly). This yields 33 test utterances (11 utterances from each of lengths 3, 4 and 5 digits) corresponding to a target speaker’s 33 utterances, but with each impostor utterance being from a different impostor speaker. Thus Nimpostor−trials is also 3300. 4.3.1. SV performance measures

4.2. Closed-set speaker-identification system Here, we present the performance of the closed-set speakeridentification (CSI) system for the 100 speaker task with manually extracted templates. Here, 33 test utterances (11 from each of lengths 3, 4 and 5 digits) per speaker per SNR condition, were used. Both training as well as test templates were subjected to noise-removal. The performance metric used here is the %SID accuracy, which is defined as the percentage of number of correctly identified test utterances from the total of 33x100 = 3300 trials for all the 100 speakers put together. Fig. 7 gives the performance of the CSI system as a function of the number of multiple templates used per word for various SNR conditions.

If Nf r is the total number of times the system incorrectly rejects the claimed speaker, for a given threshold θ, the probability of false rejection is defined as: Pf r = Nf r /Ntarget−trials If Nf a is the total number of times the system incorrectly accepts an impostor, for a given threshold θ, as the corresponding claimed speaker, then the probability of false acceptance is: Pf a = Nf a /Nimpostor−trials This yields a point (Pf a , Pf r ) in the Pf a –Pf r plane for the given θ, and varying θ yields the ROC curve. 4.3.2. SV performance results

100

The ROC curves were obtained for various number of training templates 1 to 5 and various SNR conditions. Fig. 8 shows the ROC curve for the clean condition and for multiple templates. It can be clearly observed that the SV performance improves significantly with the use of multiple templates.

% SID ACCURACY

95

90

85 5

Clean 0dB −10dB −10dB (Without Noise Suppression)

75

1

2 3 4 NUMBER OF TEMPLATES

5

Figure 7: Closed-set speaker-identification (CSI) system performance (%SID accuracy) for 1 to 5 training templates and for different SNR levels; 100 speakers As seen in Fig. 7, the use of multiple templates clearly improves the CSI performance significantly at all SNR levels as compared to the single-template version. Particularly, the use of 5 templates yields 100% SID accuracy for SNR levels up to 0 dB. To deliver high performance across all noisy condition, we have used noise-suppression at the front end during both training and testing. The impact of the noise-suppression is also seen in Fig. 7 for the -10 dB noise condition. Compared to the performance when noise-suppression is not used (dotted blue line), the performance of the system with noise suppression (shown as dashed blue line) is about 10% better. When noise suppression is not used, even then the use of multiple templates is seen to deliver better performance than a single template version.

1 Template 2 Templates 3 Templates 5 Templates

4

Prob(False Rejection) in %

80

3

2

EER: (1.15, 1.58) 1

EER: (.36, .30) 0

EER: (.06, .09) 0

1

2

3

4

5

Prob(False Acceptance) in %

Figure 8: ROC curves for the speaker-verification (SV) system for 1 to 5 training templates per word in ‘clean’ condition; 100 speakers In order to bring out the effect of normalization (as described in Sec. 3.2), we show the ROC obtained without normalization in Fig. 9. These are shown along with the ROC curves obtained

with normalization as in Fig. 8. It can be clearly seen that the normalization used here has a significant impact in reducing the EER of the SV system. The unnormalized performance has EERs several times higher than that for the normalized cases for a given number of templates.

10 1 Template (unnormalized 9

2 Templates (unnormalized) 5 Templates (unnormalized)

Prob (False Rejection) in %

8

1 Template (normalized)

UNNORMALIZED

5 Templates (normalized)

6 5

Thus, for a given threshold θ of acceptance and rejection used in the speaker-verification system within the OSI system, we get two overall error measures:

4 3

NORMALIZED

1 Template

2

1. The number of false rejections Nf r from a given set of valid target speaker trials Ntarget−trials and,

2 Templates 5 Templates

1

2. The speaker is a genuine speaker ‘j’ (one of N ) and the system accepts him/her as ‘k’ (after an incorrect identification by CSI as ‘k’). This is referred to as ‘False acceptance – Type 1’. The SV system can also reject the valid speaker j while verifying the claim against speaker k (which is the incorrect result of the CSI system); this results in a ‘False rejection’. 3. The speaker is an impostor, but the system accepts him/her as one of the N speakers. This is referred to as ‘False acceptance - Type 2’.

2 Templates (normalized)

7

as ‘j’ or incorrectly as ‘k’). This is referred to as ‘False rejection’.

1 Template 2 Templates

0

5 Templates

0

1

2

3

4

5

6

7

8

9

10

Prob (False Acceptance) in %

Figure 9: ROC curves for the SV system for 1 to 5 training templates per word in CLEAN condition - Normalized and Unnormalized; 100 speakers Table 1 presents the ‘equal-error-rate’ (EER) points (points on the ROC curve where Pf r = Pf a ) for multiple number of templates and various SNR conditions. The results clearly show that a) the use of multiple templates indeed leads to improved SV performance and b) the front-end noise suppression makes our proposed algorithm quite robust to noise up to - 10 dB, particularly with the use of multiple templates. Specifically, it can be noted that the SV system achieves a sub-0.1% EER up to 0 dB which represents the best performance reported in literature so far for such a large set of speaker population (from which the impostors are drawn). 4.4. Open-set speaker-identification system In order to describe the evaluation of the open-set speakeridentification (OSI) system, we first describe the types of errors possible in an OSI system and how the performance of the OSI system is given in terms of the ROC plots as in speakerverification (as described in Sec. 4.3).

2. The number of false acceptances Nf a = Nf a1 + Nf a2 from a given set target speaker trials Ntarget−trials and impostor trials Nimpostor−trials put together. The overall performance of the OSI system is measured in terms of the probability of ‘false rejections’ and ‘false acceptances’ as in the case of a speaker-verification (SV) system. The probability of false rejection (not detecting the speaker) pf r (θ) for a given threshold θ, is given as pf r (θ) = Nf r /Ntarget−trials where, Ntarget−trials and Nf r are respectively the number of target trials and the number of those trials for which the target speaker was not detected (falsely rejected). The probability of false acceptance pf a (θ) for a given threshold θ, is given by pf a (θ) = Nf a /(Ntarget−trials + Nimpostor−trials ) where, (Ntarget−trials +Nimpostor−trials ) and Nf a are respectively the total number of target and impostor trials and the number of those trials for which the input speaker was falsely accepted. These two measures pf r (θ) and pf a (θ) are combined in the ‘Receiver Operating Characteristics’ (ROC) curve by plotting pf r (θ) against pf a (θ) for various thresholds θ as in the case of the SV system.

4.4.1. OSI performance measures Referring back to Fig. 6 described in Sec. 3.3, it can be seen that the overall output of the OSI system can be the following: 1. Accept input speaker as belonging to one of the N speakers and the identity of the speaker is i. 2. Reject input speaker as not belonging to the N speaker set. In this process, various kinds of decision combinations occur at both the closed-set speaker-identification (CSI) and speakerverification (SV) taken together. Table 2 gives all possible scenarios that arise. As seen from the table, the system makes the following types of errors: 1. The speaker is a genuine speaker ‘j’ (one of N ) and the system rejects him/her (after identifying him/her correctly

4.4.2. OSI performance results In the evaluation of the OSI system, we have used the following target-trials and impostor-trials from the TIDIGITS database. There are 100 target speakers, and for each target speaker, 33 test utterances (11 from each of lengths 3, 4 and 5 digits) were used as target speaker data; i.e., 33 target-trials per target speaker per noise condition. Impostor speakers were drawn from a set of 50 speakers outside the target set of 100 speakers. For all the 100 speakers, the total number of target trials are therefore Ntarget−trials = 33x100 = 3300. For each target speaker trial, an impostor data (speaker and sentence by that speaker) was generated from the set of 50 impostor speakers (chosen randomly) to yield test utterances such that there are 33 test utterances (11 utterances from each of lengths 3, 4 and 5 digits) corresponding to a target speaker’s 33 utterances, but with each impostor utterance

Table 1: EER for speaker-verification (SV) system for multiple templates and SNRs; 100 speakers No. Noise conditions (SNR in dB) of Clean 0 dB -10 dB All SNRs temp Pf r Pf a Pf r Pf a Pf r Pf a Pf r Pf a 1 1.58 1.15 1.85 1.73 5.42 4.33 3.09 2.24 2 0.30 0.36 0.45 0.33 2.27 2.15 1.21 0.74 3 0.15 0.09 0.18 0.24 1.52 1.58 0.40 0.76 4 0.09 0.09 0.15 0.21 1.45 0.94 0.66 0.29 5 0.09 0.06 0.09 0.12 0.94 0.79 0.58 0.22

Table 2: Types of errors in open-set speaker-identification (OSI) system Input Output of closed-set Output of Overall open-set Speaker-identity Speaker-identification Speaker-verification error type j j Accept as j NO ERROR (Correct Acceptance) (One of N ) (NO ERROR) Reject as j ERROR (False Rejection) Valid Speaker Nf r = N f r + 1 Accept as k ERROR (False Acceptance) —ditto— k Nf a1 = Nf a1 + 1 (ERROR) Reject as k ERROR (False Rejection) Nf r = N f r + 1 m j Accept as j ERROR (False Acceptance) (Not one of N ) (Any one of N ) Nf a2 = Nf a2 + 1 Impostor Speaker Reject as j NO ERROR (Correct Rejection)

being from a different impostor speaker. Thus Nimpostor−trials is also 3300. Fig. 10 shows the ROC curves for the open-set speakeridentification system for the clean condition and for multiple templates 1, 3 and 5. It can be clearly observed that the open-set speaker-identification performance improves significantly with the use of multiple templates with a dramatic lowering of the EER from 1 to 5 templates.

30 1 Template 3 Templates 5 Templates

Prob (False Rejection) in %

25

20

15

EER: (12.22, 12.36) 10

formance (reduction in EER of 12.3% for 1 template to 3.4% for 5 templates in clean condition). Despite the front-end noise suppression, the performance tends to worsen with noise up to -10 dB. However, the use of multiple templates in each noisecondition has helped improve the performance significantly (by factors of 2 and 3 for -10 dB and 0 dB respectively) pointing to the possibility that use of multiple templates (perhaps cleaned templates from various SNR levels and noise conditions) could help increase the robustness of the system to arbitrary input noise conditions and SNR levels. It should also be noted that the OSI system has poorer EER than the speaker-verification system, as the OSI is a harder task (can be viewed as a 100-speaker speaker-verification system) and is implicitly coupled to the performance of the closed-set speakeridentification system (which can produce errors on which the latter speaker-verification is based on) and is thus dependent on the speaker-population (unlike the speaker-verification system which is in principle independent of the population of registered speakers and is more dependent on the population of impostor speakers that can challenge a target speaker).

EER: (5.13, 5.18) 5

4.5. Results with automatically extracted templates 0

EER: (3.47, 3.45) 0

5

10

15

20

25

30

Prob (False Acceptance) in %

Figure 10: ROC curves for the open-set speaker-identification (OSI) system for 1, 3 and 5 templates per word in ‘clean’ condition; 100 target speakers; 50 impostor speakers Table 3 presents the equal-error-rate (EER) points of the OSI system for multiple templates and various SNR conditions. As with the speaker-verification system, the results also clearly show that the use of multiple templates leads to improved OSI per-

Secs. 4.2, 4.3, 4.4 above presented results for a 100 speaker task where multiple templates were extracted manually. In this manner, we could establish that we obtained the best performances possible by the one-pass DP algorithm with multiple templates without concern about template-extraction accuracies. However, since it is impractical to use manually extracted templates, we also extracted the multiple templates automatically in a training phase using the proposed one-pass DP algorithm in a forced alignment mode (against known word-level orthographic transcription of the input training utterances which were typically continuously spoken 7 digit strings). We first performed experiments on the same 100 speaker task to measure the differences

Table 3: EER for open-set speaker-identification (OSI) for multiple templates and SNRs; 100 target speakers; 50 impostor speakers No. Noise conditions (SNR in dB) of Clean 0 dB -10 dB All SNRs temp Pf r Pf a Pf r Pf a Pf r Pf a Pf r Pf a 1 12.36 12.22 14.30 14.34 24.42 24.16 17.44 17.31 3 5.18 5.13 6.76 6.79 16.45 16.38 9.92 10.10 5 3.45 3.47 5.18 5.15 14.45 14.28 8.29 8.30

Table 4: Performances of Closed-set SID (CSI), Speaker-verification (SV) and Open-set SID (OSI) systems for 100 speakers and 200 speakers using manually and automatically derived word templates (5 templates) for 2 different SNR conditions Closed-set Speaker Open-set SID Verification SID SNR % accuracy pf r pf a pf r pf a Manual Clean 100 0.09 0.06 3.45 3.47 (100 speakers) 0 dB 100 0.09 0.12 5.18 5.15 Automatic Clean 99.3 0.09 0.06 4.21 4.16 (100 speakers) 0 dB 98.4 0.12 0.09 5.94 5.86 Automatic Clean 98.8 0.14 0.12 3.85 3.93 (200 speakers) 0 dB 97.8 0.21 0.21 5.36 5.47 between manual and automatic extractions. We then extended the three systems to the 200 speaker task where the templates were extracted entirely automatically. In this section, we give the results using automatically extracted templates in all the three systems for both the 100 and 200 speaker tasks. Table 4 shows the performances of the three systems (CSI, SV and OSI) for 100 speakers and 200 speakers using manually and automatically derived word templates. The number of templates used here is 5. Results are shown for 2 different SNR conditions - clean and 0 dB. The following can be noted from this table: a) The performance of the three systems on the 100 speaker task with automatically extracted templates is only marginally poorer than with manually extracted templates; this clearly brings out the effectiveness of the proposed one-pass DP algorithm for automatic template extraction also, thus making it a consistent technique for both training and recognition of these systems; b) The performance of the three systems for 200 speakers is as good with 100 speakers, showing the robustness of the algorithm to increase in speaker population. The lower EER for the 200 speaker OSI system (than the 100 speaker case) is mainly due to fact that we kept the impostor set size to be same as 50 for both cases. Since the proportion of impostor to target speakers is now less in the 200 speaker case, the corresponding overlapping error regions (contributing to the false-acceptances and false-rejections) between the score distributions were less.

5. Conclusions We have presented variable-text text-dependent speakerrecognition systems based on the one-pass dynamic programming (DP) algorithm with multiple templates. The use of multiple templates allows efficient handling of the intra-speaker variability delivering significant performance improvements over single template version. The proposed algorithm also uses inter-word silence templates, enabling the speaker recognition system to handle continuous input utterances. The one-pass DP algorithm is also used to automatically extract the multiple templates from continuous training speech utterances. The resulting 100-speaker and 200-speaker closed-set (and open-set)

speaker-identification and speaker-verification systems demonstrated high performance and robustness in noisy conditions up to 0 dB. The results reported here represent the best reported so far for text-dependent speaker-recognition for such large populations.

6. References [1] S. Furui. Cepstral analysis technique for automatic speaker verification. IEEE Trans. on Acoustics, Speech and Signal Processing, 29:254–272, 1981. [2] B. Yegnanarayana, S. R. Mahadeva Prasanna, Jinu Mariam Zachariah, and C. S. Gupta. Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker-verification system. IEEE Trans. on Speech and Audio Processing, 13(4):575–582, Jul 2005. [3] V. Ramasubramanian and Amitav Das. Text-dependent speaker-recognition – A survey and state-of-the-art. Tutorial at ICASSP-2006, Toulouse, France, May 2006. Accepted. [4] A. E. Rosenberg, C. H. Lee, and S. Gokeen. Connected word talker verification using whole word hidden Markov models. In ICASSP’91, pages 381–384, 1991. [5] A. Higgins, L. Bahler, and J. Porter. Speaker verification using randomized phrase prompting. Digital Signal Processing, 1(2):89–106, 1991. [6] J. S. Bridle, M. D. Brown, and R. M. Chamberlain. An algorithm for connected word recognition. In Proceedings of ICASSP’82, Paris, 1982. [7] H. Ney. The use of one-stage dynamic programming algorithm for connected word recognition. IEEE Trans. on Acoust., Speech and Signal Proc., 32(2):263–271, Apr 1984. [8] V. Ramasubramanian, Amitav Das, and V. Praveen Kumar. Text-dependent speaker-recognition using one-pass dynamic programming algorithm. In Proceedings of ICASSP’06, Toulouse, France, May 2006. Accepted.

Text-dependent speaker-recognition systems based on ...

tems based on the one-pass dynamic programming (DP) algo- rithm. .... Rsil. R 52. R 54. R 53. R 51. Rsil. Rsil. RsilR 11. R 14. R 13. R 12. Forced Alignment ..... help increase the robustness of the system to arbitrary input noise conditions and ...

121KB Sizes 0 Downloads 279 Views

Recommend Documents

Exploitation on ARM-based Systems - Troopers18 - GitHub
Mar 12, 2018 - Sascha Schirra. Ralf Schaefer. • Independent Security. Consultant. • Reverse engineering. • Exploit development. • Mobile application security. • Embedded systems. • Twitter: @s4sh_s. • Security Analyst ...... Ask the Ker

(International Series on Microprocessor-Based and Intelligent Systems ...
INTELLIGENT SYSTEMS ENGINEERING. VOLUME 23. Editor ..... John Harris (auth.)-An Introduction to Fuzzy Logic Applications-Springer Netherlands (2000).pdf.

design of mechatronic actuation systems based on ...
small amount of wear, direct connection with the system, easy to replace, can operate in ... (fiber structure) creating a three-dimensional grid structure (Fig. 1). ... accelerations and high force are required, like sorting, speed cutting, manipulat

Evaluating the Survivability of SOA Systems based on ...
Abstract. Survivability is a crucial property for computer systems ... Quality of Service (QoS) property. ... services make it difficult to precisely model SOA systems.

Evaluating the Survivability of SOA Systems based on ...
While SOA advances the software development as well as the resulting systems, the new ... building dynamic systems more adaptive to varying envi- ronments.

Tutorial: Verification of Real-time Systems Based on ...
Electrical and Computer Engineering,. Wayne State ... I. Introduction. Discrete Event System Specification(DEVS) is a promising formalism for modelling and analysis of dis- crete event systems and especially it has been regarded as a powerful ... the

A Performance study on Operator-based stream processing systems
Department of Computer Science ... It is impossible to store data on disk. ◦ The volume of the data is very large. Process data on-the-fly in-memory. OP. 1. OP.

Microprocessor-Based Systems (E155)
Lab Assistant: Carl Pearson. Max Korbel. Class web page: https://sites.google.com/a/g.hmc.edu/e155f2012/syllabus. Class directory: \\Charlie\Courses\Engineering\E155. Class email list: eng-155-l. Be sure to check that you are on the class email list.

land-based mobile mapping systems
Jan 16, 2002 - based mobile mapping systems focused on improving system .... www.jacksonville.com/tu-online/stories/071200/ ... 54th Annual Meeting.

Reversible Sketch Based on the XOR-based Hashing
proportional to the sketch length at none cost of the storage space and a little cost of the update ... Using a large amount of real Internet traffic data from NLANR,.

Location-Based-Service Roaming based on Web ...
1. Introduction. In various Add-On services, Location Based. Services (LBS) are services based on the ... network-based approach and handset-based approach.

Linear space-time precoding for OFDM systems based on long-term ...
Email:[email protected]. Abstract - This paper addresses severe performance degra- dation for OFDM systems in case of long channel delays.

On Robust Key Agreement Based on Public Key Authentication
explicitly specify a digital signature scheme. ... applies to all signature-based PK-AKE protocols. ..... protocol design and meanwhile achieve good efficiency.

On Robust Key Agreement Based on Public Key ... - Semantic Scholar
in practice. For example, a mobile user and the desktop computer may hold .... require roughly 1.5L multiplications which include L square operations and 0.5L.

Implementation of SQL Server Based on SQLite Engine on Android ...
Keywords: Embedded Database, android, android platform, SQLite database ..... 10. Motivation. The application under consideration, The SQL database server, ...

Performance Evaluation of IEEE 802.11e based on ON-OFF Traffic ...
Student. Wireless Telecommunication ... for Wireless Local Area Communications, IEEE 802.11 [1], ..... technology-local and metropolitan area networks, part 11:.

Interpersonal Judgments Based on Talkativeness
Jul 16, 2007 - National Science Foundation and Cornell's Center for International Studies. ... tative aspects of talk: Talkativeness appears to index directly, indeed. vir ..... Two analyses were conducted, which we shall call Studies 3A and 3B.

Interpersonal Judgments Based on Talkativeness
Jul 16, 2007 - be a particularly potent cue for early impression formation should not be considered as an .... 1956 "Task status and likeability as a function of talking and listening in decision- ... Princeton, N.J.: Van Nostrand. Kendon, Adam.

'Reading Based on Korean Stories'. -
Korean Americans (Spirit of America, Our Cultural Heritage) -by Cynthia Fitterer ... Land of Morning Calm: Korean Culture Then and Now -by John Stickler.