Robust Speaker Verification with Principal Pitch Components

Viewer
Transcript

INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY 8(4), 323–339, 2005 c 2006 Springer Science + Business Media, LLC. Manufactured in The Netherlands. DOI: 10.1007/s10772-006-9048-4

Robust Speaker Verification with Principal Pitch Components ROBERT M. NICKEL, SACHIN P. OSWAL AND ANANTH N. IYER Department of Electrical Engineering, The Pennsylvania State University, University Park, PA 16802 [email protected]

Abstract. We are presenting a new method that improves the accuracy of text dependent speaker verification systems. The new method exploits a set of novel speech features derived from a principal component analysis of pitch synchronous voiced speech segments. We use the term principal pitch components (PPCs) or optimal pitch bases (OPBs) to denote the new feature set. Utterance distances computed from these new PPC features are only loosely correlated with utterance distances computed from cepstral features. A distance measure that combines both cepstral and PPC features provides a discriminative power that cannot be achieved with cepstral features alone. By augmenting the feature space of a cepstral baseline system with PPC features we achieve a significant reduction of the equal error probability of incorrect customer rejection versus incorrect impostor acceptance. The proposed method delivers robust performance in various noise conditions. Keywords: speaker verification, speaker recognition, speaker identification, principal component analysis, pitch estimation, biometrics

1.

Introduction

The majority of text dependent automatic speaker verification systems in use today employ cepstral features in combination with either dynamic time warping (DTW) (Campbell, 1997; Furui, 1981; Higgins et al., 1991a) or hidden Markov modeling (HMM) (Che et al., 1996). For text dependent speaker verification, cepstral features exhibit a discriminative power that is, as of now, unsurpassed by any other feature representation for speech (Campbell, 1997). It is, thus, not surprising that, in recent years, the research community has focused more on the pattern matching end and the noise reduction part of the verification problem (Sang et al., 2003; Singh et al., 2003; Reynolds et al., 2000; Furui, 1997). The success in these areas, however, warrants a revisit of the feature extraction problem since the performance of any speaker verification system is inherently limited by the discriminative power of the underlying speech feature. More recently, a number of novel speech features have been proposed. Gopalan et al. (1999) performed speaker identification experiments using

Fourier-Bessel representations of the speech waveform. They showed that speech features obtained from first order Bessel expansions are comparable in discriminative power to cepstral features. Assaleh (1995) proposed an orthogonal set of supplementary features to the cepstrum. These features termed as sepstrum were derived from the imaginary parts of the linear prediction (LP) poles. An adaptive component weighing scheme which modifies the LP spectral components to emphasize the formant structure was studied in Assaleh and Mammone (1994). Petry and Barone (2003) considered Lyapunov exponents derived from a state space representation of speech. Discrete wavelet transform coefficients (DWTC) were investigated by Bovbel et al. (2002). Thev´enaz and Hugli considered the utility of the LPC residue (Thev´enaz and Hugli, 1995). Carey et al. (1996) investigated how prosodic features can be employed. Instead of developing novel features, some studies focussed on how to obtain an optimal subset of cepstral coefficients. Pandit and Kittler (1998) derived an optimal feature selection technique for a DTW based

324

Nickel, Oswal and Iyer

speaker verification system. Similarly, Haydar et al. (1998) proposed a feature selection optimization by means of a genetic algorithm. The features that we are considering in this paper are derived from the local structure of voiced sections of the speech signal (Nickel and Williams, 2000). The term local structure refers to the characteristic shape of the waveform of each voiced pitch period. Since every speaker is bound to use the same vocal apparatus for each utterance, it is expected that the generated waveforms will bear striking similarities. By comparing suitably chosen waveforms from different utterances we should be able to obtain insight into the identity of the given speaker. The caveat of this approach is that: (i) we must restrict ourselves to waveforms that are not chaotic in nature (i.e. utterly unpredictable), and (ii) we must verify that the variability in the articulator positions of the vocal tract must be within reasonable bounds between two utterances of the same word from the same speaker. Condition (i) is easily satisfied by excluding waveforms from unvoiced (and silent) sections of the utterance. Condition (ii) warrants an averaging procedure that focuses on the principal components of the observed waveforms (Nickel and Oswal, 2003). It is expected that a feature representation based on principal component waveforms alone will exhibit a larger variability for intra-speaker comparisons than cepstral features. If we can demonstrate, however, that the proposed new features are only loosely correlated with cepstral coefficients, then we can significantly improve the accuracy of state of the art speaker verification systems by considering both features jointly in the verification process. The details of the proposed feature extraction method and the proposed feature comparison procedure are presented in Section 2. Section 3 describes the experimental setup that is used to evaluate the discriminative power of the proposed feature space. Section 4 summarizes our results. 2.

Methods

We are using separate strategies for speaker enrollment (training) and speaker verification (testing). Principal pitch components (PPCs) are obtained by performing a principal component analysis on pitch synchronous segments from each voiced phonetic unit that is present in the enrollment signal. The PPCs are then used in the

verification procedure as a matched filter to detect similarities between the training utterance and the testing utterance. The details of the enrollment and verification procedure are described in Sections 2.2 and 2.3. Section 2.1 summarizes the signal preprocessing that is applied to both training and testing. Block diagrams that summarize the enrollment and verification procedures are shown in Figs. 4 and 5. 2.1.

Signal Preprocessing

Utterances that enter the enrollment and verification procedure must, first, be subjected to a speech endpoint detection (EPD). The exact details of the employed EPD scheme are not germane to our point. Many alternative algorithms may be used. In our experiments, we chose the following one for its simplicity and robustness (Deller et al., 1999): In a first step, we extract voice activity identifiers such as the short-time absolute energy1 (STAE), the zero crossing rate (ZCR), and the normalized short-time autocorrelation function at lag one (STAC1). The features are extracted on a frame-byframe basis. For each frame, the features are fused into a voice activity decision with the method2 proposed by Qiang and Youwei (1998). The voice activity decisions are then used to eliminate the silent regions at the beginning and the end of an incoming utterance. 2.2.

Speaker Enrollment Procedure

The speaker enrollment procedure requires the following five steps: (1) a silence/voiced/unvoiced classification, (2) a pitch aligned signal segmentation, (3) a pitch class identification, (4) the computation of optimal pitch bases features, and (5) the computation of linear predictive cepstral features. 2.2.1. Silence/Voiced/Unvoiced Classification. A segmentation of incoming training utterances into silent, voiced, and unvoiced (SVU) regions is accomplished with a pattern classification approach proposed by Atal and Rabiner (1976). The employed features are: (1) the zero crossing rate (ZCR), (2) the short-time absolute energy (STAE), (3) the short-time normalized autocorrelation at lag one (STAC1), and (4) the shorttime frame entropy (STFE). We relied on the short-time absolute energy (instead of the short-time log energy, as originally proposed by Atal and Rabiner) to strengthen the classification in the presence of babble and

Robust Speaker Verification with Principal Pitch Components

background noise (Qiang and Youwei, 1998). The incorporation of a short-time entropy measure helps to suppress the effects of non-stationary noises such as mechanical sounds for example (Huang and Yang, 2000). A variety of existing approaches usually combine SUV classification with pitch estimation. Atal and Rabiner (1976) suggested that these techniques suffer from the disadvantage of assuming voiced periodicity. Voiced speech is only approximately periodic due to its time varying envelope and its (generally) time varying pitch contour. Separating the speech classification task from the pitch tracking algorithm allows a more accurate classification for short utterances. We are, hence, using separate algorithms for the SUV classification described above and the pitch estimation described in the following section. 2.2.2. Pitch Aligned Segmentation. In a next step we extract pitch synchronous segments from the voiced portions of the incoming training utterance. The extraction of the segments is divided into two steps. First, we use the algorithm proposed by Medan et al. (1991) to estimate the time-varying pitch of the voiced speech sections. Second, we employ the resulting pitch contour in a peak picking algorithm to identify and isolate individual pitch periods. According to Medan et al., the best pitch estimate at each time is found by maximizing a normalized correlation measure between locally adjacent pitch frames. The algorithm uses normalization to counter the intensity variation that may exist between two successive periods (Medan et al., 1991). The original algorithm proposed by Medan et al. also uses an interpolation procedure to increase the resolution of the pitch estimate.3 Their interpolation procedure, however, is omitted in our work at this time. We may study the possibly positive effect of the interpolation in future research. The resulting pitch contour p[n] assigns an instantaneous fundamental period measure to each time index n of a voiced speech segment s[n]. The maximum and the minimum of the pitch are computed for all time indices Nvoiced of voiced speech samples s[n]. pmax = max { p[n] }

(1)

pmin = min { p[n] }

(2)

n∈Nvoiced n∈Nvoiced

The estimates pmax , pmin , and p[n] are then used in a peak picking algorithm to identify the most predomi-

325

Figure 1. An illustration of the peak picking procedure that is used in the pitch aligned segmentation described in Section 2.2.2.

nant peaks (positive or negative) within each pitch period of the incoming training utterance s[n]. The procedure is illustrated in Fig. 1. We begin by identifying the predominant peak (positive or negative) within a frame of length pmax starting at the beginning of a voiced speech section. We indicate the index of this first peak with n 1 . We then identify the predominant peak in a frame of length pmin that is centered at index n 1 + p[n 1 ]. The index of the second peak is indicated with n 2 . We continue by considering the current peak at index n k and searching for the next peak in a frame of length pmin that is centered at index n k + p[n k ] until the end of the voiced section under consideration. The locations of the chosen peaks n k serve as a simple means to align similar glottal events across neighboring frames. We construct pitch synchronous frames s[k] by extracting a 20 msec long segment symmetrically around each predominant peak: s[k] = [ s[n k − L] . . . s[n k ] . . . s[n k + L] ]T

(3)

The parameter L denotes the number of samples that corresponds to a segment half-length of 10 msec.

2.2.3. Identification of Pitch Classes. For the proposed verification strategy it is necessary to classify all voiced speech segments s[k] into different phonetic units (Furui, 1997). Each voiced phonetic unit within

326

Nickel, Oswal and Iyer

Table 1.

WSS center frequencies and bandwidths.

Filter number q

Center frequency f q in (Hz)

3 dB filter bandwidth f q in (Hz)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

50.00 120.00 190.00 260.00 330.00 400.00 470.00 540.00 617.37 703.38 798.72 904.13 1020.38 1148.30 1288.72 1442.54 1610.70 1794.16 1993.93 2211.08 2446.71 2701.97 2978.04 3276.17 3597.63

70.00 70.00 70.00 70.00 70.00 70.00 70.00 77.37 86.01 95.34 105.41 116.26 127.91 140.42 153.82 168.15 183.46 199.78 217.15 235.63 255.26 276.07 298.13 321.47 346.14

an utterance is referred to as a pitch class with index i. The division of segments into different pitch classes is accomplished with Klatt’s weighted spectral slopes (WSS) (Klatt, 1982). Weighted spectral slopes represent a simple spectral similarity measure that characterizes phonemic differences between frames. The slopes are computed over twenty five overlapping filters with increasing bandwidth. The employed center frequencies f q and bandwidths f q for q = 1, 2, . . . , 25 are listed in Table 1. For each voiced pitch segment s[k] we estimate the average log power E q in frequency band q via π E q (s[k]) = log Sk (ω) Hq (ω) dω (4) 0

in which Sk (ω) represents the power spectrum of segment s[k]: 2 L Sk (ω) = s[n k + n] e jωn (5) n=−L

and Hq (ω) is a Gaussian averaging window:

fq ω Hq (ω) = cq exp − − 2π T f q fq

2 ln 4

(6) π with cq such that 0 Hq (ω)dω = 1 for q = 1, . . . , 25. The average power estimates E q (s[k]) are converted into weighted spectral slopes Q q (s[k]): Q q (s[k]) = Wq,k · [ E q+1 (s[k]) − E q (s[k]) ].

(7)

The adaptive weights Wq,k are designed to emphasize peak and valley locations in the given spectrum. They are computed according to recommendations by Klatt (1982) as global

Wq,k = Wq,k

local · Wq,k

(8)

with global

Wq,k

local Wq,k

20 20 + maxm=1,...,25 {E m (s[k])} − E q (s[k]) 1 = 20 + maxωq ≤ω≤ωq {log Sk (ω)} − E q (s[k]) =

in which

ωq = 2π ( f q − f q /2)

and

ωq = 2π ( f q + f q /2).

The psychoacoustic motivation for the proposed weighting is comprehensively described in Klatt (1982). In our experiments we found that a WSS metric performs better than other measures such as the Itakura Saito distance or the log-likelihood ratio for example (Deller et al., 1999). For the remainder of this section we use the notation Q(s[k]) to indicate the vector of weighted spectral slopes of segment s[k]: Q(s[k]) = [Q 1 (s[k]) Q 2 (s[k]) . . . Q 24 (s[k])]T . (9) If i denotes the index of a pitch class and ki denotes the index of the first segment in the class then an average ¯ i [k] for class i is calculated for every WSS measure Q new incoming pitch frame s[k] with ¯ i [k] = Q

k 1 Q(s[ p]) k − ki + 1 p=ki

for k ≥ ki . (10)

Robust Speaker Verification with Principal Pitch Components

¯ i [k] − Q(s[k + 1]) > ρ then ki+1 = k + 1. if Q (11) The process is repeated for the next class until all frames s[k] are sequentially assigned to a unique pitch class i. We found that a threshold of ρ = 15 works best with the training data described in Section 3. Note that the classification of each segment is based on a comparison with the (temporally) previous class only. Segments that belong to equivalent phonetic units, but have disjoint occurrences in time are put into different pitch classes. All segments that belong to the same class i are collected to form the class matrix Ci : Ci = [ s[ki ] s[ki + 1] . . . s[ki+1 − 1] ]T

(12)

Inaccurate peak picking during the segmentation stage sometimes leads to a formation of invalid classes. Hence, class matrices that contain fewer than 4 segments are purged. 2.2.4. Optimal Pitch Bases Expansions (OPB). An optimal pitch bases expansion (Nickel and Oswal, 2003) is used to derive a PPC feature that represents the pitch class. Each class matrix Ci is subjected to a singular value decomposition (SVD): Ci = Ui Di Vi

T with Vi = vi1 vi2 vi3 . . .

(13) (14)

with Ui and Vi being unitary matrices and Di being a diagonal matrix. The singular values contained in Di are assumed to be sorted in decreasing order. The first column vector of ViT (represented by vi1 ) serves as the PPC feature for class i. Figure 2 displays the normalized average value4 μ S and the standard deviation σ S of the 10 largest singular values of matrix Di computed over all classes from all training and testing utterances from the TI46 data set (see Section 3). The average magnitude of the singular values tends to decay rapidly. Pitch classes are thus well represented5 by a projection onto a subset of eigenvectors vi1 , vi2 , . . . , viN with a small N . In our

Singular Value Statistics (TI46) μs σs μs μ +σ

1 Relative Value

A new pitch class is formed when the new incoming pitch frame has a WSS measure that deviates from the average WSS measure of the current class by more than a fixed threshold ρ:

327

0.75

s

s

0.5 0.25 0

1

2

3

4

5 6 Number

7

8

9

10

...

Figure 2. A statistical analysis of the normalized average magnitude of the first 10 singular values in matrix Di from Eq. (14) for the entire TI46 data set.

work we have so far considered the case N = 1 only. An incorporation of more eigenvectors is theoretically possible, but leads to normalization problems during feature matching and an increase in the computational burden in the testing stage. 2.2.5. OPB Feature Matrices. Direct feature matching at the testing stage is impeded by the fact that the pitch aligned segmentations of training and testing utterances are not readily synchronized between utterances. In order to measure utterance similarities we propose a scheme that employs OPB feature matrices P in addition to the PPC vectors vi1 . An OPB feature matrix P provides a similarity measure that is obtained on a fixed grid, with a fixed window length of 2L + 1 samples (corresponding to 20 msec) and a fixed window shift offset of M samples (corresponding to 10 msec). Matrices P are computed in the following way. First, we generate a PPC similarity measure by computing the normalized correlation of the principal pitch component vi1 with the training utterance s[n] itself. Mathematically, we can express the correlation in two steps: (i) we divide the training utterance into segments of length 2L + 1 via sn = [ s[n − L] . . . s[n] . . . s[n + L] ]T ,

(15)

and (ii) we compute the normalized inner product of the segments with the PPC feature for each class i and each time instant n, i.e. ϑi [n] =

1 T 1 s v . sn n i

(16)

In praxis one would use an FFT based fast correlation algorithm to compute ϑi [n] from sn for all n. In order to synchronize the resulting PPC match with a non pitch synchronous segmentation we pick the absolute maximum of the correlation ϑi [n] within each

328

Nickel, Oswal and Iyer

non synchronous frame of length 2L + 1 as the match measure for that frame: ϕi [m] =

max

−L ≤ k ≤ +L

{| ϑi [Mm − k] |}

(17)

Parameter M denotes the window shift that is used in the non synchronous frame segmentation. The match measures for each class i at frame number m are then read into a feature vector φ[m]: φ[m] = [ ϕ1 [m] ϕ2 [m] ϕ3 [m] . . . ]T .

(18)

The collection of all feature vectors φ[m] for all frame numbers m forms the OPB feature matrix P: P = [ φ[0] φ[1] φ[2] . . . ].

(19)

The top panel of Fig. 3 shows a typical example of a correlation template P derived from 8 pitch classes. The PPC match regions (dark) follow a diagonal path in accordance with the sequential selection of the associated pitch classes. It is visible that pitch classes 3, 4, and 5 were apparently chosen from the beginning, middle, and end of the same phonetic unit. The separation of phonetic units into sub-units is a consequence of the adaptive classification described in Section 2.1.4. Note that, unlike in the actually proposed procedure, the matrices in Fig. 3 were computed from the incoming signal s[n] before the end point detection was applied. The lack of any pitch match in the top panel before frame 20 and after frame 93 is due to the silence before and after the utterance. OPB Feature Examples OPB Template

8

0.8

6

0.6 0.4 0.2

4 2 10

20

30

40

50

60

70

80

90

Customer Feature

8

0.8 0.6 0.4 0.2

6 4 2 10

20

30

40

50

60

70

80

90

Impostor Feature

8

0.8 0.6 0.4 0.2

6 4 2 10

20

30

40 50 60 Frame Number

70

80

Figure 4. A block diagram that illustrates the proposed speaker enrollment procedure.

2.2.6. The Baseline System. As indicated in Fig. 4, we are augmenting our PPC feature extraction method with a procedure that also extracts linear predictive cepstral coefficients (LPCCs). The LP cepstral coefficients are the features of a baseline system that serves as the foundation for the proposed method. The details of the LPCC computation is summarized as follows. After the removal of leading and trailing silences from the incoming speech signal it is segmented into 20 msec long Hamming windowed frames with a 50% overlap. The signal segmentation grid of the baseline system must be the same as the segmentation grid used for the OPB feature matrices P. An LPC analysis computes autoregressive (AR) coefficients for each frame via the autocorrelation method (Rabiner and Schaefer, 1978). Cepstral coefficients are computed recursively from the AR coefficients (Rabiner and Juang, 1993, Eqs. (3.83b/3.83c)). The resulting LPCCs are weighted with a sinusoidally shaped bandpass lifter (Rabiner and Juang, 1993, Eq. (3.89)). Table 2 summarizes the parameters involved in the baseline feature computation. Since we are using the same segmentation grid for the computation of the LPCC feature matrix and the OPB feature matrix we can assume that both matrices are time aligned.

90

Figure 3. A typical example for OPB feature matrices P: the OPB template from the training utterance (top panel), a typical customer feature (middle panel), and a typical impostor feature (bottom panel).

2.2.7. The Training Template. The speaker enrollment procedure provides a training template for each training utterance. The training template consists of

Robust Speaker Verification with Principal Pitch Components

Table 2. features.

Computation of baseline

Parameter

Value

LPC order No. of cepstr. coeff. Cepstral weights Segment length Segment overlap Frame window

22 24 Sinusoidal6 20 msec 10 msec Hamming

three parts: (i) the principal pitch components vi1 for each pitch class i from Section 2.2.4, (ii) the OPB feature matrix P from Section 2.2.5, and (iii) the weighted short-time LPC cepstral coefficients from the baseline system of Section 2.2.6.6 2.3.

Speaker Verification Procedure

The verification procedure of the proposed method is significantly less complex than the enrollment procedure. Figure 5 summarizes the proposed verification system with a block diagram. After the preprocessing stage, the incoming testing utterance is subjected to an extraction of short-time LPC cepstral features (as outlined in Section 2.2.6) and the computation of an OPB feature matrix P (as outlined in Section 2.2.5). Note that in the verification stage we use the PPC features vi1 of the training utterance to compute the inner products ϑi [n] in Eq. (16). It is, thus, not necessary to perform an explicit PPC

329

feature extraction on the testing utterances which reduces the complexity of the testing stage significantly. The second panel in Fig. 3 shows an example for an OPB feature matrix P of a testing utterance from a customer, i.e. a person for which the verification system was trained. The good match between the customer feature (first panel) and the OPB template is clearly visible. The third panel shows the OPB feature matrix computed for an impostor, i.e. a different speaker saying the same word. The obvious lack of match between the OPB template and the impostor feature is as expected. The pattern comparison between the incoming shorttime LPCC features and the LPCC template is accomplished with a dynamic time warping procedure (DTW) (Rabiner and Juang, 1993). We are using a simple DTW algorithm with path length normalization and with a local path constraint that allows steps up, right, and diagonal only. We found that uniform slope weighting and a relaxed endpoint constraint with a maximum offset of 7 frames performs best for the given training data (see Section 3). We use dc to denote the resulting LPCC distance between an incoming utterance and the training template. The optimal alignment path between the LPCC features is used to also align the OPB feature matrices P of the testing and the training utterance. The average Euclidian distance between the φ[m] feature vectors of both matrices along the prescribed path is used as the overall distance do between the OPB feature matrices. The overall distance d between the training and the testing utterance is obtained from an appropriately weighted linear combination of the LPCC distance dc and the OPB feature distance do : d = dc + w f · do

(20)

The optimal weighing factor w f was found by minimizing the equal error rate of the verification procedure over the training data described in Section 3. 3.

Figure 5. A block diagram that illustrates the proposed speaker verification procedure.

Experiments

We evaluated the performance of the proposed procedure with a set of speaker verification experiments. The experiments were conducted with speech utterances from the TI46 database (Liberman et al., 1993; Doddington and Schalk, 1981) and the YOHO database (Campbell and Higgins, 1998; Higgins et al., 1991b). Both databases have disjoint training and testing sets

330

Nickel, Oswal and Iyer

for enrollment and verification. The experiments were performed as follows: 1. Each experiment consisted of a large number of enrollment and verification trials. The resulting distance measures dc , do and d (Eq. (20)) were recorded for all same-speaker (customer) and differentspeaker (impostor) trials. 2. Only text-dependent verification was considered, i.e. for each enrollment and verification trial only utterances of the same words were used. 3. We considered single-utterance training only, i.e. each enrollment process was performed on a single training utterance only (and not on multiple training utterances). 4. We evaluated the verification performance in three different modes: (i) with no noise, (ii) with additive white noise, and (iii) with additive cospeaker noise. Enrollment was always performed with clean (i.e. no noise) utterances in all three cases. It should be pointed out that text dependent singleutterance training constitutes a particularly difficult case for most conventional verification systems. We chose to evaluate the proposed scheme in this context to gain a better understanding of the core correlation between cepstral features and PPC features. The performance of the employed baseline system is thus expected to be significantly worse than that reported for systems that utilize a more intensive training (see Section 4). The database specific details of our experiments are summarized in the following subsections.

3.1.

TI46 Speech Corpus Experiments

In our experiments we used the TI20 subcorpus of the TI46 data set (Liberman et al., 1993). The TI20 set consists of recordings from 8 male and 8 female speakers uttering 20 isolated words.7 The recordings were performed in 9 sessions in a low noise sound isolation booth with a cardioid dynamic microphone that was positioned two inches away from the speaker’s mouth and out of the speaker’s breath stream. The utterances were sampled and digitized with a sampling rate of 12500 Hz. The data is divided into an enrollment set that consists of 10 utterances of each word from each speaker and a testing set that consists of 16 utterances of each word from each speaker distributed over 8 separate recording sessions.

During each experiment we trained the system with the first utterance of each word in the enrollment set. Separate training was performed once for each word and each speaker leading to a total of 320 (20 words × 16 speakers) training iterations. Each trained system was then subjected to a verification test against 8 customer (same speaker, same word) and 8 impostor (different speaker, same word) utterances from the testing set. As customer utterances we chose the first utterance in each of the eight testing sessions for the same speaker and the same word. As impostor utterances we selected 8 randomly chosen utterances of the same word spoken by 8 different speakers from the testing set. As a result, we recorded the distances dc , do and d for 5120 verification trials8 (320 enrollments × 16 verifications) in each of the three experiment modes (no noise, white noise, and cospeaker noise). 3.2.

YOHO Speech Corpus Experiments

The YOHO speech corpus (Higgins et al., 1991b) consists of combination lock phrases (such as 36-24-36) uttered by 108 male and 30 female speakers. The phrases were recorded over a period of three month in a realworld office environment. The data was sampled and digitized at a 8000 Hz sampling rate with a 3.8 kHz analog bandwidth. Each subject underwent a total of 4 enrollment sessions with 24 phrases each and 10 testing sessions with 4 phrases each. The total number of validated sessions in the database is 1932. The YOHO data set is not explicitly designed for textdependent verification since the employed combination lock phrases are not the same for the enrollment and the testing sets. In order to use the database in the given context it was necessary to segment every employed utterance into a set of six sub-phrases. The sub-phrases consisted of the 17 words: ONE, TWO, . . . , SEVEN, NINE, TEN, TWENTY, . . . , NINETY. The segmentation was done with the SPHINX speech recognition system developed and maintained by the Sphinx Group at Carnegie Mellon University (Lee et al., 1990). The details of the employed segmentation procedure are summarized in Appendix B. In our experiments we trained the system once for each of the 17 words and each of the 138 speakers leading to a total of 2346 training iterations. The training utterances were randomly picked from the enrollment data, i.e. any phrase that contained the targeted sub-phrase was equally likely to be picked. Unfortunately, the SPHINX system was not always able to

Robust Speaker Verification with Principal Pitch Components

Additive Noise

An evaluation of the system performance under noise conditions was done by adding white noise and cospeaker noise to the testing data at a signal-to-noise ratio of 15 dB. In the white noise mode we added plain white Gaussian noise. In the cospeaker noise mode we added a random subsection of a fixed speech signal from a person that was not part of the TI46 or YOHO recordings. We chose the utterance “the reasons for this dive seemed foolish now” (file TEST\DR1\FAKS0\SI2203.WAV) of the TIMIT10 speech corpus (Garofolo et al., 1993). The initial and final silence sections of the cospeaker noise file were truncated to disqualify silent regions from being part of the additive noise. The noise signal was appropriately resampled to match the sampling rate of the TI46 and the YOHO data. 4.

Results

The performance of the proposed system is dependent on the total number of validated classes Ci and the number of validated pitch synchronous segments s[k] that can be identified from each given training utterance. Figures 6 to 9 show histograms of the identified class- and segment-numbers for the employed training utterances of the TI46 and YOHO data sets. It is

Number of Utterances

200 150 100 50 0

1

2

3

4

5

6

7

8

9

10

11

12

Number of Classes

Figure 6. A histogram of the number of classes per utterance for the employed training data of the TI46 data set. The average number of classes is 3.5. Segment Number Distribution (TI46) Number of Utterances

200 150 100 50 0 0

10

20

30

40

50 60 70 80 90 Number of Segments

100 110

120 130 140

Figure 7. A histogram of the number of extracted pitch segments per utterance for the employed training data of the TI46 data set. The average number of segments per utterance is 43.1. Class Number Distribution (YOHO)

4

Number of Subutterances

3.3.

Class Number Distribution (TI46) 250

2

x 10

1.5 1 0.5 0

1

2

3

4

5 6 7 8 Number of Classes

9

10

11

12

Figure 8. A histogram of the number of classes per utterance for the employed training data of the YOHO data set. The average number of classes is 2.7. Segment Number Distribution (YOHO)

4

Number of Subutterances

correctly segment the chosen phrase (see Appendix B). In cases in which an obvious segmentation error could be detected, the picked phrase was marked as “disqualified for further selection” and a different phrase was randomly picked from the remaining “qualified” phrases. The employed selection process clearly introduced a bias into the analysis. Due to the large number of considered phrases, however, it was not possible to validate the segmentation process manually and avoid the bias. Each trained system was subjected to a verification test against 8 customer (same speaker, same word) and 8 impostor (different speaker, same word) utterances from the testing set. The testing utterances were again randomly selected (given that no segmentation errors could be detected). As a result, we recorded the distances dc , do and d for 37536 verification trials9 (2346 enrollments × 16 verifications) in each of the three experiment modes (no noise, white noise, and cospeaker noise).

331

2

x 10

1.5 1 0.5 0 0

10

20

30

40

50

60

70

80

90

100 110

120 130 140

Number of Segments

Figure 9. A histogram of the number of extracted pitch segments per utterance for the employed training data of the YOHO data set. The average number of segments per utterance is 23.5.

clearly visible that the YOHO set delivers significantly fewer validated classes and segments. The discrepancy is due to the differences in speaking style between the two data sets. While the TI46 set tends to be more “annunciated” the combination lock phrases of the YOHO set tend to be more “rattled down”. Even though

332

Nickel, Oswal and Iyer

Table 3.

TI46 correlation results.

Distance Scatter Plot 1.4

Correlation coefficients

Data subset ID

Relative subset size

With no noise

White noise

Cospeaker noise

CL1/SG4 CL2/SG35 CL3/SG40 CL4/SG60

100% 61% 47% 23%

0.55673 0.59562 0.60310 0.58921

0.50842 0.49912 0.50703 0.49762

0.39463 0.47265 0.48837 0.49867

1.2

o

OPB Distances (d )

1

0.8

0.6

0.4

0.2

Table 4.

YOHO correlation results. Correlation coefficients

Data subset ID

Relative subset size

With no noise

White noise

Cospeaker noise

CL1/SG4 CL2/SG10 CL3/SG20 CL4/SG30

100% 86% 44% 17%

0.29473 0.30120 0.31600 0.32248

0.25950 0.26486 0.27187 0.25620

0.27847 0.28712 0.29873 0.31001

both sets employ a similar vocabulary the utterances of the YOHO set are hence significantly shorter. Furthermore, the higher signal-to-noise ratio and sampling rate of the TI46 set are likely to reduce the number of classification errors in the pitch extraction (see Section 2.2) and are thus leading to higher class and segment counts. To provide a more refined performance analysis the training data was divided into four subsets for each database. For the TI46 data these were: (i) the set of all training utterances with at least 1 identified class and at least 4 validated segments (CL1/SG4), (ii) the set of all training utterances with at least 2 classes and at least 35 segments (CL2/SG35), (iii) a set with at least 3 classes and 40 segments (CL3/SG40), and (iv) a set with at least 4 classes and 60 segments (CL4/SG60). Analogously we defined the sets CL1/SG4, CL2/SG10, CL3/SG20, and CL4/SG30 for the YOHO training data. The relative size of each subsets in comparison to the entire training set is listed in Tables 3 and 4. 4.1.

Correlation Analysis

A scatter plot of a random subset of LPCC distances dc from our baseline system and the newly proposed PPC feature distances do is shown in Fig. 10. The gray circles indicate the customer distances and the black

0 4

Customer Impostor 6

8 10 12 Cepstral Distances (dc)

14

16

Figure 10. A scatter plot of a random subset of inter-speaker distances (◦) and intra-speaker distances (×) from our TI46 experiment

crosses indicate the imposter distances. The scatter plot suggests that there is only a loose correlation between distances obtained from PPC features and distances obtained from LPCC features. A lack of correlation can be used to boost the performance of the baseline system by replacing the LPCC distance dc with the combined distance d proposed in Eq. (20) (Nickel and Williams, 2000). The level of possible performance increase depends (in parts) on the normalized correlation coefficient ρ between the two distance measures:

ρ(dc , do ) =

E{(dc − E{dc }) (do − E{do })}

. E{(dc − E{dc })2 } E{(do − E{do })2 } (21)

Tables 3 and 4 list estimates of the normalized correlation coefficients ρ(dc , do ) for various experiments. The coefficients are generally around 0.5 for the TI46 data and around 0.3 for the YOHO data. A weak correlation between two distance measures generally indicates that both measures address (at least in parts) unrelated characteristics of the given speech signals. A weak correlation by itself, however, does not guarantee a performance gain since not every characteristic of a speech signal is relevant to the classification task (Papoulis, 2002). We demonstrate a concrete performance gain with the receiver operating characteristics in Section 4.3.

Robust Speaker Verification with Principal Pitch Components

4.2.

ROC Estimation

333

100

c

96 94 92

[D]

[C]

[B]

[A]

90 88 [A] Baseline (CL1/SG4) [A] OPB (CL1/SG4) [B] Baseline (CL2/SG35) [B] OPB (CL2/SG35) [C] Baseline (CL3/SG40) [C] OPB (CL3/SG40) [D] Baseline (CL4/SG60) [D] OPB (CL4/SG60)

86 84 82 80 0

2

4

6 8 10 12 14 Impostor Access Probability P (%)

16

Qc = and

Qi =

√ √

−1

(24)

−1

(25)

2 erf {2Pc − 1}

2 erf {2Pi − 1}.

20

Figure 11. The resulting ROC estimates for the TI46 experiments under no noise conditions.

CL1/SG4 CL2/SG35 CL3/SG40 CL4/SG60

40

i

We can map the ROC estimation problem into a linear regression problem by considering the warped quantities

18

i

Reduction in P (%)

a 1 Pi = 1 + erf √ + b · erf−1 {2Pc − 1} 2 2 μc − μi σc for a = and b = (22) σi σi x 2 2 with erf{x} = √ e−t dt. (23) π 0

Customer Access Probability P (%)

98

In order to estimate smooth receiver operating characteristics (ROCs, see (Van Trees, 1968)) from our experiments we assume that the impostor and the customer distances are both distributed in an approximately normal11 fashion (Papoulis, 2002). The functional connection between the customer access probability Pc ∼ N (μc , σc ) and the impostor access probability Pi ∼ N (μi , σi ) is given by:

30 20 10 0 80

82

84

86

88

90

92

94

96

98

Customer Access Probability Pc(%)

Figure 12. The relative reduction in impostor access probability as a function of the customer access probability for the TI46 experiments under no noise conditions.

From Eq. (22) we obtain the linear relationship Qi = a + b Qc

(26)

with the unknowns a and b. Using estimates for the warped quantity pairs (Q i , Q c ) from our experiments we can find a and b via a linear regression. The numerical details of the estimation procedure for various experiments are summarized in Figs. 22 to 27 in Appendix A. 4.3.

Verification Results

The receiver operating characteristics (ROCs) in Figs. 11 to 18 demonstrate the improvements in verification performance with the proposed method. The ROCs establish the functional connection between the customer access probability Pc and the impostor access probability Pi for a given distance metric. The desired operating point on the ROC can be specified with a given decision threshold λ. Distance measures above λ are considered impostor distances and distance measures below λ are considered customer distances. We are generally interested in ROCs that approach the upper left corner

of the diagram (i.e. ROCs that approach Pc = 100% and Pi = 0%). The special point 100%− Pc = Pi is referred to as the equal error point (EEP). It can be used as a general performance measure for a given procedure. In the respective legends of the graphs the term “Baseline” refers to the performance of the baseline system with distance metric dc . The term “OPB” refers to the proposed combined system with distance metric d after Eq. (20). 4.3.1. TI46 Results. The results of the TI46 experiments in clean, i.e. no noise, conditions are shown in Figs. 11 and 12. Figure 11 shows the receiver operating characteristics for each of the considered four TI46 subsets separately. As could be expected, the performances of both, the baseline system and the proposed system were best for the CL4/SG60 subset and worst for the CL1/SG4 subset. The relative reduction in impostor access probability as a function of the customer access probability Pc is given by: Reduction in Pi (%) (27) Baseline Baseline OPB /Pi = 100% × Pi − Pi

334

Nickel, Oswal and Iyer

100

95

97 94 Customer Access Probability P (%)

85

c

Customer Access Probability Pc(%)

90

[B]

80

[A]

75

70

[A] Baseline (CL1/SG4) [A] OPB (CL1/SG4) [B] Baseline (CL4/SG60) [B] OPB (CL4/SG60)

65

60 5

10

15

20

25

30

35

[B]

[A]

91 88 85 82 79 76 [A] Baseline (CL1/SG4) [A] OPB (CL1/SG4) [B] Baseline (CL4/SG60) [B] OPB (CL4/SG60)

73 70 0

40

3

6

9

12

15

18

30 20

85

30

90

95

CL1/SG4 CL2/SG35 CL4/SG60

40 30 20 10 0 80

10 80

27

Figure 15. The resulting ROC estimates for the TI46 experiments under cospeaker noise conditions. The results for the CL2/SG35 and the CL3/SG40 data sets are not drawn to maintain legibility of the plot. They generally fall between graphs [A] and [B].

Reduction in Pi(%)

CL1/SG4 CL2/SG35 CL4/SG60

40

i

Reduction in P (%)

Figure 13. The resulting ROC estimates for the TI46 experiments under white noise conditions. The results for the CL2/SG35 and the CL3/SG40 data sets are not drawn to maintain legibility of the plot. They generally fall between graphs [A] and [B].

75

24

i

i

0 70

21

Impostor Access Probability P (%)

Impostor Access Probability P (%)

82

84

86

88

90

92

94

96

Customer Access Probability Pc(%)

Customer Access Probability P (%) c

Figure 14. The relative reduction in impostor access probability as a function of the customer access probability for the TI46 experiments under white noise conditions.

Its value for different data subsets is displayed in Fig. 12. The equal error point (EEP) of the baseline system is around 9% for the CL1/SG4 data set and around 5% for the CL4/SG60 data set. The relative reduction in Pi at the EEP is around 5% for the CL1/SG4 data set and around 25% for the CL4/SG60 data set. The results of the TI46 experiments with additive white noise are shown in Figs. 13 and 14. Figure 13 shows the receiver operating characteristics for two of the considered four TI46 subsets. As in the no noise case, the performances of the baseline system and the proposed system were best for the CL4/SG60 subset and worst for the CL1/SG4 subset. The relative reduction in impostor access probability is displayed in Fig. 14. The EEP of the baseline system is around 25% for the CL1/SG4 data set and around 22% for the CL4/SG60 data set. The relative reduction in Pi at

Figure 16. The relative reduction in impostor access probability as a function of the customer access probability for the TI46 experiments under cospeaker noise conditions.

the EEP is around 10% for the CL1/SG4 data set and around 27% for the CL4/SG60 data set. The results under additive cospeaker noise are shown in Figs. 15 and 16. Figure 15 shows the receiver operating characteristics and Fig. 16 displays the relative reduction in impostor access probability. Again, the overall system performance is best for the CL4/SG60 subset and worst for the CL1/SG4 subset. The EEP of the baseline system is around 16% for the CL1/SG4 data set and around 11% for the CL4/SG60 data set. The relative reduction in Pi at the EEP is around 20% for the CL1/SG4 data set and around 21% for the CL4/SG60 data set. Lastly, Fig. 17 shows the performance of a system that employs only distance do to make a verification decision.12 The ROC shows that the addition of cospeaker noise leads to a small increase in EEP (from 27% to 29% for the CL1/SG4 set and from 21% to 23% for the CL4/SG60 set). The relative increase, however,

Robust Speaker Verification with Principal Pitch Components

95

Reduction in P (%)

3 CL1/SG4 CL2/SG10 CL3/SG20

2.5

i

85

80

2 1.5 1 0.5 0 80

82

75

84

86

88

90

92

94

Customer Access Probability P (%) c

70

Figure 19. The relative reduction in impostor access probability as a function of the customer access probability for the YOHO experiments under no noise conditions.

65 No Noise (CL1/SG4) No Noise (CL4/SG60) Cospeaker Noise (CL1/SG4) Cospeaker Noise (CL4/SG60)

60

10

15

20

25

30

35

40

45

50

15 i

Impostor Access Probability Pi(%)

Reduction in P (%)

Customer Access Probability Pc(%)

90

55 5

335

Figure 17. TI46 ROC estimates for the OPB distances do alone in no noise and cospeaker noise conditions.

CL1/SG4 CL3/SG20 CL4/SG30

12 9 6 3 0 80

82

84

86

88

90

92

94

Customer Access Probability P (%) c

94

Figure 20. The relative reduction in impostor access probability as a function of the customer access probability for the YOHO experiments under white noise conditions.

Customer Access Probability Pc(%)

92

90

[D]

88

[C]

[B]

[A]

noise conditions than that of the employed baseline system.

86 [A] Baseline (CL1/SG4) [A] OPB (CL1/SG4) [B] Baseline (CL2/SG10) [B] OPB (CL2/SG10) [C] Baseline (CL3/SG20) [C] OPB (CL3/SG20) [D] Baseline (CL4/SG30) [D] OPB (CL4/SG30)

84

82

80 6

8

10

12

14

16

18

20

Impostor Access Probability Pi(%)

Figure 18. The resulting ROC estimates for the YOHO experiments under no noise conditions.

is significantly smaller than that for the baseline system for which the EEP almost doubles under cospeaker noise conditions. In summary, the proposed method leads to a significant reduction in impostor access probability across all speakers of the TI46 data set. The performance is evidently correlated with the number of validated pitch classes and pitch segments of the given training utterance. As a consequence, we receive a stronger performance boost for speakers with a high pitch (such as women and children) and for words and phrases that contain pronounced vowels and/or similarly stationary, almost-periodic sounds. Furthermore, the performance of the proposed system is much more robust under

4.3.2. YOHO Results. The results of the YOHO experiments are summarized in Figs. 18 to 21. Figure 18 shows the ROCs of the no noise experiments and Figs. 19 to 21 show the relative reduction in impostor access probabilities for the noise and the no noise cases. As expected the improvement boost obtained from the proposed method is much smaller than that for the TI46 data. The EEP of the baseline system for the no noise experiments is 14% for the CL1/SG4 set and 12% for the CL4/SG30 set. Under white noise the EEP increases to 22% for the CL1/SG4 set and to 20% for the CL4/SG30 set. Cospeaker noise moves the EEP to 17% for the CL1/SG4 data and to 15% for the CL4/SG30 data. In the no noise case the relative reduction in impostor access probability at the EEP shrinks to between 1% – 2%. For the cospeaker case we still receive a reduction of between 2%–3%. For the white noise case the results indicate a reduction of at least 3% for the CL1/SG4 set and up to 10% for the CL4/SG30 case. The reasons for the poorer performance of the proposed system in the YOHO experiments are three-fold:

336

Nickel, Oswal and Iyer

2 1 0 80

82

84

86

88

90

92

94

Customer Access Probability P (%) c

Figure 21. The relative reduction in impostor access probability as a function of the customer access probability for the YOHO experiments under cospeaker noise conditions.

1. The number of identifiable classes and extractable pitch segments is overall significantly lower than that for the TI46 data (see Figs. 6 to 9). 2. The sampling rate and signal-to-noise ratio of the YOHO data is significantly lower than for the TI46 data. This is important because the OPB method is particularly sensitive to training with a high bandwidth and low noise data (Nickel and Oswal, 2003). 3. It is not possible to directly measure how many of the classification errors are indirectly induced by a poor phrase segmentation by the SPHINX system (see Appendix B). Despite these problems we receive a small but acceptable improvement in verification performance, especially in the white noise case. The proposed method is thus capable of delivering valuable speech information for large population data sets. This claim would not be possible with the results from the TI46 data set alone. We expect that the results for comparable data sets (with large speaker numbers) will improve if a data acquisition procedure is used that is better adjusted to the proposed processing.

5.

Conclusions

The paper presents a new technique to increase the robustness of speaker verification algorithms by means of principal pitch components (PPC). The newly proposed method leads to a significant performance improvement when incorporated into a traditional verification system based on cepstral features. Experimental results show that under various noise conditions the imposter access probability can be significantly reduced even if the underlying baseline system operates at the dis-

Appendix A Figures 22 to 27 demonstrate how well the assumption of normally distributed distance measures fits Baseline / No Noise

OPB / No Noise

1.8

1.8

1.7

1.7

1.6

1.6

1.5

1.5

1.4 1.3 1.2

i

3

Q = warped( Pi )

4

criminative limit of the employed LPC cepstral (LPCC) features. The key to the new approach is the design of a filterbank of speaker specific matched filters whose impulse responses are the identified PPCs. In the current work we are only using the main principal component of each pitch class. We are anticipating further improvements when not only one, but several optimal pitch bases are employed. Furthermore, the proposed method may help to combat break-in attempts with prerecorded speech. Since PPCs are sensitive to the time waveform, it is possible to detect if the time-domain similarity between two utterances that were submitted to the systems becomes “too good”. If the feature distance between a previously submitted utterance and the currently submitted utterance is too small then we can reject the current utterance as a recorded one. In its current form the proposed method requires a relatively complex training procedure. At the verification stage, however, the computational complexity is of the same order as that of simple LPCC based approaches.

i

CL1/SG4 CL4/SG30

Qi = warped( P )

Reduction in Pi(%)

5

1.1

1.4 1.3 1.2 1.1

1

1

0.9

0.9

(Qi,Qc) Estimate Least Squares Fit

0.8

(Qi,Qc) Estimate Least Squares Fit

0.8 Qc = warped( Pc )

Qc = warped( Pc )

Figure 22. Least squares fit of Eq. (26) from warped result counts (Q i , Q c ) for the TI46 - CL1/SG4 data set with no noise contamination. The left axis system shows the result for the baseline system (with a = 2.5127 and b = 0.89875). The right axis system shows the result for the proposed combined system (with a = 2.5824 and b = 0.92915).

Robust Speaker Verification with Principal Pitch Components

Baseline / White Noise

OPB / White Noise

1.2

OPB / No Noise

1.5

1.5

1.4

1.4

1.3

1.3

1.2

0.4

0.6

1.2

1.1

1

0.4

(Qi,Qc) Estimate

0.2

Least Squares Fit

0.9

(Qi,Qc) Estimate

(Q ,Q ) Estimate i

Least Squares Fit

(Q ,Q ) Estimate

c

i

Least Squares Fit

Qc = warped( Pc )

Figure 23. Least squares fit of Eq. (26) from warped result counts (Q i , Q c ) for the TI46 - CL1/SG4 data set with white noise contamination. The left axis system shows the result for the baseline system (with a = 1.4492 and b = 1.1853). The right axis system shows the result for the proposed combined system (with a = 1.5112 and b = 1.1343).

0.6

0.4

0.4

0.2

Least Squares Fit

0.2

1.3

1.2

1.2

1.1

1.1

(Qi,Qc) Estimate

1 i

0.9 0.8 0.7

i

0.8

0.8 0.7 0.6

0.5

0.5

0.4

Least Squares Fit

Qc = warped( Pc )

0.9

0.6

(Qi,Qc) Estimate

0.4

Least Squares Fit 0.3

Qc = warped( Pc )

c

OPB / White Noise

1.3

1

1

0.6

(Qi,Qc) Estimate

Baseline / White Noise

i

i

0.8

c

Q = warped( P )

1.2

Q = warped( P )

c

Figure 25. Least squares fit of Eq. (26) from warped result counts (Q i , Q c ) for the YOHO - CL1/SG4 data set with no noise contamination. The left axis system shows the result for the baseline system (with a = 2.0182 and b = 0.88137). The right axis system shows the result for the proposed combined system (with a = 2.0271 and b = 0.88525).

i

1.2

1

c

Q = warped( P )

1.4

Qi = warped( Pi )

1.4

0.8 Q = warped( P )

OPB / Cospeaker Noise 1.6

i

Q = warped( P )

Baseline / Cospeaker Noise 1.6

c

Least Squares Fit

0.8

Qc = warped( Pc )

1.1

1

0.9

0.2

1.2

i

0.6

0.8

Q = warped( Pi )

0.8

Qi = warped( Pi )

1

Q = warped( P ) i i

1

Q = warped( P ) i i

Baseline / No Noise

337

(Qi,Qc) Estimate Least Squares Fit

0.3 Qc = warped( Pc )

Qc = warped( Pc )

Figure 24. Least squares fit of Eq. (26) from warped result counts (Q i , Q c ) for the TI46 - CL1/SG4 data set with cospeaker noise contamination. The left axis system shows the result for the baseline system (with a = 1.922 and b = 0.94431). The right axis system shows the result for the proposed combined system (with a = 1.9662 and b = 0.87287).

Figure 26. Least squares fit of Eq. (26) from warped result counts (Q i , Q c ) for the YOHO - CL1/SG4 data set with white noise contamination. The left axis system shows the result for the baseline system (with a = 1.4614 and b = 0.96996). The right axis system shows the result for the proposed combined system (with a = 1.4862 and b = 0.96286).

our experimental results. For the TI46 - CL1/SG4 data set Figs. 22 to 24 show the warped result counts (Q i , Q c ) from Eq. (24) (gray lines) and the least squares fit from Eq. (26) (black lines). Figs. 25 to 27 show the same for the YOHO - CL1/SG4 data set. Re-

sults under no noise, white noise, and cospeaker noise conditions are displayed separately. The graphs for all data subsets with higher class and segment counts were very similar to the ones shown here and were omitted.

338

Nickel, Oswal and Iyer

OPB / Cospeaker Noise 1.6

1.4

1.4

1.2

1.2 Q = warped( P ) i i

Q = warped( P ) i i

Baseline / Cospeaker Noise 1.6

1

0.8

0.6

Acknowledgments This work has been supported in parts with a grant from the Pittsburgh Digital Greenhouse initiative, Pittsburgh, Pennsylvania. We would like to thank the reviewers for their valuable suggestions. Their comments have helped to significantly improve the organization of the paper.

1

0.8

0.6

0.4

0.4 (Q ,Q ) Estimate i

0.2

c

Least Squares Fit

Qc = warped( Pc )

0.2

(Q ,Q ) Estimate i

Notes

c

Least Squares Fit

Qc = warped( Pc )

Figure 27. Least squares fit of Eq. (26) from warped result counts (Q i , Q c ) for the YOHO - CL1/SG4 data set with cospeaker noise contamination. The left axis system shows the result for the baseline system (with a = 1.9362 and b = 1.0253). The right axis system shows the result for the proposed combined system (with a = 1.9408 and b = 1.0137).

Appendix B SPHINX is an open source speech recognition engine developed and maintained by the Sphinx Group at Carnegie Mellon University (Lee et al., 1990). Software and documentation can be downloaded from http://fife.speech.cs.cmu.edu/sphinx/. In our segmentation procedure we used SPHINX version 3–0.4.1 in its default configuration with a 17 word dictionary. Recognition was limited to the 17 sub-phrases targeted in Section 3.2. In segmenting the data we used the estimated word-boundary indices produced by the software. In order to eliminate the most serious segmentation errors we compared the produced recognition result with the actually spoken phrase for each utterance. Utterances were discarded if one of the two following conditions were met:

1. less then 6 sub-phrases were identified, or 2. not all identified sub-phrases matched with the actually spoken sub-phrases.

The employment of this disqualification rule introduced a bias in our analysis. Unfortunately, it was not possible to replace the automated segmentation procedure with a manual one due to the large number of considered speech files.

1. The term short-time absolute energy is technically a misnomer since it refers to a sliding window absolute sum. 2. Unlike Quiang and Youwei, we are additionally incorporating the normalized autocorrelation feature STAC1 to suppress false end point detections due to spurious unvoiced regions such as lip pops for example. 3. The resolution of the pitch estimate is limited by the sampling time if the interpolation is omitted. 4. Slot number 1 represents the average μ S and standard deviation σ S of all largest singular values, slot number 2 displays the average and standard deviation of the second largest singular values, and so forth. All averages are normalized so that μ S + σ S of the largest singular values is equal to one. 5. In a mean square sense. 6. We are employing a Sinusoidal Bandpass Lifter after (Rabiner and Juang, 1993). 7. The words are ZERO, ONE, . . . . . . 11. . . , NINE, ENTER, ERASE, GO, HELP, NO, RUBOUT, REPEAT, STOP, START, and YES. 8. Half of the 5120 verification trials were customer trials and the other half were impostor trials. 9. Again, half of the 37536 verification trials were customer trials and the other half were impostor trials. 10. The TIMIT corpus was recorded in a low noise environment with 16 bits/sample at a 16000 Hz sampling rate. 11. Our results show that the assumption of normally distributed distances is fairly accurate around the mean of the distribution. The distribution tails must be excluded for obvious reasons. 12. Distances d and dc are being ignored!

References Assaleh, K.T. (1995). Supplementary orthogonal cepstral features. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 413–416. Assaleh, K.T. and Mammone, R.J. (1994). New LP-derived features for speaker identification. IEEE Transactions on Speech and Audio Processing, 2(4):630–638. Atal, B.S. and Rabiner, L.R. (1976). A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition. IEEE Transaction on Acoustics, Speech and Signal Processing, 24(3). Bovbel, E.I., Kheidorov, I.E., and Chaikou, Y.A. (2002). Wavelet based speaker identification. Digital Signal Processing, 2:1005– 1008.

Robust Speaker Verification with Principal Pitch Components

Campbell, J. and Higgins, A. (1998). YOHO Speaker Verification, Speech Database of the Linguistic Data Consortium, LDC Catalog No.: LDC94S16. Campbell, J.P. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9):1437–1461. Carey, M.J., Parris, E.S., Lloyd-Thomas, H., and Bennett, S. (1996). Robust prosodic features for speaker identification. In Fourth International Conference on Spoken Language Processing (ICSLP), vol. 3, pp. 1800–1803. Che, C.W., Lin, Q., and Yuk, D.S. (1996). An HMM approach to text-prompted speaker verification. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 673–676. Deller, J.R., Proakis, J., and Hansen, J. (1999). Discrete-Time Processing of Speech Signals, Macmillan. Doddington, G.R. and Schalk, T.B. (1981). Speech recognition: Turning theory to practice. IEEE Spectrum, 18(9). Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(2):254–272. Furui, S. (1997). Recent advances in speaker recognition. Pattern Recognition Letters, 18:859–872. Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L., and Zue, V. (1993). TIMIT acoustic-phonetic continuous speech corpus, Speech Database of the Linguistic Data Consortium, LDC Catalog No.: LDC93S1. Gopalan, K., Anderson, T.R., and Cupples, E.J. (1999). A comparison of speaker identification results using features based on cepstrum and fourier-bessel expansion. IEEE Transactions on Speech and Audio Processing, 7(3):289–294. Haydar, A., Demirekler, M., and Yurtseven, M.K. (1998). Feature selection using genetic algorithm and its application to speaker verification. Electronic Letters, 34:1457–1459. Higgins, A., Bahler, L., and Porter, J. (1991a). Speaker verification using randomized phrase prompting. Digital Signal Processing, 1(2):89–106. Higgins, A., Bahler, L., and Porter, J. (1991b). Speaker verification using randomized phrase prompting. Digital Signal Processing, 1(2):89–106. Qiang, H. and Youwei, Z. (1998), On prefiltering and endpoint detection of speech signal. In Proceedings of IEEE International Conference on Signal Processing (ICSP), pp. 749– 752. Huang, L. and Yang, C. (2000). A novel approach to robust speech endpoint detection in car environments. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, pp. 1751–754. Klatt, D.H. (1982). Prediction of perceived phonetic distance from critical-band spectra: A first step. In Proceedings of IEEE Conference of Acoustics, Speech, and Signal Processing (ICASSP), pp. 1278–1281.

339

Lee, K.-F., Hon, H.-W., and Reddy, R. (1990). An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech and Signal Processing, 38:35–44. Liberman, M., Amsler, R., Church, K., Fox, E., Hafner, C., Klavans, J., Marcus, M., Mercer, B., Pedersen, J., Roossin, P., Walker, D., Warwick, S., and Zampolli, A. (1993). TI 46-Word, Speech Database of the Linguistic Data Consortium, LDC Catalog No.: LDC93S9. Medan, Y., Yair, E., and Chazan, D. (1991). Super resolution pitch determination of speech signals. IEEE Transactions on Signal Processing, 39(1):40–48. Nickel, R.M. and Oswal, S.P. (2003). Optimal pitch bases expansions in speech signal processing. In Proceedings of the 37th Asilomar Conference on Signals, Systems and Computers . Nickel, R.M. and Williams, W.J. (2000). On local time-frequency features of speech and their employment in speaker verification. Journal of the Franklin Institute, Special Issue on Time-Frequency Analysis and Applications, 377:469–481. Pandit, M. and Kittler, J. (1998). Feature selection for a dtw-based speaker verification system. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 769–772. Papoulis, A. (2002). Probability, Random Variables, and Stochastic Processes, 4th Edition, McGraw Hill. Petry, A. and Barone, D.A.C. (2003). Preliminary experiments in speaker verification using time-dependent largest Lyapunov exponents. Computer Speech and Language, 17:403–413. Rabiner, L. and Juang, B.H. (1993). Fundamentals of Speech Recognition. Englewood Cliffs, New Jersey 07632: Prentice Hall, Inc. Rabiner, L.R. and Schaefer, R.W. (1978). Digital Processing of Speech Signals. Englewood Cliffs, New Jersey 07632: Prentice Hall, Inc. Reynolds, D.A., Quatieri, T.F., and Dunn, R.B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1):19–41. Sang, L., Wu, Z., Yang, Y., and Zhang, W. (2003). Automatic speaker recognition using dynamic bayesian network. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 188–191. Singh, G., Panda, A., Bhattacharyya, S., and Srikanthan, T. (2003). Vector quantization techniques for GMM based speaker verification, In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 65–68. Thev´enaz, P. and Hugli, H. (1995). Usefulness of the LPC-residue in text-independent speaker verification. Speech Communication, 17:145–157. Van Trees, H.L. (1968). Detection, Estimation, and Modulation Theory: Part 1. New York: John Wiley & Sons. Yacoub, S., Abdeljaoued, Y., and Mayoraz, E. (1999). Fusion of face and speech data for person identity verification. IEEE Transactions on Neural Networks, 10(5), 1065–1074.

Designing Reliable, Robust and Reusable Components ... - Sapao.net

Speaker Verification Using Fisher Vector

speaker identification and verification using eigenvoices

ROBUST SPEAKER CLUSTERING STRATEGIES TO ...

Principal Components for Regression: a conditional ...

Steerable Principal Components for Space-Frequency ...

Speaker Verification via High-Level Feature Based ...

Multiple Background Models for Speaker Verification

Speaker Verification Anti-Spoofing Using Linear ...

speaker identification and verification using eigenvoices

End-to-End Text-Dependent Speaker Verification - Research at Google

High-Level Speaker Verification via Articulatory-Feature ...

Text-Independent Speaker Verification via State ...

Highly Noise Robust Text-Dependent Speaker ... - ISCA Speech

Highly Noise Robust Text-Dependent Speaker ...

Highly Noise Robust Text-dependent Speaker ...

Robust Speaker segmentation and clustering for ...

Highly Noise Robust Text-Dependent Speaker Recognition Based on ...

pose-robust representation for face verification in ...