Author Guidelines for 8

Viewer
Transcript

AUTOMATIC ACQUISITION DEVICE IDENTIFICATION FROM SPEECH RECORDINGS Daniel Garcia-Romero and Carol Y. Espy-Wilson Department of Electrical and Computer Engineering, University of Maryland, College Park, MD [email protected], [email protected] ABSTRACT In this paper we present a study on the automatic identification of acquisition devices when only access to the output speech recordings is possible. A statistical characterization of the frequency response of the device contextualized by the speech content is proposed. In particular, the intrinsic characteristics of the device are captured by a template, constructed by appending together the means of a Gaussian mixture trained on the device speech recordings. This study focuses on two classes of acquisition devices, namely, landline telephone handsets and microphones. Three publicly available databases are used to assess the performance of linear- and mel-scaled cepstral coefficients. A Support Vector Machine classifier was used to perform closed-set identification experiments. The results show classification accuracies higher than 90 percent among the eight telephone handsets and eight microphones tested. Index Terms— Digital speech forensics, intrinsic fingerprint, non-intrusive forensics, Gaussian supervectors.. 1. INTRODUCTION The widespread availability of low-cost and sophisticated digital media editing software allows amateur users to perform imperceptible alterations to digital content. This fact poses a serious threat to a wide variety of fields such as intellectual property, criminal investigation, and lawenforcement. In an attempt to minimize this threat, this work addresses the problem of automatic acquisition device identification (Dev-ID). In particular, our goal is to automatically extract forensic evidence about the mechanism involved in the generation of the speech recording by analysis of the acoustic signal. Our focus is on blind-passive strategies – as opposed to active embedding of watermarks or having access to input-output pairs – since most realistic scenarios only allow for this kind of approach. The underlying hypothesis of our approach is that the physical devices–along with their associated signal processing chain–leave behind intrinsic fingerprint traces in the speech signal. Moreover, these digital traces can be .

The authors would like to thank Dr. K.J. Ray Liu for his generous advice and direction regarding this research. Supported by NSF grant #0917104

characterized and detected by statistical methods and automatic pattern recognition techniques. An important observation in this regard is that in the field of speaker recognition (SR), the artifacts left in the speech signal by the acquisition device are highly detrimental to the performance of recognition systems [1]. Hence, most SR systems try to remove or compensate these artifacts by using some general knowledge about the acquisition device (e.g., GSM handset as opposed to CDMA). A great amount of research has been dedicated to this issue (see [1] for current approaches). Since Dev-ID suffers from the dual problem – the speech content variability adds great difficulty to our problem – this paper relies heavily on the findings in the SR field when building our algorithms. Regarding prior work, a small study concerning the classification of 4 microphones was presented in [2]. Motivated by a steganalysis perspective, a combination of time domain features involving short-term statistics of the audio signal as well as mel-cepstral coefficients were used during the feature extraction. A Naïve Bayes classifier was used to perform closed set identification experiments at the frame level. Accuracies on the order of 60-75% were reported. Although its experimental setup was very limited in size, it is the only prior work the authors have been able to identify and nonetheless sets a precedent in the automatic identification of microphones. 2. DATABASES 2.1. Landline telephone handsets The Handset-TIMIT (HTIMIT) and Lincoln-Labs Handset Database (LLHDB) provide speech recordings through four carbon-button {CB1-CB4} and four electrect {EL1-EL4} landline handsets [3]. In particular, HTIMIT was created by playing a subset of TIMIT through a dummy head into the different handsets. Ten short segments of around 3 seconds from 384 speakers (half of them male) resulted in 38,400 speech segments played through each handset. The specific make and models of each handset is available in [3]. LLHDB comprises 53 speakers (24 males and 29 females) that uttered the 10 short sentences from HTIMIT plus a read passage and a picture description using the same handsets from HTIMIT. Both databases were acquired at 8 KHz sampling rate and 16 bits.

Figure 1. Magnitude squared of the frequency response (in dB) of the landline handsets of HTIMIT/LLHDB computed from white noise using Welch’s method with a 20 ms Hamming window and 50% overlap

2.2. Microphones

3.1. Statistical modeling by UBM-GMM

The ICSI subset of the NIST 2006 Speaker Recognition Evaluation database [4] comprising 8 different microphones and 61 speakers (28 male and 33 female) provides a total of 2223 speech segments of around 2.5 minutes each. An almost evenly distributed number of recordings came from each microphone. The speech segments were acquired simultaneously with all microphones in an interview style setup. Table 1 shows the specific microphone types.

Influenced by the SR field, we used Gaussian Mixture Models (GMM). In particular, an only-means adapted UBM-GMM architecture with 2048 mixtures and diagonal covariance matrices was used [5]. The frequency content information was represented by a parametrization of shorttime speech segments using 20 ms Hamming windows with 50% overlap. Either 24 mel-scaled or 39 linearly-scaled filters were used to compute 23 MFCCs or 38 LFCCs by removing c0. To select an optimal parametrization, closedset identification experiments were conducted on HTIMIT. Half of the database was used to train the 8 GMM handset models and the remaining part to perform classification using log-likelihood ratio scores [5]. Table 2 shows the average device identification results for different types of parametrizations.

M1 M2 M3 M4 M5 M6 M7 M8

AT3035 (Audio Technica Studio Mic) MX418S (Shure Gooseneck Mic) Crown PZM Soundgrabber II AT Pro45 (Audio Technica Hanging Mic) Jabra Cellphone Earwrap Mic Motorola Cellphone Earbud Olympus Pearlcorder Radio Shack Computer Desktop Mic

Table 1. Microphone types from NIST 2006 database

3. DEVICE CHARACTERIZATION A typical methodology to characterize handsets is to model them as a linear system and estimate their frequency response by measuring the output of the system when excited with white noise. Figure 1 shows the results of estimating the power spectral density of white noise captured with each telephone handset using Welch’s method. This data is part of the HTIMIT distribution. A simple inspection of the frequency responses of each device shows the potential discriminative power of the frequency domain information. Unfortunately, for blind-passive speech forensics approaches an input-output characterization of the device is not available. Consequently, we need to devise mechanisms that solely rely on the output signal, which in our case are speech recordings. This fact makes the problem more challenging since the signals available to us not only contain information about the device but also about the speech content variability (i.e., different speakers, linguistic content). Our approach to this problem is to alleviate the effects of the speech content variability by using a statistical characterization of the frequency response of the device contextualized by the speech content. In the following we present a practical implementation of this formulation.

Param. type 23 MFCC 23 MFCC + DELTA 38 LFCC 38 LFCC + DELTA

Num. coeffs. 23 46 38 76

ID rate (%) 99.84 99.75 99.98 99.97

Table 2. Average handset identification accuracy for various types of parametrizations on a dev. set of HTIMIT.

Two important remarks are in place. First, the augmentation of the mel-frequency (MFCC) and linearfrequency cepstral coefficients with deltas does not improve the performance. Thus, there is no justification to double the dimensionality of our acoustic space. Second, the MFCC and LFCC result in very similar results, which suggest the use of MFCC due to their smaller dimensionality. This last observation will be further tested on a different setting in section 4 to strengthen its validity. 3.2. Intrinsic fingerprint computation In an only-means adapted UBM-GMM architecture all the discriminant information of the model (in our case telephone handsets or microphones) is captured by the means of the GMM. This is due to the fact that MAP adaptation process only updates the means of the speaker’s GMM with respect to the UBM. This observation led to the construction of a Gaussian supervector (GSV) by stacking the means of the

Figure 2. Top panel shows the magnitude responses of the EL4 telephone handset and the reference SENH microphone from the HTIMIT database. They were estimated by computing the power spectral density of the output using white noise as input. The bottom panel is the visualization of the difference between the clustered means of the EL4 GSV and the SENH GSV. Dark blue colors indicate low values and dark red high values.

mixture components [6]. In this way, a speech recording is represented by a point in a high-dimensional vector space (i.e., dimension ~ 50k). Formally, given a training data set 𝑋 = 𝒙𝑡 𝑇𝑡=1 and a diagonal covariance UBM with 𝐾 mixtures defined by {𝝅, 𝒎, 𝐑}, where 𝝅 = 𝜋1 , … , 𝜋𝐾 𝑇 , 𝒎 = 𝒎1𝑇 , … , 𝒎𝑇𝐾 𝑇 and 𝐑 = 𝑑𝑖𝑎𝑔( R 𝑘 𝐾𝑘=1 ), the GSV 𝜽 = 𝜽1𝑇 , 𝜽𝑇2 , … , 𝜽𝑇𝐾 𝑇 is obtained as the MAP adaptation of the means of the UBM by using one iteration of the EM algorithm with prior 𝑁(𝜽; 𝒎, 𝜏 −1 𝐑). After the E step, the responsibility of Gaussian 𝑘 for data point 𝒙𝑡 is denoted by 𝛾tk and 𝛾k = t 𝛾tk . Furthermore, the application of the M step results in the MAP estimate of the GSV [5]: 𝜽 = 𝜏𝐈 + 𝚲

−1 𝐾 𝑘=1 )

𝜏 𝒎 + 𝚲𝝁 ,

(1)

𝝁1𝑇 , … , 𝝁𝑇𝐾 𝑇 .

with matrix 𝚲 = 𝑑𝑖𝑎𝑔( γ𝑘 I and 𝝁 = mixture 𝑘, eq. (1) reduces to the convex combination: 𝜽 k = 𝛼 k 𝒎k + 1 − 𝛼 k 𝝁 k ,

For

(2)

between the UBM mean 𝒎𝑘 and the data mean 𝝁𝑘 with the mixing coefficient 𝛼𝑘 given by 𝛼k = 𝜏/(𝜏 + 𝛾k ) and the data mean: 1 𝝁k = 𝛾tk 𝒙t (3) 𝛾𝑘 t

The scalar 𝜏 (known as “relevance factor”) controls the trade-off between what the data “says” and our prior belief contained in the UBM means [5]. Based on this formulation, the intrinsic fingerprint of an acquisition device 𝑑 is defined as the GSV 𝜽(𝑑) computed from speech recordings acquired with the device of interest and a UBM. This procedure results in a fixed-length template to represent variable-length speech recordings. This is a desirable property, since in principle, the intrinsic fingerprint of a device should be independent of the amount of data acquired with the device. Moreover, regarding the speech content contextualization, an experimental study presented in [7] indicated that a phonetic context can be attached to partitions of the GSV (i.e., grouping of subsets of mean vectors). To support this claim, the bottom panel of Figure 2 shows the difference between the GSV of the EL4 handset and a reference microphone SENH (both from HTIMIT) using the visualization procedure presented in [7].

Twenty clusters were used in the visualization process. The color coding is such that dark blue means low values and dark red high values. Also, the top panel of Figure 2 shows the magnitude response of both devices estimated with white noise. The blue band in the low frequency range of the difference between the GSVs is justified by the fact that below 1 KHz the SENH magnitude response is above the EL4’s response. On the other hand, above 1 KHz the relation is the opposite, and therefore, it makes sense that the GSV difference is mostly represented with red colors. 4. EXPERIMENTS This section presents an experimental validation of the use of GSV as intrinsic fingerprints of acquisition devices. Two separate closed-set identification experiments were carried out: one for landline telephone handsets and another one for microphones. This separation was primarily motivated by the distinct nature of the devices as well as the difference in the duration of the speech recordings for each type of device. Whereas for the telephone handsets the average recording length is of 3 seconds, 2.5 minutes of speech is available for the microphones. A linear Support Vector Machine (SVM) classifier was trained for each device using the GSVs computed from each file. MFCCs were used to parametrize the files. In particular, for the telephone handsets, half of HTIMIT was used to train a UBM. Additionally, LLHDB was partitioned in 2 sets (balanced in the number of files as well as speakers and gender). Each file in the database was MAP-adapted from the UBM to compute the GSVs. A two-fold cross-validation setup resulted in approximately 300 positive GSV exemplars, comprising 26 speakers and 7 times as much negative GSV exemplars to train the SVM models for each partition. A total of 5079 identification trials were obtained. Table 3 shows the corresponding confusion matrix where the entries represent percentages. The average identification accuracy across telephones is 93.2 %. The gray shading in the table highlights the fact that most of the errors remain within the same transducer class (i.e., electrect and carbon-button). As indicated in section 3.1, to further validate the use of 23 MFCCs instead of 38 LFCCs, the same experiment was repeated using LFCCs. The average identification rate was

identical to that of MFCCs. Thus, no improvement was obtained by using a linear-frequency scale.

CB1 CB2 CB3 CB4 EL1 EL2 EL3 EL4

CB1 94.3 1.6 0.2 0.2 1.1 0.2 0.3 1.3

CB2 0.6 97.0 0.0 0.0 2.5 0.3 1.9 0.0

CB3 0.3 0.3 99.2 0.9 0.0 0.0 0.0 0.3

CB4 1.1 0.2 0.6 98.7 0.2 0.0 0.0 0.5

EL1 0.9 0.5 0.0 0.0 86.9 3.3 7.4 3.5

EL2 0.2 0.0 0.0 0.0 2.5 92.9 6.2 0.6

EL3 0.0 0.5 0.0 0.0 3.3 2.8 83.4 0.6

EL4 2.5 0.0 0.0 0.2 3.5 0.5 0.8 93.2

Table 3. Confusion matrix for telephone handsets based on the GSV-SVM architecture. Rows indicate files and columns models. In this way, entry (2,1) indicates that 1.6% of the CB2 files were wrongly identified as coming from CB1.

For the microphone identification experiments, the same experimental setup was used with some small changes due to constraints imposed by the database. Namely, the UBM was trained with all speech segments from the ICSI subset. A downsampling ratio of 100 frames reduced the data to 2 hours of speech for the UBM. Moreover, the twofold cross-validation resulted in an average of 280 positive GSV exemplars, comprising 30 speakers and 7 times as much negative GSV exemplars to train the SVM models. A total of 2223 identification trials were obtained. Table 4 shows the confusion matrix of the identification experiments. The average identification accuracy across microphones is 99.0 % with a very uniform behavior. No apparent confusion patterns are observable among the small number of errors. The worst performance was obtained for recordings coming from M1 for which 2.5% were mistakenly identified as M2. The fact that the biggest source of errors for recordings from M2 is M1 suggests that these two microphones have similar characteristics. Although a formal study of the influence of the amount of data in performance is a topic for future research, the gap of 6% in average identification accuracy between telephone handsets and microphones seems to indicate an interrelation.

M1 M2 M3 M4 M5 M6 M7 M8

M1 96.8 1.1 0.0 0.0 0.0 1.1 0.0 0.0

M2 2.5 98.2 0.4 0.0 0.0 0.4 0.0 0.0

M3 0.4 0.0 99.3 0.0 0.0 0.0 0.0 0.0

M4 0.0 0.0 0.4 100 0.0 0.0 0.4 0.0

M5 0.0 0.4 0.0 0.0 100 0.0 0.0 0.0

M6 0.4 0.4 0.0 0.0 0.0 98.6 0.0 0.0

M7 0.0 0.0 0.0 0.0 0.0 0.0 99.6 0.0

M8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100

Table 4. Confusion matrix for microphone handsets based on the GSV-SVM architecture. Rows indicate files and columns models.

Finally, to extend the telephone handset experiments presented above, an explicit mechanism to remove undesired variability form the device intrinsic fingerprint known as Nuisance Attribute Projection (NAP) [8] was

added. The subset of HTIMIT not used in the construction of the UBM was used to compute an undesired variability subspace using the procedure detailed in [8]. This variability was removed by projecting the GSVs into the orthogonal complement subspace. A slight improvement of 0.3% over the average identification accuracy of 93.2% was obtained when 64 dimensions were projected away. This result indicates that most of the variability in our experimental setup is well accounted for by our GSV-SVM system. However, we are currently searching for publicly available databases with more sources of variability (e.g., acoustic environments) and it is our belief that these techniques will play an important role in these more challenging setups. 5. CONCLUSIONS A study on the automatic identification of eight telephone handsets and eight microphones was presented. Several types of parametrizations were evaluated and the MFCCs resulted in the best trade-off between performance and dimensionality. The use of Gaussian supervectors as a statistical characterization of frequency domain information of a device contextualized by speech content was proposed. Thus, a template that captures the intrinsic characteristics of a device was obtained. A simple visualization procedure of this template validated its discriminative power. A Support Vector Machine classifier was used to perform closed-set identification experiments. Classification accuracies higher than 90 percent were obtained. This result indicates that most of the variability in our experimental setup was well accounted for by our system. 6. REFERENCES [1] L. Burget, P. Matejka, P. Schwarz, O. Glembek, and J. Cernocký, "Analysis of Feature Extraction and Channel Compensation in a GMM Speaker Recognition System," IEEE TASP, vol. 15, no. 7, pp. 1979-1986, September 2007. [2] C. Kraetzer, A. Oermann, J. Dittmann, and A. Lang, "Digital Audio Forensics: A First Practical Evaluation on Microphone and Environment Classification," in MMSEC'07, pp. 63-74, 2007. [3] D. A. Reynolds, "HTIMIT and LLHDB: Speech Corpora for the Study of Handset Transducer Effects," in ICASSP, pp. 15351538, 1997. [4] NIST Speech Group. [Online]. http://www.nist.gov/speech/ [5] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker Verification Using Adapted Gaussian Mixture Models," Digital Signal Processing, vol. 10, pp. 19-41, 2000. [6] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support Vector Machines Using GMM Supervectors for Speaker Verification,” IEEE Signal Processing Letters, vol. 13, no. 5, pp. 308–311, May 2006. [7] D. Garcia-Romero and C. Espy-Wilson, "Intersession Variability in Speaker Recognition: A Behind the Scene Analysis," in Interspeech, pp. 1413-1416, 2008. [8] A. Solomonoff, W. M. Campbell and I. Boardman, “Advances in Channel Compensation for SVM Speaker Recognition,” in ICASSP, pp. 629–632, 2005.