Fractional Fourier Transform Based Auditory Feature for ...

Viewer
Transcript

Fractional Fourier Transform Based Auditory Feature for Language Identification Wei-Qiang Zhang, Liang He, Tao Hou, and Jia Liu Tsinghua National Laboratory for Information Science and Technology Department of Electronic Engineering, Tsinghua University, Beijing 100084, China Email: {weiq.zhang, sanphiee}@gmail.com, [email protected], [email protected]

Abstract—In this paper, a novel auditory feature based on fractional Fourier transform (FRFT), namely, fractional auditory cepstrum coefficient (FACC), is presented for language identification (LID). Different from the widely used Mel-frequency cepstrum coefficient (MFCC), the proposed feature utilizes the human auditory model and performs Gammatone filtering for the short-time fractional spectrum of the speech. Experimental results on NIST 2003 Language Recognition Evaluation (LRE03) show that the FACC feature decreases the equal error rate (EER) of 10.5% relatively when compared with the MFCC feature.

I. I NTRODUCTION Language identification (LID) has been a developing branch of speech signal processing. It has many applications in multilingual spoken dialog systems, spoken language translation, call-routing and spoken document retrieval, etc [1]. For example, in the intelligent human-machine interaction system, LID can act as an important front-end for automatically identifying the language of an utterance and routing it to the corresponding back-end. In the LID field, Mel-frequency cepstrum coefficient (MFCC) and its derived feature have been extensively used [1], [2]. Previous investigations have shown that some features which are based on more subtle auditory model can achieve better performance for speech recognition [3], [4]. Enlightened by these work, we try to study the auditory feature for LID. On the other hand, fractional Fourier transform (FRFT), which is a generalization of Fourier transformation, has gained a lot of popularity in the signal processing community [5], [6]. In [7], a noise-robust FRFT feature for digital word recognition has been reported. By utilizing FRFT and properly selecting the FRFT order, we believe more discriminative information for language identification may be extracted from the speech signal. In this paper, we will propose a fractional auditory feature for LID. The rest of this paper is organized as follows. In section II and section III, we introduce the auditory Gammatone filterbank and FRFT and give their implementation method. Section IV addresses the extraction procedure of the proposed feature. Experiments are presented in section V. Finally, section VI gives conclusions. This work was supported by the National Natural Science Foundation of China and Microsoft Research Asia under Grant No. 60776800, and in part by the National High Technology Development Program of China under Grant No. 2006AA010101 and No. 2007AA04Z223.

II. G AMMATONE AUDITORY F ILTERBANK A. Gammatone Filterbank Similar to the commonly used Mel filterbank [8] which the MFCC feature based on, the Gammatone filterbank models the cochlea by a bank of overlapping bandpass filters [9]. A single Gammatone filter with order q can be expressed as g(t) = atq−1 e−2πbt cos(2πfc t + φ)u(t),

(1)

where a is the normalization constant, b is a bandwidth related parameter, fc is the center frequency, φ is the phase shift, and u(t) is a unit step function. In speech signal processing, the parameters are usually set as q = 4 and φ = 0. When q = 4, b can be calculated by [10] b = 1.019 · ERB(fc ),

(2)

where ERB is the equivalent rectangular bandwidth and it can be determined by ERB(fc ) =

fc + B0 , Q

(3)

where Q is the asymptotic filter quality at large frequencies, B0 is the minimum bandwidth when frequency is zero. In this paper, we use the parameters Q = 9.26449, B0 = 24.7Hz, which are estimated from the experimental and statistical data [11]. B. Implementation of Gammatone Filterbank The Gammatone filterbank can be implemented in either the time domain or the frequency domain [12]. For facilitate comparison with the Mel filterbank, we investigate it in the frequency domain. The Fourier transform of the fourth order Gammatone filter can be obtained as G(f ) =F [g(t)] 3a(b + jf )2 −3a(b + jf )4 + = 16π 4 ((b + jf )2 + fc2 )4 4π 4 ((b + jf )2 + fc2 )3 −3a . (4) + 4 8π ((b + jf )2 + fc2 )2 Its amplitude response is

√ 3a G1 + G2 , |G(f )| = 4 2 8π ((b + f 2 )2 + 2(b2 − f 2 )fc2 + fc4 )2

(5)

where 3

G1 = (−4b f + 4bf 4

2 2

3

+ 12bfc2 f )2 , 4 2 2

G2 = (b − 6b f + f − 6(b − f

(6) )fc2

+

fc4 )2 .

(7)

The maximum value of |G(f | is located at fc approximately, i.e., max |G(f )| ≈ |G(fc )|. (8) Thus we can get the normalized form as ˘ ) = |G(f )| . G(f (9) |G(fc )| C. Comparison of Mel- and Gammatone Filterbank For further understanding the Gammatone filterbank, we compare it with the Mel filterbank. The Mel filterbank and Gammatone filterbank with similar parameters are illustrated in Fig. 1. We can observe that each filter of Mel filterbank (also called as triangular filter), has a piecewise linear form and its apex point is underivable, while the Gammatone filter has a more smooth form. In addition, the stepfactor (which is defined as the amount of overlap) of the Mel filterbank is fixed, so if the number of filters increases, the bandwidth of each triangular filter will decrease. For the Gammatone filter, the bandwidth is determined by its center frequency (see (3)), so if the number of filters increases, the stepfactor will also increase. Compared with Mel filterbank, the structure of Gammatone filterbank is more subtle and similar to the human auditory model. III. F RACTIONAL F OURIER T RANSFORM A. Review of the FRFT Being a generalization of the conventional Fourier transform, the FRFT can be interpreted as a rotation of signals in the continuous time-frequency plane [5]. Compared with the Fourier transform, the FRFT has orthobasis of chirp signals and thus is more flexible and suitable for nonstationary signals processing. It has been widely applied in optics, quantum mechanics, communication, radar, information security, pattern recognition and so on [5]. The FRFT with order p of a signal x(t) is defined as Z +∞ Xp (u) = F p [x(t)] = Kp (u, t)x(t)dt, (10) −∞

where Kp (u, t) is the kernel function:  jπ{t2 cot α−2ut csc α+u2 cot α}  , α 6= nπ Aα e Kp (u, t) = δ(t − u), α = 2nπ   δ(t + u), α = 2nπ ± π (11) √ where n ∈ Z, Aα = 1 − j cot α, and α = pπ/2, which indicates the rotation angle of the signal in the time-frequency plane. It is easy to observe that the FRFT has following two special cases: F 0 [x(t)] = x(t),

(12)

F 1 [x(t)] = X(f ).

(13)

When p is changing from 0 to 1, the FRFT of x(t) changes from the time domain to the frequency domain. In some cases, this additional degree of freedom may bring flexibility for processing speech signals. B. Discrete FRFT The discrete FRFT (DFRFT) has several definitions and implements, each of which has different computational complexity and precision. In this paper, we follow the eigen decomposition method [13]. The N -point DFRFT can be defined by its transformation matrix: F p [m, n] =

N −1 X

uk [m]e−j 2 kp uk [n], π

(14)

k=0

where uk [n] is the discrete Hermite-Gaussian functions. The advantage of this method is that it has low computational complexity of O(N log N ). Fig. 2 gives an example of DFRFT for a segment of windowed speech signal. We can observe that how the transform evolves from a time domain waveform to its discrete Fourier transform (DFT) as the order ranges from 0 to 1. IV. F RACTIONAL AUDITORY C EPSTRUM C OEFFICIENTS A. Basic Fractional Auditory Feature Combining the Gammatone filterbank and FRFT, we can develop a fractional auditory feature. The procedures of feature extraction are as follows. The speech signal is first preprocessed by the methods such as pre-emphasis, and then windowed with the Hanning window. After that, the DFRFT is computed and the Gammatone filtering is performed to obtain the energies for every channel. Then, the logarithm operation is performed for each energy and the discrete cosine transform (DCT) is performed to decorrelate the log-energies. Just like the the traditional cepstrum is the (inverse) DCT of the log spectrum, our proposed feature is the (inverse) DCT of the log fractional spectrum, and also imitates the mechanism of human auditory system, so we refer to it as fractional auditory cepstrum coefficient (FACC). B. Other Configuration Besides the Gammatone filterbank, we also consider the effect of equal loudness curve (ELC) for the auditory feature. According to psychoacoustics, the subjective loudness of human ear is determined not only by the intensity of the sound, but also by its frequency. Thus for a specific loudness, the intensity can be seen as a function of frequency. This is called equal loudness curve (ELC) [8]. When implementing the Gammatone filterbank, we can weight each channel by the inverse ELC to simulate the human subjective perception. This is equivalent to weight the spectrum and then perform filtering, but with less calculation. Moreover, in order to compensate the speaker variability and suppress channel noise, we also use vocal tract length normalization (VTLN) [14] and relative spectra (RASTA) filtering [15]. The detailed descriptions are omitted here.

1

0.8

0.8 Amplitude

Amplitude

1

0.6 0.4 0.2 0 0

0.6 0.4 0.2

1000

2000 3000 Frequency (Hz) (a) Mel filterbank Fig. 1.

0 0

4000

1000

2000 3000 Frequency (Hz) (b) Gammatone filterbank

4000

Comparison of the magnitude response of the Mel- and Gammatone filterbank

0.5

Amplitude

Amplitude

0.8

0

0.6 0.4 0.2 0 250

−0.5

50

100 150 200 Sample (a) Waveform (Sampling rate = 8kHz) Fig. 2.

250

200

150

1 100

50

Sample

0.5 0

Order

(b) DFRFT

A segment of windowed speech signal and its discrete fractional Fourier transform.

V. E XPERIMENTS

C. Shifted Delta Cepstrum In order to extract broader temporal information for LID, the shifted delta cepstrum (SDC) feature has been presented and shown better performance than basic features [2]. Suppose the basic cepstrum coefficients at frame t are {cj (t), j = 1, 2, . . . , N − 1}, where j is the dimension index and N is the number of cepstrum coefficients. The SDC feature is essentially k blocks delta cepstrum coefficients: s(iN +j) (t) = cj (t + iP + d) − cj (t + iP − d), for i = 0, 1, . . . , k − 1,

(15)

where d is the time difference between frames for spectra computation and P is the time shift between each block. The SDC feature is specified by the parameter set N -d-P -k. In this paper, we use 7 FACCs concatenated with SDC 7-13-7, which totals in 56 coefficients per frame and denoted as FACC-SDC. The effectiveness of this configuration has been proved in for the MFCC-SDC feature [16]. The whole feature extraction procedure is illustrated in Fig. 3.

A. Experimental Data The training data come from CallFriend corpus [17], which consists of Arabic, English (southern and non-southern dialects), Farsi, French, German, Hindi, Japanese, Korean, Mandarin (mainland and taiwan dialects), Spanish (Caribbean and non-Caribbean dialects), Tamil, and Vietnamese telephone speech. Each language/dialect contains 60 half-hour conversations. After feature extraction, we decimate 1/20 frames for training. The evaluation data come from National Institute of Standard and Technology (NIST) 2003 Language Recognition Evaluation (LRE03) data [18]. B. Experimental Setup The evaluation is performed in the framework of NIST LRE [18]. The detection task is done for each language and the equal error rate (EER), which is achieved when the false acceptance rate equals the false rejection rate by adjusting the

TABLE II FACC-SDC FEATURE ORDER SELECTION

Speech Pre-Processing

FRFT Order p

1.00

0.99

0.98

0.97

0.96

0.95

EER (%)

4.28

4.33

4.10

4.24

4.15

4.47

Windowing

VI. C ONCLUSION

FRFT Gammatone Filtering

DCT FACC

Order Selection

Log

Post-Processing FACC-SDC Training & Test

Fig. 3. Block diagram for fractional auditory cepstrum coefficients feature extraction. TABLE I C OMPARISON OF MFCC-SDC AND FACC-SDC FEATURES Feature

EER (%)

MFCC-SDC

4.58

FACC-SDC (p = 1)

4.28

detection threshold, is obtained. The closed-set, 30s duration condition is evaluated and the average EER is used as a performance measure. We use Gaussian mixture model (GMM) as the classifier to validate the performance of the proposed fractional auditory features. Each GMM has 256 mixture components. The GMMs are first trained via maximum likelihood (ML) criteria with 8 iterations and then trained via maximum mutual information (MMI) criteria [16] with 20 iterations in our experiments. C. Results and Discussion We first compare the MFCC-SDC and FACC-SDC (FRFT order p = 1) features in the experiment. The EERs are listed in Table I. We can see that the FACC-SDC (p = 1) feature gives 6.55% relative improvement over the MFCC-SDC feature. The above result is obtained with FRFT order p = 1. In order to get optimal parameter, we change the FRFT order p from 1.00 to 0.95 with step 0.01 and the results are listed in Table I. We can find that when p = 0.98, we achieve the lowest EER = 4.10%. This shows that the fractional auditory feature is effective for language identification.

In this paper, we have proposed a fractional auditory cepstrum coefficient feature for language identification. This feature calculates the spectrum of the speech signal by employing the fractional Fourier transform and computes the sub-band energies by utilizing the psychoacoustic Gammatone filterbank instead of the Mel filterbank. In the experiments using GMM as classifier, the proposed feature outperforms the MFCC based feature. R EFERENCES [1] M. A. Zissman, “Comparison of four approaches to automatic language identification of telephone speech,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 1, pp. 31–44, Jan. 1996. [2] P. A. Torres-Carrasquillo, “Language identification using Gaussian mixture models,” Ph.D. dissertation, Michigan State University, 2002. [3] Q. Li, F. Soong, and O. Siohan, “A high-performance auditory feature for robust speech recognition,” in Proc. 6th International Conference on Spoken Language Processing, Beijing, Oct. 2000, pp. 51–54. [4] W. H. Abdulla, “Auditory based feature vectors for speech recognition systems,” in Proc. Advances in Communications and Software Technologies, Oct. 2002, pp. 231–236. [5] H. M. Ozaktas, Z. Zalevsky, and M. A. Kutay, The Fractional Fourier Transform With Applications in Optics and Signal Processing. New York: Wiley. [6] R. Tao, B. Deng, W.-Q. Zhang et al., “Sampling and sampling rate conversion of band limited signals in the fractional fourier transform domain,” IEEE Transactions on Signal Processing, vol. 56, no. 1, pp. 158–171, Jan. 2008. [7] R. Sahkaya, Y. Gao, and G. Saon, “Fractional fourier transform features for speech recognition,” in Proc. International Conference on Acoustics, Speech, and Signal Processing, vol. 1, May 2004, pp. 529–532. [8] X.-D. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing. Prentice Hall, 2000. [9] A. M. Aertsen, P. I. Johannesma, and D. J. Hermes, “Spectro-temporal receptive fields of auditory neurons in the grassfrog,” Biological Cybernetics, vol. 38, no. 4, pp. 235–248, Nov. 1980. [10] R. Patterson, K. Robinson, J. Holdsworth et al., “Complex sounds and auditory images,” in Proc. 9th International Symposium on Hearing, 1992, pp. 429–446. [11] B. R. Glasberg and B. C. Moore, “Derivation of auditory filter shapes from notched-noise data,” Hearing Research, vol. 47, no. 1-2, pp. 103– 108, Aug. 1990. [12] M. Slaney, “An efficient implementation of the Patterson-Holdsworth auditory filter bank,” Apple Computer Inc., Tech. Rep. 35, 1993. [13] C. Candan, M. A. Kutay, and H. M. Ozaktas, “The discrete fractional fourier transform,” IEEE Transactions on Signal Processing, vol. 48, no. 5, pp. 1329–1337, May 2000. [14] W.-Q. Zhang, J. Liu, and L. He, “Auditory features with vocal tract length normalization for language identification,” in Proc. International Conference on Audio, Language and Image Processing, vol. 1, Shanghai, Jul. 2008, pp. 66–70. [15] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 4, pp. 578– 589, Oct. 1994. [16] P. Matejka, L. Burget, P. Schwarz et al., “Brno University of Technology system for NIST 2005 language recognition evaluation,” in Proc. IEEE Odyssey - The Speaker and Language Recognition Workshop, San Juan, Puerto Rico, June 2006. [17] Callfriend. [Online]. Available: http://www.ldc.upenn.edu/Catalog [18] NIST language recognition evaluation. [Online]. Available: http://www. nist.gov/speech/tests/lang/index.htm

Fast Fourier Transform Based Numerical Methods for ...