IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, MON YYYY
1
Theoretical Complex Cepstrum of DCT and Warped DCT Filters R. Muralishankar Member, IEEE, Abhijeet Sangwan Student Member, IEEE, and Douglas O’Shaughnessy Fellow, IEEE
Abstract— In this letter, we derive the theoretical complex cepstrum (TCC) of the discrete cosine transform (DCT) and warped DCT (WDCT) filters. Using these derivations, we intend to develop an analytic model of the warped discrete cosine transform cepstrum (WDCTC), which was recently introduced as a speech processing feature. In our derivation, we start with the filter bank structure for the DCT, where each basis is represented by a finite impulse response (FIR) filter. The WDCT filter bank is obtained by substituting z −1 in the DCT filter bank with a first-order all-pass filter. Using the filter bank structures, we first derive the transfer functions for the DCT and WDCT, and subsequently the TCC for each filter is computed. We analyze the DCT and WDCT filter transfer functions and the TCC by illustrating the corresponding polezero maps and cepstral sequences. Moreover, we also use the derived TCC expressions to compute the cepstral sequence for a synthetic vowel /aa/ where the observations on the theoretical cepstrum corroborate well with our practical findings.
I. I NTRODUCTION
I
N an earlier work, we have introduced the warped discrete cosine transform cepstrum (WDCTC) as a new speech processing feature. Particularly, we demonstrated the better performance of the WDCTC over the mel-frequency cepstral coefficients (MFCC) in a vowel recognition and speaker-identification task [1]. Further, we reported some interesting facts about the WDCTC such as (i) good vowel class separability, (ii) low variance, (iii) good codebook representation, (iv) robustness to noise, and (v) better approximation towards a Gaussian distribution [2], [3]. The proposed algorithm for the computation of the WDCTC is briefly reviewed in Algo. 1 [1]. It is noted that the algorithm uses the warped discrete cosine transform (WDCT, [4]) which directly maps the time domain signal to the perceptual frequency domain via an appropriate choice of the warping factor [1], unlike the MFCC which employs the mel frequency filter banks to map the signal representation to a perceptual frequency domain. The use of the WDCT in computing the WDCTC gives a significant advantage in terms of the analyzability of the WDCTC as opposed to the MFCC, where analysis has proved to be a difficult task [5]. In this letter, we exploit this advantage by deriving the theoretical complex cepstrum (TCC) for the WDCTC. Our approach towards deriving the TCC is as follows. Firstly, we consider the filter bank structure of the discrete cosine transform (DCT) and the WDCT where each basis is represented by a finite impulse response (FIR) filter as shown in Fig. 1 (a) and (b) [4]. Next, we obtain the transfer function for each of the DCT and WDCT filters and show the pole-zero maps along with the frequency responses to
..., x2, x1, x0
..., x2, x1, x0 F0(z −1)
8
C0
F0(A(z))
8
C0
F1 (z −1 )
8
C1
F1 (A(z))
8
C1
F7(z −1)
8
C7
F7 (A(z))
8
C7
(a)
(b)
Fig. 1. Filter-bank representation of an 8-point (a) DCT and (b) WDCT. {xi } represents the time-domain signal sequence, and {Ci } are the corresponding (a) DCT, or (b) WDCT coefficients.
demonstrate the effects of warping. Finally, using these DCT and WDCT transfer functions, we derive the theoretical complex cepstrum (TCC) for each DCT and WDCT filter. In addition, we use the newly derived expressions for the WDCT and DCT TCC to compute the cepstral sequences for the synthetic vowel /aa/. Algorithm 1 Algorithm to compute the WDCTC 1: Obtain an N-point WDCT, XW DCT (k), 0 ≤ k ≤ N − 1 for a finite duration, real sequence x(n), 0 ≤ n ≤ N − 1. 2: Compute the phase ζ(k) of the WDCT coefficients as: jπ (sgn(XW DCT (k)) − 1), 2 where sgn is the sign of the WDCT coefficient. 3: Compute the WDCTC x b(n) as: ζ(k) =
(1)
x b(n) = <(IDCT (ζ(k) + ln |XW DCT (k)|)),
(2)
where < is the real part of the complex number. II.
POLE - ZERO MAP OF DCT AND WDCT FILTERS
Let the N-point DCT, {X(0), X(1), ..., X(N − 1)} be given by N−1
X(k) = U (k)
X
x(n) cos
n=0
(2n + 1)kπ 2N
(3)
for k = 0, 1, ..., N − 1, where Manuscript received APR. 19, 2006. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Patrick Naylor. R. Muralishankar was with INRS-EMT, Montreal, Quebec, Canada and is currently with the Dept. of Telecommunications, P.E.S Institute of Technology, Bangalore, India. (e-mail:
[email protected]). Abhijeet Sangwan is with the Center for Robust Speech Systems, Dept. of Electrical Engg., University of Texas at Dallas, Richardson, Texas, U.S.A. (e-mail:
[email protected]). Douglas O’Shaughnessy is with INRS-EMT, Montreal, Quebec, Canada (e-mail:
[email protected]).
U (k) =
√1 2
1
k = 0, otherwise.
The kth row of the N × N DCT matrix can be viewed as a filter whose transfer function is given by N−1
Fk (z
c 2006 IEEE 0000–0000/00$00.00
−1
) = U (k)
X n=0
cos
(2n + 1)kπ 2N
z −n ,
(4)
2
0 −0.5 (a)
−1
0 Real Part
1
0
1 0.5
15
2
0 −0.5 −1
(c)
−2
−1
0 Real Part
1
2
15
−0.5 −1 (e) −2
−1
0 Real Part
1
1
0.5 Frequency
1
0.5 Frquency
1
(d)
0
0.5
0.5 Frequency
0
2
1
0
(b)
2
Magnitude (dB)
Imaginary Part
−2
0
Magnitude (dB)
1 0.5
−1
Imaginary Part
2
0
Magnitude (dB)
Imaginary Part
IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, MON YYYY
(f)
2
0
Fig. 2. Pole-zero map for the 4th filter in a 16-length transform: (a) DCT, (c) WDCT β = 0.56, and (e) WDCT β = −0.56. The corresponding frequency responses: (b) DCT, (d) WDCT β = 0.56, and (f) WDCT, β = −0.56.
that is, the ith coefficient of Fk (z −1 ) is the (k, i)th element of the DCT matrix. Fk (z −1 ) is a band-pass filter with center , with the sampling frequency normalized to frequency at (2k+1) 2N one. Letting θ = kπ and using Euler’s trigonometric relations, (4) N can be written as Fk (z −1 ) =
N−1 n o −iθ iθ U (k) X e 2 einθ + e 2 e−inθ z −n . 2
(5)
n=0
Using a closed form for the geometric series in (5) yields Fk (z −1 ) =
U (k) 2
iθ
e2
ikπ −N
−iθ 1 − e 1−e z z +e 2 1 − eiθ z −1 1 − e−iθ z −1
−ikπ −N
.
(6) We rewrite eikπ = (−1)k in (6) and simplify to obtain the expression for the transfer function as: Fk (z −1 ) = U (k) cos
(1 − (−1)k z −N )(1 − z −1 ) θ
(1 − eiθ z −1 )(1 − e−iθ z −1 )
2
.
(7)
To obtain the transfer function for the kth WDCT filter (Fek (z −1 )), we substitute z −1 with the first-order all-pass filter given by A(z) =
−β + z −1 1 − βz −1
Fk (z
−1
) = U (k) cos
2
(1 − eiθ A(z))(1 − e−iθ A(z))
.
z=e where
Ml (k) =
, l = 0, 1, ..., N − 1,
(1 − z −1 ) QN−1 (1 − e iπMNl (k) z −1 ) θ l=0 (1 − eiθ z −1 )(1 − e−iθ z −1 )
2
(10)
Fek (z
−1
) = U (k) cos
(1 − A(z)) QN−1 (1 − e iπMNl (k) A(z)) θ l=0 (1 − eiθ A(z))(1 − e−iθ A(z))
2
k ∈ odd, k ∈ even.
.
(11) Using the definition of the all-pass filter in (8), we simplify (11) to obtain Fek (z −1 )
QN−1 l=0
=
U (k)(1 + β) cos
(1 − z −1 )
(1 + βeiθ )(1 + βe−iθ )(1 − βz −1 )(N−1)
(1 + βeiπMl (k)/N ) (1 −
θ 2
QN−1 l=0
β+eiθ −1 z )(1 1+βeiθ
(1 −
−
(β+eiπMl (k)/N ) −1 z ) (1+βeiπMl (k)/N )
β+e−iθ −1 z ) 1+βe−iθ
. (12)
The transfer functions of the DCT and WDCT filters are given by (10) and (12), respectively. III.
COMPLEX CEPSTRUM OF DCT AND WDCT FILTERS
To derive the TCC we apply the logarithm to (10), i.e., =
(9) N−1
+
X
ln(1 − e
ln U (k) cos
iπMl (k) N
z −1 ) − ln
l=0
θ 2
+ ln(1 − z −1 )
1 − e−iθ z −1 (1 − eiθ z −1 )−1
.
The power series expansion for the term ln(1 − αz −1 ) is [6], 2l + 1 2l
.
Similarly, (9) becomes
ln(Fk (z −1 ))
The N roots of the term (1 − (−1)k z −N ) in (7) are given by iπMl (k)/N
Fk (z −1 ) = U (k) cos
(8)
where β is the warping parameter. The value of β controls the nature and degree of warping. Using (8) in (7), we obtain
(1 − (−1)k A(z)N )(1 − A(z)) θ
Now, (7) can be rewritten as
ln(1 − αz −1 ) = −
∞ X αn n=1
n
z −n , |z| > |α| ,
(13)
IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, MON YYYY
WDCT Cepstral sequence with β = 0.56
DCT Cepstral sequence
3
WDCT Cepstral sequence with β = −0.56
the kth DCT and WDCT filter bank is given by
and
Yk (z) = X(z)Fk (z)
(18)
Yek (z) = X(z)Fek (z),
(19)
ybk (n) = b x(n) + fbk (n)
(20)
respectively, where k = 0, 1, ...N −1. Applying logarithm and inverse z-transform, (18) and (19) become
and
b yek (n) = x b(n) + b fek (n),
respectively, where n = 0, 1, ..., N − 1. Here, ybk (n) and b yek (n) are output cepstral sequences of the DCT and the WDCT. The decimator present in each branch of the filter bank acts as a multiplexer which selects consecutive samples from each of the filter output sequences. Hence, the TCC can be obtained from DCT and WDCT cepstral sequences as b ck = ybk (n)δ(n − k) (22) and
(a) 5
10 Sample no.
(b) 15
5
10 Sample no.
15
5
10 Sample no.
15
where the region of convergence includes the unit circle. Using the above-mentioned power series expansion in (13), we get ln(Fk (z −1 ))
−
XXe l=0 n=1
iπMl (k)n N
n
=
ln U (k) cos
z −n +
Xe ∞
n=1
θ 2
iθn
n
b e ck = b yek (n)δ(n − k),
(c)
Fig. 3. Cepstral sequence for all the filters in a 16-length transform: (a) DCT, (b) WDCT β = 0.56, and (c) WDCT β = −0.56. The cepstral sequences of the filters are shown in order with the 0th filter at the bottom.
N−1 ∞
z −n +
−
∞ X z −n n=1
Xe ∞
n=1
−iθn
n
n
Hr0 ,ω0 (z) =
z −n . (14)
fbk (n) =
− n1
n
ln U (k) cos
1+
PN−1 l=0
e
θ
2
iπMl (k)n N
− 2 cos(θn)
o n = 0,
n > 0. (15)
b
fek (n) =
ln
U (k)(1+β) cos( θ 2) iθ
PN−1(1+βe )(1+βe ) + l=0 ln 1 + βeiπMl (k)/N n n P iπM (k)/N −iθ
n = 0,
l N−1 (β+e ) − n1 1 + l=0 l (k)/N ) (1+βeiπM n o n − (N−1) − β+eiθ β+e−iθ − 1+βe n > 0. −iθ β −n 1+βeiθ
(16)
IV. TCC OF A
SYNTHETIC
VOWEL
(24)
b hr0 ,ω0 [n] =
n 2r0 n
cos(ω0 n) 0
n > 0, n = 0.
(25)
Now, the transfer function for the vocal-tract response of a vowel sound is easily obtained as a product of the three digital resonator transfer functions corresponding to each formant, V (z) = HrF1 ,ωF1 (z)HrF2 ,ωF2 (z)HrF3 ,ωF3 (z).
(26)
Hence, the complex cepstrum of the vowel is now merely the sum of the individual complex cepstrum sequences corresponding to the formants and the excitation, i.e., x b[n]
= =
2 n
PNh
P3
n
i=1
b hrF
i
b
ωFi [n] + e[n] 3 r n cos(ωFi n) i=1 Fi
r cos(ωi n) + i=1
P
, (27) where b e[n] is the complex cepstrum of the excitation, and Nh is the number of harmonics available within f2s . V. R ESULTS AND D ISCUSSION
Let the input speech signal be represented as X(z) in the z-domain. Using the source-filter model, we get X(z) = V (z)E(z),
1 . (1 − r0 ejω0 z −1 )(1 − r0 e−jω0 z −1 )
Taking logarithm and inverse z-transform of (24), we get the complex cepstrum of the digital resonator as
Similarly, we obtain the TCC of the kth WDCT filter as
(23)
respectively, where k, n = 0, 1, ..., N − 1. In this letter, we use the above technique to obtain the TCC for a synthetic vowel. An intelligible vowel sound can be synthesized by appropriately choosing the first three formant frequencies F1 , F2 and F3 . Also, one could use cascaded digital resonators to produce vowel sounds by choosing resonant frequencies equal to formants. Hence, the problem of deriving the TCC for a vowel sound reduces to determining the TCC for a digital resonator. In order to derive the TCC for a digital resonator, we define the phase and magnitude of the poles of the digital resonator to be ±ω0 and r0 , respectively, and obtain the corresponding transfer function as
Taking the inverse z-transform, we obtain the TCC for the kth filter as
(
(21)
(17)
where V (z) and E(z) are the z-domain representations of the vocaltract response and excitation signal, respectively. Then the output of
The pole-zero maps for the DCT and WDCT filters are shown in Fig. 2 for the 4th filter of 16-length transform. It is observed in Fig. 2 that the peaks in the frequency response correspond to a pole-zero cancellation. The warping effect is illustrated in Figs. 2(c),(d),(e) and (f) for two values of the warping parameter (β = 0.56 and −0.56). Note that the warped filters have N − 1 real poles, which boosts
IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, MON YYYY
4
4
4
3
3
2
2
Pitch Peak
Pitch Peak
1
1
0
0
−1
(a) 0
50
100
150
200
−1
250
(c) 0
50
100
0
0
−1
−1
−2
−2
−3 −4 −5 −6 −7
200
250
−3 −4 −5 −6 −7
−8 −9
150
Quefrency (Samples)
Magnitude (dB)
Magnitude (dB)
Quefrency (Samples)
−8
(b) 0
1
2
3
4
5
6
7
8
Frequency (kHz)
−9
(d) 0
1
2
3
4
5
6
7
8
Frequency (kHz)
Fig. 4. TCC for synthetic vowel /aa/ using: (a) DCT and (b) the corresponding spectrum using the first 18 cepstral coefficients and, (c) WDCT, β = 0.56 and (d) the corresponding spectrum using the first 18 cepstral coefficients.
the magnitude response in the region of higher resolution. Figure 3 shows the cepstral sequence for the DCT and WDCT filters using two values of the warping parameter (β = 0.56 and −0.56) where the effect of warping is evident. It may be useful to note that Smith and Abel have already shown that the warping for β = 0.56 closely resembles the psychoacoustic Bark scale for a sampling frequency of 16kHz [7]. Hence, β = 0.56 is employed in computing the WDCTC speech feature, and is also used in analyzing the synthetic vowel /aa/ in what follows. We used synthetic vowel /aa/ with the formant frequencies given by 700Hz, 1220Hz and 2600Hz. Its fundamental frequency was 100Hz. The complex cepstral sequence of vowel /aa/ with DCT and WDCT is obtained using Eqs. (22) and (23), respectively and is shown in Fig. 4. Figures 4(a) and (c) show cepstral sequences obtained using DCT and WDCT. The sequences are decaying with a factor equal to n1 . Further, pitch peaks can be seen in Figs. 4(a) and (c) and are exactly equal to 100Hz (160th sample in the quefrency domain). We can also observe clear separation between vocal-tract response and excitation in the quefrency domain for the WDCT. Figures 4(b) and (d) show the spectrum obtained from the first 18 cepstral coefficients of (a) and (c). Further, from Figs. 4(b) and (d), we can see that the dynamic range of the WDCT spectrum is less than the DCT. It is useful to note that a similar observation about the low variance of the WDCTC was made in practical experiments [2]. It is also observed that the higher resolution in low fequency provides a closer match to the peaks in the spectrum and given formant positions for the WDCT than the DCT. VI.
CONCLUSION
We have presented the TCC and the transfer functions for the DCT and WDCT filters, and illustrated the corresponding pole-zero map, frequency responses and TCC sequences. Our analysis of the DCT
and WDCT filters has thrown light on some interesting properties such as the distribution of zeros on the unit circle and the pole-zero cancellation, which illustrate the spectral properties of the DCT and WDCT bases. Further, we have also shown the TCC for a synthetic vowel which reflects the property of low variance observed in an earlier work. The results presented in this letter are significant as they provide an opportunity to theoretically analyze the WDCTC which has already shown improved performance over the MFCC in an earlier work. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their helpful and constructive comments during the review process. R EFERENCES [1] R. Muralishankar, A. Sangwan, and D. O’Shaughnessy, “Warped Discrete Cosine Transform Cepstrum: A new feature for speech processing,” in EUSIPCO’05, Antalya, Turkey, Sept. 2005. [2] A. Sangwan, R. Muralishankar, and D. O’Shaughnessy, “Performance analysis of the Warped Discrete Cosine Transform Cepstrum with MFCC using different classifiers,” IEEE Workshop on Machine Learning for Signal Processing, pp. 99–104, Sept. 2005. [3] R. Muralishankar, A. Sangwan, and D. O’Shaughnessy, “Statistical properties of the Warped Discrete Cosine Transform Cepstrum compared with the MFCC,” in EUROSPEECH’05, Lisbon, Portugal, Sept. 2005. [4] N. Cho and S. Mitra, “Warped discrete cosine transform and its application in image compression,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 10, no. 8, pp. 1364–1373, Dec. 2000. [5] Y. Ephraim and M. Rahim, “On second-order statistics and linear estimation of cepstral coefficients,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 2, pp. 162–176, Mar. 1999. [6] A. Oppenheim and R. Schafer, Discrete-Time Signal Processing. Prentice-Hall Inc., New Jersey, 1989, ch. 12, pp. 780. [7] J. O. Smith III and J. S. Abel, “Bark and ERB bilinear transforms,” IEEE Trans. on Speech and Audio Proc., vol. 7, pp. 697–708, June 1999.