INTERSPEECH 2005

Voicing Features for Robust Speech Detection Trausti Kristjansson, Sabine Deligne, Peder Olsen IBM T.J. Watson Research Yorktown Heights, NY, 10601 {tkristj,deligne,pederao}@us.ibm.com

Abstract

2. Features for Voicing Detection We investigated well known features and some recently introduced features. These features are:

Accurate speech activity detection is a challenging problem in the car environment where high background noise and high amplitude transient sounds are common. We investigate a number of features that are designed for capturing the harmonic structure of speech. We evaluate separately three important characteristics of these features: 1) discriminative power 2) robustness to greatly varying SNR and channel characteristics and 3) performance when used in conjunction with MFCC features. We propose a new features, the Windowed Autocorrelation Lag Energy (WALE) which has desirable properties.

• Autocorrelation Peak Count • Spectral Entropy • Maximum LPC Residual Autocorrelation Peak • Spectral Autocorrelation Peak Valley Ratio • Maximum Autocorrelation Peak • Maximum Cepstral Coefficient In addition, we introduce an extension of the Maximum Autocorrelation coefficient that is designed to improve its robustness, i.e. the:

1. Introduction Speech-silence discrimination and end-pointing important components of many speech recognitions systems. Speech-silence discrimination is a challenging problem in the car environment where it is common to have high intensity semi-stationary background noise and high amplitude transient noises such as road bumps, wiper noise, door slams, tapping etc. High SNR conditions are also commonly encountered, such as when the car is stationary. We are therefore interested in features that are highly discriminative while being very robust to different conditions. In this paper we focus on the inherent performance of the features that should be independent of the higher level decision mechanism. Various decision mechanism have been proposed such as likelihood ratio[1], HMMs [2] and hierarchical HMMs[3]. A simple and effective feature for speech detection in high SNR conditions is signal energy. Any robust decision mechanism based on energy must adapt to the relative signal and noise levels and the overall gain of the signal. In contrast, all the features reviewed in this paper are gain invariant. MFCC features are also effective for discriminating speech from other environmental sounds although they were not designed for this purpose. In particular, the Mel filter removes the characteristics of excitation signal. For voiced sounds the excitation signal is periodic glottal pulse train signal, which manifests itself as harmonic structure in the spectrum (see Figure 1(b)). Since MFCC features do not capture the harmonic structure of speech, an avenue of exploration is to extend the feature space with features that succinctly capture the strength of voicing of the signal. An additional motivation for pursuing the structure of voiced speech rather than that of unvoiced speech is that in the car environment, unvoiced speech sounds are easily confusable with wind, road and fan noise.

• Windowed Autocorrelation Lag Energy These features use either the autocorrelation or the spectrum or a combination in conjunction with an arbitrary nonlinear method for extracting a single measure. They all attempt to condense the harmonic structure of voiced speech into a single coefficient that is relatively efficient to compute. 2.1. Autocorrelation Based Features A number of techniques in the literature are based on the autocorrelation of the signal [3, 4]. The periodic characteristic of speech signal makes it a good candidate for searching for self-similarity, i.e. repetitions of the filtered glottal pulse. However, the autocorrelation captures any repetitive signal, including motor noise. The standard un-normalized autocorrelation is aj [k] =

N 

xj [n]xj [n − k]

(1)

n=k

where xj is the j-th segment of the signal and k is the lag. The autocorrelation can be normalized in a number of ways. In the lag-zero normalized autocorrelation each lag is divided by a[0]. This ensures gain invariance[4]. The short-time normalized autocorrelation normalizes each element by the energy of that lag. Hence it normalizes both for the number of lags, and the energy. N xj [n]xj [n − k] acorrj [k] = N −k n=k 1 N 1 ( n=1 xj [n]2 ) 2 ( n=k xj [n]2 ) 2 )

(2)

The methods based on the autocorrelation include the Maximum Autocorrelation Peak[4] which finds the magnitude or power of the maximum peak within the range of lags that correspond to the range of fundamental frequencies of male and

369

September, 4-8, Lisbon, Portugal

INTERSPEECH 2005

female voices. In our experiments, we used a range of 50Hz400Hz corresponding to 320-40 lags respectively at a 16kHz sampling rate. Another measure is the Autocorrelation Peak Count[3] or the number of peaks found in a range of lags. For pitch estimation in high SNR conditions, it is advantageous to remove the correlations of the vocal tract to reveal an approximation the glottal pulse train. This can be done by inverse LPC filtering. The Maximum LPC Residual Autocorrelation Peak[5] measure is based on finding the peak of the autocorrelation of LPC residual signal.

regularly spaced harmonic peaks. If two speakers are speaking simultaneously, this structure will be distorted. SAPVR takes the autocorrelation of the magnitude spectrum to detect the harmonic regularity. After this operation the maximal ratio between first valley and second peak in the autocorrelation is found1 . The Cepstral Peak has been used for pitch estimation [9] as well as voice activity detection. The cepstrum is computed as ceps = DCT (log(|F F T (x)|2 ))

where x is a short segment of the signal2 . It is well known that the low order cepstra characterize the vocal tract filter, whereas the higher capture the excitation. Figure 1(c) shows a clear peak corresponding to the excitation period. In order to better capture the peak, we ran a difference operator over the cepstra, and then found the difference between the maximum value and the minimum value. This produced better results than directly using the maximum.

(a) PCM waveform 1000 0 −1000 50

100

150

200 sample

250

300

350

400

(b) Normalized Autocorrelation 0.5 0 −0.5 50

100

150

200

250

(4)

300

lag

(c) Log Spectrum

3. Windowed Autocorrelation Lag Energy

20 15 10 0

500

1000

1500

2000 Hz

2500

3000

3500

The Windowed Autocorrelation Lag Energy (WALE) measure is designed as a robust extension of the Autocorrelation Maximum Peak Amplitude metric. Voiced speech is produced when the vocal cords produce a glottal pulse train that is then filtered by the vocal tract. The autocorrelation of a pulse train has a single peak at the lag corresponding to the period of the pulse train, which motivates the use of the Maximum Autocorrelation as a voicing indicator. However, the vocal tract introduces correlations and spreads out the energy somewhat. The signal decays rapidly after the glottal pulse and energy is concentrated in that region. In the autocorrelation, this manifests itself as a ’saw’ pattern seen in Figure 1(b). The motivation for the Windowed Autocorrelation Lag Energy is to better capture this structure by taking into account a short window where most of the energy should be concentrated when the signal is voiced speech.

4000

(d) Cepstrum 140 100 60 20 5

10

15

20 25 Cepstrum coefficient

30

35

40

Figure 1: (a) The PCM waveform of a segment of voiced speech. Three glottal periods are shown. (b) The Normalized Autocorrelation shows a distinctive ‘saw’ pattern centered at the glottal period (lag 146 ≈ 109Hz fundamental at 16kHz sampling rate ). The shaded box corresponds to a window of 30 lags. The energy of the autocorrelation lags in this window corresponds to the WALE coefficient for the speech segment. (c) The Log Spectrum shows regularly spaced harmonic peaks characteristic of voiced speech. (d) The Cepstrum has a very distinct peak corresponding to the fundamental frequency.

log WALE 1.4 Farfield: Non−Speech

1.2

p(x|class)

2.2. Spectrum Based features The Spectral Entropy [6, 3] measure is found by interpreting the short-time spectrum as a probability distribution over a single discrete random variable X and then calculating the entropy of the distribution. The spectral distribution is found by normalizing the values of the short-time spectrum pX (f ) =  Ns(f )s(k)

Closetalk: Voiced

0.6

0.2 0 −2

−1

0

1

2

3

4

5

Feature value Entropy 3

where s(f ) is the spectral energy for frequency f , and pX is the spectral distribution. Now we can calculate the spectral entropy for frame j as N 

Closetalk: Non−Speech

0.8

0.4

k=1

H(j) = −

Farfield: Voiced

1

Farfield: Non−Speech

p(x|class)

2.5

Farfield: Voiced

2

Closetalk: Non−Speech 1.5 Closetalk: Voiced 1 0.5

pXj (k) log(pXj (k))

0

(3)

0

1

2

3

4

5

6

Feature value

k=1

Figure 2: Distributions for voiced speech and background noise for the features log WALE and Spectral Entropy. Distributions are shown under the closetalk and farfield conditions. Notice that the distribution for background noise changes considerably between conditions for the entropy feature while log WALE is robust.

Due to the harmonic structure of voiced speech, it is expected that voiced speech will have relatively low entropy while stationary background noise is expected to have high entropy. This tendency can be seen in Figure 2. Various noise signals are also expected to have low entropy such as alarms, brake squeaks and sirens. The Spectral Autocorrelation Peak Valley Ratio (SAPVR) [7] measure was introduce in the context of usable speech detection. If the a single speaker is speaking, the spectrum will have

1a

few variants are reported in the literature[7, 8]. that unlike the Mel Filtered Cepstral Coefficient (MFCC) transform, it is important not to use a Mel Filtering or warping. 2 Note

370

INTERSPEECH 2005

4.1. Discriminative Power

To calculate this feature, we slide the window across the autocorrelation lags, and calculate the energy of the lags in the window, at each shift point. The maximum value is then returned. A window of length 30 is shown as the shaded area in Figure 1(b), centered at the maximum value. Hence WALE is computed as W ALE(j) = max

l+W −1

l

|acorrj (i)|2

To assess the discriminative power of individual features we calculated two metrics of each feature when used alone; 1) the Symmetric KL distance between voiced and non-speech models in two noise and channel conditions, and 2) ROC curves showing Segment False Accept and False Reject error rates. Max LPC residual

(5)

Acorr. Peak Count

i=l

Entropy log SAPVR Max Cep. coef.

where acorrj is the vector of autocorrelation coefficients calculated in Equation 2, W is the length of the lag window. In our experiments W was set to 15 lags. Notice that when the window is of length W = 1, WALE is equivalent to finding the square of Max Autocorrelation. To further improve robustness, we can take advantage of the fact that voiced segments usually span a few consecutive frames. Since the voicing period will not change dramatically between consecutive frames, the maximum will be close in consecutive frames. We therefore define the Multi-Frame WALE (W ALEM F ) as W ALEM F (j) = max l

l+W −1

j+β

i=l

t=j−α



|acorrt (i)|2 .

Max Autocorr. log WALE 0

0.5

1

1.5

2

2.5

3

2.5

3

Farfield: Sym. KL(Voiced || Nonspeech)

Max LPC residual Acorr. Peak Count Entropy log SAPVR Max Cep. coef. Max Autocorr. log WALE 0

0.5

1

1.5

2

Closetalk: Sym. KL(Voiced || Nonspeech)

Figure 3: Symmetric KL distance between voiced distribution and non-speech distribution for far-field and close-talk conditions. High values are indicative of highly discriminative features.

(6)

Figure 3 shows the symmetric KL distance between p(x|voiced) and p(x|non-speech) for all features in the far-field condition and close-talk conditions. Note that in the close-talk condition, all features performed well. In the far-field condition, the performance of some of the features decreases considerably (e.g. Autocorrelation Peak Count and Spectral Entropy), when the distributions for speech and non-speech overlap. The cepstrum and the Max Autocorrelation and log WALE continue to perform well.

where α and β designate how many past frames and future frames to consider, respectively. α and β were set to 1 in our experiments. When using Gaussian Mixture Models to model feature distributions, it is advantageous to use log(W ALE) and log(W ALEM F ) instead.

4. Feature Evaluation Three performance characteristics of the features are of particular interest to us: 1) their discriminative power, 2) their robustness to different conditions 3) how well they complement the MFCC features that are known to be effective features. To assess these aspects of the features we collected a dataset consisting of 3 male speakers and 3 female speakers. Each speaker contributed about 10 minutes of speech data for a total of about 1 hour of data. The data was collected in different cars and a variety of conditions. An attempt was made to produce the whole range of noise conditions, transient sounds and situations where speech detection might fail. As an example, the data contains speech recorded when driving over road seams, washboards, potholes and other rough surfaces. Data was also collected by the roadside with door slams, trunk slams and with open windows and trucks driving by at high speeds in heavy rain. The data was collected on two channels where one channel recorded a far-field microphone mounted on the rear view mirror and the second channel was of a head mounted closetalking noise canceling microphone. A portion of the close-talking data was hand labeled with three tags: voiced, unvoiced and non-speech. In order to get a similar labeling for the whole data-set, we ran forced alignment with a speech recognition system and known transcriptions on all the close-talking data. Phone models that correspond well to voiced speech segments were then used as a ground truth for voiced sounds. The recognition system also labeled non-speech segments reliably. The labeling was used as the ground truth for non-speech segments.

4.2. Robustness To assess the robustness of the features we also calculated the Symmetric KL distance of equivalent distributions between the far-field and the close-talking condition. This measure gives an indication of how the features vary between two common conditions, i.e. the high SNR close-talking condition and a low SNR noisy far-field condition, and hence how they can be expected to perform in new unseen conditions. KL(Farfield || Closetalk) Nonspeech dist Voiced speech Max LPC residual

Acorr. Peak Count

Entropy

log SAPVR

Max Cep. coef.

Max Autocorr.

log WALE

0

0.5

1

1.5

2

2.5

3

Symmetric KL(Farfield || Closetalk)

Figure 4: Symmetric KL distance between close-talk and farfield conditions. A large value is indicative of a non-robust feature.

371

INTERSPEECH 2005

False Reject Rate FA MFCC only Max LPC residual Acorr. Peak Count Entropy log SAPVR Max Cep. coef. Max Autocorr. log WALE

5.00% 68.28% 0.71% 0.06% 0.56% -0.48% 0.96% -2.71% -1.38%

10.00% 42.33% 4.52% 2.24% 5.04% -0.39% -0.97% -0.52% -6.33%

15.00% 25.19% 8.85% 1.34% 1.80% -3.15% -3.07% -8.08% -12.15%

20.00% 13.45% 15.63% -1.61% -4.50% -3.31% -6.30% -11.88% -11.12%

24.62% 7.33% 16.38% -0.81% -3.01% -2.04% -13.05% -21.18% -20.13%

30.00% 3.55% 8.00% 0.16% -5.38% -7.02% -18.19% -21.49% -19.67%

40.00% 1.41% 3.58% -1.84% -12.23% -14.21% -19.26% -17.94% -19.72%

Average. 8.24% -0.06% -2.53% -4.37% -8.55% -11.97% -12.93%

Table 1: The table shows the relative percent change in False Accept rates when a single feature is added to the MFCC features. The numbers represent different points on an ROC curve. The baseline False Accept rate for the MFCC features is shown in the second row. Notice that the Max Autocorr feature and the log WALE feature perform well, where the WALE feature performs best when a low False Reject rate is desirable. The False Reject rate of 24.62% corresponds to a symmetric loss function of a Bayes classifier (i.e. cost of rejecting voiced speech is equal to cost of accepting noise). 1

Figure 4 shows that noise distributions for Autocorrelation Peak Count and Spectral Entropy change considerably, while Max Autocorrelation and log WALE show relatively good robustness characteristics for both the noise and voiced speech distributions.

Max LPC residual Acorr. Peak Count Entropy log SAPVR Max Cep. coef. Max Autocorr. log WALE

0.9 0.8

False accept

0.7

4.3. Complement to MFCC features To assess how well these feature complement the MFCC features which were used in our baseline Speech Detection system, we appended each of the features in turn to the 13 MFCC features and noted the affect on False Alarm rate at a particular False Reject rate. The voiced speech and noise models were trained on a large data-set consisting of in-car speech recorded at 0mph, 30mph and 60mph. Each feature was modeled with a 64 Gaussian mixture model. The models were combined assuming independence between the features. The test-set was the 1-hour far-field test set mentioned in the experiments above. The relative amount of low-SNR conditions and transient noises in the test set was larger than in the training set. Table 1 shows the effect of adding each feature in turn. The numbers represent different points on an ROC curve. Different False Rejection rates were achieved by artificially biasing the prior probabilities of speech and noise models. In our application, the cost of missing a speech vector is high and it is desirable to select a low False Rejection rate. The best improvement is achieved by selecting the features at the bottom of the table. The log-WALE feature is slightly better on average than the Max Autocorrelation feature. At low False Reject rates, the log-WALE feature outperforms the Max Autocorrelation feature. It is interesting to note that inverse LPC filtering the signal prior to using the Max Autocorrelation is harmful to performance. This may be due to this feature not being robust to the types of noises in the test set that were not seen in the training set. It is also interesting to note that adding any of these features helps less when we bias towards low False Reject rates.

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False reject

Figure 5: ROC curves for individual features.

6. References [1] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,” IEEE Signal Processing Letters, vol. 6, no. 7, 1999. [2] R. Sarikaya and J. H. Hansen, “Robust detection of speech activity in the presence of noise,” in Proc. Inter. Conf. on Spoken Language Processing, 1998. [3] S. Basu, “A linked-hmm model for robust voicing and speech detection,” in Proc. IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP 2003), 2003, pp. 816–819. [4] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan, and R. Sarikaya, “Robust speech recogntion in noisy environments: The 2001 ibm spine evaluation system,” in Proc. IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP 2001), 2001. [5] W. Hess, Pitch Determination of Speech Signals.

Springer Verlag, 1983.

[6] J.-L. Shen, J.-W. Hung, and L.-S. Lee, “Robust entropy-based endpoint detection for speech recognition in noisy environments,” in Proc. International Conference on Spoken Language Processing, Sidney, Australia, Nov.-Dec. 1998. [7] K. R. Krishnamachari and R. E. Yantorno, “Spectral autocorrelation ration as a usability measure of speech segments under co-channel conditions,” in IEEE Symposium on Intelligent Signal Processing and Communication systems, 2000. [8] R. E. Yantorno, K. R. Krishnamachari, and J. M. Lovekin, “The spectral autocorrelation peak valley ratio (sapvr) - a usable speech measure employed as a co-channel detection system,” in IEEE International Workshop on Intelligent Signal Processing, 2001.

5. Discussion We have evaluated a number of well-known voice activity features as well as some features that have recently been proposed. Our evaluation focused on performance in very difficult noise conditions and the robustness to different noise and channel conditions. We also introduced the Windowed Autocorrelation Lag Energy feature that has advantages over the Maximum Autocorrelation feature when low false reject rates are desirable.

[9] S. Ahmadi and A. S. Spanias, “Cepstrum-based pitch detection using a new statistical V/UV classification algorithm,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 3, pp. 333–337, 1999.

372

Voicing Features for Robust Speech Detection

to greatly varying SNR and channel characteristics and 3) per- formance ..... [3] S. Basu, “A linked-hmm model for robust voicing and speech detection,” in. Proc.

144KB Sizes 6 Downloads 204 Views

Recommend Documents

Voicing Features for Robust Speech Detection
The periodic characteristic of speech signal makes it a good candidate for searching for .... Phone models that correspond well to voiced speech .... IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP. 2003), 2003, pp.

CASA Based Speech Separation for Robust Speech Recognition
National Laboratory on Machine Perception. Peking University, Beijing, China. {hanrq, zhaopei, gaoqin, zhangzp, wuhao, [email protected]}. Abstract.

INVESTIGATIONS ON EXEMPLAR-BASED FEATURES FOR SPEECH ...
from mobile and video sharing speech applications. How- ever, most ... average distance between the hypothesis X and the k-nearest .... the hypothesis in the lattice with the lowest edit cost. ... scribed in Section 2 for development and tuning.

CASA Based Speech Separation for Robust Speech ...
techniques into corresponding speakers. Finally, the output streams are reconstructed to compensate the missing data in the abovementioned processing steps ...

CASA Based Speech Separation for Robust ... - Semantic Scholar
and the dominant pitch is used as a main cue to find the target speech. Next, the time frequency (TF) units are merged into many segments. These segments are ...

ROBUST SPEECH RECOGNITION IN NOISY ...
and a class-based language model that uses both the SPINE-1 and. SPINE-2 training data ..... that a class-based language model is the best choice for addressing these problems .... ing techniques for language modeling,” Computer Speech.

Sparse Representation Features for Speech Recognition
ing the SR features on top of our best discriminatively trained system allows for a 0.7% ... method for large vocabulary speech recognition. 1. ... of training data (typically greater than 50 hours for large vo- ... that best represent the test sampl

ROBUST PARKING SPACE DETECTION ...
the increase of private vehicles. Looking for parking spaces always wastes travel time. For the driver's convenience, pub- lic parking lots should provide the location of available park- ing spaces. However, maintaining such information manually need

Robust cross-media transfer for visual event detection
ferent types of noisy social multimedia data as input and conduct robust event ... H.3.1 [Information Storage and Retrieval]: Content. Analysis and Indexing.

Robust Location Detection in Emergency Sensor Networks
that generalizes the basic ID-CODE algorithm and produces irreducible r-robust codes. The degree of robustness, r, is a design parameter that can be traded off ...

a robust phase detection structure for m-psk
Moreover, the new detector has a very compact hardware ..... efficient fixed-point hardware implementation in the form ..... 6.1.2 Linear Modeling of VM,N(n).

Robust Subspace Based Fault Detection
4. EFFICIENT COMPUTATION OF Σ2. The covariance Σ2 of the robust residual ζ2 defined in (11) depends on the covariance of vec U1 and hence on the first n singular vectors of H, which can be linked to the covariance of the subspace matrix H by a sen

Fepstrum Features: Design and Application to Conversational Speech ...
Jun 6, 2011 - on a 1.5Hr SWB test set with a 2, 300 words vocabulary. We also provide the ... meaningful and artifact-free modulation filtering. Margos et.

Inim Electronics Fire detection systems key features -
fault; and at network level – to ensure signaling during a main unit CPU fault. ... LoopMap technology is so new that it seems to have come out of the latest ...

Highly Noise Robust Text-Dependent Speaker ... - ISCA Speech
TIDIGITS database and show that the proposed HWF algorithm .... template is the 'clean' version of the input noisy speech, a column ..... offering a large improvement over the noisy and SS cases. Table 2: Speaker-identification accuracy (%) for 3 non

Robust Speech Recognition Based on Binaural ... - Research at Google
degrees to one side and 2 m away from the microphones. This whole setup is 1.1 ... technology and automatic speech recognition,” in International. Congress on ...

Inim Electronics Fire detection systems key features -
of reliability installer companies can expect to find in a fire detection system. ... it guarantees an emergency call in the event of an alarm during control panel ...

A Saliency Detection Model Using Low-Level Features ...
The authors are with the School of Computer Engineering, Nanyang ... maps: 1) global non-linear normalization (being simple and suit- ...... Nevrez İmamoğlu received the B.E. degree in Com- ... a research associate for 1 year during 2010–2011 ...

Fast Robust GA-Based Ellipse Detection
*Electrical & Computer Engineering and **Computer Science Departments, Concordia University, 1455 Blvd. de ... Genetic Algorithms, clustering, Sharing GA,.