Temporal Filtering of Visual Speech for Audio-Visual Speech Recognition in Acoustically and Visually Challenging Environments Jong-Seok Lee and Cheol Hoon Park School of Electrical Engineering and Computer Science Korea Advanced Institute of Science and Technology (KAIST) Guseong-dong, Yuseong-gu, Daejeon, 305-701, Korea

[email protected]; [email protected]

ABSTRACT

1. INTRODUCTION

The use of visual information of speech has been shown to be effective for compensating for performance degradation of acoustic speech recognition in noisy environments. However, visual noise is usually ignored in most of audio-visual speech recognition systems, while it can be included in visual speech signals during acquisition or transmission of the signals. In this paper, we present a new temporal filtering technique for extraction of noise-robust visual features. In the proposed method, a carefully designed band-pass filter is applied to the temporal pixel value sequences of lip region images in order to remove unwanted temporal variations due to visual noise, illumination conditions or speakers’ appearances. We demonstrate that the method can improve not only visual speech recognition performance for clean and noisy images but also audio-visual speech recognition performance in both acoustically and visually noisy conditions.

Audio-visual speech recognition (AVSR) is a multimodal humancomputer interface technology which uses both acoustic and visual speech signals for recognizing speech by computers automatically. While conventional acoustic speech recognition using acoustic signals shows good performance in quite conditions, its performance is easily degraded when there exist acoustic noise which is inevitable in most of the real-world applications of automatic speech recognition. The visual signal can be a powerful source for compensating for such performance degradation because speech recognition using visual signals is not affected by acoustic noise. It has been shown that the additional use of the visual modality together with the acoustic one improves robustness of recognition performance against acoustic noise [13]. AVSR by computers has been motivated by the bimodal nature of human speech perception. People use the lips’ movements as a supplementary information source for speech understanding in acoustically noisy environments [4] and even in clean conditions [5]. The McGurk effect shows that an illusion can occur when the acoustic and the visual stimuli conflict [6]. Also, it has been shown that some acoustically confusable phonemes are easily distinguished by using the visual speech information [7].

Categories and Subject Descriptors I.4.7 [Image Processing and Computer Vision]: Feature measurement – feature representation; I.2.7 [Artificial Intelligence]: Natural Language Processing – speech recognition and synthesis

General Terms

Several AVSR systems have been presented by researchers, most of which aim at obtaining robustness in recognition only against acoustic noise [8]. Although visual speech signals may contain noise during acquisition or transmission of the signals [9], only a few recent studies have considered visually challenging conditions [10,11]. When a visually degraded speech signal is used for AVSR, discrimination between speech classes by it becomes difficult and thereby the overall AVSR performance will be degraded, especially when the audio-visual integrated recognition should rely on the visual modality because of much noise in the acoustic signal. While extensive researches for noiserobust acoustic feature extraction have been accomplished by analyzing speech dynamics or modeling human auditory systems, there is little work for extracting robust visual features.

Algorithms, Experimentation.

Keywords Audio-visual speech recognition, temporal filtering, noiserobustness, feature extraction, late integration, hidden Markov model, neural network.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMI’07, November 12–15, 2007, Nagoya, Aichi, Japan. Copyright 2007 ACM 978-1-59593-817-6/07/0011…$5.00.

In this paper, we propose a temporal filtering method to extract noise-robust visual features for AVSR in both acoustically and visually challenging environments. We design a band-pass filter

220

coefficients of the transformed images are used as features [16]. In the contour-based method, it is possible to lose useful information such as the protrusion of the lips and the visibility of the tongue or the teeth in the oral cavity. Therefore, the pixelbased method usually shows better recognition performance than the contour-based one [17].

(BPF) which is applied to the temporal trajectory of the intensity value of each pixel location in the mouth region images. The filter suppresses unwanted too slowly or too fast changing spectral components which are not relevant to speech information and easily contaminated by noise, and passes only speech-related components. We determine the passband of the filter based on psychological, spectral and experimental analyses. We demonstrate that the extracted noise-robust visual features improve visual speech recognition performance in both clean and visually challenging conditions and, thus, the AVSR system using the proposed feature extraction method shows robust recognition performance in acoustically and visually noisy conditions.

To extract pixel-based features in an image, we first segment the lip region in the image. This stage also includes pre-processing steps to reduce the effects of speakers’ different appearances and uneven illumination conditions within and across images. First, the difference of the brightness of the left and the right parts of the image are balanced so that we can detect the two mouth corners accurately [18]. Next, the pixel values in the image are normalized by the histogram specification method [9] so that all incoming images have the same pixel value distribution. Then, the two mouth corners are detected by applying the thresholding method; since there is always a dark region between the upper and the lower lips due to the oral cavity or the shadow of the upper lip, the mouth corners can be found by detecting the end points of the dark region after thresholding. The mouth region is cropped based on the corner points. As a result, the lip region image of 44×50 pixels, which is invariant to scaling and rotation, is obtained.

The remainder of the paper is organized as follows. Section 2 describes the baseline AVSR system including the database, the acoustic and the visual feature extraction methods, the recognizers and the audio-visual integration problem. In Section 3, the proposed temporal filtering method is presented. Section 4 shows experimental results for visual speech recognition and audiovisual integrated recognition. Finally, concluding remarks are given in Section 5.

2. BASELINE SYSTEM

To obtain features from the segmented lip region images, we use the principal component analysis (PCA) which has been popularly used for feature extraction of visual speech [19]. PCA finds the main linear modes of variations in the training lip region images. Before applying PCA, we subtract the mean value of each pixel location for an utterance from the pixel value so as to remove constant variations over different utterances [16]. We use the coefficients for the first 12 principal components as the visual features.

2.1 Database We use the two audio-visual databases of isolated words [12]. The DIGIT database contains eleven Korean digits including two versions of zero and the CITY database the names of famous sixteen Korean cities. Fifty six speakers have participated in database collection and each speaker pronounced each word three times. A camera recorded the face regions around the speakers’ lips at the frame rate of 30 Hz and, at the same time, a microphone recorded the acoustic signals at the frame rate of 32 kHz (and downsampled to 16 kHz for feature extraction).

2.4 Speech Recognizer The hidden Markov models (HMMs) are the dominant paradigm for recognizers of both acoustic and visual speech [8,13]. We use the left-to-right continuous HMMs having Gaussian mixture models for modeling observation probability distributions in the HMM states. An HMM is trained to model each word by the popular expectation-maximization algorithm [13]. During the recognition phase, a test datum of unknown class is inputted to the HMMs for all classes and the one showing the maximum probability for the datum is chosen.

The data of 56 speakers are divided into three groups for speakerindependent experiments. The data of 28 speakers are used for training the recognizers, the data of 14 speakers for developing the proposed filter, and the data of the remaining 14 speakers for recognition test.

2.2 Acoustic Feature Extraction We use the Mel-frequency cepstral coefficients (MFCCs) [13] for acoustic features. A single frame contains speech samples of 25 ms and the frame window proceeds by 10 ms. After the Hamming window is applied to each frame, the Mel-scale filterbank analysis is performed to convert the speech samples in the frame into frequency information. Finally, the discrete cosine transform is applied to the log-transformed outputs of the filterbank. The 12thorder MFCCs, the normalized frame energy and their temporal derivatives (delta terms) form the acoustic features of a frame.

2.5 Audio-Visual Integration Integration of acoustic and visual modalities aims at obtaining robust speech recognition performance in noisy circumstances. Models for audio-visual integration can be categorized into two major approaches: early integration (or feature fusion) and late integration (or decision fusion). In the early integration model, the acoustic and the visual features are combined and then used for recognition by a single recognizer. In the late integration model, a recognizer for each modality performs recognition separately and the outputs of the two recognizers are combined to produce the final decision. Late integration can be occurred at various levels such as phonemes, words, or sentences. Also, it can be performed within complicated models such as multistream HMMs [1] or product HMMs [20].

2.3 Visual Feature Extraction There are two main approaches for extracting visual features from recorded image sequences. The first approach is the contourbased method in which the lip contours are found in the images and the features are defined by the geometrical properties of the contours such as the height and the width of the lips or the parameters of models describing the contours [14,15]. The second approach is the pixel-based method where image transformation methods are applied to the lip region images and a few

There are advantages of using late integration for AVSR. First, the late integration model can easily utilize adaptive weighting

221

we cannot utilize the data of all possible noise conditions to construct the reliabilities-weights mapping, the data for only a few selected conditions are used to train the neural network and we expect that it generates appropriate integration weights for the data whose noise conditions are not considered during its learning phase. For the training audio-visual data of the selected noise conditions, we calculate the reliabilities of the acoustic and the visual modalities by using (2). Then, for each datum, we exhaustively obtain the integration weight values producing the correct recognition result by increasing γ from 0 to 1 by 0.01 and testing the recognition result. Finally, we train the neural network with the pairs of the reliabilities and the found integration weights.

techniques to control the amounts of the contributions of the two modalities to the final decision according to their relative reliabilities. If the acoustic signal does not contain noise, the final recognition result should rely mostly on the acoustic recognition result because the acoustic modality without noise usually performs better than the visual one. If the visual recognition performance is better than the acoustic one because of much acoustic noise, the weight for the visual modality should be large so that the final result is largely governed by the visual modality. By adaptively changing the integration weights based on the relative reliabilities measured from the recognizers’ outputs, we can expect robustness of AVSR over a wide range of noise conditions. Second, while early integration assumes perfect synchrony of the two signals, late integration can offer flexibility in modeling inherent asynchrony between them. For some pronunciations, the tongue and the lips start to move up to several hundred milliseconds before the actual speech sound [21]. Third, we can construct a late integration system by using existing unimodal recognition systems, whereas we need to train a new recognizer for an early integration-based AVSR system.

3. PROPOSED METHOD Psychological studies have revealed that the syllable rate of 4 Hz and its modulation by higher frequency components compose of the speech movements [24]. Since the videos in the databases were recorded at the rate of 30 Hz, the temporal sequences of the pixel values of the lip region images in an utterance contain frequency components from 0 Hz to 15 Hz. However, some frequency components may be unnecessary for recognition. The pixel value sequences can contain not only useful speech information but also unwanted variations caused by fluctuations of illuminations, speakers’ appearances or visual noise. These variations are obstacles for good visual recognition performance. Therefore, the objective of the proposed method is to remove the components which change too slowly or too fast compared to speech information and obtain invariant features for improved recognition performance. We design a BPF which passes only speech-related components and suppresses other unnecessary components. The images filtered by the BPF are used for feature extraction by mean-subtraction and PCA.

In the late integration model, the outputs of the acoustic and the visual recognizers are combined by the weighted sum rule. When we have a test datum O , the recognized class u * is given by [22] u* = arg max{γ log P(O | λiA ) + (1 − γ ) log P(O | λiV )} , i

(1)

where λiA and λiV are the acoustic and the visual HMMs for the i -th class, respectively, and log P (O | λiA ) and log P (O | λiV ) are their outputs (i.e., log-likelihoods), respectively. The integration weight γ ∈ [0,1] controls the relative amounts of the contributions of the two modalities to the final recognition result. Its value for a datum should be appropriately determined according to the noise condition of the datum in order to produce the synergy effect of the modalities; otherwise, the integrated result may be worse than any of the unimodal recognition performances, which is called “attenuating fusion” [8].

The low-stop portion of the BPF deemphasizes slow-varying components mainly due to speakers’ different appearances and different illumination conditions across recording sessions. The high-stop portion of the BPF removes the frequency components which are easily contaminated by random noise and not relevant to speech information.

We present a neural network-based method of determining the values of γ automatically according to the noise conditions of the given audio-visual speech data. In this method, the reliability of a modality is measured from the outputs of the corresponding HMMs and the integration weight is determined by a neural network based on the measured reliabilities of the two modalities. When the signal of a modality contains no noise, the outputs of the corresponding sets of HMMs show large differences and, as the signal becomes noisy, the differences tend to become small. Thus, the reliability of a modality can be defined by [23] S=

1 C ∑ (max log P(O | λ j ) − log P(O | λi )) , C − 1 i =1 j

To determine the lower cut-off frequency of the BPF, we perform a visual speech recognition experiment for the clean condition by using high-pass filters (HPFs) with various cut-off frequencies. A Butterworth HPF [25] is applied to the temporal pixel value sequences of the lip region images. Figure 1 shows the recognition performance for the development data set with respect to the value of the cut-off frequency. It is observed that suppressing slow-varying components can improve recognition performance significantly. When the cut-off frequency becomes larger than about 3 Hz, we lose speech-related components and the performance becomes poor. Although the best cut-off frequency for each database is slightly different, the lower cut-off frequency of the BPF can be set to 0.9 Hz which shows good performance for both databases.

(2)

where C is the number of classes. In other words, the reliability of a modality is the average difference between the maximum loglikelihood and the other ones. After we measure the reliabilities of the two modalities for a given datum, a proper integration weight for the datum is produced by inputting the reliabilities to a neural network which models the mapping between the reliabilities and the integration weights.

Next, we determine the higher cut-off frequency of the BPF by the following analyses. Figure 2 shows the average power spectra of the pixel value sequences in clean and noisy images for the development data. For generating noisy images, we use white Gaussian noise which is signal-independent and has stationary zero-mean. White Gaussian noise is commonly found in images

An advantage of using a neural network for estimating proper integration weights is its good generalization capability: Because

222

10

85

5

80

0

75

-5

Magnitude (dB)

Accuracy (%)

90

70 65 60 55

-10 -15 -20 -25

DIGIT without HPF

50

-30

DIGIT with HPF

45

-35

CITY without HPF

40

-40

CITY with HPF

35 -2 10

10

-1

Cut-off frequency (Hz)

10

10

-1

0

10

0

Frequency (Hz)

10

1

Figure 3. Frequency response of the proposed BPF.

Figure 1. Visual speech recognition performance for the clean condition when HPFs with various cut-off frequencies are used.

Power spectral density (dB)

60

Figure 4. Example frames of clean, noisy and blurred lip region images used for experiments.

40 20 0 -20

4. EXPERIMENTS In this section, we report the experimental results of visual speech recognition and AVSR using the proposed filtering method on the databases described in Section 2.1. The images are degraded by white Gaussian noise and blur for the experiments in visually challenging conditions. Examples of the clean and the degraded lip region images are shown in Figure 4.

Clean Noisy

-40 -1 10

0

10 Frequency (Hz)

10

1

Figure 2. Average power spectra of clean and noisy images.

Noisy images are produced by adding the zero-mean white Gaussian noise to the clean images. We obtain noisy images of 0 dB to 15 dB in peak signal-to-noise ratio (PSNR) defined by

as a result of electronic noise existing in cameras and sensors [26]. We can observe that the high frequency components are significantly affected by noise. Thus, the noise effect can be effectively reduced by removing the high frequency components as long as the components do not contain important speech information. In psychological studies, it has been experimentally shown that the frame rate of 16.7 Hz is sufficient for human visual speech perception and a higher frame rate is not much helpful [27], which implies important speech information lies below 8.35 Hz. Thus, the higher cut-off frequency of the BPF is set to 9 Hz based on the spectral and the psychological analyses.

⎧⎪ (maximum pixel value) 2 ⋅ MN ⎫⎪ PSNR = 10log10 ⎨ M ⎬, N 2 ⎪⎩ ∑ m =1 ∑ n =1 ( I (m, n) − K (m, n)) ⎭⎪

where I (m, n) and K (m, n) are the pixel values of the clean and the noisy images of M × N pixels at pixel location (m,n), respectively. For producing blurred images, we use R×R Gaussian blur with the standard deviation of σ and three different combinations of R and σ are used.

The number of states in each HMM is set to be proportional to the number of the phonetic units of the corresponding word. The observation probability distribution in each HMM state is modeled by using the Gaussian mixture model having three Gaussian functions.

Therefore, the fourth-order Butterworth BPF having the cut-off frequencies of 0.9 Hz and 9 Hz for temporal filtering is given by H ( z) =

(4)

0.3307 − 0.6614 z −2 + 0.3307 z −4 . (3) 1 − 1.4261z −1 + 0.4618 z −2 − 0.1566 z −3 + 0.1754 z −4

Figure 3 shows the frequency response of the BPF. There exists a fairly flat passband to transmit important speech information.

223

Table 1. Visual speech recognition accuracies (%) with clean and corrupted images by white noise for the DIGIT database.

PSNR

Baseline

Table 3. Visual speech recognition accuracies (%) with blurred images for the DIGIT database.

Blur

Filtering

(dB)

Static

Static+Δ

Static

Static+Δ

clean

54.1

63.3

64.3

68.0

15

53.2

60.4

61.9

66.7

10

49.8

51.1

58.9

56.9

5

43.1

38.3

51.7

42.9

0

24.2

24.7

32.5

22.1

Baseline

(R, σ )

Static

Static+Δ

Static

Static+Δ

(7,5)

44.2

53.9

60.8

64.5

(9,7)

36.4

46.5

56.9

59.5

(11,9)

30.7

42.4

54.8

54.8

Table 4. Visual speech recognition accuracies (%) with blurred images for the CITY database.

Blur Table 2. Visual speech recognition accuracies (%) with clean and corrupted images by white noise for the CITY database.

PSNR

Baseline

Filtering

(dB)

Static

Static+Δ

Static

Static+Δ

clean

62.2

75.4

76.8

79.6

15

62.1

72.0

75.4

76.8

10

57.7

62.6

67.9

70.1

5

42.2

44.9

55.1

55.2

0

24.0

22.9

35.0

32.3

Filtering

Baseline

Filtering

(R, σ )

Static

Static+Δ

Static

Static+Δ

(7,5)

54.0

64.6

71.0

74.0

(9,7)

42.6

51.2

62.2

61.2

(11,9)

30.2

39.1

54.6

50.9

proposed method in this experiment again. Blurring increases low frequency components of the pixel value sequences, which is suppressed by applying the proposed BPF. In the baseline results, high-pass filtering by the delta features is essential for good performance. As in the results in Tables 1 and 2, however, the results of the static and the delta features without filtering is outperformed by those of the static features with filtering.

4.2 Audio-Visual Speech Recognition Results

4.1 Visual Speech Recognition Results

We evaluate the performance of the AVSR system using the proposed filtering method for visual feature extraction. We add white noise to the acoustic speech signal to obtain 0 dB to 25 dB noisy acoustic speech in signal-to-noise ratio (SNR). In this subsection, only the results for the DIGIT database are shown because we obtained similar results for the CITY database.

We evaluate the proposed filtering method on visual-only speech recognition. Tables 1 and 2 compare the recognition performance of the features obtained without and with the proposed filtering method for each database, respectively. The tables show the results obtained by using only static features and both static and delta (Δ) features. We compute the delta features over the regression window of t ± 2 , which shows the best performance for the clean condition. We can see that the proposed method significantly improves performance for both clean and noisy conditions by suppressing unnecessary frequency components. In the case of the features without filtering, the static features contain only instantaneous information and thus the use of their delta terms containing transitional (or dynamic) information is beneficial [28]. On the other hand, the additional use of the delta features in the case of filtering is not so helpful in an overall sense; because the features for a frame depend on their preceding ones by temporal filtering, they include both instantaneous and dynamic information. It is observed that the static and the delta features without filtering still perform worse than the static features with filtering: Although the delta feature computation is a way of performing high-pass filtering and can remove slowvarying components, it is selective to only a small range of frequencies [29] whereas the proposed filtering has a flat passband as shown in Figure 3. Besides, the delta feature computation cannot remove the noise effect appearing in the high frequency band.

To determine the integration weights, we use a feedforward neural network having five sigmoidal hidden neurons; using more hidden neurons did not improve performance. The neural network is trained by the Levenberg-Marquardt algorithm which is one of the fastest training algorithms of neural networks [30]. The acoustic noise conditions for training the neural network are ∞ dB (i.e., clean condition), 15 dB and 0 dB; the visual noise conditions of ∞ dB, 10 dB and 0 dB are used for training. In both cases, we use white noise. Therefore, 9 combinations of the noise conditions are used for training of the neural network. Figure 5 shows the audio-only, the visual-only and the integrated recognition performance for various acoustic SNRs and visual PSNRs. Note that Figure 5(b) is for the visual noise condition which is not considered in training the neural network. Also, the untrained acoustic noise conditions (5 dB, 10 dB, 20 dB and 25 dB) are included in each case of Figure 5. For the baseline visualonly and integrated results, we use the static and the delta visual features and, for the results obtained by using the proposed filtering method, only the static features are used based on the results in the previous subsection. When the clean visual signal is used (Figure 5(a)), the integrated results for the cases without and

The results for blurred images are given in Tables 3 and 4 for each database, respectively. We can see the effectiveness of the

224

recognition results for the two cases are similar to each other. However, the number of visual features for the case of filtering is a half that of the baseline, which means that the filtering method allows a more efficient realization of the AVSR system than the baseline in terms of the number of HMM parameters and the computational cost. When the images are noisy, the proposed filtering method enhances the visual recognition performance and, consequently, the integrated AVSR performance is also improved. The performance gap between the baseline and the proposed systems is large for high acoustic noise levels where the acoustic recognition performance is quite low and the visual recognition performance largely contributes to the AVSR results.

100

90

A V (baseline)

Accuracy (%)

80

V (filtering) AV (baseline)

70

AV (filtering)

60

50

40 0

5

10 15 20 Acoustic SNR (dB)

25

In Figure 6, we show the unimodal and the bimodal recognition performance when the images are blurred. The neural networks for obtaining the results in Figure 5 are used without further training. Again, it can be observed that applying the proposed BPF to the blurred images improves the AVSR performance compared to the baseline system.

clean

(a) 100

100 90

V (baseline)

A 80

V (filtering)

70

Accuracy (%)

Accuracy (%)

90

A

80

AV (baseline) 60

AV (filtering)

50

V (filtering) 70

AV (baseline) AV (filtering)

60

40 30 0

V (baseline)

50 5

10 15 20 Acoustic SNR (dB)

25

clean

40 0

5

(b)

10 15 20 Acoustic SNR (dB)

25

clean

(a)

100 100 90 90 A

70 60

V (filtering) AV (baseline)

50

AV (filtering)

V (baseline) V (filtering)

70

AV (baseline) AV (filtering)

60

40 30 20 0

A

80

V (baseline)

Accuracy (%)

Accuracy (%)

80

50 5

10 15 20 Acoustic SNR (dB)

25

clean

40 0

(c)

5

10 15 20 Acoustic SNR (dB)

25

clean

(b)

Figure 5. Performance of acoustic-only (A), visual-only (V) and audio-visual (AV) recognition when the visual signal is (a) clean, (b) noisy (PSNR=5 dB), and (c) noisy (PSNR=0 dB).

Figure 6. Performance of acoustic-only (A), visual-only (V) and audio-visual (AV) recognition when the visual signal is degraded by Gaussian blur with the parameters (a) (R, σ )=(9, 7) and (b) (R, σ )=(11, 9).

with filtering are nearly the same because the visual-only

225

[8] Chibelushi, C. C., Deravi, F., and Mason, J. S. D. A review of speech-based bimodal recognition. IEEE Trans. Multimedia, 4(1):23-27, Mar. 2002.

5. CONCLUSION We have presented a robust AVSR system in acoustically and visually challenging environments by using a new temporal filtering method of visual speech. The temporal filter, designed based on experimental, spectral and psychological analyses, applied to the pixel value sequences of the lip images to reduce unwanted variations which change too slowly or too fast compared to speech information. We demonstrated that the filtering method is effective for visual-only and audio-visual recognition using clean, noisy and blurred images.

[9] Gonzalez, R. C., Woods, R. E. Digital Image Processing. Prentice-Hall, Upper Saddle River, NJ, 2002. [10] Potamianos, G., Neti, C. Audio-visual speech recognition in challenging environments. In Proc. European Conf. Speech Communication and Technology (Geneva, Switzerland, 2003), 1293-1296. [11] Saenko, K., Darrell, T., Glass, J. Articulatory features for robust visual speech recognition. In Proc. Int. Conf. Multimodal Interfaces (State College, PA, 2004), 152-158.

One might raise a question about applicability of the proposed BPF given by (3) to speech data having high speaking rates such as continuous or spontaneous speech. Although it is not guaranteed that the BPF is optimal for such data, we can expect that the filter will be still helpful. In Figure 1, we could see that applying the HPFs having cut-off frequencies less than 0.9 Hz always improves visual recognition performance. Similarly, although the optimal lower cut-off frequency may be slightly higher than 0.9 Hz for continuous or spontaneous speech, the value of 0.9 Hz will enhance the performance for such speech data. Our future work will confirm this and, if necessary, finetuning of the lower cut-off frequency value of the BPF will be done.

[12] Lee, J.-S. and Park, C. H. Training hidden Markov models by hybrid simulated annealing for visual speech recognition. in Proc. IEEE Int. Conf. Systems, Man, and Cybernetics (Taipei, Taiwan, Oct. 2006), 198-202. [13] Huang, X.-D., Acero, A., Hon, H.-W. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice-Hall, Upper Saddle River, NJ, 2001. [14] Gurbuz, S., Tufekci, Z., Patterson, E., Gowdy, J. Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition. In Proc. Int. Conf. Acoustics, Speech and Signal Processing (Salt Lake City, 2001), 177-180.

It would be also necessary to evaluate the presented AVSR system on more realistic challenging conditions. Various sources of degrading audio-visual speech should be considered for testing effectiveness of the system in diverse recognition environments.

[15] Kaynak, M. N., Zhi, Q., Cheok, A. D., Sengupta, K., Jiang, Z., Chung, K. C. Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis. Speech Communication, 43(1-2):1-16, 2004.

6. ACKNOWLEDGEMENT This work was supported by Brain Korea 21 Project, The School of Information Technology, KAIST in 2007.

7. REFERENCES

[16] Lucey, S. An evaluation of visual speech features for the tasks of speech and speaker recognition. In Proc. Int. Conf. Audio- and Video-Based Biometric Person Authentication (Guildford, UK, 2003), 260-267.

[1] Dupont, S., Luettin, J. Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia, 2(3):141-151, Sept. 2000.

[17] Scanlon, P., Reilly, R. Features analysis for automatic speechreading. In Proc. Int. Conf. Multimedia and Expo (Tokyo, Japan, 2001), 625-630.

[2] Huang, J., Potamianos, G., Connell, J., Neti, C. Audio-visual speech recognition using an infrared headset. Speech Communication, 44(1-4):83-96, 2004.

[18] Lee, J.-S., Shim, S. H., Kim, S. Y., Park, C. H. Bimodal speech recognition using robust feature extraction of lip movement under uncontrolled illumination conditions. Telecommunications Review, 14(1):123-134, Feb. 2004.

[3] Hazen, T. J. Visual model structures and synchrony constraints for audio-visual speech recognition. IEEE Trans. Audio, Speech, and Language Processing, 14(3):1082-1089, May 2006.

[19] Bregler, C., Konig, Y. Eigenlips for robust speech recognition. In Proc. Int. Conf. Acoustics, Speech, and Signal Processing (Adelaide, Australia, 1994), 669-672.

[4] Ross, L. A., Saint-Amour, D., Leavitt, V. M., Foxe, J. J. Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. Cerebral Cortex, 17(5):1147-1153, 2007.

[20] Nakamura, S. Statistical multimodal integration for audiovisual speech processing. IEEE Trans. Neural Networks, 13(4):854-866, July 2002. [21] Benoît, C. The intrinsic bimodality of speech communication and the synthesis of talking faces. In The Structure of Multimodal Dialogue II, Taylor, M. M., Nel, F., Bouwhuis, D. (eds.), John Benjamins, Amsterdam, The Netherlands, 2000, 485-502.

[5] Arnold, P., Hill, F. Bisensory augmentation: a speechreading advantage when speech is clearly audible and intact. British J. Psychology, 92:339-355, 2001. [6] McGurk, H., MacDonald, J. Hearing lips and seeing voices. Nature, 264(5588):746-748, Dec. 1976.

[22] Rogozan, A., Deléglise, P. Adaptive fusion of acoustic and visual sources for automatic speech recognition. Speech Communication, 26(1-2):149-161, Oct. 1998.

[7] Summerfield, A. Q. Some preliminaries to a comprehensive account of audio-visual speech perception. In Hearing by Eye: The Psychology of Lip-reading, Dodd, B., Campbell, R. (eds.), Lawrence Erlbarum, London, UK, 1987, 3-51.

[23] Lewis, T. W., Powers, D. M. W. Sensor fusion weighting measures in audio-visual speech recognition. In Proc. Conf.

226

Australasian Computer Science (Dunedin, New Zealand, 2004), 305-314.

[27] Vitkovitch, M., Barber, P. Visible speech as a function of image quality: effects of display parameters on lipreading ability. Applied Cognitive Psychology 10(2):121-140, 1996.

[24] Munhall, K., Vatikiotis-Bateson, E. The moving face during speech communication. In Hearing by Eye II: Advances in the Psychology of Speechreading and Audio-Visual Speech, Campbell, R., Dodd, B., Burnham, D. (eds.), Psychology Press, Hove, UK, 1998, 123-142.

[28] Jung, H.-Y., Lee, S.-Y. On the temporal decorrelation of feature parameters for noise-robust speech recognition. IEEE Trans. Speech and Audio Processing 8(4):407-416, 2000.

[25] Oppenheim, A. V., Schafer, W. W. Discrete-Time Signal Processing. Prentice-Hall, Upper Saddle River, NJ, 1999.

[29] Hermansky, H., Morgan, N. RASTA processing of speech, IEEE Trans. Speech and Audio Processing 2(4):578-589, 1994.

[26] Weeks Jr., A. R. Fundamentals of Electronic Image Processing. SPIE/IEEE Press, Bellingham, WA, 1996.

[30] Bishop, C. Neural Networks for Pattern Recognition. Oxford University Press, UK, 1995.

227

Temporal Filtering of Visual Speech for Audio-Visual ...

performance for clean and noisy images but also audio-visual speech recognition ..... [4] Ross, L. A., Saint-Amour, D., Leavitt, V. M., Foxe, J. J. Do you see what I ...

314KB Sizes 0 Downloads 202 Views

Recommend Documents

Preference for Audiovisual Speech Congruency in ... - Floris de Lange
Auditory speech perception can be altered by concurrent visual information. .... METHODS .... analysis in an independent data set in which participants listened ...

Preference for Audiovisual Speech Congruency in ... - Floris de Lange
multisensory fusion (see Figure 1). ... fected by perceptual fusion of incongruent auditory and ..... Brugge, J. F., Volkov, I. O., Garell, P. C., Reale, R. A., &. Howard ...

Temporal Feature Selection for Noisy Speech ...
Jun 4, 2015 - Acoustic Model: Quantify the relationship between the audio signature and the words. Ludovic Trottier et al. Temporal Feature Selection. June 4 ...

Auditory enhancement of visual temporal order judgment
Study participants performed a visual temporal order judgment task in the presence ... tion software (Psychology Software Tools Inc., Pittsburgh, ... Data analysis.

Automatic Annotation Suggestions for Audiovisual ...
visualize semantically related documents. 1 Context. The Netherlands ... analogue towards digital data. This effectively ... adaptation for 'audiovisual' catalogue data of the FRBR data model 1 which has ...... Chapter Designing Adaptive Infor-.

audiovisual services -
Meeting Room Projector Package - $630 ... From helping small gatherings create a great impact to amplifying a stage experience for hundreds of attendees,.

Visual attention affects temporal estimation in ...
Jun 14, 2011 - If you wish to self-archive your work, please use the ... version for posting to your own website or ... objects can be distracting when one does not know the difference ... motion extrapolation could have made the participants.

Method and apparatus for filtering E-mail
Jan 31, 2010 - Clark et a1., PCMAIL: A Distributed Mail System for Per. 6,052,709 A ..... keted as a Software Development Kit (hereinafter “SDK”). This Will ...

Combinational Collaborative Filtering for ... - Research at Google
Aug 27, 2008 - Before modeling CCF, we first model community-user co- occurrences (C-U) ...... [1] Alexa internet. http://www.alexa.com/. [2] D. M. Blei and M. I. ...

Design and Simulation of Adaptive Filtering Algorithms for Active ...
Keywords: Adaptive Filter, LMS Algorithm, Active Noise cancellation, MSE, .... The anti-noise generated corresponding to the white noise is as shown below,.

audio-visual speech integration using coupled hidden ...
Email: {asubram, sabrig, epatter, jgowdy}@clemson.edu. Abstract. In recent years ... visual information for the purposes of Automatic Speech. Recognition (ASR) ...

audio-visual speech integration using coupled hidden ...
Department of Electrical and Computer Engineering. Clemson University ..... Interaction,” PhD Dissertation, University of Illinois, Urbana-. Champaign, 1999.

Spatio-Temporal Layout of Human Actions for Improved ...
4.3. Spatio-Temporal Layout Configurations: Across the 48 Actions. 359. 360 ..... R. Ronfard, Action Recognition from Arbitrary Views using 3D Exemplars, ICCV.

Use of adaptive filtering for noise reduction in ...
software solutions to the adaptive system using the two main leaders of adaptive LMS (least mean square) ... environment is variable in time and its development.

A variational framework for spatio-temporal smoothing of fluid ... - Irisa
Abstract. In this paper, we introduce a variational framework derived from data assimilation principles in order to realize a temporal Bayesian smoothing of fluid flow velocity fields. The velocity measurements are supplied by an optical flow estimat

Spatio-Temporal Frames in a Bag-of-Visual-Features ...
Federal University of Minas Gerais – UFMG – Belo Horizonte, MG, Brazil. 2Exact and Technological .... dimension and the temporal one, avoiding the need to create sophisticated 3D ... visual data is depicted in Figure 3. The process starts by ...

pdf-115\windows-speech-recognition-programming-with-visual ...
... the apps below to open or edit this item. pdf-115\windows-speech-recognition-programming-with- ... eech-software-technical-professionals-by-keith-a.pdf.

Audiovisual tools as integrated part of information transfer ... - EFITA
and to create a lasting commitment, a dedicated communication platform such as a special news program ´Beet-TV´ is most effective. By means of such a platform with different audiovisual elements a more personalized, durable, comfortable, and perhap

Robust Audio-Visual Speech Recognition Based on Late Integration
Jul 9, 2008 - gram of the Korea Science and Engineering Foundation and Brain Korea 21 ... The associate ... The authors are with the School of Electrical Engineering and Computer ...... the B.S. degree in electronics engineering (with the.

Unscented Information Filtering for Distributed ...
This paper represents distributed estimation and multiple sensor information fusion using an unscented ... Sensor fusion can be loosely defined as how to best extract useful information from multiple sensor observations. .... with nυ degrees of free

Spatio-Temporal Exploration Strategies for Long-Term Autonomy of ...
Jan 1, 2017 - the frequency domain results in an intelligent exploratory behaviour, which ..... measured state s can be occupied or free, the goal of this method is to esti ..... 100. Moder error [%]. Exploration ratio [%]. Environment model error ..

Physically-Based Vibrotactile Feedback for Temporal ... - mobileHCI
Sep 18, 2009 - selects a company and gives his device a twist to quickly feel the trend from ... social networking context, enabling users to transmit directly the.

Sampling of Signals for Digital Filtering and ... - Linear Technology
exact value of an analog input at an exact time. In DSP ... into the converter specification and still ... multiplexing, sample and hold, A/D conversion and data.

Design and Simulation of Adaptive Filtering Algorithms for Active ...
In general, noise can be reduced using either passive methods or active (Active Noise Control, ANC) techniques. The passive methods works well for high ...