767

Robust Audio-Visual Speech Recognition Based on Late Integration Jong-Seok Lee, Member, IEEE, and Cheol Hoon Park, Senior Member, IEEE

Abstract—Audio-visual speech recognition (AVSR) using acoustic and visual signals of speech has received attention because of its robustness in noisy environments. In this paper, we present a late integration scheme-based AVSR system whose robustness under various noise conditions is improved by enhancing the performance of the three parts composing the system. First, we improve the performance of the visual subsystem by using the stochastic optimization method for the hidden Markov models as the speech recognizer. Second, we propose a new method of considering dynamic characteristics of speech for improved robustness of the acoustic subsystem. Third, the acoustic and the visual subsystems are effectively integrated to produce final robust recognition results by using neural networks. We demonstrate the performance of the proposed methods via speaker-independent isolated word recognition experiments. The results show that the proposed system improves robustness over the conventional system under various noise conditions without a priori knowledge about the noise contained in the speech. Index Terms—Audio-visual speech recognition, late integration, robustness, hidden Markov model, interframe correlation, neural network, stochastic optimization.

I. INTRODUCTION

A

UDIO-VISUAL speech recognition (AVSR) is to recognize speech by observing not only the speaker’s voice signal but also the lips’ movement. An important role of the visual information in speech recognition is to enhance robustness against acoustic noise existing in the environments. Most real-world applications of acoustic speech recognition are vulnerable to the interference of inevitable noise such as vehicle engines, machinery or other voices in the background. On the other hand, since the visual signal is not affected by acoustic noise, it can be used to remedy performance degradation of acoustic speech recognition in such noisy environments. People use the movement of the speaker’s lips as a supplementary information source when they cannot hear well due to noise. AVSR is an attempt to imitate this bimodal nature of human speech perception for improving robustness of automatic speech recognition systems. Since the first speech recognition system using visual information was introduced by Petajan in

Manuscript received September 26, 2006; revised December 11, 2007. First published June 13, 2008; last published July 9, 2008 (projected). This work was supported by GRANT R01-2003-000-10829-0 from the Basic Research Program of the Korea Science and Engineering Foundation and Brain Korea 21 Project, The School of Information Technology, KAIST, in 2007. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Bo Shen. The authors are with the School of Electrical Engineering and Computer Science, KAIST, Daejeon 305-701, Korea (e-mail: [email protected]ist.ac.kr; [email protected]). Digital Object Identifier 10.1109/TMM.2008.922789

1984 [1], several researchers have reported their own AVSR systems [2]. Generally, the AVSR systems work by the following procedures. First, the acoustic and the visual signals of speech are recorded by a microphone and a camera, respectively. Then, each signal is converted into an appropriate form of compact features. Finally, the two modalities are integrated for recognition of the given speech. Integration of acoustic and visual information aims at obtaining as good recognition results as possible in noisy circumstances. It can take place either before the two information sources are processed by a recognizer [early integration (EI)] or after they are classified independently [late integration (LI)]. LI has been shown to be preferable because of its better performance and robustness than EI [3], and psychological supports [4]. This paper concentrates on constructing a robust AVSR system based on the LI model by considering the three factors affecting the performance of the system: performance of the visual speech recognition, robustness of the acoustic speech recognition, and effectiveness of the audio-visual integration. First, we optimize the recognizer of the visual speech, the hidden Markov models (HMMs), by utilizing a stochastic optimization method [5] to enhance the visual speech recognition performance. A conventional popular method for optimizing HMMs is the expectation-maximization (EM) algorithm [6] in which the HMM parameters are adjusted iteratively so as to maximize the likelihood of the training data. A limitation of the EM algorithm is that it only achieves local optimal solutions and may not provide the global optimum. On the other hand, our stochastic optimization algorithm performs global optimization in estimating HMMs’ parameters and, thereby, improves the visual speech recognition performance. Second, we propose a method of modeling interframe correlations of acoustic speech to obtain more robust acoustic recognition performance compared to the conventional speech modeling method by HMMs. Acoustic speech recognition systems usually deal with short segments of speech; the spectral analysis of speech is performed with a speech segment of 20–30 ms while the window function moves at the rate of about 10 ms. When we model the speech sequence with the conventional HMMs, the conditional dependence between observation frames is often ignored. However, it has been revealed that the use of the property related to speech dynamics, especially frame dependence between relatively long time intervals, is one of the important issues for robust speech recognition by both humans and machines [7]–[11]. Instead of assuming the conditional independence between frames, we model the joint distribution of frames by the Gaussian mixture model (GMM) so that the conditional dependence between frames is considered in the observation probability distributions of HMM

1520-9210/$25.00 © 2008 IEEE

768

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 5, AUGUST 2008

states. This modification enables the HMMs to capture the dynamic characteristics of speech, which improves robustness of acoustic speech recognition. Third, we develop an audio-visual integration method using neural networks (NNs) to achieve the final robust AVSR performance. When we integrate the two information sources, we should measure the “reliability” of each modality (i.e., the degree of trust in the recognition result drawn from the acoustic or the visual subsystem) for the given audio-visual speech data and apply an appropriate integration weight to the modalities according to their reliabilities. The weight determines how much we depend on each modality during recognition of the speech. Determining proper weights for the given speech data is crucial for robust integrated recognition results. We use a trainable NN to automatically determine appropriate integration weights based on the reliabilities measured from the recognizers’ outputs and obtain reliable integration results for various noise conditions. The NN learns the mapping between the reliabilities of the modalities and the weight values giving the correct recognition results for training speech data so that it produces appropriate weights for test speech data of unknown noise conditions. Through the experiments, we demonstrate that the constructed AVSR system consistently shows improved robustness compared to the conventional system without a priori knowledge about the type or the amount of noise contained in the speech. An outline of the remainder of the paper is as follows. In the following section, we present the stochastic optimization algorithm of HMMs and its mathematical convergence property. In Section III, we introduce the proposed method of modeling the interframe correlations of speech in HMMs. Section IV describes our integration scheme of the audio-visual information by using NNs. In Section V, we show the experimental results for isolated word recognition tasks. Finally, conclusion is made in Section VI.

II. STOCHASTIC OPTIMIZATION OF HMMS FOR VISUAL SUBSYSTEM This section introduces a global optimization method of HMMs, hybrid simulated annealing (HSA) [5]. The algorithm is based on the stochastic search algorithm, simulated annealing (SA), which has an ability to escape from local optima and search the whole parameter space for the global optimum [12]. The HSA algorithm combines SA and a local optimization technique for enhancing convergence speed and the quality of the solution. The algorithm is applied to improve the visual speech recognition performance and, consequently, enhances the AVSR performance1. There have been some efforts of using SA to optimize HMMs for speech recognition [13] or sequence alignment in bioinformatics [14]. Although these methods have shown to be successful, most of them have dealt with only simple discrete HMMs while continuous HMMs (CHMMs) are much more popularly used in speech recognition because of their better 1HSA can also improve the acoustic speech recognition performance for clean speech. However, such improvement does not guarantee improved robustness in acoustically noisy environments because training of HMMs uses only clean speech. Therefore, we do not apply HSA for the acoustic HMMs.

performance than that of discrete HMMs. Moreover, the optimization procedures of the methods are heuristically designed and thus their convergence to the global optimum may not be guaranteed. On the other hand, our method is developed for optimizing CHMMs and its procedures are so carefully designed that the method is supported by mathematical convergence proofs. A. Algorithm As in the EM algorithm, the HSA algorithm maximizes the sum of the log-likelihoods for the training data

(1) is a training observation sequence where and a having the frame length of hidden state sequence for . The set of the -state HMM’s includes the initial state distribution parameters , the state transition probability distribution and the observation probability distribution . We use the CHMMs in which is given by the GMM. HSA performs global optimization of by iterative generation, local optimization, evaluation and selection of the solution with a controlled annealing schedule. The temperature parameter which gradually decreases by annealing governs the amount of random displacements of new solutions and their acceptance probability. The procedure of the algorithm is written as follows: Step 1) Initialization: Generate an initial solution vector, , and calculate its objective value, . Set the temperature to its initial value. from Step 2) Generation: Generate a new solution vector the current one : (2) the amount of where is the iteration index and displacement of the solution. is determined by the Cauchy generating function which is written by the following probability distribution function [15]: (3) where is the dimension of the temperature at iteration and the normalizing constant which is given by (4) is controlled by In the The magnitude of is high and the solubeginning of the algorithm decreases tion is likely to be changed largely. As

LEE AND PARK: ROBUST AVSR BASED ON LATE INTEGRATION

769

by the annealing schedule in Step 6, tends to become small. Step 3) Local optimization: Apply a few iterations of the to produce . This step helps EM algorithm to to find a good solution with improved convergence speed. Step 4) Evaluation: Calculate the objective value of . Step 5) Selection: Select the solution of the next iteration, , between and by the Metropolis rule is given by [16]: The acceptance probability of (5) In other words, when the new solution is better than the current one, the new one is always accepted; otherwise, acceptance of the new solution is determined probabilistically. Thus, a worse solution than the current one can be selected with a nonzero probability, which helps the algorithm escape from local optima. The temperature controls the acceptance probability of a worse solution against the current one: For a large in the beginning of the algorithm, the value of the exponential term in the equation is large and a worse solution than the current one is accepted with a relatively high probability; when becomes small, the value of the exponential term becomes small and a worse solution than the current one is hardly accepted, which allows the solution to converge to the final one. Step 6) Annealing: Decrease the temperature by the reciprocal annealing schedule (6) is the initial temperature. where Step 7) Termination: If some termination conditions are satisfied, stop. Otherwise, go to Step 2. We set a maximum number of iterations as the termination condition. B. Convergence The convergence of HSA can be proved mathematically. The following theorem proves that the objective sequence converges in probability to the global optimum. Theorem 1: Let be the entire feasible space in . is regenerated until its every component is greater than or equal to which is a lower bound on the displacement at and given by (7) . Then, the objective value sequence in HSA converges in probability to the global optimum for any initial solution . By this theorem, it is guaranteed that, if the displacement of every component of the solution is larger than a monotonically

decreasing lower bound , the solution converges to the global optimum in probability regardless of its initial value. Here, can be set to a very small value, for example, , so that we do not need many trials for generating satisfying the lower bound condition. A key factor for obtaining the above theorem is the combination of the generating function (3) and the annealing schedule (6). The balance between the amount of the displacement of the solution and the decreasing rate of the temperature is important; if we use too fast decreasing a temperature function with the Cauchy generating function or a generating function with too short tails (e.g., the Gaussian distribution) with the reciprocal annealing schedule, we cannot get the convergence shown above. Next, the following theorem proves the best objective sequence converges in probability to the global optimum. Theorem 2: The best objective value sequence in HSA converges in probability to . the global optimum for any initial solution The above theorem states that the best objective value up to the current iteration, which is stored in an auxiliary memory, converges in probability to the global optimum regardless of the initial solution. Note that, for convergence of the best objective sequence to the global optimum, we do not need the lower bound . condition on The proofs of the theorems can be found in [17]. III. MODELING OF INTERFRAME CORRELATIONS FOR ACOUSTIC SUBSYSTEM One of the weaknesses of acoustic speech recognition systems with the conventional HMMs is the lack of consideration about the dynamic characteristics of speech signals. Since the dynamic structures of speech are quite different from those of noise, they are more invariant to noise than the static information which is based on the frequency analysis of short speech segments. In the conventional HMM it is assumed that the observations belonged to a state are independent on each other and identically distributed. Thus, although there actually exist correlations between neighboring speech frames and such correlations are known to be important for robust human speech understanding [7], [8], modeling of speech by the conventional HMM does not reflect such correlations and, thereby, does not fully utilize the temporal characteristics of speech in recognition. In this section, we propose a modeling method for considering the correlations between speech frames within the framework of the HMM. The assumption of conditional independence between frames is relaxed so that an observation frame is conditionally dependent on a past one. We explicitly model the joint probability distribution of two observation vectors with the GMM to enhance noise-robustness of recognition with HMMs by capturing the dynamic speech characteristics which are independent on the noise characteristics. The proposed model and its learning algorithm are described below.

with

A. Formulation Given a state sequence an observation sequence

, the probability of with the assump-

770

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 5, AUGUST 2008

tion of conditional independence in the conventional HMM is written by

Now, we consider the denominator of (10). It can be shown with respect to yields [19] that integrating

(15)

(8)

Thus, the conditional observation probability in (10) becomes On the other hand, the current observation depends on a previous frame in the proposed modeling method, i.e.,

(9) where is the interval between the current frame and the previous frame on which the current observation depends. Using the Bayes’ rule, we can rewrite the conditional observation probas ability distribution (10) We model the joint probability distribution of and , with a GMM, which is a simple extension of the conventional CHMM. Therefore, with omitting the subscripts of for simplicity

(16) Since the GMM is used for modeling correlations between observation frames, we call our method Gaussian mixture correlation model HMM (GMCM-HMM). In the method, determines the frame interval for modeling the correlations between two frames. While short-term correlations of speech data are considered by the delta features which are used together with static features, the proposed method is used to deal with rather long-term correlations. Therefore, if the delta features are deis set to be larger than 22. Effects of on fined over performance is investigated in the experiments. B. Learning Algorithm The parameters of the proposed model can be optimized by an EM-like algorithm. The “Baum’s auxiliary function” [6] which is to be maximized in the EM algorithm for CHMMs and GMCM-HMMs is written by

(11) where and

is the number of the Gaussian functions in the GMM the mixture coefficients satisfying (12)

is the -dimensional joint vector of and and is a Gaussian function with mean and covariance matrix . The mean vector is given by , where and are the mean vectors for and , respectively. The covariance matrix is given by (13) where and are the covariance matrices for and , is the cross-covariance matrix of and respectively, and . We use diagonal matrices for and to reduce the number of parameters like in the conventional CHMMs. By using the block matrix inversion lemma [18], we get

(17) where and are the HMM parameter sets before and after updating, respectively. The summations in the first line are performed over all possible state sequences ’s and mixtures ’s. and of Since the Baum’s auxiliary functions for and in (17), are the same to GMCM-HMM, those of the conventional HMM, the updating formulas for them are the same to those in the original EM algorithm. If we use of GMCM-HMM is written (16), the auxiliary function for by

where (14) where (13) is diagonal,

. Note that, if each block of in and are all also diagonal.

.

2Although increasing the window length for calculating the delta features may be helpful for improving robustness, it is not helpful for recognizing clean speech and the optimal window length varies with the noise condition of the speech [20]. Also, our preliminary experiments confirmed that our method outperforms the use of delta features of various window lengths.

LEE AND PARK: ROBUST AVSR BASED ON LATE INTEGRATION

To optimize and , we differentiate respect to each of their elements and set it to zero

771

with

(18) ’s, ’s, ’s, ’s, and where is any of is obtained from and , and and (note that , and ). For , we get obtained from

’s are

(19) For the other parameters, it is not possible to obtain analytic solutions of (18) because of the complexity of the equations. Instead, we use the Gauss–Newton method with line-search [21] to numerically find the parameter values which make the derivatives zero. As for the mixture coefficients , we should solve a conis maximized with straint optimization problem where under the constraints in (12). While the Larespect to grangian multiplier method is used to obtain the updating forin the original EM algorithm, such an approach mula for is not applicable in our case because of the complexity of the equation. Instead, we solve this problem by applying the numerical constraint optimization method using a sequential quadratic programming method [21], [22]. C. Related Work There have been efforts to incorporate the dynamic information of speech into speech recognition for noise-robustness. One of the popular methods is the relative spectra (RASTA) processing [10]. This method was developed on the basis of the observation that the human auditory perception is highly sensitive to the relative temporal changes of speech. In this method the speech dynamics are considered at the stage of feature extraction by filtering: After a short segment of speech is analyzed by the filterbank in the frequency domain, a band-pass filter is applied to the logarithm of the filterbank outputs. This RASTA filter has a rather long time constant and thus the analysis of the current frame depends on its history. Filtering makes the signal less sensitive to slowly-varying components which are usually corrupted by noise. While the RASTA filtering in the log domain can remove convolutional noise, one can use J-RASTA in which filtering is done in the so-called lin-log domain to deal with additive noise. While the RASTA method is performed at the feature extraction stage, GMCM-HMM directly models the correlations between frames. We will compare the performance of RASTA and GMCM-HMM in the experiments. IV. AUDIO-VISUAL INTEGRATION WITH NEURAL NETWORKS

movements) for better speech understanding in acoustically noisy circumstances [23] and even in clean conditions [24]. Models for audio-visual integration can be categorized into two major approaches: EI and LI. In EI (or sometimes called feature fusion), the visual and the acoustic speech features are combined to form a composite feature vector and processed by a single recognizer. Since the acoustic and the visual feature sequences usually have different frame rates, interpolation is performed so that both of the sequences are of the same number of frames. In LI (or sometimes called decision fusion), each modality is processed independently by the corresponding classifier and the outputs of the two classifiers are combined to yield the final recognition result. Although which approach is appropriate is still arguable, there exist advantages of LI for implementing noise-robust AVSR systems. First, since the acoustic and the visual signals are processed and classified independently in the LI model, we can easily use an adaptive weighting scheme to adjust relative amounts of the contribution of the two modalities to the final decision according to the noise level of the acoustic signal. This would be the greatest motivation of employing LI because the main objective of AVSR is to enhance recognition performance in various noisy environments. By adapting the integration weights of the two modalities according to the noise conditions, we can effectively utilize the complementary nature of the modalities for robustness. Second, the LI approach allows us to model the asynchronous characteristics of the two modalities flexibly while EI assumes the perfect synchrony between the acoustic and the visual signals. For some pronunciations the lips and the tongue start to move up to several hundred milliseconds before the speech sound is produced [25]. It has been shown that audio-visual integration does not require precise synchrony and there exists an “intersensory synchrony window” during which the performance of human audio-visual speech perception is not degraded for desynchronized audio-visual signals [26]. Third, an LI system can be constructed by utilizing existing unimodal recognition systems, whereas we need to train a whole new recognizer for designing an EI system. The issue on which this paper mainly focuses is the optimal weighting problem of the modalities in the LI model. A good scheme of adaptive weighting enables us to apply the recognition system reliably in a wide range of noise conditions [27], [28]. On the other hand, when the integration weights are not estimated appropriately for the given noise conditions, we cannot expect complementarity and synergy of the two information sources and, moreover, the combined recognition performance may be even inferior to that of any of the unimodal systems (which is called “attenuating fusion” [2]). After the acoustic and the visual subsystems perform recognition separately, their outputs are combined by a weighted sum rule to produce the final decision. For a given test datum , the recognized utterance is chosen by [29], [30]

A. Audio-Visual Integration Problem Integration of the acoustic and the visual modalities aims at improving automatic speech recognition performance, especially in noisy environments. Complementary usage of the two information sources can be found in humans’ face-to-face speech perception. People utilize the visual cues (the lips’

(20) and are the acoustic and the visual HMMs where for the th utterance class, respectively, and

772

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 5, AUGUST 2008

and

are their outputs. The weighting factor determines how much each modality contributes to the final decision. represents the reliability of the acoustic modality and, thus, its value varies with the amount of acoustic noise; when the SNR value of the acoustic speech is large, where the acoustic modality usually outperforms the visual one, is close to 1 and the final decision mostly relies on the acoustic modality; when the acoustic speech contains much noise and the acoustic modality performs worse than the visual one, should be small so that the visual modality contributes more to the final decision than the acoustic one. Therefore, it is important to determine automatically according to the relative reliability measures of the modalities to obtain robust performance for speech corrupted by various types and amounts of noise. As simplest solutions to this problem, a constant weight value over various noise conditions [31] or manual determination of the weight [32] have been considered. In some work the weight is given by the function of the SNR by assuming that we know the SNR of the acoustic signal [33], which is not always a feasible assumption. There have been researches to determine the weight for the operating condition of an AVSR system by using a small size of additional adaptation data [34]. We are interested in a scheme to automatically determine appropriate integration weights without a priori knowledge of the given noise conditions or additional (and sometimes unavailable) adaptation data. One of the most popular methods among such schemes is the reliability ratio-based method in which the weight is calculated from the relative reliability measures of the two modalities [29], [35]. In the following subsection we briefly explain this method and discuss its limitation in producing proper weights. Then, we propose a NN-based method to overcome the limitation and estimate appropriate weights for various noise conditions. B. Reliability Ratio-Based Integration In the reliability ratio-based method, the weighting factor is calculated by [35] (21) and are the reliability measures of the outputs of where the acoustic and the visual subsystems, respectively. The reliability of each modality is obtained from the outputs of the corresponding set of HMMs. When the acoustic speech does not contain any noise, there are large differences between the outputs of the acoustic HMMs. As the acoustic speech becomes noisy, the differences tend to become small. Considering this observation, we can define the reliability of a modality in several ways [29], [30]. Among them, the following definition has been shown to be the most appropriate and the best in performance [30]:

(22) is the number of classes being considered. where Although the integration using (21) can improve noise-robustness compared to the acoustic-only recognition, the method

Fig. 1. Audio-visual integration results with different formulas of estimating the weight . The F-16 cockpit noise was added to the clean speech to produce noisy speech of various SNRs.

may not be the optimal way of integration: We tried different equations of calculating other than (21) and one example giving partially better performance is shown in Fig. 1. The integrated recognition result with (21) is more robust than the audio-only recognition result due to the additional use of visual information. However, if we use a different way for obtaining , we can obtain further the weight improvement for low SNR values. Thus, the weight estimation by (21) is not always the optimal and the AVSR performance can be improved for certain noise conditions. C. Proposed Neural Network-Based Method In our integration method, a NN models the input-output mapping between the two reliabilities and the integration weight so that it works as an optimal weight estimator. In theory, a feedforward NN can model any arbitrary function with a desired error bound provided that the number of its hidden neurons is not limited [36]. Practical applications also have shown that NNs are more effective to construct smooth mappings from data by learning underlying input-output relationships included in the data and generalize for unseen data than other linear and nonlinear regression methods [37]. Thus, a NN is a good candidate for approximating the reliabilities-weight mapping and producing appropriate weights for AVSR in various noise conditions. Fig. 2 shows the overall architecture of the proposed integration scheme. In order for the NN to work as an estimator of the weight, it is trained by the following steps. First, we calculate the reliabilities of the outputs of the visual and the acoustic HMMs for the training data of various SNRs. Then, for each datum we obtain the values of for correct recognition of the datum; while increasing from 0 to 1 by 0.01, we store the values of with which the recognition result by (20) is correct. Next, we train the NN by using the reliabilities and the found optimal ’s as the training input and target pairs. Although we want the NN to generate proper weights consistently over various noise conditions (types and levels) for wide

LEE AND PARK: ROBUST AVSR BASED ON LATE INTEGRATION

773

There are two popular types of NNs: multilayer perceptrons (MLPs) and radial basis function networks (RBFNs) [37]. An RBFN with localized basis functions (Gaussian functions) represents a local mapping in the space of hidden units with respect to the input space because only a few hidden neurons significantly contribute to the output for an input vector. On the other hand, many hidden units of an MLP are activated for an input vector because the sigmoidal function used for the activation function of the hidden units shows a global characteristic. In our case, the data which the NN is to learn is highly noisy and we expect the NN models the global input-output characteristics of the data rather than the local noisy patterns. Therefore, the MLP is more appropriate than the RBFN in our system.

Fig. 2. Proposed audio-visual integration scheme.

V. EXPERIMENTS A. Databases

Fig. 3. Example of the optimal weight for a datum corrupted by white noise as a function of the SNR. Any value of in the shaded region gives the correct recognition result for the datum of the corresponding SNR.

applicability of the system, it is practically impossible to utilize the training speech data of all possible noise conditions. dB, 20 dB, 10 dB, and 0 dB speech Instead, we use only data corrupted by white noise for training the NN. Then, by the NN’s generalization capability, it produces proper weights for the speech data whose noise conditions are not considered during training. This generalization capability will be verified in the experiments. It should be noted that the value of which gives the correct recognition result for each given datum appears as an interval, as shown in Fig. 3. When the SNR is large, a relatively large interval of the weight can lead the correct recognition result because the outputs of the acoustic HMMs show large differences; when the SNR is small, the weight producing the correct recognition result has a small interval. Unlike in the conventional training of NNs, the target value for an input vector of the NN in our integration method is not a specific value but given by an interval. Thus, we modify the error function for training the NN as follows: (23) and are the lower and the where is the NN’s output. upper bounds of giving the correct recognition result, respectively, and they correspond to the lower and the upper boundaries of the shaded region in Fig. 3.

For the experiments, we use two databases of isolated words: the digit database (DIGIT) and the city name database (CITY) [5]. The DIGIT database contains the isolated digits from zero to nine (including two versions of zero) in Korean and the CITY database the names of 16 famous Korean cities. Each database contains pronunciations of 56 speakers (37 males and 19 females). Each person pronounced each word three times. The face regions around the speakers’ lips were focused and the lips’ movements were captured by a digital video camera at the rate of 30 Hz. At the same time, the acoustic speech was recorded at the rate of 32 kHz by a microphone and downsampled to 16 kHz. The whole databases were collected under quiet laboratory conditions. The recognition task is performed in a speaker-independent manner. In order to increase reliability of the experiments, we use the jackknife method: After dividing the data of 56 speakers into four parts, we repeat the experiments four times by using the data of three parts (42 speakers) for training and the remaining part (14 speakers) for test. For simulating noisy environments, we use the NOISEX-92 database [38] which contains various real-world noises. We choose four additive background noises from the database: the white noise (WHT), the F-16 cockpit noise (F16), the factory noise (FAC), and the operation room noise (OPR). Each noise is added to the clean acoustic speech to produce the speech signals of various SNR values. B. Baseline System The acoustic feature used is the Mel-frequency cepstral coefficient (MFCC) [6]. A single frame contains 25 ms speech samples and the frame window proceeds by 10 ms. The 12th-order MFCCs, the normalized energy and their delta terms defined over are used for the acoustic features. We apply the cepstral mean subtraction (CMS) technique [6] to remove the channel distortions contained in the speech samples. Visual features are extracted by the following procedure: We first crop the lip region from each of the recorded image in an automatic way [5]. Variations such as scaling and rotation of the lips over different images are compensated during segmentation of the lip regions. Then, for each pixel point of the segmented grayscale lip region images of 50 44 pixels, the

774

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 5, AUGUST 2008

Fig. 4. Log-likelihood values of the visual HMMs trained by EM and HSA for (a) the DIGIT database and (b) the CITY database.

mean value over an utterance is subtracted, which is similar to the CMS technique in acoustic feature extraction and removes unwanted variations across images due to the speakers’ appearances and the different illumination conditions. Finally, we apply the principal component analysis (PCA) to find the main linear modes of variations and reduce the dimension of the feature vector. Twelve-dimensional static features and their delta terms are used for the visual features. We use the left-to-right CHMMs without skip for the acoustic and the visual recognizers. The whole-word model is used and the number of states in each HMM is set to be proportional to the number of the phonetic units of the corresponding word. We tested different numbers of Gaussian functions for GMMs in HMMs and three Gaussian functions are used in each state because this configuration gives the best performance. The initial parameters of the HMMs are obtained by linear segmentation of the training data onto the states of the HMMs and iterative application of the segmental k-means algorithm and the Viterbi alignment. The EM training of the HMMs is terminated when , the relative change of the log-likelihood value is less than which results in about 20 iterations for each word on average. The baseline system uses the reliability ratio-based method for audio-visual integration. C. Results of the Visual Subsystem The performance of the visual subsystem using HSA is shown in this subsection. In HSA we set the initial temperature to 10 and the maximum iterations to 10000. We use five iterations of EM for the local optimization step of HSA. In Fig. 4, we compare the final log-likelihood values by the EM and the HSA algorithms for each database. It is observed that the HMMs optimized by HSA always show higher likelihoods than those by EM, which indicates the HMMs by HSA model the visual speech better than those by EM via global optimization instead of local optimization. As a result, the HMMs optimized by HSA show better recognition performance than those by EM, as shown in Table I. The relative error reduction is 7.5% and 2.7% for each database. The performance improvement of HSA over EM was obtained at the cost of increased computational complexity. The

TABLE I ERROR RATES (%) OF THE VISUAL HMMS TRAINED BY EM AND HSA. RELATIVE ERROR REDUCTIONS (%) OVER EM ARE SHOWN IN THE PARENTHESES

time complexity of a parallel implementation of HSA by using multiple computers was 39 times that of EM with a single computer. However, this increased training time is acceptable if we consider the performance improvement by HSA. Also, the time complexity of HSA is necessary only for the training process which can be done in advance. D. Results of the Acoustic Subsystem In this subsection, we demonstrate robustness of the acoustic recognition using the proposed modeling method. First, we investigate the effect of the value of the frame interval used in the proposed GMCM-HMM method. Fig. 5 compares the recognition error rates for the DIGIT dataare used. The performance base when various values of of GMCM-HMM depends on the value of , but the overall tendency shows that, by modeling the correlations between frames, we can obtain improvement in robustness over the conventional HMM system for various types and levels of noise. performs well for various conditions, On the whole, which indicates that it is acceptable to use a fixed value of for various conditions. Tables II and III compare the performance of GMCM-HMM with that of the conventional HMM with the MFCCs (baseline) and the J-RASTA features. For J-RASTA, we use the 12th-order J-RASTA features derived from the Mel-scale filterbank analysis, the normalized energy and their delta terms. We remove the channel distortions by CMS after the filterbank analysis and then apply the RASTA filtering in the lin-log domain. We can observe that GMCM-HMM significantly improves robustness in comparison with the baseline. J-RASTA also shows better performance than the baseline, but it is outperformed by GMCM-HMM in most cases. The performance gap of the three methods is salient when the SNR values are low.

LEE AND PARK: ROBUST AVSR BASED ON LATE INTEGRATION

775

Fig. 5. Performance of GMCM-HMM with various values of for the DIGIT database when the speech is corrupted by (a) WHT, (b) F-16, (c) FAC, and (d) OPR. TABLE II ACOUSTIC RECOGNITION PERFORMANCE IN ERROR RATES (%) BY THE BASELINE SYSTEM, THE RASTA PROCESSING AND THE PROPOSED GMCM-HMM METHOD FOR THE DIGIT DATABASE

TABLE III ACOUSTIC RECOGNITION PERFORMANCE IN ERROR RATES (%) BY THE BASELINE SYSTEM, THE RASTA PROCESSING AND THE PROPOSED GMCM-HMM METHOD FOR THE CITY DATABASE

There are a few cases where the performance for a higher SNR is worse than that for a lower SNR; for example, in Table II the error rate of J-RASTA for clean speech is higher than that for the 20 dB noisy data corrupted by OPR. In our database there are a few utterances which are highly confusable because the HMMs for the correct and the competing classes produce nearly the same outputs for those utterances. Since the recognition results of these utterances dominantly contribute to the

overall error rates for high SNRs where the error rates are small, such inversions of performance occur in our experiments. Nevertheless, the results do not show large, significant differences. We believe that the superiority of GMCM-HMM to J-RASTA comes from the fact that the former has more freedom in modeling the correlations between observations than the latter. While the RASTA processing uses the fixed filter for considering dynamic characteristics of speech, the interframe correlations are captured by the “adjustable” models in our

776

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 5, AUGUST 2008

Fig. 6. Recognition performance of the unimodal (acoustic-only and visual-only) and the AVSR systems for the DIGIT database. (a) WHT. (b) F16. (c) FAC. (d) OPR.

method. A GMCM-HMM can model the dynamic characteristics of the speech for a class and such characteristics are distinct from those for the other classes. The improvement by the proposed method over the conventional HMM is obtained at the expense of the increased number of model parameters3 and the higher computational complexity. For the HMMs, we have parameters for both the mean vector and the covariance matrix of a Gaussian function. In the prois and posed method, the dimension of the mean vector is . the number of parameters in the covariance matrix Thus, the number of parameters for each Gaussian function into . Also, the training time for the creases 2.5 times from proposed model was nearly 20 times that for the conventional one in the Matlab environment. This is because, while we can use the parameter update formulas in each iteration of training of the conventional HMM, an iteration of training of the proposed model includes several iterations of solving (18) and the constraint optimization problem. However, these additional complexities are worth accepting because of the improved noise-robustness shown by the experiments. E. Results of Integrated Recognition We demonstrate the performance of the proposed NN-based integration method in comparison with the conventional reliability ratio-based method. Figs. 6 and 7 show the visual-only, 3We

tried to use various parameter settings for both methods and the ones giving the best performance were used in the experiments. Especially, HMMs having more parameters were not beneficial to robustness in the conventional method.

the acoustic-only and the integrated recognition performance for the two databases, respectively, when the visual HMMs are trained by HSA and we use the GMCM-HMMs in the acoustic subsystem. We use two-layer MLPs having five sigmoidal hidden neurons because using more hidden neurons did not show any further improvement of the performance. For training the MLPs, we use the Levenberg-Marquardt algorithm [36] which is one of the fastest training algorithms of NNs. In the figures, the reliability ratio-based method is referred as “RR” and the proposed method as “NN”. In the figures, we can observe performance improvement by the proposed integration method over the conventional one. The improvement is prominent when the SNR is low. It is observed that the proposed method sometimes performs a little worse than the conventional one in the mid-range of the SNRs. However, in the sense of the overall performance for various conditions, we can conclude that the proposed method enhances the integrated recognition performance compared to the conventional one. Also, the generalization capability of the NN for various types and levels of noises is verified; although we trained the NN with only the dB, 20 dB, 10 dB, and 0 dB data containing WHT, the NN successfully works for integration in untrained conditions. F. Comparison of the Overall Systems Finally, we evaluate the performance of our AVSR system by comparing with the baseline system. Tables IV and V compare the performance of the two AVSR systems for each database,

LEE AND PARK: ROBUST AVSR BASED ON LATE INTEGRATION

777

Fig. 7. Recognition performance of the unimodal (acoustic-only and visual-only) and the AVSR systems for the CITY database. (a) WHT. (b) F16. (c) FAC. (d) OPR. TABLE IV AVSR PERFORMANCE IN ERROR RATES (%) BY THE BASELINE AND THE PROPOSED SYSTEMS FOR THE DIGIT DATABASE

TABLE V AVSR PERFORMANCE IN ERROR RATES (%) BY THE BASELINE AND THE PROPOSED SYSTEMS FOR THE CITY DATABASE

respectively. The tables clearly show overall improvement by the proposed AVSR system over the baseline in noisy speech recognition tasks. Especially, the performance improvement is large when the SNR is small, which shows that the goal of our work, improvement of robustness, is achieved. In Fig. 8, we investigate the effect of each proposed method on AVSR by comparing the recognition results by the combinations of the conventional and the proposed methods for the CITY database. The acoustic subsystem uses either HMMs or GMCM-HMMs and the visual recognizer is trained by either EM or HSA. For fair comparisons, the NN-based integration method is used for combining the two modalities for all cases.

When we compare the dotted and the dashed lines, we can evidently observe the advantageous effect of GMCM-HMM on the final AVSR results. Since we obtain large performance gains for low SNRs by the GMCM-HMM method in acoustic speech recognition, the improvement of the AVSR performance by the method is also large for these SNR values. The benefit of the HSA training algorithm is not as much as that of GMCM-HMM because the performance gain by HSA in the visual-only recognition is smaller than that by GMCM-HMM in the acoustic-only recognition. However, we still observe the enhanced performance by HSA in the integrated recognition from the figures.

778

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 5, AUGUST 2008

Fig. 8. AVSR performance of different combinations of the conventional and the proposed methods for the acoustic (A) and the visual (V) subsystems on the CITY database. (a) WHT. (b) F16. (c) FAC. (d) OPR.

VI. CONCLUSION In this paper, we have proposed and constructed a novel robust AVSR system of the LI scheme. To enhance the robustness of the system in various noisy environments, we have considered the three parts constituting the system, proposed a technique for improving the performance of each part, and demonstrated the effectiveness of the methods via experiments. For improving the performance of the visual subsystem, the visual HMMs have been globally optimized by the HSA algorithm. For enhancing robustness of the acoustic subsystem, GMCM-HMM for modeling correlations between observation frames has been proposed and its learning method has been devised. To combine the two subsystems effectively, we have utilized the NN for obtaining appropriate integration weights. Although we have shown effectiveness of the proposed NN-based integration method on the isolated word recognition tasks, this scheme can be extended for connected-word or continuous speech recognition tasks. In such cases, it would be a problem that, from the two modalities, we have unmanageably many possible word or phoneme sequence hypotheses to be considered for weighted integration. A solution for this is to consider only N-best hypotheses from each modality and test 2N combined pairs, as shown in [33]. For large vocabulary speech recognition, it would be necessary to incorporate the adaptive weighting scheme into joint modeling of the two modalities by using complex models such as multi-stream HMMs [39]. Also, more complicated interactions between the

modalities can be modeled by using cross-modal associations and influences [40], [41], where we still can use the proposed integration method for adaptive robustness. With these considerations, further investigation of applying the proposed system to complex tasks such as connected-word or continuous speech recognition is in progress. REFERENCES [1] E. D. Petajan, “Automatic lipreading to enhance speech recognition,” in Proc. Global Telecommunications Conf., Atlanta, GA, Nov. 1984, pp. 265–272. [2] C. C. Chibelushi, F. Deravi, and J. S. D. Mason, “A review of speechbased bimodal recognition,” IEEE Trans. Multimedia, vol. 4, no. 1, pp. 23–37, Mar. 2002. [3] B. V. Dasarathy, “Sensor fusion potential exploitation: Innovative archi-tectures and illustrative applications,” Proc. IEEE, vol. 85, pp. 24–38, Jan. 1997. [4] D. W. Massaro, Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry. Hillsdale, NJ: Lawrence Erlbaum, 1987. [5] J.-S. Lee and C. H. Park, “Training hidden Markov models by hybrid simulated annealing for visual speech recognition,” in Proc. IEEE Int. Conf. Systems, Man, Cybernetics, Taipei, Taiwan, R.O.C., Oct. 2006, pp. 198–202. [6] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development.. Upper Saddle River, NJ: Prentice Hall, 2001. [7] R. Drullman, J. M. Festen, and R. Plomp, “Effect of temporal envelope smearing on speech reception,” J. Acoust. Soc. Amer., vol. 95, no. 2, pp. 1053–1064, Feb. 1994. [8] T. Arai and S. Greenberg, “Speech intelligibility in the presence of cross-channel spectral asynchrony,” in Proc. ICASSP, Seattle, WA, 1998, vol. 2, pp. 933–936.

LEE AND PARK: ROBUST AVSR BASED ON LATE INTEGRATION

[9] S. Furui, “Speaker-independent isolated word recognition using dynamic features of speech spectrum,” IEEE Trans. Acoust., Speech, Signal Process., vol. 34, no. 1, pp. 52–59, Feb. 1986. [10] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans. Speech Audio Processing, vol. 2, no. 4, pp. 578–589, 1994. [11] C. Bartels and J. Bilmes, “Focused state transition information in ASR,” in Proc. Workshop on Automatic Speech Recognition and Understanding, San Juan, PR, Nov. 2005, pp. 191–196. [12] H. H. Szu and R. L. Hartley, “Fast simulated annealing,” Phys. Lett. A, vol. 122, no. 3–4, pp. 157–162, June 1987. [13] D. Paul, “Training of HMM recognizers by simulated annealing,” in Proc. ICASSP, Tampa, FL, Mar. 1985, pp. 13–16. [14] S. R. Eddy, “Multiple alignment using hidden Markov models,” in Proc. Int. Conf. Intelligent Systems for Molecular Biology, Menlo Park, CA, 1995, pp. 114–120. [15] D. Nam, J.-S. Lee, and C. H. Park, “n-dimensional Cauchy neighbor generation for the fast simulated annealing,” IEICE Trans. Inf. Syst., vol. E87-D, no. 11, pp. 2499–2502, Nov. 2004. [16] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, “Equation of state calculations by fast computing machines,” J. Chem. Phys., vol. 21, no. 6, pp. 1087–1092, 1953. [17] J.-S. Lee, “Audio-Visual Speech Recognition: Stochastic Optimization of Hidden Markov Models, Modeling of Interframe Correlations and Integration With Neural Networks,” Ph.D. dissertation, Dept. Elect. Eng. Comput. Science, KAIST, Daejeon, Korea, 2006. [18] D. V. Ouellette, “Schur complements and statistics,” Linear Algebra Appl., vol. 36, pp. 187–295, Mar. 1981. [19] A. Papoulis, Probability, Random Variables, and Stochastic Processes, 3rd ed. New York: McGraw-Hill, 1991. [20] T. H. Applebaum and B. A. Hanson, “Regression features for recognition of speech in quiet and in noise,” in Proc. ICASSP, Toronto, ON, Canada, Apr. 1991, vol. 2, pp. 985–988. [21] Optimization Toolbox User’s Guide. Natick, MA: The Mathworks, Inc., 2005, The Mathworks. [22] A. D. Belegundu and T. R. Chandrupatla, Optimization Concepts and Applications in Engineering. Upper Saddle River, NJ: Prentice-Hall, 1999. [23] L. A. Ross, D. Saint-Amour, V. M. Leavitt, D. C. Javitt, and J. J. Foxe, “Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments,” Cerebral Cortex vol. 17, no. 5, pp. 1147–1153, 2007. [24] P. Arnold and F. Hill, “Bisensory augmentation: A speechreading advantage when speech is clearly audible and intact,” Brit. J. Psychol., vol. 92, pp. 339–355, 2001. [25] C. Benoît, , M. M. Taylor, F. Nel, and D. Bouwhuis, Eds., “The intrinsic bimodality of speech communication and the synthesis of talking faces,” in The Structure of Multimodal Dialogue II. Amsterdam, The Netherlands: John Benjamins, 2000, pp. 485–502. [26] B. Conrey and D. B. Pisoni, “Auditory-visual speech perception and synchrony detection for speech and nonspeech signals,” J. Acoust. Soc. Amer., vol. 119, no. 6, pp. 4065–4073, June 2006. [27] M. Heckmann, F. Berthommier, and K. Kroschel, “Noise adaptive stream weighting in audio-visual speech recognition,” EURASIP J. Appl. Signal Process., vol. 11, pp. 1260–1273, 2002. [28] E. Marcheret, V. Libal, and G. Potamianos, “Dynamic stream weight modeling for audio-visual speech recognition,” in Proc. ICASSP, Honolulu, HI, Apr. 2007, vol. 4, pp. 945–948. [29] A. Rogozan and P. Deléglise, “Adaptive fusion of acoustic and visual sources for automatic speech recognition,” Speech Commun., vol. 26, no. 1–2, pp. 149–161, Oct. 1998. [30] T. W. Lewis and D. M. W. Powers, “Sensor fusion weighting measures in audio-visual speech recognition,” in Proc. 27th Conf. Australasian Computer Science, Dunedin, New Zealand, 2004, pp. 305–314. [31] T. J. Hazen, “Visual model structures and synchrony constraints for audio-visual speech recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 3, pp. 1082–1089, May 2006.

779

[32] A. Verma, T. Faruquie, C. Neti, and S. Basu, “Late integration in audiovisual continuous speech recognition,” in Proc. Workshop on Automatic Speech Recognition and Understanding, Keystone, CO, Dec. 1999, pp. 71–74. [33] G. F. Meyer, J. B. Mulligan, and S. M. Wuerger, “Continuous audiovisual digit recognition using N-best decision fusion,” Inform. Fusion, vol. 5, no. 2, pp. 91–101, June 2004. [34] S. Tamura, K. Iwano, and S. Furui, “A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization,” in Proc. ICASSP, Philadelphia, PA, Mar. 2005, vol. 1, pp. 469–472. [35] A. Adjoudani and C. Benoît, , D. G. Stork and M. E. Hennecke, Eds., “On the integration of auditory and visual parameters in an HMMbased ASR,” in Speechreading by Humans and Machines: Models, Systems and Applications, ser. NATO ASI Series. Berlin, Germany: Springer, 1996, pp. 461–472. [36] C. M. Bishop, Neural Networks for Pattern Recognition. New York: Oxford Univ. Press, 1995. [37] S. Haykin, Neural Networks: A Comprehensive Foundation. Upper Saddle River, NJ: Prentice-Hall, 1999. [38] A. Varga and H. J. M. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Commun., vol. 12, no. 3, pp. 247–251, 1993. [39] S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Trans. Multimedia, vol. 2, no. 3, pp. 141–151, Sep. 2000. [40] M. H. Coen, “Multimodal integration—A biological view,” in Proc. Int. Joint Conf. Artificial Intelligence, Seattle, WA, 2001, pp. 1417–1424. [41] D. Li, N. Dimitrova, M. Li, and I. K. Sethi, “Multimedia content processing through cross-modal association,” in Proc. ACM Int. Conf. Multimedia, Berkeley, CA, Nov. 2003, pp. 604–611.

Jong-Seok Lee (M’06) received the B.S. degree in electrical and electronic engineering and the M.S. and the Ph.D. degrees in electrical engineering and computer science in 1999, 2001 and 2006, respectively, from KAIST, Daejeon, Korea. He is currently a Postdoctoral Fellow at KAIST. He was an Adjunct Professor at the School of Electrical Engineering and Computer Science, KAIST, in 2007. His research interests include speech recognition, multimodal human-computer interaction, and machine learning.

Cheol Hoon Park (S’82–M’84–SM’04) received the B.S. degree in electronics engineering (with the Best Student Award) from Seoul National University, Seoul, Korea, in 1984 and the M.S. and Ph.D. degrees in electrical engineering from California Institute of Technology, Pasadena, in 1985 and 1990, respectively. He joined the Department of Electrical Engineering, KAIST, Daejeon, Korea, in 1991, where he is currently a Professor. His research interests are in the area of intelligent systems including intelligence, neural networks, fuzzy logic, evolutionary algorithms, and their application to recognition, information processing, intelligent control, dynamic systems, and optimization.