Discriminative Training of Hidden Markov Models by Multiobjective Optimization for Visual Speech Recognition Jong-Seok Lee

Cheol Hoon Park

Department of Electrical Engineering and Computer Science Korea Advanced Institute of Science and Technology Daejeon, 305-701, Korea E-mail: [email protected]

Department of Electrical Engineering and Computer Science Korea Advanced Institute of Science and Technology Daejeon, 305-701, Korea E-mail: [email protected]

Abstract— This paper proposes a novel discriminative training algorithm of hidden Markov models (HMMs) based on the multiobjective optimization for visual speech recognition. We develop a new criterion composed of two minimization objectives for training HMMs discriminatively and a global multiobjective optimization algorithm based on the simulated annealing algorithm to find the Pareto solutions of the optimization problem. We demonstrate the effectiveness of the proposed method via an isolated digit recognition experiment. The results show that the proposed method is superior to the conventional maximum likelihood estimation and the popular discriminative training algorithms.

I. I NTRODUCTION Visual speech recognition (or lipreading) is to recognize speech by observing the movement of the speaker’s lips. Its importance arises from the observation that the visual information about speech is useful for both human speech recognition and automatic speech recognition by machines especially in acoustically noisy circumstances [1], [2]. Although the visual speech recognition shows somewhat low accuracy compared to the recognition using acoustic speech in lownoise environment, it is not affected by the acoustic noise and, thus, can be a powerful solution which compensates for the performance degradation of the acoustic speech recognition in noisy environment [3]. The dominant recognition paradigm for visual speech recognition is the hidden Markov model (HMM), as in acoustic speech recognition [2]. In order to use HMMs for recognition, their parameters must be trained by using appropriate visual speech features extracted from the recorded image sequences. By far the most popular training method for HMMs is the maximum likelihood (ML) estimation in which the parameters are adjusted so as to maximize the likelihood of the training sequences from the HMM. A simple implementation of the ML estimation, the Baum-Welch algorithm, has been developed and successfully used in many applications [4]. On the other hand, discriminative training methods attempt to maximize the recognition accuracy of HMMs [5], [6], [7]. If the true distribution of the data which we are to model

0-7803-9048-2/05/$20.00 ©2005 IEEE

can be accurately described by a HMM and the amount of the training data goes to infinity, the optimal distribution estimation by the ML estimation is consistent with the optimal recognizer design. Unfortunately, since we cannot model the true distribution of the visual speech signal by a HMM and the training data is always insufficient, the performance of the maximum likelihood estimation may be significantly different from that of the optimal recognizer design [8]. For obtaining high recognition performance, the discriminative training algorithms usually utilize both the correct and the other classes while in the ML estimation a HMM is trained with only the data of the correct class. The key of the discriminative training methods is development of new objective functions which are expected to reduce the classification error and optimization methods of the new objective functions. In this paper, we propose a new optimization criterion composed of two objectives for training HMMs in a discriminative manner and develop a multiobjective simulated annealing (MOSA) algorithm to solve the multiobjective optimization problem. Since, when a test data is given, the probabilities of the data from all the HMMs are compared and the maximum is chosen as the recognized class, it is desirable for a HMM to yield a high probability for the data of the correct class and low probability values for the data of the other classes in order to obtain a high recognition accuracy. Therefore, in our optimization criterion, the probability of the data of the correct class by the corresponding HMM is maximized while the probability of the data of the other classes by the HMM is minimized. These two conflicting objectives are optimized by the use of a multiobjective optimization method. We devise a MOSA algorithm to find the non-dominated optimal solutions for the two objective functions. Our MOSA algorithm has a simple and compact structure and contains the capability to find diverse global optimal solutions. In the next section, we explain the visual speech recognition system. Section III formulates the proposed optimization criterion for discriminative training of HMMs, describes the multiobjective simulated annealing algorithm for solving the

2053

a11

S1

b1

Fig. 1.

a22

a12

S2

a33

a23

b2

S3

a44

a34

b3

S4

b4

Fig. 2. A typical left-to-right HMM. There are four states (S1 –S4 ). aij is the state transition probability from state i to state j and bi is the observation probability distribution of state i.

Lip contour points for the visual speech features.

optimization problem and discusses the problem of making decision among the set of non-dominated solutions for composing the final classifier. Section IV presents the experimental results for an isolated digit recognition task and, finally, concluding remarks are given in Section V.

In the training phase, the features of each utterance are used for training the parameters of the corresponding HMM. In the recognition phase, the features are processed through the HMMs for all utterances and the HMM which yields the maximum probability is chosen.

II. V ISUAL S PEECH R ECOGNITION When a person speaks, a camera captures the lip movement of the speaker and produces a sequence of images. Then, an appropriate set of visual speech features is extracted from the images by using some image processing techniques. In our system, the visual features are represented by geometrical contour information of lips. They are based on 14 points of lip contours, as shown in Figure 1. We extract these points automatically by using various color information and the detailed description of the extraction method can be found in [9]. In order to obtain visual features using the extracted contour points, we measure the heights of the edge points (Ui and Li , i = 1, 2, ..., 6) from the line connecting the mouth corners (C1 C2 ). Then, each height is divided by the width of the mouth for normalization because the size of mouth varies with the speakers. Since some speakers have non-symmetric lip shapes, both the left and right parts of lips are used. The height values normalized by the mouth width are then transformed by the principal component analysis (PCA) [10], [11]. Along with the geometrical features, we use the delta (Δ) terms defined by their temporal derivatives to improve the recognition performance [4]. Thus, we have 24 features (12 visual features + 12 delta features) for each image frame. Finally, these continuous visual features are converted into discrete symbols using the codebook with 128 code words by vector quantization. The visual speech features are modeled by HMMs. Figure 2 illustrates the structure of a typical left-to-right HMM. An N -state HMM is characterized by the initial state distribution, Π = {πi }, the state transition probability distribution, A = {aij }, and the observation probability distribution, B = {bi }. For convenience, the following compact notation is used λ = (Π, A, B).

(1)

We use the whole-word HMM which is a standard approach in small vocabulary speech recognition systems [4].

III. P ROPOSED M ETHOD A. Objectives for Discriminative Training When the HMM parameters are trained using the training data, the most popular method is the ML estimation. The objective of the ML estimation is stated by Maximize f (λ) =

Uc

log P (Ouc |λ),

(2)

u=1

where λ is the parameter set of the HMM, Ouc the u-th training data of the corresponding class, Uc the number of the training data. The ML estimation is implemented by the well-known Baum-Welch method [4]. When we repeatedly apply the re-estimation formulas of the method, the likelihood monotonically decreases and eventually converges to a critical point. Although the ML estimation is simple and gives quite good performance, it does not guarantee the maximum recognition accuracy because of the insufficient training data and the mismatch between the model (i.e., the HMM) and the speech data to be modeled [8]. To solve this problem, discriminative training methods use new objective functions which maximize the recognition accuracy instead of (2) and optimization methods of the new objective functions. Usually, the discriminative training objectives involve the information of the competing classes as well as that of the correct class. For correct recognition, the probability of a given data from the HMM of the correct class should be larger than any of the probabilities from the HMMs of the other classes. In other words, it is desirable that a HMM produces high probabilities for the data of the correct class and low probabilities for the data of the other classes. Therefore, we propose a new training criterion of two objectives for training HMMs discriminatively,

2054

which is stated in the following form: Minimize f1 (λ)

= −

Uc

function gk which is given by gk (Δx) =

log P (Ouc |λ)

Minimize f2 (λ)

=

w=1 w=c

log P (Ouw |λ),

(3)

u=1

Step 3: Local search. Mutate further the new solutions by the local search operation. One iteration of the BaumWelch estimation is used as the local search operator. This operation is used to speed up the search process.

Ouw

where c is the correct class, the u-th training data of the wth class, W the total number of the class, and Uw the number of training data of the w-th class. Note that the first objective is the same to the conventional ML criterion in (2) by attaching the minus sign and converting the maximization criterion to the minimization. Optimizing the HMM parameters using above objectives promotes the likelihood of the data of the correct class and suppresses that of the data of the other classes, thereby makes the HMM have discriminative performance when it is used to perform recognition. The solution of (3) is not unique but there exist many solutions which are non-dominated each other. A solution x is said to dominate the other solution y if x is no worse than y in all objectives and strictly better than y in at least one objective. If x does not dominate y and vice versa, they are said to be non-dominated each other. The set of the optimal solutions for a multiobjective problem is called the Pareto optimal set or the Pareto front, which is not dominated by any feasible solution. The method to find the Pareto optimal solutions of (3) is explained in the following subsection.

Step 4: Evaluation. Calculate the objective function values of the new solutions by (3) and determine the Paretobased costs of the current and the new solutions. In this method, the cost of a solution x is the number of solutions which dominates x among the current and the new solutions. Step 5: Selection. Select the solutions of the next iteration among the current and the new solutions by the Metropolis method [17]; when a current solution x and the new solution x generated from x compete, the acceptance probability of the new solution is calculated by 1 if c(x) > c(x ), (5) pa = c(x)−c(x ) otherwise, exp Tk where c(x) and c(x ) are the Pareto-based cost of x and x , respectively. In other words, when the new solution is better than the current solution, the transition from the current solution to the new one always occurs; otherwise, the transition occurs probabilistically.

B. Training Algorithm In order to solve the problem given by (3), we adopt a multiobjective simulated annealing (MOSA) algorithm. The simulated annealing (SA) algorithm is a stochastic search algorithm, which is designed using spin glass model by Kirkpatrick [12]. It has been used in wide areas because it performs well in most of optimization problems, especially in the complex problems. SA has an interesting advantage that its convergence property is mathematically proved [13], [14]. In [15], a SA method was developed to find the Pareto solutions of multiobjective optimization problems and its convergence property was discussed. The optimization algorithm used in this paper is based on that method and contains some modifications to be suitable for optimizing the proposed objectives. Basically, our algorithm utilizes the fast simulated annealing algorithm [16], where the Cauchy distribution is used for generating new states and the reciprocal cooling for annealing. The procedure of the proposed algorithm is written as follows: Step 1: Initialization. Randomly generate an initial population of P solutions. Set the temperature to its initial value. Step 2: Generation (mutation). Generate a new solution for each current solution by the Cauchy generating

(4)

where k is the index of the iteration, D the dimension of a solution vector x, and Tk the temperature at iteration k.

u=1 W Uw

Tk , (||Δx||2 + Tk2 )(D+1)/2

Step 6: Annealing. Decrease the temperature by the reciprocal cooling schedule given by T0 , k where T0 is the initial temperature. Tk =

(6)

Step 7: Termination. If the termination conditions are satisfied, stop. Otherwise, go to Step 2. Above optimization process is repeatedly conducted for each HMM. A new solution is generated from a current solution by the combination of the Cauchy generating function and one iteration of the Baum-Welch algorithm. The augmentation of the local search operator aims at speeding up the search process. We observed that by the use of this scheme the algorithm converges much faster. The cost of each solution is calculated by the Pareto-based cost. In [15], it was proven that the MOSA with the Paretobased cost guarantees that the search agents will be located to all the global optima uniformly. The Pareto-based cost is defined by dom(y, x)dy, (7) c(x) =

2055

A

C. Decision Making

objective 2 Area=Sy

y

z Area=Sx

x

Pareto front objective 1 Fig. 3. Illustration of the Pareto-based cost. The right upper region of the Pareto front is the feasible region. The solutions x and y are dominated by the solutions in the regions of which areas are Sx and Sy , respectively. The solution z is on the Pareto front and there is no feasible solution which dominates z.

where A represents the whole feasible solution space and 1 if y dominates x, dom(y, x) = (8) 0 otherwise. Figure 3 shows the concept of the Pareto-based cost for the problem with two objectives. The solutions in the region of area Sx dominates the solution x and the Pareto-based cost of x becomes Sx . Similarly, the Pareto-based cost of y is Sy . Since there is no solution which dominates z and thus the cost of z is zero. We can see that c(y) > c(x) > c(z), which reflects the distances of the solutions from the Pareto front. However, when we implement the Pareto-based cost given by (7), it is required to obtain the cost information of all solutions including the optimal solutions, which is generally impossible. Besides, the integral operation in (7) cannot be exactly calculated in the digital computers. Therefore, we use the Pareto-based cost based on sampling [15], i.e., c˜(x) =

2P

dom(yp , x),

(9)

p=1

where yp is the p-th solution among the set of the P current solutions and the P new solutions. In many of the evolutionary algorithms, the several individuals possibly converge to the same one due to stochastic errors in the selection process. Some techniques such as fitness sharing or niche induction are often used to prevent this genetic drift phenomena and keep diversity of the solutions [18]. In our method, on the other hand, competition occurs only between a current solution and the new one generated from the current solution and each of the next solutions is selected in a parallel way. Therefore, the selection scheme naturally does not contain the possibility of the undesirable convergence of the solutions to the same value. Steps from 2 to 6 are repeated until some conditions are satisfied. In our experiments, we set a maximum iteration as the termination condition.

Once we obtain the non-dominated optimal solutions of the HMM for each utterance model, we must select one solution for each utterance to compose a final recognizer. If we were given W utterance classes and obtained M optimal solutions for each class, we should choose one of the M W choices for performing recognition. We try several combinations of a randomly selected nondominated solution for each class and take the one showing the best recognition performance for the training data. Additionally, we discuss some preference in choosing solutions among the non-dominated solutions for obtaining the recognizer showing good performance based on the experimental results in the next section. IV. E XPERIMENTAL R ESULTS Our database contains recordings of 18 speakers (8 males and 10 females) pronouncing the isolated digits from one to ten in Korean. The speakers pronounced each word ten times; among them, eight utterances were used for training and two for testing. Each image frame contains 8 bits R, G and B components of size 360×480 pixels and the frame rate is 30Hz. As explained in Section II, we extracted 24dimensional features from the recorded images and discretized them using the codebook with 128 code words. We used leftto-right discrete HMMs having eight states for all words. The HMMs with several numbers of states were tested and the one showing the best recognition result was selected. In our MOSA algorithm, we used a population size of 25. The initial temperature T0 was set to 10 and the maximum number of iterations to 50000. We used the method of generating D-dimensional Cauchy random number proposed in [19] for implementing the generating function in (4). We compare the recognition performance of the solutions found by our method with that of the solutions by the BaumWelch algorithm and existing two popular discriminative training algorithms: the minimum classification error (MCE) approach [5] and the maximum mutual information (MMI) method [6], [7]. In the MCE method, a smooth loss function approximating the error rate is defined and minimized by the generalized probabilistic descent (GPD) algorithm. In the MMI method, the HMM parameters are adjusted so that the mutual information between the observations and the correct class is maximized. Especially, a simple and powerful optimization technique similar to the Baum-Welch algorithm for the MMI objective has been developed in [7] and we used it in the comparison with the proposed algorithm. When we initialize each HMM in the Baum-Welch algorithm, the training data are linearly segmented onto the HMM states and then the iterative segmental k-means clustering and the Viterbi alignment are used. This initialization scheme is known to be very effective in speech modeling [4]. The algorithm parameters of the MCE and the MMI methods are carefully determined via several experiments. Figure 4 shows typical examples of the obtained nondominated solutions in the objective space. For each word, we

2056

5

−2.8

x 10

55

51.1

−2.85

50

Accuracy (%)

Objective 2

−2.9

−2.95

−3

46.9

47.2

47.2

MMI

MCE

45

40

−3.05

35 −3.1 1.24

1.25

1.26

1.27

1.28

1.29

Objective 1

1.3

1.31

1.32

1.33

Proposed

4

Fig. 5. Comparison of the recognition performance of the conventional methods and the proposed method.

(a) 5

−2.15

Baum−Welch

1.34 x 10

x 10

−2.2

−2.25

Objective 2

−2.3

−2.35

−2.4

−2.45

−2.5

−2.55 1.37

1.375

1.38

1.385

1.39

1.395

1.4

Objective 1

1.405

1.41

1.415

1.42

1.425 4

x 10

(b) Fig. 4. Typical examples of the obtained non-dominated solutions. (a) For the seventh utterance. (b) For the eighth utterance.

finally obtained about seven to eight non-dominated solutions. Figure 5 compares the recognition accuracies by the proposed method and the three conventional methods. For the decision making of our method, we tried 100 combinations of solutions and selected the one showing the best training recognition accuracy. The recognition accuracy obtained by our method is 51.1%, which reflects a reduction of about 8% in error rate in comparison to the Baum-Welch algorithm. The two discriminative training methods show slightly improved performances compared to the Baum-Welch method, but the amount of the improvement is less than the proposed method. The superiority of the proposed method to the MCE and the MMI methods comes from the fact that the proposed method performs global optimization by the MOSA while the MCE and the MMI methods only find local solutions. We observed that the final solutions selected among the found non-dominated solutions by the decision making procedure are usually the ones showing comparatively small values of the first objective (i.e., large likelihoods for the correct class). When we picked the leftmost solutions on the objective plane (i.e., the ones showing the smallest values of the first objective among the non-dominated solutions) to

produce a classifier, we obtained the accuracy of 50.8%, which is similar to the result in Figure 5. The recognizer composed of the solutions lying the midway among the non-dominated solutions shows the accuracy of 48.6% and, if we use the rightmost solutions (the ones showing the minimum values for the second objectives among the non-dominated solutions), the recognition performance was 43.6%. Therefore, we can infer that it is preferable to pick the solutions having small values of the first objective. One thing to be noted is that the improvement of the proposed method over the conventional methods is obtained at the cost of additional time complexity. In our method, since each of the HMMs for each utterance can be optimized independently, we implemented the training process in a parallel way by using multiple computers in order to reduce the time complexity. Nevertheless, while training of HMMs by the three conventional methods takes a few minutes, the proposed method requires a few hours for designing the recognizer. However, the additional complexity of our method is required only at the training phase and the recognition phases of the conventional and the proposed methods are exactly the same. V. C ONCLUSION We have proposed a discriminative training algorithm of HMMs using a multiobjective optimization method for visual speech recognition. We defined a new discriminative HMM training criterion composed of two objectives in order to maximize the recognition performance of the HMM recognizer. Also, we developed a multiobjective simulated annealing method for optimizing the objectives and finding the Pareto solutions. The experimental results showed that the proposed method produces the recognizer showing better recognition performance in comparison with the conventional Baum-Welch algorithm and popular discriminative training algorithms such as the MCE and the MMI methods. Testing the proposed method with various databases and applying it to the systems using continuous HMMs are currently in progress.

2057

ACKNOWLEDGMENT This work was supported by grant No. R01-2003-00010829-0 from the Basic Research Program of the Korea Science and Engineering Foundation. R EFERENCES [1] H. McGurk and J. McDonald, “Hearing lips and seeing voices,” Nature, vol. 264, pp. 746–748, 1976. [2] C. C. Chibelushi, F. Deravi, and J. S. D. Mason, “A review of speechbased bimodal recognition,” IEEE Transactions on Multimedia, vol. 4, no. 1, pp. 23–27, 2002. [3] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, “Recent advances in the automatic recognition of audiovisual speech,” Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326, 2003. [4] L. R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, New Jersey: Prentice-Hall, 1993. [5] B.-H. Juang, W. Chou, and C.-H. Lee, “Minimum classification error rate methods for speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3, pp. 257–265, 1997. [6] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, “Maximum mutual information estimation of hidden Markov model parameters,” Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Tokyo, Japan, 1986, vol. 1, pp. 49–52. [7] A. Ben-Yishai and D. Burshtein, “A discriminative training algorithm for hidden Markov models,” IEEE Transactions on Speech and Audio Processing, vol. 12, no. 3, pp. 204–216, 2004. [8] W. Chou, “Discriminant-function-based minimum recognition error rate pattern-recognition approach to speech recognition,” Proceedings of the IEEE, vol. 88, no. 8, pp. 1201–1223, 2000. [9] J.-S. Lee, S. H. Shim, S. Y. Kim, and C. H. Park, “Bimodal speech recognition using robust visual feature extraction under uncontrolled illumination conditions,” Telecommunication Reviews, vol. 14, no. 1, pp. 123–134, 2004. [10] C. Bregler and Y. Konig, “‘Eigenlips’ for robust speech recognition,” Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Adelaide, Australia, 1994, vol. 2, pp. 669–672. [11] S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Transactions on Multimedia, vol. 2, no. 3, pp. 141–151, 2000. [12] S. Kirkpatrick, C. D. Gerlatt, and M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, pp. 671–680, 1983 [13] D. Mitra, F. Romeo, and A. Sangiovanni-Vincentelli, “Convergence and finite-time behavior of simulated annealing,” Applied Probability Trust, vol. 18, pp. 747–771, 1986. [14] R. L. Yang, “Convergence of the simulated annealing algorithm for continuous global optimization,” Journal of Optimization Theory and Applications, vol. 104, no. 3, pp. 691–716, 2000. [15] D. Nam and C. H. Park, “Pareto-based cost simulated annealing for multiobjective optimization,” Proceedings of the 4th Asia-Pacific Conference on Simulated Evolution and Learning, Singapore, 2002, vol. 2, pp. 522–526. [16] H. Szu and R. Hartley, “Fast simulated annealing,” Physics Letters A, vol. 122, no. 3–4, pp. 157–162, 1987. [17] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, “Equation of state calculation by fast computing machines,” Journal of Chemical Physics, vol. 21, pp. 1087–1092, 1953. [18] E. Zitzler, M. Laumanns, and S. Bleuler, “A tutorial on evolutionary multiobjective optimization,” Metaheuristics for Multiobjective Optimisation, Lecture Notes in Economics and Mathematical Systems, X. Gandibleux, M. Sevaux, K. S¨orensen, and V. T’kindt, Eds. Berlin: Springer-Verlag, 2004, vol. 535, pp. 3–37. [19] D. Nam, J.-S. Lee, and C. H. Park, “n-dimensional Cauchy neighbor generation for the fast simulated annealing,” IEICE Transactions on Information and Systems, vol. E87-D, no. 11, pp. 2499–2502, 2004.

2058