1188

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST 2010

Correspondence Hybrid Simulated Annealing and Its Application to Optimization of Hidden Markov Models for Visual Speech Recognition Jong-Seok Lee and Cheol Hoon Park, Senior Member, IEEE Abstract—We propose a novel stochastic optimization algorithm, hybrid simulated annealing (SA), to train hidden Markov models (HMMs) for visual speech recognition. In our algorithm, SA is combined with a local optimization operator that substitutes a better solution for the current one to improve the convergence speed and the quality of solutions. We mathematically prove that the sequence of the objective values converges in probability to the global optimum in the algorithm. The algorithm is applied to train HMMs that are used as visual speech recognizers. While the popular training method of HMMs, the expectation–maximization algorithm, achieves only local optima in the parameter space, the proposed method can perform global optimization of the parameters of HMMs and thereby obtain solutions yielding improved recognition performance. The superiority of the proposed algorithm to the conventional ones is demonstrated via isolated word recognition experiments. Index Terms—Global optimization, hidden Markov model (HMM), hybrid simulated annealing (HSA), visual speech recognition.

I. I NTRODUCTION Visual speech recognition (or lipreading) is to recognize speech by observing the movement of the speaker’s lips. Although speech recognition using the visual signal shows rather lower recognition performance than acoustic speech recognition in low-noise environments, the visual signal is not affected by acoustic noise that is inevitable in real-world applications and can thus be a powerful information source that compensates for the performance degradation of acoustic speech recognition in noisy conditions [1]. The hidden Markov model (HMM) is the dominant paradigm for the recognizer in visual speech recognition, as in acoustic speech recognition [2]. Before the HMMs are used for recognition, their parameters must sufficiently be trained with training data. The expectation–maximization (EM) algorithm is popularly used to train the parameters according to the maximum-likelihood (ML) objective [3]. Although the EM algorithm has successfully been used for training HMMs, its limitation is that it only achieves local optimal solutions and may not provide the global optimum. There have been efforts for global optimization of HMMs for speech recognition. Genetic algorithms, which are inspired by natural evolution and perform stochastic optimization, have been used for global optimization of HMMs [4]. A population of solutions for an HMM evolves by random operators such as crossover and mutation, Manuscript received April 27, 2007; revised February 3, 2008; accepted April 2, 2008. Date of publication January 8, 2010; date of current version July 16, 2010. This work was supported in part by Grant R01-2003-00010829-0 from the Basic Research Program of the Korea Science and Engineering Foundation and by the Brain Korea 21 Project from the School of Information Technology, KAIST. This paper was recommended by Associate Editor S. Hu. J.-S. Lee was with the School of Electrical Engineering and Computer Science, KAIST, Daejeon 305-701, Korea. He is now with the Institute of Electrical Engineering, Ecole Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland (e-mail: [email protected]). C. H. Park is with the School of Electrical Engineering and Computer Science, KAIST, Daejeon 305-701, Korea (e-mail: [email protected]). Digital Object Identifier 10.1109/TSMCB.2009.2036753

and the solutions compete with each other to be selected by the “survival of the fittest” rule based on their likelihood values. The crossover operator leads to regional optimization of HMMs, whereas the mutation operator helps the algorithm to jump out from regional optimization. In [5], an attempt of using simulated annealing (SA) for training HMMs was made. SA is a stochastic optimization algorithm, which exploits an analogy between the cooling process of a metal into the minimum-energy crystalline structure and the search for the global minimum in a general system [6]. Gaussian random numbers are generated and added to the parameters of the HMMs to perturb their values at each iteration. The perturbation is accepted with a nonzero probability even when the new solution is worse than the current one, which enables the algorithm to perform global search of HMM parameters. As the so-called temperature is gradually decreased by the annealing process, the system gets “frozen” and converges to a final state. In [7], another annealing-based approach, called deterministic annealing (DA), has been presented by adopting fundamental principles in statistical physics and information theory. The formulation of DA introduces randomness in the classification rate by HMMs and minimizes the misclassification rate of the HMMs with controlling the level of randomness by annealing. Although these methods have been shown to enhance recognition performance in comparison to the EM method, they have the following limitations. First, their optimization procedures have been designed heuristically, and thus, we cannot guarantee convergence of the algorithms theoretically; the methods of performing the mutation and crossover operations for the GA-based method and the annealing schedules for the SA and DA methods were determined manually without consideration of the theoretical convergence properties of the algorithms. Second, they were developed only for the discrete HMM, while the continuous HMM is much more popular for automatic speech recognition. Optimizing continuous HMMs is more challenging than optimizing discrete HMMs because of largely increased numbers of parameters to be trained in continuous HMMs. In this paper, we propose a global optimization method of HMMs, hybrid SA (HSA), which is supported by mathematical convergence theorems and applicable to training of continuous HMMs, and show its application to visual speech recognition. The proposed method basically adopts the SA algorithm. SA performs global optimization via iterative generation, evaluation, and selection of solutions with a controlled annealing schedule. Starting from an initial solution, we randomly generate a new solution from the current one by using a generating function and evaluate it at each iteration. The amount of the displacement of the new solution is determined by the temperature; a high temperature allows the solution to move by a large distance. The temperature also determines the acceptance probability of the new solution so that, when the temperature is high, the transition from the current solution to the new one of a worse objective value can occur with a relatively high probability. The temperature is gradually decreased by the annealing schedule. The major advantage of SA is an ability to escape from local optima. Because the transition to a solution having a worse objective value than the current one can occur with a nonzero probability, the algorithm has a possibility to reach the global optimum even when local optima exist. Although SA is a good candidate to overcome the limitation of EM, optimizing HMMs’ parameters by SA is not easy because the problem dimension is relatively high and the degree of epistasis between

1083-4419/$26.00 © 2009 IEEE

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST 2010

parameters is high. To obtain good solutions and reduce convergence time for global optimization of the HMM parameters, we propose the HSA algorithm in which SA is fused with a local optimization technique. The proposed algorithm combines the exploration capability of the underlying stochastic search by SA and the exploitation capability of the local search by the EM estimation so that the algorithm performs global optimization with improved accuracy and convergence speed. We show that the proposed method overcomes the aforementioned shortcomings of the conventional algorithms for global optimization of HMMs. First, we mathematically prove the global convergence of the proposed algorithm. The presented convergence theorems state that both the objective value sequence and the best objective value sequence converge to the global optimum in probability. The convergence analysis of the proposed method provides a suitable, problemindependent combinational choice of the optimization steps such as the probability density function for generating solutions, the annealing schedule, and the selection scheme. This overcomes the conventional methods’ limitation that their optimization procedures were designed heuristically, and thus, they may be problem-dependent and contain additional algorithm parameters to be determined manually. Moreover, the proposed method can be utilized for other parameter optimization problems without modification because we prove its convergence property for a general problem.1 Second, while the conventional methods only dealt with optimization of discrete HMMs, we show that the proposed method can successfully be applied to optimization of continuous HMMs for visual speech recognition to produce HMMs showing improved recognition performance compared to the conventional EM algorithm. We also analyze the effect of global optimization of HMMs by the proposed method by inspecting the alignment of visual speech on the HMM states and the confusion matrices. In the following section, we will explain the proposed HSA algorithm for a general problem and mathematically prove its convergence property. Section III describes the visual speech recognition system. In Section IV, we present experimental results for isolated word recognition tasks. Finally, we conclude this paper in Section V. II. H YBRID SA This section describes the proposed HSA algorithm for solving a general optimization problem and its mathematical convergence property. A specific implementation of the algorithm for optimizing HMMs for visual speech recognition is explained in Section III.

1189

Step 2: Generation. Generate a new solution yt from the current one xt by yt = xt + Δxt , where t is the iteration index. The amount of change of the solution vector, Δxt , is a random vector that follows the Cauchy probability distribution function (also called the generating function) given by g(Δxt , Tt ) =

an Tt

(2)

(Δxt 2 + Tt2 )(n+1)/2

where Tt is the temperature at iteration t and an is the normalizing constant. The method of generating a random vector following the Cauchy distribution can be found in [10]. Step 3: Local optimization. Mutate yt further to produce zt by using a local optimizer φ: zt = φ(yt ). It must be guaranteed that the solution always gets better by the local optimization, i.e., C(yt ) ≥ C(zt ).

(3)

Gradient-based optimization methods, greedy search methods, and, as in our method for visual speech recognition, the EM algorithm are examples of φ. Step 4: Evaluation. Calculate the objective value of zt , C(zt ). Step 5: Selection. Select the solution of the next iteration, xt+1 , from the current and the new solutions by the Metropolis method [11]: The acceptance probability of zt is obtained by pa (xt , zt , Tt ) = min [1, exp {(C(xt ) − C(zt )) /Tt }] .

A uniform random number between zero and one, α, is generated, and if α ≤ pa (xt , zt , Tt ), zt is selected; otherwise, xt is selected. Thus, zt , which is better than xt , is always accepted; if zt is worse than xt , it is accepted probabilistically. Because a worse solution than the current one has a chance to be selected, the algorithm can escape from local optima. The temperature Tt controls the acceptance probability so that a high temperature enables a worse solution than the current one to be chosen with a relatively high probability and, as Tt becomes small, the probability of accepting a worse solution than the current one becomes small so that we obtain a converged solution at the final stage of the algorithm. Step 6: Annealing. Decrease the temperature by the reciprocal cooling schedule given by Tt = T0 /t

A. Algorithm Consider that we are to solve the following general optimization problem: Minimize

C(x),

x∈Ψ

(5)

where T0 is the initial temperature. Step 7: Termination. If the termination conditions are satisfied, stop. Otherwise, go to Step 2.

(1)

n

where Ψ is the entire feasible space in R and C is the real-valued objective function (or cost function) whose global minimum is C ∗ . Basically, our HSA algorithm utilizes the fast SA (FSA) algorithm in which the Cauchy probability distribution is used for generating new solutions and the reciprocal schedule for annealing [9]. FSA is practically effective in real-world applications compared to the classical SA whose inverse-logarithmic annealing schedule is too slow for practical use. The procedure of the algorithm for solving (1) is written as follows: Step 1: Initialization. Generate an initial solution x0 ∈ Ψ. Set the temperature to its initial value. 1 Recently,

(4)

the proposed algorithm has successfully been applied for parameter optimization of neural networks [8].

B. Convergence Proof The convergence of the HSA algorithm presented in the previous section can be proved in a similar way to the argument in [12]. In this section, we state and prove two theorems. The first theorem shows that the objective value sequence of the HSA algorithm converges to the global optimum in probability. The second theorem analyzes the convergence of the best objective value sequence. Theorem 1: For  > 0 and ξ > 0, let Ψ = {x ∈ Ψ|C(x) < C ∗ + }



Ω,ξ,t = x ∈ Ψ|C ∗ +  ≤ C(x) < C ∗ +  + 1/t



 ξ

(6) (7)

Ω,ξ,t = x ∈ Ψ|C ∗ +  ≤ C (φ(x)) < C ∗ +  + 1/tξ and



C(x) ≥ C ∗ +  + 1/tξ .

(8)

1190

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST 2010

Assume that ζ(Ψ ) > 0 for every  > 0, where ζ(·) is the Lebesgue measure on Rn , and there exist constants ξ > 0 and R > 0 such that ζ(Ω,ξ,t ) + ζ(Ω,ξ,t ) ≤ R/tξ for all t ≥ t0 . If we impose a lower bound on the random displacement of each dimension of the solution vector at t, σt , which is small enough and monotonically decreases with respect to t, then, for any initial solution x0 ∈ Ψ, the objective value sequence {C(xt ), t ≥ 0} of HSA converges in probability to the global minimum C ∗ . Proof: See Appendix I.  The combination of the generating function in (2) and the annealing schedule in (5) is crucial for obtaining the aforementioned convergence property of HSA. If the heaviness of the tails of a generating function and the cooling speed of the temperature by an annealing schedule is not well balanced, we cannot obtain such a property; for example, if we use the reciprocal annealing schedule (5) together with the Gaussian distribution (having shorter tails than the Cauchy distribution) for the generating function, the aforesaid theorem is not valid. In the existing annealing-based algorithms, unlike our method, choosing the generating function and the annealing schedule is based on heuristics, and thus, no convergence proof for them was made. The SA method in [5] used the Gaussian distribution for the generating function and the cooling schedule that is initially linear and then exponential, which is determined experimentally. In the DA method in [7], an exponential annealing schedule given by Tt = T0 η t was adopted, where a suitable value of the constant η should be determined manually. In Theorem 1, it is required that every component of the displacement of a new solution (Δxt ) is greater than or equal to the lower bound σt at each iteration. Thus, generation of Δxt with (2) should be repeated until this lower bound condition is satisfied. Next, we consider the convergence of the sequence of the best objective value in the algorithm. This is the case when the best objective value up to the current iteration and the corresponding solution are stored in an auxiliary memory. Theorem 2: For  > 0, let Ψ = {x ∈ Ψ|C(x) < C ∗ + } Ψ = {x ∈ Ψ|C (φ(x)) < C ∗ +  and C(x) ≥ C ∗ + } .

(9) (10)

Assume that ζ(Ψ ) > 0 and ζ(Ψ ) > 0 for every  > 0, where ζ(·) is the Lebesgue measure on Rn . Then, for any initial solution x0 ∈ Ψ, the best objective value sequence {min0≤j≤t C(xj ), t ≥ 0} of HSA converges in probability to the global minimum C ∗ . Proof: See Appendix II.  When we are concerned only with the convergence of the best objective value sequence, we do not need the lower bound σt imposed on the displacement of the solution in the generation step. III. V ISUAL S PEECH R ECOGNITION Visual speech recognition is performed by the following procedure. When a person speaks, a camera records the movement of the lips. Then, salient and compact visual features are extracted from the recorded image sequence. Finally, the recognizer (a set of HMMs) performs recognition with the extracted features and determines what the speaker said. In the following sections, we explain the databases for the experiments in Section IV, the method of extracting visual speech features, and the HMMs for recognition. A. Databases We use two databases of isolated words: the digit database (DIGIT) and the city name database (CITY)2 [13]. The former contains the 2 We

are planning to make the databases public for academic purposes.

Fig. 1.

Procedure of extracting visual features.

digits from zero to nine (including two versions of zero) in Korean and the latter the names of 16 famous Korean cities. These databases contain more significant amounts of data for speaker-independent recognition experiments than many of previously reported databases [14]–[16]. Two existing audiovisual speech databases that are thought to be significant, namely, IBM AV-ViaVoice [17] and AV-TIMIT [18], are not publicly available. In the DIGIT and the CITY databases, pronunciations of 56 speakers (37 males and 19 females) are contained. Each person pronounced each word three times. A digital video camera focused on the face regions around the speakers’ lips and captured the lips’ movements at the rate of 30 Hz. The recognition experiments are performed in a speakerindependent manner. To increase the reliability of the experiments, we use the jackknife test method. We divide the data of 56 speakers into four parts so that each part contains the data of 14 speakers. Then, we train the recognizer with the data of three parts (42 speakers) and run the recognition test with the data of the remaining one part (14 speakers). We repeat this procedure four times with different combinations so that the whole database is used for the test. B. Feature Extraction For good visual speech recognition performance, we must obtain appropriate visual speech features from the recorded image sequence. The features must contain crucial information that can discriminate between the utterance classes and, at the same time, is common across speakers having different colors of skins and lips and invariant to environmental changes such as illuminations. There are two major approaches for extracting visual speech features from images: the contour-based approach and the pixel-based approach [1]. The former concentrates on identifying the lip contours; after the lip contours are tracked in the image sequences, the features are defined by geometric measures such as the height or the width of the mouth [19] or the parameters of a model describing the contours [20]. In the latter approach, the image containing the mouth region is used after image transformations [21]. It is known that the contourbased approach has disadvantages such as degradation of the recognition accuracy due to errors in tracking the lip contours and loss of important information describing the characteristics of the oral cavity and the protrusion of lips [22]. Thus, our visual speech recognition system adopts the pixel-based approach. Fig. 1 shows the overall procedure of extracting features in our system. We carefully design the method of extracting the lip area and define an effective representation of the visual features derived from the extracted images of the mouth region [13]. First, we remove the brightness variation across the left and right parts of the images. Then,

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST 2010

1191

distribution B = {bi }N i=1 . We use the continuous HMM, which is the most popular type of HMM for speech recognition, where the observation probability distribution is given by the Gaussian mixture model (GMM): bi (o) =

M 

cim N (o; μim , Σim )

(13)

m=1

Fig. 2.

Four principal modes of variations in the lip region images.

the pixel values of the images are normalized so that the pixel values of all incoming images have a similar distribution characteristic, which reduces different illumination conditions during different recording sessions and skin color difference across speakers. Next, the two mouth corners are detected by the bilevel thresholding method, and the lip regions are cropped based on the found corners. In the following mean subtraction, the mean value over an utterance is subtracted for each pixel point to remove unwanted constant variations across utterances. Finally, we apply principal component analysis (PCA) to find the main linear modes of variations in the images and reduce the feature dimension. If we let h be a column vector for the pixel values of ¯ the mean of h for all training a mean-subtracted lip region image and h images, then the static feature vector f after PCA is given by ¯ f = U T (h − h)

(11)

where the columns of the matrix U are the eigenvector for a few largest eigenvalues of the covariance matrix of the training images. We found that the coefficients for the first 12 principal components of the pixel values are adequate to perform recognition. In Fig. 2, we show the mean lip region image and the four most significant principal modes of intensity variations by ±2 standard deviations for the training data of the DIGIT database. We can see that each mode explains distinct variations occurring in the lip region images. The first mode mainly accounts for the mouth opening. In the second mode, the protrusion of the lower lip and the visibility of the teeth are shown. The third mode represents the protrusion of the upper lip and the changes of the shadow under the lower lip. The fourth mode largely describes the visibility of the teeth according to the mouth opening. We also use the delta features defined by the temporal derivatives of the static features obtained from PCA [2]:



Δfk =

τ  i=−τ



ifk+i

τ 



2

i

(12)

i=−τ

where fk is the static feature vector at the kth image frame and τ denotes the frame interval for the delta features. We use τ = 2, which was found to be an appropriate choice for good recognition performance. Thus, in total, a 24-dimensional feature vector is extracted for each frame. C. Recognition With HMMs An N -state HMM λ = (Π, A, B) is characterized by the initial state probability distribution Π = {πi }N i=1 , the state transition probability distribution A = {aij }N , and the observation probability i,j=1

where M is the number of Gaussian functions in a state, o is an observation vector, cim is the mixture coefficient for the mth mixture in state i, and N (·; μim , Σim ) is a Gaussian density function with mean vector μim and covariance matrix Σim . In speech recognition, diagonal covariance matrices are usually used for reducing the numbers of free parameters. During the training phase, the features of each speech class are used for training the parameters of the corresponding HMM. The most popular training criterion is the ML. The objective of the ML estimation is to maximize the sum of the log-likelihood values for all training data: L(λ) =



log P (O|λ)

all O =



all O

log

 all q

πq1 bq1 (o1 )

K 

aqk−1 qk bqk (ok )

(14)

k=2

where O = [o1 , o2 , . . . , oK ] is a training observation sequence (i.e., ok = [fkT , ΔfkT ]T ) and q = [q1 , q2 , . . . , qK ] is a hidden state sequence for O. The forward–backward algorithm is used to calculate L(λ) [2]. Conventionally, the solution of this criterion is obtained by the well-known EM method [3]. As we repeatedly apply the reestimation formulas of the method, the likelihood monotonically increases until a local optimum is achieved. In this paper, we use the HSA algorithm developed in Section II for global optimization of (14). The collection of the parameters of an HMM, i.e., {πi }, {aij }, {cim }, {μim }, and {Σim }, forms the n-dimensional solution vector in HSA. Because the ML objective is the maximization while the problem that we stated in (1) is the minimization, we attach a minus sign to the ML objective, i.e., C(λ) = −L(λ).

(15)

We start the optimization process by HSA with a randomly generated initial solution. A new solution is produced by the steps of generation and local optimization. When the random displacement of the solution vector (Δxt ) is generated according to the Cauchy generating function (2), its lower bound (σt ) is determined by (23) and can be set to a very small value by using a very small σ0 , e.g., σ0 = 10−10 , so that generation of a new solution at each iteration is not repeated many times and does not take much time. After the generation step, the new values for πi , aij , and cim are renormalized so that the row sums of the probability matrices are equal to one. We use a few iterations of the EM reestimation for the local optimization, which satisfies the condition (3) by the characteristic of the EM algorithm. Next, we calculate the objective value of the newly generated solution by (15) and choose a solution for the next iteration between the current and new solutions according to (4). Then, the temperature is decreased by (5). The aforementioned steps are repeated until the maximum iteration number is reached. During the recognition phase, a feature sequence whose class is unknown is inputted to the trained HMMs of all classes, and the HMM showing the maximum probability of the data is selected as the winning class.

1192

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST 2010

IV. E XPERIMENTAL R ESULTS Visual speech recognition results by applying the proposed method are reported in this section. Before showing the results, we demonstrate the global optimization capability of HSA via an experiment of optimizing an HMM with the data generated from an artificial HMM. A. Experiment on Synthetic Data We make a left-to-right HMM having three states and a Gaussian function in each state. The initial state probability is set to Π = (1, 0, 0), and the transition probability is set to

 A=

0.5 0 0

0.5 0 0.9 0.1 0 1

 .

The centers and the variances of the univariate Gaussian functions in the states are μ = (0, 1, −1) and Σ = (1, 0.5, 2), respectively. We randomly generate 100 observations of 20 frames from the HMM. The task is to train an HMM by EM or HSA with the generated data and identify the parameters of the original HMM. When we initialize the HMM in the EM method, the training data are linearly segmented onto the HMM states, and then, the segmental k-means algorithm is iteratively applied [2]. For HSA, the initial temperature is set to one, and the maximum number of iterations is set to 1000 for the termination condition. Moreover, one iteration of the EM algorithm is used for local optimization in HSA. The parameters obtained by EM are

 Aˆ =

0.91 0 0

0.09 0 0.01 0.99 0 1



Fig. 3. Comparison of the objective values of the HMMs trained by EM and HSA for the DIGIT database.

Fig. 4. Comparison of the objective values of the HMMs trained by EM and HSA for the CITY database.

ˆ = (0.81, 0.24, 2.06), which are quite μ ˆ = (0.73, 1.18, −0.98), and Σ different from the original model. On the other hand, HSA produced

 Aˆ =

0.52 0 0

0.48 0 0.89 0.11 0 1



ˆ = (0.92, 0.47, 2.06), which are μ ˆ = (−0.13, 1.00, −0.97), and Σ nearly the same to those of the original model. The sum of the loglikelihood values of the data from the original HMM was −3115, and those from the HMMs trained by EM and HSA were −3227 and −3111, respectively, which indicates that HSA succeeded in finding the global optimum while EM did not. B. Visual Speech Recognition Experiment 1) Setup: We use the whole-word HMM, which is a standard approach in small vocabulary speech recognition systems. The number of states in each HMM is set to be proportional to the number of visemic units of the corresponding word. We use a GMM having three Gaussian functions in each state, which is the best configuration in our experiments. Again, the HMMs are initialized by linear segmentation, followed by iterative application of the segmental k-means algorithm when EM is used for their training. When we use HSA for training HMMs, we set T0 to ten, and the maximum iteration number of 10000 is used for the termination condition. Then, the final temperature becomes 10−3 , which is sufficiently small. As the local optimization operator, we use five iterations of EM. Using less EM iterations leads to slow convergence of HSA, while using more iterations causes longer running time because the computational complexity of the EM iterations is dominant in HSA.

Fig. 5. Comparison of the alignments of the visual speech, “7” (/chil/), onto the states (S1 , S2 , and S3 ) of the HMMs trained by EM and HSA.

2) Comparison of Performance: Figs. 3 and 4 compare the final values of the objective function in (15) for the training data of each class from the HMMs trained by EM and HSA for each database, respectively. We can see that the HMMs trained by HSA always show smaller objective values (larger likelihood values) than those by EM due to its ability of global optimization. This result shows that HSA produces the HMMs that model the visual speech better than EM. Fig. 5 shows an example of the alignments of the visual speech, “7” (/chil/), onto the states of the HMMs trained by EM and HSA. The alignments are obtained by the Viterbi decoding algorithm. It can be seen that the two training algorithms result in different alignments of the utterance. An important difference of the two cases is the alignment of the second frame. While the second frame is aligned on the second state of the HMM trained by EM along with the third frame, the HMM trained by HSA aligns it on the first state with the first frame. Because the second frame corresponds to the silence like the first frame, whereas the third frame corresponds to the mouth opening, the alignment by HSA is more appropriate than that by EM. As a result, the log-likelihood of the datum by HSA is −999, which is larger than that by EM (−1055). The log-likelihoods for the competing class, “1” (/il/), were −1031 and −1052 by EM and HSA, respectively, and thus, HSA gives the correct recognition result while EM does not. Next, we compare the recognition performance of the HMMs trained by EM and HSA. We also include two existing discriminative training algorithms for comparison: the minimum classification error

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST 2010

TABLE I E RROR R ATES ( IN P ERCENT ) BY THE C ONVENTIONAL AND P ROPOSED M ETHODS . R ELATIVE E RROR R EDUCTIONS ( IN P ERCENT ) OVER EM A RE S HOWN IN THE PARENTHESES

1193

TABLE III C ONFUSION M ATRIX O BTAINED BY THE HSA T RAINING FOR THE DIGIT DATABASE . T HE VALUES A RE THE P ERCENTAGES OF E ACH W ORD ON THE V ERTICAL A XIS R ECOGNIZED AS THE W ORDS ON THE H ORIZONTAL A XIS . “0” AND “0 ” A RE THE T WO V ERSIONS OF Z ERO

TABLE II C ONFUSION M ATRIX O BTAINED BY THE EM T RAINING FOR THE DIGIT DATABASE . T HE VALUES A RE THE P ERCENTAGES OF E ACH W ORD ON THE V ERTICAL A XIS R ECOGNIZED AS THE W ORDS ON THE H ORIZONTAL A XIS . “0” AND “ 0 ” A RE THE T WO V ERSIONS OF Z ERO

TABLE IV C OMPUTATION T IMES IN S ECONDS FOR T RAINING HMM S BY HSA AND THE C ONVENTIONAL M ETHODS . F OR HSA, THE AVERAGE T RAINING T IMES PER HMM A RE A LSO S HOWN IN THE PARENTHESES

(MCE) approach [23] and the maximum mutual information (MMI) approach [24], [25]. In these approaches, objective functions related to the classification error are designed instead of the ML objective, and the parameters of HMMs are adjusted according to such objective functions. Although their training objectives are different from that of EM and HSA, comparing their recognition performance with that of HSA can demonstrate how much our method improves the recognition performance in comparison with the existing algorithms that aim a higher accuracy than EM. In the MCE method, a smooth function approximating the classification error rate is defined and minimized by iteratively applying the updating formulas of the generalized probabilistic descent algorithm. The MMI method tries to maximize the mutual information between the training data and the HMMs of the corresponding classes. An estimation algorithm of HMMs for the MMI objective, which is similar to EM, was presented in [25] and is used in our comparison. The MMI and MCE methods have a few algorithm parameters, which are carefully determined via several experiments. In Table I, we compare the recognition performance of the HMMs trained by each method. As shown in the table, the proposed HSA algorithm shows the best performance for both databases. First, we see that HSA shows better recognition performance than EM. The relative error reductions by HSA over EM are 7.5% and 2.7% for the two databases, respectively. This superiority comes from the better objective values after training by HSA than EM, as shown in Figs. 3 and 4. Moreover, the performance by HSA is superior to those by MCE and MMI. MCE and MMI produce HMMs showing improved performance over EM by the use of discriminative training objectives. However, they also use the training algorithms that perform only local optimization and may thus be trapped in local optima, depending on the initial parameters of the HMMs. To analyze the performance improvement by HSA over EM in detail, the confusion matrices for the recognition results by EM and HSA are shown in Tables II and III, respectively. The words whose accuracies are increased the most by HSA are “7” and “9” (8%). The increased accuracy of “7” (/chil/) in HSA is obtained mostly by the decreased misclassification rate of the word as “1” (/il/). In Fig. 5, we have already shown a case where more accurate modeling by HSA than EM prevents the misclassification of “7” as “1.” Also, HSA increases the recognition accuracy of “9” (/gu/) by reducing its misclassification as a visually confusable word with it, i.e., “6” (/yuk/). Similar results are observed in the cases of “5” (/o/) and

“0 ” (/yeong/), where the accuracies are increased by 6% by HSA compared to EM; the misclassifications of “5” as “0” (/gong/) and “0 ” as “4” (/sa/) are suppressed in the case of HSA. These observations confirm that HSA produces the HMMs modeling the speech data of the corresponding classes more accurately than EM through global optimization and thereby enhances discriminability between visually confusable utterances. 3) Computational Complexity: The superiority of HSA over the conventional methods is obtained at the expense of additional computational complexity. Theoretically, training an HMM by applying IEM iterations of the EM algorithm has the computational complexity of order O(N M Ktotal DIEM ), where Ktotal is the total frame length of training data and D is the dimension of the feature vectors [26]. In HSA, the local optimization step by EM is computationally dominant, and thus, the complexity of HSA can be written by O(N M Ktotal DIlocal IHSA ), where IHSA and Ilocal are the numbers of iterations of HSA and EM for local optimization, respectively. Because IEM  IHSA in our experiments, it is apparent that HSA has a much higher computational complexity than EM. Table IV shows the actual computation times of each method in seconds in our experiments. All results in the table are from the experiments performed on the PCs with 2.4-GHz CPUs and 1 GB of RAM under the Linux operating system. We implemented the HSA training of HMMs in a parallel way by using multiple computers to reduce the computation time because each HMM for each word is trained independently. In the table, the sums of the training times for all words are shown with the average times for each word in the parentheses. We observe that HSA requires much more time than the other methods. Even if we consider the parallel HSA for comparison, its time complexity is still much higher than those of the others. However, this complexity is necessary only for the training process that can be done in advance; the recognition process is exactly the same regardless of the training methods. V. C ONCLUSION We have proposed a new algorithm for training HMM parameters for visual speech recognition. We have developed the HSA algorithm by fusing SA with EM for fast convergence and improved solution quality. It has been proven that the sequences of the objective value and the best objective value converge in probability to the global optimum by HSA. The experimental results have shown that HSA can

1194

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST 2010

successfully be applied to optimization of continuous HMMs and is superior to the conventional algorithms in terms of likelihoods and error rates at the cost of additional computational complexity. As a way of reducing the computational complexity of the proposed method, we are considering parallelization of the method with more processors. For this, we can consider to adopt a parallel implementation method of the EM algorithm [27] or devise a variant of HSA with a population of solutions that simultaneously and parallelly evolves [28]. Also, further study of applying the method to connected words or continuous speech recognition tasks is in progress. Finally, it would be desirable to perform analysis of finite-time behavior of HSA based on the proved global convergence property.

We will show that, for any  > 0 and δ > 0, and initial solution x0 ∈ Ψ, there exists a positive integer I such that ∀t ≥ I.

(16)

 ξ

Θ,ξ,t = x ∈ Ψ|C(x) ≥ C ∗ ++1/t Ψ = {x ∈ Ψ|C (φ(x)) < C ∗ + and C(x) ≥ C ∗ +} .

(17) (18)

Let A(t, u) denote the event that there is at least one transition from Ψ to Ω,ξ,t+j−1 (1 ≤ j ≤ u) during the iterations between t and t + u, B(t, u) be the event that there is at least one transition from Ψ to Θ,ξ,t+j−1 (1 ≤ j ≤ u) during the iterations between t and t + u, C(t, u) be the event that the solution is never in Ψ during the iterations between t and t + u, and D be the event that xt+u ∈ Ψ \ Ψ . Because D ⊂ A ∪ B ∪ C, we have P (xt+u ∈ Ψ \ Ψ ) = P (D) ≤ P (A) + P (B) + P (C).

Ω,ξ,t+j−1



g(y − xt+j−1 , Tt+j−1 ) dy

+ Ω ,ξ,t+j−1

≤ ζ(Ω,ξ,t+j−1 )



max

g(y − x, Tt+j−1 )

x∈Ψ ,y∈Ω,ξ,t+j−1 |y i −xi |≥σt+j−1

+ ζ Ω,ξ,t+j−1





max

x∈Ψ ,y∈Ω ,ξ,t+j−1 |y i −xi |≥σt+j−1







x,y∈Ψ |y i −xi |≥σt+j−1













g(y − x, Tt+j−1 )

≤ ζ(Ω,ξ,t+j−1 ) + ζ Ω,ξ,t+j−1 an Tt+j−1 × max x,y∈Ψ (y − x2 )(n+1)/2 |y i −xi |≥σ t+j−1



≤ ζ(Ω,ξ,t+j−1 ) + ζ Ω,ξ,t+j−1

Let



g(y − xt+j−1 , Tt+j−1 ) dy

≤ ζ(Ω,ξ,t+j−1 ) + ζ Ω,ξ,t+j−1 × max g(y − x, Tt+j−1 )

A PPENDIX I P ROOF O F T HEOREM 1

P (xt ∈ Ψ \ Ψ ) < δ





= ζ(Ω,ξ,t+j−1 ) + ζ Ω,ξ,t+j−1 1 (t + j − 1)1−ξ/2 ≤ RM1 /(t + j − 1)β+ξ



an Tt+j−1 (n+1)/2 2 nσt+j−1



an T0 2 (n+1)/2 (nσ0 )

×

(24)

where we set M1 = an T0 /(nσ02 )(n+1)/2 and β = 1 − ξ/2. From (22) and (24)



t+u−1

P (A(t, u)) ≤ RM1

(19)

1/j β+ξ .

(25)

j=t

First, we will show that there exists a positive integer t1 such that ∀t ≥ t1 , ∀u > 0.

P (A(t, u)) < δ/3

(20)

Let Aj (t, u) be the event that xt+j−1 ∈ Ψ and xt+j ∈ Ω,ξ,t+j−1 . Then, we have

u

A(t, u) ⊂

Aj (t, u)

1/j β+ξ ≤

∞ 

(26)

j=t1

which yields (20). Next, we will show that there exists a positive integer t2 such that ∀t ≥ t2 , ∀u > 0.

P (B(t, u)) < δ/3 P (Aj (t, u)) .

(22)

j=1

σt = σ0 t−ξ/2(n+1)

(27)

Let Bj (t, u) be the event that xt+j−1 ∈ Ψ and xt+j ∈ Θ,ξ,t+j−1 . Then

If we set P (B(t, u)) ≤

(23)

u 

P (Bj (t, u)) .

(28)

j=1

where 0 < ξ < 1, we can write

For all 1 ≤ j ≤ u and xt+j−1 ∈ Ψ

P (Aj (t, u))





=

g(y − xt+j−1 , Tt+j−1 ) Ω,ξ,t+j−1 \Ψ

× pa (xt+j−1 , φ(y), Tt+j−1 ) dy



+

1/j β+ξ < δ/3RM1

(21)

which yields u 



t+u−1

j=t

j=1

P (A(t, u)) ≤



Because β + ξ > 1, j=N1 1/j β+ξ < ∞ for all N1 > 0. Therefore, there exists an integer t1 > N1 such that, for all t ≥ t1 and u > 0

g(y − xt+j−1 , Tt+j−1 )

Ω

,ξ,t+j−1

× pa (xt+j−1 , φ(y), Tt+j−1 ) dy

P (Bj (t, u)) =

g(y − xt+j−1 , Tt+j−1 ) Θ,ξ,t+j−1 \Ω \Ψ ,ξ,t+j−1



× exp

C(xt+j−1 ) − C (φ(y)) Tt+j−1

 dy. (29)

Because the maximum objective of Ψ and the minimum objective of Θ,ξ,t+j−1 are C ∗ +  and C ∗ +  + 1/(t + j − 1)ξ , respectively,

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST 2010

1195

where the last line comes from the fact that log(1 − x) ≤ −x for all 0 ≤ x < 1. Here, with vi = maxx,y∈Ψ |xi − y i |, 1 ≤ i ≤ n

C(xt+j−1 ) − C(φ(y)) ≤ −1/(t + j − 1)ξ . Thus



P (Bj (t, u))

min

≤ exp{−1/(t + j − 1)ξ Tt+j−1 }

xj ∈Ψ\Ψ

P (yj ∈ Ψ |xj ) =



×

min

xj ∈Ψ\Ψ

≥ ζ(Ψ )

g(y − xt+j−1 , Tt+j−1 ) dy

Θ,ξ,t+j−1 \Ω,ξ,t+j−1 \Ψ

≥ ζ(Ψ )

≤ exp{−1/(t + j − 1)ξ Tt+j−1 }.

(30)

g(y − xj , Tj ) dy Ψ

min

x∈Ψ\Ψ ,y∈Ψ |y i −xi |≥σj

min

x,y∈Ψ |y i −xi |≥σj

≥ ζ(Ψ )

Hence P (B(t, u)) ≤

u 



n

where we set M2 = an T0 /( ∞ 

t+u−1

exp(−1/j ξ Tj ).

(31) j=t

j=t





Because exp(−1/j ξ Tj ) = j=1 exp(−1/j ξ−1 T0 ) < ∞, j=1 there exists an integer t2 such that, for all t ≥ t2 and u > 0



t+u−1



j=t

exp(−1/j ξ Tj ) < δ/3

(32)

j=t2

∀u > 0.

(33)

Finally, we will show that there exists a positive integer u0 such that ∀u > u0 .

max P (xj+1 ∈ Ψ \ Ψ |xj ).

xj ∈Ψ\Ψ



j=t0

min

xj ∈Ψ\Ψ

+ P (B(t0 , u)) + P (C(t0 , u)) < δ. (42)

lim P (Ft − C ∗ ≥ ) = 0.

= exp

(36)

P (Ft − C ∗ ≥ ) = P (x1 ∈ Ψ \ Ψ |x0 ∈ Ψ \ Ψ ) × P (x2 ∈ Ψ \ Ψ |x1 ∈ Ψ \ Ψ ) · · · P (xt ∈ Ψ \ Ψ |xt−1 ∈ Ψ \ Ψ ) ≤



j=t0

≤ exp −



xj ∈Ψ\Ψ

P (yj ∈ Ψ |xj )



min

xj ∈Ψ\Ψ

P (yj ∈ Ψ |xj )

∞  j=0



t0 +u−1

j=t0

t→∞

 min

(44)

lim P (Ft − C ∗ ≥ )

xj ∈Ψ\Ψ

1−

max P (xj+1 ∈ Ψ \ Ψ |xj ).

xj ∈Ψ\Ψ

Therefore, in a similar way to (37)

max {1 − P (yj ∈ Ψ |xj )}



t−1  j=0



t0 +u−1

log

(43)

t→∞

max P (xj+1 ∈ Ψ \ Ψ |xj )



(41)

P (xt+u ∈ Ψ \ Ψ ) ≤ P (A(t0 , u))

xj ∈Ψ\Ψ

0 +u−1

(40)

From (19), (33), and (41), for all u ≥ u0

t0 +u−1

j=t0

∀u > u0

P (yj ∈ Ψ |xj ) > log(3/δ)

P (C(t0 , u)) < exp {− log(3/δ)} = δ/3.

(35)

(35) becomes

t

(39)

We have

P (xt+1 ∈ Ψ \ Ψ |xt ) ≤ 1 − P (yt ∈ Ψ |xt )

≤ exp log

1/j = ∞.

Let Ft = min0≤j≤t C(xj ). To prove the theorem, we will show that, for any  > 0 and initial solution x0 ∈ Ψ \ Ψ

Because we can obtain



∞ 

t0 +u−1

t0 +u−1



vi2 + T02 )(n+1)/2 . It follows that

A PPENDIX II P ROOF OF T HEOREM 2

· · · P (xt0 +u ∈ Ψ \ Ψ |xt0 +u−1 ∈ Ψ \ Ψ )

j=t0

(38)

Therefore, (16) is satisfied with I = t0 + u0 .

× P (xt0 +1 ∈ Ψ \ Ψ |xt0 ∈ Ψ \ Ψ )



(n+1)/2

j=t

P (C(t0 , u)) = P (xt0 ∈ Ψ \ Ψ |xt0 −1 ∈ Ψ)

j=t0

i=1

P (yj ∈ Ψ |xj ) ≥ ζ(Ψ )M2

(34)

We have



vi2 + T02

which yields

P (A(t0 , u)) + P (B(t0 , u)) < 2δ/3



an T0 /j n i=1

Therefore, there exists an integer u0 such that

which yields (27). If we set t0 = max{t1 , t2 }, it follows from (20) and (27) that

P (C(t0 , u)) < δ/3

min

xj ∈Ψ\Ψ



exp(−1/j ξ Tj ) ≤

g(y − x, Tj )

= ζ(Ψ )M2 /j

exp{−1/(t + j − 1)ξ Tt+j−1 }

j=1

=

g(y − x, Tj )

(37)

max P (xj+1 ∈ Ψ \ Ψ |xj )

xj ∈Ψ\Ψ



≤ exp −

∞  j=0

 min

xj ∈Ψ\Ψ

P (xj+1 ∈ Ψ |xj ) .

(45)

1196

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST 2010

Here, we have min

xj ∈Ψ\Ψ

P (xj+1 ∈ Ψ |xj )

= min



min

xj ∈Ψ\Ψ



P (yj ∈ Ψ |xj ),

min

xj ∈Ψ\Ψ

P (yj ∈

Ψ |xj )

.

(46) We can write each term in (46) as min

xj ∈Ψ\Ψ

P (yj ∈ Ψ |xj ) =

min

xj ∈Ψ\Ψ

≥ ζ(Ψ )

g(y − xj , Tj )dy Ψ

min

g(y − xj , Tj )

xj ∈Ψ\Ψ y∈Ψ

≥ ζ(Ψ ) min g(y − xj , Tj ) xj ,y∈Ψ

≥ ζ(Ψ )

an T0 /j n i=1

vi2 + T02

(n+1)/2

≥ ζ(Ψ )M3 /j

(47)

n

where we set M3 = an T0 /( i=1 vi2 + T02 )(n+1)/2 with vi = maxx,y∈Ψ |xi − y i |, 1 ≤ i ≤ n, and, similarly min

xj ∈Ψ\Ψ

P (yj ∈ Ψ |xj ) ≥ ζ (Ψ ) M3 /j.

(48)

Combining (45) through (48) yields





lim P (Ft − C ≥ ) ≤ exp −min {ζ(Ψ ), ζ

t→∞

(Ψ )}

∞ 

 M3 /j

j=0

= 0.

(49)

R EFERENCES [1] C. C. Chibelushi, F. Deravi, and J. S. D. Mason, “A review of speech-based bimodal recognition,” IEEE Trans. Multimedia, vol. 4, no. 1, pp. 23–37, Mar. 2002. [2] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993. [3] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Stat. Soc., B, vol. 39, no. 1, pp. 1–38, 1977. [4] F. Sun and G. Hu, “Speech recognition based on genetic algorithm for training HMM,” Electron. Lett., vol. 34, no. 16, pp. 1563–1564, Aug. 1998. [5] D. Paul, “Training of HMM recognizers by simulated annealing,” in Proc. ICASSP, Tampa, FL, Mar. 1985, pp. 13–16. [6] S. Kirkpatrick, C. D. Gerlatt, and M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, no. 4598, pp. 671–680, May 1983. [7] A. V. Rao and K. Rose, “Deterministically annealed design of hidden Markov model speech recognizers,” IEEE Trans. Speech Audio Process., vol. 9, no. 2, pp. 111–126, Feb. 2001. [8] Y. Lee, J.-S. Lee, S.-Y. Lee, and C. H. Park, “Improving generalization capability of neural networks based on simulated annealing,” in Proc. IEEE Congr. Evol. Comput., Singapore, Oct. 2007, pp. 3447–3453. [9] H. H. Szu and R. L. Hartley, “Fast simulated annealing,” Phys. Lett. A, vol. 122, no. 3/4, pp. 157–162, Jun. 1987.

[10] D. Nam, J.-S. Lee, and C. H. Park, “n-dimensional Cauchy neighbor generation for the fast simulated annealing,” IEICE Trans. Inf. Syst., vol. E87D, no. 11, pp. 2499–2502, Nov. 2004. [11] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, “Equation of state calculations by fast computing machines,” J. Chem. Phys., vol. 21, no. 6, pp. 1087–1092, Jun. 1953. [12] R. L. Yang, “Convergence of the simulated annealing algorithm for continuous global optimization,” J. Optim. Theory Appl., vol. 104, no. 3, pp. 691–716, Mar. 2000. [13] J.-S. Lee and C. H. Park, “Training hidden Markov models by hybrid simulated annealing for visual speech recognition,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., Taipei, Taiwan, Oct. 2006, pp. 198–202. [14] J. R. Movellan, “Visual speech recognition with stochastic networks,” in Advances in Neural Information Processing Systems, vol. 7, G. Tesauro, D. Toruetzky, and T. Leen, Eds. Cambridge, MA: MIT Press, 1995, pp. 851–858. [15] C. C. Chibelushi, S. Gandon, J. S. D. Manson, F. Deravi, and R. D. Johnston, “Design issues for a digital audio-visual integrated database,” in Proc. IEE Colloq. Integr. Audio-Visual Process. Recog., Synthesis, Commun., London, U.K., 1996, pp. 7/1–7/7. [16] S. Pigeon and L. Vandendrope, “The M2VTS multimodal face database (release 1.00),” in Proc. Int. Conf. Audio- Video-Based Biometric Authentication, Crans-Montana, Switzerland, Mar. 1997, pp. 403–409. [17] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, “Recent advances in the automatic recognition of audiovisual speech,” Proc. IEEE, vol. 91, no. 9, pp. 1306–1326, Sep. 2003. [18] T. Hazen, K. Saenko, C. La, and J. Glass, “A segment-based audiovisual speech recognizer: Data collection, development, and initial experiments,” in Proc. Int. Conf. Multimodal Interfaces, State College, PA, 2004, pp. 235–242. [19] M. N. Kaynak, Q. Zhi, A. D. Cheok, K. Sengupta, Z. Jian, and K. C. Chung, “Lip geometric features for human–computer interaction using bimodal speech recognition: Comparison and analysis,” Speech Commun., vol. 43, no. 1/2, pp. 1–16, Jan. 2004. [20] S. Gurbuz, Z. Tufekci, E. Patterson, and J. Gowdy, “Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition,” in Proc. ICASSP, Salt Lake City, UT, May 2001, vol. 1, pp. 177–180. [21] S. Lucey, “An evaluation of visual speech features for the tasks of speech and speaker recognition,” in Proc. Int. Conf. Audio-Video-Based Biometric Authentication, Guildford, U.K., Jun. 2003, pp. 260–267. [22] I. Matthews, G. Potamianos, C. Neti, and J. Luettin, “A comparison of model and transform-based visual features for audio-visual LVCSR,” in Proc. Int. Conf. Multimedia Expo, Tokyo, Japan, Apr. 2001, pp. 22–25. [23] B.-H. Juang, W. Chou, and C.-H. Lee, “Minimum classification error rate methods for speech recognition,” IEEE Trans. Speech Audio Process., vol. 5, no. 3, pp. 257–265, May 1997. [24] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proc. ICASSP, Tokyo, Japan, Apr. 1986, pp. 49–52. [25] A. Ben-Yishai and D. Burshtein, “A discriminative training algorithm for hidden Markov models,” IEEE Trans. Speech Audio Process., vol. 12, no. 3, pp. 204–217, May 2004. [26] L. J. Rodrígues and I. Torres, “Comparative study of the Baum–Welch and Viterbi training algorithms applied to read and spontaneous speech recognition,” in Proc. Iberian Conf. Pattern Recog. Image Anal., vol. 2652, Lecture Notes in Computer Science, F. J. Perales, A. C. Campilh, N. P. de la Blanca, and A. Sanfeliu, Eds., Berlin, Germany, 2003, pp. 847–857. [27] W. Turin, “Unidirectional and parallel Baum–Welch algorithms,” IEEE Trans. Speech Audio Process., vol. 6, no. 6, pp. 516–523, Nov. 1998. [28] H.-J. Cho, S.-Y. Oh, and D.-H. Choi, “Population-oriented simulated annealing technique based on local temperature concept,” Electron. Lett., vol. 34, no. 3, pp. 312–313, Feb. 1998.

Hybrid Simulated Annealing and Its Application to Optimization of ...

HMMs, its limitation is that it only achieves local optimal solutions and may not provide the global optimum. There have been efforts for global optimization of ...

319KB Sizes 2 Downloads 260 Views

Recommend Documents

Hybrid Simulated Annealing and Its Application to Optimization of ...
Abstract—We propose a novel stochastic optimization algorithm, hybrid simulated annealing (SA), to train hidden Markov models (HMMs) for visual speech ...

Physical Mapping using Simulated Annealing and Evolutionary ...
Physical Mapping using Simulated Annealing and Evolutionary Algorithms. Jakob Vesterstrøm. EVALife Research Group, Dept. of Computer Science, University ...

Hybrid computing CPU+GPU co-processing and its application to ...
Feb 18, 2012 - Hybrid computing: CPUþGPU co-processing and its application to .... CPU cores (denoted by C-threads) are running concurrently in the system.

Convergence Proofs for Simulated Annealing ...
ties of hybrid automata can be posed as a minimization problem by utilizing ... Abbas is with the Department of Electrical, Computer and Energy. Engineering ...

A Simulated Annealing-Based Multiobjective ...
cept of archive in order to provide a set of tradeoff solutions for the problem ... Jadavpur University, Kolkata 700032, India (e-mail: [email protected]. in).

Simulated Annealing based Automatic Fuzzy Clustering ...
Department of Computer Science and Engineering .... it have higher membership degree to that cluster, and can be considered as they are clustered properly.

Hybrid Formal Power Series and Their Application to ...
In this paper we shall develop the theory of rational hybrid formal power series . ..... Define the functions ˜δ : Q × Γ∗ → Q and ˜λ : Q × Γ∗ → O as follows. Let.

Application of fermionic marginal constraints to hybrid quantum ...
Jan 10, 2018 - Hamiltonians that commute with the number operator of our system and thus we can decompose M2 into blocks ..... Assuming a list of K equality constraints, we will express the kth constraint Ck as. Ck = L−1 .... Errors in the 2-RDM me

Application of fermionic marginal constraints to hybrid quantum ...
Jan 10, 2018 - appears when building constraints starting from a polar cone picture. Though these constraints are formulated with pure states it is simple to show these conditions hold ...... [61] J. J. Foley and D. A. Mazziotti, Phys. Rev. A 86, 012

Hybrid Formal Power Series and Their Application to ...
Centrum voor Wiskunde en Informatica (CWI). P.O.Box 94079 ... Partial realization theory If the number of available data points is big enough and the family of ...

Terminal Area Trajectory Optimization using Simulated ...
the source, we can send out explorers each travelling at a constant speed and ... time, it would need a significant amount of time to pre-build the whole network with ..... pilot/FMS optimal trajectory data from a ground-station with a high-speed ...

Optimization Principles and Application Performance ... - grothoff.org!
Feb 23, 2008 - ing Execution Manager [17] and PeakStream Virtual Machine [4]. ... GPU/CPU code due to the virtual machine and dynamic compila-.

Optimization Principles and Application Performance ... - grothoff.org!
Feb 23, 2008 - a kernel function call defines the organization of the sizes and di- .... cupied while many threads are waiting on global memory accesses.

The SOMN-HMM Model and Its Application to ...
Abstract—Learning HMM from motion capture data for automatic .... bi(x) is modeled by a mixture of parametric densities, like ... In this paper, we model bi(x) by a.

impossible boomerang attack and its application to the ... - Springer Link
Aug 10, 2010 - Department of Mathematics and Computer Science, Eindhoven University of Technology,. 5600 MB Eindhoven, The Netherlands e-mail: [email protected] .... AES-128/192/256, and MA refers to the number of memory accesses. The reminder of

phonetic encoding for bangla and its application to ...
These transformations provide a certain degree of context for the phonetic ...... BHA. \u09AD. “b” x\u09CD \u09AE... Not Coded @ the beginning sরণ /ʃɔroɳ/.

Optimization Of Connection-oriented, Mobile, Hybrid ...
are the number of channels dedicated to. , , and , respectively. The two base stations, namely and. , can communicate via either a terrestrial wire- line connection ...

pdf-1399\logistics-management-and-optimization-through-hybrid ...
... the apps below to open or edit this item. pdf-1399\logistics-management-and-optimization-throug ... ce-systems-by-carlos-alberto-ochoa-ortiz-zezzatti.pdf.

pdf-1399\logistics-management-and-optimization-through-hybrid ...
... the apps below to open or edit this item. pdf-1399\logistics-management-and-optimization-throug ... ce-systems-by-carlos-alberto-ochoa-ortiz-zezzatti.pdf.

Lithography Defect Probability and Its Application to ...
National Research Foundation of Korea (NRF) grant funded by the Korean. Government ... Institute of Science and Technology, Daejeon 34141, South Korea, and ...... in physics from Seoul National University, Seoul, ... emerging technologies. ... Confer

Hierarchical Constrained Local Model Using ICA and Its Application to ...
2 Computer Science Department, San Francisco State University, San Francisco, CA. 3 Division of Genetics and Metabolism, Children's National Medical Center ...