VARIATIONAL KULLBACK-LEIBLER DIVERGENCE FOR HIDDEN MARKOV MODELS John R. Hershey, Peder A. Olsen, Steven J. Rennie IBM Thomas J. Watson Research Center ABSTRACT Divergence measures are widely used tools in statistics and pattern recognition. The Kullback-Leibler (KL) divergence between two hidden Markov models (HMMs) would be particularly useful in the fields of speech and image recognition. Whereas the KL divergence is tractable for many distributions, including gaussians, it is not in general tractable for mixture models or HMMs. Recently, variational approximations have been introduced to efficiently compute the KL divergence and Bhattacharyya divergence between two mixture models, by reducing them to the divergences between the mixture components. Here we generalize these techniques to approach the divergence between HMMs using a recursive backward algorithm. Two such methods are introduced, one of which yields an upper bound on the KL divergence, the other of which yields a recursive closed-form solution. The KL and Bhattacharyya divergences, as well as a weighted edit-distance technique, are evaluated for the task of predicting the confusability of pairs of words. Index Terms: Kullback-Leibler divergence, variational methods, mixture models, hidden Markov models (HMMs), weighted edit distance, Bhattacharyya divergence. 1. INTRODUCTION The Kullback-Leibler (KL) divergence, also known as the relative entropy, between two probability density functions f (x) and g(x), def

D(f g) =

f (x) log

f (x) dx, g(x)

(1)

is commonly used in statistics as a measure of similarity between two density distributions [1]. The KL divergence satisfies three divergence properties: 1. Self similarity: D(f f ) = 0. 2. Self identification: D(f g) = 0 only if f = g. 3. Positivity: D(f g) ≥ 0 for all f, g. The KL divergence is used in many aspects of speech and image recognition, such as determining if two acoustic models are similar, [2], measuring how confusable two words or hidden Markov models (HMMs) are, [3, 4, 5], computing the best match using pixel distribution models [6], clustering of models, and optimization by minimizing or maximizing the divergence between distributions. The KL divergence has a closed form expression for many probability densities. For two gaussians, f and g, it reduces to the wellknown expression, D(f g) =

|Σg | 1 log + Tr[Σ−1 g Σf ] − d 2 |Σf |

(2)



+ (μf − μg )T Σ−1 g (μf − μg ) . {jrhershe,pederao,sjrennie }@us.ibm.com

978-1-4244-1746-9/07/$25.00 ©2007 IEEE

323

In fact, the same is true if f and g are any of a wide range of useful distributions known as the exponential family, of which the gaussian is the most famous example. These densities are defined as def f (x) = exp(θfT φ(x))/z(θf ), where θf is a vector of parameters,  and z(θf ) = exp(θfT φ(x)) dx, and φ(x) is a vector-valued function of x [7]. This formulation makes the KL divergence between two such densities surprisingly simple: D(f g)

=

log

z(θf ) + (θf − θg )T Ef φ(x), z(θg )

(3)

which requires only that Ef φ(x) be known [8]. In general, however, for more complex distributions such as mixture models and hidden Markov models, the integral involves the logarithm of sums of component densities, and no such simple expression exists. In the following sections we review two variational approximations to the KL divergence between two mixture models. Throughout the paper we use the example of gaussian mixture models (GMMs), and HMMs with gaussian mixtures as observation models, although the same techniques directly apply to any densities for which we can compute the KL divergences between pairs of mixture components. For an observation of a given sequence length an HMM can be construed as a mixture model in which each HMM state sequence is a mixture component. In theory, the variational approximations for the KL divergence between two mixture models directly carries over to HMMs in this sense. However, the direct application of the variational approximation yields one set of state sequences inside the logarithm, and another outside the logarithm. This prevents us from using a recursive formulation to sum over the exponential number of pairs of state sequences generated by typical HMMs. Therefore we derive a new variational approximation that is amenable to standard forward and backward algorithms. The variational approximations contain variational parameters that serve to associate sequences of one HMM with similar sequences of the other. We constrain these parameters by factorizing them into a Markov chain, which allows us to recursively solve for the variational parameters and evaluate the approximation. One weakness of the KL divergence between HMMs is that, in many common cases, the divergence becomes infinite. If the HMM f generates sequences of lengths that the HMM g cannot generate, then the KL divergence is infinite. Such is the case, for instance, in leftto-right models where g is longer than f , despite the fact that the precise length of such models may be an artifact of the phonetic system of the recognizer, rather than an important modeling assumption. In [9], this was addressed by connecting the final state to the initial state to make the HMM ergodic, and substituting the KL divergence rate for the KL divergence. Here we propose methods for approximating KL divergences of non-ergodic HMMs, and instead we consider symmetric versions of the KL divergence that yields finite meaningful values. We consider two symmetrized versions of

ASRU 2007

with respect to φ. If we define DVA (f g) = Lψˆ (f f ) − Lφˆ (f g) and substitute the optimal variational parameters, φˆb|a and ψˆa |a , the result simplifies to

the KL divergence: Dmin (f, g)

=

Dresistor (f, g)

=

min{D(f g), D(gf )}

−1

D−1 (f g) + D−1 (gf )

The resistor average symmetrized KL divergence was first introduced in [10]. This problem can also be addressed by computing the KL divergence over the intersection of the sets of sequence lengths allowed by the HMMs. In addition to the KL divergence, there exist other useful measures of dissimilarity between distributions. In particular, the Bhattacharyya divergence def

DB (f, g) = − log

 

(4)

f (x)g(x) dx,

is closely related to the KL divergence and can be used to bound the  def Bayes error, Be (f, g) = 12 min(f (x), g(x)) dx ≤ 12 e−DB (f,g) . The Bhattacharyya divergence is symmetric, and has the advantage that it does not diverge to infinity for HMMs which support different sets of sequence lengths, so long as both HMMs can generate some sequences of the same length. The variational approximations for the KL divergence can also be applied to compute a variational approximation to the Bhattacharyya divergence, without factorizing the variational parameters. It also turns out that the Bhattacharyya divergence is closely related to a heuristic method known as the weighted edit distance, which we include here in our experiments. To validate these approaches we compare numerical predictions with empirical word confusability measurements (i.e., word substitution error rates) from a speech recognizer.

DVA (f g)



f (x) g(x)

 a πa N (x; μa ; Σa )

= =

b

(5)

ωb N (x; μb ; Σb ).

Here πa is the prior probability of each state, and N (x; μa ; Σa ) is a gaussian in x with mean μa and variance Σa . We use the shorthand notation fa (x) = N (x; μa ; Σa ) and gb (x) = N (x; μb ; Σb ). Our estimates of D(f g) will make use of the KL-divergence between individual components, which we thus write as D(fa gb ). The Variational Approximation for Mixture Models: A variational lower bound to the likelihood was introduced  in [11]. We define variational parameters φb|a > 0 such that b φb|a = 1. By Jensen’s inequality we have L(f g)



def

=

f (x) log g(x) dx 

=



πa

fa (x) log

a





a def

πa

 b

Lφ (f g),

= def



φb|a

 b

φb|a

ωb gb (x) dx φb|a



ωb log + L(fa gb ) φb|a

(6) (7)



where L(fa gb ) = fa (x) log gb (x) dx. Since this is a lower bound on L(f g), we get the best bound by maximizing Lφ (f g)

324

(8)

DVA (f g) satisfies the similarity property, but it does not in general satisfy the positivity property. Note that this variational approximation is the difference of two bounds, and hence is not itself a bound. In terms of accuracy, however, it performs somewhat better than the bound described below perhaps because some of the error cancels out in the subtraction, as shown in [11]. The Variational Bound for Mixture Models: A direct upper bound on the divergence is also introduced in [11] for mixture models. We define the variational parameters φab ≥ 0 and ψab ≥ 0 satisfying the constraints b φab = πa and a ψab = ωb . Using the variational parameters we may write f g

= =

πa fa ω b b gb

= =

a

φba fa ab ψab gb . ab

(9)

With this notation we use Jensen’s inequality to obtain an upper bound of the KL divergence as follows D(f g)

= =



f log(f /g)

 ψ g φ f  dx φ f f  φ  f log  φ f  dx





ab b

f log

ab

a

ab

ab

In [11] variational methods were introduced that allow the KL divergence to be approximated for mixture models. Without loss of generality, we consider the case where f and g are gaussian mixture models, with marginal densities of x ∈ Rd under f and g given by

πa e−D(fa fa ) . −D(fa gb ) b ωb e



a πa log 

a



2. VARIATIONAL METHODS FOR MIXTURE MODELS



=

def

=

ab a

ab a

ab a

ψab gb

(10)

Dφψ (f g).

The best possible upper bound can be attained by finding the variational parameters φˆ and ψˆ that minimize Dφψ (f g). The problem is convex in φ as well as in ψ so we can fix one and optimize for the other. Fixing φ the optimal value for ψ is seen to be ψab =

ωb φab .  a φa b

(11)

Similarly, fixing ψ the optimal value for φ is φab =

πa ψab e−D(fa gb ) . −D(fa gb )  b ψab e

(12)

At each iteration step the upper bound Dφψ (f g) is lowered, and we refer to the convergent as DVB (f g). Since any zeros in φ and ψ are fixed under the iteration we recommend starting with φab = ψab = πa ωb . In practice it converges sufficiently in a few iterations [11]. This iterative scaling scheme is of the same type as the Blahut-Arimoto algorithm for computing the channel capacity and also arises in maximum entropy models (see [11] for references). 3. HIDDEN MARKOV MODELS To formulate the KL divergence for hidden Markov models, we must take care to define them in a way that yields a distribution (integrates to one) over all sequence lengths. To this end the HMM must terminate the sequence when it transitions to a special final state. For an HMM, f , emitting an observation sequence of length n, as

def

a1:n = (a1 , . . . , an ) be a sequence of hidden state discrete random variables, at taking values in E , where E is the set of emitdef ting states. Let x1:n = (x1 , . . . , xn ) be a sequence of observad tions, with xt ∈ R . For the observations we use the shorthand def fat (xt ) = N (xt ; μat , Σat ). We also define non-emitting initial and final state values (i.e., not random variables) I, and F . The state sequence probabilities are thus formulated as a Markov chain def πa1:n = πa1 |I πF |an n t=2 πat |at−1 , where πa1 |I is an initial distribution, πat |at−1 are transition probabilities, and πF |an are the final state transitions. The transition probabilities are normalized such that a1 πa1 |I = 1, and πF |at−1 + at πat |at−1 = 1, for t ≥ 2. It bears emphasizing here that the transitions to emitting states do not in general sum to one (i.e., at πat |at−1 ≤ 1), because there may also be a transition to the non-emitting final state. Hence it is as if the HMM is gradually leaking probability away to paths which terminate before the path in question. This allows the HMM to describe a distribution over all sequence lengths. In general, the transition to the final state can occur at any time; however, for a given sequence length n, we only consider paths that reach the final state after exactly n observations. The density assigned to signals of particular length can thus be written:







f (x1:n )

=



=



n

a1:n





 p (n) = 1. =

f (x1:n ) dx1:n

(13)

f

n=1

4. THE VARIATIONAL APPROXIMATION FOR HMMS We extend the variational approximation for mixture models to HMMs by defining variational parameters in the form of a condef ditional Markov chain, φb1:n |a1:n = φb1 |a1 n t=2 φbt |at bt−1 where = 1 and = 1, so that b1 φb1 |a1 bt φbt |at bt−1 b1:n φb1:n |a1:n = 1. For a given sequence length n, by Jensen’s inequality we have







def

Ln (f g) = ≥



π φ

a1:n def

=

b1:n

Lφ (f g), def

where L(fa1:n gb1:n ) = Note that

a2



a3:n

 L(f

n−1

an |an−1



bn |an bn−1

bn

=

at gbt )

)

 n

ωbn |bn−1 ωF |bn e φbn |an bn−1

  ...

n

(16)

,

21

where we can use the following recursion to compute the nested sums over the priors pn−t (at )



def

=





πat+1:n |at =

at+1:n

 n

πF |an

πaτ |τ −1

τ =t+1

at+1:n

(17)

at+1 |at pn−t−1 (at+1 )

at+1

is the probability that a sequence in state at will terminate in n − t def steps. The recursion terminates with p0 (an ) = πF |an . Then we can write (16) recursively as def

Lφt (at−1 , bt−1 ) =



at |at−1



at

bt |at bt−1





bt

ωbt |bt−1 eL(fat gbt ) pn−t (at ) log + Lφt+1 (at , bt ) , φbt |at bt−1 beginning the recursion with



an |an−1



bn |an bn−1

bn

 

ωbn |bn−1 ωF |bn eL(fan gbn ) p0 (an ) log , φbn |an bn−1

dx1:n .

and terminating it with Lφ (f g) =

π φ a1 |I

a1

b1 |a1

b1

n

L(fa1:n gb1:n )

g

L(f

ωbn−1 |bn−2 e an−1 bn−1 an |an−1 πF |an ) log φbn−1 |an−1 bn−2

an

(14)

a1:n (x1:n ) log gb1:n (x1:n )



bn−1 |an−1 bn−2

bn−1

ωb1:n eL(fa1:n gb1:n ) b1:n |a1:n log φb1:n |a1:n

f

ωb2 |b1 eL(fa2 gb2 ) φb2 |a2 b1



an−1 |an−2

an−1

π (



a3:n |a2 ) log

+ ... +

ωb1 eL(fa1 gb1 ) φb1 |a1

2

b2

π (

1

a2:n |a1 ) log

b2 |a2 b1

Lφn (an−1 , bn−1 ) =

f (x1:n ) log g(x1:n ) dx1:n

a1:n

a2 |a1

=



n=1





b1 |I

a2:n

πF |an log





b1 |a1

L(fan gbn )

t=2

 f (x)dx =

π +

π (

an

at |at−1 fat (xt ).

The probability of a particular sequence length n is pf (n) = f (x1:n ) dx1:n = a1:n πa1 |I πF |an n t=2 πat |at−1 ≤ 1. Since ∞ n×d , the integration over all the set of all sequences is x ∈ ∪n=1 R sequences is perhaps an unfamiliar operation. It amounts to separately integrating over sequences of each length and then summing over the individual results. To see that f is a distribution over all sequences it is enough to verify that indeed



a1 |I

a1

an

a1 |I πF |an fa1 (x1 )

π φ

Lφ (f g) =

π +

a1:n fa1:n (x1:n )

a1:n



def

where L(fat gbt ) = fat (xt ) log gbt (xt ) dxt Since this is a lower bound on L(f g), we get the best bound by maximizing Lφ (f g) with respect to φb1:n |a1:n . To do so, we first expand the objective function into a recursive formula, by pulling earlier terms out of the sums over later variables.

(15)

t=1

325

pn−1 (a1 ) log

 

ωb1 |I eL(fa1 gb1 ) + Lφ2 (a1 , b1 ) . φb1 |a1

Note that Lφt (at−1 , bt−1 ) is the only term containing φbt |at bt−1 , so the derivative is ∂Lφ (f g) = φ˜bt−1 π ˜at ∂φbt |at bt−1  ωbt |bt−1 eL(fat gbt ) + Lφt+1 (at , bt ) − pn−t (at ) . φbt |at bt−1

pn−t (at ) log

˜at are some priors that are independent of bt . EquatWhere φ˜bt−1 π ing to zero and solving for φbt |at bt−1 yields φ

φˆbt |at bt−1

ωbt |bt−1 eL(fat gbt ) eLt+1 (at ,bt )/pn−t (at )

=

φ

bt

ωbt |bt−1 eL(fat gbt ) eLt+1 (at ,bt )/pn−t (at )

.

ωbn |bn−1 ωF |bn eL(fan gbn )

=

bn

ωbn |bn−1 ωF |bn eL(fan gbn )

To upper-bound the divergence we employ two variational parameters, again factorized into a Markov chain. A different upper bound is proposed in [12], in which the closest pair of paths is used instead of summing over all paths. In this section we define ct = (at , bt ) to simplify the notation. For HMMs we formulate the def variational parameters as φc1:n = φc1 |I φF |cn t=2:n φct |ct−1 , and





def

ψc1:n = ψc1 |I ψF |cn t=2:n φct |ct−1 . We also have the constraints that bt φat bt |at−1 bt−1 = πat |at−1 and at ψat bt |at−1 bt−1 = ωbt |bt−1 . The variational parameters for the final state transitions are constrained to be φF |cn = πF |an ψF |cn = ωF |bn . For the variational bound we have

 φ log φ

.

=

L(fa1 gb1 ) Lφ e 2 (a1 ,b1 )/pn−1 (a1 ) b1 ωb1 |I e

.

at |at−1

ω log

at

L(fat gbt ) e bt |bt−1 e

D(fa1:n gb1:n ) .

We unroll this in time as

Lφn (an−1 , bn−1 ) = an |an−1 πF |an

log

an



def

L

c3:n |c2 ) log

2 c3:n

c2

ˆ φ

L(fa1 gb1 ) L2 (a1 ,b1 )/pn−1 (a1 ) e . b1 |I e

φcn |cn−1 φF |cn log

Thus we have a single-pass backward algorithm. The KL divergence is then approximated using def

1 c2:n

c2 |c1

b1

DVA (f g) =

fat (xt ) at (xt ) log g (x ) t bt

dxt .

(19)

 φ ( φ +  

LVA (f (x1:n )g(x1:n )) = Lφˆ (f g) = a1

(18)

φc1 |I eD(fa1 gb1 ) c2:n |c1 ) log ψc1 |I

c1 |I

c1



at gbt ),

t=1

 φ ( φ

bn

a1 |I log

 D(f

Dφψ (f1:n g1:n ) =

L(fan gbn ) , bn |bn−1 ωF |bn e

and ends with



=

f

def

where D(fat gbt ) =

The recursion begins with



fa1:n (x1:n ) dx1:n . gb1:n (x1:n )

n

ˆ φ Lt+1 (at ,bt )/pn−t (at )

bt

ˆ

fa1:n (x1:n ) log

+ D(fa1:n gb1:n ) ,

Note that, due to the conditional independence of the xt given the at , we have

Lφt (at−1 , bt−1 ) =





def

D(fa1:n gb1:n ) =

Substituting back in, we can simplify and eliminate the variational parameters altogether. ˆ

ψc1:n

c1:n

where φˆb1 |a1

c1:n

c1:n

and φ ωb1 |I eL(fa1 gb1 ) eL2 (a1 ,b1 )/pn−1 (a1 )



def

D (f (x1:n )g(x1:n )) ≤ Dφψ (f g) =

The variational parameters for the end points are: φˆbn |an bn−1

5. THE VARIATIONAL BOUND FOR HMMS



VA (f (x1:n )f (x1:n ))−LVA (f (x1:n )g(x1:n )),

n=0

which in practice is truncated to a finite series. Note that this sum can also be computed recursively by saving intermediate results. In some situations a forward algorithm may be useful. In such cases an easy option is to make change of parameters to reverse the HMM itself, using time-dependent state transitions that condition on the future rather than the past. Reversing the HMM entails a forward algorithm to produce the reversed parameters. Then the above backward algorithms can be applied on the time-reversed HMM, yielding a second forward algorithm. The two forward algorithms can be done simultaneously, deriving the next parameters from the reversal algorithm just in time for the next forward step of the variational recursion.

326

cn

n

φc2 |c1 eD(fa2 gb2 ) + ...+ ψc2 |c1



φcn |cn−1 φF |cn + D(fan gbn ) . . . ψcn |cn−1 ψF |cn

.

21

n

Because of the variational constraints we have the following equality

φ ct+1:n

=



  φ φ n−1

ct+1:n |ct

=

F |cn

at+1:n bt+1:n

π

n−1

πF |an

aτ +1 |aτ

τ =t

at+1:n

=



(20)

cτ +1 |cτ

τ =t

πat+1:n |at = pn−t (at ).

at+1:n

Using (20) we can directly write the recursive form of (19) as Dtφψ (ct−1 ) =



ct |ct−1



ct

φct |ct−1 eD(fat gbt ) pn−t (at ) log ψct |ct−1





φψ + Dt+1 (ct )

Beginning the recursion with Dnφψ (cn−1 ) =



φF |cn φcn |cn−1 cn

and terminating it with Dφψ (f g) =

log

 

I



φF |cn φcn |cn−1 eD(fan gbn ) ψF |cn ψcn |cn−1

φc1 |I c1

pn−1 (a1 ) log

D(fa1 gb1 )

φc1 |I e ψc1 |I



 



=



t

The iteration can be done to convergence for each step in a backward algorithm. Analogously to the variational approximation we need corresponding starting and terminating iterations too. φψ Let DV B (f1:n g1:n ) be the convergent value of Dφ, Then ˆψ ˆ. ∞ DV B (f g) = n=1 DV B (f1:n g1:n ). Factoring the φ and ψ differently leads to a family of related approximations. Factorizations that constrain the variational parameters more will in general require less storage for variational parameters, and yield less accurate results. In addition, the values of the variational parameters can be constrained. Constraining them to be sparse leads to Viterbi-style dynamic programming algorithms for the KL divergence, which may have some computational advantages.



6. WEIGHTED EDIT DISTANCES Various types of weighted edit distances have been applied to the task of estimating spoken word confusability, as discussed in [3] and [4]. A word is modeled in terms of a left–to–right HMM, see Fig. 1. F K

AO

AX:AO

L:AO

D:L

AY:L

AX:L

L:L

DWED (f, g) = min min C(a1:n , b1:n ) n

ωbt |bt−1 pn−t (at )φct |ct−1 .   a pn−t (at )φat bt |ct−1

I

AY:AO

.

t

Similarly the optimal value for ψct |ct−1 given φct |ct−1 is ψˆct |ct−1

D:AO

weighted edit distance (WED) is the shortest path (i.e., the Viterbi path) from the initial to the final node in the product graph.

−D(fat gb )−Dt+1 (at ,bt )/pn−t (at )

ψat bt |at bt−1 e

L:K

Fig. 2. Product HMM for the words call (K AO L) and dial (D AY AX L)

φψ

φψ

bt

AX:K

F

−D(fat gbt )−Dt+1 (ct )/pn−t (at )

πat |at−1 ψct |ct−1 e

AY:K

+ D2φψ (c1 )

To optimize we must iterate between solving for φct |ct−1 and ψct |ct−1 , holding the other constant. The optimal value of φct |ct−1 given ψct |ct−1 is φˆct |ct−1 =

D:K

L

Fig. 1. An HMM for call with pronunciation K AO L. In practice, each phoneme is composed of three states, although here they are shown with one state each. The confusion between two words can be heuristically modeled in terms of a cartesian product between the two HMMs as seen in Fig. 2. This structure is similar to that used for acoustic perplexity [3] and the average divergence distance [4]. Weights are placed on the vertices that assign smaller values when the corresponding phoneme state models are more confusable. The

327

a1:n ,b1:n

where C(a1:n , b1:n ) = n t=1 (wfat |at−1 + wgbt |bt−1 + wfat ,gbt ) is the cost of the path, and the w are costs assigned to each transition. In our experiments we define wfat |at−1 = − log πat |at−1 , and wgbt |bt−1 = − log ωbt |bt−1 . The wfat ,gbt are dissimilarity measures between the acoustic models for each pair of HMM states. For def the KL divergence WED, we define wfat ,gbt = D(fat gbt ), and for def

the Bhattacharyya WED, we define wfat ,gbt = DB (fat gbt ). An interesting variation, which we call the total weighted edit distance TWED, is to sum over all paths and sequence lengths: DTWED (f, g) = − log

 

e−C(a1:n ,b1:n ) .

(21)

n a1:n ,b1:n

That is, we sum over the similarities (probabilities), rather than the costs (negative log probabilities), since this corresponds to the interpretation as a product HMM. It turns out that when we apply the variational techniques introduced above to the Bhattacharyya divergence, the resulting measure DB (f g) can be seen as a special case of the total weighted edit distance. These are formulated for mixture models in [13] and the same methods apply directly to HMMs. In addition, the TWED with def Bhattacharyya weights, wfat ,gbt = DB (fat gbt ) is in fact a simple Jensen’s bound on the HMM Bhattacharyya divergence. Because the Bhattacharyya approximations and weighted edit distances are not in general zero for f = g, we subsequently normalize them using: Dnorm (f, g) = D(f, g) − 12 D(f, f ) − 12 D(g, g), which improves the performance. The derivations of the variational Bhattacharyya divergence bounds and the details of their relationship to the weighted edit distances are beyond the scope of this paper and are to be published elsewhere. 7. WORD CONFUSABILITY EXPERIMENTS In this section we briefly describe some experimental results where we use the HMM divergence estimates to approximate spoken word

confusability. To measure how well each method can predict recognition errors we used a test suite consisting of spelling data, meaning utterances in which letter sequences are read out, i.e., ”J O N” is read as ”jay oh en.” There were a total of 38,921 instances of the spelling words (the letters A-Z) in the test suite with an average letter error rate of about 19.3%. A total of 7,500 recognition errors were detected. Given the errors we estimated the probability of error for def each word pair as E(w1 , w2 ) = 12 P(w1 |w2 ) + 12 P(w2 |w1 ), where P(w1 |w2 ) is the fraction of utterances of w2 that are recognized as w1 . We discarded cases where w1 = w2 , since these dominate the results and exaggerate the performance. We also discarded unreliable cases where the counts were too low. Continuous speech was used, so it is possible that some errors were due to mis-alignment. 6

a⋅p n⋅ol⋅ro⋅r

Negative Log Error Rate

5 4 3 2 1 0 0

i⋅y a⋅n a⋅b i⋅r a⋅e c⋅t l⋅n b⋅e e⋅p p⋅t d⋅e l⋅m d⋅g e⋅v d⋅p g⋅t a⋅k b⋅d d⋅v v⋅z b⋅p p⋅v j⋅k f⋅s d⋅t l⋅o b⋅v m⋅n

a⋅j c⋅e e⋅t e⋅g e⋅u

8. REFERENCES [1] Solomon Kullback, Information Theory and Statistics, Dover Publications Inc., Mineola, New York, 1968.

a⋅h

[2] Peder Olsen and Satya Dharanipragada, “An efficient integrated gender detection scheme and time mediated averaging of gender dependent acoustic models,” in Eurospeech, Geneva, Switzerland, September 1-4 2003, vol. 4, pp. 2509–2512. [3] Harry Printz and Peder Olsen, “Theory and practice of acoustic confusability,” Computer, Speech and Language, vol. 16, pp. 131–164, January 2002.

c⋅z

100

200 Divergence Score

Monte-Carlo sampling of the HMM state sequences, as well as the Bhattacharyya and weighted edit distance methods. The variational bound was excluded because it did not perform as well as the variational approximation. Table 1 shows the results using all the different methods. The variational HMM KL divergence is about as good as the more accurate and time-consuming Monte Carlo estimates of KL divergence. Unfortunately, the HMM KL divergence itself is apparently not as well suited to the confusability task as the weighted edit distances and the Bhattacharyya divergence. This is natural since the Bhattacharyya divergence is known to yield a tighter bound on the Bayes error than the KL divergence. It is a bit surprising, though, that the Bhattacharyya total weighted edit distance outperforms the variational Bhattacharyya divergence, since it is actually a looser bound on the Bayes error. However, the confusability measurements produced by the recognizer are only loosely related to the Bayes error, because for example, the recognizer computes the Viterbi path instead of summing over paths.

300

[4] Jorge Silva and Shrikanth Narayanan, “Average divergence distance as a statistical discrimination measure for hidden Markov models,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 890–906, May 2006.

400

Fig. 3. The negative log error rate for all spelling word pairs compared to the variational HMM KL divergence.

Method VA Min KL Divergence VA Resistor KL Divergence MC 100K Min KL Divergence MC 100K Resistor KL Divergence KL Divergence Weighted Edit Distance Bhattacharyya Weighted Edit Distance VA Bhattacharyya Divergence Bhattacharyya Total Weighted Edit Distance

[5] Qiang Huo and Wei Li, “A DTW-based dissimilarity measure for leftto-right hidden Markov models and its application to word confusability analysis,” in Proceedings of Interspeech 2006 - ICSLP, Pittsburgh, PA, 2006, pp. 2338–2341. [6] Jacob Goldberger, Shiri Gordon, and Hayit Greenspan, “An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures,” in Proceedings of ICCV 2003, Nice, October 2003, vol. 1, pp. 487–493.

Score 0.365 0.433 0.450 0.442 0.571 0.610 0.631 0.646

[7] L. D. Brown, Fundamentals of statistical exponential families. vol 9 of Lecture Notes - Monograph Series, Institute of Math. Stat., 1991. [8] Peder A. Olsen and Karthik Visweswariah, “Fast clustering of gaussians and the virtue of representing gaussians in exponential model format,” Proceedings of the International Conference on Spoken Language Processing, October 2004. [9] B. H. Juang and L. R. Rabiner, “A probabilistic distance measure for hidden Markov models,” AT&T technical Journal, vol. 64, no. 2, pp. 391–408, 1985.

Table 1. Squared correlation scores between the various modelbased divergence measures and the empirical word confusabilities − log E(w1 , w2 ). VA refers to the variational HMM approximation of KL divergence or Bhattacharyya divergence. Min and Resistor are the two symmetrization methods. MC 100K refers to Monte Carlo simulations with 100,000 samples of HMM sequences. Figure 3 shows a scatter plot of the variational KL divergence score for each pair of letters, versus the empirical error measurement. Note that similar-sounding combinations of letters appear on the lower left (e.g. ”c·z”), and dissimilar combinations appear in the upper right (e.g. ”a·p”). We also computed the HMM KL divergence by direct

328

[10] D. Johnson and S. Sinanovic, “Symmetrizing the Kullback Leibler distance,” http://www-dsp.rice.edu/∼dhj/resistor.pdf. [11] John Hershey and Peder Olsen, “Approximating the Kullback Leibler divergence between gaussian mixture models,” in ICASSP, Honolulu, Hawaii, April 2007. [12] Jorge Silva and Shrikanth Narayanan, “Upper bound Kullback-Leibler divergence for hidden Markov models with application as discrimination measure for speech recognition,” IEEE International Symposium on Information Theory, 2006. [13] Peder Olsen and John Hershey, “Bhattacharyya error and divergence using variational importance sampling,” in ICSLP, Antwerp, Belgium, August 2007.

Variational Kullback-Leibler Divergence for Hidden ...

amenable to standard forward and backward algorithms. The varia- ..... An HMM for call with pronunciation K AO L. In practice, each phoneme is composed of ...

252KB Sizes 2 Downloads 136 Views

Recommend Documents

variational bhattacharyya divergence for hidden markov ...
apply the variational Bhattacharyya divergence for HMMs to word confusability, the ... care to define them in a way that yields a distribution (integrates to one) over all ..... An HMM for call with pronunciation K AO L. In practice, each phoneme is

Variational Kullback-Leibler Divergence for Hidden ...
The KL divergence has a closed form expression for many proba- bility densities. .... take care to define them in a way that yields a distribution (inte- grates to one) over ..... An HMM for call with pronunciation K AO L. In practice, each phoneme .

Bhattacharyya Error and Divergence Using Variational ...
the 826 GMMs. For two Gaussians f and g the Bhattacharyya divergence has a ... Efa [h(x)] in such a way that the approximation is exact for all quadratic ...

Variational Nonparametric Bayesian Hidden Markov ...
[email protected], [email protected]. ABSTRACT. The Hidden Markov Model ... nite number of hidden states and uses an infinite number of Gaussian components to support continuous observations. An efficient varia- tional inference ...

Adversarial Images for Variational Autoencoders
... posterior are normal distributions, their KL divergence has analytic form [13]. .... Our solution was instead to forgo a single choice for C, and analyze the.

Geometry Motivated Variational Segmentation for Color Images
In Section 2 we give a review of variational segmentation and color edge detection. .... It turns out (see [4]) that this functional has an integral representation.

Geometry Motivated Variational Segmentation for ... - Springer Link
We consider images as functions from a domain in R2 into some set, that will be called the ..... On the variational approximation of free-discontinuity problems in.

Variational Program Inference - arXiv
If over the course of an execution path x of ... course limitations on what the generated program can do. .... command with a prior probability distribution PC , the.

Studies on Genetic Divergence and Variability For Certain ... - CiteSeerX
Path analysis indicated that number of fruits per plant and average fruit ... following statistical analysis namely genetic ... FAO STAT Statistics data base, FAO,.

Variational Program Inference - arXiv
reports P(e|x) as the product of all calls to a function: .... Evaluating a Guide Program by Free Energy ... We call the quantity we are averaging the one-run free.

Genetic divergence in rice for physiological and quality ...
Based on the inter-cluster distance, mean performance and ... long duration varieties using Javanicas shows the great ... (1999). Maximum intra-cluster distance (Table 2) ... MTU 1010, Tellahamsa, MTU 6203, Keshava, RDR 536, Himalaya 2,.

Derivation of the velocity divergence constraint for low ...
Nov 6, 2007 - Email: [email protected]. NIST Technical ... constraint from the continuity equation, which now considers a bulk source of mass. We.

A variational framework for spatio-temporal smoothing of fluid ... - Irisa
discontinuities. Vorticity-velocity scheme To deal with the advective term, we use the fol- lowing semidiscrete central scheme [13, 14]:. ∂tξi,j = −. Hx i+ 1. 2 ,j (t) − Hx i− 1. 2 ,j (t). ∆x. −. Hy i,j+ 1. 2(t) − Hy i,j− 1. 2. (t).

A Variational Technique for Time Consistent Tracking of Curves ... - Irisa
oceanography where one may wish to track iso-temperature, contours of cloud systems, or the vorticity of a motion field. Here, the most difficult technical aspect ...

Efficient Variational Inference for Gaussian Process ...
Intractable so approximate inference is needed. • Bayesian inference for f and w, maximum likelihood for hyperparameters. • Variational messing passing was ...

Efficient Variational Inference for Gaussian Process ...
covariance functions κw and κf evaluated on the test point x∗ wrt all of the ..... partment of Broadband, Communications and the Dig- ital Economy and the ...

Variational Restoration and Edge Detection for Color ... - CS Technion
computer vision literature. Keywords: color ... [0, 1]3) and add to it Gaussian noise with zero mean and standard .... years, both theoretically and practically; in particular, ...... and the D.Sc. degree in 1995 from the Technion—Israel Institute.

A parameter-free variational coupling approach for ...
isogeometric analysis and embedded domain methods. .... parameter-free non-symmetric Nitsche method in com- ...... At a moderate parameter of C = 100.0,.

A variational framework for spatio-temporal smoothing of fluid ... - Irisa
Abstract. In this paper, we introduce a variational framework derived from data assimilation principles in order to realize a temporal Bayesian smoothing of fluid flow velocity fields. The velocity measurements are supplied by an optical flow estimat

Variational bayes for modeling score distributions
Dec 3, 2010 - outperforms the dominant exponential-Gaussian model. Keywords Score distributions Б Gaussian mixtures Б Variational inference Б.

accelerated monte carlo for kullback-leibler divergence ...
When using ˜Dts a (x) with antithetical variates, errors in the odd-order terms cancel, significantly improving efficiency. 9. VARIATIONAL IMPORTANCE ...

Multiple symmetric solutions for some hemi- variational ...
(F5) F(x, s) ≤ F(x, −s) for a.e. x ∈ Ω and all s ∈ R−. The first main result of the paper is the following: Theorem 1.1. Assume that 1 < p < N. Let Ω ⊂ RN be the ...

Variational inference for latent nonlinear dynamics
work that is able to infer an approximate posterior representing nonlinear evolution in the latent space. Observations are expressed as a noise model, Poisson or Gaussian, operating on arbitrary ... We use VIND to develop inference for a Locally Line