Confidence Score Based Unsupervised Incremental Adaptation for OOV Words Detection Wei Chu, Xi Xiao, and Jia Liu Department of Electronic Engineering, Tsinghua University, Beijing 100084, China [email protected] Abstract. This paper presents a novel approach of distinguishing in-vocabulary (IV) words and out-of-vocabulary (OOV) words by using confidence scorebased unsupervised incremental adaptation. The unsupervised adaptation uses Viterbi decode results which have high confidence scores to adjust new acoustic models. The adjusted acoustic models can award IV words and punish OOV words in confidence score, thus obtain the goal of separating IV and OOV words. Our Automatic Speech Recognition Laboratory has developed a Speech Recognition Developer Kit (SRDK) which serves as a baseline system for different speech recognition tasks. Experiments conducted on the SRDK system have proved that this method can achieve a rise over 41% in OOV words detection rate (from 68% to 96%) at the same cost of a false alarm (taken IV words as OOV words) rate of 10%. This method also obtains a rise over 11% in correct acceptance rate (from 88% to 98%) at the same cost of a false acceptance rate of 20%.

1 Introduction Nowadays speech recognition system can perform quite well on isolated words recognition if only providing IV word utterances as input and a vocabulary which is not very large. But the situation gets worse as the appearance of OOV words. In real world, OOV words input problem should not be overlooked, because the recognizer is faced with the OOV words spoken by users all the time. Confidence score is utilized to evaluate the reliability of recognition results by S. Cox [1]. Later on, many approaches of calculating confidence score are introduced. T. J. Hazen has done prominent work in summarizing and devising confidence scores in word-level and utterance-level [2]. But for our practical short isolated words recognition, it is hard to distinguish IV words from OOV words in confidence score domain. One major reason is that the acoustic models used in SRDK can not generate confidence scores which are separable for IV and OOV words. In this paper, a confidence score-based unsupervised incremental adaptation method is used to adjust the acoustic models used in SRDK system. During the adaptation, we first send adaptation data including IV and OOV words into SRDK system, then use the Viterbi decode results of the recognizer which have high confidence scores to guide the model adaptation. A Threshold for confidence D.-Y. Yeung et al. (Eds.): SSPR&SPR 2006, LNCS 4109, pp. 723 – 731, 2006. © Springer-Verlag Berlin Heidelberg 2006

724

W. Chu, X. Xiao, and J. Liu

score is set in order to ensure that almost all the words used for adaptation are correctly recognized IV words. Because the adaptation data are limited, we adopt maximum likelihood linear regression (MLLR) + maximum a posteriori (MAP) adaptation method. Our experiments have proved this unsupervised adaptation procedure can greatly improve the later performance of OOV words detection.

2 Word Confidence Scoring In this OOV words detection task, two classical but proved to be efficient confidence scores are employed. For computational reasons, we adopt a two-pass search strategy in which a semi-syllable based confidence score [3] is calculated in the first-pass, and a filler model based confidence score [4] is calculated in the second-pass. Finally, we combine the two confidence scores into a single dimensional confidence score through a simple linear discrimination method. 2.1 Semi-syllable Based Likelihood Ratio Css ( X , W0 ) is the semi-syllable based confidence score of the best hypothesis W0 when the observed vector sequence is X which focuses on likelihood ratio: P( X W0 )

Css ( X ,W0 ) =

P( X )

.

(1)

Consider the states alignment of the observed vectors, we can express P( X W0 ) as: P( X W 0 ) = P ( X 1 , X 2 ,L , X m h1 , h2 ,L , hm ),

(2)

where hi is the semi-syllable alignment of the observed vector sequence X i , The corresponding relationship between hi and X i is determined during Viterbi match. Assuming the observed vectors X i are independent of each other, we have: m

P( X ) = ∏ P( X i ).

(3)

i =1

Furthermore, we assume that semi-syllables hi are independent of each other. We represent the conditional probability P( X W ) as: 0

m

P( X W 0 ) = P ( X 1 , X 2 ,L , X m h1 , h2 ,L , hm ) = ∏ P( X i hi ).

(4)

i =1

So, we get m

CS ss ( X , W0 ) = ∏ i =1

P ( X i hi ) . P( X i )

(5)

Confidence Score Based Unsupervised Incremental Adaptation

725

For each segmented observed vector X i , M

P( X i ) = ∑ P( X i h j ).

(6)

j =1

In our system based on semi-syllable, if h j is matched as a consonant (or vowel), M represents the amount of all the consonants (or vowels). Consequently, CSlog_ss ( X , W0 ) in the log domain can be described as: m

M

i =1

j =1

CSlog_ss ( X ,W0 ) = log P(W0 X ) = ∑ [log P( X i hi ) − log ∑ P( X i h j )].

(7)

2.2 Filler Model Based Likelihood Ratio CS fl ( X , W0 ) is the filler model based confidence score of the best hypothesis W0

when the observed vector sequence is X . CS fl ( X , W0 ) =

P ( X W0 ) . P ( X H Filler )

(8)

In our system, online garbage model H Online _ Garbage is considered to work as filler model H Filler . In the back-end, N -best hypotheses are listed out. Besides the best hypothesis W0 , the left N − 1 hypotheses are called online garbage. P( X H Online _ Garbage ) is as follows: P( X H Online _ Garbage ) =

1 N −1 ∑ P( X Wi ). N − 1 i =1

(9)

So the normalized confidence score CSlog_fl ( X , W0 ) in the log domain is expressed as follows: CSlog_fl ( X ,W0 ) =

1 ⎧ ⎡ 1 N ⎤⎫ P( X | Wi ) ⎥ ⎬ , ⎨log P( X | W0 ) − log ⎢ ∑ nX ⎩ ⎣ N − 1 i =1 ⎦⎭

(10)

where n X represents the frame numbers of word W0 , and makes words with different frame numbers comparable in confidence score domain. 2.3 Confidence Score Combination Since the two confidence scores are of different information, better performance will be achieved while combining them together. For computational reasons, Fisher linear discrimination is used to find the projection vector pT , and generate a linear discriminating plane between the IV and OOV words. Now we obtain a single dimensional confidence score CS single ( X , W0 ) : CS single ( X ,W0 ) = pT CSmulti (X, W0 ) = [α

β ] ⎡⎣CSlog _ ss (W0 ) CSlog _ fl (W0 ) ⎤⎦ . T

(11)

726

W. Chu, X. Xiao, and J. Liu

3 Unsupervised Incremental Adaptation After finishing calculating confidence score, we find that IV and OOV words can not be separated easily in confidence score domain. It is hard for us to detect OOV words by confidence score. The main reason for this phenomenon is that the initial acoustic models are trained with generic speech data. This initial acoustic models can perform speakerindependent speech recognition tasks quite well when providing only IV word utterances as input. But when adding OOV word utterances into the input sequence, these acoustic models can not generate separable confidence scores for IV and OOV words. For this reason, we have an idea that we can use specific IV word utterances to adjust suitable acoustic models. The acoustic model is a specific context-dependent phoneme HMM to our OOV word detect task. D. Wang has used this confidence score-based unsupervised adaptation method to improve the performance of speech recognition [5]. Our experiments have proved that these acoustic models after adaptation can also award IV words and punish OOV words in confidence score domain. Because we use IV word utterances to perform unsupervised incremental adaptation, it is possible that wrongly recognized results will degrade the model parameters accuracy. Our strategy is to select only correctly recognized IV word utterances with high confidence scores for the adaptation. 3.1 MAP Adaptation In MAP adaptation, the following formulas are used in each step of re-estimation, for each Gaussian pdf [6]: N prior

μ=

w( x j ) x j ∑ xi + j∈∑ N Init i∈Init Adapt , N prior + ∑ w( x j )

(12)

j∈Adapt

N prior

σ = 2

N Init

∑x

i∈Init

2 i

N prior +

∑ w( x ) x ∑ w( x )

+

j

j∈Adapt

j∈Adapt

2 j

− μ2,

(13)

j

where N prior is a control parameter of the adaptation process. The fewer N prior is, the more adaptation utterances are taken into account with respect to prior data. w( x j ) is a weighting factor to determine in what way the utterances should be used in the adaptation process. In our system, We adopt a strict strategy for w( x j ) : ⎧ w( x j ) = 1 if CS single ( x j , W0 ) > Th . ⎨ ⎩ w( x j ) = 0 if CS single ( x j ,W0 ) ≤ Th

(14)

Confidence Score Based Unsupervised Incremental Adaptation

727

In our experiment we find that when the confidence scores of recognition results exceed a certain threshold, all the Viterbi decoder output is right. Only utterances with confidence score above Th can be used for adaptation in order to ensure that wrong Viterbi decode results will not perform negative effect on model parameters. 3.2 MLLR Adaptation MLLR adaptation [7] is suitable when the amount of adaptation data is small or limited. MLLR adaptation performs faster than MAP adaptation when given the same amount of adaptation data. For each Gaussian pdf, μik is transformed by using the following formula:

μ% ik = A μik + bc , c

(15)

where A is a regression matrix and bc is an additive bias vector associated with c some broad class c , which can be either a broad phone class or a set of tied Markov states. We also only utilize those utterances with confidence scores over Th in MLLR adaptation, just as in our MAP adaptation.

4 Experiment Results To show the effectiveness of the proposed method, we conduct experiments on our SRDK system. The initial acoustic models of the SRDK system are trained from approximately 100 hour speech data. The vocabulary size of IV words is 200. Because our OOV words detection is expected to perform well in adverse situation, it is assumed that OOV word utterances occupy 50% of the total utterances. 3000 IV words and 3000 OOV word utterances are prepared as input utterances. We use one third of the total input utterances for MLLR+MAP unsupervised incremental adaptation. Left 4000 utterances including IV and OOV word utterances are taken as OOV words detection test set. In MLLR+MAP adaptation, to find an optimum value for Th , we compare OOV words detection performance under different Th . The results are depicted in Figure 1. The work point refers to the OOV words detection rate at the point where OOV words detection rate + false alarm rate = 1. Given a Th , each work point represents the best work condition under this Th . We want to mention that all our following experiments are conducted under this optimum Th ( 4 ×105 ). The original OOV words detection point is 82.5% before our unsupervised adaptation. When the Th is higher than the optimum Th ( 4 ×105 ), the amount of utterances which used in adaptation decreases, and the work point falls, but always over the initial work point. When the Th is lower than the optimum Th , the performance of OOV words detection falls greatly. It is mainly because the incorrectly recognized words have performed negative effects on the unsupervised adaptation. We observed that when the input data used in adaptation contain a few OOV word utterances, the work point after adaptation is still higher than the initial work point.

728

W. Chu, X. Xiao, and J. Liu MLLR + MAP Adaptation

Work Point of OOV Words Detection

0.95

0.9

0.85

0.8

0.75 After Adaptation Before Adaptation 0.7

2

2.5

3

3.5 4 4.5 5 5.5 Threshold of Confidence Score

6

6.5

7 5

x 10

Fig. 1. Change in work points of OOV words detection depending on the changes in threshold of confidence score after MLLR+MAP adaptation

Figure 2 and Figure 3 below illustrate that IV and OOV words become easier to separate in confidence score domain after adaptation. During the incremental adaptation procedure, the acoustic model parameters are gradually adjusted according formulas (12), (13) and (14), thus can generate higher confidence scores for the later coming utterances used in adaptation. So it is also feasible to gradually lower Th as the adaptation procedure goes gradually. But we want to perform a robust unsupervised incremental adaptation, thus a fixed Th is used during the adaptation procedure to prevent the possible underlying instabilities which may perform negative effects on OOV word detection. In Figure 4, false alarm rate in Figure 4 refers to the rate of false acceptance of IV word as OOV words. Figure 4 shows that the proposed method has achieved a rise Before Adaptation

The number of Utterances

250

200

150

OOV Words

100

50

0

IV Words

0

1

2

3 4 5 6 7 Combined Confidence Score

8

9

10 5

x 10

Fig. 2. Confidence score distribution of IV and OOV words before adaptation

Confidence Score Based Unsupervised Incremental Adaptation

729

After Adaptation

The number of Utterances

250

200

150

OOV Words

100

50

0

IV Words

0

1

2

3 4 5 6 7 Combined Confidence Score

8

9

10 x 10

5

Fig. 3. Confidence score distribution of IV and OOV words after adaptation. Under Th = 4 ×105 ROC curve of OOV words detection 1 0.9

OOV Words Detection Rate

0.8 0.7 0.6 0.5 0.4 0.3 0.2 Before Adaptation After Adaptation

0.1 0

0

0.05

0.1

0.15

0.2 0.25 False Alarm Rate

0.3

0.35

0.4

0.45

Fig. 4. OOV words detection performance before adaptation and after adaptation. Under Th = 4 ×105 .

over 41% in OOV words detection rate (from 68% to 96%) at the same cost of a false alarm rate of 10%. In Figure 5, false acceptance rate refers to the percentage of wrongly recognized words which are accepted. The correct acceptance rate refers to the percentage of correctly recognized words which are accepted. It is essential for us to examine the recognition performance of the adapted models. Figure 5 depicted that we can achieve a rise in correct acceptance rate (from 88% to 98%) at a false acceptance rate of 20%, when the input data are composed of 50% IV word utterances and 50% OOV word utterances. But when the input data are all IV word utterances, we observe degradation in correct acceptance rate (from 88% to 68%) at a false acceptance rate of

730

W. Chu, X. Xiao, and J. Liu

20%. The main reason is that the adjusted acoustic models is task-oriented (best fit for 50% IV + 50% OOV), and its performance relies greatly on the proportions of IV and OOV word utterances in the input data. ROC curve of Recognition with Rejection 1 0.95

Correct Acceptance Rate

0.9 0.85 0.8 0.75 0.7 Before Adaptation After Adaptation 100%IV After Adaptation 50%IV 50%OOV

0.65 0.6

0

0.1

0.2 0.3 0.4 False Acceptance Rate

0.5

0.6

Fig. 5. ROC curve of recognition with rejection before adaptation and after adaptation. Under Th = 4 ×105 .

5 Conclusions This paper presented a new method for improving the OOV word detection rate by using unsupervised incremental adaptation based on confidence score. The effectiveness of this method has been proved by experiments on our SRDK system. Our future work will include applying this idea not only to the acoustic models, but also to the language model of a real-time human-machine interactive system in which input utterances are composed of isolated words and sentences. It is important to find a balance point between OOV words detection rate and recognition rate in practical usage according to the real feelings of the users.

Acknowledgements The research is supported by National Natural Science Funds of China (Item No. 60572083). The authors would like to thank everyone who contributed to building and improving the SRDK system.

References 1. Cox, S. and Rose, R.: Confidence measures for the SWITCHBOARD database. In Proceedings of ICASSP 1996. Atlanta. (1996) 511-514 2. Hazen, T. J., Seneff, S. and Polifroni, J.: Recognition confidence scoring and its use in speech understanding systems. Computer Speech and Language. 16(1). (2002) 49-67

Confidence Score Based Unsupervised Incremental Adaptation

731

3. Sankar, A. and Wu, S.-L.: Utterance verification based on statistics of phone-level confidence scores. In Proceedings of ICASSP 2003. Menlo Park. (2003) 584–587 4. Boite, J., Bourlard, H., D’hoore, B. and Haesen, M.: A new approach towards keyword spotting. In Proceedings of Eurospeech 93. Berlin. (1993) 1273-1276 5. Wang, D., and Narayanan, S. S.: A confidence-score based unsupervised map adaptation for speech recognition. In Proceedings of 36th Conference on Signal, Systems and Computers. Pacific Grove. (2002) 222–226 6. Charlet, D.: Confidence-measure-driven unsupervised incremental adaptation for HMMbased speech recognition. In Proceedings of ICASSP 2001. Salt Lake City. (2001) 357–360 7. Leggetter, C. J. and Woodland, P. C.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language. Vol. 9, No. 2. (1995) 171-185

Confidence Score Based Unsupervised Incremental ...

Recognition Developer Kit (SRDK) which serves as a baseline system for ..... some broad class c , which can be either a broad phone class or a set of tied .... Cox, S. and Rose, R.: Confidence measures for the SWITCHBOARD database. In.

190KB Sizes 0 Downloads 222 Views

Recommend Documents

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute of ... over all competing classes, and have been demonstrated to be effective in isolated word ...

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute ... NIST (National Institute of Standards and Technology) has ..... the best procedure to follow.

n-best based supervised and unsupervised adaptation for ... - CiteSeerX
Chien. 3,*. 1Panasonic Technologies Inc., Speech Technology Laboratory, Santa Barbara, California ... 1-best unsupervised adaptation and N-best unsupervised adaptation with ... and the sequences of states, , given the specific mean vector .

Factor Graph Based Incremental Smoothing in Inertial ... - GitHub Pages
the effect of linearization errors. In [20], a ..... illustration of the interaction between the stereo-vision binary .... scattered on the ground with 소50 meters elevation.

An Ensemble Based Incremental Learning Framework ...
May 2, 2010 - well-established SMOTE algorithm can address the class im- balance ... drift in [13], based on their natural ability to obtain a good balance ...... Change in Streaming Data," School of Computer Science, Bangor. University, UK ...

An Ensemble Based Incremental Learning Framework ...
May 2, 2010 - Index Terms—concept drift, imbalanced data, ensemble of classifiers, incremental ..... show a more detailed empirical analysis of the variation of. ߟ. Overall, the ... thus making the minority class prediction the most difficult. Fig

First-Order Incremental Block-Based Statistical Timing ...
be computed on a global basis, or on a per-end-point basis, where an end point is a slack-determining node of the graph (a primary output or either end of a ...

Improving Part based Object Detection by Unsupervised, Online ...
based on online boosting to improve the performance on ... based image/video retrieval, etc. Recently, the boosting .... number of sequences from the CAVIAR video corpus [19], and a number of ..... In the Fifth International Conference on.

Dictionary Learning Based on Laplacian Score in ... - Springer Link
plied in signal processing, data mining and neuroscience. Constructing a proper dictionary for sparse coding is a common challenging problem. In this paper, we treat dictionary learning as an unsupervised learning pro- cess, and propose a Laplacian s

A Confidence-Based Decision Rule and Ambiguity ...
Feb 14, 2013 - 1 Introduction. Since Ellsberg's (1961) seminal thought experiment, the concept of ambi- guity has been indispensable in the theory of decision making under uncer- ..... from constant acts, and the less is her valuation of nonconstant

Incremental Noir presentation -
Jun 1, 2018 - just one small part. • You can only see part of what came before ... suggests the part of the story arc you should ... The kitchen is the back end.

Incremental Crawling - Research at Google
Part of the success of the World Wide Web arises from its lack of central control, because ... typically handled by creating a central repository of web pages that is ...

Score Center.pdf
[6] Osborne HS (722). [3] Stockton HS (1913). Match 2. Match 3. Match 4. Match 5. SubState Champion. Kansas State High School Activities Association. Class 1A DI SubState Volleyball Bracket. October 21, 2017. Site: Sylvan GroveSylvanLucas Unified HS.

Election Confidence
Each comparison of audit data with electronic data either detects compromise (if the comparison fails), or adds a certain amount of “confidence” that there has ...

unsupervised child policy.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. unsupervised ...Missing:

Unstoppable Confidence
Thank you to Fred Gleeck for leading my marketing team. ...... Before becoming a real estate entrepreneur, I got my master's degree in computer science and.

consumer confidence
compared with the rest of Europe, is contributing to the confidence of German consumers,” said Ingo Schier, managing director, Nielsen. Germany. “The capital markets are also stabilizing at present, which is demonstrated by the upward trend of th

SCORE CARD TA.PDF
Approval of the Vice-Chancellor, UAS, Dharwad. ... Technician) / Farm Manager I Programme Assistant (Computer) is ... Registrars of all Farm Universities.