JUST-IN-TIME LATENT SEMANTIC ADAPTATION ON ...

Viewer
Transcript

JUST-IN-TIME LATENT SEMANTIC ADAPTATION ON LANGUAGE MODEL FOR CHINESE SPEECH RECOGNITION USING WEB DATA Qin Gao, Xiaojun Lin, Xihong Wu Speech and Hearing Research Center, National Lab on Machine Perception, Peking University {gaoqin,linxj,wxh}@cis.pku.edu.cn ABSTRACT

be very expensive if the recognition task consists of unrelated short sentences. A novel method was proposed, which is for performing justLatent Semantic Analysis (LSA) is a promising method in-time adaptation on language models in Chinese speech recogthat can reveal the semantic relationships among words and nition using Web search engines. Latent semantic analysis(LSA) documents as well as that of words and words. Many atis employed to change the probability distribution of N-gram tempts have been made to utilize LSA in speech recognition language model. The method has two advantages. First, it [4, 5]. These works achieved good results, although most needs relatively small amount of data which can be obtained of the works only provide results on perplexity of language from Web on-the-fly. Second, comparing to traditional adapmodel. [6] presented an encouraging result of using LSA in tation formula of LSA, the proposed approach is more effiChinese speech recognition, in which the CER reduced by cient, which ensures second pass decoding to be performed more than 3% relatively. However, most of these works did with high speed. Experiments show that the perplexity of not investigate the complexity of the algorithm, which actulanguage model is reduced by over 13% after adaptation. A ally prevents the method to be put into application. 4.29% relative reduction on WER is achieved in large vocabAiming at bringing together the advantages of the two ulary Chinese speech recognition over standard test set. promising methods, a novel method was proposed which uses data from Web searching engine as the adaptation data and 1. INTRODUCTION LSA as the adaptation approach. Firstly, it is assumed that the data got from the searching engine are closely related to Language model is a critical component for speech recognithe utterance in semantic sense, and the LSA probabilities of tion. State-of-the-art language models are dominated by staall the words in the dictionary are calculated. The probabilitistical ones, especially by the N-gram model. Many efforts ties are then used to adjust the trigram probabilities. Finally, have been made to improve the training of statistical language second pass decoding is carried out on word lattice with the models. However, no matter how much data are used, the lexadapted language model. The full adaptation procedure is ical, syntactic and semantic characteristics of training set and presented in Figure 1: testing set are quite likely to be different in real world applications [1]. Hence, adaptation of language model becomes Requ est Sp eech ASR W ord C on fi d e n c e C on fi d e n t Qu er y a pressing need. The adaptation of language model involves Sy st e m L a t t ice Me a s u r e H y p ot h e s i s En g i n e R e s p on s e two major tasks, one is the acquisition of adaptation data, the Se a r c h Re su lt Upda te LS A Ba c k g r ou n d LM A d a p t e d LM other is the extraction of knowledge from adaptation data and combine it with the well-trained background model [1]. In the Fig. 1. Procedure of Web-based LSA Adaptation paper, we will explore the usage of Web searching engine in solving the first problem. For the second problem, the latent semantic analysis technology will be investigated. Since the statistical characters of collected data are not Development of World Wide Web makes it a huge data needed, the used data are relatively small, generally less than source. The Web search engine provides a way to find sam10 Web requests are needed for a single utterance, hence the ples of text that are closely related to the keywords. Some overhead is minimized. Meanwhile, the adaptation formula researches have been carried out to utilize the Web searching so that the complexity of the algorithm is reduced greatly. engine in language modeling, such as [2] and [3]. However, This paper is organized as follow. The second section the size of corpus they need to download from the Web should introduces the LSA adaptation method. The third section be large enough to ensure the statistical significance. It may presents the approach of acquiring adaptation data using Web searching engine. In section 4, the evaluation results are preThe work was supported in part by NSFC 60435010, 60305030, sented, followed by analysis and discussions on the results. 60305004, NKBRPC 2004CB318000, a program for NCET, as well as a joint project of Engineering Institute at Peking University. The fifth section gives the conclusion.

2. LM ADAPTATION USING LSA 2.1. LSA Probability LSA originated from the field of information retrieval. It assumes that words {wi } in a same “document” are semantically related. Given this assumption, the co-occurrences matrix W of words and documents can be built from corpus T = {d1 , · · · , dM }, and vocabulary V = {w1 , · · · , wN }, where the (i, j) cell in W denotes the weighted count of word wi in document dj . However, the matrix cannot be used directly, because the dimension of the matrix is too large, and the vector spaces of word and document are distinct from each other[7]. In order to reduce the dimensionality of the matrix and exploit the relationship between the document and word vector spaces, singular value decomposition is applied in order to transform W in the following way [8] : ˆ = U SV T W ≈W

(1)

where the dimensions of U , S, V are M ×R, R×R and N ×R ˆ is the best rank-R approximation of W respectively. Here, W for any unitarily invariant norm[7]. The column vectors of U and the row vector of V T are orthonormal basises of column ˆ . Given (1), a word wi can be and row vector space of W represented as ui and document dj as vi in the same space: ui vi

= =

wi V S −1 dj U S −1

(2)

The co-occurrence of word wi and document dj , namely the (i, j) cell of W , can be characterized by taking the dot product of corresponding ui S 1/2 and vj S 1/2 . Therefore, the cosine of the angle between ui S 1/2 and vj S 1/2 can be used to represent the closeness between wi and dj . PrLSA (wi |dj ) = C(ui , vj ) =

cos(ui S 1/2 , vj S 1/2 )

=

ui SvjT k ui S 1/2 kk vj S 1/2 k

(3)

Further discussion of LSA probability can be find at [8].

be applied, therefore it may prevents the equation from yielding proper estimation on new N-gram probabilities. Instead of trying to estimate the N-gram probability itself, it is assumed that semantic conditions have same effect on N-gram and unigram of the same word. That is 1 : ∀wi ∈ V, ∀H,

PrLSA (wi |d) PrLSA (wi |d, H) = = C(wi ) Pr(wi ) Pr(wi |H)

(5)

where H = {wq−1 , wq−2 · · · wq−n+1 } denotes the history of decoding procedure. So the updating formula should be: „ PrLSA (wi |H, d) = Pr(wi |H) ×

PrLSA (wi |d) Pr(wi )

« (6)

where PrLSA (wi ) is calculated by: Pr(wi |d)λ Pr(wi ) λ wj ∈V Pr(wj |d) Pr(wj )

PrLSA (wi ) = P

(7)

The overhead of the algorithm is much smaller than (4), and therefore more suitable for applications. Besides, the method relies on the unigram probabilities which have less sparsity problems than trigram, the performance can be more stable. The selection of d in (6) should also be considered. In the works of [7] and [8], the decoding histories are viewed as “pseudo-documents”. However, since the documents changes as the decoding history changes, every access to the trigram probability requires (7) to be computed. In this work, Web data are used as document, so (7) is calculated only once. 3. WEB-BASED ADAPTATION DATA ACQUISITION The rapid development of Web searching engines provides an interface to find text that is closely related to the query keywords. Technologies such as stop words and duplicated page filtering further alleviate the burden on the client side. In this paper, we take advantages of Web searching engines to acquire adaptation data. The procedure is described in Figure 2, and the rest of this section will present detailed description on proposed method.

2.2. Adaptation Formula Assume that the document d is closely related to the utterance to be recognized. The task is to incorporate LSA into N-gram language model.[5] proposed a method that modify N-gram directly, given a smoothing factor λ: PrLSA (wq |wq−1 · · · wq−n+1 , d) Prngram (wq |wq−1 · · · wq−n+1 )1−λ Pr(wq |d)λ 1−λ Pr(w |d)λ q wi ∈V Prngram (wi |wq−1 · · · wq−n+1 )

=P

(4)

However, there are two problems lie in the method. First of all, the updating process is time-consuming. For each trigram, the denominator is accumulated from all the words in the dictionary. In our system whose dictionary has more than 60000 words, the approach is inapplicable. Secondly, due to the data sparsity, the N-gram itself is not well-trained, back-offs may

3.1. Query Generation In order to obtain high quality adaptation data from the Web, the query strings should be taken under careful consideration. If they contain errant hypothesis, the data collected might be biased, and the result may be deteriorated. Hence, confidence measure is used to choose only the most confident words as query string, rather than the full hypothesis. 1 The assumption is made by expanding (5) using (4). After eliminating corresponding factors we get P (5) ⇔

wi ∈V Pr(wi |H) · Pr(wi |d) ≡ C P wi ∈V Pr(wi ) · Pr(wi |d)

so the assumption is the right part of the equation equals to a constant. The assumption is reasonable since the denominator is a constant and the numerator is accumulated through the whole vocabulary, it will not vary too much.

Word Lattice

Confidence Measure Thresholding

Query Generation

Web Query

Web Page Parsing

Yes

No Word Separation Need to Relax Threshold?

Second-Pass Decode

No

Adaptation Data

Enough?

Yes

segmentation should be done before head. The maximummatch approach is used to perform the task. The advantage of using this method is that the result of word segmentation can be limited within the dictionary. By average, getting 4k Chinese characters from Web needs 5 queries, and the pre-processing can be done on-the-fly.

Fig. 2. Adaptation Data Acquisition There are many methods that produce confidence measure in ASR, the N-bset voting was selected. First, A∗ search is performed on the output word lattice to get an N-best list contains a fixed number of hypothesis. All of the hypothesis “vote” for the words in the best result, the percentage of positive votes for each word is used as the confidence measure.

4. EVALUATION AND DISCUSSION 4.1. Baseline System and Evaluation Task The experiments were carried out based on the PULSAR speech recognition system in SHRC, PKU. The specifications of the system are listed in Table 2. Table 2. System Specification Item Acoustic Model Units AM Topology AM Output AM Training Data Dictionary Language Model

3.2. Query Strategy In Web search engine, there are many kinds of queries, such as exact phrase search, “AND” and “OR” Boolean operations. The work aims to find the text that is most closely related to the utterance, therefore, more exact queries are preferable. However, in most cases, the utterance itself cannot be found on the Web. In order to get enough data for adaptation, the constraints are relaxed gradually. To avoid the potential bias introduced by relaxed queries, the needed amount of data is fixed, and the query stops when we get enough data. Table 1 shows the adopted strategy, in which “Word Seg” means whether or not to submit words without space between then. That is meaningful only for Chinese, which initially have no space between words. If data acquired is still not Table 1. Query Strategy Step 1 2 3 4

Boolean OP AND OR AND OR

Word Seg NO NO YES YES

Exact YES YES NO NO

enough after four steps, the threshold is modified to include more words in queries using binary “OR”, less words in queries using binary “AND”, until enough data were got. 3.3. Query and Pre-processing

Specification Initial-Final with vary context 3 state(Initial) or 5(Final) state HMM 32 Mixture GMM 1000+ Hours 64325 Words Trigram trained from 4GB web data

The evaluation task is the testing corpus of 2004 National 863 Evaluation on Desktop Speech Recognition. The corpus consists of 200 separated sentences. The corpus is artificially mixed with natural noise, and most speakers have accents. Traditionally, the character error rate(CER) is used in the evaluation of Chinese speech recognition, but the metric does not reveal the actual quality of the output. In this work, we try to use both CER and word error rate(WER) to evaluate the result. In order to get the WER result, word segmentation is applied on reference result using maximum-match method, and the outputs of decoder are left unmerged as usual. In training the LSA model, 35,000 sentences were used as class seeds, and another 200,000 sentences were classified into the classes. The dimension of the LSA space is 46. 4.2. Perplexity Perplexities were computed on the test corpus, as shown in Figure 3. 960

960

943.24

940 930

920 900

Perplexity

900

Perplexity

State-of-the-art Web search engines like Google can also spot and return the context near the key words. This can further refine the adaptation data. In this work, Google is choosen as the searching engine for data acquisition. After fetching the result page from Google, we parse the file, extract the content and ignore text from templates or advertisements. Before utilizing the fetched data, several steps of pre-processing should be applied on it. Because of the characteristic of Chinese that no word boundaries are explicitly provided, word

880 860

870

840

844.75 840

820

831.93 822.62

800

819.32

819.41

9.0k

G O LD

813.17 810

780 0

1

2

3

4

5

FactorO

(a)Perplexity of Different Factors λ

6

NO NE

1.5k

3.0k

4.5k

H YP

Adaptation D ata

(b) Perplexity of Different Amount of Adaptation Data

Fig. 3. Perplexity After Adaptation

We adjusted the amount of data obtained from the search engine and performed experiments accordingly. The value of λ was also adjusted to produce a number of results. In the figure, “GOLD” represents using reference gold as the adaptation data, and “HYP” represents using decoding results which filtered by confidence measure.

4.3. Performance of ASR The results of the test sets is shown in Figure 4. The experiment design is as described before.

5. CONCLUSION A new method has been presented, which is to utilize the Web search engine to improve the performance of Chinese speech recognition. The method uses latent semantic analysis to perform language model adaptation. An efficient integration approach of LSA is used to ensure the speed of the system. Experiments have demonstrated the validity of the method. The method brings in relatively small speed impact. If local information source is used to substitute the search engine, it can also be integrated into real-time ASR systems. The proposed method is a promising one especially for recognizing “Web-relevant” tasks such as broadcast news, whose content can easily be found on the Web. One of the future works is to apply the method in broadcast news indexing system. In addition, the method can be improved if the quality of downloaded data is refined using other measures to avoid possible bias. 6. REFERENCES

(a) CER of Different Factors λ

(b) CER of Different Amount of Adaptation Data

[1] J.R. Bellegarda, “Statistical language model adaptation: review and perspectives,” Speech Communication, vol. 42, pp. 93–108, 2004.

(c) WER of Different Factors λ

(d) WER of Different Amount of Adaptation Data

Fig. 4. System Performance. The results show that by increasing the amount of the data fetched from Web search engine, the performance can be improved, but if the criterion of the query is overrelaxed, the decoding result will be deteriorated. When the factor λ is set to 2 and the adaptation data amount is 4.5 kB, the best result is got. The relative reduction of CER and WER are 2.67% and 4.29% respectively. In that result, the insertion and deletion errors reduced only 0.14%, but substitution errors reduced from 30.82% to 29.22%. It shows that a number of ambiguous words are corrected by combining semantic information. The results also show that when using Web data, the results are even better than using reference transcription. The more semantically related words are found, the more probabilities of correct words are raised. Another interesting phenomenon is observed that the reduction on WER is much greater than CER. The reason is that most Chinese words consists of 2 characters, and the decoder often yield partly-correct words. Although it contributes to the CER, it is not at all helpful for WER. Of course, the definition of Chinese words is not clear, and this method is primitive. But it can suggest that in order to find out the true quality of ASR system, a new standard for evaluation should be carefully considered.

[2] A. Berger and R. Miller, “Just-in-time language modelling,” in Acoustics, Speech, and Signal Processing, Proceedings of the 1998 IEEE International Conference on. IEEE, 1998, vol. II, pp. 705–708. [3] X. Zhu and R. Rosenfeld, “Improving trigram language modeling with the world wide web,” in Acoustics, Speech, and Signal Processing, Proceedings of the 2001 IEEE International Conference on. IEEE, 2001, vol. I, pp. 533–536. [4] J.R. Bellegarda, “Exploiting latent semantic informationin statistical language modeling,” in Proceedings of the IEEE, 2000, vol. 88. [5] N. Coccaro and D. Jurafsky, “Towards better integration of semantic predictors in statistical language modeling,” in Int. Conf. Spoken Language Processing, 1998, pp. 2403–2406. [6] J. Ren, Research on the Clustering-based Latent Semantic Analysis, Ph.D. thesis, Tsinghua University, 2005. [7] J.R. Bellegarda, “A multispan language modeling framework for large vocabulary speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 6, no. 5, pp. 456–467, September 1998. [8] W. Chou and B.H. Juang, Pattern Recognition in Speech and Language Processing, chapter 9, pp. 280–297, CRC, 2003.

Fast Speaker Adaptation - Semantic Scholar

LATENT SEMANTIC RETRIEVAL OF SPOKEN ...

Regularized Latent Semantic Indexing

Evolving developmental programs for adaptation ... - Semantic Scholar

Domain Adaptation: Learning Bounds and ... - Semantic Scholar

A Semantic Policy Sharing and Adaptation ... - Semantic Scholar

Domain Adaptation with Coupled Subspaces - Semantic Scholar

A Semantic Policy Sharing and Adaptation ... - Semantic Scholar

moment restrictions on latent

On Matching Latent Fingerprints

A Latent Semantic Pattern Recognition Strategy for an ...

LATENT SEMANTIC RATIONAL KERNELS FOR TOPIC ...

Regularized Latent Semantic Indexing: A New ...

24. On SVD-free Latent Semantic Indexing for Iris ...

Enhanced Semantic Graph Using Latent Relation ...

Sparse Distributed Learning Based on Diffusion Adaptation

organisational adaptation on rugged landscapes Accounts