THUEE Language Modeling Method for the OpenKWS 2015 Evaluation Zhuo Zhang∗, Wei-Qiang Zhang∗, Kai-Xiang Shen∗, Xu-Kui Yang†, Yao Tian∗ , Meng Cai∗ , and Jia Liu∗ ∗ Tsinghua

National Laboratory for Information Science and Technology Department of Electronic Engineering, Tsinghua University, Beijing 100084 Email: [email protected], [email protected] † Zhengzhou Information Science and Technology Institute, Zhengzhou 450002

Abstract—In this paper, we describe the THUEE (Department of Electronic Engineering, Tsinghua University) team’s method of building language models (LMs) for the OpenKWS 2015 Evaluation held by the National Institute of Standards and Technology (NIST). Due to the very limited in-domain data provided by NIST, it takes most of our time and efforts to make good use of the out-of-domain data. There are three main steps in our work. Firstly, data cleaning has been done on the out-of-domain data. Secondly, by comparing the cross-entropy difference between the in-domain data and out-of-domain data, a part of the out-of-domain corpus which is well-matched to the in-domain one has been selected as the training corpus. Thirdly, the final n-gram LM is obtained by interpolating individual ngram LMs according to different training corpus and all the training data is further combined to train one feed-forward neural network LM (FNNLM). In this way, we reduce the perplexity on development test data by 8.3% for n-gram LM and 1.7% for FNNLM, and the Actual Term-Weighted Value (ATWV) of the final result is 0.5391.

I. I NTRODUCTION Statistical language modeling based on a large amount of text data has become the main research area of language modeling in these years. Statistic language models (LMs) estimate the distribution of text segments (e.g. words, morphemes, etc.) for the purpose of various language technologies such as keyword spotting. So far, the n-gram models [1] and the neural network models [2] are the most popular LMs. As we all know, statistical LMs are particularly sensitive to the changes of the text data on which they are trained. Too much training data which is not matched to the in-domain data may get the final LM much worse. Thus, the selection of training data is an important step that can’t be skipped. There are different approaches to address this problem, and among them, comparing the cross-entropy difference of LMs built from in-domain and out-of-domain data, proposed by Moore and Lewis in 2010 [3], has earned widespread acceptance for its good performance. In the OpenKWS15 evaluation [4], the in-domain corpus is very limited, so we try to select some useful data from the out-of-domain data to help build LMs based on that method described above. In fact, we have built four LMs including This work was supported by the National Natural Science Foundation of China under Grant No. 61370034, No. 61273268, and No. 61403224. The corresponding author is W.-Q. Zhang.

the n-gram LMs and the FNNLMs both for the words and the morphemes. This paper tries to clarify the method of building these LMs and data selection is the key point in the method. But before the data selection, data cleaning on out-ofdomain data has been done first in order to improve the data quality. Though this work would cost much time, it is still worth doing in advance for the reason that the data quality would directly impact the performance of LM even of the whole system. After data selection, there are training texts including selected data and in-domain data. The FNNLM is trained by merging these data into one corpus. As to the n-gram LM, interpolating individual n-gram LMs built from different training sources linearly is a better way to improve its performance. We have done a series of experiments to help build better LMs for this evaluation by applying above described techniques. And this paper is organized as follows: in Section 2, we describe data processing on out-of-domain data. In Section 3, we describe the data used for training and testing LMs and the experimental setup. In Section 4, we give the results and our discussions. And finally, we conclude in Section 5. II. DATA P ROCESSING Data processing on out-of-domain data includes data cleaning, data selection and linear interpolation. And the source language is Swahili during this evaluation. A. Data Cleaning In this evaluation, NIST provides a large amount of out-ofdomain data which is scraped from the web, and this part of data is not suitable for training LMs directly for the reason that it may contains too many characters which don’t belong to the Swahili. In our work, words and characters contained within the pronunciation lexicon which is generated from the Language Specific Peculiarities (LSP) [4] provided by NIST are all kept, and character strings which consist of English letters even the letters with phonetic symbols are also kept. The other characters such as web URLs and punctuation are deleted. But before deleting punctuation, we have done sentence segmentation based on the its locations.

After data cleaning, we count the word frequencies of the out-of-domain data to generate the vocabulary for language modeling. But because of the problem of lower-case letters and capitals, the word counts may be not correct. Thus, words with capital letters are transformed into lower-case ones except for the words in pronunciation lexicon. And about 110 thousand high frequency words are selected as the vocabulary. B. Data Selection As is mentioned above, data cleaning simply deletes the obvious wrong characters and could not make the out-ofdomain data suitable for language modeling. Our key work for making use of out-of-domain data is data selection.We choose the algorithm proposed by Moore and Lewis [3] as our method to do the data selection. The principle is described as follows: First, let I be the in-domain corpus, and O the out-of-domain one to be selected. OI is a part of O and well-matched to the I. Given the sentences s drawn from O, P (OI |s, O) will represent the posterior probability of s belonging to OI . So, this probability can indicate the degree of similarity between s and I. Based on the Bayes rules, the probability P (OI |s, O) can be expressed as P (OI |s, O) =

P (s|OI , O)P (OI |O) P (s|O)

(1)

Moore and Lewis put forward to using the difference of cross-entropy HI (s) − HO (s) as the new metric to select segments of data instead of log(P (s|I)) − log(P (s|O)). Since, without length-normalization, log(P (s|I))−log(P (s|O)) correlate strongly with the length of s, and it would select sentences according to their lengths. Thus, it would fail to select sentences which are similar with in-domain data. By comparing the difference of cross-entropy, the length L almost has no effect on the process of selecting data, and we can score the sentences correctly, so that the data selection will be done. C. The Use of Selected Data After the data selection, the selected data and in-domain data are used as our training data. For building FNNLMs, merging these training data to train one LM is enough. But for n-gram LMs, we can build LMs according to different sources of data, and then interpolate these LMs linearly to get the final model. As for the interpolation parameter, which are tuned to make the perplexity on development data set best, we can use SRI Language Modeling Toolkit [5] with the “Compute-bestmix” option to compute in an iterative process. This approach is based on the Expectation Maximization Algorithm (EM) [6]. For the evaluation, the performance of n-gram LM can be further improved by linear interpolation.

Because OI is a subset belonging to O, there is, P (s|OI , O) = P (s|OI )

III. E XPERIMENT (2)

At the meantime, we assume the distribution of OI is similar with I, so, P (s|OI ) ≈ P (s|I) (3) Thus, we can get following formula: P (s|I)P (OI |O) P (OI |s, O) = P (s|O)

(4)

As to the probability P (OI |O), it indicates the prior probability of matched data OI , and its value just influences the accuracy of classification which is all the same to every sentence s to be selected. That means this probability P (OI |O) has no effect on the sort of similarity between sentences. So we should only focus on the ratio of P (s|I) and P (s|O). For convenience, we change division to subtraction by working in the log domain. Hence, we get new formula: log(P (s|I)) − log(P (s|O))

(5)

Define HO (s) and HI (s) as the cross-entropy of s according to LMs trained on the out-of-domain data O and the in-domain data I separately: 1 log P (s|O) L 1 HI (s) = − log P (s|I) L where, L is the length of the sentence s. HO (s) = −

(6) (7)

We do the experiments in OpenKWS15 evaluation and get the final LMs to help this evaluation. A. Data Preparation The source language is Swahili during the OpenKWS15 evaluation. Our all training data, development test data are provided by NIST [4]. The length of training data is 80 hours, and it’s divided into three parts. The first 40-hour part is untranscribed training pool, and we can get 1-best decoded text data from this part; the second 30-hour is transcribed training audio pool, and 3 hours of transcriptions named Very Limited Language Pack (VLLP) is selected from the middle of conversation sides from this part; the rest 10-hour data is transcribed “Tuning Set” audio pool which is used for parameter setting for evaluation conditions, and only 3-hour data of this part is selected to be transcribed. Thus, VLLP, the decoded text data, Tuning Set, these three parts are considered as the in-domain data. The out-of-domain data to be selected is the text data scraped from the web (202WebText) which is also provided by NIST. Counting the word frequencies of in-domain and out-ofdomain data, approximate 110 thousand words are selected as the vocabulary for language modeling. In addition, there is another 10-hour data as the development test data, which is used only for development testing and should not be incorporated into training material.

1600

5500 selecting data randomly selecting data according to cross−entropy difference

5000

combining different models use all data to train one LM 1400

4000

Dev−set perplexity

Dev−set perplexity

4500

3500 3000 2500

1200

1000

800

2000 600 1500 1000

400 0

200

400 600 Thousands of web data

800

1000

Fig. 1. Comparison of perplexity on development test data between two method of selecting data.The LMs are built from different amount of selected data from out-of-domain data.

B. Experimental Setup Based on the algorithm described in Section 2, we use the XenC toolkit [7] to accomplish data selection. We build two LMs according to the in-domain and out-of-domain data, and use this toolkit to compute the cross-entropy difference between them. Then, a new corpus will be sorted according to degree of similarity with in-domain data. After that, we choose different numbers of sentences from the new sorted corpus to build LMs by MITLM toolkit [8]. At meantime, we have selected the same number of sentences from the previous out-of-domain data randomly and build LMs in that way. By comparing the perplexity on development test data, we can test the effectiveness of the method of selecting data. When we get the part which is the most similar with indomain data, we merge this part, the VLLP data and the decoded untranscribed text data to train one FNNLM with variance regularization to reduce the computational complexity [9]. But we have taken another method to build n-gram LM. we combine n-gram LMs built from different amounts of selected data with n-gram LMs built from the VLLP and the decoded untranscribed text data separately using SRI Language Modeling Toolkit [5] to find out which amount is the best to be selected. We should pay attention to the use of in-domain-data , that the Tuning Set can not be used to train LM for interpolation. What’s more, we merge all the VLLP, the decoded untranscribed text data and the selected data into one new big corpus to train an n-gram LM. That is used to test and verify whether combining different n-gram LMs according to different sources data is better than using all source data together to train one LM.

0

500

1000

1500 2000 2500 Thousands of web data

3000

3500

4000

Fig. 2. Comparison of perplexity on development test data between two method of using selected data. The LMs are built from the in-domain data and selected data in different ways.

IV. R ESULTS

AND

D ISCUSSIONS

For data selection, Fig. 1 shows the comparison between the method we used and the randomly selecting way. We select different numbers of sentences from out-of-domain data to train LMs, and test their performance on development test data. According to this figure, the perplexity on development test data decreases with the increase of randomly selected data. But as to the method we have used for data selection, the perplexity decreases first and then increases slowly. When an LM is built only from the VLLP corpus, its perplexity is 697.856 which is much smaller than points on the two curves in Fig. 1. It suggests that the out-of-domain data has low correlation with the in-domain data. So, when we use more data to train LMs without any operations on data selection, the performance of LMs could be better, as we can see the changes of the upper curve in Fig. 1. With regard to the lower curve, using about 50 thousand sentences selected according to their cross-entropy differences for language modeling could get the smallest perplexity of 1513.346. When the amount of data increases, the LM gets worse, which means excessive data is useless for building LMs. A reasonable explanation might be that after data selection, these 50 thousand sentences are the most similar with the in-domain data, and if we increase the amount of training data, it would lead to the result that the correlation between training corpus and in-domain data gets lower. So we just merge these 50 thousand sentences with the VLLP data and the decoded untranscribed text data to trian the FNNLM. The result is that perplexity is reduced by 1.7% from 582.04 to 572.207. For n-gram language modeling, we interpolate individual LMs linearly to get the final one. So, we would not only use the 50 thousand sentences to train LM. Fig. 2 shows the results of another experiment to help determine how many sentences should be selected. In Fig. 2, the upper curve gives

TABLE I P ERPLEXITY ON DEVELOPMENT TEST DATA ACCORDING TO DIFFERENT SOURCE FOR BUILDING LM S . T HE NOTATION A ) INDICATES THE METHOD THAT MERGING ALL DATA SOURCES INTO ONE CORPUS TO TRAIN LM. T HE NOTATION B ) INDICATES THE METHOD THAT COMBINING INDIVIDUAL LM S BY LINEAR INTERPOLATION .

TABLE II T HE WER OF SYSTEMS ON T UNING S ET AND DEVELOPMENT TEST DATA WITH DIFFERENT DATA SOURCES FOR TRAINING AM S AND LM S . T HE FULL SYSTEMS ARE TRAINED WITH G AUSSIAN MIXTURE MODEL , MULTILINGUAL BOTTLENECK FEATURES AND DATA AUGMENTATION [10]. data sources for full systems

data sources VLLP decoded untranscribed text selected data all web data VLLP + decoded text a) merging all data b) linear interpolation VLLP + decoded text + selected data a) merging all data b) linear interpolation VLLP + decoded text + all web data a) merging all data b) linear interpolation

perplexity 697.856 549.168 2451.399 3099.715 541.691 523.115 1208.763 479.594 1533.638 481.633

the performance of merging different amounts of selected data, the VLLP data and the decoded untranscribed text data into a new big corpus to train only one LM. With the increase of selected data, the perplexity gets worse. The reason for this phenomenon is that simply adding selected data into training corpus make the correlation between the whole training data and in-domain data lower. After all, the selected data is outof-domain data and can not be equal to the in-domain data. As to the lower curve in Fig. 2, it’s apparent that its perplexity is much better than the upper one.This curve indicates another method of building LM. We combine different LMs built from different sources corpus, including the selected data, the VLLP data, and the decoded untranscribed text data, by using SRI Language Modeling Toolkit option “Compute-bestmix” [5]. From this curve, not only can we know that using this method would get much better LM, but also the approximately suited amount of selected data could be determined. That is, selecting two million sentences from out-of-domain data to help build LM almost get the lowest perplexity. Detailed experiment results are given in Table I. From the results, we could know that using the method described in this paper can make the perplexity of LM built from in-domain data decrease by 8.3%. And from Table I, using all web data to build LM by linear interpolation gets similar perplexity compared with using selected data only. This is because, during the interpolation process, to get the best perplexity, the weighted parameter of all web data is much smaller than the parameter of selected data, and the in-domain data gets bigger weight at the same time. Doing data selection has another advantage, where the size of the final LM would be much smaller, and would save the computational time and the storage resources. At last, we give the final results of the word error rate (WER) of the systems with different data for training acoustic models (AMs) and the LMs in Table II. There are two groups of contrasts in this table. Results of the first row and the second row have shown that using selected data to train LMs can improve the WER by 0.6 percent on Tuning Set and 0.8

AM:VLLP LM:VLLP AM:VLLP LM:VLLP+selected data AM:VLLP+decoded text LM:VLLP+decoded text AM:VLLP+decoded text LM:VLLP+decoded text+selected data

WER (%) tuning dev 63.0

63.0

62.4

62.2

61.7

59.8

61.1

59.5

percent on development test data, and the AMs of these two rows are trained using VLLP only. Results of the third row and the fourth row have shown that using decoded text data for semi-supervised training LMs can improve the WER by 0.6 percent on Tuning Set and 0.3 percent by development test data, and the AMs of these two rows are trained using VLLP and decoded text data for semi-supervised training. The improvements of the whole system have further demonstrated our method of building LMs for this evaluation is good. The THUEE OpenKWS15 keyword search system includes 12 sub-systems [10, 11]. Among them, 11 sub-systems developed based on Kaldi [12] have used the final morphemes and words n-gram LMs, and the rest one developed based on HTK [13] has used the morphemes and words FNNLMs. The morpheme LMs are trained to deal with the out-of-vocabulary (OOV) problem, and the method of these two morpheme LMs are the same as it used in word LMs’ training. And the ATWV of the final result is 0.5391 in the evaluation. V. C ONCLUSION Experiments on the OpenKWS15 evaluation have demonstrated the effectiveness of our proposed methods for building LMs. Making good use of out-of-domain data plays an important role in building LMs in this evaluation. For the language modeling techniques, data selection has been used in many situations successfully. In this evaluation, it helps us determine which part can be used for building LMs both for FNNLMs and n-gram LMs. By interpolating n-gram LMs built from different sources, we can further improve the performance of the n-gram LMs. In the future, we would like to build recurrent neural network (RNN) LMs [14] and long short-term memory (LSTM) LMs [15] for their good performance and interpolate them with our LMs to find out whether it could get further improvements. R EFERENCES [1] S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 35, no. 3, pp. 400–401, 1987. [2] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” The Journal

[3]

[4]

[5] [6]

[7]

[8]

[9]

of Machine Learning Research, vol. 3, pp. 1137–1155, 2003. R. C. Moore and W. Lewis, “Intelligent selection of language model training data,” in Proc. ACL 2010 conference short papers, 2010, pp. 220–224. NIST, “KWS15 keyword search evaluation plan,” 2015, [Available] http://www.nist.gov/itl/iad/mig/upload/ KWS15-evalplan-v05.pdf. A. Stolcke et al., “SRILM—An extensible language modeling toolkit.” in Proc. Interspeech, 2002. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 1–38, 1977. A. Rousseau, “XenC: An open-source tool for data selection in natural language processing,” The Prague Bulletin of Mathematical Linguistics, vol. 100, pp. 73– 82, 2013. B.-J. Hsu and J. Glass, “Iterative language model estimation: efficient data structure & algorithms,” in Proc. Interspeech, 2008. Y. Shi, W.-Q. Zhang, M. Cai, and J. Liu, “Variance regularization of RNNLM for speech recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4893–

4897. [10] M. Cai, Z. Lv, C. Lu, J. Kang, L. Hui, Z. Zhang, W.Q. Zhang, and J. Liu, “The THUEE system for the OpenKWS15 evaluation,” in Proc. NIST 2015 OpenKWS Workshop, 2015. [11] M. Cai, Z. Lv, C. Lu, J. Kang, L. Hui, Z. Zhang, and J. Liu, “High-performance swahili keyword search with very limited language pack the thuee system for the openkws15 evaluation,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), to be published. [12] D. Povey, A. Ghoshal, G. Boulianne et al., “The Kaldi speech recognition toolkit,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [13] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, “The HTK book (revised for HTK version 3.4.1),” 2009, cambridge University. [14] T. Mikolov, M. Karafi´at, L. Burget, J. Cernock`y, and S. Khudanpur, “Recurrent neural network based language model,” in Proc. Interspeech, 2010, pp. 1045–1048. [15] M. Sundermeyer, R. Schl¨uter, and H. Ney, “LSTM neural networks for language modeling,” in Proc. Interspeech, 2012.

THUEE Language Modeling Method for the OpenKWS ...

em algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 1–38, 1977. [7] A. Rousseau, “XenC: An open-source tool for data selection in natural language processing,” The Prague. Bulletin of Mathematical Linguistics, vol. 100, pp. 73–. 82, 2013. [8] B.-J. Hsu and J. Glass, “Iterative language model.

88KB Sizes 0 Downloads 185 Views

Recommend Documents

Continuous Space Discriminative Language Modeling - Center for ...
When computing g(W), we have to go through all n- grams for each W in ... Our experiments are done on the English conversational tele- phone speech (CTS) ...

MORPHEME-BASED LANGUAGE MODELING FOR ...
2, we describe the morpheme-based language modeling used in our experiments. In Section 3, we describe the Arabic data sets used for training, testing, and ...

structured language modeling for speech ... - Semantic Scholar
20Mwds (a subset of the training data used for the baseline 3-gram model), ... it assigns probability to word sequences in the CSR tokenization and thus the ...

STRUCTURED LANGUAGE MODELING FOR SPEECH ...
A new language model for speech recognition is presented. The model ... 1 Structured Language Model. An extensive ..... 2] F. JELINEK and R. MERCER.

Supervised Language Modeling for Temporal ...
tween the language model for a test document and supervised lan- guage models ... describe any form of communication without cables (e.g. internet access).

Continuous Space Discriminative Language Modeling - Center for ...
quires in each iteration identifying the best hypothesisˆW ac- ... cation task in which the classes are word sequences. The fea- .... For training CDLMs, online gradient descent is used. ... n-gram language modeling,” Computer Speech and Lan-.

Putting Language into Language Modeling - CiteSeerX
Research in language modeling consists of finding appro- ..... L(i j l z) max l2fi j;1g v. R(i j l v) where. L(i j l z) = xyy i l] P(wl+1|x y) yzy l + 1 j] Q(left|y z). R(i j l v) =.

Modeling of a New Method for Metal Filaments Texturing
Key words: Metallic Filament, Yarn, Texturizing, Modeling, Magnetic Field. Introduction ... The Opera 8.7 software is used for simulating the force of rotating ...

An XFEM method for modeling geometrically elaborate crack ...
may have been duplicated (see Figure 7 for a one-dimensional illustration). Take {˜φi} to be the usual ...... a state-of-the-art review. Computers and Structures ...

Modeling Method and Design Optimization for a ... - Research at Google
driving vehicle, big data analysis, and the internet of things (IoT), .... 3) Reverse Recovery of SR ... two topologies suffer from hard switching, resulting in higher.

An XFEM method for modeling geometrically elaborate crack ...
does not require remeshing of the domain (which is in the spirit of XFEM), ..... or Monte Carlo methods, those evaluations will get additionally weighted by the ...

Automatic derivative method for a computer programming language
Oct 19, 2007 - 04/overloading-haskell-numbers-part-2.html. T.F. Coleman, A. Verma, “ADMIT-l: Automatic Differentiation and. MATLAB Interface Toolbox,” Mar ...

Automatic derivative method for a computer programming language
Oct 19, 2007 - pile a Higher-Order Functional-Programming Language with a First. Class Derivative Operator to Ef?cient Fortran-like Code”, Jan. 5,. 2008 ...

Joint Morphological-Lexical Language Modeling for ...
was found experimentally that for a news corpus it only misses about 5% of the most frequent ... Besides the Maximum Entropy method, another alternative machine learning ..... features are extracted and appended with the frame energy. The.

Referential Semantic Language Modeling for Data ...
Department of Computer Science and Engineering. Minneapolis, MN ... ABSTRACT. This paper describes a referential semantic language model that ..... composes the HHMM reduce variables βd into reduced referent ed. R and final state fd.

QUERY LANGUAGE MODELING FOR VOICE ... - Research at Google
ABSTRACT ... data (10k queries) when using Katz smoothing is shown in Table 1. ..... well be the case that the increase in PPL for the BIG model is in fact.

Geo-location for Voice Search Language Modeling - Semantic Scholar
guage model: we make use of query logs annotated with geo- location information .... million words; the root LM is a Katz [10] 5-gram trained on about 695 billion ... in the left-most bin, with the smallest amounts of data and LMs, either before of .

Written-Domain Language Modeling for ... - Research at Google
Language modeling for automatic speech recognition (ASR) systems has been traditionally in the verbal domain. In this paper, we present finite-state modeling ...

Scale-Invariant Visual Language Modeling for Object ...
Index Terms—Computer vision, content-based image retrieval, ... leverage of text data mining techniques to analyze images. While some work applied ...

Data Selection for Language Modeling Using Sparse ...
semi-supervised learning framework where the initial hypothe- sis from a ... text corpora like the web is the n-gram language model. In the ... represent the target application. ... of sentences from out-of-domain data that can best represent.

Semi-Supervised Discriminative Language Modeling for Turkish ASR
Discriminative training of language models has been shown to improve the speech recognition accuracy by resolving acoustic confusions more effectively [1].

Hallucinated N-best Lists for Discriminative Language Modeling
reference text and are then used for training n-gram language mod- els using the perceptron ... Index Terms— language modeling, automatic speech recogni-.