Back-Off Language Model Compression Boulos Harb, Ciprian Chelba, Jeffrey Dean, Sanjay Ghemawat {harb,ciprianchelba,jeff,sanjay}@google.com

Back-Off Language Model Compression, Interspeech 2009 – p. 1

Outline Motivation: Language Model (LM) Size Matters Integer Trie LM Representation Techniques for LM Compaction: N-gram Map: Block Compression Probabilities and Back-off Weights: Quantization and Block Compression Experiments Conclusions and Future Work

Back-Off Language Model Compression, Interspeech 2009 – p. 2

How Big a Language Model?

Typical Voicesearch LM training setup is data rich: vocabulary size: 1 million words, OoV rate 0.57% training data: 230 billion words from google.com query logs, after text normalization for ASR Order 3 3 5

# n-grams pruning PPL n-gram hit-ratios 15M entropy 190 47/93/100 7.7B 1-1-1 132 97/99/100 12.7B 1-1-2-2-2 108 77/88/97/99/100

A lot of float numbers along with n-grams!

Back-Off Language Model Compression, Interspeech 2009 – p. 3

Is Bigger 1st Pass LM Better? YES!

Perplexity (left) and Word Error Rate (right) as a function of LM size 260

20.5

240

20

220

19.5

200

19

180

18.5

160

18

140

17.5

120 −3 10

−2

10

−1

10 LM size: # n−grams(B, log scale)

0

10

17 1 10

Back-Off Language Model Compression, Interspeech 2009 – p. 4

Integer Trie LM Representation

1-1 mapping between n-grams and dense integer range using integer trie: 2 vectors that concatenate, for each n-gram context: cummulative diversity count list of future words look-up time: O((n − 1) · log(V )), in practice much smaller once n-gram key is identified, lookup probability and back-off weight in 2 separate arrays

Back-Off Language Model Compression, Interspeech 2009 – p. 5

Integer Trie LM Compaction

Sequence of entries in vectors is far from memoryless. N-gram Map: block compression for both diversity and word vectors GroupVar: variable integer length per block RandomAccess: fixed integer length per block CompressedArray: a version of Huffman coding enhanced with simple operators Probabilities and Back-off Weights: linear quantization to 1 byte block compression of 4 byte bundles cast to int Back-Off Language Model Compression, Interspeech 2009 – p. 6

Experiments

Google Search by Voice LM: : 3-gram LM, 13.5 million n-grams 1.0/8.2/4.3 million 1/2/3-grams, respectively We measure: storage: representation rate, no. bytes/n-gram speed (relative to uncompressed): computed PPL on unseen test data

Back-Off Language Model Compression, Interspeech 2009 – p. 7

LM Representation Rate vs. Speed

Compression Technique None Quantized CMU 24b, Quantized GroupVar RandomAccess CompressedArray + logprob/bow arrays

Block Relative Bytes per Length Time n-gram — 1.0 13.2 — 1.0 8.1 — 1.0 5.8 8 1.4 6.3 64 1.9 4.8 256 3.4 4.6 8 1.5 6.2 64 1.8 4.6 256 3.0 4.6 8 2.3 5.0 64 5.6 3.2 256 16.4 3.1 256 19.0 2.6

Back-Off Language Model Compression, Interspeech 2009 – p. 8

LM Representation Rate vs. Speed

Google Search by Voice LM 9 GroupVar RandomAccess CompressedArray

Representation Rate (B/−ngram)

8

7

6

5

4

3

0

1

2

3

4 5 6 Time, Relative to Uncompressed

7

8

9

10

1 billion 3-grams: 4GB of RAM @acceptable lookup speed Back-Off Language Model Compression, Interspeech 2009 – p. 9

Conclusions can achieve 2.6 bytes/n-gram representation rate if speed is not a concern 4 bytes/n-gram at reasonable speed 1st pass LM using 1 billion n-grams is feasible, with excellent results in WER: 10% rel. reduction in WER over 13.5 million n-gram LM baseline

Back-Off Language Model Compression, Interspeech 2009 – p. 10

Future Work Integrate with reachable composition decoder at real-time factor close to 1.0: Allauzen, Riley, Schalkwyk: A Generalized Composition Algorithm for Weighted Finite-State Transducers Scale up to 10 billion n-grams (40-60GB)?

Back-Off Language Model Compression, Interspeech 2009 – p. 11

Back-Off Language Model Compression - Research at Google

How Big a Language Model? ... training data: 230 billion words from google.com query logs, after text ... storage: representation rate, no. bytes/n-gram.

65KB Sizes 4 Downloads 385 Views

Recommend Documents

LANGUAGE MODEL CAPITALIZATION ... - Research at Google
tions, the lack of capitalization of the user's input can add an extra cognitive load on the ... adding to their visual saliency. .... We will call this model the Capitalization LM. The ... rive that “iphone” is rendered as “iPhone” in the Ca

Back-Off Language Model Compression
(LM): our experiments on Google Search by Voice show that pruning a ..... Proceedings of the International Conference on Spoken Language. Processing ...

Compression Progress, Pseudorandomness ... - Research at Google
Here p(t) is the agent's compression program at time t, and C(p(t),h(≤ t + 1) is the cost to encode the agent's history through time t + 1, with p(t). If execution time.

Backoff Inspired Features for Maximum Entropy ... - Research at Google
Sep 14, 2014 - lem into many binary language modeling problems (one versus the rest) and ... 4: repeat. 5: t ← t + 1. 6: {θ1. 1,...,θK. L } ← IPMMAP(D1,...,DK , Θt−1, n). 7: .... SuffixBackoff (NG+S); (3) n-gram features plus PrefixBackoffj.

Example-based Image Compression - Research at Google
Index Terms— Image compression, Texture analysis. 1. ..... 1The JPEG2000 encoder we used for this comparison was Kakadu Soft- ware version 6.0 [10]. (a).

Bayesian Language Model Interpolation for ... - Research at Google
used for a variety of recognition tasks on the Google Android platform. The goal ..... Equation (10) shows that the Bayesian interpolated LM repre- sents p(w); this ...

Language Model Verbalization for Automatic ... - Research at Google
this utterance for a voice-search-enabled maps application may be as follows: ..... interpolates the individual models using a development set to opti- mize the ...

On-Demand Language Model Interpolation for ... - Research at Google
Sep 30, 2010 - Google offers several speech features on the Android mobile operating system: .... Table 2: The 10 most popular voice input text fields and their.

Natural Language Processing Research - Research at Google
Used numerous well known systems techniques. • MapReduce for scalability. • Multiple cores and threads per computer for efficiency. • GFS to store lots of data.

Factorization-based Lossless Compression of ... - Research at Google
A side effect of our approach is increasing the number of terms in the index, which ..... of Docs in space Θ. Figure 1 is an illustration of such a factor- ization ..... 50%. 60%. 8 iterations 35 iterations. C o m p re ssio n. R a tio. Factorization

Sentence Compression by Deletion with LSTMs - Research at Google
In this set-up, online learn- ..... Alan Turing, known as the father of computer science, the codebreaker that helped .... Xu, K., J. Ba, R. Kiros, K. Cho, A. Courville,.

Strategies for Foveated Compression and ... - Research at Google
*Simon Fraser University, Vancouver ... Foveation is a well established technique for reducing graphics rendering times for virtual reality applications [​1​] and for compression of regular image .... be added to the system, which may bring furth

Gipfeli - High Speed Compression Algorithm - Research at Google
is boosted by using multi-core CPUs; Intel predicts a many-core era with ..... S. Borkar, “Platform 2015 : Intel processor and platform evolution for the next decade ...

Full Resolution Image Compression with ... - Research at Google
This paper presents a set of full-resolution lossy image compression ..... Computing z1 does not require any masked convolution since the codes of the previous.

Multi-Sentence Compression: Finding Shortest ... - Research at Google
sentence which we call multi-sentence compression and ... tax is not the only way to gauge word or phrase .... which occur more than once in the sentence; (3).

SPECTRAL DISTORTION MODEL FOR ... - Research at Google
[27] T. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional,. Long Short-Term Memory, Fully Connected Deep Neural Net- works,” in IEEE Int. Conf. Acoust., Speech, Signal Processing,. Apr. 2015, pp. 4580–4584. [28] E. Breitenberger, “An

DISTRIBUTED DISCRIMINATIVE LANGUAGE ... - Research at Google
formance after reranking N-best lists of a standard Google voice-search data ..... hypotheses in domain adaptation and generalization,” in Proc. ICASSP, 2006.

EXPLORING LANGUAGE MODELING ... - Research at Google
ended up getting less city-specific data in their models. The city-specific system also includes a semantic stage for inverse text normalization. This stage maps the query variants like “comp usa” and ”comp u s a,” to the most common web- tex

Action Language Hybrid AL - Research at Google
the idea of using a mathematical model of the agent's domain, created using a description in the action language AL [2] to find explanations for unexpected.

AUTOMATIC LANGUAGE IDENTIFICATION IN ... - Research at Google
this case, analysing the contents of the audio or video can be useful for better categorization. ... large-scale data set with 25000 music videos and 25 languages.

Efficient Natural Language Response ... - Research at Google
ceived email is run through the triggering model that decides whether suggestions should be given. Response selection searches the response set for good sug ...

Speech and Natural Language - Research at Google
Apr 16, 2013 - clearly set user expectation by existing text app. (proverbial ... develop with the users in the loop to get data, and set/understand user ...

DISCRIMINATIVE FEATURES FOR LANGUAGE ... - Research at Google
language recognition system. We train the ... lar approach to language recognition has been the MAP-SVM method [1] [2] ... turned into a linear classifier computing score dl(u) for utter- ance u in ... the error rate on a development set. The first .

Continuous Space Discriminative Language ... - Research at Google
confusion sets, and then discriminative training will learn to separate the ... quires in each iteration identifying the best hypothesisˆW ac- cording the current model. .... n-gram language modeling,” Computer Speech and Lan- guage, vol. 21, pp.