Language Modeling in the Era of Abundant Data Ciprian Chelba

Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 1

Statistical Modeling in Automatic Speech Recognition

Speaker’s Mind

W

Speech Producer

Speech

Speaker

Acoustic Processor

A

Linguistic Decoder

^ W

Speech Recognizer Acoustic Channel

ˆ = argmaxW P (W |A) = argmaxW P (A|W ) · P (W ) W P (A|W ) acoustic model (AM, Hidden Markov Model); varies depending on problem (machine translation, spelling correction, soft keyboard input) P (W ) language model (LM, usually Markov chain) ˆ search for the most likely word string W Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 2

Language Modeling Usual Assumptions

we have a word level tokenization of the text (not true in all languages, e.g. Chinese) some vocabulary is given to us (usually also estimated from data); out-of-vocabulary (OoV) words are mapped to (“open” vocabulary LM) sentences are assumed to be independent and of finite length; LM needs to predict end-of-sentence symbol On my second day , I managed the uphill walk to a waterfall called Skok . Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 3

Language Model Evaluation (1)

Word Error Rate (WER) TRN: UP UPSTATE NEW YORK SOMEWHERE UH HYP: UPSTATE NEW YORK SOMEWHERE UH ALL D 0 0 0 0 0 I :3 errors/7 words in transcript; WER =

OVER ALL S 43%

Perplexity (PPL)(Jelinek, 1997)  P P P L(M ) = exp − N1 N i=1 ln [PM (wi |w1 . . . wi−1 )]

good models are “smoothed” ML estimates: PM (wi |w1 . . . wi−1 ) > ǫ; also guarantees a proper probability model over sentences other metrics: out-of-vocabulary rate/n-gram hit ratios Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 4

Language Model Smoothing

Markov assumption leads to N -gram model: Pθ (wi |w1 . . . wi−1 ) = Pθ (wi |wi−N +1 . . . wi−1 ), θ ∈ Θ, wi ∈ V Smoothing using Deleted Interpolation: Pn (w|h) = λ(h) · Pn−1 (w|h′ ) + (1 − λ(h)) · fn (w|h) P−1 (w) = unif orm(V) where: h = (wi−n+1 . . . wi−1 ) is the n-gram context, and h′ = (wi−n+2 . . . wi−1 ) is the back-off context weights λ(h) must be estimated on held-out (cross-validation) data. Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 5

Language Model Smoothing: Katz

Katz Smoothing (Katz, 1987) uses Good-Turing discounting:  fn (w|h), C(h, w) > K (r + 1) tr+1 · fn (w|h), 0 < C(h, w) ≤ K Pn (w|h) = t r  β(h)Pn−1 (w|h′ ) where:

tr represents the number of n-grams (types) that occur r times: tr = |(wi−n+1 . . . wi ), C(wi−n+1 . . . wi ) = r| β(h) is the back-off weight ensuring proper normalization Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 6

Language Model Smoothing: Kneser-Ney

Kneser-Ney Smoothing (Kneser & Ney, 1995): ( C(h,w)−D 1 ′ + λ(h)P (w|h ), n = N n−1 C(h) Pn (w|h) = Lef P tDivC(h,w)−D2 + λ(h)Pn−1 (w|h′ ), 0 ≤ n < N Lef tDivC(h,w) w

where: Lef tDivC(h, w) = |v, C(v, h, w) > 0| is the “left diversity” count for an n-gram (h, w) See (Goodman, 2001) for a detailed presentation on LM smoothing.

Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 7

Language Model Representation: ARPA Back-off p(wd3|wd1,wd2)= if(trigram exists) else if(w1,w2 exists) else p(wd2|wd1)= if(w1,w2 exists) else

p_3(wd1,wd2,wd3) bo_2(w1,w2)*p(wd3|wd2) p(wd3|w2) p_2(wd1,wd2) bo_1(wd1)*p_1(wd2)

\1-grams: p_1 wd bo_1 \2-grams: p_2 wd1 wd2 bo_2 \3-grams: p_3 wd1 wd2 wd3 Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 8

Language Model Size Control: Entropy Pruning

Entropy pruning (Stolcke, 1998) is required for use in 1st pass: should one remove n-gram (h, w)? ′

D[q(h)p(·|h) k q(h) · p (·|h)] = q(h)

X w

p(w|h) p(w|h) log ′ p (w|h)

| D[q(h)p(·|h) k q(h) · p′ (·|h)] | < pruning threshold lower order estimates: q(h) = p(h1 ) . . . p(hn |h1 ...hn−1 ) or relative frequency: q(h) = f (h) greedily reduces LM size at min cost in PPL Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 9

On Smoothing and Pruning

Perplexity Increase with Pruned LM Size 8.4 Katz (Good−Turing) Kneser−Ney Interpolated Kneser−Ney

8.2

8

PPL (log2)

7.8

7.6

7.4

7.2

7

6.8 18

19

20 21 22 23 Model Size in Number of N−grams (log2)

24

25

KN degrades very fast with aggressive pruning (< 10% of original size) (Ciprian Chelba, 2010) switch from KN to Katz smoothing: 10% WER gain for voice-search

Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 10

Voice Search LM Training Setup (Chelba & Schalkwyk, 2013)

spelling corrected google.com queries, normalized for ASR, e.g. 5th -> fifth vocabulary size: 1M words, OoV rate 0.57% (!), excellent n-gram hit ratios training data: 230B words Order

no. n-grams

pruning

PPL

n-gram hit-ratios

3

15M

entropy

190

47/93/100

3

7.7B

none

132

97/99/100

5

12.7B

1-1-2-2-2

108

77/88/97/99/100

Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 11

Is Bigger Better? YES!

Perplexity (left) and Word Error Rate (right) as a function of LM size 260

20.5

240

20

220

19.5

200

19

180

18.5

160

18

140

17.5

120 −3 10

−2

10

−1

10 LM size: # n−grams(B, log scale)

0

10

17 1 10

PPL is really well correlated with WER when controlling for vocabulary and training set. Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 12

Better Language Models: More Smarts 1-billion word benchmark (Chelba et al., 2013) results Model

Num. Params

PPL

Katz 5-gram

1.74 B

79.9

Kneser-Ney 5-gram

1.76 B

67.6

SNM skip-gram

33 B

52.9

RNN

20 B

51.3

ALL, linear interpolation

41.0

there are LMs that handily beat the N -gram by leveraging longer context (when available) how about increasing the amount of data, when we have it? Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 13

Better Language Models: More Smarts, More Data? Ideally Both 10/100 billion word query data benchmark resultsa Model

Data Amount

Num. Params

PPL

Katz 6-gram

10B

3.2 B

123.9

Kneser-Ney 6-gram

10B

4.1 B

114.5

SNM skip-gram

10B

25 B

111.0

RNN

10B

4.1 B

111.1

Katz 6-gram

100B

19.6 B

92.7

Kneser-Ney 6-gram

100B

24.5 B

87.9

RNN

100B

4.1 B

101.0

more data and model is an easy way to get solid gains complex models better scale up gracefully KN smoothing loses its edge over Katz a

Thanks Babak Damavandi for theModeling RNN experimental results. Ciprian Chelba, Language in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 14

More Data Is Not Always a Winner: Query Stream Non-stationarity (1)

USA training data: XX months X months

test data: 10k, Sept-Dec 2008 very little impact in OoV rate for 1M wds vocabulary: 0.77% (X months vocabulary) vs. 0.73% (XX months vocabulary)

Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 15

More Data Is Not Always a Winner: Query Stream Non-stationarity (2)

3-gram LM

Training Set

Test Set PPL

unpruned

X months

121

unpruned

XX months

132

entropy pruned

X months

205

entropy pruned

XX months

209

bigger is not always bettera 10% rel reduction in PPL when using the most recent X months instead of XX months no significant difference after pruning, in either PPL or WER a

The vocabularies are mismatched, so the PPL comparison is troublesome.

The difference would be higher if we used a fixed vocabulary. Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 16

More Locales training data across 3 locales: USA, GBR, AUS, spanning same amount of time ending in Aug 2008 test data: 10k/locale, Sept-Dec 2008 Out of Vocabulary Rate: Training Locale

Test Locale

USA

GBR

AUS

USA

0.7

1.3

1.6

GBR

1.3

0.7

1.3

AUS

1.3

1.1

0.7

locale specific vocabulary halves the OoV rate Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 17

Locale Matters (2)

Perplexity of unpruned LM: Training

Test Locale

Locale

USA

GBR

AUS

USA

132

234

251

GBR

260

110

224

AUS

276

210

124

locale specific LM halves the PPL of the unpruned LM

Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 18

Open Problems

Entropy of text from a given source: how much are we leaving on the table? How much data/model is enough for a given source: does such a bound exist for N -gram models? More data, relevance, transfer learning: not all data is created equal. Conditional ML estimation: LM estimation should take into account the channel model.

Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 19

Entropy of English

High variance, depending on estimate, source of data; 0.1-0.2 bits/char is a significant difference in PPL at word level! (Cover & King, 1978): 1.3 bits/char (Brown, Pietra, Mercer, Pietra, & Lai, 1992): 1.75 bits/char 1-billion corpus: ≈a 1.17 bits/char for KN, ≈ 1.03 bits/char for the best reported LM mixing skip-gram SNM with RNN 10, 100 -billion query corpus: ≈ 1.43, 1.35 bits/char for KN, respectively. a

Modulo OoV word modeling Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 20

Abundant Data: How Much is Enough for Modeling a Given Source?

A couple of observations: one can prune an LM to about 10% of unpruned size without significant impact on PPL increasing the amount of data and model size becomes unproductive after a while For a given source, and N -gram order, is there a data size beyond which there is no benefit to the model quality?

Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 21

Abundant Data: Not All Data is Created Equal

It is not always possible to find very large amounts of data that is well matched to a given application/test set E.g. when building an LM for SMS text we may have very little such data, quite a bit more from posts on social networks, and a lot of text from a web crawl. LM adaptation: leveraging data in different amounts, and of various degrees of relevancea to a given test set. a

Relevance of data to a given test set is hard to describe, but you know it

when you see it.

Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 22

References Brown, P. F., Pietra, V. J. D., Mercer, R. L., Pietra, S. A. D., & Lai, J. C. (1992, March). An estimate of an upper bound for the entropy of english. Comput. Linguist., 18(1), 31–40. Available from http://dl.acm.org/citation.cfm?id=146680.146685 Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., et al. (2013). One billion word benchmark for measuring progress in statistical language modeling. Chelba, C., & Schalkwyk, J. (2013). Empirical exploration of language modeling for the google.com query stream as applied to mobile voice search. In Mobile speech and advanced natural language solutions (pp. 197–229). New York: Springer. Available from http://www.springer.com/engineering/signals/book/978-1-4614-6017-6

Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 23

References Ciprian Chelba, Will Neveitt, Peng Xu, Thorsten Brants. (2010). Study on Interaction between Entropy Pruning and Kneser-Ney Smoothing. In Proc. interspeech (pp. 2242–2245). Makuhari, Japan. Cover, T., & King, R. (1978, September). A convergent gambling estimate of the entropy of english. IEEE Trans. Inf. Theor., 24(4), 413–421. Available from http://dx.doi.org/10.1109/TIT.1978.1055912 Goodman, J. (2001). A bit of progress in language modeling, extended version (Tech. Rep.). Microsoft Research. Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge, MA, USA: MIT Press.

Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 24

References Katz, S. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. In Ieee transactions on acoustics, speech and signal processing (Vol. 35, p. 400-01). Kneser, R., & Ney, H. (1995). Improved backing-off for m-gram language modeling. In Proceedings of the ieee international conference on acoustics, speech and signal processing (Vol. 1, pp. 181–184). Stolcke, A. (1998). Entropy-based pruning of back-off language models. In Proceedings of news transcription and understanding workshop (pp. 270–274). Lansdowne, VA: DARPA.

Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p. 25

Language Modeling in the Era of Abundant Data - Research at Google

Jan 9, 2015 - (cross-validation) data. Ciprian Chelba, Language Modeling in the Era of Abundant Data, Information Theory Forum, Stanford, 01/09/2015 – p.

113KB Sizes 2 Downloads 158 Views

Recommend Documents

Language Modeling in the Era of Abundant Data - Research at Google
Apr 29, 2017 - Katz Smoothing (Katz, 1987) uses Good-Turing discounting: ..... and of various degrees of relevancea to a given test set. a. Relevance of data ...

EXPLORING LANGUAGE MODELING ... - Research at Google
ended up getting less city-specific data in their models. The city-specific system also includes a semantic stage for inverse text normalization. This stage maps the query variants like “comp usa” and ”comp u s a,” to the most common web- tex

Large Scale Language Modeling in Automatic ... - Research at Google
The test set was gathered using an Android application. People were prompted to speak a set of random google.com queries selected from a time period that ...

Sparse Non-negative Matrix Language Modeling - Research at Google
test data: 1.98% (out of domain, as it turns out). ○ OOV words mapped to , also part of ... Computationally cheap: ○ O(counting relative frequencies).

n-gram language modeling using recurrent ... - Research at Google
vocabulary; a good language model should of course have lower perplexity, and thus the ..... 387X. URL https://transacl.org/ojs/index.php/tacl/article/view/561.

Sparse Non-negative Matrix Language Modeling - Research at Google
Table 4 that the best RNNME model outperforms the best SNM model by 13% on the check set. The out- of-domain test set shows that due to its compactness,.

QUERY LANGUAGE MODELING FOR VOICE ... - Research at Google
ABSTRACT ... data (10k queries) when using Katz smoothing is shown in Table 1. ..... well be the case that the increase in PPL for the BIG model is in fact.

Written-Domain Language Modeling for ... - Research at Google
Language modeling for automatic speech recognition (ASR) systems has been traditionally in the verbal domain. In this paper, we present finite-state modeling ...

Sparse Non-negative Matrix Language Modeling - Research at Google
same speech recognition accuracy on voice search and short message ..... a second training stage that adapts the model to in-domain tran- scribed data. 5.

AUTOMATIC LANGUAGE IDENTIFICATION IN ... - Research at Google
this case, analysing the contents of the audio or video can be useful for better categorization. ... large-scale data set with 25000 music videos and 25 languages.

Overcoming the Lack of Parallel Data in ... - Research at Google
compression is making use of rich feature rep- ..... As an illustration to the procedure, consider the .... 6Recall from the beginning of the section that for the full.

Natural Language Processing Research - Research at Google
Used numerous well known systems techniques. • MapReduce for scalability. • Multiple cores and threads per computer for efficiency. • GFS to store lots of data.

Language-Independent Sandboxing of Just-In ... - Research at Google
Chrome Web browser, in the Web browser on Android phones, and in Palm WebOS. In addition to .... forked to make our changes on June 21st, 2010. In the V8 ...

Unary Data Structures for Language Models - Research at Google
sion competitive with the best proposed systems, while retain- ing the full finite state structure, and ..... ronments, Call Centers and Clinics. Springer, 2010.

Efficient Estimation of Quantiles in Missing Data ... - Research at Google
Dec 21, 2015 - n-consistent inference and reducing the power for testing ... As an alternative to estimation of the effect on the mean, in this document we present ... through a parametric model that can be estimated from external data sources.

VALUE OF SHARING DATA 1. Introduction In ... - Research at Google
12 Feb 2018 - In some online advertising systems, such as those for Universal App Campaigns at Google, the only way an advertiser would be able to .... build on in this paper. Other papers analyze how ... If an advertiser shares its data, then each a

AUTOMATIC OPTIMIZATION OF DATA ... - Research at Google
matched training speech corpus to better match target domain utterances. This paper addresses the problem of determining the distribution of perturbation levels ...

DISTRIBUTED ACOUSTIC MODELING WITH ... - Research at Google
best rescoring framework for Google Voice Search. 87,000 hours of training .... serving system (SSTable service) with S servers each holding. 1/S-th of the data.

Modeling Capacitor Derating in Power Integrity ... - Research at Google
library file containing the derating models is generated such that it can be repeatedly ... I. INTRODUCTION. The design of power distribution networks (PDNs) has ... incorporated into a system-level simulation of a PDN have significant impact ...

DIRECTLY MODELING VOICED AND ... - Research at Google
DIRECTLY MODELING VOICED AND UNVOICED COMPONENTS. IN SPEECH WAVEFORMS BY NEURAL NETWORKS. Keiichi Tokuda. †‡. Heiga Zen. †. †.

DIRECTLY MODELING SPEECH WAVEFORMS ... - Research at Google
statistical model [13], and mel-cepstral analysis-integrated hidden ..... Speech data in US English from a female professional speaker was used for the ...

Frame by Frame Language Identification in ... - Research at Google
May 20, 2014 - Google Services. Reported results .... Figure 1: DNN network topology. EM iterations .... eral Google speech recognition services such as Voice.

LANGUAGE MODEL CAPITALIZATION ... - Research at Google
tions, the lack of capitalization of the user's input can add an extra cognitive load on the ... adding to their visual saliency. .... We will call this model the Capitalization LM. The ... rive that “iphone” is rendered as “iPhone” in the Ca

Discriminative pronunciation modeling for ... - Research at Google
clinicians and educators employ it for automated assessment .... We call this new phone sequence ..... Arlington, VA: Center for Applied Linguistics, 1969.