Language Modeling for Automatic Speech Recognition Meets the Web:

Google Search by Voice Ciprian Chelba, Johan Schalkwyk, Boulos Harb, Carolina Parada, Cyril Allauzen, Leif Johnson, Michael Riley, Peng Xu, Preethi Jyothi, Thorsten Brants, Vida Ha, Will Neveitt

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 1

Statistical Modeling in Automatic Speech Recognition

Speaker’s Mind

W

Speech Producer

Speech

Speaker

Acoustic Processor

A

Linguistic Decoder

^ W

Speech Recognizer Acoustic Channel

ˆ = argmaxW P (W |A) = argmaxW P (A|W ) · P (W ) W P (A|W ) acoustic model (Hidden Markov Model) P (W ) language model (Markov chain) ˆ search for the most likely word string W due to the large vocabulary size—1M words—an exhaustive search is intractable 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 2

Language Model Evaluation (1)

Word Error Rate (WER) TRN: UP UPSTATE NEW YORK SOMEWHERE UH HYP: UPSTATE NEW YORK SOMEWHERE UH ALL D 0 0 0 0 0 I :3 errors/7 words in transcript; WER = Perplexity(PPL) 

P P L(M ) = exp − N1

PN

i=1

ln [PM (wi |w1 . . . wi−1 )]

OVER ALL S 43%



good models are smooth: PM (wi |w1 . . . wi−1 ) > ǫ other metrics: out-of-vocabulary rate/n-gram hit ratios

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 3

Language Model Evaluation (2)

Web Score (WebScore) TRN: TAI PAN RESTAURANT PALO ALTO HYP: TAIPAN RESTAURANTS PALO ALTO

produce the same search results do not count as error if top search result is identical with that for the manually transcribed query

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 4

Language Model Smoothing

Markov assumption: Pθ (wi /w1 . . . wi−1 ), θ ∈ Θ, wi ∈ V Smoothing using Deleted Interpolation: Pn (w|h) = λ(h) · Pn−1 (w|h′ ) + (1 − λ(h)) · fn (w|h) P−1 (w) = unif orm(V) Parameters (smoothing weights λ(h) must be estimated on cross-validation data): θ = {λ(h); count(w|h), ∀(w|h) ∈ T } 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 5

Voice Search LM Training Setup

correcta google.com queries, normalized for ASR, e.g. 5th -> fifth vocabulary size: 1M words, OoV rate 0.57% (!), excellent n-gram hit ratios training data: 230B words Order no. n-grams pruning PPL n-gram hit-ratios 3 15M entropy 190 47/93/100 3 7.7B none 132 97/99/100 5 12.7B 1-1-2-2-2 108 77/88/97/99/100 a

Thanks Mark Paskin

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 6

Distributed LM Training

Input: key=ID, value=sentence/doc Intermediate: key=word, value=1 Output: key=word, value=count Map chooses reduce shard based on hash value (red a or bleu) a

T. Brants et al., Large Language Models in Machine Translation

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 7

Using Distributed LMs

load each shard into the memory of one machine Bottleneck: in-memory/network access at X-hundred nanoseconds/Y milliseconds (factor 10,000) Example: translation of one sentence approx. 100k n-grams; 100k * 7ms = 700 seconds per sentence Solution: batched processing 25 batches, 4k n-grams each: less than 1 second a a

T. Brants et al., Large Language Models in Machine Translation

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 8

ASR Decoding Interface

First pass LM: finite state machine (FSM) API states: n-gram contexts arcs: for each state/context, list each n-gram in the LM + back-off transition trouble: need all n-grams in RAM (tens of billions) Second pass LM: lattice rescoring states: n-gram contexts, after expansion to rescoring LM order arcs: {new states} X {no. arcs in original lattice} good: distributed LM and large batch RPC 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 9

Language Model Pruning

Entropy pruning is required for use in 1st pass: should one remove n-gram (h, w)? ′

D[q(h)p(·|h) k q(h) · p (·|h)] = q(h)

X w

p(w|h) p(w|h) log ′ p (w|h)

| D[q(h)p(·|h) k q(h) · p′ (·|h)] | < pruning threshold lower order estimates: q(h) = p(h1 ) . . . p(hn |h1 ...hn−1 ) or relative frequency: q(h) = f (h) very effective in reducing LM size at min cost in PPL

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 10

On Smoothing and Pruning (1)

4-gram model trained on 100Mwds, 100k vocabulary, pruned to 1% of raw size using SRILM tested on 690k wds 4-gram

Perplexity

LM smoothing Ney Ney, Interpolated Witten-Bell Witten-Bell, Interpolated Ristad Katz (Good-Turing) Kneser-Ney Kneser-Ney, Interpolated Kneser-Ney (CG) Kneser-Ney (CG, Interpolated)

raw pruned 120.5 197.3 119.8 198.1 118.8 196.3 121.6 202.3 126.4 203.6 119.8 198.1 114.5 285.1 115.8 274.3 116.3 280.6 115.8 274.3

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 11

On Smoothing and Pruning (2)

Perplexity Increase with Pruned LM Size 8.4 Katz (Good−Turing) Kneser−Ney Interpolated Kneser−Ney

8.2

8

PPL (log2)

7.8

7.6

7.4

7.2

7

6.8 18

19

20 21 22 23 Model Size in Number of N−grams (log2)

24

25

baseline LM is pruned to 0.1% of raw size! switch from KN to Katz smoothing: 10% WER gain 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 12

Billion n-gram 1st Pass LM (1)

LM representation rate Compression Technique None Quantized CMU 24b, Quantized GroupVar RandomAccess CompressedArray

Block Rel. Rep. Rate Length Time (B/n-gram) — 1.0 13.2 — 1.0 8.1 — 1.0 5.8 8 1.4 6.3 64 1.9 4.8 256 3.4 4.6 8 1.5 6.2 64 1.8 4.6 256 3.0 4.6 8 2.3 5.0 64 5.6 3.2 256 16.4 3.1

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 13

Billion n-gram 1st Pass LM (2)

Google Search by Voice LM 9 GroupVar RandomAccess CompressedArray

Representation Rate (B/−ngram)

8

7

6

5

4

3

0

1

2

3

4 5 6 Time, Relative to Uncompressed

7

8

9

10

1B 3-grams: 5GB of RAM @acceptable lookup speeda a

B. Harb, C. Chelba, J. Dean and S. Ghemawat, Back-Off Language Model

Compression, Interspeech 2009 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 14

Is Bigger Better? YES!

Word Error Rate (left) and WebScore Error Rate (100%−WebScore, right) as a function of LM size 22

30

20

28

18

26

16 −3 10

−2

10

−1

10 LM size: # n−grams(B, log scale)

0

10

24 1 10

8%/10% relative gain in WER/WebScorea a

With Cyril Allauzen, Johan Schalkwyk, Mike Riley, May reachable composi-

tion CLoG be with you! 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 15

Is Bigger Better? YES!

Perplexity (left) and Word Error Rate (right) as a function of LM size 260

20.5

240

20

220

19.5

200

19

180

18.5

160

18

140

17.5

120 −3 10

−2

10

−1

10 LM size: # n−grams(B, log scale)

0

10

17 1 10

PPL is really well correlated with WER! 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 16

Is Even Bigger Better? YES!

WER (left) and WebError (100−WebScore, right) as a function of 5−gram LM size 20

28

18

26

16 −2 10

−1

0

10

10

24 1 10

LM size: # 5−grams(B)

5-gram: 11% relative in WER/WebScore 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 17

Is Even Bigger Better? YES!

Perplexity (left) and WER (right) as a function of 5−gram LM size 200

19

180

18.5

160

18

140

17.5

120

17

100 −2 10

−1

0

10

10

16.5 1 10

LM size: # 5−grams(B)

Again, PPL is really well correlated with WER! 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 18

Detour: Search vs. Modeling error

ˆ = argmaxW P (A, W |θ) W ˆ we have an error: If correct W ∗ 6= W ˆ |θ): search error P (A, W ∗ |θ) > P (A, W ˆ |θ): modeling error P (A, W ∗ |θ) < P (A, W wisdom has it that in ASR search error < modeling error Corollary: improvements come primarily from using better models, integration in decoder/search is second order! 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 19

Lattice LM Rescoring

Pass 1st 1st 2nd 2nd 2nd

Language Model 15M 3g 1.6B 5g 15M 3g 1.6B 3g 12B 5g

PPL WER 191 18.7 112 16.9 191 18.8 112 16.9 108 16.8

WebScore 72.2 75.2 72.6 75.3 75.4

10% relative reduction in remaining WER, WebScore error 1st pass gains matched in ProdLm lattice rescoringa at negligible impact in real-time factor a

Older front end, 0.2% WER diff

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 20

Lattice Depth Effect on LM Rescoring

5

x 10

Perplexity (left) and WER (right) as a function of lattice depth 50

Perplexity

Word Error Rate

5

0 0 10

1

2

10 10 Lattice Density (# links per transcribed word)

45 3 10

LM becomes ineffective after a certain lattice depth 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 21

N-best Rescoring

N-best rescoring experimental setup minimal coding effort for testing LMs: all you need to do is assign a score to a sentence Experiment SpokenLM baseline lattice rescoring 10-best rescoring

LM 13M 3g 12B 5g 1.6B 5g

WER WebScore 17.5 73.3 16.1 76.3 16.4 75.2

a good LM will immediately show its potential, even on as little as 10-best alternates rescoring!

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 22

Query Stream Non-stationarity (1)

USA training dataa : XX months X months

test data: 10k, Sept-Dec 2008b very little impact in OoV rate for 1M wds vocabulary: 0.77% (X months vocabulary) vs. 0.73% (XX months vocabulary) a

Thanks Mark Paskin

b

Thanks Zhongli Ding for query selection.

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 23

Query Stream Non-stationarity (2)

3-gram LM Training Set Test Set PPL unpruned X months 121 unpruned XX months 132 entropy pruned X months 205 entropy pruned XX months 209 bigger is not always bettera 10% rel reduction in PPL when using the most recent X months instead of XX months no significant difference after pruning, in either PPL or WER a

The vocabularies are mismatched, so the PPL comparison is a bit trouble-

some. The difference would be higher if we used a fixed vocabulary. 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 24

More Locales training data across 3 localesa : USA, GBR, AUS, spanning same amount of time ending in Aug 2008 test data: 10k/locale, Sept-Dec 2008 Out of Vocabulary Rate: Training Locale USA USA 0.7 GBR 1.3 AUS 1.3

Test Locale

GBR 1.3 0.7 1.1

AUS 1.6 1.3 0.7

locale specific vocabulary halves the OoV rate a

Thanks Mark Paskin 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 25

Locale Matters (2)

Perplexity of unpruned LM: Training Test Locale Locale USA GBR AUS USA 132 234 251 GBR 260 110 224 AUS 276 210 124 locale specific LM halves the PPL of the unpruned LM

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 26

Locale Matters (3)

Perplexity of pruned LM: Training Locale USA GBR AUS

Test Locale

USA 210 442 422

GBR 369 150 293

AUS 412 342 171

locale specific LM halves the PPL of the pruned LM as well

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 27

Discriminative Language Modeling

ML estimate from correct text is of limited use in decoding: back-off n-gram assigns −logP(“a navigate to”) = 0.266 need parallel data (A, W ∗ ) significant amount can be mined from voice search logs using confidence filtering first-pass scores discriminate perfectly, nothing to learn? a a

Work with Preethi Jyothi, Leif Johnson, Brian Strope, [ICASSP

’12,

to be published

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 28

Experimental Setup

confidence filtering on baseline AM/LM to give reference transcriptions (≈ manually transcribed data) weaker AM (ML-trained, single mixture gaussians) to generate N-best and ensure sufficient errors to train the DLMs largest models are trained on ∼80,000 hours of speech (re-decoding is expensive!), ∼350 million words different from previous work [Roark et al.,ACL ’04] where they cross-validate the baseline LM training to generalize better to unseen data

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 29

N-best Reranking Oracle Error Rates on weakAM-dev/T9b

40

50

Figure 1: Oracle error rates upto N=200

30 20 10

Error Rate

weakAM!dev SER weakAM!dev WER T9b SER T9b WER

0

50

100

150

200

N

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 30

DLM at Scale: Distributed Perceptron

Features: 1st pass lattice costs and ngram word features, [Roark et al.,ACL ’04]. Rerankers: Parameter weights at iteration t + 1, wt+1 for reranker models trained on N utterances. P Perceptron: wt+1 = wt + c ∆c DistributedPerceptron: wt+1 = wt + et al., ACL ’10] av = AveragedPerceptron: wt+1 [Collins, EMNLP ’02]

t av w t+1 t

PC c

∆c

C

+

wt t+1

[McDonald +

PC

S∆c N ·(t+1) c

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 31

MapReduce Implementation

Rerank-Mappers

Reducers

SSTable Utterances

SSTableService Cache (per Map chunk) SSTable FeatureWeights: Epoch t

SSTable FeatureWeights: Epoch t+1 Identity-Mappers

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 32

WERs on weakAM-dev Model Baseline DLM-1gram DLM-2gram DLM-3gram ML-3gram

WER(%) 32.5 29.5 28.3 27.8 29.8

Our best DLM gives ∼4.7% absolute (∼15% relative) improvement over the 1-best baseline WER. Our best ML LM trained on data T gives ∼2% absolute (∼6% relative) improvement over an ngram LM also trained on T . 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 33

Results on T9b

Data set weakAM-test T9b

Baseline Reranking, ML Reranking, LM DLM 39.1 36.7 34.2 14.9 14.6 14.3a

5% rel gains in WER Note: Improvements are cut in half when comparing our models trained on data T with a reranker using an ngram LM trained on T . a

Statistically significant at p<0.05

02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 34

Open Problems in Language Modeling for ASR and Beyond

LM adaptation: bigger is not always better. Making use of related, yet not fully matched data, e.g.: Web text should help query LM? related locales—GBR,AUS should help USA? discriminative LM: ML estimate from correct text is of limited use in decoding, where the LM is presented with atypical n-grams can we sample from correct text instead of parallel data (A, W ∗ )? LM smoothing, estimation: neural network LMs are staging a comeback. 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 35

ASR Success Story: Google Search by Voice What contributed to success: excellent language model built from query stream clearly set user expectation by existing text app clean speech: users are motivated to articulate clearly app phones (Android, iPhone) do high quality speech capture speech tranferred error free to ASR server over IP Challenges: Measuring progress: manually transcribing data is at about same word error rate as system (15%) 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 36

ASR Core Technology

Current state: automatic speech recognition is incredibly complex problem is fundamentally unsolved data availability and computing have changed significantly: 2-3 orders of magnitude more of each Challenges and Directions: re-visit (simplify!) modeling choices made on corpora of modest size multi-linguality built-in from start better feature extraction, acoustic modeling 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 37

Google Search by Voice

Mar 2, 2012 - Epoch t+1. SSTable. Feature-. Weights: Epoch t. SSTable. Utterances. SSTableService. Rerank-Mappers. Identity-Mappers. Reducers. Cache.

254KB Sizes 0 Downloads 308 Views

Recommend Documents

Google Search by Voice - Research at Google
May 2, 2011 - 1.5. 6.2. 64. 1.8. 4.6. 256. 3.0. 4.6. CompressedArray. 8. 2.3. 5.0. 64. 5.6. 3.2. 256 16.4. 3.1 .... app phones (Android, iPhone) do high quality.

Google Search by Voice - Research at Google
Feb 3, 2012 - 02/03/2012 Ciprian Chelba et al., Voice Search Language Modeling – p. 1 ..... app phones (Android, iPhone) do high quality speech capture.

Google Search by Voice - Research at Google
Kim et al., “Recent advances in broadcast news transcription,” in IEEE. Workshop on Automatic ... M-phones (including back-off) in an N-best list .... Technology.

Search by Voice in Mandarin Chinese - Research at Google
client application running on an Android mobile telephone with an intermittent ... 26-30 September 2010, Makuhari, Chiba, Japan .... lar Mandarin phone.

Deploying Google Search by Voice in Cantonese - CiteSeerX
Aug 28, 2011 - tonese Google search by voice was launched in December 2010. Index Terms: .... phones in a variety of acoustic environments, including use at home, on the ... ers using our DataHound Android application [3], which dis-.

Deploying Google Search by Voice in Cantonese - Semantic Scholar
Aug 31, 2011 - web scores for both Hong Kong and Guangzhou data. Can- ... The efficient collection of high quality data thus became a cru- cial issue in ...

Google Search by Voice: A case study - Research at Google
of most value to end-users, and supplying a steady flow of data for training systems. Given the .... for directory assistance that we built on top of GMM. ..... mance of the language model on unseen query data (10K) when using Katz ..... themes, soci

Deploying Google Search by Voice in Cantonese - CiteSeerX
Aug 28, 2011 - believe our development of Cantonese Voice Search is a step to- wards solving ... ers using our DataHound Android application [3], which dis-.

Multi-scale Personalization for Voice Search Applications
sonalization features for the post-processing of recognition results in the form of n-best lists. Personalization is carried out from three different angles: short-term, ...

Voice Search for Development - Research at Google
26-30 September 2010, Makuhari, Chiba, Japan. INTERSPEECH ... phone calls are famously inexpensive, but this is not true in most developing countries.).

google's cross-dialect arabic voice search - Research at Google
our DataHound Android application [5]. This application displays prompts based on common ... pruning [10]. All the systems described in this paper make use of ...

japanese and korean voice search - Research at Google
iPhone and Android phones in US English [1]. Soon after it was ..... recognition most important groups of pronunciations: a) the top 10k words as occurring in the ...

Large-scale discriminative language model reranking for voice-search
Jun 8, 2012 - The Ohio State University ... us to utilize large amounts of unsupervised ... cluding model size, types of features, size of partitions in the MapReduce framework with .... recently proposed a distributed MapReduce infras-.

Large-scale discriminative language model reranking for voice-search
Jun 8, 2012 - voice-search data set using our discriminative .... end of training epoch need to be made available to ..... between domains WD and SD.

Multi-scale Personalization for Voice Search ... - Semantic Scholar
of recognition results in the form of n-best lists. ... modal cellphone-based business search application ... line, we take the hypothesis rank, which results in.

Geo-location for Voice Search Language Modeling - Semantic Scholar
guage model: we make use of query logs annotated with geo- location information .... million words; the root LM is a Katz [10] 5-gram trained on about 695 billion ... in the left-most bin, with the smallest amounts of data and LMs, either before of .

a voice by the sea
When God called Abraham, he advanced his plan to rescue the world by forming a new family, which ... Bonhoeffer espoused “costly grace” instead of “cheap grace”: “Such grace is costly because it calls us ... Phone (650) 494-0623. Fax (650) 

archer's voice by mia sheridan.pdf
on pinterest book, beautifulstoriesand sagittarius. Reviewarcher 39 s voice bymiasheridanmelimel 39 s book reviews. Archer 39 s voice on pinterest levijackson, maineand book review. Mia. sheridan 39 s 39 archer 39 s voice 39 optioned for movie. Arche

Searching the Web by Voice - CiteSeerX
query traffic is covered by the vocabulary of the lan- ... according to their likelihood ratios, and selecting all ... discovery algorithm considers all n − 1 possible.

Sorting by search intensity
such a way that the worker can use a contact with one employer as a threat point in the bargaining process with another. Specifically, an employed worker who has been contacted by an outside firm will match with the most productive of the two and bar

how to search by keyword
To create a new Playlist, drag and drop the desired learning object into the New Playlist box in the right-hand column. STEP TWO. eMediaVA will prompt you to ...

Search Overview - Feldenkrais Downloads by Ryan Nagy
Feb 20, 2013 - 24. feldenkrais exercises free downloads. 50. 25. feldenkrais mp3. 49 ... 32. utah feldenkrais blog. 42 ... 78. ryan nagy feldenkrais blog. 18.