Large Scale Distributed Acoustic Modeling With Back-off N-grams

Google Search by Voice Ciprian Chelba, Peng Xu, Fernando Pereira, Thomas Richardson

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 1

Statistical Modeling in Automatic Speech Recognition

Speaker’s Mind

W

Speech Producer

Speech

Speaker

Acoustic Processor

A

Linguistic Decoder

^ W

Speech Recognizer Acoustic Channel

ˆ = argmaxW P (W |A) = argmaxW P (A|W ) · P (W ) W P (A|W ) acoustic model (Hidden Markov Model) P (W ) language model (Markov chain) ˆ search for the most likely word string W due to the large vocabulary size—1M words—an exhaustive search is intractable 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 2

Voice Search LM Training Setup

correct google.com queries, normalized for ASR, e.g. 5th -> fifth vocabulary size: 1M words, OoV rate 0.57% (!), excellent n-gram hit ratios training data: 230B words Order 3 3 5

no. n-grams pruning PPL n-gram hit-ratios 15M entropy 190 47/93/100 7.7B none 132 97/99/100 12.7B 1-1-2-2-2 108 77/88/97/99/100

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 3

Is a Bigger LM Better? YES!

Perplexity (left) and WER (right) as a function of 5−gram LM size 200

19

180

18.5

160

18

140

17.5

120

17

100 −2 10

−1

0

10

10

16.5 1 10

LM size: # 5−grams(B)

PPL is really well correlated with WER. It is critical to let model capacity (number of parameters) grow with the data. 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 4

Back to Acoustic Modeling: How Much Model Can We Afford? typical amounts of training data for AM in ASR vary from 100 to 1000 hours frame rate in most systems is 100 Hz (every 10ms) assuming 1000 frames are sufficient for robustly estimating a single Gaussian 1000 hours of speech would allow for training about 0.36 million Gaussians (quite close to actual systems!) We have 100,000 hours of speech! Where is the 40 million Gaussians AM?

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 5

Previous Work GMM sizing: a log(num. components) = log(β) + α · log(n) typical values: α = 0.3, β = 2.2 or α = 0.7, β = 0.1 same approach to getting training data as CU-HTK b they report diminishing returns past 1350 hours, 9k states/300k Gaussians we use 87,000 hours and build models up to 1.1M states/40M Gaussians. a

Kim et al., “Recent advances in broadcast news transcription,” in IEEE

Workshop on Automatic Speech Recognition and Understanding, 2003. b

Gales at al., “Progress in the CU-HTK broadcast news transcription system,”

IEEE Transactions on Audio, Speech, and Language Processing, 2006. 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 6

Back-off N-gram Acoustic Model (BAM) W = action , sil ae k sh ih n sil BAM with M = 3 extracts : ih_1 / ae k sh ___ n sil ih_1 / k sh ___ n sil ih_1 / sh ___ n

frames frames frames

Back-off strategy: back-off at both ends if the M-phone is symmetric if not, back-off from the longer end until the M-phone becomes symmetric Rich Schwartz et al., Improved Hidden Markov modeling of phonemes for continuous speech recognition, in Proceedings of ICASSP, 1984. 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 7

Back-off Acoustic Model Training

generate context-dependent state-level Viterbi alignment using: H ◦ C ◦ L ◦ W and the first-pass AM extract maximal order M-phones along with speech frames, and output (M-phone key, frames) pairs compute back-off M-phones and output (M-phone key, empty) pairs to avoid sending the frame data M times, we sort the stream of M-phones arriving at Reducer in nesting order cashe frames arriving on maximal order M-phones for use with lower order M-phones when they arrive. 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 8

MapReduce for BAM Training

action --- frames

fashion --- frames

faction --- frames

Chunked Input Data

Mapper: -generate alignment: sil ae k sh ih n sil -extract and emit M-phones … ih_1 / ae k sh ___ n sil ~ , frames_A ih_1 / k sh ___ n sil , ih_1 / sh ___ n , ...

Mapper: -generate alignment: sil f ae sh ih n sil -extract and emit M-phones … ih_1 / f ae sh ___ n sil ~ , frames_B ih_1 / ae sh ___ n sil , ih_1 / sh ___ n , ...

Mapper: -generate alignment: sil f ae k sh ih n sil -extract and emit M-phones … ih_1 / ae k sh ___ n sil ~ , frames_C ih_1 / k sh ___ n sil , ih_1 / sh ___ n , ...

Shuffling: - M-phones sent to their Reduce shard, as determined by the partitioning key shard(ih_1 / sh ___ n) - M-phone stream arriving at a given Reduce shard is sorted in lexicographic order Reducer for partition shard(ih_1 / sh ___ n): - maintains a stack of nested M-phones in reverse order along with frames reservoir … ae k sh ___ n sil ~, frames_A … f ae sh ___ n sil ~, frames_B … k sh ___ n sil , frames_A | frames_C … ae sh ___ n sil , frames_B … sh ___ n , frames_A | frames_B | frames_C ... When a new M-phone arrives: - pop top entry - estimate GMM - output (M-phone, GMM) pair

Partition shard(ih_1 / sh ___ n) of the associative array (M-phone, GMM) storing BAM

SSTable output storing BAM as a distributed associative array (M-phone, key)

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 9

N-best Rescoring

load model into an in-memory key-value serving system (SSTable service) with S servers each holding 1/S-th of the data query SSTable service with batch requests for all M -phones (including back-off) in an N-best list log PAM (A|W ) = λ · log Pf irst pass (A|W ) + (1.0 − λ) · log Psecond pass (A|W ) log P (W, A) = 1/lmw · log PAM (A|W ) + log PLM (W )

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 10

Experimental Setup

training data baseline ML AM : 1 million manually transcribed Voice Search spoken queries—approx. 1,000 hours of speech filtered logs: 110 million Voice Search spoken queries + 1-best ASR transcript, filtered at 0.8 confidence (approx. 87,000 hours) dev/test data: manually transcribed data, each about 27,000 spoken queries (87,000 words) N = 10-best rescoring: 7% oracle WER on dev set, on 15% WER baseline 80% of the test set has 0%-WER at 10-best 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 11

Experimental Results: Maximum Likelihood Baseline Model ML,λ = 0.6 ML,λ = 1.0 BAM,λ = 0.8 BAM,λ = 0.8 BAM,λ = 0.8 BAM,λ = 0.6 BAM,λ = 0.6 BAM,λ = 0.6 BAM,λ = 0.6

Train Source WER No. M (hrs) (%) Gaussians 1k base AM 11.6 327k — 1k base AM 11.9 327k — 1k base AM 11.5 490k 1 1k 1% logs 11.3 600k 2 1k 1% logs 11.4 720k 1 9k 10% logs 10.9 3,975k 2 9k 10% logs 10.9 4,465k 1 87k 100% logs 10.6 22,210k 2 87k 100% logs 10.6 14,435k 1

BAM steadily improves with more data, and model phonetic context does not really help beyond triphones 1.3% (11% rel) WER reduction on ML baseline 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 12

Experimental Results: WER with Model Size

M=1, α = 0.7, β = 0.1 M=2, α = 0.3, β = 2.2

11.5

11.4

Word Error Rate (%)

11.3

11.2

11.1

11

10.9

10.8

10.7

10.6 5 10

6

7

10 10 Model Size (Number of Gaussians, log scale)

8

10

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 13

Experimental Results: WER with Data Size

11.5 M=1, α = 0.7, β = 0.1 M=2, α = 0.3, β = 2.2

11.4

11.3

Word Error Rate (%)

11.2

11.1

11

10.9

10.8

10.7

10.6

10.5 2 10

3

4

10 10 Training Data Size (hours, log scale)

5

10

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 14

Experimental Results: bMMI Baseline

Model bMMI,λ = 0.6 bMMI,λ = 1.0 BAM,λ = 0.8

Train (hrs) 1k 1k 87k

Source WER No. M (%) Gaussians base AM 9.7 327k — base AM 9.8 327k — 100% logs 9.2 40,360k 3

0.6% (6% rel) WER reduction on tougher 9.8% bMMI baseline

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 15

Experimental Results: M-phone Hit Ratios 10-best Hypotheses for Test Data for BAM Using M = 3 (7-phones) Trained on the Filtered Logs Data (87 000 hours) left, right context size 0 1 2 3 0 1.1% 0.1% 0.2% 4.3% 1 0.1% 26.0% 0.9% 3.4% 2 0.7% 0.9% 27.7% 2.2% 3 3.8% 2.9% 2.0% 23.6% For large amounts of data, DT clustering of triphone states is not needed

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 16

Experimental Results: Validation Setup

train on the dev set with Nmin = 1 test on the subset of the dev set with 0% WER at 10-best; 80% utterances; 1st pass AM: 7.6% WER use only BAM AM score, very small LM weight. Context type M CI phones 1 CI phones 5 + word boundary 1 + word boundary 5

WER, (%) 4.5 1.5 1.8 0.6

triphones do not overtrain 04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 17

BAM: Conclusions and Future Work distributed acoustic modeling is promising for improving ASR expanding phonetic context is not really productive, whereas more Gaussians do help Future work: bring to the new world of (D)NN-AM discriminative training wish: steeper learning rate as we add more training data

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 18

Parting Thoughts on ASR Core Technology

Current state: automatic speech recognition is incredibly complex problem is fundamentally unsolved data availability and computing have changed significantly: 2-3 orders of magnitude more of each Challenges and Directions: re-visit (simplify!) modeling choices made on corpora of modest size multi-linguality built-in from start better modeling: feature extraction, acoustic, pronunciation, and language modeling

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 19

ASR Success Story: Google Search by Voice What contributed to success: DNN acoustic models clearly set user expectation by existing text app excellent language model built from query stream clean speech: users are motivated to articulate clearly app phones do high quality speech capture speech tranferred error free to ASR server over IP

04/23/2013 Ciprian Chelba et al., BAM: Large Scale Distributed Acoustic Modeling – p. 20

Google Search by Voice - Research at Google

Kim et al., “Recent advances in broadcast news transcription,” in IEEE. Workshop on Automatic ... M-phones (including back-off) in an N-best list .... Technology.

236KB Sizes 1 Downloads 101 Views

Recommend Documents

Google Search by Voice - Research at Google
May 2, 2011 - 1.5. 6.2. 64. 1.8. 4.6. 256. 3.0. 4.6. CompressedArray. 8. 2.3. 5.0. 64. 5.6. 3.2. 256 16.4. 3.1 .... app phones (Android, iPhone) do high quality.

Voice Search for Development - Research at Google
26-30 September 2010, Makuhari, Chiba, Japan. INTERSPEECH ... phone calls are famously inexpensive, but this is not true in most developing countries.).

Search by Voice in Mandarin Chinese - Research at Google
client application running on an Android mobile telephone with an intermittent ... 26-30 September 2010, Makuhari, Chiba, Japan .... lar Mandarin phone.

Google Search by Voice
Mar 2, 2012 - Epoch t+1. SSTable. Feature-. Weights: Epoch t. SSTable. Utterances. SSTableService. Rerank-Mappers. Identity-Mappers. Reducers. Cache.

Improving Keyword Search by Query Expansion ... - Research at Google
Jul 26, 2017 - YouTube-8M Video Understanding Challenge ... CVPR 2017 Workshop on YouTube-8M Large-Scale Video Understanding ... Network type.

Query-Free News Search - Research at Google
Keywords. Web information retrieval, query-free search ..... algorithm would be able to achieve 100% relative recall. ..... Domain-specific keyphrase extraction. In.

Eyetracking in Online Search - Research at Google
studying online human-computer interaction, effective methods for analysis ..... accurate to within 1 degree of visual angle, which corresponds to an on-screen error ...... Rayner, K.: Eye movements in reading and information processing: Twenty years

ICMI'12 grand challenge: haptic voice recognition - Research at Google
Oct 26, 2012 - Voice Recognition (HVR) [10], a novel multimodal text en- try method for ... that on desktop and laptop computers with full-sized key- board [4].

QUERY LANGUAGE MODELING FOR VOICE ... - Research at Google
ABSTRACT ... data (10k queries) when using Katz smoothing is shown in Table 1. ..... well be the case that the increase in PPL for the BIG model is in fact.

Deploying Google Search by Voice in Cantonese - CiteSeerX
Aug 28, 2011 - believe our development of Cantonese Voice Search is a step to- wards solving ... ers using our DataHound Android application [3], which dis-.

Deploying Google Search by Voice in Cantonese - CiteSeerX
Aug 28, 2011 - tonese Google search by voice was launched in December 2010. Index Terms: .... phones in a variety of acoustic environments, including use at home, on the ... ers using our DataHound Android application [3], which dis-.

Using Search Engines for Robust Cross-Domain ... - Research at Google
We call our approach piggyback and search result- ..... The feature is calculated in the same way for ..... ceedings of the 2006 Conference on Empirical Meth-.

On the Predictability of Search Trends - Research at Google
Aug 17, 2009 - various business decisions such as budget planning, marketing ..... the major characteristics of the underlying time series are maintained.

Context-aware Querying for Multimodal Search ... - Research at Google
ing of a multimodal search framework including real-world data such as user ... and the vast amount of user generated content, the raise of the big search en- ... open interface framework [25], which allows for the flexible creation of combined.