End-to-End Attention-based Large Vocabulary Speech Recognition

D Bahdanau, JK Chorowski, D Serdyuk, P Brakel, Y Bengio

End-to-end trainable systems What is end-to-end: • “training all the modules to optimize a global performance criterion” (“Gradient-based learning applied to document recognition”, LeCun et al., 98) • present a system for recognizing checks in which segmentation and character recognition are trained jointly with word constraints taken into account (the approach would now be called Conditional Random Fields) Not end-to-end: hand-crafted feature engineering, manual integration of separately trained modules. Why end-to-end: better performance, better portability

End-to-end trainable systems are the future Recent examples of end-to-end systems: • convolutional networks for object recognition from raw pixels (Krizhevsky et al., 12) • Neural Machine Translation: takes raw words as the input, all components trained together (Sutskever et al., 14, Bahdanau et al., 15) • Neural Caption Generation: produce image descriptions from raw images (many recent papers)

Are DNN-HMM systems end-to-end trainable? Without sequence discriminative training: no • Lexicon and HMM structure are not optimized with the rest of the system • Acoustic model (DNN) is trained to predict the states of the HMM in isolation from the language model With sequence discriminative training: more end-to-end, but still no • Lexicon and HMM structure …

Our (more-) end-to-end approach • Direct transduction from speech to characters (Graves et al., 14) • Based on Bidirectional Recurrent Neural Networks (BiRNN) and an attention mechanism, like in Neural Machine Translation approach from Bahdanau et al., 14. • Additional language model is added after training - not yet fully end-to-end

Recurrent Neural Networks optional outputs

Generic RNN: 𝑦" ∼ 𝑔 𝑠" 𝑠" = 𝑓 𝑠" , 𝑥"*+ , 𝑦"

states

“Simple RNN” with hyperbolic tangent units 𝑝 𝑦" |𝑠" = 𝟏 𝑦" / softmax 𝑉𝑠" + 𝑏; 𝑠"*+ = tanh 𝑊𝑠" + 𝑥"*+ + 𝑈𝟏(𝑦" + 𝑏A ) We use Gated Recurrent Units (GRU) inputs

Deep Bidirectional RNN with Subsampling BiRNN = forward RNN + backward RNN (both without outputs)

Deep BiRNN = states of the layer K are the inputs of the layer K + 1

Attention Mechanism At each step 1. Compute attention weights 𝛼",D with MLP 2. Compute weighted sum 𝑐" 3. Use weighted sum 𝑐" as the input to the RNN.

The System at a Glance The network defines

𝑃G (𝑌|𝑋)

where 𝑌 = 𝑦+ 𝑦J … 𝑦/ are characters 𝑋 = 𝑥+𝑥J … 𝑥L are feature vectors 𝜃 are the parameters

Attention mechanism

4 GRU BiRNN layers with 250 units each Subsampling after the layers 2 and 3

GRU RNN with 250 units

Training We train the network to maximize log-likelihood of the correct output W

1 P log 𝑃G 𝑌S 𝑋S → 𝑚𝑎𝑥 𝑁 SX+

𝑃G is differentiable with respect to 𝜃 => we can use gradient based methods

Decoding We use beam search to find 𝑌Y = 𝑎𝑟𝑔𝑚𝑎𝑥[ log 𝑃G (𝑌|𝑋). Problem: not enough text in the training data Workaround: add an additional language model 𝑌Y = 𝑎𝑟𝑔𝑚𝑎𝑥[ (𝛼 log 𝑃G 𝑌 𝑋 + 𝛽 log 𝑃L] 𝑌 + 1 − 𝛼 − 𝛽 |𝑌|) Problem: we have a word-level, we need a character-level model Workaround: • compose “spelling” FST with the FST obtained from n-gram LM • minimize the new FST and push the weights

Comparison with other approaches Our approach: the alignment 𝛼 is computed. More traditional approach: make 𝛼 a latent variable and marginalize it out: 𝑃G 𝑌 𝑋 = P 𝑃G 𝑌, 𝛼 𝑋 _

• Connectionist Temporal Classification (CTC, Graves et al., 06), 𝑃G 𝑌 𝛼, 𝑋 is factorized • RNN Transducer (Graves et al., 12): another RNN is used to model dependencies in Y Our model and RNN Transducer are equal in terms of expressive power.

Experiment Data details • Wall Street Journal (WSJ) dataset, ~80 hours of training data • mel-scale filterbank + deltas + delta-deltas + energies = 123 features • model selection on dev93, evaluation on eval92 Training details: • ADADELTA learning rule, anneal 𝜖 from 10bc to 10b+d • adaptive gradient clipping

Tricks of the Trade • Windowing: let 𝑝" = median 𝛼" . Set all 𝛼",D outside of 𝑝" − 𝑙; 𝑝" + 𝑟 to zero. Used during training and during testing, especially important to decode with LM. Latest findings: not so necessary when subsampling is used. • Regularization: constraining the norm of incoming weights to 1 for every unit of the network brings ~30% (!!!) performance improvement.

Results Model

Language Model

CER%

WER%

Encoder-Decoder

-

6.4

18.6

Encoder-Decoder

bigram

5.3

11.7

Encoder-Decoder

trigram

4.8

10.8

Encoder-Decoder

extended trigram

3.9

9.3

CTC, phonemes, Miao et al. (2015)

lexicon

-

26.9

CTC, characters, Miao et al. (2015)

trigram

-

9.0

CTC, characters, Miao et al. (2015)

extended trigram

-

7.3

DNN-HMM (Kaldi), Miao et al. (2015)

trigram

-

7.14

DNN-HMM, seq. dis. training, extended lexicon

?

-

~4?

Alignment example

Discussion and Future Work What is better for LVCSR, alignment computation or alignment marginalization? • Enough evidence that both are feasible (this paper, “Listen Attend and Spell”, “DeepSpeech2”, “EESEN”) • But fully end-to-end training is yet to be tried for both In our future work we want to train the network to work well will the LM. Thank you for attention! Code: https://github.com/rizar/attention-lvcsr (research quality…)

End-to-End Attention-based Large Vocabulary Speech ...

End-to-End Attention-based Large Vocabulary Speech Recognition_Last.pdf. End-to-End Attention-based Large Vocabulary Speech Recognition_Last.pdf.

767KB Sizes 3 Downloads 311 Views

Recommend Documents

Large Vocabulary Automatic Speech ... - Research at Google
Sep 6, 2015 - child speech relatively better than adult. ... Speech recognition for adults has improved significantly over ..... caying learning rate was used. 4.1.

Accent Issues in Large Vocabulary Continuous Speech ...
When a large amount of training data collected for each accent is available, we ... As described in Section 1, accent is one of the challenges in current ASR systems. ... They have been applied successfully in speech analysis (Malayath et al.,.

raining for Large Vocabulary Speech Recognition ...
This thesis investigates the use of discriminative criteria for training HMM pa- rameters for speech ... tures of this implementation include the use of lattices to represent alternative transcriptions of the ..... information from two sources : a st

Accent Issues in Large Vocabulary Continuous Speech ...
To reflect a speaker in detail, up to 65 regression classes are used according to. Mandarin ...... Computer Speech and Language, 9:171-185. Liu, M. K., Xu, B., .... 101. 201. 301. 401. 501. 601. 701. 801. 901. Speaker Index. Proje ction. Value.

Large vocabulary continuous speech recognition of an ...
art recognition systems are able to use vocabularies with sizes of 20,000 to 100,000 words. These systems .... done for the English language, we will compare the charac- teristics of .... dictionary and modeled on the basis of tri-phone context.

Discriminative "raining for Large Vocabulary Speech ...
1. 2 Int r od u ction to speech r ecognition. 3. 2 .1 The SpeechRecognition Problem . ..... 1 33. 8 .15.1 E xact vs. inexact implementation for Minimum WordEr - ...... A mi ssd l ass i)fi d at i o n m eas u re for each class is de fi ned as follows :

Large Vocabulary Noise Robustness on Aurora4 - International ...
While porting models to noisy environments, the distortion of pa- rameters augments the confusability of models, and therefore cre- ates decoding errors.

YouTube Scale, Large Vocabulary Video Annotation
Recently, Nistér and Stewénius [32] developed a system able to recognize in real- ... 12 MPEG files, each 30 minutes long from CNN or ABC including ..... Spill trees build on these traditional nearest neighbor structures with a few crit-.

Large Vocabulary Noise Robustness on Aurora4
Large Vocabulary Speech Recognition (LVCSR) in presence of noise. ... structure of the tree and the filter coefficients define the wavelet ba- sis onto which the ...

WSABIE: Scaling Up To Large Vocabulary ... - Research at Google
fast algorithm that fits on a laptop, at least at annotation time. ... ever previously reported (10 million training examples ...... IEEE Computer Society, 2008.

Accurate and Compact Large Vocabulary ... - Research at Google
Google offers the ability to search by voice [1] on Android, ... windows of speech every 10ms. ... our large scale task [10], while also reducing computation.

Making Deep Belief Networks Effective for Large Vocabulary ...
English Broadcast News task. In addition, we ... an alternative to GMMs [2], addressing the GMM problems ... Thus the energy of the Gaussian-Bernoulli con-.

End-to-End Training of Acoustic Models for Large Vocabulary ...
Large Vocabulary Continuous Speech Recognition with TensorFlow. Ehsan Variani ... to batching of training data, unrolling of recurrent acoustic models, and ...

Large Vocabulary Noise Robustness on Aurora4
rigazio, nguyen, kryze, jcj @research.panasonic.com. ABSTRACT ..... phone and discovered that some signals were originally recorded with very narrow band ...

large scale discriminative training for speech recognition
used to train HMM systems for conversational telephone speech transcription using ..... compare acoustic and language model scaling for several it- erations of ...

Vocabulary Activities
What is another term for a person such as a king or queen who reigns over a kingdom or empire? 2. What word comes from an ancient Greek term meaning “rule ...

CASA Based Speech Separation for Robust Speech Recognition
National Laboratory on Machine Perception. Peking University, Beijing, China. {hanrq, zhaopei, gaoqin, zhangzp, wuhao, [email protected]}. Abstract.

vocabulary Practice - MOBILPASAR.COM
2. What is another word for buddy or pal? 3. How would you describe a person who has exceptional talent? 4. What type of race uses several members, each of whom finishes a part of the race? 5. What does a cat do before it leaps? 6. Which word means t

Vocabulary (Radioactive)
This allows scientists to put a number on rocks and fossils. Half Life ​. The length of time it takes for one half of the original isotope (parent rock) to decay.

Seasons Vocabulary Games - UsingEnglish.com
acorns (= nuts falling from trees) tangerines (= a kind of orange) brown leaves on the trees cherry blossom leaves on the trees falling down new flowers orange leaves on the trees new leaves on the trees no leaves on the trees migration (= birds fly

Controlling loudness of speech in signals that contain speech and ...
Nov 17, 2010 - variations in loudness of speech between different programs. 5'457'769 A ..... In an alternative implementation, the loudness esti mator 14 also ... receives an indication of loudness or signal energy for all segments and makes ...