Improving Training Time of Deep Belief Networks ...

Viewer
Transcript

Improving Training Time of Deep Belief Networks Through Hybrid Pre-Training And Larger Batch Sizes

Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY 10598 {tsainath, bedk, bhuvana}@us.ibm.edu

Pre-training of deep belief networks (DBNs) is typically unsupervised and generative, and thus the learned weights are not linked to the final supervised objective function. While discriminative pretraining addresses this issue, discriminative pre-training methods are often too greedy and thus after fine-tuning these methods can perform slightly worse compared to generative methods. Alternatively, we propose a hybrid pre-training methodology, combining the benefits of both generative and discriminative pre-training. Another benefit of hybrid pre-training is that it creates weights that are more closely linked to the fine-tuning objective function, thus allowing us to use a very large batch size and parallelize the gradient computation during fine-tuning. Experimental results indicate that combining hybrid pre-training and gradient parallelization allows for a speedup of 3 times with little loss in accuracy compared to a generatively trained DBN.

1

Introduction

Deep Belief Networks (DBNs) have gained increasing popularity in acoustic modeling over the past few years [1], [2], showing improvements between 5-20% relative over state of the art Gaussian Mixture Model (GMM)/Hidden Markov Model (HMM) systems. However, DBNs are usually trained serially using stochastic gradient descent (SGD) and are computationally expensive to train, particularly on large vocabulary tasks. DBN training typically consists of first generatively learning a set of unsupervised weights via Restricted Boltzmann Machines (RBMs) [3], followed by a supervised discriminative fine-tuning backpropagation step. We believe that one reason DBN training is slow is that the generative pre-trained weights are not linked to the final cross-entropy objective function during fine-tuning. Recently, [4] and [1] performed pre-training in a discriminative fashion, where weights are pre-trained using the cross-entropy objective function. While [1] showed that doing discriminative pre-training and then fine-tuning allows for slightly better results compared to doing generative RBM pre-training and then fine-tuning, [4] demonstrated that discriminative pre-training degraded performance compared to generative pre-training. One problem with performing discriminative pre-training is that at every layer weights are learned so as to minimize the cross-entropy of the system. This means that weights learned in lower layers are potentially not general enough, but rather too specific to the final DBN objective [4]. Having generalized weights in lower layers has been shown to be helpful [5], [6]. Specifically, generalized concepts, such as edges, are typically captured in lower layers and more discriminative representations, such as different faces, are captured in higher layers. Experimentally, we will show in this paper that a DBN pre-trained discriminatively is actually slightly worse in performance compared to a DBN pre-trained generatively, showing the importance of having generalization in pre-trained weights. In this work, we introduce a hybrid PT strategy [7] that combines both the generative and discriminative benefits. While hybrid training itself has been explored before, [7] only explored hybrid PT 1

for a two layer network and for binary inputs, while [8] explored using the hybrid objective function for fine-tuning. The first contribution of this work is to extend previous hybrid training work by exploring greedy layerwise hybrid pre-trained weights, which are then adjusted via discriminative fine-tuning. We will show that one benefit of hybrid PT is that weights are put in a much better initial space relative to generative PT, and thus fewer iterations of fine-tuning are needed. It is important to note that the main reason for not exploring hybrid training for fine-tuning is that it is over two times slower than discriminative fine-tuning [8]. Because pre-training is a fraction of time compared to fine-tuning [2], the time to train DBNs with hybrid pre-training methods is still insignificant compared to fine-tuning. In addition, we will show that another benefit of hybrid pre-training is that the mini-batch size during fine-tuning can be made very large relative to generative pre-training. Typically during fine-tuning, a mini-batch size between 128-512 is used [5]. If the batch size is too small, parallelization of matrixmatrix multiplies on CPUs and GPUs is inefficient. A batch size which is too large often makes training unstable, unless the learning rate is dropped, which slows down training. We will show that when weights are in a much better initial space using hybrid pre-training, very large batch size (>10,000) can be used. This allows the gradient computation to be parallelized and improves overall training speed. Typically gradient parallel SGD methods are not effective in speech for small batch sizes because of the I/O involved in passing large gradient vectors computed on different worker machines to a master [9]. Our initial experiments are conducted on a 50 hour English Broadcast News (BN) task [10]. First, we will show that a DBN with hybrid pre-training offers modest improvements over a DBN trained either generatively or discriminatively. Second, we will show that hybrid pre-training allows for fewer iterations of fine-tuning compared to generative pre-training with the same batch size, illustrating the importance of learning pre-trained weights which are linked to the final objective function. Third, we will show that we can increase the batch size of hybrid pre-training to be > 10, 000, allowing us to effectively parallelize the gradient computation. Overall, with hybrid PT + parallel SGD, we can achieve roughly a 3 times speedup in fine-tuning time over generative pre-training and a small batch size, with a very small decrease in accuracy. The rest of this paper is organized as follows. Section 2 provides a background on generative, discriminative and hybrid pre-training. Section 3 provides a discussion on larger batch sizes which can be used with discriminative pre-training. Experiments and results are presented in Sections 4 and 5 respectively. Finally, Section 6 concludes the paper.

2

Pre-Training Methodologies

In this section, we describe generative, discriminative and hybrid pre-training in more detail. 2.1

Generative Pre-Training

The Restricted Boltzmann Machine (RBM) is a commonly used model for generative pretraining [3]. An RBM is a bipartite graph where visible units v, representing observations, are connected via undirected weights to hidden units h. Units h and v are stochastic, with values distributed according to a given distribution, and the entire RBM is endowed with an energy function. For an RBM in which all units are binary, and follow a Bernoulli distribution, the energy function is E(v, h; θ) = −hT Wh − bT v − aT h

(1)

where θ = {W, b, a} defines the RBM parameters, including weights h, visible biases b, and hidden biases a. The RBM assigns a probability to an observed vector v based on the energy function ∑ −E(v,h;θ) he p(v; θ) = ∑ ∑ −E(u,h;θ) u he

(2)

and the RBM parameters are trained to maximize this generative likelihood. In generative pre-training, an RBM is used to learn the weights for the first layer of a neural network. Once these weights are learned, the outputs (hidden units) are treated as inputs to another RBM that 2

learns higher-order features, and the process is iterated for each layer in the network. Because speech features are continuous, the RBM for the first layer is a Gaussian-Bernoulli RBM. Subsequent layers are trained using Bernoulli-Bernoulli RBMs. This greedy, layer-wise pretraining scheme is both fast and effective [3]. After a stack of RBMs has been trained, the layers are connected together to form what is referred to as a DBN. 2.2

Discriminative Pre-Training

Rather than maximizing the generative likelihood p(v; θ) as in generative pre-training, discriminative pre-training optimizes the likelihood p(l|v; θ), which makes use of both features v and labels l [4], [1]. This discriminative likelihood is defined to be the cross-entropy objective function which is used during fine-tuning (i.e. backpropagation). Training an RBM discriminatively is referred to as DRBM [7]. In the discriminative pre-training methodology, a 2-layer DRBM, namely one hidden layer and one softmax layer, is trained using the cross-entropy criterion with label information. After taking one pass through the entire data with discriminative pre-training, the softmax layer is thrown away and replaced by another randomly initialized hidden layer and softmax layer on top. The initially trained hidden layer is held constant, and discriminative pre-training is performed on the new hidden and softmax layers. This discriminative training is greedy and layer-wise like generative RBM pretraining. 2.3

Hybrid Pre-Training

One problem with performing discriminative pre-training is that at every layer, weights are learned to minimize the objective function (i.e., cross-entropy). This means that weights learned in lower layers are potentially not general enough, but rather too specific to the final DBN objective. Having generalized weights in lower layers has been shown to be helpful. Specifically, generalized, local concepts are captured in lower layers and more discriminative representations such as different phonemes, are captured in higher layers [6]. Hybrid pre-training has been proposed to address the issues of discriminative pre-training, by performing pre-training with both a generative and discriminative component. We follow a hybrid pretraining recipe similar to the methodology in [7], which looks to maximizes the objective function in Equation 3, where α is an interpolation weight between the discriminative p(l|v) and generative components p(l, v). More intuitively, the generative component can be seen to act as a datadependent regularizer for the discriminative component [7]. The hybrid discriminative methodology is referred to as HDRBM. While [7] only explored pre-training a two layer HRDBM with binary inputs, in this work we extend to multiple layers and continuous inputs. p(l|v) + αp(v, l)

(3)

To optimize the generative component p(v, l), first consider a 2 layer DBN, where the weights, hidden and visible biases for layer 1 are given by {W, a, b}, and for layer 2 as {U, c, d}. In addition we will define l to be a label vector with an entry of 1 corresponding to the class label of input v and zeros elsewhere. For an HDRBM in which all units are binary and follow a Bernoulli distribution, the energy function is given by Equation 4: E(v, l, h; θ) = −hT W v − bT v − aT h − cT l − hT U l

(4)

The joint probability that the model assigns to a visible vector v and label l is given by Equation 5: ∑ −E(v,l,h) e p(v, l; θ) = ∑ ∑h ∑ −E(u,l,h) (5) u l he The generative component is trained to maximize the likelihood p(v, l; θ), while the discriminative component p(l; θ) is trained similar to the discriminative pre-training methodology. To train an HDRBM, stochastic gradient descent is used, and for each example the gradient contribution due to p(l; θ) is added to α times the gradient estimated from p(v, l; θ). Similar to RBM training, 3

because input speech features are continuous, the HDRBM for the first layer is a Gaussian-Bernoulli HDRBM, while subsequent layers are Bernoulli-Bernoulli HDRBMs Again, training is performed in a greedy, layerwise fashion similar to discriminative pre-training.

3

Batch Size for Optimization

During fine-tuning, each frame is labeled with a target class label. Given a DBN and a set of learned initial weights, fine-tuning is performed via backpropagation to retrain the weights such that the loss between the target and hypothesized class probabilities is minimized. During SGD fine-tuning, the gradient is estimated using a small collection of frames, which is referred to as a mini-batch. [3]. The weight update per mini-batch is given more explicitly by Equation 6, where γ is the learning rate, w are the weights, vi is training example i and gradi is the gradient computed using this training example and weights. In addition, B is the mini-batch size. w := w − γ

B ∑

gradi (w, vi )

(6)

i=1

Notice from Equation 6 that the gradient is calculated as the sum of gradients from individual training examples. When the batch size B is large (and thus number of training examples large), this allows the gradient computation to be parallelized across multiple worker computers. Specifically, on each worker a gradient is estimated using a subset of training examples, and then the gradients calculated on each slave computer are added together by a master computer to estimate the total gradient. We will observe that when using hybrid pre-training and having a much better initial weight space, the mini-batch size can be increased and the gradient computation can efficiently be parallelized.

4

Experiments

4.1

Broadcast News

The first set of experiments is conducted on a 50 hour English Broadcast News transcription task [10]. The acoustic models are trained on 50 hours of data from the 1996 and 1997 English Broadcast News Speech Corpora. Results are reported on 101 speakers in the EARS Dev-04f set. An LVCSR recipe described in [11] is used to create vocal tract length normalized (VTLN) features, which are used as input features to the DBN. The DBN architecture for Broadcast News consists of a 6 layer DBN with 1,024 hidden units per layer. Similar to previous DBN work with Broadcast News [2], 2,220 output targets are used. For pre-training experiments, one epoch of training was done per layer for both discriminative and hybrid pre-training. For hybrid pre-training, the optimal value of α was tuned on a held-out set. For generative pre-training, multiple epochs were performed for RBM training per layer. Following a similar recipe to [2], during fine-tuning, after one pass through the data, loss is measured on a held-out set1 and the learning rate is annealed (i.e. reduced) by a factor of 2 if the held-out loss has not improved sufficiently over the previous iteration. Training stops after we have annealed the weights 5 times. All DBN results are reported using the cross-entropy loss function.

5

Results

5.1

WER Comparison of Pre-Training Strategies

Table 1 compares the word error rate (WER) after fine-tuning when generative, discriminative and hybrid pre-training is performed. Notice that the WER using discriminative pre-training is slightly worse than generative pre-training, indicating that having generalization in learning pre-trained weights is helpful. However, notice that hybrid pre-training, which combines the generalization of 1

Note that this held out set is different than Dev-04f

4

pre-trained weights with a discriminative objective function linked to the final cross-entropy objective function, offers a slight improvement in WER over both generative or discriminative pre-training alone. Method Generative Pre-Training Discriminative Pre-Training Hybrid Pre-Training

WER Dev04f 19.6 19.7 19.5

Table 1: WER of Pre-Training Strategies, Broadcast News (BN) It is natural to wonder if hybrid pre-training would produce similar results to performing generative pre-training on x% of the data and discriminative pre-training on 1 − x% of the data [4]. Because it is more important to generatively pre-train the lower layers and discriminatively pre-train higher layers, we explored pre-training a 5 layer DBN with a different percentage of data used for generative training per layer. The best configuration of percentage of data used for generative pre-training per layer was found to be [80%, 60%, 40%, 20%, 0%]. The rest of the data per layer was used for discriminative pre-training. Using this strategy, we obtained a WER of 19.7% - worse than the hybrid pre-training WER. This shows the value of the joint optimization of both hybrid and generative components in hybrid pre-training. 5.2

Timing Comparison of Pre-Training Strategies

Because both discriminative and hybrid pre-training learn weights that are more closely linked to the final objective function relative to generative pre-training, this implies that fewer iterations of fine-tuning are necessary. We confirm this experimentally by showing the number of iterations and total training time of fine-tuning needed for the three pre-training strategies for Broadcast News in Table 2. All timing experiments in this paper were run on an 8 core Intel Xeon [email protected] CPU. Matrix/vector operations for DBN training are multi-threaded using Intel MKL-BLAS. Notice that fewer iterations of fine-tuning are needed for both hybrid and discriminative pre-training, relative to generative pre-training. Because discriminative pre-training lacks a generative component and is even closer to the final objective function compared to hybrid pre-training, fewer fine-tuning iterations are required for discriminative pre-training. However, learning weights too greedily causes the WER with discriminative pre-training to be higher than generative pre-training, as illustrated in Table 1. Thus, hybrid pre-training offers the best tradeoff between WER and training time of the three pre-training strategies. Method

Number of Iterations 12 9 10

Generative Pre-Training Discriminative Pre-Training Hybrid Pre-Training

Fine-Tuning Time (hrs) 42.0 31.5 35.8

Table 2: Fine-Tuning Time for Pre-Training Strategies, BN 5.3

Larger Mini-Batch Size

Typically when generative pre-training is performed, a mini-batch size between 128-512 is used [5] 2 . The intuition, which we will show experimentally, is the following: If the batch size is too small, parallelization of matrix-matrix multiplies on CPUs is inefficient. A batch size which is too large often makes training unstable. However, when weights are in a much better initial space, we hypothesize that a larger batch size can be used, speeding up training time further. Figure 1 shows the WER as a function of batch size for both generative and hybrid pre-training methods. Note that we have not included discriminative pre-training in this analysis, since from Section 5.2 it was shown that hybrid pre-training offers the best tradeoff between WER and training time. Notice that after a batch size of 2,000, the WER of the generative pre-training method starts to rapidly increase, 2 The authors are aware that in [1] a batch size of 1,000 was used. However, the first few iterations of training were run with a batch size of 256 before increasing to 1,000.

5

while with hybrid pre-training, we can have a batch size of 10,000 with no degradation in WER. Even at a batch size of 30,000 the WER degradation is minimal.

Generative PT Hybrid PT 20

WER

19.9 19.8

19.7

19.6

19.5

0.01

0.05

0.1

0.2 0.30.4 1 Batch Size (x10,000)

1.5 2

3

Figure 1: Batch Size vs. WER for Pre-Training Strategies, BN 5.4

Parallel Stochastic Gradient Descent

Having a large batch size implies that the gradient can efficiently be parallelized across worker machines. Table 5.4 shows that we can improve the fine-tuning training time by more than 1.5 using parallel SGD over serial SGD for the same batch size of 30,000. In addition, hybrid PT + parallel SGD provides a large speedup over generative pre-training. The fine-tuning training time for generative PT with a batch size of 512, a commonly used size in the literature, is roughly 24.7 hours. With hybrid PT and a batch size of 30K, the training time is roughly 7.9 hours, a 3x speedup over generative PT with little loss in accuracy. Method Generative PT, Serial SGD Hybrid PT, Serial SGD Hybrid PT, Parallel SGD

Fine-Tuning Training Time (hrs) 24.7 14.5 7.9 (5 workers)

Table 3: Comparison between Serial and Parallel SGD Fine-Tuning Training Time

6

Conclusions

In this paper, we introduced a novel hybrid methodology to perform pre-training which combined both a generative and discriminative component. We demonstrated that hybrid pre-training initializes pre-trained weights in a much better space compared to generative pre-training, allowing for fewer iterations of fine-tuning. In addition, hybrid pre-training allows for more generalization compared to discriminative pre-training, resulting in an improvement in WER. Furthermore, we demonstrated that hybrid pre-training allows for a larger batch size during fine-tuning, allowing the gradient computation to be parallelized. Using hybrid PT + parallel SGD results in roughly a 3x speedup with little loss in accuracy compared to a generatively pre-trained DBN.

7

Acknowledgements

The authors would like to thank Hagen Soltau, George Saon and Stanley Chen for their contributions towards the IBM toolkit and recognizer utilized in this paper. Also, thank you to Abdel-rahman Mohamed of Toronto for useful discussions related to hybrid pre-training.

References [1] F. Seide, G. Li, X. Chen, and D. Yu, “Feature Engineering in Context-Dependent Deep Neural Networks for Conversational Speech Transcription,” in Proc. ASRU, 2011. [2] T. N. Sainath et. al., “Making Deep Belief Networks Effective for Large Vocabulary Continuous Speech Recognition,” in Proc. ASRU, 2011.

6

[3] G. E. Hinton, S. Osindero, and Y. Teh, “A Fast Learning Algorithm for Deep Belief Nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006. [4] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring Strategies for Training Deep Neural Networks,” Journal of Machine Learning Research, vol. 1, pp. 1–40, 2009. [5] G. E. Hinton, “A Practical Guide to Training Restricted Boltzmann Machines,” Tech. Rep. 2010-003, Machine Learning Group, University of Toronto, 2010. [6] A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modeling,” in Proc. ICASSP, 2012. [7] H. Larochelle and Y. Bengio, “Classification Using Discriminative Restricted Boltzmann Machines,” in Proc. ICML, 2008. [8] A. Mohamed and G. E. Hinton, “Phone Recognition using Restricted Boltzmann Machines,” in Proc. ICASSP, 2010. [9] K. Vesely, L. Burget, and F. Grezl, “Parallel Training of Neural Networks for Speech Recognition,” in Proc. Interspeech, 2010. [10] B. Kingsbury, “Lattice-Based Optimization of Sequence Classification Criteria for Neural-Network Acoustic Modeling,” in Proc. ICASSP, 2009. [11] H. Soltau, G. Saon, and B. Kingsbury, “The IBM Atilla Speech Recognition Toolkit,” in Proc. SLT Workshop, 2010.

7

Improving the Robustness of Deep Neural Networks ... - Stephan Zheng

Making Deep Belief Networks Effective for Large Vocabulary ...

Making Deep Belief Networks Effective for Large ...

Deep Belief Networks using Discriminative Features for ...

Improving Deep Neural Networks Based Speaker ...

Training Deep Neural Networks on Noisy Labels with Bootstrapping

Deep Convolutional Neural Networks On Multichannel Time Series for ...

Deep Belief-Nets for Natural Language Call Routing

Auto-Encoder Bottleneck Features Using Deep Belief ...

Improving the Quality of Experience Journals: Training ...

Improving Generalization Capability of Neural Networks ...

Improving the speed of neural networks on CPUs - CiteSeerX

lecture 17: neural networks, deep networks, convolutional ... - GitHub

Seven Laws of Belief

Improving the Readability of Clustered Social Networks using Node ...

Deep Learning and Neural Networks

DEEP NEURAL NETWORKS BASED SPEAKER ...

Scalable Object Detection using Deep Neural Networks

Improving convergence time of BPNN Image ...

Thu.P10b.03 Application of Pretrained Deep Neural Networks to Large ...

T81-559: Applications of Deep Neural Networks, Washington ... - GitHub

Application of Pretrained Deep Neural Networks ... - Vincent Vanhoucke