Comparison of Training Methods for Deep Neural Networks Patrick Oliver GLAUNER Imperial College London Department of Computing

May 2015

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

1 / 34

Motivation

Attracted major IT companies including Google, Facebook, Microsoft and Baidu to make significant investments in deep learning The so-called ”Google Brain project” self-learned cat faces from images extracted from YouTube videos Learning features from data rather than modeling them Advances have been raising many hopes about the future of machine learning, in particular to work towards building a system that implements the single learning hypothesis

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

2 / 34

Contents

1

Neural networks

2

Deep neural networks

3

Application to computer vision problems

4

Conclusions and prospects

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

3 / 34

Neural networks Neural networks are inspired by the brain Composed of layers of logistic regression units Can learn complex non-linear hypotheses

Figure 1: Neural network with two input and output units and one hidden layer with two units and bias units x0 and z0 [1] Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

4 / 34

Neural networks: training

Goal: minimize a cost function, e.g.: J(Θ) = Partial derivatives

∂ ∂θi J(θ)

Pm

i=1 (y

(i)

− hΘ (x (i) ))2

are used in an optimization algorithm

Backpropagation is an efficient method to compute them Risk of overfitting because of many parameters Highly non-convex cost functions: training may end in a local minimum

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

5 / 34

Deep neural networks

Figure 2: Deep neural network layers learning complex feature hierarchies [4]

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

6 / 34

Deep neural networks

Unsupervised layer-wise pre-training to compute good initialization of the weights: Autoencoder Restricted Boltzmann Machine (RBM) Discriminative pre-training Sparse initialization Reduction of internal covariance shift

Discriminative fine-tuning using backpropagation

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

7 / 34

Deep neural networks: autoencoder Three-layer neural network y (i) = x (i) Tries to learn the identity function hΘ (x) ≈ x Denoising autoencoder corrupts the corresponding inputs using a deterministic corruption mapping: y (i) = fΘ (x (i) )

Figure 3: Autoencoder with three input and output units and two hidden units Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

8 / 34

Deep neural networks: stacked autoencoder First, an autoencoder is trained on the input, trained hidden layer is the first hidden layer of the stacked autoencoder Then, used as input and output to train another autoencoder, the learned hidden layer is then the second hidden layer of the stacked autoencoder Continued for more times, then fine-tuned

Figure 4: Stacked autoencoder network structure Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

9 / 34

Deep neural networks: RBM A Boltzmann Machine, in which which the neurons are binary nodes of a bipartite graph The visible units of a RBM represent states that are observed, The hidden units represent the feature detectors

Figure 5: Restricted Boltzmann Machine with three visible units and two hidden units (and biases) Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

10 / 34

Deep neural networks: RBM

RBMs are undirected Single matrix W of parameters, which associates the connectivity of visible hidden units Bias units a for the visible units and h for the hidden units Goal: minimize the energy: E (v, h) = −aT v − b T h − v T Wh Use of contrastive divergence to compute the gradients of the weights

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

11 / 34

Deep neural networks: deep belief network (DBN) Layer-wise pre-training of RBMs Procedure similar to training a stacked autoencoder Followed by discriminative fine-tuning

Figure 6: Deep belief network structure

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

12 / 34

Application to computer vision problems: goal

Comparison of RBMs and autoencoders on two data sets: MNIST Kaggle facial emotion data

Use of MATLAB Deep Learning Toolbox [3]

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

13 / 34

Application to computer vision problems: MNIST Hand-written digits 28 × 28 pixel gray-scale values 60,000 training and 10,000 test examples

Figure 7: Hand-written digit recognition learned by a convolutional neural network [5] Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

14 / 34

Application to computer vision problems: MNIST Training of: Deep belief network composed of RBMs (DBN) Stacked denoising autoencoder (SAE) 10 epochs for pre-training and fine-tuning Independent optimization of: Learning rate Momentum L2 regularization Output unit type Batch size Hidden layers Dropout Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

15 / 34

Application to computer vision problems: MNIST

Neural network DBN composed of RBMs Stacked denoising autoencoder Stacked autoencoder

Test error 0.0244 0.0194 0.0254

Table 1: Error rates for optimized DBN and SAE on MNIST, lowest error rate in bold

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

16 / 34

Application to computer vision problems: Kaggle data From a 2013 competition named ”Emotion and identity detection from face images” [2] 48 × 48 pixel gray-scale values Size is reduced to 24 × 24 = 576 pixels using a bilinear interpolation 4178 training and 1312 test examples Original training set is split up into 3300 training and 800 test examples

Figure 8: Sample data of the Kaggle competition [2] Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

17 / 34

Application to computer vision problems: Kaggle data Training of: Deep belief network composed of RBMs (DBN) Stacked denoising autoencoder (SAE) 10 epochs for pre-training and fine-tuning Independent optimization of: Learning rate Momentum L2 regularization Output unit type Batch size Hidden layers Dropout Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

18 / 34

Application to computer vision problems: Kaggle data

Neural network DBN composed of RBMs Stacked denoising autoencoder Stacked autoencoder

Test error 0.7225 0.5737 0.3975

Table 2: Error rates for optimized DBN and SAE on Kaggle data, lowest error rate in bold

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

19 / 34

Application to computer vision problems: Kaggle data

Neural network DBN composed of RBMs Stacked denoising autoencoder Stacked autoencoder

Test error 0.5675 0.3387 0.3025

Table 3: Error rates for optimized DBN and SAE on Kaggle data, lowest error rate in bold, for 100 epochs

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

20 / 34

Conclusions and prospects

Neural networks can learn complex non-linear hypotheses Training them comes with many difficulties Unsupervised pre-training using autoencoders or RBMs Followed by discriminative fine-tuning Promising methods, but no silver bullet Proposed investigations: better pre-processing, convolutional neural networks and use of GPUs

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

21 / 34

Application to computer vision problems: MNIST Parameter Learning rate

Default value 1.0

Momentum

0

L2 regularization

0

Output unit type Batch size Hidden layers

Sigmoid 100 [100, 100]

Dropout

0

Tested values 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0 0.01, 0.02, 0.05, 0.1, 0.15, 0.2, 0.25, 0.5 1e-7, 5e-7, 1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4 Sigmod, softmax 25, 50, 100, 150, 200, 400 [50], [100], [200], [400], [50, 50], [100, 100], [200, 200], [400, 400], [50, 50, 50], [100, 100, 100], [200, 200, 200] 0, 0.125, 0.25, 0.5

Table 4: Model selection values for MNIST Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

22 / 34

Application to computer vision problems: MNIST

Parameter Learning rate Momentum L2 regularization Output unit type Batch size Hidden layers Dropout

DBN 0.5 0.02 5e-5 softmax 50 [400, 400] 0

Test error 0.0323 0.0331 0.0298 0.0278 0.0314 0.0267 0.0335

SAE 0.75 0.5 5e-5 softmax 25 [400, 400] 0

Test error 0.0383 0.039 0.0345 0.0255 0.0347 0.017 0.039

Table 5: Model selection for DBN and SAE on MNIST, lowest error rates in bold

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

23 / 34

Application to computer vision problems: MNIST

Figure 9: Test error for different L2 regularization values for training of DBN Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

24 / 34

Application to computer vision problems: MNIST

Neural network DBN composed of RBMs Stacked denoising autoencoder Stacked autoencoder

Test error 0.0244 0.0194 0.0254

Table 6: Error rates for optimized DBN and SAE on MNIST, lowest error rate in bold

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

25 / 34

Application to computer vision problems: MNIST

Neural network DBN composed of RBMs Stacked denoising autoencoder Stacked autoencoder

Test error 0.0225 0.0189 0.0191

Table 7: Error rates for optimized DBN and SAE on MNIST, lowest error rate in bold, for 100 iterations

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

26 / 34

Application to computer vision problems: Kaggle data Parameter Learning rate

Default value 1.0

Momentum

0

L2 regularization

0

Output unit type Batch size Hidden layers

Sigmoid 100 [100, 100]

Dropout

0

Tested values 0.05, 0.1, 0.15, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5 0.01, 0.02, 0.05, 0.1, 0.15, 0.2, 0.25, 0.5 1e-7, 5e-7, 1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4 Sigmod, softmax 25, 50, 100, 150, 275 [50], [100], [200], [400], [50, 50], [100, 100], [200, 200], [400, 400], [50, 50, 50], [100, 100, 100], [200, 200, 200] 0, 0.125, 0.25, 0.5

Table 8: Model selection values for Kaggle data Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

27 / 34

Application to computer vision problems: Kaggle data

Parameter Learning rate Momentum L2 regularization Output unit type Batch size Hidden layers Dropout

DBN 0.25 0.01 5e-5 softmax 50 [50, 50] 0.125

Test error 0.5587 0.7225 0.7225 0.7225 0.6987 0.7225 0.7225

SAE 0.1 0.5 1e-4 softmax 50 [200] 0.5

Test error 0.5413 0.7225 0.7225 0.7225 0.5913 0.5850 0.7225

Table 9: Model selection for DBN and SAE on Kaggle data, lowest error rates in bold

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

28 / 34

Application to computer vision problems: Kaggle data

Figure 10: Test error for different learning rates values for training of DBN Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

29 / 34

Application to computer vision problems: Kaggle data

Neural network DBN composed of RBMs Stacked denoising autoencoder Stacked autoencoder

Test error 0.7225 0.5737 0.3975

Table 10: Error rates for optimized DBN and SAE on Kaggle data, lowest error rate in bold

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

30 / 34

Application to computer vision problems: Kaggle data

Figure 11: Test error for different factors of noise in SAE Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

31 / 34

Application to computer vision problems: Kaggle data

Neural network DBN composed of RBMs Stacked denoising autoencoder Stacked autoencoder

Test error 0.5675 0.3387 0.3025

Table 11: Error rates for optimized DBN and SAE on Kaggle data, lowest error rate in bold, for 100 epochs

Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

32 / 34

Application to computer vision problems: Kaggle data

Figure 12: Test error for different factors of noise in SAE, for 100 epochs Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

33 / 34

References Christopher M. Bishop: Pattern Recognition and Machine Learning. Springer. 2007. Kaggle: Emotion and identity detection from face images. http://inclass.kaggle.com/c/facial-keypoints-detector. Retrieved: April 15, 2015. Rasmus Berg Palm: DeepLearnToolbox. http://github.com/rasmusbergpalm/DeepLearnToolbox. Retrieved: April 22, 2015. The Analytics Store: Deep Learning. http://theanalyticsstore.com/deep-learning/. Retrieved: March 1, 2015. Yann LeCun et al.: LeNet-5, convolutional neural networks. http://yann.lecun.com/exdb/lenet/. Retrieved: April 22, 2015. Patrick Oliver GLAUNER

Training Methods for Deep Neural Networks

May 2015

34 / 34

Comparison of Training Methods for Deep Neural ... - Patrick GLAUNER

Attracted major IT companies including Google, Facebook, Microsoft and Baidu to make ..... Retrieved: April 22, 2015. The Analytics Store: Deep Learning.

2KB Sizes 3 Downloads 301 Views

Recommend Documents

Detecting Electricity Theft - Patrick GLAUNER
Imbalance of the data, meaning that there are significantly more regular customers than ... machine learning, deep learning, anomaly detection and big data.

Efficient Training Strategies for Deep Neural Network ... -
many examples using the back-propagation algorithm and data selection ? Is care- ful initialization important ? How to speed-up training ? Can we benefit from ...

Efficient Training Strategies for Deep Neural Network ... -
many examples using the back-propagation algorithm and data selection ? Is care- ful initialization important ? How to speed-up training ? Can we benefit from ...

Training Deep Neural Networks on Noisy Labels with Bootstrapping
Apr 15, 2015 - “Soft” bootstrapping uses predicted class probabilities q directly to ..... Analysis of semi-supervised learning with the yarowsky algorithm.

COMPARISON METHODS FOR PROJECTIONS AND.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. COMPARISON ...

A Comparison of Training Modules for Administrative ...
obtain address characteristics of residences (such as the type of housing unit). Finally, the dataset is then linked with administrative records using MAFID and PIK variables. 3.3 American Community Survey. We use the 2014 American Community Survey (

Learning Methods for Dynamic Neural Networks - IEICE
Email: [email protected], [email protected], [email protected]. Abstract In .... A good learning rule must rely on signals that are available ...

Intrinsic Methods for Comparison of Corpora - raslan 2013
Dec 6, 2013 - syntactic analysis. ... large and good-enough-quality textual content (e.g. news ... program code, SEO keywords, link spam, generated texts,. . . ).

a comparison of methods for determining the molecular ...
2009). The spatial and mass resolution of the cosmological simulation is matched exactly to the spatial and mass resolution of the simulation with the fixed ISM. Due to computational expense, the fully self- consistent cosmological simulation was not

A Comparison of Methods for Finding Steady- State ...
The existing commercial tools suggest a brute-force approach i.e. simulation ..... to sampled-data modeling for power electronic circuits , IEEE. Transactions on ...

Comparison of Stochastic Collocation Methods for ...
Oct 30, 2009 - ... of 58800 volumes with a double cell size in the direction tangential ...... Factorial sampling plans for preliminary computational experiments, ...

Comparison of Four Methods for Determining ...
limit the access of lysostaphin to the pentaglycine bridge in the ... third glycines to the pentaglycine cross bridge. .... Lysostaphin: enzymatic mode of action.

A Comparison of Clustering Methods for Writer Identification and ...
a likely list of candidates. This list is ... (ICDAR 2005), IEEE Computer Society, 2005, pp. 1275-1279 ... lected from 250 Dutch subjects, predominantly stu- dents ...

A Comparison of Methods for Finding Steady- State ...
a converter is described as a linear time-invariant system. (LTI). ... resonant converters the procedure to find the solution is not as easy as it may .... New set of Ai.

Comparison of Camera Motion Estimation Methods for ...
Items 1 - 8 - 2 Post-Doctoral Researcher, Construction Information Technology Group, Georgia Institute of. Technology ... field of civil engineering over the years.

A comparison of chemical pretreatment methods for ...
conversion of lignocellulosic biomass to ethanol is, how- ever, more challenging ..... treatments, subsequent calculations and analysis of data in this study were ...

A comparison of numerical methods for solving the ...
Jun 12, 2007 - solution. We may conclude that the FDS scheme is second-order accurate, but .... −a2 − cos bt + 2 arctan[γ(x, t)] + bsin bt + 2 arctan[γ(x, t)]. − ln.

A comparison of chemical pretreatment methods for ...
xylan reduction (95.23% for 2% acid, 90 min, 121 ╟C/15 psi) but the lowest cellulose to glucose ... research on the development of alternative energy sources.

ECOC-Based Training of Neural Networks for Face ...
algorithm, BP, to adjust weights of the network. Experimental results for face recognition problem on Yale database demonstrate the effectiveness of our method.

Deep Neural Networks for Object Detection - NIPS Proceedings
This method combines a set of discriminatively trained .... network to predict the object box mask and four additional networks to predict four ... In order to complete the detection process, we need to estimate a set of bounding ... training data.

Multiframe Deep Neural Networks for Acoustic ... - Research at Google
windows going up to 400 ms. Given this very long temporal context, it is tempting to wonder whether one can run neural networks at a lower frame rate than the ...