828L project: 1D plot of NN loss surface
Zheng Xu Dept. of CS, Univ. of Maryland, College Park
[email protected]
Motivation. We study the loss surface of neural network in this manuscript. Recent research suggests saddle points are ubiquitous and local optimals are reasonably good for neural networks [2, 1, 5]. However, the loss surface of neural network is still largely unknown. To get more institution of how the loss surface may look like, we exploit low dimensional plots to visualize several measurement, such as objective and norm of gradient. Approach. Specifically, we first train a neural network by momentum SGD method with decreasing learning rate to arrive at a resaonable solution (critical point) of the loss surface. Then find the direction from initialized parameter to the solution and perform 1D interpolation along this direction that will pass intitialization and solution. For each interpolated point, we caculate the objective and gradient with the entire training batch and plot the results. Note that our methods are very similar to the visualization method in [4, 6]. Implementation. In the following experiments, we use a rather “toy” problem, the sampled MNIST dataset with 8 × 8 images, 600 images for training, and 100 images for testing. Without any architecture tuning, the neural networks we used could generally achieve over 85% accuracy on this dataset. We use the default initialization in torch with a random seed 0 to run the SGD optimization, initial learning rate is 0.01, minibatch size is 16. We monitor the training process by calculating the objective and gradient for the entire training batch after each epoch, and stop the training if the residual (norm of gradient) becomes small or it has been running for 10 minutes. Discussion. Generally, we found the gradient is never small enough (less than 1e-5) to stop the optimization in our experiments, while the training seems to stop making progress after the first a few hundred epochs. A faster learning rate decay rate may help the solution to settle on a more “stable” critical point. Though it is said that SGD is unlikely to trap at saddle point [3], it is still worth to verify whether our solution is a true local minimum or not. Analyzing the Hessian matrix will show sufficient evidence, while it may require factorizing a large matrix. The last comment is to remind the readers that we have lost a lot of details when perform 1D plot, and we use 50 interpolation points between the initialization and the solution, which may only mimic the loss surface in the coarse grain.
1
Shallow MLP
We use multilayer perceptron (MLP) with 1 hidden layer of 50 hidden nodes in this experiment. ReLU is used as nonlinear activation, softmax is used for the output layer and negative log likelihood (NLL) criterion is used for the loss. Fig. 1 presents the plots of test accuracy, training loss, and training residual along the training path. Test accuracy suggests the solution is reasonably good. The training loss and residual quickly drop in the first a few hundred epochs, and then arrive at a plateau. Fig. 2 presents the plots of training loss, training residual, and the inner product of gradient and the direction from initialization to solution along the direction. We do see some other critical points along the direction, and there seems to be a plateau around the solution (Fig. 2 left top). The interpolation between initialization and solution seems to suggest the local loss surface is reasonably good, which may partially explain the success of optimizing neural networks (Fig. 2 bottom).
2.5
40 20
20
0 0
0 0
1
2
epoch
200
400
600
epoch
3
800
4
2
1.5 1
1.5 1 0.5 0 0
0.5
1000
0 0
5 4
x 10
train residual
60
2.5
train error
test accuracy
40
train error
80
60
0.8
2
100
1
2
200
epoch
400
600
800
epoch
3
1000
4
0.8
0.6 0.4 0.2
0.4 0.2 0 0
0 0
5 4
0.6
1
2
x 10
epoch
200
400
600
epoch
3
800
5 4
x 10
10 0 -60
-40
-20
0
20
distance from solution
40
60
2 1.5 1 0.5 0 -15
-10
-5
distance from solution
0
residual (norm of gradient)
2.5
10
5
0 -60
-40
-20
0
20
distance from solution
40
60
1 0.8 0.6 0.4 0.2 0 -15
-10
-5
distance from solution
0
inner prod of grad and direct
20
15
inner prod of grad and direct
objective/error
30
residual (norm of gradient)
training batch. (right) Residual (norm of gradient) for the entire training batch. The small images present the fine-grained curve from the first 1000 epochs. 40
0.2 0 -0.2 -0.4 -0.6 -0.8 -60
-40
-20
0
20
40
distance from solution
60
-0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -15
-10
-5
distance from solution
0
Figure 2: 1D plot of 1-layer MLP. (left) Objective (loss) along the direction from initialization to solution. (middle) Residual (norm of gradient) for the entire training batch. (right) Inner product of the direction and the gradient. The fine-grained plots showing the interpolation between initialization and solution are presented in the bottom.
2
Deep MLP
We make the neural network a bit deeper by adding a second hidden layer with 20 hidden nodes, ReLU is used for both hidden layers, and all the other settings are the same as Section 1. Fig. 3 presents the plots of test accuracy, training loss, and training residual along the training path. Fig. 4 presents the plots of training loss, training residual, and the inner product of gradient and the direction from initialization to solution along the direction. Though the loss objective of 2-layer perceptron looks sharper than 1-layer perceptron on a wide range (note that the objective in Fig. 3 left top is higher than in Fig. 1), the local cure is very similar to each other (see Fig. 3 and Fig. 1 bottom).
2
1000
4
Figure 1: Training of 1-layer MLP. (left) Test accuracy along the training path. (middle) Loss for the entire
objective/error
test accuracy
80
train residual
100
2.5
40
60 40 20
20
0 0
0 0
1
2
2
1.5
train error
train error
test accuracy
80
1
1.5 1 0.5
0.5 200
epoch
400
600
800
epoch
3
4
0.4
2.5
0 0
200
400
epoch
1000
0 0
5
1
2
4
x 10
epoch
600
3
800
4
1000
0.5 0.4
0.3
train residual
2
100
60
0.2
0.3 0.2 0.1
0.1
0 0
0 0
5
1
4
x 10
2
200
epoch
400
600
800
epoch
3
5 4
x 10
100 50 0 -60
-40
-20
0
20
40
distance from solution
60
2 1.5 1 0.5 0 -15
-10
-5
distance from solution
0
residual (norm of gradient)
2.5
60 40 20 0 -60
-40
-20
0
20
distance from solution
40
60
1 0.8 0.6 0.4 0.2 0 -15
-10
-5
distance from solution
0
inner prod of grad and direct
residual (norm of gradient)
objective/error
150
80
inner prod of grad and direct
training batch. (right) Residual (norm of gradient) for the entire training batch. The small images present the fine-grained curve from the first 1000 epochs. 200
0.2 0 -0.2 -0.4 -0.6 -0.8 -60
-40
-20
0
20
40
distance from solution
60
0 -0.2 -0.4 -0.6 -0.8 -15
-10
-5
distance from solution
0
Figure 4: 1D plot of 2-layer MLP. (left) Objective (loss) along the direction from initialization to solution. (middle) Residual (norm of gradient) for the entire training batch. (right) Inner product of the direction and the gradient. The fine-grained plots showing the interpolation between initialization and solution are presented in the bottom.
3
CNN
We test a simple convolutional neural network (CNN) with one hidden convolutional layer, 6 filters of 3 × 3. ReLU is used as nonlinear activation, softmax is used for the output layer and NLL criterion is used for the loss. Surprisingly, the plots for 1-layer MLP, 2-layer MLP and 1-layer CNN are rather similar (Fig. 1– Fig. 6), and the local plots (bottom figures) suggest the loss surface is friendly to optimization methods. The information of the real loss surface cannot be fully captured by the 1D plots. It is an open question to say whether the “good” local properties is general for neural networks, or around the “good” solutions, or even just along a specific direction.
References [1] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015. [2] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014. 3
1000
4
Figure 3: Training of 2-layer MLP. (left) Test accuracy along the training path. (middle) Loss for the entire
objective/error
test accuracy
80
0.5
train residual
100
2.5 2
40
60 40 20
20
0 0
0 0
1
200
400
600
800
epoch
1000
2
epoch
2.5 2
1.5
train error
test accuracy
60
train error
80
train residual
100
1
1.5 1 0.5
0.5
0 0
0 0
3 4
1
200
400
800
0.4
0.6 0.4 0.2
0.2 0 0
1000
2
epoch
x 10
600
epoch
0.8
0.6
0 0
3
1
4
x 10
200
epoch
400
600
800
epoch
2
3
100 50 0 -60
-40
-20
0
20
40
distance from solution
60
residual (norm of gradient)
2.5 2 1.5 1 0.5 0 -10
-8
-6
-4
-2
distance from solution
0
40 30 20 10 0 -60
-40
-20
0
20
distance from solution
40
60
inner prod of grad and direct
residual (norm of gradient)
objective/error
150
50
3 2.5 2 1.5 1 0.5 0 -10
inner prod of grad and direct
training batch. (right) Residual (norm of gradient) for the entire training batch. The small images present the fine-grained curve from the first 1000 epochs. 200
-8
-6
-4
-2
distance from solution
0
0.5
0
-0.5
-1 -60
-40
-20
0
20
40
distance from solution
60
0 -0.2 -0.4 -0.6 -0.8 -10
-8
-6
-4
-2
distance from solution
0
Figure 6: 1D plot of 1-layer CNN. (left) Objective (loss) along the direction from initialization to solution. (middle) Residual (norm of gradient) for the entire training batch. (right) Inner product of the direction and the gradient. The fine-grained plots showing the interpolation between initialization and solution are presented in the bottom. [3] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015. [4] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014. [5] K. Kawaguchi. Deep learning without poor local minima. arXiv preprint arXiv:1605.07110, 2016. [6] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
4
1000
4
x 10
Figure 5: Training of 1-layer CNN. (left) Test accuracy along the training path. (middle) Loss for the entire
objective/error
test accuracy
80
0.8
train residual
100