828L project: 1D plot of NN loss surface

Viewer
Transcript

828L project: 1D plot of NN loss surface

Zheng Xu Dept. of CS, Univ. of Maryland, College Park [email protected]

Motivation. We study the loss surface of neural network in this manuscript. Recent research suggests saddle points are ubiquitous and local optimals are reasonably good for neural networks [2, 1, 5]. However, the loss surface of neural network is still largely unknown. To get more institution of how the loss surface may look like, we exploit low dimensional plots to visualize several measurement, such as objective and norm of gradient. Approach. Specifically, we first train a neural network by momentum SGD method with decreasing learning rate to arrive at a resaonable solution (critical point) of the loss surface. Then find the direction from initialized parameter to the solution and perform 1D interpolation along this direction that will pass intitialization and solution. For each interpolated point, we caculate the objective and gradient with the entire training batch and plot the results. Note that our methods are very similar to the visualization method in [4, 6]. Implementation. In the following experiments, we use a rather “toy” problem, the sampled MNIST dataset with 8 × 8 images, 600 images for training, and 100 images for testing. Without any architecture tuning, the neural networks we used could generally achieve over 85% accuracy on this dataset. We use the default initialization in torch with a random seed 0 to run the SGD optimization, initial learning rate is 0.01, minibatch size is 16. We monitor the training process by calculating the objective and gradient for the entire training batch after each epoch, and stop the training if the residual (norm of gradient) becomes small or it has been running for 10 minutes. Discussion. Generally, we found the gradient is never small enough (less than 1e-5) to stop the optimization in our experiments, while the training seems to stop making progress after the first a few hundred epochs. A faster learning rate decay rate may help the solution to settle on a more “stable” critical point. Though it is said that SGD is unlikely to trap at saddle point [3], it is still worth to verify whether our solution is a true local minimum or not. Analyzing the Hessian matrix will show sufficient evidence, while it may require factorizing a large matrix. The last comment is to remind the readers that we have lost a lot of details when perform 1D plot, and we use 50 interpolation points between the initialization and the solution, which may only mimic the loss surface in the coarse grain.

1

Shallow MLP

We use multilayer perceptron (MLP) with 1 hidden layer of 50 hidden nodes in this experiment. ReLU is used as nonlinear activation, softmax is used for the output layer and negative log likelihood (NLL) criterion is used for the loss. Fig. 1 presents the plots of test accuracy, training loss, and training residual along the training path. Test accuracy suggests the solution is reasonably good. The training loss and residual quickly drop in the first a few hundred epochs, and then arrive at a plateau. Fig. 2 presents the plots of training loss, training residual, and the inner product of gradient and the direction from initialization to solution along the direction. We do see some other critical points along the direction, and there seems to be a plateau around the solution (Fig. 2 left top). The interpolation between initialization and solution seems to suggest the local loss surface is reasonably good, which may partially explain the success of optimizing neural networks (Fig. 2 bottom).

2.5

40 20

20

0 0

0 0

1

2

epoch

200

400

600

epoch

3

800

4

2

1.5 1

1.5 1 0.5 0 0

0.5

1000

0 0

5 4

x 10

train residual

60

2.5

train error

test accuracy

40

train error

80

60

0.8

2

100

1

2

200

epoch

400

600

800

epoch

3

1000

4

0.8

0.6 0.4 0.2

0.4 0.2 0 0

0 0

5 4

0.6

1

2

x 10

epoch

200

400

600

epoch

3

800

5 4

x 10

10 0 -60

-40

-20

0

20

distance from solution

40

60

2 1.5 1 0.5 0 -15

-10

-5

distance from solution

0

residual (norm of gradient)

2.5

10

5

0 -60

-40

-20

0

20

distance from solution

40

60

1 0.8 0.6 0.4 0.2 0 -15

-10

-5

distance from solution

0

inner prod of grad and direct

20

15

inner prod of grad and direct

objective/error

30

residual (norm of gradient)

training batch. (right) Residual (norm of gradient) for the entire training batch. The small images present the fine-grained curve from the first 1000 epochs. 40

0.2 0 -0.2 -0.4 -0.6 -0.8 -60

-40

-20

0

20

40

distance from solution

60

-0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -15

-10

-5

distance from solution

0

Figure 2: 1D plot of 1-layer MLP. (left) Objective (loss) along the direction from initialization to solution. (middle) Residual (norm of gradient) for the entire training batch. (right) Inner product of the direction and the gradient. The fine-grained plots showing the interpolation between initialization and solution are presented in the bottom.

2

Deep MLP

We make the neural network a bit deeper by adding a second hidden layer with 20 hidden nodes, ReLU is used for both hidden layers, and all the other settings are the same as Section 1. Fig. 3 presents the plots of test accuracy, training loss, and training residual along the training path. Fig. 4 presents the plots of training loss, training residual, and the inner product of gradient and the direction from initialization to solution along the direction. Though the loss objective of 2-layer perceptron looks sharper than 1-layer perceptron on a wide range (note that the objective in Fig. 3 left top is higher than in Fig. 1), the local cure is very similar to each other (see Fig. 3 and Fig. 1 bottom).

2

1000

4

Figure 1: Training of 1-layer MLP. (left) Test accuracy along the training path. (middle) Loss for the entire

objective/error

test accuracy

80

train residual

100

2.5

40

60 40 20

20

0 0

0 0

1

2

2

1.5

train error

train error

test accuracy

80

1

1.5 1 0.5

0.5 200

epoch

400

600

800

epoch

3

4

0.4

2.5

0 0

200

400

epoch

1000

0 0

5

1

2

4

x 10

epoch

600

3

800

4

1000

0.5 0.4

0.3

train residual

2

100

60

0.2

0.3 0.2 0.1

0.1

0 0

0 0

5

1

4

x 10

2

200

epoch

400

600

800

epoch

3

5 4

x 10

100 50 0 -60

-40

-20

0

20

40

distance from solution

60

2 1.5 1 0.5 0 -15

-10

-5

distance from solution

0

residual (norm of gradient)

2.5

60 40 20 0 -60

-40

-20

0

20

distance from solution

40

60

1 0.8 0.6 0.4 0.2 0 -15

-10

-5

distance from solution

0

inner prod of grad and direct

residual (norm of gradient)

objective/error

150

80

inner prod of grad and direct

training batch. (right) Residual (norm of gradient) for the entire training batch. The small images present the fine-grained curve from the first 1000 epochs. 200

0.2 0 -0.2 -0.4 -0.6 -0.8 -60

-40

-20

0

20

40

distance from solution

60

0 -0.2 -0.4 -0.6 -0.8 -15

-10

-5

distance from solution

0

Figure 4: 1D plot of 2-layer MLP. (left) Objective (loss) along the direction from initialization to solution. (middle) Residual (norm of gradient) for the entire training batch. (right) Inner product of the direction and the gradient. The fine-grained plots showing the interpolation between initialization and solution are presented in the bottom.

3

CNN

We test a simple convolutional neural network (CNN) with one hidden convolutional layer, 6 filters of 3 × 3. ReLU is used as nonlinear activation, softmax is used for the output layer and NLL criterion is used for the loss. Surprisingly, the plots for 1-layer MLP, 2-layer MLP and 1-layer CNN are rather similar (Fig. 1– Fig. 6), and the local plots (bottom figures) suggest the loss surface is friendly to optimization methods. The information of the real loss surface cannot be fully captured by the 1D plots. It is an open question to say whether the “good” local properties is general for neural networks, or around the “good” solutions, or even just along a specific direction.

References [1] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015. [2] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014. 3

1000

4

Figure 3: Training of 2-layer MLP. (left) Test accuracy along the training path. (middle) Loss for the entire

objective/error

test accuracy

80

0.5

train residual

100

2.5 2

40

60 40 20

20

0 0

0 0

1

200

400

600

800

epoch

1000

2

epoch

2.5 2

1.5

train error

test accuracy

60

train error

80

train residual

100

1

1.5 1 0.5

0.5

0 0

0 0

3 4

1

200

400

800

0.4

0.6 0.4 0.2

0.2 0 0

1000

2

epoch

x 10

600

epoch

0.8

0.6

0 0

3

1

4

x 10

200

epoch

400

600

800

epoch

2

3

100 50 0 -60

-40

-20

0

20

40

distance from solution

60

residual (norm of gradient)

2.5 2 1.5 1 0.5 0 -10

-8

-6

-4

-2

distance from solution

0

40 30 20 10 0 -60

-40

-20

0

20

distance from solution

40

60

inner prod of grad and direct

residual (norm of gradient)

objective/error

150

50

3 2.5 2 1.5 1 0.5 0 -10

inner prod of grad and direct

training batch. (right) Residual (norm of gradient) for the entire training batch. The small images present the fine-grained curve from the first 1000 epochs. 200

-8

-6

-4

-2

distance from solution

0

0.5

0

-0.5

-1 -60

-40

-20

0

20

40

distance from solution

60

0 -0.2 -0.4 -0.6 -0.8 -10

-8

-6

-4

-2

distance from solution

0

Figure 6: 1D plot of 1-layer CNN. (left) Objective (loss) along the direction from initialization to solution. (middle) Residual (norm of gradient) for the entire training batch. (right) Inner product of the direction and the gradient. The fine-grained plots showing the interpolation between initialization and solution are presented in the bottom. [3] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015. [4] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014. [5] K. Kawaguchi. Deep learning without poor local minima. arXiv preprint arXiv:1605.07110, 2016. [6] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.

4

1000

4

x 10

Figure 5: Training of 1-layer CNN. (left) Test accuracy along the training path. (middle) Loss for the entire

objective/error

test accuracy

80

0.8

train residual

100