Functional manifold projections in Deep-LEARCH

Viewer
Transcript

Functional manifold projections in Deep-LEARCH

Jim Mainprice 1

Arunkumar Byravan2

Daniel Kappler1,3

Stefan Schaal1,4

Dieter Fox2

Nathan Ratliff3 ∗

1

Introduction

Inverse Reinforcement Learning (IRL) has been studied for more that 15 years and is of fundamental importance in robotics. It allows learning a utility function “explaining” the behavior of an agent, and can thus be used for imitation or prediction of a given behavior by having solely access to demonstrated optimal or near optimal solutions. When the reward function is assumed to be a linear combination of features, IRL has strong convergence properties [AN04, KPRS13, BMZ+ 15, MHB15, MHB16]. However, in the recent years, a lot of interest has been focused on using Deep Convolutional Neural Networks (CNNs) to encode the reward function [FLA16, WZWP16]. Using such powerful nonlinear function approximators allows to learn from low level features directly, thus not requiring domain knowledge, which can potentially lead to learn higher fidelity behaviors. The LEARning to SearCH framework [RSB09] has introduced functional gradients as a powerful technique underlying the IRL loss optimization. This approach has shown to outperform subgradient methods when optimizing under a linear reward assumption, and was shown to efficiently optimize non-linear cost functions. In this paper, we extend LEARCH to train CNNs using functional manifold projections, which we denote Deep-LEARCH. Earlier work on functional gradient approaches [RSB09] built large but flat additive models that continually grow in size. Our technique maintains the convergence advantage of functional gradient techniques (observed in linear spaces [MBVH09]) while generalizing to fixed sized deep parametric models (CNNs) by formally representing the function approximator as a nonlinear sub-manifold of the space of all functions. We derive a simple step-project functional gradient descent method to walk across the manifold that is substantially more data efficient than traditional gradient steps consisting of a single back-propagation commonly used in Deep-IRL. We present preliminary experimental results showing higher-training rates on low-dimensional 2D synthetic data. We believe these ideas have broad implications for structured training beyond IRL as well as deep learning training in general.

2

Functional manifold projection

Defining ξi as the ith example trajectory and denoting a state along the trajectory by xt ∈ ξ, the LEARCH [RSB09] loss functional is L[c] =

N h X X i=1

xit ∈ξi

c(xit ) − min ξ∈Ξ

X

i c(xit ) − li (xit ) ,

(1)

xit ∈ξi

where Ξ is the set of all trajectories and li some loss defining the margin. For regularization, we often restrict the class of functions c to lie on the manifold of function approximators H, such as neural nets with a particular parameterization, and we may even restrict that class further by putting bounds on the size of the neural network weight vectors. In Deep-LEARCH both of these are handled by the ∗1

Autonomous Motion Department, Max Planck Institute for Intelligent Systems ; AMD ; IS-MPI ; Tübingen, Germany, 2 University of Washington, Seattle, WA, USA, 3 Lula Robotics Inc., Seattle, WA, USA, 4 University of Southern California ; USC; Los Angeles, USA 29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

pan cy M

ap

0.12

0.08 0.06

ost

Sy

nth etic

Co st

Validation Loss

Oc cu

0.10

CNN training steps : 100 CNN training steps : 500 CNN training steps : 1,000 CNN training steps : 2,000 CNN training steps : 4,000

Le arn

ed C

0.04 0.02 0

5

10

15 20 LEARCH iterations

25

30

Figure 1: The types of environments, from left to right: fully observable, lidar and object in motion (left). Validation loss on the “fully observable" data set with different inner loop training iteration steps of the CNN (right). projection step in the inner loop where we train the neural network based on the data we’ve collected. The outline of the LEARCH algorithm that optimizes this loss functional based on its functional gradient is outlined in Section 5. Intuitively, the negative gradient portion of the loss L[c] dealing only with the outputs of the cost ∂ L(y) defines the quickest way to decrease the loss function. Thus the function y = ct (x), i.e. - ∂y (negative) functional gradient data set is just the (negative) loss partials of the loss function ∆ci =

∂ L(y) c (x ) . t i ∂y

The functional gradient ∇f L[c] defines a step off of the function manifold spanned by our hypothesis class H = {c = c(·; w)|w ∈ W} [RSB09]. At each Deep-LEARCH iteration, the functional gradient is projected back onto the manifold by computing the direction h∗ ∈ H that best correlates with the functional gradient. This results in training the CNN, solving a problem of the form h∗ = argmax < h, ∇f L[c] > . h∈H

The space of all squared integrable L2 is big. Even if H is the space of all deep neural networks with 200 layers and 20 million weights (just a made up example), that space of function approximators can span a submanifold of at most 20 million dimensions, one for each weight. How do we know that? The set of all function approximators in that class is parameterized by its weight vector w. Changing the weight vector moves from point to point in the class of functions. The function approximators are differentiable w.r.t. w, so this movement between functions is smooth. Therefore, it creates a smooth submanifold in the space of functions. The Jacobian of the function approximator w.r.t. w tells us, to the first order, how the function changes when we change the weights.

3

Results

In order to validate our approach we have implemented Deep-LEARCH with three types of environments. We then study the evolution of the validation loss in [RSB09] on a holdout set with increased numbers of CNN training steps (i.e., stochastic gradient descent steps) that intuitively relate to how precise the manifold projections are. The environments are presented in Figure 1. The occupancy maps (first row in the figure) are known and represent the agent sensory input. The three types of data set, from the left to right columns in the figure, are constructed to represent 1) a fully observable occupancy map, 2) lidar occupancy data, and 3) an object in motion, which direction of motion is represented by a few pixels. For each data set, 2

we randomly generated 800 environments composed of occupancy maps and synthetic costmaps on which we planned 20 example trajectories using the Field D* [FS06] algorithm, resulting in 16,000 demonstrations. We then attempt to learn a CNN parameterization that can reproduce the behaviors using Deep-LEARCH. The CNN maps occupancy to cost (as shown in the third row of the figure), producing cost for each pixel in the image. The predicted behavior is obtained by path planning using the same Field D* algorithm. We show on the first data set that training the network with a larger number of stochastic gradient steps, in black in the figure, allows higher generalization performance, which makes our functional manifold projection view of Deep-IRL promising.

4

CNN models

We use a Convolutional Neural Network (CNN) as the function approximator to generate the cost function for a given scene. We have tested three specific architectures, all of which take an occupancy grid representation Occ(X ) of a scene X as the input and generate the corresponding cost function c: • Conv-Only: This network applies a sequence of six convolutions, followed by Batch Normalization (BN) [IS15] and PReLU [HZRS15] non-linearities. We keep the resolution constant throughout the network. • Conv-Deconv: This network first applies a sequence of four convolutions to the input. Each convolution layer is followed by a max-pooling operation that reduces resolution by a factor of 2. At the bottleneck, we apply a 1x1 convolution followed by four deconvolution layers which interpolate the output back to the full resolution. Each layer except the final layer is followed by a Batch Normalization and PReLU non-linearity. • Conv-Deconv-Linear: This network architecture is similar to the Conv-Deconv network, but we replace the Conv + max-pooling operations by strided convolutions. We only have 3 strides convolution and deconvolutional layers. Also, we replace the 1x1 convolution at the bottleneck with two fully connected layers that allow us to propagate information throughout the entire image. As before, all layers except the final layer are followed by a Batch Normalization and PReLU non-linearities. We made use of this model in the results presented in Section 3. We use the neural network package Torch [tor] to implement our models. We use the ADAM optimization algorithm [KB14] with default parameters, a step size of 1e-3 and a batch size of 32 for training all our networks.

5

Appendix: LEARCH algorithm 1. Initialize the data set to empty D = ∅, and iterate the following across all examples i: (a) Solve the loss-augmented problem ξi∗ = argmin ξ∈Ξi

X

c(xit ) − li (xit ) .

(2)

xit ∈ξi

(b) Add the functional gradient data from the loss-augmented problem D = D ∪ {(x∗t , c(x∗t ) + ηt ) | x∗t ∈ ξi∗ }.

(3)

These points suggest where to increase the cost function by ηt . (c) Add the functional gradient data from the example D = D ∪ {(x∗t , c(x∗t ) − ηt ) | xt ∈ ξi }.

(4)

These points suggest where to decrease the cost function by ηt . 2. Solve the regression problem to improve the hypothesis: 2 λ1 λ2 1 X y − c(x; w) + kw − wt k2 + w. ct+1 = argmin 2 2 2 w (x,y)∈D

3

(5)

6

Acknowledgment

This research was supported in part by National Science Foundation grants IIS-1205249, IIS-1017134, EECS-0926052, the Office of Naval Research, the Okawa Foundation, and the Max-Planck-Society. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding organizations.

References [AN04] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004. [BMZ+ 15] Arunkumar Byravan, Mathew Monfort, Brian Ziebart, Byron Boots, and Dieter Fox. Graph-based inverse optimal control for robot manipulation. In Proceedings of the International Joint Conference on Artificial Intelligence, 2015. [FLA16] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. arXiv.org, March 2016. [FS06] Dave Ferguson and Anthony Stentz. Using interpolation to improve path planning: The field d* algorithm. Journal of Field Robotics, 23(2):79–101, 2006. [HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, pages 1026– 1034, 2015. [IS15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [KB14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [KPRS13] Mrinal Kalakrishnan, Peter Pastor, Ludovic Righetti, and Stefan Schaal. Learning objective functions for manipulation. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 1331–1336, 2013. [MBVH09] Daniel Munoz, J Andrew Bagnell, Nicolas Vandapel, and Martial Hebert. Contextual classification with functional max-margin markov networks. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 975–982. IEEE, 2009. [MHB15] Jim Mainprice, Rafi Hayne, and Dmitry Berenson. Predicting human reaching motion in collaborative tasks using inverse optimal control and iterative re-planning. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 885–892, 2015. [MHB16] Jim Mainprice, Rafi Hayne, and Dmitry Berenson. Goal set inverse optimal control and iterative re-planning for predicting human reaching motions in shared workspaces. IEEE Transaction on Robotics, 2016. [RSB09] Nathan D Ratliff, David Silver, and J Andrew Bagnell. Learning to search: Functional gradient techniques for imitation learning. Autonomous Robots, 27(1):25–53, June 2009. [tor] Torch : http://torch.ch/. [WZWP16] Markus Wulfmeier, Dominic Zeng Wang, and Ingmar Posner. Watch this: Scalable cost-function learning for path planning in urban environments. In IROS, page 1. IEE, 2016.

4