Faster Convolutional Architecture Search for Semantic ...

Viewer
Transcript

Faster Convolutional Architecture Search for Semantic Segmentation Rupesh Durgesh12 , Gerald Krell2 , Andrii Iegorov1 , Peter Schuberth1 , Felix Friedmann1 1 Audi

Electronics Venture GmbH, Germany {rupesh.durgesh, felix.friedmann, peter.schuberth}@audi.de 2 Otto-von-Guericke-Universit¨at Magdeburg, Germany {gerald.krell}@ovgu.de

Abstract—Designing deep learning architectures is a complex task and requires expert knowledge. Convolutional Neural Networks involve different network topologies, layers, layer parameters. In order to automate the design process our approach is based on MetaQNN, the Q-learning agent design architectures by selecting CNN layers. We extend the approach for semantic segmentation task, where architecture involves encoder-decoder layers. To speed up the search process, we use a Hyperband-like technique. Our experiments are evaluated on CamVid urban street scene semantic segmentation dataset. The architectures designed by the Q-learning agent for semantic segmentation task are better than some commonly used hand-designed architectures with similar number of parameters.

a lot of GPU resources and is not feasible for all kinds of applications. [22] proposed convolutional neural fabrics, large number of architectures are ensembled. DeepArchitect [17] is a framework for architecture search with tree search methods. MetaQNN [3] approach is based on Q-learning [24]. II. M ETHODS

I. I NTRODUCTION In the recent years, deep neural networks have revolutionized the field of Artificial Intelligence. CNNs outperform classical computer vision algorithms in classification and segmentation tasks [13][10][2][6]. It is possible to design different CNNs for a given accuracy. For example, the SqueezeNet [11] architecture was able to achieve similar accuracy as AlexNet [13] with less number of parameters. CNNs involve different network topologies, with different layers and parameters. And it is not an easy task to understand how these layers and layer parameters influence results. Hence, automated designing of neural networks is desired to explore effectively. This paper aims to define a CNN architecture search using Q-learning [24] for encoder-decoder architectures. Our experiments are evaluated for semantic segmentation, where the goal is to classify each pixel in the given image as labels. For training, instead of early-stopping our approach uses a Hyperband-like technique to select the number of epochs for generated architectures. Our experiments run fast and results are competitive to hand designed architectures with similar number of parameters. A. Related Work Design space exploration of neural networks is one of the important topics in the field of deep learning. Earlier methods include cascaded-correlation learning [8], where hidden units are added to the network when the accuracy doesn’t improve. Recent approaches include predicting CNN layer parameters by using recurrent network as a controller [25]. Although this approach is promising with vast design space, but, requires

Fig. 1. Overview: The controller Q-learning agent select layers using greedy strategy. Different network topologies are added to selected layers. The designed architecture is trained for certain epoch, then Q-values are updated using validation accuracy as a reward.

The concept is as shown in the Fig. 1. Our approach is based on MetaQNN [3]. The contoller Q-learning [24] agent designs architectures by consecutively selecting available CNN layers and explores the design space using the -greedy strategy [16]. Designed architectures are trained over certain number of epochs. The best validation accuracy is used as a reward function. Q-values are updated using Bellman’s equation [4] for selecting layers as actions. Qt+1 (si , u) = (1 − α)Qt (si , u) + α[rt + γ 0max Qt (sj , u0 )] u ∈u(sj )

(1) Where rt is the reward function, α is a learning rate and γ is the discount factor to balance out the immediate and future rewards. If γ is zero, Q-value relies on the immediate reward. The -greedy strategy decides the way of selecting a layer as an action. It can be either exploration or exploitation stage. During exploration stage, the agent selects a random action and in exploitation, an action with maximum Q-value is selected. Usually, the rate of exploration will be high at the beginning of the experiment. Hence the agent can explore different actions for a given state.

A. Design Space

C. Speeding up the learning process

Similar to the MetaQNN approach, the layer selection is modelled as a Markov Decision process. The state and action space is as shown in the Fig. 2. We extend the design space for semantic segmentation task which involves encoder-decoder layers. The encoder part will have stack of convolutional layers with downsampling layers. Similarly, for decoder, upsampling layers are used instead of downsampling. An alternative convolutional and up-sampling layers are assigned in the decoder design space. The termination state is not used. Hence, all the designed architectures will have same depth.

Instead of using early stopping, we use a Hyperbandlike [14] technique to speed up the learning process. N designed architectures are trained for T epochs. Then, the N/3 architectures with highest validation accuracy are trained for 3T epochs. By repeating the process, in the end, the best N/S architectures are trained for ST epochs. By this way, the agent explores more and it also finds fast converging architectures. The number of epochs is as shown in the Table II. III. E XPERIMENTS A. Semantic Segmentation

Fig. 2. Design space for encoder-decoder architectures. The parameters are in the format (Featuremaps, kernel size width, kernel size height). C is the number of classes. Number of featuremaps for decoder layers is equal to number of classes for a given task. For the decoder layer selection, a convolutional layer is introduced between layers.

The available layers and parameters are as shown in the Table I. We have not used higher feature maps for convolutional layers, to limit the design space and number of parameters. For decoder design space, the number of convolutional layer feature maps is equal to the number of classes of the respective task. B. Network Topologies

The goal of the semantic segmentation is pixelwise classification of the objects present in a given input image. The task is to achieve pixel-wise classification for a given image using CNN. It is common to design segmentation architectures by modifying best classification architectures trained over ImageNet dataset [21]. Usually these architectures have high number of parameters and it is not feasible to train these from scratch without pre-trained weights [15][6]. However, there are also some architectures exclusively designed for this task [2][20][18]. B. Datasets Our experiments are evaluated on the internal dataset provided by Audi and public dataset CamVid [5] for benchmarking purposes. For an architecture search stage, we took a subset of the dataset with 1900 training and 200 validation samples from the Audi dataset. Then the best final architectures are trained over full dataset and results are evaluated on internal test dataset. Cambridge-driving Labeled Video Database (CamVid) is an automotive road scene dataset. For training, resolution of 480x360 RGB pixels is used and the dataset consists of 367 training, 101 validation and 233 test images. The dataset has 11 classes, unlabeled or isolated pixels are mapped to class 12, which is ignored in training. C. Training TABLE I D ESIGN SPACE LAYERS AND PARAMETERS

Fig. 3. Different network topologies: For the selected layers, different network topologies such as split, residual and SharpMask-like connections are added automatically.

For the generated layers with similar shapes, the agent randomly adds different network topologies and forms a final architecture. The different topologies as shown in the Fig. 3 such as split, residual [9] and SharpMask-like [19] connections are introduced.

Layers

Available Parameters

Convolution

Kernel Size: {1, 3, 5} Feature Maps: {16, 32, 64, 128, 144} Stride: {1}

Pooling

Max Pooling Pool Size: {(2, 2), (3, 3)}

Dropout

Rate: {0.25, 0.5}

Up-Sampling

Kernel Size: {4, 4} Stride: {2}

In our experiments the maximum encoder layer depth is set to 20. For decoder design space, deconvolutional layers [15] are used. We have introduced some design constraints like, after every 4 layers the agent must select a pooling layer. According to the number of pooling layers the agent has to select upsampling layers. For the generated sequential layers, the agent randomly adds different network topologies and forms a final architecture. All the designed architectures use Exponential Linear Unit activation function to avoid bias shift during training [23]. For the Hyperband-like process, the number of architectures N is 27 and sample rate S is 3. For one cycle, the agent designs 27 architectures, each architecture will be trained for 1 epoch, Q values are updated using validation accuracy as a reward after each training. Then, the best 9 architectures are trained over 3 epochs again Q-values are updated. At the end of the cycle, the best architecture will be trained over 27 epochs. The number of architectures and epochs for each step is as shown in the figure. The cycle is repeated for a certain number of episodes. Finally the best architectures are trained over complete dataset for longer epochs. Results are evaluated using mean intersection over union and class average accuracy evaluation metrics. Our learning framework is developed using the Keras framework [7]. For all the designed architectures, cross-entropy measures are optimized with Adam optimizer [12]. Fixed learning rate of 1e-4 and L2 regularization with weight decay of 2e-4 is used.

TABLE III R ESULTS COMPARISON WITH S EG N ET AND EN ET ON THE C AM V ID DATASET

Architecture

No. Of Parameters

Class Avg.

Mean IoU

SegNet-Basic SegNet ENet Agent designed-1 Agent designed-2 Agent designed-3

1.4M 29.4M 0.37M 2.9M 1.03M 1.02M

62.3 65.39 68.3 67.94 65.91 64.73

46.3 50.2 51.3 55.9 53.73 52.3

Input

Ground Truth

Result

Fig. 4. Visual results of agent designed architectures on Audi dataset test images

TABLE II H YPERBAND -L IKE PROCESS Architectures

Epochs

27 9 3 1

1 3 9 27

Input

Ground Truth

Result

IV. R ESULTS Initially, we started experiments with small resolution of 336x144 pixels on the Audi dataset. The agent has designed and trained over 1650 architectures in 3 weeks on a single NVIDIA Tesla P40 GPU. For benchmarking purpose, with our internal hand-designed architecture, we started experiments for higher resolutions. Few architectures designed by the agent outperforms our hand-designed architecture by 3 to 4% with similar number of parameters. Visual result on Audi test images as shown in the Fig. 4. We benchmark our results against handdesigned architectures Segnet and ENet [18]. SegNet results are taken from [1]. Resutls are as shown in the Table III. Visual results are as shown in the Fig. 5. Data augmentations are not used. The architectures designed by Q-learning agent outperform handdesigned architectures under mean IoU.

Fig. 5. Visual results of agent designed architectures on CamVid test images

V. C ONCLUSION It is possible to come up with different architectures for a given accuracy. Hence, automating the design process is necessary. The above approach can be applied to other deep learning tasks and it can be improved further by increasing the design space by introducing convolution with strides and special layers like dilation. In the end we can conclude, that architecture learning is the new feature learning. R EFERENCES [1] Getting started with segnet. http://mi.eng.cam.ac.uk/projects/segnet/ tutorial.html. Accessed: 2017-08-06. [2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. CoRR, abs/1511.00561, 2015. [3] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. CoRR, abs/1611.02167, 2016. [4] Dimitri P Bertsekas. Convex optimization algorithms. Athena Scientific Belmont,, 2015. [5] Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Segmentation and recognition using structure from motion point clouds. In Proceedings of the 10th European Conference on Computer Vision: Part I, ECCV ’08, pages 44–57, Berlin, Heidelberg, 2008. SpringerVerlag. [6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016. [7] Franc¸ois Chollet et al. Keras. https://github.com/fchollet/keras, 2015. [8] Scott E. Fahlman and Christian Lebiere. The cascade-correlation learning architecture. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 524–532. Morgan-Kaufmann, 1990. [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. [10] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016. [11] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016. [12] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. [13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. [14] Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Efficient hyperparameter optimization and infinitely many armed bandits. CoRR, abs/1603.06560, 2016. [15] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. CoRR, abs/1411.4038, 2014. [16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529533, 2015. [17] Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deep architectures. arXiv preprint arXiv:1704.08792, 2017. [18] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation. CoRR, abs/1606.02147, 2016. [19] Pedro H. O. Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Doll´ar. Learning to refine object segments. CoRR, abs/1603.08695, 2016. [20] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597, 2015.

[21] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. [22] Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. CoRR, abs/1606.02492, 2016. [23] Michael Treml, Jos´e Arjona-Medina, Thomas Unterthiner, Rupesh Durgesh, Felix Friedmann, Peter Schuberth, Andreas Mayr, Martin Heusel, Markus Hofmarcher, Michael Widrich, et al. Speeding up semantic segmentation for autonomous driving. NIPSW, 1(7):8, 2016. [24] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge, England,, 1989. [25] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. 2017.