Situation Awareness for UAVs using Deep Learning techniques∗ Miquel Martí1,2

Amin Barekatain3 Hsueh-Fu Shih4 Samuel Murray1 Yutaka Matsuo5 Helmut Prendinger6

1

2

KTH Royal Institute of Technology, Stockholm, Sweden UPC Universitat Politècnica de Catalunya, Barcelona, Catalonia/Spain 3 TU Munchen, Munich, Germany 4 NTU National Taiwan University, Taipei, Taiwan 5 The University of Tokyo, Tokyo, Japan 6 NII National Institute of Informatics, Tokyo, Japan

Abstract: Unmaned Aerial Vehicle (UAV) missions or applications, both manually controlled or autonomous, require the operator to be aware of what is happening around the UAV. In this paper we introduce an approach dividing this visual situation awareness problem into different levels, which we first try to solve separately using different deep learning techniques and later integrate into a single model using Multi-Task Learning techniques, with the aim of deploying it on an embedded system mounted on the UAV itself for running in real-time. 1. Introduction Unmaned Aerial Vehicle (UAV) missions require a good understanding of the situation in the surroundings of the UAV in order to fully utilize their capability of quick deployment and almost unlimited access that enables new points of view in many scenarios. In order to provide UAVs with situation awareness from a visual perspective we divide the problem into a number of subproblems that can be solved independently. Each subproblem constitutes a level of awareness, from low to high level understanding according to its complexity, that we solve with different Deep Learning techniques and later try to integrate into a single model using Multi-Task Learning in order to make possible its deployment in an embedded systems mounted on the UAV and give real-time performance. First, we define 5 levels of situation awareness that are to be finally integrated into a Dynamic Map, a foundational technology for applications and services including delivery, traffic/crowd management, surveillance, disaster response, and so on. See Figure 1. In this paper we present work towards solving the first three and towards integrating the first two under a single multi-task model. 1. Semantic Segmentation: In order to determine the type of terrains over which the UAV is navigating and their extension. This can be useful for example for finding a safe area for landing or for determining whether a region is flooded when comparing to its historical records or maps. 2. Object Detection and Tracking: In order to determine the position over time of multiple objects ∗ The work was conducted while the first 4 authors were participating in the NII International Internship Program

of interest pertaining to different categories. An application can be to track the movement of vehicles, pedestrians or animals, to maybe follow them or report unusual behaviors if they access restricted areas. 3. Action detection: For understanding and localizing human actions in the area such as people running, walking or other behaviors that might be of interest or unexpected. 4. Event recognition: For recognizing social events involving multiple humans. 5. Anomaly detection: For understanding anomalous patterns in the previous levels of awareness and trigger alerts or actuate, e.g. determining if a pedestrian is in need of help by seeing its position and actions over a period of time within the context of the scene as could be detecting an accident involving a pedestrian and a car in a road.

Figure 1 The Dynamic Map presents all static and time-varying (dynamic) objects that are captured by the UAVs. From [1].

Our approach to solving the three tasks corresponding to the first three levels of awareness are explained in the next section. Second, we aim at integrating the first two models into a single one for meeting the requirements during its deployment. Typically, Deep Learning models (and Machine Learning models in general) focus on solving one task at a time, as we do first in this paper. However, applications often require more than one task to be performed simultaneously and if that is the case the naive solution is to deploy different models running in parallel, one for each task, as is the case for our situation awareness problem. In fact, [2] claims that most real-world problems can be seen as multi-task problems and that, although few of the test problems in machine learning repositories are seen as multi-task problems, this is because most have often been broken to fit into smaller subproblems beforehand, as we have done in a first stage. When it comes to deploying a system involving different models and specially taking into account that it has to run on an embedded system and with realtime performance so that it is useful in the real world, many challenges arise: the resources available in the embedded system are typically limited and real-time performance means small inference times, which cannot be achieved by deploying more powerful hardware as this is already a constraint. Embedded systems can be found on autonomous systems or mobile devices such as UAVs, robots, smartphones, wearables, Virtual / Augmented Reality headsets or portable game consoles and typically run on low power - which in turn limits the computing power they can provide - and are limited on terms of available memory. These can be based on a Central Processing Unit (CPU) and include a Graphics Processing Unit (GPU) or can be based on a Field-programmable Gate Array (FPGA) or Application-specific Integrated Circuit (ASIC). Each of this approaches serves different applications and poses different challenges, e.g. a typical FPGA has very small on-chip memory1 but is much more efficient than a CPU. Here the focus is on the more flexible but less efficient CPU or GPU-based embedded systems. In those, the available resources range from very limited to moderate but are never close to the systems in which Deep Learning models are trained and where their performance is usually evaluated. Our target deployment system is the Nvidia Jetson TX1 SoC, which features 4 GB of shared memory, a 256-core Maxwell GPU capable of performing 512 GFLOPS and with a storage capacity of 16GB. This poses three main challenges from the point of view of deployment of deep learning models for inference, which are nothing but accentuated when it is not one but multiple models that have to be deployed simultaneously. The challenges are in terms of memory 1 As an example a Xilinx Spartan-7 FPGA has at most 0.5 MB of on-chip memory.

requirements, both in disk and physical memory, and of computational resources needed to achieve small inference times that allow high enough throughput to give real-time performance: • Model storage: The model weights need to be stored on disk to be later used. Embedded systems are typically constrained in the size of their storage and the weights of the models can take several hundreds of MB2 , which can mean a large share of the total available storage space. If you need more than one task and have one model for each, you have to store the weights for each model, even if they are not to run simultaneously. • Memory usage: The model has to fit into memory. Not only the parameters have to be loaded into memory but the outputs of intermediate layers must be stored as well, whose size depends on the size of the input data and the batch size. Figure 2 shows the relation between batch size and memory utilization for common convolutional networks used for Image Classification. This might not be a problem for models running in data centers but clearly compromises models running in embedded systems with low memory available, both if they run on CPU or on GPU with dedicated memory. If two models are needed in parallel - assuming they use the same amount of memory - the memory footprint is doubled. • Inference times: The model has to be run to get the inference results, i.e. a forward pass of the model has to be computed for each sample of the input data. Focusing now on the GPU-accelerated inference, the best throughput is achieved typically by making full use of the parallelism capabilities of the GPU using a batch size larger than one, which introduces some latency. When its utilization gets close to the maximum the inference time is proportional to the computational complexity or number of operations that needs to be performed [5]. See Figure 3. In fully-connected and convolutional networks, the computational complexity is dominated by multiply-adds, typically measured in FLoating-point Operations Per Second (FLOPS). With two models in parallel double the number of operations need to be done, i.e. double the inference time (assuming they reach the maximum utilization of the GPU or are directly running sequentially on a single CPU). Using Multi-task Learning models we study how can the memory footprint and computational complexity of Deep Learning models be reduced without compromising their prediction accuracy in applications in which more than one task needs to be solved simultaneously, and specially for the case of solving Semantic Segmentation and Object Detection for finally deploy2 For example, the file containing the weights of the reference AlexNet model for Image Classification is 233 MB [3] when using the Caffe framework[4]. Find it in: https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet

ing the model in a UAV for providing our first two levels of situation awareness. 2. Approach 2.1. Semantic Segmentation Image classification and semantic segmentation are very similar tasks, except for the resolution of the output. Semantic segmentation has a class label matrix at the output, instead of a single label. Most of the state-of-the-art semantic segmentation models therefore base themselves on an image classification architecture. They have to add an up-sampling part to compensate the dimensionality reduction of the feature extraction part.

Figure 2 Memory vs. batch size. Maximum system memory utilisation for batches of different sizes. Memory usage shows a knee graph, due to the network model memory static allocation and the variable memory used by batch size. From [5].

Different approaches have been extensively studied in the literature but since [6], most follow the FullyConvolutional Network approach, i.e. a model consisting in only a few modifications of any classification architecture. The last fully connected layers of a network are transformed into convolutional layers with filter size 1, allowing the network to give a heatmap at the output, which can be up-sampled through a bilinear interpolation in order to get a segmented image of the same size of the input. The main problem with this method is that the many pooling layers considerably decrease the spatial information of the features. Therefore, the last layer giving the heatmap can be accurate in terms of classification, but very coarse in terms of localisation. By shortcutting features obtained from shallower layers which carry more information about location the localisation aspect of the prediction can be improved. Others build in top of this approach by adding context [7] or doing some post-processing steps [8], [9]. As previously explained in [10], here we use the simplest version with no post-processing steps and adapt the more powerful classification architectures from [11] to be fully-convolutional. We explore multiple versions of ResNets with different numbers of layers, from 50 to 152, when training on two new datasets annotated for 9 Semantic Segmentation classes and consisting of top-down images taken from a UAV in Japan and Switzerland, which we name Okutama and Swiss datasets respectively.

Figure 3 Operations vs. inference time, size proportional to # parameters. Relationship between operations and inference time for batch size 16, where resources are fully utilised and therefore operations count represent an optimal estimation of inference time. From [5].

The small size of the datasets (100 and 91 images respectively) forces us to balance properly the different splits so the classes are uniformly distributed across them and to be creative and explore also different training strategies. We do data augmentation with random crops and random flips, explore multi-source training using both datasets and model compression to bump up the performance of the smaller model, Resnet50, by teaching it to mimic an ensemble formed from fully-convolutional Resnet-152 with skip connections and other models taken from [6] and based on the VGG[12] architecture. Figure 4 shows the architecture of a FCN Resnet-50 with skip connection.

Figure 5 Single-shot Multibox detector. From [17]. Figure 4 FCN Resnet-50 with skip connection architecture. From [10]. 2.2. Object Detection and Tracking Object detection aims at finding in an image all instances of objects from a number of known classes, while also giving their class. The detection is normally given in the form of a class and a bounding box defining location and scale of the detected object within the image. Object detection is many times used within higher level CV pipelines, e.g. in Action Detection[13], as will be explained in next section, or in Multiple Object Tracking algorithms based in the tracking-by-detection approach. Approaches to Object Detection have typically been divided between sliding windows[14] and region proposal classification approaches. Since the accuracy jump provided by R-CNN the second method has gained much attention and been improved to the extent that Faster R-CNN [15] was able to run at 7 frames per second and achieve 73.2 mean Average Precision (mAP) on Pascal VOC 2007[16] by integrating the Region Proposal part of the pipeline into the network. The Single-shot Multibox Detector (SSD) [17] is similar to the Region Proposal network of Faster R-CNN but instead of using the proposals to pool features and perform the classification, SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. We use SSD because of its simplicity and speed while maintaining a good accuracy and train it to classify 6 different classes of Objects (Pedestrian, Biker, Skater, Car, Cart, Bus) on the Stanford Drone Dataset [18] and Pedestrians only in the newly annotated Okutamaaction dataset that will be explained in next section. We base the models both on VGG as done originally and Resnet-50 to match the base network use for Semantic Segmentation. See Figure 5.

The tracking algorithm uses the detections of our SSD model and consists of two steps. A prediction step in which the location of objects is determined based on the location in previous frames by applying a Kalman Filter with a motion model, and an association step to connect tracked objects to the new associations for the next frame, which is done using the Hungarian algorithm. Given the meta-data available from the position of the UAV and the camera the tracked objects are later geo-referenced to be later consumed by other applications or be shown in a map. 2.3. Action Detection After reviewing the available datasets for Aerial View scene understanding tasks and for Human Action Detection, we decided to define a new dataset that overcomes some of the limitations of previous datasets and makes it closer to real world applications. The limitations we found include at least one of these factors: low resolution videos, not aerial view, short video length, no concurrent actions from different classes, stable camera. The dataset is collected in Okutama and can be used for multiple tasks: Pedestrian Detection, Multiple Object Tracking, and, especially, Multi-human Action Detection, being the first dataset available for such a purpose. We name it after the location were it was collected and its main purpose as Okutama-Action. The Okutama-Action dataset consists of 42 videos taken from 2 Drones at the same time in 2 different lighting conditions. In total the dataset contains 77368 frames which was annotated every 10 frame using an open source annotation tool, VATIC. The annotation task was crowdsourced using Amazon Mechanical Turk. Find as an example an annotated frame in Figure 6

Figure 6 An annotated frame of the Okutama-Action dataset.

Here the aim is to use this dataset to train DL models which can detect human actions spatially (draw bounding boxes around actors) and recognize them (giving a label describing the actions to the boxes). For such a purpose we re-use the SSD architecture for Object Detection and train it on the Action labels instead than on the class labels for Object Detection. This can later be integrated in the framework proposed by [13] to give temporal coherence and smooth the output of the detections. As SSD is not able to deal with multiple labels we give only one label to every annotated human given some priorities. For example, if someone is reading at the same time than carrying a book and walking, the label will only be reading, as it is more specific. In this way we try to balance the dataset towards the action classes we expect to be more difficult. 2.4. Multi-task learning for deployment in embedded devices Multi-task learning (MTL) learns to solve tasks in parallel using a shared representation of the same input data. One straightforward way of implementing MTL is with Neural Networks which share a base trunk or set of hidden layers that learns a common representation for the multiple tasks and from which a number of taskspecific branches emerge, giving multiple and different outputs related to each task. Such a network can still be trained end-to-end by defining a global objective function as the sum of the loss functions of each task. Using a part of the network to learn a shared representation also means computing this representation in a shared manner, sharing the resources both in terms of memory and computation cost. In this way, only the branches emerging from the shared representation must be stored and computed in parallel, amounting for the extra cost over having a single task network. Depending on the relative size of these task-specific branches over the shared trunk the multi-task network can be much larger than the single-task network or almost negligible. In practice, networks solving many different tasks are already based on the same networks. For example, most of the lately proposed networks for both Object Detection and Image Segmentation are based on VGG [12] or ResNet [11] networks previously trained for Image Classification. The non-task-specific layers of these networks amount for most of the weights and computational complexity. These make for perfect candidates to be used as the shared base trunks for multi-task networks as they have already been proved useful for solving many different tasks in Computer Vision related problems, as previously investigated in the literature [19]–[21]; in [22] features extracted from the OverFeat network [23] are used for solving up to 5 different tasks. Moreover, previous research proves that by exploiting synergies between related tasks the models can improve their generalization performance, as the extra information in the training signals of other tasks works

as an inductive bias towards the more general solution that gives better performance for the multiple tasks [2]. MTL seems thus to be a win-win approach for deployment of models in real-world problems, and specifically in those constrained by real-time requirements and by the limited resources available in embedded systems. As a downside, the training process turns out to be more complex as has already been analyzed in previous research [2]. A comprehensive method for the training of multi-task network is explained in [24], even when the training involves using multiple partial datasets containing only labels for some of the tasks in each of their samples. The study of the viability of MTL for this purpose focuses on the tasks of Semantic Segmenation and Object Detection tasks, which are known to be related and have been previously solved simultaneously with Multi-task Networks [24]–[26], although never with the goals pursued here. First, a multi-task network will be defined for both tasks. The common convolutional trunk is selected based on memory and computational complexity requirements from the typical architectures used nowadays as the base for this tasks. Figure 7 plots the accuracy of the most known architectures for Image Classification versus their throughput for batch size 16 on the Jetson TX1 GPU.

Figure 7 Accuracy vs. throughput, size proportional to # of operations. A non trivial linear upper bound can be seen in this scatter plots, illustrating the relationship between prediction accuracy and throughput of all examined architectures. From [5] While no architecture offers a free lunch and gives high throughput and high accuracy, the heaviest networks are expected to not be suited for deployment in embedded systems due to both bigger memory footprint and larger inference times (at least with today’s technology) and the lightest and fastest models to generalize poorly to other tasks compared to the heaviest ones due to their lower capability of learning general enough

features. The Resnet-50 architecture is selected as the base trunk due to its good speed/accuracy trade-off and our previous successful experiences using it. Similarly, the task-specific layers are taken from the latest approaches for each task that have the best compromise between inference time and accuracy (SSD[17], RFCN[27], FCN[6] ParseNet[7], RefineNet[9], ENet[28]). SSD and plain FCN are selected again taking into account their speed/accuracy trade-offs and our positive prior experiences. Figure 8 shows the proposed multi-task architecture for Object Detection and Semantic Segmentation based on Resnet-50.

with online stochastic gradient descent, i.e. mini-batch size of only one sample, with original sized images, no normalization for the loss function, a very small learning rate of 1e-10 and a high momentum of 0.99. SSD models are trained with a mini-batch size of 32, standard momentum of 0.9 and learning rate of 1e3, normalized loss and resized images with a heavy data augmentation strategy. For our training we try a number of combinations of these parameters to find a middle point for which both tasks can be trained reasonably. This is a second challenge that we face when training MTL models, the exploration of the hyperparameter space has to be performed again as the best parameters do not necessarily coincide with those for the single task models. 3. Results 3.1. Semantic Segmentation

Figure 8 Multi-task model with FCN for Semantic Segmentation and SSD for object detection based on Resnet-50. The multi-task architecture is first evaluated by training it on a common dataset for both tasks, Pascal VOC07+++[16], [29], which includes labels for both tasks, and will later be evaluated on datasets where only the ground truth for one of the tasks is available, as our newly annotated datasets for aerial view imagery Swiss[30] and Okutama[10] for Semantic Segmentation and the Stanford Drone Dataset[18] and our new Okutama-action dataset for Object or Pedestrian detection. A multi-task network approach opens up the possibility of creating a swiss-knife for different tasks which share the same input data as done in [24]. Here we focus on the first two task but it is worth mentioning that any other number of tasks can be added as long as they have the same input, are related and are based on a convolutional trunk. For the application mentioned here, the natural task to add would be Action Detection as it would directly extend to providing the third level of situation awareness that we defined previously. For Pascal VOC07+++ a subset in which there is ground truth for both tasks is defined and divided intro training, validation and test splits, which is about half the size of the sets used for Segmentation and Detection separately. This is one of the biggest challenges of MTL, having datasets with available ground truth for multiple tasks. The common training strategies for SSD and FCN models are very different. FCN models are typically trained

As was already reported in [10], all tried data augmentation techniques and training strategies successfully improve the accuracy of the model. The best resulting model, Resnet-152 skip, achieved 69.6% mIoU and 96.2% pixel accuracy in the Swiss dataset at 10 fps when running on a Titan X GPU, the extended results can be found there. Since then, more data has been annotated for the Okutama dataset and we trained a Resnet-50 skip model able to run at 20 fps on a Titan X GPU with the union of training sets of Okutama and Swiss datasets to give a mIoU of 70.8% on the union of validation sets, reminding us that the amount of training data is extremely important for training Deep Learning models. Three example segmentation results are shown in Figure 9. 3.2. Object Detection Three models were trained for Object Detection using SSD, two on the Stanford Drone Dataset (SDD) [18] and one in the Okutama-Action dataset for Pedestrian detection only. Figure 10 shows the result for the VGG based model on the SDD dataset, which achieves 80.1% mAP on the validation set and runs at 2.8 fps on the Jetson TX1. The version trained based the Resnet-50 model proves more difficult to train and achieves only 50% mAP, a wider exploration of the hyperparameter space seems to be needed to train it with a new base model. Figure 11 shows an example result for the VGG based model trained on the Okutama action dataset for Pedestrian detection, achieving 72.3% mAP. We can see that the accuracy is higher when the height is lower, as the objects appear to be bigger. This is consistent with the known limitations of this model in other datasets. It can be somehow improved by

training on higher resolution input and/or changing the scale of the prior boxes. 3.3. Action Detection The same SSD model used for Object Detection is trained on the Action labels instead. Figure 12 shows an example frame of our the action detections. The model is trained first with the same resolution as before and then fine-tuned with a higher resolution of 960x540 pixels and achieves a mAP of 19%.

Figure 9 Segmentation results compared to the ground truth annotations for three different images in Okutama dataset. Figure 12 Detection results of SSD model on Okutama-Action dataset for Action Detection task. By visual inspection and checking the per-class accuracies we see that many actions are not completely mislabeled when only given one class but count as False Positives and False Negatives, as for example the class “Carrying” appears very often and is considered wrong when computing the score when the label was for example “drinking” or “reading”. We believe that by introducing the priorities in the labeling to handle the lack of multi-label prediction some artifacts appeared in the computation of the final accuracy. Future work will tackle this fact. 3.4. Multi-task model Figure 10 Detection results of SSD model on Stanford Drone Dataset.

Figure 11 Detection results of SSD model on Okutama-Action dataset for Pedestrian detection.

The training of our multi-task models is still a work in progress. The best results for our multi-task model are achieved using standard momentum, normalized loss for the semantic segmentation task, resized images, simple data augmentation consisting in random horizontal flips, a mini-batch size of 12 that allows us to use Batch Normalization and a bigger learning rate of 1e-3. For the Semantic Segmentation task we obtain 54% mIoU with the MTL model versus 59% mIoU for the single-task baseline trained with the best strategy using the same data. For the Object Detection task we obtain 47% mAP with the MTL model versus 48% mAP for the single-task baseline trained without data augmentation using the same data, although the best reported model based on Resnet-50 achieves 70% mAP when using more data and extensive data

augmentation. The next step will be to add more data augmentation as it seems to be completely essential for avoiding the SSD branch to overfit and will probably also help the FCN branch. In terms of speed and memory usage Table 1 summarizes the times of the single-task models, the union of them and the multi-task model. The benefit of MTL in sharing resources is obvious and encourages our work even further. Even with the low accuracy achieved for the Object Detection task, the MTL model enables the deployment of both tasks in the Jetson TX1, as the memory usage for the union of both single-task models (if deployed in parallel) is bigger than the total amount of memory available making it impossible to deploy both models simultaneously, while the amount of memory usage in the MTL model is marginally larger than that of the single task models.

We are currently integrating the Object Detection model into a Multiple Object Tracker, improving our model for Action Detection and training a Multi-Task model for Object Detection and Semantic Segmentation on Aerial View datasets that we can deploy on a Jetson TX1 mounted on a UAV. References [1] “DRONET,” DRONET – We develop the DRONET open platform to transform the unprecedented opportunity of designing and safely operating low-altitude airspace into the next IT revolution. [Online]. Available: http://siliconmountain.jp/en/frontpage/ [2] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, no. 1, pp. 41–75, Jul. 1997 [Online]. Available: http://dx.doi.org/10.1023/A:1007379606734 [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in neural information processing systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.

Table 1 Inference times, model size and memory usage for the single-task baselines, their union and the MTL model. Values on a Nvidia GTX 970m GPU with CUDA8 and cudnn5 for input image size 500x500, batch size 1 and 10 classes. 4. Conclusions By dividing the task of providing situation awareness for the UAVs into 5 different levels we successfully solve the first two levels of awareness using Deep Learning models for Semantic Segmentation of top-down aerial images into regions over which the UAV navigates and Object Detection for detecting smaller objects of interest in similar imagery and that can be later tracked over time via a tracking-by-detection approach. We introduce a new dataset, Okutama-Action, which provides a natural set of challenges not found in previous datasets and specifically can be used for Multihuman Action Detection from aerial view. We try a simple approach to solve the task and see the complexity that Action Detection introduces with the interaction between action classes and its multi-label nature. In addition, we evaluate how Multi-Task Learning models can help share resources during inference, which becomes essential when deploying multiple models in embedded systems under real-time constraints, and analyze the challenges and benefits that this approach poses.

[4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014. [5] A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep neural network models for practical applications,” CoRR, vol. abs/1605.07678, 2016 [Online]. Available: http://arxiv.org/abs/1605.07678 [6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in The iEEE conference on computer vision and pattern recognition (cVPR), 2015. [7] W. Liu, A. Rabinovich, and A. C. Berg, “ParseNet: Looking wider to see better,” CoRR, vol. abs/1506.04579, 2015 [Online]. Available: http://arxiv.org/abs/1506.04579 [8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014. [9] G. Lin, A. Milan, C. Shen, and I. D. Reid, “RefineNet: Multi-path refinement networks for high-resolution semantic segmentation,” CoRR, vol. abs/1611.06612, 2016 [Online]. Available: http://arxiv.org/abs/1611.06612 [10] J. Laurmaa, “A deep learning model for scene segmentation of images captured by drones,” Master’s thesis, EPFL, Switzerland, 2016. [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint

arXiv:1512.03385, 2015. [12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014 [Online]. Available: http://arxiv.org/abs/1409.1556

CoRR, vol. abs/1609.02132, 2016 [Online]. Available: http://arxiv.org/abs/1609.02132 [25] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” in CVPR, 2016.

[13] S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Deep learning for detecting multiple space-time action tubes in videos,” arXiv preprint arXiv:1608.01529, 2016.

[26] S. Brahmbhatt, H. I. Christensen, and J. Hays, “StuffNet: Using ’stuff’ to improve object detection,” CoRR, vol. abs/1610.05861, 2016 [Online]. Available: http://arxiv.org/abs/1610.05861

[14] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.

[27] K. H. Jifeng Dai Yi Li, “R-FCN: Object detection via region-based fully convolutional networks,” arXiv preprint arXiv:1605.06409, 2016.

[15] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” CoRR, vol. abs/1506.01497, 2015 [Online]. Available: http://arxiv.org/abs/1506.01497 [16] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.”. [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg, “SSD: single shot multiBox detector,” CoRR, vol. abs/1512.02325, 2015 [Online]. Available: http://arxiv.org/abs/1512.02325 [18] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning social etiquette: Human trajectory understanding in crowded scenes,” in European conference on computer vision, 2016, pp. 549–565. [19] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “DeCAF: A deep convolutional activation feature for generic visual recognition,” in International conference in machine learning (iCML), 2014. [20] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level image representations using convolutional neural networks,” in Proceedings of the iEEE conference on computer vision and pattern recognition, 2014, pp. 1717–1724. [21] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision, 2014, pp. 818–833. [22] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features off-the-shelf: An astounding baseline for recognition,” in The iEEE conference on computer vision and pattern recognition (cVPR) workshops, 2014. [23] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated recognition, localization and detection using convolutional networks,” CoRR, vol. abs/1312.6229, 2013 [Online]. Available: http://arxiv.org/abs/1312.6229 [24] I. Kokkinos, “UberNet: Training a Universal’ convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory,”

[28] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “ENet: A deep neural network architecture for real-time semantic segmentation,” CoRR, vol. abs/1606.02147, 2016 [Online]. Available: http:// arxiv.org/abs/1606.02147 [29] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.”. [30] Sensefly, “Swiss dataset - example datasets,” Example Datasets: senseFly SA. [Online]. Available: https://www.sensefly.com/drones/example-datasets. html

Situation Awareness for UAVs using Deep Learning techniques

system is the Nvidia Jetson TX1 SoC, which features. 4 GB of shared memory, a 256-core Maxwell GPU ca- pable of performing 512 GFLOPS and with a storage capacity of 16GB. This poses three main challenges from the point of view of deployment of deep learning models for inference, which are nothing but accentuated ...

4MB Sizes 1 Downloads 198 Views

Recommend Documents

ePub Designing for Situation Awareness: An Approach ...
control centers, automobiles, financial and business management systems, ... people to effectively manage the information available to gain a high level of ...

PDF Download Designing for Situation Awareness: An ... - Sites
control centers, automobiles, financial and business management systems, space ... All of these systems need user interfaces that allow people to ... development Mica R. Endsley is a pioneer and world leader in the study and application of.

Using Machine Learning Techniques for VPE detection
Technical Report 88.268, IBM Science and Technology and Scientific. Center, Haifa, June 1989. (Quinlan 90) J. R. Quinlan. Induction of decision trees. In Jude W. Shavlik and Thomas G. Dietterich, editors, Readings in Machine. Learning. Morgan Kaufman

Using Machine Learning Techniques for VPE detection
King's College London ... tic account (Fiengo & May 94; Lappin & McCord ... bank. It achieves precision levels of 44% and re- call of 53%, giving an F1 of 48% ...

DEEP LEARNING VECTOR QUANTIZATION FOR ...
Video, an important part of the Big Data initiative, is believed to contain the richest ... tion of all the data points contained in the cluster. k-means algorithm uses an iterative ..... and A. Zakhor, “Applications of video-content analysis and r

Deep Learning - GitHub
2.12 Example: Principal Components Analysis . . . . . . . . . . . . . 48. 3 Probability and .... 11.3 Determining Whether to Gather More Data . . . . . . . . . . . . 426.

Development of a Pilot Training Platform for UAVs Using a 6DOF ...
Aug 18, 2008 - D visualization software for real-time simulation. ... The Meridian will be equipped with an off-the-shelf autopilot as its primary guidance, ...

Situation identification techniques in pervasive computing
used in modelling and inferring situations from sensor data. We compare and contrast these ... Overview of situation identification in pervasive computing.

DEEP LEARNING BOOKLET_revised.pdf
Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps.

Download Deep Learning
Download Deep Learning (Adaptive Computation and Machine. Learning series) Full ePUB ... speech recognition, computer vision, online recommendation ...

Deep Learning with Differential Privacy
Oct 24, 2016 - can train deep neural networks with non-convex objectives, under a ... Machine learning systems often comprise elements that contribute to ...

Deep Learning with Differential Privacy
Oct 24, 2016 - In this paper, we combine state-of-the-art machine learn- ing methods with ... tribute to privacy since inference does not require commu- nicating user data to a ..... an Apache 2.0 license from github.com/tensorflow/models. For privac