A Multi-scale Multiple Instance Video Description Network Huijuan Xu1 , Subhashini Venugopalan2 , Vasili Ramanishka1 , Marcus Rohrbach3 , Kate Saenko1 1 UMass Lowell, 2 UT Austin, 3 UC Berkeley 1

{hxu1, vramanis, saenko}@cs.uml.edu,

2

[email protected],

Abstract

3

[email protected]

2. Approach The MM-VDN architecture is shown in Figure 1. The input to the network is a sequence of video frames, and the output is a sequence of words. Each frame is processed by a multi-scale multi-instance convolutional network and embedded into a N −dimensional high-level semantic vector, corresponding to activations of N high-level concepts. A recurrent network accepts such semantic vectors from all frames in the video, and then decodes the resulting state into the output sentence. Unlike previous approaches that used a single-scale single-label architecture (top stream in our network), our network can handle ambiguity in the number, size and location of objects and activities in the scene. The visual subnet (Figure 1) is structured as several multi-scale fully convolutional networks connected via the MIL mechanism. The first scale is the base pre-trained CNN classification network (AlexNet [4]) applied on the whole frame to capture scene-level semantics. Additional scales consist of the same CNN network but applied in a fully convolutional manner across upsampled versions of the original frame. Our FCN conversion of AlexNet changes the last three fully-connected layers into convolutional layers, while the first five convolutional layers are kept the same. The weights in the last three fully-connected layers of AlexNet are converted to be the filter weights in the last three convolutional layers of the FCN. The weights of the first seven layers are shared across scales (shown by shading their outputs in Figure 1). The fc8 weights are not shared, to allow different concepts to be learned at different scales. The MIL mechanism consists of several layers of maxpooling and allows the latent position and scale of concepts to be discovered simultaneously during learning. MIL allows weaker forms of supervision where training examples come in sets. We consider a set to be all image patches corresponding to all possible receptive field locations at all scales in a given frame. A MIL max pooling layer is defined on top of each FCN output score map to capture the maximum score, which infers the latent object positions. Then, we define an additional MIL element-wise max layer on top of the multi-scale FCNs to select among different

Generating natural language descriptions for in-thewild videos is a challenging task. Most state-of-the-art methods for this problem borrow existing deep convolutional neural network (CNN) architectures to extract a visual representation of the input video, which cannot handle multiple receptive filed sizes in original image. In order to consider objects in different positions and at different scales simultaneously, we apply several fully convolutional neural networks (FCNs) to form a multi-scale network. Evaluation on a Youtube video dataset shows the advantage of our approach compared to the original single-scale model.

1. Introduction The ability to automatically describe videos in natural language has many real-world applications, such as contentbased video retrieval, descriptive video service for the visually impaired, and automated video surveillance. Most current visual description approaches make use of pre-trained deep convolutional neural networks (CNNs) as semantic feature extractors for each video frame. These CNN models are trained to predict a single object label on images where objects are usually center positioned and occupy most of the image. However, realistic videos are much more complex and contain several objects of different scales in different positions of each video frame, including small objects. To detect smaller objects and actions, receptive fields of different sizes (relative to the original image size) must be used. In this paper, we propose the first end-to-end trainable video description network that incorporates spatially localized descriptors to capture concepts at multiple scales. We combine the traditional classification CNN that operates on the scale of the whole image with fully-convolutional network (FCN) of smaller receptive fields in the original image. We further incorporate a Multiple Instance Learning (MIL) mechanism to deal with the uncertainty of object scales and positions. The resulting semantic representation of the frames is encoded into a hidden state vector and then decoded into a sentence using a recurrent neural network proposed in [7]. We call our model the Multi-scale Multiinstance Video Description Network (MM-VDN). 1

. . . frame t-1

scale 1 conv 1-5 fc6

fc7

conv-fc6

conv-fc7

fc8

scale 2

max

max

conv 1-5

frame t

conv-fc8

Recurrent Network

scale 3

conv 1-5

A cat and a bear meet in the woods

max conv-fc6

conv-fc7

conv-fc8

. . . frame t+1

...

Figure 1. MM VDN architecture

Input Image Size 227 × 227 451 × 451 707 × 707

Score Map Size 1×1 8×8 16 × 16

Height Ratio 100% 78.7% 50.2%

Table 1. Height ratio of the receptive field to the original input image for different input sizes in the FCN. The size of the receptive field is 355 × 355 for all input image sizes.

input scales for each embedding concept. When the input image of the FCN has been upscaled, each score in the output score map corresponds to a smaller region in the original unscaled image. Thus, by using a FCN coupled with an upscaled input image, we can capture smaller objects. We further combine several FCNs with different input image sizes (scales), and apply MIL across the scales to capture concepts of different scales simultaneously. The ratio of the receptive field height to the original image height for several different input scales is shown in Table 1.

3. Experiments We perform our experiments on the Microsoft Research Video Description Corpus (MSVD) [2]. The dataset contains 1,970 short Youtube video clips paired with multiple human-generated natural language descriptions (∼ 40 English sentence descriptions per video). The 1,970 videos are split into training set (1,200 videos), validation set (100 videos) and testing set (670 videos), as used by the prior work on the same video description task [3][6][9][8]. We report the test accuracy after choosing the model on validation set. In our model, the input image for the AlexNet part is resized to 256 × 256 and cropped to 227 × 227 to generate five candidate patches (four corners and one center) without mirroring. The input size for other scales is directly set to be the input size listed in Table 1, without cropping or mirroring. The AlexNet weights are initialized with the 566-category ImageNet pretrained model, and conv1 to fc7 weights are kept fixed during training. The FCN weights are initialized with the reshaped 566-category ImageNet pre-

Methods FGM [6] [8] LSTM-YT [8] S2VT RGB (AlexN et) [7] MM-VDN (AlexN et + 8 × 8)

BLEU 13.68 31.19 37.64

METEOR 23.9 26.87 27.9 29.00

Table 2. Comparison to other baselines. (FGM) is the factor graph model in [6]; (LSTM-YT) is the LSTM model in [8]; (S2VT RGB) is the basic RGB sequence to sequence model with AlexNet in [7]; (MM-VDN) is our model.

trained model weights, and conv1 to conv-fc7 weights are also kept fixed. The fc8 and conv-fc8 layers directly connect with the LSTM recurrent neural network via max operations, and only fc8, conv-fc8 and the LSTM parameters are fine-tuned on the training videos. We have also tried to fine-tune the weights in conv-fc6 and conv-fc7, but the result was poor compared with keeping these layers fixed. For a single scale, the original AlexNet whole-frame scale is better than the other two scales alone. We investigate the combinations of different input scales 1 × 1, 8 × 8, and 16 × 16 in the FCN model. The combination of 1 × 1 and 8 × 8 performs the best among all possibile combinations and gets a boost of 4.7 in BLEU value and a boost of 1 in METEOR value. Scale combination is a datasetspecific problem, and the best scale combination requires cross-validation. Our model is flexible enough to integrate different input scales and combinations of several scales depending on the dataset. We use BLEU[5] and METEOR[1] scores to evaluate the generated sentences against all reference sentences. The comparison of the best MM-VDN model from the ablation study (AlexN et+8×8 score maps) to the other three baselines is listed in Table 2. Our MM-VDN model provides a distinct improvement in both BLEU and METEOR compared to the other three baselines.

4. Conclusion This paper proposed a Multi-scale Multi-instance Video Description Network (MM-VDN).The model is shown to be effective on the task of video description generation, compared to the single scale whole-frame classification CNN.

References [1] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005. 2 [2] D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 190– 200. Association for Computational Linguistics, 2011. 2 [3] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the 14th International Conference on Computer Vision (ICCV-2013), pages 2712–2719, Sydney, Australia, December 2013. 2 [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097– 1105, 2012. 1 [5] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002. 2 [6] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational Linguistics (COLING), August 2014. 2 [7] S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and K. Saenko. Sequence to sequence - video to text. CoRR, abs/1505.00487, 2015. 1, 2 [8] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. CoRR, abs/1412.4729, 2014. 2 [9] R. Xu, C. Xiong, W. Chen, and J. J. Corso. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of AAAI Conference on Artificial Intelligence, 2015. 2

A Multi-scale Multiple Instance Video Description Network

tor, corresponding to activations of N high-level concepts. A recurrent network accepts such semantic vectors from all frames in the video, and then decodes the resulting state into the output sentence. Unlike previous approaches that used a single-scale single-label architecture (top stream in our net- work), our network can ...

440KB Sizes 0 Downloads 219 Views

Recommend Documents

PowerPoint Presentation - Instance-level Multiple ...
denote a selection option by . ➢It can be viewed ... features from the original 3- dimensional data, a possible option is: 6 ... graph Laplacian and its degree matrix.

Graph-based Multiple-Instance Learning for Object ...
Notice that once the instance classifier f is found, the bag classifier c ...... Apple. WoodRollingPin. Banana. LargeSpoon. 73.4±1.8. 72.7±1.5. 72.4±1.9. 69.5±1.6.

M3IC: Maximum Margin Multiple Instance Clustering
a practical perspective, clustering plays an outstanding role in data mining applications such as information retrieval, text mining, Web analysis, marketing, computational biology, and many others [Han and Kamber, 2001]. However, so far, almost all

Batch Mode Adaptive Multiple Instance Learning for ... - IEEE Xplore
positive bags, making it applicable for a variety of computer vision tasks such as action recognition [14], content-based image retrieval [28], text-based image ...

Video Description Length Guided Constant Quality Video Coding with ...
least four scenes to mimic real multi-scene videos. There are 400 test video sequences. B. Crf-AvgBitrate Model. The average bitrate is a function of crf, spatial ...

A MULTISCALE APPROACH
Aug 3, 2006 - ena (e.g. competition for space, progress through the cell cycle, natural cell death and drug-induced cell .... the host tissue stim- ulates the production of enzymes that digest it, liberating space into which the .... The dynamics of

Optimal Multiple Surfaces Searching for Video/Image Resizing - A ...
Content-aware video/image resizing is of increasing rel- evance to allow high-quality image and video resizing to be displayed on devices with different resolution. In this paper, we present a novel algorithm to find multiple 3-D surfaces simultaneou

Instance-Based Ontology Matching By Instance ...
Jul 17, 2009 - 1.6 Real-world OM scenarios. To empirically test our method we apply IBOMbIE to two real-world OM scenar- ios: the KB and TEL scenarios. ...... of a word w in a document d, which is part of a data-set D. The TF-IDF weight. 1http://www.

pdf-1881\intelligent-network-video-understanding-modern-video ...
... more apps... Try one of the apps below to open or edit this item. pdf-1881\intelligent-network-video-understanding-modern-video-surveillance-systems.pdf.

Multiple Listing - United Network for Organ Sharing
The number for Patient. Services is 1-888-894-6361. Transplant candidate/family member. Date received. Transplant candidate/family member (printed).

Video transcoding in a Grid network with User ...
transcodification service using a Grid architecture and User Controlled LightPaths .... communities cannot use the public Internet to carry out their experiments; they ... receives from a switch or network cloud using a specific network protocol ...

Instance-Based Ontology Matching By Instance ...
Jul 17, 2009 - houses, e-commerce and semantic query processing. The use of ..... Lucene2 is a well-known open-source text indexing and search engine. ..... section 6.2.7 we list the system specifications of the platform the experiments.

Instance-Based Ontology Matching By Instance ...
Jun 29, 2009 - Results: instance similarity measure - run-time amount time to enrich 100K indexed instances (hrs:min) instances Lucene VSM. 524K. 1:04.

Multiple Gateway Cellular IP Network
Multiple Gateway Cellular IP Network. Manas R. Panda [email protected]. E6951 Project, Dated: 05/06/02. ABSTRACT. In the current Cellular IP architecture, only one gateway serves the entire CIP network. So, the gateway is single point of failure

Video Description Length Guided Constant Quality ... - IEEE Xplore
University of Florida. Gainesville, FL, US [email protected]. Abstract—In this paper, we propose a new video encoding strategy — Video description length guided ...

Video Description Length Guided Constant Quality ... - Semantic Scholar
Abstract—In this paper, we propose a new video encoding strategy — Video description length guided Constant Quality video coding with Bitrate Constraint ...

Merging Rank Lists from Multiple Sources in Video ... - Semantic Scholar
School of Computer Science. Carnegie ... preserve rank before and after mapping. If a video ... Raw Score The degree of confidence that a classifier assigns to a ...

Multiple Frames Matching for Object Discovery in Video
and the corresponding soft-segmentation masks across multiple video frames. ... mation method in video, based on Principal Component Analysis. Then, we ...

A Synthesis Instance Pruning Approach Based on ... - Semantic Scholar
Department of Computer Science, Ocean University of China, Qingdao 266100, China;. 2. Department of ..... the loss of dynamic coverage after deleting the unit's leaf that ..... uniform instances is in some degree made up by virtual non-uniform ...