Learning Multimodal Semantic Models for Image Question Answering

Zichao Yang1 , Xiaodong He2 , Jianfeng Gao2 , Li Deng2 Carnegie Mellon University, 2 Microsoft Research, Redmond, WA 98052, USA, [email protected] {xiaohe, jfgao, deng}@microsoft.com 1

Abstract This paper presents a set of models and methods that learn to answer natural language questions from an image. The proposed methods use deep neural networks to jointly learn the semantic representations of the image and the question to predict the answer. We carried out evaluations on three image question answering (QA) data sets. The experimental results demonstrate that the proposed multimodal semantic model based image QA system gives superior performance, outperforming previous state-of-the-art approaches by a significant margin.

1

Introduction

With the advancement of deep learning in computer vision and in natural language processing, merging of vision and natural language processing is becoming an increasingly important research area. One of the recent success in this area is automatic caption generation for images [2, 18, 3, 9, 7, 2, 13, 20]. The models underling this success jointly learn from high level representations of images using convolutional neural networks (CNN) and of word sequences using recurrent neural networks (RNN) or using CNN. More recently, image question answering (QA) is proposed as a new challenging task in the vision and language area. In image QA, we need to answer natural language questions according to the content of a reference image. In order to facilitate the research of image QA, several datasets have been constructed, either through automatic generation based on image caption data, or by human labeling of question and answers given images [4, 14, 11, 1, 12]. Though closely related, image QA differs from automatic image captioning in several aspects. Unlike image captioning, language generation is not a main focus for image QA. Therefore, instead of learning models for generating syntactically fluent language, the model for image QA focuses more on capturing the semantics of the image and the question effectively, and therefore predicting answers correctly. In this paper, we propose using deep neural networks to jointly learn the semantic representations of the image and the question, and then to predict the answer based on the learned abstract semantic representations of the image and the question. Unlike previous work[11], no domain-specific knowledge in language or vision processing, such as natural language parsing and image segmentation, is required. Instead, the new method reported in this paper is an end-to-end image QA framework, which is learnable directly from the training data. The overall architecture of our model is shown in Fig. 1. Our model is composed of three parts, a image model for extracting semantic representations of images, a question (text) model for extracting semantic representations of natural language questions, and a prediction model that combines the image representation and the question representation to predict the answer. The main contributions of our work are: 1) A novel end-to-end trainable framework, which is based on joint image and text semantic representation learning, for image QA tasks; 2) comprehensive evaluation on three image QA benchmarks; 3) detailed analysis and demonstration of the effectiveness of the proposed image QA system through case studies. 1

2

The Proposed Model

In this work, we map the image(I), and question(Q) to the same semantic space through joint learning. Denote by the vector of them as vI and vQ , respectively. Assuming that the semantics of the image can be decomposed in to the semantics of the question plus the semantics of the answer and some background noise. Thus, subtracting vQ from vI provides strong information for predicting answer. Therefore, we propose the multimodal semantic learning based image QA framework as illustrated in Fig.1.

vI

CNN

Question: what is the color of the horses

CNN/ LSTM

vQ

Softmax

vI

Answer: Brown

vQ

Figure 1: Image QA Multimodal model 2.1

Predicting answers from semantic representation of images and questions

As discussed above, given the semantic vector vI and vQ for the image and the question, subtracting vQ from vI provides strong information for predicting the answer. In the three image QA benchmarks, the set of possible answers is pre-define. Therefore, we make prediction of the answer directly based on vI − vQ using a classifier, i.e., pA = softmax(Wmm (vI − vQ ) + bmm ). (1) 2.2

The Image Model

We use a CNN to learn the images features. We choose to use the VGGnet[16] to extract image presentation. Specifically, denote by the image as I and the VGGnet image feature is: hI = CNNvgg (I). (2) Then, we add a MLP layer to further transform the VGGnet image feature to the joint image-text semantic space: (3) vI = tanh(WI hI + bI ), 2.3

The Question Model

In this work, we explore using the LSTM and the CNN to capture the semantic meaning of questions for image QA tasks. 2.3.1

The LSTM based question model

As illustrated in Fig.2, the output ht after seeing the word sequence up to time t preserves the semantic meaning of text up to word t. Given the question q = [q1 , ..., qT ], where qt is the one hot vector representation of word at position t, we first embed the words to a vector space through an embedding matrix xt = We qt . Then for every time step, we feed the corresponding embedding vector to the LSTM and at the end we use the last output state as the representation of the whole question. (4) xt =We qt , t ∈ {1, 2, ...T } (5) ht =LSTM(xt ), t ∈ {1, 2, ...T } The model is shown in Fig. 2, where the question what is the color of the horse is fed into the LSTM and the final hidden layer of the LSTM after consuming the last word of the whole question is taken as the semantic representation of the question, i.e., vQ = hT . 2

Question:

LSTM

LSTM



LSTM

We

We



We

what

is



horse

Figure 2: LSTM for question 2.3.2

The CNN based question model

CNN has been demonstrated as powerful models for capturing the semantic meaning of text [15, 6]. In this work, we adopt a similar CNN architecture as in [8, 15]. The diagram of CNN for the question model is shown in Fig. 3. In CNN, we first embed the words to a vector space, then we apply convolution and pooling operations to the word embedding vectors. We use three convolution filters which have a size of one (monogram), two (bigram) or three (trigram) words respectively. Max pooling operation applies following the convolution operation to generate a global semantic representation vector for the whole question.

unigram

trigram

bigram

max pooling over time

convolution

embedding horse …

color

is what

Question:

Figure 3: The CNN based question model The diagram of the CNN based question model is shown in Fig. 3. The red, blue and orangecolored diagram illustrates the monogram, bigram and trigram convolution and pooling operations, respectively. The semantic representation of the whole question is the vector after max pooling.

3 3.1

Experiments Datasets & Baselines

We evaluate our model on the three common benchmarks recently proposed for image QA: DAQUAR-ALL [11], DAQUAR-REDUCED, COCO-QA[14]. We compare our model with methods proposed in recent works on image QA. In [11], the authors use a parser to parse the question and images segmentation to get the answer. Other papers make use of neural network models [14, 12, 10]. Although these paper also make use of LSTM or CNN, our approach has a different model architecture. We report the accuracy of answers in percentage 3

in this paper. Besides, since the reference models also use Wu-Palmer similarity (WUPS) measure [19], we report WUPS0.9 and WUPS0.0 as well. 3.2

Training and results

All the models are trained using stochastic gradient descent with a momentum 0.9. The best learning rate is picked among 0.1, 0.01, 0.001. The batch size is fixed to be 100. Gradient clipping [5] and dropout[17] are used during training. The experimental results for DAQUAR-ALL, DAQUAR-REDUCED and COCO-QA datasets are show in Table. 1, 4 and 3 respectively. The name of our model explains itself. The experimental results show that our models perform better than previous models significantly, by 4% absolutely across all datasets. 3.3

A Case study

To further verify our assumption that the semantic meaning of the image can be decomposed into two parts, the semantics of the question and semantics of the answer plus some background noise, in Fig.4, at each row, we compute the vQ of the question and the vI of the reference image, and then we retrieve the nearest four images measured by the distance between their semantic vector vI and the difference between vI − vQ . Ideally, assuming the semantic information of the image is decomposable, then vI − vQ corresponds to semantic representation of the object of the answer plus some background. In Fig.4, the nearest neighbor images retrieved by vI − vQ all contains the object of the answer, plus various background, showing that the proposed approach learns the image and the question in a unified semantic space, which then helps predicting the correct answer effectively in our framework.

Figure 4: Image query with vI − vQ

4

Conclusion

In this paper, we propose a novel framework for image QA and explore important variations. The evaluation on three image QA benchmarks shows that our models outperform other methods proposed in previous studies. Further, it is observed that the CNN perform better than the LSTM in 4

Methods

Accuray WUPS0.9 WUPS0.0

Multi-World: [11] Multi-World 7.86 11.86

38.79

Neural-based: [12] Language 17.15 22.80 Neural-based 19.43 25.28

58.42 62.00

CNN: [10] IMG-CNN

23.40

29.59

62.95

Ours: LSTM-MM CNN-MM

27.80 28.61

34.27 34.91

68.09 68.84

Human:[11] Human 50.20

50.82

67.27

Methods

Objects Number Color Location

VSE: [14] GUESS BOW LSTM IMG IMG+BOW VIS+LSTM 2-VIS+BLSTM

2.11 37.27 35.87 40.37 58.66 56.53 58.17

35.84 43.56 45.34 29.26 44.10 46.10 44.79

Ours: LSTM-MM CNN-MM

61.94 62.96

47.51 53.65 51.70 46.97 53.86 53.15

8.93 40.84 38.42 44.19 49.39 45.52 47.34

Table 2: COCO-QA accuracy per class, in percentage

Table 1: DAQUAR-ALL results, in percentage

Methods

Methods

13.87 34.75 36.26 42.68 51.96 45.87 49.53

Accuray WUPS0.9 WUPS0.0

Accuray WUPS0.9 WUPS0.0

Multi-World: [11] Multi-World 12.73

18.20

51.47

Neural-based: [12] Language 31.65 Neural-based 34.68

38.35 40.76

80.08 79.54

VSE: [14] GUESS BOW LSTM IMG+BOW VIS+LSTM 2-VIS+BLSTM

18.24 32.67 32.73 34.17 34.41 35.78

29.65 43.19 43.50 44.99 46.05 46.83

77.59 81.30 81.62 81.48 82.23 82.15

VSE: [14] GUESS BOW LSTM IMG IMG+BOW VIS+LSTM 2-VIS+BLSTM

6.65 37.52 36.76 43.02 55.92 53.31 55.09

17.42 48.54 47.58 58.64 66.78 63.91 65.34

73.44 82.78 82.34 85.85 88.99 88.25 88.64

CNN: [10] IMG-CNN CNN

54.95 32.70

65.36 44.32

88.58 80.89

CNN: [10] IMG-CNN

39.66

44.86

83.06

Ours: LSTM-MM CNN-MM

58.88 59.69

68.93 69.62

90.02 90.20

Ours: LSTM-MM CNN-MM

43.45 43.45

49.09 47.70

83.09 83.22

Human:[11] Human

60.27

61.04

78.96

Table 3: COCO-QA results, in percentage

Table 4: DAQUAR-REDUCED results, in percentage

extracting general semantic meaning of sentences, while the LSTM is able to capture the representation of number in the context better than the CNN.

References [1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. arXiv preprint arXiv:1505.00468, 2015. [2] Xinlei Chen and C Lawrence Zitnick. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014. 5

[3] Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Srivastava, Li Deng, Piotr Doll´ar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. From captions to visual concepts and back. CoRR, abs/1411.4952, 2014. [4] Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. Are you talking to a machine? dataset and methods for multilingual image question answering. arXiv preprint arXiv:1505.05612, 2015. [5] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. [6] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 2333–2338. ACM, 2013. [7] Andrej Karpathy, Armand Joulin, and Fei Fei F Li. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pages 1889–1897, 2014. [8] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1746–1751, 2014. [9] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014. [10] Lin Ma, Zhengdong Lu, and Hang Li. Learning to answer questions from image using convolutional neural network. arXiv preprint arXiv:1506.00333, 2015. [11] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems, pages 1682–1690, 2014. [12] Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to answering questions about images. arXiv preprint arXiv:1505.01121, 2015. [13] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632, 2014. [14] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. arXiv preprint arXiv:1505.02074, 2015. [15] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 101–110. ACM, 2014. [16] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [17] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014. [18] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014. [19] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133–138. Association for Computational Linguistics, 1994. [20] Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015.

6

Learning Multimodal Semantic Models for Image ...

to jointly learn the semantic representations of the image and the question to pre- ... With the advancement of deep learning in computer vision and in natural ... images using convolutional neural networks (CNN) and of word sequences using ... image caption data, or by human labeling of question and answers given ...

721KB Sizes 0 Downloads 174 Views

Recommend Documents

Multi-Graph Enabled Active Learning for Multimodal Web Image ...
Nov 11, 2005 - data. This proposed approach, overall, tackles the problem of unsupervised .... we resort to some web-page analysis techniques. A Visual- ...

Multi-Graph Enabled Active Learning for Multimodal Web Image ...
Nov 11, 2005 - A key challenge of WWW image search engines is precision performance. ...... multi-modalities by decision boundary optimization. As training ...

Towards a 3D digital multimodal curriculum for the ... - Semantic Scholar
Apr 9, 2010 - ACEC2010: DIGITAL DIVERSITY CONFERENCE ... students in the primary and secondary years with an open-ended set of 3D .... [voice over or dialogue], audio [music and sound effects], spatial design (proximity, layout or.

An Architecture for Multimodal Semantic Fusion
Jun 25, 2009 - analysis, human behavior analysis , a cognitive architecture ... During the early fusion, some data received from each ..... position too fast.

Towards a 3D digital multimodal curriculum for the ... - Semantic Scholar
Apr 9, 2010 - movies, radio, television, DVDs, texting, youtube, Web pages, facebook, ... and 57% of those who use the internet, are media creators, having.

Multimodal MRI reveals secondarily generalized ... - Semantic Scholar
The protocol included a 3D T1-weighted fast field-echo sequence [TR 11 ms, ... and less prone to increased neurotransmitter traffic than the signal reception.

Discriminative Models for Semi-Supervised ... - Semantic Scholar
and structured learning tasks in NLP that are traditionally ... supervised learners for other NLP tasks. ... text classification using support vector machines. In.

Discriminative Models for Information Retrieval - Semantic Scholar
Department of Computer Science. University ... Pattern classification, machine learning, discriminative models, max- imum entropy, support vector machines. 1.

Discriminative Models for Semi-Supervised ... - Semantic Scholar
Discriminative Models for Semi-Supervised Natural Language Learning. Sajib Dasgupta .... text classification using support vector machines. In. Proceedings of ...

Artificial Intensity Remapping: Learning Multimodal ...
University of California at Berkeley ... non-multimodal data with AIR outperforms state-of-the-art algorithms not only .... The model trained with AIR had the best.

Semantic Language Models for Topic Detection ... - Semantic Scholar
Ramesh Nallapati. Center for Intelligent Information Retrieval, ... 1 Introduction. TDT is a research ..... Proc. of Uncertainty in Artificial Intelligence, 1999. Martin, A.

Hybrid Generative/Discriminative Learning for Automatic Image ...
1 Introduction. As the exponential growth of internet photographs (e.g. ..... Figure 2: Image annotation performance and tag-scalability comparison. (Left) Top-k ...

Generalized image models and their application as statistical models ...
Jul 20, 2004 - exploit the statistical model to aid in the analysis of new images and .... classically employed for the prediction of the internal state xПtч of a ...

Hidden Markov Models - Semantic Scholar
Download the file HMM.zip1 which contains this tutorial and the ... Let's say in Graz, there are three types of weather: sunny , rainy , and foggy ..... The transition probabilities are the probabilities to go from state i to state j: ai,j = P(qn+1 =

Hidden Markov Models - Semantic Scholar
A Tutorial for the Course Computational Intelligence ... “Markov Models and Hidden Markov Models - A Brief Tutorial” International Computer Science ...... Find the best likelihood when the end of the observation sequence t = T is reached. 4.

Multimodal Split View Tabletop Interaction Over ... - Semantic Scholar
people see a twinned view of a single user application. ... applications (Figure 1 left; A and B are different ... recently, the development of high-resolution digital.

Multimodal Split View Tabletop Interaction Over ... - Semantic Scholar
applications that recognize and take advantage of multiple mice. ... interaction, and define three software configurations that constrain how ... The disadvantage is that side glances are more effortful ..... off the shelf speech recognition software

Transformation-based Learning for Semantic parsing
semantic hypothesis into the correct semantics by applying an ordered list of transformation rules. These rules are learnt auto- matically from a training corpus ...

Learning Topographic Representations for ... - Semantic Scholar
the assumption of ICA: only adjacent components have energy correlations, and ..... were supported by the Centre-of-Excellence in Algorithmic Data Analysis.

MACHINE LEARNING FOR DIALOG STATE ... - Semantic Scholar
output of this Dialog State Tracking (DST) component is then used ..... accuracy, but less meaningful confidence scores as measured by the .... course, 2015.

How Pairs Interact Over a Multimodal Digital Table - Semantic Scholar
interleaving acts: the graceful mixing of inter-person speech and gesture ..... In summary, we were pleasantly surprised that people were able to converse and ...