Relating Natural Language and Visual Recognition Marcus Rohrbach1,2 , Jacob Andreas1 , Trevor Darrell1 , Jiashi Feng7 , Lisa Anne Hendricks1 , Dan Klein1 , Ronghang Hu1 , Raymond Mooney4 , Anna Rohrbach3 , Kate Saenko5 , Bernt Schiele3 , Subhashini Venugopalan4 , Huazhe Xu6 1

UC Berkeley EECS, 2 ICSI, Berkeley, 3 MPI for Informatics, 4 UT Austin, 5 UMass Lowell, 6 Tsinghua University, 7 National University of Singapore

In this poster we will relate and discuss several of our most recent efforts for “Closing the Loop Between Vision and Language”. In Section 1 we show how we can describe videos [6] and images [3] with natural language sentences (Vision ⇒ Language). In Section 2 we show how we ground phrases in images [4] (Language ⇔ Vision). And finally, in Section 3, we discuss how compositional computation allows for effective question answering about images [1] (Language & Vision ⇒ Language).

SMT [7] S2VT [8] Visual labels [6] Reference SMT [7] S2VT [8] Visual labels [6] Reference SMT [7] S2VT [8] Visual labels [6] Reference

1. Describing visual content with natural language sentences

Figure 2: Results on MPII Movie Description dataset [7]. The “Visual labels” approach [6] identifies activities, objects, and places better than related work. From [6].

In [6], we decompose the challenging task of describing movies in two steps. First we recognize the most relevant activities/verbs, scenes, and objects, and then we describe them with natural sentences using a recurrent networks, namely an LSTM. The approach is visualized in Fig. 1 and achieves state-of-the-art performance on the challenging MPII Movie Description dataset[7], both with respect to automatic and human evaluation. Qualitative results are shown in Fig. 2.

Deep Compositional Captioner

Unpaired Image Data

State-of-the art deep image and video captioning approaches are limited to describe objects which appear in paired image/sentence data. Hendricks et al. [3] show how to exploit vision only and language only unpaired data to describe novel categories (Fig. 3).

Video

Visual  recognition

Language  generation

Sentence

LSTM

Someone   enters   the   room.

Verbs Places Objects

Someone is a man, someone is a man. Someone looks at him, someone turns to someone. Someone is standing in the crowd, a little man with a little smile. Someone, back in elf guise, is trying to calm the kids. The car is a water of the water. On the door, opens the door opens. The fellowship are in the courtyard. They cross the quadrangle below and run along the cloister. Someone is down the door, someone is a back of the door, and someone is a door. Someone shakes his head and looks at someone. Someone takes a drink and pours it into the water. Someone grabs a vodka bottle standing open on the counter and liberally pours some on the hand.

Otter

Alpaca

Pizza

Bus

Paired ImageSentence Data A bus driving down the street. Yummy pizza sitting on the table.

Existing Methods

A otter that is sitting in the water.

Unpaired Text Data Otters live in a variety of aquatic environments. Pepperoni is a popular pizza topping.

A dog sitting on a boat in the water.

Figure 3: Existing deep caption methods are unable to generate sentences about objects unseen in caption corpora (like otter). In contrast our model (DCC) effectively incorporates information from independent image datasets and text corpora to compose descriptions about novel objects without any paired image-captions. From [3].

Figure 1: Describing movie snippets with natural sentences.

1

candidate location set

input image object proposal

...

natural language query:

white car on the right

global context

spatial configuration

man squatting

standing guy

leaves of left tree

pillar building in the middle

local descriptor

Spatial Context Recurrent ConvNet candidate scores output object retrieval result

0.28 0.15

top score candidate

0.42 ... 0.07 0.54

Figure 4: Overview of our approach to grounding phrases in images. Given an input image, a text query and a set of candidate locations (e.g. from object proposal methods), a recurrent neural network model is used to score candidate locations based on local descriptors, spatial configurations and global context. The highest scoring candidate is retrieved. Form [4].

Figure 5: Correctly localized examples (IoU ≥ 0.5) on ReferIt [5] with EdgeBox [9]. Ground truth in yellow and correctly retrieved box in green. Where is the dog?

Parser

couch

LSTM

Layout

count

dog

2. Grounding natural language phrases in images In many human-computer interaction or robotic scenarios it is important to be able to ground, i.e. localize, referential natural language expression in visual content. Hu et al. [4] propose to do this by ranking bounding box proposals using local, context, and spatial information (Fig. 4). An important aspect of our approach is to transfer models trained on full-image description datasets to this new task.

where

cat

color

standing

...

...

CNN

Figure 6: We use a natural language parser to dynamically lay out a deep network composed of reusable modules. For visual question answering tasks, an additional sequence model provides sentence context and learns common-sense knowledge. From [1].

3. Answering questions about images In the third part we discuss how to answer natural sentence questions about images. Andreas et al. [1] describe an approach to visual question answering based on neural module networks (NMNs). The proposed approach answers natural language questions about images using collections of jointly-trained neural “modules”, dynamically composed into deep networks based on linguistic structure. Concretely, given an image and an associated question (e.g. where is the dog?), we wish to predict a corresponding answer (e.g. on the couch, or perhaps just couch) by decomposing it into a where and dog module as shown in Fig. 6.

This surpasses performance of prior work on the MSCOCO based VQA dataset [2] as well as on a novel challenging shapes dataset which requires composing up to 6 modules together to answer a question.

References [1] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Deep compositional question answering with neural module networks. arXiv:1511.02799, 2015. [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra,

C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. [3] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. Deep compositional captioning: Describing novel object categories without paired training data. arXiv:1511, 2015. [4] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. arXiv:1511.04164, 2015. [5] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. [6] A. Rohrbach, M. Rohrbach, and B. Schiele. The longshort story of movie description. Proceedings of the German Confeence on Pattern Recognition (GCPR), 2015. [7] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [8] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence – video to text. arXiv:1505.00487v2, 2015. [9] C. L. Zitnick and P. Doll´ar. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision (ECCV), pages 391– 405. Springer, 2014.

Relating Natural Language and Visual Recognition

Grounding natural language phrases in im- ages. In many human-computer interaction or robotic scenar- ios it is important to be able to ground, i.e. localize, ref-.

4MB Sizes 6 Downloads 301 Views

Recommend Documents

Blunsom - Natural Language Processing Language Modelling and ...
Download. Connect more apps. ... Blunsom - Natural Language Processing Language Modelling and Machine Translation - DLSS 2017.pdf. Blunsom - Natural ...

Partitivity in natural language
partitivity in Zamparelli's analysis to which I turn presently. Zamparelli's analysis of partitives takes of to be the residue operator. (Re') which is defined as follows:.

Speech and Natural Language - Research at Google
Apr 16, 2013 - clearly set user expectation by existing text app. (proverbial ... develop with the users in the loop to get data, and set/understand user ...

Natural Language Watermarking
Watermark Testing. Watermark Selecting. ○ Stylistic concerns. ○ Security concerns. Watermark Embedding. 13:38. The 1st Workshop on Info. Hiding. 16 ...

natural language processing
In AI, more attention has been paid ... the AI area of knowledge representation via the study of ... McTear (http://www.infj.ulst.ac.uk/ cbdg23/dialsite.html).

NATURAL LANGUAGE PROCESSING.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. NATURAL ...

Relating language examinations to the Common European ... - Eaquals
provides general and vocational qualifications to schools, colleges, employers, and training ... least 2 years before the examination itself takes place. Question ...

Relating language examinations to the Common European ... - Eaquals
error-free camera-ready copy (usually in the form of PDF files) and finally into printed question ...... connais les bases mais j'ai besoin de parler couramment.

Visual Recognition - Vision & Perception Neuroscience Lab - Stanford ...
memory: An alternative solution to classic problems in perception and recognition. ... of faces, objects and scenes: Analytic and holistic processes (pp. 269–294).

Face Tracking and Recognition with Visual Constraints in Real-World ...
... constrain term can be found at http://seqam.rutgers.edu/projects/motion/face/face.html. ..... [14] Y. Li, H. Ai, T. Yamashita, S. Lao, and M. Kawade. Tracking in.

Discriminative Acoustic Language Recognition via ...
ments of recorded telephone speech of varying duration. Every ..... 5, on the 14 languages of the closed-set language detection task of the NIST 2007 Language ...

Discriminative Acoustic Language Recognition via ...
General recipe for GMM-based. Acoustic Language Recognition. 1. Build a feature extractor which maps: speech segment --> sequence of feature vectors. 2.

Discriminative Acoustic Language Recognition via ...
This talk will emphasize the more interesting channel ... Prior work: 1G: One MAP-trained GMM per language. .... concatenation of the mean vectors of all.

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute of ... over all competing classes, and have been demonstrated to be effective in isolated word ...

Language Recognition Based on Score ... - Semantic Scholar
1School of Electrical and Computer Engineering. Georgia Institute ... NIST (National Institute of Standards and Technology) has ..... the best procedure to follow.

Syllabic length effects in visual word recognition and ...
was used to describe how one might get syllable effects without explicit syllable units. ... in terms of orthography-to-phonology conversion rules and that the pronunciation ... The last characteristic concerns the presence/absence of a syllabic ...

Inferring Maps and Behaviors from Natural Language ...
Visualization of one run for the command “go to the hydrant behind the cone,” showing .... update the semantic map St as sensor data arrives and refine the optimal policy .... that best reflects the entire instruction in the context of the semant

Natural Language Processing (almost) from Scratch - CiteSeerX
Looking at all submitted systems reported on each CoNLL challenge website ..... Figure 4: Charniak parse tree for the sentence “The luxury auto maker last year ...

The Computer and Natural Language (Ling 445/515 ...
Alphabetic. Syllabic. Logographic. Systems with unusual realization. Relation to language. Encoding written language. ASCII. Unicode. Spoken language.

Ambiguity Management in Natural Language Generation - CiteSeerX
from these interactions, and an ambiguity ... of part of speech triggers obtained by tagging the text. .... menu commands of an Internet browser, and has used.

Natural Language Processing Research - Research at Google
Used numerous well known systems techniques. • MapReduce for scalability. • Multiple cores and threads per computer for efficiency. • GFS to store lots of data.

Gradability in Natural Language: Logical and ...
Feb 25, 2016 - This work would never have been possible without all the help and support that I have received from friends and colleagues during my time as ...