Perceptual Reward Functions Ashley Edwards [email protected]

Charles Isbell [email protected]

Atsuo Takanishi [email protected]

Abstract Reinforcement learning problems are often described through rewards that indicate if an agent has completed some task. This specification can yield desirable behavior, however many problems are difficult to specify in this manner, as one often needs to know the proper configuration for the agent. When humans are learning to solve tasks, we often learn from visual instructions composed of images or videos. Such representations motivate our development of Perceptual Reward Functions, which provide a mechanism for creating visual task descriptions. We show that this approach allows an agent to learn from rewards that are based on raw pixels rather than internal parameters.

Experiments

Background Figure 1: The image on the left shows the result of an agent moving from the start location to the top-right position. The middle image is the corresponding motion template. The image on the right is a zoomed-in visualization of the HOG features of the motion template.

Reinforcement Learning • In reinforcement learning problems, an agent takes an action a in its current state s, and receives a reward r that indicates how good it was to take that action in that state. A policy π informs an agent on what actions to take, and aims to maximize rewards. A Q-value represents the expected discounted cumulative reward an agent will receive after taking action a in state s, then following π thereafter. We use a Deep QNetwork (DQN) to approximate these values. Motion Templates • A Motion Template is a 2D visual representation of motion that has occurred in a sequence of images. [Davis, 1999; Bobick and Davis, 2001]. Movement that occurred more recently in time has a higher pixel intensity in the template than earlier motion and depicts both where and when motion occurred. HOG Features • A Histogram of Oriented Gradients (HOG) [Dalal and Triggs, 2005] represents information about local appearances and shapes within images. The method divides an image into cells and calculates the gradients for each. Then, a feature vector is computed by concatenating a histogram of gradient orientations from each cell.

Perceptual Reward Functions (PRFs) • This work aims to provide a mechanism for describing goals without modifying internal reward values • Rather than stating: “The task is complete when these specific configurations are met,” visual task descriptions allow one to say: “Here is how the completed task should look.” • A mirror state is an image from the agent’s simulator or camera. • A goal template TG to be an image of the agent’s goal. • A agent template TA as a visual representation of the agent. Mirror Descriptors • In a mirror descriptor, TA is based on the agent’s mirror state. • In a direct descriptor, TA is the agent’s mirror state and TG is in the same space as TA. • In a window descriptor, TG is a cropped section of the goal. TA is the closest matching section from the agent’s mirror state. Motion Template Descriptor • In a motion template descriptor, TA is a motion template of the sequence of mirror states in an episode. TG is a motion template of the desired task.

• Compared against a Variable Reward Function (VRF) based on the task’s internal variables. • Aim to show that a policy based on a PRF will yield behaviors that are at least as good as the behaviors learned with the VRF. • Use same DQN architecture from [Mnih et al., 2015] with a different input and output size. Domains Figure 2: Tasks used for evaluation. In Breakout, the agent must hit a pellet with a paddle to break all of the bricks in the game. In Flappy Bird, the agent must flap its wings to move itself between pipes. In the Kobian Simulator, the agent must move parts of its face to make expressions. (a) Breakout

(b) Flappy Bird

(c) Kobian

Task Descriptors

Figure 3: Task Descriptors. From left to right: Breakout TG, Flappy Bird TG, Kobian Simulator Happy expression for VRF, Kobian Simulator Surprise Expression for VRF, KPRF Happy TG, KPRF Surprise TG, HPRF Happy TG, HPRF Surprise TG. The descriptors for Breakout are Direct Descriptors,. The descriptors for Flappy Bird are Window Descriptors. The descriptors for Kobian are Motion Template Descriptors.

• Breakout VRF gives reward of 1 for each brick hit. PRF rewards increase as the number of black pixels in TA increases. • Flappy Bird VRF gives reward of 1 when agent is between pipes, -1 when it crashes, and .1 for each step. PRF rewards increase as bird gets closer to the middle of a pipe. • Kobian VRF gives reward for distance to components of face for a Happy Expression and Surprise Expression. K(obian)PRF rewards increase as TA approaches motion templates for Happy and Surprise Expressions. The H(uman)PRF rewards increase as TA approaches motion templates generated from humans making the expressions.

Results

(a) Breakout (b) Flappy Bird Figure 4: The results obtained in Breakout and Flappy Bird. We ran Breakout for 60,000 episodes and Flappy Bird for 8,500 episodes. In Breakout, the agent’s score was incremented each time it hit a brick. In Flappy Bird, the agent’s score was incremented when it moved through two pipes.

PRF Calculation • Given these templates, we can compute a PRF: (a) Kobian VRF

• The distance metric D can be calculated by taking the distance between the HOG features of automatically cropped templates:

(b) Kobian KPRF

(c) Kobian HPRF

Figure 5: The results obtained in the Kobian simulator. We ran the Kobian simulator for 10,000 episodes for each experiment.

State Representation • Information necessary to solve specific task can be ost when only a single image is used as input to a DQN. • Our approach is to take the Exponential Moving Average [Hunter 2016] of states for state inputs for direct and window descriptors.

Figure 6: The faces that each experiment converged or nearly converged to. The first three images show the learned happy faces for the VRF, KPRF, and HPRF. The final three images show the learned surprise faces for the VRF, KPRF, and HPRF.

Acknowledgements This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1148903 and the International Research Fellowship of the Japan Society for the Promotion of Science.

Perceptual Reward Functions - GitHub

expected discounted cumulative reward an agent will receive after ... Domains. Task Descriptors. Figure 3: Task Descriptors. From left to right: Breakout TG, ...

401KB Sizes 5 Downloads 449 Views

Recommend Documents

Recursive Functions - GitHub
Since the successor function can increment by one, we can easily find primitive ... Here we have used the convention of ending predicate names with “?”; in this .... that is, functions that are undefined, or divergent, at some values in the domai

Physics-based basis functions - GitHub
effect of element mutual coupling on both signal and noise response of the .... evlamemo69.pdf ..... {10o, 20o, 30o, 40o, 50o}, which give rise to a rank-five voltage covariance matrix. ... An illustration of two sets of CBFPs for modeling the.

discrete.mac: a list of functions - GitHub
flow poly(g): uses the external program tutte to compute the flow polynomial of the graph g. from degree sequence(lst): construct a graph with degree sequence ...

Counting with generating functions in MAXIMA - GitHub
In this paper we describe implementations of two counting methods which are based on generating func- ... Pólya theory [2] is an important counting method when some objects that are ...... [9] http://www.tcs.hut.fi/Software/bliss/index.html. 19.

Perceptual Reasoning for Perceptual Computing
Department of Electrical Engineering, University of Southern California, Los. Angeles, CA 90089-2564 USA (e-mail: [email protected]; dongruiw@ usc.edu). Digital Object ... tain a meaningful uncertainty model for a word, data about the word must be

Data 8R Table Methods and Functions Summer 2017 1 ... - GitHub
Jul 18, 2017 - We have the dataset trips, which contains data on trips taken as part ofa ... def num_long_trips(cutoff): ... We want to see what the distribution of.

Data 8R Plotting Functions Summer 2017 1 Midterm Review ... - GitHub
Jul 20, 2017 - In physics calculations, we often want to have the data in terms of centimeters. Create a table called cm table that has the original data and a ...

Toward the Web of Functions: Interoperable Higher-Order ... - GitHub
enabling a generation of Web APIs for sparql that we call Web of Func- tions. The paper ... Functions with a Remote Procedure Call (RPC) approach, that is, users can call ...... tional World Wide Web Conference, WWW (Companion Volume).

Data 8R Plotting Functions Summer 2017 1 Midterm Review ... - GitHub
Data 8R. Plotting Functions. Summer 2017. Discussion 7: July 20, 2017. 1 Midterm Review. Question 4 ... function onto the table: Hint: Velocity = distance / time.

Defining functions Defining Rules Generating and Capturing ... - GitHub
language and are defined like this: (, ... ... generates an error with an error code and an error message. ... node(*v, *l, *r) => 1 + size(*l) + size(*r).

PERCEPTUAL CoMPUTINg - CS UTEP
“Perceptual Computing Programs (PCP)” and “IJA Demo.” In the PCP folder, the reader will find separate folders for Chapters 2–10. Each of these folders is.

Data 8R Table Methods and Functions Summer 2017 1 ... - GitHub
Jul 18, 2017 - Data 8R. Table Methods and Functions. Summer 2017. Discussion 7: ... its range - the difference between the highest value in the array and.

PERCEPTUAL CoMPUTINg
Map word-data with its inherent uncertainties into an IT2 FS that captures .... 3.3.2 Establishing End-Point Statistics For the Data. 81 .... 7.2 Encoder for the IJA.

Similarity-Based Perceptual Reasoning for Perceptual ...
Dongrui Wu, Student Member, IEEE, and Jerry M. Mendel, Life Fellow, IEEE. Abstract—Perceptual reasoning (PR) is ... systems — fuzzy logic systems — because in a fuzzy logic system the output is almost always a ...... in information/intelligent

Induced Perceptual Grouping - SAGE Journals
were contained within a group or crossed group bound- aries as defined by induced grouping due to similarity, proximity, or common fate. Induced grouping was ...

Perceptual coding of audio signals
Nov 10, 1994 - “Digital audio tape for data storage”, IEEE Spectrum, Oct. 1989, pp. 34—38, E. .... analytical and empirical phenonomena and techniques, a central features of ..... number of big spectral values (bigvalues) number of pairs of ...

reward-pres26.pdf
Who Gains and Who Loses from Credit Card Payments? ... and goods—a level of detail for which proper data are not currently available ... reward-pres26.pdf.

NORTHWESTERN UNIVERSITY Perceptual ...
A DISSERTATION. SUBMITTED TO THE GRADUATE SCHOOL. IN PARTIAL FULFILLMENT OF THE REQUIREMENTS. For the degree. DOCTOR OF PHILOSOPHY. Field of Communication Sciences and Disorders ... These results therefore provide behavioral evidence that is consiste

Perceptual coding of audio signals
Nov 10, 1994 - for understanding the FORTRAN processing as described herein is FX/FORTRAN Programmer's Handbook, Alliant. Computer Systems Corp., July 1988. LikeWise, general purpose computers like those from Alliant Computer Sys tems Corp. can be us

Something funny happened to reward
Two cows were standing in a field, and one said to the other, 'Those humans are sure getting worked up about this mad cow disease. It is an ... gigabytes of data.

Princess Tooth Brushing Reward Chart.pdf
Brush .twice daily. Page 1 of 1. Princess Tooth Brushing Reward Chart.pdf. Princess Tooth Brushing Reward Chart.pdf. Open. Extract. Open with. Sign In.

Rhetoric and Practice of Strategic Reward Management.pdf ...
Rhetoric and Practice. of. Strategic. Reward Management. Page 3 of 510. Rhetoric and Practice of Strategic Reward Management.pdf. Rhetoric and Practice of ...

Perceptual Similarity based Robust Low-Complexity Video ...
block means and therefore has extremely low complexity in both the ..... [10] A. Sarkar et al., “Efficient and robust detection of duplicate videos in a.