Perceptual Reward Functions Ashley Edwards [email protected]

Charles Isbell [email protected]

Atsuo Takanishi [email protected]

Abstract Reinforcement learning problems are often described through rewards that indicate if an agent has completed some task. This specification can yield desirable behavior, however many problems are difficult to specify in this manner, as one often needs to know the proper configuration for the agent. When humans are learning to solve tasks, we often learn from visual instructions composed of images or videos. Such representations motivate our development of Perceptual Reward Functions, which provide a mechanism for creating visual task descriptions. We show that this approach allows an agent to learn from rewards that are based on raw pixels rather than internal parameters.


Background Figure 1: The image on the left shows the result of an agent moving from the start location to the top-right position. The middle image is the corresponding motion template. The image on the right is a zoomed-in visualization of the HOG features of the motion template.

Reinforcement Learning • In reinforcement learning problems, an agent takes an action a in its current state s, and receives a reward r that indicates how good it was to take that action in that state. A policy π informs an agent on what actions to take, and aims to maximize rewards. A Q-value represents the expected discounted cumulative reward an agent will receive after taking action a in state s, then following π thereafter. We use a Deep QNetwork (DQN) to approximate these values. Motion Templates • A Motion Template is a 2D visual representation of motion that has occurred in a sequence of images. [Davis, 1999; Bobick and Davis, 2001]. Movement that occurred more recently in time has a higher pixel intensity in the template than earlier motion and depicts both where and when motion occurred. HOG Features • A Histogram of Oriented Gradients (HOG) [Dalal and Triggs, 2005] represents information about local appearances and shapes within images. The method divides an image into cells and calculates the gradients for each. Then, a feature vector is computed by concatenating a histogram of gradient orientations from each cell.

Perceptual Reward Functions (PRFs) • This work aims to provide a mechanism for describing goals without modifying internal reward values • Rather than stating: “The task is complete when these specific configurations are met,” visual task descriptions allow one to say: “Here is how the completed task should look.” • A mirror state is an image from the agent’s simulator or camera. • A goal template TG to be an image of the agent’s goal. • A agent template TA as a visual representation of the agent. Mirror Descriptors • In a mirror descriptor, TA is based on the agent’s mirror state. • In a direct descriptor, TA is the agent’s mirror state and TG is in the same space as TA. • In a window descriptor, TG is a cropped section of the goal. TA is the closest matching section from the agent’s mirror state. Motion Template Descriptor • In a motion template descriptor, TA is a motion template of the sequence of mirror states in an episode. TG is a motion template of the desired task.

• Compared against a Variable Reward Function (VRF) based on the task’s internal variables. • Aim to show that a policy based on a PRF will yield behaviors that are at least as good as the behaviors learned with the VRF. • Use same DQN architecture from [Mnih et al., 2015] with a different input and output size. Domains Figure 2: Tasks used for evaluation. In Breakout, the agent must hit a pellet with a paddle to break all of the bricks in the game. In Flappy Bird, the agent must flap its wings to move itself between pipes. In the Kobian Simulator, the agent must move parts of its face to make expressions. (a) Breakout

(b) Flappy Bird

(c) Kobian

Task Descriptors

Figure 3: Task Descriptors. From left to right: Breakout TG, Flappy Bird TG, Kobian Simulator Happy expression for VRF, Kobian Simulator Surprise Expression for VRF, KPRF Happy TG, KPRF Surprise TG, HPRF Happy TG, HPRF Surprise TG. The descriptors for Breakout are Direct Descriptors,. The descriptors for Flappy Bird are Window Descriptors. The descriptors for Kobian are Motion Template Descriptors.

• Breakout VRF gives reward of 1 for each brick hit. PRF rewards increase as the number of black pixels in TA increases. • Flappy Bird VRF gives reward of 1 when agent is between pipes, -1 when it crashes, and .1 for each step. PRF rewards increase as bird gets closer to the middle of a pipe. • Kobian VRF gives reward for distance to components of face for a Happy Expression and Surprise Expression. K(obian)PRF rewards increase as TA approaches motion templates for Happy and Surprise Expressions. The H(uman)PRF rewards increase as TA approaches motion templates generated from humans making the expressions.


(a) Breakout (b) Flappy Bird Figure 4: The results obtained in Breakout and Flappy Bird. We ran Breakout for 60,000 episodes and Flappy Bird for 8,500 episodes. In Breakout, the agent’s score was incremented each time it hit a brick. In Flappy Bird, the agent’s score was incremented when it moved through two pipes.

PRF Calculation • Given these templates, we can compute a PRF: (a) Kobian VRF

• The distance metric D can be calculated by taking the distance between the HOG features of automatically cropped templates:

(b) Kobian KPRF

(c) Kobian HPRF

Figure 5: The results obtained in the Kobian simulator. We ran the Kobian simulator for 10,000 episodes for each experiment.

State Representation • Information necessary to solve specific task can be ost when only a single image is used as input to a DQN. • Our approach is to take the Exponential Moving Average [Hunter 2016] of states for state inputs for direct and window descriptors.

Figure 6: The faces that each experiment converged or nearly converged to. The first three images show the learned happy faces for the VRF, KPRF, and HPRF. The final three images show the learned surprise faces for the VRF, KPRF, and HPRF.

Acknowledgements This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1148903 and the International Research Fellowship of the Japan Society for the Promotion of Science.

Perceptual Reward Functions - GitHub

expected discounted cumulative reward an agent will receive after ... Domains. Task Descriptors. Figure 3: Task Descriptors. From left to right: Breakout TG, ...

401KB Sizes 5 Downloads 294 Views

Recommend Documents

Physics-based basis functions - GitHub
effect of element mutual coupling on both signal and noise response of the .... evlamemo69.pdf ..... {10o, 20o, 30o, 40o, 50o}, which give rise to a rank-five voltage covariance matrix. ... An illustration of two sets of CBFPs for modeling the.

Defining functions Defining Rules Generating and Capturing ... - GitHub
language and are defined like this: (, ... ... generates an error with an error code and an error message. ... node(*v, *l, *r) => 1 + size(*l) + size(*r).

Data 8R Plotting Functions Summer 2017 1 Midterm Review ... - GitHub
Data 8R. Plotting Functions. Summer 2017. Discussion 7: July 20, 2017. 1 Midterm Review. Question 4 ... function onto the table: Hint: Velocity = distance / time.

Data 8R Table Methods and Functions Summer 2017 1 ... - GitHub
Jul 18, 2017 - Data 8R. Table Methods and Functions. Summer 2017. Discussion 7: ... its range - the difference between the highest value in the array and.

Data 8R Table Methods and Functions Summer 2017 1 ... - GitHub
Jul 18, 2017 - We have the dataset trips, which contains data on trips taken as part ofa ... def num_long_trips(cutoff): ... We want to see what the distribution of.

Toward the Web of Functions: Interoperable Higher-Order ... - GitHub
enabling a generation of Web APIs for sparql that we call Web of Func- tions. The paper ... Functions with a Remote Procedure Call (RPC) approach, that is, users can call ...... tional World Wide Web Conference, WWW (Companion Volume).

Map word-data with its inherent uncertainties into an IT2 FS that captures .... 3.3.2 Establishing End-Point Statistics For the Data. 81 .... 7.2 Encoder for the IJA.

Princess Tooth Brushing Reward Chart.pdf
Brush .twice daily. Page 1 of 1. Princess Tooth Brushing Reward Chart.pdf. Princess Tooth Brushing Reward Chart.pdf. Open. Extract. Open with. Sign In.

Something funny happened to reward
Two cows were standing in a field, and one said to the other, 'Those humans are sure getting worked up about this mad cow disease. It is an ... gigabytes of data.

Perceptual Global Illumination Cancellation in ... - Computer Science
For the examples in this paper, we transform the appear- ance of an existing ... iterative design applications. 2. Related Work ..... On a desktop machine with an.