Perceptual Reward Functions Ashley Edwards [email protected]
Charles Isbell [email protected]
Atsuo Takanishi [email protected]
Abstract Reinforcement learning problems are often described through rewards that indicate if an agent has completed some task. This specification can yield desirable behavior, however many problems are difficult to specify in this manner, as one often needs to know the proper configuration for the agent. When humans are learning to solve tasks, we often learn from visual instructions composed of images or videos. Such representations motivate our development of Perceptual Reward Functions, which provide a mechanism for creating visual task descriptions. We show that this approach allows an agent to learn from rewards that are based on raw pixels rather than internal parameters.
Background Figure 1: The image on the left shows the result of an agent moving from the start location to the top-right position. The middle image is the corresponding motion template. The image on the right is a zoomed-in visualization of the HOG features of the motion template.
Reinforcement Learning • In reinforcement learning problems, an agent takes an action a in its current state s, and receives a reward r that indicates how good it was to take that action in that state. A policy π informs an agent on what actions to take, and aims to maximize rewards. A Q-value represents the expected discounted cumulative reward an agent will receive after taking action a in state s, then following π thereafter. We use a Deep QNetwork (DQN) to approximate these values. Motion Templates • A Motion Template is a 2D visual representation of motion that has occurred in a sequence of images. [Davis, 1999; Bobick and Davis, 2001]. Movement that occurred more recently in time has a higher pixel intensity in the template than earlier motion and depicts both where and when motion occurred. HOG Features • A Histogram of Oriented Gradients (HOG) [Dalal and Triggs, 2005] represents information about local appearances and shapes within images. The method divides an image into cells and calculates the gradients for each. Then, a feature vector is computed by concatenating a histogram of gradient orientations from each cell.
Perceptual Reward Functions (PRFs) • This work aims to provide a mechanism for describing goals without modifying internal reward values • Rather than stating: “The task is complete when these specific configurations are met,” visual task descriptions allow one to say: “Here is how the completed task should look.” • A mirror state is an image from the agent’s simulator or camera. • A goal template TG to be an image of the agent’s goal. • A agent template TA as a visual representation of the agent. Mirror Descriptors • In a mirror descriptor, TA is based on the agent’s mirror state. • In a direct descriptor, TA is the agent’s mirror state and TG is in the same space as TA. • In a window descriptor, TG is a cropped section of the goal. TA is the closest matching section from the agent’s mirror state. Motion Template Descriptor • In a motion template descriptor, TA is a motion template of the sequence of mirror states in an episode. TG is a motion template of the desired task.
• Compared against a Variable Reward Function (VRF) based on the task’s internal variables. • Aim to show that a policy based on a PRF will yield behaviors that are at least as good as the behaviors learned with the VRF. • Use same DQN architecture from [Mnih et al., 2015] with a different input and output size. Domains Figure 2: Tasks used for evaluation. In Breakout, the agent must hit a pellet with a paddle to break all of the bricks in the game. In Flappy Bird, the agent must flap its wings to move itself between pipes. In the Kobian Simulator, the agent must move parts of its face to make expressions. (a) Breakout
(b) Flappy Bird
Figure 3: Task Descriptors. From left to right: Breakout TG, Flappy Bird TG, Kobian Simulator Happy expression for VRF, Kobian Simulator Surprise Expression for VRF, KPRF Happy TG, KPRF Surprise TG, HPRF Happy TG, HPRF Surprise TG. The descriptors for Breakout are Direct Descriptors,. The descriptors for Flappy Bird are Window Descriptors. The descriptors for Kobian are Motion Template Descriptors.
• Breakout VRF gives reward of 1 for each brick hit. PRF rewards increase as the number of black pixels in TA increases. • Flappy Bird VRF gives reward of 1 when agent is between pipes, -1 when it crashes, and .1 for each step. PRF rewards increase as bird gets closer to the middle of a pipe. • Kobian VRF gives reward for distance to components of face for a Happy Expression and Surprise Expression. K(obian)PRF rewards increase as TA approaches motion templates for Happy and Surprise Expressions. The H(uman)PRF rewards increase as TA approaches motion templates generated from humans making the expressions.
(a) Breakout (b) Flappy Bird Figure 4: The results obtained in Breakout and Flappy Bird. We ran Breakout for 60,000 episodes and Flappy Bird for 8,500 episodes. In Breakout, the agent’s score was incremented each time it hit a brick. In Flappy Bird, the agent’s score was incremented when it moved through two pipes.
PRF Calculation • Given these templates, we can compute a PRF: (a) Kobian VRF
• The distance metric D can be calculated by taking the distance between the HOG features of automatically cropped templates:
(b) Kobian KPRF
(c) Kobian HPRF
Figure 5: The results obtained in the Kobian simulator. We ran the Kobian simulator for 10,000 episodes for each experiment.
State Representation • Information necessary to solve specific task can be ost when only a single image is used as input to a DQN. • Our approach is to take the Exponential Moving Average [Hunter 2016] of states for state inputs for direct and window descriptors.
Figure 6: The faces that each experiment converged or nearly converged to. The first three images show the learned happy faces for the VRF, KPRF, and HPRF. The final three images show the learned surprise faces for the VRF, KPRF, and HPRF.
Acknowledgements This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1148903 and the International Research Fellowship of the Japan Society for the Promotion of Science.