Contextual Decision Processes with Low Bellman Rank are PAC-Learnable Nan Jiang1,3, Akshay Krishnamurthy2, Alekh Agarwal3, John Langford3, Robert E. Schapire3 1 University of Michigan, Ann Arbor 2University of Massachusetts, Amherst 3Microsoft Research, NYC

Long-term Planning Approximate DP

PAC-MDP Theory

Our Answer: ● A new measure – Bellman rank

Problem

Generalization

Exploration

○ Polynomial sample complexity guarantee

Simplified Algorithm ● Generate trajectories using πf’ .

● A new algorithm – OLIVE Contextual Bandits

Proof Sketch

Bound

Full matrix view

(assuming no statistical errors)

○ Captures a wide range of tractable RL problems

?

OLIVE (Optimism-Led Iterative Value-function Elimination)

RL problems with low Bellman rank

Introduction: 3 challenges of RL

Tabular MDP (context = state)

● Eliminate all f with non-zero Bellman error.

Bellman rank ≤ # states

PAC Learning: known (e.g., [2])

State distribution induced by πf’

Bellman error of f on each state

● Choose a new πf’ optimistically: f’ is the maximizer of among the surviving functions.

Value-based RL in CDPs Contextual Decision Processes (CDPs): episodic RL with rich observations ● Action space A, horizon H. ● Context space X. A context is ... ○ any function of history that expresses a good policy & value function ○ e.g., last 4 frames of images in Atari games ○ e.g., (state, time-step) for finite-horizon tabular MDPs

● An episode: x1, a1, r1, x2, …, xH, aH, rH ● Policy π : X → A. Want to maximize

hidden state

POMDP with rich obs. and reactive value function (context = current obs.)

Bellman rank ≤ # hidden states

rich obs.

Factored matrix view

Analysis of iteration complexity

● Suffices to find a row that contains non-zero entry in surviving columns. ● Optimism finds the row with a non-zero diagonal entry (for some h).

.

In general, |X| is very large ⇒ Requires generalization! Large MDP with low-rank transition (context = state)

Bellman rank ≤ rank of transition matrix

hidden factor

Bellman rank ≤ poly(# abstract states, # actions)

abstract state

state

Analysis that considers statistical errors

Geometric view (Bellman rank = 2)

New

Large MDP with Q*-irrelevant abstraction (context = abstract state)

Need additional condition, otherwise exponential lower bound applies. [1]

.

● If dark blue vectors are linearly indep., #iterations (for h) ≤ Bellman rank.

Extends [1]

Value-based PAC-RL in CDPs ● Input: a function space F which contains Q* ● Output: π such that, w.p. ≥ 1- , Vπ* - Vπ ≤ ε after acquiring poly(|A|, H, log|F|, 1/ε, 1/ ) trajectories.

● Repeat until

Average Bellman error

state

Known [3]

Bellman rank Rank of average Bellman error matrices (maximum over h=1, …, H)

PSRs with rich obs. and reactive value function (context = current obs.)

candidate value function

[Todd’82]: Expressing Bellman error matrix using a submatrix of the System Dynamics Matrix (naturally low-rank for PSRs). (histories = all (h-1)-long seq., tests = length 2 seq.)

,

|⟨

⟩|

=> Significant reduction in ellipsoid volume

Sample complexity:

M: Bellman rank

New

● Size |F| × |F| ● Q* has 0 Bellman error on all roll-in policies (col of 0’s) roll-in policy ● Sample-efficient to evaluate a row at a time: generate trajectories using πf’ until h, then random action + importance weighting

Bellman rank ≤ poly(system dim, # actions)

Linear Quadratic Regulators (context = state)

Bellman rank ≤ poly( state space dim, action space dim)

● ● ●

Need policy class + state-value function class representation (see Extensions). Crucially depends on the choice of function classes: linear policies + quadratic value functions. Algorithm does not apply as-is due to continuous action space.

Known [4]

References [1] Krishnamurthy, Agarwal, and Langford. PAC reinforcement learning with rich observations. NIPS 2016. [2] Kearns and Singh. Near-Optimal Reinforcement Learning in Polynomial Time. ML 2000. [3] Lihong Li. A unifying framework for computational reinforcement learning theory. PhD thesis, 2000. [4] Osband and Van Roy. Model-based reinforcement learning and the eluder dimension. NIPS 2014.

Extensions ● Can use doubling trick to guess unknown Bellman rank. ● Can compete with functions that have small non-zero Bellman errors. ● Can work with policy class + V-value function class (as opposed to Q). ○ Compete with the best (policy, V-value function) pair that respects Bellman equation for policy evaluation. ● Can accommodate infinite classes with bounded statistical complexity. ● Can handle approximately low-rank Bellman error matrices.

Contextual Decision Processes with Low Bellman Rank ...

any function of history that expresses a good policy & value function. ○ e.g., last 4 frames of images in Atari games. ○ e.g., (state, time-step) for finite-horizon tabular MDPs. ○ An episode: ... among the surviving functions. ○ Repeat until . Full matrix view. Factored matrix view. Geometric view. Analysis of iteration complexity.

849KB Sizes 0 Downloads 149 Views

Recommend Documents

Robust Low-Rank Subspace Segmentation with Semidefinite ...
dimensional structural data such as those (approximately) lying on subspaces2 or ... left unsolved: the spectrum property of the learned affinity matrix cannot be ...

Low-Rank Spectral Learning with Weighted Loss ... - EECS @ Michigan
domains. 1 INTRODUCTION. Predictive state representations (PSRs) are compact models of ...... Low-Rank Spectral Learning with Weighted Loss Functions. 0. 50. 100. 10−5. 100. |H| .... Dimension-free concentration bounds on hankel ma-.

Discriminative Dictionary Learning with Low-Rank ...
matrix recovery theory and apply a low-rank regularization on the dictionary. ..... thesis, Massachusetts Institute of Technology, 2006. [8] M. Aharon, M. Elad, and ...

GRAPH REGULARIZED LOW-RANK MATRIX ...
also advance learning techniques to cope with the visual ap- ... Illustration of robust PRID. .... ric information by enforcing the similarity between ai and aj.

Probabilistic Low-Rank Subspace Clustering
recent results in VB matrix factorization leading to fast and effective estimation. ...... edges the Beckman Institute Postdoctoral Fellowship. SN thanks the support ...

fine context, low-rank, softplus deep neural networks for mobile ...
plus nonlinearity for on-device neural network based mobile ... translation. While the majority of mobile speech recognition ..... application for speech recognition.

Exploiting Low-rank Structure for Discriminative Sub ...
The transpose of a vector/matrix is denoted by using superscript . A = [aij] ∈. R m×n defines a matrix A with aij being its (i, j)-th element for i = 1,...,m and j = 1,...,n,.

Speaker Adaptation Based on Sparse and Low-rank ...
nuclear norm regularization forces the eigenphone matrix to be low-rank. The basic considerations are that being sparse can alleviate over-fitting and being ... feature vectors of the adaptation data. Using the expectation maximization (EM) algorithm

Exploiting Low-rank Structure for Discriminative Sub-categorization
1 Department of Computer Science,. University of Maryland,. College Park, USA. 2 Microsoft .... the top-K prediction scores from trained exemplar classifiers.

Matrilineal rank Inheritance varies with absolute rank in ...
sals and noted the importance of alliances with kin (CHAPAIS, 1988) and ...... On the rank system in a natural group of Japanese monkeys, I: The basic and the ...

Exploiting Low-rank Structure for Discriminative Sub-categorization
recognition. We use the Office-Caltech dataset for object recognition and the IXMAS dataset for action recognition. LRLSE-LDAs based classifi- cation achieves ...

FAST DYNAMIC TIME WARPING USING LOW-RANK ...
ABSTRACT. Dynamic Time Warping (DTW) is a computationally intensive algo- rithm and computation of a local (Euclidean) distance matrix be- tween two signals constitute the majority of the running time of. DTW. In earlier work, a matrix multiplication

low-rank matrix factorization for deep neural network ...
of output targets to achieve good performance, the majority of these parameters are in the final ... recognition, the best performance with CNNs can be achieved when matching the number of ..... guage models,” Tech. Rep. RC 24671, IBM ...

Low-Rank Coding with b-Matching Constraint for Semi ...
Low-Rank Coding with b-Matching Constraint for Semi-supervised Classification. ∗ ... Pentland, 1991], are unable to utilize the class label informa- tion, while ...... Workshop on Applications of Computer Vision, pages 138–. 142, 1994.

Structured Sparse Low-Rank Regression Model for ... - Springer Link
3. Computer Science and Engineering,. University of Texas at Arlington, Arlington, USA. Abstract. With the advances of neuroimaging techniques and genome.

A Fast and Efficient Algorithm for Low-rank ... - Semantic Scholar
The Johns Hopkins University [email protected]. Thong T. .... time O(Md + (n + m)d2) where M denotes the number of non-zero ...... Computer Science, pp. 143–152 ...

A Fast and Efficient Algorithm for Low-rank ... - Semantic Scholar
republish, to post on servers or to redistribute to lists, requires prior specific permission ..... For a fair comparison, we fix the transform matrix to be. Hardarmard and set .... The next theorem is dedicated for showing the bound of d upon which