Learning in Sequential Decision Problems Peter Bartlett Mathematical Sciences Queensland University of Technology Computer Science and Statistics University of California at Berkeley

AusDM November 28, 2014

1 / 34

A Decision Problem

Order news stories by popularity.

2 / 34

A Decision Problem

Order news stories by popularity. Multiclass classification.

2 / 34

A Decision Problem

Order news stories by popularity. Multiclass classification. Incur loss when a low-ranked story is chosen. 2 / 34

A Decision Problem

Order news stories by popularity. Multiclass classification. Incur loss when a low-ranked story is chosen. Aim to minimize expected loss. 2 / 34

Decision Problems

Decision Problem Finite decision

Approach estimate expected loss.

Aim close to optimal.

3 / 34

A Decision Problem: Classification

4 / 34

A Decision Problem: Classification

4 / 34

A Decision Problem: Classification

Aim for small relative expected loss: close to best in some comparison class of prediction rules.

4 / 34

Decision Problems

Decision Problem Finite decision Classification

Approach estimate expected loss. e.g., logistic regression

Aim close to optimal. close to best linear classifier.

5 / 34

A Decision Problem: Bandit Classification

But do not get complete information: what would the user do if we offered a different ranking?

6 / 34

A Decision Problem: Bandit Classification

But do not get complete information: what would the user do if we offered a different ranking? Sequential decision problem: current decisions affect later performance. 6 / 34

A Decision Problem: Bandit Classification

But do not get complete information: what would the user do if we offered a different ranking? Sequential decision problem: current decisions affect later performance. Exploration versus exploitation. 6 / 34

A Decision Problem: Bandit Classification

But do not get complete information: what would the user do if we offered a different recommendation? Sequential decision problem: current decisions affect later performance. Exploration versus exploitation. 6 / 34

Decision Problems

Decision Problem Finite decision Classification

Approach estimate expected loss. e.g., logistic regression

Contextual bandit

e.g., reduction to classification

Aim close to optimal. close to best linear classifier. close to best in set of classifiers.

7 / 34

A Decision Problem: Maximizing Value

8 / 34

A Decision Problem: Maximizing Value

8 / 34

A Decision Problem: Maximizing Value

State of user evolves.

8 / 34

A Decision Problem: Maximizing Value

State of user evolves.

8 / 34

A Decision Problem: Maximizing Value

State of user evolves. Sequential decision problem: current decisions affect our knowledge and the subsequent state.

8 / 34

A Decision Problem: Maximizing Value

State of user evolves. Sequential decision problem: current decisions affect our knowledge and the subsequent state. Exploration versus exploitation. 8 / 34

Decision Problems

Decision Problem Finite decision Classification

Approach estimate expected loss. e.g., logistic regression

Contextual bandit

e.g., reduction to classification

Markov decision process

Aim close to optimal. close to best linear classifier. close to best in set of classifiers. close to best in set of policies.

9 / 34

Sequential Decision Problems: Key Ideas

1

2

Scaling back our ambitions: Performance guarantees relative to a comparison class. Two directions for Markov Decision Processes (MDPs): Large scale policy design for MDPs. Learning changing MDPs with full information.

10 / 34

Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . :

11 / 34

Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . : 1

See state Xt

of ecosystem

11 / 34

Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . : 1

See state Xt

2

Play an action At

of ecosystem intervention

11 / 34

Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . : 1

See state Xt

of ecosystem

2

Play an action At intervention anti-poaching patrols

11 / 34

Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . : 1

See state Xt

of ecosystem

2

Play an action At intervention anti-poaching patrols

3

Incur loss `(Xt , At )

11 / 34

Markov Decision Processes MDP: Managing Threatened Species For t = 1, 2, . . . : 1

See state Xt

of ecosystem

2

Play an action At intervention anti-poaching patrols

3

Incur loss `(Xt , At )

$, extinction

11 / 34

Markov Decision Processes MDP: Web Customer Interactions For t = 1, 2, . . . : 1

See state Xt

2

Play an action At

3

Incur loss `(Xt , At )

of customer interaction offer/advertisement missed revenue

11 / 34

Markov Decision Processes MDP: Managing Threatened Species Transition matrix:

For t = 1, 2, . . . : 1

See state Xt

of ecosystem

2

Play an action At

3

Incur loss `(Xt , At )

4

State evolves to Xt+1 ∼ PXt ,At

P : X × A → ∆(X )

intervention anti-poaching patrols $, extinction

11 / 34

Markov Decision Processes MDP: Managing Threatened Species Transition matrix:

For t = 1, 2, . . . : 1

See state Xt

of ecosystem

2

Play an action At

3

Incur loss `(Xt , At )

4

State evolves to Xt+1 ∼ PXt ,At

intervention anti-poaching patrols

P : X × A → ∆(X )

Policy:

π : X → ∆(A)

$, extinction

Performance Measure: Regret RT = E

T X t=1

`(Xt , At ) − min E π

T X

`(Xtπ , π(Xtπ )).

t=1

11 / 34

Markov Decision Processes MDP: Managing Threatened Species Transition matrix:

For t = 1, 2, . . . : 1

See state Xt

of ecosystem

2

Play an action At

3

Incur loss `(Xt , At )

4

State evolves to Xt+1 ∼ PXt ,At

intervention anti-poaching patrols $, extinction

P : X × A → ∆(X )

Policy:

π : X → ∆(A)

Stationary distribution: µ Average loss:

µT `.

Performance Measure: Regret RT = E

T X t=1

`(Xt , At ) − min E π

T X

`(Xtπ , π(Xtπ )).

t=1

11 / 34

Markov Decision Processes MDP: Managing Threatened Species Transition matrix:

For t = 1, 2, . . . : 1

See state Xt

of ecosystem

2

Play an action At

3

Incur loss `(Xt , At )

4

State evolves to Xt+1 ∼ PXt ,At

intervention anti-poaching patrols $, extinction

P : X × A → ∆(X )

Policy:

π : X → ∆(A)

Stationary distribution: µ Average loss:

µT `.

Performance Measure: Excess Average Loss T µT π ` − min µπ ` π

11 / 34

Large-Scale Sequential Decision Problems

Large MDP Problems: When the state space X is large, we must scale back the ambition of optimal performance.

12 / 34

Large-Scale Sequential Decision Problems

Large MDP Problems: When the state space X is large, we must scale back the ambition of optimal performance: In comparison to a restricted family of policies Π. e.g., linear value function approximation. Want a strategy that competes with the best policy in Π.

12 / 34

Outline 1. Large-Scale Policy Design

2. Learning Changing Dynamics

13 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized approximate stationary distributions.

2. Learning Changing Dynamics

13 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized exponentially transformed value function.

2. Learning Changing Dynamics

13 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization.

2. Learning Changing Dynamics

13 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class.

2. Learning Changing Dynamics

13 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy.

2. Learning Changing Dynamics

13 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.

2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy. Competitive with small comparison class Π. Computationally efficient if Π has polynomial size. Hard for shortest path problems. 13 / 34

Linear Subspace of Stationary Distributions

Large-scale policy design

(with Yasin Abbasi-Yadkori and Alan Malek. ICML2014)

Stationary distributions dual to value functions.

14 / 34

Linear Subspace of Stationary Distributions

Large-scale policy design

(with Yasin Abbasi-Yadkori and Alan Malek. ICML2014)

Stationary distributions dual to value functions. Consider a class of policies defined by feature matrix Φ, distribution µ0 , and parameters θ: [µ0 (x, a) + Φ(x,a),: θ]+ . 0 a0 [µ0 (x, a ) + Φ(x,a0 ),: θ]+

πθ (a|x) = P

14 / 34

Linear Subspace of Stationary Distributions

Large-scale policy design

(with Yasin Abbasi-Yadkori and Alan Malek. ICML2014)

Stationary distributions dual to value functions. Consider a class of policies defined by feature matrix Φ, distribution µ0 , and parameters θ: [µ0 (x, a) + Φ(x,a),: θ]+ . 0 a0 [µ0 (x, a ) + Φ(x,a0 ),: θ]+

πθ (a|x) = P

Let µθ denote the stationary distribution of policy πθ .

14 / 34

Linear Subspace of Stationary Distributions

Large-scale policy design

(with Yasin Abbasi-Yadkori and Alan Malek. ICML2014)

Stationary distributions dual to value functions. Consider a class of policies defined by feature matrix Φ, distribution µ0 , and parameters θ: [µ0 (x, a) + Φ(x,a),: θ]+ . 0 a0 [µ0 (x, a ) + Φ(x,a0 ),: θ]+

πθ (a|x) = P

Let µθ denote the stationary distribution of policy πθ . Find θb such that µ> ` ≤ minθ∈Θ µ> ` + . θb

θ

14 / 34

Linear Subspace of Stationary Distributions

Large-scale policy design

(with Yasin Abbasi-Yadkori and Alan Malek. ICML2014)

Stationary distributions dual to value functions. Consider a class of policies defined by feature matrix Φ, distribution µ0 , and parameters θ: [µ0 (x, a) + Φ(x,a),: θ]+ . 0 a0 [µ0 (x, a ) + Φ(x,a0 ),: θ]+

πθ (a|x) = P

Let µθ denote the stationary distribution of policy πθ . Find θb such that µ> ` ≤ minθ∈Θ µ> ` + . θb

θ

Large-scale policy design: Independent of size of X .

14 / 34

Approach: a Reduction to Convex Optimization Define a constraint violation function



V (θ) = k[µ0 + Φθ]− k1 + (P − B)> (µ0 + Φθ) | {z } | {z }1 prob. dist.

stationary

Approach: a Reduction to Convex Optimization Define a constraint violation function



V (θ) = k[µ0 + Φθ]− k1 + (P − B)> (µ0 + Φθ)

1

and consider the convex cost function c(θ) = `> (µ0 + Φθ) + αV (θ).

Approach: a Reduction to Convex Optimization Define a constraint violation function



V (θ) = k[µ0 + Φθ]− k1 + (P − B)> (µ0 + Φθ)

1

and consider the convex cost function c(θ) = `> (µ0 + Φθ) + αV (θ).

P Stochastic gradient: θt+1 = θt − ηgt (θt ), θbT = T t=1 θt /T ,

Approach: a Reduction to Convex Optimization Define a constraint violation function



V (θ) = k[µ0 + Φθ]− k1 + (P − B)> (µ0 + Φθ)

1

and consider the convex cost function c(θ) = `> (µ0 + Φθ) + αV (θ).

P Stochastic gradient: θt+1 = θt − ηgt (θt ), θbT = T t=1 θt /T , . . . with cheap, unbiased stochastic subgradient estimates:

Φ(xt ,at ),: I q1 (xt , at ) {µ0 (xt ,at )+Φ(xt ,at ),: θ<0} (P − B)> :,xt0 Φ +α sign((P − B)> :,xt0 Φθ). 0 q2 (xt )

gt (θ) = `> Φ − α

15 / 34

Performance Bounds Main Result For T = 1/4 gradient estimates, with high probability (under a mixing assumption),   V (θ) > > + O() . µθb ` ≤ min µθ ` + T θ∈Θ 

16 / 34

Performance Bounds Main Result For T = 1/4 gradient estimates, with high probability (under a mixing assumption),   V (θ) > > + O() . µθb ` ≤ min µθ ` + T θ∈Θ 

Competitive with all policies (stationary distributions) in the linear subspace (i.e., V (θ) = 0).

16 / 34

Performance Bounds Main Result For T = 1/4 gradient estimates, with high probability (under a mixing assumption),   V (θ) > > + O() . µθb ` ≤ min µθ ` + T θ∈Θ 

Competitive with all policies (stationary distributions) in the linear subspace (i.e., V (θ) = 0). Competitive with other policies; comparison more favorable near some stationary distribution.

16 / 34

Performance Bounds Main Result For T = 1/4 gradient estimates, with high probability (under a mixing assumption),   V (θ) > > + O() . µθb ` ≤ min µθ ` + T θ∈Θ 

Competitive with all policies (stationary distributions) in the linear subspace (i.e., V (θ) = 0). Competitive with other policies; comparison more favorable near some stationary distribution. Previous results of this kind: require knowledge about optimal policy, or require that the comparison class Π contains a near-optimal policy. 16 / 34

Simulation Results: Queueing

http://alzatex.org/

(Rybko and Stolyar, 1992; de Farias and Van Roy, 2003a)

17 / 34

Simulation Results: Queueing

http://alzatex.org/

(Rybko and Stolyar, 1992; de Farias and Van Roy, 2003a)

17 / 34

Outline

1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized approximate stationary distributions.

18 / 34

Outline

1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized approximate stationary distributions. Linearly parameterized exponentially transformed value function.

18 / 34

Outline

1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized approximate stationary distributions. Linearly parameterized exponentially transformed value function. Stochastic gradient convex optimization. Competitive with policies in the approximating class. Simulation results: crowdsourcing.

18 / 34

Total Cost, Kullback-Leibler Penalty

Large-scale policy design

(with Yasin Abbasi-Yadkori, Xi Chen and Alan Malek)

19 / 34

Total Cost, Kullback-Leibler Penalty

Large-scale policy design

(with Yasin Abbasi-Yadkori, Xi Chen and Alan Malek)

Total cost:

(assume a.s. hit absorbing state with zero loss)

E

∞ X

`(Xt ).

t=1

19 / 34

Total Cost, Kullback-Leibler Penalty

Large-scale policy design

(with Yasin Abbasi-Yadkori, Xi Chen and Alan Malek)

Total cost:

(assume a.s. hit absorbing state with zero loss)

E

∞ X

`(Xt ).

t=1

Parameterized value functions.

19 / 34

Total Cost, Kullback-Leibler Penalty

Large-scale policy design

(with Yasin Abbasi-Yadkori, Xi Chen and Alan Malek)

Total cost:

(assume a.s. hit absorbing state with zero loss)

E

∞ X

`(Xt ).

t=1

Parameterized value functions. Convex cost function.

19 / 34

Total Cost, Kullback-Leibler Penalty

Large-scale policy design

(with Yasin Abbasi-Yadkori, Xi Chen and Alan Malek)

Total cost:

(assume a.s. hit absorbing state with zero loss)

E

∞ X

`(Xt ).

t=1

Parameterized value functions. Convex cost function. Stochastic gradient.

19 / 34

Simulation Results: Crowdsourcing Classification task.

http://www.technicalinfo.net/

20 / 34

Simulation Results: Crowdsourcing Classification task. Crowdsource labels.

http://www.technicalinfo.net/ http://cdns2.freepik.com/

20 / 34

Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors.

http://www.technicalinfo.net/ http://cdns2.freepik.com/

20 / 34

Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi )

http://www.technicalinfo.net/ http://cdns2.freepik.com/

20 / 34

Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi )

http://www.technicalinfo.net/ http://cdns2.freepik.com/

20 / 34

Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi )

http://www.technicalinfo.net/ http://cdns2.freepik.com/

20 / 34

Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi ); pi ∼ Beta.

http://www.technicalinfo.net/ http://cdns2.freepik.com/

20 / 34

Simulation Results: Crowdsourcing Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi ); pi ∼ Beta. State = posterior.

http://www.technicalinfo.net/ http://cdns2.freepik.com/

20 / 34

Simulation Results: Crowdsourcing

Classification task. Crowdsource: $ for labels. Fixed budget; minimize errors. Bayesian model: binary labels, i.i.d. crowd; Yi ∼ Bernoulli(pi ); pi ∼ Beta. State = posterior.

20 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.

2. Learning Changing Dynamics

21 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.

2. Learning Changing Dynamics Changing MDP; complete information.

21 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.

2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy.

21 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.

2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy. Competitive with small comparison class Π.

21 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.

2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy. Competitive with small comparison class Π. Computationally efficient if Π has polynomial size. 21 / 34

Outline 1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization. Competitive with policies near the approximating class. Without knowledge of optimal policy. Simulation results: queueing, crowdsourcing.

2. Learning Changing Dynamics Changing MDP; complete information. Exponential weights strategy. Competitive with small comparison class Π. Computationally efficient if Π has polynomial size. Hard for shortest path problems. 21 / 34

Changing Dynamics

(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)

Full information: observe Pt , `t after round t. Arbitrary. Even adversarial.

22 / 34

Changing Dynamics

(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)

Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A}

22 / 34

Changing Dynamics

(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)

Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A} P π π Optimal policy: π ∗ = argminπ∈Π T t=1 `t (xt , π(xt ))

22 / 34

Changing Dynamics

(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)

Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A} P π π Optimal policy: π ∗ = argminπ∈Π T t=1 `t (xt , π(xt )) PT PT ∗ ∗ Regret: RT = t=1 `t (xt , at ) − t=1 `t (xtπ , π ∗ (xtπ ))

22 / 34

Changing Dynamics

(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)

Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A} P π π Optimal policy: π ∗ = argminπ∈Π T t=1 `t (xt , π(xt )) PT PT ∗ ∗ Regret: RT = t=1 `t (xt , at ) − t=1 `t (xtπ , π ∗ (xtπ )) Aim for low regret: RT /T → 0

22 / 34

Changing Dynamics

(with Yasin Abbasi-Yadkori, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari, NIPS2013)

Full information: observe Pt , `t after round t. Arbitrary. Even adversarial. Consider a comparison class: Π ⊂ {π | π : X → A} P π π Optimal policy: π ∗ = argminπ∈Π T t=1 `t (xt , π(xt )) PT PT ∗ ∗ Regret: RT = t=1 `t (xt , at ) − t=1 `t (xtπ , π ∗ (xtπ )) Aim for low regret: RT /T → 0

Computationally efficient low regret strategies?

22 / 34

Regret Bound

Main Result There is a strategy that

achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)

23 / 34

Exponential weights: Strategy for a repeated game: Choose action a ∈ A with probability proportional to exp(total loss a has incurred so far).

24 / 34

Exponential weights: Strategy for a repeated game: Choose action a ∈ A with probability proportional to exp(total loss a has incurred so far).

Regret (total loss versus best in hindsight) for T rounds: p  T log |A| . O

24 / 34

Exponential weights: Strategy for a repeated game: Choose action a ∈ A with probability proportional to exp(total loss a has incurred so far).

Regret (total loss versus best in hindsight) for T rounds: p  T log |A| . O

Long history.

24 / 34

Exponential weights: Strategy for a repeated game: Choose action a ∈ A with probability proportional to exp(total loss a has incurred so far).

Regret (total loss versus best in hindsight) for T rounds: p  T log |A| . O

Long history. Unreasonably broadly applicable: Zero-sum games.

Shortest path problems.

AdaBoost.

Fast max-flow.

Bandit problems.

Fast graph sparsification.

Linear programming.

Model of evolution.

24 / 34

Strategy:

For allP policies π ∈ Π, wπ,0 = 1. Wt = π∈Π wπ,t , pπ,t = wπ,t−1 /Wt−1 . for t := 1, 2, . . . do wπt−1 ,t−1 , πt = πt−1 . Otherwise πt ∼ p.,t . w.p. βt = wπt−1 ,t−2 Choose action at ∼ πt (.|xt ). Observe dynamics Pt and loss `t . Suffer `t (xt , at ). For all policies π, wπ,t = wπ,t−1 exp (−ηE [`t (xtπ , π)]). end for Exponential weights on Π.

25 / 34

Strategy:

For allP policies π ∈ Π, wπ,0 = 1. Wt = π∈Π wπ,t , pπ,t = wπ,t−1 /Wt−1 . for t := 1, 2, . . . do wπt−1 ,t−1 , πt = πt−1 . Otherwise πt ∼ p.,t . w.p. βt = wπt−1 ,t−2 Choose action at ∼ πt (.|xt ). Observe dynamics Pt and loss `t . Suffer `t (xt , at ). For all policies π, wπ,t = wπ,t−1 exp (−ηE [`t (xtπ , π)]). end for Exponential weights on Π. Rare, random changes to πt . 25 / 34

Regret Bound

Main Result There is a strategy that

achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)

Adversarial dynamics and loss functions.

26 / 34

Regret Bound

Main Result There is a strategy that

achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)

Adversarial dynamics and loss functions. Large state and action spaces.

26 / 34

Regret Bound

Main Result There is a strategy that

achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)

Adversarial dynamics and loss functions. Large state and action spaces. E [RT ] /T → 0. T = ω(log |Π|) suffices.

26 / 34

Regret Bound

Main Result There is a strategy that

achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)

Adversarial dynamics and loss functions. Large state and action spaces. E [RT ] /T → 0. T = ω(log |Π|) suffices.

Computationally efficient as long as |Π| is polynomial.

26 / 34

Regret Bound

Main Result There is a strategy that

achieves p E [RT ] ≤ (4 + 2τ 2 ) T log |Π| + log |Π| . (under a τ -mixing assumption)

Adversarial dynamics and loss functions. Large state and action spaces. E [RT ] /T → 0. T = ω(log |Π|) suffices.

Computationally efficient as long as |Π| is polynomial. No computationally efficient algorithm in general.

26 / 34

Shortest Path Problem Special case of MDP: node=state; action=edge; loss=weight.

http://www.google.com/

27 / 34

Shortest Path Problem Special case of MDP: node=state; action=edge; loss=weight.

http://www.google.com/ http://www.meondirect.com/

27 / 34

Shortest Path Problem Special case of MDP: node=state; action=edge; loss=weight.

http://www.google.com/ http://www.meondirect.com/

27 / 34

Computational Efficiency

Hardness Result Suppose there is a strategy for the online adversarial shortest path problem that: 1 2

runs in time poly(n, T ), and has regret RT = O(poly(n)T 1−δ ) for some constant δ > 0.

Then there is an efficient algorithm for online agnostic parity learning with sublinear regret.

28 / 34

Online Agnostic Parity Learning

Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }

29 / 34

Online Agnostic Parity Learning

Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }

Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt }

29 / 34

Online Agnostic Parity Learning

Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }

Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt } P PT RT = T yt 6=yt } − minPARS ∈PARITIES t=1 I{b t=1 I{PARS (xt )6=yt }

29 / 34

Online Agnostic Parity Learning

Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }

Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt } P PT RT = T yt 6=yt } − minPARS ∈PARITIES t=1 I{b t=1 I{PARS (xt )6=yt } Is there an efficient (time polynomial in n, T ) learning algorithm with sublinear regret (RT = O(poly(n)T 1−δ ) for some δ > 0)?

29 / 34

Online Agnostic Parity Learning

Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }

Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt } P PT RT = T yt 6=yt } − minPARS ∈PARITIES t=1 I{b t=1 I{PARS (xt )6=yt } Is there an efficient (time polynomial in n, T ) learning algorithm with sublinear regret (RT = O(poly(n)T 1−δ ) for some δ > 0)? Very well-studied.

29 / 34

Online Agnostic Parity Learning

Class of parity functions on {0, 1}n : PARITIES = {PARS | S ⊂ [n], PARS (x) = ⊕i∈S xi }

Learning problem: given xt ∈ {0, 1}n , learner predicts ybt ∈ {0, 1}, observes the true label yt and suffers loss I{byt 6=yt } P PT RT = T yt 6=yt } − minPARS ∈PARITIES t=1 I{b t=1 I{PARS (xt )6=yt } Is there an efficient (time polynomial in n, T ) learning algorithm with sublinear regret (RT = O(poly(n)T 1−δ ) for some δ > 0)? Very well-studied. Widely believed to be hard: used for cryptographic schemes.

29 / 34

Computational Efficiency

Hardness Result Suppose there is a strategy for the online adversarial shortest path problem that: 1 2

runs in time poly(n, T ), and has regret RT = O(poly(n)T 1−δ ) for some constant δ > 0.

Then there is an efficient algorithm for online agnostic parity learning with sublinear regret.

30 / 34

Reduction

x = (1, 0, 1, 0, 1) ∈ {0, 1}5 1a Adversary (x, y)

Conversion1

Shortest Path Algorithm

ybt

Conversion2

2a

2b

3a

3b

4a

4b

5a

5b

6a Path

6b 1−y

y Z ybt = 0

ybt = 1

31 / 34

Online shortest path: Hard versus easy

Edges (dynamics) Adversarial Stochastic Adversarial

Weights (costs) Adversarial Adversarial Stochastic

As hard as noisy parity. Efficient algorithm. Efficient algorithm.

32 / 34

Sequential Decision Problems: Key Ideas Scaling back our ambitions: Performance guarantees relative to a comparison class. Two directions for Markov Decision Processes (MDPs):

1. Large-Scale Policy Design Compete with a restricted family of policies Π: Linearly parameterized policies. Stochastic gradient convex optimization.

2. Learning changing dynamics Compete with a restricted family of policies Π. Exponential weights strategy. Computationally efficient if Π has polynomial size. Hard for shortest path problems. 33 / 34

Acknowledgements: ARC, NSF.

Alan Malek Yasin Abbasi-Yadkori

Varun Kanade

Yevgeny Seldin

Xi Chen

Csaba Szepesvari 34 / 34

Learning in Sequential Decision Problems

Nov 28, 2014 - Learning Changing Dynamics. 13 / 34 ... Stochastic gradient convex optimization. 2. ..... Strategy for a repeated game: ... Model of evolution.

5MB Sizes 0 Downloads 254 Views

Recommend Documents

Estimating Bayesian Decision Problems with ...
Nov 11, 2014 - data is to estimate the decision-making parameters and understand, .... Therefore one can recover (θit,πit) if ρt, γit,0, and γit,1 are known.

young children's learning via solving problems in the ...
it would seem that progress made in the sphere of machines preceded progress ..... Table 31: Preferences towards causal features while predicting water ...... analytic stage, the individual integrates the parts and perceives a whole as a ...

young children's learning via solving problems in the ...
change; and the relationship between building and exploring in the process of learning a new system. Implications of these findings for technology education are ...

pdf-1425\professional-learning-communities-using-data-in-decision ...
... apps below to open or edit this item. pdf-1425\professional-learning-communities-using-data-i ... nt-learning-author-patrick-baccellieri-published-on.pdf.

Sequential Preference Revelation in Incomplete ...
Feb 23, 2018 - signs students to schools as a function of their reported preferences. In the past, the practical elicitation of preferences could be done only through the use of physical forms mailed through the postal service. Under such a system, a

Sequential Projection Learning for Hashing with ... - Sanjiv Kumar
the-art methods on two large datasets con- taining up to 1 .... including the comparison with several state-of-the-art .... An illustration of partitioning with maximum.

sequential circuits
SEQUENTIAL CIRCUITS. Flip-Flops Analysis. Ciletti, M. D. and M. M. Mano. Digital design, fourth edition. Prentice Hall. NJ.

Sequential and Coordinative Complexity in Time ...
The research was funded by Grant INK 12/A1 (Project B3) of the. German Research Foundation and was .... in tasks requiring coordination are simply an indirect effect of a basic age-related slowing observed in ..... with age group as a between-subject

Sequential Equilibria in a Ramsey Tax Model
Starting with the seminal paper of Kydland and Prescott (1977), there is by now a large .... denote the dynamic game between the government and the house-.

Mining Actionable Subspace Clusters in Sequential Data
such as astronomy, physics, geology, marketing, etc. ... Email: [email protected]. †School of Computer Engineering, Nanyang Technological University,.

Sequential Pattern Mining for Moving Objects in Sptio ...
IJRIT International Journal of Research in Information Technology, Volume 1, Issue 5, May 2013, ... 2 Head of the Computer Department, U.V.Patel College of Engineering, ... Given such a spatio-temporal series, we study the problem of discovering ....

Sequential Correlated Equilibria in Stopping Games
69978, Israel. Phone: 972-3-640-5386. Fax: 972-3-640-9357. ... coordinate their actions in advance, but coordination during the game consumes too much time.

Bidders' and Sellers' Strategies in Sequential Auctions ... - CiteSeerX
empirical problem using data on auctions of modern and contemporary art objects ... The assumption of unbiased estimates has two important consequences. .... objects sold in an auction, is fairly big and is likely to imply a negligible bias.8.

Revealed preferences in a sequential prisoners ...
studies that have started to analyze belief biases as potential explanations for anomalous be- .... used the experimental software z-Tree (Fischbacher, 2007).

Sequential Pattern Mining for Moving Objects in Sptio ...
Spatial-temporal data mining (STDM) is the process of discovering interesting, useful and non-trivial patterns from large spatial or spatial-temporal datasets. For example, analysis of crime datasets may reveal frequent patterns such as partially ord

Sequential Effects of Phonological Priming in Visual ...
Phonological Priming in Visual ... Thus, the present experiments address two key issues re- .... RTs higher than 1,500 ms (less than 2% of the data) were re-.