of 193

Viewer
Transcript

Yevgeny Seldin University of Copenhagen

ECML-PKDD-2015 Tutorial

What is Online Learning? • Subfield of Machine Learning studying problems involving interaction with an environment

Examples • • • • •

Investment in the stock market Online advertising/personalization/routing/… Games Robotics …

How Online different from “batch”? Batch Learning

Online Learning

Collect Data

Analyze

Apply

Act

Analyze

Get More Data

When do we need Online Learning? • Interactive learning • “Adversarial” game-theoretic settings – No i.i.d. assumptions

• Large-scale data analysis

The Space of Online Learning Problems

full info

limited (bandit) feedback

partial monitoring

feedback

The Space of Online Learning Problems

stock market

full info

limited (bandit) feedback

partial monitoring

feedback

The Space of Online Learning Problems

stock market

full info

limited (bandit) feedback

medical treatments advertising

partial monitoring

feedback

The Space of Online Learning Problems

stock market

full info

limited (bandit) feedback

medical treatments advertising

partial monitoring

censored feedback dynamic pricing

feedback

The Space of Online Learning Problems environmental resistance adversarial

stochastic (i.i.d.) full

bandit

partial

feedback

The Space of Online Learning Problems environmental resistance

spam filtering stock market

adversarial

stochastic (i.i.d.) full

bandit

partial

feedback

The Space of Online Learning Problems environmental resistance

spam filtering stock market

adversarial

medical treatments weather prediction

stochastic (i.i.d.) full

bandit

partial

feedback

The Space of Online Learning Problems environmental resistance adversarial

Markov Decision Processes (MDPs)

i.i.d. stateless full side info (state)

structural complexity

bandit

partial

feedback

The Space of Online Learning Problems environmental resistance adversarial

Markov Decision Processes (MDPs)

i.i.d. stateless full side info (state)

bandit

partial

medical records of different patients subsequent treatments of the same patient

structural complexity

feedback

The Space of Online Learning Problems environmental resistance The opponent adversarial

i.i.d. stateless state MDPs

structural complexity The battle field

full

bandit

partial

feedback The player

Part I: “classical” algorithms environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Part I: “classical” algorithms environmental resistance

Prediction with expert advice

adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Part I: “classical” algorithms environmental resistance

Prediction with expert advice Adversarial bandits

adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Part I: “classical” algorithms environmental resistance

Prediction with expert advice Adversarial bandits

adversarial Stochastic bandits i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Part I: “classical” algorithms environmental resistance

Prediction with expert advice Adversarial bandits

adversarial Stochastic bandits i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Part I: “classical” algorithms environmental resistance

Prediction with expert advice Adversarial bandits

adversarial Stochastic bandits i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Prediction with Expert Advice Examples I

Experts = financial advisers

I

Experts = different algorithms

Prediction with Expert Advice

I I

Experts = financial advisers Experts = different algorithms

Game Definition

Expert Advice

Examples

ξ11 , .. .

ξ21 , .. .

ξ1a , .. .

ξ2a , .. .

ξ1K ,

For t = 1, 2, . . . :

Expert Losses

3. Observe `1t , . . . , `K t & suffer

t `A t

··· ···

··· ξ2K , · · ·

ξt1 , .. . ξta , .. .

··· ··· ···

··· ξtK , · · ·

−−−−−−−−→ time

1. Observe advice of K experts 2. Pick an expert At to follow

···

`11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K 2 , ···

`1t , .. . `at , .. .

··· ··· ···

··· `K t , ···

−−−−−−−−→ time

Prediction with Expert Advice

I I

Experts = financial advisers Experts = different algorithms

Game Definition

Expert Advice

Examples

ξ11 , .. .

ξ21 , .. .

ξ1a , .. .

ξ2a , .. .

ξ1K ,

For t = 1, 2, . . . :

Performance Measure: Regret RT =

T X t=1

t `A t − min

a

T X t=1

`at

!

Expert Losses

3. Observe `1t , . . . , `K t & suffer

t `A t

··· ···

··· ξ2K , · · ·

ξt1 , .. . ξta , .. .

··· ··· ···

··· ξtK , · · ·

−−−−−−−−→ time

1. Observe advice of K experts 2. Pick an expert At to follow

···

`11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K 2 , ···

`1t , .. . `at , .. .

··· ··· ···

··· `K t , ···

−−−−−−−−→ time

Prediction with Expert Advice - Simplification Examples I

Experts = different algorithms

Game Definition For t = 1, 2, . . . : 1. Observe advice of K experts 2. Pick an expert At to follow `1t , . . . , `K t

& suffer

t `A t

Performance Measure: Regret RT =

T X t=1

t `A t − min

a

T X t=1

`at

!

Expert Advice

Experts = financial advisers

Expert Losses

I

3. Observe

1 1 1 · · Z ξ1 , ξ2 , · · · ξt , · Z.. . .. . Z .. ··· . ··· Za · · ·ξta , · · · ξ1a , ξZ 2, .. .. Z .. . . Z ··· . ··· Z K K , ··· Z ξt , · · · ξ1K , ξ2 Z −−−−−−−−→ Z Z time Z Z

Z

`11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K , ··· 2

`1t , .. . `at , .. .

··· ··· ···

··· `K , ··· t

−−−−−−−−→ time

Game Definition For t = 1, 2, . . . : 1. Pick a row At At 2. Observe `1t , . . . , `K t & suffer `t

Performance Measure: Regret RT =

T X t=1

t `A t − min

a

T X t=1

`at

!

Expert Losses

Prediction with Expert Advice `11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K 2 , ···

`1t , .. . `at , .. .

··· ··· ···

··· `K t , ···

−−−−−−−−→ time

Game Definition For t = 1, 2, . . . : 1. Pick a row At At 2. Observe `1t , . . . , `K t & suffer `t

Performance Measure: Regret RT =

T X t=1

t `A t − min

a

T X t=1

`at

!

Assumptions I

`at -s are in [0,1] and selected arbitrarily (adversarially)

Expert Losses

Prediction with Expert Advice `11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K 2 , ···

`1t , .. . `at , .. .

··· ··· ···

··· `K t , ···

−−−−−−−−→ time

Game Definition For t = 1, 2, . . . : 1. Pick a row At At 2. Observe `1t , . . . , `K t & suffer `t

Performance Measure: Regret RT =

T X t=1

t `A t − min

a

T X t=1

`at

!

Assumptions I

`at -s are in [0,1] and selected arbitrarily (adversarially)

Expert Losses

Prediction with Expert Advice `11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K 2 , ···

`1t , .. . `at , .. .

RT = o(T )

··· ···

··· `K t , ···

−−−−−−−−→ time

Learning Goal

···

Game Definition For t = 1, 2, . . . : 1. Pick a row At At 2. Observe `1t , . . . , `K t & suffer `t

Performance Measure: Regret RT =

T X t=1

t `A t − min

a

T X t=1

`at

!

Assumptions I

`at -s are in [0,1] and selected arbitrarily (adversarially)

Expert Losses

Prediction with Expert Advice `11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K 2 , ···

`1t , .. . `at , .. .

··· ··· ···

··· `K t , ···

−−−−−−−−→ time

Learning Goal RT = o(T ) Why comparing to the best row, not the best path?

Prediction with Expert Advice

I

The algorithm needs a protection against the adversary

I

Protection - randomization

Assumptions I

The adversary may know the algorithm, but not the random bits (“oblivious setting” - `at -s are written down before the game starts)

Expert Losses

Algorithms `11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K 2 , ···

`1t , .. . `at , .. .

··· ··· ···

··· `K t , ···

−−−−−−−−→ time

The Hedge Algorithm (a.k.a. Exponential Weights) [Vovk, 1990, Littlestone & Warmuth, 1994, . . . ]

Input: Learning rates η1 ≥ η2 ≥ · · · > 0 ˆ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˆ

∀a : pt (a) = P

e−ηt Lt−1 (a)

a0

e−ηt Lˆ t−1 (a0 )

Sample At according to pt and play it Observe `1t , . . . , `K t ˆ t (a) = L ˆ t−1 (a) + `at ∀a : L end

Analysis (simplified for known T and constant η) Reminder ˆ

pt (a) = P

e−ηLt−1 (a)

a0

e−ηLˆ t−1 (a0 )

ˆ t (a) = L ˆ t−1 (a) + `a L t

Analysis (simplified for known T and constant η) Let Wt =

P

ˆ

−η Lt (a) = ae

P

a

ˆ

−η`t e−η Lt−1 (a) ae

Reminder ˆ

pt (a) = P

e−ηLt−1 (a)

a0

e−ηLˆ t−1 (a0 )

ˆ t (a) = L ˆ t−1 (a) + `a L t

Analysis (simplified for known T and constant η) Let Wt =

P

ˆ

−η Lt (a) = ae

Calculation

P

a

ˆ

−η`t e−η Lt−1 (a) ae

ˆ

X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a

Reminder ˆ

pt (a) = P

e−ηLt−1 (a)

a0

e−ηLˆ t−1 (a0 )

ˆ t (a) = L ˆ t−1 (a) + `a L t

Analysis (simplified for known T and constant η) Let Wt =

P

ˆ

−η Lt (a) = ae

Calculation

P

a

ˆ

X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a X a e−η`t pt (a) = a

ˆ

−η`t e−η Lt−1 (a) ae

Reminder ˆ

pt (a) = P

e−ηLt−1 (a)

a0

e−ηLˆ t−1 (a0 )

ˆ t (a) = L ˆ t−1 (a) + `a L t

Analysis (simplified for known T and constant η) Let Wt =

P

ˆ

−η Lt (a) = ae

Calculation

P

a

ˆ

X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a X a e−η`t pt (a) = a

ˆ

−η`t e−η Lt−1 (a) ae

Reminder ˆ

pt (a) = P

e−ηLt−1 (a)

a0

e−ηLˆ t−1 (a0 )

ˆ t (a) = L ˆ t−1 (a) + `a L t

Useful Inequalities For x ≤ 0: 1 ex ≤ 1 + x + x2 2

Analysis (simplified for known T and constant η) Let Wt =

P

ˆ

−η Lt (a) = ae

Calculation

P

ˆ

a

−η`t e−η Lt−1 (a) ae

ˆ

≤

X a

1 − η`at +

ˆ

pt (a) = P

e−ηLt−1 (a)

a0

X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a X a e−η`t pt (a) = a

Reminder

1 2 (η`at ) pt (a) 2

e−ηLˆ t−1 (a0 )

ˆ t (a) = L ˆ t−1 (a) + `a L t

Useful Inequalities For x ≤ 0: 1 ex ≤ 1 + x + x2 2

Analysis (simplified for known T and constant η) Let Wt =

P

ˆ

−η Lt (a) = ae

Calculation

P

ˆ

a

−η`t e−η Lt−1 (a) ae

ˆ

a

≤

a

=1−η

1 − η`at + X a

ˆ

pt (a) = P

e−ηLt−1 (a)

a0

X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a X a e−η`t pt (a) = X

Reminder

1 2 (η`at ) pt (a) 2

η X a 2 `at pt (a) + (` ) pt (a) 2 a t 2

e−ηLˆ t−1 (a0 )

ˆ t (a) = L ˆ t−1 (a) + `a L t

Useful Inequalities For x ≤ 0: 1 ex ≤ 1 + x + x2 2

Analysis (simplified for known T and constant η) Let Wt =

P

ˆ

−η Lt (a) = ae

Calculation

P

ˆ

a

−η`t e−η Lt−1 (a) ae

ˆ

a

≤

a

=1−η

1 − η`at + X a

ˆ

pt (a) = P

e−ηLt−1 (a)

a0

X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a X a e−η`t pt (a) = X

Reminder

1 2 (η`at ) pt (a) 2

η X a 2 `at pt (a) + (` ) pt (a) 2 a t 2

e−ηLˆ t−1 (a0 )

ˆ t (a) = L ˆ t−1 (a) + `a L t

Useful Inequalities For x ≤ 0: 1 ex ≤ 1 + x + x2 2 For any x: 1 + x ≤ ex

Analysis (simplified for known T and constant η) Let Wt =

P

ˆ

−η Lt (a) = ae

Calculation

P

ˆ

a

−η`t e−η Lt−1 (a) ae

ˆ

a

≤

a

=1−η ≤ e−η

1 − η`at + X a

1 2 (η`at ) pt (a) 2

η X a 2 `at pt (a) + (` ) pt (a) 2 a t

η2 a a `t pt (a)+ 2

P

ˆ

pt (a) = P

e−ηLt−1 (a)

a0

X a e−ηLt−1 (a) Wt = e−η`t P ˆ t−1 (a0 ) −η L Wt−1 a0 e a X a e−η`t pt (a) = X

Reminder

2

a 2 a (`t ) pt (a)

P

e−ηLˆ t−1 (a0 )

ˆ t (a) = L ˆ t−1 (a) + `a L t

Useful Inequalities For x ≤ 0: 1 ex ≤ 1 + x + x2 2 For any x: 1 + x ≤ ex

Analysis (simplified for known T and constant η) From the last slide: Wt =

X

ˆ

e−ηLt (a)

a

P a η2 Wt ≤ e−η a `t pt (a)+ 2 Wt−1

a 2 a (`t ) pt (a)

P

Analysis (simplified for known T and constant η) From the last slide: Wt =

X

ˆ

e−ηLt (a)

a

P a η2 Wt ≤ e−η a `t pt (a)+ 2 Wt−1

a 2 a (`t ) pt (a)

P

Calculation continued: ln

T X T X η2 X X a 2 WT ≤ −η `at pt (a) + (` ) pt (a) W0 2 t=1 a t t=1 a

Analysis (simplified for known T and constant η) From the last slide: Wt =

X

ˆ

e−ηLt (a)

a

P a η2 Wt ≤ e−η a `t pt (a)+ 2 Wt−1

a 2 a (`t ) pt (a)

P

Calculation continued: ln WT ln = ln W0

T X T X η2 X X a 2 WT ≤ −η `at pt (a) + (` ) pt (a) W0 2 t=1 a t t=1 a

P

a

ˆ

e−ηLT (a) K

Analysis (simplified for known T and constant η) From the last slide: Wt =

X

ˆ

e−ηLt (a)

a

P a η2 Wt ≤ e−η a `t pt (a)+ 2 Wt−1

a 2 a (`t ) pt (a)

P

Calculation continued: T X T X η2 X X a 2 WT ≤ −η `at pt (a) + (` ) pt (a) W0 2 t=1 a t t=1 a ˆ T (a) P −ηLˆ T (a) −η L max e a e WT ln = ln a ≥ ln W0 K K

ln

Analysis (simplified for known T and constant η) From the last slide: Wt =

X

ˆ

e−ηLt (a)

a

P a η2 Wt ≤ e−η a `t pt (a)+ 2 Wt−1

a 2 a (`t ) pt (a)

P

Calculation continued: T X T X η2 X X a 2 WT ≤ −η `at pt (a) + (` ) pt (a) W0 2 t=1 a t t=1 a ˆ T (a) P −ηLˆ T (a) −η L max e a e WT ˆ T (a) −ln K ln = ln a ≥ ln = −η min L a W0 K K

ln

Analysis (simplified for known T and constant η) From the last slide: Wt =

X

ˆ

e−ηLt (a)

a

P a η2 Wt ≤ e−η a `t pt (a)+ 2 Wt−1

a 2 a (`t ) pt (a)

P

Calculation continued: T X T X η2 X X a 2 WT ≤ −η `at pt (a) + (` ) pt (a) W0 2 t=1 a t t=1 a ˆ T (a) P −ηLˆ T (a) −η L max e a e WT ˆ T (a) −ln K ln = ln a ≥ ln = −η min L a W0 K K

ln

T X T X η2 X X a 2 ˆ T (a) − ln K ≤ −η −η min L `at pt (a) + (` ) pt (a) a 2 t=1 a t t=1 a

Analysis (simplified for known T and constant η) Calculation Summary T X X t=1

a

T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a η 2 a t=1

Analysis (simplified for known T and constant η) Calculation Summary T X X t=1

|a

T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a η 2 a t=1 {z i } h At

E `t

Analysis (simplified for known T and constant η) Calculation Summary T X X t=1

|

|a

T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a η 2 a t=1 {z i } h At

E `t

{z

E[RT ]

}

Analysis (simplified for known T and constant η) Calculation Summary T X X t=1

|

|a

T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a | {z } η 2 a t=1 ≤1 {z } i h At

E `t

{z

E[RT ]

}

Analysis (simplified for known T and constant η) Calculation Summary T X X t=1

|

|a

T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a | {z } η 2 a t=1 ≤1 {z } i h | {z } At

E `t

{z

E[RT ]

}

≤1

Analysis (simplified for known T and constant η) Calculation Summary T X X t=1

|

|a

T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a | {z } η 2 a t=1 ≤1 {z } i h | {z } At

E `t

{z

E[RT ]

}

|

≤1

{z

≤T

}

Analysis (simplified for known T and constant η) Calculation Summary T X X t=1

|

|a

T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a | {z } η 2 a t=1 ≤1 {z } i h | {z } At

E `t

{z

|

}

E[RT ]

Minimize with respect to η η=

r

2 ln K T

≤1

{z

≤T

}

Analysis (simplified for known T and constant η) Calculation Summary T X X t=1

|

|a

T X X ˆ T (a) ≤ ln K + η `at pt (a) − min L (`at )2 pt (a) a | {z } η 2 a t=1 ≤1 {z } i h | {z } At

E `t

{z

|

}

E[RT ]

Minimize with respect to η η=

r

2 ln K T

Final Result E [RT ] ≤

√

2T ln K

≤1

{z

≤T

}

Lower bound (high-level idea) Construction 1 2

Expert Losses

`at -s independent Bernoulli with bias

`11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K , ··· 2

`1t , .. . `at , .. .

··· ··· ···

··· `K , ··· t

−−−−−−−−→ time

Lower bound (high-level idea) Construction `at -s independent Bernoulli with bias

1 2

lim

K→∞,T →∞

h i ˆ T (a) T /2 − E min L q a =1 1 T ln K 2

Expert Losses

Lemma

`11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K , ··· 2

`1t , .. . `at , .. .

··· ··· ···

··· `K , ··· t

−−−−−−−−→ time

Lower bound (high-level idea) Construction `at -s independent Bernoulli with bias

1 2

E[RT ]

lim

K→∞,T →∞

z h }| i{ ˆ T (a) T /2 − E min L q a =1 1 T ln K 2

Expert Losses

Lemma

`11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K , ··· 2

`1t , .. . `at , .. .

··· ··· ···

··· `K , ··· t

−−−−−−−−→ time

Lower bound (high-level idea) Construction `at -s independent Bernoulli with bias

1 2

E[RT ]

lim

K→∞,T →∞

z h }| i{ ˆ T (a) T /2 − E min L q a =1 1 T ln K 2

Conclusion E [RT ] = Ω

√

T ln K

Expert Losses

Lemma

`11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K , ··· 2

`1t , .. . `at , .. .

··· ··· ···

··· `K , ··· t

−−−−−−−−→ time

Part I: “classical” algorithms environmental resistance

Prediction with expert advice Adversarial bandits

adversarial Stochastic bandits i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Adversarial Multiarmed Bandits

For t = 1, 2, . . . : 1. Play an action At t 2. Observe and suffer the loss `A t

Performance Measure: Regret RT =

T X t=1

t `A t − min

a

T X t=1

`at

!

Action Losses

Game Definition `11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K 2 , ···

`1t , .. . `at , .. .

··· ··· ···

··· `K t , ···

−−−−−−−−→ time

Reminder: The Hedge Algorithm

Input: Learning rates η1 ≥ η2 ≥ · · · > 0 ˆ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˆ

∀a : pt (a) = P

e−ηt Lt−1 (a)

a0

e−ηt Lˆ t−1 (a0 )

Sample At according to pt and play it Observe `1t , . . . , `K t ˆ t (a) = L ˆ t−1 (a) + `at ∀a : L end

The EXP3 Algorithm for Adversarial Bandits [Auer et. al., 2002; Bubeck, 2010]

Input: Learning rates η1 ≥ η2 ≥ · · · > 0 ˜ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˜

∀a : pt (a) = P

e−ηt Lt−1 (a)

a0

e−ηt L˜ t−1 (a0 )

Sample At according to pt and play it t Observe `A t

`at 1{At =a} ∀a : `˜at = = pt (a)

(

`a t pt (a) ,

if At = a

0,

otherwise

˜ t (a) = L ˜ t−1 (a) + `˜a ∀a : L t end

Importance-weighted sampling

Properties of Importance-Weighted Sampling

Notation Et [·] = E [·|everything up to round t]

Expectation =pt (a)

h i `at 1{At =a} Et `˜at = Et pt (a)

z }| { `at Et 1{At =a} = = `at pt (a)

Properties of Importance-Weighted Sampling

Second moment

≤1

2 = Et `˜at

=pt (a)

z

}|

=1{At =a}

{

z }| { z }| { 2 (`at )2 Et 1{At =a} pt

(a)2

≤

1 pt (a)

Properties of Importance-Weighted Sampling

Second moment

≤1

2 = Et `˜at Et

=pt (a)

z

}|

=1{At =a}

{

z }| { z }| { 2 (`at )2 Et 1{At =a} pt

" X a

(a)2

# 2 pt (a) `˜at ≤K

≤

1 pt (a)

Analysis (simplified for known T and constant η) Following the calculations in the analysis of Hedge we have: T X X t=1

a

T X 2 X ˜ T (a) ≤ ln K + η `˜at pt (a) − min L `˜at pt (a) a η 2 t=1 a

Analysis (simplified for known T and constant η) Following the calculations in the analysis of Hedge we have: T X X a

t=1

T X 2 X ˜ T (a) ≤ ln K + η `˜at pt (a) − min L `˜at pt (a) a η 2 t=1 a

Taking expectation on both sides E

" T XX t=1

a

`˜at pt (a)

#

h

˜ T (a) −E min L a

i

" T # X X 2 ln K η a ≤ + E `˜t pt (a) η 2 t=1 a

Analysis (simplified for known T and constant η) Following the calculations in the analysis of Hedge we have: T X X a

t=1

T X 2 X ˜ T (a) ≤ ln K + η `˜at pt (a) − min L `˜at pt (a) a η 2 t=1 a

Taking expectation on both sides E

" T XX t=1

a

`˜at pt (a)

#

h

˜ T (a) −E min L a

i

" T # X X 2 ln K η a ≤ + E `˜t pt (a) η 2 t=1 a

Propagating the expectations inside (E [min(·)] ≤ min E [·]) E

"

T X X t=1

a

# " T # h i h i X X a 2 ln K η a ˜ ˜ ˜ Et `t pt (a) −min E LT (a) ≤ + E Et ` t pt (a) a η 2 t=1 a

Analysis (simplified for known T and constant η) Following the calculations in the analysis of Hedge we have: T X X a

t=1

T X 2 X ˜ T (a) ≤ ln K + η `˜at pt (a) − min L `˜at pt (a) a η 2 t=1 a

Taking expectation on both sides E

" T XX t=1

`˜at pt (a)

a

#

h

˜ T (a) −E min L a

i

# " T X X 2 ln K η a ≤ `˜t pt (a) + E η 2 t=1 a

Propagating the expectations inside (E [min(·)] ≤ min E [·]) E

X T X t=1

a

|

X T X h i h i 2 ˜ T (a) ≤ ln K + η E Et `˜at pt (a) −min E L Et `˜at pt (a) a η 2 | {z } | {z } t=1 a | {z } `a L (a) T t ≤K {z } h i A E `t t

Analysis (simplified for known T and constant η) Following the calculations in the analysis of Hedge we have: T X X a

t=1

T X 2 X ˜ T (a) ≤ ln K + η `˜at pt (a) − min L `˜at pt (a) a η 2 t=1 a

Taking expectation on both sides E

" T XX t=1

a

`˜at pt (a)

#

h

˜ T (a) −E min L a

i

" T # X X 2 ln K η a ≤ + E `˜t pt (a) η 2 t=1 a

Propagating the expectations inside (E [min(·)] ≤ min E [·]) E |

"

T X X t=1

a

# " T # h i h i X X a 2 ln K η a ˜ ˜ ˜ Et `t pt (a) − min E LT (a) ≤ + E Et `t pt (a) a η 2 t=1 a {z } | {z } E[RT ]

≤KT

Analysis (simplified for known T and constant η) Calculation summary E [RT ] ≤

ln K η + KT η 2

Analysis (simplified for known T and constant η) Calculation summary E [RT ] ≤

ln K η + KT η 2

Minimize with respect to η η=

r

2 ln K KT

Analysis (simplified for known T and constant η) Calculation summary ln K η + KT η 2

E [RT ] ≤

Minimize with respect to η η=

r

2 ln K KT

Final Result E [RT ] ≤

√

2KT ln K

Analysis (simplified for known T and constant η) Calculation summary η ln K + KT η 2

E [RT ] ≤

Minimize with respect to η η=

r

2 ln K KT

Final Result E [RT ] ≤

√

2KT ln K

In comparison with full information we got the extra K factor

Analysis (simplified for known T and constant η) Calculation summary η ln K + KT η 2

E [RT ] ≤

Minimize with respect to η η=

r

2 ln K KT

Final Result E [RT ] ≤

√

2KT ln K

It is possible to eliminate ln K with more sophisticated algorithms

i-th game: `it -s Bernoulli with bias For a 6= i: `at -s Bernoulli with bias

1 2 1 2 +ε 1 2 −ε

Expert Losses

0-th game: `at -s Bernoulli with bias

`1 2, . . . a `2 , . . . K `2 ,

···

`1 1, . . . a `1 , . . . K `1 ,

`1 2, . . . a `2 , . . . K `2 ,

···

`1 1, . . . `a 1, . . . `K 1 ,

`1 2, . . . `a 2, . . . `K 2 ,

Expert Losses

Construct K + 1 games

`1 1, . . . a `1 , . . . K `1 ,

Expert Losses

Lower bound (high level idea) ··· ··· ··· ···

··· ···

.. .

··· ···

··· ··· ··· ··· ···

`1 t, . . . a `t , . . . K `t ,

···

`1 t, . . . a `t , . . . K `t ,

···

`1 t, . . . `a t, . . . `K t ,

··· ··· ··· ···

··· ··· ··· ···

··· ··· ··· ··· ···

i-th game: `it -s Bernoulli with bias For a 6= i: `at -s Bernoulli with bias

1 2 1 2 +ε 1 2 −ε

Claim For small ε, 0-th game is indistinguishable from i-th game based on T observations.

Expert Losses

0-th game: `at -s Bernoulli with bias

`1 2, . . . a `2 , . . . K `2 ,

···

`1 1, . . . a `1 , . . . K `1 ,

`1 2, . . . a `2 , . . . K `2 ,

···

`1 1, . . . `a 1, . . . `K 1 ,

`1 2, . . . `a 2, . . . `K 2 ,

Expert Losses

Construct K + 1 games

`1 1, . . . a `1 , . . . K `1 ,

Expert Losses

Lower bound (high level idea) ··· ··· ··· ···

··· ···

.. .

··· ···

··· ··· ··· ··· ···

`1 t, . . . a `t , . . . K `t ,

···

`1 t, . . . a `t , . . . K `t ,

···

`1 t, . . . `a t, . . . `K t ,

··· ··· ··· ···

··· ··· ··· ···

··· ··· ··· ··· ···

i-th game: `it -s Bernoulli with bias For a 6= i: `at -s Bernoulli with bias

1 2 1 2 +ε 1 2 −ε

Claim For small ε, 0-th game is indistinguishable from i-th game based on T observations. p For ε = θ K/T : √ E [RT ] = Ω KT

Expert Losses

0-th game: `at -s Bernoulli with bias

`1 2, . . . a `2 , . . . K `2 ,

···

`1 1, . . . a `1 , . . . K `1 ,

`1 2, . . . a `2 , . . . K `2 ,

···

`1 1, . . . `a 1, . . . `K 1 ,

`1 2, . . . `a 2, . . . `K 2 ,

Expert Losses

Construct K + 1 games

`1 1, . . . a `1 , . . . K `1 ,

Expert Losses

Lower bound (high level idea) ··· ··· ··· ···

··· ···

.. .

··· ···

··· ··· ··· ··· ···

`1 t, . . . a `t , . . . K `t ,

···

`1 t, . . . a `t , . . . K `t ,

···

`1 t, . . . `a t, . . . `K t ,

··· ··· ··· ···

··· ··· ··· ···

··· ··· ··· ··· ···

Part I: “classical” algorithms environmental resistance

Prediction with expert advice Adversarial bandits

adversarial Stochastic bandits i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Stochastic Multiarmed Bandits Game Definition `at -s independent; E [`at ] = µ(a) For t = 1, 2, . . . : 1. Play an action At

Action Losses

t 2. Observe and suffer `A t

`11 , .. .

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K 2 , ···

`1t , .. . `at , .. .

··· ··· ···

··· `K t , ···

−−−−−−−−→ time

Stochastic Multiarmed Bandits Notations

Game Definition `at -s

independent;

E [`at ]

= µ(a)

1. Play an action At

Action Losses

t 2. Observe and suffer `A t

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K 2 , ···

`1t , .. . `at , .. .

··· ··· ···

··· `K t , ···

−−−−−−−−→ time

a

Nt (a) - the number of times a was played up to round t

For t = 1, 2, . . . :

`11 , .. .

Gaps: ∆(a) = µ(a) − min µ(a0 ) 0

Stochastic Multiarmed Bandits Notations

Game Definition `at -s

independent;

E [`at ]

= µ(a)

1. Play an action At

Action Losses

t 2. Observe and suffer `A t

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K 2 , ···

`1t , .. . `at , .. .

··· ··· ···

··· `K t , ···

−−−−−−−−→ time

a

Nt (a) - the number of times a was played up to round t

For t = 1, 2, . . . :

`11 , .. .

Gaps: ∆(a) = µ(a) − min µ(a0 ) 0

Performance: Expected Regret E [RT ] = E =

" T X

X a

t=1

#

t `A − T min µ(a) t

NT (a)∆(a)

a

Stochastic Multiarmed Bandits Notations

Game Definition `at -s

independent;

E [`at ]

= µ(a)

1. Play an action At

Action Losses

t 2. Observe and suffer `A t

`12 , .. .

`a1 , .. .

`a2 , .. .

`K 1 ,

··· ··· ···

··· `K 2 , ···

`1t , .. . `at , .. .

··· ··· ···

··· `K t , ···

−−−−−−−−→ time

a

Nt (a) - the number of times a was played up to round t

For t = 1, 2, . . . :

`11 , .. .

Gaps: ∆(a) = µ(a) − min µ(a0 ) 0

Performance: Expected Regret E [RT ] = E =

" T X

X

t=1

#

t `A − T min µ(a) t

a

NT (a)∆(a)

a

Historical remark Originally formulated with gains rta = 1 − `at

LCB1 (Lower Confidence Bound) Algorithm Originally UCB1 (Upper Confidence Bound), [Auer et al., 2002]

Initialization: Play each arm once. for t = 1, 2, ... do ˆ t (a) - average loss of a up to t Let L s ˆ t (a) − Play At = arg min L a

end

|

3 ln t 2Nt (a) {z }

LCBt (a)

Hoeffding’s inequality

Hoeffding’s inequality (simplified) Let X1 , . . . , XN be i.i.d. with E [Xi ] = µ. Then: ( ) r N 1 X 3 ln t P Xi − µ ≥ ≤ N i=1 2N ( ) r N 1 X 3 ln t P µ− Xi ≥ ≤ N i=1 2N

1 t3 1 t3

Key properties of LCB Optimism in the face of uncertainty

ˆ t (a) − LCBt (a) = L I I I

s

3 ln t 2Nt (a)

h i ˆ t (a) = µ(a) E L

Warning: Nt (a) is a random variable - Hoeffding cannot be applied Pt Let X1 , . . . , Xt be i.i.d. with E [Xs ] = µ(a) and let µ ˆt = 1t s=1 Xs (

ˆ t (a) + P {LCBt (a) ≥ µ(a)} = P L (

s

) 3 ln t ≥ µ(a) 2Nt (a) r

) 3 ln t ≥ µ(a) 2s ( ) r t X 3 ln t 1 1 ≤ P µ ˆs (a) + ≥ µ(a) ≤ t × 3 = 2 2s t t s=1 ≤ P ∃s ∈ {1, . . . , t} : µ ˆs (a) +

I

Bottom line: the probability that LCBt (a) ≥ µ(a) is small

LCB1 Analysis Highlights (for two arms, a∗ and a)

I

∆ = µ(a) − µ(a∗ )

LCB1 Analysis Highlights (for two arms, a∗ and a)

I

∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t

LCB1 Analysis Highlights (for two arms, a∗ and a)

I

∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)...

LCB1 Analysis Highlights (for two arms, a∗ and a)

I

∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a)

LCB1 Analysis Highlights (for two arms, a∗ and a)

I

∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a) I

In expectation, the number of rounds on which either of the two Pt 2 confidence bounds fails is bounded by 2 s=1 s12 ≤ π3 .

LCB1 Analysis Highlights (for two arms, a∗ and a)

I

∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a) I

In expectation, the number of rounds on which either of the two Pt 2 confidence bounds fails is bounded by 2 s=1 s12 ≤ π3 . I Fix time horizon T

LCB1 Analysis Highlights (for two arms, a∗ and a)

I

∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a) I

In expectation, the number of rounds on which either of the two Pt 2 confidence bounds fails is bounded by 2 s=1 s12 ≤ π3 . I Fix time horizon T q 3 ln T I Once Nt (a) > 6 ln2T we have 2 ∆ 2Nt (a) < ∆

LCB1 Analysis Highlights (for two arms, a∗ and a)

I

∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a) I

In expectation, the number of rounds on which either of the two Pt 2 confidence bounds fails is bounded by 2 s=1 s12 ≤ π3 . I Fix time horizon T q 3 ln T I Once Nt (a) > 6 ln2T we have 2 ∆ 2Nt (a) < ∆ I

Thus, LCBt (a) > µ(a) − ∆ = µ(a∗ ) ≥ LCBt (a∗ )

LCB1 Analysis Highlights (for two arms, a∗ and a)

I

∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a) I

In expectation, the number of rounds on which either of the two Pt 2 confidence bounds fails is bounded by 2 s=1 s12 ≤ π3 . I Fix time horizon T q 3 ln T I Once Nt (a) > 6 ln2T we have 2 ∆ 2Nt (a) < ∆ I

Thus, LCBt (a) > µ(a) − ∆ = µ(a∗ ) ≥ LCBt (a∗ ) I Therefore, in expectation, arm a will be played no more than 6 ln T π2 ∆2 + 1 + 3 times

LCB1 Analysis Highlights (for two arms, a∗ and a)

I

∆ = µ(a) − µ(a∗ ) I W.p. ≥ 1 − 12 : LCBt (a∗ ) ≤ µ(a∗ ) t q 3 ln t ˆ t (a) ≥ µ(a) − I W.p. ≥ 1 − 12 : L t 2Nt (a) (by a mirror calculation)... q q 3 ln t 3 ln t ˆ t (a) − I ... and thus: LCBt (a) = L ≥ µ(a) − 2 2Nt (a) 2Nt (a) I

In expectation, the number of rounds on which either of the two Pt 2 confidence bounds fails is bounded by 2 s=1 s12 ≤ π3 . I Fix time horizon T q 3 ln T I Once Nt (a) > 6 ln2T we have 2 ∆ 2Nt (a) < ∆ I

Thus, LCBt (a) > µ(a) − ∆ = µ(a∗ ) ≥ LCBt (a∗ ) I Therefore, in expectation, arm a will be played no more than 6 ln T π2 ∆2 + 1 + 3 times I

E [RT ] = ∆NT (a) ≤

6 ln T ∆

+ (1 +

π2 3 )∆

Lower bound [Lai & Robbins, 1985] E [RT ] ≥ T →∞ ln T

lim inf

X

a:∆(a)>0

∆(a) , Kinf (νa , µ(a∗ ))

where Kinf (νa , µ(a∗ )) is the minimal KL-divergence between distribution of rewards νa of arm a and a suitable distribution with mean lower bounded by µ(a∗ ).

Lower bound [Lai & Robbins, 1985] E [RT ] ≥ T →∞ ln T

lim inf

X

a:∆(a)>0

∆(a) , Kinf (νa , µ(a∗ ))

where Kinf (νa , µ(a∗ )) is the minimal KL-divergence between distribution of rewards νa of arm a and a suitable distribution with mean lower bounded by µ(a∗ ).

Simplified I I

Kinf (νa , µ(a∗ )) ≥

1 2∆(a)2

When `at Bernoulli with µ(a) close to 21 ,  Kinf (νa , µ(a∗ )) ≈

1 2∆(a)2

and E [RT ] = θ 

X

a:∆(a)>0



ln T  ∆(a)

Other popular algorithms KL-UCB I

Capp´e, Garivier, Maillard, Munos, Stoltz. Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation. Annals of Statistics, 2013

I

Replaces Hoeffding’s inequality with a tighter KL concentration inequality

I

Matches the lower bound

Thompson Sampling I

Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933

I

Kaufmann, Korda, Munos. Thompson sampling: an asymptotically optimal finite time analysis. ALT, 2012

I

A Bayesian playing strategy

I

Matches the lower bound

Part I: “classical” algorithms environmental resistance

Prediction with expert advice Adversarial bandits

adversarial Stochastic bandits i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Bandits with Side Information A.k.a. Contextual Bandits

Version 1: Multiarmed Bandits with Expert Advice For t = 1, 2, . . . : 1. Observe advice of N experts in a form of N distributions pt,h (·) over K arms, where h ∈ {1, . . . , N } indexes the experts 2. Play one arm, observe and suffer the loss of that arm

Bandits with Side Information A.k.a. Contextual Bandits

Version 1: Multiarmed Bandits with Expert Advice For t = 1, 2, . . . : 1. Observe advice of N experts in a form of N distributions pt,h (·) over K arms, where h ∈ {1, . . . , N } indexes the experts 2. Play one arm, observe and suffer the loss of that arm

Algorithm: EXP4 [Auer et al., 2002] For t = 1, 2, . . . :

X

˜

pt,h (a)e−ηt Lt−1 (h)

I

Mix the advice into pt (a) ∝

I

Pick an arm At according to pt (a)

I

˜ t (h)-s Use importance-weighted sampling to track L

h

Bandits with Side Information A.k.a. Contextual Bandits

Version 1: Multiarmed Bandits with Expert Advice For t = 1, 2, . . . : 1. Observe advice of N experts in a form of N distributions pt,h (·) over K arms, where h ∈ {1, . . . , N } indexes the experts 2. Play one arm, observe and suffer the loss of that arm

Algorithm: EXP4 [Auer et al., 2002] For t = 1, 2, . . . :

X

˜

pt,h (a)e−ηt Lt−1 (h)

I

Mix the advice into pt (a) ∝

I

Pick an arm At according to pt (a)

I

˜ t (h)-s Use importance-weighted sampling to track L

Regret Bound

h

E [RT ] = O

√

KT ln N

Bandits with Side Information A.k.a. Contextual Bandits

Version 2: Multiarmed bandits with side info For t = 1, 2, . . . : 1. Observe side info (= state) St 2. Play arm At , observe and suffer the loss `(At , St )

Bandits with Side Information A.k.a. Contextual Bandits

Version 2: Multiarmed bandits with side info For t = 1, 2, . . . : 1. Observe side info (= state) St 2. Play arm At , observe and suffer the loss `(At , St )

Expert advice is a special case of side info Side info = the advice vector

Bandits with Side Information A.k.a. Contextual Bandits

Version 2: Multiarmed bandits with side info For t = 1, 2, . . . : 1. Observe side info (= state) St 2. Play arm At , observe and suffer the loss `(At , St )

Expert advice is a special case of side info Side info = the advice vector

Inverse reduction for finite state space S

Experts = the set of all possible functions h : S → {1, . . . , K} N = K |S| p E [RT ] = O KT |S| ln K

Bandits with Side Information A.k.a. Contextual Bandits

Version 2: Multiarmed bandits with side info For t = 1, 2, . . . : 1. Observe side info (= state) St 2. Play arm At , observe and suffer the loss `(At , St )

Expert advice is a special case of side info Side info = the advice vector

Inverse reduction for finite state space S

Experts = the set of all possible functions h : S → {1, . . . , K} N = K |S| p E [RT ] = O KT |S| ln K

Structural complexity ln N = |S| ln K

Markov Decision Processes (MDPs) Game definition Start from state S1 For t = 1, 2, . . . : 1. Play an action At 2. Observe and suffer loss `(At , St ) 3. Transfer to state St+1 ∼ p(St+1 |At , St )

Major difference with bandits with side info St+1 depends on At

Markov Decision Processes (MDPs) Game definition Start from state S1 For t = 1, 2, . . . : 1. Play an action At 2. Observe and suffer loss `(At , St ) 3. Transfer to state St+1 ∼ p(St+1 |At , St )

Major difference with bandits with side info St+1 depends on At

Complexity of MDPs 1. The size of the state space (same as in bandits with side info) 2. Mixing time

“Classical” algorithms summary Prediction with expert advice 𝑇 ln 𝐾

environmental resistance

Adversarial bandits 𝐾𝑇 adversarial Stochastic bandits i.i.d. stateless state

full

bandit

partial

ln 𝑇 𝑎≠𝑎∗ Δ(𝑎)

feedback

MDPs Adversarial bandits with expert advice 𝐾𝑇 ln 𝑁 or 𝐾𝑇|𝑆| ln 𝐾 structural complexity

Simplicities along the axes environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Simplicities along the axes environmental resistance

The more the simpler adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Simplicities along the axes environmental resistance

• Gaps (between action outcomes) • Variance (of action outcomes)

adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Simplicities along the axes environmental resistance

• Reducibility of the state space (relevance of side info) • Mixing time (in MDPs)

adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

“Classical” algorithms environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Assume a certain form of simplicity and exploit it

“Classical” algorithms environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Assume a certain form of simplicity and exploit it

environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Environmental resistance in full info [Koolen & van Erven, COLT, 2015, Luo & Schapire, COLT, 2015, Wintenberger, 2015, van Erven, Kotłowski & Warmuth, COLT, 2014, Gaillard, Stoltz & van Erven, COLT 2014, …] environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Environmental resistance in full info [Koolen & van Erven, COLT, 2015, Luo & Schapire, COLT, 2015, Wintenberger, 2015, van Erven, Kotłowski & Warmuth, COLT, 2014, Gaillard, Stoltz & van Erven, COLT 2014, …] environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Prediction with limited advice [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014, Kale, COLT, 2014]

environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Prediction with limited advice [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014, Kale, COLT, 2014]

environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Bandits with paid observations [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014]

environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Bandits with paid observations [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014]

environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Contaminated stochastic bandits [Seldin & Slivkins, ICML, 2014]

environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Contaminated stochastic bandits [Seldin & Slivkins, ICML, 2014]

environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Filtering of relevant side info [Seldin, Auer, Laviolette, Shawe-Taylor, Ortner, NIPS, 2011]

environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

In details environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Putting all in one language `1 1, . . . a `1 , . . . K `1 ,

`1 2, . . . a `2 , . . . K `2 ,

··· ··· ··· ··· ···

`1 t, . . . a `t , . . . K `t ,

··· ··· ··· ··· ···

Putting all in one language `1 1, . . . a `1 , . . . K `1 ,

Feedback I

Expert Advice: K/K

I

Bandits: 1/K

`1 2, . . . a `2 , . . . K `2 ,

··· ··· ··· ··· ···

`1 t, . . . a `t , . . . K `t ,

··· ··· ··· ··· ···

Putting all in one language `1 1, . . . a `1 , . . . K `1 ,

Feedback I

Expert Advice: K/K

I

Limited Advice: M/K

I

Bandits: 1/K

`1 2, . . . a `2 , . . . K `2 ,

··· ··· ··· ··· ···

`1 t, . . . a `t , . . . K `t ,

··· ··· ··· ··· ···

Putting all in one language `1 1, . . . a `1 , . . . K `1 ,

Feedback I

Expert Advice: K/K

I

Limited Advice: M/K

I

Bandits: 1/K

I

Paid Observations: 0/K

`1 2, . . . a `2 , . . . K `2 ,

··· ··· ··· ··· ···

`1 t, . . . a `t , . . . K `t ,

··· ··· ··· ··· ···

Putting all in one language `1 1, . . . a `1 , . . . K `1 ,

Feedback I

Expert Advice: K/K

I

Limited Advice: M/K

I

Bandits: 1/K

I

Paid Observations: 0/K

`1 2, . . . a `2 , . . . K `2 ,

··· ··· ··· ··· ···

`1 t, . . . a `t , . . . K `t ,

··· ··· ··· ··· ···

Loss generation I

Adversarial (deterministic)

I

Stochastic (E [`at ] = µ(a))

Putting all in one language `1 1, . . . a `1 , . . . K `1 ,

Feedback

`1 2, . . . a `2 , . . . K `2 ,

··· ··· ··· ··· ···

`1 t, . . . a `t , . . . K `t ,

··· ··· ··· ··· ···

Loss generation

I

Expert Advice: K/K

I

Adversarial (deterministic)

I

Limited Advice: M/K

I

Contaminated stochastic

I

Bandits: 1/K

I

Stochastic (E [`at ] = µ(a))

I

Paid Observations: 0/K

Putting all in one language `1 1, . . . a `1 , . . . K `1 ,

`1 2, . . . a `2 , . . . K `2 ,

··· ··· ··· ··· ···

Feedback

`1 t, . . . a `t , . . . K `t ,

··· ··· ··· ··· ···

Loss generation

I

Expert Advice: K/K

I

Adversarial (deterministic)

I

Limited Advice: M/K

I

Contaminated stochastic

I

Bandits: 1/K

I

Stochastic (E [`at ] = µ(a))

I

Paid Observations: 0/K

Regret E [RT ] = E

" T X t=1

#

t `A − min E t

a

"

T X t=1

`at

#

Prediction with limited advice [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014]

environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Prediction with Limited Advice

Motivation We can observe the advice of M out of K experts for M ≤ K

Examples Experts are computationally-expensive functions (or algorithms) and we have a constraint on the response time Experts are humans that have to be paid

Prediction with Limited Advice

Notations Ot ⊆ {1, . . . , K} - the set of observed experts |Ot | = Mt - the number of observed experts 1 ≤ Mt ≤ N

Game Definition For t = 1, 2, . . . : 1. Pick (Ot , At ), such that At ∈ Ot and follow the advice of At t 2. Observe `at for a ∈ Ot and suffer `A t

General Picture Prediction with Expert Advice

Prediction with Limited Advice

`1t , . . . , `K t

{`at |a ∈ Ot , |Ot | = M }

Observations Regret Upper Bound Regret Lower Bound

(M = K)

O Ω

√

√

T ln K

T ln K

Bandits t `A t (M = 1)

???

O

???

Ω

√

√

KT

KT

General Picture Prediction with Expert Advice

Prediction with Limited Advice

`1t , . . . , `K t

{`at |a ∈ Ot , |Ot | = M }

Observations Regret Upper Bound Regret Lower Bound

(M = K)

O Ω

√

√

T ln K

T ln K

O

q Ω

K MT

q

ln K

K MT

Bandits t `A t (M = 1)

O Ω

√

√

KT

KT

General Picture Prediction with Expert Advice

Prediction with Limited Advice

`1t , . . . , `K t

{`at |a ∈ Ot , |Ot | = M }

Observations Regret Upper Bound Regret Lower Bound

(M = K)

O Ω

√

√

T ln K

T ln K

I The (ln K) gaps can be closed

O

q Ω

K MT

q

ln K

K MT

Bandits t `A t (M = 1)

O Ω

√

√

KT

KT

General Picture Prediction with Expert Advice

Prediction with Limited Advice

`1t , . . . , `K t

{`at |a ∈ Ot , |Ot | = M }

Observations Regret Upper Bound Regret Lower Bound

(M = K)

O Ω

√

√

T ln K

T ln K

O

I The (ln K) gaps can be closed I For time-dependent Mt the regret is O

q Ω

K MT

q

ln K

K MT

r PT K t=1

Bandits t `A t (M = 1)

1 Mt

O Ω

ln K

√

√

KT

KT

Reminder: Hedge Algorithm (Exponential Weights)

Input: Learning rates η1 ≥ η2 ≥ · · · > 0 ˆ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˆ

∀a : pt (a) = P

e−ηt Lt−1 (a)

a0

e−ηt Lˆ t−1 (a0 )

Sample At according to pt and play it Observe `1t , . . . , `K t ˆ t (a) = L ˆ t−1 (a) + `at ∀a : L end

Reminder: EXP3 Algorithm Input: Learning rates η1 ≥ η2 ≥ · · · > 0 ˜ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˜

∀a : pt (a) = P

e−ηt Lt−1 (a)

a0

e−ηt L˜ t−1 (a0 )

Sample At according to pt and play it t Observe `A t

∀a :

`˜at

`at 1{At =a} = = pt (a)

(

`a t pt (a) ,

if At = a

0,

otherwise

˜ t (a) = L ˜ t−1 (a) + `˜at ∀a : L end

Importance-weighted sampling

Algorithm for Prediction with Limited Advice Input: M1 , M2 , . . . and learning rates η1 ≥ η2 ≥ . . . ˜ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˜

∀a : pt (a) = P

e−ηt Lt−1 (a)

a0

e−ηt L˜ t−1 (a0 )

Sample At according to pt and play it (At ∈ Ot ) Sample Mt − 1 additional experts (rest of Ot ) uniformly Observe `at for a ∈ Ot . ∀a : `˜at =

`at 1{a∈Ot }

t −1 pt (a) + (1 − pt (a)) M N −1

˜ t (a) = L ˜ t−1 (a) + `˜at ∀a : L end

Analysis idea Upper bound By the analysis of Hedge/EXP3: " T # K 2 X ηt X ln K a ˜ E [RT ] ≤ E pt (a) `t + 2 a=1 ηT t=1

Analysis idea Upper bound By the analysis of Hedge/EXP3: " T # K 2 X ηt X ln K a ˜ E [RT ] ≤ E pt (a) `t + 2 a=1 ηT t=1 And we have:

Et

"

# 2 K a pt (a) `˜t ≤ M t a=1 K X

Analysis idea Upper bound By the analysis of Hedge/EXP3: " T # K 2 X ηt X ln K a ˜ E [RT ] ≤ E pt (a) `t + 2 a=1 ηT t=1 And we have:

By tuning ηt :

Et

"

# 2 K a pt (a) `˜t ≤ M t a=1 K X

v  ! ! r u T u X 1 K E [RT ] ≤ O t K ln K  = O T ln K Mt M t=1

Analysis idea Upper bound By the analysis of Hedge/EXP3: " T # K 2 X ηt X ln K a ˜ E [RT ] ≤ E pt (a) `t + 2 a=1 ηT t=1 And we have:

By tuning ηt :

Et

"

# 2 K a pt (a) `˜t ≤ M t a=1 K X

v  ! ! r u T u X 1 K E [RT ] ≤ O t K ln K  = O T ln K Mt M t=1

Lower bound Similar to bandits (indistinguishability of K games), just with M T ! r observations K E [RT ] = Ω T M

Bandits with paid observations [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014]

environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Multiarmed Bandits with Paid Observations

Motivation How to deal with a situation when we have to pay for observations?

The loss of any arm can be observed, but each observation has a known cost ct (a)

Example Signing contracts with service providers ct (a) - inspection cost

Multiarmed Bandits with Paid Observations Notations At - the arm played Ot ⊆ {1, . . . , K} - the set of observed arms

Game Definition For t = 1, 2, . . . : 1. Pick (At , Ot ) and play At (At is not necessarily in Ot ) X t 2. Observe `at for a ∈ Ot and suffer `A ct (a) t + a∈Ot

Multiarmed Bandits with Paid Observations Notations At - the arm played Ot ⊆ {1, . . . , K} - the set of observed arms

Game Definition For t = 1, 2, . . . : 1. Pick (At , Ot ) and play At (At is not necessarily in Ot ) X t 2. Observe `at for a ∈ Ot and suffer `A ct (a) t + a∈Ot

Performance Measure: Cost-sensitive Expected Regret

E [RTc ] = E |

" T X t=1

#

t `A − min t

a

{z

E[RT ]

T X t=1

`at

! }

+E

"

T X X

t=1 a∈Ot

ct (a)

#

Lower bound

Assume the algorithm makes M T observations and ct (a) = c:

E [RTc ] = E [RT ]+cM T

Lower bound

Assume the algorithm makes M T observations and ct (a) = c:

E [RTc ] = E [RT ]+cM T = Ω

r

! K T +cM T M

Lower bound

Assume the algorithm makes M T observations and ct (a) = c:

E [RTc ] = E [RT ]+cM T = Ω

r

! K T +cM T ≥ Ω (cK)1/3 T 2/3 M

Algorithm ˜ 0 (a) = 0 ∀a : L for t = 1, 2, ... do ˜

∀a : pt (a) = P

e−ηt Lt−1 (a)

a0

e−ηt L˜ t−1 (a0 )

Sample At according to pt and play it ∀a: Query the loss of a with probability s ! ηt pt (a) qt (a) = min 1, 2ct (a)

Trade-off between relative arm quality pt (a) & observation cost ct (a)

`a ∀a : `˜at = t 1{At =a} qt (a) ˜ t (a) = L ˜ t−1 (a) + `˜a ∀a : L t end The learning rate ηt is tuned based on p1 (·), . . . , pt−1 (·) and c1 (·), . . . , ct−1 (·)

Analysis By the analysis of Hedge/EXP3: " T # K 2 X ηt X ln K a +E pt (a) `˜t E [RT ] ≤ ηT 2 a=1 t=1

Analysis By the analysis of Hedge/EXP3:

And:

" T # K 2 X ηt X ln K a +E pt (a) `˜t E [RT ] ≤ ηT 2 a=1 t=1 2 Et `˜at ≤

1 qt (a)

Analysis By the analysis of Hedge/EXP3:

And:

" T # K 2 X ηt X ln K a +E pt (a) `˜t E [RT ] ≤ ηT 2 a=1 t=1 2 Et `˜at ≤

Thus: E [RTc ]

1 qt (a)

" T # K X ηt X pt (a) ln K +E + ct (a)qt (a) ≤ ηT 2 a=1 qt (a) t=1

Analysis By the analysis of Hedge/EXP3:

And:

" T # K 2 X ηt X ln K a +E pt (a) `˜t E [RT ] ≤ ηT 2 a=1 t=1 2 Et `˜at ≤

Thus: E [RTc ]

1 qt (a)

" T # K X ηt X pt (a) ln K +E + ct (a)qt (a) ≤ ηT 2 a=1 qt (a) t=1

We have to minimize:

K X ηt pt (a)

a=1

2qt (a)

+ ct (a)qt (a)

Analysis By the analysis of Hedge/EXP3:

And:

" T # K 2 X ηt X ln K a +E pt (a) `˜t E [RT ] ≤ ηT 2 a=1 t=1 2 Et `˜at ≤

Thus: E [RTc ]

1 qt (a)

" T # K X ηt X pt (a) ln K +E + ct (a)qt (a) ≤ ηT 2 a=1 qt (a) t=1

We have to minimize:

K X ηt pt (a)

a=1

This is achieved by

2qt (a)

+ ct (a)qt (a)

( s

qt (a) = min 1,

ηt pt (a) 2ct (a)

)

Results Simplified Upper Bound for ct (a) = c RTc

1/3

. (32c ln K)

Worst case:

PT

t=1

K p X

pt (a)

|a=1 {z

1≤···≤K

}

2/3

√ + 2 T ln K

√ RTc ≤ (32cK ln K)1/3 T 2/3 + 2 T ln K Favorable case (one dominating arm): √ RTc → (32c ln K)1/3 T 2/3 + 2 T ln K

General Upper Bound RTc

. (32 ln K)

1/3

X T t=1

√

Worst case: 

RTc . (32 ln K)1/3 

K p X

|a=1

ct

pt (a)ct (a) {z } √PK

(a0 )≤···≤

v

T u K uX X t=1

t

a=1

2/3

√ + 2 T ln K

a=1 ct (a)

2/3

ct (a)

√ + 2 T ln K

Favorable case (one dominating arm h? ): RTc

→ (32 ln K)

1/3

!2/3 T p X √ ct (a? ) + 2 T ln K t=1

Bandits with Paid Observations Summary

I

Adaptation to the cost of information gathering

I

Balance between qproblemcomplexity and information cost t pt (a) qt (a) = min 1, η2c t (a)

I

Automatic tuning of the learning rate ηt

Stochastic and Adversarial bandits [Seldin & Slivkins, ICML, 2014]

environmental resistance adversarial

i.i.d. stateless state MDPs

structural complexity

full

bandit

partial

feedback

Loss Generation Models Adversarial Regime `at -s are picked by an adversary in an arbitrary way

Stochastic Regime `at -s are drawn independently at random, so that E [`at ] = µ(a) ∆(a) = µ(a) − mina0 {µ(a0 )} - the gap ∆ = mina:∆(a)>0 {∆(a)} - the minimal gap

Loss Generation Models Adversarial Regime `at -s are picked by an adversary in an arbitrary way

Stochastic Regime `at -s are drawn independently at random, so that E [`at ] = µ(a) ∆(a) = µ(a) − mina0 {µ(a0 )} - the gap ∆ = mina:∆(a)>0 {∆(a)} - the minimal gap

Moderately Contaminated Stochastic RegimeNEW A stochastic regime, where the adversary can contaminate I

up to t∆(a)/4 entries for suboptimal actions

I

up to t∆/4 entries for optimal actions

Loss Generation Models Adversarial Regime `at -s are picked by an adversary in an arbitrary way

Stochastic Regime `at -s are drawn independently at random, so that E [`at ] = µ(a) ∆(a) = µ(a) − mina0 {µ(a0 )} - the gap ∆ = mina:∆(a)>0 {∆(a)} - the minimal gap

Moderately Contaminated Stochastic RegimeNEW A stochastic regime, where the adversary can contaminate I

up to t∆(a)/4 entries for suboptimal actions

I

up to t∆/4 entries for optimal actions

Adversarial Regime with a GapNEW

P Let λt (a) = ts=1 `as There exists a consistent minimizer a∗τ of λt (a) for all t ≥ τ 1 ∆(τ, a) = mint≥τ t (λt (a) − λt (a∗τ )) - deterministic gap

Can we have one algorithm that performs “well” in all the regimes? (without prior knowledge of the regime type)

Classical Results Adversarial Regime

√ Lower bound - Ω Kt [Auer et. al., 1995] √ EXP3 - O Kt ln K [Auer et. al., 2002] √ Kt [Audibert & Bubeck, 2009] INF - O

Stochastic Regime

P ln t Lower bound - Ω a:∆(a)>0 ∆(a) [Lai & Robbins, 1985] P ln t UCB1 - O a:∆(a)>0 ∆(a) [Auer et. al., 2002] P ln t KL-UCB, Thompson sampling, EwS, ... - O a:∆(a)>0 ∆(a)

Classical Results Adversarial Regime

√ Lower bound - Ω Kt [Auer et. al., 1995] √ EXP3 - O Kt ln K [Auer et. al., 2002] √ Kt [Audibert & Bubeck, 2009] INF - O

Stochastic Regime

P ln t Lower bound - Ω a:∆(a)>0 ∆(a) [Lai & Robbins, 1985] P ln t UCB1 - O a:∆(a)>0 ∆(a) [Auer et. al., 2002] P ln t KL-UCB, Thompson sampling, EwS, ... - O a:∆(a)>0 ∆(a) I

Algorithms for the stochastic regime are inapplicable in the adversarial regime (linear regret)

I

Algorithms for the adversarial regime are suboptimal in the stochastic regime

SAO [Bubeck & Slivkins, 2012]

T K (ln T )3/2 ln K - in the adversarial regime 2 + O K (ln T ) ln K - in the stochastic regime ∆ + O

√

− Does not cover the intermediate regimes

− Relatively complicated and unnatural for the problem − Relies on the knowledge of time horizon T

− Based on one-time irreversible transition from stochastic to adversarial operation mode

The EXP3++ Algorithm [Seldin & Slivkins, 2014]

+ Simple and natural generalization of the EXP3 algorithm √ + O Kt ln K regret in the adversarial regime P (ln t)3 regret in the stochastic regime + O a:∆(a)>0 ∆(a) P 3 (ln t) regret in the moderately contaminated + O a:∆(a)>0 ∆(a) stochastic regime n o P (ln t)3 + O minτ τ + a:∆(τ,a)>0 ∆(τ,a) regret in the adversarial regime with a gap

Reminder: EXP3 Control lever: ηt

=

q

ln K tK

˜ 0 (h) = 0 ∀h : L for t = 1, 2, ... do ˜

∀a : pt (a) = P

e−ηt Lt−1 (a)

a0

e−ηt L˜ t−1 (a0 )

t Sample At according to pt and play it. Observe and suffer `A t

∀a : `˜at =

t `A t 1{At =a} pt (h)

Importance-weighted sampling

˜ t (a) = L ˜ t−1 (a) + `˜a ∀a : L t end

The EXP3++ Algorithm ˜ 0 (h) = 0 ∀h : L for t = 1, 2, ... do

Control levers: ηt and εt (a)-s ˜

∀a : pt (a) = P

e−ηt Lt−1 (a)

a0

∀a : p˜t (a) =

1−

X a0

e−ηt L˜ t−1 (a0 ) !

εt (a0 ) pt (a) + εt (a)

t Sample At according to p˜t and play it. Observe and suffer `A t

∀a : `˜at =

t `A t 1{At =a} p˜t (h)

˜ t (a) = L ˜ t−1 (a) + `˜a ∀a : L t end

Analysis Adversarial Regime

EXP3++ playing distribution p˜t (a) =

1−

X

!

εt (a) pt (a) + εt (a)

a

Regret in the adversarial regime q For εt (a) = O

ln K Kt

:

E [RT ] = O (E [RT ] is unaffected by εt (a))

√

KT ln K

Analysis Stochastic Regime

Properties of Importance-Weighted Sampling I I I I

h i E `˜at = E [`at ] = µ(a)

˜a is a martingale ˜ t (a) = Pt tµ(a) − L µ(a) − ` s s=1 2 a Instantaneous variance: Et µ(a) − `˜t ≤ p˜t1(a) ≤ Cumulative variance over t rounds: νt (a) ≈

t εt (a)

1 εt (a)

The Fundamental Trade-off of the Algorithm Stochastic Regime 4

By Benstein’s inequality, w.p. ≥ 1 − 1t : p ˜ t (a) ≤ 2νt (a) ln t + ln t tµ(a) − L 3εt s 2t ln t ln t + ≈ εt (a) 3εt

10

x 10

8

6

4

tµ(a1 ) tµ(a2 ) ˜ t (a1 ) L ˜ t (a2 ) L U C B(a1 ) U C B(a2 ) LC B(a1 ) LC B(a2 ) t

2

0 0

2

4

6

8

10 4

x 10

The Fundamental Trade-off of the Algorithm Stochastic Regime 4

By Benstein’s inequality, w.p. ≥ 1 − 1t : p ˜ t (a) ≤ 2νt (a) ln t + ln t tµ(a) − L 3εt s 2t ln t ln t + ≈ εt (a) 3εt

10

x 10

8

6

4

tµ(a1 ) tµ(a2 ) ˜ t (a1 ) L ˜ t (a2 ) L U C B(a1 ) U C B(a2 ) LC B(a1 ) LC B(a2 ) t

2

0 0

2

4

6

8

10 4

x 10

I

For separation ofq arms

2t ln t εt (a)

= O (t∆(a)) ⇒ εt (a) = Ω

1 t∆(a)2

The Fundamental Trade-off of the Algorithm Stochastic Regime 4

By Benstein’s inequality, w.p. ≥ 1 − 1t : p ˜ t (a) ≤ 2νt (a) ln t + ln t tµ(a) − L 3εt s 2t ln t ln t + ≈ εt (a) 3εt

10

x 10

8

6

4

tµ(a1 ) tµ(a2 ) ˜ t (a1 ) L ˜ t (a2 ) L U C B(a1 ) U C B(a2 ) LC B(a1 ) LC B(a2 ) t

2

0 0

2

4

6

8

10 4

x 10

I

For separation ofq arms

2t ln t εt (a)

I

= O (t∆(a)) ⇒ εt (a) = Ω P 1 Nt (a) ≥ ts=1 εs (a) ⇒ εt (a) = O t∆(a) 2

1 t∆(a)2

The Fundamental Trade-off of the Algorithm Stochastic Regime 4

By Benstein’s inequality, w.p. ≥ 1 − 1t : p ˜ t (a) ≤ 2νt (a) ln t + ln t tµ(a) − L 3εt s 2t ln t ln t + ≈ εt (a) 3εt

10

x 10

8

6

4

tµ(a1 ) tµ(a2 ) ˜ t (a1 ) L ˜ t (a2 ) L U C B(a1 ) U C B(a2 ) LC B(a1 ) LC B(a2 ) t

2

0 0

2

4

6

8

10 4

x 10

I

I I

For separation ofq arms

2t ln t εt (a)

18(ln t)2 ˆ t (a)2 t∆

ˆ t (a) → ∆(a) and show that ∆

= O (t∆(a)) ⇒ εt (a) = Ω P 1 Nt (a) ≥ ts=1 εs (a) ⇒ εt (a) = O t∆(a) 2 We take εt (a) =

1 t∆(a)2

Main Results I

ηt =

1 2

q

q ln K and εt (a) = O tK √ tK ln K regret in the adversarial regime ⇒O

ln K tK

Main Results I

ηt =

1 2

q

q ln K and εt (a) = O tK √ tK ln K regret in the adversarial regime ⇒O

ln K tK

ˆ t (a) be empirical estimate of ∆(a) Let ∆ q 18(ln t)2 1 ln K I εt (a) = ˆ t (a)2 and ηt ≥ 2 tK t∆ (ln t)3 ⇒ O ∆(a) regret in the stochastic regime, moderately contaminated stochastic, adversarial with a gap

Main Results I

ηt =

1 2

q

q ln K and εt (a) = O tK √ tK ln K regret in the adversarial regime ⇒O

ln K tK

ˆ t (a) be empirical estimate of ∆(a) Let ∆ q 18(ln t)2 1 ln K I εt (a) = ˆ t (a)2 and ηt ≥ 2 tK t∆ (ln t)3 ⇒ O ∆(a) regret in the stochastic regime, moderately contaminated stochastic, adversarial with a gap q 18(ln t)2 ln K I ηt = 1 and ε (a) = ⇒ good for all four regimes t 2 ˆ 2 tK t∆t (a)

Experiments in the Stochastic Regime

200 0 0

5 t

Cumulative Regret

1000

5 t

2000 0 0 4

2000

0 0

4000

10 6 x 10

3000

10 6 x 10

Cumulative Regret

400

4

6000

4

x 10

5 t K = 10. ∆ = 0.01

10 6 x 10 Cumulative Regret

600

K = 2. ∆ = 0.01 Cumulative Regret

K = 10. ∆ = 0.1 Cumulative Regret

Cumulative Regret

K = 2. ∆ = 0.1 800

3 2 1 0 0

5 t

10 6 x 10

4

x 10

K = 100. ∆ = 0.1

3 2 1 0 0

10

5 t 4 K = 100. ∆ = 0.01 x 10 UCB Thom EXP3

5

10 6 x 10

EXP3++η=β EXP3++η=1

0 0

5 t

10 6 x 10

EXP3++ Summary I

EXP3++ simple and natural extension of EXP3

I

Two control levers ηt and εt (a)-s

I

Almost optimal performance in both stochastic and adversarial regimes “Logarithmic” regret in two new regimes

I

I I

I

Moderately contaminated stochastic regime Adversarial regime with a gap

In the stochastic regime empirically comparable to UCB1 Punch Line

EXP3++ is a powerful tool for exploiting the gaps in a variety of regimes without compromising on the worst-case performance!

environmental resistance adversarial

i.i.d. no state state MDPs

structural complexity

full

bandit

partial

feedback

environmental resistance adversarial

i.i.d. no state state MDPs

structural complexity

full

bandit

partial

feedback

environmental resistance adversarial

i.i.d. no state state MDPs

structural complexity

full

bandit

partial

feedback

environmental resistance adversarial

i.i.d. no state state MDPs

structural complexity

full

bandit

partial

feedback

Other popular problems we have not touched

I

Linear bandits

I

Combinatorial bandits

I

Dueling bandits

I

And many, many more variations ...

Further Reading Part 1 I

Nicol`o Cesa-Bianchi and G´abor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006

I

Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 2012

I

Peter Auer, Nicol` o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002

I

Peter Auer, Nicol` o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 2002

I

S´ebastien Bubeck and Nicol` o Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. In Foundations and Trends in Machine Learning, 2012

Further Reading Part 2 I

Wouter M. Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial games. COLT, 2015

I

Haipeng Luo and Robert E. Schapire. Achieving all with no parameters: AdaNormalHedge. COLT, 2015

I

Yevgeny Seldin and Aleksandrs Slivkins. One practical algorithm for both stochastic and adversarial bandits. ICML, 2014

I

Yevgeny Seldin, Peter L. Bartlett, Koby Crammer, and Yasin Abbasi-Yadkori. Prediction with limited advice and multiarmed bandits with paid observations. ICML, 2014

I

Satyen Kale. Multiarmed bandits with limited expert advice. COLT, 2014

I

Yevgeny Seldin, Peter Auer, Fran¸cois Laviolette, John Shawe-Taylor, and Ronald Ortner. PAC-Bayesian analysis of contextual bandits. NIPS, 2011

Online advertising/personalization/routing/... â¢ Games. â¢ Robotics. â¢ ... Page 2 of 193 ..... .pdf. Open. Extract. Open with. Sign In. Main menu. Page 1 of 193.

Download PDF

5MB Sizes 1 Downloads 184 Views

Report

193.pdf

193-03.pdf

man-193\mulaistills.pdf

man-193\requrements-of-sg-lourence-nursing-college.pdf ...

man-193\grand-masters-package-of-fiitjee-pdf ...

man-193\comcast-corporate-headquarters-phone-number.pdf ...

Effects of various plasma pretreatments on 193 nm ...

man-193\grand-masters-package-of-fiitjee-pdf ...

man-193\red-sonja-wiki.pdf

man-193\auntynumber-com.pdf

EO 193-13.pdf

man-193\nissan-cars-india.pdf

man-193\verizon-fios-instructions.pdf

Lipar 61.183-193.pdf

man-193\training-manual-template.pdf

man-193\hsbc-retail-services-customer-service.pdf

CONVOCATORIA CAS NÂ° 193-2017 SUNAFIL BASES.pdf ...

man-193\verizon-blackberry-9930-call-forwarding.pdf

man-193\how-to-learn-python-fast.pdf

man-193\dell-corporate-headquarters-phone-number.pdf ...

man-193\att-corporate-headquarters-phone-number.pdf ...

man-193\savitha-tamil-pdf-download.pdf

of 193

193.pdf

193-03.pdf

man-193\mulaistills.pdf

man-193\requrements-of-sg-lourence-nursing-college.pdf ...

man-193\grand-masters-package-of-fiitjee-pdf ...

man-193\comcast-corporate-headquarters-phone-number.pdf ...

Effects of various plasma pretreatments on 193 nm ...

man-193\grand-masters-package-of-fiitjee-pdf ...

man-193\red-sonja-wiki.pdf

man-193\auntynumber-com.pdf

EO 193-13.pdf

man-193\nissan-cars-india.pdf

man-193\verizon-fios-instructions.pdf

Lipar 61.183-193.pdf

man-193\training-manual-template.pdf

man-193\hsbc-retail-services-customer-service.pdf

CONVOCATORIA CAS NÂ° 193-2017 SUNAFIL BASES.pdf ...

man-193\verizon-blackberry-9930-call-forwarding.pdf

man-193\how-to-learn-python-fast.pdf

man-193\dell-corporate-headquarters-phone-number.pdf ...

man-193\att-corporate-headquarters-phone-number.pdf ...

man-193\savitha-tamil-pdf-download.pdf

of 193

Recommend Documents