Limit Values in some Markov Decision Processes and Repeated Games
Limit Values in some Markov Decision Processes and Repeated Games Jérôme Renault Université Toulouse 1, TSE-GREMAQ
IMT Probas, 6 avril 2010
1/34
Limit Values in some Markov Decision Processes and Repeated Games
1. Intro: standard MDPs with finite states and actions 2. Limit values in dynamic programming problems 3. Application to standard MDPs 4. Application to MDPs with imperfect observation 5. Application to repeated games with an informed controller
“Uniform value in Dynamic Programming", arXiv : 0803.2758. To appear in JEMS (Journal of the European Mathematical Society).
2/34
Limit Values in some Markov Decision Processes and Repeated Games
1. Intro: standard MDPs with finite sets of states and actions Controlled Markov chain, Blackwell 1962 Example 1: -1 k1 1
-1 k2 1
1 k3 1
0 k4 0
0 ? * 6 1 k5 A finite set of states K , a finite set of actions A, a transition function q from K × A to ∆(K ), a reward function g : K × A −→ [0, 1], and an initial probability p0 on K . k1 in K is selected according to p0 and told to the player, then he selects a1 in A and receives a payoff of g (k1 , a1 ). A new state k2 is selected according to q(k1 , a1 ) and told to the player, etc... 3/34
Limit Values in some Markov Decision Processes and Repeated Games
1. Intro: standard MDPs with finite sets of states and actions Controlled Markov chain, Blackwell 1962 Example 1: -1 k1 1
-1 k2 1
1 k3 1
0 k4 0
0 ? * 6 1 k5 A finite set of states K , a finite set of actions A, a transition function q from K × A to ∆(K ), a reward function g : K × A −→ [0, 1], and an initial probability p0 on K . k1 in K is selected according to p0 and told to the player, then he selects a1 in A and receives a payoff of g (k1 , a1 ). A new state k2 is selected according to q(k1 , a1 ) and told to the player, etc... 3/34
Limit Values in some Markov Decision Processes and Repeated Games
(pure) strategy: σ = (σt )t ≥1 , where for each t, σt : (K × A)t −1 × K −→ A defines the action to be played at stage t. (p0 , σ ) generates a proba on plays, and one can define: p0 n-stage problem, for n ≥ 1: vn (p0 ) = sup σ γn (σ ), where γnp0 (σ ) = IEp0 ,σ n1 ∑nt=1 g (kt , at ) .
λ -discounted problem, for λ ∈ (0, 1]: vλ (p0 ) = supσ γλp0 (σ ), t −1 where γλp0 (σ ) = IEp0 ,σ λ ∑∞ (1 − λ ) g (k , a ) . t t t=1 Theorem (Blackwell 1962): 1) in a neighborhood of zero, vλ is a rational function of λ . 2) limn→∞ vn and limλ →0 vλ exists and are equal. 3) There exist λ0 > 0, and a strategy σ which is stationary (induced by a mapping f : K → A) and 0-optimal in any λ -discounted problem for λ ≤ λ0 (Blackwell optimal). ? Generalizations to: - more general actions, - imperfect observation of the states, - stochastic games. 4/34
Limit Values in some Markov Decision Processes and Repeated Games
(pure) strategy: σ = (σt )t ≥1 , where for each t, σt : (K × A)t −1 × K −→ A defines the action to be played at stage t. (p0 , σ ) generates a proba on plays, and one can define: p0 n-stage problem, for n ≥ 1: vn (p0 ) = sup σ γn (σ ), where γnp0 (σ ) = IEp0 ,σ n1 ∑nt=1 g (kt , at ) .
λ -discounted problem, for λ ∈ (0, 1]: vλ (p0 ) = supσ γλp0 (σ ), t −1 where γλp0 (σ ) = IEp0 ,σ λ ∑∞ (1 − λ ) g (k , a ) . t t t=1 Theorem (Blackwell 1962): 1) in a neighborhood of zero, vλ is a rational function of λ . 2) limn→∞ vn and limλ →0 vλ exists and are equal. 3) There exist λ0 > 0, and a strategy σ which is stationary (induced by a mapping f : K → A) and 0-optimal in any λ -discounted problem for λ ≤ λ0 (Blackwell optimal). ? Generalizations to: - more general actions, - imperfect observation of the states, - stochastic games. 4/34
Limit Values in some Markov Decision Processes and Repeated Games
(pure) strategy: σ = (σt )t ≥1 , where for each t, σt : (K × A)t −1 × K −→ A defines the action to be played at stage t. (p0 , σ ) generates a proba on plays, and one can define: p0 n-stage problem, for n ≥ 1: vn (p0 ) = sup σ γn (σ ), where γnp0 (σ ) = IEp0 ,σ n1 ∑nt=1 g (kt , at ) .
λ -discounted problem, for λ ∈ (0, 1]: vλ (p0 ) = supσ γλp0 (σ ), t −1 where γλp0 (σ ) = IEp0 ,σ λ ∑∞ (1 − λ ) g (k , a ) . t t t=1 Theorem (Blackwell 1962): 1) in a neighborhood of zero, vλ is a rational function of λ . 2) limn→∞ vn and limλ →0 vλ exists and are equal. 3) There exist λ0 > 0, and a strategy σ which is stationary (induced by a mapping f : K → A) and 0-optimal in any λ -discounted problem for λ ≤ λ0 (Blackwell optimal). ? Generalizations to: - more general actions, - imperfect observation of the states, - stochastic games. 4/34
Limit Values in some Markov Decision Processes and Repeated Games
Blackwell optimality is weakened. The MDP has a uniform value if: (vn (p0 ))n has a limit v (p0 ), and the player can guarantee this limit, i.e.∀ε > 0, ∃σ , lim inf n γnp0 (σ ) ≥ v (p0 ) − ε When the uniform value exists, one can play ε-optimally simultaneously in any long enough problem. A strategy σ such that lim inf n γnp0 (σ ) ≥ v (p0 ) − ε is called ε-optimal. Blackwell case: existence of the uniform value and a 0-optimal strategy. Example 2: K = {a, b, c}. b and c are absorbing with payoffs 1 and 0. Start at a, choose α ∈ [0, 1/2], and move to b with proba α and to c with proba α 2 . 1 − α − α2 a 0 α b 1*
R 2 α c 0*
√ √ vλ (a) = 1 − 2 λ + o( λ ) (slow). No 0-optimal strategy. 5/34
Limit Values in some Markov Decision Processes and Repeated Games
Blackwell optimality is weakened. The MDP has a uniform value if: (vn (p0 ))n has a limit v (p0 ), and the player can guarantee this limit, i.e.∀ε > 0, ∃σ , lim inf n γnp0 (σ ) ≥ v (p0 ) − ε When the uniform value exists, one can play ε-optimally simultaneously in any long enough problem. A strategy σ such that lim inf n γnp0 (σ ) ≥ v (p0 ) − ε is called ε-optimal. Blackwell case: existence of the uniform value and a 0-optimal strategy. Example 2: K = {a, b, c}. b and c are absorbing with payoffs 1 and 0. Start at a, choose α ∈ [0, 1/2], and move to b with proba α and to c with proba α 2 . 1 − α − α2 a 0 α b 1*
R 2 α c 0*
√ √ vλ (a) = 1 − 2 λ + o( λ ) (slow). No 0-optimal strategy. 5/34
Limit Values in some Markov Decision Processes and Repeated Games
Blackwell optimality is weakened. The MDP has a uniform value if: (vn (p0 ))n has a limit v (p0 ), and the player can guarantee this limit, i.e.∀ε > 0, ∃σ , lim inf n γnp0 (σ ) ≥ v (p0 ) − ε When the uniform value exists, one can play ε-optimally simultaneously in any long enough problem. A strategy σ such that lim inf n γnp0 (σ ) ≥ v (p0 ) − ε is called ε-optimal. Blackwell case: existence of the uniform value and a 0-optimal strategy. Example 2: K = {a, b, c}. b and c are absorbing with payoffs 1 and 0. Start at a, choose α ∈ [0, 1/2], and move to b with proba α and to c with proba α 2 . 1 − α − α2 a 0 α b 1*
R 2 α c 0*
√ √ vλ (a) = 1 − 2 λ + o( λ ) (slow). No 0-optimal strategy. 5/34
Limit Values in some Markov Decision Processes and Repeated Games
Blackwell optimality is weakened. The MDP has a uniform value if: (vn (p0 ))n has a limit v (p0 ), and the player can guarantee this limit, i.e.∀ε > 0, ∃σ , lim inf n γnp0 (σ ) ≥ v (p0 ) − ε When the uniform value exists, one can play ε-optimally simultaneously in any long enough problem. A strategy σ such that lim inf n γnp0 (σ ) ≥ v (p0 ) − ε is called ε-optimal. Blackwell case: existence of the uniform value and a 0-optimal strategy. Example 2: K = {a, b, c}. b and c are absorbing with payoffs 1 and 0. Start at a, choose α ∈ [0, 1/2], and move to b with proba α and to c with proba α 2 . 1 − α − α2 a 0 α b 1*
R 2 α c 0*
√ √ vλ (a) = 1 − 2 λ + o( λ ) (slow). No 0-optimal strategy. 5/34
Limit Values in some Markov Decision Processes and Repeated Games
Blackwell optimality is weakened. The MDP has a uniform value if: (vn (p0 ))n has a limit v (p0 ), and the player can guarantee this limit, i.e.∀ε > 0, ∃σ , lim inf n γnp0 (σ ) ≥ v (p0 ) − ε When the uniform value exists, one can play ε-optimally simultaneously in any long enough problem. A strategy σ such that lim inf n γnp0 (σ ) ≥ v (p0 ) − ε is called ε-optimal. Blackwell case: existence of the uniform value and a 0-optimal strategy. Example 2: K = {a, b, c}. b and c are absorbing with payoffs 1 and 0. Start at a, choose α ∈ [0, 1/2], and move to b with proba α and to c with proba α 2 . 1 − α − α2 a 0 α b 1*
R 2 α c 0*
√ √ vλ (a) = 1 − 2 λ + o( λ ) (slow). No 0-optimal strategy. 5/34
Limit Values in some Markov Decision Processes and Repeated Games
2. Limit values in dynamic programming problems (bounded payoffs) 2.a. Pb Γ(z0 ) = (Z , F , r , z0 ) given by a non empty set of states Z , an initial state z0 , a transition correspondence F from Z to Z with non empty values, and a reward mapping r from Z to [0, 1]. A player chooses z1 in F (z0 ), has a payoff of r (z1 ), then he chooses z2 in F (z1 ), etc... Admissible plays: S(z0 ) = {s = (z1 , ..., zt , ...) ∈ Z ∞ , ∀t ≥ 1, zt ∈ F (zt −1 )}. n-stage problem, for n ≥ 1: vn (z) = sups ∈S(z) γn (s), where γn (s) = n1 ∑nt=1 r (zt ). λ -discounted problem, for λ ∈ (0, 1]: t −1 r (zt ). vλ (z) = sups ∈S(z) γλ (s), where γλ (s) = λ ∑∞ t=1 (1 − λ ) 1 n vn+1 (z) = supz 0 ∈F (z) r (z 0 ) + vn (z 0 ) , n+1 n+1 0 vλ (z) = supz 0 ∈F (z) λ r (z ) + (1 − λ )vλ (z 0 ) . 6/34
Limit Values in some Markov Decision Processes and Repeated Games
2. Limit values in dynamic programming problems (bounded payoffs) 2.a. Pb Γ(z0 ) = (Z , F , r , z0 ) given by a non empty set of states Z , an initial state z0 , a transition correspondence F from Z to Z with non empty values, and a reward mapping r from Z to [0, 1]. A player chooses z1 in F (z0 ), has a payoff of r (z1 ), then he chooses z2 in F (z1 ), etc... Admissible plays: S(z0 ) = {s = (z1 , ..., zt , ...) ∈ Z ∞ , ∀t ≥ 1, zt ∈ F (zt −1 )}. n-stage problem, for n ≥ 1: vn (z) = sups ∈S(z) γn (s), where γn (s) = n1 ∑nt=1 r (zt ). λ -discounted problem, for λ ∈ (0, 1]: t −1 r (zt ). vλ (z) = sups ∈S(z) γλ (s), where γλ (s) = λ ∑∞ t=1 (1 − λ ) 1 n vn+1 (z) = supz 0 ∈F (z) r (z 0 ) + vn (z 0 ) , n+1 n+1 0 vλ (z) = supz 0 ∈F (z) λ r (z ) + (1 − λ )vλ (z 0 ) . 6/34
Limit Values in some Markov Decision Processes and Repeated Games
2. Limit values in dynamic programming problems (bounded payoffs) 2.a. Pb Γ(z0 ) = (Z , F , r , z0 ) given by a non empty set of states Z , an initial state z0 , a transition correspondence F from Z to Z with non empty values, and a reward mapping r from Z to [0, 1]. A player chooses z1 in F (z0 ), has a payoff of r (z1 ), then he chooses z2 in F (z1 ), etc... Admissible plays: S(z0 ) = {s = (z1 , ..., zt , ...) ∈ Z ∞ , ∀t ≥ 1, zt ∈ F (zt −1 )}. n-stage problem, for n ≥ 1: vn (z) = sups ∈S(z) γn (s), where γn (s) = n1 ∑nt=1 r (zt ). λ -discounted problem, for λ ∈ (0, 1]: t −1 r (zt ). vλ (z) = sups ∈S(z) γλ (s), where γλ (s) = λ ∑∞ t=1 (1 − λ ) 1 n vn+1 (z) = supz 0 ∈F (z) r (z 0 ) + vn (z 0 ) , n+1 n+1 0 vλ (z) = supz 0 ∈F (z) λ r (z ) + (1 − λ )vλ (z 0 ) . 6/34
Limit Values in some Markov Decision Processes and Repeated Games
• Large known horizon: existence of limn→∞ vn (z), of limλ →0 vλ (z), equality of the limits ? Uniform CV of (vn ) and (vλ ) ? 0 player (ie. F single-valued): (vn (z))n converges iif (vλ (z))λ converges, and in case of CV both limits are the same (Hardy-Littlewood). 1-player: (vn )n converges uniformly iif (vλ )λ converges uniformly, and in case of CV both limits are the same (Lehrer-Sorin 1992). • Large unknown horizon: when is it possible to play ε-optimally simultaneously in any long enough problem ? Say that Γ(z) has a uniform value if (vn (z))n has a limit v (z), and one can guarantee this limit, i.e.: ∀ε > 0, ∃s ∈ S(z), ∃n0 , ∀n ≥ n0 , γn (s) ≥ v (z) − ε. Always : sups ∈S(z) (lim inf n γn (s)) ≤ lim inf n vn (z) ≤ lim supn vn (z). And the uniform value exists iif: sups ∈S(z) (lim inf n γn (s)) = lim supn vn (z). A play s such that ∃n0 , ∀n ≥ n0 , γn (s) ≥ v (z) − ε is called ε-optimal. 7/34
Limit Values in some Markov Decision Processes and Repeated Games
• Large known horizon: existence of limn→∞ vn (z), of limλ →0 vλ (z), equality of the limits ? Uniform CV of (vn ) and (vλ ) ? 0 player (ie. F single-valued): (vn (z))n converges iif (vλ (z))λ converges, and in case of CV both limits are the same (Hardy-Littlewood). 1-player: (vn )n converges uniformly iif (vλ )λ converges uniformly, and in case of CV both limits are the same (Lehrer-Sorin 1992). • Large unknown horizon: when is it possible to play ε-optimally simultaneously in any long enough problem ? Say that Γ(z) has a uniform value if (vn (z))n has a limit v (z), and one can guarantee this limit, i.e.: ∀ε > 0, ∃s ∈ S(z), ∃n0 , ∀n ≥ n0 , γn (s) ≥ v (z) − ε. Always : sups ∈S(z) (lim inf n γn (s)) ≤ lim inf n vn (z) ≤ lim supn vn (z). And the uniform value exists iif: sups ∈S(z) (lim inf n γn (s)) = lim supn vn (z). A play s such that ∃n0 , ∀n ≥ n0 , γn (s) ≥ v (z) − ε is called ε-optimal. 7/34
Limit Values in some Markov Decision Processes and Repeated Games
• Large known horizon: existence of limn→∞ vn (z), of limλ →0 vλ (z), equality of the limits ? Uniform CV of (vn ) and (vλ ) ? 0 player (ie. F single-valued): (vn (z))n converges iif (vλ (z))λ converges, and in case of CV both limits are the same (Hardy-Littlewood). 1-player: (vn )n converges uniformly iif (vλ )λ converges uniformly, and in case of CV both limits are the same (Lehrer-Sorin 1992). • Large unknown horizon: when is it possible to play ε-optimally simultaneously in any long enough problem ? Say that Γ(z) has a uniform value if (vn (z))n has a limit v (z), and one can guarantee this limit, i.e.: ∀ε > 0, ∃s ∈ S(z), ∃n0 , ∀n ≥ n0 , γn (s) ≥ v (z) − ε. Always : sups ∈S(z) (lim inf n γn (s)) ≤ lim inf n vn (z) ≤ lim supn vn (z). And the uniform value exists iif: sups ∈S(z) (lim inf n γn (s)) = lim supn vn (z). A play s such that ∃n0 , ∀n ≥ n0 , γn (s) ≥ v (z) − ε is called ε-optimal. 7/34
Limit Values in some Markov Decision Processes and Repeated Games
The uniform CV of (vn ) does not imply the existence of the uniform value (Monderer Sorin 93, Lehrer Monderer 94). Sufficient conditions for the existence of the uniform value given by Mertens and Neyman 1982, from stochastic games (convergence of (vλ )λ with a BV condition). Back to Example 2: K = {a, b, c}. b and c are absorbing with payoffs 1 and 0. Start at a, choose α ∈ [0, 1/2], and move to b with proba α and to c with proba α 2 . 1 − α − α2 a 0 α b 1*
R 2 α c 0*
→ Dynamic Programming Pb with Z = ∆(K ), r (z) = z b , z0 = δa and F (z) = {(z a (1 − α − α 2 ), z b + z a α, z c + z a α 2 ), α ∈ [0, 1/2]}. The uniform value exists and v (z0 ) = 1. no ergodicity
8/34
Limit Values in some Markov Decision Processes and Repeated Games
The uniform CV of (vn ) does not imply the existence of the uniform value (Monderer Sorin 93, Lehrer Monderer 94). Sufficient conditions for the existence of the uniform value given by Mertens and Neyman 1982, from stochastic games (convergence of (vλ )λ with a BV condition). Back to Example 2: K = {a, b, c}. b and c are absorbing with payoffs 1 and 0. Start at a, choose α ∈ [0, 1/2], and move to b with proba α and to c with proba α 2 . 1 − α − α2 a 0 α b 1*
R 2 α c 0*
→ Dynamic Programming Pb with Z = ∆(K ), r (z) = z b , z0 = δa and F (z) = {(z a (1 − α − α 2 ), z b + z a α, z c + z a α 2 ), α ∈ [0, 1/2]}. The uniform value exists and v (z0 ) = 1. no ergodicity
8/34
Limit Values in some Markov Decision Processes and Repeated Games
The uniform CV of (vn ) does not imply the existence of the uniform value (Monderer Sorin 93, Lehrer Monderer 94). Sufficient conditions for the existence of the uniform value given by Mertens and Neyman 1982, from stochastic games (convergence of (vλ )λ with a BV condition). Back to Example 2: K = {a, b, c}. b and c are absorbing with payoffs 1 and 0. Start at a, choose α ∈ [0, 1/2], and move to b with proba α and to c with proba α 2 . 1 − α − α2 a 0 α b 1*
R 2 α c 0*
→ Dynamic Programming Pb with Z = ∆(K ), r (z) = z b , z0 = δa and F (z) = {(z a (1 − α − α 2 ), z b + z a α, z c + z a α 2 ), α ∈ [0, 1/2]}. The uniform value exists and v (z0 ) = 1. no ergodicity
8/34
Limit Values in some Markov Decision Processes and Repeated Games
Example 3: Z = [0, 1]2 ∪ {z0 }. F (z0 ) = {(0, y ), y ∈ [0, 1]}, and F (x, y ) = {(Min{1, x + y }, y )}. 1
y− z0 ∗ 0
1 3
2 3
1
r (x, y ) = 1 if x ∈ [1/3, 2/3], and r (x, y ) = 0 if x ∈ / [1/4, 3/4]. The player would like to maximize the number of stages where the first coordinate of the state is not too far from 1/2. For each n ≥ 2, we have vn (z0 ) ≥ 1/2. But for each play s, we have limn γn (s) = 0. The uniform value does not exist. 9/34
Limit Values in some Markov Decision Processes and Repeated Games
Example 3: Z = [0, 1]2 ∪ {z0 }. F (z0 ) = {(0, y ), y ∈ [0, 1]}, and F (x, y ) = {(Min{1, x + y }, y )}. 1
y− z0 ∗ 0
1 3
2 3
1
r (x, y ) = 1 if x ∈ [1/3, 2/3], and r (x, y ) = 0 if x ∈ / [1/4, 3/4]. The player would like to maximize the number of stages where the first coordinate of the state is not too far from 1/2. For each n ≥ 2, we have vn (z0 ) ≥ 1/2. But for each play s, we have limn γn (s) = 0. The uniform value does not exist. 9/34
Limit Values in some Markov Decision Processes and Repeated Games
Example 3: Z = [0, 1]2 ∪ {z0 }. F (z0 ) = {(0, y ), y ∈ [0, 1]}, and F (x, y ) = {(Min{1, x + y }, y )}. 1
y− z0 ∗ 0
1 3
2 3
1
r (x, y ) = 1 if x ∈ [1/3, 2/3], and r (x, y ) = 0 if x ∈ / [1/4, 3/4]. The player would like to maximize the number of stages where the first coordinate of the state is not too far from 1/2. For each n ≥ 2, we have vn (z0 ) ≥ 1/2. But for each play s, we have limn γn (s) = 0. The uniform value does not exist. 9/34
Limit Values in some Markov Decision Processes and Repeated Games
Example 3: Z = [0, 1]2 ∪ {z0 }. F (z0 ) = {(0, y ), y ∈ [0, 1]}, and F (x, y ) = {(Min{1, x + y }, y )}. 1
y− z0 ∗ 0
1 3
2 3
1
r (x, y ) = 1 if x ∈ [1/3, 2/3], and r (x, y ) = 0 if x ∈ / [1/4, 3/4]. The player would like to maximize the number of stages where the first coordinate of the state is not too far from 1/2. For each n ≥ 2, we have vn (z0 ) ≥ 1/2. But for each play s, we have limn γn (s) = 0. The uniform value does not exist. 9/34
Limit Values in some Markov Decision Processes and Repeated Games
2.b The auxiliary functions vm,n and uniform CV of (vn ) For m ≥ 0 and n ≥ 1, s = (zt )t ≥1 , define: γm,n (s) =
1 n ∑ r (zm + t) and vm,n (z) = sups ∈S(z) γm,n (s). n t=1
The player first makes m moves in order to reach a “good initial state", then plays n moves for payoffs. Write v − (z) = lim inf n vn (z), v + (z) = lim supn vn (z). Lemma 1: v − (z) = supm≥0 infn≥1 vm,n (z). Lemma 2: ∀m0 , infn≥1 supm≤m0 vm,n (z) ≤ v − (z) ≤ v + (z) ≤ infn≥1 supm≥0 vm,n (z). can be restated as: infn≥1 supz 0 ∈G m0 (z) vn (z 0 ) ≤ v − (z) ≤ v + (z) ≤ infn≥1 supz 0 ∈G ∞ (z) vn (z 0 ). where G m0 (z) is the set of states that can be reached from z in at most m0 stages, and G ∞ (z) = ∪m G m (z). 10/34
Limit Values in some Markov Decision Processes and Repeated Games
2.b The auxiliary functions vm,n and uniform CV of (vn ) For m ≥ 0 and n ≥ 1, s = (zt )t ≥1 , define: γm,n (s) =
1 n ∑ r (zm + t) and vm,n (z) = sups ∈S(z) γm,n (s). n t=1
The player first makes m moves in order to reach a “good initial state", then plays n moves for payoffs. Write v − (z) = lim inf n vn (z), v + (z) = lim supn vn (z). Lemma 1: v − (z) = supm≥0 infn≥1 vm,n (z). Lemma 2: ∀m0 , infn≥1 supm≤m0 vm,n (z) ≤ v − (z) ≤ v + (z) ≤ infn≥1 supm≥0 vm,n (z). can be restated as: infn≥1 supz 0 ∈G m0 (z) vn (z 0 ) ≤ v − (z) ≤ v + (z) ≤ infn≥1 supz 0 ∈G ∞ (z) vn (z 0 ). where G m0 (z) is the set of states that can be reached from z in at most m0 stages, and G ∞ (z) = ∪m G m (z). 10/34
Limit Values in some Markov Decision Processes and Repeated Games
Define V = {vn , n ≥ 1} ⊂ {v : Z −→ [0, 1]}, endowed with d∞ (v , v 0 ) = supz |v (z) − v 0 (z)|. Thm 1 (R, 2009): (vn )n CVU iff V is precompact. And the uniform limit v ∗ can only be: v ∗ (z) = supm≥0 infn≥1 vm,n (z) = infn≥1 supm≥0 vm,n (z). Sketch of proof: 1) Define d (z, z 0 ) = supn≥1 |vn (z) − vn (z 0 )|. Prove that (Z , d ) is pseudometric precompact. Clearly, each vn is 1-Lipschitz for d . 2) Fix z. Prove that: ∀ε > 0, ∃m0 , ∀z 0 ∈ G ∞ (z), ∃z 00 ∈ G m0 (z) s.t. d (z 0 , z 00 ) ≤ ε. 3) Use infn≥1 supz 0 ∈G m0 (z) vn (z 0 ) ≤ v − (z) ≤ v + (z) ≤ infn≥1 supz 0 ∈G ∞ (z) vn (z 0 ), and conclude.
11/34
Limit Values in some Markov Decision Processes and Repeated Games
Define V = {vn , n ≥ 1} ⊂ {v : Z −→ [0, 1]}, endowed with d∞ (v , v 0 ) = supz |v (z) − v 0 (z)|. Thm 1 (R, 2009): (vn )n CVU iff V is precompact. And the uniform limit v ∗ can only be: v ∗ (z) = supm≥0 infn≥1 vm,n (z) = infn≥1 supm≥0 vm,n (z). Sketch of proof: 1) Define d (z, z 0 ) = supn≥1 |vn (z) − vn (z 0 )|. Prove that (Z , d ) is pseudometric precompact. Clearly, each vn is 1-Lipschitz for d . 2) Fix z. Prove that: ∀ε > 0, ∃m0 , ∀z 0 ∈ G ∞ (z), ∃z 00 ∈ G m0 (z) s.t. d (z 0 , z 00 ) ≤ ε. 3) Use infn≥1 supz 0 ∈G m0 (z) vn (z 0 ) ≤ v − (z) ≤ v + (z) ≤ infn≥1 supz 0 ∈G ∞ (z) vn (z 0 ), and conclude.
11/34
Limit Values in some Markov Decision Processes and Repeated Games
Define V = {vn , n ≥ 1} ⊂ {v : Z −→ [0, 1]}, endowed with d∞ (v , v 0 ) = supz |v (z) − v 0 (z)|. Thm 1 (R, 2009): (vn )n CVU iff V is precompact. And the uniform limit v ∗ can only be: v ∗ (z) = supm≥0 infn≥1 vm,n (z) = infn≥1 supm≥0 vm,n (z). Sketch of proof: 1) Define d (z, z 0 ) = supn≥1 |vn (z) − vn (z 0 )|. Prove that (Z , d ) is pseudometric precompact. Clearly, each vn is 1-Lipschitz for d . 2) Fix z. Prove that: ∀ε > 0, ∃m0 , ∀z 0 ∈ G ∞ (z), ∃z 00 ∈ G m0 (z) s.t. d (z 0 , z 00 ) ≤ ε. 3) Use infn≥1 supz 0 ∈G m0 (z) vn (z 0 ) ≤ v − (z) ≤ v + (z) ≤ infn≥1 supz 0 ∈G ∞ (z) vn (z 0 ), and conclude.
11/34
Limit Values in some Markov Decision Processes and Repeated Games
Corollary 1: V = {vn , n ≥ 1} is precompact, and thus (vn )n CVU in the following cases: a) Z is endowed with a distance d such that (Z , d ) is precompact, and the family (vn )n≥1 is uniformly equicontinuous. b) Z is endowed with a distance d such that (Z , d ) is precompact, r is uniformly continuous and F is non expansive: ∀z ∈ Z , ∀z 0 ∈ Z , ∀z1 ∈ F (z), ∃z10 ∈ F (z 0 ) s.t. d (z1 , z10 ) ≤ d (z, z 0 ). c) Z is finite (Blackwell, 1962).
12/34
Limit Values in some Markov Decision Processes and Repeated Games
2.c. The auxiliary functions wm,n and the existence of the uniform value For m ≥ 0 and n ≥ 1, s = (zt )t ≥1 , we define: γm,n (s) = n1 ∑nt=1 r (zm+t ), and vm,n (z) = sups ∈S(z) γm,n (s). µm,n (s) = min{γm,t (s), t ∈ {1, ..., n}}, and wm,n (z) = sups ∈S(z) µm,n (s). wm,n : the player first makes m moves in order to reach a “good initial state", but then his payoff only is the minimum of his next n average rewards. Thm 2 (R, 2007): Assume that W = {(wm,n )m≥0,n≥1 } is precompact for d∞ . Then for every initial state z in Z , the pb has a uniform value which is: supm≥0 infn≥1 wm,n (z) = supm≥0 infn≥1 vm,n (z) = infn≥1 supm≥0 wm,n (z) = infn≥1 supm≥0 vm,n (z). And (vn )n uniformly converges to v ∗ . 13/34
Limit Values in some Markov Decision Processes and Repeated Games
2.c. The auxiliary functions wm,n and the existence of the uniform value For m ≥ 0 and n ≥ 1, s = (zt )t ≥1 , we define: γm,n (s) = n1 ∑nt=1 r (zm+t ), and vm,n (z) = sups ∈S(z) γm,n (s). µm,n (s) = min{γm,t (s), t ∈ {1, ..., n}}, and wm,n (z) = sups ∈S(z) µm,n (s). wm,n : the player first makes m moves in order to reach a “good initial state", but then his payoff only is the minimum of his next n average rewards. Thm 2 (R, 2007): Assume that W = {(wm,n )m≥0,n≥1 } is precompact for d∞ . Then for every initial state z in Z , the pb has a uniform value which is: supm≥0 infn≥1 wm,n (z) = supm≥0 infn≥1 vm,n (z) = infn≥1 supm≥0 wm,n (z) = infn≥1 supm≥0 vm,n (z). And (vn )n uniformly converges to v ∗ . 13/34
Limit Values in some Markov Decision Processes and Repeated Games
Corollary 2: W is precompact, and thus the previous theorem applies in the following cases: a) Z is endowed with a distance d such that (Z , d ) is precompact, and the family (wm,n )m≥0,n≥1 is uniformly equicontinuous. b) Z is endowed with a distance d such that (Z , d ) is compact, r is continuous and F is non expansive. c) Z is finite.
Rem: Application to non expansive control problems in continuous time (with M. Quincampoix, 2009)
14/34
Limit Values in some Markov Decision Processes and Repeated Games
2.d. Characterizing limvn Fix Z compact metric and F non expansive, and put E = {r : Z −→ [0, 1], r C 0 }. For each r in E , there is a limit value Φ(r ). r r We have Φ(r ) = supm≥0 infn≥1 vm ,n = infn≥1 supm≥0 vm,n . What are the properties of Φ : E −→ E ? Ex: 0 player, ergodic Markov chain on a finite set: Φ(r ) =< m∗ , r >, with m∗ the unique invariant measure.
Define: A = {r ∈ E , Φ(r ) = 0}, B = {x ∈ E , ∀z x(z) = supz 0 ∈F (z) x(z 0 )}. Thm 3 (R, 2010): Φ(r ) = min{v , v ∈ B and r − v ∈ A}.
15/34
Limit Values in some Markov Decision Processes and Repeated Games
2.d. Characterizing limvn Fix Z compact metric and F non expansive, and put E = {r : Z −→ [0, 1], r C 0 }. For each r in E , there is a limit value Φ(r ). r r We have Φ(r ) = supm≥0 infn≥1 vm ,n = infn≥1 supm≥0 vm,n . What are the properties of Φ : E −→ E ? Ex: 0 player, ergodic Markov chain on a finite set: Φ(r ) =< m∗ , r >, with m∗ the unique invariant measure.
Define: A = {r ∈ E , Φ(r ) = 0}, B = {x ∈ E , ∀z x(z) = supz 0 ∈F (z) x(z 0 )}. Thm 3 (R, 2010): Φ(r ) = min{v , v ∈ B and r − v ∈ A}.
15/34
Limit Values in some Markov Decision Processes and Repeated Games
2.d. Characterizing limvn Fix Z compact metric and F non expansive, and put E = {r : Z −→ [0, 1], r C 0 }. For each r in E , there is a limit value Φ(r ). r r We have Φ(r ) = supm≥0 infn≥1 vm ,n = infn≥1 supm≥0 vm,n . What are the properties of Φ : E −→ E ? Ex: 0 player, ergodic Markov chain on a finite set: Φ(r ) =< m∗ , r >, with m∗ the unique invariant measure.
Define: A = {r ∈ E , Φ(r ) = 0}, B = {x ∈ E , ∀z x(z) = supz 0 ∈F (z) x(z 0 )}. Thm 3 (R, 2010): Φ(r ) = min{v , v ∈ B and r − v ∈ A}.
15/34
Limit Values in some Markov Decision Processes and Repeated Games
Particular cases: 1) If the problem is ergodic (Φ(r ) is constant for each r ), then the decomposition r = v + w with v in B and w in A is unique: Φ is the projection onto B along A . 2) Assume the game is leavable, i.e. z ∈ Γ(z) for each z. Then B = {x ∈ E , ∀z x(z) ≥ supz 0 ∈F (z) x(z 0 )} (excessive functions) is convex, and Φ(r ) = min{v , v ∈ B, v ≥ r } (Gambling Fundamental Theorem, Dubins Savage 1965) Example: K finite, X = ∆(K ) is a simplex. Consider the gambling house Γ : X ⇒ ∆f (X ), with Γ(x) = {µ ∈ ∆f (X ), m(µ) = x}. Then Φ(r ) = cavr , the smallest concave function above r . cavr 1 6
J
J
J r J
J
J 0 1 z 16/34
Limit Values in some Markov Decision Processes and Repeated Games
Particular cases: 1) If the problem is ergodic (Φ(r ) is constant for each r ), then the decomposition r = v + w with v in B and w in A is unique: Φ is the projection onto B along A . 2) Assume the game is leavable, i.e. z ∈ Γ(z) for each z. Then B = {x ∈ E , ∀z x(z) ≥ supz 0 ∈F (z) x(z 0 )} (excessive functions) is convex, and Φ(r ) = min{v , v ∈ B, v ≥ r } (Gambling Fundamental Theorem, Dubins Savage 1965) Example: K finite, X = ∆(K ) is a simplex. Consider the gambling house Γ : X ⇒ ∆f (X ), with Γ(x) = {µ ∈ ∆f (X ), m(µ) = x}. Then Φ(r ) = cavr , the smallest concave function above r . cavr 1 6
J
J
J r J
J
J 0 1 z 16/34
Limit Values in some Markov Decision Processes and Repeated Games
3. Application to Standard MDPs with a finite set of states. -1 k1 1
-1 k2 1
1 k3 1
0 k4 0
0 ? * 6 1 k5 MDP Ψ(p0 ): A finite set of states K , a non empty set of actions A, a transition function q from K × A to ∆(K ), a reward function g : K × A −→ [0, 1], and an initial probability p0 on K . k1 in K is selected according to p0 and told to the player, then he selects a1 in A and receives a payoff of g (k1 , a1 ). A new state k2 is selected according to q(k1 , a1 ) and told to the player, etc...
17/34
Limit Values in some Markov Decision Processes and Repeated Games
3. Application to Standard MDPs with a finite set of states. -1 k1 1
-1 k2 1
1 k3 1
0 k4 0
0 ? * 6 1 k5 MDP Ψ(p0 ): A finite set of states K , a non empty set of actions A, a transition function q from K × A to ∆(K ), a reward function g : K × A −→ [0, 1], and an initial probability p0 on K . k1 in K is selected according to p0 and told to the player, then he selects a1 in A and receives a payoff of g (k1 , a1 ). A new state k2 is selected according to q(k1 , a1 ) and told to the player, etc...
17/34
Limit Values in some Markov Decision Processes and Repeated Games
A pure strategy: σ = (σt )t ≥1 , with ∀t, σt : (K × A)t −1 × K −→ A defines the action to be played at stage t. (p0 , σ ) generates a proba on plays, one can define the expected payoffs and the n-stage values. → Auxiliary deterministic Pb Γ(z0 ): new set of states Z = ∆(K ) × [0, 1], a new initial state z0 = (p0 , 0), new payoff function r (p, y ) = y for all (p, y ) in Z , a transition correspondence such that for every z = (p, y ) in Z, ( ! ) F (z) =
∑ pk q(k, ak ), ∑ pk g (k, ak )
k ∈K
, ak ∈ A ∀k ∈ K
.
k ∈K
Put d ((p, y ), (p 0 , y 0 )) = max{kp − p 0 k1 , |y − y 0 |}. Apply the corollaries to obtain the UCV of (vn )n and the existence of the uniform value (for any set A). Well known for A finite (Blackwell 1962), and for A compact and q, g continuous in a (Dynkin Yushkevich 1979). 18/34
Limit Values in some Markov Decision Processes and Repeated Games
A pure strategy: σ = (σt )t ≥1 , with ∀t, σt : (K × A)t −1 × K −→ A defines the action to be played at stage t. (p0 , σ ) generates a proba on plays, one can define the expected payoffs and the n-stage values. → Auxiliary deterministic Pb Γ(z0 ): new set of states Z = ∆(K ) × [0, 1], a new initial state z0 = (p0 , 0), new payoff function r (p, y ) = y for all (p, y ) in Z , a transition correspondence such that for every z = (p, y ) in Z, ( ! ) F (z) =
∑ pk q(k, ak ), ∑ pk g (k, ak )
k ∈K
, ak ∈ A ∀k ∈ K
.
k ∈K
Put d ((p, y ), (p 0 , y 0 )) = max{kp − p 0 k1 , |y − y 0 |}. Apply the corollaries to obtain the UCV of (vn )n and the existence of the uniform value (for any set A). Well known for A finite (Blackwell 1962), and for A compact and q, g continuous in a (Dynkin Yushkevich 1979). 18/34
Limit Values in some Markov Decision Processes and Repeated Games
A pure strategy: σ = (σt )t ≥1 , with ∀t, σt : (K × A)t −1 × K −→ A defines the action to be played at stage t. (p0 , σ ) generates a proba on plays, one can define the expected payoffs and the n-stage values. → Auxiliary deterministic Pb Γ(z0 ): new set of states Z = ∆(K ) × [0, 1], a new initial state z0 = (p0 , 0), new payoff function r (p, y ) = y for all (p, y ) in Z , a transition correspondence such that for every z = (p, y ) in Z, ( ! ) F (z) =
∑ pk q(k, ak ), ∑ pk g (k, ak )
k ∈K
, ak ∈ A ∀k ∈ K
.
k ∈K
Put d ((p, y ), (p 0 , y 0 )) = max{kp − p 0 k1 , |y − y 0 |}. Apply the corollaries to obtain the UCV of (vn )n and the existence of the uniform value (for any set A). Well known for A finite (Blackwell 1962), and for A compact and q, g continuous in a (Dynkin Yushkevich 1979). 18/34
Limit Values in some Markov Decision Processes and Repeated Games
A pure strategy: σ = (σt )t ≥1 , with ∀t, σt : (K × A)t −1 × K −→ A defines the action to be played at stage t. (p0 , σ ) generates a proba on plays, one can define the expected payoffs and the n-stage values. → Auxiliary deterministic Pb Γ(z0 ): new set of states Z = ∆(K ) × [0, 1], a new initial state z0 = (p0 , 0), new payoff function r (p, y ) = y for all (p, y ) in Z , a transition correspondence such that for every z = (p, y ) in Z, ( ! ) F (z) =
∑ pk q(k, ak ), ∑ pk g (k, ak )
k ∈K
, ak ∈ A ∀k ∈ K
.
k ∈K
Put d ((p, y ), (p 0 , y 0 )) = max{kp − p 0 k1 , |y − y 0 |}. Apply the corollaries to obtain the UCV of (vn )n and the existence of the uniform value (for any set A). Well known for A finite (Blackwell 1962), and for A compact and q, g continuous in a (Dynkin Yushkevich 1979). 18/34
Limit Values in some Markov Decision Processes and Repeated Games
4. Application to MDPs with imperfect observation Hidden controlled Markov chain More general model where the player may not perfectly observe the state. r = 0,s = s3 r =0 r =1 k1 k3 ? ? ? s = 34 s1 + 14 s2 ? s = s3 r =0 s = s4 R r =0 s = 14 s1 + 34 s2 ?
r =0 s = s3 k2
r =0 k4 ? ? ? s = s4
r = 0, s = s4
States K = {k1 , k2 , k3 , k4 }, Actions y, y, y, Signals: {s1 , s2 , s3 , s4 }. p0 = 1/2 δk1 + 1/2 δk2 . Playing y for a large number of stages, and then y or y depending on the stream of signals received, is ε-optimal. v (p0 ) = 1, the uniform value exists, but non existence of 0-optimal strategies.
19/34
Limit Values in some Markov Decision Processes and Repeated Games
4. Application to MDPs with imperfect observation Hidden controlled Markov chain More general model where the player may not perfectly observe the state. r = 0,s = s3 r =0 r =1 k1 k3 ? ? ? s = 34 s1 + 14 s2 ? s = s3 r =0 s = s4 R r =0 s = 14 s1 + 34 s2 ?
r =0 s = s3 k2
r =0 k4 ? ? ? s = s4
r = 0, s = s4
States K = {k1 , k2 , k3 , k4 }, Actions y, y, y, Signals: {s1 , s2 , s3 , s4 }. p0 = 1/2 δk1 + 1/2 δk2 . Playing y for a large number of stages, and then y or y depending on the stream of signals received, is ε-optimal. v (p0 ) = 1, the uniform value exists, but non existence of 0-optimal strategies.
19/34
Limit Values in some Markov Decision Processes and Repeated Games
4. Application to MDPs with imperfect observation Hidden controlled Markov chain More general model where the player may not perfectly observe the state. r = 0,s = s3 r =0 r =1 k1 k3 ? ? ? s = 34 s1 + 14 s2 ? s = s3 r =0 s = s4 R r =0 s = 14 s1 + 34 s2 ?
r =0 s = s3 k2
r =0 k4 ? ? ? s = s4
r = 0, s = s4
States K = {k1 , k2 , k3 , k4 }, Actions y, y, y, Signals: {s1 , s2 , s3 , s4 }. p0 = 1/2 δk1 + 1/2 δk2 . Playing y for a large number of stages, and then y or y depending on the stream of signals received, is ε-optimal. v (p0 ) = 1, the uniform value exists, but non existence of 0-optimal strategies.
19/34
Limit Values in some Markov Decision Processes and Repeated Games
Finite set of states K , initial probability p0 on K , non empty set of actions A, and also a non empty set of signals S. Transition q : K × A → ∆f (S × K ), and reward function g : K × A → [0, 1]. k1 in K is selected according to p0 and is not told to the player. At stage t the player selects an action at ∈ A, and has a (unobserved) payoff g (kt , at ). Then a pair (st , kt+1 ) is selected according to q(kt , at ), and st is told to the player. The new state is kt+1 , and the play goes to stage t + 1. Thm: If the set of states is finite, a hidden controlled Markov chain, played with behavioral strategies, has a uniform value and (vn )n uniformly converges. Generalization of a result by Rosenberg Solan Vieille 2002 with K , A and S finite.
20/34
Limit Values in some Markov Decision Processes and Repeated Games
Finite set of states K , initial probability p0 on K , non empty set of actions A, and also a non empty set of signals S. Transition q : K × A → ∆f (S × K ), and reward function g : K × A → [0, 1]. k1 in K is selected according to p0 and is not told to the player. At stage t the player selects an action at ∈ A, and has a (unobserved) payoff g (kt , at ). Then a pair (st , kt+1 ) is selected according to q(kt , at ), and st is told to the player. The new state is kt+1 , and the play goes to stage t + 1. Thm: If the set of states is finite, a hidden controlled Markov chain, played with behavioral strategies, has a uniform value and (vn )n uniformly converges. Generalization of a result by Rosenberg Solan Vieille 2002 with K , A and S finite.
20/34
Limit Values in some Markov Decision Processes and Repeated Games
Finite set of states K , initial probability p0 on K , non empty set of actions A, and also a non empty set of signals S. Transition q : K × A → ∆f (S × K ), and reward function g : K × A → [0, 1]. k1 in K is selected according to p0 and is not told to the player. At stage t the player selects an action at ∈ A, and has a (unobserved) payoff g (kt , at ). Then a pair (st , kt+1 ) is selected according to q(kt , at ), and st is told to the player. The new state is kt+1 , and the play goes to stage t + 1. Thm: If the set of states is finite, a hidden controlled Markov chain, played with behavioral strategies, has a uniform value and (vn )n uniformly converges. Generalization of a result by Rosenberg Solan Vieille 2002 with K , A and S finite.
20/34
Limit Values in some Markov Decision Processes and Repeated Games
Write X = ∆(K ). Assume that the state of some stage has been selected according to p in X and the player plays some action a in A. This defines a probability qˆ(p, a) on the future belief of the player on the state of the next stage. qˆ(p, a) ∈ ∆f (X ). → Auxiliary deterministic Pb Γ(z0 ): new set of states Z = ∆f (X ) × [0, 1], new initial state z0 = (δp0 , 0), new payoff function r (u, y ) = y for all (u, y ) in Z , transition correspondence such that for every z = (u, y ) in Z , F (z) = {(H(u, f ), R(u, f )) , f : X −→ ∆f (A)} , q (p, a)) ∈ ∆f (X ), where H(u, f ) = ∑p∈X u(p) (∑a∈A f (p)(a)ˆ and R(u, f ) = ∑p∈X u(p) ∑k ∈K ,a∈A p k f (p)(a)g (k, a) . Use k.k1 on X . ∆(X ): Borel probabilities over X , with the weak-* topology. Topology metrized by the Wasserstein distance: ∀u ∈ ∆(X ), ∀v ∈ ∆(X ), d (u, v ) = supf ∈E1 |u(f ) − v (f )|.
21/34
Limit Values in some Markov Decision Processes and Repeated Games
Write X = ∆(K ). Assume that the state of some stage has been selected according to p in X and the player plays some action a in A. This defines a probability qˆ(p, a) on the future belief of the player on the state of the next stage. qˆ(p, a) ∈ ∆f (X ). → Auxiliary deterministic Pb Γ(z0 ): new set of states Z = ∆f (X ) × [0, 1], new initial state z0 = (δp0 , 0), new payoff function r (u, y ) = y for all (u, y ) in Z , transition correspondence such that for every z = (u, y ) in Z , F (z) = {(H(u, f ), R(u, f )) , f : X −→ ∆f (A)} , q (p, a)) ∈ ∆f (X ), where H(u, f ) = ∑p∈X u(p) (∑a∈A f (p)(a)ˆ and R(u, f ) = ∑p∈X u(p) ∑k ∈K ,a∈A p k f (p)(a)g (k, a) . Use k.k1 on X . ∆(X ): Borel probabilities over X , with the weak-* topology. Topology metrized by the Wasserstein distance: ∀u ∈ ∆(X ), ∀v ∈ ∆(X ), d (u, v ) = supf ∈E1 |u(f ) − v (f )|.
21/34
Limit Values in some Markov Decision Processes and Repeated Games
Write X = ∆(K ). Assume that the state of some stage has been selected according to p in X and the player plays some action a in A. This defines a probability qˆ(p, a) on the future belief of the player on the state of the next stage. qˆ(p, a) ∈ ∆f (X ). → Auxiliary deterministic Pb Γ(z0 ): new set of states Z = ∆f (X ) × [0, 1], new initial state z0 = (δp0 , 0), new payoff function r (u, y ) = y for all (u, y ) in Z , transition correspondence such that for every z = (u, y ) in Z , F (z) = {(H(u, f ), R(u, f )) , f : X −→ ∆f (A)} , q (p, a)) ∈ ∆f (X ), where H(u, f ) = ∑p∈X u(p) (∑a∈A f (p)(a)ˆ and R(u, f ) = ∑p∈X u(p) ∑k ∈K ,a∈A p k f (p)(a)g (k, a) . Use k.k1 on X . ∆(X ): Borel probabilities over X , with the weak-* topology. Topology metrized by the Wasserstein distance: ∀u ∈ ∆(X ), ∀v ∈ ∆(X ), d (u, v ) = supf ∈E1 |u(f ) − v (f )|.
21/34
Limit Values in some Markov Decision Processes and Repeated Games
∆f (X ) is dense in the compact set ∆(X ) for the weak-* topology, so Z is precompact metric. One can show that: the graph of F is convex (use mixed actions). So the set of plays S(z) is a convex subset of Z ∞ , and we can apply a minmax theorem to obtain wm,n (z) = infθ ∈∆({1,...,n}) v[θ m,n ] (z). So wm,n is 1-Lipschitz as an infimum of 1-Lipschitz mappings. OK by corollaries a).
Open problems: 1) does there exist a uniform value with pure strategies ? (even if K , A, S are finite). 2) What about supσ IEp0 ,σ (lim inf n 1/n ∑nt=1 g (kt , at ))? (even in the perfect observation case, OK for leavable MDPs via Φ(r ) = min{v , v ∈ B, v ≥ r })
22/34
Limit Values in some Markov Decision Processes and Repeated Games
∆f (X ) is dense in the compact set ∆(X ) for the weak-* topology, so Z is precompact metric. One can show that: the graph of F is convex (use mixed actions). So the set of plays S(z) is a convex subset of Z ∞ , and we can apply a minmax theorem to obtain wm,n (z) = infθ ∈∆({1,...,n}) v[θ m,n ] (z). So wm,n is 1-Lipschitz as an infimum of 1-Lipschitz mappings. OK by corollaries a).
Open problems: 1) does there exist a uniform value with pure strategies ? (even if K , A, S are finite). 2) What about supσ IEp0 ,σ (lim inf n 1/n ∑nt=1 g (kt , at ))? (even in the perfect observation case, OK for leavable MDPs via Φ(r ) = min{v , v ∈ B, v ≥ r })
22/34
Limit Values in some Markov Decision Processes and Repeated Games
∆f (X ) is dense in the compact set ∆(X ) for the weak-* topology, so Z is precompact metric. One can show that: the graph of F is convex (use mixed actions). So the set of plays S(z) is a convex subset of Z ∞ , and we can apply a minmax theorem to obtain wm,n (z) = infθ ∈∆({1,...,n}) v[θ m,n ] (z). So wm,n is 1-Lipschitz as an infimum of 1-Lipschitz mappings. OK by corollaries a).
Open problems: 1) does there exist a uniform value with pure strategies ? (even if K , A, S are finite). 2) What about supσ IEp0 ,σ (lim inf n 1/n ∑nt=1 g (kt , at ))? (even in the perfect observation case, OK for leavable MDPs via Φ(r ) = min{v , v ∈ B, v ≥ r })
22/34
Limit Values in some Markov Decision Processes and Repeated Games
∆f (X ) is dense in the compact set ∆(X ) for the weak-* topology, so Z is precompact metric. One can show that: the graph of F is convex (use mixed actions). So the set of plays S(z) is a convex subset of Z ∞ , and we can apply a minmax theorem to obtain wm,n (z) = infθ ∈∆({1,...,n}) v[θ m,n ] (z). So wm,n is 1-Lipschitz as an infimum of 1-Lipschitz mappings. OK by corollaries a).
Open problems: 1) does there exist a uniform value with pure strategies ? (even if K , A, S are finite). 2) What about supσ IEp0 ,σ (lim inf n 1/n ∑nt=1 g (kt , at ))? (even in the perfect observation case, OK for leavable MDPs via Φ(r ) = min{v , v ∈ B, v ≥ r })
22/34
Limit Values in some Markov Decision Processes and Repeated Games
∆f (X ) is dense in the compact set ∆(X ) for the weak-* topology, so Z is precompact metric. One can show that: the graph of F is convex (use mixed actions). So the set of plays S(z) is a convex subset of Z ∞ , and we can apply a minmax theorem to obtain wm,n (z) = infθ ∈∆({1,...,n}) v[θ m,n ] (z). So wm,n is 1-Lipschitz as an infimum of 1-Lipschitz mappings. OK by corollaries a).
Open problems: 1) does there exist a uniform value with pure strategies ? (even if K , A, S are finite). 2) What about supσ IEp0 ,σ (lim inf n 1/n ∑nt=1 g (kt , at ))? (even in the perfect observation case, OK for leavable MDPs via Φ(r ) = min{v , v ∈ B, v ≥ r })
22/34
Limit Values in some Markov Decision Processes and Repeated Games
5. Application to zero-sum repeated games with an informed controller 5.a. Rappel: Finite “static" zero-sum games Given a matrix A = (ai ,j ) in IR I ×J , the value of A is (Von Neumann): max min
max ∑ x(i)y (j)ai ,j . ∑ x(i)y (j)ai ,j = y ∈min ∆(J) x ∈∆(I )
x ∈∆(I ) y ∈∆(J) i ,j
i ,j
Introducing dynamics: → repeated games.
23/34
Limit Values in some Markov Decision Processes and Repeated Games
5.b. General zero-sum repeated games. Γ(π) • Five non empty finite sets: states K , actions I for player 1, and J for player 2, signals C for player 1, and D for player 2. • an initial distribution π ∈ ∆(K × C × D), a payoff function g from K × I × J to [0, 1], and a transition q from K × I × J to ∆(K × C × D). At stage 1: (k1 , c1 , d1 ) is selected according to π, player 1 learns c1 and player 2 learns d1 . Then simultaneously player 1 chooses i1 in I and player 2 chooses j1 in J. The payoff for player 1 is g (k1 , i1 , j1 ). At any stage t ≥ 2: (kt , ct , dt ) is selected according to q(kt −1 , it −1 , jt −1 ), player 1 learns ct and player 2 learns dt . Simultaneously, player 1 chooses it in I and player 2 chooses jt in J. The stage payoff for player 1 is g (kt , it , jt ).
24/34
Limit Values in some Markov Decision Processes and Repeated Games
5.b. General zero-sum repeated games. Γ(π) • Five non empty finite sets: states K , actions I for player 1, and J for player 2, signals C for player 1, and D for player 2. • an initial distribution π ∈ ∆(K × C × D), a payoff function g from K × I × J to [0, 1], and a transition q from K × I × J to ∆(K × C × D). At stage 1: (k1 , c1 , d1 ) is selected according to π, player 1 learns c1 and player 2 learns d1 . Then simultaneously player 1 chooses i1 in I and player 2 chooses j1 in J. The payoff for player 1 is g (k1 , i1 , j1 ). At any stage t ≥ 2: (kt , ct , dt ) is selected according to q(kt −1 , it −1 , jt −1 ), player 1 learns ct and player 2 learns dt . Simultaneously, player 1 chooses it in I and player 2 chooses jt in J. The stage payoff for player 1 is g (kt , it , jt ).
24/34
Limit Values in some Markov Decision Processes and Repeated Games
5.b. General zero-sum repeated games. Γ(π) • Five non empty finite sets: states K , actions I for player 1, and J for player 2, signals C for player 1, and D for player 2. • an initial distribution π ∈ ∆(K × C × D), a payoff function g from K × I × J to [0, 1], and a transition q from K × I × J to ∆(K × C × D). At stage 1: (k1 , c1 , d1 ) is selected according to π, player 1 learns c1 and player 2 learns d1 . Then simultaneously player 1 chooses i1 in I and player 2 chooses j1 in J. The payoff for player 1 is g (k1 , i1 , j1 ). At any stage t ≥ 2: (kt , ct , dt ) is selected according to q(kt −1 , it −1 , jt −1 ), player 1 learns ct and player 2 learns dt . Simultaneously, player 1 chooses it in I and player 2 chooses jt in J. The stage payoff for player 1 is g (kt , it , jt ).
24/34
Limit Values in some Markov Decision Processes and Repeated Games
A pair of behavioral strategies (σ , τ) induces a probability over plays. The n-stage payoff for player 1 is: ! 1 n π γn (σ , τ) = IEIPπ,σ ,τ ∑ g (kt , it , jt ) . n t=1 The n-stage value exists: vn (π) = supσ infτ γnπ (σ , τ) = infτ supσ γnπ (σ , τ). Definition The repeated game Γ(π) has a uniform value if: • (vn (π))n has a limit v (π) as n goes to infinity, • Player 1 can uniformly guarantee this limit: ∀ε > 0, ∃σ , ∃n0 , ∀n ≥ n0 , ∀τ, γnπ (σ , τ) ≥ v (π) − ε, • Player 2 can uniformly guarantee this limit: ∀ε > 0, ∃τ, ∃n0 , ∀n ≥ n0 , ∀σ , γnπ (σ , τ) ≤ v (π) + ε. Conjectures: (vn ) CV ? A player fully informed can guarantee the limit ? 25/34
Limit Values in some Markov Decision Processes and Repeated Games
A pair of behavioral strategies (σ , τ) induces a probability over plays. The n-stage payoff for player 1 is: ! 1 n π γn (σ , τ) = IEIPπ,σ ,τ ∑ g (kt , it , jt ) . n t=1 The n-stage value exists: vn (π) = supσ infτ γnπ (σ , τ) = infτ supσ γnπ (σ , τ). Definition The repeated game Γ(π) has a uniform value if: • (vn (π))n has a limit v (π) as n goes to infinity, • Player 1 can uniformly guarantee this limit: ∀ε > 0, ∃σ , ∃n0 , ∀n ≥ n0 , ∀τ, γnπ (σ , τ) ≥ v (π) − ε, • Player 2 can uniformly guarantee this limit: ∀ε > 0, ∃τ, ∃n0 , ∀n ≥ n0 , ∀σ , γnπ (σ , τ) ≤ v (π) + ε. Conjectures: (vn ) CV ? A player fully informed can guarantee the limit ? 25/34
Limit Values in some Markov Decision Processes and Repeated Games
A pair of behavioral strategies (σ , τ) induces a probability over plays. The n-stage payoff for player 1 is: ! 1 n π γn (σ , τ) = IEIPπ,σ ,τ ∑ g (kt , it , jt ) . n t=1 The n-stage value exists: vn (π) = supσ infτ γnπ (σ , τ) = infτ supσ γnπ (σ , τ). Definition The repeated game Γ(π) has a uniform value if: • (vn (π))n has a limit v (π) as n goes to infinity, • Player 1 can uniformly guarantee this limit: ∀ε > 0, ∃σ , ∃n0 , ∀n ≥ n0 , ∀τ, γnπ (σ , τ) ≥ v (π) − ε, • Player 2 can uniformly guarantee this limit: ∀ε > 0, ∃τ, ∃n0 , ∀n ≥ n0 , ∀σ , γnπ (σ , τ) ≤ v (π) + ε. Conjectures: (vn ) CV ? A player fully informed can guarantee the limit ? 25/34
Limit Values in some Markov Decision Processes and Repeated Games
5.c. The standard case of repeated games with incomplete information (Aumann Maschler 1966-67) A finite family of matrix games (G k )k ∈K in IR I ×J is given, with an initial proba p ∈ ∆(K ). k is chosen according to p and told to player 1 only. Then the players repeat the matrix game G k (actions observed). Examples: 2states K = {a, b}, p =(1/2, 1/2). 0 0 −1 0 a Ex1: G = and G b = . → P1 should fully use 0 −1 0 0 the info. v = 0. 1 0 0 0 a b Ex2: G = and G = . → P1 should not use the 0 0 0 1 info. 1/2 0 1 b 1 a guarantees 1/4. Non revealing game: 2 G + 2 G = 1/2 0 4 0 2 0 4 −2 et G b = . Ex3: G a = 4 0 −2 0 4 2 Playing CR, or NR guarantees 0. Optimal play for P1 here: choose s = Top with proba 3/4 if k = a and with proba 1/4 if k = b, and choose s = Bottom with the remaining weights. Then play s at each stage. Partial revelation of information, guarantees 1 here. 26/34
Limit Values in some Markov Decision Processes and Repeated Games
Define u(q) = Val(∑k q k G k ) for each proba q on K . 1 6
J
J
J u(q)J
J J
1 1 3 0 1 q 4 2 4 1) limn vn (p) exists and can be guaranted by P2. (vn (p) non increasing in n, concave in p). P1 can guarantee u(p). 2) Once P1’s strategy is fixed, the beliefs of P2 follows a martingale: Given σ du J1, and ht ∈ (I × J)t , the belief of P2 after ht is: pt (σ , ht ) = IPp,σ ,τ (k˜ = k|ht ) ∈ ∆(K ). k
3) Splitting Lemma: P1 can generate any martingale if p = ∑s ∈S λs ps , there exists a transition proba x ∈ ∆(S)K such that: ∀s ∈ S, λs = ∑k ∈K p k x k (s) and px (.|s) = ps . 27/34
Limit Values in some Markov Decision Processes and Repeated Games
T T . p1 T 6 T p T . T - . p2 T ∆(K ) p . 3 T T Hence P1 guarantees cavu(p). Prop: a) ∀T ≥ 1,
1 T
T −1
C
∑ IE (kpt+1 − pt k) ≤ √T , with
t=0
C=
∑
q p k (1 − p k ).
k ∈K
˜ b) ∀t, ∀ht ∈ Ht , IE (kpt+1 − pt k|ht ) = IE kσ k (ht ) − σ¯ (ht )k|ht . This implies: C vT (p) ≤ cavu(p) + √ T 28/34
Limit Values in some Markov Decision Processes and Repeated Games
Thm (Aumann Maschler, 1966): cavu(p) = limT vT (p) is the uniform value of the game. (lots of extensions. ) 5.d. Extension: the state follows a Markov chain, only observed by P1. • Example: the value isdifficultto compute. b}, p = (1/2, K = {a, 1/2), α 1−α 1 0 0 0 M= , Ga = and G b = . 1−α α 0 0 0 1 If α = 1, the value is 1/4 (Aumann Maschler). α If α ∈ [1/2, 2/3], the value is 4α− 1 (Hörner et al. 2006, Marino 2005 for α = 2/3). What is the value for α = 0.9 ? • General case: define B0 = limn M nL for some L, and Bl = limn→∞ M nL+l for each l . The process qt (ht ) = pt (ht )B−t is a IPp,σ -martingale. Thm (R, 2006): The uniform value exists and is cavw (pB0 ), where w is the value of the NR game. 29/34
Limit Values in some Markov Decision Processes and Repeated Games
5.e. Application to repeated games with an informed controller. P1 can also influence the state process. Hypothesis HX: Player 1 is informed, in the sense that he can always deduce the state and player 2’s signal from his own signal. HX does not imply that P1 knows the actions played by P2.
Hypothesis HY: Player 1 controls the transition, in the sense that the marginal of the transition q on K × D does not depend on player 2’s action.
30/34
Limit Values in some Markov Decision Processes and Repeated Games
5.e. Application to repeated games with an informed controller. P1 can also influence the state process. Hypothesis HX: Player 1 is informed, in the sense that he can always deduce the state and player 2’s signal from his own signal. HX does not imply that P1 knows the actions played by P2.
Hypothesis HY: Player 1 controls the transition, in the sense that the marginal of the transition q on K × D does not depend on player 2’s action.
30/34
Limit Values in some Markov Decision Processes and Repeated Games
Given m ≥ 0 and n ≥ 1, define the payoffs and auxiliary value functions: ! m+n 1 π γm ∑ g (kt , it , jt ) , ,n (σ , τ) = IEIPπ,σ ,τ n t=m+1 π π vm,n (π) = supσ infτ γm ,n (σ , τ) = infτ supσ γm,n (σ , τ).
Thm (R, 2007): Under HX and HY, the repeated game Γ(π) has a uniform value, which is: v ∗ (π) = infn≥1 supm≥0 vm,n (π) = supm≥0 infn≥1 vm,n (π).
31/34
Limit Values in some Markov Decision Processes and Repeated Games
Given m ≥ 0 and n ≥ 1, define the payoffs and auxiliary value functions: ! m+n 1 π γm ∑ g (kt , it , jt ) , ,n (σ , τ) = IEIPπ,σ ,τ n t=m+1 π π vm,n (π) = supσ infτ γm ,n (σ , τ) = infτ supσ γm,n (σ , τ).
Thm (R, 2007): Under HX and HY, the repeated game Γ(π) has a uniform value, which is: v ∗ (π) = infn≥1 supm≥0 vm,n (π) = supm≥0 infn≥1 vm,n (π).
31/34
Limit Values in some Markov Decision Processes and Repeated Games
Given m ≥ 0 and n ≥ 1, define the payoffs and auxiliary value functions: ! m+n 1 π γm ∑ g (kt , it , jt ) , ,n (σ , τ) = IEIPπ,σ ,τ n t=m+1 π π vm,n (π) = supσ infτ γm ,n (σ , τ) = infτ supσ γm,n (σ , τ).
Thm (R, 2007): Under HX and HY, the repeated game Γ(π) has a uniform value, which is: v ∗ (π) = infn≥1 supm≥0 vm,n (π) = supm≥0 infn≥1 vm,n (π).
31/34
Limit Values in some Markov Decision Processes and Repeated Games
“Natural" deterministic state space here: ∆f (X ). Variant: does the uniform value exist under HB: q(k, i, j) does not depend on j and HA: Player 1 is more informed than player 2 ? The natural state space here should be {(u, v ) ∈ ∆f (X ) × ∆f (X ), u v }, where u v if u is a sweeping of v : for each concave f in C (X ), u(f ) ≤ v (f ).
32/34
Limit Values in some Markov Decision Processes and Repeated Games
5.f. a question: the state follows a Markov chain observed by P1, but payoffs are non-zero sum (with E. Solan, N. Vieille) Example: K = {0, 1, 2}. From x move to x + 1 with proba 3/4 and to x − 1 with proba 1/4. 0 T T T TT 2 1 Question: given an ergodic Markov chain on a finite set K , characterize the set C of copulas µ ∈ ∆(K × K ) such that there exists a process (sn , tn )n with values in K × K satisfying: 1) (sn )n and (tn )n have the same law, the law of the Markov chain, 2) for each n, the law of (sn , tn ) is µ. 3) given s1 ,...,sn , the r.v. tn is independent from (st )t >n (or even (sn , tn )n Markov chain ?)
33/34
Limit Values in some Markov Decision Processes and Repeated Games
5.f. a question: the state follows a Markov chain observed by P1, but payoffs are non-zero sum (with E. Solan, N. Vieille) Example: K = {0, 1, 2}. From x move to x + 1 with proba 3/4 and to x − 1 with proba 1/4. 0 T T T TT 2 1 Question: given an ergodic Markov chain on a finite set K , characterize the set C of copulas µ ∈ ∆(K × K ) such that there exists a process (sn , tn )n with values in K × K satisfying: 1) (sn )n and (tn )n have the same law, the law of the Markov chain, 2) for each n, the law of (sn , tn ) is µ. 3) given s1 ,...,sn , the r.v. tn is independent from (st )t >n (or even (sn , tn )n Markov chain ?)
33/34
Limit Values in some Markov Decision Processes and Repeated Games
Compute the set of µ ∈ ∆(K × K ) such that there exists a process (sn , tn )n with values in K × K satisfying: 1) (sn )n and (tn )n have the same law, the law of the Markov chain, 2) for each n, the law of (sn , tn ) is µ. 3) given s1 ,...,sn , the r.v. tn is independent from (st )t >n (or even (sn , tn )n Markov chain ?) Example: K = {0, 1, 2}. From x move to x + 1 with proba 3/4 and to x − 1 with proba 1/4. 0 T T T TT 2 1 0 0 1/3 0 1/3 0 1/3 0 0 ∈ C , 1/3 0 0 ∈ / C. 0 1/3 0 0 0 1/3 α γ β Bet : C = { β α γ , α + β + γ = 1/3, α ≥ 0, β ≥ 0, γ ≥ 0}. γ β α 34/34
Limit Values in some Markov Decision Processes and Repeated Games References
Aumann, R.J. and M. Maschler (1995): Repeated games with incomplete information. With the collaboration of R. Stearns. Cambridge, MA: MIT Press. A. Araposthathis, V. Borkar, E. Fernández-Gaucherand, M. Ghosh and S. Marcus. Discrete-time controlled Markov Processes with average cost criterion: a survey. SIAM Journal of Control and Optimization, 31, 282–344, 1993. M. Arisawa and P.L. Lions On ergodic stochastic control. Com. in partial differential equations, 23, 2187–2217, 1998. P. Bettiol On ergodic problem for Hamilton-Jacobi-Isaacs equations ESAIM: Cocv, 11, 522–541, 2005. D. Blackwell. Discrete dynamic programming, 34/34
Limit Values in some Markov Decision Processes and Repeated Games References
Annals of Mathematical Statistics, 33, 719–726, 1962. Coulomb, J.M. (2003): Games with a recursive structure. based on a lecture of J-F. Mertens. Chapter 28, Stochastic Games and Applications, A. Neyman and S. Sorin eds, Kluwer Academic Publishers. L. Dubins and L. Savage. How to gamble if you must: inequalities for stochastic porcesses. McGraw-Hill, 1965. 2nd edition 1976 Dover, New York. E.B. Dynkin and A.A. Yushkevich. Controlled Markov Processes, Springer, 1979. O. Hernández-Lerma, et J.B. Lasserre. Long-Run Average-Cost Problems. Discrete-Time Markov Control Processes, Ch. 5, 75–124, 1996. E. Lehrer et D. Monderer. Discounting versus Averaging in Dynamic Programming. 34/34
Limit Values in some Markov Decision Processes and Repeated Games References
Games and Economic Behavior, 6, 97–113, 1994. E. Lehrer et S. Sorin. A uniform Tauberian Theorem in Dynamic Programming. Mathematics of Operations Research, 17, 303–307, 1992. S. Lippman. Criterion Equivalence in Discrete Dynamic Programming. Operations Research, 17, 920–923, 1969. A. Maitra and W. Sudderth. Discrete Gambling and Stochastic Games. Applications of Mathematics, Stochastic Modelling and Applied Probability, Springer, 1996. J.-F. Mertens. Repeated games. Proceedings of the International Congress of Mathematicians, Berkeley 1986, 1528–1577. American Mathematical Society, 1987. J.-F. Mertens et A. Neyman. Stochastic games, 34/34
Limit Values in some Markov Decision Processes and Repeated Games References
International Journal of Game Theory, 1, 39-64, 1981. D. Monderer et S. Sorin. Asymptotic properties in Dynamic Programming. International Journal of Game Theory, 22, 1–11, 1993. M. Quincampoix and J. Renault. On the existence of a limit value in some non expansive optimal control problems. arXiv : 0904.3653, 2009. M. Quincampoix and F. Watbled Averaging methods for discontinuous Mayer’s problem of singularly perturbed control systems. Nonlinar analysis, 54, 819–837, 2003. J. Renault. The value of Markov chain games with lack of information on one side. Mathematics of Operations Research, 3, 490–512, 2006. J. Renault. 34/34
Limit Values in some Markov Decision Processes and Repeated Games References
Uniform value in Dynamic Programming. arXiv : 0803.2758. To appear in JEMS (Journal of the European Mathematical Society) 2010. J. Renault. The value of Repeated Games with an informed controller. arXiv : 0803.3345. D. Rosenberg, E. Solan et N. Vieille. Blackwell Optimality in Markov Decision Processes with Partial Observation. The Annals of Statistics, 30, 1178–1193, 2002. Rosenberg, D., Solan, E. and N. Vieille (2004): Stochastic games with a single controller and incomplete information. SIAM Journal on Control and Optimization, 43, 86-110. Sorin, S. (1984): Big match with lack of information on one side (Part I), International Journal of Game Theory, 13, 201-255. 34/34
Limit Values in some Markov Decision Processes and Repeated Games References
Sorin, S. and S. Zamir (1985): A 2-person game with lack of information on 1 and 1/2 sides. Mathematics of Operations Research, 10, 17-23. S. Sorin. A First Course on Zero-Sum Repeated Games. Mathématiques et Applications, Springer, 2002.
merci pour votre attention !
34/34