Notation for Stochastic Dynamic Programming (Markov Decision Processes, Approximate Dynamic Programming, Reinforcement Learning)
Stages First Stage Final Stage State Space State Action Space Action Policy Transitions Cost Terminal Cost Discount Q-Value (Policy) Q-Value (Optimal) Value (Policy) Value (Optimal) Bellman Operator
Bertsekas k N 0
Sutton and Barto t 1 T
i, ik U (i)
s a π(s, a), π a Pss 0 a Rss0 rT γ Qπ (s, a)
µk (i), π pij (µk (i)) g(i, u, j) G(iN ) α Jkπ (i) Jkπ (i) Jk∗ (i)
π
V (s) V ∗ (s)
Puterman t 1 N S s A = ∪s∈S As a D π, dM (s) t pt (· | s, a) rt (s, a) rN (s) λ
Powell t 1 T S s, St A a π P(s0 | St , at ) Ct (St , at ) VT (ST ) γ
uπt u∗t
Q(S n , a) Vtπ (St ) Vt (St ) M
L, L
T
Optimal Value Function • Bertsekas [2007] Jk∗ = min
n X
u∈U (i)
∗ pij (u) g(i, u, j) + αJk−1 (j)
j=1
• Sutton and Barto [1998] a a ∗ 0 V ∗ (s) = max Pss 0 [Rss0 + γV (s )] a
• Puterman [1994] u∗t (st ) = max
a∈Ast
rt (st , a) +
X
pt (j | st , a)u∗t+1 (j)
j∈S
• Powell [2011] ( Vt (St ) = max Ct (St , at ) + γ at
) X
0
0
P(s | St , at )Vt+1 (s )
s0 ∈S
References D.P. Bertsekas. Dynamic Programming and Optimal Control. Number v. 2 in Athena Scientific Optimization and Computation Series. Athena Scientific, 2007. ISBN 9781886529304. URL http://books.google.com/books?id=eL01YAAACAAJ. W.B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley Series in Probability and Statistics. John Wiley & Sons, 2011. ISBN 9781118029152. URL http://books.google.com/books?id=VBuZhne7pmwC. M.L. Puterman. Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and statistics. Wiley-Interscience, 1994. ISBN 9780471727828. URL http://books.google.com/books?id=Y-gmAQAAIAAJ. R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning. Mit Press, 1998. ISBN 9780262193986. URL http://books.google.com/books?id=CAFR6IBF4xYC.
Tim Hopper –
[email protected] – StiglerDiet.com