Planning to Maximize Reward:
Markov Decision Processes
Brian C. Williams
16.410
December 8
th
, 2003
Slides adapted from:
Manuela Veloso,
Reid Simmons, &
Tom Mitchell, CMU
2
Reading and Assignments
?
?
? This Lecture based on development in:
“Machine Learning” by Tom Mitchell
Chapter 13: Reinforcement Learning
Markov Decision Processes
Read AIMA Chapters 17 sections 1 – 3.
1
3
How Might a Mouse Search a Maze for
Cheese?
?
?
?
?
?
Cheese
4
Ideas in this lecture
?
rather than goal states.
?
rather than a plan for a single starting situation.
?
greatest lifetime reward achievable at every state.
?
State Space Search?
As a Constraint Satisfaction Problem?
Goal-directed Planning?
As a Rule or Production System?
What is missing?
Objective is to accumulate rewards,
Task is to generate policies for how to act in all situations,
Policies fall out of value functions, which describe the
Value functions are iteratively approximated.
5
MDP Examples: TD-Gammon [Tesauro, 1995]
Learning Through Reinforcement
States:
?
20
)
Actions:
?
Rewards:
?
?
?
?
? Currently, roughly equal to best human player.
6
Computing a Solution from a Continuous Model
Learns to play Backgammon
Board configurations (10
Moves
+100 if win
- 100 if lose
0 for all other states
Trained by playing 1.5 million games against self.
MDP Examples: Aerial Robotics [Feron et al.]
Courtesy of Eric Feron.
Courtesy of Eric Feron. Courtesy of Eric Feron.
7
Markov Decision Processes
?
?
?
?
lici
?
8
MDP Problem
Agent
s
0
r
0
a
0
s
1
a
1
r
1
s
2
a
2
r
2
s
3
State
Reward
Action
Given an environment model as a MDP create a
lifetime reward
Motivation
What are Markov Decision Processes (MDPs)?
Models
Lifetime Reward
?Po es
Computing Policies From a Model
?Summary
Environment
policy for acting
that maximizes
9
MDP Problem: Model
Agent
s
0
r
0
a
0
s
1
a
1
r
1
s
2
a
2
r
2
s
3
State
Reward
Action
Given an environment model as a MDP create a
lifetime reward
10
Markov Decision Processes (MDPs)
Model:
? Fi S
? Fi i A
? (Probabilistic) state
transitions, δ(s,a)
?
and action, R(s,a)
Process:
G
10
1010
transitions is 0.
s
0
r
0
a
0
s
1
a
1
r
1
s
2
a
2
r
2
s
3
Example:
s1
a1
?
t
in S
?
t
in A
?
t
?
t+1
Environment
policy for acting
that maximizes
nite set of states,
nite set of act ons,
Reward for each state
? Legal transitions shown
? Reward on unlabeled
Observe state s
Choose action a
Receive immediate reward r
State changes to s
11
MDP Environment Assumptions
?
Next state and reward is a function only of the current state
and action:
? s
t+1
= δ(s
t
, a
t
)
? r
t
= r(s
t
, a
t
)
?
nullδ and r may be nondeterministic and unknown
12
MDP Nondeterministic Example
S1
Unemployed
D
R
S2
Industry
D
S3
Grad School
D
R
S4
Academia
D
R
R
0.1
0.9
1.0
0.9
0.1
1.0
0.9
0.1
1.0
0.90.1
1.0
Today we only consider
Markov Assumption:
Uncertain and Unknown Environment:
R – Research
D – Development
the deterministic case
13
MDP Problem: Model
Agent
s
0
r
0
a
0
s
1
a
1
r
1
s
2
a
2
r
2
s
3
State
Reward
Action
Given an environment model as a MDP create a
lifetime reward
14
MDP Problem: Lifetime Reward
Agent
s
0
r
0
a
0
s
1
a
1
r
1
s
2
a
2
r
2
s
3
State
Reward
Action
Given an environment model as a MDP create a
lifetime reward
Environment
policy for acting
that maximizes
Environment
policy for acting
that maximizes
15
Lifetime Reward
?
?
? $100K + $100K + $100K = $300K
?
?
?
?
?
(a bird in hand …)
? γ
$100K + γ $100K + γ
2
$100K. . .
?
16
MDP Problem:
Agent
s
0
r
0
a
0
s
1
a
1
r
1
s
2
a
2
r
2
s
3
State
Reward
Action
Given an environment model as a MDP create a
lifetime reward
V = r
0
+ γ r
1
+ γ
2
r
2
. . .
Finite horizon:
Rewards accumulate for a fixed period.
Infinite horizon:
Assume reward accumulates for ever
$100K + $100K + . . . = infinity
Discounting:
Future rewards not worth as much
Introduce discount factor
converges
Will make the math work
Environment
policy for acting
that maximizes
17
MDP Problem: Policy
Agent
s
0
r
0
a
0
s
1
a
1
r
1
s
2
a
2
r
2
s
3
State
Reward
Action
Given an environment model as a MDP create a
lifetime reward
V = r
0
+ γ r
1
+ γ
2
r
2
. . .
18
Assume deterministic world
Policy π : S ?A
?
G
10
1010
π
G
10
1010
π
?
Optimal policy π
?
: S ?A
?
reward.
Environment
policy for acting
that maximizes
Selects an action for each state.
Selects action for each state that maximizes lifetime
19
?
?
G
10
1010
G
10
1010
G
10
1010
20
Markov Decision Processes
?
?
?
?
?
?
?
There are many policies, not all are necessarily optimal.
There may be several optimal policies.
Motivation
Markov Decision Processes
Computing Policies From a Model
Value Functions
Mapping Value Functions to Policies
Computing Value Functions through Value Iteration
An Alternative: Policy Iteration (appendix)
?Summary
21
Value Function V
π
for a Given Policy π
? V
π
(s
t
) is the accumulated lifetime reward resulting from
starting in state s
t
and repeatedly executing policy π:
V
π
(s
t
) = r
t
+ γ r
t+1
+ γ
2
r
t+2
. . .
V
π
(s
t
) = ∑
i
γ
i
r
t+I
where r
t
, r
t+1
, r
t+2
. . . are generated by following π starting
at s
t
.
10 10
0
Assume γ = .9 99
G
10
1010
π
V
π
22
An Optimal Policy π* Given Value Function V*
1. Examine possible actions a
i
in state s.
2. Select action a
i
with greatest lifetime reward.
G
10
1010
π
10
10
0
99
Lifetime reward Q(s, a
i
) is:
? r(s,a) …
? V( δ(s, a) ) …
? γ.
π*(s) = argmax
a
[r(s,a) + γV
?
( δ(s, a) )]
Requires:
?
?
? δ : S x A → S
? r : S x A →?
10
Idea: Given state s
10
the immediate reward for taking action
plus life time reward starting in target state
discounted by
Value function
Environment model.
23
Example: Mapping Value Function to Policy
? V:
π(s) = argmax
a
[r(s,a) + γV(δ(s, a)]
90
81
100
90
0
100
G
100
100
Model + V:
γ
24
Example: Mapping Value Function to Policy
? V:
π(s) = argmax
a
[r(s,a) + γV(δ(s, a)]
90
81
100
90
0
100
G
100
100
Model + V:
G
π:
a
b
? select a
γ
Agent selects optimal action from
= 0.9
Agent selects optimal action from
? a: 0 + 0.9 x 100 = 90
? b: 0 + 0.9 x 81 = 72.9
= 0.9
25
Example: Mapping Value Function to Policy
? V:
π(s) = argmax
a
[r(s,a) + γV(δ(s, a)]
G
90
81
100
90
0
100
G
100
100
Model + V:
π:
a
b
? select a
γ
26
Example: Mapping Value Function to Policy
? V:
π(s) = argmax
a
[r(s,a) + γV(δ(s, a)]
G
90
81
100
90
0
100
G
100
100
Model + V:
π:
a
b
? select ?
γ
c
Agent selects optimal action from
? a: 100 + 0.9 x 0 = 100
? b: 0 + 0.9 x 90 = 81
= 0.9
Agent selects optimal action from
?a: ?
?b: ?
?c: ?
= 0.9
Markov Decision Processes
? Motivation
? Markov Decision Processes
? Computing Policies From a Model
? Value Functions
? Mapping Value Functions to Policies
? Computing Value Functions through Value Iteration
? An Alternative: Policy Iteration
?Summary
27
Value Function V
?
for an optimal policy π
?
Example
B
R
B
R
S
R
A
A
A
A
S
R
B
B
B
A
? Optimal value function for a one step horizon:
V*
1
(s) = max
a
[r(s,a)]
i
i
V*
1
(S
A
)
S
A
R
A
A
S
A
Max
B
R
B
. . .
S
B
S
B
28
V*
1
(S
B
)
Value Function V
?
for an optimal policy π
?
Example
B
R
B
R
S
R
A
A
A
A
S
R
B
B
B
A
? Optimal value function for a one step horizon:
V*
1
(s) = max
a
[r(s,a)]
i
i
? Optimal value function for a two step horizon:
V*
2
(s) = max
a
[r(s,a) + γ V
1
?
(δ (s, a ))]
i
i i
Instance of the Dynamic
Programming Principle:
? Reuse shared sub-results
? Exponential saving
V*
2
(S
A
)
A
R
A
+
γ V*
1
(S
A
)
S
A
A
S
A
S
A
B
Max
B
R
B
. . . . . .
S
B
S
B
S
B
29
V*
2
(S
B
) + γ V*
1
(S
B
)
Value Function V
?
for an optimal policy π
?
Example
B
R
B
R
S
R
A
A
A
A
S
R
B
B
B
A
? Optimal value function for a one step horizon:
V*
1
(s) = max
a
[r(s,a)]
i
i
? Optimal value function for a two step horizon:
V*
2
(s) = max
a
[r(s,a) + γ V
1
?
(δ (s, a ))]
i
i i
? Optimal value function for an n step horizon:
V*
n
(s) = max
a
[r(s,a
i
) + γ V
n-1
?
(δ (s, a))]
i
i
30
31
Value Function V
?
for an optimal policy π
?
?
V*
1
(s) = max
a
i
[r(s,a
i
)]
?
V*
2
(s) = max
a
i
[r(s,a
i
) + γ V
1
?
(δ (s, a
i
))]
?
V*
n
(s) = max
a
i
[r(s,a
i
) + γ V
n-1
?
(δ (s, a
i
))]
? Optimal value function for an infinite horizon:
S
A
S
B
B
A
R
A
R
B
A
R
A
B
R
B
V*(s
a
i
[r(s,a
i
) + γ V
?
(δ (s, a
i
))]
Example
32
Dynamic Programming.
Algorithm:
?
V*
t+1
(s) ← max
a
[r(s,a) + γ V
?
t
(δ (s, a))]
?
|V*
t+1
(s) - V
?
t
(s) | < ε
? V
?
:
π *(s) = argmax
a
[ r(s,a) + γ V
?
(δ (s, a)]
Optimal value function for a one step horizon:
Optimal value function for a two step horizon:
Optimal value function for an n step horizon:
) = max
Solving MDPs by Value Iteration
Insight: Can calculate optimal values iteratively using
Iteratively calculate value using Bellman’s Equation:
Terminate when values are “close enough”
Agent selects optimal action by one step lookahead on
33
Convergence of Value Iteration
?
|V
t+1
(s) - V
t
(s) | < ε
Then:
Max |V
t+1
(s) - V
?
(s) | < 2εγ/(1 - γ)
?
?
infinitely often, but asynchronously and in any order.
34
Example of Value Iteration
V*
t+1
(s) ← max
a
[r(s,a) + γV
?
t
(δ(s, a))]
0
0
0
0
0
0
G
100
100
G
100
100
V
?
t
V
?
t+1
0
γ
a
b
? Max = 0
If terminate when values are “close enough”
sin S
Converges in polynomial time.
Convergence guaranteed even if updates are performed
= 0.9
? a: 0 + 0.9 x 0 = 0
? b: 0 + 0.9 x 0 = 0
35
Example of Value Iteration
V*
t+1
(s) ← max
a
[r(s,a) + γV
?
t
(δ(s, a))]
0
0
0
0
0
0
G
100
100
G
100
100
V
?
t
V
?
t+1
0
γ
100a
b
0 + 0.9 x 0 = 0
? Max = 100
c
36
Example of Value Iteration
V*
t+1
(s) ← max
a
[r(s,a) + γV
?
t
(δ(s, a))]
0
0
0
0
0
0
G
100
100
G
100
100
V
?
t
V
?
t+1
0
γ
100 0
a
? Max = 0
= 0.9
? a: 100 + 0.9 x 0 = 100
? b: 0 + 0.9 x 0 = 0
? c:
= 0.9
? a: 0 + 0.9 x 0 = 0
37
Example of Value Iteration
V*
t+1
(s) ← max
a
[r(s,a) + γV
?
t
(δ(s, a))]
0
0
0
0
0
0
G
100
100
G
100
100
V
?
t
V
?
t+1
0
γ
100 0
0
38
Example of Value Iteration
V*
t+1
(s) ← max
a
[r(s,a) + γV
?
t
(δ(s, a))]
0
0
0
0
0
0
G
100
100
G
100
100
V
?
t
V
?
t+1
0
γ
100 0
0 0
= 0.9
= 0.9
39
Example of Value Iteration
V*
t+1
(s) ← max
a
[r(s,a) + γV
?
t
(δ(s, a))]
0
0
0
0
0
0
G
100
100
G
100
100
V
?
t
V
?
t+1
0
γ
100 0
0 0 100
40
Example of Value Iteration
V*
t+1
(s) ← max
a
[r(s,a) + γV
?
t
(δ(s, a))]
0
0
100
0
0
100
G
100
100
G
100
100
V
?
t
V
?
t+1
90 100 0
0 100
γ
= 0.9
90
= 0.9
41
Example of Value Iteration
V*
t+1
(s) ← max
a
[r(s,a) + γV
?
t
(δ(s, a))]
90
0
100
90
0
100
G
100
100
G
100
100
V
?
t
V
?
t+1
90 100 0
81 90 100
γ
42
Example of Value Iteration
V*
t+1
(s) ← max
a
[r(s,a) + γV
?
t
(δ(s, a))]
90
81
100
90
0
100
G
100
100
G
100
100
V
?
t
V
?
t+1
90 100 0
81 90 100
γ
= 0.9
= 0.9
43
Markov Decision Processes
?
?
?
?
?
?
44
Policy Iteration
Idea: Iteratively improve the policy
1. Policy Evaluation: Given a policy π
i
calculate V
i
= V
πi
,
the utility of each state if π
i
were to be executed.
2. Policy Improvement: Calculate a new maximum expected
utility policy π
i+1
using one-step look ahead based on V
i
.
? π
i
improves at every step, converging if π
i
= π
i+1
.
?
i
is simpler than for Value iteration (no max):
V*
t+1
(s) ← r(s, π
i
(s)) + γV
?
t
(δ(s, π
i
(s)))]
?
3
)
?
Motivation
Markov Decision Processes
Computing policies from a modelValue Functions
Mapping Value Functions to Policies
Computing Value Functions through Value Iteration
An Alternative: Policy Iteration (appendix)
?Summary
Computing V
Solve linear equations in O(N
Solve iteratively, similar to value iteration.
45
Markov Decision Processes
?
?
?
?
?
46
Markov Decision Processes (MDPs)
Model:
? Fi S
? Fi i A
?
transitions, δ(s,a)
?
and action, R(s,a)
Deterministic Example:
G
10
1010
Process:
?
t
in S
?
t
in A
?
t
?
t+1
s
0
r
0
a
0
s
1
a
1
r
1
s
2
a
2
r
2
s
3
s1
a1
Motivation
Markov Decision Processes
Computing policies from a model
Value Iteration
Policy Iteration
?Summary
nite set of states,
nite set of act ons,
Probabilistic state
Reward for each state
Observe state s
Choose action a
Receive immediate reward r
State changes to s
Crib Sheet: MDPs by Value Iteration
Insight: Can calculate optimal values iteratively using
Dynamic Programming.
Algorithm:
? Iteratively calculate value using Bellman’s Equation:
V*
t+1
(s) ← max
a
[r(s,a) + γV
?
(δ(s, a))]
t
? Terminate when values are “close enough”
|V*
t+1
(s) - V
?
(s) | < ε
t
? Agent selects optimal action by one step lookahead on V
?
:
π*(s) = argmax
a
[r(s,a) + γV
?
(δ(s, a)]
47
Ideas in this lecture
? Objective is to accumulate rewards,
rather than goal states.
? Objectives are achieved along the way,
rather than at the end.
? Task is to generate policies for how to act in all situations,
rather than a plan for a single starting situation.
? Policies fall out of value functions, which describe the
greatest lifetime reward achievable at every state.
? Value functions are iteratively approximated.
48
49
How Might a Mouse Search a Maze for
Cheese?
?
?
Cheese
By Value Iteration?
What is missing?