Learning to Act Optimally:
Reinforcement Learning
Brian C. Williams
16.410
December 8
th
, 2003
Slides adapted from:
Manuela Veloso,
Reid Simmons, &
Tom Mitchell, CMU
1
TD-Gammon [Tesauro, 1995]
Learning Through Reinforcement
Learns to play Backgammon
States:
? Board configurations (10
20
)
Actions:
? Moves
Rewards:
? +100 if win
? - 100 if lose
? 0 for all other states
? Trained by playing 1.5 million games against self.
? Currently, roughly equal to best human player.
2
3
Markov Decision Processes and
Reinforcement Learning
?
? Learning policies through reinforcement
?
?
?
4
Reinforcement Learning Problem
?
?
?
Learn action policy S: S o A
?
r
0
+ J r
1
+ J
2
r
2
. . .
from any start state.
? 1
Note:
?
?
?
Agent
s
0
r
0
a
0
s
1
a
1
r
1
s
2
a
2
r
2
s
3
State
Reward
Action
Goal: Learn to choose actions
r
0
+ J r
1
+ J
2
r
2
. . .
Motivation
Q values
Q learning
?
Appendix: Nondeterministic MDPs
Given: Repeatedly…
Executed action
Observed state
Observed reward
Maximizes life reward
Discount: 0
Unsupervised learning
Delayed reward
Model not known
Environment
that maximize life reward
Summary
J
5
How About Learning Value Functions?
V
S
, denoted V
V
S*(s) = argmax
a
>r(s,a + JV
(G(s, a)]
Problem:
?
? G: S x A o S
? r: S x A o?
?
.
6
Eliminating the Model with Q Functions
S*(s) = argmax
a
>r(s,a + JV
(G(s, a)]
Key idea:
? V
that encapsulates G and r:
Q(s,a = r(s,a + JV
(G(s, a))
? Q, it can choose an optimal action
without knowing G or r.
S*(s) = argmax
a
Q(s,a
1. Have agent learn value function
2. Agent selects optimal action by one step lookahead on
Works well if agent knows the environment model.
With no model, agent can’t choose action from V
Define function like
Then, if agent learns
7
How Do We Learn Q?
Q(s
t
,a
t
= r(s
t
,a
t
+ JV
(G(s
t
, a
t
))
?
?
V*(s) = max
a’
Q(s,a’)
?
Q(s
t
,a
t
= r(s
t
,a
t
+ Jmax
a’
Q(s’,a’)
8
Example – Q Learning Update
Q(s
1
,a
right
) m r(s
1
,a
right
) + Jmax
a’
Q(s
2
,a’)
m 0 +
m 90
90
R
100
81
72
63
R
100
81
63
Note: if rewards are non-negative:
ll s, a, n, Q
n
(s, a) dQ
n+1
(s, a)
ll s, a, n, 0 dQ
n
(s, a) dQ(s, a)
J
Need to eliminate V* In update rule.
Note Q and V* are closely related:
Substituting Q for V*:
? For a
? For a
0.9 max {63, 81, 100}
= 0.9
9
Q-Learning Iterations
?
Initially all values in Q table are zero; J
Q(s, a) mr+ Jmax
a’
Q(s’,a’)
G
10
1010
s1 s2 s3
s4s5s6
10
Crib Sheet: Q-Learning for Deterministic Worlds
Let Q denote the current approximation to Q.
Initially:
s, a initialize table entry Q(s, a) m0
? s
Do forever:
? a and execute it
? r
? s’
? Q (s, a) as follows:
Q(s, a) mr+ Jmax
a’
Q(s’,a’)
? s ms’
Starts at top left corner – move clockwise around perimeter;
= 0.8
? For each
Observe current state
Select an action
Receive immediate reward
Observe the new state
Update the table entry for
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)
0 0 0
r+ Jmax
a’
{Q(s5,loop)}=
10 + 0.8 x 0 = 10
0 0
r+ Jmax
a’
{Q(s4,W), Q(s4,N)}
= 0 + 0.8 x max{10,0) = 8
10
0
r+ Jmax
a’
{Q(s3,W), Q(s3,S)}
= 0 + 0.8 x max{0,8) = 6.4
8 10
11
Discussion
?
intermediate Q values?
?
?
?
?
12
Markov Decision Processes and
Reinforcement Learning
?
? Learning policies through reinforcement
?
?
?
How should the learning agent use the
Exploration vs Exploitation
Scaling up in the size of the state space
Function approximators (neural net instead of table)
Reuse, use of macros
Motivation
Q values
Q learning
?
Appendix: Nondeterministic MDPs
Summary
13
Nondeterministic Example
S1
Unemployed
D
R
S2
Industry
D
S3
Grad School
D
R
S4
Academia
D
R
R
0.1
0.9
1.0
0.9
0.1
1.0
0.9
0.1
1.0
0.90.1
1.0
14
? V, Q by taking expected values
V
S
(s
t
) = E[r
t
+ J r
t+1
+ J
2
r
t+2
. . .]
V
S
(s
t
) = E[ |J
i
r
t+I
]
Q(s
t
,a
t
= E[r(s
t
,a
t
+ JV
(G(s
t
, a
t
))]
R – Research
D – Development
NonDeterministic Case
We redefine
15
Nondeterministic Case
?
Q
n
(s, a) m (1- D
n
) Q
n-1
(s,a) + D
n
[r+ J max
a’
Q
n-1
(s’,a’)]
where D
n
= 1/(1+visits
n
G(s, a).
Can still prove convergence of Q [Watkins and Dayan, 92]
16
Ongoing Research
?
?
?
? Learn and use G S x A o S
?
?
Alter training rule to
Handling case where state is only partially observable
Design optimal exploration strategies
Extend to continuous action, state
Relationship to dynamic programming
Multiple learners – Multi-agent reinforcement learning
(s,a)) and s’ =