Partially Observable Markov
Decision Processes
? Additional reading:
?
Acting in Partially Observable Stochastic Domains,'' Artificial Intelligence, Vol. 101,
1998.
?
Issues
? Problem statement:
–
? Inputs:
–
? Outputs from different algorithms:
– Complete/partial, exact/approximate value function
– Complete/partial, exact/approximate finite state machine
? Choices:
– Policy vs. plan
– Exact vs. approximate
– Representation
– Discrete vs. continuous states, actions, observations
Leslie Pack Kaelbling, Michael L. Littman and Anthony R. Cassandra, ``Planning and
J. Pineau, G. Gordon & S. Thrun "Point-based value iteration: An anytime algorithm
for POMDPs''. International Joint Conference on Artificial Intelligence (IJCAI).
Acapulco, Mexico. Aug. 2003.
If we do not know the current state of the world, what should we do
to act optimally?
Model of states, actions, observations, transition and emission functions,
reward function, initial distribution and discount factor
Graphical Models
Sondik, 1971
States
S
1
Rewards
R
1
S
2
T(s
j
|a
i
, s
i
)
Z
2
b
1
Beliefs
Z
1
Observations
a
1
Actions
O(z
j
|s
i
)
b
2
Hidden
Observable
Reliable Navigation
? Conventional
trajectories may
not be robust to
localization error
Estimated robot position
True robot position
Goal position
? Markov Decision Processes
Control Models
Probabilistic
Perception
Model
P(x) Control
World state
World state
Probabilistic
Perception
Model
P(x) Control
Partially Observable Markov Decision Processes
Navigation as a POMDP
State Space
State is hidden from the controller
Action, observation
Controller chooses actions based
on probability distributions
argmax P(x)
? Stationary policy: Best action is fixed
? Non-stationary
? States can be , continuous, or hybrid
Tradeoffs
? MDP
+
+
– Assumes perfect knowledge of state
? POMDP
+ in a uniform
framework
+
–
– Hugely intractable to solve optimally
POMDPHMMHidden State
MDPMarkov Model
Fully
Observable
ControlledPassive
POMDP Advantages
? Models information gathering
? Computes trade-off between:
? Getting reward
? Being uncertain
Am I here, or over there?
policy: Best action depends on time
discrete
Tractable to solve
Relatively easy to specify
Treats all sources of uncertainty (action, sensing, environment)
Allows for taking actions that gain information
Difficult to specify all the conditional probabilities
A Simple POMDP
p(s)
s
1
s
2
s
3
State is hidden
POMDP Policies
POMDP Policies
Coastal Navigation
Example POMDP
s
2
s
1
Two states: s
1
, s
2
Two actions: a
1
, a
2
Two observations: z
1
, z
2
Assume =.9
Reward: r(a
1
, s
1
) = 2
r(a
2
, s
1
) = 2
r(a
1
, s
2
) = 2
r(a
2
, s
2
) = 2
Emission p(z
1
|s
1
) = .9
2
|s
1
) = .1
p(z
1
|s
2
) = .5
p(z
2
|s
2
) = .5
Transition p(s
1
|s
1
, a
1
) = .3 p(s
1
|s
2
, a
1
) = .6
2
|s
1
, a
1
) = .7 p(s
1
|s
2
, a
1
) = .4
p(s
1
|s
1
, a
2
) = .1 p(s
1
|s
2
, a
2
) = .8
p(s
2
|s
1
, a
2
) = .9 p(s
1
|s
2
, a
2
) = .2
a
2
a
1
Exact Value Iteration
? Define “policy tree”
? For a horizon of length
t, there are
possible policy trees
? Idea:
Compute value of tree
of length t recursively
from trees of length t-1
a
1
z
2
z
1
a
2
z
2
z
1
a
1
z
2
z
1
t
Z
AA
||
||||
D
D
? b
t
max)(
Example courtesy of Simmons and Veloso, 2002
probabilities: p(z
probabilities: p(s
* ?
b V
Exact Value Iteration
? Compute value of each policy tree over space of
beliefs b by taking expectation
? Given tree p:
? Expectation is linear: value function for policy tree
is therefore linear, expressed as an “. -vector”
? Compute V
p
(s) from Bayes’ filter:
|
||
)()())(()(
S
ppbp
ssVEbV
| |
?
||' ||
1
)'()'|(),|'(),(max)(
Ss Z
t
p
a
t
p
ssV
z
J
a
1
z
2
z
1
a
2
z
2
z
1
a
1
z
2
z
1
1
1
t
p
z
V
1
1
t
p
z
V
Exact Value Iteration
? Set V
t=0
=0
? V
t=1
: Two policy trees, p
1
, p
2
a
1
p
1
:
p
2
: a
2
)],(),,([
...]),(...,),([
)](),([
1211
1211
211
11
a
a
sVsV
pp
D
)],(),,([
...]),(...,),([
)](),([
2221
2221
212
22
a
a
sVsV
pp
D
p(s
2
)1
p(s
2
)1
p(s
2
)1
1
1
t
p
V
1
2
t
p
V
1 t
V
D
D
? b
t
max)(
V s p
V s z p s a s p a s R
s R a s R
s R a s R
s R a s R
s R a s R
* ?
b V
Exact Value Iteration
? V
t=2
: 8 possible policy trees
p
1
a
1
z
2
z
1
p
1
p
1
a
1
z
2
z
1
p
2
p
2
a
1
z
2
z
1
p
1
p
2
a
1
z
2
z
1
p
2
p
1
a
2
z
2
z
1
p
1
p
1
a
2
z
2
z
1
p
2
p
2
a
2
z
2
z
1
p
1
p
2
a
2
z
2
z
1
p
2
p
111
p
112
p
121
p
122
p
211
p
212
p
221
p
222
....
])'()'|(),|'(),(
,)'()'|(),|'(),(
112
'||
12112
'||
11111
111
?
?
?
?
o
?
?
?
?
a
| |
| |
?
?
D
D J
D J
D
Ss Z
Ss Z
sa
s
Exact Value Iteration
? .
111
= [3.17, 2.44]
? .
112
= [3.773, 2.746]
? .
121
= [3.557, 2.314]
? .
122
= [4.16, 2.62]
? .
211
= [1.99, 4.62]
? .
212
= [2.791, 4.728]
? .
221
= [2.719, 4.152]
? .
222
= [3.52, 4.26]
0
1
2
3
4
5
01
p(s
2
)
.
222
.
212
.
122
s z p s a s p s R
s z p s a s p a s R
0.5
1.5
2.5
3.5
4.5
Pruning Alpha Vectors
? Test all . -vectors for dominance
? V
t=2
= {[2.791, 4.728], [3.52, 4.26], [4.16, 2.62]}
Ss
pVpdsb(s)V
d
S
p
S
p
? t
? t
|
|
maximize
0)(
1)(
~
)(
||
~
||
p(s)
s
1
s
2
2
2.5
3
3.5
4
4.5
5
01
Execution
Current
belief
Optimal
Action
Belief Space
.
222
.
212
.
122
a
1
s b
s b
(s) b(s)V
all for : and
: and
all for : that such } {
Witness Algorithm (Littman et al., 1998)
? A Witness is a Counter-Example
– Idea: Find places where the value function is suboptimal
– Operates action-by-action and observation-by-
observation to build up value vectors
? Algorithm
– Start with value vectors for known (“corner”) states
– Define a linear program (based on Bellman’s equation)
that finds a point in the belief space where the value of
the function is incorrect
– Add a new vector (a linear combination of the old value
function)
– Iterate
? Set V
t=0
=0
? V
t=1
: Two . -vectors, .
1
, .
2
? V
t=2
:
– Start with some belief state, and compute optimal policy
tree p to get a single . -vector
– Construct new policy tree p
new
, and test for witness
–
– Keep tree if witness exists
Witness Algorithm
p(s
2
)
1
1 t
V
bssb
b
z Ss
b
t
t
? ?
1
·
¨
?
§
?
| |
?
'
)()'|(),|'(max),(
][
1
D D
D
Ss
pdsb(s)V(s)b(s)V
d
S
p
S
p
new
? t
t
|
|
maximize
0)(
1)(
)(
||
||
z p s a s p a s R
* ?
0,1
s b
s b
all for : and
: and
all for : that such
? Fix a set of beliefs
? Maintain a single
. -vector per belief
? Update each
. -vector according
2
3
4
5
0 1
b
1
b
2
b
3
)(
)(
)'|(),|'(),()(
,
,,,
'
1,,
,
,,
i
t
i
Zz
iii
Ss
t
ii
b
b
s
i
?
?
|
|
?
?
D D
D D
D J D
D
D
Point-based Value Iteration (Pineau et al., 2003)
to Bayes’ filter
2.5
3.5
4.5
max arg
max arg
a t
z a t a t
z a t
s z p s a s p a s R
a t
z a t