Partially Observable Markov
Decision Processes Part II
● Additional reading:
● Anthony R. Cassandra. Exact and Approximate Algorithms for Partially Observable Markov Decision
Processes. Ph.D. Thesis. Brown University, Department of Computer Science, Providence, RI, 1998
(This lecture most focuses on chapter 6).
Issues
● Problem statement:
– If we do not know the current state of the world, what should we do
to act optimally?
● Inputs:
– Model of states, actions, observations, transition and emission functions,
reward function, initial distribution and discount factor
● Outputs from approximate algorithms:
– approximate value function
● Outputs from heuristic algorithms:
– policies that capture the kind of behaviour we want
Point-based Value Iteration (Pineau et al., 2003)
● Fix a set of beliefs
● Maintain a single
α-vector per belief
● Update each
α-vector according
to Bayes’ filter
2
2.5
3
3.5
4
4.5
5
0 1
b
1
b
2
b
3
)(maxarg
)(maxarg
)'|(),|'(),()(
,
,,,
'
1,,
,
,,
i
at
t
i
Zz
i
zat
i
at
i
Ss
t
i
zat
i
b
b
szpsaspasRs
at
zat
i
?=
?=
+=
∑
∑
∈
∈
?
αα
αα
αγα
α
α
Maximum-Likelihood State Heuristic
● Assume the MDP policy
● Given belief b:
AS
MDP
→:π
()
?
?
?
?
?
?
=
∈
)(maxarg)( sbb
Ss
MDPMLS
ππ
Voting Heuristic
● Assume the MDP policy
● Given belief b:
Nourbakhsh, Powers and Birchfield, 1995
Simmons & Koenig, 1996
()
∑
∈
∈
=
?
?
?
≠
=
=
Ss
MDP
Aa
AV
ji
ji
ji
asIsbb
aa
aa
aaI
),()(maxarg)(
:0
:1
),(
ππ
Q-MDP
● Assume the optimal MDP Q-function:
● Given belief b:
● Given belief b:
Cassandra, Kaelbling, and Littman, 1994
?→× ASQ
MDP
:
∑
∈
∈
?
=
Ss
Aa
MDPQ
asQsbb ),()(maxarg)(π
Entropy Heuristic
Cassandra, Kaelbling & Kurien, 1996
()
[ ]
?
?
?
?
?
?
?
?
?
?
?
>
=
∈
otherwise
if
))((maxarg
)()),|((minarg
sp
bHzaspHE
b
s
MDP
z
Aa
π
κ
π
Alternate Dual-Mode Heuristic
● Given belief b
()
()
?
?
?
?
?
?
?
+=
∑
∑
∈
∈
∈
Ss
MDP
Ss
CU
Aa
asQspbH
asQspbHb
),()()(1
),()()(maxarg
κ
κ
π
Discretization
● Discretize belief space
● Convert to belief space MDP
● Solve using conventional DP
● Subject to “Curse of Dimensionality”
Coastal Navigation
● Represent beliefs using
1. Discretize state-entropy space
2. Compute reward function and transition
function
3. Solve belief state MDP
)();(maxarg
~
bHsbb
s
=
Model Parameters
? Reward function
R(b)
s
1
s
2
s
3
p(s)
Back-project to high
dimensional belief
∑
==
S
b
sRspsREbR )()())(()(
Compute expected reward from belief:
~
~
Model Parameters
? Use forward model
b
i
b
j
b
k
a
z
1
p
i
(s) p
j
(s)
b
i
b
j
a, z
1
Low dimension
Full dimension
Deterministic process
Stochastic process
~
~
~
~
~
Model Parameters
? Use forward model
b
i
b
q
b
k
a
z
2
b
j z
1
T(b
i
, a, b
j
) ∝ p(z|s)b
i
(s|a)
if b
j
(s) = b
i
(s|a,z)
= 0
otherwise
~ ~
~
~
~
~
What You Should Know
● Advantages and disadvantages to using
POMDPs
● What an optimal policy looks like
● How to compute a policy
● Different heuristics and approximations:
how they work and where they may fail