Partially Observable Markov Decision Processes Part II ● Additional reading: ● Anthony R. Cassandra. Exact and Approximate Algorithms for Partially Observable Markov Decision Processes. Ph.D. Thesis. Brown University, Department of Computer Science, Providence, RI, 1998 (This lecture most focuses on chapter 6). Issues ● Problem statement: – If we do not know the current state of the world, what should we do to act optimally? ● Inputs: – Model of states, actions, observations, transition and emission functions, reward function, initial distribution and discount factor ● Outputs from approximate algorithms: – approximate value function ● Outputs from heuristic algorithms: – policies that capture the kind of behaviour we want Point-based Value Iteration (Pineau et al., 2003) ● Fix a set of beliefs ● Maintain a single α-vector per belief ● Update each α-vector according to Bayes’ filter 2 2.5 3 3.5 4 4.5 5 0 1 b 1 b 2 b 3 )(maxarg )(maxarg )'|(),|'(),()( , ,,, ' 1,, , ,, i at t i Zz i zat i at i Ss t i zat i b b szpsaspasRs at zat i ?= ?= += ∑ ∑ ∈ ∈ ? αα αα αγα α α Maximum-Likelihood State Heuristic ● Assume the MDP policy ● Given belief b: AS MDP →:π () ? ? ? ? ? ? = ∈ )(maxarg)( sbb Ss MDPMLS ππ Voting Heuristic ● Assume the MDP policy ● Given belief b: Nourbakhsh, Powers and Birchfield, 1995 Simmons & Koenig, 1996 () ∑ ∈ ∈ = ? ? ? ≠ = = Ss MDP Aa AV ji ji ji asIsbb aa aa aaI ),()(maxarg)( :0 :1 ),( ππ Q-MDP ● Assume the optimal MDP Q-function: ● Given belief b: ● Given belief b: Cassandra, Kaelbling, and Littman, 1994 ?→× ASQ MDP : ∑ ∈ ∈ ? = Ss Aa MDPQ asQsbb ),()(maxarg)(π Entropy Heuristic Cassandra, Kaelbling & Kurien, 1996 () [ ] ? ? ? ? ? ? ? ? ? ? ? > = ∈ otherwise if ))((maxarg )()),|((minarg sp bHzaspHE b s MDP z Aa π κ π Alternate Dual-Mode Heuristic ● Given belief b () () ? ? ? ? ? ? ? += ∑ ∑ ∈ ∈ ∈ Ss MDP Ss CU Aa asQspbH asQspbHb ),()()(1 ),()()(maxarg κ κ π Discretization ● Discretize belief space ● Convert to belief space MDP ● Solve using conventional DP ● Subject to “Curse of Dimensionality” Coastal Navigation ● Represent beliefs using 1. Discretize state-entropy space 2. Compute reward function and transition function 3. Solve belief state MDP )();(maxarg ~ bHsbb s = Model Parameters ? Reward function R(b) s 1 s 2 s 3 p(s) Back-project to high dimensional belief ∑ == S b sRspsREbR )()())(()( Compute expected reward from belief: ~ ~ Model Parameters ? Use forward model b i b j b k a z 1 p i (s) p j (s) b i b j a, z 1 Low dimension Full dimension Deterministic process Stochastic process ~ ~ ~ ~ ~ Model Parameters ? Use forward model b i b q b k a z 2 b j z 1 T(b i , a, b j ) ∝ p(z|s)b i (s|a) if b j (s) = b i (s|a,z) = 0 otherwise ~ ~ ~ ~ ~ ~ What You Should Know ● Advantages and disadvantages to using POMDPs ● What an optimal policy looks like ● How to compute a policy ● Different heuristics and approximations: how they work and where they may fail