Partially Observable Markov Decision Processes ? Additional reading: ? Acting in Partially Observable Stochastic Domains,'' Artificial Intelligence, Vol. 101, 1998. ? Issues ? Problem statement: – ? Inputs: – ? Outputs from different algorithms: – Complete/partial, exact/approximate value function – Complete/partial, exact/approximate finite state machine ? Choices: – Policy vs. plan – Exact vs. approximate – Representation – Discrete vs. continuous states, actions, observations Leslie Pack Kaelbling, Michael L. Littman and Anthony R. Cassandra, ``Planning and J. Pineau, G. Gordon & S. Thrun "Point-based value iteration: An anytime algorithm for POMDPs''. International Joint Conference on Artificial Intelligence (IJCAI). Acapulco, Mexico. Aug. 2003. If we do not know the current state of the world, what should we do to act optimally? Model of states, actions, observations, transition and emission functions, reward function, initial distribution and discount factor Graphical Models Sondik, 1971 States S 1 Rewards R 1 S 2 T(s j |a i , s i ) Z 2 b 1 Beliefs Z 1 Observations a 1 Actions O(z j |s i ) b 2 Hidden Observable Reliable Navigation ? Conventional trajectories may not be robust to localization error Estimated robot position True robot position Goal position ? Markov Decision Processes Control Models Probabilistic Perception Model P(x) Control World state World state Probabilistic Perception Model P(x) Control Partially Observable Markov Decision Processes Navigation as a POMDP State Space State is hidden from the controller Action, observation Controller chooses actions based on probability distributions argmax P(x) ? Stationary policy: Best action is fixed ? Non-stationary ? States can be , continuous, or hybrid Tradeoffs ? MDP + + – Assumes perfect knowledge of state ? POMDP + in a uniform framework + – – Hugely intractable to solve optimally POMDPHMMHidden State MDPMarkov Model Fully Observable ControlledPassive POMDP Advantages ? Models information gathering ? Computes trade-off between: ? Getting reward ? Being uncertain Am I here, or over there? policy: Best action depends on time discrete Tractable to solve Relatively easy to specify Treats all sources of uncertainty (action, sensing, environment) Allows for taking actions that gain information Difficult to specify all the conditional probabilities A Simple POMDP p(s) s 1 s 2 s 3 State is hidden POMDP Policies POMDP Policies Coastal Navigation Example POMDP s 2 s 1 Two states: s 1 , s 2 Two actions: a 1 , a 2 Two observations: z 1 , z 2 Assume  =.9 Reward: r(a 1 , s 1 ) = 2 r(a 2 , s 1 ) = 2 r(a 1 , s 2 ) = 2 r(a 2 , s 2 ) = 2 Emission p(z 1 |s 1 ) = .9 2 |s 1 ) = .1 p(z 1 |s 2 ) = .5 p(z 2 |s 2 ) = .5 Transition p(s 1 |s 1 , a 1 ) = .3 p(s 1 |s 2 , a 1 ) = .6 2 |s 1 , a 1 ) = .7 p(s 1 |s 2 , a 1 ) = .4 p(s 1 |s 1 , a 2 ) = .1 p(s 1 |s 2 , a 2 ) = .8 p(s 2 |s 1 , a 2 ) = .9 p(s 1 |s 2 , a 2 ) = .2 a 2 a 1 Exact Value Iteration ? Define “policy tree” ? For a horizon of length t, there are possible policy trees ? Idea: Compute value of tree of length t recursively from trees of length t-1 a 1 z 2 z 1 a 2 z 2 z 1 a 1 z 2 z 1 t Z AA || |||| D D ? b t max)( Example courtesy of Simmons and Veloso, 2002 probabilities: p(z probabilities: p(s *? b V Exact Value Iteration ? Compute value of each policy tree over space of beliefs b by taking expectation ? Given tree p: ? Expectation is linear: value function for policy tree is therefore linear, expressed as an “. -vector” ? Compute V p (s) from Bayes’ filter: | || )()())(()( S ppbp ssVEbV || ?   ||' || 1 )'()'|(),|'(),(max)( Ss Z t p a t p ssV z J a 1 z 2 z 1 a 2 z 2 z 1 a 1 z 2 z 1 1 1 t p z V 1 1 t p z V Exact Value Iteration ? Set V t=0 =0 ? V t=1 : Two policy trees, p 1 , p 2 a 1 p 1 : p 2 : a 2 )],(),,([ ...]),(...,),([ )](),([ 1211 1211 211 11 a a sVsV pp  D )],(),,([ ...]),(...,),([ )](),([ 2221 2221 212 22 a a sVsV pp  D p(s 2 )1 p(s 2 )1 p(s 2 )1 1 1 t p V 1 2 t p V 1 t V D D ? b t max)( V s p V s z p s a s p a s R s R a s R s R a s R s R a s R s R a s R *? b V Exact Value Iteration ? V t=2 : 8 possible policy trees p 1 a 1 z 2 z 1 p 1 p 1 a 1 z 2 z 1 p 2 p 2 a 1 z 2 z 1 p 1 p 2 a 1 z 2 z 1 p 2 p 1 a 2 z 2 z 1 p 1 p 1 a 2 z 2 z 1 p 2 p 2 a 2 z 2 z 1 p 1 p 2 a 2 z 2 z 1 p 2 p 111 p 112 p 121 p 122 p 211 p 212 p 221 p 222 .... ])'()'|(),|'(),( ,)'()'|(),|'(),( 112 '|| 12112 '|| 11111 111 ? ? ? ? o ? ? ? ? a   || || ? ? D DJ DJ D Ss Z Ss Z sa s Exact Value Iteration ? . 111 = [3.17, 2.44] ? . 112 = [3.773, 2.746] ? . 121 = [3.557, 2.314] ? . 122 = [4.16, 2.62] ? . 211 = [1.99, 4.62] ? . 212 = [2.791, 4.728] ? . 221 = [2.719, 4.152] ? . 222 = [3.52, 4.26] 0 1 2 3 4 5 01 p(s 2 ) . 222 . 212 . 122 s z p s a s p s R s z p s a s p a s R 0.5 1.5 2.5 3.5 4.5 Pruning Alpha Vectors ? Test all . -vectors for dominance ? V t=2 = {[2.791, 4.728], [3.52, 4.26], [4.16, 2.62]} Ss pVpdsb(s)V d S p S p ?t ?t | | maximize 0)( 1)( ~ )( || ~ || p(s) s 1 s 2 2 2.5 3 3.5 4 4.5 5 01 Execution Current belief Optimal Action Belief Space . 222 . 212 . 122 a 1 s b s b (s) b(s)V all for : and : and all for : that such } { Witness Algorithm (Littman et al., 1998) ? A Witness is a Counter-Example – Idea: Find places where the value function is suboptimal – Operates action-by-action and observation-by- observation to build up value vectors ? Algorithm – Start with value vectors for known (“corner”) states – Define a linear program (based on Bellman’s equation) that finds a point in the belief space where the value of the function is incorrect – Add a new vector (a linear combination of the old value function) – Iterate ? Set V t=0 =0 ? V t=1 : Two . -vectors, . 1 , . 2 ? V t=2 : – Start with some belief state, and compute optimal policy tree p to get a single . -vector – Construct new policy tree p new , and test for witness – – Keep tree if witness exists Witness Algorithm p(s 2 ) 1 1 t V bssb b z Ss b t t ?? 1 · ¨ ? § ? || ?  ' )()'|(),|'(max),( ][ 1 DD D Ss pdsb(s)V(s)b(s)V d S p S p new ?t t | | maximize 0)( 1)( )( || || z p s a s p a s R *? 0,1 s b s b all for : and : and all for : that such ? Fix a set of beliefs ? Maintain a single . -vector per belief ? Update each . -vector according 2 3 4 5 0 1 b 1 b 2 b 3 )( )( )'|(),|'(),()( , ,,, ' 1,, , ,, i t i Zz iii Ss t ii b b s i ? ?  | | ? ?  DD DD DJD D D Point-based Value Iteration (Pineau et al., 2003) to Bayes’ filter 2.5 3.5 4.5 max arg max arg a t z a t a t z a t s z p s a s p a s R a t z a t