Planning to Maximize Reward: Markov Decision Processes Brian C. Williams 16.410 December 8 th , 2003 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU 2 Reading and Assignments ? ? ? This Lecture based on development in: “Machine Learning” by Tom Mitchell Chapter 13: Reinforcement Learning Markov Decision Processes Read AIMA Chapters 17 sections 1 – 3. 1 3 How Might a Mouse Search a Maze for Cheese? ? ? ? ? ? Cheese 4 Ideas in this lecture ? rather than goal states. ? rather than a plan for a single starting situation. ? greatest lifetime reward achievable at every state. ? State Space Search? As a Constraint Satisfaction Problem? Goal-directed Planning? As a Rule or Production System? What is missing? Objective is to accumulate rewards, Task is to generate policies for how to act in all situations, Policies fall out of value functions, which describe the Value functions are iteratively approximated. 5 MDP Examples: TD-Gammon [Tesauro, 1995] Learning Through Reinforcement States: ? 20 ) Actions: ? Rewards: ? ? ? ? ? Currently, roughly equal to best human player. 6 Computing a Solution from a Continuous Model Learns to play Backgammon Board configurations (10 Moves +100 if win - 100 if lose 0 for all other states Trained by playing 1.5 million games against self. MDP Examples: Aerial Robotics [Feron et al.] Courtesy of Eric Feron. Courtesy of Eric Feron. Courtesy of Eric Feron. 7 Markov Decision Processes ? ? ? ? lici ? 8 MDP Problem Agent s 0 r 0 a 0 s 1 a 1 r 1 s 2 a 2 r 2 s 3 State Reward Action Given an environment model as a MDP create a lifetime reward Motivation What are Markov Decision Processes (MDPs)? Models Lifetime Reward ?Po es Computing Policies From a Model ?Summary Environment policy for acting that maximizes 9 MDP Problem: Model Agent s 0 r 0 a 0 s 1 a 1 r 1 s 2 a 2 r 2 s 3 State Reward Action Given an environment model as a MDP create a lifetime reward 10 Markov Decision Processes (MDPs) Model: ? Fi S ? Fi i A ? (Probabilistic) state transitions, δ(s,a) ? and action, R(s,a) Process: G 10 1010 transitions is 0. s 0 r 0 a 0 s 1 a 1 r 1 s 2 a 2 r 2 s 3 Example: s1 a1 ? t in S ? t in A ? t ? t+1 Environment policy for acting that maximizes nite set of states, nite set of act ons, Reward for each state ? Legal transitions shown ? Reward on unlabeled Observe state s Choose action a Receive immediate reward r State changes to s 11 MDP Environment Assumptions ? Next state and reward is a function only of the current state and action: ? s t+1 = δ(s t , a t ) ? r t = r(s t , a t ) ? nullδ and r may be nondeterministic and unknown 12 MDP Nondeterministic Example S1 Unemployed D R S2 Industry D S3 Grad School D R S4 Academia D R R 0.1 0.9 1.0 0.9 0.1 1.0 0.9 0.1 1.0 0.90.1 1.0 Today we only consider Markov Assumption: Uncertain and Unknown Environment: R – Research D – Development the deterministic case 13 MDP Problem: Model Agent s 0 r 0 a 0 s 1 a 1 r 1 s 2 a 2 r 2 s 3 State Reward Action Given an environment model as a MDP create a lifetime reward 14 MDP Problem: Lifetime Reward Agent s 0 r 0 a 0 s 1 a 1 r 1 s 2 a 2 r 2 s 3 State Reward Action Given an environment model as a MDP create a lifetime reward Environment policy for acting that maximizes Environment policy for acting that maximizes 15 Lifetime Reward ? ? ? $100K + $100K + $100K = $300K ? ? ? ? ? (a bird in hand …) ? γ $100K + γ $100K + γ 2 $100K. . . ? 16 MDP Problem: Agent s 0 r 0 a 0 s 1 a 1 r 1 s 2 a 2 r 2 s 3 State Reward Action Given an environment model as a MDP create a lifetime reward V = r 0 + γ r 1 + γ 2 r 2 . . . Finite horizon: Rewards accumulate for a fixed period. Infinite horizon: Assume reward accumulates for ever $100K + $100K + . . . = infinity Discounting: Future rewards not worth as much Introduce discount factor converges Will make the math work Environment policy for acting that maximizes 17 MDP Problem: Policy Agent s 0 r 0 a 0 s 1 a 1 r 1 s 2 a 2 r 2 s 3 State Reward Action Given an environment model as a MDP create a lifetime reward V = r 0 + γ r 1 + γ 2 r 2 . . . 18 Assume deterministic world Policy π : S ?A ? G 10 1010 π G 10 1010 π ? Optimal policy π ? : S ?A ? reward. Environment policy for acting that maximizes Selects an action for each state. Selects action for each state that maximizes lifetime 19 ? ? G 10 1010 G 10 1010 G 10 1010 20 Markov Decision Processes ? ? ? ? ? ? ? There are many policies, not all are necessarily optimal. There may be several optimal policies. Motivation Markov Decision Processes Computing Policies From a Model Value Functions Mapping Value Functions to Policies Computing Value Functions through Value Iteration An Alternative: Policy Iteration (appendix) ?Summary 21 Value Function V π for a Given Policy π ? V π (s t ) is the accumulated lifetime reward resulting from starting in state s t and repeatedly executing policy π: V π (s t ) = r t + γ r t+1 + γ 2 r t+2 . . . V π (s t ) = ∑ i γ i r t+I where r t , r t+1 , r t+2 . . . are generated by following π starting at s t . 10 10 0 Assume γ = .9 99 G 10 1010 π V π 22 An Optimal Policy π* Given Value Function V* 1. Examine possible actions a i in state s. 2. Select action a i with greatest lifetime reward. G 10 1010 π 10 10 0 99 Lifetime reward Q(s, a i ) is: ? r(s,a) … ? V( δ(s, a) ) … ? γ. π*(s) = argmax a [r(s,a) + γV ? ( δ(s, a) )] Requires: ? ? ? δ : S x A → S ? r : S x A →? 10 Idea: Given state s 10 the immediate reward for taking action plus life time reward starting in target state discounted by Value function Environment model. 23 Example: Mapping Value Function to Policy ? V: π(s) = argmax a [r(s,a) + γV(δ(s, a)] 90 81 100 90 0 100 G 100 100 Model + V: γ 24 Example: Mapping Value Function to Policy ? V: π(s) = argmax a [r(s,a) + γV(δ(s, a)] 90 81 100 90 0 100 G 100 100 Model + V: G π: a b ? select a γ Agent selects optimal action from = 0.9 Agent selects optimal action from ? a: 0 + 0.9 x 100 = 90 ? b: 0 + 0.9 x 81 = 72.9 = 0.9 25 Example: Mapping Value Function to Policy ? V: π(s) = argmax a [r(s,a) + γV(δ(s, a)] G 90 81 100 90 0 100 G 100 100 Model + V: π: a b ? select a γ 26 Example: Mapping Value Function to Policy ? V: π(s) = argmax a [r(s,a) + γV(δ(s, a)] G 90 81 100 90 0 100 G 100 100 Model + V: π: a b ? select ? γ c Agent selects optimal action from ? a: 100 + 0.9 x 0 = 100 ? b: 0 + 0.9 x 90 = 81 = 0.9 Agent selects optimal action from ?a: ? ?b: ? ?c: ? = 0.9 Markov Decision Processes ? Motivation ? Markov Decision Processes ? Computing Policies From a Model ? Value Functions ? Mapping Value Functions to Policies ? Computing Value Functions through Value Iteration ? An Alternative: Policy Iteration ?Summary 27 Value Function V ? for an optimal policy π ? Example B R B R S R A A A A S R B B B A ? Optimal value function for a one step horizon: V* 1 (s) = max a [r(s,a)] i i V* 1 (S A ) S A R A A S A Max B R B . . . S B S B 28 V* 1 (S B ) Value Function V ? for an optimal policy π ? Example B R B R S R A A A A S R B B B A ? Optimal value function for a one step horizon: V* 1 (s) = max a [r(s,a)] i i ? Optimal value function for a two step horizon: V* 2 (s) = max a [r(s,a) + γ V 1 ? (δ (s, a ))] i i i Instance of the Dynamic Programming Principle: ? Reuse shared sub-results ? Exponential saving V* 2 (S A ) A R A + γ V* 1 (S A ) S A A S A S A B Max B R B . . . . . . S B S B S B 29 V* 2 (S B ) + γ V* 1 (S B ) Value Function V ? for an optimal policy π ? Example B R B R S R A A A A S R B B B A ? Optimal value function for a one step horizon: V* 1 (s) = max a [r(s,a)] i i ? Optimal value function for a two step horizon: V* 2 (s) = max a [r(s,a) + γ V 1 ? (δ (s, a ))] i i i ? Optimal value function for an n step horizon: V* n (s) = max a [r(s,a i ) + γ V n-1 ? (δ (s, a))] i i 30 31 Value Function V ? for an optimal policy π ? ? V* 1 (s) = max a i [r(s,a i )] ? V* 2 (s) = max a i [r(s,a i ) + γ V 1 ? (δ (s, a i ))] ? V* n (s) = max a i [r(s,a i ) + γ V n-1 ? (δ (s, a i ))] ? Optimal value function for an infinite horizon: S A S B B A R A R B A R A B R B V*(s a i [r(s,a i ) + γ V ? (δ (s, a i ))] Example 32 Dynamic Programming. Algorithm: ? V* t+1 (s) ← max a [r(s,a) + γ V ? t (δ (s, a))] ? |V* t+1 (s) - V ? t (s) | < ε ? V ? : π *(s) = argmax a [ r(s,a) + γ V ? (δ (s, a)] Optimal value function for a one step horizon: Optimal value function for a two step horizon: Optimal value function for an n step horizon: ) = max Solving MDPs by Value Iteration Insight: Can calculate optimal values iteratively using Iteratively calculate value using Bellman’s Equation: Terminate when values are “close enough” Agent selects optimal action by one step lookahead on 33 Convergence of Value Iteration ? |V t+1 (s) - V t (s) | < ε Then: Max |V t+1 (s) - V ? (s) | < 2εγ/(1 - γ) ? ? infinitely often, but asynchronously and in any order. 34 Example of Value Iteration V* t+1 (s) ← max a [r(s,a) + γV ? t (δ(s, a))] 0 0 0 0 0 0 G 100 100 G 100 100 V ? t V ? t+1 0 γ a b ? Max = 0 If terminate when values are “close enough” sin S Converges in polynomial time. Convergence guaranteed even if updates are performed = 0.9 ? a: 0 + 0.9 x 0 = 0 ? b: 0 + 0.9 x 0 = 0 35 Example of Value Iteration V* t+1 (s) ← max a [r(s,a) + γV ? t (δ(s, a))] 0 0 0 0 0 0 G 100 100 G 100 100 V ? t V ? t+1 0 γ 100a b 0 + 0.9 x 0 = 0 ? Max = 100 c 36 Example of Value Iteration V* t+1 (s) ← max a [r(s,a) + γV ? t (δ(s, a))] 0 0 0 0 0 0 G 100 100 G 100 100 V ? t V ? t+1 0 γ 100 0 a ? Max = 0 = 0.9 ? a: 100 + 0.9 x 0 = 100 ? b: 0 + 0.9 x 0 = 0 ? c: = 0.9 ? a: 0 + 0.9 x 0 = 0 37 Example of Value Iteration V* t+1 (s) ← max a [r(s,a) + γV ? t (δ(s, a))] 0 0 0 0 0 0 G 100 100 G 100 100 V ? t V ? t+1 0 γ 100 0 0 38 Example of Value Iteration V* t+1 (s) ← max a [r(s,a) + γV ? t (δ(s, a))] 0 0 0 0 0 0 G 100 100 G 100 100 V ? t V ? t+1 0 γ 100 0 0 0 = 0.9 = 0.9 39 Example of Value Iteration V* t+1 (s) ← max a [r(s,a) + γV ? t (δ(s, a))] 0 0 0 0 0 0 G 100 100 G 100 100 V ? t V ? t+1 0 γ 100 0 0 0 100 40 Example of Value Iteration V* t+1 (s) ← max a [r(s,a) + γV ? t (δ(s, a))] 0 0 100 0 0 100 G 100 100 G 100 100 V ? t V ? t+1 90 100 0 0 100 γ = 0.9 90 = 0.9 41 Example of Value Iteration V* t+1 (s) ← max a [r(s,a) + γV ? t (δ(s, a))] 90 0 100 90 0 100 G 100 100 G 100 100 V ? t V ? t+1 90 100 0 81 90 100 γ 42 Example of Value Iteration V* t+1 (s) ← max a [r(s,a) + γV ? t (δ(s, a))] 90 81 100 90 0 100 G 100 100 G 100 100 V ? t V ? t+1 90 100 0 81 90 100 γ = 0.9 = 0.9 43 Markov Decision Processes ? ? ? ? ? ? 44 Policy Iteration Idea: Iteratively improve the policy 1. Policy Evaluation: Given a policy π i calculate V i = V πi , the utility of each state if π i were to be executed. 2. Policy Improvement: Calculate a new maximum expected utility policy π i+1 using one-step look ahead based on V i . ? π i improves at every step, converging if π i = π i+1 . ? i is simpler than for Value iteration (no max): V* t+1 (s) ← r(s, π i (s)) + γV ? t (δ(s, π i (s)))] ? 3 ) ? Motivation Markov Decision Processes Computing policies from a modelValue Functions Mapping Value Functions to Policies Computing Value Functions through Value Iteration An Alternative: Policy Iteration (appendix) ?Summary Computing V Solve linear equations in O(N Solve iteratively, similar to value iteration. 45 Markov Decision Processes ? ? ? ? ? 46 Markov Decision Processes (MDPs) Model: ? Fi S ? Fi i A ? transitions, δ(s,a) ? and action, R(s,a) Deterministic Example: G 10 1010 Process: ? t in S ? t in A ? t ? t+1 s 0 r 0 a 0 s 1 a 1 r 1 s 2 a 2 r 2 s 3 s1 a1 Motivation Markov Decision Processes Computing policies from a model Value Iteration Policy Iteration ?Summary nite set of states, nite set of act ons, Probabilistic state Reward for each state Observe state s Choose action a Receive immediate reward r State changes to s Crib Sheet: MDPs by Value Iteration Insight: Can calculate optimal values iteratively using Dynamic Programming. Algorithm: ? Iteratively calculate value using Bellman’s Equation: V* t+1 (s) ← max a [r(s,a) + γV ? (δ(s, a))] t ? Terminate when values are “close enough” |V* t+1 (s) - V ? (s) | < ε t ? Agent selects optimal action by one step lookahead on V ? : π*(s) = argmax a [r(s,a) + γV ? (δ(s, a)] 47 Ideas in this lecture ? Objective is to accumulate rewards, rather than goal states. ? Objectives are achieved along the way, rather than at the end. ? Task is to generate policies for how to act in all situations, rather than a plan for a single starting situation. ? Policies fall out of value functions, which describe the greatest lifetime reward achievable at every state. ? Value functions are iteratively approximated. 48 49 How Might a Mouse Search a Maze for Cheese? ? ? Cheese By Value Iteration? What is missing?