Learning to Act Optimally: Reinforcement Learning Brian C. Williams 16.410 December 8 th , 2003 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU 1 TD-Gammon [Tesauro, 1995] Learning Through Reinforcement Learns to play Backgammon States: ? Board configurations (10 20 ) Actions: ? Moves Rewards: ? +100 if win ? - 100 if lose ? 0 for all other states ? Trained by playing 1.5 million games against self. ? Currently, roughly equal to best human player. 2 3 Markov Decision Processes and Reinforcement Learning ? ? Learning policies through reinforcement ? ? ? 4 Reinforcement Learning Problem ? ? ? Learn action policy S: S o A ? r 0 + J r 1 + J 2 r 2 . . . from any start state. ? 1 Note: ? ? ? Agent s 0 r 0 a 0 s 1 a 1 r 1 s 2 a 2 r 2 s 3 State Reward Action Goal: Learn to choose actions r 0 + J r 1 + J 2 r 2 . . . Motivation Q values Q learning ? Appendix: Nondeterministic MDPs Given: Repeatedly… Executed action Observed state Observed reward Maximizes life reward Discount: 0 Unsupervised learning Delayed reward Model not known Environment that maximize life reward Summary J 5 How About Learning Value Functions? V S , denoted V   V   S*(s) = argmax a >r(s,a + JV (G(s, a)] Problem: ? ? G: S x A o S ? r: S x A o? ?  . 6 Eliminating the Model with Q Functions S*(s) = argmax a >r(s,a + JV (G(s, a)] Key idea: ? V that encapsulates G and r: Q(s,a = r(s,a + JV (G(s, a)) ? Q, it can choose an optimal action without knowing G or r. S*(s) = argmax a Q(s,a  1. Have agent learn value function 2. Agent selects optimal action by one step lookahead on Works well if agent knows the environment model. With no model, agent can’t choose action from V Define function like Then, if agent learns 7 How Do We Learn Q? Q(s t ,a t = r(s t ,a t + JV (G(s t , a t )) ? ? V*(s) = max a’ Q(s,a’) ? Q(s t ,a t = r(s t ,a t + Jmax a’ Q(s’,a’) 8 Example – Q Learning Update Q(s 1 ,a right ) m r(s 1 ,a right ) + Jmax a’ Q(s 2 ,a’) m 0 + m 90 90 R 100 81 72 63 R 100 81 63 Note: if rewards are non-negative: ll s, a, n, Q n (s, a) dQ n+1 (s, a) ll s, a, n, 0 dQ n (s, a) dQ(s, a) J Need to eliminate V* In update rule. Note Q and V* are closely related: Substituting Q for V*: ? For a ? For a 0.9 max {63, 81, 100} = 0.9 9 Q-Learning Iterations ? Initially all values in Q table are zero; J Q(s, a) mr+ Jmax a’ Q(s’,a’) G 10 1010 s1 s2 s3 s4s5s6 10 Crib Sheet: Q-Learning for Deterministic Worlds Let Q denote the current approximation to Q. Initially: s, a initialize table entry Q(s, a) m0 ? s Do forever: ? a and execute it ? r ? s’ ? Q (s, a) as follows: Q(s, a) mr+ Jmax a’ Q(s’,a’) ? s ms’ Starts at top left corner – move clockwise around perimeter; = 0.8 ? For each Observe current state Select an action Receive immediate reward Observe the new state Update the table entry for Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) 0 0 0 r+ Jmax a’ {Q(s5,loop)}= 10 + 0.8 x 0 = 10 0 0 r+ Jmax a’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x max{10,0) = 8 10 0 r+ Jmax a’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8) = 6.4 8 10 11 Discussion ? intermediate Q values? ? ? ? ? 12 Markov Decision Processes and Reinforcement Learning ? ? Learning policies through reinforcement ? ? ? How should the learning agent use the Exploration vs Exploitation Scaling up in the size of the state space Function approximators (neural net instead of table) Reuse, use of macros Motivation Q values Q learning ? Appendix: Nondeterministic MDPs Summary 13 Nondeterministic Example S1 Unemployed D R S2 Industry D S3 Grad School D R S4 Academia D R R 0.1 0.9 1.0 0.9 0.1 1.0 0.9 0.1 1.0 0.90.1 1.0 14 ? V, Q by taking expected values V S (s t ) = E[r t + J r t+1 + J 2 r t+2 . . .] V S (s t ) = E[ |J i r t+I ] Q(s t ,a t = E[r(s t ,a t + JV (G(s t , a t ))] R – Research D – Development NonDeterministic Case We redefine 15 Nondeterministic Case ? Q n (s, a) m (1- D n ) Q n-1 (s,a) + D n [r+ J max a’ Q n-1 (s’,a’)] where D n = 1/(1+visits n G(s, a). Can still prove convergence of Q [Watkins and Dayan, 92] 16 Ongoing Research ? ? ? ? Learn and use G  S x A o S ? ? Alter training rule to Handling case where state is only partially observable Design optimal exploration strategies Extend to continuous action, state Relationship to dynamic programming Multiple learners – Multi-agent reinforcement learning (s,a)) and s’ =