Download A Short Trip to Reinforcement Learning

Reinforcement Learning Presented by: Bibhas Chakraborty and Lacey Gunter 5/22/2017 1 What is Machine Learning?  A method to learn about some phenomenon from data, when there is little scientific theory (e.g., physical or biological laws) relative to the size of the feature space.  The goal is to make an “intelligent” machine, so that it can make decisions (or, predictions) in an unknown situation.  The science of learning plays a key role in areas like statistics, data mining and artificial intelligence. It also arises in engineering, medicine, psychology and finance. 5/22/2017 2 Types of Learning  Supervised Learning - Training data: (X,Y). (features, label) - Predict Y, minimizing some loss. - Regression, Classification.  Unsupervised Learning - Training data: X. (features only) - Find “similar” points in high-dim X-space. - Clustering. 5/22/2017 3 Types of Learning (Cont’d)  Reinforcement Learning - Training data: (S, A, R). (State-Action-Reward) - Develop an optimal policy (sequence of decision rules) for the learner so as to maximize its long-term reward. - Robotics, Board game playing programs. 5/22/2017 4 Example of Supervised Learning  Predict the price of a stock in 6 months from now, based on economic data. (Regression)  Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient. (Logistic Regression) 5/22/2017 5 Example of Supervised Learning  Identify the numbers in a handwritten ZIP code, from a digitized image (pixels). (Classification) 5/22/2017 6 Example of Unsupervised Learning  From the DNA micro-array data, determine which genes are most “similar” in terms of their expression profiles. (Clustering) 5/22/2017 7 Examples of Reinforcement Learning  How should a robot behave so as to optimize its “performance”? (Robotics)  How to automate the motion of a helicopter? (Control Theory)  How to make a good chess-playing program? (Artificial Intelligence) 5/22/2017 8 History of Reinforcement Learning  Roots in the psychology of animal learning (Thorndike,1911).  Another independent thread was the problem of optimal control, and its solution using dynamic programming (Bellman, 1957).  Idea of temporal difference learning (on-line method), e.g., playing board games (Samuel, 1959).  A major breakthrough was the discovery of Qlearning (Watkins, 1989). 5/22/2017 9 What is special about RL?  RL is learning how to map states to actions, so as to maximize a numerical reward over time.  Unlike other forms of learning, it is a multistage decision-making process (often Markovian).  An RL agent must learn by trial-and-error. (Not entirely supervised, but interactive)  Actions may affect not only the immediate reward but also subsequent rewards (Delayed effect). 5/22/2017 10 Elements of RL  A policy - A map from state space to action space. - May be stochastic.  A reward function - It maps each state (or, state-action pair) to a real number, called reward.  A value function - Value of a state (or, state-action pair) is the total expected reward, starting from that state (or, state-action pair). 5/22/2017 11 The Precise Goal  To find a policy that maximizes the Value function.  There are different approaches to achieve this goal in various situations.  Q-learning and A-learning are just two different approaches to this problem. But essentially both are temporal-difference methods. 5/22/2017 12 The Basic Setting  Training data: n finite horizon trajectories, of the form {s0 , a0 , r0 ,..., sT , aT , rT , sT 1}.  Deterministic policy: A sequence of decision rules { 0 ,  1 ,...,  T }.  Each π maps from the observable history (states and actions) to the action space at that time point. 5/22/2017 13 Value and Advantage  Time t state value function, for history (s t , a t 1 ) is T  Vt (st , at 1 )  E  rj (S j , A j , S j 1 ) | St  st , At 1  at 1 .  j t   Time t state-action value function, Q-function, is   Qt (s t , a t )  E rt (S t , A t , St 1 )  Vt 1 (S t 1 , A t ) | S t  s t , A t  a t .  Time t advantage, A-function, is t (s t , at )  Qt (s t , a t )  Vt (s t , a t 1 ). 5/22/2017 14 Optimal Value and Advantage  Optimal time t value function for history (s t , a t 1 ) is T   * Vt (st , at 1 )  max E  rj (S j , A j , S j 1 ) | St  st , At 1  at 1 .    j t   Optimal time t Q-function is   Qt* (s t , a t )  E rt (S t , A t , St 1 )  Vt*1 (S t 1 , A t ) | S t  s t , A t  a t .  Optimal time t A-function is t* (s t , at )  Qt* (st , at )  Vt* (st , at 1 ). 5/22/2017 15 Return (sum of the rewards)  The conditional expectation of the return is T 1 T  T E  rt ST , AT    t (S t , A t )   t 1 (S t 1 , At )  V0 ( S0 ) t 0  t 0  t 0 where the advantages μ are t (St , At )  Qt (St , At )  Vt (St , At 1 ) and the  , called temporal difference errors, are t (St , At 1 )  rt 1 (St , At 1 )  Vt (St , At 1 )  Qt 1 (St 1 , At 1 ) 5/22/2017 16 Return (continued)  Conditional expectation of the return is a telescoping sum T 1 T  T E  rt ST , AT    t   t 1  V0 ( S0 ) t 0  t 0  t 0 T T 1 t 0 t 0   [Qt  Vt ]   [rt  Vt 1  Qt ]  V0 ( S 0 )  Temporal difference errors have conditional mean zero E[rt 1 (St , At 1 )  Vt (St , At 1 ) | St 1 , At 1 ]  Qt 1 (St 1 , At 1 ) 5/22/2017 17 Q-Learning Watkins,1989  Estimate the Q-function using some approximator (for example, linear regression or neural networks or decision trees etc.).  Derive the estimated policy as an argument of the maximum of the estimated Q-function.  Allow different parameter vectors at different time points.  Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function. 5/22/2017 18 Q-Learning Algorithm QT 1  0. For t  T , T  1,...,0,  Set  Yt  rt  max Qt 1 (S t 1 , A t , at 1;ˆt 1 ). at 1 2 1 ˆ  t  arg min  Yt ,i  Qt (S t ,i , A t ,i ; )  . n i 1  n  The estimated policy satisfies ˆ Q ,t (s t , at 1 )  arg max Qt (s t , at ;ˆt ), t. at 5/22/2017 19 What is the intuition?  Bellman equation gives Qt* ( St , At )  E rt  max Qt*1 (S t 1 , A t , at 1 ) | S t , A t .   at 1 * Q  Q  If t 1 t 1 and the training set were infinite, then Q-learning minimizes 2 E rt  max Qt*1 (S t 1 , A t , at 1 )  Qt (S t , A t ; ) .   at 1 which is equivalent to minimizing E[Qt*  Qt (S t , A t ; )]2 . 5/22/2017 20 A Success Story  TD Gammon (Tesauro, G., 1992) - A Backgammon playing program. - Application of temporal difference learning. - The basic learner is a neural network. - It trained itself to the world class level by playing against itself and learning from the outcome. So smart!! - More information: http://www.research.ibm.com/massive/tdl.html 5/22/2017 21 A-Learning Murphy, 2003 and Robins, 2004  Estimate the A-function (advantages) using some approximator, as in Q-learning.  Derive the estimated policy as an argument of the maximum of the estimated A-function.  Allow different parameter vectors at different time points.  Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function. 5/22/2017 22 A-Learning Algorithm (Inefficient Version)  For t  T , T  1,...,0, T Yt   rj  j t T   (S , A ;ˆ ). j t 1 j j j j 2 n 1 ˆt  arg min  Yt ,i  t (S t ,i , A t ,i ; )  E ( t | S t ,i , A t 1,i ) . n i 1  t (S t , A t ;ˆt )  t (S t , A t ;ˆt )  max t (S t , , A t 1 , at ;ˆt ) at  The estimated policy satisfies ˆ A,t (s t , a t 1 )  arg max t (s t , a t ;ˆt ), t. at 5/22/2017 23 Differences between Q and A-learning  Q-learning  At time t we model the main effects of the history, (St,,At-1) and the action At and their interaction  Our Yt-1 is affected by how we modeled the main effect of the history in time t, (St,,At-1)  A-learning  At time t we only model the effects of At and its interaction with (St,,At-1)  Our Yt-1 does not depend on a model of the main effect of the history in time t, (St,,At-1) 5/22/2017 24 Q-Learning Vs. A-Learning  Relative merits and demerits are not completely known till now.  Q-learning has low variance but high bias.  A-learning has high variance but low bias.  Comparison of Q-learning with A-learning involves a bias-variance trade-off. 5/22/2017 25 References  Sutton, R.S. and Barto, A.G. (1998). Reinforcement     Learning- An Introduction. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning-Data Mining, Inference and Prediction. Murphy, S.A. (2003). Optimal Dynamic Treatment Regimes. JRSS-B. Blatt, D., Murphy, S.A. and Zhu, J. (2004). A-Learning for Approximate Planning. Murphy, S.A. (2004). A Generalization Error for Q-Learning. 5/22/2017 26

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Short Trip to Reinforcement Learning