* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Learning Algorithms for Solving MDPs References: Barto, Bradtke
Genetic algorithm wikipedia , lookup
Machine learning wikipedia , lookup
Mathematical optimization wikipedia , lookup
Corecursion wikipedia , lookup
Simplex algorithm wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Pattern recognition wikipedia , lookup
Factorization of polynomials over finite fields wikipedia , lookup
Birthday problem wikipedia , lookup
Algorithm characterizations wikipedia , lookup
Simulated annealing wikipedia , lookup
Learning Algorithms for Solving MDPs References: Barto, Bradtke and Singh (1995) “Learning to Act Using Real-Time Dynamic Programming” in Machine Learning (also on WWW) 1. Q-Learning Given an MDP problem, define the function by: "!# (') $ "!%& * +'-,/.01 (') ) 2 ' 453 ;:=<">/67?A@ 9:C? 8 FE-E +') ! G ! H (') IJ LKM B B;D %ONP%RQ %TSU% pairs is updated at each step Comment: Each of the of the -learning algorithm. At any stage * the recommended action ' in state is the action for which the corresponding is the V ' 9WX96798 ' highest, i.e. ZY[ 1<">/? B E 2 '] are a sequence of possibily stochastic step size funcComment: \ : tions satisfying for each 2 '; _^ `bacd1e g'-fh i 2 ' j w.p.1 poq g'-fh i 2R'k ml n w.p.1 r i.e. is the expected present discounted value of taking action in state . Let denote the (random) successor to , i.e. a random draw from . Then if denotes the estimate of the function at iteration of the -learning algorithm, we have: Relationship to Stochastic Approximation 8 1. Standard Find fixed point stochastic approximation: where by where \ '] 8 '-,/. 8 ' 2 ' ` #8 ' H 8 ' ' e \ ' % ' ] a ' l S var is a sequence of stochastic shocks satisfying 8' 8 ' #84 M [ 2. Asynchronous stochastic approximation: Only update or “back up” some of the components of at time . Let be an infinite sequence of times at which state is updated. Then * ^ M 8 ' * 8 ' 2 ' ` #8 ' H 8 ' ' e * ^ where 8 ' 8 . ! . * )G# "$"#""8 ! * )& % 8R' where ' * denotes the last time component ( of was updated. ml * . Obviously ' * 6 +' * f ' * Theorem (Tsitsiklis, 1993) If for all and ( we have ! ) j with probability and if is a contraction mapping, then 8 ' 8 8 '-,/.0 with probability 1. , then we have ' 1 m^ N Q S . Corollary: If 1 for each 1 with probability pair at every Comment: It is not necessary to back up each iteration of the -learning algorithm. However it is necessary that every pair be backed up often. This corre+* infinitely ' with probability sponds to the assumption that )! 1 in Tsitsiklis’s theorem. * 6 ' f * j Real Time Dynamic Programming A learning algorithm closely related to successive approximations that works in real time, i.e. you update value function during the process of controlling system. 0. Successive Approximations '-, . 17 6 "< >/9? 8 E ` ' e B 167< > 9? 8 B E (')1 % ' M E ` 1 ' D e E (' % D ' " [ '-,/. 7 6 "< > ? 8 B 67<"> ? 8 B 1. Gauss-Seidel where ' D ' '-, . ' ' ' , M if ( otherwise 2. Asynchronous Successive Approximations 1 67<">/9? 8 B E ' % ' if ^ ' [ ' otherwise. ' ^ N which are backed up Comment: denotes the set of states of the asynchronous successive approximations at iteration * algorithm. Successive approximations is a special case when ' N (all states backed up) Gauss-Seidel method is ] a i . . . ] and ] # " $ " " special \ , \ k \ , case \ . ]c "#when "$" '-,/.G ' \ '] 3. Real Time Dynamic Programming This is a special case of asynchronous successive approximations when where is the realized state of the process at time . ' * \ '] Theorem: In discounted finite MDPs, if denotes the sequence of value functions from asynchronous successive approximations, then with probability 1 provided each state is backed up infinitely often. ' H a ^ N