Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Inductive probability wikipedia , lookup
Birthday problem wikipedia , lookup
Ars Conjectandi wikipedia , lookup
Random variable wikipedia , lookup
Infinite monkey theorem wikipedia , lookup
Probability interpretations wikipedia , lookup
Central limit theorem wikipedia , lookup
CS 683 — Learning, Games, and Electronic Markets Spring 2007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 1 1.1 26-30 Mar 2007 Information-theoretic lower bounds for multiarmed bandits KL divergence The central notion in this lower bound proof — as in many information-theoretic lower bound proofs — is KL divergence. This is a measure of the statistical distinguishability of two probability distributions on the same set. Definition 1. Let Ω be a finite set with two probability measures p, q. Their KullbackLeibler divergence, or KL-divergence, is the sum X p(x) , KL(p; q) = p(x) ln q(x) x∈Ω with the convention that p(x) ln(p(x)/q(x)) is interpreted to be 0 when p(x) = 0 and +∞ when p(x) > 0 and q(x) = 0. If Y is a random variable defined on Ω and taking values in some set Γ, the conditional Kullback-Leibler divergence of p and q given Y is the sum X p(x | Y = Y (x)) , p(x) ln KL(p; q | Y ) = q(x | Y = Y (x)) x∈Ω where terms containing log(0) or log(∞) are handled according to the same convention as above. Remark 1. Some authors use the notation D(pkq) instead of KL(p; q). In fact, the D(pkq) notation was adopted when this material was presented in class. In these notes we will use the KL(p; q) notation, which is more convenient for expressing conditioning. One should visualize KL-divergence as a measure of certainty that observed data is coming from some “true” distribution p as opposed to a counterfactual distribution q. If a sample point x ∈ Ω represents the observed data, then the log-likelihood-ratio ln(p(x)/q(x)) is a measure of how much more likely we are to observe the data under distribution p than under distribution q. The KL-divergence is the expectation of this W9-1 log-likelihood-ratio when the random sample x actually does come from distribution p. The following lemma summarizes some standard facts about KL-divergence; for proofs, see Cover and Thomas’s book Elements of Information Theory. Lemma 1. Let p, q be two probability measures on a measure space (Ω, F) and let Y be a random variable defined on Ω and taking values in some finite set Γ. Define a pair of probability measures pY , qY on Γ by specifying that pY (y) = p(Y = y), qY (y) = q(Y = y) for each y ∈ Γ. Then KL(p; q) = KL(p; q | Y ) + KL(pY ; qY ), and KL(p; q | Y ) is non-negative. The equation given in the lemma is sometimes called the “chain rule for KL divergence.” Its interpretation is as follows: the amount of certainty we gain by observing a pair of random variables (X, Y ) is equal to the amount of certainty we gain by observing Y alone, plus the additional amount of certainty we gain by observing X, conditional on Y . The KL-divergence of two distributions can be thought of as a measure of their statistical distinguishability. We will need three lemmas concerning KL-divergence. The first lemma asserts that a sequence of n experiments can not be very good at distinguishing two possible distributions if none of the individual experiments is good at distinguishing them. The second one shows that if the KL-divergence of p and q is small, then an event which is reasonably likely under distribution p can not be too unlikely under distribution q. The third lemma estimates the KL-divergence of distributions which are very close to the uniform distribution on {0, 1}. Lemma 2. Suppose Ω0 , Ω1 , . . . , Ωn is a sequence of finite probability spaces, and suppose we are given two probability measures pi , qi on Ωi (0 ≤ i ≤ n) and random variables Yi : Ωi → Ωi−1 such that pi−1 = (pi )Yi , qi−1 = (qi )Yi for i = 1, 2, . . . , n. If p0 = q0 and KL(pi ; qi | Yi ) < δ for all i, then KL(pn ; qn ) < δn. Proof. The proof is by induction on n, the base case n = 1 being trivial. For the induction step, Lemma 1 implies KL(pn ; qn ) = K(pn ; qn | Yn ) + KL(pn−1 ; qn−1 ) < δ + KL(pn−1 ; qn−1 ), and the right side is less than δn by the induction hypothesis. Lemma 3. If p, q are two distributions on Ω, then 2KL(p; q) ≥ kp − qk21 . W9-2 (1) Proof. For any event A with p(A) = a, q(A) = b, if Y is the indicator random variable of A then Lemma 1 ensures that KL(p; q) ≥ KL(p(y); q(y)) a 1−a = log a + log (1 − a) b 1−b Z b 1−a a = − dx 1−x x a Z b x−a = dx a x(1 − x) Z b 4(x − a) dx geq (2) a while 2 2 Z [2(p(A) − q(A))] = 4(a − b) = b 8(x − a) dx. (3) a R a (1). If b < a we rewrite the right sides of (2) and (3) as RIfa b ≥ a this confirms 4(a − x) dx and b 8(a − x) dx to make the intervals properly oriented and the b integrands non-negative, and again (1) follows. Lemma 4. If 0 < ε < 1/2 and p, q, r are the distributions on {0, 1} defined by 1+ε 1 1−ε p(1) = q(1) = r(1) = 2 2 2 1−ε 1 1+ε p(0) = q(0) = r(0) = 2 2 2 then KL(p; q) < 2ε2 and KL(p; r) < 4ε2 . Proof. 1+ε 1−ε ln(1 + ε) + ln(1 − ε) 2 2 1 ε 2ε 2 = ln(1 − ε ) + ln 1 + 2 2 1−ε 2ε ε < 2ε2 . < 2 1−ε 1+ε 1+ε 1−ε 1−ε ln + ln 2 1−ε 2 1+ε 2 ! 1 ε 1+ε ln(1) + ln 2 2 1−ε 2ε ε ln 1 + 1−ε 2 4ε . KL(p; q) = KL(p; r) = = = < W9-3 1.2 Distinguishing coins Suppose that I give you two coins: one fair, the other with bias 12 − ε. Your job is to repeatedly choose one coin to flip, and to stop when you think you know which one is biased. Your answer should be correct with probability at least 0.99. How many flips does it take to identify the biased coin? The answer is O(1/ε2 ). While it is possible to prove this by elementary means, here we will give a proof using KL-divergence, which has the benefit of generalizing to the case where there are more than two coins. Theorem 5. Let p, q be two distributions on {0, 1}m such that the m bits are independent, with expected value 12 − ε under p and 12 under q. If 16m ≤ 1/ε2 and A is any event, then |p(A) − q(A)| ≤ 1/2. Proof. We have 1 |p(A) − q(A)| ≤ kp − qk1 ≤ 2 r 1 KL(p; q). 2 Also KL(p; q) = X KL(p(xi ); q(xi ) | x1 , . . . , xi−1 ). i This is bounded above by 8ε2 m. Here’s the generalization to more than 2 coins. We have n coins and we have an algorithm which chooses, at time t (1 ≤ t ≤ T ), a coin xt to flip. It also outputs a guess yt ; the guess is considered correct if yt is a biased coin. The choice of xt , yt is only allowed to depend on the outcomes of the first t − 1 coin flips. Consider the following distributions on coin-flip outcomes. Distribution p0 is the distribution in which all coins are fair. Distribution pj is the distribution in which coin j has bias 21 − ε while all others are fair coins. In all these distributions p0 , p1 , . . . , pn , the different coin flips are mutually independent events. When 0 ≤ j ≤ n, we will denote the probability of an event under distribution pj by Prj . Similarly, the expectation of an event under distribution pj will be denoted by Ej . Theorem 6. Let ALG be any coin-flipping algorithm. If 100t ≤ n/ε2 then there exist at least n/3 distinct values of j > 0 such that Prj (yt 6= j) ≥ 1/2. Proof. The intuition is as follows. Let Qj denote the random variable which counts the number of times ALG flips coin j. If Ej (Qj ) is much smaller than 1/ε2 , then at time t the algorithm is unlikely to have accumulated enough evidence that j is the biased coin. On the other hand, since there are n coins and t ≤ n/ε2 , for most values of j is it unlikely that the algorithm flips coin j more than 1/ε2 times before time t. W9-4 P To make this precise, first note that nj=1 E0 (Qj ) = t by linearity of expectation. Hence the set J1 = {j : E0 (Qj ) ≤ 3t/n} has at least 2n/3 elements. Moreover, the set J2 = {j : Pr0 (yt = j) ≤ 3/n} has at least 2n/3 elements. Let J = J1 ∩ J2 ; this set has at least n/3 elements. If j ∈ J and E is the event yt = j then Pr(E) ≤ Pr(E) + | Pr(E) − Pr(E)| j 0 j 0 1 ≤ Pr(E) + kp0 − pj k1 0 r 2 1 3 + KL(p0 ; pj ). ≤ n 2 Moreover, by Lemma 2 and 4, KL(p0 ; pj ) ≤ 8ε2 E0 (Qj ) ≤ 24ε2 t/n < 1/4. 1.3 (4) The multi-armed bandit lower bound Let MAB be any algorithm for the n-armed bandit problem. We will describe a procedure for generating a random input such that the expected regret accumulated √ by MAB when running against this random input is Ω( nT ). Define distributions p0 , p1 , . . . , pn on ({0, 1}n )T as follows. First, p0 is the uniform distribution on this set. Second, for 1 ≤ j ≤ n, pj is the distribution in which the random variables ct (i) are mutually independent, and 1/2 if i 6= j Pr(ct (i)) = . 1/2 − ε if i = j p 1 Here ε = 10 n/T , so that 100T ≤ n/ε2 . The random input is generated by sampling a number j ∗ uniformly at random from [n] and then sampling the sequence of cost functions according to pj ∗ . Remark 2. There is nothing surprising that a lower bound for oblivious adversaries against randomized algorithms comes from looking at a distribution over inputs. Yao’s Lemma says that it must be possible to prove the lower bound this way. However, there are two surprising things about the way this lower bound is established. First, conditional on the random number j ∗ , the samples are independent and identically distributed. There’s nothing in Yao’s Lemma which says that the worst-case distribution of inputs has to be such a simple distribution. Second, the lower bound for oblivious adversaries nearly matches the upper bound for adaptive W9-5 adversaries. It appears that almost all of the adversary’s power comes from the ability to tailor the value of ε to the time horizon T . Note that there is still a logarithmic gap between the lower and upper bounds. Closing this gap is an interesting open question. Finally, the most important thing to appreciate about this lower bound (besides the proof technique, which serves as a very good demonstration of the power of KL divergence) is that it pins down precisely which multi-armed problems are the toughest: those in which the n strategies have nearly identical payoff distributions but one of them is just slightly better than the rest. Consider the following coin-flipping algorithm based on MAB. At time t, when MAB chooses strategy xt , the coin-flipping algorithm chooses to flip coin xt and guesses yt = xt as well. The previous theorem about coin-flipping algorithms proves that there exists a set Jt with at least n/3 distinct elements, such that if we run this coin-flipping algorithm then for all t (1 ≤ t ≤ T ) and j ∈ Jt , Prj (xt = j) ≤ 1/2, which implies that 1 1 1 ε 1 1 ∗ + −ε ≥ − . (5) E [ct (xt ) | j ∈ Jt ] ≥ 2 2 2 2 2 2 Trivially, 1 − ε. 2 Recalling that |Jt | ≥ n/3, we have Pr(j ∗ ∈ Jt ) ≥ 1/3, so 1 1 ε 1 5ε 2 1 E [ct (xt )] ≥ − −ε = − . + 3 2 2 3 2 2 6 E [ct (xt ) | j ∗ 6∈ Jt ] ≥ Hence " E T X # ∗ ct (j ) = t=1 while " T X (6) (7) T − εT 2 # T 5εT − . 2 6 t=1 √ 1 It follows that the regret of MAB is at least εT6 = 60 nT . E 2 ct (xt ) ≥ Markov decision processes Up to this point, our treatment of multi-armed bandit problems has focused on worst-case analysis of algorithms. Historically, the first approach to multi-armed bandit problems was grounded in average-case analysis of algorithms; this is still the most influential approach to multi-armed bandit algorithms. It assumes that the W9-6 decision-maker has a prior distribution which is a probability measure on the set of possible input sequences. The task is then to design a Bayesian optimal bandit algorithm, i.e. one which optimizes the expected cost of the decision sequence, assuming that the actual input sequence is a random sample from the prior distribution. Under a convenient assumption (namely, that costs are geometrically timediscounted) this problem belongs to a class of planning problems called Markov decision problems or simply MDP’s. MDP’s are much more general than multi-armed bandit problems, and they constitute an extremely important topic in artificial intelligence. See http://www.cs.ualberta.ca/∼sutton/book/ebook/the-book.html wfor a textbook with excellent coverage of the subject. In these notes we will explore the most basic elements of the theory of MDP’s, with the aim of laying the technical foundations for the Gittins index theorem, a theorem which describes the Bayesian optimal bandit algorithm when the prior distribution over the n strategies is a product distribution (i.e. the prior belief is that different strategies are uncorrelated) and the costs are geometrically time-discounted. 2.1 Definitions Definition 2. A Markov decision process (MDP) is specified by the following data: 1. a finite set S of states; 2. a finite set A of actions; a 0 3. transition probabilities Ps,s 0 for all s, s ∈ S, a ∈ A, specifying the probability of a state transition from s to s0 given that the decision-maker selects action A when the system is in state s; 4. costs c(s, a) ∈ R+ specifying the cost of choosing action a ∈ A when the system is in state s. Some authors define MDP’s using payoffs or rewards instead of costs, and this change of terminology implies changing the objective from cost minimization to payoff or reward maximization. Also, some authors define the costs/payoffs/rewards to be random variables. Here we have opted to make them deterministic. From the standpoint of solving the expected-cost-minimization problem, it does not matter whether c(s, a) is defined to be a random variable or to be the expectation of that random variable. For simplicity, we have opted for the latter interpretation. Definition 3. A policy for an MDP is a rule for assigning a probability π(s, a) ∈ [0, 1] P to each state-action pair (s, a) ∈ S × A such that for all s ∈ S, a∈A π(s, a) = 1. A pure policy is a policy such that π(s, a) ∈ {0, 1} for all s ∈ S, a ∈ A. If π is a pure policy, then for every s ∈ S there is a unique a ∈ A such that π(s, a) = 1; we will sometimes denote this unique value of a as π(s), by abuse of notation. W9-7 Definition 4. A realization of an MDP (S, A, P, c) with policy π is a probability space Ω with the following collection of random variables: • a sequence s0 , s1 , . . . taking values in S; • a sequence a0 , a1 , . . . taking values in A. These random variables are required to obey the specified transition probabilities, i.e. Pr(st+1 = s | s0 , s1 , . . . , st , a0 , a1 , . . . , at ) = Pr(st+1 = s | st , at ) = Psatt,s . Given a policy π for an MDP, there are a few different ways to define the cost of using policy π. One way is to set a finite time horizon T and to define the cost of using π starting from state s ∈ S to be the function # " T X c(st , at ) s0 = s . V π (s) = E t=0 Another way is to define the cost as an infinite sum using geometric time discounting with some discount factor γ < 1: " T # X V π (s) = E γ t c(st , at ) s0 = s . t=0 The following general definition incorporates both of these possibilities and many others. Definition 5. A stopping time τ for a MDP is a random variable defined in a realization of the MDP, taking values in N ∪ {0}, which satisfies the property: Pr(τ = t | s0 , s1 , . . . , a0 , a1 , . . .) = Pr(τ = t | s0 , s1 , . . . , st , a0 , a1 , . . . , at ). If τ satisfies the stronger property that there exists a function p : S → [0, 1] such that Pr(τ = t | s0 , s1 , . . . , a0 , a1 , . . .) = p(st ), then we say that τ is a memoryless stopping time and we call p the stopping probability function. Given a MDP with policy π and stopping time τ , the cost of π is defined to be the function " τ −1 # X V π (s) = E c(st , at ) s0 = s . t=0 This is also called the value function of π. W9-8 For example, a finite time horizon T is encoded by setting the stopping time τ to be equal to T + 1 at every point of the sample space Ω. Geometric time discounting with discount factor γ < 1 is encoded by setting τ to be a geometrically distributed random variable which is independent of the random variables s0 , s1 , . . . and a0 , a1 , . . ., i.e. a random variable satisfying Pr(τ > t | s0 , s1 , . . . , a0 , a1 , . . .) = γ t for all t ∈ N ∪ {0}, s0 , s1 , . . . ∈ S, a0 , a1 , . . . ∈ A. Definition 6. Given a set U ⊆ S, the hitting time of U is a stopping time τ which satisfies τ = min{t | st ∈ U } whenever the right side is defined. If there is a positive probability that the infinite sequence s0 , s1 , s2 , . . . never visits U , then the hitting time of U is not a well-defined stopping time. However, we will always be considering sets U such that the hitting time is well-defined. Note that the hitting time of U is a memoryless stopping time whose stopping probability function is p(s) = 1 if s ∈ U , 0 otherwise. Given a MDP with a memoryless stopping time τ , we may assume (virtually without loss of generality) that τ is the hitting time of U , for some set of states U ⊆ S. This is because we may augment the MDP by adjoining a single extra state, Done, and defining the transition probabilities P̂ and costs ĉ as follows (where p denotes the stopping probability function of τ ): a if s, s0 ∈ S 0 (1 − p(s)) · Ps,s a P̂s,s = p(s) if s ∈ S, s0 = Done 0 1 if s = s0 = Done. c(s, a) if s ∈ S ĉ(s, a) = 0 otherwise. There is a natural mapping from policies for the augmented MDP to policies for the original MDP and vice-versa: given a policy π̂ for the augmented MDP one obtains a policy π for the original MDP by restricting π̂ to the state-action pairs in S × A; given a policy π for the original MDP one obtains π̂ to be an arbitrary policy whose restriction to S ×A is equal to π. Both of these natural mappings preserve the policy’s value function. In that sense, solving the original MDP (i.e. identifying a policy of minimum cost) is equivalent to solving the augmented MDP. This is what we mean when we say that a memoryless stopping rule is, without loss of generality, equal to the hitting time of some set U . 2.2 Examples To illustrate the abstract definition of Markov decision processes, we will give two examples in this section. A third illustration is contained in the following section, which explains how MDP’s model an important class of bandit problems. W9-9 Example 1 (Blackjack with an infinite deck). If one is playing blackjack with an infinite deck of cards (such that the probability of seeing any given type of card is 1/52 regardless of what cards have been seen before) then the game is a MDP whose states are ordered triples (H, B, F ) where H is a multiset of cards (the contents of the player’s current hand), B > 0 is the size of the player’s current bet, and F ∈ {0, 1} specifies whether the player is finished receiving new cards into his or her hand. The set of actions is {hit,stand,double}. In state (H, B, 0), if the player chooses “stand” then the next state is (H, B, 1) with probability 1. If the player chooses “double” then the next state is (H, 2B, 0) with probability 1. If the player chooses “hit” then a random card is added to H, B remains the same, and F changes from 0 to 1 if the sum of the values in H now exceeds 21, otherwise F remains at 0. The stopping time is the hitting time of the set of states such that F = 1. (Consequently it doesn’t matter how we define the transition probabilities in such states, though for concreteness we will say that any action taken in such a state leads back to the same state with probability 1.) The cost of taking an action that leads to a state (H, B, 0) is 0; the cost of taking an action that leads to a state (H, B, 1) is B times the probability that the dealer beats a player whose hand is H. (Technically, we are supposed to define the cost as a function of the action and the state immediately preceding that action. Thus we should really define the cost of taking action a in state s to be equal to B times the probability that a leads to a state with F = 1 and the dealer beats the player in this state.) If the deck is not infinite and the player is counting cards, then to model the process as an MDP we must enlarge the state space to include the information that the player recalls about cards that have been dealt in the past. Example 2 (Playing golf with n golf balls). The game of “golf with n golf balls” is played by a single golfer using n golf balls on a golf course with a finite set of locations where a ball may come to rest. One of these locations is the hole, and the objective is to get at least one of the n balls to land in the hole while minimizing the total number of strokes (including strokes that involved hitting other balls besides the one which eventually landed in the hole). We can model this as a MDP, as follows. Let L be the set of locations, and let h ∈ L denote the hole. The set of states of the MDP is Ln and the stopping time is the hitting time of the set U = {(`1 , `2 , . . . , `n ) ∈ Ln | ∃i such that `i = h}. The set of actions is [n] × C where C is the set of golf clubs that the golfer is using. The interpretation of action (i, c) is that the golfer uses club c to hit ball number i. When the golfer takes action (i, c), the state updates from (`1 , . . . , `n ) to a random new state (`1 , . . . , `i−1 , `0i , `i+1 , . . . , `n ), where the probability of hitting ball i from ` to `0 using club c is a property of the golfer and the ball which the golfer is hitting (but it does not depend on the time at which the golfer is hitting the ball, nor on the positions of the other balls on the golf course). W9-10 2.3 How is this connected to multi-armed bandits? Let F denote a family of probability measures on R. For example F may be the family of all Gaussian distributions, or F may be the family of all distributions supported on the two-element set {0, 1}. Consider a multi-armed bandit problem with strategy set S = [n], in which the decision-maker believes that each strategy i ∈ [n] has a costs distributed according to some unknown distribution fi ∈ F and that these unknown distributions f1 , f2 , . . . , fn are themselves independent random variables distributed according to n known probability measures µ1 , µ2 , . . . , µn on F. To put it more precisely, the decision-maker’s prior belief distribution can be described as follows. There are n random variables f1 , f2 , . . . , fn taking values in F; they are distributed according to the product distribution µ1 ⊗ . . . ⊗ µn , and the costs ct (i) are mutually conditionally independent (conditioned on f1 , f2 , . . . , fn ) and satisfy Pr(ct (i) ∈ B | f1 , f2 , . . . , fn ) = fi (B) for every Borel set B ⊆ R. Let us assume, moreover, that there is a fixed discount factor γ < 1 and that the decision-maker wishes to choose a sequence of strategies x1 , x2 , . . . so as to minimize the expected time-discounted cost "∞ # X E γ t ct (it ) t=0 where the expectation is with respect to the decision-maker’s prior. This problem can be modeled as a Markov decision process with an infinite state space. Specifically, a state of the MDP is an n-tuple of beliefs ν1 , ν2 , . . . , νn — each of which is a probability measure on F — representing the decision-maker’s posterior belief about the cost distribution of each strategy, after performing some number of experiments and observing their outcomes. The set of actions is simply [n]; performing action x at time t in the MDP corresponds to choosing strategy x in step t of the bandit problem. The transition probabilities of the MDP are determined by Bayes’ law, which specifies how to update the posterior distribution for strategy x after making one observation of the cost ct (x). Note that when the decision-maker chooses action x in state (ν1 , . . . , νn ), the resulting state transition only updates the x-th component of the state vector. (Our assumption that the cost distributions of the different strategies are independent ensures that a Bayesian update after observing ct (x) has no effect on νy for y 6= x.) The cost of choosing action x in state (ν1 , . . . , νn ) is simply the conditional expectation E[ct (x) | νx ], i.e. it is the expected value of a random sample from distribution νx . Note the similarity between this example and the “golfing with n golf balls” example. Both problems entail studying MDP’s in which the states are represented as n-tuples, and each action can only update one component of the n-tuple. In fact, if W9-11 one generalizes the golfing problem to include golf courses with uncountably many locations and rules in which the cost of a stroke depends on the ball’s location at the time it was hit, then it is possible to see the bandit problem as a special case of the golfing problem. 2.4 Properties of optimal policies In this section we prove a sequence of three theorems which characterize optimal policies of MDP’s and which establish that every MDP has an optimal policy which is a pure policy. Before doing so, it will be useful to introduce the notation Qπ (s, a), which denotes the expected cost of performing action a at time 0 in state s, and using policy π in every subsequent time step. # " τ X π Q (s, a) = c(s, a) + E c(st , at ) s0 = s, a0 = a . t=1 By abuse of notation, for a policy π 0 we also define Qπ (s, π 0 ) to be the weighted average X Qπ (s, π 0 ) = π 0 (s, a)Qπ (s, a). a∈A π π Note that Q (s, π) = V (s) for any policy π and state s. Theorem 7 (Policy improvement theorem). If π, π 0 are policies such that Qπ (s, π) ≥ Qπ (s, π 0 ) for every state s, then 0 V π (s) ≥ V π (s) (8) (9) for every state s. If the inequality (8) is strict for at least one s, then (9) is also strict for at least one s. Proof. For any t ≥ 0, let π < t > denote the “hybrid policy” which distributes its actions according to π 0 at all times s < t and distributes its actions according to π at all times s ≥ t. (Technically, this does not satisfy our definition of the word “policy” since the distribution over actions depends not only on the current state but on the time as well. However, this abuse of terminology should not cause confusion.) For every t ≥ 0 and s ∈ S we have X V π<t+1> (s) − V π<t> (s) = (Qπ (s0 , π 0 ) − Qπ (s0 , π)) · Pr(st = s0 | s0 = s) s0 ∈S ≤ 0. W9-12 The theorem now follows by observing that V π<0> (s) = V π (s) and that 0 lim V π<t> (s) = V π (s). t→∞ 0 Definition 7. A policy π for a MDP is optimal if V π (s) ≤ V π (s) for every state s and policy π 0 . Theorem 8 (Bellman’s optimality condition). A policy π is optimal if and only if it satisfies a ∈ arg min Qπ (s, b) (10) b∈A for every state-action pair (s, a) such that π(s, a) > 0. Proof. By Theorem 7, if π fails to satisfy (10) for some state-action pair (s, a) such that π(s, a) > 0, then π is not an optimal policy. This is because we may construct a different policy π 0 such that π 0 (s) is a probability distribution concentrated on the set arg minb∈A Qπ (s, b), and π 0 (s0 ) = π(s0 ) for all states s0 6= s. This new policy π 0 satisfies Qπ (s, π) > Qπ (s, π 0 ), and Qπ (s0 , π) = Qπ (s0 , π 0 ) for all s0 6= s; hence by Theorem 7 0 there is some state in which V π is strictly less than V π , hence π is not optimal. To prove the converse, assume π is not optimal and let σ be an optimal policy. Also assume (without loss of generality) that the stopping time τ is the hitting time of some set U ⊆ S. Let x = max (V π (s) − V σ (s)) , s∈S and let T = {s ∈ S | V π (s) = V σ (s) + x}. Notice that V π (s) = V σ (s) = 0 for all s ∈ U , hence T is disjoint from U . Thus there must be at least one state s ∈ T such that the probability of a state transition from s to the complement of T , when playing with policy σ, is strictly positive. Now, Qπ (s, π) = V π (s) = V σ (s) + x ! X X a σ 0 = σ(s, a) c(s, a) + Ps,s 0 (V (s ) + x) a∈A s0 ∈S > X X = X ! σ(s, a) c(s, a) + s0 ∈S a∈A σ(s, a)Qπ (s, a) a∈A = Qπ (s, σ), hence π must not satisfy (10). W9-13 π 0 a Ps,s 0 (V (s )) Theorem 9 (Existence of pure optimal policies). For every MDP, there is a pure policy which is optimal. Proof. There are only finitely many pure policies, all pure policies there is P so among π at least one policy π which minimizes the sum s∈S V (s). We claim that this policy π is an optimal policy. Indeed, if π is not an optimal policy then by Theorem 8 there is a state-action pair (s, a) such that π(s, a) > 0 but a 6∈ arg minb∈A Qπ (s, b). Then, by Theorem 7, if we modify π to a new policy π 0 by changing π(s) from a to any 0 action a0 ∈ arg minb∈A Qπ (s, b), then this new policy π 0 satisfies V π (s0 ) ≤ V π (s0 ) for all states s0 ∈ S, with strict inequality P for at least one such state. This contradicts our assumption that π minimizes s∈S V π (s) among all pure policies. 2.5 Computing optimal policies Theorem 8 actually implies an algorithm for computing an optimal policy of a MDP in polynomial time, by solving a linear program. Namely, consider the linear program: X max V (s) s∈S s.t. V (s) = 0 ∀s ∈ U X a 0 Ps,s V (s) ≤ c(s, a) + ∀s ∈ S \ U, a ∈ A. 0 V (s ) s0 ∈S It is easy to check the following facts. 1. If V is a solution of the linear program, then X a 0 V (s) = min c(s, a) + Ps,s 0 V (s ) a∈A s0 ∈S for all s ∈ S \ U, a ∈ A. P a 0 2. If π is a pure policy obtained by selecting π(s) ∈ arg mina∈A c(s, a)+ s0 ∈S Ps,s 0 V (s ) for all s ∈ S, a ∈ A, then V is the value function of π. 3. The pure policy π defined in this manner satisfies the Bellman optimality condition, and is therefore an optimal policy. In practice, there are iterative algorithms for solving MDP’s which are much more efficient than the reduction from MDP’s to linear programming presented here. For more information on these other methods for solving MDP’s, we refer the reader to http://www.cs.ualberta.ca/∼sutton/book/ebook/the-book.html. W9-14