Download Notes from Week 9: Multi-Armed Bandit Problems II 1 Information

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Random variable wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Probability interpretations wikipedia , lookup

Randomness wikipedia , lookup

Central limit theorem wikipedia , lookup

Law of large numbers wikipedia , lookup

Conditioning (probability) wikipedia , lookup

Transcript
CS 683 — Learning, Games, and Electronic Markets
Spring 2007
Notes from Week 9: Multi-Armed Bandit Problems II
Instructor: Robert Kleinberg
1
1.1
26-30 Mar 2007
Information-theoretic lower bounds for multiarmed bandits
KL divergence
The central notion in this lower bound proof — as in many information-theoretic
lower bound proofs — is KL divergence. This is a measure of the statistical distinguishability of two probability distributions on the same set.
Definition 1. Let Ω be a finite set with two probability measures p, q. Their KullbackLeibler divergence, or KL-divergence, is the sum
X
p(x)
,
KL(p; q) =
p(x) ln
q(x)
x∈Ω
with the convention that p(x) ln(p(x)/q(x)) is interpreted to be 0 when p(x) = 0 and
+∞ when p(x) > 0 and q(x) = 0. If Y is a random variable defined on Ω and taking
values in some set Γ, the conditional Kullback-Leibler divergence of p and q given Y
is the sum
X
p(x | Y = Y (x))
,
p(x) ln
KL(p; q | Y ) =
q(x
|
Y
=
Y
(x))
x∈Ω
where terms containing log(0) or log(∞) are handled according to the same convention
as above.
Remark 1. Some authors use the notation D(pkq) instead of KL(p; q). In fact, the
D(pkq) notation was adopted when this material was presented in class. In these
notes we will use the KL(p; q) notation, which is more convenient for expressing
conditioning.
One should visualize KL-divergence as a measure of certainty that observed data
is coming from some “true” distribution p as opposed to a counterfactual distribution
q. If a sample point x ∈ Ω represents the observed data, then the log-likelihood-ratio
ln(p(x)/q(x)) is a measure of how much more likely we are to observe the data under
distribution p than under distribution q. The KL-divergence is the expectation of this
W9-1
log-likelihood-ratio when the random sample x actually does come from distribution
p.
The following lemma summarizes some standard facts about KL-divergence; for
proofs, see Cover and Thomas’s book Elements of Information Theory.
Lemma 1. Let p, q be two probability measures on a measure space (Ω, F) and let Y
be a random variable defined on Ω and taking values in some finite set Γ. Define a
pair of probability measures pY , qY on Γ by specifying that pY (y) = p(Y = y), qY (y) =
q(Y = y) for each y ∈ Γ. Then
KL(p; q) = KL(p; q | Y ) + KL(pY ; qY ),
and KL(p; q | Y ) is non-negative.
The equation given in the lemma is sometimes called the “chain rule for KL
divergence.” Its interpretation is as follows: the amount of certainty we gain by
observing a pair of random variables (X, Y ) is equal to the amount of certainty
we gain by observing Y alone, plus the additional amount of certainty we gain by
observing X, conditional on Y .
The KL-divergence of two distributions can be thought of as a measure of their
statistical distinguishability. We will need three lemmas concerning KL-divergence.
The first lemma asserts that a sequence of n experiments can not be very good at
distinguishing two possible distributions if none of the individual experiments is good
at distinguishing them. The second one shows that if the KL-divergence of p and
q is small, then an event which is reasonably likely under distribution p can not be
too unlikely under distribution q. The third lemma estimates the KL-divergence of
distributions which are very close to the uniform distribution on {0, 1}.
Lemma 2. Suppose Ω0 , Ω1 , . . . , Ωn is a sequence of finite probability spaces, and suppose we are given two probability measures pi , qi on Ωi (0 ≤ i ≤ n) and random
variables Yi : Ωi → Ωi−1 such that pi−1 = (pi )Yi , qi−1 = (qi )Yi for i = 1, 2, . . . , n. If
p0 = q0 and KL(pi ; qi | Yi ) < δ for all i, then KL(pn ; qn ) < δn.
Proof. The proof is by induction on n, the base case n = 1 being trivial. For the
induction step, Lemma 1 implies
KL(pn ; qn ) = K(pn ; qn | Yn ) + KL(pn−1 ; qn−1 )
< δ + KL(pn−1 ; qn−1 ),
and the right side is less than δn by the induction hypothesis.
Lemma 3. If p, q are two distributions on Ω, then
2KL(p; q) ≥ kp − qk21 .
W9-2
(1)
Proof. For any event A with p(A) = a, q(A) = b, if Y is the indicator random variable
of A then Lemma 1 ensures that
KL(p; q)
≥
KL(p(y); q(y))
a
1−a
= log
a + log
(1 − a)
b
1−b
Z b
1−a a
=
−
dx
1−x x
a
Z b
x−a
=
dx
a x(1 − x)
Z b
4(x − a) dx
geq
(2)
a
while
2
2
Z
[2(p(A) − q(A))] = 4(a − b) =
b
8(x − a) dx.
(3)
a
R a (1). If b < a we rewrite the right sides of (2) and (3) as
RIfa b ≥ a this confirms
4(a − x) dx and b 8(a − x) dx to make the intervals properly oriented and the
b
integrands non-negative, and again (1) follows.
Lemma 4. If 0 < ε < 1/2 and p, q, r are the distributions on {0, 1} defined by
1+ε
1
1−ε
p(1) =
q(1) =
r(1) =
2
2
2
1−ε
1
1+ε
p(0) =
q(0) =
r(0) =
2
2
2
then KL(p; q) < 2ε2 and KL(p; r) < 4ε2 .
Proof.
1+ε
1−ε
ln(1 + ε) +
ln(1 − ε)
2
2
1
ε
2ε
2
=
ln(1 − ε ) + ln 1 +
2
2
1−ε
2ε
ε
< 2ε2 .
<
2 1−ε
1+ε
1+ε
1−ε
1−ε
ln
+
ln
2
1−ε
2
1+ε
2 !
1
ε
1+ε
ln(1) + ln
2
2
1−ε
2ε
ε ln 1 +
1−ε
2
4ε .
KL(p; q) =
KL(p; r) =
=
=
<
W9-3
1.2
Distinguishing coins
Suppose that I give you two coins: one fair, the other with bias 12 − ε. Your job is to
repeatedly choose one coin to flip, and to stop when you think you know which one is
biased. Your answer should be correct with probability at least 0.99. How many flips
does it take to identify the biased coin? The answer is O(1/ε2 ). While it is possible
to prove this by elementary means, here we will give a proof using KL-divergence,
which has the benefit of generalizing to the case where there are more than two coins.
Theorem 5. Let p, q be two distributions on {0, 1}m such that the m bits are independent, with expected value 12 − ε under p and 12 under q. If 16m ≤ 1/ε2 and A is
any event, then |p(A) − q(A)| ≤ 1/2.
Proof. We have
1
|p(A) − q(A)| ≤ kp − qk1 ≤
2
r
1
KL(p; q).
2
Also
KL(p; q) =
X
KL(p(xi ); q(xi ) | x1 , . . . , xi−1 ).
i
This is bounded above by 8ε2 m.
Here’s the generalization to more than 2 coins. We have n coins and we have an
algorithm which chooses, at time t (1 ≤ t ≤ T ), a coin xt to flip. It also outputs a
guess yt ; the guess is considered correct if yt is a biased coin. The choice of xt , yt is
only allowed to depend on the outcomes of the first t − 1 coin flips.
Consider the following distributions on coin-flip outcomes. Distribution p0 is the
distribution in which all coins are fair. Distribution pj is the distribution in which coin
j has bias 21 − ε while all others are fair coins. In all these distributions p0 , p1 , . . . , pn ,
the different coin flips are mutually independent events. When 0 ≤ j ≤ n, we
will denote the probability of an event under distribution pj by Prj . Similarly, the
expectation of an event under distribution pj will be denoted by Ej .
Theorem 6. Let ALG be any coin-flipping algorithm. If 100t ≤ n/ε2 then there exist
at least n/3 distinct values of j > 0 such that Prj (yt 6= j) ≥ 1/2.
Proof. The intuition is as follows. Let Qj denote the random variable which counts
the number of times ALG flips coin j. If Ej (Qj ) is much smaller than 1/ε2 , then at
time t the algorithm is unlikely to have accumulated enough evidence that j is the
biased coin. On the other hand, since there are n coins and t ≤ n/ε2 , for most values
of j is it unlikely that the algorithm flips coin j more than 1/ε2 times before time t.
W9-4
P
To make this precise, first note that nj=1 E0 (Qj ) = t by linearity of expectation.
Hence the set
J1 = {j : E0 (Qj ) ≤ 3t/n}
has at least 2n/3 elements. Moreover, the set J2 = {j : Pr0 (yt = j) ≤ 3/n} has at
least 2n/3 elements. Let J = J1 ∩ J2 ; this set has at least n/3 elements. If j ∈ J and
E is the event yt = j then
Pr(E) ≤ Pr(E) + | Pr(E) − Pr(E)|
j
0
j
0
1
≤ Pr(E) + kp0 − pj k1
0
r 2
1
3
+
KL(p0 ; pj ).
≤
n
2
Moreover, by Lemma 2 and 4,
KL(p0 ; pj ) ≤ 8ε2 E0 (Qj ) ≤ 24ε2 t/n < 1/4.
1.3
(4)
The multi-armed bandit lower bound
Let MAB be any algorithm for the n-armed bandit problem. We will describe a
procedure for generating a random input such that the expected
regret accumulated
√
by MAB when running against this random input is Ω( nT ). Define distributions
p0 , p1 , . . . , pn on ({0, 1}n )T as follows. First, p0 is the uniform distribution on this set.
Second, for 1 ≤ j ≤ n, pj is the distribution in which the random variables ct (i) are
mutually independent, and
1/2
if i 6= j
Pr(ct (i)) =
.
1/2 − ε if i = j
p
1
Here ε = 10
n/T , so that 100T ≤ n/ε2 . The random input is generated by sampling
a number j ∗ uniformly at random from [n] and then sampling the sequence of cost
functions according to pj ∗ .
Remark 2. There is nothing surprising that a lower bound for oblivious adversaries
against randomized algorithms comes from looking at a distribution over inputs. Yao’s
Lemma says that it must be possible to prove the lower bound this way.
However, there are two surprising things about the way this lower bound is established. First, conditional on the random number j ∗ , the samples are independent
and identically distributed. There’s nothing in Yao’s Lemma which says that the
worst-case distribution of inputs has to be such a simple distribution. Second, the
lower bound for oblivious adversaries nearly matches the upper bound for adaptive
W9-5
adversaries. It appears that almost all of the adversary’s power comes from the ability
to tailor the value of ε to the time horizon T .
Note that there is still a logarithmic gap between the lower and upper bounds.
Closing this gap is an interesting open question.
Finally, the most important thing to appreciate about this lower bound (besides
the proof technique, which serves as a very good demonstration of the power of
KL divergence) is that it pins down precisely which multi-armed problems are the
toughest: those in which the n strategies have nearly identical payoff distributions
but one of them is just slightly better than the rest.
Consider the following coin-flipping algorithm based on MAB. At time t, when
MAB chooses strategy xt , the coin-flipping algorithm chooses to flip coin xt and
guesses yt = xt as well. The previous theorem about coin-flipping algorithms proves
that there exists a set Jt with at least n/3 distinct elements, such that if we run this
coin-flipping algorithm then for all t (1 ≤ t ≤ T ) and j ∈ Jt , Prj (xt = j) ≤ 1/2,
which implies that
1 1
1 ε
1 1
∗
+
−ε ≥ − .
(5)
E [ct (xt ) | j ∈ Jt ] ≥
2 2
2 2
2 2
Trivially,
1
− ε.
2
Recalling that |Jt | ≥ n/3, we have Pr(j ∗ ∈ Jt ) ≥ 1/3, so
1 1 ε
1 5ε
2 1
E [ct (xt )] ≥
−
−ε = − .
+
3 2 2
3 2
2
6
E [ct (xt ) | j ∗ 6∈ Jt ] ≥
Hence
"
E
T
X
#
∗
ct (j ) =
t=1
while
"
T
X
(6)
(7)
T
− εT
2
#
T
5εT
−
.
2
6
t=1
√
1
It follows that the regret of MAB is at least εT6 = 60
nT .
E
2
ct (xt ) ≥
Markov decision processes
Up to this point, our treatment of multi-armed bandit problems has focused on
worst-case analysis of algorithms. Historically, the first approach to multi-armed
bandit problems was grounded in average-case analysis of algorithms; this is still the
most influential approach to multi-armed bandit algorithms. It assumes that the
W9-6
decision-maker has a prior distribution which is a probability measure on the set
of possible input sequences. The task is then to design a Bayesian optimal bandit algorithm, i.e. one which optimizes the expected cost of the decision sequence,
assuming that the actual input sequence is a random sample from the prior distribution. Under a convenient assumption (namely, that costs are geometrically timediscounted) this problem belongs to a class of planning problems called Markov decision problems or simply MDP’s. MDP’s are much more general than multi-armed
bandit problems, and they constitute an extremely important topic in artificial intelligence. See http://www.cs.ualberta.ca/∼sutton/book/ebook/the-book.html
wfor a textbook with excellent coverage of the subject. In these notes we will explore the most basic elements of the theory of MDP’s, with the aim of laying the
technical foundations for the Gittins index theorem, a theorem which describes the
Bayesian optimal bandit algorithm when the prior distribution over the n strategies
is a product distribution (i.e. the prior belief is that different strategies are uncorrelated) and the costs are geometrically time-discounted.
2.1
Definitions
Definition 2. A Markov decision process (MDP) is specified by the following data:
1. a finite set S of states;
2. a finite set A of actions;
a
0
3. transition probabilities Ps,s
0 for all s, s ∈ S, a ∈ A, specifying the probability
of a state transition from s to s0 given that the decision-maker selects action A
when the system is in state s;
4. costs c(s, a) ∈ R+ specifying the cost of choosing action a ∈ A when the system
is in state s.
Some authors define MDP’s using payoffs or rewards instead of costs, and this
change of terminology implies changing the objective from cost minimization to payoff
or reward maximization. Also, some authors define the costs/payoffs/rewards to
be random variables. Here we have opted to make them deterministic. From the
standpoint of solving the expected-cost-minimization problem, it does not matter
whether c(s, a) is defined to be a random variable or to be the expectation of that
random variable. For simplicity, we have opted for the latter interpretation.
Definition 3. A policy for an MDP is a rule for assigning a probability
π(s, a) ∈ [0, 1]
P
to each state-action pair (s, a) ∈ S × A such that for all s ∈ S, a∈A π(s, a) = 1. A
pure policy is a policy such that π(s, a) ∈ {0, 1} for all s ∈ S, a ∈ A. If π is a pure
policy, then for every s ∈ S there is a unique a ∈ A such that π(s, a) = 1; we will
sometimes denote this unique value of a as π(s), by abuse of notation.
W9-7
Definition 4. A realization of an MDP (S, A, P, c) with policy π is a probability
space Ω with the following collection of random variables:
• a sequence s0 , s1 , . . . taking values in S;
• a sequence a0 , a1 , . . . taking values in A.
These random variables are required to obey the specified transition probabilities, i.e.
Pr(st+1 = s | s0 , s1 , . . . , st , a0 , a1 , . . . , at ) = Pr(st+1 = s | st , at ) = Psatt,s .
Given a policy π for an MDP, there are a few different ways to define the cost of
using policy π. One way is to set a finite time horizon T and to define the cost of
using π starting from state s ∈ S to be the function
#
" T
X
c(st , at ) s0 = s .
V π (s) = E
t=0
Another way is to define the cost as an infinite sum using geometric time discounting
with some discount factor γ < 1:
" T
#
X
V π (s) = E
γ t c(st , at ) s0 = s .
t=0
The following general definition incorporates both of these possibilities and many
others.
Definition 5. A stopping time τ for a MDP is a random variable defined in a realization of the MDP, taking values in N ∪ {0}, which satisfies the property:
Pr(τ = t | s0 , s1 , . . . , a0 , a1 , . . .) = Pr(τ = t | s0 , s1 , . . . , st , a0 , a1 , . . . , at ).
If τ satisfies the stronger property that there exists a function p : S → [0, 1] such
that
Pr(τ = t | s0 , s1 , . . . , a0 , a1 , . . .) = p(st ),
then we say that τ is a memoryless stopping time and we call p the stopping probability
function. Given a MDP with policy π and stopping time τ , the cost of π is defined
to be the function
" τ −1
#
X
V π (s) = E
c(st , at ) s0 = s .
t=0
This is also called the value function of π.
W9-8
For example, a finite time horizon T is encoded by setting the stopping time τ to
be equal to T + 1 at every point of the sample space Ω. Geometric time discounting
with discount factor γ < 1 is encoded by setting τ to be a geometrically distributed
random variable which is independent of the random variables s0 , s1 , . . . and a0 , a1 , . . .,
i.e. a random variable satisfying
Pr(τ > t | s0 , s1 , . . . , a0 , a1 , . . .) = γ t
for all t ∈ N ∪ {0}, s0 , s1 , . . . ∈ S, a0 , a1 , . . . ∈ A.
Definition 6. Given a set U ⊆ S, the hitting time of U is a stopping time τ which
satisfies τ = min{t | st ∈ U } whenever the right side is defined.
If there is a positive probability that the infinite sequence s0 , s1 , s2 , . . . never visits
U , then the hitting time of U is not a well-defined stopping time. However, we will
always be considering sets U such that the hitting time is well-defined. Note that the
hitting time of U is a memoryless stopping time whose stopping probability function
is p(s) = 1 if s ∈ U , 0 otherwise.
Given a MDP with a memoryless stopping time τ , we may assume (virtually
without loss of generality) that τ is the hitting time of U , for some set of states
U ⊆ S. This is because we may augment the MDP by adjoining a single extra state,
Done, and defining the transition probabilities P̂ and costs ĉ as follows (where p
denotes the stopping probability function of τ ):

a
if s, s0 ∈ S
0
 (1 − p(s)) · Ps,s
a
P̂s,s
=
p(s)
if s ∈ S, s0 = Done
0

1
if s = s0 = Done.
c(s, a) if s ∈ S
ĉ(s, a) =
0
otherwise.
There is a natural mapping from policies for the augmented MDP to policies for the
original MDP and vice-versa: given a policy π̂ for the augmented MDP one obtains
a policy π for the original MDP by restricting π̂ to the state-action pairs in S × A;
given a policy π for the original MDP one obtains π̂ to be an arbitrary policy whose
restriction to S ×A is equal to π. Both of these natural mappings preserve the policy’s
value function. In that sense, solving the original MDP (i.e. identifying a policy of
minimum cost) is equivalent to solving the augmented MDP. This is what we mean
when we say that a memoryless stopping rule is, without loss of generality, equal to
the hitting time of some set U .
2.2
Examples
To illustrate the abstract definition of Markov decision processes, we will give two
examples in this section. A third illustration is contained in the following section,
which explains how MDP’s model an important class of bandit problems.
W9-9
Example 1 (Blackjack with an infinite deck). If one is playing blackjack with an
infinite deck of cards (such that the probability of seeing any given type of card is
1/52 regardless of what cards have been seen before) then the game is a MDP whose
states are ordered triples (H, B, F ) where H is a multiset of cards (the contents of the
player’s current hand), B > 0 is the size of the player’s current bet, and F ∈ {0, 1}
specifies whether the player is finished receiving new cards into his or her hand. The
set of actions is {hit,stand,double}. In state (H, B, 0), if the player chooses “stand”
then the next state is (H, B, 1) with probability 1. If the player chooses “double”
then the next state is (H, 2B, 0) with probability 1. If the player chooses “hit” then a
random card is added to H, B remains the same, and F changes from 0 to 1 if the sum
of the values in H now exceeds 21, otherwise F remains at 0. The stopping time is the
hitting time of the set of states such that F = 1. (Consequently it doesn’t matter how
we define the transition probabilities in such states, though for concreteness we will
say that any action taken in such a state leads back to the same state with probability
1.) The cost of taking an action that leads to a state (H, B, 0) is 0; the cost of taking
an action that leads to a state (H, B, 1) is B times the probability that the dealer
beats a player whose hand is H. (Technically, we are supposed to define the cost as
a function of the action and the state immediately preceding that action. Thus we
should really define the cost of taking action a in state s to be equal to B times the
probability that a leads to a state with F = 1 and the dealer beats the player in this
state.)
If the deck is not infinite and the player is counting cards, then to model the
process as an MDP we must enlarge the state space to include the information that
the player recalls about cards that have been dealt in the past.
Example 2 (Playing golf with n golf balls). The game of “golf with n golf balls”
is played by a single golfer using n golf balls on a golf course with a finite set of
locations where a ball may come to rest. One of these locations is the hole, and the
objective is to get at least one of the n balls to land in the hole while minimizing the
total number of strokes (including strokes that involved hitting other balls besides
the one which eventually landed in the hole).
We can model this as a MDP, as follows. Let L be the set of locations, and let
h ∈ L denote the hole. The set of states of the MDP is Ln and the stopping time is
the hitting time of the set U = {(`1 , `2 , . . . , `n ) ∈ Ln | ∃i such that `i = h}. The set
of actions is [n] × C where C is the set of golf clubs that the golfer is using. The
interpretation of action (i, c) is that the golfer uses club c to hit ball number i. When
the golfer takes action (i, c), the state updates from (`1 , . . . , `n ) to a random new state
(`1 , . . . , `i−1 , `0i , `i+1 , . . . , `n ), where the probability of hitting ball i from ` to `0 using
club c is a property of the golfer and the ball which the golfer is hitting (but it does
not depend on the time at which the golfer is hitting the ball, nor on the positions of
the other balls on the golf course).
W9-10
2.3
How is this connected to multi-armed bandits?
Let F denote a family of probability measures on R. For example F may be the family
of all Gaussian distributions, or F may be the family of all distributions supported on
the two-element set {0, 1}. Consider a multi-armed bandit problem with strategy set
S = [n], in which the decision-maker believes that each strategy i ∈ [n] has a costs
distributed according to some unknown distribution fi ∈ F and that these unknown
distributions f1 , f2 , . . . , fn are themselves independent random variables distributed
according to n known probability measures µ1 , µ2 , . . . , µn on F. To put it more
precisely, the decision-maker’s prior belief distribution can be described as follows.
There are n random variables f1 , f2 , . . . , fn taking values in F; they are distributed
according to the product distribution µ1 ⊗ . . . ⊗ µn , and the costs ct (i) are mutually
conditionally independent (conditioned on f1 , f2 , . . . , fn ) and satisfy
Pr(ct (i) ∈ B | f1 , f2 , . . . , fn ) = fi (B)
for every Borel set B ⊆ R.
Let us assume, moreover, that there is a fixed discount factor γ < 1 and that the
decision-maker wishes to choose a sequence of strategies x1 , x2 , . . . so as to minimize
the expected time-discounted cost
"∞
#
X
E
γ t ct (it )
t=0
where the expectation is with respect to the decision-maker’s prior.
This problem can be modeled as a Markov decision process with an infinite state
space. Specifically, a state of the MDP is an n-tuple of beliefs ν1 , ν2 , . . . , νn — each
of which is a probability measure on F — representing the decision-maker’s posterior
belief about the cost distribution of each strategy, after performing some number of
experiments and observing their outcomes. The set of actions is simply [n]; performing
action x at time t in the MDP corresponds to choosing strategy x in step t of the
bandit problem. The transition probabilities of the MDP are determined by Bayes’
law, which specifies how to update the posterior distribution for strategy x after
making one observation of the cost ct (x). Note that when the decision-maker chooses
action x in state (ν1 , . . . , νn ), the resulting state transition only updates the x-th
component of the state vector. (Our assumption that the cost distributions of the
different strategies are independent ensures that a Bayesian update after observing
ct (x) has no effect on νy for y 6= x.) The cost of choosing action x in state (ν1 , . . . , νn )
is simply the conditional expectation E[ct (x) | νx ], i.e. it is the expected value of a
random sample from distribution νx .
Note the similarity between this example and the “golfing with n golf balls” example. Both problems entail studying MDP’s in which the states are represented as
n-tuples, and each action can only update one component of the n-tuple. In fact, if
W9-11
one generalizes the golfing problem to include golf courses with uncountably many
locations and rules in which the cost of a stroke depends on the ball’s location at the
time it was hit, then it is possible to see the bandit problem as a special case of the
golfing problem.
2.4
Properties of optimal policies
In this section we prove a sequence of three theorems which characterize optimal
policies of MDP’s and which establish that every MDP has an optimal policy which
is a pure policy.
Before doing so, it will be useful to introduce the notation Qπ (s, a), which denotes
the expected cost of performing action a at time 0 in state s, and using policy π in
every subsequent time step.
#
" τ
X
π
Q (s, a) = c(s, a) + E
c(st , at ) s0 = s, a0 = a .
t=1
By abuse of notation, for a policy π 0 we also define Qπ (s, π 0 ) to be the weighted
average
X
Qπ (s, π 0 ) =
π 0 (s, a)Qπ (s, a).
a∈A
π
π
Note that Q (s, π) = V (s) for any policy π and state s.
Theorem 7 (Policy improvement theorem). If π, π 0 are policies such that
Qπ (s, π) ≥ Qπ (s, π 0 )
for every state s, then
0
V π (s) ≥ V π (s)
(8)
(9)
for every state s. If the inequality (8) is strict for at least one s, then (9) is also strict
for at least one s.
Proof. For any t ≥ 0, let π < t > denote the “hybrid policy” which distributes its
actions according to π 0 at all times s < t and distributes its actions according to π at
all times s ≥ t. (Technically, this does not satisfy our definition of the word “policy”
since the distribution over actions depends not only on the current state but on the
time as well. However, this abuse of terminology should not cause confusion.) For
every t ≥ 0 and s ∈ S we have
X
V π<t+1> (s) − V π<t> (s) =
(Qπ (s0 , π 0 ) − Qπ (s0 , π)) · Pr(st = s0 | s0 = s)
s0 ∈S
≤ 0.
W9-12
The theorem now follows by observing that V π<0> (s) = V π (s) and that
0
lim V π<t> (s) = V π (s).
t→∞
0
Definition 7. A policy π for a MDP is optimal if V π (s) ≤ V π (s) for every state s
and policy π 0 .
Theorem 8 (Bellman’s optimality condition). A policy π is optimal if and only
if it satisfies
a ∈ arg min Qπ (s, b)
(10)
b∈A
for every state-action pair (s, a) such that π(s, a) > 0.
Proof. By Theorem 7, if π fails to satisfy (10) for some state-action pair (s, a) such
that π(s, a) > 0, then π is not an optimal policy. This is because we may construct a
different policy π 0 such that π 0 (s) is a probability distribution concentrated on the set
arg minb∈A Qπ (s, b), and π 0 (s0 ) = π(s0 ) for all states s0 6= s. This new policy π 0 satisfies
Qπ (s, π) > Qπ (s, π 0 ), and Qπ (s0 , π) = Qπ (s0 , π 0 ) for all s0 6= s; hence by Theorem 7
0
there is some state in which V π is strictly less than V π , hence π is not optimal.
To prove the converse, assume π is not optimal and let σ be an optimal policy.
Also assume (without loss of generality) that the stopping time τ is the hitting time
of some set U ⊆ S. Let
x = max (V π (s) − V σ (s)) ,
s∈S
and let
T = {s ∈ S | V π (s) = V σ (s) + x}.
Notice that V π (s) = V σ (s) = 0 for all s ∈ U , hence T is disjoint from U . Thus there
must be at least one state s ∈ T such that the probability of a state transition from
s to the complement of T , when playing with policy σ, is strictly positive. Now,
Qπ (s, π) = V π (s)
= V σ (s) + x
!
X
X
a
σ 0
=
σ(s, a) c(s, a) +
Ps,s
0 (V (s ) + x)
a∈A
s0 ∈S
>
X
X
=
X
!
σ(s, a) c(s, a) +
s0 ∈S
a∈A
σ(s, a)Qπ (s, a)
a∈A
= Qπ (s, σ),
hence π must not satisfy (10).
W9-13
π 0
a
Ps,s
0 (V (s ))
Theorem 9 (Existence of pure optimal policies). For every MDP, there is a
pure policy which is optimal.
Proof. There are only finitely many pure policies,
all pure policies there is
P so among
π
at least one policy π which minimizes the sum s∈S V (s). We claim that this policy
π is an optimal policy. Indeed, if π is not an optimal policy then by Theorem 8 there
is a state-action pair (s, a) such that π(s, a) > 0 but a 6∈ arg minb∈A Qπ (s, b). Then,
by Theorem 7, if we modify π to a new policy π 0 by changing π(s) from a to any
0
action a0 ∈ arg minb∈A Qπ (s, b), then this new policy π 0 satisfies V π (s0 ) ≤ V π (s0 ) for
all states s0 ∈ S, with strict inequality
P for at least one such state. This contradicts
our assumption that π minimizes s∈S V π (s) among all pure policies.
2.5
Computing optimal policies
Theorem 8 actually implies an algorithm for computing an optimal policy of a MDP
in polynomial time, by solving a linear program. Namely, consider the linear program:
X
max
V (s)
s∈S
s.t.
V (s) = 0 ∀s ∈ U
X
a
0
Ps,s
V (s) ≤ c(s, a) +
∀s ∈ S \ U, a ∈ A.
0 V (s )
s0 ∈S
It is easy to check the following facts.
1. If V is a solution of the linear program, then
X
a
0
V (s) = min c(s, a) +
Ps,s
0 V (s )
a∈A
s0 ∈S
for all s ∈ S \ U, a ∈ A.
P
a
0
2. If π is a pure policy obtained by selecting π(s) ∈ arg mina∈A c(s, a)+ s0 ∈S Ps,s
0 V (s )
for all s ∈ S, a ∈ A, then V is the value function of π.
3. The pure policy π defined in this manner satisfies the Bellman optimality condition, and is therefore an optimal policy.
In practice, there are iterative algorithms for solving MDP’s which are much more
efficient than the reduction from MDP’s to linear programming presented here. For
more information on these other methods for solving MDP’s, we refer the reader to
http://www.cs.ualberta.ca/∼sutton/book/ebook/the-book.html.
W9-14