Download Bayesian Learning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Transcript
Bayesian Learning
Consider the following simple yet quite general setting where an economic agent learns
about underlying uncertainty in her economic environment:
• Time is discrete, t = 0, 1, ...
• In each period, a random variable Xt is observed.
• We take the realizations of this random variable to be in a finite set X = {x1 , ..., xM }.
• To make the presentation simple and concrete, I assume that an unknown parameter
θ controls the uncertainty in the model.
• Again, for simplicity, assume that θ is in a finite set Θ = {θ1 , ..., θN }.
• There is a prior probability µ0 (θ) on Θ.
• In each period, an action a is taken in a finite set A = {a1 , ..., aK }.
• For a fixed sequence of actions {at }, the space of uncertainty is
Ω ≡ Θ × X ∞ × A∞ .
• A typical realization of uncertainty is thus ω ∈ Ω.
• In a more general setting, we would not have a parametric representation for the
uncertainty. Instead, we would be operating directly on a product of measurable
spaces (Xt , Ft ) where Xt is the sample space of the period t random variable and
Bt is the sigma-algebra of measurable events on Xt . Th space of uncertainty would
then be
Ω ≡ (×Xt , ×Ft ) ≡ (X, F).
• For most of this lecture, we shall assume that Xt depends only on θ and at . Write
the conditional distribution on X given a, θ as p(x | a, θ).
• In this setting, learning is clearly just about the parameter, and inference is quite
easy.
1
Information
• At t, the information given to the agent is (a0 , x0 , ..., at−1 , xt−1 ).
• We denote this information by Ft , and formally it is the smallest sigma-algebra on
Ω containing events {Xs = xs , As = as } for all s < t.
• In the parametric setting, the distribution of future Xt is completely determined
by the value of the parameter θ. As a result, we may state our learning problem in
terms of posterior probabilities on Θ.
• Let µ(θ | Ft ) = µt (θ).
• We want to understand how the sequence µt (θ) behaves as a function of the observables xt , at .
• The key is of course Bayes’ rule. We shall return to this.
• In the more general case of non-parametric learning, One starts with a (prior)
probability measure µ on (X, F). Learning is then represented by conditioning the
original measure µ on the sequence of increasing sigma-algebras Ft .
Examples
1. Parametric example
• In each period, a coin is either flipped at = 1 or not at = 0.
• There are three possible signals: X = {H, T, ∅}.
• p(∅ | 0, θ) = 0 for all θ, p(H | 1, θ) = θ for all θ.
• Assume for simplicity that Θ = { 21 , 13 }.
• It is clear that a coin flip is informative of the bias of the coin in a statistical
sense whereas no information is received if there is no coin flip.
• Let µ0 be the prior that θ = 12 .
• Let µt be the posterior based on the previous t signals. Exercise: Write Bayes’
rule for this updating.
• What are the key issues here: Will the bias of the coin be learned? Translation:
Is it the case that
µt → 1 ⇔ θ =
1
2
• In what sense is the convergence to be understood?
• How quickly does the convergence take place?
2
2. Non-parametric example
• Sample space is X = {0, 1}.
• The dependence across periods cannot be described through a useful parametric model (such as the i.i.d. model above or a simple Markov chain). Then
One posits that the true probability measure is qiven by some λ on (X, F).
Agents start with a prior µ on (X, F).
• The relevant question is then whether
µ(B | Ft ) converges to λ(B | Ft ).
• Obviously one gets different notions of convergence as one varies the sets B
allowed in the definition.
• A famous result by Blackwell and Dubins (1962) states that
sup | µ(B | Ft ) − λ(B | Ft ) |→ 0 a.s. λ
B∈F
if λ is absolutely continuous with respect to µ. (I.e. µ(B) > 0 ⇒ λ(B) > 0.
• This strong notion of convergence is called merging of µ to λ.
• A special case of such absolute continuity is so called ’grain of truth’ assumption, where µ = ρλ + (1 − ρ)λ0 .
• Kalai and Lehrer have show that if µ merges to λ for all filtrations Ft , that
generate F then λ is absolutely continuous with respect to µ.
• Hence merging is a notion that is independent of the filtration.
• Other specifications for convergence are possible.
• One could consider
sup | µ(B | Ft ) − λ(B | Ft ) |→ 0 a.s. λ
B∈Ft+1
• This is clearly a weaker notion of convergence but since it extends obviously
to any finite futures, it seems the appropriate notion for discounted decision
problems.
3
Learning in the parametric setting
• Bayes’ Rule:
µt+1 (θ̂) | at , xt ) = P
µt (θ̂)p(xt | at , θ̂)
.
θ∈Θ µt (θ)p(xt | at , θ)
• A sequence of random variables {Yt } on a probability space (µ, Ω, F) is called a
martingale with respect to {Ft } if
E[Yt+1 | Ft ] = Yt a.s. µ
• Think of martingales as representing fair gambles.
Claim 1. {µt (θ)} is a martingale for all θ with respect to {Ft } given by the sigmaalgebra generated by the observables.
Proof. Using Bayes’ rule from above, we have for all at :
Eµt+1 (θ̂) =
X
x∈X
=
X
µt (θ̂)p(x | at , θ̂)
Pr{Xt = x | at }
θ∈Θ µt (θ)p(xt | at , θ)
P
µt (θ̂)p(x | at , θ̂) = µt (θ̂)
X
p(x | at , θ̂) = µt (θ̂).
x∈X
x∈X
• Notice that these posteriors on individual θ̂ generate the posterior on Θ.
• We turn next to the question of convergence of these posterior probabilities on
individual θ̂ and hence on the convergence of the posterior beliefs on Θ.
• In this quest, we use one of the most famous theorems in the theory of stochastic
processes.
Theorem 2 (Martingale Convergence Theorem, Doob). Let {Yt } be a martingale
with respect to {Ft } which satises
sup EYt < ∞
t
Then the limit limt Yt ≡ Y∞ exists almost surely (and the limit is thus finite, almost
surely). Moreover, we have that Y∞ ∈ L1 .
4
• An excellent source for more material on martingales is: David Williams, Probability with martingales, Cambridge Mathematical Textbooks, Cambridge University
Press, 1991.
• When do posteriors converge in the parametric case?
• Martingale Convergence Theorem says that there exists a random variable µ∞ on
(Ω, F) such that for all paths ω in Ω (except a set of measure zero),
µt (θ̂ | Ft ) → µ∞ (ω).
• This implies that for almost all ω, µt (θ̂) converges to a constant.
• When is this possible?
• Think about Bayes’ rule: Posterior is constant if and only if a new trial contains no
new information.
• Two possibilities: Uninformative signals, i.e. no more coin flips or certainty about
the state.
• We conclude that as long as the coin is flipped infinitely often, beliefs will converge.
Furthermore, beliefs converge on truth with probability 1 and since the posterior is
a martingale, we can compute the probability of converging on θ̂ equals the prior
on θ̂.
Preferences
• For this subsection, we stay in the parametric setting for simplicity.
• Stationary environment: At = A for all t, Xt = X for all t where both sets are
finite.
• The timing is as follows. At start of period t, decision maker chooses at ∈ A. Then
Xt is realized. After this, play moves to period t + 1.
• Per period discount factor 0 < δ < 1.
• Preferences are defined on A∞ × X ∞ so that they depend on Θ only through X.
This allows the player to record and collect utility u(at , xt ) at the end of the period
in such a way that utility realization is not informative beyond Xt on θ.
5
• Let µt denote the posterior on Θ given (F )t .
• We can write Bayes’ rule as:
µt+1 = f (µt , at ).
• Notice that in this sense, posterior is always a controlled Markov process. Next
posterior depends on the current posterior and the current action.
• Per period expected utility from action at at posterior µt is given by:
U (at , µt ) =
XX
Θ
u(at , xt )p(xt | at , θ)µt (θ).
X
• We assume a time-separable preference setting, and the dynamic decision problem
is then to
max
{at }
∞
X
δ t U (at , µt )
t=0
subject to µt+1 = f (µt , at ).
• Since we have finite A and X, the per period expected utilities are bounded. Furthermore, in our simple parametric case, the transition function µt+1 = f (µt , at ) is
regular enough so that we have equivalence between the sequence problem above
and a dynamic programming formulation:
V (µt ) = max{U (at , µt ) + δEV (f (µt , at ))}.
at
• Let
V̂ (µt ) = max U (at , µt ).
at
• What can be said about convexity of V̂ (µt ) and V (µt )?
• Pretty much everything that has been said above generalizes to the case where A
and X and Ω are compact, u(·, ·) is continuous and bounded and the conditional
densities f (x | at , ω) exist and are well behaved. Stokey, Lucas, Prescott is a good
source for sufficient conditions for this case.
• Can we say anything about optimal actions based on the above? Experimental
consumption? How to think about risk and uncertainty in this framework?
6
Bandit Problems
We specialize now the above dynamic decision making framework as follows:
• Markov decision problem in discrete time with time index t = 0, 1, ....
• At each t, the decision maker chooses amongst K arms and we denote this choice
by at ∈ {1, ..., K}.
• Ifat = k, a random payoff xkt is realized and we denote the associated random
variable by Xtk . We assume that | Xtk | is bounded for all t, k by L.
• The state variable of the Markovian decision problem is given by st .
• Write the distribution of xkt as F k (·; st ) .
• The state transition function φ depends on the choice of the arm and the realized
payoff:
st+1 = φ xkt ; st
• Let St denote the set of all possible states in period t.
• A feasible Markov policy a = {at }∞
t=0 selects an available alternative for each conceivable state st , i.e.
at : St → {1, ..., K}
Defining properties The following two assumptions must be met for the problem
to qualify as a bandit problem.
1. Payoffs are evaluated according to the discounted expected payoff criterion where
the discount factor δ satisfies 0 ≤ δ < 1.
2. The payoff from each k depends only on outcomes of periods with at = k. In other
words, we can decompose the state variable st into K components s1t , ..., sK
such
t
that for all k :
skt+1 = skt
if at 6= k,
skt+1 = φ(skt , xt ) if at = k,
and
F k (·, st ) = F k ·; skt .
7
Notice that when the second assumption holds, the alternatives must be statistically
independent.
It is easy to see that many situations of economic interest are special cases of the
above formulation.
• First, it could be that F k ·; θk is a fixed distribution with an unknown parameter
θk . The state variable is then the posterior probability distribution on θk .
• Alternatively, F k ·; sk could denote the random yield per period from a resource
k after extracting sk units.
The value function V (s0 ) of the bandit problem can be written as follows. Let X k skt
denote the random reward with distribution F k ·; skt . Then the problem of finding an
optimal allocation policy is the solution to the following intertemporal optimization problem:
(
V (s0 ) = sup E
a
∞
X
)
δ t X at (sat t ) .
t=0
The celebrated index theorem due to Gittins and Jones (1974) transforms the problem
of finding the optimal policy into a collection of k stopping problems. For each alternative
k, we calculate the following index mk skt , which only depends on the state variable of
alternative k:
k
m
skt
( Pτ
)
E u=t δ t X k sku
P
= sup
,
E τu=t δ t
τ
(1)
where τ is a stopping time with respect to {skt }.
The idea is to find for each k the stopping time τ that results in the highest discounted
expected return per discounted expected number of periods in operation. The Gittins
index theorem then states that the optimal way of choosing arms in a bandit problem is
to select in each period the arm with the highest Gittins index, mk skt .
Theorem 3 (Gittins-Jones (1974)).
The optimal policy satisfies at = k for some k such that
mk skt ≥ mj sjt for all j ∈ {1, ..., K}.
An alternative formulation of the main theorem, based on dynamic programming can
be found in Whittle (1982). The basic idea is to find for every arm a retirement value Mtk ,
8
and then to choose in every period the arm with the highest retirement value. Formally,
for every arm k and retirement value M , we can compute the optimal retirement policy
given by:
V k skt , M , max E X k sku + δV k sk+1
,M ,M
t
(2)
The auxiliary decision problem given by (2) compares in every period the trade-off between continuation with the reward process generated by arm k or stopping with a fixed
retirement value M . The index of arm k in the state skt is the highest retirement value
at which the decision is just indifferent between continuing with arm k or retiring with
M = M skt :
M k skt = V k skt , M k skt .
The resulting index M k skt is equal to the discounted sum of flow index mk skt , or
M k skt = mk skt / (1 − δ).
Proof of the Gittins Index Theorem
Since the rewards are bounded, it is easy to verify (e.g. by Blackwell’s sufficient
conditions) that the following Bellman’s equation has a unique bounded solution.
V (st , M ) = max{M,
max
k∈{1,...,K}
Exkt + δV (st+1 )}.
Consider also the Bellman’s equations for the following K auxiliary problems:
V k (skt , M ) = max{M, Exkt + δV k (skt+1 )}.
These problems also have unique bounded solutions. Let M k k(sk ) = min{M | V k (sk , M ) =
M }.
Finally, define
V −k (st , M ) = max{M, max Exlt + δV −k (s−k
t+1 )}.
l6=k
The following Lemma plays an important role in characterizing the form of optimal
policies.
Lemma 1.
V (st , M ) ≤ V k (skt , M ) + V −k (s−k
t , M ) − M.
Proof. Write s = (sk , s−k ). Let V0 (s, M ) = V0k (sk , M ) = V0−k (s−k , M ) = M , and for all
st , define
9
Vn+1 (st , M ) = max{M, E{xkt + δVn (st+1 , M )}, max E{xlt + δVn (st+1 , M )}}.
l6=k
Similarly,
k
Vn+1
(st , M ) = max{M, E{xkt + δVnk (st+1 , M )}},
and
−k
Vn+1
(st , M ) = max{M, max E{xlt + δVn−k (s−k
t+1 , M )}}.
l6=k
These are the successive approximations in the value iteration algorithm for dynamic
programming, and therefore the result follows if we have for all n, s,
Vn (st , M ) ≤ Vnk (skt , M ) + Vn−k (s−k
t , M ) − M.
This is trivially true for n = 0, so assume it holds for n and we will show that it holds
for n + 1.
Observe first that Vnk (skt , M ), Vn−k (s−k
t , M ) ≥ M for all n, s, and
k
−k
Vn+1
(skt , M ) ≥ Vnk (skt , M ), and Vn+1
(skt , M ) ≥ Vn−k (skt , M ).
As a consequence,
k
δ[Vnk (sk , M ) − M ] ≤ Vnk (sk , M ) − M ≤ Vn+1
(sk , M ) − M,
and
−k
δ[Vn−k (sk , M ) − M ] ≤ Vn−k (sk , M ) − M ≤ Vn+1
(sk , M ) − M.
But then, using the induction hypothesis and the definitions of the value functions,
we have
Vn+1 (st , M ) ≤ max{M, E{xkt + δVnk (skt , M ) + Vn−k (s−k
t , M ) − M,
max E{xlt + δVnk (skt , M ) + Vn−k (s−k
t , M ) − M }}.
l6=k
All three terms on the right-hand side of the equation are bounded from above by
−k
k
k
k
Vn+1
(skt , M ) + Vn+1
(s−k
t , M ) − M. For the first term, this follows from Vn+1 (st , M ) ≥
−k
−k
l
k k
−k
M, Vn+1
(s−k
t , M ) ≥ M. The second term is bounded by E{xt +δVn (st , M )+Vn+1 (s , M )−
10
−k
k
(s−k , M ) − M.
(skt , M ) + Vn+1
M . By definition of V k (skn+1 (sk , M ), this is bounded by Vn+1
The last term is bounded using exactly the same argument as the middle term. Therefore
we have
k
−k
Vn+1 (st , M ) ≤ Vn+1
(skt , M ) + Vn+1
(s−k
t , M ) − M,
We use this Lemma to prove that the optimal policy takes a write-off form: For each
arm k there is a set of states S k such that the retirement option is chosen if and only if
sk ∈ S k . We can write S k = {sk | V k (sk , M ) = M }. If sk ∈ S k for some k, then by the
Lemma, V (s, M ) = V −k (sk , M ). Therefore an optimal policy exists that never uses k. If
sk ∈ S k for all k, then it is clearly optimal to retire. Clearly if there is a k such that
sk ∈
/ S k , then V (s, M ) ≥ V k (sk , M ) > M and it is not optimal to retire.
We write V (s) for the value function of the original bandit problem without retirement
options. Since rewards are bounded, we have V (s, M ) = V (s) for low enough M, and the
problems coincide. From the definition it is clear that V (s, M ) is non-decreasing in M .
Consider an arbitrary policy of retiring at a stopping time τ . The value from this policy
is
V (s, M ; τ ) =
τ
X
δ t Ex̂t + M Eδ τ .
t=0
Since each V (s, M ; τ ) is linear in M and V (s, M ) = supτ V (s, M ; τ ), we know that
V (s, M ) is convex in M . Therefore it has a derivative almost everywhere, and by envelope
theorem,
∂V (s, M )
= Eδ τ M ,
∂M
where τ M is the optimal retirement policy for retirement value M .
By the write-off property deduced from the Lemma, we know that retirement happens
when sk ∈ S k for all k. Therefore by the independence of the arms,
Eδ τ M =
Y
Eδ τ k,M ,
k
where τ k,M is the random time for reaching S k under the optimal policy.
We have then:
11
∂V (s, M ) Y ∂V k (sk , M )
=
.
∂M
∂M
k
∂V k (sk ,M )
∂M
is non-decreasing in M since each V k (sk , M ) is convex and
Q
k (sk ,M )
is zero for M < −L and unity for M ≥ L
nondecreasing in M . Furthermore, k ∂V ∂M
Notice that
Q
k
(recall that L is the bound on the absolute value of the rewards). Therefore it has the
properties of a distribution function.
Integrating gives:
L
Z
V (s, M ) = L −
Y ∂V k (sk , m)
M
k
∂m
dm.
The remaining step is to verify that Gittins index rule is optimal given this value
function.
Lemma 2. We have:
V (st , M ) − E{xkt + δV (st+1 , M )} = [V k (skt , M ) − E{xkt + δV k (skt+1 , M )}]
Y ∂V l (sl , M )
l6=k
Z
L
V k (skt , m) − E{xkt + δV k (skt+1 , m)}dm
+
M
Y ∂V l (sl , m)
∂m
l6=k
≥ 0,
∂M
(3)
with equality if M k = max M l ≥ M.
Proof. Integrating by parts,
L
Z
Y ∂V k (sk , m)
V (s, M ) = L −
M
=V
k
(skt , M )
Y ∂V l (sl , M )
l6=k
∂M
k
Z
∂m
dm
L
+
V k (skt , m)dm
M
Y ∂V l (sl , m)
l6=k
∂m
.
Therefore the first equality in (3) follows by adding E{xkt + δV (st+1 , M )} on both
sides of the equation. V k (skt , M ) − E{xkt + δV k (skt+1 , M )} = 0 if M ≤ M k . Furthermore,
Q
l (sl ,m)
= 0 for m ≥ maxl6=k M l . Therefore the expression in zero if M k =
dm l6=k ∂V ∂m
max M l ≥ M. This proves the optimality of the index rule.
12
Computing Gittins Index for Examples
Pandora’s boxes
• Weitzman (1979), Econometrica (Pandora’s boxe) asks how one schedule search for
a prize (only one prize can be claimed) when there are k boxes characterized by the
value of the prize and the probability of finding a price in the box (v k , pk ).
• By now we know how to answer this. Compute Gittins indices for all boxes and
open them in the decreasing sequence of the indices.
• Properties of the optimal sequence: Suppose pk v k = pl v l for some k, l, i.e. if these
were the only boxes, the decision maker would be indifferent. Which should be
opened first?
• Formulate the problem so that it fits our model above.
• Each arm starts in sk0
• If arm k is chosen, it gives an immediate expected return of pk v k
• If k is chosen, then skt = H for all t with probability pk and skt = L for all t with
probability 1 − pk
• xkt = v k for all t if skt = H and xkt = 0 for all t if skt = L.
• Observe that if the arm is tried once, all uncertainty for future is resolved.
• δ < 1 the discount factor.
• Compute the Gitttins index as follows: Let V k (sk0 , M ) be the value function of the
auxiliary problem.
V (L, M ) = M for all M ≥ 0,
V (H, M ) =
v
for all M ≤ 1.
1−δ
M = pk v k + δ(pk V (H1, M ) + (1 − pk )V (L, M ))
k
k k
k v
k
=p v +δ p
+ (1 − p )M ,
1−δ
or
M=
pk v k (1 + δ)
,
(1 − δ)(1 − δ + pk δ)
13
or
(1 − δ)M =
pk v k
.
1 − δ + pk δ
• Observe that as δ → 1, (1 − δ)M → v k .
• Observe also that
δ(1 − pk )pk v k
(1 − δ)M − p v =
> 0.
1 − δ + pk δ
k k
• Therefore there is always value to experimentation.
• How far does this generalize?
• Key property required for Index Theorem to work: Outside option for each alternative must stay constant.
• Simple generalizations fails this property: Choice of multiple arms simultaneously.
Switching costs between arms.
• Generalizations that can be handled: Arms branching into different arms.
14