Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian Learning Consider the following simple yet quite general setting where an economic agent learns about underlying uncertainty in her economic environment: • Time is discrete, t = 0, 1, ... • In each period, a random variable Xt is observed. • We take the realizations of this random variable to be in a finite set X = {x1 , ..., xM }. • To make the presentation simple and concrete, I assume that an unknown parameter θ controls the uncertainty in the model. • Again, for simplicity, assume that θ is in a finite set Θ = {θ1 , ..., θN }. • There is a prior probability µ0 (θ) on Θ. • In each period, an action a is taken in a finite set A = {a1 , ..., aK }. • For a fixed sequence of actions {at }, the space of uncertainty is Ω ≡ Θ × X ∞ × A∞ . • A typical realization of uncertainty is thus ω ∈ Ω. • In a more general setting, we would not have a parametric representation for the uncertainty. Instead, we would be operating directly on a product of measurable spaces (Xt , Ft ) where Xt is the sample space of the period t random variable and Bt is the sigma-algebra of measurable events on Xt . Th space of uncertainty would then be Ω ≡ (×Xt , ×Ft ) ≡ (X, F). • For most of this lecture, we shall assume that Xt depends only on θ and at . Write the conditional distribution on X given a, θ as p(x | a, θ). • In this setting, learning is clearly just about the parameter, and inference is quite easy. 1 Information • At t, the information given to the agent is (a0 , x0 , ..., at−1 , xt−1 ). • We denote this information by Ft , and formally it is the smallest sigma-algebra on Ω containing events {Xs = xs , As = as } for all s < t. • In the parametric setting, the distribution of future Xt is completely determined by the value of the parameter θ. As a result, we may state our learning problem in terms of posterior probabilities on Θ. • Let µ(θ | Ft ) = µt (θ). • We want to understand how the sequence µt (θ) behaves as a function of the observables xt , at . • The key is of course Bayes’ rule. We shall return to this. • In the more general case of non-parametric learning, One starts with a (prior) probability measure µ on (X, F). Learning is then represented by conditioning the original measure µ on the sequence of increasing sigma-algebras Ft . Examples 1. Parametric example • In each period, a coin is either flipped at = 1 or not at = 0. • There are three possible signals: X = {H, T, ∅}. • p(∅ | 0, θ) = 0 for all θ, p(H | 1, θ) = θ for all θ. • Assume for simplicity that Θ = { 21 , 13 }. • It is clear that a coin flip is informative of the bias of the coin in a statistical sense whereas no information is received if there is no coin flip. • Let µ0 be the prior that θ = 12 . • Let µt be the posterior based on the previous t signals. Exercise: Write Bayes’ rule for this updating. • What are the key issues here: Will the bias of the coin be learned? Translation: Is it the case that µt → 1 ⇔ θ = 1 2 • In what sense is the convergence to be understood? • How quickly does the convergence take place? 2 2. Non-parametric example • Sample space is X = {0, 1}. • The dependence across periods cannot be described through a useful parametric model (such as the i.i.d. model above or a simple Markov chain). Then One posits that the true probability measure is qiven by some λ on (X, F). Agents start with a prior µ on (X, F). • The relevant question is then whether µ(B | Ft ) converges to λ(B | Ft ). • Obviously one gets different notions of convergence as one varies the sets B allowed in the definition. • A famous result by Blackwell and Dubins (1962) states that sup | µ(B | Ft ) − λ(B | Ft ) |→ 0 a.s. λ B∈F if λ is absolutely continuous with respect to µ. (I.e. µ(B) > 0 ⇒ λ(B) > 0. • This strong notion of convergence is called merging of µ to λ. • A special case of such absolute continuity is so called ’grain of truth’ assumption, where µ = ρλ + (1 − ρ)λ0 . • Kalai and Lehrer have show that if µ merges to λ for all filtrations Ft , that generate F then λ is absolutely continuous with respect to µ. • Hence merging is a notion that is independent of the filtration. • Other specifications for convergence are possible. • One could consider sup | µ(B | Ft ) − λ(B | Ft ) |→ 0 a.s. λ B∈Ft+1 • This is clearly a weaker notion of convergence but since it extends obviously to any finite futures, it seems the appropriate notion for discounted decision problems. 3 Learning in the parametric setting • Bayes’ Rule: µt+1 (θ̂) | at , xt ) = P µt (θ̂)p(xt | at , θ̂) . θ∈Θ µt (θ)p(xt | at , θ) • A sequence of random variables {Yt } on a probability space (µ, Ω, F) is called a martingale with respect to {Ft } if E[Yt+1 | Ft ] = Yt a.s. µ • Think of martingales as representing fair gambles. Claim 1. {µt (θ)} is a martingale for all θ with respect to {Ft } given by the sigmaalgebra generated by the observables. Proof. Using Bayes’ rule from above, we have for all at : Eµt+1 (θ̂) = X x∈X = X µt (θ̂)p(x | at , θ̂) Pr{Xt = x | at } θ∈Θ µt (θ)p(xt | at , θ) P µt (θ̂)p(x | at , θ̂) = µt (θ̂) X p(x | at , θ̂) = µt (θ̂). x∈X x∈X • Notice that these posteriors on individual θ̂ generate the posterior on Θ. • We turn next to the question of convergence of these posterior probabilities on individual θ̂ and hence on the convergence of the posterior beliefs on Θ. • In this quest, we use one of the most famous theorems in the theory of stochastic processes. Theorem 2 (Martingale Convergence Theorem, Doob). Let {Yt } be a martingale with respect to {Ft } which satises sup EYt < ∞ t Then the limit limt Yt ≡ Y∞ exists almost surely (and the limit is thus finite, almost surely). Moreover, we have that Y∞ ∈ L1 . 4 • An excellent source for more material on martingales is: David Williams, Probability with martingales, Cambridge Mathematical Textbooks, Cambridge University Press, 1991. • When do posteriors converge in the parametric case? • Martingale Convergence Theorem says that there exists a random variable µ∞ on (Ω, F) such that for all paths ω in Ω (except a set of measure zero), µt (θ̂ | Ft ) → µ∞ (ω). • This implies that for almost all ω, µt (θ̂) converges to a constant. • When is this possible? • Think about Bayes’ rule: Posterior is constant if and only if a new trial contains no new information. • Two possibilities: Uninformative signals, i.e. no more coin flips or certainty about the state. • We conclude that as long as the coin is flipped infinitely often, beliefs will converge. Furthermore, beliefs converge on truth with probability 1 and since the posterior is a martingale, we can compute the probability of converging on θ̂ equals the prior on θ̂. Preferences • For this subsection, we stay in the parametric setting for simplicity. • Stationary environment: At = A for all t, Xt = X for all t where both sets are finite. • The timing is as follows. At start of period t, decision maker chooses at ∈ A. Then Xt is realized. After this, play moves to period t + 1. • Per period discount factor 0 < δ < 1. • Preferences are defined on A∞ × X ∞ so that they depend on Θ only through X. This allows the player to record and collect utility u(at , xt ) at the end of the period in such a way that utility realization is not informative beyond Xt on θ. 5 • Let µt denote the posterior on Θ given (F )t . • We can write Bayes’ rule as: µt+1 = f (µt , at ). • Notice that in this sense, posterior is always a controlled Markov process. Next posterior depends on the current posterior and the current action. • Per period expected utility from action at at posterior µt is given by: U (at , µt ) = XX Θ u(at , xt )p(xt | at , θ)µt (θ). X • We assume a time-separable preference setting, and the dynamic decision problem is then to max {at } ∞ X δ t U (at , µt ) t=0 subject to µt+1 = f (µt , at ). • Since we have finite A and X, the per period expected utilities are bounded. Furthermore, in our simple parametric case, the transition function µt+1 = f (µt , at ) is regular enough so that we have equivalence between the sequence problem above and a dynamic programming formulation: V (µt ) = max{U (at , µt ) + δEV (f (µt , at ))}. at • Let V̂ (µt ) = max U (at , µt ). at • What can be said about convexity of V̂ (µt ) and V (µt )? • Pretty much everything that has been said above generalizes to the case where A and X and Ω are compact, u(·, ·) is continuous and bounded and the conditional densities f (x | at , ω) exist and are well behaved. Stokey, Lucas, Prescott is a good source for sufficient conditions for this case. • Can we say anything about optimal actions based on the above? Experimental consumption? How to think about risk and uncertainty in this framework? 6 Bandit Problems We specialize now the above dynamic decision making framework as follows: • Markov decision problem in discrete time with time index t = 0, 1, .... • At each t, the decision maker chooses amongst K arms and we denote this choice by at ∈ {1, ..., K}. • Ifat = k, a random payoff xkt is realized and we denote the associated random variable by Xtk . We assume that | Xtk | is bounded for all t, k by L. • The state variable of the Markovian decision problem is given by st . • Write the distribution of xkt as F k (·; st ) . • The state transition function φ depends on the choice of the arm and the realized payoff: st+1 = φ xkt ; st • Let St denote the set of all possible states in period t. • A feasible Markov policy a = {at }∞ t=0 selects an available alternative for each conceivable state st , i.e. at : St → {1, ..., K} Defining properties The following two assumptions must be met for the problem to qualify as a bandit problem. 1. Payoffs are evaluated according to the discounted expected payoff criterion where the discount factor δ satisfies 0 ≤ δ < 1. 2. The payoff from each k depends only on outcomes of periods with at = k. In other words, we can decompose the state variable st into K components s1t , ..., sK such t that for all k : skt+1 = skt if at 6= k, skt+1 = φ(skt , xt ) if at = k, and F k (·, st ) = F k ·; skt . 7 Notice that when the second assumption holds, the alternatives must be statistically independent. It is easy to see that many situations of economic interest are special cases of the above formulation. • First, it could be that F k ·; θk is a fixed distribution with an unknown parameter θk . The state variable is then the posterior probability distribution on θk . • Alternatively, F k ·; sk could denote the random yield per period from a resource k after extracting sk units. The value function V (s0 ) of the bandit problem can be written as follows. Let X k skt denote the random reward with distribution F k ·; skt . Then the problem of finding an optimal allocation policy is the solution to the following intertemporal optimization problem: ( V (s0 ) = sup E a ∞ X ) δ t X at (sat t ) . t=0 The celebrated index theorem due to Gittins and Jones (1974) transforms the problem of finding the optimal policy into a collection of k stopping problems. For each alternative k, we calculate the following index mk skt , which only depends on the state variable of alternative k: k m skt ( Pτ ) E u=t δ t X k sku P = sup , E τu=t δ t τ (1) where τ is a stopping time with respect to {skt }. The idea is to find for each k the stopping time τ that results in the highest discounted expected return per discounted expected number of periods in operation. The Gittins index theorem then states that the optimal way of choosing arms in a bandit problem is to select in each period the arm with the highest Gittins index, mk skt . Theorem 3 (Gittins-Jones (1974)). The optimal policy satisfies at = k for some k such that mk skt ≥ mj sjt for all j ∈ {1, ..., K}. An alternative formulation of the main theorem, based on dynamic programming can be found in Whittle (1982). The basic idea is to find for every arm a retirement value Mtk , 8 and then to choose in every period the arm with the highest retirement value. Formally, for every arm k and retirement value M , we can compute the optimal retirement policy given by: V k skt , M , max E X k sku + δV k sk+1 ,M ,M t (2) The auxiliary decision problem given by (2) compares in every period the trade-off between continuation with the reward process generated by arm k or stopping with a fixed retirement value M . The index of arm k in the state skt is the highest retirement value at which the decision is just indifferent between continuing with arm k or retiring with M = M skt : M k skt = V k skt , M k skt . The resulting index M k skt is equal to the discounted sum of flow index mk skt , or M k skt = mk skt / (1 − δ). Proof of the Gittins Index Theorem Since the rewards are bounded, it is easy to verify (e.g. by Blackwell’s sufficient conditions) that the following Bellman’s equation has a unique bounded solution. V (st , M ) = max{M, max k∈{1,...,K} Exkt + δV (st+1 )}. Consider also the Bellman’s equations for the following K auxiliary problems: V k (skt , M ) = max{M, Exkt + δV k (skt+1 )}. These problems also have unique bounded solutions. Let M k k(sk ) = min{M | V k (sk , M ) = M }. Finally, define V −k (st , M ) = max{M, max Exlt + δV −k (s−k t+1 )}. l6=k The following Lemma plays an important role in characterizing the form of optimal policies. Lemma 1. V (st , M ) ≤ V k (skt , M ) + V −k (s−k t , M ) − M. Proof. Write s = (sk , s−k ). Let V0 (s, M ) = V0k (sk , M ) = V0−k (s−k , M ) = M , and for all st , define 9 Vn+1 (st , M ) = max{M, E{xkt + δVn (st+1 , M )}, max E{xlt + δVn (st+1 , M )}}. l6=k Similarly, k Vn+1 (st , M ) = max{M, E{xkt + δVnk (st+1 , M )}}, and −k Vn+1 (st , M ) = max{M, max E{xlt + δVn−k (s−k t+1 , M )}}. l6=k These are the successive approximations in the value iteration algorithm for dynamic programming, and therefore the result follows if we have for all n, s, Vn (st , M ) ≤ Vnk (skt , M ) + Vn−k (s−k t , M ) − M. This is trivially true for n = 0, so assume it holds for n and we will show that it holds for n + 1. Observe first that Vnk (skt , M ), Vn−k (s−k t , M ) ≥ M for all n, s, and k −k Vn+1 (skt , M ) ≥ Vnk (skt , M ), and Vn+1 (skt , M ) ≥ Vn−k (skt , M ). As a consequence, k δ[Vnk (sk , M ) − M ] ≤ Vnk (sk , M ) − M ≤ Vn+1 (sk , M ) − M, and −k δ[Vn−k (sk , M ) − M ] ≤ Vn−k (sk , M ) − M ≤ Vn+1 (sk , M ) − M. But then, using the induction hypothesis and the definitions of the value functions, we have Vn+1 (st , M ) ≤ max{M, E{xkt + δVnk (skt , M ) + Vn−k (s−k t , M ) − M, max E{xlt + δVnk (skt , M ) + Vn−k (s−k t , M ) − M }}. l6=k All three terms on the right-hand side of the equation are bounded from above by −k k k k Vn+1 (skt , M ) + Vn+1 (s−k t , M ) − M. For the first term, this follows from Vn+1 (st , M ) ≥ −k −k l k k −k M, Vn+1 (s−k t , M ) ≥ M. The second term is bounded by E{xt +δVn (st , M )+Vn+1 (s , M )− 10 −k k (s−k , M ) − M. (skt , M ) + Vn+1 M . By definition of V k (skn+1 (sk , M ), this is bounded by Vn+1 The last term is bounded using exactly the same argument as the middle term. Therefore we have k −k Vn+1 (st , M ) ≤ Vn+1 (skt , M ) + Vn+1 (s−k t , M ) − M, We use this Lemma to prove that the optimal policy takes a write-off form: For each arm k there is a set of states S k such that the retirement option is chosen if and only if sk ∈ S k . We can write S k = {sk | V k (sk , M ) = M }. If sk ∈ S k for some k, then by the Lemma, V (s, M ) = V −k (sk , M ). Therefore an optimal policy exists that never uses k. If sk ∈ S k for all k, then it is clearly optimal to retire. Clearly if there is a k such that sk ∈ / S k , then V (s, M ) ≥ V k (sk , M ) > M and it is not optimal to retire. We write V (s) for the value function of the original bandit problem without retirement options. Since rewards are bounded, we have V (s, M ) = V (s) for low enough M, and the problems coincide. From the definition it is clear that V (s, M ) is non-decreasing in M . Consider an arbitrary policy of retiring at a stopping time τ . The value from this policy is V (s, M ; τ ) = τ X δ t Ex̂t + M Eδ τ . t=0 Since each V (s, M ; τ ) is linear in M and V (s, M ) = supτ V (s, M ; τ ), we know that V (s, M ) is convex in M . Therefore it has a derivative almost everywhere, and by envelope theorem, ∂V (s, M ) = Eδ τ M , ∂M where τ M is the optimal retirement policy for retirement value M . By the write-off property deduced from the Lemma, we know that retirement happens when sk ∈ S k for all k. Therefore by the independence of the arms, Eδ τ M = Y Eδ τ k,M , k where τ k,M is the random time for reaching S k under the optimal policy. We have then: 11 ∂V (s, M ) Y ∂V k (sk , M ) = . ∂M ∂M k ∂V k (sk ,M ) ∂M is non-decreasing in M since each V k (sk , M ) is convex and Q k (sk ,M ) is zero for M < −L and unity for M ≥ L nondecreasing in M . Furthermore, k ∂V ∂M Notice that Q k (recall that L is the bound on the absolute value of the rewards). Therefore it has the properties of a distribution function. Integrating gives: L Z V (s, M ) = L − Y ∂V k (sk , m) M k ∂m dm. The remaining step is to verify that Gittins index rule is optimal given this value function. Lemma 2. We have: V (st , M ) − E{xkt + δV (st+1 , M )} = [V k (skt , M ) − E{xkt + δV k (skt+1 , M )}] Y ∂V l (sl , M ) l6=k Z L V k (skt , m) − E{xkt + δV k (skt+1 , m)}dm + M Y ∂V l (sl , m) ∂m l6=k ≥ 0, ∂M (3) with equality if M k = max M l ≥ M. Proof. Integrating by parts, L Z Y ∂V k (sk , m) V (s, M ) = L − M =V k (skt , M ) Y ∂V l (sl , M ) l6=k ∂M k Z ∂m dm L + V k (skt , m)dm M Y ∂V l (sl , m) l6=k ∂m . Therefore the first equality in (3) follows by adding E{xkt + δV (st+1 , M )} on both sides of the equation. V k (skt , M ) − E{xkt + δV k (skt+1 , M )} = 0 if M ≤ M k . Furthermore, Q l (sl ,m) = 0 for m ≥ maxl6=k M l . Therefore the expression in zero if M k = dm l6=k ∂V ∂m max M l ≥ M. This proves the optimality of the index rule. 12 Computing Gittins Index for Examples Pandora’s boxes • Weitzman (1979), Econometrica (Pandora’s boxe) asks how one schedule search for a prize (only one prize can be claimed) when there are k boxes characterized by the value of the prize and the probability of finding a price in the box (v k , pk ). • By now we know how to answer this. Compute Gittins indices for all boxes and open them in the decreasing sequence of the indices. • Properties of the optimal sequence: Suppose pk v k = pl v l for some k, l, i.e. if these were the only boxes, the decision maker would be indifferent. Which should be opened first? • Formulate the problem so that it fits our model above. • Each arm starts in sk0 • If arm k is chosen, it gives an immediate expected return of pk v k • If k is chosen, then skt = H for all t with probability pk and skt = L for all t with probability 1 − pk • xkt = v k for all t if skt = H and xkt = 0 for all t if skt = L. • Observe that if the arm is tried once, all uncertainty for future is resolved. • δ < 1 the discount factor. • Compute the Gitttins index as follows: Let V k (sk0 , M ) be the value function of the auxiliary problem. V (L, M ) = M for all M ≥ 0, V (H, M ) = v for all M ≤ 1. 1−δ M = pk v k + δ(pk V (H1, M ) + (1 − pk )V (L, M )) k k k k v k =p v +δ p + (1 − p )M , 1−δ or M= pk v k (1 + δ) , (1 − δ)(1 − δ + pk δ) 13 or (1 − δ)M = pk v k . 1 − δ + pk δ • Observe that as δ → 1, (1 − δ)M → v k . • Observe also that δ(1 − pk )pk v k (1 − δ)M − p v = > 0. 1 − δ + pk δ k k • Therefore there is always value to experimentation. • How far does this generalize? • Key property required for Index Theorem to work: Outside option for each alternative must stay constant. • Simple generalizations fails this property: Choice of multiple arms simultaneously. Switching costs between arms. • Generalizations that can be handled: Arms branching into different arms. 14