Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Foundations of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Student's t-test wikipedia , lookup
Approximate Bayesian computation wikipedia , lookup
Bayesian analysis Bayesian analysis illustrates scientific reasoning where consistent reasoning (in the sense that two individuals with the same background knowledge, evidence, and perceived veracity of the evidence reach the same conclusion) is fundamental. In the next few pages we survey foundational ideas: Bayes’ theorem (and its product and sum rules), maximum entropy probability assignment, the importance of loss functions, and Bayesian posterior simulation (direct and Markov chain Monte Carlo or McMC, for short). Then, we return to some of the examples utilized to illustrate classical inference strategies in the projections and conditional expectation functions notes. 1 Bayes’ theorem and consistent reasoning Consistency is the hallmark of scientific reasoning. When we consider probability assignment to events, whether they are marginal, conditional, or joint events their assignments should be mutually consistent (match up with common sense). This is what Bayes’ product and sum rules express formally. The product rule says the product of a conditional likelihood (or distribution) and the marginal likelihood (or distribution) of the conditioning variable equals the joint likelihood (or distribution). p (x, y) = p (x|y) p (y) = p (y|x) p (x) The sum rule says if we sum over all events related to one (set of) variable(s) (integrate out one variable or set of variables), we are left with the likelihood (or distribution) of the remaining (set of) variables(s). p (x) p (y) = = n i=1 n p (x, yi ) p(xj , y) j=1 Bayes’ theorem combines these ideas to describe consistent evaluation of evidence. That is, the posterior likelihood associated with a proposition, , given the evidence, y, is equal to the product of the conditional likelihood of the evidence given the proposition and marginal or prior likelihood of the conditioning variable (the proposition) scaled by the likelihood of the evidence. Notice, we’ve simply rewritten the product rule where both sides are divided by p (y) and p (y) is simply the sum rule where is integrated out of the joint distribution, p (, y). p ( | y) = p (y | ) p () p (y) For Bayesian analyses, we often find it convenient to suppress the normalizing factor, p (y), and write the posterior distribution is proportional to the product of the sampling 1 distribution or likelihood function and prior distribution. p ( | y) p (y | ) p () or p ( | y) ( | y) p () where p (y | ) is the sampling distribution, ( | y) is the likelihood function, and p () is the prior distribution for . Bayes’ theorem is the glue that holds consistent probability assignment together. Example 1 Consider the following joint distribution: p (y = y1 , = 1 ) 0.1 p (y = y2 , = 1 ) 0.4 p (y = y1 , = 2 ) 0.2 p (y = y2 , = 2 ) 0.3 The sum rule yields the following marginal distributions: p (y = y1 ) 0.3 p (y = y2 ) 0.7 p ( = 1 ) 0.5 p ( = 2 ) 0.5 and The product rule gives the conditional distributions: y1 y2 p (y | = 1 ) 0.2 0.8 p (y | = 2 ) 0.4 0.6 p ( | y = y1 ) p ( | y = y2 ) and 1 2 1 3 2 3 4 7 3 7 as common sense dictates. 2 Maximum entropy distributions From the above, we see that evaluation of propositions given evidence is entirely determined by the sampling distribution, p (y | ), or likelihood function, ( | y), and the prior distribution for the proposition, p (). Consequently, assignment of these probabilities is a matter of some considerable import. How do we proceed? Jaynes suggests we take account of our background knowledge, , and evaluate the evidence in a manner consistent with both background knowledge and evidence. That is, the posterior likelihood (or distribution) is more aptly represented by p ( | y, ) p (y | , ) p ( | ) 2 Now, we’re looking for a mathematical statement of what we know and only what we know. For this idea to properly grounded requires a sense of complete ignorance (even though this may never represent out state of background knowledge). For instance, if we think that µ1 is more likely the mean or expected value than µ2 then we must not be completely ignorant about the location of the random variable and consistency demands that our probability assignment reflect this knowledge. Further, if the order of events or outcomes is not exchangeable (if one permutation is more plausible than another), then the events are not seen as stochastically independent or identically distributed.1 The mathematical statement of our background knowledge is defined in terms of Shannon’s entropy (or sense of diffusion or uncertainty). 2.1 Entropy Shannon defines entropy as2 h= for discrete events where n i=1 n pi log (pi ) pi = 1 i=1 or differential entropy as h= p (x) log p (x) dx for events with continuous support where p (x) dx = 1 2.2 Discrete examples 2.2.1 Discrete uniform Example 2 Suppose we know only that there are three possible (exchangeable) events, x1 = 1, x2 = 2,and x3 = 3. The maximum entropy probability assignment is found by solving the Lagrangian 3 3 L max pi log pi (0 1) pi 1 pi i=1 i=1 1 Exchangeability is foundational for independent and identically distributed events (iid), a cornerstone of inference. However, exchangeability is often invoked in a conditional sense. That is, conditional on a set of variables exchangeability applies — a foundational idea of conditional expectations or regression. 2 The axiomatic development for this measure of entropy can be found in Jaynes [2003] or Accounting and Causal Effects: Econometric Challenges, ch. 13. 3 First order conditions yield pi = e0 for i = 1, 2, 3 and 0 = log 3 Hence, as expected, the maximum entropy probability assignment is a discrete uniform distribution with pi = 13 for i = 1, 2, 3. 2.2.2 Discrete nonuniform Example 3 Now suppose we know a little more. We know the mean is 2.5.3 The Lagrangian is now 3 3 3 L max pi log pi (0 1) pi 1 1 pi xi 2.5 pi i=1 i=1 i=1 First order conditions yield pi = e0 xi 1 for i = 1, 2, 3 and 0 = 1 = 0.834 2.987 The maximum entropy probability assignment is 2.3 p1 = 0.116 p2 = 0.268 p3 = 0.616 Normalization and partition functions The above analysis suggests a general approach for assigning probabilities. m 1 exp j fj (xi ) p (xi ) = Z (1 , . . . , m ) j=1 where fj (xi ) is a function of the random variable, xi , reflecting what we know4 and n m Z (1 , . . . , m ) = exp j fj (xk ) j=1 k=1 3 Clearly, if we knew the mean is 2 then we would assign the uniform discrete distribution above. 0 simply ensures the probabilities sum to unity and the partition function assures this, we can define the partition function without 0 . 4 Since 4 is a normalizing factor, called a partition function. Probability assignment is completed by determining the Lagrange multipliers, j , j = 1, . . . , m, from the m constraints. Return to the example above. Since we know support and the mean, n = 3 and the function f (xi ) = xi . This implies Z (1 ) = 3 exp [1 xi ] i=1 and exp [1 xi ] pi = 3 k=1 exp [1 xk ] where x1 = 1, x2 = 2, and x3 = 3. Now, solving the constraint 3 i=1 3 i=1 pi xi 2.5 exp [1 xi ] xi 2.5 3 k=1 exp [1 xk ] = 0 = 0 produces the multiplier, 1 = 0.834, and identifies the probability assignments p1 = 0.116 p2 = 0.268 p3 = 0.616 We utilize the analog to the above partition function approach next to address continuous density assignment. 2.4 Continuous examples The partition function approach for continuous support involves density assignment m exp j=1 j fj (x) p (x) = b m exp j=1 j fj (x) dx a where support is between a and b. 2.4.1 Continuous uniform Example 4 Suppose we only know support is between zero and three. The above partition function density assignment is simply (there are no constraints so there are no multipliers to identify) exp [0] 1 p (x) = 3 = 3 exp [0] dx 0 Of course, this is the density function for a uniform with support from 0 to 3. 5 2.4.2 Known mean Example 5 Continue the example above but with known mean equal to 2. The partition function density assignment is p (x) = 3 0 and the mean constraint is 0 0 3 x 3 0 exp [1 x] exp [1 x] dx 3 xp (x) dx 2 exp [1 x] exp [1 x] dx dx 2 = 0 = 0 so that 1 = 0.716 and the density function is a truncated exponential distribution with support from 0 to 3. p (x) = 0.0945 exp [0.716x] 2.4.3 Gaussian (normal) distribution and known variance Example 6 Suppose we know the average dispersion or variance is 2 = 100. Then, a finite mean must exist, but even if we don’t know it, we can find the maximum entropy density function for arbitrary mean, µ. Using the partition function approach above we have 2 exp 2 (x µ) p (x) = 2 exp (x µ) dx 2 and the average dispersion constraint is 2 (x µ) p (x) dx 100 exp [1 x] 2 dx 100 (x µ) exp [1 x] dx so that 2 = 1 2 2 = 0 = 0 and the density function is p (x) = = 2 1 (x µ) exp 2 2 2 2 1 (x µ) exp 200 210 Of course, this is a Gaussian or normal density function. Strikingly, the Gaussian distribution has greatest entropy of any probability assignment with the same variance. 6 3 Loss functions Bayesian analysis is combined with decision analysis via explicit recognition of a loss function. Relative to classical statistics this is a strength as a loss function always exists but is sometimes not acknowledged. For simplicity and brevity we’ll explore 5 symmetric versions of a few loss functions. Let denote an estimator for , c , denote a loss function, and p ( | y) denote the posterior distribution for given evidence y. A sensible strategy for consistent decision making involves selecting the estimator, , to minimize the average or expected loss. minE [loss] = minE c , Briefly, for symmetric loss functions, we find the expected loss minimizing estimator is the posterior mean for a quadratic loss function, is the median of the posterior distribution for a linear loss function, and is the posterior mode for an all or nothing loss function. 3.1 Quadratic loss 2 Let c , = for > 0, then (with support from a to b) minE c , = min a b 2 p ( | y) d The first order conditions are b 2 d p ( | y) d = 0 d a Expansion of the integrand gives b 2 d 2 + 2 p ( | y) d = 0 d a or b b 2 d 2 2 p ( | y) d + p ( | y) d = 0 d a a Differentiation gives and the solution is where b a 2 2 = b a a p ( | y) d =0 b p ( | y) d p ( | y) d is the posterior mean. 5 The more general case, including asymmetric loss, is addressed in chapter 4 of Accounting and Causal Effects: Econometric Challenges. 7 3.2 Linear loss Let c , = for > 0, then minE c , = min The first order conditions are d d a b a b p ( | y) d p ( | y) d = 0 Rearrangement of the integrand gives b d p ( | y) d p ( | y) d = 0 d a or d F | y a p ( | y) d 1F |y =0 b d + p ( | y) d where F | y is cumulative posterior probability evaluated at . Differentiation gives F | y + p | y p |y = 0 1F | y 1p | y + 1p |y or and the solution is |y 1F |y F 2F |y 1 = 0 = 0 1 F |y = 2 or the median of the posterior distribution. 3.3 Let All or nothing loss >0 c , = 0 if = if = Then, we want to assign the maximum value of p ( | y) or the posterior mode. 8 4 Analysis The science of experimentation and evaluation of evidence is a deep and subtle art. Essential ingredients include careful framing of the problem (theory development), matching of data to be collected with the frame, and iterative model specification that complements the data and problem frame so that causal effects can be inferred. Consistent evaluation of evidence draws from the posterior distribution which is proportional to the product of the likelihood and prior. When they combine to generate a recognizable posterior distribution, our work is made simpler. Next, we briefly discuss conjugate families which produce recognizable posterior distributions. 4.1 Conjugate families Conjugate families arise when the likelihood times the prior produces a recognizable posterior kernel p ( | y) ( | y) p () where the kernel is the characteristic part of the distribution function that depends on the random variable(s) (the part excluding any normalizing constants). For example, the density function for a univariate Gaussian or normal is 1 1 2 exp 2 (x µ) 2 2 and its kernel (for known) is 1 2 exp 2 (x µ) 2 1 as 2 is a normalizing constant. Now, we discuss a few common conjugate family results6 and uninformative prior results to connect with classical results. 4.1.1 Binomial - beta prior A binomial likelihood with unknown success probability, , n ns ( | s; n) = s (1 ) s s= n i=1 yi , yi = {0, 1} combines with a beta(; a, b) prior (i.e., with parameters a and b) p () = (a + b) a1 b1 (1 ) (a) (b) 6 A more complete set of conjugate families are summarized in chapter 7 of Accounting and Causal Effects: Econometric Challenges. 9 to yield ns p ( | y) s (1 ) b1 a1 (1 ) ns+b1 s+a1 (1 ) which is the kernel of a beta distribution with parameters (a + s) and (b + n s), beta( | y; a + s, b + n s). Uninformative priors Suppose priors for are uniform over the interval zero to one or, equivalently, beta(1, 1).7 Then, the likelihood determines the posterior distribution for . ns p ( | y) s (1 ) which is beta( | y; 1 + s, 1 + n s). 4.1.2 Gaussian (unknown mean, known variance) A single draw from a Gaussian likelihood with unknown mean, , known standard deviation, , 2 1 (y ) ( | y, ) exp 2 2 combines with a Gaussian or normal prior for given 2 with prior mean 0 and prior variance 20 2 1 ( 0 ) 2 2 p | ; 0 , 0 exp 2 20 or writing 20 2 /0 , we have 2 2 p | ; 0 , /0 to yield 2 p | y, , 0 , /0 2 1 0 ( 0 ) exp 2 2 1 exp 2 2 2 (y ) 0 ( 0 ) + 2 2 Expansion and rearrangement gives 2 1 2 2 2 2 p | y, , 0 , /0 exp 2 y + 0 0 2y + + 0 20 2 Any terms not involving are constants and can be discarded as they are absorbed on normalization of the posterior 1 p | y, , 0 , 2 /0 exp 2 2 (0 + 1) 2 (0 0 + y) 2 would utilize Jeffreys’ prior, p () beta ; 12 , 12 , which is invariant to transformation, as the uninformative prior. 7 Some 10 2 +y) Completing the square (add and subtract (000+1 ), dropping the term subtracted (as it’s all constants), and factoring out (0 + 1) gives 2 0 + 1 0 0 + y 2 p | y, , 0 , /0 exp 2 2 0 + 1 Finally, we have 2 p | y, , 0 , /0 where 1 = 0 0 +y 0 +1 = 1 0 0 + 12 1 1 0 + 2 y 2 1 ( 1 ) exp 2 21 and 21 = 2 0 +1 = 1 0 1 + 12 , or the posterior distribu- tion of the mean given the data and priors is Gaussian or normal. Notice, the posterior mean, 1 , weights the data and prior beliefs by their relative precisions. For a sample of n exchangeable draws, the likelihood is n 2 1 (yi ) ( | y, ) exp 2 2 i=1 combined with the above prior yields 2 p | y, , 0 , /0 where n = 0 µ0 +ny 0 +n = 1 0 0 + n2 1 n 0 + 2 y 2 1 ( n ) exp 2 2n , y is the sample mean, and 2n = 2 0 +n = 1 0 1 + n2 , or the posterior distribution of the mean, , given the data and priors is again Gaussian or normal and the posterior mean, n , weights the data and priors by their relative precisions. Uninformative prior An uninformative prior for the mean, , is the (improper) uniform, p | 2 = 1. Hence, the likelihood n 2 1 (yi ) ( | y, ) exp 2 2 i=1 n 1 2 2 exp 2 yi 2ny + n 2 i=1 n 1 2 2 2 exp 2 yi ny + n ( y) 2 i=1 1 2 exp 2 n ( y) 2 11 determines the posterior 2 n ( y) p | , y exp 2 2 2 which is the kernel for a Gaussian or N | 2 , y; y, n , the classical result. 4.1.3 2 Gaussian (known mean, unknown variance) For a sample of n exchangeable draws with known mean, µ, and unknown variance, , a Gaussian or normal likelihood is n 2 1 (yi µ) 12 ( | y, µ) exp 2 i=1 combines with an inverted-gamma(a, b) b p (; a, b) (a+1) exp n+2a to yield an inverted-gamma 2 , b + 12 t posterior distribution where t= n i=1 2 (yi µ) Alternatively and conveniently (but equivalently), we could parameterize the prior as an inverted-chi square 0 , 20 8 0 20 ( 20 +1) 2 p ; 0 , 0 () exp 2 and combine with the above likelihood to yield p ( | y) ( n+ 0 2 +1) 20 +t an inverted chi-square 0 + n, 00 +n . 1 2 exp 0 0 + t 2 Uninformative prior An uninformative prior for scale is p () 1 Hence, the posterior distribution for scale is t p ( | y) exp 2 which is the kernel of an inverted-chi square ; n, nt . ( n 2 +1) 2 8 0 0 X variable. is a scaled, inverted-chi square 0 , 20 with scale 20 where X is a chi square( 0 ) random 12 4.1.4 Gaussian (unknown mean, unknown variance) For a sample of n exchangeable draws, a normal likelihood with unknown mean, , and unknown (but constant) variance, 2 , is n 2 1 (y ) i , 2 | y 1 exp 2 2 i=1 Expanding and rewriting the likelihood gives n 1 y 2 2yi + 2 i 2 n , | y exp 2 2 i=1 n Adding and subtracting i=1 2yi y = 2ny 2 , we write n 2 n2 2 1 2 2 2 2 , | y exp 2 yi 2yi y + y + y 2y + 2 i=1 or 2 , | y n 2 2 n 1 2 2 exp 2 (yi y) + (y ) 2 i=1 which can be rewritten as 2 n2 1 2 2 2 , | y exp 2 (n 1) s + n (y ) 2 n 2 1 where s2 = n1 i y) . The above likelihood combines with a Gaussian or i=1 (y normal | 2 ; 0 , 2 /0 inverted-chi square 2 ; 0 , 20 prior9 2 2 0 ( 0 ) 1 2 2 2 p | ; 0 , /0 p ; 0 , 0 exp 2 2 2 ( 0 /2+1) 0 20 exp 2 2 2 ( 0+3 ) 2 2 0 20 + 0 ( 0 ) exp 2 2 to yield a normal | 2 ; n , 2n /n *inverted-chi square 2 ; n , 2n joint posterior distribution10 where n = 0 + n n = 0 + n n 2n 9 The 10 The = 0 20 + (n 1) s2 + 0 n 2 (0 y) 0 + n prior for the mean, , is conditional on the scale of the data, 2 . product of normal or Gaussian kernels produces a Gaussian kernel. 13 That is, the joint posterior is p , 2 | y; 0 , 2 /0 , 0 , 20 2 n+20 +3 0 20 + (n 1) s2 1 2 +0 ( 0 ) exp 2 2 2 +n ( y) Completing the square The expression for the joint posterior is written by completing the square. Completing the weighted square for centered around n = where y = 1 n n i=1 1 (0 0 + ny) 0 + n yi gives 2 (0 + n) ( n ) = = (0 + n) 2 2 (0 + n) n + (0 + n) 2n (0 + n) 2 2 (0 0 + ny) + (0 + n) 2n While expanding the exponent includes the square plus additional terms as follows 2 2 0 ( 0 ) + n ( y) = 0 2 20 + 20 + n 2 2y + y 2 (0 + n) 2 2 (0 0 + ny) + 0 20 + ny 2 = Add and subtract (0 + n) 2n and simplify. 2 2 0 ( 0 ) + n ( y) (0 + n) 2 2 (0 + n) n + (0 + n) 2n = (0 + n) 2n + 0 20 + ny 2 2 = (0 + n) ( n ) 1 (0 + n) 0 20 + ny 2 2 (0 0 + ny) (0 + n) Expand and simplify the last term. 2 2 2 0 ( 0 ) + n ( y) = (0 + n) ( n ) + 0 n 2 (0 y) 0 + n Now, the joint posterior can be rewritten as p , 2 | y; 0 , 2 /0 , 0 , 20 or 2 p , | y; 0 , 2 /0 , 0 , 20 2 n+20 +3 1 exp 2 2 0 20 + (n 1) s2 2 0 n + 0 +n (0 y) 2 + (0 + n) ( n ) 1 2 exp 2 n n 2 1 2 1 exp 2 (0 + n) ( n ) 2 14 2 n+ 0 2 1 Hence,the conditional posterior distribution for the mean, , given 2 is Gaussian or 2 normal | 2 ; n , 0+n . Marginal posterior distributions We’re often interested in the marginal posterior distributions which are derived by integrating out the other parameter from the joint 2 posterior. The marginal posterior for the mean, , on integrating out is a noncentral, 2n scaled-Student t ; n , n , n 11 for the mean p ; n , 2n , n , n or n 2n p ; n , , n n n n + n ( n )2 2n 2 n ( n ) 1+ n 2n n2+1 n2+1 and the marginal posterior for the variance, 2 , is an inverted-chi square 2 ; n , 2n on integrating out . ( n /2+1) n 2n p 2 ; n , 2n 2 exp 2 2 Derivation of the marginal posterior for the mean, , is as follows. Let z = where 0 n 2 2 A = 0 20 + (n 1) s2 + (0 y) + (0 + n) ( n ) 0 + n A 2 2 2 = n 2n + (0 + n) ( n ) The marginal posterior for the mean, , integrates out 2 from the joint posterior p ( | y) = p , 2 | y d 2 0 2 n+20 +3 A = exp 2 d 2 2 0 Utilizing 2 = 2 A 2z and dz = 2zA d 2 or d 2 = 2zA2 dz, n+20 +3 A A p ( | y) exp [z] dz 2 2z 2z 0 n+20 +1 A z 1 exp [z] dz 2z 0 n+ 0 +1 n+ 0 +1 A 2 z 2 1 exp [z] dz 0 11 The noncentral, scaled-Student t ; n , 2n /n , n implies 2 n +1 distribution p ( | y) 1 + n / n n n 2 . 15 n n / n has a standard Student-t( n ) n+ 0 +1 The integral 0 z 2 1 exp [z] dz is a constant since it is the kernel of a gamma density and therefore can be ignored when deriving the kernel of the marginal posterior for the mean n+ 0 +1 p ( | y) A 2 n+20 +1 2 n 2n + (0 + n) ( n ) n+20 +1 2 (0 + n) ( n ) 1+ n 2n 2n which is the kernel for a noncentral, scaled Student t ; n , 0 +n , n + 0 . Derivation of the marginal posterior for 2 is somewhat simpler. Write the joint posterior in terms of the conditional posterior for the mean multiplied by the marginal posterior for 2 . p , 2 | y = p | 2 , y p 2 | y Marginalization of 2 is achieved by integrating out . 2 p |y = p 2 | y p | 2 , y d Since only the conditional posterior involves the marginal posterior for 2 is immediate. n+20 +3 A p , 2 | y 2 exp 2 2 2 n+ +2 2 20 n 2n 1 (0 + n) ( n ) exp exp 2 2 2 2 Integrating out yields n 2n exp 2 2 2 (0 + n) ( n ) 1 exp d 2 2 ( 2n +1) n 2n 2 exp 2 2 which we recognize as the kernel of an inverted-chi square 2 ; n , 2n . p 2 | y 2 n+20 +2 Uninformative priors The case of uninformative priors is relatively straightforward. Since priors convey no information, the prior for the mean is uniform (proportional to a constant, 0 0) and an uninformative prior for 2 has 0 0 degrees of freedom so that the joint prior is 1 p , 2 2 16 The joint posterior is p , 2 | y where 1 2 exp 2 (n 1) s2 + n ( y) 2 2 [(n1)/2+1] 2 exp n2 2 n 2 1 exp 2 ( y) 2 2 (n/2+1) 2n = (n 1) s2 2 The conditional posterior for given 2 is Gaussian y, n . And, the marginal poste 2 rior for is noncentral, scaled Student t y, sn , n 1 , the classical estimator. Derivation of the marginal posterior proceeds as above. The joint posterior is 2 (n/2+1) 1 2 2 2 p , | y exp 2 (n 1) s + n ( y) 2 2 A 2 2 Let z = 2 out of the joint 2 where A = (n 1) s + n ( y) . Now integrate posterior following the transformation of variables. 2 (n/2+1) A p ( | y) exp 2 d 2 2 0 An/2 z n/21 ez dz 0 As before, the integral involves the kernel of a gamma density and therefore is a constant which can be ignored. Hence, p ( | y) An/2 n2 2 (n 1) s2 + n ( y) n1+1 2 2 n ( y) 1+ (n 1) s2 2 which we recognize as the kernel of a noncentral, scaled Student t ; y, sn , n 1 . 4.1.5 Multivariate Gaussian (unknown mean, known variance) More than one random variable (the multivariate case) with joint Gaussian or normal likelihood is analogous to the univariate case with Gaussian conjugate prior. Consider a vector of k random variables (the sample is comprised of n draws for each random variable) with unknown mean, , and known variance, . For n exchangeable draws 17 of the random vector (containing each of the m random variable), the multivariate Gaussian likelihood is n 1 T ( | y, ) exp (yi ) 1 (yi ) 2 i=1 where superscript T refers to transpose, yi and are k length vectors and is a k k variance-covariance matrix. A Gaussian prior for the mean vector, , with prior mean, 0 , and prior variance, 0 ,is 1 T 1 p ( | ; 0 , 0 ) exp ( 0 ) 0 ( 0 ) 2 The product of the likelihood and prior yields the kernel of a multivariate posterior Gaussian distribution for the mean 1 T 1 p ( | , y; 0 , 0 ) exp ( 0 ) 0 ( 0 ) 2 n 1 T 1 exp (yi ) (yi ) 2 i=1 Completing the square Expanding terms in the exponent leads to T ( 0 ) 1 0 ( 0 ) + n i=1 T (yi ) 1 (yi ) 1 1 = T 1 2T 1 y 0 + n 0 0 + n n +T0 1 + yiT 1 yi 0 0 i=1 where y is the sample average. While completing the (weighted) square centered around 1 1 1 = 1 0 0 + n1 y 0 + n leads to T 1 1 0 + n 18 1 = T 1 0 + n 1 T 1 2 0 + n T 1 1 + 0 + n Thus, adding and subtracting (with three extra terms). T 1 1 in the exponent completes the square 0 + n T ( 0 ) 1 0 ( 0 ) + T 1 0 T + n T 1 1 0 i=1 T (yi ) 1 (yi ) T 1 1 1 + 1 0 + n 0 + n n T 1 1 + T0 1 yiT 1 yi 0 + n 0 0 + = = n 2 + n 1 T i=1 n T 1 1 yiT 1 yi 1 + n + + 0 0 0 0 i=1 Dropping constants (the last three extra terms unrelated to ) gives T 1 1 1 p ( | , y; 0 , 0 ) exp 0 + n 2 Hence, the posterior for the mean has expected value and variance 1 1 V ar [ | y, , 0 , 0 ] = 1 0 + n As in the univariate case, the data and prior beliefs are weighted by their relative precisions. Uninformative priors Uninformative priors for are proportional to a constant. Hence, the likelihood determines the posterior n 1 T 1 ( | , y) exp (yi ) (yi ) 2 i=1 Expanding the exponent and adding and subtracting ny T 1 y (to complete the square) gives n i=1 T (yi ) 1 (yi ) = n i=1 yiT 1 yi 2nT 1 y + nT 1 +ny T 1 y ny T 1 y T = n (y ) 1 (y ) n + yiT 1 yi ny T 1 y i=1 The latter two terms are constants, hence, the posterior kernel is n T p ( | , y) exp (y ) 1 (y ) 2 1 which is Gaussian or N ; y, n , the classical result. 19 4.1.6 Multivariate Gaussian (unknown mean, unknown variance) When both the mean, , and variance, , are unknown, the multivariate Gaussian cases remains analogous to the univariate case. Specifically, a Gaussian likelihood (, | y) n i=1 n 2 || || 1 T exp (yi ) 1 (yi ) 2 n T 1 (yi y) 1 (yi y) i=1 exp T 2 +n (y ) 1 (y ) 1 T 1 2 exp (n 1) s + n (y ) (y ) 2 12 || n 2 n T 1 1 where s2 = n1 (yi y) combines with a Gaussian-inverted i=1 (yi y) Wishart prior 1 1 12 T 1 p | ; 0 , p ; , || exp ( 0 ) 0 ( 0 ) 0 2 1 +k+1 tr 2 || 2 || exp 2 where tr (·) is the trace of the matrix and is degrees of freedom, to produce tr 1 +n+k+1 2 2 p (, | y) || || exp 2 T 1 (n 1) s2 + n (y ) 1 (y ) 12 || exp T 2 +0 ( 0 ) 1 ( 0 ) Completing the square Completing the square involves the matrix analog to the univariate unknown mean and variance case. Consider the exponent (in braces) T = 2 +0 = T (n 1) s + ny T = T (n 1) s2 + n (y ) 1 (y ) + 0 ( 0 ) 1 ( 0 ) 1 1 T T y 2n 20 1 T 0 + 1 y + n 1 0 T0 1 0 T 1 (n 1) s + (0 + n) 1 2 2 T (0 0 + ny) + (0 + n) Tn 1 n (0 + n) Tn 1 n + 0 T0 1 0 + ny T 1 y T (n 1) s2 + (0 + n) ( n ) 1 ( n ) 0 n T + (0 y) 1 (0 y) 0 + n Hence, the joint posterior can be rewritten as 20 tr 1 p (, | y) || || exp 2 T (0 + n) ( n ) 1 ( n ) 1 1 + (n 1) s2 || 2 exp 2 T 1 0 n + 0 +n (0 y) (0 y) 1 +n+k+1 tr + (n 1) s2 1 2 || 2 || exp T n + 00+n (0 y) 1 (0 y) 2 1 1 T 1 || 2 exp (0 + n) ( n ) ( n ) 2 2 +n+k+1 2 Inverted-Wishart kernel We wish to identify the exponent with Gaussian by invertedWishart kernels where the inverted-Wishart involves the trace of a square, symmetric matrix, call it n , multiplied by 1 . To make this connection we utilize the following general results. Since a quadratic form, say xT 1 x, is a scalar, it’s equal to its trace, xT 1 x = tr xT 1 x Further, for conformable matrices A, B and C, D, tr (A) + tr (B) = tr (A + B) and tr (CD) = tr (DC) We immediately have the results and tr xT x = tr xxT tr xT 1 x = tr 1 xxT = tr xxT 1 1 Therefore, the above joint posterior can be rewritten as a N ; n , (0 + n) inverted-Wishart 1 ; + n, n +n 1 +n+k+1 2 p (, | y) |n | 2 || exp tr n 1 2 1 0 + n T || 2 exp ( n ) 1 ( n ) 2 where n = 1 (0 0 + ny) 0 + n 21 and n 0 n T (y 0 ) (y 0 ) + n 0 i=1 1 Now, it’s apparent the conditional posterior for given is N n , (0 + n) 0 + n T p ( | , y) exp ( n ) 1 ( n ) 2 n = + T (yi y) (yi y) + Marginal posterior distributions Integrating out the other parameter gives the marginal posteriors, a multivariate Student t for the mean, Student tk (; n , , + n k + 1) and an inverted-Wishart for the variance, I-W 1 ; + n, n where 1 =( 0 + n) 1 ( + n k + 1) n Marginalization of the mean derives from the following identities (see Box and Tiao [1973], p. 427, 441). Let Z be a m m positive definite symmetric matrix consisting of 12 m (m + 1) distinct random variables zij (i, j = 1, . . . , m; i j). And let q > 0 and B be an m m positive definite symmetric matrix. Then, the distribution of zij , 1 p (Z) |Z| 2 q1 exp 21 tr (ZB) , Z>0 is a multivariate generalization of the 2 distribution obtained by Wishart [1928]. Integrating out the distinct zij produces the first identity. 1 1 q1 1 (q+m1) |Z| 2 exp tr (ZB) dZ = |B| 2 (I.1) 2 Z>0 1 q+m1 2 2 (q+m1) m 2 where p (b) is the generalized gamma function (Siegel [1935]) p 1 p(p1) p (b) = 12 2 b+ =1 and (z) = tz1 et dt 0 or for integer n, (n) = (n 1)! 22 p 2 , b> p1 2 The second identity involves the relationship between determinants that allows us to express the above as a quadratic form. The identity is |Ik P Q| = |Il QP | (I.2) for P a k l matrix and Q a l k matrix. If we transform the joint posterior to p , 1 | y , the above identities can be applied to marginalize the joint posterior. The key to transformation is 1 p , | y = p (, | y) 1 where 1 is the (absolute value of the) determinant of the Jacobian or = ( 11 , 12 , . . . , kk ) ( 11 , 12 , . . . , kk ) 1 k+1 = || with ij the elements of and ij the elements of 1 . Hence, 1 +n+k+1 1 2 p (, | y) || exp tr n 2 1 + n 0 2 T 1 || exp ( n ) ( n ) 2 1 +n+k 1 1 2 || exp tr S () 2 T where S () = n + (0 + n) ( n ) ( n ) , can be rewritten 2k+2 1 +n+k+2 2 p , 1 | y || exp tr S () 1 || 2 2 1 +nk 1 2 exp tr S () 1 2 Now, applying the first identity yields 1 (+n+1) p , 1 | y d1 |S ()| 2 1 >0 1 (+n+1) T 2 n + (0 + n) ( n ) ( n ) 1 (+n+1) T 2 I + (0 + n) 1 n ( n ) ( n ) And the second identity gives 12 (+n+1) T p ( | y) 1 + (0 + n) ( n ) 1 ( ) n n We recognize this is the kernel of a multivariate Student tk (; n , , + n k + 1) distribution. 23 Uninformative priors The joint uninformative prior (with a locally uniform prior for ) is k+1 p (, ) || 2 and the joint posterior is k+1 2 p (, | y) || || || 1 T exp (n 1) s2 + n (y ) 1 (y ) 2 1 T exp (n 1) s2 + n (y ) 1 (y ) 2 1 exp tr S () 1 2 n 2 || n+k+1 2 n+k+1 2 n T T where now S () = i=1 (y yi) (y yi ) + n (y ) (y ) . Then, the conditional posterior for given is N y, n1 n T p ( | , y) exp ( y) 1 ( y) 2 The marginal posterior for is derived analogous to the above informed conjugate prior case. Rewriting the posterior in terms of 1 yields 2k+2 1 n+k+1 1 1 2 p , | y || exp tr S () || 2 2 1 nk1 1 1 2 exp tr S () 2 p ( | y) p , 1 | y d1 1 >0 1 nk1 2 exp 1 tr S () 1 d1 2 1 >0 The first identity (I.1) produces p ( | y) n |S ()| 2 n n2 T T (y yi ) (y yi ) + n (y ) (y ) i=1 n2 n 1 T T I + n (y yi ) (y yi ) (y ) (y ) i=1 The second identity(I.2) identifies the marginal posterior for as (multivariate) Student tk ; y, n1 s2 , n k n2 n T T p ( | y) 1 + (y ) (y ) (n k) s2 n T where (n k) s2 = i=1 (y yi ) (y yi ). The marginal posterior for the variance 1 n T is I-W ; n, n where now n = i=1 (y yi ) (y yi ) . 24 4.1.7 Bayesian linear regression Linear regression is the starting point for more general data modeling strategies, including nonlinear models. Hence, Bayesian linear regression is foundational. Suppose the data are generated by y = X + where and X isa n p full column rank matrix of (weakly exogenous) regressors N 0, 2 In andE [ | X] = 0. Then, the sample conditional density is y | X, , 2 N X, 2 In . Known variance If the error variance, 2 In , is known and we have informed Gaussian priors for conditional on 2 , p | 2 N 0 , 2 V0 1 where we can think of V0 = X0T X0 as if we had a prior sample (y0 , X0 ) such that 1 T X0 y0 0 = X0T X0 then the conditional posterior for is p | 2 , y, X; 0 , V0 N , V where 1 T X0 X0 0 + X T X = X0T X0 + X T X = X T X 1 X T y and 1 V = 2 X0T X0 + X T X The variance expression follows from rewriting the estimator 1 T X0 X0 0 + X T X = X0T X0 + X T X 1 T 1 T 1 T X y X0 X0 X0T X0 X0 y0 + X T X X T X = X0T X0 + X T X T 1 = X0 X0 + X T X X0T y0 + X T y Since the DGP is then y0 = X0 + 0 , 0 N 0, 2 In0 y = X + , N 0, 2 In 1 T = X0T X0 + X T X X0 X0 + X0T 0 + X T X + X T The conditional (and by iterated expectations, unconditional) expected value of the estimator is 1 T E | X, X0 = X0T X0 + X T X X0 X0 + X T X = 25 Hence, so that V E | X, X0 = 1 T = X0T X0 + X T X X0 0 + X T V ar | X, X0 T = E | X, X0 1 T T X0T X0 + X T X X0 0 + X T X0T 0 + X T = E 1 X0T X0 + X T X | X, X0 1 X0T 0 T0 X0 + X T T0 X0 T T X0 X0 + X X +X0T 0 T X + X T T T X = E T 1 T X0 X0 + X X | X, X0 1 T 1 X0T 2 IX0 + X T T IX X0T X0 + X T X = X0 X0 + X T X 1 1 T = 2 X0T X0 + X T X X0 X0 + X T X X0T X0 + X T X 1 = 2 X0T X0 + X T X Now, let’s backtrack and derive the conditional posterior as the product of conditional priors and the likelihood function. The likelihood function for known variance is 1 T | 2 , y, X exp 2 (y X) (y X) 2 Conditional Gaussian priors are 1 T p | 2 exp 2 ( 0 ) V01 ( 0 ) 2 The conditional posterior is the product of the prior and likelihood T 1 (y X) (y X) p | 2 , y, X exp 2 T 2 + ( 0 ) V01 ( 0 ) T T y y 2y T X + X T X 1 = exp 2 + T X0T X0 2 T0 X0T X0 2 + T0 X0T X0 0 The first and last terms in the exponent do not involve (are constants) and can ignored as they are absorbed through normalization. This leaves 1 2y T X + T X T X + T X0T X0 2 p | , y, X exp 2 2 T0 X0T X0 2 T X0T X0 + X T X 1 = exp 2 2 y T X + T0 X0T X0 2 26 which can be recognized as the expansion of the conditional posterior claimed above. p | 2 , y, X N , V T 1 exp V1 2 T T 1 = exp 2 X0 X0 + X T X 2 T X0T X0 + X T X T 1 = exp 2 2 X0T X0 + X T X 2 T + X0T X0 + X T X T X0T X0 + X T X T 1 = exp 2 2 X0T X0 0 + X T y 2 T + X0T X0 + X T X The last term in the exponent is all constants (does not involve ) so its absorbed through normalization and disregarded for comparison of kernels. Hence, T 1 1 2 p | , y, X exp V 2 T X0T X0 + X T X 1 exp 2 2 y T X + T0 X0T X0 2 as claimed. Uninformative priors If the prior for is uniformly distributed conditional on known variance, 2 , p | 2 1, then it’s as if X0T X0 0 and the posterior for is 2 X T X 1 p | 2 , y, X N , equivalent to the classical parameter estimators. Unknown variance In the usual case where the variance as well as the regression coefficients, , are unknown, the likelihood function can be expressed as 1 T , 2 | y, X n exp 2 (y X) (y X) 2 Rewriting gives 1 T , | y, X exp 2 2 since = y X. The estimated model is y = Xb + e, therefore X + = Xb + e 1 T where b = X T X X y and e = y Xb are estimates of and , respectively. This implies = e X ( b) and T 1 eT e 2 ( b) X T e 2 n , | y, X exp 2 T 2 + ( b) X T X ( b) 2 n 27 Since, X T e = 0 by construction, this simplifies as 1 T T 2 n T , | y, X exp 2 e e + ( b) X X ( b) 2 or 2 , | y, X n 1 T 2 T exp 2 (n p) s + ( b) X X ( b) 2 1 where s2 = np eT e.12 The conjugate prior for linear regression is the Gaussian | 2 ; 0 , 2 1 -inverse 0 2 chi square ; 0 , 20 T 2 ( 0 ) 0 ( 0 ) 2 2 1 2 p p | ; 0 , 0 p ; 0 , 0 exp 2 2 0 20 ( 0 /2+1) exp 2 2 Combining the prior with the likelihood gives a joint Gaussian , 2 1 n -inverse chi square 0 + n, 2n posterior (n p) s2 2 2 1 2 n p , | y, X; 0 , 0 , 0 , 0 exp 2 2 T ( b) X T X ( b) exp 2 2 T ( 0 ) 0 ( 0 ) p exp 2 2 2 ( 0 /2+1) 0 20 exp 2 2 Collecting terms and rewriting, we have 2 p , | y, X; 0 , 2 2 1 0 , 0 , 0 2n exp 2 2 1 T p exp 2 n 2 2 [( 0 +n)/2+1] 12 Notice, the univariate Gaussian case is subsumed by linear regression where X = (a vector of ones). Then, the likelihood as described earlier, 1 , 2 | y, X n exp 2 (n p) s2 + ( b)T X T X ( b) 2 becomes 1 = , 2 | y, X = n exp 2 (n 1) s2 + n ( y)2 2 T 1 T T where = , b = X X X y = y, p = 1, and X X = n. 28 where and 1 = 0 + X T X 0 0 + X T Xb n = 0 + X T X T T XT X n 2n = 0 20 + (n p) s2 + 0 0 0 + where n = 0 + n. The conditional posterior of given 2 is Gaussian , 2 1 n . Completing the square The derivation of the above joint posterior follows from the matrix version of completing the square where 0 and X T X are square, symmetric, full rank p p matrices. The exponents from the prior for the mean and likelihood are T T XT X ( ) 0 ( ) + 0 0 Expanding and rearranging gives T + T 0 + T X T X T 0 + X t X 2 0 0 + X T X 0 0 (1) The latter two terms are constants not involving (and can be ignored when writing the kernel for the conditional posterior) which we’ll add to when we complete the square. Now, write out the square centered around T 0 + X T X = T 0 + X T X T T 2 0 + X T X + 0 + X T X Substitute for in the second term on the right hand side and the first two terms are identical to the two terms in equation (1). Hence, the exponents from the prior for the mean and likelihood in (1) are equal to T 0 + X T X T T X T X 0 + X T X + T 0 + 0 0 which can be rewritten as T 0 + X T X T T XT X + 0 0 0 + or (in the form analogous to the univariate Gaussian case) T 0 + X T X T 1 1 1 + 0 1 1 0 n 0 n 1 + 0 n 1 n 0 where 1 = X T X. 29 Stacked regression Bayesian linear regression with conjugate priors works as if we have a prior sample {X0 , y0 }, 0 = X0T X0 , and initial estimates 1 T 0 = X0T X0 X 0 y0 Then, we combine this initial "evidence" with new evidence to update our beliefs in the form of the posterior. Not surprisingly, the posterior mean is a weighted average of the two "samples" where the weights are based on the relative precision of the two "samples". Marginal posterior distributions The marginal posterior for on integrating out 2 is noncentral, scaled multivariate Student tp , 2n 1 n , 0 + n p ( | y, X) T 0 +n+p 2 n 2n + n 1+ T 1 n 2 n n 0 +n+p 2 where n = 0 +X T X. This result corresponds with the univariate Gaussian case and A 2 is derived analogously by transformation of variables where z = 2 2 where A = n + T n . The marginal posterior for 2 is inverted-chi square 2 ; n, 2n . Derivation of the marginal posterior for is as follows. p ( | y) = p , 2 | y d 2 0 2 n+ 02+p+2 A = exp 2 d 2 2 0 2 A Utilizing 2 = 2z and dz = 2zA d 2 or d 2 = 2zA2 dz, (1 and 2 are constants and can be ignored when deriving the kernel) n+ 02+p+2 A A p ( | y) exp [z] dz 2z 2z 2 0 n+ 0 +p n+ 0 +p A 2 z 2 1 exp [z] dz 0 n+ 0 +k 1 2 The integral 0 z exp [z] dz is a constant since it is the kernel of a gamma density and therefore can be ignored when deriving the kernel of the marginal posterior for the mean n+ 0 +p p ( | y) A 2 T n+20 +p n 2n + n n+20 +p n 1+ n 2n the kernel for a noncentral, scaled (multivariate) Student tp ; , 2n 1 n , n + 0 . T 30 Uninformative priors Again, the case of uninformative priors is relatively straightforward. Since priors convey no information, the prior for the mean is uniform (proportional to a constant, 0 0) and the prior for 2 has 0 0 degrees of freedom 1 so that the joint prior is p , 2 2 . The joint posterior is [n/2+1] 1 T p , 2 | y 2 exp 2 (y X) (y X) 2 1 T Since y = Xb + e where b = X T X X y, the joint posterior can be written [n/2+1] 1 T exp 2 (n p) s2 + ( b) X T X ( b) p , 2 | y 2 2 Or, factoring into the conditional posterior for and marginal for 2 , we have p , 2 | y p 2 | y p | 2 , y 2 [(np)/2+1] 2n exp 2 2 1 T p T exp 2 ( b) X X ( b) 2 where 2n = (n p) s2 1 Hence, the conditional posterior for given 2 is Gaussian b, 2 X T X . T 1 2 ,n p , The marginal posterior for is multivariate Student tp ; b, s X X the classical estimator. Derivation of the marginal posterior for is analogous to that T A 2 above. Let z = 2 X T X ( b). Integrating 2 2 where A = (n p) s + ( b) out of the joint posterior produces the marginal posterior for . p ( | y) p , 2 | y d 2 2 n+2 A 2 exp 2 d 2 2 Substitution yields p ( | y) n A 2 A 2z n+2 2 A exp [z] dz 2z 2 n z 2 1 exp [z] dz As before, the integral involves the kernel of a gamma distribution, a constant which 31 can be ignored. Therefore, we have n p ( | y) A 2 n2 T (n p) s2 + ( b) X T X ( b) n2 T ( b) X T X ( b) 1+ (n p) s2 1 which is multivariate Student tp ; b, s2 X T X ,n p . 4.1.8 Bayesian linear regression with general error structure Now, we consider Bayesian regression with a more general error structure. That is, the DGP is y = X + , ( | X) N (0, ) First, we consider the known variance case, then take up the unknown variance case. Known variance If the error variance, , is known, we simply repeat the Bayesian linear regression approach discussed above for the known variance case after transforming all variables via the Cholesky decomposition of . Let = T and Then, the DGP is 1 1 1 = T 1 y = 1 X + 1 where 1 N (0, In ) 1 With informed priors for , p ( | ) N ( 0 , ) where it is as if = X0T 1 , 0 X0 the posterior distribution for conditional on is p ( | , y, X; 0 , ) N , V where 1 T 1 T 1 T 1 X0T 1 X + X X X X + X X 0 0 0 0 0 0 1 1 T 1 T 1 = 1 + X X + X X 0 = = X T 1 X 1 X T 1 y 32 and V T 1 X0T 1 X 0 X0 + X 1 T 1 = 1 X +X = 1 It is instructive to once again backtrack to develop the conditional posterior distribution. The likelihood function for known variance is 1 T 1 ( | , y, X) exp (y X) (y X) 2 Conditional Gaussian priors are 1 T p ( | ) exp 2 ( 0 ) V1 ( 0 ) 2 The conditional posterior is the product of the prior and likelihood T 1 (y X) 1 (y X) 2 p | , y, X exp 2 T 2 + ( 0 ) V1 ( 0 ) T 1 1 y y 2y T 1 X + T X T 1 X = exp 2 + T V1 2 T0 V1 + T0 V1 0 2 The first and last terms in the exponent do not involve (are constants) and can ignored as they are absorbed through normalization. This leaves 1 2y T 1 X + T X T 1 X 2 p | , y, X exp 2 + T V1 2 T0 V1 2 T 1 T 1 V + X X 1 = exp 2 2 2 y T 1 X + T0 V 1 which can be recognized as the expansion of the conditional posterior claimed above. p ( | , y, X) N , V T 1 exp V1 2 T 1 1 = exp V + X T 1 X 2 T V1 + X T 1 X 1 T 1 T 1 = exp 2 V + X X 2 + T V 1 + X T 1 X T X0T X0 + X T X T 1 T 1 T 1 = exp 2 y X + V 0 2 T + X0T X0 + X T X 33 The last term in the exponent is all constants (does not involve ) so its absorbed through normalization and disregarded for comparison of kernels. Hence, T 1 p | 2 , y, X exp V1 2 T X0T X0 + X T X 1 T exp 2 2 2 y T 1 X + T0 V1 as claimed. Unknown variance Bayesian linear regression with unknown general error structure, , is something of a composite of ideas developed for exchangeable ( 2 In error structure) Bayesian regression and the multivariate Gaussian case with mean and variance unknown where each draw is an element of the y vector and X is an n p matrix of regressors. A Gaussian likelihood is 1 n T 1 2 exp (y ) (y ) (, | y, X) || 2 T 1 1 (y Xb) (y Xb) n || 2 exp T 2 + (b ) X T 1 X (b ) 1 n T || 2 exp (n p) s2 + (b ) X T 1 X (b ) 2 1 T 1 T 1 where b = X T 1 X X y and s2 = np (y Xb) 1 (y Xb). Combine the likelihood with a Gaussian-inverted Wishart prior 1 1 T 1 p ( | ; 0 , ) p ; , exp ( 0 ) ( 0 ) 2 1 +p+1 tr 2 || 2 || exp 2 1 where tr (·) is the trace of the matrix, it is as if = X0T 1 , and is degrees 0 X0 of freedom to produce the joint posterior 1 +n+p+1 tr 2 p (, | y, X) || 2 || exp 2 (n p) s2 1 T + (b ) X T 1 X (b ) exp 2 T + ( 0 ) 1 ( 0 ) 34 Completing the square Completing the square involves the matrix analog to the univariate unknown mean and variance case. Consider the exponent (in braces) T T (n p) s2 + (b ) X T 1 X (b ) + ( 0 ) 1 ( 0 ) (n p) s2 + bT X T 1 Xb 2 T X T 1 Xb + T X T 1 X T 1 T 1 + T 1 2 0 + 0 0 T 1 = (n p) s2 + T 1 + X X = 2 T V1 + bT X T 1 Xb + T0 1 0 = (n p) s2 + T V1 2 T V1 + bT X T 1 Xb + T0 1 0 where 1 1 T 1 T 1 1 + X X + X Xb 0 1 T 1 = V 0 + X Xb = 1 T 1 .and V = 1 X . +X Variation in around is T T V1 = T V1 2 T V1 + V1 The first two terms are identical to two terms in the posterior involving and there is apparently no recognizable kernel from these expressions. The joint posterior is p (, | y, X) +n+p+1 2 2 || || exp 1 tr 1 exp 2 T V1 T + (n p) s2 V1 +bT X T 1 Xb + T0 1 0 tr 1 + (n p) s2 T 1 +n+p+1 2 exp || 2 || V1 2 +bT X T 1 Xb + T 1 0 0 T 1 1 exp V 2 2 Therefore, we write the conditional posteriors for the parameters of interest. First, we focus on then we take up . The conditional posterior for conditional on involves collecting all terms involving . Hence, the conditional posterior for is ( | ) N , V or T 1 1 p ( | , y, X) exp V 2 35 Inverted-Wishart kernel Now, we gather all terms involving and write the conditional posterior for . p ( | , y, X) 2 +n+p+1 2 +n+p+1 2 +n+p+1 2 || || || 2 || || 2 || 1 tr 1 + (n p) s2 exp T + (b ) X T 1 X (b ) 2 tr 1 + 1 T (y Xb) 1 (y Xb) exp 2 T + (b ) X T 1 X (b ) T 1 +( y Xb) (y Xb) exp tr 1 T 2 + (b ) X T X (b ) We can identify the kernel as an inverted-Wishart involving the trace of a square, symmetric matrix, call it n , multiplied by 1 . The above joint posterior can be rewritten as an inverted-Wishart 1 ; + n, n +n 1 +n+p+1 2 p (, | y) |n | 2 || exp tr n 1 2 where T T n = + (y Xb) (y Xb) + (b ) X T X (b ) With conditional posteriors in hand, we can employ McMC strategies (namely, a Gibbs sampler) to draw inferences around the parameters of interest, and . That is, we sequentially draw conditional on and , in turn, conditional on . We discuss McMC strategies (both the Gibbs sampler and its generalization, the MetropolisHastings algorithm) later. (Nearly) uninformative priors As discussed by Gelman, et al [2004] uninformative priors for this case is awkward, at best. What does it mean to posit uninformative priors for a regression with general error structure? Consistent probability assignment suggests that either we have some priors about the correlation structure or heteroskedastic nature of the errors (informative priors) or we know nothing about the error structure (uninformative priors). If priors are uninformative, then maximum entropy probability assignment suggests we assign independent and unknown homoskedastic errors. Hence, we discuss nearly uninformative priors for this general error structure regression. The joint uninformative prior (with a locally uniform prior for ) is 21 p (, ) || 36 and the joint posterior is 12 p (, | y, X) || || || 1 T exp (n p) s2 + (b ) X T 1 X (b ) 2 1 T exp (n p) s2 + (b ) X T 1 X (b ) 2 1 exp tr S () 1 2 n 2 || n+1 2 n+1 2 T T where now S () = (y Xb) (y + (b ) X T X Xb) (b ). Then, the condi T 1 1 tional posterior for given is N b, X X | n T p ( | , y, X) exp ( b) X T 1 X ( b) 2 The conditional posterior for given is inverted-Wishart 1 ; n, n n 1 n+1 p (, | y) |n | 2 || 2 exp tr n 1 2 where T T n = (y Xb) (y Xb) + (b ) X T X (b ) As with informed priors, a Gibbs sampler (sequential draws from the conditional posteriors) can be employed to draw inferences for the uninformative prior case. Next, we discuss posterior simulation, a convenient and flexible strategy for drawing inference from the evidence and (conjugate) priors. 4.2 Direct posterior simulation Posterior simulation allows us to learn about features of the posterior (including linear combinations, products, or ratios of parameters) by drawing samples when unable (or difficult) to write exact form of density. Example For example, suppose x1 and distrib x2 are (jointly) Gaussian or normally 2 µ uted with unknown means, µ , and known variance-covariance, I = 9I, 2 1 2 but we’re interested in x3 = 1 x1 + 2 x2 . Based on a sample of data (n = 30), y = {x1 , x2 }, we can infer the posterior means and variance for x1 and x2 and simulate posterior draws for x3 from which properties of the posterior distribution for x3 can be inferred. Suppose µ1 = 50 and µ2 = 75 and we have no prior knowledge regarding the location of x1 and x2 so we employ uniform (uninformative) priors. Sample 37 statistics for x1 and x2 are reported below. statistic x1 mean 51.0 median 50.8 standard deviation 3.00 maximum 55.3 minimum 43.6 quantiles: 0.01 43.8 0.025 44.1 0.05 45.6 0.10 47.8 0.25 49.5 0.75 53.0 0.9 54.4 0.95 55.1 0.975 55.3 0.99 55.3 Sample statistics x2 75.5 76.1 2.59 80.6 69.5 69.8 70.3 71.1 72.9 73.4 77.4 78.1 79.4 80.2 80.5 Since we know x1 and x2 are independent each with variance 9, the marginal posteriors for µ1 and µ2 are 9 p (µ1 | y) N 49.6, 30 and p (µ2 | y) N 75.6, 9 30 and the predictive posteriors for x1 and x2 are based on posteriors draws for µ1 and µ2 p (x1 | µ1 , y) N (µ1 , 9) and p (x2 | µ2 , y) N (µ2 , 9) For 1 = 2 and 2 = 3, we generate 1, 000 posterior predictive draws of x1 and x2 , and utilize them to create posterior predictive draws for x3 . Sample statistics for 38 these posterior draws are reported below. statistic µ1 µ1 x1 x2 mean 51.0 75.5 50.9 75.4 median 51.0 75.5 50.8 75.3 standard deviation 0.55 0.56 3.15 3.04 maximum 52.5 77.5 59.7 85.4 minimum 48.5 73.5 39.4 65.4 quantiles: 0.01 49.6 74.4 44.1 68.5 0.025 49.8 74.5 44.7 69.7 0.05 50.1 74.6 45.7 70.6 0.10 50.3 74.8 46.8 71.6 0.25 50.6 75.2 48.8 73.3 0.75 51.3 75.9 52.9 77.6 0.9 51.6 76.3 55.0 79.4 0.95 51.8 76.5 56.2 80.5 0.975 52.0 76.7 57.5 81.6 0.99 52.3 76.9 58.5 82.4 Sample statistics for posterior draws x3 149.2 149.2 5.07 163.1 131.4 137.8 139.8 141.2 142.8 145.7 152.8 155.6 157.6 159.4 160.9 A normal probability plot13 and histogram based on 1, 000 draws of x3 along with the descriptive statistics above based on posterior draws suggest that x3 is well approx13 We employ Filliben’s [1975] approach by plotting normal quantiles of u , N (u ), (horizontal axis) j i against z scores (vertical axis) for the data, y, of sample size n where 1 0.5n ui = (a general expression is ja , n+12a j0.3175 n+0.365 0.5n j=1 j = 2, . . . , n 1 j=n in the above a = 0.3175), and zi = yi y s with sample average, y, and sample standard deviation, s. 39 imated by a Gaussian distribution. Normal probability plot for x3 40 4.2.1 Independent simulation The above example illustrates independent simulation. Since x1 and x2 are independent, their joint distribution, p (x1 , x2 ), is the product of their marginals, p (x1 ) and p (x2 ). As these marginals depend on their unknown means, we can independently draw from the marginal posteriors for the means, p (1 | y) and p (2 | y), to generate predictive posterior draws for x1 and x2 .14 The general independent posterior simulation procedure is 1. draw 2 from the marginal posterior p (2 | y), 2. draw 1 from the marginal posterior p (1 | y). 4.2.2 Nuisance parameters & conditional simulation When there are nuisance parameters or, in other words, the model is hierarchical in nature, it is simpler to employ conditional posterior simulation. That is, draw the nuisance parameter from its marginal posterior then draw the other parameters of interest conditional on the draw of the nuisance or hierarchical parameter. The general conditional simulation procedure is 1. draw 2 (say, scale) from the marginal posterior p (2 | y), 2. draw 1 (say, mean) from the conditional posterior p (1 | 2 , y). Example We compare independent simulation based on marginal posteriors for the mean and variance with conditional simulation based on the marginal posterior of the variance and the conditional posterior of the mean for the Gaussian (normal) unknown mean and variance case. First ,we explore informed priors, then we compare with uninformative priors. An exchangeable sample of n = 50 observations from a Gaussian (normal) distribution with mean equal to 46, a draw from the prior distribution for the mean (described below), and variance equal to 9, a draw from the prior distribution for the variance (also, described below). Informed priors The prior distribution for the mean is Gaussian with mean equal 2 to 0 = 50 and variance equal to 0 = 18 (0 = 12 ). The prior distribution for the variance is inverted-chi square with 0 = 5 degrees of freedom and scale equal to 20 = 9. Then, the marginal posterior for the variance is inverted-chi square with n = 0 + n = 55 degrees of freedom and scale equal to n 2n = 45 + 49s2 + n n 2 2 25 1 1 2 i=1 (yi y) and y = n i=1 yi depend on the 50.5 (50 y) where s = n1 sample. The conditional posterior for the mean is Gaussian with mean equal to n = 1 50.5 (25 + 50y) and variance equal to the draw from marginal posterior for the variance 2 scaled by 0 + n, 50.5 . The marginal posterior for the mean is noncentral, scaled 1 Student t with noncentrality parameter equal to n = 50.5 (25 + 50y) and scale equal 2n 2n to 50.5 . In other words, posterior draws for the mean are = t 50.5 + n where t is a draw from a standard Student t(55) distribution. 14 Predictive posterior simulation is discussed below. 41 Statistics for 1, 000 marginal and conditional posterior draws of the mean and marginal posterior draws of the variance are tabulated below. 2 | 2 , y |y statistic ( | y) mean 45.4 45.5 9.6 median 45.4 45.5 9.4 standard deviation 0.45 0.44 1.85 maximum 46.8 46.9 21.1 minimum 44.1 43.9 5.5 quantiles: 0.01 44.4 44.4 6.1 0.025 44.5 44.6 6.6 0.05 44.7 44.8 7.0 0.10 44.9 44.8 7.0 0.25 45.1 45.2 8.3 0.75 45.7 45.8 10.7 0.9 46.0 46.0 12.0 0.95 46.2 46.2 12.8 0.975 46.3 46.3 13.4 0.99 46.5 46.5 14.9 Sample statistics for posterior draws based on informed priors Clearly, marginal and conditional posterior draws for the mean are very similar, as expected. Marginal posterior draws for the variance have more spread than those for the mean, as expected, and all posterior draws comport well with the underlying distribution. Sorted posterior draws based on informed priors are plotted below with the 42 underlying parameter depicted by a horizontal line. Posterior draws for the mean and variance based on informed priors As the evidence and priors are largely in accord, we might expect the informed priors to reduce the spread in the posterior distributions somewhat. Below we explore uninformed priors. Uninformed priors The marginal posterior for the variance is inverted-chi square with n 1 = 49 degrees of freedom and scale equal to (n 1) s2 = 49s2 where n n 2 1 1 s2 = n1 i=1 (yi y) and y = n i=1 yi depend on the sample. The conditional posterior for the mean is Gaussian with mean equal to y and variance equal to the draw 2 from marginal posterior for the variance scaled by n, 50 . The marginal posterior for the mean is noncentral, scaled Student t with noncentrality parameter equalto y and 2 s scale equal to 50 . In other words, posterior draws for the mean are = t where t is a draw from a standard Student t(49) distribution. 43 s2 50 +y Statistics for 1, 000 marginal and conditional posterior draws of the mean and marginal posterior draws of the variance are tabulated below. 2 | 2 , y |y statistic ( | y) mean 45.4 45.4 9.7 median 45.4 45.5 9.4 standard deviation 0.43 0.45 2.05 maximum 46.7 47.0 18.9 minimum 44.0 43.9 4.8 quantiles: 0.01 44.4 44.3 6.2 0.025 44.6 44.5 6.6 0.05 44.7 44.7 6.9 0.10 44.9 44.8 7.3 0.25 45.1 45.1 8.3 0.75 45.7 45.7 10.9 0.9 46.0 46.0 12.4 0.95 46.1 46.2 13.5 0.975 46.3 46.3 14.3 0.99 46.4 46.4 15.6 Sample statistics for posterior draws based on informed priors There is remarkably little difference between the informed and uninformed posterior draws. With a smaller sample we would expect the priors to have a more substantial impact. Sorted posterior draws based on uninformed priors are plotted below with the 44 underlying parameter depicted by a horizontal line. Posterior draws for the mean and variance based on uninformed priors Discrepant evidence Before we leave this subsection, perhaps it is instructive to explore the implications of discrepant evidence. That is, we investigate the case where the evidence differs substantially from the priors. We again draw a value for from 9 a Gaussian distribution with mean 50 and variance 1/2 , now the draw is = 53.1. Then, we set the prior for , 0 , equal to 50 + 6 0 = 75.5. Everything else remains as above. As expected, posterior draws based on uninformed priors are very similar to those reported above except with the shift in the mean for .15 Based on informed priors, the marginal posterior for the variance is inverted-chi square with n = 0 + n = 55 degrees of freedom and scale equal to n 2n = 45 + n n 2 2 25 1 1 49s2 + 50.5 (75.5 y) where s2 = n1 i=1 (yi y) and y = n i=1 yi depend on the sample. The conditional posterior for the mean is Gaussian with mean equal to 1 n = 50.5 (37.75 + 50y) and variance equal to the draw from marginal posterior for 15 To conserve space, posterior draws based on the uninformed prior results are not reported. 45 2 the variance scaled by 0 + n, 50.5 . The marginal posterior for the mean is noncentral, 1 scaled Student t with noncentrality parameter equal to n = 50.5 (37.75 + 50y) and 2n 2n scale equal to 50.5 . In other words, posterior draws for the mean are = t 50.5 + n where t is a draw from a standard Student t(55) distribution. Statistics for 1, 000 marginal and conditional posterior draws of the mean and marginal posterior draws of the variance are tabulated below. 2 statistic ( | y) | 2 , y |y mean 53.4 53.4 13.9 median 53.4 53.4 13.5 standard deviation 0.52 0.54 2.9 maximum 55.3 55.1 26.0 minimum 51.4 51.2 7.5 quantiles: 0.01 52.2 52.2 8.8 0.025 52.5 52.4 9.4 0.05 52.6 52.6 10.0 0.10 52.8 52.8 10.6 0.25 53.1 53.1 11.9 0.75 53.8 53.8 15.5 0.9 54.1 54.1 17.6 0.95 54.3 54.4 19.2 0.975 54.5 54.5 21.1 0.99 54.7 54.7 23.3 Sample statistics for posterior draws based on informed priors: discrepant case Posterior draws for are largely unaffected by the discrepancy between the evidence and the prior, presumably, because the evidence dominates with a sample size of 50. However, consistent with intuition posterior draws for the variance are skewed upward more than previously. Sorted posterior draws based on informed priors are plotted 46 below with the underlying parameter depicted by a horizontal line. Posterior draws for the mean and variance based on informed priors: discrepant case 4.2.3 Posterior predictive distribution As we’ve seen for independent simulation (the first example in this section), posterior predictive draws allow us to generate distributions for complex combinations of parameters or random variables. For independent simulation, the general procedure for generating posterior predictive draws is 1. draw 1 from p (1 | y), 2. draw 2 from p (2 | y), 3. draw y from p( y | 1 , 2 , y) where y is the predictive random variable. Also, posterior predictive distributions provide a diagnostic check on model specification adequacy. If sample data and posterior predictive draws are substantially different we have evidence of model misspecification. 47 For conditional simulation, the general procedure for generating posterior predictive draws is 1. draw 2 from p (2 | y), 2. draw 1 from p (1 | 2 , y), 3. draw y from p( y | 1 , 2 , y). 4.2.4 Bayesian linear regression As our (ANOVA and ANCOVA) examples draw on small samples, in this subsection we focus on a linear DGP Y = X + uninformative priors for , and known variance, 2 = 1. The posterior for is 1 T 1 Gaussian or normal with mean b = X T X X y and variance 2 X T X 1 T T 2 p | , Y, X exp 2 ( b) X X ( b) 2 Next, we return to the (ANOVA and ANCOVA) examples from the projections chapter but apply Bayesian simulation. Mean estimation As discussed in the projections chapter, we can estimate an unknown mean from a Gaussian distribution via regression where X = (a vector of 2 ones). The posterior distribution is Gaussian with mean Y and variance n . For an 2 exchangeable sample, Y = {4, 6, 5}, we have Y = 5 and variance n = 0.577. The table below reports statistics from 1, 000 posterior draws. statistic | 2 , Y, X mean 4.99 median 4.99 standard deviation 0.579 maximum 6.83 minimum 3.25 quantiles: 0.01 3.62 0.025 3.91 0.05 4.04 0.10 4.26 0.25 4.57 0.75 5.39 0.9 5.78 0.95 5.92 0.975 6.06 0.99 6.23 Sample statistics for posterior draws of the mean DGP : Y = X + , X = 48 These results correspond well with, say, a 95% classical confidence interval {3.87, 6.13} while the 95% Bayesian posterior interval is {3.91, 6.06}. Different means (ANOVA) ANOVA example 1 Suppose we have exchangeable outcome or response data (conditional on D) with two factor levels identified by D. Y 4 6 5 11 9 10 D 0 0 0 1 1 1 We’re interested in estimating the unknown means, or equivalently, the mean difference conditional on D (and one of the means). We view the former DGP as Y = X1 (1) + = 0 (1 D) + 1 D + a no intercept regression where X1 = latter DGP as 1D D and (1) = 0 1 , and the = X2 (2) + Y = 0 + 2D + (2) 0 where X2 = D , = ,and 2 = 1 0 . 2 The posterior for (1) is Gaussian with means b (1) = Y |D=0 1 and variance 2 X1T X1 = 1 3 0 0 1 3 . = Y |D=1 5 10 T 1 p (1) | 2 , Y, X1 exp 2 (1) b(1) X1T X1 (1) b(1) 2 While the posterior for (2) is Gaussian with means b (2) = Y |D=0 = Y |D=1 Y |D=0 49 5 5 1 and variance 2 X2T X2 = 1 3 13 13 2 3 . T 1 p (2) | 2 , Y, X2 exp 2 (2) b(2) X2T X2 (2) b(2) 2 The tables below reports statistics from 1, 000 posterior draws for each model. statistic 0 1 1 0 mean 4.99 10.00 5.00 median 4.99 10.00 4.99 standard deviation 0.597 0.575 0.851 maximum 6.87 11.63 7.75 minimum 2.92 8.35 2.45 quantiles: 0.01 3.64 8.64 3.14 0.025 3.82 8.85 3.37 0.05 4.01 9.06 3.58 0.10 4.23 9.26 3.90 0.25 4.60 9.58 4.44 0.75 5.40 10.39 5.57 0.9 5.76 10.73 6.09 0.95 5.97 10.91 6.36 0.975 6.15 11.08 6.69 0.99 6.40 11.30 7.17 classical 95% interval: lower 3.87 8.87 3.40 upper 6.13 11.13 6.60 Sample statistics for posterior draws from ANOVA example 1 DGP : Y = X1 (1) + , X1 = (1 D) D 50 statistic 0 2 0 + 2 mean 5.02 4.99 10.01 median 5.02 4.97 10.02 standard deviation 0.575 0.800 0.580 maximum 6.71 7.62 12.09 minimum 3.05 2.88 7.90 quantiles: 0.01 3.65 3.10 8.76 0.025 3.89 3.43 8.88 0.05 4.05 3.66 9.02 0.10 4.29 3.91 9.22 0.25 4.62 4.49 9.62 0.75 5.41 5.51 10.38 0.9 5.73 6.02 10.75 0.95 5.93 6.30 10.96 0.975 6.14 6.54 11.18 0.99 6.36 6.79 11.33 classical 95% interval: lower 3.87 3.40 8.87 upper 6.13 6.60 11.13 Sample statistics for posterior draws from ANOVA example 1 DGP : Y = X2 (2) + , X2 = D As expected, for both models there is strong correspondence between classical confidence intervals and the posterior intervals (reported at the 95% level). ANOVA example 2 Suppose we now have a second binary factor, W . Y 4 6 5 11 9 10 D 0 0 0 1 1 1 W 0 1 0 1 0 0 (D W ) 0 0 0 1 0 0 We’re still interested in estimating the mean differences conditional on D and W . The DGP is Y = X + = 0 + 1 D + 2 W + 3 (D W ) + 0 1 where X = D W (D W ) and = 2 . The posterior is Gaussian 3 1 T 2 T p | , Y, X exp 2 ( b) X X ( b) 2 51 with mean 4.5 5 1 T b = XT X X Y = 1.5 0 1 1 1 1 2 2 2 2 1 1 1 1 1 2 2 and variance 2 X T X = 1 . 1 3 3 2 2 2 2 1 2 1 32 3 The tables below reports statistics from 1, 000 posterior draws. statistic 0 1 2 3 mean 4.45 5.05 1.54 0.03 median 4.47 5.01 1.50 0.06 standard deviation 0.706 0.989 1.224 1.740 maximum 6.89 7.98 5.80 6.28 minimum 2.16 2.21 2.27 5.94 quantiles: 0.01 2.74 2.76 1.36 4.05 0.025 3.06 3.07 0.87 3.36 0.05 3.31 3.47 0.41 2.83 0.10 3.53 3.82 0.00 2.31 0.25 3.94 4.39 0.69 1.15 0.75 4.93 5.72 2.35 1.13 0.9 5.33 6.34 3.20 2.14 0.95 5.57 6.69 3.58 2.75 0.975 5.78 6.99 3.91 3.41 0.99 6.00 7.49 4.23 4.47 classical 95% interval: lower 3.11 3.04 0.90 3.39 upper 5.89 6.96 3.90 3.39 Sample statistics for posterior draws from ANOVA example 2 DGP : Y = X + , X = D W (D W ) As expected, there is strong correspondence between the classical confidence intervals and the posterior intervals (reported at the 95% level). ANOVA example 3 Now, suppose the second factor, W , is perturbed slightly. Y 4 6 5 11 9 10 D 0 0 0 1 1 1 W 0 0 1 1 0 0 52 (D W ) 0 0 0 1 0 0 We’re still interested in estimating the mean differences conditional on D and W . The DGP is Y = X + = 0 + 1 D + 2 W + 3 (D W ) + 0 1 where X = D W (D W ) and = 2 . The posterior is Gaussian 3 1 T p | 2 , Y, X exp 2 ( b) X T X ( b) 2 with mean 5 4.5 1 T b = XT X X Y = 0 1.5 1 1 12 12 2 2 1 1 T 1 1 1 2 2 2 and variance X X = 1 . 1 3 3 2 2 2 2 1 2 1 32 3 The tables below reports statistics from 1, 000 posterior draws. statistic 0 1 2 3 mean 5.01 4.49 0.01 1.50 median 5.00 4.49 0.03 1.52 standard deviation 0.700 0.969 1.224 1.699 maximum 7.51 7.45 3.91 8.35 minimum 2.97 1.61 5.23 3.34 quantiles: 0.01 3.46 2.28 2.84 2.36 0.025 3.73 2.67 2.30 1.82 0.05 3.87 2.93 1.98 1.18 0.10 4.08 3.25 1.56 0.62 0.25 4.50 3.79 0.80 0.37 0.75 5.50 5.14 0.82 2.60 0.9 5.93 5.76 1.51 3.61 0.95 6.11 6.09 2.00 4.23 0.975 6.29 6.27 2.36 4.74 0.99 6.51 6.58 2.93 5.72 classical 95% interval: lower 3.61 2.54 2.40 1.89 upper 6.39 6.46 2.40 4.89 Sample statistics for posterior draws from ANOVA example 3 DGP : Y = X + , X = D W (D W ) 53 As expected, there is strong correspondence between the classical confidence intervals and the posterior intervals (reported at the 95% level). ANOVA example 4 Now, suppose the second factor, W , is again perturbed. Y 4 6 5 11 9 10 D 0 0 0 1 1 1 W 0 1 1 0 0 1 (D W ) 0 0 0 0 0 1 We’re again interested in estimating the mean differences conditional on D and W . The DGP is Y where X = with mean D = X + = 0 + 1 D + 2 W + 3 (D W ) + 0 1 W (D W ) and = 2 . The posterior is Gaussian 3 1 T p | 2 , Y, X exp 2 ( b) X T X ( b) 2 4 6 1 T b = XT X X Y = 1.5 1.5 1 1 1 1 1 1 1.5 1 1.5 and variance 2 X T X = . 1 1 1.5 1.5 1 1.5 54 1.5 3 The tables below reports statistics from 1, 000 posterior draws. statistic 0 1 2 3 mean 3.98 6.01 1.53 1.46 median 3.95 6.01 1.56 1.51 standard deviation 1.014 1.225 1.208 1.678 maximum 6.72 9.67 6.40 4.16 minimum 0.71 2.61 1.78 6.50 quantiles: 0.01 1.69 3.28 1.06 5.24 0.025 2.02 3.69 0.84 4.73 0.05 2.37 3.94 0.53 4.26 0.10 2.70 4.34 0.05 3.60 0.25 3.27 5.19 0.69 2.59 0.75 4.72 6.81 2.36 0.27 0.9 5.28 7.58 3.09 0.73 0.95 5.64 8.08 3.45 1.31 0.975 5.94 8.35 3.81 1.75 0.99 6.36 8.79 4.21 2.20 classical 95% interval: lower 2.04 3.60 0.90 4.89 upper 5.96 8.40 3.90 1.89 Sample statistics for posterior draws from ANOVA example 4 DGP : Y = X + , X = D W (D W ) As expected, there is strong correspondence between the classical confidence intervals and the posterior intervals (reported at the 95% level). Simple regression Suppose we don’t observe treatment, D, but rather observe only regressor, X1 , in combination with outcome, Y . Y 4 6 5 11 9 10 X1 1 1 0 1 1 0 For the perceived DGP Y = 0 + 1 X1 + with N 0, 2 I ( 2 known) the posterior distribution for is 1 T T p ( | Y, X) exp 2 ( b) X X ( b) 2 55 1 T 1 7.5 X Y = , and 2 X T X = where X = X1 , b = X T X 1 1 0 6 . 0 14 Sample statistics for 1, 000 posterior draws are tabulated below. statistic 0 1 mean 7.48 0.98 median 7.49 1.01 standard deviation 0.386 0.482 maximum 8.67 2.39 minimum 6.18 0.78 quantiles: 0.01 6.55 0.14 0.025 6.71 0.08 0.05 6.85 0.16 0.10 6.99 0.34 0.25 7.22 0.65 0.75 7.74 1.29 0.9 7.98 1.60 0.95 8.09 1.76 0.975 8.22 1.88 0.99 8.33 2.10 classical 95% interval: lower 6.20 0.02 upper 7.80 1.98 Sample statistics for posterior draws from simple regression example DGP : Y = X + , X = X1 Classical confidence intervals and Bayesian posterior intervals are similar, as expected. ANCOVA example Suppose in addition to the regressor, X1 , we observe treatment, D, in combination with outcome, Y . Y 4 6 5 11 9 10 X1 1 1 0 1 1 0 D 0 0 0 1 1 1 For the perceived DGP Y = 0 + 1 D + 2 X1 + 3 (D X1 ) + 56 with N 0, 2 I ( 2 known) the posterior distribution for is 1 T p ( | Y, X) exp 2 ( b) X T X ( b) 2 where X = 5 5 1 T D X1 (D X1 ) , b = X T X X Y = 1 , and 0 1 13 0 0 3 2 0 0 T 1 13 3 2 X X = 1 0 12 2 0 0 0 12 1 Sample statistics for 1, 000 posterior draws are tabulated below. statistic 0 1 2 3 mean 5.00 4.99 1.02 0.01 median 5.02 5.02 1.02 0.01 standard deviation 0.588 0.802 0.697 1.00 maximum 7.23 7.96 3.27 3.21 minimum 3.06 2.69 1.33 3.31 quantiles: 0.01 3.62 3.15 0.57 2.34 0.025 3.74 3.38 0.41 2.03 0.05 4.02 3.65 0.10 1.66 0.10 4.28 3.98 0.14 1.25 0.25 4.60 4.47 0.54 0.66 0.75 5.34 5.53 1.48 0.65 0.9 5.73 5.96 1.91 1.25 0.95 5.97 6.30 2.16 1.68 0.975 6.20 6.50 2.37 2.07 0.99 6.49 6.93 2.72 2.36 classical 95% interval: lower 3.87 3.40 0.39 1.96 upper 6.13 6.60 2.39 1.96 Sample statistics for posterior draws from the ANCOVA example DGP : Y = X + , X = D X1 (D X1 ) Classical confidence intervals and Bayesian posterior intervals are similar, as expected. 4.3 McMC (Markov chain Monte Carlo) simulation Markov chain Monte Carlo (McMC) simulations can be employed when the marginal posterior distributions cannot be derived or are extremely cumbersome to derive. 57 McMC approaches draw from the set of conditional posterior distributions instead of the marginal posterior distributions. The utility of McMC simulation has evolved along with the R Foundation for Statistical Computing. Before discussing standard algorithms (the Gibbs sampler and Metropolis-Hastings) we briefly review some important concepts associated with Markov chains and attempt to develop some intuition regarding their effective usage. The objective is to be able to generate draws from a stationary posterior distribution which we denote but we’re unable to directly access. To explore how Markov chains help us access , we begin with discrete Markov chains then connect to continuous chains. Discrete state spaces Let S = 1 , 2 , . . . , d be a discrete state space. A Markov chain is the sequence of random variables, {1 , 2 , . . . , r , . . .} given 0 generated by the following transition pij Pr r+1 = j | r = i The Markov property says that transition to r+1 only depends on the immediate past history, r , and not all history. Define a Markov transition matrix, P = [pij ], where the rows denote initial states and the columns denote transition states such that, for example, pii is the likelihood of beginning in state i and remaining in state i. Now, relate this Markov chain idea to distributions from which random variables are drawn. Say, the initial value, 0 , is drawn from 0 . Then, the distribution for 1 given 0 0 is d d 1j Pr 1 = j = i=1 Pr 0 = i pij = i=1 0i pij , j = 1, 2, . . . , d In matrix notation, the above is 1 = 0 P and after r iterations we have r = 0 P r As the number of iterations increases, we expect the effect of the initial distribution, 0 , dies out so long as the chain does not get trapped. Irreducibility and stationarity The idea of no absorbing states or states in which the chain gets trapped is called irreducibility. This is key to our construction of Markov chains. If pij > 0 (strictly positive) for all i, j, then the chain is irreducible and there exists a stationary distribution, , such that lim 0 P r = r and P = Since the elements are all positive and each row sums to one, the maximum eigenvalue of P T is one and its corresponding eigenvector and vector from the inverse of the matrix of eigenvectors determine . By singular value decomposition, P = SS 1 where 58 S eigenvectors and is a diagonal matrix of corresponding eigenvalues, isT armatrix of P = Sr S 1 since PT r = SS 1 SS 1 · · · SS 1 = Sr S 1 This implies the long-run steady-state is determined by the largest eigenvalue (if max = 1) and in the direction of its eigenvector and inverse eigenvector (if the remaining s < 1 then ri goes to zero and their corresponding eigenvectors’ influence on direction dies out). That is, T r P = S1 r S11 where S1 denotes the eigenvector (column vector) corresponding to the unit eigenvalue and S11 denotes the corresponding inverse eigenvector (row vector). Since one is the largest eigenvalue of P T , after a large number of iterations 0 P r converges to 1 = . Hence, after many iterations the Markov chain produces draws from a stationary distribution if the chain is irreducible. Time reversibility and stationarity An equivalent property, time reversibility, is more useful when working with more general state space chains. Time reversibility says that if we reverse the order of a Markov chain, the resulting chain has the same transition behavior. First, we show the reverse chain is Markovian if the forward chain is Markovian, then we relate the forward and reverse chain transition probabilities, and finally, we show that time reversibility implies i pij = j pji and this implies P = where is the stationary distribution for the chain. The reverse transition probability (by Bayesian "updating") is pij Pr r = j | r+1 = i1 , r+2 = i2 , . . . , r+T = iT Pr r = j , r+1 = i1 , r+2 = i2 , . . . , r+T = iT = Pr r+1 = i1 , r+2 = i2 , . . . , r+T = iT Pr r = j Pr r+1 = i1 | r = j = Pr r+1 = i1 Pr r+2 = i2 , . . . , r+T = iT | r = j , r+1 = i1 Pr r+2 = i2 , . . . , r+T = iT | r+1 = i1 Since the forward chain is Markovian, this simplifies as Pr r = j Pr r+1 = i1 | r = j pij = Pr r+1 = i1 Pr r+2 = i2 , . . . , r+T = iT | r+1 = i1 Pr r+2 = i2 , . . . , r+T = iT | r+1 = i1 Pr r = j Pr r+1 = i1 | r = j = Pr r+1 = i1 59 The reverse chain is Markovian. Let P represent the transition matrix for the reverse chain then the above says j pji pij = i By definition, time reversibility implies pij = pij . Hence, time reversibility implies i pij = j pji Time reversibility says the likelihood of transitioning from state i to j is equal to the likelihood of transitioning from j to i. The above implies if a chain is reversible with respect to a distribution then is the stationary distribution of the chain. To see this sum both sides of the above relation over i i i pij = i j pji = j i pji = j 1, j = 1, 2, . . . , d In matrix notation, we have P = is the stationary distribution of the chain. Continuous state spaces Continuous state spaces are analogous to discrete state spaces but with a few additional technical details. Transition probabilities are defined in reference to sets rather than the singletons i . For example, for a set A the chain is define in terms of the probabilities of the set given the value of the chain on the previous iteration, . That is, the kernel of the chain, K (, A), is the probability of set A given the chain is at where K (, A) = p (, ) d A p (, ) is a density function with given and p (·, ·) is the transition or generator function of the kernel. An invariant or stationary distribution with density (·) implies () d = K (, A) () d = p (, ) d () d A A Time reversibility in the continuous space case implies () p (, ) = () p (, ) And, irreducibility in the continuous state case is satisfied for a chain with kernel K with respect to (·) if every set A with positive probability can be reached with positive probability after a finite number of iterations. In other words, if A () d > 0 then there exists n 1 such that K n (, A) > 0. With continuous state spaces, irreducibility and time reversibility produce a stationary distribution of the chain as with discrete state spaces. Next, we briefly discussion application of these Markov chain concepts to two popular McMC strategies: the Gibbs sampler, and Metropolis-Hastings (MH) algorithm. The Gibbs sampler is a special case of MH and somewhat simpler so we review it first. 60 4.3.1 Gibbs sampler Suppose we cannot derive p ( | Y ) in closed form (it does not have a standard probability distribution) but we are able to identify the set of conditional posterior distributions. We can utilize the set of full conditional posterior distributions to draw dependent samples for parameters of interest via McMC simulation. For the full set of conditional posterior distributions p (1 | 1 , Y ) .. . p (k | k , Y ) draws are made for 1 conditional on starting values for parameters other than 1 , that is 1 . Then, 2 is drawn conditional on the 1 draw and the starting values for the remaining . Next, 3 is drawn conditional on the draws for 1 and 2 and the starting values for the remaining . This continues until all have been sampled. Then the sampling is repeated for a large number of draws with parameters updated each iteration by the most recent draw. For example, the procedure for a Gibbs sampler involving two parameters is 1. select a starting value for 2 , 2. draw 1 from p (1 |2 , y) utilizing the starting value for 2 , 3. draw 2 from p(2 |1 , y) utilizing the previous draw for 1 , 4. repeat until a converged sample based on the marginal posteriors is obtained. The samples are dependent. Not all samples will be from the posterior; only after a finite (but unknown) number of iterations are draws from the marginal posterior distribution (see Gelfand and Smith [1990]). (Note, in general, p (1 , 2 | Y ) = p (1 | 2 , Y ) p (1 | 2 , Y ).) Convergence is usually checked using trace plots, burnin iterations, and other convergence diagnostics. Model specification includes convergence checks, sensitivity to starting values and possibly prior distribution and likelihood assignments, comparison of draws from the posterior predictive distribution with the observed sample, and various goodness of fit statistics. 4.3.2 Metropolis-Hastings algorithm If neither some conditional posterior, p (j | Y, j ), or its marginal posterior, p ( | Y ), is recognizable, then we may be able to employ the Metropolis-Hastings (MH) algorithm. The Gibbs sampler is a special case of the MH algorithm. The random walk Metropolis algorithm is most common and outlined next. We wish to draw from p ( | ·) but we only know p ( | ·) up to constant of proportionality, p ( | ·) = cf ( | ·) where c is unknown. The random walk Metropolis algorithm for one parameter is as follows. 1. Let (k1) be a draw from p ( | ·). 2. Draw from N (k1) , s2 where s2 is fixed. |·) cf ( |·) 3. Let = min 1, p p( = . cf ( (k1) |·) ((k1) |·) 61 4. Draw z from U (0, 1). 5. If z < then (k) = , otherwise (k) = (k1) . In other words, with probability set (k) = , and otherwise set (k) = (k1) .16 These draws converge to random draws from the marginal posterior distribution after a burn-in interval if properly tuned. Tuning the Metropolis algorithm involves selecting s2 (jump size) so that the parameter space is explored appropriately. Usually, smaller jump size results in more accepts and larger jump size results in fewer accepts. If s2 is too small, the Markov chain will not converge quickly, has more serial correlation in the draws, and may get stuck at a local mode (multi-modality can be a problem). If s2 is too large, the Markov chain will move around too much and not be able to thoroughly explore areas of high posterior probability. Of course, we desire concentrated samples from the posterior distribution. A commonly-employed rule of thumb is to target an acceptance rate for around 30% (20 80% is usually considered “reasonable”).17 The above procedure describes the algorithm for a single parameter or vector of parameters. A general K parameter Metropolis-Hastings algorithm works similarly (see Train [2002], p. 305). 1. Start with a value 0n . 2. Draw K independent values from a standard normal density, and stack the draws into a vector labeled 1 . 3. Create a trial value of 1n = 0n + 1 where is the researcher-chosen jump size parameter, is the Cholesky factor of W such that T = W . Note the proposal distribution is specified to be normal with zero mean and variance 2 W . 4. Draw a standard uniform variable µ1 . L(yn | 1 )( 1 |b,W ) 5. Calculate the ratio F = L(yn |n0 )( 0n|b,W ) where L yn | 1n is a product of n n logits, and 1n | b, W is the normal density. 6. If µ1 F , accept 1n ; if µ1 > F , reject 1n and let 1n = 0n . 7. Repeat the process many times, adjusting the tuning parameters if necessary. For sufficiently large t, tn is a draw from the marginal posterior. 4.3.3 Missing data augmentation One of the many strengths of M cM C approaches is their flexibility for dealing with missing data. Missing data is a common characteristic plaguing limited dependent variable models like discrete choice and selection. As a prime example, we next discuss Albert and Chib’s McMC data augmentation approach to discrete choice modeling. Later we’ll explore McMC data augmentation of selection models. Albert and Chib’s Gibbs sampler Bayes’ probit The challenge with discrete choice models (like probit) is that latent utility is unobservable, rather the analyst observes modification of the RW Metropolis algorithm sets (k) = with log() probability where = min{0, log[f ( |·)] log[f ((k1) |·)]}. 17 Gelman, et al [2004] report the optimal acceptance rate is 0.44 when the number of parameters K = 1 and drops toward 0.23 as K increases. 16 A 62 only discrete (usually binary) choices.18 Albert & Chib [1993] employ Bayesian data augmentation to “supply” the latent variable. Hence, parameters of a probit model are estimated via normal Bayesian regression (see earlier discussion in this chapter). Consider the latent utility model UD = W V where binary choice, D, is observed. D= 1 0 UD > 0 UD < 0 The conditional posterior distribution for is 1 p (|D, W, UD ) N b1 , Q1 + W T W where 1 1 b1 = Q1 + W T W Q b0 + W T W b 1 T b = WTW W UD 1 b0 = prior means for and Q = W0T W0 is the prior for the covariance. The conditional posterior distribution for the latent variables are p (UD |D = 1, W, ) N (W , I|UD > 0) or T N(0,) (W , I) p (UD |D = 0, W, ) N (W , I|UD 0) or T N(,0) (W , I) where T N (·) refers to random draws from a truncated normal (truncated below for the first and truncated above for the second). Iterative draws for (UD |D, W, ) and (|D, W, UD ) form the Gibbs sampler. Interval estimates of are supplied by postconvergence draws of (|D, W, UD ). For simulated normal draws of the unobservable portion of utility, V , this Bayes’ augmented data probit produces remarkably similar inferences to MLE.19 Probit example We compare ML (maximum likelihood) estimates20 with Gibbs sampler McMC data augmentation probit estimates for a simple discrete choice problem. In particular, we return to the choice (or selection) equation referred to in the 18 See Accounting and causal effects: econometric challenges, chapter 5 for a discussion of discrete choice models. 19 An efficient algorithm for this Gibbs sampler probit, rbprobitGibbs, is available in the bayesm package of R (http://www.r-project.org/), the open source statistical computing project. Bayesm is a package written to complement Rossi, Allenby, and McCulloch [2005]. 20 See the second appendix to these notes for a brief discussion of ML estimation of discrete choice models. 63 illustration of the control function strategy for identifying treatment effects of the projection chapter. The variables (choice and instruments) are D 1 1 1 0 0 0 Z1 1 0 1 1 0 1 Z2 0 3 0 0 0 2 Z3 0 1 0 0 1 0 Z4 0 1 0 2 0 0 Z5 1 2 1 0 0 0 The above data are a representative sample. To mitigate any small sample bias, we repeat this sample 20 times (n = 120).21 ML estimates (with standard errors in parentheses below the estimates) are E [UD | Z] = 0.6091 Z1 + 0.4950 Z2 0.1525 Z3 0.7233 Z4 + 0.2283 Z5 (0.2095) (0.1454) (0.2618) (0.1922) (0.1817) (Z ) The model has only modest explanatory power (pseudo-R2 = 1 = 11.1%, ( 0 ) where Z is the log-likelihood for the model and 0 is the log-likelihood with a constant only). However, recall this selection equation works perfectly as a control function in the treatment effect example where high explanatory power does not indicate an adequate model specification (see projections chapter). Now, we compare the ML results with McMC data augmentation and the Gibbs sampler probit discussed previously. Statistics from 10, 000 posterior draws following 1, 000 burn-in draws are tabulated below based on the n = 120 sample. statistic 1 2 3 4 5 mean 0.6225 0.5030 0.1516 0.7375 0.2310 median 0.6154 0.5003 0.1493 0.7336 0.2243 standard deviation 0.2189 0.1488 0.2669 0.2057 0.1879 quantiles: 0.025 1.0638 0.2236 0.6865 1.1557 0.1252 0.25 0.7661 0.4009 0.3286 0.8720 0.1056 0.75 0.4713 0.6007 0.02757 0.5975 0.3470 0.975 0.1252 0.1056 0.2243 0.3549 0.6110 Sample statistics for data augmented Gibbs McMC probit posterior draws DGP : UD = Z + , Z = Z1 Z2 Z3 Z4 Z5 As expected, the means, medians, and standard errors of the McMC probit estimates correspond quite well with ML probit estimates. 21 Comparison of estimates based on n = 6 versus n = 120 samples produces no difference in ML parameter estimates but substantial difference in the McMC estimates. The n = 6 McMC estimates are typically larger in absolute value compared to their n = 120 counterparts. This tends to exagerate heterogeneity in outcomes if we reconnect with the treatment effect examples. The remainder of this discussion focuses on the n = 120 sample. 64 Logit example Next, we apply logistic regression (logit for short) to the same (n = 120) data set. We compare MLE results with two McMC strategies: (1) logit estimated via a random walk Metropolis-Hastings (MH) algorithm without data augmentation and (2) a uniform data augmented Gibbs sampler logit. Random walk MH for logit The random walk MH algorithm employs a standard binary discrete choice model exp ZiT (Di | Zi ) Bernoulli 1 + exp ZiT The default tuning parameter, s2 = 0.25, produces an apparently satisfactory MH acceptance rate of 28.6%. Details are below. We wish to draw from the posterior Pr ( | D, Z) p () ( | D, Z) where the log likelihood is exp ZiT exp ZiT + (1 Di ) log 1 ( | D, Z) = Di log 1 + exp ZiT 1 + exp ZiT i=1 n For Z other than a constant, there is no prior, p (), which produces a well known posterior, Pr ( | D, Z), for the logit model. This makes the MH algorithm attractive. The MH algorithm builds a Markov chain (the current draw depends on only the previous draw) such that eventually the influence of initial values dies out and draws are from a stable, approximately independent distribution. The MH algorithm applied to the logit model is as follows. 1. Initialize the vector 0 at some value. 2. Define a proposal generating density, q , k1 for draw k {1, 2, . . . , K}. The random walk MH chooses a convenient generating density. = k1 + , N 0, 2 I In other words, for each parameter, j , q j , k1 = j 1 exp 2 j k1 j 2 2 2 3. Draw a vector, from N k1 , 2 I . Notice, for the random walk, the tuning parameter, 2 , is the key. If 2 is chosen too large, then the algorithm will reject the proposal draw frequently and will converge slowly, If 2 is chosen too small, then the algorithm will accept the proposal draw frequently but may fail to fully explore the parameter space and may fail to discover the convergent distribution. 65 4. Calculate where Pr( |D,Z)q ( , k1 ) min 1, Pr k1 |D,Z q k1 , ( )( ) = 1 Pr k1 | D, Z q k1 , > 0 Pr k1 | D, Z q k1 , = 0 The core of the MH algorithm is that the ratio eliminates the problematic normalizing constant for the posterior (normalization is problematic since we don’t recognize the posterior). The convenience MH enters here of therandom walk k1 k1 as, by symmetry of the normal, q , =q , and the calculation of simplifies as = q ( , k1 ) q ( k1 , ) min 1, drops out. Hence, we calculate Pr( |D,Z) Pr( k1 |D,Z ) 1 Pr k1 | D, Z q k1 , > 0 Pr k1 | D, Z q k1 , = 0 5. Draw U from a Uniform(0, 1). If U < , set k = , otherwise set k = k1 . In other words, with probability accept the proposal draw, . 6. Repeat K times until the distribution converges. Uniform Gibbs sampler for logit On the other hand, the uniform data augmented Gibbs sampler logit specifies a complete set of conditional posteriors developed as follows. Let 1 with probability i Di = , i = 1, 2, . . . , n 0 with probability 1 i where i = exp[ZiT ] 1+exp[ZiT ] = FV ZiT , or log i 1 i = ZiT , and FV ZiT is the cumulative distribution function of the logistic random variable V . Hence, i = exp[ZiT ] Pr U < 1+exp Z T where U has a uniform(0, 1) distribution. Then, given the pri[ i ] ors for , p (), the joint posterior for the latent variable u = (u1 , u2 , . . . , un ) and given the data D and Z is exp[ZiT ] n I u I (D = 1) i i 1+exp[ZiT ] Pr (, u | D, Z) p () I (0 ui 1) exp[ZiT ] i=1 I (D = 0) +I ui > log 1+exp Z i [ iT ] where I (X A) is an indicator function that equals one if X A, and zero otherwise. Thus, the conditional posterior for the latent (uniform) variable u is exp[ZiT ] U nif orm 0, 1+exp Z T if Di = 1 [ i ] Pr (ui | , D, Z) T exp[Zi ] U nif orm 1+exp Z ,1 if Di = 0 [ iT ] 66 Since the joint posterior can be written n I Z T log ui I (D = 1) i 1ui i I (0 ui 1) Pr (, u | D, Z) p () +I Z T < log ui I (Di = 0) i i=1 1ui we have 5 j=1 so Zij j log ui 1ui if Di = 1 1 ui log Zij j Zik 1 ui j=k for all samples for which Di = 1 and Zik > 0, as well as for all samples for which Di = 0 and Zik < 0. Similarly, 1 ui k < log Zij j Zik 1 ui j=k for all samples for which Di =1 and Zik > 0, as well as for all samples for which Di = 0 and Zik < 0.22 Let Ak and Bk be the sets defined by the above, that is, Ak = {i : ((Di = 1) (Zik > 0)) ((Di = 0) (Zik < 0))} and Bk = {i : ((Di = 0) (Zik > 0)) ((Di = 1) (Zik < 0))} A diffuse prior p () 1 combined with the above gives the conditional posterior for k , k = 1, 2, . . . , 5, given the other ’s and latent variable, u. p (k | k , u, D, Z) U nif orm (ak , bk ) where k is a vector of parameters except k , 1 ui ak = max log Zij j iAk Zik 1 ui j=k and 1 u i log bk = min Zij j iBk Zik 1 ui j=k The Gibbs sampler is implemented by drawing n values of u in one block conditional on and the data, D, Z. The elements of are drawn successively, each conditional on u, the remaining parameters, k , and the data, D, Z. 22 If Zik = 0, the observation is ignored as k is determined by the other regressor values. 67 ML logit estimates (with standard errors in parentheses below the estimates) are E [UD | Z] = 0.9500 Z1 + 0.7808 Z2 0.2729 Z3 1.1193 Z4 + 0.3385 Z5 (0.3514) (0.2419) (0.4209) (0.3250) (0.3032) Logit results are proportional to the probit results (approximately 1.5 times the probit estimates), as is typical. As with the probit model, the logit model has modest explana (Z ) is the log-likelihood for tory power (pseudo-R2 = 1 = 10.8%, where Z ( 0 ) the model and 0 is the log-likelihood with a constant only). Now, we compare the ML results with McMC posterior draws. Statistics from 10, 000 posterior MH draws following 1, 000 burn-in draws are tabulated below based on the n = 120 sample. statistic 1 2 3 4 5 mean 0.9850 0.8176 0.2730 1.1633 0.3631 median 0.9745 0.8066 0.2883 1.1549 0.3440 standard deviation 0.3547 0.2426 0.4089 0.3224 0.3069 quantiles: 0.025 1.7074 0.3652 1.0921 1.7890 0.1787 0.25 1.2172 0.6546 0.5526 1.3793 0.1425 0.75 0.7406 0.9787 0.0082 0.9482 0.5644 0.975 0.3134 1.3203 0.5339 0.5465 0.9924 Sample statistics for MH McMC logit posterior draws DGP : UD = Z + , Z = Z1 Z2 Z3 Z4 Z5 Statistics from 10, 000 posterior data augmented uniform Gibbs draws following 40, 000 burn-in draws23 are tabulated below based on the n = 120 sample. statistic 1 2 3 4 5 mean 1.015 0.8259 0.3375 1.199 0.3529 median 1.011 0.8126 0.3416 1.2053 0.3445 standard deviation 0.3014 0.2039 0.3748 0.2882 0.2884 quantiles: 0.025 1.6399 0.3835 1.1800 1.9028 0.2889 0.25 1.2024 0.6916 0.5902 1.3867 0.1579 0.75 0.8165 0.9514 0.0891 1.0099 0.5451 0.975 0.4423 1.2494 0.3849 0.6253 0.9451 Sample statistics for uniform Gibbs McMC logit posterior draws DGP : UD = Z + , Z = Z1 Z2 Z3 Z4 Z5 As expected, the means, medians, and standard errors of the McMC logit estimates correspond well with each other and the ML logit estimates. Now that we’ve developed McMC data augmentation for the choice or selection equation, we return to the discussion of causal effects (initiated in the projections notes) and discuss data augmentation for the counterfactuals as well as latent utility. 23 Convergence to marginal posterior draws is much slower with this algorithm. 68 5 Treatment effects and counterfactuals Suppose we observe treatment or no treatment and the associated outcome, Y = DY1 + (1 D) Y0 , where Y1 Y0 = 1 + V1 = 0 + V0 and a representative sample is Y 15 14 13 13 14 15 D 1 1 1 0 0 0 Y1 15 14 13 11 10 9 Y0 9 10 11 13 14 15 V1 3 2 1 1 2 3 V0 3 2 1 1 2 3 Further, we have the following instruments at our disposal Z = where their representative values are Z1 5 6 0 0 0 1 Z2 4 5 0 0 1 0 Z3 3 4 0 1 0 0 Z1 Z2 Z3 Z4 Z4 1 2 1 0 0 0 and we perceive latent utility, EU , to be related to choice via the instruments. EU = Z + VD and observed choice is D= 1 0 EU > 0 otherwise This is the exact setup we discussed earlier in the projections analysis. 5.1 Gibbs sampler for treatment effects There are three sources of missing data: latent utility, EU , counterfactuals for individuals who choose treatment, (Y0i | Di = 1), and counterfactuals for individuals who choose no treatment, (Y1i | Di = 0). Bayesian data augmentation effectively models these missing data processes (as in, for example, Albert and Chib’s McMC probit) by drawing in sequence from the conditional posterior distributions — a Gibbs sampler. Define the complete or augmented data as T ri = Di Di Yi + (1 Di ) Yimiss Di Yimiss + (1 Di ) Yi 69 Also, let Zi Hi = 0 0 and 0 Xi 0 where X = , a vector of ones. 5.1.1 = 1 0 0 0 Xi Full conditional posterior distributions Let x denote all parameters other than x. The full conditional posteriors for the augmented outcome data are Yimiss | Yimiss , Data N ((1 Di ) µ1i + Di µ0i , (1 Di ) 1i + Di 0i ) where standard multivariate normal theory (see the appendix) is applied to derive means and variances conditional on the draw for latent utility and the other outcome24 µ1i = Xi 1 + 20 D1 10 D0 10 D1 D0 (Di Zi ) + (Yi Xi 0 ) 20 2D0 20 2D0 µ0i = Xi 0 + 21 D0 10 D1 10 D1 D0 (Di Zi ) + (Yi Xi 1 ) 21 2D1 21 2D1 1i = 21 0i = 20 2D1 20 2 10 D1 D0 + 210 20 2D0 2D0 21 2 10 D1 D0 + 210 21 2D1 Similarly, the conditional posterior for the latent utility is T N(0,) µDi D Di | Di , Data T N(,0) µDi D if Di = 1 if Di = 0 where T N (·) refers to the truncated normal distribution with support indicated via the subscript and the arguments are parameters of the untruncated distribution. Applying multivariate normal theory for (Di | Yi ) we have µDi 2 D1 10 D0 = Zi + Di Yi + (1 Di ) Yimiss Xi 1 0 2 2 1 0 210 2 D0 10 D1 + Di Yimiss + (1 Di ) Yi Xi 0 1 2 2 1 0 210 24 Technically, 10 is unidentified (i.e., even with unlimited data we cannot "observe" the parameter). However, we can employ restrictions derived through the positive-definiteness (see the appendix) of the variance-covariance matrix, , to impose bounds on the parameter, 10 . If treatment effects are overly sensitive this strategy will prove ineffective; otherwise, it allows us to proceed from observables to treatment effects via augmentation of unobservables (the counterfactuals as well as latent utility). 70 D = 1 2D1 20 2 10 D1 D0 + 2D0 21 21 20 210 The conditional posterior distribution for the parameters is | , Data N µ , where by the SUR (seemingly-unrelated regression) generalization of Bayesian regression (see the appendix at the end of these notes) 1 µ = H T 1 In H + V1 H T 1 In r + V1 0 1 = H T 1 In H + V1 and the prior distribution is p () N ( 0 , V ). The conditional distribution for the trivariate variance-covariance matrix is | , Data G1 where G W ishart (n + , S + R) with prior p (G) W ishart (, R), and S = n i=1 T (ri Hi ) (ri Hi ) . As usual, starting values for the Gibbs sampler are varied to test convergence of parameter posterior distributions. 5.1.2 Nobile’s algorithm Recall 2D is normalized to one. This creates a slight complication as the conditional posterior is no longer inverse-Wishart. Nobile [2000] provides a convenient algorithm for random Wishart (multivariate 2 ) draws with a restricted element. The algorithm applied to the current setting results in the following steps: 1. Exchange rows and columns one and three in S + R, call this matrix V . T 2. Find L such that V = L1 L1 . 3. Construct a lower triangular matrix A with a. aii equal to the square root of 2 random variates, i = 1, 2. 1 where l33 is the third row-column element of L. b. a33 = l33 c. aij equal to N (0, 1) random variates, i > j. T 1 T 1 1 4. Set V = L1 A A L . 5. Exchange rows and columns one and three in V and denote this draw . 71 5.1.3 Prior distributions Li, Poirier, and Tobias choose relatively diffuse priors such that the data dominates the posterior distribution. Their prior distribution for is p () N ( 0 , V ) where 0 = 0, V = 4I and their prior for 1 is p (G) W ishart (, R) where = 12 1 1 1 and R is a diagonal matrix with elements 12 , 4, 4 . 5.2 Marginal and average treatment effects The marginal treatment effect is the impact of treatment for individuals who are indifferent between treatment and no treatment. We can employ Bayesian data augmentationbased estimation of marginal treatment effects (MTE) as data augmentation generates repeated draws for unobservables, VDj , (Y1j | Dj = 0), and (Y0j | Dj = 1). Now, exploit these repeated samples to describe the distribution for M T E (uD ) where VD is transformed to uniform (0, 1), uD = pv . For each draw, VD = v, we determine the cumulative probability, uD = (v),25 and calculate M T E (uD ) = E [Y1 Y0 | uD ]. If M T E (uD ) is constant for all uD , then all treatment effects are alike. MTE can be connected to standard population-level treatment effects, ATE, ATT, and ATUT, via non-negative weights whose sum is one (assuming full support) n j=1 I (uD ) wAT E (uD ) = n n j=1 I (uD ) Dj n wAT T (uD ) = j=1 Dj n j=1 I (uD ) (1 Dj ) n wAT U T (uD ) = j=1 (1 Dj ) where probabilities pk refer to bins from 0 to 1 by increments of 0.01 for indicator variable I (uD ) = 1 uD = pk I (uD ) = 0 uD = pk Hence, MTE-estimated average treatment effects are estAT E (M T E) estAT T (M T E) estAT U T (M T E) = = = n i=1 n i=1 n wAT E (uD ) M T E (uD ) wAT T (uD ) M T E (uD ) wAT U T (uD ) M T E (uD ) i=1 Next, we apply these data augmentation ideas to the causal effects example and estimate the average treatment effect on the treated (ATT), the average treatment effect on the untreated (ATUT), and the average treatment effect (ATE). 25 (·) is a cumulative probability distribution function. 72 5.3 Return to the treatment effect example Initially, we employ Bayesian data augmentation via a Gibbs sampler on the treatment effect problem outlined above. Recall this example was employed in the projections notes to illustrate where the inverse-Mills ratios control functions strategy based on the full complement of instruments26 was exceptionally effective. The representative sample is Y 15 14 13 13 14 15 D 1 1 1 0 0 0 Y1 15 14 13 11 10 9 Y0 9 10 11 13 14 15 Z1 5 6 0 0 0 1 Z2 4 5 0 0 1 0 Z3 3 4 0 1 0 0 Z4 1 2 1 0 0 0 which is repeated 200 times to create a sample of n = 1, 200 observations. The Gibbs sampler employs 15, 000 draws from the conditional posteriors. The first 5, 000 draws are discarded as burn-in, then sample statistics are based on the remaining 10, 000 draws. statistic mean median standard dev quantiles: minimum 0.01 0.025 0.05 0.10 0.25 0.75 0.90 0.95 0.975 0.99 maximum 1 0 1 13.76 13.76 0.810 13.76 13.76 0.809 0.026 0.028 0.051 2 0.391 0.391 0.054 3 1.647 1.645 0.080 4 1.649 1.650 0.080 13.67 13.70 13.71 13.72 13.73 13.74 13.78 13.79 13.80 13.81 13.82 13.84 0.585 0.521 0.500 0.481 0.461 0.428 0.356 0.325 0.306 0.289 0.269 0.185 1.943 1.837 1.807 1.781 1.751 1.699 1.593 1.547 1.519 1.497 1.467 1.335 1.362 1.461 1.493 1.518 1.547 1.595 1.704 1.751 1.778 1.806 1.836 1.971 13.64 13.69 13.70 13.71 13.71 13.74 13.78 13.80 13.80 13.81 13.82 13.86 0.617 0.695 0.713 0.727 0.746 0.776 0.844 0.873 0.893 0.910 0.931 1.006 Sample statistics for the parameters of the data augmented Gibbs sampler applied to the treatment effect example The results demonstrate selection bias as the means are biased upward from 12. This does not bode well for effective estimation of marginal or average treatment effects. Sample statistics for average treatment effects as well as correlations, D,1 , D,0 , 26 Typically, we’re fortunate to identify any instruments. In the example, the instruments form a basis for the nullspace to the outcomes, Y1 and Y0 . In this (linear or Gaussian) sense, we’ve exhausted the potential set of instruments. 73 and 1,0 are tabulated below. statistic mean median standard dev quantiles: minimum 0.01 0.025 0.05 0.10 0.25 0.75 0.90 0.95 0.975 0.99 maximum AT E 0.000 0.000 0.017 AT T 0.481 0.480 0.041 AT U T 0.482 0.481 0.041 D,1 D,0 0.904 0.904 0.904 0.904 0.009 0.009 1,0 0.852 0.852 0.015 0.068 0.331 0.039 0.388 0.033 0.403 0.028 0.415 0.022 0.428 0.012 0.452 0.011 0.510 0.022 0.535 0.028 0.551 0.034 0.562 0.040 0.576 0.068 0.649 0.649 0.580 0.564 0.549 0.534 0.509 0.453 0.429 0.416 0.405 0.393 0.350 0.865 0.880 0.884 0.888 0.892 0.898 0.910 0.915 0.917 0.920 0.923 0.932 0.899 0.884 0.879 0.875 0.871 0.862 0.842 0.832 0.826 0.821 0.814 0.787 0.933 0.923 0.920 0.918 0.915 0.910 0.898 0.892 0.888 0.884 0.880 0.861 Sample statistics for average treatment effects and error correlations of the data augmented Gibbs sampler applied to the treatment effect example Average treatment effects estimated from weighted averages of MTE are similar: estAT E (M T E) estAT T (M T E) estAT U T (M T E) = 0.000 = 0.464 = 0.464 The average treatment effects on the treated and untreated suggest heterogeneity but are grossly understated compared to the DGP averages of 4 and 4. Next, we revisit the problem and attempt to consider what is left out of our model specification. 5.4 Instrumental variable restrictions Consistency demands that we fully consider what we know. In the foregoing analysis, we have not effectively employed this principle. Data augmentation of the counterfactuals involves another condition. That is, outcomes are independent of the instruments (otherwise, they are not instruments), DY +(1 D) Y draw and DY draw +(1 D) Y are independent of Z. We can impose orthogonality on the draws of the counterfactuals such that the "sample" satisfies this population condition. We’ll refer to this as the IV data augmented Gibbs sampler treatment effect analysis. To implement this we add the following steps to the above Gibbs sampler. Minimize the distance of Y draw from Y miss such that Y1 = DY + (1 D) Y draw and Y0 = DY draw + (1 D) Y are orthogonal to the instruments, Z. T draw min Y draw Y miss Y Y miss Y draw s.t. Z T DY + (1 D) Y draw DY draw + (1 D) Y = 0 74 where the constraint is p 2 zeroes and p is the number of columns in Z (the number of instruments). Hence, in each McMC round, IV outcome draws are Y1 = DY + (1 D) Y draw and Y0 = DY draw + (1 D) Y 5.5 Return to the example once more With the IV data augmented Gibbs sample Y D Y1 15 1 15 14 1 14 13 1 13 13 0 11 14 0 10 15 0 9 sampler in hand we return to the representative Y0 9 10 11 13 14 15 Z1 5 6 0 0 0 1 Z2 4 5 0 0 1 0 Z3 3 4 0 1 0 0 Z4 1 2 1 0 0 0 and repeat 20 times to create a sample of n = 120 observations. The IV Gibbs sampler employs 15, 000 draws from the conditional posteriors. The first 5, 000 draws are discarded as burn-in, then sample statistics are based on the remaining 10, 000 draws. statistic mean median standard dev quantiles: minimum 0.01 0.025 0.05 0.10 0.25 0.75 0.90 0.95 0.975 0.99 maximum 1 0 12.01 11.99 12.01 11.99 0.160 0.160 11.35 11.64 11.69 11.74 11.80 11.90 12.11 12.21 12.27 12.32 12.38 12.63 11.37 11.62 11.68 11.73 11.80 11.89 12.10 12.20 12.25 12.30 12.36 12.64 1 0.413 0.420 0.227 2 0.167 0.148 0.274 3 0.896 0.866 0.370 4 0.878 0.852 0.359 0.558 0.149 0.058 0.028 0.117 0.267 0.566 0.695 0.774 0.840 0.923 1.192 1.325 0.889 0.764 0.648 0.530 0.334 0.023 0.168 0.249 0.312 0.389 0.685 2.665 1.888 1.696 1.550 1.381 1.124 0.637 0.451 0.348 0.256 0.170 0.257 0.202 0.170 0.254 0.336 0.435 0.617 1.113 1.367 1.509 1.630 1.771 2.401 Sample statistics for the parameters of the IV data augmented Gibbs sampler applied to the treatment effect example Not surprisingly, the results demonstrate no selection bias and effectively estimate marginal and average treatment effects. Sample statistics for average treatment effects 75 as well as correlations, D,1 , D,0 , and 1,0 are tabulated below. statistic mean median standard dev quantiles: minimum 0.01 0.025 0.05 0.10 0.25 0.75 0.90 0.95 0.975 0.99 maximum AT E 0.000 0.000 0.000 AT T AT U T 4.000 4.000 4.000 4.000 0.000 0.000 D,1 0.813 0.815 0.031 D,0 0.812 0.815 0.032 1,0 0.976 0.976 0.004 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 0.650 0.728 0.743 0.756 0.772 0.794 0.835 0.850 0.859 0.866 0.874 0.904 0.910 0.874 0.866 0.859 0.851 0.835 0.794 0.771 0.755 0.742 0.726 0.640 0.987 0.984 0.983 0.982 0.981 0.979 0.973 0.970 0.968 0.967 0.965 0.952 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 Sample statistics for average treatment effects and error correlations of the IV data augmented Gibbs sampler applied to the treatment effect example Weighted MTE estimates of average treatment effects are similar. estAT E (M T E) 0.000 estAT T (M T E) 3.792 estAT U T (M T E) 3.792 Next, we report some more interesting experiments. Instead, of having the full set of instruments available, suppose we have only three, Z1 , Z2 , and Z3 + Z4 , or two, Z1 + Z2 and Z3 + Z4 , or one, Z1 + Z2 + Z3 + Z4 . We repeat the above for each set of instruments and compare the results with classical control function analysis based on Heckman’s inverse Mills strategy introduced in the projections notes. 76 5.5.1 Three instruments Suppose we have only three instruments, Z1 , Z2 , and Z3 + Z4 . IV data augmented Gibbs sampler results are tabulated below.27 statistic mean median standard dev quantiles: minimum 0.01 0.025 0.05 0.10 0.25 0.75 0.90 0.95 0.975 0.99 maximum 1 0 12.00 12.00 12.00 12.00 0.164 0.165 11.32 11.62 11.68 11.73 11.79 11.89 12.11 12.21 12.27 12.32 12.38 12.58 11.36 11.61 11.68 11.73 11.79 11.89 12.11 12.21 12.27 12.32 12.39 12.57 1 0.242 0.243 0.222 2 0.358 0.342 0.278 3 0.001 0.001 0.132 0.658 0.263 0.189 0.120 0.041 0.094 0.394 0.526 0.604 0.670 0.753 1.067 1.451 1.080 0.950 0.844 0.723 0.532 0.168 0.021 0.071 0.155 0.245 0.564 0.495 0.306 0.258 0.216 0.170 0.091 0.090 0.171 0.217 0.254 0.302 0.568 Sample statistics for the parameters of the IV data augmented Gibbs sampler with three instruments applied to the treatment effect example These results differ very little from those based on the full set of four instruments. There is no selection bias and marginal and average treatment effects are effectively estimated. Sample statistics for average treatment effects as well as correlations, D,1 , 27 Inclusion of an intercept in the selection equation with three, two, and one instruments makes no qualitative difference in the average treatment effect analysis. These results are not reported. 77 D,0 , and 1,0 are tabulated below. statistic mean median standard dev quantiles: minimum 0.01 0.025 0.05 0.10 0.25 0.75 0.90 0.95 0.975 0.99 maximum AT E 0.000 0.000 0.000 AT T AT U T 4.000 4.000 4.000 4.000 0.000 0.000 D,1 0.799 0.802 0.036 D,0 0.800 0.814 0.037 1,0 0.884 0.888 0.029 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 0.605 0.702 0.719 0.734 0.751 0.777 0.825 0.842 0.852 0.860 0.869 0.894 0.899 0.870 0.861 0.853 0.844 0.826 0.778 0.751 0.734 0.720 0.699 0.554 0.956 0.936 0.930 0.924 0.918 0.905 0.867 0.846 0.833 0.821 0.803 0.703 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 Sample statistics for average treatment effects and error correlations of the IV data augmented Gibbs sampler with three instruments applied to the treatment effect example Weighted MTE estimates of average treatment effects are similar. estAT E (M T E) 0.000 estAT T (M T E) 3.940 estAT U T (M T E) 3.940 Classical results based on Heckman’s inverse Mills control function strategy with three instruments are reported below for comparison. The selection equation estimated via probit is Pr (D | Z) = (0.198Z1 0.297Z2 + 0.000 (Z3 + Z4 )) P seudoR2 = 0.019 where (·) denotes the cumulative normal distribution function. The estimated outcome equations are E [Y | X] = 11.890 (1 D) + 11.890D 2.700 (1 D) 0 + 2.700D1 and estimated average treatment effects are estATE 0.000 estATT 4.220 estATUT 4.220 In spite of the weak explanatory of the selection model, control functions produce reasonable estimates of average treatment effects. Next, we consider two instruments. 78 5.5.2 Two instruments Suppose we have only two instruments, Z1 + Z2 , and Z3 + Z4 . IV data augmented Gibbs sampler results are tabulated below. statistic mean median standard dev quantiles: minimum 0.01 0.025 0.05 0.10 0.25 0.75 0.90 0.95 0.975 0.99 maximum 1 0 12.08 13.27 12.07 13.27 0.168 0.243 1 0.034 0.034 0.065 2 0.008 0.009 0.128 11.47 11.69 11.75 11.80 11.86 11.96 12.18 12.29 12.35 12.41 12.46 12.64 0.328 0.185 0.162 0.141 0.118 0.077 0.009 0.048 0.073 0.092 0.115 0.260 0.579 0.287 0.244 0.207 0.159 0.077 0.095 0.171 0.219 0.260 0.308 0.635 12.41 12.70 12.79 12.87 12.96 13.11 13.42 13.58 13.67 13.75 13.84 14.26 Sample statistics for the parameters of the IV data augmented Gibbs sampler with two instruments applied to the treatment effect example Selection bias emerges as 0 diverges from 12. This suggests marginal and average treatment effects are likely to be confounded. Sample statistics for average treatment 79 effects as well as correlations, D,1 , D,0 , and 1,0 are tabulated below. statistic mean median standard dev quantiles: minimum 0.01 0.025 0.05 0.10 0.25 0.75 0.90 0.95 0.975 0.99 maximum AT E 1.293 1.297 0.219 AT T 1.413 1.406 0.438 AT U T 4.000 4.000 0.000 D,1 D,0 1,0 0.802 0.516 0.634 0.806 0.532 0.648 0.037 0.136 0.115 2.105 1.806 1.738 1.665 1.572 1.435 1.147 1.005 0.930 0.861 0.795 0.625 0.211 0.389 0.525 0.670 0.855 1.130 1.705 1.989 2.141 2.277 2.409 2.750 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 0.601 0.695 0.719 0.735 0.754 0.779 0.828 0.846 0.856 0.864 0.874 0.902 0.813 0.890 0.757 0.834 0.732 0.813 0.706 0.795 0.675 0.768 0.613 0.716 0.438 0.569 0.340 0.479 0.262 0.417 0.195 0.365 0.124 0.301 0.150 0.055 Sample statistics for average treatment effects and error correlations of the IV data augmented Gibbs sampler with two instruments applied to the treatment effect example Weighted MTE estimates of average treatment effects are similar. estAT E (M T E) 1.293 estAT T (M T E) 1.372 estAT U T (M T E) 3.959 ATUT is effectively estimated but the other average treatment effects are biased. Classical results based on Heckman’s inverse Mills control function strategy with three instruments are reported below for comparison. The selection equation estimated via probit is Pr (D | Z) = (0.023 (Z1 + Z2 ) + 0.004 (Z3 + Z4 )) P seudoR2 = 0.010 The estimated outcome equations are E [Y | X] = 109.38 (1 D) + 11.683D + 121.14 (1 D) 0 + 2.926D1 and estimated average treatment effects are estATE 97.69 estATT 191.31 estATUT 4.621 While the Bayesian estimates of ATE and ATT are moderately biased, classical estimates produce severe bias. Both strategies produce reasonable ATUT estimates with the Bayesian estimation right on target. Finally, we consider one instrument. 80 5.5.3 One instrument Suppose we have only one instrument, Z1 + Z2 + Z3 + Z4 . IV data augmented Gibbs sampler results are tabulated below. statistic mean median standard dev quantiles: minimum 0.01 0.025 0.05 0.10 0.25 0.75 0.90 0.95 0.975 0.99 maximum 1 0 1 12.08 13.95 0.019 12.09 13.95 0.019 0.166 0.323 0.013 11.42 11.69 11.75 11.81 11.87 11.97 12.19 12.29 12.35 12.40 12.47 12.67 12.95 13.27 13.35 13.43 13.53 13.73 14.18 14.38 14.50 14.59 14.69 15.12 0.074 0.051 0.046 0.041 0.036 0.027 0.010 0.002 0.003 0.006 0.011 0.033 Sample statistics for the parameters of the IV data augmented Gibbs sampler with one instrument applied to the treatment effect example Selection bias emerges as 0 again diverges from 12. This suggests marginal and average treatment effects are likely to be confounded. Sample statistics for average 81 treatment effects as well as correlations, D,1 , D,0 , and 1,0 are tabulated below. statistic mean median standard dev quantiles: minimum 0.01 0.025 0.05 0.10 0.25 0.75 0.90 0.95 0.975 0.99 maximum AT E 1.293 1.297 0.219 AT T 1.413 1.406 0.438 AT U T 4.000 4.000 0.000 D,1 D,0 1,0 0.797 0.039 0.048 0.801 0.051 0.061 0.039 0.298 0.336 2.105 1.806 1.738 1.665 1.572 1.435 1.147 1.005 0.930 0.861 0.795 0.625 0.211 0.389 0.525 0.670 0.855 1.130 1.705 1.989 2.141 2.277 2.409 2.750 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 4.000 0.576 0.691 0.710 0.727 0.746 0.774 0.824 0.843 0.853 0.861 0.870 0.894 0.757 0.817 0.615 0.682 0.554 0.624 0.503 0.571 0.429 0.490 0.272 0.310 0.187 0.213 0.370 0.415 0.461 0.518 0.526 0.581 0.591 0.651 0.747 0.800 Sample statistics for average treatment effects and error correlations of the IV data augmented Gibbs sampler with one instrument applied to the treatment effect example Weighted MTE estimates of average treatment effects are similar. estAT E (M T E) 1.957 estAT T (M T E) 0.060 estAT U T (M T E) 3.975 ATUT is effectively estimated but the other average treatment effects are biased. Classical results based on Heckman’s inverse Mills control function strategy with three instruments are reported below for comparison. The selection equation estimated via probit is Pr (D | Z) = (0.017 (Z1 + Z2 + Z3 + Z4 )) P seudoR2 = 0.009 The estimated outcome equations are E [Y | X] = 14.000 (1 D) + 11.885D + N A (1 D) 0 + 2.671D1 and estimated average treatment effects are estATE 2.115 estATT NA estATUT NA While the Bayesian estimates of ATE and ATT are biased, the classical strategy fails to generate estimates for ATT and ATUT — it involves a singular X matrix as there is no variation in 0 . 82 6 Appendix — common distributions and their kernels To try to avoid confusion, we list our descriptions of common distributions and their kernels. Others may employ variations on these definitions. Multivariate distributions and their support Gaussian (normal) x Rk µ Rk Rkk positive definite Student t x Rk µ Rk Rkk positive definite Wishart W Rkk positive definite S Rkk positive definite >k1 Inverse Wishart W Rkk positive definite S Rkk positive definite >k1 Density f (·) functions and their kernels f (x; µ, ) = conjugacy 1 (2)k/2 ||1/2 conjugate prior for mean of multinormal distribution 12 (xµ)T 1 (xµ) e T 1 1 e 2 (xµ) (xµ) f (x; , µ, ) [(+k)/2] ( 2 )()k/2 ||1/2 = +k T 1 2 1 + (xµ) (xµ) +k T 1 2 1 + (xµ) (xµ) marginal posterior for multi-normal with unknown mean and variance 1 2k/2 |S|/2 k ( 2 ) f (W ; , S) = k1 2 see inverse Wishart 1 1 e 2 T r(S W ) k1 1 1 |W | 2 e 2 T r(S W ) |W | f W ; , S 1 = |S|/2 2k/2 k ( 2 ) +k+1 2 e +k+1 2 e 2 T r(SW |W | |W | 12 T r (SW 1 ) 1 1 ) conjugate prior for variance of multinormal distribution (n) = (n 1)!, for n a positive integer (z) = 0 tz1 et dt k k 2 = k(k1)/4 +1j 2 j=1 Multivariate distributions 83 Univariate distributions and their support Beta x (0, 1) , > 0 Density f (·) functions and their kernels f (x; , ) 1 (+) 1 = ()() x (1 x) 1 x1 (1 x) F (s; ) ns = ns s (1 ) ns s (1 ) f (x; ) 1 x/21 = 2/2 (/2) exp[x/2] x/21 ex/2 f (x; ) Binomial s = 1.2. . . . (0, 1) Chi-square x [0, ) >0 Inverse chi-square x (0, ) >0 Scaled, inverse chi-square x (0, ) , 2 > 0 Exponential x [0, ) >0 Extreme value (logistic) x (, ) < µ < , s > 0 Gamma x [0, ) , > 0 Inverse gamma x [0, ) , > 0 Gaussian (normal) x (, ) < µ < , > 0 Student t x (, ) µ (, ) , > 0 = = exp[1/(2x)] 1 2/2 (/2) x/2+1 /21 1/(2x) x e f x; , 2 /2 exp[ 2 /(2x)] (2 ) 2/2 (/2) x/2+1 2 x/21 e /(2x) f (x; ) = exp [x] exp [x] f (x; µ, s) exp[(xµ)/s] = s(1+exp[(xµ)/s]) 2 exp[(xµ)/s] (1+exp[(xµ)/s])2 f (x; , ) = x1 exp[x/] () x1 exp [x/] f (x; , ) x1 exp[/x] () 1 = x exp [x/] f (x; µ, ) 2 1 = 2 exp (xµ) 2 2 (xµ)2 exp 22 ( +1 ) f (x; , µ, ) = 2 (2) ( +1 2 2 ) 1 (xµ) 1 + 2 ( +1 2 2 ) 1 + 1 (xµ) 2 Univariate distributions 84 conjugacy beta is conjugate prior to binomial beta is conjugate prior to binomial see scaled, inverse chi-square see scaled, inverse chi-square conjugate prior for variance of a normal distribution gamma is conjugate prior to exponential posterior for Bernoulli prior and normal likelihood gamma is conjugate prior to exponential and others conjugate prior for variance of a normal distribution conjugate prior for mean of a normal distribution marginal posterior for a normal with unknown mean and variance 7 Appendix — maximum likelihood estimation of discrete choice models The most common method for estimating the parameters of discrete choice models is maximum likelihood. The likelihood is defined as the joint density for the parameters of interest conditional on the data Xt . For binary choice models and Dt = 1 the contribution to the likelihood is F (Xt ) , and for Dt = 0 the contribution to the likelihood is 1 F (Xt ) where these are combined as binomial draws and F (Xt ) is the cumulative distribution function evaluated at Xt . Hence, the likelihood is L (|X) = n Dt F (Xt ) t=1 The log-likelihood is (|X) logL (|X) = n t=1 1Dt [1 F (Xt )] Dt log (F (Xt )) + (1 Dt ) log (1 F (Xt )) Since this function for binary response models like probit and logit is globally concave, numerical maximization is straightforward. The first order conditions for a maximum, max (|X) , are n t=1 Dt f (Xt )Xit F (Xt ) (1Dt )f (Xt )Xti 1F (Xt ) = 0 i = 1, . . . , k where f (·) is the density function. Simplifying yields n t=1 [Dt F (Xt )]f (Xt )Xti F (Xt )[1F (Xt )] = 0 i = 1, . . . , k Estimates of are found by solving these first order conditions iteratively or, in other words, numerically. A common estimator for the variance of ̂M LE is the negative inverse of the Hessian 1 matrix evaluated at ̂M LE , H D, ̂ . Let H (D, ) be the Hessian matrix for the log-likelihood with typical element Hij (D, ) 8 2 t (D,) 28 i j . Appendix — seemingly unrelated regression (SUR) First, we describe the seemingly unrelated regression (SUR) model. Second, we remind ourselves Bayesian regression works as if we have two samples: one representative of our priors and another from the new evidence. Then, we connect to seemingly unrelated regression (SUR) — both classical and Bayesian strategies are summarized. 28 Details can be found in numerous econometrics references and chapter 4 of Accounting and Causal Effects: Econometric Challenges. 85 We describe the SUR model in terms of a stacked regression as if the latent variables in a binary selection setting are observable. r = X + where and U W r = Y1 , X = 0 Y0 0 1 Let = D1 D0 by , is V = D1 11 10 N (0, V = In In = D1 In D0 In 0 D1 = ... 0 D0 . .. 0 In ) D0 10 , a 3 3 matrix, then the Kronecker product, denoted 00 1 .. . VD 0 0 , = 1 , = V1 , 2 V0 X2 0 X1 0 D1 In 11 In 10 In ··· .. . ··· ··· .. . 0 .. . D1 .. . 1 0 .. . 0 11 .. . ··· .. . ··· ··· .. . ··· ··· .. . D1 0 .. . 0 10 .. . ··· ··· .. . D0 In 10 In 00 In 0 .. . D0 .. . 0 10 .. . ··· .. . ··· ··· .. . D1 0 .. . 11 0 .. . D0 0 .. . 0 00 .. . ··· ··· .. . 10 0 .. . D0 0 · · · 10 0 · · · 00 a 3n 3n matrix and V 1 = 1 In . Classical estimation of SUR follows generalized least square (GLS). = X T 1 In X 1 X T 1 In r and ··· 0 .. . = X T 1 In X 1 V ar On the other hand, Bayesian analysis employs a Gibbs sampler based on the conditional posteriors as, apparently, the SUR error structure prevents identification of conjugate priors. Recall, from the discussion of Bayesian linear regression with gen- eral error structure the conditional posterior for is p ( | , y; 0 , ) N , V where 1 T 1 = X T 1 X0 + X T 1 X X X0 + X T 1 X 0 = 0 0 T 1 1 X +X 1 0 0 T 1 1 X 0 + X 86 = X T 1 X 1 X T 1 y and V T 1 X0T 1 X 0 X0 + X 1 T 1 = 1 X +X = 1 For the present SUR application, we replace y with r and 1with 1 In yielding the conditional posterior for , p ( | , y; 0 , ) N SU R and SU R , VSU R where 1 1 T X0T 1 In X 0 X0 + X SU R T 1 X + X I X X0T 1 0 0 n 0 1 1 SU R 1 T 1 T = 1 + X I X + X I X n n 0 = SU R = X T 1 In X 1 X T 1 In r VSU R 1 1 T X0T 1 In X 0 X0 + X 1 1 = 1 In X +X = 87 9 Examples [to be added]means (ANOVA) - Gaussian likelihood - variance unknown - Jeffrey’s prior 9.1 regression - Gaussian likelihood -follow from Bayesian conditional distribution data model: p (y | 1 , 2 , X) N (X1 , 2 I), prior model: p (1 , 2 | X) 12 - conditional simulation effectively treats scale as a nuisance parameter marginal posterior of 2 given (y, X), conditional posterior of 1 given (2 , y, X). - variance unknown 9.2 selection missing data - latent expected utility and counterfactuals - data augmentation, - Gaussian likelihood, - SUR, - bounding unidentified parameters, - Gibbs sampler 88