Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STATISTICS 450/850 Estimation and Hypothesis Testing Supplementary Lecture Notes Don L. McLeish and Cyntha A. Struthers Dept. of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario, Canada Winter 2013 Contents 1 Properties of Estimators 1.1 Prerequisite Material . . . . . . . . . . . 1.2 Introduction . . . . . . . . . . . . . . . . 1.3 Unbiasedness and Mean Square Error 1.4 Sufficiency . . . . . . . . . . . . . . . . . . 1.5 Minimal Sufficiency . . . . . . . . . . . . 1.6 Completeness . . . . . . . . . . . . . . . . 1.7 The Exponential Family . . . . . . . . . 1.8 Ancillarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Maximum Likelihood Estimation 2.1 Maximum Likelihood Method - One Parameter . . . . . . . . . . . . . . . . 2.2 Principles of Inference . . . . . . . . . . . . . 2.3 Properties of the Score and Information - Regular Model . . . . . . . . . . . . . . . . . 2.4 Maximum Likelihood Method - Multiparameter . . . . . . . . . . . . . . . . 2.5 Incomplete Data and The E.M. Algorithm 2.6 The Information Inequality . . . . . . . . . . 2.7 Asymptotic Properties of M.L. Estimators - One Parameter . . . . . . . . . 2.8 Interval Estimators . . . . . . . . . . . . . . . 2.9 Asymptotic Properties of M.L. Estimators - Multiparameter . . . . . . . . 2.10 Nuisance Parameters and M.L. Estimation . . . . . . . . . . . . . . . . . 2.11 Problems with M.L. Estimators . . . . . . . 2.12 Historical Notes . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 4 10 16 18 23 33 39 . . . . . . . . . . . . 39 50 . . . . . . 52 . . . . . . . . . . . . . . . . . . 54 67 75 . . . . . . . . . . . . 79 82 . . . . . . 92 . . . . . . 108 . . . . . . 109 . . . . . . 111 0 CONTENTS 3 Other Methods of Estimation 3.1 Best Linear Unbiased Estimators 3.2 Equivariant Estimators . . . . . . . 3.3 Estimating Equations . . . . . . . . 3.4 Bayes Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 113 115 119 124 4 Hypothesis Tests 4.1 Introduction . . . . . . . . . . . . . . . . 4.2 Uniformly Most Powerful Tests . . . . 4.3 Locally Most Powerful Tests . . . . . . 4.4 Likelihood Ratio Tests . . . . . . . . . . 4.5 Score and Maximum Likelihood Tests 4.6 Bayesian Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 135 138 144 146 153 155 5 Appendix 5.1 Inequalities and Useful 5.2 Distributional Results 5.3 Limiting Distributions 5.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 157 159 168 171 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Properties of Estimators 1.1 Prerequisite Material The following topics should be reviewed: 1. Tables of special discrete and continuous distributions including the multivariate normal distribution. Location and scale parameters. 2. Distribution of a transformation of one or more random variables including change of variable(s). 3. Moment generating function of one or more random variables. 4. Multiple linear regression. 5. Limiting distributions: convergence in probability and convergence in distribution. 1.2 Introduction Before beginning a discussion of estimation procedures, we assume that we have designed and conducted a suitable experiment and collected data X1 , . . . , Xn , where n, the sample size, is fixed and known. These data are expected to be relevant to estimating a quantity of interest θ which we assume is a statistical parameter, for example, the mean of a normal distribution. We assume we have adopted a model which specifies the link between the parameter θ and the data we obtained. The model is the framework within which we discuss the properties of our estimators. Our model might specify that the observations X1 , . . . , Xn are independent with 1 2 CHAPTER 1. PROPERTIES OF ESTIMATORS a normal distribution, mean θ and known variance σ2 = 1. Usually, as here, the only unknown is the parameter θ. We have specified completely the joint distribution of the observations up to this unknown parameter. 1.2.1 Note: We will sometimes denote our data more compactly by the random vector X = (X1 , . . . , Xn ). The model, therefore, can be written in the form {f (x; θ) ; θ ∈ Ω} where Ω is the parameter space or set of permissible values of the parameter and f (x; θ) is the probability (density) function. 1.2.2 Definition A statistic, T (X), is a function of the data X which does not depend on the unknown parameter θ. Note that although a statistic, T (X), is not a function of θ, its distribution can depend on θ. An estimator is a statistic considered for the purpose of estimating a given parameter. It is our aim to find a “good” estimator of the parameter θ. In the search for good estimators of θ it is often useful to know if θ is a location or scale parameter. 1.2.3 Location and Scale Parameters Suppose X is a continuous random variable with p.d.f. f (x; θ). Let F0 (x) = F (x; θ = 0) and f0 (x) = f (x; θ = 0). The parameter θ is called a location parameter of the distribution if F (x; θ) = F0 (x − θ) , θ∈< or equivalently f (x; θ) = f0 (x − θ), θ ∈ <. Let F1 (x) = F (x; θ = 1) and f1 (x) = f (x; θ = 1). The parameter θ is called a scale parameter of the distribution if ³x´ F (x; θ) = F1 , θ>0 θ 1.2. INTRODUCTION 3 or equivalently f (x; θ) = 1.2.4 1 x f1 ( ), θ θ θ > 0. Problem (1) If X ∼ EXP(1, θ) then show that θ is a location parameter of the distribution. See Figure 1.1 (2) If X ∼ EXP(θ) then show that θ is a scale parameter of the distribution. See Figure 1.2 1 0.9 0.8 0.7 θ=1 θ=-1 θ=0 0.6 f(x) 0.5 0.4 0.3 0.2 0.1 0 -1 0 1 2 x 3 4 5 Figure 1.1: EXP(1, θ) p.d.f.’s 1.2.5 Problem (1) If X ∼ CAU(1, θ) then show that θ is a location parameter of the distribution. (2) If X ∼ CAU(θ, 0) then show that θ is a scale parameter of the distribution. 4 CHAPTER 1. PROPERTIES OF ESTIMATORS 2 1.8 θ=0.5 1.6 1.4 1.2 f(x) 1 θ=1 0.8 0.6 θ=2 0.4 0.2 0 0 0.5 1 1.5 x 2 2.5 3 Figure 1.2: EXP(θ) p.d.f.’s 1.3 Unbiasedness and Mean Square Error How do we ensure that a statistic T (X) is estimating the correct parameter? How do we ensure that it is not consistently too large or too small, and that as much variability as possible has been removed? We consider the problem of estimating the correct parameter first. We begin with a review of the definition of the expectation of a random variable. 1.3.1 Definition If X is a discrete random variable with p.f. f (x; θ) and support set A then P E [h (X) ; θ] = h (x) f (x; θ) x∈A provided the sum converges absolutely, that is, provided P E [|h (X)| ; θ] = |h (x)| f (x; θ)dx < ∞. x∈A 1.3. UNBIASEDNESS AND MEAN SQUARE ERROR 5 If X is a continuous random variable with p.d.f. f (x; θ) then E[h(X); θ] = Z∞ h(x)f (x; θ) dx, −∞ provided the integral converges absolutely, that is, provided E[|h(X)| ; θ] = Z∞ −∞ |h(x)| f (x; θ) dx < ∞. If E [|h (X)| ; θ] = ∞ then we say that E [h (X) ; θ] does not exist. 1.3.2 Problem Suppose that X has a CAU(1, θ) distribution. Show that E(X; θ) does not exist and that this implies E(X k ; θ) does not exist for k = 2, 3, . . .. 1.3.3 Problem Suppose that X is a random variable with probability density function f (x; θ) = θ , xθ+1 x ≥ 1. For what values of θ do E(X; θ) and V ar(X; θ) exist? 1.3.4 Problem If X ∼ GAM(α, β) show that E(X p ; α, β) = β p Γ(α + p) . Γ(α) For what values of p does this expectation exist? 1.3.5 Problem Suppose X is a non-negative continuous random variable with moment generating function M (t) = E(etX ) which exists for t ∈ <. The function M (−t) is often called the Laplace Transform of the probability density function of X. Show that Z∞ ¡ −p ¢ 1 = M (−t)tp−1 dt, p > 0. E X Γ(p) 0 6 1.3.6 CHAPTER 1. PROPERTIES OF ESTIMATORS Definition A statistic T (X) is an unbiased estimator of θ if E[T (X); θ] = θ for all θ ∈ Ω. 1.3.7 Example Suppose Xi ∼ POI(iθ) i = 1, ..., n independently. Determine whether the following estimators are unbiased estimators of θ: T1 = n X 1 P i , n i=1 i T2 = µ 2 n+1 ¶ X̄ = n P 2 Xi . n(n + 1) i=1 Is unbiased estimation preserved under transformations? For example, if T is an unbiased estimator of θ, is T 2 an unbiased estimator of θ2 ? 1.3.8 Example Suppose X1 , . . . , Xn are uncorrelated random variables with E(Xi ) = μ and V ar(Xi ) = σ 2 , i = 1, 2, . . . , n. Show that T = n P ai Xi i=1 is an unbiased estimator of μ if n P ai = 1. Find an unbiased estimator of i=1 σ 2 assuming (i) μ is known (ii) μ is unknown. If (X1 , . . . , Xn ) is a random sample from the N(μ, σ 2 ) distribution then show that S is not an unbiased estimator of σ where ¸ ∙n n P 2 1 P 1 (Xi − X̄)2 = Xi − nX̄ 2 S2 = n − 1 i=1 n − 1 i=1 is the sample variance. What happens to E (S) as n → ∞? 1.3.9 Example Suppose X ∼ BIN(n, θ). Find an unbiased estimator, T (X), of θ. Is [T (X)]−1 an unbiased estimator of θ−1 ? Does there exist an unbiased estimator of θ−1 ? 1.3. UNBIASEDNESS AND MEAN SQUARE ERROR 1.3.10 7 Problem Let X1 , . . . , Xn be a random sample from the POI(θ) distribution. Find ´ ³ E X (k) ; θ = E[X(X − 1) · · · (X − k + 1); θ], the kth factorial moment of X, and thus find an unbiased estimator of θk , k = 1, 2, . . . ,. We now consider the properties of an estimator from the point of view of Decision Theory. In order to determine whether a given estimator or statistic T = T (X) does well for estimating θ we consider a loss function or distance function between the estimator and the true value which we denote L(θ, T ). This loss function is averaged over all possible values of the data to obtain the risk: Risk = E[L(θ, T ); θ]. A good estimator is one with little risk, a bad estimator is one whose risk 2 is high. One particular loss function is L(θ, T ) = (T − θ) which is called the squared error loss function. Its corresponding risk, called mean squared error (M.S.E.), is given by h i 2 M SE(T ; θ) = E (T − θ) ; θ . Another loss function is L(θ, T ) = |T − θ| which is called the absolute error loss function. Its corresponding risk, called the mean absolute error, is given by Risk = E (|T − θ|; θ) . 1.3.11 Problem Show M SE(T ; θ) = V ar(T ; θ) + [Bias(T ; θ)]2 where Bias(T ; θ) = E (T ; θ) − θ. 1.3.12 Example Let X1 , . . . , Xn be a random sample from a UNIF(0, θ) distribution. Compare the M.S.E.’s of the following three estimators of θ: T1 = 2X, T2 = X(n) , T3 = (n + 1)X(1) 8 CHAPTER 1. PROPERTIES OF ESTIMATORS where X(n) = max(X1 , . . . , Xn ) and X(1) = min(X1 , . . . , Xn ). 1.3.13 Problem Let X1 , . . . , Xn be a random sample from a UNIF(θ, 2θ) distribution with θ > 0. Consider the following estimators of θ: T1 = 1 1 1 5 2 X(n) , T2 = X(1) , T3 = X(n) + X(1) , T4 = X(n) + X(1) . 2 3 3 14 7 (a) Show that all four estimators can be written in the form Za = aX(1) + 1 (1 − a) X(n) 2 (1.1) for suitable choice of a. (b) Find E(Za ; θ) and thus show that T3 is the only unbiased estimator of θ of the form (1.1). (c) Compare the M.S.E.’s of these estimators and show that T4 has the smallest M.S.E. of all estimators of the form (1.1). Hint: Find V ar(Za ; θ), show Cov(X(1) , X(n) ; θ) = θ2 (n + 1)2 (n + 2) , and thus find an expression for M SE(Za ; θ). 1.3.14 Problem Let X1 , . . . , Xn be a random sample from the N(μ, σ2 ) distribution. Consider the following estimators of σ 2 : S2, T1 = n−1 2 S , n T2 = n−1 2 S . n+1 Compare the M.S.E.’s of these estimators by graphing them as a function of σ 2 for n = 5. 1.3.15 Example Let X ∼N(θ, 1). Consider the following three estimators of θ: T1 = X, T2 = X , 2 T3 = 0. 1.3. UNBIASEDNESS AND MEAN SQUARE ERROR 9 Which estimator is better in terms of M.S.E.? Now h i M SE (T1 ; θ) = E (X − θ)2 ; θ = V ar (X; θ) = 1 "µ µ ¶2 # ¶ ∙ µ ¶ ¸2 X X X M SE (T2 ; θ) = E − θ ; θ = V ar ;θ + E ;θ − θ 2 2 2 µ ¶2 ¢ 1 1¡ 2 θ = + −θ = θ +1 4 2 4 h i 2 M SE (T3 ; θ) = E (0 − θ) ; θ = θ2 The M.S.E.’s can be compared by graphing them as functions of θ. See Figure 1.3. 4 3.5 3 2.5 MSE(T ) 3 MSE 2 1.5 1 MSE(T 1) MSE(T ) 2 0.5 0 -2 -1.5 -1 -0.5 0 θ 0.5 1 1.5 2 Figure 1.3: Comparison of M.S.E.’s for Example 1.3.15 One of the conclusions of the above example is that there is no estimator, even the natural one, T1 = X which outperforms all other estimators. One is better for some values of the parameter in terms of smaller risk, while another, even the trivial estimator T3 , is better for other values of the parameter. In order to achieve a best estimator, it is unfortunately 10 CHAPTER 1. PROPERTIES OF ESTIMATORS necessary to restrict ourselves to a specific class of estimators and select the best within the class. Of course, the best within this class will only be as good as the class itself, and therefore we must ensure that restricting ourselves to this class is sensible and not unduly restrictive. The class of all estimators is usually too large to obtain a meaningful solution. One possible restriction is to the class of all unbiased estimators. 1.3.16 Definition An estimator T = T (X) is said to be a uniformly minimum variance unbiased estimator (U.M.V.U.E.) of the parameter θ if (i) it is an unbiased estimator of θ and (ii) among all unbiased estimators of θ it has the smallest M.S.E. and therefore the smallest variance. 1.3.17 Problem Suppose X has a GAM(2, θ) distribution and consider the class of estimators {aX; a ∈ <+ }. Find the estimator in this class which minimizes the mean absolute error for estimating the scale parameter θ. Hint: Show E (|aX − θ|; θ) = θE (|aX − 1|; θ = 1) . Is this estimator unbiased? Is it the best estimator in the class of all functions of X? 1.4 Sufficiency A sufficient statistic is one that, from a certain perspective, contains all the necessary information for making inferences about the unknown parameters in a given model. By making inferences we mean the usual conclusions about parameters such as estimators, significance tests and confidence intervals. Suppose the data are X and T = T (X) is a sufficient statistic. The intuitive basis for sufficiency is that if X has a conditional distribution given T (X) that does not depend on θ, then X is of no value in addition to T in estimating θ. The assumption is that random variables carry information on a statistical parameter θ only insofar as their distributions (or conditional distributions) change with the value of the parameter. All of this, of course, assumes that the model is correct and θ is the only unknown. It should be remembered that the distribution of X given a sufficient statistic T may have a great deal of value for some other purpose, such as testing the validity of the model itself. 1.4. SUFFICIENCY 1.4.1 11 Definition A statistic T (X) is sufficient for a statistical model {f (x; θ) ; θ ∈ Ω} if the distribution of the data X1 , . . . , Xn given T = t does not depend on the unknown parameter θ. To understand this definition suppose that X is a discrete random variable and T = T (X) is a sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}. Suppose we observe data x with corresponding value of the sufficient statistic T (x) = t. To Experimenter A we give the observed data x while to Experimenter B we give only the value of T = t. Experimenter A can obviously calculate T (x) = t as well. Is Experimenter A “better off” than Experimenter B in terms of making inferences about θ? The answer is no since Experimenter B can generate data which is “as good as” the data which Experimenter A has in the following manner. Since T (X) is a sufficient statistic, the conditional distribution of X given T = t does not depend on the unknown parameter θ. Therefore Experimenter B can use this distribution and a randomization device such as a random number generator to generate an observation y from the random variable Y such that P (Y = y|T = t) = P (X = y|T = t) (1.2) and such that X and Y have the same unconditional distribution. So Experimenter A who knows x and experimenter B who knows y have equivalent information about θ. Obviously Experimenter B did not gain any new information about θ by generating the observation y. All of her information for making inferences about θ is contained in the knowledge that T = t. Experimenter B has just as much information as Experimenter A, who knows the entire sample x. Now X and Y have the same unconditional distribution because = = = = = = = P (X = x; θ) P [X = x, T (X) = T (x) ; θ] since the event {X = x} is a subset of the event {T (X) = T (x)} P [X = x|T (X) = T (x)] P [T (X) = T (x) ; θ] P (X = x|T = t) P (T = t; θ) where t = T (x) P (Y = x|T = t) P (T = t; θ) using (1.2) P [Y = x|T (X) = T (x)] P [T (X) = T (x) ; θ] P [Y = x, T (X) = T (x) ; θ] P (Y = x; θ) since the event {Y = x} is a subset of the event {T (X) = T (x)} . 12 CHAPTER 1. PROPERTIES OF ESTIMATORS The use of a sufficient statistic is formalized in the following principle: 1.4.2 The Sufficiency Principle Suppose T (X) is a sufficient statistic for a model {f (x; θ) ; θ ∈ Ω}. Suppose x1 , x2 are two different possible observations that have identical values of the sufficient statistic: T (x1 ) = T (x2 ). Then whatever inference we would draw from observing x1 we should draw exactly the same inference from x2 . If we adopt the sufficiency principle then we partition the sample space (the set of all possible outcomes) into mutually exclusive sets of outcomes in which all outcomes in a given set lead to the same inference about θ. This is referred to as data reduction. 1.4.3 Example Let (X1 , . . . , Xn ) be a random sample from the POI(θ) distribution. Show n P that T = Xi is a sufficient statistic for this model. i=1 1.4.4 Problem Let X1 , . . . , Xn be a random sample from the Bernoulli(θ) distribution and n P let T = Xi . i=1 (a) Find the conditional distribution of (X1 , . . . , Xn ) given T = t and thus show T is a sufficient statistic for this model. (b) Explain how you would generate data with the same distribution as the original data using the value of the sufficient statistic and a randomization device. (c) Let U = U (X1 ) = 1 if X1 = 1 and 0 otherwise. Find E(U ) and E(U |T = t). 1.4.5 Problem Let X1 , . . . , Xn be a random sample from the GEO(θ) distribution and let n P T = Xi . i=1 1.4. SUFFICIENCY 13 (a) Find the conditional distribution of (X1 , . . . , Xn ) given T = t and thus show T is a sufficient statistic for this model. (b) Explain how you would generate data with the same distribution as the original data using the value of the sufficient statistic and a randomization device. (d) Find E(X1 |T = t). 1.4.6 Problem Let X1 , . . . , Xn be a random sample from the EXP(1, θ) distribution and let T = X(1) . (a) Find the conditional distribution of (X1 , . . . , Xn ) given T = t and thus show T is a sufficient statistic for this model. (b) Explain how you would generate data with the same distribution as the original data using the value of the sufficient statistic and a randomization device. (c) Find E [(X1 − 1) ; θ] and E [(X1 − 1) |T = t]. 1.4.7 Problem Let X1 , . . . , Xn be a random sample from the distribution with probability density function f (x; θ) . Show that the order statistic T (X) = (X(1) , . . . , X(n) ) is sufficient for the model {f (x; θ) ; θ ∈ Ω}. The following theorem gives a straightforward method for identifying sufficient statistics. 1.4.8 Factorization Criterion for Sufficiency Suppose X has probability (density) function {f (x; θ) ; θ ∈ Ω} and T (X) is a statistic. Then T (X) is a sufficient statistic for {f (x; θ) ; θ ∈ Ω} if and only if there exist two non—negative functions g(.) and h(.) such that f (x; θ) = g(T (x); θ)h(x), for all x, θ ∈ Ω. Note that this factorization need only hold on a set A of possible values of X which carries the full probability, that is, f (x; θ) = g(T (x); θ)h(x), where P (X ∈ A; θ) = 1, for all θ ∈ Ω. for all x ∈ A, θ ∈ Ω 14 CHAPTER 1. PROPERTIES OF ESTIMATORS Note that the function g(T (x); θ) depends on both the parameter θ and the sufficient statistic T (X) while the function h(x) does not depend on the parameter θ. 1.4.9 Example Let X1 , . . . , Xn be a random sample from the N(μ, σ2 ) distribtion. Show n n P P that ( Xi , Xi2 ) is a sufficient statistic for this model. Show that i=1 i=1 (X, S 2 ) is also a sufficient statistic for this model. 1.4.10 Example Let X1 , . . . , Xn be a random sample from the WEI(1, θ) distribution. Find a sufficient statistic for this model. 1.4.11 Example Let X1 , . . . , Xn be a random sample from the UNIF(0, θ) distribution. Show that T = X(n) is a sufficient statistic for this model. Find the conditional probability density function of (X1 , . . . , Xn ) given T = t. 1.4.12 Problem Let X1 , . . . , Xn be a random sample from the EXP(1, θ) distribution. Show that X(1) is a sufficient statistic for this model and find the conditional probability density function of (X1 , . . . , Xn ) given X(1) = t. 1.4.13 Problem Use the Factorization Criterion for Sufficiency to show that if T (X) is a sufficient statistic for the model {f (x; θ) ; θ ∈ Ω} then any one-to-one function of T is also a sufficient statistic. We have seen above that sufficient statistics are not unique. One-toone functions of a statistic contain the same information as the original statistic. Fortunately, we can characterise all one-to-one functions of a statistic in terms of the way in which they partition the sample space. Note that the partition induced by the sufficient statistic provides a partition of the sample space into sets of observations which lead to the same inference about θ. See Figure 1.4. 1.4. SUFFICIENCY 1.4.14 15 Definition The partition of the sample space induced by a given statistic T (X) is the partition or class of sets of the form {x; T (x) = t} as t ranges over its possible values. SAMPLE SPACE {x;T(x)=1} {x;T(x)=2} {x;T(x)=3} {x;T(x)=4} {x;T(x)=5} Figure 1.4: Partition of the sample space induced by T From the point of view of statistical information on a parameter, a statistic is sufficient if it contains all of the information available in a data set about a parameter. There is no guarantee that the statistic does not contain more information than is necessary. For example, the data (X1 , . . . , Xn ) is always a sufficient statistic (why?), but in many cases, there is a further data reduction possible. For example, for independent observations from a N(θ, 1) distribution, the sample mean X is also a sufficient statistic but it is reduced as much as possible. Of course, T = (X̄)3 is a sufficient statistic since T and X̄ are one-to-one functions of each other. From X̄ we can obtain T and from T we can obtain X̄ so both of these statistics are equivalent in terms of the amount of information they contain about θ. Now suppose the function g is a many-to-one function, which is not invertible. Suppose further that g (X1 , . . . , Xn ) is a sufficient statistic. Then the reduction from (X1 , . . . , Xn ) to g (X1 , . . . , Xn ) is a non-trivial reduction of the data. Sufficient statistics that have experienced as much data reduction as is possible without losing the sufficiency property are called minimal sufficient statistics. 16 1.5 CHAPTER 1. PROPERTIES OF ESTIMATORS Minimal Sufficiency Now we wish to consider those circumstances under which a given statistic (actually the partition of the sample space induced by the given statistic) allows no further real reduction. Suppose g(.) is a many-to-one function and hence is a real reduction of the data. Is g(T ) still sufficient? In some cases, as in the example below, the answer is “no”. 1.5.1 Problem Let X1 , . . . , Xn be a random sample from the Bernoulli(θ) distribution. n P Show that T (X) = Xi is sufficient for this model. Show that if g is not i=1 a one-to-one function, (g(t1 ) = g(t2 ) = g0 for some integers t1 and t2 where 0 ≤ t1 < t2 ≤ n) then g(T ) cannot be sufficient for {f (x; θ) ; θ ∈ Ω}. Hint: Find P (T = t1 |g(T ) = g0 ). 1.5.2 Definition A statistic T (X) is a minimal sufficient statistic for {f (x; θ) ; θ ∈ Ω} if it is sufficient and if for any other sufficient statistic U (X), there exists a function g(.) such that T (X) = g(U (X)). This definition says that a minimal sufficient statistic is a function of every other sufficient statistic. In terms of the partition induced by the minimal sufficient this implies that the minimal sufficient statistic induces the coarsest partition possible of the sample space among all sufficient statistics. This partition is called the minimal sufficient partition. 1.5.3 Problem Prove that if T1 and T2 are both minimal sufficient statistics, then they induce the same partition of the sample space. The following theorem is useful in showing a statistic is minimally sufficient. 1.5. MINIMAL SUFFICIENCY 1.5.4 17 Theorem - Minimal Sufficient Statistic Suppose the model is {f (x; θ) ; θ ∈ Ω} and let A = support of X. Partition A into the equivalence classes defined by ½ ¾ f (x; θ) Ay = x; = H(x, y) for all θ ∈ Ω , y ∈ A. f (y; θ) This is a minimal sufficient partition. The statistic T (X) which induces this partition is a minimal sufficient statistic. The proof of this theorem is given in Section 5.4.2 of the Appendix. 1.5.5 Example Let (X1 , . . . , Xn ) be a random sample from the distribution with probability density function f (x; θ) = θxθ−1 , 0 < x < 1, θ > 0. Find a minimal sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}. 1.5.6 Example Let X1 , . . . , Xn be a random sample from the N(θ, θ2 ) distribution. Find a minimal sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}. 1.5.7 Problem Let X1 , . . . , Xn be a random sample from the LOG(1, θ) distribution. Prove that the order statistic (X(1) , . . . , X(n) ) is a minimal sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}. 1.5.8 Problem Let X1 , . . . , Xn be a random sample from the CAU(1, θ) distribution. Find a minimal sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}. 1.5.9 Problem Let X1 , . . . , Xn be a random sample from the UNIF(θ, θ + 1) distribution. Find a minimal sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}. 18 CHAPTER 1. PROPERTIES OF ESTIMATORS 1.5.10 Problem Let Ω denote the set of all probability density functions. Let (X1 , . . . , Xn ) be a random sample from a distribution with probability density function f ∈ Ω. Prove that the order statistic (X(1) , . . . , X(n) ) is a minimal sufficient statistic for the model {f (x); f ∈ Ω}. Note that in this example the unknown “parameter” is f . 1.5.11 Problem - Linear Regression Suppose E(Y ) = Xβ where Y = (Y1 , . . . , Yn )T is a vector of independent and normally distributed random variables with V ar(Yi ) = σ 2 , i = 1, . . . , n, X is a n × k matrix of known constants of rank k and β = (β1 , . . . , βk )T is a vector of unknown parameters. Let ¡ ¢−1 T β̂ = X T X X Y and Se2 = (Y − X β̂)T (Y − X β̂)/(n − k). Show that (β̂, Se2 ) is a minimal sufficient statistic for this model. Hint: Show (Y − Xβ)T (Y − Xβ) = (n − k)Se2 + (β̂ − β)T X T X(β̂ − β). 1.6 Completeness The property of completeness is one which is useful for determining the uniqueness of estimators, for verifying, in some cases, that a minimal sufficient statistic has been found, and for finding U.M.V.U.E.’s. Let X1 , . . . , Xn denote the observations from a distribution with probability (density) function {f (x; θ) ; θ ∈ Ω}. Suppose T (X) is a statistic and u(T ), a function of T , is an unbiased estimator of θ so that E[u(T ); θ] = θ for all θ ∈ Ω. Under what circumstances is this the only unbiased estimator which is a function of T ? To answer this question, suppose u1 (T ) and u2 (T ) are both unbiased estimators of θ and consider the difference h(T ) = u1 (T ) − u2 (T ). Since u1 (T ) and u2 (T ) are both unbiased estimators we have E[h(T ); θ] = 0 for all θ ∈ Ω. Now if the only function h(T ) which satisfies E[h(T ); θ] = 0 for all θ ∈ Ω is the function h(t) = 0, then the two unbiased estimators must be identical. A statistic T with this property is said to be complete. The property of completeness is really a property of the family of distributions of T generated as θ varies. 1.6. COMPLETENESS 1.6.1 19 Definition The statistic T = T (X) is a complete statistic for {f (x; θ) ; θ ∈ Ω} if E[h(T ); θ] = 0, for all θ ∈ Ω implies P [h(T ) = 0; θ] = 1 for all θ ∈ Ω. 1.6.2 Example Let X1 , . . . , Xn be a random sample from the N(θ, 1) distribution. Consider n P T = T (X) = (X1 , Xi ). Prove that T is a sufficient statistic for the model i=2 {f (x; θ) ; θ ∈ Ω} but not a complete statistic. 1.6.3 Example Let X1 , . . . , Xn be a random sample from the Bernoulli(θ) distribution. n P Prove that T = T (X) = Xi is a complete sufficient statistic for the i=1 model {f (x; θ) ; θ ∈ Ω}. 1.6.4 Example Let X1 , . . . , Xn be a random sample from the UNIF(0, θ) distribution. Show that T = T (X) = X(n) is a complete statistic for the model {f (x; θ) ; θ ∈ Ω}. 1.6.5 Problem Prove that any one-to-one function of a complete sufficient statistic is a complete sufficient statistic. 1.6.6 Problem Let X1 , . . . , Xn be a random sample from the N(θ, aθ2 ) distribution where a > 0 is a known constant and θ > 0. Show that the minimal sufficient statistic is not a complete statistic. 20 1.6.7 CHAPTER 1. PROPERTIES OF ESTIMATORS Theorem If T (X) is a complete sufficient statistic for the model {f (x; θ) ; θ ∈ Ω} then T (X) is a minimal sufficient statistic for {f (x; θ) ; θ ∈ Ω}. The proof of this theorem is given in Section 5.4.3 of the Appendix. 1.6.8 Problem The converse to the above theorem is not true. Let X1 , . . . , Xn be a random sample from the UNIF(θ − 1, θ + 1) distribution. Show that T = T (X) = (X(1) , X(n) ) is a minimal sufficient statistic for the model. Show also that for the non-zero function h(T ) = X(n) − X(1) n−1 − , 2 n+1 E[h(T ); θ] = 0 for all θ ∈ Ω and therefore T is not a complete statistic. 1.6.9 Example Let X = X1 , . . . , Xn be a random sample from the UNIF(0, θ) distribution. Prove that T = T (X) = X(n) is a minimal sufficient statistic for {f (x; θ) ; θ ∈ Ω}. 1.6.10 Problem Let X = X1 , . . . , Xn be a random sample from the EXP(1, θ) distribution. Prove that T = T (X) = X(1) is a minimal sufficient statistic for {f (x; θ) ; θ ∈ Ω}. 1.6.11 Theorem For any random variables X and Y , E(X) = E[E(X|Y )] and V ar(X) = E[V ar(X|Y )] + V ar[E(X|Y )] 1.6. COMPLETENESS 1.6.12 21 Theorem If T = T (X) is a complete statistic for the model {f (x; θ) ; θ ∈ Ω}, then there is at most one function of T that provides an unbiased estimator of the parameter τ (θ). 1.6.13 Problem Prove Theorem 1.6.12. 1.6.14 Theorem (Lehmann-Scheffé) If T = T (X) is a complete sufficient statistic for the model {f (x; θ) ; θ ∈ Ω} and E [g (T ) ; θ] = τ (θ), then g(T ) is the unique U.M.V.U.E. of τ (θ). 1.6.15 Example Let X1 , . . . , Xn be a random sample from the Bernoulli(θ) distribution. Find the U.M.V.U.E. of τ (θ) = θ2 . 1.6.16 Example Let X1 , . . . , Xn be a random sample from the UNIF(0, θ) distribution. Find the U.M.V.U.E. of τ (θ) = θ. 1.6.17 Problem Let X1 , . . . , Xn be a random sample from the Bernoulli(θ) dsitribution. Find the U.M.V.U.E. of τ (θ) = θ(1 − θ). 1.6.18 Problem Suppose X has a Hypergeometric distribution with p.f. µ ¶µ ¶ Nθ N − Nθ x n−x µ ¶ f (x; θ) = , x = 0, 1, . . . , min (N θ, N − N θ) ; N n ¾ ½ 1 2 θ ∈ Ω = 0, , , . . . , 1 N N Show that X is a complete sufficient statistic. Find the U.M.V.U.E. of θ. 22 1.6.19 CHAPTER 1. PROPERTIES OF ESTIMATORS Problem Let X1 , . . . , Xn be a random sample from the EXP(β, μ) distribution where β is known. Show that T = X(1) is a complete sufficient statistic for this model. Find the U.M.V.U.E. of μ and the U.M.V.U.E. of μ2 . 1.6.20 Problem Suppose X1 , ..., Xn is a random sample from the UNIF(a, b) distribution. Show that T = (X(1) , X(n) ) is a complete sufficient statistic for this model. Find the U.M.V.U.E.’s of a and b. Find the U.M.V.U.E. of the mean of Xi . 1.6.21 Problem Let T (X) be an unbiased estimator of τ (θ). Prove that T (X) is a U.M.V.U.E. of τ (θ) if and only if E(U T ; θ) = 0 for all U (X) such that E(U ) = 0 for all θ ∈ Ω. 1.6.22 Theorem (Rao-Blackwell) If T = T (X) is a complete sufficient statistic for the model {f (x; θ) ; θ ∈ Ω} and U = U (X) is any unbiased estimator of τ (θ), then E(U |T ) is the U.M.V.U.E. of τ (θ). 1.6.23 Problem Let X1 , . . . , Xn be a random sample from the EXP(β, μ) distribution where β is known. Find the U.M.V.U.E. of τ (μ) = P (X1 > c; μ) where c ∈ < is a known constant. Hint: Let U = U (X1 ) = 1 if X1 ≥ c and 0 otherwise. 1.6.24 Problem Let X1 , . . . , Xn be a random sample from the DU(θ) distribution. Show that T = X(n) is a complete sufficient statistic for this model. Find the U.M.V.U.E. of θ. 1.7. THE EXPONENTIAL FAMILY 1.7 1.7.1 23 The Exponential Family Definition Suppose X = (X1 , . . . , Xp ) has a (joint) probability (density) function of the form " # k P f (x; θ) = C(θ) exp qj (θ)Tj (x) h(x) (1.3) j=1 for functions qj (θ), Tj (x), h(x), C(θ). Then we say that f (x; θ) is a member of the exponential family of densities. We call (T1 (X), . . . , Tk (X)) the natural sufficient statistic. It should be noted that the natural sufficient statistic is not unique. Multiplication of Tj by a constant and division of qj by the same constant results in the same function f (x; θ). More generally linear transformations of the Tj and the qj can also be used. 1.7.2 Example Prove that T (X) = (T1 (X), . . . , Tk (X)) is a sufficient statistic for the model {f (x; θ) ; θ ∈ Ω} where f (x; θ) has the form (1.3). 1.7.3 Example Show that the BIN(n, θ) distribution has an exponential family distribution and find the natural sufficient statistic. One of the important properties of the exponential family is its closure under repeated independent sampling. 1.7.4 Theorem Let X1 , . . . , Xn be a random sample from the distribution with probability (density) function given by (1.3). Then (X1 , . . . , Xn ) also has an exponential family form, with joint probability (density) function n f (x1 , . . . xn ; θ) = [C (θ)] exp ( k P j=1 qj (θ) ∙ n P i=1 ¸) n Q Tj (xi ) h (xi ) . i=1 24 CHAPTER 1. PROPERTIES OF ESTIMATORS In other words, C is replaced by C n and Tj (x) by n P Tj (xi ). The natural i=1 sufficient statistic is µ 1.7.5 n P T1 (Xi ), . . . , i=1 n P i=1 ¶ Tk (Xi ) . Example Let X1 , . . . , Xn be a random sample from the POI(θ) distribution. Show that X1 , . . . , Xn is a member of the exponential family. 1.7.6 Canonical Form of the Exponential Family It is usual to reparameterize equation (1.3) by replacing qj (θ) by a new parameter ηj . This results in the canonical form of the exponential family f (x; η) = C(η) exp " k P # ηj Tj (x) h(x). j=1 The natural parameter space in this form is the set of all values of η for which the above function is integrable; that is {η; Z∞ −∞ f (x; η)dx < ∞}. If X is discrete the intergral is replaced by the sum over all x such that f (x; η) > 0. If the statistic satisfies a linear constraint, for example, P à k P j=1 Tj (X) = 0; η ! = 1, then the number of terms k can be reduced. Unless this is done, the parameters ηj are not all statistically meaningful. For example the data may permit us to estimate η1 + η2 but not allow estimation of η1 and η2 individually. In this case we call the parameter “unidentifiable”. We will need to assume that the exponential family representation is minimal in the sense that neither the ηj nor the Tj satisfy any linear constraints. 1.7. THE EXPONENTIAL FAMILY 1.7.7 25 Definition We will say that X has a regular exponential family distribution if it is in canonical form, is of full rank in the sense that neither the Tj nor the ηj satisfy any linear constraints, and the natural parameter space contains a k−dimensional rectangle. By Theorem 1.7.4 if Xi has a regular exponential family distribution then X = (X1 , . . . , Xn ) also has a regular exponential family distribution. 1.7.8 Example Show that X ∼ BIN(n, θ) has a regular exponential family distribution. 1.7.9 Theorem If X has a regular exponential family distribution with natural sufficient statistic T (X) = (T1 (X), . . . , Tk (X)) then T (X) is a complete sufficient statistic. Reference: Lehmann and Ramano (2005), Testing Statistical Hypotheses (3rd edition), pp. 116-117. 1.7.10 Differentiating under the Integral In Chapter 2, it will be important to know if a family of models has the property that differentiation under the integral is possible. We state that for a regular exponential family, it is possible to differentiate under the integral, that is, # " # " Z Z k k P P ∂m ∂m ηj Tj (x) h(x)dx = C(η) exp ηj Tj (x) h(x)dx C(η) exp ∂ηim ∂ηim j=1 j=1 for any m = 1, 2, . . . and any η in the interior of the natural parameter space. 1.7.11 Example Let X1 , . . . , Xn be a random sample from the N(μ, σ 2 ) distribution. Find a complete sufficient statistic for this model. Find the U.M.V.U.E.’s of μ and σ 2 . 1.7.12 Example Show that X ∼ N(θ, θ2 ) does not have a regular exponential family distribution. 26 CHAPTER 1. PROPERTIES OF ESTIMATORS 1.7.13 Example Suppose (X1 , X2 , X3 ) have joint density f (x1 , x2 , x3 ; θ1 , θ2 , θ3 ) = P (X1 = x1 , X2 = x2 , X3 = x3 ; θ1 , θ2 , θ3 ) n! = θ x1 θ x2 θ x3 x1 !x2 !x3 ! 1 2 3 xi = 0, 1, . . . ; i = 1, 2, 3; x1 + x2 + x3 = n 0 < θi < 1; i = 1, 2, 3; θ1 + θ2 + θ3 = 1 Find the U.M.V.U.E. of θ1 , θ2 , and θ1 θ2 . Since f (x1 , x2 , x3 ; θ1 , θ2 , θ3 ) = exp " 3 P # qj (θ1 , θ2 , θ3 ) Tj (x1 , x2 , x3 ) h (x1 , x2 , x3 ) j=1 where qj (θ1 , θ2 , θ3 ) = log θj , Tj (x1 , x2 , x3 ) = xj , j = 1, 2, 3 and h (x1 , x2 , x3 ) = (X1 , X2 , X3 ) is a member of the exponential family. But 3 P Tj (x1 , x2 , x3 ) = n and θ1 + θ2 + θ3 = 1 j=1 and thus (X1 , X2 , X3 ) is not a member of the regular exponential family. However by substituting X3 = n − X1 − X2 and θ3 = 1 − θ1 − θ2 we can show that (X1 , X2 ) has a regular exponential family distribution. Let µ µ ¶ ¶ θ1 θ2 η1 = log , η2 = log 1 − θ1 − θ2 1 − θ1 − θ2 then θ1 = eη1 , 1 + eη1 + eη2 θ2 = eη2 . 1 + eη1 + eη2 Let C (η1 , η2 ) = µ T1 (x1 , x2 ) = x1 , T2 (x1 , x2 ) = x2 , ¶n 1 n! , and h (x1 , x2 ) = . 1 + eη1 + eη2 x1 !x2 ! (n − x1 − x2 )! In canonical form (X1 , X2 ) has p.f. f (x1 , x2 ; η1 , η2 ) = C (η1 , η2 ) exp [η1 T1 (x1 , x2 ) + η2 T2 (x1 , x2 )] h (x1 , x2 ) n! , x1 !x2 !x3 ! 1.7. THE EXPONENTIAL FAMILY 27 with natural parameter space {(η1 , η2 ) ; η1 ∈ <, η2 ∈ <} which contains a two-dimensional rectangle. The ηj0 s and the Tj0 s do not satisfy any linear constraints. Therefore (X1 , X2 ) has a regular exponential family distribution with natural sufficient statistic T (X1 , X2 ) = (X1 , X2 ) and thus T (X1 , X2 ) is a complete sufficient statistic. By the properties of the multinomial distribution (see Section 5.2.2) we have X1 v BIN (n, θ1 ) , X2 v BIN (n, θ2 ) and Cov (X1 , X2 ) = −nθ1 θ2 . Since µ µ ¶ ¶ X1 X2 nθ1 nθ2 E ; θ1 , θ2 = = θ1 and E ; θ1 , θ2 = = θ2 n n n n then by the Lehmann-Scheffé Theorem X1 /n is the U.M.V.U.E. of θ1 and X2 /n is the U.M.V.U.E. of θ2 . Since −nθ1 θ2 = Cov (X1 , X2 ; θ1 , θ2 ) = E (X1 X2 ; θ1 , θ2 ) − E (X1 ; θ1 , θ2 ) E (X2 ; θ1 , θ2 ) = E (X1 X2 ; θ1 , θ2 ) − n2 θ1 θ2 or E µ X1 X2 ; θ1 , θ2 n (n − 1) ¶ = θ1 θ2 then by the Lehmann-Scheffé Theorem X1 X2 / [n (n − 1)] is the U.M.V.U.E. of θ1 θ2 . 1.7.14 Example Let X1 , . . . , Xn be a random sample from the POI(θ) distribution. Find the U.M.V.U.E. of τ (θ) = e−θ . Show that the U.M.V.U.E. is also a consistent estimator of τ (θ). Since (X1 , . . . , Xn ) is a member of the regular exponential family with n P natural sufficient statistic T = Xi therefore T is a complete sufficient i=1 statistic. Consider the random variable U (X1 ) = 1 if X1 = 0 and 0 otherwise. Then E [U (X1 ); θ] = 1 · P (X1 = 0; θ) = e−θ , θ>0 and U (X1 ) is an unbiased estimator of τ (θ) = e−θ . Therefore by the RaoBlackwell Theorem E (U |T ) is the U.M.V.U.E. of τ (θ) = e−θ . 28 CHAPTER 1. PROPERTIES OF ESTIMATORS Since X1 , . . . , Xn is a random sample from the POI(θ) distribution, X1 v POI (θ) , T = Thus n P i=1 Xi v POI (nθ) and n P i=2 Xi v POI ((n − 1) θ) . E (U |T = t) = 1 · P (X1 = 0|T = t) µ ¶ n P P X1 = 0, Xi = t; θ i=1 = P (T = t; θ) µ ¶ n P P X1 = 0, Xi = t − 0; θ i=2 = P (T = t; θ) (nθ)t −nθ [(n − 1) θ]t e−(n−1)θ = e−θ ÷ e t! t! µ ¶t 1 = 1− , t = 0, 1, . . . n ¢T ¡ is the U.M.V.U.E. of θ. Therefore E (U |T ) = 1 − n1 Since X1 , . . . , Xn is a random sample from the POI(θ) distribution then by the W.L.L.N. X̄ →p θ and by the Limit Theorems (see Section 5.3) µ ¶T ∙µ ¶n ¸X̄ 1 1 = 1− →p e−θ E (U |T ) = 1 − n n and therefore E (U |T ) a consistent estimator of e−θ . 1.7.15 Example Let X1 , . . . , Xn be a random sample from the N(θ, 1) distribution. Find the U.M.V.U.E. of τ (θ) = Φ(c − θ) = P (Xi ≤ c; θ) for some constant c where Φ is the standard normal cumulative distribution function. Show that the U.M.V.U.E. is also a consistent estimator of τ (θ). Since (X1 , . . . , Xn ) is a member of the regular exponential family with n P natural sufficient statistic T = Xi therefore T is a complete sufficient i=1 statistic. Consider the random variable U (X1 ) = 1 if X1 ≤ c and 0 otherwise. Then E [U (X1 ); θ] = 1 · P (X1 ≤ c; θ) = Φ(c − θ), θ∈< 1.7. THE EXPONENTIAL FAMILY 29 and U (X1 ) is an unbiased estimator of τ (θ) = Φ(c − θ). Therefore by the Rao-Blackwell Theorem E (U |T ) is the U.M.V.U.E. of τ (θ) = Φ(c − θ). Since X1 , . . . , Xn is a random sample from the N(θ, 1) distribution, X1 v N(θ, 1), T = n P i=1 Xi v N(nθ, n) and The conditional p.d.f. of X1 given T = t is n P i=2 Xi v N((n − 1) θ, n − 1). f (x1 |T = t) ¸ ∙ 1 1 2 = √ exp − (x1 − θ) 2 2π ( ( ) ) [t − x1 − (n − 1) θ]2 1 [t − nθ]2 1 exp − ×p exp − ÷√ 2 (n − 1) 2n 2πn 2π (n − 1) ( " #) 2 1 2 (t − x1 ) 1 t2 exp − = q ¡ + x − ¢ 2 1 n−1 n 2π 1 − n1 " # µ ¶2 1 t 1 ¡ ¢ x1 − = q ¡ ¢ exp − 2 1 − 1 n n 2π 1 − 1 n which is the p.d.f. of a N( nt , 1 − n1 ) random variable. Since X1 |T = t has a N( nt , 1 − n1 ) distribution, E (U |T ) = 1 · P (X1 ≤ c|T ) ⎞ ⎛ c − T /n ⎠ = Φ ⎝ q¡ ¢ 1 1− n is the U.M.V.U.E. of τ (θ) = Φ(c − θ). Since X1 , . . . , Xn is a random sample from the N(θ, 1) distribution then by the W.L.L.N. X̄ →p θ and by the Limit Theorems ⎞ ⎛ ⎞ ⎛ c − T /n ⎠ ⎝ c − X̄ ⎠ →p Φ (c − θ) E (U |T ) = Φ ⎝ q¡ ¢ = Φ q¡ ¢ 1 1− n 1 − n1 and therefore E (U |T ) a consistent estimator τ (θ) = Φ(c − θ). 30 1.7.16 CHAPTER 1. PROPERTIES OF ESTIMATORS Problem Let X1 , . . . , Xn be a random sample from the distribution with probability density function f (x; θ) = θxθ−1 , 0 < x < 1, θ > 0. µn ¶1/n Q Show that the geometric mean of the sample Xi is a complete i=1 sufficient statistic and find the U.M.V.U.E. of θ. Hint: − log Xi ∼ EXP(1/θ). 1.7.17 Problem Let X1 , . . . , Xn be a random sample from the EXP(β, μ) distribution where n P μ is known. Show that T = Xi is a complete sufficient statistic. Find i=1 the U.M.V.U.E. of β 2 . 1.7.18 Problem Let X1 , . . . , Xn be a random sample from the GAM(α, β) distribution and θ = (α, β). Find the U.M.V.U.E. of τ (θ) = αβ. 1.7.19 Problem Let X ∼ NB(k, θ). Find the U.M.V.U.E. of θ. Hint: Find E[(X + k − 1)−1 ; θ]. 1.7.20 Problem Let X1 , . . . , Xn be a random sample from the N(θ, 1) distribution. Find the U.M.V.U.E. of τ (θ) = θ2 . 1.7.21 Problem Let X1 , . . . , Xn be a random sample from the N(0, θ) distribution. Find the U.M.V.U.E. of τ (θ) = θ2 . 1.7.22 Problem Let X1 , . . . , Xn be a random sample from the POI(θ) distribution. Find the U.M.V.U.E. for τ (θ) = (1 + θ)e−θ . Hint: Find P (X1 ≤ 1; θ). 1.7. THE EXPONENTIAL FAMILY Member of the REF 31 Complete Sufficient Statistic n P POI (θ) Xi i=1 n P BIN (n, θ) Xi i=1 n P NB (k, θ) Xi i=1 ¡ ¢ N μ, σ2 ¡ ¢ N μ, σ2 Xi i=1 n P μ known i=1 µ ¢ ¡ N μ, σ2 GAM (α, β) n P σ 2 known n P (Xi − μ)2 Xi , i=1 α known n P i=1 n P Xi2 ¶ Xi i=1 GAM (α, β) n Q β known µ GAM (α, β) n P Xi , i=1 EXP (β, μ) μ known Xi i=1 n Q Xi i=1 n P ¶ Xi i=1 Not a Member of the REF Complete Sufficient Statistic UNIF (0, θ) X(n) ¡ ¢ X(1) , X(n) UNIF (a, b) EXP (β, μ) EXP (β, μ) β known X(1) µ ¶ n P X(1) , Xi i=1 32 1.7.23 CHAPTER 1. PROPERTIES OF ESTIMATORS Problem Let X1 , . . . , Xn be a random sample form the POI(θ) distribution. Find the U.M.V.U.E. for τ (θ) = e−2θ . Hint: Find E[(−1)X1 ; θ] . Show that this estimator has some undesirable properties when n = 1 and n = 2 but when n is large, it is approximately equal to the maximum likelihood estimator. 1.7.24 Problem Let X1 , . . . , Xn be a random sample from the GAM(2, θ) distribution. Find the U.M.V.U.E. of τ1 (θ) = 1/θ and the U.M.V.U.E. of τ2 (θ) = P (X1 > c; θ) where c > 0 is a constant. 1.7.25 Problem In Problem 1.5.11 show that β̂ is the U.M.V.U.E. of β and Se2 is the U.M.V.U.E. of σ 2 . 1.7.26 Problem A Brownian Motion process is a continuous-time stochastic process X (t) which is often used to describe the value of an asset. Assume X (t) represents the market price of a given asset such as a portfolio of stocks at time t and x0 is the value of the portfolio at the beginning of a given time period (assume that the analysis is conditional on x0 so that x0 is fixed and known). The distribution of X (t) for any fixed time t is assumed to be N(x0 + μt, σ2 t) for 0 < t ≤ 1. The parameter μ is the drift of the Brownian motion process and the parameter σ is the diffusion coefficient. Assume that t = 1 corresponds to the end of the time period so X (1) is the closing price. Suppose that we record both the period high max X (t) and the close X (1). Define random variables {0≤t≤1} M = max X (t) − x0 {0≤t≤1} and Y = X (1) − x0 . The joint probability density function of (M, Y ) can be shown to be ½ i¾ 2 (2m − y) 1 h 2 2 f (m, y; μ, σ 2 ) = √ 2μy − μ , exp − (2m − y) 2σ2 2πσ 3 m > 0, − ∞ < y < m, μ ∈ < and σ 2 > 0. 1.8. ANCILLARITY 33 (a) Show that (M, Y ) has a regular exponential family distribution. ¡ ¢ ¢ ¡ (b) Let Z = M (M − Y ). Show that Y v N μ, σ2 and Z v EXP σ2 /2 independently. (c) Suppose we record independent pairs of observations (Mi , Yi ), i = 1, . . . n on the portfolio for a total of n distinct time periods. Find the U.M.V.U.E.’s of μ and σ 2 . (d) Show that the estimators V1 = and V2 = n ¡ ¢2 1 P Yi − Ȳ n − 1 i=1 n n 2 P 2 P Zi = Mi (Mi − Yi ) n i=1 n i=1 are also unbiased estimators of σ 2 . How do we know that neither of these estimators is the U.M.V.U.E. of σ 2 ? Show that the U.M.V.U.E. of σ2 can be written as a weighted average of V1 and V2 . Compare the variances of all three estimators. (e) An up-and-out call option on the portfolio is an option with exercise price ξ (a constant) which pays a total of (X (1) − ξ) dollars at the end of one period provided that this quantity is positive and provided that X (t) never exceeded the value of a barrier throughout this period of time, that is, provided that M < a. Thus the option pays g(M, Y ) = max {Y − (ξ − x0 ), 0} if M < a and otherwise g(M, Y ) = 0. Find the expected value of such an option, that is, find the expected value of g(M, Y ). 1.8 Ancillarity Let X = (X1 , . . . , Xn ) denote observations from a distribution with probability (density) function {f (x; θ) ; θ ∈ Ω} and let U (X) be a statistic. The information on the parameter θ is provided by the sensitivity of the distribution of a statistic to changes in the parameter. For example, suppose a modest change in the parameter value leads to a large change in the expected value of the distribution resulting in a large shift in the data. Then the parameter can be estimated fairly precisely. On the other hand, if a statistic U has no sensitivity at all in distribution to the parameter, then it would appear to contain little information for point estimation of this parameter. A statistic of the second kind is called an ancillary statistic. 34 1.8.1 CHAPTER 1. PROPERTIES OF ESTIMATORS Definition U (X) is an ancillary statistic if its distribution does not depend on the unknown parameter θ. Ancillary statistics are, in a sense, orthogonal or perpendicular to minimal sufficient statistics. Ancillary statistics are analogous to the residuals in a multiple regression, while the complete sufficient statistics are analogous to the estimators of the regression coefficients. It is well-known that the residuals are uncorrelated with the estimators of the regression coefficients (and independent in the case of normal errors). However, the “irrelevance” of the ancillary statistic seems to be limited to the case when it is not part of the minimal (preferably complete) sufficient statistic as the following example illustrates. 1.8.2 Example Suppose a fair coin is tossed to determine a random variable N = 1 with probability 1/2 and N = 100 otherwise. We then observe a Binomial random variable X with parameters (N, θ). Show that the minimal sufficient statistic is (X, N ) but that N is an ancillary statistic. Is N irrelevant to inference about θ? In this example it seems reasonable to condition on an ancillary component of the minimal sufficient statistic. Conducting inference conditionally on the ancillary statistic essentially means treating the observed number of trials as if it had been fixed in advance instead of the result of the toss of a fair coin. This example also illustrates the use of the following principle: 1.8.3 The Conditionality Principle Suppose the minimal sufficient statistic can be written in the form T = (U, A) where A is an ancillary statistic. Then all inference should be conducted using the conditional distribution of the data given the value of the ancillary statistic, that is, using the distribution of X|A. Some difficulties arise from the application of this principle since there is no general method for constructing the ancillary statistic and ancillary statistics are not necessarily unique. 1.8. ANCILLARITY 35 The following theorem allows us to use the properties of completeness and ancillarity to prove the independence of two statistics without finding their joint distribution. 1.8.4 Basu’s Theorem Consider X with probability (density) function {f (x; θ) ; θ ∈ Ω}. Let T (X) be a complete sufficient statistic. Then T (X) is independent of every ancillary statistic U (X). 1.8.5 Proof We need to show P [U (X) ∈ B, T (X) ∈ C; θ] = P [U (X) ∈ B; θ] · P [T (X) ∈ C; θ] for all sets B, C and all θ ∈ Ω. Let g(t) = P [U (X) ∈ B|T (X) = t] − P [U (X) ∈ B] for all t ∈ A where P (T ∈ A; θ) = 1. By sufficiency, P [U (X) ∈ B|T (X) = t] does not depend on θ, and by ancillarity, P [U (X) ∈ B] also does not depend on θ. Therefore g(T ) is a statistic. Let ½ 1 if U (X) ∈ B I{U (X) ∈ B} = 0 else. Then E[I{U (X) ∈ B}] = P [U (X) ∈ B], E[I{U (X) ∈ B}|T = t] = P [U (X) ∈ B|T = t], and g(t) = E[I{U (X) ∈ B}|T (X) = t] − E[I{U (X) ∈ B}]. This gives E[g(T )] = E[E[I{U (X) ∈ B}|T ]] − E[I{U (X) ∈ B}] = E[I{U (X) ∈ B}] − E[I{U (X) ∈ B}] = 0 for all θ ∈ Ω, and since T is complete this implies P [g(T ) = 0; θ] = 1 for all θ ∈ Ω. Therefore P [U (X) ∈ B|T (X) = t] = P [U (X) ∈ B] for all t ∈ A and all B. (1.4) 36 CHAPTER 1. PROPERTIES OF ESTIMATORS Suppose T has probability density function h(t; θ). Then Z P [U (X) ∈ B, T (X) ∈ C; θ] = P [U (X) ∈ B|T = t]h(t; θ)dt C = Z P [U (X) ∈ B]h(t; θ)dt C = P [U (X) ∈ B] · Z by (1.4) h(t; θ)dt C = P [U (X) ∈ B] · P [T (X) ∈ C; θ] true for all sets B, C and all θ ∈ Ω as required.¥ 1.8.6 Example Let X1 , . . . , Xn be a random sample from the EXP(θ) distribution. Show n P that T (X1 , . . . , Xn ) = Xi and U (X1 , . . . , Xn ) = (X1 /T, . . . , Xn /T ) are i=1 independent random variables. Find E(X1 /T ). 1.8.7 Example Let X1 , . . . , Xn be a random sample from the N(μ, σ 2 ) distribution. Prove that X̄ and S 2 are independent random variables. 1.8.8 Problem Let X1 , . . . , Xn be a random sample from the distribution with p.d.f. f (x; β) = 2x , β2 0 < x ≤ β. (a) Show that β is a scale parameter for this model. (b) Show that T = T (X1 , . . . , Xn ) = X(n) is a complete sufficient statistic for this model. (c) Find the U.M.V.U.E. of β. (d) Show that T and U = U (X) = X1 /T are independent random variables. (e) Find E (X1 /T ). 1.8. ANCILLARITY 1.8.9 37 Problem Let X1 , . . . , Xn be a random sample from the GAM(α, β) distribution. (a) Show that β is a scale parameter for this model. n P (b) Suppose α is known. Show that T = T (X1 , . . . , Xn ) = Xi is a i=1 complete sufficient statistic for the model. (c) Show that T and U = U (X1 , . . . , Xn ) = (X1 /T, . . . , Xn /T ) are independent random variables. (d) Find E(X1 /T ). 1.8.10 Problem In Problem 1.5.11 show that β̂ and Se2 are independent random variables. 1.8.11 Problem Let X1 , . . . , Xn be a random sample from the EXP(β, μ) distribution. (a) Suppose β is known. Show that T1 = X(1) is a complete sufficient statistic for the model. n ¡ ¢ P (b) Show that T1 and T2 = Xi − X(1) are independent random varii=1 ables. (c) Find the p.d.f. of T2 . Hint: Show n P i=1 (Xi − μ) = n (T1 − μ) + T2 . (d) Show that (T1 , T2 ) is a complete sufficient statistic for the model {f (x1 , . . . , xn ; μ, β) ; μ ∈ <, β > 0}. (e) Find the U.M.V.U.E.’s of β and μ. 38 1.8.12 CHAPTER 1. PROPERTIES OF ESTIMATORS Problem Let X1 , . . . , Xn be a random sample from the distribution with p.d.f. f (x; α, β) = αxα−1 , βα α > 0, 0 < x ≤ β. (a) Show that if α is known then T1 = X(n) is a complete sufficient statistic for the model. n Q Xi (b) Show that T1 and T2 = T1 are independent random variables. i=1 (c) Find the p.d.f. of T2 . Hint: Show n X i=1 log µ Xi β ¶ = log T2 + n log µ T1 β ¶ . (d) Show that (T1 , T2 ) is a complete sufficient statistic for the model. (e) Find the U.M.V.U.E. of α. Chapter 2 Maximum Likelihood Estimation 2.1 Maximum Likelihood Method - One Parameter Suppose we have collected the data x (possibly a vector) and we believe that these data are observations from a distribution with probability function P (X = x; θ) = f (x; θ) where the scalar parameter θ is unknown and θ ∈ Ω. The probability of observing the data x is equal to f (x; θ). When the observed value of x is substituted into f (x; θ), then f (x; θ) is a function of the parameter θ only. In the absence of any other information, it seems logical that we should estimate the parameter θ using a value most compatible with the data. For example we might choose the value of θ which maximizes the probability of the observed data. 2.1.1 Definition Suppose X is a random variable with probability function P (X = x; θ) = f (x; θ), where θ ∈ Ω is a scalar and suppose x is the observed data. The likelihood function for θ is L(θ) = P (observing the data x; θ) = P (X = x; θ) = f (x; θ), θ ∈ Ω. 39 40 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION If X = (X1 , . . . , Xn ) is a random sample from the probability function P (X = x; θ) = f (x; θ) and x = (x1 , . . . , xn ) are the observed data then the likelihood function for θ is L(θ) = P (observing the data x; θ) = P (X1 = x1 , . . . , Xn = xn ; θ) n Q = f (xi ; θ), θ ∈ Ω. i=1 The value of θ which maximizes the likelihood L (θ) also maximizes the logarithm of the likelihood function. (Why?) Since it is easier to find the derivative of the sum of n terms rather than the product, we usually determine the maximum of the logarithm of the likelihood function. 2.1.2 Definition The log likelihood function is defined as l(θ) = log L(θ), θ∈Ω where log is the natural logarithmic function. 2.1.3 Definition The value of θ that maximizes the likelihood function L(θ) or equivalently the log likelihood function l(θ) is called the maximum likelihood (M.L.) estimate. The M.L. estimate is a function of the data x and we write θ̂ = θ̂ (x). The corresponding M.L. estimator is denoted θ̂ = θ̂(X). 2.1.4 Example Suppose in a sequence of n Bernoulli trials with P (Success) = θ we have observed x successes. Find the likelihood function L (θ), the log likelihood function l (θ), the M.L. estimate of θ and the M.L. estimator of θ. 2.1.5 Example Suppose we have collected data x1 , . . . , xn and we believe these observations are independent observations from a POI(θ) distribution. Find the likelihood function, the log likelihood function, the M.L. estimate of θ and the M.L. estimator of θ. 2.1. MAXIMUM LIKELIHOOD METHOD- ONE PARAMETER 41 2.1.6 Problem Suppose we have collected data x1 , . . . , xn and we believe these observations are independent observations from the DU(θ) distribution. Find the likelihood function, the M.L. estimate of θ and the M.L. estimator of θ. 2.1.7 Definition The score function is defined as S(θ) = 2.1.8 d d l (θ) = log L (θ) , dθ dθ θ ∈ Ω. Definition The information function is defined as I(θ) = − d2 d2 l(θ) = − 2 log L(θ), 2 dθ dθ θ ∈ Ω. I(θ̂) is called the observed information. In Section 2.7 we will see how the observed information I(θ̂) can be used to construct approximate confidence intervals for the unknown parameter θ. I(θ̂) also tells us about the concavity of the log likelihood function. Suppose in Example 2.1.5 the M.L. estimate of θ was θ̂ = 2. If n = 10 then I(θ̂) = 10/2 = 5. If n = 25 then I(θ̂) = 25/2 = 12.5. See Figure 2.1. The log likelihood function is more concave down for n = 25 than for n = 10 which reflects the fact that as the number of observations increases we have more “information” about the unknown parameter θ. 2.1.9 Finding M.L. Estimates If X1 , . . . , Xn is a random sample from a distribution whose support set does not depend on θ then we usually find θ̂ by solving S(θ) = 0. It is important to verify that θ̂ is the value of θ which maximizes L (θ) or equivalently l (θ). This can be done using the First Derivative Test. Note that the condition I(θ̂) > 0 only checks for a local maximum. Although we view the likelihood, log likelihood, score and information functions as functions of θ they are, of course, also functions of the observed 42 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION 2 0 n=10 -2 R(θ) n=25 -4 -6 -8 -10 1 1.5 2 θ 2.5 3 3.5 Figure 2.1: Poisson Log Likelihoods for n = 10 and n = 25 data x. When it is important to emphasize the dependence on the data x we will write L(θ; x), S(θ; x), etc. Also when we wish to determine the sampling properties of these functions as functions of the random variable X we will write L(θ; X), S(θ; X), etc. 2.1.10 Definition If θ is a scalar then the expected or Fisher information (function) is given by ¸ ∙ ∂2 J(θ) = E [I(θ; X); θ] = E − 2 l(θ; X); θ , θ ∈ Ω. ∂θ Note: If X1 , . . . , Xn is a random sample from f (x; θ) then ∙ ¸ ∙ ¸ ∂2 ∂2 J (θ) = E − 2 l(θ; X); θ = nE − 2 log f (X; θ); θ ∂θ ∂θ where X has probability function f (x; θ). 2.1. MAXIMUM LIKELIHOOD METHOD- ONE PARAMETER 43 2.1.11 Example Find the Fisher information based on a random sample X1 , . . . , Xn from the POI(θ) distribution and compare it to the variance of the M.L. estimator θ̂. How does the Fisher information change as n increases? The Poisson model is used to model the number of events occurring in time or space. Suppose it is not possible to observe the number of events but only whether or not one or more events has occurred. In other words it is only possible to observe the outcomes “X = 0” and “X > 0”. Let Y be the number of times the outcome “X = 0” is observed in a sample of size n. Find the M.L. estimator of θ for these data. Compare the Fisher information for these data with the Fisher information based on (X1 , . . . , Xn ). See Figure 2.2 1 0.9 0.8 0.7 Ratio of Information Functions0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 θ 5 6 7 8 Figure 2.2: Ratio of Fisher Information Functions 2.1.12 Problem Suppose X ∼ BIN(n, θ) and we observe X. Find θ̂, the M.L. estimator of θ, the score function, the information function and the Fisher information. Compare the Fisher information with the variance of θ̂. 44 2.1.13 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION Problem Suppose X ∼ NB(k, θ) and we observe X. Find the M.L. estimator of θ, the score function and the Fisher information. 2.1.14 Problem - Randomized Sampling A professor is interested estimating the unknown quantity θ which is the proportion of students who cheat on tests. She conducts an experiment in which each student is asked to toss a coin secretly. If the coin comes up a head the student is asked to toss the coin again and answer “Yes” if the second toss is a head and “No” if the second toss is a tail. If the first toss of the coin comes up a tail, the student is asked to answer “Yes” or “No” to the question: Have you ever cheated on a University test? Students are assumed to answer more honestly in this type of randomized response survey because it is not known to the questioner whether the answer “Yes” is a result of tossing the coin twice and obtaining two heads or because the student obtained a tail on the toss of the coin and then answered “Yes” to the question about cheating. (a) Find the probability that x students answer “Yes” in a class of n students. (b) Find the M.L. estimator of θ based on X students answering “Yes” in a class of n students. Be sure to verify that your answer corresponds to a maximum. (c) Find the Fisher information for θ. (d) In a simpler experiment n students could be asked to answer “Yes” or “No” to the question: Have you ever cheated on a University test? If we could assume that they answered the question honestly then we would expect to obtain more information about θ from this simpler experiment. Determine the amount of information lost in doing the randomized response experiment as compared to the simpler experiment. 2.1.15 Problem Suppose (X1 , X2 ) ∼ MULT(n, θ2 , 2θ (1 − θ)). Find the M.L. estimator of θ, the score function and the Fisher information. 2.1.16 Likelihood Functions for Continuous Models Suppose X is a continuous random variable with probability density function f (x; θ). We will often observe only the value of X rounded to some 2.1. MAXIMUM LIKELIHOOD METHOD- ONE PARAMETER 45 degree of precision (say one decimal place) in which case the actual observation is a discrete random variable. For example, suppose we observe X correct to one decimal place. Then 1.15 Z P (we observe 1.1) = f (x; θ)dx ≈ (1.15 − 1.05) · f (1.1; θ) 1.05 assuming the function f (x; θ) is quite smooth over the interval. More generally, if we observe X rounded to the nearest ∆ (assumed small) then the likelihood of the observation is approximately ∆f (observation; θ). Since the precision ∆ of the observation does not depend on the parameter, then maximizing the discrete likelihood of the observation is essentially equivalent to maximizing the probability density function f (observation; θ) over the parameter. Therefore if X = (X1 , . . . , Xn ) is a random sample from the probability density function f (x; θ) and x = (x1 , . . . , xn ) are the observed data then we define the likelihood function for θ as L(θ) = L (θ; x) = n Q i=1 See also Problem 2.8.12. 2.1.17 f (xi ; θ), θ ∈ Ω. Example Suppose X1 , . . . , Xn is a random sample from the distribution with probability density function f (x; θ) = θxθ−1 , 0 ≤ x ≤ 1, θ > 0. Find the score function, the M.L. estimator, and the information function of θ. Find the observed information. Find the mean and variance of θ̂. Compare the Fisher infomation and the variance of θ̂. 2.1.18 Example Suppose X1 , . . . , Xn is a random sample from the UNIF(0, θ) distribution. Find the M.L. estimator of θ. 2.1.19 Problem Suppose X1 , . . . , Xn is a random sample from the UNIF(θ, θ + 1) distribution. Show the M.L. estimator of θ is not unique. 46 2.1.20 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION Problem Suppose X1 , . . . , Xn is a random sample from the DE(1, θ) distribution. Find the M.L. estimator of θ. 2.1.21 Problem Show that if θ̂ is the unique M.L. estimator of θ then θ̂ must be a function of the minimal sufficient statistic. 2.1.22 Problem The word information generally implies something that is additive. Suppose X has probability (density) function f (x; θ) , θ ∈ Ω and independently Y has probability (density) function g(y; θ), θ ∈ Ω. Show that the Fisher information in the joint observation (X, Y ) is the sum of the Fisher information in X plus the Fisher information in Y . Often S(θ) = 0 must be solved numerically using an iterative method such as Newton’s Method. 2.1.23 Newton’s Method Let θ(0) be an initial estimate of θ. We may update that value as follows: θ(i+1) = θ(i) + S(θ(i) ) , I(θ(i) ) i = 0, 1, . . . Notes: (1) The initial estimate, θ(0) , may be determined by graphing L (θ) or l (θ). (2) The algorithm is usually run until the value of θ(i) no longer changes to a reasonable number of decimal places. When the algorithm is stopped it is always important to check that the value of θ obtained does indeed maximize L (θ). (3) This algorithm is also called the Newton-Raphson Method. (4) I (θ) can be replaced by J (θ) for a similar algorithm which is called the method of scoring or Fisher’s method of scoring. (5) The value of θ̂ may also be found by maximizing L(θ) or l(θ) using the maximization (minimization) routines available in various statistical software packages such as Maple, S-Plus, Matlab, R etc. (6) If the support of X depends on θ (e.g. UNIF(0, θ)) then θ̂ is not found by solving S(θ) = 0. 2.1. MAXIMUM LIKELIHOOD METHOD- ONE PARAMETER 47 2.1.24 Example Suppose X1 , . . . , Xn is a random sample from the WEI(1, β) distribution. Explain how you would find the M.L. estimate of β using Newton’s Method. How would you find the mean and variance of the M.L. estimator of β? 2.1.25 Problem - Likelihood Function for Grouped Data Suppose X is a random variable with probability (density) function f (x; θ) and P (X ∈ A; θ) = 1. Suppose A1 , A2 , . . . , Am is a partition of A and let pj (θ) = P (X ∈ Aj ; θ), j = 1, . . . , m. Suppose n independent observations are collected from this distribution but it is only possible to determine to which one of the m sets, A1 , A2 , . . . , Am , the i’th observation belongs. The observed data are: Outcome Frequency A1 f1 A2 f2 ... ... Am fm Total n (a) Show that the Fisher information for these data is given by J(θ) = n Hint: Since m P pj (θ) = 1, j=1 m [p0 (θ)]2 P j . j=1 pj (θ) ¸ ∙m d P pj (θ) = 0. dθ i=1 (b) Explain how you would find the M.L. estimate of θ. 2.1.26 Definition The relative likelihood function R(θ) is defined by R(θ) = R (θ; x) = L(θ) L(θ̂) , θ ∈ Ω. The relative likelihood function takes on values between 0 and 1.and can be used to rank possible parameter values according to their plausibilities in light of the data. If R(θ1 ) = 0.1, say, then θ1 is rather an implausible parameter value because the data are ten times more probable when θ = θ̂ than they are when θ = θ1 . However, if R(θ1 ) = 0.5, say, then θ1 is a fairly plausible value because it gives the data 50% of the maximum possible probability under the model. 48 2.1.27 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION Definition The set of θ values for which R(θ) ≥ p is called a 100p% likelihood region for θ. If the region is an interval of real values then it is called a 100p% likelihood interval (L.I.) for θ. Values inside the 10% L.I. are referred to as plausible and values outside this interval as implausible. Values inside a 50% L.I. are very plausible and outside a 1% L.I. are very implausible in light of the data. 2.1.28 Definition The log relative likelihood function is the natural logarithm of the relative likelihood function: r(θ) = r (θ; x) = log[R(θ)] = log[L(θ] − log[L(θ̂)] = l(θ) − l(θ̂), θ ∈ Ω. Likelihood regions or intervals may be determined from a graph of R(θ) or r(θ) and usually it is more convenient to work with r(θ). Alternatively, they can be found by solving r(θ) − log p = 0. Usually this must be done numerically. 2.1.29 Example Plot the relative likelihood function for θ in Example 2.1.5 if n = 15 and θ̂ = 1. Find the 15% L.I.’s for θ. See Figure 2.3 2.1.30 Problem Suppose X ∼ BIN(n, θ). Plot the relative likelihood function for θ if x = 3 is observed for n = 100. On the same graph plot the relative likelihood function for θ if x = 6 is observed for n = 200. Compare the graphs as well as the 10% L.I. and 50% L.I. for θ. 2.1.31 Problem Suppose X1 , . . . , Xn is a random sample from the EXP(1, θ) distribution. Plot the relative likelihood function for θ if n = 20 and x(1) = 1. Find 10% and 50% L.I.’s for θ. 2.1. MAXIMUM LIKELIHOOD METHOD- ONE PARAMETER 49 1 0.9 0.8 0.7 R(θ) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 θ 2 2.5 Figure 2.3: Relative Likelihood Function for Example 2.1.29 2.1.32 Problem The following model is proposed for the distribution of family size in a large population: P (k children in family; θ) = θk , for k = 1, 2, . . . 1 − 2θ P (0 children in family; θ) = . 1−θ The parameter θ is unknown and 0 < θ < 12 . Fifty families were chosen at random from the population. The observed numbers of children are given in the following table: No. of children Frequency observed 0 17 1 22 2 7 3 3 4 1 Total 50 (a) Find the likelihood, log likelihood, score and information functions for θ. 50 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION (b) Find the M.L. estimate of θ and the observed information. (c) Find a 15% likelihood interval for θ. (d) A large study done 20 years earlier indicated that θ = 0.45. Is this value plausible for these data? (e) Calculate estimated expected frequencies. Does the model give a reasonable fit to the data? 2.1.33 Problem The probability that k different species of plant life are found in a randomly chosen plot of specified area is ¢k+1 ¡ 1 − e−θ pk (θ) = , (k + 1) θ k = 0, 1, . . . ; θ > 0. The data obtained from an examination of 200 plots are given in the table below: No. of species Frequency observed 0 147 1 36 2 13 3 4 ≥4 0 Total 200 (a) Find the likelihood, log likelihood, score and information functions for θ. (b) Find the M.L. estimate of θ and the observed information. (c) Find a 15% likelihood interval for θ. (d) Is θ = 1 a plausible value of θ in light of the observed data? (e) Calculate estimated expected frequencies. Does the model give a reasonable fit to the data? 2.2 Principles of Inference In Chapter 1 we discussed the Sufficiency Principle and the Conditionality Principle. There is another principle which is equivalent to the Sufficiency Principle. The likelihood ratios generate the minimal sufficient partition. In other words, two likelihood ratios will agree f (x1 ; θ) f (x2 ; θ) = f (x1 ; θ0 ) f (x2 ; θ0 ) if and only if the values of the minimal sufficient statistic agree, that is, T (x1 ) = T (x2 ). Thus we obtain: 2.2. PRINCIPLES OF INFERENCE 2.2.1 51 The Weak Likelihood Principle Suppose for two different observations x1 , x2 , the likelihood ratios f (x2 ; θ) f (x1 ; θ) = f (x1 ; θ0 ) f (x2 ; θ0 ) for all values of θ, θ0 ∈ Ω. Then the two different observations x1 , x2 should lead to the same inference about θ. A weaker but similar principle, the Invariance Principle follows. This can be used, for example, to argue that for independent identically distributed observations, it is only the value of the observations (the order statistic) that should be used for inference, not the particular order in which those observations were obtained. 2.2.2 Invariance Principle Suppose for two different observations x1 , x2 , f (x1 ; θ) = f (x2 ; θ) for all values of θ ∈ Ω. Then the two different observations x1 , x2 should lead to the same inference about θ. There are relationships among these and other principles. For example, Birnbaum proved that the Conditionality Principle and the Sufficiency Principle above imply a stronger version of a Likelihood Principle. However, it is probably safe to say that while probability theory has been quite successfully axiomatized, it seems to be difficult if not impossible to derive most sensible statistical procedures from a set of simple mathematical axioms or principles of inference. 2.2.3 Problem Consider the model {f (x; θ) ; θ ∈ Ω} and suppose that θ̂ is the M.L. estimator based on the observation X. We often draw conclusions about the plausibility of a given parameter value θ based on the relative likelihood L(θ) . If this is very small, for example, less than or equal to 1/N , we regard L(θ̂) the value of the parameter θ as highly unlikely. But what happens if this test declares every value of the parameter unlikely? Suppose f (x; θ) = 1 if x = θ and f (x; θ) = 0 otherwise, where θ = 1, 2, . . . N . Define f0 (x) to be the discrete uniform distribution on the 52 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION integers {1, 2, . . . , N }. In this example the parameter space is Ω = {θ; θ = 0, 1, ..., N }. Show that the relative likelihood 1 f0 (x) ≤ f (x; θ) N no matter what value of x is observed. Should this be taken to mean that the true distribution cannot be f0 ? 2.3 Properties of the Score and Information - Regular Model Consider the model {f (x; θ) ; θ ∈ Ω}. The following is a set of sufficient conditions which we will use to determine the properties of the M.L. estimator of θ. These conditions are not the most general conditions but are sufficiently general for most applications. Notable exceptions are the UNIF(0, θ) and the EXP(1, θ) distributions which will be considered separately. For convenience we call a family of models which satisfy the following conditions a regular family of distributions. (See 1.7.9.) 2.3.1 Regular Model Consider the model {f (x; θ) ; θ ∈ Ω}. Suppose that: (R1) The parameter space Ω is an open interval in the real line. (R2) The densities f (x; θ) have common support, so that the set A = {x; f (x; θ) > 0} , does not depend on θ. (R3) For all x ∈ A, f (x; θ) is a continuous, three times differentiable function of θ. R (R4) The integral f (x; θ) dx can be twice differentiated with respect to θ A under the integral sign, that is, Z Z ∂k ∂k f (x; θ) dx = f (x; θ) dx, k = 1, 2 for all θ ∈ Ω. ∂θk ∂θk A A (R5) For each θ0 ∈ Ω there exist a positive number c and function M (x) (both of which may depend on θ0 ), such that for all θ ∈ (θ0 − c, θ0 + c) ¯ ¯ 3 ¯ ∂ log f (x; θ) ¯ ¯ < M (x) ¯ ¯ ¯ ∂θ3 2.3. PROPERTIES OF THE SCORE AND INFORMATION- REGULAR MODEL53 holds for all x ∈ A, and E [M (X) ; θ] < ∞ for all θ ∈ (θ0 − c, θ0 + c) . (R6) For each θ ∈ Ω, 0<E (∙ ∂ 2 log f (X; θ) ∂θ2 ¸2 ;θ ) <∞ If these conditions hold with X a discrete random variable and the integrals replaced by sums, then we shall also call this a regular family of distributions. Condition (R3) insures that the function ∂ log f (x; θ) /∂θ has, for each x ∈ A, a Taylor expansion as a function of θ. The following lemma provides one method of determining whether differentiation under the integral sign (condition (R4)) is valid. 2.3.2 Lemma Suppose ∂g (x; θ) /∂θ exists for all θ ∈ Ω, and all x ∈ A. Suppose also that for each θ0 ∈ Ω there exist a positive number c, and function G (x) (both of which may depend on θ0 ), such that for all θ ∈ (θ0 − c, θ0 + c) ¯ ¯ ¯ ∂g (x; θ) ¯ ¯ ¯ ¯ ∂θ ¯ < G(x) holds for all x ∈ A, and Z G (x) dx < ∞. A Then ∂ ∂θ Z A 2.3.3 g(x, θ)dx = Z ∂ g(x, θ)dx. ∂θ A Theorem - Expectation and Variance of the Score Function If X = (X1 , . . . , Xn ) is a random sample from a regular model {f (x; θ) ; θ ∈ Ω} then E[S(θ; X); θ] = 0 54 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION and V ar[S(θ; X); θ] = E{[S(θ; X)]2 ; θ} = E[I(θ; X); θ] = J(θ) < ∞ for all θ ∈ Ω. 2.3.4 Problem - Invariance Property of M.L. Estimators Suppose X1 , . . . , Xn is a random sample from a distribution with probability (density) function f (x; θ) where {f (x; θ) ; θ ∈ Ω} is a regular family. Let S(θ) and J(θ) be the score function and Fisher information respectively based on X1 , . . . , Xn . Consider the reparameterization τ = h(θ) where h is a one-to-one differentiable function with inverse function θ = g(τ ). Let S ∗ (τ ) and J ∗ (τ ) be the score function and Fisher information respectively under the reparameterization. (a) Show that τ̂ = h(θ̂) is the M.L. estimator of τ where θ̂ is the M.L. estimator of θ. (b) Show that E[S ∗ (τ ; X); τ ] = 0 and J ∗ (τ ) = [g 0 (τ )]2 J[g(τ )]. 2.3.5 Problem It is natural to expect that if we compare the information available in the original data X and the information available in some statistic T (X), the latter cannot be greater than the former since T can be obtained from X. Show that in a regular model the Fisher information calculated from the marginal distribution of T is less than or equal to the Fisher information for X. Show that they are equal for all values of the parameter if and only if T is a sufficient statistic for {f (x; θ) ; θ ∈ Ω}. 2.4 Maximum Likelihood Method - Multiparameter The case of several parameters is exactly analogous to the one parameter case. Suppose θ = (θ1 , . . . , θk )T . The log likelihood function l (θ1 , . . . , θk ) = log L (θ1 , . . . , θk ) is a function of k parameters. The M.L. estimate of θ, ∂l θ̂ = (θ̂1 , . . . , θ̂k )T is usually found by solving ∂θ = 0, j = 1, . . . , k simultaj neously. The invariance property of the M.L. estimator also holds in the multiparameter case. 2.4. MAXIMUM LIKELIHOOD METHOD- MULTIPARAMETER55 2.4.1 Definition If θ = (θ1 ,. . . , θk )T then the score vector is defined as ⎡ ∂l ⎤ ⎢ ⎢ S(θ) = ⎢ ⎣ 2.4.2 ∂θ1 .. . ∂l ∂θk ⎥ ⎥ ⎥, ⎦ θ ∈ Ω. Definition If θ = (θ1 , . . . , θk )T then the information matrix I(θ) is a k × k symmetric matrix whose (i, j) entry is given by − ∂2 l(θ), ∂θi ∂θj θ ∈ Ω. I(θ̂) is called the observed information matrix. 2.4.3 Definition If θ = (θ1 , . . . , θk )T then the expected or Fisher information matrix J(θ) is a k × k symmetric matrix whose (i, j) entry is given by ∙ ¸ ∂2 E − l(θ; X); θ , θ ∈ Ω. ∂θi ∂θj 2.4.4 Expectation and Variance of the Score Vector For a regular family of distributions ⎤ 0 ⎢ ⎥ E[S(θ; X); θ] = ⎣ ... ⎦ 0 ⎡ and V ar[S(θ; X); θ] = E[S(θ; X)S(θ; X)T ; θ] = E [I (θ; X) ; θ] = J(θ). 2.4.5 Likelihood Regions The set of θ values for which R(θ) ≥ p is called a 100p% likelihood region for θ. 56 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION 2.4.6 Example: Suppose X1 , ..., Xn is a random sample from the N(μ, σ 2 ) distribution. Find the score vector, the information matrix, the Fisher information matrix and ¢T ¡ the estimator of θ = μ,¡σ2 .¢Find the observed information ¡ M.L. ¢ ¡ ¢matrix I μ̂, σ̂2 and thus verify that μ̂,¡σ̂ 2 is¢the M.L. estimator of μ, σ 2 . Find the Fisher information matrix J μ, σ 2 . ¡ ¢ Since X1 , ..., Xn is a random sample from the N μ, σ 2 distribution the likelihood function is ¸ ∙ n ¡ ¢ Q −1 1 2 2 √ L μ, σ (xi − μ) = exp 2σ 2 2πσ i=1 ∙ ¸ n −1 P −n/2 ¡ 2 ¢−n/2 2 exp (xi − μ) σ = (2π) 2σ2 i=1 ∙ ¸ n ¡ ¡ ¢−n/2 ¢ −1 P 2 2 exp − 2μx − μ x = (2π)−n/2 σ 2 i 2σ2 i=1 i ∙ µn ¶¸ n ¡ ¢−n/2 P −1 P 2 2 exp x − 2μ x + nμ = (2π)−n/2 σ 2 i 2σ2 i=1 i i=1 ∙ ¸ ¡ ¢ ¡ ¢ −1 −n/2 −n/2 2 exp − 2μt + nμ σ2 t , μ ∈ <, σ 2 > 0 = (2π) 1 2 2σ2 where t1 = n P i=1 x2i and t2 = The log likelihood function is ¢ ¡ l μ, σ 2 = = = where −n log (2π) − 2 −n log (2π) − 2 −n log (2π) − 2 ¡ ¢ n log σ 2 − 2 ¡ ¢ n log σ 2 − 2 ¡ ¢ n log σ 2 − 2 s2 = Now n P xi . i=1 n 1 P 2 (xi − μ) 2 2σ i=1 ∙n ¸ 1 ¡ 2 ¢−1 P 2 2 (xi − x̄) + n (x̄ − μ) σ 2 i=1 i 1 ¡ 2 ¢−1 h (n − 1) s2 + n (x̄ − μ)2 , μ ∈ <, σ 2 > 0 σ 2 n 1 P (xi − x̄)2 . n − 1 i=1 ¡ ¢−1 n ∂l (x̄ − μ) = 2 (x̄ − μ) = n σ 2 ∂μ σ 2.4. MAXIMUM LIKELIHOOD METHOD- MULTIPARAMETER57 and i n ¡ 2 ¢−1 1 ¡ 2 ¢−2 h ∂l 2 2 = − + + n (x̄ − μ) (n − 1) s . σ σ ∂σ 2 2 2 The equations ∂l ∂l =0 = 0, ∂μ ∂σ 2 are solved simultaneously for μ = x̄ and σ2 = Since ∂2l ∂μ2 ∂2l − − ∂ (σ 2 ) 2 n 1 P (n − 1) 2 (xi − x̄)2 = s . n i=1 n n ∂2l n (x̄ − μ) , − = σ2 ∂σ 2 ∂μ σ4 h i n 1 1 2 2 = − + + n (x̄ − μ) (n − 1) s 2 σ4 σ6 = the information matrix is ⎡ n/σ 2 ¡ ¢ 2 ⎣ I μ, σ = n (x̄ − μ) /σ 4 − n2 σ14 Since ⎤ n (x̄ − μ) /σ 4 h i ⎦ , μ ∈ <, σ 2 > 0. + σ16 (n − 1) s2 + n (x̄ − μ)2 ¡ ¡ ¢ ¢ n n2 I11 μ̂, σ̂ 2 = 2 > 0 and det I μ̂, σ̂ 2 = 6 > 0 σ̂ 2σ̂ then by the Second Derivative Test the M.L. estimates of of μ and σ2 are μ̂ = x̄ and σ̂ 2 = and the M.L. estimators are μ̂ = X̄ and σ̂2 = The observed information is ¡ ¢ I μ̂, σ̂2 = Now ´ ³n n 2 ; μ, σ = 2, E 2 σ σ n 1 P (n − 1) 2 (xi − x̄)2 = s n i=1 n n ¡ ¢2 (n − 1) 2 1 P Xi − X̄ = S . n i=1 n " n/σ̂ 2 0 0 ¡ ¢ 1 4 n/σ̂ 2 # . " ¡ # ¢ n X̄ − μ 2 E ; μ, σ = 0, σ4 58 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION and ¾ ½ ¡ ¢2 i 1 h n 1 2 2 (n − 1) S + n X̄ − μ ; μ, σ + E − 2 σ4 σ6 h¡ io ¢2 n 1 1 n 2 2 2 = − + ; μ, σ ) + nE X̄ − μ ; μ, σ (n − 1) E(S 2 σ4 σ6 ¤ n 1 1 £ = − + 6 (n − 1) σ 2 + σ 2 2 σ4 σ n = 2σ 4 since ¤ £¡ ¢ E X̄ − μ ; μ, σ 2 = 0, h¡ i ¢2 ¡ ¢ σ2 ¢ ¡ E X̄ − μ ; μ, σ2 = V ar X̄; μ, σ2 = and E S 2 ; μ, σ 2 = σ 2 . n Therefore the Fisher information matrix is # " n 0 ¡ ¢ 2 σ J μ, σ2 = 0 2σn4 and the inverse of the Fisher information matrix is ⎡ 2 ⎤ σ 0 ¢¤ £ ¡ −1 n ⎦. =⎣ J μ, σ 2 4 0 2σn Now ¡ ¢ V ar X̄ = and σ2 n ∙ ¸ n ¡ ¡ 2¢ ¢2 2σ 4 2(n − 1)σ 4 1 P V ar σ̂ ≈ = = V ar Xi − X̄ n i=1 n2 n Cov(X̄, σ̂ 2 ) = since X̄ and n ¡ ¢2 P 1 Xi − X̄ ) = 0 Cov(X̄, n i=1 n ¡ ¢2 P Xi − X̄ are independent random variables. Inferences i=1 for μ and σ 2 are usually made using X̄ − μ (n − 1)S 2 √ v t (n − 1) and v χ2 (n − 1) . σ2 S/ n 2.4. MAXIMUM LIKELIHOOD METHOD- MULTIPARAMETER59 The relative likelihood function is ¢ µ ¶n/2 ¡ nn io ¡ ¢ L μ, σ 2 σ̂ 2 n h 2 2 2 R μ, σ = exp + (x̄ − μ) σ̂ , μ ∈ <, σ 2 > 0. = − L (μ̂, σ̂ 2 ) σ2 2 2σ 2 ¡ ¢ See Figure 2.4 for a graph of R μ, σ 2 for n = 350, μ̂ = 160 and σ̂ 2 = 36. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 42 161 40 38 160.5 36 160 34 159.5 32 30 σ 159 2 μ Figure 2.4: Normal Likelihood function for n = 350, μ̂ = 160 and σ̂ 2 = 36 2.4.7 Problem - The Score Equation and the Exponential Family Suppose X has a regular exponential family distribution of the form " # k P f (x; η) = C(η) exp ηj Tj (x) h(x). j=1 where η = (η1 , . . . , ηk )T . Show that E[Tj (X); η] = −∂ log C (η) , j = 1, . . . , k ∂ηj 60 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION and Cov(Ti (X), Tj (X); η) = −∂ 2 log C (η) , i, j = 1, . . . , k. ∂ηi ∂ηj Suppose that (x1 , . . . , xn ) are the observed data for a random sample from fη (x). Show that the score equations ∂ l (η) = 0, ∂ηj j = 1, ..., k can be written as ¸ ∙n n P P Tj (Xi ); η = Tj (xi ), E i=1 2.4.8 j = 1, . . . , k. i=1 Problem Suppose X1 , . . . , Xn is a random sample from the N(μ, σ 2 ) distribution. Use the result of Problem 2.4.7 to find the score equations for μ and σ2 and verify that these are the same equations obtained in Example 2.4.7. 2.4.9 Problem Suppose (X1 , Y1 ) , . . . , (Xn , Yn ) is a random sample from the BVN(μ, Σ) distribution. Find the M.L. estimators of μ1 , μ2 , σ12 , σ22 , and ρ. You do not need to verify that your answer corresponds to a maximum. Hint: Use the result from Problem 2.4.7. 2.4.10 Problem Suppose (X1 , X2 ) ∼ MULT(n, θ1 , θ2 ). Find the M.L. estimators of θ1 and θ2 , the score function and the Fisher information matrix. 2.4.11 Problem Suppose X1 , . . . , Xn is a random sample from the UNIF(a, b) distribution. Find the M.L. estimators of a and b. Verify that your answer corresponds to a maximum. Find the M.L. estimator of τ (a, b) = E (Xi ). 2.4.12 Problem Suppose X1 , . . . , Xn is a random sample from the UNIF(μ − 3σ, μ + 3σ) distribution. Find the M.L. estimators of μ and σ. 2.4. MAXIMUM LIKELIHOOD METHOD- MULTIPARAMETER61 2.4.13 Problem Suppose X1 , . . . , Xn is a random sample from the EXP(β, μ) distribution. Find the M.L. estimators of β and μ. Verify that your answer corresponds to a maximum. Find the M.L. estimator of τ (β, μ) = xα where xα is the α percentile of the distribution. 2.4.14 Problem In Problem 1.7.26 find the M.L. estimators of μ and σ 2 . Verify that your answer corresponds to a maximum. 2.4.15 Problem Suppose E(Y ) = Xβ where Y = (Y1 , . . . , Yn )T is a vector of independent and normally distributed random variables with V ar(Yi ) = σ 2 , i = 1, . . . , n, X is a n × k matrix of known constants of rank k and β = (β1 , . . . , βk )T is a vector of unknown parameters. Show that the M.L. estimators of β and σ 2 are given by ¡ ¢−1 T β̂ = X T X X Y and σ̂ 2 = (Y − X β̂)T (Y − X β̂)/n. 2.4.16 Newton’s Method In the multiparameter case θ = (θ1 , . . . , θk )T Newton’s method is given by: θ(i+1) = θ(i) + [I(θ(i) )]−1 S(θ(i) ), i = 0, 1, 2, . . . I(θ) can also be replaced by the Fisher information J(θ). 2.4.17 Example The following data are 30 independent observations from a BETA(a, b) distribution: 0.2326, 0.3049, 0.5297, 0.1079, 0.0465, 0.4195, 0.1508, 0.0819, 0.2159, 0.2447, 0.0674, 0.3729, 0.3247, 0.3910, 0.3150, 0.3473, 0.2709, 0.4302, 0.3232, 0.2354, 0.4014, 0.3720, 0.4253, 0.0710, 0.3212, 0.3373, 0.1322, 0.4712, 0.4111, 0.3556 The likelihood function for observations x1 , x2 , ..., xn is n Γ(a + b) Q (1 − xi )b−1 , a > 0, b > 0 xa−1 i i=1 Γ(a)Γ(b) ¸b−1 ∙ ¸n ∙ n ¸a−1 ∙ n Q Q Γ(a + b) xi (1 − xi ) . = Γ(a)Γ(b) i=1 i=1 L(a, b) = 62 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION The log likelihood function is l(a, b) = n [log Γ(a + b) − log Γ(a) − log Γ(b) + (a − 1)t1 + (b − 1)t2 ] where t1 = n n 1 P 1 P log xi and t2 = log(1 − xi ). n i=1 n i=1 (T1 , T2 ) is a sufficient statistic for (a, b) where T1 = Why? Let n n 1 P 1 P log Xi and T2 = log(1 − Xi ). n i=1 n i=1 Ψ (z) = d log Γ(z) Γ0 (z) = dz Γ(z) which is called the digamma function. The score vector is " # ∙ ¸ Ψ (a + b) − Ψ (a) + t1 ∂l/∂a S (a, b) = =n . ∂l/∂b Ψ (a + b) − Ψ (b) + t2 S (a, b) = [0 0]T must be solved numerically to find the M.L. estimates of a and b. Let d Ψ0 (z) = Ψ (z) dz which is called the trigamma function. The information matrix is " # Ψ0 (a) − Ψ0 (a + b) −Ψ0 (a + b) I(a, b) = n −Ψ0 (a + b) Ψ0 (b) − Ψ0 (a + b) which is also the Fisher or expected information matrix. For the data above t1 = n n P 1 P 1 log xi = −1.3929 and t2 = log(1 − xi ) = −0.3594. log 30 i=1 30 i=1 The M.L. estimates of a and b can be found using Newton’s Method given by ∙ (i+1) ¸ ∙ (i) ¸ h i−1 a a = + I(a(i) , b(i) ) S(a(i) , b(i) ) (i+1) (i) b b 2.4. MAXIMUM LIKELIHOOD METHOD- MULTIPARAMETER63 for i = 0, 1, ... until convergence. Newton’s Method converges after 8 interations beginning with the initial estimates a(0) = 2, b(0) = 2. The iterations are given below: ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ 0.6449 2.2475 1.0852 3.1413 1.6973 4.4923 2.3133 5.8674 2.6471 6.6146 2.7058 6.7461 2.7072 6.7493 2.7072 6.7493 ¸ ¸ ¸ ¸ ¸ ¸ ¸ ¸ = = = = = = = = ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ 2 2 ¸ + ∙ 0.6449 2.2475 1.0852 3.1413 1.6973 4.4923 2.3133 5.8674 2.6471 6.6146 2.7058 6.7461 2.7072 6.7493 ¸−1 ∙ ¸ 10.8333 −8.5147 −16.7871 −8.5147 10.8333 14.2190 ¸ ∙ ¸−1 ∙ ¸ 84.5929 −12.3668 26.1919 + −12.3668 4.3759 −1.5338 ¸ ∙ ¸−1 ∙ ¸ 35.8351 −8.0032 11.1198 + −8.0032 3.2253 −0.5408 ¸ ∙ ¸−1 ∙ ¸ 18.5872 −5.2594 4.2191 + −5.2594 2.2166 −0.1922 ¸ ∙ ¸−1 ∙ ¸ 12.2612 −3.9004 1.1779 + −3.9004 1.6730 −0.0518 ¸ ∙ ¸−1 ∙ ¸ 10.3161 −3.4203 0.1555 + −3.4203 1.4752 −0.0067 ¸ ∙ ¸−1 ∙ ¸ 10.0345 −3.3478 0.0035 + −3.3478 1.4450 −0.0001 ¸ ∙ ¸−1 ∙ ¸ 10.0280 −3.3461 0.0000 + −3.3461 1.4443 0.0000 The M.L. estimates are â = 2.7072 and b̂ = 6.7493. The observed information matrix is ∙ ¸ 10.0280 −3.3461 I(â, b̂) = −3.3461 1.4443 2 Note that since det[I(â, b̂)] = (10.0280) (1.4443) − (3.3461) > 0 and [I(â, b̂)]11 = 10.0280 > 0 and then by the Second Derivative Test we have found the M.L. estimates. A graph of the relative likelihood function is given in Figure 2.5. 64 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 15 10 5 b 0 1 1.5 2.5 2 3 3.5 4 5 4.5 a Figure 2.5: Relative Likelihood for Beta Example A 100p% likelihood region for (a, b) is given by {(a, b) ; R (a, b) ≥ p}. The 1%, 5% and 10% likelihood regions for (a, b) are shown in Figure 2.6. Note that the likelihood contours are elliptical in shape and are skewed relative to the ab coordinate axes. Since this is a regular model and S(â, b̂) = 0 then by Taylor’s Theorem we have " # ¤ 1£ L (a, b) ≈ L(â, b̂) + S(â, b̂) + â − a b̂ − b I(â, b̂) 2 b̂ − b " # â − a ¤ 1£ = L(â, b̂) + â − a b̂ − b I(â, b̂) 2 b̂ − b â − a for all (a, b) sufficiently close to (â, b̂). Therefore R (a, b) = L (a, b) L(â, b̂) " â − a b̂ − b # 2.4. MAXIMUM LIKELIHOOD METHOD- MULTIPARAMETER65 h ≈ 1 − 2L(â, b̂) h = 1 − 2L(â, b̂) i−1 £ i−1 £ â − a b̂ − b â − a b̂ − b ¤ ¤ I(â, b̂) ∙ Iˆ11 Iˆ12 " â − a b̂ − b ¸" Iˆ12 Iˆ22 # â − a # b̂ − b i−1 h i = 1 − 2L(â, b̂) (a − â)2 Iˆ11 + 2 (a − â) (b − b̂)Iˆ12 + (b − b̂)2 Iˆ22 . h The set of points (a, b) which satisfy R (a, b) = p is approximately the set of points (a, b) which satisfy 2 (a − â) Iˆ11 + 2 (a − â) (b − b̂)Iˆ12 + (b − b̂)2 Iˆ22 = 2 (1 − p) L(â, b̂) which we recognize as the points on an ellipse centred at (â, b̂). The skewness of the likelihood contours relative to the ab coordinate axes is determined by the value of Iˆ12 . If this value is close to zero the skewness will be small. 2.4.18 Problem The following data are 30 independent observations from a GAM(α, β) distribution: 15.1892, 19.3316, 1.6985, 2.0634, 12.5905, 6.0094, 13.6279, 14.7847, 13.8251, 19.7445, 13.4370, 18.6259, 2.7319, 8.2062, 7.3621, 1.6754, 10.1070, 3.2049, 21.2123, 4.1419, 12.2335, 9.8307, 3.6866, 0.7076, 7.9571, 3.3640, 12.9622, 12.0592, 24.7272, 12.7624 For these data t1 = 30 P log xi = 61.1183 and t2 = i=1 30 P xi = 309.8601. i=1 Find the M.L. estimates of α and β for these data, the observed information I(α̂, β̂) and the Fisher information J(α, β). On the same graph plot the 1%, 5%, and 10% likelihood regions for (α, β). Comment. 2.4.19 Problem Suppose X1 , . . . , Xn is a random sample from the distribution with probability density function f (x; α, β) = αβ (1 + βx)α+1 x > 0; α, β > 0. Find the Fisher information matrix J(α, β). 66 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION 14 1% 5% 12 10% 10 (1.5,8) 8 b (2.7,6.7) 6 4 2 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 a Figure 2.6: Likelihood Regions for BETA(a,b) Example The following data are 15 independent observations from this distribution: 9.53, 0.15, 0.77, 0.47, 4.10, 1.60, 0.42, 0.01, 2.30, 0.40, 0.80, 1.90, 5.89, 1.41, 0.11 Find the M.L. estimates of α and β for these data and the observed information I(α̂, β̂). On the same graph plot the 1%, 5%, and 10% likelihood regions for (α, β). Comment. 2.4.20 Problem Suppose X1 , . . . , Xn is a random sample from the CAU(β, μ) distribution. Find the Fisher information for (β, μ). 2.5. INCOMPLETE DATA AND THE E.M. ALGORITHM 2.4.21 67 Problem A radioactive sample emits particles at a rate which decays with time, the rate being λ(t) = λe−βt . In other words, the number of particles emitted in an interval (t, t + h) has a Poisson distribution with parameter t+h R λe−βs ds and the number emitted in disjoint intervals are independent t random variables. Find the M.L. estimate of λ and β, λ > 0, β > 0 if the actual times of the first, second, ..., n’th decay t1 < t2 < . . . tn are observed. Show that β̂ satisfies the equation β̂tn eβ̂tn − 1 2.4.22 = 1 − β̂ t̄ where t̄ = n 1 P ti . n i=1 Problem In Problem 2.1.23 suppose θ = (θ1 , . . . , θk )T . Find the Fisher information matrix and explain how you would find the M.L. estimate of θ. 2.5 Incomplete Data and The E.M. Algorithm The E.M. algorithm, which was popularized by Dempster, Laird and Rubin (1977), is a useful method for finding M.L. estimates when some of the data are incomplete but can also be applied to many other contexts such as grouped data, mixtures of distributions, variance components and factor analysis. The following are two examples of incomplete data: 2.5.1 Censored Exponential Data Suppose Xi ∼ EXP(θ), i = 1, . . . , n. Suppose we only observe Xi for m observations and the remaining n − m observations are censored at a fixed time c. The observed data are of the form Yi = min(Xi , c), i = 1, . . . , n. Note that Y = Y (X) is a many-to-one mapping. (X1 , . . . , Xn ) are called the complete data and (Y1 , . . . , Yn ) are called the incomplete data. 2.5.2 “Lumped” Hardy-Weinberg Data A gene has two forms A and B. Each individual has a pair of these genes, one from each parent, so that there are three possible genotypes: AA, AB and BB. Suppose that, in both male and female populations, the proportion of A types is equal to θ and the proportion of B types is equal to 1 − θ. 68 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION Suppose further that random mating occurs with respect to this gene pair. Then the proportion of individuals with genotypes AA, AB and BB in the next generation are θ2 , 2θ(1 − θ) and (1 − θ)2 respectively. Furthermore, if random mating continues, these proportions will remain nearly constant for generation after generation. This is the famous result from genetics called the Hardy-Weinberg Law. Suppose we have a group of n individuals and let X1 = number with genotype AA, X2 = number with genotype AB and X3 = number with genotype BB. Suppose however that it is not possible to distinguish AA0 s from AB 0 s so that the observed data are (Y1 , Y2 ) where Y1 = X1 + X2 and Y2 = X3 . The complete data are (X1, X2 , X3 ) and the incomplete data are (Y1 , Y2 ). 2.5.3 Theorem Suppose X, the complete data, has probability (density) function f (x; θ) and Y = Y (X), the incomplete data, has probability (density) function g(y; θ). Suppose further that f (x; θ) and g (x; θ) are regular models. Then ∂ log g(y; θ) = E[S(θ; X)|Y = y]. ∂θ ∂ ∂θ Suppose θ̂, the value which maximizes log g(y; θ), is found by solving log g(y; θ) = 0. By the previous theorem θ̂ is also the solution to E[S(θ; X)|Y = y; θ] = 0. Note that θ appears in two places in the second equation, as an argument in the function S as well as an argument in the expectation E. 2.5.4 The E.M. Algorithm The E.M. algorithm solves E[S(θ; X)|Y = y] = 0 using an iterative two-step method. Let θ(i) be the estimate of θ from the ith iteration. (1) E-Step (Expectation Step) Calculate h i E log f (X; θ) |Y = y; θ(i) = Q(θ, θ(i) ). (2) M-step (Maximization Step) 2.5. INCOMPLETE DATA AND THE E.M. ALGORITHM 69 Find the value of θ which maximizes Q(θ, θ(i) ) and set θ(i+1) equal to this value. θ(i+1) is found by solving ∙ ¸ h i ∂ ∂ Q(θ, θ(i) ) = E log f (X; θ) |Y = y; θ(i) = E S(θ; X)|Y = y; θ(i) = 0 ∂θ ∂θ with respect to θ. Note that 2.5.5 h i E S(θ(i+1) ; X)|Y = y; θ(i) = 0 Example Give the E.M. algorithm for the “Lumped” Hardy-Weinberg example. Find θ̂ if n = 10 and y1 = 3. Show how θ̂ can be found explicitly by solving ∂ ∂θ log g(y; θ) = 0 directly. The complete data (X1 , X2 ) have joint p.f. f (x1 , x2 ; θ) = x1 , x2 h in−x1 −x2 £ 2 ¤x1 n! x 2 [2θ (1 − θ)] 2 (1 − θ) θ x1 !x2 ! (n − x1 − x2 )! = θ2x1 +x2 (1 − θ)2n−(2x1 +x2 ) · h (x1 , x2 ) = 0, 1, . . . ; x1 + x2 ≤ n; 0 < θ < 1 where h (x1 , x2 ) = n! 2x2 . x1 !x2 ! (n − x1 − x2 )! It is easy to see (show it!) that (X1 , X2 ) has a regular exponential family distribution with natural sufficient statistic T = T (X1 , X2 ) = 2X1 + X2 . The incomplete data are Y = X1 + X2 . For the E-Step we need to calculate h i Q(θ, θ(i) ) = E log f (X1 , X2 ; θ) |Y = y; θ(i) n o = E (2X1 + X2 ) log θ + [2n − (2X1 + X2 )] log (1 − θ) |Y = X1 + X2 = y; θ(i) h i +E h (X1 , X2 ) |Y = X1 + X2 = y; θ(i) . (2.1) To find these expectations we note that by the properties of the multinomial distribution µ ¶ µ ¶ θ2 θ X1 |X1 + X2 = y v BIN y, 2 = BIN y, θ + 2θ (1 − θ) 2−θ 70 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION and µ X2 |X1 + X2 = y v BIN y, 1 − θ 2−θ ¶ . Therefore µ ¶ µ (i) ¶ ´ ³ θ(i) θ + y 1 − = 2y E 2X1 + X2 |Y = X1 + X2 = y; θ(i) 2 − θ(i) 2 − θ(i) µ ¶ ³ ´ 2 = y = yp θ(i) (2.2) 2 − θ(i) where p (θ) = 2 . 2−θ Substituting (2.2) into (2.1) gives ³ ´ h ³ ´i h i Q(θ, θ(i) ) = yp θ(i) log θ+ 2n − yp θ(i) log (1 − θ)+E h (X1 , X2 ) |Y = X1 + X2 = y; θ(i) . Note that we do not need to simplify the last term on the right hand side since it does not involve θ. For the M-Step we need to solve ∂ Q(θ, θ(i) ) = 0. ∂θ Now ∂ ³ (i) ´ = Q θ, θ ∂θ = = and if ¡ ¢ yp θ(i) 2n − yp(θ(i) ) − θ (1 − θ) £ ¤ (i) yp(θ ) (1 − θ) − 2n − yp(θ(i) ) θ θ (1 − θ) yp(θ(i) ) − 2nθ θ (1 − θ) ∂ Q(θ, θ(i) ) = 0 ∂θ yp(θ(i) ) y θ= = 2n 2n Therefore θ(i+1) is given by θ (i+1) µ 2 2 − θ(i) y = n µ ¶ y = n 1 2 − θ(i) ¶ µ . 1 2 − θ(i) ¶ . (2.3) 2.5. INCOMPLETE DATA AND THE E.M. ALGORITHM 71 Our algorithm for finding the M.L. estimate of θ is ¶ µ y 1 (i+1) θ = , i = 0, 1, . . . n 2 − θ(i) For the data n = 10 and y = 3 let the initial guess for θ be θ(0) = 0.1. Note that the initial guess does not really matter in this example since the algorithm converges rapidly for any initial guess between 0 and 1. For the given data and initial guess we obtain: θ(1) = θ(2) = θ(3) = θ(4) = µ ¶ 3 1 = 0.1579 10 2 − 0.1 ¶ µ 1 3 = 0.1629 10 2 − θ(1) ¶ µ 3 1 = 0.1633 10 2 − θ(2) ¶ µ 3 1 = 0.1633. 10 2 − θ(3) So the M.L. estimate of θ is θ̂ = 0.1633 to four decimal places. In this example we can find θ̂ directly since ¡ ¢ Y = X1 + X2 v BIN n, θ2 + 2θ (1 − θ) and therefore µ ¶ in−y ¤y h n £ 2 2 g (y; θ) = θ + 2θ (1 − θ) , y = 0, 1, . . . , n; 0 < θ < 1 (1 − θ) y µ ¶ n y n−y 2 = q (1 − q) , where q = 1 − (1 − θ) y which is a binomial likelihood so the M.L. estimate of q is q̂ = y/n. By the √ invariance property of M.L. estimates the M.L. estimate of θ = 1 − 1 − q is p p (2.4) θ̂ = 1 − 1 − q̂ = 1 − 1 − y/n. p For the data n = 10 and y = 3 we obtain θ̂ = 1 − 1 − 3/10 = 0.1633 to four decimal places which is the same answer as we found using the E.M. algorithm. 72 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION 2.5.6 E.M. Algorithm and the Regular Exponential Family Suppose X, the complete data, has a regular exponential family distribution with probability (density) function # " k P qj (θ) Tj (x) h(x), θ = (θ1 , . . . , θk )T f (x; θ) = C(θ) exp j=1 and let Y = Y (X) be the incomplete data. Then the M-step of the E.M. algorithm is given by E[Tj (X); θ(i+1) ] = E[Tj (X)|Y = y; θ(i) ], 2.5.7 j = 1, . . . , k. (2.5) Problem Prove (2.5) using the result from Problem 2.4.7. 2.5.8 Example Use (2.5) to find the M-step for the “Lumped” Hardy-Weinberg example. Since the natural sufficient statistic is T = T (X1 , X2 ) = 2X1 + X2 the M-Step is given by i h i h (2.6) E 2X1 + X2 ; θ(i+1) = E 2X1 + X2 |Y = y; θ(i) . Using (2.2) and the fact that ¡ ¢ X1 v BIN n, θ2 and X2 v BIN (n, 2θ (1 − θ)) , then (2.6) can be written as or µ h i2 ih i h 2n θ(i+1) + n 2θ(i+1) 1 − θ(i+1) = y θ(i+1) = y 2n µ 2 2 − θ(i) ¶ = which is the same result as in (2.3). If the algorithm converges and lim θ(i) = θ̂ i→∞ y n µ 2 2 − θ(i) 1 2 − θ(i) ¶ ¶ 2.5. INCOMPLETE DATA AND THE E.M. ALGORITHM 73 (How would you prove this? Hint: Recall the Monotonic Sequence Theorem.) then ¶ µ 1 y (i+1) lim θ = lim i→∞ i→∞ n 2 − θ(i) or ¶ µ y 1 θ̂ = . n 2 − θ̂ Solving for θ̂ gives θ̂ = 1 − which is the same result as in (2.4). 2.5.9 p 1 − y/n Example Use (2.5) to give the M-step for the censored exponential data example. Assuming the algorithm converges, find an expression for θ̂. Show that this ∂ is the same θ̂ which is obtained when ∂θ log g(y; θ) = 0 is solved directly. 2.5.10 Problem Suppose X1 , . . . , Xn is a random sample from the N(μ, σ 2 ) distribution. Suppose we observe Xi , i = 1, . . . , m but for i = m + 1, . . . , n we observe only that Xi > c. (a) Give explicitly the M-step of the E.M. algorithm for finding the M.L. estimate of μ in the case where σ 2 is known. (b) Give explicitly the M-step of the E.M. algorithm for finding the M.L. estimates of μ and σ 2 . Hint: If Z ∼ N(0, 1) show that E(Z|Z > b) = φ(b) = h(b) 1 − Φ(b) where φ is the probability density function and Φ is the cumulative distribution function of Z and h is called the hazard function. 2.5.11 Problem Let (X1 , Y1 ) , . . . , (Xn , Yn ) be a random sample from the BVN(μ, Σ) distribution. Suppose that some of the Xi and Yi are missing as follows: for i = 1, . . . , n1 we observe both Xi and Yi , for i = n1 + 1, . . . , n2 we observe only Xi and for i = n2 + 1, . . . , n we observe only Yi . Give explicitly the M-step of the E.M. algorithm for finding the M.L. estimates of 74 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION (μ1 , μ2 , σ12 , σ22 , ρ). Hint: Xi |Yi = yi ∼ N(μ1 + ρσ1 (yi − μ2 )/σ2 , (1 − ρ2 )σ12 ). 2.5.12 Problem The data in the table below were obtained through the National Crime Survey conducted by the U.S. Bureau of Census (See Kadane (1985), Journal of Econometrics, 29, 46-67.). Households were visited on two occasions, six months apart, to determine if the occupants had been victimized by crime in the preceding six-month period. First visit Crime-free (X1 = 0) Victims (X1 = 1) Nonrespondents Crime-free (X2 = 0) 392 76 31 Second visit Victims (X2 = 1) 55 38 7 Nonrespondents 33 9 115 Let X1i = 1 (0) if the occupants in household i were victimized (not victimized) by crime in the preceding six-month period on the first visit. Let X2i = 1 (0) if the occupants in household i were victimized (not victimized) by crime during the six-month period between the first visit and second visits. Let θjk = P (X1i = j, X2i = k) , j = 0, 1; k = 0, 1; i = 1, . . . , N. (a) Write down the probability of observing the complete data Xi = (X1i , X2i ) , i = 1, . . . , N and show that X = (X1 , . . . , XN ) has a regular exponential family distribution. (b) Give the M-step of the E.M. algorithm for finding the M.L. estimate of θ = (θ00 , θ01 , θ10 , θ11 ). Find the M.L. estimate of θ for the data in the table. Note: You may ignore the 115 households that did not respond to the survey at either visit. Hint: E [(1 − X1i ) (1 − X2i ) ; θ] = P (X1i = 0, X2i = 0; θ) = θ00 E [(1 − X1i ) (1 − X2i ) |X1i = 1; θ] = 0 P (X1i = 0, X2i = 0; θ) θ00 E [(1 − X1i ) (1 − X2i ) |X1i = 0; θ] = = P (X1i = 0; θ) θ00 + θ01 etc. 2.6. THE INFORMATION INEQUALITY 75 (c) Find the M.L. estimate of the odds ratio τ= θ00 θ11 . θ01 θ10 What is the significance of τ = 1? 2.6 The Information Inequality Suppose we consider estimating a parameter τ (θ),where θ is a scalar, using an unbiased estimator T (X). Is there any limit to how well an estimator like this can behave? The answer for unbiased estimators is in the affirmative, and a lower bound on the variance is given by the information inequality. 2.6.1 Information Inequality - One Parameter Suppose T (X) is an unbiased estimator of the parameter τ (θ) in a regular statistical model {f (x; θ) ; θ ∈ Ω}. Then 2 V ar(T ) ≥ [τ 0 (θ)] . J (θ) Equality holds if and only if X has a regular exponential family with natural sufficient statistic T (X). 2.6.2 Proof Since T is an unbiased estimator of τ (θ), Z T (x)f (x; θ) dx = τ (θ), for all θ ∈ Ω, A where P (X ∈ A; θ) = 1. Since f (x; θ) is a regular model we can take the derivative with respect to θ on both sides and interchange the integral and derivative to obtain: Z ∂f (x; θ) T (x) dx = τ 0 (θ). ∂θ A Since E[S(θ; X)] = 0, this can be written as Cov[T, S(θ; X)] = τ 0 (θ) 76 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION and by the covariance inequality, this implies V ar(T )V ar[S(θ; X)] ≥ [τ 0 (θ]2 (2.7) which, upon dividing by J(θ) = V ar[S(θ; X)], provides the desired result. Now suppose we have equality in (2.7). Equality in the covariance inequality obtains if and only if the random variables T and S(θ; X) are linear functions of one another. Therefore, for some (non-random) c1 (θ), c2 (θ), if equality is achieved, S(θ; x) = c1 (θ)T (x) + c2 (θ) all x ∈ A. Integrating with respect to θ, log f (x; θ) = C1 (θ)T (x) + C2 (θ) + C3 (x) where we note that the constant of integration C3 is constant with respect to changing θ but may depend on x. Therefore, f (x; θ) = C(θ) exp [C1 (θ)T (x)] h(x) where C(θ) = eC2 (θ) and h(x) = eC3 (x) which is exponential family with natural sufficient statistic T (X).¥ The special case of the information inequality that is of most interest is the unbiased estimation of the parameter θ. The above inequality indicates that any unbiased estimator T of θ has variance at least 1/J(θ). The lower bound is achieved only when f (x; θ) is regular exponential family with natural sufficient statistic T . Notes: 1. If equality holds then T (X) is called an efficient estimator of τ (θ). 2. The number [τ 0 (θ)]2 J(θ) is called the Cramér-Rao lower bound (C.R.L.B.). 3. The ratio of the C.R.L.B. to the variance of an unbiased estimator is called the efficiency of the estimator. 2.6.3 Example Suppose X1 , . . . , Xn is a random sample from the POI(θ) distribution. Show that the variance of the U.M.V.U.E. of θ achieves the Cramér-Rao 2.6. THE INFORMATION INEQUALITY 77 lower bound for unbiased estimators of θ and find the lower bound. What is the U.M.V.U.E. of τ (θ) = θ2 ? Does the variance of this estimator achieve the Cramér-Rao lower bound for unbiased estimators of θ2 ? What is the lower bound? 2.6.4 Example Suppose X1 , . . . , Xn is a random sample from the distribution with probability density function f (x; θ) = θxθ−1 , 0 < x < 1, θ > 0. Show that the variance of the U.M.V.U.E. of θ does not achieve the CramérRao lower bound. What is the efficiency of the U.M.V.U.E.? For some time it was believed that no estimator of θ could have variance smaller than 1/J(θ) at any value of θ but this was demonstrated incorrect by the following example of Hodges. 2.6.5 Problem Let X1 , . . . , Xn is a random sample from the N(θ, 1) distribution and define X̄ if |X̄| ≤ n−1/4 , T (X) = X̄ otherwise. 2 1 Show that E(T ) ≈ θ, V ar(T ) ≈ 1/n if θ = 6 0, and V ar(T ) ≈ 4n if θ = 0. Show that the Cramér-Rao lower bound for estimating θ is equal to n1 . T (X) = This example indicates that it is possible to achieve variance smaller than 1/J(θ) at one or more values of θ. It has been proved that this is the exception. In fact the set of θ for which the variance of an estimator is less than 1/J(θ) has measure 0, which means, for example, that it may be a finite set or perhaps a countable set, but it cannot contain a non-degenerate interval of values of θ. 2.6.6 Problem For each of the following determine whether the variance of the U.M.V.U.E. of θ based on a random sample X1 , . . . , Xn achieves the Cramér-Rao lower bound. In each case determine the Cramér-Rao lower bound and find the efficiency of the U.M.V.U.E. (a) N(θ, 4) (b) Bernoulli(θ) (c) N(0, θ2 ) (d) N(0, θ). 78 2.6.7 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION Problem Find examples of the following phenomena in a regular statistical model. (a) No unbiased estimator of τ (θ) exists. (b) An unbiased estimator of τ (θ) exists but there is no U.M.V.U.E. (c) A U.M.V.U.E. of τ (θ) exists but its variance is strictly greater than the Cramér-Rao lower bound. (d) A U.M.V.U.E. of τ (θ) exists and its variance equals the Cramér-Rao lower bound. 2.6.8 Information Inequality - Multiparameter The right hand side in the information inequality generalizes naturally to the multiple parameter case in which θ is a vector. For example if θ = (θ1 , . . . , θk )T , then the Fisher information J(θ) is a k × k matrix. If τ (θ) is any real-valued function of θ then its derivative is a column vector ´T ³ ∂τ ∂τ , . . . , . Then if T (X) is any unbiased we will denote by D(θ) = ∂θ ∂θ 1 k estimator of τ (θ) in a regular model, V ar(T ) ≥ [D(θ)]T [J(θ)]−1 D(θ) for all θ ∈ Ω. 2.6.9 Example Let X1 , . . . , Xn be a random sample from the N(μ, σ 2 ) distribution. Find the U.M.V.U.E. of σ and determine whether the U.M.V.U.E. is an efficient estimator of σ. What happens as n → ∞? Hint: ∙ µ ¶¸ Γ (k + a) (a + b − 1) (a − b) 1 as k → ∞ = ka−b 1 + +O Γ (k + b) 2k k2 2.6.10 Problem Let X1 , . . . , Xn be a random sample from the N(μ, σ 2 ) distribution. Find the U.M.V.U.E. of μ/σ and determine whether the U.M.V.U.E. is an efficient estimator. What happens as n → ∞? 2.6.11 Problem Let X1 , . . . , Xn be a random sample from the GAM(α, β) distribution. Find the U.M.V.U.E. of E(Xi ; α, β) = αβ and determine whether the U.M.V.U.E. is an efficient estimator. 2.7. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - ONE PARAMETER 79 2.6.12 Problem Consider the model in Problem 1.7.26. (a) Find the M.L. estimators of μ and σ 2 using the result from Problem 2.4.7. You do not need to verify that your answer corresponds to a maximum. Compare the M.L. estimators with the U.M.V.U.E.’s of μ and σ 2 . (b) Find the observed information matrix and the Fisher information. (c) Determine if the U.M.V.U.E.’s of μ and σ 2 are efficient estimators. 2.7 Asymptotic Properties of M.L. Estimators - One Parameter One of the more successful attempts at justifying estimators and demonstrating some form of optimality has been through large sample theory or the asymptotic behaviour of estimators as the sample size n → ∞. One of the first properties one requires is consistency of an estimator. This means that the estimator converges to the true value of the parameter as the sample size (and hence the information) approaches infinity. 2.7.1 Definition Consider a sequence of estimators Tn where the subscript n indicates that the estimator has been obtained from data X1 , . . . , Xn with sample size n. Then the sequence is said to be a consistent sequence of estimators of τ (θ) if Tn →p τ (θ) for all θ ∈ Ω. It is worth a reminder at this point that probability (density) functions are used to produce probabilities and are only unique up to a point. For example if two probability density functions f (x) and g(x) were such that they produced the same probabilities, or the same cumulative distribution function, for example, Zx −∞ f (z)dz = Zx g(z)dz −∞ for all x, then we would not consider them distinct probability densities, even though f (x) and g(x) may differ at one or more values of x. Now when we parameterize a given statistical model using θ as the parameter, it is natural to do so in such a way that different values of the parameter lead to distinct probability (density) functions. This means, for example, that 80 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION the cumulative distribution functions associated with these densities are distinct. Without this assumption it would be impossible to accurately estimate the parameter since two different parameters could lead to the same cumulative distribution function and hence exactly the same behaviour of the observations. Therefore we assume: (R7) The probability (density) functions corresponding to different values of the parameters are distinct, that is, θ 6= θ∗ =⇒ f (x; θ) 6= f (x; θ∗ ). This assumption together with assumptions (R1) − (R6) (see 2.3.1) are sufficient conditions for the theorems given in this section. 2.7.2 Theorem - Consistency of the M.L. Estimator (Regular Model) Suppose X1 , . . . , Xn is a random sample from a model {f (x; θ) ; θ ∈ Ω} satisfying regularity conditions (R1) − (R7). Then with probability tending to 1 as n → ∞, the likelihood equation or score equation n X ∂ logf (Xi ; θ) = 0 ∂θ i=1 has a root θ̂n such that θ̂n converges in probability to θ0 , the true value of the parameter, as n → ∞. The proof of this theorem is given in Section 5.4.9 of the Appendix. The likelihood equation does not always have a unique root as the following problem illustrates. 2.7.3 Problem Indicate whether or not the likelihood equation based on X1 , . . . , Xn has a unique root in each of the cases below: (a) LOG(1, θ) (b) WEI(1, θ) (c) CAU(1, θ) The consistency of the M.L. estimator is one indication that it performs reasonably well. However, it provides no reason to prefer it to some other consistent estimator. The following result indicates that M.L. estimators perform as well as any reasonable estimator can, at least in the limit as n → ∞. 2.7. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - ONE PARAMETER 81 2.7.4 Theorem - Asymptotic Distribution of the M.L. Estimator (Regular Model) Suppose (R1) − (R7) hold. Suppose θ̂n is a consistent root of the likelihood equation as in Theorem 2.7.2. Then p J(θ0 )(θ̂n − θ0 ) →D Z ∼ N(0, 1) where θ0 is the true value of the parameter. The proof of this theorem is given in Section 5.4.10 of the Appendix. Note: Since J(θ) is the Fisher expected information based on a random sample from the model {f (x; θ) ; θ ∈ Ω}, ¸ ∙ ¸ ∙ n P ∂2 ∂2 logf (Xi ; θ) ; θ = nE − 2 logf (X; θ) ; θ J(θ) = E − 2 ∂θ i=1 ∂θ where X has probability (density) function f (x; θ). This theorem implies that for a regular model and sufficiently large n, θ̂n has an approximately normal distribution with mean θ0 and variance −1 −1 [J(θ0 )] . [J(θ0 )] is called the asymptotic variance of θ̂n . This theorem also asserts that θ̂n is asymptotically unbiased and its asymptotic variance approaches the Cramér-Rao lower bound for unbiased estimators of θ. By the Limiting Theorems it also follows that τ (θ̂ ) − τ (θ0 ) q n →D Z v N (0, 1) . [τ 0 (θ0 )]2 /J (θ0 ) Compare this result with the Information Inequality. 2.7.5 Definition Suppose X1 , . . . , Xn is a random sample from a regular statistical model {f (x; θ) ; θ ∈ Ω}. Suppose also that Tn = Tn (X1 , ..., Xn ) is asymptotically normal with mean θ and variance σT2 /n. The asymptotic efficiency of Tn is defined to be ½ ∙ 2 ¸¾−1 ∂ log f (X; θ) σT2 · E − ; θ ∂θ2 where X has probability (density) function f (x; θ). 82 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION 2.7.6 Problem Suppose X1 , . . . , Xn is a random sample from a distribution with continuous probability density function f (x; θ) and cumulative distribution function F (x; θ) where θ is the median of the distribution. Suppose also that f (x; θ) is continuous at x = θ. The sample median Tn = med(X1 , . . . , Xn ) is a possible estimator of θ. (a) Find the probability density function of the median if n = 2m + 1 is odd. (b) Prove µ ¶ √ 1 n(Tn − θ) →D T ∼ N 0, 4[f (0; 0)]2 (c) If X1 , ..., Xn is a random sample from the N(θ, 1) distribution find the asymptotic efficiency of Tn . (d) If X1 , ..., Xn is a random sample from the CAU(1, θ) distribution find the asymptotic efficiency of Tn . 2.8 2.8.1 Interval Estimators Definition Suppose X is a random variable whose distribution depends on θ. Suppose that A(x) and B(x) are functions such that A(x) ≤ B(x) for all x ∈ support of X and θ ∈ Ω. Let x be the observed data. Then (A(x), B(x)) is an interval estimate for θ. The interval (A(X), B(X)) is an interval estimator for θ. Likelihood intervals are one type of interval estimator. Confidence intervals are another type of interval estimator. We now consider a general approach for constructing confidence intervals based on pivotal quantities. 2.8.2 Definition Suppose X is a random variable whose distribution depends on θ. The random variable Q(X; θ) is called a pivotal quantity if the distribution of Q does not depend on θ. Q(X; θ) is called an asymptotic pivotal quantity if the limiting distribution of Q as n → ∞ does not depend on θ. For example, for a random sample X1 , . . . , Xn from a N(θ, σ 2 ) distrib- 2.8. INTERVAL ESTIMATORS 83 ution where σ 2 is known, the statistic ¢ √ ¡ n X̄ − θ T = σ is a pivotal quantity whose distribution does not depend on θ. If X1 , . . . , Xn is a random sample from a distribution, not necessarily normal, having mean θ and known variance σ 2 then the asymptotic distribution of T is N(0, 1) by the C.L.T. and T is an asymptotic pivotal quantity. 2.8.3 Definition Suppose A(X) and B(X) are statistics. If P [A(X) < θ < B(X)] = p, 0 < p < 1 then (a(x), b(x)) is called a 100p% confidence interval (C.I.) for θ. Pivotal quantities can be used for constructing C.I.’s in the following way. Since the distribution of Q(X; θ) is known we can write down a probability statement of the form P (q1 ≤ Q(X; θ) ≤ q2 ) = p. If Q is a monotone function of θ then this statement can be rewritten as P [A(X) ≤ θ ≤ B(X)] = p and the interval [a(x), b(x)] is a 100p% C.I.. The following theorem gives the pivotal quantity in the case in which θ is either a location parameter or a scale parameter. 2.8.4 Theorem Let X = (X1 , ..., Xn ) be a random sample from the model {f (x; θ) ; θ ∈ Ω} and let θ̂ = θ̂(X) be the M.L. estimator of the scalar parameter θ based on X. (1) If θ is a location parameter then Q = Q (X) = θ̂−θ is a pivotal quantity. (2) If θ is a scale parameter then Q = Q (X) = θ̂/θ is a pivotal quantity. 2.8.5 Asymptotic Pivotal Quantities and Approximate Confidence Intervals In cases in which an exact pivotal quantity cannot be constructed we can use the limiting distribution of θ̂n to construct approximate C.I.’s. Since [J(θ̂n )]1/2 (θ̂n − θ0 ) →D Z v N(0, 1) 84 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION then [J(θ̂n )]1/2 (θ̂−θ0 ) is an asymptotic pivotal quantity and an approximate 100p% C.I. based on this asymptotic pivotal quantity is given by i h θ̂n − a[J(θ̂n )]−1/2 , θ̂n + a[J(θ̂n )]−1/2 where θ̂n = θ̂n (x1 , ..., xn ) is the M.L. estimate of θ, P (−a < Z < a) = p and Z v N(0, 1). Similarly since [I(θ̂n ; X)]1/2 (θ̂n − θ0 ) →D Z v N(0, 1) where X = (X1 , ..., Xn ) then [I(θ̂n ; X)]1/2 (θ̂n − θ0 ) is an asymptotic pivotal quantity and an approximate 100p% C.I. based on this asymptotic pivotal quantity is given by i h θ̂n − a[I(θ̂)]−1/2 , θ̂n + a[I(θ̂n )]−1/2 where I(θ̂n ) is the observed information. Finally since −2 log R(θ0 ; X) →D W v χ2 (1) then −2 log R(θ0 ; X) is an asymptotic pivotal quantity and an approximate 100p% C.I. based on this asymptotic pivotal is {θ : −2 log R(θ; x) ≤ b} where x = (x1 , ..., xn ) are the observed data, P (W ≤ b) = p and W v χ2 (1). Usually this must be calculated numerically. Since τ (θ̂ ) − τ (θ0 ) rh n i →D Z v N (0, 1) 2 0 τ (θ̂n ) /J(θ̂n ) an approximate 100p% C.I. for τ (θ) is given by " τ (θ̂n ) − a ½h ¾1/2 ½h ¾1/2 # i2 i2 0 τ (θ̂n ) /J(θ̂n ) , τ (θ̂n ) + a τ (θ̂n ) /J(θ̂n ) 0 where P (−a < Z < a) = p and Z v N(0, 1). 2.8. INTERVAL ESTIMATORS 2.8.6 85 Likelihood Intervals and Approximate Confidence Intervals A 15% L.I. for θ is given by {θ : R(θ; x) ≥ 0.15}. Since −2 log R(θ0 ; X) →D W v χ2 (1) we have P [R(θ; X) ≥ 0.15] = = ≈ ≈ ≈ P [−2 log R(θ; X) ≤ −2 log (0.15)] P [−2 log R(θ; X) ≤ 3.79] ¡ ¢ P (W ≤ 3.79) = P Z 2 ≤ 3.79 where Z v N(0, 1) P (−1.95 ≤ Z ≤ 1.95) 0.95 and therefore a 15% L.I. is an approximate 95% C.I. for θ. 2.8.7 Example Suppose X1 , . . . , Xn is a random sample from the distribution with probability density function f (x; θ) = θxθ−1 , 0 < x < 1. The likelihood function for observations x1 , . . . , xn is L (θ) = n Q i=1 θxθ−1 i =θ n µ n Q xi i=1 ¶θ−1 , θ>0 The log likelihood and score function are l (θ) = n log θ + (θ − 1) S (θ) = where t = n P n P log xi , θ>0 i=1 n 1 + t = (n + tθ) θ θ log xi . Since i=1 S (θ) > 0 for 0 < θ < −n/t and S (θ) < 0 for θ > −n/t therefore by the First Derivative Test, θ̂ = −n/t is the M.L. estimate of θ. n P The M.L. estimator of θ is θ̂ = −n/T where T = log Xi . i=1 86 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION The information function is I (θ) = n , θ2 θ>0 and the Fisher information is J (θ) = E [I (θ; X)] = E By the W.L.L.N. − ³n´ θ2 = n θ2 n T 1 1 P log Xi →p E (− log Xi ; θ0 ) = =− n n i=1 θ0 and by the Limit Theorems θ̂n = n →p θ0 T and thus θ̂n is a consistent estimator of θ0 . By Theorem 2.7.4 r p n J (θ0 )(θ̂n − θ0 ) = (θ̂n − θ0 ) →D Z v N (0, 1) . θ02 (2.8) The asymptotic variance of θ̂n is equal to θ02 /n whereas the actual variance of θ̂n is n2 θ02 V ar(θ̂n ) = . 2 (n − 1) (n − 2) This can be shown using the fact that − log Xi v EXP (1/θ0 ), i = 1, . . . , n independently which means µ ¶ n P 1 T =− log Xi v GAM n, (2.9) θ0 i=1 and then using the result from Problem 1.3.4. Therefore the asymptotic variance and the actual variance of θ̂n are not identical but are close in value for large n. An approximate 95% C.I. for θ based on q J(θ̂n )(θ̂n − θ0 ) →D Z v N (0, 1) is given by ∙ ¸ h q q √ √ i θ̂n − 1.96/ J(θ̂n ), θ̂n + 1.96/ J(θ̂n ) = θ̂n − 1.96θ̂n / n, θ̂n + 1.96θ̂n / n . 2.8. INTERVAL ESTIMATORS 87 √ Note √ the width of the C.I. which is equal to 2 (1.96) θ̂n / n decreases as 1/ n. An exact C.I. for θ can be obtained in this case since nθ T = Tθ = v GAM (n, 1) θ−1 θ̂ and therefore nθ/θ̂ is a pivotal quantitiy. Since 2T θ = 2nθ θ̂ v χ2 (2n) we can use values from the chi-squared tables. From the chi-squared tables we find values a and b such that P (a ≤ W ≤ b) = 0.95 where W v χ2 (2n) . Then µ ¶ 2nθ a≤ ≤ b = 0.95 θ̂ à ! aθ̂ bθ̂ P ≤θ≤ = 0.95 2n 2n P or and a 95% C.I. for θ is " # aθ̂ bθ̂ , . 2n 2n If we choose P (W ≤ a) = 1 − 0.95 = 0.025 = P (W ≥ b) 2 then we obtain an “equal-tail” C.I. for θ. This is not the narrowest C.I. but it is easier to obtain than the narrowest C.I.. How would you obtain the narrowest C.I.? 2.8.8 Example Suppose X1 , . . . , Xn is a random sample from the POI (θ) distribution. The parameter θ is neither a location or scale parameter. The M.L. estimator of θ and the Fisher information are θ̂n = X̄n and J (θ) = n . θ 88 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION By Theorem 2.7.4 ³ ´ rn ³ ´ p J (θ0 ) θ̂n − θ0 = θ̂n − θ0 →D Z v N (0, 1) . θ0 (2.10) By the C.L.T. X̄n − θ0 p →D Z v N (0, 1) θ0 /n which is the same result. The asymptotic variance of θ̂n is equal to θ0 /n. Since the actual variance of θ̂n is ¡ ¢ θ0 V ar(θ̂n ) = V ar X̄n = n the asymptotic variance and the actual variance of θ̂n are identical in this case. An approximate 95% C.I. for θ based on q ³ ´ J(θ̂n ) θ̂n − θ0 →D Z v N (0, 1) is given by ∙ ¸ ∙ ¸ q q q q θ̂n − 1.96/ J(θ̂n ), θ̂n + 1.96/ J(θ̂n ) = θ̂n − 1.96 θ̂n /n, θ̂n + 1.96 θ̂n /n . An approximate 95% C.I. for τ (θ) = e−θ can be based on the asymptotic pivotal τ (θ̂ ) − τ (θ0 ) rh n i →D Z v N (0, 1) . 2 0 τ (θ̂n ) /J(θ̂n ) ¡ −θ ¢ d e = −e−θ and the approxFor τ (θ) = e−θ = P (X1 = 0; θ), τ 0 (θ) = dθ imate 95% C.I. is given by " ½h ½h ¾1/2 ¾1/2 # ³ ´ i2 ³ ´ i2 τ θ̂n − 1.96 τ 0 (θ̂n ) /J(θ̂n ) , τ θ̂n − 1.96 τ 0 (θ̂n ) /J(θ̂n ) = ∙ ¸ q q e−θ̂n − 1.96e−θ̂n θ̂n /n, e−θ̂n + 1.96e−θ̂n θ̂n /n . which is symmetric about the M.L. estimate τ (θ̂n ) = e−θ̂n . (2.11) 2.8. INTERVAL ESTIMATORS 89 Alternatively since µ ¶ q q 0.95 ≈ P θ̂n − 1.96 θ̂n /n ≤ θ ≤ θ̂n + 1.96 θ̂n /n µ ¶ q q = P −θ̂n + 1.96 θ̂n /n ≥ −θ ≥ −θ̂n − 1.96 θ̂n /n µ µ ¶ µ ¶¶ q q −θ = P exp −θ̂n + 1.96 θ̂n /n ≥ e ≥ exp −θ̂n − 1.96 θ̂n /n µ µ ¶ µ ¶¶ q q −θ = P exp −θ̂n − 1.96 θ̂n /n ≤ e ≤ exp −θ̂n + 1.96 θ̂n /n therefore ∙ µ ¶ µ ¶¸ q q exp −θ̂n − 1.96 θ̂n /n , exp −θ̂n + 1.96 θ̂n /n (2.12) is also an approximate 95% C.I. for τ (θ). If n = 20 and θ̂n = 3 then the C.I. (2.11) is equal to [0.012, 0.0876] while the C.I. (2.12) is equal to [0.0233, 0.1064] . 2.8.9 Example Suppose X1 , . . . , Xn is a random sample from the EXP (1, θ) distribution. f (x; θ) = e−(x−θ) , x≥θ The likelihood function for observations x1 , . . . , xn is n Q e−(xi −θ) if xi ≥ θ > −∞, i = 1, . . . , n i=1 µ n ¶ P xi enθ if − ∞ < θ ≤ x(1) = exp − L (θ) = i=1 and L (θ) is equal to 0 if θ > x(1) . To maximize this function of θ we note that we want to make the term enθ as large as possible subject to θ ≤ x(1) which implies that θ̂n = x(1) is the M.L. estimate and θ̂n = X(1) is the M.L. estimator of θ. Since the support of Xi depends on the unknown parameter θ, the model is not a regular model. This means that Theorem 2.7.2 and 2.7.4 cannot be used to determine the asymptotic properties of θ̂n . Since P (θ̂n ≤ x; θ0 ) = 1 − n Q i=1 P (Xi > x; θ0 ) = 1 − e−n(x−θ0 ) , x ≥ θ0 90 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION then θ̂n v EXP ¡1 n , θ0 ¢ . Therefore lim E(θ̂n ) = lim n→∞ and n→∞ µ 1 θ0 + n ¶ = θ0 µ ¶2 1 lim V ar(θ̂n ) = lim =0 n→∞ n→∞ n by Theorem 5.3.8, θ̂n →p θ0 and θ̂n is a consistent estimator. Since h P n(θ̂n − θ0 ) ≤ t; θ0 i ¶ µ t = P θ̂n ≤ + θ0 ; θ0 n = 1 − en(t/n+θ0 −θ0 ) = 1 − e−t , t ≥ 0 true for n = 1, 2, . . ., therefore n(θ̂n − θ0 ) v EXP (1) for n = 1, 2, . . . and therefore the asymptotic distribution of n(θ̂n − θ0 ) is also EXP (1). Since we know the exact distribution of θ̂n for n = 1, 2, . . ., the asymptotic distribution is not needed for obtaining C.I.’s. For this model the parameter θ is a location parameter and therefore (θ̂ − θ) is a pivotal quantity and in particular P (θ̂ − θ ≤ t; θ) = P (θ̂ ≤ t + θ; θ) = 1 − e−nt , t ≥ 0. h i Since (θ̂−θ) is a pivotal quantity, a C.I. for θ would take the form θ̂ − b, θ̂ − a where 0 ≤ a ≤ b. Unless a = 0 this interval would not contain the M.L. estimate θ̂ and therefore a “one-tail” C.I. makes sense in this case. To obtain a 95% “one-tail” C.I. for θ we solve 0.95 = P (θ̂ − b ≤ θ ≤ θ̂; θ) ³ ´ = P 0 ≤ θ̂ − θ ≤ b; θ = 1 − e−nb which gives 1 log 20 b = − log (0.05) = . n n Therefore ¸ ∙ log 20 , θ̂ θ̂ − n is a 95% “one-tail” C.I. for θ. 2.8. INTERVAL ESTIMATORS 2.8.10 91 Problem Let X1 , . . . , Xn be a random sample from the distribution with p.d.f. f (x; β) = 2x , β2 0 < x ≤ β. (a) Find the likelihood function of β and the M.L. estimator of β. (b) Find the M.L. estimator of E (X; β) where X has p.d.f. f (x; β). (c) Show that the M.L. estimator of β is a consistent estimator of β. (d) If n = 15 and x(15) = 0.99, find the M.L. estimate of β. Plot the relative likelihood function for β and find 10% and 50% likelihood intervals for β. (e) If n = 15 and x(15) = 0.99, construct an exact 95% one-tail C.I. for β. 2.8.11 Problem Suppose X1 , . . . , Xn is a random sample from the UNIF(0, θ) distribution. Show that the M.L. estimator θ̂n is a consistent estimator of θ. How would you construct a C.I. for θ? 2.8.12 Problem A certain type of electronic equipment is susceptible to instantaneous failure at any time. Components do not deteriorate significantly with age and the distribution of the lifetime is the EXP(θ) density. Ten components were tested independently with the observed lifetimes, to the nearest days, given by 70 11 66 5 20 4 35 40 29 8. (a) Find the M.L. estimate of θ and verify that it corresponds to a local maximum. Find the Fisher information and calculate an approximate 95% C.I. for θ based on the asymptotic distribution of θ̂. Compare this with an exact 95% C.I. for θ. (b) The estimate in (a) ignores the fact that the data were rounded to the nearest day. Find the exact likelihood function based on the fact that the probability of observing a lifetime of i days is given by g(i; θ) = i+0.5 Z i−0.5 1 −x/θ dx, i = 1, 2, . . . and g(0; θ) = e θ Z0.5 1 −x/θ dx. e θ 0 Obtain the M.L. estimate of θ and verify that it corresponds to a local maximum. Find the Fisher information and calculate an approximate 95% C.I. for θ. Compare these results with those in (a). 92 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION 2.8.13 Problem The number of calls to a switchboard per minute is thought to have a POI(θ) distribution. However, because there are only two lines available, we are only able to record whether the number of calls is 0, 1 , or ≥ 2. For 50 one minute intervals the observed data were: 25 intervals with 0 calls, 16 intervals with 1 call and 9 intervals with ≥ 2 calls. (a) Find the M.L. estimate of θ. (b) By computing the Fisher information both for this problem and for one with full information, that is, one in which all of the values of X1 , . . . , X50 had been recorded, determine how much information was lost by the fact that we were only able to record the number of times X > 1 rather than the value of these X’s. How much difference does this make to the asymptotic variance of the M.L. estimator? 2.8.14 Problem Let X1 , . . . , Xn be a random sample from a UNIF(θ, 2θ) distribution. Show that the M.L. estimator θ̂ is a consistent estimator of θ. What is the 5 minimal sufficient statistic for this model? Show that θ̃ = 14 X(n) + 27 X(1) is a consistent estimator of θ which has smaller M.S.E. than θ̂. 2.9 Asymptotic Properties of M.L. Estimators - Multiparameter Under similar regularity conditions to the univariate case, the conclusion of Theorem 2.7.2 holds in the multiparameter case θ = (θ1 , . . . , θk )T , that is, each component of θ̂n converges in probability to the corresponding component of θ0 . Similarly, Theorem 2.7.4 remains valid with little modification: [J (θ0 )]1/2 (θ̂n − θ0 ) →D Z ∼ MVN(0k , Ik ) where 0k is a k ×1 vector of zeros and Ik is the k ×k identity matrix. Therefore for a regular model and sufficiently large n, θ̂n has approximately a multivariate normal distribution with mean vector θ0 and variance/covariance −1 matrix [J (θ0 )] . Consider the reparameterization τj = τj (θ), j = 1, . . . , m ≤ k. It follows that ª−1/2 © [τ (θ̂n ) − τ (θ0 )] →D Z ∼ MVN(0m , Im ) [D(θ0 )]T [J(θ0 )]−1 D(θ0 ) 2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 93 where τ (θ) = (τ1 (θ), . . . , τm (θ))T and D(θ) is a k × m matrix with (i, j) element equal to ∂τj /∂θi . 2.9.1 Definition A 100p% confidence region for the vector θ based on X = (X1 , ..., Xn ) is a region R(X) ⊂ Rk which satisfies P (θ ∈ R(X); θ) = p. 2.9.2 Aymptotic Pivotal Quantities and Approximate Confidence Regions Since [J(θ̂n )]1/2 (θ̂n − θ0 ) →D Z v MVN(0k , Ik ) it follows that (θ̂n − θ0 )T J(θ̂n )(θ̂n − θ0 ) →D W v χ2 (k) and an approximate 100p% confidence region for θ based on this asymptotic pivotal is the set of all θ vectors in the set {θ : (θ̂n − θ)T J(θ̂n )(θ̂n − θ) ≤ b} where θ̂n = θ̂(x1 , ..., xn ) is the M.L. estimate of θ and b is the value such that P (W < b) = p where W v χ2 (k). Similarly since [I(θ̂n ; X)]1/2 (θ̂n − θ0 ) →D Z v MVN(0k , Ik ) it follows that (θ̂n − θ0 )T I(θ̂n ; X)(θ̂n − θ0 ) →D W v χ2 (k) where X = (X1 , ..., Xn ). An approximate 100p% confidence region for θ based on this asymptotic pivotal quantity is the set of all θ vectors in the set {θ : (θ̂n − θ)T I(θ̂n )(θ̂n − θ) ≤ b} where I(θ̂n ) is the observed information matrix. Finally since −2 log R(θ0 ; X) →D W v χ2 (k) 94 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION an approximate 100p% confidence region for θ based on this asymptotic pivotal quantity is the set of all θ vectors in the set {θ : −2 log R(θ; x) ≤ b} where x = (x1 , ..., xn ) are the observed data, R(θ; x) is the relative likelihood function. Note that since {θ : −2 log R(θ; x) ≤ b} = {θ : R(θ; x) ≥ e−b/2 } this approximate 100p% confidence region is also a 100e−b/2 % likelihood region for θ. Approximate confidence intervals for a single parameter, say θi , from the vector of parameters θ = (θ1 , ..., θi , ..., θk )T can also be obtained. Since [J(θ̂n )]1/2 (θ̂n − θ0 ) →D Z v MVN(0k , Ik ) it follows that an approximate 100p% C.I. for θi is given by h p p i θ̂i − a v̂ii , θ̂i + a v̂ii where θ̂i is the M.L. estimate of θi , v̂ii is the (i, i) entry of [J(θ̂n )]−1 and a is the value such that P (−a < Z < a) = p where Z v N(0, 1). Similarly since [I(θ̂n ; X)]1/2 (θ̂n − θ0 ) →D Z v MVN(0k , Ik ) it follows that an approximate 100p% C.I. for θi is given by h p p i θ̂i − a v̂ii , θ̂i + a v̂ii where v̂ii is the (i, i) entry of [I(θ̂n )]−1 . If τ (θ) is a scalar function of θ then o−1/2 n [τ (θ̂n ) − τ (θ0 )] →D Z ∼ N(0, 1) [D(θ̂n )]T [J(θ̂n )]−1 D(θ̂n ) where D(θ) is a k × 1 vector with ith element equal to ∂τ /∂θi . An approximate 100p% C.I. for τ (θ) is given by ∙ n o1/2 n o1/2 ¸ τ (θ̂n ) − a [D(θ̂n )]T [J(θ̂n )]−1 D(θ̂n ) . , τ (θ̂n ) + a [D(θ̂n )]T [J(θ̂n )]−1 D(θ̂n ) (2.13) 2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 95 2.9.3 Example Recall from Example 2.4.17 that for a random sample from the BETA(a, b) distribution the information matix and the Fisher information matrix are given by " # Ψ0 (a) − Ψ0 (a + b) −Ψ0 (a + b) I(a, b) = n = J(a, b). −Ψ0 (a + b) Ψ0 (b) − Ψ0 (a + b) Since £ â − a0 ¤ ³ ´ b̂ − b0 J â, b̂ " â − a0 b̂ − b0 # →D W v χ2 (2) , an approximate 100p% confidence region for (a, b) is given by " # â − a £ ¤ {(a, b) : â − a b̂ − b J(â, b̂) < c} b̂ − b where P (W ≤ c) = p. Since χ2 (2) = GAM (1, 2) = EXP(2), c can be determined using Z c 1 −x/2 p = P (W ≤ c) = dx = 1 − e−c/2 e 0 2 which gives c = −2 log(1 − p). For p = 0.95, c = −2 log(0.05) = 5.99. An approximate 95% confidence region is given by " # â − a £ ¤ {(a, b) : â − a b̂ − b J(â, b̂) < 5.99}. b̂ − b Let J(â, b̂) = " Jˆ11 Jˆ12 Jˆ12 Jˆ22 # then the confidence region can be written as {(a, b) : (â − a)2 Jˆ11 + 2 (â − a) (b̂ − b)Jˆ12 + (b̂ − b)2 Jˆ22 ≤ 5.99} which can be seen to be the points inside an on the ellipse centred at (â, b̂). For the data in Example 2.4.15, â = 2.7072, b̂ = 6.7493 and ∙ ¸ 10.0280 −3.3461 J(â, b̂) = . −3.3461 1.4443 96 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION 14 99% 12 95% 10 90% (1.5,8) 8 (2.7,6.7) b 6 4 2 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 a Figure 2.7: Approximate Confidence Regions for Beta(a,b) Example Approximate 90%, 95% and 99% confidence regions are shown in Figure 2.7. A 10% likelihood region for (a, b) is given by {(a, b) : R(a, b; x) ≥ 0.1}. Since −2 log R(a0 , b0 ; X) →D W v χ2 (2) = EXP (2) we have P [R(a, b; X) ≥ 0.1] = ≈ = = P [−2 log R(a, b; X) ≤ −2 log (0.1)] P (W ≤ −2 log (0.1)) 1 − e−[−2 log(0.1)]/2 1 − 0.1 = 0.9 and therefore a 10% likelihood region corresponds to an approximate 90% confidence region. Similarly 1% and 5% likelihood regions correspond to approximate 99% and 95% confidence regions respectively. Compare the 2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 97 likelihood regions in Figure 2.6 with the approximate confidence regions shown in Figure 2.7. What do you notice? Let h i−1 = J(â, b̂) " v̂11 v̂12 v̂12 v̂22 # . Since 1/2 [J(â, b̂)] ∙ â − a0 b̂ − b0 ¸ →D Z v BVN µ∙ 0 0 ¸ ∙ ¸¶ 1 0 , 0 1 then for large n, V ar(â) ≈ v̂11 , V ar(b̂) ≈ v̂22 and Cov(â, b̂) ≈ v̂12 . Therefore an approximate 95% C.I. for a is given by h p i p â − 1.96 v̂11 , â + 1.96 v̂11 and an approximate 95% C.I. for b is given by h p i p b̂ − 1.96 v̂22 , b̂ + 1.96 v̂22 . For the given data â = 2.7072, b̂ = 6.7493 and h i−1 ∙ 0.4393 1.0178 ¸ = J(â, b̂) 1.0178 3.0503 so the approximate 95% C.I. for a is i h √ √ 2.7072 + 1.96 0.44393, 2.7072 − 1.96 0.44393 = [1.4080, 4.0063] and the approximate 95% C.I. for b is i h √ √ 6.7493 − 1.96 3.0503, 6.7493 + 1.96 3.0503 = [3.3261, 10.1725] . Note that a = 1.5 is in the approximate 95% C.I. for a and b = 8 is in the approximate 95% C.I. for b and yet the point (1.5, 8) is not in the approximate 95% joint confidence region for (a, b). Clearly these marginal C.I.’s for a and b must be used with care. To obtain an approximate 95% C.I. for τ (a, b) = E (X; a, b) = a a+b 98 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION we use (2.13) with D (a, b) = and v̂ £ ∂τ ∂b ∂τ ∂a ¤T = h b (a+b)2 −a (a+b)2 = [D(â, b̂)]T [J(â, b̂)]−1 D(â, b̂) #" " i v̂11 v̂12 h −â b̂ = (â+b̂)2 (â+b̂)2 v̂12 v̂22 iT b̂ (â+b̂)2 −â (â+b̂)2 # For the given data τ (â, b̂) = â â + b̂ = 2.7072 = 0.28628 2.7072 + 6.7493 and v̂ = 0.00064706. The approximate 95% C.I. for τ (a, b) = E (X; a, b) = a/ (a + b) is i h √ √ 0.28628 − 1.96 0.00064706, 0.28628 + 1.96 0.00064706 = [0.23642, 0.33614] . 2.9.4 Problem In Problem 2.4.10 find an approximate 95% C.I. for Cov(X1 , X2 ; θ1 , θ2 ). 2.9.5 Problem In Problem 2.4.18 find an approximate 95% joint confidence region for (α, β) and approximate 95% C.I.’s for β and τ (α, β) = E(X; α, β) = αβ. 2.9.6 Problem In Problem 2.4.19 find approximate 95% C.I.’s for β and E(X; α, β). 2.9.7 Problem Suppose X1 , ..., Xn is a random sample from the EXP(β, μ) distribution. Show that the M.L. estimators β̂n and μ̂n are consistent estimators. How would you construct a joint confidence region for (β, μ)? How would you construct a C.I. for β? How would you construct a C.I. for μ? 2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 99 2.9.8 Problem Consider the model in Problem 1.7.26. Explain clearly how you would construct a C.I. for σ 2 and a C.I. for μ. 2.9.9 Problem Let X1 , . . . , Xn be a random sample from the distribution with p.d.f. f (x; α, β) = αxα−1 , βα 0 < x ≤ β, α > 0. (a) Find the likelihood function of α and β and the M.L. estimators of α and β. (b) Show that the M.L. estimator of α is a consistent estimator of α. Show that the M.L. estimator of β is a consistent estimator of β. 15 P (c) If n = 15, x(15) = 0.99 and log xi = −7.7685 find the M.L. estimates i=1 of α and β. (d) If n = 15, x(15) = 0.99 and 15 P i=1 log xi = −7.7685, construct an exact 95% equal-tail C.I. for α and an exact 95% one-tail C.I. for β. (e) Explain how you would construct a joint likelihood region for α and β. Explain how you would construct a joint confidence region for α and β? 2.9.10 Problem The following are the results, in millions of revolutions to failure, of endurance tests for 23 deep-groove ball bearings: 17.88 48.48 68.64 105.12 28.92 51.84 68.64 105.84 33.00 51.96 68.88 127.92 41.52 54.12 84.12 128.04 42.12 55.56 93.12 173.40 45.60 67.80 98.64 As a result of testing thousands of ball bearings, it is known that their lifetimes have a WEI(θ, β) distribution. (a) Find the M.L. estimates of θ and β and the observed information I(θ̂, β̂). (b) Plot the 1%, 5% and 10% likelihood regions for θ and β on the same graph. 100 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION (c) Plot the approximate 99%, 95% and 90% joint confidence regions for θ and β on the same graph. Compare these with the likelihood regions in (b) and comment. (d) Calulate an approximate 95% confidence interval for β. (e) The value β = 1 is of interest since WEI(θ, 1) = EXP (θ). Is β = 1 a plausible value of β in light of the observed data? Justify your conclusion. (f ) If X v WEI (θ, β) then " µ ¶ # β 80 = τ (θ, β) . P (X > 80; θ, β) = exp − θ Find an approximate 95% confidence interval for τ (θ, β). 2.9.11 Example - Logistic Regression Pistons are made by casting molten aluminum into moulds and then machining the raw casting. One defect that can occur is called porosity, due to the entrapment of bubbles of gas in the casting as the metal solidifies. The presence or absence of porosity is thought to be a function of pouring temperature of the aluminum. One batch of raw aluminum is available and the pistons are cast in 8 different dies. The pouring temperature is set at one of 4 levels 750, 775, 800, 825 and at each level, 3 pistons are cast in the 8 dies available. The presence (1) or absence (0) of porosity is recorded for each piston and the data are given below: Temperature 750 775 800 825 Total 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 1 1 0 1 0 1 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 13 11 10 8 In Figure 2.8, the scatter plot of the proportion of pistons with porosity versus temperature shows that there is a general decrease in porosity as temperature increases. 2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 101 0.6 0.55 0.5 0.45 proportion of defects 0.4 0.35 0.3 0.25 740 750 760 770 780 790 Temperature 800 810 820 830 Figure 2.8: Proportion of Defects vs Temperature A model for these data is Yij ∼ BIN(1, pi ), i = 1, . . . , 4, j = 1, . . . , 24 independently where i indicates the level of pouring temperature, j the replication. We would like to fit a curve, a function of the pouring temperature, to the probabilities pi and the most common function used for this purpose is the logistic function, ez /(1 + ez ). This function is bounded between 0 and 1 and so can be used to model probabilities. We may choose the exponent z to depend on the explanatory variates resulting in: pi = pi (α, β) = eα+β(xi −x̄) . 1 + eα+β(xi −x̄) In this expression, xi is the pouring temperature at level i, x̄ = 787.5 is the average pouring temperature, and α, β are two unknown parameters. Note also that µ ¶ pi = α + β(xi − x̄). logit(pi ) = log 1 − pi The likelihood function is L(α, β) = 4 Q 24 Q i=1 j=1 P (Yij = yij ; α, β) = 4 Q 24 Q i=1 j=1 y pi ij (1 − pi )(1−yij ) 102 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION and the log likelihood is l (α, β) = 24 4 P P i=1 j=1 Note that [yij log (pi ) + (1 − yij ) log (1 − pi )] . ∂pi = pi (1 − pi ) ∂α and ∂pi = (xi − x̄) pi (1 − pi ) ∂β so that ∂l(α, β) ∂α ∙ ¸ 24 ∂l 24 4 P 4 P P P ∂pi 1 − yij yij · − pi (1 − pi ) = = ∂α 1 − pi i=1 j=1 ∂pi i=1 j=1 pi = 24 4 P P i=1 j=1 = 4 P i=1 where [yij (1 − pi ) − (1 − yij ) pi ] = 24 4 P P i=1 j=1 (yij − pi ) (yi. − 24pi ) yi. = 24 P yij . j=1 Similarly 4 P ∂l(α, β) (xi − x̄)(yi. − 24pi ). = ∂β i=1 The score function is ⎡ ⎢ ⎢ S(α, β) = ⎢ 4 ⎣ P ∂ 2 l (α, β) ∂α2 ∂ 2 l (α, β) ∂β 2 and ∂ 2 l (α, β) ∂α∂β (yi. − 24pi ) i=1 (xi − x̄)(yi. − 24pi ) i=1 Since 4 P ⎤ ⎥ ⎥ ⎥. ⎦ 4 ∂p 4 P P i pi (1 − pi ) , = 24 i=1 ∂α i=1 4 4 P P ∂pi 2 = 24 (xi − x̄) (xi − x̄) pi (1 − pi ) , = 24 ∂β i=1 i=1 4 ∂p 4 P P i (xi − x̄) pi (1 − pi ) = 24 = 24 i=1 ∂β i=1 = 24 2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 103 the information matrix and the Fisher information matrix are equal and given by ⎡ 4 4 P P 24 pi (1 − pi ) 24 (xi − x̄)pi (1 − pi ) ⎢ i=1 i=1 ⎢ I (α, β) = J (α, β) = ⎢ 4 4 ⎣ P P (xi − x̄)2 pi (1 − pi ) 24 (xi − x̄)pi (1 − pi ) 24 i=1 i=1 To find the M.L. estimators of α and β we must solve ∂l(α, β) ∂l(α, β) =0= ∂α ∂β simultaneously which must be done numerically using a method such as Newton’s method. Initial estimates of α and β can be obtained by drawing a line through the points in Figure 2.8, choosing two points on the line and then solving for α and β. For example, suppose we require that the line pass through the points (775, 11/24) and (825, 8/24). We obtain −0.167 = logit(11/24) = α + β (775 − 787.5) = α + β(−12.5) −0.693 = logit(8/24) = α + β (825 − 787.5) = α + β(37.5), and these result in initial estimates: α(0) = −0.298, β (0) = −0.0105. Now ¸ ³ ´ ∙ 23.01 −27.22 J α(0) , β (0) = −27.22 17748.30 and ´ ³ S α(0) , β (0) = [0.9533428, − 9.521179]T and the first iteration of Newton’s method gives ∙ (1) ¸ ∙ (0) ¸ h ³ ´i−1 ³ ´ ∙ −0.2571332 ¸ α α (0) (0) (0) (0) , β S α , β = + J α = . −0.01097377 β (1) β (0) Repeating this process does not substantially change these estimates, so we have the M.L. estimates: α̂ = −0.2571831896 β̂ = −0.01097623887. The Fisher information matrix evaluated at the M.L. estimate is " # 23.09153 −24.63342 J(α̂, β̂) = . −24.63342 17783.63646 ⎤ ⎥ ⎥ ⎥. ⎦ 104 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION The inverse of this matrix gives an estimate of the asymptotic variance/covariance matrix of the estimators: " # 0.0433700024 0.0000600749759 −1 . [J(α̂, β̂)] = 0.0000600749759 0.00005631468312 0.65 0.6 0.55 0.5 proportion of defects 0.45 0.4 0.35 740 750 760 770 780 790 800 Temperature 810 820 830 840 Figure 2.9: Fitted Model for Proportion of Defects as a Function of Temperature A plot of h i exp α̂ + β̂(x − x̄) h i p̂(x) = 1 + exp α̂ + β̂(x − x̄) is shown in Figure 2.9. Note that the curve is very close to a straight line over the range of x. The 1%, 5% and 10% likelihood regions for (α, β) are shown in Figure 2.10. Note that these likelihood regions are very elliptical in shape. This follows since the (1, 2) entry in the estimated variance/covariance matrix [J(α̂, β̂)]−1 is very close to zero which implies that the estimators α̂ and β̂ are not highly correlated. This allows us to make inferences more easily about β alone. Plausible values for β can be determined from the likelihood regions in 2.10. A model with no effect due to pouring temperature corresponds to β = 0. The likelihood regions indicate that the value, β = 0, is a very plausible value in light of the data for all plausible value of α. 2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 105 0.02 1% 0.01 5% 0 10% beta -0.01 (-.26,-.011) -0.02 -0.03 -0.04 -1 -0.5 0 0.5 alpha Figure 2.10: Likelihood regions for (α, β). The probability of a defect when the pouring temperature is x = 750 is equal to τ = τ (α, β) = eα+β(750−x̄) eα+β(−37.5) 1 = = −α−β(−37.5) . α+β(750−x̄) α+β(−37.5) 1+e 1+e e +1 By the invariance property of M.L. estimators the M.L. estimator of τ is τ̂ = τ (α̂, β̂) = eα̂+β̂(−37.5) 1+ eα̂+β̂(−37.5) = 1 e−α̂−β̂(−37.5) +1 and the M.L. estimate is 1 1 = 0.5385. = 0.2571831896+0.01097623887(−37.5) τ̂ = − α̂− β̂(−37.5) e +1 e +1 To construct a approximate C.I. for τ we need an estimate of V ar (τ̂ ; α, β). Now h i V ar −α̂ − β̂(−37.5); α, β = (−1)2 V ar (α̂; α, β) + (37.5)2 V ar(β̂; α, β) + 2 (−1) (37.5)Cov(α̂, β̂; α, β) and using [J(α̂, β̂)]−1 we estimate this variance by v̂ = 0.0433700024 + (37.5)2 (0.00005631468312) − 2(37.5) (0.0000600749759) = 0.118057. 106 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION An approximate 95% C.I. for −α − β(−37.5) is √ √ [−α̂ − β̂(−37.5) − 1.96 v̂, − α̂ − β̂(−37.5) + 1.96 v̂] h i √ √ = −0.154426 − 1.96 0.118057, − 0.154426 + 1.96 0.118057 = [−0.827870, 0.519019] . An approximate 95% C.I. for τ = τ (α, β) = 1 e−α−β(−37.5) + 1 is ∙ ¸ 1 1 , = [0.373082, 0.695904] . exp (0.519019) + 1 exp (−0.827870) + 1 The near linearity of the fitted function as indicated in Figure 2.9 seems to imply that we need not use the logistic function treated in this example, but that a straight line could have been fit to these data with similar results over the range of temperatures observed. Indeed, a simple linear regression would provide nearly the same fit. However, if values of pi near 0 or 1 had been observed, e.g. for temperatures well above or well below those used here, the non-linearity of the logistic function would have been important and provided some advantage over simple linear regression. 2.9.12 Problem - The Challenger Data On January 28, 1986, the twenty-fifth flight of the U.S. space shuttle program ended in disaster when one of the rocket boosters of the Shuttle Challenger exploded shortly after lift-off, killing all seven crew members. The presidential commission on the accident concluded that it was caused by the failure of an O-ring in a field joint on the rocket booster, and that this failure was due to a faulty design that made the O-ring unacceptably sensitive to a number of factors including outside temperature. Of the previous 24 flights, data were available on failures of O-rings on 23, (one was lost at sea), and these data were discussed on the evening preceding the Challenger launch, but unfortunately only the data corresponding to the 7 flights on which there was a damage incident were considered important and these were thought to show no obvious trend. The data are given in Table 1. (See Dalal, Fowlkes and Hoadley (1989), JASA, 84, 945-957.) 2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 107 Table 1 Date Temperature 4/12/81 11/12/81 3/22/82 6/27/82 1/11/82 4/4/83 6/18/83 8/30/83 11/28/83 2/3/84 4/6/84 8/30/84 10/5/84 11/8/84 1/24/85 4/12/85 4/29/85 6/17/85 7/29/85 8/27/85 10/3/85 10/30/85 11/26/85 1/12/86 1/28/86 66 70 69 80 68 67 72 73 70 57 63 70 78 67 53 67 75 70 81 76 79 75 76 58 31 Number of Damage Incidents 0 1 0 Not available 0 0 0 0 0 1 1 1 0 0 3 0 0 0 0 0 0 2 0 1 Challenger Accident (a) Let p(t; α, β) = P (at least one damage incident for a flight at temperature t) eα+βt = . 1 + eα+βt Using M.L. estimation fit the model Yi ∼ BIN(1, p (ti ; α, β)), i = 1, . . . , 23 to the data available from the flights prior to the Challenger accident. You may ignore the flight for which information on damage incidents is not available. (b) Plot 10% and 50% likelihood regions for α and β. 108 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION (c) Find an approximate 95% C.I. for β. How plausible is the value β = 0? (d) Find an approximate 95% C.I. for p(t) if t = 31, the temperature on the day of the disaster. Comment. 2.10 Nuisance Parameters and M.L. Estimation Suppose X1 , . . . , Xn is a random sample from the distribution with probability (density) function f (x; θ). Suppose also that θ = (λ, φ) where λ is a vector of parameters of interest and φ is a vector of nuisance parameters. The profile likelihood is one modification of the likelihood which allows us to look at estimation methods for λ in the presence of the nuisance parameter φ. 2.10.1 Definition Suppose θ = (λ, φ) with likelihood function L(λ, φ). Let φ̂(λ) be the M.L. estimator of φ for a fixed value of λ. Then the profile likelihood for λ is given by L(λ, φ̂(λ)). The M.L. estimator of λ based on the profile likelihood is, of course, the same estimator obtained by maximizing the joint likelihood L(λ, φ) simultaneously over λ and φ. If the profile likelihood is used to construct likelihood regions for λ, care must be taken since the imprecision in the estimation of the nuisance paramter φ is not taken into account. Profile likelihood is one example of a group of modifications of the likelihood known as pseudo-likelihoods which are based on a derived likelihood for a subset of parmeters. Marginal likelihood, conditional likelihood and partial likelihood are also included in this class. Suppose that θ = (λ, φ) and the data X, or some function of the data, can be partitioned into U and V . Suppose also that f (u, v; θ) = f (u; λ) · f (v|u; θ). If the conditional distribution of V given U does depends only on φ then estimation of λ can be based on f (u; λ), the marginal likelihood for λ. If f (v|u; θ) depends on both λ and φ then the marginal likelihood may still be used for estimation of λ if, in ignoring the conditional distribution, there is little information lost. If there is a factorization of the form f (u, v; θ) = f (u|v; λ) · f (v; θ) 2.11. PROBLEMS WITH M.L. ESTIMATORS 109 then estimation of λ can be based on f (u|v; λ) the conditional likelihood for λ. 2.10.2 Problem Suppose X1 , . . . , Xn is a random sample from a N(μ, σ 2 ) distribution and that σ is the parameter of interest while μ is a nuisance parameter. Find the profile likelihood of σ. Let U = S 2 and V = X̄. Find f (u; σ), the marginal likelihood of σ and f (u|v; σ), the conditional likelihood of σ. Compare the three likelihoods. 2.11 Problems with M.L. Estimators 2.11.1 Example This is an example to indicate that in the presence of a large number of nuisance parameters, it is possible for a M.L. estimator to be inconsistent. Suppose we are interested in the effect of environment on the performance of identical twins in some test, where these twins were separated at birth and raised in different environments. If the vector (Xi , Yi ) denotes the scores of the i’th pair of twins, we might assume (Xi , Yi ) are both independent N(μi , σ 2 ) random variables. We wish to estimate the parameter σ 2 based on a sample of n twins. Show that the M.L. estimator of σ 2 is n P 1 σ̂ 2 = 4n (Xi − Yi )2 and this is a biased and inconsistent estimator of i=1 σ 2 . Show, however, that a simple modification results in an unbiased and consistent estimator. 2.11.2 Example Recall that Theorem 2.7.2 states that under some conditions a root of the likelihood equation exists which is consistent as the sample size approaches infinity. One might wonder why the theorem did not simply make the same assertion for the value of the parameter providing the global maximum of the likelihood function. The answer is that while the consistent root of the likelihood equation often corresponds to the global maximum of the likelihood function, there is no guarantee of this without some additional conditions. This somewhat unusual example shows circumstances under which the consistent root of the likelihood equation is not the global maximizer of the likelihood function. Suppose Xi , i = 1, . . . , n are independent 110 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION observations from the mixture density of the form 2 2 2 ² 1−² f (x; θ) = √ e−(x−μ) /2σ + √ e−(x−μ) /2 2πσ 2π where θ = (μ, σ 2 ) with both parameters unknown. Notice that the likelihood function L(μ, σ) → ∞ for μ = xj , σ → 0 for any j = 1, . . . , n. This means that the globally maximizing σ is σ = 0, which lies on the boundary of the parameter space. However, there is a local maximum of the likelihood function at some σ̂ > 0 which provides a consistent estimator of the parameter. 2.11.3 Unidentifiability and Singular Information Matrices Suppose we observe two independent random variables Y1 , Y2 having normal distributions with the same variance σ2 and means θ1 + θ2 , θ2 + θ3 respectively. In this case, although the means depend on the parameter θ = (θ1 , θ2 , θ3 ), the value of this vector parameter is unidentifiable in the sense that, for some pairs of distinct parameter values, the probability density function of the observations are identical. For example the parameter θ = (1, 0, 1) leads to exactly the same joint distribution of Y1 , Y2 as does the parameter θ = (0, 1, 0). In this case, we we might consider only the two parameters (φ1 , φ2 ) = (θ1 + θ2 , θ2 + θ3 ) and anything derivable from this pair estimable, while parameters such as θ2 that cannot be obtained as functions of φ1 , φ2 are consequently unidentifiable. The solution to the original identifiability problem is the reparametrization to the new parameter (φ1 , φ2 ) in this case, and in general, unidentifiability usually means one should seek a new, more parsimonious parametrization. In the above example, compute the Fisher information matrix for the parameter θ = (θ1 , θ2 , θ3 ). Notice that the Fisher information matrix is singular. This means that if you were to attempt to compute the asymptotic variance of the M.L. estimator of θ by inverting the Fisher information matrix, the inversion would be impossible. Attempting to invert a singular matrix is like attempting to invert the number zero. It results in one or more components that you can consider to be infinite. Arguing intuitively, the asymptotic variance of the M.L. estimator of some of the parameters is infinite. This is an indication that asymptotically, at least, some of the parameters may not be identifiable. When parameters are unidentifiable, the Fisher information matrix is generally singular. However, when J(θ) is singular for all values of θ, this may or may not mean parameters are unidentifiable for finite sample sizes, but it does usually mean one should 2.12. HISTORICAL NOTES 111 take a careful look at the parameters with a possible view to adopting another parametrization. 2.11.4 U.M.V.U.E.’s and M.L. Estimators: A Comparison Should we use U.M.V.U.E.’s or M.L. estimators? There is no general consensus among statisticians. 1. If we are estimating the expectation of a natural sufficient statistic Ti (X) in a regular exponential family both M.L. and unbiasedness considerations lead to the use of Ti as an estimator. 2. When sample sizes are large U.M.V.U.E.’s and M.L. estimators are essentially the same. In that case use is governed by ease of computation. Unfortunately how large “large” needs to be is usually unknown. Some studies have been carried out comparing the behaviour of U.M.V.U.E.’s and M.L. estimators for various small fixed sample sizes. The results are, as might be expected, inconclusive. 3. M.L. estimators exist “more frequently” and when they do they are usually easier to compute than U.M.V.U.E.’s. This is essentially because of the appealing invariance property of M.L.E.’s. 4. Simple examples are known for which M.L. estimators behave badly even for large samples (see Examples 2.10.1 and 2.10.2 above). 5. U.M.V.U.E.’s and M.L. estimators are not necessarily robust. They are sensitive to model misspecification. In Chapter 3 we examine other approaches to estimation. 2.12 Historical Notes The concept of sufficiency is due to Fisher (1920), who in his fundamental paper of 1922 also introduced the term and stated the factorization criterion. The criterion was rediscovered by Neyman (1935) and was proved for general dominated families by Halmos and Savage (1949). The theory of minimal sufficiency was initiated by Lehmann and Scheffé (1950) and Dynkin (1951). For further generalizations, see Bahadur (1954) and Landers and Rogge (1972). One-parameter exponential families as the only (regular) families of distributions for which there exists a one-dimensional sufficient statistic were 112 CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION also introduced by Fisher (1934). His result was generalized to more than one dimension by Darmois (1935), Koopman (1936) and Pitman (1936). A more recent discussion of this theorem with references to the literature is given, for example, by Hipp (1974). A comprehensive treatment of exponential families is provided by Barndorff-Nielsen (1978). The concept of unbiasedness as “lack of systematic error” in the estimator was introduced by Gauss (1821) in his work on the theory of least squares. The first UMVU estimators were obtained by Aitken and Silverstone (1942) in the situation in which the information inequality yields the same result. UMVU estimators as unique unbiased functions of a suitable sufficient statistic were derived in special cases by Halmos (1946) and Kolmogorov (1950), and were pointed out as a general fact by Rao (1947). The concept of completeness was defined, its implications for unbiased estimation developed, and the Lehmann-Scheffé Theorem obtained in Lehmann and Scheffé (1950, 1955, 1956). Basu’s theorem is due to Basu (1955, 1958). The origins of the concept of maximum likelihood go back to the work of Lambert, Daniel Bernoulli, and Lagrange in the second half of the 18th century, and of Gauss and Laplace at the beginning of the 19th. [For details and references, see Edwards (1974).] The modern history begins with Edgeworth (1908-09) and Fisher (1922, 1925), whose contributions are discussed by Savage (1976) and Pratt (1976). The amount of information that a data set contains about a parameter was introduced by Edgeworth (1908, 1909) and was developed systematically by Fisher (1922 and later papers). The first version of the information inequality appears to have been given by Fréchet (1943), Rao (1945), and Cramér (1946). The designation “information inequality” which replaced the earlier “Cramér-Rao inequality” was proposed by Savage (1954). Fisher’s work on maximum likelihood was followed by a euphoric belief in the universal consistence and asymptotic efficiency of maximum likelihood estimators, at least in the i.i.d. case. The true situation was sorted out gradually. Landmarks are Wald (1949), who provided fairly general conditions for consistency, Cramér (1946), who defined the “regular” case in which the likelihood equation has a consistent asymptotically efficient root, the counterexamples of Bahadur (1958) and Hodges (Le Cam, 1953), and Le Cam’s resulting theorem on superefficiency (1935). Chapter 3 Other Methods of Estimation 3.1 Best Linear Unbiased Estimators The problem of finding best unbiased estimators is considerably simpler if we limit the class in which we search. If we permit any function of the data, then we usually require the heavy machinery of complete sufficiency to produce U.M.V.U.E.’s. However, the situation is much simpler if we suggest some initial random variables and then require that our estimator be a linear combination of these. Suppose, for example we have random variables Y1 , Y2 , Y3 with E(Y1 ) = α + θ, E(Y2 ) = α − θ, E(Y3 ) = θ where θ is the parameter of interest and α is another parameter. What linear combinations of the Yi ’s provide an unbiased estimator of θ and among these possible linear combinations which one has the smallest possible variance? To answer these questions, we need to know the covariances Cov(Yi , Yj ) (at least up to some scalar multiple). Suppose Cov(Yi , Yj ) = 0, i 6= j and V ar(Yj ) = σ 2 . Let Y = (Y1 , Y2 , Y3 )T and β = (α, θ)T . The model can be written as a general linear model as Y = Xβ + ² where ⎡ ⎤ 1 1 X = ⎣ 1 −1 ⎦ , 0 1 ε = (ε1 , ε2 , ε3 )T , and the εi ’s are uncorrelated random variables with E(²i ) = 0 and V ar(εi ) = σ2 . Then the linear combination of the compo113 114 CHAPTER 3. OTHER METHODS OF ESTIMATION nents of Y that has the smallest variance among all unbiased estimators of β is given by the usual regression formula β̃ = (α̃, θ̃)T = (X t X)−1 X T Y and θ̃ = 13 (Y1 − Y2 + Y3 ) provides the best estimator of θ in the sense of smallest variance. In other words, the linear combination of the components of Y which has smallest variance among all unbiased estimators of aT β is aT β̃ where aT = (0, 1). This result follows from the following theorem. 3.1.1 Gauss-Markov Theorem Suppose Y = (Y1 , . . . , Yn )T is a vector of random variables such that Y = Xβ + ε where X is a n × k (design) matrix of known constants having rank k, β = (β1 , . . . , βk )T is a vector of unknown parameters and ε = (ε1 , . . . , εn )T is a vector of random variables such that E (εi ) = 0, i = 1, . . . , n and V ar (ε) = σ2 B where B is a known non-singular matrix and σ 2 is a possibly unknown scalar parameter. Let θ = aT β, where a is a known k × 1 vector. The unbiased estimator of θ having smallest variance among all unbiased estimators that are linear combinations of the components of Y is θ̃ = aT (X T B −1 X)−1 X T B −1 Y. Note that this result does not depend on any assumed normality of the components of Y but only on the first and second moment behaviour, that is, the mean and the covariances. The special case when B is the identity matrix is the least squares estimator. 3.1.2 Problem Show that if the conditions of the Gauss-Markov Theorem hold and the εi ’s are assumed to be normally distributed then the U.M.V.U.E. of β is given by β̃ = (X T B −1 X)−1 X T B −1 Y (see Problems 1.5.11 and 1.7.25). Use this result to prove the Gauss-Markov Theorem in the case in which the εi ’s are not assumed to be normally distributed. 3.2. EQUIVARIANT ESTIMATORS 3.1.3 115 Example Suppose T1 , . . . , Tn are independent unbiased estimators of θ with known variances V ar(Ti ) = σi2 , i = 1, . . . , n. Find the best linear combination of these estimators, that is, the one that results in an unbiased estimator of θ having the minimum variance among all linear unbiased estimators. 3.1.4 Problem Suppose Yij , i = 1, 2; j = 1, . . . , n are independent random variables with E(Yij ) = μ + αi and V ar(Yij ) = σ2 where α1 + α2 = 0. (a) Find the best linear unbiased estimator of α1 . (b) Under what additional assumptions is this estimator the U.M.V.U.E.? Justify your answer. 3.1.5 Problem Suppose Yij , i = 1, 2; j = 1, . . . , ni are independent random variables with E (Yij ) = αi + βi (xij − x̄i ) , V ar (Yij ) = σ 2 and x̄i = ni 1 P xij . ni j=1 (a) Find the best linear unbiased estimators of α1 , α2 , β1 and β2 . (b) Under what additional assumptions are these estimators the U.M.V.U.E.’s? Justify your answer. 3.1.6 Problem Suppose X1 , . . . , Xn is a random sample from the N(μ, σ 2 ) distribution. Find the linear combination of the random variables (Xi − X̄)2 , i = 1, . . . , n which minimizes the M.S.E. for estimating σ 2 . Compare this estimator with the M.L. estimator and the U.M.V.U.E. of σ2 . 3.2 3.2.1 Equivariant Estimators Definition A model {f (x; θ) ; θ ∈ <} such that f (x; θ) = f0 (x − θ) with f0 known is called a location invariant family and θ is called a location parameter. (See 1.2.3.) In many examples the location of the origin is arbitrary. For example if we record temperatures in degrees celcius, the 0 point has been more or 116 CHAPTER 3. OTHER METHODS OF ESTIMATION less arbitrarily chosen and we might wish that our inference methods do not depend on the choice of origin. This can be ensured by requiring that the estimator when it is applied to shifted data, is shifted by the same amount. 3.2.2 Definition The estimator θ̃(X1 , . . . , Xn ) is location equivariant if θ̃(x1 + a, . . . , xn + a) = θ̃(x1 , . . . , xn ) + a for all values of (x1 , . . . , xn ) and real constants a. 3.2.3 Example Suppose X1 , . . . , Xn is a random sample from a N(θ, 1) distribution. Show that the U.M.V.U.E. of θ is a location equivariant estimator. Of course, location equivariant estimators do not make much sense for estimating variances; they are naturally connected to estimating the location parameter in a location invariant family. We call a given estimator, θ̃(X), minimum risk equivariant (M.R.E.) if, among all location equivariant estimators, it has the smallest M.S.E.. It is not difficult to show that a M.R.E. estimator must be unbiased (Problem 3.2.8). Remarkably, best estimators in the class of location equivariant estimators are known, due to the following theorem of Pitman. 3.2.4 Theorem Suppose X1 , . . . , Xn is a random sample from a location invariant family {f (x; θ) = f0 (x − θ), θ ∈ <} , with known density f0 . Then among all location equivariant estimators, the one with smallest M.S.E. is the Pitman location equivariant estimator given by θ̃(X1 , ..., Xn ) = R∞ −∞ R∞ u n Q i=1 n Q −∞ i=1 3.2.5 f0 (Xi − u) du . (3.2) f0 (Xi − u) du Example Let X1 , . . . , Xn be a random sample from the N(θ, 1) distribution. Show that the Pitman estimator of θ is the U.M.V.U.E. of θ. 3.2. EQUIVARIANT ESTIMATORS 3.2.6 117 Problem Prove that the M.R.E. estimator is unbiased. 3.2.7 Problem Let (X1 , X2 ) be a random sample from the distribution with probability density function 1 1 f (x; θ) = −6(x − θ − )(x − θ + ), 2 2 θ− 1 1 <x<θ+ . 2 2 Show that the Pitman estimator of θ is θ̃ (X1 , X2 ) = (X1 + X2 )/2. 3.2.8 Problem Let X1 , . . . , Xn be a random sample from the EXP(1, θ) distribution. Find the Pitman estimator of θ and compare it to the M.L. estimator of θ and the U.M.V.U.E. of θ. 3.2.9 Problem Let X1 , . . . , Xn be a random sample from the UNIF(θ − 1/2, θ + 1/2) distribution. Find the Pitman estimator of θ. Show that the M.L. estimator is not unique in this case. 3.2.10 Problem Suppose X1 , . . . , Xn is a random sample from a location invariant family {f (x; θ) = f0 (x − θ), θ ∈ <}. Show that if the M.L. estimator is unique then it is a location equivariant estimator. Since the M.R.E. estimator is an unbiased estimator, it follows that if there is a U.M.V.U.E. in a given problem and if that U.M.V.U.E. is location equivariant then the M.R.E. estimator and the U.M.V.U.E. must be identical. M.R.E. estimators are primarily used when no U.M.V.U.E. exists. For example, the Pitman estimator of the location parameter for a Cauchy distribution performs very well by comparison with any other estimator, including the M.L. estimator. 3.2.11 Definition A model {f (x; θ) ; θ > 0} such that f (x; θ) = 1θ f1 ( xθ ) with f1 known is called a scale invariant family and θ is called a scale parameter (See 1.2.3). 118 3.2.12 CHAPTER 3. OTHER METHODS OF ESTIMATION Definition An estimator θ̃k = θ̃k (X1 , . . . , Xn ) is scale equivariant if θ̃k (cx1 , . . . , cxn ) = ck θ̃k (x1 , . . . , xn ) for all values of (x1 , . . . , xn ) and c > 0. 3.2.13 Theorem Suppose X1 , . . . , Xn is aª random sample from a scale invariant family © f (x; θ) = θ1 f1 ( xθ ), θ > 0 , with known density f1 . The Pitman scale equivariant estimator of θk which minimizes ⎡à !2 ⎤ k k − θ θ̃ E⎣ ; θ⎦ θk (the scaled M.S.E.) is given by R∞ 0 θ̃k = θ̃k (X1 , . . . , Xn ) = R∞ 0 for all k for which the integrals exist. 3.2.14 un+k−1 un+2k−1 n Q i=1 n Q f1 (uXi )du f1 (uXi )du i=1 Problem (a) Show that the EXP(θ) density is a scale invariant family. (b) Show that the U.M.V.U.E. of θ based on a random sample X1 , . . . , Xn is a scale equivariant estimator and compare it to the Pitman scale equivariant estimator of θ. How does the M.L. estimator of θ compare with these estimators? (c) Find the Pitman scale equivariant estimator of θ−1 . 3.2.15 Problem (a) Show that the N(0, σ 2 ) density is a scale invariant family. (b) Show that the U.M.V.U.E. of σ2 based on a random sample X1 , . . . , Xn is a scale equivariant estimator and compare it to the Pitman scale equivariant estimator of σ2 . How does the M.L. estimator of σ 2 compare with these estimators? (c) Find the Pitman scale equivariant estimator of σ and compare it to the M.L. estimator of σ and the U.M.V.U.E. of σ. 3.3. ESTIMATING EQUATIONS 3.2.16 119 Problem (a) Show that the UNIF(0, θ) density is a scale invariant family. (b) Show that the U.M.V.U.E. of θ based on a random sample X1 , . . . , Xn is a scale equivariant estimator and compare it to the Pitman scale equivariant estimator of θ. How does the M.L. estimator of θ compare with these estimators? (c) Find the Pitman scale equivariant estimator of θ2 . 3.3 Estimating Equations To find the M.L. estimator, we usually solve the likelihood equation n X ∂ logf (Xi ; θ) = 0. ∂θ i=1 (3.3) Note that the function on the left hand side is a function of both the observations and the parameter. Such a function is called an estimating function. Most sensible estimators, like the M.L. estimator, can be described easily through an estimating function. For example, if we know V ar(Xi ) = θ for independent identically distributed Xi , then we can use the estimating function n P (Xi − X̄)2 − (n − 1)θ Ψ(θ; X) = i=1 to estimate the parameter θ, without any other knowledge of the distribution, its density, mean etc. The estimating function is set equal to 0 and solved for θ. The above estimating function is an unbiased estimating function in the sense that E[Ψ(θ; X); θ] = 0, θ ∈ Ω. (3.4) This allows us to conclude that the function is at least centered appropriately for the estimation of the parameter θ. Now suppose that Ψ is an unbiased estimating function corresponding to a large sample. Often Ψ can be written as the sum of independent components, for example Ψ(θ; X) = n P ψ(θ; Xi ). i=1 Now suppose θ̂ is a root of the estimating equation Ψ(θ; X) = 0. (3.5) 120 CHAPTER 3. OTHER METHODS OF ESTIMATION Then for θ sufficiently close to θ̂, Ψ(θ; X) = Ψ(θ; X) − Ψ(θ̂; X) ≈ (θ − θ̂) ∂ Ψ(θ; X). ∂θ Now using the Central Limit Theorem, assuming that θ is the true value of the parameter and provided ψ is a sum as in (3.5), the left hand side is approximately normal with mean 0 and variance equal to V ar[Ψ(θ; X)]. ∂ Ψ(θ; X) is also a sum of similar derivatives of the individual The term ∂θ ψ(θ; Xi ). If a law of large numbers applies to these terms, £then when divided ¤ ∂ by n this sum will be asymptotically equivalent to n1 E ∂θ Ψ(θ, X); θ . It follows that the root θ̂ will have an approximate normal distribution with mean θ and variance V ar [Ψ(θ; X); θ] ¤ª2 . © £∂ E ∂θ Ψ(θ; X); θ By analogy with the relation between asymptotic variance of the M.L. estimator and the Fisher information, we call the reciprocal of the above asymptotic variance formula the Godambe information of the estimating function. This information measure is © £∂ ¤ª2 E ∂θ Ψ(θ; X); θ J(Ψ; θ) = . (3.1) V ar [Ψ(θ; X); θ] Godambe(1960) proved the following result. 3.3.1 Theorem Among all unbiased estimating functions satisfying the usual regularity conditions (see 2.3.1), an estimating function which maximizes the Godambe information (3.1) is of the form c(θ)S(θ; X) where c(θ) is non-random. 3.3.2 Problem Prove Theorem 3.3.1. 3.3.3 Example Suppose X = (X1 , . . . , Xn ) is a random sample from a distribution with E(log Xi ; θ) = eθ and V ar(log Xi ; θ) = e2θ , Consider the estimating function Ψ(θ; X) = n P (log Xi − eθ ). i=1 i = 1, . . . , n. 3.3. ESTIMATING EQUATIONS 121 (a) Show that Ψ(θ; X) is an unbiased estimating function. (b) Find the estimator θ̂ which satisfies Ψ(θ̂; X) = 0. (c) Construct an approximate 95% C.I. for θ. 3.3.4 Problem Suppose X1 , . . . , Xn is a random sample from the Bernoulli(θ) distribution. Suppose also that (²i , . . . , ²n ) are independent N(0, σ 2 ) random variables independent of the Xi ’s. Define Yi = θXi + ²i , i = 1, . . . , n. We observe only the values (Xi , Yi ), i = 1, . . . , n. The parameter θ is unknown and the ²i ’s are unobserved. Define the estimating function Ψ[θ; (X, Y )] = n P (Yi − θXi ). i=1 (a) Show that this is an unbiased estimating function for θ. (b) Find the estimator θ̂ which satisfies Ψ[θ̂; (X, Y )] = 0. Is θ̂ an unbiased estimator of θ? (c) Construct an approximate 95% C.I. for θ. 3.3.5 Problem Consider random variables X1 , . . . , Xn generated according to a first order autoregressive process Xi = θXi−1 + Zi , where X0 is a constant and Z1 , . . . , Zn are independent N(0, σ 2 ) random variables. (a) Show that Xi = θi X0 + i P θi−j Zj . j=1 (b) Show that Ψ(θ; X) = n−1 P i=0 Xi (Xi+1 − θXi ) is an unbiased estimating function for θ. (c) Find the estimator θ̂ which satisfies Ψ(θ̂; X) = 0. Compare the asymptotic variance of this estimator with the Cramér-Rao lower bound. 122 3.3.6 CHAPTER 3. OTHER METHODS OF ESTIMATION Problem Let X1 , . . . , Xn be a random sample from the POI(θ) distribution. Since V ar (Xi ; θ) = θ, we could use the sample variance S 2 rather than the sample mean X̄ as an estimator of θ, that is, we could use the estimating function Ψ(θ; X) = n 1 P (Xi − X̄)2 − θ. n − 1 i=1 Find the asymptotic variance of the resulting estimator and hence the asymptotic efficiency of this estimation method. (Hint: The sample variance S 2 has asymptotic variance V ar(S 2 ) ≈ n1 {E[(Xi − μ)4 ] − σ 4 } where E(Xi ) = μ and V ar(Xi ) = σ2 .) 3.3.7 Problem Suppose Y1 , . . . , Yn are independent random variable such that E (Yi ) = μi and V ar (Yi ) = v (μi ) , i = 1, . . . , n where v is a known function. Suppose also that h (μi ) = xTi β where h is a known function, xi = (xi1 , . . . , xik )T , i = 1, . . . , n are vectors of known constants and β is a k × 1 vector of unknown parameters. The quasi-likelihood estimating equation for estimating β is µ ¶T ∂μ [V (μ)]−1 (Y − μ) = 0 ∂β where Y = (Y1 , . . . , Yn )T , μ = (μ1 , . . . , μn )T , V (μ) = diag {v (μ1 ) , . . . , v (μn )}, ∂μi and ∂μ ∂β is the n × k matrix whose (i, j) entry is ∂βj . (a) Show that this is an unbiased estimating equation for all β. (b) Show that if Yi v POI(μi ) and log (μi ) = xTi β then the quasi-likelihood estimating equation is also the likelihood equation for estimating β. 3.3.8 Problem Suppose X1 , . . . , Xn be a random sample from a distribution with p.f./p.d.f. f (x; θ). It is well known that the estimator X̄ is sensitive to extreme observations while the sample median is not. Attempts have been made to find robust estimators which are not unduly affected by outliers. One such class proposed by Huber (1981) is the class of M-estimators. These estimators are defined as the estimators which minimize n P i=1 ρ (Xi ; θ) (3.2) 3.3. ESTIMATING EQUATIONS 123 with respect to θ for some function ρ. The “M” stands for “maximum likelihood type” since for ρ (x; θ) = −logf (x; θ) the estimator is the M.L. estimator. Since minimizing (3.2) is usually equivalent to solving Ψ(θ; X) = n ∂ P ρ (Xi ; θ) = 0, i=1 ∂θ M-estimators may also be defined in terms of estimating functions. Three examples of ρ functions are: 2 (1) ρ (x; θ) = (x − θ) /2 (2) ρ (x; θ) = |x − θ| (3) ½ 2 (x − θ) /2 ρ (x; θ) = c |x − θ| − c2 /2 if |x − θ| ≤ c if |x − θ| > c (a) For all three ρ functions, find a p.d.f. f (x; θ) such that ρ (x; θ) = −logf (x; θ) + log k. (b) For the f (x; θ) obtained for (3), graph the p.d.f. for θ = 0, c = 1 and θ = 0, c = 1.5. On the same graph plot the N(0, 1) and t(2) p.d.f.’s. What do you notice? (c) The following data, ordered from smallest to largest, were randomly generated from a t(2) distribution: −1.75 −0.78 −0.18 0.09 1.16 −1.24 −0.61 −0.18 0.14 1.34 −1.15 −0.59 −0.17 0.25 1.61 −1.09 −0.58 −0.15 0.36 1.95 −1.02 −0.44 −0.08 0.43 2.25 −0.93 −0.35 −0.04 0.93 2.37 −0.92 −0.26 0.02 1.03 2.59 −0.91 −0.20 0.03 1.13 4.82 Construct a frequency histogram for these data. (d) Find the M-estimate for θ for each of the ρ functions given above. Use c = 1 for (3). (e) Compare these estimates with the M.L. estimate obtained by assuming that X1 , . . . , Xn is a random sample from the distribution with p.d.f. " #−3/2 Γ (3/2) (x − θ)2 f (x; θ) = √ 1+ , t∈< 2 2π which is a t(2) p.d.f. if θ = 0. 124 3.4 CHAPTER 3. OTHER METHODS OF ESTIMATION Bayes Estimation There are two major schools of thought on the way in which statistical inference is conducted, the frequentist and the Bayesian school. Typically, these schools differ slightly on the actual methodology and the conclusions that are reached, but more substantially on the philosophy underlying the treatment of parameters. So far we have considered a parameter as an unknown constant underlying or indexing the probability density function of the data. It is only the data, and statistics derived from the data that are random. However, the Bayesian begins by asserting that the parameter θ is simply the realization of some larger random experiment. The parameter is assumed to have been generated according to some distribution, the prior distribution π and the observations then obtained from the corresponding probability density function f (x; θ) interpreted as the conditional probability density of the data given the value of θ. The prior distribution π(θ) quantifies information about θ prior to any further data being gathered. Sometimes π(θ) can be constructed on the basis of past data. For example, if a quality inspection program has been running for some time, the distribution of the number of defectives in past batches can be used as the prior distribution for the number of defectives in a future batch. The prior can also be chosen to incorporate subjective information based on an expert’s experience and personal judgement. The purpose of the data is then to adjust this distribution for θ in the light of the data, to result in the posterior distribution for the parameter. Any conclusions about the plausible value of the parameter are to be drawn from the posterior distribution. For a frequentist, statements like P (1 < θ < 2) are meaningless; all randomness lies in the data and the parameter is an unknown constant. Hence the effort taken in earlier courses in carefully assuring students that if an observed 95% confidence interval for the parameter is 1 ≤ θ ≤ 2 this does not imply P (1 ≤ θ ≤ 2) = 0.95. However, a Bayesian will happily quote such a probability, usually conditionally on some observations, for example, P (1 ≤ θ ≤ 2|x) = 0.95. 3.4.1 Posterior Distribution Suppose the parameter is initially chosen at random according to the prior distribution π(θ) and then given the value of the parameter the observations are independent identically distributed, each with conditional probability (density) function f (x; θ). Then the posterior distribution of the parameter 3.4. BAYES ESTIMATION 125 is the conditional distribution of θ given the data x = (x1 , . . . , xn ) π(θ|x) = cπ(θ) n Q f (xi ; θ) = cπ(θ)L(θ; x) i=1 where −1 c = Z∞ π(θ)L(θ; x)dθ −∞ is independent of θ and L(θ; x) is the likelihood function. Since Bayesian inference is based on the posterior distribution it depends only on the data through the likelihood function. 3.4.2 Example Suppose a coin is tossed n times with probability of heads θ. It is known from “previous experience with coins” that the prior probability of heads is not always identically 1/2 but follows a BETA(10, 10) distribution. If the n tosses result in x heads, find the posterior density function for θ. 3.4.3 Definition - Conjugate Prior Distribution If a prior distribution has the property that the posterior distribution is in the same family of distributions as the prior then the prior is called a conjugate prior. 3.4.4 Conjugate Prior Distribution for the Exponential Family Suppose X1 , . . . , Xn is a random sample from the exponential family f (x; θ) = C(θ) exp[q(θ)T (x)]h(x) and θ is assumed to have the prior distribution with parameters a, b given by π(θ) = π (θ; a, b) = k[C(θ)]a exp[bq(θ)] (3.8) where k−1 = ∞ R [C (θ)]a exp [bq (θ)] dθ. −∞ Then the posterior distribution of θ, given the data x = (x1 , . . . , xn ) is easily seen to be given by π(θ|x) = c[C(θ)]a+n exp{q(θ)[b + n P i=1 T (xi )]} 126 CHAPTER 3. OTHER METHODS OF ESTIMATION where −1 c = ∞ R a+n [C (θ)] −∞ ½ ∙ ¸¾ n P exp q (θ) b + T (xi ) dθ. i=1 Notice that the posterior distribution is in the same family of distributions as (3.8) and thus π(θ) is a conjugate prior. The value of the parameters of the posterior distribution reflect the choice of parameters in the prior. 3.4.5 Example Find the conjugate prior for θ for a random sample X1 , . . . , Xn from the distribution with probability density function f (x; θ) = θxθ−1 , 0 < x < 1, θ > 0. Show that the posterior distribution of θ given the data x = (x1 , . . . , xn ) is in the same family of distributions as the prior. 3.4.6 Problem Find the conjugate prior distribution of the parameter θ for a random sample X1 , . . . , Xn from each of the following distributions. In each case, find the posterior distribution of θ given the data x = (x1 , . . . , xn ). (a) POI(θ) (b) N(θ, σ 2 ), σ2 known (c) N(μ, θ), μ known (d) GAM(α, θ), α known. 3.4.7 Problem Suppose X1 , . . . , Xn is a random sample from the UNIF(0, θ) distribution. Show that the prior distribution θ ∼ PAR(a, b) is a conjugate prior. 3.4.8 Problem Suppose X1 , . . . , Xn is a random sample from the N(μ, 1θ ) where μ and θ are unknown. Show that the joint prior given by ½ ¾ ¤ θ£ π(μ, θ) = cθb1 /2 exp − a1 + b2 (a2 − μ)2 , θ > 0, μ ∈ < 2 where a1 , a2 , b1 and b2 are parameters, is a conjugate prior. This prior is called a normal-gamma prior. Why? Hint: π(μ, θ) = π1 (μ|θ)π2 (θ). 3.4. BAYES ESTIMATION 3.4.9 127 Empirical Bayes In the conjugate prior given in (3.8) there are two parameters, a and b, which must be specified. In an empirical Bayes approach the parameters of the prior are assumed to be unknown constants and are estimated from the data. Suppose the prior distribution for θ is π(θ; λ) where λ is an unknown parameter (possibly a vector) and X1 , . . . , Xn is a random sample from f (x; θ). The marginal distribution of X1 , . . . , Xn is given by f (x1 , . . . , xn ; λ) = Z∞ π(θ; λ) n Q f (xi ; θ)dθ i=1 −∞ which depends on the data X1 , . . . , Xn and λ and therefore can be used to estimate λ. 3.4.10 Example In Example 3.4.5 find the marginal distribution of (X1 , . . . , Xn ) and indicate how it could be used to estimate the parameters a and b of the conjugate prior. 3.4.11 Problem Suppose X1 , . . . , Xn is a random sample from the POI(θ) distribution. If a conjugate prior is assumed for θ find the marginal distribution of (X1 , . . . , Xn ) and indicate how it could be used to estimate the parameters a and b of the conjugate prior. 3.4.12 Problem An insurance company insures n drivers. For each driver the company knows Xi the number of accidents driver i has had in the past three years. To estimate each driver’s accident rate λi the company assumes (λ1 , . . . , λn ) is a random sample from the GAM(a, b) distribution where a and b are unknown constants and Xi ∼ POI(λi ), i = 1, . . . , n independently. Find the marginal distribution of (X1 , . . . , Xn ) and indicate how you would find the M.L. estimates of a and b using this distribution. Another approach to estimating a and b would be to use the estimators X̄ ã = , b̃ b̃ = n P i=1 n P i=1 Xi2 Xi ¡ ¢ − 1 + X̄ . 128 CHAPTER 3. OTHER METHODS OF ESTIMATION Show that these are consistent estimators of a and b respectively. 3.4.13 Noninformative Prior Distributions The choice of the prior distribution to be the conjugate prior is often motivated by mathematical convenience. However, a Bayesian would also like the prior to accurately represent the preliminary uncertainty about the plausible values of the parameter, and this may not be easily translated into one of the conjugate prior distributions. Noninformative priors are the usual way of representing ignorance about θ and they are frequently used in practice. It can be argued that they are more objective than a subjectively assessed prior distribution since the latter may contain personal bias as well as background knowledge. Also, in some applications the amount of prior information available is far less than the information contained in the data. In this case there seems little point in worrying about a precise specification of the prior distribution. If in Example 3.4.2 there were no reason to prefer one value of θ over any other then a noninformative or ‘flat’ prior disribution for θ that could be used is the UNIF(0, 1) distribution. For estimating the mean θ of a N(θ, 1) distribution the possible values for θ are (−∞, ∞). If we take the prior distribution to be uniform on (−∞, ∞), that is, π(θ) = c, −∞<θ <∞ then this is not a proper density since Z∞ −∞ π(θ)dθ = c Z∞ −∞ dθ = ∞. Prior densities of this type are called improper priors. In this case we could consider a sequence of prior distributions such as the UNIF(−M, M ) which approximates this prior as M → ∞. Suppose we call such a prior density function πM . Then the posterior distribution of the parameter is given by πM (θ|x) = cπM (θ)L(θ; x) and it is easy to see that as M → ∞, this approaches a constant multiple of the likelihood function L(θ). This provides another interpretation of the likelihood function. We can consider it as proportional to the posterior distribution of the parameter when using a uniform improper prior on the whole real line. The language is somewhat sloppy here since, as we have seen, the uniform distribution on the whole real line really makes sense only through taking limits for uniform distributions on finite intervals. 3.4. BAYES ESTIMATION 129 In the case of a scale parameter, which must take positive values such as the normal variance, it is usual to express ignorance of the prior distribution of the parameter by assuming that the logarithm of the parameter is uniform on the real line. 3.4.14 Example Let X1 , . . . , Xn be a random sample from a N(μ, σ 2 ) distribution and assume that the prior distributions of μ, and log(σ 2 ) are independent improper uniform distributions. Show that the marginal√posterior distribution of μ given the data x = (x1 , . . . , xn ) is such that n (μ − x̄) /s has a t distribution with n − 1 degrees of freedom. Show also that the marginal posterior distribution of σ 2 given the data x is such that 1/σ 2 has a GAM ³ ´ n−1 2 distribution. 2 , (n−1)s2 3.4.15 Jeffreys’ Prior A problem with nonformative prior distributions is whether the prior distribution should be uniform for θ or some function of θ, such as θ2 or log(θ). It is common to use a uniform prior for τ = h(θ) where h(θ) is the function of θ whose Fisher information does not depend on the unknown parameter. This idea is due to Jeffreys and leads to a prior distribution which is proportional to [J(θ)]1/2 . Such a prior is referred to as a Jeffreys’ prior. 3.4.16 Problem h i ∂2 Suppose {f (x; θ) ; θ ∈ Ω} is a regular model and J(θ) = E − ∂θ 2 logf (X; θ) ; θ is the Fisher information. Consider the reparameterization Zθ p τ = h(θ) = J(u)du , (3.9) θ0 where θ0 is a constant. Show that the Fisher information for the reparameterization is equal to one (see Problem 2.3.4). (Note: Since the asymptotic variance of the M.L. estimator τ̂n is equal to 1/n, which does not depend on τ, (3.9) is called a variance stabilizing transformation.) 3.4.17 Example Find the Jeffreys’ prior for θ if X has a BIN(n, θ) distribution. What function of θ has a uniform prior distribution? 130 CHAPTER 3. OTHER METHODS OF ESTIMATION 3.4.18 Problem Find the Jeffreys’ prior distribution for a random sample X1 , . . . , Xn from each of the following distributions. In each case, find the posterior distribution of the parameter θ given the data x = (x1 , . . . , xn ). What function of θ has a uniform prior distribution? (a) POI(θ) (b) N(θ, σ 2 ), σ2 known (c) N(μ, θ), μ known (d) GAM(α, θ), α known. 3.4.19 Problem If θ is a vector then the Jeffreys’ prior is taken to be proportional to the square root of the determinant of the Fisher information matrix. Suppose (X1 , X2 ) v MULT(n, θ1 , θ2 ). Find the Jeffreys’ prior for (θ1 , θ2 ). Find the posterior distribution of (θ1 , θ2 ) given (x1 , x2 ). Find the marginal posterior distribution of θ1 given (x1 , x2 ) and the marginal posterior distribution of θ2 given (x1 , x2 ). Hint: Show Z1 1−x Z Γ (a) Γ (b) Γ (c) xa−1 y b−1 (1 − x − y)c−1 dydx = , Γ (a + b + c) 0 a, b, c > 0 0 3.4.20 Problem Suppose E(Y ) = Xβ where Y = (Y1 , . . . , Yn )T is a vector of independent and normally distributed random variables with V ar(Yi ) = σ 2 , i = 1, . . . , n, X is a n × k matrix of known constants of rank k and β = (β1 , . . . , βk )T is a vector of unknown parameters. Let ¢−1 T ¡ X y and s2e = (y − X β̂)T (y − X β̂)/ (n − k) β̂ = X T X where y = (y1 , . . . , yn )T are the observed data. (a) Find the joint posterior distribution of β and σ 2 given the data if the joint (improper) prior distribution of β and σ2 is assumed to be proportional to σ −2 . (b) Show that the marginal³posterior distribution of σ 2 given the data y is ´ n−k 2 −2 such that σ has a GAM distribution. 2 , (n−k)s2 e (c) Find the marginal posterior distribution of β given the data y. 3.4. BAYES ESTIMATION 131 (d) Show that the posterior distribution of β given σ 2 and the ³ conditional ¡ T ¢−1 ´ 2 data y is MVN β̂, σ X X . ¡ ¢ (e) Show that (β − β̂)T X T (Xβ − β̂)/ ks2e has a Fk,n−k distribution. 3.4.21 Bayes Point Estimators One method of obtaining a point estimator of θ is to use the posterior distribution and a suitable loss function. 3.4.22 Theorem Suppose X has p.f./p.d.f. f (x; θ) and θ has prior distribution π(θ). The Bayes estimator of θ for squared error loss with respect to the prior π(θ) given X is Z∞ θ̃ = θ̃(X) = θπ(θ|X)dθ = E(θ|X) −∞ which is the mean of the posterior distribution π(θ|X). This estimator minimizes ⎡ ⎤ Z∞ Z∞ ³ ´2 ⎣ θ̃ − θ f (x; θ) dx⎦ π(θ)dθ. E[(θ̃ − θ)2 ] = −∞ 3.4.23 −∞ Example Suppose X1 , . . . , Xn is a random sample from the distribution with probability density function f (x; θ) = θxθ−1 , 0 < x < 1, θ > 0. Using a conjugate prior for θ find the Bayes estimator of θ for squared error loss. What is the Bayes estimator of τ = 1/θ for squared error loss? Do Bayes estimators satisfy an invariance property? 3.4.24 Example In Example 3.4.14 find the Bayes estimators of μ and σ2 for squared error loss based on their respective marginal posterior distributions. 3.4.25 Problem Prove Theorem 3.4.22. Hint: Show that E[(X − c)2 ] is minimized by the value c = E(X). 132 CHAPTER 3. OTHER METHODS OF ESTIMATION 3.4.26 Problem For each case in Problems 3.4.7 and 3.4.18 find the Bayes estimator of θ for squared error loss and compare the estimator with the U.M.V.U.E. as n → ∞. 3.4.27 Problem In Problem 3.4.12 find the Bayes estimators of (λ1 , . . . , λn ) for squared error loss. 3.4.28 Problem Let X1 , . . . , Xn be a random sample from a GAM(α, β) distribution where α is known. Find the posterior distribution of λ = 1/β given X1 , . . . , Xn if the improper prior distribution of λ is assumed to be proportional to 1/λ. Find the Bayes estimator of β for squared error loss and compare it to the U.M.V.U.E. of β. 3.4.29 Problem In Problem 3.4.19 find the Bayes estimators of θ1 and θ2 for squared error loss using their respective marginal posterior distributions. Compare these to the U.M.V.U.E.’s. 3.4.30 Problem In Problem 3.4.20 find the Bayes estimators of β and σ 2 for squared error loss using their respective marginal posterior distributions. Compare these to the U.M.V.U.E.’s. 3.4.31 Problem Show that the Bayes estimator of θ for absolute error loss with respect to the prior π(θ) given data X is the median of the posterior distribution. Hint: d dy b(y) b(y) Z Z ∂g(x, y) 0 0 g(x, y)dx = g(b (y) , y) · b (y) − g(a (y) , y) · a (y) + dx. ∂y a(y) a(y) 3.4. BAYES ESTIMATION 3.4.32 133 Bayesian Intervals There remains, after many decades, a controversy between Bayesians and frequentists about which approach to estimation is more suitable to the real world. The Bayesian has advantages at least in the ease of interpretation of the results. For example, a Bayesian can use the posterior distribution given the data x = (x1 , . . . , xn ) to determine points a = a(x), b = b(x) such that Za π(θ|x)dθ = 0.95 a and then give a Bayesian confidence interval (a, b) for the parameter. If this results in [2, 5] the Bayesian will state that (in a Bayesian model, subject to the validity of the prior) the conditional probability given the data that the parameter falls in the interval [2, 5] is 0.95. No such probability can be ascribed to a confidence interval for frequentists, who see no randomness in the parameter to which this probability statement is supposed to apply. Bayesian confidence regions are also called credible regions in order to make clear the distinction between the interpretation of Bayesian confidence regions and frequentist confidence regions. Suppose π(θ|x) is the posterior distribution of θ given the data x and A is a subset of Ω. If Z P (θ ∈ A|x) = π(θ|x)dθ = p A then A is called a p credible region for θ. A credible region can be formed in many ways. If (a, b) is an interval such that P (θ < a|x) = 1−p = P (θ > b|x) 2 then [a, b] is called a p equal-tailed credible region. A highest posterior density (H.P.D.) credible region is constructed in a manner similar to likelihood regions. The p H.P.D. credible region is given by {θ : π(θ|x) ≥ c} where c is chosen such that Z π(θ|x)dθ. p= {θ:π(θ|x)≥c} A H.P.D. credible region is optimal in the sense that it is the shortest interval for a given value of p. 134 3.4.33 CHAPTER 3. OTHER METHODS OF ESTIMATION Example Suppose X1 , . . . , Xn is a random sample from the N(μ, σ 2 ) distribution where σ 2 is known and μ has the conjugate prior. Find the p = 0.95 H.P.D. credible region for μ. Compare this to a 95% C.I. for μ. 3.4.34 Problem Suppose (X1 , . . . , X10 ) is a random sample from the GAM(2, 1θ ) distribu10 P tion. If θ has the Jeffreys’ prior and xi = 4 then find and compare i=1 (a) the 0.95 equal-tailed credible region for θ (b) the 0.95 H.P.D. credible region for θ (c) the 95% exact equal tail C.I. for θ. Finally, although statisticians argue whether the Bayesian or the frequentist approach is better, there is really no one right way to do statistics. Some problems are best solved using a frequentist approach while others are best solved using a Bayesian approach. There are certainly instances in which a Bayesian approach seems sensible— particularly for example if the parameter is a measurement on a possibly randomly chosen individual (say the expected total annual claim of a client of an insurance company). Chapter 4 Hypothesis Tests 4.1 Introduction Statistical estimation usually concerns the estimation of the value of a parameter when we know little about it except perhaps that it lies in a given parameter space, and when we have no a priori reason to prefer one value of the parameter over another. If, however, we are asked to decide between two possible values of the parameter, the consequences of one choice of the parameter value may be quite different from another choice. For example, if we believe Yi is normally distributed with mean α + βxi and variance σ 2 for some explanatory variables xi , then the value β = 0 means there is no relation between Yi and xi . We need neither collect the values of xi nor build a model around them. Thus the two choices β = 0 and β = 1 are quite different in their consequences. This is often the case. An excellent example of the complete asymmetry in the costs attached to these two choices is Problem 4.4.17. A hypothesis test involves a (usually natural) separation of the parameter space Ω into two disjoint regions, Ω0 and Ω − Ω0 . By the difference between the two sets we mean those points in the former (Ω) that are not in the latter (Ω0 ). This partition of the parameter space corresponds to testing the null hypothesis that the parameter is in Ω0 . We usually write this hypothesis in the form H0 : θ ∈ Ω0 . The null hypothesis is usually the status quo. For example in a test of a new drug, the null hypothesis would be that the drug had no effect, or no more of an effect than drugs already on the market. The null hypothesis 135 136 CHAPTER 4. HYPOTHESIS TESTS is only rejected if there is reasonably strong evidence against it. The alternative hypothesis determines what departures from the null hypothesis are anticipated. In this case, it might be simply H1 : θ ∈ Ω − Ω0 . Since we do not know the true value of the parameter, we must base our decision on the observed value of X. The hypothesis test is conducted by determining a partition of the sample space into two sets, the critical or rejection region R and its complement R̄ which is called the acceptance region. We declare that H0 is false (in favour of the alternative) if we observe x ∈ R. When a test of hypothesis is conducted there are two types of possible errors: reject the null hypothesis H0 when it is true (Type I error) and accept H0 when it is false (Type II error). 4.1.1 Definition The power function of a test with rejection region R is the function β(θ) = P (X ∈ R; θ) = P (reject H0 ; θ), θ ∈ Ω. Note that β (θ) = 1 − P (accept H0 ; θ) ¡ ¢ = 1 − P X ∈ R̄; θ = 1 − P (type II error; θ) for θ ∈ Ω − Ω0 . In order to minimize the two types of possible errors in a test of hypothesis, it is obviously desirable that the power function β(θ) be small for θ ∈ Ω0 but large for θ ∈ Ω − Ω0 . The probability of rejecting H0 when it is true determines one important measure of the performance of a test, the level of significance. 4.1.2 Definition A test has level of significance α if β(θ) ≤ α for all θ ∈ Ω0 . The level of significance is simply an upper bound on the probability of a type I error. There is no assurance that the upper bound is tight, that is, that equality is achieved somewhere. The lowest such upper bound is often called the size of the test. 4.1. INTRODUCTION 4.1.3 137 Definition The size of a test is equal to sup β(θ). θ∈Ω0 4.1.4 Example Suppose we toss a coin 100 times to determine if the coin is fair. Let X = number of heads observed. Then the model is X v BIN (n, θ) , θ ∈ Ω = {θ : 0 < θ < 1} . The null hypothesis is H0 : θ = 0.5 and Ω0 = {0.5}. This is an example of a simple hypothesis. A simple null hypothesis is one for which Ω0 contains a single point. The alternative hypothesis is H1 : θ 6= 0.5 and Ω − Ω0 = {θ : 0 < θ < 1, θ 6= 0.5} . The alternative hypothesis is not a simple hypothesis since Ω − Ω0 contains more than one point. It is an example of a composite hypothesis. The sample space is S = {x : x = 0, 1, . . . , 100}. Suppose we choose the rejection region to be R = {x : |x − 50| ≥ 10} = {x : x ≤ 40 or x ≥ 60} . The test of hypothesis is conducted by rejecting the null hypothesis H0 in favour of the alternative H1 if x ∈ R. The acceptance region is R̄ = {x : 41 ≤ x ≤ 59} . The power function is β (θ) = P (X ∈ R; θ) = P (X ≤ 40 ∪ X ≥ 60; θ) µ ¶ 59 P 100 x 100−x = 1− θ (1 − θ) x x=41 A graph of the power function is given below. For this example Ω0 = {0.5} consists of a single point and therefore P (type I error) = = = = size of test β (0.5) P (X ∈ R; θ = 0.5) P (X ≤ 40 ∪ X ≥ 60; θ = 0.5) µ ¶ 59 P 100 = 1− (0.5)x (0.5)100−x x x=41 ≈ 0.05689. 138 CHAPTER 4. HYPOTHESIS TESTS 1 0.9 0.8 0.7 0.6 beta(theta) 0.5 0.4 0.3 0.2 0.1 0 0.3 0.35 0.4 0.45 0.5 theta 0.55 0.6 0.65 0.7 Figure 4.1: Power Function for Binomial Example 4.2 Uniformly Most Powerful Tests Tests are often constructed by specifying the size of the test, which in turn determines the probability of the type I error, and then attempting to minimize the probability that the null hypothesis is accepted when it is false (type II error). Equivalently, we try to maximize the power function of the test for θ ∈ Ω − Ω0 . 4.2.1 Definition A test with power function β(θ) is a uniformly most powerful (U.M.P.) test of size α if, for all other tests of the same size α having power function β ∗ (θ), we have β(θ) ≥ β ∗ (θ) for all θ ∈ Ω − Ω0 . The word “uniformly” above refers to the fact that one function dominates another, that is, β(θ) ≥ β ∗ (θ) uniformly for all θ ∈ Ω − Ω0 . When the alternative Ω − Ω0 consists of a single point {θ1 } then the construc- 4.2. UNIFORMLY MOST POWERFUL TESTS 139 tion of a best test is particularly easy. In this case, we may drop the word “uniformly” and refer to a “most powerful test”. The construction of a best test, by this definition, is possible under rather special circumstances. First, we often require a simple null hypothesis. This is the case when Ω0 consists of a single point {θ0 } and so we are testing the null hypothesis H0 : θ = θ0 . 4.2.2 Neyman-Pearson Lemma Let X have probability (density) function f (x; θ) , θ ∈ Ω. Consider testing a simple null hypothesis H0 : θ = θ0 against a simple alternative H1 : θ = θ1 . For a constant c, suppose the rejection region defined by R = {x; f (x; θ1 ) > c} f (x; θ0 ) corresponds to a test of size α. Then the test with this rejection region is a most powerful test of size α for testing H0 : θ = θ0 against H1 : θ = θ1 . 4.2.3 Proof Consider another rejection region R1 with the same size. Then Z Z P (X ∈ R; θ0 ) = P (X ∈ R1 ; θ0 ) = α or f (x; θ0 )dx = f (x; θ0 )dx. R Therefore Z Z f (x; θ0 )dx + f (x; θ0 )dx = R∩R1 R∩R̄1 and Z Z f (x; θ0 )dx + Z f (x; θ0 )dx. R∩R1 f (x; θ0 )dx = R∩R̄1 R1 Z f (x; θ0 )dx R̄∩R1 (4.1) R̄∩R1 For x ∈ R ∩ R̄1 , f (x; θ1 ) > c or f (x; θ1 ) > cf (x; θ0 ) f (x; θ0 ) and thus Z R∩R̄1 f (x; θ1 ) > c Z R∩R̄1 f (x; θ0 )dx. (4.2) 140 CHAPTER 4. HYPOTHESIS TESTS For x ∈ R̄ ∩ R1 , f (x; θ1 ) ≤ cf (x; θ0 ), and thus Z Z f (x; θ1 )dx ≥ −c f (x; θ0 )dx. − R̄∩R1 (4.3) R̄∩R1 Now β(θ1 ) = P (X ∈ R; θ1 ) = P (X ∈ R ∩ R1 ; θ1 ) + P (X ∈ R ∩ R̄1 ; θ1 ) Z Z = f (x; θ1 )dx + f (x; θ1 )dx R∩R1 R∩R̄1 and β1 (θ1 ) = P (X ∈ R1 ; θ1 ) Z = f (x; θ1 )dx + R∩R1 Z f (x; θ1 )dx. R̄∩R1 Therefore, using (4.1), (4.2), and (4.3) we have Z Z β(θ1 ) − β1 (θ1 ) = f (x; θ1 )dx − f (x; θ1 )dx R̄∩R1 R∩R̄1 ≥ c Z f (x; θ0 )dx − c R∩R̄1 ⎡ ⎢ = c⎣ Z Z R∩R̄1 R̄∩R1 f (x; θ0 )dx − Z R̄∩R1 f (x; θ0 )dx ⎤ ⎥ f (x; θ0 )dx⎦ = 0 and the test with rejection region R is therefore the most powerful.¥ 4.2.4 Example Suppose X1 , . . . , Xn are independent N(θ, 1) random variables. We consider only the parameter space Ω = [0, ∞). Suppose we wish to test the hypothesis H0 : θ = 0 against H1 : θ > 0. (a) Choose an arbitrary θ1 > 0 and obtain the rejection region for the most powerful test of size 0.05 of H0 against H1 : θ = θ1 . (b) Does this test depend on the value of θ1 you chose? Can you conclude that it is uniformly most powerful? 4.2. UNIFORMLY MOST POWERFUL TESTS 141 (c) Graph the power function of the test. (d) Find the rejection region for the uniformly most powerful test of H0 : θ = 0 against H1 : θ < 0. Find and graph the power function of this test. 1 0.9 0.8 0.7 0.6 power function 0.5 0.4 0.3 0.2 0.1 0 -1.5 -1 -0.5 0 theta 0.5 1 Figure 4.2: Power Functions for Examples 4.2.4 and 4.2.5: β1 (θ), -· β2 (θ) 4.2.5 1.5 — β (θ), - - Example Let X1 , . . . , Xn be a random sample from the √ N(θ, 1) distribution. Consider the rejection region {(x1 , . . . , xn ); |x̄| > 1.96/ n} for testing the hypothesis H0 : θ = 0 against H1 : θ 6= 0. What is the size of this test? Graph the power function of this test. Is this test uniformly most powerful? 4.2.6 Problem - Sufficient Statistics and Hypothesis Tests Suppose X has probability (density) function f (x; θ) , θ ∈ Ω. Suppose also that T = T (X) is a minimal sufficient statistic for θ. Show that the 142 CHAPTER 4. HYPOTHESIS TESTS rejection region of the most powerful test of H0 : θ = θ0 against H1 : θ = θ1 depends on the data X only through T. 4.2.7 Problem Let X1 , . . . , X5 be a random sample from the distribution with probability density function θ f (x; θ) = θ+1 , x ≥ 1, θ > 0. x (a) Find the rejection region for the most powerful test of size 0.05 ¡ ¢ of H0 : θ = 1 against H1 : θ = θ1 where θ1 > 1. Note: log (Xi ) v EXP θ1 . (b) Explain why the rejection region in (a) is also the rejection region for the uniformly most powerful test of size 0.05 of H0 : θ = 1 against H1 : θ > 1. Sketch the power function of this test. (c) Find the uniformly most powerful test of size 0.05 of H0 : θ = 1 against H1 : θ < 1. On the same graph as in (b) sketch the power function of this test. (d) Explain why there is no uniformly most powerful test of H0 against H1 : θ 6= 1. What reasonable test of size 0.05 might be used for testing H0 against H1 : θ 6= 1? On the same graph as in (b) sketch the power function of this test. 4.2.8 Problem Let X1 , . . . , X10 be a random sample from the GAM( 12 , θ) distribution. (a) Find the rejection region for the most powerful test of size 0.05 of H0 : θ = 2 against the alternative H1 : θ = θ1 where θ1 < 2. (b) Explain why the rejection region in (a) is also the rejection region for the uniformly most powerful test of size 0.05 of H0 : θ = 2 against the alternative H1 : θ < 2. Sketch the power function of this test. (c) Find the uniformly most powerful test of size 0.05 of H0 : θ = 2 against the alternative H1 : θ > 2. On the same graph as in (b) sketch the power function of this test. (d) Explain why there is no uniformly most powerful test of H0 against H1 : θ 6= 2. What reasonable test of size 0.05 might be used for testing H0 against H1 : θ 6= 2? On the same graph as in (b) sketch the power function of this test. 4.2. UNIFORMLY MOST POWERFUL TESTS 4.2.9 143 Problem Let X1 , . . . , Xn be a random sample from the UNIF(0, θ) distribution. Find the rejection region for the uniformly most powerful test of H0 : θ = 1 against the alternative H1 : θ > 1 of size 0.01. Sketch the power function of this test for n = 10. 4.2.10 Problem We anticipate collecting observations (X1 , . . . , Xn ) from a N(μ, σ 2 ) distribution in order to test the hypothesis H0 : μ = 0 against the alternative H1 : μ > 0 at level of significance 0.05. A preliminary investigation yields σ ≈ 2. How large a sample must we take in order to have power equal to 0.95 when μ = 1? 4.2.11 Relationship Betweeen Hypothesis Tests and Confidence Intervals There is a close relationship between hypothesis tests and confidence intervals as the following example illustrates. Suppose X1 , . . . , Xn is a random sample from the N(θ, 1) distribution and we wish to test the hypothesis √ H0 : θ = θ0 against H1 : θ 6= θ0 . The rejection region {x; |x̄ − θ0 | > 1.96/ n} is a size α = 0.05 rejection √ region which has a corresponding acceptance region {x; |x̄ − θ0 | ≤ 1.96/ n}. Note that the hypothesis H0 : θ = θ0 would √ not be rejected at the 0.05 level if |x̄ − θ0 | ≤ 1.96/ n or equivalently √ √ x̄ − 1.96/ n ≤ θ0 ≤ x̄ + 1.96/ n which is a 95% C.I. for θ. 4.2.12 Problem Let (X1 , . . . , X5 ) be a random sample from the GAM(2, θ) distribution. Show that ½ 5 ¾ 5 P P R = x; xi < 4.7955θ0 or xi > 17.085θ0 i=1 i=1 is a size 0.05 rejection region for testing H0 : θ = θ0 . Show how this rejection region may be used to construct a 95% C.I. for θ. 144 4.3 CHAPTER 4. HYPOTHESIS TESTS Locally Most Powerful Tests It is not always possible to construct a uniformly most powerful test. For this reason, and because alternative values of the parameter close to those under H0 are the hardest to differentiate from H0 itself, one may wish to develop a test that is best able to test the hypthesis H0 : θ = θ0 against alternatives very close to θ0 . Such a test is called locally most powerful. 4.3.1 Definition A test of H0 : θ = θ0 against H1 : θ > θ0 with power function β(θ) is locally most powerful if, for any other test having the same size and having power function β ∗ (θ), there exists an ² > 0 such that β(θ) ≥ β ∗ (θ) for all θ0 < θ < θ0 + ². This definition asserts that there is a neighbourhood of the null hypothesis in which the test is most powerful. 4.3.2 Theorem Suppose {f (x; θ) ; θ ∈ Ω} is a regular statistical model with corresponding score function ∂ S(θ; x) = log f (x; θ) . ∂θ A locally most powerful test of H0 : θ = θ0 against H1 : θ > θ0 has rejection region R = {x; S(θ0 ; x) > c} , where c is a constant determined by P [S(θ0 ; X) > c; θ0 ] = size of test. Since this test is based on the score function, it is also called a score test. 4.3.3 Example Suppose X1 , . . . , Xn is a random sample from a N(θ, 1) distribution. Show that the locally most powerful test of H0 : θ = 0 against H1 : θ > 0 is also the uniformly most powerful test. 4.3. LOCALLY MOST POWERFUL TESTS 4.3.4 145 Problem Consider a single observation X from the LOG(1, θ) distribution. Find the rejection region for the locally most powerful test of H0 : θ = 0 against H1 : θ > 0. Is this test also uniformly most powerful? What is the power function of the test? Suppose X = (X1 , . . . , Xn ) is a random sample from a regular statistical model {f (x; θ) ; θ ∈ Ω} and the exact distribution of S(θ0 ; X) = n ∂ P log f (Xi ; θ0 ) i=1 ∂θ is difficult to obtain. Since, under H0 : θ = θ0 , S(θ0 ; X) p →D Z v N(0, 1) J (θ0 ) by the C.L.T., an approximate size α rejection region for testing H0 : θ = θ0 against H1 : θ > θ0 is given by ) ( S(θ0 ; x) ≥a x; p J (θ0 ) where P (Z ≥ a) = α and Z v N(0, 1). J(θ0 ) may be replaced by I(θ0 ; x). 4.3.5 Example Suppose X1 , . . . , Xn is a random sample from the CAU(1, θ) distribution. Find an approximate rejection region for a locally most powerful size 0.05 test of H0 : θ = θ0 against H1 : θ < θ0 . Hint: Show J(θ) = n/2. 4.3.6 Problem Suppose X1 , . . . , Xn is a random sample from the WEI(1, θ) distribution. Find an approximate rejection region for a locally most powerful size 0.01 test of H0 : θ = θ0 against H1 : θ > θ0 . Hint: Show that n π2 J(θ) = 2 (1 + + γ 2 − 2γ) θ 6 where Z∞ γ = − (log y)e−y dy ≈ 0.5772 0 is Euler’s constant. 146 4.4 CHAPTER 4. HYPOTHESIS TESTS Likelihood Ratio Tests Consider a test of the hypothesis H0 : θ ∈ Ω0 against H1 : θ ∈ Ω − Ω0 . We have seen that for prescribed θ0 ∈ Ω0 , θ1 ∈ Ω − Ω0 , the most powerful test of the simple null hypothesis H0 : θ = θ0 against a simple alternative H1 : θ = θ1 is based on the likelihood ratio fθ1 (x)/fθ0 (x). By the NeymanPearson Lemma it has rejection region ½ ¾ f (x; θ1 ) R = x; >c f (x; θ0 ) where c is a constant determined by the size of the test. When either the null or the alternative hypothesis are composite (i.e. contain more than one point) and there is no uniformly most powerful test, it seems reasonable to use a test with rejection region R for some choice of θ1 , θ0 . The likelihood ratio test does this with θ1 replaced by θ̂, the M.L. estimator over all possible values of the parameter, and θ0 replaced by the M.L. estimator of the parameter when it is restricted to Ω0 . Thus, the likelihood ratio test of H0 : θ ∈ Ω0 versus H1 : θ ∈ Ω − Ω0 has rejection region R = {x; Λ(x) > c} where Λ(x) = sup f (x; θ) sup L (θ; x) θ∈Ω θ∈Ω sup f (x; θ) θ∈Ω0 = sup L (θ; x) θ∈Ω0 and c is determined by the size of the test. In general, the distribution of the test statistic Λ(X) may be difficult to find. Fortunately, however, the asymptotic distribution is known under fairly general conditions. In a few cases, we can show that the likelihood ratio test is equivalent to the use of a statistic with known distribution. However, in many cases, we need to rely on the asymptotic chi-squared distribution of Theorem 4.4.7. 4.4.1 Example Let X1 , . . . , Xn be a random sample from the N(μ, σ 2 ) distribution where μ and σ 2 are unknown. Consider a test of H0 : μ = 0, 0 < σ 2 < ∞ against the alternative H1 : μ 6= 0, 0 < σ 2 < ∞. (a) Show that the likelihood ratio test of H0 against H1 has rejection region 4.4. LIKELIHOOD RATIO TESTS 147 R = {x; nx̄2 /s2 > c}. (b) Show under H0 that the statistic T = nX̄ 2 /S 2 has a F(1, n − 1) distribution and thus find a size 0.05 test for n = 20. (c) What rejection region would you use for testing H0 : μ = 0, 0 < σ 2 < ∞ against the one-sided alternative H1 : μ > 0, 0 < σ 2 < ∞? 4.4.2 Problem Suppose X ∼ GAM(2, β1 ) and Y ∼ GAM(2, β2 ) independently. (a) Show that the likelihood ratio statistic for testing the hypothesis H0 : β1 = β2 against the alternative H1 : β1 6= β2 is a function of the statistic T = X/(X + Y ). (b) Find the distribution of T under H0 . (b) Find the rejection region for a size 0.01 test. What rejection region would you use for testing H0 : β1 = β2 against the one-sided alternative H1 : β1 > β2 ? 4.4.3 Problem Let (X1 , . . . , Xn ) be a random sample from the N(μ, σ2 ) distribution and independently let (Y1 , . . . , Yn ) be a random sample from the N(θ, σ 2 ) distribution where σ 2 is known. (a) Show that the likelihood ratio statistic for testing the hypothesis H0 : μ = θ against the alternative H1 : μ 6= θ is a function of T = |X̄ − Ȳ |. (b) Find the rejection region for a size 0.05 test. Is this test U.M.P.? Why? 4.4.4 Problem Suppose X1 , . . . , Xn are independent EXP(λ) random variables and independently Y1 , . . . , Ym are independent EXP(μ) random variables. (a) Show that the likelihood ratio statistic for testing the hypothesis H0 : λ = μ against the alternative H1 : λ 6= μ is a function of T =∙ n P Xi i=1 n P i=1 Xi + m P i=1 Yi ¸. (b) Find the distribution of T under H0 . Explain clearly how you would find a size α = 0.05 rejection region. 148 CHAPTER 4. HYPOTHESIS TESTS (c) For n = 20 find the rejection region for the one-sided alternative H1 : λ > μ for a size 0.05 test. 4.4.5 Problem Suppose X1 , . . . , Xn is a random sample from the EXP(β, μ) distribution where β and μ are unknown. (a) Show that the likelihood ratio statistic for testing the hypothesis H0 : β = 1 against the alternative H1 : β 6= 1 is a function of the statistic T = n ¡ ¢ P Xi − X(1) . i=1 (b) Show that under H0 , 2T has a chi-squared distribution (see Problem 1.8.11). (c) For n = 12 find the rejection region for the one-sided alternative H1 : β > 1 for a size 0.05 test. 4.4.6 Problem Let X1 , . . . , Xn be a random sample from the distribution with p.d.f. f (x; α, β) = αxα−1 , βα 0 < x ≤ β. (a) Show that the likelihood ratio statistic for testing the hypothesis H0 : α = 1 against the alternative H1 : α 6= 1 is a function of the statistic T = n ¡ ¢ Q Xi /X(n) i=1 (b) Show that under H0 , −2 log T has a chi-squared distribution (see Problem 1.8.12). (c) For n = 14 find the rejection region for the one-sided alternative H1 : α > 1 for a size 0.05 test. 4.4.7 Problem Suppose Yi ∼ N(α+βxi , σ 2 ), i = 1, 2, . . . , n independently where x1 , . . . , xn are known constants and α, β and σ 2 are unknown parameters. 4.4. LIKELIHOOD RATIO TESTS 149 (a) Show that the likelihood ratio statistic for testing H0 : β = 0 against the alternative H1 : β 6= 0 is a function of β̂ 2 T = where Se2 = n P i=1 (xi − x̄) 2 Se2 n 1 P (Yi − α̂ − β̂xi )2 . n − 2 i=1 (b) What is the distribution of T under H0 ? 4.4.8 Theorem - Asymptotic Distribution of the Likelihood Ratio Statistic (Regular Model) Suppose X = (X1 , . . . , Xn ) is a random sample from a regular statistical model {f (x; θ) ; θ ∈ Ω} with Ω an open set in k−dimensional Euclidean space. Consider a subset of Ω defined by Ω0 = {θ(η); η ∈ open subset of q-dimensional Euclidean space }. Then the likelihood ratio statistic defined by n Q f (X; θ) sup L (θ; X) sup θ∈Ω i=1 θ∈Ω Λn (X)= = n Q sup L (θ; X) sup f (X; θ) θ∈Ω0 θ∈Ω0 i=1 is such that, under the hypothesis H0 : θ ∈ Ω0 , 2 log Λn (X) →D W v χ2 (k − q). Note: The number of degrees of freedom is the difference between the number of parameters that need to be estimated in the general model, and the number of parameters left to be estimated under the restrictions imposed by H0 . 4.4.9 Example Suppose X1 , . . . , Xn are independent POI(λ) random variables and independently Y1 , . . . , Yn are independent POI(μ) random variables. (a) Find the likelihood ratio test statistic for testing H0 : λ = μ against the alternative H1 : λ 6= μ. (b) Find the approximate rejection region for a size α = 0.05 test. Be sure to justify the approximation. (c) Find the rejection region for the one-sided alternative H1 : λ < μ for a size 0.05 test. 150 4.4.10 CHAPTER 4. HYPOTHESIS TESTS Problem Suppose (X1 , X2 ) ∼ MULT(n, θ1 , θ2 ). (a) Find the likelihood ratio statistic for testing H0 : θ1 = θ2 = θ3 against all alternatives. (b) Find the approximate rejection region for a size 0.05 test. Be sure to justify the approximation. 4.4.11 Problem Suppose (X1 , X2 ) ∼ MULT(n, θ1 , θ2 ). (a) Find the likelihood ratio statistic for testing H0 : θ1 = θ2 , θ2 = 2θ(1−θ) against all alternatives. (a) Find the approximate rejection region for a size 0.05 test. Be sure to justify the approximation. 4.4.12 Problem Suppose (X1 , Y1 ), . . . , (Xn , Yn ) is a random sample from the BVN(μ, Σ) distribution with (μ, Σ) unknown. (a) Find the likelihood ratio statistic for testing H0 : ρ = 0 against the alternative H1 : ρ 6= 0. (b) Find the approximate size 0.05 rejection region. Be sure to justify the approximation. 4.4.13 Problem Suppose in Problem 2.1.25 we wish to test the hypothesis that the data arise from the assumed model. Show that the likelihood ratio statistic is given by µ ¶ k P Fi Fi log Λ=2 E i i=1 where Ei = npi (θ̂) and θ̂ is the M.L. estimator of θ. What is the asymptotic distribution of Λ? Another test statistic which is commonly used is the Pearson goodness of fit statistic given by k (F − E )2 P i i E i i=1 which also has an approximate χ2 distribution. 4.4. LIKELIHOOD RATIO TESTS 4.4.14 151 Problem In Example 2.1.32 test the hypothesis that the data arise from the assumed model using the likelihood ratio statistic. Compare this with the answer that you obtain using the Pearson goodness of fit statistic. 4.4.15 Problem In Example 2.1.33 test the hypothesis that the data arise from the assumed model using the likelihood ratio statistic. Compare this with the answer that you obtain using the Pearson goodness of fit statistic. 4.4.16 Problem In Example 2.9.11 test the hypothesis that the data arise from the assumed model using the likelihood ratio statistic. Compare this with the answer that you obtain using the Pearson goodness of fit statistic. 4.4.17 Problem Suppose we have n independent repetitions of an experiment in which each outcome is classified according to whether event A occurred or not as well as whether event B occurred or not. The observed data can be arranged in a 2 × 2 contingency table as follows: A A Total B f11 f21 c1 B f12 f22 c2 Total r1 r2 n Find the likelihood ratio statistic for testing the hypothesis that the events A and B are independent, that is, H0 : P (A ∩ B) = P (A)P (B). 4.4.18 Problem Suppose E(Y ) = Xβ where Y = (Y1 , . . . , Yn )T is a vector of independent and normally distributed random variables with V ar(Yi ) = σ 2 , i = 1, . . . , n, X is a n × k matrix of known constants of rank k and β = (β1 , . . . , βk )T is a vector of unknown parameters. Find the likelihood ratio statistic for testing the hypothesis H0 : βi = 0 against the alternative H1 : βi 6= 0 where βi is the ith element of β. 152 4.4.19 CHAPTER 4. HYPOTHESIS TESTS Signed Square-root Likelihood Ratio Statistic Suppose X = (X1 , . . . , Xn ) is a random sample from a regular statistical T model {f (x; θ) ; θ ∈ Ω} where θ = (θ1 , θ2 ) , θ1 is a scalar and Ω is an k open set in < . Suppose also that the null hypothesis is H0 : θ1 = θ10 . Let θ̂ = (θ̂1 , θ̂2 ) be the maximum likelihood of θ and let θ̃ = (θ10 , θ̃2 (θ10 )) where θ̃2 (θ10 ) is the maximum likelihood value of θ2 assuming θ1 = θ10 . Then by Theorem 4.4.8 2 log Λn (X) = 2l(θ̂; X) − 2l(θ̃; X) →D W v χ2 (1) under H0 or equivalently h i1/2 →D Z v N (0, 1) 2l(θ̂; X) − 2l(θ̃; X) under H0 . The signed square-root likelihood ratio statistic defined by h i1/2 sign(θ̂1 − θ10 ) 2l(θ̂; X) − 2l(θ̃; X) can be used to test one-sided alternatives such as H1 : θ1 > θ10 or H1 : θ1 < θ10 . For example if the alternative hypothesis were H1 : θ1 > θ10 then the rejection region for an approximate size 0.05 test would be given by ¾ ½ h i1/2 > 1.645 . x; sign(θ̂1 − θ10 ) 2l(θ̂; X) − 2l(θ̃; X) 4.4.20 Problem - The Challenger Data In Problem 2.8.9 test the hypothesis that β = 0. What would a sensible alternative be? Describe in detail the null and the alternative hypotheses that you have in mind and the relative costs of the two different kinds of errors. 4.4.21 Significance Tests and p-values We have seen that a test of hypothesis is a rule which allows us to decide whether to accept the null hypothesis H0 or to reject it in favour of the alternative hypothesis H1 based on the observed data. A test of significance can be used in situations in which H1 is difficult to specify. A (pure) test of significance is a procedure for measuring the strength of the evidence provided by the observed data against H0 . This method usually involves looking at the distribution of a test statistic or discrepancy measure T under H0 . The p-value or significance level for the test is the probability, 4.5. SCORE AND MAXIMUM LIKELIHOOD TESTS 153 computed under H0 , of observing a T value at least as extreme as the value observed. The smaller the observed p-value, the stronger the evidence against H0 . The difficulty with this approach is how to find a statistic with ‘good properties’. The likelihood ratio statistic provides a general test statistic which may be used. 4.5 Score and Maximum Likelihood Tests 4.5.1 Score or Rao Tests In Section 4.3 we saw that the locally most powerful test was a score test. Score tests can be viewed as a more general class of tests of H0 : θ = θ0 against H1 : θ ∈ Ω−{θ0 }. If the usual regularity conditions hold then under H0 : θ = θ0 we have S(θ0 ; X)[J(θ0 )]−1/2 →D Z v N(0, 1). and thus R(X; θ0 ) = [S(θ0 ; X)]2 [J(θ0 )]−1 →D Y v χ2 (1). For a vector θ = (θ1 , . . . , θk )T we have R(X; θ0 ) = [S(θ0 ; X)]T [J(θ0 )]−1 S(θ0 ; X) →D Y v χ2 (k). (4.4) The corresponding rejection region is R = {x; R(x; θ0 ) > c} where c is determined by the size of the test, that is, c satisfies P [R(X; θ0 ) > c; θ0 ] = α. An approximate value for c can be determined using P (Y > c) = α where Y v χ2 (k). The test based on R(X; θ0 ) is asymptotically equivalent to the likelihood ratio test. In (4.4) J(θ0 ) may be replaced by I(θ0 ) for an asymptotically equivalent test. Such test statistics are called score or Rao test statistics. 4.5.2 Maximum Likelihood or Wald Tests Suppose that θ̂ is the M.L. estimator of θ over all θ ∈ Ω and we wish to test H0 : θ = θ0 against H1 : θ ∈ Ω − {θ0 }. If the usual regularity conditions hold then under H0 : θ = θ0 W (X; θ0 ) = (θ̂ − θ0 )T J(θ0 )(θ̂ − θ0 ) →D Y v χ2 (k). (4.5) 154 CHAPTER 4. HYPOTHESIS TESTS The corresponding rejection region is R = {x; W (x; θ0 ) > c} where c is determined by the size of the test, that is, c satisfies P [W (X; θ0 ) > c; θ0 ] = α. An approximate value for c can be determined using P (Y > c) = α where Y v χ2 (k). The test based on W (X; θ0 ) is asymptotically equivalent to the likelihood ratio test. In (4.5) J(θ0 ) may also be replaced by J(θ̂), I(θ0 ) or I(θ̂) to obtain an asymptotically equivalent test statistic. Such statistics are called maximum likelihood or Wald test statistics. 4.5.3 Example Suppose X v POI (θ). Find the score test statistic (4.4) and the maximum likelihood test statistic (4.5) for testing H0 : θ = θ0 against H1 : θ 6= θ0 . 4.5.4 Problem Find the score test statistic (4.4) and the Wald test statistic (4.5) for testing H0 : θ = θ0 against H1 : θ 6= θ0 based on a random sample (X1 , . . . , Xn ) from each of the following distributions: (a) EXP(θ) (b) BIN(n, θ) (c) N(θ, σ 2 ), σ 2 known (d) EXP(θ, μ), μ known (e) GAM(α, θ), α known 4.5.5 Problem Let (X1 , . . . , Xn ) be a random sample from the PAR(1, θ) distribution. Find the score test statistic (4.4) and the maximum likelihood test statistic (4.5) for testing H0 : θ = θ0 against H1 : θ 6= θ0 . 4.5.6 Problem Suppose (X1 , . . . , Xn ) is a random sample from an exponential family model {f (x; θ) ; θ ∈ Ω}. Show that the score test statistic (4.4) and the maximum likelihood test statistic (4.5) for testing H0 : θ = θ0 against H1 : θ 6= θ0 are identical if the maximum likelihood estimator of θ is a linear function of the natural sufficient statistic. 4.6. BAYESIAN HYPOTHESIS TESTS 4.6 155 Bayesian Hypothesis Tests Suppose we have two simple hypotheses H0 : θ = θ0 and H1 : θ = θ1 . The prior probability that H0 is true is denoted by P (H0 ) and the prior probability that H1 is true is P (H1 ) = 1 − P (H0 ). P (H0 )/P (H1 ) are the prior odds. Suppose also that the data x have probability (density) function f (x; θ). The posterior probability that Hi is true is denoted by P (Hi |x), i = 0, 1. The Bayesian aim in hypothesis testing is to determine the posterior odds based on the data x given by P (H0 |x) P (H0 ) f (x; θ0 ) = × . P (H1 |x) P (H1 ) f (x; θ1 ) The ratio f (x; θ0 )/f (x; θ1 ) is called the Bayes factor. If P (H0 ) = P (H1 ) then the posterior odds are just a likelihood ratio. The Bayes factor measures how the data have changed the odds as to which hypothesis is true. If the posterior odds were equal to q then a Bayesian would conclude that H0 is q times more likely to be true than H1 . A Bayesian may also decide to accept H0 rather than H1 if q is suitably large. If we have two composite hypotheses H0 : θ ∈ Ω0 and H1 : θ ∈ Ω − Ω0 then a prior distribution for θ must be specified for each hypothesis. We denote these by π0 (θ|H0 ) and π1 (θ|H1 ). In this case the posterior odds are P (H0 |x) P (H0 ) = ·B P (H1 |x) P (H1 ) where B is the Bayes factor given by R f (x; θ) π0 (θ|H0 )dθ Ω0 . B= R f (x; θ) π1 (θ|H1 )dθ Ω−Ω0 For the hypotheses H0 : θ = θ0 and H1 : θ 6= θ0 the Bayes factor is B= R f (x; θ0 ) . f (x; θ) π1 (θ|H1 )dθ θ6=θ0 4.6.1 Problem Suppose (X1 , . . . , Xn ) is a random sample from a POI(θ) distribution and we wish to test H0 : θ = θ0 against H1 : θ 6= θ0 . Find the Bayes factor if under H1 the prior distribution for θ is the conjugate prior. 156 CHAPTER 4. HYPOTHESIS TESTS Chapter 5 Appendix 5.1 5.1.1 Inequalities and Useful Results Hölder’s Inequality Suppose X and Y are random variables and p and q are positive numbers satisfying 1 1 + = 1. p q Then 1/p |E (XY )| ≤ E (|XY |) ≤ [E (|X|p )] [E (|Y |q )] 1/q . Letting Y = 1 we have 1/p E (|X|) ≤ [E (|X|p )] 5.1.2 , p > 1. Covariance Inequality If X and Y are random variables with variances σ12 and σ22 respectively then [Cov (X, Y )]2 ≤ σ12 σ22 . 5.1.3 Chebyshev’s Inequality If X is a random variable with E(X) = μ and V ar(X) = σ 2 < ∞ then P (|X − μ| ≥ k) ≤ for any k > 0. 157 σ2 . k2 158 5.1.4 CHAPTER 5. APPENDIX Jensen’s Inequality If X is a random variable and g (x) is a convex function then E [g (X)] ≥ g [E (X)] . 5.1.5 Corollary If X is a non-degenerate random variable and g (x) is a strictly convex function. Then E [g (X)] > g [E (X)] . 5.1.6 Stirling’s Formula For large n Γ (n + 1) ≈ 5.1.7 √ 2πnn+1/2 e−n . Matrix Differentiation T T Suppose x = (x1 , . . . , xk ) , b = (b1 , . . . , bk ) and A is a k × k symmetric matrix. Then ∙ ¸T ∂ ¡ T ¢ ∂ ¡ T ¢ ∂ ¡ T ¢ =b x b ,..., x b x b = ∂x ∂x1 ∂xk and ∙ ¸T ∂ ¡ T ¢ ∂ ¡ T ¢ ∂ ¡ T ¢ = 2Ax. x Ax , . . . , x Ax x Ax = ∂x ∂x1 ∂xk 5.2. DISTRIBUTIONAL RESULTS 5.2 5.2.1 159 Distributional Results Functions of Random Variables Univariate One-to-One Transformation Suppose X is a continuous random variable with p.d.f. f (x) and support set A. Let Y = h (X) be a real-valued, one-to-one function from A to B. Then the probability density function of Y is ¯ ¯ ¯ ¡ −1 ¢ ¯ d −1 ¯ g (y) = f h (y) · ¯ h (y)¯¯ , y ∈ B. dy Multivariate One-to-One Transformation Suppose (X1 , . . . , Xn ) is a vector of random variables with joint p.d.f. f (x1 , . . . , xn ) and support set RX . Suppose the transformation S defined by Ui = hi (X1 , . . . , Xn ), i = 1, . . . , n is a one-to-one, real-valued transformation with inverse transformation Xi = wi (U1 , . . . , Un ), i = 1, . . . , n. Suppose also that S maps RX into RU . Then g(u1 , . . . , un ), the joint p.d.f. of (U1 , . . . , Un ) , is given by ¯ ¯ ¯ ∂(x1 , . . . , xn ) ¯ ¯ , (u1 , . . . , un ) ∈ RU g(u) = f (w1 (u), . . . , wn (u)) ¯¯ ∂(u1 , . . . , un ) ¯ where ¯ ¯ ∂(x1 , . . . , xn ) ¯¯ =¯ ∂(u1 , . . . , un ) ¯ ¯ ∂x1 ∂u1 ··· ∂x1 ∂un ∂xn ∂u1 ··· ∂xn ∂un .. . is the Jacobian of the transformation. .. . ¯ ¯ ∙ ¸−1 ¯ ∂(u1 , . . . , un ) ¯ ¯= ¯ ∂(x1 , . . . , xn ) ¯ 160 5.2.2 CHAPTER 5. APPENDIX Order Statistic The following results are derived in Casella and Berger, Section 5.4. Joint Distribution of the Order Statistic Suppose X1 , . . . , Xn is a random sample from a continuous distribution with probability density function probability density function ¡ f (x). The joint ¢ of the order statistic T = X(1) , . . . , X(n) = (T1 , . . . , Tn ) is g (t1 , . . . , tn ) = n! n Q i=1 f (ti ) , − ∞ < t1 < · · · < tn < ∞. Distribution of the Maximum and the Minimum of a Vector of Random Variables Suppose X1 , . . . , Xn is a random sample from a continuous distribution with probability density function f (x), support set A, and cumulative distribution function F (x). The probability density function of U = X(i) , i = 1, . . . , n is n! i−1 n−i [1 − F (u)] , f (u) [F (u)] (i − 1)! (n − i)! u ∈ A. In particular the probability density function of T = X(n) = max (X1 , ..., Xn ) is g1 (t) = nf (t) [F (t)]n−1 , t∈A and the probability density function of S = X(1) = min (X1 , ..., Xn ) is g2 (s) = nf (s) [1 − F (s)]n−1 , s ∈ A. 5.2. DISTRIBUTIONAL RESULTS 161 The joint p.d.f. of U = X(i) and V = X(j) , 1 ≤ i < j ≤ n is given by n! i−1 n−j [F (v) − F (u)] [1 − F (v)] , f (u) f (v) [F (u)] (i − 1)! (j − 1 − i)! (n − j)! u < v, u ∈ A, v ∈ A. In particular the joint probability density function of S = X(1) and T = X(n) is n−2 g (s, t) = n (n − 1) f (s) f (t) [F (t) − F (s)] 5.2.3 , s < t, s ∈ A, t ∈ A. Problem If Xi ∼ UNIF(a, b), i = 1, . . . , n independently, then show X(1) − a ∼ BETA (1, n) b−a X(n) − a ∼ BETA (n, 1) b−a 162 5.2.4 CHAPTER 5. APPENDIX Distribution of Sums of Random Variables (1) If Xi ∼ POI(μi ), i = 1, . . . , n independently, then µn ¶ n P P Xi ∼ POI μi . i=1 i=1 (2) If Xi ∼ BIN(ni , p), i = 1, . . . , n independently, then µn ¶ n P P Xi ∼ BIN ni , p . i=1 i=1 (3) If Xi ∼ NB(ki , p), i = 1, . . . , n independently, then µn ¶ n P P Xi ∼ NB ki , p . i=1 (4) If Xi ∼ i=1 N(μi , σi2 ), n P i=1 2 i = 1, . . . , n independently, then µn ¶ n P P 2 2 ai Xi ∼ N ai μi , ai σi . i=1 i=1 (5) If Xi ∼ N(μ, σ ), i = 1, . . . , n independently, then n ¡ ¢ ¢ ¡ P Xi ∼ N nμ, nσ 2 and X̄ ∼ N μ, σ 2 /n . i=1 (6) If Xi ∼ GAM(αi , β), i = 1, . . . , n independently, then µn ¶ n P P Xi ∼ GAM αi , β . i=1 i=1 (7) If Xi ∼ GAM(1, β) = EXP(β), i = 1, . . . , n independently, then n P i=1 Xi ∼ GAM (n, β) . (8) If Xi ∼ χ2 (ki ), i = 1, . . . , n independently, then µn ¶ n P P 2 Xi ∼ χ ki . i=1 i=1 (9) If Xi ∼ GAM(αi , β), i = 1, . . . , n independently where αi is a positive integer, then µ n ¶ n P 2 P Xi ∼ χ2 2 αi . β i=1 i=1 (10) If Xi ∼ N(μ, σ 2 ), i = 1, . . . , n independently, then µ ¶2 n P Xi − μ ∼ χ2 (n) . σ i=1 5.2. DISTRIBUTIONAL RESULTS 5.2.5 163 Theorem - Properties of the Multinomial Distribution Suppose (X1 , . . . , Xk ) ∼ MULT (n, p1 , . . . , pk ) with joint p.f. f (x1 , . . . , xk ) = n! xk+1 px1 px2 · · · pk+1 x1 !x2 ! · · · xk+1 ! 1 2 xi = 0, . . . , n, i = 1, . . . , k+1, xk+1 = n− k P xi , 0 < pi < 1, i = 1, . . . , k+1, i=1 k+1 P pi = 1. Then i=1 (1) (X1 , . . . , Xk ) has joint m.g.f. M (t1 , . . . , tk ) = (p1 et1 + · · · + pk etk + pk+1 )n , (t1 , . . . , tk ) ∈ <k . (2) Any subset of X1 , . . . , Xk+1 also has a multinomial distribution. In particular Xi v BIN (n, pi ) , i = 1, . . . , k + 1. (3) If T = Xi + Xj , i 6= j, then T v BIN (n, pi + pj ) . (4) Cov (Xi , Xj ) = −nθi θj . (5) The conditional distribution of any subset of (X1 , . . . , Xk+1 ) given the rest of the coordinates is a multinomial distribution. In particular the conditional p.f. of Xi given Xj = xj , i 6= j, is µ ¶ pi Xi |Xj = xj ∼ BIN n − xj , . 1 − pj (6) The conditional distribution of Xi given T = Xi + Xj = t, i 6= j, is µ Xi |Xi + Xj = t ∼ BIN t, pi pi + pj ¶ . 164 5.2.6 CHAPTER 5. APPENDIX Definition - Multivariate Normal Distribution Let X = (X1 , . . . , Xk )T be a k × 1 random vector with E(Xi ) = μi and Cov(Xi , Xj ) = σij , i, j = 1, . . . , k. (Note: Cov(Xi , Xi ) = σii = V ar(Xi ) = σi2 .) Let μ = (μ1 , . . . , μk )T be the mean vector and Σ be the k×k symmetric covariance matrix whose (i, j) entry is σij . Suppose also that Σ−1 exists. If the joint p.d.f. of (X1 , . . . , Xk ) is given by ∙ ¸ 1 1 T −1 exp − Σ (x − μ) , x ∈ Rk f (x1 , . . . , xk ) = (x − μ) 2 (2π)k/2 |Σ|1/2 where x = (x1 , . . . , xk )T then X is said to have a multivariate normal distribution. We write X v MVN(μ, Σ). 5.2.7 Theorem - Properties of the MVN Distribution Suppose X = (X1 , . . . , Xk )T v MVN(μ, Σ). Then (1) X has joint m.g.f. ¶ µ 1 T T M (t) = exp μ t + t Σt , 2 t = (t1 , . . . , tk )T ∈ <k . (2) Any subset of X1 , . . . , Xk also has a MVN distribution and in particular Xi v N(μi , σi2 ), i = 1, . . . , k. (3) (X − μ)T Σ−1 (X − μ) v χ2 (k). (4) Let c = (c1 , . . . , ck )T be a nonzero vector of constants then cT X = k P i=1 ci Xi v N(cT μ, cT Σc). (5) Let A be a k × p vector of constants of rank p then AT X v N(AT μ, AT ΣA). (6) The conditional distribution of any subset of (X1 , . . . , Xk ) given the rest of the coordinates is a multivariate normal distribution. In particular the conditional p.d.f. of Xi given Xj = xj , i 6= j, is Xi |Xj = xj ∼ N(μi + ρij σi (xj − μj )/σj , (1 − ρ2ij )σi2 ) 5.2. DISTRIBUTIONAL RESULTS 165 In following figures the BVN joint p.d.f. is graphed. The graphs all have the same mean vector μ = [0 0]T but different variance/covariance matrices Σ. The axes all have the same scale. 0.2 f(x,y) 0.15 0.1 0.05 0 3 3 2 2 1 1 0 0 -1 -1 -2 -2 y -3 -3 x Figure 5.1: Graph of BVN p.d.f. with μ = [0 0]T and Σ = [1 0 ; 0 1]. 166 CHAPTER 5. APPENDIX 0.2 f(x,y) 0.15 0.1 0.05 0 3 3 2 2 1 1 0 0 -1 -1 -2 -2 y -3 -3 x Graph of BVN p.d.f. with μ = [0 0]T and Σ = [1 0.5; 0.5 1]. 5.2. DISTRIBUTIONAL RESULTS 167 0.2 f(x,y) 0.15 0.1 0.05 0 3 3 2 2 1 1 0 0 -1 -1 -2 -2 y -3 -3 x Graph of BVN p.d.f. with μ = [0 0]T and Σ = [0.6 0.5; 0.5 1]. 168 5.3 5.3.1 CHAPTER 5. APPENDIX Limiting Distributions Definition - Convergence in Probability to a Constant The sequence of random variables X1 , X2 , . . . , Xn , . . . converges in probability to the constant c if for each ² > 0 lim P (|Xn − c| ≥ ²) = 0. n→∞ We write Xn →p c. 5.3.2 Theorem If X1 , X2 , ..., Xn , ... is a sequence of random variables such that ½ 0 x<b lim P (Xn ≤ x) = n→∞ 1 x>b then Xn →p b. 5.3.3 Theorem - Weak Law of Large Numbers If X1 , . . . , Xn is a random sample from a distribution with E(Xi ) = μ and V ar(Xi ) = σ 2 < ∞ then X̄n = 5.3.4 n 1 P Xi →p μ. n i=1 Problem Suppose X1 , X2 , . . . , Xn , . . . is a sequence of random variables such that E(Xn ) = c and lim V ar(Xn ) = 0. Show that Xn →p c. n→∞ 5.3.5 Problem Suppose X1 , X2 , . . . , Xn , . . . is a sequence of random variables such that Xn /n →p b < 0. Show that lim P (Xn < 0) = 1. n→∞ 5.3.6 Problem Show that if Yn →p a and lim P (|Xn | ≤ Yn ) = 1 n→∞ 5.3. LIMITING DISTRIBUTIONS 169 then Xn is bounded in probability, that is, there exists b > 0 such that lim P (|Xn | ≤ b) = 1. n→∞ 5.3.7 Definition - Convergence in Distribution The sequence of random variables X1 , X2 , . . . , Xn , . . . converges in distribution to a random variable X if lim P (Xn ≤ x) = P (X ≤ x) = F (x) n→∞ for all values of x at which F (x) is continuous. We write Xn →D X. 5.3.8 Theorem Suppose X1 , ..., Xn , ... is a sequence of random variables with E(Xn ) = μn and V ar(Xn ) = σn2 . If lim μn = μ and lim σn2 = 0 then Xn →p μ. n→∞ 5.3.9 n→∞ Central Limit Theorem If X1 , . . . , Xn is a random sample from a distribution with E(Xi ) = μ and V ar(Xi ) = σ 2 < ∞ then Yn = 5.3.10 n P i=1 √ Xi − nμ n(X̄n − μ) √ = →D Z v N(0, 1). σ nσ Limit Theorems 1. If Xn →p a and g is continuous at a, then g(Xn ) →p g(a). 2. If Xn →p a, Yn →p b and g(x, y) is continuous at (a, b) then g(Xn , Yn ) →p g(a, b). 3. (Slutsky) If Xn →D X, Yn →p b and g(x, b) is continuous for all x ∈ support of X then g(Xn , Yn ) →D g(X, b). 4. (Delta Method) If X1 , X2 , ..., Xn , ... is a sequence of random variables such that nb (Xn − a) →D X for some b > 0 and if the function g(x) is differentiable at a with g 0 (a) 6= 0 then nb [g(Xn ) − g(a)] →D g 0 (a)X. 170 5.3.11 CHAPTER 5. APPENDIX Problem If Xn →p a > 0, Yn →p b 6= 0 and Zn →D Z ∼ N (0, 1) then find the limiting distributions of p (1) Xn2 (2) Xn (3) Xn Yn (4) Xn + Yn (5) Xn /Yn (6) 2Zn (7) Zn + Yn (8) Xn Zn (9) Zn2 (10) 1/Zn 5.4. PROOFS 5.4 5.4.1 171 Proofs Theorem Suppose the model is {f (x; θ) ; θ ∈ Ω} and let A = support of X. Partition A into the equivalence classes defined by ½ ¾ f (x; θ) Ay = x; = H(x, y) for all θ ∈ Ω , y ∈ A. (5.1) f (y; θ) This is a minimal sufficient partition. The statistic T (X) which induces this partition is a minimal sufficient statistic. 5.4.2 Proof We give the proof for the case in which A does not depend on θ. Let T (X) be the statistic which induces the partition in (5.1). To show that T (X) is sufficient we define B = {t : t = T (x) for some x ∈ A} . Then the set A can be written as A = {∪t∈B At } where At = {x : T (x) = t} , t ∈ B. The statistic T (X) induces the partition defined by At , t ∈ B. For each At we can choose and fix one element, xt ∈ A. Obviously T (xt ) = t. Let g (t; θ) be a function defined on B such that g (t; θ) = f (xt ; θ) , t ∈ B. Consider any x ∈ A. For this x we can calculate T (x) = t and thus determine the set At to which x belongs as well as the value xt which was chosen for this set. Obviously T (x) = T (xt ). By the definition of the partition induced by T (X), we know that for all x ∈ At , f (x; θ) /f (xt ; θ) is a constant function of θ. Therefore for any x ∈ A we can define a function h (x) = where T (x) = T (xt ) = t. f (x; θ) f (xt ; θ) 172 CHAPTER 5. APPENDIX Therefore for all x ∈ A and θ ∈ Ω we have f (x; θ) f (xt ; θ) = g (t; θ) h (x) = g (T (xt ) ; θ) h (x) = g (T (x) ; θ) h (x) f (x; θ) = f (xt ; θ) and by the Factorization Criterion for Sufficieny T (X) is a sufficient statistic. To show that T (X) is a minimal sufficient statistic suppose that T1 (X) is any other sufficient statistic. By the Factorization Criterion for Sufficieny, there exist functions h1 (x) and g1 (t; θ) such that f (x; θ) = g1 (T1 (x) ; θ) h1 (x) for all x ∈ A and θ ∈ Ω. Let x and y be any two points in A with T1 (x) = T1 (y). Then f (x; θ) f (y; θ) = = = g1 (T1 (x) ; θ) h1 (x) g1 (T1 (y) ; θ) h1 (y) h1 (x) h1 (y) function of x and y which does not depend on θ and therefore by the definition of T (X) this implies T (x) = T (y). This implies that T1 induces either the same partition of A as T (X) or it induces a finer partition of A than T (X) and therefore T (X) is a function of T1 (X). Since T (X) is a function of every other sufficient statistic, therefore T (X) is a minimal sufficient statistic. 5.4.3 Theorem If T (X) is a complete sufficient statistic for the model {f (x; θ) ; θ ∈ Ω} then T (X) is a minimal sufficient statistic for {f (x; θ) ; θ ∈ Ω}. 5.4.4 Proof Suppose U = U (X) is a minimal sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}. The function E (T |U ) is a function of U which does not depend on θ since U is a sufficient statistic. Also by Definition 1.5.2, U is 5.4. PROOFS 173 a function of the sufficient statistic T which implies E (T |U ) is a function of T . Let h (T ) = T − E (T |U ) . Now E [h (T ) ; θ] = = = = E [T − E (T |U ) ; θ] E (T ; θ) − E [E (T |U ) ; θ] E (T ; θ) − E (T ; θ) 0, for all θ ∈ Ω. Since T is complete this implies P [h (T ) = 0; θ] = 1, for all θ ∈ Ω P [T = E (T |U )] = 1, for all θ ∈ Ω or and therefore T is a function of U . This can only be true if T is also a minimal sufficent statistic for the model. The regularity conditions are repeated here since they are used in the proofs that follow. 5.4.5 Regularity Conditions Consider the model {f (x; θ) ; θ ∈ Ω}. Suppose that: (R1) The parameter space Ω is an open interval in the real line. (R2) The densities f (x; θ) have common support, so that the set A = {x; f (x; θ) > 0} , does not depend on θ. (R3) For all x ∈ A, f (x; θ) is a continuous, three times differentiable function of θ. R (R4) The integral f (x; θ) dx can be twice differentiated with respect to θ A under the integral sign, that is, ∂k ∂θk Z A f (x; θ) dx = Z A ∂k f (x; θ) dx, k = 1, 2 for all θ ∈ Ω. ∂θk 174 CHAPTER 5. APPENDIX (R5) For each θ0 ∈ Ω there exist a positive number c and function M (x) (both of which may depend on θ0 ), such that for all θ ∈ (θ0 − c, θ0 + c) holds for all x ∈ A, and ¯ ¯ 3 ¯ ∂ log f (x; θ) ¯ ¯ < M (x) ¯ ¯ ¯ ∂θ3 E [M (X) ; θ] < ∞ for all θ ∈ (θ0 − c, θ0 + c) . (R6) For each θ ∈ Ω, 0<E (∙ ∂ 2 log f (X; θ) ∂θ2 ¸2 ;θ ) <∞ (R7) The probability (density) functions corresponding to different values of the parameters are distinct, that is, θ 6= θ∗ =⇒ f (x; θ) 6= f (x; θ∗ ). The following lemma is required for the proof of consistency of the M.L. estimator. 5.4.6 Lemma If X is a non-degenerate random variable with model {f (x; θ) ; θ ∈ Ω} satisfying (R1) − (R7) then E [log f (X; θ) − log f (X; θ0 ) ; θ0 ] < 0 5.4.7 for all θ, θ0 ∈ Ω, and θ 6= θ0 . Proof Since g (x) = − log x is strictly convex and X is a non-degenerate random variable then by the corollary to Jensen’s inequality ¾ ½ ∙ ¸ f (X; θ) E [log f (X; θ) − log f (X; θ0 ) ; θ0 ] = E log ; θ0 f (X; θ0 ) ½ ∙ ¸¾ f (X; θ) < log E ; θ0 f (X; θ0 ) for all θ, θ0 ∈ Ω, and θ 6= θ0 . 5.4. PROOFS 175 Since ∙ f (X; θ) E ; θ0 f (X; θ0 ) ¸ = Z f (x; θ) f (x; θ0 ) dx f (x; θ0 ) A = Z f (x; θ) dx = 1, for all θ ∈ Ω A therefore E [l (θ; X) − l (θ0 ; X) ; θ0 ] < log (1) = 0, for all θ, θ0 ∈ Ω, and θ 6= θ0 . 5.4.8 Theorem Suppose (X1 , . . . , Xn ) is a random sample from a model {f (x; θ) ; θ ∈ Ω} satisfying regularity conditions (R1) − (R7). Then with probability tending to 1 as n → ∞, the likelihood equation or score equation n X ∂ logf (Xi ; θ) = 0 ∂θ i=1 has a root θ̂n such that θ̂n converges in probability to θ0 , the true value of the parameter, as n → ∞. 5.4.9 Proof Let ln (θ; X) = ln (θ; X1 , . . . , Xn ) ∙n ¸ Q = log f (Xi ; θ) i=1 = n P i=1 logf (Xi ; θ) , θ ∈ Ω. Since f (x; θ) is differentiable with respect to θ for all θ ∈ Ω this implies ln (θ; x) is differentiable with respect to θ for all θ ∈ Ω and also ln (θ; x) is a continuous function of θ for all θ ∈ Ω. By the above lemma we have for any δ > 0 such that θ0 ± δ ∈ Ω, E [ln (θ0 + δ; X) − ln (θ0 ; X) ; θ0 ] < 0 (5.2) E [ln (θ0 − δ; X) − ln (θ0 ; X) ; θ0 ] < 0. (5.3) and 176 CHAPTER 5. APPENDIX By (5.2) and the WLLN 1 [ln (θ; X) − ln (θ0 ; X)] →p b < 0 n which implies lim P [ln (θ0 + δ; X) − ln (θ0 ; X) < 0] = 1 n→∞ (see Problem 5.3.5). Therefore there exists a sequence of constants {an } such that 0 < an < 1, lim an = 0 and n→∞ P [ln (θ0 + δ; X) − ln (θ0 ; X) < 0] = 1 − an . Let An = An (δ) = {x : ln (θ0 + δ; x) − ln (θ0 ; x) < 0} where x = (x1 , . . . , xn ). Then lim P (An ; θ0 ) = lim (1 − an ) = 1. n→∞ n→∞ Let Bn = Bn (δ) = {x : ln (θ0 − δ; x) − ln (θ0 ; x) < 0} . then by the same argument as above there exists a sequence of constants {bn } such that 0 < bn < 1, lim bn = 0 and n→∞ lim P (Bn ; θ0 ) = lim (1 − bn ) = 1. n→∞ n→∞ Now P (An ∩ Bn ; θ0 ) = = = ≥ P (An ; θ0 ) + P (Bn ; θ0 ) − P (An ∪ Bn ; θ0 ) 1 − an + 1 − bn − P (An ∪ Bn ; θ0 ) 1 − an − bn + [1 − P (An ∪ Bn ; θ0 )] 1 − an − bn since 1 − P (An ∪ Bn ; θ0 ) ≥ 0. Therefore lim P (An ∩ Bn ; θ0 ) = lim (1 − an − bn ) = 1 n→∞ n→∞ (5.4) Continuity of ln (θ; x) for all θ ∈ Ω implies that for any x ∈ An ∩ Bn , there exists a value θ̂n (δ) = θ̂n (δ; x) ∈ (θ0 − δ, θ0 + δ) such that ln (θ; x) 5.4. PROOFS 177 has a local maximum at θ = θ̂n (δ). Since ln (θ; x) is differentiable with respect to θ this implies (Fermat’s theorem) n ∂ln (θ; x) X ∂ = logf (xi ; θ) = 0 for θ = θ̂n (δ) . ∂θ ∂θ i=1 Note that ln (θ; x) may have more than one local maximum on the interval (θ0 − δ, θ0 + δ) and therefore θ̂n (δ) may not be unique. If x ∈ / An ∩ Bn , then θ̂n (δ) may not exist in which case we define θ̂nn(δ) to o be a fixed arbitrary value. Note also that the sequence of roots θ̂n (δ) depends on δ. Let θ̂n = θ̂n (x) be the value of θ closest to θ0 such that ∂ln (θ; x) /∂θ = 0. If such a root does not exist we define θ̂n to be a fixed arbitrary value. Since h i 1 ≥ P θ̂n ∈ (θ0 − δ, θ0 + δ) ; θ0 h i ≥ P θ̂n (δ) ∈ (θ0 − δ, θ0 + δ) ; θ0 ≥ P (An ∩ Bn ; θ0 ) (5.5) then by (5.4) and the Squeeze Theorem we have h i lim P θ̂n ∈ (θ0 − δ, θ0 + δ) ; θ0 = 1. n→∞ Since this is true for all δ > 0, θ̂n →p θ0 . 5.4.10 Theorem Suppose (R1) − (R7) hold. Suppose θ̂n is a consistent root of the likelihood equation as in Theorem 5.4.8. Then p J(θ0 )(θ̂n − θ0 ) →D Z ∼ N(0, 1) where θ0 is the true value of the parameter. 5.4.11 Proof Let S1 (θ; x) = and I1 (θ; x) = − ∂ log f (x; θ) ∂θ ∂ ∂2 S1 (θ; x) = − 2 log f (x; θ) ∂θ ∂θ 178 CHAPTER 5. APPENDIX be the score and information functions respectively for one observation from {f (x; θ) ; θ ∈ Ω}. Since {f (x; θ) ; θ ∈ Ω} is a regular model E [S1 (θ; X) ; θ] = 0, θ ∈ Ω (5.6) and V ar {[S1 (θ; X)] ; θ} = E [I1 (θ; x) ; θ] = J1 (θ) < ∞, θ ∈ Ω. (5.7) Let ½ ¾ n ∂ n P P S1 (θ; xi ) = 0 has a solution logf (xi ; θ) = An = (x1 , . . . , xn ) ; such that i=1 ∂θ i=1 and for (x1 , . . . , xn ) ∈ An , let θ̂n = θ̂n (x1 , . . . , xn ) be the value of θ such n P that S1 (θ̂n ; xi ) = 0. i=1 Expand n P S1 (θ̂n ; xi ) as a function of θ̂n about θ0 to obtain i=1 n P S1 (θ̂n ; xi ) = i=1 n P i=1 S1 (θ0 ; xi ) − (θ̂n − θ0 ) n P I1 (θ; xi ) i=1 n ∂3 P 1 log f (xi ; θ) |θ=θn∗ + (θ̂n − θ0 )2 3 2 i=1 ∂θ (5.8) where θn∗ = θn∗ (x1 , . . . , xn ) lies between θ0 and θ̂n by Taylor’s Theorem. Suppose (x1 , . . . , xn ) ∈ An . Then the left side of (5.8) equals zero and thus n P i=1 or n P S1 (θ0 ; xi ) p nJ1 (θ0 ) i=1 n ∂3 P 1 I1 (θ; xi ) − (θ̂n − θ0 )2 log f (xi ; θ) |θ=θn∗ 3 2 i=1 i=1 ∂θ ∙n ¸ n ∂3 P P 1 ∗ = (θ̂n − θ0 ) I1 (θ; xi ) − (θ̂n − θ0 )2 log f (x ; θ) | i θ=θn 3 2 i=1 i=1 ∂θ S1 (θ0 ; xi ) = (θ̂n − θ0 ) n P ∙ n ¸ n ∂3 P (θ̂ − θ0 ) P 1 ∗ pn I1 (θ; xi ) − (θ̂n − θ0 )2 log f (x ; θ) | i θ=θ n 3 2 nJ1 (θ0 ) i=1 i=1 ∂θ ⎡ P ⎤ n n P 1 ∂3 1 ∗ I (θ; x ) log f (x ; θ) | i i θ=θn n ∂θ3 p ⎢ n i=1 1 ⎥ 1 i=1 2 ⎥ = J (θ0 )(θ̂n − θ0 ) ⎢ − θ ) − ( θ̂ n 0 ⎣ ⎦ J1 (θ0 ) 2 J1 (θ0 ) = 5.4. PROOFS 179 Therefore for (X1 , . . . , Xn ) we have ⎤ ⎡P n S1 (θ0 ; Xi ) ⎥ ⎢ i=1 ⎥ I{(X1 , . . . , Xn ) ∈ An } ⎢ p ⎣ nJ1 (θ0 ) ⎦ = ⎡ p ⎢ J (θ0 )(θ̂n −θ0 ) ⎢ ⎣ 1 n n P I1 (θ; Xi ) i=1 J1 (θ0 ) − n P (θ̂n − θ0 )2 i=1 2J1 (θ0 ) ∂3 ∂θ3 (5.9) log f (Xi ; θ) |θ=θn∗ n ⎤ ⎥ ⎥ I{(X1 , . . . , Xn ) ∈ An } ⎦ where θn∗ = θn∗ (X1 , . . . , Xn ) . By an argument similar to that used in Proof 5.4.4 lim P [(X1 , . . . , Xn ) ∈ An ; θ0 ] = 1. (5.10) n→∞ Since S1 (θ0 ; Xi ), i = 1, . . . , n are i.i.d. random variables with mean and variance given by (5.6) and (5.7) then by the CLT n P S1 (θ0 ; Xi ) p →D Z v N (0, 1) . nJ1 (θ0 ) i=1 (5.11) Since I1 (θ0 ; Xi ), i = 1, . . . , n are i.i.d. random variables with mean J1 (θ) then by the WLLN and thus n 1 P I1 (θ; Xi ) →p J1 (θ) n i=1 1 n n P I1 (θ; Xi ) i=1 J1 (θ0 ) →p 1 by the Limit Theorems. To complete the proof we need to show ∙ n ¸ 3 2 1 P ∂ ∗ log f (Xi ; θ) |θ=θn →p 0. (θ̂n − θ0 ) n i=1 ∂θ3 (5.12) (5.13) Since θ̂n →p θ0 we only need to show that n ∂3 1 P log f (Xi ; θ) |θ=θn∗ n i=1 ∂θ3 (5.14) 180 CHAPTER 5. APPENDIX is bounded in probability. Since θ̂n →p θ0 implies θn∗ →p θ0 then by (R5) ¯ ¾ ½¯ n n ¯ 1 P ¯ 1 P ∂3 ¯ ¯ log f (Xi ; θ) |θ=θn∗ ¯ ≤ M (Xi ) ; θ0 = 1. lim P ¯ n→∞ n i=1 ∂θ3 n i=1 Also by (R5) and the WLLN n 1 P M (Xi ) →p E[M (X); θ0 ] < ∞. n i=1 It follows that (5.14) is bounded in probability (see Problem 5.3.6). Therefore p J (θ0 )(θ̂n − θ0 ) →D Z v N (0, 1) follows from (5.9), (5.11)-(5.13) and Slutsky’s Theorem. Special Discrete Distributions Notation and Parameters p.f. Mean Variance m.g.f. X BINn, p nx p x q n−x np npq pe t q n 0p1 x 0, 1, . . . , n p pq pe t q kq p kq p2 p k 1−qe t Binomial q 1−p Bernoulli X Bernoullip p x q 1−x 0p1 x 0, 1 q 1−p Negative Binomial X NBk, p −kx p k −q x 0p1 q 1−p x 0, 1, . . . t − log q Geometric X GEOp 0p1 q 1−p pq x x 0, 1, . . . q/p q/p 2 p 1−qe t t − log q Special Discrete Distributions Notation and Parameters p.f. Mean N Mx N−M n−x / n nM/N Variance m.g.f. Hypergeometric X HYPn, M, N n 1, 2, , N 1 − nM N M N N−n N−1 * x 0, 1, , n M 0, 1, , N Poisson X POI e − x /x! 0 x 0, 1, e e −1 N1 2 N 2 −1 12 1 e t −e N1t N 1−e t t Discrete Uniform X DUN N 1, 2, * Not Tractable 1/N x 1, 2, , N t≠0 Special Continuous Distributions Notation and Parameters p.d.f. Mean Variance m.g.f. 1/b − a ab 2 b−a 2 12 e bt −e at b−at Uniform X UNIFa, b ab a≤x≤b t≠0 Normal X N, 2 1 2 e −x−/ 2 /2 2 2 e t 2 t 2 /2 2 0 Gamma X GAM, 1 Γ 0 x −1 e −x/ 1 − t − x0 0 t 1/ Inverted Gamma X IG, 1 Γ x −−1 e −1/x 0 0 x0 1 −1 1 2 −1 2 −2 * Special Continuous Distributions Notation and Parameters p.d.f. Mean Variance m.g.f. 2 1 − t −1 Exponential X EXP 1 0 e −x/ x≥0 t 1/ Two-Parameter Exponential X EXP, e −x−/ 1 0 e t 1 − t −1 2 x≥ t 1/ Double Exponential X DE, 1 2 e −|x−|/ e t 1 − 2 t 2 −1 2 2 0 |t| 1/ Weibull X WEI, x −1 e −x/ Γ1 1 2 Γ1 −Γ 2 1 0 0 * Not Tractable. x0 2 1 * Special Continuous Distributions Notation and Parameters p.d.f. Mean Variance m.g.f. − 22 6 e t Γ1 t Extreme Value X EV, 1 e x−/−e x−/ 0 0. 5772 t −1/ (Euler’s const.) Cauchy X CAU, 1 1x−/ 2 ** ** ** x 1 −1 2 −1 2 −2 ** x≥ 1 2 e −x−/ 1e −x−/ 2 22 3 0 Pareto X PAR, , 0 Logistic X LOG, 0 ** Does not exist. e t Γ1 − tΓ1 t |t| 1/ Special Continuous Distributions Notation and Parameters p.d.f. Mean Variance m.g.f. 2 1 − 2t −/2 Chi-Squared X 2 x /2−1 e −x/2 1 2 /2 Γ/2 1, 2, x0 t 1/2 Student’s t X t Γ 1 2 x2 1 1 Γ 2 − 1 2 1, 2, 0 −2 1 2 2 2 −2 2 22 1 2 −2 1 2 −2 2 2 −4 2 2 4 2 a ab ab ab1ab 2 ** Snedecor’s F X F 1 , 2 1 1, 2, 1 2 2 1 2 Γ 2 2 Γ Γ 1 2 1, 2, 12 1 2 x − 1 2 x 1 2 −1 1 2 2 ** x0 Beta X BETAa, b a0 b0 * Not Tractable. ** Does not exist. Γab ΓaΓb x a−1 1 − x b−1 0x1 * Special Multivariate Distributions Notation and Parameters p.f./p.d.f. m.g.f. Multinomial X X 1 , X 2 , , X k X MULTn, p 1 , , p k fx 1 , , x k k1 x k1 p x11 p x22 p k1 n! x 1 !x 2 !x k1 ! p 1 e t 1 p k e t k p k1 n k 0 p i 1, ∑ p i 1 0 ≤ x i ≤ n, x k1 n − ∑ x i i1 i1 Bivariate Normal X X1 X2 BVN, fx 1 , x 2 −1 exp 21− 2 0 1, 2 x 1 − 1 1 e 1 2 1 2 2 − 2 x 1 − 1 1 1− 2 x 2 − 2 2 −1 1 1 2 , 21 1 2 1 2 22 1 2|| 1/2 exp − 12 x − T −1 x − x 2 − 2 2 2 T t 1 t T t 2 t t1 t2