Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ch.7: 1 Estimation Introduction The objective of data collection is to learn about the distribution, or aspects of the distribution, of some characteristic of the units of a population of interest. In Chapter 6, we saw how to estimate certain key descriptive parameters of a population distribution, such as the mean, the variance, percentiles, and probabilities (or proportions), from the corresponding sample quantities. In this chapter we will learn another approach to the estimation of such population parameters. This approach is based on the assumption that the population distribution belongs in a certain family of distribution models, and on methods for fitting a particular family of distribution models to data. Different fitting methods will be presented and discussed. The estimators obtained from this other approach will occasionally differ from the estimators we saw in Chapter 6. For example, under the assumption that the population distribution is normal, estimators of population percentiles and proportions depend only on the sample mean and sample variance, and thus differ from the sample percentiles and proportions; the assumption of a uniform distribution yields an estimator of the population mean value which is different from the sample mean; the assumption of Poisson distribution yields an estimator of the population variance which is different from the sample variance. Another learning objective of this chapter is to develop criteria for selecting the best among different estimators of the same quantity, or parameter. For example, should the alternative estimators, which were mentioned in the preceding paragraph, be preferred over those which were discussed in Chapter 6? The same criteria can also help us decide whether a stratified sample is preferable to simple random sample for estimating the population mean or a population proportion. Finally, in this chapter we will learn how to report the uncertainty of estimators through their standard error and how that leads to confidence intervals for estimators which have (or have approximately) a normal distribution. The above estimation concepts will be developed here in the context of a single sample, but will be applied in later chapters to samples from several populations. 1 2 Overview, Notation and Terminology Many families of distribution models, including all we have discussed, depend on a small number of parameters; for example, a Poisson distribution model is identified by the single parameter λ, and normal models are identified by two parameters, µ and σ 2 . Such families of distribution models are called parametric. An approach to extrapolating sample information to the population is to assume that the population distribution is a member of (or belongs in) a specific parametric family of distribution models, and then fit the assumed family to the data, i.e. identify the member of the parametric family that best fits the data. There are several methods/criteria for fitting a parametric family of distribution models to data. They all amount to estimating the model parameters, and taking as the fitted model the one that corresponds to the estimated parameters. Example 2.1. Car manufacturers often advertise damage results from low impact crash experiments. In an experiment crashing n = 20 randomly selected cars of a certain type against a wall at 5 mph, let X denote the number of cars that sustain no visible damage. Here it is reasonable to assume that the distribution of X, which is the population distribution in this case, is a member of the family of binomial probability models. A binomial distribution is identified by the sample size used (here the sample size is n = 20), and the parameter p, which is the probability that a randomly selected car will sustain no visible damage when crashed at 5 mph. The best fitting model is the binomial distribution that corresponds to the estimated value of p. For example, if X = 12 of the 20 cars in the experiment sustain no visible damage, the estimate of p is p̂ = 12/20, and the best fitting model is Bin(20, 0.6). Example 2.2. The response time, X, of a robot to a certain malfunction of a certain production process (e.g. car manufacturing) is often the variable of interest. Let X1 , . . . , X36 denote 36 response times that are to be measured. Here it is not clear what the population distribution (i.e. the distribution of each Xi ) might be, but it might be assumed (at least tentatively) that this distribution is a member of the normal family of distributions. The model parameters that identify a normal distribution are its mean and variance. Thus, the best fitting model is the normal distribution with mean and variance equal to the sample mean and sample variance, respectively, obtained from the data (i.e. the 36 measured response times). For example, if the sample mean of the 36 response times is X = 9.3, and the sample variance is S 2 = 4.9, the best fitting model is N (9.3, 4.9). 2 Example 2.3. The lifetime of electric components is often the variable of interest in reliability studies. Let T1 , . . . , T25 denote the life times, in hours, of a random sample of 25 components. Here it is not clear what the population distribution might be, but it might be assumed (at least tentatively) that it is a member of the exponential family of models, which was introduced in Example 3.7, page 14 of Chapter 3. Thus, each Ti has pdf fλ (t) = λ exp(−λt), for t ≥ 0, fλ (t) = 0, for t < 0, for some λ > 0. Here the single model parameter λ identifies the exponential distribution. Since the model mean value (i.e. the mean value of a population having the exponential distribution) is λ−1 , and since the population mean value can be estimated by the sample mean, X, of the 25 life times, the model parameter λ can be estimated by λ̂ = 1/X. Thus, the best fitting exponential model is the exponential distribution with model parameter equal to λ̂. For example, if the average of the 25 life times is 113.5 hours, the best fitting model is the exponential distribution with λ = 113.5−1 . Example 2.4. Suppose, as in the previous example, that interest lies in the distribution of the life time of some type of electric component, and let T1 , . . . , T25 denote the life times, in hours, of a random sample of 25 such components. If the assumption that the population distribution belongs in the exponential family does not appear credible, it might be assumed that it is a member of the gamma family of distribution models. This is a richer family of models and it includes the models of the exponential distribution. The gamma distribution is identified by two parameters, α and β, and its pdf is of the form fα,β (x) = 1 β α Γ(α) xα−1 e−x/β , x > 0, where Γ(α) is the gamma function. The model parameters α and β are related to the model mean and variance through α= µ2 σ2 , β = . σ2 µ Since the model mean and variance can be estimated by the sample mean and variance, the parameters α and β can be estimated by 2 S2 X α̂ = 2 , β̂ = , S X where X, S 2 denote the sample mean and sample variance of the 25 life times. For example, if the average of the 25 life times is 113.5 hours, and their sample variance is 1205.55 hours2 , the best fitting model is the gamma distribution with α = 113.52 /1205.55 = 10.69, and β = 1205.55/113.5 = 10.62. 3 The method used in the above examples to fit parametric families of models to data, namely the method of matching the nonparametric estimates of the population mean and population variance to the model parameters, is called the method of moments. It will be discussed further, later in this chapter, along with other methods of fitting. Having a fitted distribution model means that the entire distribution has been estimated. In particular, the fitted model provides immediate and direct estimation of any percentile and probability that might be of interest. Example 2.5. For example, the fitted model of Example 2.2 allows direct and immediate estimation of the probability that the reaction time of the robot will exceed 50 nanoseconds, and of the 95th percentile of the robot reaction time as those of the fitted normal distribution: µ Pb(X > 50) = 1 − Φ 50 − X S ¶ , and x̂0.05 = X + z0.05 S, (2.1) where a hat place above a quantity denotes an estimator of the quantity, X is the sample mean and S is the sample standard deviation obtained from the 36 measured response times. Of course, the nonparametric estimates of proportions and percentiles that were discussed in Chapter 6, can still be used. For example, the probability P (X > 50) in Example 2.5, can be estimated as the proportion of the 36 robot reaction times that exceed 50 nanoseconds, and x̂0.05 can be estimated by the 95th sample percentile. If the assumption of normality is correct, i.e. if the population distribution of the robot reaction time is normal, then the estimates in (2.1) are to be preferred over the nonparametric ones, according to criteria that will be discussed in this chapter. Keep in mind that, if the parametric assumption is not correct, i.e. if the population distribution is not a member of the assumed parametric family of models, then the fitted model is an estimator of the best-approximating model, which is the member of the parametric family that best approximates the population distribution. Consequently, estimators of probabilities and percentiles based on the fitted model (as was done in Example 2.5) will not have desirable properties. The following example illustrates this point. Example 2.6. To demonstrate the effect that an incorrect modeling assumption can have on the estimation of probabilities and percentiles, suppose, as in Example 2.4, that interest lies in the distribution of the life time of some type of electric component, and the life times 4 of a random sample of 25 such components is observed. With the information given in that example, the best fitting gamma model is the one with α = 113.52 /1205.55 = 10.69, and β = 1205.55/113.5 = 10.62. Thus, using the assumption of a gamma distribution model, the probability that a randomly selected component will last more than 140 hours, and the 95th percentile of life times are estimated as those of the best fitting gamma model. Using a statistical software package we obtain the estimates Pb(X > 140) = 0.21, and x̂0.05 = 176.02. However, if the assumption of an exponential model distribution is made, as in Example 2.3, the best fitting exponential model gives estimates Pb(X > 140) = 0.29, and x̂0.05 = 340.02. Hopefully, there are diagnostic tests that can help decide whether a parametric assumption is not correct, or which of two parametric families provides a better fit to the data. A way of gaining confidence that the population distribution belongs in the assumed parametric family (ideal case), or at least it is well approximated by the best approximating model, is the probability plot, which we saw in Chapter 6. In all that follows, the Greek letter θ serves as a generic notation for any model or population parameter(s) that we are interested in estimating. Thus, if we are interested in the population mean value, then θ = µ, and, if we are interested in the population mean value and variance then, θ = (µ, σ 2 ). If θ denotes a population parameter, then, by true value of θ we mean the (unknown to us) population value of θ. For example, if θ denotes the population mean or variance, then the true value of θ is the (unknown to us) value of the population mean or variance. If θ denotes a model parameter then, by true value of θ we mean the (unknown to us) value of θ that corresponds to the best-approximating model. For example, if we use the exponential family to model the distribution of life times of a certain component, then the unknown to us value of λ = 1/µ, where µ is the population mean, is the true value of θ = λ. 3 Estimators and Estimates An estimator, θ̂, of the true value of θ is a function of the random sample X1 , . . . , Xn , θ̂ = θ̂(X1 , . . . , Xn ). 5 Here we refer to the random sample in the planning stage of the experiment, when X1 , . . . , Xn are random variables (hence the capital letters). Being a function of random variables, i.e. a statistic, an estimator is a random variable, and thus it has a sampling distribution. If θ is a population parameter, or, if it is a model parameter and the assumed parametric model is correct, the sampling distribution of an estimator θ̂ depends on the true value of θ. This dependence of the sampling distribution of θ̂ on θ is often indicated by writing the expected value, E(θ), of θ̂ as ³ ´ Eθ=θ0 θ̂ , and read ”the expected value of θ̂ when the true value of θ is θ0 ”. Example 3.1. a) In Example 2.1, we used the proportion, p̂ = X/20, of cars that sustain no visible damage as an estimator of the probability, p, of a car sustaining no visible damage when crashed at 5 mph. Because X = 20p̂ ∼ Bin(20, p), we see that the distribution of p̂ depends on the true value of p. For example, if p = 0.7, then np̂ = X ∼ Bin(n, 0.7). This dependence of the distribution of X on the true value p is often indicated in the formula for the expected value of a binomial random variable, which is E(X) = 20p, by writing it as Ep (X) = 20p. In particular, if the true value of p is p = 0.7, we write Ep=0.7 (X) = 20 × 0.7 = 14, Ep=0.7 (p̂) = 0.7, and read ”when the true value of p is 0.7, the expected value of X is 14 and the expected value of p̂ is 0.7”. b) In Example 2.2, we used the average, X, of the 36 response times as an estimator of the expected value, µ, of the robot’s reaction time to a future malfunction. Because X ∼ N (µ, σ 2 /36), we see that the distribution of X depends on the true value of µ (and also on that of σ 2 ). If the true value of µ is µ = 8.5, we write ¡ ¢ Eµ=8.5 X = 8.5, and read ”when the true value of µ = 8.5, the expected value of X is 8.5”. Let X1 = x1 , . . . , Xn = xn be the observed values when the experiment has been carried out. The value, θ̂(x1 , . . . , xn ), of the estimator evaluated at the observed sample values will be called a point estimate or simply an estimate. Point estimates will also be denoted by θ̂. Thus, an estimator is a random variable, while a (point) estimate is a specific value that the estimator takes. 6 Example 3.2. a) In Example 2.1, the proportion p̂ = X/20 of cars that will sustain no visible damage among the 20 that are to be crashed is an estimator of p, and, if the experiment yields that X = 12 sustain no visible damage, p̂ = 12/20 is the point estimate of p. b) In Example 2.2, µ̂ = X = (X1 +· · ·+X36 )/36 is an estimator of µ, and, if the measured response times X1 = x1 , . . . , X36 = x36 yield an average of (x1 +· · ·+x36 )/36 = 9.3, µ̂ = 9.3 is a point estimate of µ. 4 Biased and Unbiased Estimators Being a random variable, θ̂ serves only as an approximation to the true value of θ. With some samples, θ̂ will overestimate the true θ, whereas for others it will underestimate it. Therefore, properties of estimators and criteria for deciding among competing estimators are most meaningfully stated in terms of their distribution, or the distribution of θ̂ − θ = error of estimation. (4.1) The first criterion we will discuss is that of unbiasedness. Definition 4.1. The estimator θ̂ of θ is called unbiased for θ if E(θ̂) = θ. (More exactly, θ̂ is unbiased for θ if Eθ (θ̂) = θ.) The difference E(θ̂) − θ is called the bias of θ̂ and is denoted by bias(θ̂). (More exactly, biasθ (θ̂) = Eθ (θ̂) − θ.) Example 4.1. a) For the estimator p̂ discussed in Example 2.1, Ep (p̂) = p, where p is the population proportion. Thus p̂ is an unbiased estimator of p. b) For the estimator µ̂ = X discussed in Example 2.2, Eµ (X) = µ, where µ is the true population mean. Thus, X is an unbiased estimator of µ. 7 Unbiased estimators have zero bias. This means that, though with any given sample θ̂ may underestimate or overestimate the true value of θ, the estimation error θ̂ − θ averages to zero. Thus, when using unbiased estimators, there is no tendency to overestimate or underestimate the true value of θ. Other commonly used unbiased estimators in statistics include the estimators of the regression coefficients, which are described in the next example. Example 4.2. ESTIMATION OF THE REGRESSION COEFFICIENTS. Let (X, Y ) be a bivariate random variable and suppose that its regression function (or the regression function of its underlying population) is of the form E(Y |X = x) = α + βx. Then, it can be shown that β= Cov(X, Y ) , and α = E(Y ) − βE(X). Var(X) (4.2) Thus, if we have a sample (X1 , Y1 ), . . . , (Xn , Yn ) from the underlying population, the method of moments estimator of the regression parameters is d Cov(X, Y) b βb = , and α b = Y − βX, d Var(X) (4.3) where n d Cov(X, Y)= n ¢¡ ¢ ¢2 1 X¡ 1 X¡ d = Xi − X Yi − Y , and Var(X) Xi − X n − 1 i=1 n − 1 i=1 are the sample covariance of the (Xi , Yi )s and the sample variance of the Xi s, respectively. REMARK: The estimators of the regression coefficients in (4.3) can also be derived with the method of least squares (see Section 10) and thus are commonly referred to as the least squares estimators. Under the additional assumption of normality, i.e. if it is assumed that the conditional distribution of Y given X = x is Y |X = x ∼ N (α + βx, σ 2 ), the estimators (4.3) can also be derived by the method of maximum likelihood (see Section 10). Proposition 4.1. The least squares estimators of the regression parameters, given in (4.3), are unbiased. Thus, ³ ´ Eβ βb = β, and Eα (b α) = α. 8 We close this section with a proposition which shows that another common estimator, the sample variance, is an unbiased estimator of population variance. Proposition 4.2. If X1 , . . . , Xn is a n sample from a population with variance σ 2 , and P let S 2 = (n − 1)−1 i (Xi − X)2 be the sample variance. Then, ¡ ¢ ¡ ¢ E S 2 = σ 2 , or Eσ S 2 = σ 2 i.e. S 2 is an unbiased estimator of σ 2 . 5 The Standard Error and the Mean Square Error Though unbiasedness is a desirable property, many commonly used estimators do have a small, but non-zero, bias. Moreover, there are cases where there is more than one unbiased estimator. The following examples illustrate both possibilities. Example 5.1. a) Let X1 , . . . , Xn be a random sample from some population having a continuous symmetric distribution. Thus, the mean and the median coincide. Then the e and any trimmed mean are unbiased estimators sample mean X, the sample median X, for µ, as are hybrid estimators such as ½ ¾ Xi + Xj median ; i, j = 1, . . . , n, i 6= j . 2 b) The sample standard deviation S is a biased estimator of the population standard deviation σ. c) Let X1 , . . . , Xn be the life times of a random sample of n valves, and assume that each life time has the exponential distribution with parameter λ (so µ = 1/λ). Then 1/X is a biased estimator of λ = 1/µ, and exp{−500/X} is a biased estimator of exp{−λ500} = P (X > 500), the probability that the life of a randomly chosen valve exceeds 500 time units of operation. d) Let X1 , . . . , Xn be a sample from a normal distribution with parameters µ, σ 2 . Then, µ ¶ µ ¶ 17.8 − X 14.5 − X Φ −Φ S S is a biased estimator of P (14.5 < X < 17.8), where X ∼ N (µ, σ 2 ), and xα = X + Szα is a biased estimator of the (1 − α)100th percentile of X. 9 In this section we will describe a criterion, based on the concept of mean square error, for comparing the performance, or quality, of estimators that are not necessarily unbiased. As a first step, we will discuss a criterion for choosing between two unbiased estimators. This criterion is based on the standard error, a term synonymous with standard deviation. Definition 5.1. The standard error of an estimator θ̂ is its standard deviation σθ̂ = q q σθ̂2 , and an estimator/estimate of the standard error, σ bθ̂ = σ bθ̂2 , is called the estimated standard error. Example 5.2. a) The standard error, and the estimated standard error, of the estimator p̂ discussed in Example 2.1 are r σp̂ = p(1 − p) , and σ̂p̂ = n r p̂(1 − p̂) . n With the given information that 12 of 20 cars sustain no visible damage, we have p̂ = 12/20 = 0.6, so that the estimated standard error is r 0.6 × 0.4 σ̂p̂ = = 0.11. 20 b) The standard error, and the estimated standard error of X are σ S σX = √ , and σ̂X = √ , n n where S is the sample standard deviation. If the sample standard deviation of the n = 36 robot reaction times mentioned in Example 2.2 is S = 1.3, the estimated standard error of X in that example is 1.3 σ̂X = √ = 0.22. 36 A comparison of unbiased estimators is possible on the basis of their standard errors: Among two unbiased estimators of a parameter θ, choose the one with the smaller standard error. This is the standard error selection criterion for unbiased estimators. The rationale behind this criterion is that a smaller standard error implies that the distribution is more concentrated about the true value of θ. This is illustrated in the following figure. 10 p.d.f of θ2 p.d.f of θ1 } θ2is more reliable as an estimator of θ theta The standard error selection criterion for unbiased estimators will be used in Section 6 to determine whether or not stratified random sampling yields a better estimator of the population mean than simple random sampling. Depending on the population distribution and on θ, it may be possible to find an unbiased estimator of θ that has the smallest variance among all unbiased estimators of θ. When such estimators exist they are called minimum variance unbiased estimators (MVUE). The following proposition gives some examples of MVUE. Proposition 5.1. a) If X1 , . . . , Xn is a sample from a normal distribution then X is MVUE for µ, and S 2 is MVUE for σ 2 . b) If X ∼ Bin(n, p), then p̂ = X/n is MVUE for p. REMARK 1: The fact that X is MVUE for µ when sampling under a normal distribution does not imply that X is always the preferred estimator for µ. Thus, if the population distribution is logistic or Cauchy then the sample median, X̃, is better than X. Estimators such as the various trimmed means and the hybrid estimator discussed in Example 5.1 perform well over a wide range of underlying population distributions. We now proceed with the definition of mean square error and the corresponding more general selection criterion. Definition 5.2. The mean square error (MSE) of an estimator θ̂ for the parameter θ is defined to be ³ ´ ³ ´2 M SE θ̂ = E θ̂ − θ . (More exactly, M SEθ (θ̂) = Eθ (θ̂ − θ)2 .) The MSE selection criterion says that among two estimators, the one with smaller MSE is preferred. 11 The mean square error of an unbiased estimator is the standard error of that estimator. Thus, the MSE selection criterion reduces to the standard error selection criterion when the estimators are unbiased. The next proposition reiterates this, and shows how the MSE criterion incorporates both the standard error and the bias in order to compare estimators that are not necessarily unbiased. Proposition 5.2. If θ̂ is unbiased for θ then ³ ´ M SE θ̂ = σθ̂2 . In general, ³ ´ h ³ ´i2 M SE θ̂ = σθ̂2 + bias θ̂ . In the next example, the MSE selection criterion is used to decide the better of two estimators for the mean value of a uniform population. Example 5.3. Suppose that X1 , . . . , Xn is a sample from a uniform in (0, θ) population. Since, the mean value of such a uniform population is µ = θ/2, and since the sample mean, X, is an unbiased estimator of µ, it follows that θ̂1 = 2X is an unbiased estimator of θ. An alternative estimator of θ, the supremum of the population values, is the maximum of the sample values, θ̂2 = X(n) . Because the largest observation, X(n) , is always smaller than θ, it follows that θ̂2 always underestimates θ, i.e. it is a biased estimator of θ. To decide which of the two is better, we look at their MSE. Because the variance of a uniform in (0, θ) population is σ 2 = θ2 /12, and θ̂1 is unbiased we have M SE(θ̂1 ) = 1 2 1 2 σ = θ . n 12n To find M SE(θ̂2 ), we first find the pdf of θ̂2 . Because Fθ̂2 (y) = P (X(n) ≤ y) = P (X1 ≤ y, X2 ≤ y, . . . , Xn ≤ y) = the pdf is fθ̂2 (y) = d n Fθ̂2 (y) = n y n−1 , dy θ 12 ³ y ´n θ , for 0 < y < θ. Using this we obtain Eθ (θ̂2 ) = n n 2 n θ, Eθ (θ̂22 ) = θ , so that Var(θ̂2 ) = θ2 . n+1 n+2 (n + 1)2 (n + 2) Thus, the bias of θ̂2 is Eθ (θ̂2 ) − θ = − 1 θ. n+1 Combining the above, we obtain n 1 1 2 2 M SE(θ̂2 ) = θ + θ = (n + 1)2 (n + 2) (n + 1)2 (n + 1)2 µ ¶ n + 1 θ2 . n+2 It is easy to check that if the sample size n is 21 or larger, then µ ¶ 1 n 1 + 1 < , (n + 1)2 n + 2 12n which implies that, according to the MSE selection criterion, the biased estimator θ̂2 is to be preferred over the unbiased θ̂1 , if the sample size is large enough. In the section on maximum likelihood estimation, the MSE criterion suggests that, with large enough sample sizes, the biased estimators of probabilities and percentiles, which are given in parts c) and d) of Example 5.1, are to be preferred over the corresponding nonparametric estimators, i.e. sample proportions and sample percentiles, provided that the assumption of exponential distribution made in part c) and of normal distribution in part d) hold true. 6 ∗ Simple Random Sampling or Stratified Sampling? In this section we employ the MSE selection principle to decide if the nonparametric estimators of the population mean and probabilities from stratified random samples are to be preferred over those obtained from simple random sampling. Because the estimators involved are unbiased, the MSE selection criterion here coincides with the standard error selection criterion. Suppose that a component is produced in two production facilities and the produced components are mixed before being packaged for shipment. One of the two facilities (facility A) has modernized its equipment resulting in faster production of more reliable components than those of facility B. In particular, suppose that 60% of all components come from facility A and 40% come from facility B. Let X denote the lifetime of a randomly 13 chosen component from the combined, or overall, population of such components. For example, the component can be randomly chosen from a randomly chosen package. We are interested in estimating µ = E(X) and p = P (X > 500), from a sample of size n = 100. We will consider estimation using two different sampling schemes. First, let X1 , . . . , X100 be the lifetimes of 100 components obtained by simple random sampling from the overall population of components. Then, 100 X= 1 X # of Xi > 500 Xi , and p̂ = 100 i=1 100 are the usual nonparametric estimators of µ and p. The second sampling scheme is the stratified sampling introduced briefly in Chapter 1, Section 1.4. A stratified sample consists of a simple random sample of n1 components from facility A and n2 = 100 − n1 components from facility B. Since 60% of all components come from facility A, it make sense to take n1 = 60 and n2 = 40. Let XA1 , . . . , XA60 denote the simple random sample of 60 components from facility A, and XB1 , . . . , XB40 the corresponding sample from facility B. Let µA = E(XA ), pA = P (XA > 500), µB = E(XB ), pB = P (XB > 500), be the mean values and population proportions for each of the two facilities. Then, µ = 0.6µA + 0.4µB , p = 0.6pA + 0.4pB . The stratified estimators of µ and p, called stratified sample mean and stratified sample proportion, respectively, are X s = 0.6X A + 0.4X B , p̂s = 0.6p̂A + 0.4p̂B , where 60 XA = 1 X # of XAi > 500 , XAi , p̂A = 60 i=1 100 and similarly for X B and p̂B . Using the linearity property of the expected value, it is easy to verify that the stratified sample mean and stratified sample proportion are unbiased estimators for µ and p, 14 respectively. Should they be preferred over the estimators of µ and p based on simple random sampling? According to the MSE criterion the answer is yes. Indeed, consider the variances of the two estimators of p: V ar(p̂s ) = 0.62 V ar(p̂A ) + 0.42 V ar(p̂B ) pA (1 − pA ) pB (1 − pB ) + 0.42 60 40 pA (1 − pA ) pB (1 − pB ) = 0.6 + 0.4 100 100 p(1 − p) V ar(p̂) = 100 pA (1 − p) pB (1 − p) = 0.6 + 0.4 . 100 100 = 0.62 Their difference is V ar(p̂) − V ar(p̂s ) pA pB = 0.6 (pA − p) + 0.4 (pB − p) 100 100 pA pB = 0.6 (0.4pA − 0.4pB ) + 0.4 (0.6pB − 0.6pA ) 100 100 (0.6)(0.4) (pA − pB )2 . = 100 This shows that the variance of the stratified sample proportion is smaller than that of the proportion based on a simple random sample. The two variances are the same only if pA = pB . Thus, on the basis of the MSE selection criterion, the stratified sample proportion is to be preferred over the simple random sampling proportion. Similar (but slightly more complicated) calculations show that V ar(X) ≥ V ar(X s ). 7 Confidence Intervals By virtue of the Central Limit Theorem, if the sample size n is large enough, many estimators, θ̂, are approximately normally distributed. Moreover, such estimators are typically unbiased (or nearly unbiased), and their estimated standard error, σ bθ̂ , typically provides a reliable estimate of σθ̂ . Thus, if n is large enough, many estimators θ̂ satisfy ¡ ¢ · θ̂ ∼ N θ, σ bθ̂2 , (7.1) where θ is the true value of the parameter. For example, this is the case for the nonparametric estimators, moment estimators and many maximum likelihood estimators, which 15 are described in Section 10, and least squares estimators, which are described in Chapter 10. For such estimators, it is customary to report point estimates together with their estimated standard errors. The estimated standard error helps assess the size of the estimation error through the 68-95-99.7% rule of the normal distribution. For example, (7.1) implies that ¯ ¯ ¯ ¯ σθ̂ ¯θ̂ − θ¯ ≤ 2b holds approximately 95% of the time. Alternatively, this is can be written as θ̂ − 2b σθ̂ ≤ θ ≤ θ̂ + 2b σθ̂ which gives an interval of plausible values for the true value of θ, with degree of plausibility approximately 95%. Such intervals are called confidence intervals. The abbreviation CI will be used for ”confidence interval”. Note that if θ̂ is unbiased and we believe that the normal approximation to its distribution is quite accurate, the 95% CI uses z0.025 = 1.96 instead of 2, i.e. θ̂ − 1.96b σθ̂ ≤ θ ≤ θ̂ + 1.96b σθ̂ . The general technique for constructing (1 − α)100% confidence intervals for a parameter θ based on an approximately unbiased and normally distributed estimator, θ̂, consists of two steps: a) Obtain an error bound, which holds with probability 1 − α. This error bound is of the form ¯ ¯ ¯ ¯ bθ̂ ¯θ̂ − θ¯ ≤ zα/2 σ b) Convert the error bound into an interval of plausible values for θ of the form θ̂ − zα/2 σ bθ̂ ≤ θ ≤ θ̂ + zα/2 σ bθ̂ , or, in short-hand notation, θ̂ ± zα/2 σ bθ̂ . The degree of plausibility, or confidence level, of the interval will be (1 − α)100%. 16 In this section we will discuss the interpretation of CIs, present nonparametric confidence intervals for population means and proportions, as well as an alternative, normal based CI for the population mean. The issue of precision in estimation will be considered, and we will see how the sample size can be manipulated in order to increase the precision of the estimation of a population mean and proportion. Furthermore, we will discuss the related issue of constructing prediction intervals under the normality assumption, and we will present CIs for other parameters such as percentiles and the variance. Finally, some methods for fitting models to data will be discussed. 7.1 Interpreting Confidence Intervals There are two ways to view any CI, depending on whether θ̂ is viewed as an estimator (i.e. a random variable) or as a point estimate. When θ̂ is viewed as a random variable the CI is a random interval. Hence we can say that the true value of θ belongs in θ̂ ± zα/2 σ bθ̂ holds with probability 1 − α. But when θ̂ is viewed as a point estimate, the CI is a fixed interval which either contains the true (and unknown to us) value of θ or it does not. For example, as we will see in the next subsection, the estimate p̂ = 0.6, based on n = 20 Bernoulli trials, leads to the fixed 95% confidence interval of (0.38, 0.82) for p. This either contains the true value of p or it does not. So how is a computed (1 − α)100% CI to be interpreted? To answer this question, think of the process of constructing a (1 − α)100% CI as performing a Bernoulli trial where the outcome is ”success” if the true value of θ belongs in the CI. Thus, the probability of ”success” is 1 − α, but we are not able to observe the outcome of the trial. Not knowing the outcome, the only thing we can say is that our degree of confidence that the outcome was ”success” is measured by (1 − α)100%. 7.2 Nonparametric CIs for Means and Proportions In this subsection we will describe a CI for the mean of a population, when no parametric assumption about the population distribution is made. As a special case, we will consider CIs for a population proportion. 17 Let X1 , . . . , Xn denote a simple random sample from a population with mean value µ and variance σ 2 . If the sample size is large (n > 30) the Central Limit Theorem asserts that ¡ ¢ · X ∼ N µ, σ 2 /n . (7.2) Moreover, if n is large enough, S is a good estimator of σ. Thus, X −µ · √ ∼ N (0, 1) . S/ n (7.3) Relation (7.3) implies that the bound on the error of estimation ¯ ¯ ¯X − µ¯ ≤ zα/2 √S n (7.4) holds with approximate probability 1−α, and this leads to the (approximate) (1−α)100% confidence interval S X ± zα/2 √ , n (7.5) for µ. (In the rare cases where the true value of σ is known, we can use it instead of its estimate S.) This is a nonparametric CI for the population mean, because it does not use any distributional assumptions (although it does assume, as the Central Limit Theorem does, that the population mean and variance exist, and are finite). Example 7.1. In Examples 2.2 and 5.2, a sample of size n = 36 measured reaction times yielded a sample average of X = 5.4, and a sample standard deviation of S = 1.3. Since n = 36 is large enough to apply the Central Limit Theorem, the available information yields an approximate 68% CI for the true value of µ (the mean response time of the robot in the conceptual population of all malfunctions of that type), of 1.3 1.3 5.4 − √ ≤ µ ≤ 5.4 + √ , or 5.18 ≤ µ ≤ 5.62. 36 36 Example 7.2. A random sample of n = 56 cotton samples gave average percent elongation of 8.17 and a sample standard deviation of S = 1.42. Calculate a 95% CI for µ, the true average percent elongation. Solution. The sample size of n = 56 is large enough for the application of the Central Limit Theorem, which asserts that the distribution of the sample mean, X, is well approximated by the normal distribution. Thus, the error bound (7.4) and the resulting nonparametric CI (7.5) for µ can be used. The information given yields the CI S 1.42 X̄ ± zα/2 √ = 8.17 ± 1.96 √ = 8.17 ± .37 = (7.8, 8.54). n 56 18 A special case of the nonparametric CI (7.5) arises when X1 , . . . , Xn is a sample from a Bernoulli population. Thus, each Xi takes the value 1 or 0, the population mean is µ = p (the probability of 1), and the population variance is σ 2 = p(1 − p). When sampling from a Bernoulli population, we are typically given only the binomial random variable, T = X1 + · · · + Xn , or the observed proportion of 1s (which is the sample average), X = p̂. By the Central Limit Theorem, the normal distribution provides an adequate approximation to the distribution of T , or that of p̂, whenever n ≥ 5 and n(1 − p) ≥ 5, and thus (7.2) is replaced by · p̂ ∼ N (p, p(1 − p)/n) . The above approximate distribution of p̂ leads to the (approximate) (1 − α)100% CI p̂ ± zα/2 p̂(1 − p̂) √ . n (7.6) Note that, because p is unknown, the condition np ≥ 5 and n(1 − p) ≥ 5 for applying the CLT is replaced in practice by np̂ ≥ 5 and n(1 − p̂) ≥ 5, i.e. at least five 1s and at least five 0s. Example 7.3. In the car crash experiment of Example 2.1, it was given that 12 of 20 cars sustained no visible damage. Thus, the number of those that did not sustain visible damage and the number of those that did, exceeds five. With this information, we have p̂ = 12/20 = 0.6, so that, an approximate 95% CI, of the form given in (7.6), for the true value of p (the population proportion of cars that sustain no visible damage) is r r 0.6 × 0.4 0.6 × 0.4 0.6 − 1.96 ≤ p ≤ 0.6 + 1.96 , 20 20 or 0.6 − 0.2191 ≤ p ≤ 0.6 + 0.2191 or 0.38 ≤ p ≤ 0.82. Example 7.4. The point estimate for the probability that a certain component functions properly for at least 1000 hours, based on a sample of n = 100 of such components, is p̂ = 0.91. Give a 95% CI for p. Solution. The sample size conditions for applying the aforementioned CI for a binomial parameter p hold here. Thus the desired CI is: r r p̂(1 − p̂) .91(.09) p̂ ± z.025 = .91 ± 1.96 = .91 ± .059. n 100 19 7.3 Confidence Intervals for a Normal Mean In this subsection we will assume that X1 , . . . , Xn is a sample from a population having a normal distribution, and we will present an alternative confidence interval for µ, which does not require the sample size to be large. In Chapter 5 we saw that, if X1 , . . . , Xn is a random sample from a normal population, then by subtracting the population mean from the sample mean, X, and dividing the √ difference by the estimated standard error, S/ n, we obtain a random variable which has a t-distribution with n − 1 degrees of freedom. That is, we have X −µ √ ∼ tn−1 . S/ n (7.7) √ Note that (7.7) gives the exact distribution of (X − µ)/(S/ n), whereas (7.3) gives the approximate distribution when n is sufficiently large. Using relation (7.7) we obtain an alternative bound on the error of estimation of µ, namely ¯ ¯ ¯X − µ¯ ≤ tn−1,α/2 √S , n (7.8) which holds with probability 1 − α. We note again that the probability for this error bound is exact, whereas the error bound (7.4) holds with probability approximately 1 − α, provided that n is sufficiently large. Of course, (7.8) requires the normality assumption, whereas (7.4) does not. The notation tn−1,α/2 is similar to the zα notation, i.e. it corresponds to the 100(1−α/2)th percentile of the t-distribution with n − 1 degrees of freedom: p.d.f of the t-distr. with ν degrees of freedom area= α t ν,α Note that, as the degrees of freedom ν = n − 1 gets large, tν,α/2 approaches zα/2 ; for example, the 95th percentile for the t-distributions with ν = 9, 19, 60 and 120, are 1.833, 1.729, 1.671, 1.658, respectively, while z0.05 = 1.645. Selective percentiles of tdistributions are given in the t-table. The error bound (7.8) leads to the following 100(1 − α)% CI for the normal mean: ¶ µ S S X − tn−1,α/2 √ , X + tn−1,α/2 √ (7.9) n n 20 Example 7.5. The mean weight loss of n = 16 grinding balls after a certain length of time in mill slurry is 3.42g with S = 0.68g. Construct a 99% CI for the true mean weight loss. Solution. Here α = .01 and tn−1,α/2 = t15,0.005 = 2.947. Thus a 99% CI for µ is √ √ X ± tn−1,α/2 (S/ n) = 3.42 ± 2.947(0.68/ 16), or 2.92 < µ < 3.92. 7.4 The issue of Precision Precision in the estimation of a parameter θ is quantified by the size of the bound of the error of estimation |θ̂ − θ|. Equivalently, it can be quantified by the length of the corresponding CI, which is twice the size of the error bound. A shorter error bound, or shorter CI, implies more precise estimation. The bounds we have seen on the error of estimation of a population mean µ and population proportion p, are of the form ¯ ¯ ¯X − µ¯ ≤ zα/2 √S (nonparametric case, n > 30) n ¯ ¯ ¯X − µ¯ ≤ tn−1,α/2 √S (normal case, any n) n r p̂(1 − p̂) |p̂ − p| ≤ zα/2 (np̂ ≥ 5, n(1 − p̂) ≥ 5). n The first and the third of these bounds hold with probability approximately 1 − α, while the second holds with probability exactly 1 − α, if the underlying population has a normal distribution. In the rare cases when σ can be considered know, the S of the first error bound is replaced by σ. The above expressions suggest that the size of the error bound (or length of CI) depends on the sample size n. In particular, a larger sample size yields a smaller error bound, and thus more precise estimation. Shortly we will see how to choose n in order to achieve a prescribed degree of precision. It can also be seen that the probability with which the error bound holds, i.e. 1 − α, also affects the size of the error bound, or length of the CI. For example, a 90% CI ((α = .1, so α/2 = .05)), is narrower than a 95% CI ((α = .05, so α/2 = .025)), which is less narrower than a 99% CI ((α = .01, so α/2 = .005)). This is so because z.05 = 1.645 < z.025 = 1.96 < z.005 = 2.575, 21 and similar inequalities for the t-critical values. The increase of the length of the CI with the level of confidence is to be expected. Indeed, we are more confident that the wider CI will contain the true value of the parameter. However, we rarely want to reduce the length of the CI by decreasing the level of confidence. We will now deal with the main learning objective of this section, which is how to choose n in order to achieve a prescribed degree of precision in the estimation of µ and of p. As we will see, the main obstacle to getting a completely satisfactory answer to this question lies in the fact that the estimated standard error, which enters all expressions of error bounds, is unknown prior to the data collection. For the two estimation problems, we will discuss separately ways to bypass this difficulty. 7.4.1 Sample size determination for µ Consider first the task of choosing n for precise estimation of µ, in the rare case that σ is known. In that case, if normality is assumed, or if we know that the needed sample size will be > 30, the required n for achieving a prescribed length L of the (1 − α)100% CI, is found equating the expression for the length of the CI to L, and solving the resulting equation for n. That is, we solve σ 2z.025 √ = L, n for n. The solution is µ n= σ 2zα/2 L ¶2 . (7.10) More likely than not, the solution will not be an integer, in which case the recommended procedure is to round up. The practice of rounding up guarantees that the prescribed objective will be more than met. Example 7.6. The time to response (in milliseconds) to an editing command with a new operating system is normally distributed with an unknown mean µ and σ = 25. We want a 95% CI for µ of length L = 10 milliseconds. What sample size n should be used? Solution. For 95% CI, α = .05, α/2 = .025 and z.025 = 1.96. Thus, from formula (7.10) we obtain µ n= 25 2 · (1.96) 10 ¶2 which is rounded up to n = 97. 22 = 96.04, Typically however, σ is unknown, and thus, sample size determinations must rely on some preliminary approximation, Sprl , to it. Two commonly used methods for obtaining this approximation are: a) If the range of population values is known, then σ can be approximated by dividing the range by 3.5 or 4. That is, use range range , or Sprl = , 3.5 4 instead of σ in the formula (7.10). This approximation is obtained by considering the √ standard deviation of a uniform in (a, b) random variable, which is σ = (b−a)/ 12 = Sprl = (b − a)/3.464. b) Alternatively, the approximation can be based on the sample standard deviation, Sprl , of a preliminary sample. The reason why this is not a very satisfactory solution is because the standard deviation of the final sample, upon which the CI will be calculated, will be different from that of the preliminary approximation of it, regardless of how this approximation was obtained. Thus the prescribed precision objective might not be met, and some trial-and-error iteration might be involved. The trial-and-error process gets slightly more complicated with the t-distribution CIs, because the t-percentiles change with the sample size. 7.4.2 Sample size determination for p Consider next the selection of sample size for meeting a prescribed level of precision in the estimation of the binomial parameter p. As noted previously, the required n for achieving a prescribed length L of the (1 − α)100% CI, is found by equating the expression for the length of the CI for p to L, and solving the resulting equation for n. That is, we solve r p̂(1 − p̂) 2zα/2 =L n for n. The solution is 2 4zα/2 p̂(1 − p̂) n= . (7.11) L2 The problem is that the value of p̂ is not known before the sample is collected, and thus sample size determinations must rely on some preliminary approximation, p̂prl , to it. Two commonly used methods for obtaining this approximation are: 23 a) When preliminary information about p exists. This preliminary information may come either from a small pilot sample or from expert opinion. In either case, the preliminary p̂prl is entered in (7.11) instead of p̂ for sample size calculation. Thus, n= 2 4zα/2 p̂prl (1 − p̂prl ) L2 , (7.12) and we round up. b) When no preliminary information about p exists. In this case, we replace p̂(1 − p̂) in (7.11) by 0.25. The rationale for doing so is seen by noting that p̂(1 − p̂) ≤ 0.25; thus, by using the larger 0.25 the calculated sample size will be at least as large as needed for meeting the precision specification. This gives n= 2 zα/2 . (7.13) L2 Example 7.7. A preliminary sample gave p̂prl = 0.9. How large should n be to estimate the probability of interest to within 0.01 with 95% confidence? Solution. “To within 0.01” is another way of saying that the 95% bound on the error of estimation should be 0.01, or the desired CI should have a width of 0.02. Since we have preliminary information, we use (7.12): n= 4(1.96)2 (.91)(.09) = 3146.27. (.02)2 This is rounded up to 3147. Example 7.8. A new method of pre-coating fittings used in oil, brake and other fluid systems in heavy-duty trucks is being studied. How large n is needed to estimate the proportion of fittings that leak to within .02 with 90% confidence? (No prior info available). Solution. Here we have no preliminary information about p. Thus, we apply the formula (7.13) and we obtain 2 n = zα/2 /L2 = (1.645)2 /(.04)2 = 1691.26. This is rounded up to 1692. 8 Prediction Intervals The meaning of the word prediction is related to, but distinct from, the word estimation. The latter is used when we are interested in learning the value of a population or model 24 parameter, while the former is used when we want to learn about the value that a future observation might take. For a concrete example, suppose you contemplate eating a hot dog and you wonder about the amount of fat in the hot dog which you will eat. This is different from the question ”what is the expected (mean) amount of fat in hot dogs?” To further emphasize the difference between the two, suppose that the amount of fat in a randomly selected hot dog is known to be N (20, 9). Thus there are no unknown parameters to be estimated. In particular we know that expected amount of fat in hot dogs is 20 gr. Still the amount of fat in the hot dog which you will eat is unknown, simply because it is a random variable. How do we predict it? According to well-accepted criteria, the best point-predictor of a normal random variable with mean µ is µ. A (1 − α)100% prediction interval, or PI, is an interval that contains the random variable (which is being predicted) with probability 1 − α. If both µ and σ are known, then a (1 − α)100% PI is µ ± zα/2 σ. In our particular example, X ∼ N (20, 9), so the best point predictor of X is 20 and a 95% PI is 20 ± (1.96)3 = (14.12, 25.88). When µ, σ are unknown (as is typically the case) we use a sample X1 , . . . , Xn to estimate µ, σ by X, S, respectively. Then, as best point predictor of a future observation, we use X. But now, the prediction interval (always assuming normality) must take into account the variability X, S as estimators of µ, σ. Doing so yields the following (1 − α)100% PI for the next observation X: r r µ ¶ 1 1 X − tα/2,n−1 S 1 + , X + tα/2,n−1 S 1 + . n n In the above formula, the variability of X is accounted for by the of S is accounted for by the use of the t-percentiles.) (8.1) 1 , and the variability n Example 8.1. The fat content measurements from a sample of size n = 10 hot dogs, gave sample mean and sample standard deviation of X = 21.9, and S = 4.134. Give a 95% PI for the fat content of the next hot dog to be sampled. Solution. Using the given information in the formula (8.1), we obtain the PI r 1 X̄ ± t.025,9 S 1 + = (12.09, 31.71). n 25 9 ∗ Other Confidence Intervals In this section we will present confidence intervals for other parameters of interest, such as the median and other percentiles, the variance, and regression parameters. In addition, we will give a general method for constructing CIs for functions of parameters. The techniques for constructing some of these CIs differ from the one used in Section 7. 9.1 Nonparametric CIs for percentiles Let X1 , . . . , Xn denote a sample from a population having a continuous distribution, and let xp denote the (1 − p)100th percentile. The basic idea for constructing a nonparametric CI for xp is to associate a Bernoulli trial Yi with each observation Xi : ( 1 if Xi > xp Yi = 0 if Xi < xp Thus, the probability of a 1 (or success) in each Bernoulli trial is p. Let Y = P i Yi be the Binomial(n, p) random variable. Let also X(1) < · · · < X(n) denote the ordered sample values. Then the events X(k) < xp < X(k+1) , X(k) < xp , xp < X(k+1) , are equivalent to Y = n − k, Y ≤ n − k, Y ≥ n − k, respectively. Nonparametric (1 − α)100% CIs for xp will be of the form X(a) < xp < X(b) , where the indices a, b are found from the requirements that ¡ ¢ P xp < X(a) = α/2 ¢ ¡ P X(b) < xp = α/2. These requirements can equivalently be expressed in terms of Y as P (Y ≥ n − a + 1) = α/2 (9.1) P (Y ≤ n − b) = α/2, (9.2) 26 and thus, a and b can be found using either the binomial tables (since Y ∼ Binomial(n, p)) or the normal approximation to the binomial. Using the normal approximation with continuity correction, a and b are found from: n − a − np + 0.5 n − b − np + 0.5 p p = zα/2 , = −zα/2 . np(1 − p) np(1 − p) The special case of p = 0.5, which corresponds to the median, deserves separate consideration. In particular, the (1 − α)100% CIs for the median x0.5 will be of the form X(a) < x0.5 < X(n−a+1) , (9.3) so that only a must be found. To see that this is so, note that if p = 0.5 then Y , the number of 1s, has the same distribution as n − Y , the number of 0s (both are Binomial(n, 0.5)). Thus, P (Y ≥ n − a + 1) = P (n − Y ≥ n − a + 1) = P (Y ≤ a − 1) , which implies that if a, b satisfy (9.1), (9.2), respectively, then they are related by n − b = a − 1 or b = n − a + 1. The a in relation (9.3) can be found from the requirement that P (Y ≤ a − 1) = α/2. (9.4) Example 9.1. Let X1 , . . . , X25 be a sample from a continuous population. Find the confidence level of the following CI for the median: ¡ ¢ X(8) , X(18) . Solution. First note that the CI (X(8) , X(18) ) is of the form (9.3) with a = 8. According to the formula (9.4), and the binomial tables, α = 2P (Y ≤ 7) = 2(0.022) = 0.044. ¡ ¢ Thus, the confidence level of the CI X(8) , X(18) , is (1 − α)100% = 95.6%. 9.2 9.2.1 CIs for the variance CIs for a normal variance Let X1 , . . . , Xn be a random sample from a population whose distribution belongs in the normal family. In Chapter 5, we saw that, for normal samples, the sampling distribution 27 of S 2 is a multiple of χ2n−1 random variable. Namely, (n − 1)S 2 ∼ χ2n−1 . 2 σ This fact implies that (n − 1)S 2 < χ2α/2,n−1 2 σ2 will be true (1 − α)100% of the time, where χ2α/2,n−1 , χ21−α/2,n−1 denote percentiles of the χ21− α ,n−1 < χ2n−1 distribution as shown in the figure below. 2 p .d.f of χ n-1 distr α 2 χ21− α ,n−1 2 α 2 χ2α/2,n−1 Note that the bounds on the error of estimation of σ 2 by S 2 are given in terms of the ratio S 2 /σ 2 . After some algebraic manipulations, we obtain that (n − 1)S 2 (n − 1)S 2 2 < σ < χ2α/2,n−1 χ21− α ,n−1 (9.5) 2 is true (1 − α)100% the time. Selective percentiles of χ2 distributions are given in chi-square table. Example 9.2. An optical firm purchases glass to be ground into lenses. As it is important that the various pieces of glass have nearly the same index of refraction, interest lies in the variability. A random sample of size n = 20 measurements, yields S 2 = (1.2)10−4 . Find a 95% CI for σ. Solution. Here n − 1 = 19, χ2.975,19 = 8.906, and χ2.025,19 = 34.852. Thus, according to (9.5), (19)(1.2 × 10−4 ) (19)(1.2 × 10−4 ) < σ2 < . 32.852 8.906 It follows that a 95% CI for σ is r r (19)(1.2 × 10−4 ) √ 2 (19)(1.2 × 10−4 ) < σ < , 32.852 8.906 or .0083 < σ < .0160. 28 9.2.2 Nonparametric CIs for the variance 9.3 Nonparametric CIs for functions of parameters 9.4 CIs for Regression Parameters In this subsection we will describe confidence intervals for the slope parameter, β, of the simple linear regression model for the regression function of Y on X, µY |X (x) = E(Y |X = x) = α + βx. As seen in Example 4.2, if (X1 , Y1 ), . . . , (Xn , Yn ) are observations to be made, the LSE estimator of the slope is P P P P (Xi − X)(Yi − Y ) n Xi Yi − ( Xi )( Yi ) P P P . β̂ = = n Xi2 − ( Xi )2 (Xi − X̄)2 In Proposition 4.1 we saw that β̂ is unbiased for β. The following proposition gives the standard error and the estimated standard error of β̂, and, under the additional assumption of normality, it gives the distribution of β̂ Proposition 9.1. Assume that the conditional variance of Y , given that X = x, is σ 2 . Let (X1 , Y1 ), . . . , (Xn , Yn ) be observations to be made, and let β̂ be the LSE of the slope parameter β. Then, 1. The standard error, and the estimated standard error of β̂ are v v u u σ2 S2 u u σβ̂ = t P , and σ̂β̂ = t P , 1 P 1 P Xi2 − ( Xi )2 Xi2 − ( Xi )2 n n respectively, where n 1 X S = (Yi − α̂ − β̂Xi )2 , n − 2 i=1 2 is the estimator of σ 2 . 2. Assume now, in addition, that the conditional distribution of Y given X = x is normal. Thus, Y |X = x ∼ N (α + βx, σ 2 ). 29 Then, β̂ has a normal distribution. Thus, µ ¶ β̂ − β 2 β̂ ∼ N β, σβ , or ∼ N (0, 1). σβ̂ 3. Under the above assumption of normality, β̂ − β ∼ tn−2 . σ̂β̂ REMARK: The quantity Pn i=1 (Yi − α̂ − β̂Xi )2 , which is used in the definition of the estimator, S 2 , of σ 2 , is called the error sum of squares, and is denoted by SSE. A computational formula for SSE is SSE = n X i=1 Yi2 − α̂ n X Yi − β̂ i=1 n X X i Yi . (9.6) i=1 Part 3 of Proposition 9.1, can be used for constructing bounds for the estimation error, β̂ − β, and CIs for β, just as relation (7.7) was used for such purposes in the estimation of a normal mean. Thus, the bound ¯ ¯ ¯ ¯ ¯β̂ − β ¯ ≤ tn−2,α/2 σ̂β̂ (9.7) on the error of estimation of β, holds with probability 1 − α. This bound leads to an 100(1 − α)% CI for the slope β of the true regression line of the form β̂ ± tα/2,n−2 σ̂β̂ . (9.8) Example 9.3. The data in this example come from a study on the dependence of Y =propagation of an ultrasonic stress wave through a substance, on X=tensile strength of substance. The n = 14 observations are: 30 X Y 12 3.3 30 3.2 36 3.4 40 3.0 45 2.8 57 2.9 62 2.7 67 2.6 71 2.5 78 2.6 93 2.2 94 2.0 100 2.3 105 2.1 Assuming that the regression function of Y on X is described by the simple linear regression model, compute the LSE α̂ and β̂ and construct a 95% CI for β. P P P P Solution. With the given data, i Xi = 890, i Xi2 = 67, 182, i Yi = 37.6, i Yi2 = P 103.54 and i Xi Yi = 2234.30. From this we get β̂ = −0.0147209, and α̂ = 3.6209072. Moreover, the point estimate of σ 2 is S2 = SSE 103.54 − β̂0 (37.6) − β̂1 (2234.30) .2624532 = = = 0.02187. n−2 12 12 Using the above, the estimated standard error of β̂ is s v u S2 0.02187 u σ̂β̂ = t P = = 0.001414. 1 P 1 67, 182 − 14 8902 Xi2 − ( Xi )2 n Finally, the 95% CI for β is −0.0147209 ± t0.025,12 0.001414 = −0.0147209 ± 2.179 × 0.001414 = −0.0147209 ± 0.00308 = (−0.0178, −0.01164). 31 10 ∗ Methods for Fitting Models to Data We have already seen the method of moments, as it applies to fitting both distribution models and the simple linear regression model. In this section we will present two other popular methods for fitting models to data, the method of least squares, and the method of maximum likelihood. 10.1 The Method of Least Squares In this subsection we will present the method of least squares, which is the most common method for fitting regression models. We will, describe this method for the case of fitting the simple linear regression model, namely the model which assumes that the regression function of Y on X is µY |X (x) = E(Y |X = x) = α + βx. The estimators we will obtain, called the least squares estimators, are the same as the moment estimators that were derived in Example 4.2. To explain the LS method consider the problem of deciding which of two lines fit the data better. A typical data set and two lines that might be thought of fitting the data, are shown in the next figure. Line 1 Line 2 } Vertical Distance from Line 1 To answer the question of which of two lines fit the data better, one must first adopt a principle, on the basis of which, to judge the quality of a fit. The principle we will use is the principle of least squares. According to this principle, the quality of the fit of a line to data (x1 , y1 ), . . . , (xn , yn ) is judged by the sum of the squared vertical distances of each point (xi , yi ) from the line. The line for which this sum of squared vertical distances is smaller, is said to provide a better fit to the data. The least squares estimates of the intercept, α, and the slope β, are the intercept and the slope, respectively, of the best fitting line, i.e. of the line with the smallest sum of 32 vertical square distances. The best fitting line is also called the estimated regression line. Since the vertical distance of (xi , yi ) from a line a + bx is yi − (a + bxi ), the method of least squares finds the values α̂, β̂ which minimize n X (yi − a − bxi )2 i=1 with respect to a, b. This minimization problem has a simple closed-form solution: P P P n xi yi − ( xi )( yi ) P P β̂ = , α̂ = ȳ − β̂ x̄. n x2i − ( xi )2 Thus, the estimated regression line is µ̂Y |X (x) = α̂ + β̂x. Example 10.1. With n = 10 data points on X=stress applied and Y =time to failure we P P 2 P P have summary statistics xi = 200, xi = 5412.5, yi = 484, xi yi = 8407.5. Thus the best fitting line has slope and intercept of β̂ = 10(8407.5) − (200)(484) = −.900885, 10(5412.5) − (200)2 α̂ = 1 200 (484) − (−.900885) − 11.41 = 66.4177, 10 10 respectively. 10.2 The Method of Maximum Likelihood (ML) The method of ML estimates θ by addressing the question “what value of the parameter is most likely to have generated the data?” The answer to this question, which is the ML estimator (MLE), is obtained by maximizing (with respect to the parameter) the so-called likelihood function which is simply the joint p.d.f. (or p.m.f.) of X1 , . . . , Xn evaluated at the sample points: lik(θ) = n Y f (xi |θ). i=1 Typically, it is more convenient to maximize the logarithm of the likelihood function, which is called the log-likelihood function. Since the logarithm is a monotone function, this is equivalent to maximizing the likelihood function. 33 Example 10.2. A Bernoulli experiment with outcomes 0 or 1 is repeated independently 20 times. If we observe 5 1s, find the MLE of the probability of 1, p. Solution. Here the observed random variable X has a binomial(n = 20, p) distribution. Thus, µ ¶ 20 5 P (X = 5) = p (1 − p)15 . 5 The value of the parameter which is most likely to have generated the data X = 5 is the one that maximizes this probability, which, in this case, is the likelihood function. The log-likelihood is µ ¶ 20 ln P (X = 5) = ln + 5 ln(p) + 15 ln(1 − p). 5 Setting the first derivative of it to zero yields the MLE p̂ = the binomial probability p is p̂ = X . n 5 . In general, the MLE of 20 Example 10.3. Let X1 = x1 , . . . , Xn = xn be a sample from a population having the exponential distribution, i.e. f (x|λ) = λ exp(λ). Find the MLE of λ. Solution, The likelihood function here is lik(λ) = λe−λx1 . . . λe−λxn = λn e−λ P xi , and the first derivative of the log-likelihood function is · X ¸ n X ∂ n ln(λ) − λ Xi = − Xi . ∂λ λ n 1 Setting this to zero yields λ̂ = P = as the MLE of λ. Xi X Example 10.4. The lifetime of a certain component is assumed to have the exp(λ) distribution. A n sample of n components is tested. Due to time constraints, the test is terminated at time L. So instead of observing the life times X1 , . . . , Xn we observe Y1 , . . . , Yn where Xi if Xi ≤ L Yi = L if X > L i We want to estimate λ. Solution. For simplicity, set Y1 = X1 , . . . , Yk = Xk , Yk+1 = L, . . . , Yn = L. The likelihood function is λe−λY1 . . . λe−λYk · e−λYk+1 . . . e−λYn 34 and the log-likelihood is µ ¶ ln(λ) − λY1 µ + + ... ¶ ln(λ) − λYk = k ln(λ) − λ k X − λL − . . . − λL Yi − (n − k)λL. i=1 Setting the derivative w.r.t. λ to zero and solving gives the MLE k λ̂ = Pk i=1 Yi + (n − k)L k = Pn i=1 Yi = k nY Example 10.5. Let X1 = x1 , . . . , Xn = xn be a sample from a population having the uniform distribution on (0, θ). Find the MLE of θ. 1 Solution. Here f (x|θ) = , if 0 < x < θ, and 0 otherwise. Thus the likelihood function is θ 1 provided, 0 < Xi < θ for all i, and is 0 otherwise. This is maximized by taking θ as θn small as possible. However if θ smaller than max(X1 , . . . , Xn ) then likelihood function is zero. Thus the MLE is θ̂ = max(X1 , . . . , Xn ). Theorem 10.1. (Optimality of MLEs) Under smoothness conditions on f (x|θ), when n is large, the MLE θ̂ has sampling distribution which is approximately normal with mean value equal to (or approximately equal to) the true value of θ, and variance nearly as small as that of any other estimator. Thus, the MLE θ̂ is approximately a MVUE of θ. REMARK: Among the conditions needed for the validity of Theorem 1 is that the set of x-values for which f (x|θ) > 0 should not depend on θ. Thus, the sampling distribution of the MLE θ̂ = max(X1 , . . . , Xn ) of Example 9 is not approximately normal even for large n. However, application of the MSE criterion yields that the biased estimator max(X1 , . . . , Xn ) should be preferred over the unbiased estimator 2X if n is sufficiently large. Moreover, in this case, it is not difficult to remove the bias of θ̂ = max(X1 , . . . , Xn ). Theorem 10.2. (Invariance of MLEs) If θ̂ is the MLE of θ and we are interested in estimating a function, ϑ = g(θ), of θ then ϑ̂ = g(θ̂) is MLE of ϑ. Thus, ϑ̂ has the stated optimality of MLEs stated in Theorem 1. 35 According to Theorem 2, the estimators given in Example 5c),d) are optimal. The following examples revisit some of them. Example 10.6. Consider the setting of Example 7, but suppose we are interested in the 1 mean lifetime. For the exponential distribution µ = . (So here θ = λ, ϑ = µ and λ 1 ϑ = g(θ) = .) Thus µ̂ = 1/λ̂ is the MLE of µ. θ Example 10.7. Let X1 , . . . , Xn be a sample from N (µ, σ 2 ). Estimate: a) P (X ≤ 400), and b) x.1 . µ 400 − µ Solution. a) ϑ = P (X ≤ 400) = Φ σ ¶ µ 400 − X̄ . ϑ̂ = g(µ̂, s2 ) = Φ s ¶ = g(µ, σ 2 ). Thus b) ϑ̂ = x̂.1 = µ̂ + σ̂z.1 = g(µ̂, σ̂ 2 ) = X̄ + sz.1 Note: As remarked also in Section 6.2, Example 14 shows that the estimator we choose depends on what assumptions we are willing to make. If we do not assume normality (or any other distribution) a) P (X ≤ 400) would be estimated by p̂=the proportion of Xi ’s that are ≤ 400, b) x.1 would be estimated by the sample 90th percentile. If the normality assumption is correct, the MLEs of Example 14 are to be preferred by Theorem 1. 36