* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Module 3: Estimation and Properties of Estimators
Survey
Document related concepts
Transcript
Module 3: Estimation and Properties of Estimators Math 4820/5320 Introduction This section of the book will examine how to find estimators of unknown parameters. For example, if the population mean is unknown and it is of interest, we can estimate the population mean through a variety of methods. We will also explore interval estimation which gives an interval in which we can have a certain degree of confidence that the true population parameter is contained. Finally, we will explore properties of different types of estimators. Reading Chapters 8 – 9 of Mathematical Statistics with Applications by Wackerly, Mendenhall, and Scheaffer. Estimation Def: An estimator is a rule, often expressed as a formula, that tells how to calculate the value of an estimate based on the measurements contained in the sample. Def: A single number that estimates a population parameter is called a point estimate. Def: An interval between two values that is intended to enclose a population parameter is a called an interval estimate. Example: Suppose we collect the following random sample of exam scores: 98, 100, 0, 58, 86, 77, 90, 83, 70 What are some potential point estimates and interval estimates? 1 Bias and Mean Squared Error of Point Estimators Illustration of the distribution of estimates Def: Let θ̂ be a point estimator for a parameter θ. Then θ̂ is an unbiased estimator if E[θ̂] = θ. If E[θ̂] 6= θ, θ̂ is said to be biased. Def: The bias of a point estimator θ̂ is given by: B(θ̂) = E[θ̂] − θ. Def: The mean square error (MSE) of a point estimator is given by: M SE(θ̂) = E[(θ̂ − θ)2 ]. Note: M SE(θ̂) = V ar(θ̂) + [B(θ̂)]2 . Sampling distributions of two unbiased estimators 2 Some Common Unbiased Point Estimators Note: Some specific methods for finding point estimators are given later in this module (see Chapter 9). These are commonly used point estimators. Def: The standard deviation of the sampling distribution of the estimator θ̂, σθ̂ , is called the standard error of the estimator θ̂. If the population mean, µ, is unknown, it is common to estimate the population mean with sample mean. Example: Note: If we are interested in a population proportion, we will denote it by p. The above table is Table 8.1 from the textbook. 3 Example: If the population variance, σ 2 , is unknown, it is typical to estimate the population variance with the sample variance. 2 = Note: SX 1 n−1 02 = Note: SX 1 n Pn i=1 (Xi Pn i=1 (Xi − X̄)2 is an unbiased point estimator for σ 2 . − X̄)2 is a biased point estimator for σ 2 . Evaluating the Goodness of a Point Estimator Def: The error of estimation is the distance between an estimator and its target parameter. That is, = |θ̂ − θ|. Example: A sample of n = 1000 voters, randomly selected from a city, showed y = 560 in favor of candidate Jones. Estimate p, the fraction of voters in the population favoring Jones, and place a 2-standard-error bound on the error of estimation. 4 Confidence Intervals Recall: An interval estimator is a rule specifying the method for using the sample measurements to calculate two numbers that form the endpoints of the interval. Interval estimators are commonly called confidence intervals. Suppose that θˆL and θˆU are the (random) lower and upper confidence limits, respectively, for a parameter θ. Then if, P (θˆL ≤ θ ≤ θˆU ) = 1 − α, (1 − α) is confidence coefficient and [θˆL , θˆU ] is called a two-sided confidence interval. 5 It is also possible to construct a one-sided confidence interval. For example, if P (θˆL ≤ θ) = 1 − α, then [θˆL , ∞] is a one-sided confidence interval. Similarly, if P (θ ≤ θˆU ) = 1 − α, then [−∞, θˆU ] is a one-sided confidence interval. One common method for finding confidence intervals is called the pivotal method. This method depends on finding a pivotal quantity that possesses two characteristics: 1. It is a function of the sample measurements and the unknown parameter θ, where θ is the only unknown quantity. 2. Its probability distribution does not depend on the parameter θ. The following facts are useful when using such a method: 1. If c > 0, then P (a ≤ Y ≤ b) = P (ca ≤ cY ≤ cb). 2. If d ∈ R, then P (a ≤ Y ≤ b) = P (a − d ≤ Y − d ≤ b − d). Example: 6 Example: Large-Sample Confidence Intervals In the previous section, we found point estimators for µ, p, µ1 − µ2 , and p1 − p2 , along with their corresponding standard errors. Utilizing the Central Limit Theorem, we can argue that all of these point estimators will have an approximately normal distribution. For instance, if we have a sufficiently large sample size, then Z = σθ̂ follows an approximately standard normal distribution. θ̂ Using this approximation, we can construct 100(1 − α)% confidence intervals. 7 Example: Example: Similar to these two-sided confidence intervals, we can determine that 100(1α)% one-sided confidence limits, often called upper and lower bounds, respectively, are given by: 100(1 − α)% lower bound for θ is θ̂ − zα σθ̂ . 100(1 − α)% upper bound for θ is θ̂ + zα σθ̂ . 8 Example: Illustration of Confidence Intervals: 9 Interpretation of Confidence Intervals: Note: If the sample size is reasonably large (n ≥ 30), we can use the sample variance as an estimate of population variance when constructing confidence intervals. Note: When constructing a confidence interval for population proportions, we can estimate the standard error by plugging in our estimated proportions. Example: 10 Selecting Sample Size In most cases, increasing sample size is not free. As a result, we can plan ahead for a particular standard error. Example: Suppose the weight of copper pipes are normally distributed with mean µ = 5 pounds and σ 2 = 4 pounds2 . Find the necessary sample size, so that the margin of error for a 95% confidence interval is at most 0.2 pounds. Example: Suppose we want to poll Americans on their preference of candidates for president. It is estimated that 44% of Americans support Hillary Clinton currently. Estimate what sample size is needed if we want to create a 95% confidence interval for Hillary Clinton’s support with a margin of error of 3%. 11 Small-Sample Confidence Intervals for µ and µ1 − µ2 In the previous section, we assumed that the data came from a normal distribution and either the population standard deviation or population variance was known. In fact, even if the population standard deviation or variance is unknown, but the sample size is sufficiently large, then σ 2 ≈ s2 . We can use the methods of the previous section to create a confidence interval in such a situation. This section will examine when either the sample size is small, so σ 2 is not necessarily approximately equal to s2 or if the distribution of our data is not normal. Let’s examine the first case. If we are interested in the distribution of X̄ and our data comes from X̄−µ √ ∼ t(n − 1). a normal distribution, then we know from chapter 6 that s/ n The second case is a bit more hairy. If the distribution of our data is approximately normally distributed (and in particular, not too skewed), then the Central Limit Theorem will kick in and we can create an approximate confidence interval. Example: Suppose a group of 30 students are given an exam and further suppose that scores are normally distributed with an unknown mean. If the sample mean of the 50 students is 101 and the sample standard deviation is 14.9, create a 95% confidence interval for µ. 12 If we are interested in creating a confidence interval for the the difference in means, there are two methods. If the variances of the two populations are assumed to be equal, we can “pool” variances to get a better estimate of σ. In particular, we can create a confidence interval as follows. (Y¯1 − Y¯2 ) ± tα/2 · Sp Sp2 = r 1 1 + , df = n1 + n2 − 2 n1 n2 (n1 − 1)S12 + (n2 − 1)S22 n1 + n2 − 2 Example: Suppose on the exam mentioned in the previous example, we are given the following breakdown by alma mater. School Sample Mean Sample Variance Sample Size UCD 105.1 15.2 15 UCB 96.9 14.6 15 Compute a 95% confidence interval for the difference in means. Assume that the variances of the two populations are approximately equal. Is there evidence that students from the two schools have different average scores? 13 If the variances of the two populations are not equal, then we cannot “pool” variances. Below is how to create a two-sided confidence interval when we don’t pool variances. s s2 s21 + 2 , df = ν n1 n2 2 s21 /n1 + s22 /n2 ν= 2 (s1 /n1 )2 /(n1 − 1) + (s22 /n2 )2 /(n2 − 1) (Y¯1 − Y¯2 ) ± tα/2 · Example: Rework the previous example without making the assumption that the two populations have the same variances. 14 Confidence Intervals for σ 2 Pn−1 (Y −Ȳ )2 2 Recall that if X1 , X2 , . . . , Xn are normally distributed, then i=1 σ2i = (n−1)·S has a χ2 disσ2 tribution with (n-1) degrees of freedom. We can use this fact to create a confidence interval for σ 2 . In particular, if we want to create a two-sided confidence interval for σ 2 with confidence coefficent (1 − α), consider the following. h i (n − 1) · S 2 2 P χ2L ≤ ≤ χ =1−α U σ2 P h (n − 1) · S 2 χ2(α/2) ≤ σ2 ≤ (n − 1) · S 2 i =1−α χ2(1−α/2) Example: Suppose a group of 30 students are given an exam and further suppose that scores are normally distributed with an unknown mean. If the sample mean of the 50 students is 101 and the sample standard deviation is 14.9, create a 95% confidence interval for σ. 15 Properties of Point Estimators and Methods of Estimation Note: The following notes cover chapter 9 of the textbook. In the previous section (Chapter 8), we considered some common point estimators (e.g. estimate µ with X̄). In this chapter, we will examine some properties of point estimators, as well as how to derive other point estimators. Recall that an estimator θ̂ is an unbiased estimator for θ if E[θ̂] = θ. Typically, unbiased estimators are preferred to biased estimators. Suppose we are given two unbiased estimators, θˆ1 and θˆ2 . Which one is preferred? In this section, we will examine additional properties of estimators such efficiency, consistency, and sufficiency. Relative Efficiency Def: Given two unbiased estimators, θˆ1 and θˆ2 , of a parameter θ, with variances V ar(θˆ1 ) and V ar(θˆ2 ), respectively, then the efficiency of θˆ1 relative to θˆ2 , denoted ef f (θˆ1 , θˆ2 ), is defined to be the ratio: V ar(θˆ2 ) ef f (θˆ1 , θˆ2 ) = V ar(θˆ1 ) If θˆ1 and θˆ2 are both unbiased estimators, then if ef f (θˆ1 , θˆ2 ) ≥ 1, then θˆ1 is preferred to θˆ2 since V ar(θˆ1 ) ≤ V ar(θˆ2 ). If ef f (θˆ1 , θˆ2 ) ≤ 1, then θˆ2 is preferred to θˆ1 since V ar(θˆ1 ) ≥ V ar(θˆ2 ). In other words, when choosing between two unbiased estimators, it is common to select the estimator with the smaller variance. 16 Example: Let Y1 , Y2 , . . . , Yn be a random sample of (continuous) uniform[0, θ] random variables. Two unbiased estimators of θ are: n + 1 θˆ1 = 2Ȳ , θˆ2 = Y(n) n Y(n) = max(Y1 , Y2 , . . . , Yn ) Find the efficiency of θˆ1 relative to θˆ2 . 17 Consistency The law of large numbers essentially states that if an experiment is repeated many times, then the sample mean will converge to the population mean. We have to be careful to define what we mean by “converge” since we are discussing random quantities. Def: The weak law of large numbers states that the sample mean converges in probability to the population mean (or expected value). p X¯n → µ, as n → ∞ limn→∞ P (|X¯n − µ| > ) = 0, ∀ > 0 Def: The strong law of large numbers states that the sample mean converges almost surely to the population mean (or expected value). a.s. X¯n → µ, as n → ∞ P (limn→∞ X¯n = µ) = 1 Note: Almost sure convergence is a stronger condition than convergence in probability. p Def: The estimator θˆn is said to be a (weakly) consistent estimator of θ if θˆn → θ, i.e., for all > 0, limn→∞ P (|θˆn − θ| > ) = 0. Note: We will focus on determining if an estimator is (weakly) consistent, but if the estimator converges almost surely to the parameter of interest, it is said to be strongly consistent. Figure 1: Sample mean of n N (µ = 4, σ 2 = 100) random variables 18 R code for previous figure: set.seed(2016) n = 10000 idx = 1:n xn.bar = rep(1,n) # generate 10000 normal(mu=4,sd=10) random variables x = rnorm(n=n, mean = 4, sd = 10) for (i in 1:10000 ){ xn.bar[i] = mean(x[1:i]) } plot(xn.bar~idx,type="l",xlab=’Sample Size’,ylab=’Sample Mean’) abline(h=4,lty=2) Theorem: An unbiased estimator θˆn for θ is a consistent estimator for θ if limn→∞ V ar(θn ) = 0. iid Example: Show that 2Ȳ is a consistent estimator for θ if Y1 , Y2 , . . . , Yn ∼ U nif orm(0, θ). p p Theorem: (Slutsky’s Theorem) Suppose that θˆn → θ and that θˆn0 → θ0 , then the following are true. p (a) θˆn + θˆn0 → θ + θ0 . p (b) θˆn · θˆn0 → θ · θ0 . p (c) If θ0 6= 0, then θˆn /θˆn0 → θ/θ0 . p (d) If g(·) is a real-valued function that is continuous at θ, then g(θˆn ) → g(θ). 19 Sufficiency Up to this point, we have considered estimators that “make sense” in some regard (such as estimating µ with X̄). One unresolved issue that we would like to address is data reduction. If we simply report X̄ as an estimate for µ, have we “lost” any information about µ? In this section, we will examine how to find statistics that in some sense summarize all of the information in a sample about a target parameter. These statistics will be called sufficient statistics. Def: Let Y1 , Y2 , . . . , Yn denote a random sample from a probability distribution with an unknown parameter θ. Then the statistics U = g(Y1 , Y2 , . . . , Yn ) is said to be a sufficient statistics for θ if the conditional distribution of Y1 , Y2 , . . . , Yn given U does not depend on θ. Def: Let y1 , y2 , . . . , yn be sample observations taken on corresponding random variables Y1 , Y2 , . . . , Yn whose distribution depends on a parameter θ. Then if Y1 , Y2 , . . . , Yn are discrete random variables, the likelihood of the sample, L(y1 , y2 , . . . , yn |θ), is defined to be the joint probability of y1 , y2 , . . . , yn . If Y1 , Y2 , . . . , Yn are continuous random variables, the likelihood L(y1 , y2 , . . . , yn |θ) is defined to be the joint density evaluated at y1 , y2 , . . . , yn . Theorem: Let U be a statistic based on the random sample Y1 , Y2 , . . . , Yn . Then U is a sufficient statistic for the estimation of parameter θ if and only if the likelihood L(y1 , y2 , . . . , yn |θ) can be factored into two non-negative functions, L(y1 , y2 , . . . , yn |θ) = g(u, θ) · h(y1 , y2 , . . . yn ) where g(u, θ) is a function only of u and θ and h(y1 , y2 , . . . , yn ) is not a function of θ. Example: (9.39) Let YP 1 , Y2 , . . . , Yn be a random sample of Poisson(λ) random variables. Show by conditioning that ni=1 Yi is sufficient for λ. 20 Example: (9.38) Let Y1 , Y2 , . . . , Yn be a random sample of normal random variables with mean µ and variance σ 2 . (a) Write the likelihood function. (b) If µ is unknown and σ 2 is known, show that Ȳ is sufficient for µ. (c) If µ is known and σ 2 is unknown, show that Pn i=1 (Xi − µ)2 is sufficient for σ 2 . Pn Pn 2 (d) If µ and σ 2 are both unknown, show that i=1 Yi are jointly sufficient for µ i=1 Yi and P n 2 2 and σ . (Note: This will imply that Ȳ and i=1 (Yi − Ȳ ) or Ȳ and S 2 are jointly sufficient for µ and σ 2 . 21 The Rao-Blackwell Theorem and Minimum-Variance Unbiased Estimation Theorem: (Rao-Blackwell) Let θ̂ be an unbiased estimator for θ such that V ar(θ̂) < ∞. If U is a sufficient statistic for θ, define θˆ∗ = E[θ̂|U ]. Then for all θ, E[θˆ∗ ] = θ and V ar(θˆ∗ ) ≤ V ar(θ̂). The Rao-Blackwell Theorem allows us to condition on a sufficient statistic to obtain unbiased estimators with smaller variance. It turns out that if you condition on a minimal sufficient statistic, then we can obtain what is called the minimum-variance unbiased estimator (MVUE). Def: A statistic S(X) is called a minimal sufficient statistic if (i) S(X) is a sufficient statistic (ii) If T(X) is a sufficient statistic, then S(X) = f(T(X)). Def: θ̂ is a minimum-variance unbiased estimator (MVUE) if θ̂ is an unbiased estimator for a parameter θ such that if θ̃ is another unbiased estimator, then V ar(θ̂) ≤ V ar(θ̃). Example: (9.59) The number of breakdowns Y per day for a certain machine is a Poisson random variable with mean θ. The daily cost of repairing these breakdowns is given by C = 3Y 2 . If Y1 , Y2 , . . . , Yn denote the observed number of breakdowns for n independently selected days, find an MVUE for E(C). 22 Example: Let Y1 , Y2 , . . . , Yn denote a random sample from the uniform distribution over the interval(0, θ). (a) (9.49) Show that Y (n) = max(Y1 , Y2 , . . . , Yn ) is sufficient for θ. (b) (9.61) Use Y(n) to find a MVUE of θ. 23 Methods of Estimation Method of Moments The Method of Moments is a very simple procedure for finding point estimates for population parameters. The method is as follows: P Let µ0k = E[Y k ] and m0k = n1 ni=1 Yik . Choose as estimates those values of the parameters that are solutions of the equations µ0k = m0k for k = 1, 2, 3, ..., where k is the number of parameters to estimate. Example: Let Y1 , Y2 , . . . , Yn be a random sample of Uniform[0, θ], where θ is unknown. Use the method of moments to find an estimator θ̂ of θ. Is θ̂ an unbiased, consistent estimator of θ? Example: (9.72) Let Y1 , Y2 , . . . , Yn be a random sample from a normal distribution with mean µ and variance σ 2 . Find the method of moments estimators for µ and σ 2 . 24 Example: (9.74) Let Y1 , Y2 , . . . , Yn be a random sample from the following distribution: ( 2 (θ − y) 0 ≤ y ≤ θ θ2 f (y|θ) = 0 y<0 (a) Find an estimator for θ by using the method of moments. (b) Is this estimator a sufficient statistic for θ? 25 Method of Maximum Likelihood In the previous section, we found how to derive method of moments estimators. While these are easy to compute, they don’t always lead to particularly good estimators. The Rao-Blackwell Theorem allows us to find a minimum-variance unbiased estimator, however, while it isn’t particularly difficult to find a sufficient statistic, the determination of the function of the minimal sufficient statistic that gives us an unbiased estimator can be largely hit or miss. This section will focus on maximum likelihood estimators which often lead to minimum-variance unbiased estimators. Method of Maximum Likelihood: Suppose that the likelihood function depends on k parameters, θ1 , . . . , θk . Choose as estimates those values of the parameters that maximize the likelihood, L(y1 , y2 , . . . , yn |θ1 , . . . , θk ). Note: One very nice property of maximum likelihood estimates is called the invariance property of MLEs. This states that if f (θ) is a one-to-one function of θ, and if θ̂ is the MLE for θ, then the MLE of f (θ) is given by f (θ̂). Example: Let Y1 , Y2 , . . . , Yn be a random sample from a normal distribution with mean µ and variance σ 2 . Find the maximum likelihood estimators for µ and σ 2 . 26 Example: (9.80) Let Y1 , Y2 , . . . , Yn be a random sample of Poisson random variables with mean λ. (a) Find the MLE λ̂ for λ. (b) Find the expected value and variance of λ̂. (c) Is λ̂ a consistent estimator for λ? Explain. (d) What is the MLE for P (Y = 0) = e−λ . 27