Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Sufficient statistic wikipedia , lookup
Central limit theorem wikipedia , lookup
Statistical inference wikipedia , lookup
Student's t-test wikipedia , lookup
Misuse of statistics wikipedia , lookup
Chapter 2 Elements of Statistical Inference 2.1 Principles of Data Reduction Example 2.1. Cholesterol levels continued. Suppose we want to make inference on a mean cholesterol level on the second day after a heart attach of such a population of people in a north eastern American state. We have got data of 28 patients, which are a realization of the random sample of size n = 28. One of the first things we can do is to calculate the estimate of the population mean (µ) and of the population variance (σ 2 ). To do this we can use some functions of the random sample, such as the sample mean (X) and the sample variance (S 2 ), respectively, where 28 X= Here we have 1 X Xi , 28 i=1 28 S2 = 1 X (Xi − X)2 . 27 i=1 x = 257 and s2 = 32. Note that X and S 2 are random variables, as they are functions of random variables, while x and s2 are their values obtained for the particular values of the rvs; in this case, for the 28 patients who took part in the study. A different group of 28 patients in the state who suffered a heart attack, would give different values of X and S 2 . A random variable which is a function of a random sample, T (X1 , . . . , Xn ), is called an estimator of a population parameter ϑ, while its value is called an estimate of the population parameter ϑ. 73 74 CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE Notation Special symbols such as X or S 2 , are used to denote estimators of some common parameters, in these cases the population mean and variance. Then their “small letter” counterparts denote the respective estimates. b indicates an estimaHat over a symbol of the parameter of interest, for example ϑ, tor of ϑ. Here we have to be careful because ϑb is often used to denote an estimate of ϑ as well. It is better to write ϑbobs to indicate that this is a value of the estimator ϑb obtained for the observed values of a random sample. A random sample will often be denoted by a bold capital letter, while its realization by a bold small case letter. For example, X denotes a random sample (X1 , . . . , Xn ) and x denotes its realization (x1 , . . . , xn ). We often know the family of distributions related to the variable of interest, which we denote by P = {Pϑ : ϑ ∈ Θ ⊂ Rs } (see Definition 1.10). The family depends on a parameter vector ϑ which belongs to a parameter space Θ. For example, for the family of normal distributions P = {f (y; µ, σ 2 ) : −∞ < µ < ∞, σ 2 > 0}, the parameter vector is two-dimensional, ϑ = (µ, σ 2 ), and the parameter space is Θ = {R × R+ \ {0}}. Properties of the estimators of the unknown parameters and the methods of finding them will be our topics for the next several lectures. 2.1.1 Sufficient statistics Statistic T (Y ) represents a random sample Y = (Y1 , . . . , Yn ) and serves as an estimator of a population parameter ϑ or of a function of ϑ. For the inference based on the random sample, it is often enough to consider T (Y ), or a function of it, rather than the whole random sample. Hence, the statistic should represent the data well and should retain all the information about the parameter which is contained in the original observations. Such a statistic has the property of sufficiency. Any additional information in the sample, besides the value of the sufficient statistic, does not contain any more information about the parameter of interest. The principle of sufficiency says that if T (Y ) is a sufficient statistic for ϑ then any inference about ϑ should depend on the sample through the value of the statistic only. That is, if y1 and y2 are two realizations of the random sample Y such that T (y1 ) = T (y2 ), then the inference 75 2.1. PRINCIPLES OF DATA REDUCTION about ϑ should be the same whichever of the realizations, y1 or y2 , is used. Definition 2.1. Let X = (X1 , . . . , Xn ) denote a random sample from a probability distribution with unknown parameters ϑ = (ϑ1 , . . . , ϑp ). Then the statistics T = (T1 , . . . , Tq ) are said to be jointly sufficient for ϑ if the conditional distribution of X given T does not depend on ϑ. Example 2.2. Let X = (X1 , X2 , . . . , Xn ) be a random sample from a population of Bernoulli distribution with probability of success equal to p, i.e., P (Xi = xi ) = pxi (1 − p)1−xi , i = 1, 2, . . . , n, where xi ∈ {0, 1}, p ∈ (0, 1). Then the joint mass function of X is P (X = x) = n Y i=1 pxi (1 − p)1−xi = p Pn i=1 xi (1 − p)n− Pn i=1 xi . Let us define a statistic T = T (X) as T (X) = n X Xi . i=1 This is a random variable which denotes the number of success in n trials and it has a Binomial distribution, that is, n t P (T = t) = p (1 − p)n−t , t = 0, 1, . . . , n. t The Pn joint probability of X and T is zero unless i=1 xi = t we can write P (X = x, T = t) = P (X = x) = p Pn i=1 xi Pn i=1 (1 − p)n− xi = t. That is, for Pn i=1 xi = pt (1 − p)n−t . Hence, the conditional distribution of X|T is P (X = x|T = t) = 1 pt (1 − p)n−t P (X = x, T = t) = = P (T = t) n t n p (1 − p)n−t t t The conditional Pn distribution does not depend on the parameter p, so the function T (X) = i=1 Xi is a sufficient statistic for p. 76 CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE It canPbe interpreted as follows: if we know the number of successes in n trials, t = ni=1 xi , then the information on the order of the successes occurring in the Bernoulli experiments does not give anyPmore information about the probability of success, which could be estimated as ni=1 xi /n. We can identify sufficient statistics more easily using the following result. Theorem 2.1. Neyman’s factorization. The statistics T (Y ) is sufficient for ϑ if and only if the joint pdf (pmf) of Y can be factorized as fY (y; ϑ) = g[T (y); ϑ]h(y), where h(y) does not depend on the parameters and function g, which depends on the parameters, is a function of the values of Y only through the statistic T . Proof. Discrete case for a single parameter. First, assume that T (Y ) is a sufficient statistic for ϑ. If T (Y ) = t, then P (Y = y) = P (Y = y, T (Y ) = t) = P (T (Y ) = t)P (Y = y|T (Y ) = t). | {z }| {z } =gϑ (T (Y )) =h(y) if T is a suf f. stat. So, we get the factorization. Conversely. Let the factorization hold, that is P (Y = y) = g[T (y); ϑ]h(y). Then, P (Y = y, T (Y ) = t) P (T (Y ) = t) ( 0, if T (Y ) 6= t; = P (Y =y) , if T (Y ) = t P (T (Y )=t) ( 0, if T (Y ) 6= t; = P P (Y =y) , if T (Y ) = t y:T (y)=t P (Y =y) ( 0, if T (Y ) 6= t; = g[T (y);ϑ]h(y) P , if T (Y ) = t g[T (y);ϑ] y:T (y)=t h(y) ( 0, if T (Y ) 6= t; = P h(y) , if T (Y ) = t h(y) P (Y = y|T (Y ) = t) = y:T (y)=t This does not depend on the parameter ϑ, hence T is a sufficient statistic for ϑ. 77 2.1. PRINCIPLES OF DATA REDUCTION As we will later see, sufficient statistics provide optimal estimators and pivots for confidence intervals and test statistics. Example 2.3. Suppose that Y = (Y1 , . . . , Yn ) is a random sample from a Poisson(λ) distribution. Then we may write P (Y = y; λ) = n Y λyi e−λ i=1 yi ! = λ = λ Pn yi −nλ e i=1 yi ! i=1 Qn Pn i=1 yi −nλ e × Qn 1 i=1 It follows that Pn i=1 Yi is a sufficient statistic for λ. yi ! . Exercise 2.1. Suppose that Y = (Y1 , . . . , Yn ) is a random sample from an Exp(λ) distribution. Obtain a sufficient statistic for λ. Example 2.4. Suppose that Y = (Y1 , . . . , Yn ) is a random sample from a N (µ, σ 2 ) distribution. Then we may write 2 fY (y; µ, σ ) = = = = Hence, Pn i=1 (yi − µ)2 √ exp − 2 2σ 2 2πσ i=1 ) ( n n 1 1 X √ exp − 2 (yi − µ)2 2σ i=1 2πσ 2 ( !) n n X X n n 1 (2π)− 2 (σ 2 )− 2 exp − 2 yi2 − 2µ yi + nµ2 2σ i=1 i=1 !) ( n n X X n 1 2 −n 2 2 × (2π)− 2 . yi − 2µ yi + nµ (σ ) 2 exp − 2 2σ i=1 i=1 n Y Yi and 1 Pn i=1 Yi2 are jointly sufficient statistics for µ and σ 2 . Example 2.5. Suppose that Y = (Y1 , . . . , Yn ) is a random sample from a U[0, ϑ] distribution. Then we may write n Y 1 1 = n, fY (y; ϑ) = ϑ ϑ i=1 max yi ≤ ϑ. i 78 CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE Now, this may be rewritten as fY (y; ϑ) = 1 I{maxi yi ≤ϑ} × 1. ϑn It follows that maxi Yi is a sufficient statistic for ϑ. There can be many sufficient statistics for a parameter of interest. Hence, the question is which one to use. As the purpose of a statistic is to reduce data without loss of information about the parameter we would like to find such a function of the random sample that achieves the most data reduction while still retaining all the information about the parameter. Definition 2.2. A sufficient statistic T (Y ) is called a minimal sufficient statistic if, for any other sufficient statistic T ′ (Y ), T (Y ) is a function of T ′ (Y ). Lemma 2.1. Let f (y; ϑ) be the pdf of a random sample Y . If there exists a function T (Y ) such that for two sample points x and y, the ratio f (x; ϑ)/f (y; ϑ) is constant as a function of ϑ if and only if T (x) = T (y) then T (Y ) is a minimal sufficient statistic for ϑ. Analogous Lemma holds for discrete r.vs. Example 2.6. Suppose that Y = (Y1 , . . . , Yn ) is a random sample fromP a Poisson(λ) distribution. From the factorization theorem we know that T (Y ) = ni=1 Yi is a sufficient statistic for λ. The ratio of the joint probability mass functions is Pn Qn yi ! λ i=1 xi e−nλ P (Y = x; λ) i=1 = Pn y −nλ Qn P (Y = y; λ) λ i=1 i e i=1 xi ! n Y y i ! Pn Pn λ i=1 xi − i=1 yi = x! i=1 i P P P This is constant in terms of λ iff ni=1 yi = ni=1 xi that is T (Y ) = ni=1 Yi is a minimal sufficient statistic. Note that Y is also a minimal sufficient statistic for λ. In general, a minimal sufficient statistic is not unique. Any one-to-one function of a minimal sufficient statistic is also a minimal sufficient statistic. 2.1. PRINCIPLES OF DATA REDUCTION 79 Exercise 2.2. Suppose that Y = (Y P1 , . . . , Yn ) is a random sample from an Exp(λ) distribution. Show that T (Y ) = ni=1 Yi is a minimal sufficient statistic for λ. 2 Example 2.7. Suppose that n ) is a random sample from a N (µ, σ ) PnY = (Y1 , . . . , YP n 2 distribution. Then T1 = i=1 Yi and T2 = i=1 Yi are jointly sufficient statistics for µ and σ 2 . The ratio of the joint densities is ( " n !#) n n n X X fY (x; µ, σ 2 ) 1 X 2 X 2 x − yi − 2µ xi − yi = exp − 2 . fY (y; µ, σ 2 ) 2σ i=1 i i=1 i=1 i=1 P Pn 2 The ratio does not depend on$ the parameters µ, σ 2 iff ni=1 x2i = i=1 yi and P P P P n n n n 2 x = y . Hence, Y , Y are jointly minimal sufficient i=1 i i=1 i i=1 i i=1 i statistics for these parameters. Pn Pn 2 1 1 2 2 (Y − Y ) = Y − nY Note that Y and S 2 = n−1 are one-toi i i=1 n−1 $ Pn Pi=1 n 2 2 one functions of i=1 Yi , i=1 Yi . Hence, (Y , S ) are also jointly minimum sufficient statistics for µ and σ 2 . 80 CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE 2.2 Point Estimation Suppose that we have data y1 , y2 , . . . , yn which are a realization of Y1 , Y2 , . . . , Yn . We are interested in the population from which the data are sampled. Suppose that the probability model is that Yi ∼ N (µ, σ 2 ) independently for i = 1, 2, . . . , n. Then the probability model for the population is completely specified, except for the unknown parameters µ and σ 2 . In general, Y1 , Y2 , . . . , Yn have a joint distribution which is specified, except for the unknown parameters ϑ = (ϑ1 , ϑ2 , . . . , ϑp )T . We are interested in estimating some function of the parameters φ = g(ϑ1 , . . . , ϑp ). Note that φ could be just a single parameter, such as µ or σ 2 . We will use a sample statistic T (Y1 , . . . , Yn ) as a point estimator of φ. For example, T (Y1 , . . . , Yn ) = Y is an estimator of µ. The observed value of T , T (y1 , . . . , yn ), is the point estimate of φ. 2.2.1 Properties of Point Estimators Estimator T (Y ) is a random variable (a function of rvs Y1 , . . . , Yn ) and the estimate is a single value taken from the distribution of T (Y ). Since we want our estimate, t, to be close to φ, the random variable T should be centred close to φ and have small variance. Also, we would want our estimator to be such that, as n → ∞, T → φ with probability one. Below are definitions of some measures of goodness of the estimators. Definition 2.3. Let Y = (Y1 , Y2 , . . . , Yn )T be a random sample from a distribution Pϑ and let φ = g(ϑ) be a function of the parameters and T (Y ) its point estimator. Then, 1. The error in estimating φ by estimator T (Y ) is T − φ; 2. The bias of estimator T (Y ) for φ is bias(T ) = E(T ) − φ, that is, the expected error; 3. Estimator T (Y ) is unbiased for φ if E(T ) = φ, that is, it has zero bias. 2.2. POINT ESTIMATION 81 A weaker property of unbiasedness is defined below. Definition 2.4. The estimator T (Y ) is asymptotically unbiased for φ if E(T ) → φ as n → ∞, that is, the bias tends to zero as n → ∞. Another common measure of goodness of an estimator is given in the following definition. Definition 2.5. The mean square error of T (Y ) as an estimator of φ is MSE(T ) = E (T − φ)2 . Note: The mean square error can be written as a sum of variance and squared bias of the estimator. $ MSE(T ) = E [{T − E(T )} + {E(T ) − φ}]2 = E {T − E(T )}2 + 2{T − E(T )}{E(T ) − φ} + {E(T ) − φ}2 = E {T − E(T )}2 + 2{E(T ) − E(T )}{E(T ) − φ} + {E(T ) − φ}2 = var(T ) + {bias(T )}2 . It seems natural that if we increase the sample size n we should get a more precise estimate of φ. The estimators which have this property are called consistent. In the following definition we use the notion of convergence in probability: A sequence of rvs X1 , X2 , . . . converges in probability to a random variable X if for every ε > 0 $ lim P |Xn − X| < ε = 1. n→∞ Definition 2.6. A sequence Tn = Tn (Y1 , Y2 , . . . , Yn ), n = 1, 2, . . ., is called a consistent sequence of estimators for φ if Tn converges in probability to φ, that is for all ε > 0 and for all ϑ ∈ Θ $ lim P |Tn − φ| < ε = 1. n→∞ 82 CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE Note: Consistency means that the probability of our estimator being within some small ε of φ can be made as close to one as we like by making the sample size n sufficiently large. The following Lemma gives a useful tool for establishing consistency of estimators. Lemma 2.2. A sufficient condition for consistency is that MSE(T ) → 0 as n → ∞, or, equivalently, var(T ) → 0 and bias(T ) → 0 as n → ∞. Example 2.8. Suppose that Y1 , . . . , Yn are independent Poisson(λ) random variables and consider using Y to estimate λ. Then we have 1 E(Y1 + . . . + Yn ) n 1 = {E(Y1 ) + . . . + E(Yn )} n 1 = nλ = λ, n E(Y ) = and so Y is an unbiased estimator of λ. Next, by independence, we have 1 {var(Y1 ) + . . . + var(Yn )} n2 1 λ = nλ = , 2 n n var(Y ) = so that MSE(Y ) = λ/n. Thus, MSE(Y ) → 0 as n → ∞, and so Y is a consistent estimator of λ. A more general result about the mean is given in the following theorem, which states that the sample mean approaches the population mean as n → ∞, and it holds for any distribution. Theorem 2.2. Weak Law of Large Numbers Let X1 , X2 , . . . be IID random variables with E(Xi ) = µ and var(Xi ) = σ 2 < P n ∞. Define X n = n1 i=1 Xi . Then, for every ε > 0, $ lim P |X n − µ| < ε = 1, n→∞ that is, X n converges in probability to µ. 83 2.2. POINT ESTIMATION Proof. The proof is a straightforward application of Markov’s Inequality, which says the following: Let X be a random variable and let g(·) be a non-negative function. Then, for any k > 0, $ E[g(X)] P g(X) ≥ k ≤ . k (2.1) [Markov’s Inequality] Proof of Markov’s inequality (continuous case): For any rv X and a nonnegative function g(·) we can write Z Z ∞ g(x)fX (x)dx ≥ g(x)fX (x)dx E[g(X)] = x:g(x)≥k −∞ Z ≥ kfX (x)dx = kP (g(X) ≥ k). x:g(x)≥k Hence (2.1) holds. Then, $ $ P |X n − µ| < ε = 1 − P |X n − µ| ≥ ε $ = 1 − P (X n − µ)2 ≥ ε2 E(X n − µ)2 ε2 var(X n ) σ2 =1− = 1 − → 1 as n → ∞. ε2 nε2 ≥1− Markov’s inequality is a generalization of Chebyshev’s inequality. If we take g(X) = (X − µ)2 and k = σ 2 t2 for some positive t, then the Markov’v inequality states that $ E[(X − µ)2 ] σ2 1 = = 2. P (X − µ)2 ≥ t2 ≤ 2 2 2 2 σ t σ t t This is equivalent to $ 1 P |X − µ| ≥ σt ≤ 2 t which is known as Chebyshev’s inequality. Example 2.9. The rule of three-sigma. This rule basically says that a chance that an rv will have values outside the interval (µ − 3σ, µ + 3σ) is close to zero. Chebyshev’s inequality, however, says that for any random variable the probability that it will deviate from its mean by more than three standard deviations is at most 1/9, i.e., $ 1 P |X − µ| ≥ 3σ ≤ . 9 84 CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE This suggests that the rule should be considered with caution. The rule of threesigma works well for normal rvs or for the variables whose distributions are close to normal. Let X ∼ N (µ, σ 2 ). Then P (|X − µ| ≥ 3σ) = 1 − P (|X − µ| ≤ 3σ) = 1 − P (µ − 3σ ≤ X ≤ µ + 3σ) X −µ = 1 − P (−3σ ≤ X − µ ≤ 3σ) = 1 − P (−3 ≤ ≤ 3) σ = 1 − [Φ(3) − Φ(−3)] = 1 − [Φ(3) − (1 − Φ(3))] = 2(1 − Φ(3)) = 2(1 − 0.99865) ∼ = 0.0027. Another important result, which is widely used in statistic is the Central Limit Theorem (CLT) which uses the notion of convergence in distribution. We have already used this notion, but here we have its formal definition. Definition 2.7. A sequence of random variables X1 , X2 , . . . converges in distribution to a random variable X if lim FXn (x) = FX (x), n→∞ at all points x where the cdf FX (x) is continuous. The following theorem represents the large sample behaviour of the sample mean (compare the Weak Law of Large Numbers). Theorem 2.3. The Central Limit Theorem Let X1 , X2 , . . . be a sequence of independent random variables such that E(Xi ) = µ and var(Xi ) = σ 2 < ∞. Define Un = X −µ √ . σ/ n Then the sequence of random variables U1 , U2 , . . . converges in distribution to a random variable U ∼ N (0, 1), that is lim FUn (u) = FU (u), n→∞ where FU (u) is a cdf of a standard normal rv. Proof. The proof is based on the mgf and so it is valid only for the cases where the mgf exits. Here we assume that MXi (t) exists in some neighborhood of zero, say for |t| < h for some positive h. 85 2.2. POINT ESTIMATION We have X −µ √ = Un = σ/ n 1 n Pn n n Xi − µ 1 X Xi − µ 1 X √ =√ =√ Yi , σ σ/ n n i=1 n i=1 i=1 where Yi = Xiσ−µ . The variables Xi are IID, hence by Theorem 1.15 the variables Yi are also IID. Hence, $ √1 Pn Y t = E e n i=1 i n ! n n n Y Y 1 1 √1 tYi n =E e . = MY i √ t = MY i √ t n n i=1 i=1 MUn (t) = M √1 Pn (t) i=1 Yi The common mgf of Yi can be expanded into a Taylor series around zero, that is MY i (k) 1 √ t n = ∞ X k=0 (k) MYi (0) √ (t/ n)k , k! k where MYi (0) = dtd k MYi (t)|t=0 . Since the mgfs exist √ for |t| < h, for some positive h, the Taylor series expansion is valid if t < nh. (1) (2) (0) Now, MYi (0) = E(Yi ) = 0, MYi (0) = E(Yi2 ) = var(Yi ) = 1 and MYi (0) = MYi (0) = E(etYi )|t=0 = E(1) = 1. Hence, the series can be written as MY i 1 √ t n √ √ ∞ (t/ n)2 X (k) (t/ n)k + MYi (0) =1+0+ 2! k! k=3 √ 2 √ (t/ n) =1+ + RYi (t/ n), 2! √ where the remainder term RYi (t/ n) tends to zero faster than the highest order explicit term, that is √ RYi (t/ n) √ lim = 0. n→∞ (t/ n)2 R √ (t/ n) This means that also limn→∞ Yi1/n here and can be taken equal to 1). √ = limn→∞ nRYi (t/ n) = 0 (as t is fixed 86 Thus, CHAPTER 2. ELEMENTS OF STATISTICAL INFERENCE n 1 lim MUn (t) = MYi √ t n→∞ n √ √ n (t/ n)2 + RYi (t/ n) = lim 1 + n→∞ 2! n 2 √ 1 t = lim 1 + + nRYi (t/ n) n→∞ n 2 | {z } = et 2 /2 →t2 /2 = MU (t). which is the mgf of a standard normal rv. The CLT means that the sample mean is asymptotically normally distributed whatever the distribution of the original rvs is. However, we have no way of knowing how good the approximation is. This depends on the original distribution of Xi . Exercise 2.3. Let Y = (Y1 , . . . , Yn )T be a random sample from the distribution with the pdf given by 2 (ϑ − y), y ∈ [0, ϑ], ϑ2 f (y; ϑ) = 0, elsewhere. For an estimator T (Y ) = 3Y calculate its bias and variance and check if it is a consistent estimator for ϑ. Exercise 2.4. Let Xi ∼ Bernoulli(p). Denote by X the sample mean. iid (a) Show that pb = X is a consistent estimator of p. (b) Show that pq b = X(1 − X) is an asymptotically unbiased estimator of pq, where q = 1 − p.