Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MTH 202 : Probability and Statistics Lecture S1 : 10. Statistics 10.1 : Sampling Statistics deals with studying samples collected from various experiments or observations. For obvious reasons it would best if we can get a large number of such data which would ensure to detect statistically the best expectation. However, often this is practically impossible to obtain. We would therefore study a reasonably convenient number of data often called as samples chosen randomly. Definition 10.1.1 : A Population is a larger set of values of observations or measurements of some experiment. Definition 10.1.2 : A smaller subset of values from population is called a sample. Suppose we have a population of size N . Corresponding to each number of the population, we associate a numerical value, denoted by x1 , x2 , . . . , xN . We would now introduce some parameters : Population mean µ := Population total τ := 1 N PN i=1 PN i=1 Population variance σ 2 := xi xi 1 N PN i=1 (xi − µ)2 We would often use the following identity : σ 2 := 1 N PN i=1 x2i − µ2 There would be a special case when the values x1 , x2 , . . . , xN are either 0 or 1 (often referred as the dichotomous case) simply representing presence or absence of certain characteristics. In this case the population mean would represent the ratio of such character present say denoted by p. In this case the population variance would be p(1 − p). (Why?) 1 2 We choose a set of n random samples (the choices are determined by a random number generator), from the total population. We call it simple random sampling (SRS in short) if each particular sample of size n has the same probability of occurrences. We will also assume that the sample of n members are chosen from the population without replacement (unless otherwise mentioned). In this case, there are Nn such choices. Since it is convenient to study sample of small size n com pared to N , the number Nn is often very large. Thus it is practically impossible to study all Nn samples. Since the samples of size n are chosen randomly, it is important to realize them through a set X1 , X2 , . . . , Xn of random variables (mostly with unknown distribution). These are obviously dependent on each other. The following random variables would be important to us : P Sample mean X := n1 ni=1 Xi Pn 1 2 Sample variance S 2 = n−1 i=1 (Xi − X) Estimate of the population total T := N X The probability distributions of X, X1 , X2 , . . . , Xn are referred as sampling distributions. The i-th sample member is equally likely to be any of the N population members, i.e., P (Xi = xj ) = N1 . However, in general there are repetitions of a particular value in the sample set : Lemma 10.1.3 : Let ζ1 , ζ2 , . . . , ζm denote the distinct population values. Suppose nj be the number of population members that has value ζj . Then Xi is a discrete RV with PMF : nj P (Xi = ζj ) = (1 ≤ i ≤ n, 1 ≤ j ≤ m) N Also, E(Xi ) = µ, Var(Xi ) = σ 2 . Proof : See [RI, Page-205, Lemma A, Section 7.3] Using this it can be easily verified that : Theorem 10.1.4 : With SRS E(X) = µ and E(T ) = τ . Lemma 10.1.5 : For SRS Cov(Xi , Xj ) = − σ2 if i 6= j N −1 Proof : See [RI, Page-207, Lemma B, section 7.3] 3 Theorem 10.1.6 : With SRS Var(X) = σ2 n−1 1− n N −1 Proof : See [RI, Page-208, Theorem B, section 7.3] In case the sampling is done with replacement, it is easy to calculate 2 that Var(X) = σn . The extra factor n−1 N − n 1− = 1− N −1 N −1 measures the difference of this sampling from the ideal scenario, which is the sampling with replacement. The ideal scenario can occur when the population size is infinite. The above quantity is therefore called the finite population correction. The number n/N is called the sampling fraction. Corollary 10.1.7 : With SRS N 2σ2 n−1 Var(T ) = 1− n N −1 10.2 : Estimation of Bias Definition 10.2.1 : A statistic is a random variable which is a function of a set of random variables X1 , X2 , . . . , Xn constituting a random sample. e.g. sample mean X and sample variance S 2 are statistics. We have come across certain parameters which arise while studying certain distributions. For example, n and p are the parameters for the binomial distribution Bin(n, p); we have discussed about the parameter λ for the Poisson(λ). Definition 10.2.2 : A statistic Θ̂ is called an unbiased estimator of the parameter θ if E(Θ̂) = θ; otherwise Θ̂ is called biased. For example if X ∼ Bin(n, p), then E(X) = np which mean that X is a biased estimator of the parameter p. However, E(X/n) = p and hence the scaled random variable X/n is an unbiased estimator of p. In the previous section we have noticed that X and T are the unbiased estimator of the (population) parameters µ and τ respectively. 4 Let σ̂ be defined by n 1X σ̂ = (Xi − X)2 n i=1 2 Theorem 10.2.3 : With SRS E(σ̂ 2 ) = σ 2 n − 1 N n N −1 Proof : See [RI, Page-211, Theorem A, section 7.3.2] Corollary 10.2.4 : With SRS, an unbiased estimator of Var(X) is 2 = sX σ̂ 2 n N − 1 N − n s2 n = 1− n n−1 N N −1 n N where n s2 = 1 X (Xi − X)2 n − 1 i=1 10.3 : Confidence Interval Let us presume in the deal scenario when we will have infinitely many random variables X1 , X2 , . . . which are independent identically distributed with common mean µ and variance σ 2 . We have the n-th average 1 X n = (X1 + X2 + . . . + Xn ) n The Central limit theorem states that X − µ n √ ≤ z −→ Φ(z) as n → ∞ P σ/ n However, in practice, neither the variables are independent, nor there are infinitely many supply of them. For this reason we have earlier used a kind of method of approximation. The confidence interval is another depiction to estimate the error with some interval estimating the error through probability. Definition 10.3.1 : An interval estimate of a parameter θ is an interval of the form θ̂1 < θ < θ̂2 where θ̂1 , θ̂2 are values of appropriate RVs Θ̂1 and Θ̂2 respectively. By ”appropriate” we mean P (Θ̂1 < θ < Θ̂2 ) = 1 − α 5 for some specified probability (1 − α) and we say that (θ̂1 , θ̂2 ) is a (1 − α)100% confidence interval. 10.3.2 Confidence interval of the mean µ : Let z(α) (where 0 ≤ α ≤ 1) denote the point on the x-axis such that the area under the standard normal density curve to the interval [z(α), ∞) is α. Since the rest of the area is 1 − α and the curve is symmetric around the y-axis, we have that z(1 − α) = −z(α). If a random variable Z follows a standard normal distribution, then P (−z(α/2) < Z < z(α/2)) = 1 − α From central limit theorem we have learned that (X − µ)/σX can be approximated to the standard normal distribution. Which means P (−z(α/2) < X −µ < z(α/2)) ≈ 1 − α σX In other words P (X − z(α/2)σX < µ < X + z(α/2)σX ) ≈ 1 − α Hence the probability that the mean µ lies in the interval (x0 − z(α/2)σX , x0 + z(α/2)σX ) is approximately 100(1 − α)% (for some appropriate value of x0 ). We call this interval as the 100(1 − α)% confidence interval for µ. Exercise 10.3.3 : Suppose that a simple random sample is used to estimate the proportion of families in a certain area that are living below the poverty level. If this this proportion is roughly 0.15, what sample size is necessary so that the standard error of the estimate is 0.02? (Page 240, Ex-7, Chapter-7, Rice) Solution : Here, counting the ratio is a dichotomous case. Hence p = 0.15. Therefore, σ 2 = p(1 − p) = 0.15 × 0.85. √ Next, the standard error estimate σX = 0.02 = σ/ n (ignoring the finite population correction). Hence n is 0.15 × 0.85 ≈ ≈ 319 0.0004 Exercise 10.3.4 : In a simple random sample of 1, 500 voters, 55% said they planned to vote for a particular proposition, and 45% said they planned to vote against it. The estimated margin of victory for 6 the proposition is thus 10%. What is the standard error of this estimated margin? What is an approximate 95% confidence interval for the margin? (Page 240, Ex-9, Chapter-7, Rice) Solution : The sample size n = 1500. Let p = 0.55 denote the proportion of votes to the particular proposition, say Q. Then 1 − p = 0.45 is the proportion of votes to ¬Q. Since the estimator of p is X= 1 (X1 + . . . + Xn ) n the estimator of the margin of victory 0.1 = p − (1 − p) = 2p − 1 is Y = 2X − 1, i.e., EY = 2p − 1. √ The standard error of the estimated margin Y is σY = σ/ n. The variance σ is given by σ 2 = Var(Y ) = E(Y − EY )2 = 4.Var(X) But Var(X) = p(1 − p) = 0.55 × 0.45. Hence the standard error is r 4 × 0.55 × 0.45 √ σY = = 0.00066 ≈ 0.026 1500 An approximate 95% confidence interval for the margin is given by (EY − 1.96σY , EY + 1.96σY ) = (0.1 − 1.96 × 0.026, 0.1 + 1.96 × 0.026) = (0.049, 0.151) 10.4 : Approximation and Estimation of Ratio In a few statistical problems it is important to estimate certain ratios; e.g. ratio of adults who has high school degrees, or ratio of children in a family who are not affected by polio disease. Thus if there are two sets of values corresponding to the population members : x1 , x2 , . . . , xN ; y1 , y2 , . . . , yN and the ratio of interest would be : y1 + y2 + . . . + yN µy r= = x1 + x 2 + . . . + xN µx where µy , µx are the population means corresponding to the first and second set of values respectively. The suffixes x and y are naturally picked. 7 We introduce the variable R defined by Y X and we will be estimating E(R) and Var(R). R := The population covariance σxy of x and y is defined by σxy = N 1 X (xi − µx )(yi − µy ) N i=1 Exercise 10.4.1 : With SRS show that : Cov(X, Y ) = σxy n−1 1− n N −1 We have already seen it is not so easy to deal with Y /X. Before we discuss how to estimate E(R), we need to establish some approximation formulae. Letting g a twice differentiable function, if U is a continuous random variable we can use Taylor’s approximation to establish : V = g(U ) ≈ g(µU ) + (U − µU )g 0 (µU ) and a little more accurately 1 V = g(U ) ≈ g(µU ) + (U − µU )g 0 (µU ) + (U − µU )2 g 00 (µU ) 2 Applying formulations of expectation and variance we deduce : µV ≈ g(µU ), Var(V ) = σV2 ≈ [g 0 (µU )]2 σU2 It can be improved more to the second error term, e.g. 1 µV ≈ g(µU ) + σU2 .g 00 (µU ) 2 If Z = g(U, V ) where g is differentiable up to second order and µ = (µU , µV ) (where µU , µV are marginal expectations of U, V ) Z = g(U, V ) ≈ g(µ) + (U − µU ) ∂ ∂ g(µ) + (V − µV ) g(µ) ∂u ∂v which implies E(Z) ≈ g(µ) and ∂ ∂ ∂ ∂ g(µ)]2 σU2 + [ g(µ)]2 σV2 + 2σU V ( g(µ))( g(µ)) ∂u ∂v ∂u ∂v = Cov(U, V ). Var(Z) ≈ [ where σU,V 8 For the function Z = g(U, V ) = V /U (i.e., g(u, v) = v/u) we have ∂g v ∂g 1 = − 2, =− , ∂u u ∂v u 2 2 2 ∂ g 1 2v ∂ g ∂ g =− 2 = − 3 , 2 = 0, 2 ∂u u ∂v ∂u∂v u Hence if µU 6= 0, we have µV 1 µV + 2 σU2 . − ρσU σV E(Z) ≈ µU µU µU where ρ is the correlation coefficient given by σU V = ρσU σV Similarly Var(Z) ≈ 1 2 µ2V µV 2 σ . + σ − 2ρσ σ . U V V µ2U U µ2U µU For more details see Page-161, Sec-4.6, Chapter-4, Rice. Now we apply this to R = Y /X. Theorem 10.4.2 : With SRS, the approximate variance of R is 1 n−1 1 2 2 2 1 2 2 2 1− . r σx +σy −2rσxy Var(R) ≈ 2 r σX +σY −2rσXY = µx n N − 1 µ2x Proof : Since µX = µx , µY = µy , the first statement follows from the form of Var(Z) as above. The second part follows from theorem 10.1.6. Using the definition of correlation coefficient ρ we can also express : n−1 1 2 2 1 2 1− . r σx + σy − 2rρσx σy Var(R) ≈ n N − 1 µ2x Theorem 10.4.3 : With SRS, 1 n−1 1 2 E(R) ≈ r + 1− . 2 rσx − ρσx σy n N − 1 µx 9 10.4.4 : Estimating the standard error of R The population variances of x and y are estimated by s2x and s2y . The population covariance is estimated by : n 1 X sxy = (Xi − X)(Yi − Y ) n − 1 i=1 The population correlation is estimated by ρ̂ = ssxxysy . Hence the variance of R is estimated by 1 n−1 1 2 2 s2R = 1− . 2 . R sx + s2y − 2Rsxy n N −1 X Note 10.4.5 : The following statements can be verified by the properties discussed above : a. An approximate 100(1−α)% confidence interval for r is R ±z( α2 )sR . b. Ratio estimate of µy is Y R = µx R = µx Y X c. Approximate variance of the ratio estimate of µy is : n−1 2 2 1 Var(Y R ) ≈ 1− (r σx + σy2 − 2rρσx σy ) n N −1 d. Approximate bias of the ratio estimate of µy is : 1 n−1 1 2 E(Y R ) − µY ≈ 1− rσx − ρσx σy n N − 1 µx Corollary 10.4.6 : The variance of Y R can be estimated by 1 n − 1 2 2 2 2 sY R = 1− R sx + sy − 2Rsxy n N −1 and an approximate 100(1 − α)% confidence interval for µy is α Y R ± z( )sY R 2 10 10.5 : Stratified Random Sampling Often studying raw data is of very less useful, since it would give a very naive view. For instance if we are trying to understand a ratio of education level, or else people living below the poverty level, we need to keep in mind that there are certain regions where the expected values are considerably larger, where as certain states marked as backward, would have lower expected values of the population parameters. It is often necessary for this reason to divide the total population into smaller groups, called ”stratum” (”strata” in plural), depending on say, geographical locations or say based on certain natural properties, which are independent of each other. We first start with introducing formal notations. Suppose that there are L number of strata and we denote by Ni , the number of population members in each of these stratum. Hence the number of members in the total population is N = N1 + N2 + . . . + NL . Let the values corresponding to the population members of the l-th stratum be x1l , x2l , . . . , xNl l whereas, the mean and the variance in the l-th stratum be µl , σl2 respectively (1 ≤ l ≤ L). The l-th population ratio Wl = Nl /N . If the population mean is µ, then L L Nl L X 1 XX 1 X µ= xil = Nl µl = W l µl N l=1 i=1 N l=1 l=1 Assume that, within each stratum, an SRS of sample size nl is taken. We realize these by the random variables X1l , X2l , . . . , Xnl l which mean that Xil1 and Xjl2 are independent to each other if l1 6= l2 . (Independency condition). We define the sample mean X l as : nl 1 X X l := Xil (1 ≤ l ≤ L) nl i=1 11 It is easy to see that X l1 and X l2 are independent if l1 6= l2 . The stratified estimate X s of the mean is defined by L L X 1 X Xs = Nl X l = Wl X l N l=1 l=1 where the suffix s simply represent X, stratified, and not a numeral. Theorem 10.5.1 : The stratified estimate X s , of the population mean is unbiased. Proof : E(X s ) = L X Wl E(X l ) = l=1 L X Wl µl = µ l=1 From now on, we will describe this setup by calling it stratified SRS. Theorem 10.5.2 : With stratified SRS L X σ2 nl − 1 Var(X s ) = Wl2 . l 1 − nl Nl − 1 l=1 Proof : Page-229, Theorem B, Section-7.5.2, Chapter-7. If the sampling fractions (nl − 1)/(Nl − 1) within all strata are small, then from above theorem we have Var(X s ) = L X Wl2 . l=1 σl2 nl See example A, Page-230, Section-7.5.2, Chapter-7. Now if Ts = N X s denote the stratified estimate of the population total, we have Corollary 10.5.3 : With stratified SRS E(Ts ) = τ and 2 Var(Ts ) = N Var(X s ) = L X l=1 Nl2 . σl2 nl − 1 1− nl Nl − 1 12 The estimate of σl2 is given by : n s2l l 1 X = (Xil − X l )2 nl − 1 i=1 Var(X s ) is estimated by s2X s = nl X Wl2 . i=1 s2l nl 1− nl Nl (Compare this with theorem 10.2.4 as above) 10.5.4 : Methods of allocation It is important to regulate the allocation by appropriate methods. There are essentially two methods we will discuss here. We have seen that neglecting the finite population correction Var(X s ) = L X W 2σ2 l l=1 l nl The first of the method of allocation is called Neyman allocation which tries to minimize Var(X s ) subject to the condition n1 + n2 + . . . + nL = n. Theorem 10.5.5 : The sample sizes n1 , n2 , . . . , nL that minimize Var(X s ) subject to the constraint n1 + n2 + . . . + nL = n are given by Wl σl nl = n. W1 σ1 + . . . + WL σL where l = 1, 2, . . . , L. Proof : See Page-232, Section 7.5.3, Chapter-7, Rice. Substituting the optimal value of nl from above theorem in the expression of Var(X s ) we obtain : Corollary 10.5.6 : Denoting Var(X so ), the stratified estimate using the optimal allocations as given in previous theorem and neglecting the finite population correction, L 2 1X Wl σ l Var(X so ) = n l=1 13 The previous method is technically difficult to employ in sampling. A rather simpler method is called proportional allocation is easier towards computing. Suppose we assume the sampling fraction is constant in each stratum, i.e., n1 n2 nL n1 + . . . + nL n = = ... = = = N1 N2 NL N1 + . . . + NL N The estimate of the population mean based on proportional allocation is nl L L X 1 XX Xil X sp = Wl X l = n i=1 l=1 l=1 Theorem 10.5.7 : The stratified estimate using the proportional allocation, neglecting the finite population correction, L 1X Var(X sp ) = Wl σl2 n l=1 Finally, a comparison between the two variance estimates is necessary to judge which one is more suitable for a particular case of study. Theorem 10.5.8 : With SRS, the difference between the above two variances ignoring the finite population correction is given by L 1X Wl (σl2 − σ)2 Var(X sp ) − Var(X so ) = n l=1 where σ= L X Wl σl l=1 Proof : See Page-235, Section 7.5.3, Chapter-7, Rice. Theorem 10.5.9 : With SRS, neglecting the finite population correction L 1X Var(X) − Var(X sp ) = Wl (µl − µ)2 n l=1 Proof : See Page-236-237, Section 7.5.3, Chapter-7, Rice. References : 14 [RS] An Introduction to Probability and Statistics, V.K. Rohatgi and A.K. Saleh, Second Edition, Wiley Students Edition. [RI] Mathematical Statistics and Data Analysis, John A. Rice, Cengage Learning, 2013