Download ST3239: Survey Methodology - Department of Statistics and Applied

ST3239: Survey Methodology by Wang ZHOU Chapter 1 Elements of the sampling problem 1.1 Introduction Often we are interested in some characteristics of a finite population, e.g. the average income of last year’s graduates from NUS. Since the population is usually very large, we would like to say something (i.e. make inference) about the population by collecting and analysing only a part of that population. The principles and methods of collecting and analysing data from finite population is a branch of statistics known as Sample Survey Method. The theory involved is called Sampling Theory. Sample survey is widely used in many areas such as agriculture, education, industry, social affairs, medicine. 1.2 Some technical terms 1. An element is an object on which a measurement is taken. 2. A population is a collection of elments about which we require information. 3. Population charateristic: this is the aspect of the population we wish to measure, e.g. the average income of last year’s graduates from NUS, or the total wheat yield of all farmers in a certain country. 4. Sampling units are nonoverlapping collections of elements from the population. Sampling units may be the individual members of the population, they may be a coarser subdivision of the population, e.g. a household which may contain more than one individual member. 5. A frame is a list of sampling units, e.g., telephone directory. 6. A sample is a collection of sampling units drawn from a frame or frames. 1 1.3 Why sample? If a sample is equal to the population, then we have a census, whcih contains all the information one wants. However, census is rarely conducted for several reasons: • cost, (money is limited) • time, (time is limited) • destructive (testing a product can be destructive, e.g. light bulbs), • accessibility (non-response can be a serious issue). In those cases, sampling is the only alternative. 1.4 How to select the sample: the design of the sample survey The procedure for selecting the sample is called the sample survey design. The general aim of sample survey is to draw samples which are “representative” of the whole population. Broadly speaking, we can classify sampling schemes into two categories: probability sampling and some other sampling schemes. 1. Probability sampling is a sampling scheme whereby particular samples are numerated and each has a non-zero probability of being selected. With probability built in the design, we can make statements such as “our estimate is unbiased and we are 95% confident that it is within 2 percentage point of the true proportion”. In this course, we shall only concentrate on Probability sampling. 2. Some other sampling schemes a) ‘volunteer sampling’: a TV telephone polls, medical volunteers for research. b) ‘subjective sampling’: We choose samples that we consider to be typical or “representative” of the population. c) ‘quota sampling’: One keeps sampling until certain quota is filled. All these sampling procedures provide some information about the population, but it is hard to deduce the nature of the population from the studies as the samples are very subjective and often very biased. Furthermore, it is hard to measure the precision of these estimates. 1.5 How to design a questionnaire and plan a survey This can be the most important and perhaps most difficult part of the survey sampling problem. We shall come back to this point in more details later. 2 Chapter 2 Simple random sampling Definition: If a sample of size n is drawn from a population of size N in such a way that every possible sample of size n has the same probability of being selected, the sampling procedure is called simple random sampling. The sample thus obtained is called a simple random sample. Simple random sampling is often written as s.r.s. for short and is the simplest sampling procedure. 2.1 How to draw a simple random sample Suppose that the population of size N has values {u1 , u2 , · · · , uN }. If draw n (distinct) items without replacement from the population, there are altogether Ã we ! Ã ! N N different ways of doing it. So if we assign probability 1/ to each of the different n n samples, then each sample thus obtained is a simple random sample. We denote this sample by {y1 , y2 , · · · , yn }. Remark: In our previous statistics course, we always use upper-case letters like X, Y etc. to denote random variables and lower-case letters like x, y etc. to represent fixed values. However, in sample survey course, by convention, we use lower-case letters like y1 , y2 etc. to denote random variables. Theorem 2.1.1 For simple random sampling, we have P (y1 = ui1 , y2 = ui2 , · · · , yn = uin ) = 1 1 (N − n)! 1 ··· = . N (N − 1) (N − n + 1) N! where i1 , i2 , · · · , in are mutually different. 3 Proof. By the definition of s.r.s, the probability ! of obtaining the sample {ui1 , ui2 , · · · , uin } Ã N (where the order is not important) is 1/ . There are n! number of ways of ordering n {ui1 , ui2 , · · · , uin }. Therefore, P (y1 = ui1 , y2 = ui2 , · · · , yn = uin ) = Ã 1 (N − n)!n! (N − n)! ! = = . N !n! N! N n! n ³ ´ Remark: Recall that the total number of all possible samples is Nn , which could be very large if N and n are large. Therefore, getting a simple random sample by first listing all possible samples and then drawing one at random would not be practical. An easier way to get a simple random sample is simply to draw n values at random without replacement from the N population values. That is, we first draw one value at random from the N population values, and then draw another value at random from the remaining N − 1 population values and so on, until we get a sample of n (different) values. Theorem 2.1.2 A sample obtained by drawing n values successively without replacement from the N population values is a simple random sample. Proof. Suppose that our sample obtained by drawing n values without replacement from the N population values is {a1 , a2 , · · · , an }, where the order is not important. Let {ai1 , ai2 , · · · , ain } be any permutation of {a1 , a2 , · · · , an }. Since the sample is drawn without replacement, we have P (y1 = ai1 , · · · , yn = ain ) = 1 1 (N − n)! 1 ··· = . N (N − 1) (N − n + 1) N! Hence, the probability of obtaining the sample {a1 , · · · , an } (where the order is not important) is X all (i1 ,···,in ) P (y1 = ai1 , · · · , yn = ain ) = X (N − n)! 1 (N − n)! !. = n! × =Ã N! N! N all (i1 ,···,in ) n The theorem is thus proved by the definition of the simple random sampling. 4 Two special cases will be used later when n = 1, and n = 2. Theorem 2.1.3 For any i, j = 1, ..., n and s, t = 1, ..., N , (i) (ii) P (yi = us ) = 1/N. P (yi = us , yj = ut ) = 1 , N (N − 1) i 6= j, s 6= t. Proof. X P (yk = uj ) = P (y1 = ui1 , · · · , yk = uik , · · · , yn = uin ) all (i1 , · · · , in ), but ik = j Ã ! (N − n)! N −1 (N − n)! (N − 1)! 1 = × (n − 1)! = × = . N! n−1 N! (N − n)! N X P (yk = us , yj = ut ) = P (y1 = ui1 , · · · , yn = uin ) all (i1 , · · · , in ), but ik = s,ij = t ! Ã (N − n)! N −2 (N − n)! (N − 2)! 1 = × (n − 2)! = × = . N! n−2 N! (N − n)! N (N − 1) Example 1. A population contains {a, b, c, d}. We wish to draw a s.r.s of size 2. List all possible samples and find out the prob. of drawing {b, d}. Solution. Possible samples of size 2 are {a, b}, {a, c}, {a, d}, {b, c}, The probability of drawing {b, d} is 1/6. 5 {b, d}, {c, d}, 2.2 2.2.1 Estimation of population mean and total Estimation of population mean Suppose that the population of size N has values {u1 , u2 , · · · , uN }, we can define 1) the population mean N u 1 + u2 + · · · + uN 1 X ui , µ= = N N i=1 2) the population variance σ2 = N 1 X (ui − µ)2 . N i=1 We wish to estimate the quantities µ and σ 2 and to study the accuracy of their estimators. Suppose that a simple random sample of size n is drawn, resulting in {y1 , y2 , · · · , yn }. Then an obvious estimator for µ is the sample mean: µ̂ = ȳ = n X yi /n. i=1 Theorem 2.2.1 V ar(yi ) = σ 2 . σ2 , for (ii) Cov(yi , yj ) = − N −1 Proof. (i). By an ealier theorem, (i) E(yi ) = µ, E(yi ) = N X k=1 V ar(yi ) = N X N X uk P (yi = uk ) = uk k=1 (uk − µ)2 P (yi = uk ) = k=1 N X i 6= j. 1 = µ. N (uk − µ)2 k=1 1 = σ2. N (ii). By defintion, Cov(yi , yj ) = E(yi yj ) − E(yi )E(yj ) = E(yi yj ) − µ2 . Now, E(yi yj ) = X all s 6= t us ut P (yi = us , yj = ut ) =   X us ut all s 6= t 1 N (N − 1) X 1 1   X us ut − u s ut  = =  N (N − 1) N (N − 1) s=t all s, t " = 1 (N µ)2 − N (N − 1) ÃN X "Ã N X !# (us − µ)2 + N µ2 s=1 h i σ2 1 (N µ)2 − N σ 2 − N µ2 = − + µ2 . = N (N − 1) N −1 2 Thus, Cov(yi , yj ) = E(yi yj ) − µ2 = − Nσ−1 . 6 s=1 us !Ã N X t=1 ! ut − N X s=1 # u2s Theorem 2.2.2 E(ȳ) = µ, V ar(ȳ) = σ2 n µ ¶ N −n . N −1 Proof. Note ȳ = n1 (y1 + ... + yn ). So E(ȳ) = 1 1 (Ey1 + ... + Eyn ) = (nµ) = µ. n n Now n n X n n X X 1 1 X V ar(ȳ) = Cov( yi , Cov(yi , yj ) yj ) = 2 n2 n i=1 j=1 i=1 j=1   X 1 X = Cov(y , y ) + Cov(yi , yj ) i j n2 i6=j i=j   n X 1 X σ2 = (− ) + V ar(yi ) n2 i6=j N − 1 i=1 1 = n2 σ2 = n σ2 = n Ã σ2 n(n − 1)(− ) + nσ 2 N −1 µ ¶ 1 (n − 1)(− )+1 N −1 µ ¶ N −n N −1 ! Remark: From Theorem 2.2.2, we see that ȳ is an unbiased estimator for µ. Also as n gets large (but n ≤ N ), V ar(ȳ) tends to 0. This implies that ȳ will be a more accurate estimator for µ as n gets larger (but less than N ). In particular, when n = N , we have a census and V ar(ȳ) = 0. Remark: In our previous statistics course, we usually sample {y1 , y2 , · · · , yn } from the population with replacement. Therefore, {y1 , y2 , · · · , yn } are independent and identically distributed (i.i.d.). And recall we have results like Eiid (ȳ) = µ, V ariid (ȳ) = σ2 . n Notice that V ariid (ȳ) is different from V ar(ȳ) in Theorem 2.2.2. In fact, for n > 1, V ar(ȳ) = σ2 n µ N −n N −1 ¶ < σ2 = V ariid (ȳ). n Thus, for the same sample size n, sampling without replacement produces a less variable estimator of µ. Why? 7 Summary 1. How to draw a simple random sample? (purpose, method) Simple random sampling is the basic survey methodology. 2. After getting a s.r.s, how to describe the population, or how to analyze the data? Estimation of the population mean. (Sample mean.) Estimation of σ 2 and V ar(ȳ) The population variance σ 2 is usually unknown. Now define Ã ! n n X 1 X 1 (yi − ȳ)2 = s = y 2 − n(ȳ)2 . n − 1 i=1 n − 1 i=1 i 2 Example. When a few data points are repeated in a data set, the results are often arrayed in a frequency table. For example, a quiz given to 25 students was graded on a 4-point scale 0, 1, 2, 3 with 3 being a perfect score. Here are the results: Score(X) Frequency(F ) Proportion(P ) 3 16 0.64 2 4 0.16 1 2 0.08 0 3 0.12 (a). Calculate the average score by using frequencies. (b). Calculate the average score by using proportions. (c). Calculate the standard deviation. Solution (a). (3 × 16 + 2 × 4 + 1 × 2)/25 = 58/25 = 2.32 P (b). µ = xp(x) = 3 × 0.64 + 2 × 0.16 + 1 × 0.08 + 0 × 0.12 = 2.32 P 2 (c). x p(x) − µ2 = 32 × 0.64 + 22 × 0.16 + 12 × 0.08 + 02 × 0.12 − 2.322 = 1.0976 q V ar(x) = σ = V ar(x) = 1.05 If the above 25 students constitute a random sample, then s2 = Let us look at some properties of s2 . Is it unbiased? Theorem 2.2.3 E(s2 ) = N σ2. N −1 8 n 1.0976 n−1 = 1.1433. Proof. Ã Es 2 n X 1 = Ey 2 − nE(ȳ)2 n − 1 i=1 i ! ! Ã n h i i h X 1 = V ar(yi ) + (Eyi )2 − n V ar(ȳ) + (E ȳ)2 n − 1 i=1 Ã " #! h i 1 σ2 N − n = n σ 2 + µ2 − n + µ2 n−1 n N −1 Ã ! · µ ¶¸ 1 N −n nσ 2 nN − n − (N − n) nσ 2 1− = = n−1 n N −1 n−1 n(N − 1) 2 Nσ = N −1 The next theorem is an easy consequence of the last theorem. Theorem 2.2.4 σ̂ 2 := N −1 2 s N is an unbiased estimator of σ 2 , e.g. µ E ¶ N −1 2 s = σ2. N We shall define n to be the sample proportion, N n 1−f =1− to be the finite population correction (ab. fpc) N f= Then we have the following theorem. Theorem 2.2.5 An unbiased estimator for V ar(ȳ) is Vd ar(ȳ) = Proof. E Vd ar(ȳ) = s2 (1 − f ) . n Es2 N σ2 n (1 − f ) = (1 − ) n n(N − 1) N Confidence intervals for µ It can be shown that the sample average ȳ under the simple random sampling is approximately normally distributed provided n is large (≥ 30, say) and f = n/N is not too close to 0 or 1. 9 Central limit theorem: If n → ∞ such that n/N → λ ∈ (0, 1), then ȳ − µ q V ar(ȳ) ∼ N (0, 1) approximately. If V ar(ȳ) is replaced by its estimator Vd ar(ȳ), we still have ȳ − µ q Vd ar(ȳ) ∼approx. N (0, 1), as n/N → λ > 0. Thus, ¯  ¯ ¯ ¯ µ ¶ q q ¯ ȳ − µ ¯ d d ¯ ¯   1 − α ≈ P ¯q ¯ ≤ zα/2 = P ȳ − zα/2 V ar(ȳ) ≤ µ ≤ ȳ + zα/2 V ar(ȳ) ¯ Vd ar(ȳ) ¯ Therefore, an approximate (1 − α) confidence interval for µ is q s q ȳ ∓ zα/2 Vd ar(ȳ) = ȳ ∓ zα/2 √ 1 − f. n q ar(ȳ) , is called bound on the error of estimation. B := zα/2 Vd Example. Suppose that a s.r.s. of size n = 200 is taken from a population of size N = 1000. resulting in ȳ = 94 and s2 = 400. Find a 95% C.I. for µ. Solution 20 q 94 ∓ 1.96 √ 1 − 1/5 = 94 ∓ 2.479 200 Example. A simple random sample of n = 100 water meters within a community is monitored to estimate the average daily water consumption per household over a specified dry spell. The sample mean and variance are found to be ȳ = 12.5 and s2 = 1252. If we assume that there are N = 10, 000 households within the community, estimate µ, the true average daily consumption, and find a 95% confidence interval for µ. Solution µ̂ = ȳ = 12.5. V̂ ar(ȳ) = s2 1252 σ̂ 2 N − n = (1 − n/N ) = (1 − 100/10000) = 12.3948. n N −1 n 100 q V̂ ar(ȳ) = 3.5206 A 95% C.I. for µ is 12.5 ± 1.96 × 3.5206 = (5.6, 19.4). 10 2.3 Selecting the sample size for estimating population means population mean 2 ³ ´ −n We have seen that V ar(ȳ) = σn N . So the bigger the sample size n is (but ≤ N ), the more N −1 accurate our estimate ȳ is. It is of interest to find out the minimum n such that our estimate is within an error bound with certain probability 1 − α, say, P (|ȳ − µ| < B) ≈ 1 − α, i.e.,   |ȳ − µ| B  ≈ 1 − α. P q <q V ar(ȳ) V ar(ȳ) By the central limit theorem, q B V ar(ȳ) ⇐⇒ =r B σ2 n ³ N −n N −1 ´ ≈ zα/2 N (N − 1)D −1= ⇐⇒ n σ2 Thus, n≈ σ2 n ⇐⇒ µ ¶ N −n B2 = 2 = D, N −1 zα/2 N (N − 1)D (N − 1)D + σ 2 =1+ = n σ2 σ2 N σ2 , (N − 1)D + σ 2 where 2 D= B2 2 zα/2 Remark 1: if α = 5%, then zα/2 = 1.96 ≈ 2, so D ≈ B4 . This coincides with the formula in the textbook (page 93). Remark 2: the above formula requires the knowledge of the population variance σ 2 , which is typically unknown in practice. However, we can approximate σ 2 by the following methods: 1) from pilot studies 2) from previous surveys 3) other studies. 11 e.g. Suppose that a total of 1500 students are to graduate next year. Determine the sample size n needed to ensure that the sample average in starting salary is within $40 of the population average with probability at least 0.9. From previous studies, we know that the standard deviation of the starting salary is approximately $400. Solution. n = 1500×4002 1499×402 /1.6452 +4002 = 229.37 ≈ 230. e.g. Example 4.5 (p.94, 5th edition). The average amount of money µ for a hospital’s accounts receivable must be estimated. Although no prior data are available to estimate the population variance σ 2 , that most accounts lie within a $100 range is known. There are 1000 open accounts. Find the sample size needed to estimate µ with a bound on the error of estimation $3 with probability 0.95. Remark. The solution depends on how one inteprets “most accounts”, whether it means 70%, 90%, 95% or 99% of all accounts. Solution. We need an estimate of σ 2 . For the normal distribution, N (0, σ 2 ), we have P (|N (0, σ 2 )| ≤ 1.96σ) = P (|N (0, 1)| ≤ 1.96) = 95%, P (|N (0, σ 2 )| ≤ 3σ) = P (|N (0, 1)| ≤ 3) = 99.87% So 95% accounts lie within a 4σ range and 99.87% accounts lie within a 6σ range. B = 3, N = 1000. If most means 95%, we take 2 × (2σ) = 100, so σ = 25. Then n = 210.76 ≈ 211. If most means 99.87%, we take 2 × (3σ) = 100, so σ = 50/3. Then n ≈ 107. 12 2.3.1 A quick summary on estimation of population mean The population mean is defined to be µ= 1 (u1 + u2 + · · · + uN ). N Suppose a simple random sample is {y1 , ..., yn }. 1) An estimator of the population mean µ and variance σ 2 are n 1X µ̂ = ȳ = yi , n i=1 n 1 X s = (yi − ȳ)2 . n − 1 i=1 2 2) The mean and variance of ȳ are E ȳ = µ, V ar(ȳ) = σ2 n µ ¶ N −n . N −1 3) An estimator of the variance of ȳ is Vd ar(ȳ) = s2 (1 − f ) , n where f = n/N . 4) An approximate (1 − α) confidence interval for µ is q s q ȳ ∓ zα/2 Vd ar(ȳ) = ȳ ∓ zα/2 √ 1 − f. n 5) Minimum sample size n needed to have an error bound B with probability 1 − α n≈ N σ2 , (N − 1)D + σ 2 where 13 D= B2 2 zα/2 2.3.2 Estimation of population total The population total is defined to be τ = (u1 + u2 + · · · + uN ) = N µ Suppose a simple random sample is {y1 , ..., yn }. 1) An estimator of the population total τ is τ̂ = N ȳ 3) The mean and variance of τ̂ are E τ̂ = τ, V ar(τ̂ ) = N 2σ 2 µ n ¶ N −n . N −1 2) An estimator of the variance of τ̂ is Vd ar(τ̂ ) = Vd ar(N ȳ) = N 2 s2 (1 − f ) n Central limit theorem: If n → ∞ such that n/N → λ ∈ (0, 1), then τ̂ − τ q V ar(τ̂ ) ∼ N (0, 1) approximately. If V ar(τ̂ ) is replaced by its estimator Vd ar(τ̂ ), we still have τ̂ − τ q Vd ar(τ̂ ) ∼approx. N (0, 1), as n/N → λ > 0. Thus, ¯ ¯  ¯ ¯ ¶ µ q q ¯ τ̂ − τ ¯ d d ¯ ¯   1 − α ≈ P ¯q ¯ ≤ zα/2 = P τ̂ − zα/2 V ar(τ̂ ) ≤ τ ≤ τ̂ + zα/2 V ar(τ̂ ) ¯ Vd ar(τ̂ ) ¯ Therefore, an approximate (1 − α) confidence interval for τ is q s q τ̂ ∓ zα/2 Vd ar(τ̂ ) = τ̂ ∓ zα/2 N √ 1 − f. n q q B := zα/2 Vd ar(τ̂ ) = N zα/2 Vd ar(ȳ) , is called bound on the error of estimation. 14 4) An approximate (1 − α) confidence interval for τ is q τ̂ ∓ zα/2 Ã q ! s q 1 − f = N ȳ ∓ zα/2 √ 1−f . n n s Vd ar(τ̂ ) = τ̂ ∓ zα/2 N √ 5) Minimum sample size n needed to have an error bound B with probability 1 − α n≈ N σ2 , (N − 1)D + σ 2 where D= B2 2 N 2 zα/2 Example 4.6. (Page 95 of the textbook). An investigator is interested in estimating the total weight gain in 0 to 4 weeks for N = 1000 chicks fed on a new ration. Obviously, to weigh each bird would be time-consuming and tedious. Therefore, determine the number of chicks to be sampled in this study in order to estimate τ within a bound on the error of estimation equal to 1000 grams with probability 95%. Many similar studies on chick nutrition have been run in the past. Using data from these studies, the investigator found that σ 2 , the population variance, was approximately 36.00 (grams)2 . Determine the required sample size. Solution D = B 2 /(1.96N )2 = 10002 /(1.962 × 10002 ) = 0.26. n = N σ 2 /((N − 1)D + σ 2 ) = 1000 × 36/(999 × 0.26 + 36) = 121.72 ∼ 122 15 2.4 Estimation of population proportion If we are interested in the proportion p of the population with a specified characteristic. Let yi = { 1 if the ith element has the characteristic 0 if not It is easy to see that E(yi ) = E(yi2 ) = p (Why?). Therefore, we have µ = E(yi ) = p, σ 2 = var(yi ) = p − p2 = pq, where q = 1 − p So the total number of elements in the sample of size n possessing the specified characteristic P is ni=1 yi . Therefore, 1. An estimator of the population proportion p is Pn i=1 ȳ = yi n = p̂, say. And an estimator of the population variance σ 2 = pq is Ã 2 s n n X 1 X 1 = y 2 − n(ȳ)2 (yi − ȳ)2 = n − 1 i=1 n − 1 i=1 i Ã ! ! n ´ X 1 1 ³ = np̂ − np̂2 yi − np̂2 = n − 1 i=1 n−1 n = p̂q̂ where q̂ = 1 − p̂ n−1 From Theorems 2.2.2 and 2.2.3, we have E(p̂) = p, N N σ2 = pq. N −1 N −1 E(s2 ) = (4.1) 2. Again, from Theorem 2.2.2, the variance of p̂ is σ2 V ar(p̂) = n µ ¶ N −n pq = N −1 n µ ¶ N −n . N −1 3. From equation (4.1) and Theorem 2.2.5, an estimator of the variance of p̂ is Vd ar(p̂) = s2 p̂q̂ (1 − f ) = (1 − f ) . n n−1 4. An approximate (1 − α) confidence interval for p is q p̂ ∓ zα/2 √ p̂q̂ q d V ar(p̂) = p̂ ∓ zα/2 √ 1 − f. n−1 16 5. The minimum sample size n required to estimate p such that our estimate p̂ is within an error bound B with probability 1 − α is, n≈ N pq , (N − 1)D + pq where D= B2 2 zα/2 Note that the right hand side is an increasing function of σ 2 = pq. a) p is often unknown, so we can replace it by some estimate (from previous study, pilot study, etc.). b) If we don’t have an estimate p̂, we can replace it by p = 1/2, thus pq = 1/4. e.g. Suppose that a small town has population of N = 800 people. Let p = the proportion of people with blood type A. (1). What sample size n must be drawn in order to estimate p to be within 0.04 of p with probability 0.95? (2). Suppose that we know no more than 10% of the population have blood type A. Find n again in (1). Comment on the difference between (1) and (2). (3). A simple random sample of size n = 200 is taken and it is found that 7% of the sample has blood type A. Find a 90% confidence interval for p. Solution. N = 800, α = 0.05, B = 0.04 (1). Take p = 1/2 in the formula, we get n = 344. (2). p ≤ 0.10 so σ 2 = pq ≤ 0.09. Simple calculation yields n = 171. (3). (0.040, 0.096). Example A simple random sample of n = 40 college students was interviewed to determine the proportion of students in favor of converting from the semester to the quarter system. 25 students answered affirmatively. Estimate p, the proportion of students on campus in favor of the change. (Assume N = 2000.) Find a 95% confidence interval for p. 17 Solution p̂ = ȳ = 25/40 = 0.625. V̂ ar(ȳ) = p̂q̂ 0.625 × 0.375 (1 − n/N ) = × (1 − 40/2000) = 5.889 × 10−3 . n−1 39 q V̂ ar(ȳ) = 0.07674. A 95% C.I. for p is 0.625 ± 1.96 × 0.0767 = (0.4746, 0.7754). 2.5 Comparing estimates Suppose x1 , · · · , xm is a random sample from a population with mean µx and y1 , · · · , yn is a random sample from a population with mean µy . We are interested in the difference of means µy − µx , which can be estimated unbiased by ȳ − x̄, as E(ȳ − x̄) = µy − µx . Further, V ar(ȳ − x̄) = V ar(ȳ) + V ar(x̄) − 2Cov(ȳ, x̄). Remark: If the two samples x1 , · · · , xm and y1 , · · · , yn are independent, then Cov(ȳ, x̄) = 0. However, a more interesting case is when the two samples are dependent, which will be illustrated in the following example. An dependent example Suppose an opinion poll asks n people the question “Do you favor the abortion?” The opinions given are YES, NO, NO OPINION. Let the proportions of people who answer ‘YES’, ‘NO’, ‘No opinion’ be p1 , p2 and p3 , respectively. In particular, we are interested in comparing p1 and p2 by looking at p1 − p2 . Clearly, p1 and p2 are dependent proportions, since if one is high, the other is likely to be low. Let p̂1 , p̂2 and p̂3 be the three respective sample proportions amongst the sample of size n. Then X = np̂1 , Y = np̂2 and Z = np̂3 follows a multinomial distribution with parameter (n, p1 , p2 , p3 ). That is Ã P (X = x, Y = y, Z = z) = n x, y, z ! px1 py2 pz3 = Please note that X n! px1 py2 pz3 = 1. x≥0,y≥0,x+y+z=n x! y! z! 18 n! px py pz x! y! z! 1 2 3 Question: What is the distribution of X? (Hint: Classify the people into “Yes” and “Not Yes”) Theorem 2.5.1 E(X) = np1 , E(Y ) = np2 , V ar(X) = np1 q1 , E(Z) = np3 , V ar(Y ) = np2 q2 , Cov(X, Y ) = −np1 p2 . Proof. X = number of people saying “YES” ∼ Bin(n, p1 ). So EX = np1 , V ar(X) = np1 q1 . Now Cov(X, Y ) = E(XY ) − (EX)(EY ) = E(XY ) − n2 p1 p2 . But X E(XY ) = xyP (X = x, Y = y) x,y≥0,x+y≤n X = xyP (X = x, Y = y, Z = n − x − y) x,y≥1,x+y≤n X = xy x,y≥1,x+y≤n = n! px py pn−x−y x! y! (n − x − y)! 1 2 3 X n! px1 py2 p3n−x−y x,y≥1,x+y≤n (x − 1)! (y − 1)! (n − x − y)! = n(n − 1)p1 p2 X x−1,y−1≥0,(x−1)+(y−1)≤(n−2) (n − 2)! (n−2)−(x−1)−(y−1) px−1 py−1 p (x − 1)! (y − 1)! ((n − 2) − (x − 1) − (y − 1))! 1 2 3 X = n(n − 1)p1 p2 x1 ,y1 ≥0,x1 +y1 ≤(n−2) (n − 2)! (n−2)−x1 −y1 px1 py1 p (x1 )! (y1 )! ((n − 2) − x1 − y1 )! 1 2 3 = n(n − 1)p1 p2 = n2 p1 p2 − np1 p2 . Therefore, Cov(X, Y ) = E(XY ) − n2 p1 p2 = −np1 p2 . Theorem 2.5.2 E(p̂1 ) = p1 , E(p̂2 ) = p2 , V ar(p̂1 ) = p1 q1 /n, V ar(p̂2 ) = p2 q2 /n, Cov(p̂1 , p̂2 ) = −p1 p2 /n. 19 Proof. Note that p̂1 = X/n and p̂2 = Y /n. Apply the last theorem. From the last theorem, we have V ar(p̂1 − p̂2 ) = V ar(p̂1 ) + V ar(p̂2 ) − 2Cov(p̂1 , p̂2 ) = p1 q1 p2 q2 2p1 p2 + + . n n n One estimator of V ar(p̂1 − p̂2 ) is Vd ar(p̂1 − p̂2 ) = p̂1 q̂1 p̂2 q̂2 2p̂1 p̂2 + + . n n n 1 q̂1 (1 − f ) . Is is unbiased? No! An unbiased estimator of the variance of p̂1 is Vd ar(p̂1 ) = p̂n−1 Also E p̂1 p̂2 = EXY /n2 = p1 p2 (1−1/n) implies an unbiased estimator of p1 p2 is p̂1 p̂2 (1−1/n)−1 . So Vd ar(p̂1 ) + Vd ar(p̂2 ) + 2n−1 p̂1 p̂2 (1 − 1/n)−1 is an unbiased estimator of V ar(p̂1 − p̂2 ). But it is easy to use Vd ar(p̂1 − p̂2 ) = p̂1 q̂1 p̂2 q̂2 2p̂1 p̂2 + + . n n n Therefore, an approximate (1 − α) confidence interval for p1 − p2 is r (p̂1 − p̂2 ) ∓ zα/2 Vd ar(p̂1 − p̂2 ) = (p̂1 − p̂2 ) ∓ zα/2 v u u p̂1 q̂1 t n + p̂2 q̂2 2p̂1 p̂2 + . n n e.g. (From the textbook.) Should smoking be banned from the workplace? A Time/Yankelovich poll of 800 adult Americans carried out on April 6-7, 1994 gave the following results. Banned Special areas No restrictions Nonsmokers 44% 52% 3% Smokers 8% 80% 11% Based on a sample of 600 nonsmokers and 200 smokers, estimate and construct a 95% C.I. for (1) the true difference between the proportions choosing “Banned” between nonsmokers and smokers; (2) the true difference between the proportions among nonsmokers choosing between “Banned” and “Special Areas”. 20 Solution A. The proportions choosing “banned” are independent of each other; a high value does not force a low value of the other. Thus, an appropriate estimate of this difference is s 0.44 − 0.08 ± 2 0.44 × 0.56 0.08 × 0.92 + = 0.36 ± 0.06 600 200 B. The proportion of nonsmokers choosing “special areas” is dependent on the proportions choosing “banned”; if the latter is large, the former must be small. These are multinomial proportions. Thus, an appropriate estimate of this difference is s 0.52 − 0.44 ± 2 0.44 × 0.52 0.44 × 0.56 0.52 × 0.48 + +2× = 0.08 ± 0.08 600 600 600 Example The major league baseball season in US came to an abrupt end in the middle of 1994. In a poll of 600 adult Americans, 29% blamed the players for the strike, 34% blamed the owners, and the rest held various other opinions. Does evidence suggest that the true proportions who blame players and owner, respectively, are really different? p1 : proportions of Americans who blamed the players. p2 : proportions of Americans who blamed the owners. p̂1 q̂1 p̂2 q̂2 2p̂1 p̂2 + + n n n 0̂.29 × 0̂.71 0.34 × 0.66 2 × 0.29 × 0.34 + + = 600 600 600 = 1.0458 × 10−3 V̂ ar(p̂1 − p̂2 ) = So an approximate 95% C.I. for p1 − p2 is q 0.29 − 0.34 ± z0.025 V̂ ar(p̂1 − p̂2 ) = −0.05 ± 1.96 × 0.03234 = (−0.11339, 0.01339) 21

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ST3239: Survey Methodology - Department of Statistics and Applied