Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Sampling ∆ιατµηµατικό πρόγραµµα µεταπτυχιακών σπουδών Τεχνο-οικονοµικά συστήµατα ∆ηµήτρης Φουσκάκης What do you think about Statistics? 2 Introduction to Statistics Why do we need statistics? Descriptive statistics. Inferential statistics. 3 Why do we need statistics? Why indeed? “A distinctive function of statistics is this: it enables the scientist to make a numerical evaluation of the uncertainty of his conclusion.” (Snedecor, 1950) 4 The fundamental problem: sampling How representative is my sample? 5 The fundamental problem: sampling Statistics can tell us how good the chances are that the characteristics of a given sample represent characteristics of the target population, if each individual of the target population had the same chance to be sampled! 6 (Assumption of randomness) The fundamental problem: sampling Population: Set of all units of interest – X random variable. X follows a distribution f with unknown mean µ and standard deviation σ and more general with an unknown parameter θ. Random Sample: X1, …, Xn independent and identically distributed random variables (follow the same distribution as X). Observed values of the random sample (sample values – sample data) x1,…, xn. help us make inference 7 Descriptive and inferential statistics Descriptive statistics: helps to describe the characteristics of a sample. Inferential statistics: a collection of methods, which help to quantify how certain we can be when we make inferences from a given sample. 8 Types of data Categorical • nominal (married, single, divorced . . .) ordinal (minimal, moderate, severe . . .) • binary (success, failure) • Quantitative • discrete (0,1,2,3,4,5 . . .) - • e.g. Number of road accidents continuous - e.g. Height 9 Descriptive statistics n x= ∑x i =1 n i Measures of location: - Sample Mean x (the sum of all the scores divided by the number of observations). - Median (the score that lies midpoint when the data are ranked in order). - Mode (the most frequently occurring score). - Trimmed Mean (some of the largest and smallest observations are removed before calculating the mean). 10 Descriptive statistics (continued) SD = Variance n s2 = ∑ ( xi − x ) i =1 n −1 2 Measures of spread: - Range (the lowest and highest values). - Centiles (two values that encompass most rather than all of the data values, e.g. quartiles). - Standard Deviation (SD) s (the idea is based on averaging the distance each value is from the mean). - Variance s2 (the square of SD). 11 Graphical representations of variability Histogram Boxplot Frequency polygon Steam-and-leaf diagram 0 0 0 0 0 1 1 111 222333 445 666666677 89 000000011111111 22222222222233333333 12 Estimate the shape of the p.d.f. of X In order to estimate the shape of the p.d.f. f(x) of X, one can create a frequency table of the sample values x1,…, xn. This is done by dividing up the range of the values of x1,…, xn into a set of intervals. Then create a histogram and use it as an estimate of the shape of the pdf f(x) of X. 13 Estimating probabilities Suppose that we want to calculate the probability p=P(a§ X§ b). Let denote p̂ the fraction of the sample data x1,…, xn that are between the values a,b. Then p̂ is an estimate of the required probability. 14 Estimate the mean and the variance The observed sample mean n x= ∑x i i =1 n can be used to estimate the true mean µ of X. The observed variance n s2 = ∑ ( xi − x ) 2 i =1 n −1 can be used to estimate the true variance σ2 of X. 15 Sample Mean The definitions of the observed sample mean and variance pertain to the observed values x1,…, xn. Let us instead look at the problem before the random sample is collected. Recall that before the sample is collected the random variables X1, …, Xn denote the uncertain values that will be obtained from the random sample. 16 Sample Mean n X= n S = 2 ∑( X −X) i=1 i n−1 ∑X i =1 i n 2 RANDOM VARIABLES 17 Sample Mean E( X ) = µ Var ( X ) = σ 2 n E (S ) = σ 2 2 How good an estimate of the mean µ is the observed sample mean x is? How reliable is this estimate? X ∼ N (µ ,σ / n) from the Central limit theorem (n ≥ 30) 18 Example Berkshire Power Company (BPC) is an electric utility company that provides electric power. Has recently implemented a variety of incentive programs to encourage households to conserve energy in winter months. They would like to estimate the mean µ and standard deviation σ of the distribution of household electricity consumption for January. Sample of n=100 households. 19 Example 20 Example n x= ∑x i i =1 n n s2 = = 3011KWH ∑(x − x ) i =1 i n −1 2 = 540483.7 ⇒ s = 735.18 KWH Suppose we choose now a different sample of 100 households. How different the answers would be? If instead would choose n=10? Remember that from the Central Limit Theorem X ∼ N (µ ,σ / n) The standard deviation of the distribution of the sample mean is lower when n 21 is larger. Example 22 Confidence Intervals for the Mean for Large Sample Size Observed sample mean x will be more reliable estimate for µ when the sample size n is larger. We can quantify the intuitive notion of reliability of an estimate by developing the concept of a confidence interval (C.I.). Consider the following problem: Compute the quantity b: p = P(µ−b≤ X ≤ µ+b) =0.95 p = P (− b X −µ b ) = 0.95 ≤ ≤ σ / n σ / n σ / n Z~N(0,1) for n>29 P (1.96 ≤ Z ≤ 1.96) = 0.95 ⇒ P ( X − 1.96σ 1.96σ ≤µ≤X + ) 23 n n Confidence Intervals for the Mean for Large Sample Size If n ≥ 30 then a 95% confidence interval for the mean µ is the interval 1.96 s 1.96 s ⎤ ⎡ ,X + ⎢X − ⎥ n n ⎣ ⎦ Interpretation of a confidence Interval: Since both the sample mean X and the sample variance S 2 are random variables, each time we take a random sample, we find different values for the observed sample mean x and the observed variance s 2 . This results to a different confidence interval each time we sample. A 95% confidence interval means that 95% of the resulting intervals will contain the actual mean µ. 24 Confidence Intervals for the Mean for Large Sample Size In our previous example with the Berkshire Power Company with a sample size of n=100 we get the following 95% confidence interval for the true mean: 1.96 s 1.96 s ⎤ ⎡ ,X + ⎢X − ⎥ = [2866.9, 3155.1] n n ⎦ ⎣ If instead our sample size was smaller, our uncertainty about the true value of µ becomes larger, and thus we should expect a wider confidence interval. 25 Confidence Intervals for the Mean for Large Sample Size Suppose that x is the observed sample mean and s 2 is the observed variance. If n ≥ 30 then a β% confidence interval for the mean µ is the interval: where zα/2 is such that: za / 2 × s za / 2 × s ⎤ ⎡ ,X + ⎢X − ⎥ n n ⎣ ⎦ P ( − z a / 2 ≤ Z ≤ z a / 2 ) = β / 100 and α = 1- (β/100) For β=90%, α=0.10, zα/2 =1.645 For β=95%, α=0.05, zα/2 =1.960 For β=98%, α=0.02, zα/2 =2.326 For β=99%, α=0.01, zα/2 =2.576 Thus in our previous example with our sample of 100 households a 99% confidence interval for the true mean is: 2 .5 7 6 s 2 .5 7 6 s ⎤ ⎡ X X − , + ⎢ ⎥ = [ 2 8 2 1 .6 , 3 2 0 0 .3 ] n n ⎦ ⎣ wider than the 95% one 26 Normal Table 27 Confidence Intervals for the Mean for Small Sample Size What if our sample size is less than 30. The procedure for constructing a confidence interval for the true mean is the same as before, but this time T= X −µ σ/ n follows approximately a t-distribution with k=(n-1) degrees of freedom (this approximation works well only if the Xi are almost Normally distributed). Thus the β% confidence interval for the true mean is: c×s c×s⎤ ⎡ ,X + ⎢X − ⎥ n n ⎣ ⎦ P ( − c ≤ T ≤ c ) = β / 100 where c is such that: and T follows the t-distribution with (n-1) degrees of freedom 28 Example In the Berkshire Power Company example lets suppose that our sample was from only n=10 households, and gave us an observed sample mean of 3056 KWH and an observed sample standard deviation of 800 KWH. Then a 99% C.I. For the true mean is: c× s c× s⎤ ⎡ 3.250 × 800 3.250 × 800 ⎤ ⎡ ,X + , 3056 − ⎢X − ⎥ = ⎢3056 − ⎥ n n 10 10 ⎣ ⎦ ⎣ ⎦ where the value 3.250 can be easily obtained from the tables of the t distribution with k=10-1=9 degrees of freedom and β=99%. 29 Student Table 30 Confidence Interval for the population proportion Suppose that the national Institute of Health (NIH) would like to estimate the proportion of teenagers that smokes. They randomly sampled 1000 teenagers and found that 253 of them are smokers. Thus the observed sample proportion p = 253 /1000 = 0.253 We would like to construct a C.I. for the estimate of the true proportion of teenagers that smoke. Let X be the number of teenagers in the sample of size n that smokes. Then X~B(n,p) and therefore E(X)=np and Var(X)=np(1-p). If P = X n is the sample proportion (random variable) then E( P )=p and Var( P )=[p(1-p)] /n. 31 Confidence Interval for the population proportion If np ≥ 5 and n(1 − p ) ≥ 5 Z= P− p P(1− P) / n then from the Central Limit Theorem we have that obeys approximately the standard Normal distribution. Using the above fact we can derive the following result: If p is the observed sample proportion in a sample of size n and np ≥ 5 and n(1 − p ) ≥ 5 then a β% C.I. For the population proportion p is: ⎡ ⎢ p − za / 2 ⎣ p (1 − p ) , p + za / 2 n p (1 − p ) ⎤ ⎥ n ⎦ where zα/2 is such that: P ( − z a / 2 ≤ Z ≤ z a / 2 ) = β / 100 and α = 1- (β/100) 32 Confidence Interval for the population proportion So in our example lets compute a 99% C.I. For the proportion of teenagers that smoke. Note that np ≥ 5 and n(1 − p ) ≥ 5 and so we can use the preceding method. From the tables of the standard normal distribution we find that c=2.576 and thus the required C.I. is: ⎡ 0.253(1 − 0.253) 0.253(1 − 0.253) ⎤ p (1 − p ) p (1 − p ) ⎤ ⎡ − + = − + , 0.253 2.576 , 0.253 2.576 p c p c ⎢ ⎥ ⎢ ⎥ 1000 1000 n n ⎣ ⎦ ⎣ ⎦ = [0.218, 0.288]. 33 Experimental Design for Estimating the Mean µ Sample size n Affects the width of the C.I. How large should n be in order to to satisfy a pre-specific tolerance in the width of the β% C.I. ? Experimental Design 2 2 za / 2 s n= 2 L L=tolerance level, i.e. Our estimate x is within plus or minus L of the true value µ with probability β/100. where zα/2 is such that: P ( − z a / 2 ≤ Z ≤ z a / 2 ) = β / 100 and α = 1- (β/100) 34 Experimental Design for Estimating the Mean µ If the value of n computed in the previous expression is less than 30, then we set n=30. One difficulty in using the previous expression is that we have to know the value of the sample standard deviation in advance. However, one can typically obtain a rough estimate of the sample standard deviation s by conducting a small pilot sample first. 35 Example Suppose that a marketing research firm wants to conduct a survey to estimate the mean µ of the distribution of the amount spent on entertainment by each adult who visits a certain popular resort. The firm would like to estimate the mean of this distribution to within $120.00 with 95% confidence. From data regarding past operations at the resort, it has been estimated that the standard deviation of entertainment expenditures is no more than $400.00. How large the sample size should be? za / 2 s 1.96 × 400 n= = = 42.68 ≈ 43 2 2 L 120 2 2 2 2 36 Experimental Design for Estimating the Proportion p Suppose we want a β% C.I. for the proportion with a tolerance level of L. Then we obtain that: c 2 p (1 − p ) n= L2 where c is such that: P ( − c ≤ Z ≤ c ) = β / 100 The problem with using the above formula directly is that we don’t know the value of the observed sample proportion in advance. However it can be easily proved that : p p (1 − p ) ≤ 1 4 Thus if we use the value of ¼ instead of we obtain the “conservative” estimate: za / 22 n= 4 L2 p (1 − p ) 37 Example Suppose that a major American television network is interested in estimating the proportion p of American adults who are in favor of a particular national issue such as a handgun control. They would like to compute a 95% C.I. whose tolerance level is plus or minus 3%. How many adults would the television network need to poll? za / 22 1.96 2 n= = = 1, 067.11 ≈ 1, 068 2 2 4L 4 × 0.03 This is a rather remarkable fact. No matter how small or large is the proportion we want to estimate, if we randomly sample 1,068 adults, then in 19 cases out of 20 (95%), the results based on such a sample will differ by no more than 3% in either direction from what would have been obtained by polling all American adults. 38 Comparing Estimates of the Mean of Two Distributions Suppose that a national department store chain is considering whether or not to promote its products via direct mail promotion campaign. They have chosen two randomly selected groups of consumers with n1 and n2 consumers in each group. They plan to mail the promotional material to all the consumers in the first group but not to any of the second group. Then they plan to monitor the spending of each consumer in each group in their stores in the coming month in order to estimate the effectiveness of the promotional campaign. Suppose that the true mean of the first group is µ1 with a standard deviation of σ1 and for the second group µ2 with a standard deviation of σ2. Our objective is to estimate the difference µ1-µ2. Suppose that we plan to randomly sample n1 observations X1,…,Xn1 from the first population and n2 observations Y1,…,Yn2 from the second population. 39 Comparing Estimates of the Mean of Two Distributions 1 n1 X = ∑ Xi , n1 i =1 The two sample means then are: and: E ( X − Y ) = µ1 − µ 2 , Var ( X − Y ) = σ 12 n 1 Y = n2 + σ 22 n2 ∑Y i =1 i n From the Central Limit Theorem then we have that: Z = X − Y − ( µ1 − µ 2 ) σ1 2 n1 + σ2 2 ∼ N (0,1) when n1, n2 ≥ 30 n2 40 Comparing Estimates of the Mean of Two Distributions If x , y are the two observed sample means and s1 , s2 the two observed standard deviations then the estimate for the µ1-µ2 is the difference between the observed sample means x − y . A β% C.I. for the true difference µ1-µ2 of the two population means is: ⎡ s12 s2 2 s12 s2 2 ⎤ + + , x − y + za / 2 ⎢ x − y − za / 2 ⎥ n1 n2 n1 n2 ⎥⎦ ⎢⎣ where zα/2 is such that: P ( − z a / 2 ≤ Z ≤ z a / 2 ) = β / 100 and α = 1- (β/100) 41 Comparing Estimates of the Mean of Two Distributions Back in our example suppose that n1= 500 and n2 = 400 consumers. Suppose that the observed sample mean of consumer sales in the first group is $387 and in the second group $365 with an observed standard deviation in the first group of $233 and in the second of $274. Let us compute a 98% C.I. for the difference between the means µ1-µ2 of the distribution of sales between the first group and the second group. ⎡ s12 s22 s12 s22 ⎤ + , x − y + za / 2 + ⎥= ⎢ x − y − za / 2 n1 n2 n1 n2 ⎥⎦ ⎢⎣ ⎡ 2232 2742 2232 2742 ⎤ + + , 387 − 365 + 2.326 ⎢387 − 365 − 2.326 ⎥ = [−$17.43, $61.43] 500 400 500 400 ⎥⎦ ⎢⎣ Because this C.I. contains zero, we are not 98% confident that the promotional 42 campaign will result in any increase in consumer spending. Comparing Estimates of the Population Proportion of Two Populations We need to estimate the difference p1-p2 between the proportions of two independent populations. Suppose we sample from both populations obtaining n1 and n2 observations respectively. Let X denote the number of observations in the first population with the characteristic of interest and Y denote the number of observations in the second population with the characteristic of interest. The sample proportions of the two populations then are: X Y P1 = , P2 = n2 n1 and: E ( P1 − P2 ) = p1 − p2 p1 (1 − p1 ) p2 (1 − p2 ) Var ( P1 − P2 ) = + n1 n2 From the Central Limit Theorem then we have that: Z = P1 − P2 − ( p1 − p 2 ) P1 (1 − P1 ) P2 (1 − P2 ) + n1 n2 ~ N (0,1) 43 Comparing Estimates of the Population Proportion of Two Populations If the observed sample proportions are p1 , p2 then the estimate for the difference between the proportions p1-p2 is the difference between the observed sample proportions p1 − p2 . n1 p1 , n2 p2 , n1 (1 − p1 ), n2 (1 − p2 ) ≥ 5 If also difference between the proportions p1-p2 is: ⎡ ⎢ p1 − p2 − za / 2 ⎣ then a β% C.I. for the p1 (1 − p1 ) p2 (1 − p2 ) + , p1 − p2 + za / 2 n1 n2 p1 (1 − p1 ) p2 (1 − p2 ) ⎤ + ⎥ n1 n2 ⎦ where zα/2 is such that: P ( − z a / 2 ≤ Z ≤ z a / 2 ) = β / 100 and α = 1- (β/100) 44 Example In a ten year study 3,806 middle-age men with high cholesterol levels but no known heart problems were randomly divided into two equal groups. Members of the first group received a new drug designed to lower cholesterol levels, while the second group received a daily dosages of a placebo. Besides lowering cholesterol levels, the drug appeared to be effective in reducing the incidence of heart attacks. During the 10 years, 155 of those in the first group had a heart attack, compared to 187 in the second group. Let p1 denote the proportion of middle-aged men with high cholesterol who will suffer a heart attack within ten years if they receive the new drug, and let p2 denote the proportion of middle-aged men with high cholesterol who will suffer a heart attack within ten years if they do not receive the new drug. Let us compute the 90% C.I. Of the difference between the proportions p1-p2. Here we have: n1=1,903, n2=1,903 and p1 = 155 /1903 = 0.08145, p2 = 187 /1903 = 0.09827 For β=90% we find that c=1.645. Therefore a 90% C.I. is: 45 Example ⎡ p1 (1 − p1 ) p2 (1 − p2 ) p1 (1 − p1 ) p2 (1 − p2 ) ⎤ + , p1 − p2 + c + ⎢ p1 − p2 − c ⎥ n n n n 1 2 1 2 ⎣ ⎦ = [ -0.032, -0.0016]. Note that this entire range is less than zero, therefore we are 90% confident that the new drug is effective in reducing the incidence of heart attacks in middle-age men with high cholesterol. 46