Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 6 Selected material from: Ch. 9 Estimation using a single sample Point estimate A point estimate of a population characteristic is a single number that is based on sample data and represents a plausible value of the characteristic. Example: A sample of 200 students at a large university is selected to estimate the proportion of students at the university that wear contact lenses (). In this sample 47 wore contact lenses. number of successes in the sample The statistic p n is a reasonable choice for a formula to obtain a point estimate for . 47 0.235 Such a point estimate is p 200 2 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Example: Weights A sample of weights (pounds) of 34 US male university students was obtained. 185 202 197 188 166 148 161 139 214 170 231 180 174 177 283 207 176 194 175 170 184 180 184 176 202 151 189 167 179 178 176 168 177 155 To estimate the true mean of all male university students, you might use the sample mean as a point estimate for the true mean. sample mean x 182.44 3 (82.8 kg) © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Example: Weights After looking at a histogram and boxplot of the data you might notice that the data seem reasonably symmetric with an outlier, so you could also use either the sample median or sample trimmed mean as a point estimate. sample mean x 182.44 5% trimmed mean 180.07 (82.8 kg) (81.7 kg) 177 178 sample median 177.5 2 (80.5 kg) 140 4 180 220 260 (max 283 lbs = 128.4 kg) Pittsburgh Steelers © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Motivation for confidence interval in addition to point estimate. Contact lens example: A sample of 200 students at a large university is selected to estimate the proportion of students that wear contact lenses. In this sample 47 wore contact lenses. The point estimate is p=47/200 = .235 , 23.5% of students wear contact lenses. A shortcoming of the point estimate is that it gives no idea of the uncertainty in it. The sample size here (200) is fairly large so we can be confident that the point estimate is fairly accurate. But what if the sample size had only been 50 or 10? The point estimate does not give us that. 5 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Motivation for confidence interval in addition to point estimate. Point estimate is p=47/200 = .235 We could report the standard deviation: p (1 ) n p(1 p) .03 n where we substitute the sample proportion p=.235 for the population proportion π since we do not know π. But a standard deviation can be hard to understand. People understand a range of plausible values more easily, and that is what the confidence interval gives. 6 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Confidence interval Longhorn stadium, Austin, TX A confidence interval for a population characteristic is an interval of plausible values for the characteristic. It is constructed from a random sample so that, with a chosen degree of confidence, the population characteristic will be captured inside the interval. The confidence level associated with a confidence interval estimate is the success rate of the method used to construct the interval. 7 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Confidence interval (CI) For example, a 95% confidence interval should, by repeated random sampling and construction of the interval, include the true population characteristic in the interval in 95% of the intervals. Sample 1 (n=50) Sample 2 (n=50) ................... Sample 100 (n=50) 95% CI 1 95% CI 2 95% CI 100 ≈ 95 of these 100 intervals (95%) should contain the true population characteristic Note we will never know the true population characteristic so can never check this, there are advanced mathematical arguments for showing this. 8 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 95% Confidence interval for a population proportion π When n is large, a 95% confidence interval for is p(1 p) p(1 p) , p 1.96 p 1.96 n n p is the sample proportion based on a random sample of size n. The endpoints of the interval are often abbreviated by p(1 p) p 1.96 n where - gives the lower endpoint and + the upper endpoint. 9 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. 95% CI for a population proportion For large n, the following interval constructed from a simple random sample with statistic p can be expected to include the true population proportion with probability .95. p(1 p) p(1 p) p 1.96 , p 1 . 96 n n Practical test to see if n is large enough: n is large enough if np 10 and n(1‐p) 10 10 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Example: Gym A student asked a random sample of 182 students at a large university whether they want a new fitness club (gym), and found that 75 said they do want it. Provide a point estimate and 95% confidence interval for the proportion of all university of students who want a new gym on campus. 75 p 0.4121 182 np = 182(0.4121) = 75 >10, n(1‐p)=182(0.5879) = 107 >10 So we can use the formulas given on the previous slide to find a 95% confidence interval for p. p(1 p) 0.4121(0.5879) 0.4121 1.96 p 1.96 n 182 0.4121 0.07151 95% confidence interval for p: (0.341, 0.484) 11 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. General confidence levels Suppose instead of a 95% confidence interval, you want to be more confident, say 99%, or less confident, say 80%, of estimating the population proportion. Question: How do you construct a C% confidence interval for a general confidence level C? Answer: p z critical value p(1 p) n where the z critical value is chosen so that the area of the standard Normal curve between –z and z is C%. 12 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Finding a z critical value Example: a z critical value for a 98% confidence interval. Looking up the cumulative area (0.99) in the standard Normal table we find z = 2.33. By symmetry the area less than –z=‐2.33 is 0.01. 2.33 13 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Some common critical values Confidence z critical As the confidence level increases level value 80% 90% 95% 98% 99% 99.8% 99.9% 14 1.28 1.645 1.96 2.33 2.58 3.09 3.29 so does the z critical value the confidence interval gets wider. For the gym example, p=0.4121 80% CI (0.37,0.46) 95% CI (0.34,0.48) 99.9% CI (0.29,0.53) © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Estimated standard error The estimated standard error of a statistic is the estimated standard deviation of the statistic. It is sometimes simply referred to as the standard error. For sample proportions, the standard deviation is (1 ) n Recall we did not know π so used p instead. This means that the standard error of the sample proportion is p(1 p) n 15 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Standard error For proportions, p(1 p) s tan dard error n p(1 p) p(1 p) , p 1.96 95%CI p 1.96 n n p 1.96 (st error p), p 1.96 (st error p) This quantity bounds how far the interval ranges. 16 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Bounds on errors The bound on error of estimation, B, associated with a 95% confidence interval, is (1.96)∙(standard error of the statistic). The bound on error of estimation, B, associated with a general confidence interval, is (z critical value)∙(standard error of the statistic). 17 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Standard error For proportions, p(1 p) p(1 p) , p 1.96 95%CI p 1.96 n n p 1.96 (st error p), p 1.96 (st error p) (p B, p B) p B Since π is in the interval. Hence B tells you the bound on how far your sample estimate is from the „truth“ for a given level of confidence. 18 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Bounds on errors The bound on error of estimation, B, associated with a general confidence interval is (z critical value)∙(standard error of the statistic). As intuitively seen by the formula for B, the bound increases with increasing z critical value and/or standard error of the statistic. The z critical value increases as the confidence level increases as shown before. So the higher the confidence you want, the higher the error bound you are stuck with. 19 Confidence z critical level value 80% 90% 95% 98% 99% 99.8% 99.9% 1.28 1.645 1.96 2.33 2.58 3.09 3.29 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Bounds on errors The bound on error of estimation, B, associated with a general confidence interval is (z critical value)∙(standard error of the statistic). The bound increases with increasing z critical value and/or standard error of the statistic. For a proportion st error of p p (1 p ) n decreases as n increases. 20 By increasing the sample size n of your experiment you can obtain a lower error B in your point estimate. © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Design of experiments Suppose an investigator wants to conduct an experiment to estimate a proportion π. He wants to know what sample size (n) he should use. To determine the sample size he needs to answer three things: a.) What is his educated guess of the true value of π? b.) What degree of confidence does he want in his answer? 95%? c.) What bound B is acceptable for the error? Then you backsolve the equation: B=(z critical value)∙(standard error of the statistic) to get n. 21 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Sample size for a proportion For a specified π and error bound B, and for 95% confidence: B 1.96 (1 ) n n B 1.96 (1 ) 1.96 (1 ) n B 1.962 (1 ) n B2 22 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Sample size for proportions The sample size required to estimate a population proportion π to within an amount B with 95% confidence is 1.96 n (1 ) B 2 The value of should be based on prior information. If no prior information is available, use = 0.5 in the formula to obtain a conservatively large value for n. One rounds the result up to the nearest integer since you cannot sample fractions of people or things. 23 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Example: Germany’s Next Top Model Luisa, 2012 ProSieben would like to find a 95% confidence interval estimate within 0.03 for the proportion of all households that watch Germany’s Next Top Model regularly. How large a sample is needed if a prior estimate for is 0.15? B = 0.03 prior estimate of = 0.15 2 2 1.96 1.96 n (1 ) (0.15)(0.85) 544.2 B 0.03 A sample of 545 households would be needed. 24 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Example: Germany’s Next Top Model How large a sample is needed if we have no prior estimate of , i.e. have no idea how many people watch that show? We should use p = 0.5 for a conservative estimate. 2 2 1.96 1.96 n (1 ) (0.5)(0.5) 1067.1 B 0.03 The required sample size is now 1068. A good prior estimate for π can lower the required sample size (n=545 on the previous slide where we knew π = 0.15 is much lower). 25 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Example: Germany’s Next Top Model How large a sample is needed if we have no prior estimate of and we want higher confidence at 99%? We should use the z critical value 2.58 (for a 99% confidence interval) 2.58 2.58 2 2 1.96 1.96 n (1 ) (0.5)(0.5) 1067.1 B 0.03 1849 The required sample size is now 1849. The higher the confidence wanted the higher the sample size needed. 26 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Confidence interval for a continuous population mean If 1. x is the sample mean from a random sample, 2. The sample size n is large (generally n 30), 3. , the population standard deviation, is known, then the general formula for a confidence interval for a population mean is given by x z critical value 27 n © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Small sample sizes Bound for estimating the mean x z critical value n Actually this confidence interval is valid when is known and either 1. n is large (n 30) or 2. The population distribution is Normal (any sample size). So if the sample size is small (n < 30) but a histogram or other graphical display indicates the data are normally distributed, then this same confidence interval can be used. 28 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Example: Ketchup A filling machine has a true population standard deviation = 0.228 ounces when used to fill ketchup bottles. A random sample of 36 “6 ounce” bottles of ketchup was selected from the output from this machine and the sample mean was 6.018 ounces. Find a 90% confidence interval estimate for the true mean fills of ketchup from this machine. Pittsburgh, PA x 6.018, 0.228, n 36 From the tables, the z critical value is 1.645. 90% Confidence interval x (z critical value) 29 n (5.955, 6.081) 0.228 6.018 1.645 6.018 0.063 36 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Unknown An Irish statistician W. S. Gossett derived the Student’s t distribution that describes the behavior of W.S. Gossett x 0 s n s is the sample standard deviation (Ch. 4): s s 2 (x x) n 1 2 Sxx n 1 Gossett invented this while working at the Guiness brewery in the early 1900‘s and named it Student‘s instead of Gossett‘s so Guiness would not think he was giving away company trade secrets. 30 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. t Distributions The statistic X t s/ n follows a t distribution with df = n ‐ 1 (degrees of freedom) if n ≥ 30 or for smaller sample sizes if the distribution of the data can be assumed to follow a Normal distribution. 31 t Distributions Comparison of normal and t distibutions df = 2 df = 5 df = 10 df = 25 Normal -4 -3 -2 -1 0 1 2 3 4 As the degrees of freedom (df) increases the t‐distribution approaches the standard Normal distribution. 32 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. One‐sample t procedures Suppose that a simple random sample of size n is drawn from a population having unknown mean μ. Then the confidence interval for is s s , x (t critical value) x (t critical value) n n where the t critical value is the t value that gives central area C for the t distribution with df n‐1, chosen as the quantile with area C + (1‐C)/2 to the left (e.g. .975 quantile for a .95 CI). 33 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Example: TV Ten randomly selected students were locked up for a week in their houses and asked to list how many hours of television they watched. The results are 82 66 90 84 75 88 80 94 110 91 Find a 90% confidence interval estimate for the true mean number of hours of television they watched. 1 2 3 4 We construct a histogram and it looks like the assumption of normality for these data is reasonable. 0 Frequency Since n=10 small, we need to assume viewing times are normally distributed in order to use the test. Histogram of x 60 34 70 80 90 x 100 110 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Example: TV Calculating the sample mean and standard deviation we have n = 10, x = 86, s =11.842. The 95% quantile of the t‐distribution with df = 9 is provided as 1.833 from the R statistical package. The 90% confidence interval for is given as s 11.842 x t * 86 (1 .833) 86 6 .86 n 10 ( 79 .14 , 92 .86) Answer: The students locked up in their houses for a week watched between 79 and 93 hours of television. 35 That was clearly an American study. What would you expect from Germans? © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Tutorial 6: Impact of working on finishing college University of California Santa Barbara An article in the Journal of College Student Development examined whether having a job while at the university was associated with a greater tendency to drop-out (quit the university). They collected a sample of 44 students who dropped out and 257 students who did not drop out and asked them the number of hours worked per week. The data are summarized in the following table. 36 Group Mean Standard deviation Drop-out (n = 44) 25.62 14.41 Not drop-out (n=257) 18.10 15.31 Why do you think the sample size for the dropouts is so much smaller than the other one? Tutorial 6: Impact of working on finishing college Group Mean Standard deviation Drop-out (n = 44) 25.62 14.41 Not drop-out (n=257) 18.10 15.31 1.) Consider the 44 drop-outs as a random sample of all drop-outs from the university and compute a 98% confidence interval (CI) for the mean number of hours worked by all students who drop out at that university. 2.) Now do similarly to compute a 98% confidence interval for the mean number of hours worked for students who do not drop out. 3.) Compare the two CIs. Which interval is wider? Comment why that is so. In R, qt(x,df) gives the x qt(.98,44) quantile of the t‐distribution qt(.98,43) with df degrees of freedom. R qt(.99,44) qt(.99,43) output 37 2.116 2.118 2.414 2.416 qt(.98,257) qt(.98,256) qt(.99,257) qt(.99,256) 2.064 2.064 2.342 2.341 Tutorial 6: Impact of working on finishing college Group Mean Standard deviation Drop-out (n = 44) 25.62 14.41 Not drop-out (n=257) 18.10 15.31 4.) Based on the CI for drop-outs, would it be safe to say that drop-outs work on average more than 20 hours? Describe your conclusion. 5.) Now fixing the standard errors to equal their estimated values as known, calculate the bounds of estimation for a 98% CI for both groups and compare. Use the z.99 critical value of 2.33. 6.) If you wanted to collect more drop-outs to get the same bound (98% CI) as for the non-drop-outs, how many more would you need? 38