Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 9 Confidence Intervals Suppose we wanted to estimate the proportion of blue candies in a VERY large bowl. How might we go about estimating this proportion? We Wecould wouldtake haveaa sample candies and sampleof proportion or a compute statistic – athe single proportion ofthe blue value for candiesestimate. in our sample. Point Estimate • A single number (a statistic) based on sample data that is used to estimate a population characteristic • But not always to the population Different samples may refers to the characteristic due “point” to sampling produce different single value on a number statistics. variation line. Population characteristic The paper “U.S. College Students’ Internet Use: Race, Gender and Digital Divides” (Journal of Computer-Mediated Communication, 2009) reports the results of 7421 students at 40 colleges and universities. (The sample was selected in such a way that it is representative of the population of college students.) The authors want to estimate the proportion (p) is a students point estimate for the population ofThis college who spend more than 3 proportion ofthe college students who spend hours a day on Internet. more than 3 hours a day on the Internet. 2998 out of 7421 students reported using the Internet more than 3 hours a day. p = 2998/7421 = .404 The paper “The Impact of Internet and Television Use on the Reading Habits and Practices of College If a point estimate of m, the mean Students” (Journal of Adolescence and Adult academic reading time per week for all Literacy,college 2009)students, investigates the reading habits of is desired, an obvious college students. The following observations choice of a statistic for estimating m is the represent the number of hours sample meanspent x. on academic readingHowever, in 1 weekthere by 20are college otherstudents. possibilities – a trimmed meansuggest or the sample median. The dotplot this data is 1.7 3.8 4.7 9.6 11.7 12.3 12.3 12.4 12.6 13.4 approximately symmetrical. 14.1 14.2 15.8 15.9 18.7 19.4 21.2 21.9 23.3 28.2 College Reading Continued . . . 1.7 3.8 4.7 9.6 11.7 12.3 12.3 12.4 12.6 13.4 14.1 14.2 15.8 15.9 18.7 19.4 21.2 21.9 23.3 28.2 287.2 sample mean x 14.36 20 So which of The these mean of 13.4 14.1 point sample median 13.75 the middle 16 estimates 2 observations. should we use? 230.2 10% trimmed mean 14.39 16 Choosing a Statistic for Computing an Estimate • Choose a statistic that is unbiased (accurate) Unbiased, since Unbiased, since is the distribution the centered Biased, since the A statistic whose mean value isdistribution equalattotheis centered at the true value distribution is the value of the population true value NOT centered at characteristic being estimated is said to the true value be an unbiased statistic. Choosing a Statistic for Computing an Estimate • Choose a statistic that is unbiased (accurate) • Choose a statistic deviation Unbiased, has a standard with the but smallest smaller standard deviation so it is more precise. Unbiased, but has If thestandard population distribution is normal, a larger deviation it isa smaller standard deviation then xsohas notthan as precise. any other unbiased statistic for estimating m. Suppose we wanted to estimate the proportion of blue candies in a VERY large bowl. We could take a sample of candies and compute the proportion of blue candies in our sample. How much confidence Would you have more do you have in confidence if the your point estimate? answer were an interval? Confidence intervals A confidence interval (CI) for a population characteristic is an interval of plausible values for the characteristic. primary goalsoofthat, a confidence interval ItThe is constructed with a chosen degree is to estimate unknown of confidence, the an actual valuepopulation of the characteristic. characteristic will be between the lower and upper endpoints of the interval. Rate your confidence 0 – 100% does it(%) mean toyou be within 10 years? HowWhat confident are that you can ... Guess my age within 10 years? . . . within 5 years? . . . within 1 year? What happened to your level of confidence as the interval became smaller? Confidence level The confidence level associated with a confidence interval estimate is the success rate of the method used to construct the interval. If this method was used to generate an intervalOur estimate over and again from confidence is inover the method – different samples, in the long runinterval! 95% of the NOT in any one particular resulting intervals would include the actual value of theThe characteristic being estimated. most common confidence levels are 90%, 95%, and 99% confidence. Recall the General Properties for Sampling Distributions of p 1. 2. These are the conditions that must be true in order to m pˆ p calculate a large-sample confidence interval for p p (1 p ) As long as the sample size is pˆ less than 10% of the population n 3. As long as n is large (np > 10 and n (1-p) > 10) the sampling distribution of p is approximately normal. Let’s develop the equation for the We canconfidence generalize thisinterval. tothe normal For large random samples, large-sample distributions other sampling distribution of than p is the To begin,approximately westandard will use anormal 95% confidence Use distribution – normal. So aboutlevel. of 95% the possible pcurve will are fall the table95% of standard areas to About ofnormal the values within 95%value of these values are of within determine the of z*deviations such that a central area 1.96 standard the 1.96 the mean. p (of 1mean and p ) z*. of .95 falls within between –z* 1.96 within p n Central Area = .95 Lower tail area = .025 Upper tail area = .025 -1.96 0 1.96 Developing a Confidence Interval Continued . . . p (1 p ) If p is within 1.96 n of p, this means the interval p (1 p ) p (1 p ) pˆ 1.96 to pˆ 1.96 n n will capture p. And this will happen for 95% of all possible samples! Developing a Confidence Interval Continued . . . Approximate sampling Suppose weSuppose get this we p get this pdistribution of p and create an interval Create an interval Suppose we get this p around p and create an interval Using this method of calculation, p the confidence p (1 p ) p (1 p ) 1.96 1.96 interval will n n not capture p p 5% of the p time. This line represents 1.96 This line represents 1.96 When n is large, a 95% p standard deviations below Here is the mean of the Notice thatdeviations the lengthabove of standard confidence interval for p is the mean. sampling distribution This p doesn’t fall within 1.96 each half of the interval the mean. This p fell within 1.96 standard This p fell within 1.96 standard p ( 1 the p ) mean standard deviations of equals deviations of the mean AND its pˆits confidence 1of.96 deviations the mean AND its p ( 1 p ) AND interval does confidence interval “captures” p. 1.96 confidence interval “captures” n p. NOT “capture” p. n The diagram to the right is 100 confidence intervals for p computed from 100 different random samples. Note that the ones with asterisks do not capture p. If we were to compute 100 more confidence intervals for p from 100 different random samples, would we get the same results? The Large-Sample Confidence Interval for p Now let’s look at a more general The general formula forformula. a confidence interval for a population proportion p when • p is the sample proportion from a random sample • the sample size n is large (np > 10 and n(1-p) > 10), and • if the sample is selected without replacement, the sample size is small relative to the population size (at most 10% of the population) The Large-Sample Confidence Interval for p The general formula for a confidence interval for a population proportion p . . . is pˆ(1 pˆ) pˆ (z critical value) n The standard point estimateerror of a statistic is the estimated standard deviation Estimate of the of the statistic.standard deviation of p or standard error The 95% confidence interval is based on the The Large-Sample Confidence fact that, for approximately 95% of all random Interval p the bound on error samples,for p is within estimation of p. The general formula for a confidence interval for a population proportion p . . . is pˆ(1 pˆ) pˆ (z critical value) n This the bound This is is called also called the on the error margin ofestimation. error. The article “How Well Are U.S. Colleges Run?” (USA Today, February 17, 2010) describes a survey of 1031 adult The point estimate is Americans. The survey was carried out by the 567 the Before computing National Center for Public Policy pˆ and the .55sample 1031 we confidence interval, was selected in a way that makes it reasonable to to verify the regard the sample asneed representative of adult conditions. Americans. Of those surveyed, 567 indicated that they believe a college education is essential for success. What is a 95% confidence interval for the population proportion of adult Americans who believe that a college education is essential for success? College Education Continued . . . What is a 95% confidence interval for the population proportion of adult Americans who believe that a college education is essential for success? Conditions: 1) np = 1031(.55) = 567 andconditions n(1-p) = 1031(.45) = 364, All our are verified since both of these so areitgreater 10, the sample is safethan to proceed with size is large enough to proceed. the calculation of the interval. 2) The sample size of n =confidence 1031 is much smaller than 10% of the population size (adult Americans). 3) The sample was selected in a way designed to produce a representative sample. So we can regard the sample as a random sample from the population. College Education Continued . . . What is a 95% confidence interval for the population proportion of adult Americans who believe that a college education is essential for success? Calculation: pˆ(1 pˆ) pˆ (z critical value) n .55(.45) .55 1.96 (.521,.579) 1031 What does this Conclusion: interval mean in the We are 95% confident that the population context proportion of this of adult Americans who believe that aproblem? college education is essential for success is between 52.1% and 57.9% College Education Revisited . . . Recall the “Rate A 95% confidence interval for theConfidence” population your proportion of adult Americans who believe that a Activity college education is essential for success is: .55(.45) .55 1.96 (.521,.579) 1031 What do you notice about the Compute a 90% confidence interval for this proportion. relationship .55(.45) between the .55 1.645 (.524,.575) confidence level 1031 ofproportion. an interval Compute a 99% confidence interval for this and the width of the interval? .55(.45) .55 2.58 1031 (.510,.590) Choosing a Sample Size The bound on error estimation for a 95% Sometimes, it isisfeasible to perform a confidence interval preliminary study to estimate the value In other cases, may forprior p. p (knowledge 1 p ) suggest a estimate for If there isreasonable no prior knowledge and p. a B 1 . 96 Before collecting any data,then an the preliminary study is not feasible, n What value should be used for the investigator may wish tofor determine a conservative estimate p is 0.5. unknown value p? If we solve this for ... sample sizen needed to achieve a certain bound on error estimation. 2 1.96 n p 1 p B Why is the conservative estimate for p = 0.5? .1(.9) = .09 .2(.8) = .16 .3(.7) = .21 .4(.6) = .24 .5(.5) = .25 By using .5 for p, we are using the largest value for p(1 – p) in our calculations. In spite of the potential safety hazards, some people would like to have an internet connection in their car. Determine the sample size required to estimate the proportion of adult Americans who would like an internet connection in their car to within 0.03 with 95% confidence. 2 1.96 n p (1 p ) B 2 1.96 n .25 .03 n 1067.111 n 1068 people What value should be used for p? This is the value for the bound on error estimate B. Always round the sample size up to the next whole number. Now let’s look at confidence intervals to estimate the mean m of a population. Confidence intervals for m when is known The general formula for a confidence interval for a population mean m when . . . This confidence interval is appropriate even when n is small, as long as it is reasonable 1) x is the sample mean from a random sample, to think that the population distribution is 2) the sample sizenormal n is large (n > 30), and in shape. this typically known? 3) , the Is population standard deviation, is known Bound on error of estimation is These are the propertiesof the sampling x (z critical value)of distribution x. Standard Point estimate n deviation of the statistic Cosmic radiation levels rise with increasing altitude, promoting researchers to consider how pilots and flight crews might be affected by increased exposure to cosmic radiation. A study reported a mean annual cosmic radiation dose of 219 mrems for a sample of flight personnel of Xinjiang Airlines. Suppose this mean is based on a random sample of 100 flight crew members. Let = 35 mrems. Calculate and interpret a 95% confidence interval for the actual mean annual cosmic radiation exposure for Xinjiang flight crew members. 1)Data is from a random sample of crew members First, verify that the 2)Sample size n is large (n > 30) conditions are met. 3) is known Cosmic Radiation Continued . . . Let What would happen to the width of x = 219 mrems this interval if the confidence level wascrew 90%members instead of 95%? n = 100 flight = 35 mrems. Calculate and interpret a 95% confidence interval for the actual mean annual cosmic radiation exposure for Xinjiang flight crew members. x (z critical value) n 35 219 1.96 (212.14, 225.86) 100 What does this mean in context? We are 95% confident that the actual mean annual cosmic radiation exposure for Xinjiang flight crew members is between 212.14 mrems and 225.86 mrems. Confidence intervals for m when is unknown When is unknown, we use the sample standard deviation s to estimate . In place of z-scores, we must use the following to standardize the values: x m t s n The use of the value of s introduces extra variability. Therefore the distribution of t values has more variability than a standard normal curve. Important Properties of t Distributions t distributions are described The t distribution corresponding to any by particular degreesofoffreedom freedom (df).shaped and number of degrees is bell centered at zero (just like the standard normal (z) distribution). 2) Each t distribution is more spread out than the standard normal distribution. 1) z curve t curve for 2 df 0 Why is the z curve taller than the t curve for 2 df? Important Properties of t Distributions Continued . . . 3) As the number of degrees of freedom increases, the spread of the corresponding t distribution decreases. t curve for 8 df t curve for 2 df 0 Important Properties of t Distributions Continued . . . 3) As the number of degrees of freedom increases, the spread of the corresponding t distribution decreases. For what df would the 4) As the number of degrees oftfreedom increases, distribution be the corresponding sequence approximately of t distributions the approaches the standard normal samedistribution. as a standard normal z curve distribution? t curve for 2 df t curve for 5 df 0 Confidence intervals for m when is unknown The general formula for a confidence interval for a population mean m based on a sample of size n when . . . This confidence interval is appropriate for small n ONLY when the population 1) x is the sample mean from a random sample, distribution is (at least approximately) 2) the population distribution is normal, or the normal. sample size n is large (n > 30), and 3) , the population standard deviation, is unknown t critical values is are found in s Table 3 x (t critical value) n Where the t critical value is based on df = n - 1. The article “Chimps Aren’t Charitable” (Newsday, November 2, 2005) summarized the results of a research study published in the journal Nature. In this study, chimpanzees learned to use an apparatus that dispersed food when either of two ropes was pulled. When one of the ropes was pulled, only the chimp controlling the Firstthe verify apparatus received food. When otherthat rope was conditions forcontrolling a pulled, food was dispensedthe both to the chimp t-interval are met.cage. The the apparatus and also a chimp in the adjoining accompanying data represent the number of times out of 36 trials that each of seven chimps chose the option that would provide food to both chimps (charitable response). 23 22 21 24 19 20 20 Compute a 99% confidence interval for the mean number of charitable responses for the population of all chimps. Chimps Continued . . . 23 Normal Scores 2 1 -1 -2 22 21 24 19 20 20 The plot is straight, Let’s suppose itreasonable is reasonable to Since n is small, we this needsample toitverify ifplausible it is so seems regard of seven plausible that this sample is from a the that the population chimps as representative of 20 22 24 Number of Charitable Responses population that ischimp approximately normal. distribution of population. number of charitable responses is Let’s use a normal probability plot. approximately normal. Chimps Continued . . . 23 22 21 24 19 20 x = 21.29 and s = 1.80 20 df = 7 – 1 = 6 s x (t critical value) n 1.80 21.29 3.71 (18.77, 23.81) 7 We are 99% confident that the mean number of charitable responses for the population of all chimps is between 18.77 and 23.81. Choosing a Sample Size When is unknown, a preliminary study can The bound error oftoestimation beon performed Thisestimate requires associated with a 95% OR toconfidence be known – interval is make an educated guess which rarely of is the value of . the case! A rough estimate for withcan use B 1.96 (used We distributions that are is the this to find n too skewed) not the range divided by 4. necessary sample size for Solve this for n: a particular 2 1.96 bound on error n of estimation. B The financial aid office wishes to estimate the mean cost of textbooks per quarter for students at a particular university. For the estimate to be useful, it should be within $20 of the true population mean. How large a sample should be used to be 95% confident of achieving this level of accuracy? The financial aid office is believes that the amount spent on books varies with most values between $150 to $550. To estimate : 550 150 $100 4 The financial aid office wishes to estimate the mean cost of textbooks per quarter for students at a particular university. For the estimate to be useful, it should be within $20 of the true population mean. How large a sample should be used to be 95% confident of achieving this level of accuracy? 1.96100 n 96.04 20 n 97 2 Always round sample size up to the next whole number!