Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundations of statistics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Misuse of statistics wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Confidence Interval for a Proportion Confidence Interval for Proportions and Means Dr. Tom Ilvento BUAD 820 Proportions ps The Pepsi Challenge asked soda drinkers to compare Diet Coke and Diet Pepsi in a blind taste test. Pepsi claimed that more than ½ of Diet Coke drinkers said they preferred Diet Pepsi Suppose we take a random sample of 100 Diet Coke Drinkers and we found that 56 preferred Diet Pepsi. Proportions ps = sample proportion If x represents the number of successes in our sample, then our estimator of p (population parameter) from a sample is ps = x/n The variance of a proportion is given by 2 s = p sq s .5 s = (psqs) Where qs = 1- ps The Standard Error of the Sampling Distribution of a proportion is SE for p = (pq/n).5 If we don’t know p and q, we use the sample estimates, ps and qs Note: we will think there is a population proportion, p, with variance equal to σ2 My sample n = 100 Calculate ps Pepsi Challenge Calculate qs Calculate the Variance and Standard Deviation Calculate the Standard Error The sample provides an estimate – Point Estimate, a single value computed from a sample and used to estimate the value of the target population. The sample proportion and s are point estimates of population proportion p and population standard deviation σ respectively. I would like to place a bound of error around the estimate – Confidence Interval 1 Pepsi Challenge I need to think of my sample as one of many possible samples I know from our work on the Normal curve that a z-value of ± 1.96 corresponds to 95 percent of the values A z-value of 1.96 is associated with a probability of .475 on one side of the normal curve 2 times that value yields 95% of the area under the normal curve Pepsi Challenge If I think of my sample as part of the sampling distribution I can place a ± 1.96(standard error) around my estimate Like this: .56 ± 1.96(.0496) .56 ± .097 .463 to .657 Notice that this interval has values less than .50, which are below Pepsi’s claim! Why did I use the standard error in my formula? I am asking the question about the proportion of Diet Coke drinkers who prefer Pepsi I want some sense of how well my sample estimates the population If it is drawn randomly it will represent the population, plus some sampling error A 95% confidence interval means that If I would have taken all possible samples And calculated a confidence interval for each one 95% of them would have contained the true population parameter What is a Confidence Interval? To construct a confidence interval we need What is a Confidence Interval? We calculate the probability that the estimation process will result in an interval that contains the true value of the population proportion or mean If we had repeated samples Most of the C.I.s would contain the population parameter But not all of them will It is an interval estimate of a population parameter The plus or minus part is also known as a Bound of Error or Margin of Error Placed in a probability framework An point estimator A sample and a sample estimate using the estimator ps Knowledge of the Sampling Distribution of the point estimator The Standard Error of the estimator The form of the sampling distribution 2 To construct a confidence interval we need A probability level we are comfortable with – how much certainty. It’s also called “Confidence Coefficient” A level of Error - α refers to the combined area to the right and left of our interval A 95% C.I. Has α = 1 - .95 = .05 α refers to the probability of being wrong in our confidence interval Confidence Interval for a Population Proportion p Formula for C.I. for a Proportion ps p s ± Z α 2σ p It is approximate because we are using the Normal Approximation to the Binomial Distribution Assumption: A sufficiently large random sample of size n is selected from the population. ps ± Z α 2 The C.I. formula ps ± Z α 2 The C.I. Formula For the Pepsi 90% C.I. 95% C.I. 99% C.I. Challenge Example .56 ± 1.645(.0496) = .56 ± .0816 .56 ± 1.96(.0496) = .56 ± .0972 .56 ± 1.575(.0496) = .56 ± .1277 For any given sample size, if you want to be more certain (smaller α ) you have to accept a wider interval ps (1 − ps ) n The larger the probability level for a C.I. The smaller the value of α, and α/2 The larger the z value CONFIDENCE LEVEL 100(1- α) For any given sample size, the width of the Confidence Interval depends on α ps (1 − p s ) n ps ± Zα 2 ps (1 − ps ) n zα/2 refers to the z-score associated with a particular probability level divided by 2 α refers to the area in the tails of the distribution We divide by 2 because we divide α equally on both sides of the mean Which means the probability in the tails of both sides of the normal curve p s (1 − p s ) n ≈ ps ± Zα2 α α/2 zα 90% .10 .05 1.645 95% .05 .025 1.96 99% .01 .005 2.575 Problem Survey questionnaire for who would you vote for 1,052 adults were surveyed by a major newspaper The percentage who indicated Candidate B was 35% Construct a 95% C.I. For this proportion 3 Newspaper Confidence Interval Problem Newspaper C.I. The newspaper said “there is a ± 3.0% margin of error.” Where did this figure come from? It doesn’t match our previous figure of 2.88% And what does it mean? Newspaper C.I. They calculated a general C.I. For a proportion at .5 .5 Standard Error = [(.5 * .5)/1,052] = .0154 C.I. .5 ± 1.96(.0154) .5 ± .0302 Variance is largest at .5 Confidence Interval for the mean Suppose I am concerned about the quality of drinking water for people who use wells in a particular geographic area I will test for nitrogen, as Nitrate+Nitrite The U.S. EPA sets a MCL of 10 mg/l of Nitrate/Nitrite (MCL=Maximum contaminant level) Below the threshold is considered safe For a proportion, the variance is largest at .5, or an equal split 2 At .5 s = (.5)(.5) = .25 2 At .7 s = (.7)(.3) = .21 2 At .3 s = (.3)(.7) = .21 Which brings up another unique thing about proportions – once you specify a value of p for the population, the variance (σ2 ) is known. Water Quality Example Let’s say there are 2,500 households in the area I could try to test them all, but at $50 a test it would cost $125,000 and weeks of work So, I decide to take 50 well water samples, and test for the presence of nitrogen 4 My sample n = 50 Mean = 7 mg/l s = 3.003 mg/l Standard error = 3.003/(50).5 = .425 From Excel Nitrate+Nitrite Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level(95.0%) Water Quality Example If I think of my sample as part of the sampling distribution I can place a Bound of Error around my estimate But I have one problem with this approach with the mean. I have two estimates The estimate of the mean The estimate of the standard deviation (s), which is used to estimate the standard error If σ is known, we don’t have a problem and we would use a z-value for the confidence interval. But σ is rarely known! What can I do about this? t-distribution Similar to the standard normal distribution The t-distribution varies with n (sample size) via degrees of freedom df = n-1 As n gets larger, the t-distribution approximates the z distribution 7.000 0.425 7.050 7.100 3.003 9.018 -0.723 0.101 11.600 1.600 13.200 350.000 50 0.853 Stem-and-Leaf Display for Nitrate+Nitrite Stem unit: 1 1 2 3 4 5 6 7 8 9 10 11 12 13 6 2 0 0 3 2 0 1 0 0 0 3 2 9 46 46 25 45 45 11 66 36 55 5 8 6 89 89 79 12479 9 78 8 Relax and have a beer! W.S. Gossett worked for Guinness Brewery in Ireland around 1900 In quality control tests he noticed the problem of using the z-distribution His solution was the tdistribution Comparison of a z and t-value as the Sample Size Gets Larger The Value of a z-value or t-value for a 95% C.I. Sample size 10 20 30 50 100 500 1000 Z-value too small too small 1.960 1.960 1.960 1.960 1.960 t-value 2.262 2.093 2.045 2.010 1.984 1.965 1.962 5 The formula for Confidence Interval for the Mean The t-Table Organized with degrees of freedom as rows Probabilities in the right tail (") are the columns We substitute the t-value from the table for a z-value in the C.I. In the case of a small sample, n < 30, the Central Limit Theorem doesn’t hold. In order to do a C.I., a big assumption with a small sample is that the population is distributed approximately normal ⎛ s ⎞ x ± tα / 2, n −1d.f. ⎜ ⎟ ⎝ n⎠ tα / 2 is based on (n − 1) degrees of freedom For any probability level, as the degrees of freedom get larger, the t-value gets smaller The meaning of the t-value The t-value is interpreted like the z-value from the standardized normal table NOTE: For a Confidence Interval, the t-value represents the corresponding value at α/2 Which is out in the right tail of the curve So a t-value for 30 degrees of freedom at the .025 level is 2.042 This corresponds to a z-value of 1.96 And is used for a 95% C.I. Degrees of Freedom t.100 t.050 t.025 t.0005 1 3.078 6.314 12.706 636.62 2 1.886 2.920 4.303 31.598 3 1.638 2.353 3.182 12.924 As the degrees of freedom gets to 30, the t-value approaches z Comparing z-distribution and tdistribution 30 1.310 1.697 2.042 3.646 4 1.282 1.645 1.960 3.291 Formation of a Confidence Interval of the Mean BASIC STEPS Set a probability that an interval estimator encloses the population parameter p = .95 Set an alpha level as 1-p .05 Divide the alpha by 2 .025 Calculate the degrees of freedom as n-1 Locate the ½ probability value for your degrees of freedom in the t-Table Find the corresponding t value for the 1/2 probability 2.010 6 Back to the Water Quality Example Formula for C.I. for the mean Took a sample estimate of the mean Treated it as one of many samples from a sampling distribution with a standard error Since σ is not known, we used the sample estimate of the standard deviation, s. And we will use a tvalue. Use the Population parameter σ if it is known, and a z-value ⎛ σ ⎞ x ± zα / 2 ⎜ ⎟ ⎝ n⎠ ⎛ s ⎞ x ± t n −1d . f . ⎜ ⎟ ⎝ n⎠ Use the sample estimate s, and a tvalue, if σ is not known Back to the Water Quality Example For a specified probability level, e.g. .95, we generate a t value That puts a bound around our estimate of the mean that represents 2.010 standard deviations around the mean in a sampling distribution t.05/2, 49 d.f.=2.010 mean = 7.0 3/(50).5 = .424 Confidence Interval 7.0 ± 2.010(.425) 7.0 ± .854 6.146 to 7.854 Remember, we only have one sample And thus one interval estimate If we could draw repeated samples 95 percent of the Confidence Intervals calculated on the sample mean Would contain the true population parameter Our one sample interval estimate may not contain the true population parameter ±2.010 standard deviations around the mean represents 95% of the values in a tdistribution 90% C.I. From Sampling Exercise from a Population with µ = 75 and σ = 10 95% C.I. From Sampling Exercise from a Population with µ = 75 and σ = 10 90% Confidence Intervals for 100 Samples, n = 50, 95% Confidence Intervals for 100 Samples, n = 50, 85.0 85.0 80.0 80.0 75.0 75.0 70.0 70.0 65.0 65.0 1 11 21 31 41 51 61 71 81 91 1 11 21 31 41 51 61 71 81 91 7 99% C.I. From Sampling Exercise from a Population with µ = 75 and σ = 10 Now you try it 99% Confidence Intervals for 100 Samples, n = 50, 85.0 80.0 75.0 70.0 A furniture company wants to test a random sample of sofas to determine how long the cushions last They simulate people sitting on the sofas by dropping a heavy object on the cushions until they wear out – they count the number of drops it takes This test involves 9 sofas Mean = 12,648.889 s = 1,898.673 Assume it follows a normal distribution. Generate a 95% Confidence Interval for this problem 65.0 1 11 21 31 41 51 61 71 81 91 Sofa Test Answer Solving this problem with Excel PhStat and Confidence Intervals Excel Output for Sofa problem Sofa Drops Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level(95.0%) 12,648.889 632.891 12742 #N/A 1898.673 3604958.111 -0.676 -0.372 5,886 9,459 15,345 113,840 9 1,459.450 Mean = 12,648.889 SE = 632.891 s = 1,898.673 12,648.889 ± 1,459.447 I entered the data into a column in Excel I then used the following sequence Tools Data Analysis Descriptive Statistics I then follow the options, including: Identify the Input Range, marking a label is in the first row Output range Descriptive statistics A 95% Confidence Interval Excel’s Data Analysis will only construct a Confidence Interval as part of the Descriptive Statistics PHStat will construct C.I. for Mean, sigma known Mean, sigma unknown Proportion Variance Population Confidences You can enter a range of data, or just the mean, standard deviation, and sample size Try it for this problem! 8 A few more points on small sample C.I. If we cannot assume a normal distribution The probability associated with our interval is not (1 - α) We really shouldn’t construct a C.I. Or we should get more data If σ is known, we can use the z instead of the t, but we still need to have an approximately normal distribution What influences the width of a confidence interval? What influences the width of a confidence interval? Sample Size or n The larger the sample size, the smaller the C.I. For a 95% Confidence Interval when s = 25 n=50 2.010(25/(50).5) = 7.11 .5 n=500 1.9647(25/(500) ) = 2.20 What influences the width of a confidence interval? • The level of α • The larger the level of α, the smaller the C.I. For a 95% Confidence Interval when s = 25 and n=50 α =.05 2.010(25/(50).5) = 7.11 α =.10 1.6766(25/(50).5) = 5.93 What influences the width of a confidence interval? • • The level of the confidence coefficient (1-α) The larger the confidence coefficient, the larger the C.I. When s = 25 and n = 50 95% C.I. 2.010(25/(50).5) = 7.11 99% C.I. 2.680(25/(500).5) = 9.48 The sample size The level of α The level of the confidence coefficient (1-α) The variability of the data – i.e., the standard deviation Focus in on sample size (n) For a given (1- α) C.I. and a given bound of error (B) which is what we add or subtract to the sample estimate We can calculate the needed sample size as ( zα / 2 ) 2 σ 2 n= B2 PhStat will do this for you, under Sample Size 9 Confidence Interval Summary Provides an interval estimate of a sample estimator Requires knowledge of the sampling distribution of the estimator We treat our estimate from a sample as one of many possible estimates from many possible samples Confidence Interval Summary Figure a C.I. Probability level as (1 - α) where α/2 represents the probability in either tail of the sampling distribution (1 - α) is referred to as the confidence coefficient For proportions, you can use a z-score provided the sample size is large enough (Binomial approximization) ps ± zα / 2 ps qs n Confidence Interval Summary For the mean If σ is known, use a z-value for the C.I. similar to proportions If σ is unknown, use the t-table with n-1 degrees of freedom If the sample size is small (<30), and the distribution is approximately normal, use the ttable with n-1 degrees of freedom x ± tα / 2, n −1d . f . s n 10