Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Announcements Announcements Unit 3: Foundations for inference Lecture 2: Hypothesis testing and confidence intervals Review the Learning Objectives. I promised you Quizzes! Midterm: Monday July 20. Practice midterm with solutions will be posted soon. Statistics 101 Gary Larson I am out of town Friday July 17 through Sunday 19th – start studying now! Bring questions next week to Office Hours. July 10, 2015 Sta 101 (Gary Larson) Announcements U3 - L2: HT and CI July 10, 2015 2 / 37 Announcements Today Exercise 2.18(d) “Hypothesis testing” ≡ “testing” ≡ “tests” how the solution makes sense with what we’ve learned so far Hypothesis testing using theoretical methods. (Simulation-based was 7/2 lecture) how the ambiguity in the way the question is asked (”overweight” to the book meant overweight or obese column) p-values students may get some credit back if they did it correct for only the overweight column. Single-sided and two-sided hypothesis tests Using confidence intervals for testing but note that we’re interested in VARIABLES being related – so that may have given you a clue to think about all the categories of weight rather than just the overweight column. Close relationship between confidence intervals and hypothesis tests Error rates in hypothesis testing Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 3 / 37 Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 4 / 37 Hypothesis testing Hypothesis testing framework Hypothesis testing Remember when... Hypothesis testing framework Result of hypothesis testing by simulation / randomization Speed skating lane advantage: Lane Inner Outer Total Top 10 6 4 10 Rank Not Top 10 14 16 30 Total 20 20 40 p̂I = 0.3, p̂O = 0.2 ⇒ p̂diff = 0.1 Possible explanations: A null hypothesis and alternative hypothesis. H0 : Rank / lane are independent. No lane advantage. Observed difference in proportions is due to chance. We failed to reject the null hypothesis, since values equal to or more extreme than our observed data p̂diff = 0.1 weren’t extremely unlikely under the null distribution (a.k.a. the randomization distribution). HA : Rank and lane are dependent, there is lane advantage, observed difference in proportions is not due to chance. Sta 101 (Gary Larson) U3 - L2: HT and CI Hypothesis testing July 10, 2015 5 / 37 Hypothesis testing framework July 10, 2015 6 / 37 Hypothesis testing framework Number of college applications We state a null hypothesis (H0 ) that represents the status quo, or the absence of an effect, or the hypothesis that “nothing is going on.” A survey asked how many colleges Duke students applied to, and 206 students responded. This sample yielded an average of x̄ = 9.7 college applications with a standard deviation of s = 7. College Board says counselors recommend students apply to 8 colleges. Does our survey provide convincing evidence that the average number of colleges all Duke students apply to is higher than recommended? We also state an alternative hypothesis (HA ) that represents our research question, i.e. what we’re testing for. Is there enough evidence in the data to reject H0 ? To find out, we conduct a hypothesis test under the assumption that the null hypothesis is true, by one of two methods: 2 U3 - L2: HT and CI Hypothesis testing Recap: hypothesis testing framework (from 7/2 slides) 1 Sta 101 (Gary Larson) simulation of additional data collection (today), e.g. using randomization theoretical methods (later in the course). Let’s introduce the method of hypothesis testing (theory version) using an example which tests a claim about a population mean. Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 http:// www.collegeboard.com/ student/ apply/ the-application/ 151680.html 7 / 37 Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 8 / 37 Hypothesis testing Hypothesis testing framework Hypothesis testing Setting the hypotheses Setting the hypotheses The parameter of interest µ is the mean number of schools applied to by all Duke students. We start with the assumption the average number of colleges Duke students apply to is 8 (as recommended) The sample statistic, x̄ = 9.7, is the average number of schools applied to by Duke students in our sample . H0 : µ = 8 There are two possible explanations why our sample mean is higher than the recommended 8 schools. We test the claim that the average number of colleges Duke students apply to is greater than 8 The true population mean is different. Duke students on average truly apply to more than 8 schools. The true population mean is 8. Duke students on average truly apply to exactly 8 schools. Our sample mean is higher than 8 simply due to natural sampling variability. Sta 101 (Gary Larson) Hypothesis testing framework U3 - L2: HT and CI Hypothesis testing July 10, 2015 HA : µ > 8 9 / 37 Sta 101 (Gary Larson) Formal testing using p-values U3 - L2: HT and CI Hypothesis testing With hypotheses in place, assume H0 is true. July 10, 2015 10 / 37 Formal testing using p-values Central limit theorem So, we pretend that µ = 8. Then how unusual is a sample statistic like x̄ = 9.7? Central limit theorem Under certain conditions, To answer that, we need the probability that if we took a random sample of n = 246 students (with H0 : µ = 8 true), we would obtain a sample mean x̄ ≥ 9.7. x̄ ∼ N mean = µ, SE = √ n Great! We did this yesterday. To calculate that probability, start with: what is the sampling distribution of x̄ if H0 is true? (Use the CLT! (if conditions are met)) Make sure to check conditions for CLT to hold: (1) independence, and (2) sample size/skew. Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 s 11 / 37 Sta 101 (Gary Larson) U3 - L2: HT and CI ! July 10, 2015 12 / 37 Hypothesis testing Formal testing using p-values Hypothesis testing Number of college applications - conditions Formal testing using p-values Applying the CLT Participation question Which of the following is not a condition that needs to be met to proceed with this hypothesis test? n = 246 students, x̄ = 9.7, s = 7, and we’re assuming H0 : µ = 8 Here the conditions for inference for the CLT are met. So: (a) Students in the sample should be independent of each other with respect to how many colleges they applied to. 7 ! x̄ ∼ N µ = 8, σx̄ = SE = √ 246 (b) Sampling should have been done randomly. (c) The sample size should be less than 10% of the population of all Duke students. x̄ ∼ N (µ = 8, σx̄ ≈ 0.5) How unusual is x̄ = 9.7 in this distribution? (d) There should be at least 10 successes and 10 failures in the sample. If it’s very unusual, that’s probably not the right distribution! (i.e. µ isn’t really 8) (e) The distribution of the number of colleges students apply to should not be extremely skewed. Sta 101 (Gary Larson) U3 - L2: HT and CI Hypothesis testing July 10, 2015 13 / 37 Sta 101 (Gary Larson) Formal testing using p-values U3 - L2: HT and CI Hypothesis testing We’re getting closer July 10, 2015 14 / 37 Formal testing using p-values Number of college applications - p-value To determine if our observed sample mean is unusual if H0 is true, we determine how many standard errors it is from the null value (µ = 8). i.e. we calculate the Z-score. µ=8 x = 9.7 7 ! With the Z-score calculated, we use it to calculate the p-value. p-value: probability under H0 (µ = 8) of observing data at least as extreme as what we observed (a sample mean greater than 9.7) The sample mean is 3.4 standard errors away from the hypothesized value. Is this considered unusually (significantly) high? µ=8 x = 9.7 x̄ ∼ N µ = 8, SE = √ ≈ 0.5 206 Z= P (x̄ > 9.7 | µ = 8) = P (Z > 3.4) = 0.0003 9.7 − 8 = 3.4 0.5 Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 15 / 37 Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 16 / 37 Hypothesis testing Formal testing using p-values Hypothesis testing p-values Formal testing using p-values Number of college applications - Making a decision p-value = 0.0003 If the true average of the number of colleges Duke students applied to is 8, there is only 0.03% chance of observing a random sample of 206 Duke students who on average apply to 9.7 or more schools. This is a pretty low probability for us to think that a sample mean of 9.7 or more schools is likely to happen simply by chance. We then use this test statistic to calculate the p-value If the p-value is low (lower than the significance level, α, which is usually 5%) we say that it would be very unlikely to observe the data if the null hypothesis were true, and hence reject H0 . If the p-value is high (higher than α) we say that it is likely to observe the data even if the null hypothesis were true, and hence do not reject H0 . Since p-value is low (lower than 5%) we reject H0 . The data provide convincing evidence that Duke students on average apply to more than 8 schools. The difference between the null value of 8 schools and observed sample mean of 9.7 schools is not due to chance or sampling variability. Sta 101 (Gary Larson) U3 - L2: HT and CI Hypothesis testing July 10, 2015 17 / 37 Sta 101 (Gary Larson) Formal testing using p-values U3 - L2: HT and CI Hypothesis testing July 10, 2015 18 / 37 Formal testing using p-values Recap: Hypothesis testing for a population mean 1 2 Set the hypotheses H0 : µ = null value HA : µ < or > or , null value Check assumptions and conditions Independence: random sample/assignment, 10% condition when sampling without replacement Normality: nearly normal population or n ≥ 30, no extreme skew the next slide is provided as a brief summary of hypothesis testing... 3 Calculate a test statistic and a p-value (draw a picture!) Z= 4 x̄ − µ s , where SE = √ SE n Make a decision, and interpret it in context of the research question If p-value < α, reject H0 , data provide evidence for HA If p-value > α, do not reject H0 , data do not provide evidence for HA Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 19 / 37 Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 20 / 37 Hypothesis testing Formal testing using p-values Hypothesis testing Formal testing using p-values Hypothesis testing for µ, from beginning to end You should (and can!) understand every aspect of this process, before the midterm. If not, come to Office Hours or email me:) You want to make a statement about a population parameter. So, state your H0 and HA and the significance level of this test. Then, observe a point estimate of the population parameter you’re interested in (i.e. collect data for testing the hypothesis). After verifying the CLT’s conditions are met, use the CLT to state what the sampling distribution for the sample mean would be if H0 were true. Calculate the Z-score of your point estimate, and use the probability tables to find the associated p-value. If the p-value is lower than the significance level, reject H0 in favor of HA . Otherwise fail to reject H0 . Never “accept” H0 . Come see me, even (especially) if you’re thinking “I really don’t understand most of this or what’s happening”. the next slide is also provided as a brief summary of hypothesis testing... Sta 101 (Gary Larson) U3 - L2: HT and CI Hypothesis testing July 10, 2015 21 / 37 Sta 101 (Gary Larson) Formal testing using p-values U3 - L2: HT and CI Hypothesis testing Example July 10, 2015 22 / 37 Formal testing using p-values Participation question The p-value for this hypothesis test is 0.0485. Which of the following is correct? A poll by the National Sleep Foundation found that college students average about 7 hours of sleep per night. A sample of 169 Duke students yielded an average of 6.88 hours, with a standard deviation of 0.94 hours. Assuming that this is a random sample representative of all Duke students (bit of a leap of faith?) , conduct a hypothesis test to evaluate if Duke students on average sleep less than 7 hours per night. (a) Fail to reject H0 , the data provide convincing evidence that Duke students sleep less than 7 hours on average. Edit by Gary: This seems totally false... can we do a survey now? (d) Fail to reject H0 , the data do not provide convincing evidence that Duke students sleep less than 7 hours on average. (b) Reject H0 , the data provide convincing evidence that Duke students sleep less than 7 hours on average. (c) Reject H0 , the data prove that Duke students sleep more than 7 hours on average. (e) Reject H0 , the data provide convincing evidence that Duke students in this sample sleep less than 7 hours on average. Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 23 / 37 Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 24 / 37 Hypothesis testing Two-sided hypothesis testing with p-values Confidence Intervals and Hypothesis Tests Two-sided hypothesis testing with p-values Confidence Intervals Construct a 95% confidence interval for the number of hours Duke students sleep on average. If the research question was “Do the data provide convincing evidence that the average amount of sleep Duke students get per night is different from the national average?”, the alternative hypothesis would be different. x̄ = 6.88, s = 0.94, n = 169, SE ≈ 0.07 H0 : µ = 7 Confidence interval, a general formula HA : µ , 7 point estimate ± z ? × SE = point estimate ± ME Hence the p-value would change as well: For a 95% confidence interval, z ? = 1.96. p-value = 0.0485 × 2 = 0.097 Do we reject now? 6.88 7.00 Sta 101 (Gary Larson) 6.88 ± 1.96 × 0.07 = (6.74, 7.02) We are 95% confidence that the true average number of hours Duke students sleep is between (6.74, 7.02) hours. 7.12 U3 - L2: HT and CI July 10, 2015 25 / 37 Confidence Intervals and Hypothesis Tests Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 26 / 37 Confidence Intervals and Hypothesis Tests Connect HT and CI Agreement of CI and HT Confidence intervals and hypothesis tests (almost) always agree, as long as the two methods use equivalent levels of significance / confidence. We are 95% confidence that the average number of hours Duke students sleep is between (6.738, 7.022) hours. A two sided hypothesis with threshold of α is equivalent to a confidence interval with CL = 1 − α. A one sided hypothesis with threshold of α is equivalent to a confidence interval with CL = 1 − (2 × α). 6.88 ± 1.96 × 0.07 = (6.74, 7.02) Is the null value in this interval? (ie is µ = 7 plausible?) Yes! Did we fail to reject the null? (ie did we decide we did not have enough evidence to claim that µ is something other than 7?) Yes! If H0 is rejected, an “agreeing” confidence interval does not include the null value; the null value wasn’t plausible. If H0 is not rejected, an “agreeing” confidence interval does include the null value; the null value was plausible. Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 27 / 37 Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 28 / 37 Confidence Intervals and Hypothesis Tests Confidence Intervals and Hypothesis Tests Confidence Intervals Significance level vs. confidence level What confidence level do we need to create a confidence interval which will agree with our one-sided hypothesis test? Two sided One sided H0 : µ = 7 HA : µ > 7 0.95 A one sided hypothesis with threshold of α is equivalent to a confidence interval with CL = 1 − (2 × α). U3 - L2: HT and CI July 10, 2015 29 / 37 Sta 101 (Gary Larson) 0.05 1.96 0.05 0 1.65 One sided HT with α = 0.05 is equivalent to 90% confidence interval. U3 - L2: HT and CI July 10, 2015 30 / 37 Confidence Intervals and Hypothesis Tests Construct a 90% confidence interval for the number of hours Duke students sleep on average. Participation question A 95% confidence interval for the average waiting time at an emergency room is (128 minutes, 147 minutes). Which of the following is false? x̄ = 6.88, SE ≈ 0.07 For a 90% confidence interval, z ? = 1.64. (a) A hypothesis test of HA : µ , 120 min at α = 0.05 is equivalent to this CI. 6.88 ± 1.64 × 0.07 = (6.76, 7.00) (b) A hypothesis test of HA : µ > 120 min at α = 0.025 is equivalent to this CI. We are 90% confident that the average number of hours Duke students sleep is between (6.76, 7.00) hours. (c) This interval does not support the claim that the average wait time is 120 minutes. We rejected the null hypothesis with a p-value of 0.0485. (d) The claim that the average wait time is 120 minutes would not be rejected using a 90% confidence interval. What is the connection between the p-value and the interval? Note the importance of rounding properly! U3 - L2: HT and CI 0 Two sided HT with α = 0.05 is equivalent to 95% confidence interval. Confidence Intervals and Hypothesis Tests Sta 101 (Gary Larson) 0.025 -1.96 CL = 1 − 2(0.05) = .90 ⇒ 90%CI Sta 101 (Gary Larson) 0.9 0.025 July 10, 2015 31 / 37 Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 32 / 37 Type 1 and Type 2 errors Type 1 and Type 2 errors Decision errors Decision errors (cont.) When conducting a hypothesis test, there are two ways we could be right, and two ways we could be wrong! Hypothesis tests are not flawless. In the court system innocent people are sometimes wrongly convicted and the guilty sometimes walk free. H0 true Truth Similarly, we can make a wrong decision in statistical hypothesis tests as well. The difference is that we have the tools necessary to quantify how often we make errors in statistics. HA true Decision fail to reject H0 reject H0 X Type 1 Error Type 2 Error X A Type 1 Error is rejecting the null hypothesis when H0 is true. A Type 2 Error is failing to reject the null hypothesis when HA is true. We (almost) never know if H0 or HA is true, but we can still look at these error rates! Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 33 / 37 Sta 101 (Gary Larson) Type 1 and Type 2 errors U3 - L2: HT and CI Type 1 and Type 2 errors Hypothesis test as a trial July 10, 2015 34 / 37 Error rates & power Type 1 error rate We usually use a significance level of 0.05, α = 0.05, i.e. reject H0 when p < 0.05 H0 : Defendant is innocent HA : Defendant is guilty One way to think about the significance level: when using a 5% significance level there is about 5% chance of making a Type 1 error (incorrectly rejecting a true H0 . Why? Because we defined rare data to be the rarest 5% of data! Which type of error is being committed in the following cirumstances? Declaring the defendant innocent when they are actually guilty Declaring the defendant guilty when they are actually innocent P (Type 1 error) = α Which error do you think is the worse error to make? This is why we like small values of α - increasing α increases the Type 1 error rate. “better that ten guilty persons escape than that one innocent suffer” – William Blackstone Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 35 / 37 Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 36 / 37 Type 1 and Type 2 errors Error rates & power Filling in the table... Truth H0 true Decision fail to reject H0 reject H0 1−α Type 1 Error, α HA true Type 2 Error, β Power, 1 − β Type 1 error is rejecting H0 when you shouldn’t have, and the probability of doing so is α (significance level) Type 2 error is failing to reject H0 when you should have, and the probability of doing so is β (a little more complicated to calculate) Power of a test is the probability of correctly rejecting H0 , and the probability of doing so is 1 − β. In hypothesis testing, we want to keep α and β low, but there is an inherent trade-off. Sta 101 (Gary Larson) U3 - L2: HT and CI July 10, 2015 37 / 37