Download Document

Impact of a simulation/ randomization-based curriculum on student understanding of p-values and confidence intervals Beth Chance Karen McGaughey Jimmy Wong Cal Poly – San Luis Obispo ICOTS9 Outline • About the curriculum (Karen) • Evaluating the curriculum (Beth) • Benefits/Cautions/Suggestions (Karen) • Next Steps (Beth) Background • Randomization-based introductory statistics courses (Saturday workshop) • Introducing all inferential techniques through simulation and randomization-based methods • e.g., permutation tests, bootstrapping • Tintle et al. (2015) text (Roy, Session 4A) • Focus on overall statistical process via genuine research studies • Normal-based methods presented as alternative approximation to simulation results Background • Spiraled just-in-time curriculum: • Brief introduction to probability through simulation • e.g., Monty Hall problem, coin tossing • Develop understanding of probability as a long-run proportion • Statistical Inference (Ch. 1) • Process probability/one proportion • One mean, two proportions, two means, matched pairs, multiple proportions, multiple means, regression • Deeper dive in each iteration • Interspersed as needed: discussions of random sampling, random assignment, graphical displays, scope of conclusions, etc. Background • Ch 1: Test of significance • Two competing explanations • One proportion for the study outcome: • Facial Prototyping – “Bob • “Random chance alone” & Tim” (Lea, Thomas, • Research conjecture Lamkin, & Bell, 2007) • Could the observed statistic plausibly have happened by random chance alone? • Design the simulation: • What does “by random • Binary response chance alone” look like? • Overwhelmingly name • Coin tossing model left picture “Tim” (e.g. ~ 80%) • Tactile & via computer Background • Ch. 3: Confidence Intervals = Interval of plausible values • Example: Reese’s Pieces (n = 40, 𝑝 = 16/40 = 0.40 ) • Test (via simulation) for plausible values of population proportion given observed sample proportion Two-sided p-value Decision at 0.05 significance level Plausible? Ho: π = 0.26 0.0430 Reject Ho No Ho: π = 0.27 0.0800 Fail to reject Ho Yes : : Fail to reject Ho Yes : : Fail to reject Ho Yes Ho: π = 0.55 0.0770 Fail to reject Ho Yes Ho: π = 0.56 0.0450 Reject Ho No Test Background • Ch 5: Two proportions • Dolphin Therapy (Antonioli & Reveley, 2005) Therapy group Dolphin Control Improved 10 3 Did not improve 5 12 • Binary response • Designed experiment • Two competing explanations: • Ho: “random chance alone” • Ha: research conjecture • Could the observed statistic plausibly have happened by random chance alone? • Design the simulation: • Card shuffling • Tactile & via computer 2013-2014 Evaluation • New and experienced teachers • 15 institutions (HS, community college, university) • 15 instructors (fall) and 23 instructors (spring, 12 new) • Over 1500 students • Assessment • (Modified) CAOS pre and post tests (Tintle, Session 8A) • SATS attitudes pre and post tests (Swanson, Session 1F) • Set of common multiple choice exam questions • 25 instructors, 774-826 students • Final exam transfer question One Proportion (Exam 1) • Research question: Are city residents more likely to watch a movie at home rather than in the theater? Q1: Picking the correct null hypothesis (overall percentages) Adult residents of the city are equally likely to choose to watch the movie at home as to watch the movie at the theater. 92.9% Adult residents of the city are more likely to choose to watch the movie at home than to watch the movie at the theater. 5.8% Adult residents of the city are less likely to choose to watch the movie at home than to watch at the theater. .6% Other .6% One Proportion (Exam 1) • Research question: Are city residents more likely to watch a movie at home rather than in the theater? Q2: Picking the correct alternative hypothesis Adult residents of the city are equally likely to choose to watch the movie at home as to watch the movie at the theater. 1.7% Adult residents of the city are more likely to choose to watch the movie at home than to watch the movie at the theater. 90.1% Adult residents of the city are less likely to choose to watch the movie at home than to watch at the theater. 5.6% Other 2.7% One Proportion (Exam 1) • Research question: Are city residents more likely to watch a movie at home rather than in the theater? Q3: Result is statistically significant (p = 0.012), which explanation is more plausible? More than half of the adult residents in her city prefer to watch the movie at home. 65.6% There is no overall preference for movie-watching-at-home in her city, but by pure chance her sample just happened to have an unusually high number of people choose to watch the movie at home. 6.0% (a) and (b) are equally plausible explanations. 29.7% Substantial section-to-section variability! One Proportion (Exam 1) • Research question: Are city residents more likely to watch a movie at home rather than in the theater? Q4: Most valid interpretation of p-value? A sample proportion as large as or larger than hers would rarely occur. 14.0% A sample proportion as large as or larger than hers would rarely occur if the study had been conducted properly. 6.9% A sample proportion as large as or larger than hers would rarely occur if 50% of adults in the population prefer to watch Higher for experienced instructors the movie at home. 59.9% A sample proportion as large as or larger than hers would rarely occur if more than 50% of adults in the population prefer to watch the movie at home 20.3% One Proportion (Exam 1) • Research question: Are city residents more likely to watch a movie at home rather than in the theater? Q5: Would 95% confidence interval contain 0.5? Yes 25.3% No 43.8% Not enough information 31.0% Two Proportions (Exam 2) • Research question: Are women more likely to dream in color than men? Q1: Best conclusion from not significant (not small p-value) result ? You have found strong evidence that there is no difference between the proportions of men and women in your community that dream in color. 14.5% You have not found enough evidence to conclude that there is a difference between the proportions of men and women in your community that dream in color. Higher for new instructors 72.8% You have found strong evidence against the claim that there is a difference between the proportions of men and women that dream in color. 10.7% Because the result is not significant, we can’t conclude anything from this study. 4.1% Two Proportions (Exam 2) • Research question: Are women more likely to dream in color than men? Q2: Best interpretation from small p-value? It would not be very surprising to obtain the observed sample results if there is really no difference between the proportions of men and women in your community that dream in color. It would be very surprising to obtain the observed sample results if there is really no difference between the proportions of men and women in your community that dream in color. 5.0% 56.5% It would be very surprising to obtain the observed sample results if there is really a difference between the proportion of men and women in your community that dream in color. 7.9% The probability is very small that there is no difference between the proportions of men and women in your community that dream in color. 22.6% The probability is very small that there is a difference between the proportions of men and women in your community that dream in color. 8.4% Two Proportions (Exam 2) • Research question: Are women more likely to dream in color than men? Q3: If really is a difference, why might get large p-value? Something went wrong with the analysis, and the results of this study cannot be trusted. 6.1% There must not be a difference after all and the other research studies were flawed. 3.8% The sample size might have been too small to detect a difference even if there is one. 90.1% Two Proportions (Exam 2) • Research question: Are women more likely to dream in color than men? Q4: Which has stronger evidence of a difference: Study A vs. Study B? Study A: 40/100 vs. 20/100 80.3% Study B: 35/100 vs. 25/100 4.4% The strength of evidence would be similar for these two studies 15.3% Two Proportions (Exam 2) • Research question: Are women more likely to dream in color than men? Q5: Which has stronger evidence of a difference: Study C vs. Study D (30% vs. 20%)? Study C: sample sizes of 100 and 100 83.0% Study D: sample sizes of 40 and 40 6.0% The strength of evidence would be similar for these two studies 10.8% Two Proportions (Exam 2) • Research question: Are women more likely to dream in color than men? Q6: Small p-value, which explanation is more plausible? Men and women in your community do not differ on this issue but by chance alone the random sampling led to the difference we observed between the two groups. 13.6% Men and women in your community differ on this issue. 58.1% (a) and (b) are equally plausible explanations. 28.2% 36% correct with draft curriculum four years ago Two Proportions (Exam 2) • n = 404 students (8 instructors) Q7: Main purpose of the randomness in the simulation? To allow me to draw a cause-and-effect conclusion from the study. 19.1% To allow me to generalize my results to a larger population. 11.4% To simulate values of the statistic under the null hypothesis. 58.8% To replicate the study and increase the accuracy of the results 8.2 Two Means (Exam 2/Final) • 717 students, 14 instructors • Want to compare mean score on video game with and without monetary incentive • Simulation process is described and given null distribution Two Means (Exam 2/Final) Q1: Main motivation for this process? This process allows her to compare her actual result to what could have happened by chance if gamers’ performances were not affected by whether they were asked to do their best or offered an incentive. 83.0% This process allows her to determine the percentage of time the $5 incentive strategy would outperform the “do your best" strategy for all possible scenarios. 12.0% This process allows her to determine how many times she needs to replicate the experiment for valid results. 2.2% This process allows her to determine whether the normal distribution fits the data. 2.8% Two Means (Exam 2/Final) Q2: What’s assumed in carrying out the simulation? The $5 incentive is more effective than the “do your best” incentive for improving performance. 25.8% The $5 incentive and the “do your best” incentive are equally effective at improving performance. 60.9% The “do your best” incentive is more effective than a $5 incentive for improving performance. 6.0% Both (a) and (b) but not (c). 7.3% Two Means (Exam 2/Final) Q3: Approximate p-value from graph 0.501 (using null value) 14.0% 0.047 (two-sided) 16.9% 0.022 52.5% .001 (small) 16.2% Two Means (Exam 2/Final) Q4: What does histogram tell us about research question? The $5 incentive is not effective because the distribution of differences generated is centered at zero. 16.3% The $5 incentive is effective because distribution of differences generated is centered at zero. 14.8% The $5 incentive is not effective because the p-value is greater than 0.05. 5.1% The $5 incentive is effective because the p-value is less than 0.05. 63.4% Two Means (Exam 2/Final) Q5: Appropriate interpretation of p-value? The p-value is the probability that the $5 incentive is not really helpful. 3.7% The p-value is the probability that the $5 incentive is really helpful. 12.9% The p-value is the probability that she would get a result as least as extreme as the one she actually found, if the $5 incentive is really not helpful. 82.3% The p-value is the probability that a student wins on the video game. 0.9% CAOS Significance questions (n  2,000 pre, 1,500 post) • Valid/invalid interpretations Pre Post Exp New Non CAOS Large or small p-value, no impact 50% 89% 85% 62%en Probability of results at least as 50% 65% 66% 52% extreme under null: valid Probability of alternative: invalid 40% 53% 58% 48% 68% Probability of null: invalid 60% 53% 72% 67% 58% 57% 54% CAOS Conf interval questions • Valid/invalid interpretations Pre 95% of all observations in population in interval: invalid Post Exp New 57% 63% 64% Non 56% 95% confident an observational 27% 41% 37% 21%en unit is in interval: invalid 95% of sample means from 51% 60% 60% 64% population are in interval: invalid 95% confident population mean 71% 80% 80% 82% is in interval: valid CAOS 65% 49% 48% 76% CAOS Sampling variability questions Pre Post CAOS Exp New Non 71% 58% 57% 49% 10% 19% 22% 11% 33% 39% 38% 34% 33% Values of 10 sample proportions 42% 44% 52%e 43% 52% Simulation design 24% 40% 35% 24%e 22% Small sample (n = 60) may fail to detect difference Necessary sample size for all 310 million U.S. residents “Hospital problem” 67% Topic areas – Summary • Auth= author team member • Mid = non-author but have used materials more than once Post Pre Auth Mid New Non Auth Mid New Non Significance 52% 43% 47% 46% 72% 67% 69% 55%* Confidence 55% 51% 51% 49% 63% 60% 60% 56% Sampling variability 35% 36% 36% 35% 41% 40% 41% 32% Transfer Question (Final exam) • A constant theme of course: Could the statistic have happened by chance alone? • Applicable in any situation vs. statistical test applicable in only one specific situation • Can students apply the same logic to a novel problem? • Spring 2014: Two Cal Poly instructors (169 students) • Final exam: mean/median as a measure of skewness to make inference about population shape (adapted from 2009 AP Statistics exam) • Earlier midterm: Ratio of standard deviations or relative risk Transfer Question (Final exam) • Do the sample data provide convincing evidence the population is right skewed? • Calculate statistic: mean/median = 1.05 • What values would you expect for the statistic with a normally distributed population? With a skewed right population? • 39% answered both questions correctly • Common errors: • Mean/median > 1.05 if right skewed • Wrong direction: mean/median < 1 if right skewed Transfer Question (Final exam) • Do the sample data provide evidence the population is right skewed? • Calculate statistic: mean/median = 1.05 • Given a simulated null distribution from a symmetric population (centered at 1) • Evidence against the null hypothesis? Transfer Question (Final exam) • Multiple choice version based on common responses from open-ended version: • Answer choices focus on 3 characteristics of the null distribution: • There is strong evidence (or not) to suggest the actual population distribution is right skewed……. • Due to symmetric shape • Because the center is at 1 • Because most values vary between 0.96 to 1.04 Transfer Question (Final exam) Two instructors (5 sections/169 students) from Cal Poly does not provide strong evidence … because this null distribution is symmetric. 11% provides strong evidence … because this null distribution is symmetric. 12% does not provide strong evidence … because this null distribution is centered around one. 20% provides strong evidence … because this null distribution is centered around one. 26% does not provide strong evidence … because most of the values in this null distribution vary between 0.96 to 1.04. 10% provides strong evidence … because most of the values in this null distribution vary between 0.96 to 1.04. 18% Other: provided correct reasoning 7% * 25% answered correctly and an additional 8% showed work indicating correct reasoning Benefits • Little to no confusion that small p-values  statistical significance • Students very comfortable (even initially) with idea of “could this have happened by chance alone” • Idea of large z-score or t-score (beyond 2SE) also clicks • Address difficult inferential reasoning earlier in course • Repeated exposures allow a synthesis of the ideas • Understanding “Inference process” as statistical method, rather than stand-alone methods for testing means, proportions, etc. • Efficiency gains: • Still possible to do both simulation and normal-based methods • Exploration of other statistics (e.g. MAD for multiple means) • Instructors enjoy approach, research study focus, richer student questions Cautions • Inferential reasoning is difficult and initially, little carry-over of learning: • Non 50/50 cases • Comparing groups • Need several repeated exposures • May introduce a misconception of “repeating the study” • Possible increase in misconception that we are “providing evidence for the null hypothesis” • Continue to struggle with identifying & defining parameters • Balance inferential with descriptive statistics (less as Common Core comes on line?) Main Suggestions • Emphasize the ideas of model and simulation • Repeatedly test their ability to design a simulation • Ask students to predict simulation results (where will it be centered, why) • Focus on variability in null distribution as the key • Clearly delineate observed data from simulation • Explicitly discuss roles of randomness in the study design vs. randomness in simulation • Use early experiential examples that give students ownership of the data (“observed” statistic) Future Steps • Three year NSF grant (DUE/TUES – 1323210) to continue data collection across institutions • More “non-users” and other randomizationbased curriculums (e.g., Lock5, Catalst) • More studies of student retention of concepts • Next theme of common exam questions: Confidence intervals • Email Nathan Tintle ([email protected]) or Beth Chance ([email protected]) if you would like to participate Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document