Download Power & Effect Size

POWER AND EFFECT SIZE Previous Weeks  A few weeks ago I made a small chart outlining all the different statistical tests we’ve covered (week 9) I want to complete that chart using information from the past week  Most of this is a repeat – but a few new tests have been added  Important that you are familiar with these tests, know when they are appropriate to use, and how to run (most of) them in SPSS  Excused from running ANCOVA, RM ANOVA When to use specific statistical tests… # of IV (format) # of DV (format) 1 (continuous) 1 (continuous) 1 (continuous) 1 (continuous) Multiple 1 (continuous) Examining… Test/Notes Association Pearson Correlation (r) Prediction Simple Linear Regression (m + b) Prediction Multiple Linear Regression (m + b) # of IV (format) # of DV (format) Examining… Test/Notes 1 (grouping, 2 levels) 1 (continuous) Group differences When one group is a ‘known’ population = One-Sample t-test Group differences When both groups are independent = Independent Samples t-test Group differences When both groups are dependent = Paired Samples t-test 1 (grouping, 2 levels) 1 (grouping, 2 levels) 1 (grouping, ∞ levels) 1 (continuous) 1 (continuous) 1 (continuous) Group differences One-Way ANOVA, with Post-Hoc (F ratio) # of IV (format) ∞ (grouping, ∞ levels) ∞ (grouping, ∞ levels) ∞ (grouping, ∞ levels) # of DV (format) Examining… Test/Notes 1 (continuous) Group Differences and interactions Factorial ANOVA with Post-Hoc and/or Estimated Marginal Means (F ratio) 1 (continuous) Group Differences, interactions, controlling for confounders ANCOVA with Estimated Marginal Means (F ratio) Analysis of CoVariance 1 (continuous) Group Differences, interactions, controlling for confounders in a related sample Repeated Measures ANOVA with Estimated Marginal Means (F ratio) (e.g., longitudinal) Tonight…  A break from learning a new statistical ‘test’  Focus will be on two critical statistical ‘concepts’  Statistical  Related  Brief Power to Alpha/Statistical Significance overview of Effect Size  Statistically  significant results vs Meaningful results First, a quick review of error in testing… Example Hypothesis  Pretend my masters thesis topic is the influence of exercise on body composition I believe people that exercise more, will have lower %BF  To study this: I draw a sample and group subjects by how much they exercise – High and Low Exercise Groups (this is my IV)  I also assess %BF in each subject as a continuous variable (DV)  I plan to see if the two groups have different mean %BF  My hypotheses (HO and HA):  HA: There is a difference in %BF between the groups  HO: There is not a difference in %BF between the groups Example Continued  Now I’m going to run my statistical test, get my test statistic, and calculate a p-value  I’ve set alpha at the standard 0.05 level  By the way, what statistical test should I use…?  My final decision on my hypotheses is going to be based on that p-value: I could reject the null hypothesis (accept HA)  I could accept the null hypothesis (reject HA) Statistical Errors…   Since there are two potential decisions (and only one of them can be correct), there are two possible errors I can make: Type I Error  We could reject the null hypothesis although it was really true (should have accepted null)  Type II Error  We could fail to reject the null hypothesis when it was really untrue (should have rejected null) HA: There is a difference in %BF between the groups HO: There is not a difference in %BF between the groups There are really 4 potential outcomes, based on what is “true” and what we “decide” Our Decision Reject HO Accept HO HO Type I Error Correct HA Correct Type II Error What is True Statistical Errors…  Remember – My final decision is based on the p-value If p </= 0.05, our decision is reject HO If p > 0.05, our decision is accept HO Our Decision Reject HO Accept HO HO Type I Error Correct HA Correct Type II Error What is True Statistical Errors…   In my analysis, I find:  High Exercise Group mean %BF = 22%  Low Exercise Group mean %BF = 26%  p = 0.08 What is my decision?    Is it possible I’ve made an error in my decision? Accept HO There is NOT a difference in %BF between the groups Why is that my decision? The means ARE different?  I can’t be confident that the 4% difference between the two groups is not due to random sampling error Possible Error…?  If I did make an error, what type would it be?  Type  When you find a p-value greater than alpha  The  II Error only possible error is Type II error When you find a p-value less than alpha  The only possible error is Type I error If p </= 0.05, our decision is reject HO If p > 0.05, our decision is accept HO Our p = 0.08, we accepted HO The only possible error is Type II Our Decision Reject HO Accept HO HO Type I Error Correct HA Correct Type II Error What is True Possible Error…?  Compare Type I and Type II error like this:  The only concern when you find statistical significance (p < 0.05) is Type I Error  Is the difference between groups REAL or due to Random Sampling Error  Thankfully, the p-value tells you exactly what the probability of that random sampling error is  In other words, the p-value tells you how likely Type I error is  But, does the p-value tell you how likely Type II error is?  The probability of Type II error is better provided by Power Possible Error…?  Probability of Type II error is provided by Power  Statistical Power, also known as β (actually 1 – β)  We will not discuss the specific calculation of power in this class  SPSS can calculate this for you  Power (Beta) is related to Alpha, but:  Alpha is the probability of having Type I error  Lower  Power number is better (i.e., 0.05 vs 0.01 vs 0.001) is the probability of NOT having Type II error  The probability of being right (correctly rejecting the null hypothesis)  Higher number is better (typical goal is 0.80) Let’s continue this in the context of my ‘thesis’ example Statistical Errors…  In my analysis, I found:  High Exercise Group mean %BF = 22% Low Exercise Group mean %BF = 26% p = 0.08  Decided to accept the null     What do I do when I don’t find statistical significance? What happens when the result does not reflect expectations? First, consider the situation Should it be statistically significant?  The most obvious thing you need to consider is if you REALLY should have found a statistically significant result?  Just because you wanted your test to be significant doesn’t mean it should be  This wouldn’t be Type II error – it would just be the correct decision!  In my example, researchers have shown in several studies that exercise does influence %BF  This result ‘should’ be statistically significant, right?  If the answer is yes, then you need to consider power In my ‘thesis’   This result ‘should’ be statistically significant, right? Probably an issue with Statistical Power  This scenario plays out at least once a year between myself and a grad student working on a thesis or research project  How can I increase the chance that I will find statistically significant results?  Why was this analysis not statistically significant?  What can I do to decrease the chance of Type II error?  Several different factors influence power  Your ability to detect a true difference How can I increase Power?  1) Increase Alpha level  Changing alpha from 0.05 to 0.10 will increase your power (better chance of finding significant results)  Downsides to increasing your alpha level?  This will increase the chance of Type I error!  This is rarely acceptable in practice  Only really an option when working in a new area:  Researchers are unsure of how to measure a new variable  Researchers are unaware of confounders to control for How can I increase Power?  2) Increase N  Sample size is directly used when calculating p-values  Including more subjects will increase your chance of finding statistically significant results  Downsides  More  More to increasing sample size? subjects means more time/money subjects is ALWAYS a better option if possible How can I increase Power?  3) Use fewer groups/variables (simpler designs)  Related  ‘Use ↑ to sample size but different fewer groups’ NOT ‘Use less subjects’ groups negatively effects your degrees of freedom  Remember, df is calculated with # groups and # subjects  Lots of variables, groups and interactions make it more difficult to find statistically significant differences  The purpose of the Family-wise error rate is to make it harder to find significant results!  Downsides to fewer groups/variables?  Sometimes you NEED to make several comparisons and test for interactions - unavoidable How can I increase Power?  4) Measure variables more accurately  If variables are poorly measured (sloppy work, broken equipment, outdated equipment, etc…) this increases measurement error  More measurement error decreases confidence in the result  For example, perhaps I underestimated %BF in my ‘low exercise’ group? This could lead to Type II Error.  More of an internal validity problem than statistical problem  Downsides to measuring more accurately?  None – if you can afford the best tools How can I increase Power?  5) Decrease subject variability  Subjects will have various characteristics that may also be correlated with your variables  SES, sex, race/ethnicity, age, etc…  These variables can confound your results, making it harder to find statistically significant results  When planning your sample (to enhance power), select subjects that are very similar to each other  This is a reason why repeated measures tests and paired samples are more likely to have statistically significant results  Downside  Will to decreasing subject variability? decrease your external validity – generalizability  If you only test women, your results do not apply to men How can I increase Power?  6) Increase magnitude of the mean difference  If your groups are not different enough, make them more different!  For example, instead of measuring just high and low exercisers, perhaps I compare marathon runners vs completely sedentary people?  Compare a ‘very’ high exercise to a ‘very’ low exercise group  Sampling at the extremes, getting rid of the middle group  Downsides to using the extremes?  Similar to decreasing subject variability, this will decrease your external validity Questions on Power/Increasing Power? The Catch-22 of Power and P-values  I’ve mentioned this previously – but once you are able to draw a large sample, this will ruin the utility of p/statistical significance  The larger your sample, the more likely you’ll find statistically significant results  Sometimes miniscule differences between groups or tiny correlations are ‘significant’  This becomes relevant once sample size grows to 100~150 subjects per group  Once you approach 1000 subjects, it’s hard not to find p < 0.05  Example from most highly cited paper in Psych, 2004…    This paper was the first to find a link between playing video games/TV and aggression in children: Every correlation in this table except 1 has p < 0.05 Do you remember what a correlation of 0.10 looks like? r = 0.10 Do you see a relationship between these two variables? What now?  This realization has led scientists to begin to avoid pvalues (or at least avoid just reporting p-values)  Moving towards reporting with 95% confidence intervals  Especially in areas of research where large samples are common (epidemiology, psychology, sociology, etc..)  Some people interpret ‘statistically significant’ as being ‘important’  We’ve mentioned several times this is NOT true  Statistically significant just means it’s likely not Type I error  Can have ‘important’ results that aren’t statistically significant Effect Size  To get an idea of how ‘important’ a difference or association is, we can use Effect Size  There are over 40 different types of effect size  Depends on statistical test used  SPSS will NOT always calculate effect size  Effect size is like a ‘descriptive’ statistic that tells you about the magnitude of the association or group difference  Not impacted by statistical significance  Effect size can stay the same even if p-value changes  Present the two together when possible  The goal is not to teach you how to calculate effect size, but to understand how to interpret it when you see it Effect Size  Understanding effect size from correlations and regressions is easy (and you already know it):  r2, coefficient of determination %  Pearson correlations between %BF and 3 variables: r  Variance accounted for = 0.54, r = -0.92, r = 0.70 Which of the three correlations has the most important association with %BF?  r2 = 0.29, r2 = 0.85, r2 = 0.49 Interpreting Effect Size  Usually, guidelines are given for interpreting the effect size  Help you to know how important the effect is  Only a guide, you can use your own brain to compare  In general, r2 is interpreted as:  0.01 or smaller, a Trivial Effect  0.01 to 0.09, a Small Effect  0.09 to 0.25, a Moderate Effect  > 0.25, a Large Effect Effect Size in Regression  Two regression equations contain 4 predictors of %BF. Each ‘model’ is statistically significant. Here are their r2 values:  0.29  and 0.15 Which has the largest effect size? Do either or the regression models have a large effect size?  0.29 model is the most important, and has a ‘large effect size’.  0.15 model is of ‘moderate’ importance. Effect Size for Group Differences   Effect size in t-tests and ANOVA’s is a bit more complicated In general, effect size is a ratio of the mean difference between two groups and the standard deviation  Does this remind you of anything we’ve previously seen?  Z-score = (Score – Mean)/SD  Effect size, when calculated this way, is basically determining how many standard deviations the two groups are different by  E.g., effect size of 1 means the two groups are different by 1 standard deviation (this would be a big difference)! Example  When working with t-tests, calculating effect size by the mean difference/SD is called Cohen’s d < 0.1 Trivial effect  0.1-0.3 Small effect  0.3-0.5 Medium effect  > 0.5 Large effect  The next slide is the result of a repeated measures t-test from a past lecture, we’ll calculate Cohen’s d Paired-Samples t-test Output   Mean difference = 2.9, Std. Deviation = 5.2 Cohen’s d = 0.55, a large effect size  Essentially, the weight loss program reduced body weight by just about half a standard deviation Other example  I sample a group of 100 ISU students and find their average IQ is 103.  Recall, the population mean for IQ is 100, SD = 15.  I run a one-sample t-test and find it to be statistically significant (p < 0.05)  However, effect size is…  0.2, or Small Effect  Interpretation: While this difference is likely not due to random sampling error – it’s not very important either Other types of effect sizes   SPSS will not calculate Cohen’s d for t-tests However, it will calculate effect size for ANOVA’s (if you request it) Cohen’s d, but Partial Eta Squared (η2)  Similar to r2, interpreted the same way (same scale)  Not  Here is last week’s cancer example  Does Tumor Size and Lymph Node Involvement effect Survival Time  I’ll re-run and request effect size…  Notice, η2 can be used for the entire ‘model’, or each main effect and interaction individually  How would you describe the effect of Tumor Size, or our interaction?  Trivial to Small Effect – How did we get a significant p-value?  Other factors not in our model are also very important  Notice that the r2 is equal to the η2 of the full model  The advantage of η2 is that you can evaluate individual effects Effect Size Summary  Many other types of effect sizes are out there – I just wanted to show you the effect sizes most commonly used with the tests we know: and Regression: r2  T-tests: Cohen’s d  ANOVA: Partial eta squared (η2) and/or r2  Correlation  You are responsible for knowing:  The general theory behind effect size/why to use them  What tests they are associated with  How to interpret them QUESTIONS ON POWER? EFFECT SIZE? Upcoming…   In-class activity Homework:  Cronk – Read Appendix A (pg. 115-19) on Effect Size  Holcomb Exercises 21 and 22  No out-of-class SPSS work this week  Things are slowing down - next week we’ll discuss non-parametric tests  Chi-Square and Odds Ratio

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Power & Effect Size