Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
POWER AND EFFECT SIZE Previous Weeks A few weeks ago I made a small chart outlining all the different statistical tests we’ve covered (week 9) I want to complete that chart using information from the past week Most of this is a repeat – but a few new tests have been added Important that you are familiar with these tests, know when they are appropriate to use, and how to run (most of) them in SPSS Excused from running ANCOVA, RM ANOVA When to use specific statistical tests… # of IV (format) # of DV (format) 1 (continuous) 1 (continuous) 1 (continuous) 1 (continuous) Multiple 1 (continuous) Examining… Test/Notes Association Pearson Correlation (r) Prediction Simple Linear Regression (m + b) Prediction Multiple Linear Regression (m + b) # of IV (format) # of DV (format) Examining… Test/Notes 1 (grouping, 2 levels) 1 (continuous) Group differences When one group is a ‘known’ population = One-Sample t-test Group differences When both groups are independent = Independent Samples t-test Group differences When both groups are dependent = Paired Samples t-test 1 (grouping, 2 levels) 1 (grouping, 2 levels) 1 (grouping, ∞ levels) 1 (continuous) 1 (continuous) 1 (continuous) Group differences One-Way ANOVA, with Post-Hoc (F ratio) # of IV (format) ∞ (grouping, ∞ levels) ∞ (grouping, ∞ levels) ∞ (grouping, ∞ levels) # of DV (format) Examining… Test/Notes 1 (continuous) Group Differences and interactions Factorial ANOVA with Post-Hoc and/or Estimated Marginal Means (F ratio) 1 (continuous) Group Differences, interactions, controlling for confounders ANCOVA with Estimated Marginal Means (F ratio) Analysis of CoVariance 1 (continuous) Group Differences, interactions, controlling for confounders in a related sample Repeated Measures ANOVA with Estimated Marginal Means (F ratio) (e.g., longitudinal) Tonight… A break from learning a new statistical ‘test’ Focus will be on two critical statistical ‘concepts’ Statistical Related Brief Power to Alpha/Statistical Significance overview of Effect Size Statistically significant results vs Meaningful results First, a quick review of error in testing… Example Hypothesis Pretend my masters thesis topic is the influence of exercise on body composition I believe people that exercise more, will have lower %BF To study this: I draw a sample and group subjects by how much they exercise – High and Low Exercise Groups (this is my IV) I also assess %BF in each subject as a continuous variable (DV) I plan to see if the two groups have different mean %BF My hypotheses (HO and HA): HA: There is a difference in %BF between the groups HO: There is not a difference in %BF between the groups Example Continued Now I’m going to run my statistical test, get my test statistic, and calculate a p-value I’ve set alpha at the standard 0.05 level By the way, what statistical test should I use…? My final decision on my hypotheses is going to be based on that p-value: I could reject the null hypothesis (accept HA) I could accept the null hypothesis (reject HA) Statistical Errors… Since there are two potential decisions (and only one of them can be correct), there are two possible errors I can make: Type I Error We could reject the null hypothesis although it was really true (should have accepted null) Type II Error We could fail to reject the null hypothesis when it was really untrue (should have rejected null) HA: There is a difference in %BF between the groups HO: There is not a difference in %BF between the groups There are really 4 potential outcomes, based on what is “true” and what we “decide” Our Decision Reject HO Accept HO HO Type I Error Correct HA Correct Type II Error What is True Statistical Errors… Remember – My final decision is based on the p-value If p </= 0.05, our decision is reject HO If p > 0.05, our decision is accept HO Our Decision Reject HO Accept HO HO Type I Error Correct HA Correct Type II Error What is True Statistical Errors… In my analysis, I find: High Exercise Group mean %BF = 22% Low Exercise Group mean %BF = 26% p = 0.08 What is my decision? Is it possible I’ve made an error in my decision? Accept HO There is NOT a difference in %BF between the groups Why is that my decision? The means ARE different? I can’t be confident that the 4% difference between the two groups is not due to random sampling error Possible Error…? If I did make an error, what type would it be? Type When you find a p-value greater than alpha The II Error only possible error is Type II error When you find a p-value less than alpha The only possible error is Type I error If p </= 0.05, our decision is reject HO If p > 0.05, our decision is accept HO Our p = 0.08, we accepted HO The only possible error is Type II Our Decision Reject HO Accept HO HO Type I Error Correct HA Correct Type II Error What is True Possible Error…? Compare Type I and Type II error like this: The only concern when you find statistical significance (p < 0.05) is Type I Error Is the difference between groups REAL or due to Random Sampling Error Thankfully, the p-value tells you exactly what the probability of that random sampling error is In other words, the p-value tells you how likely Type I error is But, does the p-value tell you how likely Type II error is? The probability of Type II error is better provided by Power Possible Error…? Probability of Type II error is provided by Power Statistical Power, also known as β (actually 1 – β) We will not discuss the specific calculation of power in this class SPSS can calculate this for you Power (Beta) is related to Alpha, but: Alpha is the probability of having Type I error Lower Power number is better (i.e., 0.05 vs 0.01 vs 0.001) is the probability of NOT having Type II error The probability of being right (correctly rejecting the null hypothesis) Higher number is better (typical goal is 0.80) Let’s continue this in the context of my ‘thesis’ example Statistical Errors… In my analysis, I found: High Exercise Group mean %BF = 22% Low Exercise Group mean %BF = 26% p = 0.08 Decided to accept the null What do I do when I don’t find statistical significance? What happens when the result does not reflect expectations? First, consider the situation Should it be statistically significant? The most obvious thing you need to consider is if you REALLY should have found a statistically significant result? Just because you wanted your test to be significant doesn’t mean it should be This wouldn’t be Type II error – it would just be the correct decision! In my example, researchers have shown in several studies that exercise does influence %BF This result ‘should’ be statistically significant, right? If the answer is yes, then you need to consider power In my ‘thesis’ This result ‘should’ be statistically significant, right? Probably an issue with Statistical Power This scenario plays out at least once a year between myself and a grad student working on a thesis or research project How can I increase the chance that I will find statistically significant results? Why was this analysis not statistically significant? What can I do to decrease the chance of Type II error? Several different factors influence power Your ability to detect a true difference How can I increase Power? 1) Increase Alpha level Changing alpha from 0.05 to 0.10 will increase your power (better chance of finding significant results) Downsides to increasing your alpha level? This will increase the chance of Type I error! This is rarely acceptable in practice Only really an option when working in a new area: Researchers are unsure of how to measure a new variable Researchers are unaware of confounders to control for How can I increase Power? 2) Increase N Sample size is directly used when calculating p-values Including more subjects will increase your chance of finding statistically significant results Downsides More More to increasing sample size? subjects means more time/money subjects is ALWAYS a better option if possible How can I increase Power? 3) Use fewer groups/variables (simpler designs) Related ‘Use ↑ to sample size but different fewer groups’ NOT ‘Use less subjects’ groups negatively effects your degrees of freedom Remember, df is calculated with # groups and # subjects Lots of variables, groups and interactions make it more difficult to find statistically significant differences The purpose of the Family-wise error rate is to make it harder to find significant results! Downsides to fewer groups/variables? Sometimes you NEED to make several comparisons and test for interactions - unavoidable How can I increase Power? 4) Measure variables more accurately If variables are poorly measured (sloppy work, broken equipment, outdated equipment, etc…) this increases measurement error More measurement error decreases confidence in the result For example, perhaps I underestimated %BF in my ‘low exercise’ group? This could lead to Type II Error. More of an internal validity problem than statistical problem Downsides to measuring more accurately? None – if you can afford the best tools How can I increase Power? 5) Decrease subject variability Subjects will have various characteristics that may also be correlated with your variables SES, sex, race/ethnicity, age, etc… These variables can confound your results, making it harder to find statistically significant results When planning your sample (to enhance power), select subjects that are very similar to each other This is a reason why repeated measures tests and paired samples are more likely to have statistically significant results Downside Will to decreasing subject variability? decrease your external validity – generalizability If you only test women, your results do not apply to men How can I increase Power? 6) Increase magnitude of the mean difference If your groups are not different enough, make them more different! For example, instead of measuring just high and low exercisers, perhaps I compare marathon runners vs completely sedentary people? Compare a ‘very’ high exercise to a ‘very’ low exercise group Sampling at the extremes, getting rid of the middle group Downsides to using the extremes? Similar to decreasing subject variability, this will decrease your external validity Questions on Power/Increasing Power? The Catch-22 of Power and P-values I’ve mentioned this previously – but once you are able to draw a large sample, this will ruin the utility of p/statistical significance The larger your sample, the more likely you’ll find statistically significant results Sometimes miniscule differences between groups or tiny correlations are ‘significant’ This becomes relevant once sample size grows to 100~150 subjects per group Once you approach 1000 subjects, it’s hard not to find p < 0.05 Example from most highly cited paper in Psych, 2004… This paper was the first to find a link between playing video games/TV and aggression in children: Every correlation in this table except 1 has p < 0.05 Do you remember what a correlation of 0.10 looks like? r = 0.10 Do you see a relationship between these two variables? What now? This realization has led scientists to begin to avoid pvalues (or at least avoid just reporting p-values) Moving towards reporting with 95% confidence intervals Especially in areas of research where large samples are common (epidemiology, psychology, sociology, etc..) Some people interpret ‘statistically significant’ as being ‘important’ We’ve mentioned several times this is NOT true Statistically significant just means it’s likely not Type I error Can have ‘important’ results that aren’t statistically significant Effect Size To get an idea of how ‘important’ a difference or association is, we can use Effect Size There are over 40 different types of effect size Depends on statistical test used SPSS will NOT always calculate effect size Effect size is like a ‘descriptive’ statistic that tells you about the magnitude of the association or group difference Not impacted by statistical significance Effect size can stay the same even if p-value changes Present the two together when possible The goal is not to teach you how to calculate effect size, but to understand how to interpret it when you see it Effect Size Understanding effect size from correlations and regressions is easy (and you already know it): r2, coefficient of determination % Pearson correlations between %BF and 3 variables: r Variance accounted for = 0.54, r = -0.92, r = 0.70 Which of the three correlations has the most important association with %BF? r2 = 0.29, r2 = 0.85, r2 = 0.49 Interpreting Effect Size Usually, guidelines are given for interpreting the effect size Help you to know how important the effect is Only a guide, you can use your own brain to compare In general, r2 is interpreted as: 0.01 or smaller, a Trivial Effect 0.01 to 0.09, a Small Effect 0.09 to 0.25, a Moderate Effect > 0.25, a Large Effect Effect Size in Regression Two regression equations contain 4 predictors of %BF. Each ‘model’ is statistically significant. Here are their r2 values: 0.29 and 0.15 Which has the largest effect size? Do either or the regression models have a large effect size? 0.29 model is the most important, and has a ‘large effect size’. 0.15 model is of ‘moderate’ importance. Effect Size for Group Differences Effect size in t-tests and ANOVA’s is a bit more complicated In general, effect size is a ratio of the mean difference between two groups and the standard deviation Does this remind you of anything we’ve previously seen? Z-score = (Score – Mean)/SD Effect size, when calculated this way, is basically determining how many standard deviations the two groups are different by E.g., effect size of 1 means the two groups are different by 1 standard deviation (this would be a big difference)! Example When working with t-tests, calculating effect size by the mean difference/SD is called Cohen’s d < 0.1 Trivial effect 0.1-0.3 Small effect 0.3-0.5 Medium effect > 0.5 Large effect The next slide is the result of a repeated measures t-test from a past lecture, we’ll calculate Cohen’s d Paired-Samples t-test Output Mean difference = 2.9, Std. Deviation = 5.2 Cohen’s d = 0.55, a large effect size Essentially, the weight loss program reduced body weight by just about half a standard deviation Other example I sample a group of 100 ISU students and find their average IQ is 103. Recall, the population mean for IQ is 100, SD = 15. I run a one-sample t-test and find it to be statistically significant (p < 0.05) However, effect size is… 0.2, or Small Effect Interpretation: While this difference is likely not due to random sampling error – it’s not very important either Other types of effect sizes SPSS will not calculate Cohen’s d for t-tests However, it will calculate effect size for ANOVA’s (if you request it) Cohen’s d, but Partial Eta Squared (η2) Similar to r2, interpreted the same way (same scale) Not Here is last week’s cancer example Does Tumor Size and Lymph Node Involvement effect Survival Time I’ll re-run and request effect size… Notice, η2 can be used for the entire ‘model’, or each main effect and interaction individually How would you describe the effect of Tumor Size, or our interaction? Trivial to Small Effect – How did we get a significant p-value? Other factors not in our model are also very important Notice that the r2 is equal to the η2 of the full model The advantage of η2 is that you can evaluate individual effects Effect Size Summary Many other types of effect sizes are out there – I just wanted to show you the effect sizes most commonly used with the tests we know: and Regression: r2 T-tests: Cohen’s d ANOVA: Partial eta squared (η2) and/or r2 Correlation You are responsible for knowing: The general theory behind effect size/why to use them What tests they are associated with How to interpret them QUESTIONS ON POWER? EFFECT SIZE? Upcoming… In-class activity Homework: Cronk – Read Appendix A (pg. 115-19) on Effect Size Holcomb Exercises 21 and 22 No out-of-class SPSS work this week Things are slowing down - next week we’ll discuss non-parametric tests Chi-Square and Odds Ratio