Download Power & Effect Size

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Rubin causal model wikipedia, lookup

Transcript
POWER
AND
EFFECT SIZE
Previous Weeks

A few weeks ago I made a small chart outlining all the
different statistical tests we’ve covered (week 9)
I
want to complete that chart using information from the
past week

Most of this is a repeat – but a few new tests have
been added
 Important
that you are familiar with these tests, know when
they are appropriate to use, and how to run (most of) them
in SPSS

Excused from running ANCOVA, RM ANOVA
When to use specific statistical tests…
# of IV
(format)
# of DV
(format)
1
(continuous)
1
(continuous)
1
(continuous)
1
(continuous)
Multiple
1
(continuous)
Examining…
Test/Notes
Association
Pearson Correlation
(r)
Prediction
Simple Linear
Regression (m + b)
Prediction
Multiple Linear
Regression (m + b)
# of IV
(format)
# of DV
(format)
Examining…
Test/Notes
1 (grouping, 2
levels)
1
(continuous)
Group
differences
When one group is a
‘known’ population =
One-Sample t-test
Group
differences
When both groups
are independent =
Independent Samples
t-test
Group
differences
When both groups
are dependent =
Paired Samples t-test
1 (grouping, 2
levels)
1 (grouping, 2
levels)
1 (grouping,
∞ levels)
1
(continuous)
1
(continuous)
1
(continuous)
Group
differences
One-Way ANOVA,
with Post-Hoc (F ratio)
# of IV
(format)
∞ (grouping,
∞ levels)
∞ (grouping,
∞ levels)
∞ (grouping,
∞ levels)
# of DV
(format)
Examining…
Test/Notes
1
(continuous)
Group
Differences and
interactions
Factorial ANOVA with
Post-Hoc and/or
Estimated Marginal
Means (F ratio)
1
(continuous)
Group
Differences,
interactions,
controlling for
confounders
ANCOVA with
Estimated Marginal
Means (F ratio)
Analysis of CoVariance
1
(continuous)
Group
Differences,
interactions,
controlling for
confounders in a
related sample
Repeated Measures
ANOVA
with Estimated
Marginal Means
(F ratio)
(e.g., longitudinal)
Tonight…

A break from learning a new statistical ‘test’

Focus will be on two critical statistical ‘concepts’
 Statistical
 Related
 Brief
Power
to Alpha/Statistical Significance
overview of Effect Size
 Statistically

significant results vs Meaningful results
First, a quick review of error in testing…
Example Hypothesis

Pretend my masters thesis topic is the influence of
exercise on body composition
I
believe people that exercise more, will have lower %BF
 To study this:
I
draw a sample and group subjects by how much they exercise –
High and Low Exercise Groups (this is my IV)
 I also assess %BF in each subject as a continuous variable (DV)
 I plan to see if the two groups have different mean %BF

My hypotheses (HO and HA):
 HA:
There is a difference in %BF between the groups
 HO: There is not a difference in %BF between the groups
Example Continued

Now I’m going to run my statistical test, get my test
statistic, and calculate a p-value
 I’ve
set alpha at the standard 0.05 level
 By the way, what statistical test should I use…?

My final decision on my hypotheses is going to be
based on that p-value:
I
could reject the null hypothesis (accept HA)
 I could accept the null hypothesis (reject HA)
Statistical Errors…


Since there are two potential decisions (and only one
of them can be correct), there are two possible errors
I can make:
Type I Error
 We
could reject the null hypothesis although it was really
true (should have accepted null)

Type II Error
 We
could fail to reject the null hypothesis when it was
really untrue (should have rejected null)
HA: There is a difference in %BF between the groups
HO: There is not a difference in %BF between the groups
There are really 4
potential outcomes,
based on what is “true”
and what we “decide”
Our Decision
Reject HO
Accept HO
HO
Type I Error
Correct
HA
Correct
Type II Error
What is
True
Statistical Errors…

Remember –
My final decision is based on the p-value
If p </= 0.05, our decision is reject HO
If p > 0.05, our decision is accept HO
Our Decision
Reject HO
Accept HO
HO
Type I Error
Correct
HA
Correct
Type II Error
What is
True
Statistical Errors…


In my analysis, I find:

High Exercise Group mean %BF = 22%

Low Exercise Group mean %BF = 26%

p = 0.08
What is my decision?



Is it possible I’ve made an error
in my decision?
Accept HO
There is NOT a difference in %BF between the groups
Why is that my decision? The means ARE different?

I can’t be confident that the 4% difference between the two
groups is not due to random sampling error
Possible Error…?

If I did make an error, what type would it be?
 Type

When you find a p-value greater than alpha
 The

II Error
only possible error is Type II error
When you find a p-value less than alpha
 The
only possible error is Type I error
If p </= 0.05, our decision is reject HO
If p > 0.05, our decision is accept HO
Our p = 0.08, we
accepted HO
The only possible error
is Type II
Our Decision
Reject HO
Accept HO
HO
Type I Error
Correct
HA
Correct
Type II Error
What is
True
Possible Error…?

Compare Type I and Type II error like this:
 The
only concern when you find statistical significance (p <
0.05) is Type I Error
 Is
the difference between groups REAL or due to Random
Sampling Error
 Thankfully, the p-value tells you exactly what the probability of
that random sampling error is
 In other words, the p-value tells you how likely Type I error is

But, does the p-value tell you how likely Type II error is?
 The
probability of Type II error is better provided by Power
Possible Error…?

Probability of Type II error is provided by Power
 Statistical
Power, also known as β (actually 1 – β)
 We will not discuss the specific calculation of power in this class
 SPSS can calculate this for you

Power (Beta) is related to Alpha, but:
 Alpha
is the probability of having Type I error
 Lower
 Power
number is better (i.e., 0.05 vs 0.01 vs 0.001)
is the probability of NOT having Type II error
 The
probability of being right (correctly rejecting the null hypothesis)
 Higher number is better (typical goal is 0.80)
Let’s continue this in the context of my ‘thesis’ example
Statistical Errors…

In my analysis, I found:

High Exercise Group mean %BF = 22%
Low Exercise Group mean %BF = 26%
p = 0.08

Decided to accept the null




What do I do when I don’t find statistical significance?
What happens when the result does not reflect
expectations?
First, consider the situation
Should it be statistically significant?

The most obvious thing you need to consider is if you
REALLY should have found a statistically significant result?
 Just
because you wanted your test to be significant doesn’t
mean it should be
 This wouldn’t be Type II error – it would just be the correct
decision!

In my example, researchers have shown in several studies
that exercise does influence %BF
 This
result ‘should’ be statistically significant, right?
 If the answer is yes, then you need to consider power
In my ‘thesis’


This result ‘should’ be statistically significant, right?
Probably an issue with Statistical Power
 This
scenario plays out at least once a year between myself
and a grad student working on a thesis or research project
 How
can I increase the chance that I will find statistically
significant results?
 Why was this analysis not statistically significant?
 What can I do to decrease the chance of Type II error?

Several different factors influence power
 Your
ability to detect a true difference
How can I increase Power?

1) Increase Alpha level
 Changing
alpha from 0.05 to 0.10 will increase your
power (better chance of finding significant results)
 Downsides to increasing your alpha level?
 This
will increase the chance of Type I error!
 This
is rarely acceptable in practice
 Only really an option when working in a new area:
 Researchers
are unsure of how to measure a new variable
 Researchers are unaware of confounders to control for
How can I increase Power?

2) Increase N
 Sample
size is directly used when calculating p-values
 Including
more subjects will increase your chance of
finding statistically significant results
 Downsides
 More
 More
to increasing sample size?
subjects means more time/money
subjects is ALWAYS a better option if possible
How can I increase Power?

3) Use fewer groups/variables (simpler designs)
 Related
 ‘Use
↑
to sample size but different
fewer groups’ NOT ‘Use less subjects’
groups negatively effects your degrees of freedom
 Remember,
df is calculated with # groups and # subjects
 Lots
of variables, groups and interactions make it more
difficult to find statistically significant differences
 The
purpose of the Family-wise error rate is to make it harder
to find significant results!
 Downsides
to fewer groups/variables?
 Sometimes
you NEED to make several comparisons and test for
interactions - unavoidable
How can I increase Power?

4) Measure variables more accurately
 If
variables are poorly measured (sloppy work, broken
equipment, outdated equipment, etc…) this increases
measurement error
 More measurement error decreases confidence in the result
 For example, perhaps I underestimated %BF in my ‘low
exercise’ group? This could lead to Type II Error.
 More of an internal validity problem than statistical problem
 Downsides to measuring more accurately?
 None
– if you can afford the best tools
How can I increase Power?

5) Decrease subject variability
 Subjects
will have various characteristics that may also be
correlated with your variables
 SES,
sex, race/ethnicity, age, etc…
 These variables can confound your results, making it harder to
find statistically significant results
 When planning your sample (to enhance power), select subjects
that are very similar to each other

This is a reason why repeated measures tests and paired samples
are more likely to have statistically significant results
 Downside
 Will
to decreasing subject variability?
decrease your external validity – generalizability
 If you only test women, your results do not apply to men
How can I increase Power?

6) Increase magnitude of the mean difference
 If
your groups are not different enough, make them more
different!
 For example, instead of measuring just high and low
exercisers, perhaps I compare marathon runners vs
completely sedentary people?
 Compare
a ‘very’ high exercise to a ‘very’ low exercise group
 Sampling at the extremes, getting rid of the middle group
 Downsides
to using the extremes?
 Similar
to decreasing subject variability, this will decrease your
external validity
Questions on Power/Increasing Power?
The Catch-22 of Power and P-values

I’ve mentioned this previously – but once you are able
to draw a large sample, this will ruin the utility of
p/statistical significance
 The
larger your sample, the more likely you’ll find
statistically significant results
 Sometimes
miniscule differences between groups or tiny
correlations are ‘significant’
 This becomes relevant once sample size grows to 100~150
subjects per group
 Once you approach 1000 subjects, it’s hard not to find p < 0.05
 Example
from most highly cited paper in Psych, 2004…



This paper was the first to find a link between playing
video games/TV and aggression in children:
Every correlation in this table except 1 has p < 0.05
Do you remember what a correlation of 0.10 looks like?
r = 0.10
Do you see a relationship
between these two variables?
What now?

This realization has led scientists to begin to avoid pvalues (or at least avoid just reporting p-values)
 Moving
towards reporting with 95% confidence intervals
 Especially in areas of research where large samples are
common (epidemiology, psychology, sociology, etc..)

Some people interpret ‘statistically significant’ as being
‘important’
 We’ve
mentioned several times this is NOT true
 Statistically significant just means it’s likely not Type I error
 Can have ‘important’ results that aren’t statistically significant
Effect Size

To get an idea of how ‘important’ a difference or
association is, we can use Effect Size
 There
are over 40 different types of effect size
 Depends
on statistical test used
 SPSS will NOT always calculate effect size
 Effect
size is like a ‘descriptive’ statistic that tells you about
the magnitude of the association or group difference
 Not
impacted by statistical significance
 Effect size can stay the same even if p-value changes
 Present the two together when possible
 The
goal is not to teach you how to calculate effect size, but
to understand how to interpret it when you see it
Effect Size

Understanding effect size from correlations and
regressions is easy (and you already know it):
 r2,
coefficient of determination
%

Pearson correlations between %BF and 3 variables:
r

Variance accounted for
= 0.54, r = -0.92, r = 0.70
Which of the three correlations has the most
important association with %BF?
 r2
= 0.29, r2 = 0.85, r2 = 0.49
Interpreting Effect Size

Usually, guidelines are given for interpreting the
effect size
 Help
you to know how important the effect is
 Only a guide, you can use your own brain to compare

In general, r2 is interpreted as:
 0.01
or smaller, a Trivial Effect
 0.01 to 0.09, a Small Effect
 0.09 to 0.25, a Moderate Effect
 > 0.25, a Large Effect
Effect Size in Regression

Two regression equations contain 4 predictors of
%BF. Each ‘model’ is statistically significant. Here
are their r2 values:
 0.29

and 0.15
Which has the largest effect size? Do either or the
regression models have a large effect size?
 0.29
model is the most important, and has a ‘large effect
size’.
 0.15 model is of ‘moderate’ importance.
Effect Size for Group Differences


Effect size in t-tests and ANOVA’s is a bit more
complicated
In general, effect size is a ratio of the mean difference
between two groups and the standard deviation
 Does
this remind you of anything we’ve previously seen?
 Z-score = (Score – Mean)/SD

Effect size, when calculated this way, is basically
determining how many standard deviations the two
groups are different by
 E.g.,
effect size of 1 means the two groups are different by
1 standard deviation (this would be a big difference)!
Example

When working with t-tests, calculating effect size by
the mean difference/SD is called Cohen’s d
<
0.1 Trivial effect
 0.1-0.3 Small effect
 0.3-0.5 Medium effect
 > 0.5 Large effect

The next slide is the result of a repeated measures
t-test from a past lecture, we’ll calculate Cohen’s d
Paired-Samples t-test Output


Mean difference = 2.9, Std. Deviation = 5.2
Cohen’s d = 0.55, a large effect size
 Essentially,
the weight loss program reduced body
weight by just about half a standard deviation
Other example

I sample a group of 100 ISU students and find their
average IQ is 103.
 Recall,
the population mean for IQ is 100, SD = 15.
 I run a one-sample t-test and find it to be statistically
significant (p < 0.05)
 However, effect size is…
 0.2,
or Small Effect
 Interpretation:
While this difference is likely not due to
random sampling error – it’s not very important either
Other types of effect sizes


SPSS will not calculate Cohen’s d for t-tests
However, it will calculate effect size for ANOVA’s (if
you request it)
Cohen’s d, but Partial Eta Squared (η2)
 Similar to r2, interpreted the same way (same scale)
 Not

Here is last week’s cancer example
 Does
Tumor Size and Lymph Node Involvement effect
Survival Time
 I’ll re-run and request effect size…

Notice, η2 can be used for the entire ‘model’, or each main
effect and interaction individually
 How
would you describe the effect of Tumor Size, or our
interaction?
 Trivial to Small Effect – How did we get a significant p-value?
 Other factors not in our model are also very important

Notice that the r2 is equal to the η2 of the full model
 The
advantage of η2 is that you can evaluate individual effects
Effect Size Summary

Many other types of effect sizes are out there – I just
wanted to show you the effect sizes most commonly
used with the tests we know:
and Regression: r2
 T-tests: Cohen’s d
 ANOVA: Partial eta squared (η2) and/or r2
 Correlation

You are responsible for knowing:
 The
general theory behind effect size/why to use them
 What tests they are associated with
 How to interpret them
QUESTIONS ON POWER?
EFFECT SIZE?
Upcoming…


In-class activity
Homework:
 Cronk
– Read Appendix A (pg. 115-19) on Effect Size
 Holcomb Exercises 21 and 22
 No out-of-class SPSS work this week

Things are slowing down - next week we’ll discuss
non-parametric tests
 Chi-Square
and Odds Ratio