Download Power & Effect Size

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Rubin causal model wikipedia , lookup

Transcript
POWER
AND
EFFECT SIZE
Previous Weeks

A few weeks ago I made a small chart outlining all the
different statistical tests we’ve covered (week 9)
I
want to complete that chart using information from the
past week

Most of this is a repeat – but a few new tests have
been added
 Important
that you are familiar with these tests, know when
they are appropriate to use, and how to run (most of) them
in SPSS

Excused from running ANCOVA, RM ANOVA
When to use specific statistical tests…
# of IV
(format)
# of DV
(format)
1
(continuous)
1
(continuous)
1
(continuous)
1
(continuous)
Multiple
1
(continuous)
Examining…
Test/Notes
Association
Pearson Correlation
(r)
Prediction
Simple Linear
Regression (m + b)
Prediction
Multiple Linear
Regression (m + b)
# of IV
(format)
# of DV
(format)
Examining…
Test/Notes
1 (grouping, 2
levels)
1
(continuous)
Group
differences
When one group is a
‘known’ population =
One-Sample t-test
Group
differences
When both groups
are independent =
Independent Samples
t-test
Group
differences
When both groups
are dependent =
Paired Samples t-test
1 (grouping, 2
levels)
1 (grouping, 2
levels)
1 (grouping,
∞ levels)
1
(continuous)
1
(continuous)
1
(continuous)
Group
differences
One-Way ANOVA,
with Post-Hoc (F ratio)
# of IV
(format)
∞ (grouping,
∞ levels)
∞ (grouping,
∞ levels)
∞ (grouping,
∞ levels)
# of DV
(format)
Examining…
Test/Notes
1
(continuous)
Group
Differences and
interactions
Factorial ANOVA with
Post-Hoc and/or
Estimated Marginal
Means (F ratio)
1
(continuous)
Group
Differences,
interactions,
controlling for
confounders
ANCOVA with
Estimated Marginal
Means (F ratio)
Analysis of CoVariance
1
(continuous)
Group
Differences,
interactions,
controlling for
confounders in a
related sample
Repeated Measures
ANOVA
with Estimated
Marginal Means
(F ratio)
(e.g., longitudinal)
Tonight…

A break from learning a new statistical ‘test’

Focus will be on two critical statistical ‘concepts’
 Statistical
 Related
 Brief
Power
to Alpha/Statistical Significance
overview of Effect Size
 Statistically

significant results vs Meaningful results
First, a quick review of error in testing…
Example Hypothesis

Pretend my masters thesis topic is the influence of
exercise on body composition
I
believe people that exercise more, will have lower %BF
 To study this:
I
draw a sample and group subjects by how much they exercise –
High and Low Exercise Groups (this is my IV)
 I also assess %BF in each subject as a continuous variable (DV)
 I plan to see if the two groups have different mean %BF

My hypotheses (HO and HA):
 HA:
There is a difference in %BF between the groups
 HO: There is not a difference in %BF between the groups
Example Continued

Now I’m going to run my statistical test, get my test
statistic, and calculate a p-value
 I’ve
set alpha at the standard 0.05 level
 By the way, what statistical test should I use…?

My final decision on my hypotheses is going to be
based on that p-value:
I
could reject the null hypothesis (accept HA)
 I could accept the null hypothesis (reject HA)
Statistical Errors…


Since there are two potential decisions (and only one
of them can be correct), there are two possible errors
I can make:
Type I Error
 We
could reject the null hypothesis although it was really
true (should have accepted null)

Type II Error
 We
could fail to reject the null hypothesis when it was
really untrue (should have rejected null)
HA: There is a difference in %BF between the groups
HO: There is not a difference in %BF between the groups
There are really 4
potential outcomes,
based on what is “true”
and what we “decide”
Our Decision
Reject HO
Accept HO
HO
Type I Error
Correct
HA
Correct
Type II Error
What is
True
Statistical Errors…

Remember –
My final decision is based on the p-value
If p </= 0.05, our decision is reject HO
If p > 0.05, our decision is accept HO
Our Decision
Reject HO
Accept HO
HO
Type I Error
Correct
HA
Correct
Type II Error
What is
True
Statistical Errors…


In my analysis, I find:

High Exercise Group mean %BF = 22%

Low Exercise Group mean %BF = 26%

p = 0.08
What is my decision?



Is it possible I’ve made an error
in my decision?
Accept HO
There is NOT a difference in %BF between the groups
Why is that my decision? The means ARE different?

I can’t be confident that the 4% difference between the two
groups is not due to random sampling error
Possible Error…?

If I did make an error, what type would it be?
 Type

When you find a p-value greater than alpha
 The

II Error
only possible error is Type II error
When you find a p-value less than alpha
 The
only possible error is Type I error
If p </= 0.05, our decision is reject HO
If p > 0.05, our decision is accept HO
Our p = 0.08, we
accepted HO
The only possible error
is Type II
Our Decision
Reject HO
Accept HO
HO
Type I Error
Correct
HA
Correct
Type II Error
What is
True
Possible Error…?

Compare Type I and Type II error like this:
 The
only concern when you find statistical significance (p <
0.05) is Type I Error
 Is
the difference between groups REAL or due to Random
Sampling Error
 Thankfully, the p-value tells you exactly what the probability of
that random sampling error is
 In other words, the p-value tells you how likely Type I error is

But, does the p-value tell you how likely Type II error is?
 The
probability of Type II error is better provided by Power
Possible Error…?

Probability of Type II error is provided by Power
 Statistical
Power, also known as β (actually 1 – β)
 We will not discuss the specific calculation of power in this class
 SPSS can calculate this for you

Power (Beta) is related to Alpha, but:
 Alpha
is the probability of having Type I error
 Lower
 Power
number is better (i.e., 0.05 vs 0.01 vs 0.001)
is the probability of NOT having Type II error
 The
probability of being right (correctly rejecting the null hypothesis)
 Higher number is better (typical goal is 0.80)
Let’s continue this in the context of my ‘thesis’ example
Statistical Errors…

In my analysis, I found:

High Exercise Group mean %BF = 22%
Low Exercise Group mean %BF = 26%
p = 0.08

Decided to accept the null




What do I do when I don’t find statistical significance?
What happens when the result does not reflect
expectations?
First, consider the situation
Should it be statistically significant?

The most obvious thing you need to consider is if you
REALLY should have found a statistically significant result?
 Just
because you wanted your test to be significant doesn’t
mean it should be
 This wouldn’t be Type II error – it would just be the correct
decision!

In my example, researchers have shown in several studies
that exercise does influence %BF
 This
result ‘should’ be statistically significant, right?
 If the answer is yes, then you need to consider power
In my ‘thesis’


This result ‘should’ be statistically significant, right?
Probably an issue with Statistical Power
 This
scenario plays out at least once a year between myself
and a grad student working on a thesis or research project
 How
can I increase the chance that I will find statistically
significant results?
 Why was this analysis not statistically significant?
 What can I do to decrease the chance of Type II error?

Several different factors influence power
 Your
ability to detect a true difference
How can I increase Power?

1) Increase Alpha level
 Changing
alpha from 0.05 to 0.10 will increase your
power (better chance of finding significant results)
 Downsides to increasing your alpha level?
 This
will increase the chance of Type I error!
 This
is rarely acceptable in practice
 Only really an option when working in a new area:
 Researchers
are unsure of how to measure a new variable
 Researchers are unaware of confounders to control for
How can I increase Power?

2) Increase N
 Sample
size is directly used when calculating p-values
 Including
more subjects will increase your chance of
finding statistically significant results
 Downsides
 More
 More
to increasing sample size?
subjects means more time/money
subjects is ALWAYS a better option if possible
How can I increase Power?

3) Use fewer groups/variables (simpler designs)
 Related
 ‘Use
↑
to sample size but different
fewer groups’ NOT ‘Use less subjects’
groups negatively effects your degrees of freedom
 Remember,
df is calculated with # groups and # subjects
 Lots
of variables, groups and interactions make it more
difficult to find statistically significant differences
 The
purpose of the Family-wise error rate is to make it harder
to find significant results!
 Downsides
to fewer groups/variables?
 Sometimes
you NEED to make several comparisons and test for
interactions - unavoidable
How can I increase Power?

4) Measure variables more accurately
 If
variables are poorly measured (sloppy work, broken
equipment, outdated equipment, etc…) this increases
measurement error
 More measurement error decreases confidence in the result
 For example, perhaps I underestimated %BF in my ‘low
exercise’ group? This could lead to Type II Error.
 More of an internal validity problem than statistical problem
 Downsides to measuring more accurately?
 None
– if you can afford the best tools
How can I increase Power?

5) Decrease subject variability
 Subjects
will have various characteristics that may also be
correlated with your variables
 SES,
sex, race/ethnicity, age, etc…
 These variables can confound your results, making it harder to
find statistically significant results
 When planning your sample (to enhance power), select subjects
that are very similar to each other

This is a reason why repeated measures tests and paired samples
are more likely to have statistically significant results
 Downside
 Will
to decreasing subject variability?
decrease your external validity – generalizability
 If you only test women, your results do not apply to men
How can I increase Power?

6) Increase magnitude of the mean difference
 If
your groups are not different enough, make them more
different!
 For example, instead of measuring just high and low
exercisers, perhaps I compare marathon runners vs
completely sedentary people?
 Compare
a ‘very’ high exercise to a ‘very’ low exercise group
 Sampling at the extremes, getting rid of the middle group
 Downsides
to using the extremes?
 Similar
to decreasing subject variability, this will decrease your
external validity
Questions on Power/Increasing Power?
The Catch-22 of Power and P-values

I’ve mentioned this previously – but once you are able
to draw a large sample, this will ruin the utility of
p/statistical significance
 The
larger your sample, the more likely you’ll find
statistically significant results
 Sometimes
miniscule differences between groups or tiny
correlations are ‘significant’
 This becomes relevant once sample size grows to 100~150
subjects per group
 Once you approach 1000 subjects, it’s hard not to find p < 0.05
 Example
from most highly cited paper in Psych, 2004…



This paper was the first to find a link between playing
video games/TV and aggression in children:
Every correlation in this table except 1 has p < 0.05
Do you remember what a correlation of 0.10 looks like?
r = 0.10
Do you see a relationship
between these two variables?
What now?

This realization has led scientists to begin to avoid pvalues (or at least avoid just reporting p-values)
 Moving
towards reporting with 95% confidence intervals
 Especially in areas of research where large samples are
common (epidemiology, psychology, sociology, etc..)

Some people interpret ‘statistically significant’ as being
‘important’
 We’ve
mentioned several times this is NOT true
 Statistically significant just means it’s likely not Type I error
 Can have ‘important’ results that aren’t statistically significant
Effect Size

To get an idea of how ‘important’ a difference or
association is, we can use Effect Size
 There
are over 40 different types of effect size
 Depends
on statistical test used
 SPSS will NOT always calculate effect size
 Effect
size is like a ‘descriptive’ statistic that tells you about
the magnitude of the association or group difference
 Not
impacted by statistical significance
 Effect size can stay the same even if p-value changes
 Present the two together when possible
 The
goal is not to teach you how to calculate effect size, but
to understand how to interpret it when you see it
Effect Size

Understanding effect size from correlations and
regressions is easy (and you already know it):
 r2,
coefficient of determination
%

Pearson correlations between %BF and 3 variables:
r

Variance accounted for
= 0.54, r = -0.92, r = 0.70
Which of the three correlations has the most
important association with %BF?
 r2
= 0.29, r2 = 0.85, r2 = 0.49
Interpreting Effect Size

Usually, guidelines are given for interpreting the
effect size
 Help
you to know how important the effect is
 Only a guide, you can use your own brain to compare

In general, r2 is interpreted as:
 0.01
or smaller, a Trivial Effect
 0.01 to 0.09, a Small Effect
 0.09 to 0.25, a Moderate Effect
 > 0.25, a Large Effect
Effect Size in Regression

Two regression equations contain 4 predictors of
%BF. Each ‘model’ is statistically significant. Here
are their r2 values:
 0.29

and 0.15
Which has the largest effect size? Do either or the
regression models have a large effect size?
 0.29
model is the most important, and has a ‘large effect
size’.
 0.15 model is of ‘moderate’ importance.
Effect Size for Group Differences


Effect size in t-tests and ANOVA’s is a bit more
complicated
In general, effect size is a ratio of the mean difference
between two groups and the standard deviation
 Does
this remind you of anything we’ve previously seen?
 Z-score = (Score – Mean)/SD

Effect size, when calculated this way, is basically
determining how many standard deviations the two
groups are different by
 E.g.,
effect size of 1 means the two groups are different by
1 standard deviation (this would be a big difference)!
Example

When working with t-tests, calculating effect size by
the mean difference/SD is called Cohen’s d
<
0.1 Trivial effect
 0.1-0.3 Small effect
 0.3-0.5 Medium effect
 > 0.5 Large effect

The next slide is the result of a repeated measures
t-test from a past lecture, we’ll calculate Cohen’s d
Paired-Samples t-test Output


Mean difference = 2.9, Std. Deviation = 5.2
Cohen’s d = 0.55, a large effect size
 Essentially,
the weight loss program reduced body
weight by just about half a standard deviation
Other example

I sample a group of 100 ISU students and find their
average IQ is 103.
 Recall,
the population mean for IQ is 100, SD = 15.
 I run a one-sample t-test and find it to be statistically
significant (p < 0.05)
 However, effect size is…
 0.2,
or Small Effect
 Interpretation:
While this difference is likely not due to
random sampling error – it’s not very important either
Other types of effect sizes


SPSS will not calculate Cohen’s d for t-tests
However, it will calculate effect size for ANOVA’s (if
you request it)
Cohen’s d, but Partial Eta Squared (η2)
 Similar to r2, interpreted the same way (same scale)
 Not

Here is last week’s cancer example
 Does
Tumor Size and Lymph Node Involvement effect
Survival Time
 I’ll re-run and request effect size…

Notice, η2 can be used for the entire ‘model’, or each main
effect and interaction individually
 How
would you describe the effect of Tumor Size, or our
interaction?
 Trivial to Small Effect – How did we get a significant p-value?
 Other factors not in our model are also very important

Notice that the r2 is equal to the η2 of the full model
 The
advantage of η2 is that you can evaluate individual effects
Effect Size Summary

Many other types of effect sizes are out there – I just
wanted to show you the effect sizes most commonly
used with the tests we know:
and Regression: r2
 T-tests: Cohen’s d
 ANOVA: Partial eta squared (η2) and/or r2
 Correlation

You are responsible for knowing:
 The
general theory behind effect size/why to use them
 What tests they are associated with
 How to interpret them
QUESTIONS ON POWER?
EFFECT SIZE?
Upcoming…


In-class activity
Homework:
 Cronk
– Read Appendix A (pg. 115-19) on Effect Size
 Holcomb Exercises 21 and 22
 No out-of-class SPSS work this week

Things are slowing down - next week we’ll discuss
non-parametric tests
 Chi-Square
and Odds Ratio