Download statistic

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Statistical Inference I:
Hypothesis testing; sample
size
Statistics Primer







Statistical Inference
Hypothesis testing
P-values
Type I error
Type II error
Statistical power
Sample size calculations
What is a statistic?


A statistic is any value that can be
calculated from the sample data.
Sample statistics are calculated to give
us an idea about the larger population.
Examples of statistics:

mean


difference in means


The difference in the average gas price in Los
Angeles ($2.96) compared with Des Moines, Iowa
($2.32) is 64 cents.
proportion


The average cost of a gallon of gas in the US is
$2.65.
67% of high school students in the U.S. exercise
regularly
difference in proportions

The difference in the proportion of Republicans
who approve of George W. (66%) and Democrats
who do (11%) is 55%
What is a statistic?

Sample statistics are estimates of
population parameters.
Sample statistics estimate
population parameters:
Sample statistic: mean IQ
of 5 subjects
Truth (not
observable)
Mean IQ of
some population
of 100,000
people =100
110  105  96  124  115
 110
5
Sample
(observation)
Make guesses
about the whole
population
Sampling Distributions
Most experiments are one-shot deals. So, how do we know if
an observed effect from a single experiment is real or is just an
artifact of sampling variability (chance variation)?
**Requires a priori knowledge about how sampling variability
works…
Question: Why have I made you learn about probability
distributions and about how to calculate and
manipulate expected value and variance?
Answer: Because they form the basis of describing the
distribution of a sample statistic.
What is sampling variation?




Statistics vary from sample to sample due to
random chance.
Example:
A population of 100,000 people has an
average IQ of 100 (If you actually could
measure them all!)
If you sample 5 random people from this
population, what will you get?
Sampling Variation
120  160  180  95  95
90  85  95  92  88  130

90
5
100  105  86  104  95
110  105 596  124  115  98
5
5 (not
Truth
observable)
Mean
IQ=100
 110
Sampling Variation and
Sample Size




Do you expect more or less sampling
variability in samples of 10 people?
Of 50 people?
Of 1000 people?
Of 100,000 people?
Standard error


Standard error is the standard deviation
of a sample statistic.
It’s a measure of sampling variability.
What is statistical inference?

The field of statistics provides guidance
on how to make conclusions in the face
of this chance variation.
Example 1: Difference in
proportions

Research Question: Are antidepressants
a risk factor for suicide attempts in
children and adolescents?
Example modified from: “Antidepressant Drug Therapy and Suicide in Severely
Depressed Children and Adults ”; Olfson et al. Arch Gen Psychiatry.2006;63:865872.

Example 1



Design: Case-control study
Methods: Researchers used Medicaid records
to compare prescription histories between
263 children and teenagers (6-18 years) who
had attempted suicide and 1241 controls who
had never attempted suicide (all subjects
suffered from depression).
Statistical question: Is a history of use of
antidepressants more common among cases
than controls?
Example 1

Statistical question: Is a history of use of
particular antidepressants more common
among heart disease cases than controls?
What will we actually compare?
 Proportion of cases who used antidepressants
in the past vs. proportion of controls who did
Results
Any antidepressant
drug ever
No (%) of
cases
(n=263)
No (%) of
controls
(n=1241)
120 (46%)
448 (36%)
46%
36%
Difference=10%
What does a 10% difference
mean?


Before we perform any formal statistical
analysis on these data, we already have
a lot of information.
Look at the basic numbers first; THEN
consider statistical significance as a
secondary guide.
Is the association statistically
significant?


This 10% difference could reflect a true
association or it could be a fluke in this
particular sample.
The question: is 10% bigger or smaller
than the expected sampling variability?
What is hypothesis testing?

Statisticians try to answer this question
with a formal hypothesis test
Hypothesis testing
Step 1: Assume the null hypothesis.
Null hypothesis: There is no association
between antidepressant use and suicide
attempts in the target population (= the
difference is 0%)
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true—math theory:
The standard error of the difference in
two proportions is:

p (1  p )
p (1 - p )

n1
n2

568
568
568
568
(1 
)
(1 
)
1504
1504  1504
1504  .033
263
1241
We expect to see differences between the group as
big as about 6% (2 standard errors) just by chance…
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true—computer simulation:


In computer simulation, you simulate
taking repeated samples of the same
size from the same population and
observe the sampling variability.
I used computer simulation to take
1000 samples of 263 cases and 1241
controls
Computer Simulation Results
What is standard error?
Standard error:
measure of
variability of
sample statistics
Standard error is
about 3.3%
Hypothesis Testing
Step 3: Do an experiment
We observed a difference of 10% between
cases and controls.
Hypothesis Testing
Step 4: Calculate a p-value
P-value=the probability of your data or
something more extreme under the null
hypothesis.
Hypothesis Testing
Step 4: Calculate a p-value—mathematical theory:
.10
Z=
= 3.0; p = .003
.033
What is a P-value?
We also got 2
results as small
or smaller than
–10%.
When we ran this
study 1000 times,
we got 1 result as
big or bigger than
10%.
P-value
P-value=the probability of
your data or something
more extreme under the null
hypothesis.
From our simulation, we
estimate the p-value to be:
3/1000 or .003
Hypothesis Testing
Step 5: Reject or do not reject the null hypothesis.
Here we reject the null.
Alternative hypothesis: There is an association
between antidepressant use and suicide in the
target population.
What does a 10% difference
mean?



Is it “statistically significant”? YES
Is it clinically significant?
Is this a causal association?
What does a 10% difference
mean?



Is it “statistically significant”? YES
Is it clinically significant? MAYBE
Is this a causal association? MAYBE
Statistical significance does not necessarily
imply clinical significance.
Statistical significance does not necessarily
imply a cause-and-effect relationship.
What would a lack of
statistical significance mean?

If this study had sampled only 50 cases
and 50 controls, the sampling variability
would have been much higher—as
shown in this computer simulation…
Standard error is
about 3.3%
Standard error is
about 10%
263 cases and
1241 controls.
50 cases and 50
controls.
With only 50 cases and 50 controls…
Standard
error is
about 10%
If we ran this
study 1000 times,
we would expect to
get values of 10%
or higher 170 times
(or 17% of the
time).
Two-tailed p-value
Two-tailed
p-value =
17%x2=34%
What does a 10% difference
mean (50 cases/50 controls)?



Is it “statistically significant”? NO
Is it clinically significant? MAYBE
Is this a causal association? MAYBE
No evidence of an effect  Evidence of no effect.
Example 2: Difference in means

Example: Rosental, R. and Jacobson, L.
(1966) Teachers’ expectancies:
Determinates of pupils’ I.Q. gains.
Psychological Reports, 19, 115-118.
The Experiment
(note: exact numbers have been altered)




Grade 3 at Oak School were given an IQ test at
the beginning of the academic year (n=90).
Classroom teachers were given a list of names of
students in their classes who had supposedly
scored in the top 20 percent; these students
were identified as “academic bloomers” (n=18).
BUT: the children on the teachers lists had
actually been randomly assigned to the list.
At the end of the year, the same I.Q. test was readministered.
Example 2

Statistical question: Do students in the
treatment group have more improvement
in IQ than students in the control group?
What will we actually compare?
 One-year change in IQ score in the treatment
group vs. one-year change in IQ score in the
control group.
Results:
“Academic
bloomers”
(n=18)
Change in IQ score:
12.2 (2.0)
12.2 points
The standard deviation
of change scores was
2.0 in both groups. This
affects statistical
significance…
Controls
(n=72)
8.2 (2.0)
8.2 points
Difference=4 points
What does a 4-point
difference mean?


Before we perform any formal statistical
analysis on these data, we already have
a lot of information.
Look at the basic numbers first; THEN
consider statistical significance as a
secondary guide.
Is the association statistically
significant?


This 4-point difference could reflect a
true effect or it could be a fluke.
The question: is a 4-point difference
bigger or smaller than the expected
sampling variability?
Hypothesis testing
Step 1: Assume the null hypothesis.
Null hypothesis: There is no difference between
“academic bloomers” and normal students (=
the difference is 0%)
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true—math theory:
The standard error of the difference in
two means is:
2
2
s
s
4 4




 0.52
n1 n2
18 72
We expect to see differences between the group as
big as about 1.0 (2 standard errors) just by chance…
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true—computer simulation:


In computer simulation, you simulate
taking repeated samples of the same
size from the same population and
observe the sampling variability.
I used computer simulation to take
1000 samples of 18 treated and 72
controls
Computer Simulation Results
What is the standard error?
Standard error is
about 0.52
Standard error:
measure of
variability of
sample statistics
Hypothesis Testing
Step 3: Do an experiment
We observed a difference of 4 between
treated and controls.
Hypothesis Testing
Step 4: Calculate a p-value
P-value=the probability of your data or
something more extreme under the null
hypothesis.
Hypothesis Testing
Step 4: Calculate a p-value—mathematical theory:
t-curve with 88 df’s has slightly wider
cut-off’s for 95% area (t=1.99) than a
normal curve (Z=1.96)
4
t88 
8
.52
p-value <.0001
What is the P-value?
If we ran this
study 1000 times
we wouldn’t
expect to get 1
result as big as a
difference of 4
(under the null
hypothesis).
P-value
P-value=the probability of
your data or something
more extreme under the null
hypothesis.
Here, p-value<.0001
Hypothesis Testing
Step 5: Reject or do not reject the null hypothesis.
Here we reject the null.
Alternative hypothesis: There is an association
between being labeled as gifted and subsequent
academic achievement.
What does a 4-point
difference mean?



Is it “statistically significant”? YES
Is it clinically significant?
Is this a causal association?
What does a 4-point
difference mean?



Is it “statistically significant”? YES
Is it clinically significant? MAYBE
Is this a causal association? MAYBE
Statistical significance does not necessarily
imply clinical significance.
Statistical significance does not necessarily
imply a cause-and-effect relationship.
What if our standard deviation
had been higher?

The standard deviation for change
scores in both treatment and control
was 2.0. What if change scores had
been much more variable—say a
standard deviation of 10.0?
Standard error is
0.52
Standard error is 2.58
Std. dev in
change scores =
2.0
Std. dev in
change scores =
10.0
With a std. dev. of 10.0…
Standard
error is 2.58
If we ran this
study 1000 times,
we would expect to
get +4.0 or –4.0
12% of the time.
P-value=.12
What would a 4.0 difference
mean (std. dev=10)?



Is it “statistically significant”? NO
Is it clinically significant? MAYBE
Is this a causal association? MAYBE
No evidence of an effect  Evidence of no effect.
Hypothesis Testing Summary
The Steps:
1. Define your hypotheses (null, alternative)
2. Specify your null distribution
3. Do an experiment
4. Calculate the p-value of what you observed
5. Reject or fail to reject the null hypothesis
Follows the logic: If A then B; not B; therefore, not A.
Hypothesis testing summary

Null hypothesis: the hypothesis of no effect (usually
the opposite of what you hope to prove). The straw
man you are trying to shoot down.


Example: antidepressants have no effect on suicide risk
P-value: the probability of your observed data if the
null hypothesis is true.



Example: The probability that the study would have found 10%
higher suicide attempts in the antidepressant group (compared
with control) if antidepressants had no effect (i.e., just by
chance).
If this probability is low enough (i.e., if our data are very
unlikely given the null hypothesis), this is evidence that the null
hypothesis is wrong.
If p-value is low enough (typically <.05), we reject the null
hypothesis and conclude that antidepressants do have an
effect.
Summary: The Underlying
Logic of hypothesis tests…
Follows this logic:
Assume A.
If A, then B.
Not B.
Therefore, Not A.
But throw in a bit of uncertainty…If A, then probably B…
Error and power

Type I error rate (or significance level): the
probability of finding an effect that isn’t real (false
positive).



If we require p-value<.05 for statistical significance, this means
that 1/20 times we will find a positive result just by chance.
Type II error rate: the probability of missing an effect
(false negative).
Statistical power: the probability of finding an effect if
it is there (the probability of not making a type II
error).

When we design studies, we typically aim for a power of 80%
(allowing a false negative rate, or type II error rate, of 20%).
Type I and Type II Error in a box
Your Statistical
Decision
Reject H0
True state of null hypothesis
H0 True
H0 False
Type I error (α)
Correct
Correct
Type II Error (β)
Do not reject H0
Reminds me of…
Pascal’s Wager
The TRUTH
Your Decision
God Exists
God Doesn’t Exist
BIG MISTAKE
Correct
Correct—
Big Pay Off
MINOR MISTAKE
Reject God
Accept God
Type I and Type II Error in a box
Your Statistical
Decision
Reject H0
True state of null hypothesis
H0 True
H0 False
Type I error (α)
Correct
Correct
Type II Error (β)
Do not reject H0
Review Question 1
If we have a p-value of 0.03 and so decide that our
effect is statistically significant, what is the
probability that we’re wrong (i.e., that the
hypothesis test gave us a false positive)?
a.
b.
c.
d.
e.
.03
.06
Cannot tell
1.96
95%
Review Question 1
If we have a p-value of 0.03 and so decide that our
effect is statistically significant, what is the
probability that we’re wrong (i.e., that the
hypothesis test gave us a false positive)?
a.
b.
c.
d.
e.
.03
.06
Cannot tell
1.96
95%
Review Question 2
Standard error is:
a.
b.
c.
d.
e.
For a given variable, its standard deviation
divided by the square root of n.
A measure of the variability of a sample
statistic.
The inverse of sample size.
A measure of the variability of a
characteristic.
All of the above.
Review Question 2
Standard error is:
a.
b.
c.
d.
e.
For a given variable, its standard deviation
divided by the square root of n.
A measure of the variability of a
sample statistic.
The inverse of sample size.
A measure of the variability of a
characteristic.
All of the above.
Review Question 3
A randomized trial of two treatments for depression
failed to show a statistically significant difference in
improvement from depressive symptoms (p-value
=.50). It follows that:
a.
b.
c.
d.
e.
The treatments are equally effective.
Neither treatment is effective.
The study lacked sufficient power to detect a
difference.
The null hypothesis should be rejected.
There is not enough evidence to reject the null
hypothesis.
Review Question 3
A randomized trial of two treatments for depression
failed to show a statistically significant difference in
improvement from depressive symptoms (p-value
=.50). It follows that:
a.
b.
c.
d.
e.
The treatments are equally effective.
Neither treatment is effective.
The study lacked sufficient power to detect a
difference.
The null hypothesis should be rejected.
There is not enough evidence to reject the null
hypothesis.
Review Question 4
Following the introduction of a new treatment regime in a
rehab facility, alcoholism “cure” rates increased. The
proportion of successful outcomes in the two years
following the change was significantly higher than in the
preceding two years (p-value: <.005). It follows that:
a.
b.
c.
d.
e.
The improvement in treatment outcome is clinically important.
The new regime cannot be worse than the old treatment.
Assuming that there are no biases in the study method, the new
treatment should be recommended in preference to the old.
All of the above.
None of the above.
Review Question 4
Following the introduction of a new treatment regime in a
rehab facility, alcoholism “cure” rates increased. The
proportion of successful outcomes in the two years
following the change was significantly higher than in the
preceding two years (p-value: <.005). It follows that:
a.
b.
c.
d.
e.
The improvement in treatment outcome is clinically important.
The new regime cannot be worse than the old treatment.
Assuming that there are no biases in the study method, the new
treatment should be recommended in preference to the old.
All of the above.
None of the above.
Statistical Power

Statistical power is the probability of
finding an effect if it’s real.
Can we quantify how much
power we have for given
sample sizes?
study 1: 263 cases, 1241 controls
Null
Distribution:
difference=0.
Rejection region.
Any value >= 6.5
(0+3.3*1.96)
For 5% significance level,
one-tail area=2.5%
(Z/2 = 1.96)
Power= chance of being in the
Clinically relevant
rejection region if the alternative
alternative: is true=area to the right of this
difference=10%.
line (in yellow)
study 1: 263 cases, 1241 controls
Rejection region.
Any value >= 6.5
(0+3.3*1.96)
Power here = >80%
Power= chance of being in the
rejection region if the alternative
is true=area to the right of this
line (in yellow)
study 1: 50 cases, 50 controls
Critical value=
0+10*1.96=20
Z/2=1.96
2.5% area
Power closer to
20% now.
Study 2: 18 treated, 72 controls, STD DEV = 2
Critical value=
0+0.52*1.96 = 1
Clinically relevant
alternative:
difference=4 points
Power is nearly
100%!
Study 2: 18 treated, 72 controls, STD DEV=10
Critical value=
0+2.59*1.96 = 5
Power is about
40%
Study 2: 18 treated, 72 controls, effect size=1.0
Critical value=
0+0.52*1.96 = 1
Power is about
50%
Clinically relevant
alternative:
difference=1 point
Factors Affecting Power
1.
2.
3.
4.
Size of the effect
Standard deviation of the characteristic
Bigger sample size
Significance level desired
1. Bigger difference from the null mean
Null
Clinically
relevant
alternative
average weight from samples of 100
2. Bigger standard deviation
average weight from samples of 100
3. Bigger Sample Size
average weight from samples of 100
4. Higher significance level
Rejection region.
average weight from samples of 100
Sample size calculations

Based on these elements, you can write
a formal mathematical equation that
relates power, sample size, effect size,
standard deviation, and significance
level…
Simple formula for difference
in proportions
Represents the
Sample size in each
group (assumes equal
sized groups)
n
desired power
(typically .84 for
80% power).
( p )(1  p )( Z   Z /2 )
A measure of
variability (similar to
standard deviation)
(p 1  p 2 )
Effect Size
(the difference
in proportions)
2
2
Represents the
desired level of
statistical
significance
(typically 1.96).
Simple formula for difference
in means
Represents the
Sample size in each
group (assumes equal
sized groups)
desired power
(typically .84 for
80% power).
 ( Z   Z /2 )
2
n
Standard deviation
of the outcome
variable
diffe re nce
Effect Size
(the difference
in means)
2
2
Represents the
desired level of
statistical
significance
(typically 1.96).
Sample size calculators on the
web…



http://biostat.mc.vanderbilt.edu/twiki/bi
n/view/Main/PowerSampleSize
http://calculators.stat.ucla.edu
http://hedwig.mgh.harvard.edu/sample
_size/size.html
These sample size calculations are
idealized
•They do not account for losses-to-follow up
(prospective studies)
•They do not account for non-compliance (for
intervention trial or RCT)
•They assume that individuals are independent
observations (not true in clustered designs)
•Consult a statistician!
Review Question 5
Which of the following elements does not
increase statistical power?
a.
b.
c.
d.
Increased sample size
Measuring the outcome variable more
precisely
A significance level of .01 rather than .05
A larger effect size.
Review Question 5
Which of the following elements does not
increase statistical power?
a.
b.
c.
d.
Increased sample size
Measuring the outcome variable more
precisely
A significance level of .01 rather than
.05
A larger effect size.
Review Question 6
Most sample size calculators ask you to
input a value for . What are they asking
for?
a.
b.
c.
d.
e.
The
The
The
The
The
standard error
standard deviation
standard error of the difference
coefficient of deviation
variance
Review Question 6
Most sample size calculators ask you to
input a value for . What are they asking
for?
a.
b.
c.
d.
e.
The standard error
The standard deviation
The standard error of the difference
The coefficient of deviation
The variance
Review Question 7
For your RCT, you want 80% power to detect a
reduction of 10 points or more in the
treatment group relative to placebo. What is
10 in your sample size formula?
a. Standard deviation
b. mean change
c. Effect size
d. Standard error
e. Significance level
Review Question 7
For your RCT, you want 80% power to detect a
reduction of 10 points or more in the
treatment group relative to placebo. What is
10 in your sample size formula?
a. Standard deviation
b. mean change
c. Effect size
d. Standard error
e. Significance level
Homework



Problem Set 3
Continue reading textbook
Journal article