Download Type II error

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Primer on Statistics
for Interventional
Cardiologists
Giuseppe Sangiorgi, MD
Pierfrancesco Agostoni, MD
Giuseppe Biondi-Zoccai, MD
What you will learn
•
•
•
•
•
•
•
•
•
•
•
•
Introduction
Basics
Descriptive statistics
Probability distributions
Inferential statistics
Finding differences in mean between two groups
Finding differences in mean between more than 2 groups
Linear regression and correlation for bivariate analysis
Analysis of categorical data (contingency tables)
Analysis of time-to-event data (survival analysis)
Advanced statistics at a glance
Conclusions and take home messages
What you will learn
• Inferential statistics:
– pivotal concepts
– point estimation and confidence intervals
– hypothesis testing:
• rationale and significance
• type I and type II error
• p values and confidence intervals
• multiple testing issues
• one-tailed and two-tailed
• power and sample size computation
Methods of inquiry
Statistical inquiry may be…
Descriptive
(to summarize or describe an observation)
or
Inferential
(to use the observations to make estimates or predictions)
Population and sample: at the heart of
descriptive and inferential statistics
Again: statistical inquiry may be…
Descriptive
(to describe a sample/population)
or
Inferential
(to measure the likelihood that estimates generated from the
sample may truly represent the underlying population)
Accuracy and precision
true value
measurement
Accuracy measures the distance from the true value
Precision measures the spead in the measurements
Accuracy and precision
true value
measurement
Accuracy measures the distance from the true value
Accuracy and precision
true value
measurement
spread
Accuracy measures the distance from the true value
Precision measures the spead in the measurements
Accuracy and precision test
Accuracy and precision test
Accuracy and precision test
Accuracy and precision test
Accuracy and precision
example
Accuracy and precision
example
Schultz et al, Am Heart J 2004
Accuracy and precision
Accuracy and precision
Thus:
• Precision expresses
the extent of
RANDOM ERROR
• Accuracy expresses
the extent of
SYSTEMATIC ERROR
(ie bias)
Bias
Bias is a systematic DEVIATION
from the TRUTH
Thus:
• in itself it cannot be ever recognized
• there is a need for one external gold
standard, one or more reference
standards, and/or permanent
surveillance
An incomplete list of bias
· Selection bias
· Information bias
· Confounders
· Observation bias
· Investigator’s bias (enthusiasm bias)
· Patient’s background bias
· Distribution of pathological changes bias
· Selection bias
· Small sample size bias
· Reporting bias
· Referral bias
· Variation bias
· Recall bias
· Statistical bias
· Selection bias
· Confounding
· Intervention bias
· Measurement or information
· Interpretation bias
· Publication bias
· Subject selection/sampling bias
Simplest classification:
1. Selection bias
2. Information bias
Sackett, J Chronic Dis 1979
Selection bias
Information bias
Validity
Internal validity entails both PRECISION
and ACCURACY (ie does a study provide a
truthful answer to the research question?)
External validity expresses the extent to
which the results can be applied to other
contexts and settings. It corresponds to the
distinction between SAMPLE and
POPULATION)
Validity
Validity
Validity
Validity
Meredith, EuroIntervention 2005
Validity
Meredith, EuroIntervention 2005
Validity
100 patients
lesions ≤15 mm
Meredith, EuroIntervention 2005
Validity
Fajadet, Circulation 2006
Validity
1197 patients
lesions 15-27 mm
Fajadet, Circulation 2006
Validity
Validity
Rothwell, Lancet 2005
What you will learn
• Inferential statistics:
– pivotal concepts
– point estimation and confidence intervals
– hypothesis testing:
• rationale and significance
• type I and type II error
• p values and confidence intervals
• multiple testing issues
• one-tailed and two-tailed
• power and sample size computation
Frequency
An easy comparison
-0.10
0.20
0.40
0.60
0.80
Late loss
Cypher™
Bx Velocity™
1.0
Frequency
A tough comparison
-0.10
0
0.10
0.20
0.30
Late loss
Cypher™
Cypher Select™
0.40
Point estimation & confidence intervals
• Using summary statistics (mean and
standard deviation for normal variables, or
proportion for categorical variable) and
factoring sample size, we can build
confidence intervals or test hypotheses
that we are sampling from a given
population or not
• This can be done by creating a powerful
tool, which weighs our dispersion
measures by means of the sample size:
the standard error
Measure of dispersion are just descriptive
Range
99% Confidence Interval (CI)
50
– Top to bottom
– Not very useful
40
Interquartile range
30
– Used with median
75% CI
20
– ¼ way to ¾ way
Standard deviation (SD) 10
SD
– Used with mean
0
40
30
20
10
0
– Very useful
From standard deviation…
Standard deviation (SD):
– approximates population σ
SD
=
as N increases
Advantages:
– with mean enables powerful synthesis
mean±1*SD 68% of data
mean±2*SD 95% of data (1.96)
mean±3*SD 99% of data (2.86)
Disadvantages:
– is based on normal assumptions
2
S( x x )
N-1
Mean ± 1 standard deviation
Frequency
68%
-1 SD mean +1 SD
Mean ± 2 standard deviations
Frequency
95%
-2 SD -1 SD mean +1 SD
+2 SD
Mean ± 3 standard deviations
Frequency
99%
-3 SD -2 SD -1 SD mean +1 SD
+2 SD +3 SD
…to confidence intervals
Standard error (SE or SEM) can be used to test a hypothesis or
create a confidence interval (CI) around a mean for a
continuous variable (eg lesion length)
SE =
SD
n
95% CI = mean ± 2 SE
95% means that we can be sure at a proportion of 0.95 (almost 1!)
of including the true population value in the confidence interval
What about proportions?
• We can easily build the standard error of a
proportion, according to the following
formula:
SE =
P * (1-P)
n
Where variance=P*(1-P) and n is the
sample size
Point estimation & confidence intervals
• We can then create a simple test to check
whether the summary estimate we have
found can be compatible according to
random variation with the corresponding
reference population mean
• The z test (when the population SD is
known) and the t test (when the population
SD is only estimated), are thus used, and
both can be viewed as a signal to noise
ratio
Signal to noise ratio
Signal to noise ratio =
Signal
Noise
Z test
Signal to noise ratio =
Z score
Signal
Noise
Absolute difference
in summary
estimates
=
Standard error
Results of z score correspond to a distinct tail probability of the
Gaussian curve (eg 1.96 corresponds to a 0.025 one-tailed
probability or 0.050 two-tailed probability)
t test
Signal to noise ratio =
t score
Signal
Noise
Absolute difference
in summary
estimates
=
Standard error
Results of t score corresponding to a distinct tail probability of
the t distribution (eg 1.96 corresponds to a 0.025 one-tailed
probability or 0.050 two-tailed probability)
t test
• The t test differs from the z test as the
variance is only estimated as follows:
• However, given the central limit theorem,
when n>30 (ie with >29 degrees of
freedom) the t distribution approximately
corresponds to the normal distribution,
thus we can use the z test and z score
instead
What you will learn
• Inferential statistics:
– pivotal concepts
– point estimation and confidence intervals
– hypothesis testing:
• rationale and significance
• type I and type II error
• p values and confidence intervals
• multiple testing issues
• one-tailed and two-tailed
• power and sample size computation
Frequency
An easy comparison
-0.10
0.20
0.40
0.60
0.80
Late loss
Stent A
Stent B
1.0
Frequency
A tough comparison
-0.10
0
0.10
0.20
0.30
Late loss
Stent C
Stent D
0.40
Any comparison can be
viewed as…
A fight between a null hypothesis (H0), stating that
there is no big difference (ie beyond random
variation) between two or more populations of
interest (from which we are sampling) and an
alternative hypothesis (H1), which implies that
there is a non-random difference between two or
more populations of interest
Any statistical test is a test that tries to tell us
whether H0 is false (thus implying H1) may be true
Why falsifying H0 instead of
proving H1 is true?
You can never prove that something is correct in science,
you can only disprove something, ie show it is wrong
Thus, only falsifiable hypotheses are scientific
Sampling distribution of a difference
We may create a sampling distribution of a difference for
any comparison of interest eg late loss, peak CK-MB or survival rates
A
B
A-B
big difference
0
no difference
big difference
Sampling distribution of a difference
True difference distribution
Difference in our study
big difference
0
no difference
(ie null hypothesis or H0)
big difference
Potential difference distributions
big difference
0
no difference
(ie null hypothesis or H0)
big difference
Potential difference distributions
big difference
0
no difference
(ie null hypothesis or H0)
big difference
Potential difference distributions
big difference
0
no difference
(ie null hypothesis or H0)
big difference
What you will learn
• Inferential statistics:
– pivotal concepts
– point estimation and confidence intervals
– hypothesis testing:
• rationale and significance
• type I and type II error
• p values and confidence intervals
• multiple testing issues
• one-tailed and two-tailed
• power and sample size computation
Potential pitfalls (alpha error)
big difference
0
no difference
big difference
Gray zone where…
...we may inappropriately reject a true null
hypothesis (H0), ie providing a false positive result
Potential pitfalls (beta error)
big difference
0
no difference
big difference
Another gray zone where…
...we may fail to reject a false null hypothesis (H0),
ie providing a false negative result
True positive test
big difference
0
no difference
big difference
Sampling here means correctly rejecting a false
null hypothesis (H0), ie providing a true positive result
True negative test
big difference
0
no difference
big difference
Sampling here means correctly retaining a true
null hypothesis (H0), ie providing a true negative result
Statistical or clinical significance
• Clinical and statistical significance are to
highly different concepts
• A clinically significant difference, if proved
true, would be considered clinically relevant
and thus worthwhile (pending costs and
tolerability)
• A statistically significant difference is a
probability concept, and should be viewed in
light of the distance from the null hypothesis
and the chosen significance treshold
Alpha and type I error
Whenever I perform a test, there is thus a
risk of a FALSE POSITIVE result, ie
REJECTING A TRUE null hypothesis
This error is called type I, is measured as
alpha and its unit is the p value
The lower the p value, the lower the risk of
falling into a type I error (ie the HIGHER the
SPECIFICITY of the test)
Alpha and type I error
Type I error is
like a MIRAGE
Because I see something
that does NOT exist
Beta and type II error
Whenever I perform a test, there is also a
risk of a FALSE NEGATIVE result, ie NOT
REJECTING A FALSE null hypothesis
This error is called type II, is measured as
beta and its unit is a probability
The complementary of beta is called power
The lower the beta, the lower the risk of
missing a true difference (ie the HIGHER the
SENSITIVITY of the test)
Beta and type II error
Type II error is
like being BLIND
Because I do NOT see
something that exists
Non-invasive diagnosis of CAD
Stress testing
Abnormal
Yes
CAD
No
Normal
Non-invasive diagnosis of CAD
Stress testing
Abnormal
Yes
CAD
No
True
positive
Normal
Non-invasive diagnosis of CAD
Stress testing
Yes
CAD
No
Abnormal
Normal
True
positive
False
negative
Non-invasive diagnosis of CAD
Stress testing
Abnormal
Normal
Yes
True
positive
False
negative
No
False
positive
CAD
Non-invasive diagnosis of CAD
Stress testing
Abnormal
Normal
Yes
True
positive
False
negative
No
False
positive
True
negative
CAD
Non-invasive diagnosis of CAD
Stress testing
Abnormal
Normal
Yes
True
positive
False
negative
No
False
positive
True
negative
CAD
Summary of errors
Experimental study
H0 accepted H0 rejected
H0 true
Truth
H0 false
Summary of errors
Experimental study
H0 accepted H0 rejected
H0 true
Truth
H0 false


Summary of errors
Experimental study
H0 accepted H0 rejected
H0 true
Truth
H0 false

Type I
error

Summary of errors
Experimental study
H0 accepted H0 rejected
H0 true

Type I
error
H0 false
Type II
error

Truth
Type I error
Pitt et al, Lancet 1997
Type I error
Pitt et al, Lancet 2000
Type II error
Burzotta, J Am Coll Cardiol
Type II error
De Luca, Eur Heart J 2008
Another example of beta error?
Kandzari et al, JACC 2006
Another example of beta error?
The PROSPECT Trial
Inclusion criteria
Comparison
Sample size
Primary end-point
Consecutive patients with
PCI of up to 4 lesions
Endeavor
vs
Cypher
8800
Stent thrombosis at 3year follow-up and MACE
Melikian et al, Heart 2007
Frequency
Shapes of distribution &
analytical errors
Value
Another potential cause of
analytic errors
Frequency
20
10
0
Value
What you will learn
• Inferential statistics:
– pivotal concepts
– point estimation and confidence intervals
– hypothesis testing:
• rationale and significance
• type I and type II error
• p values and confidence intervals
• multiple testing issues
• one-tailed and two-tailed
• power and sample size computation
P values
95% Confidence intervals
The RANGE of values where we would
have CONFIDENCE that the population
value lies in 95 cases, if we were to
perform 100 studies
95%
confidence
interval
summary
point
estimate
99% confidence interval
Confidence intervals
P values & confidence intervals
Ps and confidence intervals
P values and confidence intervals
are strictly connected
Any hypothesis test providing a
significant result (eg p=0.045) means that
we can be confident at 95.5% that the
population average difference lies far
from zero (ie the null hypothesis)
Ps and confidence intervals
Thus this statistical analysis reports an odds ratio of
0.111, with 95% confidence intervals of 0.16 to 0.778,
and a concordantly significant p value of 0.027
important
trivial
difference difference
P values and confidence intervals
Ho
significant
difference
(p<0.05)
non significant
difference
(p>0.05)
What you will learn
• Inferential statistics:
– pivotal concepts
– point estimation and confidence intervals
– hypothesis testing:
• rationale and significance
• type I and type II error
• p values and confidence intervals
• multiple testing issues
• one-tailed and two-tailed
• power and sample size computation
Multiple testing
• What happens when you perform the
same hypothesis test several times?
…
Multiple testing
• What happens when you perform the
same hypothesis test several times?
…
• The answer is restricting the analyses only
to prespecified and biologically plausible
sub-analysis, and using suitable
corrections:
–
–
–
–
–
Bonferroni
Dunn
Tukey
Keuls
interaction tests
Multiple testing & subgroups
ENDEAVOR IV – 24-month TLR rates
Risk Ratio [95% CI]
Risk Ratio
Endeavor
Taxus
P value*
Diabetes
1.29
8.7% (20)
6.7% (15)
0.956
Non-diabetes
1.27
4.7% (24)
3.7% (19)
0.98
6.4% (16)
6.5% (17)
>2.5 <3.0mm
1.21
5.9% (17)
4.8% (14)
3.0mm
3.45
5.5% (11)
1.6% (3)
1.07
5.2% (12)
4.9% (11)
1.38
6.2% (26)
4.5% (18)
1.21
5.6% (5)
4.6% (5)
Single Stent
1.61
5.7% (39)
3.5% (23)
Multiple Stents
0.84
9.1% (4)
10.8% (8)
RVD
2.5mm
0.187
Lesion Length
10mm
>10 <20mm
20mm
0.1
Favors
Endeavor
1
Favors
Taxus
10
0.412
0.324
*interaction p values calculated using logistic regression
What you will learn
• Inferential statistics:
– pivotal concepts
– point estimation and confidence intervals
– hypothesis testing:
• rationale and significance
• type I and type II error
• p values and confidence intervals
• multiple testing issues
• one-tailed and two-tailed
• power and sample size computation
One- or two-tailed tests
mean±
1.96*SD
<2.5%
<2.5%
One- or two-tailed tests
mean±
1.96*SD
<2.5%
<2.5%
When can we use a one-tailed test? When you
assume that the difference is only in one direction:
ALMOST NEVER for superiority or equivalence comparisons
One- or two-tailed tests
mean±
1.96*SD
<2.5%
<2.5%
When can we use a one-tailed test? When you
assume that the difference is only in one direction:
ALMOST NEVER for superiority or equivalence comparisons
When should you we use a two-tailed test? When
you cannot assume the direction of the difference:
ALMOST ALWAYS, except for non-inferiority comparisons
What you will learn
• Inferential statistics:
– pivotal concepts
– point estimation and confidence intervals
– hypothesis testing:
• rationale and significance
• type I and type II error
• p values and confidence intervals
• multiple testing issues
• one-tailed and two-tailed
• power and sample size computation
Sample size calculation
To compute the sample size for a study we thus need:
1.
2.
3.
Preferred alpha value
Preferred beta value
Control event rate or average value
(with measure of dispersion if applicable)
4.
Expected relative reduction in experimental group
Svilaas et al, NEJM 2008
Another sample size example
Fajadet, Circulation 2006
Power and sample size
Whenever designing a study or analyzing a dataset, it is
important to estimate the sample size or the power of
the comparison
SAMPLE SIZE
Setting a specific alpha and a specific beta, you
calculate the necessary sample size given the average
inter-group difference and its variation
POWER
Given a specific sample size and alpha, in light of the
calculated average inter-group difference and its
variation, you obtain an estimate of the power (ie 1-beta)
Power and sample size
Whenever designing a study or analyzing a dataset, it is
important to estimate the sample size or the power of
the comparison
SAMPLE SIZE
Setting a specific alpha and a specific beta, you
calculate the necessary sample size given the average
inter-group difference and its variation
POWER
Given a specific sample size and alpha, in light of the
calculated average inter-group difference and its
variation, you obtain an estimate of the power (ie 1-beta)
Power analysis
To compute the power of a study we thus need:
1.
2.
Preferred or actual alpha value
Control event rate or average value
(with
measure of dispersion if applicable)
3. Expected or actual relative reduction in experimental group
4. Expected or actual sample size
Biondi-Zoccai et al, Ital Heart J 2003
Thank you for your attention
For any correspondence:
[email protected]
For further slides on these topics feel
free to visit the metcardio.org website:
http://www.metcardio.org/slides.html