Download Basic statistical methods

Document related concepts

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Foundations of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Basic Statistical Concepts
Chapter 2 Reading instructions
•
•
•
•
•
•
•
•
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Introduction: Not very important
Uncertainty and probability: Read
Bias and variability: Read
Confounding and interaction: Read
Descriptive and inferential statistics: Repetition
Hypothesis testing and p-values: Read
Clinical significance and clinical equivalence: Read
Reproducibility and generalizability: Read
Bias and variability
Eˆ   
Bias: Systemtic deviation from the true value
Design, Conduct, Analysis, Evaluation
Lots of examples on page 49-51
Bias and variability
Larger study does not decrease bias
ˆ n    
;
n 
Distribution of sample means:
Drog X - Placebo
-10
-4
n=40
mm Hg
-10
Population mean
bias
= population mean
Drog X - Placebo
-7


Drog X - Placebo
-7
-4
n=200
mm Hg
-10
-7
-4
N=2000
mm Hg
Bias and variability
There is a multitude of sources for bias
Publication bias
Positive results tend to be published while negative of
inconclusive results tend to not to be published
Selection bias
The outcome is correlated with the exposure. As an example,
treatments tends to be prescribed to those thought to
benefit from them. Can be controlled by randomization
Exposure bias
Differences in exposure e.g. compliance to treatment could
be associated with the outcome, e.g. patents with side
effects stops taking their treatment
Detection bias
The outcome is observed with different intensity depending
no the exposure. Can be controlled by blinding investigators
and patients
Analysis bias
Essentially the I error, but also bias caused by model miss
specifications and choice of estimation technique
Interpretation bias
Strong preconceived views can influence how analysis results
are interpreted.
Bias and variability
Amount of difference between observations
True biological:
Variation between subject due
to biological factors (covariates)
including the treatment.
Temporal:
Variation over time (and space)
Often within subjects.
Measurement error:
Related to instruments or observers
Design, Conduct, Analysis, Evaluation
Raw Blood pressure data
Placebo
DBP
(mmHg)
Drug X
Baseline
Subset of plotted data
8 weeks
Bias and variability
Y  X  
Variation in = Explained + Unexplained
observations
variation
variation
Bias and variability
Is there any difference between drug A and drug B?
Drug A
Drug B
Outcome
Bias and variability
Model: Yij   i  xij   ij
Y=μB+βx
μB
Y=μA+βx
μA
x=age
Confounding
Confounders
Predictors
of treatment
Predictors
of outcome
A
Treatment
allocation
Outcome
B
Example
Smoking Cigaretts is not so bad but watch out for
Cigars or Pipes (at least in Canada)
Variable
Non smokers
Cigarette
smokers
Cigar or pipe
smokers
Mortality rate*
20.2
20.5
35.5
*) per 1000 person-years %
Cochran, Biometrics 1968
Example
Smoking Cigaretts is not so bad but watch out for
Cigars or Pipes (at least in Canada)
Variable
Non smokers
Cigarette
smokers
Cigar or pipe
smokers
Mortality rate*
20.2
20.5
35.5
Average age
54.9
50.5
65.9
*) per 1000 person-years %
Cochran, Biometrics 1968
Example
Smoking Cigaretts is not so bad but watch out for
Cigars or Pipes (at least in Canada)
Variable
Non smokers
Cigarette
smokers
Cigar or pipe
smokers
Mortality rate*
20.2
20.5
35.5
Average age
54.9
50.5
65.9
Adjusted
mortality rate*
20.2
26.4
24.0
*) per 1000 person-years %
Cochran, Biometrics 1968
Confounding
The effect of two or more factors can not be separated
Example: Compare survival for
surgery and drug
Survival
Life long treatment with drug
R
Surgery at time 0
Looks ok but:
•Surgery only if healty enough
•Patients in the surgery arm may take drug
•Complience in the drug arm May be poor
Time
Confounding
Can be sometimes be handled in the design
Example: Different effects in males and females
Imbalance between genders affects result
Stratify by gender
A
M
A
R
Gender
R
B
Balance on average
B
A
F
R
Always balance
B
Interaction
The outcome on one variable depends
on the value of another variable.
Example
Interaction between two drugs
A
B
A=AZD1234
Wash
out
R
B
B=AZD1234 +
Clarithromycin
A
Interaction
Example: Drug interaction
Plasma concentration (µmol/L)
linear scale
Mean
5
4
AZD0865 alone
3
Combination of clarithromycin
and AZD0865
2
1
0
0
4
8
12
16
20
24
Time after dose
AUC AZD0865:
19.75 (µmol*h/L)
AUC AZD0865 + Clarithromycin:
Ratio:
36.62 (µmol*h/L)
0.55 [0.51, 0.61]
Interaction
Example:
Treatment by center interaction
Treatment difference in diastolic blood pressure
15
10
mmHg
5
0
-5
0
5
10
15
20
25
30
-10
-15
-20
-25
Ordered center number
Average treatment effect: -4.39 [-6.3, -2.4] mmHg
Treatment by center: p=0.01
What can be said about the treatment effect?
Descriptive and inferential
statistics
The presentation of the results from a clinical trial
can be split in three categories:
•Descriptive statistics
•Inferential statistics
•Explorative statistics
Descriptive and inferential
statistics
Descriptive statistics aims to describe various
aspects of the data obtained in the study.
•Listings.
•Summary statistics (Mean, Standard Deviation…).
•Graphics.
Descriptive and inferential
statistics
Inferential statistics forms a basis for a conclusion
regarding a prespecified objective addressing the
underlying population.
Confirmatory analysis:
Hypothesis
Results
Conclusion
Descriptive and inferential
statistics
Explorative statistics aims to find interesting results that
Can be used to formulate new objectives/hypothesis for
further investigation in future studies.
Explorative analysis:
Results
Hypothesis
Conclusion?
Hypothesis testing, p-values and
confidence intervals
Objectives
Variable
Design
Estimate
p-value
Confidence interval
Statistical Model
Null hypothesis
Results
Interpretation
Hypothesis testing, p-values
n


X

X
,

X

R
Statistical model: Observations
1
n
from a class of distribution functions
  P :   
Hypothesis test: Set up a null hypothesis: H0:    0
and an alternative H1:   1
Reject H0 if
X  Sc  R n
P X  Sc |   0   
Rejection region
Significance level
p-value: The smallest significance level for which the
null hypothesis can be rejected.
What is statistical hypothesis testing?
A statistical hypothesis test is a method of making
decisions using data from a scientific study.
Decisions:
Conclude that substance is likely to be
efficacious in a target population.
Scientific study:
Collect data measuring efficacy of a
substance
on a random sample of patients
representative
of the target population
Formalizing hypothesis testing
We try to prove that collected data with little likelihood can
correspond to a statement or a hypothesis which we do not
want to be true!
Hypothesis:
•The effect of our substance has no effect on blood
pressure compared to placebo
•An active comparator is superior to our substance
What do we want to be true then?
•The effect of our substance has a different effect on
blood pressure compared to placebo
•Our substance is non-inferior to active comparator
Formalizing hypothesis testing cont.
Formalizing it even further…
As null hypothesis, H0, we shall choose the hypothesis we
want to reject
H0: μX = μplacebo
As alternative hypothesis, H1, we shall choose the
hypothesis we want to prove.
H1: μX ≠ μplacebo
(two sided alternative hypothesis)
μ is a parameter that characterize the efficacy in the target
population, for example the Mean.
Example, two sided H1:
H0: Treatment with substance
X has no effect on blood
pressure compared to placebo
0
H1: Treatment with
substance X influence
blood pressure
compared to placebo
Difference in BP
Example: one sided H1
H0: Treatment with substance
X has no effect on blood
pressure compared to placebo
0
H1: Treatment with
substance X decrease
the blood pressure more
than placebo
Difference in BP
Structure of a hypothesis test
Given our H0 and H1
Assume that H0 is true
H0: Treatment with substance X has no effect on
blood pressure
compared to placebo (μX = μplacebo)
1) We collect data from a scientific study.
2) Calculate the likelihood/probability/chance that data would
actually turn out as it did (or even more extreme) if H0 was
true;
P-value:
Probability to get at least as extreme as what we
observed given H0
How to interpret the p-value?
P-value:
Probability to get at least as extreme as what we observed
given H0
P-value small:
Correct: our data is unlikely given the H0. We don’t believe in
unlikely things so something is wrong. Reject H0.
Common language: It is unlikely that H0 is true; we should
reject H0
P-value large:
We can´t rule out that H0 is true; we can´t reject H0
A statistical test can either reject (prove false) or fail to reject
(fail to prove false) a null hypothesis, but never prove it true
(i.e., failing to reject a null hypothesis does not prove it true).
How to interpret the p-value? Cont.
From the assumption that if the treatment does not have effect we
calculate: How likely is it to get the data we collected in the study?
Expected outcome of the study if
H0 is true (μX = μplacebo )
μX = μplacebo
Likely outcome if the treatments
are equal:
The p-value is big
Expected outcome of the study
for some H1 (μX < μplacebo )
μX
μplacebo
Unlikely outcome if the
treatments are equal:
The p-value is small
Risks of making incorrect decisions
We can make two types of incorrect decisions:
1)Incorrect rejection of a true H0
“Type I error”; False Positive
2)
Fail to reject a false null hypothesis H0
“Type II error”; False Negative
These two errors can´t be completely avoided.
We can try to control and minimize the risks of making
error is several ways:
1) Optimal study design
2) Deciding the burden of evidence in favor of H1
required to reject H0
Misconceptions and warnings
Easy to misinterpret and mix things up
(p-value, size of test, significance level, etc).
P-value is not the probability that the null hypothesis
is true
P-value says nothing about the size of the effect.
A statistical significant difference does not need to be
clinically relevant!
A high p-value is not evidence that H0 is true
Example:
Distribution of average
effect if Ho is true
H0: Treatment with
substance X has no effect
on blood pressure
compared to placebo
p=0.0156
0
Xobs=10.4mmHg
•If the null hypothesis is true it is unlikely to observe 10.4
•We don’t believe in unlikely things so something is wrong
•The data is ”never” wrong
•The null hypothesis must be wrong!
Difference in BP
Example:
Distribution of average
effect if Ho is true
H0: Treatment with
substance X has no effect
on blood pressure
compared to placebo
p=0.35
0
Xobs=3.2mmHg
Difference in BP
•If the null hypothesis is true it is not that unlikely to observe 3.2
•Nothing unlikely has happened
•There is no reason to reject the null hypothesis
•This does not prove that the null hypothesis is true!
Confidence intervals
Let X, *  
1 if H 0 :    * rejected
0 if H 0 :    * not rejected
(critical function)
Confidence set: CX   :  X,   0
The set of parameter values correponding to hypotheses
that can not be rejected.
A confidence set is a random subset CX  
covering the true parameter value with probability at
least 1   .
Example
• We study the effect of treatment X on
change in DBP from baseline to 8 weeks
• The true treatment effect is:
d = μX – μplacebo
• The estimated treatment effect is:
x X  x placebo  7.5mmHg
– How reliable is this estimation?
Hypothesis testing
• Test the hypothesis that there is no
difference between treatment X and placebo
• Use data to calculate the value of a test
function: High value -> reject hypothesis,
since it is unlikely given not true difference
• For a long time it was enough to present the
point estimate and the result of the test
– Previous example: The estimated treatment effect
is 7.5 mmHg, which is significantly different from
zero at the 5% level.
Famous (?) quotes
“We do not perform an experiment to find out if
two varieties of wheat or two drugs are equal.
We know in advance, without spending a dollar
on an experiment, that they are not equal”
(Deming 1975)
“Overemphasis on tests of significance at the
expense especially of interval estimation has
long been condemned” (Cox 1977)
“The continued very extensive use of
significance tests is alarming” (Cox 1986)
41
Estimations + Interval
Better than just hypothesis test
• The estimation gives the best guess of the true
effect
• The confidence interval gives the accuracy of the
estimation, i.e. limits within which the true effect
is located with large certainty
• One can decide if the result is statistically
significant, but also if it has biological/clinical
relevance
• Now required by Regulatory Authorities, journals,
etc
Neyman, Jerzy (1937). "Outline of a Theory of Statistical
Estimation Based on the Classical Theory of Probability".
Philosophical Transactions of the Royal Society of London.
Series A.
Example
• The 95% confidence interval is
[-9.5, -4.9] mmHg (based on a
model assuming normally distributed
data)
• With great likelihood (95%), the
interval [-9.5, -4.9] contains the true
treatment effect μX – μplacebo
• Provided that our model is correct…
-9.5
-4.5
0
Model:
Drug X
σ
μX
Placebo
σ
μplacebo
General principle
• There is a True fixed value that we
estimate using data from a clinical trial or
other experiment
• The estimate deviates from the true
value (by an unknown amount)
• If we were to repeat the study a number
of times, the different estimates would
be centred around the true value
• A 95% confidence interval is an interval
estimated so that if the study is repeated
a number of times, the interval will cover
the true value in 95% of these
General principle
• Estimate a parameter, for example
– Mean difference,
– Relative risk reduction,
– Proportion of treatment responders, etc
• Calculate an interval that contains
the true parameter value with large
probability
General principle
Depends on
confidence level
Estimate
±
Quantile
*
Variability
Of the estimate
Model
General principle – Normal model
67% confidence
level
Mean
±
1
*

n
SD of the
mean
~67%
Normal model,
σ known
- / n
+ / n
General principle – Normal model
95 % confidence
level
Mean
±
1.96
*

n
SD of the
mean
95%
Normal model,
σ known
-1.96 / n
+1.96 / n
General principle – Normal model
(1-α)% confidence
level
Mean
± t(1-α/2, n-1)
*
sd
n
SD of the
mean
Normal model,
σ not known
Width
• The width of the confidence interval
depends on:
– Degree of Confidence (90%, 95% or
99%)
– Size of the sample
– Variability
– Statistical model
– Design of the study
Effect of confidence level
95%
95%
95%
90% 17% narrower
80% 36% narrower
99% 35% wider
(normal model, n=30)
“Excuse me, professor. Why a
95% confidence interval and not
a 94% or 96% interval?”, asked
the student.
“Shut up,” he explained.
51
Effect of Sample size and SD
Sample size
Double
+50%
+30%
Double SD
52
30% narrower
20% narrower
12% narrower
double width, etc
Model Dependence
• Can we assume normally distributed data?
• Is the variability the same in all treatment
groups?
• Are there important covariates, e.g. baseline
level or confounders?
• Can we assume a covariance structure e.g.
repeated measurements per individual?
• For time to event data – can we assume
proportional hazards?
• Multiple events data – can we assume that
there is no overdispersion?
Model Dependence
• Statistical packages provides confidence
intervals for most models
• But assumptions are not always easily checked
and you usually have to pre-specify the model
• Given a sufficient sample size you may still ‘get
away with it’, since mean values are
approximately normally distributed (or a log
transform might do the trick)
• Otherwise you may want to use a nonparametric method which does not depend on
a (possibly approximate) model
Non-parametric CIs - Example
Calculated using the available values.
Example - say we have calculated the
following the following treatment
differences (in sorted order):
-7.8
-6.9
-4.7
3.7
6.5
8.7
9.1
10.1
10.8
13.6
14.4
15.6
20.2
median
99.2%
96.5%
88.2%
Confidence level (interval for the median)
22.4
23.5
Analogy with tests
• For most statistical tests there is a
corresponding confidence interval
• If a 95% confidence interval does not
cover ‘the zero’, the corresponding test is
to give a p-value < 0.05.
• If for example your p-value is based on
the median and the confidence interval is
based on the mean you can get conflicting
results = trouble
Comparing two CIs
• Can we judge whether groups are significantly
different depending on whether or not their
confidence intervals overlap? - Not always.
• Parallel group:
– Non-overlapping confidence intervals significantly different.
– Overlapping confidence intervals - not
certain.
– Depends on the confidence level of the
individual CIs: Comparing two 84%
confidence intervals corresponds to a test
at the 5% level (approximately)
Sample size
• Based on the precision of the interval
instead of the power of a test
• Examples:
– Interval of treatment difference not
wider than x, with some probability
– Lower limit/mean > 0.6 and upper
limit/mean < 1.4 with at least 80%
probability* - suitable when
estimating geometric means
* Based on ‘FDA recommendation’ for pediatric PK studies
Example
Objective: To compare sitting diastolic blood pressure (DBP) lowering effect of
hypersartan 16 mg with that of hypersartan 8 mg
Variable: The change from baseline to end of study in sitting DBP
(sitting SBP) will be described with an ANCOVA model,
with treatment as a factor and baseline blood pressure
as a covariate
Model:
yij = μ + τi + β (xij - x··) + εij
Parameter space: R 4
1   2   3  0
treatment effect
i = 1,2,3
{16 mg, 8 mg, 4 mg}
Null hypoteses (subsets of
H01: τ1 = τ2 (DBP)
H02: τ1 = τ2 (SBP)
H03: τ2 = τ3 (DBP)
H04: τ2 = τ3 (SBP)
R 4):
Example contined
Hypothesis
Variable
LS Mean
CI (95%)
p-value
1: 16 mg vs 8 mg
Sitting DBP
-3.7 mmHg
[-4.6, -2.8]
<0.001
2: 16 mg vs 8 mg
Sitting SBP
-7.6 mmHg
[-9.2, -6.1]
<0.001
3: 8 mg vs 4 mg
Sitting DBP
-0.9 mmHg
[-1.8, 0.0]
0.055
4 : 8 mg vs 4 mg
Sitting SBP
-2.1 mmHg
[-3.6, -0.6]
0.005
This is a t-test where the test statistic follows a t-distribution
Rejection region:
X : T X  c
-c
0
c
P-value: The null hypothesis can pre rejected at 
-4.6
-2.8
0
T X
 0.001
1  2
P-value says nothing about the
size of the effect!
Example: Simulated data. The difference between treatment and
placebo is 0.3 mmHg
No. of patients per group
Estimation of effect
p-value
10
1.94 mmHg
0.376
100
-0.65 mmHg
0.378
1000
0.33 mmHg
0.129
10000
0.28 mmHg
<0.0001
100000
0.30 mmHg
<0.0001
A statistical significant difference does NOT
need to be clinically relevant!
Statistical and clinical
significance
Statistical significance:
Is there any difference between
the evaluated treatments?
Clinical significance:
Does this difference have any
meaning for the patients?
Health ecominical relevance: Is there any economical
benefit for the society in
using the new treatment?
Statistical and clinical
significance
A study comparing gastroprazole 40 mg and mygloprazole 30 mg
with respect to healing of erosived eosophagitis after 8 weeks
treatment.
Drug
Healing rate
gastroprazole 40 mg
87.6%
mygloprazole 30 mg
84.2%
Cochran Mantel Haenszel p-value = 0.0007
Statistically significant!
Clinically significant?
Health economically relevant??