Download Overview of Hypothesis Testing Objectives Read Chapter 21

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Overview of Hypothesis
Testing
Laura Lee Johnson, Ph.D.
Statistician
Chapter 21 (parts of 18 and 24), 3rd Edition
Objectives
• Formulate questions using
 P-values
 Confidence intervals
 Type I and Type II errors
• Identity a few commonly used statistical
tests for comparing two groups
• Define testable hypotheses to address
questions of interest
• Interpret hypothesis tests and confidence
intervals
Read Chapter 21
• Several important elements of hypothesis
testing will not be covered in this lecture
 Covered in the textbook chapter
 Covered in a video lecture with handouts
found at
https://ippcr.nihtraining.com/lecture_detail
.php?lecture_id=216&year=2013
1
Read Journals
• Nature Methods: Points of Significance
(typically by Krzywinski and Altman)
• BMJ: Endgames (several are design or
statistical in nature; look in particular for the
ones by Philip Sedgwick)
Use Hypothesis Testing
• Analyze the data to find results
 Programs and formulas not presented here
in detail
• Software makes it seem like anyone can
analyze the data, but be careful
Randomized Trial to Determine Effects of
a Low Glycemic Index Diet in Pregnancy
• Population: at risk women/fetuses
 Women in second pregnancy
 Previously delivered infant weighing
greater than 4000g
• Outcomes of interest
 Birth weight of infants (mean)
 Primary outcome
 Incidence of infant macrosomia (yes/no)
 Gestational weight gain
 Maternal glucose intolerance
BMJ 2012;345:e5605 Jennifer Walsh et al
Low glycaemic index diet in pregnancy to prevent macrosomia (ROLO study): randomised control trial
BMJ 2012;345:e7951 (Philip Sedgwick discussion)
2
Statistical Inference
• Inferences about a population are made on
the basis of results obtained from a sample
drawn from that population
• Want to talk about the larger population from
which the subjects are drawn, not the
particular subjects!
Use Hypothesis Testing
• Designing a study
• Reviewing the design of other studies
 Grant or application review (e.g. NIH
study section, IRB)
• Interpreting study results
• Interpreting others’ study results
 Reviewing a manuscript or journal
 Interpreting the news
Outline
Estimation and Hypotheses
• How to Test Hypotheses
• Confidence Intervals
• Regression
• Error
• Diagnostic Testing
• Questions & Appendix
3
Analysis Follows Design
Questions → Hypotheses →
Experimental Design → Samples →
Data → Analyses →Conclusions
What Do We Test
• Effect or Difference we are interested in




Difference in Means or Proportions
Odds Ratio (OR)
Relative Risk (RR)
Correlation Coefficient
• Clinically important difference
 Smallest difference considered
biologically or clinically relevant
• Medicine: usually 2 group comparison
of population means
Estimation: From the Sample
• Point estimation
 Mean
 Median
 Change in mean/median
• Interval estimation
 Variation (e.g. range, σ2, σ, σ/√n)
 95% Confidence interval (CI)
4
Pictures, Not Numbers
•
•
•
•
Scatter plots (can be useful)
Histograms (can be useful)
Box plots (can be useful)
Bar plots
 PLEASE DO NOT USE!
 Use a table instead
• Not Estimation
 See the data and check assumptions
Z or Standard Normal
Distribution
2 of the Continuous Distributions
• Normal/Gaussian distribution: N( μ,σ2)
 μ = mean,σ2 = variance
 Z or standard normal = N(0,1)
• t distribution: t
  = degrees of freedom (df)
 Usually a function of sample size
 Mean = X (sample mean)
 Variance = s2 (sample variance)
5
Binary Distribution
• Two options
 Yes/no
• Binomial distribution: B (n, p)
 Sample size = n
 Proportion ‘yes’ = p
 Mean = np
 Variance = np(1-p)
• Can do exact or use Normal
Many More Distributions
• Not going to cover
 Poisson
 Log normal
 Gamma
 Beta
 Weibull
 Many more
Hypothesis Testing
• Null hypothesis (H0)
• Alternative hypothesis (H1 or Ha)
6
Superiority Studies
• For example
 Average systolic blood pressure (SBP) on
Drug A is different than average SBP on
Drug B
Alternative Hypothesis
H1 Ha HA
•
•
•
•
There is an effect
What you want to prove
If equivalence trial, special way to do this
Contradicts the null
Null Hypothesis
• Null of that? Usually that there is no effect
 Difference in the Means = 0
 Correlation Coefficient = 0 [do not use]
 Odds Ratio (OR) = 1
 Relative Risk (RR) = 1
• Sometimes compare to a fixed value so Null
 Mean systolic blood pressure = 120 mmHg
• If an equivalence trial, look at NEJM paper or
other specific resources
7
Example Hypotheses
• H0: μ1 = μ2
 Mean birth weight (g) of babies born to
mothers in the low glycemic diet group =
Mean birth weight (g) of babies born to
mothers in the non-diet (control) group
• HA: μ1 ≠ μ2
 Two-sided test
• HA: μ1 < μ2
 One-sided test
1 vs. 2 Sided Tests
• Two-sided test
 No a priori reason 1 group should have
stronger effect
 Difference in any direction is notable
 Used for most tests
• One-sided test
 Specific interest in only one direction
 Not scientifically relevant/interesting if
reverse situation true
Use a 2-Sided Test
• Almost always
• If you use a one-sided test
 Explain yourself
 Penalize yourself on the alpha
 0.05 2-sided test becomes a 0.025 1sided test
8
Never “Accept” Anything
• Reject the null hypothesis
• Fail to reject the null hypothesis
• Failing to reject the null hypothesis does
NOT mean the null (H0) is true
• Failing to reject the null means
 Not enough evidence in your sample to
reject the null hypothesis
 In one sample saw what you saw
 Deviation from the null might be too
small to detect reliably with the study’s
sample size
Outline
 Estimation and Hypotheses
 How to Test Hypotheses
• Confidence Intervals
• Regression
• Error
• Diagnostic Testing
• Questions & Appendix
Experiment
 Develop hypotheses
 Collect sample/Conduct experiment
• Calculate test statistic
• Compare test statistic with what is expected
when H0 is true
9
Information at Hand
• 1 or 2 sample test?
• Outcome variable
 Binary, Categorical, Ordered, Continuous,
Survival
• Population
• Numbers (e.g. mean, standard deviation)
Example:
Hypertension/Cholesterol
• 20-74 years old men
• Mean cholesterol hypertensive men
• Mean cholesterol in male general
(normotensive) population
• In the 20-74 year old male
normotensive population the mean
serum cholesterol is 211 mg/ml with
a standard deviation of 46 mg/ml
One Sample:
Cholesterol Sample Data
• Have data on 25 hypertensive men
• Mean serum cholesterol level is 220mg/ml
 Point estimate of the mean X
• Sample standard deviation: s = 38.6 mg/ml
 Point estimate of the variance = s2
10
Compare Sample to Population
• Is 25 enough?
 Sample Size and Power lecture
• What difference in cholesterol is clinically or
biologically meaningful?
• Have an available sample and want to know if
hypertensives are different than general
population
Cholesterol Hypotheses
• H0: μ1 = μ2
• H0: μ = 211 mg/ml
 μ = POPULATION mean serum
cholesterol for male hypertensives
 Mean cholesterol for hypertensive
men = mean for general male
population
• HA: μ1 ≠ μ2
• HA: μ ≠ 211 mg/ml
Cholesterol Sample Data
• Population information (general)
 μ = 211 mg/ml
 σ = 46 mg/ml (σ2 = 2116)
• Sample information (hypertensives)
 X = 220 mg/ml
 s = 38.6 mg/ml (s2 = 1489.96)
 N = 25
11
Experiment
Develop hypotheses
Collect sample/Conduct experiment
Calculate test statistic
• Compare test statistic with what is
expected when H0 is true
Test Statistic
• Basic test statistic for a mean
test statistic =
point estimate of  - target value of 
 point estimate of 
• σ = standard deviation (sometimes use σ/√n)
• For 2-sided test: Reject H0 when the test
statistic is in the upper or lower 100*α/2% of
the reference distribution
• What is α?
Vocabulary
• Types of errors
 Type I (α) (false positives)
 Type II (β) (false negatives)
• Related words
 Significance Level: α level
 Power: 1- β
12
Source of Image: Effect Size FAQs by Paul Ellis http://tinyurl.com/pn2dt68
Lydia Flynn has a nice video also explaining type I and II errors at
https://www.youtube.com/watch?v=Dsa9ly4OSBk
How to test?
Z test or Critical Value [Chapter 21]
 N(0,1) distribution and alpha
t test or Critical Value [Chapter 21]
 t distribution and alpha
P-value
• Confidence interval
P-value
• Smallest α the observed sample would
reject H0
• Given H0 is true, probability of
obtaining a result as extreme or more
extreme than the actual sample
• MUST be based on a model
 Normal, t, binomial, etc.
13
Cholesterol Example
•
•
•
•
•
P-value for two sided test
X = 220 mg/ml, σ = 46 mg/ml
n = 25
H0: μ = 211 mg/ml
HA: μ ≠ 211 mg/ml
2* P[ X  220]  0.33
Determining Statistical
Significance: P-Value Method
• Compute the exact p-value (0.33)
• Compare to the predetermined α-level
(0.05)
• If p-value < predetermined α-level
 Reject H0
 Results are statistically significant
• If p-value > predetermined α-level
 Do not reject H0
 Results are not statistically significant
Unknown Truth and the Data
Truth
H0 Correct
HA Correct
Data
1- α
β
Decide H0
“fail to reject True Negative False Negative
H0”
α
1- β
Decide HA
“reject H0”
False Positive True Positive
α = significance level
1- β = power
14
P-value Interpretation Reminders
• Measure of the strength of evidence in the
data that the null is not true
• A random variable whose value lies between
0 and 1
• NOT the probability that the null hypothesis
is true.
Other Methods to Get p-Values
• Permutation test
• Bootstrap
Type I Error
• α = P( reject H0 | H0 true)
• Probability reject the null hypothesis
given the null is true
• False positive
• Probability reject that hypertensives’
µ=211mg/ml when in truth the mean
cholesterol for hypertensives is 211
15
Type II Error (or, 1- Power)
• β = P( do not reject H0 | H1 true )
• False Negative
• Probability we NOT reject that male
hypertensives’ cholesterol is that of the
general population when in truth the mean
cholesterol for hypertensives is different
than the general male population
Power
• Power = 1-β = P( reject H0 | H1 true )
• Everyone wants high power, and therefore
low Type II error
Is α or β more important ?
• Depends on the question
• Most will say protect against Type I error
 Multiple comparisons
• Need to think about individual and
population health implications and costs
16
Multiple Hypothesis Testing
• Adjust critical level of significance
• Try to reduce probability of a type I error
• Desire overall type I error rate no greater
than alpha across all hypothesis tests (not
only for each individual test)
 Usually done across the primary outcomes
• Assumption of independent outcomes
 Rarely the case
 Err toward non-significance
 Inflate type II error rate
• Which inflation is more damaging?
How to test?
Z test or Critical Value
 N(0,1) distribution and alpha
t test or Critical Value
 t distribution and alpha
P-value
• Confidence interval
Outline
 Estimation and Hypotheses
 How to Test Hypotheses
 Confidence Intervals
• Regression
• Error
• Diagnostic Testing
• Questions & Appendix
17
Hypothesis Testing
and Confidence Intervals
• Hypothesis testing focuses on where
the sample mean is located
• Confidence intervals focus on plausible
values for the population mean
• In general, the best way to estimate a
confidence interval is to bootstrap
(details: see a statistician)
CI Interpretation
• Cannot determine if a particular interval
does/does not contain true mean
• Can say in the long run
 Take many samples
 Same sample size
 From the same population
 95% of similarly constructed confidence
intervals will contain true mean
• Think about meta analyses
Interpret a 95% Confidence
Interval (CI) for the population
mean, μ
• “If we were to find many such intervals, each
from a different random sample but in
exactly the same fashion, then, in the long
run, about 95% of our intervals would include
the population mean, μ, and 5% would not.”
18
Do NOT interpret a 95% CI…
• “There is a 95% probability that the true
mean lies between the two confidence
values we obtained from a particular
sample”
• “We can say that we are 95% confident
that the true mean does lie between
these two values.”
• Also NOTE: Overlapping CIs do NOT
imply non-significance
14 Week Change
Outcome
Fat-free body
mass (kg)
BMI
Fortified
Corn-Soy
Spread
Blend
Mean (95% CI) Mean (95% CI)
2.9
2.2
(2.50, 3.30)
(1.82, 2.58)
2.2
1.7
(1.96, 2.44)
(1.49, 1.91)
BMJ 2009; 338:b1867 Ndekha MJ et al Supplementary feeding with either ready-to-use fortified spread or cornsoy blend in wasted adults starting antiretroviral therapy in Malawi
Which of the following
statements, if any, are true?
A. The difference in BMI between treatment
groups was significant at the 5% level
because the 95% confidence intervals for
the two groups did not overlap
B. The difference in fat-free body mass
between treatment groups was not
significant at the 5% level because the 95%
confidence intervals for the two groups
overlapped
BMJ 2014; 349:g5196
18 August 2014 Philp Sedgwick
19
14 Week Change: part A
Outcome
Fat-free body
mass (kg)
BMI
Fortified
Corn-Soy
Spread
Blend
Mean (95% CI) Mean (95% CI)
2.9
2.2
(2.50, 3.30)
(1.82, 2.58)
2.2
1.7
(1.96, 2.44)
(1.49, 1.91)
BMJ 2009; 338:b1867 Ndekha MJ et al Supplementary feeding with either ready-to-use fortified spread or cornsoy blend in wasted adults starting antiretroviral therapy in Malawi
TRUE!
• Can say statistically significant difference at
5% level
 Because the 95% confidence intervals for
the two groups did not overlap
 Non-overlapping confidence intervals
Which of the following
statements, if any, are true?
A. The difference in BMI between treatment
groups was significant at the 5% level
because the 95% confidence intervals for
the two groups did not overlap
B. The difference in fat-free body mass
between treatment groups was not
significant at the 5% level because the 95%
confidence intervals for the two groups
overlapped
BMJ 2014; 349:g5196
18 August 2014 Philp Sedgwick
20
14 Week Change: part B
Outcome
Fat-free body
mass (kg)
BMI
Fortified
Corn-Soy
Spread
Blend
Mean (95% CI) Mean (95% CI)
2.9
2.2
(2.50, 3.30)
(1.82, 2.58)
2.2
1.7
(1.96, 2.44)
(1.49, 1.91)
BMJ 2009; 338:b1867 Ndekha MJ et al Supplementary feeding with either ready-to-use fortified spread or cornsoy blend in wasted adults starting antiretroviral therapy in Malawi
FALSE!
• The 95% confidence intervals for each
group’s mean overlap?
 Difference in fat-free body mass between
treatment groups
 Might be statistically significantly
different
 Might not be
 We do not know
OK – We know
Because We Can Calculate the 95% Confidence
Interval of the Mean Difference
• Mean difference = 0.7 kg
• Using data from the paper
• 95% CI = (0.2 kg, 1.2 kg)
 95% CI around the mean difference does
NOT include 0
 The difference in fat-free body mass
between treatment groups was
statistically significant at the 5% level
21
Interpreting a Single CI for a
Single Study
• Provide reasonable ranges, based on the
sample data, that should be considered when
powering future studies
• Use for comparisons and looking at direction
of study results
• Look at the CI for the correct value
 Hypothesis about group differences
 Do not separately evaluate CI for each
group
• Do not force writers to articulate a CI in
words
General Formula (1-α)% CI for μ

Z1 / 2
Z

, X  1 / 2 
X 
n
n 

• Construct an interval around the point
estimate
• Look to see if the population/null mean
is inside
But I Have All Zeros!
Calculate 95% upper bound
• Known # of trials without an event (2.11
van Belle 2002, Louis 1981)
• Given no observed events in n trials,
95% upper bound on rate of occurrence
is 3 / (n + 1)
 No fatal outcomes in 20 operations
 95% upper bound on rate of
occurrence = 3 / (20 + 1) = 0.143, so
the rate of occurrence of fatalities
could be as high as 14.3%
22
Take Home: Hypothesis Testing
• Many ways to test
 Rejection interval [book]
 Z test, t test, or Critical Value [book]
 P-value
 Confidence interval
• For this, all ways for same (regression)
method will agree
 If not: math wrong, rounding errors,
assumption violation?
• Make sure interpret correctly
Take Home Hypothesis Testing
• How to turn questions into hypotheses
• Failing to reject the null hypothesis
DOES NOT mean that the null is true
• Every test has assumptions
 A statistician can check all the
assumptions
 If the data does not meet the assumptions
there are non-parametric versions of tests
(see text)
 Non-parametric tests also have
assumptions
Take Home: CI
• Meaning/interpretation of the CI
• Give us some idea of the size of the
difference between two groups and the
direction of the difference
• In practice use Bootstrap
23
Take Home: Vocabulary
•
•
•
•
•
Null Hypothesis: H0
Alternative Hypothesis: H1 or Ha or HA
Significance Level: α level
Statistically Significant
P-value, Confidence Interval
Outline
 Estimation and Hypotheses
 How to Test Hypotheses
 Confidence Intervals
 Regression
• Error
• Diagnostic Testing
• Questions & Appendix
Regression
• Continuous outcome
 Linear
• Binary outcome
 Logistic
• Many other types
24
Warning!
• New meaning for the letter β on the next
several slides
How a Statistician Sees a
Research Study
• Model: y = β0 + β1x1 + β2x2 + ε
• Y is called outcome, endpoint, but not
CAUSAL
• Coefficient = Coef = β ≈ Association
• Variable (in model) = Covariate = x
 Treatment
 Age
 Gender
• Everything impacts the statistical analysis
Linear Regression
• Model for simple linear regression
 Yi = β0 + β1x1i + εi
 β0 = intercept
 β1 = slope
• Hypothesis testing
 H0: β1 = 0 vs. HA: β1 ≠ 0
25
In Order of Importance
1.
2.
3.
Independence: Observations are
independent
Equal variance
Normality: Normally distributed error terms
with constant variance
(for ANOVA and linear regression)
More Than One Covariate
• Yi = β0 + β1x1i + β2x2i + β3x3i + εi
• Systolic Blood Pressure =
β0 + β1 Drug + β2 Male + β3 Age
• β1
 Association between Drug and SBP
 Average difference in SBP between
the Drug and Control groups, given
sex and age
Testing?
• Each cefficient β has a p-value
associated with it
• Each model will have an F-test
• Other methods to determine fit
 Residuals
• See a statistician and/or take a
biostatistics class. Or 3.
26
Repeated Measures
(3 or more time points)
• Do NOT use repeated measures AN(C)OVA
 Assumptions quite stringent
• Talk to a statistician
 Mixed model
 Generalized estimating equations
 Other
An Aside: Correlation
• Range: -1 to 1
• Test is correlation is ≠ 0
• With N=1000, easy to have highly
significant (p<0.001) correlation = 0.05
 Statistically significant that is
 No where CLOSE to meaningfully
different from 0
• Partial Correlation Coefficient
Do Not Use Correlation.
Use Regression
• Some fields: Correlation still popular
 Partial regression coefficients
• High correlation is > 0.8 (in absolute
value). Maybe 0.7
• Never believe a p-value from a
correlation test
• Regression coefficients are more
meaningful
27
Outline
 Estimation and Hypotheses
 How to Test Hypotheses
 Confidence Intervals
 Regression
 Error
• Diagnostic Testing
• Questions & Appendix
Omics, Imaging, Testing
Millions of Things at Once
• False negative (Type II error)
 Miss what could be important
 Are these samples going to be looked at
again?
• False positive (Type I error)
 Waste resources following dead ends
What do you need to think
about?
• Where is the evidence? Will this be the last
study or will there be other studies,
regardless of your findings?
• Implications on science, public health, health
care, policy
 If false negative result
 If false positive result
• These answers will help guide you as to what
amount of error you are willing to tolerate in
your trial design
28
Outline
 Estimation and Hypotheses
 How to Test Hypotheses
 Confidence Intervals
 Regression
 Error
 Diagnostic Testing
• Questions & Appendix
Little Diagnostic Testing
Lingo
• False Positive/False Negative (α, β)
• Positive Predictive Value (PPV)
 Probability diseased given POSITIVE
test result
• Negative Predictive Value (NPV)
 Probability NOT diseased given
NEGATIVE test result
• Predictive values depend on disease
prevalence
Sensitivity, Specificity
• Sensitivity: how good is a test at correctly
IDing people who have disease
 Can be 100% if you say everyone is ill (all
have positive result)
 Useless test with bad Specificity
• Specificity: how good is the test at correctly
IDing people who are well
29
Example: Western vs. ELISA
•
•
•
•
1 million people
ELISA Sensitivity = 99.9%
ELISA Specificity = 99.9%
1% prevalence of infection
 10,000 positive by Western (gold
standard)
 9990 true positives (TP) by ELISA
 10 false negatives (FN) by ELISA
Medical University South Carolina HIV test example
Similar example http://www.cdc.gov/hiv/testing/lab/clia/rtcounseling.html CDC HIV Counseling with Rapid Tests
information
1% Prevalence
• 990,000 not infected
 989,010 True Negatives (TN)
 990 False Positives (FP)
• Without confirmatory test
 Tell 990 or ~0.1% of the population
they are infected when in reality they
are not
 PPV = 91%, NPV = 99.999%
1% Prevalence
• 10980 total test positive by ELISA
 9990 true positive
 990 false positive
• 9990/10980 = probability diseased GIVEN
positive by ELISA = PPV = 0.91 = 91%
• 989,020 total test negatives by ELISA
 989,010 true negatives
 10 false negatives
• 989010/989020 = NPV = 99.999%
30
0.1% Prevalence
• 1,000 infected – ELISA picks up 999
 1 FN
• 999,000 not infected
 989,001 True Negatives (TN)
 999 False Positives (FP)
• Positive predictive value = 50%
• Negative predictive value = 99.999%
Prevalence Matters
(Population You Sample to Estimate
Prevalence, too)
• Numbers look “good” with high
prevalence
 Testing at STD clinic in high risk
populations
• Low prevalence means even very high
sensitivity and specificity will result in
middling PPV
• Calculate PPV and NPV for 0.01%
prevalence found in United States
female blood donors
10% Prevalence
• 99% PPV
• 99.99% NPV
31
Prevalence Matters
• PPV and NPV tend to come from good cohort
data
• Can estimate PPV/NPV from case control
studies but the formulas are hard and you
need to be REALLY sure about the
prevalence
 Triple sure
High OR
Does Not a Good Test Make
• Diagnostic tests need separation
• ROC curves
 Not logistic regression with high OR
• Strong association between 2 variables does
not mean good prediction of separation
Analysis Follows Design
Questions → Hypotheses →
Experimental Design → Samples →
Data → Analyses →Conclusions
32
Outline
Estimation and Hypotheses
How to Test Hypotheses
Confidence Intervals
Regression
Error
Diagnostic Testing
Questions
Example Questions
Appendix
Questions?
Randomized Trial to Determine Effects of
a Low Glycemic Index Diet in Pregnancy
• Population: at risk women/fetuses
 Women in second pregnancy
 Previously delivered infant weighing
greater than 4000g
• Outcomes of interest
 Birth weight of infants (mean)
 Primary outcome
 Incidence of infant macrosomia (yes/no)
 Gestational weight gain
 Maternal glucose intolerance
BMJ 2012;345:e5605
Low glycaemic index diet in pregnancy to prevent macrosomia (ROLO study): randomised control trial
33
Statistical Testing Information
• Independent samples t test of primary
outcome of birth weight
• Two-tailed hypothesis testing
• Critical level of significance (allowable type I
error) 0.05 (5%)
• Superiority trial
BMJ 2012;345:e5605
Low glycaemic index diet in pregnancy to prevent macrosomia (ROLO study): randomised control trial
Interventions and Proportion Birth
Weight > 4000g
• Interventions
 Low glycemic diet from early pregnancy
 189 of 372 infants (51%)
 No dietary intervention
 199 of 387 infants (51%)
• P = 0.88
BMJ 2012;345:e5605
Low glycaemic index diet in pregnancy to prevent macrosomia (ROLO study): randomised control trial
Outcome
Low Glycemic Control Group
Diet (n=372*)
(n=387)
Mean (95% CI) Mean (95% CI)
Birth weight (g)
4034
4006
(3982, 4086)
(3956, 4056)
Gestational
12.2
13.7
weight gain (kg)
(11.8, 12.6)
(13.2, 14.2)
*If interested, read the article and later discussion pieces
for more on the study. Good lessons are discussed. Not
a perfect study.
34
Which of the following
statements, if any, are true?
A. The difference in birth weight between
treatment groups was not significant at the
5% level because the 95% confidence
intervals for the two groups overlapped
B. The difference in gestational weight gain
between treatment groups was significant at
the 5% level because the 95% confidence
intervals for the two groups did not overlap
BMJ 2014; 349:g5196
18 August 2014 Philp Sedgwick
Which of the following
statements, if any, are true?
A. The difference in birth weight between
treatment groups was not significant at the
5% level because the 95% confidence
intervals for the two groups overlapped
B. The difference in gestational weight gain
between treatment groups was significant at
the 5% level because the 95% confidence
intervals for the two groups did not overlap
BMJ 2014; 349:g5196
18 August 2014 Philp Sedgwick
Outcome
Low Glycemic Control Group
Diet (n=372)
(n=387)
Mean (95% CI) Mean (95% CI)
Birth weight (g)
4034
4006
(3982, 4086)
(3956, 4056)
Gestational
12.2
13.7
weight gain (kg)
(11.8, 12.6)
(13.2, 14.2)
35
Interventions and Birth Weight Results
• Interventions
 Low glycemic diet from early pregnancy
 Mean birth weight 4034g
 Standard deviation (SD) 510
 No dietary intervention
 Mean birth weight 4006g, SD = 497
• Mean difference in birth weight = 28.6g
• 95% CI = (-45.6 to 102.8)
• P = 0.449
• “absence of evidence is not evidence of
absence”
BMJ 2012;345:e5605
Low glycaemic index diet in pregnancy to prevent macrosomia (ROLO study): randomised control trial
Which of the following
statements, if any, are true?
A. The difference in birth weight between
treatment groups was not significant at the
5% level because the 95% confidence
intervals for the two groups overlapped
B. The difference in gestational weight gain
between treatment groups was significant at
the 5% level because the 95% confidence
intervals for the two groups did not overlap
BMJ 2014; 349:g5196
18 August 2014 Philp Sedgwick
Outcome
Low Glycemic Control Group
Diet (n=372)
(n=387)
Mean (95% CI) Mean (95% CI)
Birth weight (g)
4034
4006
(3982, 4086)
(3956, 4056)
Gestational
12.2
13.7
weight gain (kg)
(11.8, 12.6)
(13.2, 14.2)
36
Interventions and Gestational Weight
Gain Results
• Interventions
 Low glycemic diet from early pregnancy
 Mean = 12.2 kg (4.4)
 No dietary intervention
 Mean = 13.7 kg (4.9)
• Mean difference = -1.3 kg
• 95% CI = (-2.4, -0.2)
• P = 0.017
BMJ 2012;345:e5605
Low glycaemic index diet in pregnancy to prevent macrosomia (ROLO study): randomised control trial
Outcome
Low Glycemic Control Group
Diet (n=372)
(n=387)
Mean (95% CI) Mean (95% CI)
Birth weight (g)
4034
4006
(3982, 4086)
(3956, 4056)
Gestational
12.2
13.7
weight gain (kg)
(11.8, 12.6)
(13.2, 14.2)
Which statement best describes the
information provided by a 95% confidence
interval (CI) for mean gestational weight gain?
A. 95% of sample participants in the diet group
achieved a weight gain between 11.8 and 12.6 kg
B. 95% of the population would achieve a weight gain
between 11.8 kg and 12.6 kg if they received the
low glycemic diet
C. There is a probability of 0.95 that the population
mean gestational weight gain would be between
11.8 and 12.6 kg
D. There is a probability of 0.95 that the sample mean
weight gain for the diet group was between 11.8
and 12.6 kg
BMJ 2012; 344:e3147
9 May 2012 Philp Sedgwick
37
Want to continue with these?
• Yes? Raise your card
Which of the following
statements, if any, are true?
A. The P value provides a direct statement about
the size of the difference between groups in the
mean gestational weight gain
B. The P value provides a direct statement about
the directions of the difference between groups
in the mean gestational weight gain
C. The P value provides a dichotomous test of
significance of the statistical hypothesis
D. The 95% confidence interval provides a
dichotomous test of significance of the
statistical hypothesis
BMJ 2013; 346:f3212
17 May 2013 Philp Sedgwick
Which of the following
statements, if any, are true?
A. The alternative hypothesis states that, in the
population sampled, treatment with the low
glycemic diet is inferior or superior to
placebo with regard to the secondary
(gestational weight gain) endpoint
B. It can be inferred that the null hypothesis
was not true
BMJ 2014; 348:g3557
30 May 2014 Philp Sedgwick
38
Which of the following
statements, if any, are true?
A. Statistical hypothesis testing based on a
critical level of significance is a
dichotomous test
B. The P value provides a direct statement
about the direction of a difference between
treatment groups in mean birth weights
C. The P value is the probability that the
alternative hypothesis was true
BMJ 2014; 349:g4550
11 July 2014 Philp Sedgwick
Which of the following
statements, if any, are true?
A. In the population, it can be inferred that no
difference exists between low glycemic diet
and the control treatment in mean birth
weight
B. The lack of significance between treatment
groups in birth weight could have been the
result of a type II error
C. The alternative hypothesis of the statistical
hypothesis test for gestational weight gain
is true
BMJ 2014; 349:g4751
1 August 2014 Philp Sedgwick
Which of the following
statements, if any, are true?
A. A type I error would have occurred if no
difference between treatment groups in
mean gestational weight gain had existed in
the population
B. The type I error rate for each statistical
hypothesis test was 5%
C. The type I error rate for the multiple
statistical hypothesis tests was 5%
BMJ 2014; 349:g5310
29 August 2014 Philp Sedgwick
39
Additional Materials
Favorite Useful Reading
BMJ Endgames: Statistical Question
• Many by Philip Sedgwick
• Randomised controlled trials: inferring
significance of treatment effects based on
confidence intervals
 BMJ 2014;349:g5196 (18 August 2014)
http://dx.doi.org/10.1136/bmj.g5196
• Pitfalls of statistical hypothesis testing:
multiple testing
 BMJ 2014;349:g5310 (29 August 2014)
http://dx.doi.org/10.1136/bmj.g5310
Articles
• Comparisons against baseline within
randomised groups are often used and can
be highly misleading
 J Martin Bland and Douglas G Altman
 http://www.trialsjournal.com/content/12/1/2
64
40
Good Studies and Reporting
• A call for transparent reporting to optimize the
predictive value of preclinical research by Landis
et al
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC35
11845/ NIH’s NINDS has more about this on
their website
• NIH Proposed Principles and Guidelines for
Reporting Preclinical Research
http://www.nih.gov/about/reporting-preclinicalresearch.htm
Articles on Reproducible
Research, Best Practices
• Rigor or Mortis: Best Practices for Preclinical
Research in Neuroscience by Steward and
Balice-Gordon
http://dx.doi.org/10.1016/j.neuron.2014.10.042
• Preclinical research: Make mouse studies work
by Perrin http://www.nature.com/news/preclinicalresearch-make-mouse-studies-work-1.14913
• Policy: NIH to balance sex in cell and animal
studies (Nature Comment by Clayton and
Collins) http://www.nature.com/news/policy-nih-tobalance-sex-in-cell-and-animal-studies-1.15195
More Resources via NIH
• NIH NIMH Enhancing the Reliability of NIMHSupported Research through Rigorous Study
Design and Reporting
http://www.nimh.nih.gov/researchpriorities/policies/enhancing-the-reliabilityof-nimh-supported-research-throughrigorous-study-design-and-reporting.shtml
• NIMH director’s blog on p-hacking (by Insel)
http://www.nimh.nih.gov/about/director/2014/
p-hacking.shtml
41
More Articles
• False-positive psychology: undisclosed
flexibility in data collection and analysis
allows presenting anything as significant by
Simmons, Nelson, and Simonsohn
http://www.ncbi.nlm.nih.gov/pubmed/22006061
• Common misconceptions about data
analysis and statistics by Motulsky
http://www.ncbi.nlm.nih.gov/pubmed/25204545
Resources: General Books
• Hulley et al (2001) Designing Clinical
Research, 2nd ed. LWW
• Rosenthal (2006) Struck by Lightning: The
curious world of probabilities
• Bland (2000) An Introduction to Medical
Statistics, 3rd. ed. Oxford University Press
• Armitage, Berry and Matthews (2002)
Statistical Methods in Medical Research, 4th
ed. Blackwell, Oxford
• Altman (1991) Practical Statistics for Medical
Research. Chapman and Hall
Books
• Statistical Rules of Thumb by Gerald van
Belle (vanbelle.org for updates & monthly
rule)
• Hosmer and Lemeshow books
• Epidemiology by Leon Gordis
42
More Books
• Statistical Reasoning in Medicine: The
Intuitive P-Value Primer by Lemuel Moye
• Designing Clinical Research: An
Epidemiologic Approach, edited by Stephen
Hulley
• Critical Appraisal of Epidemiological Studies
and Clinical Trials by Mark Elwood
• Sequential trials: Whitehead, J. (1997) The
Design and Analysis of Sequential Clinical
Trials, revised 2nd. ed. Wiley
• Equivalence trials: Pocock SJ. (1983) Clinical
Trials: A Practical Approach. Wiley
And More Books
• Data Monitoring Committees in Clinical
Trials: A Practical Perspective by Ellenberg,
Fleming, DeMets.
• Fundamentals of Clinical Trials by Friedman,
Furberg, DeMets
• The Statistical Evaluation of Medical Tests
for Classification and Prediction by Margaret
Sullivan Pepe
Normal/Large Sample Data?
Yes
Inference on means?
No
Yes
Independent?
Yes
Variance
known?
Yes
Z test
Inference on variance?
No
Yes
Paired t
F test for
variances
No
Variances equal?
Yes
T test w/
pooled
variance
No
T test w/
unequal
variance
43
Normal/Large Sample Data?
No
Yes
Binomial?
Independent?
No
No
Nonparametric test
Yes
McNemar’s test
Expected ≥5
Yes
2 sample Z test for
proportions or
contingency table
No
Fisher’s Exact
test
44