Download Epidemiologic Methods

Document related concepts

Theory of conjoint measurement wikipedia , lookup

Psychometrics wikipedia , lookup

Transcript
Epidemiologic Methods
September 28
Understanding Measurement: Aspects of Reproducibility & Validity
October 3
Study Design
October 10
October 17
October 24
October 31
October 31
Measures of Disease Occurrence I
Measures of Disease Occurrence II
Measures of Disease Association I
Measures of Disease Association II
Measures of Attributable Risk
November 7
November 14
Bias in Epidemiologic Studies: Selection Bias
Bias in Epidemiologic Studies: Measurement Bias
November 14
November 21
November 28
Confounding and Interaction I: General Principles
Confounding and Interaction II: Assessing Interaction
Confounding and Interaction II: Stratified Analysis
December 5
December 7
December 12
Conceptual Approach to Multivariable Analysis I
Conceptual Approach to Multivariable Analysis II
Conceptual Approach to Multivariable Analysis III
Definitions of Epidemiology
• The study of the distribution and
determinants (causes) of disease
– e.g. cardiovascular epidemiology
• The method used to conduct human subject
research
– the methodologic foundation of any research
where individual humans or groups of humans
are the unit of observation
Understanding Measurement:
Aspects of Reproducibility and Validity
• Review Measurement Scales
• Reproducibility
– importance
– methods of assessment
• by variable type: interval vs categorical
• intra- vs. inter-observer comparison
• Validity
– methods of assessment
• gold standards present
• no gold standard available
Clinical Research
Sample
Measure
Analyze
Infer
A study can only be as good as the data . . .
-Martin Bland
Measurement Scales
Scale
Example
Interval
continuous
discrete
weight
WBC count
Categorical
ordinal
nominal
dichotomous
tumor stage
race
death
Reproducibility vs Validity
• Reproducibility
– the degree to which a measurement provides the
same result each time it is performed on a given
subject or specimen
• Validity
– from the Latin validus - strong
– the degree to which a measurement truly
measures (represents) what it purports to
measure (represent)
Reproducibility vs Validity
• Reproducibility
– aka: reliability, repeatability, precision, variability,
dependability, consistency, stability
• Validity
– aka: accuracy
Relationship Between Reproducibility and
Validity
Good Reproducibility
Poor Reproducibility
Poor Validity
Good Validity
Relationship Between Reproducibility and
Validity
Good Reproducibility
Poor Reproducibility
Good Validity
Poor Validity
Why Care About Reproducibility?
Impact on Validity
• Mathematically, the upper limit of a measurement’s
validity is a function of its reproducibility
• Consider a study to measure height in the community:
– if we measure height twice on a given person and
get two different values, then one of the two values
must be wrong (invalid)
– if study measures everyone only once, errors,
despite being random, may not balance out
– final inferences are likely to be wrong (invalid)
Why Care About Reproducibility?
Impact on Statistical Precision
• Classical Measurement Theory:
observed value (O) = true value (T) + measurement error (E)
E is random and ~ N (0, 2E)
Therefore, when measuring a group of subjects, the variability of
observed values is a combination of:
the variability in their true values and measurement error
 2O =  2T +  2E
Why Care About Reproducibility?
2O = 2T + 2E
• More measurement error means more variability in
observed measurements
• More variability of observed measurements has
profound influences on statistical precision/power:
– Descriptive study: less precise estimates of given traits
– RCT’s: power to detect a treatment difference is reduced
– Observational studies: power to detect an influence of a
particular exposure upon a given outcome is reduced.
Conceptual Definition of Reproducibility
• Reproducibility


 
  
2
2
T
T
2
2
2
O
T
E
• Varies from 0 (poor) to 1 (optimal)
• As 2E approaches 0 (no error),
reproducibility approaches 1
Phillips and Smith, J Clin Epi 1993
Sources of Measurement Variability
• Observer
• within-observer (intrarater)
• between-observer (interrater)
• Instrument
• within-instrument
• between-instrument
• Subject
• within-subject
Sources of Measurement Variability
• e.g. plasma HIV viral load
– observer: measurement to measurement
differences in tube filling, time before processing
– instrument: run to run differences in reagent
concentration, PCR cycle times, enzymatic
efficiency
– subject: biologic variation in viral load
Assessing Reproducibility
Depends on measurement scale
• Interval Scale
– within-subject standard deviation
– coefficient of variation
• Categorical Scale
– Cohen’s Kappa
Reproducibility of an Interval Scale
Measurement: Peak Flow
• Assessment requires
>1 measurement per subject
• Peak Flow Rate in 17 adults
(Bland & Altman)
Subject
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Meas. 1 Meas. 2
494
490
395
397
516
512
434
401
476
470
557
611
413
415
442
431
650
638
433
429
417
420
656
633
267
275
478
492
178
165
423
372
427
421
Assessment by Simple Correlation
800
Meas. 2
600
400
200
200
400
600
Meas. 1
800
Pearson Product-Moment Correlation Coefficient
• r (rho) ranges from -1 to +1
( X  X )(Y  Y )
• r 
 ( X  X )  (Y  Y )

2
2
• r describes the strength of the association
• r2 = proportion of variance (variability) of one
variable accounted for by the other variable
r = 1.0
r = -1.0
r = 1.0
r = -1.0
r = 0.0
r = 0.8
r = 0.8
r = 0.0
Correlation Coefficient for Peak Flow Data
r ( meas.1, meas. 2) = 0.98
Limitations of Simple Correlation for
Assessment of Reproducibility
• Depends upon range of data
– e.g. Peak Flow
• r (full range of data) = 0.98
• r (peak flow <450) = 0.97
• r (peak flow >450) = 0.94
Limitations of Simple Correlation for
Assessment of Reproducibility
• Depends upon ordering of data
• Measures linear association only
1700
1500
1300
Meas. 2
1100
900
700
500
300
100
100
300
500
700
900
Meas 1
1100
1300
1500
1700
Limitations of Simple Correlation for
Assessment of Reproducibility
• Gives no meaningful parameter for the issue
Within-Subject Standard Deviation
subject
1
2
3
.
.
.
15
16
17
meas1
494
395
516
.
.
.
178
423
427
meas2
490
397
512
.
.
.
165
372
421
mean
492
396
514
.
.
.
172
398
424
s
2.83
1.41
2.83
.
.
.
9.19
36.06
4.24
• Mean within-subject standard deviation (sw)

s
(2.83  ...  4.24 )

n
17

i
2
i
= 15.3 l/min
2
2
Computationally easier with ANOVA table:
Analysis of Variance
Source
SS
df
MS
F
Prob > F
----------------------------------------------------------------------Between groups
441598.529
16
27599.9081
117.80
0.0000
Within groups
3983.00
17
234.294118
----------------------------------------------------------------------Total
445581.529
33
13502.4706

s i  within - group sum of squares
2
2
si
 within - group mean square  234
17

• Mean within-subject standard deviation (sw) :
 within - group mean square  15.3 l/min
sw: Further Interpretation
• If assume that replicate results:
– normally distributed
– mean of replicates estimates true value
– standard deviation estimated by sw
• Then 95% of replicates will be within
(1.96)(sw) of the true value
• For Peak Flow data:
– 95% of replicates will be within (1.96)(15.3) =
30.0 l/min of the true value
sw: Further Interpretation
• Difference between any 2 replicates for same person =
diff = meas1 - meas2
• Because var(diff) = var(meas1) + var(meas2), therefore,
s2diff = sw2 + sw2 = 2sw2
sdiff
 s  2s  2s
2
2
diff
w
w
• If assume the distribution of the differences between
pairs is N(0, 2diff), therefore,
– The difference between 2 measurements for the same subject
is expected to be less than (1.96)(sdiff) = (1.96)(1.41)sw =
2.77sw for 95% of all pairs of measurements
sw: Further Interpretation
• For Peak Flow data:
• The difference between 2 measurements for the
same subject is expected to be less than 2.77sw
=(2.77)(15.3) = 42.4 l/min for 95% of all pairs
• Bland-Altman refer to this as the “repeatability” of
the measurement
Interpreting sw
Within-Subject Std Deviation
• Appropriate only if there is one sw
• if sw does not vary with the true underlying value
Kendall’s correlation
coefficient = 0.17, p = 0.36
40
30
20
10
0
100
300
500
Subject Mean Peak Flow
700
Another Interval Scale Example
• Salivary cotinine in children (Bland-Altman)
• n = 20 participants measured twice
subject
1
2
3
.
.
.
18
19
20
trial 1
0.1
0.2
0.2
.
.
.
4.9
4.9
7.0
trial 2
0.1
0.1
0.3
.
.
.
1.4
3.9
4.0
Simple Correlation of Two Trials
trial 1
6
4
2
0
0
2
trial 2
4
6
Correlation of Cotinine Replicates
>2.0
Range of data
r
Full range
0.70
< 1.0
0.37
1.0 to 2.7
0.57
> 2.7
-0.01
Cotinine: Absolute Difference vs. Mean
Kendall’s tau = 0.62, p
= 0.001
Subject Absolute Difference
4
3
2
1
0
0
2
4
Subject Mean Cotinine
6
Logarithmic Transformation
subject
1
2
3
.
.
.
18
19
20
trial1
0.1
0.2
0.2
.
.
.
4.9
4.9
7
trial2
0.1
0.1
0.3
.
.
.
1.4
3.9
4
log trial 1
-1
-0.69897
-0.69897
.
.
.
0.690196
0.690196
0.845098
log trial 2
-1
-1
-0.52288
.
.
.
0.146128
0.591065
0.60206
Log Transformed: Absolute Difference vs. Mean
Kendall’s
tau=0.07, p=0.7
Subject abs log diff
.6
.4
.2
0
-1
-.5
0
Subject mean log cotinine
.5
1
sw for log-transformed cotinine data
Analysis of Variance
Source
SS
df
MS
F
Prob > F
-----------------------------------------------------------------------Between groups
10.4912641
19
.552171793
18.10
0.0000
Within groups
.610149715
20
.030507486
-----------------------------------------------------------------------Total
11.1014138
39
.284651636
• sw
 0.0305  0.175
• back-transforming to original units:
• antilog(sw) = antilog(0.175) = 1.49
Coefficient of Variation
• On the natural scale, there is not one common within-subject
standard deviation for the cotinine data
• Therefore, there is not one absolute number that can
represent the difference any replicate is expected to be from
the true value or from another replicate
• Instead,
within - subject standard deviation
antilog(s w ) - 1 
within - subject mean
= coefficient of variation
Cotinine Data
• Coefficient of variation = 1.49 -1 = 0.49
• At any level of cotinine, the within-subject
standard deviation of repeated measures is
49% of the level
Coefficient of Variation for Peak Flow Data
• By definition, when the within-subject standard deviation
is not proportional to the mean value, as in the Peak
Flow data, then there is not a constant ratio between the
within-subject standard deviation and the mean.
• Therefore, there is not one common coefficient of
variation
• Estimating the coefficient of variation by taking the
common within-subject standard deviation and dividing
by the overall mean of the subjects is not very
meaningful
Intraclass Correlation Coefficient, rI
• rI
•
•
•
•


 2 2
2
O T  E
2
T
2
T
Averages correlation across all possible ordering of replicates
Varies from 0 (poor) to 1 (optimal)
As 2E approaches 0 (no error), rI approaches 1
Advantages: not dependent upon ordering of replicates; does not
mistake linear association for agreement; allows >2 replicates
• Disadvantages: still dependent upon range of data in sample, still
does not give a meaningful parameter on the actual scale of
measurement in question
Intraclass Correlation Coefficient, rI
• rI
mSSb  SSt

(m  1) SSt
• where:
– m = no. of replicates per person
– SSb = sum of squares between subjects
– SSt = total sum of squares
• rI(peak flow) = 0.98
• rI(cotinine) = 0.69
Reproducibility of a Categorical Measurement:
Chest X-Rays
• On 2 different occasions, a radiologist is
given the same 100 CXR’s from a group of
high-risk smokers to evaluate for masses
• How should reproducibility in reading be
assessed?
time2
time1 |
No Mass
Mass |
Total
-----------+----------------------+---------No Mass |
50
0 |
50
Mass |
0
50 |
50
-----------+----------------------+---------Total |
50
50 |
100
time2
time1 |
No Mass
Mass |
Total
-----------+----------------------+---------No Mass |
0
50 |
50
Mass |
50
0 |
50
-----------+----------------------+---------Total |
50
50 |
100
time2
time1 |
No Mass
Mass |
Total
-----------+----------------------+---------No Mass |
40
10 |
50
Mass |
10
40 |
50
-----------+----------------------+---------Total |
50
50 |
100
time2
time1 |
No Mass
Mass |
Total
-----------+----------------------+---------No Mass |
25
25 |
50
Mass |
25
25 |
50
-----------+----------------------+---------Total |
50
50 |
100
Agreement
= (50+50)/100 = 100%
= (0 +0)/100 = 0%
= (40 +40)/100 = 80%
= (25 +25)/100 = 50%
Kappa
• Agreement above that expected by chance
observed agreement - chance agreement
kappa 
1 - chance agreement
• (observed agreement - chance agreement) is the amount of
agreement above chance
• If maximum amount of agreement is 1.0, then (1 - chance
agreement) is the maximum amount of agreement above
chance that is possible
• Therefore, kappa is the ratio of “agreement beyond chance” to
“maximal possible agreement beyond chance”
Determining agreement expected by chance
Observed
time2
time1 |
No Mass
Mass |
Total
-----------+----------------------+---------No Mass |
42
3 |
45
Mass |
18
37 |
55
-----------+----------------------+---------Total |
60
40
|
100
Observed
agreement = 79%
Fix margins
time2
time1 |
No Mass
Mass |
Total
-----------+----------------------+---------No Mass |
|
45
Mass |
|
55
-----------+----------------------+---------Total |
60
40
|
100
Fill in expected values for cells under assumption of independence
time2
time1 |
No Mass
Mass |
Total
-----------+----------------------+---------No Mass |
27
18 |
45
Mass |
33
22 |
55
-----------+----------------------+---------Total |
60
40
|
100
Agreement
expected by
chance = 49%
0.79 - 0.49
kappa 
 0.59
1 - 0.49
• Suggested interpretations for kappa
kappa
<0
0 – 0.19
0.20 – 0.39
0.40 – 0.59
0.60 – 0.79
0.80 – 1.00
Interpretation
No agreement
Poor agreement
Fair agreement
Moderate agreement
Substantial agreement
Near perfect agreement
Kappa: problematic at the extremes of prevalence
Observed
time2
time1 |
No Mass
Mass |
Total
-----------+----------------------+---------No Mass |
92
3 |
95
Mass |
3
2
|
5
-----------+----------------------+---------Total |
95
5
|
100
Observed
agreement = 94%
Fix margins
time2
time1 |
No Mass
Mass |
Total
-----------+----------------------+---------No Mass |
|
95
Mass |
|
5
-----------+----------------------+---------Total |
95
5
|
100
Fill in expected values for cells under assumption of independence
time2
time1 |
No Mass
Mass |
Total
-----------+----------------------+---------No Mass |
90.25
4.75 |
95
Mass |
4.75
0.25 |
5
-----------+----------------------+---------Total |
95
5
|
100
Agreement
expected by
chance = 90.5%
Kappa = 0.37
Sources of Measurement Variability:
Which to Assess?
• Observer
• within-observer (intrarater)
• between-observer (interrater)
• Instrument
• within-instrument
• between-instrument
• Subject
• within-subject
• Which to assess depends upon the use of the measurement
and how it will be made.
– For clinical use: all of the above are needed
– For research: depends upon logistics of study (i.e.
intrarater and within-instrument only if just one
person/instrument used throughout study)
Improving Reproducibility
• See Hulley text
• Make more than one measurement!
– But know where the source of your variation
exists!
Assessing Validity - With Gold Standards
• A new and simpler device to measure peak flow becomes
available (Bland-Altman)
subject
1
2
3
.
.
.
15
16
17
gold std
494
395
516
.
.
.
178
423
427
new
512
430
520
.
.
.
259
350
451
Plot of Difference vs. Gold Standard
Difference
200
100
0
-100
-200
0
200
400
600
Gold standard
800
1000
8
Frequency
6
4
2
0
-100
-50
0
diff
50
100
• The mean difference describes any systematic difference
between the gold standard and the new device:
1
1
d  i di  [(512  494)  ..  (451  427)]  2.3
n
n
• The standard deviation of the differences:
sd 
2
(
d

d
)
i
i
 38.8
n 1

• 95% of differences will lie between -2.3 + (1.96)(38.8), or
from -78 to 74 l/min.
•
These are the 95% limits of agreement
Assessing Validity of Categorical Measures
• Dichotomous
Gold Standard
Present Absent
New
Present
a
b
Measurement Absent
c
d
Sensitivity = a/(a+c)
Specificity = d/(b+d)
• More than 2 levels
– Collapse or
– Kappa
Assessing Validity - Without Gold Standards
• When gold standards are not present, measures
can be assessed for validity in 3 ways:
– Content validity
• Face
• Sampling
– Construct validity
– Empirical validity (aka criterion)
• Concurrent
• Predictive
Conclusions
• Measurement reproducibility plays a key role in determining validity
and statistical precision in all different study designs
– When assessing reproducibility,
• avoid correlation coefficients
• use within-subject standard deviation if constant
• or coefficient of variation if within-subject sd is proportional to
the magnitude of measurement
• Acceptable reproducibility depends upon desired use
• For validity, plot difference vs mean and determine “limits of
agreement” or determine sensitivity/specificity
– Be aware of how your measurements have been validated!