Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Epidemiologic Methods September 28 Understanding Measurement: Aspects of Reproducibility & Validity October 3 Study Design October 10 October 17 October 24 October 31 October 31 Measures of Disease Occurrence I Measures of Disease Occurrence II Measures of Disease Association I Measures of Disease Association II Measures of Attributable Risk November 7 November 14 Bias in Epidemiologic Studies: Selection Bias Bias in Epidemiologic Studies: Measurement Bias November 14 November 21 November 28 Confounding and Interaction I: General Principles Confounding and Interaction II: Assessing Interaction Confounding and Interaction II: Stratified Analysis December 5 December 7 December 12 Conceptual Approach to Multivariable Analysis I Conceptual Approach to Multivariable Analysis II Conceptual Approach to Multivariable Analysis III Definitions of Epidemiology • The study of the distribution and determinants (causes) of disease – e.g. cardiovascular epidemiology • The method used to conduct human subject research – the methodologic foundation of any research where individual humans or groups of humans are the unit of observation Understanding Measurement: Aspects of Reproducibility and Validity • Review Measurement Scales • Reproducibility – importance – methods of assessment • by variable type: interval vs categorical • intra- vs. inter-observer comparison • Validity – methods of assessment • gold standards present • no gold standard available Clinical Research Sample Measure Analyze Infer A study can only be as good as the data . . . -Martin Bland Measurement Scales Scale Example Interval continuous discrete weight WBC count Categorical ordinal nominal dichotomous tumor stage race death Reproducibility vs Validity • Reproducibility – the degree to which a measurement provides the same result each time it is performed on a given subject or specimen • Validity – from the Latin validus - strong – the degree to which a measurement truly measures (represents) what it purports to measure (represent) Reproducibility vs Validity • Reproducibility – aka: reliability, repeatability, precision, variability, dependability, consistency, stability • Validity – aka: accuracy Relationship Between Reproducibility and Validity Good Reproducibility Poor Reproducibility Poor Validity Good Validity Relationship Between Reproducibility and Validity Good Reproducibility Poor Reproducibility Good Validity Poor Validity Why Care About Reproducibility? Impact on Validity • Mathematically, the upper limit of a measurement’s validity is a function of its reproducibility • Consider a study to measure height in the community: – if we measure height twice on a given person and get two different values, then one of the two values must be wrong (invalid) – if study measures everyone only once, errors, despite being random, may not balance out – final inferences are likely to be wrong (invalid) Why Care About Reproducibility? Impact on Statistical Precision • Classical Measurement Theory: observed value (O) = true value (T) + measurement error (E) E is random and ~ N (0, 2E) Therefore, when measuring a group of subjects, the variability of observed values is a combination of: the variability in their true values and measurement error 2O = 2T + 2E Why Care About Reproducibility? 2O = 2T + 2E • More measurement error means more variability in observed measurements • More variability of observed measurements has profound influences on statistical precision/power: – Descriptive study: less precise estimates of given traits – RCT’s: power to detect a treatment difference is reduced – Observational studies: power to detect an influence of a particular exposure upon a given outcome is reduced. Conceptual Definition of Reproducibility • Reproducibility 2 2 T T 2 2 2 O T E • Varies from 0 (poor) to 1 (optimal) • As 2E approaches 0 (no error), reproducibility approaches 1 Phillips and Smith, J Clin Epi 1993 Sources of Measurement Variability • Observer • within-observer (intrarater) • between-observer (interrater) • Instrument • within-instrument • between-instrument • Subject • within-subject Sources of Measurement Variability • e.g. plasma HIV viral load – observer: measurement to measurement differences in tube filling, time before processing – instrument: run to run differences in reagent concentration, PCR cycle times, enzymatic efficiency – subject: biologic variation in viral load Assessing Reproducibility Depends on measurement scale • Interval Scale – within-subject standard deviation – coefficient of variation • Categorical Scale – Cohen’s Kappa Reproducibility of an Interval Scale Measurement: Peak Flow • Assessment requires >1 measurement per subject • Peak Flow Rate in 17 adults (Bland & Altman) Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Meas. 1 Meas. 2 494 490 395 397 516 512 434 401 476 470 557 611 413 415 442 431 650 638 433 429 417 420 656 633 267 275 478 492 178 165 423 372 427 421 Assessment by Simple Correlation 800 Meas. 2 600 400 200 200 400 600 Meas. 1 800 Pearson Product-Moment Correlation Coefficient • r (rho) ranges from -1 to +1 ( X X )(Y Y ) • r ( X X ) (Y Y ) 2 2 • r describes the strength of the association • r2 = proportion of variance (variability) of one variable accounted for by the other variable r = 1.0 r = -1.0 r = 1.0 r = -1.0 r = 0.0 r = 0.8 r = 0.8 r = 0.0 Correlation Coefficient for Peak Flow Data r ( meas.1, meas. 2) = 0.98 Limitations of Simple Correlation for Assessment of Reproducibility • Depends upon range of data – e.g. Peak Flow • r (full range of data) = 0.98 • r (peak flow <450) = 0.97 • r (peak flow >450) = 0.94 Limitations of Simple Correlation for Assessment of Reproducibility • Depends upon ordering of data • Measures linear association only 1700 1500 1300 Meas. 2 1100 900 700 500 300 100 100 300 500 700 900 Meas 1 1100 1300 1500 1700 Limitations of Simple Correlation for Assessment of Reproducibility • Gives no meaningful parameter for the issue Within-Subject Standard Deviation subject 1 2 3 . . . 15 16 17 meas1 494 395 516 . . . 178 423 427 meas2 490 397 512 . . . 165 372 421 mean 492 396 514 . . . 172 398 424 s 2.83 1.41 2.83 . . . 9.19 36.06 4.24 • Mean within-subject standard deviation (sw) s (2.83 ... 4.24 ) n 17 i 2 i = 15.3 l/min 2 2 Computationally easier with ANOVA table: Analysis of Variance Source SS df MS F Prob > F ----------------------------------------------------------------------Between groups 441598.529 16 27599.9081 117.80 0.0000 Within groups 3983.00 17 234.294118 ----------------------------------------------------------------------Total 445581.529 33 13502.4706 s i within - group sum of squares 2 2 si within - group mean square 234 17 • Mean within-subject standard deviation (sw) : within - group mean square 15.3 l/min sw: Further Interpretation • If assume that replicate results: – normally distributed – mean of replicates estimates true value – standard deviation estimated by sw • Then 95% of replicates will be within (1.96)(sw) of the true value • For Peak Flow data: – 95% of replicates will be within (1.96)(15.3) = 30.0 l/min of the true value sw: Further Interpretation • Difference between any 2 replicates for same person = diff = meas1 - meas2 • Because var(diff) = var(meas1) + var(meas2), therefore, s2diff = sw2 + sw2 = 2sw2 sdiff s 2s 2s 2 2 diff w w • If assume the distribution of the differences between pairs is N(0, 2diff), therefore, – The difference between 2 measurements for the same subject is expected to be less than (1.96)(sdiff) = (1.96)(1.41)sw = 2.77sw for 95% of all pairs of measurements sw: Further Interpretation • For Peak Flow data: • The difference between 2 measurements for the same subject is expected to be less than 2.77sw =(2.77)(15.3) = 42.4 l/min for 95% of all pairs • Bland-Altman refer to this as the “repeatability” of the measurement Interpreting sw Within-Subject Std Deviation • Appropriate only if there is one sw • if sw does not vary with the true underlying value Kendall’s correlation coefficient = 0.17, p = 0.36 40 30 20 10 0 100 300 500 Subject Mean Peak Flow 700 Another Interval Scale Example • Salivary cotinine in children (Bland-Altman) • n = 20 participants measured twice subject 1 2 3 . . . 18 19 20 trial 1 0.1 0.2 0.2 . . . 4.9 4.9 7.0 trial 2 0.1 0.1 0.3 . . . 1.4 3.9 4.0 Simple Correlation of Two Trials trial 1 6 4 2 0 0 2 trial 2 4 6 Correlation of Cotinine Replicates >2.0 Range of data r Full range 0.70 < 1.0 0.37 1.0 to 2.7 0.57 > 2.7 -0.01 Cotinine: Absolute Difference vs. Mean Kendall’s tau = 0.62, p = 0.001 Subject Absolute Difference 4 3 2 1 0 0 2 4 Subject Mean Cotinine 6 Logarithmic Transformation subject 1 2 3 . . . 18 19 20 trial1 0.1 0.2 0.2 . . . 4.9 4.9 7 trial2 0.1 0.1 0.3 . . . 1.4 3.9 4 log trial 1 -1 -0.69897 -0.69897 . . . 0.690196 0.690196 0.845098 log trial 2 -1 -1 -0.52288 . . . 0.146128 0.591065 0.60206 Log Transformed: Absolute Difference vs. Mean Kendall’s tau=0.07, p=0.7 Subject abs log diff .6 .4 .2 0 -1 -.5 0 Subject mean log cotinine .5 1 sw for log-transformed cotinine data Analysis of Variance Source SS df MS F Prob > F -----------------------------------------------------------------------Between groups 10.4912641 19 .552171793 18.10 0.0000 Within groups .610149715 20 .030507486 -----------------------------------------------------------------------Total 11.1014138 39 .284651636 • sw 0.0305 0.175 • back-transforming to original units: • antilog(sw) = antilog(0.175) = 1.49 Coefficient of Variation • On the natural scale, there is not one common within-subject standard deviation for the cotinine data • Therefore, there is not one absolute number that can represent the difference any replicate is expected to be from the true value or from another replicate • Instead, within - subject standard deviation antilog(s w ) - 1 within - subject mean = coefficient of variation Cotinine Data • Coefficient of variation = 1.49 -1 = 0.49 • At any level of cotinine, the within-subject standard deviation of repeated measures is 49% of the level Coefficient of Variation for Peak Flow Data • By definition, when the within-subject standard deviation is not proportional to the mean value, as in the Peak Flow data, then there is not a constant ratio between the within-subject standard deviation and the mean. • Therefore, there is not one common coefficient of variation • Estimating the coefficient of variation by taking the common within-subject standard deviation and dividing by the overall mean of the subjects is not very meaningful Intraclass Correlation Coefficient, rI • rI • • • • 2 2 2 O T E 2 T 2 T Averages correlation across all possible ordering of replicates Varies from 0 (poor) to 1 (optimal) As 2E approaches 0 (no error), rI approaches 1 Advantages: not dependent upon ordering of replicates; does not mistake linear association for agreement; allows >2 replicates • Disadvantages: still dependent upon range of data in sample, still does not give a meaningful parameter on the actual scale of measurement in question Intraclass Correlation Coefficient, rI • rI mSSb SSt (m 1) SSt • where: – m = no. of replicates per person – SSb = sum of squares between subjects – SSt = total sum of squares • rI(peak flow) = 0.98 • rI(cotinine) = 0.69 Reproducibility of a Categorical Measurement: Chest X-Rays • On 2 different occasions, a radiologist is given the same 100 CXR’s from a group of high-risk smokers to evaluate for masses • How should reproducibility in reading be assessed? time2 time1 | No Mass Mass | Total -----------+----------------------+---------No Mass | 50 0 | 50 Mass | 0 50 | 50 -----------+----------------------+---------Total | 50 50 | 100 time2 time1 | No Mass Mass | Total -----------+----------------------+---------No Mass | 0 50 | 50 Mass | 50 0 | 50 -----------+----------------------+---------Total | 50 50 | 100 time2 time1 | No Mass Mass | Total -----------+----------------------+---------No Mass | 40 10 | 50 Mass | 10 40 | 50 -----------+----------------------+---------Total | 50 50 | 100 time2 time1 | No Mass Mass | Total -----------+----------------------+---------No Mass | 25 25 | 50 Mass | 25 25 | 50 -----------+----------------------+---------Total | 50 50 | 100 Agreement = (50+50)/100 = 100% = (0 +0)/100 = 0% = (40 +40)/100 = 80% = (25 +25)/100 = 50% Kappa • Agreement above that expected by chance observed agreement - chance agreement kappa 1 - chance agreement • (observed agreement - chance agreement) is the amount of agreement above chance • If maximum amount of agreement is 1.0, then (1 - chance agreement) is the maximum amount of agreement above chance that is possible • Therefore, kappa is the ratio of “agreement beyond chance” to “maximal possible agreement beyond chance” Determining agreement expected by chance Observed time2 time1 | No Mass Mass | Total -----------+----------------------+---------No Mass | 42 3 | 45 Mass | 18 37 | 55 -----------+----------------------+---------Total | 60 40 | 100 Observed agreement = 79% Fix margins time2 time1 | No Mass Mass | Total -----------+----------------------+---------No Mass | | 45 Mass | | 55 -----------+----------------------+---------Total | 60 40 | 100 Fill in expected values for cells under assumption of independence time2 time1 | No Mass Mass | Total -----------+----------------------+---------No Mass | 27 18 | 45 Mass | 33 22 | 55 -----------+----------------------+---------Total | 60 40 | 100 Agreement expected by chance = 49% 0.79 - 0.49 kappa 0.59 1 - 0.49 • Suggested interpretations for kappa kappa <0 0 – 0.19 0.20 – 0.39 0.40 – 0.59 0.60 – 0.79 0.80 – 1.00 Interpretation No agreement Poor agreement Fair agreement Moderate agreement Substantial agreement Near perfect agreement Kappa: problematic at the extremes of prevalence Observed time2 time1 | No Mass Mass | Total -----------+----------------------+---------No Mass | 92 3 | 95 Mass | 3 2 | 5 -----------+----------------------+---------Total | 95 5 | 100 Observed agreement = 94% Fix margins time2 time1 | No Mass Mass | Total -----------+----------------------+---------No Mass | | 95 Mass | | 5 -----------+----------------------+---------Total | 95 5 | 100 Fill in expected values for cells under assumption of independence time2 time1 | No Mass Mass | Total -----------+----------------------+---------No Mass | 90.25 4.75 | 95 Mass | 4.75 0.25 | 5 -----------+----------------------+---------Total | 95 5 | 100 Agreement expected by chance = 90.5% Kappa = 0.37 Sources of Measurement Variability: Which to Assess? • Observer • within-observer (intrarater) • between-observer (interrater) • Instrument • within-instrument • between-instrument • Subject • within-subject • Which to assess depends upon the use of the measurement and how it will be made. – For clinical use: all of the above are needed – For research: depends upon logistics of study (i.e. intrarater and within-instrument only if just one person/instrument used throughout study) Improving Reproducibility • See Hulley text • Make more than one measurement! – But know where the source of your variation exists! Assessing Validity - With Gold Standards • A new and simpler device to measure peak flow becomes available (Bland-Altman) subject 1 2 3 . . . 15 16 17 gold std 494 395 516 . . . 178 423 427 new 512 430 520 . . . 259 350 451 Plot of Difference vs. Gold Standard Difference 200 100 0 -100 -200 0 200 400 600 Gold standard 800 1000 8 Frequency 6 4 2 0 -100 -50 0 diff 50 100 • The mean difference describes any systematic difference between the gold standard and the new device: 1 1 d i di [(512 494) .. (451 427)] 2.3 n n • The standard deviation of the differences: sd 2 ( d d ) i i 38.8 n 1 • 95% of differences will lie between -2.3 + (1.96)(38.8), or from -78 to 74 l/min. • These are the 95% limits of agreement Assessing Validity of Categorical Measures • Dichotomous Gold Standard Present Absent New Present a b Measurement Absent c d Sensitivity = a/(a+c) Specificity = d/(b+d) • More than 2 levels – Collapse or – Kappa Assessing Validity - Without Gold Standards • When gold standards are not present, measures can be assessed for validity in 3 ways: – Content validity • Face • Sampling – Construct validity – Empirical validity (aka criterion) • Concurrent • Predictive Conclusions • Measurement reproducibility plays a key role in determining validity and statistical precision in all different study designs – When assessing reproducibility, • avoid correlation coefficients • use within-subject standard deviation if constant • or coefficient of variation if within-subject sd is proportional to the magnitude of measurement • Acceptable reproducibility depends upon desired use • For validity, plot difference vs mean and determine “limits of agreement” or determine sensitivity/specificity – Be aware of how your measurements have been validated!