Download Statistical Methods in Assessing Agreement: Models, Issues, and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Statistical Methods in Assessing Agreement:
Models, Issues, and Tools
Lawrence Lin, A. S. Hedayat, Bikas Sinha, and Min Yang
Measurements of agreement are needed to assess the acceptability of a new or generic process, methodology, and formulation in areas of
laboratory performance, instrument or assay validation, method comparisons, statistical process control, goodness of Ž t, and individual
bioequivalence. In all of these areas, one needs measurements that capture a large proportion of data that are within a meaningful
boundary from target values. Target values can be considered random (measured with error) or Ž xed (known), depending on the situation.
Various meaningful measures to cope with such diverse and complex situations have become available only in the last decade. These
measures often assume that the target values are random. This article reviews the literature and presents methodologies in terms of
“coverage probability.” In addition, analytical expressions are introduced for all of the aforementioned measurements when the target
values are Ž xed and when the error structure is homogenous or heterogeneous (proportional to target values). This article compares the
asymptotic power of accepting the agreement across all competing methods and discusses the pros and cons of each. Data when the
target values are random or Ž xed are used for illustration. A SAS macro program to compute all of the proposed methods is available
for download at http://www.uic.edu/ Q hedayat/.
KEY WORDS: Accuracy; Concordance correlation coefŽ cient; Coverage probability; Mean squared deviation; Precision; Target values;
Total deviation index.
1.
INTRODUCTION
Measurements of agreement are needed to assess the acceptability of a new or generic process, methodology, and formulation in areas of laboratory performance, instrument or
assay validation, method comparisons, statistical process control, goodness of Ž t, and individual bioequivalence. Examples
include the agreement of laboratory measurements collected
in various laboratories, the agreement of a newly developed
method with a gold standard method, the agreement of manufacturing process measurements with speciŽ cations, the agreement of observed values with predicted values, and the agreement in bioavailability of a new or generic formulation with a
commonly used formulation. In all of these areas, one needs
measurements that capture a large proportion of data that are
within a meaningful boundary from target values. Examples
of target values include mean, gold standard, quality control
speciŽ cation, predicted, and common formulation values. Target values can be considered random (measured with error) or
Ž xed (known), depending on the situation. There are also situations in which one is interested in comparing two methods
without a designated gold standard or target values.
When the agreement measurements show evidence of lack of
agreement, we need to address the sources of the deŽ ciencies.
Lawrence Lin is Senior Research Scientist (Statistics), Baxter Healthcare
International Corp. and Adjunct Professor of Statistics, Department of Mathematics, Statistics, and Computer Science, University of Illinois, Chicago,
IL 60607 (E-mail: [email protected]). A. S. Hedayat is Professor
of Statistics, Department of Mathematics, Statistics, and Computer Science,
University of Illinois, Chicago, IL 60607 (E-mail: [email protected]). Bikas
Sinha is Professor of Statistics, Indian Statistical Institute, Calcutta, India (Email: [email protected]). Min Yang is Ph.D. student, Department of Mathematics, Statistics, and Computer Science, University of Illinois, Chicago,
IL 60607 (E-mail: [email protected]). This research was primarily
sponsored by National Science Foundation grants DMS-9803596 and DMS0103727, National Cancer Institute grant P01-CA48112-08, and National
Institutes of Health (N IH) National Center for Complementary and Alterative
Medicine grant P50-AT00155. The contents are solely the responsibility of
the authors and do not necessarily represent the ofŽ cial views of N IH. The
authors acknowledge numerous constructive comments from four anonymous
referees and an associate editor that have been extremely helpful in clarifying
the presentation of the ideas related to agreement measurements.
When there is a disagreement between the two marginal distributions, the source is deŽ ned as constant and/or scale “shift,”
or lack of “accuracy.” When there is a disagreement due to
large within-sample variation, the source is deŽ ned as lack of
“precision.”
The question of assessment of agreement has received considerable attention in the literature. Cohen (1960, 1968) discussed this problem in the context of categorical data. Bland
and Altman (1986) proposed a simple and meaningful graphical approach for assessing the agreement between two clinical measurements. In a series of articles, Lin (1989, 1992,
1997, 2000) and Lin and Torbeck (1998) examined this problem critically in the framework of method reproducibility and
suggested a few measures and studied their properties.
In the context of bioequivalence, similar studies have been
reported by Anderson and Hauck (1990), Sheiner (1992),
Holder and Hsuan (1993), Schall and Luus (1993), Schall
(1995), Schall and Williams (1996), and Lin (2000). In
the context of goodness of Ž t, Vonesh, Chinchilli, and Pu
(1996) and Vonesh and Chinchilli (1997) have modiŽ ed Lin’s
approach for choosing models that have better agreement
between the observed and the predicted values.
The article is organized as follows. In Section 2 we brie y
discuss existing methods and add analytical solutions for these
methods when target values are Ž xed. In Section 3 we propose
methods in terms of coverage probability (CP). In Section 4
we compare the power of accepting the agreement among all
competing methods, and in Section 5 we perform a simulation
study to examine the Ž nite-sample performance of the methods. In Section 6 we present two examples based on real data,
one when the target values are random and one when the target values are Ž xed. We discuss general extension problems
and future studies in Section 7, followed by a conclusion in
Section 8.
257
© 2002 American Statistical Association
Journal of the American Statistical Association
March 2002, Vol. 97, No. 457, Theory and Methods
258
Journal of the American Statistical Association, March 2002
2.
2.1
METHODS
2.2
Mean Squared Deviation
2.1.1 When the Target Values are Random. A meaningful statistic to measure the agreement of observations (Y ) with
their target values (X) has been the mean squared deviation
(MSD). Let D D Y ƒ X; then
MSD D …2 D E4D2 5 D E4Y ƒ X5 2 0
(2)
and the sample counterpart of MSD can be computed as
N 2 C sy2 C sx2 ƒ 2syx 1
e2 D 4yN ƒ x5
N x1
N sy2 1 sx2 , and syx represent the usual sample based on
where y1
n paired observations on 4y1 x5 and each with divisor n.
The bootstrap method has been proposed for inference
based on the MSD estimate for individual bioequivalence
(Schall and Luus 1993). Lin (2000) showed that W D ln4e2 5
has an asymptotic normal distribution with mean — D ln4… 2 5
and variance
‘ W2 D
261 ƒ 4Œx ƒ Œy 54 =e4 7
0
nƒ2
(3)
For lesser bias of estimating — and ‘ W2 , we use eQ2 to estimate
… 2 , where
eQ2 D
n
1 X
4y ƒ xi 52 0
n ƒ 1 iD1 i
(4)
Throughout the article, we use n ƒ 1, n ƒ 2, or n ƒ 3 instead
of n in the denominator for lesser bias with smaller sample
size, based on results of the simulation study in Section 5.
2.1.2 When the Target Values are Fixed. When X is
Ž xed, we consider 84yi —xi 5 — i D 11 : : : 1 n9 as observations in a
random sample from the model
Y D ‚0 C ‚1 X C e Y 0
(5)
Here eY is the residual error with mean 0 and variance ‘ e2 .
The familiar quantities b0 , b1 , se2 , and r should be used as the
estimators for ‚0 , ‚1 , ‘ e2 , and .
We compute E4Y ƒ X5 2 for each xi by model (5) and then
take the average across xi . Therefore, the MSD becomes
N 2 C sx2 41 ƒ ‚1 52 C ‘ e2 1
…—X 2 D 4Œy—xN ƒ x5
(6)
N The MSD estimate remains the same
where ŒY —xN D ‚0 C ‚1 x.
as in the case where X is random, except that the variance of
W becomes smaller. It can be shown that (3) is replaced by
³
´
N 2 C sx2 41 ƒ ‚1 52 72
64Œy—xN ƒ x5
2
‘ W2 —X D
0 (7)
1ƒ
nƒ2
…—X 4
For variance estimation, we proceed as usual.
2.2.1 When the Target Values are Random. Lin (1989,
1992) presented another measurement, the concordance correlation coefŽ cient (CCC), for measuring agreement (reproducibility) between two methods. The MSD in (2) can be standardized as a correlation coefŽ cient to yield the CCC denoted
by c ,
(1)
We assume that the joint distribution of Y and X has Ž nite
second moments with means Œy and Œx , variances ‘ y2 and ‘ x2 ,
and covariance ‘ yx . Next, note that (1) can be expressed as
… 2 D 4Œy ƒ Œx 52 C ‘ y2 C ‘ x2 ƒ 2‘ yx 1
Concordance Correlation Coef’ cient
c D 1 ƒ
2‘ yx
…2
D
0
2
2
D 0 ‘ y C ‘ x C 4Œy ƒ Œx 52
… 2 —
2
…
Here …2 —D
is the ratio of the mean square of within-sample
0
total deviation and the total deviation. The mean square of
the within-sample total deviation contains the within-sample
variance and the bias square. The mean square of the total
deviation contains the largest possible variance among nonnegative correlated samples and the bias square. The CCC
can be written as the product of the accuracy and the precision coefŽ cients c D •a . The accuracy coefŽ cient is •a D
4Œ ƒŒ 5 2
‘
2
1 where “ 2 D ‘y ‘ x and š D ‘ y . Here the marginal
š C1=š C“ 2
y x
x
distributions of Y and X are equal (i.e., both means and variances are equal) if and only if the accuracy coefŽ cient is 1.
The precision coefŽ cient is the Pearson correlation coefŽ cient.
The CCC translates the MSD into a correlation coefŽ cient
that measures the agreement along the identity line, in which
a value of 1 represents a perfect agreement 4Y D X5, a value
of ƒ1 represents a perfect disagreement 4Y D ƒX5, and a
value of 0 represents no agreement. The sample counterpart
of CCC is given as
2r sy sx
0
N 2
sy2 C sx2 C 4yN ƒ x5
Cr ¢
Lin (1989) showed that Z D 05 ln 11ƒrcc has an asymptotic
normal distribution with mean
³
´
1 C c
ƒ1
† D tanh 4c 5 D 05 ln
1
1 ƒ c
rc D
ƒ
where tanh 1 4¢5 is the inverse hyperbolic tangent function. Its
variance is
µ
¶
41 ƒ 2 52c 2“ 2 41 ƒ c 53c
“ 4 4c
1
2D
C
ƒ
‘Z
0 (8)
n ƒ 2 41 ƒ 2c 52
41 ƒ 2c 52 
241 ƒ 2c 52 2
2.2.2 When the Target Values are Fixed. When X is
Ž xed, under (5), the CCC is computed in the same way as in
the case when X is random. However, the variance of the Z
transformation of the CCC estimate becomes smaller,
µ
2c 41 ƒ 2 5
2
‘ Z—X D
2 “ 2 š C 4c š ƒ 152
4n ƒ 252 41 ƒ 2c 52 c
¶
1
C 2c š 2 41 ƒ 2 5
2
D
N2
2c 41 ƒ 2 5
2c 6‚0 C 4‚1 ƒ 15x7
4n ƒ 252 41 ƒ 2c 52
sx2
C 4c ‚1 ƒ 152 C
2c ‚21 41 ƒ 2 5
0 (9)
22
Lin et al.: Assessing Agreement
2.3
259
Precision and Accuracy
When agreement measurements show evidence of lack of
agreement, we need to address the sources of the deŽ ciencies.
It is important to know whether the deŽ ciencies are coming
from the large within-sample variation (imprecision) or from
the shift in marginal distributions (inaccuracy). The former
would become a variance reduction exercise, which typically
is more cumbersome than the latter. The latter most likely is
a calibration problem. The CCC has meaningful components
of precision 45 and accuracy 4•a 5. In a way, the precision
squared is in the same scale as accuracy, where 0 represents no
agreement and 1 represents perfect agreement. The inference
on  when X is random has been routinely used for decades.
However, the inference on  when X is Ž xed is not known to
be addressed in the literature. To develop a large-sample inference for , we can simply let “ D 0, š D 1, and c D  in (8)
and (9) when X is random and Ž xed. Thus asymptotic vari1
ance of the Z transformation of r is nƒ3
when X is random,
2
1
and nƒ3 41 ƒ 2 5 when X is Ž xed.
We now present the inference on •a . Let the accuracy estimate be
ca D
u2
2
1
C v C 1=v
µ
¶
Œ2d
2
2 D 2
2
D
 Pr 4D < Š 5 • Š 1 11 2 1
‘d
where • 2 4¢5 is the cumulative noncentral chi-squared distribuŒ2
tion with 1 degree of freedom and noncentrality parameter ‘ d2 .
d
This noncentrality parameter is the relative bias squared. The
total deviation index (TD I) for measuring the boundary Š is
deŽ ned as
TD I D
2
and variance (Robieson 1999)
•a2 “ 2 4š C 1=š ƒ 25 C •a2 4š 2 C 1=š 2 C 22 5=2 C 41 C 2 54•a “ 2 ƒ 15
0
4n ƒ 2541 ƒ •a 52
D
2.4
“ 2 š• a2 41 ƒ 2 5 C 41 ƒ š•a 52 41 ƒ 4 5=2
4n ƒ 2541 ƒ •a 52
N • 41 ƒ  5=s C 41 ƒ ‚1 •a =5 41 ƒ  5=2
6‚0 C 4‚ 1 ƒ 15x7
0
4n ƒ 2541 ƒ •a 52
2
2
a
2
2
x
µ
¶
Œ2
• 2 4ƒ15  1 11 d2 1
‘d
(10)
where • 2 4¢5 is the inverse function of • 2 4¢5. Inference based on estimate of this TD I is intractable. According to Chebyshev’s inequality, this probability has a lower
bound of
…2
Pr 4D2 < Š2 5 > 1 ƒ 2 0
Š
Therefore, the lower bound of the Š2 value is proportional
to … 2 , the MSD, in (2). Lin (2000) suggested using the TD I
to approximate the value of Š that yields P4D2 < Š2 5 D  ,
which is
³
´
1ƒ
ƒ1
D
D
ƒ
—… —1
TD I Š ê
1
2
When X is Ž xed (known), the foregoing variance estimate
becomes smaller,
2
D
‘ L—X
s
4ƒ15
s
N x5
N
where u2 D 4yƒ
and v D syx . Then the logit function of ca
sy sx
c
is L D ln4 1ƒca a 5. The random variable L has an asymptotic
normal distribution with mean
± • ²
a
å D ln
1 ƒ •a
‘ L2 D
the Š value, and compare this Š value with the predetermined
boundary (10 mmHg in this example). In Section 3 we present
the method for estimating  for a given Š. In this section we
present the method for estimating Š for a given  .
Assuming that the distribution of D is normal with mean
Œd D Œy ƒ Œx and variance ‘ d2 D ‘ y2 C‘ x2 ƒ 2‘ yx , the proportion
of the population with —D— (absolute difference) less than Š,
Š > 0, becomes
2
4
Total Deviation Index
An intuitively clear measurement of agreement is a measure that captures a large proportion of data within a predetermined boundary from target values. In other words, we
want the probability of the absolute value of D less than the
boundary, Š, to be large. For example, consider the agreement assessment between the digital instrument used at home
and the manual instrument used in a hospital for measuring
diastolic blood pressure. In this case, a widely acceptable criterion is that at least 90% of the digital observations must
be within 10 mmHg measured from the manual instrument.
There are two approaches to measure agreement. We can Ž x
the predetermined Š value (10 mmHg in this example), compute the coverage probability  (CP), and compare this CP
with the predetermined probability level (90% in this example). We can also Ž x the predetermined CP value, compute
(11)
where ê ƒ1 4¢5 is the inverse cumulative normal distribution and
— ¢ — is the absolute value. This can be estimated by replacing
… in (11) with eQ in (4). Lin (2000) showed that the approximation is good under the following conditions:  D 075 and
Π2d
Œ2
Œ2
µ 12 1  D 080 and ‘ d2 µ 81  D 085 and ‘ d2 µ 21  D 090
‘2
d
Œ2
d
Œ2
d
and ‘ d2 µ 1, and  D 099 and ‘ d2 µ 12 . We may then use the
d
d
sample counterpart and perform statistical inference in the
same way as shown in Section 2.1 on MSD through
the asymptotic normality of W D ln4eQ2 5 when X is either random or Ž xed. Note that when X is Ž xed, we would compute
the Š2 for each xi by model (5) and take the average across
xi . The square root of this average is equivalent to (11).
The TD I is proportional to the square root of the MSD
and is intuitively much clearer than the MSD. It is the 100
percentile of the absolute difference of paired observations.
The TD I is similar in concept to the prediction limit, and its
conŽ dence limit is similar in concept to the tolerance limit for
capturing individual observations. The difference is that the
boundary is set to deviate from target values, instead of from
the mean.
260
2.5
Journal of the American Statistical Association, March 2002
Intraclass Correlation Coef’ cient
The intraclass correlation coefŽ cient ( ICC) (Fisher 1925)
in its original form is the ratio of between-sample variance
and the total variance (between- and within-sample) to measure precision under the model of equal marginal distributions. However, when the marginal distributions are not equal
(inaccuracy), the ICC captures the deviations and considers
those as imprecision. In contrast, the Pearson correlation coefŽ cient ignores the inaccuracy component, and the CCC could
segregate inaccuracy from imprecision. The ICC is closely
related to the CCC. The subtle difference is that the ICC value
remains the same when some pairs of yi and xi are interchanged, whereas the CCC does not. Unlike the CCC, the ICC
does not have meaningful components of accuracy and precision. For these reasons, we prefer to use CCC. However, we
expect the performance of the ICC and the special forms that
have evolved to be very similar to that of CCC.
3.
COVERAGE PROBABILITY
In this section we present a method to compute  for
a given Š. When the target values are considered random,
Anderson and Hauck (1990) proposed using this method to
assess individual bioequivalence. They used nonparametric
counting and inference for the assessment. Schall (1995) later
advocated such a proposal through the normal distribution,
and proposed using bootstrap estimation for inference. From
(11), one can also approximate  for a given Š. Here we
have  Š û 1 ƒ 261 ƒ ê4Š=—…—57 D • 2 4Š2 =…2 1 15. Like the TD I,
the approximation is subject to the restriction of reasonable
relative bias squared values. This section provides the exact
parametric estimation through the normal distribution and provides the asymptotic variance analytically for inference when
the target values are random or Ž xed.
3.1
Consider the problem of assessment of agreement when D
is from the normal distribution with mean Œd and variance ‘ d2 .
Then the coverage probability for a given Š, as discussed in
Section 2.4, is
(12)
where ê4¢5 is the cumulative normal distribution function. The
n
4sy2 C sx2 ƒ
estimates of Œd and ‘ d2 are ŒO d D yN ƒ xN and sd2 D nƒ3
2
2syx 5. Furthermore, ŒO d and sd are independent. Consequently,
an estimator of CPŠ can be taken to be
pŠ D ê
µ
¶
µ
¶
ƒŠ ƒ ŒO d
Š ƒ ŒO d
Đ
0
sd
sd
C O64ŒO d ƒ Œd 52 7 C O64sd ƒ ‘ d 52 7
C O64ŒO d ƒ Œd 54sd ƒ ‘ d 571
where ”4x5 is the density function of standard normal distriO4x5
bution and lim x!0 x < ˆ.
Therefore, it is clear that
³ ´
1
E4pŠ 5 D  Š C O
1
n
and V 4pŠ 5 becomes
(µ ³
´
³
´¶
ƒŠ ƒ Œd
Š ƒ Œd 2
1
2 D
ƒ”
‘p
”
nƒ3
‘d
‘d
µ
³
´
³
´¶ )
ƒŠ ƒ Œ d 2
Š ƒ Œd
Š C Œd
1 Š ƒ Œd
C
”
”
‘d
‘d
‘d
‘d
2
³ ´
1
CO
0
(13)
n2
C
Because CPŠ is bounded by 0 and 1, it is better to¢ use
p
the logit transformation for inference. Let T D ln 1ƒpŠ Š . Its
¢
Š
asymptotic mean is ’ D ln 1ƒ Š , and its asymptotic variance
is ‘ T2 D
When the Target Values are Random
CPk D  k D Pr4—Y ƒ X — < Š5
³
´
Œ2d
2
2
D • Š 1 11 2
‘d
hŠ ƒ Œ i
h ƒŠ ƒ Œ i
d
d
Dê
Đ
1
‘d
‘d
Next, we can use the Ž rst-order approximation to compute
E4pŠ 5, and V 4pŠ 5,
³
´
³
´
Š ƒ Œd
Š ƒ Œd
1
ƒ ”
pŠ D ê
4ŒO d ƒ Œd 5
‘d
‘d
‘d
³
´
³
´
ƒŠ ƒ Œd
Š ƒ Œd
Š ƒ Œd
ƒ
”
4sd ƒ ‘ d 5 ƒ ê
‘ d2
‘d
‘d
³
´
ƒ
ƒ
Š Œd
1
C ”
4ŒO d ƒ Œd 5
‘d
‘d
³
´
ƒŠ ƒ Œd
Š C Œd
ƒ
”
4sd ƒ ‘ d 5
‘ d2
‘d
3.2
‘ p2
 Š2 41ƒ Š 52
.
When the Target Values are Fixed
Consider the problem of assessment of agreement when target values X are Ž xed. If in model (5) we further assume that
eY has the normal distribution with mean 0 and variance ‘ e2 ,
then the coverage probability of the ith observation is
 Ši D Pr4—yi ƒ xi — < Š5
h Š ƒ ‚ ƒ 4‚ ƒ 15x i
i
0
1
Dê
‘e
h ƒŠ ƒ ‚ ƒ 4‚ ƒ 15x i
i
0
1
Đ
0
‘e
(14)
We deŽ ne the overall coverage probability as
 Š—X D
n
1X
 0
n iD1 Ši
(15)
Suppose that we have a random sample 84yi 1 xi 5 — i D
11 : : : 1 n9 and that ‚0 1 ‚1 1 and ‘ e2 are estimated by b0 1 b1 ,
Lin et al.: Assessing Agreement
261
and se2 ; then b0 and b1 are independent of se . Here we use
n
se2 D nƒ3
41 ƒ r 2 5sy2 . An estimate of  Ši is
µ
¶
µ
¶
ƒŠ ƒ b0 ƒ 4b1 ƒ 15xi
Š ƒ b0 ƒ 4b1 ƒ 15xi
Đ
pŠi D ê
1
se
se
and an estimate of  Š—X is
pŠ—X D
n
1X
p 0
n iD1 Ši
By the same method as in Section 3.1, it can be shown that
³ ´
1
E4pŠ—X 5 D  Š—X C O
1
n
and that the asymptotic variance of pŠ—X is
2 D
‘ p—X
C02 4C0 xN ƒ C1 52
C2
1
C
C 2 0
n ƒ 3 n2
n2 sx2
2n2
(16)
Here
´
n µ ³
X
ƒŠ ƒ ‚0 ƒ 4‚1 ƒ 15xi
”
‘e
iD1
³
´¶
Š ƒ ‚0 ƒ 4‚1 ƒ 15xi
ƒ”
1
‘e
´
n µ ³
X
ƒŠ ƒ ‚0 ƒ 4‚1 ƒ 15xi
D
C1
”
‘e
iD1
³
´¶
Š ƒ ‚0 ƒ 4‚1 ƒ 15xi
ƒ”
xi 1
‘e
C0 D
and
C2 D
³
´
n µ
X
ƒŠ ƒ ‚0 ƒ 4‚1 ƒ 15xi
ƒŠ ƒ ‚0 ƒ 4‚1 ƒ 15xi
”
‘e
‘e
iD1
³
´¶
Š ƒ ‚0 ƒ 4‚1 ƒ 15xi
Š ƒ ‚0 ƒ 4‚1 ƒ 15xi
ƒ
”
0
‘e
‘e
The logit transformation is again used for inference, and the
corresponding expression is the same as in Section 3.1.
3.3
Sectional Coverage Probability
In certain situations, agreement requirements might be different for certain sections of an analytical range. For example,
in measuring a clinical marker for cancer cells, there might
be a threshold value for which certain therapies would be prescribed when the marker value exceeds the threshold value.
Therefore, we might require stricter agreement criteria near
the threshold window. Suppose that the analytical range in the
interval of a to b can be classiŽ ed into m sections based on
the gold standard method. Then the section di becomes
di D ai ƒ aiƒ1 1 i D 11 21 : : : 1 m1
where a0 D a1 ai > aiƒ1 1 am D b, and ai values are prespeciŽ ed.
For di , an acceptable form of agreement would mean that
Y should largely lie inside an interval of length 2Ii centered
around X. Under this assumption, we need to consider the
sectional coverage probability. For this purpose, the integrated
sectional coverage probability ( ISCP) between Y and X over
those intervals is deŽ ned as
Pm
d Pr6 —Y ƒ X — < Ii 7
0
ISCP D  I D iD1 i
(17)
bƒa
This coverage probability involves only the distribution of D D
Y ƒ X for both random and Ž xed X, which for the purpose
of this article is assumed to have a normal distribution with
parameters Œd and ‘ d2 . Thus (17) becomes
h ±
²
±
²i
Pm
ƒ aiƒ1 5 ê Ii ‘ƒŒ d ƒ ê ƒI‘i ƒŒ d
iD1 4ai
d
d
I D
0
(18)
bƒa
The estimator of  I , denoted by pI , is obtained by substituting
ŒO d and sd for Œd and ‘ d in (18). By the same method as in
Section 3.1, it can be shown that pI is a consistent estimator
of  I with asymptotic variance
µm
³ ³
´
X
Ii ƒ Œd
1
ƒ
‘ p2I D
4a
a
5
”
iƒ1
n4b ƒ a52 iD1 i
‘d
³
´´¶2
I C Œd
ƒ” i
‘
µm
³ d
³
´
Ii ƒ Œd
Ii ƒ Œd
1 X
C
4ai ƒ aiƒ1 5
”
‘d
‘d
2 iD1
³
´´¶2
I C Œd
I C Œd
C i
” i
0
(19)
‘d
‘d
Again, the logit transformation should be used for inference.
3.4
Proportional Error Case
In practice, both Y and X are usually positive-valued variables. However, XY has a bounded variance. Here we evaluate the proportion change 4 XY 5 rather than absolute difference
4Y ƒ X5, because the error is proportional to the measurement.
Let 100ˆ% represent the percent change between Y and X.
A simpliŽ ed approach in this case is to assume that ln Y and
ln X have a bivariate normal distribution. Thus the probability
X
that Y lies in the interval 1Cˆ
to X41 C ˆ5 is given by
µ
¶
X
< Y < X41 C ˆ5
CPˆ D  ˆ D Pr
1Cˆ
µ
¶
Y
1
D Pr
< < 1Cˆ
X
1Cˆ
D Pr6— ln Y ƒ ln X — < ln41 C ˆ570
Let D D ln Y ƒ ln X and Š D ln41 C ˆ5. Then all algorithms
of the previous section can be applied to the logs to obtain
CPŠ and TD I . Here the TD I is a percent,
TD I % D 100ˆ % D 1004eŠ  ƒ 15%0
In this proportional error case, we should also compute the
CCC and MSD from the log transformation of the data.
4.
ASYMPTOTIC POWER
In this section we investigate the asymptotic power of
accepting agreement among estimates of MSD, CCC, and CP.
Inference based on TD I can be assessed through MSD. Therefore, the asymptotic power of the estimate of TD I is the same
262
Journal of the American Statistical Association, March 2002
as that of MSD. For comparison, we must establish a common
agreement criteria. We say that Y and X are in agreement if
“ 2 < “02 , š1 < š < š0 , and  > 0 where “0 1 š0 , and 0 are
0
prespeciŽ ed values. We refer to these as null values. We compute the chance of declaring agreement under the alternative
values at “ 2 D “12 1 š D š1 , and  D 1 . We refer to these as
power. In addition, let the null and alternative values of ‘ y‘ x
‘ ‘
be ‘ y0‘ x0 and ‘ y1‘ x1 , and let h D ‘ y0y1‘ x0
. Let the null and alterx1
ƒ
ƒ
native values of Œy Œx be Œy0 Œx0 and Œy1 ƒ Œx1 . Furthermore, let sign0 be positive when Œy0 ƒ Œx0 ¶ 0 and negative
otherwise, and let sign1 be positive when Œy1 ƒ Œx1 ¶ 0 and
negative otherwise.
4.1
The asymptotic variances of W 1 Z, and T statistics evaluated
at “m2 1 šm , and m become
2 D ƒm
‘ Wm
1
nƒ2
2 D ‡m
‘ Zm
1
nƒ2
and
–m
0
nƒ3
The regions for declaring agreement by using W 1 Z, and T at
the  signiŽ cance level are
2 D
‘ Tm
W µ —0 ƒ êƒ 1 41 ƒ 5‘ W 0 1
When the Target Values are Random
Let the corresponding log MSD, Z value of CCC, and logit
CP values evaluated at “ 2 D “m2 1 š D šm , and  D m 1 m D
01 1, be
—m D ln4…m2 51
³
´
1 C cm
†m D ln
1
1 ƒ cm
and
’m D ln
³
´
 Šm
1
1 ƒ  Šm
(20)
1
ƒ 2m 51
šm
2 D
‘ dm
‘ ym‘ xm 4šm C 1=šm ƒ 2m 51
Let
µ
D
ƒm 2 1 ƒ
‡m D
“m
1
4šm C 1=šm ƒ 2m 5 2
0
“m4
4“m2 C šm C 1=šm ƒ 2m 52
and
¶
1
“m4 4cm
ƒ
1
241 ƒ 2cm 52 2m
4.2
(22)
1
€
"
µ
p
p #2
êƒ1 41 ƒ ‚5 –1 ƒ êƒ1 41 ƒ 5 –0
C 30
’0 ƒ ’1
We do not show the asymptotic power and sample size for
precision and accuracy here; one can easily follow the same
principles for precision and accuracy.
and
2
 km
41 ƒ  km 52
nT D
(21)
41 ƒ 2m 52cm 2“m2 41 ƒ cm 53cm
C
41 ƒ 2cm 52m
41 ƒ 2cm 52 m
–m D
The asymptotic powers of accepting agreement at “ 2 D
“12 1 š D š1 , and  D 1 by using W 1 Z, and T are
µ
¶
— ƒ —1 ƒ ê ƒ1 41 ƒ 5‘ W 0
PW D ê 0
1
‘W1
µ
¶
†0 ƒ †1 C êƒ1 41 ƒ 5‘ Z0
D
ƒ
PZ 1 ê
1
‘ Z1
¶
’0 ƒ ’1 C êƒ1 41 ƒ 5‘ T 0
0
‘T1
The required sample sizes yielding the same power 1 ƒ ‚ by
using W 1 Z, and T are
"
p
p #2
ê ƒ1 41 ƒ ‚5 ƒ1 C êƒ1 41 ƒ 5 ƒ0
C 21
nW D
—0 ƒ —1
"
p
p #2
ê ƒ1 41 ƒ ‚5 ‡1 ƒ êƒ1 41 ƒ 5 ‡0
C 21
nZ D
†0 ƒ †1
2m
1
“m2 C šm C 1=šm
³
´
³
´
k
k
ƒ „m ƒ ê ƒ
ƒ „m 1
 km D ê
‘ dm
‘ dm
„m D signm
T ¶ ’0 C ê ƒ1 41 ƒ 5‘ T 0 0
PT D 1 ƒ ê
cm D
and
and
and
where
…m2 D ‘ ym‘ xm 4“m2 C šm C
Z ¶ †0 C ê ƒ1 41 ƒ 5‘ Z0 1
µ ³
´
³
´¶2
k
k
ƒ „m ƒ ” ƒ
ƒ „m
”
‘ dm
‘ dm
µ³
´ ³
´
k
k
1
C
ƒ „m ”
ƒ „m
‘ dm
2 ‘ dm
³
´ ³
´¶2
k
k
C
C „m ”
C „m
0 (23)
‘ dm
‘ dm
When the Target Values are Fixed
2
When X is Ž xed, we simply use xN and sxm
in place of
2
Œxm and ‘ xm . Equations (14), (7), (9), and (16) are evaluated
at both the null and alternative values to obtain the Ž xed X
equivalents of equations (20), (21), (22), and (23).
4.3
Comparisons of Asymptotic Powers
The above powers and sample sizes for TD I (MSD) and CP
when X is random and for TD I when X is Ž xed depend on
‘ ‘
the value of h D ‘ y0y1‘ x0
. To compare these powers, we assume
x1
that h D 1.
The power of CP depends on the choice of Š. We let
Š D 105‘ d0 1 2‘ d0 1 205‘ d0 when X is random and Š D ‘ e0 1
105‘ e0 1 2‘ e0 when X is Ž xed. We let “0 D 0151 š0 D 1015,
Lin et al.: Assessing Agreement
and 0 D 081 091 0951 099 and evaluate the power at “1 D
ƒ
005 and 0101 š1 D 1005 and 1010, and 1 D tanh6tanh 1 40 5 C
ƒ1
017 and tanh6tanh 40 5 C 0257. The null hypotheses correspond
to a 15% location shift per 4‘ y‘ x 51=2 and a 15% scale shift for
various precision values. These null hypotheses are compared
to alternatives with two levels of improved precision, location,
and scale shifts. The ranges were chosen so that most power
values are in the broad range of 02 to 099. When X is Ž xed,
from model (5) we let X D ƒ11 0, and 1. We let Y have values equally distributed among the three levels of X values. We
let ‚1 D š1 ‚0 D 4“ 2 sx2 š51=2 , and ‘ e2 D sx2 š 2 41 ƒ 2 5 for the
corresponding “1 š, and  values in the null and alternative
cases.
Table 1 presents the asymptotic powers among TD I, CCC,
and CP when X is random and Ž xed for n D 30 and  D
005. The results indicate that the asymptotic powers of TD I,
CPŠ1 1 CPŠ2 , and CPŠ3 are in general quite similar whenever X
is random or Ž xed. Therefore, the choice of Š has little impact
on the power.
The asymptotic power of CCC is inferior to that of TD I
(MSD) in all test cases when X is random or Ž xed. Presumably, the estimation for the denominator of the CCC increases
the noise. Therefore, for statistical inferences, CP and TD I
are always preferable to CCC, especially with normally distributed data. However, in the case of a higher correlation coefŽ cient (.99) when X is Ž xed, the CCC has a similar power.
Regardless of the fact that the power performance is not very
appealing, the CCC, precision, and accuracy remain as useful
descriptive tools.
5.
SIMULATION
For statistical inference based on any of the foregoing agreement measurement estimates, we would replace the parameters with their sample counterparts in the respective variance
expressions. To assess the asymptotic normality and power of
the methods, we performed two Monte Carlo simulations for
X when random and Ž xed. In each simulation, we examined
two cases: one case representing the null hypothesis and the
other representing the alternative hypothesis. For comparison,
we selected the same cases as shown in Table 1. The simulations also attempted to cover those CP, CCC, precision, and
accuracy values near their boundaries.
For the simulation when X is random, paired samples were
generated from each of the following bivariate normal distribution cases:
1. The null hypothesis case with mean (0151 0), variance
(1015, 1=1015), and correlation 0 . Here “0 D 015 and š0 D
1015.
2. The alternative hypothesis case with mean (01, 0), variance (101, 1=101), and correlation tanh[tanhƒ1 40 C 0257. Here
“1 D 01, š1 D 101, and h D 1.
For the simulation when X is Ž xed, we generated univariate normal samples, with equal sample size among the
X D ƒ11 01 1 values. We let ‚1 D š, ‚0 D 4“ 2 sx2 š51=2 , and
‘ e2 D sx2 š 2 41 ƒ 2 5 under (5) for the corresponding “1 š, and
 values in the following null and alternative cases:
1. The null hypothesis case with “0 D 015, š0 D 1015, and
correlation 0 .
263
2. The alternative hypothesis case with “1 D 01, š1 D 101,
ƒ
and correlation 1 D tanh6tanh 1 40 C 0257.
In each of the these random and Ž xed cases, we let 0 D 095
and 099. The samples generated correspond to n D 15, n D 30,
and n D 60. We have a total of 24 situations: 2 cases by 2
levels of precision by 3 levels of sample size for X when
random and when Ž xed. For each situation, 5,000 runs were
performed. Tables 2–5 present the simulation results for cases
where X is random and Ž xed. In all of these tables, the fourth
column presents the theoretical value of precision, accuracy,
CCC, TD I, and CP for the null and alternative hypotheses.
To assess the robustness of each agreement statistics estimate, in each run, we calculated estimates of the respective
transformation of precision (Z), accuracy (logit), CCC (Z),
TD I09 (ln MSD), CPŠ1 , and CPŠ3 (logit). The Š1 and Š3 values
correspond with those in Table 1. The mean and standard deviation of each estimate based on 5,000 runs were computed.
The respective antitransformation of each mean estimate is
reported in the Ž fth column. Comparisons between the fourth
and Ž fth columns were used to assess the robustness of the
estimates. The standard deviation of each estimate is reported
in the sixth column (“Std of estimate”). The mean of the standard deviation estimate of each estimate based on 5,000 runs
was also computed. This is reported in the seventh column
(“Mean of std”). Comparisons between the sixth and seventh
columns were used to assess the robustness of the variance
estimates.
To assess the asymptotic normality for the signiŽ cance level
(case 1) and power (case 2) of each estimate at  D 005,
for each run we computed the proportion of each estimate
among 5,000 runs that falls into the respective rejection region
(accepting agreement) in Section 4. These proportions are
reported in the eighth column (“Proportions in reject region”).
The proportions in case 1 represent the signiŽ cance level while
the proportions in case 2 represent the power of accepting case
2 against the null hypothesis of case 1 at  D 005. For comparison, the corresponding theoretical probabilities are shown
in the last column.
For point estimates, the results showed that all but the precision estimates are robust (i.e., have little bias) for all 24 situations in this study, even when n D 15. The precision estimate
performs well when X is random, but tends to overestimate
when X is Ž xed and the sample size is smaller.
For standard deviation estimates, there was practically no
discrepancy between the sixth and seventh columns in all situations. For signiŽ cance level and power, the results showed
that all but the precision estimates are accurate for all 24 situations in this study, even when n D 15. The precision estimate
performs well when X is random, but tends to reject more
often due to overestimates when X is Ž xed and the sample
size is smaller. The power of the CCC estimate tends to be
larger in the simulation study than its theoretical value.
6.
EXAMPLES
This section presents two examples based on real data. One
example compares the agreement of two instruments in measuring blood counts in human samples, and the other compares the agreement of an assay to measure factor V III against
264
Journal of the American Statistical Association, March 2002
Table 1. Asymptotic Power of Accepting the Agreement, Where Š1 D 1.5‘ d 0 , Š2 D 2‘ d 0 , and Š3 D 2.5‘ d 0 (Random); Š1 D‘ e 0 , Š2 D 1.5‘ e0 , and
Š3 D 2‘ e0 (Fixed); and “0 D .15, š0 D 1.15, n D 30, and  D .05
Random
Fixed
0
“1
š1
1
TDI
CCC
CPŠ1
CPŠ2
CPŠ3
TDI
CCC
CPŠ1
CPŠ2
CPŠ3
.80
.05
1.05
.8332
.8614
.8332
.8614
.2601
.5149
.2368
.4798
.1936
.3666
.1783
.3432
.3128
.5781
.2854
.5451
.3258
.5916
.2975
.5592
.3330
.5986
.3043
.5667
.3905
.6609
.2981
.5589
.2623
.5120
.2357
.4815
.4246
.6793
.3294
.5875
.4468
.6947
.3488
.6070
.4625
.7048
.3627
.6197
.8332
.8614
.8332
.8614
.2341
.4757
.2126
.4417
.1787
.3421
.1643
.3196
.2822
.5413
.2564
.5082
.2946
.5562
.2677
.5235
.3017
.5645
.2743
.5320
.3586
.6238
.2702
.5201
.2399
.4786
.2144
.4477
.3920
.6459
.2993
.5510
.4141
.6637
.3178
.5721
.4300
.6756
.3313
.5863
.9174
.9318
.9174
.9318
.3752
.6478
.3217
.5815
.2644
.4551
.2286
.4060
.4379
.6931
.3805
.6362
.4513
.7017
.3933
.6467
.4574
.7054
.3991
.6514
.5209
.7771
.3928
.6585
.4008
.6763
.3438
.6203
.5512
.7794
.4264
.6766
.5706
.7869
.4468
.6910
.5828
.7914
.4601
.6996
.9174
.9318
.9174
.9318
.3155
.5738
.2679
.5082
.2290
.4026
.1968
.3562
.3738
.6298
.3200
.5700
.3880
.6430
.3328
.5846
.3958
.6506
.3398
.5927
.4553
.7145
.3333
.5872
.3437
.6086
.2900
.5495
.4873
.7240
.3647
.6110
.5090
.7374
.3849
.6303
.5239
.7464
.3992
.6434
.9589
.9662
.9589
.9662
.5814
.8146
.4712
.7134
.4117
.6112
.3321
.5202
.6323
.8239
.5302
.7441
.6391
.8243
.5386
.7475
.6391
.8228
.5390
.7467
.7259
.9036
.5560
.7851
.6291
.8522
.5234
.7734
.7357
.8882
.5836
.7860
.7431
.8854
.5979
.7909
.7454
.8829
.6042
.7924
.9589
.9662
.9589
.9662
.4585
.7020
.3588
.5894
.3293
.5089
.2598
.4211
.5182
.7356
.4164
.6399
.5326
.7468
.4296
.6535
.5397
.7532
.4355
.6608
.6131
.8249
.4390
.6730
.5102
.7532
.4058
.6555
.6313
.8153
.4684
.6836
.6486
.8234
.4881
.7001
.6595
.8290
.5006
.7110
.9918
.9933
.9918
.9933
.9936
.9989
.9270
.9697
.9486
.9813
.8388
.9186
.9787
.9902
.9118
.9502
.9726
.9866
.8982
.9401
.9675
.9839
.8830
.9301
.9993
.9999
.9805
.9951
.9985
.9998
.9796
.9951
.9963
.9986
.9705
.9867
.9923
.9965
.9630
.9824
.9870
.9937
.9514
.9763
.9918
.9933
.9918
.9933
.9241
.9712
.7201
.8240
.7875
.8702
.5880
.6957
.9099
.9544
.7376
.8249
.9195
.9598
.7493
.8376
.9220
.9612
.7496
.8402
.9856
.9966
.8803
.9473
.9711
.9923
.8658
.9413
.9676
.9848
.8702
.9243
.9692
.9858
.8689
.9246
.9640
.9828
.8598
.9230
1.10
.10
1.05
1.10
.90
.05
1.05
1.10
.10
1.05
1.10
.95
.05
1.05
1.10
.10
1.05
1.10
.95
.05
1.05
1.10
.10
1.05
1.10
NOTE:
h D 1 for TDI and CP when X is random, and for TDI when X is ’ xed.
known target values in test tubes (in vitro). The former is the
constant error case when X is random, and the latter is the
proportional error case when X is Ž xed.
6.1
Constant Error When the Target
Values are Random
Diaspirin crosslinked hemoglobin (DCLHb) is a solution
containing oxygen-carrying hemoglobin. The solution was created as a blood substitute to treat acute trauma patients and to
replace blood loss during surgery. Measurements of DCLHb
in patient’s serum after infusion are routinely performed using
a Sigma instrument. A method of measuring hemoglobin
called the HemoCue photometer was modiŽ ed to reproduce
the Sigma instrument DCLHb results. To validate this modiŽ ed method, serum samples from 299 patients over the analytical range of 50–2000 mg/dL were collected. DCLHb values of each sample were measured simultaneously with the
HemoCue and Sigma methods. Agreement was deŽ ned as
having at least 90% of pair observations over the analytical
range of 50–2000 mg/dL within 150 mg/dL of each other
and a within-sample total deviation not more than 15% of the
total deviation. This translates into a least acceptable CCC of
09775 41 ƒ 0152 5.
Figure 1 presents the plot of HemoCue versus Sigma measurements of DCLHb. The plot indicates that the withinsample error is relatively constant across the clinical range.
The plot also indicates that the HemoCue accuracy is excellent and that the precision is adequate.
Table 6 presents the agreement statistics and the appropriate 95% upper or lower conŽ dence limits. The CCC estimate
is .9866, which means that the within-sample total deviation
is about 11.6% of the total deviation. The CCC one-sided
lower conŽ dence limit is .9838, which is greater than .9775.
The precision estimate is .9867 with a one-sided lower conŽ dence limit of .9839. The accuracy estimate is .9999 with a
one-sided lower conŽ dence limit of .9989. The MSD estimate
is 6,007 with a one-sided upper conŽ dence limit of 6,875.
The TD I09 estimate is 127.5 mg/dL, which means that 90%
of HemoCue observations were within 127.5 mg/dL of their
target values. The one-sided upper conŽ dence limit for TD I09
is 136.4 mg/dL, which is less than 150 mg/dL. Finally, the
CP150 estimate is .9463, which means that 94.6% of HemoCue
observations are within 150 mg/dL of their target values. The
Lin et al.: Assessing Agreement
265
Table 2. Results of the Simulation Study When X is Random for Moderate Precision
Case
Sample
size
15
H0
30
60
15
H1
30
60
NOTE:
Statistics
Theoretical
value
Mean of
estimate
Std of
estimate
Mean of
std
Proportion in
reject region
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
.9500
.9794
.6200
.9304
.8303
.9789
.9500
.9794
.6200
.9304
.8303
.9789
.9500
.9794
.6200
.9304
.8303
.9789
.9662
.9905
.4843
.9571
.9220
.9969
.9662
.9905
.4843
.9571
.9220
.9969
.9662
.9905
.4843
.9571
.9220
.9969
.9531
.9767
.6213
.9248
.8143
.9765
.9516
.9783
.6209
.9280
.8220
.9776
.9506
.9788
.6211
.9290
.8267
.9785
.9685
.9892
.4848
.9539
.9132
.9965
.9673
.9901
.4856
.9555
.9180
.9967
.9669
.9904
.4831
.9566
.9192
.9967
02848
09218
03696
02531
05971
102977
01913
06448
02607
01727
04017
08381
01314
04324
01799
01188
02789
05775
02824
100575
03756
02552
08592
200338
01901
07743
02610
01734
05679
103489
01327
05293
01825
01230
03817
08922
02887
09669
03804
02437
06098
102469
01925
06391
02616
01690
04096
08353
01325
04375
01825
01186
02837
05786
02887
101993
03841
02505
08286
109178
01925
07884
02634
01739
05577
102914
01325
05310
01838
01221
03840
08868
.0590
.0498
.0486
.0364
.0474
.0596
.0548
.0558
.0532
.0354
.0492
.0592
.0528
.0518
.0488
.0368
.0482
.0550
.1926
.1736
.3250
.2084
.3088
.3486
.2952
.2890
.5712
.3764
.5534
.5946
.4744
.5088
.8626
.6442
.8434
.8604
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
01777
02554
03568
02594
04347
04682
02786
03744
05894
04211
06399
06608
04513
05575
08516
06613
08573
08613
“ 0 D 0151 š 0 D 10151 0 D 095; “ 1 D 011 š 1 D 1011 1 D 09662.
one-sided lower conŽ dence limit for CP150 is .9276, which is
greater than .9. The agreement between HemoCue and Sigma
is acceptable with excellent accuracy and adequate precision.
The relative bias squared is estimated to be .003, and so the
approximation of TD I should be excellent.
6.2
Theoretical
probability
Proportional Error When the Target
Values are Fixed
A study was designed to evaluate Dade International’s
reagent test system for clottable factor V III (FV III) assay.
The FV III assay involved measuring modiŽ ed activated partial thromboplastin time (APTT) with varying dilutions of
plasma and speciŽ c factor-deŽ cient substrate. Using a reference plasma of known FV III activity, a standard curve was
prepared. The dilution scheme of the standard curve started
at either 1:5 or 1:10, and serial dilutions were prepared until
the target values were reached. The percent of FV III activity
present in plasma was determined by the degree of correction of the APTT. The reagent test system consisted of Dade
actin-activated cephaloplastin reagent and Dade factor assay
reference plasma. The target values were 3%1 8%1 38%1 91%,
and 108%. Each level was assayed for six FV III observations
starting at 1:5 and 1:10. One FV III value started at 1:5 at the
91% target value was missing.
Figure 2 presents the results started at 1:5, and Figure 3
presents the same at 1:10 for the plots of observed FV III
assay results versus targeted values in log2 scale. Note that
in Figure 2, four 3% and two 2% were observed at the target value of 3%. Circles at the target value of 8% represents
duplicate readings of 8%, 9%, and 10%. Duplicate readings
of 45% were observed at target values of 38%. Also note that
in Figure 3, four 5% and two 4% were observed at the target
value of 3%, three 11% and two 12% were observed at the
target value of 8%, duplicate readings of 49% were observed
at target values of 38%, and duplicate readings of 124% were
observed at target values of 91%. The plots indicate that the
within-sample error was relatively constant across the target
values in log scale. The precision was good for both, but the
accuracy was not as good for the assay started at 1:10.
The client deŽ ned an acceptable agreement as having 80%
of FV III assay values over the analytical range of 3%–108%
within 50% from the target percentage values. The client also
wanted, in log scale due to proportional error by dilution, the
within-sample total deviation to be not more than 15% of the
total deviation. This translated into a least acceptable CCC of
09775. Table 7 presents the agreement statistics and their 95%
upper or lower conŽ dence limits for the assays started at 1:5
and 1:10.
266
Journal of the American Statistical Association, March 2002
Table 3. Results of the Simulation Study When X is Random for High Precision
Case
Sample
size
15
H0
30
60
15
H1
30
60
NOTE:
Statistics
Theoretical
value
Mean of
estimate
Std of
estimate
Mean of
std
Proportion in
reject region
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
.9900
.9794
.4098
.9696
.7600
.9590
.9900
.9794
.4098
.9696
.7600
.9590
.9900
.9794
.4098
.9696
.7600
.9590
.9933
.9905
.2965
.9839
.9031
.9959
.9933
.9905
.2965
.9839
.9031
.9959
.9933
.9905
.2965
.9839
.9031
.9959
.9907
.9774
.4104
.9671
.7448
.9563
.9903
.9784
.4116
.9683
.7545
.9586
.9902
.9790
.4103
.9691
.7571
.9587
.9938
.9897
.2978
.9826
.8957
.9957
.9936
.9902
.2969
.9834
.8998
.9958
.9934
.9904
.2968
.9837
.9018
.9959
02820
05252
03567
02176
05103
100542
01893
03556
02490
01472
03472
06958
01301
02434
01724
01005
02483
04875
02831
05879
03636
02235
07845
200080
01898
03924
02494
01531
05075
102364
01301
02772
01743
01074
03539
08481
02887
05086
03592
02108
05373
100348
01925
03486
02465
01455
03606
06934
01325
02432
01721
01016
02483
04750
02887
05753
03664
02205
07728
108336
01925
03908
02518
01521
05166
102190
01325
02730
01758
01065
03564
08397
.0618
.0390
.0558
.0320
.0462
.0624
.0568
.0440
.0520
.0350
.0462
.0672
.0554
.0436
.0544
.0394
.0500
.0632
.1986
.3690
.5166
.3704
.4778
.5410
.3096
.6678
.8270
.6696
.8182
.8500
.4782
.9138
.9850
.9124
.9834
.9898
Theoretical
probability
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
01788
04594
05481
04400
05949
06382
02807
07043
08240
06957
08249
08402
04548
09218
09797
09231
09697
09691
“ 0 D 0151 š 0 D 10151 0 D 099; “ 1 D 011 š 1 D 1011 1 D 09933.
For the results started at 1:5, the CCC was estimated to
be 09917, which means that the within-sample total deviation is about 901% of the total deviation. The one-sided lower
conŽ dence limit was 09875, which is greater than 09775. The
precision was estimated to be 09942 with a one-sided lower
conŽ dence limit of 09908, the accuracy was estimated to be
09975 with a one-sided lower conŽ dence limit of 09935, and
the MSD was estimated to be 00356 with a one-sided upper
conŽ dence limit of 00549 (both at log scale). The TD I08 % was
estimated to be 2703%, which means that 80% of observations
were within a 2703% change from the target values (percentage
of percentage values). The one-sided upper conŽ dence limit
was 35%, which is less than 50%. Finally, the CP50% was estimated to be 09653, which means that 9605% of observations
were within a 50% change from target values. The one-sided
lower conŽ dence limit was 08921, which is greater than 08.
The agreement between the FV III assay and the actual concentration was acceptable, with good precision and accuracy.
The relative bias squared was estimated to be 0009, so that the
approximation of TD I should be excellent.
For the results started at 1:10, the CCC was estimated to be
09669, which means that the within-sample total deviation is
about 1802% of the total deviation. The one-sided lower conŽ dence limit was 09584, which is less than 09775. The precision
was estimated to be 09947 with a one-sided lower conŽ dence
limit of 09917, the accuracy was estimated to be 09721 with a
one-sided lower conŽ dence limit of 09638, and the MSD was
estimated to be 01308 with a one-sided upper conŽ dence limit
of 01677 (both at log scale). The TD I08 % was estimated to be
5809%, which means that 80% of observations were within
a 5809% change from the target percentage values. The onesided upper conŽ dence limit was 6900%, which is greater than
50%. Finally, the CP50% was estimated to be 07016, which
means that 7002% of the observations were within a 50%
change from the target values. The one-sided lower conŽ dence
limit was 05898, which is less than 08. The agreement between
the FV III assay and actual concentration was not acceptable
with good precision and mediocre accuracy. The relative bias
squared was estimated to be 30747, so the approximation of
TD I should be acceptable.
7.
7.1
DISCUSSION AND FUTURE STUDY
Agreement Measurements Summary
Under the normal or log-normal distribution, each of the
agreement measurements (MSD, CCC, TD I, and CP) basically
measures the same information but from different perspectives. Note that the asymptotic variances of MSD, CCC, precision, and accuracy were derived from the covariance matrix of
the sample moments, in which the normality assumption was
Lin et al.: Assessing Agreement
267
Table 4. Results of the Simulation Study When X is Fixed for Moderate Precision
Case
Sample
size
15
H0
30
60
15
H1
30
60
NOTE:
Statistics
Theoretical
value
Mean of
estimate
Std of
estimate
Mean of
std
Proportion in
reject region
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
.9500
.9794
.6339
.9304
.8302
.9339
.9500
.9794
.6339
.9304
.8302
.9399
.9500
.9794
.6339
.9304
.8302
.9339
.9662
.9905
.4843
.9571
.9285
.9841
.9662
.9905
.4843
.9571
.9285
.9841
.9662
.9905
.4843
.9571
.9285
.9841
.9597
.9800
.6320
.9351
.8191
.9305
.9543
.9797
.6359
.9322
.8222
.9305
.9525
.9796
.6327
.9317
.8276
.9331
.9730
.9904
.4822
.9602
.9235
.9837
.9692
.9906
.4862
.9581
.9238
.9830
.9678
.9906
.4844
.9577
.9267
.9837
02106
08585
03751
01740
06109
09218
01417
05895
02564
01189
03996
05856
00959
04019
01779
00823
02761
04014
02115
09707
03732
01779
08696
104270
01407
07395
02596
01229
05703
09098
00980
04951
01843
00876
04012
06344
02126
08914
03706
01710
06066
08825
01422
05878
02569
01189
04044
05840
00980
04011
01800
00832
02811
04060
02099
101114
03754
01774
08562
103586
01403
07750
02600
01232
05677
08926
00966
04975
01821
00862
03951
06205
.1282
.0670
.0546
.0760
.0504
.0594
.0992
.0662
.0516
.0670
.0486
.0552
.0792
.0650
.0496
.0612
.0484
.0546
.4034
.2094
.3896
.4654
.3696
.4078
.5204
.3494
.6340
.6828
.6226
.6498
.7380
.5750
.9060
.9154
.9026
.9104
Theoretical
probability
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
02357
02824
04264
04076
04537
05052
04028
04141
06730
06555
06836
07110
06548
06093
09088
08987
09006
09023
“ 0 D 0151 š 0 D 10151 0 D 095; “ 1 D 011 š 1 D 1011 1 D 09662.
used, even though the point estimates do not depend on the
normality assumption. None of the methods proposed in this
article is expected to be robust against outliers and/or large
deviation from normality.
The interpretation of MSD is difŽ cult to understand. The
TD I is desirable as an alternative because of its straightforward
interpretation. The CP is the most intuitively clear approach;
it mirrors the information provided by the TD I. Both TD I and
CP depend on the normality assumption and offer better power
for inference than the CCC. The CP would have difŽ culty discriminating among instruments or assays that have excellent
agreement, all because the CP values would be very close to
1. In this case, the TD I can be used to discriminate among
these.
When a meaningful clinical range is known and the study is
conducted over that range, the CCC offers a meaningful geometric interpretation and is unit free. Furthermore, the accuracy and precision components of the CCC offer more insight.
Therefore, the CCC, accuracy, and precision remain very useful tools. Note that when Y and X are not linearly related, the
CCC will capture the total deviation. However, it will treat the
nonlinear deviation as imprecision rather than inaccuracy.
The CCC, ICC, and Pearson correlation coefŽ cient depend
largely on the analytical range and the intrasample variation.
This property makes sense when we want to assess agreement over the range of all potential outcomes. Good agreement over a small range of measurements (e.g., at low concentration only) can not be extrapolated to infer good agreement
over a larger range of measurements (e.g., at higher concen-
Figure 1. HemoCue (Vertical) and Sigma Readings (Horizontal) on
Measuring DCLHb.
268
Journal of the American Statistical Association, March 2002
Table 5. Results of the Simulation Study When X is Fixed for High Precision
Sample
size
Case
15
30
H0
60
15
30
H1
60
NOTE:
Statistics
Theoretical
value
Mean of
estimate
Std of
estimate
Mean of
std
Proportion in
reject region
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
Precision
Accuracy
TDI
CCC
CPŠ1
CPŠ3
.9900
.9794
.4190
.9696
.7534
.8879
.9900
.9794
.4190
.9696
.7534
.8879
.9900
.9794
.4190
.9696
.7534
.8879
.9933
.9905
.2965
.9839
.9101
.9809
.9933
.9905
.2965
.9839
.9101
.9809
.9933
.9905
.2965
.9839
.9101
.9809
.9919
.9795
.4256
.9707
.7473
.8880
.9910
.9795
.4214
.9703
.7514
.8888
.9905
.9795
.4199
.9700
.7527
.8885
.9947
.9906
.2993
.9847
.9109
.9832
.9940
.9906
.2984
.9842
.9096
.9817
.9937
.9906
.2968
.9841
.9106
.9816
02010
03690
02923
01342
04232
06513
01364
02556
02030
00932
02899
04337
00943
01773
01410
00648
02010
02963
02021
04564
03165
01494
07608
103544
01381
03104
02172
01020
04879
08371
00945
02169
01554
00732
03422
05759
02059
03599
03013
01311
04527
06695
01373
02529
02059
00922
03011
04421
00946
01784
01434
00651
02065
03020
02053
04379
03198
01450
07546
102804
01369
03079
02205
01022
04944
08259
00943
02166
01542
00721
03397
05643
.1200
.0586
.0456
.0702
.0520
.0632
.1048
.0566
.0492
.0660
.0538
.0634
.0872
.0558
.0500
.0630
.0536
.0610
.4276
.6090
.6960
.7802
.7250
.7614
.5596
.8886
.9482
.9664
.9556
.9644
.7766
.9942
.9992
.9994
.9992
.9992
Theoretical
probability
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
005
02509
06351
07452
07236
07165
07437
04276
08730
09473
09413
09243
09230
06853
09874
09985
09982
09954
09933
“ 0 D 0151 š 0 D 10151 0 D 099; “ 1 D 01š 1 D 1011 1 D 09933.
tration). However, caution must be taken when using these
correlation coefŽ cients. Comparisons among these coefŽ cients
are meaningful only when the clinical study ranges are similar. Ranges by different units can be compared as long as they
have similar clinical interpretations.
7.2
Categorical Data
The CP has long been used in categorical data by summing
the diagonal elements of the joint probability matrix based on
assigning subjects’ scores to two raters. Cohen (1960) proposed using the kappa coefŽ cient to correct for the probability of agreement by chance. Cohen (1968) later improved the
Table 6. Agreement Statistics and Their Con’ dence Limits
for Example 1
Statistics
Estimate
95% Con’ dence limit
CCC
Accuracy
Precision
MSD
TDI09
CP150
09866
09999
09867
61007
12705
09463
09838
09989
09839
61875
13604
09276
NOTE:
Relative bias squared was estimated to be .003 (µ1; see sec. 2.4).
Allowance
Figure 2. Observed FVIII Assay Results (Vertical, %) Versus Targeted
Values (Horizontal, %) Started at 1:5.
09775
150
09
kappa coefŽ cient by assigning different weights according to
the degree of disagreements. Interestingly, CCC becomes the
weighted kappa proposed by Cohen (1968). Therefore, we can
use the CCC for categorical data in most situations (Robieson
1999; King and Chinchilli 2001).
Lin et al.: Assessing Agreement
269
Figure 3. Observed FVIII Assay Results (Vertical, %) Versus Targeted
Values (Horizontal, %) Started at 1:10.
7.3
More General Cases
This article provides a systematic treatment of modeling
agreement measurement methodology based on the most basic
bivariate model when X is random and on the regression
model when X is Ž xed. When X is random, the model does
not allow us to assess agreement among repeated estimates
of the target method. Therefore, it is not known whether any
mediocre agreement between two methods may have resulted
from imprecision in the target method. In an individual bioequivalence ( IBE) study, measures of the bioavailability are
recorded when a patient is given a test formulation twice (T1
and T2 ) and a reference formulation twice (R1 and R2 ) by a
four-period crossover design (Schall and Williams 1996). The
current FDA guideline is based on the absolute and relative
measures of E4T ƒ R5 2 ƒ E4R1 ƒ R2 52 . The denominator of
the relative measure is E4R1 ƒ R2 52 .
The many gold standard assays include immunoassays or
enzyme assays, which are imprecise. The methodologies and
designs used in an IBE study should be adopted for imprecise
Table 7. Agreement Statistics and Their Con’ dence Limits
for Example 2
Statistics
Estimate
95% Con’ dence limits
CCC
Accuracy
Precision
MSD
TDI08 %
CP50%
09917
09975
09942
00356
270347
09653
Started at 1:5
09875
09935
09908
00549
350010
08921
CCC
Accuracy
Precision
MSD
TDI08 %
CP50%
09669
09721
09947
01308
580949
07016
Started at 1:10
09584
09638
09917
01677
690007
05898
Allowance
09775
50
08
09775
50
08
NOTE: Relative bias squared was estimated to be 0009 for the assay started at 1:5 and 30747
for the assay started at 1:10 (µ8; see sec. 2.4).
assays. More importantly, to prove that a new method is better
than a gold standard method, one should adopt the four-period
design and principles used in the IBE study. Here we would
be in a position to demonstrate an acceptable E4T ƒ R52 and
further that E4T1 ƒ T2 52 < E4R1 ƒ R2 52 . We would then translate the measurement into CCC, precision, accuracy, TD I, and
CP for better interpretation. Our future articles will address
those more general cases.
When assessing the agreement coefŽ cients, there might be
situations where some other controllable factors could be
present. For example, in a four period crossover bioequivalence study, treatment-order effects might be present. In these
situations, we can eliminate these effects by Ž tting a mixedeffects model with these Ž xed effects and a random subject
effect in the model. Then the methods presented here can
be applied asymptotically by letting Y and X represent the
respective residuals computed from the model.
Chinchilli Martel, Kumanyka, and Lloyd (1996) extended
Lin’s CCC to repeated-measures designs by using a weighted
CCC. The CCC on multiple raters with robust estimates has
been studied by King and Chinchilli (2001). For more general approaches, the use of generalized estimating equations
(GEEs) possibly could be used to model the functions of the
agreement coefŽ cients. Barnhart and Williamson (2001) used
GEEs to model CCC with good results. They used three estimating equations, one to model location sums, one to model
the sums of squares, and one to model the cross-products with
Z transformation of CCC values computed from functions of
the foregoing. Potentially, GEEs also can be used to model
MSD, TD I, and CP. This approach is  exible enough to allow
comparisons of multiple agreement coefŽ cients in the presence of some explanatory covariates. Such an approach would
allow us to compare, for example, the agreements of two competing methods (A and B) with a gold standard method (C)
or, in other words, to compare the agreement of A and C to
the agreement of B and C. Thus GEE will Ž t nicely into the
framework of an IBE study for comparing the agreement of T
and R relative to the agreement of R1 and R2 , while controlling for period and order effects.
8.
CONCLUSION
We have summarized various methods for assessing the
agreement among individual paired samples when X is random or Ž xed and when error is constant or proportional.
We suggest using CCC, TD I, and CP to summarize the agreement results. These offer the same information from different perspectives. In addition, the coefŽ cients of accuracy and
precision should also accompany the results to identify the
sources of any disagreement. When we are conŽ dent that the
data have normal or lognormal distribution, inference should
be based on TD I and CP for better power of accepting
the agreement. We plan to provide more general approaches
regarding all of the above in the future.
For convenience, a validated SAS macro is provided at
http://www.uic.edu/ hedayat/ that computes the estimates
and conŽ dence limits for CCC, precision, accuracy, TD I, and
CP. We can specify when X is random or Ž xed and the error
is constant or proportional, along with the conŽ dence level,
CCC, CP, and TD I allowances. The program also generates
270
Journal of the American Statistical Association, March 2002
the agreement plot of Y versus X with the identity line under
a customized scale.
[Received December 2000. Revised July 2001.]
REFERENCES
Anderson, S., and Hauck, W. W. (1990), “Considersation of Individual
Bioequivalence,” Journal of Pharmacokinetics and Biopharmaceutics, 18,
259–273.
Barnhart, H. X., and Williamson, J. M. (2001), “Modeling Concordance Correlation via GEE to Evaluate Reproducibility,” Biometrics, 57, 931–940.
Bland. J. M., and Altman, D. (1986), “Statistical Methods for Assessing
Agreement Between Two Methods of Clinical Measurement,” Lancet, 8,
307–310.
Chinchilli, V. M., Martel, J. K., Kumanyka, S., and Lloyd, T. (1996), “A
Weighted Concordance Correlation CoefŽ cient for Repeated Measurement
Designs,” Biometrics, 52, 341–353.
Cohen, J. (1960), “A CoefŽ cient of Agreement for Nominal Scales,” Educational and Psychological Measurement, 20, 37–46.
(1968), “Weighted Kappa: Nominal Scale Agreement With Provision for Scaled Disagreement or Partial Credit,” Psychological Bulletin, 70,
213–220.
Fisher, R. A. (1925), Statistical Methods for Research Workers, Edinburgh:
Oliver & Boyd.
Holder, D. J., and Hsuan, F. (1993), “Moment-Based Criteria for Determining
Bioequivalence,” Biometrika, 80, 835–846.
King, T. S., and Chinchilli, V. M. (2001), “A Generalized Concordance Correlation CoefŽ cient for Continuous and Categorical Data,” Statistics in
Medicine, 20, 2131–2147.
Lin. L. I. (1989), “A Concordance Correlation CoefŽ cient to Evaluate Reproducibility,” Biometrics, 45, 255–268.
(1992), “Assay Validation Using the Concordance Correlation CoefŽ cient,” Biometrics, 48, 599–604.
(1997), Rejoinder to the Letter to the Editor by Atkinson and Nevill,
Biometrics, 53, 777–778.
(2000), “Total Deviation Index for Measuring Individual Agreement:
With Application in Lab Performance and Bioequivalence,” Statistics in
Medicine, 19, 255–270.
Lin. L. I., and Torbeck, L. D. (1998), “CoefŽ cient of Accuracy and Concordance Correlation CoefŽ cient: New Satistics for Method Comparison,”
PDA Journal of Pharmaceutical Science and Technology, 52, 55–59.
Robieson, W. Z. (1999), “On the Weighted Kappa and Concordance Correlation CoefŽ cient,” Ph.D. dissertation, University of Illinois at Chicago.
Schall. R. (1995), “Assessment of Individual and Population Bioequivalence Using the Probability That Bioavailities are Similar,” Biometrics, 51,
615–626.
Schall, R., and Luus, H. G. (1993), “On Population and Individual Bioequivalence,” Statistics in Medicine, 12, 1109–1124.
Schall, R., and Williams, R. L. (1996), “Towards a Practical Strategy for
Assessing Individual Bioequivalence,” Journal of Pharmacokinetics and
Biopharmaceutics, 24, 133–149.
Sheiner, L. B. (1992), “Bioequivalence Revisted,” Statistics in Medicine, 11,
1777–1788.
Vonesh. E. F., and Chinchilli, V. M. (1997), Linear and Nonlinear Models for
the Analysis of Repeated Measurement, New York: Marcel Dekker.
Vonesh, E. F., Chinchilli, V. M., and Pu, K. (1996), “Goodness of
Fit in Generalized Nonlinear Mixed-Effect Models,” Biometrics, 52,
572–587.