Download Bland-Altman Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Mean field particle methods wikipedia , lookup

Psychometrics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Chapter 2-17. Bland-Altman Analysis
<< The section for the clustered data case is still under construction >>
In this chapter we see how to assess the agreement between two methods of clinical
measurement. Statisticians have given labeled this type of analysis a methods comparison study.
The most popular methods comparison approach is called a Bland-Altman analysis. D.G.
Altman and J.M. Bland first published this approach in 1983 in a statistical journal (Altman and
Bland, 1983) and later in Lancet (Bland and Altman, 1986) to appeal to medical investigators.
Even though the approach is simple, some investigators make errors in applying the method.
Mantha et al (2000) reviewed how the method of applied in seven anesthesis journals, reporting
that the quality of Bland-Altman analysis frequently varied. They proposed a reporting standard
for a Bland-Altman analysis.
We will practice with a dataset provided in the Bland and Altman (1983) paper.
Bringing this dataset into Stata,
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on blandaltmanlancet1986.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\
blandaltmanlancet1986", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use blandaltmanlancet1986, clear
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385
Chapter 2-17 (revision 9 Jan 2011)
p. 1
Listing the data,
Data
Describe data
List data
Main tab: Override minimum abbreviation of variable names: Characters: 15
OK
list , abbrev(15)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
+---------------------------------------------------------+
| subject
wright1
wright2
miniwright1
miniwright2 |
|---------------------------------------------------------|
|
1
494
490
512
525 |
|
2
395
397
430
415 |
|
3
516
512
520
508 |
|
4
434
401
428
444 |
|
5
476
470
500
500 |
|---------------------------------------------------------|
|
6
557
611
600
625 |
|
7
413
415
364
460 |
|
8
442
431
380
390 |
|
9
650
638
658
642 |
|
10
433
429
445
432 |
|---------------------------------------------------------|
|
11
417
420
432
420 |
|
12
656
633
626
605 |
|
13
267
275
260
227 |
|
14
478
492
477
467 |
|
15
178
165
259
268 |
|---------------------------------------------------------|
|
16
423
372
350
370 |
|
17
427
421
451
443 |
+---------------------------------------------------------+
The study aim is to compare two methods of measuring peak expiratory flow rate (PEFR). For
each subject, two measurements where taken with a Wright peak flow meter and two with a mini
Wright meter, done in a random order.
The first measurement by each method will be used to illustrate the comparison of methods. The
second measurement will be used to assess repeatibility.
An initial visual assessment of agreement is made using a scatterplot of the two methods,
overlaying a line of equality. If the two methods provide identical measurements, the pairs of
measurements will lie on this line.
Finding the minimum and maximum to use for graphing the line of equality
sum wright1 miniwright1
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------wright1 |
17
450.3529
116.3126
178
656
miniwright1 |
17
452.4706
113.1151
259
658
Chapter 2-17 (revision 9 Jan 2011)
p. 2
A line of equality that connects the ordered pairs (178 , 178) and (658 , 658) will pass through
the entire range of values.
Overlying a scatterplot of the two methods with the line of equality,
500
400
200
300
Mini Wright
600
700
twoway (scatter miniwright1 wright1 )(pci 178 178 658 658) , ///
xtitle(Wright) ytitle("Mini Wright") legend(off)
200
300
400
500
600
700
Wright
Here we used the “pci” command to get a “paired coordinates” graph, with the “i” for immediate,
telling the command the data, being the two x-y coordinates, followed the command name, rather
than being contained in two variables.
The syntax for such a graph is:
twoway pci #_y1 #_x1 #_y2 #_x2
Chapter 2-17 (revision 9 Jan 2011)
p. 3
Some white space at the low and high ends will make it easier to visualize,
400
0
200
Mini Wright
600
800
twoway (scatter miniwright1 wright1 )(pci 0 0 800 800) , ///
xtitle(Wright) ytitle("Mini Wright") legend(off)
0
200
400
Wright
600
800
Although interesting to look at, with this graph it is difficult to tell just how close the agreement
is between the two methods.
A more informative graph is the Bland-Altman graph.
We do not know the true value of PEFR, since both meters are subject to error, so the best
estimate we have is the mean of the two measurements.
In a Bland-Altman graph, we form a scatterplot using the difference between the two
measurements, which is amount of disagreement, on the y-axis, and the mean of the two
measurements on the x-axis.
NOTE: Bland and Altman (1986, p. 308, last sentence of first column) point out it is erroneous
to plot the difference between either of the measurements, because the difference will be
related to whichever value we select. This is a well-known statistical artifact, called
mathematical coupling.
Chapter 2-17 (revision 9 Jan 2011)
p. 4
Computing the difference between the two methods and requesting descriptive statistics,
gen diff = wright1 - miniwright1
sum diff
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------diff |
17
-2.117647
38.76513
-81
73
The “limits of agreement” are the mean difference ± 2 × standard deviation of the differences,
which are
display "lower limit = " -2.117647-2*38.76513
display "upper limit = " -2.117647+2*38.76513
lower limit = -79.647907
upper limit = 75.412613
Assuming the differences are normally distributed, these limits bound the middle 95% of the
differences in the sample. [Note: Using 1.96 in place of 2 would more precisely bound the
middle 95% of the differences if they are truly normally distributed, but using 2 provides
adequate precision and 2 is what is advocated by Bland and Altman (1986).]
There will be inaccuracy in these limits bounding the middle 95% of differences in future
samples, however, since every sample will produce a different mean difference and standard
deviation of the differences, due simply to sampling variation. Therefore, analogous to reporting
a 95% confidence interval (CI) for a mean, or any effect estimate, a 95% CI should always be
reported for the limits of agreement (Mantha et al., 2000).
The formula for the confidence interval for the limits of agreement is given by Bland and Altman
(1986). Mantha et al. (2000) present the same CI formula more explicitly as,
CI for lower limit of agreement (mean-2SD): (d  2SD)  t 
3SD 2
n
3SD 2
CI for upper limit of agreement (mean+2SD): (d  2SD)  t 
n
Chapter 2-17 (revision 9 Jan 2011)
p. 5
The mean, standard deviation, and sample size for the difference are stored in the scalar names
r(mean), r(sd), and r(n) following the summarize, or sum, command. To see this,
capture drop diff
gen diff = wright1 - miniwright1
sum diff
return list // see macro names for results from previous command
scalars:
r(N)
r(sum_w)
r(mean)
r(Var)
r(sd)
r(min)
r(max)
r(sum)
=
=
=
=
=
=
=
=
17
17
-2.117647058823529
1502.735294117647
38.76512987360738
-81
73
-36
In this output, Stata calls them scalars, consistent with matrix algebra terminology. A scalar is a
single number, rather than a variable with many observations, which is a vector. We can display
these in Stata using,
display r(mean)
display r(sd)
display r(N)
-2.1176471
38.76513
17
We can use now write some Stata code with the CI formula, using these scalar names, which will
work for any dataset, rather than having to be modified by typing in the numbers themselves.
The two-tailed alpha 0.05, or two-sided 95% confidence level, critical value of the t distribution
which we need in the CI formula is given in Stata by
display invttail(r(N)-1,0.025)
2.1199053
Putting this all together,
capture drop diff
gen diff = wright1 - miniwright1
sum diff
display "lower limit of agreement: " r(mean)-2*r(sd)
display "95% CI for lower limit: (" ///
r(mean)-2*r(sd)-invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N))
" , " ///
r(mean)-2*r(sd)+invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N))
display "upper limit of agreement: " r(mean)+2*r(sd)
display "95% CI for upper limit: (" ///
r(mean)+2*r(sd)-invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N))
" , " ///
r(mean)+2*r(sd)+invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N))
Chapter 2-17 (revision 9 Jan 2011)
///
")"
///
")"
p. 6
. display "lower limit of agreement: " r(mean)-2*r(sd)
lower limit of agreement: -79.647907
. display "95% CI for lower limit: (" ///
>
r(mean)-2*r(sd)-invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) ///
>
" , " ///
>
r(mean)-2*r(sd)+invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) ")"
95% CI for lower limit: (-114.16974 , -45.126072)
. display "upper limit of agreement: " r(mean)+2*r(sd)
upper limit of agreement: 75.412613
. display "95% CI for upper limit: (" ///
>
r(mean)+2*r(sd)-invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) ///
>
" , " ///
>
r(mean)+2*r(sd)+invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) ")"
95% CI for upper limit: (40.890778 , 109.93445)
This is kind of messy output, since Stata displays the commands along with the output. If we
want to see the output by itself, we can set it up as a program.
First run the following block of Stata commands inside the do-file editor,
capture program drop blandstats
program define blandstats
args var1 var2
capture drop diff
gen diff = `var1' - `var2'
sum diff
display _newline "lower limit of agreement: " r(mean)-2*r(sd) ///
"95% CI(" ///
r(mean)-2*r(sd)-invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) ///
" , " ///
r(mean)-2*r(sd)+invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) ")"
display "upper limit of agreement: " r(mean)+2*r(sd)
display "95% CI for upper limit: (" ///
r(mean)+2*r(sd)-invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) ///
" , " ///
r(mean)+2*r(sd)+invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) ")"
end
That sets up the program that defines the command “blandstats”, which requires passing it two
variable names as arguments. It will work for any dataset, without modifying it.
Next run the command “blandstats”, with the variables of the two methods being compared,
blandstats wright1 miniwright1
lower limit of agreement: -79.647907 , 95% CI(-114.16974 , -45.126072)
upper limit of agreement: 75.412613 , 95% CI(40.890778 , 109.93445)
These are the values given in Bland and Altman (1986), except for rounding in the Bland and
Altman paper. In their paper, they used 2.12, instead of 2.1199053, for the t critical value, and
only one decimal place for the mean and standard deviations, resulting in CIs of -114.3 to -45.1
and 40.9 to 110.1. Thus, we have verified we programmed it correctly.
Chapter 2-17 (revision 9 Jan 2011)
p. 7
Graphing a Bland-Altman plot, which is a scatterplot of the differences, with reference lines at
the mean difference, and mean difference ± 2 × standard deviation of the differences (limits of
agreement),
* --- Bland-Altman plot --capture drop diff
capture drop meanval
gen diff = wright1 - miniwright1
sum diff
local sddiff = r(sd)
local meandiff = r(mean)
gen meanval = (wright1+miniwright1)/2
local lowerlimit = meandiff - 2*sddiff
local upperlimit = meandiff + 2*sddiff
#delimit ;
twoway (scatter diff meanval , color(black) symbol(square))
(pci `upperlimit' 0 `upperlimit' 780 , lcolor(black))
(pci `lowerlimit' 0 `lowerlimit' 780 , lcolor(black))
(pci `meandiff' 0 `meandiff' 780 , lcolor(black))
, text(`upperlimit' 790 "Mean + 2SD",placement(e))
text(`lowerlimit' 790 "Mean - 2SD",placement(e))
text(`meandiff' 790 "Mean",placement(e))
xlabel(0(100)810)
ylabel(-100(20)100, angle(horizontal))
ytitle("Difference in PEFR (Wright - Mini Wright) (l/min)")
xtitle("Average PEFR by two meters (l/min)", height(5))
r1title(" ") r2title(" ")
legend(off)
scheme(s1mono) plotregion(style(none))
;
#delimit cr
Note: In this graph, the commands must all be run as a block of commands, by highlighting them
in the do-file editor and hitting the last icon on the right (the run button). Otherwise, the “local”
values do not pass correctly into the other Stata commands in this block of Stata code.
Chapter 2-17 (revision 9 Jan 2011)
p. 8
100
80
Mean + 2SD
60
40
20
0
Mean
-20
-40
-60
Mean - 2SD
-80
-100
0
100
200
300
400
500
600
Average PEFR by two meters (l/min)
Chapter 2-17 (revision 9 Jan 2011)
700
800
p. 9
Protocol Suggestion
Continuing with Bland and Altman’s example, where the Wright and mini Wright meters are
compared, you might say something like the following in your protocol to describe the BlandAltman analysis (For sake of illustration, I am assuming the mini Wright is more rapid and less
expensive).
Aim 1
We will compare two methods of measuring peak expiratory flow rate (PEFR), the Wright peak
flow meter and the mini-Wright meter. If we can demonstrate that mini-Wright meter
measurement is within clinically acceptable agreement to the Wright meter measurement, this
would promote widely accepted use of the mini-Wright meter, thus providing a more rapid and
less expensive clinical assessment of PEFR.
Hypothesis 1
PEFR measured with mini-Wright meter will have clinically acceptable agreement with the
Wright meter.
Statistical Methods
To test the Aim 1 hypothesis, a Bland-Altman analysis will be used. A Bland-Altman analysis
has been accepted as the standard statistical approach to assess the agreement between two
methods of clinical measurement (Altman and Bland, 1983; Bland and Altman, 1986; Mantha et
al, 2000). In this approach, for each patient, the new method (mini-Wright) measurement is
subtracted from the standard method (Wright) , representing the “measurement error” observed
with that patient. The mean of these differences is computed, along with a standard deviation. A
95% tolerance bound, mean±2SD, is then computed, which is called the “limits of agreement.”
This represents the limits in which we can be 95% confident that the measurement error will be
within. If the limits of agreement are contained in, or more narrower than, what would be
clinically acceptable measurement error, it can be concluded that the new measurement method
can be used interchangeable with the standard measurement method. A 95% confidence interval
will be computed for the lower limit of agreement and for the upper limit of agreement. Since
the amount of error that represents “clinically accepted measurement error” has not be
established, the limits of agreement will be reported descriptively, with 95% confidence intervals
around the limits. This will permit the reader to assess the results FINISH
Chapter 2-17 (revision 9 Jan 2011)
p. 10
Clustered Data
When clustered data are used, such as multiple observations taken on the same person, the SD
needs to be corrected using the design effect (McCarthy and Thompson, 2007).
We will practice with a dataset provided in the Bland and Altman (1986) paper.
Bringing this dataset into Stata,
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on blandaltmanlancet1986.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\
blandaltmanlancet1986", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use blandaltmanlancet1986, clear
This is not a clustered dataset. To artificially make it a clusted dataset, for purposes of
illustrating the clustered analysis approach, we will replace the subject ID with one that identifies
five subjects, making the subjects systematically differ by sorting on the outcome variable before
assigning the subject ID.
sort miniwright1
replace subject=1
replace subject=2
replace subject=3
replace subject=4
replace subject=5
in
in
in
in
in
Chapter 2-17 (revision 9 Jan 2011)
1/2
3/5
6/11
12/14
15/17
p. 11
Listing the data, with a line separator between subject ID,
list , abbrev(15) sepby(subject)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
+---------------------------------------------------------+
| subject
wright1
wright2
miniwright1
miniwright2 |
|---------------------------------------------------------|
|
1
178
165
259
268 |
|
1
267
275
260
227 |
|---------------------------------------------------------|
|
2
423
372
350
370 |
|
2
413
415
364
460 |
|
2
442
431
380
390 |
|---------------------------------------------------------|
|
3
434
401
428
444 |
|
3
395
397
430
415 |
|
3
417
420
432
420 |
|
3
433
429
445
432 |
|
3
427
421
451
443 |
|
3
478
492
477
467 |
|---------------------------------------------------------|
|
4
476
470
500
500 |
|
4
494
490
512
525 |
|
4
516
512
520
508 |
|---------------------------------------------------------|
|
5
557
611
600
625 |
|
5
656
633
626
605 |
|
5
650
638
658
642 |
+---------------------------------------------------------+
We see that subject 1 has miniwright scores in the 200’s, subject 2 has scores in the 300’s, and so
on. This is a common feature of a clustered dataset, in that scores within the same subject are
more alike than the score are alike between subjects.
Computing the difference between the two methods and requesting descriptive statistics,
gen diff = wright1 - miniwright1
sum diff
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------diff |
17
-2.117647
38.76513
-81
73
<< finish this section >>
Chapter 2-17 (revision 9 Jan 2011)
p. 12
References
Altman DG, Bland JM. (1983). Measurement in medicine: the analysis of method comparison
studies. Statistician 32(3):307-317.
Bland JM, Altman DG. (1986). Statistical methods for assessing agreement between two
methods of clinical measurement. Lancet Feb 8:307-310.
Hamilton C, Stamey J. (2007). Using Bland-Altman to assess agreement between two medical
devices—don’t forget the confidence intervals! J Clin Monit Comput 21:331-33.
Mantha S, Roizen MF, Fleischer LA, et al. (2000). Comparing methods of clinical measurement:
reporting standards for Bland and Altman Analysis Anesth Analg 90:593-602.
McCarthy WF, Thompson DR. (2007). The analysis of pixel intensity (myocardial signal density)
data: the quantification of myocardial perfusion by imaging methods. (May 2007).
COBRA Preprint Series. Article 23. http://biostats.bepress.com/cobra/ps/art23
Chapter 2-17 (revision 9 Jan 2011)
p. 13