Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2-17. Bland-Altman Analysis << The section for the clustered data case is still under construction >> In this chapter we see how to assess the agreement between two methods of clinical measurement. Statisticians have given labeled this type of analysis a methods comparison study. The most popular methods comparison approach is called a Bland-Altman analysis. D.G. Altman and J.M. Bland first published this approach in 1983 in a statistical journal (Altman and Bland, 1983) and later in Lancet (Bland and Altman, 1986) to appeal to medical investigators. Even though the approach is simple, some investigators make errors in applying the method. Mantha et al (2000) reviewed how the method of applied in seven anesthesis journals, reporting that the quality of Bland-Altman analysis frequently varied. They proposed a reporting standard for a Bland-Altman analysis. We will practice with a dataset provided in the Bland and Altman (1983) paper. Bringing this dataset into Stata, File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on blandaltmanlancet1986.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ blandaltmanlancet1986", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use blandaltmanlancet1986, clear _____________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385 Chapter 2-17 (revision 9 Jan 2011) p. 1 Listing the data, Data Describe data List data Main tab: Override minimum abbreviation of variable names: Characters: 15 OK list , abbrev(15) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. +---------------------------------------------------------+ | subject wright1 wright2 miniwright1 miniwright2 | |---------------------------------------------------------| | 1 494 490 512 525 | | 2 395 397 430 415 | | 3 516 512 520 508 | | 4 434 401 428 444 | | 5 476 470 500 500 | |---------------------------------------------------------| | 6 557 611 600 625 | | 7 413 415 364 460 | | 8 442 431 380 390 | | 9 650 638 658 642 | | 10 433 429 445 432 | |---------------------------------------------------------| | 11 417 420 432 420 | | 12 656 633 626 605 | | 13 267 275 260 227 | | 14 478 492 477 467 | | 15 178 165 259 268 | |---------------------------------------------------------| | 16 423 372 350 370 | | 17 427 421 451 443 | +---------------------------------------------------------+ The study aim is to compare two methods of measuring peak expiratory flow rate (PEFR). For each subject, two measurements where taken with a Wright peak flow meter and two with a mini Wright meter, done in a random order. The first measurement by each method will be used to illustrate the comparison of methods. The second measurement will be used to assess repeatibility. An initial visual assessment of agreement is made using a scatterplot of the two methods, overlaying a line of equality. If the two methods provide identical measurements, the pairs of measurements will lie on this line. Finding the minimum and maximum to use for graphing the line of equality sum wright1 miniwright1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------wright1 | 17 450.3529 116.3126 178 656 miniwright1 | 17 452.4706 113.1151 259 658 Chapter 2-17 (revision 9 Jan 2011) p. 2 A line of equality that connects the ordered pairs (178 , 178) and (658 , 658) will pass through the entire range of values. Overlying a scatterplot of the two methods with the line of equality, 500 400 200 300 Mini Wright 600 700 twoway (scatter miniwright1 wright1 )(pci 178 178 658 658) , /// xtitle(Wright) ytitle("Mini Wright") legend(off) 200 300 400 500 600 700 Wright Here we used the “pci” command to get a “paired coordinates” graph, with the “i” for immediate, telling the command the data, being the two x-y coordinates, followed the command name, rather than being contained in two variables. The syntax for such a graph is: twoway pci #_y1 #_x1 #_y2 #_x2 Chapter 2-17 (revision 9 Jan 2011) p. 3 Some white space at the low and high ends will make it easier to visualize, 400 0 200 Mini Wright 600 800 twoway (scatter miniwright1 wright1 )(pci 0 0 800 800) , /// xtitle(Wright) ytitle("Mini Wright") legend(off) 0 200 400 Wright 600 800 Although interesting to look at, with this graph it is difficult to tell just how close the agreement is between the two methods. A more informative graph is the Bland-Altman graph. We do not know the true value of PEFR, since both meters are subject to error, so the best estimate we have is the mean of the two measurements. In a Bland-Altman graph, we form a scatterplot using the difference between the two measurements, which is amount of disagreement, on the y-axis, and the mean of the two measurements on the x-axis. NOTE: Bland and Altman (1986, p. 308, last sentence of first column) point out it is erroneous to plot the difference between either of the measurements, because the difference will be related to whichever value we select. This is a well-known statistical artifact, called mathematical coupling. Chapter 2-17 (revision 9 Jan 2011) p. 4 Computing the difference between the two methods and requesting descriptive statistics, gen diff = wright1 - miniwright1 sum diff Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------diff | 17 -2.117647 38.76513 -81 73 The “limits of agreement” are the mean difference ± 2 × standard deviation of the differences, which are display "lower limit = " -2.117647-2*38.76513 display "upper limit = " -2.117647+2*38.76513 lower limit = -79.647907 upper limit = 75.412613 Assuming the differences are normally distributed, these limits bound the middle 95% of the differences in the sample. [Note: Using 1.96 in place of 2 would more precisely bound the middle 95% of the differences if they are truly normally distributed, but using 2 provides adequate precision and 2 is what is advocated by Bland and Altman (1986).] There will be inaccuracy in these limits bounding the middle 95% of differences in future samples, however, since every sample will produce a different mean difference and standard deviation of the differences, due simply to sampling variation. Therefore, analogous to reporting a 95% confidence interval (CI) for a mean, or any effect estimate, a 95% CI should always be reported for the limits of agreement (Mantha et al., 2000). The formula for the confidence interval for the limits of agreement is given by Bland and Altman (1986). Mantha et al. (2000) present the same CI formula more explicitly as, CI for lower limit of agreement (mean-2SD): (d 2SD) t 3SD 2 n 3SD 2 CI for upper limit of agreement (mean+2SD): (d 2SD) t n Chapter 2-17 (revision 9 Jan 2011) p. 5 The mean, standard deviation, and sample size for the difference are stored in the scalar names r(mean), r(sd), and r(n) following the summarize, or sum, command. To see this, capture drop diff gen diff = wright1 - miniwright1 sum diff return list // see macro names for results from previous command scalars: r(N) r(sum_w) r(mean) r(Var) r(sd) r(min) r(max) r(sum) = = = = = = = = 17 17 -2.117647058823529 1502.735294117647 38.76512987360738 -81 73 -36 In this output, Stata calls them scalars, consistent with matrix algebra terminology. A scalar is a single number, rather than a variable with many observations, which is a vector. We can display these in Stata using, display r(mean) display r(sd) display r(N) -2.1176471 38.76513 17 We can use now write some Stata code with the CI formula, using these scalar names, which will work for any dataset, rather than having to be modified by typing in the numbers themselves. The two-tailed alpha 0.05, or two-sided 95% confidence level, critical value of the t distribution which we need in the CI formula is given in Stata by display invttail(r(N)-1,0.025) 2.1199053 Putting this all together, capture drop diff gen diff = wright1 - miniwright1 sum diff display "lower limit of agreement: " r(mean)-2*r(sd) display "95% CI for lower limit: (" /// r(mean)-2*r(sd)-invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) " , " /// r(mean)-2*r(sd)+invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) display "upper limit of agreement: " r(mean)+2*r(sd) display "95% CI for upper limit: (" /// r(mean)+2*r(sd)-invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) " , " /// r(mean)+2*r(sd)+invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) Chapter 2-17 (revision 9 Jan 2011) /// ")" /// ")" p. 6 . display "lower limit of agreement: " r(mean)-2*r(sd) lower limit of agreement: -79.647907 . display "95% CI for lower limit: (" /// > r(mean)-2*r(sd)-invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) /// > " , " /// > r(mean)-2*r(sd)+invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) ")" 95% CI for lower limit: (-114.16974 , -45.126072) . display "upper limit of agreement: " r(mean)+2*r(sd) upper limit of agreement: 75.412613 . display "95% CI for upper limit: (" /// > r(mean)+2*r(sd)-invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) /// > " , " /// > r(mean)+2*r(sd)+invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) ")" 95% CI for upper limit: (40.890778 , 109.93445) This is kind of messy output, since Stata displays the commands along with the output. If we want to see the output by itself, we can set it up as a program. First run the following block of Stata commands inside the do-file editor, capture program drop blandstats program define blandstats args var1 var2 capture drop diff gen diff = `var1' - `var2' sum diff display _newline "lower limit of agreement: " r(mean)-2*r(sd) /// "95% CI(" /// r(mean)-2*r(sd)-invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) /// " , " /// r(mean)-2*r(sd)+invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) ")" display "upper limit of agreement: " r(mean)+2*r(sd) display "95% CI for upper limit: (" /// r(mean)+2*r(sd)-invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) /// " , " /// r(mean)+2*r(sd)+invttail(r(N)-1,0.025)*sqrt(3*(r(sd)^2)/r(N)) ")" end That sets up the program that defines the command “blandstats”, which requires passing it two variable names as arguments. It will work for any dataset, without modifying it. Next run the command “blandstats”, with the variables of the two methods being compared, blandstats wright1 miniwright1 lower limit of agreement: -79.647907 , 95% CI(-114.16974 , -45.126072) upper limit of agreement: 75.412613 , 95% CI(40.890778 , 109.93445) These are the values given in Bland and Altman (1986), except for rounding in the Bland and Altman paper. In their paper, they used 2.12, instead of 2.1199053, for the t critical value, and only one decimal place for the mean and standard deviations, resulting in CIs of -114.3 to -45.1 and 40.9 to 110.1. Thus, we have verified we programmed it correctly. Chapter 2-17 (revision 9 Jan 2011) p. 7 Graphing a Bland-Altman plot, which is a scatterplot of the differences, with reference lines at the mean difference, and mean difference ± 2 × standard deviation of the differences (limits of agreement), * --- Bland-Altman plot --capture drop diff capture drop meanval gen diff = wright1 - miniwright1 sum diff local sddiff = r(sd) local meandiff = r(mean) gen meanval = (wright1+miniwright1)/2 local lowerlimit = meandiff - 2*sddiff local upperlimit = meandiff + 2*sddiff #delimit ; twoway (scatter diff meanval , color(black) symbol(square)) (pci `upperlimit' 0 `upperlimit' 780 , lcolor(black)) (pci `lowerlimit' 0 `lowerlimit' 780 , lcolor(black)) (pci `meandiff' 0 `meandiff' 780 , lcolor(black)) , text(`upperlimit' 790 "Mean + 2SD",placement(e)) text(`lowerlimit' 790 "Mean - 2SD",placement(e)) text(`meandiff' 790 "Mean",placement(e)) xlabel(0(100)810) ylabel(-100(20)100, angle(horizontal)) ytitle("Difference in PEFR (Wright - Mini Wright) (l/min)") xtitle("Average PEFR by two meters (l/min)", height(5)) r1title(" ") r2title(" ") legend(off) scheme(s1mono) plotregion(style(none)) ; #delimit cr Note: In this graph, the commands must all be run as a block of commands, by highlighting them in the do-file editor and hitting the last icon on the right (the run button). Otherwise, the “local” values do not pass correctly into the other Stata commands in this block of Stata code. Chapter 2-17 (revision 9 Jan 2011) p. 8 100 80 Mean + 2SD 60 40 20 0 Mean -20 -40 -60 Mean - 2SD -80 -100 0 100 200 300 400 500 600 Average PEFR by two meters (l/min) Chapter 2-17 (revision 9 Jan 2011) 700 800 p. 9 Protocol Suggestion Continuing with Bland and Altman’s example, where the Wright and mini Wright meters are compared, you might say something like the following in your protocol to describe the BlandAltman analysis (For sake of illustration, I am assuming the mini Wright is more rapid and less expensive). Aim 1 We will compare two methods of measuring peak expiratory flow rate (PEFR), the Wright peak flow meter and the mini-Wright meter. If we can demonstrate that mini-Wright meter measurement is within clinically acceptable agreement to the Wright meter measurement, this would promote widely accepted use of the mini-Wright meter, thus providing a more rapid and less expensive clinical assessment of PEFR. Hypothesis 1 PEFR measured with mini-Wright meter will have clinically acceptable agreement with the Wright meter. Statistical Methods To test the Aim 1 hypothesis, a Bland-Altman analysis will be used. A Bland-Altman analysis has been accepted as the standard statistical approach to assess the agreement between two methods of clinical measurement (Altman and Bland, 1983; Bland and Altman, 1986; Mantha et al, 2000). In this approach, for each patient, the new method (mini-Wright) measurement is subtracted from the standard method (Wright) , representing the “measurement error” observed with that patient. The mean of these differences is computed, along with a standard deviation. A 95% tolerance bound, mean±2SD, is then computed, which is called the “limits of agreement.” This represents the limits in which we can be 95% confident that the measurement error will be within. If the limits of agreement are contained in, or more narrower than, what would be clinically acceptable measurement error, it can be concluded that the new measurement method can be used interchangeable with the standard measurement method. A 95% confidence interval will be computed for the lower limit of agreement and for the upper limit of agreement. Since the amount of error that represents “clinically accepted measurement error” has not be established, the limits of agreement will be reported descriptively, with 95% confidence intervals around the limits. This will permit the reader to assess the results FINISH Chapter 2-17 (revision 9 Jan 2011) p. 10 Clustered Data When clustered data are used, such as multiple observations taken on the same person, the SD needs to be corrected using the design effect (McCarthy and Thompson, 2007). We will practice with a dataset provided in the Bland and Altman (1986) paper. Bringing this dataset into Stata, File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on blandaltmanlancet1986.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ blandaltmanlancet1986", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use blandaltmanlancet1986, clear This is not a clustered dataset. To artificially make it a clusted dataset, for purposes of illustrating the clustered analysis approach, we will replace the subject ID with one that identifies five subjects, making the subjects systematically differ by sorting on the outcome variable before assigning the subject ID. sort miniwright1 replace subject=1 replace subject=2 replace subject=3 replace subject=4 replace subject=5 in in in in in Chapter 2-17 (revision 9 Jan 2011) 1/2 3/5 6/11 12/14 15/17 p. 11 Listing the data, with a line separator between subject ID, list , abbrev(15) sepby(subject) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. +---------------------------------------------------------+ | subject wright1 wright2 miniwright1 miniwright2 | |---------------------------------------------------------| | 1 178 165 259 268 | | 1 267 275 260 227 | |---------------------------------------------------------| | 2 423 372 350 370 | | 2 413 415 364 460 | | 2 442 431 380 390 | |---------------------------------------------------------| | 3 434 401 428 444 | | 3 395 397 430 415 | | 3 417 420 432 420 | | 3 433 429 445 432 | | 3 427 421 451 443 | | 3 478 492 477 467 | |---------------------------------------------------------| | 4 476 470 500 500 | | 4 494 490 512 525 | | 4 516 512 520 508 | |---------------------------------------------------------| | 5 557 611 600 625 | | 5 656 633 626 605 | | 5 650 638 658 642 | +---------------------------------------------------------+ We see that subject 1 has miniwright scores in the 200’s, subject 2 has scores in the 300’s, and so on. This is a common feature of a clustered dataset, in that scores within the same subject are more alike than the score are alike between subjects. Computing the difference between the two methods and requesting descriptive statistics, gen diff = wright1 - miniwright1 sum diff Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------diff | 17 -2.117647 38.76513 -81 73 << finish this section >> Chapter 2-17 (revision 9 Jan 2011) p. 12 References Altman DG, Bland JM. (1983). Measurement in medicine: the analysis of method comparison studies. Statistician 32(3):307-317. Bland JM, Altman DG. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet Feb 8:307-310. Hamilton C, Stamey J. (2007). Using Bland-Altman to assess agreement between two medical devices—don’t forget the confidence intervals! J Clin Monit Comput 21:331-33. Mantha S, Roizen MF, Fleischer LA, et al. (2000). Comparing methods of clinical measurement: reporting standards for Bland and Altman Analysis Anesth Analg 90:593-602. McCarthy WF, Thompson DR. (2007). The analysis of pixel intensity (myocardial signal density) data: the quantification of myocardial perfusion by imaging methods. (May 2007). COBRA Preprint Series. Article 23. http://biostats.bepress.com/cobra/ps/art23 Chapter 2-17 (revision 9 Jan 2011) p. 13