Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 5-15. Correlated Data: Response Feature (Summary Measure) Analysis In this chapter we will begin discussing how to model data that are correlated. This occurs when repeated measurements are used from the same research subject. It also occurs with hierarchical clusters, such with patients within physicians, and physicians within hospitals. The linear regression, logistic regression, and Cox regression models we have discussed thus far, as well as simple statistical test that compare groups, all assume that observations are independent. The statistical formulas and p value calculations are only correct, then, when this assumption is met. The following example illustrates this. Autocorrelation and Type I Error Example van Belle (2002, pp 7-11) provides an example of what happens to the Type I error rate when serially acquired observations, such as sequential laborary measurements, are correlated, which is referred to as autocorrelation. Measurements taken spatially are also known to exhibit autocorrelation, such as inflammation measurements taken distally from a wound site. In van Belle’s example, observations ordered by time are correlated as follows: adjacent observations have correlation ρ, where ρ is the population correlation coefficient, observations two steps apart have correlation ρ2, and so on. This is called a first order autoregressive process, AR(1), with correlation ρ. For reasonably large n, these data will have, true standard error of x 1 s 1 n rather than standard error = s / n , which applies to the independent observation case. Using the one-sample t test that assumes independence t x s n , H0: µ = 0 with AR(1) data, then, inflates the Type I error when ρ is positive and deflates it when ρ is negative. In either case, significance is not achieved the expected proportion of times. The effect is quite dramatic, as shown in the following table (van Belle, 2002, p.9): _________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 5-15 (revision 16 May 2010) p. 1 Effect of Autocorrelation on Type I Error, when Assuming Independent Observations ρ 0 0.1 0.2 0.3 0.4 0.5 Type I error 0.05 0.08 0.11 0.15 0.20 0.26 We see that with autocorrelation as low as 0.2, the Type I error doubles. As this example illustrates, lack of independence in the data makes a shambles of out of hypothesis testing when statistical methods assuming independence are used. Progesterone Dataset Dalton et al. (1987) report a study where they obtain absorption profiles from women following the administration of ointment containing 20, 30, and 40 mg of progesterone to the nasal mucosa. Their dataset is reproduced in Altman (1991, pp.427-428). The four treatment groups are: Group 1 (0.2ml of 100 mg/ml one nostril) Group 2 (0.3ml of 100 mg/ml one nostril) Group 3 (0.2ml of 200 mg/ml one nostril) Group 4 (0.2ml of 100 mg/ml each nostril) Opening the progesterone.dta dataset in Stata, File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on progesterone.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\progesterone.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use progesterone.dta, clear Chapter 5-15 (revision 16 May 2010) p. 2 Listing the variable names: Data Describe data Describe variables in memory Options: Display only variable names OK describe, simple * <or> ds id group progest0 progest1 progest3 progest5 progest10 progest15 progest30 progest45 progest60 progest120 we see that the data represent the serum level of progesterone (nmol/l) at baseline (time 0) and after nasal administration (3, 10, …, 120 minutes) . Listing the data, Data Describe data List data Main tab: Column widths: Compress width of columns in both tables and display formats Main tab: Do not list observation numbers Options tab: Separators: When these variables change: group Options tab: Display numeric codes rather than label values OK list, compress noobs sepby(group) nolabel Chapter 5-15 (revision 16 May 2010) p. 3 +--------------------------------------------------------------------------------------------+ | id group pr~t0 pro~1 pro~3 pr~t5 pr~10 pr~15 pr~30 pr~45 pr~60 p~120 | |--------------------------------------------------------------------------------------------| | 1 1 1 . 10 16 22 20 16 . 18 14 | | 2 1 6.5 5.7 9.5 11.6 17.5 27.3 28.5 22.4 19.3 10 | | 3 1 3 4 4 13 15.8 19.5 21.2 17.9 10.7 13.4 | | 4 1 1 2.1 9.7 . 21.8 . 27.5 . 15.5 6.2 | | 5 1 1 1 1 4.2 22.6 23.9 45.5 42.6 35 10.6 | | 6 1 1 1 1 1 3.9 14.7 17.6 16.1 8.8 10.8 | |--------------------------------------------------------------------------------------------| | 7 2 1 1.5 5 11 16 23 15 9 6 5 | | 8 2 1 1 6.5 20 22.5 27.8 19 9 8.2 8 | | 9 2 1 1 7.3 7.5 18 20 18.9 12.8 6.3 4.8 | | 10 2 3 2.5 2 2.7 3.4 3.6 14 7.3 7.7 4.7 | | 11 2 8.3 7.5 9.6 11 11.5 15.7 15.2 15.8 14 11.5 | | 12 2 6.2 5.9 6.8 7.7 9 9.3 12.1 12.2 11 9 | |--------------------------------------------------------------------------------------------| | 13 3 8.4 10.8 8.1 7.8 8.5 12 19.8 22.2 25.2 40.5 | | 14 3 3.5 3.2 3.4 3.3 8.5 9.4 14.5 12.7 11.5 10.2 | | 15 3 3.5 4 4.8 3.5 3.7 13 12.5 15 22 10.5 | | 16 3 3.7 3.2 4.3 4.5 5.5 8.5 10.3 11.1 8 6 | |--------------------------------------------------------------------------------------------| | 17 4 5 5.6 6.1 7.2 13.8 26 26.1 25.7 20.5 11 | | 18 4 4.5 5.1 13.2 21 26.8 28 22 17.8 15.7 14 | | 19 4 8.4 6.2 8 18.5 33.8 35 26.2 23 19 12.6 | | 20 4 4.2 3.2 4.2 4.8 10.3 13.7 17.1 18.3 17.4 15.8 | +--------------------------------------------------------------------------------------------+ These data represent a longitudinal dataset. Twisk (2003, p.1) explains, “Longitudinal studies are defined as studies in which the outcome variable is repeatedly measured; i.e., the outcome variable is measured in the same individual on several different occasions.” In this dataset, each of the several occasions represents a repeated measurement of serum progesterone across time. Within each research subject, we can expect autocorrelation, also called serial correlation, and so our analysis strategy must take this into account. Our first step, however, will be to just simply graph the data. Chapter 5-15 (revision 16 May 2010) p. 4 Parallel Coordinate Plots A popular approach to graph such data are with parallel coordinate plots (Cox, 2004). If you are interesting is why such plots have this name, or you want to see a more general presentation of this graphing strategy, refer to Wegman (1990). First, you might have to update your stata to get the parplot command. findit parplot parplot from http://fmwww.bc.edu/RePEc/bocode/p 'PARPLOT': module for parallel coordinates plots / parplot draws parallel coordinates plots. Stata 8 is required. d / KW: graphics / KW: multivariate / KW: parallel coordinates plot / Requires: Stata version 8.0 / Author: Nicholas J. Cox, Durham University / Support: email Click on the blue link, which gives you INSTALLATION FILES parplot.ado parplot.hlp (click here to install) Click on the blue link. To generate the graph, use #delimit ; parplot progest0-progest120 , transform(raw) xlabel(1 "0" 2 "1" 3 "3" 4 "5" 5 "10" 6 "15" 7 "30" 8 "45" 9 "60" 10 "120") ylabel(0(5)50, angle(horizontal)) xtitle("Minutes Post Dose Administration") yline(0) ytitle("Serum Progesterone(nmol/l)") by(group) ; #delimit cr Chapter 5-15 (revision 16 May 2010) p. 5 Grp 1 (0.2ml of 100 mg/ml one nostril) Grp 2 (0.3ml of 100 mg/ml one nostril) Grp 3 (0.2ml of 200 mg/ml one nostril) Grp 4 (0.2ml of 100 mg/ml each nostril) 0 0 50 45 40 35 30 25 20 15 10 5 0 50 45 40 35 30 25 20 15 10 5 0 1 3 5 10 15 30 45 60 120 1 3 5 10 15 30 45 60 120 Minutes Post Dose Administration Graphs by Dose Group Chapter 5-15 (revision 16 May 2010) p. 6 You can get a similar looking graph using using Stata’s “twoway line” command. To do that, however, you must first reshape the data into long format. Beginning with the data in wide format, +--------------------------------------------------------------------------------------------+ | id group pr~t0 pro~1 pro~3 pr~t5 pr~10 pr~15 pr~30 pr~45 pr~60 p~120 | |--------------------------------------------------------------------------------------------| | 1 1 1 . 10 16 22 20 16 . 18 14 | | 2 1 6.5 5.7 9.5 11.6 17.5 27.3 28.5 22.4 19.3 10 | … Reshaping it to long format, reshape long progest , i(id) j(time) In this command, “progest” is called the stub variable (prefix variable would be a more intuitive name). Stata used the j subscript variable, time, to store the suffix that followed “progest” in the variable names. Stata uses the i subscript variable to identify what variable identifies each subject. This variable has to contain a unique number across the rows of data, or it will get confused. Stata then duplicates the values of all the other variables in the file and places them on each newly created row, if any other variables are in the dataset. Listing the first two subjects to check if the reshape did what was expected. list if id<=2, noobs nolabel sepby(id) +-----------------------------+ | id time group progest | |-----------------------------| | 1 0 1 1 | | 1 1 1 . | | 1 3 1 10 | | 1 5 1 16 | | 1 10 1 22 | | 1 15 1 20 | | 1 30 1 16 | | 1 45 1 . | | 1 60 1 18 | | 1 120 1 14 | |-----------------------------| | 2 0 1 6.5 | | 2 1 1 5.7 | | 2 3 1 9.5 | | 2 5 1 11.6 | | 2 10 1 17.5 | | 2 15 1 27.3 | | 2 30 1 28.5 | | 2 45 1 22.4 | | 2 60 1 19.3 | | 2 120 1 10 | +-----------------------------+ Chapter 5-15 (revision 16 May 2010) p. 7 Graphing the data using twoway line with the connect(ascending) option, sort id time #delimit ; graph twoway (line progest time , connect(ascending)) , by(group) ylabel(0(5)50, angle(horizontal)) xtitle("Minutes Post Dose Administration") ytitle("Serum Progesterone(nmol/l)") xlabel(0 "0" 1 " " 3 " " 5 "5" 10 "10" 15 "15" 30 "30" 45 "45" 60 "60" 120 "120" ,labsize(small)) ; #delimit cr Grp 1 (0.2ml of 100 mg/ml one nostril) Grp 2 (0.3ml of 100 mg/ml one nostril) Grp 3 (0.2ml of 200 mg/ml one nostril) Grp 4 (0.2ml of 100 mg/ml each nostril) 0 5 1015 0 5 1015 50 45 40 35 30 25 20 15 10 5 0 50 45 40 35 30 25 20 15 10 5 0 30 45 60 120 30 45 60 120 Minutes Post Dose Administration Graphs by Dose Group It is the connect(ascending) option that instructs Stata to draw a separate line for each subject. This graph has an advantage over the parplot in that the time points are not evenly spaced. Grp 1 (0.2ml of 100 mg/ml one nostril) Grp 2 (0.3ml of 100 mg/ml one nostril) Grp 3 (0.2ml of 200 mg/ml one nostril) Grp 4 (0.2ml of 100 mg/ml each nostril) 0 0 50 45 40 35 30 25 20 15 10 5 0 50 45 40 35 30 25 20 15 10 5 0 1 3 5 10 15 30 45 60 120 1 3 5 10 15 30 45 60 120 Minutes Post Dose Administration Graphs by Dose Group However, notice for this example the parplot provides better resolution for the low values of time, which is a different advantage. Chapter 5-15 (revision 16 May 2010) p. 8 Summary Measure (Response Feature) Analysis A common approach to analyzing longitudinal data is to use a summary measure computed directly from the observed data. In this way, all of the repeated measurements are reduced to a single number per subject, which eliminates the need to account for the correlation structure of the data. Then, the analysis reduces to comparing the groups in a cross-sectional fashion, such as an independent groups t test (for two groups) or a oneway ANOVA for the four groups shown here. The summary measure approach is also called response feature analysis (Dupont, 2002, pp. 345356). Altman (1991, pp.430-431) lists some of the more frequently derived summary measures: “-- mean of all the measurements (i.e., ignore the time response) -- height of peak -- time to reach peak -- time to reach a given level -- time to change by a given amount -- time above a given level -- time to achieve maximum change from original level (baseline) -- time to return (near) to baseline level -- change from first to last measurement -- final level (perhaps the average of the last few measurements) -- area under the curve (AUC)” “Several of these suggestions incorporate some arbitrary definitions which should be chosen in advance of the analysis rather than after inspection of the data. Several are specifically aimed at data with peaks. Where initial values vary considerably the change from baseline may be used.” “…In general it is reasonable to consider two or three derived statistics, but as in any study it is highly desirable to identify a single measure of primary interest. The choice of appropriate measures should relate to the study objectives. For example, if the study is one of treatment efficacy we may reasonably be most interested in the values at the end of the study, perhaps in relation to starting values. If the study is to evaluate the effectiveness of analgesics, then we would probably be interested in the rapid effectiveness of the drug, perhaps by looking at the timing of the peak and the level achieved, and perhaps also the time above some critical level.” Chapter 5-15 (revision 16 May 2010) p. 9 Response Feature Analysis: Area Under the Curve (AUC) Dupont (2002, p.355-356), Altman (1991, pp.431-433), Twisk (2003, pp.184-185) describe how to apply the AUC approach. It is done by approximating the area under the curve with the sum of trapezoids, called the trapezoidal rule (the American term) or trapezium rule (the British term). Altman (1991, pp. 431-433) explains, “The area under the curve (AUC) is a useful way of summarizing the information from a series of measurements on one individual. It is frequently used in clinical pharmacology, where the AUC from serum levels can be interpreted as the total uptake or bioavailability of whatever had been administered. “The data are joined by straight lines to get a ‘curve’. The AUC is usually calculated by adding the areas under the curve between each pair of consecutive observations. If we have measurements y1 and y2 at times t1 and t2 , then the AUC between those two times is the product of the time difference and the average of the two measurements. Thus we get (t2 – t1) (y1 + y2)/2. This is known as the trapezium rule because of the shape of each segment of the area under the curve. If we have n + 1 measurements yi at times ti (t = 0, … , n) then the AUC is calculated as 1 n 1 (ti1 ti )( yi yi1 ). 2 i 0 The units of the AUC are the product of the units used for yi and ti , for example nmol.min/l, and are not easy to understand. It may be useful to divide the AUC by the total time to get a sort of weighted average level over the time period. …We can calculate the AUC even when there are missing data, except when the final observation is missing.” After computing the AUC, the groups are compared in a cross-sectional fashion, using t tests, ANOVA, etc., treating the AUC as we would any other continuous variable. Chapter 5-15 (revision 16 May 2010) p. 10 It is easier to perform the AUC calculation in long format, so we continue to use the reshaped data, which was reshaped on Page 7 using, reshape long progest , i(id) j(time) Calculating the AUC, capture drop area capture drop auc generate area=(time-time[_n-1])*(progest+progest[_n-1])/2 if id==id[_n-1] list id time progest area if id==1 replace area=(time-time[_n-2])*(progest+progest[_n-2])/2 /// if (id==id[_n-2] & area==.) // fill in if missing 1 consecutive time list id time progest area if id==1 bysort id: egen auc=sum(area) list if id<=2, noobs nolabel sepby(id) egen tag = tag(id) // indicator for first observation per subject drop area list if id<=2, noobs nolabel sepby(id) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. +----------------------------+ | id time progest area | |----------------------------| | 1 0 1 . | | 1 1 . . | | 1 3 10 . | | 1 5 16 26 | | 1 10 22 95 | | 1 15 20 105 | | 1 30 16 270 | | 1 45 . . | | 1 60 18 . | | 1 120 14 960 | +----------------------------+ <- missing one consecutive missing point <- missing one consecutive missing point +----------------------------+ | id time progest area | |----------------------------| | 1 0 1 . | | 1 1 . . | | 1 3 10 16.5 | <- 16.5 comes from replace command | 1 5 16 26 | | 1 10 22 95 | | 1 15 20 105 | | 1 30 16 270 | | 1 45 . . | | 1 60 18 510 | <- 510 comes from replace command | 1 120 14 960 | +----------------------------+ Chapter 5-15 (revision 16 May 2010) p. 11 +------------------------------------------------+ | id time group progest area auc | |------------------------------------------------| | 1 0 1 1 . 1982.5 | <- have AUC repeated on every line | 1 1 1 . . 1982.5 | | 1 3 1 10 16.5 1982.5 | | 1 5 1 16 26 1982.5 | | 1 10 1 22 95 1982.5 | | 1 15 1 20 105 1982.5 | | 1 30 1 16 270 1982.5 | | 1 45 1 . . 1982.5 | | 1 60 1 18 510 1982.5 | | 1 120 1 14 960 1982.5 | +------------------------------------------------+ | 2 0 1 6.5 . 2219.15 | | 2 1 1 5.7 6.1 2219.15 | | 2 3 1 9.5 15.2 2219.15 | | 2 5 1 11.6 21.1 2219.15 | | 2 10 1 17.5 72.75 2219.15 | | 2 15 1 27.3 112 2219.15 | | 2 30 1 28.5 418.5 2219.15 | | 2 45 1 22.4 381.75 2219.15 | | 2 60 1 19.3 312.75 2219.15 | | 2 120 1 10 879 2219.15 | +------------------------------------------------+ +---------------------------------------------+ | id time group progest auc tag | |---------------------------------------------| | 1 0 1 1 1982.5 1 | <- tag first observation per subject | 1 1 1 . 1982.5 0 | | 1 3 1 10 1982.5 0 | | 1 5 1 16 1982.5 0 | | 1 10 1 22 1982.5 0 | | 1 15 1 20 1982.5 0 | | 1 30 1 16 1982.5 0 | | 1 45 1 . 1982.5 0 | | 1 60 1 18 1982.5 0 | | 1 120 1 14 1982.5 0 | +---------------------------------------------+ | 2 0 1 6.5 2219.15 1 | <- tag first observation per subject | 2 1 1 5.7 2219.15 0 | | 2 3 1 9.5 2219.15 0 | | 2 5 1 11.6 2219.15 0 | | 2 10 1 17.5 2219.15 0 | | 2 15 1 27.3 2219.15 0 | | 2 30 1 28.5 2219.15 0 | | 2 45 1 22.4 2219.15 0 | | 2 60 1 19.3 2219.15 0 | | 2 120 1 10 2219.15 0 | +---------------------------------------------+ When we analyze the AUC variable, we need to only use one AUC value per subject. In subsequent commands that use the AUC, we will need to include an “if tag” to limit the analysis to one observation per subject, which maintains the correct sample size. Chapter 5-15 (revision 16 May 2010) p. 12 To make this work for any number of missing follow-up observations, we can use the following, capture drop area capture drop auc generate area=(time-time[_n-1])*(progest+progest[_n-1])/2 if id==id[_n-1] list id time progest area if id==1 *-- begin fill in for any number of missing values capture drop num_records bysort id: egen num_records=count(time) sum num_records scalar max_records=r(max) // maximum number of records per id capture drop num_records local i=2 while `i' < max_records { replace area=(time-time[_n-`i'])*(progest+progest[_n-`i'])/2 /// if (area==. & id==id[_n-`i']) // fill in if missing local i = `i'+1 } * -- end fill in for missing list id time progest area if id==1 bysort id: egen auc=sum(area) list if id<=2, noobs nolabel sepby(id) capture drop tag egen tag = tag(id) // indicator for first observation per subject drop area list if id<=2, noobs nolabel sepby(id) Chapter 5-15 (revision 16 May 2010) p. 13 Graphing the AUC data, 1,000 1,500 2,000 2,500 3,000 3,500 graph box auc if tag, over(group, /// relabel(1 "grp 1" 2 "grp 2" 3 "grp 3" 4 "grp 4")) grp 1 grp 2 grp 3 grp 4 We see that the one outlier in the lowest dose group, group 1, which was also a suspicious looking subject displayed in the parallel coordinate plot. Grp 1 (0.2ml of 100 mg/ml one nostril) Grp 2 (0.3ml of 100 mg/ml one nostril) Grp 3 (0.2ml of 200 mg/ml one nostril) Grp 4 (0.2ml of 100 mg/ml each nostril) 0 0 50 45 40 35 30 25 20 15 10 5 0 50 45 40 35 30 25 20 15 10 5 0 1 3 5 10 15 30 45 60 120 1 3 5 10 15 30 45 60 120 Minutes Post Dose Administration Graphs by Dose Group Since that subject was elevated at three adjacent time points, however, it is probably a true response to the drug, and so should not be treated as an outlier. Chapter 5-15 (revision 16 May 2010) p. 14 Now that we have eliminated the correlation structure in the data, by reducing the data into a single point per subject, we can analyze these data in the ordinary cross-sectional fashion, using a oneway ANOVA between independent groups. Statistics Linear models and related ANOVA One-way ANOVA Main tab: Response variable: auc Main tab: Factor variable: group Main tab: Output: produce summary table by/if/in tab: If: (expression): tag OK oneway auc group if tag , tabulate | Summary of auc Dose Group | Mean Std. Dev. Freq. ------------+-----------------------------------Grp 1 (0. | 2082.5333 675.96171 6 Grp 2 (0. | 1234.8167 277.97327 6 Grp 3 (0. | 1749.5375 903.61011 4 Grp 4 (0. | 2175.6125 236.26341 4 ------------+-----------------------------------Total | 1780.235 658.9552 20 Analysis of Variance Source SS df MS F Prob > F -----------------------------------------------------------------------Between groups 2962255.46 3 987418.487 2.99 0.0622 Within groups 5287961.76 16 330497.61 -----------------------------------------------------------------------Total 8250217.22 19 434221.959 Bartlett's test for equal variances: chi2(3) = 7.4390 Prob>chi2 = 0.059 We just missed significance (p = 0.062). Next, try the nonparametric ANOVA, which is the Kruskal-Wallis ANOVA, which basically compares the medians and uses rank scores so the outlier is no longer an influential point. Statistics Summaries, tables & tests Nonparametric tests of hypotheses Kruskal-Wallis rank test Main tab: Outcome variable: auc Main tab: Variable defining groups: group if/in tab: If: (expression): tag OK kwallis auc if tag , by(group) Chapter 5-15 (revision 16 May 2010) p. 15 Test: Equality of populations (Kruskal-Wallis test) +----------------------------------------------------------+ | group | Obs | Rank Sum | |-----------------------------------------+-----+----------| | Grp 1 (0.2ml of 100 mg/ml one nostril) | 6 | 80.00 | | Grp 2 (0.3ml of 100 mg/ml one nostril) | 6 | 30.00 | | Grp 3 (0.2ml of 200 mg/ml one nostril) | 4 | 38.00 | | Grp 4 (0.2ml of 100 mg/ml each nostril) | 4 | 62.00 | +----------------------------------------------------------+ chi-squared = probability = 9.533 with 3 d.f. 0.0230 chi-squared with ties = probability = 0.0230 9.533 with 3 d.f. This time we discovered a significant difference among the group (p = 0.023). Similarly, we could have computed all possible pairwise significant tests, such as t tests, and then adjusted the p values for multiple comparisons. (See Chapter 2-8 for multiple comparison procedures.) Shortcoming with this Analysis. Twisk (2002, p.185) points out that an AUC analysis like the one we performed here has a shortcoming. We did nothing to take into account any differences in baseline. Even though this was an experiment, where the groups were randomized, baseline differences could be large enough to affect the result and lead to this lack of detected effect. One approach to adjust for baseline differences would be to substract the baseline value from each of the posttreatment times before calculating the AUC, the change score approach. With such an approach, one must decide what to do with any negative changes. That is, AUC segments above the baseline reference line have positive values and AUC segments under the baseline reference line have negative values. One popular approach, which is used in blood glucose measurements following food intake for example, is to use only the positive AUC segments. This is called the IAUC (incremental AUC). Using the approach of substracting the baseline measurement, however, is still subject to regression towards the mean bias. That is, subjects with high baseline measurements are more likely to have lower subsequent measurements and subjects with low baseline measurements are more likely to have higher subsequent measurements, which will occur independently from the treatment effect. A better approach, then, is to include the baseline measurement into the model as a predictor variable. Controlling the baseline measurement in this way is called the analysis of covariance (ANCOVA) approach. The ANCOVA approach, unlike the change approach, basically corrects for regression to the mean. (Twisk, 2002, p.169). We cannot be certain that ANCOVA will entirely adjust for regression to the mean, because measurement error in the baseline value will lead to some amount of under- or over-adjustment (Cook and Campbell, 1979, p.164). The ANCOVA approach will be explained in more detail in the next chapter. Chapter 5-15 (revision 16 May 2010) p. 16 To use the ANCOVA approach, we need a variable with the baseline score. Currently our baseline value is contained in the first occurrence, or time 0, of the progest variable. +---------------------------------------------+ | id time group progest auc tag | |---------------------------------------------| | 1 0 1 1 1982.5 1 | | 1 1 1 . 1982.5 0 | | 1 3 1 10 1982.5 0 | | 1 5 1 16 1982.5 0 | | 1 10 1 22 1982.5 0 | | 1 15 1 20 1982.5 0 | | 1 30 1 16 1982.5 0 | | 1 45 1 . 1982.5 0 | | 1 60 1 18 1982.5 0 | | 1 120 1 14 1982.5 0 | +---------------------------------------------------+ | 2 0 1 6.5 2219.15 1 | | 2 1 1 5.7 2219.15 0 | | 2 3 1 9.5 2219.15 0 | | 2 5 1 11.6 2219.15 0 | | 2 10 1 17.5 2219.15 0 | | 2 15 1 27.3 2219.15 0 | | 2 30 1 28.5 2219.15 0 | | 2 45 1 22.4 2219.15 0 | | 2 60 1 19.3 2219.15 0 | | 2 120 1 10 2219.15 0 | +---------------------------------------------+ To create a separate variable containing the baseline value of progest, capture drop progbase // progesterone baseline bysort id: gen progbase=progest[1] list if id<=2, noobs nolabel sepby(id) +--------------------------------------------------------+ | id time group progest auc tag progbase | |--------------------------------------------------------| | 1 0 1 1 1982.5 1 1 | | 1 1 1 . 1982.5 0 1 | | 1 3 1 10 1982.5 0 1 | | 1 5 1 16 1982.5 0 1 | | 1 10 1 22 1982.5 0 1 | | 1 15 1 20 1982.5 0 1 | | 1 30 1 16 1982.5 0 1 | | 1 45 1 . 1982.5 0 1 | | 1 60 1 18 1982.5 0 1 | | 1 120 1 14 1982.5 0 1 | +--------------------------------------------------------+ | 2 0 1 6.5 2219.15 1 6.5 | | 2 1 1 5.7 2219.15 0 6.5 | | 2 3 1 9.5 2219.15 0 6.5 | | 2 5 1 11.6 2219.15 0 6.5 | | 2 10 1 17.5 2219.15 0 6.5 | | 2 15 1 27.3 2219.15 0 6.5 | | 2 30 1 28.5 2219.15 0 6.5 | | 2 45 1 22.4 2219.15 0 6.5 | | 2 60 1 19.3 2219.15 0 6.5 | | 2 120 1 10 2219.15 0 6.5 | +--------------------------------------------------------+ Note: the square backets in “progest[1]” represent a subscript, being the first observation for each ID number. Chapter 5-15 (revision 16 May 2010) p. 17 Now, using the ANCOVA approach, where we control for baseline, Stata Version 10 (specify continuous variable with “continuous” option) Statistics Linear models and related ANOVA Analysis of variance and covariance Model tab: Dependent variable: auc Model tab: Model: group progbase Model tab: Model variables: Categorical except the following continuous variables: progbase by/if/in tab: If: (expression): tag OK anova auc group progbase if tag, continuous(progbase) Stata Version 11 (specify continuous variable with “c.” prefix) Statistics Linear models and related ANOVA/MANOVA Analysis of variance and covariance Model tab: Dependent variable: auc Model tab: Model: group c.progbase by/if/in tab: If: (expression): tag OK anova auc group c.progbase if tag Number of obs = 20 Root MSE = 544.975 R-squared = Adj R-squared = 0.4600 0.3160 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------Model | 3795254.05 4 948813.514 3.19 0.0438 | group | 3001109.23 3 1000369.74 3.37 0.0467 progbase | 832998.594 1 832998.594 2.80 0.1147 | Residual | 4454963.17 15 296997.544 -----------+---------------------------------------------------Total | 8250217.22 19 434221.959 We see that the baseline progesterone was not significantly different between the groups, as expected, since randomization was use. However, it was different enough (p = 0.115) to possibly influence the result. After controlling for baseline, there is a significant difference among the groups (p = 0.047). Chapter 5-15 (revision 16 May 2010) p. 18 We could accomplish the same ANCOVA analysis using linear regression, followed by a posttest simultaneously comparing the group indicators to the referent group. Stata Version 10: * Stata Version 10 anova auc group progbase if tag, continuous(progbase) anova, regress Source | SS df MS -------------+-----------------------------Model | 3795254.05 4 948813.514 Residual | 4454963.17 15 296997.544 -------------+-----------------------------Total | 8250217.22 19 434221.959 Number of obs F( 4, 15) Prob > F R-squared Adj R-squared Root MSE = = = = = = 20 3.19 0.0438 0.4600 0.3160 544.97 -----------------------------------------------------------------------------auc Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------------------------------------------------------------------------_cons 1678.891 402.7646 4.17 0.001 820.4187 2537.364 group 1 201.3574 393.2665 0.51 0.616 -636.8702 1039.585 2 -751.2476 369.5388 -2.03 0.060 -1538.901 36.4058 3 -358.6468 387.453 -0.93 0.369 -1184.483 467.1897 4 (dropped) progbase 89.90432 53.68277 1.67 0.115 -24.51778 204.3264 ------------------------------------------------------------------------------ First creating the indicator variables, so we can drop group 4 to match the anova command, capture drop grp* tab group, gen(grp) regress auc grp1 grp2 grp3 progbase if tag test grp1=grp2=grp3=0 Source | SS df MS -------------+-----------------------------Model | 3795254.05 4 948813.514 Residual | 4454963.17 15 296997.544 -------------+-----------------------------Total | 8250217.22 19 434221.959 Number of obs F( 4, 15) Prob > F R-squared Adj R-squared Root MSE = = = = = = 20 3.19 0.0438 0.4600 0.3160 544.97 -----------------------------------------------------------------------------auc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------grp1 | 201.3574 393.2665 0.51 0.616 -636.8702 1039.585 grp2 | -751.2476 369.5388 -2.03 0.060 -1538.901 36.4058 grp3 | -358.6468 387.453 -0.93 0.369 -1184.483 467.1897 progbase | 89.90432 53.68277 1.67 0.115 -24.51778 204.3264 _cons | 1678.891 402.7646 4.17 0.001 820.4187 2537.364 -----------------------------------------------------------------------------. test grp1=grp2=grp3=0 ( 1) ( 2) ( 3) grp1 - grp2 = 0 grp1 - grp3 = 0 grp1 = 0 F( 3, 15) = Prob > F = 3.37 0.0467 As expected, we get the same result as the ANCOVA using the anova command (p=0.0467). Chapter 5-15 (revision 16 May 2010) p. 19 Stata Version 11: * Stata Version 10 anova auc group c.progbase if tag regress Source | SS df MS -------------+-----------------------------Model | 3795254.05 4 948813.514 Residual | 4454963.17 15 296997.544 -------------+-----------------------------Total | 8250217.22 19 434221.959 Number of obs F( 4, 15) Prob > F R-squared Adj R-squared Root MSE = = = = = = 20 3.19 0.0438 0.4600 0.3160 544.97 -----------------------------------------------------------------------------auc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------group | 2 | -952.605 320.8141 -2.97 0.010 -1636.404 -268.806 3 | -560.0042 376.9914 -1.49 0.158 -1363.542 243.5339 4 | -201.3574 393.2665 -0.51 0.616 -1039.585 636.8702 | progbase | 89.90432 53.68277 1.67 0.115 -24.51778 204.3264 _cons | 1880.249 253.1579 7.43 0.000 1340.655 2419.842 ------------------------------------------------------------------------------ First creating the indicator variables, so we can drop group 1 to match the anova command, capture drop grp* tab group, gen(grp) regress auc grp2 grp3 grp4 progbase if tag test grp2=grp3=grp4=0 Source | SS df MS -------------+-----------------------------Model | 3795254.05 4 948813.514 Residual | 4454963.17 15 296997.544 -------------+-----------------------------Total | 8250217.22 19 434221.959 Number of obs F( 4, 15) Prob > F R-squared Adj R-squared Root MSE = = = = = = 20 3.19 0.0438 0.4600 0.3160 544.97 -----------------------------------------------------------------------------auc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------grp2 | -952.605 320.8141 -2.97 0.010 -1636.404 -268.806 grp3 | -560.0042 376.9914 -1.49 0.158 -1363.542 243.5339 grp4 | -201.3574 393.2665 -0.51 0.616 -1039.585 636.8702 progbase | 89.90432 53.68277 1.67 0.115 -24.51778 204.3264 _cons | 1880.249 253.1579 7.43 0.000 1340.655 2419.842 -----------------------------------------------------------------------------. test grp2=grp3=grp4=0 ( 1) ( 2) ( 3) grp2 - grp3 = 0 grp2 - grp4 = 0 grp2 = 0 F( 3, 15) = Prob > F = 3.37 0.0467 As expected, we get the same result as the ANCOVA using the anova command (p=0.0467). Chapter 5-15 (revision 16 May 2010) p. 20 Response Feature Analysis: Linear Slope of Repeated Measures Next, we will use the 11.2.Isoproterenol.dta dataset provided with the Dupont (2002, p.338) textbook, described as, “Lang et al. (1995) studied the effect of isoproterenol, a β-adrenergic agonist, on forearm blood flow in a group of 22 normotensive men. Nine of the study subjects were black and 13 were white. Each subject’s blood flow was measured at baseline and then at escalating doses of isoproterenol.” Reading the data in File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on 11.2.Isoproterenol.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ 11.2.Isoproterenol.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use 11.2.Isoproterenol.dta, clear Chapter 5-15 (revision 16 May 2010) p. 21 Listing the data list , nolabel 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. +---------------------------------------------------------------------+ | id race fbf0 fbf10 fbf20 fbf60 fbf150 fbf300 fbf400 | |---------------------------------------------------------------------| | 1 1 1 1.4 6.4 19.1 25 24.6 28 | | 2 1 2.1 2.8 8.3 15.7 21.9 21.7 30.1 | | 3 1 1.1 2.2 5.7 8.2 9.3 12.5 21.6 | | 4 1 2.44 2.9 4.6 13.2 17.3 17.6 19.4 | | 5 1 2.9 3.5 5.7 11.5 14.9 19.7 19.3 | |---------------------------------------------------------------------| | 6 1 4.1 3.7 5.8 19.8 17.7 20.8 30.3 | | 7 1 1.24 1.2 3.3 5.3 5.4 10.1 10.6 | | 8 1 3.1 . . 15.45 . . 31.3 | | 9 1 5.8 8.8 13.2 33.3 38.5 39.8 43.3 | | 10 1 3.9 6.6 9.5 20.2 21.5 30.1 29.6 | |---------------------------------------------------------------------| | 11 1 1.91 1.7 6.3 9.9 12.6 12.7 15.4 | | 12 1 2 2.3 4 8.4 8.3 12.8 16.7 | | 13 1 3.7 3.9 4.7 10.5 14.6 20 21.7 | | 14 2 2.46 2.7 2.54 3.95 4.16 5.1 4.16 | | 15 2 2 1.8 4.22 5.76 7.08 10.92 7.08 | |---------------------------------------------------------------------| | 16 2 2.26 3 2.99 4.07 3.74 4.58 3.74 | | 17 2 1.8 2.9 3.41 4.84 7.05 7.48 7.05 | | 18 2 3.13 4 5.33 7.31 8.81 11.09 8.81 | | 19 2 1.36 2.7 3.05 4 4.1 6.95 4.1 | | 20 2 2.82 2.6 2.63 10.03 9.6 12.65 9.6 | |---------------------------------------------------------------------| | 21 2 1.7 1.6 1.73 2.96 4.17 6.04 4.17 | | 22 2 2.1 1.9 3 4.8 7.4 16.7 21.2 | +---------------------------------------------------------------------+ We see that the data are in wide format, with variables id patient ID (1 to 22) race race (1=white, 2=black) fbf0 forearm blood flow (ml/min/dl) at ioproterenol dose 0 ng/min fbf10 forearm blood flow (ml/min/dl) at ioproterenol dose 10 ng/min … fbf400 forearm blood flow (ml/min/dl) at ioproterenol dose 400 ng/min In this dataset, each of the several occasions represents an increasing dose, so can be thought of as an effect across dose, rather than as an effect across time. Chapter 5-15 (revision 16 May 2010) p. 22 Graphing the data with a parallel coordinate plot #delimit ; parplot fbf0-fbf400 , transform(raw) xlabel(1 "0" 2 "10" 3 "20" 4 "60" 5 "150" 6 "300" 7 "400") ylabel(0(5)45, angle(horizontal)) xtitle("ioproterenol dose (ng/min)") yline(0) ytitle("forearm blood flow (ml/min/dl)") by(race) ; #delimit cr White Black 45 40 35 30 25 20 15 10 5 0 0 10 20 60 150 300 400 0 10 20 60 150 300 400 ioproterenol dose (ng/min) Graphs by Race Given that the dose range is so wide, to see if the increase is linear, a scatterplot with the correct spacing between doses might be worthwhile to examine. We won’t bother, however, because there is not a good way to come up with a single slope value from the linear and quadratic terms of the regression model, Ŷ a b1 X b2 X 2 We are stuck with just using a linear model. Chapter 5-15 (revision 16 May 2010) p. 23 First we convert the data into long format, which will be easier to work with. reshape long fbf , i(id) j(dose) list if id<=2, nolabel sepby(id) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. +-------------------------+ | id dose race fbf | |-------------------------| | 1 0 1 1 | | 1 10 1 1.4 | | 1 20 1 6.4 | | 1 60 1 19.1 | | 1 150 1 25 | | 1 300 1 24.6 | | 1 400 1 28 | |-------------------------| | 2 0 1 2.1 | | 2 10 1 2.8 | | 2 20 1 8.3 | | 2 60 1 15.7 | | 2 150 1 21.9 | | 2 300 1 21.7 | | 2 400 1 30.1 | +-------------------------+ Next, convert the 1-2 race variable into a 0-1 black indicator. capture drop black recode race 1=0 2=1 , gen(black) tab black race, nolabel RECODE of | race | Race (Race) | 1 2 | Total -----------+----------------------+---------0 | 91 0 | 91 1 | 0 63 | 63 -----------+----------------------+---------Total | 91 63 | 154 Unlike the progesterone absorption profiles, which increased and then decreased, these blood flow graphs appear to monotonically increase, more or less, across the dose range. This suggests that a linear slope would provide an adequate summary measure for comparison of whites with blacks. For completeness, in his textbook, Dupont (2002, p.346), uses log dose to derive the slope summary measure. We will skip that, since the small improvement in linear fit (R2 = 0.55 vs R2 = 0.52) does not seem to justify the added complexity of the presentation. To derive the summary measure, the slope, we fit a linear regression line to each subject’s data, the 7 dose-fbf pairs, and retrieve the slope using the _b[ ] Stata variable (see box). Chapter 5-15 (revision 16 May 2010) p. 24 Capturing Results from Regression Models If all you need are the regression coefficients and standard errors, the easy way to retrieve them is to use the Stata system variables _b[ ], or synomonously _coef[ ], and se[ ] (Stata User’s Guide, version 11, p. 149). Using the following data for demonstration, 1. 2. 3. 4. 5. 6. 7. 8. 9. +------------------+ | id y x1 x2 | |------------------| | 1 5 1 33 | | 2 4 1 14 | | 3 5 1 10 | | 4 3 1 5 | | 5 6 0 17 | |------------------| | 6 7 0 18 | | 7 3 0 4 | | 8 5 0 10 | | 9 4 0 8 | +------------------+ fitting a linear regression regress y x1 x2 Source | SS df MS -------------+-----------------------------Model | 7.22813239 2 3.61406619 Residual | 6.77186761 6 1.1286446 -------------+-----------------------------Total | 14 8 1.75 Number of obs = F( 2, 6) Prob > F R-squared Adj R-squared Root MSE 9 = = = = = 3.20 0.1132 0.5163 0.3551 1.0624 -----------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------x1 | -1.161939 .7347975 -1.58 0.165 -2.959923 .6360463 x2 | .1004728 .043656 2.30 0.061 -.0063497 .2072953 _cons | 3.85461 .6880503 5.60 0.001 2.171011 5.538208 ------------------------------------------------------------------------------ The three regression coefficients are stored in system variables of the form _b[variable name] or synomonously _coef[variable name], which in this example are _b[x1] _b[x2] _b[_cons] as well as in _coef[x1] _coef[x2] _coef[_cons] To find this in Stata’s help, use help _b Chapter 5-15 (revision 16 May 2010) p. 25 We can verify this with the display command display _b[x1] display _b[x2] display _b[_cons] display _coef[x1] display _coef[x2] display _coef[_cons] . display _b[x1] -1.1619385 . display _b[x2] .10047281 . display _b[_cons] 3.8546099 . display _coef[x1] -1.1619385 . display _coef[x2] .10047281 . display _coef[_cons] 3.8546099 The three standard errors are stored in system variables of the form _se[variable name], which in this example are _coef[x1] _coef[x2] _coef[_cons] Verifying this display _se[x1] display _se[x2] display _se[_cons] . display _se[x1] .73479753 . display _se[x2] .04365605 . display _se[_cons] .68805032 Chapter 5-15 (revision 16 May 2010) p. 26 As a test of our Stata code, we check what the slope should be for subject 1 regress fbf dose if id==1 // see what slope should be Source | SS df MS -------------+-----------------------------Model | 593.987183 1 593.987183 Residual | 238.867112 5 47.7734224 -------------+-----------------------------Total | 832.854295 6 138.809049 Number of obs F( 1, 5) Prob > F R-squared Adj R-squared Root MSE = = = = = = 7 12.43 0.0168 0.7132 0.6558 6.9118 -----------------------------------------------------------------------------fbf | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dose | .0628501 .0178242 3.53 0.017 .0170315 .1086687 _cons | 6.63156 3.543134 1.87 0.120 -2.476356 15.73948 ------------------------------------------------------------------------------ For subject 1, the slope summary measure is 0.0628501. Now, doing this for all subjects (see box for a more complicated version), *-- program to compute slope for each subject capture program drop calcslope program define calcslope , byable(recall) marksample touse quietly regress fbf dose if `touse' quietly replace doseslope=_b[dose] if `touse' end * -- call program to compute slope capture drop doseslope gen doseslope=. // variable to hold slope quietly bysort id: calcslope // call program for each subject Checking how it worked, list if id<=2, nolabel sepby(id) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. +--------------------------------------------+ | id dose race fbf black dosesl~e | |--------------------------------------------| | 1 0 1 1 0 .0628501 | | 1 10 1 1.4 0 .0628501 | | 1 20 1 6.4 0 .0628501 | | 1 60 1 19.1 0 .0628501 | | 1 150 1 25 0 .0628501 | | 1 300 1 24.6 0 .0628501 | | 1 400 1 28 0 .0628501 | |--------------------------------------------| | 2 0 1 2.1 0 .0611372 | | 2 10 1 2.8 0 .0611372 | | 2 20 1 8.3 0 .0611372 | | 2 60 1 15.7 0 .0611372 | | 2 150 1 21.9 0 .0611372 | | 2 300 1 21.7 0 .0611372 | | 2 400 1 30.1 0 .0611372 | +--------------------------------------------+ Chapter 5-15 (revision 16 May 2010) p. 27 This time, just to see if we like it better, we will save only one copy of the slope value per subject. That way, we will not need to tag the first observation and then bother with “if tag” in subsequent analysis commands. bysort id: replace doseslope=. if _n~=1 list if id<=2, nolabel sepby(id) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. // keep only one value +--------------------------------------------+ | id dose race fbf black dosesl~e | |--------------------------------------------| | 1 0 1 1 0 .0628501 | | 1 10 1 1.4 0 . | | 1 20 1 6.4 0 . | | 1 60 1 19.1 0 . | | 1 150 1 25 0 . | | 1 300 1 24.6 0 . | | 1 400 1 28 0 . | |--------------------------------------------| | 2 0 1 2.1 0 .0611372 | | 2 10 1 2.8 0 . | | 2 20 1 8.3 0 . | | 2 60 1 15.7 0 . | | 2 150 1 21.9 0 . | | 2 300 1 21.7 0 . | | 2 400 1 30.1 0 . | +--------------------------------------------+ We can now compare blacks with whites using a simple independent samples t test. Statistics Summaries, tables & tests Classical tests of hypotheses Group mean comparison test Variable name: doseslope Group variable name: race OK ttest doseslope , by(race) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------White | 13 .0498816 .0047465 .0171139 .0395398 .0602234 Black | 9 .0152201 .0045469 .0136407 .0047349 .0257052 ---------+-------------------------------------------------------------------combined | 22 .0357019 .0049658 .0232917 .0253749 .0460288 ---------+-------------------------------------------------------------------diff | .0346616 .0068584 .0203551 .048968 -----------------------------------------------------------------------------Degrees of freedom: 20 Ho: mean(White) - mean(Black) = diff = 0 Ha: diff < 0 t = 5.0538 P < t = 1.0000 Ha: diff != 0 t = 5.0538 P > |t| = 0.0001 Ha: diff > 0 t = 5.0538 P > t = 0.0000 From this, we would conclude that the forearm blood flow increases more rapidly in whites than blacks when the isoproterenol dosage is increased (p < 0.001). Chapter 5-15 (revision 16 May 2010) p. 28 Shortcoming with this Analysis. We made no adjustment for differences in blood flow that might exist between blacks and whites in the absent of the drug. That is, we made no adjustment for the baseline value. We could use the change approach, subtracting baseline flow from each dose flow, which would adjust for differences in baseline, and then repeating the slope analysis on the change scores. However, it would not adjust for regression towards the mean bias. To do both, we can use an ANCOVA approach, once again. Adding a baseline variable capture drop fbfbase // forerarm blood flow baseline bysort id: gen fbfbase=fbf if _n==1 list if id<=2, noobs nolabel sepby(id) abbrev(15) +-------------------------------------------------------+ | id dose race fbf black doseslope fbfbase | |-------------------------------------------------------| | 1 0 1 1 0 .0628501 1 | | 1 10 1 1.4 0 . . | | 1 20 1 6.4 0 . . | | 1 60 1 19.1 0 . . | | 1 150 1 25 0 . . | | 1 300 1 24.6 0 . . | | 1 400 1 28 0 . . | |-------------------------------------------------------| | 2 0 1 2.1 0 .0611372 2.1 | | 2 10 1 2.8 0 . . | | 2 20 1 8.3 0 . . | | 2 60 1 15.7 0 . . | | 2 150 1 21.9 0 . . | | 2 300 1 21.7 0 . . | | 2 400 1 30.1 0 . . | +-------------------------------------------------------+ Running the ANCOVA using linear regression, regress doseslope black fbfbase Source | SS df MS -------------+-----------------------------Model | .007856551 2 .003928276 Residual | .003536004 19 .000186105 -------------+-----------------------------Total | .011392555 21 .000542503 Number of obs F( 2, 19) Prob > F R-squared Adj R-squared Root MSE = = = = = = 22 21.11 0.0000 0.6896 0.6570 .01364 -----------------------------------------------------------------------------doseslope | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------black | -.0306394 .0060866 -5.03 0.000 -.0433788 -.0179001 fbfbase | .007539 .0026851 2.81 0.011 .0019191 .013159 _cons | .029416 .0082125 3.58 0.002 .0122271 .0466049 ------------------------------------------------------------------------------ We arrive at the same conclusion. Chapter 5-15 (revision 16 May 2010) p. 29 If we wanted to do this type of analysis for several variables, we could modify the calcslope program to allow us to pass a variable name as an argument (see box). Passing a variable name into a Stata program The program we used above, replicated here, *-- program to compute slope for each subject capture program drop calcslope program define calcslope , byable(recall) marksample touse quietly regress fbf dose if `touse' quietly replace doseslope=_b[dose] if `touse' end * -- call program to compute slope capture drop doseslope gen doseslope=. // variable to hold slope quietly bysort id: calcslope // call program for each subject would have to be modified for each outcome variable we wished to create a slope variable for. A simple modification is to pass a variable name when the program is called. Here is what it would look like: *-- program to compute slope for each subject capture program drop calcslope program define calcslope , byable(recall) marksample touse args v1 quietly regress `v1' dose if `touse' quietly replace doseslope_`v1'=_b[dose] if `touse' end * -- call program to compute slope capture drop doseslope_fbf gen doseslope_fbf=. // variable to hold slope quietly bysort id: calcslope fbf // call program for each subject This time, the slopes are stored in doseslope_fbf, rather than doseslope. If called using capture drop doseslope_heartrate gen doseslope_heartrate=. // variable to hold slope quietly bysort id: calcslope heartrate the slopes would be stored in doseslope_heartrate. Chapter 5-15 (revision 16 May 2010) p. 30 Response Feature Analysis: Mean of Repeated Measurements When analyzing longitudinal data using a sumary measure, Rabe-Kesketh and Everitt (2004, p.145) state, “The most commonly used measure is the mean of the responses over time because many investigations, eg., clinical trials, are most concerned with differences in overall levels rather than more subtle effects.” Bringing the wide format data back in, File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on 11.2.Isoproterenol.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ 11.2.Isoproterenol.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use 11.2.Isoproterenol.dta, clear Listing it, list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. +----------------------------------------------------------------------+ | id race fbf0 fbf10 fbf20 fbf60 fbf150 fbf300 fbf400 | |----------------------------------------------------------------------| | 1 White 1 1.4 6.4 19.1 25 24.6 28 | | 2 White 2.1 2.8 8.3 15.7 21.9 21.7 30.1 | | 3 White 1.1 2.2 5.7 8.2 9.3 12.5 21.6 | | 4 White 2.44 2.9 4.6 13.2 17.3 17.6 19.4 | | 5 White 2.9 3.5 5.7 11.5 14.9 19.7 19.3 | |----------------------------------------------------------------------| | 6 White 4.1 3.7 5.8 19.8 17.7 20.8 30.3 | | 7 White 1.24 1.2 3.3 5.3 5.4 10.1 10.6 | | 8 White 3.1 . . 15.45 . . 31.3 | | 9 White 5.8 8.8 13.2 33.3 38.5 39.8 43.3 | | 10 White 3.9 6.6 9.5 20.2 21.5 30.1 29.6 | |----------------------------------------------------------------------| | 11 White 1.91 1.7 6.3 9.9 12.6 12.7 15.4 | | 12 White 2 2.3 4 8.4 8.3 12.8 16.7 | | 13 White 3.7 3.9 4.7 10.5 14.6 20 21.7 | | 14 Black 2.46 2.7 2.54 3.95 4.16 5.1 4.16 | | 15 Black 2 1.8 4.22 5.76 7.08 10.92 7.08 | |----------------------------------------------------------------------| | 16 Black 2.26 3 2.99 4.07 3.74 4.58 3.74 | | 17 Black 1.8 2.9 3.41 4.84 7.05 7.48 7.05 | | 18 Black 3.13 4 5.33 7.31 8.81 11.09 8.81 | | 19 Black 1.36 2.7 3.05 4 4.1 6.95 4.1 | | 20 Black 2.82 2.6 2.63 10.03 9.6 12.65 9.6 | |----------------------------------------------------------------------| | 21 Black 1.7 1.6 1.73 2.96 4.17 6.04 4.17 | | 22 Black 2.1 1.9 3 4.8 7.4 16.7 21.2 | Chapter 5-15 (revision 16 May 2010) p. 31 +----------------------------------------------------------------------+ Missing values are permissible, since the mean is computed on the non-missing repeated measurements. Chapter 5-15 (revision 16 May 2010) p. 32 Computing the mean of the nonmissing post isoproterenol administration measurements Data Create or change variables Create new variable (extended) Generate variable: meanfbf Egen function: row mean Egen fuction argument: Variables: fbf10-fbf400 OK capture drop meanfbf egen meanfbf=rmean(fbf10-fbf400) Listing the data to check the calculation, list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. +---------------------------------------------------------------------------------+ | id race fbf0 fbf10 fbf20 fbf60 fbf150 fbf300 fbf400 meanfbf | |---------------------------------------------------------------------------------| | 1 White 1 1.4 6.4 19.1 25 24.6 28 17.41667 | | 2 White 2.1 2.8 8.3 15.7 21.9 21.7 30.1 16.75 | | 3 White 1.1 2.2 5.7 8.2 9.3 12.5 21.6 9.916667 | | 4 White 2.44 2.9 4.6 13.2 17.3 17.6 19.4 12.5 | | 5 White 2.9 3.5 5.7 11.5 14.9 19.7 19.3 12.43333 | |---------------------------------------------------------------------------------| | 6 White 4.1 3.7 5.8 19.8 17.7 20.8 30.3 16.35 | | 7 White 1.24 1.2 3.3 5.3 5.4 10.1 10.6 5.983334 | | 8 White 3.1 . . 15.45 . . 31.3 23.375 | | 9 White 5.8 8.8 13.2 33.3 38.5 39.8 43.3 29.48333 | | 10 White 3.9 6.6 9.5 20.2 21.5 30.1 29.6 19.58333 | |---------------------------------------------------------------------------------| | 11 White 1.91 1.7 6.3 9.9 12.6 12.7 15.4 9.766666 | | 12 White 2 2.3 4 8.4 8.3 12.8 16.7 8.75 | | 13 White 3.7 3.9 4.7 10.5 14.6 20 21.7 12.56667 | | 14 Black 2.46 2.7 2.54 3.95 4.16 5.1 4.16 3.768333 | | 15 Black 2 1.8 4.22 5.76 7.08 10.92 7.08 6.143333 | |---------------------------------------------------------------------------------| | 16 Black 2.26 3 2.99 4.07 3.74 4.58 3.74 3.686667 | | 17 Black 1.8 2.9 3.41 4.84 7.05 7.48 7.05 5.455 | | 18 Black 3.13 4 5.33 7.31 8.81 11.09 8.81 7.558333 | | 19 Black 1.36 2.7 3.05 4 4.1 6.95 4.1 4.15 | | 20 Black 2.82 2.6 2.63 10.03 9.6 12.65 9.6 7.851666 | |---------------------------------------------------------------------------------| | 21 Black 1.7 1.6 1.73 2.96 4.17 6.04 4.17 3.445 | | 22 Black 2.1 1.9 3 4.8 7.4 16.7 21.2 9.166667 | +---------------------------------------------------------------------------------+ Checking the calculation in observation 8, display (15.45+31.3)/2 // mean omitting baseline 23.375 We leave out the baseline fbf because it is not part of the post drug administration outcome. Chapter 5-15 (revision 16 May 2010) p. 33 We can now analyze these data using an ANCOVA approach, controlling for baseline. First recoding the race variable, again, and then using a linear regression to obtain the ANCOVA, capture drop black recode race 1=0 2=1 , gen(black) tab black race regress meanfbf black fbf0 Source | SS df MS -------------+-----------------------------Model | 723.655489 2 361.827745 Residual | 275.804739 19 14.5160389 -------------+-----------------------------Total | 999.460228 21 47.5933442 Number of obs F( 2, 19) Prob > F R-squared Adj R-squared Root MSE = = = = = = 22 24.93 0.0000 0.7240 0.6950 3.81 -----------------------------------------------------------------------------meanfbf | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------black | -7.593173 1.699874 -4.47 0.000 -11.15105 -4.035297 fbf0 | 3.196871 .7498965 4.26 0.000 1.62732 4.766423 _cons | 6.312108 2.293603 2.75 0.013 1.511542 11.11267 ------------------------------------------------------------------------------ Just for practice, we will do the same thing in long format. capture capture reshape list if drop race drop meanfbf long fbf , i(id) j(dose) id<=2, noobs nolabel sepby(id) +--------------------------+ | id dose fbf black | |--------------------------| | 1 0 1 0 | | 1 10 1.4 0 | | 1 20 6.4 0 | | 1 60 19.1 0 | | 1 150 25 0 | | 1 300 24.6 0 | | 1 400 28 0 | |--------------------------| | 2 0 2.1 0 | | 2 10 2.8 0 | | 2 20 8.3 0 | | 2 60 15.7 0 | | 2 150 21.9 0 | | 2 300 21.7 0 | | 2 400 30.1 0 | +--------------------------+ Chapter 5-15 (revision 16 May 2010) p. 34 Computing the mean of the post isoproterenol administration measurements capture drop meanfbf egen meanfbf=mean(fbf) if dose>0 ,by(id) list if id==7 | id==8, noobs nolabel sepby(id) +--------------------------------------+ | id dose fbf black meanfbf | |--------------------------------------| | 7 0 1.24 0 . | | 7 10 1.2 0 5.983334 | | 7 20 3.3 0 5.983334 | | 7 60 5.3 0 5.983334 | | 7 150 5.4 0 5.983334 | | 7 300 10.1 0 5.983334 | | 7 400 10.6 0 5.983334 | |--------------------------------------| | 8 0 3.1 0 . | | 8 10 . 0 23.375 | | 8 20 . 0 23.375 | | 8 60 15.45 0 23.375 | | 8 150 . 0 23.375 | | 8 300 . 0 23.375 | | 8 400 31.3 0 23.375 | +--------------------------------------+ We see that the mean is computed correctly on the nonmissing values. This time we will put the baseline fbf value in the last observation for each subject, since the meanfbf is missing in the first observation. Creating a baseline variable and setting all values of the meanfbf variable to missing except for the last line, capture drop fbfbase bysort id: gen fbfbase=fbf[1] if _n==_N bysort id: replace meanfbf=. if _n~=_N list if id<=2, noobs nolabel sepby(id) +-----------------------------------------------+ | id dose fbf black meanfbf fbfbase | |-----------------------------------------------| | 1 0 1 0 . . | | 1 10 1.4 0 . . | | 1 20 6.4 0 . . | | 1 60 19.1 0 . . | | 1 150 25 0 . . | | 1 300 24.6 0 . . | | 1 400 28 0 17.41667 1 | |-----------------------------------------------| | 2 0 2.1 0 . . | | 2 10 2.8 0 . . | | 2 20 8.3 0 . . | | 2 60 15.7 0 . . | | 2 150 21.9 0 . . | | 2 300 21.7 0 . . | | 2 400 30.1 0 16.75 2.1 | +-----------------------------------------------+ Chapter 5-15 (revision 16 May 2010) p. 35 Requesting the ANCOVA, regress meanfbf black fbfbase Source | SS df MS -------------+-----------------------------Model | 723.655489 2 361.827745 Residual | 275.804739 19 14.5160389 -------------+-----------------------------Total | 999.460228 21 47.5933442 Number of obs F( 2, 19) Prob > F R-squared Adj R-squared Root MSE = = = = = = 22 24.93 0.0000 0.7240 0.6950 3.81 -----------------------------------------------------------------------------meanfbf | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------black | -7.593173 1.699874 -4.47 0.000 -11.15105 -4.035297 fbfbase | 3.196871 .7498965 4.26 0.000 1.62732 4.766423 _cons | 6.312108 2.293603 2.75 0.013 1.511542 11.11267 ------------------------------------------------------------------------------ When to Use Which Summary Measure Senn et al (2000) discuss this question. The AUC method is often used in pharmacokinetic studies, expressed as the area under the concentration time curve (Senn et al, 2000, p.869). If one is more interested in the total amount of concentration per time of a drug in the body between absorption and excretion, and not so much as how quickly it is absorbed or excreted, the AUC measure would do a nice job of expressing this. If it is thought that the effect is to rapidly reach a plateau and then remain there, the mean approach, leaving baseline out of the computation of the mean summary measure, with baseline used as a covariate in an ANCOVA, is a good choice (Senn et al, 2000, p.869, Figure 1). [Exercise: look at Figure 1 in Senn et al, p.869] If one is interested in how rapidly the effect changes, without a plateau effect, then a slope is an appropriate summary measure (Senn et al, 2000, p.869, Figure 1). A more sophisticated approach is a hierarchical model (also called multilevel model, or mixed model), which we will cover later in this course. Senn et al (2000, p.873) point out, however, “The summary measures approach is a simple and robust approach to analysing clinical trials. In many cases the loss of efficiency compared to fitting more formal hierarchical models is not great.” Chapter 5-15 (revision 16 May 2010) p. 36 Exercise Look at the Grantham et al (N Engl J Med, 2006) paper. They have longitudinal data, which they show in their figures. They analyzed their data using a response feature analysis. Look at the first paragraph on page 2124. They 1) computed a slope for each patient as one summary measure, and 2) they used (year 3 minus baseline)/interval as a change per year summary measure. In their Table 1, they show the comparison of group means of individual slopes. Chapter 5-15 (revision 16 May 2010) p. 37 References Altman DG. (1991). Practical Statistics for Medical Research. New York, Chapman & Hall/CRC, pp.426-433. Cook TD, Campbell DT. (1979). Quasi-Experimentation: Design & Analysis Issues for Field Settings. Boston, Houghton Mifflin Company. Cox NJ. (2004). Speaking Stata: graphing agreement and disagreement. The Stata Journal 4(3):329-349. [available free at: http://www.stata-journal.com/archives.html] Dalton ME, Bromhan DR, Ambrose CL, Osborne J, Dalton KD. (1987). Nasal absorption of progesterone in women. Br J Obstet Gynaecol 94(1):85-8. Dupont WD. (2002). Statistical Modeling for Biomedical Researchers: a Simple Introduction to the Analysis of Complex Data. Cambridge, Cambridge University Press. Lang CC, Stein CM, Brown RM, et al. Attenuation of isoproterenol-mediated vasodilation in blacks. N Engl J Med 333:155-60. Rabe-Hesketh S, Everitt B. (2003). A Handbook of Statistical Analyses Using Stata. 3rd Ed. New York, Chapman & Hall/CRC. Senn S, Stevens L, Chaturvedi N. (2000). Tutorial in biostatistics: repeated measures in clinical trials: simple strategies for analysis using summary measures. Statist Med 19:861-877. Twisk JWR. (2003). Applied Longitudinal Data Analysis for Epidemiology: A Practical Guide. Cambridge, Cambridge University Press. van Belle G. (2002). Statistical Rules of Thumb. New York, John Wiley & Sons. Wegman EJ. (1990). Hyperdimensional data analysis using parallel coordinates. Journal of the American Statistical Association 85: 664-675. Chapter 5-15 (revision 16 May 2010) p. 38