Download Response Feature Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Forecasting wikipedia , lookup

Least squares wikipedia , lookup

Regression analysis wikipedia , lookup

Data assimilation wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapter 5-15. Correlated Data: Response Feature (Summary Measure)
Analysis
In this chapter we will begin discussing how to model data that are correlated. This occurs when
repeated measurements are used from the same research subject. It also occurs with hierarchical
clusters, such with patients within physicians, and physicians within hospitals.
The linear regression, logistic regression, and Cox regression models we have discussed thus far,
as well as simple statistical test that compare groups, all assume that observations are
independent. The statistical formulas and p value calculations are only correct, then, when this
assumption is met.
The following example illustrates this.
Autocorrelation and Type I Error Example
van Belle (2002, pp 7-11) provides an example of what happens to the Type I error rate when
serially acquired observations, such as sequential laborary measurements, are correlated, which is
referred to as autocorrelation.
Measurements taken spatially are also known to exhibit autocorrelation, such as inflammation
measurements taken distally from a wound site.
In van Belle’s example, observations ordered by time are correlated as follows: adjacent
observations have correlation ρ, where ρ is the population correlation coefficient, observations
two steps apart have correlation ρ2, and so on. This is called a first order autoregressive process,
AR(1), with correlation ρ. For reasonably large n, these data will have,
true standard error of x 
1  s
1  n
rather than standard error = s / n , which applies to the independent observation case.
Using the one-sample t test that assumes independence
t
x
s
n
, H0: µ = 0
with AR(1) data, then, inflates the Type I error when ρ is positive and deflates it when ρ is
negative. In either case, significance is not achieved the expected proportion of times. The
effect is quite dramatic, as shown in the following table (van Belle, 2002, p.9):
_________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.
Chapter 5-15 (revision 16 May 2010)
p. 1
Effect of Autocorrelation on Type I Error, when Assuming
Independent Observations
ρ
0
0.1
0.2
0.3
0.4
0.5
Type I error
0.05 0.08 0.11 0.15 0.20 0.26
We see that with autocorrelation as low as 0.2, the Type I error doubles. As this example
illustrates, lack of independence in the data makes a shambles of out of hypothesis testing when
statistical methods assuming independence are used.
Progesterone Dataset
Dalton et al. (1987) report a study where they obtain absorption profiles from women following
the administration of ointment containing 20, 30, and 40 mg of progesterone to the nasal mucosa.
Their dataset is reproduced in Altman (1991, pp.427-428). The four treatment groups are:
Group 1 (0.2ml of 100 mg/ml one nostril)
Group 2 (0.3ml of 100 mg/ml one nostril)
Group 3 (0.2ml of 200 mg/ml one nostril)
Group 4 (0.2ml of 100 mg/ml each nostril)
Opening the progesterone.dta dataset in Stata,
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on progesterone.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\progesterone.dta",
clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use progesterone.dta, clear
Chapter 5-15 (revision 16 May 2010)
p. 2
Listing the variable names:
Data
Describe data
Describe variables in memory
Options: Display only variable names
OK
describe, simple
*
<or>
ds
id
group
progest0
progest1
progest3
progest5
progest10
progest15
progest30
progest45
progest60
progest120
we see that the data represent the serum level of progesterone (nmol/l) at baseline (time 0) and
after nasal administration (3, 10, …, 120 minutes) .
Listing the data,
Data
Describe data
List data
Main tab: Column widths: Compress width of columns in both tables and
display formats
Main tab: Do not list observation numbers
Options tab: Separators: When these variables change: group
Options tab: Display numeric codes rather than label values
OK
list, compress noobs sepby(group) nolabel
Chapter 5-15 (revision 16 May 2010)
p. 3
+--------------------------------------------------------------------------------------------+
| id
group
pr~t0
pro~1
pro~3
pr~t5
pr~10
pr~15
pr~30
pr~45
pr~60
p~120 |
|--------------------------------------------------------------------------------------------|
| 1
1
1
.
10
16
22
20
16
.
18
14 |
| 2
1
6.5
5.7
9.5
11.6
17.5
27.3
28.5
22.4
19.3
10 |
| 3
1
3
4
4
13
15.8
19.5
21.2
17.9
10.7
13.4 |
| 4
1
1
2.1
9.7
.
21.8
.
27.5
.
15.5
6.2 |
| 5
1
1
1
1
4.2
22.6
23.9
45.5
42.6
35
10.6 |
| 6
1
1
1
1
1
3.9
14.7
17.6
16.1
8.8
10.8 |
|--------------------------------------------------------------------------------------------|
| 7
2
1
1.5
5
11
16
23
15
9
6
5 |
| 8
2
1
1
6.5
20
22.5
27.8
19
9
8.2
8 |
| 9
2
1
1
7.3
7.5
18
20
18.9
12.8
6.3
4.8 |
| 10
2
3
2.5
2
2.7
3.4
3.6
14
7.3
7.7
4.7 |
| 11
2
8.3
7.5
9.6
11
11.5
15.7
15.2
15.8
14
11.5 |
| 12
2
6.2
5.9
6.8
7.7
9
9.3
12.1
12.2
11
9 |
|--------------------------------------------------------------------------------------------|
| 13
3
8.4
10.8
8.1
7.8
8.5
12
19.8
22.2
25.2
40.5 |
| 14
3
3.5
3.2
3.4
3.3
8.5
9.4
14.5
12.7
11.5
10.2 |
| 15
3
3.5
4
4.8
3.5
3.7
13
12.5
15
22
10.5 |
| 16
3
3.7
3.2
4.3
4.5
5.5
8.5
10.3
11.1
8
6 |
|--------------------------------------------------------------------------------------------|
| 17
4
5
5.6
6.1
7.2
13.8
26
26.1
25.7
20.5
11 |
| 18
4
4.5
5.1
13.2
21
26.8
28
22
17.8
15.7
14 |
| 19
4
8.4
6.2
8
18.5
33.8
35
26.2
23
19
12.6 |
| 20
4
4.2
3.2
4.2
4.8
10.3
13.7
17.1
18.3
17.4
15.8 |
+--------------------------------------------------------------------------------------------+
These data represent a longitudinal dataset. Twisk (2003, p.1) explains,
“Longitudinal studies are defined as studies in which the outcome variable is repeatedly
measured; i.e., the outcome variable is measured in the same individual on several
different occasions.”
In this dataset, each of the several occasions represents a repeated measurement of serum
progesterone across time.
Within each research subject, we can expect autocorrelation, also called serial correlation, and so
our analysis strategy must take this into account.
Our first step, however, will be to just simply graph the data.
Chapter 5-15 (revision 16 May 2010)
p. 4
Parallel Coordinate Plots
A popular approach to graph such data are with parallel coordinate plots (Cox, 2004). If you are
interesting is why such plots have this name, or you want to see a more general presentation of
this graphing strategy, refer to Wegman (1990).
First, you might have to update your stata to get the parplot command.
findit parplot
parplot from http://fmwww.bc.edu/RePEc/bocode/p
'PARPLOT': module for parallel coordinates plots / parplot draws parallel
coordinates plots. Stata 8 is required. d / KW: graphics / KW:
multivariate / KW: parallel coordinates plot / Requires: Stata version 8.0
/ Author: Nicholas J. Cox, Durham University / Support: email
Click on the blue link, which gives you
INSTALLATION FILES
parplot.ado
parplot.hlp
(click here to install)
Click on the blue link.
To generate the graph, use
#delimit ;
parplot progest0-progest120
, transform(raw)
xlabel(1 "0" 2 "1" 3 "3" 4 "5" 5 "10" 6 "15" 7 "30"
8 "45" 9 "60" 10 "120")
ylabel(0(5)50, angle(horizontal))
xtitle("Minutes Post Dose Administration")
yline(0) ytitle("Serum Progesterone(nmol/l)")
by(group)
;
#delimit cr
Chapter 5-15 (revision 16 May 2010)
p. 5
Grp 1 (0.2ml of 100 mg/ml one nostril)
Grp 2 (0.3ml of 100 mg/ml one nostril)
Grp 3 (0.2ml of 200 mg/ml one nostril)
Grp 4 (0.2ml of 100 mg/ml each nostril)
0
0
50
45
40
35
30
25
20
15
10
5
0
50
45
40
35
30
25
20
15
10
5
0
1
3
5
10 15 30 45 60 120
1
3
5
10 15 30 45 60 120
Minutes Post Dose Administration
Graphs by Dose Group
Chapter 5-15 (revision 16 May 2010)
p. 6
You can get a similar looking graph using using Stata’s “twoway line” command. To do that,
however, you must first reshape the data into long format.
Beginning with the data in wide format,
+--------------------------------------------------------------------------------------------+
| id
group
pr~t0
pro~1
pro~3
pr~t5
pr~10
pr~15
pr~30
pr~45
pr~60
p~120 |
|--------------------------------------------------------------------------------------------|
| 1
1
1
.
10
16
22
20
16
.
18
14 |
| 2
1
6.5
5.7
9.5
11.6
17.5
27.3
28.5
22.4
19.3
10 |
…
Reshaping it to long format,
reshape long progest , i(id) j(time)
In this command, “progest” is called the stub variable (prefix variable would be a more intuitive
name). Stata used the j subscript variable, time, to store the suffix that followed “progest” in the
variable names.
Stata uses the i subscript variable to identify what variable identifies each subject. This variable
has to contain a unique number across the rows of data, or it will get confused.
Stata then duplicates the values of all the other variables in the file and places them on each
newly created row, if any other variables are in the dataset.
Listing the first two subjects to check if the reshape did what was expected.
list if id<=2, noobs nolabel sepby(id)
+-----------------------------+
| id
time
group
progest |
|-----------------------------|
| 1
0
1
1 |
| 1
1
1
. |
| 1
3
1
10 |
| 1
5
1
16 |
| 1
10
1
22 |
| 1
15
1
20 |
| 1
30
1
16 |
| 1
45
1
. |
| 1
60
1
18 |
| 1
120
1
14 |
|-----------------------------|
| 2
0
1
6.5 |
| 2
1
1
5.7 |
| 2
3
1
9.5 |
| 2
5
1
11.6 |
| 2
10
1
17.5 |
| 2
15
1
27.3 |
| 2
30
1
28.5 |
| 2
45
1
22.4 |
| 2
60
1
19.3 |
| 2
120
1
10 |
+-----------------------------+
Chapter 5-15 (revision 16 May 2010)
p. 7
Graphing the data using twoway line with the connect(ascending) option,
sort id time
#delimit ;
graph twoway (line progest time , connect(ascending))
, by(group)
ylabel(0(5)50, angle(horizontal))
xtitle("Minutes Post Dose Administration")
ytitle("Serum Progesterone(nmol/l)")
xlabel(0 "0" 1 " " 3 " " 5 "5" 10 "10" 15 "15" 30 "30"
45 "45" 60 "60" 120 "120" ,labsize(small))
;
#delimit cr
Grp 1 (0.2ml of 100 mg/ml one nostril)
Grp 2 (0.3ml of 100 mg/ml one nostril)
Grp 3 (0.2ml of 200 mg/ml one nostril)
Grp 4 (0.2ml of 100 mg/ml each nostril)
0 5 1015
0 5 1015
50
45
40
35
30
25
20
15
10
5
0
50
45
40
35
30
25
20
15
10
5
0
30
45
60
120
30
45
60
120
Minutes Post Dose Administration
Graphs by Dose Group
It is the connect(ascending) option that instructs Stata to draw a separate line for each subject.
This graph has an advantage over the parplot in that the time points are not evenly spaced.
Grp 1 (0.2ml of 100 mg/ml one nostril)
Grp 2 (0.3ml of 100 mg/ml one nostril)
Grp 3 (0.2ml of 200 mg/ml one nostril)
Grp 4 (0.2ml of 100 mg/ml each nostril)
0
0
50
45
40
35
30
25
20
15
10
5
0
50
45
40
35
30
25
20
15
10
5
0
1
3
5
10 15 30 45 60 120
1
3
5
10 15 30 45 60 120
Minutes Post Dose Administration
Graphs by Dose Group
However, notice for this example the parplot provides better resolution for the low values of
time, which is a different advantage.
Chapter 5-15 (revision 16 May 2010)
p. 8
Summary Measure (Response Feature) Analysis
A common approach to analyzing longitudinal data is to use a summary measure computed
directly from the observed data. In this way, all of the repeated measurements are reduced to a
single number per subject, which eliminates the need to account for the correlation structure of
the data. Then, the analysis reduces to comparing the groups in a cross-sectional fashion, such as
an independent groups t test (for two groups) or a oneway ANOVA for the four groups shown
here.
The summary measure approach is also called response feature analysis (Dupont, 2002, pp. 345356).
Altman (1991, pp.430-431) lists some of the more frequently derived summary measures:
“-- mean of all the measurements (i.e., ignore the time response)
-- height of peak
-- time to reach peak
-- time to reach a given level
-- time to change by a given amount
-- time above a given level
-- time to achieve maximum change from original level (baseline)
-- time to return (near) to baseline level
-- change from first to last measurement
-- final level (perhaps the average of the last few measurements)
-- area under the curve (AUC)”
“Several of these suggestions incorporate some arbitrary definitions which should be
chosen in advance of the analysis rather than after inspection of the data. Several are
specifically aimed at data with peaks. Where initial values vary considerably the change
from baseline may be used.”
“…In general it is reasonable to consider two or three derived statistics, but as in any
study it is highly desirable to identify a single measure of primary interest. The choice of
appropriate measures should relate to the study objectives. For example, if the study is
one of treatment efficacy we may reasonably be most interested in the values at the end of
the study, perhaps in relation to starting values. If the study is to evaluate the
effectiveness of analgesics, then we would probably be interested in the rapid
effectiveness of the drug, perhaps by looking at the timing of the peak and the level
achieved, and perhaps also the time above some critical level.”
Chapter 5-15 (revision 16 May 2010)
p. 9
Response Feature Analysis: Area Under the Curve (AUC)
Dupont (2002, p.355-356), Altman (1991, pp.431-433), Twisk (2003, pp.184-185) describe how
to apply the AUC approach. It is done by approximating the area under the curve with the sum
of trapezoids, called the trapezoidal rule (the American term) or trapezium rule (the British
term).
Altman (1991, pp. 431-433) explains,
“The area under the curve (AUC) is a useful way of summarizing the information
from a series of measurements on one individual. It is frequently used in clinical
pharmacology, where the AUC from serum levels can be interpreted as the total uptake or
bioavailability of whatever had been administered.
“The data are joined by straight lines to get a ‘curve’. The AUC is usually
calculated by adding the areas under the curve between each pair of consecutive
observations. If we have measurements y1 and y2 at times t1 and t2 , then the AUC
between those two times is the product of the time difference and the average of the two
measurements. Thus we get (t2 – t1) (y1 + y2)/2. This is known as the trapezium rule
because of the shape of each segment of the area under the curve.
If we have n + 1 measurements yi at times ti (t = 0, … , n) then the AUC is
calculated as
1 n 1
 (ti1  ti )( yi  yi1 ).
2 i 0
The units of the AUC are the product of the units used for yi and ti , for example
nmol.min/l, and are not easy to understand. It may be useful to divide the AUC by the
total time to get a sort of weighted average level over the time period.
…We can calculate the AUC even when there are missing data, except when the
final observation is missing.”
After computing the AUC, the groups are compared in a cross-sectional fashion, using t tests,
ANOVA, etc., treating the AUC as we would any other continuous variable.
Chapter 5-15 (revision 16 May 2010)
p. 10
It is easier to perform the AUC calculation in long format, so we continue to use the reshaped
data, which was reshaped on Page 7 using,
reshape long progest , i(id) j(time)
Calculating the AUC,
capture drop area
capture drop auc
generate area=(time-time[_n-1])*(progest+progest[_n-1])/2 if id==id[_n-1]
list id time progest area if id==1
replace area=(time-time[_n-2])*(progest+progest[_n-2])/2 ///
if (id==id[_n-2] & area==.) // fill in if missing 1 consecutive time
list id time progest area if id==1
bysort id: egen auc=sum(area)
list if id<=2, noobs nolabel sepby(id)
egen tag = tag(id) // indicator for first observation per subject
drop area
list if id<=2, noobs nolabel sepby(id)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
+----------------------------+
| id
time
progest
area |
|----------------------------|
| 1
0
1
. |
| 1
1
.
. |
| 1
3
10
. |
| 1
5
16
26 |
| 1
10
22
95 |
| 1
15
20
105 |
| 1
30
16
270 |
| 1
45
.
. |
| 1
60
18
. |
| 1
120
14
960 |
+----------------------------+
<- missing one consecutive missing point
<- missing one consecutive missing point
+----------------------------+
| id
time
progest
area |
|----------------------------|
| 1
0
1
. |
| 1
1
.
. |
| 1
3
10
16.5 | <- 16.5 comes from replace command
| 1
5
16
26 |
| 1
10
22
95 |
| 1
15
20
105 |
| 1
30
16
270 |
| 1
45
.
. |
| 1
60
18
510 | <- 510 comes from replace command
| 1
120
14
960 |
+----------------------------+
Chapter 5-15 (revision 16 May 2010)
p. 11
+------------------------------------------------+
| id
time
group
progest
area
auc |
|------------------------------------------------|
| 1
0
1
1
.
1982.5 | <- have AUC repeated on every line
| 1
1
1
.
.
1982.5 |
| 1
3
1
10
16.5
1982.5 |
| 1
5
1
16
26
1982.5 |
| 1
10
1
22
95
1982.5 |
| 1
15
1
20
105
1982.5 |
| 1
30
1
16
270
1982.5 |
| 1
45
1
.
.
1982.5 |
| 1
60
1
18
510
1982.5 |
| 1
120
1
14
960
1982.5 |
+------------------------------------------------+
| 2
0
1
6.5
.
2219.15 |
| 2
1
1
5.7
6.1
2219.15 |
| 2
3
1
9.5
15.2
2219.15 |
| 2
5
1
11.6
21.1
2219.15 |
| 2
10
1
17.5
72.75
2219.15 |
| 2
15
1
27.3
112
2219.15 |
| 2
30
1
28.5
418.5
2219.15 |
| 2
45
1
22.4
381.75
2219.15 |
| 2
60
1
19.3
312.75
2219.15 |
| 2
120
1
10
879
2219.15 |
+------------------------------------------------+
+---------------------------------------------+
| id
time
group
progest
auc
tag |
|---------------------------------------------|
| 1
0
1
1
1982.5
1 | <- tag first observation per subject
| 1
1
1
.
1982.5
0 |
| 1
3
1
10
1982.5
0 |
| 1
5
1
16
1982.5
0 |
| 1
10
1
22
1982.5
0 |
| 1
15
1
20
1982.5
0 |
| 1
30
1
16
1982.5
0 |
| 1
45
1
.
1982.5
0 |
| 1
60
1
18
1982.5
0 |
| 1
120
1
14
1982.5
0 |
+---------------------------------------------+
| 2
0
1
6.5
2219.15
1 | <- tag first observation per subject
| 2
1
1
5.7
2219.15
0 |
| 2
3
1
9.5
2219.15
0 |
| 2
5
1
11.6
2219.15
0 |
| 2
10
1
17.5
2219.15
0 |
| 2
15
1
27.3
2219.15
0 |
| 2
30
1
28.5
2219.15
0 |
| 2
45
1
22.4
2219.15
0 |
| 2
60
1
19.3
2219.15
0 |
| 2
120
1
10
2219.15
0 |
+---------------------------------------------+
When we analyze the AUC variable, we need to only use one AUC value per subject. In
subsequent commands that use the AUC, we will need to include an “if tag” to limit the analysis
to one observation per subject, which maintains the correct sample size.
Chapter 5-15 (revision 16 May 2010)
p. 12
To make this work for any number of missing follow-up observations, we can use the following,
capture drop area
capture drop auc
generate area=(time-time[_n-1])*(progest+progest[_n-1])/2 if id==id[_n-1]
list id time progest area if id==1
*-- begin fill in for any number of missing values
capture drop num_records
bysort id: egen num_records=count(time)
sum num_records
scalar max_records=r(max) // maximum number of records per id
capture drop num_records
local i=2
while `i' < max_records {
replace area=(time-time[_n-`i'])*(progest+progest[_n-`i'])/2 ///
if (area==. & id==id[_n-`i']) // fill in if missing
local i = `i'+1
}
* -- end fill in for missing
list id time progest area if id==1
bysort id: egen auc=sum(area)
list if id<=2, noobs nolabel sepby(id)
capture drop tag
egen tag = tag(id) // indicator for first observation per subject
drop area
list if id<=2, noobs nolabel sepby(id)
Chapter 5-15 (revision 16 May 2010)
p. 13
Graphing the AUC data,
1,000
1,500
2,000
2,500
3,000
3,500
graph box auc if tag, over(group, ///
relabel(1 "grp 1" 2 "grp 2" 3 "grp 3" 4 "grp 4"))
grp 1
grp 2
grp 3
grp 4
We see that the one outlier in the lowest dose group, group 1, which was also a suspicious
looking subject displayed in the parallel coordinate plot.
Grp 1 (0.2ml of 100 mg/ml one nostril)
Grp 2 (0.3ml of 100 mg/ml one nostril)
Grp 3 (0.2ml of 200 mg/ml one nostril)
Grp 4 (0.2ml of 100 mg/ml each nostril)
0
0
50
45
40
35
30
25
20
15
10
5
0
50
45
40
35
30
25
20
15
10
5
0
1
3
5
10 15 30 45 60 120
1
3
5
10 15 30 45 60 120
Minutes Post Dose Administration
Graphs by Dose Group
Since that subject was elevated at three adjacent time points, however, it is probably a true
response to the drug, and so should not be treated as an outlier.
Chapter 5-15 (revision 16 May 2010)
p. 14
Now that we have eliminated the correlation structure in the data, by reducing the data into a
single point per subject, we can analyze these data in the ordinary cross-sectional fashion, using a
oneway ANOVA between independent groups.
Statistics
Linear models and related
ANOVA
One-way ANOVA
Main tab: Response variable: auc
Main tab: Factor variable: group
Main tab: Output: produce summary table
by/if/in tab: If: (expression): tag
OK
oneway auc group if tag , tabulate
|
Summary of auc
Dose Group |
Mean
Std. Dev.
Freq.
------------+-----------------------------------Grp 1 (0. |
2082.5333
675.96171
6
Grp 2 (0. |
1234.8167
277.97327
6
Grp 3 (0. |
1749.5375
903.61011
4
Grp 4 (0. |
2175.6125
236.26341
4
------------+-----------------------------------Total |
1780.235
658.9552
20
Analysis of Variance
Source
SS
df
MS
F
Prob > F
-----------------------------------------------------------------------Between groups
2962255.46
3
987418.487
2.99
0.0622
Within groups
5287961.76
16
330497.61
-----------------------------------------------------------------------Total
8250217.22
19
434221.959
Bartlett's test for equal variances:
chi2(3) =
7.4390
Prob>chi2 = 0.059
We just missed significance (p = 0.062). Next, try the nonparametric ANOVA, which is the
Kruskal-Wallis ANOVA, which basically compares the medians and uses rank scores so the
outlier is no longer an influential point.
Statistics
Summaries, tables & tests
Nonparametric tests of hypotheses
Kruskal-Wallis rank test
Main tab: Outcome variable: auc
Main tab: Variable defining groups: group
if/in tab: If: (expression): tag
OK
kwallis auc if tag , by(group)
Chapter 5-15 (revision 16 May 2010)
p. 15
Test: Equality of populations (Kruskal-Wallis test)
+----------------------------------------------------------+
|
group | Obs | Rank Sum |
|-----------------------------------------+-----+----------|
| Grp 1 (0.2ml of 100 mg/ml one nostril) |
6 |
80.00 |
| Grp 2 (0.3ml of 100 mg/ml one nostril) |
6 |
30.00 |
| Grp 3 (0.2ml of 200 mg/ml one nostril) |
4 |
38.00 |
| Grp 4 (0.2ml of 100 mg/ml each nostril) |
4 |
62.00 |
+----------------------------------------------------------+
chi-squared =
probability =
9.533 with 3 d.f.
0.0230
chi-squared with ties =
probability =
0.0230
9.533 with 3 d.f.
This time we discovered a significant difference among the group (p = 0.023). Similarly, we
could have computed all possible pairwise significant tests, such as t tests, and then adjusted the
p values for multiple comparisons. (See Chapter 2-8 for multiple comparison procedures.)
Shortcoming with this Analysis. Twisk (2002, p.185) points out that an AUC analysis like the
one we performed here has a shortcoming. We did nothing to take into account any differences
in baseline. Even though this was an experiment, where the groups were randomized, baseline
differences could be large enough to affect the result and lead to this lack of detected effect.
One approach to adjust for baseline differences would be to substract the baseline value from
each of the posttreatment times before calculating the AUC, the change score approach. With
such an approach, one must decide what to do with any negative changes. That is, AUC
segments above the baseline reference line have positive values and AUC segments under the
baseline reference line have negative values. One popular approach, which is used in blood
glucose measurements following food intake for example, is to use only the positive AUC
segments. This is called the IAUC (incremental AUC).
Using the approach of substracting the baseline measurement, however, is still subject to
regression towards the mean bias. That is, subjects with high baseline measurements are more
likely to have lower subsequent measurements and subjects with low baseline measurements are
more likely to have higher subsequent measurements, which will occur independently from the
treatment effect.
A better approach, then, is to include the baseline measurement into the model as a predictor
variable. Controlling the baseline measurement in this way is called the analysis of covariance
(ANCOVA) approach. The ANCOVA approach, unlike the change approach, basically corrects
for regression to the mean. (Twisk, 2002, p.169). We cannot be certain that ANCOVA will
entirely adjust for regression to the mean, because measurement error in the baseline value will
lead to some amount of under- or over-adjustment (Cook and Campbell, 1979, p.164).
The ANCOVA approach will be explained in more detail in the next chapter.
Chapter 5-15 (revision 16 May 2010)
p. 16
To use the ANCOVA approach, we need a variable with the baseline score. Currently our
baseline value is contained in the first occurrence, or time 0, of the progest variable.
+---------------------------------------------+
| id
time
group
progest
auc
tag |
|---------------------------------------------|
| 1
0
1
1
1982.5
1 |
| 1
1
1
.
1982.5
0 |
| 1
3
1
10
1982.5
0 |
| 1
5
1
16
1982.5
0 |
| 1
10
1
22
1982.5
0 |
| 1
15
1
20
1982.5
0 |
| 1
30
1
16
1982.5
0 |
| 1
45
1
.
1982.5
0 |
| 1
60
1
18
1982.5
0 |
| 1
120
1
14
1982.5
0 |
+---------------------------------------------------+
| 2
0
1
6.5
2219.15
1 |
| 2
1
1
5.7
2219.15
0 |
| 2
3
1
9.5
2219.15
0 |
| 2
5
1
11.6
2219.15
0 |
| 2
10
1
17.5
2219.15
0 |
| 2
15
1
27.3
2219.15
0 |
| 2
30
1
28.5
2219.15
0 |
| 2
45
1
22.4
2219.15
0 |
| 2
60
1
19.3
2219.15
0 |
| 2
120
1
10
2219.15
0 |
+---------------------------------------------+
To create a separate variable containing the baseline value of progest,
capture drop progbase // progesterone baseline
bysort id: gen progbase=progest[1]
list if id<=2, noobs nolabel sepby(id)
+--------------------------------------------------------+
| id
time
group
progest
auc
tag
progbase |
|--------------------------------------------------------|
| 1
0
1
1
1982.5
1
1 |
| 1
1
1
.
1982.5
0
1 |
| 1
3
1
10
1982.5
0
1 |
| 1
5
1
16
1982.5
0
1 |
| 1
10
1
22
1982.5
0
1 |
| 1
15
1
20
1982.5
0
1 |
| 1
30
1
16
1982.5
0
1 |
| 1
45
1
.
1982.5
0
1 |
| 1
60
1
18
1982.5
0
1 |
| 1
120
1
14
1982.5
0
1 |
+--------------------------------------------------------+
| 2
0
1
6.5
2219.15
1
6.5 |
| 2
1
1
5.7
2219.15
0
6.5 |
| 2
3
1
9.5
2219.15
0
6.5 |
| 2
5
1
11.6
2219.15
0
6.5 |
| 2
10
1
17.5
2219.15
0
6.5 |
| 2
15
1
27.3
2219.15
0
6.5 |
| 2
30
1
28.5
2219.15
0
6.5 |
| 2
45
1
22.4
2219.15
0
6.5 |
| 2
60
1
19.3
2219.15
0
6.5 |
| 2
120
1
10
2219.15
0
6.5 |
+--------------------------------------------------------+
Note: the square backets in “progest[1]” represent a subscript, being the first observation for
each ID number.
Chapter 5-15 (revision 16 May 2010)
p. 17
Now, using the ANCOVA approach, where we control for baseline,
Stata Version 10 (specify continuous variable with “continuous” option)
Statistics
Linear models and related
ANOVA
Analysis of variance and covariance
Model tab: Dependent variable: auc
Model tab: Model: group progbase
Model tab: Model variables: Categorical except the following continuous
variables: progbase
by/if/in tab: If: (expression): tag
OK
anova auc group progbase if tag, continuous(progbase)
Stata Version 11 (specify continuous variable with “c.” prefix)
Statistics
Linear models and related
ANOVA/MANOVA
Analysis of variance and covariance
Model tab: Dependent variable: auc
Model tab: Model: group c.progbase
by/if/in tab: If: (expression): tag
OK
anova auc group c.progbase if tag
Number of obs =
20
Root MSE
= 544.975
R-squared
=
Adj R-squared =
0.4600
0.3160
Source | Partial SS
df
MS
F
Prob > F
-----------+---------------------------------------------------Model | 3795254.05
4 948813.514
3.19
0.0438
|
group | 3001109.23
3 1000369.74
3.37
0.0467
progbase | 832998.594
1 832998.594
2.80
0.1147
|
Residual | 4454963.17
15 296997.544
-----------+---------------------------------------------------Total | 8250217.22
19 434221.959
We see that the baseline progesterone was not significantly different between the groups, as
expected, since randomization was use. However, it was different enough (p = 0.115) to possibly
influence the result. After controlling for baseline, there is a significant difference among the
groups (p = 0.047).
Chapter 5-15 (revision 16 May 2010)
p. 18
We could accomplish the same ANCOVA analysis using linear regression, followed by a posttest
simultaneously comparing the group indicators to the referent group.
Stata Version 10:
* Stata Version 10
anova auc group progbase if tag, continuous(progbase)
anova, regress
Source |
SS
df
MS
-------------+-----------------------------Model | 3795254.05
4 948813.514
Residual | 4454963.17
15 296997.544
-------------+-----------------------------Total | 8250217.22
19 434221.959
Number of obs
F( 4,
15)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
20
3.19
0.0438
0.4600
0.3160
544.97
-----------------------------------------------------------------------------auc
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-----------------------------------------------------------------------------_cons
1678.891
402.7646
4.17
0.001
820.4187
2537.364
group
1
201.3574
393.2665
0.51
0.616
-636.8702
1039.585
2
-751.2476
369.5388
-2.03
0.060
-1538.901
36.4058
3
-358.6468
387.453
-0.93
0.369
-1184.483
467.1897
4
(dropped)
progbase
89.90432
53.68277
1.67
0.115
-24.51778
204.3264
------------------------------------------------------------------------------
First creating the indicator variables, so we can drop group 4 to match the anova command,
capture drop grp*
tab group, gen(grp)
regress auc grp1 grp2 grp3 progbase if tag
test grp1=grp2=grp3=0
Source |
SS
df
MS
-------------+-----------------------------Model | 3795254.05
4 948813.514
Residual | 4454963.17
15 296997.544
-------------+-----------------------------Total | 8250217.22
19 434221.959
Number of obs
F( 4,
15)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
20
3.19
0.0438
0.4600
0.3160
544.97
-----------------------------------------------------------------------------auc |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------grp1 |
201.3574
393.2665
0.51
0.616
-636.8702
1039.585
grp2 | -751.2476
369.5388
-2.03
0.060
-1538.901
36.4058
grp3 | -358.6468
387.453
-0.93
0.369
-1184.483
467.1897
progbase |
89.90432
53.68277
1.67
0.115
-24.51778
204.3264
_cons |
1678.891
402.7646
4.17
0.001
820.4187
2537.364
-----------------------------------------------------------------------------. test grp1=grp2=grp3=0
( 1)
( 2)
( 3)
grp1 - grp2 = 0
grp1 - grp3 = 0
grp1 = 0
F(
3,
15) =
Prob > F =
3.37
0.0467
As expected, we get the same result as the ANCOVA using the anova command (p=0.0467).
Chapter 5-15 (revision 16 May 2010)
p. 19
Stata Version 11:
* Stata Version 10
anova auc group c.progbase if tag
regress
Source |
SS
df
MS
-------------+-----------------------------Model | 3795254.05
4 948813.514
Residual | 4454963.17
15 296997.544
-------------+-----------------------------Total | 8250217.22
19 434221.959
Number of obs
F( 4,
15)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
20
3.19
0.0438
0.4600
0.3160
544.97
-----------------------------------------------------------------------------auc |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------group |
2 |
-952.605
320.8141
-2.97
0.010
-1636.404
-268.806
3 | -560.0042
376.9914
-1.49
0.158
-1363.542
243.5339
4 | -201.3574
393.2665
-0.51
0.616
-1039.585
636.8702
|
progbase |
89.90432
53.68277
1.67
0.115
-24.51778
204.3264
_cons |
1880.249
253.1579
7.43
0.000
1340.655
2419.842
------------------------------------------------------------------------------
First creating the indicator variables, so we can drop group 1 to match the anova command,
capture drop grp*
tab group, gen(grp)
regress auc grp2 grp3 grp4 progbase if tag
test grp2=grp3=grp4=0
Source |
SS
df
MS
-------------+-----------------------------Model | 3795254.05
4 948813.514
Residual | 4454963.17
15 296997.544
-------------+-----------------------------Total | 8250217.22
19 434221.959
Number of obs
F( 4,
15)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
20
3.19
0.0438
0.4600
0.3160
544.97
-----------------------------------------------------------------------------auc |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------grp2 |
-952.605
320.8141
-2.97
0.010
-1636.404
-268.806
grp3 | -560.0042
376.9914
-1.49
0.158
-1363.542
243.5339
grp4 | -201.3574
393.2665
-0.51
0.616
-1039.585
636.8702
progbase |
89.90432
53.68277
1.67
0.115
-24.51778
204.3264
_cons |
1880.249
253.1579
7.43
0.000
1340.655
2419.842
-----------------------------------------------------------------------------. test grp2=grp3=grp4=0
( 1)
( 2)
( 3)
grp2 - grp3 = 0
grp2 - grp4 = 0
grp2 = 0
F(
3,
15) =
Prob > F =
3.37
0.0467
As expected, we get the same result as the ANCOVA using the anova command (p=0.0467).
Chapter 5-15 (revision 16 May 2010)
p. 20
Response Feature Analysis: Linear Slope of Repeated Measures
Next, we will use the 11.2.Isoproterenol.dta dataset provided with the Dupont (2002, p.338)
textbook, described as,
“Lang et al. (1995) studied the effect of isoproterenol, a β-adrenergic agonist, on forearm
blood flow in a group of 22 normotensive men. Nine of the study subjects were black and
13 were white. Each subject’s blood flow was measured at baseline and then at
escalating doses of isoproterenol.”
Reading the data in
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on 11.2.Isoproterenol.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\
11.2.Isoproterenol.dta", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use 11.2.Isoproterenol.dta, clear
Chapter 5-15 (revision 16 May 2010)
p. 21
Listing the data
list , nolabel
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
+---------------------------------------------------------------------+
| id
race
fbf0
fbf10
fbf20
fbf60
fbf150
fbf300
fbf400 |
|---------------------------------------------------------------------|
| 1
1
1
1.4
6.4
19.1
25
24.6
28 |
| 2
1
2.1
2.8
8.3
15.7
21.9
21.7
30.1 |
| 3
1
1.1
2.2
5.7
8.2
9.3
12.5
21.6 |
| 4
1
2.44
2.9
4.6
13.2
17.3
17.6
19.4 |
| 5
1
2.9
3.5
5.7
11.5
14.9
19.7
19.3 |
|---------------------------------------------------------------------|
| 6
1
4.1
3.7
5.8
19.8
17.7
20.8
30.3 |
| 7
1
1.24
1.2
3.3
5.3
5.4
10.1
10.6 |
| 8
1
3.1
.
.
15.45
.
.
31.3 |
| 9
1
5.8
8.8
13.2
33.3
38.5
39.8
43.3 |
| 10
1
3.9
6.6
9.5
20.2
21.5
30.1
29.6 |
|---------------------------------------------------------------------|
| 11
1
1.91
1.7
6.3
9.9
12.6
12.7
15.4 |
| 12
1
2
2.3
4
8.4
8.3
12.8
16.7 |
| 13
1
3.7
3.9
4.7
10.5
14.6
20
21.7 |
| 14
2
2.46
2.7
2.54
3.95
4.16
5.1
4.16 |
| 15
2
2
1.8
4.22
5.76
7.08
10.92
7.08 |
|---------------------------------------------------------------------|
| 16
2
2.26
3
2.99
4.07
3.74
4.58
3.74 |
| 17
2
1.8
2.9
3.41
4.84
7.05
7.48
7.05 |
| 18
2
3.13
4
5.33
7.31
8.81
11.09
8.81 |
| 19
2
1.36
2.7
3.05
4
4.1
6.95
4.1 |
| 20
2
2.82
2.6
2.63
10.03
9.6
12.65
9.6 |
|---------------------------------------------------------------------|
| 21
2
1.7
1.6
1.73
2.96
4.17
6.04
4.17 |
| 22
2
2.1
1.9
3
4.8
7.4
16.7
21.2 |
+---------------------------------------------------------------------+
We see that the data are in wide format, with variables
id
patient ID (1 to 22)
race race (1=white, 2=black)
fbf0 forearm blood flow (ml/min/dl) at ioproterenol dose 0 ng/min
fbf10 forearm blood flow (ml/min/dl) at ioproterenol dose 10 ng/min
…
fbf400 forearm blood flow (ml/min/dl) at ioproterenol dose 400 ng/min
In this dataset, each of the several occasions represents an increasing dose, so can be thought of
as an effect across dose, rather than as an effect across time.
Chapter 5-15 (revision 16 May 2010)
p. 22
Graphing the data with a parallel coordinate plot
#delimit ;
parplot fbf0-fbf400
, transform(raw)
xlabel(1 "0" 2 "10" 3 "20" 4 "60" 5 "150" 6 "300" 7 "400")
ylabel(0(5)45, angle(horizontal))
xtitle("ioproterenol dose (ng/min)")
yline(0) ytitle("forearm blood flow (ml/min/dl)")
by(race)
;
#delimit cr
White
Black
45
40
35
30
25
20
15
10
5
0
0
10
20
60
150
300
400
0
10
20
60
150
300
400
ioproterenol dose (ng/min)
Graphs by Race
Given that the dose range is so wide, to see if the increase is linear, a scatterplot with the correct
spacing between doses might be worthwhile to examine.
We won’t bother, however, because there is not a good way to come up with a single slope value
from the linear and quadratic terms of the regression model,
Ŷ  a  b1 X  b2 X 2
We are stuck with just using a linear model.
Chapter 5-15 (revision 16 May 2010)
p. 23
First we convert the data into long format, which will be easier to work with.
reshape long fbf , i(id) j(dose)
list if id<=2, nolabel sepby(id)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
+-------------------------+
| id
dose
race
fbf |
|-------------------------|
| 1
0
1
1 |
| 1
10
1
1.4 |
| 1
20
1
6.4 |
| 1
60
1
19.1 |
| 1
150
1
25 |
| 1
300
1
24.6 |
| 1
400
1
28 |
|-------------------------|
| 2
0
1
2.1 |
| 2
10
1
2.8 |
| 2
20
1
8.3 |
| 2
60
1
15.7 |
| 2
150
1
21.9 |
| 2
300
1
21.7 |
| 2
400
1
30.1 |
+-------------------------+
Next, convert the 1-2 race variable into a 0-1 black indicator.
capture drop black
recode race 1=0 2=1 , gen(black)
tab black race, nolabel
RECODE of |
race |
Race
(Race) |
1
2 |
Total
-----------+----------------------+---------0 |
91
0 |
91
1 |
0
63 |
63
-----------+----------------------+---------Total |
91
63 |
154
Unlike the progesterone absorption profiles, which increased and then decreased, these blood
flow graphs appear to monotonically increase, more or less, across the dose range. This suggests
that a linear slope would provide an adequate summary measure for comparison of whites with
blacks.
For completeness, in his textbook, Dupont (2002, p.346), uses log dose to derive the slope
summary measure. We will skip that, since the small improvement in linear fit (R2 = 0.55 vs R2
= 0.52) does not seem to justify the added complexity of the presentation.
To derive the summary measure, the slope, we fit a linear regression line to each subject’s data,
the 7 dose-fbf pairs, and retrieve the slope using the _b[ ] Stata variable (see box).
Chapter 5-15 (revision 16 May 2010)
p. 24
Capturing Results from Regression Models
If all you need are the regression coefficients and standard errors, the easy way to retrieve them is
to use the Stata system variables _b[ ], or synomonously _coef[ ], and se[ ] (Stata User’s Guide,
version 11, p. 149).
Using the following data for demonstration,
1.
2.
3.
4.
5.
6.
7.
8.
9.
+------------------+
| id
y
x1
x2 |
|------------------|
| 1
5
1
33 |
| 2
4
1
14 |
| 3
5
1
10 |
| 4
3
1
5 |
| 5
6
0
17 |
|------------------|
| 6
7
0
18 |
| 7
3
0
4 |
| 8
5
0
10 |
| 9
4
0
8 |
+------------------+
fitting a linear regression
regress y x1 x2
Source |
SS
df
MS
-------------+-----------------------------Model | 7.22813239
2 3.61406619
Residual | 6.77186761
6
1.1286446
-------------+-----------------------------Total |
14
8
1.75
Number of obs =
F( 2,
6)
Prob > F
R-squared
Adj R-squared
Root MSE
9
=
=
=
=
=
3.20
0.1132
0.5163
0.3551
1.0624
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------x1 | -1.161939
.7347975
-1.58
0.165
-2.959923
.6360463
x2 |
.1004728
.043656
2.30
0.061
-.0063497
.2072953
_cons |
3.85461
.6880503
5.60
0.001
2.171011
5.538208
------------------------------------------------------------------------------
The three regression coefficients are stored in system variables of the form
_b[variable name] or synomonously _coef[variable name], which in this example are
_b[x1]
_b[x2]
_b[_cons]
as well as in _coef[x1]
_coef[x2]
_coef[_cons]
To find this in Stata’s help, use
help _b
Chapter 5-15 (revision 16 May 2010)
p. 25
We can verify this with the display command
display _b[x1]
display _b[x2]
display _b[_cons]
display _coef[x1]
display _coef[x2]
display _coef[_cons]
. display _b[x1]
-1.1619385
. display _b[x2]
.10047281
. display _b[_cons]
3.8546099
. display _coef[x1]
-1.1619385
. display _coef[x2]
.10047281
. display _coef[_cons]
3.8546099
The three standard errors are stored in system variables of the form _se[variable name], which in
this example are
_coef[x1]
_coef[x2]
_coef[_cons]
Verifying this
display _se[x1]
display _se[x2]
display _se[_cons]
. display _se[x1]
.73479753
. display _se[x2]
.04365605
. display _se[_cons]
.68805032
Chapter 5-15 (revision 16 May 2010)
p. 26
As a test of our Stata code, we check what the slope should be for subject 1
regress fbf dose if id==1 // see what slope should be
Source |
SS
df
MS
-------------+-----------------------------Model | 593.987183
1 593.987183
Residual | 238.867112
5 47.7734224
-------------+-----------------------------Total | 832.854295
6 138.809049
Number of obs
F( 1,
5)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
7
12.43
0.0168
0.7132
0.6558
6.9118
-----------------------------------------------------------------------------fbf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------dose |
.0628501
.0178242
3.53
0.017
.0170315
.1086687
_cons |
6.63156
3.543134
1.87
0.120
-2.476356
15.73948
------------------------------------------------------------------------------
For subject 1, the slope summary measure is 0.0628501.
Now, doing this for all subjects (see box for a more complicated version),
*-- program to compute slope for each subject
capture program drop calcslope
program define calcslope , byable(recall)
marksample touse
quietly regress fbf dose if `touse'
quietly replace doseslope=_b[dose] if `touse'
end
* -- call program to compute slope
capture drop doseslope
gen doseslope=. // variable to hold slope
quietly bysort id: calcslope // call program for each subject
Checking how it worked,
list if id<=2, nolabel sepby(id)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
+--------------------------------------------+
| id
dose
race
fbf
black
dosesl~e |
|--------------------------------------------|
| 1
0
1
1
0
.0628501 |
| 1
10
1
1.4
0
.0628501 |
| 1
20
1
6.4
0
.0628501 |
| 1
60
1
19.1
0
.0628501 |
| 1
150
1
25
0
.0628501 |
| 1
300
1
24.6
0
.0628501 |
| 1
400
1
28
0
.0628501 |
|--------------------------------------------|
| 2
0
1
2.1
0
.0611372 |
| 2
10
1
2.8
0
.0611372 |
| 2
20
1
8.3
0
.0611372 |
| 2
60
1
15.7
0
.0611372 |
| 2
150
1
21.9
0
.0611372 |
| 2
300
1
21.7
0
.0611372 |
| 2
400
1
30.1
0
.0611372 |
+--------------------------------------------+
Chapter 5-15 (revision 16 May 2010)
p. 27
This time, just to see if we like it better, we will save only one copy of the slope value per
subject. That way, we will not need to tag the first observation and then bother with “if tag” in
subsequent analysis commands.
bysort id: replace doseslope=. if _n~=1
list if id<=2, nolabel sepby(id)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
// keep only one value
+--------------------------------------------+
| id
dose
race
fbf
black
dosesl~e |
|--------------------------------------------|
| 1
0
1
1
0
.0628501 |
| 1
10
1
1.4
0
. |
| 1
20
1
6.4
0
. |
| 1
60
1
19.1
0
. |
| 1
150
1
25
0
. |
| 1
300
1
24.6
0
. |
| 1
400
1
28
0
. |
|--------------------------------------------|
| 2
0
1
2.1
0
.0611372 |
| 2
10
1
2.8
0
. |
| 2
20
1
8.3
0
. |
| 2
60
1
15.7
0
. |
| 2
150
1
21.9
0
. |
| 2
300
1
21.7
0
. |
| 2
400
1
30.1
0
. |
+--------------------------------------------+
We can now compare blacks with whites using a simple independent samples t test.
Statistics
Summaries, tables & tests
Classical tests of hypotheses
Group mean comparison test
Variable name: doseslope
Group variable name: race
OK
ttest doseslope , by(race)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------White |
13
.0498816
.0047465
.0171139
.0395398
.0602234
Black |
9
.0152201
.0045469
.0136407
.0047349
.0257052
---------+-------------------------------------------------------------------combined |
22
.0357019
.0049658
.0232917
.0253749
.0460288
---------+-------------------------------------------------------------------diff |
.0346616
.0068584
.0203551
.048968
-----------------------------------------------------------------------------Degrees of freedom: 20
Ho: mean(White) - mean(Black) = diff = 0
Ha: diff < 0
t =
5.0538
P < t =
1.0000
Ha: diff != 0
t =
5.0538
P > |t| =
0.0001
Ha: diff > 0
t =
5.0538
P > t =
0.0000
From this, we would conclude that the forearm blood flow increases more rapidly in whites than
blacks when the isoproterenol dosage is increased (p < 0.001).
Chapter 5-15 (revision 16 May 2010)
p. 28
Shortcoming with this Analysis. We made no adjustment for differences in blood flow that might
exist between blacks and whites in the absent of the drug. That is, we made no adjustment for
the baseline value.
We could use the change approach, subtracting baseline flow from each dose flow, which would
adjust for differences in baseline, and then repeating the slope analysis on the change scores.
However, it would not adjust for regression towards the mean bias. To do both, we can use an
ANCOVA approach, once again.
Adding a baseline variable
capture drop fbfbase // forerarm blood flow baseline
bysort id: gen fbfbase=fbf if _n==1
list if id<=2, noobs nolabel sepby(id) abbrev(15)
+-------------------------------------------------------+
| id
dose
race
fbf
black
doseslope
fbfbase |
|-------------------------------------------------------|
| 1
0
1
1
0
.0628501
1 |
| 1
10
1
1.4
0
.
. |
| 1
20
1
6.4
0
.
. |
| 1
60
1
19.1
0
.
. |
| 1
150
1
25
0
.
. |
| 1
300
1
24.6
0
.
. |
| 1
400
1
28
0
.
. |
|-------------------------------------------------------|
| 2
0
1
2.1
0
.0611372
2.1 |
| 2
10
1
2.8
0
.
. |
| 2
20
1
8.3
0
.
. |
| 2
60
1
15.7
0
.
. |
| 2
150
1
21.9
0
.
. |
| 2
300
1
21.7
0
.
. |
| 2
400
1
30.1
0
.
. |
+-------------------------------------------------------+
Running the ANCOVA using linear regression,
regress doseslope black fbfbase
Source |
SS
df
MS
-------------+-----------------------------Model | .007856551
2 .003928276
Residual | .003536004
19 .000186105
-------------+-----------------------------Total | .011392555
21 .000542503
Number of obs
F( 2,
19)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
22
21.11
0.0000
0.6896
0.6570
.01364
-----------------------------------------------------------------------------doseslope |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------black | -.0306394
.0060866
-5.03
0.000
-.0433788
-.0179001
fbfbase |
.007539
.0026851
2.81
0.011
.0019191
.013159
_cons |
.029416
.0082125
3.58
0.002
.0122271
.0466049
------------------------------------------------------------------------------
We arrive at the same conclusion.
Chapter 5-15 (revision 16 May 2010)
p. 29
If we wanted to do this type of analysis for several variables, we could modify the calcslope
program to allow us to pass a variable name as an argument (see box).
Passing a variable name into a Stata program
The program we used above, replicated here,
*-- program to compute slope for each subject
capture program drop calcslope
program define calcslope , byable(recall)
marksample touse
quietly regress fbf dose if `touse'
quietly replace doseslope=_b[dose] if `touse'
end
* -- call program to compute slope
capture drop doseslope
gen doseslope=. // variable to hold slope
quietly bysort id: calcslope // call program for each subject
would have to be modified for each outcome variable we wished to create a slope variable for.
A simple modification is to pass a variable name when the program is called. Here is what it
would look like:
*-- program to compute slope for each subject
capture program drop calcslope
program define calcslope , byable(recall)
marksample touse
args v1
quietly regress `v1' dose if `touse'
quietly replace doseslope_`v1'=_b[dose] if `touse'
end
* -- call program to compute slope
capture drop doseslope_fbf
gen doseslope_fbf=. // variable to hold slope
quietly bysort id: calcslope fbf // call program for each subject
This time, the slopes are stored in doseslope_fbf, rather than doseslope. If called using
capture drop doseslope_heartrate
gen doseslope_heartrate=. // variable to hold slope
quietly bysort id: calcslope heartrate
the slopes would be stored in doseslope_heartrate.
Chapter 5-15 (revision 16 May 2010)
p. 30
Response Feature Analysis: Mean of Repeated Measurements
When analyzing longitudinal data using a sumary measure, Rabe-Kesketh and Everitt (2004,
p.145) state,
“The most commonly used measure is the mean of the responses over time because many
investigations, eg., clinical trials, are most concerned with differences in overall levels
rather than more subtle effects.”
Bringing the wide format data back in,
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on 11.2.Isoproterenol.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\
11.2.Isoproterenol.dta", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use 11.2.Isoproterenol.dta, clear
Listing it,
list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
+----------------------------------------------------------------------+
| id
race
fbf0
fbf10
fbf20
fbf60
fbf150
fbf300
fbf400 |
|----------------------------------------------------------------------|
| 1
White
1
1.4
6.4
19.1
25
24.6
28 |
| 2
White
2.1
2.8
8.3
15.7
21.9
21.7
30.1 |
| 3
White
1.1
2.2
5.7
8.2
9.3
12.5
21.6 |
| 4
White
2.44
2.9
4.6
13.2
17.3
17.6
19.4 |
| 5
White
2.9
3.5
5.7
11.5
14.9
19.7
19.3 |
|----------------------------------------------------------------------|
| 6
White
4.1
3.7
5.8
19.8
17.7
20.8
30.3 |
| 7
White
1.24
1.2
3.3
5.3
5.4
10.1
10.6 |
| 8
White
3.1
.
.
15.45
.
.
31.3 |
| 9
White
5.8
8.8
13.2
33.3
38.5
39.8
43.3 |
| 10
White
3.9
6.6
9.5
20.2
21.5
30.1
29.6 |
|----------------------------------------------------------------------|
| 11
White
1.91
1.7
6.3
9.9
12.6
12.7
15.4 |
| 12
White
2
2.3
4
8.4
8.3
12.8
16.7 |
| 13
White
3.7
3.9
4.7
10.5
14.6
20
21.7 |
| 14
Black
2.46
2.7
2.54
3.95
4.16
5.1
4.16 |
| 15
Black
2
1.8
4.22
5.76
7.08
10.92
7.08 |
|----------------------------------------------------------------------|
| 16
Black
2.26
3
2.99
4.07
3.74
4.58
3.74 |
| 17
Black
1.8
2.9
3.41
4.84
7.05
7.48
7.05 |
| 18
Black
3.13
4
5.33
7.31
8.81
11.09
8.81 |
| 19
Black
1.36
2.7
3.05
4
4.1
6.95
4.1 |
| 20
Black
2.82
2.6
2.63
10.03
9.6
12.65
9.6 |
|----------------------------------------------------------------------|
| 21
Black
1.7
1.6
1.73
2.96
4.17
6.04
4.17 |
| 22
Black
2.1
1.9
3
4.8
7.4
16.7
21.2 |
Chapter 5-15 (revision 16 May 2010)
p. 31
+----------------------------------------------------------------------+
Missing values are permissible, since the mean is computed on the non-missing repeated
measurements.
Chapter 5-15 (revision 16 May 2010)
p. 32
Computing the mean of the nonmissing post isoproterenol administration measurements
Data
Create or change variables
Create new variable (extended)
Generate variable: meanfbf
Egen function: row mean
Egen fuction argument: Variables: fbf10-fbf400
OK
capture drop meanfbf
egen meanfbf=rmean(fbf10-fbf400)
Listing the data to check the calculation,
list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
+---------------------------------------------------------------------------------+
| id
race
fbf0
fbf10
fbf20
fbf60
fbf150
fbf300
fbf400
meanfbf |
|---------------------------------------------------------------------------------|
| 1
White
1
1.4
6.4
19.1
25
24.6
28
17.41667 |
| 2
White
2.1
2.8
8.3
15.7
21.9
21.7
30.1
16.75 |
| 3
White
1.1
2.2
5.7
8.2
9.3
12.5
21.6
9.916667 |
| 4
White
2.44
2.9
4.6
13.2
17.3
17.6
19.4
12.5 |
| 5
White
2.9
3.5
5.7
11.5
14.9
19.7
19.3
12.43333 |
|---------------------------------------------------------------------------------|
| 6
White
4.1
3.7
5.8
19.8
17.7
20.8
30.3
16.35 |
| 7
White
1.24
1.2
3.3
5.3
5.4
10.1
10.6
5.983334 |
| 8
White
3.1
.
.
15.45
.
.
31.3
23.375 |
| 9
White
5.8
8.8
13.2
33.3
38.5
39.8
43.3
29.48333 |
| 10
White
3.9
6.6
9.5
20.2
21.5
30.1
29.6
19.58333 |
|---------------------------------------------------------------------------------|
| 11
White
1.91
1.7
6.3
9.9
12.6
12.7
15.4
9.766666 |
| 12
White
2
2.3
4
8.4
8.3
12.8
16.7
8.75 |
| 13
White
3.7
3.9
4.7
10.5
14.6
20
21.7
12.56667 |
| 14
Black
2.46
2.7
2.54
3.95
4.16
5.1
4.16
3.768333 |
| 15
Black
2
1.8
4.22
5.76
7.08
10.92
7.08
6.143333 |
|---------------------------------------------------------------------------------|
| 16
Black
2.26
3
2.99
4.07
3.74
4.58
3.74
3.686667 |
| 17
Black
1.8
2.9
3.41
4.84
7.05
7.48
7.05
5.455 |
| 18
Black
3.13
4
5.33
7.31
8.81
11.09
8.81
7.558333 |
| 19
Black
1.36
2.7
3.05
4
4.1
6.95
4.1
4.15 |
| 20
Black
2.82
2.6
2.63
10.03
9.6
12.65
9.6
7.851666 |
|---------------------------------------------------------------------------------|
| 21
Black
1.7
1.6
1.73
2.96
4.17
6.04
4.17
3.445 |
| 22
Black
2.1
1.9
3
4.8
7.4
16.7
21.2
9.166667 |
+---------------------------------------------------------------------------------+
Checking the calculation in observation 8,
display (15.45+31.3)/2
// mean omitting baseline
23.375
We leave out the baseline fbf because it is not part of the post drug administration outcome.
Chapter 5-15 (revision 16 May 2010)
p. 33
We can now analyze these data using an ANCOVA approach, controlling for baseline.
First recoding the race variable, again, and then using a linear regression to obtain the
ANCOVA,
capture drop black
recode race 1=0 2=1 , gen(black)
tab black race
regress meanfbf black fbf0
Source |
SS
df
MS
-------------+-----------------------------Model | 723.655489
2 361.827745
Residual | 275.804739
19 14.5160389
-------------+-----------------------------Total | 999.460228
21 47.5933442
Number of obs
F( 2,
19)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
22
24.93
0.0000
0.7240
0.6950
3.81
-----------------------------------------------------------------------------meanfbf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------black | -7.593173
1.699874
-4.47
0.000
-11.15105
-4.035297
fbf0 |
3.196871
.7498965
4.26
0.000
1.62732
4.766423
_cons |
6.312108
2.293603
2.75
0.013
1.511542
11.11267
------------------------------------------------------------------------------
Just for practice, we will do the same thing in long format.
capture
capture
reshape
list if
drop race
drop meanfbf
long fbf , i(id) j(dose)
id<=2, noobs nolabel sepby(id)
+--------------------------+
| id
dose
fbf
black |
|--------------------------|
| 1
0
1
0 |
| 1
10
1.4
0 |
| 1
20
6.4
0 |
| 1
60
19.1
0 |
| 1
150
25
0 |
| 1
300
24.6
0 |
| 1
400
28
0 |
|--------------------------|
| 2
0
2.1
0 |
| 2
10
2.8
0 |
| 2
20
8.3
0 |
| 2
60
15.7
0 |
| 2
150
21.9
0 |
| 2
300
21.7
0 |
| 2
400
30.1
0 |
+--------------------------+
Chapter 5-15 (revision 16 May 2010)
p. 34
Computing the mean of the post isoproterenol administration measurements
capture drop meanfbf
egen meanfbf=mean(fbf) if dose>0 ,by(id)
list if id==7 | id==8, noobs nolabel sepby(id)
+--------------------------------------+
| id
dose
fbf
black
meanfbf |
|--------------------------------------|
| 7
0
1.24
0
. |
| 7
10
1.2
0
5.983334 |
| 7
20
3.3
0
5.983334 |
| 7
60
5.3
0
5.983334 |
| 7
150
5.4
0
5.983334 |
| 7
300
10.1
0
5.983334 |
| 7
400
10.6
0
5.983334 |
|--------------------------------------|
| 8
0
3.1
0
. |
| 8
10
.
0
23.375 |
| 8
20
.
0
23.375 |
| 8
60
15.45
0
23.375 |
| 8
150
.
0
23.375 |
| 8
300
.
0
23.375 |
| 8
400
31.3
0
23.375 |
+--------------------------------------+
We see that the mean is computed correctly on the nonmissing values.
This time we will put the baseline fbf value in the last observation for each subject, since the
meanfbf is missing in the first observation. Creating a baseline variable and setting all values of
the meanfbf variable to missing except for the last line,
capture drop fbfbase
bysort id: gen fbfbase=fbf[1] if _n==_N
bysort id: replace meanfbf=. if _n~=_N
list if id<=2, noobs nolabel sepby(id)
+-----------------------------------------------+
| id
dose
fbf
black
meanfbf
fbfbase |
|-----------------------------------------------|
| 1
0
1
0
.
. |
| 1
10
1.4
0
.
. |
| 1
20
6.4
0
.
. |
| 1
60
19.1
0
.
. |
| 1
150
25
0
.
. |
| 1
300
24.6
0
.
. |
| 1
400
28
0
17.41667
1 |
|-----------------------------------------------|
| 2
0
2.1
0
.
. |
| 2
10
2.8
0
.
. |
| 2
20
8.3
0
.
. |
| 2
60
15.7
0
.
. |
| 2
150
21.9
0
.
. |
| 2
300
21.7
0
.
. |
| 2
400
30.1
0
16.75
2.1 |
+-----------------------------------------------+
Chapter 5-15 (revision 16 May 2010)
p. 35
Requesting the ANCOVA,
regress meanfbf black fbfbase
Source |
SS
df
MS
-------------+-----------------------------Model | 723.655489
2 361.827745
Residual | 275.804739
19 14.5160389
-------------+-----------------------------Total | 999.460228
21 47.5933442
Number of obs
F( 2,
19)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
22
24.93
0.0000
0.7240
0.6950
3.81
-----------------------------------------------------------------------------meanfbf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------black | -7.593173
1.699874
-4.47
0.000
-11.15105
-4.035297
fbfbase |
3.196871
.7498965
4.26
0.000
1.62732
4.766423
_cons |
6.312108
2.293603
2.75
0.013
1.511542
11.11267
------------------------------------------------------------------------------
When to Use Which Summary Measure
Senn et al (2000) discuss this question. The AUC method is often used in pharmacokinetic
studies, expressed as the area under the concentration time curve (Senn et al, 2000, p.869).
If one is more interested in the total amount of concentration per time of a drug in the body
between absorption and excretion, and not so much as how quickly it is absorbed or excreted, the
AUC measure would do a nice job of expressing this.
If it is thought that the effect is to rapidly reach a plateau and then remain there, the mean
approach, leaving baseline out of the computation of the mean summary measure, with baseline
used as a covariate in an ANCOVA, is a good choice (Senn et al, 2000, p.869, Figure 1).
[Exercise: look at Figure 1 in Senn et al, p.869]
If one is interested in how rapidly the effect changes, without a plateau effect, then a slope is an
appropriate summary measure (Senn et al, 2000, p.869, Figure 1).
A more sophisticated approach is a hierarchical model (also called multilevel model, or mixed
model), which we will cover later in this course. Senn et al (2000, p.873) point out, however,
“The summary measures approach is a simple and robust approach to analysing clinical
trials. In many cases the loss of efficiency compared to fitting more formal hierarchical
models is not great.”
Chapter 5-15 (revision 16 May 2010)
p. 36
Exercise Look at the Grantham et al (N Engl J Med, 2006) paper. They have longitudinal data,
which they show in their figures. They analyzed their data using a response feature analysis.
Look at the first paragraph on page 2124. They 1) computed a slope for each patient as one
summary measure, and 2) they used (year 3 minus baseline)/interval as a change per year
summary measure.
In their Table 1, they show the comparison of group means of individual slopes.
Chapter 5-15 (revision 16 May 2010)
p. 37
References
Altman DG. (1991). Practical Statistics for Medical Research. New York, Chapman &
Hall/CRC, pp.426-433.
Cook TD, Campbell DT. (1979). Quasi-Experimentation: Design & Analysis Issues for
Field Settings. Boston, Houghton Mifflin Company.
Cox NJ. (2004). Speaking Stata: graphing agreement and disagreement. The Stata Journal
4(3):329-349. [available free at: http://www.stata-journal.com/archives.html]
Dalton ME, Bromhan DR, Ambrose CL, Osborne J, Dalton KD. (1987). Nasal absorption of
progesterone in women. Br J Obstet Gynaecol 94(1):85-8.
Dupont WD. (2002). Statistical Modeling for Biomedical Researchers: a Simple
Introduction to the Analysis of Complex Data. Cambridge, Cambridge University
Press.
Lang CC, Stein CM, Brown RM, et al. Attenuation of isoproterenol-mediated vasodilation in
blacks. N Engl J Med 333:155-60.
Rabe-Hesketh S, Everitt B. (2003). A Handbook of Statistical Analyses Using Stata. 3rd
Ed. New York, Chapman & Hall/CRC.
Senn S, Stevens L, Chaturvedi N. (2000). Tutorial in biostatistics: repeated measures in clinical
trials: simple strategies for analysis using summary measures. Statist Med 19:861-877.
Twisk JWR. (2003). Applied Longitudinal Data Analysis for Epidemiology: A Practical
Guide. Cambridge, Cambridge University Press.
van Belle G. (2002). Statistical Rules of Thumb. New York, John Wiley & Sons.
Wegman EJ. (1990). Hyperdimensional data analysis using parallel coordinates. Journal of the
American Statistical Association 85: 664-675.
Chapter 5-15 (revision 16 May 2010)
p. 38