Download Multivariate Statistical Analysis for Pathologists

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Prostate-specific antigen wikipedia , lookup

Transcript
ANATOMIC PATHOLOGY
Review Article
Multivariate Statistical Analysis
for Pathologists
Part I, The Logistic Model
ROBIN T. VOLLMER, MD, MS
This paper reviews concepts of multivariate statistical modeling via the
logistic regression model, which has become very popular for modeling
the relationship between a positive clinical outcome and a variety of
predictor variables. The process is illustrated using a composite of data
from three large prostate specific antigen based screening studies of
prostate cancer. (Key words: Statistics; Logistic regression; Multivariate analysis; Prostate specific antigen; Prostate cancer) Am J Clin Patholl996;105:115-126.
The purpose of this paper is to introduce and summarize
multivariate statistical analysis as seen through the logistic regression model, which is the most popular method
for relating a binary clinical outcome to several predictors.1"5
Most often, we use statistical models to either explain
or predict a clinical outcome. We view the outcome as
a dependent random variable, y, because it depends on
random variables, xl, x 2 , . . . xj, that we hypothesize either explain or predict the outcome. What the models do
then is to provide a function, f, so that:
continuous or nearly continuous, such as serum concentrations of creatinine kinase, then we should turn to
other models such as the general linear model.2
One of the major problems with the medical literature
in general and anatomic pathology literature specifically
is that clinical outcome is often linked to a single predictor in isolation and without consideration of additional
predictors. The new factor is often not tested for predictive capacity independent of other, and perhaps more established predictors. Furthermore, it is often not tested
for the capacity to add additional information to that imparted by other predictors. For example, a new immunohistochemical marker of a cancer may be linked to patient prognosis, but often it is not determined if this new
marker provides independent or additional information
beyond that given by the standard morphologic prognosticators of tumor size, grade, and stage. Multivariate
analysis is a statistical tool that may be used to achieve
such multifactor modeling.
y = f(xl,x2, ...,xj)
(1)
If we could truly understand the biochemical, physiologic, and epidemiologic mechanisms of a disease, then
we would know the exact form of the function f. Without
that understanding, we resort to a few empirical functions that seem to work. The logistic function is one of
the more popular ones. To make it work for a given dataset, we need to fit it to the data. For the logistic model,
the outcome, y, is most often a binary one (ie, either absent or present, 0 or 1, or true or false). This corresponds
to the presence or absence of disease, complication, response to treatment, relapse, survival, or other binary
outcomes. The logistic model can also apply to categorical outcomes, for which y takes values of 0, 1, 2, 3, and
so forth. However, if the outcome we are interested in is
THE LOGISTIC MODEL
Let us symbolize by y the patient's clinical outcome. If
the outcome is negative, then y = 0; if the outcome is
positive then y = 1. The logistic model with a single x
variable is then written:
g(a+b.x)
P(y=
From the Department of Laboratory Medicine, VA Medical Center
and Duke University, Durham, North Carolina.
I -(a+b.x)
(2)
where p(y = 11 x) symbolizes the probability that y = 1
given the value of x. The a and b are coefficients that are
adjusted to make the model fit the data. The a is a kind
of intercept function reflecting the overall probability of
Manuscript received August 22, 1995; accepted August 23, 1995.
Address reprint requests to Dr. Vollmer: Laboratory Medicine (113),
VA Medical Center, Durham, NC 27705.
115
116
ANATOMIC PATHOLOGY
Review Article
TABLE 1. DATA FOR PROSTATE SPECIFIC ANTIGEN
SCREENING FOR PROSTATE CANCER
No. of
No. of
No. of
Patients
Reference Patients Patients Discovered
5%
No.
Screened Biopsied with Cancer Estimate*
6
7
8
984
6,630
1,249
984
1.167
105
54
264
32
49
332
62
Residual
Carcinoma^
(%)
0(0)
68(1.2)
30(2.6)
* Five percent of total number of patients screened, ie. an estimate of the total number of
significant cancers (ref 9).
t Difference between the previous two columns, ie. the estimate of expected significant
cancers that would have been found if all patients had been biopsied.
positive outcome, and the b is the slope parameter controlling the influence of x on the probability of positive
outcome. Often the model is written in terms of the logit
function denned as:
logit(p) = log
P(y=
l-p(y=
(3)
The ratio inside the logarithm on the right side of
equation 3 is the odds of a positive outcome or odds ratio. To shorten the notation, we will sometimes write p(y
= 11 x) as just p as we did above.
By substituting equation 2 for p(y = 11 x) and doing
some algebra we can find that:
logit(p) = a + b*x
(4)
Thus, if the data follow the logistic model and we plot
the logit(p) against x, then the data points would cluster
about a straight line.
Models for multiple x variables are written as if the
variables have an additive effect. For example, if there
are two x variables, x 1 and x2, then the logistic model
becomes:
logit(p) = a + bl*xl + b2*x2
(5)
cause the proportion (3 of 18) having a positive biopsy
differed so much from the rest at this PSA level that they
appeared as outliers. Furthermore, in two studies the investigators biopsied mostly just those with elevated PSA,
so that we had to make a guess about the total frequency
of positive biopsy for those with PSA less than 4 ng/mL.
We based that guess on Littrup's9 suggestion that in
this age group approximately 5% will have a "significant" cancer. Byfirstcalculating what 5% of the total was
and then by subtracting the cancers found from this 5%,
we obtained an estimate of how many significant cancers
remained undetected in the patients who were not biopsied. This step was an important one if we were to obtain
a logistic model applicable to all patients in the screening
population. Table 1 summarizes the data, and shows that
the guess of residual cancers is close to Brawer and coworkers' experience that less than 2% of cancers remain
undetected after screening.10 Figure 1 shows the plot of
observed fractions of those with carcinoma versus the
midpoint PSA level for all three studies (symbolized as
1, 2, and 3). The lower end of the plot demonstrates that
our guesses for the fractions with cancer at PSA levels
less than 4 ng/mL were consistent with the rest of the
data, and the plot suggested that all three studies appear
to follow a single curve.
Ideally, to perform the following analysis we need the
raw data. Also, ideally, at a minimum that data should
provide not only the biopsy result, but also PSA level,
digital rectal exam result, patient's age, and family history of prostate cancer. Thus, because we are working
S
0.6
DATA SCREENING FOR PROSTATIC
CARCINOMA FROM PSA
To illustrate logistic analysis, we will use a composite
of data from three studies on prostate specific antigen
(PSA) based screening for prostate cancer.6"8 Although
the reports did not publish the raw data, they gave sufficient detail to calculate the frequencies of positive biopsy
for groups of patients with relatively narrow ranges of
PSA. For each such group the midpoint of that range
was chosen to represent their PSA. Altogether, the three
studies comprised more than 8,800 male patients older
than 45. We excluded 18 patients with PSA between 10
and 20 ng/mL from Labrie and colleagues' study,6 be-
3
1
2
3
2
3
21
2
3
20
30
PSA NG/ML
FIG. 1. Plot of observed fractions of those with carcinoma versus the
midpoint PSA level for three studies of the use of PSA for screening for
prostate cancer (references 6, 7 and 8 symbolized respectively as 1, 2
and 3).
A.J.C.P.-Januarv 1996
VOLLMER
117
Statistics for Pathology
with limited data, the following results are intended primarily to illustrate key points of logistic analysis, not to
produce a final model. However, to the degree the developed models match the summarized data, they may be
useful for predicting results from such limited data on
new patients, and they may also be a good starting point
for developing models with the full data.
square distribution under the null hypothesis that there
is no x effect. '"5 The LR for our model with one x variable
is then:
LR = -2*ln L(a) - -2*ln L(a,b)
(7)
Because LR follows a chi-square distribution with one
degree of freedom, it is also called the model chi-square,
and from this chi-square, we may calculate a P value for
the null hypothesis that x has no effect on outcome. If
FITTING THE MODEL TO THE DATA
the LR is large because -2*ln L(a) is much larger than
-2*ln L(a,b), this is equivalent to saying that L(a,b) is
The data for a logistic model with a single x variable
much
larger than L(a), that is, the likelihood for getting
consist of a table of two columns, one for y and one for
this
raw
data result is higher if the x variable is acting
x. There is one row entry for each patient. Thus, the obthan
if
it
is not. In this circumstance, the P value will be
servation for the ith patient consists of a pair of values: yi
small,
and
we may reject the null hypothesis that x has
and xi. Next we must consider the likelihood function,
no
effect
on
outcome.
L, which is the probability of getting the dataset if the
logistic model applies. L is the key to both the solution of
The form of equation 7 also gives the LR statistic for
the analysis as well as to the calculation of statistics to
comparing a logistic regression model with several x varitest hypotheses, and for the logistic model it is defined as:
ables against one with just the intercept a. Thus, if there
are three x variables in the model with coefficients bl,
L = Il{p(y = 1 Ixi)yi*(i - P(y = 1 Ixi))"-yi)} (6) b2, and b3, then the model chi-square becomes:
i=l
Given that each p(y = 11 xi) term in the equation is a
shorthand for the more complex function in equation 2,
we see that this equation for L collects all the terms of the
problem: the overall likelihood for the data, the y outcome for each patient, the x value for each patient, and
the a and b coefficients.
Because -2*ln(L) (In means natural logarithm) relates
closely to the chi-square distribution and test statistics,12
this form of L is often more emphasized. In fact, because
of its importance some logistic analysis programs such as
LOG1ST (4) print out -2*ln(L) rather than L, and we
will simplify its writing for the rest of this paper by omitting the parentheses around L and write it as just -2*ln
L. Expressing the likelihood as a logarithm converts the
products in equation 6 into sums. Note also that as L
increases, —2*ln L decreases, so that the user must expect
to see decreasing values of —2*ln L with improved
models, rather than increases. What the software programs then do is to iteratively change a and b until —2*ln
L is minimized (that is, L is maximized), because when
the model fits the data, the likelihood should be maximum. The estimates that we obtain for a and b are then
called maximum likelihood estimates.
TEST OF SIGNIFICANCE
To test for the significance of an x variable we compare
the -2*ln L obtained for the model fitted with just the
parameter a (—2*ln L[a]) to the —2*ln L with both a and
b (-2*ln L[a,b]). It turns out that -2*ln(L[a]/L[a,b]),
called the likelihood ratio (LR), is a statistic having a chi-
LR = -2*ln L(a) - -2*ln L(a,b 1 ,b2,b3)
(8)
Furthermore, if we have two alternative models, each
of which has its own model LR (or chi-square), then we
can compare the two by taking the difference between
their model chi-squares. Because the terms for the intercept cancel out in the subtraction, this is equivalent to
a likelihood ratio test for one model against the other.
Using the PSA screening data of Table 1, a positive
outcome of biopsy result diagnostic of adenocarcinoma
and the SAS program LOGIST,4 we obtained a value of
3,547 for -21n L(a) and a value of 2,790 for -21n L(a,b),
so that the model chi-square was 3,547 — 2,790 = 757.
Because there was just one parameter b involved, this
chi-square implied a single degree of freedom and a P
value of .0001. The result clearly suggests that elevated
PSA predicts a positive biopsy. Furthermore, using the
iterative maximum likelihood technique LOGIST estimated the model fit for p(Ca | PSA) as:
p(Ca|PSA) =
1
(9)
1 +e
This logisticfitestimates that at a PSA level of zero the
probability of a biopsy diagnostic of significant cancer is
approximately 0.02, which is close to Brawer and colleagues' experience.10 The model then predicts that probability rises gradually to 0.06 at PSA of 4 ng/mL and to
0.32 at 10 ng/mL.
-(-4.I9+.344.PSA)
TEST OF FIT
Although logistic regression analysis has become common in published studies in pathology, we seldom see
Vol. 105•No. I
ANATOMIC PATHOLOGY
118
Review Article
2
I ...
o
o
o
oo
o It
o
#
o
o It
It
II
on
20
PSA NG/ML
FIG. 2. Plot of observed fractions ("o") and calculated values ("#") of
p(Ca | PSA) for the model with untransformed PSA as ng/mL. Note that
the #'s are too low at lower levels of PSA and too high at higher levels
of PSA.
any indication of how well the published models fit the
data. Instead most authors seem satisfied with just the
chi-squares and P values for the x variables of interest.
As we will see, one can obtain significant P values and
still have poor fit for the model. Fortunately, there are
several ways to examine and test for goodness of fit, and
if the fit is not good, then there may be ways to improve
it without abandoning the logistic model. We just need
to take a few extra steps in the analysis.
Perhaps, the most helpful way to see how well the
model fits the data is to plot the calculated p(y = 11 x)
and the observed probabilities against the x variable (or
against several x variables if there are more than one). If
the fit is good, then the observed and predicted probabilities fall close to one another. If the dataset is not already
large enough to have several patients at each level of x,
then we can divide it arbitrarily into 10 percentiles of
increasing values of x, calculate an average x for the
group and a p(y = 11 x) for that average x and then once
again plot and compare the observed percentages of y =
1 in these percentiles to the calculated p(y = 11 x). Figure
2 shows the observed and calculated values of p(Ca|PSA) for the screening data analyzed above. We see that
the calculated p(Ca | PSA) from equation 9 appears too
low in the PSA range of 5 to 10 ng/mL and then too high
for PSA greater than 15 ng/mL. Thus the fit is not ideal
even though the model chi-square was high and the P
value low.
Another way to plot and compare the observed and
estimated p(y = 11 x) is to plot the observed frequencies
on the y axis and the calculated p(y = 11 x) on the x axis.
If the fit is good, then the plot clusters about a line with a
45° angle. This sort of plot works no matter how many x
variables there might be.
A third way to visualize and test for the goodness of fit
uses the Pearson or deviance residuals, which are analogous to the difference between observed and predicted
values in a linear regression.1 A good fit gives residuals
close to zero and without any trend with respect to the x
variable. The sum of the squares of either of these residual:; has a chi-square distribution if the dataset is large
enough, so that large values of this sum suggest a poor fit.
Figure 3 shows the plot of the deviance residuals for the
logistic model of equation 9 and the PSA screening dataset, and it illustrates more explicitly how the fit is not
ideal. Between PSA values of 2 and 10 there are many
points where the deviance is too positive, and overall the
plot of deviance does not suggest the ideal of random
scatter about zero. The lack of fit in the region from 2 to
10 is of special concern, because this is where we desire
greatest accuracy.
SCALING OF X VARIABLES, CUT-OFF POINTS,
AND CONTINUITY
The beauty of the logistic model is that it can relate a
binary y outcome to one or more continuous x variables.
Think in terms of equation 1. On the left hand side, we
I
8 +
o
o
o o
o o
30
10
PSA NG/ML
FIG. 3. Plot of deviance residuals versus PSA in ng/mL for the model
of untransformed PSA as ng/mL. Note that the distribution of points
appears as a non-random scatter.
A.J.C.P.-January 1996
VOLLMER
119
Statistics for Pathology
often have binary y outcomes or choices. The patient
does, or does not, have a disease. His tumor can, or cannot, be resected. She has, or has not, suffered a relapse.
We will, or will not, treat. For a given patient the number
of diseases we can diagnose, the number of treatments
we can offer, or the number of stages of disease we can
determine tend to be limited to a few categories, often
just two. However, the x variables on the right hand side
of the equation are usually more complex, or even continuous.
Tumor diameter, Breslow thickness, patient weight,
and serum PSA are examples of continuous x variables.
If such a variable is useful, it will map the patients and
their disease into a continuous spectrum, so that if we
compare any two patients the one with higher x will consistently have more disease (or less, if the relationship
between y and x is negative). Thus, the potential information that continuous x variables can offer is great.
However, what we commonly see in papers are transformations of continuous x variables that turn them into
categorical variables or even binary ones. By using a single cut-off point, authors will change a continuous x variable into one that is either positive or negative. For clinical chemistries, we get normal and elevated. For tumor
diameter, we have tumors less than or greater than 2 cm.
For Breslow thickness, we have less than or greater than
1 mm.
Mostly, these cut-off points attempt to simplify the
choices of clinical actions and treatments. If the action is
a binary one, then by making the x variable binary the
choice of action becomes automatic. Receiver-operator
curves (ROC) are then used to optimize the cut-off point
of x." For example, we use a cut-off point of 4 ng/mL
on PSA to decide whether or not to biopsy the prostate.
However, using these cut-off points hides natural variance in the data and may deprive us of useful information. I suggest that before using a cut-off point to
transform a continuous x variable into a binary one, we
try the x variable in its natural scale or use a continuous
transformation such as log(x), exp(x), or square root of
x. Then we can use the logistic model to relate x to the
binary outcome or choice in which we are interested.
Choosing the scale for a continuous x variable is important, and finding a satisfactory choice depends on intuition and trial and error. Certainly, to begin it is good
to try the natural scale for x as measured. Although we
can also plot the logit(y) against x to get an idea of what
transformation of x might be useful,1 we should try several different scales or transformations of x to optimize
its performance in the model. Furthermore, we need not
confine ourselves to just one measure of x. We can add a
second term in x such as x2. In this case the logistic model
becomes:
Iogit(p) = a + bl*x + b2*x2
(10)
and we can continue this by adding terms of x3 or higher
exponents or terms such as ln(x) and exp(x). Each additional term in x requires an additional coefficient (that is
the bl, b2, b3, . . . etc.), so that the final fitted logistic
model may appear like:
p(y=
(11)
with an example of tr(x) (tr symbolizes "transformation") given as:
tr(x) = a + bl*x + b2*x2 + b3*ln(x)
(12)
Some also favor restricted cubic splines of x for modeling nonlinearities between the logit(p) and x.1213
We can illustrate the issues of cut-off points and transformations with a further analysis of the PSA screening
data. Let us first reduce PSA levels to just two binary x
variables, PSA4 and PSA 10, by using cut-off points at 4
ng/mL and 10 ng/mL and the following algorithm:
if PSA < 4, then PSA4 = 0 and psalO = 0.
if 4 < PSA < 10, then PSA4 = 1 and PSA 10 = 0.
if PSA > 10,thenPSA4= landPSA10= 1.
Running the LOGIST program with this model produced an overall chi-square of 900, representing a significant improvement over the previous model chisquare of 757 with PSA alone as the x variable. The
difference, 900 - 757= 143, implies a P value of less than
0.001. However, plotting the fitted model against the observed probabilities in Figure 4 shows that this model is
unnatural. Instead of a continuous increase in p(Ca|PSA) with increasing PSA, it gives three horizontal plots
of the "#" at locally constant levels of p(Ca | PSA) depending on the PSA: 0.02 for PSA <4, 0.26 for 4 <Lt>
PSA < 10, and 0.57 for PSA > 10. Now although one
may be tempted to conclude from just the P values alone
that this is a good model for predicting the probability of
a positive biopsy, the plotted test offitshows that it is not
ideal.
After trial and error, we settled on the following
transform (tr) of PSA:
tr(PSA) = ln(PSA) + ln2(PSA)
(13)
This resulted in an overall chi-square of 941, the best
of the three models reported here and significantly better
than that of the previous model, because the difference
in their model chi-squares was 41. Figure 5 shows that its
predicted values of p(Ca| PSA) fall close to the observed
ones. Figure 6 shows that the deviance residuals now
Vol. 105 • No. 1
120
ANATOMIC PATHOLOGY
Review Article
5
o
o o
« #t* ftett
FIG. 4. Plot of observed fractions ("o") and calculated values ("#") of
p(Ca|PSA4, PSA 10) for the model with PSA4 and PSA 10 (cut-off
points of PSA at 4 ng/mL and 10 ng/mL). Note that the #'s appear on
three horizontal lines.
center about zero and with a random scatter. The equation for the fit of this model is given as:
p(Ca|PSA) =
1
e -[-6.07+3.57.1n(PSA)
- 0.448.1n2(PSA)J
(14)
MULTIPLE X VARIABLES
The greatest interest in the logistic model comes from
its ability to use multiple variables (xl, x2, . . . xj) to
predict a binary outcome (y = 0 or 1). After all, what we
achieved above by modeling P(Ca | PSA) with equation
14 is not much different than the raw data plotted in Figure 1. If that raw data is sufficiently dense, then we can
use it alone for predicting outcomes for new patients, because the logistic model then functions only to interpolate between observed values in the raw data. Alternatively, imagine attempting the prediction from raw data
and with multiple x variables. In that circumstance, we
would require multiple plots or multiple tables. Here the
logistic model goes beyond simple interpolation to provide us a very concise tool for prediction.
With advances in molecular technology, the number
of potentially prognostic variables is rising rapidly. Just
consider the complexity regarding the prognosis in breast
cancer. In addition to the traditional prognostic measures of tumor size, grade, and nodal status we have
ploidy, S phase, ER/PR, p53, c-erbB-2/neu, HCAM/
CD44, PDGFR, BCL-2, GST-PI, MiB-1, mitotic rate,
angiogenesis, nuclear morphometry, and so on. With
just these 16 even logistic regression analysis may not be
able to reach a satisfactory predictive model without using thousands of patients. The more x variables there are,
the more data are required to perform the analysis. In
general, we need sufficient number of patients to produce
at least one positive outcome (and preferably five positive outcomes) for every possible combination of x variable levels. The lower the probability of a positive outcome, then the more patients we need. Thus, 16 binary x
variables implies 2' 6 = 256 possible categories or cells of
x variables. If the a priori prevalence of positive outcome
were 0.01 in one of these cells, then we might require
more than 100 patients in just this one to get one positive
patient. To get more than 100 in this cell in a random
population could require even greater numbers in the
others. Harrell3 also suggests that there should be at least
10 times as many patients with a positive outcome as
there are x variables (assuming that the number with y =
1 is less than with y = 0). Thus it is easy to see that the
proving of prognostic importance of new markers could
be expensive or possibly never done adequately because
of limited numbers of patients. Furthermore, the more x
variables there are, the more tedious, time consuming,
and subjective the model building process can be.
We have seen that each model has a LR statistic for
comparing it to the model with just an intercept, so that
if we want to test for the importance of adding an x2
variable to the model with a single xl variable, all we
need do is compare the differences between their model
LR statistics. This difference is:
3
0.6
•s
ft
o
o
It
o
o
tt
o
oo (to
It o
20
30
PSA NG/ML
FIG. 5. Plot of observed fractions ("o") and calculated values ("#") of
p(Ca | PSA) for the model with ln(PSA) and ln2(PSA).
A.J.C.P.-January 1996
VOLLMER
121
Statistics for Pathology
o
o o o
30
PSA NG/ML
FlG. 6. Plot of deviance residuals versus PSA in ng/mL for the model
with ln(PSA) and ln2(PSA). Note that now the deviances appear more
as random scatter than in Figure 3.
= LR(a,bl,b2)-LR(a,bl)
(15)
This new LR gives a chi-square statistic for testing the
importance of the x2 parameter. Another useful statistic
for comparing several x variables in the model is the
Wald statistic. It can be used for testing the null hypothesis that the b coefficient is zero (that is, that the x variable is unimportant for predicting y). For the xi variable
the Wald statistic for this null hypothesis is denned as:
W =
maximum L estimate of bi
standard error of bi estimate
(16)
W2 has a chi-square distribution, and programs such
as LOGIST print a table of these with P values, one for
every x variable. The Wald statistics, or more likely their
P values, are probably the ones most commonly seen
published next to x variables in the results of pathology
papers using the logistic model.
Although in general there seems to be no one ideal way
to select x variables for model building, there is a tendency now to move away from computer driven stepwise
techniques.1'3 Many prefer instead a more purposeful selection. We begin with the one, or ones, we know from
prior studies are important. As each x variable is added,
we examine the overall model LR as well as the Wald
statistics to see if there has been significant improvement. We can add newer x variables in an order of preference based on experience, intuition, a motivation to
test a particular variable with others, or other factors
such as the availability and cost of x variables. We can
examine whole subsets of related variables, such as
multiple measures of proliferation, to see which have the
largest Wald statistic and then work more carefully with
them. Or we can resort to the computer's selection to get
a preliminary idea about the x variables.
The forward stepwise approach has been popular in
pathology. It selects the x variables one at a time based
on their effect on the likelihood L of equation 6. At each
step the x variable producing the greatest increase in L is
chosen to enter the model, and ones already in are retested to see if they remain significant. As an alternative
to this forward stepwise approach, we can put all x variables into the model and then proceed stepwise to eliminate ones that are not significant. This is called backward
stepwise analysis. With either automatic stepwise selection beware x variables that become significant only after
many others have entered the model especially if their
relationship with the y outcome reverses during the
steps. This may be due to overfitting and can yield a
model that does not validate well with new data. After
all, if perfect fit is what we are after, we can achieve this
simply by adding one x variable for every patient. This is
called the full or saturated model, but it provides no useful prediction for new datasets.
Having satisfied ourselves that our choice of x variables and model building are sufficient, we can test for
ones left out by adding them and seeing if they improve
the overall model chi-square. For example, if we have
selected x variables xl-xk and derived a transformation
tr(xl-xk) of these that produced an adequate model, we
can compare the LR of this model with ones left unused
(xk + 1 — xn) as follows:
LR = LR(a,tr(x 1 - xk), xk + 1 - xn)
-LR(a,tr(xl - xk))
(17)
This statistic should have a chi-square distribution
with n-k degrees of freedom. If this LR is small, we can
reasonably omit the remaining xk + 1 - xn variables.
Nevertheless, we should also remember that with a large
list of potential x variables it is unlikely that there is a
single optimal model, but instead several models of close
or equal performance and involving different subsets of
the x variables.2
INTERACTION BETWEEN X VARIABLES
If we have two x variables, xl and x2, then a logistic
model with interaction includes a third term, which is
the product xl * x2:
Vol. 105-No. I
122
ANATOMIC PATHOLOGY
Review Article
(18) nificant age effect on the probability of obtaining a positive biopsy. This was in addition to the effect of PSA.
This model then allows for the possibility that the
Because the coefficient for agel was positive, the resulteffect of xl on outcome y is different for different levels
ing model implied that as age increases the probability of
of x2. For example, if the disease were breast cancer and
positive biopsy increases even after accounting for the
y were some arbitrary positive outcome such as tumor
PSA effect. This fits our prior understanding about the
recurrence, then x 1 might be ER/PR status and x2 sex. diagnosis of prostate cancer.
If men and women differ in they way their outcome deNext, we added an interaction term thinking that the
pends on ER/PR, then the interaction term b3*xl*x2
association between a positive biopsy and PSA might
models this difference. To test for a significant difference
differ for different ages. For example, this could hold if
in the way men and women's ER/PR status affects outage specific thresholds for PSA were important for precome, we compare the model chi-squares with and withdicting a positive biopsy. Because in the last model the
out the interaction coefficient, b3:
most important x variable was the ln(PSA), we looked
for an interaction between age 1 and this term by forming
LR = -2*ln L(a,bl,b2) - -2*ln L(a,bl,b2,b3) (19)
the variable agel *ln(PSA). This time LOGIST produced
or examine the Wald statistic (and its P value) for the
a model chi-square of 732, or an improvement of 12 (P <
coefficient b3. If either test shows a low chi-square and
.001) over the previous model, and the Wald chi-square
high P value, then we conclude the ER/PR effect was the
statistic for the agel*ln(PSA) interaction term was 12.03
same for men and women.
(/' = .0005). The final model for p(Ca | PSA,age) incorporating both PSA and age effects is given by:
logit(p) = a + b 1 *x 1 + b2*x2 + b3*x 1 *x2
SCREENING FOR PROSTATE CANCER:PSA
AND AGE
To illustrate the importance of interaction between x
variables, we continue the example of screening for prostate cancer but now with two variables, PSA, and patient
age. Because only one of the three datasets8 published
sufficient information about age to do the analysis, the
total size of the data drops to just 6,630. This makes the
model chi-squares smaller. Furthermore, even though
age was given in the three broad categories of 50-60,
>60-70, and >70, stratifying this smaller data into PSA
levels as well as three age levels produced several agePSA cells with so few patients that we had to combine
them to get numbers exceeding 10 patients. Whenever
we did this, we took the group's final PSA midpoint as
representative for the combined category. These combinations did not appear to alter the shape of the relationship between P(Ca | PSA) versus PSA plot.
We began this analysis with the model of equation 14.
The LOGIST program on the smaller data set now
yielded a model chi-square of 712, but the plot (not
shown) of predicted p(Ca|PSA) once again matched
closely the observed values over the entire range of PSA.
Next we added age using a graded factor agel defined as:
agel =0ifage = 50to59;
agel = 1 if age = 60 to 69;
agel = 2 if age >69.
The LOGIST program this time yielded a model chisquare of 720. The difference between these two model
chi-squares is 720 — 712 = 8, implying a/"value of 0.005,
and the Wald chi-square statistic for agel was significant
at a P value of 0.0063. Thus, there was a small but sig-
1
p(Ca|PSA, age) =
J _|_ g-tr<PSA,age)
(20)
where tr(PSA,age) symbolizes:
tr(PSA, age) = -7.6 + .854*agel + 5.0*ln(PSA)
- 0.739*ln2(PSA) - 0.442*agel*ln(PSA)
(21)
0.6
1
0
1
2
2
I
0.2
0.1
PSA NG/ML
FIG. 7. Plot of the predicted p(Ca) versus PSA in ng/mL for the three age
ranges 50-59,60-69, and >69 indicated respectively by the numbers I,
2 and 3 on the plot.
A.J.C.P.-January 1996
VOLLMER
123
Statistics for Pathology
TABLE 2. SUMMARY OF MODEL DEVELOPMENT FOR PROSTATE SPECIFIC ANTIGEN
SCREENING FOR PROSTATE CANCER
Model
No.
1
2
3
4
5
6
3
Y
n
Npos
X Variables
Ca
Ca
Ca
Ca
Ca
Ca
8,863
8,863
8,863
6,630
6,630
6,630
448
448
448
PSA
PSA4, PSA 10
ln(PSA), ln2(PSA)
ln(PSA), ln2(PSA)
ln(PSA),ln 2 (PSA),agel
ln(PSA), ln2(PSA), age 1, age 1 *ln(PSA)
332
332
332
LR
Figures
757
2,3
900
4
5,6
941
712
720
732
7,8
Y = outcome parameter, ie, a biopsy positive for carcinioma; n = total number of patients: Npos = total number with positive biopsy; LR
likelihood ratio statistic.
Whereas the positive sign for the coefficient of age 1
implies that in general older men have a higher probability of positive biopsy, the negative sign for the interaction
term means that for older men the probability of positive
biopsy is less than the PSA level alone predicts. We can
see this easier by looking at the plot of predicted p(Ca|PSA,age) versus PSA in Figure 7.
In the plot the categories of the agel factor are indicated by the numbers 1, 2 and 3 on the plot. For each
PSA at lower levels the model predicts that older men
have a higher probability of a positive biopsy, but the
curve reverses at higher levels of PSA, where the model
implies that older men have a lower probability of a positive biopsy than their PSA level alone predicts. Perhaps
this is because older men have higher baselines. However, for most of the PSA levels, including the critical
range of less than 10 ng/mL, the curves for the three age
ranges are so close they nearly overlap. This relates well
to Catalona and colleagues'7 conclusion that age specific
PSA values do not add much to the diagnostic accuracy
of a binary cut-off point in PSA.
pie, Table 2 summarizes the models tried here for the
PSA screening data. For the final or best model, then we
should list the x variables, their coefficients, standard errors, Wald statistics, and P values. Table 3 summarizes
these results for the final two models developed for the
PSA screening. For a significant continuous x variable it
is useful to see the observed and calculated p(y = 1) plotted overlayed and against the x as in Figure 5, and it is
helpful to see either the Pearson or deviance residuals
plotted the same way as in Figure 6. If the model is complex and includes a number of x variables, one can plot
the observed p(y = 1) against the predicted p(y = 1). For
example, Figure 8 shows the plot of the observed probability of positive prostate biopsy versus the predicted
probability for the logistic model that included both age
REPORT OF FITTED MODEL
Publication of the results of logistic analysis should include at least a partial list of the models tried. For examTABLE 3. SUMMARY OF MODEL RESULTS
FOR FINAL MODELS
Model
No.*
1
2
X Variable
Coefficient
SE
Wald
P Value
Intercept
ln(PSA)
ln2(PSA)
Intercept
ln(PSA)
ln2(PSA)
agel
ageMn(PSA)
-6.0745
3.5651
-0.4477
-7.6035
4.9973
-0.7394
0.8540
-0.4418
0.2319
0.3351
0.1016
0.3737
0.4851
0.1489
0.1968
0.1274
656
113
19
414
106
25
19
12
.0001
.0001
.0001
.0001
.0001
.0001
.0001
.0005
PSA ° prostate specific antigen.
* Model 1 used data from references 6-8. Model 2 used data from reference 7.
0.6
0.8
Estimated Probability
FlG. 8. Plot of the observed fractions against the calculated values of
p(Ca) for the logistic model of equations 20 and 21 and using both PSA
and patient age. Because the scatter of points follow a 45° line, the fit of
the logistic model is reasonable.
Vol. 105-No. I
124
ANATOMIC PATHOLOGY
Review Article
TABLE 4. COMPARISON OF MODEL FITS WITH
REDUCED DATA
Data Set*
No. of patients with
biopsy positive
for carcinoma
Total no. of patients
X variable
coefficients
Intercept
In(PSA)
ln2(PSA)
/
2
i
332
347
448
6,630
6,647
8,863
-6.92
4.82
-0.829
-6.49
4.05
-0.568
-6.07
3.57
-0.448
PSA = prostalc specific antigen.
• Data set 1 used data from reference 7. Data set 2 used data from reference 7 plus those
from reference 6 with PSA > 20 ng/mL. Data set 3 used data from references 6-8. collectively.
and PSA as x variables. Although there is some scatter to
the points, they appear clustered about a 45° line, and
this indicates reasonable agreement. It is probable that
some of the residual variation in the plot is due to important factors, such as results of digital rectal exam, that
were left out of the analysis.
DISCRIMINATION AND VALIDATION
A measure of a model's ability to discriminate between
y = 0 and y = 1 outcomes is the concordance, c.3 To
see how this works, imagine taking two patients from the
dataset such that the first had a positive outcome (y = 1)
and the second a negative outcome (y = 0). If the first
also has a higher calculated p(y= 11 x), then the pair is
considered concordant. Otherwise it is discordant, and
pairs with tied p(y = 11 x) are not used. If we repeat this
process for all possible pairings of patients, one with y
= 1 and the second with y = 0, then c is defined as the
proportion of concordant pairs:
c =
no. of concordant pairs
total possible pairings
(22)
A good model should give concordant pairs. Specifically, a model without any ability to discriminate outcome gives a c of 0.5; whereas, a model with perfect prediction gives a c of 1.0. The final models for PSA
screening for prostate cancer with and without age gave
c values of 0.82 and 0.84, respectively, on the original
data, which are reasonably good concordance results.
Probably, the best way to validate a developed model
is to apply it to a new dataset. Some researchers divide
the initial data into two parts: one for developing the
model and one for testing the model. Regardless of
whether we split the initial data or gather new data, we
designate the first as the training set and the second as
the test set. Using the fitted model from the training set,
we can then calculate p(y = 11 x) for the test patients and
compare this calculated estimate to whether or not they
had a positive outcome. Hosmer and Lemeshow9 recommend using the same summary statistics one uses for test
0.4
s
o
f
0.3
o
*
o
8 o
10
20
PSA NG/ML
PSA NG/ML
FIG. 9. Plot of the observed fractions ("o") and calculated values ("#")
of p(Ca|PSA) for the training data from reference 7 and using the logistic model of Table 4, column 1. The test data for this model appear in
the next figure.
FIG. 10. Plot of the observed fractions ("o") for the test data from references 6 and 8. The calculated values of p(Ca | PSA) ("#") come from
the model developed from just the data of reference 7.
AJ.C.P.'January 1996
VOLLMER
125
Statistics for Pathology
offit.Thus, to study how well the model validates we can
examine the Pearson and deviance residuals on the test
data and plot the results to see how well the model works
on test data.
Several5'4 also recommend studying validation by reusing the logistic regression model on the test data, only
this time with a new x variable defined as the logic(p(y =
11 x)). Let us take the simple example with just one x
variable. First, we fit the logistic model of equation 2 to
the test data. Then we calculate p(y = 11 x) and logic (p)
for the patients from the test data. Next we do a second
logistic analysis, this time on the test data, and we use the
calculated logic (p) as the new x variable. If the second
logistic analysis yields maximum likelihood estimates of
the coefficients such that a = 0 and b = 1, then the validation is perfect. Of course, because variable a mostly
will not equal 0 and b will not equal 1, we must perform
statistical tests for this null hypothesis (that a = 0 and b
= 1) and if the resulting P values are large we may conclude that the test data validates the model.
data with replacement to get a different collection of patients. This new sample, also of size n, is called the bootstrap sample, and even though it contains patients found
in the original sample, the overall mix is different. Some
of the original patients may not appear in the bootstrap,
and some may appear more than once because of the
replacement. Because there are a variety of ways the
bootstrapping technique can be used to validate the
model, its coefficients, its predictive ability or even the
model building process I refer the reader to the references
5 and 15 for further details.
Because the ability to predict outcome for new patients seems of great importance, it is surprising that
more studies are not devoted to validating previously
published statistical models. Certainly, the data must exist, especially since there are many major medical centers
collecting large numbers of patients with diseases such as
breast and prostate cancers.
To illustrate some aspects of validation, let us use once
again the PSA screening data. Now pretend that for the
training step we have just the data from reference 7. Performing the logistic regression analysis on this limited
data using the transform of PSA in equation 13 gives us
the results in column 1 of Table 4. Note that the coefficients differ from those of equation 14 that resulted from
using all the data (repeated in column 3 of Table 4). Figure 9 shows that this model fit this smaller training dataset well, and Figure 10 shows that it also fit the test data
well up to a PSA level of 13 ng/mL. Beyond this was just
a single group of patients at PSA level of 35 ng/mL, and
this point was fit poorly. We can see then that if we want
the modeling from the training set to validate, we must
ensure that the training data has the full range of the x
variables. Next, wefixedthis problem by moving the patients with these higher PSA values from the test data to
the training data and repeated the analysis. This produced a better model (column 2 of Table 4), and it fit the
test data better (not shown). Note that the coefficients of
this model come closer to those of the model in column
3, which used all the data. Thus we can see that one of
the costs of working with limited data is that we develop
a model that may not be as good as if we used all the
data. Furthermore, in this example I have cheated a little,
because I began with equation 13, which had come from
all the data. In the real training-test situation, one should
develop both the model form as well as the coefficients
from just the training data, and in this circumstance the
resulting validation could have been worse.
Some may be tempted to believe that models such as
the logistic model allow us to predict an exact outcome
for a new, specific patient, but this is seldom if ever true.
What the logistic model provides us instead is the probability of positive outcome p(y = 1), not the outcome itself, and it does this not for a single patient but for a
group of similar patients. This is what is implied by the
difference between equations 1 and 2. Although what we
are most interested in is an equation like 1 that gives the
outcome y as its output, with the logistic model we never
achieve perfect prediction for a single patient but instead
an average result for a group. There remains uncertainty
about the outcome for a single patient, especially because
there are almost always important and unknown factors
operating outside the model. In the face of this uncertainty some may abandon important prognosticators,16
but in my opinion this is really nothing more than what
we should expect from modeling complex biologic systems.
The fact that the output of the logistic model is a probability raises another important issue, the one of continuity versus discrete phenomena. Often, we deal with binary outcomes such as the patient does or does not have
cancer. However, the output of the logistic model is
probability, which is a continuous phenomena. It goes
continuously from 0 to 1, and all cut-off points in probability are arbitrary. Because many of the x variables we
deal with are also continuous, we must recognize that
binary decisions are to be made from inputs that are continuous. Attempts to ease the decision by dividing one or
more x variables such as PSA level into arbitrary "high"
and "low" levels should not mislead us into thinking the
An alternative to splitting the data is to use bootstrap
technique.5'5 Here, we randomly resample the original
DISCUSSION
Vol. 105-No. I
126
ANATOMIC PATHOLOGY
Review Article
underlying biologic process is binary. The logistic model
is helpful in this regard because it takes both continuous
and discrete inputs and summarizes the problem with a
single continuity, the probability of positive outcome or
p(y = 1). Instead basing binary decisions on cut-off
points in the x variables, we can base the decision on our
estimate of p(y = 1). Perhaps the patient can at this point
help in the decision of what to do next, so that some
might opt for biopsy if their probability of cancer was
0.20 (1 in 5); whereas, others might prefer a threshold for
p(Ca) of 0.05 (1 in 20). In this way, I believe the logistic
model can be especially useful.
REFERENCES
1. Hosmer DW, Lemeshow S. Applied Logistic Regression. New
York: John Wiley and Sons, 1989.
2. McCullaghP, NelderJA. Generalized linear models, ed 2. London:
Chapman and Hall, 1989.
3. Harrell FE Jr, Lee K.L, Matchar DB, Reichert TA. Regression
models for prognostic prediction: Advantages, problems, and
suggested solutions. Cancer Treat Rep 1985;69:1071 -1077.
4. The LOGIST procedure. SAS/STAT User's Guide. Version 6. ed
4. Cary, NC: SAS Institute, 1990.
5. Harrel FE Jr. Predicting outcomes: Applied survival analysis and
logistic regression. Durham, NC: Duke University Medical Center, 1994.
6. Labrie F, Dupont A, Subura R, Cusan L, et al. Serum prostate
specific antigen as pre-screeing test for prostate cancer. J Urol
1992; 147:846-852.
7. Catalona WJ, Hudson M, Scardino PT, et al. Selection of optimal
prostate specific antigen cut-offs for early detection of prostate
cancer: Receiver operating characteristic curves. J Urol
1994; 152:2037-2042.
8. Brawer MK, Chetner MP, Beatie J, Buchner DM, Vessella RL,
Lange PH. Screening for prostatic carcinoma with prostate specific antigen. J Urol 1992; 147:841-845.
9. Littrup PJ, Lee F, Mettlin C. Prostate cancer screening: Current
trends and future implications. CA Cancer JClin 1992:42:198210.
10. Brawer MK, Beatie J, Wener MH, et al. Screening for prostatic
carcinoma with prostate specific antigen: results of the second
year. J Urol 1993; 150:106-109.
11. Metz CE. Basic principles of ROC analysis. Semin Nucl hied
1978;8:283-298.
12. Harrell FE Jr, Lee KL, Pollock BG. Regression models in clinical
studies: Determining relationships between predictors and response. J Nail Cancer lnst 1988; 80:1198-1202.
13. Durrleman S, Simon R. Flexible regression models with cubic
splines. Slat Med 1989; 8:551 -561.
14. Miller ME, Hui SL, Tierney WM. Validation techniques for logistic regression models. Stat Med 1991; 10:1213-1226.
15. Efron B, Tibshirani R. An Introduction to the Bootstrap. New
York: Chapman and Hall, 1993.
16. Green MS, Ackerman AB. Thickness is not an accurate gauge of
prognosis of primary cutaneous melanoma. Am J Dermatopathol 1993; 15:461-473.
A.J.C.P.-January 1996