Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ANATOMIC PATHOLOGY Review Article Multivariate Statistical Analysis for Pathologists Part I, The Logistic Model ROBIN T. VOLLMER, MD, MS This paper reviews concepts of multivariate statistical modeling via the logistic regression model, which has become very popular for modeling the relationship between a positive clinical outcome and a variety of predictor variables. The process is illustrated using a composite of data from three large prostate specific antigen based screening studies of prostate cancer. (Key words: Statistics; Logistic regression; Multivariate analysis; Prostate specific antigen; Prostate cancer) Am J Clin Patholl996;105:115-126. The purpose of this paper is to introduce and summarize multivariate statistical analysis as seen through the logistic regression model, which is the most popular method for relating a binary clinical outcome to several predictors.1"5 Most often, we use statistical models to either explain or predict a clinical outcome. We view the outcome as a dependent random variable, y, because it depends on random variables, xl, x 2 , . . . xj, that we hypothesize either explain or predict the outcome. What the models do then is to provide a function, f, so that: continuous or nearly continuous, such as serum concentrations of creatinine kinase, then we should turn to other models such as the general linear model.2 One of the major problems with the medical literature in general and anatomic pathology literature specifically is that clinical outcome is often linked to a single predictor in isolation and without consideration of additional predictors. The new factor is often not tested for predictive capacity independent of other, and perhaps more established predictors. Furthermore, it is often not tested for the capacity to add additional information to that imparted by other predictors. For example, a new immunohistochemical marker of a cancer may be linked to patient prognosis, but often it is not determined if this new marker provides independent or additional information beyond that given by the standard morphologic prognosticators of tumor size, grade, and stage. Multivariate analysis is a statistical tool that may be used to achieve such multifactor modeling. y = f(xl,x2, ...,xj) (1) If we could truly understand the biochemical, physiologic, and epidemiologic mechanisms of a disease, then we would know the exact form of the function f. Without that understanding, we resort to a few empirical functions that seem to work. The logistic function is one of the more popular ones. To make it work for a given dataset, we need to fit it to the data. For the logistic model, the outcome, y, is most often a binary one (ie, either absent or present, 0 or 1, or true or false). This corresponds to the presence or absence of disease, complication, response to treatment, relapse, survival, or other binary outcomes. The logistic model can also apply to categorical outcomes, for which y takes values of 0, 1, 2, 3, and so forth. However, if the outcome we are interested in is THE LOGISTIC MODEL Let us symbolize by y the patient's clinical outcome. If the outcome is negative, then y = 0; if the outcome is positive then y = 1. The logistic model with a single x variable is then written: g(a+b.x) P(y= From the Department of Laboratory Medicine, VA Medical Center and Duke University, Durham, North Carolina. I -(a+b.x) (2) where p(y = 11 x) symbolizes the probability that y = 1 given the value of x. The a and b are coefficients that are adjusted to make the model fit the data. The a is a kind of intercept function reflecting the overall probability of Manuscript received August 22, 1995; accepted August 23, 1995. Address reprint requests to Dr. Vollmer: Laboratory Medicine (113), VA Medical Center, Durham, NC 27705. 115 116 ANATOMIC PATHOLOGY Review Article TABLE 1. DATA FOR PROSTATE SPECIFIC ANTIGEN SCREENING FOR PROSTATE CANCER No. of No. of No. of Patients Reference Patients Patients Discovered 5% No. Screened Biopsied with Cancer Estimate* 6 7 8 984 6,630 1,249 984 1.167 105 54 264 32 49 332 62 Residual Carcinoma^ (%) 0(0) 68(1.2) 30(2.6) * Five percent of total number of patients screened, ie. an estimate of the total number of significant cancers (ref 9). t Difference between the previous two columns, ie. the estimate of expected significant cancers that would have been found if all patients had been biopsied. positive outcome, and the b is the slope parameter controlling the influence of x on the probability of positive outcome. Often the model is written in terms of the logit function denned as: logit(p) = log P(y= l-p(y= (3) The ratio inside the logarithm on the right side of equation 3 is the odds of a positive outcome or odds ratio. To shorten the notation, we will sometimes write p(y = 11 x) as just p as we did above. By substituting equation 2 for p(y = 11 x) and doing some algebra we can find that: logit(p) = a + b*x (4) Thus, if the data follow the logistic model and we plot the logit(p) against x, then the data points would cluster about a straight line. Models for multiple x variables are written as if the variables have an additive effect. For example, if there are two x variables, x 1 and x2, then the logistic model becomes: logit(p) = a + bl*xl + b2*x2 (5) cause the proportion (3 of 18) having a positive biopsy differed so much from the rest at this PSA level that they appeared as outliers. Furthermore, in two studies the investigators biopsied mostly just those with elevated PSA, so that we had to make a guess about the total frequency of positive biopsy for those with PSA less than 4 ng/mL. We based that guess on Littrup's9 suggestion that in this age group approximately 5% will have a "significant" cancer. Byfirstcalculating what 5% of the total was and then by subtracting the cancers found from this 5%, we obtained an estimate of how many significant cancers remained undetected in the patients who were not biopsied. This step was an important one if we were to obtain a logistic model applicable to all patients in the screening population. Table 1 summarizes the data, and shows that the guess of residual cancers is close to Brawer and coworkers' experience that less than 2% of cancers remain undetected after screening.10 Figure 1 shows the plot of observed fractions of those with carcinoma versus the midpoint PSA level for all three studies (symbolized as 1, 2, and 3). The lower end of the plot demonstrates that our guesses for the fractions with cancer at PSA levels less than 4 ng/mL were consistent with the rest of the data, and the plot suggested that all three studies appear to follow a single curve. Ideally, to perform the following analysis we need the raw data. Also, ideally, at a minimum that data should provide not only the biopsy result, but also PSA level, digital rectal exam result, patient's age, and family history of prostate cancer. Thus, because we are working S 0.6 DATA SCREENING FOR PROSTATIC CARCINOMA FROM PSA To illustrate logistic analysis, we will use a composite of data from three studies on prostate specific antigen (PSA) based screening for prostate cancer.6"8 Although the reports did not publish the raw data, they gave sufficient detail to calculate the frequencies of positive biopsy for groups of patients with relatively narrow ranges of PSA. For each such group the midpoint of that range was chosen to represent their PSA. Altogether, the three studies comprised more than 8,800 male patients older than 45. We excluded 18 patients with PSA between 10 and 20 ng/mL from Labrie and colleagues' study,6 be- 3 1 2 3 2 3 21 2 3 20 30 PSA NG/ML FIG. 1. Plot of observed fractions of those with carcinoma versus the midpoint PSA level for three studies of the use of PSA for screening for prostate cancer (references 6, 7 and 8 symbolized respectively as 1, 2 and 3). A.J.C.P.-Januarv 1996 VOLLMER 117 Statistics for Pathology with limited data, the following results are intended primarily to illustrate key points of logistic analysis, not to produce a final model. However, to the degree the developed models match the summarized data, they may be useful for predicting results from such limited data on new patients, and they may also be a good starting point for developing models with the full data. square distribution under the null hypothesis that there is no x effect. '"5 The LR for our model with one x variable is then: LR = -2*ln L(a) - -2*ln L(a,b) (7) Because LR follows a chi-square distribution with one degree of freedom, it is also called the model chi-square, and from this chi-square, we may calculate a P value for the null hypothesis that x has no effect on outcome. If FITTING THE MODEL TO THE DATA the LR is large because -2*ln L(a) is much larger than -2*ln L(a,b), this is equivalent to saying that L(a,b) is The data for a logistic model with a single x variable much larger than L(a), that is, the likelihood for getting consist of a table of two columns, one for y and one for this raw data result is higher if the x variable is acting x. There is one row entry for each patient. Thus, the obthan if it is not. In this circumstance, the P value will be servation for the ith patient consists of a pair of values: yi small, and we may reject the null hypothesis that x has and xi. Next we must consider the likelihood function, no effect on outcome. L, which is the probability of getting the dataset if the logistic model applies. L is the key to both the solution of The form of equation 7 also gives the LR statistic for the analysis as well as to the calculation of statistics to comparing a logistic regression model with several x varitest hypotheses, and for the logistic model it is defined as: ables against one with just the intercept a. Thus, if there are three x variables in the model with coefficients bl, L = Il{p(y = 1 Ixi)yi*(i - P(y = 1 Ixi))"-yi)} (6) b2, and b3, then the model chi-square becomes: i=l Given that each p(y = 11 xi) term in the equation is a shorthand for the more complex function in equation 2, we see that this equation for L collects all the terms of the problem: the overall likelihood for the data, the y outcome for each patient, the x value for each patient, and the a and b coefficients. Because -2*ln(L) (In means natural logarithm) relates closely to the chi-square distribution and test statistics,12 this form of L is often more emphasized. In fact, because of its importance some logistic analysis programs such as LOG1ST (4) print out -2*ln(L) rather than L, and we will simplify its writing for the rest of this paper by omitting the parentheses around L and write it as just -2*ln L. Expressing the likelihood as a logarithm converts the products in equation 6 into sums. Note also that as L increases, —2*ln L decreases, so that the user must expect to see decreasing values of —2*ln L with improved models, rather than increases. What the software programs then do is to iteratively change a and b until —2*ln L is minimized (that is, L is maximized), because when the model fits the data, the likelihood should be maximum. The estimates that we obtain for a and b are then called maximum likelihood estimates. TEST OF SIGNIFICANCE To test for the significance of an x variable we compare the -2*ln L obtained for the model fitted with just the parameter a (—2*ln L[a]) to the —2*ln L with both a and b (-2*ln L[a,b]). It turns out that -2*ln(L[a]/L[a,b]), called the likelihood ratio (LR), is a statistic having a chi- LR = -2*ln L(a) - -2*ln L(a,b 1 ,b2,b3) (8) Furthermore, if we have two alternative models, each of which has its own model LR (or chi-square), then we can compare the two by taking the difference between their model chi-squares. Because the terms for the intercept cancel out in the subtraction, this is equivalent to a likelihood ratio test for one model against the other. Using the PSA screening data of Table 1, a positive outcome of biopsy result diagnostic of adenocarcinoma and the SAS program LOGIST,4 we obtained a value of 3,547 for -21n L(a) and a value of 2,790 for -21n L(a,b), so that the model chi-square was 3,547 — 2,790 = 757. Because there was just one parameter b involved, this chi-square implied a single degree of freedom and a P value of .0001. The result clearly suggests that elevated PSA predicts a positive biopsy. Furthermore, using the iterative maximum likelihood technique LOGIST estimated the model fit for p(Ca | PSA) as: p(Ca|PSA) = 1 (9) 1 +e This logisticfitestimates that at a PSA level of zero the probability of a biopsy diagnostic of significant cancer is approximately 0.02, which is close to Brawer and colleagues' experience.10 The model then predicts that probability rises gradually to 0.06 at PSA of 4 ng/mL and to 0.32 at 10 ng/mL. -(-4.I9+.344.PSA) TEST OF FIT Although logistic regression analysis has become common in published studies in pathology, we seldom see Vol. 105•No. I ANATOMIC PATHOLOGY 118 Review Article 2 I ... o o o oo o It o # o o It It II on 20 PSA NG/ML FIG. 2. Plot of observed fractions ("o") and calculated values ("#") of p(Ca | PSA) for the model with untransformed PSA as ng/mL. Note that the #'s are too low at lower levels of PSA and too high at higher levels of PSA. any indication of how well the published models fit the data. Instead most authors seem satisfied with just the chi-squares and P values for the x variables of interest. As we will see, one can obtain significant P values and still have poor fit for the model. Fortunately, there are several ways to examine and test for goodness of fit, and if the fit is not good, then there may be ways to improve it without abandoning the logistic model. We just need to take a few extra steps in the analysis. Perhaps, the most helpful way to see how well the model fits the data is to plot the calculated p(y = 11 x) and the observed probabilities against the x variable (or against several x variables if there are more than one). If the fit is good, then the observed and predicted probabilities fall close to one another. If the dataset is not already large enough to have several patients at each level of x, then we can divide it arbitrarily into 10 percentiles of increasing values of x, calculate an average x for the group and a p(y = 11 x) for that average x and then once again plot and compare the observed percentages of y = 1 in these percentiles to the calculated p(y = 11 x). Figure 2 shows the observed and calculated values of p(Ca|PSA) for the screening data analyzed above. We see that the calculated p(Ca | PSA) from equation 9 appears too low in the PSA range of 5 to 10 ng/mL and then too high for PSA greater than 15 ng/mL. Thus the fit is not ideal even though the model chi-square was high and the P value low. Another way to plot and compare the observed and estimated p(y = 11 x) is to plot the observed frequencies on the y axis and the calculated p(y = 11 x) on the x axis. If the fit is good, then the plot clusters about a line with a 45° angle. This sort of plot works no matter how many x variables there might be. A third way to visualize and test for the goodness of fit uses the Pearson or deviance residuals, which are analogous to the difference between observed and predicted values in a linear regression.1 A good fit gives residuals close to zero and without any trend with respect to the x variable. The sum of the squares of either of these residual:; has a chi-square distribution if the dataset is large enough, so that large values of this sum suggest a poor fit. Figure 3 shows the plot of the deviance residuals for the logistic model of equation 9 and the PSA screening dataset, and it illustrates more explicitly how the fit is not ideal. Between PSA values of 2 and 10 there are many points where the deviance is too positive, and overall the plot of deviance does not suggest the ideal of random scatter about zero. The lack of fit in the region from 2 to 10 is of special concern, because this is where we desire greatest accuracy. SCALING OF X VARIABLES, CUT-OFF POINTS, AND CONTINUITY The beauty of the logistic model is that it can relate a binary y outcome to one or more continuous x variables. Think in terms of equation 1. On the left hand side, we I 8 + o o o o o o 30 10 PSA NG/ML FIG. 3. Plot of deviance residuals versus PSA in ng/mL for the model of untransformed PSA as ng/mL. Note that the distribution of points appears as a non-random scatter. A.J.C.P.-January 1996 VOLLMER 119 Statistics for Pathology often have binary y outcomes or choices. The patient does, or does not, have a disease. His tumor can, or cannot, be resected. She has, or has not, suffered a relapse. We will, or will not, treat. For a given patient the number of diseases we can diagnose, the number of treatments we can offer, or the number of stages of disease we can determine tend to be limited to a few categories, often just two. However, the x variables on the right hand side of the equation are usually more complex, or even continuous. Tumor diameter, Breslow thickness, patient weight, and serum PSA are examples of continuous x variables. If such a variable is useful, it will map the patients and their disease into a continuous spectrum, so that if we compare any two patients the one with higher x will consistently have more disease (or less, if the relationship between y and x is negative). Thus, the potential information that continuous x variables can offer is great. However, what we commonly see in papers are transformations of continuous x variables that turn them into categorical variables or even binary ones. By using a single cut-off point, authors will change a continuous x variable into one that is either positive or negative. For clinical chemistries, we get normal and elevated. For tumor diameter, we have tumors less than or greater than 2 cm. For Breslow thickness, we have less than or greater than 1 mm. Mostly, these cut-off points attempt to simplify the choices of clinical actions and treatments. If the action is a binary one, then by making the x variable binary the choice of action becomes automatic. Receiver-operator curves (ROC) are then used to optimize the cut-off point of x." For example, we use a cut-off point of 4 ng/mL on PSA to decide whether or not to biopsy the prostate. However, using these cut-off points hides natural variance in the data and may deprive us of useful information. I suggest that before using a cut-off point to transform a continuous x variable into a binary one, we try the x variable in its natural scale or use a continuous transformation such as log(x), exp(x), or square root of x. Then we can use the logistic model to relate x to the binary outcome or choice in which we are interested. Choosing the scale for a continuous x variable is important, and finding a satisfactory choice depends on intuition and trial and error. Certainly, to begin it is good to try the natural scale for x as measured. Although we can also plot the logit(y) against x to get an idea of what transformation of x might be useful,1 we should try several different scales or transformations of x to optimize its performance in the model. Furthermore, we need not confine ourselves to just one measure of x. We can add a second term in x such as x2. In this case the logistic model becomes: Iogit(p) = a + bl*x + b2*x2 (10) and we can continue this by adding terms of x3 or higher exponents or terms such as ln(x) and exp(x). Each additional term in x requires an additional coefficient (that is the bl, b2, b3, . . . etc.), so that the final fitted logistic model may appear like: p(y= (11) with an example of tr(x) (tr symbolizes "transformation") given as: tr(x) = a + bl*x + b2*x2 + b3*ln(x) (12) Some also favor restricted cubic splines of x for modeling nonlinearities between the logit(p) and x.1213 We can illustrate the issues of cut-off points and transformations with a further analysis of the PSA screening data. Let us first reduce PSA levels to just two binary x variables, PSA4 and PSA 10, by using cut-off points at 4 ng/mL and 10 ng/mL and the following algorithm: if PSA < 4, then PSA4 = 0 and psalO = 0. if 4 < PSA < 10, then PSA4 = 1 and PSA 10 = 0. if PSA > 10,thenPSA4= landPSA10= 1. Running the LOGIST program with this model produced an overall chi-square of 900, representing a significant improvement over the previous model chisquare of 757 with PSA alone as the x variable. The difference, 900 - 757= 143, implies a P value of less than 0.001. However, plotting the fitted model against the observed probabilities in Figure 4 shows that this model is unnatural. Instead of a continuous increase in p(Ca|PSA) with increasing PSA, it gives three horizontal plots of the "#" at locally constant levels of p(Ca | PSA) depending on the PSA: 0.02 for PSA <4, 0.26 for 4 <Lt> PSA < 10, and 0.57 for PSA > 10. Now although one may be tempted to conclude from just the P values alone that this is a good model for predicting the probability of a positive biopsy, the plotted test offitshows that it is not ideal. After trial and error, we settled on the following transform (tr) of PSA: tr(PSA) = ln(PSA) + ln2(PSA) (13) This resulted in an overall chi-square of 941, the best of the three models reported here and significantly better than that of the previous model, because the difference in their model chi-squares was 41. Figure 5 shows that its predicted values of p(Ca| PSA) fall close to the observed ones. Figure 6 shows that the deviance residuals now Vol. 105 • No. 1 120 ANATOMIC PATHOLOGY Review Article 5 o o o « #t* ftett FIG. 4. Plot of observed fractions ("o") and calculated values ("#") of p(Ca|PSA4, PSA 10) for the model with PSA4 and PSA 10 (cut-off points of PSA at 4 ng/mL and 10 ng/mL). Note that the #'s appear on three horizontal lines. center about zero and with a random scatter. The equation for the fit of this model is given as: p(Ca|PSA) = 1 e -[-6.07+3.57.1n(PSA) - 0.448.1n2(PSA)J (14) MULTIPLE X VARIABLES The greatest interest in the logistic model comes from its ability to use multiple variables (xl, x2, . . . xj) to predict a binary outcome (y = 0 or 1). After all, what we achieved above by modeling P(Ca | PSA) with equation 14 is not much different than the raw data plotted in Figure 1. If that raw data is sufficiently dense, then we can use it alone for predicting outcomes for new patients, because the logistic model then functions only to interpolate between observed values in the raw data. Alternatively, imagine attempting the prediction from raw data and with multiple x variables. In that circumstance, we would require multiple plots or multiple tables. Here the logistic model goes beyond simple interpolation to provide us a very concise tool for prediction. With advances in molecular technology, the number of potentially prognostic variables is rising rapidly. Just consider the complexity regarding the prognosis in breast cancer. In addition to the traditional prognostic measures of tumor size, grade, and nodal status we have ploidy, S phase, ER/PR, p53, c-erbB-2/neu, HCAM/ CD44, PDGFR, BCL-2, GST-PI, MiB-1, mitotic rate, angiogenesis, nuclear morphometry, and so on. With just these 16 even logistic regression analysis may not be able to reach a satisfactory predictive model without using thousands of patients. The more x variables there are, the more data are required to perform the analysis. In general, we need sufficient number of patients to produce at least one positive outcome (and preferably five positive outcomes) for every possible combination of x variable levels. The lower the probability of a positive outcome, then the more patients we need. Thus, 16 binary x variables implies 2' 6 = 256 possible categories or cells of x variables. If the a priori prevalence of positive outcome were 0.01 in one of these cells, then we might require more than 100 patients in just this one to get one positive patient. To get more than 100 in this cell in a random population could require even greater numbers in the others. Harrell3 also suggests that there should be at least 10 times as many patients with a positive outcome as there are x variables (assuming that the number with y = 1 is less than with y = 0). Thus it is easy to see that the proving of prognostic importance of new markers could be expensive or possibly never done adequately because of limited numbers of patients. Furthermore, the more x variables there are, the more tedious, time consuming, and subjective the model building process can be. We have seen that each model has a LR statistic for comparing it to the model with just an intercept, so that if we want to test for the importance of adding an x2 variable to the model with a single xl variable, all we need do is compare the differences between their model LR statistics. This difference is: 3 0.6 •s ft o o It o o tt o oo (to It o 20 30 PSA NG/ML FIG. 5. Plot of observed fractions ("o") and calculated values ("#") of p(Ca | PSA) for the model with ln(PSA) and ln2(PSA). A.J.C.P.-January 1996 VOLLMER 121 Statistics for Pathology o o o o 30 PSA NG/ML FlG. 6. Plot of deviance residuals versus PSA in ng/mL for the model with ln(PSA) and ln2(PSA). Note that now the deviances appear more as random scatter than in Figure 3. = LR(a,bl,b2)-LR(a,bl) (15) This new LR gives a chi-square statistic for testing the importance of the x2 parameter. Another useful statistic for comparing several x variables in the model is the Wald statistic. It can be used for testing the null hypothesis that the b coefficient is zero (that is, that the x variable is unimportant for predicting y). For the xi variable the Wald statistic for this null hypothesis is denned as: W = maximum L estimate of bi standard error of bi estimate (16) W2 has a chi-square distribution, and programs such as LOGIST print a table of these with P values, one for every x variable. The Wald statistics, or more likely their P values, are probably the ones most commonly seen published next to x variables in the results of pathology papers using the logistic model. Although in general there seems to be no one ideal way to select x variables for model building, there is a tendency now to move away from computer driven stepwise techniques.1'3 Many prefer instead a more purposeful selection. We begin with the one, or ones, we know from prior studies are important. As each x variable is added, we examine the overall model LR as well as the Wald statistics to see if there has been significant improvement. We can add newer x variables in an order of preference based on experience, intuition, a motivation to test a particular variable with others, or other factors such as the availability and cost of x variables. We can examine whole subsets of related variables, such as multiple measures of proliferation, to see which have the largest Wald statistic and then work more carefully with them. Or we can resort to the computer's selection to get a preliminary idea about the x variables. The forward stepwise approach has been popular in pathology. It selects the x variables one at a time based on their effect on the likelihood L of equation 6. At each step the x variable producing the greatest increase in L is chosen to enter the model, and ones already in are retested to see if they remain significant. As an alternative to this forward stepwise approach, we can put all x variables into the model and then proceed stepwise to eliminate ones that are not significant. This is called backward stepwise analysis. With either automatic stepwise selection beware x variables that become significant only after many others have entered the model especially if their relationship with the y outcome reverses during the steps. This may be due to overfitting and can yield a model that does not validate well with new data. After all, if perfect fit is what we are after, we can achieve this simply by adding one x variable for every patient. This is called the full or saturated model, but it provides no useful prediction for new datasets. Having satisfied ourselves that our choice of x variables and model building are sufficient, we can test for ones left out by adding them and seeing if they improve the overall model chi-square. For example, if we have selected x variables xl-xk and derived a transformation tr(xl-xk) of these that produced an adequate model, we can compare the LR of this model with ones left unused (xk + 1 — xn) as follows: LR = LR(a,tr(x 1 - xk), xk + 1 - xn) -LR(a,tr(xl - xk)) (17) This statistic should have a chi-square distribution with n-k degrees of freedom. If this LR is small, we can reasonably omit the remaining xk + 1 - xn variables. Nevertheless, we should also remember that with a large list of potential x variables it is unlikely that there is a single optimal model, but instead several models of close or equal performance and involving different subsets of the x variables.2 INTERACTION BETWEEN X VARIABLES If we have two x variables, xl and x2, then a logistic model with interaction includes a third term, which is the product xl * x2: Vol. 105-No. I 122 ANATOMIC PATHOLOGY Review Article (18) nificant age effect on the probability of obtaining a positive biopsy. This was in addition to the effect of PSA. This model then allows for the possibility that the Because the coefficient for agel was positive, the resulteffect of xl on outcome y is different for different levels ing model implied that as age increases the probability of of x2. For example, if the disease were breast cancer and positive biopsy increases even after accounting for the y were some arbitrary positive outcome such as tumor PSA effect. This fits our prior understanding about the recurrence, then x 1 might be ER/PR status and x2 sex. diagnosis of prostate cancer. If men and women differ in they way their outcome deNext, we added an interaction term thinking that the pends on ER/PR, then the interaction term b3*xl*x2 association between a positive biopsy and PSA might models this difference. To test for a significant difference differ for different ages. For example, this could hold if in the way men and women's ER/PR status affects outage specific thresholds for PSA were important for precome, we compare the model chi-squares with and withdicting a positive biopsy. Because in the last model the out the interaction coefficient, b3: most important x variable was the ln(PSA), we looked for an interaction between age 1 and this term by forming LR = -2*ln L(a,bl,b2) - -2*ln L(a,bl,b2,b3) (19) the variable agel *ln(PSA). This time LOGIST produced or examine the Wald statistic (and its P value) for the a model chi-square of 732, or an improvement of 12 (P < coefficient b3. If either test shows a low chi-square and .001) over the previous model, and the Wald chi-square high P value, then we conclude the ER/PR effect was the statistic for the agel*ln(PSA) interaction term was 12.03 same for men and women. (/' = .0005). The final model for p(Ca | PSA,age) incorporating both PSA and age effects is given by: logit(p) = a + b 1 *x 1 + b2*x2 + b3*x 1 *x2 SCREENING FOR PROSTATE CANCER:PSA AND AGE To illustrate the importance of interaction between x variables, we continue the example of screening for prostate cancer but now with two variables, PSA, and patient age. Because only one of the three datasets8 published sufficient information about age to do the analysis, the total size of the data drops to just 6,630. This makes the model chi-squares smaller. Furthermore, even though age was given in the three broad categories of 50-60, >60-70, and >70, stratifying this smaller data into PSA levels as well as three age levels produced several agePSA cells with so few patients that we had to combine them to get numbers exceeding 10 patients. Whenever we did this, we took the group's final PSA midpoint as representative for the combined category. These combinations did not appear to alter the shape of the relationship between P(Ca | PSA) versus PSA plot. We began this analysis with the model of equation 14. The LOGIST program on the smaller data set now yielded a model chi-square of 712, but the plot (not shown) of predicted p(Ca|PSA) once again matched closely the observed values over the entire range of PSA. Next we added age using a graded factor agel defined as: agel =0ifage = 50to59; agel = 1 if age = 60 to 69; agel = 2 if age >69. The LOGIST program this time yielded a model chisquare of 720. The difference between these two model chi-squares is 720 — 712 = 8, implying a/"value of 0.005, and the Wald chi-square statistic for agel was significant at a P value of 0.0063. Thus, there was a small but sig- 1 p(Ca|PSA, age) = J _|_ g-tr<PSA,age) (20) where tr(PSA,age) symbolizes: tr(PSA, age) = -7.6 + .854*agel + 5.0*ln(PSA) - 0.739*ln2(PSA) - 0.442*agel*ln(PSA) (21) 0.6 1 0 1 2 2 I 0.2 0.1 PSA NG/ML FIG. 7. Plot of the predicted p(Ca) versus PSA in ng/mL for the three age ranges 50-59,60-69, and >69 indicated respectively by the numbers I, 2 and 3 on the plot. A.J.C.P.-January 1996 VOLLMER 123 Statistics for Pathology TABLE 2. SUMMARY OF MODEL DEVELOPMENT FOR PROSTATE SPECIFIC ANTIGEN SCREENING FOR PROSTATE CANCER Model No. 1 2 3 4 5 6 3 Y n Npos X Variables Ca Ca Ca Ca Ca Ca 8,863 8,863 8,863 6,630 6,630 6,630 448 448 448 PSA PSA4, PSA 10 ln(PSA), ln2(PSA) ln(PSA), ln2(PSA) ln(PSA),ln 2 (PSA),agel ln(PSA), ln2(PSA), age 1, age 1 *ln(PSA) 332 332 332 LR Figures 757 2,3 900 4 5,6 941 712 720 732 7,8 Y = outcome parameter, ie, a biopsy positive for carcinioma; n = total number of patients: Npos = total number with positive biopsy; LR likelihood ratio statistic. Whereas the positive sign for the coefficient of age 1 implies that in general older men have a higher probability of positive biopsy, the negative sign for the interaction term means that for older men the probability of positive biopsy is less than the PSA level alone predicts. We can see this easier by looking at the plot of predicted p(Ca|PSA,age) versus PSA in Figure 7. In the plot the categories of the agel factor are indicated by the numbers 1, 2 and 3 on the plot. For each PSA at lower levels the model predicts that older men have a higher probability of a positive biopsy, but the curve reverses at higher levels of PSA, where the model implies that older men have a lower probability of a positive biopsy than their PSA level alone predicts. Perhaps this is because older men have higher baselines. However, for most of the PSA levels, including the critical range of less than 10 ng/mL, the curves for the three age ranges are so close they nearly overlap. This relates well to Catalona and colleagues'7 conclusion that age specific PSA values do not add much to the diagnostic accuracy of a binary cut-off point in PSA. pie, Table 2 summarizes the models tried here for the PSA screening data. For the final or best model, then we should list the x variables, their coefficients, standard errors, Wald statistics, and P values. Table 3 summarizes these results for the final two models developed for the PSA screening. For a significant continuous x variable it is useful to see the observed and calculated p(y = 1) plotted overlayed and against the x as in Figure 5, and it is helpful to see either the Pearson or deviance residuals plotted the same way as in Figure 6. If the model is complex and includes a number of x variables, one can plot the observed p(y = 1) against the predicted p(y = 1). For example, Figure 8 shows the plot of the observed probability of positive prostate biopsy versus the predicted probability for the logistic model that included both age REPORT OF FITTED MODEL Publication of the results of logistic analysis should include at least a partial list of the models tried. For examTABLE 3. SUMMARY OF MODEL RESULTS FOR FINAL MODELS Model No.* 1 2 X Variable Coefficient SE Wald P Value Intercept ln(PSA) ln2(PSA) Intercept ln(PSA) ln2(PSA) agel ageMn(PSA) -6.0745 3.5651 -0.4477 -7.6035 4.9973 -0.7394 0.8540 -0.4418 0.2319 0.3351 0.1016 0.3737 0.4851 0.1489 0.1968 0.1274 656 113 19 414 106 25 19 12 .0001 .0001 .0001 .0001 .0001 .0001 .0001 .0005 PSA ° prostate specific antigen. * Model 1 used data from references 6-8. Model 2 used data from reference 7. 0.6 0.8 Estimated Probability FlG. 8. Plot of the observed fractions against the calculated values of p(Ca) for the logistic model of equations 20 and 21 and using both PSA and patient age. Because the scatter of points follow a 45° line, the fit of the logistic model is reasonable. Vol. 105-No. I 124 ANATOMIC PATHOLOGY Review Article TABLE 4. COMPARISON OF MODEL FITS WITH REDUCED DATA Data Set* No. of patients with biopsy positive for carcinoma Total no. of patients X variable coefficients Intercept In(PSA) ln2(PSA) / 2 i 332 347 448 6,630 6,647 8,863 -6.92 4.82 -0.829 -6.49 4.05 -0.568 -6.07 3.57 -0.448 PSA = prostalc specific antigen. • Data set 1 used data from reference 7. Data set 2 used data from reference 7 plus those from reference 6 with PSA > 20 ng/mL. Data set 3 used data from references 6-8. collectively. and PSA as x variables. Although there is some scatter to the points, they appear clustered about a 45° line, and this indicates reasonable agreement. It is probable that some of the residual variation in the plot is due to important factors, such as results of digital rectal exam, that were left out of the analysis. DISCRIMINATION AND VALIDATION A measure of a model's ability to discriminate between y = 0 and y = 1 outcomes is the concordance, c.3 To see how this works, imagine taking two patients from the dataset such that the first had a positive outcome (y = 1) and the second a negative outcome (y = 0). If the first also has a higher calculated p(y= 11 x), then the pair is considered concordant. Otherwise it is discordant, and pairs with tied p(y = 11 x) are not used. If we repeat this process for all possible pairings of patients, one with y = 1 and the second with y = 0, then c is defined as the proportion of concordant pairs: c = no. of concordant pairs total possible pairings (22) A good model should give concordant pairs. Specifically, a model without any ability to discriminate outcome gives a c of 0.5; whereas, a model with perfect prediction gives a c of 1.0. The final models for PSA screening for prostate cancer with and without age gave c values of 0.82 and 0.84, respectively, on the original data, which are reasonably good concordance results. Probably, the best way to validate a developed model is to apply it to a new dataset. Some researchers divide the initial data into two parts: one for developing the model and one for testing the model. Regardless of whether we split the initial data or gather new data, we designate the first as the training set and the second as the test set. Using the fitted model from the training set, we can then calculate p(y = 11 x) for the test patients and compare this calculated estimate to whether or not they had a positive outcome. Hosmer and Lemeshow9 recommend using the same summary statistics one uses for test 0.4 s o f 0.3 o * o 8 o 10 20 PSA NG/ML PSA NG/ML FIG. 9. Plot of the observed fractions ("o") and calculated values ("#") of p(Ca|PSA) for the training data from reference 7 and using the logistic model of Table 4, column 1. The test data for this model appear in the next figure. FIG. 10. Plot of the observed fractions ("o") for the test data from references 6 and 8. The calculated values of p(Ca | PSA) ("#") come from the model developed from just the data of reference 7. AJ.C.P.'January 1996 VOLLMER 125 Statistics for Pathology offit.Thus, to study how well the model validates we can examine the Pearson and deviance residuals on the test data and plot the results to see how well the model works on test data. Several5'4 also recommend studying validation by reusing the logistic regression model on the test data, only this time with a new x variable defined as the logic(p(y = 11 x)). Let us take the simple example with just one x variable. First, we fit the logistic model of equation 2 to the test data. Then we calculate p(y = 11 x) and logic (p) for the patients from the test data. Next we do a second logistic analysis, this time on the test data, and we use the calculated logic (p) as the new x variable. If the second logistic analysis yields maximum likelihood estimates of the coefficients such that a = 0 and b = 1, then the validation is perfect. Of course, because variable a mostly will not equal 0 and b will not equal 1, we must perform statistical tests for this null hypothesis (that a = 0 and b = 1) and if the resulting P values are large we may conclude that the test data validates the model. data with replacement to get a different collection of patients. This new sample, also of size n, is called the bootstrap sample, and even though it contains patients found in the original sample, the overall mix is different. Some of the original patients may not appear in the bootstrap, and some may appear more than once because of the replacement. Because there are a variety of ways the bootstrapping technique can be used to validate the model, its coefficients, its predictive ability or even the model building process I refer the reader to the references 5 and 15 for further details. Because the ability to predict outcome for new patients seems of great importance, it is surprising that more studies are not devoted to validating previously published statistical models. Certainly, the data must exist, especially since there are many major medical centers collecting large numbers of patients with diseases such as breast and prostate cancers. To illustrate some aspects of validation, let us use once again the PSA screening data. Now pretend that for the training step we have just the data from reference 7. Performing the logistic regression analysis on this limited data using the transform of PSA in equation 13 gives us the results in column 1 of Table 4. Note that the coefficients differ from those of equation 14 that resulted from using all the data (repeated in column 3 of Table 4). Figure 9 shows that this model fit this smaller training dataset well, and Figure 10 shows that it also fit the test data well up to a PSA level of 13 ng/mL. Beyond this was just a single group of patients at PSA level of 35 ng/mL, and this point was fit poorly. We can see then that if we want the modeling from the training set to validate, we must ensure that the training data has the full range of the x variables. Next, wefixedthis problem by moving the patients with these higher PSA values from the test data to the training data and repeated the analysis. This produced a better model (column 2 of Table 4), and it fit the test data better (not shown). Note that the coefficients of this model come closer to those of the model in column 3, which used all the data. Thus we can see that one of the costs of working with limited data is that we develop a model that may not be as good as if we used all the data. Furthermore, in this example I have cheated a little, because I began with equation 13, which had come from all the data. In the real training-test situation, one should develop both the model form as well as the coefficients from just the training data, and in this circumstance the resulting validation could have been worse. Some may be tempted to believe that models such as the logistic model allow us to predict an exact outcome for a new, specific patient, but this is seldom if ever true. What the logistic model provides us instead is the probability of positive outcome p(y = 1), not the outcome itself, and it does this not for a single patient but for a group of similar patients. This is what is implied by the difference between equations 1 and 2. Although what we are most interested in is an equation like 1 that gives the outcome y as its output, with the logistic model we never achieve perfect prediction for a single patient but instead an average result for a group. There remains uncertainty about the outcome for a single patient, especially because there are almost always important and unknown factors operating outside the model. In the face of this uncertainty some may abandon important prognosticators,16 but in my opinion this is really nothing more than what we should expect from modeling complex biologic systems. The fact that the output of the logistic model is a probability raises another important issue, the one of continuity versus discrete phenomena. Often, we deal with binary outcomes such as the patient does or does not have cancer. However, the output of the logistic model is probability, which is a continuous phenomena. It goes continuously from 0 to 1, and all cut-off points in probability are arbitrary. Because many of the x variables we deal with are also continuous, we must recognize that binary decisions are to be made from inputs that are continuous. Attempts to ease the decision by dividing one or more x variables such as PSA level into arbitrary "high" and "low" levels should not mislead us into thinking the An alternative to splitting the data is to use bootstrap technique.5'5 Here, we randomly resample the original DISCUSSION Vol. 105-No. I 126 ANATOMIC PATHOLOGY Review Article underlying biologic process is binary. The logistic model is helpful in this regard because it takes both continuous and discrete inputs and summarizes the problem with a single continuity, the probability of positive outcome or p(y = 1). Instead basing binary decisions on cut-off points in the x variables, we can base the decision on our estimate of p(y = 1). Perhaps the patient can at this point help in the decision of what to do next, so that some might opt for biopsy if their probability of cancer was 0.20 (1 in 5); whereas, others might prefer a threshold for p(Ca) of 0.05 (1 in 20). In this way, I believe the logistic model can be especially useful. REFERENCES 1. Hosmer DW, Lemeshow S. Applied Logistic Regression. New York: John Wiley and Sons, 1989. 2. McCullaghP, NelderJA. Generalized linear models, ed 2. London: Chapman and Hall, 1989. 3. Harrell FE Jr, Lee K.L, Matchar DB, Reichert TA. Regression models for prognostic prediction: Advantages, problems, and suggested solutions. Cancer Treat Rep 1985;69:1071 -1077. 4. The LOGIST procedure. SAS/STAT User's Guide. Version 6. ed 4. Cary, NC: SAS Institute, 1990. 5. Harrel FE Jr. Predicting outcomes: Applied survival analysis and logistic regression. Durham, NC: Duke University Medical Center, 1994. 6. Labrie F, Dupont A, Subura R, Cusan L, et al. Serum prostate specific antigen as pre-screeing test for prostate cancer. J Urol 1992; 147:846-852. 7. Catalona WJ, Hudson M, Scardino PT, et al. Selection of optimal prostate specific antigen cut-offs for early detection of prostate cancer: Receiver operating characteristic curves. J Urol 1994; 152:2037-2042. 8. Brawer MK, Chetner MP, Beatie J, Buchner DM, Vessella RL, Lange PH. Screening for prostatic carcinoma with prostate specific antigen. J Urol 1992; 147:841-845. 9. Littrup PJ, Lee F, Mettlin C. Prostate cancer screening: Current trends and future implications. CA Cancer JClin 1992:42:198210. 10. Brawer MK, Beatie J, Wener MH, et al. Screening for prostatic carcinoma with prostate specific antigen: results of the second year. J Urol 1993; 150:106-109. 11. Metz CE. Basic principles of ROC analysis. Semin Nucl hied 1978;8:283-298. 12. Harrell FE Jr, Lee KL, Pollock BG. Regression models in clinical studies: Determining relationships between predictors and response. J Nail Cancer lnst 1988; 80:1198-1202. 13. Durrleman S, Simon R. Flexible regression models with cubic splines. Slat Med 1989; 8:551 -561. 14. Miller ME, Hui SL, Tierney WM. Validation techniques for logistic regression models. Stat Med 1991; 10:1213-1226. 15. Efron B, Tibshirani R. An Introduction to the Bootstrap. New York: Chapman and Hall, 1993. 16. Green MS, Ackerman AB. Thickness is not an accurate gauge of prognosis of primary cutaneous melanoma. Am J Dermatopathol 1993; 15:461-473. A.J.C.P.-January 1996