Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Regression toward the mean wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Time series wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Choice modelling wikipedia , lookup
Linear regression wikipedia , lookup
Marginal Probabilities: an Intuitive Alternative to Logistic Regression Coefficients P. M. Wright The Johns Hopkins University Abstract Specifically, it uses a linear combination of the independent variables to calculate a value L for each observation, then it plots P as a function of L according to this equation: This paper presents an algorithm for using output from PROC LOGISTIC to compute marginal probabilities. Logistic regression coefficients are difficult to interpret, because they reflect the effect that a change in an independent variable would have on the term In(P/(I-P)) -where P is the estimated probability that the dependent variable is one. Marginal probabilities, on the other hand, reflect the effect that a change in an independent variable would have directly on P. Because marginal probabilities are more intuitively interpretable than logistic regression coefficients they can be a more useful statistic. PROC LOGISTIC does not provide marginal probabilities directly, but it does provide everything needed to compute them. (I) In(p/(I-P)); L where p; the probability that the dependent variable is 1, and L ; a + bX, a linear combination of the independent variables. In short, logistic regression attempts to find a linear function L for which L is positive when the dependent variable is I and L is negative when the dependent variable is O. The reason that logistic regrcssion coefficients are difficult to interpret is that the reported coefficients are the b's in this equation: In troduction Statistical software packages such as SAS ® have made producing sophisticated statistical results as easy as 1[. This has been a tremendous boon to research. One can learn a few SAS procedures and produce quite complex statistical results almost blindly. Unfortunately, not all statistical procedures produce results that are easy to interpret. For example, many researchers use logistic regression, but are unclear as to how to interpret the results. They know how to use PROC LOGISTIC to produce logistic regression coefficients, but nevertheless are not sure what these coefficients mean. (2) In(P/(I-P)); a + bX. In other words logistic regression coefficients tell you that a one-unit change in an independent variable will result in a b-unit change in the natural logarithm of the ratio of the probability that the dependent variable is onc to one minus the probability that the depcndent variable is onc. Go ahead, rcad it a few more times. The difficulty understanding the meaning of these coefficients is partly due to the fact that it is so cumbersome to express in English. In order to overcome the linguistic tangle of this explanation, the left-hand side of equation (2) has been given the name "log odds". The explanation can now be stated like this: "a one-unit change in an independent variable results in a b-unit change in the log·odds of the dependent variable". While the explanation has become easier to say. it is not any easier to understand. This is due to the fact that natural logarithms are not a very intuitive concept, and taking the natural logarithm of the ratio (P/Cl-P)) makes matters even worse. For this reason many papers in the social sciences that use logistic regression report another statistic, called marginal probability [e.g., Newman, et. al. (1988) and Hanley and Wiener (1991)]. The purpose of this paper is to present an algorithm for using logistic regression coefficients from PROC LOGISTIC to calculate marginal probabilities, a statistic that is easier to interpret than logistic regression coefficients. This paper will also describe what marginal probabilities are and why they are easier to interpret than logistic regression coefficients. Logistic Regression Coefficients Logistic regression is a procedure that has come to be accepted in most social science disciplines as the correct methodology when one wishes to perform regression analysis with dichotomous dependent variables. In contrast to ordinary regression which models the data to a line, logistic regrcssion models the data as an S-shaped curve. 1380 Marginal Probabilities: Alternative An Intuitive the mean as the most easily understood. The reader can use the marginal probabilities, however to compute any descriptive statistic that might be useful and insightful. Marginal probability is a statistic that reflects what effect a one unit change in an independent variable will have directly on the probability that the dependent variable is one. The key to making sense of logistic regression, therefore, is to solve equation (2) for P, since P is simply interpreted as the estimated probability that the dependent variable is one. This concept is straightforward. When this is done, we have this equation instead: (3) References Hanley, Raymond J. and Joshua M. Wiener (1991), "Use of Paid Home Care by the Chronically Disabled Elderly." Research on Aging 13(3):310-332. Newman, Sandra, Raymond Struyk, Paul Wright, and Michelle Rice (1988), "Overwhelming Odds: Caregiving and the Risk of Institutionalization." Urban Institute Report 3691-01. WaShington, DC: The Urban Institute. SAS Institute Inc. (1990), SASISTAT@ User's Guide; Version 6, Fourth Edition, Volume 2, Cary, NC: SAS Institute Inc. p; 1/( 1 + exp (-(a+bX))) While this equation seems no easier to understand than the original one, it is more useful because the left-hand side is simple to interpret. Even if one is not clear about what is happening on the right-hand side of equation (3), it gives an answer -- P -- which can be interpreted directly. SAS and SAS/STAT are registered trademarks of SAS Institute Inc., Cary, NC, USA. To compute the marginal probability associated with an independent variable, first compute P for each observation, using equation (3). Then, change the value of the independent variables by one unit and recompute P with this new value. The marginal probability with respect to the varied independent variable is then simply the difference between the twerps. This marginal probability is computed for each case, and the mean of this statistic across all cases is then reported. In other words, for each case begin by computing the expected probability that the dependent variable is one. Then change an independent variable by one, and recompute the expected probability to see the effect of that change on the independent variable. PROC LOGISTIC does not provide this statistic directly, but it does provide everything one needs to compute it. Appendix / * This program begins with a simple logistic regression on a SAS data set called SASDAT with a dependent variable called DEPVAR, and two independent variables called VARl and VAR2.*/ /* The option OUTEST is used to save the logistic regression coefficients to a SAS file. By default the coefficients are given the same name as the variables; they need to be renamed so they may be merged back into the output data set.*/ Using SAS and PROC LOGISTIC to Compute Marginal Probabilities /* The OUTPUT statement is used to add XBETA to the original SAS data set. (Note: XBETA is the equivalent of L in equation (1)] .*/ The appendix provides a template for programming the algorithm in SAS. Using the standard SAS convention, comments are enclosed between /* and */. In the sample program that I have provided I used the name SASDAT for my SAS data set; DEPV AR for my dependent variable; and VARI and VAR2 for my independent variables. The reader should replace thesc with the actual names from the data set they will be analyzing. Similarly, I have created some variables using a prefix for VARI and VAR2 (e.g. CFV ARI and CFV AR2). These names are arbitrary, and the rcader should provide any preferred unique name for these variables. I also use the arbitrary names ONE TWO, etc. for the data sets created for this procedure. The reader should feel free to use these names, or provide others arbitrarily. The algorithm computes a marginal probability for each case. Most readers will want to produce an aggregate measure of marginal probability. I suggest calculating and reporting PROC LOGISTIC DATA;SASDAT OUTEST;ONE(RENAME~(VAR1;CFVARl VAR2~CFVAR2) ) ; MODEL DEPVAR ~ VARI VAR2 ; OUTPUT OUT;TWO XBETA~XB ; /* The logistic regression coefficients are merged back into the OUTPUT data set., which contains the original data, as well as XBETA.*/ /* Note that an unusual set statement is used for the "merge because data set ONE has only one Observation, and the values need to be repeated for each U 1381 observation in data set TWO. */ /* Assume VAR2 is bounded and "mxv2" is its upper bound. */ DATA THREE IF N ~ 1 THEN SET ONE SET TWO ; 1* If it is not within one of its upper bound then treat it just as an unbounded variable; i.e. add one to the independent variable, compute PVAR# and /* Data set THREE now contains everything needed for computing marginal probabilities. */ then subtract PPP from PVAR#.*/ IF (rnxv2 - VAR2) >~ 1 THEN DO ; PVAR2 ~ 1 / (1 + EXP(-(XB+CFVAR2»); MPVAR2 ~ PVAR2 -PPP ; END ; /* For each case compute PPP, which is the probability variable is that one. the [Note: dependent This that is equation (3) above.] */ PPP ~ 1 / /* If it is within one of its upper bound, then SUBTRACT one from the independent variable, compute PVAR# and ( 1 + EXP ( -XB» then subtract PVAR# from PPP.*/ /* Compute PVAR# for each case when the value of the variable VAR# has been changed by one. (PVAR# is the probability that the dependent variable ELSE IF (mxv2 - VAR2) < 1 THEN DO PVAR2 ~ 1 / (1 + EXP(-(XB-CFVAR2»); MPVAR2 ~ PPP - PVAR2 ; END ; is one, when the value of VAR# has been changed by one, and all the other variables retain their original value] . /* This gets repeated for each variable in the model, which will produce a marginal probability for each variable, for each case. */ */ /* Note that this requires two separate cases -.- one for unbounded data and one for bounded data. */ /* /* You can now generate many descriptive statistics on these marginal probabilities. In most instances the mean is computed and reported. * / Note that adding (or subtracting) one to an independent variable, manifests itself in equation (3) as simply adding (or subtracting) the value of its coefficient from XBETA. */ PROC MEANS VAR MPVARl MPVAR2 /* UNBOUNDED VARIABLES */ /* In the case of unbounded data just add one to the independent variable; compute PVAR# and then subtract ppp from PVAR#. */ /* Assume VARI is unbounded: */ PVARl ~ 1 / ( 1 + EXP(-(XB+CFVAR1»); MPVAR1 ~ PVAR1 - PPP /* BOUNDED VARIABLES */ 1 * In the case of bounded data, one cannot be added to the independent variable if it is already within one of its upper bound, so first check to see if it is at this bound or not.*/ 1382