Download Marginal Probabilities: an Intuitive Alternative

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Marginal Probabilities:
an Intuitive Alternative to Logistic Regression Coefficients
P. M. Wright
The Johns Hopkins University
Abstract
Specifically, it uses a linear combination of the
independent variables to calculate a value L for each
observation, then it plots P as a function of L according
to this equation:
This paper presents an algorithm for using output from
PROC LOGISTIC to compute marginal probabilities.
Logistic regression coefficients are difficult to interpret,
because they reflect the effect that a change in an
independent variable would have on the term In(P/(I-P)) -where P is the estimated probability that the dependent
variable is one. Marginal probabilities, on the other hand,
reflect the effect that a change in an independent variable
would have directly on P. Because marginal probabilities
are more intuitively interpretable than logistic regression
coefficients they can be a more useful statistic. PROC
LOGISTIC does not provide marginal probabilities
directly, but it does provide everything needed to compute
them.
(I)
In(p/(I-P)); L
where p; the probability that the dependent variable is 1,
and L ; a + bX, a linear combination of the independent
variables.
In short, logistic regression attempts to find a linear
function L for which L is positive when the dependent
variable is I and L is negative when the dependent variable
is O.
The reason that logistic regrcssion coefficients are difficult
to interpret is that the reported coefficients are the b's in
this equation:
In troduction
Statistical software packages such as SAS ® have made
producing sophisticated statistical results as easy as 1[.
This has been a tremendous boon to research. One can
learn a few SAS procedures and produce quite complex
statistical results almost blindly. Unfortunately, not all
statistical procedures produce results that are easy to
interpret. For example, many researchers use logistic
regression, but are unclear as to how to interpret the
results. They know how to use PROC LOGISTIC to
produce logistic regression coefficients, but nevertheless
are not sure what these coefficients mean.
(2)
In(P/(I-P)); a + bX.
In other words logistic regression coefficients tell you that
a one-unit change in an independent variable will result in
a b-unit change in the natural logarithm of the ratio of the
probability that the dependent variable is onc to one
minus the probability that the depcndent variable is onc.
Go ahead, rcad it a few more times. The difficulty
understanding the meaning of these coefficients is partly
due to the fact that it is so cumbersome to express in
English. In order to overcome the linguistic tangle of this
explanation, the left-hand side of equation (2) has been
given the name "log odds". The explanation can now be
stated like this: "a one-unit change in an independent
variable results in a b-unit change in the log·odds of the
dependent variable". While the explanation has become
easier to say. it is not any easier to understand. This is
due to the fact that natural logarithms are not a very
intuitive concept, and taking the natural logarithm of the
ratio (P/Cl-P)) makes matters even worse. For this reason
many papers in the social sciences that use logistic
regression report another statistic, called marginal
probability [e.g., Newman, et. al. (1988) and Hanley and
Wiener (1991)].
The purpose of this paper is to present an algorithm for
using logistic regression coefficients from PROC
LOGISTIC to calculate marginal probabilities, a statistic
that is easier to interpret than logistic regression
coefficients. This paper will also describe what marginal
probabilities are and why they are easier to interpret than
logistic regression coefficients.
Logistic Regression Coefficients
Logistic regression is a procedure that has come to be
accepted in most social science disciplines as the correct
methodology when one wishes to perform regression
analysis with dichotomous dependent variables. In contrast
to ordinary regression which models the data to a line,
logistic regrcssion models the data as an S-shaped curve.
1380
Marginal Probabilities:
Alternative
An Intuitive
the mean as the most easily understood. The reader can
use the marginal probabilities, however to compute any
descriptive statistic that might be useful and insightful.
Marginal probability is a statistic that reflects what effect
a one unit change in an independent variable will have
directly on the probability that the dependent variable is
one. The key to making sense of logistic regression,
therefore, is to solve equation (2) for P, since P is simply
interpreted as the estimated probability that the dependent
variable is one. This concept is straightforward. When
this is done, we have this equation instead:
(3)
References
Hanley, Raymond J. and Joshua M. Wiener (1991),
"Use of Paid Home Care by the Chronically
Disabled Elderly." Research on Aging
13(3):310-332.
Newman, Sandra, Raymond Struyk, Paul Wright, and
Michelle Rice (1988), "Overwhelming Odds:
Caregiving
and
the
Risk
of
Institutionalization." Urban Institute Report
3691-01. WaShington, DC: The Urban
Institute.
SAS Institute Inc. (1990), SASISTAT@ User's Guide;
Version 6, Fourth Edition, Volume 2, Cary, NC:
SAS Institute Inc.
p; 1/( 1 + exp (-(a+bX)))
While this equation seems no easier to understand than the
original one, it is more useful because the left-hand side is
simple to interpret. Even if one is not clear about what is
happening on the right-hand side of equation (3), it gives
an answer -- P -- which can be interpreted directly.
SAS and SAS/STAT are registered trademarks of SAS
Institute Inc., Cary, NC, USA.
To compute the marginal probability associated with an
independent variable, first compute P for each observation,
using equation (3). Then, change the value of the
independent variables by one unit and recompute P with
this new value. The marginal probability with respect to
the varied independent variable is then simply the
difference between the twerps. This marginal probability
is computed for each case, and the mean of this statistic
across all cases is then reported. In other words, for each
case begin by computing the expected probability that the
dependent variable is one. Then change an independent
variable by one, and recompute the expected probability
to see the effect of that change on the independent
variable.
PROC LOGISTIC does not provide this
statistic directly, but it does provide everything one needs
to compute it.
Appendix
/ * This program
begins
with
a
simple
logistic regression on a SAS data set
called SASDAT with a dependent variable
called
DEPVAR,
and
two
independent
variables called VARl and VAR2.*/
/* The option OUTEST is used to save
the logistic regression coefficients to
a
SAS
file.
By
default
the
coefficients are given the same name as
the variables; they need to be renamed
so they may be merged back into the
output data set.*/
Using SAS and PROC LOGISTIC to
Compute Marginal Probabilities
/* The OUTPUT statement is used to add
XBETA to the original SAS data set.
(Note: XBETA is the equivalent of L in
equation (1)] .*/
The appendix provides a template for programming the
algorithm in SAS. Using the standard SAS convention,
comments are enclosed between /* and */. In the sample
program that I have provided I used the name SASDAT
for my SAS data set; DEPV AR for my dependent
variable; and VARI and VAR2 for my independent
variables. The reader should replace thesc with the actual
names from the data set they will be analyzing.
Similarly, I have created some variables using a prefix for
VARI and VAR2 (e.g. CFV ARI and CFV AR2). These
names are arbitrary, and the rcader should provide any
preferred unique name for these variables. I also use the
arbitrary names ONE TWO, etc. for the data sets created
for this procedure. The reader should feel free to use these
names, or provide others arbitrarily. The algorithm
computes a marginal probability for each case. Most
readers will want to produce an aggregate measure of
marginal probability. I suggest calculating and reporting
PROC LOGISTIC DATA;SASDAT
OUTEST;ONE(RENAME~(VAR1;CFVARl
VAR2~CFVAR2) ) ;
MODEL DEPVAR ~ VARI VAR2 ;
OUTPUT OUT;TWO XBETA~XB ;
/* The logistic regression coefficients
are merged back into the OUTPUT data
set., which contains the original data,
as well as XBETA.*/
/* Note that an unusual set statement
is used for the "merge
because data
set ONE has only one Observation, and
the values need to be repeated for each
U
1381
observation in data set TWO. */
/* Assume VAR2 is bounded and
"mxv2" is its upper bound. */
DATA THREE
IF N ~ 1 THEN SET ONE
SET TWO ;
1* If it is not within one of its upper
bound then treat
it
just
as
an
unbounded variable; i.e. add one to the
independent variable, compute PVAR# and
/* Data set THREE now contains
everything
needed
for
computing
marginal probabilities. */
then subtract PPP from PVAR#.*/
IF (rnxv2 - VAR2) >~ 1 THEN DO ;
PVAR2 ~ 1 / (1 + EXP(-(XB+CFVAR2»);
MPVAR2 ~ PVAR2 -PPP ;
END ;
/* For each case compute PPP, which is
the
probability
variable
is
that
one.
the
[Note:
dependent
This
that
is
equation (3) above.] */
PPP
~
1 /
/* If it is within one of its upper
bound, then SUBTRACT one from the
independent variable, compute PVAR# and
( 1 + EXP ( -XB»
then subtract PVAR# from PPP.*/
/* Compute PVAR# for each case when the
value
of
the
variable
VAR#
has
been
changed
by
one.
(PVAR#
is
the
probability that the dependent variable
ELSE IF (mxv2 - VAR2) < 1 THEN DO
PVAR2 ~ 1 / (1 + EXP(-(XB-CFVAR2»);
MPVAR2 ~ PPP - PVAR2 ;
END ;
is one, when the value of VAR# has been
changed
by
one,
and
all
the
other
variables retain their original value] .
/* This gets repeated for each variable
in the model, which will produce a
marginal probability for each variable,
for each case. */
*/
/* Note that this requires two separate
cases -.- one for unbounded data and one
for bounded data. */
/*
/*
You
can
now
generate
many
descriptive
statistics
on
these
marginal probabilities.
In most
instances the mean is computed and
reported. * /
Note that adding (or subtracting)
one
to
an
independent
variable,
manifests itself in equation (3) as
simply
adding
(or
subtracting)
the
value of its coefficient from XBETA. */
PROC MEANS
VAR MPVARl MPVAR2
/* UNBOUNDED VARIABLES */
/* In the case of unbounded data just
add one to the independent variable;
compute PVAR# and then subtract ppp
from PVAR#. */
/* Assume VARI is unbounded: */
PVARl ~ 1 / ( 1 + EXP(-(XB+CFVAR1»);
MPVAR1 ~ PVAR1 - PPP
/* BOUNDED VARIABLES */
1 * In the case of bounded data, one
cannot be added to the independent
variable if it is already within one of
its upper bound, so first check to see
if it is at this bound or not.*/
1382