Download Information Theory Makes Logistic Regression Special

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Information security wikipedia , lookup

Vector generalized linear model wikipedia , lookup

Predictive analytics wikipedia , lookup

Information theory wikipedia , lookup

Least squares wikipedia , lookup

Generalized linear model wikipedia , lookup

Regression analysis wikipedia , lookup

Transcript
INFORMATION THEORY MAKES LOGISTIC REGRESSION
SPECIAL
Ernest S. Shtatland, PhD
Mary B. Barton, MD, MPP
Harvard Medical School, Harvard Pilgrim Health Care, Boston, MA
ABSTRACT
This paper is a continuation of our previous presentations
at NESUG , 1997 and SUGI, 1998 ([1,2]). Our first aim is
to provide a theoretical justification for using logistic
regression (vs. probit, gompit, or angular regression
models, for example). The consensus is that logistic
regression is a very powerful, convenient and flexible
statistical tool; however, it is completely empirical.
Information theory can guide our interpretation of logistic
regression as a whole, and its’coefficients; through this
interpretation we will demonstrate how logistic regression
is special, and unlike other regression models mentioned
above. A similar approach will be used to interpret Bayes’
formula in terms of nformation.
Our second goal is to propose a test of significance that in
case of small samples is superior to the conventional ChiSquare test. This is important because, in addition to the
unreliability of Chi-Square, small sizes are typical and
unavoidable in many fields including medical and health
services research. The proposed test can also be
interpreted in terms of information theory.
LOGISTIC REGRESSION AND INFORMATION
To model the relationship between a dichotomous
outcome variable (YES vs. NO, DEAD vs. LIVE,
SELL vs. NOT SELL, etc.) and a set of explanatory
variables we have a fairly wide “menu” of alternatives,
such as logistic, probit, gompit, and angular regression
models as examples. In [3] we can find the larger list of 7
alternatives. Only two of them, probit and logit, have
received significant attention. According to [3], p. 79,
even probit and logit (not to mention the other possible
nonlinear specifications) are arbitrary. See also [4], p.
388 about the arbitrariness of the logit models: “ The logit
I(E) = - log P(E)
(2)
Usually log in (2) is the binary logarithm and in this case
information is measured in bits. Of course, other bases of
the logarithm can be used and as a result information
units can vary. Information is as fundamental a concept
as probability and there are cases (in particular, in physics
transformation is similar to the probit but on biological
grounds is more arbitrary. ” In many sources the opinion
has been expressed that the logistic regression is a very
powerful, convenient, and flexible statistical tool;
however, it is completely empirical, with no theoretical
justifications ( [5], p. 164, [6], p. 1724). To the best of our
knowledge, [6] is the first and only work that provides us
with some theoretical justification for logistic models.
However, this justification is given on deterministic
grounds - in terms of general systems theory.
We will show that logistic regression is special and unlike
other regression models mentioned above by justifying it
in statistical terms (within information theory) rather
than on deterministic grounds as in [6]. This is more
natural because logistic analysis is first and foremost a
statistical tool.
A typical logistic regression model is of the form
log (P / (1 - P)) =
b0 + b1 X1 + b2 X2 + ... +bk Xk
(1)
where P is the probability of the event of interest,
b0, b1, b2,....,bk are the parameters of the model, X1,
X2, ... ,Xk are the explanatory variables, and log is the
natural logarithm. In this form the logistic model with the
logit link looks really arbitrary, with no advantages over
any other model discussed above.
As is well-known, for any random event E we have two
numbers: the event’s probability P(E) and it’s information
I(E) - the information contained in the message that E
occurred. These quantities are connected according to the
formula
and engineering) in which information is even more
convenient and natural than probability. Perhaps, logistic
regression is one of these cases.
Taking into consideration the definition of information
(2), it is easy to see that the left side of (1) is the
difference in information content between the event of
interest E and nonevent NE. The appearance of the
information difference (ID) between E and NE seems
logical because logistic regression could be treated as a
variant of discriminant analysis when the assumption of
normality is not justified (see, for example, [7], p. 232,
[8], pp. 19-20, 34-36, or [9], pp. 355-356). That is why
the information difference ID could be called also the
discriminant information difference.
The interpretability in terms of information is a
unique property of logistic regression and constitutes the
advantage of the logistic model over probit, gompit and
other similar models.
INFORMATION AND BAYES’ FORMULA
We will also give a new interpretation to coefficients b0,
b1, b2,....,bk in (1). Usually they are interpreted as
logarithms of odds ratios. According to [8], p. 41, "This
fact concerning the interpretability of the coefficients is
the fundamental reason why logistic regression has proven
such a powerful analytic tool for epidemiologic research.”
And further, see [8], p. 47: “This relationship between
the logistic regression coefficient and the odds ratio
provides the foundation for our interpretation of all
logistic regression results.” But the question arises
whether odds ratios themselves are a solid foundation for
this interpretation. According to the same authors ([8], p.
42), “The interpretation given for the odds ratio is based
on the fact that in many instances it approximates a
quantity called the relative risk.” Other authors are not so
optimistic about this approximation and the value of odds
ratios. For example, the opinions vary from “Odds ratios
are hard to comprehend directly” ([10], p.989), to “odds
ratio is hard to interpret clinically ” ([11], p. 1233 ), to
Miettinen’s opinion that the odds ratio is
epidemiologically “unintelligible”. Also, according to
Altman , ([9], p. 271): “The odds ratio is approximately
the same as the relative risk if the outcome of interest is
rare. For common events, however, they can be quite
different, so it is best to think of the odds ratio as a
measure in its own right.” ([9], p. 271). Altman’s opinion
that “it is best to think of the odds ratio as a measure in its
own right”, comes especially to the point. We think that
logistic regression coefficients also need a new
interpretation in their own right and that this
interpretation should be done in terms of information. It
is easy to show that coefficients b1, b2,....,bk have the
meaning of the change in the discriminant information
difference (ID) as the corresponding explanatory variable
gets a one-unit increase (with statistical adjusting for
other variables). For example, if X1 is a dichotomous
explanatory variable with values 0 and 1 then
b1 = ID( E vs.NE | X1 =1 ) ID( E vs.NE | X1 =0 )
It is well-known how important is Bayes’ formula for
modifying disease probabilities based on diagnostic test
results (see for example, [12] and [13]). The most popular
and intuitive variant of Bayes’ formula used in medical
literature is the odds-likelihood ratio form:
posterior odds in favor of disease =
(4)
prior odds in favor of disease * likelihood ratio
or
P( D | R) / P ( ND | R) =
(4')
P( D) / P(ND) * P ( R | D) / P ( R | ND)
where D stands for disease, ND - for nondisease, and R for a test result. Even this, “odds-instead-of-probabilities”
form is not intuitive enough. The first problem here is
similar to the problem with odds ratios: unlike risks, odds
are difficult to understand, they are fairly easy to visualize
when they are greater than one, but are less easily grasped
when the value is less than one ([10],
p. 989-990). The second problem with odds is that,
although they are related to risk, the relation is not
straightforward - these characteristics become
increasingly different in the upper part of the scale. The
third problem is related to the fact that formula (4') is
inherently multiplicative while human thinking grasps
more easily additive relationships. This is the reason why
researchers working with more conventional forms of
Bayes’ formula like (4) or (4'), use sometimes special
nomograms and tables to calculate the posterior odds and
probabilities ([13], pp. 124 -126).
Taking the logarithms of both sides of (4') (in this case the
binary logarithm is more appropriate) we arrive at the
following relationship between information quantities:
(3)
Thus, equation (1) can be treated as the decomposition of
the discriminant information difference between the event
E and nonevent NE into the sum of contributions of the
explanatory variables. This decomposition in terms of
information is linear unlike the original logistic model
which is nonlinear in terms of probabilities.
I(D,ND|R) = ID(D, ND)+ID(R|D,ND) (5)
where ID(D,ND|R) is the posterior information difference
between disease D and nondisease ND given the result R
of the test, ID(D,ND) is the corresponding prior
information difference, and ID(R|D,ND) is the difference
in information contents of the test result between disease
and nondisease cases. In other words, (5) could be
reformulated as follows:
Discrimination information difference between disease
and nondisease after the test =
Discrimination information difference between disease
and nondisease before the test +
Information contained in the test result about
the disease/nondisease dilemma.
as a sample estimate of the difference between the
information characteristics of the ‘null’ model
(b = 0) and the model under consideration. The most
natural interpretation of this difference is the information
gain, IG, that we have obtain
moving from the simplest 'null' model to the fitted model
([17], pp. 163-173). As a result we have
IG*(N /K) = 2(log L(b)-log L(0))/K (8)
Thus the information variant of Bayes’ formula (5)
conveys literally what we always imply when working
with the conventional Bayes’ theorem: increase and
balance of information.
or
LOGISTIC REGRESSION AND SMALL SAMPLES
With Chi-Square approximations being at least
questionable or even misleading when the sample size is
not large enough, a realistic approach, adopted by some
practitioners, is to treat the right side of (8) and (9) as an
F statistic with K and N - K - 1 degrees of freedom (see,
for example, [3], p. 89 and [2]). In [3] this statistic is
called “asymptotic F ”. It is similar to the Wald F statistic
(see, for example, [18]). Using the F asymptotic instead of
the Chi-Square makes the test of significance more
conservative. And the smaller sample size is, the more
conservative is the F asymptotic in comparison with the
original Chi-Square. By using this “practitioners”
approach we arrive at the following equation:
Testing statistical significance of the logistic model is
based on the fact that the difference
2(logL(b) - logL(0))
(6)
has approximately a Chi-Square distribution with K
degrees of freedom when the number of observations, N is
large enough (theoretically, N is infinite). Here logL(b)
and logL(0) denote the log-likelihoods of the fitted and
“null” (with intercept only) models respectively; b =
( b0,b1,b2, ...,bk), and log is the natural logarithm. The
questions immediately arise: what does “large enough”
mean, how often the assumption of “large enough” is not
satisfied, and what to do in this case. We have to add to
these problems the fact that the likelihood-ratio test based
on the statistic (6) is too liberal and tends to
overparameterize the model ([14], pp. 502 -503 ) for both
large and small samples. Usually statisticians avoid these
questions but the problem still remains. We will try to
address the questions mentioned above.
As to the question of how often a small sample situation
can be encountered we mention the results of only two
papers on meta-analysis of clinical trials( [15], [16])
which demonstrate that small numbers (e.g., less than 50)
of participants in the trials are very common. Small
samples are also common in many studies in such fields
as behavioral sciences, psychology, etc. (especially in the
studies of the exploratory rather than confirmatory type).
Much more difficult is the practical question of what to do
if we do have a small sample.
Fasymptotic
=
IG * (N / K)
(9)
(10)
The term “F asymptotic” is most probably related to the
fact that its critical values approximate the corresponding
critical values of the Chi-Square as N increases. Thus the
Fasymptotic literally becomes the Chi-Square for very large
values of N. But as we mentioned above, the limiting
Chi-Square is too liberal according to [14]. To make our
test of significance even more conservative, we can
multiply the asymptotic F statistic (which is equal to the
right side in (8) and (9)) by (N-K-1) / N. The result is the
“adjusted F” statistic:
Fadjusted
Fasymptotic*( N-K -1 )/N
=
(11)
which is similar to the adjusted Wald statistic ([18], [19]).
Our new F statistic can be given in information terms as
follows:
Fadjusted
=
IG * ( N - K -1 ) / K
(12)
Note that this formula is the same as we derived in our
SUGI’98 paper [2] for linear regression.
As shown in [1], [2], we can think of
2(log L(b) - log L(0)) / N
IG * (N / K) = Chi-Square(K) / K
(7)
Thus, we have a “menu” of tests of significance: from the
most liberal Chi-Square usually used in PROC
LOGISTIC to more conservative Fasymptotic, to even more
conservative Fadjusted. We propose to use Fasymptotic and
Fadjusted not as a substitute but rather as a supplement to
the conventional Chi-Square. The superiority of Fasymptotic
and Fadjusted statistics over the Chi-Square is apparent. To
find the extent of this superiority, we need simulation
studies.
Finally, we would like to comment on the meaning of
Fasymptotic and Fadjusted statistics. It is usually thought that
F statistics, in spite of their enormous practical value are
not an estimate of anything meaningful. Equations (10)
and (11) assign the meaning in terms of information to
both F statistics. Fasymptotic has the meaning of
information contained in all the data per parameter.
Fadjusted is the information gain (per parameter) left after
estimating the model parameters by using K + 1 degrees
of freedom. Thus, we have a bridge between statistical
significance in terms of critical values of the F
distribution and substantive significance in terms of
information.
REFERENCES
1. Shtatland, E. S. & Barton, M. B. (1997). Information as
a unifying measure of fit in SAS statistical modeling
procedures. NESUG ’97 Proceedings, Baltimore, Me,
875-880.
2. Shtatland, E. S. & Barton, M. B. (1998).
An information-gain measure of fit in PROC LOGISTIC.
SUGI ’98 Proceedings, Cary, NC:
SAS Institute Inc., pp.1194-1199
3. Aldrich, J. H. & Nelson, F. D. (1984). Linear
Probability, Logit, And Probit Models, Sage University
Paper series on Quantitative Applications in the Social
Sciences, series no. 07-045, Beverly Hills and London:
Sage Publications, Inc.
4. Armitage, P. & Berry, G. (1987). Statistical Methods in
Medical Research, Oxford: Blackwell Scientific
Publications.
5. Anderson, S. , Auquier, A. , Hauck W. W.,
Oakes, D., Vandaele, W., and Weisberg, H. I. (1980).
Statistical Methods for Comparative Studies, New York:
John Wiley & Sons, Inc.
6. Voit, E. O. & Knapp, R. G. (1997). Derivation of the
linear-logistic model and Cox’s proportional hazard
model from a canonical system description. Statistics in
Medicine, 16, 1705-1729.
7. Munro, B. H. & Page, E. B. (1993). Statistical Methods
for Health Care Research, Philadelphia, Pennsylvania: J.
B. Lippincott Company.
8. Hosmer, D. W. & Lemeshow, S. (1989). Applied
Logistic Regression, New York: John Wiley & Sons, Inc.
9. Altman, D. G. (1991). Practical Statistics for Medical
Research, London: Chapman & Hall.
10. Davies, H. T. O., Crombie, I. K., Tavakoli, M. (1998).
When can odds ratios mislead? BMJ, 316, 989-991.
11. Eccles, M., Freemantle, N., and Mason, J. (1998).
North of England evidence based guidelines development
project: methods of developing guidelines for efficient
drug use in primary care, BMJ, 316, 1232-1235.
12. Ingelfinger, J. A., Mosteller, F.,
Thibodeau, L. A., and Ware, J. H. (1987). Biostatistics in
Clinical Medicine, New York: Macmillan Publishing Co.,
Inc.
13. Sackett, D. L., Haynes, R. B.,
Guyatt, G. H.,and Tugwell, P. (1991) Clinical
Epidemiology - a Basic Science for ClinicalMedicine,
London: Little, Brown
and Co.
14. Gelfand, A. E. & Dey, D. K. (1994). Bayesian, model
choice: Asymptotics and exact calculations. Journal of the
Royal Statistical Society, Series B, 56, 501-504.
15. Lau, J., Schmid, C. H., Chalmers, T. C. (1995).
Cumulative meta-analysis of clinical trials builds evidence
for exemplary medical care. Journal of Clinical
Epidemiology, 48, 45-57.
16. Gotzsche, P. C., Johansen, H. K. (1998). Metaanalysis of short term low dose prednisolone versus
placebo and non-steroidal anti-inflammatory drugs in
rheumatoid arthritis. BMJ, 316, 811-817.
17. Kent J. T. (1983). Information gain and a general
measure of correlation. Biometrika, 70, 163-173.
18. Korn, E. L., and Graubard, B. I. (1990). Simultaneous
testing of regression coefficients with complex survey
data: use of Bonferroni t-statistics. The American
Statistician, pp. 270-276.
19. Fellegi, I. P., (1980). Approximate tests of
independence and goodness of fit based on stratified
multistage samples. Journal of American Statistical
Association, 75, 261-268.
CONTACT INFORMATION:
Ernest S. Shtatland
Department of Ambulatory Care and Prevention
Harvard Pilgrim Health Care & Harvard Medical School
126 Brookline Avenue, Suite 200
Boston, MA 02215
tel: (617) 421-2671
email: [email protected]
Mary B. Barton, MD, MPP
Department of Ambulatory Care and Prevention
Harvard Medical School & Harvard Pilgrim
Health Care
126 Brookline Avenue, Suite 200
Boston, MA 02215
tel: (617) 421-6011
email: [email protected]