Download Word [] file

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Logistic Regression and Logistic Analysis
(Alan Pickering, 2nd December 2003)

To understand these notes you should first have understood the material in the notes on
“Associations in two-way categorical data” (henceforth referred to as the A2WCD notes), which
are available electronically in the usual places.
MORE GENERAL PROCEDURES
FOR ANALYSING CATEGORICAL DATA
There are a number of techniques that are more general than the Pearson chi-squared test (2),
likelihood ratio chi-squared test (G2), and odds-ratio (OR) methods reviewed in the A2WCD notes.
In particular, these general methods can be used to analyse contingency tables with more than 2
variables. Next we consider the range of these procedures available in SPSS.
Choosing A Procedure In SPSS
There are lots of different ways to analyse contingency tables and categorical dependent variables
within SPSS. Each of the following procedures can sometimes be used:
Analyze > REGRESSION >> BINARY LOGISTIC
Analyze > REGRESSION >> MULTINOMIAL LOGISTIC
Analyze > LOGLINEAR >> GENERAL
Analyze > LOGLINEAR >> LOGIT
Analyze > LOGLINEAR >> MODEL SELECTION
Each procedure works best for a particular type of statistical question, although the procedures can
often be “forced” to carry out analyses for which they were not specifically designed. The outputs
of each procedure look quite different, even though many of the results, buried within the printout,
will be identical. However, some of the output information is unique to each procedure.
My general advice is that the Multinomial Logistic Regression procedure is by far the most userfriendly and will deal with the most common data analyses of this type that we wish to carry out.
The SPSS Help menu gives advice on choosing between procedures (select Help > TOPICS and
then select the Index tab). Unfortunately, SPSS uses slightly different names for the procedures in
its Help menu (and on printed output) than those that appear as the options in the Analyze menu.
The following table will clarify the names that SPSS uses:-
1
Procedure Name in Analyze Menu
Procedure Name used in Help Menu
and in Output
Logistic Regression
Help
Output
Logistic Regression
Multinomial Logistic
Help
Regression
Output
Nominal Regression
General Loglinear Analysis
Help
Output
General Loglinear (Analysis)
Logit Loglinear Analysis
Help
Output
General Loglinear (Analysis)
Model Selection Loglinear
Help
Analysis
Output
HiLog
Hierarchical Log Linear
REGRESSION >> BINARY LOGISTIC
REGRESSION >> MULTINOMIAL LOGISTIC
LOGLINEAR >> GENERAL
LOGLINEAR >> LOGIT
LOGLINEAR >> MODEL SELECTION
Table 1. The varying names used by SPSS to describe its categorical data analysis procedures
Finally, note that the SPSS printed output, for the various types of categorical data analysis, is fairly
confusing because it contains a lot of technical detail and jargon. That is why it is important to have
a clear understanding of some of the basic issues covered below.
Three Types of Analysis
 Logistic (or Logit) Regression: describes a general procedure in which one attempts to predict a
categorical dependent variable (DV) from a group of predictors (IVs). These predictors can be
categorical or numerical variables (the latter are referred to in SPSS as covariates). The DV can
have two or more levels (binary or multinomial, respectively). This analysis can be thought of
as analogous to (multiple) linear regression, but with categorical DVs. It is most easily carried
out in SPSS using the following procedures:
Analyze > REGRESSION >> BINARY LOGISTIC
Analyze > REGRESSION >> MULTINOMIAL LOGISTIC

Logistic (or Logit) Analysis: describes a special case of logistic regression in which all the
predictor variables are categorical, and these analyses often include interaction terms formed
from the predictor variables. This analysis can be thought of as analogous to ANOVA, but with
categorical DVs. It is most easily carried out in SPSS using the following procedures:
Analyze > REGRESSION >> BINARY LOGISTIC
Analyze > REGRESSION >> MULTINOMIAL LOGISTIC
Analyze > LOGLINEAR >> LOGIT

(Hierarchical) Loglinear Modelling, Loglinear Analysis, or Multiway Frequency Table
Analysis: describes a procedure in which there is no separation into DVs and predictors, and one
is concerned with the interrelationships between all the categorical variables in the table. It is
most easily carried out in SPSS using the following procedures:
Analyze > LOGLINEAR >> GENERAL
Analyze > LOGLINEAR >> MODEL SELECTION
2
Example of Logistic Analysis Using SPSS
Logistic analysis may be most straightforward place to start looking at the more general
contingency table analysis methods available in SPSS. This is because: (a) logistic analysis
resembles ANOVA (which is familiar to psychologists); (b) categorical data from psychological
experiments are probably most often in a form requiring logistic analysis (i.e., there is a DV and one
or more categorical IVs); and (c) the relevant SPSS procedures are probably easier to execute and
interpret than the other types of contingency table analysis methods. Before analysing multiway
tables, we start with an example of a two-way analysis.
Logistic Analysis Example: small parks data
The data concerns subjects with Parkinson’s disease (PD), and the dataset contains disease status
(PDstatus: 1= has disease; 2= no disease) and smoking history (Smokehis: 1=is or was a smoker;
2=never smoked). Table 2 is a key contingency table:
PDstatus
yes
no
(=1)
(=2)
Smokehis
Row
Totals
yes (=1)
3
11
14
no (=2)
6
2
8
Column
Totals
9
13
Grand
Total=
22
Table 2. Observed frequency counts of current Parkinson’s disease status by
smoking history.
The data were analysed using the Analyze > REGRESSION >> MULTINOMIAL LOGISTIC procedure.
The presence or absence of PD (PDstatus) was selected into the “Dependent variable” box and
cigarette smoking history (Smokehis) was selected into the “Factor(s)” box. The Statistics button
was selected and in the resulting subwindow, the “Likelihood ratio test” and “Parameter estimates”
options were checked. The key resulting printed output was as follows:Model Fitting Information
Model
Intercept Only
Final
-2 Log
Likelihood
11.308
5.087
Chi-Squa
re
6.222
3
df
Sig.
1
.013
Likelihood Ratio Tests
-2 Log
Likelihood
of
Reduced
Chi-Squa
Effect
Model
re
df
Sig.
Intercept
5.087
.000
0
.
SMOKEHIS
11.308
6.222
1
.013
The chi-square statistic is the difference in -2 log-likelihoods
between the final model and a reduced model. The reduced
model is formed by omitting an effect from the final model. The null
hypothesis is that all parameters of that effect are 0.
Parameter Estimates
Has got Parkins on's
dis eas e?
yes
Intercept
[SMOKEHIS=1]
[SMOKEHIS=2]
B
1.099
-2.398
0a
Std. Error
.816
1.044
0
Wald
1.810
5.271
.
df
1
1
0
Sig.
.178
.022
.
Exp(B)
9.09E-02
.
95% Confidence
Interval for Exp(B)
Lower
Upper
Bound
Bound
1.17E-02
.
.704
.
a. This parameter is s et to zero because it is redundant.
Understanding the Printout
(The key jargon from the printout is highlighted in bold below.) In this analysis, there is only one
effect of interest: the effect of smoking history Smokehis on PDstatus. As a result, the first two
output tables (“Model Fitting Information”; “Likelihood Ratio Tests”) are completely redundant.
Later, when we look at a 3-way example, these two tables provide different information. The final
model is a model containing all the possible effects. As will be explained in more detail later, this
model has two free parameters (parameters are just the independent components in the
mathematical formula for the model). The final model proposes that the probability of PD is
different in each of the two samples with differing smoking histories (i.e., differing values of
Smokehis). Therefore, the final model needs 2 parameters: effectively these parameters correspond
to the probability of having PD in each of the 2 samples. As there are only 2 outcomes (has PD vs.
doesn’t have PD), we do not need to specify the probability of not having PD, because that is
simply 1 minus the probability of having PD.
The final model has a likelihood which we can denote with the symbol Lfinal. This likelihood is just
the probability that exactly the observed data would have been obtained if the final model were true.
These analyses use the natural logarithms (loge) of various values. Those who are not very familiar
with logarithms should review the quick tutorial on logarithms that was given in the A2WCD notes.
The value of -2*loge(Lfinal) is 5.087 (see “Model Fitting Information” in the above printout). The
likelihood of getting exactly the data in Table 1, if the final model were true, is therefore given by
e-5.087/2 (=0.08). Although this probability may seem low, the final model is the best possible model
one could specify for these data. Later in these notes we consider how these likelihoods are
calculated.
The analysis also produces a reduced model that is simpler than the final model. The reduced
model is called an intercept only model in the Model Fitting Information output table. Because the
reduced model is formed from the final model by removing the effect of Smokehis on PDstatus, the
reduced model appears in the row labelled SMOKEHIS in the table of Likelihood Ratio Tests. (The
row labelled Intercept in this table should be ignored.) The reduced model has only a single
parameter because it proposes only one rate of PD occurrence (i.e., the rate is the same for both
samples differing in smoking history under this model).
4
The reduced model has a likelihood that can be represented Lreduced. From the printout shown above,
the value of -2*loge(Lreduced) is 11.308; which corresponds to a likelihood of e-11.308/2 (=0.0035).
Given the data, this model is more unlikely than the final model. This will be the case for any
reduced model (because reduced models have fewer parameters than more complete models). In
general, scientific modelling attempts to find the simplest model (i.e., the one with fewest
parameters) that provides an adequate fit to the data. The key decision in this kind of analysis is
therefore whether the lower likelihood of the reduced model is a statistically acceptable “trade” for
the reduced number of parameters involved in the reduced model. The likelihood ratio test
compares the likelihoods of the two models in order to make this decision.
If the reduced model were true then it turns out that a function of the ratio of the log likelihoods
(specifically –2*log[Lreduced/Lfinal]) would have a distribution that is approximated by the 2
distribution, with degrees of freedom given by the difference in the number of free parameters
between the final and reduced models (2-1=1 in this case). From the properties of logs (see the
A2WCD notes) we know that the log likelihood ratio statistic given above is equivalent to:
2*loge(Lfinal) –2*loge(Lreduced).
The value of the statistic in the present example is (-5.087 - - 11.308)=6.222. Note that this value
appears under the “Chi-Square” column heading in the table of Likelihood Ratio Tests. The above
notes (and the footnote on the SPSS output) should have made it clear why it is called a likelihood
ratio test statistic and why it is tested against the chi-squared distribution. The value obtained
(6.222) is considerably greater than the critical value for 2 with df=1 (for p=0.05, this is just under
4) and so the reduced model can be rejected in favour of the final model.
This result means that Smokehis and PDstatus are not independent in this dataset: i.e., there is
a significant effect of Smokehis on PDstatus (p=0.013).
We can compare the value of the likelihood ratio test statistic in this analysis with the value for the
likelihood ratio statistic (G2) for two-way tables, which we can obtain using the SPSS CROSSTABS
procedure (see A2WCD notes). The value is identical. The G2 statistic is a special case of logistic
analysis when there are only 2 variables in the contingency table.
We should also note that the Parameter Estimates output table shows estimates for the two
parameters of he final model (B values). Note that the B parameter with a value of -2.398
represents the natural logarithm of the odds-ratio {loge(OR)}. Once, again we can calculate this for
a 2x2 table using CROSSTABS (see A2WCD notes). The value Exp(B) “undoes” the effect of taking
the logarithm, because the exponentiation function {Exp()} is the inverse operation to taking a
logarithm (in much the same way as dividing is the inverse of multiplication). Exp(B) therefore
gives us the odds ratio itself (OR) and its associated 95% confidence interval (CI; also available
from CROSSTABS). Later on in these notes, we will discuss how these parameter estimates are
constructed, and thus how they can be interpreted. When we move on to tables bigger than 2x2, the
B parameter values shown in the output will each be different log(OR) values calculated from 4
cells within the larger table.
5
PART II – UNDERSTANDING THE STATISTICAL MODELLING TECHNIQUES USED
IN LOGISTIC REGRESSION
Sample Estimates and Population Probabilities
This section is quite simple conceptually (and the technical bits are in boxes). If you grasp what’s
going on, even if only roughly, then it will really help you to understand: (a) the process of
executing logistic regression; (b) what to look at in the printed output; and (c) the jargon used in the
printout.
Imagine we tested a random sample of 100 female subjects in their twenties in a dart-throwing
experiment. We got each subject to take one throw at the board. We scored the data very simply:
did the subject hit the scoring portion of the board? This generated a categorical variable hitboard
with values: 1=yes; 2=no.
hitboard
1=yes
2=no
60
40
Total
100
Table 3. The summary data for the hitboard variable in the dart-throwing data
So the overall probability of hitting the dartboard was 0.6 (60/100; let’s call that probability q). The
measured value of 0.6, from this particular measurement sample, might be used to estimate a
particular population probability that we are interested in (e.g., the probability with which women
in their twenties would hit a dartboard given a single throw; this population probability will be
denoted by the letter p). We might ask what is the most likely value of the population probability
given the sample value (q=0.6) that we obtained. For any hypothetical value of p, we can easily
calculate the likelihood of getting exactly 60 hits from 100 women using probability theory. The
underlying theory and mechanics of the calculation are described in the box on the following page.
6
Using Probability Theory To Derive Likelihoods
Take tossing a coin as an easy example, which involves all the processes we are interested in.
What is the likelihood of getting exactly 2 heads in 3 tosses of a completely fair coin? This is
a binomial problem as there are just two outcomes for each trial (Heads, H; Tails, T). We can
count the answer. There are 8 (i.e. 23) possible sequences of 3 tosses which are all equally
likely:
TTT; TTH; THT; HTT; HHT; HTH; THH; HHH
Only 3 of the sequences have exactly 2 Heads (THH; HTH; HHT), so the likelihood is 3/8
(=0.375). It is important that the outcome on each toss is independent of the outcome on every
other toss. Independence therefore means, for example, that tossing a H on one trial does not
change the chance of getting a H on the next trial. In this way the 8 possible sequences shown
above are equally likely. This binomial problem, for 2 possible outcomes, can be described
generally as trying to find the likelihood, L, of getting exactly m occurrences of outcome 1 in
a total of N independent trials, when the probability of outcome 1 on a single trial is p). For
our example above: N=3; m=2; p=0.5. (The value of p is 0.5 because it is a fair coin.) The
general formula is:
L = NCm * pm * (1 – p)(N-m)
Where NCm is the number of ways (combinations) that m outcomes of a particular type can be
arranged in a series of N outcomes (answer=3 in our example). NCm is itself given by the
following formula:N
Cm = N!/(m!*[N-m]!)
Where the symbol ! means the factorial function. X! = X*(X-1)*(X-2)…*2*1. Thus, for
example, 3!= 3*2*1 . Check that the above formulae generate 0.375 as the answer to our cointossing problem. The multinomial formulae are an extension of the above to deal with cases
where there are more than 2 types of outcome.
Applying Probability Theory to the Dart-Throwing Data
For our darts data, the values to plug into the formula thus: there are 100 trials (i.e., N=100); and we
are interested in the case where we obtained exactly 60 “hitboard=yes” outcomes (i.e., m=60). We
can allow the value of p to vary in small steps from 0.05 to 0.95 and calculate a likelihood for each
value of p. Putting these values into the formulae, we get the likelihoods that are shown in the
following graph:
7
0.09
0.08
0.07
Likelihood
0.06
0.05
0.04
0.03
0.02
0.01
0
0
0.2
0.4
0.6
Values of p parameter
0.8
1
It is fairly clear from the graph that the likelihood of getting the result q=0.6 is at a maximum for
the value p=0.6. In fact, this might seem intuitively obvious: if the true value of p were, say, 0.5
then to get a sample estimate of 0.6 must mean that the random sample used was slightly better than
expected. This seems less likely to occur than getting a sample which performs exactly as expected.
A coin-tossing example may help: I think many people would intuitively “know” that the likelihood
of getting 10 heads in 20 tosses of a fair coin is greater than the likelihood of getting 8 (or 12) heads
(and much greater than the likelihood of getting 2 or 18 heads). This means that our sample value
(q=0.6) is the best estimate for p we can make, given the data.
Another, possibly surprising, point that one might notice from the graph is that the likelihoods are
all quite low. Even the maximum likelihood (for p=0.6) is only around 0.08. Even if the population
probability was really 0.6, we would get sample values which differ from this value 92% of the
time. The low values are because here we are talking about the likelihood of p being exactly 0.6
and, in psychology, we are more used to giving ranges of values. For example, we might more
usefully give the 95% confidence intervals (CIs) around our sample estimate of q=0.6. These CIs
give a range of values which, with 95% confidence, would be expected to contain the true value of
p. (How to calculate such CIs is not discussed here.)
8
Maximum Likelihood Estimation
In general, if one has frequency data of this kind, and an underlying hypothesis (or model)
that can be expressed in terms of particular probabilities, then one can create a computer
program to estimate the values of those probabilities which are associated with the maximum
likelihood of leading to the data values obtained. This is the process called maximum
likelihood estimation. In the darts example, we would therefore say that 0.6 is the maximum
likelihood estimate (MLE) of the underlying population probability parameter (p), given the
data obtained. It can also be said that the value p=0.6 provides the best fit to the data obtained
in the experiment.
For the simple dart-throwing example it was possible to work out the MLE for p by
logic/intuition. For a more complex model, with several probabilities, numerical estimation by
computer is often the only way to derive MLEs. Statistical packages, such as SPSS, use
numerical methods to generate MLEs in several different kinds of analyses, including those
involved in logistic regression.
Comparing Likelihoods
Recall that the likelihood was about 0.08 that the true value of p is 0.6, given our sample estimate.
Statisticians do not apply any conventional likelihood values in order to draw conclusions about the
real value of p. We do not apply the 0.05 convention used in hypothesis testing to evaluate
likelihoods in these situations. (CIs, as described earlier, show how the 0.05 convention can be used
in this situation.) Instead, statistical modelling works by comparing the likelihoods under two
different hypotheses.
Let us suppose that we have calculated the (maximum) likelihood under hypothesis 1 and a
(maximum) likelihood under hypothesis 2. We can denote these hypotheses as H1 and H2, and the
associated likelihoods as L1 and L2. As already noted, it has been found that -2 times the natural
logarithm of the ratio between these two likelihoods (i.e., -2*loge[L1/L2]) has approximately the 2
distribution. Thus, we can use the log likelihood ratio and the 2 distribution to test whether H1 is
significantly less likely than H2.
In analysing frequency data, this approach is typically used when there is a hierarchy of hypotheses
of increasing complexity. Hence, loglinear modelling of frequency data is often referred to as
hierarchical loglinear modelling. Analysis proceeds by finding the simplest hypothesis which is
able to account for the observed data with a likelihood that is not significantly lower than the next
most complex hypothesis in the hierarchy. We will illustrate this rather abstract and wordy
statement with a concrete worked example.
Comparing Likelihoods in the Darts Data
In fact the darts data used above were not collected from 100 women but from a single individual.
She was a right-handed woman with no previous experience of darts. The data reflect 100 throws at
the board using her right and left hand on alternate throws. (The data from this experiment are on
the J drive as darts study.sav.) The full contingency table is illustrated below.
9
Throwing
Hand
Right (=1)
Left (=2)
Column
Totals
hitboard
1=yes
2=no
40
10
20
30
60
40
Row Totals
50
50
Grand
Total=100
Table 4. The overall contingency data for the dart-throwing data
We might generate a hierarchy of two simple hypotheses about the subject’s performance:
H1: her ability to hit the board is unaffected by the hand she uses to throw
H2: her ability to hit the board is affected by the hand she uses to throw1
For the full contingency table 2 independent probability values were measured in the experiment:
the probability of hitting the board measured for her right hand and the probability measured for her
left hand. We will denote the measured sample probability values by qL and qR for the left and right
hand respectively. (The probability values for missing the board are not independent of the
probabilities for hitting the board: the probability for a hit plus that for a miss must add up to 1.)
We can represent the hypothesis H1 and H2 in terms of underlying population probabilities. H1 is an
independence hypothesis (“throwing ability is independent of hand used”). According to H1 the
true probability for a hit with the right hand (denoted pR) equals that for the left hand (pL). Because
pR = pL we can replace these probabilities with a single value (denoted p; p = pR = pL). Hypotheses
have parameters (probabilities in this case) and degrees of freedom. H1 is thus a “single-parameter”
hypothesis (as it specifies only one probability value; i.e., p). The degrees of freedom (df) for a
hypothesis are given by the number of independent data points (the independent probabilities
measured in this experiment; 2 in this case) minus the number of freely varying parameters of the
hypothesis. Thus, for H1, df=(2-1)=1.
H2 is a more complex hypothesis than H1 because it has two parameters. H2 says that the
probabilities for a hit with the left and right hands are not equal; i.e., pR  pL. Thus, 2 separate
probabilities are needed to specify the hypothesis.
It also follows that H2 has df=0. A hypothesis produces a specific model when values are provided
for the parameters of the hypothesis. A hypothesis with df=0 can always generate a model that is
described as saturated. Saturated models are not very interesting because they describe (or fit) the
data perfectly (in the sense that there is no discrepancy between the observed frequencies and those
expected according to the model). The saturated model under hypothesis H2 would have the
following values: pR=0.8 and pL=0.4. This model is the best-fitting version of hypothesis H2, given
the data obtained: it is the version of H2 that has the maximum likelihood of generating the
observed data.
1
This is a nondirectional hypothesis. Given that the subject is right-handed, we might have had a directional hypothesis
specifying that her performance is better with her right, than her left, hand.
10
Question 1
Can you explain why the best-fitting parameter values for H2 are pR=0.8 and
pL=0.4? By looking at the formulae given in textbooks (or the A2WCD
notes), work out what the values of the 2 and G2 statistics would be under the
expected frequencies generated by these probability values (Hint: you do not
need a calculator to do this, as long as you remember what log(1) is.)
From these probability values given above one can work out the likelihood of the data for the
sample of right hand throws (=0.140) and the likelihood of the data for the left hand throws
(=0.115). Assuming that the observed probabilities in the two samples of throws are independent of
one another (i.e., the success or failure of the throwing trials for the left hand does not influence the
success or failure of the trials for the right hand, nor vice versa), then the overall likelihood across
both samples can be worked out by multiplying the likelihoods for the two separate samples 2. The
overall likelihood is therefore 0.016 (=0.140*0.115). This independence assumption must be met in
order to apply any kind of analysis of categorical data (from simple 2 tests to logistic regression).
In addition, as noted earlier, the probability of hitting the board with any throw (with left or right
hand) must be independent of the probability of hitting the board with any other throw; if this
independence assumption is violated then the maximum likelihood estimation process (described
above) will not estimate the true likelihoods, and the test statistics will not follow a chi-squared
distribution.
Question 2
(The questions in this box illustrate the fact that the use of a particular
statistical analysis technique may inform the choice between similar, but
subtly different designs, for the same experiment.)
Is the assumption of independence between the left hand and right hand darts
data samples justified? Would it have been more or less justified if the subject
had taken all her right hand throws first, followed by all her left hand throws?
Is the probability of success with each throw of the dart likely to be
independent of the probability of success of any other throw? From the
independence point of view, would a better design have been to test 100
separate right-handed women for one throw each, with half of them (selected
at random) being asked to used their left hand?
In general, a more complex hypothesis (such as H2) will be able to fit a set of data better than a
simpler hypothesis with fewer parameters (such as H1). Using the log-likelihood ratio technique,
outlined earlier, we can see if the best-fitting version of the simpler hypothesis (H1) can fit the
observed darts data with a likelihood that is not significantly lower than the likelihood calculated
for H2. If the likelihood for the best-fitting H1 model is not significantly lower than that for the bestfitting H2 model, then we adopt H1 as the best-fitting hypothesis and conclude that dart-throwing
accuracy was independent of the hand used. However, if the fit of the H1 model is significantly
poorer than that of H2 (i.e., the likelihood of H1 is significantly lower than that of H2) then we can
reject H1 and conclude that throwing accuracy was not independent of the hand used. (The details of
the likelihood ratio calculation is given below and then checked out using SPSS.)
2
A basic axiom of probability theory states that if event A (occurring with probability pA) and event B (occurring with
probability pB) are independent, then the occurrence of both A and B is given by (pA * pB).
11
To emphasise the analogy with ANOVA, one can think of the likelihood ratio statistic as testing the
interaction between the variables Hand and hitboard. A significant interaction would simply mean
that the probability of hitting the board was affected by the hand used (i.e. supporting H2); the lack
of significant interaction therefore supports H1. This way of thinking of the data is particularly
helpful when we later analyse tables with more than two variables. Note also that we are usually not
interested in the main effects under such analyses. The main effects in the darts data (i.e., for Hand
and hitboard) would correspond to questions about the distributions of categories in the row and
column totals. Specifically, the main effect for hitboard would tell us whether the ratio of
hitboard=yes: hitboard=no responses, across the whole experiment, deviated from 50:50. This is
not something of particular interest. Because we sampled the data such that there were equal
numbers of left hand and right hand throws, the Hand main effect is completely meaningless. When
contingency table data are sampled with a clear separation between DVs and IVs (and thus are
suitable for logistic analysis) it will generally be the case that the main effects of the IVs will be
meaningless.
Calculating Log-Likelihood Ratios for the Darts Data
We already calculated that the likelihoood for the best fitting model under
hypothesis H2 was 0.016. We denote this value by L2. The corresponding loglikelihood is –4.135. This model has 2 independent parameters (i.e., 2
probabilities). Hypothesis H1 has only a single parameter, the probability of
hitting the board (independent of hand used). It turns out that the best estimate we
have for this probability is the overall probability of hitting the board in Table 4
(i.e., 60/100 = 0.6). We can use the likelihood formulae given earlier to calculate
the likelihood of getting 40 hits out of 50 with the right hand if the true
probability were 0.6. This likelihood is 0.0014. Similarly, the likelihood of
getting 20 hits out of 50 for the left hand (if the true probability were 0.6) is
0.002. The overall likelihood (L1) for the table is therefore (0.0014*0.002), i.e.
2.9 x 10-6 (this is 2.9 in a million). The corresponding log-likelihood is –12.764.
The ratio of the log-likelihood for the simpler model divided by the loglikelihood of the more complex model is thus L1/L2. We already noted that, if the
simpler model were true, then the statistic –2*loge(L1/L2) would be distributed
approximately as χ2, with df equal to the difference in number of parameters for
the two models (here H2 has two parameters and H1 has 1; df = 1). But,
-2*loge(L1/L2) = (-2*loge[L1]) – (-2* loge[L2]). Therefore, the test statistic for the
darts data is (–2*-12.764) – (-2*-4.135) = 17.258. This is very much greater than
the critical value for χ2 with 1 df and so we can reject H1 in favour of H2. There is
a highly significant effect of Hand on ability to hit the dartboard.
Checking The Result With SPSS
We can run a logistic regression on the darts data using the Analyze > REGRESSION >>
MULTINOMIAL LOGISTIC procedure SPSS. The key part of the printed output is shown below. The
final model corresponds to best-fitting probabilities under hypothesis H2. This model is found to
have a –2*log-likelihood (-2LL) of 8.268. When the simpler model (H1) is fitted to the data, this
reduced model corresponds to omitting the effect of the Hand variable from the full model. The
reduced model (with Hand omitted) is found to have a –2LL of 25.529. The likelihood ratio test
involves subtracting the –2LL value for the full model from the –2LL value for the reduced model.
The resulting value (in this case 17.261) is tested against the 2 distribution, with df equal to the
difference in number of parameters between the two models (1 in this case). The result is highly
significant. The values are the same as we got by hand earlier (within rounding errors).
12
Model Fitting Information
Model
Intercept Only
Final
-2 Log
Likelihood
25.529
8.268
Chi-Squa
re
17.261
df
Sig.
1
.000
Likelihood Ratio Tests
-2 Log
Likelihood
of
Reduced
Chi-Squa
Effect
Model
re
df
Sig.
Intercept
8.268
.000
0
.
HAND
25.529
17.261
1
.000
The chi-square statistic is the difference in -2 log-likelihoods
between the final model and a reduced model. The reduced
model is formed by omitting an effect from the final model. The
null hypothesis is that all parameters of that effect are 0.
PART III – EXTENDING THE ANALYSES TO MULTIWAY DESIGNS
The analysis of larger contingency tables will be illustrated using a logistic analysis of a 3-way
table.
3-Way Logistic Analysis Example: more parks data
The data relate to a larger study of Parkinson’s Disease (PD) and smoking history. The data are in
the file called more parks data.sav. The Smokehis and PDstatus variables, familiar from the small
parks dataset, each now have 3 levels. The Sex of each subject is also recorded. PDstatus:
Clinicians have rated the presence or absence of the disease, or they called patients “borderline” if
they did not quite meet the full clinical criteria3. Smokehis: Subjects are classified into those who
have never smoked, those who gave up smoking more than 20 years ago, and those who have
smoked in the last 20 years (including current smokers). The output from the CROSSTABS procedure
(using the “layer” option) is as follows:
3
Note that we might regard the 3 values of PDstatus as having a natural order. Logistic regression analyses, such as the
logistic analysis conducted in these notes, completely ignore this information.
13
smoking history * Parkinson's Disease Status * SEX Crosstabulation
Count
SEX
female
male
smoking
his tory
Total
smoking
his tory
never s moked
us ed to smoke
(>20 years ago)
smokes or smoked
in last 20 years
Parkins on's Dis eas e Status
has
dis eas e
borderline no disease
9
3
5
never s moked
us ed to smoke
(>20 years ago)
smokes or smoked
in last 20 years
Total
Total
17
4
2
12
18
1
3
11
15
14
8
8
2
28
6
50
16
3
4
12
19
3
2
10
15
14
8
28
50
In a logistic analysis the following effects of IVs on PDstatus need to be explored: Smokehis; Sex;
and Smokehis*Sex. We noted earlier that each of these effects is really an interaction with PDstatus,
and also noted that we are not interested in the true main effects of Smokehis or Sex. However, in
this type of analysis, the Smokehis*PDstatus and Sex*PDstatus effects are conventionally referred
to as main effects. The analyses reported below use a hierarchical approach to try to find the
simplest model that provides an adequate model for the data. The underlying maths is just an
extension of what we have already seen for 2-way tables.
Executing Step 1 of the Analysis
The data were analysed using the Analyze > REGRESSION >> MULTINOMIAL LOGISTIC procedure. In
the first step of the analysis, the goal is to test whether it is necessary to include highest order IV
effect (Smokehis*Sex) to adequately explain the data. To do this, PDstatus was selected into the
“Dependent variable” box and Smokehis and Sex were selected into the “Factor(s)” box. The
Statistics button was selected and, in the resulting subwindow, only the “Likelihood ratio test” and
“Goodness of fit chi-square statistics” options were checked. The Model button was selected and, in
the resulting subwindow, the “Main effects model” option should be checked (this is the default).
The selection of the main effects model means that the model includes only the main effects of
Smokehis and Sex on PDstatus. The selected statistics options allow us to see a likelihood ratio test
comparing the main effects model with the more complete model including the Smokehis*Sex
effect.
The SPSS Output for Step 1 and Its Interpretation
The key printed output resulting from the first step of the analysis was as follows:Model Fitting Information
Model
Intercept Only
Final
-2 Log
Likelihood
49.361
35.013
Chi-Squa
re
14.348
df
Sig.
6
.026
The first part of the output (above) compares the main effects model (termed the “Final” model on
the output) with the simplest possible (“Intercept Only”) model. The intercept only model includes
neither the Smokehis nor Sex effects. The Smokehis effect on PDstatus involves a total of 4 free
14
parameters and the Sex effect on PDstatus involves a total of 2 free parameters4. The intercept only
model does not include either of these effects and therefore has 6 more degrees of freedom than the
final model. The difference in –2*log-likelihoods for the two models is 14.348 (labelled Chi-Square
on the output). When compared with the χ2 distribution with 6 df, the likelihood ratio test statistic is
significant (p=0.026). This means that the best-fitting model cannot afford to drop the combination
of the Smokehis and Sex effects, although it may be the case that we do not need to include both of
these two effects in our ultimate model. We find out the answer to this question later.
The next output table shows two Goodness-of-Fit indices (GFIs). These GFIs compare the fit of the
model requested in the analysis (i.e., the main effects model) with a saturated model formed from
all the possible IV effects (i.e., a model including Smokehis, Sex and Smokehis*Sex in this case).
The Deviance GFI is the one to consult as it is a likelihood ratio test statistic comparing these two
models. The 4 df arise because the effect which differentiates the two models (i.e., Smokehis*Sex)
has four parameters in it. (When calculating the number of parameters, one must remember that this
effect is really Smokehis*Sex*PDstatus.) This test statistic (again labelled Chi-Square in the output)
has a value of 2.406, which does not even approach significance (p=0.662) when compared with the
χ2 distribution with 4 df. This tells us that the interaction of the IVs Smokehis and Sex does not
affect PDstatus.
Goodness-of-Fit
Pearson
Deviance
Chi-Squa
re
2.345
2.406
df
4
4
Sig.
.673
.662
The final output table presents two further likelihood ratio tests. Each test compares the final model
(containing both Smokehis and Sex) with a simpler reduced model formed by leaving out one of the
two effects in the final model. Leaving out Smokehis from the model is associated with a likelihood
ratio statistic of 14.348, which is highly significant (p=0.006) when judged against when compared
with the χ2 distribution with 4 df (df =4 because 4 parameters are needed to specify the Smokehis
effect on PDstatus). We must therefore include the IV Smokehis in the model which captures the
data in the fewest parameters. Leaving out Sex from the model is associated with a likelihood ratio
statistic of 0.007, which does not approach significance (p=0.997) when judged against when
compared with the χ2 distribution with 2 df (df =2 because 2 parameters are needed to specify the
Sex effect on PDstatus). We are therefore able to drop the IV Sex in the model which captures the
data in the fewest parameters.
4
A simple way of working out the number of parameters for an IV*DV effect is as follows: if A is the number of
categories in the IV and B is the number of categories in the DV, then number of parameters = (A-1)*(B-1). For the IV
Smokehis there are 3 category levels (=A), and for the DV PDstatus there are 3 category levels (=B); thus number of
parameters =4.
15
Likelihood Ratio Tests
Effect
Intercept
SMOKEHIS
SEX
-2 Log
Likelihood
of
Reduced
Model
35.013
49.361
35.019
Chi-Squa
re
.000
14.348
.007
df
Sig.
0
4
2
.
.006
.997
The chi-square statistic is the difference in -2 log-likelihoods
between the final model and a reduced model. The reduced
model is formed by omitting an effect from the final model. The null
hypothesis is that all parameters of that effect are 0.
Step 2 of the Analysis
The preceding step confirms that a model including only an effect of Smokehis is the best model for
the present data. The key contingency table is therefore given in Table 5 (i.e., it is collapsed across
both sexes). We can then fit this model and look at the parameter estimates as these can allow us to
investigate which cells in Table 5 contribute to the Smokehis effect on PDstatus. We might be
interested to know whether both groups of smokers (quit long ago vs. smoked more recently) differ
in their rates of PD, and which of these differ from nonsmokers. We might also be interested if any
effects of smoking history are confined only to the definite presence of absence of PD, or whether
there might be an effect for cases with milder, more borderline symptoms.
Smokehis
never
smoked (=0)
last smoked
>20 years
ago (=1)
smoked in
last 20 years
(=2)
Column
Totals
has
disease
(=0)
17
PDstatus
borderline
(=1)
5
no
disease
(=2)
11
Row
Totals
7
6
24
37
4
5
21
30
28
16
56
Grand
Total=
100
33
Table 5. The Smokehis by PDstatus contingency data for the more parks data dataset
For step 2, once again we use the Analyze > REGRESSION >> MULTINOMIAL LOGISTIC procedure.
PDstatus was selected into the “Dependent variable” box and Smokehis was selected into the
“Factor(s)” box. The Statistics button was selected and, in the resulting subwindow, only the
“Parameter Estimates” option was checked. (One doesn’t need to worry about the Model button
here, because thee is only one factor in our model.) The key output is as follows:
16
Parameter Estimates
Parkins on's
Dis eas e Status
has dis eas e
borderline
Intercept
[SMOKEHIS=0]
[SMOKEHIS=1]
[SMOKEHIS=2]
Intercept
[SMOKEHIS=0]
[SMOKEHIS=1]
[SMOKEHIS=2]
B
Std. Error
-1.658
.546
2.094
.669
.426
.694
0a
0
-1.435
.498
.647
.734
4.88E-02
.675
0a
0
Wald
9.239
9.798
.377
.
8.317
.776
.005
.
df
1
1
1
0
1
1
1
0
Sig.
.002
.002
.539
.
.004
.378
.942
.
Exp(B)
95% Confidence
Interval for Exp(B)
Lower
Upper
Bound
Bound
8.114
1.531
.
2.187
.393
.
30.098
5.972
.
1.909
1.050
.
.453
.280
.
8.044
3.944
.
a. This parameter is set to zero because it is redundant.
As noted earlier the parameter estimates (B) are logs of odds ratios (ORs). Each parameter is
therefore based on a comparison of 4 cells. The Parameter Estimates printed output table gives you
enough information to work out how each B parameter is calculated.
One can ignore the intercept parameters; they are not of interest. Notice that only 2 of the 3
categories of Parkinson’s Disease status are shown in the table (“has disease”=0 and
“borderline”=1). This is because the other value (“no disease”=2) acts as the reference category.
Thus the parameters in the “has disease” part of the table refer to the odds of having the disease
relative to not having the disease. The parameters in the “borderline” part of the table refer to the
odds of being borderline, once again relative to not having the disease. Notice also that there are
parameters against only 2 of the Smokehis categories: Smokehis=0 (i.e., never smoked) and
Smokehis=1 (i.e., smoked more than 20 years ago). Once again it is because the missing category
(recent smokers) acts as a reference. The parameters listed against Smokehis=0 refer to ratios of
odds for never smokers divided by odds for the reference category (recent smokers). Similarly, the
parameters listed against Smokehis=1 refer to ratios of odds for long ago quitters divided by odds
for recent smokers.
Consider the B parameter for “has disease” and Smokehis=0. This parameter therefore is the natural
logarithm of the following OR: the odds of having the disease (relative to not having the disease)
amongst never smokers divided by the odds of having the disease (relative to not having the
disease) amongst recent smokers. This is an important comparison for this research and involves the
4 cells shaded gray in Table 5. We can, therefore, calculate the OR concerned using the usual odds
ratio formula given in textbooks (and in the A2WCD notes). The value is given by
(17*21)/(4*11)=8.114. This checks with the value of Exp(B) given in the output, and loge(8114)
also agrees with the value for B (=2.094). As noted in the A2WCD notes, the standard error (SE) of
the estimate for the loge(OR) is given by the square root of the sum of the reciprocals of the 4 cell
frequencies used to calculate the odds ratio. In this case the value is:
√(1/17 + 1/21 + 1/4 + 1/11)=0.69
(this checks with value given by SPSS). This SE is used to produce the 95% CI around the
estimated OR. Note that for this parameter our best estimate of the OR is 8.114 and we are 95%
confident that it lies between 2.2 and 30.1. This means that we can confidently reject the hypothesis
that the OR in question is 1 (the value expected if there were no effect). In this particular case, we
conclude that never smokers are 8 times more likely than recent smokers to have PD (compared to
not having PD), and that this increased risk is significant.
17
The SE(loge[OR]) is also used to calculate a Wald statistic that is used to test whether the OR differs
significantly from 1. The Wald statistic is simply given by (B/SE[B])2. If the OR in question were 1,
then the Wald statistic would have a distribution approximated by the χ2 distribution with 1 df. In
this case the value of 9.798 is comfortably greater than the critical value for χ2, allowing us to reject
the hypothesis that the OR concerned is 1 (p=0.002).
The other parameters in the Parameter Estimates table tell us that long ago smokers and recent
smokers do not differ from one another in their odds of PD relative to no PD (OR=1.53, ns), nor in
their odds of borderline PD relative to no PD (OR=1.05, ns). Similarly, never smokers and recent
smokers do not differ from one another in their odds of borderline PD relative to no PD (OR=1.91,
ns).
Question 3
We might well want to know whether long ago smokers and never smokers
differed in their odds of getting PD (or borderline PD). Can you recode the
Smokehis variable in the dataset to give the relevant ORs in the parameter
estimates table? Hint: SPSS’s MULTINOMIAL LOGISTIC procedure always uses
the highest numbered category as the reference category when calculating
parameters in these analyses.
A Warning About Parameter Estimates
The same logistic analysis model can be represented mathematically in many different (but
essentially identical) ways. Although the overall likelihood ratios and p-values are not affected, the
“alternative parametrisations” naturally lead to differing parameter estimates with different
interpretations. (This is similar to the issue of dummy and effect coding in multiple linear
regression.) The parametrisation used by SPSS MULTINOMIAL LOGISTIC recodes the IVs in the model
by using indicator variables, which take values of 0 and 1. Thus, for an IV with 3 categories (e.g.,
Smokehis in the above example), 2 indicator variables are needed. Indicator 1 will be 1 for the first
category and 0 for the others; indicator 2 will be 1 for the second category and 0 for the others. As
we have seen, the third category, given a zero in both indicator variables, will act as the reference
category in the parameter estimates table (so indicator coding here is the same as dummy coding in
multiple linear regression). Although indicator variable parametrisation is common, and leads to
easy-to-interpret parameters, other statistical packages (and indeed other procedures within SPSS)
use alternative methods. The parameter estimates obtained under packages or procedures with
different parametrisations will differ from those obtained by SPSS MULTINOMIAL LOGISTIC.
Logistic Regression
The approach illustrated in these notes has been for logistic analysis (LA). Recall that LA is a
special case of logistic regression (LR): all the IVs in LA are categorical whereas some (or all) of
the IVs in LR can be continuous numerical variables. The relationship between LA and LR is
identical to the relationship between ANOVA and multiple linear regression. To carry out an LR
analysis including some continuous IVs, all that one needs to do when running the MULTINOMIAL
LOGISTIC procedure is to enter the continuous IVs as “covariates” and the categorical IVs as
“factors”. The Computer Class for week 10 will explore a LR analysis with a mixture of categorical
and continuous IVs, and this should illustrate all the minor differences between LR and LA.
18
Question 4
Can you use the MULTINOMIAL LOGISTIC procedure with the small sample data
on predicting a skiers fall from the difficulty of the run and the season of the
year (this example is from chapter 12 of Tabachnik and Fidell)? These data
are on the J drive under the file name tab and fid log reg example.sav. Hint: to
get the same parameter estimates as them you will have to use the recoded fall
and season variables fallrev and seasonr respectively. (N.B. Although the
likelihood ratio test statistics and p-values come out the same as in the book,
the absolute log-likelihoods differ slightly owing to a minor difference in how
they are calculated.)
A few brief comments about LR are warranted. Firstly, the OR parameters obtained for a
continuous IV reflect the increase (or decrease) in odds of a particular category in the DV (relative
to a reference category) for a one unit increase in the continuous IV. In other words, the odds ratios
compare the odds of a particular DV category for subjects with a continuous IV score of (x+1) with
the odds of the DV category for subjects with a continuous IV score of x.
Secondly, it is well-known that contingency table analyses using statistics such as the Pearson 2
test, need adequate expected frequencies in all or most cells of the table. A related recommendation
is that few (or none) of the cells in the table should have zero observed frequencies. If these
conditions are not met, the statistics calculated may not be well approximated by the 2 distribution
and so the resulting p-values may be inaccurate. The same issue applies to logistic regression, and
indeed the SPSS MULTINOMIAL LOGISTIC procedure prints a warning about zero observed frequency
cells (when you have the Goodness-of-fit option selected under the Statistics button). When one has
a continuous IV in a LR, then that IV can potentially take many differing values across the whole
dataset. The LR analysis is based on a so-called “covariate pattern” which is formed by crossing
each observed value of each continuous IV (and each value of each factor) in the model with every
value of the categorical DV in the model. Any cell in this covariate pattern, which does not have at
least one observation in it, is a zero observed frequency cell in terms of logistic regression.
There are not likely to be (m)any zero frequency cells in the covariate pattern if you have very large
samples and/or continuous IVs with a small range of possible values. However, in other cases, it
should be obvious that using continuous IVs will often produce empty cells -- particular values of
the continuous variable might be rare or unique in the dataset and thus will not occur with each
value of the DV.
One can minimise the zero-cell problem by reducing the continuous IV scores into a small number
of ordered values (e.g., quartile scores). For the particular IV, and quartile scoring, all subjects with
scores in the lowest 25% of the sample are given a value of 1 for the recoded IV; subjects in the
next 25% are given a score of 2 and so on. When this kind of recoded continuous IV is entered as a
covariate in a LR this preserves the ordinal and interval nature of subjects’ (recoded) scores on that
IV. This contrasts with a categorical IV (factor) where the category levels have no particular
numerical relationship to one another. We will carry out this kind of recoding when analysing the
LR example in the computer class.
Thirdly, a “full factorial” model is these analyses is conventionally considered to include all the
main effects and interactions formed by the categorical IVs (factors) plus all the main effects of the
continuous IVs (covariates). Interactions between covariates or between covariates and factors are
not included in such a specification, and so if one wants to explore these effects one will have to use
the “Custom model” option in SPSS.
19
Finally, the nature of saturated model in logistic regression may not be obvious. (Recall that
saturated models are uninteresting models which are perfectly able to capture the observed DV
frequencies. They can do this because they have the same number of parameters as there are
independent data points and so have df=0.) The way that continuous IVs (covariates) are specified
in the model is very efficient: we use only (m-1) parameters across the whole range of covariate
values, where m is the number of categories of the DV. Hence, the df of likelihood ratio tests
relating to the removal of a covariate from a model will be (m-1). As already noted, for a covariate
and a particular pair of categories of the DV, the parameter specifies the increase in odds of one of
the DV categories (relative to the other) for each unit increase in covariate score. The saturated
model, by contrast, requires a different parameter for each value of the covariate but one5 (as if it
were being treated like a categorical factor). So, in your dataset if there are k different observed
values of the covariate, the saturated model will include (m-1)*(k-1) parameters for each covariate.
(Note that k is not necessarily the same as the number of possible values for the covariate, as there
may well be no subjects in your data who score at particular covariate values.). The saturated model
will also include (m-1)*(k-1)*(j-1) parameters for each interaction with each of the other effects in
the model, where j is the number of observed values in the dataset for the other covariate or factor
(there would also be parameters for higher order interaction terms as well, if present). We can see
that the number of parameters for saturated models with covariates can rapidly become very large if
the covariates are not recoded into a small number of values.
The (deviance) goodness-of-fit tests for a specific model are log-likelihood ratio tests which
compare the log-likelihood of obtaining the observed data based on the specific model with the loglikelihood based on the saturated model6. The df for this test are the difference between the number
of parameters needed for the specific model and the larger number of parameters needed for the
saturated model. Without very large samples of subjects, these tests will be unreliable if the
covariate scores are not recoded into a small number of possible values.
This is the usual “minus one” rule for df. If we have k different values of the covariate we need parameters to specify
the (conditional) probability of a particular DV outcome at k-1 of these values. The probability for the final (kth) value
of the covariate is not free to vary because the probabilities must sum to 1.
6
The likelihood of the observed data given the saturated model is 1 and so the log-likelihood is 0.
5
20