Download IBM SPSS STATISTICS for Windows Intermediate / Advance

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
N U I T, NEWCASTLE UNIVERSITY
IBM SPSS STATISTICS for
Windows Intermediate /
Advance
A Training Manual for Intermediate /
Experience Users, Faculty of Medical Sciences
Dr S. T. Kometa
Table of Contents
Ordinary Regression ................................................................................................................ 3
Repeated Measures Analysis................................................................................................... 9
Data Analysis Using Crosstabulation Techniques .............................................................. 12
Type of Survival Analysis / Kaplan-Meier .......................................................................... 18
The Ordinal Regression Model ............................................................................................. 20
Binary Logistics Regression .................................................................................................. 29
2
Ordinary Linear Regression Model with Two Independent
Variables
Why fit a regression model?
 To build a model for predicting the outcome variable for a new sample of data.
 To see how well the independent (explanatory) variables explain the dependent
(response) variable.
 To identify which subsets from many independent variables is most effective for
estimating the dependent variable.
Open the data set called world95.sav. To do this, follow these instructions:
1. Select Start -> Programs -> Statistical Software -> IBM SPSS Statistics -> IBM
SPSS Statistics 19.
2. From SPSS menu bar select File -> Open -> Data… a dialogue box will appear.
3. In the text area for File name: type \\campus\software\dept\spss and then click on
Open.
4. Select the file world95.sav and click on Open.
5. Spend some time to study the data file. How many cases and variables make up the
data file? Cases:…….. Variables:………
6. Are there any missing values in the data? Yes No
Assumptions for Ordinary Linear Regression




All observations should be independent.
Your data should not suffer from multicollinearity. That is the independent variables
should not be highly related. To find out if your data suffer from multicollinearity,
you have to look at the tolerances for each of the independent variables in the model.
These are printed if you select Collinearity Diagnostics in the Linear Regression
Statistics dialogue box. If any of the tolerances are small, less than 0.1 for example,
multicollinearity may be a problem.
Residual from model fit should follow a normal distribution.
Each of the independent (explanatory or predictor) continuous variables should have a
linear relationship with the dependent (response or outcome) variable. It is always a
good idea to check this assumption using a scatterplots.
Simple Linear Regression
Is the female literacy of a country useful in predicting their life expectancy? We want to build
model of the form:
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 𝑏0 + 𝑏1 ∗ 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑡𝑒𝑟𝑎𝑐𝑦 + 𝜀
Where Average female life expectancy (lifeexpf) is the dependent (response, y, or outcome)
variable, female who can read (%) (lit_fema) is the independent (explanatory or predictor)
variable, 𝑏0 is the intercept of the line of best fit, b1 is its slope and 𝜀 is the error term.
Is there a linear relationship between average female life expectancy and female literacy?
Produce a scatter plot to help you answer this question.
3
To produce the output for the regression model, from the menus choose:
Analyze -> Regression -> Linear….
Dependent Variable: Average female life expectancy [lifeexpf]
Independent: female who can read (%) [lit_fema]
Statistics…
Descriptives
Make sure that Estimates and Model fit are selected.
Select Collinearity diagnostics
Residuals
Casewise disnognotics
Select Outlier outside 1.0 standard deviations
Plots…
Y: *ZRESID
X: *ZPRED
Click Next
Y: ZPRED
X: Dependent
Select Histogram and Normal probability plot
These steps will generate lots of output. Now examine the output and attempt to interpret it.
Look at the table Descriptive Statistics. What will you conclude?
Look at the table Correlations. What are the hypotheses being tested? What will you
conclude?
Look at the table Model Summary. What do you conclude?
.
4
Look at the table ANOVA. Explain what the Degrees of Freedom (DF), Sums of Squares (SS)
and Mean Squares (MS) represent. How they are related?
State the hypotheses being tested in the ANOVA table. How is the test statistic calculated and
what would your decision be?
Look at the table Coefficients. What do you conclude? Write an equation for the
regression model and use it to predict the average female life expectancy of a country
whose female literacy is 86%. What are the hypotheses being tested?
5
The last two columns of the Coefficient table give information about collinearity
statistics. Looking at the Tolerance, can you say if there is any problem with
multicollinearity?
The rest of the output deals with the residuals. This helps to find out if the assumptions to run
a linear are met and to identify any outliers or influential cases.
Look at the table Casewise Diagnostics. What is standardised residual? What do you
conclude?
Look at the table Residual Statistics. What do you conclude?
Look at the Histogram and Normal P-P Plot. What do you conclude about the
residuals?
Now look at the two Scatter Plots. What do you conclude?
Can you think of any restriction when using your model to predict female life
expectancy?
6
How would you validate a model like this?
Multiple Linear Regression
While a simple linear regression can have just one independent variable, a multiple linear
regression can have more than one independent variable. The following is a model with two
independent variables:
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 𝑏0 + 𝑏1 ∗ 𝑖𝑛𝑓𝑎𝑛𝑡 𝑚𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦 + 𝑏2 ∗ 𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦 + 𝜀
where infant mortality (deaths per 1000 live births) [babymort] is the number of dead babies
during their first year per thousand live births and average number of kids [fertilty] is the
average number of children per family.
We found that literacy explained 67% of the variability of life expectancy. Now we examine
a model using infant mortality (babymort) and fertility (fertility) to predict life expectancy.
To run the analysis, select
Analyze -> Regression -> Linear….Click on Reset.
Dependent Variable: Average female life expectancy [lifeexpf]
Independent: average number of kids [fertilty], infant mortality (deaths per 1000 live births)
[babymort]
Case Labels: country
Statistics…
Descriptives
Make sure that Estimates and Model fit are selected.
Select Collinearity diagnostics
Plots…
Produce all partial plots
Save...
Predicted values: Standardised
Look at the table Descriptive Statistics. What do you conclude?
Look at the table Correlations. What do you conclude?
7
Look at the table Model Summary. What do you conclude?
Look at the ANOVA table. What do you conclude?
Look at the table Coefficients. What do you conclude? Write an equation for the
regression model and use it to predict the female life expectancy of a country whose
fertility is 3 and infant mortality is 23 per 1000 live births.
8
Repeated Measures Analysis of Variance
Does the anxiety rating of a person affect performance on a learning task? Twelve subjects
were assigned to one of two anxiety groups on the basis of an anxiety test, and the number of
errors made in four blocks of trials on a learning task was measured. We use repeated
measures analysis of variance technique to study the data.
Open the SPSS data file called anxiety2. Notice that there is one case for each subject and
four trials variables (trial1, trial2, trial3 and trial4).
In repeated measures analysis-of-variance technique, we distinguish two types of factors in
the model: between-subject factors and within-subjects factors. A between-subject factor
as the name suggest, divides the subjects into discrete subgroups, for example anxiety in this
data file. Anxiety divides the cases into two groups of high anxiety scores and low anxiety
scores. A within-subjects factor is any factor that distinguishes measurements made on the
same subject. For example trail distinguishes the four measurements taken for each subjects.
To produce the output in this example, from the menus choose:
Analyze
General Linear Model
Repeated Measures…
Within-Subject Factors name: replace factor1 with trial
Number of Levels: 4
Click Add and Click Define
Within-Subjects Variables (trial): trial1, trial2, trial3 and trial4
Between-Subjects Factor(s): anxiety
Options…
Select Homogeneity tests
Contrasts…
Factors: trial
Contrasts: Repeated (click Change)
Click on Continue and then OK.
Examine the results and try to interpret it.
Between-Subject Test
The test of between-subject effects is shown on the table Tests of Between-Subjects
Effects. Examine this table. What do you concluded?
9
Multivariate Tests
The multivariate table contains tests of the within-subjects factor, trial, and the interaction of
within-subjects factor and the between-subjects factor, trail*anxiety.
Examine the Multivariate Tests table. What do you conclude?
Assumptions
The vector of the dependent variables follows a normal distribution, and the variancecovariance matrices are equal across the cells formed by the between-subjects effects.
The test for this assumption is shown on the table Box’s test of Equality of Covariance
Matrices. Examine this table, what do you conclude?
It is assumed that the variance-covariance matrix of the dependent variables is circular.
The test of this assumption is shown on the table Mauchly’s Test of Sphericity. Examine
this table, what do you conclude?
If the test of sphericity was not satisfied use Greenhouse-Geisser, Huynh-Feldt or Lowerbound to make your conclusion.
Now let us look at the table of Tests of Within-Subjects Effects. Examine the table, what
can you conclude?
10
Contrasts
A repeated contrast measures compares one level of trial with the subsequent level. The first
column (source) indicates the effect being tested. For example the label trial test the
hypothesis that averaged over the two anxiety groups, the mean of the specified contrast is
zero.
The second column trial represents the contrasts. For example Level 1 vs Level 2 represents
the transformation trial 1 – trial 2. This compares the first level of trial with the second level
of trial, and so on.
The label trial*anxiety tests the hypothesis that the mean of the specified contrast is the same
for the two anxiety groups.
Now look at the Tests of Within-Subjects Contrasts. What do you conclude?
11
Data Analysis Using Crosstabulations Techniques in SPSS
Introduction
Crosstabulation is a powerful technique that helps you to describe the relationships between
categorical (nominal or ordinal) variables. With Crosstabulation, we can produce the
following statistics:
 Observed Counts and Percentages
 Expected Counts and Percentages
 Residuals
 Chi-Square
 Relative Risk and Odds Ratio for a 2 x 2 table
 Kappa Measure of agreement for an R x R table
Examples will be used to demonstrate how to produce these statistics using SPSS. The data
set used for the demonstration comes with SPSS and it is called GSS_93.sav. It has 67
variables and 1500 cases (observations). Open this data file which is located in the SPSS
folder. Study the data file in order to understand it before performing the following exercises.
Exercise 1: An R x C Table with Chi-Square Test of Independence
Chi-Square tests the hypothesis that the row and column variables are independent, without
indicating strength or direction of the relationship. Like most statistics test, to use the ChiSquare test successfully, certain assumptions must be met. They are:


No cell should have expected value (count) less than 0, and
No more than 20% of the cells have expected values (counts) less than 5
In the SPSS file, there is a variable called relig short for religion (Protestant, Catholic,
Jewish, None, Other) and another one called region4 (Northeast, Midwest, South, West). In
this example, we want to find out if religious preferences vary by region of the country.
To produce the output, from the menu choose:
Analyze -> Descriptive Statistics -> Crosstabs….
Row(s): Religious Preferences [relig]
Column(s): Region [region4]
Statistics… select Chi-Square, click Continue then OK
In the SPSS output, Pearson chi-square, likelihood-ratio chi-square, and linear-by-linear
association chi-square are displayed. Fisher's exact test and Yates' corrected chi-square are
computed for 2x2 tables.
State the null and alternative hypothesis that is being tested.
12
Examine the output. What conclusion can you draw from the output?
However, you will notice that certain assumptions are not met. The results could be
misleading. What should you do? We will discuss this further in example 2 below.
Example 2: Percentages, Expected Values, and Residuals and Omitting Categories
From the last example, we noticed that 40% of the cells had expected counts less than 5. So
this assumption was violated. Since Other and Jewish had just 15 cases each, we can drop
them out of the analysis by using Select Cases. In other words, the religious preference is
restricted to Protestant, Catholic and None.
To produce the output, use Select Cases from the Data menu to select cases with relig not
equal to 3 and relig not equal to 5 (relig ~=3 & relig ~=5). Call up the dialogue box for
Crosstabs. Reset it to default and select:
Row(s): Region [region4]
Column(s): Religious Preferences [relig]
Statistics…
Select Chi-Square
Nominal: select Contingency coefficient, Phi and Cramer’s V, Lambda,
Uncertainty coefficient, click Continue
Cells…
Counts: select Expected
Percentages: select Row
Residuals: select Adjusted Standardized, click Continue then OK
Now examine the output and try to interpret it.
You can pivot the table so each group of statistics appears in its own panel. Demonstrate.
Double-click Table and drag region on the row tray to the right of statistics.
Look at the Region4*Religion Preference Crosstabulation. What can you conclude?
13
Look at the Chi-Square Tests table. What can you conclude?
Look at the Symmetric Measures table. What can you conclude about the strength of
the relationship between religion preference and region?
Examine the table Directional Measures what do you conclude?
14
Example 3: Tests Within Layers of a Multiway Table
Multiway table allows you to examine the relationship between two categorical variables
within a controlling variable. For example, is the relationship between marital status and view
of life the same for males and females? This example shows you how to answer this type of
question in SPSS.
Use Select Cases from the Data menu to select cases with marital not equal to 4 (marital ~=
4).
Can you think of any reason why we have decided to exclude cases where marital status
is equal to 4 (i.e. separted)?
Call up the Crosstabs dialogue box. Click Reset to restore the dialogue box defaults. Then
select:
Row(s): Marital satus [marital]
Column(s): Is Life Exciting or Dull [life]
Layer 1 of 1: Respondent’s Sex [sex]
Statistics…
Select Chi-Square
Cells…
Counts: select Expected
Percentages: select Row
Residuals: select Standardized and Adjusted Standardized
Examine the results and try to interpret it.
Is there a relationship between marital status and view on life? Is this relationship the
same between male and female?
Example 4: The Relative Risk and Odds Ration for a 2 x 2 Table
The Relative Risk for 2 x 2 tables is a measure of the strength of the association between the
presence of a factor and the occurrence of an event. If the confidence interval for the statistic
includes a value of 1, you cannot assume that the factor is associated with the event. The odds
ratio can be used as an estimate or relative risk when the occurrence of the factor is rare.
15
In the GSS93 data file, there is a variable (dwelown) that measure home ownership (owner or
renter) and another variable (vote92) that measure voting (voted or did not vote). We will like
to find out whether home owners are more likely to vote than renters.
Through the Variable window note all the codes that have been used for the two variables of
interest. For example, dwelown uses code 3 for other and code 8 for don’t know, while vote92
uses code 3 for not eligible and code 4 for refused. Select the cases with dwelown less than 3
and vote92 less than 3.
From the menus choose:
Data -> Select Cases
Select If condition is satisfied and click If.
Enter dwelown < 3 & vote92 < 3 as the condition and click Continue then OK.
In the Crosstabs dialogue box, click Reset to restore the dialogue box defaults, and then
select:
Row(s): Homeowner or Renter [dwelown]
Column(s): Voting in 1992 Election [vote92]
Cells…
Percentages select Row, click Continue then OK
Examine and interpret the output.
From the crosstabulation table, what can you conclude?
Recall the crosstabs dialogue box. In the Crosstabs dialogue box, select:
Statistics…
Select Risk, click Continue
Examine the output and interpret it.
Look at the table called Risk Estimate, what can you conclude?
16
The odds ratio should be used as an approximation to the relative risk when the following
conditions are met:
 The probability of the event is small (<0.1). This condition guarantees that the odds
ratio will make a good approximation to the relative risk.
 The design of the study is case-control.
These conditions are not met in this present example. In smoking and lung cancer study, the
conditions are met. You can use the odds ratio.
Example 5: The Kappa Measure of Agreement for an R x R Table
Cohen's kappa measures the agreement between the evaluations of two raters when both are
rating the same object. A value of 1 indicates perfect agreement. A value of 0 indicates that
agreement is no better than chance. Values of Kappa greater than 0.75 indicates excellent
agreement beyond chance; values between 0.40 to 0.75 indicate fair to good; and values
below 0.40 indicate poor agreement. Kappa is only available for tables in which both
variables use the same category values and both variables have the same number of
categories.
The table structure of the Kappa statistics is a square R x R and has the same row and column
categories because each subject is classified or rated twice. For example, doctor A and doctor
B diagnose the same patients as schizophrenic, manic depressive, or behaviour-disorder. Do
the two doctors agree or disagree in their diagnosis? Two teachers assess a class of 18 years
old students. Do the teachers agree or disagree in their assessment?
In the GSS93 subset data file, we have variables that assess the educational level of
respondent’s father (padeg) and mother (madeg). Is there any agreement between father and
mother educational level?
To produce the output, use Select Cases from the Data menu to select cases with madeg not
equal to 2 and padeg not equal to 2 (madeg ~= 2 & padeg ~= 2). In the Crosstabs dialogue
box, click Reset to restore the dialogue box defaults, and then select:
Row(s): Father’s Highest Degree [padeg]
Column(s): Mother’s Highest Degree [madeg]
Statistics…
Select kappa, click Continue
Cells…
Percentages: select Total, click Continue then OK
Examine and interpret the output.
Look at the tables from the output. What can you conclude?
17
Intraclass Correlation Coefficients (ICC)
We can use ICC to assess inter-rater agreement when there are more than two raters. For
example, the International Olympic Committee (IOC) trains judges to assess gymnastics
competitions. How can we find out if the judges are in agreement? ICC can help us to answer
this question. Judges have to be trained to ensure that good performances receive higher
scores than average performances, and average performances receive higher scores than poor
performances; even though two judges may differ on the precise score that should be
assigned to a particular performance.
Use the data set judges.sav to illustrate how to use SPSS to calculate ICC.
Open the data set. From the menu select:
Analyze -> Scale -> Reliability
Items: judge1, judge2, judge3, judge4, judge5, judge6, judge7
Statistics… Under Descriptives for check Item.
Check Intraclass correlation coefficient
Model: Two-Way Random
Type: Consistency
Confidence interval: 95%
Test value: 0
Examine and interpret the output. What would you conclude?
Types of Survival Analyses and when to use them in SPSS
Life Tables: Use life tables if cases can be classified into meaningful equal time interval.
Life table can be used to calculate the probability of a terminal event during any interval
under study.
Kaplan-Meier: Use this technique if cases cannot be classified into equal time intervals as
above. This is common to many clinical and experimental studies.
Cox Regression: Use this technique if you want to see the relation between survival time and
a predictor variable, for instant age or tumour type.
18
Using Kaplan-Meier Survival Analysis to Test Competing Pain
Relief Treatments
A pharmaceutical company is developing an anti-inflammatory medication for treating
chronic arthritic pain. Of particular interest is the time it takes for the drug to take effect and
how it compares to an existing medication. Shorter times to effect are considered better.
The results of a clinical trial are collected in pain_medication.sav. This data file is stored in
the following folders \\campus\software\dept\spss. Open the file and study it. Use KaplanMeier Survival Analysis to examine the distribution of "time to effect" and compare the
effectiveness of the two treatments.

To run a Kaplan-Meier Survival Analysis, from the menus choose:
Analyze
Survival
Kaplan-Meier...













Select Time to effect [time] as the Time variable.
Select Effect status [status] as the Status variable.
Click Define Event.
Under Value(s) Indicating Event Has Occurred type 1 in the text area next to
Single value:.
Click Continue.
Select Treatment [treatment] as a Factor.
Click Compare Factor.
Select Log rank, Breslow, and Tarone-Ware.
Click Continue.
Click Options in the Kaplan-Meier dialog box.
Select Quartiles in the Statistics group and Survival in the Plots group.
Click Continue.
Click OK in the Kaplan-Meier dialog box.
Interpretation
Survival Table
The survival table is a descriptive table that details the time until the drug takes effect. The
table is sectioned by each level of Treatment, and each observation occupies its own row in
the table.
Time: The time at which the event or censoring occurred.
Status: Indicates whether the case experienced the terminal event or was censored.
Cumulative Proportion Surviving at the Time: The proportion of cases surviving from the
start of the table until this time. When multiple cases experience the terminal event at the
19
same time, these estimates are printed once for that time period and apply to all the cases
whose drug took effect at that time.
N of Cumulative Events: The number of cases that have experienced the terminal event from
the start of the table until this time.
N of Remaining Cases: The number of cases that, at this time, have yet to experience the
terminal event or be censored.
Survival Functions (Curves)
The survival curves give a visual representation of the life tables. The horizontal axis shows
the time to event. In this plot, drops in the survival curve occur whenever the medication
takes effect in a patient. The vertical axis shows the probability of survival. Thus, any point
on the survival curve shows the probability that a patient on a given treatment will not have
experienced relief by that time.
The plot for the New drug below that of the Existing drug throughout most of the trial, which
suggests that the new drug may give faster relief than the old. To determine whether these
differences are due to chance, look at the comparisons tables.
Mean and Medians for Survival Time
The means and medians for survival time table offers a quick numerical comparison of the
"typical" times to effect for each of the medications. Since there is a lot of overlap in the
confidence intervals, it is unlikely that there is much difference in the "average" survival
time.
Percentiles
The percentiles table gives estimates of the first quartile, median, and third quartile of the
survival distribution. The interpretation of percentiles for survival curves is that the 75th
percentile is the latest time that at least 75 percent of the patients have yet to feel relief.
Overall Comparisons
This table provides overall tests of the equality of survival times across groups. Since the
significance values of the tests are all greater than 0.05, you cannot determine a difference
between the survival curves.
Summary
With the Kaplan-Meier Survival Analysis procedure, you have examined the distribution of
time to effect for two different medications. The comparison tests show that there is not a
statistically significant difference between them.
Recommended Readings
1. Hosmer, D. W., and S. Lemeshow. 1999. Applied Survival Analysis. New York: John
Wiley and Sons.
2. Kleinbaum, D. G. 1996. Survival Analysis: A Self-Learning Text. New York:
Springer-Verlag.
3. Norusis, M. 2004. SPSS 13.0 Advanced Statistical Procedures Companion. Upper
Saddle-River, N.J.: Prentice Hall, Inc..
20
The Ordinal Regression Model
Generalized linear models are a very powerful class of models, which can be used to answer
a wide range of statistical questions. The basic form of a generalized linear model is shown in
the following equation.
link(γij) =θj−[β1xi1+β2xi2+...+βpxiJ]
where
link( ) is the link function
γij
is the cumulative probability of the jth category for the ith case
θj
is the threshold for the jth category
p
is the number of regression coefficients
xi1...xip are the values of the predictors for the ith case
β1... βp are regression coefficients
There are several important things to notice here.




The model is based on the notion that there is some latent continuous outcome
variable, and that the ordinal outcome variable arises from discretizing the underlying
continuum into ordered groups. The cutoff values that define the categories are
estimated by the thresholds. In some cases, there is good theoretical justification for
assuming such an underlying distribution. However, even in cases in which there is no
theoretical concept that links to the latent variable, the model can still perform quite
well and give valid results.
The thresholds or constants in the model (corresponding to the intercept in linear
regression models) depend only on which category's probability is being predicted.
Values of the predictor (independent) variables do not affect this part of the model.
The prediction part of the model depends only on the predictors and is independent of
the outcome category. These first two properties imply that the results will be a set of
parallel lines or planes-one for each category of the outcome variable.
Rather than predicting the actual cumulative probabilities, the model predicts a
function of those values. This function is called the link function, and you choose the
form of the link function when you build the model. This allows you to choose a link
function based on the problem under consideration to optimize your results. Several
link functions are available in the Ordinal Regression procedure.
As you can see, these are very powerful and general models. Of course, there is also a bit
more to keep track of here than in a typical linear regression model. There are three major
components in an ordinal regression model:
Location component: The portion of the equation shown above which includes the
coefficients and predictor variables, is called the location component of the model. The
location is the "meat" of the model. It uses the predictor variables to calculate predicted
probabilities of membership in the categories for each case.
Scale component: The scale component is an optional modification to the basic model to
account for differences in variability for different values of the predictor variables. For
21
example, if men have more variability than women in their account status values, using a
scale component to account for this may improve your model. The model with a scale
component follows the form shown in this equation.
link (ij ) 
j   xi1   2 xi 2  ...  pxiJ
1
e 1zi1...mzim
where
zi1...zim are scale component predictors (a subset of the x's)
τ1...τm are scale component coefficients
Link function: The link function is a transformation of the cumulative probabilities that
allows estimation of the model. Five link functions are available in the ordinal regression
procedure, summarized in the following table.
Function
Form
Logit
log( y / (1-y) )
Complementary log-log log(-log(1-y))
Negative log-log
-log(-log(y))
Probit
Ф-1(y)
Cauchit (inverse Cauchy) tan(π(g-0.5))
Typical Application
evenly distributed categories
higher categories more probable
lower categories more probable
latent variable is normally distributed
latent variable has many extreme values
Using Ordinal Regression to Build a Credit Scoring Model
A creditor wants to be able to determine whether an applicant is a good credit risk, given
various financial and personal characteristics. From their customer database, the creditor
(dependent) variable is account status, with five ordinal levels: no debt history, no current
debt, debt payments current, debt payments past due, and critical account. Potential predictors
consist of various financial and personal characteristics of applicants, including age, number
of credits at the bank, housing type, checking account status, and so on.
This information is collected in german_credit.sav . Use Ordinal Regression to build a model
for scoring applicants.
Constructing a Model
Constructing your initial ordinal regression model entails several decisions. First, of course,
you need to identify the ordinal outcome variable. Then, you need to decide which predictors
to use for the location component of the model. Next, you need to decide whether to use a
scale component and, if you do, what predictors to use for it. Finally, you need to decide
which link function best fits your research question and the structure of the data.
Identifying the Outcome Variable
In most cases, you will already have a specific target variable in mind by the time you begin
building an ordinal regression model. After all, the reason you use an ordinal regression
model is that you know you want to predict an ordinal outcome. In this example, the ordinal
22
outcome is Account status, with five categories: No debt history, No current debt, Payments
current, Payments delayed, and Critical account.
Note that this particular ordering may not, in fact, be the best possible ordering of the
outcomes. You can easily argue that a known customer with no current debt, or with
payments current, is a better credit risk than a customer with no known credit history.
Choosing Predictors for the Location Model
The process of choosing predictors for the location component of the model is similar to the
process of selecting predictors in a linear regression model. You should take both theoretical
and empirical considerations into account in selecting predictors. Ideally, your model would
include all of the important predictors and none of the others. In practice, you often don't
know exactly which predictors will prove to be important until you build the model. In that
case, it's usually better to start off by including all of the predictors that you think might be
important. If you discover that some of those predictors seem not to be helpful in the model,
you can remove them and re-estimate the model.
In this case, previous experience and some preliminary exploratory analysis have identified
five likely predictors: age, duration of loan, number of credits at the bank, other instalment
debts, and housing type. You will include these predictors in the initial analysis and then
evaluate the importance of each predictor. Number of credits, other instalment debts, and
housing type are categorical predictors, entered as factors in the model. Age and duration of
loan are continuous predictors, entered as covariates in the model.
Scale Component
The next decision has two stages. The first decision is whether to include a scale component
in the model at all. In many cases, the scale component will not be necessary, and the
location-only model will provide a good summary of the data. In the interests of keeping
things simple, it's usually best to start with a location-only model, and add a scale component
only if there is evidence that the location-only model is inadequate for your data. Following
this philosophy, you will begin with a location-only model, and after estimating the model,
decide whether a scale component is warranted.
Choosing a Link Function

To choose a link function, it is helpful to examine the distribution of values for the
outcome variable. To create a bar chart for Account status [chist], from the menus
choose:
Graphs
Bar...




Click Define.
Select % of cases in the Bars Represent group.
Select Account status [chist] as the variable to plot on the Category Axis.
Click OK.
23
The resulting bar chart shows the distribution for the account status categories. The bulk of
cases are in the higher categories, especially categories 3 (payments current) and 5 (critical
account). The higher categories are also where most of the "action" is, since the most
important distinctions from a business perspective are between categories 3, 4, and 5. For this
reason, you will begin with the complementary log-log link function, since that function
focuses on the higher outcome categories. The high number of cases in the extreme category
5 (critical account) indicates that the Cauchit distribution might be a reasonable alternative.
Running the Analysis

To run the Ordinal Regression analysis, from the menus choose:
Analyze
Regression
Ordinal...











Select Account status [chist] as the Dependent variable.
Select # of existing credits [numcred], Other installment debts [othnstal], and
Housing [housing] as Factors.
Select Age in years [age] and Duration in months [duration] as Covariates.
Click Options.
Select Complementary Log-Log as the Link function.
Click Continue.
Click Output in the Ordinal Regression dialog box.
Select Test of Parallel Lines in the Display group.
Select Predicted Category in the Saved Variables group.
Click Continue.
Click OK in the Ordinal Regression dialog box.
Evaluating the Model
The first thing you see in the output is a warning about cells with zero frequencies. The
reason this warning comes up is that the model includes continuous covariates. Certain fit
statistics for the model depend on aggregating the data based on unique predictor and
outcome value patterns. For instance, all cases where the applicants have current payments on
debt, one other credit at the bank, own their home, have no other instalment debts, are 49
years old and are seeking a 12-month loan are combined to form a cell.
However, because Duration in months and Age in years are both continuous, most cases have
unique values for those variables. This results in a very large table with many empty cells,
which makes it difficult to interpret some of the fit statistics. You have to be careful in
evaluating this model, particularly when looking at chi-square-based fit statistics.
For relatively simple models with a few factors, you can display information about individual
cells by selecting Cell Information on the Output dialog box. However, this is not
recommended for models with many factors (or factors with many levels), or models with
continuous covariates, since such models typically result in very large tables. Such large
24
tables are often of limited value in evaluating the model, and they can take a long time to
process.
Predictive Value of the Model
Before you start looking at the individual predictors in the model, you need to find out if the
model gives adequate predictions. To answer this question, you can examine the ModelFitting Information table.
Here you see the -2 log-likelihood values for the intercept only (baseline) model and the final
model (with the predictors).
While the log-likelihood statistics themselves are suspect due to the large number of empty
cells in the model, the difference of log-likelihoods can usually still be interpreted as chisquare distributed statistics (McCullagh and Nelder, 1989). The chi-square reported in the
table is just that: the difference between -2 times the log-likelihood for the intercept-only
model and that for the final model, within rounding error.
The significant chi-square statistic indicates that the model gives a significant improvement
over the baseline intercept-only model. This basically tells you that the model gives better
predictions than if you just guessed based on the marginal probabilities for the outcome
categories. That's a good sign, but what you really want to know is how much better the
model is.
Chi-Square-Based Fit Statistics
The next table in the output is the Goodness-of-Fit table. This table contains Pearson's chisquare statistic for the model and another chi-square statistic based on the deviance. These
statistics are intended to test whether the observed data are inconsistent with the fitted model.
If they are not-that is, if the significance values are large-then you would conclude that the
data and the model predictions are similar and that you have a good model.
These statistics can be very useful for models with a small number of categorical predictors.
Unfortunately, these statistics are both sensitive to empty cells. When estimating models with
continuous covariates, there are often many empty cells, as in this example. Therefore, you
shouldn't rely on either of these test statistics with such models. Because of the empty cells,
you can't be sure that these statistics will really follow the chi-square distribution, and the
significance values won't be accurate.
Pseudo R-Squared Measures
The next tool to turn to in assessing the overall goodness of fit of the model is the pseudo rsquared measures. These measures attempt to serve the same function as the coefficient of
determination in linear regression models-namely, to summarize the proportion of variance in
the dependent variable associated with the predictor (independent) variables. For ordinal
regression models, these measures are based on likelihood ratios rather than raw residuals.
Three different methods are used to estimate the coefficient of determination.
Cox and Snell's r-squared (Cox and Snell, 1989) is a well-known generalization of the usual
measure designed to apply when maximum likelihood estimation is used, as with ordinal
regression. However, with categorical outcomes, it has a theoretical maximum value of less
25
than 1.0. For this reason, Nagelkerke (Nagelkerke, 1991) proposed a modification that allows
the index to take values in the full zero-to-one range. McFadden's r-squared (McFadden,
1974) is another version, based on the log-likelihood kernels for the intercept-only model and
the full estimated model.
Here, the pseudo r-squared values are respectable but leave something to be desired. It will
probably be worth the effort to revise the model to try to make better predictions.
Classification Table
The next step in evaluating the model is to examine the predictions generated by the model.
Recall that the model is based on predicting cumulative probabilities. However, what you're
probably most interested in is how often the model can produce correct predicted categories
based on the values of the predictor variables. To see how well the model does, you can
construct a classification table-also called a confusion matrix-by cross-tabulating the
predicted categories with the actual categories. You can create a classification table in
another procedure, using the saved model-predicted categories.
To produce the classification table, from the menu choose:
Analyze
Descriptive Statistics
Crosstabs…
 Choose Account status [chist] as Row variable.
 Choose Predicted Response Category [PRE_1] as Column variables.
 Click Cells.
 Select Row under Percentages group.
 Click Continue.
 Click OK.
The model seems to be doing a respectable job of predicting outcome categories, at least for
the most frequent categories-category 3 (debt payments current) and category 5 (critical
account). The model correctly classifies 90.6% of the category 3 cases and 75.1% of the
category 5 cases. In addition, cases in categories 2 are more likely to be classified as category
3 than category 5, a desirable result for predicting ordinal responses.
On the other hand, category 1 (no credit history) cases are somewhat poorly predicted, with
the majority of cases being assigned to category 5 (critical account), a category that should
theoretically be most dissimilar to category 1. This may indicate a problem in the way the
ordinal outcome scale is defined. In the interest of brevity, you will not pursue this issue
further here, but in an actual data analysis situation, you would probably want to investigate
this and try to discover whether the ordinal scale itself could be improved by reordering,
merging, or excluding certain categories.
Test of Parallel Lines
For location-only models, the test of parallel lines can help you assess whether the
assumption that the parameters are the same for all categories is reasonable. This test
compares the estimated model with one set of coefficients for all categories to a model with a
separate set of coefficients for each category.
26
You can see that the general model (with separate parameters for each category) gives a
significant improvement in the model fit. This can be due to several things, including use of
an incorrect link function or using the wrong model.
It is also possible that the poor model fit is due to the chosen ordering of the categories of the
dependent variable. An ordering that places No debt history as a greater credit risk may have
a better fit. It would also be interesting to examine this data file using the Multinomial
Logistic Regression procedure, since it allows you to avoid the ordering issues and also
allows different effects of predictors.
Evaluating the Choice of Link Function
Often, there will not be a clear theoretical choice of link function based on the data. In cases
where the initial model performs poorly, it is usually worth trying alternative link functions to
see if a better model can be constructed with a different link function. Although some of the
link functions perform quite similarly in many instances (particularly the logit,
complementary log-log and negative log-log functions), there are situations where choice of
link function can make or break your model.
In this example, there are at least two link functions (complementary log-log and Cauchit)
that may be appropriate. Although the model does fairly well with the complementary log-log
link, it might be possible to improve the model fit by using the Cauchit link function.
You can now estimate a new model with a Cauchit link function to see whether the change
increases the predictive utility of the model. It is recommended to keep the same set of
predictor variables in the model until you have finished evaluating link functions. If you
change the link function and the set of predictors at the same time, you won't know which of
them caused any change in model fit.
Revising the Model





Recall the Ordinal Regression dialog box.
Click Options in the Ordinal Regression dialog box.
Select Cauchit as the Link function.
Click Continue.
Click OK in the Ordinal Regression dialog box.
Model-fitting information
The significance level for the chi-square statistic is less than 0.05, indicating that the Cauchit
model is better than simple guessing.
The chi-square statistic for the Cauchit link (459.860) is larger than that the complementary
log-log link (353.336). This suggests that the Cauchit link is better.
Pseudo R-squared Measures
The pseudo r-squared statistics are larger for the Cauchit link than the complementary log-log
link, which further suggests that the Cauchit link is better.
27
Classification Table
The Cauchit model seems to be slightly better at predicting the lower categories (1, 2, and 3)
and slightly worse at predicting the higher categories than the previous model. You can check
this by recalling the crosstabs dialogue box and replacing the column variables with Predicted
Response [Pre_2]. Since the most important goal of credit scoring is to correctly identify
accounts that are likely to become critical (category 5), you would probably choose to retain
the complementary log-log model, even though the fit statistics favor the Cauchit model.
Interpreting the Model
Having chosen the model with the Complementary log-log link, you can make some
interpretations based on the parameter estimates.
The significance of the test for Age in years is less than 0.05, suggesting that its observed
effect is not due to chance.
By contrast, Duration in months adds little to the model.
While there is no single category of NUMCRED that is significant on its own, there are two
that are marginally significant. Usually, it is worth keeping such a variable in the model,
since the small effects of each category accumulate and provide useful information to the
model.
OTHNSTAL also seems to be an important predictor on empirical grounds.
On the other hand, HOUSNG doesn't seem to contribute to the model in a meaningful way
and could probably be dropped without substantially worsening the model.
While direct interpretation of the coefficients in this model is difficult due to the nature of the
link function, the signs of the coefficients can give important insights into the effects of the
predictors in the model. The signs essentially indicate the direction of the effect.
Positive coefficients (such as that for age) indicate positive relationships between predictors
and outcome. In this example, as age increases, so does the probability of being in one of the
higher categories of account status.
Negative coefficients (such as that for the first category of numcred) indicate inverse
relationships. In this model, for example, those with one credit at the bank are likely to be in
the lower outcome categories.
Summary : Using the Model to Make Predictions
Because the model attempts to predict cumulative probabilities rather than category
membership, two steps are required to get predicted categories. First, for each case, the
probabilities must be estimated for each category. Second, those probabilities must be used to
select the most likely outcome category for each case.
The probabilities themselves are estimated by using the predictor values for a case in the
model equations and taking the inverse of the link function. The result is the cumulative
probability for each group, conditional on the pattern of predictor values for the case. The
28
probabilities for individual categories can then be derived by taking the differences of the
cumulative probabilities for the groups in order. In other words, the probability for the first
category is the first cumulative probability; the probability for the second category is the
second cumulative probability minus the first; the probability for the third category is the
third cumulative probability minus the second; and so on.
For each case, the predicted outcome category is simply the category with the highest
probability, given the pattern of predictor values for that case. For example, suppose you
have an applicant who wants a 48-month loan (duration), is 22 years old (age), has one credit
with the bank (numcred), has no other instalment debt (othnstal), and owns her home
(housng). Inserting these values into the prediction equations, this applicant has predicted
values of -2.78, -1.95, 0.63, and 0.97. (Remember that there is one equation for each category
except the last.)
For example the model equation for the first category is:
Link y = -3.549 – [-0.002(duration) + 0.015(age) -1.134(numcred) + 0(othnstal) +
0.132(housing)]
Write down the model equations for the other categories.
To get -2.78, substitute the values of the said applicant into the equation, that is,
Link y = -3.549 – [-0.002x48 + 0.015x22 -1.134 + 0 + 0.132]
Note that we use just the right coefficients for the factor variables.
Taking the inverse of the complementary log-log link function gives the cumulative
probabilities of .06, 0.13, 0.85, and 0.93 (and, of course, 1.0 for the last category).
To get 0.06 from -2.78 we do e-2.78.
Taking differences gives the following individual category probabilities: category 1: .06,
category 2: 0.13-0.06=0.07, category 3: 0.85-0.13=0.72, category 4: 0.93-0.85=0.08, and
category 5: 1.0-0.93=0.07. Clearly, category 3 (debt payments current) is the most likely
category for this case according to the model, with a predicted probability of 0.72. Thus, you
would predict that this applicant would keep her payments current and the account would not
become critical.
Recommended Readings
See the following text for more information on generalized linear models for ordinal data:
1. McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models, 2nd ed. London:
Chapman & Hall.
2. Cox, D. R., and E. J. Snell. 1989. The Analysis of Binary Data, 2nd ed. London:
Chapman and Hall.
3. McFadden, D. 1974. Conditional logit analysis of qualitative choice behavior. In:
Frontiers in Economics, P. Zarembka, eds. New York: Academic Press.
29
Binary Logistic Regression Model
In this type of model you estimate the probability of an event occurring. The model can be
written as:
𝑷𝒓𝒐𝒃(𝒆𝒗𝒆𝒏𝒕) =
𝟏
𝟏 + 𝒆−𝒛
For a single independent variable
𝒛 = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏
For multiple independent variables:
𝒛 = 𝒃𝟎 + 𝒃𝟏 𝒙 𝟏 + 𝒃𝟐 𝒙𝟐 + ⋯ . 𝒃𝒏 𝒙𝒏
where b0 and b1, b2, are coefficients estimated from the data, x1, x2, are the independent
variables, n is the number of independent variables and e is the base of natural logarithms
(2.781).
Exercise
The data held in the file cancer.sav are from a study reported by Brown (1980) and are
commonly cited in texts considering binary logistic regression. The prognosis for prostate
cancer is based upon whether or not the cancer has spread to the surrounding lymph nodes. In
this classic study Brown et al (see Brown, 1980) explored the following separate indicators
for lymph node involvement in a group of 53 men known to have prostate cancer. To open
the data file, follow these instructions:
1. From SPSS menu bar select File -> Open -> Data… a dialogue box will appear.
2. In the text area for File name: type \\campus\software\dept\spss and then click on
Open.
3. Select the file cancer.sav and click on Open.
4. Spend some time to study the data file. How many cases and variables make up the
data file? Cases:…….. Variables:………
5. Are there any missing values in the data? Yes No
The variables (corresponding to columns in the data file) are:
1) age - age of patients in years.
2) acid - level of serum acid phosphates (acid level in King-Armstrong units)
3) xray - x-ray result (0 = negative, 1 - positive)
4) size - size of tumour (0 = small, 1 = large)
5) stage - stage of tumour (0 = less serious, 1 = more serious)
6) nodes - nodal involvement (0 = not involved, 1 = involved)
Modelling
Carry out a Forward Conditional logistic regression analysis of the data using nodal
involvement as the dependent variable and the other variables as independent variables (i.e.
covariates). You do not need to define xray, size or stage as being categorical variables, since
30
they are already binary variables. Follow these steps to carry out the Forward Conditional
binary logistic regression:
Analyze -> Regression -> Binary Logistic….
Dependent: Nodal involvement [nodes]
Covariates: age acid xray size stage
Method: Forward Conditional
Use the output to answer the following questions.
Look at the table Case Processing Summary. What do you conclude?
Now look at the three tables under Block 0: Beginning Block. What do you conclude?
Now look at the tables under Block 1: Method=Forward (stepwise) conditional. What do
you conclude?
Give the logistic regression equation for the final model.
Predictions
Carry out another logistic regression analysis of the data using nodal involvement as the
dependent variable but this time including ALL the covariates in the model, i.e. using the
ENTER method. Also request the Odd Ratio (OR) and the 95% Confidence Interval (CI) of
OR. Follow these steps:
31
Analyze -> Regression -> Binary Logistic….
Dependent: Nodal involvement [nodes]
Covariates: age acid xray size stage
Save: Under Predicted Values select Probabilities
Options: CI for exp(B):
Method: Enter
1. Give the coefficients for the full model, i.e. including all the variables. [Normally
you would only consider the statistically significant variables].
2. Which coefficients are statistically significant and why?
3. What is the probability of nodal involvement for each man in the data set? Which
case has the highest probability and which case the lowest probability of nodal
involvement?
4. Select one significant variable give the OR and its 95% CI? How would you interpret
the OR and its 95% CI?
Reference
Brown, B. W., Jr et al. 1980 Prediction Analyses for Binary Data. In Biostatistics Casebook,
New York: John Wiley and Sons.
32