Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
N U I T, NEWCASTLE UNIVERSITY IBM SPSS STATISTICS for Windows Intermediate / Advance A Training Manual for Intermediate / Experience Users, Faculty of Medical Sciences Dr S. T. Kometa Table of Contents Ordinary Regression ................................................................................................................ 3 Repeated Measures Analysis................................................................................................... 9 Data Analysis Using Crosstabulation Techniques .............................................................. 12 Type of Survival Analysis / Kaplan-Meier .......................................................................... 18 The Ordinal Regression Model ............................................................................................. 20 Binary Logistics Regression .................................................................................................. 29 2 Ordinary Linear Regression Model with Two Independent Variables Why fit a regression model? To build a model for predicting the outcome variable for a new sample of data. To see how well the independent (explanatory) variables explain the dependent (response) variable. To identify which subsets from many independent variables is most effective for estimating the dependent variable. Open the data set called world95.sav. To do this, follow these instructions: 1. Select Start -> Programs -> Statistical Software -> IBM SPSS Statistics -> IBM SPSS Statistics 19. 2. From SPSS menu bar select File -> Open -> Data… a dialogue box will appear. 3. In the text area for File name: type \\campus\software\dept\spss and then click on Open. 4. Select the file world95.sav and click on Open. 5. Spend some time to study the data file. How many cases and variables make up the data file? Cases:…….. Variables:……… 6. Are there any missing values in the data? Yes No Assumptions for Ordinary Linear Regression All observations should be independent. Your data should not suffer from multicollinearity. That is the independent variables should not be highly related. To find out if your data suffer from multicollinearity, you have to look at the tolerances for each of the independent variables in the model. These are printed if you select Collinearity Diagnostics in the Linear Regression Statistics dialogue box. If any of the tolerances are small, less than 0.1 for example, multicollinearity may be a problem. Residual from model fit should follow a normal distribution. Each of the independent (explanatory or predictor) continuous variables should have a linear relationship with the dependent (response or outcome) variable. It is always a good idea to check this assumption using a scatterplots. Simple Linear Regression Is the female literacy of a country useful in predicting their life expectancy? We want to build model of the form: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 𝑏0 + 𝑏1 ∗ 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑡𝑒𝑟𝑎𝑐𝑦 + 𝜀 Where Average female life expectancy (lifeexpf) is the dependent (response, y, or outcome) variable, female who can read (%) (lit_fema) is the independent (explanatory or predictor) variable, 𝑏0 is the intercept of the line of best fit, b1 is its slope and 𝜀 is the error term. Is there a linear relationship between average female life expectancy and female literacy? Produce a scatter plot to help you answer this question. 3 To produce the output for the regression model, from the menus choose: Analyze -> Regression -> Linear…. Dependent Variable: Average female life expectancy [lifeexpf] Independent: female who can read (%) [lit_fema] Statistics… Descriptives Make sure that Estimates and Model fit are selected. Select Collinearity diagnostics Residuals Casewise disnognotics Select Outlier outside 1.0 standard deviations Plots… Y: *ZRESID X: *ZPRED Click Next Y: ZPRED X: Dependent Select Histogram and Normal probability plot These steps will generate lots of output. Now examine the output and attempt to interpret it. Look at the table Descriptive Statistics. What will you conclude? Look at the table Correlations. What are the hypotheses being tested? What will you conclude? Look at the table Model Summary. What do you conclude? . 4 Look at the table ANOVA. Explain what the Degrees of Freedom (DF), Sums of Squares (SS) and Mean Squares (MS) represent. How they are related? State the hypotheses being tested in the ANOVA table. How is the test statistic calculated and what would your decision be? Look at the table Coefficients. What do you conclude? Write an equation for the regression model and use it to predict the average female life expectancy of a country whose female literacy is 86%. What are the hypotheses being tested? 5 The last two columns of the Coefficient table give information about collinearity statistics. Looking at the Tolerance, can you say if there is any problem with multicollinearity? The rest of the output deals with the residuals. This helps to find out if the assumptions to run a linear are met and to identify any outliers or influential cases. Look at the table Casewise Diagnostics. What is standardised residual? What do you conclude? Look at the table Residual Statistics. What do you conclude? Look at the Histogram and Normal P-P Plot. What do you conclude about the residuals? Now look at the two Scatter Plots. What do you conclude? Can you think of any restriction when using your model to predict female life expectancy? 6 How would you validate a model like this? Multiple Linear Regression While a simple linear regression can have just one independent variable, a multiple linear regression can have more than one independent variable. The following is a model with two independent variables: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑒𝑚𝑎𝑙𝑒 𝑙𝑖𝑓𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑛𝑐𝑦 = 𝑏0 + 𝑏1 ∗ 𝑖𝑛𝑓𝑎𝑛𝑡 𝑚𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦 + 𝑏2 ∗ 𝑓𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦 + 𝜀 where infant mortality (deaths per 1000 live births) [babymort] is the number of dead babies during their first year per thousand live births and average number of kids [fertilty] is the average number of children per family. We found that literacy explained 67% of the variability of life expectancy. Now we examine a model using infant mortality (babymort) and fertility (fertility) to predict life expectancy. To run the analysis, select Analyze -> Regression -> Linear….Click on Reset. Dependent Variable: Average female life expectancy [lifeexpf] Independent: average number of kids [fertilty], infant mortality (deaths per 1000 live births) [babymort] Case Labels: country Statistics… Descriptives Make sure that Estimates and Model fit are selected. Select Collinearity diagnostics Plots… Produce all partial plots Save... Predicted values: Standardised Look at the table Descriptive Statistics. What do you conclude? Look at the table Correlations. What do you conclude? 7 Look at the table Model Summary. What do you conclude? Look at the ANOVA table. What do you conclude? Look at the table Coefficients. What do you conclude? Write an equation for the regression model and use it to predict the female life expectancy of a country whose fertility is 3 and infant mortality is 23 per 1000 live births. 8 Repeated Measures Analysis of Variance Does the anxiety rating of a person affect performance on a learning task? Twelve subjects were assigned to one of two anxiety groups on the basis of an anxiety test, and the number of errors made in four blocks of trials on a learning task was measured. We use repeated measures analysis of variance technique to study the data. Open the SPSS data file called anxiety2. Notice that there is one case for each subject and four trials variables (trial1, trial2, trial3 and trial4). In repeated measures analysis-of-variance technique, we distinguish two types of factors in the model: between-subject factors and within-subjects factors. A between-subject factor as the name suggest, divides the subjects into discrete subgroups, for example anxiety in this data file. Anxiety divides the cases into two groups of high anxiety scores and low anxiety scores. A within-subjects factor is any factor that distinguishes measurements made on the same subject. For example trail distinguishes the four measurements taken for each subjects. To produce the output in this example, from the menus choose: Analyze General Linear Model Repeated Measures… Within-Subject Factors name: replace factor1 with trial Number of Levels: 4 Click Add and Click Define Within-Subjects Variables (trial): trial1, trial2, trial3 and trial4 Between-Subjects Factor(s): anxiety Options… Select Homogeneity tests Contrasts… Factors: trial Contrasts: Repeated (click Change) Click on Continue and then OK. Examine the results and try to interpret it. Between-Subject Test The test of between-subject effects is shown on the table Tests of Between-Subjects Effects. Examine this table. What do you concluded? 9 Multivariate Tests The multivariate table contains tests of the within-subjects factor, trial, and the interaction of within-subjects factor and the between-subjects factor, trail*anxiety. Examine the Multivariate Tests table. What do you conclude? Assumptions The vector of the dependent variables follows a normal distribution, and the variancecovariance matrices are equal across the cells formed by the between-subjects effects. The test for this assumption is shown on the table Box’s test of Equality of Covariance Matrices. Examine this table, what do you conclude? It is assumed that the variance-covariance matrix of the dependent variables is circular. The test of this assumption is shown on the table Mauchly’s Test of Sphericity. Examine this table, what do you conclude? If the test of sphericity was not satisfied use Greenhouse-Geisser, Huynh-Feldt or Lowerbound to make your conclusion. Now let us look at the table of Tests of Within-Subjects Effects. Examine the table, what can you conclude? 10 Contrasts A repeated contrast measures compares one level of trial with the subsequent level. The first column (source) indicates the effect being tested. For example the label trial test the hypothesis that averaged over the two anxiety groups, the mean of the specified contrast is zero. The second column trial represents the contrasts. For example Level 1 vs Level 2 represents the transformation trial 1 – trial 2. This compares the first level of trial with the second level of trial, and so on. The label trial*anxiety tests the hypothesis that the mean of the specified contrast is the same for the two anxiety groups. Now look at the Tests of Within-Subjects Contrasts. What do you conclude? 11 Data Analysis Using Crosstabulations Techniques in SPSS Introduction Crosstabulation is a powerful technique that helps you to describe the relationships between categorical (nominal or ordinal) variables. With Crosstabulation, we can produce the following statistics: Observed Counts and Percentages Expected Counts and Percentages Residuals Chi-Square Relative Risk and Odds Ratio for a 2 x 2 table Kappa Measure of agreement for an R x R table Examples will be used to demonstrate how to produce these statistics using SPSS. The data set used for the demonstration comes with SPSS and it is called GSS_93.sav. It has 67 variables and 1500 cases (observations). Open this data file which is located in the SPSS folder. Study the data file in order to understand it before performing the following exercises. Exercise 1: An R x C Table with Chi-Square Test of Independence Chi-Square tests the hypothesis that the row and column variables are independent, without indicating strength or direction of the relationship. Like most statistics test, to use the ChiSquare test successfully, certain assumptions must be met. They are: No cell should have expected value (count) less than 0, and No more than 20% of the cells have expected values (counts) less than 5 In the SPSS file, there is a variable called relig short for religion (Protestant, Catholic, Jewish, None, Other) and another one called region4 (Northeast, Midwest, South, West). In this example, we want to find out if religious preferences vary by region of the country. To produce the output, from the menu choose: Analyze -> Descriptive Statistics -> Crosstabs…. Row(s): Religious Preferences [relig] Column(s): Region [region4] Statistics… select Chi-Square, click Continue then OK In the SPSS output, Pearson chi-square, likelihood-ratio chi-square, and linear-by-linear association chi-square are displayed. Fisher's exact test and Yates' corrected chi-square are computed for 2x2 tables. State the null and alternative hypothesis that is being tested. 12 Examine the output. What conclusion can you draw from the output? However, you will notice that certain assumptions are not met. The results could be misleading. What should you do? We will discuss this further in example 2 below. Example 2: Percentages, Expected Values, and Residuals and Omitting Categories From the last example, we noticed that 40% of the cells had expected counts less than 5. So this assumption was violated. Since Other and Jewish had just 15 cases each, we can drop them out of the analysis by using Select Cases. In other words, the religious preference is restricted to Protestant, Catholic and None. To produce the output, use Select Cases from the Data menu to select cases with relig not equal to 3 and relig not equal to 5 (relig ~=3 & relig ~=5). Call up the dialogue box for Crosstabs. Reset it to default and select: Row(s): Region [region4] Column(s): Religious Preferences [relig] Statistics… Select Chi-Square Nominal: select Contingency coefficient, Phi and Cramer’s V, Lambda, Uncertainty coefficient, click Continue Cells… Counts: select Expected Percentages: select Row Residuals: select Adjusted Standardized, click Continue then OK Now examine the output and try to interpret it. You can pivot the table so each group of statistics appears in its own panel. Demonstrate. Double-click Table and drag region on the row tray to the right of statistics. Look at the Region4*Religion Preference Crosstabulation. What can you conclude? 13 Look at the Chi-Square Tests table. What can you conclude? Look at the Symmetric Measures table. What can you conclude about the strength of the relationship between religion preference and region? Examine the table Directional Measures what do you conclude? 14 Example 3: Tests Within Layers of a Multiway Table Multiway table allows you to examine the relationship between two categorical variables within a controlling variable. For example, is the relationship between marital status and view of life the same for males and females? This example shows you how to answer this type of question in SPSS. Use Select Cases from the Data menu to select cases with marital not equal to 4 (marital ~= 4). Can you think of any reason why we have decided to exclude cases where marital status is equal to 4 (i.e. separted)? Call up the Crosstabs dialogue box. Click Reset to restore the dialogue box defaults. Then select: Row(s): Marital satus [marital] Column(s): Is Life Exciting or Dull [life] Layer 1 of 1: Respondent’s Sex [sex] Statistics… Select Chi-Square Cells… Counts: select Expected Percentages: select Row Residuals: select Standardized and Adjusted Standardized Examine the results and try to interpret it. Is there a relationship between marital status and view on life? Is this relationship the same between male and female? Example 4: The Relative Risk and Odds Ration for a 2 x 2 Table The Relative Risk for 2 x 2 tables is a measure of the strength of the association between the presence of a factor and the occurrence of an event. If the confidence interval for the statistic includes a value of 1, you cannot assume that the factor is associated with the event. The odds ratio can be used as an estimate or relative risk when the occurrence of the factor is rare. 15 In the GSS93 data file, there is a variable (dwelown) that measure home ownership (owner or renter) and another variable (vote92) that measure voting (voted or did not vote). We will like to find out whether home owners are more likely to vote than renters. Through the Variable window note all the codes that have been used for the two variables of interest. For example, dwelown uses code 3 for other and code 8 for don’t know, while vote92 uses code 3 for not eligible and code 4 for refused. Select the cases with dwelown less than 3 and vote92 less than 3. From the menus choose: Data -> Select Cases Select If condition is satisfied and click If. Enter dwelown < 3 & vote92 < 3 as the condition and click Continue then OK. In the Crosstabs dialogue box, click Reset to restore the dialogue box defaults, and then select: Row(s): Homeowner or Renter [dwelown] Column(s): Voting in 1992 Election [vote92] Cells… Percentages select Row, click Continue then OK Examine and interpret the output. From the crosstabulation table, what can you conclude? Recall the crosstabs dialogue box. In the Crosstabs dialogue box, select: Statistics… Select Risk, click Continue Examine the output and interpret it. Look at the table called Risk Estimate, what can you conclude? 16 The odds ratio should be used as an approximation to the relative risk when the following conditions are met: The probability of the event is small (<0.1). This condition guarantees that the odds ratio will make a good approximation to the relative risk. The design of the study is case-control. These conditions are not met in this present example. In smoking and lung cancer study, the conditions are met. You can use the odds ratio. Example 5: The Kappa Measure of Agreement for an R x R Table Cohen's kappa measures the agreement between the evaluations of two raters when both are rating the same object. A value of 1 indicates perfect agreement. A value of 0 indicates that agreement is no better than chance. Values of Kappa greater than 0.75 indicates excellent agreement beyond chance; values between 0.40 to 0.75 indicate fair to good; and values below 0.40 indicate poor agreement. Kappa is only available for tables in which both variables use the same category values and both variables have the same number of categories. The table structure of the Kappa statistics is a square R x R and has the same row and column categories because each subject is classified or rated twice. For example, doctor A and doctor B diagnose the same patients as schizophrenic, manic depressive, or behaviour-disorder. Do the two doctors agree or disagree in their diagnosis? Two teachers assess a class of 18 years old students. Do the teachers agree or disagree in their assessment? In the GSS93 subset data file, we have variables that assess the educational level of respondent’s father (padeg) and mother (madeg). Is there any agreement between father and mother educational level? To produce the output, use Select Cases from the Data menu to select cases with madeg not equal to 2 and padeg not equal to 2 (madeg ~= 2 & padeg ~= 2). In the Crosstabs dialogue box, click Reset to restore the dialogue box defaults, and then select: Row(s): Father’s Highest Degree [padeg] Column(s): Mother’s Highest Degree [madeg] Statistics… Select kappa, click Continue Cells… Percentages: select Total, click Continue then OK Examine and interpret the output. Look at the tables from the output. What can you conclude? 17 Intraclass Correlation Coefficients (ICC) We can use ICC to assess inter-rater agreement when there are more than two raters. For example, the International Olympic Committee (IOC) trains judges to assess gymnastics competitions. How can we find out if the judges are in agreement? ICC can help us to answer this question. Judges have to be trained to ensure that good performances receive higher scores than average performances, and average performances receive higher scores than poor performances; even though two judges may differ on the precise score that should be assigned to a particular performance. Use the data set judges.sav to illustrate how to use SPSS to calculate ICC. Open the data set. From the menu select: Analyze -> Scale -> Reliability Items: judge1, judge2, judge3, judge4, judge5, judge6, judge7 Statistics… Under Descriptives for check Item. Check Intraclass correlation coefficient Model: Two-Way Random Type: Consistency Confidence interval: 95% Test value: 0 Examine and interpret the output. What would you conclude? Types of Survival Analyses and when to use them in SPSS Life Tables: Use life tables if cases can be classified into meaningful equal time interval. Life table can be used to calculate the probability of a terminal event during any interval under study. Kaplan-Meier: Use this technique if cases cannot be classified into equal time intervals as above. This is common to many clinical and experimental studies. Cox Regression: Use this technique if you want to see the relation between survival time and a predictor variable, for instant age or tumour type. 18 Using Kaplan-Meier Survival Analysis to Test Competing Pain Relief Treatments A pharmaceutical company is developing an anti-inflammatory medication for treating chronic arthritic pain. Of particular interest is the time it takes for the drug to take effect and how it compares to an existing medication. Shorter times to effect are considered better. The results of a clinical trial are collected in pain_medication.sav. This data file is stored in the following folders \\campus\software\dept\spss. Open the file and study it. Use KaplanMeier Survival Analysis to examine the distribution of "time to effect" and compare the effectiveness of the two treatments. To run a Kaplan-Meier Survival Analysis, from the menus choose: Analyze Survival Kaplan-Meier... Select Time to effect [time] as the Time variable. Select Effect status [status] as the Status variable. Click Define Event. Under Value(s) Indicating Event Has Occurred type 1 in the text area next to Single value:. Click Continue. Select Treatment [treatment] as a Factor. Click Compare Factor. Select Log rank, Breslow, and Tarone-Ware. Click Continue. Click Options in the Kaplan-Meier dialog box. Select Quartiles in the Statistics group and Survival in the Plots group. Click Continue. Click OK in the Kaplan-Meier dialog box. Interpretation Survival Table The survival table is a descriptive table that details the time until the drug takes effect. The table is sectioned by each level of Treatment, and each observation occupies its own row in the table. Time: The time at which the event or censoring occurred. Status: Indicates whether the case experienced the terminal event or was censored. Cumulative Proportion Surviving at the Time: The proportion of cases surviving from the start of the table until this time. When multiple cases experience the terminal event at the 19 same time, these estimates are printed once for that time period and apply to all the cases whose drug took effect at that time. N of Cumulative Events: The number of cases that have experienced the terminal event from the start of the table until this time. N of Remaining Cases: The number of cases that, at this time, have yet to experience the terminal event or be censored. Survival Functions (Curves) The survival curves give a visual representation of the life tables. The horizontal axis shows the time to event. In this plot, drops in the survival curve occur whenever the medication takes effect in a patient. The vertical axis shows the probability of survival. Thus, any point on the survival curve shows the probability that a patient on a given treatment will not have experienced relief by that time. The plot for the New drug below that of the Existing drug throughout most of the trial, which suggests that the new drug may give faster relief than the old. To determine whether these differences are due to chance, look at the comparisons tables. Mean and Medians for Survival Time The means and medians for survival time table offers a quick numerical comparison of the "typical" times to effect for each of the medications. Since there is a lot of overlap in the confidence intervals, it is unlikely that there is much difference in the "average" survival time. Percentiles The percentiles table gives estimates of the first quartile, median, and third quartile of the survival distribution. The interpretation of percentiles for survival curves is that the 75th percentile is the latest time that at least 75 percent of the patients have yet to feel relief. Overall Comparisons This table provides overall tests of the equality of survival times across groups. Since the significance values of the tests are all greater than 0.05, you cannot determine a difference between the survival curves. Summary With the Kaplan-Meier Survival Analysis procedure, you have examined the distribution of time to effect for two different medications. The comparison tests show that there is not a statistically significant difference between them. Recommended Readings 1. Hosmer, D. W., and S. Lemeshow. 1999. Applied Survival Analysis. New York: John Wiley and Sons. 2. Kleinbaum, D. G. 1996. Survival Analysis: A Self-Learning Text. New York: Springer-Verlag. 3. Norusis, M. 2004. SPSS 13.0 Advanced Statistical Procedures Companion. Upper Saddle-River, N.J.: Prentice Hall, Inc.. 20 The Ordinal Regression Model Generalized linear models are a very powerful class of models, which can be used to answer a wide range of statistical questions. The basic form of a generalized linear model is shown in the following equation. link(γij) =θj−[β1xi1+β2xi2+...+βpxiJ] where link( ) is the link function γij is the cumulative probability of the jth category for the ith case θj is the threshold for the jth category p is the number of regression coefficients xi1...xip are the values of the predictors for the ith case β1... βp are regression coefficients There are several important things to notice here. The model is based on the notion that there is some latent continuous outcome variable, and that the ordinal outcome variable arises from discretizing the underlying continuum into ordered groups. The cutoff values that define the categories are estimated by the thresholds. In some cases, there is good theoretical justification for assuming such an underlying distribution. However, even in cases in which there is no theoretical concept that links to the latent variable, the model can still perform quite well and give valid results. The thresholds or constants in the model (corresponding to the intercept in linear regression models) depend only on which category's probability is being predicted. Values of the predictor (independent) variables do not affect this part of the model. The prediction part of the model depends only on the predictors and is independent of the outcome category. These first two properties imply that the results will be a set of parallel lines or planes-one for each category of the outcome variable. Rather than predicting the actual cumulative probabilities, the model predicts a function of those values. This function is called the link function, and you choose the form of the link function when you build the model. This allows you to choose a link function based on the problem under consideration to optimize your results. Several link functions are available in the Ordinal Regression procedure. As you can see, these are very powerful and general models. Of course, there is also a bit more to keep track of here than in a typical linear regression model. There are three major components in an ordinal regression model: Location component: The portion of the equation shown above which includes the coefficients and predictor variables, is called the location component of the model. The location is the "meat" of the model. It uses the predictor variables to calculate predicted probabilities of membership in the categories for each case. Scale component: The scale component is an optional modification to the basic model to account for differences in variability for different values of the predictor variables. For 21 example, if men have more variability than women in their account status values, using a scale component to account for this may improve your model. The model with a scale component follows the form shown in this equation. link (ij ) j xi1 2 xi 2 ... pxiJ 1 e 1zi1...mzim where zi1...zim are scale component predictors (a subset of the x's) τ1...τm are scale component coefficients Link function: The link function is a transformation of the cumulative probabilities that allows estimation of the model. Five link functions are available in the ordinal regression procedure, summarized in the following table. Function Form Logit log( y / (1-y) ) Complementary log-log log(-log(1-y)) Negative log-log -log(-log(y)) Probit Ф-1(y) Cauchit (inverse Cauchy) tan(π(g-0.5)) Typical Application evenly distributed categories higher categories more probable lower categories more probable latent variable is normally distributed latent variable has many extreme values Using Ordinal Regression to Build a Credit Scoring Model A creditor wants to be able to determine whether an applicant is a good credit risk, given various financial and personal characteristics. From their customer database, the creditor (dependent) variable is account status, with five ordinal levels: no debt history, no current debt, debt payments current, debt payments past due, and critical account. Potential predictors consist of various financial and personal characteristics of applicants, including age, number of credits at the bank, housing type, checking account status, and so on. This information is collected in german_credit.sav . Use Ordinal Regression to build a model for scoring applicants. Constructing a Model Constructing your initial ordinal regression model entails several decisions. First, of course, you need to identify the ordinal outcome variable. Then, you need to decide which predictors to use for the location component of the model. Next, you need to decide whether to use a scale component and, if you do, what predictors to use for it. Finally, you need to decide which link function best fits your research question and the structure of the data. Identifying the Outcome Variable In most cases, you will already have a specific target variable in mind by the time you begin building an ordinal regression model. After all, the reason you use an ordinal regression model is that you know you want to predict an ordinal outcome. In this example, the ordinal 22 outcome is Account status, with five categories: No debt history, No current debt, Payments current, Payments delayed, and Critical account. Note that this particular ordering may not, in fact, be the best possible ordering of the outcomes. You can easily argue that a known customer with no current debt, or with payments current, is a better credit risk than a customer with no known credit history. Choosing Predictors for the Location Model The process of choosing predictors for the location component of the model is similar to the process of selecting predictors in a linear regression model. You should take both theoretical and empirical considerations into account in selecting predictors. Ideally, your model would include all of the important predictors and none of the others. In practice, you often don't know exactly which predictors will prove to be important until you build the model. In that case, it's usually better to start off by including all of the predictors that you think might be important. If you discover that some of those predictors seem not to be helpful in the model, you can remove them and re-estimate the model. In this case, previous experience and some preliminary exploratory analysis have identified five likely predictors: age, duration of loan, number of credits at the bank, other instalment debts, and housing type. You will include these predictors in the initial analysis and then evaluate the importance of each predictor. Number of credits, other instalment debts, and housing type are categorical predictors, entered as factors in the model. Age and duration of loan are continuous predictors, entered as covariates in the model. Scale Component The next decision has two stages. The first decision is whether to include a scale component in the model at all. In many cases, the scale component will not be necessary, and the location-only model will provide a good summary of the data. In the interests of keeping things simple, it's usually best to start with a location-only model, and add a scale component only if there is evidence that the location-only model is inadequate for your data. Following this philosophy, you will begin with a location-only model, and after estimating the model, decide whether a scale component is warranted. Choosing a Link Function To choose a link function, it is helpful to examine the distribution of values for the outcome variable. To create a bar chart for Account status [chist], from the menus choose: Graphs Bar... Click Define. Select % of cases in the Bars Represent group. Select Account status [chist] as the variable to plot on the Category Axis. Click OK. 23 The resulting bar chart shows the distribution for the account status categories. The bulk of cases are in the higher categories, especially categories 3 (payments current) and 5 (critical account). The higher categories are also where most of the "action" is, since the most important distinctions from a business perspective are between categories 3, 4, and 5. For this reason, you will begin with the complementary log-log link function, since that function focuses on the higher outcome categories. The high number of cases in the extreme category 5 (critical account) indicates that the Cauchit distribution might be a reasonable alternative. Running the Analysis To run the Ordinal Regression analysis, from the menus choose: Analyze Regression Ordinal... Select Account status [chist] as the Dependent variable. Select # of existing credits [numcred], Other installment debts [othnstal], and Housing [housing] as Factors. Select Age in years [age] and Duration in months [duration] as Covariates. Click Options. Select Complementary Log-Log as the Link function. Click Continue. Click Output in the Ordinal Regression dialog box. Select Test of Parallel Lines in the Display group. Select Predicted Category in the Saved Variables group. Click Continue. Click OK in the Ordinal Regression dialog box. Evaluating the Model The first thing you see in the output is a warning about cells with zero frequencies. The reason this warning comes up is that the model includes continuous covariates. Certain fit statistics for the model depend on aggregating the data based on unique predictor and outcome value patterns. For instance, all cases where the applicants have current payments on debt, one other credit at the bank, own their home, have no other instalment debts, are 49 years old and are seeking a 12-month loan are combined to form a cell. However, because Duration in months and Age in years are both continuous, most cases have unique values for those variables. This results in a very large table with many empty cells, which makes it difficult to interpret some of the fit statistics. You have to be careful in evaluating this model, particularly when looking at chi-square-based fit statistics. For relatively simple models with a few factors, you can display information about individual cells by selecting Cell Information on the Output dialog box. However, this is not recommended for models with many factors (or factors with many levels), or models with continuous covariates, since such models typically result in very large tables. Such large 24 tables are often of limited value in evaluating the model, and they can take a long time to process. Predictive Value of the Model Before you start looking at the individual predictors in the model, you need to find out if the model gives adequate predictions. To answer this question, you can examine the ModelFitting Information table. Here you see the -2 log-likelihood values for the intercept only (baseline) model and the final model (with the predictors). While the log-likelihood statistics themselves are suspect due to the large number of empty cells in the model, the difference of log-likelihoods can usually still be interpreted as chisquare distributed statistics (McCullagh and Nelder, 1989). The chi-square reported in the table is just that: the difference between -2 times the log-likelihood for the intercept-only model and that for the final model, within rounding error. The significant chi-square statistic indicates that the model gives a significant improvement over the baseline intercept-only model. This basically tells you that the model gives better predictions than if you just guessed based on the marginal probabilities for the outcome categories. That's a good sign, but what you really want to know is how much better the model is. Chi-Square-Based Fit Statistics The next table in the output is the Goodness-of-Fit table. This table contains Pearson's chisquare statistic for the model and another chi-square statistic based on the deviance. These statistics are intended to test whether the observed data are inconsistent with the fitted model. If they are not-that is, if the significance values are large-then you would conclude that the data and the model predictions are similar and that you have a good model. These statistics can be very useful for models with a small number of categorical predictors. Unfortunately, these statistics are both sensitive to empty cells. When estimating models with continuous covariates, there are often many empty cells, as in this example. Therefore, you shouldn't rely on either of these test statistics with such models. Because of the empty cells, you can't be sure that these statistics will really follow the chi-square distribution, and the significance values won't be accurate. Pseudo R-Squared Measures The next tool to turn to in assessing the overall goodness of fit of the model is the pseudo rsquared measures. These measures attempt to serve the same function as the coefficient of determination in linear regression models-namely, to summarize the proportion of variance in the dependent variable associated with the predictor (independent) variables. For ordinal regression models, these measures are based on likelihood ratios rather than raw residuals. Three different methods are used to estimate the coefficient of determination. Cox and Snell's r-squared (Cox and Snell, 1989) is a well-known generalization of the usual measure designed to apply when maximum likelihood estimation is used, as with ordinal regression. However, with categorical outcomes, it has a theoretical maximum value of less 25 than 1.0. For this reason, Nagelkerke (Nagelkerke, 1991) proposed a modification that allows the index to take values in the full zero-to-one range. McFadden's r-squared (McFadden, 1974) is another version, based on the log-likelihood kernels for the intercept-only model and the full estimated model. Here, the pseudo r-squared values are respectable but leave something to be desired. It will probably be worth the effort to revise the model to try to make better predictions. Classification Table The next step in evaluating the model is to examine the predictions generated by the model. Recall that the model is based on predicting cumulative probabilities. However, what you're probably most interested in is how often the model can produce correct predicted categories based on the values of the predictor variables. To see how well the model does, you can construct a classification table-also called a confusion matrix-by cross-tabulating the predicted categories with the actual categories. You can create a classification table in another procedure, using the saved model-predicted categories. To produce the classification table, from the menu choose: Analyze Descriptive Statistics Crosstabs… Choose Account status [chist] as Row variable. Choose Predicted Response Category [PRE_1] as Column variables. Click Cells. Select Row under Percentages group. Click Continue. Click OK. The model seems to be doing a respectable job of predicting outcome categories, at least for the most frequent categories-category 3 (debt payments current) and category 5 (critical account). The model correctly classifies 90.6% of the category 3 cases and 75.1% of the category 5 cases. In addition, cases in categories 2 are more likely to be classified as category 3 than category 5, a desirable result for predicting ordinal responses. On the other hand, category 1 (no credit history) cases are somewhat poorly predicted, with the majority of cases being assigned to category 5 (critical account), a category that should theoretically be most dissimilar to category 1. This may indicate a problem in the way the ordinal outcome scale is defined. In the interest of brevity, you will not pursue this issue further here, but in an actual data analysis situation, you would probably want to investigate this and try to discover whether the ordinal scale itself could be improved by reordering, merging, or excluding certain categories. Test of Parallel Lines For location-only models, the test of parallel lines can help you assess whether the assumption that the parameters are the same for all categories is reasonable. This test compares the estimated model with one set of coefficients for all categories to a model with a separate set of coefficients for each category. 26 You can see that the general model (with separate parameters for each category) gives a significant improvement in the model fit. This can be due to several things, including use of an incorrect link function or using the wrong model. It is also possible that the poor model fit is due to the chosen ordering of the categories of the dependent variable. An ordering that places No debt history as a greater credit risk may have a better fit. It would also be interesting to examine this data file using the Multinomial Logistic Regression procedure, since it allows you to avoid the ordering issues and also allows different effects of predictors. Evaluating the Choice of Link Function Often, there will not be a clear theoretical choice of link function based on the data. In cases where the initial model performs poorly, it is usually worth trying alternative link functions to see if a better model can be constructed with a different link function. Although some of the link functions perform quite similarly in many instances (particularly the logit, complementary log-log and negative log-log functions), there are situations where choice of link function can make or break your model. In this example, there are at least two link functions (complementary log-log and Cauchit) that may be appropriate. Although the model does fairly well with the complementary log-log link, it might be possible to improve the model fit by using the Cauchit link function. You can now estimate a new model with a Cauchit link function to see whether the change increases the predictive utility of the model. It is recommended to keep the same set of predictor variables in the model until you have finished evaluating link functions. If you change the link function and the set of predictors at the same time, you won't know which of them caused any change in model fit. Revising the Model Recall the Ordinal Regression dialog box. Click Options in the Ordinal Regression dialog box. Select Cauchit as the Link function. Click Continue. Click OK in the Ordinal Regression dialog box. Model-fitting information The significance level for the chi-square statistic is less than 0.05, indicating that the Cauchit model is better than simple guessing. The chi-square statistic for the Cauchit link (459.860) is larger than that the complementary log-log link (353.336). This suggests that the Cauchit link is better. Pseudo R-squared Measures The pseudo r-squared statistics are larger for the Cauchit link than the complementary log-log link, which further suggests that the Cauchit link is better. 27 Classification Table The Cauchit model seems to be slightly better at predicting the lower categories (1, 2, and 3) and slightly worse at predicting the higher categories than the previous model. You can check this by recalling the crosstabs dialogue box and replacing the column variables with Predicted Response [Pre_2]. Since the most important goal of credit scoring is to correctly identify accounts that are likely to become critical (category 5), you would probably choose to retain the complementary log-log model, even though the fit statistics favor the Cauchit model. Interpreting the Model Having chosen the model with the Complementary log-log link, you can make some interpretations based on the parameter estimates. The significance of the test for Age in years is less than 0.05, suggesting that its observed effect is not due to chance. By contrast, Duration in months adds little to the model. While there is no single category of NUMCRED that is significant on its own, there are two that are marginally significant. Usually, it is worth keeping such a variable in the model, since the small effects of each category accumulate and provide useful information to the model. OTHNSTAL also seems to be an important predictor on empirical grounds. On the other hand, HOUSNG doesn't seem to contribute to the model in a meaningful way and could probably be dropped without substantially worsening the model. While direct interpretation of the coefficients in this model is difficult due to the nature of the link function, the signs of the coefficients can give important insights into the effects of the predictors in the model. The signs essentially indicate the direction of the effect. Positive coefficients (such as that for age) indicate positive relationships between predictors and outcome. In this example, as age increases, so does the probability of being in one of the higher categories of account status. Negative coefficients (such as that for the first category of numcred) indicate inverse relationships. In this model, for example, those with one credit at the bank are likely to be in the lower outcome categories. Summary : Using the Model to Make Predictions Because the model attempts to predict cumulative probabilities rather than category membership, two steps are required to get predicted categories. First, for each case, the probabilities must be estimated for each category. Second, those probabilities must be used to select the most likely outcome category for each case. The probabilities themselves are estimated by using the predictor values for a case in the model equations and taking the inverse of the link function. The result is the cumulative probability for each group, conditional on the pattern of predictor values for the case. The 28 probabilities for individual categories can then be derived by taking the differences of the cumulative probabilities for the groups in order. In other words, the probability for the first category is the first cumulative probability; the probability for the second category is the second cumulative probability minus the first; the probability for the third category is the third cumulative probability minus the second; and so on. For each case, the predicted outcome category is simply the category with the highest probability, given the pattern of predictor values for that case. For example, suppose you have an applicant who wants a 48-month loan (duration), is 22 years old (age), has one credit with the bank (numcred), has no other instalment debt (othnstal), and owns her home (housng). Inserting these values into the prediction equations, this applicant has predicted values of -2.78, -1.95, 0.63, and 0.97. (Remember that there is one equation for each category except the last.) For example the model equation for the first category is: Link y = -3.549 – [-0.002(duration) + 0.015(age) -1.134(numcred) + 0(othnstal) + 0.132(housing)] Write down the model equations for the other categories. To get -2.78, substitute the values of the said applicant into the equation, that is, Link y = -3.549 – [-0.002x48 + 0.015x22 -1.134 + 0 + 0.132] Note that we use just the right coefficients for the factor variables. Taking the inverse of the complementary log-log link function gives the cumulative probabilities of .06, 0.13, 0.85, and 0.93 (and, of course, 1.0 for the last category). To get 0.06 from -2.78 we do e-2.78. Taking differences gives the following individual category probabilities: category 1: .06, category 2: 0.13-0.06=0.07, category 3: 0.85-0.13=0.72, category 4: 0.93-0.85=0.08, and category 5: 1.0-0.93=0.07. Clearly, category 3 (debt payments current) is the most likely category for this case according to the model, with a predicted probability of 0.72. Thus, you would predict that this applicant would keep her payments current and the account would not become critical. Recommended Readings See the following text for more information on generalized linear models for ordinal data: 1. McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models, 2nd ed. London: Chapman & Hall. 2. Cox, D. R., and E. J. Snell. 1989. The Analysis of Binary Data, 2nd ed. London: Chapman and Hall. 3. McFadden, D. 1974. Conditional logit analysis of qualitative choice behavior. In: Frontiers in Economics, P. Zarembka, eds. New York: Academic Press. 29 Binary Logistic Regression Model In this type of model you estimate the probability of an event occurring. The model can be written as: 𝑷𝒓𝒐𝒃(𝒆𝒗𝒆𝒏𝒕) = 𝟏 𝟏 + 𝒆−𝒛 For a single independent variable 𝒛 = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 For multiple independent variables: 𝒛 = 𝒃𝟎 + 𝒃𝟏 𝒙 𝟏 + 𝒃𝟐 𝒙𝟐 + ⋯ . 𝒃𝒏 𝒙𝒏 where b0 and b1, b2, are coefficients estimated from the data, x1, x2, are the independent variables, n is the number of independent variables and e is the base of natural logarithms (2.781). Exercise The data held in the file cancer.sav are from a study reported by Brown (1980) and are commonly cited in texts considering binary logistic regression. The prognosis for prostate cancer is based upon whether or not the cancer has spread to the surrounding lymph nodes. In this classic study Brown et al (see Brown, 1980) explored the following separate indicators for lymph node involvement in a group of 53 men known to have prostate cancer. To open the data file, follow these instructions: 1. From SPSS menu bar select File -> Open -> Data… a dialogue box will appear. 2. In the text area for File name: type \\campus\software\dept\spss and then click on Open. 3. Select the file cancer.sav and click on Open. 4. Spend some time to study the data file. How many cases and variables make up the data file? Cases:…….. Variables:……… 5. Are there any missing values in the data? Yes No The variables (corresponding to columns in the data file) are: 1) age - age of patients in years. 2) acid - level of serum acid phosphates (acid level in King-Armstrong units) 3) xray - x-ray result (0 = negative, 1 - positive) 4) size - size of tumour (0 = small, 1 = large) 5) stage - stage of tumour (0 = less serious, 1 = more serious) 6) nodes - nodal involvement (0 = not involved, 1 = involved) Modelling Carry out a Forward Conditional logistic regression analysis of the data using nodal involvement as the dependent variable and the other variables as independent variables (i.e. covariates). You do not need to define xray, size or stage as being categorical variables, since 30 they are already binary variables. Follow these steps to carry out the Forward Conditional binary logistic regression: Analyze -> Regression -> Binary Logistic…. Dependent: Nodal involvement [nodes] Covariates: age acid xray size stage Method: Forward Conditional Use the output to answer the following questions. Look at the table Case Processing Summary. What do you conclude? Now look at the three tables under Block 0: Beginning Block. What do you conclude? Now look at the tables under Block 1: Method=Forward (stepwise) conditional. What do you conclude? Give the logistic regression equation for the final model. Predictions Carry out another logistic regression analysis of the data using nodal involvement as the dependent variable but this time including ALL the covariates in the model, i.e. using the ENTER method. Also request the Odd Ratio (OR) and the 95% Confidence Interval (CI) of OR. Follow these steps: 31 Analyze -> Regression -> Binary Logistic…. Dependent: Nodal involvement [nodes] Covariates: age acid xray size stage Save: Under Predicted Values select Probabilities Options: CI for exp(B): Method: Enter 1. Give the coefficients for the full model, i.e. including all the variables. [Normally you would only consider the statistically significant variables]. 2. Which coefficients are statistically significant and why? 3. What is the probability of nodal involvement for each man in the data set? Which case has the highest probability and which case the lowest probability of nodal involvement? 4. Select one significant variable give the OR and its 95% CI? How would you interpret the OR and its 95% CI? Reference Brown, B. W., Jr et al. 1980 Prediction Analyses for Binary Data. In Biostatistics Casebook, New York: John Wiley and Sons. 32