Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STA 4107/5107 Chapter 5: Multiple Discriminant Analysis and Logistic Regression March 19, 2007 1 Key Terms Please review and learn these terms. 2 What are Discriminant Analysis and Logistic Regression? Discriminant analysis and logistic regression are dependence techniques whose goal is to classify (some refer to these techniques as classification techniques) categorical variables based on metric variables. For both these techniques, we must have a “training set” that has the values for the class variables and predictor variables, that is, you must have a data set where you know the group (or population) memberships for all the cases. Once the discriminant function is found using the training set and either technique, you can classify an unknown case based on the values of its predictor variables. If group membership is unknown in the training set, then cluster analysis is the appropriate technique. 2.1 Examples 1. An archaeologist wishes to determine which of three possible tribes created a particular statue found in a dig. The archaeologist takes measurements from statues produced by the three tribes, as well as the unknown statue. The known statues are used to train a discriminant function and then the values from the unknown statue are plugged into the discriminant function which then classifies the statue into one of the three tribes. 2. Lubishew (1962) considers a problem of discrimination between three species of flea beetles, Chaetocnema concinna, C. heikertingeri, C heptapotamica based on various physical measurements. 3. The US forest service would like to identify the personal characteristics of residents near a reservoir that predict whether that person will fish as an adult, with the goal of increasing recreational fishing in the area. 4. Investigators are interested in the relationship between island size and bird extinctions. On each island they count the number of species that went extinct out of all the species on the 1 island. The investigator would like to characterize the relationship between the area of an island and the probability of extinction of birds present on the island? 5. Investigators are interested in moth coloration and natural selection. At a number of distances from Liverpool they count the number of moths from each morph that were taken by predators. They would like to quantify the relationship between the distance from Liverpool, where trees are dark from industrial soot, and the probability of predation on the light and dark morphs of the moth Carbonaria? 6. Researchers are interested in survival in the Donner Party. What is the relationship between age and sex of individuals in the Donner Party and whether or not they survived? 7. In a study of winter habitat selection by pronghorn in the Red Rim area in south-central Wyoming (Source: Manly, McDonald, and Thomas 1993, Resource Selection by Animals, Chapman and Hall, pp. 16-24; data from Ryder 1983), presence/absence of pronghorn during winters of 1980-81 and 1981-82 were recorded for a systematic sample of 256 plots of 4 ha. each. Other variables recorded for each plot were: density (in thousands/ha) of sagebrush, black greasewood, Nuttal’s saltbrush, and Douglas rabbitbrush, and the slope, distance to water, and aspect. The investigators were interested in which variables are most strongly associated with presence/absence of pronghorn and whether they could formulate a model to predict the probability that pronghorn will be present on a plot. 3 Discriminant Analysis Discriminant analysis is used to classify cases into one of two or more groups or populations on the basis of a set variables measured on each case. The populations are known by the researcher to be distinct and to which each individual belongs. The discriminant variate, also called the discriminant function is the linear combination of the independent variables that will best discriminate between the a priori identified groups. Discrimination is achieved by finding the linear combination that maximizes the differences between the groups. With n observations and p independent variables, 1 < i < p and 1 < k < m, the discriminant function, is given by: Zik = a + W1 X1k + W2 X2k + · · · + Wn Xnk (1) where Zik is the discriminant Z score calculated using the j th discriminant function on the k th observation; a is the intercept, Wi is the discriminant weight for independent variable i, and Xij is the ith independent variable measured on the k th observation. For purposes of this discussion we will consider the two-group case, however, this technique is most useful for more-than-two populations because logistic regression is most common when there are only two groups. The main reason for this is conceptual and graphical ease. Suppose that we have a representative sample from both populations and have just one measurement on each observation (again for graphical simplicity). We could choose a discriminant function to separate the two populations and then look at the distribution of discriminant scores. These distributions might look something like those shown in the figure below. 2 The discriminant function shown in the upper portion of the graph is doing a better job of separating the two populations, though the separation is not perfect. The dividing point is the point on the X axis that is directly below the point of overlap between the two groups. Discriminant scores that are “low” will be classified as Group A and those that are “high” will be classified as Group B. The lighter shaded region on the left represents the percent of group B that will be falsely classified as group A. The darker shaded region on the right represents the percent of group B that will be falsely classified as group A. Our goal in discriminant analysis is to find a discriminant function that minimizes both of these errors. For example, if the two populations have Z̄A the same variance, then the dividing point, C, will be C = Z̄B + , or the average of the sample 2 averages. 3.1 Fisher’s Iris Data: an example The Iris dataset we will use was originally introduced by R. A. Fisher as an example for discriminant analysis. The data report four characteristics (sepal width, sepal length, pedal width and pedal length) of three species of Iris flower, Iris setosa Iris versicolor, Iris virginica. The first figure shows the separation of the three species using petal length and width, that is, in two dimensions. The second figure shows the sample distribution by petal width. You can see the overlapping observations. We expect better separation between Setosa and Versicolor than between Versicolor and Virginica. 3.1.1 Stages of Analysis Identify the Objectives As with all statistical analyses we need to be sure of our objectives before we begin. For these data we are interested in testing whether the 4 morphological characteristics are good classifiers for the three species. • profile analysis: We may wish to test whether the differences between the species are significant. We can do this via a statistical hypothesis test on the mean score profiles. • classification: Or we could possibly have a newly discovered flower whose species is disputed 3 by the experts and we would like to see to which of the three species it belongs. Research Design • The Dependent Variable In most cases the researcher will know which is the dependent variable and how many categories it has. However, this is not always straightforward. It could be the case that the investigator is using a variable that could be viewed as continuous, but feels it is more appropriate to categorize due to precision or for simplification. It is usually better, in this case, to create a small number of categories. The categories should be exhaustive and mutually exclusive (every possible observation belongs to one exactly one and only one category) and there should be no more categories than are necessary. • The independent Variable The researcher should, again, choose or measure the variables of interest and no more. It is my opinion that looking at the data to decide which variables to include as independent variables in the analysis is a form of data snooping. The researcher should have a scientific theory that dictates which variables to include. If there is no theory, then the analysis might still be appropriate, but should be reported as exploratory and not as a scientific analysis. • Sample Size A rule of thumb is that there should be at least 5 observations for each predictor variable and that the number of observations in each category should be at least one more than the number of predictors. For the iris example, we have 3 categories and 4 independent variables, so we need at least 5 observations in each of the three species categories and at least 20 observations altogether. It will also work better if there is approximately the same number of observations in each category, so our minimal experimental design is to have at least 7 observations for each species, for a total of 21 observations. Of course, we would do well to have quite a few more than this. If the sample size becomes too large, even small differences in the discriminant scores between the groups will be significant and the researcher will always need to be cognizant of what effect size is scientifically meaningful. This is not straightforward when we are looking at differences in discriminant scores, so some exploratory analysis beforehand is advisable. • Cross Validation Another consideration when planning an experiment is the sample size that would allow crossvalidation. Cross validation consists of removing a subset of observations from the sample and performing the analysis on the remaining, and then use the subset that was removed as test set to see how well the procedure performs on observations not included in the original analysis. Your text suggests that using error estimates obtained from the entire data set is better than none at all, though does admit that there will be bias in the estimates. However, 4 leave-one-out cross validation is a much better approach. In leave-one-out cross validation, only one observation is removed at a time. The analysis is performed on the remaining n − 1 observations and then the left-out observation is used as a test case. This procedure is performed for all data points, yielding a prediction error estimate for all observations. This approach will work well for any sample larger than say, 5, and is the most commonly used cross-validation procedure. We will see how to do this in SAS later. Assumptions of Discriminant Analysis The main assumptions of discriminant analysis are that the independent variables are normally distributed when separated into the different populations. There are different options for data sets with equal covariance matrices versus unequal covariance matrices. There is some disagreement as to how robust the method is to these assumptions. If the investigator is only interested in characterization, then distributional assumptions are not necessary. That is, unless we would like to conduct formal hypothesis testing, we do not need to know the distribution. When we wish to conduct hypothesis testing with data that do not meet the assumption of normality, we can try the standard transformations available. There are also non-parametric methods available, where no distributional assumptions are made. If the equal covariance matrices assumption is violated, we can use the individual with-in group covariance matrix, rather than the pooled. If the assumption is valid, we will have more power if we use the pooled covariance matrix. We can test both of these assumptions in SAS. The normality assumption can be tested using proc univariate. The equal covariance assumption can be tested within the proc discrim analysis. 3.1.2 SAS Code and Output proc univariate data=iris normal; var SepalLength; by species; run; Results for I. virginica for the variables Sepal Length are shown below. The results for the other two species are similar for the other variables. The tests for normality are not significant, so we can proceed. ------------------------------------------ species=3 ------------------------------------------The UNIVARIATE Procedure Variable: SepalLength Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises W D W-Sq Pr < W Pr > D Pr > W-Sq 0.971179 0.115034 0.089467 5 0.2583 0.0953 0.1538 Anderson-Darling A-Sq 0.551641 Pr > A-Sq 0.1506 Estimation of the Discriminant Model and Assessing Overall Fit With SAS we have a large number of options. Because the data are normal, we can use methods that require the normal assumption. However, we’d like to be careful about the equal variance assumption. The option pool=test will test whether it is appropriate to use the pooled covariance matrix or to use the individual with-in covariance matrices. SAS Code and Output proc discrim data=iris outstat=irisstat wcov pcov method=normal pool=test distance anova manova listerr crosslisterr; class Species; var SepalLength SepalWidth PetalLength PetalWidth; title2 ’Discriminant Analysis of Iris Data’; run; The DISCRIM Procedure Within-Class Covariance Matrices species = 1, Variable SepalLength SepalWidth PetalLength PetalWidth DF = 49 SepalLength SepalWidth PetalLength PetalWidth 0.1242489796 0.1002979592 0.0161387755 0.0105469388 0.1002979592 0.1451795918 0.0116816327 0.0114367347 0.0161387755 0.0116816327 0.0301061224 0.0056979592 0.0105469388 0.0114367347 0.0056979592 0.0114938776 ------------------------------------------------------------------------------------------------ species = 2, Variable SepalLength SepalWidth PetalLength PetalWidth DF = 49 SepalLength SepalWidth PetalLength PetalWidth 0.2664326531 0.0851836735 0.1828979592 0.0557795918 0.0851836735 0.0984693878 0.0826530612 0.0412040816 0.1828979592 0.0826530612 0.2208163265 0.0731020408 0.0557795918 0.0412040816 0.0731020408 0.0391061224 ------------------------------------------------------------------------------------------------ 6 species = 3, Variable SepalLength SepalWidth PetalLength PetalWidth DF = 49 SepalLength SepalWidth PetalLength PetalWidth 0.4043428571 0.0937632653 0.3032897959 0.0490938776 0.0937632653 0.1040040816 0.0713795918 0.0476285714 0.3032897959 0.0713795918 0.3045877551 0.0488244898 0.0490938776 0.0476285714 0.0488244898 0.0754326531 -----------------------------------------------------------------------------------------------The DISCRIM Procedure Pooled Within-Class Covariance Matrix, Variable SepalLength SepalWidth PetalLength PetalWidth DF = 147 SepalLength SepalWidth PetalLength PetalWidth 0.2650081633 0.0930816327 0.1674421769 0.0384734694 0.0930816327 0.1158843537 0.0552380952 0.0334231293 0.1674421769 0.0552380952 0.1851700680 0.0425414966 0.0384734694 0.0334231293 0.0425414966 0.0420108844 The DISCRIM Procedure Test of Homogeneity of Within Covariance Matrices Notation: K P N N(i) V RHO DF = = = = Number of Groups Number of Variables Total Number of Observations - Number of Groups Number of Observations in the i’th Group - 1 __ N(i)/2 || |Within SS Matrix(i)| = ----------------------------------N/2 |Pooled SS Matrix| _ | 1 = 1.0 - | SUM ----|_ N(i) - _ 2 1 | 2P + 3P - 1 --- | ------------N _| 6(P+1)(K-1) = .5(K-1)P(P+1) 7 _ _ | PN/2 | | N V | -2 RHO ln | ------------------ | | __ PN(i)/2 | |_ || N(i) _| Under the null hypothesis: is distributed approximately as Chi-Square(DF). Chi-Square DF Pr > ChiSq 139.236945 20 <.0001 Since the Chi-Square value is significant at the 0.1 level, the within covariance matrices will be used in the discriminant function. Univariate Test Statistics F Statistics, Variable Num DF=2, Den DF=147 Total Standard Deviation Pooled Standard Deviation Between Standard Deviation R-Square R-Square / (1-RSq) F Value Pr > F 0.8281 0.4336 1.7644 0.7632 0.5148 0.3404 0.4303 0.2050 0.7951 0.3313 2.0896 0.8978 0.6187 0.3919 0.9413 0.9288 1.6226 0.6444 16.0413 13.0520 119.26 47.36 1179.03 959.32 <.0001 <.0001 <.0001 <.0001 SepalLength SepalWidth PetalLength PetalWidth Average R-Square Unweighted Weighted by Variance 0.7201854 0.868708 Multivariate Statistics and F Approximations S=2 Statistic Wilks’ Lambda Pillai’s Trace Hotelling-Lawley Trace Roy’s Greatest Root M=0.5 N=71 Value F Value Num DF Den DF Pr > F 0.02352545 1.18720676 32.54952466 32.27195780 198.71 52.95 583.49 1169.86 8 8 8 4 288 290 203.4 145 <.0001 <.0001 <.0001 <.0001 NOTE: F Statistic for Roy’s Greatest Root is an upper bound. NOTE: F Statistic for Wilks’ Lambda is exact. Posterior Probability of Membership in species Obs From species Classified into species 1 2 3 71 84 134 2 2 3 3 * 3 * 2 * 0.0000 0.0000 0.0000 0.3359 0.1543 0.6050 0.6641 0.8457 0.3950 * Misclassified observation 8 Number of Observations and Percent Classified into species From species 1 2 3 Total 1 50 100.00 0 0.00 0 0.00 50 100.00 2 0 0.00 48 96.00 2 4.00 50 100.00 3 0 0.00 1 2.00 49 98.00 50 100.00 Total 50 33.33 49 32.67 51 34.00 150 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for species Rate Priors 1 2 3 Total 0.0000 0.3333 0.0400 0.3333 0.0200 0.3333 0.0200 The assumption of equal covariances is violated for these data. We can still use discriminant analysis to separate the data for us, but we cannot use the pooled estimate. Simulations have shown that discriminant analysis works well most of the time, even when the assumptions are violated. We just need to be careful about how we interpret the results. The discrimination results are pretty good. There are only three misclassified observations. Now lets look at our cross-validation results. To do this we need to go into SAS and look at the data set created by the cross-validation procedure. The same three points that were misclassified during “training” were also misclassified during cross-validation. 4 Logistic Regression The simple logistic regression model is useful in the situation where there is a categorical response variable with two levels, and a continuous predictor. The two levels of the response can be labeled with the numbers 1 and 0. Of course, the values “one” and “zero” are arbitrary assignments to levels of y, such as “in remission” and “not in remission,” or “toxic effect” and “no toxic effect,” or “success” and “failure.” It is traditional to refer to the y = 1 level as “success” even though it might be a label for an event like “gets cancer” or “dies.” The multiple logistic regression model is a way to model the probability of a success, as a function of continuous predictors X1 , · · · , Xp . It is appropriate when this probability is either increasing over the range of x-values, or decreasing over the range. When the response y can be either zero or one, the interest is in estimating the probability that y = 1 at values of the predictors, instead of estimating the response y itself. The simple logistic regression function is a probability curve. The curve is described by the parameters β0 and β1 , · · · , βp which are estimated by b0 , · · · , bp . The estimates and their standard errors are computed 9 by the SAS, given data consisting of pairs (y, x1 , · · · , xp ). For a particular set of values of the explanatory variables: µ (Y |X1 , · · · , Xp ) = π = the proportion of 1s in the population. Rather than model µ (Y |X1 , · · · , Xp ) as a linear function of the explanatory variables, we model ³ logit (π) = ln π 1−π ´ = β0 + β1 X1 + · · · + βp Xp that is, we model the logit of π as a linear function of the explanatory variables. Consider that the probability that Y = 1 is equal to it population proportion, π. Consider also, that if eη η = logit(π), then π = 1+e η , i.e., that the inverse logit gives us back the population proportion. Example: Carcinogens in Rats A classic example in logistic regression is the dose-response problem, in which increasing sizes of dose are thought to be associated with increasing probability of a response. Suppose lab rats are exposed to a sequence of doses of a substance thought to cause cancer. The independent variable is the size of the dose (say, in milligrams), and the response is whether or not they develop cancer. Let’s say y=1 if the rat develops cancer, y=0 otherwise. Suppose we have these data: x y 1 0 2 0 3 0 4 0 5 0 6 1 7 0 8 0 9 1 10 0 x y 11 0 12 0 13 1 14 0 15 1 16 1 17 1 18 1 19 1 20 1 If you plot the data, the “scatterplot” has only two response values, as shown below. It is clear that fitting a line to the data would not represent the relationship between y and x. It looks like higher doses correspond to higher probabilities of cancer. In other words, if you have a • • • • • • • • • y=1 −− large population of rats all receiving the same dose, a certain proportion of them will get cancer, and this proportion is larger for larger doses. To model the probability of getting cancer at dose x, we need a functional form with certain properties. First, the range of the function used to • • • • • • • • • • • y=0 −− model probability must be between 0 5 10 15 20 zero and one. Second, the function x=size of dose (mg) must either be increasing with x over the whole range of values, or decreasing with x over the whole range of values. The logistic regression model does not use the actual value of the response (zero or one) as the underlying value of the function we are fitting to the data. Instead, the probability of y=1, or 10 P (y = 1), is used as the underlying function. We need a function to fit whose range is between zero and one (because it models a probability), and is increasing as the independent variable increases. For the ith response, we have eβ0 +β1 xi . 1 + eβ0 +β1 xi This is called the logistic function. We can see that the function is always between zero and one, and that it is increasing if β1 > 0 and decreasing if β1 < 0. We use the logistic function as our “probability curve.” Note that if β1 = 0, then the probability that y = 1 is P (yi = 1) = f (xi ) = eβ0 , 1 + eβ0 for all values of x. In other words, if β1 = 0, then the probability of success does not depend on the value of x. P (yi = 1) = f (xi ) = Probability of Cancer The object in simple logistic regression is to estimate the probability curve by estimating the parameters • • • • • • • • • y=1 −− β0 and β1 , and doing inference about the curve. Usually, interest is in determining if β1 = 0, that is, if the probability that y = 1 depends on x. SAS provides estimates b0 and b1 of β0 and β1 , respectively. There are no formulas for these estimates; a • • • • • • • • • • • y=0 −− computer must be used to calculate 0 5 10 15 20 them. x=size of dose (mg) Suppose the estimated parameters for the rat data are b0 = −4.097 and b1 = 0.3544. The estimated probability curve is shown below, with the data marked as circles. Now we can calculate the probability that a rat will develop cancer at a dose x = 10. We plug in 10 to the above formula: e−4.097+(0.3544)(10) e−1.5 = = 0.182 1 + e−1.5 1 + e−4.097+(0.3544)(10) and see that the estimated probability of cancer at x = 10 is 0.182. P (y = 1) = Now you try it: 1. What is the estimated probability of cancer for dose x = 15? 2. What is the probability of y = 1 at x = 4? 11 3. What is the probability of y = 1 at x = 14? 4.1 Odds Ratios Let f (x) represent the logistic function, or the probability of a success at x. Then the probability of failure or y = 0 is eβ0 +β1 xi 1 1 − f (x) = 1 − = . β +β x β 0 1 i 1+e 1 + e 0 +β1 xi Now we define the odds as odds = P (Yi = 1) f (x) = = eβ0 +β1 x . P (Yi = 0) 1 − f (x) The odds of a success are the probability of success divided by the probability of failure. If “odds=2” then success is twice as likely as failure. In this case, the probability of success is 2/3. Now we define the log odds of success, given x, as log odds = log f (x) = β0 + β1 x, 1 − f (x) so the log odds is linear in the predictor variable. This gives us some language to use when we interpret the parameters: The parameter β1 is the increase in the log odds associated with an increase of one unit in the x variable. The parameter β0 is the log odds associated with x = 0. Pronghorn Example: In the pronghorn example, suppose, hypothetically, the true relationship between probability of use and distance to water (in meters) followed the logit model: logit(π) = 3 − .0015W ater Then, π= e3−0.0015W ater 1 + e3−0.0015W ater Calculate π for the following distances (note that it might be easier to compute 1 − π then subtract from 1): Water = 100 m. 1000 m. 3000m. 12 • What is the interpretation of the coefficient -.0015 for the variable distance to water? For every 1 m. increase in the distance to water, the log-odds of use decrease by .0015; for every 1 km. increase in distance to water, log-odds of use decrease by 1.5. • More meaningful: for every 1 m. increase in the distance to water, the odds of use change by a multiplicative factor of e−0.0015 = .999. For every 1 km. increase in distance to water, the odds of use change by a multiplicative factor of e−1.5 = .223 (we could also reverse these statements; for example, the odds of use increase by a factor of e1.5 for every km. closer to water.) 4.2 Variance in the Logistic Regression Model The 0/1 response variable Y is a Bernoulli random variable: µ (Y |X1 , · · · , Xp ) = π q SD (Y |X1 , · · · , Xp ) = π(1 − π) Variance of Y is not constant. The logistic regression model is an example of a generalized linear model (in contrast to a general linear model which is the usual regression model with normal errors and constant variance). A generalized linear model is specified by: 1. a link function which specifies what function of µ(Y ) is a linear function of the X1 , · · · , Xp . In logistic regression, the link function is the logit function. 2. the distribution of Y for a fixed set of values of X1 , · · · , Xp . In the logit model, this is the Bernoulli distribution. The usual linear regression model is also a generalized linear model. The link function is the identity: that is, f (µ) = µ, and the distribution is normal with constant variance. There are other generalized linear models which are useful (e.g., Poisson response distribution with log link function). A general methodology has been developed to fit and analyze these models. 4.3 Estimation of Logistic Regression Coefficients Estimation of parameters in linear regression model was by least squares. If we assume normal errors with constant variance, the least squares estimators are the same as the maximum likelihood estimators (MLE’s). Maximum likelihood estimation is based on a simple principle: the estimates of the parameters in a model are the values which maximize the probability (likelihood) of observing the sample data we have. Example: Suppose we select a random sample of 10 UFL students in order to estimate what proportion of students own a car. We find that 6 out of the 10 own a car. What is the MLE of the proportion π of all students who won a car? What do we think it should turn out to be? 13 We model the responses from the students as 10 independent Bernoulli trials with probability of success π on each trial. Then the total number of successes, say Y, in 10 trials follows a binomial model: à P r (Y = y) = ! 10 y π (1 − π)10−y , y = 0, 1, · · · , 10 y The maximum likelihood principle says to find the value of π which maximizes the probability of observing the number of successes we actually observed in the sample. That is maximize: à P r (Y = y) = ! 10 6 π (1 − π)4 6 Can you guess what value of π maximizes this equation? We can also find the exact solution using calculus. Note that finding the value of π that maximizes P r(Y = 4) is equivalent to finding the value of π which maximizes à ! à ! h i 10 10 ln [P r(Y = 6)] = ln + ln π 6 (1 − π)4 = ln + 6 ln π + 4 ln(1 − π) 6 6 Let’s solve this for the general case of n and y: Taking the derivative with respect to pi and setting it equal to 0, we have: Now, solving for π: So, for our example, our maximum likelihood estimate for π is π̂ = Back to logistic regression: we use the maximum likelihood principle to find estimators of the βs in the logistic regression model. The likelihood function is the probability of observing the particular set of failures and successes that we observed in the sample. But there’s a difference from the binomial model above: the model says that the probability of success is possibly different for each subject because it depends on the explanatory variables X1 , · · · , Xp . eη ˆ ˆ Recall the link function: π = 1+e η . In logistic regression the parameter estimates β0 , · · · , βp are chosen to maximize the expression (called the likelihood) Pn exp( L(β0 , β1 ) = Qn i=1 yi (β0 + β1 X1i · · · + βp xpi )) . i=1 (1 + exp(β0 + β1 x1i · · · + βp xpi )) The natural logarithm of this expression is called the log-likelihood: l(β0 , β1 ) = n X i=1 yi (β0 + β1 x1i · · · + βp xpi ) − n X i=1 14 log(1 + exp(β0 + β1 x1i + · · · + βp xpi )), and the best fit parameters are said to minimize the negative log-likelihood. This is not so intuitive as minimizing the sum of squared errors. We can try to find a formula for the best fit β0 and β1 , · · · , βp , by taking derivatives of this last expression and setting them equal to zero, but in fact, this leads to a pair of equations that are impossible to solve simultaneously. There is no formula for the estimates β̂0 and β̂1 , · · · , βp like there is in simple linear regression; rather, there are iterative computer algorithms to find the solution, built into the statistical packages. SAS reports the negative log likelihood for the model, evaluated at the best fit parameters β̂0 and β̂1 , · · · , βp . SAS automatically fits another model as well as the specified model: eβ0 . 1 + eβ0 This is called the “reduced model” as compared to the “full model” described above. Notice that this reduced model does not use the predictor variable. It assumes that all of the responses have the same probability of success. This is used for a test of P (Yi = 1) = H0 : βk = 0, vs. Ha : βk 6= 0. Clearly, if βk = 0, the probability of Y = 1 does not depend on xk . The graph of the logistic function is a horizontal line if β1 = 0, where the vertical placement of the line is determined by the value of β0 . The negative log likelihood will always be larger for the reduced model. Mathematicians have proved the following result: For large sample sizes n, Under the null hypothesis βk = 0, twice the difference between the negative log likelihoods of the reduced and full models is a random variable with (approximately) a chi-squared distribution with one degree of freedom. The Greek letter λ is used to represent the test statistic “twice the difference between the negative log likelihoods.” Large values of this statistic support the alternative hypothesis. A simpler way to test H0 : βk = 0 is to use the p-value reported in the parameter estimate table, under “Analysis of Maximum Likelihood Estimates.” Example: Teenage Drivers and the Risk of Getting a Ticket You can find the data on the course website teendriver.txt. Here’s how to analyze the probability of getting a ticket predicted by age and GPA: proc logistic data=teendriver; model ticket(event=’1’)=age GPA; output out=teenLogRegout predprobs=I p=probpreb; run; The event=1 command in the model statement is how we tell SAS to model the probablity of getting a ticket. If we leave that command out, SAS will automatically model the probability Y = 0, i.e., the probability of not getting a ticket. SAS chooses the response with the “lowest” value, i.e. 0 is less than 1. We could also have coded ticket with a yes or no. In that case, since “no” comes first alphabetically, SAS would model the probability of not getting a ticket. 15 The SAS output for logistic regression is rather extensive: The SAS System 12:24 Saturday, February 3, 2007 The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Model Optimization Technique WORK.TEENDRIVER ticket 2 binary logit Fisher’s scoring Number of Observations Read Number of Observations Used 52 52 Response Profile Ordered Value Total Frequency ticket 1 2 0 1 42 10 Probability modeled is ticket=’1’. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. Model Fit Statistics Criterion AIC SC -2 Log L Intercept Only Intercept and Covariates 52.913 54.865 50.913 47.538 53.392 41.538 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq 9.3750 8.4071 6.7558 2 2 2 0.0092 0.0149 0.0341 Likelihood Ratio Score Wald The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error 16 Wald Chi-Square Pr > ChiSq 39 Intercept age GPA 1 1 1 12.3331 -0.5241 -1.6469 5.4389 0.2488 0.7465 5.1418 4.4382 4.8673 0.0234 0.0351 0.0274 Odds Ratio Estimates Effect age GPA Point Estimate 95% Wald Confidence Limits 0.592 0.193 0.364 0.045 0.964 0.832 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 80.7 19.0 0.2 420 Somers’ D Gamma Tau-a c 0.617 0.618 0.195 0.808 The negative value for β1 (age) means that the probability of getting a ticket goes down as age goes up. The interpretation of β̂1 is this: for every unit increase of age, the log-odds of getting a ticket decreases by 0.5241. The odds ratio estimate for age is 0.592. Note that this is e−.5241 . This means that the odds of getting a ticket when the age is n + 1 is 59.2% of the odds of getting a ticket when the age is n. In other words, when the age goes up by one unit, the odds of getting a ticket is only 59.2% of what it the year before. The interpretation for β2 is similar. As in linear regression, interpretation of the parameters is in the context of holding the other variables constant. There are three different test statistics for H0 : β1 = 0, or the null hypothesis that the probability of getting a ticket does not depend on the age or GPA of the driver. The likelihood ratio test compares the likelihoods under the full model and the reduced model with β1 and β2 set to zero. Here the full model has a smaller negative log likelihood than the reduced model. The test statistic λ is the twice the difference in likelihoods=9.375. Larger values of λ support the alternative hypothesis more, so the p-value is the area to the right of 9.375, under a chi-squared density with one degree of freedom. We conclude that there is a strong evidence that at least one of age or GPA is related to the probability of getting a ticket. The Wald statistic uses the approximate distribution of the estimate of the βs and can be found in the “analysis of maximum likelihood estimates” part of the output as well as the “testing global hypotheses” part. Notice that here the p-value is larger. With larger datasets there is usually not such a big difference in these two p-values. The Score statistic is beyond the scope of this discussion. The last bit of output shows you how the observed data would be classified using the model in a logistic discrimination. In logistic discrimination, the predicted value of Y is obtained by classifying the observation in the category that has the highest probability under the model. So, if a given age and GPA combination has a predicted probability of .52, then that case would be classified at Y=1 or this person got a ticket. The percent discordant shows us how many observations would be miss-classified by the model. 17 4.4 Model Selection in Multiple Logistic Regression Just as there were various criteria for model selection in linear regression, we have similar criteria in logistic regression. 4.4.1 Likelihood Ratio Tests In linear regression, we used the extra sum-of-squares F test to compare two nested models ( two models are nested if one is a special case of the other). The analogous test in logistic regression compares the values of -2ln(Maximized likelihood function). • The quantity -2 ln(Maximized likelihood) is also called the deviance of a model since larger values indicate greater deviation from the assumed model. Comparing two nested models by the difference in deviances is a drop-in-deviance test. • The difference between the values of -2ln(Maximized likelihood function) for a full and null model has approximately a chi-square distribution if the null hypothesis that the extra parameters are all 0 is true. The d.f. is the difference in the number of parameters for the two models. SAS performs this test automatically. • The drop-in-deviance test is a likelihood ratio test (LR test) because it is based on the natural log of the ratio of the maximized likelihoods (the difference of logs is the log of the ratio). The extra sum-of squares F-test in linear regression also turns out to be a likelihood ratio test. • If the full and reduced models differ by only one parameter, as in this example, then the likelihood ratio test is testing the same thing as the Wald test above. In this example, the test statistic is slightly different than the Wald test value of 5.450 and P = .020. The likelihood ratio test is preferred. The two tests will generally give similar results, but not always. • The relationship between the Wald test and the likelihood ratio test is analogous to relationship between the t-test for a single coefficient and the F-test in linear regression. However, in linear regression, the two tests are exactly equivalent. Not so in logistic regression. • To use the likelihood ratio test to test the full model against a reduced model larger than the null model, we obtain the -2 log likelihoods and find their difference, then find the p-value by looking it up in a table for chi-square distribution with 1 degree of freedom. 4.4.2 AIC and BIC Both AIC and BIC can be used as model selection criteria. As with linear regression models, they are only relative measures of fit, not absolute measures of fit. AIC = Deviance + 2p BIC = Deviance + pln (n) where p is the number of parameters in the model. You probably noticed that AIC was in the basic output; however, SAS does not output the BIC directly, but SAS does output the negative 18 log likelihood and so we can calculate BIC ourselves. Let’s compare the two reduced models with just one predictor variable to the full model we saw above above. First, consider the model with just age as the predictor: Model Fit Statistics Criterion Intercept Only Intercept and Covariates 52.913 54.865 50.913 51.355 55.258 47.355 AIC SC -2 Log L Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq 3.5583 3.4813 3.2208 1 1 1 0.0592 0.0621 0.0727 Likelihood Ratio Score Wald The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept age 1 1 6.0373 -0.4095 4.1040 0.2282 2.1640 3.2208 0.1413 0.0727 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 61.7 25.0 13.3 420 Somers’ D Gamma Tau-a c 0.367 0.423 0.116 0.683 Now the model with GPA as the predictor: Model Fit Statistics Criterion AIC SC -2 Log L Intercept Only Intercept and Covariates 52.913 54.865 50.913 50.721 54.623 46.721 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Chi-Square DF Pr > ChiSq 4.1925 1 0.0406 19 Score Wald 4.1323 3.7846 1 1 0.0421 0.0517 The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept GPA 1 1 1.6965 -1.2155 1.5831 0.6248 1.1484 3.7846 0.2839 0.0517 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 65.5 34.0 0.5 420 Somers’ D Gamma Tau-a c 0.314 0.316 0.100 0.657 The model with the smallest AIC is the full model. Notice also that the percent concordant is also much better with with the full model than with either of the two smaller models. This is directly analogous to the prediction error in a linear regression model and so is also a good measure of model fit and predictive ability. Though, it still not a cross validation, which we will consider shortly. 4.4.3 Interactions in Logistic Regression We can conclude interactions terms in a logistic regression model. Interpretation is analogous to that of linear regression. Proc Logistic handles interaction terms easily: SAS Code and Output proc logistic data=teendriver outest=teendriverAgeGPA; class ticket; model ticket(event=’1’)= age GPA age*GPA; run; Model Fit Statistics Criterion AIC SC -2 Log L Intercept Only Intercept and Covariates 52.913 54.865 50.913 45.021 52.826 37.021 Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square DF Pr > ChiSq 13.8923 9.7159 6.9647 3 3 3 0.0031 0.0211 0.0730 20 The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept age GPA age*GPA 1 1 1 1 -28.4576 1.6680 16.2074 -0.9726 20.9376 1.1267 9.0674 0.5011 1.8473 2.1915 3.1949 3.7672 0.1741 0.1388 0.0739 0.0523 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 85.5 14.5 0.0 420 Somers’ D Gamma Tau-a c 0.710 0.710 0.225 0.855 Note that the AIC is smaller for the full(er) model with both predictors and the interaction term, as well as an improvement in the percent concordant. Note also that age falls out of significance and GPA is now marginal. This is not of any real concern–the model selection criteria have improved, so we have a better model. 4.5 Residual Analysis in Logistic Regression Residual analysis in logistic regression focuses on the identification of outliers and influential points. There are two standard ways to define a residual in logistic regression. The residuals in a logistic regression are not as useful as in a linear regression. The residuals are not assumed to have a normal distribution, so the usual ±2 or ±3 cutoffs don’t necessarily apply, and we cannot interpret any patterns we see in a straight-forward manner. Instead, we can simply look for outliers among the distribution of Pearson or deviance residuals. We can plot the values against the predicted probabilities, the predictor variables, or observation number and look for points that seem to lie far from the others. We can plot the residuals using Proc GPLOT as before. You can see 21 from the plots that things are not as clear with the residuals from the logistic regression. We don’t necessarily expect normal residuals or residuals scattered around zero. Pearson’s Residual: Rp = p yi − π̂ π̂(1 − π̂) Deviance Residual: ( RD = −2 ln π̂ √ − −2 ln π̂ if yi = 1 if yi = 0 Pearson’s residual is sometimes called the “Standardized Residual”. Why? The deviance residuals have the property that the sum of the deviance residuals squared is the deviance, D (-2 ln(maximized likelihood)) for the model. The Pearson residual is more easily understood, but the deviance residual directly gives the contribution of each point to the lack of fit of the model. data TEENDRIVER; infile "C:\STA5701\Datasets\teendriver.txt" firstobs=2; input age ticket$ GPA; obsno=_N_; run; proc logistic data=teendriver; class ticket; model ticket(event=’1’)= age GPA age*GPA; output out=teenLogRegOut predprobs=I p=predprob resdev=resdev reschi=pearres; run; proc plot plot run; gplot data=teenLogRegOut; resdev*obsno; pearres*obsno; quit; 4.5.1 Measures of influence Measures of influence attempt to measure how much individual observations influence the model. Observations with high influence merit special examination. Many measures of influence have been suggested and are used for linear regression. Some of these have analogies for logistic regression. 22 However, the guidelines for deciding what is a big enough value of an influence measure to merit special attention are less developed for logistic regression than for linear regression and the guidelines developed there are sometimes not appropriate for logistic regression. Cook’s Distance: This is a measure of how much the residuals change when the case is deleted. Large values indicate a large change when the observation is left out. Plot Di against case number or against the predicted probabilities and look for outliers. The leverage is a measure of the potential influence of an observation. In linear regression, the leverage is a function of how far its covariate vector xi is from the average. It is a function only of the covariate vector. In logistic regression, the leverage is a function of xi and of πi (which must be estimated) - it isn’t necessarily the observations which have the most extreme xi s which have the most leverage. You can also get these and a large number of other diagnostic plots within the Logistic procedure: proc logistic data=teendriver; class ticket; model ticket(event=’1’)= age GPA age*GPA/influence iplots; run; These plots will appear within the output and are difficult to export and incorporate into other programs, but try this on your own. You can also try using the ods graphics option in SAS: ods html; ods graphics on; title ’Teenage Drivers: Age and GPA in predicting Tickets’; proc logistic data=teendriver; class ticket; model ticket(event=’1’)= age GPA age*GPA/influence iplots; run; ods graphics off; ods html close; 4.6 Logistic Regression using Counts or Proportions As we saw in some of the examples at the beginning of the chapter notes, we can also use logistic regression to model the counts or the proportion of successes rather than using the binary (0/1) outcome. Not all proportions are appropriate to model with logistic regression. We model proportions like fat calories/total calories, etc., using normal theory, usually. The only proportions that are appropriate in this context are those that result from an integer count of a certain outcome over the total number of trials or outcomes. 23 In the case that the response variable is binomial counts, denoted by counts of binary variables. P we have: if X ∼bernoulli(π), then Y = Xi ∼Binomial(n, π). Example 1 Island size and bird extinctions: On each island we count the number of species that went extinct out all the species on the island. What is the relationship between the area of an island and the probability of extinction of birds present on the island? Example 2 Moth coloration and natural selection: At each distance from Liverpool we count the number of moths from each morph that were taken by predators. What is the relationship between the distance from Liverpool, where trees are dark from industrial soot, and the probability of predation on the light and dark morphs of the moth Carbonaria? 4.6.1 The Logistic Regression Model for Binomial Counts • Y = the number of successes in m binomial trials. For example, how many species went extinct in the 10 year period of the study on each island? • Yi ∼ Binomial(mi , πi ) where mi is the number of species on the ith island and πi is the probability of extinction on the ith island. • X1 , · · · , Xp are the explanatory variables, in the extinction example, X is the area of the island. • For a particular set of values of the explanatory variables, µ (Y |X1 , · · · , Xp ) = π = probability of success (extinction in our example). • π̄ = Y /m = the observed binomial proportion. • Note that the sample size in the bird extinction study is the number of islands, not the number of species. • We model µ (Y |X1 , · · · , Xp ) = π just as we did for the binary response model. • logit(π) = η = β0 + β1 + · · · + βp • As before: π = 4.6.2 eη 1+eη Variance Since the response variable, Y , is a Binomial random variable we have: q SD (Yi |X1i , · · · , Xpi ) = mi πi (1 − πi ) Variance of Y is not constant across all Yi . 4.6.3 Estimation of Logistic Regression Coefficients As for the binary response model we use the maximum likelihood estimators (MLE’s). 24 4.6.4 SAS Code and Output Below is the SAS code for the bird extinction and island size problem mentioned at the beginning of these notes. proc logistic data=IslandBirds; model extinct/nspecies=area; run; The LOGISTIC Procedure Model Information Data Set Response Variable (Events) Response Variable (Trials) Model Optimization Technique WORK.ISLANDBIRDS extinct nspecies binary logit Fisher’s scoring Response Profile Ordered Value Binary Outcome 1 2 Total Frequency Event Nonevent 108 524 Model Fit Statistics Criterion AIC SC -2 Log L Intercept Only Intercept and Covariates 580.013 584.461 578.013 561.335 570.233 557.335 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq 20.6774 16.5183 14.2202 1 1 1 <.0001 <.0001 0.0002 Likelihood Ratio Score Wald Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept area 1 1 -1.3060 -0.0101 0.1173 0.00268 123.8728 14.2202 <.0001 0.0002 25 Odds Ratio Estimates Effect area Point Estimate 95% Wald Confidence Limits 0.990 0.985 0.995 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 4.6.5 60.2 26.4 13.4 56592 Somers’ D Gamma Tau-a c 0.339 0.391 0.096 0.669 Model Assessment Model assessment for logistic regression using counts is similar to that using the binary outcome variable. Estimated versus observed: One way to assess the appropriateness of the model and the efficacy of the estimation routine is to plot the estimated probability, π̂i , against the observed response proportion, π̄i . Additionally, plots of the π̂i versus one or more of the explanatory variables are useful for visual examination as we do for ordinary scatterplots in linear regression. Residual Analysis: As in the binary response case, we have two widely used residuals for binomial counts models. There are two standard ways to define a residual in logistic regression. Pearson’s Residual: yi − mi π̂ mi π̂(1 − π̂) Rp = p Deviance Residual: s ½ µ Yi Dr = sign(Yi − mi πˆI ) 2 Yi ln mi π̂i 26 ¶ µ mi − Yi + (mi − Yi ) ln mi − mi π̂i ¶¾ The Pearson residual is more easily understood, but the deviance residual directly gives the contribution of each point to the lack of fit of the model. Since the data are grouped, the residuals in a binomial counts logistic regression (either Pearson or deviance) are more useful than in the binary response regression. The residuals should be plotted against the predicted values for the πi s and examined for outliers or remaining patterns. 4.6.6 Model Selection Likelihood Ratio Tests As with the binary response case, we use the value -2ln(Maximized likelihood function) to compare models. Recall that the MLE’s of the βs are the values that maximize the likelihood function of the data. So, we find that the values for the βs that maximize the likelihood function, take the natural log and multiply by -2. • The quantity -2 ln(Maximized likelihood) is also called the deviance of a model since larger values indicate greater deviation from the assumed model. Comparing two nested models by the difference in deviances is a drop-in-deviance test. • The difference between the values of -2 ln(Maximized likelihood function) for a full and reduced model has approximately a chi-square distribution if the null hypothesis that the extra parameters are all 0 is true. The d.f. is the difference in the number of parameters for the two models. AIC and BIC Both AIC and BIC can be used as model selection criteria. As with linear regression models, they are only relative measures of fit, not absolute measures of fit. 27 • AIC = Deviance + 2p • BIC = Deviance + p ln(n) where p is the number of parameters in the model. 5 Appendix • Afifi, Abdelmonem, V.A. Clark, May, S. 2004. Computer-Aided Multivariate Analysis. Chapman & Hall/CRC. • Quinn, Gerry P. and Michael J. Keough 2002. Experimental Design and Data Analysis for Biologists. Cambridge University Press. • McCullagh, P. and J.A. Nelder 1998. Generalized Linear Models. Chapman & Hall/CRC. 28