Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Interaction (statistics) wikipedia , lookup
Discrete choice wikipedia , lookup
Regression toward the mean wikipedia , lookup
Data assimilation wikipedia , lookup
Time series wikipedia , lookup
Least squares wikipedia , lookup
Regression analysis wikipedia , lookup
582729638 Revised: 4/30/2017 Chapter 17. Logistic Regression. 17:1 What is Logistic Regression? In general, logistic regression is a method to classify objects, plots, observations, cases, or individuals (all are synonyms in this subject) into pre-existing non-overlapping classes, categories, or groups. From this point of view, logistic regression has exactly the same goals as discriminant analysis. Example 1. You want to predict a discrete outcome of operating a farm: the farm is or is not economically sustainable. For this prediction you can use a number of characteristics measured on other farms that operated for a while and either failed or remained in operation and develop a prediction equation that describes the probability of success or failure. Example 2. A population of animals that live about 4 years has to be described by its age structure. An equation to classify each individual into each age class can be developed based on a random sample of individuals that were tagged at birth over the last 10 years. Length, weight and other continuous and categorical variables can be tested through logistic regression to produce an optimal age-predicting curve that can be applied to any animal trapped. 17:2 When and why to use Logistic Regression? As indicated before, logistic regression has the same uses as discriminant analysis, but there are some differences. 1. The response variable has to be binary or ordinal. 2. Logistic regression is a non-parametric method that requires no specific distribution of the errors or response variables. 3. Predictors can be continuous, discrete, or combinations of variables. 4. Non-linear relationships between the response and predictors are accommodated. 5. Because of its similarity with regression, logistic regression offers easy model-building or variable selection procedures. 6. The logistic model directly predicts the probability that each object belongs to each group as a function of the values of the predictors. 7. Parameter estimates are obtained by maximum likelihood methods that require computationally intensive numerical solutions. These differences suggest that logistic regression is a better choice than discriminant analysis when there are categorical predictors, when the assumption of multivariate normality is not met, when the effects of a predictor on the outcome are not linear, and when a large number of predictors have to be screened for predictive power. When the assumptions of multivariate normality and linearity are met, discriminant analysis, if applicable, is more efficient than logistic regression. 17:3 Model and assumptions. Consider a binary response, for example sex (Y) of an individual before any obvious external dimorphism is developed. Suppose that females tend to have a slightly different shape from males, as evinced by the weight/length ratio (X). Given that it has a certain shape, the probability that any individual is male is p, so the 1 582729638 Revised: 4/30/2017 probability of female q=1-p. For practical purpose let Y=1 when male and Y=0 when female. Logistic regression determines if and how p varies with shape. 17:3.1 Model The expected value of Y, given X is calculated as usual: 1 with probability p Y 0 with probability 1 - p EY 1 p 0 (1 p) p Theoretical and practical considerations indicate that the effects of the predictor X on the expected value of Y, if any, can be represented by the logistic model. This model is flexible and can represent lines that range from almost straight horizontal to almost straight vertical, within the interval [0,1] of valid probability values. p EY ( X) e 0 1 1 e ( 0 1 X ) through a logit transformation the model is linearized: p logit (p) ln 1 X 1 p 0 so we can write the model as: Y EY are independent Bernoulli random variables. The estimated parameters b0 and b1 can be used to get a predicted logit(p) and then a p value for any level of X. The logit function can be quite general, because it can accept continuous and discrete variables, and not assumptions are imposed on their distributions. In this example of simple logistic regression, the "back-transformed" values can be plotted against X and a simple decision rule emerges as seen in the following Figure 17-1. 2 582729638 Revised: 4/30/2017 Figure 17-1. Logistic regression of sex on an index of shape. Fictitious data. The figure can be interpreted as the proportion of females and males in the population for each value of X. Based on the model and the figure, a rule to classify individuals with shape X lower than 8.4 as females is established. The model also gives us a continuous measure of error rate as a function of the X variable. A curve that becomes steep reflects a better discrimination, whereas a flat line shows that the predictor contains no information about the response. Note that the meaning of the scatter of points in these plots is limited, because there is only one predicted probability associated with each point, but no observed probability. Each point is placed horizontally at its observed value of X and then vertically at a height randomly chosen within the correct region. In Figure 17-1, the points above the blue line are all male, and those below the line are female. Because the heights are randomly chosen, each time you run the logistic regression for a given dataset a different scatter plot will be produced. Figure 17-2. Simulation of situations when shapeX is a good (left) and a poor (right) predictor of sex. The adequacy of the model is given by the Rsquare or U value, which is 0.75 and 0.03 for the left and right panels. 17:3.2 Assumptions and limitations As usual, the training sample has to be a random sample of the population for which the equation will be used. Logistic regression requires no additional assumptions about the distributions of the predictors or 3 Revised: 4/30/2017 582729638 predicted variables, so it is quite useful from this point of view. However, if the usual assumptions of multivariate normality are met, discriminant analysis is usually a more efficient and stronger method. Some limitations or cautions must be considered for logistic regression. 17:3.2.1 Ratio of observations to predictors If too many predictors are included relative to the number of cases or observations, the analysis can produce large values for the parameters and standard errors. This is particularly problematic when there are several nominal predictors that generate a large number of cells or “dummy” variables in the linear model. This situation can be corrected by merging categories and by obtaining more observations such that all possible cells are represented in the sample. As a guideline, have a minimum of 30 observations per continuous predictor plus 6-10 for each combination of values of each nominal predictor. 17:3.2.2 Observations or cases per cell Because the analysis is based on a test of goodness of fit, the presence of cells with expected values smaller than one or with fewer than 5 observations significantly reduces the power of the test. Check all pairs of nominal variables and merge categories as necessary to obtain cells with expected frequencies greater than 1 and to have less than 20% of cells with observed frequencies less than 5. 17:3.2.3 Collinearity among predictors A multiple linear regression, solved by maximum likelihood, is at the core of logistic regression. Thus, the method is subject to exactly the same collinearity problems described for multiple linear regression. This is addressed by a process of backward elimination, whereby all variables and interactions are included into the model at first, and then one proceeds to eliminate the least significant interaction and run the modified model again. Proceed deleting one effect at a time until the model contains only significant interactions, significant simple effects and non-significant simple effects involved in significant interactions. 17:3.2.4 Extreme values of predictors The equation is sensitive to extreme values of continuous predictors. Although no distribution is assumed for the predictors, they should be explored by standardizing and flagging observations with absolute values greater than 3. Multivariate outliers can also be studied by standard techniques. 17:4 Detection as classification. This type of analysis is frequent in the health sciences, as individuals have to be “classified” as having or not having a condition or disease based on the result of a test (X variable or predictor). Frequently, tests involve titration or quantitative measurements of antibodies or chemicals that exists both in individuals with and without the disease. 17:4.1 Structure of the problem. The population of individuals can be exhaustively partitioned into those who fall in the, say “infected” and those who are in the “not infected” classes. Based on more expensive tests or in tracking the evolution of patients, a test has been developed to determine if people are infected or not. The test yields a value X, for example, concentration of a certain protein in the blood, which is related to the infection. The application of a logistic regression to assign individuals to “positive” or “negative” groups results in individuals in each of the four possible classes as shown in the table below, which contains fictional data. Test result True state Infected Not infected Positive Negative Correct 52 False negative 8 False positive 20 Correct 100 In assessing the classification procedure, it is important to take into account the false positives and negative together and separately. Consider, in the sex example above, how the numbers of females classified as males 4 Revised: 4/30/2017 582729638 and males classified as females change as the “critical” shapeX is varied from the minimum to the maximum. In the case of diagnosis and signal detection, this relationship is important to assess the consequences of making mistakes and in deciding the overall performance of the test. Because the subjects tested are usually not a random sample of the population (people who are feeling well are less likely to be tested), one has to correct the probabilities to assess how the test would do in the general population. When the training sample does not represent the prevalence of infection in the general population, it still yields correct conditional probabilities within rows: given that the subject is infected (or not) the row frequencies are the probabilities of positive and negative test results. The rows frequencies for the table above are presented below. The row frequencies should add up to 1 across columns. The correction to determine what proportion of the positives are actually infected in the general population is done on the basis of an a priori estimation of the probability that any subject from the population is infected (prevalence of the infection). One is interested in finding the proportion of those random individuals tested who are correctly identified as infected. Test result True state Positive Negative Infected 52/60 8/60 Not infected 20/120 100/120 Assuming that the prevalence in the general population is 15%, the proportion of individuals that test positive who are actually infected is: P(inf . | pos.) P(inf . pos.) P( pos.) 0.15 52 60 0.478 0.15 52 60 0.85 20120 The problem is that because the prevalence is usually a small number, the total number of positives becomes highly “contaminated” by false positives, because most of the subjects in the population are not infected. 17:4.2 Measures of usefulness of the classification function. Two measures, sensitivity and specificity, are calculated for 2x2 tables. 17:4.2.1 Sensitivity Sensitivity is the probability that the test correctly identifies the presence of infection. In the table above, sensitivity is the proportion of correct positives within the infected individuals, P(positive | infected) or probability of positive given infected (52/60). 17:4.2.2 Specificity Specificity is the proportion of individuals correctly identified as not being infected. This is the probability of negative given not infected (100/120). Note that the correction to determine the proportion of true positives in the population depends on the sensitivity but not on the specificity. 17:5 Obtaining and interpreting output with SAS. 17:5.1 SAS code. proc logistic data=sex; 5 Revised: 4/30/2017 582729638 model sex=shape / corrb ctable; run; 17:5.2 SAS output. The LOGISTIC Procedure Data Set: WORK.SEX Response Variable: SEX Response Levels: 2 Number of Observations: 100 Link Function: Logit Response Profile Ordered Value SEX Count 1 female 50 2 male 50 Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC SC -2 LOG L Score 140.629 143.235 138.629 . 116.517 121.727 112.517 . . . 26.112 with 1 DF (p=0.0001) 23.136 with 1 DF (p=0.0001) Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Variable DF Estimate Error Chi-Square Chi-Square Estimate INTERCPT 1 9.6830 2.2634 18.3013 0.0001 . SHAPE 1 -1.1448 0.2668 18.4147 0.0001 -0.686124 Odds Ratio . 0.318 Association of Predicted Probabilities and Observed Responses Concordant = 77.6% Somers' D = 0.553 Discordant = 22.3% Gamma = 0.554 Tied = 0.1% Tau-a = 0.279 (2500 pairs) c = 0.777 Variable INTERCPT SHAPE Estimated Correlation Matrix INTERCPT SHAPE 1.00000 -0.99487 -0.99487 1.00000 Classification Table Correct Incorrect Percentages ------------ ------------ ------------------------------------Prob NonNonSensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG -----------------------------------------------------------------------0.040 50 0 50 0 50.0 100.0 0.0 50.0 . 0.060 50 1 49 0 51.0 100.0 2.0 49.5 0.0 0.080 50 2 48 0 52.0 100.0 4.0 49.0 0.0 0.100 50 4 46 0 54.0 100.0 8.0 47.9 0.0 0.120 50 6 44 0 56.0 100.0 12.0 46.8 0.0 6 Revised: 4/30/2017 582729638 0.140 0.160 0.180 0.200 0.220 0.240 0.260 0.280 0.300 0.320 0.340 0.360 0.380 0.400 0.420 0.440 0.460 0.480 0.500 0.520 0.540 0.560 0.580 0.600 0.620 0.640 0.660 0.680 0.700 0.720 0.740 0.760 0.780 0.800 0.820 0.840 0.860 0.880 0.900 0.920 0.940 0.960 0.980 49 49 49 48 47 47 45 44 44 44 43 43 41 40 36 36 35 35 34 34 31 31 30 29 28 26 23 22 21 20 18 17 15 15 10 6 4 3 2 1 1 1 0 9 11 12 12 14 15 17 20 20 20 20 21 25 26 26 28 30 32 33 33 37 37 39 41 42 43 44 45 46 46 48 49 49 49 49 49 49 49 49 50 50 50 50 41 39 38 38 36 35 33 30 30 30 30 29 25 24 24 22 20 18 17 17 13 13 11 9 8 7 6 5 4 4 2 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 2 3 3 5 6 6 6 7 7 9 10 14 14 15 15 16 16 19 19 20 21 22 24 27 28 29 30 32 33 35 35 40 44 46 47 48 49 49 49 50 58.0 60.0 61.0 60.0 61.0 62.0 62.0 64.0 64.0 64.0 63.0 64.0 66.0 66.0 62.0 64.0 65.0 67.0 67.0 67.0 68.0 68.0 69.0 70.0 70.0 69.0 67.0 67.0 67.0 66.0 66.0 66.0 64.0 64.0 59.0 55.0 53.0 52.0 51.0 51.0 51.0 51.0 50.0 98.0 98.0 98.0 96.0 94.0 94.0 90.0 88.0 88.0 88.0 86.0 86.0 82.0 80.0 72.0 72.0 70.0 70.0 68.0 68.0 62.0 62.0 60.0 58.0 56.0 52.0 46.0 44.0 42.0 40.0 36.0 34.0 30.0 30.0 20.0 12.0 8.0 6.0 4.0 2.0 2.0 2.0 0.0 18.0 22.0 24.0 24.0 28.0 30.0 34.0 40.0 40.0 40.0 40.0 42.0 50.0 52.0 52.0 56.0 60.0 64.0 66.0 66.0 74.0 74.0 78.0 82.0 84.0 86.0 88.0 90.0 92.0 92.0 96.0 98.0 98.0 98.0 98.0 98.0 98.0 98.0 98.0 100.0 100.0 100.0 100.0 45.6 44.3 43.7 44.2 43.4 42.7 42.3 40.5 40.5 40.5 41.1 40.3 37.9 37.5 40.0 37.9 36.4 34.0 33.3 33.3 29.5 29.5 26.8 23.7 22.2 21.2 20.7 18.5 16.0 16.7 10.0 5.6 6.3 6.3 9.1 14.3 20.0 25.0 33.3 0.0 0.0 0.0 . 10.0 8.3 7.7 14.3 17.6 16.7 22.7 23.1 23.1 23.1 25.9 25.0 26.5 27.8 35.0 33.3 33.3 31.9 32.7 32.7 33.9 33.9 33.9 33.9 34.4 35.8 38.0 38.4 38.7 39.5 40.0 40.2 41.7 41.7 44.9 47.3 48.4 49.0 49.5 49.5 49.5 49.5 50.0 7