Download Logistic regression - UC Davis Plant Sciences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Discrete choice wikipedia , lookup

Regression toward the mean wikipedia , lookup

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Least squares wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
582729638
Revised: 4/30/2017
Chapter 17. Logistic Regression.
17:1 What is Logistic Regression?
In general, logistic regression is a method to classify objects, plots, observations, cases, or individuals (all
are synonyms in this subject) into pre-existing non-overlapping classes, categories, or groups. From this point of
view, logistic regression has exactly the same goals as discriminant analysis.
Example 1. You want to predict a discrete outcome of operating a farm: the farm is or is not economically
sustainable. For this prediction you can use a number of characteristics measured on other farms that operated
for a while and either failed or remained in operation and develop a prediction equation that describes the
probability of success or failure.
Example 2. A population of animals that live about 4 years has to be described by its age structure. An
equation to classify each individual into each age class can be developed based on a random sample of
individuals that were tagged at birth over the last 10 years. Length, weight and other continuous and categorical
variables can be tested through logistic regression to produce an optimal age-predicting curve that can be
applied to any animal trapped.
17:2 When and why to use Logistic Regression?
As indicated before, logistic regression has the same uses as discriminant analysis, but there are some
differences.
1. The response variable has to be binary or ordinal.
2. Logistic regression is a non-parametric method that requires no specific distribution of the errors or response
variables.
3. Predictors can be continuous, discrete, or combinations of variables.
4. Non-linear relationships between the response and predictors are accommodated.
5. Because of its similarity with regression, logistic regression offers easy model-building or variable selection
procedures.
6. The logistic model directly predicts the probability that each object belongs to each group as a function of
the values of the predictors.
7. Parameter estimates are obtained by maximum likelihood methods that require computationally intensive
numerical solutions.
These differences suggest that logistic regression is a better choice than discriminant analysis when there
are categorical predictors, when the assumption of multivariate normality is not met, when the effects of a
predictor on the outcome are not linear, and when a large number of predictors have to be screened for
predictive power.
When the assumptions of multivariate normality and linearity are met, discriminant analysis, if applicable, is
more efficient than logistic regression.
17:3 Model and assumptions.
Consider a binary response, for example sex (Y) of an individual before any obvious external dimorphism is
developed. Suppose that females tend to have a slightly different shape from males, as evinced by the
weight/length ratio (X). Given that it has a certain shape, the probability that any individual is male is p, so the
1
582729638
Revised: 4/30/2017
probability of female q=1-p. For practical purpose let Y=1 when male and Y=0 when female. Logistic regression
determines if and how p varies with shape.
17:3.1
Model
The expected value of Y, given X is calculated as usual:
1 with probability p
Y  
0 with probability 1 - p
EY   1 p  0 (1  p)  p
Theoretical and practical considerations indicate that the effects of the predictor X on the expected value of
Y, if any, can be represented by the logistic model. This model is flexible and can represent lines that range
from almost straight horizontal to almost straight vertical, within the interval [0,1] of valid probability values.
p  EY 
(   X)
e 0 1
1  e ( 0  1 X )
through a logit transformation the model is linearized:
 p 
logit (p)  ln 
    1 X
1  p  0
so we can write the model as:
Y  EY   are independent Bernoulli random variables.
The estimated parameters b0 and b1 can be used to get a predicted logit(p) and then a p value for any level
of X. The logit function can be quite general, because it can accept continuous and discrete variables, and not
assumptions are imposed on their distributions.
In this example of simple logistic regression, the "back-transformed" values can be plotted against X and a
simple decision rule emerges as seen in the following Figure 17-1.
2
582729638
Revised: 4/30/2017
Figure 17-1. Logistic regression of sex on an index of shape. Fictitious
data. The figure can be interpreted as the proportion of
females and males in the population for each value of X.
Based on the model and the figure, a rule to classify individuals with shape X lower than 8.4 as females is
established. The model also gives us a continuous measure of error rate as a function of the X variable. A curve
that becomes steep reflects a better discrimination, whereas a flat line shows that the predictor contains no
information about the response. Note that the meaning of the scatter of points in these plots is limited, because
there is only one predicted probability associated with each point, but no observed probability. Each point is
placed horizontally at its observed value of X and then vertically at a height randomly chosen within the correct
region. In Figure 17-1, the points above the blue line are all male, and those below the line are female. Because
the heights are randomly chosen, each time you run the logistic regression for a given dataset a different scatter
plot will be produced.
Figure 17-2. Simulation of situations when shapeX is a good (left) and a poor (right)
predictor of sex. The adequacy of the model is given by the Rsquare or U
value, which is 0.75 and 0.03 for the left and right panels.
17:3.2
Assumptions and limitations
As usual, the training sample has to be a random sample of the population for which the equation will be
used. Logistic regression requires no additional assumptions about the distributions of the predictors or
3
Revised: 4/30/2017
582729638
predicted variables, so it is quite useful from this point of view. However, if the usual assumptions of multivariate
normality are met, discriminant analysis is usually a more efficient and stronger method.
Some limitations or cautions must be considered for logistic regression.
17:3.2.1
Ratio of observations to predictors
If too many predictors are included relative to the number of cases or observations, the analysis can
produce large values for the parameters and standard errors. This is particularly problematic when there are
several nominal predictors that generate a large number of cells or “dummy” variables in the linear model. This
situation can be corrected by merging categories and by obtaining more observations such that all possible cells
are represented in the sample. As a guideline, have a minimum of 30 observations per continuous predictor plus
6-10 for each combination of values of each nominal predictor.
17:3.2.2
Observations or cases per cell
Because the analysis is based on a test of goodness of fit, the presence of cells with expected values
smaller than one or with fewer than 5 observations significantly reduces the power of the test. Check all pairs of
nominal variables and merge categories as necessary to obtain cells with expected frequencies greater than 1
and to have less than 20% of cells with observed frequencies less than 5.
17:3.2.3
Collinearity among predictors
A multiple linear regression, solved by maximum likelihood, is at the core of logistic regression. Thus, the
method is subject to exactly the same collinearity problems described for multiple linear regression. This is
addressed by a process of backward elimination, whereby all variables and interactions are included into the
model at first, and then one proceeds to eliminate the least significant interaction and run the modified model
again. Proceed deleting one effect at a time until the model contains only significant interactions, significant
simple effects and non-significant simple effects involved in significant interactions.
17:3.2.4
Extreme values of predictors
The equation is sensitive to extreme values of continuous predictors. Although no distribution is assumed for
the predictors, they should be explored by standardizing and flagging observations with absolute values greater
than 3. Multivariate outliers can also be studied by standard techniques.
17:4 Detection as classification.
This type of analysis is frequent in the health sciences, as individuals have to be “classified” as having or not
having a condition or disease based on the result of a test (X variable or predictor). Frequently, tests involve
titration or quantitative measurements of antibodies or chemicals that exists both in individuals with and without
the disease.
17:4.1
Structure of the problem.
The population of individuals can be exhaustively partitioned into those who fall in the, say “infected” and
those who are in the “not infected” classes. Based on more expensive tests or in tracking the evolution of
patients, a test has been developed to determine if people are infected or not. The test yields a value X, for
example, concentration of a certain protein in the blood, which is related to the infection. The application of a
logistic regression to assign individuals to “positive” or “negative” groups results in individuals in each of the four
possible classes as shown in the table below, which contains fictional data.
Test result
True
state
Infected
Not infected
Positive
Negative
Correct
52
False negative
8
False positive
20
Correct
100
In assessing the classification procedure, it is important to take into account the false positives and negative
together and separately. Consider, in the sex example above, how the numbers of females classified as males
4
Revised: 4/30/2017
582729638
and males classified as females change as the “critical” shapeX is varied from the minimum to the maximum. In
the case of diagnosis and signal detection, this relationship is important to assess the consequences of making
mistakes and in deciding the overall performance of the test.
Because the subjects tested are usually not a random sample of the population (people who are feeling well
are less likely to be tested), one has to correct the probabilities to assess how the test would do in the general
population. When the training sample does not represent the prevalence of infection in the general population, it
still yields correct conditional probabilities within rows: given that the subject is infected (or not) the row
frequencies are the probabilities of positive and negative test results. The rows frequencies for the table above
are presented below. The row frequencies should add up to 1 across columns.
The correction to determine what proportion of the positives are actually infected in the general population is
done on the basis of an a priori estimation of the probability that any subject from the population is infected
(prevalence of the infection). One is interested in finding the proportion of those random individuals tested who
are correctly identified as infected.
Test result
True
state
Positive
Negative
Infected
52/60
8/60
Not infected
20/120
100/120
Assuming that the prevalence in the general population is 15%, the proportion of individuals that test positive
who are actually infected is:
P(inf . | pos.)  P(inf . pos.) P( pos.) 
 
 

0.15  52 60
 0.478
0.15  52 60  0.85  20120

The problem is that because the prevalence is usually a small number, the total number of positives
becomes highly “contaminated” by false positives, because most of the subjects in the population are not
infected.

17:4.2
Measures of usefulness of the classification function.
Two measures, sensitivity and specificity, are calculated for 2x2 tables.
17:4.2.1
Sensitivity
Sensitivity is the probability that the test correctly identifies the presence of infection. In the table above,
sensitivity is the proportion of correct positives within the infected individuals, P(positive | infected) or probability
of positive given infected (52/60).
17:4.2.2
Specificity
Specificity is the proportion of individuals correctly identified as not being infected. This is the probability of
negative given not infected (100/120).
Note that the correction to determine the proportion of true positives in the population depends on the
sensitivity but not on the specificity.
17:5 Obtaining and interpreting output with SAS.
17:5.1
SAS code.
proc logistic data=sex;
5
Revised: 4/30/2017
582729638
model sex=shape / corrb ctable;
run;
17:5.2
SAS output.
The LOGISTIC Procedure
Data Set: WORK.SEX
Response Variable: SEX
Response Levels: 2
Number of Observations: 100
Link Function: Logit
Response Profile
Ordered
Value SEX
Count
1 female
50
2 male
50
Model Fitting Information and Testing Global Null Hypothesis BETA=0
Intercept
Intercept
and
Criterion
Only
Covariates
Chi-Square for Covariates
AIC
SC
-2 LOG L
Score
140.629
143.235
138.629
.
116.517
121.727
112.517
.
.
.
26.112 with 1 DF (p=0.0001)
23.136 with 1 DF (p=0.0001)
Analysis of Maximum Likelihood Estimates
Parameter Standard
Wald
Pr >
Standardized
Variable DF Estimate
Error Chi-Square Chi-Square
Estimate
INTERCPT 1
9.6830
2.2634
18.3013
0.0001
.
SHAPE
1
-1.1448
0.2668
18.4147
0.0001
-0.686124
Odds
Ratio
.
0.318
Association of Predicted Probabilities and Observed Responses
Concordant = 77.6%
Somers' D = 0.553
Discordant = 22.3%
Gamma
= 0.554
Tied
= 0.1%
Tau-a
= 0.279
(2500 pairs)
c
= 0.777
Variable
INTERCPT
SHAPE
Estimated Correlation Matrix
INTERCPT
SHAPE
1.00000
-0.99487
-0.99487
1.00000
Classification Table
Correct
Incorrect
Percentages
------------ ------------ ------------------------------------Prob
NonNonSensi- Speci- False False
Level Event Event Event Event Correct tivity ficity
POS
NEG
-----------------------------------------------------------------------0.040
50
0
50
0
50.0
100.0
0.0
50.0
.
0.060
50
1
49
0
51.0
100.0
2.0
49.5
0.0
0.080
50
2
48
0
52.0
100.0
4.0
49.0
0.0
0.100
50
4
46
0
54.0
100.0
8.0
47.9
0.0
0.120
50
6
44
0
56.0
100.0
12.0
46.8
0.0
6
Revised: 4/30/2017
582729638
0.140
0.160
0.180
0.200
0.220
0.240
0.260
0.280
0.300
0.320
0.340
0.360
0.380
0.400
0.420
0.440
0.460
0.480
0.500
0.520
0.540
0.560
0.580
0.600
0.620
0.640
0.660
0.680
0.700
0.720
0.740
0.760
0.780
0.800
0.820
0.840
0.860
0.880
0.900
0.920
0.940
0.960
0.980
49
49
49
48
47
47
45
44
44
44
43
43
41
40
36
36
35
35
34
34
31
31
30
29
28
26
23
22
21
20
18
17
15
15
10
6
4
3
2
1
1
1
0
9
11
12
12
14
15
17
20
20
20
20
21
25
26
26
28
30
32
33
33
37
37
39
41
42
43
44
45
46
46
48
49
49
49
49
49
49
49
49
50
50
50
50
41
39
38
38
36
35
33
30
30
30
30
29
25
24
24
22
20
18
17
17
13
13
11
9
8
7
6
5
4
4
2
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
2
3
3
5
6
6
6
7
7
9
10
14
14
15
15
16
16
19
19
20
21
22
24
27
28
29
30
32
33
35
35
40
44
46
47
48
49
49
49
50
58.0
60.0
61.0
60.0
61.0
62.0
62.0
64.0
64.0
64.0
63.0
64.0
66.0
66.0
62.0
64.0
65.0
67.0
67.0
67.0
68.0
68.0
69.0
70.0
70.0
69.0
67.0
67.0
67.0
66.0
66.0
66.0
64.0
64.0
59.0
55.0
53.0
52.0
51.0
51.0
51.0
51.0
50.0
98.0
98.0
98.0
96.0
94.0
94.0
90.0
88.0
88.0
88.0
86.0
86.0
82.0
80.0
72.0
72.0
70.0
70.0
68.0
68.0
62.0
62.0
60.0
58.0
56.0
52.0
46.0
44.0
42.0
40.0
36.0
34.0
30.0
30.0
20.0
12.0
8.0
6.0
4.0
2.0
2.0
2.0
0.0
18.0
22.0
24.0
24.0
28.0
30.0
34.0
40.0
40.0
40.0
40.0
42.0
50.0
52.0
52.0
56.0
60.0
64.0
66.0
66.0
74.0
74.0
78.0
82.0
84.0
86.0
88.0
90.0
92.0
92.0
96.0
98.0
98.0
98.0
98.0
98.0
98.0
98.0
98.0
100.0
100.0
100.0
100.0
45.6
44.3
43.7
44.2
43.4
42.7
42.3
40.5
40.5
40.5
41.1
40.3
37.9
37.5
40.0
37.9
36.4
34.0
33.3
33.3
29.5
29.5
26.8
23.7
22.2
21.2
20.7
18.5
16.0
16.7
10.0
5.6
6.3
6.3
9.1
14.3
20.0
25.0
33.3
0.0
0.0
0.0
.
10.0
8.3
7.7
14.3
17.6
16.7
22.7
23.1
23.1
23.1
25.9
25.0
26.5
27.8
35.0
33.3
33.3
31.9
32.7
32.7
33.9
33.9
33.9
33.9
34.4
35.8
38.0
38.4
38.7
39.5
40.0
40.2
41.7
41.7
44.9
47.3
48.4
49.0
49.5
49.5
49.5
49.5
50.0
7