Download AGRO 6005

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Logistic regression for binary data
In a variety of applications we may have a response variable that has only two possible
outcomes (alive/dead, male/female, sick/not sick, improved/not improved, etc.).
Typically one of the outcomes is coded as 0 (“failure”) and the other as 1 (“success”).
The coding is arbitrary. We are interested in study how the outcome depends on one or
more explanatory variables. Like in linear regression models, we are interested in
explaining the expected value (mean) of our dependent variable as a function of
independent variable(s). The simplest model is the same as the simple linear regression
model:
E Y   0  1 x
It can be proved that the mean of the dependent variable is the proportion of successes
(1s) in the population. Hence, what we are modeling is how the proportion of 1s changes
with the x. For example, suppose that we are testing different doses of an antibiotic and
record whether the condition improves (1) or not (0). We are interested in predicting what
is the probability of improving (the same as the proportion of 1s) for each dose.
Unfortunately, this simple linear model has three major problems:
1. The Ys do not have normal distribution.
2. The variances of the Ys are not constant: Var(Y )  E(Y ) 1  E(Y )  .
3. The model could yield predicted values larger than 1 or smaller than 0 (remember
that we are modeling proportions or probabilities).
Because of these problems, the linear model is rarely used, and there are more applicable
models available. Among this, the logistic regression model is the most common one:
exp   0  1 x 
E (Y ) 
1  exp   0  1 x 
This is a particular non linear model. The curve described by it has the following
properties:
1. As x becomes large, E(Y) approaches 1 (if 1  0 ) or 0 (if 1  0 ).
2. The curve has an S shape. Is is monotone (either increases or decreases
everywhere).
3. E (Y )  1 2 when x  
0
1 .
One interpretation of this curve is through the “logit” transform:
 E (Y ) 
log 
   0  1 x
 1  E (Y ) 
where the logarithms are natural logs (sometimes denoted as ln). Here we can see that in
this scale the model is linear. The ratio between probability of success and probability of
failure is called the “odds”. If the odds are 1, the probability of success is the same as the
probability of failure (0.5). If the odds are 0, the probability of failure is 1, and if the odds
tend to infinity, the probability of success tends to 1. When we take the log of the odds,
we are “stretching” the odds scale from  0,   to  ,   :
Probability of success, E (Y )
0
0.5
1
Odds, E(Y ) 1  E(Y ) 
Logits, log E(Y ) 1  E(Y ) 
0
1



0
Using the odds scale we can give interpretation to the slope (or partial slope if we have
multiple regression): the value of exp  1  indicates by how much the odds of success are
multiplied if we increase the independent variable 1 unit. For example, suppose the
relationship between dose and probability of improving certain condition is given by
exp  3  2 x 
 E (Y ) 
E (Y ) 
or log 
  3  2x
1  exp  3  2 x 
 1  E (Y ) 
The slope indicates that if the dose increase 1 unit, the odds of improving are multiplied
by exp(2). Note that this is not the same as saying that the probability of improving
increase exp(2) units (remember that this model is nonlinear in the probability scale).
In order to fit these models and to make inference about their parameters, we need, as in
linear regression, a random sample of n (independent) observations x,y. The method to
obtain parameter estimates is maximum likelihood, and the tests can be done through a
likelihood ratio test or a Wald’s test. Both use chi-squared (  2 ) statistics and the chisquared table to find the critical values. The details of these tests are well beyond the
scope of this course, but statistical software will provide the test statistics and associated
p-values.
Example 1: The following are the number of dead insects under different doses of an
insecticide.
data escar;
input dosis muerta total;
p=muerta/total;
datalines;
16.9 4 59
17.2 13 60
17.5 18 62
17.8 28 56
18.1 52 63
18.4 53 59
18.6 61 62
18.8 60 60
proc glimmix ;
model muerta/total = dosis / link=logit dist=bin solution;
run;
Model Information
Data Set
WORK.ESCAR
Response Variable (Events)
muerta
Response Variable (Trials)
total
Response Distribution
Binomial
Link Function
Logit
Variance Function
Default
Variance Matrix
Diagonal
Estimation Technique
Maximum Likelihood
Degrees of Freedom Method Residual
Number of Observations Read
8
Number of Observations Used
8
Number of Events
289
Number of Trials
481
Dimensions
Columns in X
2
Columns in Z
0
Subjects (Blocks in V)
1
Max Obs per Subject
8
Fit Statistics
-2 Log Likelihood
34.54
AIC (smaller is better)
38.54
AICC (smaller is better)
40.94
BIC (smaller is better)
38.70
CAIC (smaller is better)
40.70
HQIC (smaller is better)
37.47
Pearson Chi-Square
7.42
Pearson Chi-Square / DF
1.24
Parameter Estimates
Effect
Estimate
Standard
Error
DF
t Value
Pr > |t|
Intercept
-61.7894
5.1906
6
-11.90
<.0001
3.4912
0.2923
6
11.94
<.0001
dosis
Type III Tests of Fixed Effects
Effect
dosis
Num Den
DF DF F Value
1
6 142.62
Pr > F
<.0001
Ejemplo 2: Los siguientes datos provienen de un estudio para evaluar la necesidad de
aplicar un tratamiento de enfriamiento a plantas de azalea antes de venderlas, para así
favorecer una floración abundante y homogénea en el momento de la venta. Se
obtuvieron plantas (seis en octubre, seis en diciembre y seis en febrero), tratando la
mitad de las mismas con frío y dejando las otras sin tratar. Se contó el número de yemas
florales cerradas y abiertas en cada una de las plantas. Observar que este es un
experimento factorial 3x2 con 3 repeticiones.
data azalea;
input rep trat $ epoca abiertas cerradas;
total=abiertas+cerradas;
datalines;
1 nofrio 1 83
75
1 nofrio 2 115 53
1 nofrio 3 188
5
1 frio
1 103 99
1 frio
2 76
77
1 frio
3 176
3
2 nofrio 1 81
78
2 nofrio 2 110 62
2 nofrio 3 174
9
2 frio
1 98 101
2 frio
2 65
68
2 frio
3 110
5
3 nofrio 1 97
81
3 nofrio 2 101 48
3 nofrio 3 201 11
3 frio
1 114 101
3 frio
2 85 79
3 frio
3 201 12
proc glimmix;
class trat epoca rep;
model abiertas/total=rep trat epoca trat*epoca / dist=bin link=logit;
contrast 'trat en epoca=1' trat 1 -1 trat*epoca 1 0 0 -1 0 0;
contrast 'trat en epoca=2' trat 1 -1 trat*epoca 0 1 0 0 -1 0;
contrast 'trat en epoca=3' trat 1 -1 trat*epoca 0 0 1 0 0 -1;
lsmeans trat*epoca / ilink pdiff adjust=tukey lines
plot=meanplot(sliceby=trat join ilink) slice=epoca;
run;
Model Information
Data Set
WORK.AZALEA
Response Variable (Events)
abiertas
Response Variable (Trials)
total
Response Distribution
Binomial
Link Function
Logit
Estimation Technique
Maximum
Likelihood
Degrees of Freedom Method Residual
Class Level Information
Class Levels Values
trat
2 frio nofrio
epoca
3 123
rep
3 123
Number of Observations Read
18
Number of Observations Used
18
Number of Events
2178
Number of Trials
3145
Fit Statistics
96.00
-2 Log Likelihood
AIC (smaller is better)
112.00
AICC (smaller is better)
128.00
BIC (smaller is better)
119.13
CAIC (smaller is better)
127.13
HQIC (smaller is better)
112.99
Pearson Chi-Square
6.55
Pearson Chi-Square / DF
0.65
Type III Tests of Fixed Effects
Effect
Num Den
DF DF F Value
Pr > F
rep
2
10
0.94
0.4225
trat
1
10
3.73
0.0823
epoca
2
10 175.12
<.0001
trat*epoca
2
10
0.0134
6.84
Contrasts
Num Den
DF DF F Value
Label
Pr > F
trat en epoca=1
1
10
0.28
0.6096
trat en epoca=2
1
10
26.33
0.0004
trat en epoca=3
1
10
0.05
0.8299
trat*epoca Least Squares Means
Standard
Estimate
Error DF
t
Value Pr > |t|
Standard
Error
Mean
Mean
trat
epo
ca
frio
1
0.04438
0.08065 10
0.55 0.5942 0.5111
0.02015
frio
2
0.004356
0.09439 10
0.05 0.9641 0.5011
0.02360
frio
3
3.1811
0.2283 10
13.93 <.0001 0.9601
0.008744
nofrio 1
0.1081
0.09010 10
1.20 0.2578 0.5270
0.02246
nofrio 2
0.6957
0.09602 10
7.25 <.0001 0.6672
0.02132
nofrio 3
3.1135
0.2044 10
15.23 <.0001 0.9574
0.008329
Tukey-Kramer Grouping for
trat*epoca Least Squares Means
(Alpha=0.05)
LS-means with the same letter are
not significantly different.
trat
epoca
frio
3
Estimate
3.1811 A
A
nofrio
3
3.1135 A
nofrio
2
0.6957 B
nofrio
1
0.1081 C
C
frio
1
0.04438 C
C
frio
2
0.004356 C
Tests of Effect Slices for trat*epoca
Sliced By epoca
epoca
Num Den
DF DF F Value
Pr > F
1
1
10
0.28
0.6096
2
1
10
26.33
0.0004
3
1
10
0.05
0.8299
Modelos lineales generalizados
Los modelos que hemos visto hasta ahora son casos especiales de modelos
lineales generalizados. Los datos provienen de datos cuya distribución es una
familia exponencial (por ejemplo, distribución normal, gamma, binomial, Poisson,
etc.). Un modelo lineal generalizado es un modelo que vincula las respuestas
(variables “dependientes”) con otras variables “independientes” o “explicativas”.
Tenemos que considerar tres componentes:
1. La componente aleatoria (la distribución de las Yi ). En general, se supone
que las Yi son independientes, con una distribución que sea una familia
exponencial lineal (por ejemplo: normal, binomial, Poisson, gamma, etc.)
2. La componente sistemática, que indica la relación entre las variables
independientes. Éste es un modelo lineal (es decir, los parámetros entran
linealmente al modelo). Por ej.,   1 x1  2 x2 .
3. La función de enlace, que es la que vincula la media (esperanza) de la
distribución de las
Yi
con la componente sistemática. Por ejemplo,
g (i )  log( i )    1 x1i   2 x2i .
Algunos ejemplos de modelos lineales generalizados:
a. Yi
N (  1 x1i   2 x2i ,  2 )
b. Yi
  
Bin( i , N ); log  i   0  1 xi
 1 i 
c. Yi
Poisson(i ); log(i )     i
Ejemplo de modelo de regresión Poisson
En el siguiente experimento (Gbur et al., 2011) se comparan estrategias de manejo para
un tipo de maleza que se encuentra en cultivos de pimiento. Se eligieron tubérculos de la
maleza de cuatro tamaños diferentes, se sembraron en tiestos individuales y se simuló el
roturado de la tierra con cuatro frecuencias diferentes (semanal, bisemanal, mensual, y
sin roturar, equivalente a 12 semanas). El diseño fue en bloques completos aleatorizados,
y es un experimento factorial 4x4 (4 tamaños x 4 frecuencias), con 4 repeticiones.
La variable de respuesta es el número total de nuevos tubérculos producidos. Observar
que la variable es un recuento, por lo que se usará un modelo suponiendo distribución
Poisson y función de enlace logarítmico (el ejemplo c anterior, pero con una estructura
factorial de tratamientos y con efectos de bloque).
data weed;
input tillage_freq $ initial_weight $ 16-26 block tubers;
datalines;
2.BW
<0.25
1
7
2.BW
<0.25
2
2
2.BW
<0.25
3
3
2.BW
<0.25
4
3
2.BW
0.25-0.50
1
6
2.BW
0.25-0.50
2
3
2.BW
0.25-0.50
3
2
2.BW
0.25-0.50
4
5
2.BW
0.50-0.75
1
6
2.BW
0.50-0.75
2
5
2.BW
0.50-0.75
3
3
2.BW
0.50-0.75
4
4
2.BW
0.75-1.00
1
6
2.BW
0.75-1.00
2
3
2.BW
0.75-1.00
3
4
2.BW
0.75-1.00
4
5
3.MO
<0.25
1
13
3.MO
<0.25
2
5
3.MO
<0.25
3
13
3.MO
<0.25
4
8
3.MO
0.25-0.50
1
21
3.MO
0.25-0.50
2
19
3.MO
0.25-0.50
3
16
3.MO
0.25-0.50
4
16
3.MO
0.50-0.75
1
32
3.MO
0.50-0.75
2
25
3.MO
0.50-0.75
3
29
3.MO
0.50-0.75
4
38
3.MO
0.75-1.00
1
84
3.MO
0.75-1.00
2
72
3.MO
0.75-1.00
3
43
3.MO
0.75-1.00
4
40
1.WE
<0.25
1
6
1.WE
<0.25
2
3
1.WE
<0.25
3
4
1.WE
<0.25
4
3
1.WE
0.25-0.50
1
5
1.WE
1.WE
1.WE
1.WE
1.WE
1.WE
1.WE
1.WE
1.WE
1.WE
1.WE
4.ZE
4.ZE
4.ZE
4.ZE
4.ZE
4.ZE
4.ZE
4.ZE
4.ZE
4.ZE
4.ZE
4.ZE
4.ZE
4.ZE
0.25-0.50
0.25-0.50
0.25-0.50
0.50-0.75
0.50-0.75
0.50-0.75
0.50-0.75
0.75-1.00
0.75-1.00
0.75-1.00
0.75-1.00
<0.25
<0.25
<0.25
<0.25
0.25-0.50
0.25-0.50
0.25-0.50
0.25-0.50
0.50-0.75
0.50-0.75
0.50-0.75
0.50-0.75
0.75-1.00
0.75-1.00
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
4
5
4
6
2
5
5
7
4
5
5
123
122
112
118
121
118
124
119
112
121
128
125
130
124
ods rtf;
proc glimmix data=weed plots=studentpanel;
class block tillage_freq initial_weight;
model tubers=block tillage_freq | initial_weight /
dist=poisson link=log;
lsmeans tillage_freq * initial_weight / ilink
plot=meanplot(sliceby= initial_weight join);
lsmeans tillage_freq * initial_weight / ilink
plot=meanplot(sliceby= initial_weight join ilink)
slice=tillage_freq;
run;
ods rtf close;
Model Information
Data Set
WORK.WEED
Response Variable
tubers
Response Distribution
Poisson
Link Function
Log
Variance Function
Default
Variance Matrix
Diagonal
Estimation Technique
Maximum
Likelihood
Degrees of Freedom Method Residual
Class Level Information
Class
Levels Values
block
4 1234
tillage_freq
4 1.WE 2.BW 3.MO 4.ZE
initial_weight
4 0.25-0.50 0.50-0.75 0.75-1.00 <0.25
Number of Observations Read
62
Number of Observations Used
62
Fit Statistics
-2 Log Likelihood
321.88
AIC (smaller is better)
359.88
AICC (smaller is better)
377.98
BIC (smaller is better)
400.30
CAIC (smaller is better)
419.30
HQIC (smaller is better)
375.75
Pearson Chi-Square
Pearson Chi-Square / DF
42.73
0.99
Type III Tests of Fixed Effects
Num Den
DF DF F Value
Effect
Pr > F
block
3
43
1.50
0.2276
tillage_freq
3
43 623.93
<.0001
initial_weight
3
43
7.24
0.0005
tillage_f*initial_we
9
43
12.74
<.0001
tillage_f*initial_we Least Squares Means
Standard
Error
Mean
Mean
tillage_
freq
initial_
weight
1.WE
0.25-0.50
1.5031
0.2357 43
6.38 <.0001
4.4956
1.0596
1.WE
0.50-0.75
1.5031
0.2357 43
6.38 <.0001
4.4956
1.0596
1.WE
0.75-1.00
1.6573
0.2182 43
7.59 <.0001
5.2449
1.1445
1.WE
<0.25
1.3853
0.2500 43
5.54 <.0001
3.9961
0.9990
2.BW
0.25-0.50
1.3853
0.2500 43
5.54 <.0001
3.9961
0.9990
2.BW
0.50-0.75
1.5031
0.2357 43
6.38 <.0001
4.4956
1.0596
2.BW
0.75-1.00
1.5031
0.2357 43
6.38 <.0001
4.4956
1.0596
2.BW
<0.25
1.3208
0.2582 43
5.12 <.0001
3.7464
0.9673
3.MO
0.25-0.50
2.8894
0.1179 43
24.52 <.0001
17.9826
2.1193
3.MO
0.50-0.75
3.4330
0.08981 43
38.23 <.0001
30.9700
2.7813
3.MO
0.75-1.00
4.0892
0.06469 43
63.21 <.0001
59.6921
3.8616
3.MO
<0.25
2.2763
0.1601 43
14.22 <.0001
9.7406
1.5598
4.ZE
0.25-0.50
4.7907
0.04556 43 105.16 <.0001
120.38
5.4844
4.ZE
0.50-0.75
4.7989
0.04537 43 105.77 <.0001
121.38
5.5072
4.ZE
0.75-1.00
4.8102
0.06651 43
72.32 <.0001
122.76
8.1651
4.ZE
<0.25
4.7761
0.04589 43 104.07 <.0001
118.63
5.4445
Estimate
Standard
Error DF t Value
Pr > |t|
Tests of Effect Slices for tillage_f*initial_we
Sliced By tillage_freq
tillage_freq
Num Den
DF DF F Value
Pr > F
1.WE
3
43
0.23
0.8736
2.BW
3
43
0.13
0.9392
3.MO
3
43
55.08
<.0001
4.ZE
3
43
0.07
0.9739
Related documents