Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Logistic regression for binary data In a variety of applications we may have a response variable that has only two possible outcomes (alive/dead, male/female, sick/not sick, improved/not improved, etc.). Typically one of the outcomes is coded as 0 (“failure”) and the other as 1 (“success”). The coding is arbitrary. We are interested in study how the outcome depends on one or more explanatory variables. Like in linear regression models, we are interested in explaining the expected value (mean) of our dependent variable as a function of independent variable(s). The simplest model is the same as the simple linear regression model: E Y 0 1 x It can be proved that the mean of the dependent variable is the proportion of successes (1s) in the population. Hence, what we are modeling is how the proportion of 1s changes with the x. For example, suppose that we are testing different doses of an antibiotic and record whether the condition improves (1) or not (0). We are interested in predicting what is the probability of improving (the same as the proportion of 1s) for each dose. Unfortunately, this simple linear model has three major problems: 1. The Ys do not have normal distribution. 2. The variances of the Ys are not constant: Var(Y ) E(Y ) 1 E(Y ) . 3. The model could yield predicted values larger than 1 or smaller than 0 (remember that we are modeling proportions or probabilities). Because of these problems, the linear model is rarely used, and there are more applicable models available. Among this, the logistic regression model is the most common one: exp 0 1 x E (Y ) 1 exp 0 1 x This is a particular non linear model. The curve described by it has the following properties: 1. As x becomes large, E(Y) approaches 1 (if 1 0 ) or 0 (if 1 0 ). 2. The curve has an S shape. Is is monotone (either increases or decreases everywhere). 3. E (Y ) 1 2 when x 0 1 . One interpretation of this curve is through the “logit” transform: E (Y ) log 0 1 x 1 E (Y ) where the logarithms are natural logs (sometimes denoted as ln). Here we can see that in this scale the model is linear. The ratio between probability of success and probability of failure is called the “odds”. If the odds are 1, the probability of success is the same as the probability of failure (0.5). If the odds are 0, the probability of failure is 1, and if the odds tend to infinity, the probability of success tends to 1. When we take the log of the odds, we are “stretching” the odds scale from 0, to , : Probability of success, E (Y ) 0 0.5 1 Odds, E(Y ) 1 E(Y ) Logits, log E(Y ) 1 E(Y ) 0 1 0 Using the odds scale we can give interpretation to the slope (or partial slope if we have multiple regression): the value of exp 1 indicates by how much the odds of success are multiplied if we increase the independent variable 1 unit. For example, suppose the relationship between dose and probability of improving certain condition is given by exp 3 2 x E (Y ) E (Y ) or log 3 2x 1 exp 3 2 x 1 E (Y ) The slope indicates that if the dose increase 1 unit, the odds of improving are multiplied by exp(2). Note that this is not the same as saying that the probability of improving increase exp(2) units (remember that this model is nonlinear in the probability scale). In order to fit these models and to make inference about their parameters, we need, as in linear regression, a random sample of n (independent) observations x,y. The method to obtain parameter estimates is maximum likelihood, and the tests can be done through a likelihood ratio test or a Wald’s test. Both use chi-squared ( 2 ) statistics and the chisquared table to find the critical values. The details of these tests are well beyond the scope of this course, but statistical software will provide the test statistics and associated p-values. Example 1: The following are the number of dead insects under different doses of an insecticide. data escar; input dosis muerta total; p=muerta/total; datalines; 16.9 4 59 17.2 13 60 17.5 18 62 17.8 28 56 18.1 52 63 18.4 53 59 18.6 61 62 18.8 60 60 proc glimmix ; model muerta/total = dosis / link=logit dist=bin solution; run; Model Information Data Set WORK.ESCAR Response Variable (Events) muerta Response Variable (Trials) total Response Distribution Binomial Link Function Logit Variance Function Default Variance Matrix Diagonal Estimation Technique Maximum Likelihood Degrees of Freedom Method Residual Number of Observations Read 8 Number of Observations Used 8 Number of Events 289 Number of Trials 481 Dimensions Columns in X 2 Columns in Z 0 Subjects (Blocks in V) 1 Max Obs per Subject 8 Fit Statistics -2 Log Likelihood 34.54 AIC (smaller is better) 38.54 AICC (smaller is better) 40.94 BIC (smaller is better) 38.70 CAIC (smaller is better) 40.70 HQIC (smaller is better) 37.47 Pearson Chi-Square 7.42 Pearson Chi-Square / DF 1.24 Parameter Estimates Effect Estimate Standard Error DF t Value Pr > |t| Intercept -61.7894 5.1906 6 -11.90 <.0001 3.4912 0.2923 6 11.94 <.0001 dosis Type III Tests of Fixed Effects Effect dosis Num Den DF DF F Value 1 6 142.62 Pr > F <.0001 Ejemplo 2: Los siguientes datos provienen de un estudio para evaluar la necesidad de aplicar un tratamiento de enfriamiento a plantas de azalea antes de venderlas, para así favorecer una floración abundante y homogénea en el momento de la venta. Se obtuvieron plantas (seis en octubre, seis en diciembre y seis en febrero), tratando la mitad de las mismas con frío y dejando las otras sin tratar. Se contó el número de yemas florales cerradas y abiertas en cada una de las plantas. Observar que este es un experimento factorial 3x2 con 3 repeticiones. data azalea; input rep trat $ epoca abiertas cerradas; total=abiertas+cerradas; datalines; 1 nofrio 1 83 75 1 nofrio 2 115 53 1 nofrio 3 188 5 1 frio 1 103 99 1 frio 2 76 77 1 frio 3 176 3 2 nofrio 1 81 78 2 nofrio 2 110 62 2 nofrio 3 174 9 2 frio 1 98 101 2 frio 2 65 68 2 frio 3 110 5 3 nofrio 1 97 81 3 nofrio 2 101 48 3 nofrio 3 201 11 3 frio 1 114 101 3 frio 2 85 79 3 frio 3 201 12 proc glimmix; class trat epoca rep; model abiertas/total=rep trat epoca trat*epoca / dist=bin link=logit; contrast 'trat en epoca=1' trat 1 -1 trat*epoca 1 0 0 -1 0 0; contrast 'trat en epoca=2' trat 1 -1 trat*epoca 0 1 0 0 -1 0; contrast 'trat en epoca=3' trat 1 -1 trat*epoca 0 0 1 0 0 -1; lsmeans trat*epoca / ilink pdiff adjust=tukey lines plot=meanplot(sliceby=trat join ilink) slice=epoca; run; Model Information Data Set WORK.AZALEA Response Variable (Events) abiertas Response Variable (Trials) total Response Distribution Binomial Link Function Logit Estimation Technique Maximum Likelihood Degrees of Freedom Method Residual Class Level Information Class Levels Values trat 2 frio nofrio epoca 3 123 rep 3 123 Number of Observations Read 18 Number of Observations Used 18 Number of Events 2178 Number of Trials 3145 Fit Statistics 96.00 -2 Log Likelihood AIC (smaller is better) 112.00 AICC (smaller is better) 128.00 BIC (smaller is better) 119.13 CAIC (smaller is better) 127.13 HQIC (smaller is better) 112.99 Pearson Chi-Square 6.55 Pearson Chi-Square / DF 0.65 Type III Tests of Fixed Effects Effect Num Den DF DF F Value Pr > F rep 2 10 0.94 0.4225 trat 1 10 3.73 0.0823 epoca 2 10 175.12 <.0001 trat*epoca 2 10 0.0134 6.84 Contrasts Num Den DF DF F Value Label Pr > F trat en epoca=1 1 10 0.28 0.6096 trat en epoca=2 1 10 26.33 0.0004 trat en epoca=3 1 10 0.05 0.8299 trat*epoca Least Squares Means Standard Estimate Error DF t Value Pr > |t| Standard Error Mean Mean trat epo ca frio 1 0.04438 0.08065 10 0.55 0.5942 0.5111 0.02015 frio 2 0.004356 0.09439 10 0.05 0.9641 0.5011 0.02360 frio 3 3.1811 0.2283 10 13.93 <.0001 0.9601 0.008744 nofrio 1 0.1081 0.09010 10 1.20 0.2578 0.5270 0.02246 nofrio 2 0.6957 0.09602 10 7.25 <.0001 0.6672 0.02132 nofrio 3 3.1135 0.2044 10 15.23 <.0001 0.9574 0.008329 Tukey-Kramer Grouping for trat*epoca Least Squares Means (Alpha=0.05) LS-means with the same letter are not significantly different. trat epoca frio 3 Estimate 3.1811 A A nofrio 3 3.1135 A nofrio 2 0.6957 B nofrio 1 0.1081 C C frio 1 0.04438 C C frio 2 0.004356 C Tests of Effect Slices for trat*epoca Sliced By epoca epoca Num Den DF DF F Value Pr > F 1 1 10 0.28 0.6096 2 1 10 26.33 0.0004 3 1 10 0.05 0.8299 Modelos lineales generalizados Los modelos que hemos visto hasta ahora son casos especiales de modelos lineales generalizados. Los datos provienen de datos cuya distribución es una familia exponencial (por ejemplo, distribución normal, gamma, binomial, Poisson, etc.). Un modelo lineal generalizado es un modelo que vincula las respuestas (variables “dependientes”) con otras variables “independientes” o “explicativas”. Tenemos que considerar tres componentes: 1. La componente aleatoria (la distribución de las Yi ). En general, se supone que las Yi son independientes, con una distribución que sea una familia exponencial lineal (por ejemplo: normal, binomial, Poisson, gamma, etc.) 2. La componente sistemática, que indica la relación entre las variables independientes. Éste es un modelo lineal (es decir, los parámetros entran linealmente al modelo). Por ej., 1 x1 2 x2 . 3. La función de enlace, que es la que vincula la media (esperanza) de la distribución de las Yi con la componente sistemática. Por ejemplo, g (i ) log( i ) 1 x1i 2 x2i . Algunos ejemplos de modelos lineales generalizados: a. Yi N ( 1 x1i 2 x2i , 2 ) b. Yi Bin( i , N ); log i 0 1 xi 1 i c. Yi Poisson(i ); log(i ) i Ejemplo de modelo de regresión Poisson En el siguiente experimento (Gbur et al., 2011) se comparan estrategias de manejo para un tipo de maleza que se encuentra en cultivos de pimiento. Se eligieron tubérculos de la maleza de cuatro tamaños diferentes, se sembraron en tiestos individuales y se simuló el roturado de la tierra con cuatro frecuencias diferentes (semanal, bisemanal, mensual, y sin roturar, equivalente a 12 semanas). El diseño fue en bloques completos aleatorizados, y es un experimento factorial 4x4 (4 tamaños x 4 frecuencias), con 4 repeticiones. La variable de respuesta es el número total de nuevos tubérculos producidos. Observar que la variable es un recuento, por lo que se usará un modelo suponiendo distribución Poisson y función de enlace logarítmico (el ejemplo c anterior, pero con una estructura factorial de tratamientos y con efectos de bloque). data weed; input tillage_freq $ initial_weight $ 16-26 block tubers; datalines; 2.BW <0.25 1 7 2.BW <0.25 2 2 2.BW <0.25 3 3 2.BW <0.25 4 3 2.BW 0.25-0.50 1 6 2.BW 0.25-0.50 2 3 2.BW 0.25-0.50 3 2 2.BW 0.25-0.50 4 5 2.BW 0.50-0.75 1 6 2.BW 0.50-0.75 2 5 2.BW 0.50-0.75 3 3 2.BW 0.50-0.75 4 4 2.BW 0.75-1.00 1 6 2.BW 0.75-1.00 2 3 2.BW 0.75-1.00 3 4 2.BW 0.75-1.00 4 5 3.MO <0.25 1 13 3.MO <0.25 2 5 3.MO <0.25 3 13 3.MO <0.25 4 8 3.MO 0.25-0.50 1 21 3.MO 0.25-0.50 2 19 3.MO 0.25-0.50 3 16 3.MO 0.25-0.50 4 16 3.MO 0.50-0.75 1 32 3.MO 0.50-0.75 2 25 3.MO 0.50-0.75 3 29 3.MO 0.50-0.75 4 38 3.MO 0.75-1.00 1 84 3.MO 0.75-1.00 2 72 3.MO 0.75-1.00 3 43 3.MO 0.75-1.00 4 40 1.WE <0.25 1 6 1.WE <0.25 2 3 1.WE <0.25 3 4 1.WE <0.25 4 3 1.WE 0.25-0.50 1 5 1.WE 1.WE 1.WE 1.WE 1.WE 1.WE 1.WE 1.WE 1.WE 1.WE 1.WE 4.ZE 4.ZE 4.ZE 4.ZE 4.ZE 4.ZE 4.ZE 4.ZE 4.ZE 4.ZE 4.ZE 4.ZE 4.ZE 4.ZE 0.25-0.50 0.25-0.50 0.25-0.50 0.50-0.75 0.50-0.75 0.50-0.75 0.50-0.75 0.75-1.00 0.75-1.00 0.75-1.00 0.75-1.00 <0.25 <0.25 <0.25 <0.25 0.25-0.50 0.25-0.50 0.25-0.50 0.25-0.50 0.50-0.75 0.50-0.75 0.50-0.75 0.50-0.75 0.75-1.00 0.75-1.00 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 4 5 4 6 2 5 5 7 4 5 5 123 122 112 118 121 118 124 119 112 121 128 125 130 124 ods rtf; proc glimmix data=weed plots=studentpanel; class block tillage_freq initial_weight; model tubers=block tillage_freq | initial_weight / dist=poisson link=log; lsmeans tillage_freq * initial_weight / ilink plot=meanplot(sliceby= initial_weight join); lsmeans tillage_freq * initial_weight / ilink plot=meanplot(sliceby= initial_weight join ilink) slice=tillage_freq; run; ods rtf close; Model Information Data Set WORK.WEED Response Variable tubers Response Distribution Poisson Link Function Log Variance Function Default Variance Matrix Diagonal Estimation Technique Maximum Likelihood Degrees of Freedom Method Residual Class Level Information Class Levels Values block 4 1234 tillage_freq 4 1.WE 2.BW 3.MO 4.ZE initial_weight 4 0.25-0.50 0.50-0.75 0.75-1.00 <0.25 Number of Observations Read 62 Number of Observations Used 62 Fit Statistics -2 Log Likelihood 321.88 AIC (smaller is better) 359.88 AICC (smaller is better) 377.98 BIC (smaller is better) 400.30 CAIC (smaller is better) 419.30 HQIC (smaller is better) 375.75 Pearson Chi-Square Pearson Chi-Square / DF 42.73 0.99 Type III Tests of Fixed Effects Num Den DF DF F Value Effect Pr > F block 3 43 1.50 0.2276 tillage_freq 3 43 623.93 <.0001 initial_weight 3 43 7.24 0.0005 tillage_f*initial_we 9 43 12.74 <.0001 tillage_f*initial_we Least Squares Means Standard Error Mean Mean tillage_ freq initial_ weight 1.WE 0.25-0.50 1.5031 0.2357 43 6.38 <.0001 4.4956 1.0596 1.WE 0.50-0.75 1.5031 0.2357 43 6.38 <.0001 4.4956 1.0596 1.WE 0.75-1.00 1.6573 0.2182 43 7.59 <.0001 5.2449 1.1445 1.WE <0.25 1.3853 0.2500 43 5.54 <.0001 3.9961 0.9990 2.BW 0.25-0.50 1.3853 0.2500 43 5.54 <.0001 3.9961 0.9990 2.BW 0.50-0.75 1.5031 0.2357 43 6.38 <.0001 4.4956 1.0596 2.BW 0.75-1.00 1.5031 0.2357 43 6.38 <.0001 4.4956 1.0596 2.BW <0.25 1.3208 0.2582 43 5.12 <.0001 3.7464 0.9673 3.MO 0.25-0.50 2.8894 0.1179 43 24.52 <.0001 17.9826 2.1193 3.MO 0.50-0.75 3.4330 0.08981 43 38.23 <.0001 30.9700 2.7813 3.MO 0.75-1.00 4.0892 0.06469 43 63.21 <.0001 59.6921 3.8616 3.MO <0.25 2.2763 0.1601 43 14.22 <.0001 9.7406 1.5598 4.ZE 0.25-0.50 4.7907 0.04556 43 105.16 <.0001 120.38 5.4844 4.ZE 0.50-0.75 4.7989 0.04537 43 105.77 <.0001 121.38 5.5072 4.ZE 0.75-1.00 4.8102 0.06651 43 72.32 <.0001 122.76 8.1651 4.ZE <0.25 4.7761 0.04589 43 104.07 <.0001 118.63 5.4445 Estimate Standard Error DF t Value Pr > |t| Tests of Effect Slices for tillage_f*initial_we Sliced By tillage_freq tillage_freq Num Den DF DF F Value Pr > F 1.WE 3 43 0.23 0.8736 2.BW 3 43 0.13 0.9392 3.MO 3 43 55.08 <.0001 4.ZE 3 43 0.07 0.9739