Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Instrumental variables estimation wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Regression toward the mean wikipedia , lookup
Data assimilation wikipedia , lookup
Least squares wikipedia , lookup
Time series wikipedia , lookup
Regression analysis wikipedia , lookup
Logistic Regression using SAS prepared by Voytek Grus for SAS user group, Halifax February 24, 2006 What is Logistic Regression? • Regression Analysis where the response variable Y is discrete and represents either categories or counts. There are no restrictions on predictors. Y X1 green yellow red 46.8 15.9 51.8 X2 X3 1 0 0 No No Yes • Linear regression equation of the type yi=α+βxi+ εi is not appropriate … • … but like in linear regression analysis logistic regression is used to – test statistical significance of relationship between response and predictor variables – predict the category of outcomes given its predictors • Falls into the category of generalized linear models and either complements or offers flexible alternative to – – – – Multiple linear regression – similarity in equations, statistical diagnostics Contingency tables (cross tabulation) Loglinear models Discriminant analysis – answers similar questions but is less restrictive • Relatively New statistical tool for the analysis of categorical data – – – – Contingency tables – 1900’s Regression Analysis – 1970’s Loglinear modes – 1975 Logistic Regression – late 70’s early 80’s but became more popular in the 90’s Fields of application. • • Health sciences - questions about disease: yes or no? Social Sciences: deals with great deal of dichotomous variables: employed vs unemployed, married vs unmarried,etc – – • Political science: • • • • Which party voters will vote for and why? Which voters will vote for a particular party? Public Opinion Polls Used in economics and marketing to study consumer choice. – – – • Attitude to work as based on demographic or behavioral predictors Racial bias in judicial decisions, etc Banks use it to assess credit rating of customers Some regulators require that utilities submit customer choice studies on energy conservation options. Choice of mode of transportation Used in demand forecasting PART I Conceptual Framework of Logistic Regression Why not to use OLS for the estimation of the categorical response equation? • Multiple Linear Regression of categorical response variables does not satisfy two assumptions of a Linear Model necessary to produce unbiased and efficient coefficients. 1. 2. 3. 4. 5. Linearity of coefficients: yi=α+βxi+ εi E(εi)=0 Heteroscedasticity: var(εi)≠σ2 • E(yi)=1*P(yi=1)+0*P(yi=0)=pi= α+βxi • var(εi)= var(yi)=pi*(1- pi)=(α+βxi)*(1-α-βxi) Errors are uncorrelated: cov(εi, εj)=0 Errors are not normally distributed: εi ~ Binomial – Errors take on only two values: εi=1-α-βxi or εi=0- α-βxi and are bounded by 0 and 1. 6. As a result – – – coefficient estimates are no longer efficient Standard error estimates are no longer consistent Estimated values of the response variable Y may be implausible because • Linear function is unbounded (estimates will be outside of the (0, 1) interval but the Binary regression is a linear probability model: E(yi)=pi= α+βxi Logit Transformation a remedy to violation of OLS assumptions • • Instead of estimating this linear equation: yi=α+ βxi1+βxi1 + …+ βxk1 + εi can apply logit transformation: log[pi/(1- pi)] =α+β1xi1+β2xi2 +. + βkxk1 where pi/(1- pi) is an odds ratio that an event of y=1 will occur. Consequences: – – – – pi=exp(α+β1xi1+β2xi2 +. + βkxk1 )/(1+exp(α+β1xi1+β2xi2 +. + βkxk1)) happens to be a cumulative logistic distribution function. No matter what the coefficients are pi is always between 0 and 1 Absence of εi complicates stats analysis: standardized coefficients? Derivative of x is a function of p: Dpi/dxi= βpi(1-pi) and reflects changing slope of the S curve making interpreation of coefficients difficult. Need to be cautious when interpreting coefficients from the prob. perspective Alternatives to logit transformation in the context of latent variables: probit and complementary log log • In a perfect world there is a model for a continuous response variable zi. The dichotomous logit model is only its simplification. There is a true equation zi=α0+ α1xi1+ α2xi1 + …+ α3xk1 + σεi but it can not be observed. It is latent. Instead we observe dichotomous y whose values of 1 and 0 depend on probability z. Y’s relationship with predictors X’s depends on the probability distribution of ε. • Assumption of distribution of ε help determine standardized coefficients. Link Distribution of ε Standard deviation of ε Link function = (inverse of CDF of ε ~ fcn) CDF of ε Logit ε ~ Logistic Distribution: f(ε) = eε/(1+ eε)2 σ =π/3= 1.8138 ƒ(p)=log(p/(1p)) F(x)=ex/(1+ ex) Probit ε ~ Normal Std Distribution σ =1 ƒ(p)=Φ-1(p) Φ(x)=(2π)-1/2 ∫- ∞x exp(-z2/2) dz Complemen tary Log Log ε ~ double exponential Distribution σ = π/√6= 1.28 ƒ(p)=log(log(1-p)) F(x)=1-exp(exp(x)) Logistic Regression in the context of the generalized linear models. Type of regression Link Link Function Distribution of the response variable Y Regression Model Error Distribution Estimation Procedure Linear Regression Indentity E(Y)=XTβ Normal E(Y)=XTβ Normal OLS Logistic Regression Logit ƒ(p)=log(p/(1p)) Binomial or Multinomial E(Y)=exp(XTβ) /(1+ exp(XTβ)) Binomial ML sometimes WLS Logistic Regression Probit ƒ(p)=Φ-1(p) Binomial or Multinomial Φ(y)=(2π)-1/2 ∫y 2 ∞ exp(-z /2) dz Binomial ML sometimes WLS Logistic Regression Complemen tary Log Log ƒ(p)=log(-log(1p)) Gompertz (extreme value) E(Y)=1-exp(exp(x)) Distr=Poiss on, Binomial, etc ML Poisson Regression Log-linear ƒ(p)=log(y) Poisson E(y)=exp(XTβ) Poisson ML ? inverse E(y)=1/(y) Gamma E(Y)=1/( XTβ) Gamma ML Log linear Regression Cumulative Log Log ƒ(p)=log(-log(1p)) Gompertz (extreme value) E(Y)=1-exp(exp(x)) Distr=Poiss on, Binomial, etc ML I Logistic Regression compared to ordinary linear regression Analytical tools Ordinary Linear Regression Logistic Regression Coefficient Interpretation In general, have meaning. Have no intuitive meaning except for the sign. Use the adjusted ODDs ratio instead.(ecoefficient) Coefficient Confidence Intervals and hypothesis testing t test, partial and sequential F tests Wald Confidence Interv./ Profile Likelihood confidence interv./ Max. likelihood interval Global Hypotesis Testing Ho vs H1 F test= SSreg/1/SSres/(n-2). Likelihood ratio test: Λ = max[lik(θ)] θ to ωo/ max[lik(θ)] θ to Ω Under Null H0: -2log Λ ~ χ21 Wald Chi-sqr statistic Score Goodness of fit R2=SSreg/SStotal or =1-SSres/SStotal AIC, Rsqradj, Press Deviance=Is there a better model than this one? -2log(max[lik(θ)] fitted/ max[lik(θ)] saturated Global Chi-sqr=Is this model better than nothing? Σcells(Oi-Ei)2/Ei ~ χ21, Hosmer-Lemeshow test ROC curve Model (Variable) Selection Method Direct, Forward, Backward, Stepwise, Maxr, Minr, Rsqr,Rsqradf, Mallows Cp Direct, Forward, Backward, Stepwise, Score Multicollinearity Detect collinear variables or group of variables using PROC REG: TOL, VIF, COLLNOINT. And/or PROC CORR. The same. II Logistic Regression compared to ordinary linear regression Analytical tools Ordinary Linear Regression Logistic Regression Influence diagnostics, Residuals, predictive powers etc DFBETAS, DFFITS, Cook distance, studentized residuals, partial residual plots, Predicted values, DFBETAS, DIFCHISQ, DIFDEV, residuals: deviance rs, pearson rs, raw rs. Predicted values, non-constant error variance May transform response Y to stabilize variance( log(y), 1/Y, sqr(Y)) or run WLS. Autocorrelation Dependence of observations Cause: correlation of ε’s in time series regression. Durbin Watson to diagnose, use autoregression to combat. Over-dispersion/underdispersion Estimation Unobserved Heterogeneity (heterogeneity shrinkage) Cause: Clustered or Longitudinal data Use of GEE Estimation or conditional logit analysis. Lack of fit: due to underspecified model or dependence of observations. OLS ML, WLS, OLS coefficients are related to the underlying continuous model βj = αj/σ. The random disturbance may reflect omitted explanatory variables. Include predictors known to be important even in absence of stat. Significance. PART II SAS Application of Logistic Regression Summary of SAS procedures for logistic regression analysis • Binary Logit Analysis: – PROCS: LOGISTIC; GENMOD; CATMOD; PROBIT, MDC, NLMIXED. • Multinomial Logit Analysis – Predictors are characteristics of the individual • Nominal (no ordering of Y’s): proc logistic; proc catmod • Ordinal (inherent ordering of Y’s): proc logistic; proc catmod; proc genmod. • Conditional Logit Analysis – Predictors are the characteristics of the response variable • Can use mdc proc & phreg proc. – Logit Analysis of Clustered data: • Proc Logistic or (Proc Phreg) • Proc Genmod (gee) Binary Logit Models • • PROC LOGISTIC at its simplest: Main effect Model 1. Individual-level data: PROC LOGISTIC DATA=input; FREQ frequency; /* optional */ MODEL y=X1 X2;RUN; or 2. Grouped data: PROC LOGISTIC DATA=input; MODEL events/trials=X1 X2;RUN; PROC LOGISTIC with more features – PROC LOGISTIC DATA=lrdata.penalty DESCENDING; • CLASS culp; – MODEL death=blackd|whitvic|culp / STB LACKFIT AGGREGATE RSQ link=logit technique=newton CLODDS=PL CLODDS=WALD SELECTION=stepwise SCALE=WILLIAMS CORRB influence iplots; • UNITS culp=2 / DEFAULT=1;Output out=results pred=phat lower=lb upper=up reschi=stres dfbetas=dfs;RUN; • PROC GENMOD at its simplest – – PROC GENMOD DATA=lrdata.penalty; MODEL y=X1 X2 /Dist=Binomial;RUN; Multinomial Logit Models • Multinomial logit for nominal response (Generalized Logit) – – • The logit transformation of the type log (pi/(1-pi)) for more than 2 categories does not work because Σi=1kpi ≠1 K-1 equations are estimated: log (pij/(pik)= +βjxi where j=1,2, … k-1. Multinomial logit for ordinal response (Cumulative, adjacent categories, continuation ratio) – – – Inherent ordering of Y responses allows to relax the assumption of multiple odds equations. Estimate k-1 equations of odds of Cum. Probabilities Fij • Log (Fij/(1-Fij)= αj+βxi - all coefficients except for intercept stay the same Because there is a hierarchy in the categories of response variable • • • The model is easier to estimate and interpret Hypothesis test are more powerful one coefficient of each predictor but k-1 intercepts. Available tools in SAS: 1. 2. PROC LOGISTIC DATA=lrdata.wallet; MODEL wallet = male business punish explain / link=glogit; /* or link=clogit */ RUN; PROC CATMOD DATA=lrdata.wallet; DIRECT male business punish explain; MODEL wallet = male business punish explain / NOITER PRED; RUN; Conditional logit Models • Consumer Choice Studies – Consumer taste preferences, choice of mode of transportation, locational characteristics for a retail store, – Conditional Logit: proc mdc; model decision = x1 x2 / type=clogit choice=(mode 1 2 3); id pid; run; – Nested Logit: proc mdc data=newdata; model decision = ttime / type=nlogit choice=(mode 1 2 3) covest=hess; id pid; utility u(1,) = ttime; nest level(1) = (1 2 @ 1, 3 @ 2), level(2) = (1 2 @ 1); run; • Analysis of clustered data – Observations within clusters can often be dependent: longitudinal data, students clustered in classrooms or schools, husbands & wives clustered in families, etc – Dependent observations produce underestimated errors and overestimated test statistics and coefficient estimates which are inefficient. – Remedies: Can use GEE (PROC GENMOD) or Conditional Logit (PROC LOGISTIC or PROC PHREG) and other methods such as Mixed Models or hybrids of the above. Consumer choice Modeling: Nested Logit Example Decision Tree Top 1 (Public) 1 (plane) 2 (train) 3 (bus) 2 (private) Level 2 4 (car) Level 1 • Example • proc mdc data=travel2 maxit=200 outest=a; • model choice = ttime time cost / type=nlogit choice=(mode); id id; • utility u(1,1 2 3 @ 1) = ttime time cost, • u(1,4 @ 2) = time cost; • nest level(1) = (1 2 3 @ 1, 4 @ 2), • level(2) = (1 2 @ 1); run; Literature • Logistic Regression Using The SAS system by Paul D. Allison (4th edition August, 2003) • Categorical Data Analysis Using The SAS System by Maura E. Stokes, Charles S. Davis, Gary G. Koch. (4th edition January, 2005) • Multivariate Statistical Methods by B. Tabachnik (1996) • SAS Help Examples Questions?