* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Alternative III Zero-inflated Poisson Regression
Survey
Document related concepts
Transcript
Count Data Models in SAS May 25, 2017 © 2006 ChoicePoint Asset Company. All Rights Reserved. Introduction A comprehensive survey of models for count data in SAS Why? Gaining popularity since 1980 => Insurance: # of auto/medical insurance claims => Banking: # of delinquencies / missed payments => Marketing: # of responses / purchases 5 Models to be covered: poisson regression, negative binomial regression, hurdle poisson regression, zero-inflated poisson regression, finite mixture (latent class) poisson regression © 2006 ChoicePoint Asset Company. All Rights Reserved. 2 SAS Capability Procedures GENMOD GLIMMIX NLIN NLMIXED COUNTREG MODEL Poisson Regression NB Regression Hurdle Regression ZIP Regression LC Poisson Regression ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ © 2006 ChoicePoint Asset Company. All Rights Reserved. ✔ ✔ 3 Count Data Nature of count data nonnegative, discrete, skewed distribution high proportion of zero outcomes potential problems: over-dispersion (variance >> mean) , excess zeroes Why OLS won’t work? counts are heteroskedastic (variance dependent on mean) predicted has to be nonnegative (log transformation won’t work) A case study: model # of hospital stays © 2006 ChoicePoint Asset Company. All Rights Reserved. 4 Data Summary Classical data for count models: - 4406 elderly respondents sampled from National Medical Expenditure Survey (NMES) in 1987 - Information included: 7 health, demo, and socio-econ variables © 2006 ChoicePoint Asset Company. All Rights Reserved. 5 Starting Point 100% Observations: 1) 80% zeroes ==> excess zeroes 2) Variance = 2 * Mean ==> possible over-dispersion 3) Poor fit with univariate Poisson 80% 60% 40% 20% 0% 0 1 2 Observed Probability © 2006 ChoicePoint Asset Company. All Rights Reserved. 3 4 5 6 7 8 Univariate Poisson Probability 6 Baseline Model Probability Function of Poisson Regression Exp ui ui f Yi | X i Yi ! Yi proc nlmixed data = data; params b0 = 0 b1 = 0 b2 = 0 ... ...; mu = exp(b0 + b1 * x1 + b2 * x2...); p = exp(-mu) * mu ** y / fact(y); ll = log(p); Identical to Prob. Function model y ~ general(ll); Run; © 2006 ChoicePoint Asset Company. All Rights Reserved. 7 Result of Poisson Model 100% Observations: 1) Improvement by including observed heterogeneity 2) Significantly under-fit at zeroes 80% What's wrong? ==> Over-Dispersion 60% 40% 20% 0% 0 1 2 Observed Probability © 2006 ChoicePoint Asset Company. All Rights Reserved. 3 4 5 6 7 8 Predicted Probability of Poisson Regerssion 8 Test for Over-Dispersion Auxiliary OLS regression (Cameron, 1996): yi ui 2 yi ui ui ei data ols_tmp; set poi_out; dep = ((y - yhat) ** 2 - y) / yhat; run; proc reg data = ols_tmp; model dep = yhat / noint; run; © 2006 ChoicePoint Asset Company. All Rights Reserved. significant yhat indicates over-dispersion 9 Alternative I Most common alternative: Negative Binomial Regression NB can be considered a generalized Poisson by including a dispersion parameter. ui Exp X i ei Exp X i Expei where Expei ~ Gamma 1 , 1 s.t. E Expei 1 and V Expei © 2006 ChoicePoint Asset Company. All Rights Reserved. 10 Alternative I Probability Function of Negative Binomial Regression f Yi | X i Yi Yi 1 1 1 1 ui 1 1 ui 1 ui Yi proc nlmixed data = data; params b0 = 0 b1 = 0 b2 = 0 ... ...; mu = exp(b0 + b1 * x1 + b2 * x2 ... ...); p = gamma(y + 1/alpha) / (gamma(y + 1) * gamma(1/alpha)) * ((1/alpha) / (1/alpha + mu)) ** (1/alpha) * (mu / (1/alpha + mu)) ** y; ll = log(p); model y ~ general(ll); Run; © 2006 ChoicePoint Asset Company. All Rights Reserved. 11 Result of NB Model 100% Observations: 1) Significant Improvement by including unobserved heterogeneity 80% Comparison with Poisson model: Likelihood Ratio = 2 * (LL_poi - LL_nb) = 2 * (-3048 - -2857) = 378 60% 40% 20% 0% 0 1 2 Observed Probability © 2006 ChoicePoint Asset Company. All Rights Reserved. 3 4 5 6 7 8 Predicted Probability of NB Regerssion 12 Alternative II Hurdle Regression (Mullahy, 1986) Two Parts: - zero outcomes: Logistic regression - positive outcomes: Truncated Poisson regression Probability Function of Hurdle Regression i Y f Yi | X i 1 i Exp ui ui i 1 Exp u Y ! i i © 2006 ChoicePoint Asset Company. All Rights Reserved. for Yi 0 for Yi 0 13 Alternative II proc nlmixed data = data; params b0 = 0 b1 = 0 ... a0 = 0 a1 = 0 ...; xb = b0 + b1 * x1 + b2 * x2 ... ...); mu = exp(b0 + b1 * x1 + b2 * x2...); xa = a0 + a1 * x1 + a2 * x2 ... ...); if y = 0 then p = exp(xa) / (1 + exp(xa)); else p = (1 - exp(xa) / (1 + exp(xa))) / (1 exp(-mu)) * (exp(-mu) * mu ** y / fact(y)); ll = log(p); Prob function for zeroes Prob function for positive model y ~ general(ll); Run; © 2006 ChoicePoint Asset Company. All Rights Reserved. 14 Result of Hurdle Model 100% 80% Observations: 1) Significant Improvement by modeling zeroes separatedly 60% How to compare with Poisson model? AIC, BIC, & Vuong statistic 40% 20% 0% 0 1 2 Observed Probability © 2006 ChoicePoint Asset Company. All Rights Reserved. 3 4 5 6 7 8 Predicted Probability of Hurdle Regerssion 15 Alternative III Zero-inflated Poisson Regression (Lambert, 1992) Two sources of zeroes - a point mass of zeroes - zeroes from standard Poisson distribution Probability Function of Hurdle Regression i 1 i Exp ui Y Exp ui ui i f Yi | X i 1 i Yi ! © 2006 ChoicePoint Asset Company. All Rights Reserved. for Yi 0 for Yi 0 16 Alternative III proc nlmixed data = data; params b0 = 0 b1 = 0 ... a0 = 0 a1 = 0 ...; xb = b0 + b1 * x1 + b2 * x2 ... ...); mu = exp(b0 + b1 * x1 + b2 * x2...); xa = a0 + a1 * x1 + a2 * x2 ... ...); if y = 0 then p = exp(xa) / (1 + exp(xa)) + (1 - exp(xa) / (1 + exp(xa)) * exp(-mu); Prob function for zeroes else p = (1 - exp(xa) / (1 + exp(xa))) * (exp(-mu) * mu ** y / fact(y)); Prob function for zeroes ll = log(p); model y ~ general(ll); Run; © 2006 ChoicePoint Asset Company. All Rights Reserved. 17 Result of ZIP Model 100% 80% Observations: 1) Significant Improvement by assuming 2 sources of zeroes 60% How to compare with other models? AIC, BIC, & Vuong statistic 40% 20% 0% 0 1 2 Observed Probability © 2006 ChoicePoint Asset Company. All Rights Reserved. 3 4 5 6 7 8 Predicted Probability of ZIP Regerssion 18 Alternative IV Latent Class Poisson Regression (Wedel, 1993): - Existence of S >= 2 classes of latent segments in the data - Each latent segment is poisson with different parameter - Each case drawn from such latent segments with certain probs. - Interesting in marketing: segment and model at the same time Probability Function of LC Poisson Regression S f Yi | X i ps s 1 © 2006 ChoicePoint Asset Company. All Rights Reserved. Exp ui |s ui |s Yi Yi ! 19 Alternative IV proc nlmixed data = data; params a0 = 0 ... b0 = 1 ... c0 = 2 ... prior1 = 0 to 1 by 0.1 prior2 = 0 to 1 by 0.1; xa = a0 + a1 * x1 + a2 * x2 ... ...); ma = exp(xa); pa = exp(-ma) * ma ** y / fact(y); xb = b0 + b1 * x1 + b2 * x2 ... ...); mb = exp(xb); pb = exp(-mb) * mb ** y / fact(y); xc = c0 + c1 * x1 + c2 * x2 ... ...); mc = exp(xc); pc = exp(-mc) * mc ** y / fact(y); p = prior1 * pa + prior2 * pb + (1 - prior1 - prior2) * pc; ll = log(p); ... ... © 2006 ChoicePoint Asset Company. All Rights Reserved. 20 Result of LC Poisson 100% 80% Observations: 1) Significant Improvement by assuming 3 latent classes with different sets of parameter 60% How to compare with other models? AIC, BIC, & Vuong statistic 40% 20% 0% 0 1 Observed Probability © 2006 ChoicePoint Asset Company. All Rights Reserved. 2 3 4 5 6 7 8 Predicted Probability of LC Poisson Regerssion 21 Models Prediction 1) Poisson cannot give adequate fit for the data. 2) Hurdle and ZIP are better to model excess zeroes. 3) NB and LC are better to handle heterogeneity. © 2006 ChoicePoint Asset Company. All Rights Reserved. 22 Models Comparison 1) AIC & BIC is convenient and easy to compute for model comparison, good enough for practitioners. BIC tends to select a more parsimonious model. 2) Vuong test is good but computationally tedious (code available in the paper), recommended for researchers. © 2006 ChoicePoint Asset Company. All Rights Reserved. 23 Conclusion In practice, Poisson model usually is not sufficient for overdispersed data but useful as a baseline model. (Rule of Thumb for Over-Dispersion: Variance ≥ 2 * Mean) It is important to identify the reason for over-dispersion, long tail, excess zeroes, or … … ? (Excess zeroes might be the most common reason) Statistics shouldn’t be the only consideration for model selection. Examples: 1) Both Hurdle and ZIP suggest positive effect of private insurance on hospital stays, which makes perfect sense. 2) LC provides a possibility to segment population, which is invaluable in marketing, insurance, and credit risk. © 2006 ChoicePoint Asset Company. All Rights Reserved. 24