Download 070-31: An Application of Survival Analysis to Population Dynamics

SUGI 31 Data Mining and Predictive Modeling Paper 070-31 An application of Survival Analysis to Population Dynamics in Human Capital Management Martin Jetton/Dr Robert Yerex, Unicru, Inc, Beaverton, OR ABSTRACT Using Base SAS®, SAS/STAT® and SAS/GRAPH® we have built a decision support infrastructure to monitor and forecast workforce retention dynamics relative to the economic value of Unicru Personality Assessments. Using PROC LIFEREG® we develop best fit survival models for employee lengths of stay and with these best fit model’s hazard rates and apply a semi-Markov population dynamic model (described in this paper) to forecast the rate of replacement of employees with new hires with targeted personality traits. SAS/GRAPH® is utilized to present historical data, fitted survival models and forecasted replacement levels. INTRODUCTION Unicru Inc. provides many retail customers with personality assessments to target specific traits in the hiring process in an hourly workforce. These personality assessments address specific strategic issues such as improved sales, customer service, time and attendance behavior and productivity. We help our customers understand the economic impact and project future benefits of Unicru’s personality assessments by applying a population dynamics model based upon employment spell hazard rates. Many measurable factors within the work environment have an impact on, or can be related to, employee asset value. Unicru’s Incremental Employee Contribution or IEC model has been developed to help organizations address key questions that relate to the value of their employees, and to examine alternative interventions aimed at optimizing the asset value of their employee pool. To date, the IEC model has been used successfully to examine relationships between length of service, time to competency, productivity, the value of retention, employee asset value, and ultimately employee profitability. In the examination of alternative interventions for optimizing employee asset value, a logical intervention is to select for effecting behaviors at the time of the hiring decision. Unicru’s personality assessments provide an effective intervention mechanism for organizations to target effecting behaviors in the hiring process. With each new hire selected using Unicru’s personality assessments, the number of employees predicted to display the effecting behavior increases. The rate at which this selection pressure impacts the employee pool can be forecasted using two aspects of the hiring process; the rate at which employees are retained and compliance in hiring employees with the targeted behavioral trait. The portion of the employee pool (or headcount) resulting from applying selection pressure using the Unicru System is referred to as the Unicru Replacement Level. For the IEC model we are interested in calculating the Unicru Replacement Level at a point in time from Unicru implementation given different retention and compliance rates. Replacement rate estimation is, in practice, much more complex than it may initially appear to be. The stochastic and semi-deterministic factors lead to difficulty in creating useable closed form solution estimators. Simulation techniques are often used to control for these factors. In practice such simulation techniques can lead to reasonably accurate predictions, but they are often time consuming and cumbersome to carry out. The closed form estimator of Unicru Replacement level developed in this paper is practical and reasonably accurate when compared to actual and simulation results. For further discussion of the IEC see the white paper “Human Capital Management – Establishing a Meaningful Metric Framework“, by Dr Robert Yerex. This white paper is available from Unicru, Inc at www.unicru.com. VALUE OF UNICRU REPLACEMENT LEVEL The success in applying Unicru assessment strategies is dependant upon the rate at which employees with the effecting behavior(s) replace headcount in the employee pool. The following example illustrates the rate of replacement and the time it takes to build value associated with the Unicru Sales Assessment. Example: A customer signs up to use the Unicru Sales Assessment. Their goal is to achieve an improvement in sales productivity for commissioned sales personnel. In comparing this customer to similar Unicru customers it is expected that the Unicru Sales Assessment selects hires providing 2% more sales per hour than non-Unicru hires. 1 SUGI 31 Data Mining and Predictive Modeling Assuming that both Unicru and non-Unicru exhibit 100% turnover rates and a 100% selection pressure compliance, we would expect the sales per hour to reflect a 1% increase at 253 days (½ of the 2%). We would expect an aggregate increase in sales to reach 2% some 4 years after adoption of the Unicru process. If the customer experiences an 80% compliance rate, the customer would not realize the 1% sales per hour impact until one full year after introducing Unicru. And with the 80% compliance rate, the customer’s maximum benefit would be 80% of 2% or 1.6% sales per hour. As you can see it takes time for this customer to realize the potential of Unicru Sales Assessment selection pressure impact on sales per hour. The example also illustrates how reduced compliance reduces the selection pressure and the long run impact the desired goal of an increase in sales per hour. A closed form solution to the Unicru Replacement Level will help customers understand that the impact of selection pressure takes time. UNICRU REPLACEMENT LEVEL Underlying Unicru Replacement Level are two fundamental drivers, compliance in hiring individuals with the affecting personality trait and the rate at which employees are retained. The first driver, selection pressure compliance, is easy to understand and easy to measure. Compliance level is the rate at which new hires are indicated as “Green” on Unicru personality assessments for specific effecting behaviors of interest. Selection pressure compliance is measured using payroll hires compared to applicants scoring green on assessments of interest. Retention, the second driver of Unicru Replacement Level, relates to how long people stay. Often confused with turnover rates, retention is the measure of the length of time people are employed. Turnover measures the number of separations. Turnover does not measure the length of time that people stay with an employer. Retention analysis provides for the application of far richer analytical techniques by focusing on the length of stay. Underlying these analytical techniques is the concept of survival analysis and hazard rates, or the probability of exit of an employee given their length of stay. RETENTION MODELING Modeling lengths of stay or retention is critical in the development of a closed form estimator of Unicru Replacement Level. We define retention as the length of continuous employment from hire date to separation date. We’ll think of this length of stay as a continuous random variable, S, and consider a large population of people who hired at a time S=0. S does not refer to calendar time but rather it measures a time on a person-specific clock that are each set to zero at the moment a person is hired. S is the duration of stay as a hire. The population is assumed to be homogenous with respect to the systematic factors, regressor variables that affect the distribution of S. This means that everyone’s duration of stay will be a realization of a random variable from the same probability distribution. This length of time S, or duration, can be understood with the following three functions of time: 1) 2) 3) Distribution Function Survival Function Density Function F(s) = Prob(S < s) S(s) = Prob(S > s) = 1 – F(s) f(s) = dF(s)/ds = -dS(s)/ds The distribution function, F(s), is defined as the probability that a hire will stay at least s days. The survival function, S(s), is defined as the probability that a hire will stay more than s days. The density function, f(s), is the probability that a hire will stay exactly s days. For the modeling of Replacement we need to know, for a given length of stay s, what is the conditional probability of exit during the next increment of time ( or delta): h(s, ∆) = Prob (s < S < s+∆ | S>=s) The rate of exit or hazard rate is defined as: h(s) = lim∆->0 h(s,∆)/∆ = f(s) / S(s) (1) The hazard rate represents the instantaneous probability that the employee separates at time s, conditional on the fact that they have lasted up to time s. The hazard rate is equal to the probability a hire will stay exactly s days divided by the probability a hire will stay more than s days. In practice, to measure hazard rates we review historical lengths of stay equal to separation date minus hire date. This length of stay is used to develop a density function of durations of stay, s. The density function, f(s), is the 2 SUGI 31 Data Mining and Predictive Modeling probability that a hire will stay exactly s days. From the density function, the distribution function and survival functions can easily be calculated. The hazard rate is the density function divided by the survival function. With system compliance and hazard rates, or the probability of exit of an employee given their length of stay, we develop the closed form estimator of Unicru Replacement Level. UNICRU REPLACEMENT LEVEL The Unicru Replacement Level at time s from adoption of Unicru is given by the formula: UR(s) = PN −>U PN −>U − PX − e PX PX ex is the natural log. The closed form estimator of Unicru Replacement Level is composed of two parts. First ‘non-compliant’ hires become Unicru compliant as percent of total hires. The second, rate us based upon length of time from Unicru adoption. The function, and thus UR(s) approaches PN −>U PX e − PX PN −>U PX , is the rate at which PN −>U − PX e , provides that the PX , goes to 0 (zero) as time, s, increase . The terms PN->U and PX are developed from the cumulative hazard rate, compliance rate and time from Unicru adoption. Using the cumulative hazard multiplied times the compliance rate, we calculate the probability of a position alternating from a non-Unicru hire to Unicru hire: PN->U = Compliance * CH(s)N where CH(s)N is the Cumulative hazard for non-Unicru Hires. The probability of exit of a non-Unicru hire over a period of time. The probability of a non-compliant separation is defined as the probability of alternating between non-Unicru and Unicru and the probability of alternating between Unicru and non-Unicru. PX = PN->U + PU->N where PU->N = (1-Compliance) * CH(s)U. The probability of a position alternating from a Unicru hire to a non-Unicru hire up to time s. CH(s)U is the Cumulative Hazard for Unicru Hires. The probability of exit of a Unicru hire up to time s. The closed form estimator of Unicru Replacement Level represents the rate of change of a position from a nonUnicru hire to a Unicru hire, PN −>U PX , adjusted over time for non-compliance, PN −>U − PX e . PX See appendix A for the development of UR(s). SAS IMPLEMENTATION To implement the tracking of and forecasting for Unicru’s Replacement Level, we need to identify the hazard rate functions for a new or prospective customer. To do this we use survival analysis and the procedure available SAS/STAT®, PROC LIFEREG®. At Unicru, as an ASP (application software provider), we collect total hires and separations at our customers for reporting back to them retention data. We (Unicru) and our customers use this data to evaluate the impact of personality assessments on retention. Given that we have ongoing interactions with our current customers payroll data (where the hires and terminations are captured) we find it easy to work with prospects to evaluate their historical retention from payroll data. Using our regular analysis of current customers, we have found the best distributions for employment spell data are the exponential, Weibull, log-logistic and log-normal. Consistently these four distributions return the best fit of the available distributions in PROC LIFEREG®. For purposes of our estimation of new customer retention behavior we focus our efforts on these four parametric functions. Extensive discussion of survival analysis and SAS can be found in “Survival Analysis using SAS, A Practical Guide” by Paul Allison. Paul Allison outlines survival analysis and the fitted hazard functions we needed for the Unicru Replacement Level described above. For retention analysis we use the difference between hire date and termination date as the duration and consider censored observations as those still employed at the time of the analysis. One of the advantages that PROC 3 SUGI 31 Data Mining and Predictive Modeling LIFEREG provides is the handling of the random censoring nature of employment spell data. Employees are starting and ending at random points over time. IDENTIFYING THE BEST FIT SURVIVAL MODEL The first step is isolating the data for the time frame of interest. We use two macro variables ‘max_obs_date’ and ‘min_obs_date’ to select the date range of interest. In the first data step we calculate the duration and whether the observation was censored. Individuals who survived past the max date are considered censored. Source code: %let %let %let %let %let Mdlofint=; Unicru_impact = 1.0; Unicru_compliance = 1.0; max_obs_date = %sysfunc(mdy(6,30,2005)); min_obs_date = %sysfunc(mdy(7,1,2003)); DATA work.SURVIVAL_DATA; set mywork.infoallemployees_spec; *if uhirejobcodedescription in ("Clerk","Cashier"); if termdate_sas <= &max_obs_date then end_date = termdate_sas; end_date = &max_obs_date; if hiredate_sas <= &min_obs_date then delete; DUR = end_date - hiredate_sas; if DUR>0; if termdate_missing = 1 or hiredate_missing = 1 if termdate_sas >= &max_obs_date then status=0; else then STATUS=0; else STATUS=1; KEEP DUR STATUS HIREMTHYR hireyear hiremonth hiredate_SAS exposure tenure employee; RUN; To capture the parameters provided in the modeling output we utilize SAS/ODS®. At the same time we are capturing the output from the modeling process for review. ODS listing close; ODS pdf file='pdflifereg.pdf'; ODS output "Analysis of Parameter Estimates" (MATCH_ALL=parmsall PERSIST=PROC)=ParmEstimates "Model Information" (MATCH_ALL=modelsall PERSIST=PROC)=ModelInfo; PROC LIFEREG data=SURVIVAL_DATA ; title model dur*status(0) = &Mdlofint / PROC LIFEREG data=SURVIVAL_DATA ; title model dur*status(0) = &Mdlofint / PROC LIFEREG data=SURVIVAL_DATA ; title model dur*status(0) = &Mdlofint / PROC LIFEREG data=SURVIVAL_DATA ; title model dur*status(0) = &Mdlofint / ODS output close; 'Log Normal'; DIST=Lnormal; RUN; 'Log Logistic'; DIST=llogistic; RUN; 'Exponential'; DIST=exponential; RUN; 'Weibull'; DIST=weibull ; RUN; After closing the ODS OUTPUT we have captured the model information to evaluate the best fit model. We then print the sorted models by the highest log likelihood. DATA allmodels (keep=depndvar distrib loglikeli) ; set &modelsall; length depndvar $15.; length distrib $15.; retain depndvar distrib loglikeli; if Label1 eq "Dependent Variable" then depndvar = cValue1; if Label1 eq "Name of Distribution" then distrib = cValue1; 4 SUGI 31 Data Mining and Predictive Modeling if Label1 eq "Log Likelihood" then do; output allmodels; end; else delete; RUN; loglikeli = nValue1; PROC SORT data=allmodels; by descending loglikeli; run; PROC PRINT data=allmodels; title 'Best Models Sorted by Log Likelihood'; run; ods pdf close; ods listing; IDENTIFYING THE BEST FIT SURVIVAL MODEL PARAMETERS The ODS OUTPUT statement is used to capture the parameter estimates in ‘parmestimates’ data sets. Here we recombine them and re-label to select the best fit parameters to be used in estimating the survival functions. The best fit model parameters are selected into macro variables for use in forecasting. Source code: PROC PROC PROC PROC TRANSPOSE TRANSPOSE TRANSPOSE TRANSPOSE data=parmestimates out=parms; var estimate; run; data=parmestimates1 out=parms1; var estimate; run; data=parmestimates2 out=parms2; var estimate; run; data=parmestimates3 out=parms3; var estimate; run; DATA testparms; set parms(in=inlognorm) parms1(in=inloglog) parms2(in=inexpon) parms3(in=inweibull); length distrib $15; if inlognorm then distrib ="Lognormal"; if inloglog then distrib ="LLogistic"; if inexpon then distrib ="Exponential"; if inweibull then distrib ="Weibull"; rename col1=Intercept; rename col2=Scale; rename col3=WiebullScale; rename col4=WiebullShape; RUN; PROC SQL noprint; select distrib into :keydistrib from allmodels having loglikeli = (select max(loglikeli) from allmodels); select intercept,scale, WiebullScale, WiebullShape into :intercept,:scale,:WiebullScale,:WiebullShape from testparms where distrib = left("&keydistrib"); QUIT; ESTIMATING SURVIVAL FUNCTIONS AND GRAPHING Now that we have captured the parameters and the best fit model we can use the parametric definition of the models to create a forecast of replacement levels based upon survival functions. The first DATA step creates a replacement curve for each 7 days (a week) increment out to 450 days after the implementation of Unicru. The replacement levels are set based upon the identified best fit survival functions. The remainder of this code compares the observed density distribution and hazard function to the fitted functions used PROC SQL® to summarize the data and SAS/GRAPH’s ® PROC GPLOT® to plot the functions for evaluation. Source code: %let max_survival_days=450; %let survival_incr=7; DATA survival_theo (drop=alpha gamma); survival_days=1; 5 SUGI 31 Data Mining and Predictive Modeling currdistrib = "&keydistrib"; cummhazrd_unicru = 0; cummhazrd_nonunicru = 0; do while (survival_days <= &max_survival_days ) ; if currdistrib eq 'Lognormal' then do; alpha = 1/(sqrt(2*constant('pi'))*&scale*survival_days); gamma = exp(-.5*((log(survival_days)-&intercept)/&scale)**2); surv = 1-probnorm((log(survival_days)-&intercept)/&scale); func = &survival_incr * alpha * gamma ; hazrd = func / surv; end; if currdistrib eq 'LLogistic' then do; alpha = exp(-&intercept/&scale); gamma = 1/&scale; surv = 1/(1+alpha*(survival_days**gamma)) ; func = &survival_incr * alpha*gamma*(survival_days**(gamma-1)) /((1+alpha*(survival_days**gamma))**2) ; hazrd = func / surv; end; if currdistrib eq 'Weibull' then do; alpha = exp(-&intercept/&scale); gamma = 1/&scale; surv =exp(-alpha*(survival_days**gamma)); func = &survival_incr * gamma*alpha*(survival_days**(gamma-1)) * exp(alpha*(survival_days**gamma)); hazrd = func / surv; end; if currdistrib eq 'Exponential' then do; alpha = exp(-&intercept); surv = exp(-alpha*survival_days) ; func = &survival_incr*alpha* exp(-alpha*survival_days) ; hazrd = func / surv; end; hazrd_nonunicru = hazrd; hazrd_unicru = hazrd_nonunicru * &Unicru_impact; cummhazrd_unicru = cummhazrd_unicru + (1 - &Unicru_compliance) *hazrd_unicru ; cummhazrd_nonunicru = cummhazrd_nonunicru + &Unicru_compliance * hazrd_nonunicru; cummhazrd = cummhazrd_unicru +cummhazrd_nonunicru; Uni_replace = (cummhazrd_nonunicru / cummhazrd ) - ( cummhazrd_nonunicru / cummhazrd ) * exp(-cummhazrd); min_exposure = survival_days; max_exposure = survival_days + &survival_incr; output; survival_days = survival_days + &survival_incr; end; RUN; PROC GPLOT data=survival_theo; title ' Unicru Replacement vs Theoretical Survival'; plot uni_replace*survival_days surv*survival_days / overlay legend; RUN; QUIT; PROC SQL; create table sumexposure as select s.survival_days, sum(employee) as emps_exposed from survival_theo s, survival_data a where a.exposure >= s.min_exposure and hiredate_SAS >= &min_obs_date and hiredate_SAS <= &max_obs_date group by s.survival_days; QUIT; PROC SQL; create table sumtenure as select s.survival_days, sum(employee) as emps_severed 6 SUGI 31 Data Mining and Predictive Modeling from survival_theo s, survival_data a where a.tenure >= s.min_exposure and a.tenure < s.max_exposure and hiredate_SAS > &min_obs_date and hiredate_SAS <= &max_obs_date group by s.survival_days; QUIT; PROC SQL data=sumexposure; by survival_days ; RUN; PROC SORT data=sumtenure; by survival_days ; RUN; DATA density_dist; merge sumexposure sumtenure; by survival_days ; if emps_exposed<0 then emps_exposed=0; if emps_severed<0 then emps_severed=0; if emps_exposed>0 then density_dist= emps_severed / emps_exposed; density_dist= 0; retain cumm_density_dist 0; if _n_=1 then cumm_density_dist=0; else cumm_density_dist=cumm_density_dist+density_dist; surv_density_dist = 1 - cumm_density_dist; hazard_dist = density_dist / surv_density_dist; RUN; else DATA survival_actual;merge density_dist survival_theo;by survival_days; RUN; PROC GPLOT data=survival_actual; title ' Survival Function (Actual vs Theo.)'; plot surv_density_dist * survival_days surv * survival_days /overlay legend; RUN;QUIT; PROC GPLOT data=survival_actual; title ' Hazard Function'; plot hazard_dist * survival_days hazrd * survival_days /overlay legend; RUN; QUIT; PROC GPLOT data=survival_actual; title ' Density Function'; plot density_dist * survival_days func * survival_days /overlay legend; RUN; QUIT; ACTUAL REPLACEMENT COMPARISON Once a customer has implemented Unicru we can track the estimate to actual replacement levels. The following code selects observations relevant to post-Unicru implementation and compares to the fitted replacement levels out to 450 days. Source code: DATA replace_DATA;set mywork.infoallemployees_spec; keep employee unicruapp hiredate_sas termdate_sas; RUN; DATA replace_days; survival_days=1; do while (survival_days < &max_survival_days); min_date = survival_days + &max_obs_date; max_date = min_date + &survival_incr; output; survival_days = survival_days + &survival_incr; 7 SUGI 31 Data Mining and Predictive Modeling end; PROC SQL; create table sum_unicru as select s.survival_days, sum(employee) as unicru_hires from replace_days s, replace_data a where hiredate_SAS <= max_date and termdate_SAS > min_date and unicruapp="Y" and hiredate_SAS <=&max_obs_date+&max_survival_days group by s.survival_days;RUN; PROC SQL; create table sum_nonunicru as select s.survival_days, sum(employee) as nonunicru_hires from replace_days s, replace_data a where hiredate_SAS <= max_date and termdate_SAS > min_date and unicruapp="N" and hiredate_SAS<= &max_obs_date+&max_survival_days group by s.survival_days;QUIT; DATA compare_replace; merge survival_theo (keep=survival_days Uni_replace) sum_unicru sum_nonunicru; by survival_days; if nonunicru_hires < 0 then nonunicru_hires = 0 ; if unicru_hires < 0 then unicru_hires = 0 ; ttl_headcount = nonunicru_hires + unicru_hires; Act_Replace = unicru_hires / ttl_headcount; RUN; PROC GPLOT data=compare_replace; title 'Actual vs Predicted'; plot Act_Replace * survival_days Uni_Replace * survival_days / overlay legend; RUN; QUIT; CONCLUSION We run into issues that dramatically impact our ability to forecast Replacement rates and levels. Primary among these problems are: 1) Rollout at customers takes time. With customers having many sites to implement the Unicru system for hiring, the lag effect of implementation can take several months. Also, when customers run tests for significant periods of time, these stores/sites need to be removed from the analysis. 2) Seasonality. Customers with significant seasonality will impact the prediction of retention. We attempt to identify previous hires who may have been seasonal hires through the data provided by customers. We exclude these hires from the analysis of history. 3) Diversity of positions. We know from experience that the attributes of positions such as full time/part time, regular/temporary (seasonal) and job categories such as manager, clerks or cashiers, will impact retention. We will split these into different retention analysis as needed. REFERENCES Allison, Paul, 1995, Survival Analysis using SAS, A Practical Guide, Cary, NC: SAS Institute Inc. RECOMMENDED READING Lancaster, Tony, The Econometric Analysis of Transition Data, Cambridge University Press, 1990 (1992 paperback printing), New York, NY, USA. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: 8 SUGI 31 Data Mining and Predictive Modeling Martin Jetton, Senior Analytic Consultant, Unicru, Inc 955 SW Gemini Dr Beaverton, OR 97008 Work Phone: 503-596-3181 E-mail: [email protected] Web: www.unicru.com Dr Robert Yerex, Director, Analytic Research Group Unicru, Inc 955 SW Gemini Dr Beaverton, OR 97008 Work Phone: 503-596-3181 E-mail: [email protected] Web: www.unicru.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 9

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 070-31: An Application of Survival Analysis to Population Dynamics