Download Note

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
DURATION (SURVIVAL) MODELS FOR TIME TO EVENT DATA
INTRODUCTION
A relatively new area in econometrics is the analysis of duration data (also called time to event
data). The econometrics literature on the analysis of duration data draws heavily from
statistical methods that have been developed by industrial engineers and biomedical
researchers, who use these methods to analyze such phenomena as the useful lives of
machines and survival times of patients after a particular type of operation.
Dependent Variable
In duration analysis, the dependent variable being studied is a duration. Duration is defined as:
i)
ii)
The amount of time that elapses until some event occurs,
or
The amount of time that elapses until measurement is taken before the event
actually occurs.
Duration is often called time to event (e.g., time to death, time to machine failure, time to
employment). If an observed duration corresponds to i, it is said to be uncensored. If the
observed duration corresponds to ii, it is said to be censored. The following points should be
noted about a duration variable:
1. A duration variable is always measured in units of time (e.g. minutes, days, weeks,
months).
2. A duration variable must be non-negative (you can’t have a negative time to event).
Censoring
It is usually the case that some of the observations on a duration variable are censored. An
observation is said to be censored when it is measured from the beginning of the period of
interest until some point before the event takes place. For example, suppose that the duration
variable is time to death after a heart transplant. Suppose that this variable is measured for a
sample of 30 persons. Suppose that when the measurement is taken 20 of these individuals
have died, but 10 are still alive. The 10 observations for the individuals still alive are censored
observations.
Duration Data
No Censoring
Let the variable duration be denoted by T. Duration is a random variable that measures time to
event. Because T is a random variable, its behavior can be described by a probability
distribution (T). Let
t1, t2, …tn be a random sample of n-observations on the random variable T. The sample will
usually consist of a cross-section of n times to event (durations) on individuals, firms, machines,
etc.
Censoring
Let T* be the value of duration in the absence of censoring. Let T be the observed value of
duration. Let c be the value of duration when it is censored at time c. The observed value of
duration is given by
T = T* if T* < c
T=c
if T*  c
The censoring time, c, can either be a known constant or a random variable. If c is a random
variable, then it must be independent of T*. To indicate whether an observation is censored, a
censor status variable is usually created. This variable is an indicator variable that takes a value
of 1 if the observation is not censored and 0 if the observation is censored.
Approaches to Analyzing Duration Data
Three alternative approaches can be used to analyze duration data. These are:
1. Parametric approach
2. Semiparametric approach
3. Nonparametric approach
The parametric approach makes assumptions about the probability distribution of T. This
allows you to analyze duration data using regression models or regression-like models. The
semiparametric approach makes only minimal assumptions about the probability distribution of
T. The nonparametric approach makes no assumptions about the probability distribution of T.
PARAMETRIC APPROACH
There are two major types of parametric models of duration. These are:
1. Regression models
2. Regression-like models
Regression Models
A regression model is the appropriate model to use when your objective is to better understand
how a set of variables, X1, X2, …, Xn influence the expected (average) time to event, E(T).
Example
Let T be the amount of time an individual is unemployed measured in weeks. Thus, the event of
interest is finding a job. Let X1 be the level of unemployment benefits in hundreds of dollars per
month, X2 be years of work experience, and X3 be marital status; X3 =1 if single, X3 = 0 if
married. You have a sample of 200 individuals. Some of these observations are censored at T =
80 weeks. Suppose your objective is to better understand how the level of unemployment
benefits, work experience, and marital status influence the average amount of time an
individual is unemployed. One way to proceed is to estimate the following classical linear
regression model,
T = 0 + 1X1 + 2X2 + 3X3 + 
The coefficient 1 measures the effect of a one unit change ($100) in unemployment benefits
on the average amount of time an individual is unemployed. The coefficients 2 and 3 have
similar interpretations. However, there are 3 potential problems with this model.
1. The observations on T are censored. As a result, the OLS estimator will yield estimates of
the coefficients that are biased and inconsistent. Thus, the appropriate model would be the
censored regression model, which accounts for censored observations in the estimation
procedure.
2. The classical linear regression model assumes that T has a normal distribution. There are a
number of reasons to believe that duration (such as length of unemployment) does not
have a normal distribution (the most obvious reason is that T is positive by construction).
Jeffrey Wooldridge suggests that one way to deal with this problem is to use the logarithm
of duration as the dependent variable; that is ln(T). This is because ln(T) usually has a
distribution that is closer to a normal distribution than T itself. In this case, the slope
coefficients (multiplied by 100) measures the approximate percentage change in T for a one
unit change in X.
3. The dependent variable, T, measures a process that takes place over the length of time (0,
t). Regression analysis assumes that the value of X does not change during the period that T
is being observed. For example, suppose that an individual is unemployed for 12 months.
Regression analysis assumes the level of unemployment benefits he received (X1) and his
marital status (X3) did not change during this period of time. If either of these variables
does change during the time an individual is unemployed, this greatly complicates the
analysis.
Regression-Like Models
A regression-like model is the appropriate model to use when analyzing duration data if your
objective is any of the following.
1.
2.
3.
4.
The probability that an event will occur before time t.
The probability that an event will occur after time t.
The probability that an event will occur between time t and time t+1.
The probability that an event will occur between time t and time t+1, given that it has not
occurred up to time t.
Notice that we are not interested in expected duration, rather we are interested in the
probability of duration. However, a regression-like model can also be used to analyze average
duration or median duration.
Example
Let T be the amount of time an individual is unemployed. We might be interested in the
following questions.
1.
2.
3.
4.
What is the probability that an individual will be unemployed for 6 months or less?
What is the probability that an individual will be unemployed for more than 6 months?
What is the probability that an individual will be unemployed between 6 and 7 months?
Given that an individual has been unemployed for 6 months, what is the probability that he
will find a job within the next month?
5. Will the probability that an individual finds a job increase or decrease the longer he is
unemployed?
6. What is the average or median amount of time an individual is unemployed?
Probability Distributions for a Duration Variable
A continuous duration random variable, T, can be described by four alternative probability
distributions. These are:
1. Probability density function
2. Cumulative distribution function
3. Survival function
4. Hazard function
Once you choose a particular type of probability density function (e.g., normal, exponential,
Weibull, etc.) you can derive the other three functions. Thus, all four functions have the same
parameters and are simply different ways of describing the same system of probabilities.
Probability Density Function
Let T be a continuous duration random variable. Let t be a specific value of the random variable
T. Let T have a probability density function given by (t), where t is a specific value of T. The
probability density function (t) allows you to calculate the probability that T will fall in the
interval between t1 and t2; that is,
t2
Pr(t1 T  t2) =  (t)dt
t1
Thus, the probability that T will fall in the interval between t 1 and t2 is equal to the area under
the curve of (t) between the values t1 and t2. For example, if T is length of unemployment in
weeks and t1=40 and t2=42, then you can find the probability that an individual will be
unemployed between 40 and 42 weeks.
Cumulative Distribution Function
Given the probability density function (t), the cumulative distribution function F(t) can be
derived as follows,
t
F(t) = Pr(T  t) =  (t)dt
0
Thus, the probability that T will take a value that is less than or equal to t is equal to the area
under the curve of (t) between 0 and t. For example, if T is the length of unemployment in
weeks and t=52, then you can find the probability that an individual will be unemployed for 52
weeks or less.
Survival Function
Given the cumulative distribution function F(t), the survival function S(t) can be derived as
follows,
S(t) = Pr(T  t) = 1 – F(t)
Thus, the probability that T will take a value that is greater than or equal to t is equal to one
minus the area of the curve of (t) between 0 and t. This is equal to the area under the curve of
(t) between t and the maximum value of t. For example, if T is the length of unemployment in
weeks and t=52, then you can find the probability that an individual will be unemployed for 52
weeks or more; that is, you can find the probability that an individual will be unemployed for at
least 52 weeks.
Hazard Function
Given the probability density function (t) and the survival function S(t), the hazard function
h(t) can be derived as follows,
h(t) = (t) / S(t)
The hazard function is a particular type of conditional probability function. It tells you the
probability that an event will occur in the next short interval of time, given that it has not
occurred up to time t. Roughly speaking, it tells you the rate at which the event will occur at
time t. For example, if T is length of unemployment in weeks and t = 52, then you can find the
probability that individual will find employment during the next week, given that he has been
unemployed for 52 weeks. That is, the hazard function tells you the rate at which individuals
who have been unemployed for 52 weeks are finding jobs. For example, a hazard rate of 0.05
at t = 52 implies that 5 of 100 individuals who are unemployed for 52 weeks are expected to
find a job shortly after that time.
The Hazard Function and Duration Dependence
Often times we are interested in questions like the following.
1. Is it more likely, less likely, or equally likely that an individual will find a job the longer he is
unemployed?
2. Is it more likely, less likely, or equally likely that an a strike will end the longer it lasts?
3. Is it more likely, less likely, or equally likely that a patient will die the longer he has survived
after open heart surgery?
The answer to these questions depends upon the slope of the hazard function. We have the
following definitions.
1. If the hazard function has a positive slope, then the distribution of the duration variable has
positive duration dependence. In this case, the longer the duration (e.g., unemployment)
the greater the probability the event will occur in the next short period (e.g., the greater the
probability the individual will find employment).
2. If the hazard function has a negative slope, then the distribution of the duration variable
has negative duration dependence. In this case, the longer the duration the smaller the
probability the event will occur in the next short period.
3. If the hazard function has a constant slope, then the probability that an event will occur in
the next short period does not depend upon duration. In this case, the event is said to have
no memory.
Duration Dependence and Functional Form
If economic theory unequivocally indicates that a duration variable has, for example, negative
duration dependence, then you should choose a functional form for the probability distribution
that imposes this structure on the data. However, if economic theory allows a duration variable
to have, for example, positive or negative duration dependence, then you should not choose a
functional form for the probability distribution that imposes either positive or negative duration
dependence on the data. If you do you create a fete accompli. In this case, you must choose a
functional form for the probability distribution that is flexible enough to allow for both positive
and negative duration dependence, and allow the data to determine the outcome.
Choosing a Functional Form for the Probability Distribution of a Duration Variable
To specify a parametric duration model, and estimate the parameters of this model, you must
choose a particular functional form for the distribution of the duration variable. The 2
distributions chosen most often are the following:
1. Exponential distribution
2. Weibull distribution
Exponential Distribution
The exponential distribution is used frequently to specify a parametric duration model. This is
because it is easy to work with, easy to interpret, and often times can be justified as a
reasonable approximation of the data generation process. The probability density function,
cumulative distribution function, survival function, and hazard function for the exponential
distribution are as follows.
(t) = exp(-t)
F(t) = 1 – exp(-t)
S(t) = exp(-t)
h(t) = 
where the parameter  > 0. The probability density function, cumulative distribution function,
survival function and hazard function for the parameter value =1 are illustrated in Figure 3.
Note the following.
1. The exponential distribution has only one parameter, .
2. Both the mean and variance of the distribution are given by 1/; that is E(t) = 1/ and Var(t)
= 1/. The median of the distribution is given by (0.69314718)(1/).
3. The hazard function is constant and equal to . Thus, this distribution imposes the
restriction of no duration dependence. Because of this characteristic, the exponential
distribution is sometimes call memoryless.
4. The major shortcoming of the exponential distribution is that it depends on only one
parameter: . The family of distributions obtained by varying the value of  is not very
flexible. Because both the mean and variance are given by 1/, they cannot be adjusted
separately. Thus, the exponential distribution will not be a good approximation of the data
generation process if the sample contains both very long and very short durations.
Weibull Distribution
The distribution that is probably used most often to specify a parametric duration model is the
Weibull distribution. The Weibull distribution is a generalization of the exponential distribution
and has the latter as a special case. The probability density function, cumulative distribution
function, survival function, and hazard function for the Weibull distribution are as follows
(t) = (t)-1exp[-(t)]
F(t) = 1 – exp[-(t)]
S(t) = exp[-(t)]
h(t) = (t)-1 = t-1
where  > 0 and  > 0. The probability density function, cumulative distribution function,
survival function and hazard function for the parameter values =1 and  = 0.5 are illustrated in
Figure 1 and for parameter values =1 and  = 3 in Figure 2. Note the following.
1. The Weibull distribution has two parameters,  and .
2. The Weibull distribution collapses to the exponential distribution when  =1. Thus, the
exponential distribution is a special case of the more general Weibull distribution.
3. The hazard function can be either monotonically increasing, monotonically decreasing, or
constant, depending on whether the parameter  is greater than one, less than one, or
equal to one. Thus, the Weibull distribution has positive duration dependence if  >1,
negative duration dependence if
4.  <1, and no duration dependence if  =1.
5. The parameter  represents the shape of the distribution and the parameter  represents
the location of the distribution.
6. The median of the distribution is given by: Median = (0.69314718)1/(1 / ). Because the
Weibull and exponential distributions are skewed to the right, the median may be a better
measure of central tendency than the mean.
Adding Explanatory Variables to a Parametric Duration Model
The duration model developed above is a univariate duration model; it includes one variable:
the dependent variable. However, one or more explanatory variables can also be included in
the duration model. It is possible to allow changes in the explanatory variables to influence the
probability distribution of the dependent variable in various ways. Many parametric duration
models allow changes in the explanatory variables to change the probability density,
cumulative distribution, survival, and hazard functions by rescaling the horizontal axis. This is
called an accelerated failure time model. The coefficients of the explanatory variables are
relatively easy to interpret for most distributions and sometimes have a regression-like
interpretation.
Adding Explanatory Variables to the Weibull Duration Model
The Weibull duration model is an accelerated failure time model. To include explanatory
variables in the Weibull model, which has the exponential model as a special case, proceed as
follows. Let T be a duration random variable that has a Weibull distribution. This distribution is
described by two parameters:  and . The parameter  represents the shape of the
distribution and the parameter  represents the location of the distribution. If  increases
(decreases), the distribution shifts to the left (right). Assume  is a constant. Let  be a function
of the explanatory variables. For the unemployment example,
 = g(X1, X2, X3)
where the X’s are the explanatory variables. Because the parameter  is a function of the
explanatory variables, whenever an explanatory variable changes this will rescale the T-axis,
thereby changing the location of the distribution. To simplify the estimation procedure, let g be
an exponential function,
 = exp[-(0 + 1X1 + 2X2 + 3X3)]
Estimation
The Weibull model can be specified in survival function form as follows,
S(t) = exp[-(t)]
where
 = exp[-(0 + 1X1 + 2X2 + 3X3)]
To obtain estimates of the parameters of the Weibull model, 0, 1, 2, 3 and , the maximum
likelihood estimation procedure is used. The estimates of 0, 1, 2, …n and  are the values
that maximize the likelihood function for the sample of observations. The likelihood function
accounts for both uncensored and censored observations.
Interpretation of Parameter Estimates
The Weibull model has two interpretations. 1) The effects of the explanatory variables on
median duration. 2) The effects of the explanatory variables on the hazard rate.
Median Duration
Suppose that we want to better understand how a set of explanatory variables influences the
center of the distribution of a duration variable. For a skewed distribution, such as the Weibull,
the most appropriate measure would be the median rather than the mean. For example,
suppose you want to better understand how unemployment benefits (X1), years of work
experience (X2) and marital status (X3) influence the amount of time that a typical worker is
unemployed. To make this interpretation, you can derive the median duration function for the
Weibull distribution. The median of the Weibull distribution is given by
M = (0.69314718)1/ (1/)
Substituting  = exp[-(0 + 1X1 + 2X2 + 3X3)] and taking the logarithm of both sides yields,
lnM =  + 1X1 + 2X2 + 3X3
where lnM is the natural logarithm of median length of time unemployed, and the constant  is
a function of the parameters 0 and , and the number 0.69314718. Note that the median
duration function is a log-linear function.
The coefficients are interpreted as follows. 1 is the approximate proportional change
(1*100 is the approximate percentage change) in median length of unemployment that results
from a one $100 increase in monthly unemployment benefits. The exact proportional change is
given by exp(1) – 1. 2 is the approximate proportional change (2*100 is the approximate
percentage change) in median length of unemployment that results from one additional year of
work experience. The exact proportional change is given by exp(2) – 1. 3 is the approximate
proportional difference (2100 is the approximate percentage difference) in median length of
unemployment between a married worker and a single worker. The exact proportional
difference is given by exp(2) – 1. Some statistical programs, such as Stata, calculate and report
the measures exp(1), exp(2), and exp(3). These are called the time ratios.
Hazard Rate
Suppose that we want to better understand how a set of explanatory variables influences the
hazard rate. For example, suppose you want to better understand how unemployment benefits
(X1), years of work experience (X2) and marital status (X3) influence the rate at which
unemployed workers find jobs, given that they have been unemployed for some given period of
time. To make this interpretation, you can derive the hazard function for the Weibull
distribution. The hazard function for the Weibull distribution is given by
h(t) = t-1
Substituting  = exp[-(0 + 1x1 + 2x2 + 3x3)] and taking the logarithm of both sides yields,
lnh = lnh0(t) – 1X1 – 2X2 – 3X3
where lnh is the natural logarithm of the hazard rate, and lnh0(t) is the natural logarithm of the
baseline hazard function, which is a function of duration, t. The baseline hazard function is
given by
h0 = exp(-0)t-1. Note that for the Weibull model, the hazard rate is a log-linear function of
the explanatory variables, X1, X2, and X3. This means that changes in the explanatory variables
have a proportional effect on the hazard rate. Because of this, the Weibull duration model is
both an accelerated failure time model and a proportional hazards model. In a proportional
hazards model, changes in the explanatory variables do not change the shape of the hazard
function; rather they shift the hazard function in a proportional manner. The shape of the
hazard function is given by the baseline hazard function and depends upon the shape
parameter . Note that the effect of an explanatory variable on the log hazard rate is given by
the negative of the product of two parameters: i.
The interpretation is as follows. -1 is the approximate proportional change (-1*100
is the approximate percentage change) in the hazard rate that results from a $100 increase in
monthly unemployment benefits. The exact proportional change is given by exp(-1) – 1. -2
is the approximate proportional change (-2*100 is the approximate percentage change) in
the hazard rate that results from a one year increase in work experience. The exact
proportional change is given by exp(-2) – 1. -3 is the approximate proportional difference (3100 is the approximate percentage difference) in the hazard rate between a married worker
and a single worker. The exact proportional difference is given by exp(-3) – 1. Some statistical
programs, such as Stata, calculate and report the measures exp(-1), exp(-2), exp(-3).
These are called hazard ratios.
The parameter  has the following interpretation:
1. If  > 1, then the hazard function has a positive slope. The longer a worker is
unemployed, the greater the probability he will find a job in the next short period.
2. If  < 1, then the hazard function has a negative slope. The longer a worker is
unemployed, the smaller the probability that he will find a job next short period.
3. If  = 0, then the hazard function has a constant slope. The probability that a worker
will find a job in the next short period of time is independent of the amount of time he is
unemployed.
A Quick Interpretation of the Coefficients of the Explanatory Varaibles
The median duration function and hazard function for the Weibull Model imply the following:
1. If  > 0, then an increase in x will shift the hazard function downward and increase the
median duration. A decrease in x will result in the opposite.
2. If  < 0, then an increase in x will shift the hazard function upward and decrease the
median duration. A decrease in x will result in the opposite.
An Empirical Example
We want to better understand the length of time that an individual is unemployed. The
dependent variable is length of unemployment measured in weeks (UNEMP). The explanatory
variables are the level of unemployment benefits measured in hundreds of dollars per month
(BENEFIT), work experience measured in years (EXPER), and marital status (MARITAL) a dummy
variable for which marital =1 if single, marital = 0 if married. We have a sample of 200
individuals. Some of these observations are censored at UNEMP = 80 weeks. The results are as
follows.
Variable
Coefficient
Standard Errort-statistic
Constant
BENEFIT
EXPER
MARITAL
4.80
0.085
-0.052
0.100
1.20
0.025
0.030
0.005
4.00
3.40
1.73
2.00

1.20
0.05
Algebraic Signs and Statistical Significance of Estimates of Coefficients of Explanatory Variables
and 
Since  > 1, the hazard function is upward sloping. Thus, the longer an individual in
unemployed, the greater the probability that he will find a job in the next short period of time
(e.g., the next week). Since the 95% confidence interval for  is (1.10 , 1.30), the estimate of 
is significantly different from 1 at the 5% level of significance. Thus, the Weibull distribution is
more appropriate than the exponential distribution. The coefficient of BENEFIT is positive and
significant at the 5% level. There is strong evidence that an increase in monthly unemployment
benefits will increase the median number of weeks that a typical individual is unemployed and
decrease the probability that an typical individual will find a job during the next week, for any
given number of weeks unemployed. The coefficient of EXPER is negative and significant at the
10% level. There is some evidence that an individual with more years of work experience will
have a smaller median number of weeks unemployed and have a higher probability of finding a
job during the next week, for any given number of weeks unemployed. The coefficient of
MARITAL is positive and significant at the 5% level. There is strong evidence that a typical single
individual is unemployed longer and has a smaller probability of finding a job during the next
week, for any given number of weeks unemployed, than a typical married individual.
Median Duration of Unemployment
The estimated median duration function is used to obtain estimates of the magnitude of the
effects of the explanatory variables on median length of unemployment. It is given by
lnM =  + 0.085*BENEFIT –0.052*EXPER + 0.010*MARITAL
A $100 increase in monthly unemployment benefits results in an approximate 8.5% increase in
the median number of weeks unemployed. The exact increase is: exp(0.085) – 1 = 1.089 – 1 =
0.89 = 8.9%. One year of additional work experience results in an approximate 5.2% decrease
in the median number of weeks unemployed. The exact decrease is: exp(-0.052) – 1 = 0.95 – 1
= 0.05 = 5.0%. The median number of weeks unemployed for a single individual is
approximately 10% greater than for a married individual. The exact percentage difference is:
exp(0.10) – 1 = 1.101 – 1 = 0.101 = 10.1%.
Some statistical programs report the time ratios rather than the coefficients. For this
example, the time ratio results are as follows.
Variable
Time Ratio
Standard Errort-statistic
BENEFIT
EXPER
MARITAL

1.089
0.95
1.101
1.20
0.025
0.030
0.005
0.05
3.40
1.73
2.00
The standard errors and t-statistics for the time ratios are identical to those for the coefficients.
However, the t-statistic for time ratio is not the ratio of the time-ratio estimate to the
estimated standard error. If you subtract one from the time ratio estimate this gives you an
estimate of the exact proportional change in median number of weeks unemployed that results
from a change in the explanatory variable.
Hazard Rate
The estimated hazard function is used to obtain estimates of the magnitude of the effects of
the explanatory variables on the hazard rate. It is given by
lnh = lnh0(t) – (1.20)(0.085)*BENEFIT – (1.20)(-0.052)*EXPER – (1.20)(0.100)*MARITAL
= lnh0(t) – 0.102*BENEFIT + 0.0624*EXPER – 0.12*MARITAL
A $100 increase in monthly unemployment benefits results in an approximate 10.2% decrease
in the hazard rate (the rate at which unemployed workers find jobs). The exact decrease is:
exp(-0.102) – 1 = 0.903 –1 = - 0.097 = - 9.7%. One year of additional work experience results in
an approximate 6.24% increase in the hazard rate. The exact increase is: exp(0.0624) – 1 =
1.064 – 1 = 0.064 = 6.4%. The hazard rate for a single individual is approximately 12% lower
than for a married individual. The exact percentage difference is: exp(-0.12) – 1 = 1.127 – 1 =
0.127 = 12.7%.
Some statistical programs report estimates of the hazard ratios. For this example, the
hazard ratio results are as follows.
Variable
Time Ratio
Standard Errort-statistic
BENEFIT
EXPER
MARITAL

0.903
1.064
1.127
1.20
0.0023
0.0041
0.0004
0.05
3.93
1.68
2.03
The standard errors and t-statistics for the hazard ratios can differ from those for the
coefficients. Also, the t-statistic for hazard ratio is not the ratio of the hazard-ratio estimate to
the estimated standard error. If you subtract one from the hazard ratio estimate this gives you
an estimate of the exact proportional change in the hazard rate that results from a change in
the explanatory variable.
Survival Function
The estimated (uncensored) survival function is used to obtain estimates of the probability that
a typical individual will be unemployed for more than a specific number of weeks. It is given by
S(t) = exp[-(t)1.20]
where
 = exp[- (4.80 + 0.085*BENEFIT – 0.052*EXPER + 0.100*MARITAL)]
The explanatory variables BENEFIT, EXPER, and MARITAL are set equal to their sample mean
values. The estimated (uncensored) survival function can be used to obtain an estimate of the
probability that a typical unemployed worker will be unemployed for more than t weeks, for
any given value of t you choose.
Hypothesis Tests
To test restrictions on the parameters of a parametric duration model, any of the following
large sample tests can be used: asymptotic t-test, Likelihood ratio test, Wald test, Lagrange
multiplier test.
Time-Varying Explanatory Variables
We have assumed that the independent variables are constant from the beginning of the time
period until the event occurs, or until the measurement is taken (in the case of censored
observations). For example, we assume that the level of monthly unemployment benefits did
not change during this 52 week period of time. However, it is possible that the level of monthly
unemployment benefits changed one or more times during this 52 week period. In this case, X
is a time-varying explanatory variable; that is, its value changes during spells of unemployment.
We can write this as X(t). This type of time-varying explanatory variable can be included into
the accelerated failure time model.
Heterogeneity
Heterogeneity exists in a population when different individuals have different distributions of
the dependent variable. For example, it is possible that married individuals have a different
distribution than single individuals. It is possible that the distribution of length of
unemployment differs for individuals who have different amounts of work experience. To
control for heterogeneity, we include explanatory variables in econometric models.
Heterogeneity can occur when unobserved factors have an important effect on the duration
variable. For instance, an individual’s innate ability may be an important factor influencing
length of unemployment; however, innate ability cannot be measured, and therefore is not
included as an explanatory variable in the duration model. If heterogeneity still exists across
individuals after we include our explanatory variables in the model, this can make it difficult to
interpret the data, maximum likelihood parameter estimates will be inconsistent, and
estimated standard errors will be incorrect. To deal with this problem, you can incorporate the
heterogeneity into the survival distribution.
SEMIPARAMETRIC APPROACH
Introduction
The semiparametric approach makes minimal assumptions about the probability distribution of
the duration variable. This approach is based on direct estimation of the hazard function. It only
requires an assumption about the general form of the hazard function. The general form that is
usually adopted is the proportional hazard specification.
Cox Proportional Hazard Model
The most often used semiparametric model is the Cox proportional hazard model. This model
adopts a general specification for the hazard function that allows changes in explanatory
variables to multiply the hazard function by a scale factor.
Specification of the Cox Proportional Hazard Model
It is assumed that the general form of the hazard function is given by,
h(t) = 0(t) (X1, X2, X3)
The hazard rate is a function of duration, t, and a set of explanatory variables, X 1, X2, X3. It is
assumed that the hazard rate is the product of two functions: 0(t) and (X1, X2, X3). The
function 0(t) depends only on the value of duration, while the function (X1, X2, X3) depends
only on the values of the explanatory variables. When (X1, X2, X3) = 1, then the hazard function
is given by
h(t) = 0(t)
Thus, the function 0(t) is called the baseline hazard function. Note that changes in the values
of the explanatory variables will result in changes in the value of the function (X1, X2, X3). In
turn this will shift the hazard function up or down, such that for any given value of t the hazard
rate will increase or decrease. The Cox model assumes that (X1, X2, X3) is an exponential
function,
(X1, X2, X3) = exp(α1X1 + α2X2 + α3X3)
The exponential functional form simplifies estimation of the parameters and has a
straightforward interpretation. The hazard function with exponential functional form is
h(t,x;,0) = 0(t) exp(α1X1 + α2X2 + α3X3)
Taking the natural log of both sides yields the Cox proportional hazard model,
lnh(t) = ln0(t) + α1X1 + α2X2 + α3X3
The Cox proportional hazard model makes no assumption about the baseline hazard function
0(t). It assumes that when the values of the explanatory variables equal zero, the baseline
hazard function has some unspecified shape given by 0(t). As a result, the coefficients of the
explanatory variables (α’s) can be estimated without making an assumption a priori about the
shape of the hazard function. As a result, this model imposes less structure on the data than a
parametric duration model. It is because the baseline hazard function is unspecified that the
Cox proportional hazards model is a semiparametric model. The hazard function is allowed to
have any shape (upward sloping, downward sloping, constant, hill shaped, U-shaped, etc.).
Interpretation of the Coefficients
The interpretation of the coefficients of the hazard function is as follows. For the
unemployment example, α1 is the approximate proportional change (α1 *100 is the
approximate percentage change) in the hazard rate that results from a $100 increase in
monthly unemployment benefits. The exact proportional change is given by exp(α 1) – 1. α2 is
the approximate proportional change (α2*100 is the approximate percentage change) in the
hazard rate that results from a one year increase in work experience. The exact proportional
change is given by exp(α2) – 1. α3 is the approximate proportional difference (α3*100 is the
approximate percentage difference) in the hazard rate between a married worker and a single
worker. The exact proportional difference is given by exp(α3) – 1. Some statistical programs,
such as Stata, calculate and report the measures exp(α1), exp(α2), exp(α3). These are called
hazard ratios.
Estimation
To obtain estimates of the parameters of the Cox model, α1, α2, α3, the maximum likelihood
estimation procedure is used. We choose as estimates of α1, α2, α3, the values that maximize
the partial likelihood function for the sample of observations. A partial likelihood function,
rather than a full likelihood function, must be maximized because the baseline hazard function,
ln0(t), is unspecified. The partial likelihood function accounts for both uncensored and
censored observations.
Stratification
For the Cox proportional hazard model, the sample can be stratified into different groups of
observations. This allows different groups of observations to have different baseline hazard
functions, even though these baseline hazard functions are not specified. Differences in
baseline hazard functions across groups are captured by a coefficient which is a fixed
parameter.
Time-Varying Explanatory Variables
Time-varying explanatory variables can be included in the Cox proportional hazard model.
Weibull Parametric Duration Model as a Special Case of the Cox Proportional Hazards Model
The parametric Weibull duration model can be viewed as a special case of the semiparametric
Cox proportional hazards model. For the case of three explanatory variables, the Cox model is
given by
lnh(t) = ln0(t) + α1X1 + α2X2 + α3X3
where the baseline hazard function, 0(t), is unspecified. The Cox model reduces to the Weibull
model when the baseline hazard function is specified as 0(t) = exp(-β0)t-1. The relationship
between the parameters of the Cox model and the Weibull model is α1 = – β1, α2 = – β2, and
α3 = – β3, where β0, β1, β2, and β3 denote the coefficients of the explanatory variables for the
Weibull model.
NONPARAMETRIC APPROACH
Introduction
The nonparametric approach makes no assumptions about the probability distribution of the
duration variable, T. Thus, it is a strictly empirical approach to the estimation of survival and
hazard functions. That is, it allows the sample data to determine the shape of the survival
and/or hazard curves. The nonparametric approach is most often used to describe the
behavior of a duration variable and/or analyze experimental data. It is usually not used to
analyze observational data.
Two Nonparametric Estimators
The two most often used nonparametric estimators are:
1. Life-table estimator
2. Kaplan-Meier estimator
Life Table Estimator
The first step in a duration analysis should be to describe the duration variable, T, for the
sample. This is tantamount to reporting descriptive statistics for a duration variable. The best
way to describe a duration variable is to estimate the univariate survival function and hazard
function without making any assumptions about how the duration variable is distributed. To do
this, you can use the life-table estimator. To use the life-table estimator, you group the
duration variable into about 8 to 20 different time intervals. You then use this grouped data to
estimate the survival and hazard functions. To use the life-table estimator, you must have
enough observations so that duration times can be meaningfully grouped into intervals. The
life-table estimator can be used with censored data and makes a correction for censoring.
Kaplan-Meier Estimator
Like the life-table estimator, the Kaplan-Meier estimator can be used to estimate a univariate
survival function without making any assumptions about how the duration variable is
distributed. Also, like the life-table estimator it can be used with censored data and makes a
correction for censoring. However, unlike the life-table estimator it cannot be used to directly
estimate a univariate hazard function. To estimate the univariate survival function, the KaplanMeier estimator uses ordered observations rather than grouped data. Therefore, the estimated
survival function does not depend on the size the time intervals that are chosen, and can be
used to construct survival functions for small samples. The Kaplan-Meier Estimator has two
potential uses.
1. Describe the behavior of a duration variable.
2. Analyze duration data that comes from a controlled experiment.
Descriptive Study of Data
Like the life-table estimator, the Kaplan Meier estimator can be used to describe the behavior
of a censored or uncensored duration variable by constructing a univariate survival function for
the sample.
Analysis of Experimental Data
In a randomized experiment, subjects are randomly assigned to a control group and one or
more treatment groups. Random assignment is used to control for extraneous factors. The
Kaplan-Meier estimator can be used to estimate a different survival function for each group,
and then conduct nonparametric statistical tests, such as the generalized Wilcoxon test or logrank test, to test whether there is a significant difference in the survival functions among two
or more groups.
ADVANTAGES AND DISADVANTAGES OF THE DIFFERENT DURATION MODELS
Parametric Weibull Duration Model
Advantages
1.
2.
3.
4.
You can determine the general shape of the hazard function (monotonic upward
sloping, downward sloping, constant), and therefore draw conclusions about duration
dependence.
It is a flexible functional form with the exponential duration model as a special case.
You can obtain estimates of the magnitude of the effects of the explanatory variables
on the hazard rate.
You can obtain estimates of the magnitude of the effects of the explanatory variables
on median duration.
5.
6.
You can obtain direct estimates of the parameters of the uncensored survival function.
This facilitates the estimation of survival probabilities and making predictions.
You obtain parameter estimates by maximizing the full maximum likelihood function.
This should result in more precise estimates.
Disadvantages
1.
2.
You must make a specific assumption about the probability distribution of the duration
variable. Specifically, you assume that the duration variable has a Weibull distribution.
By making a specific assumption about the probability distribution of the duration
variable you risk imposing an inappropriate structure on the data a priori.
You require the explanatory variables to have a proportional effect on the hazard rate
and median duration. If this assumption isn’t valid then you are imposing an incorrect
structure on the data
Semiparametric Cox Proportional Hazards Model
Advantages
1. You can make minimal assumptions about the probability distribution of the duration
variable and therefore impose minimal structure on the data a priori.
2. You must only make an assumption about the general form of the hazard function (e.g.,
explanatory variables to multiply the hazard function by a scale factor).
3. By ignoring the shape of the hazard function, you can focus on how explanatory
variables proportionately increase or decrease the hazard function.
4. You can obtain estimates of the magnitude of the effects of the explanatory variables on
the hazard rate.
Disadvantages
1.
2.
3.
4.
You cannot determine the shape of the hazard function. Therefore, you can’t obtain
information on whether the hazard function is upward sloping, downward sloping, or
constant. Because of this, you can’t make any conclusions about duration dependence.
You cannot obtain estimates of the magnitude of the effects of the explanatory
variables on median duration.
You do not obtain direct estimates of the parameters of the uncensored survival
function. This makes it difficult to obtain estimates of survival probabilities. To obtain
estimates of survival probabilities, you must use a two-step approach. In step #1, you
estimate the parameters of the hazard function. In step #2, you obtain a separate
estimate of the baseline survival function. You then combine these to get estimates of
survival probabilities.
To obtain parameter estimates you maximize a partial likelihood function rather than a
full likelihood function. This may result in less precise estimates.
5.
You must assume that the explanatory variables have a proportional effect on the
hazard rate. If this assumption isn’t valid then you are imposing an incorrect structure
on the data.
Nonparametric Life-Table Estimator
Advantages
1.
2.
3.
4.
5.
It does not require you to make an assumption about the probability distribution of the
duration variable.
You can estimate both the univariate survival and hazard functions for a duration
variable.
You can determine the shape of the descriptive hazard function.
You can impose minimal structure on the data, a priori, and therefore allow the data to
determine the shape of the survival and hazard functions. This may yield better
estimates of survival probabilities and hazard rates.
It is a useful tool for preliminary descriptive analysis of duration data.
Disadvantages
1.
2.
3.
4.
5.
You cannot estimate a covariate adjusted survival and hazard functions that control for
explanatory variables.
You cannot analyze the effect of explanatory variables on the duration variable.
You cannot conduct statistical tests to determine if different groups of subjects have
different survival and hazard functions.
It is largely a descriptive tool.
It cannot be used with small samples.
Nonparametric Kaplan-Meier Estimator
Advantages
1. It does not require you to make an assumption about the probability distribution of the
duration variable, T.
2. You can estimate the univariate survival function for a duration variable.
3. You can impose minimal structure on the data, a priori.
4. Parametric models, by imposing more structure on the data a priori, may yield distorted
estimates of survival probabilities. It is possible that the Kaplan-Meier estimator will
yield a better representation of the survival function and better estimates of survival
probabilities.
5. Heterogeneity of observations can often times obtained by placing observations into
different groups (stratification) and constructing different survival distributions for the
different groups.
6. You can conduct nonparametric statistical tests to determine if different groups of
subjects have different survival functions and hence survival probabilities.
7. It can be used with small samples.
Disadvantages
1. You cannot directly estimate the hazard function and hazard rates.
2. You cannot determine the shape of the hazard function, and therefore you cannot draw
conclusions about duration dependence.
3. It is much more difficult to analyze the effects of explanatory variables using the
nonparametric Kaplan-Meier approach than for the parametric or semiparametric
approach. The only way to analyze the effect of an explanatory variable is to create a
categorical variable with n-categories, and then estimate separate survival functions for
the n-groups. You cannot analyze the effects of continuous explanatory variables on the
duration variable.
4. You cannot obtain estimates of the magnitudes of explanatory variables on the duration
variable.
5. It is less informative than the parametric Weibull model and semiparametric Cox
proportional hazard model.