Download Predicting the Probability of Being a Smoker: A Probit Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Discrete choice wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Data assimilation wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Predicting the Probability of Being a Smoker: A Probit Analysis
Department of Economics
Florida State University
Tallahassee, FL 32306-2180
Abstract
This paper explains the probability of being a smoker, based on 23 variables, using a
probit analysis model. Specifically, age, gender, marital status, location, race, risky
behavior, health insurance coverage, obtaining routine medical care and highest degree
obtained are the basis of the construction of the model. They are hypothesized to be
significant factors. 14 variables are individually significant at the 5% and 1% level. The
regressors are jointly significant at both levels. However, the probit marginal effects
demonstrate that race and possessing a high school degree can affect the probability of
smoking by -10% to 34%.
1
INTRODUCTION
Tobacco use in the United States is a behavior that has been studied intensely due to its
perceived benefits by users and extreme externalities. A problem of interest with tobacco
use, primarily smoking, is the ability to predict who is a current smoker. Annually, the
United States Department of Health and Human Services (DHHS) conducts the Medical
Expenditure Panel Survey (MEPS). MEPS is a comprehensive examination of individual
health and medical expenditures. There are approximately 1,099 variables within MEPS
and 33,691 observations. Smoking and tobacco use does not comprise a majority of this data
set. However, several variables along with a binary for individuals who are current smokers
should enable an econometrician to explore this aspect of human behavior. By utilizing
MEPS which is compiled by the Agency for Healthcare Research and Quality (AHRQ) and
applying limited dependent variable regression analysis, namely a probit model, the goal of
this analysis is to predict the probability of a person being a smoker.
1.1
Problem of Interest
The ability to predict the behavior of certain individuals is of paramount importance
to statisticians, econometricians and economists. Smoking is unique in that it is a form of
behavior that extremely restricted. Almost every aspect of smoking is restricted by government, groups and individuals, via the use of cultural “rules.” Enabling someone with the
power to foretell who is a smoker based on a particular set of characteristics would create
enormous benefits in the form of reduced transaction costs and greater efficiency. Improvements in the provision of healthcare and insurance could be realized. Healthcare providers
would possibly be able to make informed and optimal decisions as opposed to decisions in
the face of uncertainty. Another interesting aspect, or result, would be lower transaction
costs in the search for personal relationships. Individuals could lower their search costs as
well as make informed decisions. Hence, the power of a model that predicts the likelihood
1
someone is a smoker would be invaluable.
1.2
Application of Limited Dependent Variable Methods
Chester Ittner Bliss (1934), a biologist, first introduced the notion of a probit. Bliss was
concerned with the treatment of a particular type of data. Specifically, Bliss (1934) sought
to express the percentage of organisms killed by pesticides. Maddala (1983,pp.22) notes that
Goldberger (1964) developed the probit analysis model.
Theoretically, a latent variable, Ji , is observed instead of Yi which is an unobserved,
qualitative dependent variable. Recall, an econometrician is faced with a classical regression
model subject to qualitative observation of the dependent variable. In the context of classical
regression, Yi is observable in the following model: Yi = αi + Xi β + i . This is not the case
with the current problem of interest. Here, Yi is not observable. A latent variable, Ji , is
observed, where Ji = 1 if the individual currently smokes and Ji = 0 otherwise. The binary
choice model, Yi = Xi β − σi , is needed to analyze this problem. Estimation of the binary
choice model requires the establishment of the relationship between Ji and Xi . Recall, Ji = 1
if and only if Yi > 0. This implies that Xi β − σi > 0. Solving for i yields: i < (Xi β/σ).
Therefore,
P (Ji = 1) = P
Xi β
i <
σ
=F
Xi β
σ
(1)
which implies that
P (Ji = 0) = 1 − F
Xi β
σ
.
(2)
The latent variable, Ji , takes the value of 1 or 0. Thus, the density function for Ji is:
Ji 1−Ji
Xi β
Xi β
1−F
.
f (Ji ) = F
σ
σ
2
(3)
The variables β and σ are not identified; however, δ = σ −1 β is identified. Using this fact,
the log-likelihood function for the binary choice model is:
ln L(δ) =
n
Ji ln F (Xi δ) + (1 − Ji ) ln 1 − F (Xi δ) .
(4)
i=1
The dependent variable, smokei , takes on discrete values; it is an indicator for individuals
who currently smoke. One can infer from equations (1) and (2) that a binary choice model
allows for a clear statement of the relationship between the latent variable, smokei , and
the regressors. This does not occur in the context of the classical regression model. Hence,
limited dependent variable methods must be used to predict the likelihood of an individual
being a smoker.
2
DISCUSSION OF MODEL
A binary choice model, specifically a probit model, is to be employed to derive the
probability that someone smokes. Using the data, a probit model is constructed:
P (smokei = 1) = Φ α + βsex sex + βage age + βrace1 race1 + βrace2 race2 + βrace3 race3
+ βrace4 race4 + βrace5 race5 + βmarried married + βged ged + βhidipl hidipl
+ βbach bach + βmastr mastr + βmedcare medcare + βhrwg hrwg + βhourwk hourwk
+ βinscov inscov + βrisk1 risk1 + βrisk2 risk2 + βrisk3 risk3 + βrisk4 risk4
+ βregion1 region1 + βregion2 region2 + βregion3 region3
(5)
The following variables, which were extracted from the MEPS panel, are purported to have
explanatory power on the decision to smoke: sex (gender), age, race, marital status, education (in terms of highest degree obtained), routine medical care, hourly wage, hours
worked per week, health insurance coverage, willingness to take risks and location in the
U.S. Descriptive statistics are provided in Table 1.
3
2.1
Regressors
A discusion of the regressors and their implication in an individual’s choice to smoke is in
order. Gender and age are believed to play an ambiguous role. This is due to the fact that
men and women of all ages smoke. Race1 , race2 , race3 , race4 and race5 are dummy variables
that were used to indicate if persons are White, Black, American Indian/Alaska Native, Asian
or Native Hawaiian/Pacific Islander, respectively. These are intriguing variables in the sense
that different races and cultures accept smoking, or at least perceive it differently.
An indicator is included in the model for marital status. Marriage is thought to be an
important factor when an individual decides to smoke. Spouses can influence their significant other, especially with respect to decisions regarding health. The variable for highest
degree obtained was decomposed into five binary variables. A higher degree should be associated with an individual who is more health conscious. Whether an individual obtains
routine medical care and currently maintains health insurance coverage or not are important
determinants. These determinants are represented by the variables medcare and inscov,
respectively. An individual’s employment environment, work week schedule and income can
obviously create undue stress and frustration. Hourwk and hrwg are variables that attempt to capture these aspects, or byproducts of employment, and hopefully will explain an
individual’s decision to smoke.
Within this panel of data, ARHQ includes a variable that describes an individual’s willingness to risks. If an individual is willing to take risks, then she should be willing, to some
degree, to smoke or be open to smoking (This statement is based heavily on the assumption
that smoking is a risk). Analogous to the reasoning for binary variables for race, there exist
binary variables for the individual’s location within the U.S. A more detailed examination
of these variables is conducted in the Data Appendix.
The probit model is a special case where the error terms are independent and identically
distributed with mean 0 and variance 1, i ∼ iidN (0, 1). Regarding the binary choice model,
this assumption about the error terms implies F Xi δ = Φ Xi δ , where Φ · is the standard
4
normal distribution function. The log-likelihood function is now simply:
ln L(δ) =
n
Ji ln Φ(Xi δ) + (1 − Ji ) ln 1 − Φ(Xi δ)
(6)
i=1
where Ji is the latent variable smokei , Xi are the regressors in equation (5) and δ is the
ratio, σ −1 β.
3
RESULTS
The probit model, equation (5), was estimated. Results from this analysis can be found
in Table 2. Coefficients, standard errors, t-statistics and p-values are reported for the twentyfour regressors. The value of the log-likelihood function is -3534.0952.
Probit model estimates can be used to test the joint significance of the regression and the
individual significance of the estimates. The following is the null-alternative pair for testing
the significance of each coefficient estimate:
H0 : β̂i = 0
HA : β̂i = 0 for i = sex, age, · · · , region3
(7)
At the α = 0.05 and α = 0.01, the significance of each regressor will be tested. Hence,
it is necessary to use a two-tailed test. Based on the number of observations and number
of regressors, n = 7628 and k = 24, the degrees of freedom (d.f.) are 7624. A level of
significance, α = 0.05, yields a two-tailed critical value of κ = ±1.96. Based on this κ, 14 of
the 24 regressors are individually significant. These are: sex, race1 , race2 , race4 , married,
ged, hidipl, medcare, hrwg, hourwk, inscov, region1, region2 and region3. Choosing
α = 0.01 yields a two-tailed critical value of κ = ±2.57. At this level, all of the variables
mentioned are still significant with the exception of race1 .
To test the joint significance of the regressors, the log-likelihood ratio is employed. The
5
null-alternative hypothesis pair is:
ˆ = βage
ˆ = · · · = βregion3
ˆ
=0
H0 : βsex
HA : at least one β̂i = 0.
(8)
Essentially, the null hypothesis states that that all of the regressors have no explanatory
power in the variation of the dependent variable, smokei . Using the log-likelihood ratio,
A
−2 ln L(δ̃) − ln L(δ̂) ∼ χ2k−1 ,
(9)
the following results. Note that ln L(δ̃) is the value of the constrained likelihood function and
ln L(δ̂) is the value of the unconstrained likelihood function, which have respective values of
A
-3817.6241 and -3534.0952. This yields a value of 567.06; hence t̂ = 576.06 ∼ χ2k−1 where
k − 1 = 23. The critical value for χ223 at α = 0.05 is 35.17 (κ1 ) and at α = 0.01, it is 41.64
(κ2 ). Thus, since t̂ > κ1 and t̂ > κ2 , the null hypothesis is rejected. This result implies the
regressors are jointly significant at the 5% and 1% level.
Marginal effects for the probit model were then calculated. Results can be found in
Table 3. Recall that marginal effects, in the context of the probit model, are the vector of
standardized coefficients. That is,
∂P (smokei = 1)
∂Φ(Xi δ)
=
= φ(Xi δ)δ
T
∂Xi
∂XiT
(10)
It is known that for different i’s, φ(·) now varies. The coefficients are scaled differently,
but still proportionately. The marginal effects allow for a more appropriate analysis when
determining the specific effect of a one unit change in Xi on the latent variable, smokei .
Table 3 indicates that race2 , race4 , ged and hidipl have a -10%, -15%, 31% and 19%
effect on smokei . Specifically, if the value of race4 changes from 0 to 1, then this implies
that there is a 15% decrease in the probability that the individual is a smoker. Analogously,
6
if race2 were to change in value from 0 to 1, there would by a 10% decrease in the probability
of a person being a smoker. On the other hand, possessing a general equivalence diploma
(ged) or a high school diploma (hidipl) cause the probability of a person being a smoker to
increase by 31% and 19%, respectively. The remaining regressors have a marginal effect of
-3% to 8% on the probability of smokei being 1. In terms of demographics, behavior and
smoking, these variables would be of primary interest since they have an effect of 10% or
greater on the probability of smokei being one. That is, P (smokei = 1).
After having compiled all of the results, one should note that the binaries for willingness
to take risks were not statistically significant at the 5% or 1% level. Furthermore, the
marginal effects for risk1 , risk2 and risk3 are -3% while risk4 is -1%. This is an interesting
result in the sense that one could reasonably assume that an individual’s willingness to take
risks would be an integral part of predicting if they are a smoker. The disparity, though,
may lie in the fact that the individual does not perceive smoking as risky behavior.
4
CONCLUSION
Predicting the probability of an individual being a smoker could be an invaluable tool for
an econometrician’s, as well as an economist’s and policy analyst’s, toolbox. The purpose
of this paper was to utilize a probit model to estimate or predict this probability based on
several variables that were thought to explain the decision to smoke.
The probit model estimation and probit marginal effects yield interesting results. Regressors, in this model, were jointly and individually significant at the 5% level. They were jointly
and individually significant at the 1% level with the exception of race1 at the individual level.
One was able to infer from the probit marginal effects that race2 , race4 , ged and hidipl
have the greatest impact on smokei . Having only a high school education tremendously
impacts the probability of an individual being a smoker. In regard to policy analysis, health
professionals seeking to curb smoking rates would then know where to direct their efforts.
Other explanatory variables may provide better results in the sense of predicting the
7
probabiliity of an individual being a smoker. Based on the results contained in this paper,
one might be better off constructing a model using variables that describe an individual’s
ethnicity and education. However, the model used in this paper may serve as a benchmark
for others who wish to pursue an invaluable tool to add to their toolbox.
5
APPENDIX: MEPS DATA
Data for analyzing this project is taken from the MEPS HC-090: 2005 Full Year Population Characteristics. According to AHRQ, the data set is comprised of a nationally
representative sample of the civilian non-institutionalized population of the United States.
It is compiled annually in rounds by the AHRQ. The data is coded by the AHRQ.
MEPS consists of two panels of data which were collected in 2005. This model is built
using one of these panels that contains 33,691 observations, or persons, and 1,099 variables.
The following variables were extracted from the data: ADSM OK42, SEX, AGE05X,
RACE05X, M ARRY 31X, HIDEG, ADRT CR42, HRW G31X, HOU R31, IN SCOV 05,
ADRISK42 and REGION 05. Missing and inapplicable observations were then dropped
from this original sample. The process of elimination yielded a total of 7,628 observations.
ADSM OK42 is a binary variable for individuals who currently smoke, which is denoted
by smokei . The variable SEX is a binary for male. IN SCOV 05 is an indicator for health insurance coverage, both public and private. Important transformations were performed on the
already coded variables, RACE05X, M ARRY 31X, HIDEG, ADRT CR42, ADRISK42
and REGION 05. It was necessary to decompose these variables into binaries. Specifically,
RACE05X was used to create a total of six dummy variables, namely race1 , race2 , race3 ,
race4 , race5 and race6 . Within the MEPS panel, M ARRY 31X is variable that indicates
the marital status of the individual. The categories for this variable are married, widowed,
single, divorced, separated, “don’t know,” inapplicable and refused. To construct the variable, married, all observations that are married are coded as one and the other categories
are assigned zero. Note that the ARHQ uses metropolitan statistical areas (MSAs) from the
8
U.S. Census to classify individuals in regard to their location within the U.S. Similar procedures were performed to construct the remaining variables from HIDEG, ADRT CR42,
ADRISK42 AND REGION 05.
In all, 23 variables are constructed using solely the MEPS panel. Descriptive statistics,
including the mean and variance, for these variables are contained in Table 1.
9
6
TABLES
Table 1: Descriptive Statistics
Variable
Mean Std. Deviation
smoke (d.v.)
0.20
0.40
age
40.80
12.77
sex (d.v.)
0.48
0.50
married (d.v.) 0.58
0.49
0.77
0.42
race 1 (d.v.)
0.16
0.37
race 2 (d.v.)
0.01
0.09
race 3 (d.v.)
0.05
0.21
race 4 (d.v.)
0.01
0.07
race 5 (d.v.)
region1 (d.v.)
0.16
0.37
region2 (d.v.)
0.23
0.42
region3 (d.v.)
0.38
0.49
ged (d.v.)
0.06
0.23
hidipl (d.v.)
0.63
0.48
bach (d.v.)
0.22
0.41
mastr (d.v.)
0.08
0.28
hourwk
38.94
11.06
hrwg
17.24
10.88
inscov (d.v.)
0.87
0.34
medcare (d.v.) 0.63
0.48
0.38
0.49
risk 1 (d.v.)
0.24
0.43
risk 2 (d.v.)
0.15
0.35
risk 3 (d.v.)
0.18
0.38
risk 4 (d.v.)
10
Regressor
Con
sex
age
race 1
race 2
race 3
race 4
race 5
married
ged
hidipl
bach
mastr
medcare
hrwg
hourwk
inscov
risk 1
risk 2
risk 3
risk 4
region1
region2
region3
Table 2: Probit Model Estimation
Coefficient Std. Error t-stat Prob > |t|
-1.42
0.26
-5.47
0.00
0.14
0.04
3.84
0.00
0.002
0.002
1.47
0.14
-0.28
0.13
-2.13
0.03
-0.41
0.14
-3.03
0.00
-0.09
0.21
-0.43
0.66
-0.58
0.16
-3.55
0.00
0.16
0.25
0.62
0.53
-0.15
0.04
-4.08
0.00
1.21
0.20
6.17
0.00
0.74
0.19
3.94
0.00
0.29
0.19
1.53
0.13
0.12
0.20
0.61
0.55
-0.14
0.04
-3.84
0.00
-0.008
0.002
-3.51
0.00
0.01
0.001
6.10
0.00
-0.15
0.05
-3.00
0.00
-0.11
0.08
-1.40
0.16
-0.11
0.08
-1.30
0.19
-0.10
0.08
-1.15
0.25
-0.05
0.08
-0.55
0.58
0.23
0.06
3.95
0.00
0.30
0.05
5.78
0.00
0.16
0.05
3.32
0.00
11
Regressor
Con
sex
age
race 1
race 2
race 3
race 4
race 5
married
ged
hidipl
bach
mastr
medcare
hrwg
hourwk
inscov
risk 1
risk 2
risk 3
risk 4
region1
region2
region3
Table 3: Probit Marginal Effects
Marginal Std. Error t-stat Prob > |t|
-0.37
0.07
-5.50
0.00
0.04
0.01
3.84
0.00
0.0006
0.0004
1.47
0.14
-0.07
0.03
-2.13
0.03
-0.11
0.04
-3.03
0.00
-0.02
0.06
-0.43
0.66
-0.15
0.04
-3.56
0.00
0.04
0.07
0.62
0.53
-0.04
0.01
-4.08
0.00
0.32
0.05
6.20
0.00
0.19
0.05
3.96
0.00
0.08
0.05
1.54
0.12
0.03
0.05
0.61
0.55
-0.04
0.01
-3.85
0.00
-0.002
0.001
-3.52
0.00
0.003
0.0004
6.10
0.00
-0.04
0.01
-2.99
0.00
-0.03
0.02
-1.40
0.16
-0.03
0.02
-1.30
0.19
-0.03
0.02
-1.15
0.25
-0.01
0.02
-0.55
0.58
0.06
0.01
3.96
0.00
0.08
0.01
5.78
0.00
0.04
0.01
3.32
0.00
12
REFERENCES
Bliss, C. I. (1934), “The Method of Probits,” Science, 79, 38-39.
Goldberger, A. S. (1964). Econometric Theory. New York: Wiley.
Maddala, G. S. (1983). Limited-Depedent and Qualitative Variables in Econometrics. New
York: Cambridge University Press.