Download Predicting the Probability of Being a Smoker: A Probit Analysis

Predicting the Probability of Being a Smoker: A Probit Analysis Department of Economics Florida State University Tallahassee, FL 32306-2180 Abstract This paper explains the probability of being a smoker, based on 23 variables, using a probit analysis model. Specifically, age, gender, marital status, location, race, risky behavior, health insurance coverage, obtaining routine medical care and highest degree obtained are the basis of the construction of the model. They are hypothesized to be significant factors. 14 variables are individually significant at the 5% and 1% level. The regressors are jointly significant at both levels. However, the probit marginal effects demonstrate that race and possessing a high school degree can affect the probability of smoking by -10% to 34%. 1 INTRODUCTION Tobacco use in the United States is a behavior that has been studied intensely due to its perceived benefits by users and extreme externalities. A problem of interest with tobacco use, primarily smoking, is the ability to predict who is a current smoker. Annually, the United States Department of Health and Human Services (DHHS) conducts the Medical Expenditure Panel Survey (MEPS). MEPS is a comprehensive examination of individual health and medical expenditures. There are approximately 1,099 variables within MEPS and 33,691 observations. Smoking and tobacco use does not comprise a majority of this data set. However, several variables along with a binary for individuals who are current smokers should enable an econometrician to explore this aspect of human behavior. By utilizing MEPS which is compiled by the Agency for Healthcare Research and Quality (AHRQ) and applying limited dependent variable regression analysis, namely a probit model, the goal of this analysis is to predict the probability of a person being a smoker. 1.1 Problem of Interest The ability to predict the behavior of certain individuals is of paramount importance to statisticians, econometricians and economists. Smoking is unique in that it is a form of behavior that extremely restricted. Almost every aspect of smoking is restricted by government, groups and individuals, via the use of cultural “rules.” Enabling someone with the power to foretell who is a smoker based on a particular set of characteristics would create enormous benefits in the form of reduced transaction costs and greater efficiency. Improvements in the provision of healthcare and insurance could be realized. Healthcare providers would possibly be able to make informed and optimal decisions as opposed to decisions in the face of uncertainty. Another interesting aspect, or result, would be lower transaction costs in the search for personal relationships. Individuals could lower their search costs as well as make informed decisions. Hence, the power of a model that predicts the likelihood 1 someone is a smoker would be invaluable. 1.2 Application of Limited Dependent Variable Methods Chester Ittner Bliss (1934), a biologist, first introduced the notion of a probit. Bliss was concerned with the treatment of a particular type of data. Specifically, Bliss (1934) sought to express the percentage of organisms killed by pesticides. Maddala (1983,pp.22) notes that Goldberger (1964) developed the probit analysis model. Theoretically, a latent variable, Ji , is observed instead of Yi which is an unobserved, qualitative dependent variable. Recall, an econometrician is faced with a classical regression model subject to qualitative observation of the dependent variable. In the context of classical regression, Yi is observable in the following model: Yi = αi + Xi β + i . This is not the case with the current problem of interest. Here, Yi is not observable. A latent variable, Ji , is observed, where Ji = 1 if the individual currently smokes and Ji = 0 otherwise. The binary choice model, Yi = Xi β − σi , is needed to analyze this problem. Estimation of the binary choice model requires the establishment of the relationship between Ji and Xi . Recall, Ji = 1 if and only if Yi > 0. This implies that Xi β − σi > 0. Solving for i yields: i < (Xi β/σ). Therefore, P (Ji = 1) = P Xi β i < σ =F Xi β σ (1) which implies that P (Ji = 0) = 1 − F Xi β σ . (2) The latent variable, Ji , takes the value of 1 or 0. Thus, the density function for Ji is: Ji 1−Ji Xi β Xi β 1−F . f (Ji ) = F σ σ 2 (3) The variables β and σ are not identified; however, δ = σ −1 β is identified. Using this fact, the log-likelihood function for the binary choice model is: ln L(δ) = n Ji ln F (Xi δ) + (1 − Ji ) ln 1 − F (Xi δ) . (4) i=1 The dependent variable, smokei , takes on discrete values; it is an indicator for individuals who currently smoke. One can infer from equations (1) and (2) that a binary choice model allows for a clear statement of the relationship between the latent variable, smokei , and the regressors. This does not occur in the context of the classical regression model. Hence, limited dependent variable methods must be used to predict the likelihood of an individual being a smoker. 2 DISCUSSION OF MODEL A binary choice model, specifically a probit model, is to be employed to derive the probability that someone smokes. Using the data, a probit model is constructed: P (smokei = 1) = Φ α + βsex sex + βage age + βrace1 race1 + βrace2 race2 + βrace3 race3 + βrace4 race4 + βrace5 race5 + βmarried married + βged ged + βhidipl hidipl + βbach bach + βmastr mastr + βmedcare medcare + βhrwg hrwg + βhourwk hourwk + βinscov inscov + βrisk1 risk1 + βrisk2 risk2 + βrisk3 risk3 + βrisk4 risk4 + βregion1 region1 + βregion2 region2 + βregion3 region3 (5) The following variables, which were extracted from the MEPS panel, are purported to have explanatory power on the decision to smoke: sex (gender), age, race, marital status, education (in terms of highest degree obtained), routine medical care, hourly wage, hours worked per week, health insurance coverage, willingness to take risks and location in the U.S. Descriptive statistics are provided in Table 1. 3 2.1 Regressors A discusion of the regressors and their implication in an individual’s choice to smoke is in order. Gender and age are believed to play an ambiguous role. This is due to the fact that men and women of all ages smoke. Race1 , race2 , race3 , race4 and race5 are dummy variables that were used to indicate if persons are White, Black, American Indian/Alaska Native, Asian or Native Hawaiian/Pacific Islander, respectively. These are intriguing variables in the sense that different races and cultures accept smoking, or at least perceive it differently. An indicator is included in the model for marital status. Marriage is thought to be an important factor when an individual decides to smoke. Spouses can influence their significant other, especially with respect to decisions regarding health. The variable for highest degree obtained was decomposed into five binary variables. A higher degree should be associated with an individual who is more health conscious. Whether an individual obtains routine medical care and currently maintains health insurance coverage or not are important determinants. These determinants are represented by the variables medcare and inscov, respectively. An individual’s employment environment, work week schedule and income can obviously create undue stress and frustration. Hourwk and hrwg are variables that attempt to capture these aspects, or byproducts of employment, and hopefully will explain an individual’s decision to smoke. Within this panel of data, ARHQ includes a variable that describes an individual’s willingness to risks. If an individual is willing to take risks, then she should be willing, to some degree, to smoke or be open to smoking (This statement is based heavily on the assumption that smoking is a risk). Analogous to the reasoning for binary variables for race, there exist binary variables for the individual’s location within the U.S. A more detailed examination of these variables is conducted in the Data Appendix. The probit model is a special case where the error terms are independent and identically distributed with mean 0 and variance 1, i ∼ iidN (0, 1). Regarding the binary choice model, this assumption about the error terms implies F Xi δ = Φ Xi δ , where Φ · is the standard 4 normal distribution function. The log-likelihood function is now simply: ln L(δ) = n Ji ln Φ(Xi δ) + (1 − Ji ) ln 1 − Φ(Xi δ) (6) i=1 where Ji is the latent variable smokei , Xi are the regressors in equation (5) and δ is the ratio, σ −1 β. 3 RESULTS The probit model, equation (5), was estimated. Results from this analysis can be found in Table 2. Coefficients, standard errors, t-statistics and p-values are reported for the twentyfour regressors. The value of the log-likelihood function is -3534.0952. Probit model estimates can be used to test the joint significance of the regression and the individual significance of the estimates. The following is the null-alternative pair for testing the significance of each coefficient estimate: H0 : β̂i = 0 HA : β̂i = 0 for i = sex, age, · · · , region3 (7) At the α = 0.05 and α = 0.01, the significance of each regressor will be tested. Hence, it is necessary to use a two-tailed test. Based on the number of observations and number of regressors, n = 7628 and k = 24, the degrees of freedom (d.f.) are 7624. A level of significance, α = 0.05, yields a two-tailed critical value of κ = ±1.96. Based on this κ, 14 of the 24 regressors are individually significant. These are: sex, race1 , race2 , race4 , married, ged, hidipl, medcare, hrwg, hourwk, inscov, region1, region2 and region3. Choosing α = 0.01 yields a two-tailed critical value of κ = ±2.57. At this level, all of the variables mentioned are still significant with the exception of race1 . To test the joint significance of the regressors, the log-likelihood ratio is employed. The 5 null-alternative hypothesis pair is: ˆ = βage ˆ = · · · = βregion3 ˆ =0 H0 : βsex HA : at least one β̂i = 0. (8) Essentially, the null hypothesis states that that all of the regressors have no explanatory power in the variation of the dependent variable, smokei . Using the log-likelihood ratio, A −2 ln L(δ̃) − ln L(δ̂) ∼ χ2k−1 , (9) the following results. Note that ln L(δ̃) is the value of the constrained likelihood function and ln L(δ̂) is the value of the unconstrained likelihood function, which have respective values of A -3817.6241 and -3534.0952. This yields a value of 567.06; hence t̂ = 576.06 ∼ χ2k−1 where k − 1 = 23. The critical value for χ223 at α = 0.05 is 35.17 (κ1 ) and at α = 0.01, it is 41.64 (κ2 ). Thus, since t̂ > κ1 and t̂ > κ2 , the null hypothesis is rejected. This result implies the regressors are jointly significant at the 5% and 1% level. Marginal effects for the probit model were then calculated. Results can be found in Table 3. Recall that marginal effects, in the context of the probit model, are the vector of standardized coefficients. That is, ∂P (smokei = 1) ∂Φ(Xi δ) = = φ(Xi δ)δ T ∂Xi ∂XiT (10) It is known that for different i’s, φ(·) now varies. The coefficients are scaled differently, but still proportionately. The marginal effects allow for a more appropriate analysis when determining the specific effect of a one unit change in Xi on the latent variable, smokei . Table 3 indicates that race2 , race4 , ged and hidipl have a -10%, -15%, 31% and 19% effect on smokei . Specifically, if the value of race4 changes from 0 to 1, then this implies that there is a 15% decrease in the probability that the individual is a smoker. Analogously, 6 if race2 were to change in value from 0 to 1, there would by a 10% decrease in the probability of a person being a smoker. On the other hand, possessing a general equivalence diploma (ged) or a high school diploma (hidipl) cause the probability of a person being a smoker to increase by 31% and 19%, respectively. The remaining regressors have a marginal effect of -3% to 8% on the probability of smokei being 1. In terms of demographics, behavior and smoking, these variables would be of primary interest since they have an effect of 10% or greater on the probability of smokei being one. That is, P (smokei = 1). After having compiled all of the results, one should note that the binaries for willingness to take risks were not statistically significant at the 5% or 1% level. Furthermore, the marginal effects for risk1 , risk2 and risk3 are -3% while risk4 is -1%. This is an interesting result in the sense that one could reasonably assume that an individual’s willingness to take risks would be an integral part of predicting if they are a smoker. The disparity, though, may lie in the fact that the individual does not perceive smoking as risky behavior. 4 CONCLUSION Predicting the probability of an individual being a smoker could be an invaluable tool for an econometrician’s, as well as an economist’s and policy analyst’s, toolbox. The purpose of this paper was to utilize a probit model to estimate or predict this probability based on several variables that were thought to explain the decision to smoke. The probit model estimation and probit marginal effects yield interesting results. Regressors, in this model, were jointly and individually significant at the 5% level. They were jointly and individually significant at the 1% level with the exception of race1 at the individual level. One was able to infer from the probit marginal effects that race2 , race4 , ged and hidipl have the greatest impact on smokei . Having only a high school education tremendously impacts the probability of an individual being a smoker. In regard to policy analysis, health professionals seeking to curb smoking rates would then know where to direct their efforts. Other explanatory variables may provide better results in the sense of predicting the 7 probabiliity of an individual being a smoker. Based on the results contained in this paper, one might be better off constructing a model using variables that describe an individual’s ethnicity and education. However, the model used in this paper may serve as a benchmark for others who wish to pursue an invaluable tool to add to their toolbox. 5 APPENDIX: MEPS DATA Data for analyzing this project is taken from the MEPS HC-090: 2005 Full Year Population Characteristics. According to AHRQ, the data set is comprised of a nationally representative sample of the civilian non-institutionalized population of the United States. It is compiled annually in rounds by the AHRQ. The data is coded by the AHRQ. MEPS consists of two panels of data which were collected in 2005. This model is built using one of these panels that contains 33,691 observations, or persons, and 1,099 variables. The following variables were extracted from the data: ADSM OK42, SEX, AGE05X, RACE05X, M ARRY 31X, HIDEG, ADRT CR42, HRW G31X, HOU R31, IN SCOV 05, ADRISK42 and REGION 05. Missing and inapplicable observations were then dropped from this original sample. The process of elimination yielded a total of 7,628 observations. ADSM OK42 is a binary variable for individuals who currently smoke, which is denoted by smokei . The variable SEX is a binary for male. IN SCOV 05 is an indicator for health insurance coverage, both public and private. Important transformations were performed on the already coded variables, RACE05X, M ARRY 31X, HIDEG, ADRT CR42, ADRISK42 and REGION 05. It was necessary to decompose these variables into binaries. Specifically, RACE05X was used to create a total of six dummy variables, namely race1 , race2 , race3 , race4 , race5 and race6 . Within the MEPS panel, M ARRY 31X is variable that indicates the marital status of the individual. The categories for this variable are married, widowed, single, divorced, separated, “don’t know,” inapplicable and refused. To construct the variable, married, all observations that are married are coded as one and the other categories are assigned zero. Note that the ARHQ uses metropolitan statistical areas (MSAs) from the 8 U.S. Census to classify individuals in regard to their location within the U.S. Similar procedures were performed to construct the remaining variables from HIDEG, ADRT CR42, ADRISK42 AND REGION 05. In all, 23 variables are constructed using solely the MEPS panel. Descriptive statistics, including the mean and variance, for these variables are contained in Table 1. 9 6 TABLES Table 1: Descriptive Statistics Variable Mean Std. Deviation smoke (d.v.) 0.20 0.40 age 40.80 12.77 sex (d.v.) 0.48 0.50 married (d.v.) 0.58 0.49 0.77 0.42 race 1 (d.v.) 0.16 0.37 race 2 (d.v.) 0.01 0.09 race 3 (d.v.) 0.05 0.21 race 4 (d.v.) 0.01 0.07 race 5 (d.v.) region1 (d.v.) 0.16 0.37 region2 (d.v.) 0.23 0.42 region3 (d.v.) 0.38 0.49 ged (d.v.) 0.06 0.23 hidipl (d.v.) 0.63 0.48 bach (d.v.) 0.22 0.41 mastr (d.v.) 0.08 0.28 hourwk 38.94 11.06 hrwg 17.24 10.88 inscov (d.v.) 0.87 0.34 medcare (d.v.) 0.63 0.48 0.38 0.49 risk 1 (d.v.) 0.24 0.43 risk 2 (d.v.) 0.15 0.35 risk 3 (d.v.) 0.18 0.38 risk 4 (d.v.) 10 Regressor Con sex age race 1 race 2 race 3 race 4 race 5 married ged hidipl bach mastr medcare hrwg hourwk inscov risk 1 risk 2 risk 3 risk 4 region1 region2 region3 Table 2: Probit Model Estimation Coefficient Std. Error t-stat Prob > |t| -1.42 0.26 -5.47 0.00 0.14 0.04 3.84 0.00 0.002 0.002 1.47 0.14 -0.28 0.13 -2.13 0.03 -0.41 0.14 -3.03 0.00 -0.09 0.21 -0.43 0.66 -0.58 0.16 -3.55 0.00 0.16 0.25 0.62 0.53 -0.15 0.04 -4.08 0.00 1.21 0.20 6.17 0.00 0.74 0.19 3.94 0.00 0.29 0.19 1.53 0.13 0.12 0.20 0.61 0.55 -0.14 0.04 -3.84 0.00 -0.008 0.002 -3.51 0.00 0.01 0.001 6.10 0.00 -0.15 0.05 -3.00 0.00 -0.11 0.08 -1.40 0.16 -0.11 0.08 -1.30 0.19 -0.10 0.08 -1.15 0.25 -0.05 0.08 -0.55 0.58 0.23 0.06 3.95 0.00 0.30 0.05 5.78 0.00 0.16 0.05 3.32 0.00 11 Regressor Con sex age race 1 race 2 race 3 race 4 race 5 married ged hidipl bach mastr medcare hrwg hourwk inscov risk 1 risk 2 risk 3 risk 4 region1 region2 region3 Table 3: Probit Marginal Effects Marginal Std. Error t-stat Prob > |t| -0.37 0.07 -5.50 0.00 0.04 0.01 3.84 0.00 0.0006 0.0004 1.47 0.14 -0.07 0.03 -2.13 0.03 -0.11 0.04 -3.03 0.00 -0.02 0.06 -0.43 0.66 -0.15 0.04 -3.56 0.00 0.04 0.07 0.62 0.53 -0.04 0.01 -4.08 0.00 0.32 0.05 6.20 0.00 0.19 0.05 3.96 0.00 0.08 0.05 1.54 0.12 0.03 0.05 0.61 0.55 -0.04 0.01 -3.85 0.00 -0.002 0.001 -3.52 0.00 0.003 0.0004 6.10 0.00 -0.04 0.01 -2.99 0.00 -0.03 0.02 -1.40 0.16 -0.03 0.02 -1.30 0.19 -0.03 0.02 -1.15 0.25 -0.01 0.02 -0.55 0.58 0.06 0.01 3.96 0.00 0.08 0.01 5.78 0.00 0.04 0.01 3.32 0.00 12 REFERENCES Bliss, C. I. (1934), “The Method of Probits,” Science, 79, 38-39. Goldberger, A. S. (1964). Econometric Theory. New York: Wiley. Maddala, G. S. (1983). Limited-Depedent and Qualitative Variables in Econometrics. New York: Cambridge University Press.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Predicting the Probability of Being a Smoker: A Probit Analysis