Download Smoking, Genes, and Health - The Center for Experimental Social

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

SNP genotyping wikipedia , lookup

Behavioural genetics wikipedia , lookup

Fetal origins hypothesis wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Public health genomics wikipedia , lookup

Genome-wide association study wikipedia , lookup

Tag SNP wikipedia , lookup

Transcript
Smoking, Genes, and Health: Evidence from the Health and
Retirement Study∗
Daniel Benjamin, Andrew Caplin†, David Cesarini, Kevin Thom, and Patrick Turley
September 20 2015
PRELIMINARY
Abstract
Genetic understanding of smoking is advancing rapidly. Three recent studies discovered
variants in nicotinic receptor genes that impact measured smoking behavior. We document
associations between these variants and multiple smoking and health outcomes available in the
Health and Retirement Study (HRS). Although these variants are associated with relatively
modest differences in measures of past smoking intensity, we find substantial effects on later-life
health and mortality outcomes. To understand this set of reduced-form patterns, we develop
and estimate a dynamic model of smoking, health, and mortality that explicitly incorporates
genetic heterogeneity. Structural estimates will allow us to understand the mechanisms by
which these genes operate (preferences v.s. addiction dynamics), shedding light on how policies
differentially affect individuals by genotype. The estimated model will permit counterfactual
simulations assessing the consequences of genetic testing interventions that provide individuals
with more information about their predisposition for addiction.
∗ We
thank Laura Beirut and Li-Shiun Chen for their invaluable contributions. We are grateful
for helpful comments and feedback from Jeff Smith, Chris Taber, and all of the participants at the
2015 IRP Summer Research Workshop at the University of Wisconsin.
† Center for Experimental Social Science and Department of Economics, New York University
and National Bureau of Economic Research.
1
1
Introduction
Smoking behaviors are a central focus of biological and epidemiological research due to their
profound health consequences. As a result, our understanding of the neurophysiological effects
of smoking has grown rapidly. Specific nicotinic and dopaminergic channels have been identified
that may contribute to the addictive nature of smoking behaviors. The importance of these channels has been confirmed by three recent studies that discovered genetic variants (single nucleotide
polymorphisms, or SNPs) in nicotinic receptor genes that impact measured smoking behavior.1
While the genetics literature has produced a set of credible gene-smoking associations, little is
known about how the biological processes affected by these genes map into the behavioral mechanisms that determine an individual’s incentives to smoke, increase or decrease smoking intensity,
and eventually quit. Following the seminal work of Becker and Murphy (1988) on the consumption
of addictive goods, the economic literature on smoking has produced a set of dynamic models that
clearly delineate such mechanisms and richly capture the life-cycle of smoking, cessation, and later
life health outcomes (e.g. Arcidiacono et al. 2007, Darden 2014). Two of the most important
behavioral mechanisms in these models are baseline preferences for nicotine, and the addiction
process that makes it difficult for individuals to reduce their level of cigarette consumption.
We develop a dynamic life-cycle model of smoking and health that explicitly incorporates genetic heterogeneity. The model allows genes to impact smoking through both preferences and the
addictive process (specifically the cost of reduction relative to past consumption). While individuals
are assumed to be fully aware of their preferences for nicotine, they are are uncertain about their
future reduction costs. Identifying the mechanisms through which these genes operate is important for understanding how the effects of policies might systematically differ across individuals of
different genotypes. Furthermore, the model allows us to simulate the effects of a unique counterfactual policy. If knowledge about one’s own genotype reduces uncertainty about the future cost of
quitting, then early genetic screening could emerge as a policy tool for altering smoking behavior.
The estimated model will allow us to evaluate the consequences of such an intervention.
1
The three studies were related and published in the same issue of Nature Genetics (The Tobacco
and Genetics Consortium 2010, Liu, Tozzi, Waterworth, Pillai, Muglia, Middleton, Berrettini,
Knouff, Yuan, Waeber, et al. 2010, Thorgeirsson, Gudbjartsson, Surakka, Vink, Amin, Geller,
Sulem, Rafnar, Esko, Walter, et al. 2010)
2
We estimate the model using a recently genotyped subsample of the Health and Retirement
Study (HRS). We focus on four SNPs identified by the existing genetic literature. The panel
structure of the HRS enables us to evaluate how these SNPs impact smoking behaviors and smokingrelated health outcomes. A key finding is that the scale on which certain SNPs’ impact smokingrelated diseases and mortality risk is significantly higher than the scale on which they impact
measured smoking. Indeed, while smoking-related SNPs explain a large amount of variation relative
to SNPs found for other behavioral traits(Rietveld, Conley, Eriksson, Esko, Medland, Vinkhuyzen,
Yang, Boardman, Chabris, Dawes, et al. 2014), their joint explanatory power remains modest.2
Yet the effects on health are of an altogether different scale. For example, two of our four SNPs
are associated with 30% greater risk of chronic pulmonary obstructive disease (COPD). Amongst
non-smokers, we find no evidence that the SNPs increase disease or mortality risk, suggesting that
the SNPs are influencing health through their impact on smoking behavior rather than other causal
channels.
Our paper illustrates interdisciplinary gains from trade. Panel data sets of the form that are
used to fit structural models, such as the HRS, involve samples that are too small for de novo
gene discovery. That is why our research builds upon successful discovery in the literature on
genetic epidemiology. Their discoveries have generally stood the test of time, being based on far
larger samples albeit at the expense of crude behavioral measurement: see section 2. We leverage
these findings in the HRS and use them to identify links between health beliefs, smoking behavior,
and health outcomes in a formal structural model. Inclusion of biological factors in structural
economic models will become increasingly important as genetic discovery advances. Our work
suggests that behavioral understanding and genetically-informed policy making rest on further
such interdisciplinary work.
The paper is structured as follows. Section 2 provides a brief review of recent findings on genes,
smoking, and health, and details our procedure for selecting the SNPs we study. In section 3 we
introduce the data and present our basic findings on how the SNPs impact smoking and health. In
section 4 we introduce the structural model.
2
Small effect sizes have caused some researchers to argue that indexes of SNPs will be necessary for
“genoeconomic” research to be of practical import.
3
2
Background to Study
In this section, we detail our procedure for selecting a set of “smoking-associated SNPs” on
which the ensuing analyses rest. We first provide the briefest of primers on DNA, the biology
of smoking, and genetic association studies of smoking behavior. We then describe the precise
method we used to identify the SNPs for which the evidence of a relationship to smoking behavior
is particularly strong. We also summarize what is known about the biological function of these
smoking-associated SNPs.
2.1
DNA
Human DNA is composed of a sequence of about 3 billion pairs of nucleotide molecules (spread
across 23 chromosomes), each of which can be indexed by its location in the sequence. The sequence
is comprised of about 25,000 subsequences called “genes,” which code for proteins that have specific
functions in the human body, and regions in between genes, which help to regulate when certain
genes are transcribed into proteins.
At the overwhelming majority of DNA locations, there is virtually no variation in the nucleotides
across individuals. The segments of DNA where individuals do differ are called “polymorphisms.”
The most common polymorphisms are called single-nucleotide polymorphisms (SNPs). SNPs are
locations in the DNA sequence where individuals differ from one another in terms of a single
nucleotide. At the vast majority of SNP locations, there are only two possible nucleotides that
occur. Each type of nucleotide is referred to as an allele, and an individual inherits one allele from
each biological parent. A person’s genotype for a particular SNP is then defined by designating
one of these two alleles as the “reference allele” and counting the number of reference alleles (0,1 or
2) the person is endowed with. A typical gene contains hundreds or thousands of SNPs, and there
are also many SNPs in the intergenic regions.
Because entire segments of DNA are transmitted from parent to child, SNPs (more precisely,
SNP genotypes) tend to be highly correlated with other SNPs in the same region of the genome.
Such correlated SNPs are said to be in “linkage disequilibrium.”
4
2.2
Smoking Biology
Cigarette smoke contains thousands of chemically distinct particulates, one of which is nicotine.
When nicotine is inhaled through smoking, it is absorbed into the blood stream via the lungs and
is delivered to the brain within a few seconds (Benowitz 1990). Nicotine’s addictive properties
come primarily from its molecular similarities to acetylcholine, which cause it to bind with the
body’s nicotinic acetylcholine receptors (NAChRs). Acetylcholine is an important neurotransmitter
that plays a role in a wide variety of biological processes, including muscle contraction, sweating,
REM sleep, memory, attention, arousal, and reward. Normally, when acetylcholine binds with the
NAChRs, it opens an ion channel, causing sodium and potassium ions to pass through, leading to
a net flow of positive ions into the cell. This triggers the cell to release other neurotransmitters
related to the processes listed above.
Nicotine is believed to cause addiction mainly by triggering the dopaminergic neurons located in
the ventral tegmental area (VTA) of the brain, which release dopamine into the nucleus accumbens
core (NAC) (Glimcher 2011). The presence of dopamine in the NAC is associated with cognition,
motivation, and positive reward prediction errors. Reinforcing this effect, nicotine also causes the
release of other neurotransmitters (e.g. glutamate, serotonin, and norepinephrine), which may
increase the responsiveness of the NAC to dopamine and also may contribute to an independent
addictive effect (for Disease Control and Prevention 2010, p. 136). Counterbalancing the effect
of dopamine, nicotine also triggers the GABA system and the medial habenula, which plays an
inhibitory role, but it appears that the response of these systems to nicotine decays more quickly
than does the response of the dopaminergic system (Fowler 2011).
2.3
Molecular Genetics and Smoking
There is a large body of work that infers heritability of a behavioral trait – the fraction of
variance accounted for by genetic factors taken as a whole – by studying twins or adoptees. Smoking
is one of several traits that have been studied in this literature (Gilbert 2011, Li et al 2003).
Though there is now strong evidence that genetic factors taken altogether influence various aspects
of smoking behavior, researchers are only now beginning to reliably identify specific genetic variants
that underlie the heritability of smoking behavior.
5
Most early molecular genetic studies were candidate gene studies, which studied variation in
genes in biological systems known to play an important role in nicotine addiction. For a comprehensive review of these early candidate gene studies, which were conducted beginning in the mid-1990s,
see the Surgeon’s General Report (for Disease Control and Prevention 2010, chap. 4). The early
studies focused almost exclusively on nicotine-metabolizing genes (primarily CYP2A6 ), nicotinic
receptor genes (such as CHRNA4 and CHRNB2 ), and a handful of other genes, most prominently
the dopamine receptor D2 gene DRD2 and the serotonin-transporter-linked region 5HTT.
The replication record of these early studies turned out to be disappointing, and the estimates of
the effect sizes were often highly heterogeneous across studies. An influential review (Munafo, Clark,
Johnstone, Murphy, and Walton 2004, p. 583) concluded that the “evidence for a contribution
of specific genes to smoking behavior remains modest.” Today, it is understood that a major
contributing factor to the inconsistent replication record of early candidate genes for smoking (and
to an even greater extent, for other behavioral traits) is that the studies relied on sample sizes
far too small to ensure adequate power (Rietveld, Conley, Eriksson, Esko, Medland, Vinkhuyzen,
Yang, Boardman, Chabris, Dawes, et al. 2014).
Beginning around 2005, medical-genetics research began to undergo a paradigm shift, moving
away from candidate-gene studies to what are called genome-wide-association (GWA) studies. In
these studies, researchers run regressions of the outcome of interest for association on each of
the (typically millions) measured single-nucleotide polymorphisms (SNPs). It was only recently
that these studies became feasible, as genotyping technologies with dense coverage of common
SNPs across the entire genome became available at modest costs. Because of the large number of
hypotheses tested in a GWAS, a SNP association is considered established only if it reaches the
“genome-wide significance” threshold of p < 5 × 10−8 . Adequate statistical power at this stringent
significance threshold requires very large samples. Since individual samples are generally too small,
many GWA studies are conducted within research consortia that meta-analyze results from multiple
samples. Empirically, it is now well established that results from such GWA studies replicate very
consistently (Visscher 2012). There are several reasons for the robustness of GWAS findings (see
Rietveld et al 2013 for a discussion). One important reason is that, even if such a study has only
modest statistical power to detect an association at the genome-wide significance level, it follows
from Bayes’ rule that conditional on finding such an association, it is likely to be true (see Benjamin
6
2012 for heuristic calculations).
2.4
SNPs Selected For This Paper
A landmark event in the study of the genetics of smoking was the publication of three GWA
studies in the May 2010 issue of Nature Genetics (The Tobacco and Genetics Consortium 2010,
Liu, Tozzi, Waterworth, Pillai, Muglia, Middleton, Berrettini, Knouff, Yuan, Waeber, et al. 2010,
Thorgeirsson, Gudbjartsson, Surakka, Vink, Amin, Geller, Sulem, Rafnar, Esko, Walter, et al.
2010). The three papers represented the culmination of the work of three separate consortia:
the Tobacco and Genetics (TAG) Consortium, the European Network of Genetic and Genomic
Epidemiology (ENGAGE) Consortium and the Oxford-GlaxoSmithKline (Ox-GSK) Consortium.
The studies examined a range of smoking outcome variables, including age at initiation, cigarettes
smoked per day while smoking, and whether the smoker had succeeded in quitting (cessation).
Because the consortium studies pooled data from multiple sources, the definition of cigarettes
smoked per day varied across the twelve cohorts that contributed to the meta-analysis: some
cohorts asked used the maximum number of cigarettes per day, whereas others used an average
constructed from panel data: we refer to the hybrid measure CCP D. Prior to publication, each
consortium shared its results with the other two. Two of the published papers (The Tobacco
and Genetics Consortium 2010, Thorgeirsson, Gudbjartsson, Surakka, Vink, Amin, Geller, Sulem,
Rafnar, Esko, Walter, et al. 2010) meta-analyzed the results for CCP D from all three consortia.
Consequently, there is much overlap between the conclusions of the two papers, and we focus our
discussion below on the findings from these overall meta-analyses. All analyses of CCP D were
restricted to samples of individuals who smoked regularly at some point in their life (ever-smokers).
Because GWA studies attempt to find associations with SNPs scattered fairly evenly across the
entire genome, it was not obvious a priori that the analyses would identify SNPs in or near genes
implicated in the biological systems already understood to be relevant. However, the majority of
SNPs that reached the genome-wide significance threshold were in fact in or near genes in biological
systems that were known to play an important role in nicotine addiction.
To select the SNPs we study in our analysis, we proceeded in two steps. First, we sought
to determine the total number of independent genetic signals identified in the two studies, by
using the software SNAP (Johnson, Handsaker, Pulit, Nizzari, O’Donnell, and de Bakker 2008)
7
to compute the linkage disequilibrium between all pairs of genome-wide significant SNPs reported
in the meta-analyses. In the genetic literature, the standard measure of linkage disequilibrium
between two SNPs is the R2 obtained from the regression of one SNP genotype on the other SNP
genotype. Following the literature, we assume that any pair of SNPs whose linkage disequilibrium
exceeds 0.4 reflect a single genetic signal. This criterion leaves us with five genomic regions with
at least one SNP that reached genome-wide significance in at least one of the studies: the nicotinic
receptor cluster on chromosome 8 (CHRNA3/CHRNA5/CHRNB4 ), the cluster on chromosome 15
(CHRNB3/CHRNA6 ), two distinct regions near the nicotine-metabolizing gene CYP2A6, and the
chromosome 10 region.
One of the regions near CYP2A6 contains SNPs that are not in close linkage disequilibrium
with any genetic variables available in the HRS data (the best proxy only explains 22.6% of the
variation in rs4105144). From each of the remaining four regions, we identified the SNP reaching the
lowest p-value in all but one case. The exception is for the CHRNA3/CHRNA5/CHRNB4 cluster
on chromosome 15, for which we retained the SNP with the second lowest p-value, rs16969968
(p < 6 × 10−72 ), compared to p < 5 × 10−73 for rs1051730 ). We focus on this SNP, known
colloquially among researchers as “Mr. Big,” because there is reason to believe it is the biologically
relevant “causal” variant (with other, nearby SNPs reaching genome-wide significance due to their
correlation with it). In particular, it is known to cause an amino acid change in the alpha-5 subunit
of the nicotinic receptors, and experiments have found that this change alters the responsiveness of
the nicotinic receptors to nicotine (Wang 2009, Falvella 2009). . Studies have also found that the
SNP influences the expression of the CHRNA5 gene in brain and lung tissue (Wang 2009, Falvella
2009). In practice, the linkage disequilibrium between Mr. Big and rs1051730 is nearly perfect,
so all of our results are substantively identical if rs1051730 is used instead. This process leaves us
with four SNPs:
• rs16969968 (in the gene CHRNA5 in the CHRNA3/CHRNA5/CHRNB4 cluster)
• rs13280604 (in the gene CHRNB3 in the CHRNB3/CHRNA6 cluster)
• rs7937 (near the nicotine-metabolizing gene CYP2A6 )
• rs1329650 (in the chromosome 10 region with unknown functional significance)
8
In what follows, we refer to these as our smoking-associated SNPs. Figure XXX shows graphical
illustrations, one for each SNP (pictured with a large orange diamond), of the genes located in the
proximity of the SNPs.
3
Data and Reduced Form Findings
3.1
HRS and Genomic Data
The data for our analysis come from the Health and Retirement Study (HRS), which is a
nationally representative longitudinal survey of Americans over 50 years of age and their spouses.
The initial HRS sample was collected in 1992 and included individuals born between 1931 and 1941.
The survey is administered every two years with only minor adjustments from wave to wave. More
cohorts have been added over time, making the current HRS sample representative of individuals
born between 1890 and 1954 who survived until the sample period.
From 2006 to 2008, 12,507 HRS respondents were genotyped from saliva samples. To avoid
detecting spurious genetic associations due to genotyping errors, it is important to analyze data
that have undergone quality control filtering (see Beauchamp, Cesarini, Johannesson, van der Loos,
Koellinger, Groenen, Fowler, Rosenquist, Thurik, and Christakis (2011) for discussion). We work
with the public-release version of the genotypic data which has been quality controlled by researchers
at the University of Washington (XXX 2012). We further restrict our sample to Caucasians, since
the genetic associations that motivate our study are largely found in all-Caucasian samples. Our
final genotyped sample consists of 68,288 person-year observations on 8,122 unique individuals. The
Appendix offers a complete discussion of the criteria used to select this sample. Table 1 presents
some basic cross-sectional characteristics of the individuals in our sample, with the variables measured as of an individual’s most recent appearance in the panel. Following the consortium studies,
we refer to individuals who smoked at some point in their life as “ever smokers,” and others as
“never smokers.” As indicated in Table 1, about 57% of our sample report having ever smoked, and
conditional on smoking, the average maximum number of cigarettes consumed per day is just over
25.
9
In the descriptive analyses that follow, we estimate regressions of the following form:
yi = β0 +
X
βj SN Pji + Xi γ + i ,
(1)
j
where SN Pji ∈ {0, 1, 2} is the genotype of individual i at SNP j and Xi is a matrix of controls. In
all analyses, we define the reference allele to be the allele that reduces the level of smoking, so when
the dependent variable is smoking or a health outcome impaired by smoking, we expect negative
coefficients. Controlling for potential confounds that may be correlated with genotype is critical in
order to avoid spurious findings.
In practice, the most common concern is confounding due to population stratification: different
groups within the sample differ in allele frequencies and also differ in their outcome for non-genetic
reasons.3 For this reason, it is common practice in genetic association studies to include as control
variables the first 10 or 20 principal components of all the genotypes measured in the dense SNP
chip. These principal components seem to pick up much of the subtle genetic structure within
a population (Price 2006). Our analyses therefore control for the first ten principal components,
provided by the HRS. We also include a dummy for Male gender, a full set of age dummies, and
interactions between the Male and Age dummies. Reported standard errors are clustered at the
person level in all specifications where the unit of observation is a person-year.
3.2
Genes, Smoking, and Health Outcomes
We begin by examining how each of the four smoking-associated SNPs are associated with
smoking behavior and health outcomes in the HRS sample. To maximize comparability with the
consortium studies, all analyses in this section are based on the sample of ever-smokers in the HRS.
3
A famous illustration of stratification is the “chopsticks effect” (Lander & Schork, 1994). Imagine a
study that that tries to identify genetic markers for chopstick use by comparing a Asian population
(cases) to a Caucasian population (controls). Without controlling for for population stratification,
any markers which differ appreciably in frequency between the Caucasian and Asian populations
will be found to be associated with chopstick use, but those associations are of course spurious.
This example might seem to suggest that a simple fix would be to control for race or ethnicity.
Indeed, it is standard practice to restrict a genetic association study to subjects of a common
ethnic background, as we do here. It has been found, however, that allele frequencies can differ
substantially even within ethnically homogeneous populations, such as different regions within
Iceland (Price et al., 2009).
10
Table 2 presents estimates of the relationship between our SNPs of interest and CP DM AX : the
number of cigarettes at peak consumption as measured in the HRS. In Columns 1-4, we consider
the influence of each SNP separately. The estimated effect of each SNP is negative, as expected,
and the estimated effect sizes are never statistically distinguishable from the effects reported in the
consortium studies. For example, an additional copy of the protective allele of Mr Big reduces the
number of cigarettes smoked at maximum consumption by 1.32 (s.e. = 0.39) - an effect similar to
what TAG reported. For two of our SNPs - Mr Big and rs7937 - the estimated effect is statistically
distinguishable from zero at the 1% level. We find no statistically significant association between
rs1329650 and CP DM AX and a borderline significant association for rs13280604 and CP DM AX .
Column 5 shows that the coefficient estimates do not change appreciably in a model which includes
all four SNPs.
Though informative, CP DM AX is only one facet of life-cycle smoking. Reduction or cessation
represents another critical feature of smoking behavior. Table 3 therefore shows the results from
panel regressions that exploit the longitudinal nature of the HRS data. In Column 1, our outcome variable is an indicator for smoking a non-zero quantity. We find no statistically significant
association between later-life smoking on the extensive margin and any of the smoking-associated
SNPs. In Column 2, the dependent variable is an indicator for not smoking. In these regressions,
we restrict the sample to person-wave observations in which the person smoked a positive quantity
in the previous wave and hence could quit. There are hints that rs13280604 is associated with
quitting behavior. Finally, columns 3 and 4 show results from a panel specification in which the
dependent variable is defined as intensive-margin smoking in each of the survey waves (CP DCON T ).
We report one specification restricted to person-wave in which the sample is restricted to observations with positive values of CP DCON T and one specification which includes zeros. An important
message from Table 3 is that there are no strong relationships between any of our SNPs and
contemporaneous smoking quantities.
A major advantage of the HRS for this analysis is the availability of life-cycle data on health
outcomes and mortality. This allows us to directly estimate the relationship between specific SNPs
related to cigarette consumption and major illnesses associated with smoking. We are particularly
interested in non-cancerous lung disease, heart disease, and cancer, since these are the major
conditions directly linked to smoking. Note that the cancer measure is non-specific: the HRS
11
only asks if an individual has ever been diagnosed with any cancer, regardless of the type. HRS
respondents are asked about their current health status in each of these three categories, along with
a series of follow-up questions. For example, the first question about lung disease asks subjects if
they have ever been told by a doctor that they have a lung conditions such as “chronic bronchitis or
emphysema.” In subsequent surveys, respondents are asked if their medical condition is improving
or deteriorating and information is also collected about any treatment received or medications
prescribed.
Tables 4-6 reports estimates of linear probability models explaining health outcomes as a function of the SNPs and the maximum number of cigarettes smoked per day. The dependent variables
are indicators for the incidence of (non-cancerous) lung illness, heart illness, and cancer. These
are cross-sectional regressions with samples restricted to include only the most recent person-year
observation in the HRS. The samples for these regressions only include those individuals that have
reported smoking at some point in their lives. To obtain a baseline association between cigarette
consumption and lung health risks, Column 1 presents an estimate of the relationship between the
maximum reported number of cigarettes smoked per day and the incidence of major non-cancerous
lung illness. These estimates suggest a positive and significant relationship: a one-unit increase in
CP DM AX is associated with an increase in the probability of lung illness of about 0.4 percentage
points. In Column 2, we regress the lung illness indicator against the SNPs. We find a large,
statistically significant coefficient on Mr Big and rs1329650, but no significant relationship with
our other two SNPs. The coefficient on Mr Big and rs1329650 are large and consistent with the
direction of its association with smoking behavior. The coefficient estimate of 2.9 for Mr Big implies
that amongst smokers, those with 2 copies of the allele that increases smoking are 5.8 percentage
points more likely than individuals with 0 copies to be diagnosed with lung disease. Since the
baseline probability of lung illness is 18%, the implied effect on the risk of lung illness is 30%.
In Tables 5-6, we conduct similar analyses for two other health indicators: the incidence of
a major heart illness, and the diagnosis of cancer. Although the maximum number of cigarettes
per day is associated with elevated risks for heart disease and cancer, we generally find small,
statistically insignificant relationships between our SNPs of interest and these outcomes. For the
cancer outcome, this is partially explained by the fact that we are unable to specifically isolate
lung cancer. Since smoking is less strongly associated with other cancers, the lack of a strong
12
relationship is not surprising.
Finally, we investigate the relationship between our SNPs and mortality in Table 7. Specifically,
we estimate a linear probability model to explain death in the next year. We pool all personyear observations for ever-smokers from 2006 onwards. The year restriction is imposed because
the individual had to survive until 2006 in order to be genotyped. As shown in Column 1, the
coefficients are always in the predicted direction, with estimates suggesting that each copy of the
reference allele reduces one-year mortality risk by 0.1 to 0.3 percentage points, but no single estimate
is statistically distinguishable from zero. One concern about these estimates is that the sample
could be selected because it is restricted to individuals who survived until 2006 (when genotyping
began). Endogenous attrition due to mortality is unlikely to be a major source of bias in the
younger HRS respondents. Reassuringly, when we split the HRS sample into two cohorts, we find
stronger evidence that the protective SNPs reduce mortality in the younger cohorts. Specifically, we
find that rs16969968, rs13280604 and rs7937 are all associated with reduced mortality risk, with
point estimates suggesting that each reference allele reduces mortality risk by 0.3 to 0.5 percentage
points.
3.2.1
Interpreting the Reduced Form Evidence: Questions and Puzzles
One challenge in interpreting the gene-health associations is that our SNPs could work through
channels other than smoking. For example, if Mr Big affects both smoking and other biological
process related to lung health or mortality (e.g. fragility of lung tissue), it becomes difficult to
credibly model the causal chain running through genes, smoking, and health. If our SNPs operate
through non-smoking channels, we expect that the SNPs should be associated with health outcomes
also amongst never-smokers. To test this hypotheses, we ran placebo tests in which we re-estimated
our basic health and mortality specifications using the sample of genotyped never-smokers. As
shown in Table 8, we find no statistically significant relationships between our SNPs and these
outcomes among never-smokers. The fact that the SNPs are not predictive of health in never
smokers suggests that the gene-health associations documented in the previous section are driven
primarily by differences in smoking behavior.
The collection of reduced form evidence presented here suggests a complicated set of relationships between individual SNPs, smoking behavior, and health. For example, we find strong effects
13
of Mr Big and rs7937 on CP DM AX and mortality, but not on late-in-life quitting. Mr Big is
robustly associated with lung health, as is rs1329650, despite the fact it was not strongly related
to cigarette quantity. And rs13280604 appear to be related to quitting, but not CP DM AX . The
magnitude of the association between rs13280604 and lung health is particularly noteworthy, as
it is much stronger than would be predicted by naively multiplying the estimated relationship
between rs13280604 and CP DM AX by the estimated relationship between CP DM AX and lung
illness. Specifically, such a calculation suggests a relationship of roughly 0.5, less than one fifth of
our point estimates. An analogous calculation for Mr Big yield similar conclusions.
How can we reconcile the modest effects on CP DM AX with the substantial effects that two of
our SNPs appear to have on lung health? One possibility rests on the insufficiency of a simple metric
like CP DM AX as a measurement of life-cycle smoking behavior. SNPs that have large life-cycle
differences may have only modest effects on CP DM AX , as the latter is only a highly imperfect proxy
of the the total accumulated damage that an individual has sustained over their lifetime due to
smoking (which is captured in the health variables). Since rs13280604 has no known relationship
with the functioning of lung tissue, the association with health could emerge because different
SNPs differently impact life-cycle smoking patterns. An individual’s health is a function of not
only maximum smoking intensity, but also the total length of time spent smoking. The lung health
association might better reflect the total life-cycle effect of rs16969968 on cumulative smoking
behavior than the observed associations between rs16969968 and maximum cigarettes. Indeed
as shown in Figure 1, individuals who continuously smoke in the NLSY on average experience a
substantial reduction in the quantity of cigarettes that they smoke per day over their life-cycle.
It is possible that a SNP like rs16969968 not only affects peak quantity but also the evolution of
quantity over time. The operation of dynamic behavioral channels can also potentially explain the
set of associations observed for rs13280604. For example, it appears that rs13280604 affects the
ease of cessation. It is possible that this association is independent of the behavioral channels that
affect the maximum quantity consumed (e.g. one’s preference for nicotine).
The results here highlight the promise of using GWAS results as a starting point for the further
exploration of genetic relationships with behavior and health outcomes. Although the Consortium
data were not sufficiently rich to investigate the health impacts of these SNPs, the results on
rs16969968 and rs1329650 suggested natural hypotheses on health which could be tested in a
14
smaller but richer data set like the HRS. Rationalizing the collected associations between genes,
smoking, and health requires the development of a unified dynamic model, a task to which we now
turn.
4
Model
Here we develop a dynamic structural model of life-cycle smoking behavior. A sizable existing
literature uses the theory of rational addiction (Becker and Murphy 1988) to organize the empirical
analysis of smoking. Chapoupka (1991) and Becker et al. (1994) present evidence in favor of the
model’s prediction that both past and future cigarette prices should affect current consumption.
(See Chaloupka (2000) for a survey). Chaloupka (1991) also finds indirect evidence that less educated and younger individuals are more myopic because their contemporaneous cigarette demand
is less related to future consumption and prices. Gilleskie and Strumpf (2005) find evidence for
state dependence in cigarette consumption, consistent with the notion of habit-formation present
in the Becker-Murphy model.
A smaller, fully structural literature jointly models smoking decisions along with health and
mortality processes. This approach allows for a rigorous quantification of how health risks (or
beliefs about health risks) alter the incentives to smoke over the life-cycle. Arcidiacono et al.
(2007) develop and estimate one of the first structural models of smoking, health, and mortality in
a sample of mature adults from the Health and Retirement Study (HRS). They find evidence in
favor of forward looking behavior and support for habit formation in the form of substantial quitting
costs. Darden (2013) develops and estimates a structural model of smoking decisions and focuses
on the role of individual (Bayesian) learning about the health risks of smoking. He finds evidence
that smokers quit in response to the onset of chronic illnesses, but are less likely to respond to new
information about individual health markers such as blood pressure and high-density lipoprotein.
Our model builds on the basic framework present in Arcidiacono et al. (2007) and Darden (2013).
4.1
Choice Set and Addiction Stocks
We model smoking as a discrete choice. Each period, individuals choose one of J + 1 levels
of smoking: {c0 , c1 , ..cJ }, where c0 = 0 represents the non-smoking option, and more generally
15
cj represents the quantity of cigarettes consumed per day under option j. We allow smokers to
choose one of four intensities: {0, 5, 20, 30}. Let Cit represent individual i’s cigarette consumption
in period t.
We assume that smoking is associated with two kinds of persistent effects. First, current
cigarette consumption fuels an addiction to nicotine that makes it difficult to reduce cigarette
consumption in the future. The intensity of this addiction is captured by the addiction stock Sita .
We assume that this evolves deterministically according to the following law of motion:
a
Sit+1
=


 (1 − δa1 )S a + δa1 Cit , if Cit > S a ;
it
it
(2)

 (1 − δa2 )Sita + δa2 Cit , if Cit ≤ Sita .
That is, the addiction stock in the next period is equal to a weighted average of the prior addiction
stock, Sita , and the current level of smoking, Cit . The weight is allowed to differ depending on
whether an individual is consuming more or less than their addiction stock. This flexibly allows for
differences in addiction dynamics between build-up and reduction phases. In addition to fueling a
behavioral habit, smoking may also have a persistent effect on an individual’s health. We assume
that such effects are related to a separate stock, Sith , which reflects the latent potential for past
smoking to induce negative health events. We refer to this as the smoking health stock, and it
evolves deterministically according to the following law of motion:
Sith = (1 − δh )Sith + ζh Cit
(3)
Here δh represents the annual depreciation rate for the health stock, and ζh represents the rate at
which cigarette consumption builds this stock.
4.2
The Health Process
We assume that individuals can enter into two kinds of bad health states: chronic, non-cancerous
conditions related to the lungs, and all other conditions. Individuals can experience both bad
health conditions simultaneously. Even though both events play important roles in influencing
health behaviors and mortality, the medical literature suggests that smoking most directly affects
the pulmonary system. Furthermore, we would like to explain the reduced-form patterns that
16
we observe between our SNPs of interest and lung health, so we treat such illness as a separate
category. Let BitS ∈ {0, 1} indicate that individual i experiences a bad health state related to the
lungs in period t, and let and BitO ∈ {0, 1} indicate a bad health state related to other conditions.
Since our data on lung illness indicate whether an individual has ever experienced a chronic lung
S
condition, we model lung illness as an absorbing state, so Bit+1
= 1 if BitS = 1. For an individual
who has never been diagnosed with a chronic lung condition, we model the joint distribution of BitS
O∗
and BitO through a bivariate probit specification. Let bS∗
it and bit be continuous indices reflecting
an individual’s propensity to fall into the various bad-health states. We assume that:
s
s
s
AgeSith + Sit
bS∗
= β0s + βage
Age + βage2
Age2 /100 + βhs Sith + βageh
it
(4)
O
O
O
o
o
o
o
o
AgeBi,t−1
+ (5)
Bi,t−1
+ βboa
AgeSith + βbo
bO∗
= β0o + βage
Age + βage2
Age2 /100 + βho Sith + βageh
it
it
S
Here Sit , O
it ∼ N (0, Σ), Σ is a variance-covariance matrix, and we allow σ12 6= 0. Then Bit = 1 if
S
O
O∗
O
S
bS∗
it > 0, and Bit = 0 otherwise. Similarly, Bit = 1 if bit > 0, and Bit = 0 otherwise. If Bit−1 = 1,
then the process determining BitO collapses to the single-equation probit specified by Equation 5.
At the beginning of a period, before BitS and BitO are determined, an individual dies with
probability:
D
πit
= Φ(β d Xitd )
(6)
Here Φ is the standard normal c.d.f., and Xitd is a vector of regressors that includes a quadratic in
S , B O , and S h . The survival probability is given by π S = (1 − π D ).
age, Bit−1
it−1
it
it
it
4.3
Period Utility
Here we describe the model describing the behavior of ever-smokers, or individuals who have
already decided to smoke for at least one period in their lives. Later we will describe how we
model the initiation process. Let Zit refer to the set of state variables for individual i’s decision
problem in period t, excluding transitory shocks to utility. This vector contains the addiction
and smoking health stocks, as well as age, the current health states, and a reduction cost draw
a
h
O
S cost
(cost
it ): Zit = {Sit , Sit , Bit , Bit , it , t}. The period utility associated with choosing option j from
17
the choice set is given by u
ej (Zit , jit ) = uj (Zit ) + jit . That is, period utility for choice j is the
sum of a component that depends on the state variables uj , and a random shock, jit , with the
state-dependent component specified as:
uj (Zit ) =
α0i + α0S 1{BitS = 1} ln(1 + cj ) + ln(yit − e(pt , cj ))
a
− exp α1i + cost
(Sit − Cit )1+α2 1 (Cit < Sita )
it
+α3 1{BitS = 1} + α4 1{BitO = 1}
(7)
Here the α0i + α0S 1{BitS = 1} ln(1 + cj ) term represents the part of period utility that an individual receives from smoking at level cj . Note that the marginal utility of cigarette consumption is
influenced both by the term α0i , which is heterogeneous in the population, as well as α0S 1{BitS = 1},
which allows for the marginal utility of cigarette consumption to differ depending on the lung health
of the individual.4 Utility from all other consumption goods is reflected in the term ln(yit −e(pit , cj )).
Income in period t is given by yit , and e(pit , cj ) represents expenditure on cigarettes, which depends
on both the level of cigarette consumption, cj , as well as pit , the cigarette price that individual i
a
faces in period t. The term − exp α1i + cost
(Sit − Cit )1+α2 captures the disutility that an indiit
vidual receives when deciding to smoke less than their currently level of the addiction stock. Notice
that this is potentially nonlinear in the distance between current consumption and the addiction
stock, so that larger reductions generate increasingly greater disutility. The parameter α2 governs
the curvature of this disutility term. The reduction cost also depends on a stochastic component
cost
it . Finally, the jit terms are shocks to each choice that are assumed to be i.i.d. across individuals, choices, and time periods and are drawn from a Type I extreme value distribution. Finally, the
parameters α3 and α4 indicate the flow-disutility associated with entering the bad health smoking
state and the bad health other state, respectively.
4.4
Stochastic Reduction Costs and Learning
The stochastic component of reduction costs, cost
it , is assumed to be drawn i.i.d. across time
2 . This means that the term
periods from a mean-zero normal distribution with variance σcost
4
This is consistent with suggestive evidence that the enjoyment of smoking may decline as respiratory
function worsens.
18
exp α1i + cost
is log-normally distributed with location parameter α1i . It is assumed that α1i
it
is heterogeneous in the population, and that individuals are uncertain about their own value.
Specifically, we assume that there are two types in the population, {α1 , α1 }. Let π low refer to the
probability that an individual is a low-cost type.
Let ηit = exp α1i + cost
refer to the cost parameter drawn by individual i in period t. When
it
an individual is informed about their own type, they correctly believe that ηit is log-normally distributed with the true location parameter for their type. However, when individuals are uninformed,
they believe that ηit is drawn from a mixture distribution of the two log-normal distributions in
the population. That is, individuals believe, each period that they will receive a draw from the low
cost distribution with probability π low,b , and will receive a draw from the high-cost distribution
with probability (1 − π low,b ). We allow π low,b to be a belief parameter that is not necessarily equal
to the true mixing probability π low .
We assume that individuals are initially unaware of their exact cost type, and only come to learn
the true value of α1i through experiential learning. Let Inf oit represent an indicator for whether
or not an individual is informed of their own type. Let Inf ocost
be a dummy variable that indicates
it
whether or not an individual is informed in time period t. We assume that individuals start out
life uninformed (Inf ocost
i0 = 0). However, after every period in which an individual smokes, there
is some probability π learn that an individual learns their true type. Let Ωit refer to an individual’s
information set at time t. Information about an individual’s true cost type is one element of this
set: Inf ocost
∈ Ωit .
it
We have proposed a rather crude learning mechanism. A rational, Bayesian agent would update their belief about the distribution of ηit on the basis of the observed sequence of past cost
draws. However, Bayesian learning dynamics would greatly complicate their numerical solution
of the model by necessitating an additional state variable - the history of past cost draws. To
avoid excessive computational burden, we make the starker assumption that individuals randomly
transition from the uniformed to the informed state with probability π ` after every period during
which they smoke.
19
4.5
Decision Problem
The individual’s decision problem can be expressed as:
S
h
max u
ej (Zit , jit ) + βπit+1
(Zit , Sit+1
)E [Vt+1 (Zit+1 ) | Ωit ]
j∈0,1,2,3
(8)
Here we recognize that the probability of dying between periods t and t + 1 depends on the period
h . Also, V
t state variables, Zit , as well as the updated smoking health stock Sit+1
t+1 (Zit+1 ) is the
value of the decision problem in period t + 1 given the state vector Zit+1 . The expectation of
Vt+1 (Zit+1 ) is taken with respect to the random state vector Zit+1 and the vector of shocks it
conditional on survival, Zit , and choice j. Note also that the expectation depends on the current
information set Ωit .
Following Rust (1987), if we assume that the shocks εjit , are additively separable, satisfy the
conditional independence assumption, and follow a Type I extreme value distribution, then the the
expected value of the next period’s value function (conditional on survival) can be expressed as:
 
E [Vt+1 (Zit+1 ) | Ωit ] = E ln 

X
exp {νjt+1 (Zit+1 , Ωit )}
(9)
j
Here νj (Zit+1 , Ωit ) is the conditional value function associated with making choice j in time period
t. This is the expected value of making the choice, net of the jit shock:
 

X
S
h
νjt (Zit , Ωit ) = uj (Zit ) + βπit+1
(Zit , Sit+1
)E ln 
exp {νjt+1 (Zit+1 , Ωit )}
(10)
j
In the terminal period t = T , the conditional value functions reduce down to νjT (ZiT ) = uj (ZiT , ΩiT ).
The individual’s decision problem can thus be expressed as:
max νj (Zit , Ωit ) + jit
j∈0,1,2,3
(11)
The conditional choice probabilities associated with this optimization problem can be expressed as:
exp(νj 0 t (Zit , Ωit ))
P rob(j = j 0 | Zit , Ωit ) = P
j exp(νjt (Zit , Ωit ))
20
(12)
4.6
Parameter Heterogeneity and Genes
In the population of ever-smokers, we assume that there are J = 3 cigarette preference types
characterized by distinct values of α0i . Let τip indicate an individual’s preference type. The probability of being preference type 1 or 2, conditional on being smoking type and on an individual’s
genotype Gi is given by
P (τip = τ | Gi ) =
exp(θp Xpi )
1 + exp(θp1 Xpi ) + exp(θp2 Xpi )
(13)
Where Xpi contains a constant, and allele counts for each of our four SNPs of interest. The
probability of preference type 3 is given by P (τip = 3 | Xpi ) = (1 − P (τip = 1 | Xpi ) − P (τip = 2 | Xpi )).
In addition to preference type τip , individuals also possess an addiction type τia = 1, 2. Different
addiction types possess different values of the location parameter α1i in the reduction cost process.
The probability of addiction type 1 is given by:
P rob(τia = 1 | Gi , τip ) =
exp(θa Xai )
1 + exp(θa Xai )
(14)
Here Xai includes allele counts for each SNP, and dummies for preference type 2 and 3. That is,
we allow for addiction types and preference types to be correlated, but impose the above nested
structured for the type probabilities.
Equations 13- 14 describe how genes enter the structural model. Allele counts for each of
our four SNPs of interest affect the linear indices that determine an individual’s preference type
probability, and the addiction type probability, conditional on preference type. That is, different
genotypes are associated with different distributions of the pair < α0i , α1i > in the population.
4.7
The Initiation Process
The behavioral model and distribution of model parameters discussed so far apply to the population of ever-smokers. We choose to separately model the process of initiation. We do this because
the SNPs studied here have not been linked to initiation. However, in any forward-looking economic
model of smoking, changes in preference for nicotine or in the cost of reduction should affect the
probability that an individual ever becomes a smoker. The presence of robust associations between
21
our SNPs of interest and various smoking outcomes, and the lack of any such correlation with
initiation suggests that individuals are not aware of their preference or cost types when making
initiation decisions. That is, the initiation decision can be thought of as largely separate from the
processes that determine consumption and cessation later in life, at least for the channels through
which these SNPs operate. We approximate the underlying behavioral model that drives initiation
by assuming that it is random and uncorrelated with our SNPs of interest. With some probability
π N oSmoke , an individual is a non-smoking type that will always abstain from cigarettes. With
probability (1 − π N oSmoke ), and individual is a possible smoker, and the distribution of < α0i , α1i >
within this population is determined by the type distribution in the previous section.
Within the population of possible smokers, individuals start life at age 10 with zero values of the
stocks Sita and Sith . The probability that an individual starts smoking for the first time is governed
by an exogenous initiation process. Specifically, we assume a probit initiation process where Iit∗
represents a latent initiation index:
Iit∗ = γ0 + γ1 Ageit + γ2 Age2it + γ3 Y earBorni + Init
it
(15)
Here Init
is distributed i.i.d. standard normal, and if a never-smoker draws Iit∗ > 0, then they
it
receive draws for the random components of utility and solve the problem in Equation 8. If they
∗
choose to not smoke, they continue to be a never-smoker and will receive a draw for Iit+1
in the
next period. If they choose to smoke, then behavior is determined based on the smokers problem
outlined in the previous section. For never-smokers, health outcomes are also determined by the
system in Equations 4-6.
We have assumed an exogenous initiation process for convenience and to avoid adding an extra
layer of uncertainty regarding an individuals’ preference for cigarettes. The cost of this approach
is that our initiation process will not be policy invariant. Thus, any counter-factual policy analysis
performed with the estimated model is subject to the limitation that the interventions might affect
the initiation process in an un-modeled manner.
22
4.8
Information about the Health Risks of Smoking
Our sample consists of individuals born between 1920-1959. These cohorts reached maturity
and made smoking decisions during a period of tremendous change in society’s understanding of the
health risks associated with smoking. Although there were concerns about the health consequences
of smoking, the health risks of smoking were not entirely recognized by the medical establishment.
As smoking rates rose in the 1930s and 1940s, cigarette advertisements often featured doctors,
promoting the idea that smoking was safe (Gardner and Brandt 2006). A key turning point in the
public perception of the health risks of smoking was the issuance of the Surgeon General’s Report
on Smoking and Health in 1964. The report marshalled epidemiological evidence and precipitated
a decline in smoking rates for many groups (De Walque 2010). Failure to account for this large,
population-wide change in information on the health risks of smoking could bias our estimates of
parameters related to the life-cycle consumption of cigarettes. To address this, we introduce a state
variable to the information set Surgt ∈ Ωit , which is an indicator for the years 1964 and later. The
Surgt affects beliefs about the health risks of smoking, and therefore alters the way that people
evaluate the conditional expectation in Equation 8. Specifically, we assume that before 1964,
optimal behavior is determined under the assumption that there are no health risks associated
o
s
, and βhd are all assumed to be zero. During and after
, βho , βageh
with smoking, so that βhs , βageh
1964, individuals form expectations based on assuming the true values of these health parameters.
Practically, these means solving for two sets of value functions: one under the assumption of no
health risks, and one under the assumption of the true health risks. Behavior is then simulated
using the Surgt = 0 value functions before 1964, and with the Surgt = 1 value functions thereafter.
5
Empirical Implementation and Estimation Results
We estimate the parameters of the structural model using the Method of Simulated Moments.
We solve and simulate the model for each distinct combination of preference and addiction types
for a number of birth cohorts, and search for model parameters that best match a set of moments
from the empirical data. Let SP = {1, 2, 3, 4} refer to the set of preference types, with preference
type 4 denoted the never-smoking type introduced in Section 4.7. Similarly, let SA = {1, 2} refer
to the set of addiction types, and let SBC = {1925, 193, 1935, 1940, 1945, 1950} refer to the set
23
of birth cohorts for which the model is simulated.5 Finally, let SG refer to the set of genotypes
formed by all relevant combinations of the SNPs rs16969968, rs13280604, rs7937, and rs1329650.6
Let S = SP × SA × SBC × SG refer to the combined set of distinct simulation groups. For each
group f ∈ S, we simulate 1,000 histories of smoking behavior, health, and mortality.
f` refer to the
Let M ` refer to the empirical sample average for the `th moment, and let M
corresponding simulated average. The `th simulated moment is constructed as:
P
f` =
M
f ∈S
P
ωf N`f m
e `f
f ∈S
(16)
ωf N`f
Here m
e `f represents the average value of the moment ` calculated from simulated observations from
group f . N`f indicates how many simulation observations contributed to the group f average for
moment `, and ωf represents the population-weight assigned to group f . The group weight ωf is
determined by:
ωf =


 F req BC F req G piN oSmoke ,
f
f
If never-smoker type;

 F req BC F req G (1 − π N oSmoke )P (τ p = τ p | Gf )P (τia = τ a | Gf , τ p ), Otherwise.
i
f
f
f
f
f
(17)
Here F reqfBC measures the relative frequency of group f ’s birth cohort, and F reqfG measures the
relative frequency of group f ’s genotype.7
Our estimator minimizes the weighted sum of squared distances between simulated and empirical
5
6
7
Note that SBC does not include all birth cohorts in the empirical sample. Since calendar time is a
state variable in the model, every birth-cohort that is simulated requires distinct value functions,
increasing computational expense. Although we use all birth cohorts when calculating our empirical
moments, we only simulate the model for an evenly spaced subset of the birth years spanned by
our sample.
There are 81 possible genotypic combinations (four SNPs and three possible values for the allele
counts at each SNP), and we observe 80 in our sample. To cut down the computational expense
of searching for the type mixing parameters, we also exclude from the simulated model genotypic
combinations that are extremely rare. Specifically, we exclude the 22 smallest genotypic groups
in our sample. These groups together account for about 2% of our sample. We use data on all
individuals for constructing the empirical moments.
The F reqfBC measure is based on the sizes of these birth cohorts in U.S. Census data. Using IPUMS
data from the 1960 and 1980 U.S. Censuses, we sum up the sampling weights of all individuals born
in each cohort. For birth cohorts between 1920-1940, we use the 1960 Census, and for those cohorts
between 1941-1960, we use the 1980 Census. We split the cohorts in this way to make sure that
mortality does not bias our calculation of relative cohort sizes for older birth cohorts. The relative
genotypic frequencies F reqfG are directly calculated from the HRS sample.
24
values for the 171 moments described in Appendix Section 7.1.8 These moments include agespecific smoking rates, the frequency of intensity categories conditional on smoking, the fraction of
individuals who are bad health, have a major lung illness, as well as annual death rates. Moments
based on maximum cigarettes ever smoked and the lung illness indicator are evaluated conditional
on genotype, matching the descriptive regressions examined earlier.
Tables 9- 10 present the structural parameter estimates. We estimate three distinct preference
types for the parameter α0 : 0.008, 0.068, and 0.231. Conditional on being an ever-smoker type,
these preferences occur with probabilities 0.72, 0.17, and 0.12, respectively. Individuals are further
differentiated by addiction types. We estimate substantial differences in the two assumed addiction
types, which take α1 values of -1.88 and 0.26, respectively. We estimate that 69 percent of the
ever-smoking population is the low cost type, and the freely estimated belief parameter of 0.59 is
quite close to this true proportion.
To assess how well the model fits the data, Table 11 compares several simulated moments with
their empirical counterparts. In general, the model fits the data quite well, with some noteworthy
exceptions. The model seems to under-predict binary smoking in the late 50s (0.22 v.s. 0.28 in
the data), and over-predict smoking later in life. The model also over-predicts light smoking and
under-predicts heavy smoking at older ages. However, the model matches the decline in smoking
as individuals pass through the 60s and 70s, and it matches the distribution of maximum cigarettes
per day at age 55 fairly well.
Table 12 displays the type probabilities associated with different genotypes. These probabilities
directly inform us about the channels through which these SNPs operate. Our estimates suggest
that rs16969968 has a large effect on the distribution of cigarette preference parameters. About
15 percent of individuals with no copies of the protective reference allele at rs16969968 fall into
the highest preference category, while this is true for only 4 percent of individuals with two copies.
Conversely, while 44 percent of individuals with zero copies are in the low preference category,
this number rises to 51 percent for those with two copies. We find no clear relationship between
rs16969968 and an individual’s addiction type. The results for SNP rs7937 follow a similar pattern,
with extra copies being associated with a smaller probability for the highest preference category,
8
We weight all moments equally, with the exception of the extensive margin smoking moments,
which receive 10 times the weight of other moments.
25
and a larger probability for the lowest preference category. Our estimates suggest no relationship
between this SNP and the addiction type. For rs1329650, the estimates suggest the opposite
pattern. While extra copies of the reference allele at rs1329650 do not shift the distribution of
preference types, they seem to reduce the probability that an individual is in the high reductioncost category. Thus it appears that rs1329650 may be working through an addiction channel
distinct from the other SNPs under study. Finally, we note that our estimates for rs13280604 are
difficulty to neatly interpret. It appears that extra copies of the reference allele at this location
are associated with a smaller probability of being in the highest preference category, and a much
higher probability of being in the high cost addiction type. Taken together, the results in Table
12 demonstrate the feasibility of using observational data to map genotypic heterogeneity into the
parameters of a dynamic model of smoking behavior. Furthermore, the results suggest that SNPs
such as rs16969968 and rs1329650 might operate through distinct channels.
26
Table 1: Cross Sectional Characteristics in HRS (At Last Observation)
Variable
Mean Std. Dev.
N
Age
73.74
7.63
8140
Male
0.43
0.50
8140
Ever Smoked
0.57
0.49
8140
Max. Cigs Per Day 25.68
17.90
4603
Ever Lung Illness
0.18
0.38
8060
Ever Heart Illness
0.38
0.48
8140
Ever Cancer
0.23
0.42
8140
Bad Health
0.27
0.45
8134
Table 2: Maximum Cigarettes Per Day (MaxCigs>0)
rs16969968
-1.320***
-1.326***
(0.389)
(0.389)
rs13280604
-0.744*
-0.757*
(0.434)
(0.433)
rs7937
-1.008***
-1.030***
(0.362)
(0.361)
rs1329650
-0.129
-0.136
(0.401)
(0.401)
Observations
4603
4603
4603
4603
4603
R2
0.097
0.096
0.097
0.095
0.100
27
Table 3: Contemporaneous Smoking Outcomes - Ever-Smokers, All Pers-Year Obs
Smoke
Quit
Quant.
Quant.
(w / zeros) (given smoking)
rs16969968
0.013+
0.005
0.060
-0.680*
(0.008) (0.007)
(0.183)
(0.349)
rs13280604
-0.014+ -0.015*
-0.390*
-0.017
(0.009) (0.008)
(0.201)
(0.409)
rs7937
0.008
0.006
-0.060
-0.725**
(0.008) (0.007)
(0.176)
(0.323)
rs1329650
-0.008
-0.007
-0.098
0.102
(0.008) (0.008)
(0.193)
(0.376)
Observations
38782
8060
35114
9061
R2
0.067
0.020
0.058
0.093
Table 4: Lung Illness
rs16969968
-0.029***
(0.010)
rs13280604
0.003
(0.011)
rs7937
-0.004
(0.009)
rs1329650
-0.026***
(0.010)
MaxCigsPerDay 0.004***
(0.000)
Observations
4603
4633
R2
0.045
0.029
28
-0.024**
(0.010)
0.005
(0.011)
-0.000
(0.009)
-0.027***
(0.010)
0.003***
(0.000)
4603
0.048
Table 5: Heart Illness
0.017+
(0.011)
rs13280604
0.021*
(0.012)
rs7937
-0.007
(0.010)
rs1329650
0.003
(0.011)
MaxCigsPerDay 0.002***
(0.000)
Observations
4603
4633
R2
0.070
0.068
0.020*
(0.011)
0.022*
(0.012)
-0.003
(0.010)
0.003
(0.011)
0.002***
(0.000)
4603
0.071
Table 6: Ever Cancer
-0.003
(0.010)
rs13280604
0.012
(0.011)
rs7937
-0.010
(0.009)
rs1329650
-0.012
(0.010)
MaxCigsPerDay 0.002***
(0.000)
Observations
4603
4633
R2
0.044
0.041
-0.001
(0.010)
0.014
(0.011)
-0.006
(0.009)
-0.011
(0.010)
0.002***
(0.000)
4603
0.045
rs16969968
rs16969968
29
Table 7: Mortality (One-Year Death Rate), Linear Probability
All Cohorts Born 1920-1939 Born 1940-1949
rs16969968
-0.003
-0.002
-0.004**
(0.002)
(0.003)
(0.002)
rs13280604
-0.003+
-0.001
-0.005***
(0.002)
(0.003)
(0.002)
rs7937
-0.002+
-0.002
-0.003*
(0.002)
(0.002)
(0.002)
rs1329650
-0.001
-0.001
-0.002
(0.002)
(0.003)
(0.002)
Observations
17566
11069
6497
R2
0.019
0.016
0.009
Table 8: Health Outcomes - Never Smokers
Ever Lung Ever Heart Ever Cancer Bad Health
rs16969968
-0.011
-0.002
-0.008
0.005
(0.008)
(0.012)
(0.010)
(0.011)
rs13280604
0.012
0.008
-0.010
0.012
(0.009)
(0.013)
(0.012)
(0.012)
rs7937
0.006
0.007
0.013
0.003
(0.007)
(0.012)
(0.010)
(0.010)
rs1329650
-0.003
0.009
-0.007
-0.002
(0.008)
(0.013)
(0.011)
(0.011)
Observations
3427
3427
3427
3426
2
R
0.031
0.064
0.040
0.083
Figure 1: Smoking Intensity by Age for Continuous Smokers - NLSY
30
Table 9: Parameter Estimates
Utility Parameters
α0
Pref. Type 1
0.0082
Pref. Type 2
0.0677
Pref. Type 3
0.2309
α0S
-0.1115
α3 (Smoking Illness)
-0.1569
α4 (Bad Health)
-0.2446
log(σ )
-0.9708
Reduction Cost Params
α1
Add. Type 1
-1.8752
Add. Type 2
0.2627
cost
log σ
-0.4029
log α2
-4.1629
- Period Utility and Stocks
Initiation
γ0
-5.3384
γ1 (Age)
0.4292
γ2 (Age Sq.)
-0.0107
γ3 (Year Born)
-0.0121
Other Parameters
δa1
0.3874
δa2
0.0500
δh
0.4125
β
0.8801
π low,b
0.5906
`
π
0.1636
Avg. Type Probabilities
Non-Smoking Type
0.3098
Pref 1
0.7168
Pref 2
0.1662
Pref 3
0.1171
Add 1
0.6922
Add 2
0.3078
Table 10: Parameter Estimates - Health Processes
Death Process
Bad Health Process
d
β0
-4.0472
β0o
d
o
βage
-0.0081
βage
d
o
βage2
0.0295
βage2
d
βs
0.3357
βho
d
o
βo
1.6855
βageh
o
βhd
0.0017
βbo
o
Lung Illness Process
βboa
β0s
-3.4998
Health Process Correlation
s
βage
0.0133
σ12
s
βage2
0.0006
βhs
0.0259
s
βageh
0.0033
31
-1.3899
-0.0001
0.0002
-0.1589
0.0365
0.0092
0.0007
0.3700
Table 11: Empirical and Simulated Moments
Smoke Light
Smoke Heavy
Smoking (Binary)
Emp.
Sim.
Emp.
Sim.
Emp.
Sim.
Ages:
Ages:
Ages:
55-59 0.2789 0.2231
55-59 0.3904 0.3636
55-59 0.2358 0.3183
60-64 0.2317 0.1983
60-64 0.4243 0.4375
60-64 0.2022 0.2309
65-69 0.1944 0.1756
65-69 0.4765 0.5347
65-69 0.1773 0.1327
70-74 0.1459 0.1542
70-74 0.5236 0.6020
70-74 0.1293 0.0724
75-79 0.1051 0.1366
75-79 0.6256 0.6379
75-79 0.0869 0.0486
80-84 0.0805 0.1312
80-84 0.6945 0.6423
80-84 0.0576 0.0414
85-89 0.0627 0.1412
Smoke Categories Age 55
Emp.
Sim.
Category:
Medium 0.2174 0.2387
Heavy
0.2714 0.2701
Bad Health
Emp.
Sim.
Category:
55-59
0.2114 0.2159
60-64
0.2291 0.2448
65-69
0.2504 0.2746
70-74
0.2750 0.3041
75-79
0.3159 0.3279
80-84
0.3428 0.3455
85-89
0.3782 0.3359
Death if Smoke
Emp.
Sim.
Ages:
55-59 0.0094 0.0110
60-64 0.0152 0.0162
65-69 0.0278 0.0258
70-74 0.0418 0.0383
75-79 0.0572 0.0598
80-84 0.1069 0.0808
85-89 0.0909 0.1094
Ever Lung Illness
Emp.
Sim.
Ages:
55-59 0.0485 0.0814
60-64 0.0779 0.1038
65-69 0.0988 0.1277
70-74 0.1326 0.1478
75-79 0.1597 0.1664
80-84 0.1742 0.1862
85-89 0.1930 0.2059
Bad Health if Smoke
Emp.
Sim.
Ages:
55-59 0.2982 0.2521
60-64 0.3198 0.2896
65-69 0.3393 0.3164
70-74 0.3590 0.3613
75-79 0.4214 0.4003
80-84 0.4479 0.4096
85-89 0.4179 0.3905
32
Ever Lung if Smoke
Emp.
Sim.
Ages:
55-59 0.1680 0.1594
60-64 0.2256 0.2084
65-69 0.2728 0.2523
70-74 0.2724 0.2914
75-79 0.3195 0.3250
80-84 0.3549 0.3577
85-89 0.4706 0.3722
Death
Emp.
Sim.
Ages:
55-59 0.0030 0.0088
60-64 0.0064 0.0125
65-69 0.0107 0.0199
70-74 0.0231 0.0292
75-79 0.0390 0.0449
80-84 0.0632 0.0627
85-89 0.0960 0.0863
Genotype
rs16969968
0 copies
1 copy
2 copies
rs13280604
0 copies
1 copy
2 copies
rs7937
0 copies
1 copy
2 copies
rs1329650
0 copies
1 copy
2 copies
Table 12: Type Probabilities by Genotype
Prob. Pref 1 Prob. Pref 2 Prob. Pref 3 Prob Add. 1
Prob Add. 2
0.4372
0.4784
0.5113
0.1014
0.0939
0.1355
0.1516
0.1178
0.0433
0.4725
0.5137
0.4418
0.2177
0.1765
0.2483
0.4988
0.4921
0.4670
0.0973
0.1348
0.1741
0.0941
0.0633
0.0491
0.5242
0.4252
0.3114
0.1660
0.2650
0.3788
0.4825
0.4972
0.5071
0.1079
0.1157
0.1225
0.0999
0.0773
0.0606
0.4781
0.4777
0.4772
0.2121
0.2124
0.2130
0.4920
0.4967
0.5012
0.1240
0.1072
0.0937
0.0742
0.0863
0.0952
0.4354
0.5140
0.5633
0.2548
0.1762
0.1269
33
6
Appendix
6.1
Moments used in Estimation
A total of 171 moments are used in the estimation, which we itemize below. Unless otherwise
noted, we condition on age by considering means across the following seven age groups: (55-59,6064,65-69,70-74,75-79,80-84,85-89);
• Initiation Moments: average age at start, fraction of starts occurring before age 15, and the
fraction of starts occurring after age 30. (3 moments)
• Smoking Extensive Margin: Fraction of individuals who are smoking by age group. (7 more
to 10 moments)
• Smoking Intensive Margin: Fraction of smokers choosing categories 1 and 3, by age group.
We do not calculate these moments for the oldest age group due to concerns over small sample
sizes (12 more to 22)
• Quitting: Fraction of smokers quitting two years later, by age group. We do not calculate
this for the oldest age group. (6 more to 28)
• Death Rates: Fraction of individuals that die each year, by age group. These are also computed unconditionally, as well as conditional on current (binary) smoking status. (14 more
to 42 moments)
• Lung Illness: Fraction of individuals that have ever reported a major lung illness, by age
group. These are also computed conditional on current (binary) smoking status. (14 more to
56 Moments)
• Bad Health Status: Fraction of individuals that are currently in bad health, by age group.
These are also computed conditional on current (binary) smoking status (14 more to 70
moments)
• Early Death Rate: Fraction of individuals that die each year between the ages of 36 and 40.
(1 more moment to 71).
34
• Ever Smoking Rates (Age ≈55): Fraction of individuals at age 55 that have ever smoked. This
is calculated conditional on inclusion in each of 5 birth year groups: 1930-1934, 1935-1939,
1940-1944, 1945-1949, and 1950-1954. (5 more moments to 76).
• Maximum Cigarettes Per Day: Fraction of ever smokers whose maximum past consumption
was equal to the second and third intensity categories, respectively. This is evaluated at age
55. (2 more moments to 78).
• Smoking Category Transitions: We look at two-year transition rates between smoking categories. For each of the three intensity categories, we calculate the fraction of smokers in each
category who end up smoking in the each of the three categories two years later. We do not
calculate these for the oldest two age categories. (45 more moments to 123).
• Conditional Death Rates: Fraction of individuals that die each year conditional on lung illness
state and bad health state. We this for each age category with the exception of the oldest
group. (12 more moments to 135).
• Genotype Specific Moments: For each of the four SNPs of interest, we calculate the fraction
of ever-smokers whose maximum consumption at age 55 was equal to the second and third
intensity categories conditional on having 0, 1, or 2 copies of the reference allele for that SNP.
By allele counts for each SNP, we also calculate the fraction of ever smokers who have been
diagnosed with a lung illness. (36 more moments to 171).
References
Beauchamp, J. P., D. Cesarini, M. Johannesson, M. J. H. M. van der Loos, P. D.
Koellinger, P. J. F. Groenen, J. H. Fowler, J. N. Rosenquist, A. R. Thurik, and
N. A. Christakis (2011): “Molecular Genetics and Economics,” Journal of Economic Perspectives, 25(4), 57–82.
for Disease Control, C., and Prevention (2010): How Tobacco Smoke Causes Disease: The
Biology and Behavioral Basis for Smoking-Attributable Disease: A Report of the Surgeon General.
Centers for Disease Control and Prevention, Atlanta, Georgia.
35
Johnson, A. D., R. E. Handsaker, S. L. Pulit, M. M. Nizzari, C. J. O’Donnell, and
P. I. de Bakker (2008): “SNAP: A Web-Based Tool for Identification and Annotation of Proxy
SNPs Using HapMap,” Bioinformatics, 24(24), 2938–2939.
Liu, J. Z., F. Tozzi, D. M. Waterworth, S. G. Pillai, P. Muglia, L. Middleton,
W. Berrettini, C. W. Knouff, X. Yuan, G. Waeber, et al. (2010): “Meta-Analysis
and Imputation Refines the Association of 15q25 with Smoking Quantity,” Nature Genetics,
42(5), 436–440.
Munafo, M. R., T. G. Clark, E. C. Johnstone, M. F. Murphy, and R. T. Walton (2004):
“The Genetic Basis for Smoking Behavior: A Systematic Review and Meta-Analysis,” Nicotine
& Tobacco Research, 6(4), 583–597.
Rietveld, C. A., D. Conley, N. Eriksson, T. Esko, S. E. Medland, A. A. Vinkhuyzen,
J. Yang, J. D. Boardman, C. F. Chabris, C. T. Dawes, et al. (2014): “Replicability and
Robustness of Genome-Wide-Association Studies for Behavioral Traits,” Psychological Science,
25(11), 1975–1986.
The Tobacco and Genetics Consortium (2010): “Genome-wide Meta-analyses Identify Multiple Loci Associated with Smoking Behavior,” Nature Genetics, 42, 441–449.
Thorgeirsson, T. E., D. F. Gudbjartsson, I. Surakka, J. M. Vink, N. Amin, F. Geller,
P. Sulem, T. Rafnar, T. Esko, S. Walter, et al. (2010): “Sequence Variants at CHRNB3CHRNA6 and CYP2A6 Affect Smoking Behavior,” Nature Genetics, 42(5), 448–453.
36