Download Personalizing public health

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic drift wikipedia , lookup

Heritability of IQ wikipedia , lookup

Gene expression programming wikipedia , lookup

Tag SNP wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Fetal origins hypothesis wikipedia , lookup

Public health genomics wikipedia , lookup

Transcript
TOOLS
OF
PERSONALIZED M EDICINE
Personalizing public health
Theodore R Holford1,
Andreas Windemuth2 &
Gualberto Ruaño2†
†Author
for correspondence
University School of
Medicine,
Division of Biostatistics,
Department of Epidemiology
and Public Health,
New Haven,
Connecticut, 06520, USA
2Genomas, Inc.,
67 Jefferson Street, Hartford,
Connecticut, 06106, USA
Tel.: +1 860 545 4574;
E-mail: g.ruano
@genomas.net
1Yale
Keywords: generalized linear
models, linear logistic model,
personalized health,
physiogenomics, regression,
statistical genetics
Public health and medicine share a common
objective of prolonging life. However, the means
by which this can be attained have long represented very different perspectives and strategies.
Medicine has traditionally focused on treating
patients who, by definition, are ill and in need of
restoration to health, while public health seeks to
reduce the number of individuals acquiring illness
in the first place. Hence, medicine generally
focuses on the needs of individuals in order to
deliver the best treatment available to that person,
taking into account their personal and clinical
characteristics. On the other hand, public health
focuses on the needs of the population as a whole,
developing a health program that will serve the
common good by reducing overall disease risk;
with the understanding that a small minority may
not realize the intended benefit.
The two perspectives can be illustrated by the
approaches for dealing with dental caries. A community that adopts a public health approach to
the problem might add fluoride to the water supply, thus increasing the resistance of individuals to
caries by making tooth enamel more resistant.
Alternatively, the community may instead choose
to provide its citizens with dental insurance, in
which caries are treated once they develop. The
former is certainly more cost effective from the
perspective of the population, but the latter would
avoid inducing any rare side effects that may result
from the addition of something to the public
water supply. Some individuals may have a gene
or a lifestyle that makes them more resistant to the
development of caries, so that they would personally receive little benefit from the addition of fluoride to the water, yet they may suffer some risk of
harm from it. If it were possible to identify those
individuals who would benefit most from a particular form of treatment or disease prevention, then
it may be possible to optimize the program as a
whole, by avoiding side effects and increasing efficiency. In effect, this effort would bring together
aspects of the medical and the public health perspectives by personalizing the public health strategies, thus providing a method by which
individuals can optimize their own health, and in
the process benefit the population at large.
10.2217/17410541.2.3.239 © 2005 Future Medicine Ltd ISSN 1741-0541
In this report, the manner in which data from a
variety of sources can be used to arrive at statistical
models that may be used in prediction is considered. First, the study designs that are employed, in
addition to the behavior of the variables, are descibed. In addition, the basic structure of a generalized linear model, indicating how this can be
adapted in order to cover a wide variety of different
types of data, is set out. These require specific ways
for dealing with measurement types when the factor is a response and when it is a predictor. Some of
the issues involved in selecting prediction models
and assessing their adequacy are also discussed.
Cause versus prediction in genomics
The issue of causal inference involves some of the
deepest concepts in the philosophy of science, and
it is arguable whether a strict adherence to causal
reasoning can ever be fully realized in health science. Knowledge that a particular genetic polymorphism will produce a given effect on the
outcome may be very difficult, especially in light of
the fact that most biochemical reactions have multiple steps involving different proteins, each one of
which can have their own effect on the phenotypic
result for the individual. A somewhat lower standard would be to determine whether a particular
polymorphism was associated with the outcome, a
relationship that may or may not be causal.
The ultimate goal of science is to understand
causal mechanisms that underlie physiological
processes, thus providing an approach for predicting outcome. However, this may be difficult
when the development of knowledge is at an
early stage, so that a more pragmatic approach
would be to learn about associations that exist
between individual characteristics and the outcome of interest. This would allow us to improve
our prediction of the response by incorporating
information that is related to the outcome. However, when we cannot be certain that the associations are causal, we must maintain a level of
caution in using the predictions. These limitations that exist in a prediction model could be
summarized to users by carefully specifying the
level of error present in the estimates provided by
the prediction model.
Personalized Medicine (2005) 2(3), 239–249
239
TOOLS OF PERSONALIZED MEDICINE – Holford, Windemuth & Ruaño
Causal inference in genomics
Causal inference requires that we have knowledge regarding the underlying biology of a
physiological process that is affecting an outcome, so that we can specify quantitatively the
manner in which an individual’s physiology
might behave under a specified set of circumstances. Understanding the process may not
include information about the magnitude of
various rates of reaction, or the manner in
which the effect takes place, so that statistical
inference amounts to the development of estimates of unknown quantities and assessing the
mode of action by determining the agreement
between observed values and levels predicted by
a model. In the case of genomic information,
the factor level would require the identification
of a particular gene that may be recognized by
detecting the number of copies of a known single nucleotide polymorphism (SNP). The identification of causal SNPs, and the mechanisms
by which they act, are ideal for predicting an
outcome, and these should be used whenever
available. When specific polymorphisms that
cause disease are known, the magnitude of their
association with the response is huge and highly
predictive for an individual who harbors it (for
example, the effect of BRCA1 and BRCA2 on
breast cancer risk [1,2]). By the same token, the
process required to find these polymorphisms
can take considerable time. However, this process continues to improve, as in the case of the
elegant work used to discover the gene causing
macular degeneration [3].
Prediction of health outcomes
Statistical models that can be used to predict
health outcomes would make use of demographic, clinical and genomic information, in
order to arrive at a model that predicts the outcome. The goal is not necessarily to discover
clinical variables and alleles that cause an effect,
but to make use of the associations that exist. In
this sense, the aims are somewhat different
from those in a strictly scientific endeavor that
seeks to understand modes of action, but
instead adopts a more pragmatic approach of
developing methodology that provides predictions that have sufficient accuracy to be of practical use to clinicians and healthcare providers.
The outcomes may be qualitative, as in the case
of binary indicators of disease status, or continuous indicators such as body mass index (BMI),
blood sugar or cholesterol. While there are
aspects of the process that will need to be
240
modified in order to arrive at assumptions that
agree with the data, the approaches are similar,
in that they can be dealt with by using the concepts underlying generalized linear models for
the analysis of health data.
Data sources
Study designs
Controlled experiments
A controlled experiment is a carefully constructed study in which the groups being compared only differ from each other by the factor
under study. In that sense, the intent is to mimic
a laboratory experiment, in which the careful
scientist can exercise considerable control over
the experimental units so that they are all handled in the same way, except for the factor or
treatment of interest. In population studies, the
randomized control trial is the most commonly
used design for this type of study, in which the
groups to be compared are made comparable
through random assignment of subjects to the
groups in question. In this case, the rules of
chance make it possible to achieve comparability, although there is always the prospect that
the luck of the draw may turn out to be unfavorable in terms of balance. Hence, the underlying
approach may be augmented by adopting alternative strategies that force balance among the
groups to be compared, especially for factors
that are known to be strongly related to the outcome. A considerable effort is required to
launch a controlled experiment, including the
recruitment of subjects and physicians who are
willing to meet the requirements of random
assignment. If the outcome is the occurrence of
a relatively rare event, such as the development
of a disease or syndrome, then the sample size
required for valid inferences could be large and
the follow-up time lengthy, thus further increasing the cost of the study. In general, these studies, although extremely valuable because of their
validity if carefully conducted, can be extremely
costly. Thus, they are often not the first
approach used to look at a particular association, but are instead used as a final justification
for a particular treatment strategy.
Observational studies
An observational study involves the assessment
of associations discovered in subjects who have
been recruited for a particular project. In some
instances, the subjects are selected as representatives of the population at large. This is sometimes referred to as a cross-sectional study. In
Personalized Medicine (2005) 2(3)
Personalizing public health – TOOLS OF PERSONALIZED MEDICINE
other instances the subjects are recruited as representatives of particular groups, and then followed to determine their clinical course – a
cohort study. Alternatively, subjects with a particular disease and an appropriate control group
may be recruited and their past clinical history
or exposure gleaned through the use of interviews or chart reviews, which are known as
case–control studies. The disadvantage in
observational studies is that the carefully structured study design provided by a controlled
experiment no longer applies. Nevertheless, the
importance of avoiding the bias that can arise
when groups are not comparable remains as
important as ever. Therefore, there is considerable effort to control for imbalances by employing statistical models that make use of variables
that may be confounding an estimated association. Typically, these are variables that have an
imbalanced distribution in the groups being
compared, and they are associated with the
response of interest.
equation, and these may be similarly classified as
continuous, binary or polychotomous. For a prediction model, the regressor variables comprise
the information that is to be used to predict the
response, and in a physiogenomic model, these
would include a mixture of demographic, physiologic and clinical data, along with information
on the genotype for the individual.
Generalized linear models provide an
extremely powerful framework for developing
statistical models that may be used for any of
these types of outcomes. The mathematical
expression, or model, that relates the outcome,
Y, to a set of regressor variables,
X = ( X 1, X2, …, Xp ) is given by:
Y = µ( X) + ε
where µ(X) is the systematic part of the equation
which describes the manner in which the regressor variables affect the average level of the outcome, µ ( X ) = E [ Y X ] , and ε is the unexplained
or random part of the model. Properties of a
good model are:
Types of predicted outcomes
The goal of prediction modeling is to develop a
mathematical expression that may be used to
predict or estimate the unobserved outcome in
an individual based on an observed set of predictor variables. The outcome or response (sometimes called the dependent variable) may be:
Accurate representation of the systematic
component of the model, µ(X)
Continuous measure
Appropriate characterization of random error, ε
•
•
•
•
Body mass
High-density lipoprotein cholesterol
Low-density lipoprotein cholesterol
Blood pressure
Binary
• Obese (yes/no)
• Neuralgia (yes/no)
• Hypertension (yes/no)
Ordered polychotomous
• Obesity (normal/overweight/obese/morbidly
obese)
• Muscular pain (none/mild/moderate/severe)
Nominal (not ordered) polychotomous
• Organ system affected (respiratory, digestive,
circulatory, neurological)
Regressor variables (sometimes called independent variables) are used in the prediction
www.futuremedicine.com
• Correct form for the equation
• Include all relevant regressor variables
• Preferably do not include unimportant variables, i.e., parsimony
• Zero on average, indicating model and data
are not biased
• Probability distribution is appropriately characterized
- Continuous – normal or Gaussian, γ
- Binary – binomial
- Counts – Poisson or negative binomial
• Variance is small – precision is great
In a linear model [4], the systematic component
is comprised of the sum of each regressor variable
multiplied by a regression parameter that is usually not known, but estimated from the data, i.e.,
µ ( X ) = β0 + X1 β1 + X2 β2 + … + Xp βp
Models of this form are the standard regression
model. The error term, ε, is assumed to be normal with a mean of zero and unspecified variance, which is a constant that does not depend
on the mean. However, a generalization in which
a model has this form after taking a specified
transformation of the mean greatly enhances
241
TOOLS OF PERSONALIZED MEDICINE – Holford, Windemuth & Ruaño
ones ability to consider different types of data, i.e.,
a model of the form:
η ( X ) = h [ µ ( X ) ] = β0 + X1 β1 + X2 β2 + … + Xp βp
The term η ( X ) is called the linear predictor, and
the function, h(.), which transforms the mean
into the linear part of the model, is called the
link function [5].
As an example, consider a binary response,
Y = 0 or 1, for which the linear logistic model is
commonly used [6]. On average, the response takes
the value of the probability of observing a one,
and for the commonly used linear logistic model,
this is expressed as:
β0 + X1 β 1 + … + Xp βp
e
Pr { Y X } = µ ( X ) = ---------------------------------------------------β0 + X1 β 1 + … + Xp Bp
1+e
Notice that the model contains a part that has the
familiar mathematic form used in standard regression, i.e., the sum of products of regressor variables and regression coefficients. To obtain an
expression that only has that form, one would use
a logit link function to transform the mean into
the linear predictor [5,7]:
µ( X)
logit[µ ( X ) ] = log --------------------- = β0 + X1 β 1 + … + Xp β p
1 – µ(X )
Clinical predictors
Demographic and personal characteristics
The most commonly-used demographic characteristics are age, gender and ethnicity. Aging
clearly affects the clinical course of most diseases,
as gender also does, perhaps due to the hormonal
and physiological differences between men and
women. Ethnic differences in disease risk may, in
part, reflect genetic variation, but they could also
be the result of diet and other practices that influence the prognosis of individuals in these groups.
Other social and economic characteristics of individuals can influence personal health, either
through knowledge or resources that empower
them to take control of factors that affect disease
risk. In developing a model that can be used to
predict or to characterize a likely course of symptomatology, the effect of each of these factors
needs to be explored.
Clinical measurements
In the interest of planning for the health needs of
individual patients, it is important to include
information on the status of clinical symptoms at
baseline, or even before if the information is available. In some instances, the baseline data may
consist of information on levels prior to the treatment regimen. For example, BMI at baseline is
likely to be strongly related to BMI 6 months
242
later, so it would clearly be relevant to include this
information in a prediction model. In addition,
blood sugar, insulin level or other measures at
baseline that might be related to an individual’s
metabolic rate could add information to the prediction. Perhaps not as obvious would be the
problem of making sense of the effect of baseline
BMI on temporal changes. An individual with a
high BMI may find it easier or harder to lose
weight, but an analysis that suggests that higher
baseline levels are associated with greater weight
loss may not be an indication of a greater ability to
lose weight as BMI increases. It could actually be
due, in part, to regression to the mean, which
results in patients with extreme levels of any variable tending to have subsequent values that are
closer to the average. Nevertheless, for the purpose
of obtaining a predicted value for a future level of
BMI, this distinction may not be important,
because the objective is to obtain a good prediction model, which is not necessarily the same as
identifying causal relationships.
Coding categorical predictors
For a binary variable there is no difference analytically between a comparison of two nominal levels
(presence vs absence of a factor) and two ordinal
levels (high vs low). In each case, a single parameter provides a measure of comparison, for example, the difference in the outcome when the factor
is present and absent, or the difference between a
high and low level of the factor. One convenient
way to specify such a comparison in a regression
model is to introduce a dummy variable that takes
the value of 1 when the factor is present or high
and 0 otherwise. In this case, the mean levels of
the response for the two groups are:
η ( 0 ) = β0 + 0 ⋅ β1 = β0
for X = 0, the reference level and
η ( 1 ) = β0 + 1 ⋅ β1 = β0 + β1
for X = 1, the comparison level. Notice that the
parameter represents the difference in the linear
predictors for the two levels of the factor,
∆η = η ( 1 ) – n ( 0 ) .
For a categorical variable with more than two
levels, one must first consider whether one is dealing with nominal categories, in which there is no
particular form for the relationship, or ordered categories, which would enable one to capitalize on
the relationship among the levels. Extending the
approach used for two categorical levels, one can
introduce additional dummy variables which take
the value 1 for the comparison level of interest, and
Personalized Medicine (2005) 2(3)
Personalizing public health – TOOLS OF PERSONALIZED MEDICINE
0 otherwise. When we apply this rule, the reference level will always be 0. In general, there will
be one fewer dummy variables than there are categories. As an example, consider a four level categorical variable, which would require the
specification of three dummy variables defined as
follows:
X1
X2
X3
Reference
0
0
0
Level 1
1
0
0
Level 2
0
1
0
Level 3
0
0
1
The linear predictors that correspond to these
four levels are:
η ( 0 ) = β0 + 0 ⋅ β1 + 0 ⋅ β2 + 0 ⋅ β3 = β0
for reference,
η ( 1 ) = β0 + 1 ⋅ β1 + 0 ⋅ β2 + 0 ⋅ β3 = β0 + β1
for level 1,
η ( 2 ) = β0 + 0 ⋅ β1 + 1 ⋅ β2 + 0 ⋅ β3 = β0 + β2
for level 2, and
η ( 3 ) = β0 + 0 ⋅ β1 + 0 ⋅ β2 + 1 ⋅ β3 = β0 + β3
for level 3, so that β1 represents the difference
between level 1 and the reference level, and β2
and β3 are similarly interpreted.
The interpretation of the individual parameters depends on the particular model being used
which is specified through the link function in
the generalized linear model. For the standard
regression model that is normally used for a continuous response, an identity link is employed,
µ = η , so that the difference in the linear predictor may be interpreted as a difference in
means. When additional covariates are also
included in the model, these are regarded as
adjusted means, in the same way that adjusted
means are typically compared using analysis of
covariance. On the other hand, when a linear
logistic model is used, the change in the linear
predictor, ∆η , represents a change in a link
function of the means, which is the log odds
ratio for the probability of a binary outcome:
µ1
µ2
– log ---------------∆η = log ---------------n2 – µ2
n 1 – µ1
µ2 ⁄ n2 – µ2
= log ------------------------µ1 ⁄ n1 – µ1
µ2 ( n1 – µ1 )
= log --------------------------µ1 ( n2 – µ2 )
www.futuremedicine.com
In some cases there is a natural candidate for
the reference category, for example, if one is
interested in the effect of introducing new
treatments, then the current standard treatment would be an obvious choice as a reference. However, there are many instances in
which the choice is arbitrary, or up to the discretion of the investigator. For this reason, it is
often useful to first consider a joint test for all
of the parameters associated with the dummy
variables for a factor, before trying to interpret
individual parameters. For example, in the four
category case, one would test the null hypothesis, H0: β1 = β2 = β3 = 0. In addition, because
each of these parameters represents a difference
from the reference level, the standard error can
never be less than the standard error for the reference. Therefore, it is best not to choose an
uncommon category as the reference, because
standard errors for the regression parameters
would tend to be large. A common category
would yield the greatest level of precision for
the parameters.
For a factor with many categories, the overall
test tends to have little power when the null
hypothesis is unfocused. For example, if one is
investigating the effect of six ordered categories
on a binary outcome, the overall null hypothesis
would be tested using a Pearson χ2 test with 5
degrees of freedom (df ) with a critical value of
11.07. On the other hand, a more focused
hypothesis concentrating on linear trend,
described quantitatively as the slope, could be
tested using the method described by Armitage
[8], which would be compared to χ2 with 1 df,
with a critical value of 3.84. If the trend is in fact
linear, the latter hypothesis is easier to reject, giving it greater power. On the other hand, a linear
trend can be a strong assumption that would
miss some patterns in ordinal variables, such as
the situation when the trend increases for low
values then decreases for high values.
Biomarkers
Biochemical markers may consist of hormone
levels, metabolic parameters or cumulative levels of environmental exposures, such as polychlorinated biphenyl levels in sera or adipose
tissue. However, the human genome project,
with the accompanying advances in technology
that enables one to identify genomic detail for
an individual, offer exciting new opportunities
that could prove to be useful for planning individual health regimens. As the technology in
this field continues to advance, the cost of
243
TOOLS OF PERSONALIZED MEDICINE – Holford, Windemuth & Ruaño
Table 1. Regressor variables used for linear, dominant, recessive, or codominant
effects of a single nucleotide polymorphism.
No. copies
Linear
Dominant
Recessive
Codominant
0
−1/2
-2/3
-1/3
-1/3
1
0
1/3
-1/3
2/3
2
1/2
1/3
2/3
-1/3
identifying large numbers of genetic markers
become more feasible. An important question
is how best to use this information for the good
of an individual’s health.
Genomic data
The contribution of genes to disorders, such as
Huntington’s disease or Down’s syndrome, has
long been recognized. However, with the
increased power provided by new genomic
technologies, one is able to identify specific
gene polymorphisms that are associated with
diseases such as breast cancer [1] and macular
degeneration [3]. Cellular decoding of DNA
results in the construction of proteins, so that
variations in genetic expression can ultimately
lead to defective proteins that may, in turn,
affect disease risk. Not all variants in the
genetic code may affect disease risk, so it is not
always clear which variants in genetic expression are important identifiers for disease risk.
In some instances, a measure of the extent of
the match for a particular portion of a gene is
used as the measure of interest. Alternatively,
libraries of specific SNPs are being constructed for use as specific genetic markers.
Ideally, the SNPs will identify critical changes
that result in clinically important changes in
protein structure, and knowledge continues to
grow as more information is gleaned about the
important variants in the genetic sequence and
the effects these may have.
Tissue samples can provide information on
the number of copies of particular alleles
through the identification of SNPs. Current
technological advances in highly parallel genotyping can provide such information on hundreds of alleles at relatively low cost. The
importance of this capability in personalized
public health is the valuable new information
provided to an individual and his/her healthcare
provider. With respect to a prediction model,
SNPs provide the number of copies of a particular allele that is an ordinal categorical variable.
This could be parameterized as any other categorical predictor, but the mode of action suggests particular approaches that would be
244
motivated by the underlying biology, which are
discussed below.
Proteomic data
While DNA provides the necessary code needed
to order the amino acids in a protein, some variants in this order will have little effect on functionality in a biochemical reaction. Thus, the
analysis of protein structure or proteomics offers
a more direct measurement of an enzyme that
may actually be involved in a relevant biochemical reaction for a particular health outcome. At
this stage, this technology is still being rapidly
developed. However, from the perspective of the
statistical analysis, the issues are very similar to
those that arise in the analysis of genomic data,
which are discussed in further detail below.
Coding the genomic markers
SNPs are represented in a data set as a three
level categorical variable, i.e., homozygous
with respect to one of two alleles, or heterozygous. These can be coded by identifying the
number of copies of the specified polymorphism (0, 1 or 2). One approach for parameterizing a SNP effect is to choose a particular
mechanism by which it would have its effect
on the outcome, and to introduce the corresponding regressor variable into the regression
models. For example, if there is an additive
effect of each copy for a particular allele, so
that the difference between 0 and 1 copy was
the same as the difference between 1 and 2
copies, then this would be parameterized as a
linear effect. Similarly, if the effect was dominant, then either 1 or 2 copies would have the
same effect, but these would be different from
0 copies. Table 1 shows the choice of regressor
variables that one might use for linear, dominant, recessive or codominant effects of a SNP.
The key in choosing which regressor variable
to represent a SNP is to appropriately represent
the mode of action, and if this selection is correct then one has the assurance that the estimated effect is both valid (i.e., unbiased) and
optimal (i.e., most precise or smallest standard
error). However, this would be unrealistic in
Personalized Medicine (2005) 2(3)
Personalizing public health – TOOLS OF PERSONALIZED MEDICINE
Table 2. Linear and dominance representation of SNP effects, along with particular combinations of
parameters for linear, dominant, recessive and codominant effects.
Linear
Dominant
Recessive
Codominant
No.
X1
X2
β1X1 + β2X2
β1 = β
β2 = 0
β1 =β2 = β
β1 = −β2 = β
β1 = 0
β2 = 2β
0
-1/2
-1/6
-β1/2-β2/6
-β/2
-2β/3
-β/3
-β/3
1
0
1/3
β2/3
0
β/3
-β/3
2β/3
2
-1/2
-1/6
β1/2-β2/6
β/2
β/3
2β/3
-β/3
that one is typically analyzing a SNP to determine whether it has an effect, usually not
knowing its mode of action on the response.
An alternative method of representing SNPs
in the framework of a generalized linear model is
to introduce two dummy variables, as described
in Table 2. In this particular parameterization, X1
represents a linear or additive contribution of
each copy of a particular allele, and X2 can be
thought of as a dominance factor. The advantage
of this choice of regressor variables is that it
offers one the opportunity to estimate the mode
of action and to test particular hypotheses that
would suggest whether the data indicates a significant departure from a particular mode of
action. Some specific examples are:
• When only X1 is included in the model
(equivalent to specifying β2 = 0), then an
additive effect of the number of copies of the
SNP is implied, with the parameter β1 measuring the difference between the effects of the
two homozygous individuals with the heterozygous lying halfway between.
• When β1 = β2 = β, then the model specifies no
difference in effect when either 1 or 2 copies of
the SNP are present, which is the behavior
expected when the SNP behaves as a classically
dominant gene, and the regression parameter,
β, measures the difference between those with
the SNP and those without.
• When β1 = -β2 = β, then those with 0 or 1
copies have the same effect on the response
and the model measures the effect of a classically recessive SNP.
• When β1 = 0, β2 = 2β, the comparison estimates the difference between the heterozygous
and the homozygous individuals, which represents a codominant mode of action.
• When both X1 and X2 are included in the
model, then the models can specify a rich
variety of modes of action, other than those
that satisfy the classical modes, including
intermediate situations.
www.futuremedicine.com
The relationship between a SNP and an outcome can be visualized in two ways. First, we
can consider the variability of the allele frequency with the level of the response, which
can be displayed graphically by giving the mean
frequency on the vertical axis and the level of
response on the horizontal axis. Alternatively,
we can display the distribution of the response
among groups defined by the allele frequency.
Of course, the form for these relationships
depends on the underlying biological mechanism for the genetic action, i.e., whether the
phenotypic effect of a SNP on the mean
response is linear, dominant, recessive, or codominant. In addition, the effect may not be on
the mean response, but on the variability also.
The graphs in Figure 1 show the manner in
which we might expect to see a particular relationship between the distribution of the response
in different SNP genotypes and the mean allele
frequency as a function of a particular phenotype. We assume that the relative frequency of
SNP haplotypes is governed by Hardy-Wienberg
equilibrium, whereby p = 1-q defines the SNP
frequency. Thus, p2 is the relative frequency of
individuals who are homozygous for the SNP
(AA), q2 would be the frequency of homozygous
wild type (aa), and 2pq heterozygous (Aa). In
Figure 1a we illustrate a display that might be
expected for a linear effect on the mean when the
distribution is Gaussian with constant variance.
The heavy purple line shows the distribution of
the phenotype without regard to the genotype,
and the three other lines give the distributions of
each genotype. In this case, the overall distribution looks very similar to a typical Gaussian
curve, even though the means are quite different.
For even larger differences, multiple modes
might begin to appear for the overall distribution, but they would only be expected for differences that were very large indeed when
compared to the variance or spread of the distribution. If data were available, we would expect a
series of overlaid histograms to appear to be very
245
TOOLS OF PERSONALIZED MEDICINE – Holford, Windemuth & Ruaño
Highlights
• Genomic markers can blur the classical distinction between the medical perspective of treating the
individual and the public health perspective of reducing risk for the population by tailoring preventive
interventions to individuals.
• Statistical models for predicting health outcomes use the pragmatic approach of making effective use
of associations between factors and outcome, some of which may not be causally related.
• Generalized linear models provide a framework for quantitative prediction for a variety of different
types of outcome variable, including those that are continuous (cholesterol or blood pressure) and
those that are binary (presence or absence of disease).
• Coding of regressor variables is a powerful tool for estimating the association between the outcome
and potential predicting factors that may be either continuous (age or baseline cholesterol) or
categorical (gender or ethnogeographic ancestry).
• Phenotypic effects of single nucleotide polymorphisms can be coded as linear, dominant, recessive, or
codominant, in order to enable the mode of action of a gene to be specified or inferred for
physiogenomic analysis.
• The relationship between gene expression and a continuous phenotype can be described in terms or
the distribution of the outcome for each genotype, or the number of copies of a particular allele
expected among individuals with a particular level of the phenotype.
• If only the mean level of a phenotype is affected by genotype, and the distribution is Gaussian, then
the relationship is often sigmoid, but can appear to be linear within a narrow range. If the variance of
the distribution is also affected by genotype, then the relationship between the expected number of
allele copies for individuals with a specified level of the outcome can become complex.
• Bringing together physiogenomic elements can provide a powerful approach for personalized health,
initially by identifying individuals who are most likely to benefit from a particular health program and,
eventually, by identifying the health program, amongst various options, which is most likely to benefit
an individual.
similar to the distributions shown in the graph.
Notice that in this hypothetical example, the
mean responses progress in equal increments, so
that the difference in the mean for 1 and 0 copies
of an allele is equal to the difference between 2
and 1. If we were to interchange the axes, displaying the mean number of alleles as a function
of the phenotypic response, we would expect the
relationship to be that shown in the corresponding sigmoid curve. Notice that for low values of
the response, the mean frequency is close to 0,
indicating that these individuals are primarily
homozygous aa. Gradually this begins to
increase, and for intermediate values the relationship is almost linear in this instance. Finally,
the curve approaches an asymptote of 2, which
corresponds to homozygous AA.
Figure 1b shows a similar pair of graphs for a
hypothetical gene with a dominant expression
on the genotype. In this case, the mean
response for Aa and AA would be the same,
with a higher mean response than aa in this
hypothetical example. The relationship
between the mean number of alleles as a function of the phenotypic response is once again a
sigmoid curve, starting near zero, indicating
that these individuals are primarily aa. However, the curve then gradually increases and,
ultimately, approaches an asymptote, which
246
lies between 1 and 2, the actual value being a
weighted average depending on the relative
frequency of Aa and AA.
A recessive SNP would be expected to behave
as shown in Figure 1c, which would be identical to
the case where a SNP is dominant. The graph
showing the sigmoid relationship between the
mean number of alleles and the phenotype now
starts between 0 and 1, a weighted average
depending on the relative frequency of aa and
Aa, and approaches the asymptote of 2.
In the rare event of a codominant effect on the
phenotype, the mean for the homozygous individuals would be the same, in this case lower
than the heterozygous, Aa. The relationship
between the genotype and the mean allele frequency is once again a sigmoid curve starting at
the weighted average of 0 and 2, depending on
the relative frequency of aa and Aa, and
approaching an asymptote of 1. In this example
the sigmoid curve is increasing because allele A is
less common than a. If the converse were true
the curve would be decreasing.
For each of the examples considered thus far,
we have assumed that the effect of the SNP is
on the mean. However, in principle it could
affect any property of the distribution. Suppose
that the variance were affected, and not the
mean, a case illustrated in Figure 1e. For this
Personalized Medicine (2005) 2(3)
Personalizing public health – TOOLS OF PERSONALIZED MEDICINE
Figure 1. The relationship between the distribution of responses for alternative
modes of gene expression and allele frequency.
b)
a)
Total
aa
Aa
AA
Total
aa
Aa
AA
-6
-4
-2
0
2
4
-6
6
-4
-2
0
-2
0
6
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
Mean number of As
Mean number of As
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
-4
4
Mean
Mean
-6
2
2
4
-6
6
-4
-2
0
2
4
6
Mean response
Mean response
c)
d)
Total
aa
Aa
AA
-6
-4
-2
0
2
4
Total
aa
Aa
AA
6
-6
-4
-2
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
-6
-4
-2
0
2
4
6
Mean
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
Mean number of As
Mean number of As
Mean
0
2
Mean response
4
6
-6
-4
-2
0
2
4
6
Mean response
The relationship between the distribution of responses for alternative modes of gene expression and the
allele frequency for the (a) linear model for mean; (b) dominance model for mean; (c) recessive model for
mean; (d) hybrid model for mean; (e) variance model; and (f) model with varying mean and variance.
www.futuremedicine.com
247
TOOLS OF PERSONALIZED MEDICINE – Holford, Windemuth & Ruaño
Figure 1 (continued). The relationship between the distribution of responses for
alternative modes of gene expression and allele frequency.
e)
f)
Total
aa
Aa
AA
-6
-4
-2
0
2
4
Total
aa
Aa
AA
6
-6
-4
-2
Mean
0
2
4
2
4
6
Mean number of As
Mean number of As
Mean
-6
-4
-2
0
2
4
6
Mean response
-6
-4
-2
0
6
Mean response
The relationship between the distribution of responses for alternative modes of gene expression and the
allele frequency for the (a) linear model for mean; (b) dominance model for mean; (c) recessive model for
mean; (d) hybrid model for mean; (e) variance model; and (f) model with varying mean and variance.
example the means are all 0, but the variance or
spread of the distribution is much greater for aa
than AA, while Aa is in between. The relationship between the phenotype and the mean allele
frequency in this instance increases then
decreases, reflecting the fact that for the
extremely high and low values the allele frequency 0 predominates, i.e., aa. On the other
hand, for values in the middle, AA and Aa are
more common, thus resulting in a mode for the
allele frequency. In this case, this SNP would
not be a good predictor of the phenotype,
because the average response would be the same
in each case. The SNP would say something
about the spread, but that would generally be
less relevant, for example, aa individuals would
tend to be either very high or very low, but we
would not know which was the case.
Of course, in practice, the genomic effects
could be realized on both the mean and the
variance, as in Figure 1(f). For this example, the
relationship between the phenotype and the
allele frequency is much more complicated,
248
with the possibility of changing directions
more than once. If the relationship is this complicated, the particular SNP may not be a good
predictor of the phenotype, even though there
is a relationship with the response. Ideally, we
would like to identify SNPs that affect the
mean value for the phenotype, as these would
be the better predictors.
Prospects
Genomics has revolutionized health science,
and characterizing an individual’s genotype
offers the possibility to learn a considerable
amount of detail regarding their physiology.
However, the efficacy of an individualized
health program is not solely written in the
genes, but also in other aspects of their nature
and environment [9]. All of these features may
come together in a physiogenomic milieu that
can help to identify the individual and their
response to a particular health program. We
have discussed some of the approaches that can
be employed to quantify the problem.
Personalized Medicine (2005) 2(3)
Personalizing public health – TOOLS OF PERSONALIZED MEDICINE
Bibliography
1.
2.
3.
Miki Y, Swensen J, Shattuck-Eidens D et al.:
A strong candidate for the breast and
ovarian cancer susceptibility gene BRCA1.
Science 266, 66–71 (1994).
Wooster R, Neuhausen SL, Mangion J et al.:
Localization of a breast cancer susceptibility
gene, BRCA2, to chromosome 13q12-13.
Science 265, 2088–2090 (1994).
Klein RJ, Zeiss C, Chew EY et al.:
Complement factor H polymorphism in
www.futuremedicine.com
4.
5.
age-related macular degeneration. Science
308, 385–389 (2005).
Neter J, Kutner MH, Nachtsheim CJ,
Wasserman W: Applied Linear Statistical
Models. WCB McGraw-Hill, MA, USA
(1996).
McCullagh P, Nelder JA: Generalized Linear
Models. Chapman and Hall, London, UK
(1989).
6.
7.
8.
9.
Hosmer DW, Lemeshow S: Applied logistic
regression. John Wiley & Sons, New York,
USA (1989).
Holford TR: Multivariate Methods in
Epidemiology. Oxford University Press, New
York, USA (2002).
Armitage P: Test for linear trend in
proportions and frequencies. Biometrics
11, 375–386 (1955).
Ruano G: Quo vadis personalized medicine?
Personalized Med. 1, 1–7 (2004).
249