Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TOOLS OF PERSONALIZED M EDICINE Personalizing public health Theodore R Holford1, Andreas Windemuth2 & Gualberto Ruaño2† †Author for correspondence University School of Medicine, Division of Biostatistics, Department of Epidemiology and Public Health, New Haven, Connecticut, 06520, USA 2Genomas, Inc., 67 Jefferson Street, Hartford, Connecticut, 06106, USA Tel.: +1 860 545 4574; E-mail: g.ruano @genomas.net 1Yale Keywords: generalized linear models, linear logistic model, personalized health, physiogenomics, regression, statistical genetics Public health and medicine share a common objective of prolonging life. However, the means by which this can be attained have long represented very different perspectives and strategies. Medicine has traditionally focused on treating patients who, by definition, are ill and in need of restoration to health, while public health seeks to reduce the number of individuals acquiring illness in the first place. Hence, medicine generally focuses on the needs of individuals in order to deliver the best treatment available to that person, taking into account their personal and clinical characteristics. On the other hand, public health focuses on the needs of the population as a whole, developing a health program that will serve the common good by reducing overall disease risk; with the understanding that a small minority may not realize the intended benefit. The two perspectives can be illustrated by the approaches for dealing with dental caries. A community that adopts a public health approach to the problem might add fluoride to the water supply, thus increasing the resistance of individuals to caries by making tooth enamel more resistant. Alternatively, the community may instead choose to provide its citizens with dental insurance, in which caries are treated once they develop. The former is certainly more cost effective from the perspective of the population, but the latter would avoid inducing any rare side effects that may result from the addition of something to the public water supply. Some individuals may have a gene or a lifestyle that makes them more resistant to the development of caries, so that they would personally receive little benefit from the addition of fluoride to the water, yet they may suffer some risk of harm from it. If it were possible to identify those individuals who would benefit most from a particular form of treatment or disease prevention, then it may be possible to optimize the program as a whole, by avoiding side effects and increasing efficiency. In effect, this effort would bring together aspects of the medical and the public health perspectives by personalizing the public health strategies, thus providing a method by which individuals can optimize their own health, and in the process benefit the population at large. 10.2217/17410541.2.3.239 © 2005 Future Medicine Ltd ISSN 1741-0541 In this report, the manner in which data from a variety of sources can be used to arrive at statistical models that may be used in prediction is considered. First, the study designs that are employed, in addition to the behavior of the variables, are descibed. In addition, the basic structure of a generalized linear model, indicating how this can be adapted in order to cover a wide variety of different types of data, is set out. These require specific ways for dealing with measurement types when the factor is a response and when it is a predictor. Some of the issues involved in selecting prediction models and assessing their adequacy are also discussed. Cause versus prediction in genomics The issue of causal inference involves some of the deepest concepts in the philosophy of science, and it is arguable whether a strict adherence to causal reasoning can ever be fully realized in health science. Knowledge that a particular genetic polymorphism will produce a given effect on the outcome may be very difficult, especially in light of the fact that most biochemical reactions have multiple steps involving different proteins, each one of which can have their own effect on the phenotypic result for the individual. A somewhat lower standard would be to determine whether a particular polymorphism was associated with the outcome, a relationship that may or may not be causal. The ultimate goal of science is to understand causal mechanisms that underlie physiological processes, thus providing an approach for predicting outcome. However, this may be difficult when the development of knowledge is at an early stage, so that a more pragmatic approach would be to learn about associations that exist between individual characteristics and the outcome of interest. This would allow us to improve our prediction of the response by incorporating information that is related to the outcome. However, when we cannot be certain that the associations are causal, we must maintain a level of caution in using the predictions. These limitations that exist in a prediction model could be summarized to users by carefully specifying the level of error present in the estimates provided by the prediction model. Personalized Medicine (2005) 2(3), 239–249 239 TOOLS OF PERSONALIZED MEDICINE – Holford, Windemuth & Ruaño Causal inference in genomics Causal inference requires that we have knowledge regarding the underlying biology of a physiological process that is affecting an outcome, so that we can specify quantitatively the manner in which an individual’s physiology might behave under a specified set of circumstances. Understanding the process may not include information about the magnitude of various rates of reaction, or the manner in which the effect takes place, so that statistical inference amounts to the development of estimates of unknown quantities and assessing the mode of action by determining the agreement between observed values and levels predicted by a model. In the case of genomic information, the factor level would require the identification of a particular gene that may be recognized by detecting the number of copies of a known single nucleotide polymorphism (SNP). The identification of causal SNPs, and the mechanisms by which they act, are ideal for predicting an outcome, and these should be used whenever available. When specific polymorphisms that cause disease are known, the magnitude of their association with the response is huge and highly predictive for an individual who harbors it (for example, the effect of BRCA1 and BRCA2 on breast cancer risk [1,2]). By the same token, the process required to find these polymorphisms can take considerable time. However, this process continues to improve, as in the case of the elegant work used to discover the gene causing macular degeneration [3]. Prediction of health outcomes Statistical models that can be used to predict health outcomes would make use of demographic, clinical and genomic information, in order to arrive at a model that predicts the outcome. The goal is not necessarily to discover clinical variables and alleles that cause an effect, but to make use of the associations that exist. In this sense, the aims are somewhat different from those in a strictly scientific endeavor that seeks to understand modes of action, but instead adopts a more pragmatic approach of developing methodology that provides predictions that have sufficient accuracy to be of practical use to clinicians and healthcare providers. The outcomes may be qualitative, as in the case of binary indicators of disease status, or continuous indicators such as body mass index (BMI), blood sugar or cholesterol. While there are aspects of the process that will need to be 240 modified in order to arrive at assumptions that agree with the data, the approaches are similar, in that they can be dealt with by using the concepts underlying generalized linear models for the analysis of health data. Data sources Study designs Controlled experiments A controlled experiment is a carefully constructed study in which the groups being compared only differ from each other by the factor under study. In that sense, the intent is to mimic a laboratory experiment, in which the careful scientist can exercise considerable control over the experimental units so that they are all handled in the same way, except for the factor or treatment of interest. In population studies, the randomized control trial is the most commonly used design for this type of study, in which the groups to be compared are made comparable through random assignment of subjects to the groups in question. In this case, the rules of chance make it possible to achieve comparability, although there is always the prospect that the luck of the draw may turn out to be unfavorable in terms of balance. Hence, the underlying approach may be augmented by adopting alternative strategies that force balance among the groups to be compared, especially for factors that are known to be strongly related to the outcome. A considerable effort is required to launch a controlled experiment, including the recruitment of subjects and physicians who are willing to meet the requirements of random assignment. If the outcome is the occurrence of a relatively rare event, such as the development of a disease or syndrome, then the sample size required for valid inferences could be large and the follow-up time lengthy, thus further increasing the cost of the study. In general, these studies, although extremely valuable because of their validity if carefully conducted, can be extremely costly. Thus, they are often not the first approach used to look at a particular association, but are instead used as a final justification for a particular treatment strategy. Observational studies An observational study involves the assessment of associations discovered in subjects who have been recruited for a particular project. In some instances, the subjects are selected as representatives of the population at large. This is sometimes referred to as a cross-sectional study. In Personalized Medicine (2005) 2(3) Personalizing public health – TOOLS OF PERSONALIZED MEDICINE other instances the subjects are recruited as representatives of particular groups, and then followed to determine their clinical course – a cohort study. Alternatively, subjects with a particular disease and an appropriate control group may be recruited and their past clinical history or exposure gleaned through the use of interviews or chart reviews, which are known as case–control studies. The disadvantage in observational studies is that the carefully structured study design provided by a controlled experiment no longer applies. Nevertheless, the importance of avoiding the bias that can arise when groups are not comparable remains as important as ever. Therefore, there is considerable effort to control for imbalances by employing statistical models that make use of variables that may be confounding an estimated association. Typically, these are variables that have an imbalanced distribution in the groups being compared, and they are associated with the response of interest. equation, and these may be similarly classified as continuous, binary or polychotomous. For a prediction model, the regressor variables comprise the information that is to be used to predict the response, and in a physiogenomic model, these would include a mixture of demographic, physiologic and clinical data, along with information on the genotype for the individual. Generalized linear models provide an extremely powerful framework for developing statistical models that may be used for any of these types of outcomes. The mathematical expression, or model, that relates the outcome, Y, to a set of regressor variables, X = ( X 1, X2, …, Xp ) is given by: Y = µ( X) + ε where µ(X) is the systematic part of the equation which describes the manner in which the regressor variables affect the average level of the outcome, µ ( X ) = E [ Y X ] , and ε is the unexplained or random part of the model. Properties of a good model are: Types of predicted outcomes The goal of prediction modeling is to develop a mathematical expression that may be used to predict or estimate the unobserved outcome in an individual based on an observed set of predictor variables. The outcome or response (sometimes called the dependent variable) may be: Accurate representation of the systematic component of the model, µ(X) Continuous measure Appropriate characterization of random error, ε • • • • Body mass High-density lipoprotein cholesterol Low-density lipoprotein cholesterol Blood pressure Binary • Obese (yes/no) • Neuralgia (yes/no) • Hypertension (yes/no) Ordered polychotomous • Obesity (normal/overweight/obese/morbidly obese) • Muscular pain (none/mild/moderate/severe) Nominal (not ordered) polychotomous • Organ system affected (respiratory, digestive, circulatory, neurological) Regressor variables (sometimes called independent variables) are used in the prediction www.futuremedicine.com • Correct form for the equation • Include all relevant regressor variables • Preferably do not include unimportant variables, i.e., parsimony • Zero on average, indicating model and data are not biased • Probability distribution is appropriately characterized - Continuous – normal or Gaussian, γ - Binary – binomial - Counts – Poisson or negative binomial • Variance is small – precision is great In a linear model [4], the systematic component is comprised of the sum of each regressor variable multiplied by a regression parameter that is usually not known, but estimated from the data, i.e., µ ( X ) = β0 + X1 β1 + X2 β2 + … + Xp βp Models of this form are the standard regression model. The error term, ε, is assumed to be normal with a mean of zero and unspecified variance, which is a constant that does not depend on the mean. However, a generalization in which a model has this form after taking a specified transformation of the mean greatly enhances 241 TOOLS OF PERSONALIZED MEDICINE – Holford, Windemuth & Ruaño ones ability to consider different types of data, i.e., a model of the form: η ( X ) = h [ µ ( X ) ] = β0 + X1 β1 + X2 β2 + … + Xp βp The term η ( X ) is called the linear predictor, and the function, h(.), which transforms the mean into the linear part of the model, is called the link function [5]. As an example, consider a binary response, Y = 0 or 1, for which the linear logistic model is commonly used [6]. On average, the response takes the value of the probability of observing a one, and for the commonly used linear logistic model, this is expressed as: β0 + X1 β 1 + … + Xp βp e Pr { Y X } = µ ( X ) = ---------------------------------------------------β0 + X1 β 1 + … + Xp Bp 1+e Notice that the model contains a part that has the familiar mathematic form used in standard regression, i.e., the sum of products of regressor variables and regression coefficients. To obtain an expression that only has that form, one would use a logit link function to transform the mean into the linear predictor [5,7]: µ( X) logit[µ ( X ) ] = log --------------------- = β0 + X1 β 1 + … + Xp β p 1 – µ(X ) Clinical predictors Demographic and personal characteristics The most commonly-used demographic characteristics are age, gender and ethnicity. Aging clearly affects the clinical course of most diseases, as gender also does, perhaps due to the hormonal and physiological differences between men and women. Ethnic differences in disease risk may, in part, reflect genetic variation, but they could also be the result of diet and other practices that influence the prognosis of individuals in these groups. Other social and economic characteristics of individuals can influence personal health, either through knowledge or resources that empower them to take control of factors that affect disease risk. In developing a model that can be used to predict or to characterize a likely course of symptomatology, the effect of each of these factors needs to be explored. Clinical measurements In the interest of planning for the health needs of individual patients, it is important to include information on the status of clinical symptoms at baseline, or even before if the information is available. In some instances, the baseline data may consist of information on levels prior to the treatment regimen. For example, BMI at baseline is likely to be strongly related to BMI 6 months 242 later, so it would clearly be relevant to include this information in a prediction model. In addition, blood sugar, insulin level or other measures at baseline that might be related to an individual’s metabolic rate could add information to the prediction. Perhaps not as obvious would be the problem of making sense of the effect of baseline BMI on temporal changes. An individual with a high BMI may find it easier or harder to lose weight, but an analysis that suggests that higher baseline levels are associated with greater weight loss may not be an indication of a greater ability to lose weight as BMI increases. It could actually be due, in part, to regression to the mean, which results in patients with extreme levels of any variable tending to have subsequent values that are closer to the average. Nevertheless, for the purpose of obtaining a predicted value for a future level of BMI, this distinction may not be important, because the objective is to obtain a good prediction model, which is not necessarily the same as identifying causal relationships. Coding categorical predictors For a binary variable there is no difference analytically between a comparison of two nominal levels (presence vs absence of a factor) and two ordinal levels (high vs low). In each case, a single parameter provides a measure of comparison, for example, the difference in the outcome when the factor is present and absent, or the difference between a high and low level of the factor. One convenient way to specify such a comparison in a regression model is to introduce a dummy variable that takes the value of 1 when the factor is present or high and 0 otherwise. In this case, the mean levels of the response for the two groups are: η ( 0 ) = β0 + 0 ⋅ β1 = β0 for X = 0, the reference level and η ( 1 ) = β0 + 1 ⋅ β1 = β0 + β1 for X = 1, the comparison level. Notice that the parameter represents the difference in the linear predictors for the two levels of the factor, ∆η = η ( 1 ) – n ( 0 ) . For a categorical variable with more than two levels, one must first consider whether one is dealing with nominal categories, in which there is no particular form for the relationship, or ordered categories, which would enable one to capitalize on the relationship among the levels. Extending the approach used for two categorical levels, one can introduce additional dummy variables which take the value 1 for the comparison level of interest, and Personalized Medicine (2005) 2(3) Personalizing public health – TOOLS OF PERSONALIZED MEDICINE 0 otherwise. When we apply this rule, the reference level will always be 0. In general, there will be one fewer dummy variables than there are categories. As an example, consider a four level categorical variable, which would require the specification of three dummy variables defined as follows: X1 X2 X3 Reference 0 0 0 Level 1 1 0 0 Level 2 0 1 0 Level 3 0 0 1 The linear predictors that correspond to these four levels are: η ( 0 ) = β0 + 0 ⋅ β1 + 0 ⋅ β2 + 0 ⋅ β3 = β0 for reference, η ( 1 ) = β0 + 1 ⋅ β1 + 0 ⋅ β2 + 0 ⋅ β3 = β0 + β1 for level 1, η ( 2 ) = β0 + 0 ⋅ β1 + 1 ⋅ β2 + 0 ⋅ β3 = β0 + β2 for level 2, and η ( 3 ) = β0 + 0 ⋅ β1 + 0 ⋅ β2 + 1 ⋅ β3 = β0 + β3 for level 3, so that β1 represents the difference between level 1 and the reference level, and β2 and β3 are similarly interpreted. The interpretation of the individual parameters depends on the particular model being used which is specified through the link function in the generalized linear model. For the standard regression model that is normally used for a continuous response, an identity link is employed, µ = η , so that the difference in the linear predictor may be interpreted as a difference in means. When additional covariates are also included in the model, these are regarded as adjusted means, in the same way that adjusted means are typically compared using analysis of covariance. On the other hand, when a linear logistic model is used, the change in the linear predictor, ∆η , represents a change in a link function of the means, which is the log odds ratio for the probability of a binary outcome: µ1 µ2 – log ---------------∆η = log ---------------n2 – µ2 n 1 – µ1 µ2 ⁄ n2 – µ2 = log ------------------------µ1 ⁄ n1 – µ1 µ2 ( n1 – µ1 ) = log --------------------------µ1 ( n2 – µ2 ) www.futuremedicine.com In some cases there is a natural candidate for the reference category, for example, if one is interested in the effect of introducing new treatments, then the current standard treatment would be an obvious choice as a reference. However, there are many instances in which the choice is arbitrary, or up to the discretion of the investigator. For this reason, it is often useful to first consider a joint test for all of the parameters associated with the dummy variables for a factor, before trying to interpret individual parameters. For example, in the four category case, one would test the null hypothesis, H0: β1 = β2 = β3 = 0. In addition, because each of these parameters represents a difference from the reference level, the standard error can never be less than the standard error for the reference. Therefore, it is best not to choose an uncommon category as the reference, because standard errors for the regression parameters would tend to be large. A common category would yield the greatest level of precision for the parameters. For a factor with many categories, the overall test tends to have little power when the null hypothesis is unfocused. For example, if one is investigating the effect of six ordered categories on a binary outcome, the overall null hypothesis would be tested using a Pearson χ2 test with 5 degrees of freedom (df ) with a critical value of 11.07. On the other hand, a more focused hypothesis concentrating on linear trend, described quantitatively as the slope, could be tested using the method described by Armitage [8], which would be compared to χ2 with 1 df, with a critical value of 3.84. If the trend is in fact linear, the latter hypothesis is easier to reject, giving it greater power. On the other hand, a linear trend can be a strong assumption that would miss some patterns in ordinal variables, such as the situation when the trend increases for low values then decreases for high values. Biomarkers Biochemical markers may consist of hormone levels, metabolic parameters or cumulative levels of environmental exposures, such as polychlorinated biphenyl levels in sera or adipose tissue. However, the human genome project, with the accompanying advances in technology that enables one to identify genomic detail for an individual, offer exciting new opportunities that could prove to be useful for planning individual health regimens. As the technology in this field continues to advance, the cost of 243 TOOLS OF PERSONALIZED MEDICINE – Holford, Windemuth & Ruaño Table 1. Regressor variables used for linear, dominant, recessive, or codominant effects of a single nucleotide polymorphism. No. copies Linear Dominant Recessive Codominant 0 −1/2 -2/3 -1/3 -1/3 1 0 1/3 -1/3 2/3 2 1/2 1/3 2/3 -1/3 identifying large numbers of genetic markers become more feasible. An important question is how best to use this information for the good of an individual’s health. Genomic data The contribution of genes to disorders, such as Huntington’s disease or Down’s syndrome, has long been recognized. However, with the increased power provided by new genomic technologies, one is able to identify specific gene polymorphisms that are associated with diseases such as breast cancer [1] and macular degeneration [3]. Cellular decoding of DNA results in the construction of proteins, so that variations in genetic expression can ultimately lead to defective proteins that may, in turn, affect disease risk. Not all variants in the genetic code may affect disease risk, so it is not always clear which variants in genetic expression are important identifiers for disease risk. In some instances, a measure of the extent of the match for a particular portion of a gene is used as the measure of interest. Alternatively, libraries of specific SNPs are being constructed for use as specific genetic markers. Ideally, the SNPs will identify critical changes that result in clinically important changes in protein structure, and knowledge continues to grow as more information is gleaned about the important variants in the genetic sequence and the effects these may have. Tissue samples can provide information on the number of copies of particular alleles through the identification of SNPs. Current technological advances in highly parallel genotyping can provide such information on hundreds of alleles at relatively low cost. The importance of this capability in personalized public health is the valuable new information provided to an individual and his/her healthcare provider. With respect to a prediction model, SNPs provide the number of copies of a particular allele that is an ordinal categorical variable. This could be parameterized as any other categorical predictor, but the mode of action suggests particular approaches that would be 244 motivated by the underlying biology, which are discussed below. Proteomic data While DNA provides the necessary code needed to order the amino acids in a protein, some variants in this order will have little effect on functionality in a biochemical reaction. Thus, the analysis of protein structure or proteomics offers a more direct measurement of an enzyme that may actually be involved in a relevant biochemical reaction for a particular health outcome. At this stage, this technology is still being rapidly developed. However, from the perspective of the statistical analysis, the issues are very similar to those that arise in the analysis of genomic data, which are discussed in further detail below. Coding the genomic markers SNPs are represented in a data set as a three level categorical variable, i.e., homozygous with respect to one of two alleles, or heterozygous. These can be coded by identifying the number of copies of the specified polymorphism (0, 1 or 2). One approach for parameterizing a SNP effect is to choose a particular mechanism by which it would have its effect on the outcome, and to introduce the corresponding regressor variable into the regression models. For example, if there is an additive effect of each copy for a particular allele, so that the difference between 0 and 1 copy was the same as the difference between 1 and 2 copies, then this would be parameterized as a linear effect. Similarly, if the effect was dominant, then either 1 or 2 copies would have the same effect, but these would be different from 0 copies. Table 1 shows the choice of regressor variables that one might use for linear, dominant, recessive or codominant effects of a SNP. The key in choosing which regressor variable to represent a SNP is to appropriately represent the mode of action, and if this selection is correct then one has the assurance that the estimated effect is both valid (i.e., unbiased) and optimal (i.e., most precise or smallest standard error). However, this would be unrealistic in Personalized Medicine (2005) 2(3) Personalizing public health – TOOLS OF PERSONALIZED MEDICINE Table 2. Linear and dominance representation of SNP effects, along with particular combinations of parameters for linear, dominant, recessive and codominant effects. Linear Dominant Recessive Codominant No. X1 X2 β1X1 + β2X2 β1 = β β2 = 0 β1 =β2 = β β1 = −β2 = β β1 = 0 β2 = 2β 0 -1/2 -1/6 -β1/2-β2/6 -β/2 -2β/3 -β/3 -β/3 1 0 1/3 β2/3 0 β/3 -β/3 2β/3 2 -1/2 -1/6 β1/2-β2/6 β/2 β/3 2β/3 -β/3 that one is typically analyzing a SNP to determine whether it has an effect, usually not knowing its mode of action on the response. An alternative method of representing SNPs in the framework of a generalized linear model is to introduce two dummy variables, as described in Table 2. In this particular parameterization, X1 represents a linear or additive contribution of each copy of a particular allele, and X2 can be thought of as a dominance factor. The advantage of this choice of regressor variables is that it offers one the opportunity to estimate the mode of action and to test particular hypotheses that would suggest whether the data indicates a significant departure from a particular mode of action. Some specific examples are: • When only X1 is included in the model (equivalent to specifying β2 = 0), then an additive effect of the number of copies of the SNP is implied, with the parameter β1 measuring the difference between the effects of the two homozygous individuals with the heterozygous lying halfway between. • When β1 = β2 = β, then the model specifies no difference in effect when either 1 or 2 copies of the SNP are present, which is the behavior expected when the SNP behaves as a classically dominant gene, and the regression parameter, β, measures the difference between those with the SNP and those without. • When β1 = -β2 = β, then those with 0 or 1 copies have the same effect on the response and the model measures the effect of a classically recessive SNP. • When β1 = 0, β2 = 2β, the comparison estimates the difference between the heterozygous and the homozygous individuals, which represents a codominant mode of action. • When both X1 and X2 are included in the model, then the models can specify a rich variety of modes of action, other than those that satisfy the classical modes, including intermediate situations. www.futuremedicine.com The relationship between a SNP and an outcome can be visualized in two ways. First, we can consider the variability of the allele frequency with the level of the response, which can be displayed graphically by giving the mean frequency on the vertical axis and the level of response on the horizontal axis. Alternatively, we can display the distribution of the response among groups defined by the allele frequency. Of course, the form for these relationships depends on the underlying biological mechanism for the genetic action, i.e., whether the phenotypic effect of a SNP on the mean response is linear, dominant, recessive, or codominant. In addition, the effect may not be on the mean response, but on the variability also. The graphs in Figure 1 show the manner in which we might expect to see a particular relationship between the distribution of the response in different SNP genotypes and the mean allele frequency as a function of a particular phenotype. We assume that the relative frequency of SNP haplotypes is governed by Hardy-Wienberg equilibrium, whereby p = 1-q defines the SNP frequency. Thus, p2 is the relative frequency of individuals who are homozygous for the SNP (AA), q2 would be the frequency of homozygous wild type (aa), and 2pq heterozygous (Aa). In Figure 1a we illustrate a display that might be expected for a linear effect on the mean when the distribution is Gaussian with constant variance. The heavy purple line shows the distribution of the phenotype without regard to the genotype, and the three other lines give the distributions of each genotype. In this case, the overall distribution looks very similar to a typical Gaussian curve, even though the means are quite different. For even larger differences, multiple modes might begin to appear for the overall distribution, but they would only be expected for differences that were very large indeed when compared to the variance or spread of the distribution. If data were available, we would expect a series of overlaid histograms to appear to be very 245 TOOLS OF PERSONALIZED MEDICINE – Holford, Windemuth & Ruaño Highlights • Genomic markers can blur the classical distinction between the medical perspective of treating the individual and the public health perspective of reducing risk for the population by tailoring preventive interventions to individuals. • Statistical models for predicting health outcomes use the pragmatic approach of making effective use of associations between factors and outcome, some of which may not be causally related. • Generalized linear models provide a framework for quantitative prediction for a variety of different types of outcome variable, including those that are continuous (cholesterol or blood pressure) and those that are binary (presence or absence of disease). • Coding of regressor variables is a powerful tool for estimating the association between the outcome and potential predicting factors that may be either continuous (age or baseline cholesterol) or categorical (gender or ethnogeographic ancestry). • Phenotypic effects of single nucleotide polymorphisms can be coded as linear, dominant, recessive, or codominant, in order to enable the mode of action of a gene to be specified or inferred for physiogenomic analysis. • The relationship between gene expression and a continuous phenotype can be described in terms or the distribution of the outcome for each genotype, or the number of copies of a particular allele expected among individuals with a particular level of the phenotype. • If only the mean level of a phenotype is affected by genotype, and the distribution is Gaussian, then the relationship is often sigmoid, but can appear to be linear within a narrow range. If the variance of the distribution is also affected by genotype, then the relationship between the expected number of allele copies for individuals with a specified level of the outcome can become complex. • Bringing together physiogenomic elements can provide a powerful approach for personalized health, initially by identifying individuals who are most likely to benefit from a particular health program and, eventually, by identifying the health program, amongst various options, which is most likely to benefit an individual. similar to the distributions shown in the graph. Notice that in this hypothetical example, the mean responses progress in equal increments, so that the difference in the mean for 1 and 0 copies of an allele is equal to the difference between 2 and 1. If we were to interchange the axes, displaying the mean number of alleles as a function of the phenotypic response, we would expect the relationship to be that shown in the corresponding sigmoid curve. Notice that for low values of the response, the mean frequency is close to 0, indicating that these individuals are primarily homozygous aa. Gradually this begins to increase, and for intermediate values the relationship is almost linear in this instance. Finally, the curve approaches an asymptote of 2, which corresponds to homozygous AA. Figure 1b shows a similar pair of graphs for a hypothetical gene with a dominant expression on the genotype. In this case, the mean response for Aa and AA would be the same, with a higher mean response than aa in this hypothetical example. The relationship between the mean number of alleles as a function of the phenotypic response is once again a sigmoid curve, starting near zero, indicating that these individuals are primarily aa. However, the curve then gradually increases and, ultimately, approaches an asymptote, which 246 lies between 1 and 2, the actual value being a weighted average depending on the relative frequency of Aa and AA. A recessive SNP would be expected to behave as shown in Figure 1c, which would be identical to the case where a SNP is dominant. The graph showing the sigmoid relationship between the mean number of alleles and the phenotype now starts between 0 and 1, a weighted average depending on the relative frequency of aa and Aa, and approaches the asymptote of 2. In the rare event of a codominant effect on the phenotype, the mean for the homozygous individuals would be the same, in this case lower than the heterozygous, Aa. The relationship between the genotype and the mean allele frequency is once again a sigmoid curve starting at the weighted average of 0 and 2, depending on the relative frequency of aa and Aa, and approaching an asymptote of 1. In this example the sigmoid curve is increasing because allele A is less common than a. If the converse were true the curve would be decreasing. For each of the examples considered thus far, we have assumed that the effect of the SNP is on the mean. However, in principle it could affect any property of the distribution. Suppose that the variance were affected, and not the mean, a case illustrated in Figure 1e. For this Personalized Medicine (2005) 2(3) Personalizing public health – TOOLS OF PERSONALIZED MEDICINE Figure 1. The relationship between the distribution of responses for alternative modes of gene expression and allele frequency. b) a) Total aa Aa AA Total aa Aa AA -6 -4 -2 0 2 4 -6 6 -4 -2 0 -2 0 6 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 Mean number of As Mean number of As 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 -4 4 Mean Mean -6 2 2 4 -6 6 -4 -2 0 2 4 6 Mean response Mean response c) d) Total aa Aa AA -6 -4 -2 0 2 4 Total aa Aa AA 6 -6 -4 -2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 -6 -4 -2 0 2 4 6 Mean 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 Mean number of As Mean number of As Mean 0 2 Mean response 4 6 -6 -4 -2 0 2 4 6 Mean response The relationship between the distribution of responses for alternative modes of gene expression and the allele frequency for the (a) linear model for mean; (b) dominance model for mean; (c) recessive model for mean; (d) hybrid model for mean; (e) variance model; and (f) model with varying mean and variance. www.futuremedicine.com 247 TOOLS OF PERSONALIZED MEDICINE – Holford, Windemuth & Ruaño Figure 1 (continued). The relationship between the distribution of responses for alternative modes of gene expression and allele frequency. e) f) Total aa Aa AA -6 -4 -2 0 2 4 Total aa Aa AA 6 -6 -4 -2 Mean 0 2 4 2 4 6 Mean number of As Mean number of As Mean -6 -4 -2 0 2 4 6 Mean response -6 -4 -2 0 6 Mean response The relationship between the distribution of responses for alternative modes of gene expression and the allele frequency for the (a) linear model for mean; (b) dominance model for mean; (c) recessive model for mean; (d) hybrid model for mean; (e) variance model; and (f) model with varying mean and variance. example the means are all 0, but the variance or spread of the distribution is much greater for aa than AA, while Aa is in between. The relationship between the phenotype and the mean allele frequency in this instance increases then decreases, reflecting the fact that for the extremely high and low values the allele frequency 0 predominates, i.e., aa. On the other hand, for values in the middle, AA and Aa are more common, thus resulting in a mode for the allele frequency. In this case, this SNP would not be a good predictor of the phenotype, because the average response would be the same in each case. The SNP would say something about the spread, but that would generally be less relevant, for example, aa individuals would tend to be either very high or very low, but we would not know which was the case. Of course, in practice, the genomic effects could be realized on both the mean and the variance, as in Figure 1(f). For this example, the relationship between the phenotype and the allele frequency is much more complicated, 248 with the possibility of changing directions more than once. If the relationship is this complicated, the particular SNP may not be a good predictor of the phenotype, even though there is a relationship with the response. Ideally, we would like to identify SNPs that affect the mean value for the phenotype, as these would be the better predictors. Prospects Genomics has revolutionized health science, and characterizing an individual’s genotype offers the possibility to learn a considerable amount of detail regarding their physiology. However, the efficacy of an individualized health program is not solely written in the genes, but also in other aspects of their nature and environment [9]. All of these features may come together in a physiogenomic milieu that can help to identify the individual and their response to a particular health program. We have discussed some of the approaches that can be employed to quantify the problem. Personalized Medicine (2005) 2(3) Personalizing public health – TOOLS OF PERSONALIZED MEDICINE Bibliography 1. 2. 3. Miki Y, Swensen J, Shattuck-Eidens D et al.: A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1. Science 266, 66–71 (1994). Wooster R, Neuhausen SL, Mangion J et al.: Localization of a breast cancer susceptibility gene, BRCA2, to chromosome 13q12-13. Science 265, 2088–2090 (1994). Klein RJ, Zeiss C, Chew EY et al.: Complement factor H polymorphism in www.futuremedicine.com 4. 5. age-related macular degeneration. Science 308, 385–389 (2005). Neter J, Kutner MH, Nachtsheim CJ, Wasserman W: Applied Linear Statistical Models. WCB McGraw-Hill, MA, USA (1996). McCullagh P, Nelder JA: Generalized Linear Models. Chapman and Hall, London, UK (1989). 6. 7. 8. 9. Hosmer DW, Lemeshow S: Applied logistic regression. John Wiley & Sons, New York, USA (1989). Holford TR: Multivariate Methods in Epidemiology. Oxford University Press, New York, USA (2002). Armitage P: Test for linear trend in proportions and frequencies. Biometrics 11, 375–386 (1955). Ruano G: Quo vadis personalized medicine? Personalized Med. 1, 1–7 (2004). 249