Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
REVIEWS ASSESSING THE FUNCTION OF GENETIC VARIANTS IN CANDIDATE GENE ASSOCIATION STUDIES Timothy R. Rebbeck*, Margaret Spitz ‡ and Xifeng Wu‡ Knowledge of inherited genetic variation has a fundamental impact on understanding human disease. Unfortunately, our understanding of the functional significance of many inherited genetic variants is limited. New approaches to assessing functional significance of inherited genetic variation, which combine molecular genetics, epidemiology and bioinformatics, promise to enhance reproducibility and plausibility of associations between genotypes and disease. *Department of Biostatistics and Epidemiology, and Abramson Cancer Center, University of Pennsylvania School of Medicine, 904 Blockley Hall, 423 Guardian drive, Philadelphia, Pennsylvania 19104, USA. ‡ Department of Epidemiology, M. D. Anderson Cancer Center, 1515 Halcombe Boulevard, Houston, Texas 77030, USA. Correspondence to T.R.R. e-mail: trebbeck@ cceb.med.upenn.edu doi:10.1038/nrg1403 NATURE REVIEWS | GENETICS The human genome contains about ten million SNPs, with an estimated two common missense variants per gene1. At least five million SNPs have already been reported in public databases2,3. Many of these variants might be involved in human disease aetiology, but it is often difficult to assess their function on the basis of nucleotide sequence alone. This is particularly true when variants do not alter an amino acid or do not disrupt a well-characterized motif that affects protein function or structure. In addition, only a small subset of variants that affect the phenotype will confer small to moderate effects on phenotypes that are causally related to disease risk. So, an important challenge that faces molecular epidemiological association studies of candidate disease-susceptibility genes is to define the variants that are functionally implicated in disease. This is particularly urgent because the amount of genomic information that is available greatly exceeds the information about the function of variants that are used in human disease studies. There is insufficient guidance for molecular epidemiologists to optimally select variants for an epidemiology study, highlighting the need for methods that prioritize the choice of genetic variants to be genotyped in molecular epidemiological studies4. Here, we bring together examples of experimental population genetics and evolutionary approaches to assessing variant function, and evaluate the potential for using this information to improve inferences from disease association studies. Genomic variation and molecular epidemiology The protein coding sequences of the human genome contain approximately 100,000–300,000 common SNPs, and additional SNPs lie within putative regulatory regions of genes that might be relevant for studies of human health and disease1. Regulatory and coding SNPs are of particular interest to molecular epidemiological association studies. Non-synonymous SNPs (nsSNPs), or missense variants, translate into aminoacid polymorphisms in the proteins they encode. Regulatory SNPs (rSNPs) can affect the expression, tissue-specificity or function of relevant proteins. It seems that both nsSNPs and rSNPs are relatively rare compared with the total number of SNPs in the human genome5. The rarity of nsSNPs might be a consequence of selection against the functional disruptions of amino-acid variation. Some molecular functional diversity is attributable to the effects on protein function caused by nsSNPs. For example, the kinetic parameters of enzymes, the DNA-binding properties of proteins that regulate transcription, the signal transduction activities of transmembrane receptors, and the architectural roles of structural proteins are all susceptible to perturbation by nsSNPs and their associated amino-acid polymorphisms. In addition to ongoing efforts to identify and characterize these genetic variants, molecular epidemiological disease association studies are under way to better understand the role of inherited genetic variation in disease. VOLUME 5 | AUGUST 2004 | 5 8 9 REVIEWS A major challenge for epidemiologists undertaking candidate gene–disease association studies is to choose target SNPs that are most likely to affect the phenotype and that ultimately contribute to disease development. Variants in biologically plausible candidate genes are usually selected for study on the basis of both variant allele frequency and the functional effect of the variant on relevant traits. Although there is often sufficient information to assess the allele frequency of a candidate variant, understanding the functional significance of genetic variants is usually more difficult. Knowledge of gene and SNP function is crucial to direct the appropriate design and interpretation of candidate gene association studies. This is in contrast to genome-wide scans and other approaches that rely on LINKAGE DISEQUILIBRIUM between loci to identify genotypedisease associations, for which knowledge of function is not required. Most epidemiologists are not trained in experimental laboratory methods and do not maintain laboratories in which detailed laboratory assessment of variant function in target tissues or individuals of interest can be made. They must therefore often rely on making an assessment of the function of a candidate gene or SNP from the published literature. However, published experimental evidence might not be adequate to guide the design or interpretation of a molecular epidemiological study. For example, inadequate information about the function of a genetic variant impedes the ability to evaluate whether the association is consistent with a causative event that is consistent with a biological mechanism, whether the association is a reflection of linkage disequilibrium with a truly causative variant, or whether the association represents a false positive result. The lack of reproducibility of many association studies might reflect the number of studies that involve genetic variants with no functional significance6,7. Experimental approaches LINKAGE DISEQUILIBRIUM The observation that two or more alleles, usually at loci that are physically close together on a chromosome, are not inherited independently but are observed to occur together more frequently than predicted under Mendel’s law of independent assortment. NIFEDIPINE A calcium-blocker drug (also called Procardia) that was one of the first drugs recognized to be metabolized by CYP3A4, and for which a regulatory element specific to the CYP3A4 gene was named. MENARCHE The first occurrence of menstruation in a woman. 590 Inferences about function of a genetic variant can be made using experimental systems, including in vitro systems and in vivo animal models. These include the effect of variants on regulatory-region control of DNA expression, RNA stability or degradation, protein structure, protein denaturing, protein expression, other measures of in vivo and in vitro control of protein levels, and tissue specificity8–10. Because the scope of experimental approaches for assessing SNP function is large, the purpose of this section is not to provide a comprehensive review of experimental approaches for evaluating SNP function, but instead to provide a context in which epidemiological studies can use experimental data in the design and interpretation association studies that involve candidate genes. These approaches can provide the strongest evidence for the functional role of a genetic variant, but can also be difficult to interpret in the context of studies of complex human traits. The extent of the effects of genetic variants on relevant phenotypes involved in the disease process is probably small. These effects may be highly dependent on the context in which the genetic variant is acting, including influence from environmental exposure, | AUGUST 2004 | VOLUME 5 tissue-specific effects or experimental conditions. For example, the functional effect of a variant might be small in magnitude in an experimental system, but might become more important in a specific human tissue or on exposure to relevant environmental agents. Similarly, large experimental effects might be observed that reflect negligible in vivo effects in humans. Similarly, effects that are small in magnitude in a single time point-experiment might be amplified or only have a phenotypically relevant effect over long time periods. Indeed, most genetic variants that are studied in complex traits and disease would be expected to have small effects on function. So, the use of model systems that are appropriate to detect large effects (for example, as might be observed in high penetrance ‘inborn errors of metabolism’ settings) might not be optimal. As outlined above, experimental evidence of genetic variant function might not be consistent with results of epidemiological studies. For example, CYP3A4 metabolizes drugs and other compounds, including steroid hormones, that are important in the aetiology of many common diseases11. A variant has been identified in the CYP3A4 promoter (denoted CYP3A4*1B) that consists of an A→G nucleotide substitution at position –290 (denoted A–290G) in the NIFEDIPINE-specific element (NFSE)12. Subsequently, the CYP3A4*1B variant was epidemiologically associated with various characteristics, including a more advanced stage of prostate tumours13,14, decreased risk for treatment-related leukaemia15, early age of MENARCHE16,17 and increased plasma levels of insulin-like growth factor-I among users of oral contraception18. Although these data provide evidence for functionally significant effects of CYP3A4*1B, the basic science literature has not consistently supported the hypothesis that CYP3A4*1B has a functionally significant effect. Hashimoto et al.12 identified several regulatory elements, including a putative repressor fragment and a NFSE element in the CYP3A4 promoter, which indicates that variants in this promoter region might affect CYP3A4 transcription (FIG. 1). Lamba et al.19 reported that CYP3A4*1B alleles were found significantly more frequently in Caucasians with low CYP3A4 protein levels than in those with higher levels of the protein. Many authors20–26 have studied the relationship of CYP3A4 expression or function to the CYP3A4*1B promoter (FIG. 1). Most of them concluded that there were no biologically meaningful effects, given the small magnitude of effects that were observed. However, most studies reported consistently elevated expression associated with CYP3A4*1B (a 20%–200% increase over the consensus CYP3A4*1A)20–26. On the basis of these data, it is possible that CYP3A4*1B has only a small to moderate phenotypic effect. These small phenotypic effects will probably not have a clinically meaningful impact on drug disposition. It is not clear, however, whether this magnitude of phenotypic perturbation is sufficient to alter metabolism on environmental exposure to steroid hormones or other agents that might confer disease risk over the lifetime of an individual. For example, would a 20% www.nature.com/reviews/genetics REVIEWS *1B Putative repressor region ATG NFSE Cell system Effect of CYP3A4*1B versus CYP3A4*1A Refs +10 Human hepatocytes 190% Increase in testosterone 6 β-oxidation 20 –1019 –6 MCF7 90% Increase in luciferase expression 21 –1019 –6 HepG2 40% Increase in luciferase expression 21 22 –490 –570 +22 Human hepatocytes 40% Increase in nifedipine oxidase activity –570 +22 Human hepatocytes 110% Increase in CYP3A4 protein expression 22 +53 HepG2 No change in luciferase expression 23 +53 –362 HepG2 + enhancer 20% Increase in luciferase expression 23 –572 –6 MCF7 20% Increase in luciferase expression 25 –572 –6 HepG2 40% Increase in luciferase expression 25 –572 –6 Human hepatocytes 20–90% Increase in luciferase expression 25 –362 In response to xenobiotics: –1203 –61 HepG2 20–117% Increase in transcriptional activation 26 –1203 –61 HuH7 75–147% Increase in transcriptional activation 26 –1203 –61 CaCo-2 4–19% Increase in transcriptional activation 26 Figure 1 | In vitro studies of the effect of CYP3A4*1B compared with CYP3A4*1A. At the top of the figure on the left is a schematic representation of the structure of the CYP3A4 region. Blue bars represent the genomic regions that are used in the CYP3A4-containing constructs for each study. The postive and negative numbers on each of the bars represent the postion of the terminal basepairs of these regions relative to the postion of the SNP. The studies summarized here encompass substantial variability in experimental conditions. Nonetheless, there is a small but consistent effect of increased CYP3A4 expression in the presence of CYP3A4*1B. NFSE, nifedipine-specific element. greater metabolism of testosterone by CYP3A4*1B over the course of a man’s lifetime (as indicated by the data in FIG. 1) sufficiently increase the risk of prostate cancer to the extent that it could be detected in an epidemiological context? If so, the apparent discrepancy between epidemiological associations of this genetic variant and the functional effects of this variant in experimental systems might not exist. Variation in experimental approaches used to assess the function of genetic variants in complex metabolic systems could affect inferences about function (see FIG. 1 and BOX 1). Results might vary depending on the specific experimental conditions, unrecognized effects of regulatory elements, and post-transcriptional or posttranslational processing. Hashimoto et al.12 suggest that removing the repressor region upstream of the CYP3A4*1B variant sequence might unmask a promoter effect. Studies that evaluate regulatory region genetic variants in CYP3A4 might differ substantially depending on whether this region is present or absent in the experimental systems. In fact, various constructs were used to evaluate in vitro functional effects of CYP3A4*1B (FIG. 1), and this experimental variability could have influenced the inferences made in these studies. The inclusion of enhancer elements that might be required in expression assays can similarly affect expression. Regulation of expression might only become apparent under exposure to the relevant compounds in the proper cellular context. The use of primary cells versus cell culture systems can also affect inferences about function. The use of different cell or tissue types (for example see FIG. 1) might reflect different regulatory influences that affect the functional assessment of a genetic variant. These effects might be NATURE REVIEWS | GENETICS further compounded if the magnitude of functional effects of individual variants is small, leading to a greater potential for artificial masking or enhancement of functional effects in experimental systems. Therefore, the ability to detect functionally relevant effects of genetic variants might be highly dependent on the context of the experimental system used, and experimental systems might not reflect relevant in vivo effects in humans. Given the differences between the relevant phenotypes in humans and experimental systems, information obtained from the latter might not be consistent with results of epidemiological association studies. Therefore, it might not be possible to make clear inferences about in vivo genotype function in humans from experimental data. Inconsistencies between experimental data and epidemiological studies can also reflect a potential study bias that is inherent to some epidemiological investigations. Poor study design, including insufficient statistical power, could influence the results of an epidemiological study to produce false positive or false negative inferences (BOX 1). In addition to more traditional experimental approaches that are used to assess SNP function, a wide variety of novel approaches for assessing gene function have been proposed, including gene tagging27, gene trapping28, N-ethyl-N-nitrosourea (ENU) mutagenesis29, proteomics methods30 and evaluation of epigenetic mechanisms31. The HaploCHIP method32 is an illustrative example. HaploCHIP analyzes SNPs that affect gene regulation in vivo using chromatin immunoprecipitation and mass spectrometry to identify differential protein–DNA binding in vivo that is associated with VOLUME 5 | AUGUST 2004 | 5 9 1 REVIEWS Box 1 | Factors influencing consistency of gene–disease associations Variables affecting inferences from experimental studies: pleiotropic effects on disease risk. This should be a main focus of future studies attempting to assess the functional significance of genetic variants. • In vitro or in vivo system studied Population genetics approaches • Cell type studied • Cultured versus fresh cells studied • Genetic background of the system • DNA constructs • DNA segments that are included in functional (for example, expression) constructs • Use of additional promoter or enhancer elements • Exposures • Use of compounds that induce or repress expression • Influence of diet or other exposures on animal studies Variables affecting epidemiological inferences: • Inclusion/exclusion criteria for study subject selection • Sample size and statistical power • Candidate gene choice • A biologically plausible candidate gene • Functional relevance of the candidate genetic variant • Frequency of allelic variant • Statistical analysis • Consideration of confounding variables, including ethincity, gender or age. • Whether an appropriate statistical model was applied (for example, were interactions considered in addition to main effects of genes?) • Violation of model assumptions HARDY-WEINBERG PROPORTIONS The binomial distribution of genotypes (that is, frequencies of genotypes AA, Aa and aa will be p2, 2pq, and q2, respectively, where p is the frequency of allele A, and q is the frequency of allele a) that result in a population when there are no external pressures that cause deviations from p2, 2pq and q2. TEST STATISTIC A quantity whose value is used to decide whether or not the null hypothesis should be rejected, usually based on quantities computed using observed data. RESTENOSIS The constriction, narrowing or blockage of a coronary artery after an initial treatment such as angioplasty aimed at removing this blockage. ANGIOPLASTY An operation that is used to repair a damaged blood vessel or unblock a coronary artery. 592 allelic variants of a gene as a surrogate measure of transcriptional activity. This allele-specific quantification method uses haplotype-specific chromatin immunoprecipitation (CHIP) to measure the amount of phosphorylated RNA polymerase II that is bound to different alleles, thereby estimating the differences in protein binding between the alleles. This assay can be adapted for highthroughput analysis because of its sensitivity and ability to quantify the relative abundance of two different alleles in a sample of immunoprecipitated chromatin. Using this approach, Knight et al.32 showed that there was a close correlation between the level of bound phosphorylated (active) RNA polymerase II at the imprinted small nuclear ribonucleoprotein polypeptide N locus and allele-specific expression. They also used this method to identify an SNP of the cytokine tumour necrosis factor (TNF), which plays a pivotal role in inflammation, immunity and apoptosis, although the involvement of this SNP in modulating TNF transcription is yet to be shown. Application of the HaploCHIP approach to the TNF/lymphotoxin-α (LTA) locus identified functionally important haplotypes that correlate with allele-specific transcription of LTA32. Despite reports of both traditional and new ways of assessing the function of a genetic variant, little has been done to determine whether genotypic differences that lead to small phenotypic perturbations are functionally significant in the context of complex human disease. No criteria have been established to evaluate the importance of genetic variants with small but potentially | AUGUST 2004 | VOLUME 5 Knowledge of population genetic structure might provide insight into the functional relevance of a genetic variant on a disease trait. For example, genetic variants with a functionally significant impact on relevant phenotypes, including disease endpoints, might be more likely to deviate from expected allele and genotype frequencies compared with alleles that are functionally neutral. It remains unclear whether there is evidence of selection for or against low penetrance alleles over evolutionary time, although it is clear that selection has influenced the pattern of genetic variability in the human genome33,34. However, a number of diseases that could confer selective pressures on genes of interest are relatively new with respect to evolutionary genetics history. For example, diabetes and obesity might confer disease risks that could lead to selective pressure on allele or genotype frequencies, but these effects still occur relatively late in life and have been a problem to human health on a population level for only a few generations. So, whilst substantial new information about evolutionary genetics history and low penetrance SNPs is becoming available, the link between this knowledge and the functional importance of specific SNPs has yet to be fully elucidated. Deviations from expected allele or genotype frequencies across relevant phenotypic groups could be used to identify alleles that are more likely to be functionally associated with disease aetiology. For example, alleles that are truly causative of a disease state would be expected to be disproportionately over-represented among cases with the disease but under-represented among the diseasefree controls35,36. As a result, deviations from HARDYWEINBERG PROPORTIONS or other measures of population frequency might provide clues to the role of genes in the aetiology of this disease. For example, Hoh et al.37 developed a so- called ‘set association’ approach that evaluates sets of SNPs at various positions in the genome. Information about allelic association and HardyWeinberg equilibrium are combined over multiple markers in the genome. A genome-wide TEST STATISTIC is generated by summing-up contributions from many SNPs located in different genomic regions. The method performs a significance test on several sets of loci simultaneously, while using conservative measures of inference to control for the potential of false positive associations. Hoh et al.37 applied their approach to an SNP association study of RESTENOSIS. Among 779 patients with heart disease, 342 showed restenosis (cases) 6 months after ANGIOPLASTY. The remaining patients, on whom angioplasty was not performed, were the controls. Eighty nine SNP markers were genotyped in 62 candidate genes for each individual. The algorithm identified nine genes that conferred susceptibility to the disease. Unfortunately, despite promising results, the method is sensitive to genotyping errors. Population ADMIXTURE, in which cases and controls belong to different www.nature.com/reviews/genetics REVIEWS ethnic groups with different SNP allele frequencies, can also adversely distort the test results. Therefore, although incorporating population genetics information can provide further information to a genotype–disease association study, additional research is required to determine the conditions under which these approaches will be optimally applied. Evolutionary and structural approaches ADMIXTURE Combining two or more populations into a single group. Combining two populations has implications for studies of genotype–disease associations if the component populations have different genotypic distributions. CELL CYCLE CHECKPOINTS Steps in the normal sequence of development and division in the cell. Disruption can lead to uncontrolled cell growth, and possibly cancer. ODDS RATIO A measure of relative risk that is usually estimated from case control studies. NATURE REVIEWS | GENETICS Natural selection shapes patterns of genetic variation in populations such that favourable variants increase in frequency relative to less favourable variants over time. Mutations in functionally important sites can be eliminated or kept at low frequencies. So, nucleotide or amino-acid residues that have been conserved across species or within a gene family are more likely to be involved in regulation of vital functions, expression or tissue specificity than residues that are not conserved. Knowledge of the functional domains can therefore be useful when assessing the functional impact of an amino-acid change. The value of this knowledge was demonstrated early on for the effects of mutations in the primary sequence of haemoglobin38. In this case, the molecular basis of the clinical effects caused by mutations could be inferred as soon as the structural information became available. These pioneering studies recognized crucial links between the putative functional motifs and potential effects of mutations on function. Recently, a number of computational algorithms have been developed to predict the impact of nucleotide or amino-acid substitutions on protein structure, expression and function. Evolutionary conservation of the amino-acid sequence can be determined by aligning amino-acid sequences of related proteins from unrelated organisms or across gene families. A number of algorithms have been proposed that use DNA or amino-acid sequence data to identify potentially functional residues or domains in a comparison with sequences in the public databases. These include methods that are based on protein three-dimensional (3D) structure39–44, methods that are based on evolutionary considerations45,46, and machine-learning approaches47. As examples of these approaches, we present two algorithms, SIFT (‘sorting intolerant from tolerant’; REFS 48–50; see Further information for website) and PolyPhen (‘polymorphism phenotyping’; REF. 51;see Further information for website). Both are based entirely around amino-acid sequence, and are useful for predicting the impact of exonic amino-acid substitutions on disease risk. A third algorithm, known as CODDLE (‘choosing codons to optimize discover of deleterious lesions’; see Further information for website) uses nucleotide sequence (for example, a PCR amplicon) to identify all potentially deleterious nucleotide substitutions, including nonsense, frameshift, missense and splicing mutations. The SIFT algorithm. This algorithm is based on the assumption that amino-acid positions that are important for the correct biological function of the protein are conserved across the protein family and/or across evolutionary history. SIFT uses protein sequence and multiple alignment information of these sequences to estimate ‘tolerance indices’ that predict tolerated and deleterious (that is, intolerant) substitutions for every position of the query sequence. Substitutions at each position with normalized tolerance indices that are below a chosen cut-off point are predicted to be deleterious. Substitutions that are greater than or equal to the cut-off point are predicted as being tolerated (that is, putatively non-functional). Using three examples (LacI, HIV-I protease and bacteriophage T4 lysozyme), Ng et al.48 showed that a high proportion of substitutions that are predicted to be deleterious by SIFT did affect phenotypes in experimental assays. However, some positions that were predicted to be intolerant by SIFT were tolerated in experimental assays. In these cases, the positions are usually involved in an unknown function that the assay does not detect. There is also evidence that SIFT fails to identify residues that are vital for protein function but that have not been conserved throughout the family48. SIFT predictions are based on sequence data alone and do not require knowledge of protein structure or function. So, substitutions in uncharacterized proteins can be evaluated by SIFT only when homologous sequences are provided. Recently, Zhu et al.6 compared the relationship of tolerance to amino-acid change, as predicted by SIFT, with the reported association with SNPs in cancerrelated genes (FIG. 2). For the study, 166 published case-control studies that reported associations in 46 SNPs, in 39 different genes, in 16 different cancer sites, were chosen. All of these SNPs are located in biologically plausible cancer-related genes, such as those involved in DNA repair, carcinogen metabolism or CELL CYCLE CHECKPOINTS. The putative functional significance of these affected SNPs was calculated using tolerance indices from SIFT by comparing sequences from different species. The analyses showed a significant inverse correlation between estimates of cancer risk as assessed by the ODDS RATIO (OR) and the tolerance indices of the amino-acid variants. These findings indicate that alterations in conserved amino-acids in SNPs are more likely to be associated with cancer susceptibility. So, using a molecular evolutionary approach might help identify SNPs to be genotyped in future molecular epidemiological studies. Bayesian phylogenetic analysis is another comparative evolutionary method that identifies potential functionally important amino-acid sites that disrupt gene function52. This approach aligns amino-acid sequences for different species and constructs a phylogenetic tree. It then obtains maximum likelihood estimates of nucleotide substitution rates, identifies conserved regions and calculates ancestral sequences. Finally, regions that evolve under positive selection are identified and compared with the distribution of conserved sites with missense changes that have been reported in public database. Fleming et al.52 used this approach, as well as the SIFT program, to identify missense changes in exon 11 of the BRCA1 gene. The phylogenetic approach inferred that 38 of 139 missense changes affected function in exon 11. By contrast, SIFT VOLUME 5 | AUGUST 2004 | 5 9 3 REVIEWS a 3 Log2ORs 2 1 0 1 0 2 PSIC score b 4.0 3.5 3.0 Log2ORs 2.5 2.0 1.5 1.0 0.5 0.0 0 0.2 0.4 0.6 0.8 1.0 1.2 Tolerance index Figure 2 | Relationship of in silico indices of SNP function and associations (measured by log2-transformed odds ratios) taken from the literature. a | The relationship between the position-specific independent count (PSIC) score difference for studied SNPs from PolyPhen (‘polymorphism phenotyping’) analyses and log2-transformed odds ratios (Log2ORs) from published associations. b | The relationship between the tolerance index for studied SNPs from SIFT (‘sorting intolerant from tolerant’) analysis and log2ORs from associations. The data indicate that SNPs that are inferred as functional are more likely to be associated with higher oddsratio effects. By inference, they are more likely to be causative of disease. Modified, with permission, from REF. 6 © (2004) The American Association for Cancer Research. identified 36 of these 38 changes, and an additional 34. Fleming et al.52 hypothesized that SIFT predicted more changes because substitutions in all taxa are given the identical weight in SIFT. By contrast, under the assumption of the phylogenetic method, if substitutions are not shared by sister taxa, they are more likely to be sequencing errors. In addition, non-conservative substitutions maintained at sites in one pair of sister taxa are unlikely to be functionally significant. Applying this approach, Fleming et al.52 successfully predicted >85% of the 594 | AUGUST 2004 | VOLUME 5 known detrimental missense changes in several domains of BRCA1, showing the utility of this approach in identifying genetic variants that could be prioritized in further association studies. Concerted efforts to further understand the functional role of genomic elements, including SNPs, are also underway. For example, the National Institutes of Health has set up a public research consortium known as ‘Encode’ (‘encyclopaedia of DNA elements’; REF. 53), with the goal of identifying and categorizing DNA functional elements including transcriptional regulatory sequences and determinants of chromosome structure and function. This initiative involves comparing existing computational and experimental approaches, and developing new methods for the identification and characterization of function. Although these and other approaches can be used to assist molecular epidemiologists in identification of genetic variants in specific nucleotides or residues that are most likely to be functionally significant, they will probably provide only limited information about the true function of a genetic variant. The methods described above generally cannot identify functional effects in non-coding regions, and cannot evaluate the effect of other factors on function, such as exposures, that might be involved in the aetiology of the disease. Recently, Rogan et al.54 proposed a computational approach to evaluate the putative effect of splicing mutations. They used information theory-based models to evaluate the relationship between variants and predicted splice sites and relevant phenotypes. The results presented by Rogan et al.54 corresponded to known functions of these genes and serve as a model for additional studies that evaluate the effect of both coding and non-coding variation on functionality (see also REF. 55). Having interpreted the results from SIFT and related approaches, it might not be possible to identify whether a single variant within a putative functional domain is sufficient to disrupt function. Such approaches require sequence data and might therefore be limited by the sequence information that is available in public databases. Therefore, inferences of conservation (or the lack thereof) might simply reflect data limitations. Inconsistencies of inference using different classes of sequence data might also arise. For example, an analysis of sequences among human gene families might provide inferences that are different from acrossspecies sequence analysis. Although potentially helpful in uncovering the role of a particular genetic region or variant, such differences might also obscure inferences about putative functional significance. The PolyPhen algorithm. Missense variants might affect protein folding, binding or interaction sites, as well as the solubility or stability of the protein. These effects can beestimated from physical considerations and from the contextof an amino-acid replacement within the family of homologous proteins. Sunyaev et al.56 demonstrated that a significantfraction of missense variants (nsSNPs) are likely to affect protein structure or function. Their approach, implemented in PolyPhen algorithm uses www.nature.com/reviews/genetics REVIEWS amino-acid sequence, phylogenetic and structural information to characterize the potential functional significance of a missense change. PolyPhen was developed to identify functionally important SNPs by predicting whether an amino-acid change is likely to be deleterious for the protein on the basis of 3D structure and multiple alignmentof homologous sequences57. The possibly damaging effect of variants in an amino-acid sequence can bedetermined if the substitution is in an annotated active or binding site, affects interaction withligands present in the crystallographic structure, leadsto hydrophobicity or electrostatic charge change in a buriedsite, destroys a disulfide bond, affects the protein’s solubility, inserts proline in an α-helix, or is incompatiblewith the profile of amino-acid substitutions observed at this site in the set of homologous proteins. Mapping the amino-acid substitution tothe known 3D structure reveals if the change is likely to destroy the hydrophobic core of a protein, electrostatic interactions, interactions with ligands, or other features of a protein. Briefly, PolyPhen uses protein sequence and variant data to search homologous sequence data from publicly available protein databases. Only sequences with 30% or more identity to the protein of interest are considered. On the basis of the alignment of these homologous sequences, profile scores can be computed for allelic variants. Profile scores, known as position-specific independent counts (PSICs), are logarithmic ratios of the likelihood that a given amino acid occurs at a particular site to the background probability of the amino acid occurring at random at a given position. Large differences in PSIC values for specific genetic variants might indicate that the substitution of interest is rarely or never observed in the protein family. Finally, PolyPhen maps the amino-acid substitution to the known 3D structure of the protein to examine whether the substitution might destroy the protein’s hydrophobic core, electrostatic interactions or interactions with ligands, or other important features of a protein based on the analysis of several structural parameters, and also on the analysis of several contact parameters. Therefore, Polyphen can provide information about whether an nsSNP is probably damaging, possibly damaging, benign or whether its function is unknown. Using this approach, Sunyaev et al.57 estimated that 20% of human nsSNPs might affect protein function, although the proportion of SNPs that are truly deleterious is probably substantially lower. Sunayev et al.57 further estimated that there are 20,000–60,000 functionally relevant nsSNPs in the human genome. By comparison, Fay et al.33 used a different model-based approach to estimate that 20–45% of nsSNPs are slightly deleterious and reach population frequencies of 1–10%, although as many as 80% of nsSNPs might have some functional impact. Zhu et al.6 applied PolyPhen to the 166 molecular epidemiological studies mentioned above to examine the correlation between PSIC score and the OR associated with a particular nsSNP (FIG. 2). They found a significant inverse correlation between the ORs and PSIC score NATURE REVIEWS | GENETICS difference, and between tolerance index and PSIC score difference. So, the more probable it is that an SNP is functionally relevant, the higher the OR effect and the more likely it is that causative associations will be identified. These results imply that using a method to assess functional significance, such as PolyPhen, can optimize the ability to identify meaningful and reproducible molecular epidemiological associations. In addition to identifying important genetic variants for research prioritization, genotyping efforts could be reduced by eliminating amino-acid substitutions that have been deemed ‘neutral’ by algorithms such as PolyPhen. In some cases, some genes could be removed from further consideration if all of the variant alleles were deemed to be non-functional. Because prediction algorithms provide numerical data, it is feasible to further subdivide the variant alleles of a gene into those that might have a moderate, modest or no impact. In particular, when a gene is known to have many genetic variants, prediction algorithms can also reduce the effective number of alleles that need to be considered in association studies. Therefore, the use of these approaches should limit the number of genotypes that need to be considered, thereby limiting the potential for false positive associations that might result from performing unnecessary hypothesis tests. The drawback of these approaches is that potential interactions among combinations of substitutions at a gene might be ignored, and the dependence of functional impact on genotype at other genes or on exposure might not be addressed. Nonetheless, the algorithms for predicting genotype impact on allele function provide an initial and important step towards simplifying the seemingly overwhelming complexity of the genotype data. The concept of equivalent alleles provides a basis for additional steps towards reducing the complexity of the data for molecular epidemiological studies. Optimizing molecular epidemiological studies To improve the probability of obtaining biologically and clinically meaningful associations between genotypes and disease outcomes, the design and interpretation of molecular epidemiological studies should include an assessment of functional significance. As outlined above and in TABLE 1, a number of criteria and/or approaches can be used to assess functional significance of a genetic variant. These include genetic characteristics such as the type of genetic change (missense, frameshift, nonsense, regulatory, splicing, disruption of a known functional motif, etc.), population genetics characteristics and evolutionary conservation of nucleotide sequences, experimental evidence from in vitro or animal model studies that consider reproducibility of functional findings, experimental conditions (such as cell/animal system, inducers/repressors) that are used to evaluate a functional effect, the magnitude of the inferred biological effects, and the relevance of the experimental model to human systems, including knowledge of the effect of the variant in the target tissue (that is, tissue specificity) and consideration of timing, dose or duration of relevant exposures. VOLUME 5 | AUGUST 2004 | 5 9 5 REVIEWS Table 1 | Criteria for assessing the functional significance of a genetic variant in candidate gene association studies Criteria Strong support for functional significance Moderate support for functional significance Evidence against functional significance Nucleotide sequence Variant disrupts a known functional or structural motif Variant is a missense change or disrupts a putative functional motif; changes to protein structure might occur Variant disrupts a non-coding region with no known functional or structural motif Evolutionary conservation Consistent evidence from multiple approaches for conservation across species and multigene families Evidence for conservation across species or multigene families Nucleotide or amino-acid residue not conserved Population genetics In the absence of laboratory error, strong deviations from expected population frequencies in cases and/or controls in a particular ethnicity In the absence of laboratory error, moderate Population genetics data indicates to small deviations from expected population no deviations from expected frequencies in cases and/or controls; effects proportions are not well characterized by ethnicity Experimental evidence Consistent effects from multiple lines of experimental evidence; effect in human context is established; effect in target tissue is known Some (possibly inconsistent) evidence for function from experimental data; effect in human context or target tissue is unclear Experimental evidence consistently indicates no functional effect Variant might affect metabolism of the exposure or one of its components; effect in target tissue might not be known Variant does not affect metabolism of exposure of interest Exposures (for example, Variant is known to affect the genotype–environment metabolism of the exposure in interaction studies) the relevant target tissue Epidemiological evidence TYPE I ERROR Incorrectly rejecting a null hypothesis when the null hypothesis is correct. Similarly, the false positive rate. 596 Consistent and reproducible reports of Reports of association exist; moderate-to-large magnitude associations replication studies are not available Information from previously published epidemiological investigations indicating the effect of an SNP can also be considered for studies that attempt to replicate association findings. For example, an ad hoc measure of support for the functional significance of a particular variant could be created by weighting in favour of a functionally significant effect if the criteria for ‘strong support’ or ‘moderate support’ for functional significance are available, and therefore might be prioritized for association studies. Similarly, variants for which there is no or neutral functional information might be ranked lower in priority for association studies. Variants for which there is evidence against functional significance should not be considered in association studies unless the hypothesis and association study methods consider linkage disequilibrium approaches to identify candidate regions of interest. In this case, the study would assume that the variants under investigation are in linkage disequilibrium with the causative allele(s). Variants would be chosen for analysis on the basis of polymorphic content and haplotype or population genetics structure at that locus. How would this approach be applied for a specific candidate SNP? For example, applying the criteria outlined in TABLE 1 for CYP3A4*1B, we learn first that CYP3A4*1B disrupts a regulatory motif (NFSE) 12. Because the function of NFSE is not clear, we might therefore conclude that there is ‘moderate support’ for function on the basis of the ‘nucleotide sequence’ criterion. The regulatory region of CYP3A4 is similar, at the sequence level, to many members of the CYP3A multigene family, so we might conclude that there is strong support for function on the basis of the ‘evolutionary conservation’ criterion. However, because the role of putative regulatory domains in CYP3A genes58 is still debatable, the assessment of ‘evolutionary conservation’ is not completely clear. | AUGUST 2004 | VOLUME 5 Prior studies show no effect of variant The population genetics of CYP3A4*1B has been well described59, but no data have been reported about deviation of frequencies from Hardy-Weinberg proportions in cases versus controls, so this criterion provides little support for or against function. The ‘experimental evidence’ about CYP3A4*1B function (FIG. 1) has been controversial, but there seems to be a small but consistent increase in expression associated with CYP3A4*1B. This relatively small effect might be insufficient to confer major effects on drug metabolism, but indicates possible associations with altered disease risk. We could therefore conclude that the ‘experimental evidence’ criterion suggests ‘moderate support’ for functional significance. It is well known that CYP3A4 metabolizes compounds that are strongly associated with disease risk, such as testosterone metabolism in prostate cancer aetiology11. Therefore, there is strong support that the product of this gene metabolizes relevant aetiological exposures, including steroid hormones and carcinogens11. Finally, there are numerous epidemiological studies that report an association of CYP3A4*1B with various phenotypes, thereby providing strong support for function on the basis of the ‘epidemiological evidence’ criterion. So, these criteria suggest moderate to strong support for the hypothesis that CYP3A4*1B is a functionally meaningful DNA change. By applying these or similar criteria, molecular epidemiologists might be able to maximize the chance that association studies involve functionally significant genetic variants, and therefore could reduce the likelihood of TYPE I ERRORS, increase reproducibility of candidate gene association studies, and facilitate interpretation of positive associations. Even in the absence of a definitive function of a genetic variant, candidate gene association studies should not be undertaken without some consideration of function. www.nature.com/reviews/genetics REVIEWS 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genet. 22, 231–238 (1999). Salisbury, B. A. et al. SNP and haplotype variation in the human genome. Mutat. Res. 526, 53–61 (2003). Schneider, J. A. et al. DNA variability of human genes. Mech. Ageing Dev. 124, 17–25 (2003). Schork, N. J., Fallin, D. & Lanchbury, J. S. Single nucleotide polymorphisms and the future of genetic epidemiology. Clin. Genet. 58, 250–264 (2000). Sachidanandam, R. et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933 (2001). Zhu, Y. et al. An evolutionary perspective on SNP screening in molecular cancer epidemiology. Cancer Res. 64, 2251–2257 (2004). The first comprehensive evaluation and comparison of SIFT and PolyPhen algorithms in molecular epidemiological association studies. Lohmueller, K. E., Pearce, C. L., Pike, M., Lander, E. S. & Hirschhorn, J. N. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nature Genet. 33, 177–182 (2003). A comprehensive evaluation of the consistency of association studies that demonstrates the need for functional correlates in achieving consistency in association study results. Botstein, D. & Risch, N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nature Genet. 33 (Suppl.), 228–237 (2003). Stoilov, P. et al. Defects in pre-mRNA processing as causes of and predisposition to diseases. DNA Cell Biol. 21, 803–818 (2002). Knight, J. C. Functional implications of genetic variation in non-coding DNA for disease susceptibility and gene regulation. Clin. Sci. (Lond.) 104, 493–501 (2003). Li, A. P., Kaminski, D. L. & Rasmussen, A. R. Substrates of human hepatic cytochrome P450 3A4. Toxicology 104, 1–8 (1995). Hashimoto, H. et al. Gene structure of CYP3A4, an adultspecific form of cytochrome P450 in human livers and its transcriptional control. Eur. J. Biochem. 218, 585–595 (1993). Rebbeck, T. R., Jaffe, J. M., Walker, A. H., Wein, A. J. & Malkowicz, S. B. Modification of clinical presentation of prostate tumors by a novel genetic variant in CYP3A4. J. Natl Cancer Inst. 90, 1225–1229 (1998). Paris, P. L. et al. Association between a CYP3A4 genetic variant and clinical presentation in African-American prostate cancer patients. Cancer Epidemiol. Biomarkers Prev. 8, 901–905 (1999). Felix, C. A., et al. Association of CYP3A4 genotype with treatment-related leukemia. Proc. Natl Acad. Sci. 95, 13176–13181 (1998). Kadlubar, F. F. et al. The putative high activity variant, CYP3A4*1B, predicts the onset of puberty in young girls. Cancer Epidemiol. Biomarkers Prev. 12, 327–331 (2003). Lai, J., Vesprini, D., Chu, W., Jernstrom, H. & Narod, S. A. CYP gene polymorphisms and early menarche. Mol. Genet. Metab. 74, 449–457 (2001). Jernstrom, H. et al. Genetic factors related to racial variation in plasma levels of insulin-like growth factor-1: implications for premenopausal breast cancer risk. Mol. Genet. Metab. 72, 144–154 (2001). Lamba, J. K. et al. Common allelic variants in cytochrome P4503A4 and their prevalence in different populations. Pharmacogenetics 12, 121–132 (2002). Westlind, A., Lofberg, L., Tindberg, N., Andersson, T. B. & Ingelman-Sundberg, M. Interindividual differences in hepatic expression of CYP3A4: relationship to genetic polymorphism in the 5′-upstream regulatory region. Biochem. Biophys. Res. Commun. 259, 201–205 (1999). Amirimani, B., Walker, A. H., Weber, B. L. & Rebbeck, T. R. Response: re: modification of clinical presentation of prostate tumors by a novel genetic variant in CYP3A4. J. Natl Cancer Inst. 91, 1588–1590 (1999). NATURE REVIEWS | GENETICS 22. Ando, Y. et al. Re: modification of clinical presentation of prostate tumors by a novel genetic variant in CYP3A4. J. Natl Cancer Inst. 91, 1587–1590 (1999). 23. Spurdle, A. B. et al. The CYP3A4*1B polymorphism has no functional significance and is not associated with risk of breast or ovarian cancer. Pharmacogenetics 12, 355–366 (2002). 24. Floyd, M. D. et al. Genotype-phenotype associations for common CYP3A4 and CYP3A5 variants in the basal and induced metabolism of midazolam in European- and African-American men and women. Pharmacogenetics 13, 595–606 (2003). 25. Amirimani, B. et al. Transcriptional activity effects of a CYP3A4 promoter variant. Environ. Mol. Mutagen. 42, 299–305 (2003). 26. Hamzeiy, H., Bombail, V., Plant, N., Gibson, G. & Goldfarb, P. Transcriptional regulation of cytochrome P4503A4 gene expression: effects of inherited mutations in the 5′-flanking region. Xenobiotica 33, 1085–1095 (2003). 27. Jeon, J. & An, G. Gene tagging in rice: a high throughput system for functional genomics. Plant Sci. 161, 211–219 (2001). 28. Cecconi, F. & Meyer, B. I. Gene trap: a way to identify novel genes and unravel their biological function. FEBS Lett. 480, 63–71 (2000). 29. Adams, M. D. ENU mutagenesis for pharma. Drug Discov. Today 8, 199–200 (2003) 30. Lee, Y. S. & Mrksich, M. Protein chips: from concept to practice. Trends Biotechnol. 20 (Suppl.), S14–18 (2002). 31. Nikaido, I. et al. EICO (expression-based imprint candidate organizer): finding disease-related imprinted genes. Nucleic Acids Res. 32 (database issue), D548–551 (2004). 32. Knight, J. C., Keating, B. J., Rockett, K. A. & Kwiatkowski, D. P. In vivo characterization of regulatory polymorphisms by allele-specific quantification of RNA polymerase loading. Nature Genet. 33, 469–475 (2003). The authors report a new method and application of experimental approaches to assessing genotype function. 33. Fay, J. C., Wyckoff, G. J. & Wu, C. I. Positive and negative selection on the human genome. Genetics 158, 1227–1234 (2001). 34. Akey, J. M., Zhang, G., Zhang, K., Jin, L. & Shriver, M. D. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 12, 1805–1814 (2002). 35. Feder, J. N. et al. A novel MHC class I-like gene is mutated in patients with hereditary haemochromatosis. Nature Genet. 13, 399–408 (1996). 36. Nielsen, D. M., Ehm, M. G. & Weir, B. S. Detecting markerdisease association by testing for Hardy-Weinberg disequilibrium at a marker locus. Am. J. Hum. Genet. 63, 1531–1540 (1998). 37. Hoh, J., Wille, A. & Ott, J. Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res. 11, 2115–2119 (2001). The authors propose a novel approach to association studies that incorporates both association and population genetics information in identifying disease genes, including the possibility of genome-wide associations. 38. Perutz, M. F. Structure and function of haemoglobin. I. A tentative atomic model of horse oxyhaemoglobin. J. Mol. Biol. 13, 646–668 (1965). 39. Wang, Z. & Moult, J. Three-dimensional structural location and molecular functional effects of missense SNPs in the T cell receptor Vβ domain. Proteins 53, 748–757 (2003). 40. Wang, Z. & Moult, J. SNPs, protein structure, and disease. Hum. Mutat. 17, 263–270 (2001). 41. Chasman, D. & Adams, R. M. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J. Mol. Biol. 307, 683–706 (2001). 42. Ferrer-Costa, C., Orozco, M. & de la Cruz, X. Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties. J. Mol. Biol. 315, 771–786 (2002). 43. Saunders, C. T. & Baker, D. Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J. Mol. Biol. 322, 891–901 (2002). 44. Herrgard, S. et al. Prediction of deleterious functional effects of amino acid mutations using a library of structure-based function descriptors. Proteins 53, 806–816 (2003). 45. Miller, M. P. & Kumar, S. Understanding human disease mutations through the use of interspecific genetic variation. Hum. Mol. Genet. 10, 2319–2328 (2001). 46. Koref, M. E. S., Gangeswaran, R., Koref, I. P. S., Shanahan, N. & Hancock, J. M. A phylogenetic approach to assessing the significance of missense mutations in disease genes. Hum. Mutat. 22, 51–58 (2003). 47. Krishnan, V. G. & Westhead, D. R. A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function. Bioinformatics 19, 2199–2209 (2003). 48. Ng, P. C. & Henikoff, S. Predicting deleterious amino acid substitutions. Genome Res. 11, 863–874 (2001). An outline of the SIFT approach to assessing missense variant function using evolutionary similarity. 49. Ng, P. C. & Henikoff, S. Accounting for human polymorphisms predicted to affect protein function. Genome Res. 12, 436–446 (2002). 50. Ng, P. C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 3812–3814 (2003). 51. Ramensky, V., Bork, P. & Sunyaev, S. Human nonsynonymous SNPs: server and survey. Nucleic Acids Res. 30, 3894–3900 (2002). An outline of the PolyPhen methodology for using evolutionary and structure data to assess SNP function. 52. Fleming, M. A., Potter, J. D., Ramirez, C. J., Ostrander, G. K. & Ostrander, E. A. Understanding missense mutations in the BRCA1 gene: an evolutionary approach. Proc. Natl Acad. Sci. USA 100, 1151–1156 (2003). 53. National Institutes of Health. The ENCODE Project: ENCyclopedia Of DNA Elements [online], <http://www.genome.gov/10005107> (2003). 54. Rogan, P. K., Svojanovsky, S. & Leeder, J. S. Information theory-based analysis of CYP2C19, CYP2D6 and CYP3A5 splicing mutations. Pharmacogenetics 13, 207–218 (2003). 55. Pagani, F. & Baralle, F. E. Genomic variants in exons and introns: identifying the splicing spoilers. Nature Rev. Genet. 5, 389–396 (2004). 56. Sunyaev, S., Ramensky, V. & Bork, P. Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet. 16, 198–200 (2000). 57. Sunyaev, S. et al. Prediction of deleterious human alleles. Hum. Mol. Genet. 10, 591–597 (2001). 58. Schuetz, E. G. Lessons from the CYP3A4 promoter. Mol. Pharmacol. 65, 279–281 (2004). 59. Zeigler-Johnson, C. M. et al. Ethnic differences in the frequency of prostate cancer susceptibilty alleles at SRD5A2 and CYP3A4. Hum. Hered. 54, 13–21 (2002). Acknowledgements Some of the work discussed in this review was supported by grants from the Public Health Service and the University of Pennsylvania Cancer Center. Competing interests statement The authors declare that they have no competing financial interests. Online links DATABASES The following terms in this article are linked online to: Entrez Gene: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene CYP3A4 | TNF FURTHER INFORMATION CODDLE: http://www.proweb.org/coddle PolyPhen: http://www.bork.embl-heidelberg.de/PolyPhen SIFT: http://blocks.fhcrc.org/%7Epauline/SIFT Access to this interactive links box is free online. VOLUME 5 | AUGUST 2004 | 5 9 7