* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Mapping complex disease traits with global gene expression
Survey
Document related concepts
Transcript
REVIEWS Mapping complex disease traits with global gene expression William Cookson*, Liming Liang‡, Gonçalo Abecasis‡, Miriam Moffatt* and Mark Lathrop§ Abstract | Variation in gene expression is an important mechanism underlying susceptibility to complex disease. The simultaneous genome-wide assay of gene expression and genetic variation allows the mapping of the genetic factors that underpin individual differences in quantitative levels of expression (expression QTLs; eQTLs). The availability of systematically generated eQTL information could provide immediate insight into a biological basis for disease associations identified through genome-wide association (GWA) studies, and can help to identify networks of genes involved in disease pathogenesis. Although there are limitations to current eQTL maps, understanding of disease will be enhanced with novel technologies and international efforts that extend to a wide range of new samples and tissues. Genome-wide association study (GWA study). An examination of common genetic variation across the genome designed to identify associations with traits such as common diseases. Typically, several hundred thousand SNPs are interrogated using microarray or bead chip technologies. *National Heart and Lung Institute, Imperial College London, London SW3 6LY, UK. ‡ Center for Statistical Genetics, Department of Biostatistics, SPH II, Ann Arbor, Michigan 48109‑2029, USA. § CEA / Centre National de Genotypage, 91057 Evry, France. Correspondence to W.C., L.L., G.A., M.M. or M.L. e‑mails: [email protected]; [email protected]; [email protected]; [email protected]; [email protected] doi:10.1038/nrg2537 Genome-wide association (GWA) studies of common com- plex or multifactorial diseases have been spectacularly successful in the last 2 years, with many new loci identified with levels of probability that were once thought unattainable. However, the extraordinary levels of significance of the association signals have yet to be translated into a full understanding of the genes or genetic elements that are mediating disease susceptibility at particular loci. The functional effects of DNA polymorphism on multifactorial disease can be mediated through several mechanisms. Polymorphisms that alter protein function can have very important effects, such as NOD2 (nucleotidebinding oligomerization domain-containing 2; also known as CARD15) mutations in inflammatory bowel disease 1 and FLG (filaggrin) mutations in eczema (atopic dermatitis)2. However, systematic study of complex diseases with known non-synonymous SNPs has not yielded many highly significant results3, and variation in gene expression is probably a more important mechanism underlying susceptibility to complex disease. The abundance of a gene transcript is directly modified by polymorphism in regulatory elements. Consequently, transcript abundance might be considered as a quantitative trait that can be mapped with considerable power. These have been named expression QTLs (eQTLs)4,5. There is a substantial gap between SNP associations from a GWA study and understanding how the locus contributes to disease. Further genotyping and statistical analyses are often necessary to identify causal variants, which are then functionally investigated. This Review explores the value of systematic identification of eQTLs as one means of characterizing the function of loci underlying complex disease traits. The combination of whole-genome genetic association studies and the measurement of global gene expression allows the systematic identification of eQTLs. By assaying gene expression and genetic variation simultaneously on a genome-wide basis in a large number of individuals, statistical genetic methods can be used to map the genetic factors that underpin individual differences in quantitative levels of expression of many thousands of transcripts. The resulting comprehensive eQTL maps provide an important reference source for categorizing both cis and trans effects of disease-associated SNPs on gene expression. In addition to providing information about the biological control of gene expression, such data aid in interpreting the results of GWA studies. Once the statistical evidence for association of genetic markers to a disease trait has been established, genome-wide eQTL mapping data can be examined to see if the same genetic markers are also associated with quantitative transcript levels of one or more genes — such markers are known as eSNPs. The availability of systematically generated eQTL information provides immediate insight into a probable biological basis for the disease associations, and can help to identify networks of genes involved in disease pathogenesis. 184 | mARCH 2009 | VOLume 10 www.nature.com/reviews/genetics © 2009 Macmillan Publishers Limited. All rights reserved REVIEWS Gene expression Microarray analysis of transcript levels in tissue of interest Genotyping Genome-wide genotyping of SNPs and other variants eQTLs Association of genetic markers with gene expression Epigenetic modifications Integration with genome-wide association studies Network analysis Disease susceptibility Non-genetic effects Environmental stimuli Figure 1 | eQTL mapping. Expression QTL (eQTL) mapping begins with the Nature Reviews | Genetics measurement of gene expression in a target cell or tissue from multiple individuals. This information is the substrate for investigating the effects of DNA polymorphism (of any type) on the expression of individual genes. Other factors that can alter transcription, such as epigenetic CpG methylation, may also be mapped. Network analyses build upon strong correlations that are present between transcripts, and allow the identification of modules of genes that mediate complex functions. This information can then be made available and used to interpret genetic associations and mapping information from the study of complex disease. Epigenetic A mitotically stable change in gene expression that depends not on a change in DNA sequence, but on covalent modifications of DNA or chromatin proteins such as histones. Heritability (H2). The heritability of an individual trait is estimated by the ratio of genetic variance to total trait variance, so that 0 indicates no genetic effects on trait variance and 1 indicates that all variance is under genetic control. The potential of genome-wide eQTL identification was shown originally in the yeast Saccharomyces cer‑ evisiae6 and then in humans, animals and plants4,7. The history of eQTL mapping has been comprehensively reviewed7–9, and will not be described in detail here. This Review will show how the combination of genetics and global gene expression can be a powerful tool for systematically unravelling the effects of variation in transcription on disease. First, we briefly introduce the principles and current methods of eQTL mapping and describe the basis of eQTLs. We then explore the relevance of these results to disease gene identification. The limits of current eQTL mapping data are discussed, as are the expected impact of new technologies, international efforts to extend results to new samples and tissues, and how cell lines might be tested with stimuli that are relevant to disease. eQTL mapping In practical terms, the starting point for eQTL mapping is the measurement of gene expression in a target cell or tissue from multiple individuals (FIG. 1). This information is the substrate for investigating the effects of DNA polymorphism (of any type) on the expression of individual genes. The use of microarray technology to measure gene expression from many thousand of genes simultaneously has been a principle driving force for systematic mapping of eQTLs7. The field is benefiting from progressively more sophisticated platforms for such studies, which are described in the later sections of this Review. Procedures for eQTL mapping are based on the insight that expression levels can be analysed with genetic approaches in the same manner as any other quantitative trait phenotype, such as body weight or blood lipids. In particular, study designs and statistical methods that are used traditionally to map QTLs can be successfully applied to the identification of eQTLs10–12. Interpretation of eQTL data can then be developed further by the incorporation of additional biological information, such as epigenetic modifications and analysis of regulatory networks, which are discussed below. eQTLs are influenced not only by genetic polymorphisms, but also by a range of other biological factors. These can be dissected systematically, starting with the measurement of heritability (H2). Heritability. Family studies have shown that many human eQTLs are highly heritable13,14. The linkage approach, in which family members are studied, has been valuable in demonstrating that genetic factors have widespread and identifiable influences on eQTLs in humans, and such studies have provided broad localization for some of the underlying genetic factors 15,16. GWA mapping of common genetic variants that underlie eQTLs has recently become possible owing to the wide availability of high-throughput and low-cost SNP genotyping. These results are particularly relevant to disease mapping that is also focused on common SNPs characterized with similar SNP arrays. moreover, the interpretation of these eQTL data relies strongly on methodologies that have been developed for disease GWA13. For example, a family study of lymphoblastoid cell lines (LCLs) identified nearly 15,000 traits (each corresponding to an individual Affymetrix probe) with an estimated H2 > 0.3, indicating that genetic influences on gene expression seem to be widespread13. Other studies have similarly described a high H2 of many eQTLs in LCLs and other tissues4,17,18. Genetic factors (with both cis-acting and trans-acting effects; see below), are often identified for eQTLs that have high H2. For example, in the LCL study mentioned above13, eQTLs for 81% of traits with H2 > 0.8 could be mapped to one or more SNPs at genome-wide significance. However, the SNP map on average accounted for less than 20% of the estimated trait H2, consistent with results obtained by other studies16. This indicates the presence of genetic or other factors affecting familial clustering on transcription that are not detectable in these genetic associations. Factors other than SNPs that might affect H2 are discussed in more detail below. Further understanding of disease phenotypes can also be gained from analysing whether particular types of genes have more heritable variation in expression level13,19 (BOX 1). Cis and trans effects. Statistical analyses of eQTLs need to take into account that the loci identified can influence gene expression either in cis or in trans. The definition of NATuRe ReVIeWS | GeneTics VOLume 10 | mARCH 2009 | 185 © 2009 Macmillan Publishers Limited. All rights reserved REVIEWS Major histocompatibility complex (MHC). A complex locus on chromosome 6p that comprises numerous genes, including the human leukocyte antigen genes, which are involved in the immune response. Gene Ontology (GO). A widely used classification system of gene functions and other gene attributes that uses a standardized vocabulary. The system uses a hierarchical organization of concepts (an ontology) with three organizing principles: molecular functions (the tasks done by individual gene products), biological processes (for example, mitosis) and cellular components (examples include the nucleus and the telomere). a cis effect is somewhat arbitrary, but cis-acting eQTLs are typically considered to include SNPs within 100 kb upstream and downstream of the gene that is affected by that eQTL. This definition becomes more problematic in regions of extended linkage disequilibrium, such as the major histocompatibility complex (mHC) locus. Detailed analysis of the position of mapped cis-acting eQTL effects have shown that these are enriched around transcription start sites and within 250 bp upstream of transcription end sites, and they rarely reside more than 20 kb away from the gene20. Cis-acting variants also seem to occur more often in exonic SNPs20. Trans effects are usually weaker than cis effects in humans4,5 and in rats21, but they are more numerous. It is not known if trans effects are mostly mediated through transcription factor variants or through other mechanisms. ‘master regulators’ are trans-acting factors with multiple effects on gene expression that have been identified in S. cerevisiae22, in rat tissues21 and in the human genome5. It is of interest that, at least in yeast, master regulators are not enriched for transcription factors, and trans-regulatory variation seems to be broadly dispersed across classes of genes with different molecular functions22. Other types of variant. The function of DNA can be altered by many mechanisms in addition to SNPs. Transcription can also be modified by copy number variants (CNVs), insertions and deletions, short tandem repeats and single amino acid repeats23. A systematic investigation of the effects of CNVs in individuals who are part of the International Hapmap project showed that SNPs and CNVs captured 84% and 18%, respectively, of the total detected genetic variation in gene expression but the signals from the two types of variation had little overlap24. It has been shown that CNVs in regulatory hot spots in the malaria parasite genome dictate transcriptional variation25. It has also been observed that small-scale copy number variation (that is, a single or few copies) can lead to multiple orders of magnitude change in gene expression and, in some cases, switches in deterministic control26. Box 1 | Gene ontology analyses Many expression QTLs (eQTLs) are highly heritable. Therefore, Gene Ontology (GO) analyses can be applied to eQTL databases to identify the types of gene that show the most inherited variation in their levels of expression (at least in the cell type studied, usually lymphoblastoid cell lines; LCLs). The most highly heritable GO biological process for eQTLs in LCLs in one study was, unexpectedly, ‘response to unfolded proteins’, a group containing numerous chaperonins and heat shock proteins. The individual variation in response to unfolded proteins may be an evolutionary response to cellular stress, and these genes could be candidates in the study of neurodegenerative diseases and the ageing processes. Genes that regulate RNA processing, DNA repair and progression through the cell cycle were also exceptionally heritable. The evolutionary advantage of individual variation in these genes is unclear. As expected, genes with significant heritability are also enriched in GO categories of immune response13,19. These highly heritable immune genes may be of particular value for the study of infectious and inflammatory diseases. The most heritable traits can be considered as candidate genes for effects on particular disease traits, but they could also be studied in large population samples, such as those contained in national biobanks, to investigate their actions on unexpected phenotypes. Epigenetic factors. In addition to DNA sequence variants, gene transcription is also modulated by epigenetic modifications (discussed further in the ‘Limitations of mapping studies’ section below). For example, non-germ line epigenetic methylation of CpG residues that regulate gene expression is common in the human genome27. In a limited study of three chromosomes, 17% of genes can be differentially methylated in their 5′ uTRs and approximately one-third of the differentially methylated 5′ uTRs are inversely correlated with transcription27. A further level of complexity comes from post-translational modifications of histones that modulate DNA accessibility and chromatin stability to provide an enormous variety of alternative interaction surfaces for trans-acting factors (reviewed in ReF. 28). eQTLs and disease gene mapping Combining eQTL and GWA studies. One of the most important consequences of eQTL mapping is the link that it provides between genetic markers of disease identified in GWA studies and the expression of a specific gene or genes. In particular, the power of these studies depends upon the identification of specific genetic markers that are simultaneously associated with disease and eQTLs, whereas simply comparing differences in gene expression in cases and controls might not provide sufficient power to detect important differences with the available sample sizes. The value of this is illustrated by several recent investigations in which eQTL analysis was incorporated directly as a component of the GWA study design (included in TABLe 1). The number of GWA studies continues to rise rapidly. In GWA studies to date, 10–15% of the top hits have affected a known eQTL in a public data set (TABLe 1). We will therefore discuss selected instances of these to show the value of the method. For example, a recent study generated genome-wide transcriptional profiles of lymphocyte samples from participants in the San Antonio Family Heart Study, and showed that high-density lipoprotein cholesterol concentration was influenced by the cis-regulated vanin 1 (VNN1) gene 15. Similarly, a study of post-mortem brain tissue identified eQTLs affecting the MAPT (microtubule-associated protein tau) and APOE (apolipoprotein e) genes, which play an important part in Alzheimer’s disease29. At the same time as the San Antonio study the results of a GWA study of asthma13,30 identified a series of SNPs in strong linkage disequilibrium and spanning more than 200 kb of chromosome 17q23. The study showed that these SNPs were strongly associated with the risk of asthma30. The region of association contains 19 genes, none of which is an obvious candidate for disease. examination of eQTL data derived from Affymetrix Hu133A arrays13,30 on the same families showed that the disease-associated SNPs had highly significant (p < 10–22) effects in cis on the expression of one the genes: ORMDL3 (ORm1-like 3). This locus illustrates the utility of combining eQTL and disease mapping studies. Despite the highly significant association with both expression and disease, the 186 | mARCH 2009 | VOLume 10 www.nature.com/reviews/genetics © 2009 Macmillan Publishers Limited. All rights reserved REVIEWS Table 1 | Disease-linked associations with significant expression QTLs from the literature and public databases study Trait Region candidate gene(s) Transcript Transcript affected by snP region Gudbjartsson et al.102* Height 7p22 Logarithm of odds (LOD) score GNA12 7p22 13 11q13.2 Intergenic CCND1 11q13 7.4 7q21.3 C17orf37 17q21 6.0 HSD17B8 6 6.4 GNA12 LMTK2 NDUFS8 11 6.1 3p14.3 PXK RPP14 3 9.2 Göring et al. High-density lipoprotein cholesterol levels 6q21 VNN1 HDL (serum) Multiple sites 8.0 Kathiresan et al.40 Polygenic dyslipidaemia 20q13 PLTP PLTP 20q13 16 15q22 LIPC LIPC 15q22 17 11q12 FADS1, FADS2, FADS3 FADS1 11q12 35 FADS3 11q12 8.0 9p22 TTC39B TTC39B 9p22 7.0 1p13 CELSR2, PSRC1, SORT1 SORT1 1p13 270 PSRC1 1p13 249 15 CELSR2 1p13 80 12q24 MMAB, MVK MMAB 12q24 43 1p31 ANGPLT3 DOCK7 1p31 27 ANGPLT3 1p31 11 PTGER4 Libioulle et al. Crohn’s disease 5p13 Intergenic 5p13 3.0 Barrett et al.36 Crohn’s disease 5q31 OCTN1, SLC22A4, SLC22A5 SLC22A5 5q31 Unknown Hom et al. * Systemic lupus erythematosus 8p23.1 C8orf13, BLK BLK 8p23.1 20 C8orf13 8p23.1 28 Harkonason et al. * Type 1 diabetes 12q13 RAB5B, SUOX, IKZF4 RPS26 12q13 33 1p31.3 ANGPTL3 DOCK7 1p31.3 16 12q13.2 43.2 37 103 104 Wellcome Trust Case Control Consortium105* Type 1 diabetes 12q13.2 ERBB3 RPS26 Todd et al.106* Type 1 diabetes 12q13.2 ERBB3 RPS26 12q13.2 30.3 Plenge et al.107* Rheumatoid arthritis 9q34 TRAF1‑C5 LOC253039 9q34 6.3 Thein et al. Fetal haemoglobin F production 6q23.3 Intergenic HBS1L 6q23.3 6.0 Moffatt et al.30 Childhood asthma 17q21 Intergenic ORMDL3 17 14 Wellcome Trust Case Control Consortium105* Bipolar disorder 16p12 PALB2, NDUFAB1, DCTN5 DCTN5 16p12 9.2 6p21 NR HLA‑DQB1 6p21 8.9 HLA‑DRB4 6p21 11 SP140 2q37 8.8 108 Di Bernardo et al.109* Chronic lymphatic leukaemia 2q37 SP140 *Identified through comparison of the National Human Genome Research Institute’s Catalog of Published Genome-Wide Association Studies and the mRNA by SNP Browser v 1.0.1. predicted expression differences in cases and controls, which was averaged over all genotypes, was not expected to be significant given the sample size: this was in agreement with the observed results30. In these data, borderline significant effects were also observed in the expression of the gene neighbouring ORMDL3, GSDML (gasdermin-like)30. Subsequent eQTL studies with the Illumina platform and RT-PCR experiments confirmed that the same SNPs determine eQTLs with both genes. These results focus attention on one or both of these genes as probable candidates for a role in disease pathology. many additional studies are now underway to investigate the biological functions of these two genes and their relationship to asthma31,32–35. Using eQTLs to interpret GWA studies. Such findings have encouraged the use of these eQTL data as a general tool for interpreting results from GWA studies. NATuRe ReVIeWS | GeneTics VOLume 10 | mARCH 2009 | 187 © 2009 Macmillan Publishers Limited. All rights reserved REVIEWS Human leukocyte antigen (HLA). A glycoprotein, encoded at the major histocompatibility complex locus, that is found on the surface of antigen-presenting cells and that present antigen for recognition by helper T cells. Recent analyses of Crohn’s disease (CD) illustrate this approach36,37. Initially, markers on chromosome 5 were shown to be strongly associated with CD in one GWA scan, but their biological effects could not be readily deduced as they reside in a 1.25 mb gene desert. examination of the LCL eQTLs database showed that one or more of these polymorphisms act as a long-range cis-acting factor influencing expression of PTGER4 (prostaglandin e receptor 4), ), a gene that resides approximately 270 kb proximal to the association region37. The homologue of this gene has been implicated in phenotypes similar to CD in the mouse37,38. Thus, research is now focused on PTGER4 as a primary candidate gene for this disease susceptibility locus. Subsequently, the eQTL approach has been applied systematically in a meta-analysis of GWA studies of CD, and several other interesting results have been obtained36. For example, eQTLs were used to address an outstanding question in CD genetics related to the identification of the CD susceptibility gene or genes in the cytokine cluster on chromosome 5q31, where SNPs have an established association with disease39. The disease-associated SNPs in the meta-analysis of this region were all shown to be correlated with decreased SLC22A5 (solute carrier family 22, member 5) mRNA expression levels. Another CD locus identified in the meta-analysis coincided with the asthma risk locus on chromosome 17, in which the disease markers are also correlated with expression of ORMDL3 and GSDML, as described above. Thus the same genetic variants contribute to susceptibility to both CD and asthma, possibly by perturbing expression of one or both of these genes. Several additional examples of eQTLs within CD susceptibility loci have also been reported36. These co-localizations greatly exceed the number that would be expected by chance, suggesting that many of them are indicative of underlying biological processes involved in disease susceptibility 36. Public GWA study results are available at the National Human Genome Research Institute’s Catalog of Published Genome-Wide Association Studies. examination of these results identifies many other disease associations for which eQTL data provide similar insights (TABLe 1). For example, a recent large study of polygenic dyslipidaemia identified 30 loci with highly significant effects on blood lipid measurements40. examination of gene expression in samples of liver from 957 subjects allowed highly significant eQTLs to be identified for 7 of the 30 loci40 (TABLe 1). In some cases, the eQTL data give genetic evidence to support a candidate gene for which a role was previously suggested from location and biological hypotheses (such as GNA12 for height on chromosome 7p22, and BLK and C8orf13 for auto-immune systemic lupus erythematosis on 8p23.1). more often the gene expression data identifies different genes or suggests a particular gene from a number of candidates. examples of this include the cluster of trans-acting genes from the height locus on chromosome 7q21.3, the RPS26 gene from the type 1 diabetes locus on 12q13.2, and the DCTN5 gene from the bipolar disorder locus on chromosome 16p12.1. Not all examples of eQTL findings are straightforward, as exemplified by the association reported between the SH2B1 (SH2B adaptor protein 1) locus and body mass index (BmI)41. In this study, a missense SNP in SH2B1 was also associated with significant variation in transcript abundances of EIF3C (eukaryotic translation initiation factor 3, subunit C) and TUFM (Tu translation elongation factor, mitochondrial). When mutated, the homologue of SH2B1 leads to extreme obesity in mice, apparently because of a failure in proper regulation of appetite. The authors speculate that the SH2B1 variant has a causal role but is in linkage disequilibrium with a different variant that influences EIF3C and TUFM mRNA levels; alternatively, regulation of EIF3C or TUFM mRNA levels could have a causal role instead of, or in addition to, variation in SH2B1 (ReF. 41). eQTL databases. mRNA by SNP Browser v 1.0.1 is a database of eQTLs from asthma studies13,30 that allows searches by genes, chromosomal regions and SNPs, and is a good example of how data from this kind of research can be examined. VarySysDB is another public database and contains 190,000 extensively annotated mRNA transcripts from 36,000 loci. VarySysDB offers information encompassing published human genetic polymorphisms for each of these transcripts separately. In addition to SNP effects on transcription, VarySysDB includes deletion–insertion polymorphisms from dbSNP, CNVs from the Database of Genomic Variants, short tandem repeats and single amino acid repeats from H-InvDB and linkage disequilibrium regions from D-HaploDB23. MHC locus. Analysis of eQTLs in the mHC locus is of particular interest for studies of diseases in which infection and autoimmunity is a major component42. Intense study of the mHC locus over many years has revealed many genes that are duplicated or polymorphic, and DNA variants in the mHC locus have been associated with more diseases than any other region of the human genome42. many disease associations have been attributed to selective binding of processed antigen to the antigen-presenting grooves of human leukocyte antigen (HLA) variants. The results of eQTL studies on the mHC locus must be interpreted with caution. This is because the high degree of genetic variability and linkage disequilibrium across the mHC locus could introduce some spurious results owing to polymorphism in sequences corresponding to probes used for expression measurements43 (see below). Nevertheless, global gene expression data has shown very strong effects of particular SNPs on the level of expression of the classical mHC antigens HLA‑A, HLA‑C, HLA‑DP, HLA‑DQ and HLA‑DR (p < 10–20 to p = 10–30)13. This confirmed the effect of genetic variation on the level of HLA‑DQ expression observed previously44. The strength of these effects suggests that associations of mHC class I and class II polymorphism might depend on the level of gene transcription as much as restriction of response to antigen13. A possible example is type I diabetes, in which the functional effects of the long-recognized association to the class II mHC genes45 have not been elucidated, despite combined p values of less than 10–100. 188 | mARCH 2009 | VOLume 10 www.nature.com/reviews/genetics © 2009 Macmillan Publishers Limited. All rights reserved REVIEWS Serial analysis of gene expression (SAGe). A method for quantitative and simultaneous analysis of a large number of transcripts. Short sequence tags are isolated, concentrated and cloned; their sequencing reveals a gene expression pattern that is characteristic of the tissue or cell type from which the tags were isolated. These results suggest that even in this intensively studied region, the investigation of eQTLs could add to our understanding of the many known genetic associations. Additional biological interpretation and validation. A genome exerts its functions not through particular genes or proteins, but through highly complex networks that produce a range of responses46. As perturbations of such networks underlie the pathogenesis of many diseases47,48, network analysis incorporating eQTL data has recently provided important novel insights into mechanisms underlying multifactorial diseases16,17,49 (BOX 2). extensive investigations of human populations, animal models and cellular systems are required to provide biological validation of the relationship between specific genes and multifactorial disease traits, even when the relationship is identified through eQTL analysis. Given the substantial effort that is required for validation, careful selection of only the strongest candidates is essential. As shown in the above examples, the combination of GWA studies and eQTL analysis is a powerful way of identifying a small number of candidate genes and pathways. With the deployment of new technologies, such as exon arrays and RNA resequencing, and expansion of the tissue types covered, as described below, we expect future eQTL databases to be even more powerful tools for such identifications. Box 2 | Networks and other analytical tools Traditional genetics and cellular biology has rested on the assumption that a single stimulus (or DNA variant) when applied to a cell (or gene) will have a single outcome. The reality is that even a simple stimulus will induce changes in transcription in many genes that interact in complex networks, with an outcome that affects many different transcripts and processes. The networks can be considered to be made up of multiple pathways that act at genetic, genomic, cellular, tissue and whole-organism levels46. The technology that is already available to gather global information on gene expression, proteins and metabolites is now allowing the systematic identification of the networks of genes that interact in disease processes92,93. Analysis of genetic variants that perturb networks through the effects of expression QTLs (eQTLs) has recently provided important novel insights into mechanisms underlying multifactorial diseases16,17,49. This type of analysis may also lead to the systematic identification of transcription modules94 and the construction of regulatory networks95. The potential of genetic mapping approaches to identify networks of genes operating on hematopoietic stem cells96 and immune responses97 are amongst the examples that have been discussed in the literature. The impact of combining eQTL analysis with an investigation of gene networks is shown by the recent detection of genetic variants associated with transcript abundance of a macrophage-enriched network and obesity-related traits in human subjects. Parallel studies in mice and humans identified a network module for obesity-related traits that was enriched for genes involved in the inflammatory and immune response. eQTL mapping was then used to identify cis-acting genetic variants associated with this network of genes. The authors characterized these genetic variants in a large cohort of individuals, and showed statistical enrichment for variants that were associated with obesity-related biometric traits16. This approach allowed the identification of genetic variants that had minor individual effects on the trait, but that can be identified as a group because of the overall perturbation of the network. Three genes in this network, lipoprotein lipase (Lpl), lactamase β (Lactb) and protein phosphatase 1-like (Ppm1l), were validated by gene knockouts, strengthening the association between this network and metabolic disease traits49. A bibliography and a range of statistical routines for network analysis can be found on the Weighted Gene Co-expression Network site. Potential limitations and future directions Despite the power of eQTL mapping to help identify the genetic basis of disease, there are many limitations to current methodologies and potential for considerable improvements as technologies develop. The best appreciated technical barriers to optimal eQTL mapping are in the use of microarrays to measure gene expression (BOX 3). Other problems and their potential solutions are discussed below. Comparisons between microarray platforms. It was assumed that different microarray platforms give broadly comparable results 50. However, numerous studies are now showing that the overlap in transcript detection between platforms is only ~30–40%, whether considered as presence or absence of detectable transcripts or the absolute level of transcript abundance51–53. The same level of discordance appears whether comparisons are made between Affymetrix arrays and serial analysis of gene expression (SAGe)52, Affymetrix and Illumina arrays50, Affymetrix and Applied Biosystems arrays53, or across multiple platforms51. Some of this discrepancy may be because individual genes are commonly interrogated by different sequences on different platforms. The situation can be improved when matching of genes is sought using genomic sequence rather than sequences inferred from the uniGene database of transcripts54. Concordance between platforms is improved further when probes are compared only when they target overlapping transcript sequence regions on cDNA microarrays or gene chips55. These discrepancies may follow from the complex and unpredictable factors that determine hybridization of particular nucleic acids to complementary array-bound sequences56,57. In addition, the selection of sequences on microarrays has been strongly biased to the 3′ end of genes, simply because public cDNA databases were first populated with genes identified by 3′ tags. A consistent conclusion of comparison studies has been that different platforms provide complementary results51,52, probably because they are all sampling only a selected fraction of the total transcriptome from the cells or tissue under study. The use of multiple platforms to extract all the expression information from a cell or tissue is impractical. New platforms for measuring gene expression. A more comprehensive measurement of gene expression comes from arrays that interrogate all known human exons. Affymetrix have produced global exon arrays58, which show a high degree of correspondence in terms of fold changes with their pre-existing ‘classical microarrays’, suggesting that the additional probe sets on the exon arrays will provide reliable as well as more detailed coverage of the transcriptome59. The use of exon arrays allows the identification of tissue-specific alternative splicing events as well as significant expression outside of known exons and well-annotated genes60. exon arrays on other platforms are likely to provide similarly robust results. NATuRe ReVIeWS | GeneTics VOLume 10 | mARCH 2009 | 189 © 2009 Macmillan Publishers Limited. All rights reserved REVIEWS Box 3 | Pitfalls with microarrays The use of microarrays to measure gene expression has led directly to the development of expression QTL (eQTL) analyses. However, the microarray approaches that underlie most eQTL studies to date provide only partial gene coverage and have a limited dynamic range for quantitative detection of expression. Specific problems inherent in the use of these microarrays include: the systematic bias that can be introduced during sample preparation, hybridization and measurement of expression; batch to batch variation in array manufacture; and day to day variation in laboratory conditions98. These types of effects are probably under-recognized, as exemplified by a report of large-scale differences in gene expression between ethnic groups98,99. In this case the highly significant differences in gene expression that the data had suggested between the groups98 were found to be due to the separate processing of expression measurements in lymphoblastoid cell lines (LCLs) from subjects of European and Asian ancestry. Cis-eQTL artefacts can also arise from the overlap of SNPs with transcript probes100. Alterations in hybridization efficiency owing to the SNP can give an erroneous impression of differences in transcript abundance attributable to the SNP (and to other DNA variants with which it is in linkage disequilibrium)100. It has been estimated that 15% of microarray probes for any given gene will overlap with SNPs that are polymorphic in the population under study100. However, most coding SNPs in the human genome are uncommon, and it also seems that measurements of abundances are robust against mismatches between the probe and RNA sequences101. Although evidence that these artefacts have an effect has been presented43, it is reassuring that in a large study in humans Emilsson et al.16 found no evidence of systematic or specific hybridization artefacts from SNPs in their eQTL data. Nevertheless, important findings from microarrays need confirmation by specific assays, such as quantitative PCR, that avoid polymorphic sequences. Statistical methodology to account for batch effects, polymorphism and other sources of artefact is discussed by Alberts et al.102 Most human studies of eQTLs have been performed in LCLs, primarily because LCLs were often created as a source of nucleic acids for genetic studies. However, LCLs can exhibit progressive genomic instability with multiple passages of storage and re-growth, with the resulting potential for artefacts. Additive genetic effects A mechanism of quantitative inheritance such that the combined effects of genetic alleles at two or more gene loci are equal to the sum of their individual effects. many of the problems that are inherent in the use of microarrays can be solved by massively parallel, ultrahigh-throughput DNA sequencing systems (reviewed in ReF. 61). These systems allow direct ultra-highthroughput sequencing of RNA, which can then be mapped back to the genome. Sequencing of RNA provides a generic tool that can support a family of assays for measuring the genome-wide profiles of mRNAs, small RNAs, transcription factor binding, chromatin structure, DNase hypersensitivity and DNA methylation status61. RNA splices may also be effectively mapped by sequence-based methods. Despite the formidable promise, ultra-highthroughput sequencing is still not without problems. The machines can produce terabytes of data daily, and make profound demands on bioinformatics for data storage and assembly of reads. Short reads may pose severe problems for the interpretation of transcripts arising from gene families with high homology or repetitive regions of the genome. Nevertheless, it can be anticipated that within 2 years many studies will rely on this technology, and that alternative or complementary approaches, such as large-scale real-time PCR-based expression assays (for example, as described by Watson et al.62 and developed by WaferGen), will continue to evolve. Limitations of mapping studies. As discussed in the section on heritability, currently mapped loci account for only a portion of the estimated heritability of eQTLs. A similar degree of unattributed or ‘dark’ heritability has been observed in GWA studies of common complex traits and diseases. A large GWA meta-analysis, for example, recently identified 20 variants that are significantly associated with adult height. The combined effects of the 20 SNPs explained only 3% of height variation, taking into account such factors as age and population63. Similarly, a large GWA meta-analysis of Crohn’s disease identified 32 loci that significantly affect the disease, which together explained only 10% of the overall variance in disease risk and 20% of the genetic risk36. A large portion of the unattributed heritability is expected to result from the effects of multiple loci that are too weak to detect using current sample sizes18. This explanation would be consistent with data in yeast, in which only 3% of highly heritable transcript abundances are explained by single-locus (monogenic) inheritance and 50% are consistent with more than five controlling loci of equal effect 64. Although current SNP arrays provide relatively comprehensive coverage of the genome (more than 80%), some of the unattributed heritability will be due to genetic factors that reside in unmapped regions, or to variation that is not effectively tagged at present, such as CNVs. Dominance and interaction effects may also account for some of the unattributed heritability, as these may be confounded with additive genetic effects in the heritability estimates with some study designs. A previously described global eQTL study was based on sibling pairs, allowing estimates of heritability for all the transcripts measured13. The study suggested that dominance had a minimal effect on gene transcription 13. Interestingly it seemed that genetic interactions may have important influences on regulation of expression for some genes, but inclusion of interaction effects had a minimal impact on the overall attributable heritability 13. epigenetic modifications and other factors that affect transcript abundance might not be accounted for in SNP-based association studies (see ‘The basis of eQTLs’ section above). Genomic imprinting is a particular case of an epigenetic effect with a parent of origin-dependent pattern. monoalleleic expression is established at imprinted loci, via epigenetic marks transmitted through the germ line. Several common complex diseases exhibit parent-of-origin effects that might indicate underlying imprinting, including asthma65, type I diabetes66,67, rheumatoid arthritis68, psoriasis69, inflammatory bowel disease70 and selective immunoglobin A deficiency 71, but systematic analysis of parent-of-origin effects in eQTL data has not yet been reported. Finally, transcript abundance is a function of transcript stability as well as transcript production. many factors mediate transcript stability, particularly in trans, either through protein–RNA interaction or through mechanisms mediated by small interfering RNAs (siRNAs)72. It seems clear that future studies of disease susceptibility as well as eQTLs will need to take these mechanisms into account. 190 | mARCH 2009 | VOLume 10 www.nature.com/reviews/genetics © 2009 Macmillan Publishers Limited. All rights reserved REVIEWS Gene expression in tissues. Although RNA for eQTL analyses would ideally be obtained from a wide variety of tissues, most human studies of eQTLs have been performed in LCLs, primarily because LCLs were often created as a renewable source of nucleic acids for genetic studies. Gene expression in LCLs, however, represents the particular circumstances of epstein–Barr virus infection of B-cells and their subsequent uncontrolled growth. LCLs may also exhibit extreme clonality with random patterns of monoallelic expression in single clones73. Although only 60% of genes from any particular cell type will also be found in LCLs4,13, it has been established that LCLs provide information about gene expression for some genes that do not function primarily in these cells4,74–76. In addition, a recent comparison of eQTLs derived from the analysis of blood and adipose tissue showed little difference in the number of eQTLs that could be mapped, and there was an approximately 50% overlap of mapped loci from the two RNA sources16. Similarly, comparison between four different tissues showed no statistically significant differences in the number of mapped transcripts in experiments involving mapped recombinant inbred strains of mice18. Despite the continued utility and convenience of LCL studies of gene expression, it is evident that many of the transcripts expressed in LCLs may be housekeeping genes, and transcripts that determine specialized cell functions and that modify disease may be more parsimoniously distributed. In addition, LCLs are removed from the stimuli that can induce disordered gene transcription in disease. This is exemplified by the differences that are observed in gene expression between LCLs derived from asthmatics and genes known to be expressed in asthmatic airways30. These factors all indicate that the direct examination of tissues that are involved in disease can provide much more information than the LCL alone. Some eQTL studies of human tissue have already been carried out, notably of liver 17, adipose16,49 and brain29 tissue. These show that approximately 60% of the transcriptome is expressed in each tissue, and that eQTLs from these tissues may be a valuable source of information for genetic mapping. Data from animal models suggest that tissue samples can allow detection of trans-eQTLs that are important in determining the composition of individual tissues18. Tissue samples will also allow the use of network analyses to identify the complex interactions that may underlie disease16,17,49 (BOX 2). The costs of reagents and the limited availability of appropriate tissues have to date restricted studies in humans to several hundreds of subjects. Although a formal evaluation of optimal study sizes is difficult because of unknown trait heritability, we know empirically that studies with a few hundred subjects have consistently identified numerous eQTLs with vanishingly small p values4,5,75. It is also clear that subtle effects, particularly in trans, would be detected more reliably with larger samples. It is therefore timely that the promise of eQTLs as a tool for disease genetics has been sufficiently exciting to prompt a National Institutes of Health (NIH) proposal for the ambitious Genotype-Tissue expression (GTex) project, a database that might include 1,000 samples from each of 30 different tissues. The GTex project is currently running as a 2-year pilot study with the primary goal of testing the feasibility of collecting high-quality RNA and DNA from multiple tissues from approximately 160 donors identified through low post-mortem interval autopsy or organ transplant settings. If the pilot phase proves successful, the project will be scaled up to involve approximately 1,000 donors, with the eventual creation of a database to house existing and GTex-generated eQTL data. The use of tissues poses a number of problems that need to be resolved. Normal and diseased tissue samples can be difficult to access, and their use requires careful attention to ethical, legal and social issues. Samples taken at post-mortem from many tissues robustly retain their histological architecture and contain RNA that can be of sufficient quality for measurements of gene expression. However, the changes in gene expression that might accompany death or surgical resection have not yet been documented in any detail. Tissues typically consist of different cell types, and their composition can vary inconsistently in the presence of disease. Finally, tissue-specific DNA methylation profiles may affect 20% of genes27, and are expected to be important in understanding tissue eQTLs. Although some of these problems may be expected to degrade the information available from the study of any particular tissue, it should be appreciated that they will not systematically lead to false positives in eQTL analyses17, emphasizing the robustness of the eQTL approach. Exercising the genome. Tissue biopsies and other samples extend the ‘expression space’ that can be examined by eQTL studies. They nevertheless still have limitations for functional analyses (particularly in humans as opposed to model organisms) when compared with cells that can be grown freely in culture and manipulated by systematic knock downs. Although the transcripts in a particular cell under particular conditions reflect only part of the function of a particular genome, the range of transcripts from a given cell type can be widened by stimulating the cell in a variety of ways. The experimental extension of the genome expression space has been called ‘exercising the genome’77, and this strategy can be used to learn much more about gene expression and integrated gene functions. experimentally, evidence is already emerging that environmental actions on gene expression are profound in humans78 and model organisms79,80 (reviewed in ReF. 81), and it is reasonable to assume that these components of gene expression can be fruitfully accessed through exposure to relevant stimuli. It is interesting that, in model organisms, environmentally induced changes in gene expression seem to act through prominent trans effects79,80 that may not be present in unstressed cells and tissues. It is therefore desirable that the genome of human LCLs or primary cells of particular interest be exercised by stimulating their gene expression in different ways. model stimuli that could be tested in these systems include pro-inflammatory stresses, metabolic stresses (such as high or low glucose, or hypoxia), the response to radiation, the response to signalling molecules (such NATuRe ReVIeWS | GeneTics VOLume 10 | mARCH 2009 | 191 © 2009 Macmillan Publishers Limited. All rights reserved REVIEWS Box 4 | eQTLs and network analyses of cancer Mutations that disrupt cell growth control mechanisms are a feature of cancer. In addition, the unchecked cell division that is characteristic of cancer can in time result in many secondary mutations and progressive genomic disorganization103. Genetic studies of cancer tissue (so called somatic cell genetics) have been used to identify the most common mutations in various tumours. Global gene expression studies have also been used in many cancer types, typically to identify gene signatures that can predict the clinical outcome104. However, most signature-based outcome predictions have not been replicated by independent studies104, perhaps owing to the innate heterogeneity of cancerous tissue and the problems of deriving statistically stringent results from the measurement of thousands of transcripts in limited numbers of samples. Expression QTL (eQTL) analyses are a powerful tool to identify the functional consequences of the numerous copy number variants (CNVs), deletions and epigenetic modifications that are a feature of neoplastic cells. eQTL mapping allows the identification not only of genes underlying malignant processes105 but also of genes modifying disease progression106 and genes modulating individual responses to chemotherapy107. Network analyses have not yet been widely applied to the study of cancer, but they have already led to interesting findings, such as the identification of the ASPM gene as a molecular target in patients with glioblastoma. The application of network analyses to cancer eQTLs may be expected to greatly alleviate problems with multiple comparisons and to lead to easier biological interpretation of results108,109. Direct comparison of the transcript network architecture of cancerous tissue against normal tissues may also allow much deeper understanding of cancer biology. as neurotransmitters, hormones and peptides) and the response to therapeutic and chemotherapeutic agents. Conclusions It is now well established that transcript abundances of genes may be considered as quantitative traits that can be mapped with considerable power, and that assaying gene expression and genetic variation simultaneously on a genome-wide basis in a large number of individuals will provide valuable tools for identifying the function of previously mapped susceptibility alleles underlying common complex diseases. Although eQTLs are shown to be effective in mapping complex traits, there are many levels of information that are inherent in the measurement of global gene expression that have yet to be accessed, such as the effects of transcript stability, epigenetic effects or environmental stimuli. In addition, larger studies involving thousands of subjects may be necessary to identify weak trans effects with the same precision as the more powerful effects that are often observed in cis. Although trans effects can be relatively weak, the genes they modify (the trans-transcriptome) are likely to contain master regulators with wide effects on key processes that might feature more strongly in tissues or in cells subjected to particular environmental stimuli. many genes are only expressed in particular tissues or at specific times during development. Thus, although systematic studies of eQTLs are already being planned for 1. 2. 3. Hugot, J. P. et al. Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn’s disease. Nature 411, 599–603 (2001). Palmer, C. N. et al. Common loss-of-function variants of the epidermal barrier protein filaggrin are a major predisposing factor for atopic dermatitis. Nature Genet. 38, 441–446 (2006). Burton, P. R. et al. Association scan of 14,500 nonsynonymous SNPs in four diseases identifies autoimmunity variants. Nature Genet. 39, 1329–1337 (2007). 4. 5. 6. a wide variety of tissues, other strategies will need to be formed to study particular cell types and tissues at specific stages of differentiation and development. understanding the genome of cancer cells and tissues is particularly challenging because the primary lesions that initially drive cellular proliferation are difficult to find when uncontrolled division results in progressive secondary damage to the genome and the transcriptome. eQTL analyses may be of particular value in malignant disease, because they allow a more integrated picture of what is happening in cancer cells (BOX 4). Good progress is being made in cataloguing the SNPs and other polymorphisms that regulate transcription, and this could be the basis for a systematic listing of regulatory sequences and regulatory proteins. Identifying epigenetic effects is likely to be more difficult, particularly if they are mediated through histone modifications (which are difficult to detect on a large scale) rather than through differential CpG methylation. The remarkable diversity of human transcriptional regulation raises new questions about the evolutionary value of unexpected variation in genes that mediate basic mechanisms, such as heat shock proteins or genes influencing the cell cycle and DNA repair. ‘Inverse genetics’ could be used to study the SNPs with the strongest effects on expression of such genes, and to investigate their actions on unexpected phenotypes measured in epidemiological samples. New analytical techniques, particularly network analyses, promise rapid advances in reducing the complexity of expression data. modules of co-expressed genes mediating complex functions may also be identified by time-series studies of the response of particular cell types to environmental stimuli82. In future, integration of eQTLs with data from large-scale approaches for genome resequencing, from proteomic and metabolomic analyses, from epigenomic studies and from functional screening of genes may provide a powerful set of tools for a systems biology approach to multifactorial disease, as well as a way to identify and biologically validate susceptibility genes83. In the future, complex disease geneticists will require integrated public databases. existing databases include the asthma study database (mRNA by SNP Browser v 1.0.1) and VarySysDB. A more comprehensive database is planned as part of the NIH GTex project, which will house existing as well as GTex-generated eQTL data. Future databases should include eQTL maps with SNPs, epigenetic marks, trans and cis effects, as well as effects that are specific for particular cells, tissues and environmental stimuli. ultimately, they will also allow browsing for networks, modules and comparisons with model organisms. Schadt, E. E. et al. Genetics of gene expression surveyed in maize, mouse and man. Nature 422, 297–302 (2003). This paper shows the power of eQTL analysis in humans. Morley, M. et al. Genetic analysis of genome-wide variation in human gene expression. Nature 430, 743–747 (2004). Brem, R. B., Yvert, G., Clinton, R. & Kruglyak, L. Genetic dissection of transcriptional regulation in budding yeast. Science 296, 752–755 (2002). 192 | mARCH 2009 | VOLume 10 Rockman, M. V. & Kruglyak, L. Genetics of global gene expression. Nature Rev. Genet. 7, 862–872 (2006). 8. Gilad, Y., Rifkin, S. A. & Pritchard, J. K. Revealing the architecture of gene regulation: the promise of eQTL studies. Trends Genet. 24, 408–415 (2008). 9. Jia, Z. & Xu, S. Mapping quantitative trait loci for expression abundance. Genetics 176, 611–623 (2007). 10. Carlborg, O. et al. Methodological aspects of the genetic dissection of gene expression. Bioinformatics 21, 2383–2393 (2005). 7. www.nature.com/reviews/genetics © 2009 Macmillan Publishers Limited. All rights reserved REVIEWS 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. Kendziorski, C. M., Chen, M., Yuan, M., Lan, H. & Attie, A. D. Statistical methods for expression quantitative trait loci (eQTL) mapping. Biometrics 62, 19–27 (2006). Schliekelman, P. Statistical power of expression quantitative trait loci for mapping of complex trait loci in natural populations. Genetics 178, 2201–2216 (2008). Dixon, A. L. et al. A genome-wide association study of global gene expression. Nature Genet. 39, 1202–1207 (2007). Visscher, P. M., Hill, W. G. & Wray, N. R. Heritability in the genomics era — concepts and misconceptions. Nature Rev. Genet. 9, 255–266 (2008). Goring, H. H. et al. Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nature Genet. 39, 1208–1216 (2007). Emilsson, V. et al. Genetics of gene expression and its effect on disease. Nature 452, 423–428 (2008). This paper illustrates the power of eQTL and network analysis in unravelling complex trait genetics. Schadt, E. E. et al. Mapping the genetic architecture of gene expression in human liver. PLoS Biol. 6, e107 (2008). Petretto, E. et al. Heritability and tissue specificity of expression quantitative trait loci. PLoS Genet. 2, e172 (2006). Monks, S. A. et al. Genetic inheritance of gene expression in human cell lines. Am. J. Hum. Genet. 75, 1094–1105 (2004). Veyrieras, J. B. et al. High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 4, e1000214 (2008). Hubner, N. et al. Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease. Nature Genet. 37, 243–253 (2005). Yvert, G. et al. Trans‑acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nature Genet. 35, 57–64 (2003). Shimada, M. K. et al. VarySysDB: a human genetic polymorphism database based on all H-InvDB transcripts. Nucleic Acids Res. 37, D810–D815 (2008). Stranger, B. E. et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848–853 (2007). Gonzales, J. M. et al. Regulatory hotspots in the malaria parasite genome dictate transcriptional variation. PLoS Biol. 6, e238 (2008). Mileyko, Y., Joh, R. I. & Weitz, J. S. Small-scale copy number variation and large-scale changes in gene expression. Proc. Natl Acad. Sci. USA 105, 16659–16664 (2008). Eckhardt, F. et al. DNA methylation profiling of human chromosomes 6, 20 and 22. Nature Genet. 38, 1378–1385 (2006). This paper shows the extent and distribution of methylation in the human genome. Krebs, J. E. Moving marks: dynamic histone modifications in yeast. Mol. Biosyst. 3, 590–597 (2007). Myers, A. J. et al. A survey of genetic human cortical gene expression. Nature Genet. 39, 1494–1499 (2007). Moffatt, M. F. et al. Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 448, 470–473 (2007). Bouzigon, E. et al. Effect of 17q21 variants and smoking exposure in early-onset asthma. N. Engl. J. Med. 359, 1985–1994 (2008). Duan, S. et al. Genetic architecture of transcript-level variation in humans. Am. J. Hum. Genet. 82, 1101–1113 (2008). Galanter, J. et al. ORMDL3 gene is associated with asthma in three ethnically diverse populations. Am. J. Respir. Crit. Care Med. 177, 1194–1200 (2008). Sleiman, P. M. et al. ORMDL3 variants associated with asthma susceptibility in North Americans of European ancestry. J. Allergy Clin. Immunol. 122, 1225–1227 (2008). Tavendale, R., Macgregor, D. F., Mukhopadhyay, S. & Palmer, C. N. A polymorphism controlling ORMDL3 expression is associated with asthma that is poorly controlled by current medications. J. Allergy Clin. Immunol. 121, 860–863 (2008). 36. Barrett, J. C. et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn’s disease. Nature Genet. 40, 955–962 (2008). A substantial meta-analysis of susceptibility loci underlying Crohn’s disease that illustrates the problem of unattributed heritability and the utility of eQTL data in understanding the function of disease-associated SNPs. 37. Libioulle, C. et al. Novel Crohn disease locus identified by genome-wide association maps to a gene desert on 5p13.1 and modulates expression of PTGER4. PLoS Genet. 3, e58 (2007). 38. Kabashima, K. et al. The prostaglandin receptor EP4 suppresses colitis, mucosal damage and CD4 cell activation in the gut. J. Clin. Invest. 109, 883–893 (2002). 39. Rioux, J. D. et al. Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease. Nature Genet. 29, 223–228 (2001). 40. Kathiresan, S. et al. Common variants at 30 loci contribute to polygenic dyslipidemia. Nature Genet. 41, 56–65 (2009). 41. Willer, C. J. et al. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nature Genet. 41, 25–34 (2009). 42. Horton, R. et al. Gene map of the extended human MHC. Nature Rev. Genet. 5, 889–899 (2004). 43. Alberts, R. et al. Sequence polymorphisms cause many false cis eQTLs. PLoS ONE 2, e622 (2007). 44. Beaty, J. S., West, K. A. & Nepom, G. T. Functional effects of a natural polymorphism in the transcriptional regulatory sequence of HLA-DQB1. Mol. Cell. Biol. 15, 4771–4782 (1995). 45. Nejentsev, S. et al. Localization of type 1 diabetes susceptibility to the MHC class I genes HLA‑B and HLA‑A. Nature 450, 887–892 (2007). 46. Sieberts, S. K. & Schadt, E. E. Moving toward a system genetics view of disease. Mamm. Genome 18, 389–401 (2007). 47. Lamb, J. et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935 (2006). 48. Goh, K. I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007). 49. Chen, Y. et al. Variations in DNA elucidate molecular networks that cause disease. Nature 452, 429–435 (2008). 50. Barnes, M., Freudenberg, J., Thompson, S., Aronow, B. & Pavlidis, P. Experimental comparison and crossvalidation of the Affymetrix and Illumina gene expression analysis platforms. Nucleic Acids Res. 33, 5914–5923 (2005). 51. Pedotti, P. et al. Can subtle changes in gene expression be consistently detected with different microarray platforms? BMC Genomics 9, 124 (2008). 52. van Ruissen, F. et al. Evaluation of the similarity of gene expression data estimated with SAGE and Affymetrix GeneChips. BMC Genomics 6, 91 (2005). 53. Bosotti, R. et al. Cross platform microarray analysis for robust identification of differentially expressed genes. BMC Bioinformatics 8 (Suppl. 1), S5 (2007). 54. Ji, Y. et al. RefSeq refinements of UniGene-based gene matching improve the correlation of expression measurements between two microarray platforms. Appl. Bioinformatics 5, 89–98 (2006). 55. Carter, S. L., Eklund, A. C., Mecham, B. H., Kohane, I. S. & Szallasi, Z. Redefinition of Affymetrix probe sets by sequence overlap with cDNA microarray probes reduces cross-platform inconsistencies in cancer-associated gene expression measurements. BMC Bioinformatics 6, 107 (2005). 56. Sohail, M., Akhtar, S. & Southern, E. M. The folding of large RNAs studied by hybridization to arrays of complementary oligonucleotides. RNA 5, 646–655 (1999). 57. Southern, E., Mir, K. & Shchepinov, M. Molecular interactions on microarrays. Nature Genet. 21, 5–9 (1999). This review, by the inventor of DNA microarrays, highlights the complexity and unpredictability of the interactions between nucleic acids in solution and target sequences on solid supports. 58. Kapur, K., Xing, Y., Ouyang, Z. & Wong, W. H. Exon arrays provide accurate assessments of gene expression. Genome Biol. 8, R82 (2007). NATuRe ReVIeWS | GeneTics 59. Okoniewski, M. J., Hey, Y., Pepper, S. D. & Miller, C. J. High correspondence between Affymetrix exon and standard expression arrays. Biotechniques 42, 181–185 (2007). 60. Clark, T. A. et al. Discovery of tissue-specific exons using comprehensive human exon microarrays. Genome Biol. 8, R64 (2007). 61. Wold, B. & Myers, R. M. Sequence census methods for functional genomics. Nature Methods 5, 19–21 (2008). 62. Watson, R. M., Griaznova, O. I., Long, C. M. & Holland, M. J. Increased sample capacity for genotyping and expression profiling by kinetic polymerase chain reaction. Anal. Biochem. 329, 58–67 (2004). 63. Weedon, M. N. et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nature Genet. 40, 575–583 (2008). 64. Brem, R. B. & Kruglyak, L. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc. Natl Acad. Sci. USA 102, 1572–1577 (2005). 65. Moffatt, M. & Cookson, W. The genetics of asthma. Maternal effects in atopic disease. Clin. Exp. Allergy 28 (Suppl. 1), 56–61 (1998). 66. Bennett, S. & Todd, J. Human type 1 diabetes and the insulin gene: principles of mapping polygenes. Annu. Rev. Genet. 30, 343–370 (1996). 67. Warram, J. H., Krolewski, A. S., Gottlieb, M. S. & Kahn, C. R. Differences in risk of insulin-dependent diabetes in offspring of diabetic mothers and diabetic fathers. N. Engl. J. Med. 311, 149–152 (1984). 68. Koumantaki, Y. et al. Family history as a risk factor for rheumatoid arthritis: a case–control study. J. Rheumatol. 24, 1522–1526 (1997). 69. Burden, A. et al. Genetics of psoriasis: paternal inheritance and a locus on chromosome 6p. J. Invest. Dermatol. 110, 958–960 (1998); comment 112, 514–516 (1999). 70. Akolkar, P. N. et al. Differences in risk of Crohn’s disease in offspring of mothers and fathers with inflammatory bowel disease. Am. J. Gastroenterol. 92, 2241–2244 (1997). 71. Vorechovsky, I., Webster, A. D., Plebani, A. & Hammarstrom, L. Genetic linkage of IgA deficiency to the major histocompatibility complex: evidence for allele segregation distortion, parent-of-origin penetrance differences, and the role of anti-IgA antibodies in disease predisposition. Am. J. Hum. Genet. 64, 1096–1109 (1999). 72. Grosshans, H. & Filipowicz, W. Molecular biology: the expanding world of small RNAs. Nature 451, 414–416 (2008). 73. Plagnol, V. et al. Extreme clonality in lymphoblastoid cell lines with implications for allele specific expression analyses. PLoS ONE 3, e2966 (2008). 74. Yan, H., Yuan, W., Velculescu, V. E., Vogelstein, B. & Kinzler, K. W. Allelic variation in human gene expression. Science 297, 1143 (2002). 75. Cheung, V. G. et al. Natural variation in human gene expression assessed in lymphoblastoid cells. Nature Genet. 33, 422–425 (2003). 76. Gretarsdottir, S. et al. The gene encoding phosphodiesterase 4D confers risk of ischemic stroke. Nature Genet. 35, 131–138 (2003). 77. Kohane, I. S., Kho, A. T. & Butte, A. J. Microarrays for an Integrative Genomics (MIT Press, Cambridge, Massachusetts, 2002). 78. Idaghdour, Y., Storey, J. D., Jadallah, S. J. & Gibson, G. A genome-wide gene expression signature of environmental geography in leukocytes of Moroccan Amazighs. PLoS Genet. 4, e1000052 (2008). Although this paper describes a small study, it shows the profound effects of different environments on gene expression in peripheral blood lymphocytes. 79. Li, Y. et al. Mapping determinants of gene expression plasticity by genetical genomics in C. elegans. PLoS Genet. 2, e222 (2006). 80. Smith, E. N. & Kruglyak, L. Gene–environment interaction in yeast gene expression. PLoS Biol. 6, e83 (2008). 81. Gibson, G. The environmental contribution to gene expression profiles. Nature Rev. Genet. 9, 575–581 (2008). 82. Reis, B. Y., Butte, A. S. & Kohane, I. S. Extracting knowledge from dynamics in gene expression. J. Biomed. Inform. 34, 15–27 (2001). This paper shows the utility of using time-series measurements of gene expression to identify co-regulated modules of genes. VOLume 10 | mARCH 2009 | 193 © 2009 Macmillan Publishers Limited. All rights reserved REVIEWS 83. Schadt, E. E. & Lum, P. Y. Thematic review series: systems biology approaches to metabolic and cardiovascular disorders. Reverse engineering gene networks to identify key drivers of complex disease phenotypes. J. Lipid Res. 47, 2601–2613 (2006). 84. Gudbjartsson, D. F. et al. Many sequence variants affecting diversity of adult human height. Nature Genet. 40, 609–615 (2008). 85. Hom, G. et al. Association of systemic lupus erythematosus with C8orf13–BLK and ITGAM–ITGAX. N. Engl. J. Med. 358, 900–909 (2008). 86. Hakonarson, H. et al. A novel susceptibility locus for type 1 diabetes on Chr12q13 identified by a genomewide association study. Diabetes 57, 1143–1146 (2008). 87. The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007). 88. Todd, J. A. et al. Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Nature Genet. 39, 857–864 (2007). 89. Plenge, R. M. et al. TRAF1–C5 as a risk locus for rheumatoid arthritis — a genomewide study. N. Engl. J. Med. 357, 1199–1209 (2007). 90. Thein, S. L. et al. Intergenic variants of HBS1L–MYB are responsible for a major QTL on chromosome 6q23 influencing HbF levels in adults. Proc. Natl Acad. Sci. USA (in the press). 91. Di Bernardo, M. C. et al. A genome-wide association study identifies six susceptibility loci for chronic lymphocytic leukemia. Nature Genet. 40, 1204–1210 (2008). 92. Gardner, T. S., di Bernardo, D., Lorenz, D. & Collins, J. J. Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301, 102–105 (2003). 93. Sontag, E., Kiyatkin, A. & Kholodenko, B. N. Inferring dynamic architecture of cellular networks using time series of gene expression, protein and metabolite data. Bioinformatics 20, 1877–1886 (2004). 94. Li, H. et al. Integrative genetic analysis of transcription modules: towards filling the gap between genetic loci and inherited traits. Hum. Mol. Genet. 15, 481–492 (2006). 95. Keurentjes, J. J. et al. Regulatory network construction in Arabidopsis by using genome-wide gene expression quantitative trait loci. Proc. Natl Acad. Sci. USA 104, 1708–1713 (2007). 96. Gerrits, A., Dykstra, B., Otten, M., Bystrykh, L. & de Haan, G. Combining transcriptional profiling and genetic linkage analysis to uncover gene networks operating in hematopoietic stem cells and their progeny. Immunogenetics 60, 411–422 (2008). 97. de Koning, D. J., Carlborg, O. & Haley, C. S. The genetic dissection of immune response using geneexpression studies and genome mapping. Vet. Immunol. Immunopathol. 105, 343–352 (2005). 98. Akey, J. M., Biswas, S., Leek, J. T. & Storey, J. D. On the design and analysis of gene expression studies in human populations. Nature Genet. 39, 807–808; author reply 808–809 (2007). 99. Spielman, R. S. et al. Common genetic variants account for differences in gene expression among ethnic groups. Nature Genet. 39, 226–231 (2007). 100. Doss, S., Schadt, E. E., Drake, T. A. & Lusis, A. J. Cis-acting expression quantitative trait loci in mice. Genome Res. 15, 681–691 (2005). 101. Hughes, T. R. et al. Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nature Biotechnol. 19, 342–347 (2001). 102. Alberts, R., Terpstra, P., Bystrykh, L. V., de Haan, G. & Jansen, R. C. A statistical multiprobe model for analyzing cis and trans genes in genetical genomics experiments with short-oligonucleotide arrays. Genetics 171, 1437–1439 (2005). 103. Halazonetis, T. D., Gorgoulis, V. G. & Bartek, J. An oncogene-induced DNA damage model for cancer development. Science 319, 1352–1355 (2008). 104. Sun, Z., Wigle, D. A. & Yang, P. Non-overlapping and non-cell-type-specific gene expression signatures predict lung cancer survival. J. Clin. Oncol. 26, 877–883 (2008). 105. Walker, B. A. et al. Integration of global SNP-based mapping and expression arrays reveals key regions, mechanisms, and genes important in the pathogenesis of multiple myeloma. Blood 108, 1733–1743 (2006). 106. Lastowska, M. et al. Identification of candidate genes involved in neuroblastoma progression by combining genomic and expression microarrays with survival data. Oncogene 26, 7432–7444 (2007). 194 | mARCH 2009 | VOLume 10 107. Huang, R. S. et al. A genome-wide approach to identify genetic variants that contribute to etoposideinduced cytotoxicity. Proc. Natl Acad. Sci. USA 104, 9758–9763 (2007). 108. Zhang, B. & Horvath, S. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4, Article17 (2005). This paper describes a statistical approach to network analyses and provides a set of software tools for their implementation. 109. Horvath, S. et al. Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target. Proc. Natl Acad. Sci. USA 103, 17402–17407 (2006). Acknowledgements The work was supported by the Wellcome Trust and the EC funded GABRIEL project, the French Ministry of Research and Higher Education and by grants from the National Institutes of Health. FURTHER INFORMATION Liming Liang’s homepage: http://www.sph.umich.edu/csg/liang Abecasis laboratory homepage (contains programs for genome-scale data analysis): http://www.sph.umich.edu/csg/abecasis Catalog of Published Genome-Wide Association Studies: http://www.genome.gov/gwastudies Database of Genomic Variants: http://projects.tcag.ca/variation dbSNP: http://www.ncbi.nlm.nih.gov/projects/SNP D-HaploDB: http://orca.gen.kyushu-u.ac.jp Genotype-Tissue Expression (GTEx): http://nihroadmap.nih.gov/GTEx H-InvDB: http://www.h-invitational.jp mRNA by SNP Browser v 1.0.1: http://www.sph.umich.edu/csg/liang/asthma UniGene: http://www.ncbi.nlm.nih.gov/unigene VarySysDB: http://www.h-invitational.jp/varygene/home.htm WaferGen: http://www.wafergen.com ALL Links ARe AcTive in The OnLine PDf www.nature.com/reviews/genetics © 2009 Macmillan Publishers Limited. All rights reserved