Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Genome-Wide Association Studies: Issues and Approaches Association Studies Hirschhorn & Daly, Nat Rev Genet 2005 Candidate Gene or GWAS Genome-wide Association Studies Affymetrix Array Altshuler & Clark, Science 2005 Genome-wide Assocation Studies (GWAS) One- and Two-Stage GWA Designs One-Stage Design Two-Stage Design SNPs 1,2,3,……………………………, M Stage 1 SNPs samples Stage 2 Samples 1,2,3,………………………,N Samples 1,2,3,………………………,N 1,2,3,……………………………, M markers One-Stage Design Samples SNPs Two-Stage Design Replication-based analysis Joint analysis SNPs SNPs Stage 2 Samples Stage 1 Stage 2 Samples Stage 1 Multistage Designs • Joint analysis has more power than replication • p-value in Stage 1 must be liberal • Lower cost—do not gain power • http://www.sph.umich.edu/csg/abecasis/CaTS/index.html QC Steps • Filter SNPs and Individuals – MAF, Low call rates • Test for HWE among controls & within ethnic groups. Use conservative alpha-level • Check for relatedness. Identity-by-state calculations. Analysis of GWAS • Most common approach: look at each SNP one-at-a-time. • Possibly add in multi-marker information. • Further investigate / report top SNPs only. • Or backwards replication… P-values GWAS Analysis • Most commonly trend test. • Log additive model, logistic regression. • Adjust for potential population stratification. Example: GWAS of Prostate Cancer chromosome Region 2 30 Region 3 http://cgems.cancer.gov Region 1 rs1447295 Gudmundsson et al. Haiman et al. Yeager et al. Combined (adjusted) 25 Multiple prostate cancer loci on 8q24 20 -log(p-value) rs16901979 15 rs6983267 10 5 0 128.10 128.20 128.30 128.40 128.50 128.60 128.70 Position on 8q24 (Mb) Witte, Nat Genet 2007 Prostate Cancer Replications Locus Chr Reg A Freq SNP Association Cntrl Case OR p value Nearby Genes / Fcn 2p15 rs721048 G/A 0.19 0.21 1.15 7.7x10-9 EHBP1: endocytic trafficking 3p12 rs2660753 C/T 0.10 0.12 1.30 2.7x10-8 Intergenic 6q25 rs9364554 C/T 0.29 0.33 1.21 5.5x10-10 SLC22A3: drugs and toxins. 7q21 rs6465657 T/C 0.46 0.50 1.19 1.1x10-9 LMTK2: endosomal trafficking 8q24 (2) rs16901979 C/A 0.04 0.06 1.52 1.1x10-12 Intergenic 8q24 (3) rs6983267 T/G 0.50 0.56 1.25 9.4x10-13 Intergenic 8q24 (1) rs1447295 C/A 0.10 0.14 1.42 6.4x10-18 Intergenic 10q11 rs10993994 C/T 0.38 0.46 1.38 8.7x10-29 MSMB: suppressor prop. 10q26 rs4962416 T/C 0.27 0.32 1.18 2.7x10-8 CTBP2: antiapoptotic activity 11q13 rs7931342 T/G 0.51 0.56 1.21 1.7x10-12 Intergenic 17q12 rs4430796 G/A 0.49 0.55 1.22 1.4x10-11 HNF1B: suppressor properties 17q24 rs1859962 T/G 0.46 0.51 1.20 2.5x10-10 Intergenic 19q13 rs2735839 A/G 0.83 0.87 1.37 1.5x10-18 KLK2/KLK3: PSA Xp11 rs5945619 T/C 0.36 0.41 1.29 1.5x10-9 NUDT10, NUDT11: apoptosis Modest ORs Witte, Nat Rev Genet 2009 Prostate Cancer Replications Locus Chr Reg A Freq SNP Association Cntrl Case OR p value Nearby Genes / Fcn 2p15 rs721048 G/A 0.19 0.21 1.15 7.7x10-9 EHBP1: endocytic trafficking 3p12 rs2660753 C/T 0.10 0.12 1.30 2.7x10-8 Intergenic 6q25 rs9364554 C/T 0.29 0.33 1.21 5.5x10-10 SLC22A3: drugs and toxins. 7q21 rs6465657 T/C 0.46 0.50 1.19 1.1x10-9 LMTK2: endosomal trafficking 8q24 (2) rs16901979 C/A 0.04 0.06 1.52 1.1x10-12 Intergenic 8q24 (3) rs6983267 T/G 0.50 0.56 1.25 9.4x10-13 Intergenic 8q24 (1) rs1447295 C/A 0.10 0.14 1.42 6.4x10-18 Intergenic 10q11 rs10993994 C/T 0.38 0.46 1.38 8.7x10-29 MSMB: suppressor prop. 10q26 rs4962416 T/C 0.27 0.32 1.18 2.7x10-8 CTBP2: antiapoptotic activity 11q13 rs7931342 T/G 0.51 0.56 1.21 1.7x10-12 Intergenic 17q12 rs4430796 G/A 0.49 0.55 1.22 1.4x10-11 HNF1B: suppressor properties 17q24 rs1859962 T/G 0.46 0.51 1.20 2.5x10-10 Intergenic 19q13 rs2735839 A/G 0.83 0.87 1.37 1.5x10-18 KLK2/KLK3: PSA Xp11 rs5945619 T/C 0.36 0.41 1.29 1.5x10-9 NUDT10, NUDT11: apoptosis Modest ORs Witte, Nat Rev Genet 2009 SNPs Missed in Replication? Locus Chr Reg A Freq SNP Association Cntrl Case OR p value Nearby Genes / Fcn 2p15 rs721048 G/A 0.19 0.21 1.15 7.7x10-9 EHBP1: endocytic trafficking 3p12 rs2660753 C/T 0.10 0.12 1.30 2.7x10-8 Intergenic 6q25 rs9364554 C/T 0.29 0.33 1.21 5.5x10-10 SLC22A3: drugs and toxins. 7q21 rs6465657 T/C 0.46 0.50 1.19 1.1x10-9 LMTK2: endosomal trafficking 8q24 (2) rs16901979 C/A 0.04 0.06 1.52 1.1x10-12 Intergenic 8q24 (3) rs6983267 T/G 0.50 0.56 1.25 9.4x10-13 Intergenic 8q24 (1) rs1447295 C/A 0.10 0.14 1.42 6.4x10-18 Intergenic rs10993994 C/T 0.38 0.46 1.38 8.7x10- 10q11 24,223 smallest P-value! MSMB: suppressor prop. 29 10q26 rs4962416 T/C 0.27 0.32 1.18 2.7x10-8 CTBP2: antiapoptotic activity 11q13 rs7931342 T/G 0.51 0.56 1.21 1.7x10-12 Intergenic 17q12 rs4430796 G/A 0.49 0.55 1.22 1.4x10-11 HNF1B: suppressor properties 17q24 rs1859962 T/G 0.46 0.51 1.20 2.5x10-10 Intergenic 19q13 rs2735839 A/G 0.83 0.87 1.37 1.5x10-18 KLK2/KLK3: PSA Xp11 rs5945619 T/C 0.36 0.41 1.29 1.5x10-9 NUDT10, NUDT11: apoptosis Witte, Nat Rev Genet, 2009 Prostate Cancer www.genome.gov/gwastudies Manolio et al. Clin Invest 2008 Limitations of GWAS • Not very predictive Example: AUC for Br Cancer Risk Gail = 58% SNPs = 58.9% G + S = 61.8% Wacholder et al. NEJM 2010 Witte, Nat Rev Genet 2009 Limitations of GWAS • • • • Not very predictive Explain little heritability Focus on common variation Many associated variants are not causal Where’s the Heritability? Common disease rare variant (CDRV) hypothesis: diseases due to multiple rare variants with intermediate penetrances (allelic heterogeneity) Many more of these? See: NEJM, April 30, 2009 McCarthy et al., 2008 Will GWAS results explain more heritability? • Possibly, if… 1. Causal SNPs not yet detected due to power / practical issues (e.g., not yet included in replication studies). 2. Stronger effects for causal SNPs: Associated SNP may only serve as a marker for multiple different causal SNPs. Imputation of SNP Genotypes • Estimate unmeasured or missing genotypes. • Based on measured SNPs and external info (e.g., haplotype structure of HapMap). • Increase GWAS power. • Allow for combining data across different platforms (e.g., Affy & Illumina) (for replication / metaanalysis). Imputation Example Observed Genotypes . . . . . . . A . . . . . . . G . . . . . . . A . . . . A . . . C . . . . A . . . . T G T T G G G T G T C G C C C G C C C C Study Sample Reference Haplotypes C C C C C T C C C C G G C G G G G G G G A A A A A G A A A A G G A A G G G G G A A A G G A A A A A G T T C C C T T C C C C C T T T C C T T T T T C C C T T C C C C C T T T C C T T T C C T T C C C T C T T C T T C C C T C T C A C C A A A C A C T C T T C C C T C T T C T T C C C T C T C T C C T T T T T C T C T T T C T T C T G A G G A A G G G G T T T T T T T T T T G G G G G G G A G G HapMap/ 1K genomes Gonçalo Abecasis Identify Match with Reference Observed Genotypes . . . . . . . A . . . . . . . G . . . . . . . A . . . . A . . . C . . . . A . . . . T G T T G G G T G T C G C C C G C C C C Reference Haplotypes C C C C C T C C C C G G C G G G G G G G A A A A A G A A A A G G A A G G G G G A A A G G A A A A A G T T C C C T T C C C C C T T T C C T T T T T C C C T T C C C C C T T T C C T T T C C T T C C C T C T T C T T C C C T C T C A C C A A A C A C T C T T C C C T C T T C T T C C C T C T C T C C T T T T T C T C T T T C T T C T G A G G A A G G G G T T T T T T T T T T G G G G G G G A G G Gonçalo Abecasis Phase chromosomes, impute missing genotypes Observed Genotypes c g a g A t c t c c c g A c c t c A t g g c g a a G c t c t t t t C t t t c A t g g Reference Haplotypes C C C C C T C C C C G G C G G G G G G G A A A A A G A A A A G G A A G G G G G A A A G G A A A A A G T T C C C T T C C C C C T T T C C T T T T T C C C T T C C C C C T T T C C T T T C C T T C C C T C T T C T T C C C T C T T G T T G G G T G T C A C C A A A C A C T C T T C C C T C T T C T T C C C T C T C T C C T T T T T C T C T T T C T T C T http://www.sph.umich.edu/csg/abecasis/MACH G A G G A A G G G G T T T T T T T T T T G G G G G G G A G G C G C C C G C C C C Gonçalo Abecasis Imputation Application TCF7L2 gene region & T2D from the WTCCC data Observed genotypes black Imputed genotypes red. Chromosomal Position Marchini Nature Genetics2007 http://www.stats.ox.ac.uk/~marchini/#software Genome-wide Sequence Studies • Trade off between number of samples, depth, and genomic coverage. MAF Sample Size Depth 0.5-1% 2-5% 1,000 20x perfect perfect 2,000 10x r2=0.98 r2=0.995 4,000 5x r2=0.90 r2=0.98 BUT: Interaction needs to be accounted for >>>> required sample size Goncalo Abecasis Near-term Design Choices • For example, between: 1. Sequencing few subjects with extreme phenotypes: • e.g., 200 cases, 200 controls, 4x coverage. Then followup in larger population. 2. 10M SNP chip based on 1,000 genomes. • 5K cases, 5K controls. • Which design will work best…? Polygenic Models • Many weak associations combine to risk? • Score model: m where xj ln( OR ) SNP i 1 i ij m ‘discovery’ sample – ln(ORi ) = ‘score’ for SNPi from – SNPij = # of alleles (0,1,2) for SNPi, person j in ‘validation’ sample. – Large number of SNPs (m) • xj associated with disease? ISC / Purcell et al. Nature 2009 Complex diseases Physical activity Genetic susceptibility Obesity Hyperlipidemia Diet Diabetes Complex diseases: Many causes = many causal pathways! Vulnerable plaques Hypertension Atherosclerosis MI Data Analysis Approaches in Human Disease Gene Discovery Linkage analysis (families with ≥2 affected individuals) Candidate genes Genome scan Association analysis (case-control data, case-parents trios, etc.) <10 – 200 markers 300 – 6000 300K – 500K SNPs polymorphic markers (1000K SNP chips coming soon) Genome-wide Association Studies 30 Genome-Wide Association Studies Technology makes it feasible. – Affymetrix 500K chip costs ~$400/subject; 1M chip arrives in early 2007 and costs ~$700/subject. – Illumina 300K chip costs ~$700/subject, 550K chip costs ~$1000/subject. (http://www.cidr.jhmi.edu/pricing.html, 05/26/2006) Simple requirements on data makes it favorable. – Case-control data, case-parents trio data are enough. Power advantage over the approach of linkage scan followed by fine mapping? Genome-wide Association Studies 31 Association Analysis in Case-Control Studies Rationale: Cases are more likely to carry disease-predisposing variants than controls. In other words, there is association between disease status and the fraction of disease variants. In order to detect disease-marker association (in a homogeneous population), the marker must either contain a variant or be associated (a.k.a. in linkage disequilibrium, LD) with a variant. Association between variant and marker (i.e. LD) Disease variant Underlying association Genetic marker Association due to both underlying association and LD Disease status The level of LD between disease variant and a marker determines how much disease association is left to be seen with the marker. Genome-wide Association Studies 32 Measure of LD: r2 r2 = (PAB − pA×pB)2 / pA×pa×pB×pb Alleles of marker 2 (freq.) B (pB) b (pb) AB (PAB) Ab (PAb) a (pa) Alleles of marker 1 (freq.) • 0 ≤ r2 ≤ 1. • Suppose N cases and N controls are needed so that the power to detect disease-variant association is β. To have the same power to detect disease-marker association, we need to have N/r2 cases and N/r2 controls. • r2 = χ2/K, where K is the number of chromosomes. • r2 is the square of correlation coefficient when alleles are coded as 0 and 1. Genome-wide Association Studies aB (PaB) ab (Pab) 33 Data Quality and Quality Checking Sensitivity of genotype-calling algorithms. Family data: Mendelian inconsistencies 11 12 inconsistency Statistical checking: 22 – Hardy-Weinberg equilibrium (HWE) – Relationship checking Genome-wide Association Studies 34 HWE Checking in Shanghai Breast Cancer Study (SBCS) Potentially bad markers Courtesy of Dr. Wei Zheng Genome-wide Association Studies 35 Association Analysis of Bi-allelic Markers in Case-Control Studies Commonly used tests: Genotype-based: Pearson’s χ2 test on 2×3 table. Allele-based: Pearson’s χ2 test on 2×2 table (additive model). AA Aa aa Case 40 45 15 Control 36 44 20 A a Case 125 75 Control 116 84 Other genotype-based tests : Trend test (additive model). Dominant model: Collapse AA/Aa and test on resulting 2×2 table. Recessive model: Collapse Aa/aa and test on resulting 2×2 table. Genome-wide Association Studies 36 Simulation of Genome Data The simulations are based on HapMap Phase II phased CEU data. a. b. c. Designate disease variant and disease model at the variant. For each person, simulate genotype at disease variant locus 0. For each allele at locus 0, grow the whole chromosome: 1. 2. Generate a five-marker haplotype at [-2, 2] given the allele at 0. Grow upward: (“4 + 1”, like a 4th-order Markov chain) 1) 2) 3) 3. Generate an allele at locus 3 given the haplotype at [-1, 2]; Generate an allele at locus 4 given the haplotype at [0, 3]; … Grow downward (“4 + 1” again). T A C C A G C C A G T C T A -6 -5 -4 -3 -2 -1 4 5 6 0 1 2 3 Algorithm described in Durrant et al. 2004 Am. J. Hum. Genet. Genome-wide Association Studies 37 LDU Comparison This algorithm retains local LD very well, but tends to break up long-range LD. Genome-wide Association Studies 38 Disease Loci in Simulations Genome-wide Association Studies 39 Power: One-Stage, Bonferroni (α = .05) Genome-wide Association Studies 40 Power: One-Stage, FDR (q = .05) Genome-wide Association Studies 41 Variation in LD Estimation Genome-wide Association Studies 42 Power Drop Due to LD Over-estimation 1000 cases, 1000 controls, 300K SNPs, λ=1.05 Genome-wide Association Studies 43 Prioritized Subset Analysis (PSA) Rationale: Often a list of candidate genes or candidate regions (e.g. determined through linkage studies) exists. It may be more efficient to use such information to prioritize the genome in data analysis. Traditional approaches to GWA such as Bonferroni and FDR ignore such information, inherently treating all markers equally. Prioritized subset analysis (PSA): – Markers are partitioned and prioritized into subsets based on supplemental data. – FDR is then applied to each subset. Genome-wide Association Studies 44 Power of PSA In the previous simulation setup, we define priority subsets to consist of various numbers of chromosomal regions, each of 10Mb long, with various fractions of disease loci in the subset. 500 cases and 500 controls, 100K SNPs. loc1 loc2 loc3 loc4 loc5 loc6 FDR 14.2 4.8 0.7 99.8 68.9 34.6 0.063 2 9.6 4.5 2.4 99.8 66.9 54.3 0.074 4 13.8 26.8 6.7 99.5 91.9 74.7 0.057 6 53.4 33.5 14.1 100.0 97.4 82.2 0.051 6 43.3 23.9 8.6 100.0 95.4 70.3 0.060 Overall PSA # regions # disease loci 6 10 Genome-wide Association Studies 45 Advantage and Caveat of PSA If a disease gene is not included in the priority subset, power decrease is very small. – This is an advantage of FDR over Bonferroni correction. The overall FDR can inflate as the number of subsets increases. – If F1/R1 ≤ q and F2/R2 ≤ q, then (F1 + F2)/(R1 + R2) ≤ q. – But, the FDR procedure only guarantees E[F1/R1] ≤ q and E[F2/R2] ≤ q, which don’t lead to E[(F1 + F2)/(R1 + R2)] ≤ q. – When the genome is partitioned into only a few subsets (≤5), the amount of inflation is ignorable and the overall FDR is practically under control. Genome-wide Association Studies 46 SBCS Results (100 Cases, 100 Controls) Among the 354,905 SNPs that were analyzed, 18,021 SNPs have p-value ≤ .05. – Compared to 17,745 expected under the assumption of uniform distribution. – This over-representation of p-values is statistically significant (p = .017). Issue: The smaller the MAF or the sample size, the shorter tail the test statistic. – We carried out simulations to take into account the distributions of MAF and sample size in our data. All data (354,905 SNPs) Candidate genes (27,224 SNPs) Observed (expected) Ratio Observed (expected) Ratio P ≤ .05 18,021 (16,236) 1.11 1,420 (1,189) 1.19 P ≤ .01 3,347 (2,806) 1.19 262 (203) 1.29 P ≤ .001 292 (204) 1.43 28 (11) 2.55 P ≤ .0001 27 (15) 1.80 10 (1) 10.00 Genome-wide Association Studies 47 Two-Stage Approach Goal: Save money and sacrifice little in power. Traditional, replication-based analysis: 1. A subset of subjects are typed for many markers, which will be screened for promising markers. The tests are liberal, focusing on maximizing power. 2. The remaining subjects are typed for promising markers, which will be tested for replication. The tests are serious, focusing on controlling type I error. Joint analysis is more powerful. – In the second stage, analyze all subjects for the promising markers and correct for the number of tests in first stage (Satagopan and Elston 2003 Genet. Epidemiol.; Skol et al. 2006 Nat. Genet.). Genome-wide Association Studies 48 Population Stratification A population under study may have sub-populations, which may lead to – Spurious association. – Loss of power to detect real association. EIGENSTRAT (Price et al. 2006 Nat. Genet.) uses principal components to extract information on stratification and adjust for the stratification in association analysis. Mixed Population = Sub-population 1 + Sub-population 2 A a A a A a Case 70 80 10 40 60 40 Control 50 100 20 80 30 20 = Genome-wide Association Studies + 49 Traditional Issues Persist Allelic heterogeneity – When multiple disease variants exist at the same gene, a single marker may not capture them well enough. – Haplotype-based association analysis is good theoretically, but it hasn’t shown its advantage in practice. Locus heterogeneity – Multiple genes may influence the disease risk independently. As a result, for any single gene, a fraction of the cases may be no different from the controls. Effect modification (a.k.a. interaction) between two genes may exist with weak/no marginal effects. – It is unknown how often this happens in reality. But when this happens, analyses that only look at marginal effects won’t be useful. – It often requires larger sample size to have reasonable power to detect interaction effects than the sample size needed to detect marginal effects. Multiple Comparisons – Need smarter ways of analyzing data. Genome-wide Association Studies 50 Need for Smarter Approaches • Multi-marker haplotype analysis – Small improvement in power (Pe’er et al. 2006 Nat. Genet.). • Prioritized subset analysis • Analyses treating each gene as a unit – Correcting for effective number of tests. – Principal components as a tool to summarize markers at each gene. Genome-wide Association Studies 51 Need for Better Coverage Many polymorphisms in the genome are not well captured by the current commercial products. If a disease variant is one of them, the power diminishes quickly. MAF ≥ 0.05 /550 /87 /83 /50 Table from Barrett and Cardon 2006 Nat. Genet. Genome-wide Association Studies 52 Moving Beyond Genome Systems Biology Transcriptome: All messenger RNA molecules (‘transcripts’) Proteome: All proteins in cell or organism Metabolome: all metabolites in a biological organism (end products of its gene expression).