Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Genome-wide association studies
(GWAS)
Thomas Hoffmann
Outline
GWAS Overview
Design
Microarray
Sequencing
Which to use and other censiderations
QC
Analysis
Population stratification adjustment
Imputation
Replication & Meta-analysis
Manolio et al., Clin Invest 2008
Genetic association studies
(guilt by association)
Candidate Gene or GWAS
Hirschhorn & Daly, Nat Rev Genet 2005
GWAS Microarray
Assay ~ 0.7 - 5M SNPs (keeps increasing)
Affymetrix, http://www.affymetrix.com
Genotype calls
Good calls!
Bad calls!
Outline
GWAS Overview
Design
Microarray
Sequencing
Which to use and other considerations
QC
Analysis
Population stratification adjustment
Imputation
Replication & Meta-analysis
Genome-wide assocation studies (GWAS)
One- and two-stage GWA designs
Two-Stage Design
One-Stage Design
SNPs
SNPs
nsamples
Stage 1
Stage 2
Samples
Samples
nmarkers
One-Stage Design
SNPs
Samples
Two-Stage Design
Replication-based analysis
Joint analysis
SNPs
SNPs
1
Stage 1
Stage 2
Samples
Stage 2
Samples
2
1
Stage 1
2
Multistage Designs
•
•
•
•
Joint analysis has more power than replication
p-value in Stage 1 must be liberal
Lower cost—do not gain power
CaTs power calculator:
http://www.sph.umich.edu/csg/abecasis/CaTS/index.html
Outline
GWAS Overview
Design
Microarray
Sequencing
Which to use and other considerations
QC
Analysis
Population stratification adjustment
Imputation
Replication & Meta-analysis
Genome-wide Sequence Studies
• Trade off between number of samples, depth, and
genomic coverage.
MAF
Sample Size
Depth
0.5-1%
2-5%
1,000
20x
“perfect”
“perfect”
2,000
10x
r2=0.98
r2=0.995
4,000
5x
r2=0.90
r2=0.98
Goncalo Abecasis
Near-term sequencing design choices
• For example, between:
1. Sequencing few subjects with extreme phenotypes:
• e.g., 200 cases, 200 controls, 4x coverage. Then follow-up in
larger population.
2. 10M SNP chip based on 1,000 genomes.
• 5K cases, 5K controls.
• Which design will work best…?
Outline
GWAS Overview
Design
Microarray
Sequencing
Which to use and other considerations
QC
Analysis
Population stratification adjustment
Imputation
Replication & Meta-analysis
Design choices
• GWAS Microarray
– Only assay SNPs designed into
array (0.7-5 million)
– Much cheaper (so many more
subjects)
GWAS Sequencing
“De novo” discovery (particularly
good for rare variants)
More expensive (but costs are
falling) (many less subjects)
Need much more expansive IT
support
Lots of interesting interpretation
problems (field rapidly
evolving)
Design choices
• Exome Microarray
– Only assay SNPs designed into
array (~300K+custom); in exons
only and that could affect protein
coding function
– Cheapest (so many more subjects)
Exome Sequencing
“De novo” discovery (particularly
good for rare variants); %age
of exons only
More expensive than
microarrays, less expensive
than gwas sequencing
Need more expansive IT support
Lots of interesting interpretation
problems
Size of study
Visscher, AJHG
Size of study
Visscher, AJHG
Outline
GWAS Overview
Design
Microarray
Sequencing
Which to use and other considerations
QC
Analysis
Population stratification adjustment
Imputation
Replication & Meta-analysis
QC Steps
Remove SNPs with low call rate (e.g., <97%)
Proportion of SNPs actually called by software
If it's low, the clusters aren't well defined, artifacts
Remove those with low minor allele frequency?
Rarer variants more likely artifacts / underpowered
Exome arrays – rare variants are the whole point!
Remove SNPs / Individuals who have too much
missing data
QC Steps (2)
SNPs that fail Hardy-Weinberg
Suppose a SNP with alleles A and B has allele
frequency of p. If random matting, then
AA has frequency p*p
AB has frequency 2*p*(1-p)
BB has frequency (1-p)*(1-p)
Test for this (e.g., chi-squared test)
In practice do for homogeneous populations (more
later)
QC Steps
• Check genotype gender
• Filter Mendelian inhertance (family-based, or
potentially cryptics, if large enough sample)
• Check for relatedness...
Check for relatedness, e.g., HapMap
Pemberton et al., AJHG 2010
Outline
GWAS Overview
Design
Microarray
Sequencing
Which to use and other considerations
QC
Analysis
Population stratification adjustment
Imputation
Replication & Meta-analysis
GWAS analysis
• Most common approach: look at each SNP one-at-a-time
– Additive coding of SNP most common, e.g., # of A alleles
– Just a covariate in a regression framework
• Dichotomous phenotype: logistic regression
• Continuous phenotype: linear regression
– {BMI}=B1{SNP}+ B2{Age}+...
• Further investigate / report top SNPs only
• Adjust for population stratification...
P-values
What is population stratification?
Balding, Nature Reviews Genetics 2010
Adjusting for PC's
• Li et al., Science 2008
Adjusting for PC's
• Razib, Current Biology 2008
Adjusting for PC's
• Wang, BMC Proc 2009
Aside: “random” mating?
Sebro, Gen Epi, 2010
Multiple comparison correction
• If you conduct 20 tests at =0.05, one true by chance
http://xkcd.com/882/. If you conduct 1 million tests...
• Correct for multiple comparisons
– e.g., Bonferroni, 1 million gives =5x10-8
QQ-plots and PC adjustment
• Wang, BMC Proc 2009
Example: GWAS of Prostate Cancer
chromosome
Region
2
30
Region
3
Region
1
rs144729
5
Column U
Column H
Column AA
Column AQ
25
Multiple prostate cancer
loci on 8q24
20
rs169019
79
-log(p-value)
http://cgems.cancer.gov
15
rs698326
7
10
5
0
128.10
128.20
128.30
128.40
128.50
128.60
128.70
Position on 8q24 (Mb)
Witte, Nat Genet 2007
Outline
GWAS Overview
Design
Microarray
Sequencing
Which to use and other considerations
QC
Analysis
Population stratification adjustment
Imputation
Replication & Meta-analysis
Imputation of SNP Genotypes
Combine data from different platforms (e.g., Affy &
Illumina) (for replication / meta-analysis).
Estimate unmeasured or missing genotypes.
Based on measured SNPs and external info (e.g.,
haplotype structure of HapMap).
Increase GWAS power (impute and analyze all), e.g.
Sick sinus syndrome, most significant was 1000
Genomes imputed SNP (Holm et al., Nature Genetics,
2011)
HapMap as reference, now 1000 Genomes Project?
Imputation Example
Li et al., Ann Rev Genom Human Genet, 2009
Imputation Example
Li et al., Ann Rev Genom Human Genet, 2009
Imputation Application
TCF7L2 gene region & T2D from the WTCCC data
Observed genotypes black
Imputed genotypes red.
Chromosomal Position
Marchini Nature Genetics2007
http://www.stats.ox.ac.uk/~marchini/#software
Outline
GWAS Overview
Design
Microarray
Sequencing
Which to use and other considerations
QC
Analysis
Population stratification adjustment
Imputation
Replication & Meta-analysis
Replication
• To replicate:
– Association test for replication sample significant at
0.05 alpha level
– Same mode of inheritance
– Same direction
– Sufficient sample size for replication
• Non-replications not necessarily a false
positive
– LD structures, different populations (e.g., flip-flop)
– covariates, phenotype definition, underpowered
Prostate Cancer Replications
Locus
Chr Reg
A Freq
SNP
Association
Cntrl
Case
OR
p value
Nearby Genes / Fcn
2p15
rs721048
G/A
0.19
0.21
1.15
7.7x10-9
EHBP1: endocytic trafficking
3p12
rs2660753
C/T
0.10
0.12
1.30
2.7x10-8
Intergenic
6q25
rs9364554
C/T
0.29
0.33
1.21
5.5x10-10
SLC22A3: drugs and toxins.
7q21
rs6465657
T/C
0.46
0.50
1.19
1.1x10-9
LMTK2: endosomal trafficking
8q24 (2)
rs16901979
C/A
0.04
0.06
1.52
1.1x10-12
Intergenic
8q24 (3)
rs6983267
T/G
0.50
0.56
1.25
9.4x10-13
Intergenic
8q24 (1)
rs1447295
C/A
0.10
0.14
1.42
6.4x10-18
Intergenic
10q11
rs10993994
C/T
0.38
0.46
1.38
8.7x10-29
MSMB: suppressor prop.
10q26
rs4962416
T/C
0.27
0.32
1.18
2.7x10-8
CTBP2: antiapoptotic activity
11q13
rs7931342
T/G
0.51
0.56
1.21
1.7x10-12
Intergenic
17q12
rs4430796
G/A
0.49
0.55
1.22
1.4x10-11
HNF1B: suppressor properties
17q24
rs1859962
T/G
0.46
0.51
1.20
2.5x10-10
Modest ORs
Intergenic
Witte, Nat Rev Genet 2009
Prostate Cancer Replications
Locus
Chr Reg
A Freq
SNP
Association
Cntrl
Case
OR
p value
Nearby Genes / Fcn
2p15
rs721048
G/A
0.19
0.21
1.15
7.7x10-9
EHBP1: endocytic trafficking
3p12
rs2660753
C/T
0.10
0.12
1.30
2.7x10-8
Intergenic
6q25
rs9364554
C/T
0.29
0.33
1.21
5.5x10-10
SLC22A3: drugs and toxins.
7q21
rs6465657
T/C
0.46
0.50
1.19
1.1x10-9
LMTK2: endosomal trafficking
8q24 (2)
rs16901979
C/A
0.04
0.06
1.52
1.1x10-12
Intergenic
8q24 (3)
rs6983267
T/G
0.50
0.56
1.25
9.4x10-13
Intergenic
8q24 (1)
rs1447295
C/A
0.10
0.14
1.42
6.4x10-18
Intergenic
10q11
rs10993994
C/T
0.38
0.46
1.38
8.7x10-29
MSMB: suppressor prop.
10q26
rs4962416
T/C
0.27
0.32
1.18
2.7x10-8
CTBP2: antiapoptotic activity
11q13
rs7931342
T/G
0.51
0.56
1.21
1.7x10-12
Intergenic
17q12
rs4430796
G/A
0.49
0.55
1.22
1.4x10-11
HNF1B: suppressor properties
17q24
rs1859962
T/G
0.46
0.51
1.20
2.5x10-10
Modest ORs
Intergenic
Witte, Nat Rev Genet 2009
SNPs Missed in Replication?
Locus
Chr Reg
A Freq
SNP
Association
Cntrl
Case
OR
p value
Nearby Genes / Fcn
2p15
rs721048
G/A
0.19
0.21
1.15
7.7x10-9
EHBP1: endocytic trafficking
3p12
rs2660753
C/T
0.10
0.12
1.30
2.7x10-8
Intergenic
6q25
rs9364554
C/T
0.29
0.33
1.21
5.5x10-10
SLC22A3: drugs and toxins.
7q21
rs6465657
T/C
0.46
0.50
1.19
1.1x10-9
LMTK2: endosomal trafficking
8q24 (2)
rs16901979
C/A
0.04
0.06
1.52
1.1x10-12
Intergenic
8q24 (3)
rs6983267
T/G
0.50
0.56
1.25
9.4x10-13
Intergenic
8q24 (1)
rs1447295
C/A
0.10
0.14
1.42
6.4x10-18
Intergenic
10q11
rs10993994
C/T
0.38
0.46
1.38
8.7x10-29
MSMB: suppressor prop.
10q26
rs4962416
T/C
0.27
0.32
1.18
2.7x10-8
CTBP2: antiapoptotic activity
11q13
rs7931342
T/G
0.51
0.56
1.21
1.7x10-12
Intergenic
17q12
rs4430796
G/A
0.49
0.55
1.22
1.4x10-11
HNF1B: suppressor properties
17q24
rs1859962
T/G
0.46
0.51
1.20
2.5x10-10
Intergenic
24,223 smallest
P-value!
Witte, Nat Rev Genet, 2009
Meta-analysis
• Combine multiple studies to increase power
• Either combine p-values (Fisher’s test),
• or z-scores (better)
(Meta-analysis)
Example: GWAS of Prostate Cancer
chromosome
Region
2
30
Region
3
Region
1
rs144729
5
Column U
Column H
Column AA
Column AQ
25
Multiple prostate cancer
loci on 8q24
20
rs169019
79
-log(p-value)
http://cgems.cancer.gov
15
rs698326
7
10
5
0
128.10
128.20
128.30
128.40
128.50
128.60
128.70
Position on 8q24 (Mb)
Witte, Nat Genet 2007
Replication & Meta-analysis
Meta-analysis
Outline
GWAS Overview
Design
Microarray
Sequencing
Which to use and other censiderations
QC
Analysis
Population stratification adjustment
Imputation
Replication & Meta-analysis
Limitations of GWAS
• Not very predictive
Witte, Nat Rev Genet 2009
Example:
AUC for Breast Cancer
Risk
58%: Gail model (# first
degree relatives w bc, age
menarche, age first live
birth, number of previous
biopsies) + age, study,
entry year
58.9%: SNPs
61.8%: Combined
Wacholder et al., NEJM 2010
Limitations of GWAS
•
•
•
•
Not very predictive
Explain little heritability
Focus on common variation
Many associated variants are not causal
Where's the heritability?
Visccher, AJHG 2011
Where’s the heritability?
Common disease rare variant (CDRV) hypothesis: diseases due to
multiple rare variants with intermediate penetrances (allelic heterogeneity)
Many more
of these?
See: NEJM, April 30, 2009
McCarthy et al., 2008
Where's the heritability?
Power & sample size issues?
Polygenic models?
Gene-gene interactions, gene-environment
interactions?
Rare variants?