Download P AB

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Genome-Wide Association
Studies: Issues and Approaches
Association Studies
Hirschhorn & Daly, Nat Rev Genet 2005
Candidate Gene or GWAS
Genome-wide Association Studies
Affymetrix Array
Altshuler & Clark, Science 2005
Genome-wide Assocation Studies (GWAS)
One- and Two-Stage GWA Designs
One-Stage Design
Two-Stage Design
SNPs
1,2,3,……………………………,
M
Stage 1
SNPs
samples
Stage 2
Samples
1,2,3,………………………,N
Samples
1,2,3,………………………,N
1,2,3,……………………………,
M
markers
One-Stage Design
Samples
SNPs
Two-Stage Design
Replication-based analysis
Joint analysis
SNPs
SNPs
Stage 2
Samples
Stage 1
Stage 2
Samples
Stage 1
Multistage Designs
• Joint analysis has more power than replication
• p-value in Stage 1 must be liberal
• Lower cost—do not gain power
• http://www.sph.umich.edu/csg/abecasis/CaTS/index.html
QC Steps
• Filter SNPs and Individuals
– MAF, Low call rates
• Test for HWE among controls & within ethnic
groups. Use conservative alpha-level
• Check for relatedness. Identity-by-state
calculations.
Analysis of GWAS
• Most common approach: look at each SNP one-at-a-time.
• Possibly add in multi-marker information.
• Further investigate / report top SNPs only.
• Or backwards replication…
P-values
GWAS Analysis
• Most commonly trend test.
• Log additive model, logistic regression.
• Adjust for potential population stratification.
Example: GWAS of Prostate Cancer
chromosome
Region
2
30
Region 3
http://cgems.cancer.gov
Region 1
rs1447295
Gudmundsson et al.
Haiman et al.
Yeager et al.
Combined (adjusted)
25
Multiple prostate cancer
loci on 8q24
20
-log(p-value)
rs16901979
15
rs6983267
10
5
0
128.10
128.20
128.30
128.40
128.50
128.60
128.70
Position on 8q24 (Mb)
Witte, Nat Genet 2007
Prostate Cancer Replications
Locus
Chr Reg
A Freq
SNP
Association
Cntrl
Case
OR
p value
Nearby Genes / Fcn
2p15
rs721048
G/A
0.19
0.21
1.15
7.7x10-9
EHBP1: endocytic trafficking
3p12
rs2660753
C/T
0.10
0.12
1.30
2.7x10-8
Intergenic
6q25
rs9364554
C/T
0.29
0.33
1.21
5.5x10-10
SLC22A3: drugs and toxins.
7q21
rs6465657
T/C
0.46
0.50
1.19
1.1x10-9
LMTK2: endosomal trafficking
8q24 (2)
rs16901979
C/A
0.04
0.06
1.52
1.1x10-12
Intergenic
8q24 (3)
rs6983267
T/G
0.50
0.56
1.25
9.4x10-13
Intergenic
8q24 (1)
rs1447295
C/A
0.10
0.14
1.42
6.4x10-18
Intergenic
10q11
rs10993994
C/T
0.38
0.46
1.38
8.7x10-29
MSMB: suppressor prop.
10q26
rs4962416
T/C
0.27
0.32
1.18
2.7x10-8
CTBP2: antiapoptotic activity
11q13
rs7931342
T/G
0.51
0.56
1.21
1.7x10-12
Intergenic
17q12
rs4430796
G/A
0.49
0.55
1.22
1.4x10-11
HNF1B: suppressor properties
17q24
rs1859962
T/G
0.46
0.51
1.20
2.5x10-10
Intergenic
19q13
rs2735839
A/G
0.83
0.87
1.37
1.5x10-18
KLK2/KLK3: PSA
Xp11
rs5945619
T/C
0.36
0.41
1.29
1.5x10-9
NUDT10, NUDT11: apoptosis
Modest ORs
Witte, Nat Rev Genet 2009
Prostate Cancer Replications
Locus
Chr Reg
A Freq
SNP
Association
Cntrl
Case
OR
p value
Nearby Genes / Fcn
2p15
rs721048
G/A
0.19
0.21
1.15
7.7x10-9
EHBP1: endocytic trafficking
3p12
rs2660753
C/T
0.10
0.12
1.30
2.7x10-8
Intergenic
6q25
rs9364554
C/T
0.29
0.33
1.21
5.5x10-10
SLC22A3: drugs and toxins.
7q21
rs6465657
T/C
0.46
0.50
1.19
1.1x10-9
LMTK2: endosomal trafficking
8q24 (2)
rs16901979
C/A
0.04
0.06
1.52
1.1x10-12
Intergenic
8q24 (3)
rs6983267
T/G
0.50
0.56
1.25
9.4x10-13
Intergenic
8q24 (1)
rs1447295
C/A
0.10
0.14
1.42
6.4x10-18
Intergenic
10q11
rs10993994
C/T
0.38
0.46
1.38
8.7x10-29
MSMB: suppressor prop.
10q26
rs4962416
T/C
0.27
0.32
1.18
2.7x10-8
CTBP2: antiapoptotic activity
11q13
rs7931342
T/G
0.51
0.56
1.21
1.7x10-12
Intergenic
17q12
rs4430796
G/A
0.49
0.55
1.22
1.4x10-11
HNF1B: suppressor properties
17q24
rs1859962
T/G
0.46
0.51
1.20
2.5x10-10
Intergenic
19q13
rs2735839
A/G
0.83
0.87
1.37
1.5x10-18
KLK2/KLK3: PSA
Xp11
rs5945619
T/C
0.36
0.41
1.29
1.5x10-9
NUDT10, NUDT11: apoptosis
Modest ORs
Witte, Nat Rev Genet 2009
SNPs Missed in Replication?
Locus
Chr Reg
A Freq
SNP
Association
Cntrl
Case
OR
p value
Nearby Genes / Fcn
2p15
rs721048
G/A
0.19
0.21
1.15
7.7x10-9
EHBP1: endocytic trafficking
3p12
rs2660753
C/T
0.10
0.12
1.30
2.7x10-8
Intergenic
6q25
rs9364554
C/T
0.29
0.33
1.21
5.5x10-10
SLC22A3: drugs and toxins.
7q21
rs6465657
T/C
0.46
0.50
1.19
1.1x10-9
LMTK2: endosomal trafficking
8q24 (2)
rs16901979
C/A
0.04
0.06
1.52
1.1x10-12
Intergenic
8q24 (3)
rs6983267
T/G
0.50
0.56
1.25
9.4x10-13
Intergenic
8q24 (1)
rs1447295
C/A
0.10
0.14
1.42
6.4x10-18
Intergenic
rs10993994
C/T
0.38
0.46
1.38
8.7x10-
10q11
24,223 smallest
P-value!
MSMB: suppressor prop.
29
10q26
rs4962416
T/C
0.27
0.32
1.18
2.7x10-8
CTBP2: antiapoptotic activity
11q13
rs7931342
T/G
0.51
0.56
1.21
1.7x10-12
Intergenic
17q12
rs4430796
G/A
0.49
0.55
1.22
1.4x10-11
HNF1B: suppressor properties
17q24
rs1859962
T/G
0.46
0.51
1.20
2.5x10-10
Intergenic
19q13
rs2735839
A/G
0.83
0.87
1.37
1.5x10-18
KLK2/KLK3: PSA
Xp11
rs5945619
T/C
0.36
0.41
1.29
1.5x10-9
NUDT10, NUDT11: apoptosis
Witte, Nat Rev Genet, 2009
Prostate
Cancer
www.genome.gov/gwastudies
Manolio et al. Clin Invest 2008
Limitations of GWAS
• Not very predictive
Example:
AUC for Br Cancer Risk
Gail = 58%
SNPs = 58.9%
G + S = 61.8%
Wacholder et al. NEJM 2010
Witte, Nat Rev Genet 2009
Limitations of GWAS
•
•
•
•
Not very predictive
Explain little heritability
Focus on common variation
Many associated variants are not causal
Where’s the Heritability?
Common disease rare variant (CDRV) hypothesis: diseases due to
multiple rare variants with intermediate penetrances (allelic heterogeneity)
Many more
of these?
See: NEJM, April 30, 2009
McCarthy et al., 2008
Will GWAS results explain
more heritability?
• Possibly, if…
1. Causal SNPs not yet detected due to power /
practical issues (e.g., not yet included in
replication studies).
2. Stronger effects for causal SNPs:
Associated SNP may only serve as a
marker for multiple different causal
SNPs.
Imputation of SNP Genotypes
• Estimate unmeasured or missing genotypes.
• Based on measured SNPs and external info (e.g.,
haplotype structure of HapMap).
• Increase GWAS power.
• Allow for combining data across different platforms
(e.g., Affy & Illumina) (for replication / metaanalysis).
Imputation Example
Observed Genotypes
.
.
.
.
.
.
. A . . . . . .
. G . . . . . .
. A . . . . A . .
. C . . . . A . .
.
.
T
G
T
T
G
G
G
T
G
T
C
G
C
C
C
G
C
C
C
C
Study
Sample
Reference Haplotypes
C
C
C
C
C
T
C
C
C
C
G
G
C
G
G
G
G
G
G
G
A
A
A
A
A
G
A
A
A
A
G
G
A
A
G
G
G
G
G
A
A
A
G
G
A
A
A
A
A
G
T
T
C
C
C
T
T
C
C
C
C
C
T
T
T
C
C
T
T
T
T
T
C
C
C
T
T
C
C
C
C
C
T
T
T
C
C
T
T
T
C
C
T
T
C
C
C
T
C
T
T
C
T
T
C
C
C
T
C
T
C
A
C
C
A
A
A
C
A
C
T
C
T
T
C
C
C
T
C
T
T
C
T
T
C
C
C
T
C
T
C
T
C
C
T
T
T
T
T
C
T
C
T
T
T
C
T
T
C
T
G
A
G
G
A
A
G
G
G
G
T
T
T
T
T
T
T
T
T
T
G
G
G
G
G
G
G
A
G
G
HapMap/
1K genomes
Gonçalo Abecasis
Identify Match with Reference
Observed Genotypes
.
.
.
.
.
.
. A . . . . . .
. G . . . . . .
. A . . . . A . .
. C . . . . A . .
.
.
T
G
T
T
G
G
G
T
G
T
C
G
C
C
C
G
C
C
C
C
Reference Haplotypes
C
C
C
C
C
T
C
C
C
C
G
G
C
G
G
G
G
G
G
G
A
A
A
A
A
G
A
A
A
A
G
G
A
A
G
G
G
G
G
A
A
A
G
G
A
A
A
A
A
G
T
T
C
C
C
T
T
C
C
C
C
C
T
T
T
C
C
T
T
T
T
T
C
C
C
T
T
C
C
C
C
C
T
T
T
C
C
T
T
T
C
C
T
T
C
C
C
T
C
T
T
C
T
T
C
C
C
T
C
T
C
A
C
C
A
A
A
C
A
C
T
C
T
T
C
C
C
T
C
T
T
C
T
T
C
C
C
T
C
T
C
T
C
C
T
T
T
T
T
C
T
C
T
T
T
C
T
T
C
T
G
A
G
G
A
A
G
G
G
G
T
T
T
T
T
T
T
T
T
T
G
G
G
G
G
G
G
A
G
G
Gonçalo Abecasis
Phase chromosomes,
impute missing genotypes
Observed Genotypes
c g a g A t c t c c c g A c c t c A t g g
c g a a G c t c t t t t C t t t c A t g g
Reference Haplotypes
C
C
C
C
C
T
C
C
C
C
G
G
C
G
G
G
G
G
G
G
A
A
A
A
A
G
A
A
A
A
G
G
A
A
G
G
G
G
G
A
A
A
G
G
A
A
A
A
A
G
T
T
C
C
C
T
T
C
C
C
C
C
T
T
T
C
C
T
T
T
T
T
C
C
C
T
T
C
C
C
C
C
T
T
T
C
C
T
T
T
C
C
T
T
C
C
C
T
C
T
T
C
T
T
C
C
C
T
C
T
T
G
T
T
G
G
G
T
G
T
C
A
C
C
A
A
A
C
A
C
T
C
T
T
C
C
C
T
C
T
T
C
T
T
C
C
C
T
C
T
C
T
C
C
T
T
T
T
T
C
T
C
T
T
T
C
T
T
C
T
http://www.sph.umich.edu/csg/abecasis/MACH
G
A
G
G
A
A
G
G
G
G
T
T
T
T
T
T
T
T
T
T
G
G
G
G
G
G
G
A
G
G
C
G
C
C
C
G
C
C
C
C
Gonçalo Abecasis
Imputation Application
TCF7L2 gene region & T2D from the WTCCC data
Observed genotypes black
Imputed genotypes red.
Chromosomal Position
Marchini Nature Genetics2007
http://www.stats.ox.ac.uk/~marchini/#software
Genome-wide Sequence Studies
• Trade off between number of samples, depth, and
genomic coverage.
MAF
Sample Size
Depth
0.5-1%
2-5%
1,000
20x
perfect
perfect
2,000
10x
r2=0.98
r2=0.995
4,000
5x
r2=0.90
r2=0.98
BUT: Interaction needs to be accounted for >>>> required
sample
size
Goncalo
Abecasis
Near-term Design Choices
• For example, between:
1. Sequencing few subjects with extreme
phenotypes:
• e.g., 200 cases, 200 controls, 4x coverage. Then followup in larger population.
2. 10M SNP chip based on 1,000 genomes.
•
5K cases, 5K controls.
• Which design will work best…?
Polygenic Models
• Many weak associations combine to risk?
• Score model:
m
where
xj 
 ln( OR )  SNP
i 1
i
ij
m ‘discovery’ sample
– ln(ORi ) = ‘score’ for SNPi from
– SNPij = # of alleles (0,1,2) for SNPi, person j in ‘validation’
sample.
– Large number of SNPs (m)
• xj associated with disease?
ISC / Purcell et al. Nature 2009
Complex diseases
Physical activity
Genetic susceptibility
Obesity
Hyperlipidemia
Diet
Diabetes
Complex diseases: Many causes = many causal pathways!
Vulnerable plaques
Hypertension
Atherosclerosis
MI
Data Analysis Approaches in
Human Disease Gene Discovery
Linkage analysis
(families with ≥2
affected individuals)
Candidate
genes
Genome
scan
Association analysis
(case-control data,
case-parents trios,
etc.)
<10 – 200 markers
300 – 6000
300K – 500K SNPs
polymorphic markers (1000K SNP chips
coming soon)
Genome-wide Association Studies
30
Genome-Wide Association Studies
Technology makes it feasible.
– Affymetrix 500K chip costs ~$400/subject; 1M chip arrives
in early 2007 and costs ~$700/subject.
– Illumina 300K chip costs ~$700/subject, 550K chip costs
~$1000/subject. (http://www.cidr.jhmi.edu/pricing.html,
05/26/2006)
Simple requirements on data makes it favorable.
– Case-control data, case-parents trio data are enough.
Power advantage over the approach of linkage scan
followed by fine mapping?
Genome-wide Association Studies
31
Association Analysis
in Case-Control Studies
Rationale: Cases are more likely
to carry disease-predisposing
variants than controls. In other
words, there is association
between disease status and the
fraction of disease variants.
In order to detect disease-marker
association (in a homogeneous
population), the marker must
either contain a variant or be
associated (a.k.a. in linkage
disequilibrium, LD) with a
variant.
Association between
variant and marker
(i.e. LD)
Disease variant
Underlying
association
Genetic marker
Association due to
both underlying
association and LD
Disease status
The level of LD between disease
variant and a marker determines
how much disease association is
left to be seen with the marker.
Genome-wide Association Studies
32
Measure of LD: r2
r2 = (PAB − pA×pB)2 / pA×pa×pB×pb
Alleles of marker 2 (freq.)
B (pB)
b (pb)
AB (PAB) Ab (PAb)
a (pa)
Alleles of marker 1 (freq.)
• 0 ≤ r2 ≤ 1.
• Suppose N cases and N controls are
needed so that the power to detect
disease-variant association is β. To
have the same power to detect
disease-marker association, we need
to have N/r2 cases and N/r2 controls.
• r2 = χ2/K, where K is the number of
chromosomes.
• r2 is the square of correlation
coefficient when alleles are coded as
0 and 1.
Genome-wide Association Studies
aB (PaB)
ab (Pab)
33
Data Quality and Quality Checking
Sensitivity of genotype-calling
algorithms.
Family data: Mendelian
inconsistencies
11
12
inconsistency
Statistical checking:
22
– Hardy-Weinberg equilibrium (HWE)
– Relationship checking
Genome-wide Association Studies
34
HWE Checking in
Shanghai Breast Cancer Study (SBCS)
Potentially
bad markers
Courtesy of Dr. Wei Zheng
Genome-wide Association Studies
35
Association Analysis of Bi-allelic
Markers in Case-Control Studies
Commonly used tests:
Genotype-based: Pearson’s χ2 test
on 2×3 table.
Allele-based: Pearson’s χ2 test on
2×2 table (additive model).
AA Aa aa
Case
40 45 15
Control 36
44
20
A
a
Case
125
75
Control
116
84
Other genotype-based tests :
Trend test (additive model).
Dominant model: Collapse AA/Aa
and test on resulting 2×2 table.
Recessive model: Collapse Aa/aa
and test on resulting 2×2 table.
Genome-wide Association Studies
36
Simulation of Genome Data
The simulations are based on HapMap Phase II phased CEU data.
a.
b.
c.
Designate disease variant and disease model at the variant.
For each person, simulate genotype at disease variant locus 0.
For each allele at locus 0, grow the whole chromosome:
1.
2.
Generate a five-marker haplotype at [-2, 2] given the allele at 0.
Grow upward: (“4 + 1”, like a 4th-order Markov chain)
1)
2)
3)
3.
Generate an allele at locus 3 given the haplotype at [-1, 2];
Generate an allele at locus 4 given the haplotype at [0, 3];
…
Grow downward (“4 + 1” again).
T
A C C A G C C A G T
C
T
A
-6 -5 -4 -3 -2 -1
4
5
6
0
1
2
3
Algorithm described in Durrant
et al. 2004 Am. J. Hum. Genet.
Genome-wide Association Studies
37
LDU Comparison
This algorithm retains local LD very well, but tends to break
up long-range LD.
Genome-wide Association Studies
38
Disease Loci in Simulations
Genome-wide Association Studies
39
Power: One-Stage, Bonferroni (α = .05)
Genome-wide Association Studies
40
Power: One-Stage, FDR (q = .05)
Genome-wide Association Studies
41
Variation in LD Estimation
Genome-wide Association Studies
42
Power Drop Due to LD Over-estimation
1000 cases, 1000 controls, 300K SNPs, λ=1.05
Genome-wide Association Studies
43
Prioritized Subset Analysis (PSA)
Rationale: Often a list of candidate genes or candidate regions
(e.g. determined through linkage studies) exists. It may be
more efficient to use such information to prioritize the genome
in data analysis.
Traditional approaches to GWA such as Bonferroni and FDR
ignore such information, inherently treating all markers
equally.
Prioritized subset analysis (PSA):
– Markers are partitioned and prioritized into subsets based on
supplemental data.
– FDR is then applied to each subset.
Genome-wide Association Studies
44
Power of PSA
In the previous simulation setup, we define priority subsets to
consist of various numbers of chromosomal regions, each of
10Mb long, with various fractions of disease loci in the subset.
500 cases and 500 controls, 100K SNPs.
loc1
loc2
loc3
loc4
loc5
loc6
FDR
14.2
4.8
0.7
99.8
68.9
34.6
0.063
2
9.6
4.5
2.4
99.8
66.9
54.3
0.074
4
13.8
26.8
6.7
99.5
91.9
74.7
0.057
6
53.4
33.5
14.1
100.0
97.4
82.2
0.051
6
43.3
23.9
8.6
100.0
95.4
70.3
0.060
Overall
PSA
# regions # disease loci
6
10
Genome-wide Association Studies
45
Advantage and Caveat of PSA
If a disease gene is not included in the priority subset,
power decrease is very small.
– This is an advantage of FDR over Bonferroni correction.
The overall FDR can inflate as the number of subsets
increases.
– If F1/R1 ≤ q and F2/R2 ≤ q, then (F1 + F2)/(R1 + R2) ≤ q.
– But, the FDR procedure only guarantees E[F1/R1] ≤ q and
E[F2/R2] ≤ q, which don’t lead to E[(F1 + F2)/(R1 + R2)] ≤ q.
– When the genome is partitioned into only a few subsets (≤5),
the amount of inflation is ignorable and the overall FDR is
practically under control.
Genome-wide Association Studies
46
SBCS Results (100 Cases, 100 Controls)
Among the 354,905 SNPs that were analyzed, 18,021 SNPs have p-value ≤ .05.
– Compared to 17,745 expected under the assumption of uniform distribution.
– This over-representation of p-values is statistically significant (p = .017).
Issue: The smaller the MAF or the sample size, the shorter tail the test statistic.
– We carried out simulations to take into account the distributions of MAF and
sample size in our data.
All data (354,905 SNPs)
Candidate genes (27,224 SNPs)
Observed (expected)
Ratio
Observed (expected)
Ratio
P ≤ .05
18,021 (16,236)
1.11
1,420 (1,189)
1.19
P ≤ .01
3,347 (2,806)
1.19
262 (203)
1.29
P ≤ .001
292 (204)
1.43
28 (11)
2.55
P ≤ .0001
27 (15)
1.80
10 (1)
10.00
Genome-wide Association Studies
47
Two-Stage Approach
Goal: Save money and sacrifice little in power.
Traditional, replication-based analysis:
1. A subset of subjects are typed for many markers, which
will be screened for promising markers. The tests are
liberal, focusing on maximizing power.
2. The remaining subjects are typed for promising markers,
which will be tested for replication. The tests are serious,
focusing on controlling type I error.
Joint analysis is more powerful.
– In the second stage, analyze all subjects for the promising
markers and correct for the number of tests in first stage
(Satagopan and Elston 2003 Genet. Epidemiol.; Skol et al.
2006 Nat. Genet.).
Genome-wide Association Studies
48
Population Stratification
A population under study may have sub-populations, which may
lead to
– Spurious association.
– Loss of power to detect real association.
EIGENSTRAT (Price et al. 2006 Nat. Genet.) uses principal
components to extract information on stratification and adjust
for the stratification in association analysis.
Mixed Population = Sub-population 1 + Sub-population 2
A
a
A
a
A
a
Case
70
80
10
40
60
40
Control
50
100
20
80
30
20
=
Genome-wide Association Studies
+
49
Traditional Issues Persist
Allelic heterogeneity
– When multiple disease variants exist at the same gene, a single marker may not
capture them well enough.
– Haplotype-based association analysis is good theoretically, but it hasn’t shown
its advantage in practice.
Locus heterogeneity
– Multiple genes may influence the disease risk independently. As a result, for
any single gene, a fraction of the cases may be no different from the controls.
Effect modification (a.k.a. interaction) between two genes may exist with
weak/no marginal effects.
– It is unknown how often this happens in reality. But when this happens,
analyses that only look at marginal effects won’t be useful.
– It often requires larger sample size to have reasonable power to detect
interaction effects than the sample size needed to detect marginal effects.
Multiple Comparisons
– Need smarter ways of analyzing data.
Genome-wide Association Studies
50
Need for Smarter Approaches
• Multi-marker haplotype analysis
– Small improvement in power (Pe’er et al. 2006
Nat. Genet.).
• Prioritized subset analysis
• Analyses treating each gene as a unit
– Correcting for effective number of tests.
– Principal components as a tool to summarize
markers at each gene.
Genome-wide Association Studies
51
Need for Better Coverage
Many polymorphisms in the genome are not well captured by the
current commercial products.
If a disease variant is one of them, the power diminishes quickly.
MAF ≥ 0.05
/550
/87
/83
/50
Table from Barrett and Cardon 2006 Nat. Genet.
Genome-wide Association Studies
52
Moving Beyond Genome
Systems Biology
Transcriptome:
All messenger RNA molecules (‘transcripts’)
Proteome:
All proteins in cell or organism
Metabolome:
all metabolites in a biological organism
(end products of its gene expression).
Related documents