Download PowerPoint

Document related concepts

Metagenomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Pharmacogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Genetic testing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Medical genetics wikipedia , lookup

Population genetics wikipedia , lookup

Genetic engineering wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Quantitative trait locus wikipedia , lookup

RNA-Seq wikipedia , lookup

Behavioural genetics wikipedia , lookup

Twin study wikipedia , lookup

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

Human genetic variation wikipedia , lookup

Microevolution wikipedia , lookup

Heritability of IQ wikipedia , lookup

Public health genomics wikipedia , lookup

Tag SNP wikipedia , lookup

Transcript
Association Analysis
University of Louisville
Center for Genetics and Molecular Medicine
January 11, 2008
Dana Crawford, PhD
Vanderbilt University
Center for Human Genetics Research
Association Analysis Outline
•
•
•
•
•
•
Study Design
SNPs versus Haplotypes
Analysis Methods
Candidate Gene
Whole Genome Analysis
Replication and Function
Study Design
Does your trait or phenotype have a genetic component?
• Segregation analysis
• Recurrence risks
• Heritability
• Other sources of evidence for a genetic
component
Classic Segregation Analysis
• Determines if a major gene is involved
• Compares data to Mendelian models, such as
Autosomal dominant
Autosomal recessive
X-linked
• Results can be used as parameters for
linkage analysis (e.g. parametric LOD)
• Subject to ascertainment bias
Note: More complex methods needed for complex traits
Recurrence Risks
The chance that a disease present in the
family will recur in that family
“Lightning striking twice”
If recurrence risk is greater in the family
compared with unrelated individuals,
the disease has a “genetic” component
Suggests familial aggregation
Recurrence Risks
Measured using the risk ratio (λ)
Sibling risk ratio = λs
λs = sibling recurrence risk
population prevalence
Cystic fibrosis λs = (0.25/0.0004) = 500
Huntington disease λs = (0.50/0.0001) = 5000
Recurrence Risks: Complex traits
λ here is for first degree relative
Merikangas and Risch (2003) Science 302:599-601.
Heritability
The proportion of phenotypic variation in a
population attributable to genetic variation
Heritability measured as h2
Think “twin studies”
(Can also be family studies)
Quantitative traits
Heritability and Quantitative Traits
Example: Height
Boys
Girls
Mexican
Americans
Mexican
Americans
Blacks
Blacks
Whites
Whites
NHANES 1971-1974 versus NHANES 1999-2002
Determined by genes and environment
Freedman et al (2006) Obesity 14:301-308
Heritability and Quantitative Traits
σT2 = σG2 + σE2
Trait variation = genetic + environment
σG2 = σa2 + σd2
Genetic variation = additive + dominant
σE2 = σf2 + σe2
Environmental variation =
familial/household + random/individual
hB2= σG2 / σT2
Broad Sense heritability
hN2= σa2 / σT2
Narrow Sense heritability
Heritability and Twins Studies
h2 = 2(rMZ – rDZ),
where r is the correlation coefficient
Monozygotic = same genetic material = r ~ 100%
Dizygotic = half genetic material = r ~ 50%
Heritability and Twins Studies
Trait
r(MZ)
r(DZ)
Reference
Cholesterol
0.76
0.39
Fenger et al
SBP
0.60
0.32
Evans et al
BMI
0.67
0.32
Schousboe et al
Perceived pitch 0.67
0.44
Drayna et al
Heritability: Is everything genetic?
Trait
r(MZ)
r(DZ)
Reference
Vote choice
0.81
0.69
Hatemi et al
Religiousness
0.62
0.42
Koenig et al
Other Evidence For A
Genetic Component
Monogenic disorders
Example:
Phenotype of interest is sensitivity to warfarin
dosing, but there are no heritability estimates
Solution:
Rare, familial disorder of warfarin resistance
Other Evidence For A
Genetic Component
Case Reports
Example:
Phenotype of interest is susceptibility to
Neisseria meningitidis (prevalence: 1/100,000)
Solution:
Case report of recurrent N. meningitidis in
patient
Other Evidence For A
Genetic Component
Other good arguments…
• Animal models
• Biochemistry or biological pathways
• Expression data
• Previous genetic association studies
Study Design
How well can you diagnose the disease or measure the trait?
• Narrow definitions better than all-inclusive definitions
There are many paths that lead to the same
phenotype
• Avoid misclassification and measurement error
Direct measurement versus recall/survey data
or indirect proxies
• Be aware of age of onset
Can your control become a case over time?
Arguably most important step in study design
Target Phenotypes
Disease or Quantitative trait?
LDLR
LDL-C
Diet
MI
IL6
Acute
Illness
CRP
Note: SNPs associated with quantitative traits
may not be associated with clinical endpoint
Carlson et al. (2004) Nature 429:446-452
Study Design
How many cases and controls will you need to detect
an association?
Statistical Power
• Null hypothesis: all alleles are equal risk
• Given that a risk allele exists, how likely is a study to reject
the null?
• Study sample size ideally determined before you begin to
recruit and genotype
Study Design
What are the thresholds/variables in a general power calculation?
• Statistical significance
– Significance = p(false positive)
– Traditional threshold 5%
• Statistical power
– Power = 1- p(false negative)
– Traditional threshold 80%
• Traditional thresholds balance confidence in results
against reasonable sample size
Note: Significance threshold for 1 SNP tested
Study Design
Power Calculation Resources
• Quanto (hydra.usc.edu/gxe/)
Supports quantitative, discrete traits (unrelated
and family based)
• Genetic Power Calculator
(pngu.mgh.arvard.edu/~purcell/gpc/)
Supports discrete traits, variance components,
quantitative traits for linkage and
association studies
(List of other software: linkage.rockefeller.edu/soft/)
Study Design
How can you maximize power for your study?
• Large sample size
Better estimate of variability or risk
Chance of misclassification / measurement error
• Large genetic effect size
SNP risk allele with large odds ratio or explains a lot of trait variance
This is unknown at beginning of study
• Risk SNP is common
This is unknown at beginning of study
Calculate power for a range of common MAFs (5-45%)
• Genotype the risk SNP directly
Risk SNP is unknown at beginning of study
Remember tagSNPs are imperfect proxies
Adjust sample size by 1/r2
Study Design
Power calculation example:
Cases: Adverse reaction (wheezing) to flu vaccination
Controls: Vaccinated children with no adverse reactions
160
140
Sample size (cases)
120
MAF
0.05
100
0.1
0.15
80
0.2
60
0.25
40
20
Genotype relative risk
(Additive model)
Calculated using Quanto 1.1.1
6
5.8
5.6
5.4
5.2
5
4.8
4.6
4.4
4.2
4
3.8
3.6
3.4
3.2
3
2.8
2.6
2.4
2.2
2
0
Study Design
Power calculation example:
Immunogenicity to influenza A (H5N1) vaccine
900
800
Sample size
700
600
500
400
300
200
100
R2
(Additive model)
Calculated using Quanto 1.1.1
0.
43
0.
46
0.
49
0.
4
0.
13
0.
16
0.
19
0.
22
0.
25
0.
28
0.
31
0.
34
0.
37
0.
1
0.
01
0.
04
0.
07
0
Study Design
Why are you considering an association study instead of linkage?
• Linkage analysis is powerful for disorders with
– Discernable pattern of inheritance
– Rare alleles w/ large genetic effect sizes
– High penetrance
• Not powerful for disorders that
– have complex pattern of inheritance
– are common
– many risk alleles with small effect sizes
– have low penetrance
Study Design
Common variant/common disease hypothesis
• Common genetic variants confer susceptibility
• Risk-conferring alleles ancient; common across most
populations
• Risk-conferring allele has small effect
• Multiple risk alleles expected for common disease;
also environment
Study Design
Should you design a candidate gene or whole genome study?
• Candidate gene association study
– Interrogate specific genes or regions
– Based on previous knowledge or
biological plausibility
– Hypothesis testing
• Whole genome association study
– Interrogate the “entire” genome
– No previous knowledge required
– Hypothesis generation
Candidate gene association
studies
• Choose gene based on previous knowledge
– Gene function
– Biological pathway
– Previous linkage or association study
• Choose DNA variations for genotyping
– Direct association approach
– Indirect association approach
Direct Candidate Gene Association
Study
Genotype “functional” SNPs
Example: Nonsynonymous SNPs
Collins et al (1997) Science 278:1580-1581
Direct Candidate Gene Association
Study
Problem: We don’t know what is functional
and what is not functional
Botstein and Risch (2003) Nat Genet 33 Suppl:228-37.
Direct Candidate Gene Association
Study
What would we miss?
Functional synonymous SNPs in MDR1 alter
P-glycoprotein activity
Komar (2007) Science 315:466-467
Direct Candidate Gene Association
Study
What would we miss?
• Non-coding SNPs or DNA variations in
– Introns
– Intergenic regulatory regions
• 99% human genome is non-coding
Indirect Candidate Gene
Association Study
Kruglyak (2005) Nat Genet 37:1299-1300
• Genotype a fraction of all SNPs regardless of “function”
• Rely on SNP-SNP correlations (linkage disequilibrium)
to capture information for SNPs not genotyped
Indirect Candidate Gene
Association Study
Linkage disequilibrium (LD)
Measured by r2
r2 = [f(A1B1) – f(A1)f(B1)]2
f(A1)f(A2)f(B1)f(B2)
r2 = 0
r2 = 1
SNPs are independent
SNPs are perfectly correlated AND
have the same minor allele frequency
Indirect Candidate Gene
Association Study
Using LD to pick “tagSNPs”
r2>0.80
CRP
European-descent
10 SNPs >5% MAF
CRP
European-descent
4 tagSNPs
Indirect Candidate Gene
Association Study
“tagSNPs” are population specific
CRP
European-descent
4 tagSNPs
CRP
African-descent
10 tagSNPs
Indirect Candidate Gene
Association Study
• “tagSNPs” are
population specific
• Merge sets for
“cosmopolitan” set
http://gvs.gs.washington.edu/GVS/
Indirect Candidate Gene
Association Study
Multiple testing
• Testing many SNPs for association with
disease status
• No consensus on correcting p-value
– Bonferroni
– False Discovery Rate
• Need to replicate findings in independent study
Indirect Candidate Gene
Association Study: Pros and Cons
• Can interrogate all common SNPs in gene
• SNPs must be known and genotypes available
to calculate LD and pick tagSNPs
• Multiple testing within a gene
• Limited to previous knowledge
Whole Genome Association Study
• Can now genotype 100K – 1 million SNPs
• Coverage depends on platform and chip
– tagSNPs capturing HapMap common SNPs
– Genic SNPs overrepresented
– Conserved non-coding SNPs represented
– Evenly spaced across genome
Illumina Infinium assay
Affymetrix GeneChips
Whole Genome Association Study
• Same study design and challenges as
candidate gene
– Mostly case-control (retrospective)
– Multiple testing
• Data storage and higher-order interaction
testing issues
• Hypothesis generation tool (replication)
Case/Control Study Designs
For either candidate gene or whole genome
Manolio et al. Nature Reviews Genetics 7, 812–820 (October 2006)
Case/Control Study Designs:
Pros and Cons
Study
Pros
Cons
Case/Control
Easier to collect
Less expensive
Subject to bias
No risk estimates
Prospective
Risk estimates
Harder to collect
More expensive
Subject to bias
For rare outcomes, case/control design
may be only option
Case/Control Study Designs:
Pros and Cons
Types of bias
• Bias in selection of cases
Those that are currently living
Miss fatal or short episodes of disease
Might miss mild diseases
Referral/admission bias
• Non-response bias
• Exposure suspicion bias
• Family information bias
• Recall bias
Often ignored in genetic association studies
Manolio et al. Nature Reviews Genetics 7, 812–820 (October 2006)
Analysis Methods
Genotype QC
• Test for departures of Hardy-Weinberg Equilibrium
• Test for gender inconsistencies
• Eliminate very rare SNPs (no power)
• Eliminate SNPs with low genotyping efficiency
• Eliminate samples with low genotyping efficiency
Analysis Methods
What statistical methods do you use to analyze your data?
• SNP by SNP (borrowed from epidemiology)
Chi-square and Fisher’s exact
2x2 table
2x3 table
Logistic and linear regression
Covariates
• Haplotypes
Haplo.stats and regression
• Interactions
Traditional regression
MDR (Ritchie et al)
Analysis Methods
The Case/Control Study
Case
Control
Minor allele A
B
Major allele C
D
Odds ratio (OR) = ratio of odds of minor allele in
Cases (A/C) and Controls (B/D)
OR
(A*D)/(B*C)
Analysis Methods
For genotypes, set homozygous for major allele (A) as
“referent” genotype, and calculate 2 odds ratios:
Case
Control
Aa
A
B
AA
C
D
Case
Control
aa
A
B
AA
C
D
Analysis Methods
Case/control:
Interpretation of Odds Ratio
1.0 – Referent
>1.0 – Greater odds of disease compared with controls
<1.0 – Lesser odds of disease compared with controls
Confidence Intervals: probably contain true OR
OR does not measure risk*
Analysis Methods
Prospective cohort
• Disease free at beginning of study
• Followed over time for disease (“incident”)
• Follow “exposed” and “unexposed” groups
• Gold-standard study design
Analysis Methods
Prospective cohort
Case
Exposed
Control Total
A
B
(A+B)
Unexposed C
D
(C+D)
Risk Ratio (RR) = Incidence of disease in
Exposed
A/(A+B)
or
Unexposed
C/(C+D)
Analysis Methods
Prospective Study:
Interpretation of Risk Ratio
1.0 – Referent
>1.0 – Risk for disease increases
<1.0 – Risk for disease decreases
Confidence Intervals: probably contain true RR
*For rare diseases, OR ~ RR
Analysis Methods
Case/control: Matching
Age
Gender
Race
Bad Example:
Cases: Adults with heart disease
Controls: Newborns without heart disease
Warning: Can “over match” and
miss describing an interesting factor
Analysis Methods
Case/control: Stratifying
Age
Gender
Race
Ex. Cases with heart disease
Aged-matched controls without heart disease
(Exposure: smoking status)
Stratify for Gender Specific Risks
Warning: Need sufficient sample size to
stratify or split the data into males and females
Analysis Methods
Problems in Case/Control genetic association studies –
• “Confounding” by race or
ancestry
• AKA population stratification
• Solutions:
Match
Stratify
Adjust (using genetic
markers)
“Trios”
Cardon and Palmer (2003) Lancet 361:598-604
Analysis Methods
Regression
• Given
– Height as “target” or “dependent” variable
– Sex as “explanatory” or “independent”
variable
• Fit regression model
height = *sex + 
Analysis Methods
Regression
• Given
– Quantitative “target” or “dependent” variable y
– Quantitative or binary “explanatory” or
“independent” variables xi
• Fit regression model
y =  1x 1 +  2x 2 + … +  i x i + 
Analysis Methods
Regression
• Works best for normal y and x
• Can include covariates
• Fit regression model
y =  1x 1 +  2x 2 + … +  i x i + 
• Estimate errors on ’s
• Use t-statistic to evaluate significance of ’s
• Use F-statistic to evaluate model overall
• Use R2 to evaluate variance explained by
model
Analysis Methods
Coding Genotypes
Genotype
AA
AG
GG
Dominant
1
1
0
Additive
2
1
0
Recessive
1
0
0
Genotype can be re-coded in any number
of ways for regression analysis
Example of gene-environment
Interaction and traditional
regression
Analysis Methods
Statistical Packages for Genetic Association Studies
• Candidate gene association study
SAS/Genetics
STATA
SPSS
R
PLINK
• Whole genome association study
R
PLINK
Analysis Methods
Whole genome in PLINK
(pngu.mgh.harvard.edu/~purcell/plink/)
MHC removed
P<2x10-11
P<1x10-100
P<5x10-8
Can adjust for population stratification
Can add covariates
Plenge et al 2007 NEJM
Genome-wide
significance
P=5x10-8
SNPs versus Haplotypes
• There is no right answer: explore both
• The only thing that matters is the correlation
between the assayed variable and the causal
variable
• Sometimes the best assayed variable is a SNP,
sometimes a haplotype
SNPs versus Haplotypes
Statistical Packages for Genetic Association Studies
with haplotypes
• Haplo.stats (haplotype regression)
Lake et al, Hum Hered. 2003;55(1):56-65.
• PHASE (case/control haplotype)
Stephens et al, Am J Hum Genet. 2005 Mar;76(3):449-62
• Haplo.view (case/control SNP analysis)
Barrett et al, Bioinformatics. 2005 Jan 15;21(2):263-5.
• SNPHAP (haplotype regression?)
Sham et al Behav Genet. 2004 Mar;34(2):207-14.
Analysis Methods
Multiple testing
• Bonferroni correction
Too conservative b/c each SNP tested
may not be independent (LD)
How many independent tests did you do?
See Conneely and Boehnke AJHG (in press)
• False Discovery Rate
Also has arbitrary threshold
• Best bet is replication
Change in ln(CRP) per copy relative to H2
Statistical Replication
0.6
0.5
Crawford et al Circulation 2006; 114:2458-2465
0.4
Black
Mexican-American
0.3
White
0.2
CRP SNPs and CRP
levels in NHANES III
0.1
0
H2
H5
H6
H7
H8
Results Consistent
with CARDIA
Carlson et al. AJHG 2005;77:64-77
Functional Replication
• Statistical replication is not always possible
• Association may imply mechanism
• Test for mechanism at the bench
– Is predicted effect in the right direction?
– Dissect haplotype effects to define functional SNPs
Functional Replication
CRP Evolutionary Conservation
• TATA box: 1697
• Transcript start: 1741
• CRP Promoter region (bp 1444-1650) >75% conserved in mouse
Functional Replication
Low CRP Levels Associated with H1-4
• USF1 (Upstream Stimulating Factor)
– Polymorphism at 1440 alters USF1 binding site
1420
1430
1440
H1-4 gcagctacCACGTGcacccagatggcCACTCGtt
H7-8 gcagctacCACGTGcacccagatggcCACTAGtt
H5-6 gcagctacCACGTGcacccagatggcCACTTGtt
Functional Replication
High CRP Levels Associated with H6
•
USF1 (Upstream Stimulating Factor)
– Polymorphism at 1421 alters another USF1 binding site
1420
1430
1440
H1-4 gcagctacCACGTGcacccagatggcCACTCGtt
H7-8 gcagctacCACGTGcacccagatggcCACTAGtt
H5
gcagctacCACGTGcacccagatggcCACTTGtt
H6
gcagctacCACATGcacccagatggcCACTTGtt
Functional Replication
CRP Promoter Luciferase Assay
4.0
Fold change over H1-3
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
H1-3
Carlson et al, AJHG v77 p64
H4
H5
H6
H7-8
empty
SV40p
Association Analysis Outline
•
•
•
•
•
•
Study Design
SNPs versus Haplotypes
Analysis Methods
Candidate Gene
Whole Genome Analysis
Replication and Function