Download AA - Institut Montefiore

Document related concepts

Genetic engineering wikipedia , lookup

Gene expression programming wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Epistasis wikipedia , lookup

Genetic drift wikipedia , lookup

Designer baby wikipedia , lookup

Twin study wikipedia , lookup

Genetic testing wikipedia , lookup

Human genetic variation wikipedia , lookup

Medical genetics wikipedia , lookup

Behavioural genetics wikipedia , lookup

Heritability of IQ wikipedia , lookup

Population genetics wikipedia , lookup

Genome (book) wikipedia , lookup

Microevolution wikipedia , lookup

Tag SNP wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Public health genomics wikipedia , lookup

Transcript
Parametric versus Non-parametric
Genetic Association Analysis
Kristel Van Steen, PhD, ScD
([email protected])
Université de Liege - Institut Montefiore
Ghent University – StepGen cvba
December 18th , 2007
Genetic Association Studies

Aim:
detect association between one or more genetic
polymorphisms and a trait, which may be



measured,
dichotomous,
time to onset.
(Genuine) Genetic associations arise only because
human populations share common ancestry
Terminology
(Roche Genetics)
Terminology
(Roche Genetics)
Terminology
(Courtesy of Ed Silverman)
Genetic Association Studies
Reflection I:

In linkage analysis, data from distantly related
individuals are more powerful for detecting small
effects
 Increased possibility for linkage to be destroyed by
recombination
 linkage extends over smaller distances
 denser maps required
Linkage Disequilibrium
(Roche Genetics)
Linkage Disequilibrium
Marker locus
Disease
locus
1
D p =p p
D1
D 1
2
pD
pd
d
p1
p2
Genetic Association Studies
Reflection II:


Association study is special form of linkage study:
the extended family is the wider population
Association studies have greater power than linkage
studies to detect small effects, but require looking at
more places
(Risch and Merikangas 1996)
Genetic Association Studies
Reflections III:


Genetic susceptibility to common complex disorders
involves many genes, most of which have small
effects
A large number of “markers” have been identified
Complex Disorders
(Roche Genetics)
Markers
(Roche Genetics)
Genetic Association
Disease
Phenotype
Test for association
between phenotype
and marker locus
LD /
correlation
Marker
DSL: disease
susceptibility
locus
Test for genetic
association between
the phenotype and
the DSL
Indirect Associations

The polymorphism is a surrogate for the causal
locus:



Indirect associations are weaker than the direct
associations they reflect
Essential to type several surrounding markers
Try to exclude the possibility that a causal variant
exists but is not picked up by the marker set:
Genome-wide vs Candidate gene approach
Statistical Requirements for a
Successful
Genome-wide Association Study
 LD coverage
 Genotyping quality
 Sufficient sample sizes
 Design of genome-wide association studies
 Handling of the multiple testing problem
Study Designs
(Cordell and Clayton, 2005)
Example for Required Sample Sizes
Required sample sizes to achieve 80% power in a
case/control study for a significance level of 10-7
Allele freq Odds ratio
1.25
1.5
0.1
0.2
0.3
0.4
8,859
5,283
4,281
3,886
1.75
2,608
1,616
1,342
1,301
1,350
869
727
750
The interpretation of r^2
r2 N is the “effective sample size”
If a marker M and causal gene G are in LD, then a study
with N cases and controls which measures M (but not G) will
have the same power to detect an association as a study
with r2 N cases and controls that directly measured G
So … The markers that are genotyped should be selected so
that they have high r^2-values (preferable at least 80%)
with the marker that are not genotyped
A good SNPs selection will be key for the success of GWAs
Power – a Statistical Concept
Online Calculators








General Statistical Calculators Including a Power Calculator (UCLA);
Statistical Power Calculator for Frequencies;
Retrospective Power Calculation;
Genetic Power Calculator;
Wise Project Applets: Power Applet;
Downloadable calculators: CaTS (Skol, 2006), Quanto (sample size
or power calculation for association studies of genes, geneenvironment or gene-gene interactions);
Calculation of Power for Genetic Association Studies 'AssocPow'
(Ambrosius, 2004), PS: Power and Sample Size Calculation;
Power & Sample Size Calculations on STATA.
(http://www.dorak.info/epi/glosge.html)
Type I and Type II errors
Statistical Analysis depends on
Study Design ...
(Cordell and Clayton, 2005)
Statistical analysis
depends on …
Assessing Association

Direct association:
patterns of genotype-phenotype
relationship
 From dose-response models
to models accounting for
epistatic effects

Indirect association:
patterns of linkage-disequilibrium
 r2 relates to the power to
detect association: ss
0.56/0.2 (2.8) times as large
to detect indirect association
with A than indirect
association with C
 Haplotype blocks / haplotype
tagging SNPs
A
1
0.2
1
B
0.56
1
C
A
B
C
r squared measures of LD;
Locus B is assumed to be
causal
Human Genetic Disorders

Single gene disorder


Less than 0.05% (rare), e.g., Huntington disease,
cystic fibrosis
Disorders with polygenic or multifactorial
inheritance




1% or more (common); e.g., diabetes, obesity
Do not show Mendelian modes of transmission
Genetically relevant phenotype often unclear
Under the influence of multiple interacting genes
Mendelian Traits
affected
Aa
BB
Aa
BB
Aa
bb
AA
bb
affected
Locus 2
BB Bb
bb
Aa
Bb
AA
AABB
AABb
AAbb
Locus 1 Aa
AaBB
AaBb
Aabb
aa
aaBB
aaBb
aabb
affected
Complex Traits
Aa
BB
aa
BB
affected
Aa
Bb
AA
bb
Aa
Bb
Locus 2
BB Bb
bb
AA
AABB
AABb
AAbb
Locus 1 Aa
AaBB
AaBb
Aabb
aa
aaBB
aaBb
aabb
affected
Genetic Etiology I
Independent effect
Gene1
Gene2
Gene3
Disease
Gene4
Gene5
Any one bad gene
results in the disease.
Genes have no effect
on each other.
Genetic Heterogeneity
Genetic Etiology II
Interactive effect
Gene1
Gene2
Gene3
Disease
E.g. Any bad gene
results in disease.
Genes have an
effect on other
genes in the
pathway.
Epistasis
Genetic Etiology III
Incomplete penetrance
Gene1
Disease
Gene1
No Disease
Gene1
Disease
Gene1
No Disease
Some individuals
with genotype do
not manifest trait.
Genetic Etiology IV
Phenocopy
Assuming a dominant model, and
disease allele A, normal allele a.
AA
Disease
Aa
Disease
AA
Disease
aa
Disease
Maybe caused by
environmental factors
And now we should be able
to start
modeling, testing, estimating, …
Association Analysis

Case-control studies


Test for association between marker alleles and the
disease phenotype in a group of affected and
unaffected individuals randomly from the population
Family-based studies

Test for association between marker alleles and the
disease phenotype in a group of affected individuals
and unaffected family members
Case-control data structure
Status
SNP1
SNP2
SNP3
SNP4
SNP5
SNP6
SNP7
SNP8
SNP9
SNP10
1
1
2
2
1
2
1
2
2
1
2
1
0
0
0
1
0
0
0
0
1
0
1
0
2
0
1
1
0
2
0
1
1
1
2
0
1
1
0
2
0
1
1
0
1
2
1
1
0
0
2
1
1
0
0
1
1
0
0
0
0
1
0
0
0
0
1
1
1
0
1
2
1
1
0
1
2
1
1
0
1
0
2
1
0
1
0
2
1
0
0
0
2
0
0
0
0
2
0
1
0
0
1
0
1
0
0
1
0
1
0
2
1
0
1
0
2
1
0
1
0
0
0
1
1
0
0
0
1
1
0
0
0
1
1
0
2
1
1
1
0
2
1
0
0
0
2
0
1
0
0
2
0
1
0
2
1
0
1
1
2
1
0
1
1
0
0
0
2
0
0
0
0
2
0
0
0
1
0
0
1
2
1
0
0
1
2
0
0
1
1
1
2
0
1
1
1
2
0
1
1
0
0
2
1
1
0
0
2
0
0
1
2
0
0
0
1
2
0
0
Standard Method: Genotype
Case-Control
# copies of ‘0’ allele
0
1
2
Total
Case
r0
r1
r2
R
Control
s0
s1
s2
S
Total
n0
n1
n2
N
( Nri  Rni ) 2 ( Nsi  Sni ) 2
 

NRni
NSni
i
2
The Bonferroni correction for multiple comparisons
0.05/(# SNPs tested)
(Gibson and Muse, 2002)
A Pure Epistatic
Inheritance Model
AA
Aa
aa
Marginal
BB
0
0
0.2
0.2
Bb
0
0.2
0
0.2
bb
0.2
0
0
0.2
Marginal
0.2
0.2
0.2
p = 0.5
q = 0.5
Comparison of allele or genotype frequencies between cases and controls will
not show anything unusual.
Virtually no power!
Traditional Method suffers

A large number of SNPs are genotyped


“multiple comparisons” problem, very small p-values
required for significance.
Genetic loci may interact (epistasis) in their
influence on the phenotype


loci with small marginal effects may go undetected
interested in the interaction itself
Curse of Dimensionality
Dd
dd
SNP 2
SNP 4
DD
BB
Bb
bb
SNP 2
50 Cases,
50 Controls
BB
Bb
bb
SNP 2
N = 100
BB
Bb
bb
CC
SNP 3
Cc
SNP 1
AA Aa aa
SNP 1
AA Aa aa
cc
SNP 1
AA Aa aa
Curse of Dimensionality

Bellman R (1961) Adaptive control processes: A
guided tour. Princeton University Press:
“... Multidimensional variational problems cannot be
solved routinely ... . This does not mean that we cannot
attack them. It merely means that we must employ
some more sophisticated techniques.”
Traditional Methods suffer
Alternatives

Tree-based methods:



Pattern recognition methods:





Recursive Partitioning (Helix Tree)
Random Forests (R, CART)
Symbolic Discriminant Analysis (SDA)
Mining association rules
Neural networks (NN)
Support vector machines (SVM)
Data reduction methods:



DICE (Detection of Informative Combined Effects)
MDR (Multifactor Dimensionality Reduction)
Logic regression …
(e.g., Onkamo and Toivonen 2006)
Goodness of fit x 2
1 independent variable
Qualitative
(categorical)
Independence test x 2
2 independent variables
McNemar test
2 dependent variables
Continuous
measurement
Type of
data
Ranks
Multiple predictors
Quantitative
(measurement)
Pearson r
Form of
relationship
Regression
Primary interest
1 predictor
Relationships
Degree of
relationship
Spearman rs
Multiple regression
2-sample t
independent
Hypothesis
Testing
Mann-Whitney U
2 groups
dependent
Related sample t
Wilcoxon T
Differences
1 IV
independent
Multiple IVs
Multiple groups
Parametric
Nonparametric
dependent
Repeated measures ANOVA
Friedman
One-way ANOVA
Kruskal-Wallis H
Factorial ANOVA
Multi-locus Methods

Parametric methods:
Regression
 Logistic or (Bagged) logic regression


Non-parametric methods:

Combinatorial Partitioning Method (CPM)


Multifactor-Dimensionality Reduction (MDR)


quantitative phenotypes; interactions
qualitative phenotypes; interactions
Machine learning and data mining
Limitation of Regression


Having too many independent variables in relation
to the number of observed outcome events
Assuming 10 bi-allelic loci:
# of Parameters =
Main
effect
# of
Parameters
20
 n
  *2 k
 k
2-locus
3-locus
4-locus
interaction interaction interaction
180
960
3360
Limitation of Regression

Fewer than 10 outcome events per independent variable
can lead to biased estimates of the regression
coefficients and to an increase in Type 1 and Type 2
errors.
# of parameters P  min(ncase , ncontrol)/10 - 1

For 200 cases and 200 controls, this formula suggests
that no more than 19 (= 200/10 – 1) parameters should
be estimated in logistic regression model.
MDR



An extension of CPM, which finds the genotype partitions
within which a (quantitative) trait variability is much
lower than between partitions
MDR reduces the dimensionality of multi-locus
information to one-dimension, thereby improving the
identification of polymorphism combinations associated
with disease risk
The one-dimensional multi-locus genotype variable is
evaluated for its ability to classify and predict disease
status through cross-validation and permutation testing
Two Measures for Selection of
Best n-locus model

Misclassification error:
The proportion of incorrect classification in the training set.

Prediction error (PE):
The proportion of incorrect prediction in the test set.
10 cross-validation  10 best models.
The model with minimum PE is the
best n-locus model.
MDR Steps
9/10 training data
All combinations
of 2 factors =
10*9/2 = 45
1/10 test data
10 runs
A single model with
minimum classification
error is the best Model
Best Multi-factor Models
Best 2-factor model
Best 3-factor model
Best 4-factor model
Best 5-factor model
Best 6-factor model
.
.
Best n-factor model
Model Selection and Evaluation

Among the best n-factor models, the best model
is:



The model with the minimum average PE.
The model with the maximum average CVC.
Rule of parsimony: If there is a tie, select the smaller
model.
MDR Analysis Window
(MDR_Overview.pdf)
Significance of the Final Model
Via permutation tests:




Randomize the the case and control labels in the
original dataset multiple times to create a set of
permuted datasets.
Run MDR on each permuted dataset.
Maximum CVC and minimum PE identified for each
dataset saved and used to create an empirical
distribution for estimation of a P-value.
Measures in Selection of Final
model

Cross-validation consistency (CVC)


Average cross-validation consistency


Average of CVC across all runs.
Average misclassification error


In every run, # of times the same MDR model is identified in m
cross-validation.1  CVC  m.
Average across all cross-validations and all runs.
Average prediction error

Average prediction error across all cross-validations and all runs.
Simulation I
200 cases and 200 controls;
10 SNPs: 1, 2, 3 , …, 10.
Disease etiology due to interaction
between SNP 1 and SNP 6.
Over 10 CVs and
10 runs
Simulation II
50 replicates of 200 cases and 200 controls;
10 SNPs: 1, 2, 3 , …, 10.
Disease risk is dependent on whether two deleterious
alleles and two normal alleles are present, from either
one locus or both loci.
2-locus epistatis model;
3-locus epistatis model;
4-locus epistatis model;
5-locus epistatis model.
Mean and standard error
of thePower
mean calculated
from 50 replicates.
78%
82%
94%
90%
(Ritchie et al, 2001)
Power of MDR in Presence of Genotyping Error,
Missing Data, Phenocopy, and Genetic Heterogeneity
no noise
5% genotyping error -- GE
5% missing data -- MS
50% phenocopy -- PC
50% genetic heterogeneity – GH
GE + MS
…
…
GE+MS+PC
…
…
6 models
4 models
GE+MS+PC+GH
Total
16 models
Advantages of MDR


Simultaneous detection of multiple genetic loci
associated with a discrete clinical endpoint in absence of
main effect.
Non-parametric:
Overcomes “curse of dimensionality” from which logistic
regression models suffer.

No particular genetic model

Low false positive rates
Disadvantages of MDR



Computationally very intensive. Only feasible for
relatively small number of factors. Impractical to test
very high-dimensional models.
When the dimensionality of the best model is relatively
high and the sample is relatively small, many
observations in the test set can not be predicted. This
impacts the SEM of prediction error.
Low power in the presence of heterogeneity
Issues to Consider

I: Variable selection

II: Model selection

III: Interpretation
I: Variable Selection



How can you determine which variables to
select?
Not computationally feasible to evaluate all
possible combinations
Need to select correct variables to detect
interactions
How many combinations are there?
~500,000 SNPs span 80% of common variation in genome (HapMap)
Number of Possible
Combinations
2 x 1026
3 x 1021
5x
1
105
1 x 1011
2
2 x 1016
3
SNPs in each subset
4
5
II: Model Selection


For each variable subset, evaluate a statistical
model
Goal is to identify the best subset of variables
that compose the best model
III: Interpretation


Selection of best statistical model in a vast
search space of possible models
Statistical or computational model may not
translate into biology


May not be able to identify prevention or
treatment strategies directly
Wet lab experiments will be necessary, but
may not be sufficient
Interpretation

Strategies to assess biological interpretation of
gene-gene interaction models



Consider current knowledge about the biochemistry of
the system and the biological plausibility of the models
Perform experiments in the wet lab to measure the
effect of small perturbations to the system
Computer simulation algorithms to model biochemical
systems
MDR: To keep in Mind

Candidate SNP selection:


Selection of the best n-factor model:


The selection of final model is highly dependent on the selection
of n factors at the beginning.
Keeping one best n-factor model from all combinations is
actually a greedy search algorithm, which might lead to local
maximum; yet nice power results and practice has proven its
usefulness.
Performance when heterogeneity is present in the data:

Phenotypic (diff clinical expressions), genetic (diff inheritance
patterns), locus (diff genes), allelic (diff alleles in same gene)
References for MDR









Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactordimensionality reduction reveals high-order interactions among estrogen-metabolism genes in
sporadic breast cancer. Am J Hum Genet. 2001 Jul;69(1):138-47.
Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, missing data, phenocopy, and genetic
heterogeneity. Genet Epidemiol. 2003 Feb;24(2):150-7.
Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting genegene and gene-environment interactions. Bioinformatics. 2003 Feb 12;19(3):376-82.
Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human
diseases. Hum Hered. 2003;56(1-3):73-82.
Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS. Multifactordimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus.
Diabetologia. 2004 Mar;47(3):549-54.
Ritchie MD, Motsinger AA. Multifactor dimensionality reduction for detecting gene-gene and geneenvironment interactions in pharmacogenomics studies. Pharmacogenomics. 2005 Dec;6(8):82334.
Martin ER, Ritchie MD, Hahn L, Kang S, Moore JH. A novel method to identify gene-gene effects
in nuclear families: the MDR-PDT. Genet Epidemiol. 2006 Feb;30(2):111-23.
Andrew AS, Nelson HH, Kelsey KT, Moore JH, Meng AC, Casella DP, Tosteson TD, Schned AR,
Karagas MR. Concordance of multiple analytical approaches demonstrates a complex relationship
between DNA repair gene SNPs, smoking and bladder cancer susceptibility. Carcinogenesis. 2006
May;27(5):1030-7.
Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational
framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic
studies of human disease susceptibility. J Theor Biol. 2006 Jul 21;241(2):252-61.
Acknowledgements
Slides content based on material from
Jie Chen, Frank Emmert-Streib, Earl F Glynn, Hua Li, Bolan Linghu,
Arcady R Mushegian, Yan Meng, Jurg Ott, Marylyn Ritchie, Antonio
Salas, Chris Seidel, Matt McQueen, Christoph Lange
and discussions with
Steve Horvath, Nan M. Laird, Stephen Lake, Christoph
Lange, Ross Lazarus, Matthew McQueen, Benjamin Raby,
Nuria Malats, Marylyn Ritchie (lab),
Edwin K. Silverman, Scott T. Weiss, Xin Xu, …