Download AA - Institut Montefiore

Parametric versus Non-parametric Genetic Association Analysis Kristel Van Steen, PhD, ScD ([email protected]) Université de Liege - Institut Montefiore Ghent University – StepGen cvba December 18th , 2007 Genetic Association Studies  Aim: detect association between one or more genetic polymorphisms and a trait, which may be    measured, dichotomous, time to onset. (Genuine) Genetic associations arise only because human populations share common ancestry Terminology (Roche Genetics) Terminology (Roche Genetics) Terminology (Courtesy of Ed Silverman) Genetic Association Studies Reflection I:  In linkage analysis, data from distantly related individuals are more powerful for detecting small effects  Increased possibility for linkage to be destroyed by recombination  linkage extends over smaller distances  denser maps required Linkage Disequilibrium (Roche Genetics) Linkage Disequilibrium Marker locus Disease locus 1 D p =p p D1 D 1 2 pD pd d p1 p2 Genetic Association Studies Reflection II:   Association study is special form of linkage study: the extended family is the wider population Association studies have greater power than linkage studies to detect small effects, but require looking at more places (Risch and Merikangas 1996) Genetic Association Studies Reflections III:   Genetic susceptibility to common complex disorders involves many genes, most of which have small effects A large number of “markers” have been identified Complex Disorders (Roche Genetics) Markers (Roche Genetics) Genetic Association Disease Phenotype Test for association between phenotype and marker locus LD / correlation Marker DSL: disease susceptibility locus Test for genetic association between the phenotype and the DSL Indirect Associations  The polymorphism is a surrogate for the causal locus:    Indirect associations are weaker than the direct associations they reflect Essential to type several surrounding markers Try to exclude the possibility that a causal variant exists but is not picked up by the marker set: Genome-wide vs Candidate gene approach Statistical Requirements for a Successful Genome-wide Association Study  LD coverage  Genotyping quality  Sufficient sample sizes  Design of genome-wide association studies  Handling of the multiple testing problem Study Designs (Cordell and Clayton, 2005) Example for Required Sample Sizes Required sample sizes to achieve 80% power in a case/control study for a significance level of 10-7 Allele freq Odds ratio 1.25 1.5 0.1 0.2 0.3 0.4 8,859 5,283 4,281 3,886 1.75 2,608 1,616 1,342 1,301 1,350 869 727 750 The interpretation of r^2 r2 N is the “effective sample size” If a marker M and causal gene G are in LD, then a study with N cases and controls which measures M (but not G) will have the same power to detect an association as a study with r2 N cases and controls that directly measured G So … The markers that are genotyped should be selected so that they have high r^2-values (preferable at least 80%) with the marker that are not genotyped A good SNPs selection will be key for the success of GWAs Power – a Statistical Concept Online Calculators         General Statistical Calculators Including a Power Calculator (UCLA); Statistical Power Calculator for Frequencies; Retrospective Power Calculation; Genetic Power Calculator; Wise Project Applets: Power Applet; Downloadable calculators: CaTS (Skol, 2006), Quanto (sample size or power calculation for association studies of genes, geneenvironment or gene-gene interactions); Calculation of Power for Genetic Association Studies 'AssocPow' (Ambrosius, 2004), PS: Power and Sample Size Calculation; Power & Sample Size Calculations on STATA. (http://www.dorak.info/epi/glosge.html) Type I and Type II errors Statistical Analysis depends on Study Design ... (Cordell and Clayton, 2005) Statistical analysis depends on … Assessing Association  Direct association: patterns of genotype-phenotype relationship  From dose-response models to models accounting for epistatic effects  Indirect association: patterns of linkage-disequilibrium  r2 relates to the power to detect association: ss 0.56/0.2 (2.8) times as large to detect indirect association with A than indirect association with C  Haplotype blocks / haplotype tagging SNPs A 1 0.2 1 B 0.56 1 C A B C r squared measures of LD; Locus B is assumed to be causal Human Genetic Disorders  Single gene disorder   Less than 0.05% (rare), e.g., Huntington disease, cystic fibrosis Disorders with polygenic or multifactorial inheritance     1% or more (common); e.g., diabetes, obesity Do not show Mendelian modes of transmission Genetically relevant phenotype often unclear Under the influence of multiple interacting genes Mendelian Traits affected Aa BB Aa BB Aa bb AA bb affected Locus 2 BB Bb bb Aa Bb AA AABB AABb AAbb Locus 1 Aa AaBB AaBb Aabb aa aaBB aaBb aabb affected Complex Traits Aa BB aa BB affected Aa Bb AA bb Aa Bb Locus 2 BB Bb bb AA AABB AABb AAbb Locus 1 Aa AaBB AaBb Aabb aa aaBB aaBb aabb affected Genetic Etiology I Independent effect Gene1 Gene2 Gene3 Disease Gene4 Gene5 Any one bad gene results in the disease. Genes have no effect on each other. Genetic Heterogeneity Genetic Etiology II Interactive effect Gene1 Gene2 Gene3 Disease E.g. Any bad gene results in disease. Genes have an effect on other genes in the pathway. Epistasis Genetic Etiology III Incomplete penetrance Gene1 Disease Gene1 No Disease Gene1 Disease Gene1 No Disease Some individuals with genotype do not manifest trait. Genetic Etiology IV Phenocopy Assuming a dominant model, and disease allele A, normal allele a. AA Disease Aa Disease AA Disease aa Disease Maybe caused by environmental factors And now we should be able to start modeling, testing, estimating, … Association Analysis  Case-control studies   Test for association between marker alleles and the disease phenotype in a group of affected and unaffected individuals randomly from the population Family-based studies  Test for association between marker alleles and the disease phenotype in a group of affected individuals and unaffected family members Case-control data structure Status SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 1 1 2 2 1 2 1 2 2 1 2 1 0 0 0 1 0 0 0 0 1 0 1 0 2 0 1 1 0 2 0 1 1 1 2 0 1 1 0 2 0 1 1 0 1 2 1 1 0 0 2 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 1 1 0 1 2 1 1 0 1 2 1 1 0 1 0 2 1 0 1 0 2 1 0 0 0 2 0 0 0 0 2 0 1 0 0 1 0 1 0 0 1 0 1 0 2 1 0 1 0 2 1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 2 1 1 1 0 2 1 0 0 0 2 0 1 0 0 2 0 1 0 2 1 0 1 1 2 1 0 1 1 0 0 0 2 0 0 0 0 2 0 0 0 1 0 0 1 2 1 0 0 1 2 0 0 1 1 1 2 0 1 1 1 2 0 1 1 0 0 2 1 1 0 0 2 0 0 1 2 0 0 0 1 2 0 0 Standard Method: Genotype Case-Control # copies of ‘0’ allele 0 1 2 Total Case r0 r1 r2 R Control s0 s1 s2 S Total n0 n1 n2 N ( Nri  Rni ) 2 ( Nsi  Sni ) 2    NRni NSni i 2 The Bonferroni correction for multiple comparisons 0.05/(# SNPs tested) (Gibson and Muse, 2002) A Pure Epistatic Inheritance Model AA Aa aa Marginal BB 0 0 0.2 0.2 Bb 0 0.2 0 0.2 bb 0.2 0 0 0.2 Marginal 0.2 0.2 0.2 p = 0.5 q = 0.5 Comparison of allele or genotype frequencies between cases and controls will not show anything unusual. Virtually no power! Traditional Method suffers  A large number of SNPs are genotyped   “multiple comparisons” problem, very small p-values required for significance. Genetic loci may interact (epistasis) in their influence on the phenotype   loci with small marginal effects may go undetected interested in the interaction itself Curse of Dimensionality Dd dd SNP 2 SNP 4 DD BB Bb bb SNP 2 50 Cases, 50 Controls BB Bb bb SNP 2 N = 100 BB Bb bb CC SNP 3 Cc SNP 1 AA Aa aa SNP 1 AA Aa aa cc SNP 1 AA Aa aa Curse of Dimensionality  Bellman R (1961) Adaptive control processes: A guided tour. Princeton University Press: “... Multidimensional variational problems cannot be solved routinely ... . This does not mean that we cannot attack them. It merely means that we must employ some more sophisticated techniques.” Traditional Methods suffer Alternatives  Tree-based methods:    Pattern recognition methods:      Recursive Partitioning (Helix Tree) Random Forests (R, CART) Symbolic Discriminant Analysis (SDA) Mining association rules Neural networks (NN) Support vector machines (SVM) Data reduction methods:    DICE (Detection of Informative Combined Effects) MDR (Multifactor Dimensionality Reduction) Logic regression … (e.g., Onkamo and Toivonen 2006) Goodness of fit x 2 1 independent variable Qualitative (categorical) Independence test x 2 2 independent variables McNemar test 2 dependent variables Continuous measurement Type of data Ranks Multiple predictors Quantitative (measurement) Pearson r Form of relationship Regression Primary interest 1 predictor Relationships Degree of relationship Spearman rs Multiple regression 2-sample t independent Hypothesis Testing Mann-Whitney U 2 groups dependent Related sample t Wilcoxon T Differences 1 IV independent Multiple IVs Multiple groups Parametric Nonparametric dependent Repeated measures ANOVA Friedman One-way ANOVA Kruskal-Wallis H Factorial ANOVA Multi-locus Methods  Parametric methods: Regression  Logistic or (Bagged) logic regression   Non-parametric methods:  Combinatorial Partitioning Method (CPM)   Multifactor-Dimensionality Reduction (MDR)   quantitative phenotypes; interactions qualitative phenotypes; interactions Machine learning and data mining Limitation of Regression   Having too many independent variables in relation to the number of observed outcome events Assuming 10 bi-allelic loci: # of Parameters = Main effect # of Parameters 20  n   *2 k  k 2-locus 3-locus 4-locus interaction interaction interaction 180 960 3360 Limitation of Regression  Fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients and to an increase in Type 1 and Type 2 errors. # of parameters P  min(ncase , ncontrol)/10 - 1  For 200 cases and 200 controls, this formula suggests that no more than 19 (= 200/10 – 1) parameters should be estimated in logistic regression model. MDR    An extension of CPM, which finds the genotype partitions within which a (quantitative) trait variability is much lower than between partitions MDR reduces the dimensionality of multi-locus information to one-dimension, thereby improving the identification of polymorphism combinations associated with disease risk The one-dimensional multi-locus genotype variable is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing Two Measures for Selection of Best n-locus model  Misclassification error: The proportion of incorrect classification in the training set.  Prediction error (PE): The proportion of incorrect prediction in the test set. 10 cross-validation  10 best models. The model with minimum PE is the best n-locus model. MDR Steps 9/10 training data All combinations of 2 factors = 10*9/2 = 45 1/10 test data 10 runs A single model with minimum classification error is the best Model Best Multi-factor Models Best 2-factor model Best 3-factor model Best 4-factor model Best 5-factor model Best 6-factor model . . Best n-factor model Model Selection and Evaluation  Among the best n-factor models, the best model is:    The model with the minimum average PE. The model with the maximum average CVC. Rule of parsimony: If there is a tie, select the smaller model. MDR Analysis Window (MDR_Overview.pdf) Significance of the Final Model Via permutation tests:     Randomize the the case and control labels in the original dataset multiple times to create a set of permuted datasets. Run MDR on each permuted dataset. Maximum CVC and minimum PE identified for each dataset saved and used to create an empirical distribution for estimation of a P-value. Measures in Selection of Final model  Cross-validation consistency (CVC)   Average cross-validation consistency   Average of CVC across all runs. Average misclassification error   In every run, # of times the same MDR model is identified in m cross-validation.1  CVC  m. Average across all cross-validations and all runs. Average prediction error  Average prediction error across all cross-validations and all runs. Simulation I 200 cases and 200 controls; 10 SNPs: 1, 2, 3 , …, 10. Disease etiology due to interaction between SNP 1 and SNP 6. Over 10 CVs and 10 runs Simulation II 50 replicates of 200 cases and 200 controls; 10 SNPs: 1, 2, 3 , …, 10. Disease risk is dependent on whether two deleterious alleles and two normal alleles are present, from either one locus or both loci. 2-locus epistatis model; 3-locus epistatis model; 4-locus epistatis model; 5-locus epistatis model. Mean and standard error of thePower mean calculated from 50 replicates. 78% 82% 94% 90% (Ritchie et al, 2001) Power of MDR in Presence of Genotyping Error, Missing Data, Phenocopy, and Genetic Heterogeneity no noise 5% genotyping error -- GE 5% missing data -- MS 50% phenocopy -- PC 50% genetic heterogeneity – GH GE + MS … … GE+MS+PC … … 6 models 4 models GE+MS+PC+GH Total 16 models Advantages of MDR   Simultaneous detection of multiple genetic loci associated with a discrete clinical endpoint in absence of main effect. Non-parametric: Overcomes “curse of dimensionality” from which logistic regression models suffer.  No particular genetic model  Low false positive rates Disadvantages of MDR    Computationally very intensive. Only feasible for relatively small number of factors. Impractical to test very high-dimensional models. When the dimensionality of the best model is relatively high and the sample is relatively small, many observations in the test set can not be predicted. This impacts the SEM of prediction error. Low power in the presence of heterogeneity Issues to Consider  I: Variable selection  II: Model selection  III: Interpretation I: Variable Selection    How can you determine which variables to select? Not computationally feasible to evaluate all possible combinations Need to select correct variables to detect interactions How many combinations are there? ~500,000 SNPs span 80% of common variation in genome (HapMap) Number of Possible Combinations 2 x 1026 3 x 1021 5x 1 105 1 x 1011 2 2 x 1016 3 SNPs in each subset 4 5 II: Model Selection   For each variable subset, evaluate a statistical model Goal is to identify the best subset of variables that compose the best model III: Interpretation   Selection of best statistical model in a vast search space of possible models Statistical or computational model may not translate into biology   May not be able to identify prevention or treatment strategies directly Wet lab experiments will be necessary, but may not be sufficient Interpretation  Strategies to assess biological interpretation of gene-gene interaction models    Consider current knowledge about the biochemistry of the system and the biological plausibility of the models Perform experiments in the wet lab to measure the effect of small perturbations to the system Computer simulation algorithms to model biochemical systems MDR: To keep in Mind  Candidate SNP selection:   Selection of the best n-factor model:   The selection of final model is highly dependent on the selection of n factors at the beginning. Keeping one best n-factor model from all combinations is actually a greedy search algorithm, which might lead to local maximum; yet nice power results and practice has proven its usefulness. Performance when heterogeneity is present in the data:  Phenotypic (diff clinical expressions), genetic (diff inheritance patterns), locus (diff genes), allelic (diff alleles in same gene) References for MDR          Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactordimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001 Jul;69(1):138-47. Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003 Feb;24(2):150-7. Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting genegene and gene-environment interactions. Bioinformatics. 2003 Feb 12;19(3):376-82. Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003;56(1-3):73-82. Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS. Multifactordimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia. 2004 Mar;47(3):549-54. Ritchie MD, Motsinger AA. Multifactor dimensionality reduction for detecting gene-gene and geneenvironment interactions in pharmacogenomics studies. Pharmacogenomics. 2005 Dec;6(8):82334. Martin ER, Ritchie MD, Hahn L, Kang S, Moore JH. A novel method to identify gene-gene effects in nuclear families: the MDR-PDT. Genet Epidemiol. 2006 Feb;30(2):111-23. Andrew AS, Nelson HH, Kelsey KT, Moore JH, Meng AC, Casella DP, Tosteson TD, Schned AR, Karagas MR. Concordance of multiple analytical approaches demonstrates a complex relationship between DNA repair gene SNPs, smoking and bladder cancer susceptibility. Carcinogenesis. 2006 May;27(5):1030-7. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006 Jul 21;241(2):252-61. Acknowledgements Slides content based on material from Jie Chen, Frank Emmert-Streib, Earl F Glynn, Hua Li, Bolan Linghu, Arcady R Mushegian, Yan Meng, Jurg Ott, Marylyn Ritchie, Antonio Salas, Chris Seidel, Matt McQueen, Christoph Lange and discussions with Steve Horvath, Nan M. Laird, Stephen Lake, Christoph Lange, Ross Lazarus, Matthew McQueen, Benjamin Raby, Nuria Malats, Marylyn Ritchie (lab), Edwin K. Silverman, Scott T. Weiss, Xin Xu, …

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download AA - Institut Montefiore