Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Identifying and estimating gene-gene and geneenvironment interactions Christopher Amos1,2 and Carol Etzel1 Departments of Epidemiology and Bioinformatics and Computational Genetics U.T. M.D. Anderson Cancer Center, Houston, TX 1 Overview of talk • Description of terminology • Epistasis modeling for quantitative traits • Epistasis modeling of linkage data in humans • Approaches to interaction modeling in human/ outbred populations • Modeling gene-environment interactions 2 What is an ‘interaction’ • Interaction is a kind of action that occurs as two or more objects have an effect upon one another. The idea of a two-way effect is essential in the concept of interaction, as opposed to a one-way causal effect. (Wikipedia) 3 4 Gene - Environmental Interaction Environment - + - spontaneous 15% pure environment 2% 17% + pure genetic 5% geneenvironmental interaction 78% 83% 20% 80% 100% Genetics Schulte, 1994 5 Interactions • Biological interpretation – Two or more factors jointly modify a phenotype – e.g. risk from smoking is 14 fold increased by tobacco smoke, 3 fold in increased by asbestos exposure and 42 fold increased by both 50 Risk 40 30 ADDITIVE OBSERVED 20 10 0 1 2 3 6 • Statistical Interpretation – Deviation from an additive model (on some scale). On a multiplicative scale, the above risks are additive, so there would not be evidence for interaction on a multiplicative scale Both Smoking 2 1.5 1 0.5 0 Asbestos Log Risk Log Additive Model 7 Definitions of Epistasis (Interaction among alleles at different loci) • Bateson: gene interaction, in a physical sense of the direct interaction between gene products. – First noticed when crossing chicken strains that only rarely was the single comb produced. Using a Punnett square this feature was shown to result as a doubly homozygous recessive trait – Another example is Bombay Phenotype 8 Batesonian Epistasis – Bombay Phenotype H allele is a precursor to ABO blood group, its absence (h) causes ABO phenotypes not to mature, hh genotype appears to express as O phenotype 9 Fisherian Epistasis • Joint effects of alleles at two loci do not influence a trait in an additive fashion • Deviation from a simple oligogenic or polygenic model – higher correlation among siblings than parent-offspring • Further Developed by Cockerham (Cockerham, C. C., 1954 An extension of the concept of partitioning ward to extend loci without further setting the orthogonal contrast when epistasis is present. Genetics 39: 859–882.) 10 2b Types of Epistatic Interactions AABB AAbb f Additive-Additive Epistasis b≠c d h Additive-Dominant Epistasis f≠g 2a Dominant-Dominant Epistasis h ≠ (d-e-(f-g))/2 g aaBB aabb 11 2c 12 Joint effects from multiple loci for quantitative traits • Let loci have alleles A, a and B, b • A typical approach (F∞) is to set design matrices - Then define interactions as additive x additive epistasis x1*x2, Additive x dominant interactions x1*z2 etc. 13 Characteristic of the (F∞) is confounding with epistatis (Kao and Zeng, Genetics 160:1243-1261, 2001) 14 Preffered model with epistasis for F2 intercross 15 Estimates from Cockerham model 16 Implications of using a model with confounding of effects • Inferences about effects can be biased depending upon the modeling procedure. • Additive main effect estimate includes a component due to dominance by additive epistasis • Dominance by additive epistasis estimate includes a component due to additive effects • If the Type I sums of squares procedure is used, then the main effect estimate is inflated and the epistasis estimate is reduced. If the Type III sums of squares procedure is used then both effects are reduced. • ML method could be used to estimate parameters if model is correctly specified. 17 Effects of Scale • For the quantitative trait just indicated, if additive by additive interaction is noted, it may be possible to change scale to remove this source of epistasis. However, if multiple genetic factors influence the trait, a change of scale may not be sufficient. 18 19 Heterogeneity versus Interaction • In Epidemiological studies, we usually treat all subjects as if they are exchangeable – i.e. they are all identically distributed • In genetics, we often assume that there our population reflects a mixture of features, may model with admixture or heterogeneity parameters • Admixture/heterogeneity ideas not well described in the interaction literature 20 Linkage analysis for multilocus models in humans • For ‘independence’ models in which the joint genotype-specific penetrances are products of each marginal genotype, modeling marginal penetrances yields sufficiently accurate models to permit linkage detection (whether using a parametric or nonparametric approach). • For ‘additive’ models in which penetrance is increased by presence of either factor, heterogeneity models are fitted. 21 22 23 24 Linkage Analysis of Lung cancer, with and without heterogeneity among families 25 Epistasis modeling • Linkage analysis is modeled according to generalization of Risch’s lambda (MLS) score method: • Weights depend upon the assumptions of the model (which can be fitted to multiplicative – independence – models) or to more general models (allowing for epistasis) 26 Joint effects of loci influencing risk for hypertension 27 From Bell JT et al. Human Molecular Genetics 2006 15(8):1365-1374 Associating disease with mutations 1) usual approach for qualitative data logistic regression (unconditional) Pr yi 1 or 1 1 e a b1 x1 b2 x2 Pr( yi 1) Log a b1 x1 b2 x2 Pr( y1 0) Where y is the disease outcome y=1 if case, 0 if control x1 – design matrix with genotype AA =1, Aa=0, aa=-1 x2 - genotype AA=0, Aa =1, aa=0 Additive effect if b2= 0; dominance effects if b2 ≠ 0 If b2 ≠ 0, Can then fit x1 – design matrix with genotype AA =1, Aa=1, aa=0 (A dominant) x2 - genotype AA=1, Aa =0, aa=0 (A recessive) 28 Epistasis Modeling Humans Pr( yi 1) Log a b1 x1 b2 x2 x1 x2 Pr( y1 0) • Where x1 and x2 are chosen to reflect best marginal models (dominant, recessive or additive) from consideration of univariable analyses 29 Changing scale may remove ‘interactions’ • Lung cancer risk and smoking and asbestos: interaction on an additive scale (risk from smoking is 14, asbestos is 3, sum is 17) • Lung cancer risk and smoking and asbestos shows no interaction on multiplicative scale (14 x 3 = 42) • What if you add in radon, which has an additive effect on risk? E.g. risk from radon is 2, risk from smoking is 14, risk from radon plus smoking is 16? If someone smokes, has radon exposure and asbestos exposure is there an additive scale? 30 “Curse of Dimensionality” • For 2 SNPs, there are 9 = 32 possible two locus genotype combinations. • If the alleles are rare (MAF10%), then some cells will be empty SNP1 AA SNP2 Aa aa BB Bb bb Empty Cell 31 “Curse of Dimensionality” 4 SNPs: 81 possible combinations with more possible empty cells SNP 3 Cc SNP1 CC SNP1 AA Aa S N P 4 aa AA Aa cc SNP1 aa BB DD SNP2 BB BB Bb Bb Bb bb bb bb AA Aa aa AA Aa aa BB Dd SNP2 BB BB Bb Bb Bb bb bb bb AA Aa aa AA Aa aa AA Aa aa AA Aa aa Empty Cell AA Aa BB dd SNP2 BB BB Bb Bb Bb bb bb bb aa 32 Tree Models • Response variable can be – Simple • disease indicator (categorical) • IBD sharing (continuous) • Number of chromosome breaks (counts) – Complex • Survival object • Regression object • Multivariate object • Predictor variables can be categorical, counts or continuous • Tree models provide some benefit over logistic regression with respect to identifying highest risk groups and not requiring assumptions, but tend to overfit data 33 Tree Models • First you “grow” the tree – Like forward regression – Only “important” predictors are put in the model • Control the growth of the tree – Setting limits on how many predictors to allow in the model • Then you “prune” the tree – Like backward regression – Only “significant” predictors are left in the model 34 Growing a Classification Tree 15 Affected 15 Unaffected AU U UA A U U A AU U A A U A U U A U AU A U A U U A A A Pr(A) = 0.50 Pr(U) = 0.50 • Data are recursively partitioned into increasingly homogeneous subgroups • Partitions of the data are ‘branched out’ through binary splits 35 All Possible Binary Splits AU U UA A U U A AU U A A U A U U A U AU A U A U U A AA Male vs Female Genotypes BB vs Bb & bb BB & Bb vs bb BB & bb vs Bb DNA repair capacity (measure of risk of cancer) [6.26,8.96] 6.265 vs >6.265 6.275 vs >6.275 36 15 Affected 15 Unaffected Female AU U UA A U U A AU U A A U A U U A U AU A U A U U A A A U U U A A UU U A A U A U A A A Pr(A)=0.50 Pr(U)=0.50 Pr(A)=0.50 Pr(U)=0.50 Male A A A U A U A U U U A U U A Pr(A)=0.50 Pr(U)=0.50 37 15 Affected 15 Unaffected Family History Of Cancer AU U UA A U U A AU U A A U A U U A U AU A U A U U A A A U Pr(A)=0.50 Pr(U)=0.50 No Family History Of Cancer A A A UU U A A A U A A A A A U A U A U U A U U U U U U A Pr(A) = 10/15 = 0.667 Pr(U) = 5/15 = 0.333 Pr(A) = 5/15 = 0.333 Pr(U) = 10/15 = 0.667 38 Purity-Impurity of a Node Pr(A) = 0.50 Pr(U) = 0.50 AU U UA A U U A AU U A A U A U U A U AU A U A U U A AA IMPURE Pr(A) = 0.667 Pr(U) = 0.333 Pr(A) = 1.0 Pr(U) = 0.0 A UA U AU A A A A A A AU U A A A A A AA A PURE 39 Choosing splits • Different measures are used – – For ith group, Let Prob(Yi=1)=Ci, let wi be proportion of the sample in a given node – Entropy measure is • Σiwi{-(Ci)log2Ci—(1-Ci)log2(1-Ci)} – Gini Index is Σiwi(Ci)(1-Ci) – Bayesian (misclassification rate based on sample) – Σiwimin{Ci,(1-Ci)} 40 Measuring Purity-Impurity of a Node 1 Bayes 0.5 Gini Entropy 0 0 0.5 1 41 Goodness of a Split IS= p-Pr(AL)L - Pr(AR)R Entropy of Parent Node Proportion Affected in Left Daughter node Entropy of Left Daughter Node Proportion Affected in Left Daughter node Entropy of Left Daughter Node 42 Female P= 0.69 AU U UA A U U A AU U A A U A U U A U AU A U A U Male U A A A L= 0.69 U UU A A UU U A A U A U A A A R= 0.69 A A A U A U A U U U A U U A IS1 = 0.69-0.5*0.69 - 0.50*0.69 =0 43 Family History Of Cancer U AU U P = 0.69 UA A U U A AU U A A U A U U A U AU A No Family History U A U Of Cancer U A A A L= 0.64 A A A UU U A A A U A A A A R= 0.64 A U A U A U U A U U U U U U A IS 2= 0.69-0.667*0.64 - 0.333*0.64 = 0.05 44 Choice of Best Split Variable Sex Family History Goodness of Split 0.00 0.05 45 Stopping the Growth of a Tree • Minimum size of a node to split • Minimum size of a daughter node after a split • Misclassification cost: no more splits if no gain 46 Pruning a Tree • Minimum Error: Prune off branches such that subtree has minimum CV error • 1-SE Rule: Prune off branches such that subtree has CV error less than but not exceeding Rˆ SE • Alternative Pruning Rules 47 Alternative Pruning Rule •Tree is allowed to overgrow •At each node, OR value is calculated from the test of Ho: OR=1.0 versus Ha: OR>1.0. Parent Node NA Affected NU Unaffected Daughter Node 1 NA1 Affected NU1 Unaffected Vi k Vi > k Aff UnAff NA1 NA2 NU1 NU2 N A1 NU 2 OR N A 2 NU 1 Daughter Node 2 NA2 Affected NU2 Unaffected 48 Alternative Pruning Rule •The natural log of the odds ratio, ln(OR), follows a normal distribution with a mean of ln(1) = 0 •At each node, we can calculate a standard normal variate given by ln OR Z SE 49 Overgrown Tree Prune if max Z < Z.01=2.32 Pruned branch OR=1.85 Z =2.53 max Z =8.23 OR =1.90 OR =1.82 Z =1.34 maxZ=2.00 Z =2.93 max Z =8.23 OR =2.00 Z =1.58 max Z =1.58 OR =1.15 OR =0.99 OR =4.00 Z =2.00 max Z =2.00 Z =0.10 max Z =0.10 Z =6.00 max Z =8.23 OR =1.30 OR=1.01 Z =1.96 max Z =2.00 Z =0.80 max Z =0.80 OR =1.3 OR =6.10 Z =1.16 max Z =1.20 Z =8.23 max Z =8.23 OR =1.35 OR =1.05 OR =1.1 OR =1.5 Z =2.00 max Z =2.00 Z =1.00 max Z =1.00 Z =0.20 max Z = 0.20 Z =1.20 max Z =1.20 50 Application of Tree Models • Data Exploration • Subgroup Selection – stratification – most informative group – risk groups • • • • • Variable Selection Identify Interactions among Covariates Model Assessment Parsimony Conversion of Continuous Covariates to Categorical • Include Individuals with Missing Data in Modeling 51 Lung Cancer Risk 1816/1651 No reported emphysema diagnosis Self-reported emphysema diagnosis 155/350 OR=2.91 2.37-3.58 1661/1301 No asbestos exposure Asbestos exposure Repair<4.59 Repair12.94 4.59Repair<12.94 1186/795 Repair11.46 Repair<11.46 131/43 1055/752 OR=2.17 1.52-3.10 Self- reported Hay fever 101/69 475/506 OR=1.69 1.21-2.37 12/10 No reported Hay fever 374/437 OR=1.69 1.21-2.37 143/321 OR=3.14 1.29-7.68 0/19 ‡ OR=636.1 29.33-Infinity Node Legend Green Nodes are Reference * No. of controls/no. of cases † ORs adjusted for age, sex, and ethnicity ‡ OR and 95% CI from exact logistic regression 52 Lung Cancer Risk 1816/1651 No reported emphysema diagnosis Self-reported emphysema diagnosis 155/350 OR=2.91 2.37-3.58 1661/1301 No asbestos exposure Asbestos exposure Repair<4.59 Repair12.94 4.59Repair<12.94 1186/795 Repair11.46 Repair<11.46 131/43 1055/752 OR=2.17 1.52-3.10 Self- reported Hay fever 475/506 OR=1.69 1.21-2.37 12/10 No reported Hay fever 374/437 OR=1.69 1.21-2.37 101/69 143/321 OR=3.14 1.29-7.68 0/19 ‡ OR=636.1 29.33-Infinity Node Legend Green Nodes are Reference * No. of controls/no. of cases † ORs adjusted for age, sex, and ethnicity ‡ OR and 95% CI from exact logistic regression Repair12.94 4.59Repair<12.94 Repair<4.59 OR=1.18 0.98-1.44 OR=1.18 1.01-1.37 53 Application of Tree Models • Data Exploration • Subgroup Selection – stratification – most informative group – risk groups • • • • • Variable Selection Identify Interactions among Covariates Model Assessment Parsimony Conversion of Continuous Covariates to Categorical • Include Individuals with Missing Data in Modeling 54 Random Forests 1. Building Phase Select m<M variables at random default: RF software is m=sqrt(M)] Select n<N of the subjects at random 2. Grow tree completely; do not prune. 55 Random Forests 3. Testing Phase: •Use the data not selected to build the tree (out of box “OOB” data) to test the tree – Take j to be the class that got the most votes every time the case was OOB – OOB error estimate = proportion of times j is not equal to the true class of the case is averaged over all cases 56 Variable selection in Random Forests model with 782 participants and 36 variables Pack Years 57 Scree Plot of the Random Forests model with 782 participants and 36 variables which include 8 pseudo variables of pack year showing random forest identifies important but Colinear variables. 58 ROC curve of Random Forests model with 7 variables and 470 participants 59 Multifactor Dimensionality Reduction (MDR) • Uses combinatorial partitioning approach. • It is a non-parametric and genetic model free alternative to logistic regression. • In this procedure, the multilocus genotypes are pooled into high-risk and low-risk groups. As a result, the genotype predictors are reduced from n dimensions to one dimension. 60 1. The data are divided into a training set and testing set 61 2. A group of factors are selected from the factor list 62 3 & 4. In each multifactor cell, the ratio of cases to controls in training set is calculate; cells are labeled ‘high risk’ or ‘low risk’ 63 5 & 6. The best factor model is picked after all possible combinations of number of factors for a given model are tested on how well cases and controls in testing set are classified 64 Concepts behind MDR • ‘Best’ Model Selection (based on cross-validation) – Testing accuracy: how likely a model is to generalize to independent data – CV consistency: the number of times a particular set of attributes are identified across the CV subsets – Parsimony: if there is a confusion of what the best model is, select the smaller model • Imputation (separate program) – Simple imputation to impute missing data 65 MDR Output • Dendogram (on latest version of MDR) X2 X6 Red – Synergy Orange Brown Green Blue - Redundancy X1 X8 66 MDR Output • Graphical Model X1 0 1 2 47 39 0 12 16 19 13 48 47 51 32 X8 1 20 1 22 2 16 13 0 0 4 67 MDR Output • If-Then Rules IF X1 = 0 AND X8 = 0 THEN CLASSIFY AS 0. IF X1 = 0 AND X8 = 1 THEN CLASSIFY AS 1. IF X1 = 0 AND X8 = 2 THEN CLASSIFY AS 1. IF X1 = 1 AND X8 = 0 THEN CLASSIFY AS 1. IF X1 = 1 AND X8 = 1 THEN CLASSIFY AS 1. IF X1 = 1 AND X8 = 2 THEN CLASSIFY AS 0. IF X1 = 2 AND X8 = 0 THEN CLASSIFY AS 1. IF X1 = 2 AND X8 = 1 THEN CLASSIFY AS 0. IF X1 = 2 AND X8 = 2 THEN CLASSIFY AS 0. 68 Esophageal Cancer •210 cases with histologically confirmed disease recruited from UT-MDACC •Presented with resectable adenocarcinoma or squamous cell carcinoma of the esophagus or gastroesophageal junction •Patients were treated with preoperative chemoradiation (platinum/5-FU) followed by esophagectomy •Study endpoints were recurrence and survival 69 Esophageal Cancer •5-Fu Pathway –Folate metabolism pathway. MTHFR Glu429Ala MTR Asp919Gly TS a+227G MTHFR Ala222Val TS C157T •Cisplatin Pathway –Genes that potentially modulate cisplatin efficacy and toxicity. MDR1 C3435T MDR1 Ala892Ser GSTP Ile105Val GSTP Ala114Val MPO T-764C P53 G/A P53 Pro72Arg FAS G-670A FASL C-844T NQO1 Pro187Ser •Nucleotide Excision Repair (NER) Pathway –NER is a major cellular system for repairing platinum-induced DNA adducts. XPA (5’ UTR) XPC Lys939Gln XPD Lys751Gln XPG1104 ERCC1 3’ UTR ERCC6 Met1097Val ERCC6 Arg1230Pro CCNH Val270Ala RAD23B Ala249Val 70 MDR Results •Recurrence •Genes involved: CCNH Val270Ala, MDRE21910, TS227 •Testing Accuracy: 66% •Cross Validation: 100% •HR (95% CI): 3.66 (1.76-7.62) •Survival •No higher order interaction detected 71 Caveats to MDR: Interpretation of ‘Interaction’ 1 Locus Model IF X1=0 THEN CLASSIFY AS 1 IF X1=1 THEN CLASSIFY AS 0 IF X1=2 THEN CLASSIFY AS 0 2 Locus Model IF X1 = 0 AND X8 = 0 THEN CLASSIFY AS 0. IF X1 = 0 AND X8 = 1 THEN CLASSIFY AS 1. IF X1 = 0 AND X8 = 2 THEN CLASSIFY AS 1. IF X1 = 1 AND X8 = 0 THEN CLASSIFY AS 1. IF X1 = 1 AND X8 = 1 THEN CLASSIFY AS 1. IF X1 = 1 AND X8 = 2 THEN CLASSIFY AS 0. IF X1 = 2 AND X8 = 0 THEN CLASSIFY AS 1. IF X1 = 2 AND X8 = 1 THEN CLASSIFY AS 0. IF X1 = 2 AND X8 = 2 THEN CLASSIFY AS 0. 72 Caveats to MDR • Interpretation of ‘Interaction’ 3 Locus Model IF X1 = 0 AND X6 = 0 AND X8 = 0 THEN CLASSIFY AS 0. IF X1 = 0 AND X6 = 0 AND X8 = 1 THEN CLASSIFY AS 0. IF X1 = 0 AND X6 = 0 AND X8 = 2 THEN CLASSIFY AS 1. IF X1 = 0 AND X6 = 1 AND X8 = 0 THEN CLASSIFY AS 0. IF X1 = 0 AND X6 = 1 AND X8 = 1 THEN CLASSIFY AS 1. IF X1 = 0 AND X6 = 1 AND X8 = 2 THEN CLASSIFY AS 0. IF X1 = 0 AND X6 = 2 AND X8 = 0 THEN CLASSIFY AS 1. IF X1 = 0 AND X6 = 2 AND X8 = 1 THEN CLASSIFY AS 0. IF X1 = 0 AND X6 = 2 AND X8 = 2 THEN CLASSIFY AS 0. IF X1 = 1 AND X6 = 0 AND X8 = 0 THEN CLASSIFY AS 0. IF X1 = 1 AND X6 = 0 AND X8 = 1 THEN CLASSIFY AS 1. IF X1 = 1 AND X6 = 0 AND X8 = 2 THEN CLASSIFY AS 0. IF X1 = 1 AND X6 = 1 AND X8 = 0 THEN CLASSIFY AS 1. IF X1 = 1 AND X6 = 1 AND X8 = 1 THEN CLASSIFY AS 0. IF X1 = 1 AND X6 = 1 AND X8 = 2 THEN CLASSIFY AS 0. IF X1 = 1 AND X6 = 2 AND X8 = 0 THEN CLASSIFY AS 0. IF X1 = 1 AND X6 = 2 AND X8 = 1 THEN CLASSIFY AS 0. IF X1 = 1 AND X6 = 2 AND X8 = 2 THEN CLASSIFY AS 0. IF X1 = 2 AND X6 = 0 AND X8 = 0 THEN CLASSIFY AS 1. IF X1 = 2 AND X6 = 0 AND X8 = 1 THEN CLASSIFY AS 0. IF X1 = 2 AND X6 = 0 AND X8 = 2 THEN CLASSIFY AS 0. IF X1 = 2 AND X6 = 1 AND X8 = 0 THEN CLASSIFY AS 0. IF X1 = 2 AND X6 = 1 AND X8 = 1 THEN CLASSIFY AS 0. IF X1 = 2 AND X6 = 1 AND X8 = 2 THEN CLASSIFY AS 0. IF X1 = 2 AND X6 = 2 AND X8 = 0 THEN CLASSIFY AS 0. IF X1 = 2 AND X6 = 2 AND X8 = 2 THEN CLASSIFY AS 0 73 Models for Gene-Environment Interaction 74 75 Definitions • Reaction norm: function relating mean phenotypic response of a genotype to change in the environment • Labile trait: a phenotype that can react to the environ• ment • Nonlabile trait: a phenotype that, during development, becomes fixed • Both can display gene x environment interactions but measuring g x e interactions can be done for labile traits by using same individuals repetitively in different environments • For nonlabile traits different individuals must be studied in different environments • Phenotypic stability: phenotype is similar in different environments • Phenotypic plasticity: phenotype varies across environment 76 Effects of Gene-Environment Interactions • Falconer suggests treating the phenotype in 2 different environments as 2 different traits. • Then, since (when) the genetic background is the same for both environments, the correlation of phenotype in the two environment is 1 for a model with no interaction, < 1 if there is interaction Assumptions : • either work with a labile trait and measure same individuals in 2 environments • or work with inbred lines so that all animals are genetically identical if different animals are measured in different environments • or assume only a finite number of measurable genotypes affect trait and these are measured 77 78 G x E interactions for qualitative traits/diseases • - usually analyzed in form of case/control data where G is measured genotype, E is some environmental exposure • Set up two cross classification tables 79 • Tools for sample size estimation to detect Interactions – Power program – developed by Garcias-Closas and Lubin http://dceg.cancer.gov/POWER/ Quanto – developed by James Gauderman http://hydra.usc.edu/gxe/ 80 Results from running Power Sample sizes to detect a 2 –fold interaction effect, Main effects (marginal effect is 2.4), exposure frequency is 50% For each exposure, equal number of cases and controls NB for alleles have to double sample sizes 81 Summaries • Methods for gene-gene and geneenvironment interaction analysis are widely available • Results may depend subtly on (unstated) assumptions • Epistasis is likely to be a key feature in disease development and expression of many traits 82