* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Advanced Twin Workshop 2001
Metagenomics wikipedia , lookup
Population genetics wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Behavioural genetics wikipedia , lookup
Heritability of IQ wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Ridge (biology) wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genomic imprinting wikipedia , lookup
Pathogenomics wikipedia , lookup
Human genetic variation wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Minimal genome wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genome evolution wikipedia , lookup
Gene expression programming wikipedia , lookup
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
Genome (book) wikipedia , lookup
The Causes of Variation Lindon Eaves and Tim York Boulder, CO March 2001 One Issue (Among Many!) • Identifying genes that cause complex diseases and genes that contribute to variation in quantitative traits Quantitative Trait Locus (QTL) Any gene whose contribution to variation in a quantitative trait is large enough to stand out against the background noise of other genetic and environmental factors Quantitative Trait A continuously variable trait (in which variation may be caused by multiple genetic and/or environmental factors); any categorical trait in which differences between categories may be mapped onto variation in a continuous trait Common diseases • • • • • • Estimated life time risk c.60% Substantial genetic component “Non-Mendelian” inheritance Non-genetic risk factors Multiple interacting pathways Most genes still not mapped Examples • • • • • • • Ischaemic heart disease (30-50%, F-M) Breast cancer (12%, F) Colorectal cancer (5%) Recurrent major depression (10%) ADHD (5%) Non-insulin dependent diabetes (5%) Essential hypertension (10-25%) Even for “simple” diseases: Number of alleles is large (Wright et al, 1999) • Ischaemic heart disease (LDR) >190 • Breast cancer (BRAC1) >300 • Colorectal cancer (MLN1) >140 Definitions • Locus: One of c. 30-40,000 genes • Allele: One of several variants of a specific gene • Gene: a sequence of DNA that codes for a specific function • Base pair: chemical “letter” of the genome (a gene has many 1000’s of base pairs) • Genome: all the genes considered together Finding QTLs • Linkage • Association Linkage Finds QTLs by correlating phenotypic similarity with genetic similarity (“IBD”) in specific parts of genome Linkage • Doesn’t depend on “guessing gene” • Works over broad regions (good for getting in right ball-park) and whole genome (“genome scan”) • Only detects large effects (>10%) • Requires large samples (10,000’s?) • Can’t guarantee close to gene Association • Looks for correlation between specific alleles and phenotype (trait value, disease risk) Association • More sensitive to small effects • Need to “guess” gene/alleles (“candidate gene”) or be close enough for linkage disequilibrium with nearby loci • May get spurious association (“stratification”) – need to have genetic controls to be convinced “Reality”: For complex disorders and quantitative traits Large number of alleles at large number of genes Defining the Haystack • 3x109 base pairs • Markers every 6-10kb for association in populations with no recent bottleneck history • 1 SNPs per 721 b.p. (Wang et al., 1998) • c.14 SNPs per 10kb = 1000s haplotypes/alleles • O (104 -105) genes Problems • Large number of loci and alleles/haplotypes • Possible interactions between genes • Possible interactions between genes and environment • Relatively low frequencies of individual risk factors • Functional form of genotype-phenotype relations not known • Sorting out signal from noise – minimizing errors within budget • Scaling of phenotype (continuous, discontinuous) • Spurious association (stratification) Prepare for the worst Need statistical approaches that can screen enormous numbers of loci and alleles to identify reliably those that have impact on risk to disease System Chosen for Study • • • • • 100 loci 20 loci affect outcome, 80 “nuisance” genes 257 alleles/locus Allele frequencies c.20-0.1% Disease genes each explain 2.5% variance in risk (c. 2-fold risk increase) • 40% rarest alleles increase risk • 50% variance non-genetic It’s a Mess! • Don’t know which genes – might have clues • Don’t know which alleles – unordered categories • >250100 locus/allele combinations • More predictor combinations than people (“curse of dimensionality”) • Reality worse Problems • Informatics: large volume of data • Computational: large number of combinations • Statistical: large number of chance associations • Genetic-epidemiological: secondary associations How are we going to figure it out? Data Mining (Steinberg and Cartel) • Attempt to discover possibly very complex structure in huge databases (large number of records and large number of variables) • Problems include classification, regression, clustering, association (market analysis) • Need tools to partially or fully automate the discovery process • Large databases support search for rare but important patterns and interactions (epistasis, GxE) Some Approaches to DM • • • • Logistic regression Neural networks “CART” (Breiman et al. 1984) “MARS” (Friedman, 1991) “MARS” • • • • Multivariate Adaptive Regression Splines Key references Friedman, J.H. (1991) Multivariate Adaptive Regression Splines (with discussion), Annals of Statistics, 19: 1-141. Steinberg, D., Bernstein, B., Colla, P., Martin, K., Friedman, J.H. (1999) MARS User Guide. San Diego, CA: Salford Systems The MARS Advantage • Allows large number of predictors (loci/alleles/environments) to be screened • Non-parametric • Continuous and discontinuous outcomes • Systematic search for detailed interactions • Testing and cross-validation • Continuous and categorical predictors • Decides best form of relationship Example Regression Spline: Impact of Non-Retail Business on Median Boston House Prices Curve 1: Maximum = 19.08890 Median 20 House Price Model for spline: 15 b1 = max(0, INDUS - 8.140) b2 = max(0, 8.140 - INDUS ) Y = 20.968 - 0.268 b1 + 1.802 b2 10 5 “Knot” 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 INDUS Industrial Business Fitting functions with Splines • Piece-wise linear regression. – simplest form. allow regression to bend. • “Knots” define where the function changes behavior. • Local fit vs. Global fit. actual data spline with 3 knots One predictor example True knots at 20 and 45 (left) Best single knot at about 35 (right) Y Y 10 20 30 40 50 60 X 10 20 30 40 50 60 X 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 10 20 30 40 50 60 Re-express variables as basis functions • Done to generalize the search for knots. Difficult to illustrate splines with > one dimension. • Core building block of MARS model – max (0, X – c); – example: BF1 = max(0, ENV – 5); BF2 = max(0, ENV – 8); 0 for ENV <= 5; 1 for 5 <= ENV <= 8; 1 + 2 for ENV > 8; • Weighted sum of basis functions used to approximate the global function. – ie y = constant + 1 * BF1 + 2 * BF2 + error; “Adaptive” Spline • “Optimal” placement of knots • “Optimal” selection of predictors and interactions Adaptive splines • Problem: – What is the optimal location of knots? – How many knots do you need? – Best to test all variable / knot locations, but computationally burdensome. • MARS solution: – Develop an overfit model with too many knots. – Remove all knots that contribute little to model quality. – The final model should have approximately correct knot locations. “Optimal” Explains “salient” features of data Ignores irrelevant features Stands up to replication - Several ways to operationalize mathematically MARS 2-step model building • Step 1. Growing phase: – begins with only a constant in the model. – serially adds basis functions to a user defined limit. tests each for improvement when added to the model. – addition of basis functions until an overly large model is found. (theoretically the true model is captured). • Step 2. Pruning phase: – delete basis function that contributes least to model fit. – refit the model and delete next term, repeat. – the most parsimonious model is selected. • GCV criterion to select optimal model (Craven 1979). • MARS option uses 10 fold cross-validation to estimate DF. Cross-validation • Protects against over fitting data. • Develops a model on subset of data. Tests fit on remaining set. • Systematically assesses how many DF to charge each variable entered into model. – Adding a basis function will always lower MSE. – This reduction is penalized by DF charged. • Only backwards deletion step is penalized. Genetic Example: Regression spline for multi-allelic locus Probability of disease = 0.037 + 0.114 b1. Where: b1 = 1 if ( LOCUS1 = 30 OR LOCUS1 = 37 OR LOCUS1 = 39 OR LOCUS1 = 43 OR LOCUS1 = 44 OR LOCUS1 = 46 OR LOCUS1 = 66 OR LOCUS1 = 73 OR LOCUS1 = 76 OR LOCUS1 = 78 OR LOCUS1 = 79 OR LOCUS1 = 80 OR LOCUS1 = 83 OR LOCUS1 = 87 OR LOCUS1 = 90 OR LOCUS1 = 95 OR LOCUS1 = 103 OR LOCUS1 = 106 OR LOCUS1 = 111 OR LOCUS1 = 113 OR LOCUS1 = 114 OR LOCUS1 = 116 OR LOCUS1 = 118 OR LOCUS1 = 128 OR LOCUS1 = 129 OR LOCUS1 = 133 OR LOCUS1 = 134 OR LOCUS1 = 139 OR LOCUS1 = 146 OR LOCUS1 = 147 OR LOCUS1 = 148 OR LOCUS1 = 170 OR LOCUS1 = 177 OR LOCUS1 = 179 OR LOCUS1 = 182 OR LOCUS1 = 183 OR LOCUS1 = 185 OR LOCUS1 = 192 OR LOCUS1 = 202 OR LOCUS1 = 208 OR LOCUS1 = 209 OR LOCUS1 = 214 OR LOCUS1 = 215 OR LOCUS1 = 218 OR LOCUS1 = 219 OR LOCUS1 = 222 OR LOCUS1 = 223 OR LOCUS1 = 226 OR LOCUS1 = 229 OR LOCUS1 = 230 OR LOCUS1 = 231 OR LOCUS1 = 232 OR LOCUS1 = 235 OR LOCUS1 = 236 OR LOCUS1 = 237 OR LOCUS1 = 240 OR LOCUS1 = 241 OR LOCUS1 = 242 OR LOCUS1 = 244 OR LOCUS1 = 253 OR LOCUS1 = 254), b1 = 0 otherwise What happens when nothing is going on? Including only “nuisance” loci (21-80). N=10,000. Validation Loci Identified None 23 25 30 32 35-37 40 47 50 54 55 57 64 68 72 74 76 87 89 91 92 94 96 97 10-fold cross-validation 25 Loci Identified as contributing to variation in outcome Sample Size 1000 2000 5000 10000 Validation Loci Identified None 2 5-8 10-12 14-18 20 24 40 43 45 56 59 70 77 94 Split-sample 7 10 14 10-fold 14 None 2 3 5 6 8-18 20 38 45 47 69 72 80 88 95 100 Split-sample 12 14 20 10-fold 14 None 2-20 29 32 43 55 56 74 Split-sample 10 15 16 20 10–fold 2-19 None 1-20 25 26 94 Split-sample 1-20 25 94 10-fold 1-20 Correct (+) and Incorrect (-) Assignment of Alleles to High- and Low-Risk Groups by MARS Model (N=10,000) Low Risk (N=30) Locus High Risk (N=227) Low Risk (N=30) + - + - 1 29 1 146 81 2 29 1 145 3 29 1 4 30 5 Locus High Risk (N=227) + - + - 11 29 1 155 72 82 12 29 1 147 80 152 75 13 30 0 155 72 0 138 89 14 30 0 149 78 30 0 142 85 15 29 1 170 57 6 28 2 139 88 16 30 0 150 77 7 28 2 143 84 17 28 2 151 76 8 29 1 148 79 18 28 2 147 80 9 27 3 154 73 19 29 1 140 87 10 29 1 157 70 20 29 1 146 81 So Far: Does quite well for largish random samples and continuous outcomes. -What about disease (dichotomous) outcomes? -What about selected (extreme) samples? Generating Dichotomous Outcomes from Continuous Measure Threshold Prevalence 21 9.1% 22 4.9% 24 1.0% Loci Identified by fitting MARS model to dichotomous outcomes (N=10,000) Prevalence No validation 10-fold cross validation 9.1% 1 2 5 6 8 9 11-17 19 4.9% 1 2 4 5 6 910 13-15 17-20 8 1.0% 1 2 5 8 9 10-17 19 56 2 16 Loci cross-validated by MARS model for extremes from sample of 10,000 screened individuals Proportion Selected Upper % Lower % 9.2 4.9 11.2 6.3 Total N 2024 1116 Outcome Loci Cross-Validated Continuous 1-3 5-10 66 88 75 Dichotomous 2 3 5-29 69 Continuous 1-3 6-10 12-15 18 20 68 Dichotomous 1-4 6-8 10-15 17 19 48 So? • Can detect signal due to relatively large numbers of relatively rare unordered alleles of relatively small effect at relatively many loci amid the noise of still more loci and environmental effects • “MARS” may provide elements for analyzing such data in this and similar contexts (?microarrays, SNPs, expression arrays?) • Works with continuous data on random samples and dichotomous outcomes on selected samples GAW12 – Simulated data • Provided for two populations: – large general pop. – pop. isolate – founded 20 generations ago by 100 ind. – limited migration b/w. • Common disease: – prevalence of 25%. increases with age – middle age disease, some early onset – more common in females than males • General population – – – – 7 genes simulated 13 to 20 kb 12 to 40 diallelic sites at start of simulation passed through 120 to 200K of random mating: • mutation, intragenic recombination, gene conversion – allowed at diff. rates for diff. genes • each gene contains a 500bp recombination hotspot – 15 to 65% of intragenic recombinations • 8 to 13 mutational hotspots per gene (6 – 300 x’s ) – 25% of genes isolated for 35 to 85K generations. GENE1 GENE5 Length (kb) 20 17 Start # of SNP 40 20 150K 165K .01 .002 4x10-8 6x10-9 Gene conv. .01 .002 Mean length conv. 1000 1600 Start of rec. hotspot / % in 10349 / 50% 4197 / 65% # mutat. hotspot 13 8 Incr mut rate 200 20 Random Mating Rec. rate Mutation rate • Isolate population – loosely modeled after pop. history of Old Order Amish in Lancaster Co., PA – Founders: 200 chr.’s sampled from general pop. – 20,000 chr.’s sampled from general pop. to create an “outside pop” – Isolate: children <12, mean 4 ; Outside: children <12, 1 – migration allowed b/w pop.s at each generation • rate: migrants = 5% of current isolate size – evolution progressed for 20 generations with recombination (no mutations, no intragenic rec.) – founders were then sampled to create the isolate pop. • 23 extended pedigrees with 1,497 individuals from each population. (1,000 living) • Pedigrees include the proband, spouse, and all first, second, and third degree relatives of each. • Living individuals are provided: – – – – – – affected status, fid, mid, sex age at last exam age of onset if affected 5 quantitative risk factors 2 environmental risk factors (binary and quantitative) marker genotype for 1 cM whole genome screen. 2,855 total markers with an average of 9.1 alleles – sequence data for 7 candidate genes – 1,176 sequence variants • 50 replicates provided for each pop. Sequence data • Isolate and General population • Intron and Exon sequence from 7 candidate genes. • Kept only those individuals with sequence data. Each set contain 7,000 individuals. 64 mb MARS limit. • 5 sets of 7 randomly selected replicates (used 35 of 50 replicates provided) • 5 associated quantitative risk factors. • Covariates included: E1, E2, Age, Sex, Age of onset. • Affected status binary. • Exon sequence coded for each individual as having 0, 1, or 2 ancestral variants. • If intron variant present (whether 1 or 2 copies) given a value of 1. Coded in binary form as haplotypes of length four. Aff Status E1 Q1 MG1 CG6 Age of onset MG6 Liability Q2 Q3 Q4 CG1 Q5 MG5 MG2 MG3 E2 MG4 Age CG2 True Model Isolate pop. AFF E1, Q1-Q5, MG6 [557] Q1 E1, MG1 [5782] E1, Q1-Q5, MG6 E1, Q1-Q5, MG6 [(435 547 548 557) [(27 57 76 110)(435 5244 5268 6912 7281] 547 548 557)] MG1 [5007] MG1 [5782] Q2 E1, MG1 [5782] E1, MG1 [5007] E1, MG1 [5782] Q3 E1, E2 E1, E2 E1, E2 Q4 E1, AGE E1, AGE E1, AGE Q5 E1, MG5 [multi-allelic] E1, MG5 [1289 3745 8657 8817] E1, MG5 [1289 3745 8657 8817] ONSET MG6 [557] none MG6 [15625] General pop. Conclusions • MARS works well to capture functional form of disease etiology in simulated data with dichotomous outcome. • In most cases was within 1 Kb of functional variant. • Generated a predictive model that was replicable in at least 4 of 5 data sets. • Highly interpretable output in the form of basis functions and Importance values. • MARS may have problems with highly correlated variables. • Pattern-recognition tools can be useful to narrow down search for genes. Comparison of MARS and ANN MARS ANN Both are non-parametric estimation schemes, allow for a high number of input predictors, allow for interactions, & non-linear mappings. Maximum allowable basis functions and degree of interactions. Type of network architecture needs to be specified. Models are developed fast. Models are trained more slowly (DeVeaux et al. 1993). Backwards elimination stage to remove unnecessary basis functions. Problem of overfitting the data esp. with small data sets. Easily interpretable basis functions. Local interpretation of the function. Black box-weights have little meaning. Diff. to interpret predictor contribution Penalizes model complexity. Tries to dev. a low order, interpretable model. Non-linear transformations and high connectivity allows for complexity. But the Haystack is Very Large • Reality worse than simulations • More alleles at more loci • Phenotypes more complex (multivariate) • More irrelevant loci (?1000’s) • Interactions with environment and between loci • Spurious associations It Needs Collaboration Clinical Statistical Molecular Epidemiological Physiological Developmental Informational Evolutionary