Download Slide 1

Bayesian Variable Selection in Semiparametric Regression Modeling with Applications to Genetic Mappping Fei Zou Department of Biostatistics Email: [email protected] Outline • Introduction – Experimental crosses – Existing QTL Mapping Methods • Bayesian semi-parametric QTL Mapping • Results • Remarks and Conclusions http://www.cs.unc.edu/Courses/comp590-090-f06/Slides/CSclass_Threadgill.ppt Overview • One gene one trait: very unlikely • The vast majority of biological traits are caused by complex polygenes – Potentially interacting with each other • Most traits have significant environmental exposure components – Potentially interacting with polygenes Experimental Crosses: F2 P1 Parents P2 Experimental Crosses • P1 F2 AA F1 F2: P1 BB P1 F1 F1 AA AB AB P2 AA BB AB BB Backcross(BC) P2 AB BC: AB AB AA AB QTL Data Format 0: homozygous AA, 2: homozygous BB, 1: heterozygote AB. Marker positions: Linkage Analysis • Data structure: – Marker data (genotypes plus positions) – Phenotypic trait(s) – Other nongenetic covariates, such as age, gender, environmental conditions etc • Quantitative trait loci (QTL): a particular region of the genome containing one or more genes that are associated with the trait being assayed or measured QTL Mapping of Experimental Crosses • Single QTL Mapping • Single marker analysis (Sax, 1923 Genetics) • Interval mapping: Lander & Botstein (1989, Genetics) • Multiple QTL mapping • Composite interval mapping (Zeng 1993 PNAS, 1994 Genetics; Jansen & Stam, 1994 Genetics) • Multiple interval mapping (Kao et al., 1999 Genetics) • Bayesian analysis (Satagopan et al., 1997 Genetics) Single QTL Interval Mapping • For backcross, the model assumes yi (QTL genotype is AA) ~ N ( AA , 2 ) yi (QTL genotype is Aa) ~ N ( Aa , 2 ) • QTL analysis: H 0 :  AA   Aa vs H A :  AA   Aa • If QTL genotypes are observed, the analysis is trivial: simple t-test! • However, QTL position is unknown and therefore QTL genotypes are unobserved Interval Mapping • For QTL between markers – QTL genotypes missing: can use marker genotypes to infer the conditional probabilities of the QTL genotypes for a given QTL position – Profile likelihood (LOD score) calculated across the whole genome or candidate regions using EM algorithm – In any region where the profile exceeds a (genomewide) significance threshold, a QTL declared at the position with the highest LOD score. Profile LOD 8 lod 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16171819 X Chromosome Multiple QTL Mapping • Most complicated traits are caused by multiple (potentially interacting) genes, which also interact with environment stimuli – Single QTL interval mapping • Ghost QTL (Lander & Botstein 1989) • Low power Multiple QTL Mapping • Composite interval mapping (Zeng 1993, 1994; Jansen & Stam1993): searching for a putative QTL in a given region while simultaneously fitting partial regression coefficients for "background markers" to adjust the effects of other QTLs outside the region • which background markers to include; window size etc • Multiple interval mapping (Kao et al 1999): fitting multiple QTLs simultaneously • Computationally intensive; how many QTLs to include? Multiple QTL Mapping • Bayesian methods (Stephens and Fisch 1998 Biometrics; Sillanpaa and Arjas 1998 Genetics; Yi and Xu 2002 Genetic Research, and Yi et al. 2003 Genetics): treat the number of QTLs as a parameter by using reversible jump Markov chain Monte Carlo (MCMC) of Green (1995 Biometrika) • change of dimensionality, the acceptance probability for such dimension change, which in practice, may not be handled correctly (Ven 2004 Genetics) Multiple QTL Mapping • Alternative, multiple QTL mapping can be viewed as a variable selection problem – Forward and step-wise selection procedures (Broman and Speed 2002 JRSSB) – LASSO, etc – Bayesian QTL mapping • Xu (2003 Genetics), Wang et al (2005 Genetics) Huang et al (2007 Genetics): Bayesian shrinkage • Yi et al (2003 Genetics): stochastic search variable selection (SSVS) of George and McCulloch (1993 JASA) • Yi (2004 Genetics): composite model space of Godsill (2001 J. Comp. Graph. Stat) • Software: R/qtlbim by Yi’s group Multiple QTL Mapping • Limitations of existing QTL mapping methods – do not model covariates at all or only model covariate effect linearly – do not model interactions at all or model only lower order interactions, such as two way interactions • The multiple QTL mapping is a very large variable selection problem: for p potential genes, with p being in the hundreds or thousands, there p p  are 2 possible main effect models, 2 2  possible two-way interactions and 2 kp  possible higher order (k > 2) interactions. Semiparmetric Multiple (Potentially Interacting) QTL Mapping • Goal: map multiple potentially interacting QTLs without specifically model all potential main and higher order interaction effects • Semiparametric model: yi   ( xi1 , xi 2 ,...xip , ti1 ,..., tiq )  ei , i  1, iid , n with ei ~N (0,  2 ) where function  is unspecified, xi1 , xi 2 ,...xip QTL genotypes and ti1 ,..., tiq represent all non- genetics factors/covariates. • When  equals xi1 xi 2  xi 3ti1 : non-explicitly modeling the two way interaction between genes 1 and 2 and the gene-environmental interaction between gene 3 and covariate 1. Bayesian Semi/non-parametric Methods – Dirichlet process (Muller et al. 1996) – Splines (Smith and Kohn 1996; Denison et al. 1998 and DiMatteo et al. 2001) – Wavelets (Abramovich et al. 1998 JRSSB) – Kernel models (Liang et al 2007) – Gaussian process (Neal 1997; 1996) • Gaussian process priors have a large support in the space of all smooth functions through an appropriate choice of covariance kernel. • Gaussian process is flexible for curve estimation because of their flexible sample path shapes • Gaussian process related to smoothing spline somehow (Wahba 1978 JRSSB) Prior Specification on  • A Gaussian process such that all possible finite dimensional distributions (1 ,..., n )T follow multivariate normal with mean 0 and covariance function p q 1 cov(i ,i ' )  exp[  xk ( xik  xi ' k )2   tj (tij  ti ' j ) 2 ]  k 1 j 1 where  ,  xks and  tj s are hyperparameters and    ( x ,..., x , t ,..., t ) i i1 ip i1 iq • Hyperparameter  defines the vertical scale of variations, i.e., controls the magnitude of the exponential part. Hyperparameters  xk (  tj ) related to length scales 1/  xk which characterize the distance in that particular direction over which y is expected to vary significantly –    0   controls the smoothness of : when the posterior mean of almost interpolates the data while centered around the prior mean function if – When  xk = 0, y is expected to be an essentially constant function of that input variable xj, which is therefore deemed irrelevant (Mackay 1998).  Priors on  s • The original papers on the Gaussian process (Mackay 1998; Neal 1997) did not view this method as an approach for variable selection and imposed a Gamma prior on the  parameters. However,  does provide information about the relevance of any QTL with value near zero indicating an irrelevant QTL. • For variable selection purpose, we can impose the following Gamma mixture priors on  xk  1/  xk Prior Specifications • Inverse Gamma distributions are used for the priors of  and  2 . Simulations • Set ups: – backcross population – 200 or 500 individuals – 151 evenly spaced markers at 5cM intervals – Four QTLs with varying heritabilities: • Main effect model: all four QTL act additively • Main plus two way interactions • Four way interactions only n=500 and pure 4 way-interaction model n=500 and pure additive model Real Data Analysis • A mouse study – # samples: 187 backcross samples – # markers: 85 with average marker distance 20 cM – Phenotypes: inguinal, gonadal, retroperitoneal and mesenteric fat pad weights Remarks • For studies with large # of samples and/or large # of markers, MCMC converges very slowly – We employed the hybrid Monte Carlo method, which merges the Metropolis-Hastings algorithm with sampling techniques based on dynamics simulation. – We also estimated the maximum a posteriori (MAP) via conjugate gradient method (Hestenes et al 1952 J. Research of National Bureau of Standards) • point estimate Real Study: Cardiovascular Disease • 2655 tag SNPs from roughly 200 selected candidate genes for cardiovascular disease • 820 individuals • Non-genetic covariates: gender, smoking status, age Remarks • Semiparemetric mapping is powerful in mapping multiple (potentially interacting with higher orders) QTL – Picks up genes related to the trait regardless of their marginal main effects or joint epistasis effects – Cannot readily differentiates genetic contributions • main effect? interaction? or both? • Fine tuned parametric model with selected genes Remarks and Future Research • How to extend the methodologies to human genome-wide association (GWA) studies, where hundreds of thousands of markers are available – Is it possible? – potential solutions: pathway analysis; data reduction techniques • How to extend the method to human pedigree analysis where mixed effect model is used for correlated family members? – Use inheritance vector: so far results are very promising Acknowledgement • Joint work with – Hanwen Huang – Haibo Zhou – Fuxia Cheng – Ina Hoeschele • Funding support – NIH R01 GM074175

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1