Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Regulatory variation and its functional consequences Chris Cotsapas [email protected] Motivating questions • How do phenotypes vary across individuals? – Regulatory changes drive cellular and organismal traits – Likely also drive evolutionary differences • How are genes (co)regulated? – Pathways, processes, contexts Regulatory variation • What do “interesting” variants do? • Genetic changes to: – – – – – – – – Coding sequence ** Gene expression levels Splice isomer levels Methylation patterns Chromatin accessibility Transcription factor binding kinetics Cell signaling Protein-protein interactions ~88% of GWAS hits are regulatory Genetic variation alters regulation • Protein levels – Maize (Damerval 94) • Expression levels – Yeast, maize, mouse, humans (Brem 02, Schadt 03, Stranger 05, Stranger 07) • RNA splicing – Humans (Pickrell 12, Lappalainen 13) • Methylation and Dnase I peak strength – Humans (Degner 12; Gibbs 12) Genetics of gene expression (eQTL) • cis-eQTL – The position of the eQTL maps near the physical position of the gene. – Promoter polymorphism? – Insertion/Deletion? – Methylation, chromatin conformation? • trans-eQTL – The position of the eQTL does not map near the physical position of the gene. – Regulator? – Direct or indirect? Modified from Cheung and Spielman 2009 Nat Gen Cis- eQTL analysis: Test SNPs within a pre-defined distance of gene 1Mb 1Mb window probe gene SNPs 1Mb QT association • Analysis of the relationship between a dependent or outcome variable (phenotype) with one or more independent or predictor variables (SNP genotype) Yi = b0 + b1Xi + ei Continuous Trait Value Linear Regression Equation Slope: b1 b0 Logistic Regression Equation pi ln (1-pi) = b0 + b1Xi + ei ( ) 0 1 Number of A1 Alleles 2 eQTL analysis: a GWAS for every gene gene 1 gene 2 gene 3 gene 4 gene 5 gene N cis-eQTLs are rather common Nica et al PLoS Genet 2011 Cis-eQTLs cluster around TSS Stranger et al PLoS Genet 2012 trans hotspots (yeast) Brem et al Science 2002 Yvert et al Nat Genet 2003 Candidate genes, perturbations underlying organismal phenotypes DOES REGULATORY VARIATION ALTER PHENOTYPE? APPLICATION TO GWAS Rationale • How do disease/trait variants actually alter biology? • If they change regulation, then: – Change in gene expression/isoform use – Phenotypic consequence* Risk variant Molecular trait Cellular trait Disease risk RTC and CPSM in the CD58 Locus 6 5 3 eQTL 2 35 eqtl$BP 26 17 GWAS 9 10 gwas$BP 7 5 CPSM Trait #2 2 1 pos$BP M1 M2 M3 M4 M5 0.75 0.5 116.63 116.75 Trait #1 116.5 RTC M1 0.25 M2 116.88 M3 M4 M5 tab$bps 117 Physical Position (Mbp) 117.13 117.25 117.38 117.5 The PAINTOR model • If a SNP is causal, then r2 should predict association of other SNPs in the area: • Correlation between test statistics Z are approximated by MVN given local pairwise LD structure. Parameters λ: standardized effect size Z: association statistic C: indicator of causality m: SNP considered Kichaev et al. PLoS Genet. 2014 19 Trait #2 M1 M2 M3 M4 M5 M2 M3 M4 M5 Distinct eQTL GWAS eQTL GWAS eQTL GWAS Trait #1 M1 Shared Shared Sheet1 Disease IBD CD UC MS T1D CEL RA Disease loci With any eQTL*** Known* Densely genotyped** CD4+ CD14+ LCL Total 110 69 69 69 68 69 30 19 18 18 18 18 23 10 10 9 10 10 93 55 54 55 55 56 44 40 39 40 36 40 38 34 34 34 34 34 44 34 34 34 34 34 Explained by a shared eQTL variant**** CD4+ CD14+ LCL Total 6 7 1 11 2 1 0 3 2 1 3 4 8 3 3 10 2 0 2 4 3 1 0 4 1 0 0 1 * Excluding conditional hits ** Defined by immunochip's densely genotyped fine-mapping intervals. Excluding MHC *** Loosely identified by cis-eQTLhits signal within +/-100kb from index SNPs. cis-eQTL is defined by the association p-value < 0.05. * Excluding conditional **** FDR < 0.05 by immunochip's densely genotyped fine** Defined The numbers are all based on unique loci. mapping intervals. Excluding MHC *** Loosely identified by cis-eQTL signal within +/100kb from index SNPs. cis-eQTL is defined by the association p-value < 0.05. 30 6 ●●●● ● ●●●●● 6 ● ● ●●●●●●● ●●●●● ● ● 20 ● ●● ● ● ● ● 0 ●● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● 45.60 ● eQTL GWAS ● ● ● 4 ● 10 ● ●● ●● ●● ● ●● ●● ● ● ● ● ● ● ●●●●●● ●● ● ●●● ● ● ●● ● ●●● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ●● ●●● ● ●●●●●●● ● ● 45.65 ● ●● ● 45.70 0 45.75 ●● ● ●● ●● ●●●● ● ● 45.60 4 ●● ● ● ●● ● ● ● ● ●●●●●●● ●● ●●●● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ●● ●● ●● ● ● ●●●● ● ● ● 2 ●●●●● ● ● ● eQTL ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ●●●● ●● ● ● ● ● ●● 45.65 ● ● ●● 0 45.70 45.75 ● ● ●● ●●● ● ●● ● ●● ● ● ●● 2 ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ●●● ●● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ●●● ● ● ●●● ●●● ● ●● ● ● ●● ● ● ●● ● ●● ●●● ●●● ●● ● ●● ● ● ● ● ● ●●●● ●●●● ●● ● ●● ●●● ● ● ●● ●● ●●●● ● 0 ● 10 20 30 GWAS 15 ● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ●●●● ● 3 ● 5 0 eQTL GWAS ● ●●●●● ●● ●●●● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●● ●● ● ●●●● ●●●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●●●●● ● ● ●●●● ●●● ● ●● ● ● ● ●● ●●●●●●● ●●●● ●● ● ● ●●●● ●●● ●●●●●●● ●●●● ●● ●●● ●● ● ● ●●●●●●● ●●● ●●●●●●●●● ● ● ●● ●●●●●●●● ●●●●●●●● ● ●●● ● ●●● ● ●● ● ●● ●● ● ●● ●● ● ●● ● ●● ●● ●● ●●●● ●●●●●●●●●● ●● ●●●●● ●● ●●●● ●●● ●●● ●●● ● ● ●● ●●●● ●●●●● ● ●●●● ●●●●●●●●●● ●● ● ● 7.7 7.8 ●●●●● ● ● ● ●● ● ● ●● ●● 7.9 8.0 ● ● ● ● 10 ● 3 ● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●●● ●● ● ● ● 8.1 8.2 ● 2 ● 0 8.3 ● ● ●● ● 1 ●● ● ● ● ●● ●● ● ● ● ●●●● ● ●●●●●●●●● ●● ● ● ●●●● ● ● ● ●● ● ● eQTL ●●●● ●● ● ● 7.7 ● ●● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ●●●● ●● ●●● ● ● ● ● ● ● ●● ●● ● ● ●●● ● ●●● ● ●●● ● ●●●●●● ● ● ●●●●●●●●●● ●●●● ● ●●● ● ● ●●●● ●● ●● ● ●● ● ● ● ●●● ● ●● ● ●●●● ● ● ●● ● ●●●● ●● ●● ● ●●●●●●● ● ● ● ● ●● ●●● ● ●● ●● ● ● ● ●● ●●●●●● ●●● ●● ●●● ●●●●●● ●● ●●● ● ●● ●●●●● ●●● ●● ● ● ● ●●●● ●● ● ● ● ● ● ● ●●●● ● ●●● ●●●●● ●●● ●●●●●●● ●●● ●● ● ●● ●●● ● ●●● ●●●●● ● ● ●●●●● ●●●● ●●● ● ● ●● ● ●●●●● ● ● ●●●●●●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ● ● ●● ●●●●●●●●● ● ●●●●●●● ●● ● ●● ●●● ● ●●●●●●●●●●●●● ● ● ●● ●● ● ● ● ● ●●●●● 7.8 7.9 ●●●●●●●● ●● ● 8.0 8.1 8.2 ● 2 1 ● ● ● 0 8.3 ● ● ● ●● ● ● ●●● ● ●●● ●● ●●●● ● ● ● ● ●● ● ●● ● ●● ●● ●● ● ●●● ● ● ●●● ● ● ● ● ●● ● ● ●●● ● ● ●●●●●●●●●● ●● ● ● ● ●● ●●● ● ● ● ● ●● ● ● ●● ● ●●●●● ● ●●●●● ● ● ● ●●● ●●● ●● ● ●●● ● ● ●● ● ●●● ●●●●● ● ●● ● ● ●●●●●●●●●● ● ● ●●● ● ● ● ●●●●●● ●●●● ●●● ● ● ● ● ●● ● ●● ●●● ●●● ● ● ● ●● ●●● ●●● ● ● ● ● ● ●●●● ● ●● ●●● ● ● ● ●●●●●●● ●●●●● ●●●●●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●●● ●●●● ● ●● ●● 0 5 ● 10 GWAS ● ●●● ● ●● ● ●●●●● ●●● ● ●●●●● ●●●● ●●● 15 Sheet1 Disease IBD CD UC MS T1D CEL RA Disease loci With any eQTL*** Known* Densely genotyped** CD4+ CD14+ LCL Total 110 69 69 69 68 69 30 19 18 18 18 18 23 10 10 9 10 10 93 55 54 55 55 56 44 40 39 40 36 40 38 34 34 34 34 34 44 34 34 34 34 34 Explained by a shared eQTL variant**** CD4+ CD14+ LCL Total 6 7 1 11 2 1 0 3 2 1 3 4 8 3 3 10 2 0 2 4 3 1 0 4 1 0 0 1 * Excluding conditional hits eQTL ** Defined by immunochip's densely genotyped fine-mapping intervals. Excluding MHC *** Loosely identified by cis-eQTLhits signal within +/-100kb from index SNPs. cis-eQTL is defined by the association p-value < 0.05. * Excluding conditional **** FDR < 0.05 by immunochip's densely genotyped fine** Defined The numbers are all based on unique loci. *** Loosely identified by cis-eQTL signal within +/100kb from index SNPs. cis-eQTL is defined by the association p-value < 0.05. **** FDR < 0.05 GWAS mapping intervals. Excluding MHC Shared Tissue LCL CD14+ CD4+ MS GWAS SNP p-value rs35967351 4.79E-08 rs9989735 2.22E-16 rs1021156 4.20E-11 rs1021156 4.20E-11 rs1359062 6.09E-13 rs1021156 4.20E-11 rs201202118 3.01E-16 rs35967351 4.79E-08 rs71624119 6.56E-10 rs917116 2.04E-09 rs60600003 3.22E-11 rs1021156 4.20E-11 rs201202118 3.01E-16 rs12946510 6.63E-05 rs12946510 6.63E-05 eQTL association GWAS-eQTL Joint likelihood result Gene p-value LD (r2) Empirical P Bonferroni SLAMF7 5.42E-08 0.94 0 7E-03 SP140 1.29E-09 0.88 0 7E-03 ZC2HC1A 3.35E-30 0.99 0 7E-03 PKIA 1.12E-09 0.99 1E-05 7E-03 RGS1 1.61E-21 0.95 0 7E-03 ZC2HC1A 3.38E-40 1.00 0 7E-03 METTL21B 1.95E-21 0.99 0 7E-03 NHLH1 8.22E-05 0.99 0 7E-03 ANKRD55 1.99E-10 1.00 3E-05 2E-02 JAZF1 6.16E-16 0.94 0 7E-03 ELMO1 1.17E-08 0.81 1E-05 7E-03 ZC2HC1A 4.51E-12 0.98 0 7E-03 METTL21B 8.84E-21 1.00 0 7E-03 GSDMB 4.15E-17 0.79 0 7E-03 ORMDL3 5.65E-13 0.88 0 7E-03 Open question DOES REGVAR REVEAL CO-REGULATION? A.K.A. WHERE ARE THE TRANS eQTLS? Whole-genome eQTL analysis is an independent GWAS for expression of each gene gene 1 gene 2 gene 3 gene 4 gene 5 gene N Issues with trans mapping • Power – Genome-wide significance is 5e-8 – Multiple testing on ~20K genes – Sample sizes clearly inadequate • Data structure – Bias corrections deflate variance – Non-normal distributions • Sample sizes – Far too small But… • Assume that trans eQTLs affect many genes… • …and you can use cross-trait methods! Association data Z1,1 Z2,1 : : Zs,1 Z1,2 … … Z1,p Zs,p Cross-phenotype meta-analysis l=1 l¹1 l¹1 −log(p) −log(p) −log(p) SCPMA ~ L(data | λ≠1) L(data | λ=1) Cotsapas et al, PLoS Genetics CPMA for correlated traits • Empirical assessment to account for correlation • Simulate Z scores under covariance, recalculate CPMA • Construct distribution of CPMA for dataset, call significance with Ben Voight, U Penn Experimental design 610,180 SNPs MAF >0.15 CEU and YRI LD pruned (r2 < 0.2) plink CEU p-values Transcript ~ SNP, sex CPMA 8368 transcripts YRI p-values Detectable on Illumina arrays 108 CEU individuals* 109 YRI individuals* Transcript ~ SNP, sex * Stranger et al Nat Genet 2007 (LCL data; publicly available) CEU CPMA scores >95%ile sim CPMA YRI CPMA scores Target sets of genes • trans-acting variant: SNP with CPMA evidence • Target genes: genes affected by trans-acting variant (i.e. regulon) Prediction 1 • Allelic effects should be conserved between two populations – Binomial test on paired observations for all genes P < 0.05 in at least one population Genes pCEU < 0.05 Genes pYRI < 0.05 CEU + + - - + YRI + + - - + YRI - - + + - True for 1124/1311 SNPs (binomial p < 0.05) Prediction 2 • Target genes should overlap – Identify by mixture of gaussians classification – Empirical p from distribution of overlaps between NCEU and NYRI genes across SNPs. Genes pCEU < 0.05 Genes pYRI < 0.05 True for 600/1311 SNPs (empirical p < 0.05) What about the target genes? • Regulons: – Encode proteins more connected than expected by chance www.broadinstitute.org/mpg/dapple.php Rossin et al 2011 PLoS Genetics What about the target genes? • Regulons: – Encode proteins enriched for TF targets (ENCODE LCL data) – 24/67 filtered TFs significant – Binomial overlap test trans target genes CHiPseq LCL target genes TF p-value CEBPB 3.7 x 10-142 HDAC8 7.8 x 10-122 FOS 2.5 x 10-96 JUND 3.7 x 10-88 NFYB 3.3 x 10-71 ETS1 3.8 x 10-63 FAM48A 2.1 x 10-61 FOXA1 1.4 x 10-33 GATA1 4.6 x 10-33 HEY1 7.8 x 10-32 Summary • Regulatory variation is common • It affects gene expression levels • Likely many other types: – DNA accessibility, chromatin states – Transcript splicing, processing, turnover • Has phenotypic consequences – GWAS – Some cellular assays (not discussed here) Open questions • Discover regulatory elements (cis) – Promoters, enhancers etc • Gene regulatory circuits (trans) • Dynamics of regulation – Splicing variation, processing, degradation • Phenotypic consequences – Cellular assays required • Tie in to organismal phenotype RNAseq, GTEx NEXT-GEN SEQUENCING DATA GTEx – Genotype-Tissue EXpression An NIH common fund project Current: 35 tissues from 50 donors Scale up: 20K tissues from 900 donors. Novel methods groups: 5 current + RFA How can we make RNAseq useful? • Standard eQTLs – Montgomery et al, Pickrell et al Nature 2010 • Isoform eQTLs – Depth of sequence! • • • • Long genes are preferentially sequenced Abundant genes/isoforms ditto Power!? Mapping biases due to SNPs RNAseq combined with other techs • Regulons: TF gene sets via CHiP/seq – Look for trans effects • Open chromatin states (Dnase I; methylation) – Find active genes – Changes in epigenetic marks correlated to RNA – Genetic effects • RNA/DNA comparisons – Simultaneous SNP detection/genotyping – RNA editing ???