Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Causes of regulatory variation in the human genome Manolis Dermitzakis The Wellcome Trust Sanger Institute Wellcome Trust Genome Campus Cambridge, UK [email protected] Human Genome: ~25,000 genes 1-1.5% of the human DNA is coding Is the remaining 98.5% “junk” Gene expression as a phenotype • Altered patterns of gene expression disease. – e.g., Type 1 diabetes, Burkitt’s lymphomas. • Widespread intraspecific variation. • Heritable genetic variation for transcript levels. – Familial aggregation of expression profiles (Cheung et al. 2003). – In humans, ~30% of surveyed loci exhibited a genetic component for expression differences (Monks et al. 2004; Schadt et al. 2003). • Much of the influential variation is located cis- to the coding locus. – In humans, mouse, and maize, 35%50% of the genetic basis for intraspecific differences in transcription level are cisto the coding locus (e.g. Morley et al. 2004; Stranger and Dermitzakis 2006 Schadt et al. 2003; Stranger et al. 2005; Cheung et al. 2005, etc.). Why study gene expression • Describe and dissect regulatory variation • Annotate regulatory elements in the human genome • Support disease studies to interpret statistical signals • Distribution of molecular effects in the genome • Natural selection Outline • Gene expression variation – recent studies • Analysis of gene expression with HapMap phase II SNPs • Update on CNV-expression associations • Natural selection and cis regulatory effects Nature of regulatory variation REG GENE DNA i) Pre-mRNA ii) mRNA iii) Protein REG GENE Expression iv) DNA Stranger and Dermitzakis, Human Genomics 2005 Effects of Copy Number Variation on gene expression REG GENE GENE REG Additional gene copy REG GENE REG Increase of distance from regulatory element GENE GENE REG REG GENE GENE REG New regulatory element Gene interruption REG REG GENE GENE Gene expression association mapping 100 AA Frequency 80 AG 60 40 GG 20 0 -1.5 0.0 1.5 3.0 Expression Levels 4.5 6.0 7.5 Stranger et al. PLoS Genet 2005 Whole-genome gene expression illumina Human 6 x 2 gene GEX arrays ~48,000 transcripts 24,000 RefSeq 24,000 other transcripts 270 HapMap individuals: CEU: 30 trios, 90 total CHB: 45 unrelated JPT: 45 unrelated YRI: 30 trios, 90 total Cell line 2 IVTs each person 2 replicate hybridizations each IVT RNA IVT1 rep1 IVT2 rep2 rep3 rep4 Quantile normalization of all replicates of each individual. Median normalization across all individuals of a population. HapMap SNPs 60 CEU 45 CHB 44 JPT 60 YRI 14,072 genes Phase I HapMap; MAF > 0.05 CEU: CHB: JPT: YRI: 762,447 SNPs 695,601 689,295 799,242 ~1/5kb Copy Number Variation dataset • Genome Structural Variation Consortium – Redon et al. Nature Nov 22, 2006 • Array-CGH using a whole genome tile path array – Median clone size ~170 kb – All 270 HapMap individuals • • Quantitative values (log2 ratios) representing diploid genome copy number, not genotypes. 1117 CNVs called from log2 ratios – Calls based on standard deviation of log2 ratios – Many CNVs experimentally verified 26,563 clones 93.7% euchromatic genome 9.0 8.5 8.0 Expression level 9.5 Linear regression for SNPs CNV and expression CC CT TT Genotype 0 1 2 - slope of line - p-value - r2 Clone signal (log2 ratio) SNP cis-analysis: SNPs within 1Mb of probe midpoint 1Mb 2Mb window probe gene SNPs 1Mb CNV cis-analysis: clone midpoint within 2Mb of probe midpoint 2Mb 4Mb window probe gene clones 2Mb Permutation GENOTYPES g11 g21 g31 … … … gi1 g12 g22 g32 g13 g23 g33 g14 g24 g34 GENE EXPRESSION … … … g1n g2n g3n permute gi2 gi3 gi4 … gin Exp1 Exp2 Exp3 … … … Expi - 10,000 permutations – each time keep lowest p-value - Null distribution of 10,000 extreme p-values - Compare observed p-values to the tails of the null Doerge and Churchill 1996 CNV vs. SNP associations Stranger et al. Science 2007 CNVs and SNPs mostly capture different effects • Relative impact on gene expression: 82% SNPs 18% CNVs • Only 13% of genes with CNV association also had a SNP association in the same population – biased toward large effect size. – CNV and SNP variation are highly correlated (p-value 0.001). Custom vs. Genome-wide [Stranger et al. 2005 PLoS Genet and Stranger et al. 2007 Science] • 2 batches of 60 CEU individuals – grown independently at two different labs – RNA extraction and labelling by different labs and people – Run in custom and gw illumina arrays – 97% of associations at the 0.05 permutation threshold from the custom array analysis were also detected in gw analysis HapMap phase II analysis • ~ 4 million SNP genotypes made publicly available for the 270 HapMap individuals. • Density: 1 SNP/ 700 bps • Includes ~50% of expected common SNPs in these populations. • 2.2 million SNPs analyzed (MAF>0.05) Phase I vs. Phase II cis- significant genes (0.001) phase I HapMap CEU CHB JPT YRI 286 317 337 356 phase II HapMap both 90% 85% 87% 87% 258 269 297 310 86% 85% 87% 79% 299 318 341 394 Phase I vs. Phase II phaseI 12 10 8 6 -log10(pvalue) 4 2 0 12 phaseII 10 8 6 4 2 0 64500000 65100000 65700000 chrom osom e coordinate 66300000 Population sharing of cis- associations CEU-CHB-JPT-YRI CEU-CHB-JPT CEU-CHB-YRI CEU-JPT-YRI CHB-JPT-YRI CEU-CHB CEU-JPT CEU-YRI CHB-JPT CHB-YRI JPT-YRI CEU only CHB only JPT only YRI only Number of genes 66 38 13 9 30 20 14 36 45 21 28 111 94 121 205 SUM (Non-redundant genes) 851 gene associations in at least 2 populations percentage of total 320 0.38 gene associations in single populations percentage of total 531 0.62 Associated SNP position relative to TSS -1000000 -500000 CEU 0 CHB 500000 1000000 40 30 -log10(pvalue) 20 10 40 JPT YRI 30 20 10 -1000000 -500000 0 500000 1000000 SNP_relative_to_TSS Distribution of regulatory elements around the TSS ENCODE Nature 2007 Direction of allelic effect same SNP-gene combination across populations Population 2 7.75 7.50 AGREEMENT 7.25 7.00 6.75 6.50 CC CT THAP5 8.00 log2 expression THAP5 log2 expression Population 1 8.00 7.75 7.50 7.25 7.00 6.75 6.50 TT CC 7.75 7.50 OPPOSITE 7.25 7.00 6.75 6.50 CT rs40915 TT THAP5 8.00 CC CT TT rs40915 log2 expression THAP5 log2 expression rs40915 8.00 7.75 7.50 7.25 7.00 6.75 6.50 CC CT rs40915 TT Direction of allelic effect Gradient-ceu*ceu-chb CEU-CHB CEU-JPT Gradient-ceu*ceu-jpt CEU-YRI Gradient-ceu*ceu-yri 3 3 3 0 0 0 -3 -3 -3 -3 0 3 -3 CHB-JPT Gradient-chb*chb-jpt 2 0 3 -3 CHB-YRI Gradient-chb*chb-y ri 5 0 3 JPT-YRI Gradient-jpt*jpt-y ri 2 0 0 0 -2 -2 -2 0 2 -5 -2 0 2 -5 0 5 Pooling populations Spurious associations Pop1 90 Pop1 80 Frequency 70 Pop2 60 50 40 30 20 10 0 -20 -10 0 10 log2exp 20 30 40 Pop2 Conditional permutations Permute data within each pop separately then perform test GENOTYPES g11 g21 g31 … … … gi1 g12 g22 g32 g13 g23 g33 g14 g24 g34 GENE EXPRESSION … … … g1n g2n g3n permute gi2 gi3 gi4 … gin Exp1 Exp2 Exp3 … … … Expi X4 Multi-population analysis overlap overlap with 0 pops overlap with 1 pop overlap with 2 pops overlap with 3 pops overlap with 4 pops #genes "multipop" #genes "by_pop" 447 NA 226 530 120 164 74 90 62 66 929 850 Proportion of single pop cis associated genes detected in multi-population analysis Proportion of single population associated genes detected in multipopulation analysis Figure 2A C29 2 multipop 3 multipop 4 multipop 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 1 2 3 Number of populations sharing association in cis- single population analyses Number of populations sharing association in cis: single population analysis 4 SGPP2 4-pop_multipop 4-multipop 50 12 6 0 0 CEU CEU 12 50 6 25 0 Frequency -log10(pvalue) N = 80 25 CHB 12 6 0 CHB N = 31 25 0 JPT 12 50 6 25 0 YRI YRI 12 50 6 25 222500000 223000000 223500000 224000000 chromosome position 0 50 JPT 0 N = 29 0 0.00 0.15 0.30 0.45 0.60 0.75 0.90 Adjusted_R^2 N = 34 0 N = 39 Trans- phase II HapMap association – Biological hypotheses: functional categories • Regulatory SNPs identified from cis- analysis (52%) • Non-synonymous SNPs (39%) • Splice site SNPs (7%) • miRNA SNPs (1%) rSNPs spliceSNPs nsSNPs miRNA SNPs GENE REG DNA Genome-wide associations GENE ~ 25,000 SNPs per population x 14,072 genes Trans- associations 10-3 threshold CEU CHB JPT YRI 1 linear regression significant genes 44 37 38 23 Non-redundant 4 pops >= 2 pops 108 5 16 10-3 2 3 CEU-CHB-JPT-YRI CEU-CHB-JPT multipop multipop 44 52 44 52 44 52 44 52 4 CHB-JPT multipop 39 39 39 39 overlap 1&2 9 10 10 7 overlap 1&3 12 14 14 7 overlap 1&4 12 15 16 7 correction at 0.001 15 genes estimated false positives FDR = 33%-39% correction at 0.01 150 genes estimated false positives FDR = 60%-75% 14,072 genes tested Enrichment of regulatory SNPs and deficit of nsSNPs in trans- associations regulatory SNPs (cis 0.001) ns SNPs splice SNPs miRNA SNPs ratio p-value ratio p-value Ratio pvalue ratio pvalue CEU 6.05 3.23E-24 0.15 1.22E-21 0.49 0.07 0 1 CHB 3.69 7.90E-10 0.24 1.91E-09 0.76 0.71 0 1 ! JPT 3.15 2.06E-07 0.31 8.82E-07 0.71 0.55 0 1 3-6x more likely that a cis regulatory effect explains a trans regulatory effect Multi-pop CNV analysis • Combined 4 populations: 193 genes at 0.001 (48 overlap with the 99 from single population analysis) • Combined 3 populations: 173 genes at 0.001 (42 overlap with the 99 from single population analysis) CNV trans effects REG GENE Additional gene copy REG GENE GENE REG Variable expression Biological pathway REG GENE REG GENE Trans-position Increase of distance from regulatory element GENE REG GENE REG Increase of distance from regulatory element GENE REG REG GENE GENE REG New regulatory element REG GENE Gene interruption REG GENE Trans effects - CEU 25 CNV_chromosome 20 15 10 5 0 0 5 10 15 GENE_chromosome 20 25 Trans effects - YRI 25 CNV_chromosome 20 15 10 5 0 0 5 10 15 GENE_chromosome 20 25 -100000 -50000 CEU 0 50000 100000 CHB Gene expression and natural selection 40 30 10 JPT 40 YRI 30 -logpval -log10(pvalue) 20 20 10 -100000 -50000 TSS 0 50000 100000 SNP_relative_to_TSS TSS With Sridhar Kudaravalli and Jonathan Pritchard (unpublished) Gene expression and natural selection With Sridhar Kudaravalli and Jonathan Pritchard (unpublished) Co-segregating regulatory variants can drive differential isoform expression regulatory variants coding variants drives high expression C T drives low expression G C gene X protein isoforms SUMMARY • Cis- and trans- acting genetic variation influencing mRNA levels. • CNV effects detected are largely not captured by SNPs • Structural variation (copy number polymorphism) influences transcript level variation. • Many detected associations are shared across human populations – replication of effects • Signal concentrated within 100 Kb from the promoter symmetrically • Trans-acting effects of CNVs - interpretation • Primary effects of trans associations are largely cis regulatory effects • Cis regulatory effects under positive selection Acknowledgements Cambridge University Barbara Stranger Alexandra Nica Antigone Dimas Christine Bird Matthew Forrest Catherine Ingle Claude Beazley Panos Deloukas Matt Hurles Mark Dunning Natalie Thorne Simon Tavaré Stanford Daphne Koller illumina Jill Orwick Mark Gibbs Genome Structural Variation Consortium Richard Redon, Nigel Carter, Charles Lee, Chris Tyler-Smith, Stephen Scherer, Wellcome Trust for funding The HapMap Consortium