Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
School of Medicine Division of Gene+cs and Molecular Medicine Mo#va#on • Use func#onal annota#on to priori#se “grey zone” of borderline-‐significant GWAS SNPs • “Hypothesis-‐free” tes#ng has a limited future – Too many variants and phenotypes to test – Increasingly undesirable to ignore all relevant parallel data • Use evidence of enrichment in GWAS hits to inform choice of annota#ons • Techniques applicable beyond GWAS e.g. NGS Aim • To provide a Bayes factor for combina#on with a GWAS p-‐value to priori#se SNPs for follow-‐ up – Method 1: Uses frequency differences between GWAS hits and nulls – Method 2: Uses machine learning algorithm based on mul#ple logis#c regression • To provide insight about the importance of func#onal annota#ons Data • Full Dataset – Affymetrix 500K & Illumina 550K panels • Filtered to remove the HLA region - Not all other SNPs – OR – All analysed SNPs • Classifier – NHGRI catalogue -‐ Hindorff et al. PNAS 2009 – Frequency based method used 2215 GWAS hits – Elas#c net based method used 8631 GWAS hits and tags (LD proxies r2=>0.8) Annota#ons I • nsSNPs – SNPs that change the amino acid sequence • UCSC genome browser • eQTLs in Open chroma#n – eQTLs: SNPs that are associated with expression levels • GWAS of these eQTLs was undertaken – Dixon et al 2007 – SNPs in open chroma#n • UCSC genome browser -‐ Duke/UNC/UT-‐Aus#n/EBI ENCODE group – Two independent laboratory based techniques • promoters – SNPs in the promoter region or the first exon • UCSC genome browser • First FE computer algorithm – Davuluri 2001 www.biomedicalresearchcentre.org Annota#ons II • Transcrip#on Factor Binding Sites (15 types) – UCSC genome browser – HudsonAlpha Ins#tute of Biotechnology – Only common TFBS types used • UCSC genes (start to end of transcrip#on) – UCSC • Binary at this stage A Bayesian framework for applica#on to associa#on data • Opost = Oprior * BFannot * BFassoc • Oprior – Prior belief in causality per SNP • BFassoc – e.g. Wakefield AJHG 2007 Deriva#on of annota#on Bayes Factors • Frequency difference based = # annotated GWAS hits/ # annotated null SNPs # not annotated GWAS hits/ # not annotated null SNPs • Elas#c net = Hit Probability/(1-‐Hit Probability) – No need to correct for training set prior due to weigh#ng www.biomedicalresearchcentre.org Frequency based method results I – Annota#on Propor#on There is a higher proportion of SNPs with functional characteristics in the GWAS hit list than the random SNP list www.biomedicalresearchcentre.org Frequency based method results II – Sensi#vity analysis • Stra#fica#on by minor allele frequency – Enrichment evident except cis eQTLs MAF < 0.1 • SNPs with unique annota#ons only • Annotated SNPs only (without LD proxies) – Enrichment evident except promoter SNPs in Hindorff list www.biomedicalresearchcentre.org Frequency based method result III – Bayes Factors using alternate GWAS hit defini#on When GWAS hits were defined using more significant cut-offs the prior odds were larger Problem! • There are more than 3 annota#ons! Method II: Machine Learning -‐ Elas#c net • Minimises the nega#ve log likelihood plus penalty func#on min{−log lik /N + λ(αL1+ (1 − α )L2)} – Tuning parameters • Alpha and lambda € – Penal#es L1 = β l1 1 L2 = β 2 2 l2 Elas#c net and alterna#ves Lasso (α=1, Penalty all L1) Elas#c net (α=0.2, Penalty mixed) Ridge regression (α=0, Penalty all L2) lambda Adapted from Freidman et al. 2010. Journal of Sta#s#cal Sooware Elas#c net func#onality • Implements variable selec#on • Allows for – correla#on between apributes – large numbers of apributes (handling these appropriately) – quan#ta#ve and qualita#ve data Elas#c net -‐ Procedure • Up weight GWAS hits to deal with unbalanced data to improve posi#ve predic#on accuracy • Tuning of main parameters (α and λ) – 10 fold cross valida#on – Errors es#mated from an independent test set – Choice of α and λ not cri#cal • Train the coefficients – Split data into training and test set – E.g. Chromosome 22: 1-‐21=training, 22=test • Predict classifica#on probabili#es at all SNPs on the excluded chromosome Results -‐ Predic#on distribu#ons Results -‐ Deviance by chromosome Results -‐ Coefficients by excluded chromosome 0.8 0.7 nsSNP 0.6 prom eQTL 0.5 UCSCgenes cMyc 0.4 BATF BCL11 EBF 0.3 IRF4 PAX5C 0.2 POL2-‐H PU1 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 p53, BCL3, PAX5N, POL2, POU2F, SP1 and TCF12 are all dropped from the models Annota#on Bayes Factors 1st Qu. Median Mean 3rd Qu. 0.72 0.73 0.95 0.73 1st Qu. Median Mean 3rd Qu. 0.72 0.73 1.69 1.99 Applica#on to real data – Crohn’s • WTCCC1 – The Wellcome Trust Case Control Consor#um. Nature 447:661-‐667 – 17 loci in strong/moderate associa#on tables • Meta-‐ analysis – Franke et al. 2010. Nature Gene0cs 42, 1118–1125 – 71 loci Rank changes Avg Rank Avg Rank Change Frequency difference Null 211150 -‐205 Frequency difference Hits 45514 10322 Elas#c net Null 211150 -‐745 Elas#c net Hits 45514 20207 Both methodologies give an increase in the rank Rank changes Rank Up Rank Down Rank Same Frequency difference 21 Hits 28 3 Elas#c net Hits 26 2 24 Some go up…some go down.. Rank changes Rank Up Rank Down Rank Same Frequency difference 4 Null 48 0 Frequency difference 21 Hits 28 3 Elas#c net Null 11 41 0 Elas#c net Hits 24 26 2 More go up…less go down.. Rank Differences Example: Psoriasis data • GWAS – Strange et al. Nature Gene0cs, 2010. • 21 SNPs – 8 known – 8 replicated in our study – 5 did not replicate Psoriasis data Annota#on Bayes Factors 6 5 4 3 2 1 Hits Nulls Psoriasis data -‐ Important annota#ons Bayes Factor nsSNP prom UCSC genes EBF PAX5C POL2H PU1 2.1 1 0 1 0 0 0 0 1.5 0 0 1 0 0 0 0 2.1 1 0 1 0 0 0 0 1.0 0 0 0 0 0 1 0 1.4 0 0 1 0 0 0 0 1.0 0 0 0 0 0 1 0 1.5 0 0 1 0 0 0 0 5.9 1 1 1 1 1 1 1 2.1 1 0 1 0 0 0 0 1.5 0 0 1 0 0 0 0 UCSC gene important as is non-‐synonymous SNP POL2 important Autoimmune GWAS hits only -‐ Predic#on distribu#ons Autoimmune GWAS hits only – Bayes Factors All Autoimmune 6 6 5 5 4 4 3 3 2 2 1 1 Hits Nulls 17.5 Hits Nulls Conclusions • Adding the annota#on data is free and on average across the SNPs up-‐weights truly causal SNPs and down-‐weights non-‐causal SNPS • We recommend its use for SNPs in the grey zone • The same methodology can be applied to sequencing data, once empirical informa#on becomes available Acknowledgements • Mike Weale • Dan Crouch • Jen Mollon • Irene Rebollo Mesa • Richard Trembath, GAP, WTCCC • Department of Health – through the (NIHR) Na#onal Ins#tute for Health Research comprehensive (BRC) Biomedical Research Centre awards • Guy’s and St. Thomas’ Na#onal Health Service Founda#on Trust • MRC centre for transplanta#on • King’s College London