* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download document 8319463
Survey
Document related concepts
Transcript
School of Medicine Division of Gene+cs and Molecular Medicine Mo#va#on • Use func#onal annota#on to priori#se “grey zone” of borderline-‐significant GWAS SNPs • “Hypothesis-‐free” tes#ng has a limited future – Too many variants and phenotypes to test – Increasingly undesirable to ignore all relevant parallel data • Use evidence of enrichment in GWAS hits to inform choice of annota#ons • Techniques applicable beyond GWAS e.g. NGS Aim • To provide a Bayes factor for combina#on with a GWAS p-‐value to priori#se SNPs for follow-‐ up – Method 1: Uses frequency differences between GWAS hits and nulls – Method 2: Uses machine learning algorithm based on mul#ple logis#c regression • To provide insight about the importance of func#onal annota#ons Data • Full Dataset – Affymetrix 500K & Illumina 550K panels • Filtered to remove the HLA region - Not all other SNPs – OR – All analysed SNPs • Classifier – NHGRI catalogue -‐ Hindorff et al. PNAS 2009 – Frequency based method used 2215 GWAS hits – Elas#c net based method used 8631 GWAS hits and tags (LD proxies r2=>0.8) Annota#ons I • nsSNPs – SNPs that change the amino acid sequence • UCSC genome browser • eQTLs in Open chroma#n – eQTLs: SNPs that are associated with expression levels • GWAS of these eQTLs was undertaken – Dixon et al 2007 – SNPs in open chroma#n • UCSC genome browser -‐ Duke/UNC/UT-‐Aus#n/EBI ENCODE group – Two independent laboratory based techniques • promoters – SNPs in the promoter region or the first exon • UCSC genome browser • First FE computer algorithm – Davuluri 2001 www.biomedicalresearchcentre.org Annota#ons II • Transcrip#on Factor Binding Sites (15 types) – UCSC genome browser – HudsonAlpha Ins#tute of Biotechnology – Only common TFBS types used • UCSC genes (start to end of transcrip#on) – UCSC • Binary at this stage A Bayesian framework for applica#on to associa#on data • Opost = Oprior * BFannot * BFassoc • Oprior – Prior belief in causality per SNP • BFassoc – e.g. Wakefield AJHG 2007 Deriva#on of annota#on Bayes Factors • Frequency difference based = # annotated GWAS hits/ # annotated null SNPs # not annotated GWAS hits/ # not annotated null SNPs • Elas#c net = Hit Probability/(1-‐Hit Probability) – No need to correct for training set prior due to weigh#ng www.biomedicalresearchcentre.org Frequency based method results I – Annota#on Propor#on There is a higher proportion of SNPs with functional characteristics in the GWAS hit list than the random SNP list www.biomedicalresearchcentre.org Frequency based method results II – Sensi#vity analysis • Stra#fica#on by minor allele frequency – Enrichment evident except cis eQTLs MAF < 0.1 • SNPs with unique annota#ons only • Annotated SNPs only (without LD proxies) – Enrichment evident except promoter SNPs in Hindorff list www.biomedicalresearchcentre.org Frequency based method result III – Bayes Factors using alternate GWAS hit defini#on When GWAS hits were defined using more significant cut-offs the prior odds were larger Problem! • There are more than 3 annota#ons! Method II: Machine Learning -‐ Elas#c net • Minimises the nega#ve log likelihood plus penalty func#on min{−log lik /N + λ(αL1+ (1 − α )L2)} – Tuning parameters • Alpha and lambda € – Penal#es L1 = β l1 1 L2 = β 2 2 l2 Elas#c net and alterna#ves Lasso (α=1, Penalty all L1) Elas#c net (α=0.2, Penalty mixed) Ridge regression (α=0, Penalty all L2) lambda Adapted from Freidman et al. 2010. Journal of Sta#s#cal Sooware Elas#c net func#onality • Implements variable selec#on • Allows for – correla#on between apributes – large numbers of apributes (handling these appropriately) – quan#ta#ve and qualita#ve data Elas#c net -‐ Procedure • Up weight GWAS hits to deal with unbalanced data to improve posi#ve predic#on accuracy • Tuning of main parameters (α and λ) – 10 fold cross valida#on – Errors es#mated from an independent test set – Choice of α and λ not cri#cal • Train the coefficients – Split data into training and test set – E.g. Chromosome 22: 1-‐21=training, 22=test • Predict classifica#on probabili#es at all SNPs on the excluded chromosome Results -‐ Predic#on distribu#ons Results -‐ Deviance by chromosome Results -‐ Coefficients by excluded chromosome 0.8 0.7 nsSNP 0.6 prom eQTL 0.5 UCSCgenes cMyc 0.4 BATF BCL11 EBF 0.3 IRF4 PAX5C 0.2 POL2-‐H PU1 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 p53, BCL3, PAX5N, POL2, POU2F, SP1 and TCF12 are all dropped from the models Annota#on Bayes Factors 1st Qu. Median Mean 3rd Qu. 0.72 0.73 0.95 0.73 1st Qu. Median Mean 3rd Qu. 0.72 0.73 1.69 1.99 Applica#on to real data – Crohn’s • WTCCC1 – The Wellcome Trust Case Control Consor#um. Nature 447:661-‐667 – 17 loci in strong/moderate associa#on tables • Meta-‐ analysis – Franke et al. 2010. Nature Gene0cs 42, 1118–1125 – 71 loci Rank changes Avg Rank Avg Rank Change Frequency difference Null 211150 -‐205 Frequency difference Hits 45514 10322 Elas#c net Null 211150 -‐745 Elas#c net Hits 45514 20207 Both methodologies give an increase in the rank Rank changes Rank Up Rank Down Rank Same Frequency difference 21 Hits 28 3 Elas#c net Hits 26 2 24 Some go up…some go down.. Rank changes Rank Up Rank Down Rank Same Frequency difference 4 Null 48 0 Frequency difference 21 Hits 28 3 Elas#c net Null 11 41 0 Elas#c net Hits 24 26 2 More go up…less go down.. Rank Differences Example: Psoriasis data • GWAS – Strange et al. Nature Gene0cs, 2010. • 21 SNPs – 8 known – 8 replicated in our study – 5 did not replicate Psoriasis data Annota#on Bayes Factors 6 5 4 3 2 1 Hits Nulls Psoriasis data -‐ Important annota#ons Bayes Factor nsSNP prom UCSC genes EBF PAX5C POL2H PU1 2.1 1 0 1 0 0 0 0 1.5 0 0 1 0 0 0 0 2.1 1 0 1 0 0 0 0 1.0 0 0 0 0 0 1 0 1.4 0 0 1 0 0 0 0 1.0 0 0 0 0 0 1 0 1.5 0 0 1 0 0 0 0 5.9 1 1 1 1 1 1 1 2.1 1 0 1 0 0 0 0 1.5 0 0 1 0 0 0 0 UCSC gene important as is non-‐synonymous SNP POL2 important Autoimmune GWAS hits only -‐ Predic#on distribu#ons Autoimmune GWAS hits only – Bayes Factors All Autoimmune 6 6 5 5 4 4 3 3 2 2 1 1 Hits Nulls 17.5 Hits Nulls Conclusions • Adding the annota#on data is free and on average across the SNPs up-‐weights truly causal SNPs and down-‐weights non-‐causal SNPS • We recommend its use for SNPs in the grey zone • The same methodology can be applied to sequencing data, once empirical informa#on becomes available Acknowledgements • Mike Weale • Dan Crouch • Jen Mollon • Irene Rebollo Mesa • Richard Trembath, GAP, WTCCC • Department of Health – through the (NIHR) Na#onal Ins#tute for Health Research comprehensive (BRC) Biomedical Research Centre awards • Guy’s and St. Thomas’ Na#onal Health Service Founda#on Trust • MRC centre for transplanta#on • King’s College London