Download Document 8319463

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genealogical DNA test wikipedia , lookup

Public health genomics wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

RNA-Seq wikipedia , lookup

SNP genotyping wikipedia , lookup

Haplogroup G-M201 wikipedia , lookup

Tag SNP wikipedia , lookup

Transcript
School of Medicine Division of Gene+cs and Molecular Medicine Mo#va#on •  Use func#onal annota#on to priori#se “grey zone” of borderline-­‐significant GWAS SNPs •  “Hypothesis-­‐free” tes#ng has a limited future –  Too many variants and phenotypes to test –  Increasingly undesirable to ignore all relevant parallel data •  Use evidence of enrichment in GWAS hits to inform choice of annota#ons •  Techniques applicable beyond GWAS e.g. NGS Aim •  To provide a Bayes factor for combina#on with a GWAS p-­‐value to priori#se SNPs for follow-­‐
up –  Method 1: Uses frequency differences between GWAS hits and nulls –  Method 2: Uses machine learning algorithm based on mul#ple logis#c regression •  To provide insight about the importance of func#onal annota#ons Data •  Full Dataset –  Affymetrix 500K & Illumina 550K panels •  Filtered to remove the HLA region - Not all other SNPs – OR – All analysed SNPs •  Classifier –  NHGRI catalogue -­‐ Hindorff et al. PNAS 2009 –  Frequency based method used 2215 GWAS hits –  Elas#c net based method used 8631 GWAS hits and tags (LD proxies r2=>0.8) Annota#ons I •  nsSNPs –  SNPs that change the amino acid sequence •  UCSC genome browser •  eQTLs in Open chroma#n –  eQTLs: SNPs that are associated with expression levels •  GWAS of these eQTLs was undertaken –  Dixon et al 2007 –  SNPs in open chroma#n •  UCSC genome browser -­‐ Duke/UNC/UT-­‐Aus#n/EBI ENCODE group –  Two independent laboratory based techniques •  promoters –  SNPs in the promoter region or the first exon •  UCSC genome browser •  First FE computer algorithm –  Davuluri 2001 www.biomedicalresearchcentre.org
Annota#ons II •  Transcrip#on Factor Binding Sites (15 types) –  UCSC genome browser – HudsonAlpha Ins#tute of Biotechnology –  Only common TFBS types used •  UCSC genes (start to end of transcrip#on) –  UCSC •  Binary at this stage A Bayesian framework for applica#on to associa#on data •  Opost = Oprior * BFannot * BFassoc •  Oprior –  Prior belief in causality per SNP •  BFassoc –  e.g. Wakefield AJHG 2007 Deriva#on of annota#on Bayes
Factors •  Frequency difference based = # annotated GWAS hits/ # annotated null SNPs # not annotated GWAS hits/ # not annotated null SNPs •  Elas#c net = Hit Probability/(1-­‐Hit Probability) –  No need to correct for training set prior due to weigh#ng www.biomedicalresearchcentre.org
Frequency based method results I –
Annota#on Propor#on There is a higher proportion of SNPs with functional characteristics in the GWAS
hit list than the random SNP list
www.biomedicalresearchcentre.org
Frequency based method results II –
Sensi#vity analysis •  Stra#fica#on by minor allele frequency –  Enrichment evident except cis eQTLs MAF < 0.1 •  SNPs with unique annota#ons only •  Annotated SNPs only (without LD proxies) –  Enrichment evident except promoter SNPs in Hindorff
list www.biomedicalresearchcentre.org
Frequency based method result III –
Bayes Factors using alternate GWAS
hit defini#on When GWAS hits were defined using more significant cut-offs the
prior odds were larger
Problem! •  There are more than 3 annota#ons! Method II: Machine Learning -­‐ Elas#c net •  Minimises the nega#ve log likelihood plus penalty func#on min{−log lik /N + λ(αL1+ (1 − α )L2)}
–  Tuning parameters •  Alpha and lambda €
–  Penal#es L1 = β
l1
1
L2 = β
2
2
l2
Elas#c net and alterna#ves Lasso (α=1, Penalty all L1) Elas#c net (α=0.2, Penalty mixed) Ridge regression (α=0, Penalty all L2) lambda Adapted from Freidman et al. 2010. Journal of Sta#s#cal Sooware Elas#c net func#onality •  Implements variable selec#on •  Allows for –  correla#on between apributes –  large numbers of apributes (handling these appropriately) –  quan#ta#ve and qualita#ve data Elas#c net -­‐ Procedure •  Up weight GWAS hits to deal with unbalanced data to improve posi#ve predic#on accuracy •  Tuning of main parameters (α and λ) –  10 fold cross valida#on –  Errors es#mated from an independent test set –  Choice of α and λ not cri#cal •  Train the coefficients –  Split data into training and test set –  E.g. Chromosome 22: 1-­‐21=training, 22=test •  Predict classifica#on probabili#es at all SNPs on the excluded chromosome Results -­‐ Predic#on distribu#ons Results -­‐ Deviance by chromosome Results -­‐ Coefficients by excluded chromosome 0.8 0.7 nsSNP 0.6 prom eQTL 0.5 UCSCgenes cMyc 0.4 BATF BCL11 EBF 0.3 IRF4 PAX5C 0.2 POL2-­‐H PU1 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 p53, BCL3, PAX5N, POL2, POU2F, SP1 and TCF12 are all dropped from the models Annota#on Bayes Factors 1st Qu.
Median
Mean
3rd Qu.
0.72 0.73 0.95 0.73 1st Qu.
Median
Mean
3rd Qu.
0.72 0.73 1.69 1.99 Applica#on to real data – Crohn’s •  WTCCC1 –  The Wellcome Trust Case Control Consor#um. Nature 447:661-­‐667 –  17 loci in strong/moderate associa#on tables •  Meta-­‐ analysis –  Franke et al. 2010. Nature Gene0cs 42, 1118–1125 –  71 loci Rank changes Avg Rank Avg Rank Change Frequency difference Null 211150 -­‐205 Frequency difference Hits 45514 10322 Elas#c net Null 211150 -­‐745 Elas#c net Hits 45514 20207 Both methodologies give an increase in the rank
Rank changes Rank Up Rank Down Rank Same Frequency difference 21 Hits 28 3 Elas#c net Hits 26 2 24 Some go up…some go down..
Rank changes Rank Up Rank Down Rank Same Frequency difference 4 Null 48 0 Frequency difference 21 Hits 28 3 Elas#c net Null 11 41 0 Elas#c net Hits 24 26 2 More go up…less go down..
Rank Differences Example: Psoriasis data •  GWAS –  Strange et al. Nature Gene0cs, 2010. •  21 SNPs –  8 known –  8 replicated in our study –  5 did not replicate Psoriasis data Annota#on Bayes Factors 6 5 4 3 2 1 Hits Nulls Psoriasis data -­‐ Important annota#ons Bayes Factor nsSNP
prom
UCSC
genes
EBF
PAX5C POL2H
PU1
2.1
1
0
1
0
0
0
0
1.5
0
0
1
0
0
0
0
2.1
1
0
1
0
0
0
0
1.0
0
0
0
0
0
1
0
1.4
0
0
1
0
0
0
0
1.0
0
0
0
0
0
1
0
1.5
0
0
1
0
0
0
0
5.9
1
1
1
1
1
1
1
2.1
1
0
1
0
0
0
0
1.5
0
0
1
0
0
0
0
UCSC gene important as is non-­‐synonymous SNP POL2 important Autoimmune GWAS hits only -­‐ Predic#on distribu#ons Autoimmune GWAS hits only – Bayes Factors All Autoimmune 6 6 5 5 4 4 3 3 2 2 1 1 Hits Nulls 17.5 Hits Nulls Conclusions •  Adding the annota#on data is free and on average across the SNPs up-­‐weights truly causal SNPs and down-­‐weights non-­‐causal SNPS •  We recommend its use for SNPs in the grey zone •  The same methodology can be applied to sequencing data, once empirical informa#on becomes available Acknowledgements •  Mike Weale •  Dan Crouch •  Jen Mollon •  Irene Rebollo Mesa •  Richard Trembath, GAP, WTCCC •  Department of Health –  through the (NIHR) Na#onal Ins#tute for Health Research comprehensive (BRC) Biomedical Research Centre awards •  Guy’s and St. Thomas’ Na#onal Health Service Founda#on Trust •  MRC centre for transplanta#on •  King’s College London