* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download HIT*nDRIVE: Multi-driver Gene Prioritization Based on Hitting Time
Gene desert wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Cancer epigenetics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Public health genomics wikipedia , lookup
Essential gene wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Pathogenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Oncogenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Designer baby wikipedia , lookup
Genome evolution wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene expression programming wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics of human development wikipedia , lookup
HIT’nDRIVE: Multi-driver Gene Prioritization Based on Hitting Time Raunak Shrestha, Ermin Hodzic, Jake Yeung, Kendric Wang, Thomas Sauerwald, Phuong Dao, Shawn Anderson, Himisha Beltran, Mark A. Rubin, Colin C. Collins, Gholamereza Haffari and S. Cenk Sahinalp RECOMB 2014 Speaker: Giulio Rossetti Background During the course of cancer evolution Tumor cells accumulate genomic aberrations • Most are “passenger” aberrations while, • few are “driver” ones Driver aberrations are expected to confer growth advantage – Thus they have potential to be used as therapeutic targets Problem Statement Identify the “most parsimonious” set of driver genes that can collectively influence (possibly) distant “outlier” genes – “most parsimonious” set: the smallest set of driver genes – Desired Target: the widest portion of “outlier genes” HIT’nDRIVE Integrate genome and transcriptome data from tumor samples to identify and prioritize potential drivers Goal: – Identify the most parsimonious set that explain most of the observed gene expression alterations Approach: – “link” aberrations at genomic level to gene expression profile alterations • Gene\Protein interaction network – Random Walk Facility Location (RWFL) • Multi-source hitting time • ILP formulation Multi-Source Hitting Time Hitting Time: Expected number of hops (τu,v) of a random walk starting from a given driver (u) and hitting a given target gene (v) the first time. Hu,v= E[τu,v] Multi-source Hitting Time: HU,v= E[minu in U τu,v] with v in (V - {U}) Estimating Hitting Time Hu,v can be empirically estimated by performing independent random walks (from u to v) and taking the average of the observations Convergence Theorem (proof omitted) C>0 constant, ε in [1/n4,1] After m = (128C)2(1/ε)2(log2n)3 iterations: Pr[|Hu,v – Hiu,v| ≤ εn] ≥ 1 - n-3 Multi-Source via Single-Source (accuracy proof omitted) HU,v can be estimated by a function of independent pairwise hitting time Hui,v for all ui in U HU,v » 1 k 1 åH ui ,v i=1 RWFL: Random Walk Facility Location Seek for a set of “facilities” (nodes) in a graph such that the maximum distance from any node in the graph to its closest facility is minimized. Distance Function: Multi-Source Hitting Time Given: set of potential driver Y set of outlier genes k user defined threshold X arg minX in X, |X|=k maxy in Y HX,y Observations – Minimize Hitting-time allow to maximize the driver “influence” w.r.t. the outliers – Multi-source hitting time captures the uncertainty in molecular interactions during the propagation of one or more signals – RWFL is an NP-Hard problem • Introduce an estimate to transform it into Weighted Multi-Set Cover problem, solvable trough ILP formulation RWFL as Minimum Weight Multi-set cover Gene gi is mutated in patient p: weight H-1gi,gj Patient p expression altered genes Genomic Aberrations WMSC ask to compute the smallest driver gene set which “sufficiently” covers “most” of the patient specific expression altered genes ILP for WMSC • • • xi potential driver yj expression alteration event ei,j edge in bipartite graph 1. A selected driver contributes to the coverage of each expression alteration it is connected to 2. The selected driver genes cover at least γ of the sum of all incoming weights to each expression alteration events – Set a lower bound on joint influence of drivers 3. The selected driver set collectively cover at least α of exp. alteration events Experiments Goal: – – Test if HIT’nDRIVE predictions provide insight into cancer phenotype Improve driver classification accuracy Evaluation Approach: – Classifiers based on network “modules” (set of functionality related genes connected in an interaction network – including at least a driver) • Module identification by OptDis Dataset: Glioblastoma Multiforme samples (GBM) from Cancer Genome Atlas (TCGA) PPI network from Human Protein Reference DB (HPRD) Issue: Evaluate cancer drivers predictor is challenging when no ground truth is available Adoption of Cancer Gene Census DB (CGC) and Catalogue of Somatic Mutation Cancer (COSMIC) Evaluation based on CGC and COSMIC Analyze the concordance of predicted driver w.r.t. genes annotated in CGC and COSMIC – Test for γ=0.7, α = {0.1, 0.2 …. 0.9} Results – The fraction of driver genes affiliated to cancer in the DBs increase as α increases – With γ=0.7, α = 0.9 we get 107 driver covering the majority of outlier in 156 patients Phenotype Classification using Dysregulated Modules Seeded with the predicted Drivers Approach 1. 2. Drivers identified from TCGA were used as seed for discovering discriminative subnetwork modules Module expression profile were used to classify normal vs. glioblastoma samples (KNN classifier, k=1) Results – HIT’nDRIVE outperform DriverNet: max accuracy 96.9%, avg accuracy 93.4% Sensitivity and Prediction of Frequent\Rare Drivers Sensitivity Random swap of edges endpoints (20%) and recomputation of hitting-times – Less than 10% of changes w.r.t. original values – Limited impact on classification accuracy Prediction – Identified frequent drivers harbour different types of genomic aberrations in different patients – HIT’nDRIVE identifies also infrequent drivers (genes aberrant in at most 2% of the cases) Prediction of Low and High degree Drivers HIT’nDRIVE predictions include: 1. Well known high degree hubs having also high betweenness in the PPI network (i.e. TP53, EGFR) – 2. If perturbed they dysregulate several other genes and the associated signaling pathway Low degree peripheral genes (i.e. IFNA2, UTY) HIT’nDRIVE: Conclusion – A combinatorial method to capture collective effects of driver genes aberrations on “outlier” genes – Based on Random Walk Facility Location • Multi-Source Hitting Time • Reduction to minimum Weighted Multi-Set Coverage (ILP formulation) – Predicted driver genes are well-supported in cancer genes databases – Identified Drivers are able to outperform state of art phenotype predictors