Download HIT*nDRIVE: Multi-driver Gene Prioritization Based on Hitting Time

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene desert wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Cancer epigenetics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Public health genomics wikipedia , lookup

Essential gene wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Pathogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

NEDD9 wikipedia , lookup

Microevolution wikipedia , lookup

Gene wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Oncogenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Designer baby wikipedia , lookup

Genome evolution wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene expression programming wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome (book) wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
HIT’nDRIVE: Multi-driver Gene Prioritization
Based on Hitting Time
Raunak Shrestha, Ermin Hodzic, Jake Yeung, Kendric Wang, Thomas
Sauerwald, Phuong Dao, Shawn Anderson, Himisha Beltran, Mark A.
Rubin, Colin C. Collins, Gholamereza Haffari and S. Cenk Sahinalp
RECOMB 2014
Speaker: Giulio Rossetti
Background
During the course of cancer evolution
Tumor cells accumulate genomic aberrations
• Most are “passenger” aberrations while,
• few are “driver” ones
Driver aberrations are expected to confer growth advantage
– Thus they have potential to be used as therapeutic targets
Problem Statement
Identify the “most parsimonious” set of driver genes that can
collectively influence (possibly) distant “outlier” genes
– “most parsimonious” set: the smallest set of driver genes
– Desired Target: the widest portion of “outlier genes”
HIT’nDRIVE
Integrate genome and transcriptome data from tumor samples to
identify and prioritize potential drivers
Goal:
– Identify the most parsimonious set that explain most of the observed gene
expression alterations
Approach:
–
“link” aberrations at genomic level to gene expression profile alterations
•
Gene\Protein interaction network
– Random Walk Facility Location (RWFL)
• Multi-source hitting time
• ILP formulation
Multi-Source Hitting Time
Hitting Time:
Expected number of hops (τu,v) of a random walk starting from a given
driver (u) and hitting a given target gene (v) the first time.
Hu,v= E[τu,v]
Multi-source Hitting Time:
HU,v= E[minu in U τu,v]
with v in (V - {U})
Estimating Hitting Time
Hu,v can be empirically estimated by performing independent random walks
(from u to v) and taking the average of the observations
Convergence Theorem (proof omitted)
C>0 constant, ε in [1/n4,1]
After m = (128C)2(1/ε)2(log2n)3 iterations:
Pr[|Hu,v – Hiu,v| ≤ εn] ≥ 1 - n-3
Multi-Source via Single-Source (accuracy proof omitted)
HU,v can be estimated by a function of independent pairwise hitting time Hui,v for
all ui in U
HU,v »
1
k
1
åH
ui ,v
i=1
RWFL: Random Walk Facility Location
Seek for a set of “facilities” (nodes) in a graph such that the maximum
distance from any node in the graph to its closest facility is minimized.
Distance Function:
Multi-Source Hitting Time
Given:
set of potential driver
Y set of outlier genes
k user defined threshold
X
arg minX in X, |X|=k maxy in Y HX,y
Observations
– Minimize Hitting-time allow to maximize the driver
“influence” w.r.t. the outliers
– Multi-source hitting time captures the uncertainty in
molecular interactions during the propagation of one or
more signals
– RWFL is an NP-Hard problem
• Introduce an estimate to transform it into Weighted Multi-Set
Cover problem, solvable trough ILP formulation
RWFL as Minimum Weight Multi-set cover
Gene gi is mutated in
patient p: weight H-1gi,gj
Patient p expression
altered genes
Genomic Aberrations
WMSC ask to compute
the smallest driver gene
set which “sufficiently”
covers “most” of the
patient specific
expression altered genes
ILP for WMSC
•
•
•
xi potential driver
yj expression alteration event
ei,j edge in bipartite graph
1. A selected driver contributes to the coverage of
each expression alteration it is connected to
2. The selected driver genes cover at least γ of the
sum of all incoming weights to each expression
alteration events
–
Set a lower bound on joint influence of drivers
3. The selected driver set collectively cover at least α
of exp. alteration events
Experiments
Goal:
–
–
Test if HIT’nDRIVE predictions provide insight into cancer phenotype
Improve driver classification accuracy
Evaluation Approach:
–
Classifiers based on network “modules” (set of functionality related genes
connected in an interaction network – including at least a driver)
•
Module identification by OptDis
Dataset:
Glioblastoma Multiforme samples (GBM) from Cancer Genome Atlas (TCGA)
PPI network from Human Protein Reference DB (HPRD)
Issue:
Evaluate cancer drivers predictor is challenging when no ground truth is available
Adoption of Cancer Gene Census DB (CGC) and Catalogue of Somatic Mutation Cancer (COSMIC)
Evaluation based on CGC and COSMIC
Analyze the concordance of predicted driver w.r.t. genes annotated in CGC
and COSMIC
– Test for γ=0.7, α = {0.1, 0.2 …. 0.9}
Results
– The fraction of driver genes affiliated to cancer in the DBs increase as α increases
– With γ=0.7, α = 0.9 we get 107 driver covering the majority of outlier in 156 patients
Phenotype Classification using Dysregulated
Modules Seeded with the predicted Drivers
Approach
1.
2.
Drivers identified from TCGA were used as seed for discovering discriminative
subnetwork modules
Module expression profile were used to classify normal vs. glioblastoma samples (KNN
classifier, k=1)
Results
–
HIT’nDRIVE outperform DriverNet: max accuracy 96.9%, avg accuracy 93.4%
Sensitivity and Prediction of Frequent\Rare Drivers
Sensitivity
Random swap of edges endpoints (20%)
and recomputation of hitting-times
– Less than 10% of changes w.r.t. original
values
– Limited impact on classification
accuracy
Prediction
– Identified frequent drivers harbour
different types of genomic aberrations
in different patients
– HIT’nDRIVE identifies also infrequent
drivers (genes aberrant in at most 2%
of the cases)
Prediction of Low and High degree Drivers
HIT’nDRIVE predictions include:
1.
Well known high degree hubs
having also high betweenness
in the PPI network (i.e. TP53,
EGFR)
–
2.
If perturbed they dysregulate
several other genes and the
associated signaling pathway
Low degree peripheral genes
(i.e. IFNA2, UTY)
HIT’nDRIVE: Conclusion
– A combinatorial method to capture collective effects of driver genes
aberrations on “outlier” genes
– Based on Random Walk Facility Location
• Multi-Source Hitting Time
• Reduction to minimum Weighted Multi-Set Coverage (ILP formulation)
– Predicted driver genes are well-supported in cancer genes databases
– Identified Drivers are able to outperform state of art phenotype
predictors