Download cudaGSEA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA silencing wikipedia , lookup

Molecular evolution wikipedia , lookup

X-inactivation wikipedia , lookup

Genomic imprinting wikipedia , lookup

List of types of proteins wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene therapy wikipedia , lookup

Gene desert wikipedia , lookup

Silencer (genetics) wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene regulatory network wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Accelerating Gene Set Enrichment
Analysis on CUDA-Enabled GPUs
Bertil Schmidt
Christian Hundt
Contents
• Gene Set Enrichment Analysis (GSEA)
– Background
– Algorithmic details
• cudaGSEA
• Performance evaluation
GSEA and Bioinformatics
• High throughput technologies generate large-scale
gene expression data sets
– RNA-Seq
– Microarrays
• GSEA uses annotated gene sets to mine a given gene
expression matrix
– MSigDB contains over 10K signatures each containing
around 100 gene identifiers on average
• Typical GSEA study:
– identify metabolic pathways that are differentially
changed in human type-2 diabetes
Gene Set Enrichment Analysis
• Reveals correlation between gene
sets and diseases using gene
expression data
• State-of-the-art tool with over
10,000 citations
• Written in (multi-threaded) Java
• Highly time consuming
– analyzing 20,639 genes measured
in 200 patients with 4,725
pathways and 1M permutations
takes around 1 week with GSEA
2.2.2 software on a CPU
• We present
– GSEA parallelization on a GPU
using CUDA (cudaGSEA)
– cudaGSEA around two orders-ofmagnitude faster than BroadGSEA
GSEA Algorithm – Gene Ranking
• Gene expression matrix D obtained from RNA-Seq or Microarray experiments
• For each gene i and patient j with associated (binary) phenotype C expression value
D[i,j] is stored
• Diseases driven by complex gene interactions  simply reporting top-ranked genes
produce many false positives
• Domain experts provides set of genes that might possibly explain observed
phenotypes
GSEA Algorithm –
Enrichment score
• Enrichment score (ES) measure correlation between given gene set S and
calculated gene ranking g(i)
– Report maximum deviation of a running sum (k)
– Sum increases if we hit a member of S and decreases otherwise
• How significant is ES = 0.857?  p-value calculation using permutation testing
GSEA Algorithm – Permuation testing
GSEA Algorithm – Permuation testing
GSEA Algorithm
-|ES|
|ES|
• Histogram of 1,000,000 enrichment scores gained by permuting
patient phenotypes
• Estimate p-value by counting events in both tails
• Why so many permutations?
– When testing 1,000 gene sets at significance level p<0.001 we need
more than 1,000,000 samples to reject null hypothesis at 1,000p <
0.001 (Bonferroni correction)
Transpose D to
ensure coalesced
memory accesses
CUDA Parallelization
CUDA Parallelization
CUDA Parallelization
CUDA Implementation Details
• Support for single-precision and double-precision
• Resulting matrix of enrichment scores (#gene sets x
#permutations) can be large
– e.g. 5K x 1M x 8B = 40GB
• p-value estimation, Family-wise error rate (FWER),
normalized enrichment score (NES) computation can
be accomplished on the GPU with (sum/max)
reduction kernels without the need for storing this
matrix
• False discovery rate (FDR) computation this matrix is
transferred to the CPU for post-processing
cudaGSEA Features
• Reading data sets directly in Broad Institute-compatible file
formats
• Supporting several local deviation measures
– Mean-based measures (difference/quotient/log-quotient of
means)
– Mean and standard deviation-based measures (signal to noiseratio, t-tests, one/two-pass estimation)
– Numerically stable summation schemes for local measures and
ES (Kahan etc.)
• Package for the R framework and standalone application
• Multi-threaded CPU version in C++ using OpenMP
Performance Evaluation
•
GSE19429 dataset
– collapsed to 20,639 gene symbols; 200 patients (183 cases + 17 controls)
•
Hallmark: 50 gene sets
– MSigDB 5.1 smallest gene set collection
•
•
•
GeForce Titan X (single precison) / Tesla K40c (double precision, ECC off), CUDA 7.5
10 core Xeon [email protected], 20 Threads, Ubuntu 14.04, gcc 4.8.4, 64-bit OpenJDK
BroadGSEA v.2.2.2
Performance Evaluation
•
GSE19429 dataset
– collapsed to 20,639 gene symbols; 200 patients (183 cases + 17 controls)
•
C2: 4726 gene sets
– MSigDB 5.1 largest gene set collection
•
•
•
GeForce Titan X (single precison) / Tesla K40c (double precision, ECC off), CUDA 7.5
10 core Xeon [email protected], 20 Threads, Ubuntu 14.04, gcc 4.8.4, 64-bit OpenJDK
BroadGSEA v.2.2.2
Conclusion
• High-throughput technologies establish the need for
scalable bioinformatics tools that can process largescale gene expression data sets
• CUDA is a suitable technology to address this need
• cudaGSEA on one GPU achieves around two orders-ofmagnitude speedup versus BroadGSEA on a CPU
– analyzing 20,639 genes measured in 200 patients with
4,726 pathways and 1M permutations takes around 1
week with GSEA 2.2.2 on a Xeon E5-2660v3 CPU while less
than 1 hour on a GeForce Titan X
• Source code available at:
– https://github.com/gravitino/cudaGSEA
• Group Website:
– https://www.hpc.informatik.uni-mainz.de/
Thank you!
Accelerating Gene Set Enrichment
Analysis on CUDA-Enabled GPUs
Bertil Schmidt, Christian Hundt
Institute of Computer Science
Johannes Gutenberg University Mainz
{bertil.schmidt, hundt}@uni-mainz.de