Download HPC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quantitative comparative linguistics wikipedia , lookup

Transcript
Systems analysis of innate immune
mechanisms in infection – a role for HPC
Peter Ghazal
What is Pathway Biology?
Pathway biology is….
A systems biology approach for understanding a biological process
- empirically by functional association of
multiple gene products & metabolites
- computationally by defining networks of cause-effect
relationships.

Pathway Models link molecular; cellular; whole organism levels.
FORMAL MODELS --- ALLOW PREDICTING the outcome of Costly or
Intractable Experiments
Focus and outline of talk
• High through-put approaches to mapping and
understanding host-response to infection.
• Targeting the host NOT the “bug” as anti-infective
strategy
• Making HPC more accessible: SPRINT a new framework
for high dimensional biostatistic computation
Story starts at the bed side
Differentially expressed genes in neonates
control vs Infected (FDR p>1x10-5, FC±4)
Sterol/
lipogenic
Dealing with HTP data:
Impact of data variability
• Model for introducing biological and technical variation:

i

Patien

y


b

, where

ij
i
ij
j

Repli

2
~
N
(
0
,

)
 
ij

2
b
~
N
(
0
,
)
i
b
2
2
 

b

2





Technical
Total Biological
variation
variation
variation
Modelling patient variability and biomarkers for classification
How different data characteristics affect the misclassification errors?
Factors investigated:
Data variability (biological and technical variations)
Training set size
Number of replications
Correlation between RNA biomarkers
Machine Learning methods:
Random Forest (RF)
Support Vector Machine (SVM)
Linear Discriminant Analysis (LDA)
K-Nearest Neighbour (K-NN)
Mizanur Khondoker
Error rate vs. (number of biomarkers, total variation)
An example of a simulation model to quantify number of biomarkers
and level of patient variability
Conclusions from simulations
• There is increased predictive value using multiple markers – although there is
no magic number that can be recommended as optimal in all situations.
• Optimal number greatly depends on the data under study.
• The important determining factors of optimal number of biomarkers are:
• The degree of differential expression (fold-change, p-values etc.)
• Amount of biological and technical variation in the data.
• The size of the training set upon which the classifier is to be built.
• The number of replication for each biomarkers.
• The degree of correlation between biomarkers.
• Now possible to predict optimal number through simulation.
Rule of five: Criteria for pathogenesis
based biomarkers
• Readily accessible
• Multiple markers
• Appropriately powered statistical association
• Physiological relevance
• Causally linked to phenotype
Key challenge is mapping biomarkers into:
biological context and understanding
Requires an experimental model system
Bone Marrow
Blood
?
Tissue
Monocyte
?
Resident Macrophage
(immature)
Promonocyte
Lymphokines
Activated
T-Lymphocyte
Myeloid
Stem Cell
(Primary Signal)
Inflammation
IFN-gamma
Primed Macrophage
(Secondary Signal)
Endotoxin,
IFN-gamma
Pluripotent
Stem Cell
“Activated”
Cytolytic Macrophage
Transcriptional profile of MΦ activated by Ifng
How do we tackle this?
A sub-system study of cause effect relationships with a defined start (input)
and end (output).
Literature
Data-mining
PATHWAY BIOLOGY
Modelling
Network analysis
Experimentation
genetic screens
microarrays
Y2H
mechanism based
studies
Mapping new nodes
Literature
Data-mining
PATHWAY BIOLOGY
Experimentation
Transcriptional profile of MΦ
infected with CMV
Hypothesis generation
• Blue zone vs red zone
Down regulation of sterol pathway
BUT…
recorded changes are small – Do they have any
effect?
Next step modelling
PATHWAY BIOLOGY
Experimental
data
Pure and applied modelling
Network inference analysis
Workflow
Literature
derived model
Known parameters
Order of magnitude
estimation
ODE model
Unknown parameters
Vary parameters
by an order of
magnitude
Ensemble
average
Ensemble of
ODE models
Results
Cholesterol Synthesis
Modelling
ODE model, Michaelis-Menten interactions
• 57 Parameters
• 25 Known Parameters
• 32 Unknown Parameters
Algorithm
• Using the first three time points, calculate an equilibrium state
• Release model from equilibrium and simulate using enzyme data
• For each unknown, consider this model across 3 orders of magnitude,
holding the other unknowns parameters fixed.
Where available, parameters obtained from
the Brenda enzyme database
http://www.brenda-enzymes.info/
Cholesterol (output of sterol pathway) results
from simulation and expts
Free intra-cellular cholesterol concentration in NIH-3T3 fibroblast
Predictions:
Experiments:
120
Relative quantity in %
100
80
Mock
C3X moi 1
60
40
20
0H
6H
24h
48H
72H
Hours post infection
Cholesterol rate/flux
Cholesterol levels
Lipidomic – mass spec results
• Infection down regulate cholesterol biosynthesis
pathway and free intra-cellular cholesterol.
• Can now predict the behaviour of the pathway.
• But?
• Just as a good as UK (Met Office) weather
predictions……because……
Scalability issues related to
increased complexity
HPC for
High Throughput Post-Genomic Data
• Increasing complexity and size of biological data
• Solution: High Performance Computing (HPC)?
Problems with large biological data
sets
– Volume of data
• Many research groups can now routinely generate high volumes of
data
– Memory (RAM) handling:
• Input data size is too big
• Algorithms cause linear, exponential or other growth in
data volume
– CPU performance:
• Routine analyses take too long
Limitation examples:
Clustering
• Gene clustering using R on a high-spec workstation:
– 16,000 genes, k=12 gene clusters runs for ~30min
– 16,000 genes, k=40 gene clusters runs for ~10hrs
Partitioning-Around-Medoids, n genes, k=12
clusters requested
Memory fail limit
Outcome: Adverse effect on
research
•
•
•
•
•
Arbitrary size reduction of input data
Batch processing of data
Analyses in smaller steps
Avoidance of some algorithms
Failure to analyse
Solution: High Performance
Computing
• HPC takes many forms:
– clusters, networks, supercomputer, grid, GPUs, “cloud”, ...
• Provides more computational power
• HPC is technically accessible for most:
– Department own, Eddie, HECToR,...
However!
HPC Access Hurdles
• Cost of access
• Time to adapt
• Complex, require specialist skills
• Consultancy (e.g. EPCC) only feasible
on ad-hoc basis, not routinely
HPC Access Hurdles
HPC is (currently) optimal for:
- Specific problems that can be tackled as a project
- Individuals who are familiar with parallelisation and
system architectures
HPC is not optimal for:
- Routine/casual analyses of high-throughput data
- Ad-hoc and ever-changing analyses algorithms
- Data analysts without time or knowledge to sidestep
into parallelisation software/hardware.
Need a step change
(up!) to broaden HPC
access to all biologists
Challenge two fold!!
• Provide a generic solution
• Easy to use
SPRINT (DPM & EPCC))
A solution for analyses using R
Post
Genomic
Data
R
Biological Results
Very Large
Post Genomic
Data
R
Very Large
Post Genomic
Data
HPC
(Eddie)
R
Biological Results
SPRINT
SPRINT
SPRINT has 2 components:
1. HPC harness manages access to HPC
2. Library of parallel R functions
e.g. cor (correlation)
pam (clustering)
maxt (permutation
 Allows non-specialists to make use of HPC
resources, with analysis functions parallelised
by us or the R community.
Code comparison
data(golub)
smallgd <- golub[1:100,]
classlabel <- golub.cl
resT <- mt.maxT(smallgd, classlabel, test="t", side="abs")
quit(save="no")
library("sprint")
data(golub)
smallgd <- golub[1:100,]
classlabel <- golub.cl
resT <- pmaxT(smallgd, classlabel, test="t", side="abs")
pterminate()
quit(save="no")
Permutation Benchmark
Input Array
Data Size
Permutation
Count
Estimated
maxt 1 CPU
Pmaxt on 256
CPUs
(s)
36,612 x 76
500,000
6 hrs
73.18
36,612 x 76
1,000,000
12 hrs
46.64
73,224 x 76
500,000
10 hrs
148.46
100,000 x 320
1,000,000
20 hrs
294.61
Correlation Benchmark
Input Array Data Size
Output Array Data Size
pcor() on 256 CPUs
(s)
11,000 x 320
(27 MB)
923 MB
4.76
22,000 x 320
3.6 GB
13.87
9.1 GB
36.64
15 GB
42.18
(54 MB)
35,000 x 320
(85 MB)
45,000 x 320
(110 MB)
Clustering Benchmarks
Future
• Cloud (confidentiality issues)
• GPU (limitations is data size)
Viral
Interaction
Networks
Host
Interaction
Networks
Bed-bench-models-almost back to bed
Virus
Antiviral
Systemic Therapeutic
Host
New therapeutic and diagnostic opportunities
THANK YOU
&
Acknowlegments to our sponsors
Acknowledgement
Mathieu Blanc
Steven Watterson
Mizanur Khondoker
Paul Dickinson
Thorsten Forster
Muriel Mewissen
EPCC
Terry Sloan
Jon Hill
Michal Piotrowski
Arthur Trew