Download BioScience on the TeraGrid

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
BioScience on the TeraGrid
Daniel S. Katz
[email protected]
Director of Science, TeraGrid GIG
Senior Fellow, Computation Institute,
University of Chicago & Argonne National Laboratory
Affiliate Faculty, Center for Computation & Technology, LSU
Adjunct Associate Professor, Electrical and Computer Engineering, LSU
[email protected]
What is the TeraGrid
• World’s largest distributed cyberinfrastructure for open scientific research,
supported by US NSF
• Integrated high performance computers (>2 PF HPC & >27000 HTC CPUs),
data resources (>2 PB disk, >60 PB tape, data collections), visualization,
experimental facilities (VMs, GPUs, FPGAs), network at 11 Resource Provider
sites
• Allocated to US researchers and their collaborators through national peer-review
process
• DEEP: provide powerful computational resources to enable research that can’t
otherwise be accomplished
• WIDE: grow the community of computational science and make the resources
easily accessible
• OPEN: connect with new resources and institutions
• Integration: Single {portal, sign-on, help desk, allocations process, advanced
user support, EOT, campus champions}
http://www.teragrid.org/
[email protected]
Governance
• 11 Resource Providers (RPs) funded under separate
agreements with NSF
–
–
–
–
Different
Different
Different
Different
start and end dates
goals
agreements
funding models
• 1 Coordinating Body – Grid Infrastructure Group (GIG)
–
–
–
–
University of Chicago/Argonne National Laboratory
Subcontracts to all RPs and six other universities
7-8 Area Directors
Working groups with members from many RPs
• TeraGrid Forum with Chair
[email protected]
Who Uses TeraGrid (2009)
(2008)
[email protected]
How TeraGrid Is Used
Use Modality
Batch Computing on Individual Resources
Exploratory and Application Porting
Workflow, Ensemble, and Parameter Sweep
Science Gateway Access
Remote Interactive Steering and Visualization
Tightly-Coupled Distributed Computation
Community Size
(rough est. - number of users)
850
650
250
500
35
10
2006 data
[email protected]
How One Uses TeraGrid
RP 1
RP 2
POPS
(for now)
User
Portal
Science
Gateways
TeraGrid Infrastructure
Accounting, …
(Accounting, Network,Network,
Authorization,…)
Command
Line
RP 3
Compute
Service
Viz
Service
Data
Service
[email protected]
User Portal: portal.teragrid.org
http://portal.teragrid.org/
[email protected]
Science Gateways
• A natural extension of Internet & Web 2.0
• Idea resonates with Scientists
– Researchers can imagine scientific capabilities provided through
familiar interface
• Mostly web portal or web or client-server program
• Designed by communities; provide interfaces understood by
those communities
– Also provide access to greater capabilities (back end)
– Without user understand details of capabilities
– Scientists know they can undertake more complex analyses and that’s
all they want to focus on
– TeraGrid provides tools to help developer
• Seamless access doesn’t come for free
– Hinges on very capable developer
Nancy Wilkins-Diehr
[email protected]
TeraGrid -> XD Future
• Current RP agreements end in March 2011
– Except track 2 centers (current and future)
• TeraGrid XD (eXtreme Digital) starts in April 2011
– Era of potential interoperation with OSG and others
– New types of science applications?
• Current TG GIG continues through July 2011
– Allows four months of overlap in coordination
– Probable overlap between GIG and XD members
• Blue Waters (track 1) production in 2011
[email protected]
Grid Enabled Neurosurgical Imaging Using Simulation (GENIUS)
Model large-scale patient-specific cerebral blood flow in clinically-relevant time scale
• Provide simulation support within the operating theatre for neuroradiologists
• Provide new information to surgeons for patient management and therapy:
1.
Diagnosis and risk assessment
2.
Predictive simulation in therapy
• Provide patient-specific information to help plan
embolisation of arterio-venous malformations,
coiling of aneurysms, etc.
Clinical workflow:
• Book computing
resources in advance
or use preemption
• Shift imaging data
around quickly over
high-bandwidth lowlatency dedicated links
• Interactive
simulations and realtime visualization for
immediate feedback
Peter Coveney, University College London
[email protected]
OLSGW Gadgets
•OLSGW Integrates bio-informatics applications
•BLAST, InterProScan, CLUSTALW , MUSCLE, PSIPRED, ACCPRO, VSL2
•454 Pyrosequencing service under development
•Four OLSGW gadgets have been published in the iGoogle gadget directory. Search for “TeraGrid
Life Science”.
Wenjun Wu, Thomas Uram, Michael Papka, ANL
[email protected]
Multiscale Simulation of Arterial Tree
Arterioles/venules 50 microns
activated platelets
Platelet diameter is 2-4 µm
Normal platelet concentration in
blood is 300,000/mm3
Functions: activation, adhesion
to injured walls, and other
platelets
Need to combine multi-scale models: 1D (arteries), 3D Navier Stokes
(organs, arterial junctions, etc.), Dissipative Particle Dynamics
(capillaries, venules, arterioles, blood cells, etc.), Molecular Dynamics
(blood cells, platelets, molecular adhesion, etc.)
NIH/NSF-IMAG project: George Em Karnaidakis, Brown
[email protected]
Expressed Sequence Tag (EST) Pipeline
• ESTs are a collection of random cDNA sequences, sequenced from a cDNA library or
sequencing devices
– Typical inputs are O(Million) sequences
– Newer 454 devices from higher volume, are relatively easy to obtain and operate
– Stored using FASTA format
• ESTs are clustered and assembled to form contigs
• Contigs then used to identify potential unknown genes, by Blasting against known protein
database
• Goal: Use TeraGrid for backend computing, with existing software, and a gateway frontend
RepeatMasker
PaCE
CAP3
BLAST
• Cleaning sequences
•Clustering
•Assembly
•Identification
• Serial execution on
split input, e.g., 1000
jobs for 2 million
sequences
•1 MPI job, runtime of
several hours
•Exponential growth in
time with growth in input
data; scales well
•Serial runs on clusters
generated by PaCE –
Clusters can be combined
•Varied sizes with varied
resource requirements
(run times: ms – days)
•Serial – Takes CAP3
results. Number of jobs
controlled by adjusting
number of sequences per
job.
Initial results – run that took 5 days on local cluster done in 2 days – more opt. underway
A. Kulshrestha, S. L. Pallickara, K. N. Muthuram, C. Kong, Q. Dong, M. Pierce, H. Tang, IU
[email protected]
Multiscale Computer Simulation of the Immature HIV-1 Virion
Experimental structures
Coarse-grained (CG) model
development
CG simulation
Wright, Schooler, Ding, Kieffer,
Fillmore, Sundquist, Jenson,
EMBO, 26, 2218, 2007
CG model refinement
Atomic-level simulation
Key CG interactions
New CG Interactions from MD
An iterative modeling approach combining experimental imaging (cryo-electron
tomography), coarse-grained (CG) simulation, and atomic-level molecular dynamics (MD)
G. A. Voth, U. of Chicago
[email protected]
CIPRES Portal: A New Science Gateway for Systematics
• Systematics: study of diversification of life and relationships
among living things through time
• CIPRES: a flexible web application that can be sustained by the
community at minimal cost even beyond the funding period of
the project
• Tools include parallel versions of MrBayes, RAxML, GARLI
• User requirements include:
–
–
–
–
Access to most or all native command line options
Add new tools quickly
Provide personal user space for storing results
Use TeraGrid resources to quickly provide results
• Cited in at least 35 publications, including Nature, PNAS, Cell
– Examples: New Family Tree for Arthropoda, Genome Sequence of a
Transitional Eukaryote, Co-evolution of Beetles and Flowering Plants
• Used routinely in at least 5 undergraduate classes
• Use 77% US (incl. 17 EPSCoR states), 23% 33 other countries
Mark Miller, SDSC
[email protected]
Patient-Specific HIV Drug Therapy
HIV-1 Protease is a common target for HIV drug therapy
•
Enzyme of HIV responsible for protein maturation
•
Target for anti-retroviral Inhibitors
•
Example of structure assisted drug design
•
9 FDA inhibitors of HIV-1 protease
So what’s the problem?
•
Emergence of drug resistant mutations in protease
•
Render drug ineffective
•
Drug resistant mutants have emerged for all FDA inhibitors
•
Too many mutations to be interpreted by a clinician
Solution: build a Binding Affinity Calculator (BAC)
• Provide tools that allow simulations to be used in clinical context, including lightweight client
– User only needs specify enzyme, mutations relative to wildtype, drug
• Others options can be specified but begin as default
• Requires large number of simulations to be constructed and run automatically (across
distributed HPC resources)
– To investigate generalisation
– Automation is critical for clinical use
• Turn-around time scale of around a week is required
• Trade off between accuracy and time-to-solution
Initial results – ensemble MD calculations for lopinavir vs wildtype & five mutants –
appear promising; excellent relative ranking in binding free energies
Peter Coveney, University College London
[email protected]
Scripting Protein Structure Prediction
int nSim = 1000;
int maxRounds = 3;
Protein pSet[ ] <ext; exec="Protein.map">;
float startTemp[ ] = [ 100.0, 200.0 ];
float delT[ ] = [ 1.0, 1.5, 2.0, 5.0, 10.0 ];
foreach p, pn in pSet {
foreach t in startTemp {
foreach d in delT {
ItFix(p, nSim, maxRounds, t, d);
}
}
}
1000
predict()
calls
…
Analyze()
ItFix()
{
foreach sim in [1:nSim] {
(structure[sim], log[sim]) = predict(p, t, d);
}
result = analyze(structure)
}
10 proteins x 1000 simulations x 3 rounds x 2 temps x 5 delta-T’s
= 300K application runs
T. Sosnick, K. Freed, G. Hocky, J. DeBartolo, A. Adhikari, J. Xu, W. Wilde, U. Chicago
[email protected]