Download Towards a Data Cauldron

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Microsoft Research
Faculty Summit 2008
Ian Foster
Computation Institute
University of Chicago & Argonne National Laboratory
If you want to build a
ship, don’t drum up the
men to gather wood,
divide the work, and give
orders. Instead, teach
them to yearn for the
vast and endless sea.
Antoine de SaintExupéry
Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch, August 2006.
Data
in
“No limits”
 Storage
 Computing
 Format
 Program
Programs
& rules in
Results
out
Allowing for
 Versioning
 Provenance
 Collaboration
 Annotation
having the interior immediately accessible
relatively free of obstructions to sight, movement, or internal
arrangement
generous, liberal, or bounteous
in operation; live
readily admitting new members
not constipated
Rules
Parallel programs
Workflows
Swift
MapReduce
R
Dryad
MatLab
SQL
Octave
BPEL
SCFL
Virtualization
Run any program, store any data
Indexing
Automated maintenance
Provisioning
Policy-driven allocation of resources
to competing demands
Data
Data
Transform
Annotate
Search
Add to
Tag
Visualize
Discover
Extend
Group
Share
Astrophysics
Cognitive science
East Asian studies
Economics
Environmental science
Epidemiology
Genomic medicine
Neuroscience
Political science
Sociology
Solid state physics
1000 TB
tape backup
500 TB reliable
storage (data, metadata)
Diverse
data
sources
Data
ingest
PADS
180 TB,
180 GB/s
17 Top/s
analysis
Dynamic
provisioning
Parallel analysis
Remote access
Diverse
users
Offload to remote
data centers
CPU cores: 118784
Tasks: 934803
Elapsed time: 7257 sec
Compute time: 21.43 CPU yr
Average task time: 667 sec
Relative Efficiency: 99.7%
Time (secs)
(from 16 to 32 racks)
Utilization:
Sustained: 99.6%
Overall: 78.3%
Ioan
Raicu
Zhao
Zhang
Mike
Wilde
HPC systems software (MPICH, PVFS, ZeptOS)
Collaborative data tagging (GLOSS)
Data integration (XDTM)
HPC data analytics and visualization
Loosely coupled parallelism (Swift, Hadoop)
Dynamic provisioning (Falkon)
Service authoring (Introduce, caGrid, gRAVI)
Provenance recording and query (Swift)
Service composition and workflow (Taverna)
Virtualization management (Workspace Service)
Distributed data management (GridFTP, etc.)
Ben
Clifford,
Functional
MihaelHatigan,
MRI
Mike Wilde,
Yong Zhao
Diverse
experimental
data &
metadata
Browse data
Search
Content preview
Transcode
Download
Analyze
SIDgrid
Bennett Berthenthal
Mike Papka
Mike Wilde
… and others
TeraGrid
PADS
…
Data
in
“No limits”
 Storage
 Computing
 Format
 Program
Programs
& rules in
Results
out
Allowing for
 Versioning
 Provenance
 Collaboration
 Annotation