Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Computer cluster wikipedia , lookup

Transcript
GEC23 Experimenter Developer Roundtable
Paul Ruth, Anirban Mandal (Presenter)
Experiences with Mid-Scale GENI
Infrastructure
• What is mid-scale infrastructure?
– Bigger GENI racks (1000+ cores)
– NSFCloud: CloudLab
– NSFCloud: Chameleon
• Why mid-scale infrastructure?
– GENI technologies are ready to be tried at scale
– Real benefits to real domain science
Presentation title goes here
2
NSFCloud
• Not a cloud!
– It is a testbed for developing clouds
• Cloudlab
–
–
–
–
Sites: Utah, Clemson, Wisconsin (plus APT)
API: GENI
Features: ARM, IB, Cisco UCS servers
~5,000 cores (when complete), + 2,000 cores from APT
• Chameleon
–
–
–
–
Sites: TACC, Chicago
API: OpenStack
Features: IB, 100 Gbps network between sites
~14,500 cores (when complete)
Presentation title goes here
3
Experiences with NSFCloud
• Goal
– Deploy ExoGENI rack in NSFCloud
• Why?
– Target HPC domain science applications (IB)
• High performance is first order goal
– Test/debug new ExoGENI features
– Evaluate performance optimizations on identical
hardware
– Add support for new resource types
• ARM, IB, etc.
Presentation title goes here
4
ExoGENI on NSF Cloud
Cloud Software
NSFCloud Site
Presentation title goes here
5
ExoGENI on NSFCloud
ExoSM
A
M
Broker
ExoGENI Rack
NSFCloud Site
Presentation title goes here
6
ExoGENI on NSFCloud
ExoSM
A
M
Broker
Network Transit
Providers
(I2,ESnet)
ExoGENI Rack
NSFCloud Site
Presentation title goes here
7
ADCIRC on Mid-Scale Infrastructure
•
•
•
•
•
Finite Element
Very high spatial resolution (~1.2M triangles)
Efficient MPI implementation, scales to thousands of cores
Typically use 256-1024 cores for forecasting applications
Used for coastal flooding simulations
–
–
–
FEMA flood insurance studies
Forecasting systems
Research applications
ADCIRC on Mid-Scale Infrastructure
• Run ADCIRC at scale (bare metal)
– CloudLab
• APT cluster (r320s) w/ IB
• Largest run 512 core (64 node) MPI job
– Chameleon
• TACC Alamo (Early User Program)
• Largest run 160 core MPI job
• Run ADCIRC at scale (VM)
– CloudLab
• APT cluster (r320s) w/ IB using SR-IOV
• OpenStack
• Largest run 256 core (32 node) MPI job
Presentation title goes here
9
ADCIRC on Mid-Scale Infrastructure
• Run ADCIRC at scale (bare metal)
– CloudLab
• APT cluster (r320s) w/ IB
• Largest run 512 core (64 node) MPI job
– Chameleon
• TACC Alamo (Early User Program)
• Largest run 160 core MPI job
• Run ADCIRC at scale (VM)
– CloudLab
• APT cluster (r320s) w/ IB using SR-IOV
• OpenStack
• Largest run 256 core (32 node) MPI job
VERY Preliminary Results
CloudLab
Ensemble member 1
Hurricane Floyd :
Bare Metal 256 core MPI (IB)
= ~160 mins
VM 256 core MPI (IB w/ SRIOV) = ~170 mins
Approx. 6% overhead due to
virtualization
Presentation title goes here
10
Thoughts for GENI community
• CloudLab
– Feels like GENI (easier for this community to use)
• Chameleon
– Based on OpenStack
• Would be easy to interface with ExoGENI to enable GENI
APIs
– Need to wait for more nodes.
Presentation title goes here
11
Functionality Requests
• Network stitching (CloudLab, Chameleon)
– Not just GENI stitching… we need to enable
ExoGENI stitching
• Image management is not transparent
(CloudLab)
– I’d like to be able to view/delete images that I have
created
Presentation title goes here
12
Future Work
• ExoGENI
– Install full ExoGENI software stack to control
OpenStack cluster on APT
– Create ExoGENI handler that can interface with
Chameleon OpenStack sites.
• Port existing work to other sites
– Targeting Clemson but it has Qlogic IB cards that will
require some time to setup.
Presentation title goes here
13