Download Using the Grid for Astronomy

Document related concepts

Distributed operating system wikipedia , lookup

Transcript
Using the Grid
for Astronomy
Roy Williams, Caltech
Enzo Case Study
Simulated dark matter density in early
universe
• N-body gravitational dynamics
(particle-mesh method)
• Hydrodynamics with PPM and ZEUS
finite-difference
• Up to 9 species of H and He
• Radiative cooling
• Uniform UV background (Haardt &
Madau)
• Star formation and feedback
• Metallicity fields
Adaptive Mesh Refinement (AMR)
• multilevel grid
hierarchy
• automatic,
adaptive, recursive
• no limits on
depth,complexity
of grids
• C++/F77
• Bryan & Norman
(1998)
Source: J. Shalf
Distributed Computing Zoo
•
•
•
•
•
•
•
Grid Computing
•
•
•
•
Also called High-Performance Computing
Big clusters, Big data, Big pipes, Big centers
Globus backbone, which now includes Services and Gateways
Decentralized control
•
local interconnect between identical cpu’s
•
Systems for sharing data without centeral server
•
•
Screensaver cycle scavenging
eg SETI@home, Einstein@home, ClimatePrediction.net, etc
•
A videoconferencing system
•
A popular software package to federate resources into a grid
•
A $150M award from NSF to the Supercomputer centers (NCSA, SCSC, PSC, etc etc)
Cluster Computing
Peer-to-Peer (Napster, Kazaa)
Internet Computing
Access Grid
Globus
TeraGrid
What is the Grid?
• The World Wide Web provides seamless access to information
that is stored in many millions of different geographical
locations
• In contrast, the Grid is an emerging infrastructure that
provides seamless access to computing power and data storage
capacity distributed over the globe.
What is the Grid?
• “Grid” was coined by Ian Foster and Carl
Kesselman “The Grid: blueprint for a new
computing infrastructure”.
• Analogy with the electric power grid: plug-in to
computing power without worrying where it comes
from, like a toaster.
• The idea has been around under other names for
a while (distributed computing, metacomputing, …).
• Technology is in place to realise the dream on a
global scale.
What is Middleware?
The GRID middleware:
• Finds convenient places for the scientists “job” (computing task) to
be run
• Optimises use of the widely dispersed resources
• Organises efficient access to scientific data
• Deals with authentication to the different sites
• Interfaces to local site authorisation / resource
allocation
• Runs the jobs
• Monitors progress
• Recovers from problems
… and ….
Tells you when the work is complete and transfers the result
back!
Grid as Federation

Grid as a federation

independent centers
 flexibility

unified interface


power and strength
Large/small state compromise
Three Big Ideas of Grid
• Federation and Uniformity
– independent management; uniform face; open
standards
• Trust and Security
– access policy; uniform authentication/authorization
• Distance doesn’t matter
– 20 Mbyte/sec, global file system
Grid projects in the world
•DOE Science Grid
•NSF National Virtual Observatory
•NSF GriPhyN/iVDGL
•DOE Particle Physics Data Grid
•NSF TeraGrid
•DOE Earth Systems Grid
•NEESGrid
•DOH BIRN
•UK e-Science Grid
•EUROGRID
•DataGrid (CERN, ...)
•EuroGrid (Unicore)
•DataTag (CERN,…)
•GridLab (Cactus Toolkit)
•CrossGrid (Infrastructure Components)
TeraGrid Wide Area Network
TeraGrid Components
• Compute hardware
– Intel/Linux Clusters, Alpha SMP clusters, POWER4
cluster, …
• Large-scale storage systems
– hundreds of terabytes for secondary storage
• Very high-speed network backbone
– bandwidth for rich interaction and tight coupling
• Grid middleware
– Globus, data management, …
• Next-generation applications
TeraGrid Resources
Compute
Resources
Online Storage
ANL/
UC
Indiana
U
NCSA
1 Tflop
1 TFlop
20 TB
Archival
Storage
Global FS
Data
Collections
Visualization
Purdue
U
SDSC
TACC
35 Tflop
16 Tflop
18 Tflop
24 Tflop
8 Tflop
6 TByte
700
TByte
200
TByte
28 Tbyte
1000
TByte
50 TByte
150
Tbyte
5000
Tbyte
2400
Tbyte
36 Tbyte
7200
Tbyte
2000
Tbyte
20 Tbyte
Yes
Yes
Instruments
Network
(Gb/s,Hub)
PSC
220
Tbyte
220
Tbyte
Yes
Yes
Yes
Yes
30
CHI
ORNL
10
CHI
Yes
Yes
30
CHI
10
CHI
Yes
Yes
Yes
Yes
30
CHI
10
ATL
30
LA
10
CHI
The TeraGrid Vision
Distributing the resources is better than putting them at one site
• Build new, extensible, grid-based infrastructure
– New hardware, new networks, new software, new practices,
new policies
• Leverage homogeneity
– Run single job across entire TeraGrid
– Move executables between sites
• Catch-phrase: Open, Deep and Wide
– Open to US science community
– Heroic computing possible by programming Unix
– Easy to use through science gateways
TeraGrid Allocations Policies
• Any US researcher can request an allocation
– http://www.teragrid.org
Wide Variety of Usage Scenarios
• Tightly coupled simulation jobs storing vast
amounts of data, performing visualization
remotely as well as making data available
through online collections (ENZO)
• Thousands of independent jobs using data from
a distributed data collection (NVO)
• Science Gateways – "not a Unix prompt"!
– from web browser with security
– SOAP client for scripting
– from application eg IRAF, IDL
Running jobs
Account Security
• Username/Password
– weak security, too many holes
– deprecated in many places
• SSH keys
– put public key on remote machine
– serves as single sign-on
• X.509 Certificates
– Proves identity
– Flexible
Ways to Submit a Job
1. Directly to PBS Batch Scheduler
–
Simple, scripts are portable among PBS TeraGrid
clusters
2. Globus common batch script syntax
–
Scripts are portable among other grids using Globus
3. Condor-G
= Condor + Globus
4. Use a science gateway, eg Nesssi
specific tasks, easy to use
PBS Batch Submission
• Single executables to be on a single remote machine
– login to a head node, submit to queue
• Direct, interactive execution
– mpirun –np 16 ./a.out
• Through a batch job manager
– qsub my_script
•
• where my_script describes executable location, runtime duration,
redirection of stdout/err, mpirun specification…
ssh tg-login.sdsc.teragrid.org
–
–
–
–
qsub flatten.sh –v "FILE=f544"
qstat or showq
ls *.dat
pbs.out, pbs.err files
Remote submission
• Through globus
– globusrun -r [some-teragrid-headnode].teragrid.org/jobmanager -f
my_rsl_script
• where my_rsl_script describes the same details as
in the qsub my_script!
• Through Condor-G
– condor_submit my_condor_script
• where my_condor_script describes the same details
as the globus my_rsl_script!
Condor-G
A Grid-enabled version of Condor that
provides robust job management for
Globus clients.
– Robust replacement for globusrun
– Provides extensive fault-tolerance
– Can provide scheduling across multiple
Globus sites
– Brings Condor’s job management features to
Globus jobs
Condor DAGMan
• Manages workflow interdependencies
• Each task is a Condor description file
• A DAG file controls the order in which the
Condor files are run
Cluster Supercomputer
user
job submission and queueing
(Condor, PBS, ..)
login node
100s of nodes
purged /scratch
parallel I/O
/home (backed-up)
parallel file system
global file system
metadata node
MPI parallel programming
• Each node runs same program
• first finds its number (“rank”)
• and the number of coordinating nodes (“size”)
• Laplace solver example
Algorithm:
Each value becomes average
of neighbor values
node 0
Serial:
for each point, compute average
remember boundary conditions
node 1
Parallel:
Run algorithm with ghost points
Use messages to exchange ghost points
Globus
• Security
• Single-sign-on, certificate handling, CAS, MyProxy
• Execution Management
• Remote jobs: GRAM and Condor-G
• Data Management
• GridFTP, reliable FT, 3rd party FT
• Information Services
• aggregating information from federated grid resources
• Common Runtime Components
• web services through GT4
The following is a personal opinion,
it is NOT the position of the NVO:
• Globus is a complex and difficult installation
• Globus needs frequent maintenance and updates
• Globus is monolithic (all or nothing)
Data storage
Typical types of HPC storage needs
Type Typical
size
Use
Aggregate
BW
Tolerance
for
Latency
Requirements
1
1-10TB Home
filesyste
m
A lot of small files, high
metadata rates, interactive
use.
2
100’s
Local
GB (per scratch
CPU)
space
High bandwidth data cache.
3
10100TB
High aggregate bandwidth.
Concurrent access to data.
Moderate latency tolerated.
4
100TB- Archival
PB
Storage
(optional)
Global
filesyste
m
Large storage pools with low
cost. Used for long term
storage of results.
Disk Farms (datawulf)
• Homogeneous Disk Farm
(= parallel file system)
parallel I/O
metadata node
parallel file system
Large files striped over disks
Management node for file creation, access, ls, etc etc
Parallel File System
• Large files are striped
– very fast parallel access
• Medium files are distributed
– Stripes do not all start the same place
• Small files choke the PFS manager
– Either containerize
– or use blobs in a database
• not a file system anymore: pool of 108 blobs with lnames
•
Storage Resource Broker (SRB)
• Single logical namespace while accessing
distributed archival storage resources
• Effectively infinite storage
• Data replication
• Parallel Transfers
• Interfaces: command-line, API, SOAP,
web/portal.
Storage Resource Broker (SRB):
Virtual Resources, Replication
hpss-sdsc
NCSA
SRB Client
(cmdline,
or API)
sfs-tape-sdsc
SDSC
hpss-caltech
workstation
…
Storage Resource Broker
(SRB):
Virtual Resources, Replication
Similar to VOSpace concept
certificate
Browser
SOAP client
Command-line
....
casjobs at JHU
tape at sdsc
File may be replicated
File comes with metadata
... may be customized
myDisk
Containerizing
• Shared metadata
• Easier for bulk movement
container
file in container
Data intensive computing
with NVO services
Two Key Ideas for FaultTolerance
• Transactions
• No partial completion -- either all or nothing
– eg copy to a tmp filename, then mv to correct file name
• Idempotent
• “Acting as if done only once, even if used multiple times”
• Can run the script repeatedly until finished
DPOSS flattening
Source
2650 x 1.1 Gbyte files
Cropping borders
Quadratic fit and subtract
Virtual data
Target
Driving the Queues
for f in os.listdir(inputDirectory):
# if the file exists, with the right size and age, then w
ofile = outputDirectory +"/"+ f
if os.path.exists(ofile):

Here is the
driver that
makes and
submits jobs
osize = os.path.getsize(ofile)
if osize != 1109404800:
print " -- wrong target size, remaking",
else:
time_tgt = filetime(ofile)
time_src = filetime(file)
if time_tgt < time_src:
print(" -- target too old or nonexist
else:
print " -- already have target file "
continue
cmd = "qsub flat.sh -v \"FILE=" + f +"\""
print " -- submitting batch job: ", cmd
os.system(cmd)
PBS script

A PBS script. Can do "qsub script.sh –v "FILE=f345"
#!/bin/sh
#PBS -N dposs
#PBS -V
#PBS -l nodes=1
#PBS -l walltime=1:00:00
cd /home/roy/dposs-flat/flat
./flat \
-infile
/pvfs/mydata/source/${FILE}.fits \
-outfile
/pvfs/mydata/target/${FILE}.fits \
-chop 0 0 1500 23552 \
-chop 0 0 23552 1500 \
-chop 0 22052 23552 23552 \
-chop 22052 0 23552 23552 \
-chop 18052 0 23552 4000
GET services from Python

This code uses a service to find the best
hyperatlas page for a given sky location
import urllib
hyperatlasURL = self.hyperatlasServer + "/getChart?atlas=" + atlas \
+ "&RA=" + str(center1) + "&Dec=" + str(center2)
stream = urllib.urlopen(hyperatlasURL)
# result is a tab-separated line, so use split() to tokenize
tokens = stream.readline().split('\t')
print "Using page ", tokens[0], " of atlas ", atlas
self.scale = float(tokens[1])
self.CTYPE1 = tokens[2]
self.CTYPE2 = tokens[3]
rval1 = float(tokens[4])
rval2 = float(tokens[5])
VOTable parser in Python

From a SIAP URL, we get the XML, and extract the
columns that have the image references, image
format, and image RA/Dec
import urllib
import xml.dom.minidom
stream = urllib.urlopen(SIAP_URL)
doc = xml.dom.minidom.parse(stream)
#Make a dictionary for the columns
col_ucd_dict = {}
for XML_TABLE in doc.getElementsByTagName("TABLE"):
for XML_FIELD in XML_TABLE.getElementsByTagName("FIELD"):
col_ucd = XML_FIELD.getAttribute("ucd")
col_ucd_dict[col_title] = col_counter
urlColumn = col_ucd_dict["VOX:Image_AccessReference"]
formatColumn = col_ucd_dict["VOX:Image_Format"]
raColumn = col_ucd_dict["POS_EQ_RA_MAIN"]
deColumn = col_ucd_dict["POS_EQ_DEC_MAIN"]
VOTable parser in Python

Table is a list of rows, and each row is a list of table cells
import xml.dom.minidom
table=[]
for XML_TABLE in doc.getElementsByTagName("TABLE"):
for XML_DATA in XML_TABLE.getElementsByTagName("DATA"):
for XML_TABLEDATA in XML_DATA.getElementsByTagName("TABLEDATA"):
for XML_TR in XML_TABLEDATA.getElementsByTagName("TR"):
row=[]
for XML_TD in XML_TR.getElementsByTagName("TD"):
data = ""
for child in XML_TD.childNodes:
data += child.data
row.append(data)
table.append(row)
Science Gateways
Grid Impediments
and now do some science....
Learn Globus
Learn MPI
Learn PBS
Port code to Itanium
Get certificate
Get logged in
Wait 3 months for account
Write proposal
A better way:
Graduated Security
for Science Gateways
power user
Write proposal
- own account
big-iron
computing
.... X.509
Authenticate
more - browser or cmd line
science....
Register - logging and reporting
some
science....
Web form - anonymous
2MASS Mosaicking portal
An NVO-Teragrid project
Caltech IPAC
Three Types of Science Gateways
• Web-based Portals
– User interacts with community-deployed web interface.
– Runs community-deployed codes
– Service requests forwarded to grid resources
• Scripted service call
– User writes code to submit and monitor jobs
• Grid-enabled applications
– Application programs on users' machines (eg IRAF)
– Also runs program on grid resource
Nesssi: Secure Web services for astronimy
certificate
repository
certificate
policies
select user
account
fetch
proxy
client
web form
nesssi
web portal
node
SOAP http
nesssi
node
queue
node
node
sandbox
storage
open http
Mosaic service
nesssiServer.
dpossMosaic.mosaic (
“-ra 49.1
-dec 60.1
-rawidth 0.5
-decwidth 0.5
-filt f
-bgcorr 0”)
Coadd service
nesssiServer.hyperatlas.run (
“-bandpass z1
-ra 170.08
-dec 13.275
-rawidth 1.0
-decwidth 1.0 “)
Cutout
Service
nesssiServer.cutout.run(sessionID,
"-surveys PQ:gr,PQ:gi,PQ:z1,PQ:z2,SDSS:r,SDSS:i,SDSS:z,2MASS:k,2MASS:h
-size 64”)
cutouts from Palomar-Quest, SDSS, 2MASS
of sources from Veron quasar catalog
Amazon Grid
(who will pay?)
Amazon Grid
• Simple Storage Service
•
•
•
•
•
Write, read, and delete.
Each object has a unique, developer-assigned key.
Authentication mechanisms. Objects can be private or public. Rights can be
granted to specific users.
REST and SOAP interfaces
Default download protocol is HTTP. BitTorrent(TM) also available.
Amazon Grid
• Elastic Compute Cloud
•
•
•
•
•
Create an Amazon Machine Image (AMI) containing your applications, libraries,
data and associated configuration settings.
Upload the AMI into Amazon Simple Storage Service.
Configure security and network access.
Start, terminate, and monitor as many instances of your AMI as needed.
Pay for the instance hours and bandwidth that you actually consume.
•
•
•
$0.10 per instance-hour consumed
$0.20 per GB of data transferred outside of Amazon
$0.15 per GB-Month of Amazon S3 storage
Amazon Grid
• Simple Queue Service
•
•
•
•
•
•
Move data between distributed application components performing different
tasks, without losing messages or requiring each component to be always
available.
Unlimited number of queues, unlimited number of messages.
New messages can be added at any time.
A computer can check a queue at any time for messages waiting to be read.
REST, SOAP and query interfaces.
The queue creator determines which other users can write to or read from the
queue.