Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Reactive programming wikipedia , lookup
Structured programming wikipedia , lookup
Falcon (programming language) wikipedia , lookup
Stream processing wikipedia , lookup
Java (programming language) wikipedia , lookup
Parallel computing wikipedia , lookup
C Sharp (programming language) wikipedia , lookup
Supercomputer architecture wikipedia , lookup
Supercomputer wikipedia , lookup
Java ConcurrentMap wikipedia , lookup
Parallel Programming on Computational Grids Outline • • • • Grids Application-level tools for grids Parallel programming on grids Case study: Ibis Grids • Seamless integration of geographically distributed computers, databases, instruments – The name is an analogy with power grids • Highly active research area – Global Grid Forum (now called Open Grid Forum) – Globus middleware – Many European projects • e.g. Gridlab: Grid Application Toolkit and Testbed – VL-e (Virtual laboratory for e-Science) project – …. Why Grids? • New distributed applications that use data or instruments across multiple administrative domains and that need much CPU power – Computer-enhanced instruments – Collaborative engineering – Browsing of remote datasets – Use of remote software – Data-intensive computing – Very large-scale simulation – Large-scale parameter studies Web, Grids and e-Science • Web is about exchanging information • Grid is about sharing resources – Computers, data bases, instruments • e-Science supports experimental science by providing a virtual laboratory on top of Grids – Support for visualization, workflows, data management, security, authentication, high-performance computing The big picture Application Application Application Potential Potential Potential Generic Generic Generic Management Virtual Laboratory part Management Management part part of comm. & Application oriented of comm. & services of comm. & computing computing computing Grids Harness distributed resources Applications • e-Science experiments generate much data, that often is distributed and that need much (parallel) processing – high-resolution imaging: – Bio-informatics queries: – Satellite world imagery: – Current particle physics: – LHC physics (2007): ~ 1 GByte per measurement 500 GByte per database ~ 5 TByte/year 1 PByte per year 10-30 PByte per year Grid programming • The goal of a grid is to make resource sharing very easy (transparent) • Practice: grid programming is very difficult – Finding resources, running applications, dealing with heterogeneity and security, etc. • Grid middleware (Globus Toolkit) makes this somewhat easier, but is still low-level and changes frequently • Need application-level tools Application-level tools • • • • Builds on grid software infrastructure Isolates users from dynamics of the grid hardware infrastructure Generic (broad classes of applications) Easy-to-use Taxonomy of existing application-level tools • Grid programming models – RPC – Task parallelism – Message passing – Java programming • Grid application execution environments – Parameter sweeps – Workflow – Portals Remote Procedure Call (RPC) • GridRPC: specialize RPC-style (client/server) programming for grids – Allows coarse-grain task parallelism & remote access – Extended with resource discovery, scheduling, etc. • Example: NetSolve – Solves numerical problems on a grid • Current development: use web technology (WSDL, SOAP) for grid services • Web and grid technology are merging Task parallelism • Many systems for task parallelism (master-worker, replicated workers) exist for the grid • Examples – MW (master-worker) – Satin: divide&conquer (hierarchical master-worker) Message passing • Several MPI implementations exist for the grid • PACX MPI (Stutgart): – Runs on heterogeneous systems • MagPIe (Thilo Kielmann): – Optimizes MPI’s collective communication for hierarchical wide-area systems • MPICH-G2: – Similar to PACX and MagPIe, implemented on Globus Java programming • Java uses bytecode and is very portable – ``Write once, run anywhere’’ • Can be used to handle heterogeneity • Many systems now have Java interfaces: – Globus (Globus Java Commodity Grid) – MPI (MPIJava, MPJ, …) – Gridlab Application Toolkit (Java GAT) • Ibis is a Java-centric grid programming system Parameter sweep applications • Computations what are mostly independent – E.g. run same simulation many times with different parameters • Can tolerate high network latencies, can easily be made fault-tolerant • Many systems use this type of trivial parallelism to harness idle desktops – APST, SETI@home, Entropia, XtremWeb Workflow applications • Link and compose diverse software tools and data formats – Connect modules and data-filters • Results in coarse-grain, dataflow-like parallelism that can be run efficiently on a grid • Several workflow management systems exist – E.g. Virtual Lab Amsterdam (predecessor VL-e) Portals • Graphical interfaces to the grid • Often application-specific • Also portals for resource brokering, file transfers, etc. Outline • • • • Grids Application-level tools for grids Parallel programming on grids Case study: Ibis Distributed supercomputing • Parallel processing on geographically distributed computing systems (grids) • Examples: – SETI@home ( ), RSA-155, Entropia, Cactus • Currently limited to trivially parallel applications • Questions: – Can we generalize this to more HPC applications? – What high-level programming support is needed? Grids versus supercomputers • Performance/scalability – Speedups on geographically distributed systems? • Heterogeneity – Different types of processors, operating systems, etc. – Different networks (Ethernet, Myrinet, WANs) • General grid issues – Resource management, co-allocation, firewalls, security, authorization, accounting, …. Approaches • Performance/scalability – Exploit hierarchical structure of grids (previous project: Albatross) • Heterogeneity – Use Java + JVM (Java Virtual Machine) technology • General grid issues – Studied in many grid projects: VL-e, GridLab, GGF Speedups on a grid? • Grids usually are hierarchical – Collections of clusters, supercomputers – Fast local links, slow wide-area links • Can optimize algorithms to exploit this hierarchy – Minimize wide-area communication • Successful for many applications – Did many experiments on a homogeneous wide-area test bed (DAS) Example: N-body simulation • Much wide-area communication – Each node needs info about remote bodies CPU 1 CPU 1 CPU 2 CPU 2 Amsterdam Delft Trivial optimization CPU 1 CPU 1 CPU 2 CPU 2 Amsterdam Delft Wide-area optimizations • Message combining on wide-area links • Latency hiding on wide-area links • Collective operations for wide-area systems – Broadcast, reduction, all-to-all exchange • Load balancing Conclusions: – Many applications can be optimized to run efficiently on a hierarchical wide-area system – Need better programming support The Ibis system • High-level & efficient programming support for distributed supercomputing on heterogeneous grids • Use Java-centric approach + JVM technology – Inherently more portable than native compilation • “Write once, run anywhere ” – Requires entire system to be written in pure Java • Optimized special-case solutions with native code – E.g. native communication libraries Ibis programming support • Ibis provides – Remote Method Invocation (RMI) – Replicated objects (RepMI) - as in Orca – Group/collective communication (GMI) - as in MPI – Divide & conquer (Satin) - as in Cilk • All integrated in a clean, object-oriented way into Java, using special “marker” interfaces – Invoking native library (e.g. MPI) would give up Java’s “run anywhere” portability Compiling/optimizing programs JVM source source Java compiler bytecode bytecode rewriter bytecode JVM JVM • Optimizations are done by bytecode rewriting – E.g. compiler-generated serialization (as in Manta) Satin: a parallel divide-and-conquer system on top of Ibis • • • • Divide-and-conquer is inherently hierarchical More general than master/worker Satin: Cilk-like primitives (spawn/sync) in Java New load balancing algorithm (CRS) – Cluster-aware random work stealing fib(5) cpu 2 fib(2) fib(1) fib(1) fib(0) fib(4) fib(3) fib(3) fib(2) fib(2) fib(1) fib(1) fib(0) fib(1) fib(0) cpu 3 cpu 1 cpu 1 Example interface FibInter { public int fib(long n); } class Fib implements FibInter { int fib (int n) { if (n < 2) return n; return fib(n-1) + fib(n-2); } } Single-threaded Java Example interface FibInter extends ibis.satin.Spawnable { public int fib(long n); } class Fib extends ibis.satin.SatinObject implements FibInter { public int fib (int n) { if (n < 2) return n; int x = fib (n - 1); int y = fib (n - 2); sync(); return x + y; } } Java + divide&conquer GridLab testbed Ibis implementation • Want to exploit Java’s “run everywhere” property, but – That requires 100% pure Java implementation, no single line of native code – Hard to use native communication (e.g. Myrinet) or native compiler/runtime system • Ibis approach: – Reasonably efficient pure Java solution (for any JVM) – Optimized solutions with native code for special cases Ibis design Application RMI GMI Satin MPJ RepMI Ibis Portability Layer (IPL) TCP Smart Sockets P2P UDP GM MPI Challenges • How to make the system flexible enough – Run seamlessly on different hardware / protocols • Make the pure-Java solution efficient enough – Need fast local communication even for grid applications • Special-case optimizations Fast communication in pure Java • Manta system [ACM TOPLAS Nov. 2001] – RMI at RPC speed, but using native compiler & RTS • Ibis does similar optimizations, but in pure Java – Compiler-generated serialization at bytecode level • 5-9x faster than using runtime type inspection – Reduce copying overhead • Zero-copy native implementation for primitive arrays • Pure-Java requires type-conversion (=copy) to bytes Java/Ibis vs. C/MPI on Pentium-3 cluster (using SOR) Grid experiences with Ibis • Using Satin divide-and-conquer system – Implemented with Ibis in pure Java, using TCP/IP • Application measurements on – DAS-2 (homogeneous) – Testbed from EC GridLab project (heterogeneous) – Grid’5000 (France) – N Queens challenge Distributed ASCI Supercomputer (DAS) 2 VU (72 nodes) UvA (32) GigaPort (1-10 Gb) Leiden (32) Utrecht (32) Delft (32) Satin on wide-area DAS-2 70.0 60.0 40.0 30.0 20.0 10.0 single cluster of 64 machines P TS K N e qu ee Pr im ns e fa ct or s Ra yt ra ce r N ch oo s ac k ID A* Kn ap s Fi bo na ive cc in i te gr at i Se on tc Fi ov b. er th re sh ol d 0.0 Ad ap t speedup 50.0 4 clusters of 16 machines Satin on GridLab • Heterogeneous European grid testbed • Implemented Satin/Ibis on GridLab, using TCP • Source: van Nieuwpoort et al., AGRIDM’03 (Workshop on Adaptive Grid Middleware, New Orleans, Sept. 2003) GridLab • Latencies: – 9-200 ms (daytime), 9-66 ms (night) • Bandwidths: – 9-4000 KB/s Configuration Type OS CPU Location CPUs Cluster Linux Pentium-3 Amsterdam 8 1 SMP Solaris Sparc Amsterdam 1 2 Cluster Linux Xeon Brno SMP Linux Pentium-3 Cardiff 12 Origin 3000 Irix MIPS ZIB Berlin 1 16 SMP Unix Alpha Lecce 14 42 Experiences • No support for co-allocation yet (done manually) • Firewall problems everywhere – Addressed with ‘SmartSockets’ library • Java indeed runs anywhere modulo bugs in (old) JVMs • Need clever load balancing mechanism (CRS) Cluster-aware Random Stealing • Use Cilk’s Random Stealing (RS) inside cluster • When idle – Send asynchronous wide-area steal message – Do random steals locally, and execute stolen jobs – Only 1 wide-area steal attempt in progress at a time • Prefetching adapts – More idle nodes more prefetching • Source: van Nieuwpoort et al., ACM PPoPP’01 Performance on GridLab • Problem: how to define efficiency on a grid? • Our approach: – Benchmark each CPU with Raytracer on small input – Normalize CPU speeds (relative to a DAS-2 node) – Our case: 40 CPUs, equivalent to 24.7 DAS-2 nodes – Define: • T_perfect = sequential time / 24.7 • efficiency = T_perfect / actual runtime – Also compare against single 25-node DAS-2 cluster Results for Raytracer Time (sec) Efficiency (%) 878 62.6 CRS 677 81.3 RS 2084 26.4 CRS 693 79.3 1 cluster 580 96.1 sequential 13,564 100 Night RS Day RS = Random Stealing, CRS = Cluster-aware RS Some statistics • Variations in execution times: – RS @ day: 0.5 - 1 hour – CRS @ day: less than 20 secs variation • Internet communication (total): – RS: 11,000 (night) - 150,000 (day) messages 137 (night) - 154 (day) MB – CRS: 10,000 - 11,000 messages 82 - 86 MB N-Queens Challenge • How many ways are there to arrange N non-attacking queens on an NxN chessboard ? Known Solutions • Currently, all solutions up to N=25 are known • Result N=25 was computed using the 'spare time' of 260 machines • Took 6 months • Used over 53 cpu-years! N # Solutions 19 20 21 22 23 24 25 4968057848 39029188884 314666222712 2691008701644 24233937684440 227514171973736 2207893435808350 N-Queens Challenge • How many solutions can you calculate in 1 hour, using as many machines as you can get ? • Part of the PlugTest 2005 and 2006 held in Sophia Antipolis Satin & Ibis • Our submission used a Satin/Ibis application • Uses Satin (Divide & Conquer) to solve N-Queens recursively • Satin distributes different parts of the computation over the Grid – Uses Ibis for communication Satin & Ibis • Used the Grid 5000 testbed to run – Large testbed distributed over France • Currently contains some 2500 CPU's • Will (eventually) contain 5000 CPU's Largest Satin/Ibis Run • Our best run used 961 CPUs on 5 clusters for a single application – Largest number of CPUs of all contestants • Solved N=22 in 25 minutes • Second place in the 2005 contest Current work: VL-e • VL-e: Virtual Laboratories for e-Science • Large Dutch project (2004-2008): – 40 M€ (20 M€ BSIK funding from Dutch goverment) • 20 partners – Academia: Amsterdam, TU Delft, VU, CWI, NIKHEF, .. – Industry: Philips, IBM, Unilever, CMG, .... DAS-3 • Next generation grid in the Netherlands • Partners: – NWO & NCF (Dutch science foundation) – ASCI – Gigaport-NG/SURFnet: DWDM computer backplane (dedicated optical group of up to 8 lambdas) – VL-e and MultimediaN BSIK projects • Optical 10Gb/s interconnect with Grid’5000 NOC StarPlane project • Application-specific management of optical networks • Future applications can: – dynamically allocate light paths, of 10 Gbit/sec each – control topology through the Network Operations Center • Gives flexible, dynamic, high-bandwidth links • Research questions: – How to provide this flexibility (across domains)? – How to integrate optical networks with applications? • Joint project with Cees de Laat (Univ. of Amsterdam), funded by NWO Summary • Parallel computing on Grids (distributed supercomputing) is a challenging and promising research area • Many grid programming environmenents exist • Ibis: a Java-centric Grid programming environment – Portable (“run anywhere”) and efficient • Current work: Virtual Laboratory for e-Science (VL-e), DAS-3, StarPlane