Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAPL-RTS: An Application Driven Runtime System Ioannis Papadopoulos, Nathan Thomas, Adam Fidel, Nancy M. Amato, Lawrence Rauchwerger {ipapadop, nthomas, fidel, amato, rwerger}@cse.tamu.edu http://parasol.tamu.edu Motivation • Parallel programming is hard! • We want: • Easy to write, maintainable programs (by non-experts) • Scalability • Portability • Difficult to satisfy all these requirements! • Modern platforms don’t make it any easier • Networked multi-socket nodes with multicore processors • Accelerators, co-processors • Mixing different programming models (e.g., MPI+OpenMP) is confusing, error-prone 2 A solution: Distributed task dependence graphs • Use dataflow representation of an application • Describe what your application does, not how to do it • Easy to write by non-experts • Create distributed task dependence graph • Scalable • No data races • Runtime system deals with task mapping, execution, communication • Abstracts platform, offers uniform communication layer • New issues: • Loss of contextual information • Data copied over and over • Memory hierarchy ignored • How to take advantage of shared memory? 3 Our contribution • Address some of the issues of distributed dataflow • High level, application-to-runtime transfer of information • Maintain appropriate abstraction (platform, algorithm, etc.) • Preserve semantics • Transfer information from the algorithm through the internal representation (task dependence graph) into the runtime system • Leverage high-level information for performance • Copy elision between tasks in shared memory • Up to 7% faster in a data mining application • Increase communication aggregation • Up to 70% faster on a graph algorithm (128K cores) 4 Related Work Shared/distributed memory integration Hybrid MPI+OpenMP Ad-hoc solution Habanero C+MPI [Chatterjee et al., IPDPS‘13] Two different models HPX [Heller et al., ScalA‘13] Distinction between shared/distributed (“localities”) Charm++ [Kale et al., OOPSLA‘93] User managed thread-safety Chapel [Callahan et al., HIPS‘04] Complex algorithms require locality knowledge Exploit object immutability Pechtchanski et al., ISCOPE’05 Lower level optimizations Javari [Birka et al., OOPSLA‘04] IGJ [Zibin et al., ESEC-FSE’07] Protection against mutation, but no optimizations applied Gordon et al., OOPSLA’12 Prove immutability for thread-safety guarantees, but no optimizations Kanor [Jiao et al., HiPC’11] Compiler gives MPI or OpenMP, but not both STAPL: Standard Template Adaptive Parallel Library A library of parallel components that adopts the generic programming philosophy of the C++ Standard Template Library (STL). STL • • Iterators provide abstracted access to data stored in Containers Algorithms are sequences of instructions that transform the data STAPL • • • Views provide abstracted access to distributed data stored in Containers Algorithms specified by Skeletons, represented at run-time as PARAGRAPHs, parallel task graphs that transform the input data Task dependence graphs are mapped to the machine by STAPL 6 The STAPL stack • Users invoke algorithms on containers via views • Algorithms are specified through skeletons • Skeletons instantiate a task dependence graph • Populate the PARAGRAPH • The Runtime System (STAPL-RTS) • Maps runnable tasks to the machine • Abstracts communication 7 Execution Model • SPMD model with task parallelism support • Implicit parallelism (e.g., MPI) • Task and Data Parallelism • Nested parallelism • Each nested parallel section executes on a number of locations • A location is an isolated address space with associated execution capabilities (e.g. thread) • Users create distributed objects (p_objects) on locations • Locations communicate with each other only through Remote Method Invocations (RMIs) on p_objects 8 Remote Method Invocation (RMI) • Asynchronous communication through RMIs • Ability to move work, data or both • Scalable • Unified communication API over distributed and shared memory • Single model • Easy to program • Copy semantics of RMI arguments • Allow for asynchrony • Eliminate data races, side-effects • Implicit RMI ordering guarantees • Consistency model 9 Unifying shared/distributed memory Hybrid OpenMP+MPI Mixed-mode Two different programming models Unified model over distributed and shared memory based on RMIs Typically different algorithms / implementations for user algorithms One algorithm and one implementation for user algorithms Integration and performance optimization implemented by the user Integration implemented through the runtime, which also offers adaptive performance Unification through ARMI primitives 10 RMI Usage struct A : public p_object { int m_value; void set(int t) { m_value = t; } int get() const { return m_value; } }; foo(…) { A a; auto h = a.get_rmi_handle(); // communication handle int t = 5; async_rmi(h, 1, &A::set, t); // call with copy of t t = 6; future<int> f = opaque_rmi(h, 1, &A::get); int y = f.get(); assert(y==5); // value is guaranteed to be 5 } 11 Application Driven Optimizations • High level, application-to-runtime transfer of information • Take advantage of the task dependence graph • Use high-level annotations • Maintain appropriate abstraction 1. Copy elision in shared memory (zero-copy) • Reduce data copying 2. RMI Tunnels (communication channels) • Relax consistency guarantees • Increase communication throughput 12 1. Copy elision in shared memory • Avoid copies in shared memory with appropriate high-level information • Skeletons: drive the task dependence graph (PARAGRAPH) creation • PARAGRAPH: annotates flow of data • STAPL-RTS: uses annotations at run-time to avoid copying data 13 PARAGRAPH rules for annotation Task dependence graph has contextual information. • Zero-copy • If a producer task has a single consumer task of its value and it is on a different location, move value into STAPL-RTS. L0 L1 T1 T2 move value move value STAPL-RTS • Immutable sharing • If a task has multiple consumers, and at least one is on a different location, then immutably share value between consumers. L0 T1 L1 T2 move value T3 get reference STAPL-RTS immutable shared 14 STAPL-RTS copy elision support • Move: Transfer objects between locations using move semantics (C++11) async_rmi(dest, obj, &A::foo, std::move(x)); • Immutable Sharing: Share object via read-only wrapper with reference counting auto x_ref = immutable_shared(x); async_rmi(dest, obj, &A::foo, x_ref); auto y = x_ref; • Avoid temporary copies of return values auto r = opaque_rmi(dest, obj, &A::get); auto x = r.get(); 15 Experimental Setup • Cray XK7m-200 • 24 compute nodes • 12 single-socket nodes with GPUs and 32GB memory • 12 dual-socket nodes with 64GB memory • AMD Opteron 6272, Interlagos 16-core CPU @ 2.1GHz • IBM BGQ (LLNL) • 24,576 nodes • Single socket nodes with 16GB of memory • IBM PowerPC A2, 16-core CPU @ 1.6GHz 16 Microbenchmarks • OSU Put/Get Microbenchmarks on 1 node, 1 MPI process, 2 threads/process 17 K-means algorithm • Clustering algorithm used in data mining • Cluster points iteratively until left with K clusters • Data heavy: result of each step is a tuple of vectors http://en.wikipedia.org/wiki/K-means_clustering • Reduce operation • Find centroids and number of elements of each cluster • Broadcast operation • Distribute reduction result to all locations • We can reduce data copies • Leverage shared memory! • Move semantics during reductions (avoid temporary vector copies) • Use immutable shared to broadcast the reduction results (avoid early 18 and unnecessary vector copies) Copy elision in K-means clustering • K-means clustering algorithm, main computation kernel Up to 7% gain, no change to user code. 19 2. RMI Tunnels • Given basic RMI function: async_rmi(location, obj, pmf, params…); • What if the algorithm knows something about the pattern of communication? • E.g., homogeneous vertex updates in graphs • Use partially evaluated RMI function. f = bind(async_rmi, dest, obj, &A::recv, _1); f(10); // call A::recv(10) f(5); // call A::recv(5) • We call this an RMI tunnel • Communication channel with one or more fixed parameters (e.g., destination, target object etc.) 20 Applying RMI Tunneling • Example: visiting vertices on stapl::graph • Basic for (auto v : my_vertices) { for (auto n : neighbors_of(v)) async_rmi(location_of(n), &graph::visit, wf); // other traffic (e.g. container metadata, PARAGRAPH) } • Using tunnel auto t = bind(async_rmi, &graph::visit); for (auto v : my_vertices) { for (auto n : neighbors_of(v)) t(location_of(n), wf); // other traffic does not get injected in the tunnel } 21 Connected Components algorithm • Connected components algorithm: find subgraphs wherein there exists a path between any two vertices • Each vertex can visit its neighbors in parallel • Algorithm tolerates relaxed consistency • We can use RMI tunneling at the algorithm level • Create dedicated communication channel for vertex updates • Reduces run-time checks • Decreases message payload size via eliminating redundant information 22 Tunneling with Connected Components • Connected Components algorithm on Newman-Watss-Strogatz graph From 1.5x at 32 cores to 1.7x at 128K cores (log-log scale). Improved scalability. 23 Conclusion • How to get good performance from high-level parallel programming models • Transfer contextual information from application to runtime system • Not necessary to break the abstractions • We optimized at the lower levels • Took advantage of shared memory • Elided copies • Reduced communication cost • Just the tip of the iceberg! 24