Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Establishing ParalleX-based Computing Systems for Extreme Scale Thomas Sterling Professor of Informatics and Computing, Indiana University Chief Scientist and Associate Director Center for Research in Extreme Scale Technologies (CREST) Pervasive Technology Institute at Indiana University Fellow, Sandia National Laboratories CSRI June 26, 2013 Introduction • ParalleX Program objectives – – – – Scalability to 109 concurrency Efficiency in the presence of uncertainty of asynchrony Practical factors: reliability, energy use, generality, performance portability Exploitation of architecture features: heterogeneity, overhead mechanisms • Relation to HSA – Advanced architecture concepts enable ParalleX software • Heterogeneous parallel structures • Mechanisms for low cost resource management, task scheduling, and addressing – ParalleX software provides possible framework for managing HAS • Asynchronous operation • Dynamic adaptive resource management and task scheduling • Rich synchronization fabric based on futures and dataflow XPRESS World • ParalleX – Execution model for scalability & efficiency thru runtime info – Multi-threading message-driven global name space paradigm • OpenX – Total system software architecture • HPX – A set of runtime systems reflecting the ParalleX model • LXK – Lightweight kernel OS for O(k) scalability – Based on commercial Catamount and current Kitten OS • XPI – Low level programming interface and intermediate representation HPX Runtime System • HPX-3 – Original usable runtime developed by H. Kaiser at LSU – Strong C++ bindings and Boost Library use – Evolving international user community • HPX-5 – New C-based implementation developed at IU – Serves as experimental code base to test new concepts • Reliability, debugging, real-time, energy aware • Programming models, graph processing, dynamic applications • HPX-4 – XPRESS runtime software – Combines HPX-3 threads package & HPX-5 parcels and processes – Integration by SNL Performance Factors - SLOWER • P = e(L,O,W) * S(s) * a(r) * U(E) • P – performance (ops) e – efficiency (0 < e < 1) s – application’s average parallelism, a – availability (0 < a < 1) U – normalization factor/compute unit E – watts per average compute unit r – reliability (0 < r < 1) Starvation – Insufficiency of concurrency of work – Impacts scalability and latency hiding – Effects programmability Latency – Time measured distance for remote access and services – Impacts efficiency • Overhead – Critical time additional work to manage tasks & resources – Impacts efficiency and granularity for scalability • Waiting for contention resolution – Delays due to simultaneous access requests to shared physical or logical resources ParalleX Execution Model • Lightweight multi-threading – Divides work into smaller tasks – Increases concurrency • Message-driven computation – Move work to data – Keeps work local, stops blocking • Constraint-based synchronization – Declarative criteria for work – Event driven – Eliminates global barriers • Data-directed execution – Merger of flow control and data structure • Shared name space – Global address space – Simplifies random gathers ParalleX Advances • New synthesis of advanced concepts – Some from prior art dating 10 – 30 years • • • • • • Dynamic Adaptive resource management and task allocation Message-driven computation instead of message-passing Global address space with object migration vs. distributed Elimination of global barriers Continuations migration – higher order parallel control space “processes” as contexts that span multiple nodes for performance portability and name hierarchy • Infinite registers for single assignment semantics to replace ILP parallelism Localities • • • • • • • • • • A “locality” is a contiguous physical domain Guarantees compound atomic operations on local state Manages intra-locality latencies Exposes diverse temporal locality attributes Divides the world into synchronous and asynchronous System comprises a set of mutually exclusive, collectively exhaustive localities A first class object An attribute of other objects Heterogeneous Specific inalienable properties 8 Active Global Address Space (AGAS) • • • • • • • • Distributed Assumes no coherence between localities User variables Synchronization variables and objects Threads as first-class objects Moves virtual named elements in physical space Parcel sets (but not parcels!) Process – First class object – Specifies a broad task – Defines a distributed environment 9 ParalleX Processes • Provides execution contexts – Hierarchical (nested) • Processes with internal parallel execution – Many activities within a process may operate concurrently – Threads & child processes • Span multiple localities – May overlap localities • Ephemeral – Created at any time by parent process – May terminate at any time • Calling Sequence that is both functional and object oriented 10 Multi-Grain Multithreading • Threads are collections of related operations that perform on locally shared data • A thread is a continuation combined with a local environment – Modifies local named data state and temporaries – Updates intra-thread and inter-thread control state • Does not assume sequential execution – Other flow control for intra-thread operations possible • • • • Thread can realize transaction phase Thread does not assume dedicated execution resources Thread is first class object identified in global name space Thread is ephemeral 11 ParalleX Computation Complex -- Runtime Aware -- Logically Active -- Physically Active ParalleX Computaion Complex State Diagram Symbol Table C – Create Thread Event P – Pending B – Blocked D – Depleted T – Terminated 1 – Allocate and initialize new thread 3 – Delete thread 5 – Emergency termination of thread 7 – Voluntary or emergency termination of thread 9 – Data dependency satisfied before losing resource allocation 11 – Data dependency satisfied after losing resource allocation 13 – Recover archived thread, return to runtime system 15 – Migrate thread across locality I – Initialized E – Executing S – Suspended M – Migration F – Finished 2 – Register thread with runtime system 4 – Assign thread to resources 6 – Interrupt thread execution 8 – Thread encounters unsatisfied data dependency 10 – Resource allocation lost 12 – Archive thread, remove from runtime system 14 – Reuse thread for new thread instantiation 16 – Remove thread from runtime system Motivation for Parcels • To achieve high scalability, efficiency, programmability • To enable new models of computation – e.g., ParalleX • To facilitate conventional models of computation – e.g., MPI • Hide latency – Support overlap of communication with computation – Move work to data, not always data to work • Virtualization of computing work – Segregate physical resource from abstract task – Circumvent blocking of resource utilization • Support asynchrony of operation • Maintain symmetry of semantics between synchronous and asynchronous operation 14 Parcels Invoke Remote Tasks Destination Locality Data Target Operand Remote Thread Create Parcel Payload Action Destination Methods Target Action Code Source Locality Destination Return Parcel Action Payload Thread Frames Remote Thread 15 Parcel Structure Transport / network layer protocol wrappers destination action payload continuations header CRC trailer PX Parcel Parcels may utilize underlying communication protocol fields to minimize the message footprint (e.g. destination address, checksum) 16 Parcel Destination • Application specific addressing: – Global virtual address of target (recipient) object – Current implementation: HPX GID (global ID), 128-bit • System specific addressing: – Physical address of a hardware resource – Supports direct access to register space or state machine manipulation – May be required for percolation 17 Parcel Actions • Data movement – Block read and write – Lightweight scalar load/store • Synchronization – Atomic Memory Operations – Basic LCOs • Thread manipulation – Thread instantiation – Thread register access – Thread control and state management • Direct hardware access – Counters – Physical memory – State machines 18 Parcel Payload • Lightweight – Remote function arguments (includes LCO actions) – AMO operands – Scalar load / store • Heavyweight – – – – Action-dependent Migrating object state Page relocation Some percolation instances 19 Parcel Continuations • • • • Optional Format: list of arbitrary LCOs Accept result(s) returned by parcel-invoked action Continuation types – Return the value to the requestor – Standard LCO evaluation • Spawn local computation • Propagate the result to another locality via parcel – Perform a system call 20 Parcel Interaction with the System Parcels AGAS Main LCOs Locality 1 Threads Locality 2 Locality 3 ... processes Locality n 21 AMOs Metathreads DCOs Copy Semantics Goals of the ParalleX LCO • Exploit parallelism in diversity of forms and granularity – For extreme scalability – e.g., exploit meta-data defined parallelism • Latency hiding at system-wide distances – Avoid conventional round trip control patterns – Support latency mitigating architectures • Provide a framework for efficient fine grain synchronization – Eliminate use of global barrier synchronization where possible – Mitigate effects of variable thread lengths within fork-join structures • • • • • Enable optimized runtime adaptive resource management and task scheduling for dynamic load balancing Migration of continuations across system and computation Support eager-lazy evaluation methods Support distributed control operations Semantics of failure response for graceful degradation – Used to establish points for “micro-checkpointing” and validation for error detection, propagation isolation, and recovery LCOs • • • • A number of forms of synchronization are incorporated into the semantics Support message-driven remote thread instantiation Finite State Machines (FSM) In-memory synchronization – – – – Control state is in the name space of the machine Producer-consumer in memory Local mutual exclusion protection Synchronization mechanisms as well as state are presumed to be intrinsic to memory • Basic synchronization objects: – – – – – – Mutexes Semaphores Events Full-Empty bits Data flow Futures • User-defined (custom) LCOs 23 • All LCO’s incorporate these generic properties – Basis for custom user defined LCO Generic LCO • First class object – In the user name space • Resides only in any single locality – Does not span locality boundaries – Allocated to data it will affect • Instantiates threads – Within resident locality – Or, by generating parcels • Lifespan – Persistent – Ephemeral • Conditional – Actions dependent on internal control state and incident events 24 Generic LCO Inherited Generic Methods Incident Events Event Buffer Control State Event Assimilation Method Predicate Control Method 25 Thread Create Thread Method New Thread Generic LCO Flow Graph Event Incident Event Data State Update Create LCO Control State Update Return Results Consequence Action 26 succeed Predicate Test failed Application: Adaptive Mesh Refinement (AMR) for Astrophysics simulations • Binary black hole and black hole neutron star mergers are LIGO candidates • AMR simulations of black holes typically scale very poorly Application: Adaptive Mesh Refinement (AMR) for Astrophysics simulations OpenX Software Architecture 29 X-Stack Review HPX Runtime Design • Current version of HPX provides the following infrastructure as defined by the ParalleX execution model – – – – Complexes (ParalleX Threads) and ParalleX Thread Management Parcel Transport and Parcel Management Local Control Objects (LCOs) Active Global Address Space (AGAS) Introspection through Machine Intelligence • • • • Control of system resources and tasks Dynamic and adaptive to changing conditions Policies direct choice space Complexity of system and operation requires sophisticated reasoning and management about alternative actions • This calls for a high degree of intelligence built in to the machine itself • CRIS – Cognitive Real-time Interactive System – An early project to isolate key principles of machine intelligence – To be inserted in machine control loop and human interface – Enables declarative programming methods XPI Goals & Objectives • • • • • • • • • A programming interface for extreme scale computing A syntactical representation of the ParalleX execution model Stable interface to underlying runtime system Target for source-to-source compilation from high-level parallel programming languages Low-level user readable parallel programming syntax Schema for early experimentation Implemented through libraries Touch and feel of familiar MPI Enables dynamic adaptive execution and asynchrony management Classes of XPI Operations • Miscellaneous – Initialization and clean up • Parcels – Message-driven computation • Data types • Threads – Computing actions • Local Control Objects for synchronization and continuations • Active Global Address Space – System wide flat virtual address space • PX Processes – Hierarchical contexts and name spaces Conclusions • HPC is in a (6th) phase change • Ultra high scale computing of the next decade will require a new model of computation to effectively exploit new technologies and guide system co-design • ParalleX is an example of an experimental execution model that addresses key challenges to Exascale • HSA incorporates a number of structures and mechanisms that would facilitate HPX operation and would benefit by it