Download New Thread - Center for Research in Extreme Scale Technologies

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

EXPOSE wikipedia , lookup

Fluid thread breakup wikipedia , lookup

Transcript
Establishing ParalleX-based Computing
Systems for Extreme Scale
Thomas Sterling
Professor of Informatics and Computing, Indiana University
Chief Scientist and Associate Director
Center for Research in Extreme Scale Technologies (CREST)
Pervasive Technology Institute at Indiana University
Fellow, Sandia National Laboratories CSRI
June 26, 2013
Introduction
• ParalleX Program objectives
–
–
–
–
Scalability to 109 concurrency
Efficiency in the presence of uncertainty of asynchrony
Practical factors: reliability, energy use, generality, performance portability
Exploitation of architecture features: heterogeneity, overhead mechanisms
• Relation to HSA
– Advanced architecture concepts enable ParalleX software
• Heterogeneous parallel structures
• Mechanisms for low cost resource management, task scheduling, and addressing
– ParalleX software provides possible framework for managing HAS
• Asynchronous operation
• Dynamic adaptive resource management and task scheduling
• Rich synchronization fabric based on futures and dataflow
XPRESS World
• ParalleX
– Execution model for scalability & efficiency thru runtime info
– Multi-threading message-driven global name space paradigm
• OpenX
– Total system software architecture
• HPX
– A set of runtime systems reflecting the ParalleX model
• LXK
– Lightweight kernel OS for O(k) scalability
– Based on commercial Catamount and current Kitten OS
• XPI
– Low level programming interface and intermediate representation
HPX Runtime System
• HPX-3
– Original usable runtime developed by H. Kaiser at LSU
– Strong C++ bindings and Boost Library use
– Evolving international user community
• HPX-5
– New C-based implementation developed at IU
– Serves as experimental code base to test new concepts
• Reliability, debugging, real-time, energy aware
• Programming models, graph processing, dynamic applications
• HPX-4
– XPRESS runtime software
– Combines HPX-3 threads package & HPX-5 parcels and processes
– Integration by SNL
Performance Factors - SLOWER
•
P = e(L,O,W) * S(s) * a(r) * U(E)
•
P – performance (ops)
e – efficiency (0 < e < 1)
s – application’s average parallelism,
a – availability (0 < a < 1)
U – normalization factor/compute unit
E – watts per average compute unit
r – reliability (0 < r < 1)
Starvation
– Insufficiency of concurrency of work
– Impacts scalability and latency hiding
– Effects programmability
Latency
– Time measured distance for remote
access and services
– Impacts efficiency
•
Overhead
– Critical time additional work to
manage tasks & resources
– Impacts efficiency and granularity for
scalability
•
Waiting for contention resolution
– Delays due to simultaneous access
requests to shared physical or logical
resources
ParalleX Execution Model
• Lightweight multi-threading
– Divides work into smaller tasks
– Increases concurrency
• Message-driven computation
– Move work to data
– Keeps work local, stops blocking
• Constraint-based synchronization
– Declarative criteria for work
– Event driven
– Eliminates global barriers
• Data-directed execution
– Merger of flow control and data
structure
• Shared name space
– Global address space
– Simplifies random gathers
ParalleX Advances
• New synthesis of advanced concepts
– Some from prior art dating 10 – 30 years
•
•
•
•
•
•
Dynamic Adaptive resource management and task allocation
Message-driven computation instead of message-passing
Global address space with object migration vs. distributed
Elimination of global barriers
Continuations migration – higher order parallel control space
“processes” as contexts that span multiple nodes for
performance portability and name hierarchy
• Infinite registers for single assignment semantics to replace
ILP parallelism
Localities
•
•
•
•
•
•
•
•
•
•
A “locality” is a contiguous physical domain
Guarantees compound atomic operations on local state
Manages intra-locality latencies
Exposes diverse temporal locality attributes
Divides the world into synchronous and asynchronous
System comprises a set of mutually exclusive, collectively
exhaustive localities
A first class object
An attribute of other objects
Heterogeneous
Specific inalienable properties
8
Active Global Address Space (AGAS)
•
•
•
•
•
•
•
•
Distributed
Assumes no coherence between localities
User variables
Synchronization variables and objects
Threads as first-class objects
Moves virtual named elements in physical space
Parcel sets (but not parcels!)
Process
– First class object
– Specifies a broad task
– Defines a distributed environment
9
ParalleX Processes
• Provides execution contexts
– Hierarchical (nested)
• Processes with internal parallel execution
– Many activities within a process may operate concurrently
– Threads & child processes
• Span multiple localities
– May overlap localities
• Ephemeral
– Created at any time by parent process
– May terminate at any time
• Calling Sequence that is both functional and object oriented
10
Multi-Grain Multithreading
• Threads are collections of related operations that perform on
locally shared data
• A thread is a continuation combined with a local environment
– Modifies local named data state and temporaries
– Updates intra-thread and inter-thread control state
• Does not assume sequential execution
– Other flow control for intra-thread operations possible
•
•
•
•
Thread can realize transaction phase
Thread does not assume dedicated execution resources
Thread is first class object identified in global name space
Thread is ephemeral
11
ParalleX Computation Complex
-- Runtime Aware
-- Logically Active
-- Physically Active
ParalleX Computaion Complex
State Diagram Symbol Table
C – Create Thread Event
P – Pending
B – Blocked
D – Depleted
T – Terminated
1 – Allocate and initialize new thread
3 – Delete thread
5 – Emergency termination of thread
7 – Voluntary or emergency termination of
thread
9 – Data dependency satisfied before losing
resource allocation
11 – Data dependency satisfied after losing
resource allocation
13 – Recover archived thread, return to
runtime system
15 – Migrate thread across locality
I – Initialized
E – Executing
S – Suspended
M – Migration
F – Finished
2 – Register thread with runtime system
4 – Assign thread to resources
6 – Interrupt thread execution
8 – Thread encounters unsatisfied data
dependency
10 – Resource allocation lost
12 – Archive thread, remove from runtime
system
14 – Reuse thread for new thread instantiation
16 – Remove thread from runtime system
Motivation for Parcels
• To achieve high scalability, efficiency, programmability
• To enable new models of computation
– e.g., ParalleX
• To facilitate conventional models of computation
– e.g., MPI
• Hide latency
– Support overlap of communication with computation
– Move work to data, not always data to work
• Virtualization of computing work
– Segregate physical resource from abstract task
– Circumvent blocking of resource utilization
• Support asynchrony of operation
• Maintain symmetry of semantics between synchronous and asynchronous
operation
14
Parcels Invoke Remote Tasks
Destination Locality
Data
Target Operand
Remote Thread Create Parcel
Payload
Action
Destination
Methods
Target Action Code
Source
Locality
Destination
Return Parcel
Action
Payload
Thread Frames
Remote Thread
15
Parcel Structure
Transport / network layer
protocol wrappers
destination
action
payload
continuations
header
CRC
trailer
PX Parcel
Parcels may utilize underlying communication protocol fields to minimize
the message footprint (e.g. destination address, checksum)
16
Parcel Destination
• Application specific addressing:
– Global virtual address of target (recipient) object
– Current implementation: HPX GID (global ID), 128-bit
• System specific addressing:
– Physical address of a hardware resource
– Supports direct access to register space or state machine
manipulation
– May be required for percolation
17
Parcel Actions
• Data movement
– Block read and write
– Lightweight scalar load/store
• Synchronization
– Atomic Memory Operations
– Basic LCOs
• Thread manipulation
– Thread instantiation
– Thread register access
– Thread control and state management
• Direct hardware access
– Counters
– Physical memory
– State machines
18
Parcel Payload
• Lightweight
– Remote function arguments (includes LCO actions)
– AMO operands
– Scalar load / store
• Heavyweight
–
–
–
–
Action-dependent
Migrating object state
Page relocation
Some percolation instances
19
Parcel Continuations
•
•
•
•
Optional
Format: list of arbitrary LCOs
Accept result(s) returned by parcel-invoked action
Continuation types
– Return the value to the requestor
– Standard LCO evaluation
• Spawn local computation
• Propagate the result to another locality via parcel
– Perform a system call
20
Parcel Interaction with the System
Parcels
AGAS
Main
LCOs
Locality 1
Threads
Locality 2
Locality 3
...
processes
Locality n
21
AMOs
Metathreads
DCOs
Copy Semantics
Goals of the ParalleX LCO
•
Exploit parallelism in diversity of forms and granularity
– For extreme scalability
– e.g., exploit meta-data defined parallelism
•
Latency hiding at system-wide distances
– Avoid conventional round trip control patterns
– Support latency mitigating architectures
•
Provide a framework for efficient fine grain synchronization
– Eliminate use of global barrier synchronization where possible
– Mitigate effects of variable thread lengths within fork-join structures
•
•
•
•
•
Enable optimized runtime adaptive resource management and task scheduling
for dynamic load balancing
Migration of continuations across system and computation
Support eager-lazy evaluation methods
Support distributed control operations
Semantics of failure response for graceful degradation
– Used to establish points for “micro-checkpointing” and validation for error detection,
propagation isolation, and recovery
LCOs
•
•
•
•
A number of forms of synchronization are incorporated into the semantics
Support message-driven remote thread instantiation
Finite State Machines (FSM)
In-memory synchronization
–
–
–
–
Control state is in the name space of the machine
Producer-consumer in memory
Local mutual exclusion protection
Synchronization mechanisms as well as state are presumed to be intrinsic to
memory
• Basic synchronization objects:
–
–
–
–
–
–
Mutexes
Semaphores
Events
Full-Empty bits
Data flow
Futures
• User-defined (custom) LCOs
23
• All LCO’s incorporate these generic properties
– Basis for custom user defined LCO
Generic LCO
• First class object
– In the user name space
• Resides only in any single locality
– Does not span locality boundaries
– Allocated to data it will affect
• Instantiates threads
– Within resident locality
– Or, by generating parcels
• Lifespan
– Persistent
– Ephemeral
• Conditional
– Actions dependent on internal control state and incident events
24
Generic LCO
Inherited Generic Methods
Incident
Events
Event Buffer
Control
State
Event
Assimilation
Method
Predicate
Control
Method
25
Thread Create
Thread
Method
New
Thread
Generic LCO Flow Graph
Event
Incident Event
Data State
Update
Create LCO
Control State
Update
Return
Results
Consequence
Action
26
succeed
Predicate
Test
failed
Application: Adaptive Mesh Refinement
(AMR) for Astrophysics simulations
• Binary black hole and black
hole neutron star mergers are
LIGO candidates
• AMR simulations of black holes
typically scale very poorly
Application: Adaptive Mesh Refinement
(AMR) for Astrophysics simulations
OpenX Software Architecture
29
X-Stack Review
HPX Runtime Design
• Current version of HPX provides the following infrastructure
as defined by the ParalleX execution model
–
–
–
–
Complexes (ParalleX Threads) and ParalleX Thread Management
Parcel Transport and Parcel Management
Local Control Objects (LCOs)
Active Global Address Space (AGAS)
Introspection through Machine Intelligence
•
•
•
•
Control of system resources and tasks
Dynamic and adaptive to changing conditions
Policies direct choice space
Complexity of system and operation requires sophisticated
reasoning and management about alternative actions
• This calls for a high degree of intelligence built in to the
machine itself
• CRIS – Cognitive Real-time Interactive System
– An early project to isolate key principles of machine intelligence
– To be inserted in machine control loop and human interface
– Enables declarative programming methods
XPI Goals & Objectives
•
•
•
•
•
•
•
•
•
A programming interface for extreme scale computing
A syntactical representation of the ParalleX execution model
Stable interface to underlying runtime system
Target for source-to-source compilation from high-level
parallel programming languages
Low-level user readable parallel programming syntax
Schema for early experimentation
Implemented through libraries
Touch and feel of familiar MPI
Enables dynamic adaptive execution and asynchrony
management
Classes of XPI Operations
• Miscellaneous
– Initialization and clean up
• Parcels
– Message-driven computation
• Data types
• Threads
– Computing actions
• Local Control Objects for synchronization and continuations
• Active Global Address Space
– System wide flat virtual address space
• PX Processes
– Hierarchical contexts and name spaces
Conclusions
• HPC is in a (6th) phase change
• Ultra high scale computing of the next decade will
require a new model of computation to effectively
exploit new technologies and guide system co-design
• ParalleX is an example of an experimental execution
model that addresses key challenges to Exascale
• HSA incorporates a number of structures and
mechanisms that would facilitate HPX operation and
would benefit by it