Download Cluster Computing - Wayne State University

Document related concepts

IEEE 802.1aq wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

CAN bus wikipedia , lookup

Parallel port wikipedia , lookup

Distributed operating system wikipedia , lookup

Nonblocking minimal spanning switch wikipedia , lookup

Airborne Networking wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Computer cluster wikipedia , lookup

Transcript
Cluster Computing
Cheng-Zhong Xu
1
Outline
 Cluster Computing Basics
– Multicore architecture
– Cluster Interconnect
 Parallel Programming for Performance
 MapReduce Programming
 Systems Management
2
What’s a Cluster?

Broadly, a group of networked autonomous
computers that work together to form a single
machine in many respects:
– To improve performance (speed)
– To improve throughout
– To improve service availability (high-availability
clusters)

Based on commercial off-the-shelf, the
system is often more cost-effective than
single machine with comparable speed or
availability
3
Highly Scalable Clusters
 High Performance Cluster (aka Compute Cluster)
– A form of parallel computers, which aims to solve
problems faster by using multiple compute nodes.
– For parallel efficiency, the nodes are often closely
coupled in a high throughput, low-latency network
 Server Cluster and Datacenter
– Aims to improve the system’s throughput , service
availability, power consumption, etc by using multiple
nodes
4
Top500 Installation of Supercomputers
Top500.com
5
Clusters in Top500
6
An Example of Top500 Submission (F’08)
Location
Tukwila, WA
Hardware – Machines
256 Dual-CPU, quad-core Intel 5320
Clovertown 1.86GHz CPU and 8GB
RAM
Hardware – Networking
Private & Public: Broadcom GigE
MPI: Cisco Infiniband SDR, 34 IB
switches in leaf/node configuration
Number of Compute Nodes
256
Total Number of Cores
2048
Total Memory
2 TB of RAM
Particulars of for current Linpack Runs
Best Linpack Result
11.75 TFLOPS
Best Cluster Efficiency
77.1%
For Comparison…
Linpack rating from June2007 Top500
run (#106) on the same hardware
8.99 TFLOPS
Cluster efficiency from June2007 Top500 59%
run (#106) on the same hardware
Typical Top500 efficiency for Clovertown 65-77%
motherboards w/ IB regardless of
(2 instances of 79%)
Operating System
30% impro in efficiency
on the same hardware;
about one hour to
deplay
7
Beowulf Cluster
 A cluster of inexpensive PCs for low-cost
personal supercomputing
 Based on commodity off-the-shelf components:
– PC computers running a Unix-like Os (BSD, Linux,
or OpenSolaris)
– Interconnected by an Ethernet LAN
 Head node, plus a group of compute node
– Head node controls the cluster, and serves files to the
compute nodes
 Standard, free and open source software
– Programming in MPI
– MapReduce
8
Why Clustering Today
 Powerful node (cpu, mem, storage)
– Today’s PC is yesterday’s supercomputers
– Multi-core processors
 High speed network
– Gigabit (56% in top500 as of Nov 2008)
– Infiniband System Area Network (SAN) (24.6%)
 Standard tools for parallel/ distributed
computing & their growing popularity.
– MPI, PBS, etc
– MapReduce for data-intensive computing
9
Major issues in Cluster Design
 Programmability
– Sequential vs Parallel Programming
– MPI, DSM, DSA: hybrid of multithreading and MPI
– MapReduce
 Cluster-aware Resource management
– Job scheduling (e.g. PBS)
– Load balancing, data locality, communication opt, etc
 System management
– Remote installation, monitoring, diagnosis,
– Failure management, power management, etc
10
Cluster Architecture
 Multi-core node architecture
 Cluster Interconnect
11
Single-core computer
12
Single-core CPU chip
the single core
13
Multicore Architecture


Combine 2 or more independent
cores (normally CPU) into a single
package
Support multitasking and
multithreading in a single physical
package
14
Multicore is Everywhere




Dual-core commonplace in laptops
Quad-core in desktops
Dual quad-core in servers
All major chip manufacturers produce multicore
CPUs
–
–
–
SUN Niagara (8 cores, 64 concurrent threads)
Intel Xeon (6 cores)
AMD Opteron (4 cores)
15
Multithreading on multi-core
David Geer, IEEE Computer, 2007
16
Interaction with the OS
 OS perceives each core as a separate processor
 OS scheduler maps threads/processes
to different cores
 Most major OS support multi-core today:
Windows, Linux, Mac OS X, …
17
Cluster Interconnect
 Network fabric connecting the compute nodes
 Objective is to strike a balance between
– Processing power of compute nodes
– Communication ability of the interconnect
 A more specialized LAN, providing many
opportunities for perf. optimization
– Switch in the core
– Latency vs bw
Input
Ports
Receiver
Input
Buffer
Output
Buffer Transmiter
Output
Ports
Cross-bar
Control
Routing, Scheduling
18
Goal: Bandwidth and Latency
0.8
Delivered Bandwidth
80
70
Latency
60
50
40
Saturation
30
20
10
0.7
0.6
0.5
0.4
Saturation
0.3
0.2
0.1
0
0
0
0.2
0.4
0.6
0.8
Delivered Bandwidth
1
0
0.2
0.4
0.6
0.8
1
1.2
Offered Bandwidth
19
Ethernet Switch: allows multiple
simultaneous transmissions
A
 hosts have dedicated, direct
C’
B
connection to switch
 switches buffer packets
 Ethernet protocol used on each
incoming link, but no collisions;
full duplex
– each link is its own collision
domain
 switching: A-to-A’ and B-to-B’
simultaneously, without
collisions
6
1 2
5
3
4
C
B’
A’
switch with six interfaces
(1,2,3,4,5,6)
– not possible with dumb hub
20
Switch Table
A
 Q: how does switch know that A’
C’
reachable via interface 4, B’
reachable via interface 5?
 A: each switch has a switch table,
each entry:
B
6
5
– (MAC address of host, interface to
reach host, time stamp)
 looks like a routing table!
 Q: how are entries created,
maintained in switch table?
– something like a routing protocol?
1 2
3
4
C
B’
A’
switch with six interfaces
(1,2,3,4,5,6)
21
Switch: self-learning
Source: A
Dest: A’
A A A’
 switch learns which hosts
can be reached through
which interfaces
C’
– when frame received,
switch “learns” location of
sender: incoming LAN
segment
– records sender/location
pair in switch table
B
1 2
3
6
5 4
C
B’
A’
MAC addr interface TTL
A
1
60
Switch table
(initially empty)
22
Source: A
Self-learning,
forwarding: example
Dest: A’
A A A’
C’
B
 frame destination unknown:
flood
1 2
3
A6 A’
5 4
r destination A location
known:
A’ A
selective send
B’
C
A’
MAC addr interface TTL
A
A’
1
4
60
60
Switch table
(initially empty)
23
Interconnecting switches
 Switches can be connected together
S4
S1
S2
A
B
S3
C
F
D
E
I
G
H
r Q: sending from A to G - how does S1 know to
forward frame destined to F via S4 and S3?
r A: self learning! (works exactly the same as in
single-switch case!)
r Q: Latency and Bandwidth for a large-scale
network?
24
What characterizes a network?
 Topology
(what)
– physical interconnection structure of the network graph
– Regular vs irregular
 Routing Algorithm
(which)
– restricts the set of paths that msgs may follow
– Table-driven, or routing algorithm based
 Switching Strategy
– how data in a msg traverses a route
– Store and forward vs cut-through
(how)
 Flow Control Mechanism
– when a msg or portions of it traverse a route
– what happens when traffic is encountered?
(when)
 Interplay of all of these determines performance
25
Tree: An Example
 Diameter and ave distance logarithmic
– k-ary tree, height d = logk N
– address specified d-vector of radix k coordinates describing path
down from root
 Fixed degree
 Route up to common ancestor and down
– R = B xor A
– let i be position of most significant 1 in R, route up i+1 levels
– down in direction given by low i+1 bits of B
 Bandwidth and Bisection BW?
26
Bandwidth
 Bandwidth
– Point-to-Point bandwidth
– Bisectional bandwidth of interconnect frabric: rate of
data that can be sent across an imaginary line dividing the
cluster into two halves each with equal number of ndoes.
 For a switch with N ports,
– If it is non-blocking, the bisectional bandwidth = N * the
p-t-p bandwidth
– Oversubscribed switch delivers less bisectional
bandwidth than non-blocking, but cost-effective. It scales
the bw per node up to a point after which the increase in
number of nodes decreases the available bw per node
– oversubscription is the ratio of the worst-case achievable aggregate
bw among the end hosts to the total bisection bw
27
How to Maintain Constant BW per
Node?
 Limited ports in a single switch
– Multiple switches
 Link between a pair of switches be bottleneck
– Fast uplink
 How to organize multiple switches
– Irregular topology
– Regular topologies: ease of management
28
Scalable Interconnect: Examples
Fat Tree
4
0
0
1
0
1
0
1
1
3
2
1
0
0
1
building block
16 node butterfly
29
Multidimensional Meshes and Tori
2D Mesh
2D torus
3D Cube
 d-dimensional array
– n = kd-1 X ...X kO nodes
– described by d-vector of coordinates (id-1, ..., iO)
 d-dimensional k-ary mesh: N = kd
– k = dN
– described by d-vector of radix k coordinate
 d-dimensional k-ary torus (or k-ary d-cube)?
30
Packet Switching Strategies
 Store and Forward (SF)
– move entire packet one hop toward destination
– buffer till next hop permitted
 Virtual Cut-Through and Wormhole
– pipeline the hops: switch examines the header,
decides where to send the message, and then
starts forwarding it immediately
– Virtual Cut-Through: buffer on blockage
– Wormhole: leave message spread through network
on blockage
31
SF vs WH (VCT) Switching
Cut-Through Routing
Store & Forward Routing
Source
Dest
32 1 0
3 2 1 0
3 2 1
3 2
3
Dest
0
1 0
2 1 0
3 2 1 0
3 2 1
3 2
3
3 2 1
0
3 2
1
0
3
2
1
0
3
2
1 0
3
2 1 0
0
3 2 1 0
1 0
2 1 0
3 2 1 0
Time
3 2 1
0
 Unloaded latency: h( n/b+ D) vs n/b+hD
– h: distance
– n: size of message
– b: bandwidth
 D: additional routing delay per hop
32
Conventional Datacenter Network
33
Problems with the Architecture
 Resource fragmentation:
– If an application grows and requires more servers, it
cannot use available servers in other layer 2 domains,
resulting in fragmentation and underutilization of
resources
 Power server-to-server connectivity
– Servers in different layer-2 domains to communication
through the layer-3 portion of the network
 See papers in the reading list of Datacenter Network
Design for proposed approaches
34
Parallel Programming for
Performance
35
Steps in Creating a Parallel Program
Partitioning
D
e
c
o
m
p
o
s
i
t
i
o
n
Sequential
computation
A
s
s
i
g
n
m
e
n
t
Tasks
p0
p1
p2
p3
Processes
O
r
c
h
e
s
t
r
a
t
i
o
n
p0
p1
p2
p3
Parallel
program
M
a
p
p
i
n
g
P0
P1
P2
P3
Processors
 4 steps: Decomposition, Assignment, Orchestration,
Mapping
– Done by programmer or system software (compiler, runtime, ...)
– Issues are the same, so assume programmer does it all explicitly
36
Some Important Concepts
 Task:
– Arbitrary piece of undecomposed work in parallel
computation
– Executed sequentially; concurrency is only across tasks
– Fine-grained versus coarse-grained tasks
 Process (thread):
– Abstract entity that performs the tasks assigned to processes
– Processes communicate and synchronize to perform their
tasks
 Processor:
– Physical engine on which process executes
– Processes virtualize machine to programmer
• first write program in terms of processes, then map to
processors
37
Decomposition
 Break up computation into tasks to be divided among
processes
– Tasks may become available dynamically
– No. of available tasks may vary with time
 Identify concurrency and decide level at which to exploit it
 Goal: Enough tasks to keep processes busy, but not too
many
– No. of tasks available at a time is upper bound on
achievable speedup
38
Assignment
 Specifying mechanism to divide work up among processes
– Together with decomposition, also called partitioning
– Balance workload, reduce communication and management cost
 Structured approaches usually work well
– Code inspection (parallel loops) or understanding of application
– Well-known heuristics
– Static versus dynamic assignment
 As programmers, we worry about partitioning first
– Usually independent of architecture or prog model
– But cost and complexity of using primitives may affect decisions
 As architects, we assume program does reasonable job of it
39
Orchestration
–
–
–
–
Naming data
Structuring communication
Synchronization
Organizing data structures and scheduling tasks temporally
 Goals
– Reduce cost of communication and synch. as seen by processors
– Reserve locality of data reference (incl. data structure organization)
– Schedule tasks to satisfy dependences early
– Reduce overhead of parallelism management
 Closest to architecture (and programming model & language)
– Choices depend a lot on comm. abstraction, efficiency of primitives
– Architects should provide appropriate primitives efficiently
40
Orchestration (cont’)
 Shared address space
–
–
–
–
–

Shared and private data explicitly separate
Communication implicit in access patterns
No correctness need for data distribution
Synchronization via atomic operations on shared data
Synchronization explicit and distinct from data communication
Message passing
–
–
–
–
Data distribution among local address spaces needed
No explicit shared structures (implicit in comm. patterns)
Communication is explicit
Synchronization implicit in communication (at least in synch. Case)
41
Mapping
 After orchestration, already have parallel program
 Two aspects of mapping:
– Which processes/threads will run on same processor (core), if necessary
– Which process/thread runs on which particular processor (core)
• mapping to a network topology
 One extreme: space-sharing
– Machine divided into subsets, only one app at a time in a subset
– Processes can be pinned to processors, or left to OS
 Another extreme: leave resource management control to OS
 Real world is between the two
– User specifies desires in some aspects, system may ignore
 Usually adopt the view: process <-> processor
42
Basic Trade-offs for Performance
43
Trade-offs
 Load Balance
– fine grain tasks
– random or dynamic assignment
 Parallelism Overhead
– coarse grain tasks
– simple assignment
 Communication
– decompose to obtain locality
– recompute from local data
– big transfers – amortize overhead and latency
– small transfers – reduce overhead and contention
44
Load Balancing in HPC
Based on notes of James Demmel
and David Culler
45
LB in Parallel and Distributed Systems
Load balancing problems differ in:
 Tasks costs
– Do all tasks have equal costs?
– If not, when are the costs known?
• Before starting, when task created, or only when task ends
 Task dependencies
– Can all tasks be run in any order (including parallel)?
– If not, when are the dependencies known?
• Before starting, when task created, or only when task ends
 Locality
– Is it important for some tasks to be scheduled on the same processor
(or nearby) to reduce communication cost?
– When is the information about communication between tasks known?
46
Task cost spectrum
47
Task Dependency Spectrum
48
Task Locality Spectrum (Data
Dependencies)
49
Spectrum of Solutions
One of the key questions is when certain information about the
load balancing problem is known
Leads to a spectrum of solutions:
 Static scheduling. All information is available to scheduling
algorithm, which runs before any real computation starts.
(offline algorithms)
 Semi-static scheduling. Information may be known at
program startup, or the beginning of each timestep, or at
other well-defined points. Offline algorithms may be used
even though the problem is dynamic.
 Dynamic scheduling. Information is not known until midexecution. (online algorithms)
50
Representative Approaches
 Static load balancing
 Semi-static load balancing
 Self-scheduling
 Distributed task queues
 Diffusion-based load balancing
 DAG scheduling
 Mixed Parallelism
51
Self-Scheduling
 Basic Ideas:
– Keep a centralized pool of tasks that are available to run
– When a processor completes its current task, look at the
pool
– If the computation of one task generates more, add them
to the pool
 It is useful, when
– A batch (or set) of tasks without dependencies
– The cost of each task is unknown
– Locality is not important
– Using a shared memory multiprocessor, so a centralized
pool of tasks is fine (How about on a distributed memory
system like clusters?)
52
Cluster Management
53
Rocks Cluster Distribution: An Example
 www.rocksclusters.org
 Based on CentOS Linux
 Mass installation is a core part of the system
– Mass re-installation for application-specific config.
 Front-end central server + compute & storage nodes
 Rolls: collection of packages
– Base roll includes: PBS (portable batch system), PVM
(parallel virtual machine), MPI (message passing
interface), job launchers, …
– Rolls ver 5.1: support for virtual clusters, virtual front
ends, virtual compute nodes
54
Microsoft HPC Server 2008: Another
example
 Windows Server 2008 + clustering package
 Systems Management
– Management Console: plug-in to System Center UI with support for
Windows PowerShell
– RIS (Remote Installation Service)
 Networking
– MS-MPI (Message Passing Interface)
– ICS (Internet Connection Sharing) : NAT for cluster nodes
– Network Direct RDMA (Remote DMA)
 Job scheduler
 Storage: iSCSI SAN and SMB support
 Failover support
55
Microsoft’s Productivity Vision for HPC
Windows HPC allows you to accomplish more, in less time, with
reduced effort by leveraging users existing skills and integrating
with the tools they are already using.
Application
Developer
Administrator





Integrated Turnkey HPC
Cluster Solution
Simplified Setup and
Deployment
Built-In Diagnostics
Efficient Cluster Utilization
Integrates with IT
Infrastructure and Policies





Integrated Tools for Parallel 
Programming
Highly Productive Parallel

Programming Frameworks
Service-Oriented HPC
Applications

Support for Key HPC
Development Standards
Unix Application Migration
End - User
Seamless Integration with
Workstation Applications
Integration with Existing
Collaboration and
Workflow Solutions
Secure Job Execution and
Data Access