* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Cluster Computing - Wayne State University
Survey
Document related concepts
IEEE 802.1aq wikipedia , lookup
Recursive InterNetwork Architecture (RINA) wikipedia , lookup
Parallel port wikipedia , lookup
Distributed operating system wikipedia , lookup
Nonblocking minimal spanning switch wikipedia , lookup
Airborne Networking wikipedia , lookup
Transcript
Cluster Computing Cheng-Zhong Xu 1 Outline Cluster Computing Basics – Multicore architecture – Cluster Interconnect Parallel Programming for Performance MapReduce Programming Systems Management 2 What’s a Cluster? Broadly, a group of networked autonomous computers that work together to form a single machine in many respects: – To improve performance (speed) – To improve throughout – To improve service availability (high-availability clusters) Based on commercial off-the-shelf, the system is often more cost-effective than single machine with comparable speed or availability 3 Highly Scalable Clusters High Performance Cluster (aka Compute Cluster) – A form of parallel computers, which aims to solve problems faster by using multiple compute nodes. – For parallel efficiency, the nodes are often closely coupled in a high throughput, low-latency network Server Cluster and Datacenter – Aims to improve the system’s throughput , service availability, power consumption, etc by using multiple nodes 4 Top500 Installation of Supercomputers Top500.com 5 Clusters in Top500 6 An Example of Top500 Submission (F’08) Location Tukwila, WA Hardware – Machines 256 Dual-CPU, quad-core Intel 5320 Clovertown 1.86GHz CPU and 8GB RAM Hardware – Networking Private & Public: Broadcom GigE MPI: Cisco Infiniband SDR, 34 IB switches in leaf/node configuration Number of Compute Nodes 256 Total Number of Cores 2048 Total Memory 2 TB of RAM Particulars of for current Linpack Runs Best Linpack Result 11.75 TFLOPS Best Cluster Efficiency 77.1% For Comparison… Linpack rating from June2007 Top500 run (#106) on the same hardware 8.99 TFLOPS Cluster efficiency from June2007 Top500 59% run (#106) on the same hardware Typical Top500 efficiency for Clovertown 65-77% motherboards w/ IB regardless of (2 instances of 79%) Operating System 30% impro in efficiency on the same hardware; about one hour to deplay 7 Beowulf Cluster A cluster of inexpensive PCs for low-cost personal supercomputing Based on commodity off-the-shelf components: – PC computers running a Unix-like Os (BSD, Linux, or OpenSolaris) – Interconnected by an Ethernet LAN Head node, plus a group of compute node – Head node controls the cluster, and serves files to the compute nodes Standard, free and open source software – Programming in MPI – MapReduce 8 Why Clustering Today Powerful node (cpu, mem, storage) – Today’s PC is yesterday’s supercomputers – Multi-core processors High speed network – Gigabit (56% in top500 as of Nov 2008) – Infiniband System Area Network (SAN) (24.6%) Standard tools for parallel/ distributed computing & their growing popularity. – MPI, PBS, etc – MapReduce for data-intensive computing 9 Major issues in Cluster Design Programmability – Sequential vs Parallel Programming – MPI, DSM, DSA: hybrid of multithreading and MPI – MapReduce Cluster-aware Resource management – Job scheduling (e.g. PBS) – Load balancing, data locality, communication opt, etc System management – Remote installation, monitoring, diagnosis, – Failure management, power management, etc 10 Cluster Architecture Multi-core node architecture Cluster Interconnect 11 Single-core computer 12 Single-core CPU chip the single core 13 Multicore Architecture Combine 2 or more independent cores (normally CPU) into a single package Support multitasking and multithreading in a single physical package 14 Multicore is Everywhere Dual-core commonplace in laptops Quad-core in desktops Dual quad-core in servers All major chip manufacturers produce multicore CPUs – – – SUN Niagara (8 cores, 64 concurrent threads) Intel Xeon (6 cores) AMD Opteron (4 cores) 15 Multithreading on multi-core David Geer, IEEE Computer, 2007 16 Interaction with the OS OS perceives each core as a separate processor OS scheduler maps threads/processes to different cores Most major OS support multi-core today: Windows, Linux, Mac OS X, … 17 Cluster Interconnect Network fabric connecting the compute nodes Objective is to strike a balance between – Processing power of compute nodes – Communication ability of the interconnect A more specialized LAN, providing many opportunities for perf. optimization – Switch in the core – Latency vs bw Input Ports Receiver Input Buffer Output Buffer Transmiter Output Ports Cross-bar Control Routing, Scheduling 18 Goal: Bandwidth and Latency 0.8 Delivered Bandwidth 80 70 Latency 60 50 40 Saturation 30 20 10 0.7 0.6 0.5 0.4 Saturation 0.3 0.2 0.1 0 0 0 0.2 0.4 0.6 0.8 Delivered Bandwidth 1 0 0.2 0.4 0.6 0.8 1 1.2 Offered Bandwidth 19 Ethernet Switch: allows multiple simultaneous transmissions A hosts have dedicated, direct C’ B connection to switch switches buffer packets Ethernet protocol used on each incoming link, but no collisions; full duplex – each link is its own collision domain switching: A-to-A’ and B-to-B’ simultaneously, without collisions 6 1 2 5 3 4 C B’ A’ switch with six interfaces (1,2,3,4,5,6) – not possible with dumb hub 20 Switch Table A Q: how does switch know that A’ C’ reachable via interface 4, B’ reachable via interface 5? A: each switch has a switch table, each entry: B 6 5 – (MAC address of host, interface to reach host, time stamp) looks like a routing table! Q: how are entries created, maintained in switch table? – something like a routing protocol? 1 2 3 4 C B’ A’ switch with six interfaces (1,2,3,4,5,6) 21 Switch: self-learning Source: A Dest: A’ A A A’ switch learns which hosts can be reached through which interfaces C’ – when frame received, switch “learns” location of sender: incoming LAN segment – records sender/location pair in switch table B 1 2 3 6 5 4 C B’ A’ MAC addr interface TTL A 1 60 Switch table (initially empty) 22 Source: A Self-learning, forwarding: example Dest: A’ A A A’ C’ B frame destination unknown: flood 1 2 3 A6 A’ 5 4 r destination A location known: A’ A selective send B’ C A’ MAC addr interface TTL A A’ 1 4 60 60 Switch table (initially empty) 23 Interconnecting switches Switches can be connected together S4 S1 S2 A B S3 C F D E I G H r Q: sending from A to G - how does S1 know to forward frame destined to F via S4 and S3? r A: self learning! (works exactly the same as in single-switch case!) r Q: Latency and Bandwidth for a large-scale network? 24 What characterizes a network? Topology (what) – physical interconnection structure of the network graph – Regular vs irregular Routing Algorithm (which) – restricts the set of paths that msgs may follow – Table-driven, or routing algorithm based Switching Strategy – how data in a msg traverses a route – Store and forward vs cut-through (how) Flow Control Mechanism – when a msg or portions of it traverse a route – what happens when traffic is encountered? (when) Interplay of all of these determines performance 25 Tree: An Example Diameter and ave distance logarithmic – k-ary tree, height d = logk N – address specified d-vector of radix k coordinates describing path down from root Fixed degree Route up to common ancestor and down – R = B xor A – let i be position of most significant 1 in R, route up i+1 levels – down in direction given by low i+1 bits of B Bandwidth and Bisection BW? 26 Bandwidth Bandwidth – Point-to-Point bandwidth – Bisectional bandwidth of interconnect frabric: rate of data that can be sent across an imaginary line dividing the cluster into two halves each with equal number of ndoes. For a switch with N ports, – If it is non-blocking, the bisectional bandwidth = N * the p-t-p bandwidth – Oversubscribed switch delivers less bisectional bandwidth than non-blocking, but cost-effective. It scales the bw per node up to a point after which the increase in number of nodes decreases the available bw per node – oversubscription is the ratio of the worst-case achievable aggregate bw among the end hosts to the total bisection bw 27 How to Maintain Constant BW per Node? Limited ports in a single switch – Multiple switches Link between a pair of switches be bottleneck – Fast uplink How to organize multiple switches – Irregular topology – Regular topologies: ease of management 28 Scalable Interconnect: Examples Fat Tree 4 0 0 1 0 1 0 1 1 3 2 1 0 0 1 building block 16 node butterfly 29 Multidimensional Meshes and Tori 2D Mesh 2D torus 3D Cube d-dimensional array – n = kd-1 X ...X kO nodes – described by d-vector of coordinates (id-1, ..., iO) d-dimensional k-ary mesh: N = kd – k = dN – described by d-vector of radix k coordinate d-dimensional k-ary torus (or k-ary d-cube)? 30 Packet Switching Strategies Store and Forward (SF) – move entire packet one hop toward destination – buffer till next hop permitted Virtual Cut-Through and Wormhole – pipeline the hops: switch examines the header, decides where to send the message, and then starts forwarding it immediately – Virtual Cut-Through: buffer on blockage – Wormhole: leave message spread through network on blockage 31 SF vs WH (VCT) Switching Cut-Through Routing Store & Forward Routing Source Dest 32 1 0 3 2 1 0 3 2 1 3 2 3 Dest 0 1 0 2 1 0 3 2 1 0 3 2 1 3 2 3 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 0 3 2 1 0 1 0 2 1 0 3 2 1 0 Time 3 2 1 0 Unloaded latency: h( n/b+ D) vs n/b+hD – h: distance – n: size of message – b: bandwidth D: additional routing delay per hop 32 Conventional Datacenter Network 33 Problems with the Architecture Resource fragmentation: – If an application grows and requires more servers, it cannot use available servers in other layer 2 domains, resulting in fragmentation and underutilization of resources Power server-to-server connectivity – Servers in different layer-2 domains to communication through the layer-3 portion of the network See papers in the reading list of Datacenter Network Design for proposed approaches 34 Parallel Programming for Performance 35 Steps in Creating a Parallel Program Partitioning D e c o m p o s i t i o n Sequential computation A s s i g n m e n t Tasks p0 p1 p2 p3 Processes O r c h e s t r a t i o n p0 p1 p2 p3 Parallel program M a p p i n g P0 P1 P2 P3 Processors 4 steps: Decomposition, Assignment, Orchestration, Mapping – Done by programmer or system software (compiler, runtime, ...) – Issues are the same, so assume programmer does it all explicitly 36 Some Important Concepts Task: – Arbitrary piece of undecomposed work in parallel computation – Executed sequentially; concurrency is only across tasks – Fine-grained versus coarse-grained tasks Process (thread): – Abstract entity that performs the tasks assigned to processes – Processes communicate and synchronize to perform their tasks Processor: – Physical engine on which process executes – Processes virtualize machine to programmer • first write program in terms of processes, then map to processors 37 Decomposition Break up computation into tasks to be divided among processes – Tasks may become available dynamically – No. of available tasks may vary with time Identify concurrency and decide level at which to exploit it Goal: Enough tasks to keep processes busy, but not too many – No. of tasks available at a time is upper bound on achievable speedup 38 Assignment Specifying mechanism to divide work up among processes – Together with decomposition, also called partitioning – Balance workload, reduce communication and management cost Structured approaches usually work well – Code inspection (parallel loops) or understanding of application – Well-known heuristics – Static versus dynamic assignment As programmers, we worry about partitioning first – Usually independent of architecture or prog model – But cost and complexity of using primitives may affect decisions As architects, we assume program does reasonable job of it 39 Orchestration – – – – Naming data Structuring communication Synchronization Organizing data structures and scheduling tasks temporally Goals – Reduce cost of communication and synch. as seen by processors – Reserve locality of data reference (incl. data structure organization) – Schedule tasks to satisfy dependences early – Reduce overhead of parallelism management Closest to architecture (and programming model & language) – Choices depend a lot on comm. abstraction, efficiency of primitives – Architects should provide appropriate primitives efficiently 40 Orchestration (cont’) Shared address space – – – – – Shared and private data explicitly separate Communication implicit in access patterns No correctness need for data distribution Synchronization via atomic operations on shared data Synchronization explicit and distinct from data communication Message passing – – – – Data distribution among local address spaces needed No explicit shared structures (implicit in comm. patterns) Communication is explicit Synchronization implicit in communication (at least in synch. Case) 41 Mapping After orchestration, already have parallel program Two aspects of mapping: – Which processes/threads will run on same processor (core), if necessary – Which process/thread runs on which particular processor (core) • mapping to a network topology One extreme: space-sharing – Machine divided into subsets, only one app at a time in a subset – Processes can be pinned to processors, or left to OS Another extreme: leave resource management control to OS Real world is between the two – User specifies desires in some aspects, system may ignore Usually adopt the view: process <-> processor 42 Basic Trade-offs for Performance 43 Trade-offs Load Balance – fine grain tasks – random or dynamic assignment Parallelism Overhead – coarse grain tasks – simple assignment Communication – decompose to obtain locality – recompute from local data – big transfers – amortize overhead and latency – small transfers – reduce overhead and contention 44 Load Balancing in HPC Based on notes of James Demmel and David Culler 45 LB in Parallel and Distributed Systems Load balancing problems differ in: Tasks costs – Do all tasks have equal costs? – If not, when are the costs known? • Before starting, when task created, or only when task ends Task dependencies – Can all tasks be run in any order (including parallel)? – If not, when are the dependencies known? • Before starting, when task created, or only when task ends Locality – Is it important for some tasks to be scheduled on the same processor (or nearby) to reduce communication cost? – When is the information about communication between tasks known? 46 Task cost spectrum 47 Task Dependency Spectrum 48 Task Locality Spectrum (Data Dependencies) 49 Spectrum of Solutions One of the key questions is when certain information about the load balancing problem is known Leads to a spectrum of solutions: Static scheduling. All information is available to scheduling algorithm, which runs before any real computation starts. (offline algorithms) Semi-static scheduling. Information may be known at program startup, or the beginning of each timestep, or at other well-defined points. Offline algorithms may be used even though the problem is dynamic. Dynamic scheduling. Information is not known until midexecution. (online algorithms) 50 Representative Approaches Static load balancing Semi-static load balancing Self-scheduling Distributed task queues Diffusion-based load balancing DAG scheduling Mixed Parallelism 51 Self-Scheduling Basic Ideas: – Keep a centralized pool of tasks that are available to run – When a processor completes its current task, look at the pool – If the computation of one task generates more, add them to the pool It is useful, when – A batch (or set) of tasks without dependencies – The cost of each task is unknown – Locality is not important – Using a shared memory multiprocessor, so a centralized pool of tasks is fine (How about on a distributed memory system like clusters?) 52 Cluster Management 53 Rocks Cluster Distribution: An Example www.rocksclusters.org Based on CentOS Linux Mass installation is a core part of the system – Mass re-installation for application-specific config. Front-end central server + compute & storage nodes Rolls: collection of packages – Base roll includes: PBS (portable batch system), PVM (parallel virtual machine), MPI (message passing interface), job launchers, … – Rolls ver 5.1: support for virtual clusters, virtual front ends, virtual compute nodes 54 Microsoft HPC Server 2008: Another example Windows Server 2008 + clustering package Systems Management – Management Console: plug-in to System Center UI with support for Windows PowerShell – RIS (Remote Installation Service) Networking – MS-MPI (Message Passing Interface) – ICS (Internet Connection Sharing) : NAT for cluster nodes – Network Direct RDMA (Remote DMA) Job scheduler Storage: iSCSI SAN and SMB support Failover support 55 Microsoft’s Productivity Vision for HPC Windows HPC allows you to accomplish more, in less time, with reduced effort by leveraging users existing skills and integrating with the tools they are already using. Application Developer Administrator Integrated Turnkey HPC Cluster Solution Simplified Setup and Deployment Built-In Diagnostics Efficient Cluster Utilization Integrates with IT Infrastructure and Policies Integrated Tools for Parallel Programming Highly Productive Parallel Programming Frameworks Service-Oriented HPC Applications Support for Key HPC Development Standards Unix Application Migration End - User Seamless Integration with Workstation Applications Integration with Existing Collaboration and Workflow Solutions Secure Job Execution and Data Access