Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multicore systems Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy Course outline Why do we need more than one processor? Key design aspects of multicore platforms Overview of existing multicore architectures Parallel programming problems Multiprocessor Real-Time systems Introduction to Multicore Systems Summary Limits of current single processor technologies Superscalar and SMT architectures SMP vs ASMP UMA vs NUMA Cache coherency Key design aspects Example of multicore systems As Moore’s law goes on… Number of transistor/chip doubles every 18 to 24 mm …heating becomes a problem Pentium Tejas cancelled! Power density (W/cm2) 1000 100 Pentium P2 P1 P3 P4 STOP !!! Hot-plate 10 286 486 8086 386 8085 8080 1 8008 4004 Power 0,1 71 74 78 85 92 Nuclear Reactor 00 04 08 Year P V f: Clock speed limited to less than 4 GHz Why did it happen? Integration technology constantly improves… CMOS manufactoring process Half-Pitch: half the distance between identical features in an array of transistors Technology trends Reduced gate sizes Higher frequencies allowed More space available BUT Phisical limits of semiconductor-based microelectronics Larger dynamic power consumed Leakage current becomes important Problematic data synchronization (GALS) Intel’s timeline Year 1971 1972 1974 1978 1979 1982 1985 1989 1993 1995 1997 1999 2000 2002 2005 2006 2007 2008 Processor 4004 8008 8080 8086 8088 286 386 486 Pentium Pentium Pro Pentium II Pentium III Pentium 4 Pentium M Pentium D Core 2 Duo Core 2 Quad Core 2 Quad X Manufacturing Frequency Technology 10 m 108 kHz 10 m 800 kHz 6 m 2 MHz 3 m 5 MHz 3 m 5 MHz 1,5 m 6 MHz 1,5 m 16 MHz 1 m 25 MHz 0,8 m 66 MHz 0,6 m 200 MHz 0,25 m 300 MHz 0,18 m 500 MHz 0,18 m 1,5 GHz 90 nm 1,7 GHz 65 nm 3,2 GHz 65 nm 2,93 GHz 65 nm 2,66 GHz 45 nm >3 GHz Number of transistors 2300 3500 4500 29000 29000 134000 275000 1200000 3100000 5500000 7500000 9500000 42000000 55000000 291000000 291000000 582000000 820000000 Frequency Lead Microprocessors frequency doubles every 2 years Frequency (Mhz) 10000 1000 100 486 10 8085 1 0.1 1970 Courtesy, Intel 8086 286 P6 Pentium ® proc 386 8080 8008 4004 1980 1990 Year 2000 2010 Die Size Growth Die size grows by 14% every two years to satisfy Moore’s Law Frequency and power f = operating frequency V = supply voltage (V~=0.3+0.7 f) Reducing the voltage causes a higher frequency reduction P = Pdynamic + Pstatic = power consumed Pdynamic A C V2 f (main contributor until hundreds nm) Pstatic V Ileak (due to leakage important when going below 100nm) Design considerations (I) P = A C V2 f + V Ileak Number of transistors grows A grows Die size grows C grows Reducing V would allow a quadratic reduction on dynamic power But clock frequency would decrease more than linearly since V~=0.3+0.7 f unless Vth as well is reduced But, again, there are problems: Ileak Isub + Igox increases when Vth is low! Design considerations (II) P = A C V2 f + V Ileak Reducing Vth and gate dimensions leakage current becomes dominant in recent process technologies static dissipation Static power dissipation is always present unless losing device state Summarizing: There is no way out for classic frequency scaling on single cores systems! Power delivery and dissipation 100000 18KW 5KW 1.5KW 500W Power (Watts) 10000 1000 Pentium® proc 100 286 486 8086 10 386 8085 8080 8008 1 4004 0.1 1971 1974 1978 1985 1992 2000 2004 2008 Year Courtesy, Intel Power density Power density too high to keep junctions at low temp Power Density (W/cm2) 10000 Rocket Nozzle Nuclear Reactor 1000 100 8086 10 4004 Hot Plate P6 8008 8085 Pentium® proc 386 286 486 8080 1 1970 Courtesy, Intel 1980 1990 Year 2000 2010 UltraSPARC Power consumption The largest amount of power is consumed by Cores Leakage Has this ever happened before? Keeping alive Moore’s law alive Exploit the immense number of transistors in other ways Reduce gate sizes maintaining the frequency sufficiently low Use a higher number of slower logic gates In other words: Switch to Multicore Systems! What is this? 1979 - David May’s B0042 board - 42 Transputers Definitions Multicore CPU = two or more cores on a single package composed by one or more dies die = IC (Integrated Circuit) Monolithic processor = all cores on a single die Multi-Chip Module (MCM) = more dies on a single package: Multi-CPU = multiple physically separated processing units (which often contain special circuitry to facilitate communication between each other). We will use the term multiprocessor and multicore interchangeably Terminology Threads Processes (or tasks) Independent Considerable state info Separate address space Resources Memory, file handles, sockets, device handles Owned by processes Threads and processes Thread of a same process share the same resources Processes can share resources only through explicit methods The only resources exclusively owned by a thread are the thread stack, register status and thread-local storage (if any) Scheduling A scheduler is responsible to allocate tasks and threads to the available computing resources Various kind of scheduling algorithms Best-effort minimize makespan Real-Time meet deadlines … other metrics to minimize Preemption and context changes Access to shared resources Multithreading Classic multithreading on single processor systems required Time Division Multiplexing (TDM) Time driven Event driven Multiprocessors different threads and processes can run on different CPUs Multithreading is easier (“native”) on multicore platforms But scheduling requires more attention Multithreading issues Race conditions Starvation, deadlock, livelock, priority inversion Synchronization (mutex, lock) Atomic execution (semaphores) Communication: shared-memory (requires locking) message-passing (slower but easier) Different kinds of Parallelisms Instruction Level Parallelism (ILP) Data Level Parallelism (DLP) Thread Level Parallelism (TLP) Instruction Level Parallelism (ILP) Execute multiple instruction per clock cycle Each functional unit on a core is an execution resource within a single CPU: Arithmetic Logic Unit (ALU) Floating Point Unit (FPU) bit shifter, multiplier, etc. Need to solve data dependencies Data dependency Consider the sequential code: 1. 2. 3. e=a+b f=c+d g=e*f Operation 3. depends on the results of operations 1. and 2. Cannot execute 3. before 1. and 2. are completed How to “parallelize” software? Parallelism can be extracted from ordinary programs At run-time (by complex specific HW) At compile-time (simplifying CPU design and improving run-time performances) Degree of ILP is application dependent ILP: superscalar architectures Data dependency check in HW Complex mechanisms power, die-space and time consuming Problematic when code difficult to predict Intructions have many interdependencies Superscalar pipelining (5x) stage pipelining (2x) Superscalar execution ILP optimizations Instruction pipelining Superscalar execution Out-of-order execution deferred memory accesses combined load and store Speculative execution branch prediction, speculative load … Processor front and back end Intel describes its processors having “in-order front end” “out-of-order execution engine” ILP: compile-time techniques Compiler decides which operations can run in parallel Removes the complexity of instruction scheduling from HW to SW New instruction sets that explicitly encode multiple independent operations per instruction Very Long Instruction Word (VLIW): one instruction encodes multiple operations (one for each execution unit) Explicitly Parallel Instruction Computing (EPIC): adds features to VLIW (cache prefetching instructions, ...) Data Level Parallelism (DLP) Higher parallelism than superscalar architecture SIMD instructions (Single Instruction, Multiple Data) Intel’s MMX, SSE, SSE2, SSE3, SSE3, SSSE3, SSE4, AVX AMD’s 3DNow!, 3DNow! Professional, SSE5 ARM’s NEON, IBM’s AltiVec and SPE, etc. Graphic cards (GPU) Cell Processor’s SPU Useful when the same operation has to be applied to a large set of data (i.e., multimedia, graphic operations on pixels, etc.) Multiple data are read and/or modified at the same same Thread Level Parallelism (TLP) Higher level parallelism than ILP Different kinds of TLP Superthreading Hyperthreading or Symultaneous MultiThreading (SMT) Chip-level MultiProcessing (CMP) Needs superscalar processor Needs multicore architecture Combinations of the above solutions Superthreading Temporal Multithreading (fine- or coarsegrained) when processor idle, execute instruction of another thread Makes better use of the computing resources when a thread is blocked Requires adequate hardware support to minimize context change overhead Hyperthreading Simultaneous MultiThreading (SMT) Introduced in late 90s: Intel’s Pentium 4 Execute instructions from multiple threads simultaneously needs superscalar support Energy inefficient can use up to 46% more power than dual-core designs Increases cache thrashing by 42%, whereas dual core results in a 37% decrease temporarily dropped by Intel (will return with Nehalem architecture) Single-threaded CPU 4 running programs Only the red program is executing Up to 4 instr/clock cycle 7 functional units Pipeline bubbles Super-threading (time-slice multithreading) Multithreaded processor: able to handle more than one thread at a time Interleaved execution of different threads (helps masking memory latency) All instructions in a pipeline stage must come from the same thread Hyper-threading (Simultaneous MultiThreading) Interleaved execution of different threads (helps masking memory latency) Instructions in a pipeline stage may come from different threads Hyper-threading (SMT) From OS perspective: many “logical” processors Average ILP for a thread = 2.5 instr/cycle Pentium 4 issues at most 3 instr/cycle to the execution core Hyperthreaded processor can exploit parallelism beyond a single thread ILP But.. SMT can be worse than non-SMT approaches A thread monopolizing an execution unit for many consecutive pipeline stages can stall the available functional units (that could have been used by other threads) Cache thrashing problem Different logical processor can execute two threads accessing completely different memory areas Not a problem for Multicores A smart SMT-aware OS running on a multicore would schedule two different tasks on different processors resource contention is minimized Comparing parallel solutions Dual vs Superscalar vs SMT processor Dual core achieves best performances w.r.t. Power and Area Multicore examples (high-end) Intel’s Core 2: 2, 4 cores AMD: 2, 4 cores IBM-Toshiba-Sony Cell processor: 8 cores Sun’s Niagara UltraSPARC: 8 cores Microsoft’s Xenon: 3 cores (Xbox 360) Tilera’s TILE64: 64-core Others (network processors, DSP, GPU,…) Multicore examples (embedded) ARM’s MPCore: 4 cores Nios II: x Cores Network Processors are being replaced by multicore chips (Broadcom’s 8-core processors) DSP: TI, Freescale, Atmel, Picochip (up to 300 cores, communication domain) How many cores in the future? Application dependent Typically few for high-end computing Many trade-offs transistor density technology limits Amdahl’s law Beyond 2 billion transistors/chip Intel’s Tukwila Itanium based 2.046 B FET Quad-core 65 nm technology 2 GHz on 170W 30 MB cache 2 SMT 8 threads/ck How many cores in the future? Intel’s 80 core prototype already available Able to transfers a TB of data/s (while Core 2 Duo reaches 1.66GB data/s) To be released in 5 years How many cores in the future? Berkeley: weather simulation for 1.5km resolution, 1000 x realtime, 3M customtailored Tensilica cores Petaflop computer Power around 2.5 MW estimated for 2011 cost around 75 M$ main obstacle will be on software Supercomputers Petaflop supercomputers (current supercomputer have Teraflop computing power) IBM’s Bluegene/P 3 Pflops 884736 quad-core Power PC ready before 2010 Prediction Patterson & Hennessy: “number of cores will double every 18 months, while power, clock frequency and costs will remain constant” Amdahl’s law Originally made for speed-ups in a portion of a program Later adapted to measure the speedup obtained increasing the number of processors P = Parallel portion of a given application N = Number of processors/cores The total speedup obtained increasing N is Considerations on Amdahl’s law For N arbitrarily large maximum speedup tends to 1 / (1-P) In practice, performance/price falls rapidly as N is increased once there is even a small component of (1 − P) Example: P = 90% (1 − P) = 10% speedup < 10 Amdahl’s law Consequences of Amdahl’s law “Law of diminuishing returns”: picking optimal improvements, the income is each time lower so is with adding processors Considering as well the memory, bus and I/O bottlenecks, the situation gets worse Parallel computing is only useful for limited numbers of processors, or problems with very high values of P “embarrassingly parallel problems” Embarassingly parallel problems Problems for which no particular effort is needed to segment the problem into a very large number of parallel tasks there is no essential dependency (or communication) between those parallel tasks 1 core @ 4 GHz = 2 cores @ 2 GHz Examples: GPU handled problems, 3D projection (independent rendering of each pixel), brute-force searching in cryptography SMP Symmetric MultiProcessing Many identical processors are connected to a single shared main memory. SMP Any processor can work on any task Easy migration Access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data: UMA (Uniform Memory Architecture) Load balancing Typically limited to a few number of CPUs For larger systems, alternatives like NUMA (NonUniform Memory Access) SMP architectures Intel’s Xeon, Pentium D, Core Duo, Core 2 Duo AMD’s Athlon64 X2, Quad FX, Opteron (200 and 2000 series) Other non-x86: UltraSPARC, SPARC64, SGI MIPS, SGI f1200, Intel Itanium, PA-RISC, DEC Alpha, POWER and PowerPC (604, 604e, G4 and G5 series) ASMP ASymmetric MultiProcessing Assigns certain tasks only to certain processors ASMP Typically there is one Master and one or more Slave processors ≠ SMP Master distributes the load to slaves Limitations to interactions among slave processors, I/O, memory Possible to use identical cores in an asymmetrical way (dedicating tasks to cores) NUMA Quickly access local memory Slower access to remote memory Improve throughput for localized data Interprocessor migration is more expensive Difficult load-balancing Cache and memory coherence problems Software solutions Hardware solutions (ccNUMA) NUMA One can view NUMA as a very tightly coupled form of cluster computing UMA or NUMA? Depends on the addressed application Locality favours NUMA Tight data dependencies among processes favour UMA Caches Caches play key role in modern architectures Reduce average data access time Reduce bandwidth demands placed on shared interconnect Memory Wall A.k.a. Von Neumann bottleneck Growing disparity of speed between CPU and memory outside the CPU chip: From 1986 to 2000, CPU speed improved at an annual rate of 55% while memory speed only improved at 10%. Mainly due to limited communication bandwidth and memory latency Larger cache sizes to mask the latency to memory Memory bottleneck Write policy Write-through Write-back Every cache write causes a write to memory Cache write aren’t immediately mirrored to memory A modified data is marked as Dirty Dirty data needs to be written back to main memory before eviction Writes may be held in a temporary queue Processing multiple stores improves bus usage Cache coherency Private processor caches create a problem Copies of a variable can be present in multiple processors A write by one processor may not become visible to others They’ll keep accessing stale value in their caches => Cache coherence problem Cache coherency protocols: MSI PrRd/— PrWr/— MSI M BusRd/Flush PrWr/BusRdX S PrRd/— BusRd/— data unmodified can be evicted without writing it back Invalid PrWr/BusRdX I cached data (new) ≠ data in main memory (old) needs write back before eviction Shared BusRdX/Flush BusRdX/— PrRd/BusRd Modified cached data is old must be fetched from memory or another cache MSI protocol Maintained though communication between caches and main memory Examples: Read on M or S states: cache supplies the data Read on I state: must verify if the data is M in another cache (snooping, cache-directory, etc) Write on M state: data written locally Write on S state: invalidate (snoop, cache dir, etc) other caches in S state before writing the data Write on I state: invalidate (snoop, cache dir, etc) other caches in S or M state before writing the data MSI protocol (example) P1 P2 P3 Snooper Snooper Snooper Notation: r1 = read from P1 w3 = write from P3 Main Memory Example sequence: r1, r2, w3, r2 MSI protocol (example) P1 P2 P3 Snooper Snooper PrRd value BusRd Snooper Operation: r1 S Main Memory Example sequence: r1, r2, w3, r2 MSI protocol (example) P1 P2 P3 PrRd value S Snooper BusRd Operation: r2 value S Snooper Main Memory Snooper Example sequence: r1, r2, w3, r2 MSI protocol (example) P1 value Snooper Operation: w3 P2 S I value P3 S Snooper Main Memory I value PrWr M Snooper BusRdX Example sequence: r1, r2, w3, r2 MSI protocol (example) P1 value Snooper Operation: r2 P2 value S BusRd P3 PrRd S Snooper Main Memory S value Flush M S Snooper Example sequence: r1, r2, w3, r2 Cache coherency mechanisms Directory-based Snooping (or sniffing) Central directory of cached blocks Point to point messages Slower, but more scalable Monitor address line for write accesses to cached location Broadcast messages Faster, but less scalable Snarfing (less used) Cache controller watches both address and data Cached data is always updated when modified Snoopy cache-coherence control Bus is a broadcast medium & Caches know what they have Cache Controller “snoops” all transactions on the shared bus and takes action to ensure coherence Pn P1 Bus snoop $ $ Mem I/O devices Cache-memory transaction Cache coherency protocols MESI MOSI Introduced in the Pentium processor Adds an Exclusive state Reduces traffic due to write exclusively present in a single cache Adds an Owned state Reduces traffic due to write-back of blocks read by other caches MOESI Has both Exclusive and Owned states beneficial when communication latency and bandwidth is more efficient between two CPUs than to main memory (es. Multicore systems with per-core L2 caches) Cache vs scratchpad memory Both are the fastest kind of memory after internal registers Scratchpad used in PS1’s R3000, PS2’s R5000 and PS3’s Cell Performance boost with multicore Interrupts can be handled on an idle processor instead of preempting the running process (also for programs written for single core) Not faster execution, but smoother appearance For inherently parallel applications (graphic operations, servers, compilers, distributed computing) speedup proportional to the number of processors Limitations due to serialized RAM access and cache coherency Less likely to benefit from multicores I/O bound tasks Tasks composed by a series of pipeline dependent calculations Tasks that frequently communicate with each other Tasks that contend for shared resources Programming Applications are written for uniprocessors Word processors and computer games Adapting a single core-suited application to a multiprocessor device results in performance losses Multicore architectures are much more widespread nowadays Multicore programming is becoming more important Exploiting multicores Multicore-capable OS’s Windows NT 4.0/2000/XP/2003 Server/Vista Linux and Unix-like systems Mac OS VxWorks, QNX, etc. Multi-threaded applications Muticore optimizations for game engines Half-Life 2: Episode Two, Crysis, etc. Software development tools Parallel programming Existing parallel programming models OpenMP MPI IBM’s X10 Intel’s TBB (abstraction for C++) Sun’s Fortress Cray’s Chapel Cilk (Cilk++) Codeplay’s Sieve C++ Rapidmind Development Platform Design stages Problem decomposition: break the application into a large number of small concurrent tasks Communication analysis: detect existing interdependencies among the various tasks Composition: combine/merge tasks to obtain the desired grain of parallelism Mapping: allocate tasks to the available computing resources (if not automatically done by a scheduler) Partitioning tasks to processors Possible partitioning choices Partition by CPU load Partition by information-sharing requirements Partition by functionality Use the least possible number of processors or run at the lowest possible frequency Depends on considerations like fault tolerance, power consumed, temperature, etc. Beyond multicore chips Many-core devices Tens or hundreds of cores Integration of SMT techniques into multicore chips Special purpose heterogenous cores Memory-on-chip Fine-grain power management Dynamic Voltage and Frequency Scaling (DVFS)