Download Multicores Cnt (EH) - Department of Information Technology

Old Trend1: Deeper pipelines/higher freq. Exploring ILP (instruction-level parallelism) What Thread-Level Parallelism (TLP) can buy you I PC Erik Hagersten Uppsala University Sweden R Regs … Data Dependence 150cycles AVDARK 2010 Mem 2 Dept of Information Technology| www.it.uu.se Old Trend2: Wider pipelines Exploring more ILP I Thread 1 PC AVDARK 2010 Issue logic I I I … Thread 1 PC Regs Dept of Information Technology| www.it.uu.se Mem 3 © Erik Hagersten| user.it.uu.se/~eh Old Trend3: Deeper memory hierarchy Exploring access locality More pipelines + Deeper pipelines Î Need more independent instructions 150cycles 1GB 1GB © Erik Hagersten| user.it.uu.se/~eh AVDARK 2010 Issue logic … I R B M MW I R B M MW I R B M MW I R B M MW Regs 2kB 2 cycles kr 10 cycles € 64kB 30 cycles £ 2MB 150cycles Mem Dept of Information Technology| www.it.uu.se 4 1GB © Erik Hagersten| user.it.uu.se/~eh CMP: Chip Multiprocessor (aka Multicores) Not enough MLP? more TLP & geographical locality Thread 1 Issue logic I R B M MW I R B M MW Slooow Memory C Regs PC … SEK Thread N Issue logic I R B M MW I R B M MW PC B C CPU + A L1 cache L3 Ctrl Regs B L2 Cache SEK € A = B + C: $ AVDARK 2010 AVDARK 2010 Mem 5 Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh Read B Read C Add B & C WriteA Latency 0.3 - 100 0.3 - 100 0.3 0.3 - 100 6 Dept of Information Technology| www.it.uu.se TLP Î MLP ns ns ns ns © Erik Hagersten| user.it.uu.se/~eh Not enough ILP Î memory accesses from many threads (MLP) Slooow Memory C C B B C C B B Thread + 2 A Thread 1 A + AVDARK 2010 Issue Logic Sloooow memory TLP Î MLP Thread-Level Parallellism Î Memory-Level Parallelism (MLP) Dept of Information Technology| www.it.uu.se 7 © Erik Hagersten| user.it.uu.se/~eh AVDARK 2010 LD ST LD ST LD ST LD ST LD ST X, R2 R1, R2, X, R2 R1, R2, X, R2 R1, R2, X, R2 R1, R2, X, R2 R1, R2, CPU L1 R3 R3 R3 R3 R3 Dept of Information Technology| www.it.uu.se 8 © Erik Hagersten| user.it.uu.se/~eh TLP Î MLP SMT: Simultaneous Multithreading ”Combine TLP&ILP to find independent instr.” Î feed one superscalar with instr. from many threads (also used in GPUs) Thread 1 Issue Logic PC … Thread N CPU Issue logic L1 I R B M MW I R B M MW I R B M MW I R B M MW Regs 1 … Regs N PC SEK AVDARK 2010 LD ST LD ST LD ST LD ST LD ST X, R2 R1, R2, X, R2 R1, R2, X, R2 R1, R2, X, R2 R1, R2, X, R2 R1, R2, R3 R3 R3 R3 R3 LD ST LD ST LD ST LD ST LD ST X, R2 R1, R2, X, R2 R1, R2, X, R2 R1, R2, X, R2 R1, R2, X, R2 R1, R2, € R3 $ R3 R3 R3 AVDARK 2010 R3 Dept of Information Technology| www.it.uu.se 9 © Erik Hagersten| user.it.uu.se/~eh Mem Dept of Information Technology| www.it.uu.se 10 © Erik Hagersten| user.it.uu.se/~eh Thread-interleaved Historical Examples: Denelcor, HEP, Tera Computers [B. Smith] 1984 Each thread executes every n:th cycle in a round-robin fashion -- Poor single-thread performance -- Expensive (due to early adoption) Intel’s “Hyperthreading” (2002) -- Poor implementation AVDARK 2010 Dept of Information Technology| www.it.uu.se 11 © Erik Hagersten| user.it.uu.se/~eh Design Issues for Multicores Erik Hagersten Uppsala University Sweden AVDARK 2010 Performance Performance Performance Performance … per per per per Watt? memory byte? bandwidth? $? 13 © Erik Hagersten| user.it.uu.se/~eh L1 L1 L1 CPU CPU CPU CPU 14 © Erik Hagersten| user.it.uu.se/~eh Fewer cores but… wide issue? O-O-O? narrow issue? in-order? have you ever heard of Amdahl? SMT, run-ahead, execute-ahead … to cure shortcomings? Davis, Laudon and Olukotun, PACT 2006 Amdahls Law in the Multicore Era Mark Hill, IEEE Computer, July 2008 http://www.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf computing suits CMPs the best Dept of Information Technology| www.it.uu.se L1 Read: Maximizing CMP Throughput with Mediocre Cores Once the workingset is in memory, work like crazy! (in general) L1 Narrow: More cores but… Memory: the major cost of a CMP system! How do we utilize it the best? AVDARK 2010 L1 Fat: Memory requirement? Sharing in cache? Memory bandwidth requirement? Î Capability L1 A Few Fat or Many Narrow Cores? Issues: L1 Dept of Information Technology| www.it.uu.se CPU Shared Resources Capacity? (≈several sequential jobs) or Capability? (≈one parallel job) CPU AVDARK 2010 Capacity or Capability Computing? CPU L2 Cache How large fraction of a CMP system cost is in the CPU chip? Should the execution (MIPS/FLOPS) be viewed as a scarce resource? Dept of Information Technology| www.it.uu.se CPU Bandwidth Shared Bottlenecks CMP bottlenecks/points of optimization AVDARK 2010 15 © Erik Hagersten| user.it.uu.se/~eh Dept of Information Technology| www.it.uu.se 16 © Erik Hagersten| user.it.uu.se/~eh How to Hide Memory Latency (and create MLP) Cores vs. caches Depends on your target applications… Niagara’s answer: go for cores Options: O-O-O HW prefetching SMT Run-ahead/Execute-ahead In-order 5-stage pipeline 8 cores a’ 4 SMT threads each Î 32 threads, 3MB shared L2 cache (96 kB/thread) SMT to hide memory latency Memory bandwidth: 25 GB/s Will this approach scale with technology? Others: go for cache 2-4 cores for now AVDARK 2010 AVDARK 2010 Dept of Information Technology| www.it.uu.se 17 © Erik Hagersten| user.it.uu.se/~eh Dept of Information Technology| www.it.uu.se 18 © Erik Hagersten| user.it.uu.se/~eh Fighting for shared resources Handling shared resources Binary Binary Core ¢ ¢ $ 1st Order MC Performance Problems wasted Mem Erik Hagersten Uppsala University Sweden •Additional multicore issues: ‐ Even less cache resources per application ‐ Sharing of cache resources ‐ Wasted cache usage AVDARK 2010 Dept of Information Technology| www.it.uu.se 20 © Erik Hagersten| user.it.uu.se/~eh Andreas’ research: Using Prefetch-NT Cache Interference in Shared Cache Cache sharing strategies: 1. 2. 3. miss rate AMD, Prefetch NT: • Install in L1 cache with NT bit set • Non-inclusive caching Î Not in L2, L3 • Upon eviction from L1, do not install in L2, L3 (if NT is not set, cacheline will get installed) Avoid cache polution using existing x86 ISA: NT = Non-temporal Fight it out! Fair share: 50% of the cache each Maximize throughput: who will benefit the most? Intel Core2, Prefetch NT: • Install in L1&L2 caches (inclusive caching) • Put in MRU place in L2 Î replaced more easily • Upon eviction from L1, keep in L2 Binary Core 2 3 1 single-threaded cache profiling A 1 B 2 ¢ ¢ Intel i7, Prefetch NT: • Install in L1&L2 caches (inclusive caching) • Put in MRU place in L2 Î replaced more easily • Upon eviction from L1, keep in L2 $ wasted A examples: ammp, art, … B examples: vpr_place, vortex, … Mem 3 $ size tot All, Store-NT: • Keep cachline in a special write-buffer •When all bytes of the cache line has been updated write to memory while bypassing caches •Huge penaly in not all bytes are updated 22 © Erik Hagersten| user.it.uu.se/~eh Read: STATSHARE: A Statistical Model for Managing Cache Share via Decay Pavlos Petoumenos et al in MOBS workshop ISCA 2006 AVDARK 2010 AVDARK 2010 Predicting the inter-thread cache contention on a CMP Chandra et al in HPCA 2005 21 Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh Dept of Information Technology| www.it.uu.se Example: Hints to avoid cache pollution (non-temporal prefetches) Example: Hints for mixed workloads (non-temporal prefetches) The larger cache, the better cache misses 0,25 ”streaming” Libquantum LBM 0,2 Miss rate bzip Hint: Don’t allocate! 2x missrate missrate actual/4 actual 0,15 ”bigger is better” 0,1 0,05 ”tiny” cache size 64M 32M 8M 16M 4M 2M 1M 512 256 64 128 32 8 4 16 2 1 0 Cache size áctual 3 One Instance Four Instances 1 0 Original AVDARK 2010 Orig Dept of Information Technology| www.it.uu.se 23 Lim=1.7MB Hint: lim= actual/4 AVDARK 2010 © Erik Hagersten| user.it.uu.se/~eh In mix In mix, patched 1,2 Performance Throughput Individually 40% faster 2 1 25% 0,8 0,6 0,4 0,2 0 bzip2 AMD Opteron Dept of Information Technology| www.it.uu.se Libquantum 24 LBM Geom mean © Erik Hagersten| user.it.uu.se/~eh Looks and Smells Like an SMP (aka UMA)? SMP system Multicore system Memory Wrapping up about multicores Interconnect Erik Hagersten Uppsala Universitet L2 L2 L1 L1 CPU AVDARK 2010 CPU Trends (my guess!) t t now Bandwidth/Thread Thr. Comm. Cost (temporal) t now Dept of Information Technology| www.it.uu.se t 27 1 1 CPU TT TT TT TT …8… 26 © Erik Hagersten| user.it.uu.se/~eh Are we buying… CPU Cache/Thread now AVDARK 2010 t now Memory/Chip L1 What matters for multicore performance? Transistors/Thread t L2 L2 Cost of parallelism? Cache capacity per thread? Memory bandwidth per thread? Cost of thread communication? … Dept of Information Technology| www.it.uu.se now … 32 … Well, how about: Threads/Chip Memory AVDARK 2010 frequency? Number of cores? MIPS and FLOPS? Memory bandwidth? Cache capacity? Memory capacity? Performance/Watt? now © Erik Hagersten| user.it.uu.se/~eh Dept of Information Technology| www.it.uu.se 28 © Erik Hagersten| user.it.uu.se/~eh MC Questions for the Future How to get parallelism? How to get performance/data locality? How to debug? A case for new funky languages? A case for automatic parallelization? Are we buying: AVDARK 2010 X86 Architecture Erik Hagersten Uppsala University Sweden compute power, memory capacity, or memory bandwidth? Will 128 cores be mainstream in 5 years? Will the CPU market diverge into desktop/capacity/capability/special-purpose CPUs again? A non-question: will it happen? Dept of Information Technology| www.it.uu.se 29 © Erik Hagersten| user.it.uu.se/~eh Intel Archeology 8086 registers (8080: 1974, 6.0 kTransistors, 2MHz, 8bit) 8086: 1978, 29 kT, 5-10MHz, 16bit (PC!) (80186:1982 ? kT, 4-40MHz, integration! ) 80286: 1982, 0.1MT, 6-25MHz, chipset (PC-AT) 80386: 1985, 0.3MT, 16-33MHz, 32 bits 80486: 1989, 1.2MT, 25-50MHz, I&D$, FPU Pentium: 1993, 3.1 MT, 66 MHz, superscalar Pentium Pro: 1997, 5.5 MT, 200 MHz, O-O-O, 3-way superscalar Intel Pentium4:2001, 42 MT, 1.5 GHz, Super-pipe, L2$ on-chip … AVDARK 2010 AVDARK 2010 Dept of Information Technology| www.it.uu.se 31 © Erik Hagersten| user.it.uu.se/~eh AX (Accumulator) BX (Base) CX (Count) DX (Data) SP (Stack ptr) BP (Base ptr) SI (Source index) DI (Destination index) CS (Code segment) DS (Data segment) SS (Stack segment) Dept of Information Technology| www.it.uu.se ”General purpose” registers ”Addressing registers” Segmented addressing (extending the address range) 32 © Erik Hagersten| user.it.uu.se/~eh Complex instructions of x86 Micro-ops RISC (Reduced Instruction Set Computer) LD/ST with a limited set of address modes ALU instructions (a minimum) Many general purpose registers Simplifications (e.g., read R0 returns the value 0) Simpler ISA Î more efficient implementations x86 CISC (Complex Instruction Set Computer) ALU/Memory in the same instruction Complicated instructions Few specialized registers (actually accumulator architecture) Variable instruction length x86 was lagging in performance to RISC in the 90s AVDARK 2010 AVDARK 2010 Dept of Information Technology| www.it.uu.se 33 © Erik Hagersten| user.it.uu.se/~eh Dept of Information Technology| www.it.uu.se x86-64 SSEn: 16 128-bit SSE ”vector” registers Backwards compatible with x86 Intel adoptions: IA-32e, EM64T, Intel64 NOTE: IA-64 is Itanium Dept of Information Technology| www.it.uu.se 35 34 © Erik Hagersten| user.it.uu.se/~eh x86 Vector instructions ISA extension to x86 (by AMD 2001) 64-bit virtual address space 64-bit GP registers x16 x86’s regs extended: rax, rbx, rcx, rdx, rbp, rsp, rsi, rdi x86-64 also has: r8, r9, ... r15 (i.e., a total of 16 regs) NOTE: dynamic register renaming makes the effective number of regs higher AVDARK 2010 Newer pipelines implements RISC-ish μ-ops. Some complex x86 instructions expanded to several micro-ops at runtime. The translated μ-ops may be cached in a trace-cache [in their predicted order] (first: Pentium4) Expanded to “loop cache” in Core-2 MMX: 64 bit vectors (e.g., two 32bit ops) SSEn: 128 bit vectors(e.g., four 32 bit ops) AVX: 256 bit vectors(e.g., eight 32 bit ops) (in Sandy Bridge, ~Q1 2011) MIC: ”16-way vectors”. Is this 16 x 32 bits?? AVDARK 2010 © Erik Hagersten| user.it.uu.se/~eh Dept of Information Technology| www.it.uu.se 36 © Erik Hagersten| user.it.uu.se/~eh How to explore SIMD: nVIDIA Examples of vector instructions Vector Regs A: SSE_MUL D, B, A B: L2 C: (less than the sum of L1:s) L1 x x x x 512 ”processors” (P) 16 P/StreamProcessor (SP) SP is SIMD-ish (sort off) Full DP-FP IEEE support 64kB L1 cache /SP 768kB global shared cache D: Atomic instructions ECC correction Debugging support Giant chip/high power ... E: AVDARK 2010 AVDARK 2010 ... Dept of Information Technology| www.it.uu.se 37 © Erik Hagersten| user.it.uu.se/~eh Dept of Information Technology| www.it.uu.se 38 © Erik Hagersten| user.it.uu.se/~eh How to explore SIMD: Intel MIC ”more than 50 cores” Exponential Growth Erik Hagersten Uppsala University Sweden AVDARK 2010 SIMD instructions: 16-way, how wide? 39 Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/ AVDARK 2010 Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/ AVDARK 2010 Dept of Information Technology| www.it.uu.se 41 © Erik Hagersten| user.it.uu.se/~eh Dept of Information Technology| www.it.uu.se Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/ AVDARK 2010 42 © Erik Hagersten| user.it.uu.se/~eh Ray Kurzweil pictures www.KurzweilAI.net/pps/WorldHealthCongress/ AVDARK 2010 Dept of Information Technology| www.it.uu.se 43 © Erik Hagersten| user.it.uu.se/~eh Dept of Information Technology| www.it.uu.se 44 © Erik Hagersten| user.it.uu.se/~eh Doubling (or Halving) times [Kurzweil 2006] Dynamic RAM Memory (bits per dollar) 1.5 years Average Transistor Price 1.6 years Microprocessor Cost per Transistor Cycle 1.1 years Total Bits Shipped 1.1 years Processor Performance in MIPS 1.8 years Transistors in Intel Microprocessors 2.0 years Microprocessor Clock Speed 2.7 years AVDARK 2010 Dept of Information Technology| www.it.uu.se 45 © Erik Hagersten| user.it.uu.se/~eh

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Multicores Cnt (EH) - Department of Information Technology