Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multiprocessor on a Chip & Simultaneous Multi-threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005] .1 Review: Multiprocessor Basics • Q1 – How do they share data? • Q2 – How do they coordinate? • Q3 – How scalable is the architecture? How many processors? # of Proc Communication Message passing 8 to 2048 model Shared NUMA 8 to 256 address UMA 2 to 64 Physical connection Network 8 to 256 Bus 2 to 36 .2 Multithreading: Interleave instructions from separate threads on the same hardware. Seen by OS as several CPUs. Multi-core: Integrating several processors that (partially) share a memory system on the same chip .3 Recall: Bypass network prevents stalls Instead of bypass: Interleave threads on the pipeline to prevent stalls ... .4 CMP: Multiprocessors On One Chip • By placing multiple processors, their memories and the Interface all on one chip, the latencies of chip-to-chip communication are drastically reduced – ARM multi-chip core Configurable # of hardware intr Private IRQ Interrupt Distributor Per-CPU aliased peripherals Configurable between 1 & 4 symmetric CPUs Private peripheral bus CPU CPU CPU CPU Interface Interface Interface Interface CPU L1$s CPU L1$s CPU L1$s CPU L1$s Snoop Control Unit Primary AXI R/W 64-b bus Optional AXI R/W 64-b bus .5 Multithreading on A Chip • Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions • Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor – Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread – The caches, TLBs, BHT, BTB can be shared (although the miss rates may increase if they are not sized accordingly) – The memory can be shared through virtual memory mechanisms – Hardware must support efficient thread context switching .6 Types of Multithreading • Fine-grain – switch threads on every instruction issue – Round-robin thread interleaving (skipping stalled threads) – Processor must be able to switch threads on every clock cycle – Advantage – can hide throughput losses that come from both short and long stalls – Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads • Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses) – Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread – Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss • Pipeline must be flushed and refilled on thread switches .7 1.0 GHz Cache (I/D/L2) 32K/64K/ (8M external) 16K/8K/3M Issue rate 4 issue 1 issue Pipe stages 14 stages 6 stages BHT entries 16K x 2-b None TLB entries 128I/512D 64I/64D Memory BW 2.4 GB/s ~20GB/s Transistors 29 million 200 million Power (max) 53 W <60 W 4-way MT SPARC pipe 1.2 GHz 4-way MT SPARC pipe Clock rate 4-way MT SPARC pipe 64-b 4-way MT SPARC pipe 64-b 4-way MT SPARC pipe Data width 4-way MT SPARC pipe Niagara 4-way MT SPARC pipe Ultra III 4-way MT SPARC pipe Multithreaded Example: Sun’s Niagara (UltraSparc T1) • Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction) Crossbar I/O shared funct’s 4-way banked L2$ Memory controllers .8 Niagara Integer Pipeline • Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient Fetch Thrd Sel Decode RegFile x4 I$ Inst bufx4 ITLB Thrd Sel Mux Thrd Sel Mux Decode Thread Select Logic Execute ALU Mul Shft Div Memory D$ DTLB Stbufx4 WB Crossbar Interface Instr type Cache misses Traps & interrupts Resource conflicts PC logicx4 From MPR, Vol. 18, #9, Sept. 2004 .9 Simultaneous Multithreading (SMT) • A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (Super Scalar) to exploit both program ILP and threadlevel parallelism (TLP) – Most Super Scalar processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP) – With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them • Need separate rename tables (ROBs) for each thread • Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle • Intel’s Pentium 4 SMT called hyperthreading – Supports just two threads (doubles the architecture state) .10 Threading on a 4-way SS Processor Example Coarse MT Fine MT SMT Issue slots → Thread A Thread B Time → Thread C Thread D .11 Multicore Xbox360 – “Xenon” processor • To provide game developers with a balanced and powerful platform – – – – – – – Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache 165M transistors total 3.2 Ghz Near-POWER ISA 2-issue, 21 stage pipeline, with 128 128-bit registers Weak branch prediction – supported by software hinting In order instructions Narrow cores – 2 INT units, 2 128-bit VMX (colloquial term for SIMD) units, 1 of anything else • A 500MZ GPU w/ 512MB of DDR3DRAM – 337M transistors, 10MB framebuffer – 48 pixel shader cores, each with 4 ALUs .12 Xenon Diagram Core 1 Core 2 L1D L1I L1D L1I L1D L1I XMA Dec Core 0 1MB UL2 MC1 MC0 512MB DRAM BIU/IO Intf SMC GPU DVD HDD Port Front USBs (2) Wireless MU ports (2 USBs) Rear USB (1) Ethernet IR Audio Out Flash Systems Control 3D Core 10MB EDRAM Video Out Analog Chip Video Out .13