Download 8 Cores

Networks for Multi-core Chip —A Controversial View Shekhar Borkar Intel Corp. 1 Outline Multi-core system outlook On die network challenges A simpler but controversial proposal Benefits Summary 2 A Sample Multi-core System 10mm Core 50% Cache 50% 45nm 10mm 65nm, 4 Cores 1V, 3GHz 10mm die, 5mm each core Core Logic: 6MT, Cache: 44MT Total transistors: 200M 32nm 10mm 22nm 10mm 16nm 10mm 8 Cores, 1V, 3GHz 3.5mm each core 16 Cores, 1V, 3GHz 2.5mm each core 32 Cores, 1V, 3GHz 1.8mm each core 64 Cores, 1V, 3GHz 1.3mm each core Total: 400MT Total: 800MT Total: 1.6BT Total: 3.2BT 3 A Sample MC Network 5mm 0.4mm Packet Switched Mesh 16B=128 bit each direction 0.4mm @ 1.5u pitch 192GB/s Bisection BW Tech Core Port size Bisection BW (mm) (mm) GB/sec@3GHz 65nm 5 0.4 192 45nm 3.5 0.4 272 32nm 2.5 0.4 384 22nm 1.8 0.4 543 16nm 1.3 0.4 768 4 100 1.E+00 1.E-01 Asci Red 1.E-02 TFLOP 1.E-03 1.E-04 Network Power (W) Active Cap (nf/bdir bit) Mesh Power @ 3GHz, 1V 80 60 1X 1.4X 40 20 0 0.5u 0.18u 65nm 65nm 45nm 32nm 22nm 16nm 1. Power too high 2. Worse if link width scales up each generation 3. Most of the power dissipation is in router logic (not in the metal busses) 4. Cache coherency mechanism is complex 5 Why Mesh (or any other complex Network)? Bus: Good at board level, does not extend well • Transmission line issues: loss and signal integrity, limited frequency • Width is limited by pins and board area • Broadcast, simple to implement Point to point busses: fast signaling over longer distance • Board level, between boards, and racks • High frequency, narrow links • 1D Ring, 2D Mesh and Torus to reduce latency • Higher complexity and latency in each node Do you need point to point busses on a chip? 6 Bus for Multi-Core Chip? Issues: Slow, < 300MHz Shared, limited scalability? Solutions: Repeaters to increase freq Wide busses for bandwidth Multiple busses for scalability Benefits: Power? Simpler cache coherency Move away from frequency, embrace parallelism 7 Repeated Bus O R R R R R R R R Assume: 10mm die, 1.5u bus pitch 50ps repeater delay Arbitration: Each cycle for the next cycle Decision visible to all nodes Repeaters: Align repeater direction No driving contention Core (mm) Bus Seg Max Bus Delay (ps) Freq (GHz) 65nm 5 195 2.2 45nm 3.5 99 2 32nm 2.5 51 1.8 22nm 1.8 26 1.5 16nm 1.3 13 1.2 8 Example of a Bus Repeater 9 Other Bus Enhancements Differential, low voltage swing Twisted to reduce cross-talk Optimal repeater placement • Not necessarily at the core • Higher bus frequency Wide bus, 1024 bit or more, transfer lots of data in one cycle Multiple busses for concurrency Employ interconnect engineering techniques 10 Bus Power and Bandwidth 100 80 1X Full Swing 1.4X 60 40 20 0 Bisection BW (GB/s) 65nm 45nm 32nm 22nm 16nm 4000 1X 3000 Mesh 1.4X 2000 1000 0 100mV Swing Bus Power (W) 120 7 1X 6 5 1.4X 4 0.1V Differential 3 2 1 0 65nm 45nm 32nm 22nm 16nm 600 Bus BW (GB/s) Full Swing Bus Power (W) Includes bus and repeater power 500 1X 400 1.4X Bus 300 200 100 0 65nm 45nm 32nm 22nm 16nm 65nm 45nm 32nm 22nm 16nm 11 Factors Affecting Latency Mesh Bus Arbitration in each node, multiple arbitration cycles Single arbitration for entire bus transaction Multiple hops from source to destination One cycle operation 3-5 Clock latency in each node None Fast clock (3 GHz) Slow clock (1 GHz) One source and destination Broadcast 12 Summary Point to point busses are not necessary for multi-core chip Rings and meshes were devised for point to point busses over long distances—overkill for on chip network? Router power could be prohibitive Wide bus or busses, may be adequate • Simple to implement • Simpler coherency • Lower power • Maybe lower latency Go slower, wider, and simpler 13

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 8 Cores