Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK Communication-Centric Architectures • Future performance gains will primarily come from increasing the number of IP cores in a system not their complexity or operating frequency • Many reasons: – – – – – Diminishing returns from simply scaling what we have Energy efficiency Complexity Fault tolerance Economics 2/19 On-Chip Networks • An efficient general purpose chip-wide communication infrastructure is becoming essential • One flexible networking option is to use packetswitched networks with support for virtualchannels 3/19 The Lochside Router • Router Architecture – Highly parameterised implementation – Packet-switched network with virtual-channel flowcontrol – Best case latency is one cycle per network hop. • Results presented here are from post P&R simulations targeting a 90nm technology TILE Traffic Generator, Debug & Test R Lochside Chip (2004/05) 180nm Technology 4/19 Exploiting Speculation to Reduce Communication Latency Peh/Dally (2001) 5/19 Exploiting Speculation to Reduce Communication Latency 6/19 Aims of this work • Apply existing power saving techniques to an on-chip network design – e.g. clock and signal gating, gate-level optimisations etc. – Importance of applying such techniques before making comparisons • Measure power consumption and provide an accurate breakdown of where the remaining power is dissipated • Where is best place to look for future power savings? 7/19 Measuring and Optimizing Dynamic Power • Our Test Case – 8mm x 8mm die – 4x4 mesh network – Low-latency routers, best case latency is one cycle per hop (incl. interconnect) – 1.2V, 90nm technology – 4 input-buffers/ VC – 4 VC/ input port – 48 x 80-bit network links – 800MHz @ WC PVT • ~32 FO4 clock period – Results reported at 250MHz 8/19 Interconnect Delay/Energy Trade-offs • Power dissipated in network links depends on how links are spaced and buffered • At least a factor of 3 difference in energy consumption over range of potential interconnect options • Could move to low-swing differential schemes for even greater energy savings For results we assume min. spaced wires, opt. energy x delay product 9/19 Clock Gating • Clock gating optimisations applied at two levels: – Local Clock Gating • Automated clock gating within router • Some tuning of RTL involved to maximise opportunities for synthesis tool – Router Level Clock Gating • Exploit opportunities to gate clock as it enters the router • Isolates router’s clock completely, only static power consumption remains 10/19 Router-Level Clock Gating • Clock gating exposes clock tree insertion delay • Need to know early if router will be required • Generate ‘early valid’ signals in neighbouring routers – Early-valid signals are slightly pessimistic – Based on what is requested not granted 11/19 Gate-Level Optimizations and Signal Gating • Automated signal gating and gate-level power optimisations had minimal impact • Inserting signal gating logic manually did reduce input FIFO power requirements significantly • The reported results could be further improved (by 12%) by enabling logic optimisation across module boundaries – This was restricted to accurately determine where power is dissipated 12/19 Analysis of Power Consumption Power consumption of a single router and its links • Simple power optimisations can quarter power requirements + many more opportunities to save power • Network is ~5% of core area • Perhaps 10% of system power at present • Don’t make comparisons without optimizing power! 13/19 Analysis of Power Consumption • 22% Static power, 11% Inter-Router Links • ~1% Global Clock tree • 65% Dynamic Power – Power Breakdown • ~50% of dynamic power is consumed in local clock tree and input FIFOs • ~30% on router datapath • ~20% on scheduling and arbitration – Scheduling is probably more complex than typical implementations due to speculation 14/19 Low-Power On-Chip Networks • Interconnect and static power set to increase – Many low-power link technologies • Low-swing differential techniques – Power gating and other leakage reduction techniques • Potential power savings begin to require lots of different techniques – no one silver bullet? 15/19 Low-Power On-Chip Networks • Topology – Don’t want to sacrifice general or at least multipurpose nature of our networked SoC – Results suggest higher radix routers and longer interconnects could reduce power • Probably not a long term solution • Reduces path diversity, bad for fault-tolerance • Architecture – Scope for minimising memory required to store precomputed router schedule (particular to our router) – Simpler routers – Single cycle routers reduce power? Speculation for low-power? 16/19 Supporting Best-Effort (BE) and Guaranteed Services (GS) Efficiently • Current timing of the datapath and link suggests additional GS data could be routed in the same clock cycle – Allocate datapath/link to GS traffic for first ½ of clock cycle • Double capacity of network – Exploit simpler GS circuit-switched routing when possible – Reduce power • Very little additional overhead 17/19 Clocking On-Chip Networks • Network system timing issues are interesting – naturally event-driven not synchronous • Work is investigating placing local data-driven clock generators in each network router – – – – Clock is stretched when no data to be routed Clock matches rate of incoming data streams Robust synchronisation solution (true GALS) Also investigating incorporating power gating support • See also Distributed Clock Generator – DCG (Fairbanks/Moore) 18/19 Challenges and Future Work • These are early results in a much more rigorous study on the power requirements of networked on-chip comummunication – Much more soon! • Exploiting a general-purpose on-chip network – – – – Exploiting execution diversity to improve energy-efficiency Multi-use platforms and Virtual-IP Fault tolerance Networks of processing elements or networks that process? • Scope for removing unnecessary interfaces and boundaries • Impact of networking on IP and processor core design 19/19 Thank You