Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration Andrew B. Kahng¶ Bin Li‡ Li-Shiuan Peh‡ Kambiz Samadi¶ ¶ University of California, San Diego ‡ Princeton University April 21, 2009 1 Outline Motivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions 2 Motivation Many-core chip NoCs needed to interconnect many-core chips Power-efficiency of NoCs is important Performance was the primary concern Now power efficiency is critical 28% of total power in Intel 80-core Teraflops chip is due to interconnection networks (routers + links); Need rapid power estimation to trade off alternative architectures Rapid power-area tradeoffs at the architectural level Our Goal: Develop accurate models that are easily usable by system-level designer early in the design cycle 3 Related Work Real-chip power measurements (Isci et al. 03) RTL-level NoC power estimations (A. Banerjee et al. 07, and N. Banerjee et al. 04) Simulation time is slow Requires detailed RTL modeling not suitable for early-stage NoC design space exploration Architectural-level power estimation Interconnection network (Patel et al. 97); model is not instantiated with architectural parameters not suitable to explore tradeoffs in router microarchitecture Uniprocessor power modeling (Wattch: Brooks et al. 00 and SimplePower: Ye et al. 00) NoC power modeling (ORION 1.0: Wang et al. 02) ORION 1.0 has been widely used early-stage design space exploration for NoC power-performance tradeoff analysis 4 ORION 1.0 Modeling Methodology Power models derived for major building blocks (FIFO, Crossbar, and arbiter) For each component, a canonical structure is described in terms of architectural and technological parameters Detailed analysis is performed to determine parameterized capacitance equations Capacitance equations and switch activity estimation are combined to determine power consumption Power models are based on detailed estimates of gate and wire capacitance and switching activity 5 Limitations of ORION 1.0 Parameters Parameters ORION ORION 1.0 2.0 16 B B 39 F F P 5 P V 2 V X 5 X 65nm tech tech 5.1 fclk GHz fclk clk 1.2V Vdd Vdd dd Npipeline pipeline App D Description Description Up to 8.1X diff. Component #buffers #buffers flit-width flit-width #ports #ports #virtual #virtual channels channels #crossbar #crossbar ports ports technology technology node node clock clock frequency frequency supply supply voltage voltage #pipeline #pipeline stages stages application application domain domain chip chip dimension dimension Power (mW) V1 Buffer Crossbar Arbiter Link Clock Total 25.2 53.2 11.1 89.5 Intel 80-core 203.3 138.6 64.7 212.5 304.9 924 10.3X diff. 6 Outline Motivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions 7 ORION 2.0: Accurate NoC Router Models circuit implementation & buffering scheme architectural parameters • # of ports; # of buffers • # of xbar ports; # of VC • SRAM and register FIFO • MUX-tree and Matrix crossbar • voltage, frequency • different arbitration scheme • hybrid buffering scheme ORION 2.0 • interconnect parameters • device parameters • scaling factors for future technologies • … grantI reqI reqE reqW Request reqN ORION Signals 1.0 reqS technology parameters Arbiter grantE grantW grantN grantS Built on top of Control Uses our automatic/semi-automatic flows to obtain technology Write inputs Source Source Buf I inI outI Provides Link significant Buf accuracy improvement compared Link with E inE outE ORION 1.0 Link Link inW Crossbar outW Buf W Link Link Buf N Buf S inN inS outN outS Link Link 8 ORION 2.0 Improvements Power Subcomponents Buffer (SRAM-based) Arbiter (dynamic power) Buffer • SRAM-based • Flip-flop-based Arbiter • VC allocator model • Leakage power Crossbar Links (dynamic power) Area (router) ORION 1.0 Model Infrastructure • Application-specific technology-level adjustment • Updated capacitance and transistor sizes Crossbar Links • Hybrid buffering • Leakage power Clock Area • More accurate router area model • Link area model ORION 2.0 9 Model Technology Inputs Inputs for power calculation Leakage current values (obtained from Liberty (.lib) / SPICE) Input capacitance for different repeater size (Liberty, Predictive Technology Models (PTM)) Inputs for area calculation Wire dimensions (Interconnect Technology Format (ITF) / LEF / ITRS) Cell area is available from Liberty and for future technologies, ITRS Afactors or proposed area models can be used We also provide data for (1) high-performance (HP), and (2) low-power (LOP) device types for 90nm and 65nm Scaling factors for 45nm and 32nm technologies were obtained from ITRS 2007 / MASTAR5.0 10 Outline Motivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions 11 Dynamic Power Modeling Dynamic Power: Switching Capacitance Clock power: Pclk = × Cclk × Vdd2 × f Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cwiring Physical Links: due to charging and discharging of capacitive load Pd = × Cload × Vdd2 × f; Cload = Cground + Ccoupling + Cinput Register-based FIFO: implemented as shift registers Virtual channel allocator: added two models Other components: we use ORION 1.0 models with updated transistor and technology parameters 12 Clock Power (1) Clock power heavily depends on its distribution topology we assume an H-tree topology Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cclock-wiring Memory structures: precharge circuitry capacitive load on clock network: due to precharge transistor Tc Cchg = Cg(Tc) + Cd(Tc) Csram-fifo = (Pr + Pw) × F × B × Cchg where Pr, Pw, F, B are #read ports, #write ports, #buffers, and flit-width, respectively Pipeline registers: due to different stages in a router assume D-flip-flop (DFF) as the building block for pipeline registers Cpipeline-register = Npipeline × F × Cff, where Cff is DFF capacitance Register-based FIFO: due to DFF capacitance used in registers Cregister-fifo = F × B × Cff 13 Clock Power (2) Wiring load: due to (1) wiring and (2) clock tree buffers Example: 5-level H-tree clock distribution: 16 1 8 2 4 42 8 1 D D D D D ) Cw 2 2 2 2 2 where, D, Cw are chip dimension and per-unit-length wire capacitance, respectively capacitive contribution due to clock buffers requires estimation of number of buffer stages, k: Cwire ( 0.4 × Rint × Cint k= 0.7 × Rd × Cgate where Rint, Cint, Rd, and Cgate are clock tree network wire resistance, wire capacitance, drive resistance, and input gate capacitance of a minimum size inverter, respectively D Rint 24 w Cint 24 D w Carea 2 24 D Cfringe where ρ, Carea, and Cfringe are resistivity, unit area, and unit fringe capacitances respectively Cclock-wiring = kCgate + Cwire Clock leakage power is due to clock buffers 14 Repeater and Wire Power Models Repeaters (buffers) are used in links and clock tree network Leakage power has two main components: (1) sub-threshold leakage, and (2) gate-tunneling current Depending on design conditions we will compute the leakage power at different temperature conditions:(1) 25◦C, (2) 80◦C, and (3) 110◦C Both components depend linearly on device size ps= (psn + psp) / 2 psn = k0n + k1n × wn psp = k0p + k1p × wp Dynamic power can be calculated as: pd = a × cl × vdd2 × f cl = ci + cg + cc pd, a, cl, vdd and f are dynamic power, activity factor, load capacitance, supply voltage and frequency, respectively Load capacitance is composed of the input capacitance of the next repeater (ci), ground (cg) and coupling (cc) capacitances of the wire driven 15 Interconnect Optimization: Buffering Conventional delay-optimal buffering unrealistic buffer sizes high dynamic / leakage power suboptimal Pareto-optimal frontier of the power-delay tradeoff of a 5mm interconnect in 90nm / 65nm Our approach: iterative optimization of hybrid objective (power + delay) Search for optimal number and size of repeaters Can be extended for other interconnect optimizations (e.g., wire sizing and driver sizing) 16 Virtual Channel Allocator Model Provides three virtual channel (VC) allocation models Traditional two-stage VC allocator model Most widely used Power consumption increases rapidly as number VCs increases 1 . 2:1 arbiter 1 . : . 2:1 arbiter 4 . 2:1 arbiter 1 . 2:1 arbiter 4 . . 1 . : . . : : 10 8:1 arbiter 1 . 8:1 arbiter 10 4:1 arbiter 1 . . 4:1 arbiter 4 . . 4:1 arbiter 1 . 4:1 arbiter 4 . . : . . Stage 1 (totally 40 arbiters) Stage 2 (totally 10 arbiters) 5 ports, 2 VCs per port : . . : : 20 16:1 arbiter 1 . 16:1 arbiter 20 . Stage 1 (totally 80 arbiters) Stage 2 (totally 20 arbiters) 5 ports, 4 VCs per port Add One-stage VC allocator model Lower power consumption Lower matching probability Add VC selection model Proposed by Kumar et al. "A 4.6Tbits/s 3.6GHz Single-cycle NoC Router with a Novel Switch Allocator in 65nm CMOS”, ICCD07 Low power and high performance 17 Outline Motivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions 18 Leakage Power Modeling Leakage Power: Subthreshold and Gate From 65nm and beyond gate leakage becomes significant I’sub(i,s) and I’gate(i,s) are subthreshold and gate leakage currents per unit transistor width for a specific technology Wsub(i,s) and Wgate(i,s) are the effective widths of component i at input state s for subthreshold and gate leakage, respectively Key circuit components INVx1, NAND2x1, NOR2x1, and DFF Leakage currents are computed at different transistor junction temperatures: (1) 110◦C, (2) 80◦C, and (3) 25◦C ' ' Ileak ( Block ) = ∑∑ Prob( i ,s ) × ( Wsub ( i ,s ) × Isub ( i ,s ) + W ( i ,s )gate × Igate ( i ,s )) i s Same methodology as in ORION 1.0 Leakage current values are all obtained through SPICE simulation using foundry SPICE models 19 Arbiter Leakage Power Model Three arbitration schemes: (1) matrix, (2) round-robin (RR), and (3) queuing Example: matrix arbiter with R requesters one R×R matrix to keep the priorities gnt n = reqn ×∏( reqi + min ) ×∏( reqi + mni ) i <n i >n grant logic can be implemented as a tree of NOR and INV gates and the RxR matrix can be constructed using DFF Ileak ( Arbitermatrix ) Ileak (NOR 2) (( 2R - 1)R ) R(R - 1) Ileak (INV ) R Ileak (DFF ) 2 Pleak ( Arbitermatrix ) Ileak ( Arbitermatrix ) Vdd NOR2, INV, and DFF represent 2-input NOR gate, inverter gate, and DFF, respectively Further details on modeling methodology in Chen et al. 2003 20 Outline Motivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions 21 Router Area Model As number of cores increases, the area occupied by communication components becomes significant (19% of total tile area in the Intel 80-core Teraflops Chip) Gate area model by Yoshida et al. (DAC’04) Link area model by Carloni et al. (ASPDAC’08) Areaarbiter = (AreaNOR2x12(R-1)R) + (AreaDFF(R(R-1)/2)) + (AreaINVx1R) Matrix Arbiter 22 Repeater and Wire Area Models For existing technologies, the area of a repeater can be calculated as: ar = τ0 + τ1 × (wn + wp) ar denotes repeater area, τ0 and τ1 are coefficients using linear regression; wn, wp are widths of NMOS, and PMOS respectively For future technologies, feature size (F), contacted pitch (CP), row height (RH), and cell width (CW) can be used to estimate the area: NF = (wp + wn + 2 × F) / RH CW = NF × (F + CP) + CP ar = RH × CW Wiring area can be calculated as: aw = (n × (ww + sw) + sw) × L aw denotes wire area, n is the bit width of the bus, and ww, sw, L are wire width, spacing and wire length 23 Outline Motivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions 24 ORION2.0: Validations and Results Validation: Two Intel NoC Chips (1) Intel 80-core Teraflops: high-performance many-core design (2) Intel SCC: ultra low-power communication core ORION2.0 offers significant accuracy improvement Intel 80-core v1.0 v2.0 %diff (total power) -85.3 %diff (total area) -80.9 Link 21% FIFO 21% Link 18% Component Buffer Crossbar 21% Crossbar Clock 30% Arbiter Arbiter 7% Clock ORION 2.0 Link -6.5 -23.6 Intel SCC v1.0 v2.0 +202.4 +11.0 +31.9 +25.3 Clock 0% FIFO 23% Arbiter 12% Link 0% %diff (ORION 2.0 vs. Intel 80-core) FIFO 28% Crossbar 16% Clock 36% Arbiter 7% Intel 80-core -14.8 16.9 -9.0 -20.9 8.8 Crossbar 60% ORION 1.0 25 Impact on System-Level Design Testcases VPROC: video processor with 42 cores and 128-bit datawidth dVOPD: dual video object plane decoder with 26 cores and 128-bit datawidth SoC 2 P (mW) v1.0 v2.0 VPROC 0.875 dVOPD 0.412 A (mm ) v1.0 v2.0 0.924 2.043 2.329 0.486 1.217 1.343 R1 R1 … R1 R1 R1 … R1 … R1 33 18 25 16 R2 … R1 max. # router ports v1.0 v2.0 8 6 12 6 …….. max. # hops v1.0 v2.0 6 11 5 10 R2 R2 … … R1 # routers v1.0 v2.0 R2 …….. R2 System-level Impact: Communication-Driven Synthesis in COSI-OCC Accurate ORION 2.0 models lead to better-performing NoC Relative power due to additional port not as high in ORION 2.0 vs. 1.0 26 Conclusions Accurate models can drive effective NoC design space exploration ORION 1.0 is inaccurate for current and future technology nodes Proposed accurate power and area models for network routers (ORION 2.0) Presented a reproducible methodology for extracting inputs to our models Maintained ORION 1.0 interface, while significantly improved the accuracy of models switching to ORION 2.0 is easy! 27 ORION 2.0 Release ORION 2.0 Website: http://www.princeton.edu/~peh/orion.html 28 System-Level NoC Power Modeling Example Polaris Toolchain Step 1 Trident Synthetic traffic generation Design-space exploration tool Step 2 LUNA Microarchitecture High-level on-chip network parameters analysis Step 3 ORION power and area models power consumption CMOS area Performance (latency) NoC designs projections V. Soteriou, N. Eisley, H. Wang, B. Li, L.S. Peh, TVLSI’07