Download Pipeline Optimization for Asynchronous Circuit - FORTH-ICS

A Channel-Based Asynchronous LowPower High-Performance Standard Cell-Based Sequential Decoder Implemented with QDI Templates Recep Ö. Özdağ & Peter A. Beerel University of Southern California Motivation and Approach Background  Fine-grain asynchronous pipelines have demonstrated high-performance in largely fullcustom back-end flows • Caltech’s MIPS R3000 Microprocessor [Martin97] • Fulcrum’s PivotPoint High Performance Switch [HotChips03] Problem  However full-custom flows are tedius, error-prone, and time-consuming and often require significant in-house tool automation Our Solution  Create asynchronous cell library  Integrate cell library into commercial P&R flow using Verilog modelling  Evaluate on a real design • Target a digital communication chip implementing the Fano algorithm Our Goal: Close to Full-Custom Performance with ASIC Design Times USC Asynchronous Group 2 Channel Based Asynchronous Design Dual-Rail Channel Sender Receiver Ack clock Asynchronous channel Data • Two wires per data bit • One acknowledgment wire • Generalizes to 1-of-N coding • Advantage: • Delay insensitive System communication Synchronous Asynchronous System Synchronization and communication between blocks implemented with handshaking using asynchronous channels by sending/receiving “data tokens” USC Asynchronous Group 3 Channel-Based Design Characteristics  Architecture is typically a multi-level hierarchy of communicating blocks Reg A Main FSM Reg B Memory Adder ASIC Register Bank Multiplier BN-1 BN-2 BN-3 leaf cells Subtract/ Divider channels Adder/ Mult. Reg C FAN-1 FAN-2 FAN-3 FA0 Netlist consists of leaf cells communicating along channels USC Asynchronous Group 4 Asynchronous Leaf Cells Definition  Smallest block that communicates via asynchornous channels Input Channels L Output Channels Functionality  Reads a subset of input channels  Computes F and writes to a subset of output channels L Linear Pipeline Linear Pipelines  Only one input and one output channel Non Linear Pipelines L  Joins and Forks  Conditional Joins: Read only some of the input channels  Conditional Splits: Write only to some of the output channels USC Asynchronous Group Conditional Join L Conditional Split 5 Template-Based Leaf-Cell Design • Each pipeline style (QDI, timed…) has a different blueprint • Create a library using a blueprint to implement the lowest level communicating blocks C L LCD LCD RCD RCD LCD F C 2-input 1-output pipeline stage LCD LCD L RCD RCD F C LCD L F Blueprint for a QDI N-input M-output pipeline stage RCD RCD RCD RCD 1-input 2-output pipeline stage Generation of instances from templates is straightforward USC Asynchronous Group 6 Background: Caltech’s QDI Templates Precharged Half Buffer (PCHB) [Lines96]  1-of-N Rail Channels • Delay-insensitive communication  Quasi-delay-insensitive design bit0 OR bit1 OR bitn OR C Done Completion Detector • Negligible timing assumptions  Dynamic Logic Function Block  Left and right completion detection R L precharge control nmos network Function Block evaluation control USC Asynchronous Group 7 PCHB Performance Analysis C C C LCD RCD LCD F1 RCD LCD F2 RCD F3 3 t+ 2 tCD tc+ t tprech CycleCycle timetime = 3= tEval ++ 2 2tc+ Eval2+tCD prech 2-D Pipelining: The key to high-throughput [MiniMips97]  Small forward latency per stage (as little as 2 gate delays)  Smaller completion detection units, reduces control overhead  Only local communication between blocks L11 L21 L31 L12 L22 L32 USC Asynchronous Group 8 Outline • Background  Illustration of the Fano Algorithm  The base-line synchronous Fano design • The Asynchronous Fano Design • The Back-End Asynchronous Design Flow • Summary of Contributions USC Asynchronous Group 9 Background on Fano Algorithm • Fano algorithm is a depth first tree-search algorithm [Fano64] • Achieves good performance with a low average complexity -5 +3 Total Metric: +1 Total Path Metric: -2 TotalPath Paththat Metric: 0 Estimate transmitted a1 1 error 01 (+3) (-5) 10 (-5) 10 10 (-10) 0 errors 11 (-5) 11 (+3) 11 (+3) 00 (-5) Estimate that transmitted a 0 Received Branch Bits Decoded Bit Index 00 (+3) 00 01 01 10 10 root root Decoded bit 11 (-10) 11 10 (-5) 01 (-5) 10 (-5) 10 01 (-5) 1 0 X X 0 1 0 X 11 X 01 X X 00 1 2 USC Asynchronous Group … 3 … 10 The Synchronous Architecture [Asilomar99] Critical path consists of a 2 ALU’s and 2 MUX’s USC Asynchronous Group 11 Outline • Introduction and Background • The Asynchronous Fano Design • The Back-End Asynchronous Design Flow • Summary of Contributions USC Asynchronous Group 12 The Asynchronous Fano At typical SNR most of the branches will be error free  Key idea: optimize architecture for forward moves Circuit can be partitioned into two units  Skip Ahead Unit: operates at high speed for error free sequences  Error Logic: operates when errors are encountered Circuit Operation Switches Back and Forth  Between Skip Ahead and Error Logic until it reaches end of tree Asynchronous Design Advantage  Allows seamless switching between blocks USC Asynchronous Group 13 The Asynchronous Architecture To BMU To BMU From BMU noError XOR_SPLIT Comparison Result ERROR-DETECT Decision_bit FILTER SkipAhead Decision Received Data compared MERGE with estimated branch bits FAST SHIFT REGISTER XOR XOR BMU Decision FAST DECISION REGISTER XOR The Skip-Ahead Unit The critical path of the Skip Ahead Unit runs at 450MHz (post layout) USC Asynchronous Group 14 The Memory Design Supports a packet length of only 128 words. Each word is a pair of branch bits. Used standard place and route tools for the physical design of the memories  Faster design time at the expense of more area and power consumption Unacknowledged tri-state buffers on the data bus Efficiently allows multiple drivers of the bus. Introduceds minor timing assumptions This is typical in synchronous design, but not typical of PCHB-based designs. 8 sets of branch bits USC Asynchronous Group 15 Fano: Error-Free Operation 17971ns 18449ns Total of 8x16 = 128 bits decoded USC Asynchronous Group 16 Fano: Error Operation 17537ns Error Encountered Move back USC Asynchronous Group 25361ns 17 The Layout Asynchronous Fano Properties  TSMC 0.25  Skip Ahead Unit runs at 450MHz  2600m x 2600m = Received Memory Decision Memory 6.76mm2 Fano     2.15 x speed 1/3 the power 10 man months to design 5x the area Threshold Adjust Unit USC Asynchronous Group Branch Metric Calculator Skip Ahead Unit Counter Compared to the Synchronous Lookup Table  Power dissipation: 32mW (@450MHz,2.5V)  360,000 transistors  10 man months to design + 6 man months library and flow development 18 Outline • Introduction and Background • The Asynchronous Fano Design • The Back-End Asynchronous Design Flow • Summary of Contributions USC Asynchronous Group 19 Physical Design Flow Specification Simulation and Analysis Schematic Symbol Schematic Functional (Virtuoso, Synopsys) (Hspice/Nanosim/Verilog) Netlist (.v) Asynchronous Leaf Cell/Gate Library Cell views: •Symbol •Schematic •Functional •Layout •Abstract Abstract (.lpe) Netlist (.sp) Place & Route (Silicon Ensemble) Layout Layout (.gds) Netlist (.cir) Chip Assembly (Virtuoso) LVS & DRC (Virtuoso, Dracula) Layout (.gds) Chip Fabrication Standard Flow Works USC Asynchronous Group 20 Cell Library Flow: Alternatives • Used for the Fano Algorithm • More suitable for designs with relaxed timing assumptions at the leaf cell level Leaf Level Design Leaf Cell Library Technology Layout Mapping Leaf Cell Design Physical P&R Gate Level Netlist Technology Mapping • Used for the STFB based adder • More suitable for designs with strict timing assumptions at the leaf cell level Template Gate Library Physical P&R Leaf cell level or gate level place and route USC Asynchronous Group 21 Cell Library Flow Cell Design (Virtuoso) Layout (.gds) Cell Abstract (Abstract generator) Symbol Schematic Functional Layout Simulation and Analysis (Hspice/Nanosim/Verilog) Netlist (.sp) DRC & LVS (Virtuoso, Dracula) Abstract (.lpe) Asynchronous Gate Library Developed asynchronous gate library USC Asynchronous Group 22 Initial cell sizes Transistor Sizing  2X for pull down network  8X for inverter drivers  Staticizer inverter is ~10x weaker than pull down network Additional sub-types added as necessary Create a number of subtypes for different strengths USC Asynchronous Group 23 Charge-Sharing Considerations • Output inverters and staticizers are internal to all dynamic cells and form part of known minimum load on dynamic node (allowing 10% dip in voltage) • On each dynamic gate minimum load is guaranteed to be sufficient to ensure no charge sharing problems exist via extensive simulation Output inverters and staticizers are encapsulated with the dynamic logic into a single gate USC Asynchronous Group 24 Netlist extraction Verilog netlist (.v) for placement and routing // LAST TIME SAVED: Jun 4 17:49:17 2003 // NETLIST TIME: Jun 4 17:51:34 2003 `timescale 1ns / 1ns module Counter2 ( Backward_e, BmuErr_e[5], Forward_e, From_FSM_T, Go_Fast, Go_Slow_FSM, Go_Start_Pointer_F, Go_Start_Pointer_T, Go_e, LB, LFB, LFBTE, LFB_LFBTE, LFNB, NewStat_e0, NewStat_e1, Slow_ShiftB_e, Start, ZeroCheck, Zero_e, infi_e1, infi_e2, nReset); output Backward_e, Forward_e, From_FSM_T, Go_Fast, Go_Slow_FSM, Go_Start_Pointer_F, Go_Start_Pointer_T, Go_e, LB, LFB, LFBTE, Send_T_Re, ShiftB_e, ZeroCheck_e, Zero_False, Zero_True, infi_e; input BmuErr_e5a, BmuErr_e5b, BmuErr_e5c, BmuErr_e5d, ConnectGnd, Dec, Go, Go_Fast_Re, Go_Slow_FSM_e, Go_Start_Pointer_e, Inc, LFB_e1, LFNB_e1, NoZeroCheck, Re_LB, Re_LFB, Re_LFBTE, Re_LFNB, Re_S19, Zero_e, infi_e1, infi_e2, nReset; output [5:5] BmuErr_e; // Buses in the design wire [0:7] Forw_e; PCHB_SingleRail_SlowDataPath I54 ( .Ae(net01493), .A1(net0507), .BUFe(Send_Delta_to_Encode_e), .BUF1(Send_Delta_to_Encode), .nReset(nReset)); PCHB_BUFFER1_for_Counter_1 I204 ( .Ae(net0489), .A1(net0486), .A0(ConnectGnd), .BUFe(LFB_e1), .BUF1(LFB_LFBTE), .BUF0(nc[30]), .Start(Start), .nReset(nReset)) … Verilog netlist of library gates is auto-generated USC Asynchronous Group 25 Placement, Routing and Extraction * * CADENCE/LPE SPICE FILE : SPICE * DATE : 5-JUN-2003 * ****** ****** MOS XTOR PARAMETERS FROM : 7MOSXREF ... ****** * MM1-XI59-3 NET72 XI59-NET35 VDD! VDD! PCH L=0.24U W=2.50U * + PD=3.24U AS=1.65P PS=6.32U NRS=0.088 NRD=0.088 *.GLOBAL VDD! GND! * * *----- TOTAL # OF MOS TRANSISTORS FOUND : 2018 * *----COMMENTED : 0 .SUBCKT INC2 DATA REQ ACK NRST4 L0 L1 * * ****** * ****** RESISTORS PARAMETERS FROM : 7RESXREF ****** CORNER ADJUSTMENT FACTOR = 0.0000000 ****** ****** ****** MM2-XI60-XI36 XI36-A NET0432 VDD! VDD! PCH L=0.24U W=2.80U AD=1.04P ****** DIODE PARAMETERS FROM : 7DIOXREF + PD=3.54U AS=1.88P PS=6.94U NRS=0.079 NRD=0.079 ****** MM3-XI60-XI36 XI36-A NR<6> VDD! VDD! PCH L=0.24U W=2.80U AD=1.04P ****** + PD=3.54U AS=1.88P PS=6.94U NRS=0.079 NRD=0.079 ****** CAPACITORS PARAMETERS FROM : 7CAPXREF MM7-XI60-XI36 XI36-XI60-NET029 NET0432 XI36-A GND! NCH L=0.24U W=1.20U ****** + AD=0.24P PD=1.60U AS=0.44P PS=1.94U NRS=0.183 NRD=0.167 ****** MM7-XI60-XI36-1 685 NET0432 GND! GND! NCH L=0.24U W=1.20U AD=0.24P ****** CAPACITORS PARAMETERS FROM : 7CAPXMER + PD=1.60U AS=0.80P PS=3.74U NRS=0.183 NRD=0.167 ****** ... * * C1 NET77 GND! 8.00421E-15 C2 NET209 GND! 1.06917E-14 C3 NET188 GND! 1.16892E-14 C4 NET121 GND! 1.34065E-14 C5 NET215 GND! 1.02445E-14 ... USC Asynchronous Group AD=0.93P 26 Chip Assembly • Stream-in blocks layout (from SE to Virtuoso) • Block placement and routing • DRC, LVS and netlist extraction (.sp) • Post-layout simulation Future Work: • Static timing • Automatic block placement and routing • Synthesis USC Asynchronous Group 27 Summary Design Flow: Standard ASIC flow for channel based asynchronous circuits  Async high performance designs with ASIC design time is possible  Verilog modelling and structural simulation is feasible  Commercial P&R tool (Silicon Ensemble) works quite well  Design flow is applicable to many templates (QDI or STFB) Architectural: Design and implementation of the Fano Algorithm  A complex design implemented both in synchronous and asynchronous  Over 2x performance with 1/3 the power at the expense of 3-5x area First freely available asynchronous library  Working on characterization and Lib file generation USC Asynchronous Group 28 Thank You USC Asynchronous Group 29 Skip-Ahead Unit with RSPCHB A 14% throughtput improvement in the Skip-Ahead Unit using RSPCHB instead of PCHB To BMU To BMU From BMU noError XOR_SPLIT Comparison Result ERROR-DETECT Decision_bit FILTER SkipAhead Decision Received Data compared MERGE with estimated branch bits FAST SHIFT REGISTER XOR XOR XOR FAST DECISION REGISTER The Skip-Ahead Unit USC Asynchronous Group 30 BMU Decision Overview of New Pipeline Templates 2-D Timing Style Assumptions PCHB Throughput DI/QDI 772 MHz RSPCHB QDI 920 MHz LP2/2+ Moderate 1.0 GHz Aggressive 1.2 GHz HC Foundation of design space exploration trading robustness for performance USC Asynchronous Group 31

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Pipeline Optimization for Asynchronous Circuit - FORTH-ICS