Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ELEC 516 VLSI System Design and Design Automation Spring 2010 Lecture 6: Timing and Clocking Issues Reading Assignment: Rabaey: Chapter 10 Note: some of the figures in this slide set are adapted from the slide set of “ Digital Integrated Circuits” by Rabaey et. al., Copyright 2002 1 ELEC516/10 Lecture 6 System Timing • Clocking is very important to ensure that improper values are never stored. • Flip-flop-based pipeline system: clock Reg. Tq Combinational Ts A Logic (Td) Reg. B Tc Tq Td Ts inputs change after clock () edge. Primary inputs must stabilize before next clock edge. Rules allow changes to propagate through combinational logic for next cycle. Flip-flop outputs hold current-state values for next-state computation Primary 2 ELEC516/10 Lecture 6 Timing Definition-Latch Parameters D Q Clk T Clk PWm D Q tsu thold tc-q td-q Delays can be different for rising and falling data transitions 3 ELEC516/10 Lecture 6 Register Parameters D Q Clk T Clk thold D tsu Q tc-q Delays can be different for rising and falling data transitions 4 ELEC516/10 Lecture 6 Clock period • For each clock cycle, cycle period must be longer than sum of: – combinational delay; – Memory element propagation delay. • period depends on longest path. • Unbalanced delays – Logic with unbalanced delays leads to inefficient use of logic: short clock period 5 long clock period ELEC516/10 Lecture 6 Retiming Retiming moves memory elements through combinational logic: • Retiming properties: Retiming changes encoding of values in registers, but proper values can be reconstructed with combinational logic. Retiming may increase number of registers required. Retiming must preserve number of latches around a cycle—may not be possible with reconvergentELEC516/10 fanout. Lecture 6 6 Latch-based design Latch Combinational T Latch Combinational Tq s clock A B Logic A (Tda) Logic B (Tdb) Latch C • Latch-based machines must use multiple ranks of latches. • Multiple ranks require multiple phases of clock. 7 ELEC516/10 Lecture 6 Clock Race • In a synchronous system, if the data input to a register does not obey the setup and hold-time constraints, then potential clock race problems may occur. • Clock race results in erroneous data being stored in registers. • Assuming a perfectly synchronous system with perfect clocks, zero hold-time registers, and clockto-Q time greater than the setup time, no clock race problem should occur. • However, at the chip level this might be hard to ensure. 8 ELEC516/10 Lecture 6 Hold time violation clk delay Reg d q M1 Td2 Logic Tc1 clk delay Tc2 Reg d q M2 Hold time Violation Tc1 Td2 Old data New data Tc2 Tc2 is sampling the new data while it’s supposed to sample the old. This happens when Tc2 lags behind the data Td2 and which is more likely to happen for extended delay on clk and shorter delay on Registers and Logic. Worst case will corresponds to the min delay of Logic. 9 ELEC516/10 Lecture 6 Hold time condition • Need to make sure that data are properly held and avoid race between data and clock. Hold time constraint: tc-q + tlogic,min> thold Also called contamination delay tc_q + tlogic,min must be higher than a certain threshold defined by the hold time of the FF. 10 ELEC516/10 Lecture 6 How fast can we run clk Reg d q M1 delay T c1 Reg d q M2 Logic delay Tc2 clk clk Tq1 There is still a margin Setup time requirement: Minimum cycle time: T = tc-q + tsu + tlogic Tq1 + Tlmax Tsetup2 11 Problem Setup time Violation ELEC516/10 Lecture 6 • The earliest that data appears at the input of register M2 is at time Tc1+Tq1, assuming zero delay in the logic block. • The clock appears at the register M2 at time Tc2. • Assume zero setup and hold times, if Tc2 lags the data change (Tc2 > (Tc1+ Tq1)), the module M2 will store the data from the current cycle rather than the previous cycle. This is a hold-time violation and may be caused in practice by Tc1 and Tq1 being close to zero while a delay is introduced into the Tc2 clock line. • If the delay (Tc1+ Tq1) - Tc2 is larger than the cycle time Tc, then the data will arrive late at M2. This will cause a setuptime violation. This occurs when the circuit is too slow for the clock cycle used. While Tc2 may be artificially increased to allow more time for the data to set up, the constraints Tc2 < (Tc1+ Tq1), becomes harder to meet and data delays may have to be artificially added to meet the constraints. 12 ELEC516/10 Lecture 6 Combating racing for latch-based design • Strict two-phase clocking discipline – Strict two-phase discipline is conservative but works. – Strict two-phase machine makes latch-based machine behave more like flip-flop design, but requires multiple phases – Phases must not overlap: non-overlap region 13 ELEC516/10 Lecture 6 Two phase clocking • Each phase has a one-sided constraint: phase must be long enough for all combinational delays. • If there are no combinational loops, phases can always be stretched to make that section of the machine work. • Total clock period depends on sum of phase periods. 14 ELEC516/10 Lecture 6 Clock Uncertainties 4 Power Supply 3 Interconnect Devices 2 5 Temperature 6 Capacitive Load 7 Coupling to Adjacent Lines 1 Clock Generation Sources of clock uncertainty 15 ELEC516/10 Lecture 6 Clock Nonidealities • Clock skew – Spatial variation in temporally equivalent clock edges; deterministic + random, tSK • Clock jitter – Temporal variations in consecutive edges of the clock signal; modulation + random noise – Cycle-to-cycle (short-term) tJS – Long term tJL • Variation of the pulse width – Important for level sensitive clocking 16 ELEC516/10 Lecture 6 Clock Skew and Jitter Clk tSK Clk tJS • Both skew and jitter affect the effective cycle time • Only skew affects the race margin 17 ELEC516/10 Lecture 6 Clock Skew # of registers Earliest occurrence of Clk edge Nominal – /2 Latest occurrence of Clk edge Nominal + /2 Bad design Insertion delay Max Clk skew Clk delay Absolute delay through a clock distribution path is not important. What matters is the relative arrival time at registers points at the end of each path. We can have positive and negative skew SKEW: No Clock period variation but only phase shift 18 ELEC516/10 Lecture 6 Sources of skew and Jitter • Systematic errors are nominally identical from chip to chip and are predictable while random errors are due to manufacturing variations that are difficult to model. • Clock-signal generation: achieved by generating a high frequency signal from a low frequency one (VCO): sensitive to device noise, power supply variations, substrate coupling. • Manufacturing Device variations: matching of devices in the buffers along multiple clock paths is critical. • Interconnect variations: Vertical and lateral dimension variations cause the interconnect cap and resistance to vary. Source of problem: Inter layer Diele (ILD) thickness variations. • Environmental variations: temperature and power supply. Temperature gradients across the chip are large as a consequence of clock gating. Device parameters (Vth and m) depend on temperature and the clock delay can vary from path to path. Does temperature contributes to skew or jitter? • Capacitive coupling: Any coupling between clock wire and adjacent signal results in timing uncertainties. 19 ELEC516/10 Lecture 6 The Clock Skew Problem Clock Rates as High as 2 Ghz in CMOS! (T=0.5ns) t l,min t r,min t l,max t r,max In t ' CL1 R1 t " CL2 ti R2 t ''' CL3 R3 Out Clock Edge Timing Depends upon Position Positive skew: data and clock routed in the same direction clk1 clk2 20 ELEC516/10 Lecture 6 Delay of Clock Wire RS r c r = 0.07 CL W /q , c = 0.04 fF/ mm 2 (Tungsten wire) 21 ELEC516/10 Lecture 6 Positive Skew TCLK + d CLK1 CLK2 TCLK 1 3 d 2 4 d + th In R1 D R2 Q Combinational Logic tCLK1 CLK tc - q tc - q, cd tsu, thold D Q tCLK2 tlogic tlogic, cd Launching edge arrives before the receiving edge 22 ELEC516/10 Lecture 6 Positive Skew TCLK + d CLK1 CLK2 TCLK 1 3 d 2 4 d + th • The output of the combinational circuit must be valid one setup time before the rising edge of CLK2 (point 4). T + >= tc-q + tsu + tlogic)max or T >= tc-q + tsu + tlogic)max - • This equation suggests that clock skew actually has the potential to improve the performance of the circuit. This is indeed true but increasing skew makes the circuit susceptible to race conditions. • The problem may arise if the new value at the output of R1 propagates through the logic is valid at the input of R2 before 2. • To avoid this we have to ensure that: + thold < tc-q + tlogic)min or < tc-q + tlogic)min - thold 23 ELEC516/10 Lecture 6 Negative Skew TCLK + d 1 CLK1 CLK2 2 TCLK 3 4 d In R1 D R2 Q Combinational Logic D tCLK1 tc - q tc - q, cd tsu, thold Q tCLK2 clk tlogic tlogic, cd Receiving edge arrives before the launching edge 24 ELEC516/10 Lecture 6 Negative Skew TCLK + d 1 CLK1 CLK2 2 d TCLK 3 4 • Negative slow impacts the performance as the effective period (from position 1 to position 4) is made shorter by : T - >= tc-q + tsu + tlogic)max or T >= tc-q + tsu + tlogic)max + • However, a negative skew implies that the system never fails since edge 2 happens before edge 1. There is no race issue. 25 ELEC516/10 Lecture 6 Positive and Negative Skew Data CL R CL R CL (a) Positive skew(clock is routed in the same direction of the data flow. R •Skew has to be strictly controlled and satisfy the maximum value of skew. Otherwise the circuit will be mal-function. Reducing the clock frequency does not help. Data 26 CL R CL R CL R (b) Negative skew(clock is routed in the opposite direction of the data •When the skew is -ve, the race condition will never happen. The circuit operates correctly independent of skew. •However, -ve skew impact the throughput in a negative way. The skew reduces the time available for the actual computation so that the clock period has to increased by ||. ELEC516/10 Lecture 6 How to counter Clock Skew? • Routing the clock is opposition direction can relieve the race problem of clock skew. But it will hamper performance. Also sometimes the data-flow of circuit is not uni-directional. . REG REG In REG REG Negative Skew log Out Positive Skew Clock Distribution • The best solution is to ensure the clock skew between communicating registers is bound 27 ELEC516/10 Lecture 6 REG MUX REG Example of Clock skew tg = gate delay, tm= mux delay, ts = setup time tq = reg, clock-to-q delay, T = clock period Assume input signals arrive early enough, max bound on the skew is tl t g t m t s The equilibrium requirement at the time of latching imposes another constraints on the skew tl 5t g t m t s T Combining these constraints we have tl t g t m t s tl 5t g t m t s - T 28 ELEC516/10 Lecture 6 Example –Propagation and contamination delay evaluation • Propagation and contamination delay are not always easy to evaluate due to false paths. OR1 PATH2 A In1 Out B PATH1 OR2 C AND1 D AND3 AND2 REG • The contamination is defined a 2tgates (through OR1,OR2) • It would appear that the worst case is path 1, 5tgates, but this is a false path (output does not even depend on C &D): – If A=1 the critical path (CP) is through OR1 and OR2. – If A=0, B=0, CP through I1, OR1 OR2 – If A=0, B=1, CP through I1, OR1, AND3, OR2 which is 4tgates • Computation of worst case delay cannot be obtained just by ELEC516/10 Lecture 6 29 adding propagation delay due to false path. Static Timing Analysis • 0->1 and 1->0 delays are generally different. • The simplest delay problem to analyze is to change the value at only one input and determine how long it takes for the effect to be propagated to a single output (provided there must be a path from the selected input to the output). • Can use a logic simulator, however have to simulate all possible transition values • Static Timing analysis - value-independent. It builds a graph which models delays through the network and identifies the longest(shortest) delay path. 30 ELEC516/10 Lecture 6 Critical Path • The longest delay path is known as critical path since that path limits the system performance. • The critical path not only tells us the system cycle time, it points out what part of the combinational logic must be changed to improved system performance. • Speed up gates on the critical path by increasing transistor sizes, or reducing wiring capacitance, or redesign logic along the critical path to use a faster gate configuration. • Speeding up the system may require modifying several sections of logic since the critical path can have multiple branches. Identify the critical path and identify the cutset of the graph represents the critical path. Then determine the edge (gate) to speed up. 31 ELEC516/10 Lecture 6 False Path • False path - critical paths that can never be exercised during normal circuit operation. In this case the actual critical path is thus shorter than what would be predicted from the first-order analysis. • Detecting false path is not easy since it requires an understanding of the logic functionality of the network. • Also it is a N-P complete problem to determine whether a path is false or not, however new CAD tools/algorithm are available now to find false paths in practical networks. 32 ELEC516/10 Lecture 6 Example of False Path a c y d z b e V a-> V c-> V d-> V e-> V z is a false path 33 ELEC516/10 Lecture 6 Impact of Jitter CLK TC LK t j itter -tji tte r In Combinational Logic REGS CLK tc-q , tc-q, ts u, thold tjitter cd t log ic t log ic, cd Temporal variation in the clock edge. 34 ELEC516/10 Lecture 6 Longest Logic Path in Edge-Triggered Systems TSU Clk TClk-Q Latest point of launching TLM T Setup time Condition TJI + Earliest arrival of next cycle If launching edge is late and receiving edge is early, the data will not be too late if: Tc-q + TLM + TSU < T – TJI,1 – TJI,2 - Minimum cycle time is determined by the maximum delays through the logic Tc-q + TLM + TSU + + 2 TJI < T Skew can be either positive or negative 35 ELEC516/10 Lecture 6 Clock Constraints in Edge-Triggered Systems –Shortest path Earliest point of launching Clk Clk Nominal clock edge Hold time Condition TClk-Q TLm TH Data must not arrive before this time If launching edge is early and receiving edge is late: Tc-q + TLM – TJI,1 < TH + TJI,2 + Minimum logic delay Tc-q + TLM < TH + 2TJI+ 36 ELEC516/10 Lecture 6 Latch-Based Design L1 latch is transparent when = 0 L2 latch is transparent when = 1 L1 Latch Logic L2 Latch Logic 37 ELEC516/10 Lecture 6 Slack-borrowing In L1 D Q CLB_A t p d,A a b CLK1 L2 D Q CLB_B t p d,B c L1 d D CLK2 Q e CLK1 TC LK CLK1 CLK2 slack passed to next stage t pd,A a valid 38 tD Q tpd,B b valid c valid t DQ e valid d valid ELEC516/10 Lecture 6 Clock-distribution network design parameters •Interconnect material used for the clock network •Shape of the clock-distribution network •Clock driver and the buffer scheme used •Load on the clock lines (I.e. the clock fan-out) •Rise and fall time of the clock 39 ELEC516/10 Lecture 6 Clock Distribution to bound skew Very attractive for regular structure CLOCK H-Tree Network Observe: Only Relative Skew is Important 40 ELEC516/10 Lecture 6 Clock Network with Distributed Buffering Local Area Module Module secondary clock drivers Module Module Module Module main clock driver Equalizing the local clock delay through a careful routing of the clock signals combining with a hierarchical clockbuffering scheme CLOCK Reduces absolute delay, and makes Power-Down easier Sensitive to variations in Buffer Delay 41 ELEC516/10 Lecture 6 More realistic H-tree [Restle98] 42 ELEC516/10 Lecture 6 The Grid System GCL K Driver GCLK Driver Driver GCLK •No rc-matching •Large power Driver GCL K 43 ELEC516/10 Lecture 6 Example: DEC Alpha 21164 Use Clock grid instead of clock tree Clock Frequency: 300 MHz - 9.3 Million Transistors Total Clock Load: 3.75 nF Power in Clock Distribution network : 20 W (out of 50) Uses Two Level Clock Distribution: • Single 6-stage driver at center of chip • Secondary buffers drive left and right side clock grid in Metal3 and Metal4 Total driver size: 58 cm! 44 ELEC516/10 Lecture 6 Clock Drivers 45 ELEC516/10 Lecture 6 Clock Skew in Alpha Processor 46 ELEC516/10 Lecture 6 EV6 (Alpha 21264) Clocking 600 MHz – 0.35 micron CMOS tcycle= 1.67ns trise = 0.35ns Global clock waveform • tskew = 50ps 2 Phase, with multiple conditional buffered clocks – – • • • • 2.8 nF clock load 40 cm final driver width Local clocks can be gated “off” to save power Reduced load/skew Reduced thermal issues Multiple clocks complicate race checking PLL 47 ELEC516/10 Lecture 6 Hybrid Grid • DEC Alpha 21264, Bailey JSSC 11/98 48 ELEC516/10 Lecture 6 DEC Alpha 21264 global clock distribution network 49 ELEC516/10 Lecture 6 Global Clock Grid 50 ELEC516/10 Lecture 6 EV7 Clock Hierarchy Active Skew Management and Multiple Clock Domains + widely dispersed drivers DLL DLL DLL NCLK (Mem Ctrl) + DLLs compensate static and lowfrequency variation GCLK (CPU Core) SYSCLK 51 L2R_CLK (L2 Cache) PLL L2L_CLK (L2 Cache) + divides design and verification effort - DLL design and verification is added work + tailored clocks ELEC516/10 Lecture 6 Example 2: Intel IA-64 Itanium • Use of Deskew buffers • 3-level Hierarchy – Global distribution • On-die Phase-lock loop • Deskew buffer (DSK) – Regional distribution • From deskew buffer to 30 clock regions (region clock grid, RCD) – Local distribution • Lock clock buffer (LCB) • Opportunity-time-borrowing (OTB) delay clocks generation 52 ELEC516/10 Lecture 6 Intel IA-64 Itanium clock distribution topology 53 ELEC516/10 Lecture 6 Global Clock Distribution • Distribute two clocks – Core clock and reference clock – Using two identical and balanced H-tree on the top two metal layers • To reduce cap. noise coupling and to ensure good inductive return path, the H-tree is fully shield laterally with Vcc/Vss. 54 ELEC516/10 Lecture 6 Regional clock distribution • Distributed array of deskew buffer (DSK) to reduce within-die process variations • Regional clock grid driven by modular Regional Clock Drivers – 30 clock regions – M4 for x-direction, M5 for y-direction – Full support for scan and clock gating 55 ELEC516/10 Lecture 6 Local Clock distribution • Local clock buffer • Delay clocks that are needed for the opportunity-timeborrowing (OTB) delay clock generation, I.e. intentional skew buffer 56 ELEC516/10 Lecture 6