Download ASIC 2001: FORMATTING AND SUBMITTING YOUR PAPER

Survey of Techniques for Post-Silicon Tuning for High Performance Circuits Xiong Liu (602995937) 210C Final Project OUTLINE This survey studies five papers on the techniques for post-silicon tuning for high speed circuits. The focus of the survey is on the first paper titled “Post-Fabrication ClockTiming Adjustment Using Genetic Algorithms” [1] from Journal of Solid State Circuit 2004. This paper demonstrates the power of postsilicon tuning by improving a real medium scale LSI chip’s yield from 42% to 93% with manageable die cost for the additional tuning circuits and small ATE tester overhead. As a proven large volume product, INTEL used similar approach in design of the 3–GHz Pentium 4 microprocessor titled “Designing a 3GHz, 130nm, Pentium® 4 Processor” [2] from VLSI symposium 2002. This confirms that the cost overhead is manageable for large volume consumer products. The same Generic Algorithm was also applied to an analog IF filter and similar yield and performance improvement were achieved as shown in the third paper titled “An AI-Calibrated IF Filter: A Yield Enhancement Method With Area and Power Dissipation Reductions” [3] from JSSC 2003. Other ways of post-silicon tuning such as adaptive body biasing (ABB) and associated clustering algorithms are outlined in the fourth paper titled “Design-Time Optimization of Post-Silicon Tuned Circuits Using Adaptive Body Bias” from Transaction on ComputerAided Design of Integrated Circuits and Systems 2008. The fifth paper titled “TuneLogic: Post-Silicon Tuning of Dual-Vdd Designs,” focuses on post-silicon tuning for dual-Vdd designs and assignment algorithms from ISQED 2009. A table compares these five paper’s approaches and highlights their key differences in the summary section. Further improvement directions and key challenges faced by these approaches are also discussed in that section. I. PAPER1--GENERIC ALGORITHM FOR POST-SILICON TUNING A GA-based clock adjustment architecture has been proposed and tested. The GA-based clock adjustment method adjust clock delays based on certain control register bits and those register bits will be adjusted post-silicon using a generic algorithm. Two test chips were fabricated to test such implementation with one small LSI chip and one medium scale LSI chip featuring multipliers and test pattern generators. The variable delay the author achieved is through a delay cell with selectable control voltage from a simple DAC. The total transistor count for the variable delay cell and DAC is 30 transistors. Hence the area and complexity overhead is small. It was reported that small delay adjustment such as 30ps can be achieved with such implementation. 40 chips were fabricated and a total of 80 multipliers and 80 memory-test-pattern generators. were tested to see the effectiveness of the GA algorithm. The author tested. The experiment with the memory-test-pattern generators shows an 11% average enhancement and a maximum enhancement of 25% (from 1.0 to 1.25 GHz). Although the operational ratio was only 15% for the memory-test-pattern generators at a frequency of 1.4 GHz before adjustment rose to 90% after adjustment. Moreover, while no chips could operate at a frequency of 1.58 GHz before adjustment, after adjustment a few could. In these experiments, adjustment for each chip took 9 min, but a careful analysis of the timing shows that almost all of the total time was required for communication between the PC and the interface board, which is not necessary with LSI testers. It was estimated that adjustment on LSI testers can be completed in less than 1 s The behavior of the GA search for a circuit shows it converged, i.e., the delay settings for correct operation were stabilized, after 14 iterations. Hence the additional tester overhead was proven to be small as well. although the chips with a clock frequency of 1.25 GHz could all operate at 1.2 V, the operational yield fell to 0% when the supply as reduced to 0.8 V. However, after timing adjustment, the operational yield for the chips was again at 100%. This represents a significant reduction in Power dissipation results, Total design flow analyzed into the four main design steps indicates that the GAbased approach could reduce the overall time by 21% (23.0 day person). Looking in more detail at the separate design stages, the GAbased approach is able to achieve design time reductions in the following areas. At the function design stage, the modeling of timing constraints is simplified. At the floor-planning stage, the time to assign pins is reduced. At the layout design stage and verification stages, the iterations required to satisfy timing constraints are reduced. In the case of more complex large-scale integrations (LSIs), the improvements obtainable with the GA-based approach will be even more apparent, because timing-design aspects become more complicated and time consuming. For the practical aspects of this approach, especially concerns about the size of the programmable delay circuits and the adjustment times. However, this study has clearly demonstrated that the developed circuits are sufficiently small and that adjustment times are sufficiently short. Accordingly, these concerns do not constitute real obstacles to the application of this approach for mass production. Although the current programmable delay circuits cannot compensate for temperature and Power-supply voltage fluctuations, Future tasks remaining are: 1) to incorporate the GA-based clock adjustment technique into EDA tools, such as clock tree synthesis (CTS), and 2) to build IPs for the programmable delay circuits for specialized EDA tools, which are both already under development. By using EDA tools capable of handling our GA- based technique together with programmable delay circuits, users will be able to benefit from the three advantages described in this paper. Another important issue is how to deal with temperature variation after the calibration or tuning is done, since temperature variation plays an important role in circuit speed and power. 2. PAPER2--DESIGNING A 3GHZ, 130NM, PENTIUM® 4 PROCESSOR This IA32 processor is fabricated on a 130nm CMOS process with six layers of dualDamascene copper interconnect. The processor contains 55M transistors. The integer execution core runs at twice the nominal frequency or six GHz range, one of the main objectives must be to keep the clock distribution from wasting Power. Many microprocessor designs waste device and wire trying to reduce all static inaccuracies in the design (load mismatches, wire mismatches, process variation, VCC variation) which ultimately wastes Power. This IA32 processor incorporates a circuit technique to detect and correct static mismatches in the global clock network postsilicon. The CPU was divided into 52 clock domains. By using the detection circuitry and a tester interface the static clock skew between each global domain could be programmed to zero using a predetermined testing sequence. This enables the clock distribution to be designed with a minimal amount of nets. Another key recognition is the fact that no matter how good the timing in pre-silicon simulation is, the true worst-case paths on silicon will actually run faster if clock skew is not zero or the clock duty cycle is not 50%. The circuitry used to correct static mismatches of the global clock network is used in combination with a genetic algorithm on the tester that looks at the maximum frequency of thousands of patterns to search out the best combination of built-in clock skew between local clocks, and best duty cycle for the global and certain local clocks to maximize frequency. As a usual industry paper, not too much detail is disclosed such as the detail of the algorithm, the tester overhead, etc. However, we can infer that the overhead is small since this processor is a large-volume product. Otherwise INTEL would not use such approach for cost reasons. This is a further confirmation that post-silicon tuning had actually been widely adopted in large-volume consumer products. 3. PAPER3--AN AI-CALIBRATED IF FILTER: A YIELD ENHANCEMENT METHOD WITH AREA AND POWER DISSIPATION REDUCTIONS An inherent problem in implementing analog large-scale integration (LSI) or integrated circuits is that the values of manufactured analog circuit components often differ from the precise design specifications. Such discrepancies cause poor yield rates, especially for high-end analog circuits. For example, in intermediate frequency (IF) filters, which are widely used in cellular phones, even a 1% discrepancy from the center frequency is unacceptable. It is, therefore, necessary to carefully examine the analog LSIs and to discard any which do not meet the specifications. In this paper, the author propose an artificial intelligence (AI) calibration method which can correct these variations in the analog circuits values by genetic algorithms (GAs). Using this method provides us with three advantages: 1) enhanced yield rates: If an analog LSI is found not to satisfy specifications, then the GA is executed at the wafer-sort testing to alter the defective analog circuit components in line with specifications. Once calibrated, the circuit values are fixed by laser trimming. 2) smaller circuits and less power consumption: The conventional approach to overcome discrepancies of the component values in analog LSIs has been to use large components. However, this requires larger spaces, which means higher manufacturing costs and greater Power consumption. In contrast, with our approach, the area size of the analog circuits can be made smaller, because chip performance can be calibrated after production with the GA software. 3) integration of peripheral circuits to LSIs. As process technology allows for smaller and smaller implementations. However, because it is extremely difficult to integrate analog circuits with high-end specifications due to process variations, these analog circuits are usually made as separate components such as ceramic filters. In contrast, with the proposed method, it is possible to integrate these analog circuits on a single LSI, resulting in cost reduction and circuit area reduction. Although many filter structures could provide the desired 455-kHz center frequency and 21kHz bandwidth frequency response, the authors have adopted an 18th-order linear filter organization consisting of three cascaded structures. Each of the three filter blocks has an all-pole sixth-order leapfrog structure. The GA will adjust the biasing current of each filtering stage and the GA chromosome for the whole IF filter is 39x6 bits, with each 6-bit string determining the transconductance value of a amplifier. This fitness function consists of a weighted sum of gain deviations and group delay differences measured around the center frequency. As a result of this new design method employing GA calibration, filter area was reduced by 63% compared with existing commercial products (from 3.36 to 1.26 mm2), resulting in a 26% reduction in power dissipation (from 3.4 to 2.5 mA, 38% reduction for the Gm amplifiers). As a further improvement, the authors have subsequently achieved a 100% yield rate with GA calibration experiments involving only 18 parameters (three bits each). As a result, a commercial chip has been designed with adjustment circuitry for 18x3 bit registers (instead of 39x3 in the prototype chip) and 18 DACs within an area of 0.41 mm2. Thus, even including the area required for calibration, the total chip area is reduced 49%. On average, the GA was able to find an optimal architecture bits and terminate after 100 measurement iterations within a few seconds on an LSI tester. Clearly, GA calibration does not represent an obstacle to mass production of the 30 test chips could be calibrated to satisfy specifications, although none of these chips conformed prior to calibration (i.e. 97% yield rate), and the remaining chip was successfully calibrated with additional iterations. This result is con- sistent with statistical simulations for GA calibration, in which 952 out of 1000 virtual chips the authors successfully calibrated. For comparison, the author conducted calibration experiments with a conventional hill-climbing method instead of the GA. A run terminated after fitness was evaluated 100 times. The result of this was that only 17 chips (57%) could meet the specifications. These results suggest the effectiveness of the GA in avoiding the local minimums in the evaluation (fitness) function. Compared with the first paper, the different nature of analog circuits renders the focus of the generic algorithm on yield and area/diecost. For comparison, the first paper focuses on the combination of yield, speed and power. As is evident from those comparisons, a good fitness evolution function is the key to the successful deployment of the GA algorithm. 4. PAPER4 DESIGN-TIME OPTIMIZATION OF POST-SILICON TUNED CIRCUITS USING ADAPTIVE BODY BIAS The major trade-off in digital circuits is the delay vs. power. In addition, both performance metrics are related to device threshold voltages. For a high threshold device, the speed is lower but the sub-threshold leakage power is lower since the leakage is inverseexponentially related with the threshold voltage. To summaries, for low delay, we prefer low threshold voltage and for low leakage power, we prefer high threshold voltage device. Threshold voltage of the device can be adjusted via process engineering or by tuning the body bias. Positive body bias will increase threshold voltage. Adaptive body biasing is a powerful technique that allows post-silicon tuning of individual manufactured dies such that each die optimally meets the delay and power constraints. Assigning individual bias control to each gate leads to severe overhead since each device needs to be sized much large to accommodate the deep N-Well, rendering the method impractical. However, assigning a single bias control to all gates in the circuit prevents the method from compensating for intra-die variation and greatly reduces its effectiveness. In this paper, the authors propose a new variability-aware method that clusters gates at design time into a handful of carefully chosen independent body-bias groups, which are then individually tuned post-silicon for each die. Certain intra-die variation model, such as linear gradient, is assumed and a quadratic programming method is proposed to correlate body bias with speed and leakage. Multiple monte-carlo runs were performed with the assumed statistical mode. The correlation of body bias with speed and leakage is calculated for each gate at each monte-carlo run and a probability density function (PDF) for each gate was generated based on those montecarlo runs. Gates with similar PDFs will be grouped/clustered together at design time by a greedy algorithm. The authors show that this allows us to obtain near-optimal performance and power characteristics with minimal overhead. The greedy algorithm will consider both speed and leakage and the number of monteclaro run seems to be with 400. The author showed that the results converged with about 400 runs of monte-carlo for a fairly large circuit – the Viterbi decoder. Different number of clusters were studied and the results shows that the performance improvement achieved with more cluster diminishes with more clusters. A 2 or 3 cluster solution already improves the speed by 1X and the leakage power by The author study the physical design constraints and show how the area and wire length overhead can be significantly limited using the proposed method. Compared with a fixed design-time based dual threshold voltage assignment method, The author improve leakage power by 38%–68% while simultaneously reducing the standard deviation of delay by two to nine times. The speed also improved on average a 20%. The clustering algorithm was integrated in to a placer capro with ECO capability. The cells/gates are placed with capro with initial constraints and then ECOed to be clustered together. Displacement of gates is less than 15%. No major routing overhead or timing violations. Future work includes carry out the true posttuning to see the overhead of the running on LSI tester. 5. PAPER5. TUNELOGIC: POST-SILICON TUNING OF DUAL-VDD DESIGNS Modern CMOS manufacturing processes have significant variability, which necessitates guard banding to achieve reasonable yield. The fundamental contribution the author make is a dual-Vdd design style, and associated CAD algorithms, wherein the authors assign supply voltages to logic based on postmanufacturing analysis rather than designing with nominal values and guard banding. The author studied the power supply demand in a 64 b multiplier and tried SPICE simulation to find the size of the pass gate to be small, close to 1.75x of the NAND3x4 gates. Hence the dual-Vdd overhead is small. The author performs a detailed case study of a custom designed pipelined multiplier using realistic process data. Results show that for comparable yield and target delay, the author can achieve significantly less power than a singleVdd supply. For example, to achieve 100% yield at same target delay, the author uses 23.6 pJ/multiply while a single-Vdd design uses 34.6 pJ/multiply. Initially all gates assigned to high voltage and a greedy algorithm tries to assign more to lower Vdd based on an atspeed testing. Future work includes designing the at-speed testing and verifying the algorithm’s overhead. 6. CONCLUSION Various ways of post-silicon tuning methods The author surveyed and it was shown that the tuning/calibration overhead can be made small by exploring the targeted design or by optimization algorithms to cluster or group multiple cells together to achieve power and speed benefits. The methods and characterizes are summarized in the table below. Future work mostly involves with the practical part of the post-tuning. The tester overhead must be verified otherwise no large volume products can use these approaches since the testers are so expensive that every minute of tester overhead will directly add 5-20 cents of extra cost to the chip. Process variations can be taken into account to speedup the posttuning algorithm and thus reduce tester overhead. This additional tester time and cost needs to be justified by performance improvement or power reduction. In addition how to deal with temperature variation after posttuning is also a challenging problem. How to integrate those clustering/assignment algorithm with existing CAD tools such as placer/router is also a big practical problem since those algorithms needs to be applied to a number of different applications. Paper 4 tried to integrate its clustering algorithm with placer Capro that has ECO capabilities. This may be an example of how to integrate with existing CAD tools. TABLE1 METHOD & IMPLEMENTATION COMPARISON Paper1 Paper2 Paper3 Optimization Generic AlGeneric AlGeneric AlMethod gorithm gorithm gorithm (post fabri(post fabri(post fabrication) cation) cation) Paper4 Greedy for clustering at design time, Binary search during post silicon To adjust body bias to change delay and leakage Improve leakage power by 3868% Paper5 Greedy for voltage domain assignment at design time, worst case sweep post silicon Purpose To control clock delay To control clock skew To adjust GM cell currents To enable dual-Vdd design to save power Benefits Improves speed by 11% and yield by 51% Improves speed by average of 7% Overheads 1 mins of tester time n/a Reduce arAchieve 100% ea by 63%, yield, reduce power yield imby 30% prove from 0 to 95%, reduce power by 27% Better than n/a n/a hill-climbing [4] S. H. Kulkarni, D. M. Sylvester, and D. T. Blaauw, "Design-Time Optimization of PostSilicon Tuned Circuits Using Adaptive Body," REFERENCES [1] E. Takahashi, Y. Kasai, M., and T. Higuchi, “Post-Fabrication Clock-Timing Adjustment Using Genetic Algorithms,” IEEE Journal of Solid-State Circuits, vol. 39, pp. 643-650, April 2004. [2] D. Deleganes, J. Douglas, B. Kommandur, M. Patyram, "Designing a 3GHz, 130nm, Pentium® 4 Processor," Symp. VLSI Circuits Dig. Papers, June 2002, pp. 130–133. [3] M. Murakawa, T. Adachi, Y. Niino, Y. Kasai, E. Takahashi,K. Takasuka, and T. Higuchi, "AI-Calibrated IF Filter: A Yield Enhancement Method With Area and Power Dissipation Reductions," IEEE Journal of Solid-State Circuits, vol. 38, pp. 495-502, March 2003. IEEE Trans. On Computer-Aided Design Of Integrated Circuits And Systems, vol. 27, no. 3, pp. 481-494, March 2008. [5] S. Bijansky, S. K. Lee, and A. Aziz, "TuneLogic: Post-Silicon Tuning of Dual-Vdd Designs," Intl. Symp. on Quality of Electronic Design, pp.394-400, 2009

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ASIC 2001: FORMATTING AND SUBMITTING YOUR PAPER