Download ASIC 2001: FORMATTING AND SUBMITTING YOUR PAPER

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Electrification wikipedia , lookup

Pulse-width modulation wikipedia , lookup

Power over Ethernet wikipedia , lookup

Variable-frequency drive wikipedia , lookup

Microprocessor wikipedia , lookup

Opto-isolator wikipedia , lookup

Electrical substation wikipedia , lookup

History of electric power transmission wikipedia , lookup

Electric power system wikipedia , lookup

Utility frequency wikipedia , lookup

Voltage optimisation wikipedia , lookup

Power inverter wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Overhead power line wikipedia , lookup

Power electronics wikipedia , lookup

Power engineering wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Rectiverter wikipedia , lookup

Islanding wikipedia , lookup

Electronic engineering wikipedia , lookup

Amtrak's 25 Hz traction power system wikipedia , lookup

Transmission tower wikipedia , lookup

Alternating current wikipedia , lookup

Flexible electronics wikipedia , lookup

Mains electricity wikipedia , lookup

Time-to-digital converter wikipedia , lookup

Transcript
Survey of Techniques for Post-Silicon Tuning for High Performance
Circuits
Xiong Liu (602995937)
210C Final Project
OUTLINE
This survey studies five papers on the
techniques for post-silicon tuning for high
speed circuits. The focus of the survey is on
the first paper titled “Post-Fabrication ClockTiming Adjustment Using Genetic Algorithms”
[1] from Journal of Solid State Circuit 2004.
This paper demonstrates the power of postsilicon tuning by improving a real medium
scale LSI chip’s yield from 42% to 93% with
manageable die cost for the additional tuning
circuits and small ATE tester overhead. As a
proven large volume product, INTEL used
similar approach in design of the 3–GHz Pentium 4 microprocessor titled “Designing a
3GHz, 130nm, Pentium® 4 Processor” [2]
from VLSI symposium 2002. This confirms
that the cost overhead is manageable for large
volume consumer products. The same Generic Algorithm was also applied to an analog IF
filter and similar yield and performance improvement were achieved as shown in the
third paper titled “An AI-Calibrated IF Filter: A
Yield Enhancement Method With Area and
Power Dissipation Reductions” [3] from JSSC
2003. Other ways of post-silicon tuning such
as adaptive body biasing (ABB) and associated clustering algorithms are outlined in the
fourth paper titled “Design-Time Optimization
of Post-Silicon Tuned Circuits Using Adaptive
Body Bias” from Transaction on ComputerAided Design of Integrated Circuits and Systems 2008. The fifth paper titled “TuneLogic:
Post-Silicon Tuning of Dual-Vdd Designs,” focuses on post-silicon tuning for dual-Vdd designs and assignment algorithms from ISQED
2009. A table compares these five paper’s approaches and highlights their key differences
in the summary section. Further improvement
directions and key challenges faced by these
approaches are also discussed in that section.
I. PAPER1--GENERIC ALGORITHM FOR
POST-SILICON TUNING
A GA-based clock adjustment architecture
has been proposed and tested. The GA-based
clock adjustment method adjust clock delays
based on certain control register bits and
those register bits will be adjusted post-silicon
using a generic algorithm. Two test chips were
fabricated to test such implementation with
one small LSI chip and one medium scale LSI
chip featuring multipliers and test pattern generators. The variable delay the author
achieved is through a delay cell with selectable control voltage from a simple DAC.
The total transistor count for the variable delay
cell and DAC is 30 transistors. Hence the area
and complexity overhead is small. It was reported that small delay adjustment such as
30ps can be achieved with such implementation. 40 chips were fabricated and a total of 80
multipliers and 80 memory-test-pattern generators. were tested to see the effectiveness of
the GA algorithm. The author tested. The experiment with the memory-test-pattern generators shows an 11% average enhancement and
a maximum enhancement of 25% (from 1.0 to
1.25 GHz). Although the operational ratio was
only 15% for the memory-test-pattern generators at a frequency of 1.4 GHz before adjustment rose to 90% after adjustment. Moreover,
while no chips could operate at a frequency of
1.58 GHz before adjustment, after adjustment
a few could. In these experiments, adjustment
for each chip took 9 min, but a careful analysis
of the timing shows that almost all of the total
time was required for communication between
the PC and the interface board, which is not
necessary with LSI testers. It was estimated
that adjustment on LSI testers can be completed in less than 1 s The behavior of the GA
search for a circuit shows it converged, i.e.,
the delay settings for correct operation were
stabilized, after 14 iterations. Hence the additional tester overhead was proven to be small
as well. although the chips with a clock frequency of 1.25 GHz could all operate at 1.2
V, the operational yield fell to 0% when the
supply as reduced to 0.8 V. However, after
timing adjustment, the operational yield for the
chips was again at 100%. This represents a
significant reduction in Power dissipation results, Total design flow analyzed into the four
main design steps indicates that the GAbased approach could reduce the overall time
by 21% (23.0 day person). Looking in more
detail at the separate design stages, the GAbased approach is able to achieve design time
reductions in the following areas. At the function design stage, the modeling of timing constraints is simplified. At the floor-planning
stage, the time to assign pins is reduced. At
the layout design stage and verification stages, the iterations required to satisfy timing
constraints are reduced. In the case of more
complex large-scale integrations (LSIs), the
improvements obtainable with the GA-based
approach will be even more apparent, because timing-design aspects become more
complicated and time consuming.
For the practical aspects of this approach,
especially concerns about the size of the programmable delay circuits and the adjustment
times. However, this study has clearly demonstrated that the developed circuits are sufficiently small and that adjustment times are
sufficiently short. Accordingly, these concerns
do not constitute real obstacles to the application of this approach for mass production. Although the current programmable delay circuits
cannot compensate for temperature and Power-supply voltage fluctuations,
Future tasks remaining are: 1) to incorporate the GA-based clock adjustment technique
into EDA tools, such as clock tree synthesis
(CTS), and 2) to build IPs for the programmable delay circuits for specialized EDA tools,
which are both already under development. By
using EDA tools capable of handling our GA-
based technique together with programmable
delay circuits, users will be able to benefit
from the three advantages described in this
paper. Another important issue is how to deal
with temperature variation after the calibration
or tuning is done, since temperature variation
plays an important role in circuit speed and
power.
2. PAPER2--DESIGNING A 3GHZ, 130NM,
PENTIUM® 4 PROCESSOR
This IA32 processor is fabricated on a
130nm CMOS process with six layers of dualDamascene copper interconnect. The processor contains 55M transistors. The integer execution core runs at twice the nominal frequency or six GHz range, one of the main objectives must be to keep the clock distribution
from wasting Power. Many microprocessor
designs waste device and wire trying to reduce all static inaccuracies in the design (load
mismatches, wire mismatches, process variation, VCC variation) which ultimately wastes
Power. This IA32 processor incorporates a
circuit technique to detect and correct static
mismatches in the global clock network postsilicon. The CPU was divided into 52 clock
domains. By using the detection circuitry and
a tester interface the static clock skew between each global domain could be programmed to zero using a predetermined testing sequence. This enables the clock distribution to be designed with a minimal amount of
nets. Another key recognition is the fact that
no matter how good the timing in pre-silicon
simulation is, the true worst-case paths on silicon will actually run faster if clock skew is not
zero or the clock duty cycle is not 50%. The
circuitry used to correct static mismatches of
the global clock network is used in combination with a genetic algorithm on the tester that
looks at the maximum frequency of thousands
of patterns to search out the best combination
of built-in clock skew between local clocks,
and best duty cycle for the global and certain
local clocks to maximize frequency.
As a usual industry paper, not too much
detail is disclosed such as the detail of the algorithm, the tester overhead, etc. However,
we can infer that the overhead is small since
this processor is a large-volume product. Otherwise INTEL would not use such approach
for cost reasons. This is a further confirmation
that post-silicon tuning had actually been
widely adopted in large-volume consumer
products.
3. PAPER3--AN AI-CALIBRATED IF FILTER:
A YIELD ENHANCEMENT METHOD WITH
AREA AND POWER DISSIPATION REDUCTIONS
An inherent problem in implementing analog
large-scale integration (LSI) or integrated circuits is that the values of manufactured analog
circuit components often differ from the precise design specifications. Such discrepancies
cause poor yield rates, especially for high-end
analog circuits. For example, in intermediate
frequency (IF) filters, which are widely used in
cellular phones, even a 1% discrepancy from
the center frequency is unacceptable. It is,
therefore, necessary to carefully examine the
analog LSIs and to discard any which do not
meet the specifications.
In this paper, the author propose an artificial
intelligence (AI) calibration method which can
correct these variations in the analog circuits
values by genetic algorithms (GAs). Using this
method provides us with three advantages: 1)
enhanced yield rates: If an analog LSI is found
not to satisfy specifications, then the GA is
executed at the wafer-sort testing to alter the
defective analog circuit components in line
with specifications. Once calibrated, the circuit
values are fixed by laser trimming. 2) smaller
circuits and less power consumption: The
conventional approach to overcome discrepancies of the component values in analog
LSIs has been to use large components.
However, this requires larger spaces, which
means higher manufacturing costs and greater Power consumption. In contrast, with our
approach, the area size of the analog circuits
can be made smaller, because chip performance can be calibrated after production with
the GA software. 3) integration of peripheral
circuits to LSIs. As process technology allows
for smaller and smaller implementations.
However, because it is extremely difficult to
integrate analog circuits with high-end specifications due to process variations, these analog circuits are usually made as separate
components such as ceramic filters. In contrast, with the proposed method, it is possible
to integrate these analog circuits on a single
LSI, resulting in cost reduction and circuit area
reduction.
Although many filter structures could provide
the desired 455-kHz center frequency and 21kHz bandwidth frequency response, the authors have adopted an 18th-order linear filter
organization consisting of three cascaded
structures. Each of the three filter blocks has
an all-pole sixth-order leapfrog structure. The
GA will adjust the biasing current of each filtering stage and the GA chromosome for the
whole IF filter is 39x6 bits, with each 6-bit
string determining the transconductance value
of a amplifier. This fitness function consists of
a weighted sum of gain deviations and group
delay differences measured around the center
frequency.
As a result of this new design method employing GA calibration, filter area was reduced by
63% compared with existing commercial
products (from 3.36 to 1.26 mm2), resulting in
a 26% reduction in power dissipation (from 3.4
to 2.5 mA, 38% reduction for the Gm amplifiers). As a further improvement, the authors
have subsequently achieved a 100% yield rate
with GA calibration experiments involving only
18 parameters (three bits each). As a result, a
commercial chip has been designed with adjustment circuitry for 18x3 bit registers (instead of 39x3 in the prototype chip) and 18
DACs within an area of 0.41 mm2. Thus, even
including the area required for calibration, the
total chip area is reduced 49%. On average,
the GA was able to find an optimal architecture bits and terminate after 100 measurement
iterations within a few seconds on an LSI tester. Clearly, GA calibration does not represent
an obstacle to mass production of the 30 test
chips could be calibrated to satisfy specifications, although none of these chips conformed
prior to calibration (i.e. 97% yield rate), and
the remaining chip was successfully calibrated
with additional iterations. This result is con-
sistent with statistical simulations for GA calibration, in which 952 out of 1000 virtual chips
the authors successfully calibrated. For comparison, the author conducted calibration experiments with a conventional hill-climbing
method instead of the GA. A run terminated
after fitness was evaluated 100 times. The result of this was that only 17 chips (57%) could
meet the specifications. These results suggest
the effectiveness of the GA in avoiding the local minimums in the evaluation (fitness) function.
Compared with the first paper, the different
nature of analog circuits renders the focus of
the generic algorithm on yield and area/diecost. For comparison, the first paper focuses
on the combination of yield, speed and power.
As is evident from those comparisons, a good
fitness evolution function is the key to the successful deployment of the GA algorithm.
4. PAPER4 DESIGN-TIME OPTIMIZATION
OF POST-SILICON TUNED CIRCUITS USING ADAPTIVE BODY BIAS
The major trade-off in digital circuits is the delay vs. power. In addition, both performance
metrics are related to device threshold voltages. For a high threshold device, the speed is
lower but the sub-threshold leakage power is
lower since the leakage is inverseexponentially related with the threshold voltage. To summaries, for low delay, we prefer
low threshold voltage and for low leakage
power, we prefer high threshold voltage device. Threshold voltage of the device can be
adjusted via process engineering or by tuning
the body bias. Positive body bias will increase
threshold voltage. Adaptive body biasing is a
powerful technique that allows post-silicon
tuning of individual manufactured dies such
that each die optimally meets the delay and
power constraints.
Assigning individual bias control to each gate
leads to severe overhead since each device
needs to be sized much large to accommodate the deep N-Well, rendering the method
impractical. However, assigning a single bias
control to all gates in the circuit prevents the
method from compensating for intra-die variation and greatly reduces its effectiveness. In
this paper, the authors propose a new variability-aware method that clusters gates at design
time into a handful of carefully chosen independent body-bias groups, which are then individually tuned post-silicon for each die. Certain intra-die variation model, such as linear
gradient, is assumed and a quadratic programming method is proposed to correlate
body bias with speed and leakage. Multiple
monte-carlo runs were performed with the assumed statistical mode. The correlation of
body bias with speed and leakage is calculated for each gate at each monte-carlo run and
a probability density function (PDF) for each
gate was generated based on those montecarlo runs. Gates with similar PDFs will be
grouped/clustered together at design time by a
greedy algorithm. The authors show that this
allows us to obtain near-optimal performance
and power characteristics with minimal overhead. The greedy algorithm will consider both
speed and leakage and the number of monteclaro run seems to be with 400. The author
showed that the results converged with about
400 runs of monte-carlo for a fairly large circuit
– the Viterbi decoder. Different number of
clusters were studied and the results shows
that the performance improvement achieved
with more cluster diminishes with more clusters. A 2 or 3 cluster solution already improves
the speed by 1X and the leakage power by
The author study the physical design constraints and show how the area and wire
length overhead can be significantly limited
using the proposed method. Compared with a
fixed design-time based dual threshold voltage
assignment method, The author improve leakage power by 38%–68% while simultaneously
reducing the standard deviation of delay by
two to nine times. The speed also improved
on average a 20%.
The clustering algorithm was integrated in to a
placer capro with ECO capability. The
cells/gates are placed with capro with initial
constraints and then ECOed to be clustered
together. Displacement of gates is less than
15%. No major routing overhead or timing violations.
Future work includes carry out the true posttuning to see the overhead of the running on
LSI tester.
5. PAPER5. TUNELOGIC: POST-SILICON
TUNING OF DUAL-VDD DESIGNS
Modern CMOS manufacturing processes have
significant variability, which necessitates
guard banding to achieve reasonable yield.
The fundamental contribution the author make
is a dual-Vdd design style, and associated
CAD algorithms, wherein the authors assign
supply voltages to logic based on postmanufacturing analysis rather than designing
with nominal values and guard banding.
The author studied the power supply demand
in a 64 b multiplier and tried SPICE simulation
to find the size of the pass gate to be small,
close to 1.75x of the NAND3x4 gates. Hence
the dual-Vdd overhead is small. The author
performs a detailed case study of a custom
designed pipelined multiplier using realistic
process data. Results show that for comparable yield and target delay, the author can
achieve significantly less power than a singleVdd supply. For example, to achieve 100%
yield at same target delay, the author uses
23.6 pJ/multiply while a single-Vdd design uses 34.6 pJ/multiply. Initially all gates assigned
to high voltage and a greedy algorithm tries to
assign more to lower Vdd based on an atspeed testing.
Future work includes designing the at-speed
testing and verifying the algorithm’s overhead.
6. CONCLUSION
Various ways of post-silicon tuning methods
The author surveyed and it was shown that
the tuning/calibration overhead can be made
small by exploring the targeted design or by
optimization algorithms to cluster or group
multiple cells together to achieve power and
speed benefits. The methods and characterizes are summarized in the table below.
Future work mostly involves with the practical
part of the post-tuning. The tester overhead
must be verified otherwise no large volume
products can use these approaches since the
testers are so expensive that every minute of
tester overhead will directly add 5-20 cents of
extra cost to the chip. Process variations can
be taken into account to speedup the posttuning algorithm and thus reduce tester overhead. This additional tester time and cost
needs to be justified by performance improvement or power reduction. In addition how
to deal with temperature variation after posttuning is also a challenging problem. How to
integrate those clustering/assignment algorithm with existing CAD tools such as placer/router is also a big practical problem since
those algorithms needs to be applied to a
number of different applications. Paper 4 tried
to integrate its clustering algorithm with placer
Capro that has ECO capabilities. This may be
an example of how to integrate with existing
CAD tools.
TABLE1 METHOD & IMPLEMENTATION COMPARISON
Paper1
Paper2
Paper3
Optimization Generic AlGeneric AlGeneric AlMethod
gorithm
gorithm
gorithm
(post fabri(post fabri(post fabrication)
cation)
cation)
Paper4
Greedy for
clustering at
design time,
Binary
search during post silicon
To adjust
body bias to
change delay
and leakage
Improve
leakage
power by 3868%
Paper5
Greedy for voltage
domain assignment
at design time,
worst case sweep
post silicon
Purpose
To control
clock delay
To control
clock skew
To adjust
GM cell currents
To enable dual-Vdd
design to save
power
Benefits
Improves
speed by
11% and
yield by 51%
Improves
speed by
average of
7%
Overheads
1 mins of
tester time
n/a
Reduce arAchieve 100%
ea by 63%,
yield, reduce power
yield imby 30%
prove from
0 to 95%,
reduce
power by
27%
Better than
n/a
n/a
hill-climbing
[4] S. H. Kulkarni, D. M. Sylvester, and D. T.
Blaauw, "Design-Time Optimization of PostSilicon Tuned Circuits Using Adaptive Body,"
REFERENCES
[1] E. Takahashi, Y. Kasai, M., and T. Higuchi,
“Post-Fabrication Clock-Timing Adjustment
Using Genetic Algorithms,” IEEE Journal of
Solid-State Circuits, vol. 39, pp. 643-650, April
2004.
[2] D. Deleganes, J. Douglas, B. Kommandur,
M. Patyram, "Designing a 3GHz, 130nm, Pentium® 4 Processor," Symp. VLSI Circuits Dig.
Papers, June 2002, pp. 130–133.
[3] M. Murakawa, T. Adachi, Y. Niino, Y. Kasai, E. Takahashi,K. Takasuka, and T. Higuchi, "AI-Calibrated IF Filter: A Yield Enhancement Method With Area and Power Dissipation Reductions," IEEE Journal of Solid-State
Circuits, vol. 38, pp. 495-502, March 2003.
IEEE Trans. On Computer-Aided Design Of
Integrated Circuits And Systems, vol. 27, no.
3, pp. 481-494, March 2008.
[5] S. Bijansky, S. K. Lee, and A. Aziz,
"TuneLogic: Post-Silicon Tuning of Dual-Vdd
Designs," Intl. Symp. on Quality of Electronic
Design, pp.394-400, 2009