Download Lecture 6 - Courses - University of British Columbia

Document related concepts
Transcript
Post-Silicon Debug and Error Correction
Using Embedded Reconfigurable Logic
Brad Quinton,
Dept. of Electrical and Computer Engineering,
University of British Columbia
Vancouver, BC
Your laptop does not work...!
•
...at least, as originally intended.
•
•
•
•
Biggest semiconductor company in the world.
Flag ship product.
100s of millions of $ invested.
Intel Core 2 Duo:
–
–
50 page Errata
75 known bugs (as of Jan. 2008)
•
•
•
19 require BIOS changes
34 require software changes
33 have no known work around...
2
•
“Intel Core 2 Duo Processor E8000 and E7000
Series - Specification Update”, Intel Corp., Jan
2008.
3
Before you trade in your CPU
•
AMD is not different.
•
AMD Opteron:
–
–
•
•
95 page errata
71 known bugs
No one is immune to bugs.
Hardware is not a special case.
4
The Culprit: SoC Complexity
•
•
The complexity of System-on-Chip (SoC)
design continues to increase dramatically:
–
Moore’s Law scaling enables an ever increasing
integration of functionality on a single chip
–
And, the demand for low power and low cost
devices continues to drive this integration (iPods,
cell phones, automotive, notebooks…)
Almost all large ICs are becoming SoCs, even
“stand alone” processors…
5
SoC Development
“Pre-Silicon”
“Post-Silicon”
6
SoC Development
“Pre-Silicon”
Verification
(complexity)
“Post-Silicon”
7
SoC Development
“Pre-Silicon”
Verification
(complexity)
~ $1.1 Million
“Post-Silicon”
Validation
(complexity
+ visibility)
8
SoC Development
“Pre-Silicon”
Verification
(complexity)
~ $1.1 Million
“Post-Silicon”
Validation
(complexity
+ visibility)
9
High Stakes
•
The cost of validation escapes are enormous
(the Pentium FPDIV bug cost Intel $475
million)
•
However, time-to-market pressure is also
hitting its peak during the validation phase
•
The validation process suffers from high
complexity with low visibility…
10
A Solution: Design for Debug
11
A Solution: Design for Debug
Design
for
Debug
•
We propose a new Design for Debug (DFD)
infrastructure based on embedded
reconfigurable logic
12
Outline
1. The Need for Post-Silicon Debug
2. Existing Debug Solutions
3. Our New DFD Infrastructure
a.
b.
c.
d.
e.
Overview
Network Topology
Network Implementation
Programmable Logic Interface
Overall Area Costs
4. DFD Summary and Research Directions
13
The Need for Post-Silicon
Debug
14
Verification
•
There will always be bugs that escape
verification and end up in the silicon device
•
Why?:
–
–
–
–
Simulations are many orders of magnitude slower
than real operation
Formal verification is restricted to well defined
subsets of the device
Much of the real input stimulus is difficult to
model, or is not well understood at design time
Human error…
15
Intel Pentium 4
•
Intel Pentium 4 Verification/Validation1:
–
–
–
–
1B.
6,000 CPUs, running simulations, 24/7 for 2 years
produced less than 1 minute of real time operation
60 person-year formal verification effort found
about ~20 high-quality bugs
210441 distinct configurations in a “simple” out-oforder X86 processor model, but only 237 were
covered in verification
10 months of validation from first-silicon to
release-to-production ( by comparison Intel aims at
a new processor released every year…)
Bentley, “Validating a modern microprocessor”, Proc. Int. Conf. CAV, July 2005.
“The device won’t boot. Now what?!”
•
The validation process inevitably follows this
pattern:
1. The first packaged device arrives in the lab after
manufacturing test is complete.
2. It is installed in a socket on a custom-designed
printed circuit board (PCB): the “validation” board.
3. The validation engineer will power the device and
attempt to start running basic tests.
4. Inevitably the device will not behave as expected.
5. The debug begins….
17
Visibility is Key
•
the validation process has the advantage of
real time operation and real world stimulus
•
unfortunately, it is severely hindered by the
lack of internal visibility and control
•
SoC integration has only increased the
problem by moving busses and component
interconnects inside the device
18
Existing Solutions
Existing Solutions
1. Software-Based - software monitor routines and
processor-specific hardware allow some visibility
2. Test Feature-Based - the design-for-test (DFT)
structures are re-purposed for functional debug
3. In-Circuit Emulation - a special “bond-out” version
of the device is created that mirrors key internal
signals on external device pins
4. On-chip Emulation - dedicated debug logic runs in
parallel to the normal device logic
20
Existing Solutions
1. Software-Based - software monitor routines and
processor-specific hardware allow some visibility
2. Test Feature-Based - the design-for-test (DFT)
structures are re-purposed for functional debug
3. In-Circuit Emulation - a special “bond-out” version
of the device is created that mirrors key internal
signals on external device pins
Our solution.
4. On-chip Emulation - dedicated debug logic runs in
parallel to the normal device logic
21
Existing Solutions
•
On-chip Emulation solves many of the
problems with other methodologies:
–
Dedicated circuits have little or no impact on the
normal behaviour of the device
–
The internal observability can be extended beyond
the state of the software
–
The debug logic can run at high-speeds without the
requirement of high-speed I/O
22
Existing Solutions - Industry
•
IBM Cell BE - Trace Logic Analyzer (TLA) for
storing and viewing internal signals. - IEEE
TVLSI 2007
•
AMD Opteron - HyperTransport Trace Buffer
(HTTB) for observation of inter-core and interdevice transactions. - IEEE D&T 2007
•
DAFCA ClearBlue - Proprietary DFD
infrastructure targeting SoCs. - DAC 2006
23
Existing Solutions - Academic
•
Hopkins and Maier - Trace message
framework for multi-core processors - IEEE
TVLSI 2006
•
Anis and Nicolici - Lossy and lossless trace
buffer compression schemes - ITC 2007
24
Our Proposal
Our Proposal
•
The existing solutions are ad-hoc and design
specific
•
We are interested in a more universal solution
•
At the heart of our proposal we use
programmable logic to provide the flexibility
and reconfigurability needed to extend debug
throughout the SoC
26
Framework
Programmable Logic Cores (PLCs) are embedded
blocks of reconfigurable logic: “embedded FPGAs”
27
High-level Architecture
28
High-level Architecture
Observability:
1. Select signals
using the
network
2. Process these
signals with the
PLC (triggers,
compression..)
3. Return the test
results
29
High-level Architecture
Signal Control:
1. Create circuits in
the PLC that
interact with the
device
2. Selectively
override signals
using the
network
3. Observe results
30
High-level Architecture
Error Detect/Correct:
1. Interrupt block
output signals
2. Manipulate
these signals
using the PLC
logic
3. Create new
device
behaviour
31
Our Proposal - Key Advantages
32
Our Proposal - Key Advantages
1. Enables the debug of arbitrary digital logic in
the SoC.
33
Our Proposal - Key Advantages
1. Enables the debug of arbitrary digital logic in
the SoC.
2. Allows for a reconfigurable, scenario-specific
triggering, event filtering and trace
compression.
34
Our Proposal - Key Advantages
1. Enables the debug of arbitrary digital logic in
the SoC.
2. Allows for a reconfigurable, scenario-specific
triggering, event filtering and trace
compression.
3. Facilitates the detection and potential
correction of design errors during normal
operation.
35
Our Proposal - Overview
•
There are many aspects to a full DFD solution:
–
Hardware Infrastructure - architecture, design,
implementation, cost overhead, etc.
–
Debug Metrics - debug coverage requirements,
identification and insertion of testpoints, etc.
–
Debug Software - control of debug resources,
recreation of signal state, etc.
36
Our Proposal - Overview
•
There are many aspects to a full DFD solution:
This talk.
–
Hardware Infrastructure - architecture, design,
implementation, cost overhead, etc.
–
Debug Metrics - debug coverage requirements,
identification and insertion of testpoints, etc.
–
Debug Software - control of debug resources,
recreation of signal state, etc.
37
Our Proposal - Overview
•
There are many aspects to a full DFD solution:
This talk.
–
Hardware Infrastructure - architecture, design,
implementation, cost overhead, etc.
–
Debug Metrics - debug coverage requirements,
identification and insertion of testpoints, etc.
–
Debug Software - control of debug resources,
recreation of signal state, etc.
On-going research.
38
Hardware Infrastructure
39
Hardware Infrastructure
1. Network Topology
2. Network
Implementation
3. Programmable
Logic Interface
4. Overall Area
40
Network Topology
Network Requirements
•
To maximize the potential for debugging, the
network should be non-blocking
•
For example: The selection of pin ‘a’ on block #1,
should not prevent simultaneous selection of pin ‘b’
on block #2
•
This is could be an expensive requirement,
however the configurability of the PLC has the
potential to reduce the network cost
42
Network Flexibility
• Networks of this type are well known: Clos, Benes, etc.
• These non-blocking networks are often called
permutation networks
43
Network Flexibility
44
Network Flexibility
• Each pin on the PLC is equivalent
45
Concentrator Networks
•
A network that matches these requirements has
been defined in previous network theory research
•
A concentrator network provides full connectivity
and takes advantage of the I/O flexibility of the
PLC
•
an (n,m)-concentrator is defined as:
a network with n inputs and m outputs, with m ≤
n, for which every set k ≤ m of the inputs can
be mapped to some k outputs, but without the
ability to distinguish between those outputs
46
Concentrator Networks
•
Theoretical proofs have shown that it is possible to
implement an n-input concentrator with O(n)
crosspoints
•
In contrast, an ordered (or permutation) network
must have at least O(n lgn) crosspoints
47
Concentrator Networks
•
A lot of research effort has been spent defining an
explicit construction of a linear cost concentrator
•
In the end, linear cost concentrators are not
practical for small n
•
However, it is possible to implement a concentrator
with ~ 1/2 the area and depth of a permutation
network for smaller values of n
48
Area Cost
49
Depth
50
Network Topology Summary
• Demonstrated that concentrator networks
provide an advantage for this application
• Described a new concentrator network topology
with an area savings and depth reduction
• Detailed Results:
B.R. Quinton and Steven J.E. Wilton, “Concentrator Access Networks for
Programmable Logic Cores on SoCs”, IEEE International Symposium on
Circuits and Systems, Kobe, Japan, May 2005.
51
Network Implementation
Implementation
53
Implementation
local to
each
block
54
Implementation
local to
each
block
spans
entire
device or
region
55
Asynchronous Interconnect
• In modern process technologies wire delay can
be a significant with respect to gate delay, this
makes communication that spans the entire die
more complex
• Classic Synchronous Solution: Pipelining
• Asynchronous Techniques: Self Clocking
56
Asynchronous Interconnect
• In modern process technologies wire delay can
be a significant with respect to gate delay, this
makes communication that spans the entire die
more complex
• Classic Synchronous Solution: Pipelining
-difficult global clock construction
• Asynchronous Techniques: Self Clocking
-no global clock requirement
57
Basic Structure
•
By coordinating transfers between the
source and destination, asynchronous
techniques avoid the requirement of a
global clock
58
Data Formats
•
Two broad categories:
1) Bundled-data
•
•
control signaling is separate from the data
requires delay-matching*
2) Delay-insensitive
•
•
control signaling encoded with the data
no delay-matching* required
* Arbitrary delay-matching is a difficult CAD problem, and is not
supported by most tools.
59
Data Formats
•
Two broad categories:
1) Bundled-data
•
•
control signaling is separate from the data
requires delay-matching*
2) Delay-insensitive
•
•
control signaling encoded with the data
no delay-matching* required
* Arbitrary delay-matching is a difficult CAD problem, and is not
supported by most tools.
60
Basic Design - Data Encoding
• Many data encodings are
possible for delay-insensitive
circuits
• We choose ‘dual-rail’
encoding to minimize the
depth of the control decode
• ‘dual-rail’ encodings allow bit
transitions to be detected
with a simple XOR gate.
61
Basic Design - Sequential Gates
• We use a flip-flop based design to conform to
standard IP and CAD tools
• 2 flops/bit are require because the data is encoded
62
Representative ICs
• We implemented networks on 9 representative
ICs using a TSMC 0.18µm process:
– 3 core die sizes:
• 3830x3830 µm (~1 million gates),
• 8560x8560 µm (~5 million gates),
• 12090x12090 µm (~10 million gates)
– 3 different block partitions:
• 16 blocks
• 64 blocks
• 256 blocks
63
Block / Network Placement
Second
Stage
Concentrator
Programmable
Debug Logic
64
Throughput - No Global Clock
65
Throughput - No Global Clock
66
Power - 350 MHz
67
Power - 350 MHz
68
Area - 350 MHz
69
Implementation Summary
• Created a new asynchronous interconnect implementation
• Showed that for large, high-speed ICs it is possible to
achieve a high throughput with asynchronous interconnect
• Quantified the relative performance and design costs of this
new implementation versus classic synchronous pipelining
• Detailed Results:
B.R. Quinton, Mark R. Greenstreet and Steven J.E. Wilton, “Practical Asynchronous
Interconnect Network Design”, accepted for publication in the IEEE Transactions on
Very Large Scale Integrated Circuits, 2008.
70
Programmable Logic Interface
71
Hardware Infrastructure
Fast
Slow
Fast
72
Hardware Infrastructure
Fast
Direct Synchronous
Slow
Fast
System Bus
73
Programmable Logic I/F
8-bit @ 500 MHz
fixed logic
32-bit @ 125 MHz
programmable logic
74
Programmable Logic I/F
interface
8-bit @ 500 MHz
fixed logic
•
32-bit @ 125 MHz
programmable logic
the interface between the fixed function and
programmable logic is a challenge
75
Programmable Logic I/F
•
Two potential strategies for implementing rate
adaptive interfaces:
1. Use the existing programmable fabric
2. Design fixed function circuits
76
Programmable Logic I/F
•
Two potential strategies for implementing rate
adaptive interfaces:
1. Use the existing programmable fabric
Not Fast Enough.
2. Design fixed function circuits
77
Programmable Logic I/F
•
Two potential strategies for implementing rate
adaptive interfaces:
1. Use the existing programmable fabric
Not Fast Enough.
2. Design fixed function circuits
Not Flexible Enough.
78
Programmable Logic I/F
•
Two potential strategies for implementing rate
adaptive interfaces:
1. Use the existing programmable fabric
Not Fast Enough.
2. Design fixed function circuits
Not Flexible Enough.
•
Instead, we propose adding new programmable
structures to the underlying fabric
79
Programmable Logic I/F
• we create new programmable structures integrated in the
clustered logic blocks (CLBs) of the programmable fabric
Programmable Logic I/F
Shadow Cluster:
• we create new programmable structures integrated in the
clustered logic blocks (CLBs) of the programmable fabric
System Bus I/F
82
System Bus I/F
83
System Bus I/F
84
Programmable Logic I/F
Bench-mark
Size
W
Base
Arch.
(106
trans)
New
Arch.
(106
trans)
Incr.
(%)
alu4_apb
apex2_apb
apex4_apb
bigkey_apb
clma_apb
des_apb
diffeq_apb
dsip_apb
elliptic_apb
ex1010_apb
ex5p_apb
frisc_apb
misex3_apb
pdc_apb
s298_apb
s38417_apb
s38584.1_apb
seq_apb
spla_apb
tseng_apb
21x21
23x23
19x19
25x25
48x48
25x25
20x20
23x23
31x31
36x36
18x18
31x31
20x20
35x35
23x23
41x41
42x42
22x22
32x32
18x18
28
34
39
26
43
30
29
28
42
35
34
41
31
53
24
29
34
37
47
25
4.34
6.14
4.72
5.81
32.2
6.51
4.09
5.20
13.3
15.3
3.78
13.0
4.29
20.6
4.61
17.0
20.3
6.04
15.5
2.94
4.36
6.16
4.74
5.85
32.4
6.55
4.10
5.25
13.3
15.3
3.80
13.0
4.31
2.07
4.63
17.0
20.4
6.06
15.6
2.97
0.43
0.36
0.42
0.78
0.33
0.73
0.24
0.79
0.25
0.24
0.47
0.26
0.44
0.21
0.43
0.23
0.41
0.38
0.23
1.09
avg:
0.44
Using shadow
clusters, the
overhead for new
programmable
circuits is very low.
85
Programmable Logic I/F
The timing improvements were significant:
System Bus Interfaces:
36.4% (144 MHz -> 294 MHz)
Direct Synchronous Interfaces:
68.0% (217 MHz -> 694 MHz)
86
Interface Summary
•
Developed new programmable structures that enhance
interface timing for embedded programmable logic
•
Showed that these structures require a very small
overhead (< 1%)
•
Demonstrated that the routing architecture was not
adversely impacted (channel width decrease…)
•
Detailed Results:
B.R. Quinton and Steven J.E. Wilton, “Programmable Logic Core
Enhancements for High Speed On-Chip Interfaces”, in review for IEEE
Transactions on VLSI, 2008.
Post-Silicon Debug
Area Overhead / Cost
Area Overhead
• To understand the area overhead of our scheme for a
range of ICs we created a set of parameterized models
• We used a 90nm standard cell process
• We targeted the 90nm IBM/Xilinx PLC with a capacity of
approximately 10,000 ASIC gates
• The network was implemented using standard cells
• All area numbers are post-synthesis, but pre-layout
89
Area Overhead - Overall
90
Area Overhead - Overall
• 20M gate device, 7200 signals for ~ 5% overhead
91
Post-Silicon Debug Key Results
• We have shown that it is feasible to integrate a PLC in a
fixed-function IC in such a way that it could be used to
assist post-silicon debug.
• We have shown that for many ICs the area overhead of
this scheme is well below 10%
• Detailed Results:
B.R. Quinton and Steven J.E. Wilton, “Post-Silicon Debug Using
Programmable Logic Cores”, IEEE International Conference on FieldProgrammable Technology, Singapore, Dec. 2005.
92
DFD Summary And Research
Directions
Summary - Design for Debug
94
Summary - Design for Debug
Design
for
Debug
•
reconfigurable Design for Debug (DFD):
–
–
–
observe
control
detect/correct
95
What’s Next?
•
•
We’ve demonstrated the feasibility of the basic
components, now we need to tackle:
–
the detailed usage model
–
the integration with the existing SoC infrastructure
Goal: A multi-purpose, reconfigurable platform
that enables observation and interaction with
the internals of the an SoC.
96
Signal Selection/Reconstruction
• It is not possible to insert enough debug logic observe all
the signals in a circuit
• However, it is possible to infer the values of many signals
using the values key signals and the circuit structure
• Key Questions:
– Given a circuit, what signals should be observed?
– Given a set of observations, what can be inferred about other
signals in the circuit?
– How should triggers and trace compression be structured to
facilitate these inferences?
97
DFT and BIST Integration
• Design for Test (DFT) and Built-in Self Test (BIST) circuits
are intended to detect manufacturing defects
• Some of our DFD structures replicate the signal
observability and control in DFT and BIST
• Key Questions:
– Can we reduce some of the DFT/BIST overhead by relying
existing DFD structures?
– Can we use the embedded programmable logic to generate and
evaluate DFT vectors?
– Unify DFT and DFD?
98
SoC Feature Enhancement
• Our proposal naturally connects the key nodes in the
SoC “hardware” with the embedded processor on the
system bus
• and, it provides a programmable intermediary between
hardware and software
• Potential opportunities:
– Custom, programmable hardware managed cache or data
structures?
– Performance monitors, event timers, real-time processing,
customizable hardware interrupt manager?
99
End.