Download Lecture 6 - Courses - University of British Columbia

Post-Silicon Debug and Error Correction Using Embedded Reconfigurable Logic Brad Quinton, Dept. of Electrical and Computer Engineering, University of British Columbia Vancouver, BC Your laptop does not work...! • ...at least, as originally intended. • • • • Biggest semiconductor company in the world. Flag ship product. 100s of millions of $ invested. Intel Core 2 Duo: – – 50 page Errata 75 known bugs (as of Jan. 2008) • • • 19 require BIOS changes 34 require software changes 33 have no known work around... 2 • “Intel Core 2 Duo Processor E8000 and E7000 Series - Specification Update”, Intel Corp., Jan 2008. 3 Before you trade in your CPU • AMD is not different. • AMD Opteron: – – • • 95 page errata 71 known bugs No one is immune to bugs. Hardware is not a special case. 4 The Culprit: SoC Complexity • • The complexity of System-on-Chip (SoC) design continues to increase dramatically: – Moore’s Law scaling enables an ever increasing integration of functionality on a single chip – And, the demand for low power and low cost devices continues to drive this integration (iPods, cell phones, automotive, notebooks…) Almost all large ICs are becoming SoCs, even “stand alone” processors… 5 SoC Development “Pre-Silicon” “Post-Silicon” 6 SoC Development “Pre-Silicon” Verification (complexity) “Post-Silicon” 7 SoC Development “Pre-Silicon” Verification (complexity) ~ $1.1 Million “Post-Silicon” Validation (complexity + visibility) 8 SoC Development “Pre-Silicon” Verification (complexity) ~ $1.1 Million “Post-Silicon” Validation (complexity + visibility) 9 High Stakes • The cost of validation escapes are enormous (the Pentium FPDIV bug cost Intel $475 million) • However, time-to-market pressure is also hitting its peak during the validation phase • The validation process suffers from high complexity with low visibility… 10 A Solution: Design for Debug 11 A Solution: Design for Debug Design for Debug • We propose a new Design for Debug (DFD) infrastructure based on embedded reconfigurable logic 12 Outline 1. The Need for Post-Silicon Debug 2. Existing Debug Solutions 3. Our New DFD Infrastructure a. b. c. d. e. Overview Network Topology Network Implementation Programmable Logic Interface Overall Area Costs 4. DFD Summary and Research Directions 13 The Need for Post-Silicon Debug 14 Verification • There will always be bugs that escape verification and end up in the silicon device • Why?: – – – – Simulations are many orders of magnitude slower than real operation Formal verification is restricted to well defined subsets of the device Much of the real input stimulus is difficult to model, or is not well understood at design time Human error… 15 Intel Pentium 4 • Intel Pentium 4 Verification/Validation1: – – – – 1B. 6,000 CPUs, running simulations, 24/7 for 2 years produced less than 1 minute of real time operation 60 person-year formal verification effort found about ~20 high-quality bugs 210441 distinct configurations in a “simple” out-oforder X86 processor model, but only 237 were covered in verification 10 months of validation from first-silicon to release-to-production ( by comparison Intel aims at a new processor released every year…) Bentley, “Validating a modern microprocessor”, Proc. Int. Conf. CAV, July 2005. “The device won’t boot. Now what?!” • The validation process inevitably follows this pattern: 1. The first packaged device arrives in the lab after manufacturing test is complete. 2. It is installed in a socket on a custom-designed printed circuit board (PCB): the “validation” board. 3. The validation engineer will power the device and attempt to start running basic tests. 4. Inevitably the device will not behave as expected. 5. The debug begins…. 17 Visibility is Key • the validation process has the advantage of real time operation and real world stimulus • unfortunately, it is severely hindered by the lack of internal visibility and control • SoC integration has only increased the problem by moving busses and component interconnects inside the device 18 Existing Solutions Existing Solutions 1. Software-Based - software monitor routines and processor-specific hardware allow some visibility 2. Test Feature-Based - the design-for-test (DFT) structures are re-purposed for functional debug 3. In-Circuit Emulation - a special “bond-out” version of the device is created that mirrors key internal signals on external device pins 4. On-chip Emulation - dedicated debug logic runs in parallel to the normal device logic 20 Existing Solutions 1. Software-Based - software monitor routines and processor-specific hardware allow some visibility 2. Test Feature-Based - the design-for-test (DFT) structures are re-purposed for functional debug 3. In-Circuit Emulation - a special “bond-out” version of the device is created that mirrors key internal signals on external device pins Our solution. 4. On-chip Emulation - dedicated debug logic runs in parallel to the normal device logic 21 Existing Solutions • On-chip Emulation solves many of the problems with other methodologies: – Dedicated circuits have little or no impact on the normal behaviour of the device – The internal observability can be extended beyond the state of the software – The debug logic can run at high-speeds without the requirement of high-speed I/O 22 Existing Solutions - Industry • IBM Cell BE - Trace Logic Analyzer (TLA) for storing and viewing internal signals. - IEEE TVLSI 2007 • AMD Opteron - HyperTransport Trace Buffer (HTTB) for observation of inter-core and interdevice transactions. - IEEE D&T 2007 • DAFCA ClearBlue - Proprietary DFD infrastructure targeting SoCs. - DAC 2006 23 Existing Solutions - Academic • Hopkins and Maier - Trace message framework for multi-core processors - IEEE TVLSI 2006 • Anis and Nicolici - Lossy and lossless trace buffer compression schemes - ITC 2007 24 Our Proposal Our Proposal • The existing solutions are ad-hoc and design specific • We are interested in a more universal solution • At the heart of our proposal we use programmable logic to provide the flexibility and reconfigurability needed to extend debug throughout the SoC 26 Framework Programmable Logic Cores (PLCs) are embedded blocks of reconfigurable logic: “embedded FPGAs” 27 High-level Architecture 28 High-level Architecture Observability: 1. Select signals using the network 2. Process these signals with the PLC (triggers, compression..) 3. Return the test results 29 High-level Architecture Signal Control: 1. Create circuits in the PLC that interact with the device 2. Selectively override signals using the network 3. Observe results 30 High-level Architecture Error Detect/Correct: 1. Interrupt block output signals 2. Manipulate these signals using the PLC logic 3. Create new device behaviour 31 Our Proposal - Key Advantages 32 Our Proposal - Key Advantages 1. Enables the debug of arbitrary digital logic in the SoC. 33 Our Proposal - Key Advantages 1. Enables the debug of arbitrary digital logic in the SoC. 2. Allows for a reconfigurable, scenario-specific triggering, event filtering and trace compression. 34 Our Proposal - Key Advantages 1. Enables the debug of arbitrary digital logic in the SoC. 2. Allows for a reconfigurable, scenario-specific triggering, event filtering and trace compression. 3. Facilitates the detection and potential correction of design errors during normal operation. 35 Our Proposal - Overview • There are many aspects to a full DFD solution: – Hardware Infrastructure - architecture, design, implementation, cost overhead, etc. – Debug Metrics - debug coverage requirements, identification and insertion of testpoints, etc. – Debug Software - control of debug resources, recreation of signal state, etc. 36 Our Proposal - Overview • There are many aspects to a full DFD solution: This talk. – Hardware Infrastructure - architecture, design, implementation, cost overhead, etc. – Debug Metrics - debug coverage requirements, identification and insertion of testpoints, etc. – Debug Software - control of debug resources, recreation of signal state, etc. 37 Our Proposal - Overview • There are many aspects to a full DFD solution: This talk. – Hardware Infrastructure - architecture, design, implementation, cost overhead, etc. – Debug Metrics - debug coverage requirements, identification and insertion of testpoints, etc. – Debug Software - control of debug resources, recreation of signal state, etc. On-going research. 38 Hardware Infrastructure 39 Hardware Infrastructure 1. Network Topology 2. Network Implementation 3. Programmable Logic Interface 4. Overall Area 40 Network Topology Network Requirements • To maximize the potential for debugging, the network should be non-blocking • For example: The selection of pin ‘a’ on block #1, should not prevent simultaneous selection of pin ‘b’ on block #2 • This is could be an expensive requirement, however the configurability of the PLC has the potential to reduce the network cost 42 Network Flexibility • Networks of this type are well known: Clos, Benes, etc. • These non-blocking networks are often called permutation networks 43 Network Flexibility 44 Network Flexibility • Each pin on the PLC is equivalent 45 Concentrator Networks • A network that matches these requirements has been defined in previous network theory research • A concentrator network provides full connectivity and takes advantage of the I/O flexibility of the PLC • an (n,m)-concentrator is defined as: a network with n inputs and m outputs, with m ≤ n, for which every set k ≤ m of the inputs can be mapped to some k outputs, but without the ability to distinguish between those outputs 46 Concentrator Networks • Theoretical proofs have shown that it is possible to implement an n-input concentrator with O(n) crosspoints • In contrast, an ordered (or permutation) network must have at least O(n lgn) crosspoints 47 Concentrator Networks • A lot of research effort has been spent defining an explicit construction of a linear cost concentrator • In the end, linear cost concentrators are not practical for small n • However, it is possible to implement a concentrator with ~ 1/2 the area and depth of a permutation network for smaller values of n 48 Area Cost 49 Depth 50 Network Topology Summary • Demonstrated that concentrator networks provide an advantage for this application • Described a new concentrator network topology with an area savings and depth reduction • Detailed Results: B.R. Quinton and Steven J.E. Wilton, “Concentrator Access Networks for Programmable Logic Cores on SoCs”, IEEE International Symposium on Circuits and Systems, Kobe, Japan, May 2005. 51 Network Implementation Implementation 53 Implementation local to each block 54 Implementation local to each block spans entire device or region 55 Asynchronous Interconnect • In modern process technologies wire delay can be a significant with respect to gate delay, this makes communication that spans the entire die more complex • Classic Synchronous Solution: Pipelining • Asynchronous Techniques: Self Clocking 56 Asynchronous Interconnect • In modern process technologies wire delay can be a significant with respect to gate delay, this makes communication that spans the entire die more complex • Classic Synchronous Solution: Pipelining -difficult global clock construction • Asynchronous Techniques: Self Clocking -no global clock requirement 57 Basic Structure • By coordinating transfers between the source and destination, asynchronous techniques avoid the requirement of a global clock 58 Data Formats • Two broad categories: 1) Bundled-data • • control signaling is separate from the data requires delay-matching* 2) Delay-insensitive • • control signaling encoded with the data no delay-matching* required * Arbitrary delay-matching is a difficult CAD problem, and is not supported by most tools. 59 Data Formats • Two broad categories: 1) Bundled-data • • control signaling is separate from the data requires delay-matching* 2) Delay-insensitive • • control signaling encoded with the data no delay-matching* required * Arbitrary delay-matching is a difficult CAD problem, and is not supported by most tools. 60 Basic Design - Data Encoding • Many data encodings are possible for delay-insensitive circuits • We choose ‘dual-rail’ encoding to minimize the depth of the control decode • ‘dual-rail’ encodings allow bit transitions to be detected with a simple XOR gate. 61 Basic Design - Sequential Gates • We use a flip-flop based design to conform to standard IP and CAD tools • 2 flops/bit are require because the data is encoded 62 Representative ICs • We implemented networks on 9 representative ICs using a TSMC 0.18µm process: – 3 core die sizes: • 3830x3830 µm (~1 million gates), • 8560x8560 µm (~5 million gates), • 12090x12090 µm (~10 million gates) – 3 different block partitions: • 16 blocks • 64 blocks • 256 blocks 63 Block / Network Placement Second Stage Concentrator Programmable Debug Logic 64 Throughput - No Global Clock 65 Throughput - No Global Clock 66 Power - 350 MHz 67 Power - 350 MHz 68 Area - 350 MHz 69 Implementation Summary • Created a new asynchronous interconnect implementation • Showed that for large, high-speed ICs it is possible to achieve a high throughput with asynchronous interconnect • Quantified the relative performance and design costs of this new implementation versus classic synchronous pipelining • Detailed Results: B.R. Quinton, Mark R. Greenstreet and Steven J.E. Wilton, “Practical Asynchronous Interconnect Network Design”, accepted for publication in the IEEE Transactions on Very Large Scale Integrated Circuits, 2008. 70 Programmable Logic Interface 71 Hardware Infrastructure Fast Slow Fast 72 Hardware Infrastructure Fast Direct Synchronous Slow Fast System Bus 73 Programmable Logic I/F 8-bit @ 500 MHz fixed logic 32-bit @ 125 MHz programmable logic 74 Programmable Logic I/F interface 8-bit @ 500 MHz fixed logic • 32-bit @ 125 MHz programmable logic the interface between the fixed function and programmable logic is a challenge 75 Programmable Logic I/F • Two potential strategies for implementing rate adaptive interfaces: 1. Use the existing programmable fabric 2. Design fixed function circuits 76 Programmable Logic I/F • Two potential strategies for implementing rate adaptive interfaces: 1. Use the existing programmable fabric Not Fast Enough. 2. Design fixed function circuits 77 Programmable Logic I/F • Two potential strategies for implementing rate adaptive interfaces: 1. Use the existing programmable fabric Not Fast Enough. 2. Design fixed function circuits Not Flexible Enough. 78 Programmable Logic I/F • Two potential strategies for implementing rate adaptive interfaces: 1. Use the existing programmable fabric Not Fast Enough. 2. Design fixed function circuits Not Flexible Enough. • Instead, we propose adding new programmable structures to the underlying fabric 79 Programmable Logic I/F • we create new programmable structures integrated in the clustered logic blocks (CLBs) of the programmable fabric Programmable Logic I/F Shadow Cluster: • we create new programmable structures integrated in the clustered logic blocks (CLBs) of the programmable fabric System Bus I/F 82 System Bus I/F 83 System Bus I/F 84 Programmable Logic I/F Bench-mark Size W Base Arch. (106 trans) New Arch. (106 trans) Incr. (%) alu4_apb apex2_apb apex4_apb bigkey_apb clma_apb des_apb diffeq_apb dsip_apb elliptic_apb ex1010_apb ex5p_apb frisc_apb misex3_apb pdc_apb s298_apb s38417_apb s38584.1_apb seq_apb spla_apb tseng_apb 21x21 23x23 19x19 25x25 48x48 25x25 20x20 23x23 31x31 36x36 18x18 31x31 20x20 35x35 23x23 41x41 42x42 22x22 32x32 18x18 28 34 39 26 43 30 29 28 42 35 34 41 31 53 24 29 34 37 47 25 4.34 6.14 4.72 5.81 32.2 6.51 4.09 5.20 13.3 15.3 3.78 13.0 4.29 20.6 4.61 17.0 20.3 6.04 15.5 2.94 4.36 6.16 4.74 5.85 32.4 6.55 4.10 5.25 13.3 15.3 3.80 13.0 4.31 2.07 4.63 17.0 20.4 6.06 15.6 2.97 0.43 0.36 0.42 0.78 0.33 0.73 0.24 0.79 0.25 0.24 0.47 0.26 0.44 0.21 0.43 0.23 0.41 0.38 0.23 1.09 avg: 0.44 Using shadow clusters, the overhead for new programmable circuits is very low. 85 Programmable Logic I/F The timing improvements were significant: System Bus Interfaces: 36.4% (144 MHz -> 294 MHz) Direct Synchronous Interfaces: 68.0% (217 MHz -> 694 MHz) 86 Interface Summary • Developed new programmable structures that enhance interface timing for embedded programmable logic • Showed that these structures require a very small overhead (< 1%) • Demonstrated that the routing architecture was not adversely impacted (channel width decrease…) • Detailed Results: B.R. Quinton and Steven J.E. Wilton, “Programmable Logic Core Enhancements for High Speed On-Chip Interfaces”, in review for IEEE Transactions on VLSI, 2008. Post-Silicon Debug Area Overhead / Cost Area Overhead • To understand the area overhead of our scheme for a range of ICs we created a set of parameterized models • We used a 90nm standard cell process • We targeted the 90nm IBM/Xilinx PLC with a capacity of approximately 10,000 ASIC gates • The network was implemented using standard cells • All area numbers are post-synthesis, but pre-layout 89 Area Overhead - Overall 90 Area Overhead - Overall • 20M gate device, 7200 signals for ~ 5% overhead 91 Post-Silicon Debug Key Results • We have shown that it is feasible to integrate a PLC in a fixed-function IC in such a way that it could be used to assist post-silicon debug. • We have shown that for many ICs the area overhead of this scheme is well below 10% • Detailed Results: B.R. Quinton and Steven J.E. Wilton, “Post-Silicon Debug Using Programmable Logic Cores”, IEEE International Conference on FieldProgrammable Technology, Singapore, Dec. 2005. 92 DFD Summary And Research Directions Summary - Design for Debug 94 Summary - Design for Debug Design for Debug • reconfigurable Design for Debug (DFD): – – – observe control detect/correct 95 What’s Next? • • We’ve demonstrated the feasibility of the basic components, now we need to tackle: – the detailed usage model – the integration with the existing SoC infrastructure Goal: A multi-purpose, reconfigurable platform that enables observation and interaction with the internals of the an SoC. 96 Signal Selection/Reconstruction • It is not possible to insert enough debug logic observe all the signals in a circuit • However, it is possible to infer the values of many signals using the values key signals and the circuit structure • Key Questions: – Given a circuit, what signals should be observed? – Given a set of observations, what can be inferred about other signals in the circuit? – How should triggers and trace compression be structured to facilitate these inferences? 97 DFT and BIST Integration • Design for Test (DFT) and Built-in Self Test (BIST) circuits are intended to detect manufacturing defects • Some of our DFD structures replicate the signal observability and control in DFT and BIST • Key Questions: – Can we reduce some of the DFT/BIST overhead by relying existing DFD structures? – Can we use the embedded programmable logic to generate and evaluate DFT vectors? – Unify DFT and DFD? 98 SoC Feature Enhancement • Our proposal naturally connects the key nodes in the SoC “hardware” with the embedded processor on the system bus • and, it provides a programmable intermediary between hardware and software • Potential opportunities: – Custom, programmable hardware managed cache or data structures? – Performance monitors, event timers, real-time processing, customizable hardware interrupt manager? 99 End.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 6 - Courses - University of British Columbia