Download sd9 - UCSD CSE

INTERNATIONAL TECHNOLOGY ROADMAP SEMICONDUCTORS FOR 2001 EDITION SYSTEM DRIVERS CHAPTER DRAFT 19-JUN-17 i TABLE OF CONTENTS System Drivers .............................................................................................................................. 2 LIST OF FIGURES Figure 1: ......................................................................................................................... 11 Figure 2: ............................................................................ Error! Bookmark not defined. Figure 3: ............................................................................ Error! Bookmark not defined. LIST OF TABLES Table 1: ............................................................................. Error! Bookmark not defined. Table 2: ............................................................................. Error! Bookmark not defined. Table 3: ............................................................................. Error! Bookmark not defined. Table 4 ............................................................................. Error! Bookmark not defined. THE NATIONAL TECHNOLOGY ROADMAP FOR SEMICONDUCTORS: TECHNOLOGY NEEDS DRAFT ** NOT FOR PUBLICATION** 1 THE INTERNATIONAL TECHNOLOGY ROADMAP FOR SEMICONDUCTORS: 1999 DRAFT ** NOT FOR PUBLICATION** System Drivers Page 2 6/19/2017 SYSTEM DRIVERS Future semiconductor manufacturing and design technology capability is developed in response to economic drivers within the worldwide semiconductor industry. To identify future technology requirements, we must first understand how these requirements are defined by different classes of applications that use integrated circuit technology. Previous ITRS editions have grounded technology projections using several generic product classes for CMOS, including SOC, ASIC, costperformance MPU, high-performance MPU, and DRAM. These product classes are further taxonomized by stages of the product life cycle, including Introduction, Ramp, and Volume Production. Product classes “drive” the ITRS because their business, product, retooling etc. cycles are felt throughout the semiconductor sector. As these product classes (e.g., “MPU”) frame the ITRS, assumptions regarding their attributes must remain consistent across technology areas and the 15year span of the ITRS. The purpose of the System Drivers chapter is to refine and formalize the definitions of system drivers used in previous ITRS editions, and – in conjunction with the ORTCs – to provide a consistent framework and motivation for specific technology requirements across the respective ITRS technology areas. We develop detailed models for three system drivers: high-volume custom – microprocessor (MPU), system-on-chip (SOC), and analog/mixed-signal (AMS). A fourth system driver, high-volume custom – memory (DRAM), is not discussed since its commodity nature is more well-understood. High-Volume Custom – Microprocessor (MPU). In high-volume custom designs, performance and manufacturing cost issues outweigh design or other NRE cost issues, primarily because of the large profits that these chips can potentially produce. These large profits result from very large sales volumes. Large volumes alone are neither necessary nor sufficient to warrant the custom design style, special process engineering and equipment, etc. often associated with such parts: the key is that the expected return on the combined NRE and manufacturing effort must be positive. MPUs use the most aggressive design styles, tools and technologies to achieve their goals, and can sometime warrant changes to the manufacturing flow if that will improve the resulting part. It is for these high-volume parts that new design styles and supporting tools are created (the large revenue streams can pay for new tool creation), and subtle circuits issues are uncovered (not all risks taken by designers work out). Thus, while high-volume custom designs are the most labor-intensive, they create new technology and automation methods (in both design and fabrication) that are leveraged by the entire industry. Within the high-volume custom arena, the three dominant classes today are microprocessors, memory1 and reprogrammable (FPGA); the first is defined in this chapter, while the latter two are discussed only as “implementation fabrics” available to the SOC system driver class. The key challenge for memory fabrics will be voltage scaling. Moreover, with the trend toward larger on-chip cache, technology must advance to allow densities greater than current SRAM without the added processing costs of traditional embedded DRAM. Analog/Mixed-Signal (AMS). The AMS category is quite broad and contains chips that at least partially deal with signals where the precise values of the input signals matter. This include RF chips, general analog chips, analog to digital and digital to analog converters, and more recently a large number of mixed-signal chips where at least part of the chip design needs to measure some signals with high precision. These chips, or at least the analog components of these chips, have very different design and technology demands from digital circuits. While technology scaling is always a Memory is a special class of high-volume custom design because of the very high replication rate of the basic memory cells and supporting circuits. Since these cells are repeated millions of time on a chip, and millions of chips are sold, the amount of custom design for these parts is extraordinary. This has led to separate fabrication lines for DRAM devices, with some of the most careful circuit engineering needed to ensure correct operation. 1 Preliminary Work in Progress – Not for Publishing System Drivers Page 3 6/19/2017 positive for a digital circuit, reducing the power, area and delay, it is not always positive for an analog design. Analog circuits sometimes need to deal with signals of a fixed voltage range, and decreasing supplies can make this difficult. Even when the input voltage range is not fixed, the need for high precision can make circuit design in a new technology with scaled supplies harder than dealing with an older technology. Thus, analog chips are often not aggressively scaled, and scaling analog circuits into new technologies is a critical issue for many mixed-signal chips. The need for precision also affects the design tasks and tool requirements for analog design. Digital circuit design creates a set of rules that allow the gates to function correctly: as long as these rules are followed, precise calculation of exact signal values is not needed. Analog designers, on the other hand, must be concerned with a number of “second-order” issues in order to obtain the needed precision. Such issues include analysis of all coupling sources including capacitance, inductance and substrate, asymmetries from local supply drops, implantation, alignment, local etching differences, and other fabrication effects. Analysis tools for these issues are mostly in place but require expert users; synthesis tools are still very preliminary. System-On-Chip (SOC). SOC is a yet-evolving product class and design style that integrates pieces of technology from other system driver classes (MPU, memory, analog / mixed-signal, reprogrammable, etc.). Design methods for SOC are typically developed originally for high-volume custom drivers. The SOC driver class most closely resembles, and is evolved most directly from, the ASIC category since reducing design costs and achieving higher levels of system integration are its principal goals.2 The primary difference from ASIC is that in SOC design, the goal is to maximize leverage of existing blocks, i.e., minimize the amount of the chip that is newly or directly created. Reused blocks in SOC include analog and high-volume custom cores, as well as blocks of software technology. . One of the challenges for this design style is how to create and maintain these reusable blocks or cores so that they are available for the SOC designers. 3 How to create incentives for these modules is another issue, as is the simulation and validation of SOC designs while avoiding the full simulation/validation complexity of all the modules that are being integrated. 4 There are several ways in which SOC represents a confluence of a number of previous product classes, and is therefore an umbrella for a wide range of high-complexity, high-value semiconductor The majority of digital designs today are considered to be application-specific integrated circuits (ASICs). The term ASIC connotes both a business model (with particular “handoff” from design team to ASIC foundry) and a type of design methodology (where the chip designer works predominantly at the functional level, coding the design at Verilog/VHDL or higher level description languages). In ASIC designs, custom functions are rarely created, mostly for economic reasons: reducing design cost and design risk is paramount, and outweighs potential increase in performance or reduction in manufacturing cost. Hence, ASIC design is characterized by relatively conservative design methods and design goals (cf. MPU-ASIC differences in clock frequency and layout density in previous ITRS editions) but aggressive use of technology, since moving to a scaled technology is a cheap way of achieving a better (smaller, lower power, and faster) part with little design risk (cf. convergence of MPU and ASIC process geometries in recent ITRS editions). “Convergence” of the ASIC business model and the SOC product class has been observed throughout the late 1990s: ASICs are rapidly becoming indistinguishable from SOCs in terms of both content and design methodology. 3 For example, such cores might have specific noise or power attributes (“field of use”, or “assumed design context”) that are not normally specified. In general, it is always more difficult to create an IC design artifact that is understandable and reusable by others, than to create such an artifact for one-time use. 4 Indeed, the SOC paradigm depends on validation of a chip design using modules being easier than validation of a chip design that does not use modules. 2 Preliminary Work in Progress – Not for Publishing System Drivers Page 4 6/19/2017 products. First, as noted above, SOCs integrate building blocks from the other system driver classes. Second, again noted above, SOC is rapidly subsuming the ASIC category. Third, the quality headroom between “custom” and “ASIC/SOC” is diminishing. The logic layout density model for the 2001 ITRS assumes a 25% layout density difference between custom and “ASIC” fabrics; in the MPU context this is counteracted by the use of larger and more high-performance devices, so that the overall ASIC and MPU logic densities are the same. “Custom quality on an ASIC schedule” is increasingly achieved by on-the-fly (“liquid”) or tuning-based standard-cell methodologies. Fourth, the MPU and SOC driver classes have also converged, with MPUs evolving into SOCs in two ways: (i) MPUs are increasingly designed as cores to be included in SOCs, and (ii) MPUs are themselves designed as SOCs to improve reuse and design productivity. As noted below, the 2001 ITRS MPU model is “multi-core”, and closely resembles an SOC in organization. 5 Key SOC futures include integration of multiple IC implementation fabrics (RF, reprogrammable, MEMS, optoelectronic, software, etc.) and the elimination of implementation productivity and manufacturing cost bottlenecks via platform-based design, silicon implementation regularity, or other novel paradigms. Market Drivers. The following table shows how each system driver class is driven by particular markets for semiconductor product (e.g., computer, wireless). Products in specific markets are contrasted by factors such as manufacturing volume, die size, integration heterogeneity, system complexity, timeto-market, etc. There are accordingly different implications for both the manufacturing and design technologies that support product delivery.6 Market Drivers I. Portable & Wireless 1. Size/weight: peak in 2002 2. Battery life: peak in 2002 3. Function: 2X / 2 years 4. Time-to-Market: ASAP 5. Time-in-Market: decreasing II. Broadband 1. Bandwidth: 2X / 9 months 2. Function: 20%/yr increase 3. Deployment/OperCost: flat 4. Reliability: asymptotic 99.999% target 5. Time-in-Market: long ASIC/SOC Low power paramount Need SOC integration (DSP, MPU, I/O cores, etc.) Large gate counts. High reliability. Primarily SOC. Analog/MS Migrating on-chip for voice processing, RF A/D sampling, etc. Migrating on-chip for signal recovery, RF A/D sampling, etc. High-Vol Custom Specialized cores to optimize function/uW. MPU cores and some specialized functions. SOCs and MPUs are also converging in the sense that the corresponding ASIC and structured-custom design methodologies are converging to a common “hierarchical ASIC/SOC” methodology. This is accelerated by customer-owned tooling business models on the ASIC side, and by tool limitations faced by both methodologies. 6 The strongest distinctions between the driver classes are according to cost, time-to-market, and production volume. VLSI system cost is equal to Manufacturing cost + Design cost (for the present purposes, assume that the latter includes all Test costs). Manufacturing cost further breaks down into non-recurring engineering (NRE) cost (masks, tools, etc.) and silicon costs (raw wafers + processing). The total system cost correlates with function, #I/Os, package cost, power or speed – and these factors correlate well with each other (e.g., speed, power, package cost). Hence, distinctions made in the 1999 ITRS between SOC-C (“costdriven”) and SOC-P (“performance-driven”) simply reflect a cost continuum. Different regions of the (Manufacturing Volume, Time To Market, System Complexity) space are best served by ASIC, FPGA or HVC implementation fabrics, and SOC or system-in-package (SIP) integration. This partitioning is continually evolving. 5 Preliminary Work in Progress – Not for Publishing System Drivers III. Internet Switching 1. Bandwidth: 4X / 3-4 yrs. 2. Reliability 3. Time-to-Market: ASAP IV. Mass Storage 1. Density: 60% increase / yr 2. Speed: 2X by 2005 3. Form factor: shift toward 2.5" V. Consumer 1. Cost: strong downward pressure 2. Time to Market: <12 mos 3. Function: high novelty 4. Form factor 5. Durability / safety 6. Conservation / Ecology VI. Computer 1. Speed: 2X / 2 yrs 2. Memory Density: 2X / 2 yrs 3. Power: Flat to decreasing 4. Form factor: Shrinking size VII. Automotive 1. Functionality 2. Ruggedness (external environment) 3. Reliability / safety 4. Cost Page 5 Large gate counts. High reliability. Primarily SOC, w/more reprogrammability to accommodate custom functions. High-speed front-end for storage systems. Primarily ASSP. Shift toward large FPGA and COT, away from ASIC costs and design flows High-end products only. Reprogramability possible. Mainly ASSP; increasing SOC for high-end digital using 3D graphics, parallel proc, RTOS kernel, MPU/MMU/DSP, voice synthesis and recognition, etc. cores. Large gate counts. High speed. Drives demand for digital functionality. Primarily SOC integration of custom off-the-shelf MPU and I/O cores. Mainly entertainment systems. Mainly ASSP, but increasing SOC for high end using standard hardware platforms with RTOS kernel, embedded s/w. 6/19/2017 Minimal on-chip analog. Migrating on-chip for RF I/O circuitry. MEMS for optical switching. MPU cores, FPGA cores and some specialized functions. Increased requirement for higher precision position measurement, “inertia knowledgeable” actuator / power controllers integrated on-chip. MEMS on R/W head for sensing. Increased AMS integration for voice, visual, tactile, physical measurement (e.g., sensor networks). CCD or CMOS sensing for cameras. Demand for high-speed hardware for, e.g., “lookahead” for DB search, MPU instruction prefetch, data compression, S/N monitoring, failure prediction. Minimal on-chip analog. Simple A/D and D/A. Video i/f for automated camera monitoring, video conferencing. Integrated high-speed A/D, D/A for logging, instrumentation, monitoring, range-speedposition resolution. Cost-driven on-chip ADC for sensor signals. Signal processing shifting to DSP for voice, visual. physical measurement ("communicating sensors” for proximity, motion, positioning). MEMS for sensors. MPU cores and some specialized functions. Increased industry partnerships on common designs to reduce development costs (requires data sharing and reuse across multiple design systems). For "long-life" mature products only. Decrease in long design cycles, and in use of highcost non-prepackaged functions and design flows. The MPU System Driver. The ITRS microprocessor (MPU) driver reflects general-purpose instruction-set architectures (ISAs) that are found standalone in desktop and server systems, and embedded as cores in SOC applications. MPUs, along with, e.g., FPGAs, form a high-volume custom segment that drives the semiconductor industry with respect to integration density and design complexity, power-speed performance envelope, large-team design process efficiency, test and verification, power management, and packaged system cost. The MPU system driver is subject to market forces that have historically led to (i) emergence of standard architecture platforms and multiple generations of derivatives, (ii) strong price sensitivities in the marketplace, and (iii) extremely high production volumes and manufacturing cost awareness. General. The major 2001 MPU driver characteristics are as follows.  There are two types of MPU: cost-performance (CP), reflecting “desktop”, and highperformance (HP), reflecting “server”. Each type is parameterized with respect to volume production in a particular node. Preliminary Work in Progress – Not for Publishing System Drivers    Page 6 6/19/2017 Die sizes are constant (140mm2 for CP, 310mm2 for HP) over the course of the roadmap, primarily due to power and cost limits. Area is broken down into logic, memory, and integration overhead.7 We believe that additional logic content would not be efficiently usable due to power limits. Additional memory content (e.g., larger caches, more levels of memory hierarchy integrated on-chip) would also fail to be cost-effective beyond a certain point. The MPU logic content reflects a multi-core organization, with multiple processing units onchip starting from the 130nm node. This integrates several related trends that arise from power and reuse considerations, as well as from the convergence of MPU and SOC architectures and design methodologies. These trends include: (i) organization of recent and planned commercial MPU products (both server and desktop); (ii) increasing need to reuse verification and logic design, as well as standard ISAs; (iii) ISA “augmentations” in successive generations (e.g., x86, MMX and EPIC) with likely continuations for encryption, graphics and multimedia, etc.; (iv) the need to enable flexible management of power at the architecture, OS and application levels via SOC-like integration of less-efficient, generalpurpose processor cores with more-efficient, special-purpose “helper engines”8; (v) the limited size of processor cores (the estimate of a constant 20-25 million transistors per core9 is a safe upper bound with respect to recent trends); and (vi) the convergence of SOC and MPU design methodologies due to design productivity needs. The number of cores on chip is projected to double with each successive technology node. The MPU memory content is initially 512KBytes (256 x 1024 x 9 bits) of SRAM for CP and 2MBytes for HP in the 180nm node.10 Memory content, like logic content, is projected to double with each successive technology node, not with respect to absolute time intervals (e.g., every 18 months).11,12 Integration overhead reflects the presence of whitespace for inter-block channels, floorplan packing losses, and potentially increasing tradeoff of layout densities for design turnaround time. 8 A “helper engine” is a form of “processing core”. The trend is toward MPU architectures containing increasingly more special-purpose, and less general-purpose, logic. 9 The CP core has 20 million transistors, and the HP core has 25 million transistors. The difference allows for more aggressive microarchitectural enhancements (trace caching, various prediction mechanisms, etc.) and other performance support. 10 Small caches tend to have speed as the primary objective, with area and redundancy sacrificed for speed. Transistor counts for such small blocks of memory are slightly overestimated by the 9 bit per byte assumption. 11 The doubling of logic and memory content with each technology node, rather than with each 18- or 24-month time interval, is due to essentially constant layout densities for logic and SRAM, as well as conformance with other parts of the ITRS. Specifically, the ITRS remains planar CMOS-centric, with little or no acknowledgment of dual-gate FET, FinFET, etc. yet incorporated into the roadmap. Adoption of such novel device architectures would allow improvements of layout densities beyond what is afforded by scaling alone, with possible restoration of the traditional “Moore’s Law” trajectory. 12 Deviation from the given model will likely occur around the 90nm node, when adoption of denser embedded memories (eDRAM, 1-T SRAM, or other) will occur. (Any such higherdensity alternatives to traditional 6T SRAM will need to overcome the limited process window that arises from scaling of oxide thickness.) Alternatives to SRAM, and integrated on-chip L3 cache, respectively increase the on-chip memory density and memory transistor count by factors of approximately 3 from the given values. While this will significantly boost transistor counts, it is not projected to significantly affect the chip size or total chip power roadmap. Adoption of eDRAM or 1-T SRAM will also depend strongly on compatibility with logic processes, the size and partitioning of memory within the individual product architecture, and density-performance-cost sensitivities (e.g., an embedded processor for low7 Preliminary Work in Progress – Not for Publishing System Drivers Page 7 6/19/2017 Futures. MPU system driver considerations will in general continue to focus on power, I/O bandwidth, and manufacturing yield. A likely bifurcation of requirements is due to emergence of a “centralized processing” context consisting of (i) smart interface remedial processing providing power-efficient basic computing13 (see the SOC section of this chapter) and (ii) centralized computing servers providing high-performance computing via traditional MPUs (this section). As modeled in the 2001 ITRS, future MPUs will consist of multiple cores per die, with cores being (i) smaller and faster to counter global interconnect scaling, and (ii) optimized for reuse across multiple applications and configurations. Multi-core architectures allow power savings as well as the use of redundancy to improve manufacturing yield. 14 The 2001 ITRS MPU model also permits increasing amounts of the memory hierarchy on chip (consistent with processor-in-memory, or at least large onchip eDRAM L3 starting in the 90nm generation). Higher memory content affords better control of leakage and total chip power. As noted above, evolutionary microarchitecture changes (superpipelining, superscalar, predictive methods) appear to be running out of steam. Thus, more multithreading support will emerge for parallel processing, as well as more complex “hardwired” functions (and/or specialized engines) for networking, graphics, communication, etc. The megatrend is toward shifting the flexibility-efficiency tradeoff point away from general-purpose processing. With respect to I/O bandwidth, increased use of double-clocking or phase-pipelining in parallel-serial data conversion will continue to push against the roughly 6-8 FO4 INV delay limit of known circuit techniques. Signal I/O counts themselves will may stay constant or even decrease as a result of system integration (due to “Rent’s Rule” type power-law scaling phenomena that empirically hold for system interconnects and I/O). More practically, I/O counts will remain bounded by constant die and package sizes and concomitant power delivery requirements. Circuit design may see increased use of dynamic circuits, but less pass gate logic due to low V th values. Error-correction for single-event upset (SEU) will increase, as will the use of redundancy to compensate for yield loss. Overall, power management will integrate a number of techniques from several component technologies: (i) application-, OS- and architecture-level optimizations including parallelism and adaptive voltage and frequency scaling, (ii) process innovations including increased use of SOI, and (iii) design techniques including use of multi-Vth, multi-Vdd, minimum-energy sizing under throughput constraints, and multi-domain clock gating and scheduling. Direct Impacts on Overall Roadmap Technology Characteristics. power handheld devices may have transistor count dominated by memory, in which case eDRAM may be preferable, while a desktop (CP) MPU may find 1-T SRAM to be more costeffective). 13 An example of SIRP would arise in the SOC integration of RF, analog/mixed-signal, and digital functions within a wireless handheld multimedia platform (recall convergence of SOC and MPU). The SOC system driver discussion in fact uses the wireless multimedia “PDA” application as a framework for defining low-power / low-cost SOC requirements. 14 Replication enables power savings through lowering of frequency and V dd while maintaining throughput (e.g., two cores running at half the frequency and half the supply voltage will save a factor of 4 in CV2f dynamic capacitive power, versus the “equivalent” single core). More generally, overheads of time-multiplexing of resources can be avoided, and the architecture and design focus can shift to better use of area than memory. Redundancybased yield improvement occurs if, e.g., a die with k-1 instead of k functional cores is still useful. Preliminary Work in Progress – Not for Publishing System Drivers Page 8 6/19/2017 The MPU system driver directly impacts the following elements of the ITRS ORTCs and chip size models.   Layout Densities. Due to their high levels of system complexity and production volume, MPUs are the driver for improved layout density.15 The logic and SRAM layout densities in the 2001 ITRS ORTCs are analogous to the DRAM “A-factor”, and have been calibrated to recent MPU products. Logic layout densities reflect average standard-cell gate layouts of approximately 320F2, where F is the minimum feature size of the technology node.16 As noted above, the logic layout density may improve significantly with the advent of novel devices. SRAM layout densities reflect use of a 6-transistor bitcell (the fitted expression for area per bitcell in units of F2 = 223.19F (um) + 97.74) in MPUs, with 60% area overhead for peripheral circuitry. Clock Frequencies. MPUs also drive maximum on-chip clock frequencies, which in turn drive various aspects of the Interconnect, PIDS, FEP and Test roadmaps. The MPU maximum on-chip clock frequency has historically been increasing by a factor of 2 per generation. Of this, approximately 1.4x has been from device scaling (which runs into t ox and other limits), and 1.4x has been from reduction in number of logic stages in a pipeline stage (e.g., equivalent of 32 fanout-of-4 inverter (FO4 INV) delays17 at 180nm, and 26 FO4 INV delays at 130nm). There are several reasons why this historical trend will not continue: (i) well-formed clock pulses cannot be generated with period below 6-8 FO4 INV delays; (ii) there is increased overhead (diminishing returns) in pipelining (2-3 FO4 INV delays per flipflop, 1-1.5 FO4 INV delays per pulse-mode latch); and (iii) around 14-16 FO4 INV delays is a practical lower limit for clock period given the latencies of L1 cache access, 64-bit integer addition, etc. The 2001 ITRS MPU model continues the historic rate of advance for maximum on-chip global clock frequencies, but flattens the clock period at 16 FO4 INV delays during the 90nm node. Hence, from the 90nm node on, clock frequencies will advance only with device performance.18 Figures of Merit and Impact on Supporting Technologies. The ASIC/SOC and MPU system driver products will likely have access to the same minimum process geometries, as has been forecast since the 1999 ITRS. This reflects emergence of pure-play foundry models, and means that fabric layout densities (SRAM, logic) are the same for SOC and MPU. However, MPUs are the driver for high density and high performance, while SOCs are the driver for high integration, low cost, and low power. 16 In the logic layout density estimate, a 2-input NAND gate is assumed to lay out in an 8x4 standard cell, where the dimensions are in units of contacted local metal pitch (MP = 3.16 x F). In other words, the average gate occupies 32 x (3.16)2 = 320F2. For both semi-custom (ASIC/SOC) and full-custom (MPU) design methodologies, an overhead of 100% is assumed. 17 A FO4 INV delay is defined to be the delay of an inverter driving a load equal to 4 times its own input capacitance (with no local interconnect). This is equivalent to roughly 14 times the CV/I device delay metric that is used in the PIDS Chapter to track device performance. 18 Unlike previous ITRS clock frequency models (e.g., Fisher/Nesbitt 1999), the 2001 model does not have any local or global interconnect component in its prototypical “critical path”. This is because local interconnect delays are negligible, and scale with device performance. Furthermore, buffered global interconnect does not contribute to the minimum clock period since long global interconnects are pipelined (cf. Intel Pentium-4 and Compaq Alpha 21264) – i.e., the clock frequency is determined primarily by the time needed to complete local computation loops, not by the time needed for global communication. Pipelining of global interconnects will become more widespread as the number of clock cycles required to signal cross-chip continues to increase beyond 1. A tangential note is that “marketing” emphases for MPUs will necessarily shift from “frequency” to “throughput” or “utility”. 15 Preliminary Work in Progress – Not for Publishing System Drivers Page 9 6/19/2017 The nature of the MPU system driver is in a state of flux, with several looming “contradictions”. Several notable issues are as follows.  Traditional microarchitecture (speculative execution, pipelining) knobs are running out of steam, as is the clock frequency knob. “Pollack’s Rule” observes that in a given process technology, a new microarchitecture occupies 2-3x the area of the old (previous-generation) microarchitecture, while providing only 40% more performance. 19  Power dissipation limits of cost-effective packaging (estimated to reach 50 W/cm2 for forcedair cooling by the end of the ITRS) cannot continue to support high supply voltages (historically scaling by 0.85x per generation instead of 0.7x ideal scaling) and frequencies (historically scaling by 2x per generation instead of 1.4x ideal scaling).20  Clock frequency trends in the MPU system driver have been taken to imply performance (switching speed) requirements for CMOS devices that lead to large off-currents and extremely thin gate oxides, as specified in the PIDS Chapter roadmap. With such devices, future HP MPUs that simply continue existing circuit and architecture techniques will exceed package power limits by factors of over 20x by the end of the ITRS; equivalently, the logic content and/or logic activity of MPUs must drop to near-zero in order to match package technology requirements. Portable and low-power embedded contexts have even more stringent power limits, and will encounter such obstacles earlier.  Power efficiencies (GOPS/mW) are up to four orders of magnitude greater for direct-mapped hardware, versus general-purpose MPUs; moreover, this gap is increasing. Hence, SOClike multi-core MPUs may not be competitive with application-specific and/or reconfigurable processing engines.  The complexity and cost of design and verification of MPU products has rapidly increased to the point where thousands of engineer-years (and a design team of hundreds) are devoted to a single design, and processors yet reach market with hundreds of bugs.  Parametric yield requirements ($/wafer after bin-sorting) are in conflict with the feature size and device architecture trends from PIDS and Lithography: thinner oxides, deepsubwavelength optical lithography requiring aggressive reticle enhancement techniques, The MPU system driver thus sets requirements for design and test technologies (distributed/collaborative design process, verification, at-speed test, tool capacity, power management), as well as device (off-current), lithography (CD variability) and packaging (power dissipation and current delivery) technologies. Key challenges and figures of merit for the MPU system driver (relevant ITRS Chapter) are:  Power efficiency (GOPS/mW) (PIDS, Design, Assembly & Packaging)  Design productivity and quality (total design cost $ per normalized transistor) (Design)  Design verification (number of bug escapes at first volume production) (Design)  Parametric yield at volume production (Lithography, PIDS, FEP, Design) The SOC System Driver. The ITRS “System-on-chip” (SOC) driver class is characterized by the heavy use of embedded, reusable IP cores to improve design productivity, and the integration of heterogeneous technologies. Performance metrics such as SPECint/MHz, SPECfp/MHz, SPECfp/Watt, etc. are also decreasing. 20 To maintain reasonable packaging cost, package pin counts and bump pitches for flip-chip are required to advance at a slower rate than integration densities (cf. the Assembly and Packaging Chapter). This increases pressure on design technology to manage larger wakeup and operational currents and larger supply voltage IR drops; power management problems are also passed to the architecture, OS and application levels of the system design. 19 Preliminary Work in Progress – Not for Publishing System Drivers Page 10 6/19/2017 SOCs are cost- and integration- driven; they drive the deployment of low-power process and low-cost packaging solutions, as well as fast-turnaround time design methodologies. ASIC vs. SOC. The focus on exploitation of reusable cores has a major impact on the SOC design process and tools21, which differ greatly from those used for traditional ASIC design. Of course, SOCs are also distinguished by other factors. As a system-on-chip, they will increasingly be mixedtechnology designs, including such diverse combinations as embedded DRAM, high-performance or low-power logic, analog, RF, and even more esoteric technologies such as Micro-Electro Mechanical Systems (MEMS) and optical input/output. However, improving design productivity through reusable IP is a key element of SOC. ASIC and SOC are similar in that time-to-market (TTM) and cost are key drivers. SOC as a Driver. A number of factors are driven by the development of SOC design methodologies. The main goal of SOC – high integration through high productivity – is accomplished through design methods that promote reusability; therefore, the tools, reusable blocks, and infrastructure for reuse will be accelerated to enable more SOC design starts. SOC drives industry standards for IP description, testing, interfaces, verification, etc. SOC design also promotes the use of built-in selftest, and perhaps built-in self repair strategies, to improve the testing of large complex designs. It further enhances the ability of designers to optimize the chip and package solution for the overall system. Finally, the generalization of SOC implies the integration of a number of heterogeneous technologies whereby specific components/technological features may be added such as embedded Flash, embedded DRAM, MEMS, chemical sensors, or ferroelectric RAM (FRAM). Hence, SOC is the driver for convergence of these technologies not only in the same system package, but potentially in the same manufacturing process. Key challenges that arise for the SOC system driver class are:      Management of power, especially for low-power, wireless, multimedia applications Maintaining design productivity improvement of 50% or more per node via reuse, platform-based design22, and integration of programmable logic fabrics 23 Development of SOC test methodology with heavy use of analog/digital BIST Development of reusable analog IP System-level integration of heterogeneous technologies (MEMS, opto-electronics, etc.) The SOC design process involves soft and hard IP selection, IP integration, and system verification. The IP authoring process makes heavy use of the ASIC or high-volume custom design flows. 22 Platform-based design is focused on a specific application and embodies the hardware architecture, embedded software architecture, design methodologies for IP authoring and integration, design guidelines and modeling standards, IP characterization and support, and hardware/software verification and prototyping. (cf. H. Chang et al., “Surviving the SOC Revolution: A Guide to Platform-based Design”) 23 A programmable logic core is a flexible logic fabric that can be customized to implement any digital logic function after fabrication. The structure of a programmable logic fabric may be similar to an FPGA capability within specific blocks of the SOC. They allow reprogrammability, adaptability and reconfigurability which greatly improves chip productivity. Applications include blocks that implement standards and protocols that continue to evolve, changing design specifications, and customization of logic for different, but related, applications and customers. 21 Preliminary Work in Progress – Not for Publishing System Drivers Page 11 6/19/2017 SOC Variants. Three major variants of SOC are driven respectively to deliver high levels of (multi-technology) integration (MT), high performance (HP), and low power and low cost (LP). SOC-MT. The need to build heterogeneous systems is driving SOC beyond CMOS to merge with MEMS, opto-electronic, chemical sensor technologies, etc. For SOC-MT, process complexity will be a major factor in the cost of the SOC applications. The more combinations of technologies that are assembled on a single chip, the more complex the processing will be. The total cost of processing will be hard to predict for these new materials and combinations of processing steps. Initially, the number of such additional technological features on a specific SOC is likely to be limited to one or two for cost reasons. However, as costs decrease and processing capabilities improve, SOC will evolve to include additional fabrics. Figure 1 shows how first production use of each technology in an SOC might evolve. Logic SRAM Flash E-DRAM CMOS RF FPGA FRAM MEMS Chemical sensors Electro-optical Electro-biological 98 00 02 04 06 08 10 12 Figure 1. Technologies integrated on SOC using standard CMOS process.   SOC-HP. Examples include network processors and high-end gaming applications. The ITRS SOC-HP model for standard CMOS processes follows a similar trend as the MPU model. The only major difference between the two is that chip size for SOC will continue to grow at 20% per technology node whereas the MPU chip size (both CP and HP) is projected to remain relatively constant. Therefore, the model for SOC-HP will not be separately discussed here. SOC-LP. Examples include portable and wireless applications such as PDAs or digital camera chips. Power will play an important role in dictating the logic/memory content in such future SOC designs. Forecasts of the logic/memory composition of future low-power SOCs are based on the following model for a low-power, consumer-driven, handheld wireless device with multimedia processing capabilities (see Table 1) using the low operating power (LOP) and low standby power (LSTP) models (see Table 2) that will be described below.  The SOC-LP model is based on a PDA application comprised of embedded blocks of CPU, DSP and other processing engines, and SRAM and embedded DRAM circuits.24 In 2001 the SOC-LP power budget is taken to be 0.1W for a 100mm2 PDA chip. The overhead area (e.g., I/O buffer cells, pad ring, whitespace due to block packing, etc.) for LOP and LSTP models is 28%, and remains fixed for all succeeding nodes. Analog blocks are counted as part of the overhead area and assumed to have insignificant power consumption. The PDA chip contains approximately 20 million transistors for both LOP and LSTP in 2001. 24 Preliminary Work in Progress – Not for Publishing System Drivers     Page 12 6/19/2017 Die size increases on average by 20% per node through 2016 to accommodate increased functionality (and matching historical trends for this application domain); layout densities for memory and logic fabrics are the same as for the MPU driver. eDRAM density is assumed to be 3X SRAM density. Maximum on-chip clock frequency will be approximately 5-10% of the MPU clock frequency on a node-by-node basis. Peak power dissipation is initially constrained to 0.1 W at 1000C, and standby power to 2.1 mW, due to battery life.26 Logic power consumption is estimated based on CVdd2f + IoffVdd model for dynamic plus static power, using area-based calculations similar to those in the MPU power analysis.27 Memory is in the form of SRAM and eDRAM28. The memory power consumption model also uses CVdd2f + IoffVdd with a different factor29 for . Two approaches can be used to derive the power dissipation for the given application, the first driven top-down from system requirements and the second driven bottom-up from process and circuit parameters. Table 1 describes top-down requirements for a PDA and serves as a roadmap for the SOC-LP driver. The PDA consists of logic and memory portions. The logic portion of a design is comprised of CPU/DSP blocks and non-CPU/DSP blocks30. The PDA functional requirements in The PDA model assumes that increasing levels of parallel computation will be required in each generation of the device, to support video, audio and voice recognition functionality. Accordingly, CPU and DSP content (e.g., number of cores) reflects increasing levels of parallel computation: total logic content increases four-fold per technology node to match the processing demands of the corresponding applications. By comparison, MPU logic content is projected to double with each node. 26 Battery life is 120Wh/kg in 2001 and increases to 400Wh/kg in 2016, i.e., at the rate of 25% per node. A 200g battery allows operation for 7 days, 24 hours per day, at 0.1W. The same battery weight would permit longer operation time, hence the flat power budget in Table 1. 27 Details of active capacitance density calculations, dependences on temperature and threshold, etc. may be found in the PIDS Chapter documentation and in accompanying spreadsheets and/or GTX models. Ioff denotes subthreshold current. 28 SRAM transistor count doubles (grows by 2X) every node. The composition of SRAM vs. DRAM depends on the ratio of memory to logic. We assume that embedded DRAM (eDRAM) is cost effective when at least 30% of the chip area is memory. Its use is not invoked until the 30% trigger point, and begins at 16Mb in 2004. Once triggered, the eDRAM content quadruples every technology node. (This discussion does not comment on whether multi-die strategies (MCM, System-In-Package) will be more cost-effective than single-die implementation.) Memory is assumed to dissipate negligible levels of static power. 29 The activity of logic blocks is fixed at 10%. The activity of a memory block is estimated to be 0.4% based on the analysis of large memory designs. To obtain this value, we first assume that a memory cell contributes 2 gate capacitances of minimum size transistors for switching purposes, accounting for source/drain capacitances, contact capacitances and wiring capacitance along the bit lines. A write access requires power in the row/column decoders, word line and M bit lines, sense amplifiers and output buffers. We consider memory to be addressed with 2N bits and assume that memory power is due primarily to the column capacitances, and Mx2N bits are accessed simultaneously out of 2Nx2N possible bits. Then =M/2N which is the ratio of accessed bit to total bits in the memory. For example, for a 16Mbit memory, M=16 and N=12; hence =0.4%. 30 The 20M transistor count in 2001 is broken down as follows. A typical CPU/DSP core (e.g., ARM) today is approximately 30-40K gates, or 125K transistors. We assume that we have four such cores on board the PDA application in 2001, i.e., 500K CPU/DSP core transistors. In 2001, the “peripheral” logic transistor count is 11.5M transistors, and this count grows at Preliminary Work in Progress – Not for Publishing System Drivers Page 13 6/19/2017 Table 1 imply a 4X CPU/DSP increase per node to meet the performance demands. We also assume a 2-3X memory increase per node. The device roadmap of Table 2 and the above-mentioned power estimation methodology yield a “bottom-up” analysis of total power over the course of the ITRS. The requirement of low operating and low standby power has led to roadmapping of two additional low-power device families which complement the high-performance device family, as described in the PIDS Chapter. Table 2 lists key attributes for the low standby power (LSTP) and low operating power (LOP) devices, and contrasts these with the high-performance device model used for MPU power and frequency analyses. Figure 2 shows the resulting lower bound for total chip power at an operating temperature of 1000C, even assuming that all logic is implemented with LOP or LSTP devices and operates as described above31. The figure indicates that SOC-LP power levels will substantially exceed the low-power requirements of the PDA application. The figure further provides a breakdown of power contributions for each case. As expected, LOP power is primarily due to standby power dissipation while LSTP power is primarily due to dynamic power dissipation32. Year of Production 2001 2004 2007 2010 2013 2016 Process Technology (nm) 130 90 65 45 32 22 Supply Voltage (V) 1.2 1 0.8 0.6 0.5 0.4 Clock Frequency (MHz) 150 300 450 600 900 1200 Application (maximum req’d performance) Application (other) Still Image Real Time Video Codec Processing (MPEG4/CIF) Web Browser TV Telephone (1:1) Real Time Interpretation Electric Mailer Scheduler Voice Recognition (Input) Voice Recognition (Operation) Processing Performance (GOPS) 0.3 Authentication (Crypto Engine) 2 15 103 720 5042 Communication Speed (Kbps) 384 2304 13824 82944 497664 2985984 Req’d Average Power (W) 0.1 0.1 0.1 0.1 0.1 0.1 Req’d Standby Power (mW) 2.1 2.1 2.1 2.1 2.1 2.1 Addressable System Mem (Gb) 0.1 1 10 100 1000 10000 Battery Capacity (Wh/Kg) 120 200 TV Telephone (>3:1) 400 Table 1. System functional requirements for the PDA SOC-LP driver. Parameter Type 99 00 01 02 03 04 05 06 07 10 13 16 Tox (nm) MPU 3.00 2.30 2.20 2.20 2.00 1.80 1.70 1.70 1.30 1.10 1.00 0.90 LOP 3.20 3.00 2.2 2.0 1.8 1.6 1.4 1.3 1.2 1.0 0.9 0.8 LSTP 3.20 3.00 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.1 1.0 0.9 MPU 1.5 1.3 1.2 1.1 1.0 1.0 0.9 0.9 0.7 0.6 0.5 0.4 Vdd 2X/node thereafter. SRAM transistor count is 8M in 2001, and grows at 2X/node thereafter as discussed above. eDRAM transistor count is discussed above. 31 In practice, at least some logic would need to be implemented with higher-performance devices, hence the lower bound. 32 At 250C, dynamic power dissipation dominates the total power in both the LOP and LSTP cases. Preliminary Work in Progress – Not for Publishing System Drivers Page 14 6/19/2017 LOP XXX XXX 1.2 1.2 1.1 1.1 1.0 1.0 0.9 0.8 0.7 0.6 LSTP XXX XXX 1.2 1.2 1.2 1.2 1.2 1.2 1.1 1.0 0.9 0.9 MPU 0.21 0.19 0.19 0.15 0.13 0.12 0.09 0.06 0.05 0.021 0.003 0.003 LOP 0.34 0.34 0.34 0.35 0.36 0.32 0.33 0.34 0.29 0.29 0.25 0.22 LSTP 0.51 0.51 0.51 0.52 0.53 0.53 0.54 0.55 0.52 0.49 0.45 0.45 MPU 1041 1022 926 959 967 954 924 960 1091 1250 1492 1507 LOP 636 591 600 600 600 600 600 600 700 700 800 900 LSTP 300 300 300 300 400 400 400 400 500 500 600 800 MPU 2.00 1.64 1.63 1.34 1.16 0.99 0.86 0.79 0.66 0.39 0.23 0.16 LOP 3.50 2.87 2.55 2.45 2.02 1.84 1.58 1.41 1.14 0.85 0.56 0.35 LSTP 4.21 3.46 4.61 4.41 2.96 2.68 2.51 2.32 1.81 1.43 0.91 0.57 Ig (uA/um) MPU 2e-5 1e-2 2e-2 2e-2 1e-1 2e-1 3e-1 3e-1 3e-4 1e-5 4e-9 2e-16 Ioff (uA/um) MPU 0.00 0.01 0.01 0.03 0.07 0.10 0.30 0.70 1.00 3 7 10 LOP 1e-4 1e-4 1e-4 1e-4 1e-4 3e-4 3e-4 3e-4 7e-4 1e-3 3e-3 1e-2 LSTP 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1-6 1e-6 1-6 3e-6 7e-6 1e-5 MPU 100 70 65 53 45 37 32 30 25 18 13 9 L(*)P 110 100 90 80 65 53 45 37 32 22 16 11 Gate cap MPU 1.39 1.29 1.26 1.07 1.02 0.95 0.87 0.85 0.90 0.81 0.69 0.59 (fF/um) LOP 1.43 1.39 1.28 1.23 1.10 1.00 0.95 0.85 0.89 0.75 0.63 0.53 LSTP 1.43 1.39 1.15 1.10 0.99 0.89 0.84 0.77 0.82 0.71 0.61 0.51 Vth (V) Ion (uA/um) CV/I (ps) Gate L (nm) Table 2. Low operating power (LOP) and low standby power (LSTP) device and process attributes. 3.00 Power Trend 2.50 - Dynamic Power LOP (W) - Dynamic Power LSTP (W) 2.00 Power (W) - Static Power LOP (W) - Static Power LSTP (W) 1.50 - Memory Power LOP (W) - Memory Power LSTP (W) 1.00 - Power for LOP Bottom-Up (W) - Power for LSTP Bottom-Up (W) 0.50 0.00 2001 2004 2007 2010 Year Preliminary Work in Progress – Not for Publishing 2013 2016 System Drivers Page 15 6/19/2017 Figure 2: Total chip power trend (with breakdown into dynamic logic, static logic, and memory components) for PDA application using LOP and LSTP low-power device models. Total chip power using only LOP devices reaches 2.45W in 2016, mostly due to a sharp rise in static power after 2010. Total chip power using only LSTP devices reaches 1.5W in 2016; almost all of this is dynamic power. Future low-power SOCs will likely integrate multiple (LOP, LSTP, HP) technologies to afford flexible control of dynamic power, static power and performance. 2001 Total LOP Dynamic Power Gap (x) -0.06 Total LSTP DynamicPower Gap (x)-0.19 Total LOP Standby Power Gap (x) 0.85 Total LSTP Standby Power Gap (x) -0.98 2004 0.59 0.55 5.25 -0.98 2007 1.03 1.35 14.55 -0.97 2010 2.04 2.57 30.18 -0.88 2013 6.43 5.81 148.76 -0.55 2016 23.34 14.00 828.71 0.24 Table 3. Power management gap. Total Power Gap is defined as (Total Power – 0.1W)/0.1W (the PDA total power requirement). Total Static Power Gap is defined as (Total Static Power – 2.1mW)/2.1mW (the PDA total standby power requirement). Negative values indicate the lack of any power management gap (i.e., existing techniques suffice). Table 3 shows the implied power management gap, i.e., the factor improvement in power management that must be achieved jointly at the levels of application, operating system, architecture, and IC design.34 Required power reduction factors reach 23X for dynamic power, and 828X for standby power. Figure 3 projects logic/memory composition of SOC-LP designs, assuming that chip power is constrained according a power budget of 0.1W and that chip size is constrained to 100mm2. Memory content outstrips logic content faster with LSTP devices since their operating power is much higher than that of LOP devices. Both models indicate that chips will be asymptotically dominated by memory by 2016 without substantial improvements in power management technology. Recall that since the functionality demands, e.g., 4X/node increase in logic (CPU/DSP cores), the PDA chip size is actually projected to grow at approximately 20% per node. The top-down power requirement in Table 1 is also flat at 0.1W. These two considerations would lead to a much more extreme memorylogic imbalance in the long-term years. For HP MPUs implemented using high-performance devices; the ITRS model implies a nearly 30X power management gap by the end of the roadmap with respect to package power limits. An alternative portrayal of the power reduction challenge is that the maximum chip area containing logic goes to zero if the chip is to remain within power constraints, and if we simply extrapolate current designs (without any improvement in application-level, OS-level, VLSI architecture, and IC design technology for power management). 34 Preliminary Work in Progress – Not for Publishing System Drivers Page 16 6/19/2017 100% Logic Area Contribution (%) LOP Logic Area Contribution (%) LSTP Total Memory Area (%) LOP Total Memory Area (%) LSTP 90% 80% Percentage of Area (%) 70% 60% 50% 40% 30% 20% 10% Die Size = 1cm 2 0% 2001 2004 2007 2010 2013 2016 Year Figure 3: Chip composition (with breakdown into logic and memory components) for PDA application using LOP and LSTP low-power device models. Design Productivity Figure of Merit for Low-Cost, Low-Power SOC. The concept of normalized gates/designer/day may be used to measure the productivity of an integrated circuit designer. This figure of merit (FoM) addresses the required improvement in productivity of designers, which depends on availability of reusable IP, SOC design tools, automatic BIST insertion, etc. Normalized gates/designer/day can also be measured on a chip by chip basis to assess overall productivity in a design process. The importance of design productivity improvement can be seen from the effect of different rates of productivity improvements on the feasible logic and memory composition of the SOC. We assume that most of the available design resource is applied to the development of new logic blocks, while reusable blocks have some overhead associated with their use (learning curve, verification, integration, etc.). Memory design, through the use of compilers, is assumed to require minimal resources. These assumptions lead to the following model.      Designer productivity for new logic is 360Kgates/year per designer in 1999 (aggregate 1000 gates/day/designer) and grows at a specified percentage per node thereafter. Designer productivity for reused logic is 720Kgates/year per designer in 1999 and grows at a specified percentage per node thereafter. Memory design is assumed to be of negligible cost due to the use of memory compilers. Available designer resources are fixed at 10 person-years. Logic and memory densities are assumed to grow according to the MPU roadmap. Figure 4 plots the maximum amount of chip area that can be occupied by logic (y-axis), given that a prescribed amount of chip area is occupied by memory (x-axis). The feasible solutions for the amount of reused logic and new logic35 is also plotted (y-axis). A constant die size of 1 cm2 is assumed, along with 30% productivity growth per node; plots are shown for four different nodes in Chip area consists of memory plus new logic plus reused logic. A vertical line in the plot defines a possible combination of the three components. Only those solutions that require 10 person-years or less are considered to be valid. 35 Preliminary Work in Progress – Not for Publishing System Drivers Page 17 6/19/2017 the ITRS. With the design resource constraint, the only valid solutions are those which require 10 or fewer person-years of effort. For example, in the year 2001 the designer can essentially dictate the memory/logic and new/reused ratios, since all solutions are under the 10 person-year budget. On the other hand, in the year 2016 there is only a small feasible region, where roughly 95% of the chip must be memory and the remaining 5% can be constructed from new logic or reusable logic blocks. Figure 4 thus shows that without adequate productivity growth, in order to fill up a fixed-size chip with a constant designer resource, we will asymptotically reach a point where only memory and reusable blocks are used in the design. The design productivity requirement also follows simply from the SOC-LP 4X trend for CPU/logic growth, and 2-3X memory growth. To keep design content roughly the same, we require at least around 50% design productivity improvement in each technology node. (Figure 5 (left) shows how the evolution of logic-memory balance changes with different rates of design productivity improvement; Figure 5 (right) shows that a 50% improvement per node will essentially preserve the level of designer freedom enjoyed in the year 2001.) Achieving such levels of productivity improvement is possible in a number of ways. A higher level of reusability can achieved for a given application using a platform-based design approach where several derivative designs can be rapidly implemented from a single platform that has a fixed portion and variable portion for proprietary or differentiated blocks. Furthermore, if a programmable fabric is utilized in the platform, there are tremendous gains expected in the overall productivity relative of a family of designs. In any case, productivity gains must exceed 50% per node to manage the expected increases in logic and memory 120% 120% 1 2 .0 0 1 6 .0 0 New Ci r cui t Rat i o 100% 1 4 .0 0 New Ci r cui t Rat i o 1 0 .0 0 Reuse Ci r cui t Rat i o 100% Reuse Ci r cui t Rat i o T ar get Desi gn Resour ce 80% 8 .0 0 R e us e L o g ic R e s o u rc e 60% 6 .0 0 40% 4 .0 0 2001 20% Re us e /Ne w Pe r ce ntage Re us e /Ne w Pe r ce ntage T ar get Desi gn Resour ce 2 .0 0 40% 60% 8 .0 0 6 .0 0 40% 2004 60% 80% 0% 100% 0% 20% 40% 60% 120% 3 5 .0 0 3 0 .0 0 100% 120% 100% 5 0 .0 0 New Ci r cui t Rat i o T ar get Desi gn Resour ce 2 5 .0 0 80% 2 0 .0 0 60% 1 5 .0 0 40% 1 0 .0 0 Reuse Ci r cui t Rat i o T ar get Desi gn Resour ce 80% 4 0 .0 0 60% 3 0 .0 0 40% 2 0 .0 0 2016 2010 5 .0 0 0% - 20% Re us e /Ne w Pe r ce ntage Reuse Ci r cui t Rat i o 0% 100% 6 0 .0 0 New Ci r cui t Rat i o 20% 80% M e m o r y Pe r ce n tag e M e m o r y Pe r ce n tag e Re use /Ne w Pe rce nta ge 4 .0 0 2 .0 0 - 20% 1 0 .0 0 20% N e w L o g ic 0% 0% 1 2 .0 0 80% 40% 60% 80% 100% M e m o r y Pe r ce n tag e content of future designs. Preliminary Work in Progress – Not for Publishing 20% 1 0 .0 0 0% - 0% 20% 40% 60% M e m o r y Pe r ce n tag e 80% 100% System Drivers Page 18 6/19/2017 Figure 4: New and reused logic content vs. memory content with constant designer resource, constant die size, and insufficient (30% per technology node) design productivity growth. 100% 100% Prod. Logic Memory Distribution 90% Percentage of Logic 80% 80% Prod. 70% Distribution (%) 60% Prod. 60% 40% Prod. 50% 20% Prod. 40% Prod. 40% 60% Prod. 30% 80% Prod. 20% Percentage of Memory 10% 100% Prod. 0% 1999 2000 2001 2002 2003 2004 2005 2006 2007 2010 2013 2016 Year Figure 5. (Left) Evolution of logic-memory balance with different rates of design productivity improvement. (Right) 50% productivity improvement per node will preserve designer freedom throughout the ITRS forecast period, 2001-2016. High-Speed Input/Output. There is a growing gap between on-chip and off-chip input/output bandwidth36. Currently, links that operate at 4-8 FO4 delays (defined above) per bit are fairly common. A natural limitation of a link’s on-chip bandwidth is the on-chip data rate, which are bittimes of 1 FO4 delay. However, as technology scales, the ability to perform high-speed I/O below bit times of 1FO4 will be critical. System throughput for the PDA imply that electrical signaling levels increase from 64Kbits/sec to 500Mbits/sec which is within current I/O capabilities. However, in MPU and memory systems, and high-speed data communication systems, speeds greatly exceeding these levels will be needed38. Research in the field of high-speed I/O is critical in the years to come before the on-chip and off-chip communication gap reaches a crisis proportions. 36 Today, most chip-to-chip interconnect uses CMOS inverters for transmitters and receivers. The processor-to-memory and processor-to-processor communication systems are built from physical busses: transmission lines with multiple loads attached along their length. These systems impose limits on signal speed of 1-2Gbits/sec, while on-chip data rates can reach well above this level. This communication gap must be addressed from a electrical signaling viewpoint to leverage the terabits/sec of bandwidth available in conventional printed circuit boards today. 38 For example, 10Gbits/sec will be needed for OC-192 and 40Gbits/sec for OC-768 systems. Preliminary Work in Progress – Not for Publishing 19 MIXED-SIGNAL SYSTEM DRIVER. An analog roadmap must consider design needs and efforts to advance design. Since there are many different circuits and architectures in mixed-signal design, simplification is mandatory, especially since the roadmap is primarily used by individuals not directly expert in mixed-signal design issues. Performance in analog, mixed-signal and RF design is often determined by specific basic circuits which are most relevant for the total performance of the design. To achieve a general and simple, yet relevant, approach to mapping of mixed-signal design issues we restrict the discussion to four basic analog circuits:  Low-noise amplifier (LNA),  Voltage-controlled oscillator (VCO),  Power amplifier (PA), and  Analog to digital converter (ADC). The requirements for accurate design and process technology to build these circuits will also determine the performance of many other mixed-signal circuits. Thus, the performance of these four circuits, as described by figures of merit (FoMs), is a good basis for a mixed-signal roadmap. In the following sections these FoMs are described in detail. Note that by convention in our evaluations, all parameters (like the gain G) are taken as absolute values and not on a decibel scale. We have also purposely avoided preferences for specific solutions to given design problems; indeed, we have sought to be as open as possible to different types of solutions since long-term experience is that unexpected solutions have often helped to overcome barriers. (Competition, e.g., between alternative solutions, is a good driving force for all types of advances related to technology roadmapping.) Furthermore, we observe that a given type of circuit can have different requirements for different purposes; in such cases, certain performance indicators might be contradictory for different applications. 39 To avoid such situations, we adjust the figures of merit to a mainstream product. The economic situation of a mainstream product is usually highly competitive: it has a high volume and is therefore characterized by a high R&Dexpenditure. The technology requirements of such a product can therefore drive the mixed-signal technology as a whole. Today, the key product in this context is the mobile phone. Finally, we evaluate the dependence of the FoMs on device parameters. This allows identification of requirements from circuit design which lead to specific device and technology specifications. Through discussions between design and technology experts, extrapolations are proposed that lead on the one hand to a significant advance of analog circuit performance and on the other hand to realistic and feasible technology advances. These parameters are stated in Table 29 of the PIDS Chapter. MIXED-SIGNAL TECHNOLOGY REQUIREMENTS LOW NOISE AMPLIFIERS (LNAS) Digital processing systems require interfaces to the analog world. Prominent examples for these interfaces are transmission media in wired or wireless communication. The LNA amplifies the input signal to a level which makes further signal processing insensitive to noise. The key performance issue of an LNA is to deliver the undistorted but amplified signal to further signal processing units without adding further noise to the signal. As will be discussed in detail, certain cases of application are omitted for the sake of simplicity. Arguments will be given for the cases selected. In many cases, we have limited our considerations to CMOS since it is the prime technological driving force and in most cases the most important technology. Alternative solutions (especially other device families) and their relevance will be discussed for some cases, as well as at the end of this Mixed-Signal section. 39 Preliminary Work in Progress – Not for Publishing 20 There exist many different LNA applications (GSM, CDMA, UMTS, GPS, Bluetooth, etc.) operating in many frequency bands. The operating frequency and, in some cases, the operation bandwidth of the LNA impact the maximum achievable performance. Furthermore, the nonlinearity has to be considered to meet the specifications in many applications. Consequently, these parameters have to be included in the FoM. On the other hand, different systems are often not directly comparable and thus have diverging requirements. For example, very wide bandwidth is needed for high performance wired applications. This increases the power consumption, which is an important design attribute especially for low bandwidth wireless applications. In the case of wide bandwidth systems, bandwidth may be more important than linearity to describe the performance of an LNA. To avoid contradictory design constraints we will focus on the wireless communication context. The linearity of a low noise amplifier can be described by the output referenced third order intercept point (OIP3 = G x IIP3 where G is the gain and IIP3 is the input referenced third order intercept point). A parameter determining the minimum signal that is correctly amplified by a LNA is directly given by the noise figure of the amplifier, NF. However, for consideration of the contribution of the amplifier to the total noise the value of (NF-1) is an even better measure. Here, the ratio between the noise of the amplifier Namplifier and the noise already present at the input Ninput can be directly evaluated. These two performance figures can be combined with the power consumption P. The resulting figure of merit captures the dynamic range of an amplifier versus the necessary dc power. For roadmap purposes it is preferable to have available a performance measure which is independent of the frequency and thus independent of the specific application. This can be achieved by assuming that the LNA is formed by a single amplification stage. Then the FoM scales linearly with operating frequency f. With these approximations and assumptions a figure of merit (FoMLNA) for LNAs is defined: G  IIP 3  f (1) FoM LNA  ( NF  1)  P Making further simplifying assumptions, and neglecting design intelligence for this purpose, the evolution of the FoM with technology scaling can be extrapolated [1]. The future trends of all relevant device parameters for LNA design like the maximum oscillation frequency fmax, the quality of the inductors, the inner gain of the MOSFETs (gm/gds |L_min), and the RF supply voltages are shown in Table 29 of the PIDS Chapter. The evolution of the FoM from recent best-in-class published accounts for CMOS LNAs shows a clear trend towards better performance for smaller device dimensions; this is in good agreement with the increase in the quality of the devices needed for LNA design. Extrapolating these data into the future, an estimate of future progress in LNA design is obtained as shown in Table 4. VOLTAGE-CONTROLLED OSCILLATORS (VCOS) Another key component of RF signal processing systems is the VCO. The VCO is the most sophisticated part of a PLL (Phase Locked Loop) which is needed to synchronize the communication between an integrated circuit and the outside world in high-bandwidth and/or high-frequency applications. The key design issues for VCOs are to minimize the timing jitter of the generated waveform (or equivalently the phase noise) and the power consumption. From these parameters a figure of merit (FoMVCO) for a VCO is defined: 2  f  1 (2) FoM VCO   0   f L {  f } P   Here, f0 is the oscillation frequency, L{f} is the phase noise power spectral density measured at a frequency offset f from f0 and P is the total power consumption. There is no clear correlation between the operating frequency and the figure of merit. However, a good value of the figure of merit is usually more difficult to achieve at higher frequencies. Therefore the figure of merit is not completely independent of the operating frequency. The definition also neglects the tuning range of the VCO since the necessary tuning range strongly depends on the application. Typically, however, the VCO’s phase noise or power consumption worsens if a larger tuning range is required. Preliminary Work in Progress – Not for Publishing 21 By restricting to fully integrated CMOS tuned VCOs with on-chip load (LC-tank) and making further simplifications, the FoM can be linked to technology development [1]. The phase noise is mainly determined by the thermal noise and the quality factor of the LC-tank. Thermal noise versus power consumption is approximately constant over the technology nodes. Finally, the evolution of the figure of merit versus technology node mainly depends on the quality of the available inductors [1]. The evolution of FoMs from recent best in class published accounts for VCOs shows a clear trend of increasing performance for decreasing CMOS minimum feature size. The FoMs are in good agreement with the data of the best available devices needed for VCO design in these technologies. Based on prediction of the relevant device parameters for future technology nodes (Table 29 in the PIDS Chapter), an extrapolation of the VCO FoM for future technology nodes is given in Table 4. POWER AMPLIFIERS (PAS) Power Amplifiers are key components in the transmission path of wired or wireless communications systems. They deliver the transmission power required for transmitting information off-chip with a high linearity to minimize adjacent channel power. Especially for battery operated applications, there is a demand for minimum DC power at a given output power. To establish a performance figure of merit (FoMPA) for power amplifiers, the key parameters of the amplifiers, including output power Pout, power gain G, carrier frequency f, linearity (in terms of IIP3) and the power-added-efficiency (PAE) are taken into account. Unfortunately, linearity strongly depends on the operating class of the amplifiers, which makes it difficult to compare amplifiers of different classes. To remain independent of the design approach and the specifications of different applications we omit this parameter in our figure of merit. To compensate for the 20 dB/decade rolloff40 of the PA’s RF-gain a factor of f 2 is included into the figure of merit. The following representation is used for benchmarking power amplifiers: FoM PA  Pout  G  PAE f 2 Restricting to class A operation and making further simplifications, correlation between the FoM and device parameters can be established [1]. The key device parameters for building high performance power amplifiers are the quality factor of the available inductors and fmax [1]. These values are mapped in Table 29 in the PIDS Chapter for future technology nodes. The FoMs of best-in-class CMOS PAs in recent years increase by approximately a factor of two per technology generation. The data are strongly correlated with progress in the active and passive device parameters for these technologies. From the required device parameters for future technology nodes (Table 29 of the PIDS Chapter), we can project requirements for future FoM values as shown in Table 4. ANALOG -TO-DIGITAL CONVERTERS (ADCS) Digital processing systems have interfaces to the analog world: audio and video interfaces, interfaces to magnetic and optical storage media and interfaces to transmission media, either wired or wireless. The point where the analog world reaches digital processing is at the Analog-to-Digital converter. Here the analog signals that are continuous time and continuous amplitude are converted to discrete time (sampled) and discrete amplitude (quantized). This makes the A-to-D converter a useful vehicle to identify the advantages and limitations of new technology nodes for future system integration. The A-to-D converter is also the most prominent and most widely used mixed-signal circuit in today’s integrated mixed-signal circuit design. Most CMOS-PAs are currently operated in this regime; using DC-gain for applications far below ft would result in a slightly increased slope. 40 Preliminary Work in Progress – Not for Publishing (3) 22 The main specification parameters of an A-to-D converter are related to sampling and quantization: the resolution of the converter, i.e. the number of quantization levels is 2 n , where “n” is the “number of bits” of the converter. This parameter also defines the maximum signal to noise level SNR  n  6.02  1.76 [dB] The sampling rate of the converter, i.e., the number of samples it can quantize per unit time (each sample the “number of bits” wide) is related to the bandwidth that needs to be converted and to the power consumption required for reaching these performance points. The Shannon/Nyquist criterion describes that a signal can be reconstructed whenever the sample-rate exceeds two times the converted bandwidth: fsample > 2 x BW. To gain insight in the potential of future technology nodes for this building block, it is useful to define a figure of merit that combines the three main parameters: dynamic range, sample rate fsample and power consumption P. These parameters are nominal parameters and do not give accurate information on the effective performance of the converter. For defining a figure of merit they should better be based on effective performance extracted from measured data. The dynamic range is extracted from the low frequency signal-to-noise-and-distortion (SINAD0) measurement minus the quantization error (both values in dB). From SINAD0 an “effective-number-of-bits” can be derived through ENOB 0  (SINAD0  1.76) / 6.02 . The sample rate is replaced by two times the effective resolution bandwidth (2xERBW) if it has a lower value, to make the link with the Nyquist criterion: FoM ADC  2 ENOB0  min({ f sample }, {2  ERBW }) P The relation between FoM and technology parameters is strongly dependent on the converter architecture and circuits used. Different classes of converters can be distinguished and the relation between theses classes and the figure of merit becomes more complicated. Due to the complexity and the diversity of A-to-D converter designs it is almost impossible to come up with direct relations as for the basic RF circuits. Nevertheless, some general considerations regarding the parameters in the figure of merit are proposed in [1], and in some cases it is even possible to determine the performance requirements of the design by looking at the performance requirements of a critical subcircuit [1]. The device parameters relevant for the different ADC designs are stated in Table 29 of the PIDS Chapter. The trend in recent years shows that the figure of merit for A-to-D converters improves by approximately a factor of 2 every three years. Taking increasing design intelligence into account, in the past the improvements were in a good agreement with the increase in analog device parameters. Best-in-class for stand-alone CMOS/BiCMOS now is approximately 800G [conversion-step/Joule]. For embedded CMOS this value is approximately 400G [conversion-step/Joule]. The expected values for the figure of merit for A-to-D converters in coming years are shown in Table 4. Major workarounds in design will be necessary (see difficult challenges) to keep on track the performance increase in ADC design while voltage signal swing is decreasing. In the long run, fundamental physical limitations (thermal noise) will challenge further improvement of ADC figures of merit and may become a major roadblock. COST/PERFORMANCE ISSUES The above figures of merit are intended to measure mixed-signal performance. A link to cost of production may be formulated as follows. Unlike high volume digital products where cost is mostly determined by chip area, in mixed-signal design, area is only one of several cost factors. The area consumption of analog circuits in a system on chip is typically in the range of 5-30% which is not as significant as that of logic or memory. The economic forces to reduce mixed-signal area are therefore not as strong as for the other parts of a system. However, there exist several additional facts that have to be considered:  In some cases the analog area can be further reduced by shifting the partitioning of a system between analog and digital parts (e.g. auto-calibration of A-to-D converters). Preliminary Work in Progress – Not for Publishing (5) 23   Process complexity is increased by introducing high performance analog devices. This may result in a solution with less area but higher total cost. In some applications a system-in-package solution with two die (large, low-cost digital and small, high-performance analog) may be cheaper than a single system-on-chip solution. All these additional constraints make the estimation of cost issues very difficult in mixed-signal design. To quantify cost we restrict our considerations to high performance applications since these also drive technology demands. Here, additional analog features like high performance passives or analog transistors are commonly used and area can be taken as a measure for estimating cost. In analog designs power consumption is often proportional to area. Since power is included in all four figures of merit they partially consider area and cost constraints. Nevertheless area requirements have to be stated explicitly in a roadmap. Scaling of transistors is driven by the increasing density of the digital parts of a system. There is no need to address the issue of transistor area in the analog part of a roadmap. Furthermore, the total area in most of today’s mixed-signal designs is determined by passives. Their area consumption dominates the cost of the mixed-signal part of a system. Table 29 in the PIDS Chapter gives values necessary for a density increase of the passive devices to improve cost / performance ratio for high performance mixed-signal designs in future. Practical interim solutions for moving the ADC barrier-line (cf. Figure 6) more rapidly for a given power consumption are not readily foreseen. CMOS technologies have the bit resolutions needed, which imply large numbers of devices per unit area, but not the speed. Compound semiconductor technologies, on the other hand, have the speed but not the needed bit resolutions. Possible solutions to increase the rate of ADC performance at reasonable costs include the use of compound semiconductors for their speed – perhaps combinations of HBTs, HEMTs, and resonant tunneling diodes (RTDs) – and hybrids of both CMOS and compound semiconductor technologies. The challenge for the compound semiconductors is to increase the number of devices per unit area and to be co-integrated with CMOS processing. A more detailed ADC technology roadmap involving both elemental and compound semiconductors could increase the rate at which the conversion-energylines in Figure 6 approach the Heisenberg uncertainty limit. Today, all ADC technologies are many orders of magnitude away from that limit for ADC performance. Table 4 Mixed-Signal Figures of Merit for benchmark circuits YEAR OF PRODUCTION 2001 2004 2007 2010 2013 2016 ASIC ½ PITCH 130 90 65 65 45 22 FOMLNA [GHZ] 10 15 25 30 – 40 40 - 50 50 - 70 PIDS TABLE 29 FOMVCO [1/J] 1022 5 6 7 8–9 10 - 11 12 - 14 PIDS TABLE 29 FOMPA [WGHZ²] 104 6 12 24 40 – 50 80 - 90 100 - 130 PIDS TABLE 29 0.4 0.8 1-1.2 1.6-2.5 2.5-5 4-10 PIDS TABLE 29 FOMADC [1/J] 1012 DRIVER MIXED-SIGNAL DIFFICULT CHALLENGES In most of today’s mixed-signal designs – especially in classical analog design – the processed signal is represented by a voltage difference. Hence, the maximum signal is determined by the supply voltage. Decreasing the supply voltage – a necessary consequence of constant-field scaling – means decreasing the maximum achievable signal level. This has a strong impact on mixed-signal product development for system-on-chip solutions. The typical development time for a completely new mixedsignal part is much longer than the development time for the digital and memory parts of a systemon-chip. It is desirable to reuse existing mixed-signal designs and to adjust only specific parameters to meet the specifications within the interfaces of the system-on-chip and to the outside world. Preliminary Work in Progress – Not for Publishing 24 However, this procedure is incompatible with supply voltage scaling: as in the example of technology requirements for A-to-D converters, the performance of the design depends on the maximum signal, and design reuse would be impossible without a second type of MOSFET that does not scale its maximum operating voltage. This is the main reason for introducing an analog transistor in a mixedsignal CMOS technology which uses a higher analog supply voltage and stays unchanged across several digital technology generations. Even with such a device, voltage reduction and development time of analog circuit blocks are major obstacles to low-cost and efficient scaling of mixed-signal functions. MIXED-SIGNAL FEATURES DRIVING SOCS The usual strategy to increase unit shipments is cost reduction while increasing product performance. However, this is not the only driver for the semiconductor business, especially for products that include mixed-signal parts: improving technology and design performance leads to the ability to build new applications (comparable to the realization of the mobile handset in past years), thus pushing the semiconductor industry into new markets. Mixed-signal considerations can be used to estimate the design needs and the feasibility of designs for future applications and new markets. For example, from the A-to-D converter (ADC) power / performance relation (resolution x bandwidth) the ADC requirements for many new applications in the last couple of years can be derived, as shown in Figure 6. Under conditions of constant performance, a constant power consumption is represented by a straight line with slope of minus one in the graph (for more details see the ADC discussion below). Increasing performance – achievable with better technology or circuit design – is equivalent to a shift of the power consumption lines toward the upper right corner. Increasing performance is also equivalent to the ability to develop new products which need higher performance or lower power consumption than available in today’s CMOS A-to-D converter designs. At the time the specifications of a new product are known one can estimate which technology is needed to fulfill these specifications. Alternatively, one can estimate the timeframe when the semiconductor industry will be able to build that product with reasonable costs and performance. Independent of circuit, product and specification, the figure of merit (FoM) concept can be used to evaluate the feasibility and the market of these potential new mixed-signal products. The ability to build high performance mixed-signal circuitry at low cost will continuously drive the semiconductor industry into such new products and markets. Preliminary Work in Progress – Not for Publishing 25 22 Resolution(bit) 20 1 kW super 1mW audio 1W 18 16 audio 14 12 10 8 ADSL GSM Basestation GSM 1 W UMTS telephony Bluetooth Cable DTV VDSL video Storage 6 Interconnectivity 4 1kHz 10kHz 100kHz 1MHz 10MHz100MHz 1GHz Signal Bandwidth Figure 6: Performance needs of ADCs for important product classes in recent years Even though the current rate of improvement in ADC performance is adequate for handset applications, it is not adequate for some applications such as digital linearization of GSM basestations and handheld/mobile high-data rate digital video applications. For example, a multi-carrier GSM base-station with a typical setup of 32 carriers requires over 80dB of dynamic range. Implementing digital linearization in such a base-station with a 25 MHz transmitter band requires ADCs that have sampling rates of 300 MHz and 14 bits of resolution. According to Table 4 and assuming progress continues as it has in the last few years, it will be many years, perhaps until after 2010, before ADCs with such performance are manufactured in sufficient volumes. However, systems designers would like to have such ADCs now. These data show that a very slowly moving technological barrier-line exists in the 100s of MHz sampling rates for ADCs. This barrier-line presently lies, for a given ADC sampling rate and power consumption, about 1 bit below the 1 W line shown in Figure 6 and has a slope of about 1 bit per octave. Most of today’s ADC technologies (e.g., silicon, SiGe, and III-V compound semiconductors and their hybrids) lie below this barrier-line. As noted in the above discussion of cost-performance issues, silicon and SiGe semiconductor technologies have the bit resolutions needed but not the speed, whereas, III-V compound semiconductor technologies have the speed but not the needed bit resolutions. [1] R. Brederlow, S. Donnay, J. Sauerer, M. Vertregt, P. Wambacq, and W. Weber ‘A mixed signal design roadmap for the International Technology Roadmap for Semiconductors (ITRS)’, IEEE Design and Test, December 2001. Preliminary Work in Progress – Not for Publishing 26 Preliminary Work in Progress – Not for Publishing

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download sd9 - UCSD CSE