Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Limits of Semiconductor Technology & Coming Challenges in Microarchitecture and Architecture Mile Stojčev, Teufik Tokić, Ivan Milentijević Faculty of Electonic Engineering, Niš Outline •Technology Trends •Process Technology Challenges – Low Power Design •Microprocessors’ Generations •Challenges in Education Outline – Technology Trends •Moore’s Law 1 •Moore’s Law 2 •Performance and New Technology Generation •Technology Trends – Example •Trends in Future •Processor Technology •Memory Technology Moore's Law 1 In 1965, Gordon Moore, director of research and development at Fairchild Semiconductor, later founder of Intel corp., wrote a paper for Electronics entitled “Cramming more components onto integrated circuits”. In the paper Moore observed that “The complexity for minimum component cost has increased at a rate of roughly a factor of two per year”. This observation became known as Moore's law. In fact, by 1975 the leading chips had maybe one-tenth as many components as Moore had predicted. The doubling period had stretched out to an average of 17 months in the decade ending in 1975, then slowed to 22 months through 1985 and 32 months through 1995. It has revived to a now relatively peppy 22 to 24 months in recent years. Moore’s Law 1 continue Similar exponential growth rates have occurred for other aspects of computer technology – disk capacities, memory chip capacities, and processor performance. These remarkable growth rates have been the major driving forces of the computer revolution. Logic DRAM Disk Capacity 2x in 3 years 4x in 3 years 4x in 3 years Speed (latency) 2x in 3 years 2x in 10 years 2x in 10 years Moore’s Law 1 – number of transistors Moore’s Law 1 - Linewidths One of the key drivers behind the industries ability to double transistor counts every 18 to 24 months, is the continuous reduction in linewidths. Shrinking linewidths not only enables more components to fit onto an IC (typically 2x per linewidth generation) but also lower costs (typically 30% per linewidth generation). Moore’s Law 1 - Die size Shrinking linewidths have slowed the rate of growth in die size to 1.14x per year versus 1.38 to 1.58x per year for transistor counts, and since the mid nineties accelerating linewidth shrinks have halted and even reversed the growth in die sizes. Moore's Law in Action The number of transistors on chip doubles annually Moore’s Law 1 – Microprocessor Moore’s Law 1– Capacity Single Chip DRAM Improving frequency via pipelining Process technology and microarchitecture innovations enable doubling the frequency increase every process generation The figure presents the contribution of both: as the process improves, the frequency increases and the average amount of work done in pipeline stages decreases Process Complexity Shrinking linewidths isn’t free. Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices leading to increasing complexity in the processes being used. Moore’s Law 2 (Rock’s Law) In 1996 Intel augmented Moore’s law (the number of transistor on processor double approximately every 18 mounts) with Moore’s law 2. Law 2 says that as sophistication of chip increases, the cost of fabrication rises exponentially. The cost of semiconductor tools doubles every four years. By this logic, chip fabrication plants, or fabs, were supposed to cost $5 billion each by the late 1990s and $10 billion by now Moore’s Law 2 (Rock’s Law) - continue For example: In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million. In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed. Moore’s Law 2 (Rock’s Law) The Cost of Semiconductor Tools Doubles Every Four Years Machrone’s Law The PC you want to bay will always be $5000 Metcalfe’s Law A network’s value grows proportionately to the Number of its users squared Wirth’s Law Software is slowing faster than hardware is accelerating Performance and new technology generation According to the Moore’s law each new generation has approximately doubled logic circuit density and increased performance by about 40 % while quadrupling memory capacity. The increase in component per chip comes from following key factors: The factor of two in component density come from 20.5 shrink in each lithography dimensions (20.5 per x and 20.5 per y). An additional factor of 20.5 comes from an increase in chip area. A final factor of 20.5 comes from device and circuit cleverness. Development in ICs Semiconductor Industry Association Roadmap Summary for high-end Processors Specification/year 1997 1999 2001 2003 2006 2009 2012 Feature size (micron) 0.25 0.18 0.15 0.13 0.1 0.07 0.05 Supply voltage (V) 1.8-2.5 1.5-1.8 1.2-1.5 1.2-1.5 0.9-1.2 0.6-0.9 0.5-0.6 Transistors/chip (millions) 11 21 40 76 200 520 1400 DRAM bits/chip (mega) 167 1070 1700 4290 17200 68700 275000 Die size (mm2) 300 340 385 430 520 620 750 Global clock freq. (MHz) 750 1200 1400 1600 2000 2500 3000 Local clock freq. (MHz) 750 1250 1500 2100 3500 6000 10000 Maximum power/chip (W) 70 90 110 130 160 170 175 total transistors/chip No. of transistors (millions) 1600 Transistors/chip 1400 1200 1000 800 600 400 200 0 0.25 0.18 0.15 0.13 0.1 0.07 0.05 1997 1999 2001 2003 2006 2009 2012 Technology (micron)/year Clock Frequency Versus Year for Various Representative Machines Limiting in Clocking Traditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range Frequency (MHz) 12000 Global clock freq. (MHz) 10000 Local clock freq. (MHz) 8000 6000 4000 2000 0 1997 1999 2001 2003 2006 2009 2012 0.25 0.18 0.15 0.13 0.1 0.07 0.05 Technology (micron)/year For higher frequency clocking (>10GHz) new ideas and new ways of designing digital systems are needed Intel’s Microprocessors Clock Frequency Technology Trends - Example As an illustration of just how computer technology is improving, let’s consider what would have happened if automobiles had improved equally quickly. Assume that an average car in 1977 had a top speed of 150 km/h and an average fuel economy of 10 km/l. If both top speed and efficiency improved at 35% per year from 1977 to 1987, and by 50% per year from 1987 to 2000, tracking computer performance, what would the average top speed and fuel economy of car be in 1987? In 2000? Solution In 1987:The span 1977 to 1987 is 10 years, so both traits would have improved by factor of (1.35)10 = 20.1 giving a top speed of 3015 km/h and fuel economy of 201 km/l In 2000: Thirteen more years elapse, this time at a 50% per year improvement rate, for a total factor of (1.5)13 = 194.6 over the 1987 values This gives a top speed of 586 719 km/h and fuel economy of 39 114.6 km/l. This is fast enough to cover the distance from the earth to the moon in under 39 min, and to make round trip on less than 10 liters of gasoline. Future size versus time in silicon ICs The semiconductor industry itself has developed a “roadmap” based on the idea of Moore’s law. The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014, at which point minimum future size are projected to be 35 nm and chips with > 1011 components are expected to be available. Trends in future size over time Processor technology today The most advanced processor technology today (year 2003) is 0.10 mm=100nm Ideally, processor technology scales by a factor of ~0.7 all physical dimensions of devices (transistors and wires) With such scaling, typical improvement figures are the following: • 1.4 – 1.5 times faster transistors • two times smaller transistors • 1.35 times lower operating voltages • three times lower switching power Processor Technology and Microprocessors Process technology is the most important technology that drives the microprocessor industry. It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years Microarchitecture attempts to increase both IPC and frequency Process technology and microarchitecture Microarchitecture techniques such as caches, branch prediction, and out-of-order execution can increase instruction per cycle (IPC) Pipelining, as microarchitecture idea, help to increase frequency Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program Frequency and performance improvements While in-order microprocessor used four to five pipe stages, modern out-of-order microprocessors use over ten pipe stages With frequencies higher than 1 GHz more than 20 pipeline stages are used 20 18 Frequency CPI 14 Performance 12 Power 10 8 6 4 2 Pipeline Depth 35 33 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 0 1 Relative Improvement 16 Performance of memory and CPU Memory in computer system is hierarchically organized In 1980 microprocessors were often designed without caches. Nowadays, microprocessors often come with two levels of caches. Memory Hierarchy Processor-DRAM Gap Microprocessor performance improved 55% per year since 1987, and 35% per year until 1986 Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed Relative processor/memory speed Type of Memories MOS memories ROM's RAM's SRAM DRAM ROM EPROM EEPROM FLASH VOLATILE NON VOLATILE Power off: contents lost Power off : contents kept Percentage of Usage Typical Applications of DRAM An anecdote In recent database benchmark study using TPC-C, both 200MHz Pentium Pro and 21164 Alpha systems were measured at 4.2 – 4.5 CPU cycles per instruction retired. IN other words, three out of every four CPU cycles retired zero instructions: most were spent waiting for memory ... Processor speed has seriously outstripped memory speed. Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse An anecdote - continue If a CPU chip today needs to move 2GBytes/s (say, 16 bytes every 8ns) across the pins to keep itself busy, imagine a chip in the foreseeable future with twice the clock rate, twice the issue width, and two instruction streams. All this factors multiply together to require about 16 GBytes/s of pin bandwidth to keep this chip busy. If is not clear whether pin bandwidth can keep up – 32 bytes every 2ns? Memory system In 1GHz microprocessor, accessing main memory can take about 100 cycles. Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance. To reduce memory stalls at a reasonable cost, modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components Expensive Memory Called a Cache A small, fast, and expensive (in $/bit) memory called a cache is located on – die and holds frequently used data A somewhat bigger, but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory Two Levels of Caches Most advanced microprocessors today employ two levels of caches on chip The first level is ~ 32 – 128kB – it takes two to three cycles to access and typically catches about 95% of all accesses The second level is 256kB to over 1MB – it typically takes six to ten cycles to access and catches over 50% of misses of the first level Memory Hierarchy Impact on Performance Of – chip memory access may elapse about 100 cycles The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions, so the structure of memory hierarchy has a major impact on performance Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution. As conclusion concerning memory - problems Today’s chip are largely able to execute code faster than we can feed then with instruction and data There are not longer performance bottlenecks in the floatingpoint multiplier or in having only a single integer unit. The real design action is in memory subsystems – caches, busses, bandwidth and latency. As conclusion concerning memory – problems, continue If the memory research community would follow the microprocessor community’s lead by learning more heavily on architecture – and system level solutions in addition to technology – level solutions to achieve higher performance, the gap might begin to close On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors. Memory Hierarchy Solutions Organization choices (CPU architecture, L1/L2 cache organizations, DRAM architecture, DRAM speed) can affect total execution time by a factor of two. System level parameters most affect performance: a) The number of independent channels and banks connecting the CPU to the DRAMs can effect a 25% performance change b) Burst – width – refers to data access granularity can effect a 15% performance change c) Magnetic RAM (MRAM) – new type of memory Magnetic RAM – MRAM or NanotechRAM (NRAM) Based on nanoscale semiconductor technology Nanotechnology RAM device consists of tiny Carbon nanotubes Differing electrical changes swing the tubes into one of two positions, representing the ones and zeroes necessary for digital storage. Moreover the tubes stay in position until a new signal resets them. MRAM Capacity The 10 Gbit devices consists of carbon nanotubes that are 1 nm – just a few thaunsand atoms – in diameter on a silicon wafer MRAM is nonvolatile, it has the perfotrmance of static RAM (SRAM) fast read-and-write The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm – just a few thaunsand atoms – in diameter on a silicon wafer. MRAM as Universal Memory MRAM can replace many ofthers types of memory including SRAM, DRAM, ROM, EEPROM, Flash EEPROM, and feroelectric RAM (FRAM) . Prediction are crystalline structures that users grow on silicon. Capacity of DRAM and FLASH - MRAM 120 100 MLC(3bits/sell) 100nm 100 16Gb 80 10 70nm 8Gb FLASH 60 MLC(2bits/cell) 4Gb 4Gb SLC 2Gb 40 55nm 1Gb 1 DRAM 0.512Mb 20 0 0.1 2003 2005 2010 Density [Gb] Design Rule [nm] 90nm Surpassing the Prediction from Moore's Low – DRAM vs MRAM The famous Moore's Low predicts that the memory density will be doubled in 1,5 years, while the new growth model clearly indicates the doubling of NAND Flash memory density every year. 100 100 FLASH 2 fold density per year Density [Gb] MLC (3bits/cell) 12Gb Moore's Law 10 4Gb MLC (2bits/cell) 4Gb 2Gb DRAM 2Gb 1Gb 1 1Gb 0.512Mb 0.1 2000 2005 2010 Overall memory prediction roadmap Even though the density growth of DRAM will slow down, DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years High-density memory growth will surpass the prediction from Moore's Low Overall memory prediction roadmap -cont 1988 Computer Food Chain 1997 Computer Food Chain 2003 Computer Food Chain Mainframe Supercomputer Outline •Technology Trends •Process Technology Challenges – Low Power Design •Microprocessors’ Generations •Challenges in Education Outline - Low Power Design •Power trends in VLSI •View Point on Power •Research Efforts in Low Power Design •Is there an Optimal Design Point Power consumption During 1995 energy consumption of all PC machines installed in USA was 60 * 106 MWh. •During 2000 energy consumption of all PC machines installed in USA was 10% of the total energy production. •During 2015 on except that the energy consumption of all PC machines will be 15% greater then 1995, or 69*106 MWh. Typical Low-Power Applications •battery operated equipments, •mobile communication equipments, •wireless communication equipments, •instrumentation, •consumer electronics, •biomedical technologies, •industry, •process controls ... Power dissipation in time “CMOS Circuits dissipate little power by nature. So believed circuit designers” (Kuroda-Sakurai, 95) 100 Power (W) x4 / 3years 10 1 0.1 0.01 80 85 90 95 “By the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages, even if the supply voltage can be feasibly reduced.” Gloom and Doom predictions Power density will increase VDD, Power and Current Trend Voltage Voltage [V] 2 Power 1.5 Current 1 0.5 0 1998 2002 2006 2010 500 Power per chip [W] 200 0 2014 VDD current [A] 2.5 0 Year International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) , Electronic Industries Association of Japan (EIAJ), Korea Semiconductor Industry Association (KSIA), and Taiwan Semiconductor Industry Association (TSIA) (* Taken from Sakurai’s ISSCC 2001 presentation) Power Delivery Problem (not just California) Your car starter ! Power Consumption New Dimension in Design Sources of Power Consumption The three major sources of power consumption in digital CMOS circuits are: Pavg pt CL Vdd2 f clk I sc Vdd I leakage Vdd P1 P2 P3 + P4 where: P1 – capacitive switching power (dynamic - dominant) P2 – short circuit power (dynamic) P3 – leakage current power (static) P4 – static power dissipation (minor) Research Efforts in Low-Power Design Reduce the active load: •Minimize the circuits •Use more efficient design •Charge recycling •More efficient layout Technology scaling: •The highest win •Thresholds should scale •Leakage starts to byte •Dynamic voltage scaling Psw = pt CL V2dd fCLK Reduce Switching Activity: •Conditional clock •Conditional precharge •Switching-off inactive blocks •Conditional execution Run it slower: •Use parallelism •Less pipeline stages •Use double-edge flip-flop Reducing the Power Dissipation The power dissipation can be minimized by reducing: supply voltage load capacitance switching activity – Reducing the supply voltage brings a quadratic improvement – Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed. Amount of Reducing the Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltage Power dissipation [ W ] (normalized) 25 Gate delay [ns] (normalized) 10 1 1 0.6 3.0 Supply voltage [ V ] 5.0 Needs for Low-Power • Efficient methodologies and technologies for the design of high-throughput and lowpower digital systems are needed. • The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing. Low-Power Design Techniques The basic idea is: Decreasing activity of the some parts within VLSI IC. The term power manager refer to such techniques in general. Applying power management to a design typically involves two steps a) identifying idle or low active conditions for various parts of the circuit; and b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components. General Approaches to Reduce Power a) Reduction in fCLK is an option acceptable when some components may be idle or low-active during operation; b) Reduction in Vdd is the most effective way for power reduction, since the power is proportional to the square of Vdd. The problem with reducing Vdd is that it leads to an increase in circuit delay; c) The product pt·CL is called the average switched capacitance per cycle and the main directions for reducing this capacitance are done at system-, architectural-, RTL-, circuit- or technology level. Low Power and Low Energy System Design higher impact more options System Level Design partitioning, Power Down Algorithm Level Complexity, Concurrency, Locality, Regularity, Data representation Architecture Level Voltage scaling, Parallelism, Instruction set, Signal correlations Circuit Level Transistor sizing, Logic optimization, Activity Driven Power Down, Lowswing logic, Adiabatic switching Process Device Level Threshold Reduction, Multithreshold The design of low power circuits can be tackled at different levels, from system to technology Multiple Frequency on the Chip as Technique to Reduce Power Less aggressive approach is which attracts more attention. This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating ff11, 11, ff12, 12,... ... ff31, 31, ff32, 32,... ... PLL PLL11/DLL /DLL11 PLL PLL33/DLL /DLL33 ffCLK CLK PLL PLL PLL PLL22/DLL /DLL22 PLL PLL44/DLL /DLL44 ff21, 21, ff21, 21,... ... ff41, 41, ff41, 41,... ... Energy Minimization Using Multiple Frequency CLKREF CLKFB up phase down detector curent pump loop filter regulated voltage PLL based digital system divider by N CLKREF CLKFB VCO up phase down detector curent pump loop filter control clock distribution & frequency multiplier logic f1 2f1 nf1 digital system TVCDL in DC1 DC2 .... VCDL out DCn DLL based Clock Gating & Clock Distribution as Techniques to Reduce Power target flip-flops latch enable A Q B Enable_A Enable_B D gated-clock C DFF clock Clk clock PLL (Clk generator) Enable_C enable C D gated-clock activated Clock distribution deactivated Clock gating - The use of gated clock is the most common approach to reduce energy. Unused modules are turned off by suppressing the clock to the module Energy Minimazation Using Multiple Supply Voltage • Multiple supply voltage on the chip, as less aggressive approach, is attracting attention • This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption) • This scheme tends to result in smaller area overhead compared to parallel architectures System Level Dynamic Power Management as another Techniques to Reduce Power Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components Power Manager Workload information OBSERVER P=400mW CONTROLLER RUN ~10ms Observations ~90ms Commands ~10ms 160ms P=50mW P=0.16mW IDLE SLEEP ~90ms SYSTEM Wait for interrupt Power Manager Wait for wake-up event Power State Machine Power Breakdown in High-Performance CPU and Dynamic Instruction Statistics Arithmetic op. Clock Compare op. 43 % 15% 13% Control Flow 5% 23% Logical op. 1% Others 12 % 25% Control, IO Datapath 16 % Memory Power breakdown 43% Data Move Dynamic instruction statistics Architecture Trade-offs – Reference Datapath Parallel Datapath The More Parallel the Better Pipeline Datapath Architecture Summary for a Simple Outline •Technology Trends •Process Technology Challenges – Low Power Design •Microprocessors’ Generations •Challenges in Education Outline Microprocessors’ Generations •First generation: 1971-78 –Behind the power curve •Second Generation: 1979-85 –Becoming “real” computers •Third Generation: 1985-89 –Challenging the “establishment” •Fourth Generation: 1990–Architectural and performance leadership The microprocessor today When we say “microprocessor” today, we generally mean the shaded area of the figure The First Generation: 1971-78 Getting enough bits and transistors Transistor counts < 50,000 Performance < 0.5 MIPS Architecture: 8-16 bits – Narrow datapaths (= slow performance) – Awkward architectures – Assembly language + some BASIC Processors: – Intel 4004, 8008. 8080, 8086 – Zilog Z-80 – Motorola 6800, 6502 Intel 4004 First general-purpose, single-chip microprocessor Shipped in 1971 8-bit architecture, 4-bit implementation 2,300 transistors Performance < 0.1 MIPS 8008: 8-bit implementation in 1972 – 3,500 transistors – First microprocessor-based computer (Micral) Targeted at laboratory instrumentation Mostly sold in Europe Intel 8080 Intel’s first 16-bit architecture – Delivered in 1974 – 4,800 transistors – Performance < 0.2 MIPS Used in Altair 8800 system – Kit form (advertised in Popular Electronics) in 1975 $297 or $395 with case! 256 bytes of memory; expandable to 64K! Keyboard and floppy 100-line bus becomes S-100, first microcomputer bus – Gates & Allen write BASIC – Wozniak builds one @ Homebrew Computer Club Intel 8086 Introduced in 1978 – Performance < 0.5 MIPS New 16-bit architecture – “Assembly language” compatible with 8080 – 29,000 transistors – Includes memory protection, support for FP coprocessor In 1981, IBM introduces PC – Based on 8088--8-bit bus version of 8086 Second Generation: 1979-85 Becoming “real” computers – First 32-bit architecture (68000) – First virtual memory support – Workstations, Macs, and PCs based on microprocessors Transistors >50,000 Performance <= 1 MIPS Processors: – Motorola 68000, 68020 – Intel 80286, 80386 Motorola 68000 Major architectural step in microprocessors: – First 32-bit architecture initial 16-bit implementation – First flat 32-bit address Support for paging – General-purpose register architecture Loosely based on PDP-11 First implementation in 1979 – 68,000 transistors – < 1 MIPS Used in – Apple Mac – Sun, Silicon Graphics, & Apollo workstations Third Generation: 1985-89 Challenging the “establishment” – Microprocessors surpass minicomputers in performance, rival mainframes – Implementation technology of choice: all new architectures are microprocessors – RISC architecture techniques take hold Transistors < 500K Performance > 5 MIPS Processors: – MIPS R2000, R3000 – Sun SPARC – HP PA-RISC MIPS R2000 Several firsts: – First RISC microprocessor – First microprocessor to provide integrated support for instruction & data cache – First pipelined microprocessor (sustains 1 instruction/clock) Implemented in 1985 – 125,000 transistors – 5-8 MIPS Fourth Generation: 1990 Architectural and performance leadership – First 64-bit architecture – First multiple-issue machine – First multilevel caches Transistors >1M Clock rates> 100MHz Performance > 50 MIPS Processors: – Intel i860, Pentium, MIPS R4000, MIPS R1000, DEC Alpha, Sun UltraSPARC, HP PA-RISC, PowerPC Generation 4.5: – same basic approach, but faster clock rates & wider issue – Alpha 21264, Pentium III & 4, Intel Itanium Key Architectural Trends Increase performance at 1.6x per year – True from 1985-present Combination of technology and architectural enhancements – Technology provides faster transistors and more of them – Faster transistors leads to high clock rates – More transistors: Architectural ideas turn transistors into performance – Responsible for about half the yearly performance growth Two key architectural directions – Sophisticated memory hierarchies – Exploiting instruction level parallelism Memory Hierarchies Caches: hide latency of DRAM and increase BW – CPU-DRAM access gap has grown by a factor of 30- 50! Trend 1: Increasingly large caches – On-chip: from 128 bytes (1984) to 100K+ bytes – Multilevel caches: add another level of caching First multilevel cache:1986 Secondary cache sizes today: 128KB to 4-16 MB Trend 2: Advances in caching techniques: – Reduce or hide cache miss latencies early restart after cache miss (1992) nonblocking caches: continue during a cache miss (1994) – Cache aware combos: computers, compilers, code writers prefetching: instruction to bring data into cache early Exploiting ILP ILP is the implicit parallelism among instructions Exploited by – Overlapping execution in a pipeline – Issuing multiple instruction per clock superscalar: uses dynamic issue decision (HW driven) VLIW: uses static issue decision (SW driven) 1985: simple microprocessor pipeline (1 instr/clock) 1990: first static multiple issue microprocessors 1995: sophisticated dynamic schemes – determine parallelism dynamically – execute instructions out-of-order – speculative execution depending on branch prediction “Off-the-shelf” ILP techniques yielded 20 year path. MIPS R4000 First 64-bit architecture Integrated caches – On-chip – Support for off-chip, secondary cache Integrated floating point Implemented in 1991: – – – – Deep pipeline 1.4M transistors Initially 100MHz > 50 MIPS Intel i860 First multiple issue microprocessor: – – – – 2 instructions/clock Dual issue mode Novel push pipeline Novel cache bypass Implemented in 1991: – 1.3M transistors – 50 mips Used primarily as attached processor (e.g., graphics) MIPS R10000 First speculative processor – Instruction scheduled and executed out-of-order – Up to 4 instructions can complete per clock – Window of 32 instructions (up to 32 in-flight) – Maintain precise state by completing instructions in order Implemented in 1996: – 6.8M transistors – 200 MHz Intel IA-64 and Itanium EPIC architecture: – Use compiler centric approach while avoiding disadvantages. – Parallelism demarcated by the compiler – Many special instruction & features for exploiting ILP in the compiler. Itanium – – – – First implementation (2001) 25 M transistors 800 MHz 130 Watts Breakdown of tasks between compiler and runtime hardware Today’s Uniprocessor ILP Menu Wide variety of approaches both hardware and compiler intensive Software Techniques – – – – – Static scheduling Static issue (i.e. VLIW) Static branch prediction Alias/pointer analysis Static speculation Lower hardware complexity More, longer range analysis More machine dependence Hardware Techniques – Dynamic scheduling – Dynamic issue (i.e. superscalar) – Dynamic branch prediction – Dynamic disambiguation – Dynamic speculation More stable performance Higher complexity Potential clock rate impact No clear cut winners at the present! Big Picture--ILP and Memory Systems My view ILP Mountain •No performance wall, but steeper slopes ahead. Speculation •Easier territory is behind us. •Industry-research gap vanished. Dynamic scheduling •Energy efficiency may be key limit. Simple pipelining Scheduled pipelines Cache Mountain Multiple issue Multipath prefetching Compiler prefetching Simple caches Multilevel caches & buffers Critical word & early restart Microprocessors today where they are, and what can do Performance 100 Supercomputers 10 Mainframes Microprocessors Minicomputers 1 0.1 1965 1970 1975 1980 1985 1990 1995 Microprocessors where they go Bit-level parallelism Instruction-level Thread-level (?) 100,000,000 10,000,000 1,000,000 R10000 Pentium Transistors i80386 i80286 100,000 R3000 R2000 i8086 10,000 i8080 i8008 i4004 1,000 1970 1975 1980 1985 1990 1995 2000 2005 Intel more Transistor Intel Faster Devices Number of Transistors in Intel’s processors Higher level parallelism Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency The more prononuced are: a) simultaneons multithreaded (SMT) processor, and b) chip multiprocessors (CMT) Multithreading Microprocessor can execute multiple operations at a time 4 or 6 operations per cycle Hard to achieve this level of parallelism from single program Can we run multiple programs (threads) on (single) processor without much effort? Simultaneous multithreading (SMT) or Hyperthreading is a solution Parallel Thread Sequencing Model Principles of SMT Multithreading in today’s processors Today many high-end microprocessors are multithreaded (e.g., Intel Pentium 4) Support for 2-4 threads but expect to get only 1.3X improvement in throughput Chip Multiprocessor Several processor cores in one die Shared L2 caches Chip Communication to build multichip module with many CMPs + memory Chip multiprocessor (CMP) platform model CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner. The idea is to put several microprocessors on a single die. This type of architecture is reffered also as Multiprocessor System-onChip (MPSoC) The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system Chip multiprocessor (CMP) platform model - continue CMP is an atractive option to use when moving to a new process technology, such as SoC Typical MPSoC applications we meet in network processors, multimedia hubs, signal processors, etc MPSoCs are usually implemented as heterogenous systems CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors Generic circa 2010 Microprocessor 4 – 8 general-purpose processing engines on chip used to execute independent programs Explicitly parallel programs (when possible) Speculatively parallel threads Special-purpose processing units (e.g., DSP functionality) Elaborate memory hierarchy Elaborate inter-chip communication facilities Characteristics of superscalar, simultaneous multithreading, and chip multiprocessor architectures Characteristic Number of CPUs' CPU issue width Number of threads Architecture registers (for integer and floating point) Physical registers (for integer and floating point Instruction window size Branch predictor table size (entries) Return stack size Instruction (I) and data (D) cache organization I and D cache sizes I and D cache associativities I and 0 cache line sizes (bytes) I and P cache access times (cycles) Secondary cache organization (Mbytes) Secondary cache size (bytes) Secondary cache associativity Secondary cache line size (bytes) Secondary cache access time (cycles) Secondary cache occupancy per access (cycles) Memory organization (no. of banks) Memory access time (cycles) Memory occupancy per access (cycles) Simultaneous Chip Superscalar multithreading multiprocessor 1 1 8 12 12 2 per CPU 1 8 1 per CPU 32 32 per thread 32 per CPU 32 + 256 256 + 256 32 + 32 per CPU 256 256 32 per CPU 32,768 32,768 8x4,096 64 entries 64 entries 8x8 entries 1x8 banks 1x8 banks 1 bank 128 kbytes 128 kbytes 16 kbytes per CPU 4-way 4-way 4-way 32 32 32 2 2 1 1x8 banks 1x8 banks 1x8 banks 8 8 8 4-way 4-way 4-way 32 32 32 5 5 7 1 1 1 4 4 4 50 50 50 13 13 13 The microprocessor tomorrow When we say “microprocessor” tomorrow, we generally mean the shaded area of the figure Outline •Technology Trends •Process Technology Challenges – Low Power Design •Microprocessors’ Generations •Challenges in Education Outline Challenges in Education •Changes in curricula •Fundamentals •A sort of the challenge we should accept Chalenges in Education It has often said that: Where you stand depends on where you sit In this context, starting from our positions an experiences, this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers? How to organize a training of new engineers ? The engineers we are training today will still be practicing 40 years from now Are we preparing them for what they will be doing then? Is the whole system of engineering education – not just the undergraduate curriculum – organized to support today’s graduate for the next 40 years ? We think not on both counts Our view & our experience Our view is that the practice of engineering is rapidly changing, and that engineering education is not keeping up Our experiences are primarily in information technology (both in academy and industry), which, admittedly, has changed more rapidly than same other fields. Changes in curricula It is almost a cliché to talk about change – so mach so that a passing reference to it becomes a substitute for serious thought about its implications But the fact is that the practice of engineering is changing at about the same pace as the technology it creates What are fundamentals The undergraduate curriculum should teach (only) fundamentals Everyone agrees with that But what are fundamentals? Since the adoption of the engineering science model, the fundamentals have been largely continuous mathematics and physics But, as we said earlier, engineering is changing What kinds of fundamentals we need now – some examples Information technology (IT) will be embedded in virtually engineered product and process in the future – i.e., the design space for all engineers will include IT Discrete mathematics, not continuous math., is the underpinning of IT. It is a new fundamental Biological materials and process are a bit behind IT in their impact on engineering, but they a closing fast Thus the chemical and biological sciences are also becoming fundamental to engineering Kinds of Fundamentals Engineering systems are increasingly complex, and increasingly contain components from across the spectrum of traditional engineering fields. More knowledge of the full spectrum will be the fundamental Engineering is global, and is performed in a holistic business context The engineer must design under constraints that include global cultural and business contexts, and so must understand them. They two are new fundamentals. How to add these new fundamentals The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full. We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly. What will the character and essence of electrical and computer engineering education look like in the future ? It is difficult to predict the future with any accuracy, but it is safe to say that: Web-based teaching, distance learning, electronic books, and interactive learning environments will play increasingly significant roles in shaping what we teach, how we teach, and how students learn. A sort of challenge we should accept During one visit at our faculty, Prof. Krishna Shenai from University of Illinois of Chicago, director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time. That is the sort of challenge we should accept for improving engineering education.