Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS 15-447: Computer Architecture Lecture 26 Emerging Architectures November 19, 2007 Nael Abu-Ghazaleh [email protected] http://www.qatar.cmu.edu/~msakr/15447-f08 15-447 Computer Architecture Fall 2008 © Last Time: Buses and I/O Control Lines Data Lines • Buses: Bunch of wires • Shared Interconnect: multiple “devices” connect to the same bus • Versatile: new devices can connect (even ones we didn’t know existed when bus was designed) • Can become a bottleneck – Shorter->faster; less devices->faster • Have to: – Define the protocol to make devices communicate – Come up with an arbitration mechanism 15-447 Computer Architecture Fall 2008 © 2 Types of Buses Processor Memory Bus Processor Memory Bus Adaptor Bus Adaptor Backplane Bus Bus Adaptor I/O Bus I/O Bus • System bus – Connects processor and memory – Short, fast, synchronous, design specific • I/O Bus – Usually is lengthy and slower; industry standard – Need to match a wide range of I/O devices – Connects to the processor-memory bus or backplane bus 15-447 Computer Architecture Fall 2008 © 3 Bus “Mechanics” • Master Slave • Have to define how we hand-shake – Depends on whether its synchronous or not • Bus arbitration protocol – Contention vs. reservation; centralized vs. distributed • I/O Model – Programmed I/O; Interrupt driven I/O; DMA • Increasing performance (mainly bandwidth) – – – – Shorter; closer; wider Block transfers (instead of byte transfers) Split transaction buses … 4 15-447 Computer Architecture Fall 2008 © Today—Emerging Architectures • We are at an interesting point in computer architecture evolution • What is emerging and why is it emerging? 5 15-447 Computer Architecture Fall 2008 © Uniprocessor Performance (SPECint) Performance (vs. VAX-11/780) 10000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006 3X ??%/year 1000 52%/year 100 10 25%/year Sea change in chip design—what is emerging? 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 • VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/year 1986 to 2002 • RISC + x86: ??%/year 2002 to present 15-447 Computer Architecture 6 Fall 2008 © How did we get there? • First, what allowed the ridiculous 52% improvement per year to continue for around 20 years? – If cars improved as much we would have 1 million Km/hr cars! • Is it just the number of transistors/clock rate? • No! Its also all the stuff that we’ve been learning about! 7 15-447 Computer Architecture Fall 2008 © Walk down memory lane • What was the first processor organization we looked at? – Single cycle processors • How did multi-cycle processors improve those? • What did we do after that to improve performance? – Pipelining; why does that help? What are the limitations? • From there we discussed superscalar architectures – Out of order execution; multiple ALUs – This is basically state of the art in uniprocessors – What gave us problems there? 8 15-447 Computer Architecture Fall 2008 © Detour: couple of other design points • • • Very Large Instruction Word Architectures; let the compiler do the work Great for energy efficiency—less Instruction Level Parallelism Not binary compatible? Trasnmeta Crusoe Processor 9 15-447 Computer Architecture Fall 2008 © SIMD ISA Extensions—Parallelism from the Data? • Same Instruction applied to multiple Data at the same time – How can this help? • MMX (Intel) and 3DNow! (AMD) ISA extensions • Great for graphics; originally invented for scientific codes (vector processors) – Not a general solution • End of detour! 10 15-447 Computer Architecture Fall 2008 © Back to Moore’s law • Why are the “good times” over? – Three walls 1. “Instruction Level Parallelism” (ILP) Wall – – – – – Less parallelism available in programs (2->4->8->16) Tremendous increase in complexity to get more Does VLIW help? What can help? Conclusion: standard architectures cannot continue to do their part of sustaining Moore’s law 15-447 Computer Architecture Fall 2008 © 11 Wall 2: Memory Wall 1000 100 CPU 10 1 Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs) 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance µProc 52%/yr. (2X/1.5yr) “Moore’s Law” • What did we do to help this? – Still very very expensive to access memory • How do we see the impact in practice? • Very different from when I learned architecture! 12 15-447 Computer Architecture Fall 2008 © Ways out? Multithreaded Processors • Can we switch to other threads if we need to access memory? – When do we need to access memory? • What support is needed? • Can I use it to help with the ILP wall as well? 13 15-447 Computer Architecture Fall 2008 © Symmetric Multithreaded Processors • How do I switch between threads? • Hardware support for that • How does this help? • But, increased contention for everything (BW, TLB, caches…) 14 15-447 Computer Architecture Fall 2008 © Third Wall: Physics/Power wall • We’re down to the level of playing with a few atoms • More error prone; lower yield • But also soft-errors and wear out – Logic that sometimes works! – Can we do something in architecture to recover? 15 15-447 Computer Architecture Fall 2008 © Power! Our topic next class 16 15-447 Computer Architecture Fall 2008 © So, what is our way out? Any ideas? Power Wall + Memory Wall + ILP Wall = Brick Wall • Maybe architecture becomes commodity; this is the best we can do – This happens to a lot of technologies: why don’t we have the million km/hr car? • Do we actually need more processing power? – 8 bit embedded processors good enough for calculators; 4 bit ones probably good enough for elevators – Is there any sense to continue investing so much time and energy into this stuff? 17 15-447 Computer Architecture Fall 2008 © A lifeline? Multi-core architectures • How does this help? • Think of the three walls • The new Moore’s law: – the number of cores will double every 3 years! – Many-core architectures 15-447 Computer Architecture 18 Fall 2008 © Overcoming the three walls • ILP Wall? – Don’t need to restrict myself to a single thread – Natural parallelism available across threads/programs • Memory wall? – Hmm, that is a tough one; on the surface, seems like we made it worse – Maybe help coming from industry • Physics/power wall? – Use less aggressive core technology • Simpler processors, shallower pipelines • But more processors – Throw-away cores to improve yield • Do you buy it? 19 15-447 Computer Architecture Fall 2008 © 7 Questions for Parallelism • Applications: 1. What are the apps? 2. What are kernels of apps? • Hardware: 3. What are the HW building blocks? 4. How to connect them? • Programming Models: 5. How to describe apps and kernels? 6. How to program the HW? • Evaluation: 7. How to measure success? (Inspired by a view of the Golden Gate Bridge from Berkeley) 20 15-447 Computer Architecture Fall 2008 © Sea Change in Chip Design • Intel 4004 (1971): 4-bit processor, 2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip • RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip • 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache – RISC II shrinks to 0.02 mm2 at 65 nm Processor is the new transistor! 21 15-447 Computer Architecture Fall 2008 © Architecture Design space • What should each core look like? • Should all cores look the same? • How should the chip interconnect between them look? • What level of the cache should they share? – And what are the implications of that? • Are there new security issues? – Side channel attacks; denial of service attacks • Many other questions… Brand new playground; exciting time to do architecture research 22 15-447 Computer Architecture Fall 2008 © Hardware Building Blocks: Small is Beautiful • Given difficulty of design/validation of large designs • Given power limits what can build, parallel is energy efficient way to achieve performance – Lower threshold voltage means much lower power • Given redundant processors can improve chip yield – Cisco Metro 188 processors + 4 spares – Sun Niagara sells 6 or 8 processor version • Expect modestly pipelined (5- to 9-stage) CPUs, FPUs, vector, SIMD PEs • One size fits all? – Amdahl’s Law a few fast cores + many small cores 23 15-447 Computer Architecture Fall 2008 © Elephant in the room • We tried this parallel processing thing before – Very difficult • It failed, pretty much – A lot of academic progress and neat algorithms, but little impact commercially • We actually have to do new programming – A lot of effort to develop; error prone; etc.. – La-Z-boy programming era is over – Need new programming models • Amdahl’s law • Applications: What will you use 1024 cores for? • These concerns are being voiced by a substantial segment of academia/industry – What do you think? – Its coming, no matter what 24 15-447 Computer Architecture Fall 2008 ©