Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CA226 – Advanced Computer Architecture Lecturer: Mike Scott L2.16 – [email protected] Notes: p:\public\mike\ca226 Lectures: C114 Monday 2.00pm, C114 Thursday 10.00am Textbook: Computer Architecture a Quantitative Approach, Hennessy & Patterson, Morgan Kaufmann Publishers Inc, 3rd Edition 2002 Web page: http://www.computing.dcu.ie/~mike/ca226.html Labs: Will start around week 4/5. LG25/26, Mondays 4-6 pm Continuous Assessment: Final Exam: 40% (2 lab exams 20% for each) 60% Assessment will use the WinMIPS64 simulator. Syllabus:Fundamentals of Computer Design Instruction set principles & design Pipelining Memory Hierarchy design & Cache memory Efficient Programming Techniques and Optimisation Short History of CPUs – see http://www.pcmech.com/show/processors/35 1971 - 4-bit 4004 processor 1973 – 8-bit 8080 processor 1978 – 16-bit 8086 processor Clock – 10MHz 1982 - 16-bit 80286 processor Clock – 16MHz 1985 - 32-bit 80386 processor Clock – 25MHz 1989 - 32-bit 80486 processor Clock – 33MHz 1993 - 32-bit Pentium processor Clock – 66 to 200MHz 1997 - 32-bit Pentium II Clock – 200 to 450MHz 1999 - 32-bit Pentium III Clock – 450MHz to 1GHz 2000 – 32-bit Pentium IV Clock – 1GHz to 3.8GHz 2002 – 64-bit Itanium 2 Clock – 1GHz 2006 – 64-bit dual core Itanium 2 Clock – 1.5GHz 2006 – 64-bit Core 2 (Centrino) Clock – 2-3GHz Clock rates are not increasing any more. Number of clock cycles per instruction has decreased dramatically Processors have gone superscalar – 2 or more instructions executed per clock cycle Processors are going multi-core – 2 or more completely separate processors on a chip Completely new architectures (like the Itanium) have been introduced. Breaking News! Last year IBM and Sony have announced a new microprocessor called CELL. This new cutting-edge architecture features in the recently announced new games consoles (Playstation 3) from Sony. Surprisingly perhaps computer games are among the most performance-demanding of computer applications, and competition in this area is a major force in driving architectural development. One blindingly obvious way to make computers go faster is to carry out tasks in parallel rather than sequentially. The CELL pushes this idea to the limit. The CELL chip contains one main 64-bit processor (a PowerPC), and 8 slave 32-bit processors (but each has 128 x 128-bit registers). Each of the slave processors has its own registers and fast memory – so its effectively distributed computing on a chip. The CELL is clocked at 4.6GHz. Note that the Pentium 4 maxes out at 3.8GHz, and it only has one processor. CELL processors are designed themselves to work in parallel. The tricky bit – How best to program such a beast? Since 1945 Computers have been getting faster & cheaper. This has largely been due to increased demand creating a larger market, and technological advances Demand has increased as computer became more friendly and easy to use. Technological advances have lead to an increase in performance of about 30% per year in the 1970’s, increasing to 50% per year by the 1990’s Moore’s law. The number of transistors that can be placed on a chip doubles every 18 months. The virtual elimination of assembly language programming made it easier to introduce new architectures. This lead to the introduction of RISC architectures in the 1980’s Microprocessor technology now predominates. Even supercomputers are built from vast arrays of microprocessors. A very rough indication of computer performance is clock speed. Technological advances make it possible to use much higher clock speeds. However if the only factor was faster electronics, computers would be about 5 times slower than they are – other advances in architectural ideas have made a major contribution – pipelining for example. A major component of computer design is Instruction Set Architecture. But there are other issues of implementation, organisation and hardware which are at least as important. Memory A major subsystem is of course memory. The amount of memory needed by the average program has grown by a factor of 1.5 to 2.0 per year. This implies that 0.5 to 1 bit of extra address bits are needed per year. In the 1970s 16 bit addresses very common, now we are used to 32-bit addresses, and 64-bit computers will soon be required. When 32-bits becomes too small all 32-bit architectures become instantly redundant (Gulp!). The cost per bit of DRAM memory has rapidly declined. Densities have increases by about 60% per year, while cycle time has increases much more slowly. In 1980 typical cycle time was 250ns. By 2000 it had improved to 100ns. Many technological advances occur in discrete steps rather than smoothly. For example DRAM sizes when they increase, always increase by factors of 4. The Cost of 1 megabyte of memory has decreased from over $15000 in 1977, to about $0.08 in 2001! Compilers Compiler technology after a slow start has started to improve dramatically. Compilers can now optimise code for RISC processors so as to optimise pipelining behaviour and memory accesses. Compiler optimisations have become more intelligent. New architectures are designed with compilers very much in mind. It is a relatively easy procedure to port the GCC compiler to a new architecture. Compilers can generate very poor code! High level C language void fadd2(big a,big b,big c) { c->w[0] = a->w[0]^b->w[0]; c->w[1] = a->w[1]^b->w[1]; c->w[2] = a->w[2]^b->w[2]; c->w[3] = a->w[3]^b->w[3]; c->w[4] = a->w[4]^b->w[4]; c->w[5] = a->w[5]^b->w[5]; c->w[6] = a->w[6]^b->w[6]; c->w[7] = a->w[7]^b->w[7]; c->w[8] = a->w[8]^b->w[8]; } Compiled with –O3 optimization using GCC compiler on Apple Mac for PPC processor. lwz lwz lwz lwz lwz xor stw lwz lwz lwz lwz lwz xor stw lwz lwz lwz lwz lwz xor stw lwz lwz lwz lwz lwz xor r9,4(r29) r11,4(r28) r10,4(r30) r2,0(r11) r0,0(r9) r0,r0,r2 r0,0(r10) r9,4(r29) ;again? r11,4(r28) ;again? r10,4(r30) r2,4(r11) r0,4(r9) r0,r0,r2 r0,4(r10) r9,4(r29) r11,4(r28) r10,4(r30) r2,8(r11) r0,8(r9) r0,r0,r2 r0,8(r10) r9,4(r29) r11,4(r28) r10,4(r30) r2,12(r11) r0,12(r9) r0,r0,r2 stw r0,12(r10) ...... This is terrible! Cost An important phenomena that drives cost is the learning curve. This applies for example to the cost of DRAM chips. When a new component is introduced it will be quite expensive. However as manufacturers get better at producing a particular component (for example the yield gets better), prices come down and demand goes up, pushing prices down further. If demand exceeds supply, competitive pressures decrease and prices will temporarily stabilise as manufacturers see an opportunity to make money. How much does a computer cost? Roughly Sheet Metal & plastic Power supply, fans Cables, nuts bolts Shipping boxes, manuals - 2% 2% 1% 1% Processor DRAM Video system I/O system - 22% 5% 5% 5% Keyboard & Mouse Monitor Hard disk DVD Drive - 3% 19% 9% 6% Pre-loaded Software - 20% Performance How to measure a computer’s performance? Increased performance means decreased execution time. The most obvious measure of time is real-time, or (wall-clock time). Then there is CPU time (CPU time required by your program) system CPU time (CPU time spent by your program in operating system code) and I/O waiting time (for example time spent waiting for you to “hit any key”, or for a disk access to complete). Traditionally computers compare performances by running special benchmark programs. Benchmark programs might be: Real programs – compilers, spreadsheets etc. Kernels – Small key pieces of code extracted from real programs Toy benchmarks – like the sieve of Erathosthenes or small classic algorithms like quicksort Synthetic benchmarks – Whetstone & Dhrystone. Artificially created programs that actually do nothing useful Computer companies like to be able to boast the best possible figures, particularly for Whetstone and Dhrystone. To this end they “cheat”, by optimising their compilers specifically to maximise performance on these benchmarks – targeted optimisations. In fact optimising compilers can discard 25% of the Dhrystone code. Many procedures are called only once so a clever compiler can simply insert the function code in-line and avoid the function calling, parameter passing overhead. A well-coached compiler might spot this optimisation taken from a key Whetstone loop:x=sqrt(exp(alog(x)/t)) A mathematician could simplify this to x=exp(alog(x)/2*t)) Real compilers of course don’t know maths. Another benchmark used a spreadsheet, but wrote the output to a file. Normally screen output is synchronous – one completes before the next starts – this is what a normal spreadsheet user expects. But by buffering all the I/O and only writing it to the disk file at the end nicely takes the I/O time out of the measurement. Another company noting that the file is never read, simply tossed it out altogether. A slightly better idea is to use a collection of benchmarks – for example SPEC2000. This includes for example programs to test for prime numbers, a chess playing program, a gzip data compression program, a 3-D graphics library, neural network simulations etc – 26 in all, some integer only, many floating-point. People still cheated and added benchmark specific optimising flags to their compilers. To defeat this the SPEC organisation insists that the one base-line performance measurement which insists one set of compiler flags for all the programs. Amdahl’s Law Which architectural optimisations to make? Each has a cost, and some must be discarded. Common sense tells us to make the common cases fast. Spend time and resources on those optimisations which will improve the frequent event, rather than the rare case. Consider a particular computing task. Amdahl’s law states that:Speed-up = Performance using enhancement Performance without enhancement The important point is that typically only a preportion of the entire task can be converted to take advantage of the proposed enhancement. For example a program that implements elliptic curve cryptography would run a lot faster with a special 32x32 bit GF2 multiplication instruction (this is basically multiplication without carries). GF2 multiplication takes up 70% of the execution time, and the special instruction would make this part of the code 20 times faster. How much faster will the program run? S=1/(0.3+0.7/20) = 3 So 3 times faster. The point is that intuitively we would have expected a much better performance increase, but Amdahls law tells us different. Note also that making the GF2 instruction another 10 times faster would contribute very little – the law of diminishing returns kicks in. Amdahls law also allows comparision between two design alternatives. For example a program that spends half of its time doing floating-point spends 20% of its time doing floating-point square roots. One proposal is to add floating-point square roots hardware which speeds that operation by a factor of 10. A second proposal is simply to make all floating point operations run twice as fast. Which gives the best speed-up? S1=1/(0.8+0.2/10)) = 1.22 S2=1/(0.5+0.5/2)) = 1.33 So the second proposal is better. CPU Performance Equation CPU time = CPU clock cycles/Clock rate We could also count the number of instructions executed (IC). This in turn allows us to calculate the very important measure CPI = CPU clock cycles/IC Where CPI is Cycles Per Instruction So CPU time = IC*CPI/Clock rate Therefore reducing IC and CPI while increasing the clock rate improves performance. Of course in practise a more complex instruction might require more cycles to complete. For a particular program it is possible to compute an average value for CPI in the obvious way. In this course we will be particularly interested in finding ways to decrease CPI. Consider the Intel 8086 instruction PUSH AX 8086 80286 80386 80486 - 11 clock cycles 3 clock cycles 2 clock cycles 1 clock cycle So its not just clock speed that has improved over the years, its CPI as well. MIPS & MFLOPS Another common (and deceptive) measure of Computer performance is MIPS – millions of instructions per second and its cousin MFLOPS – millions of floating-point operations per second. MIPS = Clock Rate/CPI*106 But MIPS can vary inversely to performance! Consider a machine with an optional Floating-Point processor. Without the coprocessor FP implementation is software requires lots of integer instructions, with the coprocessor it requires a lot less. A single FSQRT instruction might take 10 clock cycles, and replace maybe 300 integer instructions which take 1 clock cycle each. So average CPI goes up. Which has the greater MIPS? An optimising compiler which is successful in reducing the number of fast 1-cycle instructions, will actually result in a lower MIPS rating! MFLOPS tries (and fails) to do the same specifically for floatingpoint. Another problem here is that it is difficult to compare different computers. Some implement a floating point division instruction (Pentium) other don’t (Sparc). Cycle times for complex instructions like sine and cosine might be quite large, so it depends a lot on the instruction mix in the program.