Download CA226 – Advanced Computer Architecture - Redbrick

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CA226 – Advanced Computer Architecture
Lecturer:
Mike Scott L2.16 – [email protected]
Notes:
p:\public\mike\ca226
Lectures:
C114 Monday 2.00pm, C114 Thursday 10.00am
Textbook:
Computer Architecture a Quantitative Approach, Hennessy &
Patterson, Morgan Kaufmann Publishers Inc, 3rd Edition 2002
Web page: http://www.computing.dcu.ie/~mike/ca226.html
Labs:
Will start around week 4/5. LG25/26, Mondays 4-6 pm
Continuous Assessment:
Final Exam:
40% (2 lab exams 20% for each)
60%
Assessment will use the WinMIPS64 simulator.
Syllabus:Fundamentals of Computer Design
Instruction set principles & design
Pipelining
Memory Hierarchy design & Cache memory
Efficient Programming Techniques and Optimisation
Short History of CPUs – see
http://www.pcmech.com/show/processors/35
1971 - 4-bit 4004 processor
1973 – 8-bit 8080 processor
1978 – 16-bit 8086 processor
Clock – 10MHz
1982 - 16-bit 80286 processor
Clock – 16MHz
1985 - 32-bit 80386 processor
Clock – 25MHz
1989 - 32-bit 80486 processor
Clock – 33MHz
1993 - 32-bit Pentium processor
Clock – 66 to 200MHz
1997 - 32-bit Pentium II
Clock – 200 to 450MHz
1999 - 32-bit Pentium III
Clock – 450MHz to 1GHz
2000 – 32-bit Pentium IV
Clock – 1GHz to 3.8GHz
2002 – 64-bit Itanium 2
Clock – 1GHz
2006 – 64-bit dual core Itanium 2
Clock – 1.5GHz
2006 – 64-bit Core 2 (Centrino)
Clock – 2-3GHz





Clock rates are not increasing any more.
Number of clock cycles per instruction has decreased
dramatically
Processors have gone superscalar – 2 or more instructions
executed per clock cycle
Processors are going multi-core – 2 or more completely
separate processors on a chip
Completely new architectures (like the Itanium) have been
introduced.
Breaking News!
Last year IBM and Sony have announced a new microprocessor
called CELL. This new cutting-edge architecture features in the
recently announced new games consoles (Playstation 3) from
Sony. Surprisingly perhaps computer games are among the most
performance-demanding of computer applications, and competition
in this area is a major force in driving architectural development.
One blindingly obvious way to make computers go faster is to
carry out tasks in parallel rather than sequentially. The CELL
pushes this idea to the limit.
The CELL chip contains one main 64-bit processor (a PowerPC),
and 8 slave 32-bit processors (but each has 128 x 128-bit registers).
Each of the slave processors has its own registers and fast memory
– so its effectively distributed computing on a chip.
The CELL is clocked at 4.6GHz. Note that the Pentium 4 maxes
out at 3.8GHz, and it only has one processor.
CELL processors are designed themselves to work in parallel.
The tricky bit – How best to program such a beast?
Since 1945 Computers have been getting faster & cheaper. This has
largely been due to
 increased demand creating a larger market, and
 technological advances
Demand has increased as computer became more friendly and easy
to use. Technological advances have lead to an increase in
performance of about 30% per year in the 1970’s, increasing to
50% per year by the 1990’s
Moore’s law. The number of transistors that can be placed on a
chip doubles every 18 months.
The virtual elimination of assembly language programming made
it easier to introduce new architectures. This lead to the
introduction of RISC architectures in the 1980’s
Microprocessor technology now predominates. Even supercomputers are built from vast arrays of microprocessors.
A very rough indication of computer performance is clock speed.
Technological advances make it possible to use much higher clock
speeds. However if the only factor was faster electronics,
computers would be about 5 times slower than they are – other
advances in architectural ideas have made a major contribution –
pipelining for example.
A major component of computer design is Instruction Set
Architecture. But there are other issues of implementation,
organisation and hardware which are at least as important.
Memory
A major subsystem is of course memory. The amount of memory
needed by the average program has grown by a factor of 1.5 to 2.0
per year. This implies that 0.5 to 1 bit of extra address bits are
needed per year.
In the 1970s 16 bit addresses very common, now we are used to
32-bit addresses, and 64-bit computers will soon be required.
When 32-bits becomes too small all 32-bit architectures become
instantly redundant (Gulp!).
The cost per bit of DRAM memory has rapidly declined. Densities
have increases by about 60% per year, while cycle time has
increases much more slowly. In 1980 typical cycle time was 250ns.
By 2000 it had improved to 100ns.
Many technological advances occur in discrete steps rather than
smoothly. For example DRAM sizes when they increase, always
increase by factors of 4.
The Cost of 1 megabyte of memory has decreased from over
$15000 in 1977, to about $0.08 in 2001!
Compilers
Compiler technology after a slow start has started to improve
dramatically. Compilers can now optimise code for RISC
processors so as to optimise pipelining behaviour and memory
accesses. Compiler optimisations have become more intelligent.
New architectures are designed with compilers very much in mind.
It is a relatively easy procedure to port the GCC compiler to a new
architecture.
Compilers can generate very poor code!
High level C language
void fadd2(big a,big b,big c)
{
c->w[0] = a->w[0]^b->w[0];
c->w[1] = a->w[1]^b->w[1];
c->w[2] = a->w[2]^b->w[2];
c->w[3] = a->w[3]^b->w[3];
c->w[4] = a->w[4]^b->w[4];
c->w[5] = a->w[5]^b->w[5];
c->w[6] = a->w[6]^b->w[6];
c->w[7] = a->w[7]^b->w[7];
c->w[8] = a->w[8]^b->w[8];
}
Compiled with –O3 optimization using GCC compiler on Apple Mac for PPC
processor.
lwz
lwz
lwz
lwz
lwz
xor
stw
lwz
lwz
lwz
lwz
lwz
xor
stw
lwz
lwz
lwz
lwz
lwz
xor
stw
lwz
lwz
lwz
lwz
lwz
xor
r9,4(r29)
r11,4(r28)
r10,4(r30)
r2,0(r11)
r0,0(r9)
r0,r0,r2
r0,0(r10)
r9,4(r29) ;again?
r11,4(r28) ;again?
r10,4(r30)
r2,4(r11)
r0,4(r9)
r0,r0,r2
r0,4(r10)
r9,4(r29)
r11,4(r28)
r10,4(r30)
r2,8(r11)
r0,8(r9)
r0,r0,r2
r0,8(r10)
r9,4(r29)
r11,4(r28)
r10,4(r30)
r2,12(r11)
r0,12(r9)
r0,r0,r2
stw r0,12(r10)
......
This is terrible!
Cost
An important phenomena that drives cost is the learning curve.
This applies for example to the cost of DRAM chips. When a new
component is introduced it will be quite expensive. However as
manufacturers get better at producing a particular component (for
example the yield gets better), prices come down and demand goes
up, pushing prices down further. If demand exceeds supply,
competitive pressures decrease and prices will temporarily stabilise
as manufacturers see an opportunity to make money.
How much does a computer cost? Roughly
Sheet Metal & plastic
Power supply, fans
Cables, nuts bolts
Shipping boxes, manuals
-
2%
2%
1%
1%
Processor
DRAM
Video system
I/O system
-
22%
5%
5%
5%
Keyboard & Mouse
Monitor
Hard disk
DVD Drive
-
3%
19%
9%
6%
Pre-loaded Software
-
20%
Performance
How to measure a computer’s performance? Increased
performance means decreased execution time.
The most obvious measure of time is real-time, or (wall-clock
time). Then there is CPU time (CPU time required by your
program) system CPU time (CPU time spent by your program in
operating system code) and I/O waiting time (for example time
spent waiting for you to “hit any key”, or for a disk access to
complete).
Traditionally computers compare performances by running special
benchmark programs.
Benchmark programs might be: Real programs – compilers, spreadsheets etc.
 Kernels – Small key pieces of code extracted from real
programs
 Toy benchmarks – like the sieve of Erathosthenes or small
classic algorithms like quicksort
 Synthetic benchmarks – Whetstone & Dhrystone. Artificially
created programs that actually do nothing useful
Computer companies like to be able to boast the best possible
figures, particularly for Whetstone and Dhrystone. To this end they
“cheat”, by optimising their compilers specifically to maximise
performance on these benchmarks – targeted optimisations.
In fact optimising compilers can discard 25% of the Dhrystone
code. Many procedures are called only once so a clever compiler
can simply insert the function code in-line and avoid the function
calling, parameter passing overhead.
A well-coached compiler might spot this optimisation taken from a
key Whetstone loop:x=sqrt(exp(alog(x)/t))
A mathematician could simplify this to
x=exp(alog(x)/2*t))
Real compilers of course don’t know maths.
Another benchmark used a spreadsheet, but wrote the output to a
file. Normally screen output is synchronous – one completes
before the next starts – this is what a normal spreadsheet user
expects. But by buffering all the I/O and only writing it to the disk
file at the end nicely takes the I/O time out of the measurement.
Another company noting that the file is never read, simply tossed it
out altogether.
A slightly better idea is to use a collection of benchmarks – for
example SPEC2000. This includes for example programs to test
for prime numbers, a chess playing program, a gzip data
compression program, a 3-D graphics library, neural network
simulations etc – 26 in all, some integer only, many floating-point.
People still cheated and added benchmark specific optimising flags
to their compilers. To defeat this the SPEC organisation insists that
the one base-line performance measurement which insists one set
of compiler flags for all the programs.
Amdahl’s Law
Which architectural optimisations to make? Each has a cost, and
some must be discarded. Common sense tells us to make the
common cases fast. Spend time and resources on those
optimisations which will improve the frequent event, rather than
the rare case.
Consider a particular computing task.
Amdahl’s law states that:Speed-up =
Performance using enhancement
Performance without enhancement
The important point is that typically only a preportion of the entire
task can be converted to take advantage of the proposed
enhancement.
For example a program that implements elliptic curve
cryptography would run a lot faster with a special 32x32 bit GF2
multiplication instruction (this is basically multiplication without
carries). GF2 multiplication takes up 70% of the execution time,
and the special instruction would make this part of the code 20
times faster. How much faster will the program run?
S=1/(0.3+0.7/20) = 3
So 3 times faster. The point is that intuitively we would have
expected a much better performance increase, but Amdahls law
tells us different. Note also that making the GF2 instruction
another 10 times faster would contribute very little – the law of
diminishing returns kicks in.
Amdahls law also allows comparision between two design
alternatives. For example a program that spends half of its time
doing floating-point spends 20% of its time doing floating-point
square roots. One proposal is to add floating-point square roots
hardware which speeds that operation by a factor of 10. A second
proposal is simply to make all floating point operations run twice
as fast. Which gives the best speed-up?
S1=1/(0.8+0.2/10)) = 1.22
S2=1/(0.5+0.5/2)) = 1.33
So the second proposal is better.
CPU Performance Equation
CPU time = CPU clock cycles/Clock rate
We could also count the number of instructions executed (IC). This
in turn allows us to calculate the very important measure
CPI = CPU clock cycles/IC
Where CPI is Cycles Per Instruction
So CPU time = IC*CPI/Clock rate
Therefore reducing IC and CPI while increasing the clock rate
improves performance.
Of course in practise a more complex instruction might require
more cycles to complete. For a particular program it is possible to
compute an average value for CPI in the obvious way.
In this course we will be particularly interested in finding ways to
decrease CPI.
Consider the Intel 8086 instruction PUSH AX
8086
80286
80386
80486
-
11 clock cycles
3 clock cycles
2 clock cycles
1 clock cycle
So its not just clock speed that has improved over the years, its CPI
as well.
MIPS & MFLOPS
Another common (and deceptive) measure of Computer
performance is MIPS – millions of instructions per second and its
cousin MFLOPS – millions of floating-point operations per
second.
MIPS = Clock Rate/CPI*106
But MIPS can vary inversely to performance! Consider a machine
with an optional Floating-Point processor. Without the coprocessor
FP implementation is software requires lots of integer instructions,
with the coprocessor it requires a lot less. A single FSQRT
instruction might take 10 clock cycles, and replace maybe 300
integer instructions which take 1 clock cycle each. So average CPI
goes up. Which has the greater MIPS?
An optimising compiler which is successful in reducing the
number of fast 1-cycle instructions, will actually result in a lower
MIPS rating!
MFLOPS tries (and fails) to do the same specifically for floatingpoint. Another problem here is that it is difficult to compare
different computers. Some implement a floating point division
instruction (Pentium) other don’t (Sparc). Cycle times for complex
instructions like sine and cosine might be quite large, so it depends
a lot on the instruction mix in the program.