Download performance

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
COMS 361
Computer Organization
Title: Performance
Date: 10/02/2004
Lecture Number: 3
1
Announcements
• Homework 1
– Due next Tuesday 9/07/04
2
Review
• Computer System Organization
• Instruction Set Architecture (ISA)
– Important from the programmers point of view
– Allows families of processors
• Hardware Set Architecture (HSA)
– Defines operational components and their interconnections
– An implementation of an ISA
• Computer Organization
– The implementation of an HSA
3
Outline
• Trends
• Performance
– Measures
– Comparisons
4
VLSI Trends: Moore’s Law
• In 1965, Gordon Moore predicted that
transistors would continue to shrink, allowing:
– Doubled transistor density every 24 months
– Doubled performance every 18 months
• History has proven Moore right
• But, is the end in sight?
– Physical limitations
– Economic limitations
Gordon Moore
Intel Co-Founder and
Chairmain Emeritus
5
Microprocessor Trends (Log Scale)
6
Microprocessor Trends (Intel)
Year
Chip
L
transistors
1971
4004
10µm
2.3K
1974
8080
6µm
6.0K
1976
8088
3µm
29K
1982
80286
1.5µm
134K
1985
80386
1.5µm
275K
1989
80486
0.8µm
1.2M
1993
Pentium®
0.8µm
3.1M
1995
Pentium® Pro
0.6µm
15.5M
1999
Mobile PII
0.25µm
27.4
2000
Pentium® 4
0.18µm
42M
2002
Pentium® 4 (N)
0.13µm
55M
2003
Itanium® 2 (M)
0.13µm
410M
7
Microprocessor Trends
450
I2M
Transistors (Millions)
400
350
300
Intel
Motorola
DEC/Compaq
IBM
250
200
Alpha (R.I.P)
150
P4N, G5
100
50
0
1970
1975
1980
1985
1990
1995
2000
2005
8
Microprocessor Trends (Log Scale)
100
I2M
Transistors (Millions)
Alpha (R.I.P)
10
P4N, G5
G4
1
Intel
Motorola
DEC/Compaq
0.1
0.01
0.001
1960
1970
1980
1990
2000
9
DRAM Memory Trends (Log Scale)
1000
100
64
512
256
128
16
10
4
1
Size (Mb)
1
0.25
0.1
0.0625
0.01
1975
1980
1985
1990
1995
2000
2005
10
Performance Trends
Vax 11/780
11
Summary - Technology Trends
• Processor
– Logic capacity
– Clock frequency
– Cost per function
increases ~ 30% per year
increases ~ 20% per year
decreases ~20% per year
• Memory
– DRAM capacity:
(4x every 3 years)
– Speed:
– Cost per bit:
increases ~ 60% per year
increases ~ 10% per year
decreases ~25% per year
• Disk
– Storage capacity
increases ~60% per year
12
Performance
• Measure, Report, and Summarize
– Understand major factors of performance
– Software performance is a function of
• Which instructions are used
• Execution time of the instructions used
13
Performance
• Performance measurement can be difficult
– Complex software
– Hardware performance enhancements
– Different applications may perform differently on the same
hardware
• Graphics application
• Computational
• Database
• Impossible to analytically compute a systems
performance
14
Performance
• An important attribute of a system
– Purchasers
– Designers
– Sales people
• Stretch the truth
• Performance claims may be meaningless for your application
• Focus
– Why software performs as it does
– How does the instruction set affect performance
– What hardware features are responsible for improved
performance
15
Performance: A Matter of Perspective
• What does this mean?
– One computer performs better than another?
• Subtle question
• Difficult to answer without an accurate meaning of the term
performance
• Example of a performance issue
16
Performance: A Matter of Perspective
Airplane
Passengers
Boeing 737-100
Boeing 747
BAC/Sud Concorde
Douglas DC-8-50
101
470
132
146
Range (mi)
630
4150
4000
8720
Speed (mph)
598
610
1350
544
• Which plane has the best performance?
– Concord: fastest
– DC-8: longest range
– 747: largest capacity
17
Performance: A Matter of Perspective
• Consider speed as a performance measure
– Speed is still not precise enough
– Concord: fastest plane for a individual
– 747: fastest plane for transporting 450 people
• Same program running on two computers
– First machine done is the fastest, right?
– Shared system:
• The most tasks completed in the least amount of time has the
better performance
– Response time: individual user
– Throughput: system manager
18
Performance Equation
• Selfish motivation:
– Focus on response time
• Often faster response time means more throughput
– Maximize performance by minimizing response time
1
Performanc e x 
Execution time x
– Performance of machine X is better than the performance
of machine Y if:
Performanc e x  Performanc e y
1
1

Execution time x Execution time y
Execution time y  Execution time x
19
Performance Equation
• Performancex > Performancey
• Means the Execution time(x) < Execution (y)
• How much better one machine wrt another
– relative performance measure of two machines
Performanc e x
n
Performanc e y
– If X is n times faster than Y, the execution time on Y is n
times longer than on X
Execution time y
Exectution time x
n
20
Measuring Performance
• Time to execute a program is our measure
• Which time?
– Obvious measures:
• Wall-clock time, response time, or elapsed time
• The total time for task completion
• Time-sharing (common)
– Elapsed time
– Time to complete our task
• CPU time
– The time the CPU spends on our task
• User time (time executing the instructions in our program)
• System time (OS things: opening files, networking, …)
21
Measuring Performance
• Use the time command to determine
– user time
– system time
– elapsed time
• Combine user and system time for a macroscopic measure of
performance
• Designers
– How fast can the hardware perform basic functions
• Moving data, adding, multiplying
– Most computers have a constant speed clock that determines when
events happen
• Clock
– Ticks, cycles, period, rate, …
– Heartbeat of the processor
22
Measuring Performance
t
1 clock cycle
or
clock period (measured in time) 0.25 ns, 250 ps
clock rate 
1
clock period
4 GHz
• Clock rate
– Measured as the number of clock cycles per unit time
23
Orders of Magnitude
thousandths
10-3
milli, m
millionths
10-6
micro, u
billionths
10-9
nano, n
trillionths
10-12
pico, p
• Disk access: ms
• Memory access: us
• Clock cycle time: ns, ps
24
Orders of Magnitude
•
•
•
•
thousands
103
kilo, K
millions
106
mega, M
billions
109
giga, G
trillions
1012
tera, T
Size of L1 cache: k
Size of Main memory: M, G
Size of a disk drive: G
Size of a disk farm: T
25
Orders of Magnitude
Scientific Units
Computer Science Units
103 = 1,000
thousands
kilo, K
210 = 1024
106 = 1,000,000
millions
mega, M
220 = 1,048,576
109 = 1,000,000,000
billions
giga, G
230 = 1,073,741,824
1012 = 1,000,000,000,000
trillions
tera, T
240 = 1,099,511,627,776
26
Clock/execution time
• Relation between clock rate and execution time
– Tells what happens to execution time if the rate changes
CPU time 
number of clock cycles
* clock period
program
CPU time 
number of clock cycles
clock rate
• Increase performance
– Reduce the number of clocks to execute the program
– Reduce the clock period
• Increase the clock rate
27
Example
• Our favorite program runs in 10 seconds on computer A, which has a
400 Mhz clock. We are trying to help a computer designer build a
new machine B, that will run this program in 6 seconds. The
designer can use new (or perhaps more expensive) technology to
substantially increase the clock rate, but has informed us that this
increase will affect the rest of the CPU design, causing machine B to
require 1.2 times as many clock cycles as machine A for the same
program. What clock rate should we tell the designer to target?
• Don't Panic, can easily work this out from basic principles
28
Example
• A given program will require
– some number of instructions (machine instructions)
– some number of cycles
– some number of seconds
• We have a vocabulary that relates these quantities:
– cycle time (seconds per cycle)
– clock rate (cycles per second)
– CPI (cycles per instruction)
a floating point intensive application might have a higher CPI
– MIPS (millions of instructions per second)
this would be higher for a program using simple instructions
29
Performance
• Performance is determined by execution time
• Do any of the other variables equal performance?
–
–
–
–
–
# of cycles to execute program?
# of instructions in program?
# of cycles per second?
average # of cycles per instruction?
average # of instructions per second?
• Common pitfall: thinking one of the variables is
indicative of performance when it really isn’t
30
CPI Example
• Suppose we have two implementations of the same
instruction set Architecture (ISA).
• For some program,
• Machine A has a clock cycle time of 10 ns. and a CPI
of 2.0
• Machine B has a clock cycle time of 20 ns. and a CPI
of 1.2
31
CPI Example
• What machine is faster for this program, and by how
much?
• If two machines have the same ISA which of our quantities
(e.g., clock rate, CPI, execution time, # of instructions,
MIPS) will always be identical?
32
# of Instructions Example
• A compiler designer is trying to decide between two
code sequences for a particular machine. Based on
the hardware implementation, there are three
different classes of instructions: Class A, Class B,
and Class C, and they require one, two, and three
cycles (respectively)
• The first code sequence has 5 instructions:
– 2 of A, 1 of B, and 2 of C
• The second sequence has 6 instructions:
– 4 of A, 1 of B, and 1 of C
33
# of Instructions Example
• Which sequence will be faster?
• How much?
• What is the CPI for each sequence?
34
MIPS example
• Two different compilers are being tested for a 100
MHz. machine with three different classes of
instructions: Class A, Class B, and Class C, which
require one, two, and three cycles (respectively).
Both compilers are used to produce code for a large
piece of software.
The first compiler's code uses
– 5 million Class A instructions
– 1 million Class B instructions
– 1 million Class C instructions
35
MIPS example
• The second compiler's code uses
– 10 million Class A instructions
– 1 million Class B instructions
– 1 million Class C instructions
• Which sequence will be faster according to MIPS?
• Which sequence will be faster according to execution
time?
36
Benchmarks
• Performance best determined by running a real
application
– Use programs typical of expected workload
– Or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics,
etc.
• Small benchmarks
– nice for architects and designers
– easy to standardize
– can be abused
37
Benchmarks
• SPEC (System Performance Evaluation
Cooperative)
– companies have agreed on a set of real program and
inputs
– can still be abused (Intel’s “other” bug)
– valuable indicator of performance (and compiler
technology)
38
SPEC ‘89
• Compiler “enhancements” and performance
800
700
SPEC performance ratio
600
500
400
300
200
100
0
gcc
espresso
spice
doduc
nasa7
li
eqntott
matrix300
fpppp
tomcatv
Benchmark
Compiler
Enhanced compiler
39
SPEC ‘95
Benchmark
go
m88ksim
gcc
compress
li
ijpeg
perl
vortex
tomcatv
swim
su2cor
hydro2d
mgrid
applu
trub3d
apsi
fpppp
wave5
Description
Artificial intelligence; plays the game of Go
Motorola 88k chip simulator; runs test program
The Gnu C compiler generating SPARC code
Compresses and decompresses file in memory
Lisp interpreter
Graphic compression and decompression
Manipulates strings and prime numbers in the special-purpose programming language Perl
A database program
A mesh generation program
Shallow water model with 513 x 513 grid
quantum physics; Monte Carlo simulation
Astrophysics; Hydrodynamic Naiver Stokes equations
Multigrid solver in 3-D potential field
Parabolic/elliptic partial differential equations
Simulates isotropic, homogeneous turbulence in a cube
Solves problems regarding temperature, wind velocity, and distribution of pollutant
Quantum chemistry
Plasma physics; electromagnetic particle simulation
40
SPEC ‘95
10
10
9
9
8
8
7
7
6
6
SPECfp
SPECint
Does doubling the clock rate double the performance?
Can a machine with a slower clock rate have better
performance?
5
5
4
4
3
3
2
2
1
1
0
0
50
100
150
Clock rate (MHz)
200
250
Pentium
Pentium Pro
50
100
150
Clock rate (MHz)
200
250
Pentium
Pentium Pro
41
Amdahl's Law
Execution Time After Improvement =
Execution Time Unaffected +( Execution Time Affected / Amount
of Improvement )
• Example:
"Suppose a program runs in 100 seconds on a
machine, with
multiply responsible for 80 seconds of this time. How
much do we have to improve the speed of multiplication if
we want the program to run 4 times faster?"
42
Example
• Suppose we enhance a machine making all floating-point
instructions run five times faster. If the execution time of
some benchmark before the floating-point enhancement is
10 seconds, what will the speedup be if half of the 10
seconds is spent executing floating-point instructions?
• We are looking for a benchmark to show off the new
floating-point unit described above, and want the overall
benchmark to show a speedup of 3. One benchmark we
are considering runs for 100 seconds with the old floatingpoint hardware. How much of the execution time would 43
Remember
• Performance is specific to a particular program/s
– Total execution time is a consistent summary of
performance
• For a given architecture performance increases
come from:
– increases in clock rate (without adverse CPI affects)
– improvements in processor organization that lower CPI
– compiler enhancements that lower CPI and/or instruction
count
44