Download Performance

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lecture 3
Benchmarks and
Performance Metrics
Performance
CS510 Computer Architectures
Lecture 3 - 1
Measurement Tools
• Benchmarks, Traces, Mixes
• Cost, Delay, Area, Power Estimation
• Simulation (many levels)
– ISA, RT, Gate, Circuit
• Queuing Theory
• Rules of Thumb
• Fundamental Laws
Performance
CS510 Computer Architectures
Lecture 3 - 2
The Bottom Line:
Performance (and Cost)
Plane
Time
(DC-Paris)
Speed
Passengers
Throughput
(pmph)
Boeing 747
6.5 hours
610 mph
470
286,700
BAD/Sud
Concorde
3.0 hours
1350 mph
132
178,200
• Time to run the task (ExTime)
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns ....(Performance)
– Throughput, bandwidth
Performance
CS510 Computer Architectures
Lecture 3 - 3
The Bottom Line:
Performance (and Cost)
“X is n times faster than Y” means:
ExTime(Y)
n
=
=
ExTime(X)
Performance
Performance(X)
Performance(Y)
CS510 Computer Architectures
Lecture 3 - 4
Performance Terminology
“X is n% faster than Y” means:
ExTime(Y)
Performance(X)
=
ExTime(X)
n
=
Performance(Y)
1 +
100
100 x (Performance(X) - Performance(Y))
n =
Performance(Y)
Performance
CS510 Computer Architectures
Lecture 3 - 5
Example
Example: Y takes 15 seconds to complete a task,
X takes 10 seconds.
What % faster is X?
ExTime(Y)
=
ExTime(X)
Performance
15
10
=
n
=
n
=
1.5
1.0
=
Performance (X)
Performance (Y)
100 (1.5 - 1.0)
1.0
CS510 Computer Architectures
50%
Lecture 3 - 6
Programs to Evaluate
Processor Performance
• (Toy) Benchmarks
– 10~100-line program
– e.g.: sieve, puzzle, quicksort
• Synthetic Benchmarks
– Attempt to match average frequencies of real
workloads
– e.g., Whetstone, dhrystone
• Kernels
– Time critical excerpts of real programs
– e.g., Livermore loops
• Real programs
– e.g., gcc, spice
Performance
CS510 Computer Architectures
Lecture 3 - 7
Benchmarking Games
• Differing configurations used to run the same
workload on two systems
• Compiler wired to optimize the workload
• Workload arbitrarily picked
• Very small benchmarks used
• Benchmarks manually translated to optimize
performance
Performance
CS510 Computer Architectures
Lecture 3 - 8
Common Benchmarking
Mistakes
•
•
•
•
Performance
Only average behavior represented in test workload
Ignoring monitoring overhead
Not ensuring same initial conditions
“Benchmark Engineering”
– particular optimization
– different compilers or preprocessors
– runtime libraries
CS510 Computer Architectures
Lecture 3 - 9
SPEC: System Performance
Evaluation Cooperative
• First Round 1989
– 10 programs yielding a single number
• Second Round 1992
– SpecInt92 (6 integer programs) and SpecFP92
(14 floating point programs)
– VAX-11/780
• Third Round 1995
– Single flag setting for all programs; new set of
programs “benchmarks useful for 3 years”
– SPARCstation 10 Model 40
Performance
CS510 Computer Architectures
Lecture 3 - 10
SPEC First Round
• One program: 99% of time
in single line of code
• New front-end compiler
could improve
dramatically
800
SPEC Perf
700
600
500
400
300
200
tomcatv
fpppp
matrix300
eqntott
li
nasa7
doduc
spice
epresso
0
gcc
100
Benchmark
Performance
CS510 Computer Architectures
Lecture 3 - 11
How to Summarize
Performance
• Arithmetic Mean (weighted arithmetic mean)
– tracks execution time: S (Ti)/n or S Wi*Ti
• Harmonic Mean (weighted harmonic mean) of
execution rates (e.g., MFLOPS)
– tracks execution time: n/S1/Ri or n/SWi/Ri
• Normalized execution time is handy for scaling
performance
• But do not take the arithmetic mean of normalized
execution time, use the geometric mean P(Ri)1/n,
where Ri=1/Ti
Performance
CS510 Computer Architectures
Lecture 3 - 12
Comparing and Summarizing
Performance
Computer A Computer B Computer C
P1(secs)
1
10
20
P2(secs)
1,000
100
20
Total time(secs)
1,001
110
40
For program P1, A is 10 times faster than B,
For program P2, B is 10 times faster than A,
and so on...
The relative performance of computer is unclear with
Total Execution Times
Performance
CS510 Computer Architectures
Lecture 3 - 13
Summary Measure
Arithmetic Mean
1
n
n
S Execution Timei
i=1
Harmonic Mean(When performance is expressed as rates)
n
n
S 1 / Ratei
i=1
Ratei =
ƒ(1 / Execcution Timei)
Good, if programs are run equally in the workload
Performance
CS510 Computer Architectures
Lecture 3 - 14
Unequal Job Mix
Relative Performance
• Weighted Execution Time
- Weighted Arithmetic Mean
- Weighted Harmonic Mean
n
S Weighti x Execution Timei
i=1
n
S Weighti / Ratei
i=1
• Normalized Execution Time to a reference machine
- Arithmetic Mean
- Geometric Mean
n
n P Execution Time Ratioi
i=1
Performance
CS510 Computer Architectures
Normalized to the
reference machine
Lecture 3 - 15
Weighted Arithmetic Mean
n
WAM(i) = S W(i)j x Timej
j=1
A
B
C
W(1)
W(2)
W(3)
P1 (secs)
1.00
10.00
20.00
0.50
0.909 0.999
P2(secs)
1,000.00
100.00
20.00
0.50
0.091 0.001
WAM(1)
500.50
55.00
20.00
WAM(2)
91.91
18.19
20.00
WAM(3)
2.00
10.09
20.00
1.0 x 0.5 + 1,000 x 0.5
Performance
CS510 Computer Architectures
Lecture 3 - 16
Normalized Execution
Time
n
A
B
C
P1
1.00 10.00 20.00
P2 1,000.00 100.00 20.00
Geometric Mean = n P Execution time ratioi
I=1
Normalized to A
Performance
A
B
P1
1.0
10.0
P2
1.0
Arithmetic mean
C
Normalized to B
Normalized to C
A
B
C
A
B
C
20.0
0.1
1.0
2.0
0.05
0.5
1.0
0.1
0.02
10.0
1.0
0.2
50.0
5.0
1.0
1.0
5.05
10.01
5.05
1.0
1.1
25.03 2.75 1.0
Geometric mean
1.0
1.0
0.63
1.0
1.0
0.63
1.58
Total time
1.0
0.11
0.04
9.1
1.0
0.36
25.03 2.75 1.0
CS510 Computer Architectures
1.58 1.0
Lecture 3 - 17
Disadvantages
of Arithmetic Mean
Performance varies depending on the reference machine
Normalized to A
A
C
Normalized to C
A
B
C
A
B
C
P1
1.0
10.0
20.0
0.1
1.0
2.0
0.05
0.5
1.0
P2
1.0
0.1
0.02
10.0
1.0
0.2
50.0
5.0
1.0
1.0
5.05
10.01
5.05
1.0
1.1
25.03 2.75 1.0
Arithmetic mean
B is 5 times
slower than A
Performance
B
Normalized to B
A is 5 times
slower than B
C is slowest
CS510 Computer Architectures
C is fastest
Lecture 3 - 18
The Pros and Cons Of
Geometric Means
•
•
•
Independent of running times of the individual programs
Independent of the reference machines
Do not predict execution time
– the performance of A and B is the same :
only true when P1 ran 100 times for every occurrence of P2
1(P1) x 100 + 1000(P2) x 1
= 10(P1) x 100 + 100(P2) x 1
Normalized to A
A
B
P1
1.0
10.0
20.0
P2
1.0
0.1
1.0
1.0
Geometric mean
Performance
C
Normalized to B
A
Normalized to C
B
C
A
0.1
1.0
2.0
0.02
10.0
1.0
0.2
0.63
1.0
1.0
CS510 Computer Architectures
B
C
0.05
0.5
1.0
50.0
5.0
1.0
1.58
1.0
0.63 1.58
Lecture 3 - 19
Performance
CS510 Computer Architectures
Lecture 3 - 20
Performance
CS510 Computer Architectures
Lecture 3 - 21
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E
Performance w/E
Speedup(E) =
=
ExTime w/ E
Performance w/o E
Suppose that enhancement E accelerates a fraction F of the
task by a factor S, and the remainder of the task is unaffected,
then:
ExTime(E) =
Speedup(E) =
Performance
CS510 Computer Architectures
Lecture 3 - 22
Amdahl’s Law
ExTimeE = ExTime x (1 -
FractionE) +
1
ExTime
Speedup =
=
ExTimeE
=
Performance
FractionE
SpeedupE
(1 - FractionE) + FractionE
SpeedupE
1
(1 - F) + F/S
CS510 Computer Architectures
Lecture 3 - 23
Amdahl’s Law
Floating point instructions are improved to run 2
times(100% improvement); but only 10% of actual
instructions are FP
Speedup =
=
1
(1-F) + F/S
1
=
(1-0.1) + 0.1/2
Performance
CS510 Computer Architectures
1
= 1.053
0.95 5.3% improvement
Lecture 3 - 24
Corollary(Amdahl):
Make the Common Case Fast
•
All instructions require an instruction fetch, only a fraction require a data
fetch/store
– Optimize instruction access over data access
•
•
Programs exhibit locality
Spatial Locality
Temporal Locality
Access to small memories is faster
Provide a storage hierarchy such that the most frequent
accesses are to the smallest (closest) memories.
Memory
Reg’s
Performance
Disk / Tape
Cache
CS510 Computer Architectures
Lecture 3 - 25
Locality of Access
Spatial Locality:
There is a high probability that a set of data, whose address
differences are small, will be accessed in small time difference.
Temporal Locality:
There is a high probability that the recently referenced data will
be referenced in near future.
Performance
CS510 Computer Architectures
Lecture 3 - 26
Rule of Thumb
• The simple case is usually the most frequent and the
easiest to optimize!
• Do simple, fast things in hardware(faster) and be sure
the rest can be handled correctly in software
Performance
CS510 Computer Architectures
Lecture 3 - 27
Metrics of Performance
Application
Answers per month
Operations per second
Programming
Language
Compiler
ISA
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second: MFLOP/s
Datapath
Control
Function Units
Transistors Wires Pins
Performance
Megabytes per second
Cycles per second (clock rate)
CS510 Computer Architectures
Lecture 3 - 28
Aspects of CPU Performance
Seconds
CPU time
=
Instructions
=
Program
Performance
Cycles
x
Program
Seconds
x
Instruction
CPI
Program
Inst Count
X
Compiler
X
(X)
Inst. Set.
X
X
Organization
X
Technology
X
CS510 Computer Architectures
Cycle
Clock Rate
X
Lecture 3 - 29
Marketing Metrics
MIPS = Instruction Count / Time x 106 = Clock Rate / CPI x 106
• Machines with different instruction sets ?
• Programs with different instruction mixes ?
– Dynamic frequency of instructions
• Not correlated with performance
MFLOP/s = FP Operations / Time x 106
• Machine dependent
• Often not where time is spent
Performance
Normalized:
add,sub,compare, mult 1
divide, sqrt
4
exp, sin, . . .
8
CS510 Computer Architectures
Lecture 3 - 30
Cycles Per Instruction
Average cycles per instruction
CPI = (CPU Time x Clock Rate) / Instruction Count
= Cycles / Instruction Count
n
S
CPU time = Cycle Time x
CPI x I
i
i
i =1
Instruction Frequency
n
CPI =
S CPIi
i =1
x
F
i
,where Fi
=
Ii
Instruction Count
Invest resources where time is spent !
Performance
CS510 Computer Architectures
Lecture 3 - 31
Organizational Trade-offs
Application
Programming
Language
Instruction Mix
Compiler
CPI
ISA
Datapath
Control
Function Units
Transistors Wires Pins
Performance
Cycle Time
CS510 Computer Architectures
Lecture 3 - 32
Example: Calculating CPI
Base Machine (Reg / Reg)
Op
Freq CPI(i)
ALU
50%
1
Load
20%
2
Store
10%
2
Branch
20%
2
CPI
.5
.4
.2
.4
1.5
(% Time)
(33%)
(27%)
(13%)
(27%)
Typical Mix
Performance
CS510 Computer Architectures
Lecture 3 - 33
Example
Some of LD instructions can be eliminated by having
R/M type ADD instruction [ADD
R1, X]
Add register / memory operations: R/M
– One source operand in memory
– One source operand in register
– Cycle count of 2
Branch cycle count to increase to 3
What fraction of the loads must be eliminated for this to pay off?
Base Machine (Reg / Reg)
Op
ALU
Load
Store
Branch
Freqi
50%
20%
10%
20%
CPIi
1
2
2
2
Typical Mix
Performance
CS510 Computer Architectures
Lecture 3 - 34
Example Solution
Exec Time = Instr Cnt x CPI x Clock
Op
ALU
Load
Store
Branch
Total
Performance
Freqi
.50
.20
.10
.20
1.00
CPIi CPI
1
.5
2
.4
2
.2
2
.4
1.5
CS510 Computer Architectures
Lecture 3 - 35
Example Solution
Exec Time = Instr Cnt x CPI x Clock
Old
Op
ALU
Load
Store
Branch
Reg/Mem
Freqi
.50
.20
.10
.20
1.00
CPIi
1
2
2
2
CPI
.5
.4
.2
.4
1.5
Freqi
.5 - X
.2 - X
.1
.2
X
1-X
New
CPIi
CPINEW
1
.5 - X
2
.4 - 2X
2
.2
3
.6
2
2X
(1.7 - X)/(1 - X)
CPINEW must be normalized to
new instruction frequency
Performance
CS510 Computer Architectures
Lecture 3 - 36
Example Solution
Exec Time = Instr Cnt x CPI x Clock
Old
Op
ALU
Load
Store
Branch
Reg/Mem
Freq
.50
.20
.10
.20
New
Cycles CPIOld
1
.5
2
.4
2
.2
2
.4
1.00
1.5
Freq
.5 - X
.2 - X
.1
.2
X
1-X
Cycles CPINEW
1
.5 - X
2
.4 - 2X
2
.2
3
.6
2
2X
(1.7 - X)/(1 - X)
Instr CntOld x CPIOld x Clock = Instr CntNew x CPINew x Clock
1.00 x 1.5
1.5
0.2
=
=
=
(1 - X) x (1.7 - X)/(1 - X)
1.7 - X
X
All LOADs must be eliminated for this to be a win !
Performance
CS510 Computer Architectures
Lecture 3 - 37
Fallacies and Pitfalls
• MIPS is an accurate measure for comparing
performance among computers
– dependent on the instruction set
– varies between programs on the same computer
– can vary inversely to performance
• MFLOPS is a consistent and useful measure of
performance
– dependent on the machine and on the program
– not applicable outside the floating-point
performance
– the set of floating-point operations is not
consistent across the machines
Performance
CS510 Computer Architectures
Lecture 3 - 38