Download Some Code Sequences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Immunity-aware programming wikipedia , lookup

Von Neumann architecture wikipedia , lookup

Transcript
ECE 486/586
Computer Architecture
Chapter 5
Code Sequences
Herbert G. Mayer, PSU
Status 1/21/2017
1
Syllabus
 Moore’s Law
 Key Architecture Messages
 Memory is Slow
 Events Tend to Cluster
 Heat is Bad
 Resource Replication
 Code Sequences
 References
2
Processor Performance Growth
Moore’s Law from Webopedia:
“The observation made in 1965 by Gordon Moore, co-founder
of Intel, that the number of transistors per square inch on
integrated circuits had doubled every year since it was
invented. Moore predicted that this trend would continue for
the foreseeable future.”
In subsequent years, the pace slowed down a bit, but data
density doubled approximately every 18 months, and this is
the current definition of Moore's Law, which Moore himself
has blessed. Most experts, including Moore himself, expect
Moore's Law to hold for another two decades.
Others coin a more general law, a bit lamely stating that “the
circuit density increases predictably over time.”
3
Processor Performance Growth
So far, Moore’s Law is holding true since ~1968
Some Intel fellows believe that an end to Moore’s Law
will be reached ~2018 due to physical limitations in
the process of manufacturing transistors from semiconductor material
Such phenomenal growth is unknown in any other
industry. For example, if doubling of performance
could be achieved every 18 months, then by 2001
other industries would have achieved the following:
Cars would travel at 2,400,000 Mph, get 600,000 MpG
Air travel LA to NYC would be at 36,000 Mach, taking
0.5 seconds
4
Architecture Messages
5
Message 1: Memory is Slow

The inner core of the processor, the CPU or the μP, is getting
faster at a steady rate

Access to memory is also getting faster over time, but at a
slower rate. This rate differential has existed for quite some
time, with the strange effect that fast processors have to rely
on progressively slower memories –relatively speaking

Possible on MP servers that processor has to wait > 100
cycles before a memory access completes: one single
memory access. On a Multi-Processor the bus protocol is
more complex due to snooping, backing-off, arbitration, thus
the number of cycles to complete a memory access can grow
so high

IO simply compounds the problem of slow memory access
6
Slow Memory Slows Down . . .
7
Message 1: Memory is Slow

Discarding conventional memory altogether, relying only on
cache-like memories, is NOT an option for 64-bit architectures,
due to the price/size/cost/power if you pursue full memory
population with 264 bytes

Another way of seeing this: Using solely reasonably-priced
cache memories (say more than 10 times the cost of regular
memory) is not feasible: the resulting physical address space
would be too small, or the price too high

Significant intellectual efforts in computer architecture
focuses on reducing the performance impact of fast
processors accessing slow, virtualized memories

All else except IO, seems easy compared to this fundamental
problem! IO is even slower by orders of magnitude
8
Message 1: Memory is Slow
1000
Moore’s Law says: line is linear
CPU
Processor-Memory
Performance Gap:
(grows 50% / year)
100
10
DRAM
7%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
DRAM
1
µProc
60%/yr.
Time
Source: David Patterson, UC Berkeley
9
Message 2: Events Tend to Cluster

A strange thing happens during program execution:
Seemingly unrelated events tend to cluster

Memory accesses tend to concentrate a majority of their
referenced addresses onto a small domain of the total address
space. Even if all of memory is accessed, during some periods
of time such clustering happens

Intuitively, one memory access seems independent of another,
but they both happen to fall onto the same cache line or the
same page, or the same working set of pages

We call this phenomenon Locality! Architects exploit locality
to speed up memory access via Caches and increase the
available address range beyond physical memory via Virtual
Memory Management

Distinguish spacial from temporal locality
10
Message 2: Events Tend to Cluster

Similarly, hash functions tend to concentrate an
unproportionally large number of keys onto a
small number of table entries

Incoming search key (say, a C++ program
identifier) is mapped into an index, but the next,
completely unrelated key, happens to map onto
the same index. In an extreme case of high fillfactor, this may render a hash lookup slower than
a sequential, linear search

Programmer must watch out for the phenomenon
of clustering, as it is undesired in hashing!
11
Sample Clustering in Hash Function
12
Message 2: Events Tend to Cluster

Clustering happens in diverse modules of processor
architecture. For example, when a data cache is used to
speed-up memory accesses by having a copy of frequently
used data in a faster memory unit, it happens that a small
cache suffices to speed up execution

That is due to Data Locality (spatial and temporal): Data that
have been accessed recently will again be accessed in the
near future, or at least data that live close by will be accessed
in the near future; close by, as measured by cache line length!

Thus they happen to reside in cache, possibly even in the
identical cache line as prior access

Architects do exploit this to speed up execution, while keeping
the incremental cost for HW contained. Here clustering is an
exceedingly valuable performance phenomenon
13
Message 3: Heat is Bad

Clocking a processor fast (e.g. > 3-5 GHz) can increase
performance and thus generally “is good”

Other performance parameters, such as memory access
speed, peripheral access, etc. do not scale with the clock
speed. Still, increasing the clock to a higher rate is desirable

Comes at the cost of higher current, thus more heat generated
in the identical physical geometry (the so called real-estate) of
the silicon processor

But the silicon part acts like a heat-conductor, conducting
better, as it gets warmer (negative temperature coefficient
resistor, or NTC)

Since the power-supply is a constant-current source, a lower
resistance causes lower voltage, shown as VDroop in the
figure below
14
Message 3: Heat is Bad
15
Message 3: Heat is Bad

This in turn means, voltage must be increased artificially, to
sustain the clock rate, creating more heat, ultimately leading to
self-destruction of the part

Great efforts are being made to increase the clock speed,
requiring more voltage, while at the same time reducing heat
generation

Contemporary technologies include sleep-states of the Silicon
part (processor as well as chip-set), and Turbo Boost mode, to
contain heat generation while boosting clock speed just at the
right time

Good that to date Silicon manufacturing technologies allow the
shrinking of transistors and thus of whole dies; else CPUs
would become larger, more expensive, and above all: hotter
16
Message 4: Resource Replication
 Architects cannot increase clock speed beyond
physical limitations
 One cannot decrease the die size beyond evolving
technology
 Yet performance improvements are desired,
needed; and can be achieved via architecture!
 Improvements can be achieved by replicating
resources to compute more results at each step!
I.e. via parallelism! But careful!
 Why careful? Resources could be used for other,
better purposes! Typical HW optimization
17
Message 4: Resource Replication

Key obstacle to parallel execution is data
dependence in computation executed. A
datum cannot be used, before it has been
computed!

Compiler optimization technology calls this
use-def dependence (short for use-beforedefinition), AKA true dependence, AKA data
dependence

Goal is to search for program portions that
are independent of one another. This can be
at multiple levels of focus
18
Message 4: Resource Replication
 At the very low level of registers, at the machine
level –done by HW; see also score board
 At the low level of individual machine instructions
–done by HW; see also superscalar architecture
 At the medium level of subexpressions in a
program –done by compiler; see CSE
 At the higher level of several statements written in
sequence in high-level language program –done
by optimizing compiler or by human programmer
 Or at the very high level of different applications,
running on the same computer, but with
independent data, separate computations, and
independent results –done by the user running
concurrent programs
19
Message 4: Resource Replication
 Whenever program portions are independent of
one another, they can be computed in any order,
including at the same time, i.e. in parallel; but will
they?
 Architects provide resources for this parallelism
 Compilers need to uncover opportunities for
parallelism in programs
 If two actions are independent of one another, they
can be computed simultaneously
 Provided that HW resources exist, that the absence
of dependence has been proven, that independent
execution paths are scheduled on these replicated
HW resources
20
Message 4: Resource Replication
Board for server
with 4 CPUs: AKA
4-Way MP server
21
Code Samples for
3
Different Architectures
22
The 3 Different Architectures
1. Single Accumulator Architecture

Has one implicit register for all/any operations: accumulator

Arithmetic operations frequently require intermediate temps!

Code relies heavily on load-store to-from temps
2. Three-Address GPR Architecture

Allows complex operations with multiple operands all in one
instruction

Hence complex opcode bits, many bits per instruction
3. Stack Machine Architecture

Operands are implied on the stack, except load/store

Hence all operations are simple, few bits, but all are memory
accesses
23
Code 1 for Different Architectures
Example 1: Code Sequence Without Optimization
 Strict left-to-right translation, no smarts in mapping
 Consider non-commutative subtraction and division
operators
 We’ll use no common subexpression elimination
(CSE), and no register reuse
 Conventional operator precedence
 For Single Accumulator SAA, Three-Address GPR,
Stack Architectures
Sample source: d  ( a + 3 ) * b - ( a + 3 ) / c
24
Code 1 for Different Architectures
No
1
2
3
4
5
6
7
8
9
10
11
12
SingleAccumulator
ld
add
mult
st
ld
add
div
st
ld
sub
st
a
#3
b
temp1
a
#3
c
temp2
temp1
temp2
d
Three-Address GPR
dest  op1 op op2
add
mult
add
div
sub
r1,
r2,
r3,
r4,
d,
25
a,
r1,
a,
r3,
r2,
#3
b
#3
c
r4
Stack Machine
push
pushlit
add
push
mult
push
pushlit
add
push
div
sub
pop
a
#3
b
a
#3
c
d
Code 1 for Different Architectures
Three-address code looks shortest, w.r.t. number of
instructions
Maybe optical illusion , must also consider number of bits
per instruction
Must consider number of I-fetches, operand fetches, total
number of stores
Numerous memory accesses on SAA (Single Accumulator
Architecture) due to temporary values held in memory
We find the largest number of memory accesses on SA (Stack
Architecture): there are no registers, just memory to hold data
Three-Address architecture immune to ordering constraint,
since operands may be placed in registers in either order
No need for reverse-operation opcodes for Three-Address
architecture
26
Code 2 for Different Architectures
This time we eliminate common subexpression (CSE)
Compiler handles left-to-right order for noncommutative operators on SAA
Better: d  ( a + 3 ) * b - ( a + 3 ) / c
27
Code 2 for Different Architectures
No
1
2
3
4
5
6
7
8
9
10
11
SingleAccumulator
ld
add
st
div
st
ld
mult
sub
st
a
#3
temp1
c
temp2
temp1
b
temp2
d
Three-Address GPR
dest  op1 op op2
add
mult
div
sub
r1,
r2,
r1,
d,
28
a,
r1,
r1,
r2,
#3
b
c
r1
Stack Machine
push
pushlit
add
dup
push
mult
xch
push
div
sub
pop
a
#3
b
c
d
Code 2 for Different Architectures
Single Accumulator Architecture (SAA) optimized still
needs temporary storage; uses temp1 for common
subexpression; has no other register for temps!!
SAA could use negate instruction or reverse subtract
Register-use optimized for Three-Address
architecture
Common subexpresssion optimized on Stack
Machine by duplicating dup, exchanging xch
20% reduced for Three-Address, 18% for SAA, only
8% for Stack Machine
29
Code 3 for Different Architectures
 Analyze 2 similar expressions but with increasing
operator precedence left-to-right, in 2nd case
precedences are overridden by ( )
 One operator sequence associates right-to-left, due
to arithmetic precedence
 Compiler uses commutativity
 The other left-to-right, due to explicit parentheses ( )
 Use simple-minded code generation model: no
cache, no optimization
 Will there be advantages/disadvantages caused by
the architecture?
Expression 1 is: e  a + b * c ^ d
30
Code 3 for Different Architectures
Expression 1 is: e  a + b * c ^ d
SingleAccumulator
No
1
ld
c
2
expo d
3
mult b
4
add a
5
st
e
6
Expression
1 is
7
8


Three-Address GPR
dest  op1 op op2
expo r1, c, d
mult r1, b, r1
add e, a, r1
:ea+b*c^d
Stack Machine
Implied Operands
push a
push b
push c
push d
expo
mult
add
pop e
Expression 2 is : f  ( ( g + h ) * i ) ^ j
Here the operators associate left-to-right due to parentheses
31
Code 3 for Different Architectures
Expression 2 is: f  ( ( g + h ) * i ) ^ j
SingleAccumulator
No
1
2
3
4
5
6
7
8
ld
add
mult
expo
st
g
h
i
j
f
Three-Address GPR
dest  op1 op op2
add r1, g, h
mult r1, i, r1
expo f, r1, j
Stack Machine
Implied operands
push
push
add
push
mult
push
expo
pop
g
h
i
j
f
Observations, Interaction of Precedence and Architecture
 Software eliminates constraints imposed by precedence: looking ahead
 Execution times identical for the 2 different expressions on the same
architecture --unless blurred by secondary effect; see cache example below
 Conclusion: all architectures handle arithmetic and logic operations well
32
Code For Stack Architecture

Stack Machine with no register would be inherently slow, due
to: Memory Accesses!!!

To avoid slowness due to memory access: implement few top
of stack elements via HW shadow registers  Cache

Let us then measure equivalent code sequences with and
without consideration for cache

Top-of-stack register tos identifies the last valid word on
physical stack

Two shadow registers may hold 0, 1, or 2 true top words; HW
design can and should use more than 2!

Top of stack cache counter tcc specifies number of shadow
registers actually used

Thus tos plus tcc jointly specify true top of stack
33
Code For Stack Architecture
stack
stack
tos
tos
0,1,2
0,1,2
tcc
tcc
22tos
tosregisters
registers
free
free
34
Code For Stack Architecture
Timings for push, pushlit, add, pop, etc. operations depend on
tcc
Operations in shadow registers fastest, typically O(1) cycle,
include (shadow) register access and the operation itself
In our simplistic model, memory access adds 2 cycles; in
reality memory access costs way more than 2 cycles!
For stack changes define some policy, e.g. keep tcc 50% full
Table below refines timings for stack with shadow registers
Note: push memory location x into cache with free space
requires 2 cycles, which are for the memory fetch: cache
adjustment is done at the same time as memory fetch
35
Code For Stack Architecture
operation
Cycles
tcc before
tcc after
tos change
add
add
add
push x
1
1+2
1+2+2
2
tcc
tcc
tcc
tcc
=
=
=
=
2
1
0
0,1
tcc = 1
tcc = 1
tcc = 1
tcc++
no change
tos-tos -= 2
no change
push x
pushlit #3
pushlit #3
pop y
pop y
2+2
1
1+2
2
2+2
tcc
tcc
tcc
tcc
tcc
=
=
=
=
=
2
0,1
2
1,2
0
tcc = 2
tcc++
tcc = 2
tcc-tcc = 0
tos++
no change
tos++
no change
tos--
36
comment
underflow?
underflow?
tcc update
in parallel
overflow?
overflow?
underflow?
Code For Stack Architecture
Code emission for: a + b * c ^ ( d + e * f ^ g )
Let + and * be commutative, by language rule
Architecture here has 2 shadow registers,
compiler exploits this
Assume an initially empty 2-word top-of-stack
cache
37
Code For Stack Architecture
#
1 Left
- to - Right
cycles 1
2
Exploit
Cache
cycles
2
1
push a
2
push f
2
2
push b
2
push g
2
3
push c
4
e xpo
1
4
push d
4
push e
2
5
push e
4
m ult
1
6
push f
4
push d
2
7
push g
4
add
1
8
expo
1
push c
2
9
mult
3
r_expo = swap + expo
1
10
add
3
push b
2
11
expo
3
m ult
1
12
m ult
3
push a
2
13
a dd
3
a dd
1
38
Code For Stack Architecture
Blind code emission costs 40 cycles; i.e. not taking advantage
of tcc knowledge: costs performance
Smart code emission with shadow register in mind: 20 cycles
True penalty for memory access is worse in practice, based on
quotient of memory access / register operation
Tremendous speed-up always possible when fixing system with
severe flaws 
Return of investment for 2 registers is twice the original
performance
Such strong speedup is an indicator that the starting
architecture was severely flawed! (Engineering Wisdom )
Stack Machine can be fast, if purity of top-of-stack access is
sacrificed for performance
Indexing, looping, indirection, call/return are not addressed here
39
References
1.
The Humble Programmer:
http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.html
2.
Algorithm Definitions:
http://en.wikipedia.org/wiki/Algorithm_characterizations
3.
http://en.wikipedia.org/wiki/Moore's_law
4.
C. A. R. Hoare’s comment on readability:
http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdf
5.
Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction
Scheduling for a Pipelined Architecture”, ACM Sigplan Notices,
Proceeding of ’86 Symposium on Compiler Construction, Volume 21,
Number 7, July 1986, pp 11-16
6.
Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/
7.
Linux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htm
8.
Words of wisdom: http://www.cs.yale.edu/quotes.html
9.
John von Neumann’s computer design: A.H. Taub (ed.), “Collected
Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co.,
New York 1963
40