Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ECE 486/586 Computer Architecture Chapter 5 Code Sequences Herbert G. Mayer, PSU Status 1/21/2017 1 Syllabus Moore’s Law Key Architecture Messages Memory is Slow Events Tend to Cluster Heat is Bad Resource Replication Code Sequences References 2 Processor Performance Growth Moore’s Law from Webopedia: “The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per square inch on integrated circuits had doubled every year since it was invented. Moore predicted that this trend would continue for the foreseeable future.” In subsequent years, the pace slowed down a bit, but data density doubled approximately every 18 months, and this is the current definition of Moore's Law, which Moore himself has blessed. Most experts, including Moore himself, expect Moore's Law to hold for another two decades. Others coin a more general law, a bit lamely stating that “the circuit density increases predictably over time.” 3 Processor Performance Growth So far, Moore’s Law is holding true since ~1968 Some Intel fellows believe that an end to Moore’s Law will be reached ~2018 due to physical limitations in the process of manufacturing transistors from semiconductor material Such phenomenal growth is unknown in any other industry. For example, if doubling of performance could be achieved every 18 months, then by 2001 other industries would have achieved the following: Cars would travel at 2,400,000 Mph, get 600,000 MpG Air travel LA to NYC would be at 36,000 Mach, taking 0.5 seconds 4 Architecture Messages 5 Message 1: Memory is Slow The inner core of the processor, the CPU or the μP, is getting faster at a steady rate Access to memory is also getting faster over time, but at a slower rate. This rate differential has existed for quite some time, with the strange effect that fast processors have to rely on progressively slower memories –relatively speaking Possible on MP servers that processor has to wait > 100 cycles before a memory access completes: one single memory access. On a Multi-Processor the bus protocol is more complex due to snooping, backing-off, arbitration, thus the number of cycles to complete a memory access can grow so high IO simply compounds the problem of slow memory access 6 Slow Memory Slows Down . . . 7 Message 1: Memory is Slow Discarding conventional memory altogether, relying only on cache-like memories, is NOT an option for 64-bit architectures, due to the price/size/cost/power if you pursue full memory population with 264 bytes Another way of seeing this: Using solely reasonably-priced cache memories (say more than 10 times the cost of regular memory) is not feasible: the resulting physical address space would be too small, or the price too high Significant intellectual efforts in computer architecture focuses on reducing the performance impact of fast processors accessing slow, virtualized memories All else except IO, seems easy compared to this fundamental problem! IO is even slower by orders of magnitude 8 Message 1: Memory is Slow 1000 Moore’s Law says: line is linear CPU Processor-Memory Performance Gap: (grows 50% / year) 100 10 DRAM 7%/yr. 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 DRAM 1 µProc 60%/yr. Time Source: David Patterson, UC Berkeley 9 Message 2: Events Tend to Cluster A strange thing happens during program execution: Seemingly unrelated events tend to cluster Memory accesses tend to concentrate a majority of their referenced addresses onto a small domain of the total address space. Even if all of memory is accessed, during some periods of time such clustering happens Intuitively, one memory access seems independent of another, but they both happen to fall onto the same cache line or the same page, or the same working set of pages We call this phenomenon Locality! Architects exploit locality to speed up memory access via Caches and increase the available address range beyond physical memory via Virtual Memory Management Distinguish spacial from temporal locality 10 Message 2: Events Tend to Cluster Similarly, hash functions tend to concentrate an unproportionally large number of keys onto a small number of table entries Incoming search key (say, a C++ program identifier) is mapped into an index, but the next, completely unrelated key, happens to map onto the same index. In an extreme case of high fillfactor, this may render a hash lookup slower than a sequential, linear search Programmer must watch out for the phenomenon of clustering, as it is undesired in hashing! 11 Sample Clustering in Hash Function 12 Message 2: Events Tend to Cluster Clustering happens in diverse modules of processor architecture. For example, when a data cache is used to speed-up memory accesses by having a copy of frequently used data in a faster memory unit, it happens that a small cache suffices to speed up execution That is due to Data Locality (spatial and temporal): Data that have been accessed recently will again be accessed in the near future, or at least data that live close by will be accessed in the near future; close by, as measured by cache line length! Thus they happen to reside in cache, possibly even in the identical cache line as prior access Architects do exploit this to speed up execution, while keeping the incremental cost for HW contained. Here clustering is an exceedingly valuable performance phenomenon 13 Message 3: Heat is Bad Clocking a processor fast (e.g. > 3-5 GHz) can increase performance and thus generally “is good” Other performance parameters, such as memory access speed, peripheral access, etc. do not scale with the clock speed. Still, increasing the clock to a higher rate is desirable Comes at the cost of higher current, thus more heat generated in the identical physical geometry (the so called real-estate) of the silicon processor But the silicon part acts like a heat-conductor, conducting better, as it gets warmer (negative temperature coefficient resistor, or NTC) Since the power-supply is a constant-current source, a lower resistance causes lower voltage, shown as VDroop in the figure below 14 Message 3: Heat is Bad 15 Message 3: Heat is Bad This in turn means, voltage must be increased artificially, to sustain the clock rate, creating more heat, ultimately leading to self-destruction of the part Great efforts are being made to increase the clock speed, requiring more voltage, while at the same time reducing heat generation Contemporary technologies include sleep-states of the Silicon part (processor as well as chip-set), and Turbo Boost mode, to contain heat generation while boosting clock speed just at the right time Good that to date Silicon manufacturing technologies allow the shrinking of transistors and thus of whole dies; else CPUs would become larger, more expensive, and above all: hotter 16 Message 4: Resource Replication Architects cannot increase clock speed beyond physical limitations One cannot decrease the die size beyond evolving technology Yet performance improvements are desired, needed; and can be achieved via architecture! Improvements can be achieved by replicating resources to compute more results at each step! I.e. via parallelism! But careful! Why careful? Resources could be used for other, better purposes! Typical HW optimization 17 Message 4: Resource Replication Key obstacle to parallel execution is data dependence in computation executed. A datum cannot be used, before it has been computed! Compiler optimization technology calls this use-def dependence (short for use-beforedefinition), AKA true dependence, AKA data dependence Goal is to search for program portions that are independent of one another. This can be at multiple levels of focus 18 Message 4: Resource Replication At the very low level of registers, at the machine level –done by HW; see also score board At the low level of individual machine instructions –done by HW; see also superscalar architecture At the medium level of subexpressions in a program –done by compiler; see CSE At the higher level of several statements written in sequence in high-level language program –done by optimizing compiler or by human programmer Or at the very high level of different applications, running on the same computer, but with independent data, separate computations, and independent results –done by the user running concurrent programs 19 Message 4: Resource Replication Whenever program portions are independent of one another, they can be computed in any order, including at the same time, i.e. in parallel; but will they? Architects provide resources for this parallelism Compilers need to uncover opportunities for parallelism in programs If two actions are independent of one another, they can be computed simultaneously Provided that HW resources exist, that the absence of dependence has been proven, that independent execution paths are scheduled on these replicated HW resources 20 Message 4: Resource Replication Board for server with 4 CPUs: AKA 4-Way MP server 21 Code Samples for 3 Different Architectures 22 The 3 Different Architectures 1. Single Accumulator Architecture Has one implicit register for all/any operations: accumulator Arithmetic operations frequently require intermediate temps! Code relies heavily on load-store to-from temps 2. Three-Address GPR Architecture Allows complex operations with multiple operands all in one instruction Hence complex opcode bits, many bits per instruction 3. Stack Machine Architecture Operands are implied on the stack, except load/store Hence all operations are simple, few bits, but all are memory accesses 23 Code 1 for Different Architectures Example 1: Code Sequence Without Optimization Strict left-to-right translation, no smarts in mapping Consider non-commutative subtraction and division operators We’ll use no common subexpression elimination (CSE), and no register reuse Conventional operator precedence For Single Accumulator SAA, Three-Address GPR, Stack Architectures Sample source: d ( a + 3 ) * b - ( a + 3 ) / c 24 Code 1 for Different Architectures No 1 2 3 4 5 6 7 8 9 10 11 12 SingleAccumulator ld add mult st ld add div st ld sub st a #3 b temp1 a #3 c temp2 temp1 temp2 d Three-Address GPR dest op1 op op2 add mult add div sub r1, r2, r3, r4, d, 25 a, r1, a, r3, r2, #3 b #3 c r4 Stack Machine push pushlit add push mult push pushlit add push div sub pop a #3 b a #3 c d Code 1 for Different Architectures Three-address code looks shortest, w.r.t. number of instructions Maybe optical illusion , must also consider number of bits per instruction Must consider number of I-fetches, operand fetches, total number of stores Numerous memory accesses on SAA (Single Accumulator Architecture) due to temporary values held in memory We find the largest number of memory accesses on SA (Stack Architecture): there are no registers, just memory to hold data Three-Address architecture immune to ordering constraint, since operands may be placed in registers in either order No need for reverse-operation opcodes for Three-Address architecture 26 Code 2 for Different Architectures This time we eliminate common subexpression (CSE) Compiler handles left-to-right order for noncommutative operators on SAA Better: d ( a + 3 ) * b - ( a + 3 ) / c 27 Code 2 for Different Architectures No 1 2 3 4 5 6 7 8 9 10 11 SingleAccumulator ld add st div st ld mult sub st a #3 temp1 c temp2 temp1 b temp2 d Three-Address GPR dest op1 op op2 add mult div sub r1, r2, r1, d, 28 a, r1, r1, r2, #3 b c r1 Stack Machine push pushlit add dup push mult xch push div sub pop a #3 b c d Code 2 for Different Architectures Single Accumulator Architecture (SAA) optimized still needs temporary storage; uses temp1 for common subexpression; has no other register for temps!! SAA could use negate instruction or reverse subtract Register-use optimized for Three-Address architecture Common subexpresssion optimized on Stack Machine by duplicating dup, exchanging xch 20% reduced for Three-Address, 18% for SAA, only 8% for Stack Machine 29 Code 3 for Different Architectures Analyze 2 similar expressions but with increasing operator precedence left-to-right, in 2nd case precedences are overridden by ( ) One operator sequence associates right-to-left, due to arithmetic precedence Compiler uses commutativity The other left-to-right, due to explicit parentheses ( ) Use simple-minded code generation model: no cache, no optimization Will there be advantages/disadvantages caused by the architecture? Expression 1 is: e a + b * c ^ d 30 Code 3 for Different Architectures Expression 1 is: e a + b * c ^ d SingleAccumulator No 1 ld c 2 expo d 3 mult b 4 add a 5 st e 6 Expression 1 is 7 8 Three-Address GPR dest op1 op op2 expo r1, c, d mult r1, b, r1 add e, a, r1 :ea+b*c^d Stack Machine Implied Operands push a push b push c push d expo mult add pop e Expression 2 is : f ( ( g + h ) * i ) ^ j Here the operators associate left-to-right due to parentheses 31 Code 3 for Different Architectures Expression 2 is: f ( ( g + h ) * i ) ^ j SingleAccumulator No 1 2 3 4 5 6 7 8 ld add mult expo st g h i j f Three-Address GPR dest op1 op op2 add r1, g, h mult r1, i, r1 expo f, r1, j Stack Machine Implied operands push push add push mult push expo pop g h i j f Observations, Interaction of Precedence and Architecture Software eliminates constraints imposed by precedence: looking ahead Execution times identical for the 2 different expressions on the same architecture --unless blurred by secondary effect; see cache example below Conclusion: all architectures handle arithmetic and logic operations well 32 Code For Stack Architecture Stack Machine with no register would be inherently slow, due to: Memory Accesses!!! To avoid slowness due to memory access: implement few top of stack elements via HW shadow registers Cache Let us then measure equivalent code sequences with and without consideration for cache Top-of-stack register tos identifies the last valid word on physical stack Two shadow registers may hold 0, 1, or 2 true top words; HW design can and should use more than 2! Top of stack cache counter tcc specifies number of shadow registers actually used Thus tos plus tcc jointly specify true top of stack 33 Code For Stack Architecture stack stack tos tos 0,1,2 0,1,2 tcc tcc 22tos tosregisters registers free free 34 Code For Stack Architecture Timings for push, pushlit, add, pop, etc. operations depend on tcc Operations in shadow registers fastest, typically O(1) cycle, include (shadow) register access and the operation itself In our simplistic model, memory access adds 2 cycles; in reality memory access costs way more than 2 cycles! For stack changes define some policy, e.g. keep tcc 50% full Table below refines timings for stack with shadow registers Note: push memory location x into cache with free space requires 2 cycles, which are for the memory fetch: cache adjustment is done at the same time as memory fetch 35 Code For Stack Architecture operation Cycles tcc before tcc after tos change add add add push x 1 1+2 1+2+2 2 tcc tcc tcc tcc = = = = 2 1 0 0,1 tcc = 1 tcc = 1 tcc = 1 tcc++ no change tos-tos -= 2 no change push x pushlit #3 pushlit #3 pop y pop y 2+2 1 1+2 2 2+2 tcc tcc tcc tcc tcc = = = = = 2 0,1 2 1,2 0 tcc = 2 tcc++ tcc = 2 tcc-tcc = 0 tos++ no change tos++ no change tos-- 36 comment underflow? underflow? tcc update in parallel overflow? overflow? underflow? Code For Stack Architecture Code emission for: a + b * c ^ ( d + e * f ^ g ) Let + and * be commutative, by language rule Architecture here has 2 shadow registers, compiler exploits this Assume an initially empty 2-word top-of-stack cache 37 Code For Stack Architecture # 1 Left - to - Right cycles 1 2 Exploit Cache cycles 2 1 push a 2 push f 2 2 push b 2 push g 2 3 push c 4 e xpo 1 4 push d 4 push e 2 5 push e 4 m ult 1 6 push f 4 push d 2 7 push g 4 add 1 8 expo 1 push c 2 9 mult 3 r_expo = swap + expo 1 10 add 3 push b 2 11 expo 3 m ult 1 12 m ult 3 push a 2 13 a dd 3 a dd 1 38 Code For Stack Architecture Blind code emission costs 40 cycles; i.e. not taking advantage of tcc knowledge: costs performance Smart code emission with shadow register in mind: 20 cycles True penalty for memory access is worse in practice, based on quotient of memory access / register operation Tremendous speed-up always possible when fixing system with severe flaws Return of investment for 2 registers is twice the original performance Such strong speedup is an indicator that the starting architecture was severely flawed! (Engineering Wisdom ) Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performance Indexing, looping, indirection, call/return are not addressed here 39 References 1. The Humble Programmer: http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.html 2. Algorithm Definitions: http://en.wikipedia.org/wiki/Algorithm_characterizations 3. http://en.wikipedia.org/wiki/Moore's_law 4. C. A. R. Hoare’s comment on readability: http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdf 5. Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July 1986, pp 11-16 6. Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/ 7. Linux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htm 8. Words of wisdom: http://www.cs.yale.edu/quotes.html 9. John von Neumann’s computer design: A.H. Taub (ed.), “Collected Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co., New York 1963 40