Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS2810 Spring 2007 Dan Watson [email protected] Course syllabus, calendar, and assignments found at http://www.cs.usu.edu/~watson/cs2810 These overheads are based on presentations courtesy of Professor Mary Jane Irwin, Penn State University and Professor Tod Amon, Southern Utah University 2004 Morgan Kaufmann Publishers 1 Chapter 1 2004 Morgan Kaufmann Publishers 2 Introduction • This course is all about how computers work • But what do we mean by a computer? – Different types: desktop, servers, embedded devices – Different uses: automobiles, graphics, finance, genomics… – Different manufacturers: Intel, Apple, IBM, Microsoft, Sun… – Different underlying technologies and different costs! • Analogy: Consider a course on “automotive vehicles” – Many similarities from vehicle to vehicle (e.g., wheels) – Huge differences from vehicle to vehicle (e.g., gas vs. electric) • Best way to learn: – Focus on a specific instance and learn how it works – While learning general principles and historical perspectives 2004 Morgan Kaufmann Publishers 3 Why learn this stuff? • You want to call yourself a “computer scientist” • You want to build software people use (need performance) • You need to make a purchasing decision or offer “expert” advice • Both Hardware and Software affect performance: – Algorithm determines number of source-level statements – Language/Compiler/Architecture determine machine instructions (Chapter 2 and 3) – Processor/Memory determine how fast instructions are executed (Chapter 5, 6, and 7) • Assessing and Understanding Performance in Chapter 4 2004 Morgan Kaufmann Publishers 4 What is a computer? • • Components: – input (mouse, keyboard) – output (display, printer) – memory (disk drives, DRAM, SRAM, CD) – network Our primary focus: the processor (datapath and control) – implemented using millions of transistors – Impossible to understand by looking at each transistor – We need... 2004 Morgan Kaufmann Publishers 5 Where is the Market? Millions of Computers 1200 1122 1000 892 Embedded Desktop Servers 862 800 600 488 400 290 200 0 93 3 1998 114 3 1999 135 4 2000 129 4 2001 131 5 2002 2004 Morgan Kaufmann Publishers 6 By the architecture of a system, I mean the complete and detailed specification of the user interface. … As Blaauw has said, “Where architecture tells what happens, implementation tells how it is made to happen.” The Mythical Man-Month, Brooks, pg 45 2004 Morgan Kaufmann Publishers 7 Instruction Set Architecture (ISA) • • ISA: An abstract interface between the hardware and the lowest level software of a machine that encompasses all the information necessary to write a machine language program that will run correctly, including instructions, registers, memory access, I/O, and so on. “... the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls, the logic design, and the physical implementation.” – Amdahl, Blaauw, and Brooks, 1964 – Enables implementations of varying cost and performance to run identical software ABI (application binary interface): The user portion of the instruction set plus the operating system interfaces used by application programmers. Defines a standard for binary portability across computers. 2004 Morgan Kaufmann Publishers 8 ISA Type Sales Other SPARC Hitachi SH PowerPC Motorola 68K MIPS IA-32 ARM 1400 Millions of Processor 1200 1000 800 600 400 200 0 1998 1999 2000 2001 2002 PowerPoint “comic” bar chart with approximate values (see text for correct values) 2004 Morgan Kaufmann Publishers 9 Moore’s Law • In 1965, Gordon Moore predicted that the number of transistors that can be integrated on a die would double every 18 to 24 months (i.e., grow exponentially with time). • Amazingly visionary – million transistor/chip barrier was crossed in the 1980’s. – 2300 transistors, 1 MHz clock (Intel 4004) - 1971 – 16 Million transistors (Ultra Sparc III) – 42 Million transistors, 2 GHz clock (Intel Xeon) – 2001 – 55 Million transistors, 3 GHz, 130nm technology, 250mm2 die (Intel Pentium 4) - 2004 – 140 Million transistor (HP PA-8500) 2004 Morgan Kaufmann Publishers 10 Historical Perspective • ENIAC built in World War II was the first general purpose computer – Used for computing artillery firing tables – 80 feet long by 8.5 feet high and several feet wide – Each of the twenty 10 digit registers was 2 feet long – Used 18,000 vacuum tubes – Performed 1900 additions per second –Since then: Moore’s Law: transistor capacity doubles every 18-24 months 2004 Morgan Kaufmann Publishers 11 Processor Performance Increase 10000 Performance (SPEC Int) Intel Pentium 4/3000 DEC Alpha 21264A/667 DEC Alpha 21264/600 1000 DEC Alpha 5/500 DEC Alpha 5/300 DEC Alpha 4/266 100 DEC AXP/500 Intel Xeon/2000 IBM POWER 100 HP 9000/750 10 IBM RS6000 MIPS M2000 SUN-4/260 MIPS M/120 1 1987 1989 1991 1993 1995 1997 1999 2001 2003 Year 2004 Morgan Kaufmann Publishers 12 DRAM Capacity Growth 512M 1000000 256M 128M 64M Kbit capacity 100000 16M 10000 4M 1M 1000 256K 64K 100 16K 10 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 Year of introduction 2004 Morgan Kaufmann Publishers 13 Impacts of Advancing Technology • Processor – logic capacity: – performance: increases about 30% per year 2x every 1.5 years ClockCycle = 1/ClockRate • • 500 MHz ClockRate = 2 nsec ClockCycle 1 GHz ClockRate = 1 nsec ClockCycle Memory 4 GHz ClockRate = 250 psec ClockCycle – DRAM capacity: – memory speed: – cost per bit: Disk – capacity: 4x every 3 years, now 2x every 2 years 1.5x every 10 years decreases about 25% per year increases about 60% per year 2004 Morgan Kaufmann Publishers 15 Example Machine Organization • Workstation design target – 25% of cost on processor – 25% of cost on memory (minimum memory size) – Rest on I/O devices, power supplies, box Computer CPU Memory Devices Control Input Datapath Output 2004 Morgan Kaufmann Publishers 16 PC Motherboard Closeup 2004 Morgan Kaufmann Publishers 17 Inside the Pentium 4 Processor Chip 2004 Morgan Kaufmann Publishers 18 Instruction Set Architecture • A very important abstraction – interface between hardware and low-level software – standardizes instructions, machine language bit patterns, etc. – advantage: different implementations of the same architecture – disadvantage: sometimes prevents using new innovations True or False: Binary compatibility is extraordinarily important? • Modern instruction set architectures: – IA-32, PowerPC, MIPS, SPARC, ARM, and others 2004 Morgan Kaufmann Publishers 20 Abstraction • Delving into the depths reveals more information • An abstraction omits unneeded detail, helps us cope with complexity What are some of the details that appear in these familiar abstractions? 2004 Morgan Kaufmann Publishers 21 MIPS R3000 Instruction Set Architecture • Registers Instruction Categories – Load/Store – Computational – Jump and Branch – Floating Point R0 - R31 • coprocessor PC HI – Memory Management – Special LO 3 Instruction Formats: all 32 bits wide OP rs rt OP rs rt OP rd sa funct immediate jump target Q: How many already familiar with MIPS ISA? 2004 Morgan Kaufmann Publishers 22 How do computers work? • Need to understand abstractions such as: – Applications software – Systems software – Assembly Language – Machine Language – Architectural Issues: i.e., Caches, Virtual Memory, Pipelining – Sequential logic, finite state machines – Combinational logic, arithmetic circuits – Boolean logic, 1s and 0s – Transistors used to build logic gates (CMOS) – Semiconductors/Silicon used to build transistors – Properties of atoms, electrons, and quantum dynamics • So much to learn! 2004 Morgan Kaufmann Publishers 23 Chapter 2 2004 Morgan Kaufmann Publishers 24 Instructions: • • Language of the Machine We’ll be working with the MIPS instruction set architecture – similar to other architectures developed since the 1980's – Almost 100 million MIPS processors manufactured in 2002 – used by NEC, Nintendo, Cisco, Silicon Graphics, Sony, … 1400 1300 1200 1100 1000 Other SPARC Hitachi SH PowerPC Motorola 68K MIPS 900 IA-32 800 ARM 700 600 500 400 300 200 100 0 1998 1999 2000 2001 2002 2004 Morgan Kaufmann Publishers 25 MIPS arithmetic • • All instructions have 3 operands Operand order is fixed (destination first) Example: C code: a = b + c MIPS ‘code’: add a, b, c (we’ll talk about registers in a bit) “The natural number of operands for an operation like addition is three…requiring every instruction to have exactly three operands, no more and no less, conforms to the philosophy of keeping the hardware simple” 2004 Morgan Kaufmann Publishers 26 MIPS arithmetic • • Design Principle: simplicity favors regularity. Of course this complicates some things... C code: a = b + c + d; MIPS code: add a, b, c add a, a, d • • Operands must be registers, only 32 registers provided Each register contains 32 bits • Design Principle: smaller is faster. Why? 2004 Morgan Kaufmann Publishers 27 Registers vs. Memory • • • Arithmetic instructions operands must be registers, — only 32 registers provided Compiler associates variables with registers What about programs with lots of variables Control Input Memory Datapath Processor Output I/O 2004 Morgan Kaufmann Publishers 28 Memory Organization • • • Viewed as a large, single-dimension array, with an address. A memory address is an index into the array "Byte addressing" means that the index points to a byte of memory. 0 1 2 3 4 5 6 ... 8 bits of data 8 bits of data 8 bits of data 8 bits of data 8 bits of data 8 bits of data 8 bits of data 2004 Morgan Kaufmann Publishers 29 Memory Organization • • • • • Bytes are nice, but most data items use larger "words" For MIPS, a word is 32 bits or 4 bytes. 0 32 bits of data 4 32 bits of data Registers hold 32 bits of data 32 bits of data 8 12 32 bits of data ... 232 bytes with byte addresses from 0 to 232-1 230 words with byte addresses 0, 4, 8, ... 232-4 Words are aligned i.e., what are the least 2 significant bits of a word address? 2004 Morgan Kaufmann Publishers 30 Instructions • • • • • Load and store instructions Example: C code: A[12] = h + A[8]; MIPS code: lw $t0, 32($s3) add $t0, $s2, $t0 sw $t0, 48($s3) Can refer to registers by name (e.g., $s2, $t2) instead of number Store word has destination last Remember arithmetic operands are registers, not memory! Can’t write: add 48($s3), $s2, 32($s3) 2004 Morgan Kaufmann Publishers 31 Our First Example • Can we figure out the code? swap(int v[], int k); { int temp; temp = v[k] v[k] = v[k+1]; v[k+1] = temp; swap: } muli $2, $5, 4 add $2, $4, $2 lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) jr $31 2004 Morgan Kaufmann Publishers 32 So far we’ve learned: • MIPS — loading words but addressing bytes — arithmetic on registers only • Instruction Meaning add $s1, $s2, $s3 sub $s1, $s2, $s3 lw $s1, 100($s2) sw $s1, 100($s2) $s1 = $s2 + $s3 $s1 = $s2 – $s3 $s1 = Memory[$s2+100] Memory[$s2+100] = $s1 2004 Morgan Kaufmann Publishers 33 Machine Language • Instructions, like registers and words of data, are also 32 bits long – Example: add $t1, $s1, $s2 – registers have numbers, $t1=9, $s1=17, $s2=18 • Instruction Format: 000000 10001 op • rs 10010 rt 01000 rd 00000 100000 shamt funct Can you guess what the field names stand for? 2004 Morgan Kaufmann Publishers 34 Machine Language • • • • Consider the load-word and store-word instructions, – What would the regularity principle have us do? – New principle: Good design demands a compromise Introduce a new type of instruction format – I-type for data transfer instructions – other format was R-type for register Example: lw $t0, 32($s2) 35 18 9 op rs rt 32 16 bit number Where's the compromise? 2004 Morgan Kaufmann Publishers 35 Stored Program Concept • • Instructions are bits Programs are stored in memory — to be read or written just like data Processor • Memory memory for data, programs, compilers, editors, etc. Fetch & Execute Cycle – Instructions are fetched and put into a special register – Bits in the register "control" the subsequent actions – Fetch the “next” instruction and continue 2004 Morgan Kaufmann Publishers 36 Control • Decision making instructions – alter the control flow, – i.e., change the "next" instruction to be executed • MIPS conditional branch instructions: bne $t0, $t1, Label beq $t0, $t1, Label • Example: if (i==j) h = i + j; bne $s0, $s1, Label add $s3, $s0, $s1 Label: .... 2004 Morgan Kaufmann Publishers 37 Control • MIPS unconditional branch instructions: j label • Example: if (i!=j) h=i+j; else h=i-j; • beq $s4, $s5, Lab1 add $s3, $s4, $s5 j Lab2 Lab1: sub $s3, $s4, $s5 Lab2: ... Can you build a simple for loop? 2004 Morgan Kaufmann Publishers 38 So far: • • Instruction Meaning add $s1,$s2,$s3 sub $s1,$s2,$s3 lw $s1,100($s2) sw $s1,100($s2) bne $s4,$s5,L beq $s4,$s5,L j Label $s1 = $s2 + $s3 $s1 = $s2 – $s3 $s1 = Memory[$s2+100] Memory[$s2+100] = $s1 Next instr. is at Label if $s4 ≠ $s5 Next instr. is at Label if $s4 = $s5 Next instr. is at Label Formats: R op rs rt rd I op rs rt 16 bit address J op shamt funct 26 bit address 2004 Morgan Kaufmann Publishers 39 Control Flow • • We have: beq, bne, what about Branch-if-less-than? New instruction: if $s1 < $s2 then $t0 = 1 slt $t0, $s1, $s2 else $t0 = 0 • Can use this instruction to build "blt $s1, $s2, Label" — can now build general control structures Note that the assembler needs a register to do this, — there are policy of use conventions for registers • 2004 Morgan Kaufmann Publishers 40 Policy of Use Conventions Name Register number $zero 0 $v0-$v1 2-3 $a0-$a3 4-7 $t0-$t7 8-15 $s0-$s7 16-23 $t8-$t9 24-25 $gp 28 $sp 29 $fp 30 $ra 31 Usage the constant value 0 values for results and expression evaluation arguments temporaries saved more temporaries global pointer stack pointer frame pointer return address Register 1 ($at) reserved for assembler, 26-27 for operating system 2004 Morgan Kaufmann Publishers 41 Constants • • • Small constants are used quite frequently (50% of operands) e.g., A = A + 5; B = B + 1; C = C - 18; Solutions? Why not? – put 'typical constants' in memory and load them. – create hard-wired registers (like $zero) for constants like one. MIPS Instructions: addi $29, $29, 4 slti $8, $18, 10 andi $29, $29, 6 ori $29, $29, 4 • Design Principle: Make the common case fast. Which format? 2004 Morgan Kaufmann Publishers 42 How about larger constants? • • We'd like to be able to load a 32 bit constant into a register Must use two instructions, new "load upper immediate" instruction lui $t0, 1010101010101010 1010101010101010 • filled with zeros 0000000000000000 Then must get the lower order bits right, i.e., ori $t0, $t0, 1010101010101010 1010101010101010 0000000000000000 0000000000000000 1010101010101010 1010101010101010 1010101010101010 ori 2004 Morgan Kaufmann Publishers 43 Assembly Language vs. Machine Language • • • • Assembly provides convenient symbolic representation – much easier than writing down numbers – e.g., destination first Machine language is the underlying reality – e.g., destination is no longer first Assembly can provide 'pseudoinstructions' – e.g., “move $t0, $t1” exists only in Assembly – would be implemented using “add $t0,$t1,$zero” When considering performance you should count real instructions 2004 Morgan Kaufmann Publishers 44 Other Issues • Discussed in your assembly language programming lab: support for procedures linkers, loaders, memory layout stacks, frames, recursion manipulating strings and pointers interrupts and exceptions system calls and conventions • Some of these we'll talk more about later • We’ll talk about compiler optimizations when we hit chapter 4. 2004 Morgan Kaufmann Publishers 45 Overview of MIPS • • • • • simple instructions all 32 bits wide very structured, no unnecessary baggage only three instruction formats R op rs rt rd I op rs rt 16 bit address J op shamt funct 26 bit address rely on compiler to achieve performance — what are the compiler's goals? help compiler where we can 2004 Morgan Kaufmann Publishers 46 Addresses in Branches and Jumps • • Instructions: bne $t4,$t5,Label $t5 beq $t4,$t5,Label $t5 j Label Formats: op I J • op rs Next instruction is at Label if $t4 ° Next instruction is at Label if $t4 = Next instruction is at Label rt 16 bit address 26 bit address Addresses are not 32 bits — How do we handle this with load and store instructions? 2004 Morgan Kaufmann Publishers 47 Addresses in Branches • • Instructions: bne $t4,$t5,Label beq $t4,$t5,Label Formats: I • • Next instruction is at Label if $t4≠$t5 Next instruction is at Label if $t4=$t5 op rs rt 16 bit address Could specify a register (like lw and sw) and add it to address – use Instruction Address Register (PC = program counter) – most branches are local (principle of locality) Jump instructions just use high order bits of PC – address boundaries of 256 MB 2004 Morgan Kaufmann Publishers 48 To summarize: MIPS operands Name 32 registers Example Comments $s0-$s7, $t0-$t9, $zero, Fast locations for data. In MIPS, data must be in registers to perform $a0-$a3, $v0-$v1, $gp, arithmetic. MIPS register $zero always equals 0. Register $at is $fp, $sp, $ra, $at reserved for the assembler to handle large constants. Memory[0], 2 30 Accessed only by data transfer instructions. MIPS uses byte addresses, so memory Memory[4], ..., words and spilled registers, such as those saved on procedure calls. add MIPS assembly language Example Meaning add $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; data in registers subtract sub $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; data in registers $s1 = $s2 + 100 $s1 = Memory[$s2 + 100] Memory[$s2 + 100] = $s1 $s1 = Memory[$s2 + 100] Memory[$s2 + 100] = $s1 Used to add constants Category Arithmetic sequential words differ by 4. Memory holds data structures, such as arrays, Memory[4294967292] Instruction addi $s1, $s2, 100 lw $s1, 100($s2) sw $s1, 100($s2) store word lb $s1, 100($s2) load byte sb $s1, 100($s2) store byte load upper immediate lui $s1, 100 add immediate load word Data transfer Conditional branch Unconditional jump $s1 = 100 * 2 16 Comments Word from memory to register Word from register to memory Byte from memory to register Byte from register to memory Loads constant in upper 16 bits branch on equal beq $s1, $s2, 25 if ($s1 == $s2) go to PC + 4 + 100 Equal test; PC-relative branch branch on not equal bne $s1, $s2, 25 if ($s1 != $s2) go to PC + 4 + 100 Not equal test; PC-relative set on less than slt $s1, $s2, $s3 if ($s2 < $s3) $s1 = 1; else $s1 = 0 Compare less than; for beq, bne set less than immediate slti jump j jr jal jump register jump and link $s1, $s2, 100 if ($s2 < 100) $s1 = 1; Compare less than constant else $s1 = 0 2500 $ra 2500 Jump to target address go to 10000 For switch, procedure return go to $ra $ra = PC + 4; go to 10000 For procedure call 2004 Morgan Kaufmann Publishers 49 1. Immediate addressing op rs rt Immediate 2. Register addressing op rs rt rd ... funct Registers Register 3. Base addressing op rs rt Memory Address + Register Byte Halfword Word 4. PC-relative addressing op rs rt Memory Address PC + Word 5. Pseudodirect addressing op Address PC Memory Word 2004 Morgan Kaufmann Publishers 50 CSE 431 Computer Architecture Fall 2005 Lecture 02: MIPS ISA Review Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, UCB] 2004 Morgan Kaufmann Publishers 51 (vonNeumann) Processor Organization • • Control needs to 1. input instructions from Memory 2. issue signals to control the information flow between the Datapath components and to control what operations they perform 3. control instruction sequencing CPU Control Datapath Memory Devices Input Output Fetch Datapath needs to have the Exec Decode – components – the functional units and storage (e.g., register file) needed to execute instructions – interconnects - components connected so that the instructions can be accomplished and so that data can be loaded from and stored to Memory 2004 Morgan Kaufmann Publishers 52 RISC - Reduced Instruction Set Computer • • • RISC philosophy – fixed instruction lengths – load-store instruction sets – limited addressing modes – limited operations MIPS, Sun SPARC, HP PA-RISC, IBM PowerPC, Intel (Compaq) Alpha, … Instruction sets are measured by how well compilers use them as opposed to how well assembly language programmers use them Design goals: speed, cost (design, fabrication, test, packaging), size, power consumption, reliability, memory space (embedded systems) 2004 Morgan Kaufmann Publishers 53 MIPS R3000 Instruction Set Architecture (ISA) • Registers Instruction Categories – Computational – Load/Store – Jump and Branch – Floating Point R0 - R31 • coprocessor PC HI – Memory Management – Special LO 3 Instruction Formats: all 32 bits wide OP rs rt OP rs rt OP rd sa immediate jump target funct R format I format J format 2004 Morgan Kaufmann Publishers 54 Review: Unsigned Binary Representation Hex Binary Decimal 0x00000000 0…0000 0 0x00000001 0…0001 1 0x00000002 0…0010 2 0x00000003 0…0011 3 0x00000004 0…0100 4 0x00000005 0…0101 5 0x00000006 0…0110 6 0x00000007 0…0111 7 0x00000008 0…1000 8 0x00000009 0…1001 9 … 0xFFFFFFFC 1…1100 0xFFFFFFFD 1…1101 0xFFFFFFFE 1…1110 0xFFFFFFFF 1…1111 231 230 229 ... 23 22 21 20 bit weight 31 30 29 ... 3 0 bit position 1 1 1 ... 1 1 1 1 bit 1 0 0 0 ... 0 0 0 0 - 2 1 1 232 - 1 232 - 4 232 - 3 232 - 2 232 - 1 2004 Morgan Kaufmann Publishers 55 Aside: Beyond Numbers • American Std Code for Info Interchange (ASCII): 8-bit bytes representing characters ASCII Char ASCII Char ASCII Char ASCII Char ASCII Char ASCII Char 0 Null 32 space 48 0 64 @ 96 ` 112 p 1 33 ! 49 1 65 A 97 a 113 q 2 34 “ 50 2 66 B 98 b 114 r 3 35 # 51 3 67 C 99 c 115 s 36 $ 52 4 68 D 100 d 116 t 37 % 53 5 69 E 101 e 117 u 38 & 54 6 70 F 102 f 118 v 39 ‘ 55 7 71 G 103 g 119 w 4 EOT 5 6 ACK 7 8 bksp 40 ( 56 8 72 H 104 h 120 x 9 tab 41 ) 57 9 73 I 105 i 121 y 10 LF 42 * 58 : 74 J 106 j 122 z 43 + 59 ; 75 K 107 k 123 { 44 , 60 < 76 L 108 l 124 | 47 / 63 ? 79 O 111 o 127 DEL 11 12 15 FF 2004 Morgan Kaufmann Publishers 56 MIPS Arithmetic Instructions • MIPS assembly language arithmetic statement add $t0, $s1, $s2 sub $t0, $s1, $s2 • • Each arithmetic instruction performs only one operation Each arithmetic instruction fits in 32 bits and specifies exactly three operands destination source1 op source2 • • Operand order is fixed (destination first) Those operands are all contained in the datapath’s register file ($t0,$s1,$s2) – indicated by $ 2004 Morgan Kaufmann Publishers 58 Aside: MIPS Register Convention Name Register Number Usage Preserve on call? $zero 0 constant 0 (hardware) n.a. $at 1 reserved for assembler n.a. $v0 - $v1 2-3 returned values no $a0 - $a3 4-7 arguments yes $t0 - $t7 8-15 temporaries no $s0 - $s7 16-23 saved values yes $t8 - $t9 24-25 temporaries no $gp 28 global pointer yes $sp 29 stack pointer yes $fp 30 frame pointer yes $ra 31 return addr (hardware) yes 2004 Morgan Kaufmann Publishers 59 MIPS Register File • Holds thirty-two 32-bit registers – Two read ports and – One write port Register File 32 bits src1 addr src2 addr • Registers are – Faster than main memory dst addr write data 5 32 src1 data 5 5 32 locations 32 src2 32 data • But register files with more locations write control are slower (e.g., a 64 word file could be as much as 50% slower than a 32 word file) • Read/write port increase impacts speed quadratically – Easier for a compiler to use • e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order vs. stack – Can hold variables so that • code density improves (since register are named with fewer bits than a memory location) 2004 Morgan Kaufmann Publishers 60 Machine Language - Add Instruction • • Instructions, like registers and words of data, are 32 bits long Arithmetic Instruction Format (R format): add $t0, $s1, $s2 op rs rt rd shamt funct op 6-bits opcode that specifies the operation rs 5-bits register file address of the first source operand rt 5-bits register file address of the second source operand rd 5-bits register file address of the result’s destination shamt 5-bits shift amount (for shift instructions) funct 6-bits function code augmenting the opcode 2004 Morgan Kaufmann Publishers 61 MIPS Memory Access Instructions • MIPS has two basic data transfer instructions for accessing memory lw $t0, 4($s3) #load word from memory sw $t0, 8($s3) #store word to memory • The data is loaded into (lw) or stored from (sw) a register in the register file – a 5 bit address The memory address – a 32 bit address – is formed by adding the contents of the base address register to the offset value – A 16-bit field meaning access is limited to memory locations within a region of 213 or 8,192 words (215 or 32,768 bytes) of the address in the base register – Note that the offset can be positive or negative • 2004 Morgan Kaufmann Publishers 62 Machine Language - Load Instruction • Load/Store Instruction Format (I format): lw $t0, 24($s2) op rs rt 16 bit offset Memory 2410 + $s2 = . . . 0001 1000 + . . . 1001 0100 . . . 1010 1100 = 0x120040ac 0xf f f f f f f f 0x120040ac $t0 0x12004094 $s2 data 0x0000000c 0x00000008 0x00000004 0x00000000 word address (hex) 2004 Morgan Kaufmann Publishers 63 Byte Addresses • • Since 8-bit bytes are so useful, most architectures address individual bytes in memory – The memory address of a word must be a multiple of 4 (alignment restriction) Big Endian: leftmost byte is word address IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA • Little Endian: rightmost byte is word address Intel 80x86, DEC Vax, DEC Alpha (Windows NT) 3 2 1 little endian byte 0 0 msb 0 big endian byte 0 lsb 1 2 3 2004 Morgan Kaufmann Publishers 64 Aside: Loading and Storing Bytes • MIPS provides special instructions to move bytes lb $t0, 1($s3) #load byte from memory sb $t0, 6($s3) #store byte to memory op • rs rt 16 bit offset What 8 bits get loaded and stored? – load byte places the byte from memory in the rightmost 8 bits of the destination register • what happens to the other bits in the register? – store byte takes the byte from the rightmost 8 bits of a register and writes it to a byte in memory • what happens to the other bits in the memory word? 2004 Morgan Kaufmann Publishers 65 MIPS Control Flow Instructions • MIPS conditional branch instructions: bne $s0, $s1, Lbl #go to Lbl if $s0$s1 beq $s0, $s1, Lbl #go to Lbl if $s0=$s1 – Ex: if (i==j) h = i + j; bne $s0, $s1, Lbl1 add $s3, $s0, $s1 ... Lbl1: • Instruction Format (I format): op • rs rt 16 bit offset How is the branch destination address specified? 2004 Morgan Kaufmann Publishers 66 Specifying Branch Destinations • Use a register (like in lw and sw) added to the 16-bit offset – which register? Instruction Address Register (the PC) • its use is automatically implied by instruction • PC gets updated (PC+4) during the fetch cycle so that it holds the address of the next instruction – limits the branch distance to -215 to +215-1 instructions from the (instruction after the) branch instruction, but most branches are local anyway from the low order 16 bits of the branch instruction 16 offset sign-extend 00 32 32 Add PC 32 32 4 32 Add 32 branch dst address 32 ? 2004 Morgan Kaufmann Publishers 67 More Branch Instructions • We have beq, bne, but what about other kinds of brances (e.g., branch-if-less-than)? For this, we need yet another instruction, slt • Set on less than instruction: slt $t0, $s0, $s1 • # if $s0 < $s1 # $t0 = 1 # $t0 = 0 then else Instruction format (R format): op rs rt rd funct 2004 Morgan Kaufmann Publishers 68 2 More Branch Instructions, Con’t • Can use slt, beq, bne, and the fixed value of 0 in register $zero to create other conditions – less than blt $s1, $s2, Label slt bne $at, $s1, $s2 $at, $zero, Label – less than or equal to – greater than – great than or equal to • #$at set to 1 if # $s1 < $s2 ble $s1, $s2, Label bgt $s1, $s2, Label bge $s1, $s2, Label Such branches are included in the instruction set as pseudo instructions - recognized (and expanded) by the assembler – Its why the assembler needs a reserved register ($at) 2004 Morgan Kaufmann Publishers 69 Other Control Flow Instructions • MIPS also has an unconditional branch instruction or jump instruction: j • label #go to label Instruction Format (J Format): op 26-bit address from the low order 26 bits of the jump instruction 26 00 32 4 PC 32 2004 Morgan Kaufmann Publishers 70 Aside: Branching Far Away • What if the branch destination is further away than can be captured in 16 bits? The assembler comes to the rescue – it inserts an unconditional jump to the branch target and inverts the condition beq $s0, $s1, L1 bne j $s0, $s1, L2 L1 becomes L2: 2004 Morgan Kaufmann Publishers 71 Instructions for Accessing Procedures • MIPS procedure call instruction: jal • • ProcedureAddress Saves PC+4 in register $ra to have a link to the next instruction for the procedure return Machine format (J format): op • • #jump and link 26 bit address Then can do procedure return with a jr $ra #return Instruction format (R format): op rs funct 2004 Morgan Kaufmann Publishers 72 Aside: Spilling Registers • What if the callee needs more registers? What if the procedure is recursive? – uses a stack – a last-in-first-out queue – in memory for passing additional values or saving (recursive) return address(es) high addr One of the general registers, $sp, is used to address the stack (which “grows” from high address to low address) top of stack $sp = $sp – 4 on stack at new $sp $sp low addr add data onto the stack – push data remove data from the stack – pop data from stack at $sp = $sp + 4 $sp 2004 Morgan Kaufmann Publishers 73 MIPS Immediate Instructions • • Small constants are used often in typical code Possible approaches? – put “typical constants” in memory and load them – create hard-wired registers (like $zero) for constants like 1 – have special instructions that contain constants ! addi $sp, $sp, 4 slti $t0, $s2, 15 • Machine format (I format): op • #$sp = $sp + 4 #$t0 = 1 if $s2<15 rs rt 16 bit immediate I format The constant is kept inside the instruction itself! – Immediate format limits values to the range +215–1 to -215 2004 Morgan Kaufmann Publishers 74 Aside: How About Larger Constants? • • • We'd also like to be able to load a 32 bit constant into a register, for this we must use two instructions a new "load upper immediate" instruction lui $t0, 1010101010101010 Then must get the lower order bits right, use ori $t0, $t0, 1010101010101010 16 0 8 1010101010101010 1010101010101010 0000000000000000 0000000000000000 1010101010101010 1010101010101010 1010101010101010 2004 Morgan Kaufmann Publishers 75 MIPS Organization So Far Processor Memory Register File src1 addr 5 src2 addr 5 dst addr write data 5 1…1100 src1 data 32 32 registers ($zero - $ra) read/write addr src2 32 data 32 32 32 bits branch offset 32 Fetch PC = PC+4 Exec 32 Add PC 32 Add 4 read data 32 32 32 write data 32 Decode 230 words 32 32 ALU 32 32 4 0 5 1 6 2 32 bits 7 3 0…1100 0…1000 0…0100 0…0000 word address (binary) byte address (big Endian) 2004 Morgan Kaufmann Publishers 76 MIPS ISA So Far Category Arithmetic (R & I format) Data Transfer (I format) Cond. Branch (I & R format) Uncond. Jump (J & R format) Instr Op Code Example Meaning add 0 and 32 add $s1, $s2, $s3 $s1 = $s2 + $s3 subtract 0 and 34 sub $s1, $s2, $s3 $s1 = $s2 - $s3 add immediate 8 addi $s1, $s2, 6 $s1 = $s2 + 6 or immediate 13 ori $s1, $s2, 6 $s1 = $s2 v 6 load word 35 lw $s1, 24($s2) $s1 = Memory($s2+24) store word 43 sw $s1, 24($s2) Memory($s2+24) = $s1 load byte 32 lb $s1, 25($s2) $s1 = Memory($s2+25) store byte 40 sb $s1, 25($s2) Memory($s2+25) = $s1 load upper imm 15 lui $s1, 6 $s1 = 6 * 216 br on equal 4 beq $s1, $s2, L if ($s1==$s2) go to L br on not equal 5 bne $s1, $s2, L if ($s1 !=$s2) go to L set on less than 0 and 42 slt if ($s2<$s3) $s1=1 else $s1=0 set on less than immediate 10 slti $s1, $s2, 6 if ($s2<6) $s1=1 else $s1=0 jump 2 j 2500 go to 10000 jump register 0 and 8 jr $t1 go to $t1 jump and link 3 jal 2500 go to 10000; $ra=PC+4 $s1, $s2, $s3 2004 Morgan Kaufmann Publishers 77 Review of MIPS Operand Addressing Modes • Register addressing – operand is in a register op rs rt rd funct Register • op word operand Base (displacement) addressing – operand is at the memory location whose address is the sum of a register and a 16-bit constant contained within the instruction rs rt offset Memory word or byte operand base register – Register relative (indirect) with – Pseudo-direct with • op 0($a0) addr($zero) Immediate addressing – operand is a 16-bit constant contained within the instruction rs rt operand 2004 Morgan Kaufmann Publishers 78 Review of MIPS Instruction Addressing Modes • op PC-relative addressing –instruction address is the sum of the PC and a 16-bit constant contained within the instruction rs rt offset Memory branch destination instruction Program Counter (PC) • op Pseudo-direct addressing – instruction address is the 26-bit constant contained within the instruction concatenated with the upper 4 bits of the PC Memory jump address || jump destination instruction Program Counter (PC) 2004 Morgan Kaufmann Publishers 79 MIPS (RISC) Design Principles • • • • Simplicity favors regularity – fixed size instructions – 32-bits – small number of instruction formats – opcode always the first 6 bits Good design demands good compromises – three instruction formats Smaller is faster – limited instruction set – limited number of registers in register file – limited number of addressing modes Make the common case fast – arithmetic operands from the register file (load-store machine) – allow instructions to contain immediate operands 2004 Morgan Kaufmann Publishers 80 Chapter Three 2004 Morgan Kaufmann Publishers 81 Numbers • • • • Bits are just bits (no inherent meaning) — conventions define relationship between bits and numbers Binary numbers (base 2) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001... decimal: 0...2n-1 Of course it gets more complicated: numbers are finite (overflow) fractions and real numbers negative numbers e.g., no MIPS subi instruction; addi can add a negative number How do we represent negative numbers? i.e., which bit patterns will represent which numbers? 2004 Morgan Kaufmann Publishers 82 Possible Representations • Sign Magnitude: 000 = +0 001 = +1 010 = +2 011 = +3 100 = -0 101 = -1 110 = -2 111 = -3 • • One's Complement Two's Complement 000 = +0 001 = +1 010 = +2 011 = +3 100 = -3 101 = -2 110 = -1 111 = -0 000 = +0 001 = +1 010 = +2 011 = +3 100 = -4 101 = -3 110 = -2 111 = -1 Issues: balance, number of zeros, ease of operations Which one is best? Why? 2004 Morgan Kaufmann Publishers 83 MIPS • 32 bit signed numbers: 0000 0000 0000 ... 0111 0111 1000 1000 1000 ... 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000two = 0ten 0000 0000 0000 0000 0000 0000 0001two = + 1ten 0000 0000 0000 0000 0000 0000 0010two = + 2ten 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1110two 1111two 0000two 0001two 0010two = = = = = + + – – – 2,147,483,646ten 2,147,483,647ten 2,147,483,648ten 2,147,483,647ten 2,147,483,646ten maxint minint 1111 1111 1111 1111 1111 1111 1101two = – 3ten 1111 1111 1111 1111 1111 1111 1110two = – 2ten 1111 1111 1111 1111 1111 1111 1111two = – 1ten 2004 Morgan Kaufmann Publishers 84 MIPS Number Representations • 32-bit signed numbers (2’s complement): 0000 0000 0000 0000 0000 0000 0000 0000two = 0ten 0000 0000 0000 0000 0000 0000 0000 0001two = + 1ten ... 0111 0111 1000 1000 ... 1111 1111 0000 0000 1111 1111 0000 0000 1111 1111 0000 0000 1111 1111 0000 0000 1111 1111 0000 0000 1111 1111 0000 0000 1110two 1111two 0000two 0001two = = = = + + – – maxint 2,147,483,646ten 2,147,483,647ten 2,147,483,648ten 2,147,483,647ten 1111 1111 1111 1111 1111 1111 1111 1110two = – 2ten 1111 1111 1111 1111 1111 1111 1111 1111two = – 1ten minint MSB LSB • Converting <32-bit values into 32-bit values – copy the most significant bit (the sign bit) into the “empty” bits 0010 -> 0000 0010 1010 -> 1111 1010 – sign extend versus zero extend (lb vs. lbu) 2004 Morgan Kaufmann Publishers 85 MIPS Arithmetic Logic Unit (ALU) • • zero ovf Must support the Arithmetic/Logic add, addi, addiu, addu sub, subu, neg mult, multu, div, divu sqrt and, andi, nor, or, ori, xor, xori beq, bne, slt, slti, sltiu, sltu operations of the ISA 1 1 A 32 ALU result 32 B 32 4 m (operation) With special handling for – sign extend – addi, addiu andi, ori, xori, slti, sltiu – zero extend – lbu, addiu, sltiu – no overflow detected – addu, addiu, subu, multu, divu, sltiu, sltu 2004 Morgan Kaufmann Publishers 86 Two's Complement Operations • Negating a two's complement number: invert all bits and add 1 – remember: “negate” and “invert” are quite different! • Converting n bit numbers into numbers with more than n bits: – MIPS 16 bit immediate gets converted to 32 bits for arithmetic – copy the most significant bit (the sign bit) into the other bits 0010 -> 0000 0010 1010 -> 1111 1010 – "sign extension" (lbu vs. lb) 2004 Morgan Kaufmann Publishers 87 Review: 2’s Complement Binary Representation • Negate 2’sc binary decimal -23 = 1000 -8 -(23 - 1) = 1001 -7 1010 -6 1011 -5 1100 -4 1101 -3 1110 -2 1111 -1 0000 0 0001 1 0010 2 0011 3 0100 4 0101 5 0110 6 0111 7 1011 and add a 1 1010 complement all the bits • Note: negate and invert are different! 23 - 1 = 2004 Morgan Kaufmann Publishers 88 Review: A Full Adder carry_in A B 1-bit Full Adder carry_out S A B carry_in carry_out S 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 S = A B carry_in (odd parity function) carry_out = A&B | A&carry_in | B&carry_in (majority function) How can we use it to build a 32-bit adder? How can we modify it easily to build an adder/subtractor? 2004 Morgan Kaufmann Publishers 89 Addition & Subtraction • Just like in grade school (carry/borrow 1s) 0111 0111 0110 + 0110 - 0110 - 0101 • Two's complement operations easy – subtraction using addition of negative numbers 0111 + 1010 • Overflow (result too large for finite computer word): – e.g., adding two n-bit numbers does not yield an n-bit number 0111 + 0001 note that overflow term is somewhat misleading, 1000 it does not mean a carry “overflowed” 2004 Morgan Kaufmann Publishers 90 A 32-bit Ripple Carry Adder/Subtractor Remember 2’s complement is just complement all the bits control (0=add,1=sub) B0 B0 if control = 0, !B0 if control = 1 add a 1 in the least significant bit A 0111 B - 0110 0001 0111 + 1001 1 1 0001 c0=carry_in A0 1-bit FA c1 S0 A1 1-bit FA c2 S1 A2 1-bit FA c3 S2 B0 B1 B2 ... add/sub c31 A31 B31 1-bit FA S31 c32=carry_out 2004 Morgan Kaufmann Publishers 92 Detecting Overflow • • • • No overflow when adding a positive and a negative number No overflow when signs are the same for subtraction Overflow occurs when the value affects the sign: – overflow when adding two positives yields a negative – or, adding two negatives gives a positive – or, subtract a negative from a positive and get a negative – or, subtract a positive from a negative and get a positive Consider the operations A + B, and A – B – Can overflow occur if B is 0 ? – Can overflow occur if A is 0 ? 2004 Morgan Kaufmann Publishers 93 Overflow Detection • • Overflow: the result is too large to represent in 32 bits Overflow occurs when – adding two positives yields a negative – or, adding two negatives gives a positive – or, subtract a negative from a positive gives a negative – or, subtract a positive from a negative gives a positive On your own: Prove you can detect overflow by: – Carry into MSB xor Carry out of MSB, ex for 4 bit signed numbers • 0 + 1 1 1 1 0 1 1 1 7 0 0 1 1 3 1 0 1 0 –6 + 0 1 1 0 0 –4 1 0 1 1 –5 0 1 1 1 7 2004 Morgan Kaufmann Publishers 95 Tailoring the ALU to the MIPS ISA • Need to support the logic operation (and,nor,or,xor) – Bit wise operations (no carry operation involved) – Need a logic gate for each function, mux to choose the output • Need to support the set-on-less-than instruction (slt) – Use subtraction to determine if (a – b) < 0 (implies a < b) – Copy the sign bit into the low order bit of the result, set remaining result bits to 0 • Need to support test for equality (bne, beq) – Again use subtraction: (a - b) = 0 implies a = b – Additional logic to “nor” all result bits together • Immediates are sign extended outside the ALU with wiring (i.e., no logic needed) 2004 Morgan Kaufmann Publishers 96 Shift Operations • Also need operations to pack and unpack 8-bit characters into 32-bit words • Shifts move all the bits in a word left or right sll $t2, $s0, 8 #$t2 = $s0 << 8 bits srl $t2, $s0, 8 #$t2 = $s0 >> 8 bits op • • rs rt rd shamt funct Notice that a 5-bit shamt field is enough to shift a 32-bit value 25 – 1 or 31 bit positions Such shifts are logical because they fill with zeros 2004 Morgan Kaufmann Publishers 97 Shift Operations, con’t • • An arithmetic shift (sra) maintain the arithmetic correctness of the shifted value (i.e., a number shifted right one bit should be ½ of its original value; a number shifted left should be 2 times its original value) – so sra uses the most significant bit (sign bit) as the bit shifted in – note that there is no need for a sla when using two’s complement number representation sra $t2, $s0, 8 #$t2 = $s0 >> 8 bits The shift operation is implemented by hardware separate from the ALU – using a barrel shifter (which would takes lots of gates in discrete logic, but is pretty easy to implement in VLSI) 2004 Morgan Kaufmann Publishers 98 Multiply • Binary multiplication is just a bunch of right shifts and adds n multiplicand multiplier partial product array n can be formed in parallel and added in parallel for faster multiplication double precision product 2n 2004 Morgan Kaufmann Publishers 99 MIPS Multiply Instruction • Multiply produces a double precision product mult $s0, $s1 # hi||lo = $s0 * $s1 op rs rt rd shamt funct – Low-order word of the product is left in processor register lo and the high-order word is left in register hi – Instructions mfhi rd and mflo rd are provided to move the product to (user accessible) registers in the register file • • Multiplies are done by fast, dedicated hardware and are much more complex (and slower) than adders Hardware dividers are even more complex and even slower; ditto for hardware square root 2004 Morgan Kaufmann Publishers 100 Effects of Overflow • • • An exception (interrupt) occurs – Control jumps to predefined address for exception – Interrupted address is saved for possible resumption Details based on software system / language – example: flight control vs. homework assignment Don't always want to detect overflow — new MIPS instructions: addu, addiu, subu note: addiu still sign-extends! note: sltu, sltiu for unsigned comparisons 2004 Morgan Kaufmann Publishers 101 Multiplication • • • More complicated than addition – accomplished via shifting and addition More time and more area Let's look at 3 versions based on a gradeschool algorithm 0010 __x_1011 • (multiplicand) (multiplier) Negative numbers: convert and multiply – there are better techniques, we won’t look at them 2004 Morgan Kaufmann Publishers 102 Multiplication: Implementation Start Multiplier0 = 1 1. Test Multiplier0 = 0 Multiplier0 1a. Add multiplicand to product and place the result in Product register Multiplicand Shift left 64 bits Multiplier Shift right 64-bit ALU 2. Shift the Multiplicand register left 1 bit 32 bits Product Write 3. Shift the Multiplier register right 1 bit Control test 64 bits No: < 32 repetitions 32nd repetition? Datapath Yes: 32 repetitions Control Done 2004 Morgan Kaufmann Publishers 103 Final Version Start •Multiplier starts in right half of product Product0 = 1 1. Test Product0 = 0 Product0 Multiplicand 32 bits 32-bit ALU Product Shift right Write Control test 3. Shift the Product register right 1 bit 64 bits No: < 32 repetitions 32nd repetition? What goes here? Yes: 32 repetitions Done 2004 Morgan Kaufmann Publishers 104 Floating Point (a brief look) • We need a way to represent – numbers with fractions, e.g., 3.1416 – very small numbers, e.g., .000000001 – very large numbers, e.g., 3.15576 109 • Representation: – sign, exponent, significand: (–1)sign significand 2exponent – more bits for significand gives more accuracy – more bits for exponent increases range • IEEE 754 floating point standard: – single precision: 8 bit exponent, 23 bit significand – double precision: 11 bit exponent, 52 bit significand 2004 Morgan Kaufmann Publishers 105 Representing Big (and Small) Numbers • What if we want to encode the approx. age of the earth? 4,600,000,000 or 4.6 x 109 or the weight in kg of one a.m.u. (atomic mass unit) 0.0000000000000000000000000166 or 1.6 x 10-27 There is no way we can encode either of the above in a 32-bit integer. • Floating point representation (-1)sign x F x 2E – Still have to fit everything in 32 bits (single precision) s E (exponent) 1 bit 8 bits F (fraction) 23 bits – The base (2, not 10) is hardwired in the design of the FPALU – More bits in the fraction (F) or the exponent (E) is a trade-off between precision (accuracy of the number) and range (size of the number) 2004 Morgan Kaufmann Publishers 106 IEEE 754 floating-point standard • Leading “1” bit of significand is implicit • Exponent is “biased” to make sorting easier – all 0s is smallest exponent all 1s is largest – bias of 127 for single precision and 1023 for double precision – summary: (–1)sign (1+significand) 2exponent – bias • Example: – decimal: -.75 = - ( ½ + ¼ ) – binary: -.11 = -1.1 x 2-1 – floating point: exponent = 126 = 01111110 – IEEE single precision: 10111111010000000000000000000000 2004 Morgan Kaufmann Publishers 107 IEEE 754 FP Standard Encoding • Most (all?) computers these days conform to the IEEE 754 floating point standard (-1)sign x (1+F) x 2E-bias – Formats for both single and double precision – F is stored in normalized form where the msb in the fraction is 1 (so there is no need to store it!) – called the hidden bit – To simplify sorting FP numbers, E comes before F in the word and E is represented in excess (biased) notation Single Precision Double Precision Object Represented E (8) F (23) E (11) F (52) 0 0 0 0 0 nonzero 0 nonzero ± denormalized number ± 1-254 anything ± 1-2046 anything ± floating point number ± 255 0 ± 2047 0 255 nonzero 2047 nonzero true zero (0) ± infinity not a number (NaN) 2004 Morgan Kaufmann Publishers 108 Floating Point Addition • Addition (and subtraction) (F1 2E1) + (F2 2E2) = F3 2E3 – Step 1: Restore the hidden bit in F1 and in F2 – Step 1: Align fractions by right shifting F2 by E1 - E2 positions (assuming E1 E2) keeping track of (three of) the bits shifted out in a round bit, a guard bit, and a sticky bit – Step 2: Add the resulting F2 to F1 to form F3 – Step 3: Normalize F3 (so it is in the form 1.XXXXX …) • If F1 and F2 have the same sign F3 [1,4) 1 bit right shift F3 and increment E3 • If F1 and F2 have different signs F3 may require many left shifts each time decrementing E3 – Step 4: Round F3 and possibly normalize F3 again – Step 5: Rehide the most significant bit of F3 before storing the result 2004 Morgan Kaufmann Publishers 109 Floating point addition • Sign Exponent Fraction Sign Exponent Start Fraction 1. Compare the exponents of the two numbers. 0 Small ALU Shift the smaller number to the right until its exponent would match the larger exponent Exponent difference 2. Add the significands 1 0 1 0 1 3. Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent Shift right Control Overflow or Big ALU Yes underflow? No 0 0 1 Exception 1 4. Round the significand to the appropriate Increment or decrement number of bits Shift left or right No Rounding hardware Still normalized? Yes Sign Exponent Fraction Done 2004 Morgan Kaufmann Publishers 110 MIPS Floating Point Instructions • • MIPS has a separate Floating Point Register File ($f0, $f1, …, $f31) (whose registers are used in pairs for double precision values) with special instructions to load to and store from them lwcl $f1,54($s2) #$f1 = Memory[$s2+54] swcl $f1,58($s4) #Memory[$s4+58] = $f1 And supports IEEE 754 single add.s $f2,$f4,$f6 #$f2 = $f4 + $f6 and double precision operations add.d $f2,$f4,$f6 #$f2||$f3 = $f4||$f5 + $f6||$f7 similarly for sub.s, sub.d, mul.s, mul.d, div.s, div.d 2004 Morgan Kaufmann Publishers 111 MIPS Floating Point Instructions, Con’t • And floating point single precision comparison operations c.x.s $f2,$f4 #if($f2 < $f4) cond=1; else cond=0 where x may be eq, neq, lt, le, gt, ge and branch operations bclt 25 #if(cond==1) go to PC+4+25 bclf 25 #if(cond==0) go to PC+4+25 • And double precision comparison operations c.x.d $f2,$f4 #$f2||$f3 < $f4||$f5 cond=1; else cond=0 2004 Morgan Kaufmann Publishers 112 Floating Point Complexities • Operations are somewhat more complicated (see text) • In addition to overflow we can have “underflow” • Accuracy can be a big problem – IEEE 754 keeps two extra bits, guard and round – four rounding modes – positive divided by zero yields “infinity” – zero divide by zero yields “not a number” – other complexities • • Implementing the standard can be tricky Not using the standard can be even worse – see text for description of 80x86 and Pentium bug! 2004 Morgan Kaufmann Publishers 113 Chapter Three Summary • Computer arithmetic is constrained by limited precision • Bit patterns have no inherent meaning but standards do exist – two’s complement – IEEE 754 floating point • Computer instructions determine “meaning” of the bit patterns • Performance and accuracy are important so there are many complexities in real machines • Algorithm choice is important and may lead to hardware optimizations for both space and time (e.g., multiplication) • You may want to look back (Section 3.10 is great reading!) 2004 Morgan Kaufmann Publishers 114 Chapter 4 2004 Morgan Kaufmann Publishers 115 Performance • • • • Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Why is some hardware better than others for different programs? What factors of system performance are hardware related? (e.g., Do we need a new machine, or a new operating system?) How does the machine's instruction set affect performance? 2004 Morgan Kaufmann Publishers 116 Which of these airplanes has the best performance? Airplane Passengers Boeing 737-100 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 101 470 132 146 Range (mi) Speed (mph) 630 4150 4000 8720 598 610 1350 544 •How much faster is the Concorde compared to the 747? •How much bigger is the 747 than the Douglas DC-8? 2004 Morgan Kaufmann Publishers 117 Computer Performance: TIME, TIME, TIME • Response Time (latency) — How long does it take for my job to run? — How long does it take to execute a job? — How long must I wait for the database query? • Throughput — How many jobs can the machine run at once? — What is the average execution rate? — How much work is getting done? • If we upgrade a machine with a new processor what do we increase? • If we add a new machine to the lab what do we increase? 2004 Morgan Kaufmann Publishers 118 Execution Time • • • Elapsed Time – counts everything (disk and memory accesses, I/O , etc.) – a useful number, but often not good for comparison purposes CPU time – doesn't count I/O or time spent running other programs – can be broken up into system time, and user time Our focus: user CPU time – time spent executing the lines of code that are "in" our program 2004 Morgan Kaufmann Publishers 119 Book's Definition of Performance • For some program running on machine X, PerformanceX = 1 / Execution timeX • "X is n times faster than Y" PerformanceX / PerformanceY = n • Problem: – machine A runs a program in 20 seconds – machine B runs the same program in 25 seconds 2004 Morgan Kaufmann Publishers 120 Clock Cycles • Instead of reporting execution time in seconds, we often use cycles seconds cycles seconds program program cycle • Clock “ticks” indicate when to start activities (one abstraction): time • • cycle time = time between ticks = seconds per cycle clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec) A 4 Ghz. clock has a 1 4 109 1012 250 picosecond s (ps) cycle time 2004 Morgan Kaufmann Publishers 121 How to Improve Performance seconds cycles seconds program program cycle So, to improve performance (everything else being equal) you can either (increase or decrease?) ________ the # of required cycles for a program, or ________ the clock cycle time or, said another way, ________ the clock rate. 2004 Morgan Kaufmann Publishers 122 How many cycles are required for a program? ... 6th 5th 4th 3rd instruction 2nd instruction Could assume that number of cycles equals number of instructions 1st instruction • time This assumption is incorrect, different instructions take different amounts of time on different machines. Why? hint: remember that these are machine instructions, not lines of C code 2004 Morgan Kaufmann Publishers 123 Different numbers of cycles for different instructions time • Multiplication takes more time than addition • Floating point operations take longer than integer ones • Accessing memory takes more time than accessing registers • Important point: changing the cycle time often changes the number of cycles required for various instructions (more later) 2004 Morgan Kaufmann Publishers 124 Example • Our favorite program runs in 10 seconds on computer A, which has a 4 GHz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?" • Don't Panic, can easily work this out from basic principles 2004 Morgan Kaufmann Publishers 125 Now that we understand cycles • A given program will require – some number of instructions (machine instructions) – some number of cycles – some number of seconds • We have a vocabulary that relates these quantities: – cycle time (seconds per cycle) – clock rate (cycles per second) – CPI (cycles per instruction) a floating point intensive application might have a higher CPI – MIPS (millions of instructions per second) this would be higher for a program using simple instructions 2004 Morgan Kaufmann Publishers 126 Performance • • Performance is determined by execution time Do any of the other variables equal performance? – # of cycles to execute program? – # of instructions in program? – # of cycles per second? – average # of cycles per instruction? – average # of instructions per second? • Common pitfall: thinking one of the variables is indicative of performance when it really isn’t. 2004 Morgan Kaufmann Publishers 127 CPI Example • Suppose we have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 250 ps and a CPI of 2.0 Machine B has a clock cycle time of 500 ps and a CPI of 1.2 What machine is faster for this program, and by how much? • If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical? 2004 Morgan Kaufmann Publishers 128 # of Instructions Example • A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? What is the CPI for each sequence? 2004 Morgan Kaufmann Publishers 129 MIPS example • Two different compilers are being tested for a 4 GHz. machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. • • Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time? 2004 Morgan Kaufmann Publishers 130 Benchmarks • • • Performance best determined by running a real application – Use programs typical of expected workload – Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. Small benchmarks – nice for architects and designers – easy to standardize – can be abused SPEC (System Performance Evaluation Cooperative) – companies have agreed on a set of real program and inputs – valuable indicator of performance (and compiler technology) – can still be abused 2004 Morgan Kaufmann Publishers 131 Benchmark Games • An embarrassed Intel Corp. acknowledged Friday that a bug in a software program known as a compiler had led the company to overstate the speed of its microprocessor chips on an industry benchmark by 10 percent. However, industry analysts said the coding error…was a sad commentary on a common industry practice of “cheating” on standardized performance tests…The error was pointed out to Intel two days ago by a competitor, Motorola …came in a test known as SPECint92…Intel acknowledged that it had “optimized” its compiler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the same thing…At the heart of Intel’s problem is the practice of “tuning” compiler programs to recognize certain computing problems in the test and then substituting special handwritten pieces of code… Saturday, January 6, 1996 New York Times 2004 Morgan Kaufmann Publishers 132 SPEC ‘89 Compiler “enhancements” and performance 800 700 600 SPEC performance ratio • 500 400 300 200 100 0 gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv Benchmark Compiler Enhanced compiler 2004 Morgan Kaufmann Publishers 133 SPEC CPU2000 2004 Morgan Kaufmann Publishers 134 SPEC 2000 Does doubling the clock rate double the performance? Can a machine with a slower clock rate have better performance? 1.6 Pentium M @ 1.6/0.6 GHz Pentium 4-M @ 2.4/1.2 GHz Pentium III-M @ 1.2/0.8 GHz 1400 1.4 1200 1.2 Pentium 4 CFP2000 1000 Pentium 4 CINT2000 1.0 800 0.8 600 0.6 Pentium III CINT2000 400 0.4 Pentium III CFP2000 200 0.2 0 0.0 500 1000 1500 2000 Clock rate in MHz 2500 3000 3500 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 Always on/maximum clock Laptop mode/adaptive clock Minimum power/minimum clock Benchmark and power mode 2004 Morgan Kaufmann Publishers 135 Experiment • Phone a major computer retailer and tell them you are having trouble deciding between two different computers, specifically you are confused about the processors strengths and weaknesses (e.g., Pentium 4 at 2Ghz vs. Celeron M at 1.4 Ghz ) • What kind of response are you likely to get? • What kind of response could you give a friend with the same question? 2004 Morgan Kaufmann Publishers 136 Amdahl's Law Execution Time After Improvement = Execution Time Unaffected +( Execution Time Affected / Amount of Improvement ) • Example: "Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?" How about making it 5 times faster? • Principle: Make the common case fast 2004 Morgan Kaufmann Publishers 137 Example • Suppose we enhance a machine making all floating-point instructions run five times faster. If the execution time of some benchmark before the floating-point enhancement is 10 seconds, what will the speedup be if half of the 10 seconds is spent executing floating-point instructions? • We are looking for a benchmark to show off the new floating-point unit described above, and want the overall benchmark to show a speedup of 3. One benchmark we are considering runs for 100 seconds with the old floating-point hardware. How much of the execution time would floatingpoint instructions have to account for in this program in order to yield our desired speedup on this benchmark? 2004 Morgan Kaufmann Publishers 138 Remember • Performance is specific to a particular program/s – Total execution time is a consistent summary of performance • For a given architecture performance increases come from: – – – – • increases in clock rate (without adverse CPI affects) improvements in processor organization that lower CPI compiler enhancements that lower CPI and/or instruction count Algorithm/Language choices that affect instruction count Pitfall: expecting improvement in one aspect of a machine’s performance to affect the total performance 2004 Morgan Kaufmann Publishers 139 Performance Metrics • Purchasing perspective – given a collection of machines, which has the • best performance ? • least cost ? • best cost/performance? • Design perspective – faced with design options, which has the • best performance improvement ? • least cost ? • best cost/performance? • • Both require – basis for comparison – metric for evaluation Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors 2004 Morgan Kaufmann Publishers 140 Defining (Speed) Performance • Normally interested in reducing – Response time (aka execution time) – the time between the start and the completion of a task • Important to individual users – Thus, to maximize performance, need to minimize execution time performanceX = 1 / execution_timeX If X is n times faster than Y, then performanceX execution_timeY -------------------- = --------------------- = n performanceY execution_timeX – Throughput – the total amount of work done in a given time • Important to data center managers – Decreasing response time almost always improves throughput 2004 Morgan Kaufmann Publishers 141 Performance Factors • • Want to distinguish elapsed time and the time spent on our task CPU execution time (CPU time) – time the CPU spends working on a task – Does not include time waiting for I/O or running other programs CPU execution time = for a program # CPU clock cycles x clock cycle time for a program or CPU execution time for a program • # CPU clock cycles for a program = ------------------------------------------clock rate Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program 2004 Morgan Kaufmann Publishers 142 Review: Machine Clock Rate • Clock rate (MHz, GHz) is inverse of clock cycle time (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 1 GHz clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate 2004 Morgan Kaufmann Publishers 143 Clock Cycles per Instruction • Not all instructions take the same amount of time to execute – One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction # CPU clock cycles = for a program • # Instructions x for a program Average clock cycles per instruction Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute – A way to compare two different implementations of the same ISA CPI for this instruction class CPI A B C 1 2 3 2004 Morgan Kaufmann Publishers 144 Effective CPI • Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging n Overall effective CPI = (CPIi x ICi) i=1 – Where ICi is the count (percentage) of the number of instructions of class i executed – CPIi is the (average) number of clock cycles per instruction for that instruction class – n is the number of instruction classes • The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs 2004 Morgan Kaufmann Publishers 145 THE Performance Equation • Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle or CPU time • = Instruction_count x CPI ----------------------------------------------clock_rate These equations separate the three key factors that affect performance – Can measure the CPU execution time by running the program – The clock rate is usually given – Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details – CPI varies by instruction type and ISA implementation for which we must know the implementation details 2004 Morgan Kaufmann Publishers 146 Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Algorithm Programming language Compiler ISA Processor organization Technology Instruction_c ount CPI clock_cycle X X X X X X X X X X X X 2004 Morgan Kaufmann Publishers 148 A Simple Example Op Freq CPIi Freq x CPIi ALU 50% 1 .5 .5 .5 .25 Load 20% 5 1.0 .4 1.0 1.0 Store 10% 3 .3 .3 .3 .3 Branch 20% 2 .4 .4 .2 .4 2.2 1.6 2.0 1.95 = • • • How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster How does this compare with using branch prediction to shave a cycle off the branch time? CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster What if two ALU instructions could be executed at once? CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster 2004 Morgan Kaufmann Publishers 150 Comparing and Summarizing Performance • How do we summarize the performance for benchmark set with a single number? – The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM) n AM = 1/n Timei i=1 – Where Timei is the execution time for the ith program of a total of n programs in the workload – A smaller mean indicates a smaller average execution time and thus improved performance • Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.)) 2004 Morgan Kaufmann Publishers 151 SPEC Benchmarks www.spec.org Integer benchmarks FP benchmarks gzip compression wupwise Quantum chromodynamics vpr FPGA place & route swim Shallow water model gcc GNU C compiler mgrid Multigrid solver in 3D fields mcf Combinatorial optimization applu Parabolic/elliptic pde crafty Chess program mesa 3D graphics library parser Word processing program galgel Computational fluid dynamics eon Computer visualization art Image recognition (NN) perlbmk perl application equake Seismic wave propagation simulation gap Group theory interpreter facerec Facial image recognition vortex Object oriented database ammp Computational chemistry bzip2 compression lucas Primality testing twolf Circuit place & route fma3d Crash simulation fem sixtrack Nuclear physics accel apsi Pollutant distribution 2004 Morgan Kaufmann Publishers 152 Example SPEC Ratings 2004 Morgan Kaufmann Publishers 153 Other Performance Metrics • Power consumption – especially in the embedded market where battery life is important (and passive cooling) – For power-limited applications, the most important metric is energy efficiency 2004 Morgan Kaufmann Publishers 154 Summary: Evaluating ISAs • Design-time metrics: – Can it be implemented, in how long, at what cost? – Can it be programmed? Ease of compilation? • Static Metrics: – How many bytes does the program occupy in memory? • Dynamic Metrics: – How many instructions are executed? How many bytes does the processor fetch to execute the program? – How many clocks are required per instruction? – How "lean" a clock is practical? CPI Best Metric: Time to execute the program! depends on the instructions set, the processor organization, and compilation techniques. Inst. Count Cycle Time 2004 Morgan Kaufmann Publishers 155 Chapter --Five 2004 Morgan Kaufmann Publishers 156 Lets Build a Processor • • Almost ready to move into chapter 5 and start building a processor First, let’s review Boolean Logic and build the ALU we’ll need (Material from Appendix B) operation a 32 ALU result 32 b 32 2004 Morgan Kaufmann Publishers 157 Review: Boolean Algebra & Gates • Problem: Consider a logic function with three inputs: A, B, and C. Output D is true if at least one input is true Output E is true if exactly two inputs are true Output F is true only if all three inputs are true • Show the truth table for these three functions. • Show the Boolean equations for these three functions. • Show an implementation consisting of inverters, AND, and OR gates. 2004 Morgan Kaufmann Publishers 158 An ALU (arithmetic logic unit) • Let's build an ALU to support the andi and ori instructions – we'll just build a 1 bit ALU, and use 32 of them operation a op a b res result b • Possible Implementation (sum-of-products): 2004 Morgan Kaufmann Publishers 159 Review: The Multiplexor • Selects one of the inputs to be the output, based on a control input S • A 0 B 1 C note: we call this a 2-input mux even though it has 3 inputs! Lets build our ALU using a MUX: 2004 Morgan Kaufmann Publishers 160 Different Implementations • Not easy to decide the “best” way to build something • – Don't want too many inputs to a single gate – Don’t want to have to go through too many gates – for our purposes, ease of comprehension is important Let's look at a 1-bit ALU for addition: CarryIn a Sum b cout = a b + a cin + b cin sum = a xor b xor cin CarryOut • How could we build a 1-bit ALU for add, and, and or? • How could we build a 32-bit ALU? 2004 Morgan Kaufmann Publishers 161 Building a 32 bit ALU CarryIn a0 b0 Operation CarryIn ALU0 Result0 CarryOut Operation CarryIn a1 a 0 b1 CarryIn ALU1 Result1 CarryOut 1 Result a2 2 b b2 CarryIn ALU2 Result2 CarryOut CarryOut a31 b31 CarryIn ALU31 Result31 2004 Morgan Kaufmann Publishers 162 What about subtraction (a – b) ? • • Two's complement approach: just negate b and add. How do we negate? • A very clever solution: Binvert Operation CarryIn a 0 1 b 0 Result 2 1 CarryOut 2004 Morgan Kaufmann Publishers 163 Adding a NOR function • Can also choose to invert a. How do we get “a NOR b” ? Ainvert Operation Binvert a CarryIn 0 0 1 1 b 0 + Result 2 1 CarryOut 2004 Morgan Kaufmann Publishers 164 Tailoring the ALU to the MIPS • Need to support the set-on-less-than instruction (slt) – remember: slt is an arithmetic instruction – produces a 1 if rs < rt and 0 otherwise – use subtraction: (a-b) < 0 implies a < b • Need to support test for equality (beq $t5, $t6, $t7) – use subtraction: (a-b) = 0 implies a = b 2004 Morgan Kaufmann Publishers 165 Supporting slt • Can we figure out the idea? Binvert a Binvert CarryIn a 0 Operation Ainvert Operation Ainvert CarryIn 0 0 0 1 1 1 1 Result b 0 + Result b 0 2 + 2 1 1 Less Less 3 3 Set Overflow detection Overflow Use this ALU for most significant bit CarryOut all other bits Supporting slt Operation Binvert Ainvert CarryIn a0 b0 CarryIn ALU0 Less CarryOut Result0 a1 b1 0 CarryIn ALU1 Less CarryOut Result1 a2 b2 0 CarryIn ALU2 Less CarryOut Result2 .. . a31 b31 0 .. . CarryIn CarryIn ALU31 Less .. . Result31 Set Overflow 2004 Morgan Kaufmann Publishers 167 Test for equality • Notice control lines: Operation Bnegate Ainvert 0000 0001 0010 0110 0111 1100 = = = = = = and or add subtract slt NOR •Note: zero is a 1 when the result is zero! a0 b0 CarryIn ALU0 Less CarryOut a1 b1 0 CarryIn ALU1 Less CarryOut a2 b2 0 CarryIn ALU2 Less CarryOut .. . a31 b31 0 Result0 Result1 .. . Result2 .. . CarryIn CarryIn ALU31 Less Zero .. . .. . Result31 Set Overflow 2004 Morgan Kaufmann Publishers 168 Conclusion • We can build an ALU to support the MIPS instruction set – key idea: use multiplexor to select the output we want – we can efficiently perform subtraction using two’s complement – we can replicate a 1-bit ALU to produce a 32-bit ALU • Important points about hardware – all of the gates are always working – the speed of a gate is affected by the number of inputs to the gate – the speed of a circuit is affected by the number of gates in series (on the “critical path” or the “deepest level of logic”) • Our primary focus: comprehension, however, – Clever changes to organization can improve performance (similar to using better algorithms in software) – We saw this in multiplication, let’s look at addition now 2004 Morgan Kaufmann Publishers 169 Problem: ripple carry adder is slow • • Is a 32-bit ALU as fast as a 1-bit ALU? Is there more than one way to do addition? – two extremes: ripple carry and sum-of-products Can you see the ripple? How could you get rid of it? c1 c2 c3 c4 = = = = b0c0 b1c1 b2c2 b3c3 + + + + a0c0 a1c1 a2c2 a3c3 + + + + a0b0 a1b1c2 = a2b2 a3b3 c3 = c4 = Not feasible! Why? 2004 Morgan Kaufmann Publishers 170 Carry-lookahead adder • • An approach in-between our two extremes Motivation: – If we didn't know the value of carry-in, what could we do? – When would we always generate a carry? gi = ai bi – When would we propagate the carry? pi = ai + bi • Did we get rid of the ripple? c1 c2 c3 c4 = = = = g0 g1 g2 g3 + + + + p0c0 p1c1 c2 = p2c2 c3 = p3c3 c4 = Feasible! Why? 2004 Morgan Kaufmann Publishers 171 Use principle to build bigger adders CarryIn a0 b0 a1 b1 a2 b2 a3 b3 a4 b4 a5 b5 a6 b6 a7 b7 a8 b8 a9 b9 a10 b10 a11 b11 a12 b12 a13 b13 a14 b14 a15 b15 CarryIn Result0–3 ALU0 P0 G0 pi gi C1 ci + 1 CarryIn Carry-lookahead unit Result4–7 • ALU1 P1 G1 pi + 1 gi + 1 C2 • ci + 2 CarryIn Result8–11 ALU2 P2 G2 • Can’t build a 16 bit adder this way... (too big) Could use ripple carry of 4-bit CLA adders Better: use the CLA principle again! pi + 2 gi + 2 C3 ci + 3 CarryIn Result12–15 ALU3 P3 G3 pi + 3 gi + 3 C4 CarryOut ci + 4 2004 Morgan Kaufmann Publishers 172 ALU Summary • • • • We can build an ALU to support MIPS addition Our focus is on comprehension, not performance Real processors use more sophisticated techniques for arithmetic Where performance is not critical, hardware description languages allow designers to completely automate the creation of hardware! 2004 Morgan Kaufmann Publishers 173 Chapter Five 2004 Morgan Kaufmann Publishers 174 The Processor: Datapath & Control • • We're ready to look at an implementation of the MIPS Simplified to contain only: – memory-reference instructions: lw, sw – arithmetic-logical instructions: add, sub, and, or, slt – control flow instructions: beq, j • Generic Implementation: – – – – • use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do All instructions use the ALU after reading the registers Why? memory-reference? arithmetic? control flow? 2004 Morgan Kaufmann Publishers 175 More Implementation Details • Abstract / Simplified View: 4 Add Add Data PC Address Instruction Instruction memory Register # Registers Register # ALU Address Data memory Register # Data • Two types of functional units: – elements that operate on data values (combinational) – elements that contain state (sequential) 2004 Morgan Kaufmann Publishers 176 State Elements • • Unclocked vs. Clocked Clocks used in synchronous logic – when should an element that contains state be updated? Falling edge Clock period Rising edge 2004 Morgan Kaufmann Publishers 177 An unclocked state element • The set-reset latch – output depends on present inputs and also on past inputs R Q Q S 2004 Morgan Kaufmann Publishers 178 Latches and Flip-flops • • • • Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) Change of state (value) is based on the clock Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology) "logically true", — could mean electrically low A clocking methodology defines when signals can be read and written — wouldn't want to read a signal at the same time it was being written 2004 Morgan Kaufmann Publishers 179 D-latch • • Two inputs: – the data value to be stored (D) – the clock signal (C) indicating when to read & store D Two outputs: – the value of the internal state (Q) and it's complement C Q D C _ Q Q D 2004 Morgan Kaufmann Publishers 180 D flip-flop • Output changes only on the clock edge D D C Q D latch D C Q D latch Q Q Q C D C Q 2004 Morgan Kaufmann Publishers 181 Our Implementation • • An edge triggered methodology Typical execution: – read contents of some state elements, – send values through some combinational logic – write results to one or more state elements State element 1 Combinational logic State element 2 Clock cycle 2004 Morgan Kaufmann Publishers 182 Register File • Built using D flip-flops Read register number 1 Register 0 Register 1 Read register number 1 Read data 1 Read register number 2 Write register Write data Register file Read data 2 M ... u Register n – 2 x Read data 1 Register n – 1 Read register number 2 Write M u Read data 2 x Do you understand? What is the “Mux” above? 2004 Morgan Kaufmann Publishers 183 Abstraction • • Make sure you understand the abstractions! Sometimes it is easy to think you do, when you don’t Select A31 Select B31 A B M u x C31 32 32 M u x 32 C A30 B30 M u x C30 .. . .. . A0 B0 M u x C0 2004 Morgan Kaufmann Publishers 184 Register File • Note: we still use the real clock to determine when to write Write C 0 1 Register number n-to-2n decoder Register 0 .. . D C Register 1 n–1 n D .. . C Register n – 2 D C Register n – 1 Register data D 2004 Morgan Kaufmann Publishers 185 Simple Implementation • Include the functional units we need for each instruction Instruction address MemWrite Instruction Add Sum PC Address Read data 16 Instruction memory a. Instruction memory b. Program counter c. Adder Write data Data memory Sign extend 32 MemRead a. Data memory unit Register numbers 5 Read register 1 5 Read register 2 5 Data Write register 4 b. Sign-extension unit ALU operation Read data 1 Data Registers Zero ALU ALU result Read data 2 Write Data Why do we need this stuff? RegWrite a. Registers b. ALU 2004 Morgan Kaufmann Publishers 186 Building the Datapath • Use multiplexors to stitch them together PCSrc M u x Add Add 4 ALU result Shift left 2 PC Read address Instruction Instruction memory Read register 1 ALUSrc Read data 1 ALU operation MemWrite Read register 2 Registers Read Write data 2 register MemtoReg Zero M u x Write data ALU ALU result Address Write data RegWrite 16 4 Sign extend 32 Read data M u x Data memory MemRead 2004 Morgan Kaufmann Publishers 187 Control • Selecting the operations to perform (ALU, read/write, etc.) • Controlling the flow of data (multiplexor inputs) • Information comes from the 32 bits of the instruction • Example: add $8, $17, $18 • Instruction Format: 000000 10001 10010 01000 op rs rt rd 00000 100000 shamt funct ALU's operation based on instruction type and function code 2004 Morgan Kaufmann Publishers 188 Control • • • e.g., what should the ALU do with this instruction Example: lw $1, 100($2) 35 2 1 op rs rt 16 bit offset ALU control input 0000 0001 0010 0110 0111 1100 • 100 AND OR add subtract set-on-less-than NOR Why is the code for subtract 0110 and not 0011? 2004 Morgan Kaufmann Publishers 189 Control • Must describe hardware to compute 4-bit ALU control input – given instruction type 00 = lw, sw ALUOp 01 = beq, computed from instruction type 10 = arithmetic – function code for arithmetic • Describe it using a truth table (can turn into gates): 2004 Morgan Kaufmann Publishers 190 0 M u x Add ALU Add result 4 Control PC Instruction [25–21] Read register 1 Instruction [20–16] Read register 2 Read address Instruction [31–0] Instruction memory 0 M u Instruction [15–11] x 1 Write register Write data Instruction [15–0] Shift left 2 RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite Instruction [31–26] 16 1 Read data 1 Zero Read data 2 Registers Sign extend 0 M u x 1 ALU ALU result Address Read data 1 M u x 0 Data Write memory data 32 ALU control Instruction [5–0] Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1 Control • Simple combinational logic (truth tables) Inputs Op5 Op4 Op3 Op2 ALUOp Op1 ALU control block Op0 ALUOp0 ALUOp1 Outputs F3 F2 F (5– 0) Operation2 Operation1 Operation Iw sw beq RegDst ALUSrc MemtoReg F1 Operation0 F0 R-format RegWrite MemRead MemWrite Branch ALUOp1 ALUOpO 2004 Morgan Kaufmann Publishers 192 Our Simple Control Structure • All of the logic is combinational • We wait for everything to settle down, and the right thing to be done – ALU might not produce “right answer” right away – we use write signals along with clock to determine when to write • Cycle time determined by length of the longest path State element 1 Combinational logic State element 2 Clock cycle We are ignoring some details like setup and hold times 2004 Morgan Kaufmann Publishers 193 Single Cycle Implementation • Calculate cycle time assuming negligible delays except: – memory (200ps), ALU and adders (100ps), register file access (50ps) PCSrc M u x Add Add 4 ALU result Shift left 2 PC Read address Instruction Instruction memory Read register 1 ALUSrc Read data 1 ALU operation MemWrite Read register 2 Registers Read Write data 2 register MemtoReg Zero M u x Write data ALU ALU result Address Write data RegWrite 16 4 Sign extend 32 Read data M u x Data memory MemRead 2004 Morgan Kaufmann Publishers 194 Where we are headed • • Single Cycle Problems: – what if we had a more complicated instruction like floating point? – wasteful of area One Solution: – use a “smaller” cycle time – have different instructions take different numbers of cycles – a “multicycle” datapath: PC Address Instruction register A Register # Registers Register # Instruction or data Memory Data Data Memory data register ALU ALUOut B Register # 2004 Morgan Kaufmann Publishers 195 Multicycle Approach • • • We will be reusing functional units – ALU used to compute address and to increment PC – Memory used for instruction and data Our control signals will not be determined directly by instruction – e.g., what should the ALU do for a “subtract” instruction? We’ll use a finite state machine for control 2004 Morgan Kaufmann Publishers 196 Multicycle Approach • • Break up the instructions into steps, each step takes a cycle – balance the amount of work to be done – restrict each cycle to use only one major functional unit At the end of a cycle – store values for use in later cycles (easiest thing to do) – introduce additional “internal” registers PC 0 M u x 1 Address Memory MemData Write data Instruction [20–16] Instruction [15–0] Instruction register Instruction [15–0] Memory data register 0 M u x 1 Read register 1 Instruction [25–21] 0 M Instruction u x [15–11] 1 Read data 1 Read register 2 Registers Write Read register data 2 A B 4 Write data 0 M u x 1 16 Sign extend 32 Zero ALU ALU result ALUOut 0 1M u 2 x 3 Shift left 2 2004 Morgan Kaufmann Publishers 197 Instructions from ISA perspective • • Consider each instruction from perspective of ISA. Example: – The add instruction changes a register. – Register specified by bits 15:11 of instruction. – Instruction specified by the PC. – New value is the sum (“op”) of two registers. – Registers specified by bits 25:21 and 20:16 of the instruction Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]] – In order to accomplish this we must break up the instruction. (kind of like introducing variables when programming) 2004 Morgan Kaufmann Publishers 198 Breaking down an instruction • ISA definition of arithmetic: Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]] • Could break down to: – IR <= Memory[PC] – A <= Reg[IR[25:21]] – B <= Reg[IR[20:16]] – ALUOut <= A op B – Reg[IR[20:16]] <= ALUOut • We forgot an important part of the definition of arithmetic! – PC <= PC + 4 2004 Morgan Kaufmann Publishers 199 Idea behind multicycle approach • We define each instruction from the ISA perspective (do this!) • Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps) • Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.) • Finally try and pack as much work into each step (avoid unnecessary cycles) while also trying to share steps where possible (minimizes control, helps to simplify solution) • Result: Our book’s multicycle Implementation! 2004 Morgan Kaufmann Publishers 200 Five Execution Steps • Instruction Fetch • Instruction Decode and Register Fetch • Execution, Memory Address Computation, or Branch Completion • Memory Access or R-type instruction completion • Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES! 2004 Morgan Kaufmann Publishers 201 Step 1: Instruction Fetch • • • Use PC to get instruction and put it in the Instruction Register. Increment the PC by 4 and put the result back in the PC. Can be described succinctly using RTL "Register-Transfer Language" IR <= Memory[PC]; PC <= PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now? 2004 Morgan Kaufmann Publishers 202 Step 2: Instruction Decode and Register Fetch • • • Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch RTL: A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0]) << 2); • We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) 2004 Morgan Kaufmann Publishers 203 Step 3 (instruction dependent) • ALU is performing one of three functions, based on instruction type • Memory Reference: ALUOut <= A + sign-extend(IR[15:0]); • R-type: ALUOut <= A op B; • Branch: if (A==B) PC <= ALUOut; 2004 Morgan Kaufmann Publishers 204 Step 4 (R-type or memory-access) • Loads and stores access memory MDR <= Memory[ALUOut]; or Memory[ALUOut] <= B; • R-type instructions finish Reg[IR[15:11]] <= ALUOut; The write actually takes place at the end of the cycle on the edge 2004 Morgan Kaufmann Publishers 205 Write-back step • Reg[IR[20:16]] <= MDR; Which instruction needs this? 2004 Morgan Kaufmann Publishers 206 Summary: 2004 Morgan Kaufmann Publishers 207 Simple Questions • How many cycles will it take to execute this code? Label: • • lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, Label add $t5, $t2, $t3 sw $t5, 8($t3) ... #assume not What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place? 2004 Morgan Kaufmann Publishers 208 PCSource PCWriteCond PCWrite ALUOp Outputs IorD MemRead ALUSrcB Control ALUSrcA MemWrite MemtoReg Op [5–0] RegWrite IRWrite 0 RegDst 26 Instruction [25-0] PC 0 M u x 1 Instruction [31–26] Address Memory MemData Write data Instruction [20–16] Instruction [15–0] Instruction register Instruction [15–0] Memory data register 0 M u x 1 Read register 1 Instruction [25–21] Read data 1 Read register 2 Registers Write Read register data 2 0 M Instruction u x [15–11] 1 A 16 Sign extend B 4 32 Instruction [5–0] 28 PC [31–28] Zero ALU ALU result Write data 0 M u x 1 Shift left 2 Shift left 2 Jump address [31–0] 0 1M u 2 x 3 ALU control ALUOut M 1 u x 2 Review: finite state machines • Finite state machines: – a set of states and – next state function (determined by current state and the input) – output function (determined by current state and possibly input) Next state Current state Next-state function Clock Inputs Output function Outputs – We’ll use a Moore machine (output based only on current state) 2004 Morgan Kaufmann Publishers 210 Review: finite state machines • Example: B. 37 A friend would like you to build an “electronic eye” for use as a fake security device. The device consists of three lights lined up in a row, controlled by the outputs Left, Middle, and Right, which, if asserted, indicate that a light should be on. Only one light is on at a time, and the light “moves” from left to right and then from right to left, thus scaring away thieves who believe that the device is monitoring their activity. Draw the graphical representation for the finite state machine used to specify the electronic eye. Note that the rate of the eye’s movement will be controlled by the clock speed (which should not be too great) and that there are essentially no inputs. 2004 Morgan Kaufmann Publishers 211 Implementing the Control • Value of control signals is dependent upon: – what instruction is being executed – which step is being performed • Use the information we’ve accumulated to specify a finite state machine – specify the finite state machine graphically, or – use microprogramming • Implementation can be derived from specification 2004 Morgan Kaufmann Publishers 212 Graphical Specification of FSM Instruction fetch MemRead ALUSrcA = 0 IorD = 0 IRWrite ALUSrcB = 01 ALUOp = 00 PCWrite PCSource = 00 0 Start • Note: Instruction decode/ register fetch 1 ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 – don’t care if not mentioned – asserted if name only – otherwise exact value Memory address computation • 2 How many state bits will we need? 6 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 8 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 Memory access 3 Memory access 5 MemRead IorD = 1 Branch completion Execution Jump completion 9 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCWriteCond PCSource = 01 PCWrite PCSource = 10 R-type completion 7 MemWrite IorD = 1 RegDst = 1 RegWrite MemtoReg = 0 Memory read completon step 4 RegDst = 1 RegWrite MemtoReg = 0 2004 Morgan Kaufmann Publishers 213 Finite State Machine for Control Implementation: PCWrite PCWriteCond IorD MemRead MemWrite IRWrite Control logic MemtoReg PCSource ALUOp Outputs ALUSrcB ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0 Instruction register opcode field S0 S1 S2 S3 Op0 Op1 Op2 Op3 Op4 Inputs Op5 • State register 2004 Morgan Kaufmann Publishers 214 PLA Implementation • If I picked a horizontal or vertical line could you explain it? Op5 Op4 Op3 Op2 Op1 Op0 S3 S2 S1 S0 PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 PCSource0 ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0 2004 Morgan Kaufmann Publishers 215 ROM Implementation • • ROM = "Read Only Memory" – values of memory locations are fixed ahead of time A ROM can be used to implement a truth table – if the address is m-bits, we can address 2m entries in the ROM. – our outputs are the bits of data that the address points to. m n 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 m is the "height", and n is the "width" 2004 Morgan Kaufmann Publishers 216 ROM Implementation • • How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 210 = 1024 different addresses) How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs • ROM is 210 x 20 = 20K bits • Rather wasteful, since for lots of the entries, the outputs are the same — i.e., opcode is often ignored (and a rather unusual size) 2004 Morgan Kaufmann Publishers 217 ROM vs PLA • Break up the table into two parts — 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM — 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM — Total: 4.3K bits of ROM • PLA is much smaller — can share product terms — only need entries that produce an active output — can take into account don't cares • Size is (#inputs #product-terms) + (#outputs #product-terms) For this example = (10x17)+(20x17) = 510 PLA cells • PLA cells usually about the size of a ROM cell (slightly bigger) 2004 Morgan Kaufmann Publishers 218 Another Implementation Style Complex instructions: the "next state" is often current state + 1 Control unit PLA or ROM Outputs Input PCWrite PCWriteCond IorD MemRead MemWrite IRWrite BWrite MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst AddrCtl 1 State Adder Address select logic Op[5– 0] • Instruction register opcode field 2004 Morgan Kaufmann Publishers 219 Details Op 000000 000010 000100 100011 101011 Dispatch ROM 1 Opcode name R-format jmp beq lw sw Value 0110 1001 1000 0010 0010 Op 100011 101011 Dispatch ROM 2 Opcode name lw sw Value 0011 0101 PLA or ROM 1 State Adder 3 Mux 2 1 AddrCtl 0 0 Dispatch ROM 2 Dispatch ROM 1 Address select logic Instruction register opcode field State number 0 1 2 3 4 5 6 7 8 9 Address-control action Use incremented state Use dispatch ROM 1 Use dispatch ROM 2 Use incremented state Replace state number by 0 Replace state number by 0 Use incremented state Replace state number by 0 Replace state number by 0 Replace state number by 0 Value of AddrCtl 3 1 2 3 0 0 3 0 0 0 2004 Morgan Kaufmann Publishers 220 Microprogramming Control unit Microcode memory Outputs Input PCWrite PCWriteCond IorD MemRead MemWrite IRWrite BWrite MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst AddrCtl Datapath 1 Microprogram counter Adder Address select logic Instruction register opcode field • What are the “microinstructions” ? 2004 Morgan Kaufmann Publishers 221 Microprogramming • A specification methodology – appropriate if hundreds of opcodes, modes, cycles, etc. – signals specified symbolically using microinstructions Label Fetch Mem1 LW2 ALU control Add Add Add SRC1 PC PC A Register control SRC2 4 Extshft Read Extend PCWrite Memory control Read PC ALU Read ALU Write MDR SW2 Rformat1 Func code A Write ALU B Write ALU BEQ1 JUMP1 • • Subt A B ALUOut-cond Jump address Sequencing Seq Dispatch 1 Dispatch 2 Seq Fetch Fetch Seq Fetch Fetch Fetch Will two implementations of the same architecture have the same microcode? What would a microassembler do? 2004 Morgan Kaufmann Publishers 222 Microinstruction format Field name ALU control SRC1 SRC2 Value Add Subt Func code PC A B 4 Extend Extshft Read ALUOp = 10 ALUSrcA = 0 ALUSrcA = 1 ALUSrcB = 00 ALUSrcB = 01 ALUSrcB = 10 ALUSrcB = 11 Write ALU RegWrite, RegDst = 1, MemtoReg = 0 RegWrite, RegDst = 0, MemtoReg = 1 MemRead, lorD = 0 MemRead, lorD = 1 MemWrite, lorD = 1 PCSource = 00 PCWrite PCSource = 01, PCWriteCond PCSource = 10, PCWrite AddrCtl = 11 AddrCtl = 00 AddrCtl = 01 AddrCtl = 10 Register control Write MDR Read PC Memory Read ALU Write ALU ALU PC write control ALUOut-cond jump address Sequencing Signals active ALUOp = 00 ALUOp = 01 Seq Fetch Dispatch 1 Dispatch 2 Comment Cause the ALU to add. Cause the ALU to subtract; this implements the compare for branches. Use the instruction's function code to determine ALU control. Use the PC as the first ALU input. Register A is the first ALU input. Register B is the second ALU input. Use 4 as the second ALU input. Use output of the sign extension unit as the second ALU input. Use the output of the shift-by-two unit as the second ALU input. Read two registers using the rs and rt fields of the IR as the register numbers and putting the data into registers A and B. Write a register using the rd field of the IR as the register number and the contents of the ALUOut as the data. Write a register using the rt field of the IR as the register number and the contents of the MDR as the data. Read memory using the PC as address; write result into IR (and the MDR). Read memory using the ALUOut as address; write result into MDR. Write memory using the ALUOut as address, contents of B as the data. Write the output of the ALU into the PC. If the Zero output of the ALU is active, write the PC with the contents of the register ALUOut. Write the PC with the jump address from the instruction. Choose the next microinstruction sequentially. Go to the first microinstruction to begin a new instruction. Dispatch using the ROM 1. Dispatch using the ROM 2. 2004 Morgan Kaufmann Publishers 223 Maximally vs. Minimally Encoded • No encoding: – 1 bit for each datapath operation – faster, requires more memory (logic) – used for Vax 780 — an astonishing 400K of memory! • Lots of encoding: – send the microinstructions through logic to get control signals – uses less memory, slower • Historical context of CISC: – Too much logic to put on a single chip with everything else – Use a ROM (or even RAM) to hold the microcode – It’s easy to add new instructions 2004 Morgan Kaufmann Publishers 224 Microcode: Trade-offs • Distinction between specification and implementation is sometimes blurred • Specification Advantages: – Easy to design and write – Design architecture and microcode in parallel • Implementation (off-chip ROM) Advantages – Easy to change since values are in memory – Can emulate other architectures – Can make use of internal registers • Implementation Disadvantages, SLOWER now that: – Control is implemented on same chip as processor – ROM is no longer faster than RAM – No need to go back and make changes 2004 Morgan Kaufmann Publishers 225 Historical Perspective • • • • • In the ‘60s and ‘70s microprogramming was very important for implementing machines This led to more sophisticated ISAs and the VAX In the ‘80s RISC processors based on pipelining became popular Pipelining the microinstructions is also possible! Implementations of IA-32 architecture processors since 486 use: – “hardwired control” for simpler instructions (few cycles, FSM control implemented using PLA or random logic) – “microcoded control” for more complex instructions (large numbers of cycles, central control store) • The IA-64 architecture uses a RISC-style ISA and can be implemented without a large central control store 2004 Morgan Kaufmann Publishers 226 Pentium 4 • Pipelining is important (last IA-32 without it was 80386 in 1985) Control Control I/O interface Chapter 7 Instruction cache Data cache Enhanced floating point and multimedia Integer datapath Control Advanced pipelining hyperthreading support • Secondary cache and memory interface Chapter 6 Control Pipelining is used for the simple instructions favored by compilers “Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions” 2004 Morgan Kaufmann Publishers 227 Pentium 4 • Somewhere in all that “control we must handle complex instructions Control Control I/O interface Instruction cache Data cache Enhanced floating point and multimedia Integer datapath Control Advanced pipelining hyperthreading support • • • • Secondary cache and memory interface Control Processor executes simple microinstructions, 70 bits wide (hardwired) 120 control lines for integer datapath (400 for floating point) If an instruction requires more than 4 microinstructions to implement, control from microcode ROM (8000 microinstructions) Its complicated! 2004 Morgan Kaufmann Publishers 228 Chapter 5 Summary • If we understand the instructions… We can build a simple processor! • If instructions take different amounts of time, multi-cycle is better • Datapath implemented using: – Combinational logic for arithmetic – State holding elements to remember bits • Control implemented using: – Combinational logic for single-cycle implementation – Finite state machine for multi-cycle implementation 2004 Morgan Kaufmann Publishers 229 Chapter Six 2004 Morgan Kaufmann Publishers 230 Pipelining • Improve performance by increasing instruction throughput Program execution Time order (in instructions) 200 lw $1, 100($0) Instruction fetch Reg lw $2, 200($0) 400 600 Data access ALU 800 1000 1200 1400 ALU Data access 1600 1800 Reg Instruction Reg fetch 800 ps lw $3, 300($0) Reg Instruction fetch 800 ps Note: timing assumptions changed for this example 800 ps Program execution Time order (in instructions) 200 400 600 Instruction fetch Reg lw $2, 200($0) 200 ps Instruction fetch Reg 200 ps Instruction fetch lw $1, 100($0) lw $3, 300($0) ALU 800 Data access ALU Reg 1000 1200 1400 Reg Data access ALU Reg Data access Reg 200 ps 200 ps 200 ps 200 ps 200 ps Ideal speedup is number of stages in the pipeline. Do we achieve this? 2004 Morgan Kaufmann Publishers 231 Pipelining • What makes it easy – all instructions are the same length – just a few instruction formats – memory operands appear only in loads and stores • What makes it hard? – structural hazards: suppose we had only one memory – control hazards: need to worry about branch instructions – data hazards: an instruction depends on a previous instruction • We’ll build a simple pipeline and look at these issues • We’ll talk about modern processors and what really makes it hard: – exception handling – trying to improve performance with out-of-order execution, etc. 2004 Morgan Kaufmann Publishers 232 Basic Idea IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back Add 4 Shift left 2 P C Address Instruction Instruction memory Read Read register 1 data1 Read register 2 Registers Write Read register data2 Write data 16 • ADD Add result Zero ALU ALU result Address Read data Data Memory Write data Sign 32 extend What do we need to add to actually split the datapath into stages? 2004 Morgan Kaufmann Publishers 233 Pipelined Datapath IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left 2 PC Address Instruction memory Add Add result Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Zero ALU ALU result Read data Address Data memory Write data Write data 16 Sign extend 32 Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem? 2004 Morgan Kaufmann Publishers 234 Corrected Datapath IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left 2 PC Address Instruction memory Add Add result Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Zero ALU ALU result Read data Address Data memory Write data Write data 16 Sign extend 32 2004 Morgan Kaufmann Publishers 235 Graphically Representing Pipelines Time (in clock cycles) Program execution order (in instructions) lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) • CC 1 CC 2 IM Reg IM CC 3 ALU Reg IM CC 4 CC 5 DM Reg ALU DM Reg ALU DM Reg CC 6 CC7 Reg Can help with answering questions like: – how many cycles does it take to execute this code? – what is the ALU doing during cycle 4? – use this representation to help understand datapaths 2004 Morgan Kaufmann Publishers 236 Pipeline Control PCSrc IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 Shift left 2 Branch RegWrite PC Address Instruction memory Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register MemWrite ALUSrc Zero Add ALU result MemtoReg Read data Address Data memory Write data Write data Instruction (15Ð0) Instruction (20Ð16) 16 Sign extend 32 6 ALU control MemRead ALUOp Instruction (15Ð11) RegDst 2004 Morgan Kaufmann Publishers 237 Pipeline control • We have 5 stages. What needs to be controlled in each stage? – Instruction Fetch and PC Increment – Instruction Decode / Register Fetch – Execution – Memory Stage – Write Back • How would control be handled in an automobile plant? – a fancy control center telling everyone what to do? – should we use a finite state machine? 2004 Morgan Kaufmann Publishers 238 Pipeline Control • Pass control signals along just like the data Instruction R-format lw sw beq Execution/Address Calculation Memory access stage stage control lines control lines Reg ALU ALU ALU Mem Mem Dst Op1 Op0 Src Branch Read Write 1 1 0 0 0 0 0 0 0 0 1 0 1 0 X 0 0 1 0 0 1 X 0 1 0 1 0 0 Write-back stage control lines Reg Mem to write Reg 1 0 1 1 0 X 0 X WB Instruction IF/ID Control M WB EX M WB ID/EX EX/MEM MEM/WB 2004 Morgan Kaufmann Publishers 239 Datapath with Control PCSrc ID/EX Control IF/ID WB EX/MEM M WB EX M MEM/WB WB Add 4 Shift left 2 PC Address Instruction memory Add Add result Branch ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Zero ALU ALU result Read data Address Data memory Write data Write data Instruction [15–0] Instruction [20–16] 16 Sign extend 32 6 ALU control MemRead ALUOp Instruction [15–11] RegDst 2004 Morgan Kaufmann Publishers 240 Dependencies • Problem with starting next instruction before first is finished – dependencies that “go backward in time” are data hazards Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 10 10 10 10/–20 –20 –20 –20 –20 IM Reg DM Reg Value of register $2: Program execution order (in instructions) sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) IM Reg IM DM Reg IM Reg DM Reg IM Reg DM Reg Reg DM Reg 2004 Morgan Kaufmann Publishers 241 Software Solution • • Have compiler guarantee no hazards Where do we insert the “nops” ? sub and or add sw • $2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2) Problem: this really slows us down! 2004 Morgan Kaufmann Publishers 242 Forwarding • Use temporary results, don’t wait for them to be written – register file forwarding to handle read/write to same register – ALU forwarding Time (in clock cycles) CC 1 CC 2 Value of register $2: 10 10 Value of EX/MEM: X X Value of MEM/WB: X X CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 X X 10 –20 X 10/–20 X –20 –20 X X –20 X X –20 X X –20 X X DM Reg Program execution order (in instructions) sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14,$2 , $2 sw $15, 100($2) what if this $2 was $13? IM Reg IM Reg IM DM Reg IM Reg DM Reg IM Reg DM Reg Reg DM Reg 2004 Morgan Kaufmann Publishers 243 Forwarding • The main idea (some details not shown) ID/EX EX/MEM MEM/WB M u x ForwardA Registers ALU M u x Data memory M u x ForwardB Rs Rt Rt Rd EX/MEM.RegisterRd M u x Forwarding unit MEM/WB.RegisterRd 2004 Morgan Kaufmann Publishers 244 Can't always forward • Load word can still cause a hazard: – an instruction tries to read a register following a load instruction that writes to the same register. Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 DM Reg CC 6 CC 7 CC 8 CC 9 Program execution order (in instructions) lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 add $9, $4, $2 slt $1, $6, $7 • IM Reg IM Reg IM DM Reg IM Reg DM Reg IM Reg DM Reg Reg DM Reg Thus, we need a hazard detection unit to “stall” the load instruction 2004 Morgan Kaufmann Publishers 245 Stalling • We can stall the pipeline by keeping an instruction in the same stage Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 Reg DM Reg CC 6 CC 7 CC 8 CC 9 CC 10 Program execution order (in instructions) lw $2, 20($1) IM bubble and becomes nop add $4, $2, $5 or $8, $2, $6 add $9, $4, $2 IM Reg IM DM Reg IM Reg DM DM Reg IM Reg Reg Reg DM Reg 2004 Morgan Kaufmann Publishers 246 Hazard Detection Unit • Stall by letting an instruction that won’t write anything go forward Hazard detection unit ID/EX.MemRead ID/EX WB M u x Control 0 IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers M u x ALU PC Instruction memory M u x Data memory IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt Rt IF/ID.RegisterRd Rd M u x ID/EX.RegisterRt Rs Rt Forwarding unit 2004 Morgan Kaufmann Publishers 247 Branch Hazards • When we decide to branch, other instructions are in the pipeline! Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 DM Reg CC 6 CC 7 CC 8 CC 9 Program execution order (in instructions) 40 beq $1, $3, 28 44 and $12, $2, $5 48 or $13, $6, $2 52 add $14, $2, $2 72 lw $4, 50($7) • IM Reg IM Reg IM DM Reg IM Reg DM Reg IM Reg DM Reg Reg DM Reg We are predicting “branch not taken” – need to add hardware for flushing instructions if we are wrong 2004 Morgan Kaufmann Publishers 248 Flushing Instructions IF.Flush Hazard detection unit ID/EX WB Control 0 IF/ID M u x + EX/MEM M WB EX/MEM EX M WB + 4 M u x Shift left 2 Registers PC = M u x Instruction memory ALU M u x Data memory M u x Sign extend M u x Fowarding unit Note: we’ve also moved branch decision to ID stage 2004 Morgan Kaufmann Publishers 249 Branches • • • • If the branch is taken, we have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction drastically hurts performance Solution: dynamic branch prediction Taken Not taken Predict taken Predict taken Taken Not taken Taken Not taken Predict not taken Predict not taken Taken Not taken A 2-bit prediction scheme 2004 Morgan Kaufmann Publishers 250 Branch Prediction • Sophisticated Techniques: – A “branch target buffer” to help us look up the destination – Correlating predictors that base prediction on global behavior and recently executed branches (e.g., prediction for a specific branch instruction based on what happened in previous branches) – Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. – A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) • Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! • Modern processors predict correctly 95% of the time! 2004 Morgan Kaufmann Publishers 251 Improving Performance • Try and avoid stalls! E.g., reorder these instructions: lw lw sw sw $t0, $t2, $t2, $t0, 0($t1) 4($t1) 0($t1) 4($t1) • Dynamic Pipeline Scheduling – Hardware chooses which instructions to execute next – Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!) – Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect) • Trying to exploit instruction-level parallelism 2004 Morgan Kaufmann Publishers 252 Advanced Pipelining • • • • Increase the depth of the pipeline Start more than one instruction each cycle (multiple issue) Loop unrolling to expose more ILP (better scheduling) “Superscalar” processors – DEC Alpha 21264: 9 stage pipeline, 6 instruction issue • All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”) • VLIW: very long instruction word, static multiple issue (relies more on compiler technology) • This class has given you the background you need to learn more! 2004 Morgan Kaufmann Publishers 253 Chapter 6 Summary • Pipelining does not improve latency, but does improve throughput Deeply pipelined Multicycle (Section 5.5) Pipelined Multiple issue with deep pipeline (Section 6.10) Multiple issue with deep pipeline (Section 6.10) Multiple-issue pipelined (Section 6.9) Multiple-issue pipelined (Section 6.9) Single-cycle (Section 5.4) Deeply pipelined Multicycle (Section 5.5) Single-cycle (Section 5.4) Slower Pipelined Faster Instructions per clock (IPC = 1/CPI) 1 Several Use latency in instructions 2004 Morgan Kaufmann Publishers 254 Chapter Seven 2004 Morgan Kaufmann Publishers 255 Memories: Review • SRAM: – value is stored on a pair of inverting gates – very fast but takes up more space than DRAM (4 to 6 transistors) • DRAM: – value is stored as a charge on capacitor (must be refreshed) – very small but slower than SRAM (factor of 5 to 10) Word line A A B B Pass transistor Capacitor Bit line 2004 Morgan Kaufmann Publishers 256 Exploiting Memory Hierarchy • Users want large and fast memories! SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB. DRAM access times are 50-70ns at cost of $100 to $200 per GB. Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB. • 2004 Try and give it to them anyway – build a memory hierarchy CPU Increasing distance Level 1 from the CPU in access time Levels in the Level 2 memory hierarchy Level n Size of the memory at each level 2004 Morgan Kaufmann Publishers 257 Locality • A principle that makes having a memory hierarchy a good idea • If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. Why does code have locality? • Our initial focus: two levels (upper, lower) – block: minimum unit of data – hit: data requested is in the upper level – miss: data requested is not in the upper level 2004 Morgan Kaufmann Publishers 258 Cache • • Two issues: – How do we know if a data item is in the cache? – If it is, how do we find it? Our first example: – block size is one word of data – "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level 2004 Morgan Kaufmann Publishers 259 Direct Mapped Cache Mapping: address is modulo the number of blocks in the cache Cache 000 001 010 011 100 101 110 111 • 00001 00101 01001 01101 10001 10101 11001 11101 Memory 2004 Morgan Kaufmann Publishers 260 Direct Mapped Cache • For MIPS: Address (showing bit positions) 31 30 Hit 13 12 11 20 2 10 Byte offset 10 Tag Data Index Index 0 1 2 Valid Tag Data 1021 1022 1023 20 32 = What kind of locality are we taking advantage of? 2004 Morgan Kaufmann Publishers 261 Direct Mapped Cache • Taking advantage of spatial locality: Address (showing bit positions) 31 14 13 18 Hit 65 8 210 4 Tag Byte offset Data Block offset Index 18 bits V 512 bits Tag Data 256 entries 16 32 32 32 = Mux 32 2004 Morgan Kaufmann Publishers 262 Hits vs. Misses • Read hits – this is what we want! • Read misses – stall the CPU, fetch block from memory, deliver to cache, restart • Write hits: – can replace data in cache and memory (write-through) – write the data only into the cache (write-back the cache later) • Write misses: – read the entire block into the cache, then write the word 2004 Morgan Kaufmann Publishers 263 Hardware Issues • Make reading multiple words easier by using banks of memory CPU CPU CPU Multiplexor Cache Cache Cache Bus Bus Memory b. Wide memory organization Bus Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3 c. Interleaved memory organization Memory a. One-word-wide memory organization • It can get a lot more complicated... 2004 Morgan Kaufmann Publishers 264 Performance • Increasing the block size tends to decrease miss rate: 40% 35% Miss rate 30% 25% 20% 15% 10% 5% 0% 4 16 64 Block size (bytes) 256 1 KB 8 KB 16 KB 64 KB 256 KB • Use split caches because there is more spatial locality in code: Program gcc spice Block size in words 1 4 1 4 Instruction miss rate 6.1% 2.0% 1.2% 0.3% Data miss rate 2.1% 1.7% 1.3% 0.6% Effective combined miss rate 5.4% 1.9% 1.2% 0.4% 2004 Morgan Kaufmann Publishers 265 Performance • Simplified model: execution time = (execution cycles + stall cycles) cycle time stall cycles = # of instructions miss ratio miss penalty • Two ways of improving performance: – decreasing the miss ratio – decreasing the miss penalty What happens if we increase block size? 2004 Morgan Kaufmann Publishers 266 Decreasing miss ratio with associativity One-way set associative (direct mapped) Block Tag Data 0 Two-way set associative 1 2 3 4 5 6 Set Tag Data Tag Data 0 1 2 3 7 Four-way set associative Set Tag Data Tag Data Tag Data Tag Data 0 1 Eight-way set associative (fully associative) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Compared to direct mapped, give a series of references that: – results in a lower miss ratio using a 2-way set associative cache – results in a higher miss ratio using a 2-way set associative cache assuming we use the “least recently used” replacement strategy 2004 Morgan Kaufmann Publishers 267 An implementation Address 31 30 12 11 10 9 8 8 22 Index 0 1 2 V Tag Data V 3210 Tag Data V Tag Data V Tag Data 253 254 255 22 32 4-to-1 multiplexor Hit Data 2004 Morgan Kaufmann Publishers 268 Performance 15% 1 KB 12% 2 KB 9% 4 KB 6% 8 KB 16 KB 32 KB 3% 64 KB 128 KB 0 One-way Two-way Four-way Eight-way Associativity 2004 Morgan Kaufmann Publishers 269 Decreasing miss penalty with multilevel caches • Add a second level cache: – often primary cache is on the same chip as the processor – use SRAMs to add another cache above primary memory (DRAM) – miss penalty goes down if data is in 2nd level cache • Example: – CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access – Adding 2nd level cache with 5ns access time decreases miss rate to .5% • Using multilevel caches: – try and optimize the hit time on the 1st level cache – try and optimize the miss rate on the 2nd level cache 2004 Morgan Kaufmann Publishers 270 Cache Complexities • Not always easy to understand implications of caches: 1200 2000 Radix sort 1000 Radix sort 1600 800 1200 600 800 400 200 Quicksort 400 0 Quicksort 0 4 8 16 32 64 128 256 512 1024 2048 4096 Size (K items to sort) Theoretical behavior of Radix sort vs. Quicksort 4 8 16 32 64 128 256 512 1024 2048 4096 Size (K items to sort) Observed behavior of Radix sort vs. Quicksort 2004 Morgan Kaufmann Publishers 271 Cache Complexities • Here is why: 5 Radix sort 4 3 2 1 Quicksort 0 4 8 16 32 64 128 256 512 1024 2048 4096 Size (K items to sort) • Memory system performance is often critical factor – multilevel caches, pipelined processors, make it harder to predict outcomes – Compiler optimizations to increase locality sometimes hurt ILP • Difficult to predict best algorithm: need experimental data 2004 Morgan Kaufmann Publishers 272 Virtual Memory • Main memory can act as a cache for the secondary storage (disk) Virtual addresses Physical addresses Address translation Disk addresses • Advantages: – illusion of having more physical memory – program relocation – protection 2004 Morgan Kaufmann Publishers 273 Pages: virtual memory blocks • Page faults: the data is not in memory, retrieve it from disk – huge miss penalty, thus pages should be fairly large (e.g., 4KB) – reducing page faults is important (LRU is worth the price) – can handle the faults in software instead of hardware – using write-through is too expensive so we use writeback Virtual address 31 30 29 28 27 15 14 13 12 11 10 9 8 3210 Page offset Virtual page number Translation 29 28 27 15 14 13 12 11 10 9 8 Physical page number 3210 Page offset Physical address 2004 Morgan Kaufmann Publishers 274 Page Tables Virtual page number Page table Physical page or Valid disk address 1 1 1 1 0 1 1 0 1 1 0 1 Physical memory Disk storage 2004 Morgan Kaufmann Publishers 275 Page Tables Page table register Virtual address 31 30 29 28 27 1 5 1 4 1 3 1 2 11 1 0 9 8 Virtual page number Page offset 12 20 Valid 3 2 1 0 Physical page number Page table 18 If 0 then page is not present in memory 29 28 27 1 5 1 4 1 3 1 2 11 1 0 9 8 Physical page number 3 2 1 0 Page offset Physical address 2004 Morgan Kaufmann Publishers 276 Making Address Translation Fast • A cache for address translations: translation lookaside buffer TLB Virtual page number Valid Dirty Ref 1 1 1 1 0 1 0 1 1 0 0 0 Tag Physical page address 1 1 1 1 0 1 Physical memory Page table Physical page Valid Dirty Ref or disk address 1 1 1 1 0 1 1 0 1 1 0 1 Typical values: 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 Disk storage 16-512 entries, miss-rate: .01% - 1% miss-penalty: 10 – 100 cycles 2004 Morgan Kaufmann Publishers 277 TLBs and caches Virtual address TLB access TLB miss exception No Yes TLB hit? Physical address No Try to read data from cache Cache miss stall while read block No Cache hit? Yes Write? No Yes Write access bit on? Write protection exception Yes Try to write data to cache Deliver data to the CPU Cache miss stall while read block No Cache hit? Yes Write data into cache, update the dirty bit, and put the data and the address into the write buffer 2004 Morgan Kaufmann Publishers 278 TLBs and Caches Virtual address 31 30 29 14 13 12 11 10 9 Virtual page number 3 2 1 0 Page offset 12 20 Valid Dirty Tag Physical page number = = = = = = TLB TLB hit 20 Page offset Physical page number Physical address Block Cache index Physical address tag offset 18 8 4 Byte offset 2 8 12 Valid Data Tag Cache = Cache hit 32 Data 2004 Morgan Kaufmann Publishers 279 Modern Systems • 2004 Morgan Kaufmann Publishers 280 Modern Systems • Things are getting complicated! 2004 Morgan Kaufmann Publishers 281 Some Issues • Processor speeds continue to increase very fast — much faster than either DRAM or disk access times 100,000 10,000 1,000 Performance CPU 100 10 Memory 1 Year • Design challenge: dealing with this growing disparity – Prefetching? 3rd level caches and more? Memory design? 2004 Morgan Kaufmann Publishers 282 Chapters 8 & 9 (partial coverage) 2004 Morgan Kaufmann Publishers 283 Interfacing Processors and Peripherals • • • I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection between devices and the system — the memory hierarchy — the operating system A variety of different users (e.g., banks, supercomputers, engineers) Interrupts Processor Cache Memory- I/O bus Main memory I/O controller Disk Disk I/O controller I/O controller Graphics output Network 2004 Morgan Kaufmann Publishers 284 I/O • Important but neglected “The difficulties in assessing and designing I/O systems have often relegated I/O to second class status” “courses in every aspect of computing, from programming to computer architecture often ignore I/O or give it scanty coverage” “textbooks leave the subject to near the end, making it easier for students and instructors to skip it!” • GUILTY! — we won’t be looking at I/O in much detail — be sure and read Chapter 8 in its entirety. — you should probably take a networking class! 2004 Morgan Kaufmann Publishers 285 I/O Devices • Very diverse devices — behavior (i.e., input vs. output) — partner (who is at the other end?) — data rate 2004 Morgan Kaufmann Publishers 286 I/O Example: Disk Drives Platters Tracks Platter Sectors Track • To access data: — seek: position head over the proper track (3 to 14 ms. avg.) — rotational latency: wait for desired sector (.5 / RPM) — transfer: grab the data (one or more sectors) 30 to 80 MB/sec 2004 Morgan Kaufmann Publishers 287 I/O Example: Buses • • • • Shared communication link (one or more wires) Difficult design: — may be bottleneck — length of the bus — number of devices — tradeoffs (buffers for higher bandwidth increases latency) — support for many different devices — cost Types of buses: — processor-memory (short high speed, custom design) — backplane (high speed, often standardized, e.g., PCI) — I/O (lengthy, different devices, e.g., USB, Firewire) Synchronous vs. Asynchronous — use a clock and a synchronous protocol, fast and small but every device must operate at same rate and clock skew requires the bus to be short — don’t use a clock and instead use handshaking 2004 Morgan Kaufmann Publishers 288 I/O Bus Standards • Today we have two dominant bus standards: 2004 Morgan Kaufmann Publishers 289 Other important issues • Bus Arbitration: — daisy chain arbitration (not very fair) — centralized arbitration (requires an arbiter), e.g., PCI — collision detection, e.g., Ethernet • Operating system: — polling — interrupts — direct memory access (DMA) • Performance Analysis techniques: — queuing theory — simulation — analysis, i.e., find the weakest link (see “I/O System Design”) • Many new developments 2004 Morgan Kaufmann Publishers 290 Pentium 4 • I/O Options Pentium 4 processor DDR 400 (3.2 GB/sec) Main memory DIMMs DDR 400 (3.2 GB/sec) System bus (800 MHz, 604 GB/sec) AGP 8X Memory (2.1 GB/sec) Graphics controller output hub CSA (north bridge) (0.266 GB/sec) 1 Gbit Ethernet 82875P Serial ATA (150 MB/sec) (266 MB/sec) Parallel ATA (100 MB/sec) Serial ATA (150 MB/sec) Parallel ATA (100 MB/sec) Disk Disk Stereo (surroundsound) AC/97 (1 MB/sec) USB 2.0 (60 MB/sec) ... I/O controller hub (south bridge) 82801EB CD/DVD Tape (20 MB/sec) 10/100 Mbit Ethernet PCI bus (132 MB/sec) 2004 Morgan Kaufmann Publishers 291 Fallacies and Pitfalls • Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never fail. • Fallacy: magnetic disk storage is on its last legs, will be replaced. • Fallacy: A 100 MB/sec bus can transfer 100 MB/sec. • Pitfall: Moving functions from the CPU to the I/O processor, expecting to improve performance without analysis. 2004 Morgan Kaufmann Publishers 292 Multiprocessors • Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad news: its really hard to write good concurrent programs many commercial failures Processor Processor Processor Cache Cache Cache Processor Processor Processor Cache Cache Cache Memory Memory Memory Single bus Memory I/O Network 2004 Morgan Kaufmann Publishers 293 Questions • How do parallel processors share data? — single address space (SMP vs. NUMA) — message passing • How do parallel processors coordinate? — synchronization (locks, semaphores) — built into send / receive primitives — operating system protocols • How are they implemented? — connected by a single bus — connected by a network 2004 Morgan Kaufmann Publishers 294 Supercomputers Plot of top 500 supercomputer sites over a decade: Single Instruction multiple data (SIMD) 500 Cluster (network of workstations) 400 Cluster (network of SMPs) 300 Massively parallel processors (MPPs) 200 100 Sharedmemory multiprocessors (SMPs) 0 93 93 94 94 95 95 96 96 97 97 98 98 99 99 00 Uniprocessors 2004 Morgan Kaufmann Publishers 295 Using multiple processors an old idea • Some SIMD designs: • Costs for the the Illiac IV escalated from $8 million in 1966 to $32 million in 1972 despite completion of only ¼ of the machine. It took three more years before it was operational! “For better or worse, computer architects are not easily discouraged” Lots of interesting designs and ideas, lots of failures, few successes 2004 Morgan Kaufmann Publishers 296 Topologies P0 P1 P2 P3 P0 a. 2-D grid or mesh of 16 nodes P4 P1 P5 P2 P6 P3 P7 P4 P5 P6 P7 b. Omega network a. Crossbar b. n-cube tree of 8 nodes (8 = 23 so n = 3) 2004 Morgan Kaufmann Publishers 297 Clusters • • • • • • Constructed from whole computers Independent, scalable networks Strengths: – Many applications amenable to loosely coupled machines – Exploit local area networks – Cost effective / Easy to expand Weaknesses: – Administration costs not necessarily lower – Connected using I/O bus Highly available due to separation of memories In theory, we should be able to do better 2004 Morgan Kaufmann Publishers 298 Google • • • • • Serve an average of 1000 queries per second Google uses 6,000 processors and 12,000 disks Two sites in silicon valley, two in Virginia Each site connected to internet using OC48 (2488 Mbit/sec) Reliability: – On an average day, 20 machines need rebooted (software error) – 2% of the machines replaced each year In some sense, simple ideas well executed. Better (and cheaper) than other approaches involving increased complexity 2004 Morgan Kaufmann Publishers 299 Concluding Remarks • Evolution vs. Revolution “More often the expense of innovation comes from being too disruptive to computer users” “Acceptance of hardware ideas requires acceptance by software people; therefore hardware people should learn about software. And if software people want good machines, they must learn more about hardware to be able to communicate with and thereby influence hardware engineers.” 2004 Morgan Kaufmann Publishers 300