Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Embedded Systems in Silicon TD5102 Other Architectures Henk Corporaal http://www.ics.ele.tue.nl/~heco/courses/EmbSystems Technical University Eindhoven DTI / NUS Singapore 2005/2006 Introduction Design alternatives: provide more powerful operations goal is to reduce number of instructions executed danger is a slower cycle time and/or a higher CPI provide even simpler operations ACA 2003 to reduce code size / complexity interpreter Sometimes referred to as “RISC vs. CISC” virtually all new instruction sets since 1982 have been RISC VAX: minimize code size, make assembly language easy instructions from 1 to 54 bytes long! We’ll look at IA-32 and Java Virtual Machine 2 Topics Recap of MIPS architecture Why RISC? Other architecture styles Accumulator architecture Stack architecture Memory-Memory architecture Register architectures Examples 80x86 Pentium Pro, II, III, 4 JVM ACA 2003 3 Recap of MIPS ACA 2003 RISC architecture Register space Addressing Instruction format Pipelining 4 Why RISC? Keep it simple RISC characteristics: Reduced number of instructions Limited addressing modes Large register set know directly where the following instruction starts Limited number of instruction formats Memory alignment restrictions ...... Based on quantitative analysis ACA 2003 uniform (no distinction between e.g. address and data registers) Limited number of instruction sizes (preferably one) load-store architecture enables pipelining " the famous MIPS one percent rule": don't even think about it when its not used more than one percent 5 Register space 32 integer (and 32 floating point) registers of 32-bit Name Register number Usage $zero 0 the constant value 0 $v0-$v1 2-3 values for results and expression evaluation $a0-$a3 4-7 arguments $t0-$t7 8-15 temporaries $s0-$s7 16-23 saved (by callee) $t8-$t9 24-25 more temporaries $gp 28 global pointer $sp 29 stack pointer $fp 30 frame pointer $ra 31 return address ACA 2003 6 Addressing 1. Immediate addressing op rs rt Immediate 2. Register addressing op rs rt rd ... funct Registers Register 3. Base addressing op rs rt Memory Address + Register Byte Halfword Word 4. PC-relative addressing op rs rt Memory Address PC + Word 5. Pseudodirect addressing op Address PC ACA 2003 Memory Word 7 Instruction format R op rs rt rd I op rs rt 16 bit address J op Example instructions Instruction add $s1,$s2,$s3 addi $s2,$s3,4 lw $s1,100($s2) bne $s4,$s5,L j Label ACA 2003 shamt funct 26 bit address Meaning $s1 = $s2 + $s3 $s2 = $s3 + 4 $s1 = Memory[$s2+100] if $s4<>$s5 goto L goto Label 8 Pipelining All integer instructions fit into the following pipeline time IF ACA 2003 ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 9 Other architecture styles ACA 2003 Accumulator architecture Stack Register (load store) Register-Memory Memory-Memory 10 Accumulator architecture Accumulator latch ALU registers address Memory latch Example code: a = b+c; load b; add c; store a; ACA 2003 // accumulator is implicit operand 11 Stack architecture latch latch stack ALU latch Example code: a = b+c; push b; push b push c; b add; stack: pop a; ACA 2003 Memory stack pt push c add c b b+c pop a 12 Other architecture styles Let's look at the code for C = A + B Stack Architecture Accumulator Architecture RegisterMemory MemoryMemory Register (load-store) Push A Load A Load r1,A Add C,B,A Load r1,A Push B Add Add Add Store C Pop C B r1,B Store C,r1 Load r2,B Add r3,r1,r2 Store C,r3 Q: What are the advantages / disadvantages of load-store (RISC) architecture? ACA 2003 13 Other architecture styles Accumulator architecture Stack three operands, all in registers loads and stores are the only instructions accessing memory (i.e. with a memory (indirect) addressing mode Register-Memory zero operand: all operands implicit (on TOS) Register (load store) one operand (in register or memory), accumulator almost always implicitly used two operands, one in memory Memory-Memory three operands, may be all in memory (there are more varieties / combinations) ACA 2003 14 Examples 80x86 Pentium x IA-32 extended accumulator JVM ACA 2003 extended accumulator stack 15 A dominant architecture: x86/IA-32 A bit of history: 1978: The Intel 8086 is announced (16 bit architecture) 1980: The 8087 floating point coprocessor is added 1981: IBM PC was launched, equipped with the Intel 8088 1982: The 80286 increases address space to 24 bits + new instructions 1985: The 80386 extends to 32 bits, new addressing modes 1989-1995: The 80486, Pentium, Pentium Pro add a few instructions (mostly designed for higher performance) 1997: MMX is added 2000: Pentium 4; very deep pipelined; extends SIMD instructions 2002: Hypertreading “This history illustrates the impact of the “golden handcuffs” of compatibility “adding new features as someone might add clothing to a packed bag” “an architecture that is difficult to explain and impossible to love” ACA 2003 16 IA-32 Overview Complexity: Instructions from 1 to 17 bytes long two-address instructions: one operand must act as both a source and destination ADD EAX,EBX ; EAX = EAX+EBX one operand can come from memory complex addressing modes e.g., “base or scaled index with 8 or 32 bit displacement” Saving grace: the most frequently used instructions are not too difficult to build compilers avoid the portions of the architecture that are slow “what the 80x86 lacks in style is made up in quantity, making it beautiful from the right perspective” ACA 2003 17 80x86 (IA-32) registers 16 general purpose registers index registers pointer registers 8 8 AH AX AL EAX BH BX BL EBX CH CX CL ECX DH DX DL EDX ESI EDI EBP ESP CS segment registers PC condition codes (a.o.) ACA 2003 SS DS ES FS GS EIP 18 IA-32 Addressing Modes Addressing modes: where are the operands? ACA 2003 Immediate MOV EAX,10 ; EAX = 10 Direct MOV EAX,I ; EAX = Mem[&i] I DW 3 Register MOV EAX,EBX ; EAX = EBX Register indirect MOV EAX,[EBX] ; EAX = Memory[EBX] Based with 8- or 32-bit displacement MOV EAX,[EBX+8] ; EAX = Mem[EBX+8] Based with scaled index (scale = 0 .. 3) MOV EAX,ECX[EBX] ; EAX = Mem[EBX + 2scale * ECX] Based plus scaled index with 8- or 32-bit displacement MOV EAX,ECX[EBX+8] 19 IA-32 Addressing Modes Not all modes apply to all instructions ACA 2003 one of the operands must be a register Not all registers can be used in all modes Why? Simply not enough bits in the instruction 20 Control: condition codes Many instructions set condition codes in EFLAGS register Some condition codes: ACA 2003 sign: set if the result of an operation was negative zero: set if the result was zero carry: set if the operation had a carry out overflow: set if the operation caused an overflow parity: set when result had even parity Subsequent conditional branch instructions test condition codes to determine if they should jump or not 21 Control Special instruction: compare CMP SRC1,SRC2 ; set cc’s based on SRC1-SRC2 Example for (i=0; i<10; i++) a[i]++; _L: _EXIT: ACA 2003 MOV CMP JNL INC ADD INC JMP ... EAX,0 EAX,10 _EXIT [EBX] EBX,4 EAX _L ; ; ; ; ; ; ; EAX = i = 0 if (i<10) jump to _EXIT if i>=10 Mem[EBX](=a[i])++ EBX = &a[i+1] EAX++ goto _L 22 Control Peculiar control instruction LOOP _LABEL ; decrease ECX, if (ECX!=0) goto _LABEL Previous example rewritten: _L: ACA 2003 MOV INC ADD LOOP ECX,10 [EBX] EBX,4 _L Fewer instructions, but LOOP is slow 23 Procedures/functions Instructions CALL AProcedure RET push return address on stack and goto AProcedure pop return address from stack and jump to it EBP is used as a frame pointer which points to a fixed location within stack frame (to access locals) ESP is used as stack pointer Special instructions: ACA 2003 ; ; ; ; PUSH EAX POP EAX ; ESP -= 4, Mem[ESP] = EAX ; EAX = Mem[ESP], ESP += 4 24 IA-32 Machine Language IA-32 instruction formats: Bytes Bits 0-5 1-2 0-1 0-1 0-4 0-4 prefix opcode mode sib displ imm 6 1 1 Bits 2 Source operand Byte/word Bits 2 3 mod reg 3 3 3 scale index base r/m 00 memory 01 memory+d8 10 memory+d16/d32 11 register ACA 2003 25 Pentium, Pentium Pro, II, III, 4 Issue rate: Pipeline Pentium : 2 way issue, in-order Pentium Pro .. 4 : 3 way issue, out-of-order IA-32 operations are translated into ops (by hardware) Pentium: 5 stage pipeline Pentium Pro, II, III: 10 stage pipeline Pentium 4: 20 stage pipeline Extra SIMD instructions MMX (multi-media extensions), SSE/SSE-2 (streaming simd extensions) + ACA 2003 26 Die example: Pentium 4 ACA 2003 27 Pentium 4 chip area breakdown ACA 2003 28 Pentium 4 Trace cache Hyper threading Add with ½ cycle throughput (1 ½ cycle latency) add least signif. 16 bits add most signif. 16 bits calculate flags forwarding carry cycle cycle cycle ACA 2003 29 P4 slides from Doug Carmean, Intel Store AGU Load AGU ALU ALU ALU ALU FP move FP store FMul FAdd MMX SSE L1 D-Cache and D-TLB Schedulers uop Queues 3 FP RF uCode ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L2 Cache and Control BTB & I-TLB 3.2 GB/s System Interface Pentium® 4 Processor Block Diagram P4 vs P II, PIII Basic P6 Pipeline 1 2 3 Fetch Fetch 4 5 6 7 8 Intro at 733MHz 9 .18µ Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch 10 Exec Basic Pentium® 4 Processor Pipeline 1 2 TC Nxt IP ACA 2003 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 Que Sch 11 12 13 14 15 Sch Sch Disp Disp RF Intro at 16 17 18 19 20 1.4GHz RF Ex Flgs Br Ck Drive .18µ 31 Example with Higher IPC and Faster Clock! Code Sequence P6 @1GHz Pentium® 4 Processor @1.4GHz Ld Add Add Ld Add Add 10 clocks 10ns IPC = 0.6 ACA 2003 6 clocks 4.3ns IPC = 1.0 32 ACA 2003 Store AGU Load AGU ALU ALU ALU ALU FP move FP store FMul FAdd MMX SSE L1 D-Cache and D-TLB Schedulers uop Queues 3 FP RF uCode ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L2 Cache and Control BTB & I-TLB 3.2 GB/s System Interface The Execution Trace Cache 33 Execution Trace Cache Advanced L1 instruction cache Caches “decoded” IA-32 instructions (uops) Removes decoder pipeline latency Capacity is ~12K uOps Integrates branches into single line Follows predicted path of program execution Execution Trace Cache feeds fast engine ACA 2003 34 Execution Trace Cache 1 cmp 2 br -> T1 .. ... (unused code) T1: T2: T3: ACA 2003 3 sub 4 br -> T2 .. ... (unused code) 5 mov 6 sub 7 br -> T3 .. ... (unused code) Trace Cache Delivery 1 cmp 2 br T1 3 T1: sub 4 br T2 5 mov 6 7 br T3 8 T3:add 9 sub 10 mul 11 cmp sub 12 br T4 8 add 9 sub 10 mul 11 cmp 12 br -> T4 35 Multi/Hyper-threading in Uniprocessor Architectures Superscalar Concurrent Multithreading Simultaneous Multithreading (Hyperthreading) Clock cycles Empty Slot Thread 1 Thread 2 Thread 3 Thread 4 Issue slots ACA 2003 36 JVM: Java Virtual Machine Make JAVA code run everywhere Use virtual architecture Platform (processor) independent Java program ACA 2003 Java compiler Java JVM bytecode (interpreter) JVM = stack architecture 37 Stack Architecture JVM follows stack model of execution operands are pushed onto stack from memory and popped off stack to memory operations take operands from stack and place result on stack Example (not real Java bytecode): a = b+c; ACA 2003 push b push c add b c b b+c pop a 38 JVM Architecture For each method invocation, the JVM creates a stack frame consisting of Local variable frame: parameters and local variables, numbered 0, 1, 2, … Operand stack: stack used for evaluating expressions local var 3 local var 0 local var 1 local var 2 static void add3(int x, int y, int z){ int r = x+y+z; System.out.println(r); } ACA 2003 39 Some JVM instructions iload_n: push local variable n onto the stack iconst_n: push constant n onto the stack (n=-1,0,...,5) bipush imm8: push byte onto stack sipush imm16: push short onto stack istore_n: pop word from stack into local variable n ACA 2003 iadd, isub, ineg, imul, idiv, irem: usual arithmetic operations if_icmpXX offset16 (XX can be eq, ne, lt, gt, le, ge): pop TOS into a pop TOS stack into b if (b XX a) PC = PC + offset16 goto offset16 : PC = PC + offset16 40 Example 1 Translate following expression to Java bytecode: v = 3*(x/y - 2/(u+y)) assume x is local var 0, y local var 1, u local var 3, v local var 4 iconst_3 iload_0 iload_1 idiv iconst_2 iload_3 iload_1 iadd idiv isub imul istore_4 ACA 2003 ; ; ; ; ; ; ; ; ; ; ; ; Stack 3 x | 3 y | x | 3 x/y | 3 2 | x/y | 3 u | 2 | x/y | 3 y | u | 2 | x/y | 3 u+y | 2 | x/y | 3 2/(u+y) | x/y | 3 x/y - 2/(u+y) | 3 3*(x/y - 2/(u+y)) v = 3*(x/y - 2/(u+y)) 41 Example 2 Translate following Java code to Java bytecode: if (x < 2) x = 0; assume x is local var 0 iload_0 iconst_2 if_icmpge endif iconst_0 istore_0 endif: ... ACA 2003 ; ; ; ; ; Stack x 2 | x if (x>=2) goto endif 0 42