Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lesson 5: Processor Design Topic 1 – Methods and Concepts EE37E 2005 1 Introduction References: -Modern Processor Design Book ( pp. 1 – 16) - Computer Organization and Design Book (pp. 54- 89) EE37E 2005 2 • While introducing this topic we will focus on these points: – Evolution of microprocessors – Instruction set processor design – Principles • Microprocessors are Instruction set processors (ISPs). • An ISP executes instructions from a predefined instruction set. • A microprocessor’s functionality is fully characterized by the instruction set it is capable of executing. • This predefined instruction set is also called the instruction set architecture. EE37E 2005 3 • An ISA serves as an interface between software and hardware. • In terms of processor design methodology, an ISA is the specification of the design while the microprocessor or ISP is the implementation of a design. EE37E 2005 4 Computer System Components L1 1000MHZ - 3 GHZ (a multiple of system bus speed) Pipelined ( 7 -21 stages ) Superscalar (max ~ 4 instructions/cycle) single-threaded Dynamically-Scheduled or VLIW Dynamic and static branch prediction CPU L2 SDRAM PC100/PC133 100-133MHZ 64-128 bits wide 2-way inteleaved ~ 900 MBYTES/SEC L3 Double Date Rate (DDR) SDRAM PC3200 400MHZ (effective 200x2) 64-128 bits wide 4-way interleaved ~3.2 GBYTES/SEC (second half 2002) RAMbus DRAM (RDRAM) PC800, PC1060 400-533MHZ (DDR) 16-32 bits wide channel ~ 1.6 - 3.2 GBYTES/SEC ( per channel) Examples: Alpha, AMD K7: EV6, 400MHZ Intel PII, PIII: GTL+ 133MHZ Intel P4 800MHZ Support for one or more CPUs Caches System Bus adapters Memory Controller Memory Bus Controllers I/O Buses NICs Example: PCI-X 133MHZ PCI, 33-66MHZ 32-64 bits wide 133-1024 MBYTES/SEC Memory Disks Displays Keyboards North Bridge I/O Devices: South Bridge Networks Fast Ethernet Gigabit Ethernet ATM, Token Ring .. Chipset EE37E 2005 5 Computer System Components Enhanced CPU Performance & Capabilities: Memory Latency Reduction: Conventional & Block-based Trace Cache. L1 • • • • • Support for Simultaneous Multithreading (SMT): Alpha EV8. VLIW & intelligent compiler techniques: Intel/HP EPIC IA-64. More Advanced Branch Prediction Techniques. Chip Multiprocessors (CMPs): The Hydra Project. IBM Power 4,5 Vector processing capability: Vector Intelligent RAM (VIRAM). Or Multimedia ISA extension. • Digital Signal Processing (DSP) capability in system. • Re-Configurable Computing hardware capability in system. SMT CMP CPU L2 Integrate Memory Controller & a portion of main memory with CPU: Intelligent RAM Integrated memory Controller: AMD Opetron IBM Power5 L3 Caches System Bus adapters Memory Controller Memory Bus Controllers I/O Buses NICs Memory Disks (RAID) Displays Keyboards North Bridge South Bridge Chipset EE37E 2005 Networks I/O Devices: 6 Recent Trends in Computer Design • The cost/performance ratio of computing systems have seen a steady decline due to advances in: – Integrated circuit technology: decreasing feature size, • Clock rate improves roughly proportional to improvement in • Number of transistors improves proportional to (or faster). • – Architectural improvements in CPU design. Microprocessor systems directly reflect IC improvement in terms of a yearly 35 to 55% improvement in performance. • Assembly language has been mostly eliminated and replaced by other alternatives such as C or C++ • Standard operating Systems (UNIX, NT) lowered the cost of introducing new architectures. • Emergence of RISC architectures and RISC-core architectures. • Adoption of quantitative approaches to computer design based on empirical performance observations. EE37E 2005 Microprocessor Architecture Trends CISC Machines instructions take variable times to complete RISC Machines (microcode) simple instructions, optimized for speed RISC Machines (pipelined) same individual instruction latency greater throughput through instruction "overlap" Superscalar Processors multiple instructions executing simultaneously Multithreaded Processors VLIW additional HW resources (regs, PC, SP) "Superinstructions" grouped together each context gets processor for x cycles decreased HW control complexity CMPs Single Chip Multiprocessors duplicate entire processors (tech soon due to Moore's Law) SIMULTANEOUS MULTITHREADING (SMT) multiple HW contexts (regs, PC, SP) each cycle, any context may execute SMT/CMPs (e.g. IBM Power5 in 2004) EE37E 2005 8 Evolution of microprocessors 100000000 “Graduation Window” Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million 10000000 Moore’s Law Pentium i80486 Transistors 1000000 i80386 i80286 100000 CMOS improvements: • Die size: 2X every 3 yrs • Line width: halve / 4-7 yrs i8086 10000 i8080 i4004 1000 1970 1975 1980 1985 1990 1995 Year EE37E 2005 2000 Figure1: Evolution of microprocessors 9 • Three decades of the history of microprocessors tell a truly remarkable story of advances in the computer industry (Table 1). 1970 1980 1980 1990 1990 2000 2000 2010 Transistor 2K – 100K count 100K – 1 M 1M – 100M 100M – 2 B Clock 0.1 – 3 frequency MHz 3 – 30 MHz 30 MHz – 1 GHz 1 – 15 GHz Instructio 0.1IPC ns/Cycle 0.1IPC0.9IPC 0.9IPC1.9IPC 1.9IPC2.9IPC Table 1. The amazing decades of the evolution of microprocessors EE37E 2005 10 Hierarchy of Computer Architecture High-Level Language Programs Software Assembly Language Programs Application Operating System Machine Language Program Compiler Software/Hardware Boundary Firmware Instr. Set Proc. I/O system Instruction Set Architecture Datapath & Control Hardware Digital Design Circuit Design Microprogram Layout Register Transfer Notation (RTN) Logic Diagrams Circuit Diagrams EE37E 2005 11 Instruction Set Processor Design • Critical to an ISP is the instruction set architecture, which specifies the functionality that must be implemented by the instruction set processor (ISP). EE37E 2005 12 The Design Process • "To Design Is To Represent“ – Design activity yields description/representation of an object • Traditional craftsman does not distinguish between the conceptualization and the artifact • Separation comes about because of complexity • Concept is captured in one or more representation languages – This process IS design • Design Begins With Requirements – Functional Capabilities: what it will do – Performance Characteristics: Speed, Power, Area, Cost, . . . EE37E 2005 13 Design Process (cont.) CPU • Design Finishes As Assembly Datapath Control – Design understood in terms of components and how they have ALU Regs Shifter been assembled – Top Down decomposition of complex functions (behaviors) into more primitive functions Nand Gate • Bottom-up composition of primitive building blocks into more complex assemblies Design is a "creative process," not a simple method EE37E 2005 14 Design as Search Problem A Strategy 1 SubProb 1 BB1 BB2 Strategy 2 SubProb2 BB3 SubProb3 BBn Design involves educated guesses and verification -- Given the goals, how should these be prioritized? -- Given alternative design pieces, which should be selected? -- Given design space of components & assemblies, which part will yield the best solution? Feasible (good) choices vs. Optimal choices EE37E 2005 15 Instruction Set Architecture (subset of Computer Architecture) “... the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.” – Amdahl, Blaaw, and Brooks, 1964 • Organization of Programmable Storage SOFTWARE • Data Types & Data Structures: Encodings & Representations • Instruction Set • Instruction Formats • Modes of Addressing and Accessing Data Items and Instructions • Exceptional Conditions EE37E 2005 16 The Instruction Set: a Critical Interface software instruction set hardware Figure 2: ISA EE37E 2005 17 Dynamic Static Interface • We have discussed two critical roles played by the ISA: – Contract between software and Hardware, which facilitates the development pf programs and machines – Specification for microprocessor design • The third role is an associated definition of an interface that separates what is done statically at the compile time versus what is done dynamically at run time. This interface is called the “ Dynamic-static Interface” EE37E 2005 18 (Software) Program Compiler complexity Exposed to software “Static” Architecture (DSI) Hardware complexity Machine Hidden in hardware “Dynamic” (Hardware) Figure 3: The dynamic-static feature EE37E 2005 19 Computer Architecture Topics Input/Output and Storage Disks, WORM, Tape Emerging Technologies Interleaving Bus protocols DRAM Memory Hierarchy Coherence, Bandwidth, Latency L2 Cache L1 Cache VLSI Instruction Set Architecture RAID Addressing, Protection, Exception Handling Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP EE37E 2005 Pipelining and Instruction Level Parallelism 20 Principles of Processor Performance EE37E 2005 21 Definitions • Performance is in units of things per sec – bigger is better • If we are primarily concerned with response time –performance(x) = 1 execution_time(x) " X is n times faster than Y" means Execution_time(Y) Performance(X) n = = Performance(Y) EE37E 2005 Execution_time(X) 22 Cycles Per Instruction IC = Instruction Count CPI = Clock Per Instruction CPU time Number of clock cycles Clock cycle time Number of clock cycles CPU time Clock Frequency Number of clock cycles CPI IC CPU time IC CPI Clock cycle time IC CPI CPU time Clock Rate n CPU time Cycle Time CPI j I j j 1 EE37E 2005 23 Cycles Per Instruction We may separate the contribution of each type of instruction to the execution time defining: n Number of clock cycles CPI j IC j j 1 where IC j is the number of times that instructio n j is executed, and CPI j is the average number of clocks required to execute instructio n j Processor pipelining and memory interactions limit the accuracy of this approach, but its a good first guess. For accuracy, it is necessary to simulate the instructions of an entire program with issue, pipeline and memory interactions. EE37E 2005 24 Aspects of CPU Performance (CPU Law) CPU time = Seconds Program = Instructions x Program EE37E 2005 Cycles x Seconds Instruction Cycle 25 Amdahl's Law Speedup due to enhancement E: Exec Time w/o E Performanc e w/ E Speedup(E) Exec Time w/ E Performanc e w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected E.g. special instructions, memory, IO, parallel processing EE37E 2005 26 Amdahl’s Law ExTime new Fraction enhanced ExTime old 1 Fraction enhanced Speedup enhanced ExTime old 1 Speedup overall Fraction enhanced ExTime new 1 Fraction enhanced Speedup enhanced EE37E 2005 27 Amdahl’s Law • Example: Floating point instructions improved to run 2X; but only 10% of actual instructions are FP 0.1 ExTime new ExTime old 1 0.1 ExTime old 0.95 2 ExTime old ExTime old 1 Speedup overall 1.053 ExTime new ExTime old 0.95 0.95 EE37E 2005 28 Topic 2: Instruction Set Architecture Design Adapted from Prof. Jerry Breecher’s Notes + my CS21Q Notes (http://babbage.clarku.edu/~jbreecher/arch/arch.html) EE37E 2005 29 Introduction 7.1 Introduction 7.2 Classifying Instruction Set Architectures 7.3 Memory Addressing 7.4 Operations in the Instruction Set 7.5 Type and Size of Operands 7.6 Encoding and Instruction Set 7.7 The Role of Compilers 7.8 The MIPS Architecture and Bonus 7.9. Endianess EE37E 2005 30 Introduction The Instruction Set Architecture is that portion of the machine visible to the assembly level programmer or to the compiler writer. software instruction set hardware Questions: - What are the advantages and disadvantages of various instruction set alternatives? - How do languages and compilers affect ISA? EE37E 2005 31 Classifying Instruction Set Architectures Classifications can be by: 1. 2. 3. Stack/accumulator/register Number of memory operands. Number of total operands. EE37E 2005 32 Instruction Set Architectures Accumulator: 1 address 1+x address Basic ISA Classes add A addx A acc acc + mem[A] acc acc + mem[A + x] add tos tos + next add A B add A B C EA(A) EA(A) + EA(B) EA(A) EA(B) + EA(C) Stack: 0 address General Purpose Register: 2 address 3 address Load/Store: 0 Memory 1 Memory load R1, Mem1 load R2, Mem2 add R1, R2 ALU Instructions can have 0, 1, 2, 3 operands. Shown here are cases of 0 and 1. add R1, Mem2 EE37E ALU Instructions can have two or three operands. 2005 33 Instruction Set Architectures Basic ISA Classes The results of different address classes is easiest to see with the examples here, all of which implement the sequences for C = A + B. Stack Accumulator Register (Register-memory) Register (load-store) Push A Load A Load R1, A Load R1, A Push B Add B Add Load R2, B Add Store C Store Add R3, R1, R2 R1, B C, R1 Pop C Store C, R3 Registers are the class that won out. The more registers on the CPU, the better. EE37E 2005 34 Instruction Set Architectures Intel 80x86 Integer Registers GPR0 EAX Accumulator GPR1 ECX Count register, string, loop GPR2 EDX Data Register; multiply, divide GPR3 EBX Base Address Register GPR4 ESP Stack Pointer GPR5 EBP Base Pointer – for base of stack seg. GPR6 ESI Index Register GPR7 EDI Index Register CS Code Segment Pointer SS Stack Segment Pointer DS Data Segment Pointer ES Extra Data Segment Pointer FS Data Seg. 2 GS Data Seg. 3 EIP Instruction Counter Eflags Condition Codes PC EE37E 2005 35 Memory Addressing Sections Include: Interpreting Memory Addresses Addressing Modes Displacement Address Mode Immediate Address Mode EE37E 2005 36 Memory Addressing Interpreting Memory Addresses What object is accessed as a function of the address and length? Objects have byte addresses – an address refers to the number of bytes counted from the beginning of memory. Little Endian – puts the byte whose address is xx00 at the least significant position in the word. Big Endian – puts the byte whose address is xx00 at the most significant position in the word. Alignment – data must be aligned on a boundary equal to its size. Misalignment typically results in an alignment fault that must be handled by the Operating System. EE37E 2005 37 Memory Addressing Addressing Modes This table shows the most common modes. A more complete set is in Figure 2.6 Addressing Mode Example Instruction Meaning When Used Register Add R4, R3 R[R4] <- R[R4] + R[R3] When a value is in a register. Immediate Add R4, #3 R[R4] <- R[R4] + 3 For constants. Displacement Add R4, 100(R1) R[R4] <- R[R4] + M[100+R[R1] ] Accessing local variables. Register Deferred Add R4, (R1) R[R4] <- R[R4] + M[R[R1] ] Using a pointer or a computed address. Absolute Add R4, (1001) R[R4] <- R[R4] + M[1001] Used for static data. EE37E 2005 38 Memory Addressing Displacement Addressing Mode How big should the displacement be? For addresses that do fit in displacement size: Add R4, 10000 (R0) For addresses that don’t fit in displacement size, the compiler must do the following: Load R1, address Add R4, 0 (R1) Depends on typical displaces as to how big this should be. On both IA32 and DLX, the space allocated is 16 bits. EE37E 2005 39 Memory Addressing Immediate Address Mode Used where we want to get to a numerical value in an instruction. At high level: At Assembler level: a = b + 3; Load Add if ( a > 17 ) Load R2, 17 CMPBGT R1, R2 goto Load Jump Addr R2, 3 R0, R1, R2 R1, Address (R1) So how would you get a 32 bit value into a register? EE37E 2005 40 Operations In The Instruction Set Sections Include: Detailed information about types of instructions. Instructions for Control Flow (conditional branches, jumps) EE37E 2005 41 Operations In The Instruction Set Arithmetic and logical Data transfer Control System Floating point Decimal String Multimedia - Operator Types and, add move, load branch, jump, call system call, traps add, mul, div, sqrt add, convert move, compare 2D, 3D? e.g., Intel MMX and Sun VIS EE37E 2005 42 Control Instructions Operations In The Instruction Set Conditional branches are 20% of all instructions!! Control Instructions Issues: – – – – taken or not where is the target link return address save or restore Instructions that change the PC: – – – (conditional) branches, (unconditional) jumps function calls, function returns system calls, system returns EE37E 2005 43 Type And Size of Operands The type of the operand is usually encoded in the Opcode – a LDW implies loading of a word. Common sizes are: Character (1 byte) Half word (16 bits) Word (32 bits) Single Precision Floating Point (1 Word) Double Precision Floating Point (2 Words) Integers are two’s complement binary. Floating point is IEEE 754. Some languages (like COBOL) use packed decimal. EE37E 2005 44 The MIPS Architecture MIPS is very RISC oriented. EE37E 2005 45 The MIPS Architecture MIPS Characteristics There’s MIPS – 32 that we learned in CS140 32bit byte addresses aligned Load/store only displacement addressing Standard datatypes 3 fixed length formats 32 32bit GPRs (r0 = 0) 16 64bit (32 32bit) FPRs FP status register No Condition Codes There’s MIPS – 64 – the current arch. Standard datatypes 4 fixed length formats (8,16,32,64) 32 64bit GPRs (r0 = 0) 64 64bit FPRs EE37E Addressing Modes • Immediate • Displacement • (Register Mode used only for ALU) Data transfer • load/store word, load/store byte/halfword signed? • load/store FP single/double • moves between GPRs and FPRs ALU • add/subtract signed? immediate? • multiply/divide signed? • and,or,xor immediate?, shifts: ll, rl, ra immediate? • sets immediate? 2005 46 The MIPS Architecture MIPS Characteristics Control • branches == 0, <> 0 • conditional branch testing FP bit • jump, jump register • jump & link, jump & link register • trap, returnfromexception Floating Point • add/sub/mul/div • single/double • fp converts, fp set EE37E 2005 47 The MIPS Architecture The MIPS Encoding Register-Register 31 26 25 Op 21 20 Rs1 16 15 Rs2 11 10 6 5 Rd 0 Opx Register-Immediate 31 26 25 Op 21 20 Rs1 16 15 0 immediate Rd Branch 31 26 25 Op Rs1 21 20 16 15 Rs2/Opx 0 immediate Jump / Call 31 26 25 Op 0 target EE37E 2005 48 Byte Ordering • How should bytes within multi-byte word be ordered in memory? • Conventions – Sun’s, Mac’s are “Big Endian” machines • Least significant byte has highest address – Alphas, PC’s are “Little Endian” machines • Least significant byte has lowest address EE37E 2005 49 Byte Ordering Example • Big Endian – Least significant byte has highest address • Little Endian – Least significant byte has lowest address • Example – Variable x has 4-byte representation 0x01234567 – Address given by &x is 0x100 Big Endian 0x100 0x101 0x102 0x103 01 Little Endian 23 45 67 0x100 0x101 0x102 0x103 67 45 EE37E 23 2005 01 50 Machine-Level Code Representation • Encode Program as Sequence of Instructions – Each simple operation • Arithmetic operation • Read or write memory • Conditional branch – Instructions encoded as bytes • Alpha’s, Sun’s, Mac’s use 4 byte instructions – Reduced Instruction Set Computer (RISC) • PC’s use variable length instructions – Complex Instruction Set Computer (CISC) – Different instruction types and encodings for different machines • Most code not binary compatible • Programs are Byte Sequences Too! EE37E 2005 51 Classification of Processors • We can classify processors according to the areas in which they are mostly used. • We can identity four different group of processors: – General purpose processors that are used in building computers – Digital Signal processors which are processors designed specifically for signal processing. – Microcontrollers which are small microcromputers which integrate in the same chip a core processors plus I/O elements and small amount of memories – Application specific processors which design to performed specific function (i.e. Network processors) EE37E 2005 52 General Purpose Processors • These processors are used to built major computer platforms. • We can name: – Intel / AMD based computers also called IBM compatible – Macintosh computers built using PowerPC processors – Sun machines that use Ultrasparc Processors. EE37E 2005 53 Examples of General Purpose Processors Type of Computer Processors Used Technology Macinstosh PowerPC (IBM, Motorola) Superscalar Sun Ultrasparc (SUN) RISC IBM Compatible Intel Processors Athlon, Duron (AMD), Cyrix Superscalar EE37E 2005 54 DSP • Digital Signal Processing (DSP) is used in a wide variety of applications, and it is hard to find a good definition that is general. • We can start by dictionary definitions of the words: – Digital * operating by the use of discrete signals to represent data in the form of numbers – Signal * a variable parameter by which information is conveyed through an electronic circuit – Processing * to perform operations on data according to programmed instructions • Which leads us to a simple definition of: Digital Signal processing * changing or analyzing information which is measured as discrete sequences of numbers EE37E 2005 55 • Note two unique features of Digital Signal processing as opposed to plain old ordinary digital processing: – signals come from the real world - this intimate connection with the real world leads to many unique needs such as the need to react in real time and a need to measure signals and convert them to digital numbers – signals are discrete - which means the information in between discrete samples is lost • The advantages of DSP are common to many digital systems and include: – Versatility: • digital systems can be reprogrammed for other applications (at least where programmable DSP chips are used) • digital systems can be ported to different hardware (for example a different DSP chip or board level product) – Repeatability: • digital systems can be easily duplicated • digital systems do not depend on strict component tolerances • digital system responses do not drift with temperature – Simplicity: • some things can be done more easily digitally than with analogue systems EE37E 2005 56 • DSP is used in a very wide variety of applications. • But most share some common features: – they use a lot of math (multiplying and adding signals) – they deal with signals that come from the real world – they require a response in a certain time • Where general purpose DSP processors are concerned, most applications deal with signal frequencies that are in the audio range. EE37E 2005 57