Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EE/CE/CS 6304 Computer Architecture Lecture #2 (8/27/15) Yiorgos Makris Professor Department of Electrical Engineering University of Texas at Dallas Course Web-site: http://www.utdallas.edu/~gxm112130/EE6304FA15 The Instruction Set: a Critical Interface software instruction set hardware • Properties of a good abstraction – – – – Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels Instruction Set Architecture ... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. – Amdahl, Blaaw, and Brooks, 1964 SOFTWARE -- Organization of Programmable Storage -- Data Types & Data Structures: Encodings & Representations -- Instruction Formats -- Instruction (or Operation Code) Set -- Modes of Addressing and Accessing Data Items and Instructions -- Exceptional Conditions Example: MIPS R3000 r0 r1 ° ° ° r31 PC lo hi 0 Programmable storage Data types ? 2^32 x bytes Format ? 31 x 32-bit GPRs (R0=0) Addressing Modes? 32 x 32-bit FP regs (paired DP) HI, LO, PC Arithmetic logical Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV Memory Access LB, LBU, LH, LHU, LW, LWL,LWR SB, SH, SW, SWL, SWR Control 32-bit instructions on word boundary J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL ISA vs. Computer Architecture • Old definition of computer architecture = instruction set design – Other aspects of computer design called implementation – Insinuates implementation is uninteresting or less challenging • Our view is computer architecture >> ISA • Architect’s job much more than instruction set design; technical hurdles today more challenging than those in instruction set design • Since instruction set design not where action is, some conclude computer architecture (using old definition) is not where action is – We disagree on conclusion – Agree that ISA not where action is Execution is not just about hardware Source-to-Source Program Transformations Compiler Libraries Linker Application Binary Library Services OS Services Hypervisor Hardware • The VAX fallacy – Produce one instruction for every high-level concept – Absurdity: Polynomial Multiply » Single hardware instruction » But Why? Is this really faster??? • RISC Philosophy – Full System Design – Hardware mechanisms viewed in context of complete system – Cross-boundary optimization • Modern programmer does not see assembly language – Many do not even see “lowlevel” languages like “C”. Computer Architecture is an Integrated Approach • What really matters is the functioning of the complete system – hardware, runtime system, compiler, operating system, and application – In networking, this is called the “End to End argument” • Computer architecture is not just about transistors, individual instructions, or particular implementations – E.g., Original RISC projects replaced complex instructions with a compiler + simple instructions • It is very important to think across all hardware/software boundaries – New technology New Capabilities New Architectures New Tradeoffs – Delicate balance between backward compatibility and efficiency Computer Architecture is Design and Analysis Design Architecture is an iterative process: • Searching the space of possible designs • At all levels of computer systems Analysis Creativity Cost / Performance Analysis Good Ideas Bad Ideas Mediocre Ideas Computer Architecture Topics Input/Output and Storage Disks, WORM, Tape VLSI Coherence, Bandwidth, Latency L2 Cache L1 Cache Instruction Set Architecture Network Communication Addressing, Protection, Exception Handling Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, Dynamic Compilation Pipelining and Instruction Level Parallelism Other Processors Emerging Technologies Interleaving Bus protocols DRAM Memory Hierarchy RAID Computer Architecture Topics P M P S M ° ° ° P M P M Interconnection Network Processor-Memory-Switch Multiprocessors Networks and Interconnections Shared Memory, Message Passing, Data Parallelism Network Interfaces Topologies, Routing, Bandwidth, Latency, Reliability Building Hardware that Computes “Moore Machine” “Mealey Machine” Latch Combinational Logic Finite State Machines: Implementation as Comb logic + Latch 1/0 Alpha/ 0/0 0 1/1 Beta/ 0/1 Delta/ 1/1 00 01 10 00 01 10 0/0 2 Input State old State new 0 0 0 1 1 1 1 00 10 01 01 00 10 Div 0 0 1 0 1 1 Microprogrammed Controllers • State machine in which part of state is a “micro-pc”. – Explicit circuitry for incrementing or changing PC • Includes a ROM with “microinstructions”. + 1 ROM (Instructions) State w/ Address Addr Control Branch PC MUX Combinational Logic/ Controlled Machine – Controlled logic implements at least branches and jumps Instruction Branch 0: forw 35 xxx 1: b_no_obstacles 000 2: back 10 xxx 3: rotate 90 xxx 4: goto 001 Fundamental Execution Cycle Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Memory Obtain instruction from program storage Determine required actions and instruction size Locate and obtain operand data Processor program regs F.U.s Data Compute result value or status von Neuman Deposit results in storage for later use Determine successor instruction bottleneck What’s a Clock Cycle? Latch or register combinational logic • Old days: 10 levels of gates • Today: determined by numerous time-of-flight issues + gate delays – clock propagation, wire lengths, drivers Pipelined Instruction Execution Time (clock cycles) Reg DMem Ifetch Reg DMem Reg ALU DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Reg Reg Reg DMem Reg Limits to pipelining • Maintain the von Neumann “illusion” of one instruction at a time execution • Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: attempt to use the same hardware to do two different things at once – Data hazards: Instruction depends on result of prior instruction still in the pipeline – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). • Power: Too many thing happening at once Melt your chip! – Must disable parts of the system that are not being used – Clock Gating, Asynchronous Design, Low Voltage Swings, … Progression of ILP • 1st generation RISC - pipelined – Full 32-bit processor fit on a chip => issue almost 1 IPC » Need to access memory 1+x times per cycle – Floating-Point unit on another chip – Cache controller a third, off-chip cache – 1 board per processor multiprocessor systems • 2nd generation: superscalar – Processor and floating point unit on chip (and some cache) – Issuing only one instruction per cycle uses at most half – Fetch multiple instructions, issue couple » Grows from 2 to 4 to 8 … – How to manage dependencies among all these instructions? – Where does the parallelism come from? • VLIW – Expose some of the ILP to compiler, allow it to schedule instructions to reduce dependences Modern ILP • Dynamically scheduled, out-of-order execution – Current microprocessor 6-8 of instructions per cycle – Pipelines are 10s of cycles deep many simultaneous instructions in execution at once – Unfortunately, hazards cause discarding of much work • What happens: – Grab a bunch of instructions, determine all their dependences, eliminate dep’s wherever possible, throw them all into the execution unit, let each one move forward as its dependences are resolved – Appears as if executed sequentially – On a trap or interrupt, capture the state of the machine between instructions perfectly • Huge complexity – Complexity of many components scales as n2 (issue width) – Power consumption big problem IBM Power 4 • Combines: Superscalar and OOO • Properties: – 8 execution units in out-of-order engine, each may issue an instruction each cycle. – In-order Instruction Fetch, Decode (compute dependencies) – Reordering for in-order commit When all else fails - guess • Programs make decisions as they go – Conditionals, loops, calls – Translate into branches and jumps (1 of 5 instructions) • How do you determine what instructions for fetch when the ones before it haven’t executed? – Branch prediction – Lot’s of clever machine structures to predict future based on history – Machinery to back out of mis-predictions • Execute all the possible branches – Likely to hit additional branches, perform stores speculative threads What can hardware do to make programming (with performance) easier? Have we reached the end of ILP? • Multiple processor easily fit on a chip • Every major microprocessor vendor has gone to multithreaded cores – Thread: loci of control, execution context – Fetch instructions from multiple threads at once, throw them all into the execution unit – Intel: hyperthreading – Concept has existed in high performance computing for 20 years (or is it 40? CDC6600) • Vector processing – Each instruction processes many distinct data – Ex: MMX • Raise the level of architecture – many processors per chip Tensilica Configurable Proc Limiting Forces: Clock Speed and ILP • Chip density is continuing increase ~2x every 2 years – Clock speed is not – # processors/chip (cores) may double instead • There is little or no more Instruction Level Parallelism (ILP) to be found – Can no longer allow programmer to think in terms of a serial programming model • Conclusion: Parallelism must be exposed to software! Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) Examples of MIMD Machines • Symmetric Multiprocessor – Multiple processors in box with shared memory communication – Current MultiCore chips like this – Every processor runs copy of OS • Non-uniform shared-memory with separate I/O through host – Multiple processors » Each with local memory » general scalable network – Extremely light “OS” on node provides simple services » Scheduling/synchronization – Network-accessible host for I/O • Cluster – Many independent machine connected with general network – Communication through messages P P P P Bus Memory P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M Host Time (processor cycle) Categories of Thread Execution Superscalar Fine-Grained Coarse-Grained Thread 1 Thread 2 Multiprocessing Thread 3 Thread 4 Simultaneous Multithreading Thread 5 Idle slot