Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computational Methods in Astrophysics ASTR 5210 Dr Rob Thacker (AT319E) [email protected] Today’s Lecture More Computer Architecture Flynn’s Taxonomy Improving CPU performance (instructions per clock) Instruction Set Architecture classifications Future of CPU design Machine architecture classifications Flynn’s taxonomy (see IEEE Trans. Comp. Vol C-21, pp 94, 1972) A way of describing the information flow in computers: architectural definition Information is divided into instructions (I) and data (D) There can be single (S) or multiple instances of both (M) Four combinations: SISD,SIMD,MISD,MIMD SISD Single Instruction, Single Data An absolutely serial execution model Typically viewed as describing a serial computer, but todays CPUs exploit parallelism Single data element Single processor P M SIMD Single Instruction, Multiple Data In this case one instruction is applied to multiple data streams at the same time K P Ma P Ma P Ma Single instruction processor K, broadcasts instruction to processing elements (PEs) Each processor typically has its own data memory Array of processors MISD Multiple Instruction, Single Data Largely useless definition (not important) Closest relevant example would be a cpu than can `pipeline’ instructions Ma Each processor has its own instruction stream but operates on the same data stream Mi P Mi P Mi P Example: systolic array, network of small elements connected in a regular grid operating under a global clock, reading and writing elements from/to neighbours. MIMD Multiple Instruction, Multiple Data Covers a host of modern architectures M M M P P P P Processors have independent data and instruction streams. Processors may communicate directly or via shared memory. M Instruction Set Architecture ISA – interface between hardware and software ISAs are typically common to a cpu family e.g. x86, MIPS (more alike than different) Assembly language is a realization of the ISA in a form easy to remember (and program) Key Concept in ISA evolution and CPU design Efficiency gains to be had by executing as many operations per clock cycle as possible Instruction level parallelism (ILP) Exploit parallelism within the instruction stream Programmer does not see this parallelism explicitly Goal of modern CPU design – maximize the number of instructions per clock cycle (IPC), equivalently reduce cycles per instruction (CPI) ILP versus thread level parallelism Many modern programs have more than one One “thread” (parallel) “thread” of execution Instructions Instruction level parallelism breaks down a single thread of execution to try and find parallelism at the instruction level These instructions 3 3 2 1 2 1 are executed in parallel even though there is one thread ILP techniques The two main ILP techniques are Pipelining – including additional techniques such as out-of-order execution Superscalar execution Pipelining Multiple instructions overlapped in execution Throughput optimization: doesn’t reduce time for individual instructions Instr 12 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 3 Instr 2 Instr 1 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Design sweetspot Pipeline stepping time is determined by slowest operation in pipeline Best speed-up: if all operations take same amount of time Net time per instruction=stepping time/pipeline stages Perfect speed up factor = # pipeline stages Never achieved: start up overheads to consider Pipeline compromises Time to issue instruction 10ns 10ns 5ns 10ns 5ns 10ns 5ns =55ns Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Instruction 10ns 10ns 10ns 10ns 10ns 10ns 10ns Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 These stages take longer than necessary =70ns Superscalar execution Careful about definitions: superscalar execution is not simply about having multiple instructions in flight Superscalar processors have more than one of a given functional unit (such as the arithmetic logic unit (ALU) or load/store) Benefits of superscalar design Having more than one functional unit of a given type can help schedule more instructions within the pipeline The Pentium IV pipeline was 20 stages deep! Enormous throughput potential but big pipeline stall penalty Incorporation of multiple units into the pipeline is sometimes called superpipelining Other ways of increasing ILP Branch prediction Out of order execution Predict which path will be taken by assigning certain probabilities Independent operations can be rescheduled in the instruction stream Pipelined functional units Floating point units can be pipelined to increase throughput Limits of ILP See D. Wall “Limits of ILP” 1991 Probability of hitting hazards (instructions that cannot be pipelined) increases with added length Instruction fetch and decode rate Branch prediction – Remember the “von Neumann” bottleneck? Would be nice to have single instruction for multiple operations… Multiple condition statements increase branches severely Cache locality and memory limitations Finite limits to effectiveness of prefetch Scalar Processor Architectures ‘Scalar’ Pipelined Functional unit parallelism, e.g. load/store and arithmetic units can be used in parallel (instructions in parallel) Superscalar Multiple functional units, e.g. 4 floating point units can operate at same time Modern processors exploit parallelism, and can’t really be called SISD Complex Instruction Set Computing CISC – older design idea (x86 instruction set is CISC) Many (powerful) instructions supported within the ISA Upside: Makes assembly programming much easier (lots of assembly programming in 60-70’s) Upside: Reduced instruction memory usage Downside: designing CPU is much harder Reduced Instruction Set Computing RISC – newer concept than CISC (but still old) MIPS, PowerPC, SPARC, all RISC designs Small instruction set, CISC type operation becomes a chain of RISC operations Upside: Easier to design CPU Upside: Smaller instruction set => higher clock speed Downside: assembly language typically longer (compiler design issue though) Most modern x86 processors are implemented using RISC techniques Birth of RISC Roots can be traced to three research projects IBM 801 (late 1970s, J. Cocke) Berkeley RISC processor (~1980, D. Patterson) Stanford MIPS processor (~1981, J. Hennessy) Stanford & Berkeley projects driven by interest in building a simple chip that could be made in a university environment Commercialization benefitted from 3 independent projects Berkeley Project -> begat Sun Microsystems Stanford Project -> begat MIPS (used by SGI) Modern RISC processors Complexity has nonetheless increased significantly Superscalar execution (where CPU has multiple functional units of the same type e.g. two add units) require complex circuitry to control scheduling of operations What if we could remove the scheduling complexity by using a smart compiler…? VLIW & EPIC VLIW – very long instruction word Idea: pack a number of noninterdependent operations into one long instruction Strong emphasis on compilers to schedule instructions When executed, words are easily broken up and allow operations to be dispatched to independent execution units Instr 1 Instr 2 Instr 3 3 instructions scheduled into one long instruction word VLIW & EPIC II Natural successor to RISC – designed to avoid the need for complex scheduling in RISC designs VLIW processors should be faster and less expensive than RISC EPIC – explicitly parallel instruction computing, Intel’s implementation (roughly) of VLIW ISA is called IA-64 VLIW & EPIC III Hey – it’s 2015, why aren’t we all using Intel Itanium processors? AMD figured out an easy extension to make x86 support 64 bits & introduced multicore Backwards compatibility + “good enough performance” + poor Itanium compiler performance killed IA-64 RISC vs CISC recap RISC (popular by mid 80s) Operations on registers CISC (pre 1970s) Operations directly on memory Pro: Small instruction set makes design easy Pro: decreased CPI, but also get faster CPU through easier design (tc reduced) Pro: Many powerful instructions, easy to write assembly language* Con: complicated instructions must be built from simpler ones Pro: Reduced memory requirement for instructions, reduced number of total instructions (Ni)* Con: ISA often large and wasteful (20-25% usage) Con: Efficient compiler technology absolutely essential Con: ISA hard to debug during development *Driven by 1970s issues of memory size (SMALL) and speed (FASTER THAN CPU) Who “won”? – Not VLIW! Modern x86 are RISC-CISC hybrids ISA is translated at hardware level to shorter instructions Very complicated designs though, lots of scheduling hardware MIPS, Sun SPARC, DEC Alpha were much truer implementations of the RISC ideal Modern metric for determining RISCkyness of design: does the ISA have LOAD STORE instructions to memory? From Patterson’s lectures (UC Berkeley CS252) Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based (B5000 1963) Concept of a Family (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets (Vax, Intel 432 1977-80) Load/Store Architecture (CDC 6600, Cray 1 1963-76) RISC (Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987) LIW/”EPIC”? (IA-64. . .1999) Simultaneous multithreading Completely different technology to ILP NOT multi-core Designed to overcome lack of fine grained parallelism in code Idea is to fill any potential gaps in the processor pipeline by switching between threads of execution on very short time scales Requires programmer to have created a parallel program for this to work though One physical processor looks like two logical processors Motivation for SMT Strong motivation for SMT: memory latency making load operations take longer and longer Need some way to hide this bottleneck (memory wall again!) SMT: switch over execution to threads that have their data and execute those TERA MTA (bought by Cray) attempt to design computer entirely around this concept SMT Example: IBM Power 5 - 8 Dual core, each core can support 2 SMT threads “MCM” package 4 dual core processors 144 MB of cache SMT gives ~40-60% improvement in performance Not bad Intel Hyperthreading ~ 10% improvement Multiple cores Simply add more CPUs Easiest way to increase throughput now Why do this? Response to problem of increasing power output on modern CPUs We’ve essentially reached the limit on improving individual core speeds Design involves compromise: n CPUs must now share memory bus – less bandwidth to each Intel & AMD multi-core processors Intel 18-core processors Codename “Haswell” Design envelope 150W, but divide by number of processors => each core is v. power efficient AMD has 16 core processors Codename “Warsaw” 115 W design envelope Individual cores not as good as Intel though Summary Flynn’s taxonomy categorizes instruction and data flow in computers Modern processors are MIMD Pipelining and superscalar design improve CPU performance by increasing the instructions per clock CISC/RISC design approaches appear to be reaching the limits of their applicability VLIW didn’t make an impact – will it return? In the absence of improved single core performance, designers are simply integrating more cores