Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MAMAS – Computer Structure 234267 Lecturers: Lihu Rappoport Adi Yoaz Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh 1 Computer Structure 2012 – Introduction General Course Information 2 Grade 20% Exercise (mandatory) תקף 80% Final exam No midterm exam Course web site http://webcourse.cs.technion.ac.il/234267 Foils will be on the web several days before the class Computer Structure 2012 – Introduction Class Focus CPU Introduction: performance, instruction set (RISC vs. CISC) Pipeline, hazards Branch prediction Out-of-order execution Memory Hierarchy Cache Main memory Virtual Memory Advanced Topics PC Architecture 3 Motherboard & chipset, DRAM, I/O, Disk, peripherals Computer Structure 2012 – Introduction Computer System – Sandy Bridge External Graphics Card PCI express ×16 DDRIII Cache Channel 1 Mem BUS DDRIII Memory controller Core GFX System Agent Core Channel 2 Display link South Bridge (PCH) HDMI PCI express ×1 4 Serial Port Parallel Port IO Controller Floppy Drive keybrd USB SATA SATA controller controller controller mouse DVD Drive Hard Disk PCI Sound Card speakers Lan Adap LAN Computer Structure 2012 – Introduction Architecture & Microarchitecture Architecture The processor features seen by the “user” Micro-architecture The way of implementation of a processor 5 Instruction set, addressing modes, data width, … Caches size and structure, number of execution units, … Timing is considered uArch (though it is user visible) Processors with different uArch can support the same Architecture Computer Structure 2012 – Introduction Compatibility Backward compatibility New hardware can run existing software • Core2 Duo can run SW written for Pentium4, PentiumM, Pentium III, Pentium II, Pentium, 486, 386, 268 Forward compatibility Architecture independent SW 6 New software can run on existing hardware Example: new software written with SSE2TM runs on older processor which does not support SSE2TM Commonly supports one or two generations behind JIT – just in time compiler: Java and .NET Binary translation Computer Structure 2012 – Introduction Moore’s Law The number of transistors doubles every ~2 years 7 Computer Structure 2012 – Introduction CPI – Cycles Per Instruction CPUs work according to a clock signal Instruction Count (IC) Clock cycle is measured in nsec (10-9 of a second) Clock frequency (= 1/clock cycle) measured in GHz (109 cyc/sec) Total number of instructions executed in the program CPI – Cycles Per Instruction Average #cycles per Instruction (in a given program) CPI = 8 #cycles required to execute the program IC IPC (= 1/CPI) : Instructions per cycles Computer Structure 2012 – Introduction Calculating the CPI of a Program ICi: #times instruction of type i is executed in the program IC IC: #instruction executed in the program: n IC i 1 Fi: relative frequency of instruction of type i : Fi = ICi/IC CPIi – #cycles to execute instruction of type i e.g.: CPIadd = 1, CPImul = 3 #cycles required to execute the entire program: # cyc n CPI i 1 i CPI: # cyc CPI IC 9 i ICi CPI * IC n CPI IC i 1 i IC i n n ICi CPI i CPI i Fi IC i 1 i 1 Computer Structure 2012 – Introduction CPU Time CPU Time - time required to execute a program CPU Time = IC CPI clock cycle 10 Our goal: minimize CPU Time Minimize clock cycle: more GHz (process, circuit, uArch) Minimize CPI: uArch (e.g.: more execution units) Minimize IC: architecture (e.g.: SSETM) Computer Structure 2012 – Introduction Amdahl’s Law Suppose enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then: texe t’exe t’exe = texe × (1 – Fractionenhanced) + texe Speedupoverall = t’exe = Fractionenhanced Speedupenhanced 1 (1 - Fractionenhanced) + 11 Fractionenhanced Speedupenhanced Computer Structure 2012 – Introduction Amdahl’s Law: Example • Floating point instructions improved to run at 2×, but only 10% of executed instructions are FP t’exe = texe × (0.9 + 0.1 / 2) = 0.95 × texe Speedupoverall = 1 = 1.053 0.95 Corollary: Make The Common Case Fast 12 Computer Structure 2012 – Introduction Comparing Performance Peak Performance MIPS, MFLOPS Often not useful: unachievable / unsustainable in practice Benchmarks Real applications, or representative parts of real apps Targeted at the specific system usages SPEC INT – integer applications • Data compression, C complier, Perl interpreter, database system, chess-playing, Text-processing, … SPEC FP – floating point applications • Mostly important scientific applications TPC Benchmarks • Measure transaction-processing throughput 13 Computer Structure 2012 – Introduction Evaluating Performance of future CPUs Use a performance simulator to evaluate the performance of a new feature / algorithm Models the uarch to a great detail Run 100’s of representative applications Produce the performance s-curve Sort the applications according to the IPC increase Baseline (0) is the processor without the new feature 3% Bad S-curve 2% 6% Positive outliers Good S-curve Positive outliers 4% 1% 0% 2% -1% -2% Negative outliers -3% 0% Small negative outliers -2% -4% 14 Computer Structure 2012 – Introduction Instruction Set Design software The ISA is what the user / compiler see instruction set hardware 15 The HW implements the ISA Computer Structure 2012 – Introduction ISA Considerations Reduce the IC to reduce execution time Simple instructions simpler HW implementation E.g., a single vector instruction performs the work of multiple scalar instructions Higher frequency, lower power, lower cost Code size Long instructions take more time to fetch Longer instructions require a larger memory • Important in small devices, e.g., cell phones 16 Computer Structure 2012 – Introduction Architectural Consideration Example Immediate data size 30% Int. Avg. FP Avg. 20% 10% 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0% Immediate data bits 17 1% of data values > 16-bits 12 – 16 bits of needed Computer Structure 2012 – Introduction CISC Processors CISC – Complex Instruction Set Computer The idea: a high level machine language Example: x86 Characteristic Many instruction types, with a many addressing modes Some of the instructions are complex • Execute complex tasks • Require many cycles ALU operations directly on memory • Only a few registers, in many cases not orthogonal Variable length instructions • common instructions get short codes save code length 18 Computer Structure 2012 – Introduction Top 10 x86 Instructions Rank instruction % of total executed 1 load 22% 2 conditional branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 move register-register 4% 9 call 1% 10 return 1% Total 96% Simple instructions dominate instruction frequency 19 Computer Structure 2012 – Introduction CISC Drawbacks Complex instructions and complex addressing modes complicates the processor slows down the simple, common instructions contradicts Make The Common Case Fast Not compiler friendly Non orthogonal registers Unused complex addressing modes Variable length instructions are a pain 20 Difficult to decode few instructions in parallel • As long as instruction is not decoded, its length is unknown Unknown where the inst. ends, and where the next inst. starts An instruction may cross a cache line or a page Computer Structure 2012 – Introduction RISC Processors RISC - Reduced Instruction Set Computer The idea: simple instructions enable fast hardware Characteristics A small instruction set, with few instruction formats Simple instructions that execute simple tasks • Most of them require a single cycle (with pipeline) A few indexing methods ALU operations on registers only • Memory is accessed using Load and Store instructions only 21 Many orthogonal registers Three address machine: Add dst, src1, src2 Fixed length instructions Computer Structure 2012 – Introduction RISC Processors (Cont.) Simple architecture Simple micro-architecture Using a smart compiler 22 Better pipeline usage Better register allocation Existing RISC processor are not “pure” RISC Simple, small and fast control logic Simpler to design and validate Leave space for large on die caches Shorten time-to-market e.g., support division which takes many cycles Examples: MIPSTM, SparcTM, AlphaTM, PowerTM Computer Structure 2012 – Introduction Compilers and ISA Ease of compilation Orthogonality: • no special registers • few special cases • all operand modes available with any data type or instruction type Regularity: • no overloading for the meanings of instruction fields streamlined • resource needs easily determined Register Assignment is critical too 23 Easier if lots of registers Computer Structure 2012 – Introduction CISC Is Dominant The x86 architecture, which is a CISC architecture, dominates the processor market A vast amount of existing software Intel, AMD, Microsoft and others benefit from this • Intel and AMD put a lot of money to make high performance x86 processors, despite the architectural disadvantage • Current x86 processor give the best cost/performance CISC processors use arch ideas from the RISC world Starting at Pentium II and K6, x86 processors translate CISC instructions into RISC-like operations internally • the inside core looks much like that of a RISC processor 24 Computer Structure 2012 – Introduction Software Specific Extensions Extend arch to accelerate exec of specific apps Example: SSETM – Streaming SIMD Extensions 128-bit packed (vector) / scalar single precision FP (4×32) Introduced on Pentium® III on ’99 8 new 128 bit registers (XMM0 – XMM7) Accelerates graphics, video, scientific calculations, … Packed: Scalar: 128-bits x3 x2 x1 128-bits x0 x3 x2 + y3 y2 x0 + y1 y0 x3+y3 x2+y2 x1+y1 x0+y0 25 x1 y3 y2 y1 y0 y3 y2 y1 x0+y0 Computer Structure 2012 – Introduction