Download ca-2012-03-12-intro

Computer Architecture (“MAMAS”, 234267) Spring 2012 Lecturer: Dan Tsafrir Reception: Mon 18:30, Taub 611 12/3/2012 Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2012 – Introduction (lec1) General Info  Grade    20% Exercise (mandatory) ‫תקף‬ 80% Final exam Textbook  “Computer Architecture: A Quantitative Approach” (4th Edition) by: Patterson & Hennessy  Other course information   2 Course web site: http://webcourse.cs.technion.ac.il/234267/Spring2012 Lectures will be upload to the web a day before the class Computer Architecture 2012 – Introduction (lec1) Computer System Structure 3 Computer Architecture 2012 – Introduction (lec1) Classical Motherboard Diagram Cache More to the “north” = closer to the CPU = faster CPU CPU BUS North Bridge External Graphics Card DDR2 or DDR3 Channel 1 PCI express 2.0 IOMMU On-board Memory Graphics controller Serial Port Parallel Port IO Controller 4 DDR2 or DDR3 Channel 2 PCI express ×1 South Bridge Floppy Drive Mem BUS keybrd USB controller mouse SATA controller DVD Drive Hard Disk PCI Sound Card speakers Lan Adap LAN Computer Architecture 2012 – Introduction (lec1) Intel Core 2 Northbridge = MCH = mem controller hub Notice bandwidths Southbridge = ICH = I/O controller hub 65 to 45 nm 5 Computer Architecture 2012 – Introduction (lec1) Intel Nehalem Core i3 i5 i7 For high-end i-Series chips, Northbridge functionality moved onto processor (=> made faster) 45 to 32 nm 6 Computer Architecture 2012 – Introduction (lec1) Intel Sandy Bridge Core i3 i5 i7 32 to 22 nm 7 The trend continues Computer Architecture 2012 – Introduction (lec1) 8 Computer Architecture 2012 – Introduction (lec1) Course Focus  Start from CPU (=processor)      Move on to Memory Hierarchy     9 Caching Main memory Virtual Memory Move on to PC Architecture   Instruction set, performance Pipeline, hazards Branch prediction Out-of-order execution Motherboard & chipset, DRAM, I/O, Disk, peripherals End with some Advanced Topics Computer Architecture 2012 – Introduction (lec1) The Processor 10 Computer Architecture 2012 – Introduction (lec1) Architecture vs. Microarchitecture  Architecture: = The processor features as seen by its user = Interface   Microarchitecture: = Manner by which the processor is implemented = Implementation details   11 Caches size and structure, number of execution units, … Note: different processors with different u-archs can support the same arch   Instruction set, number of registers, addressing modes,… Example: Intel Pentium-IV vs. Intel Core2 Duo We will address both Computer Architecture 2012 – Introduction (lec1) Why Should We Care?  Abstractions enhance productivity, so:    Same goes for arch   Just details for a programmer of a high-level language Abstractions only work so long as what’s below works  12 If we know the arch (=interface), Why should we care about the u-arch (=internals)? The taxi story: http://vimeo.com/11478146 (4:50-6:00) Computer Architecture 2012 – Introduction (lec1) Recent Processor Trends Source: http://www.scidacreview.org/0904/html/multicore.html 13 Computer Architecture 2012 – Introduction (lec1) Well-Known Moore’s Law Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm 14 Computer Architecture 2012 – Introduction (lec1) 15 Computer Architecture 2012 – Introduction (lec1) The Story in a Nutshell Transistors (1000s) clock speed (MHz) power (W) Instructions/cycle (ILP) 16 Computer Architecture 2012 – Introduction (lec1) Took the Industry by Surprise 17 Computer Architecture 2012 – Introduction (lec1) Dire Implications: Performance 18 Computer Architecture 2012 – Introduction (lec1) Dire Implications: Sales 19 Computer Architecture 2012 – Introduction (lec1) Dire Implications: Sales 20 Computer Architecture 2012 – Introduction (lec1) Dire Implications: Programmers 21 Computer Architecture 2012 – Introduction (lec1) Supercomputing: “Top 500 list” 22 Computer Architecture 2012 – Introduction (lec1) Dire Implications: Supercomputing 23 Computer Architecture 2012 – Introduction (lec1) Processor Performance 24 Computer Architecture 2012 – Introduction (lec1) Metrics: IC, CPI, IPC  CPUs work according to a clock signal    Instruction Count (IC)   Clock cycle: measured in nanoseconds (10-9 of a second) Clock frequency = 1/|clock cycle|: in GHz (109 cycles/sec) Total number of instructions executed in the program Cycles Per Instruction (CPI)  Average #cycles per Instruction (in a given program) CPI =  25 #cycles required to execute the program IC IPC (= 1/CPI) : Instructions per cycles. Can be > 1; see the “story in a nutshell slide” Computer Architecture 2012 – Introduction (lec1) Minimizing Execution Time  CPU Time - time required to execute a program CPU Time = IC  CPI  clock cycle  Our goal: minimize CPU Time (any of above components)  Minimize clock cycle: increase GHz (processor design)  Minimize CPI: u-arch (e.g.: more execution units)  Minimize IC: arch + u-arch (e.g.: SSETM) SSE = streaming SIMD extension (Intel) 26 Computer Architecture 2012 – Introduction (lec1) Alternative Way to Calculate CPI   ICi = #times instruction of type-i is executed in program n IC = #instruction executed in program = IC   ICi i 1   Fi = relative frequency of type-i instruction = ICi/IC CPIi = #cycles to execute type-i instruction   e.g.: CPIadd = 1, CPImul = 3 n #cycles required to execute the program: # cyc   CPI i  ICi i 1  CPI: n # cyc CPI   IC 27  CPI  IC i 1 i IC i n ICi n   CPIi    CPIi  Fi IC i 1 i 1 Computer Architecture 2012 – Introduction (lec1) Performance Evaluation: How?  No simple answer  Performance depends on    Mathematical analysis   28 Application Input Typically impossible What to do? Computer Architecture 2012 – Introduction (lec1) Benchmarks  Use benchmarks & measure how long it takes   Use real applications (=> no absolute answers) Preferably standardized benchmarks (+input), e.g.,  SPEC INT: integer apps • Compression, C complier, Perl, text-processing, …     Sometimes you see FLOPS (“pick” or “sustained”)  29 SPEC FP: floating point apps (mostly scientific) TPC benchmarks: measure transaction throughput (DB) SPEC JBB: models wholesale company (Java server, DB) Supercomputers (top500 list), against LINPACK Computer Architecture 2012 – Introduction (lec1) Evaluating Performance  Use a performance simulator to evaluate the performance of a new feature / algorithm    Models the uarch to a great detail Run 100’s of representative applications Produce the performance s-curve   Sort the applications according to the IPC increase Baseline (0%) is the processor without the new feature 3% Bad S-curve 2% 6% Positive outliers Good S-curve Positive outliers 4% 1% 0% 2% -1% -2% Negative outliers -3% 0% Small negative outliers -2% -4% 30 Computer Architecture 2012 – Introduction (lec1) Amdahl’s Law   Suppose we accelerate the computation such that  P = proportion of computation we make faster  S = speedup experienced by the proportion we improved For example  If an improvement can speedup 40% of the computation => P = 0.4   31 If the improvement makes the portion run twice as fast => S = 2 Then overall speedup = 1 (1  P)  P S Computer Architecture 2012 – Introduction (lec1) Amdahl’s Law - Example  FP operations improved to run 2x faster  S = 2, but…  P = only affects 10% of the program  Speedup: 1 1    1.053 0.1 0.95 (1  P)  P (1  0.1)  S 2  Conclusion  32 1 Better to make common case fast… Computer Architecture 2012 – Introduction (lec1) Amdahl’s Law – Parallelism  When parallelizing a program  P = proportion of program that can be made parallel  1 - P = inherently serial   N = number of processing elements (say, cores) Speedup: 1 (1  P)  P  N Serial component imposes a hard limit   1 1   lim  N   (1  P) (1  P)  P  N  33 Computer Architecture 2012 – Introduction (lec1) Instruction Set Design software The ISA is what the user & compiler see instruction set hardware 34 The HW implements the ISA Computer Architecture 2012 – Introduction (lec1) Considerations in ISA Design  Instruction size  Long instructions take more time to fetch from memory  Longer instructions require a larger memory • Important for small (embedded) devices, e.g., cell phones  Number of instructions (IC)   35 Reduce IC => reduce runtime (at a given CPI & frequency) Virtues of instructions simplicity  Simpler HW allows for: higher frequency & lower power  Optimization can be applied better to simpler code  Cheaper HW Computer Architecture 2012 – Introduction (lec1) Basing Design Decisions on Workload Immediate argument’s size in bits (histogram) 30% Int. Avg. FP Avg. 20% 10% 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0% Immediate data bits   36 1% of data values > 16-bits Having 16 bits is likely good enough Computer Architecture 2012 – Introduction (lec1) CISC Processors  CISC - Complex Instruction Set Computer   Example: x86 The idea: a high level machine language • Once people programmed in assembly, CISC supposedly easier  Characteristic   Many instruction types, with a many addressing modes Some of the instructions are complex • Execute complex tasks • Require many cycles  ALU operations directly on memory (e.g., arr[j] = arr[i]+n) • Registers not used (and, accordingly, only a few registers exist)  Variable length instructions • common instructions get short codes  save code length 37 Computer Architecture 2012 – Introduction (lec1) But it Turns Out… Rank instruction % of total executed 1 load 22% 2 conditional branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 move register-register 4% 9 call 1% 10 return 1% Total 96% Simple instructions dominate instruction frequency 38 Computer Architecture 2012 – Introduction (lec1) CISC Drawbacks  Complex instructions and complex addressing modes  complicates the processor  slows down the simple, common instructions  contradicts Make The Common Case Fast  Compilers don’t use complex instructions / indexing methods  Variable length instructions are real pain in the neck    39 Difficult to decode few instructions in parallel • As long as instruction is not decoded, its length is unknown  It is unknown where the instruction ends  It is unknown where the next instruction starts An instruction may be over more than a single cache line An instruction may be over more than a single page Computer Architecture 2012 – Introduction (lec1) RISC Processors  RISC - Reduced Instruction Set Computer   The idea: simple instructions enable fast hardware Characteristic   A small instruction set, with only a few instructions formats Simple instructions • execute simple tasks • Most of them require a single cycle (with pipeline)   A few indexing methods ALU operations on registers only • Memory is accessed using Load and Store instructions only • Many orthogonal registers • Three address machine: Add dst, src1, src2   40 Fixed length instructions Examples: MIPSTM, SparcTM, AlphaTM, PowerTM Computer Architecture 2012 – Introduction (lec1) RISC Processors (Cont.)  Simple arch => simple u-arch       Compiler can be smarter    Better pipeline usage Better register allocation Existing RISC processor are not “pure” RISC  41 Room for larger on die caches Smaller => faster Easier to design & validate (=> cheaper to manufacture) Shorten time-to-market More general-purpose registers (=> less memory refs) e.g., support division which takes many cycles Computer Architecture 2012 – Introduction (lec1) Compilers and ISA  Ease of compilation  Orthogonality: • no special registers • few special cases • all operand modes available with any data type or instruction type  Regularity: • no overloading for the meanings of instruction fields  streamlined • resource needs easily determined  Register assignment is critical too  42 Easier if lots of registers Computer Architecture 2012 – Introduction (lec1) Still, CISC Is Dominant  x86 (CISC) dominates the processor market  Legacy     CISC internally arch emulates RISC   43 A vast amount of existing software Intel, AMD, Microsoft benefit But put lot of money to compensate for disadvantage Starting at Pentium II and K6, x86 processors translate CISC instructions into RISC-like operations internally Inside core looks much like that of a RISC processor Computer Architecture 2012 – Introduction (lec1) Software Specific Extensions  Extend arch to accelerate exec of specific apps  Example: SSETM – Streaming SIMD Extensions      128-bit packed (vector) / scalar single precision FP (4×32) Introduced on Pentium® III on ’99 8 new 128 bit registers (XMM0 – XMM7) Accelerates graphics, video, scientific calculations, … Packed: Scalar: 128-bits x3 x2 x1 128-bits x0 x3 x2 + y3 y2 x0 + y1 y0 x3+y3 x2+y2 x1+y1 x0+y0 44 x1 y3 y2 y1 y0 y3 y2 y1 x0+y0 Computer Architecture 2012 – Introduction (lec1) BACKUP 45 Computer Architecture 2012 – Introduction (lec1) Compatibility  Backward compatibility (HW responsibility)  When buying new hardware, it can run existing software: • i5 can run SW written for Core2 Duo, Pentium4, PentiumM, Pentium III, Pentium II, Pentium, 486, 386, 268 BTW:  Forward compatibility (SW responsibility)    Architecture-independent SW   46 For example: MS Word 2003 can open MS Word 2010 doc Commonly supports one or two generations behind Run SW on top of VM that does JIT (just in time compiler): JVM for Java and CLR for .NET Interpreted languages: Perl, Python Computer Architecture 2012 – Introduction (lec1)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ca-2012-03-12-intro