Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS152 Computer Architecture and Engineering Lecture 3 Performance, Technology & Delay Modeling ©UCB Fall 2001 CS152 Lec3.1 Review: Salient features of MIPS I • 32-bit fixed format inst (3 formats) • 32 32-bit GPR (R0 contains zero) and 32 FP registers (+ HI LO) – partitioned by software convention • 3-address, reg-reg arithmetic instr. • Single address mode for load/store: base+displacement – no indirection, scaled • 16-bit immediate plus LUI • Simple branch conditions – compare against zero or two registers for =, – no integer condition codes • Support for 8bit, 16bit, and 32bit integers • Support for 32bit and 64bit floating point. Lec3.2 Review: MIPS Addressing Modes/Instruction Formats • All instructions 32 bits wide Register (direct) op rs rt rd register Immediate Base+index op rs rt immed op rs rt immed register PC-relative op rs PC rt Memory + immed Memory + Lec3.3 Review: When does MIPS sign extend? • When value is sign extended, copy upper bit to full value: Examples of sign extending 8 bits to 16 bits: 00001010 00000000 00001010 10001100 11111111 10001100 • When is an immediate value sign extended? – Arithmetic instructions (add, sub, etc.) sign extend immediates even for the unsigned versions of the instructions! – Logical instructions do not sign extend addi $r2, $r3, -1 has 0xFFFF in immediate field and will extend to 0xFFFFFFFF before adding andi $r2, $r3, -1 has 0xFFFF in immediate field and will extend to 0x0000FFFF before anding – Kinda weird to put negative numbers in logical instructions Lec3.4 Review: Details of the MIPS instruction set • Register zero always has the value zero (even if you try to write it) • Branch/jump and link put the return addr. PC+4 into the link register (R31), also called “ra” • All instructions change all 32 bits of the destination register (including lui, lb, lh) and all read all 32 bits of sources (add, and, …) • The difference between signed and unsigned versions: – For add and subtract: signed causes exception on overflow » No difference in sign-extension behavior! – For multiply and divide, distinguishes type of operation • Thus, overflow can occur in these arithmetic and logical instructions: – add, sub, addi – it cannot occur in addu, subu, addiu, and, or, xor, nor, shifts, mult, multu, div, divu • Immediate arithmetic and logical instructions are extended as follows: – logical immediates ops are zero extended to 32 bits – arithmetic immediates ops are sign extended to 32 bits (including addu) • The data loaded by the instructions lb and lh are extended as follows: – lbu, lhu are zero extended – lb, lh are sign extended Lec3.5 Performance • Purchasing perspective – given a collection of machines, which has the » best performance ? » least cost ? » best performance / cost ? • Design perspective – faced with design options, which has the » best performance improvement ? » least cost ? » best performance / cost ? • Both require – basis for comparison – metric for evaluation • Our goal is to understand cost & performance implications of architectural choices Lec3.6 Two notions of “performance” Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concorde 3 hours 1350 mph 132 178,200 Which has higher performance? ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns. .. (Performance) – throughput, bandwidth Response time and throughput often are in opposition Lec3.7 Definitions • Performance is in units of things-per-second – bigger is better • If we are primarily concerned with response time – performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = ---------------------Performance(Y) Lec3.8 Example • Time of Concorde vs. Boeing 747? • Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours • Throughput of Concorde vs. Boeing 747 ? • Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” • Boeing is 286,700 pmph / 178,200 pmph = 1.60 “times faster” • Boeing is 1.6 times (“60%”) faster in terms of throughput • Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job Lots of instructions in a program => Instruction throughput important! Lec3.9 Basis of Evaluation Cons Pros • representative Actual Target Workload • portable • widely used • improvements useful in reality • easy to run, early in design cycle • identify peak capability and potential bottlenecks • very specific • non-portable • difficult to run, or measure • hard to identify cause •less representative Full Application Benchmarks Small “Kernel” Benchmarks Microbenchmarks • easy to “fool” • “peak” may be a long way from application performance Lec3.10 SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer – go, m88ksim, gcc, compress, li, ijpeg, perl, vortex • Ten floating-point intensive – tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 • Must run with standard compiler flags – eliminate special undocumented incantations that may not even generate working code for real programs Lec3.11 Metrics of performance Seconds per program Application Useful Operations per second Programming Language Compiler ISA (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s Datapath Control Megabytes per second Function Units Transistors Wires Pins Cycles per second (clock rate) Each metric has a place and a purpose, and each can be misused Lec3.12 CPI “Average cycles per instruction” CPIave = (CPU Time * Clock Rate) / Instruction Count = Clock Cycles / Instruction Count n CPU time = ClockCycleTime * CPI i * I i i =1 n CPI = CPI i * i =1 F i where F i = I i Instruction Count "instruction frequency" Invest Resources where time is Spent! Lec3.13 Aspects of CPU Performance CPU time = Seconds Program = Instructions x Cycles Program instr count CPI Instruction x Seconds Cycle clock rate Program Compiler Instr. Set Organization Technology Lec3.14 Amdahl's Law Speedup due to enhancement E: ExTime w/o E Speedup(E) = -------------------- = ExTime w/ E Performance w/ E --------------------Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = 1 (1-F) + F/S Lec3.15 Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) ALU 50% 1 .5 Load 20% 5 1.0 Store 10% 3 .3 Branch 20% 2 .4 2.2 % Time 23% 45% 14% 18% Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? Lec3.16 Summary: Evaluating Instruction Sets and Implementation Design-time metrics: ° Can it be implemented, in how long, at what cost? ° Can it be programmed? Ease of compilation? Static Metrics: ° How many bytes does the program occupy in memory? Dynamic Metrics: ° How many instructions are executed? ° How many bytes does the processor fetch to execute the program? CPI ° How many clocks are required per instruction? ° How "lean" a clock is practical? Best Metric: Time to execute the program! Inst. Count Cycle Time NOTE: this depends on instructions set, processor organization, and compilation techniques. Lec3.17 Administrative Matters Lec3.18 Finite State Machines: • System state is explicit in representation • Transitions between states represented as arrows with inputs on arcs. • Output may be either part of state or on arcs 1 “Mod 3 Machine” Input (MSB first) Mod 3 106 1 Alpha/ 0 0 1 Beta/ 1 0 1 1 0101 0 Delta/ 100122 1 1 0 2 Lec3.19 Implementation as Combinational logic + Latch 1/0 “Moore Machine” “Mealey Machine” Latch Combinational Logic Alpha/ 0 0/0 1/1 Beta/ 0/1 Delta/ 1/1 00 01 10 00 01 10 0/0 2 Input State old State new 0 0 0 1 1 1 1 00 10 01 01 00 10 Div 0 0 1 0 1 1 Lec3.20 Performance and Technology Trends 1000 Supercomputers Performance 100 Mainframes 10 Minicomputers Microprocessors 1 0.1 1965 1970 1975 1980 1985 1990 1995 2000 Year • Technology Power: 1.2 x 1.2 x 1.2 = 1.7 x / year – Feature Size: shrinks 10% / yr. => Switching speed improves 1.2 / yr. – Density: improves 1.2x / yr. – Die Area: 1.2x / yr. • The lesson of RISC is to keep the ISA as simple as possible: – Shorter design cycle => fully exploit the advancing technology (~3yr) – Advanced branch prediction and pipeline techniques – Bigger and more sophisticated on-chip caches Lec3.21 Range of Design Styles Custom Control Logic Custom Design Standard Cell Gates Custom ALU Gates Routing Channel Standard ALU Custom Register File Gate Array/FPGA/CPLD Standard Registers Gates Routing Channel Gates Performance Design Complexity (Design Time) Compact Longer wires Lec3.22 Basic Technology: CMOS • CMOS: Complementary Metal Oxide Semiconductor – NMOS (N-Type Metal Oxide Semiconductor) transistors – PMOS (P-Type Metal Oxide Semiconductor) transistors • NMOS Transistor – Apply a HIGH (Vdd) to its gate turns the transistor into a “conductor” – Apply a LOW (GND) to its gate shuts off the conduction path • PMOS Transistor – Apply a HIGH (Vdd) to its gate shuts off the conduction path – Apply a LOW (GND) to its gate turns the transistor into a “conductor” Vdd = 5V GND = 0v Vdd = 5V GND = 0v Lec3.23 Basic Components: CMOS Inverter Vdd Symbol In Circuit PMOS In Out Out NMOS • Inverter Operation Vdd Vout Vdd Vdd Vdd Open Charge Out Open Vdd Vin Discharge Lec3.24 Basic Components: CMOS Logic Gates NOR Gate NAND Gate A A Out B Out 0 0 1 1 B 0 1 0 1 1 1 1 0 A A Out B Vdd 0 0 1 1 B Out 0 1 0 1 1 0 0 0 Vdd A Out B B Out A Lec3.25 Voltage waveforms versus time Voltage 1 => Vdd Vout Vin Vout Vin 0 => GND Time Lec3.26 Series Connection Vin V1 G1 Vdd Vout Vin G2 G1 Vdd V1 G2 C1 Vout Cout Voltage Vdd V1 Vout Vin Vdd/2 d1 d2 GND Time • Total Propagation Delay = Sum of individual delays = d1 + d2 • Capacitance C1 has two components: – Capacitance of the wire connecting the two gates – Input capacitance of the second inverter Lec3.27 Gate Comparison Vdd Vdd A Out B B Out A NAND Gate NOR Gate • PMOS are 3 times slower than NMOS (3 times higher resistance) so if all devices are the same size then a NAND Low to High will be • Better to put NMOS transistors in series Lec3.28 Calculating Delays Vin V1 Vdd V2 Vin G1 V3 Vdd V1 G2 V2 C1 Vdd • Sum delays along serial paths G3 V3 • Delay (Vin -> V2) ! = Delay (Vin -> V3) – Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2) – Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V2) + Delay (V1 > V3) • Critical Path = The longest delay path (Vin->V3) • C1 = Wire Capacitance + Cin of Gate 2 + Cin of Gate 3 Lec3.29 General C/L Cell Delay Model Vout A B . . . Combinational Logic Cell Delay Va -> Vout Cout X X X Internal Delay X delay per load (Cload) X Nanoseconds/femtoFarad = ns/fF Ccritical Cout • Combinational Cell (symbol) is fully specified by: – functional (input -> output) behavior » truth-table, logic equation, VHDL – load at each input – critical propagation delay from each input to each output for each transition » THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load capacitance – Linear model is good enough up to Ccritical Lec3.30 Characterize a Gate • Input capacitance for each input • For each input-to-output path: – For each output transition (H->L, L->H) » Internal delay (ns) - e.g. for low to high from A to O: TPAOlh » Load dependent delay (ns / fF) - e.g. TPAOlhf • Example: 2-input NAND Gate A Delay A -> O O: Low -> High O B Slope = 0.0021ns / fF For A and B: Input Load = 61 fF 0.5ns For A -> O : TPAOlh = 0.5ns TPAOlhf = 0.0021ns / fF TPAOhl = 0.1ns TPAOhlf = 0.0020ns / fF CO Lec3.31 A Specific Example: 2 to 1 MUX A Gate 3 B Gate 2 B Y = (A and !S) or (A and S) Wire 2 2 x 1 Mux Gate 1 Wire 0 A Wire 1 S S Inv: I.L. = 50 fF; T I.D.= .01 ns; T L.D.D. = .01 ns/fF • Input Load (I.L.) – A, B: I.L. (NAND) = 61 fF – S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF • Load Dependent Delay (L.D.D.): Set by Gate 3 – TAYlhf = 0.021 ns / fF TAYhlf = 0.020 ns / fF – TBYlhf = 0.021 ns / fF – TSYlhf = 0.021 ns / fF TBYhlf = 0.020 ns / fF TSYlhf = 0.020 ns / fF Lec3.32 Y 2 to 1 MUX: Internal Delay Calculation A Gate 1 Wire 0 Wire 1 Y = (A and !S) or (A and S) Gate 3 B Gate 2 S Wire 2 L.D.D. = Load Dependent Delay I.D. = Internal Delay • Internal Delay (I.D.): – A to Y: I.D. G1 + (Wire_1_C + G3_Input_C) * L.D.D G1 + I.D. G3 – B to Y: I.D. G2 + (Wire_2_C + G3_Input_C) * L.D.D. G2 + I.D. G3 – S to Y (Worst Case) : I.D. Inv + (Wire_0_C + G1_Input_C) * L.D.D. Inv + Internal Delay A to Y • We can approximate the size of “Wire_1_C” by: – Assume Wire 1 has the same C as the input capacitance off all the gates attached to it (G3 in this case). Therefore the total C that Gate1 needs to drive = 2.0 x G3_Input_C Lec3.33 2 to 1 MUX: Internal Delay Calculation (continue) A Gate 1 Wire 0 Wire 1 Y = (A and !S) or (A and S) Gate 3 B Gate 2 Wire 2 S • Internal Delay (I.D.): – A to Y: I.D. G1 + (Wire 1 C + G3 Input C) * L.D.D G1 + I.D. G3 – B to Y: I.D. G2 + (Wire 2 C + G3 Input C) * L.D.D. G2 + I.D. G3 – S to Y (Worst Case): I.D. Inv + (Wire 0 C + G1 Input C) * L.D.D. Inv + Internal Delay A to Y • Specific Example: – TAYlh = TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G3 = 0.1ns + 122 fF * 0.0020 ns/fF + 0.5ns = 0.844 ns Lec3.34 Abstraction: 2 to 1 MUX A Gate 3 B Y B 2 x 1 Mux A Gate 1 Y Gate 2 S S • Input Load: A = 61 fF, B = 61 fF, S = 111 fF • Load Dependent Delay: – TAYlhf = 0.021 ns / fF – TBYlhf = 0.021 ns / fF – TSYlhf = 0.021 ns / fF TAYhlf = 0.020 ns / fF TBYhlf = 0.020 ns / fF TSYlhf = 0.020 ns / f F • Internal Delay: – TAYlh = TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G3 = 0.1ns + 122 fF * 0.0020ns/fF + 0.5ns = 0.844ns – Fun Exercises!: TAYhl, TBYlh, TSYlh, TSYlh Lec3.35 Storage Element’s Timing Model Clk D Q Setup D Don’t Care Hold Don’t Care Clock-to-Q Q Unknown • Setup Time: Input must be stable BEFORE the trigger clock edge • Hold Time: Input must REMAIN stable after the trigger clock edge • Clock-to-Q time: – Output cannot change instantaneously at the trigger clock edge – Similar to delay in logic gates, two components: » Internal Clock-to-Q » Load dependent Clock-to-Q Lec3.36 CS152 Building blocks (maybe more….) • Logic elements – NAND2, NAND3, NAND 4 – NOR2, NOR3, NOR4 – INV1x (normal inverter) – INV4x (inverter with large output drive) – XOR2 – XNOR2 – PWR: Source of 1’s – GND: Source of 0’s – fast MUXes • Storage Element – D flip flop - negative edge triggered Lec3.37 Clocking Methodology Clk . . . . . . Combination Logic . . . . . . • All storage elements are clocked by the same clock edge (but there may be clock skews) • The combination logic block’s: – Inputs are updated at each clock tick – All outputs MUST be stable before the next clock tick Lec3.38 Hold Time Violations Clk-to-Q+Delay Clk1 Hold Time Clk2 . . . . . . Combination Logic Clk1 . . . . . . Clk2 • The worst case scenario for hold time consideration: – The input register sees CLK1 – The output register sees CLK2 – fast FF1 output must not change input to FF2 for same clock edge • (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time Lec3.39 Clock Skew’s Effect on Hold Time Violation Clk-to-Q+Delay Clk1 Hold Time Clk2 Clock Skew . . . . . . Combination Logic Clk1 . . . . . . Clk2 • The worst case scenario for hold time consideration: – The input register sees CLK1 – The output register sees CLK2 – fast FF1 output must not change input to FF2 for same clock edge • (CLK-to-Q + Shortest Delay Path) > Hold Time + Clock Skew or (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time Lec3.40 Critical Path & Cycle Time Clk . . . . . . . . . . . . • Critical path: the slowest path between any two storage devices • Cycle time is a function of the critical path • must be greater than: – Clock-to-Q + Longest Path through the Combination Logic + Setup Lec3.41 Clock Skew’s Effect on Cycle Time Clk1 Clock Skew Clk2 Clock Skw Setup FF1 FF2 . . . . . . Clk1 . . . . . . Clk2 • The worst case scenario for cycle time consideration: – The input register sees CLK1 – The output register sees CLK2 • Cycle Time = CLK-to-Q(FF1) + Longest Delay(C/L) + Setup(FF2) + Clock Skew Lec3.42 Tricks to Reduce Cycle Time • Reduce the number of gate levels A A B B C C D D ° Pay attention to loading ° One gate driving many gates is a bad idea ° Avoid using a small gate to drive a long wire ° Use multiple stages to drive large load INV4x Clarge INV4x Lec3.43 How to Avoid Hold Time Violations? Clk . . . . . . Combination Logic . . . . . . • Hold time requirement: – Input to register must NOT change immediately after the clock tick • This is usually easy to meet in the “edge trigger” clocking scheme • Hold time of most FFs is <= 0 ns • CLK-to-Q + Shortest Delay Path must be greater than Hold Time Lec3.44 Hold Time Violation Clk1 Clk-to-Q+Delay Clk2 Hold Time . . . . . . Combination Logic Clk1 . . . . . . Clk2 • The worst case scenario for hold time consideration: – The input register sees CLK1 – The output register sees CLK2 – fast FF1 output must not change input to FF2 for same clock edge • For no violation (CLK-to-Q + Shortest Delay Path) > Hold Time • A violation is shown above Lec3.45 A hold time violation because of clock skew Clk-to-Q+Delay Clk1 Hold Time Clk2 Clock Skew . . . . . . Clk1 Combination Logic . . . . . . Clk2 • For no violation (CLK-to-Q + Shortest Delay Path) > Hold Time + Clock Skew or (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time Lec3.46 Summary • Total execution time is the most reliable measure of performance • Amdall’s law: Law of Diminishing Returns • Performance and Technology Trends – Keep the design simple (KISS rule) to take advantage of the latest technology – CMOS inverter and CMOS logic gates • Delay Modeling and Gate Characterization – Delay = Internal Delay + (Load Dependent Delay x Output Load) • Clocking Methodology and Timing Considerations – Simplest clocking methodology » All storage elements use the SAME clock edge – Cycle Time CLK-to-Q + Longest Delay Path + Setup + Clock Skew – (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time Lec3.47 To Get More Information • EECS 141 - Digital Integrated Circuit Design - 105 no longer a prerequisite - only EECS 40 required! • Book: Digital Integrated Circuits - A design perspective - by Jan Rabaey • Web page (slides from book) – http://bwrc.eecs.berkeley.edu/icdesign/instructors.html Lec3.48