Download hspc05 - University of Utah School of Computing

High Performance Computer Architecture Challenges Rajeev Balasubramonian School of Computing, University of Utah Dramatic Clock Speed Improvements!! 3500 Intel Pentium 4 3.2 GHz 2500 2000 1500 1000 The 1st Intel processor 108 KHz 500 '03 '02 '01 2000 '99 '98 '96 '94 '92 '90 '88 '85 '82 '78 '76 '74 '72 0 1971 Clock speed (in MHz) 3000 Clock Speed = Performance ? • The Intel Pentium4 has a higher clock speed than the IBM Power4 – does the Pentium4 execute your program faster? Clock Speed = Performance ? • The Intel Pentium4 has a higher clock speed than the IBM Power4 – does the Pentium4 execute your program faster? Case 1: Completing instruction Clock tick Case 2: Time Performance = Clock Speed x Parallelism What About Parallelism? 0.09 Parallelism SPECInt / MHz 0.08 0.07 Intel Alpha HP 0.06 0.05 0.04 0.03 0.02 0.01 0 1995 1996 1997 1998 1999 2000 Dramatic Clock Speed Improvements!! 3500 Intel Pentium 4 3.2 GHz 2500 2000 1500 1000 The 1st Intel processor 108 KHz 500 '03 '02 '01 2000 '99 '98 '96 '94 '92 '90 '88 '85 '82 '78 '76 '74 '72 0 1971 Clock speed (in MHz) 3000 The Basic Pipeline Consider an automobile assembly line: Stage 1 Stage 2 Stage 3 Stage 4 1 day 1 day 1 day 1 day A new car rolls out every day A new car rolls out every half day In each case, it takes 4 days to build a car, but… More stages  more parallelism and less time between cars What Determines Clock Speed? • Clock speed is a function of work done in each stage – in the earlier examples, the clock speeds were 1 car/day and 2 cars/day • Similarly, it takes plenty of “work” to execute an instruction and this work is broken into stages Execution of a single instruction What Determines Clock Speed? • Clock speed is a function of work done in each stage – in the earlier examples, the clock speeds were 1 car/day and 2 cars/day • Similarly, it takes plenty of “work” to execute an instruction and this work is broken into stages 250ps  4GHz clock speed Execution of a single instruction Clock Speed Improvements • Why have we seen such dramatic improvements in clock speed?  work has been broken up into more stages  early Intel chips executed work equivalent to approximately 56 logic gates; today’s chips execute 12 logic gates worth of work  transistors have been becoming faster  as technology improves, we can draw smaller and smaller transistors/gates on a chip and that improves their speed (doubles every 5-6 years) Will these Improvements Continue? • Transistors will continue to shrink and become faster for at least 10 more years • Each pipeline stage is already pretty small – improvements from this factor will cease • If clock speed improvements stagnate, should we turn our focus to parallelism? Microprocessor Blocks Branch Predictor L1 Instr Cache L2 Cache Decode & Rename L1 Data Cache Issue Logic ALU ALU Register File ALU ALU Innovations: Branch Predictor Improve prediction accuracy by detecting frequent patterns Branch Predictor L1 Instr Cache L2 Cache Decode & Rename L1 Data Cache Issue Logic ALU ALU Register File ALU ALU Innovations: Out-of-order Issue Branch Predictor L1 Instr Cache L2 Cache Out-of-order issue: if later instructions do not depend on earlier ones, execute them first Decode & Rename L1 Data Cache Issue Logic ALU ALU Register File ALU ALU Innovations: Superscalar Architectures Branch Predictor L1 Instr Cache L2 Cache Multiple ALUs: increase execution bandwidth Decode & Rename L1 Data Cache Issue Logic ALU ALU Register File ALU ALU Innovations: Data Caches Branch Predictor L1 Instr Cache L2 Cache 2K papers on caches: efficient data layout, stride prefetching Decode & Rename L1 Data Cache Issue Logic ALU ALU Register File ALU ALU Summary • Historically, computer engineers have focused on performance • Performance is a function of clock speed and parallelism • As technology improves, clock speeds will improve, although at a slower rate • Parallelism has been gradually improving and plenty of low-hanging fruits have been picked Outline • Recent Microprocessor History • Current Trends and Challenges • Solutions to Handling these Challenges Trend I : An Opportunity • Transistors on a chip have been doubling every two years (Moore’s Law) • In the past, transistors have been used for out-of-order logic, large caches, etc… • In the future, transistors can be employed for multiple processors on a single chip Chip Multiprocessors (CMP) • The IBM Power4 has two processors on a die • Sun has announced the 8-processor Niagara P1 P2 P3 P4 L2 cache The Challenge • Nearly every chip will have multiple processors, but where are the threads? • Some applications will truly benefit – they can be easily decomposed into threads • Some applications are inherently sequential – can we execute speculative threads to speed up these programs? (open problem!) Trend II : Power Consumption • Power a a f C V2 , where a is activity factor, f is frequency, C is capacitance, and V is voltage • Every new chip has higher frequency, more transistors (higher C), and slightly lower voltage – the net result is an increase in power consumption Scary Slide! • Power density cannot be allowed to increase at current rates (Source: Borkar et al., Intel) Impact of Power Increases • Well, UtahPower sends you fatter bills every month • To maintain constant chip temperature, heat produced on a chip has to be dissipated away – every additional watt increases cooling cost of a chip by approximately $4 !! • If temperature of a chip rises, the power dissipated also increases (almost exponentially)  a vicious cycle! Trend III : Wire Delays • As technology improves, logic gates shrink  their speed increases and clock speeds improve • As logic gates shrink, wires shrink too – unfortunately, their speed improves only marginally • In relative terms, future chips will have fast transistors/gates and slow wires Computation is cheap, communication is expensive! Impact of Wire Delays • Crossing the chip used to take one cycle • In the future, crossing the chip can take up to 30 cycles • Many structures on a chip are wire-constrained (register file, cache) – their access times slow down  throughput decreases as instructions sit around waiting for values • Long wires also consume power Trend IV : Soft Errors • High energy particles constantly collide with objects and deposit charge • Transistors are becoming smaller and on-chip voltages are being lowered  it doesn’t take much to toggle the state of the transistor • The frequency of this occurrence is projected to increase by nine orders of magnitude over a 20 year period Impact of Soft Errors • When a particle strike occurs, the component is not rendered permanently faulty – only the value it contains is erroneous • Hence, this is termed a transient fault or soft error • The error propagates when other instructions read this faulty value • This is already a problem for mission-critical apps (space, defense, highly-available servers) and may soon be a problem in other domains Summary of Trends • More transistors, more processors on a single chip • High power consumption • Long wire delays • Frequent soft errors We are attempting to exploit transistors to increase parallelism – in light of the above challenges, we’d be happy to even preserve parallelism Transistors & Wire Delays • Bring in a large window of instructions so you can find high parallelism • Distribute instructions across processors so that communication is minimized Instructions Processors Difficult Branches • Mispredicted branches result in poor parallelism and wasted work (power) • Solution: when you arrive at a fork, take both directions – execute on low frequency units to control power dissipation levels Instructions Processors Thermal Emergencies • Heterogeneous units allow you to reduce cooling costs • If a chip’s peak power is 110W, allow enough cooling to handle 100W average – save $40/chip! • If the application starts consuming more than 100W and temperature starts to rise, start favoring the low power processor cores – intelligent management allows you to make forward progress even in a thermal emergency Handling Long Wire Delays • Wires can be designed to have different properties • Knob 1: wire width and spacing: fat wires are faster, but have low bandwidth Handling Wire Capacitance • Knob 2: wires have repeaters/buffers – many, large buffers  low delay, high power consumption Mapping Data to Wires • We can optimize wires for delay, bandwidth, power • Different data transfers on a chip have different latency and bandwidth needs – an intelligent mapping of data to wires can improve performance and lower power consumption Handling Soft Errors • Errors can be detected and corrected by providing redundancy – execute two copies of a program (perhaps, on a CMP) and compare results • Note that this doubles power consumption! Leading Thread Trailing Thread Handling Soft Errors • Trailing thread is capable of higher performance than leading thread – but there’s no point catching up – hence, artificially slow the trailing thread by lowering its frequency  lower power dissipation Peak thruput: 1 BIPS Leading Thread 2 BIPS Trailing Thread Trailing thread never fetches data from memory and never guesses at branches Summary of Solutions • Heterogeneous wires and processors • Instructions and data have different needs: map them to appropriate wires and processors • Note how these solutions target multiple issues simultaneously: slow wires, many transistors, soft errors, power/thermal emergencies Conclusions • Performance has improved because of clock speed and parallelism advances • Clock speed improvements will continue at a slower rate • Parallelism is on a downward trend because of technology trends and because low-hanging fruits have been picked • We must find creative ways to preserve or even improve parallelism in the future Slide Title • Point 1.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download hspc05 - University of Utah School of Computing