Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
High Performance Computer Architecture Challenges Rajeev Balasubramonian School of Computing, University of Utah Dramatic Clock Speed Improvements!! 3500 Intel Pentium 4 3.2 GHz 2500 2000 1500 1000 The 1st Intel processor 108 KHz 500 '03 '02 '01 2000 '99 '98 '96 '94 '92 '90 '88 '85 '82 '78 '76 '74 '72 0 1971 Clock speed (in MHz) 3000 Clock Speed = Performance ? • The Intel Pentium4 has a higher clock speed than the IBM Power4 – does the Pentium4 execute your program faster? Clock Speed = Performance ? • The Intel Pentium4 has a higher clock speed than the IBM Power4 – does the Pentium4 execute your program faster? Case 1: Completing instruction Clock tick Case 2: Time Performance = Clock Speed x Parallelism What About Parallelism? 0.09 Parallelism SPECInt / MHz 0.08 0.07 Intel Alpha HP 0.06 0.05 0.04 0.03 0.02 0.01 0 1995 1996 1997 1998 1999 2000 Dramatic Clock Speed Improvements!! 3500 Intel Pentium 4 3.2 GHz 2500 2000 1500 1000 The 1st Intel processor 108 KHz 500 '03 '02 '01 2000 '99 '98 '96 '94 '92 '90 '88 '85 '82 '78 '76 '74 '72 0 1971 Clock speed (in MHz) 3000 The Basic Pipeline Consider an automobile assembly line: Stage 1 Stage 2 Stage 3 Stage 4 1 day 1 day 1 day 1 day A new car rolls out every day A new car rolls out every half day In each case, it takes 4 days to build a car, but… More stages more parallelism and less time between cars What Determines Clock Speed? • Clock speed is a function of work done in each stage – in the earlier examples, the clock speeds were 1 car/day and 2 cars/day • Similarly, it takes plenty of “work” to execute an instruction and this work is broken into stages Execution of a single instruction What Determines Clock Speed? • Clock speed is a function of work done in each stage – in the earlier examples, the clock speeds were 1 car/day and 2 cars/day • Similarly, it takes plenty of “work” to execute an instruction and this work is broken into stages 250ps 4GHz clock speed Execution of a single instruction Clock Speed Improvements • Why have we seen such dramatic improvements in clock speed? work has been broken up into more stages early Intel chips executed work equivalent to approximately 56 logic gates; today’s chips execute 12 logic gates worth of work transistors have been becoming faster as technology improves, we can draw smaller and smaller transistors/gates on a chip and that improves their speed (doubles every 5-6 years) Will these Improvements Continue? • Transistors will continue to shrink and become faster for at least 10 more years • Each pipeline stage is already pretty small – improvements from this factor will cease • If clock speed improvements stagnate, should we turn our focus to parallelism? Microprocessor Blocks Branch Predictor L1 Instr Cache L2 Cache Decode & Rename L1 Data Cache Issue Logic ALU ALU Register File ALU ALU Innovations: Branch Predictor Improve prediction accuracy by detecting frequent patterns Branch Predictor L1 Instr Cache L2 Cache Decode & Rename L1 Data Cache Issue Logic ALU ALU Register File ALU ALU Innovations: Out-of-order Issue Branch Predictor L1 Instr Cache L2 Cache Out-of-order issue: if later instructions do not depend on earlier ones, execute them first Decode & Rename L1 Data Cache Issue Logic ALU ALU Register File ALU ALU Innovations: Superscalar Architectures Branch Predictor L1 Instr Cache L2 Cache Multiple ALUs: increase execution bandwidth Decode & Rename L1 Data Cache Issue Logic ALU ALU Register File ALU ALU Innovations: Data Caches Branch Predictor L1 Instr Cache L2 Cache 2K papers on caches: efficient data layout, stride prefetching Decode & Rename L1 Data Cache Issue Logic ALU ALU Register File ALU ALU Summary • Historically, computer engineers have focused on performance • Performance is a function of clock speed and parallelism • As technology improves, clock speeds will improve, although at a slower rate • Parallelism has been gradually improving and plenty of low-hanging fruits have been picked Outline • Recent Microprocessor History • Current Trends and Challenges • Solutions to Handling these Challenges Trend I : An Opportunity • Transistors on a chip have been doubling every two years (Moore’s Law) • In the past, transistors have been used for out-of-order logic, large caches, etc… • In the future, transistors can be employed for multiple processors on a single chip Chip Multiprocessors (CMP) • The IBM Power4 has two processors on a die • Sun has announced the 8-processor Niagara P1 P2 P3 P4 L2 cache The Challenge • Nearly every chip will have multiple processors, but where are the threads? • Some applications will truly benefit – they can be easily decomposed into threads • Some applications are inherently sequential – can we execute speculative threads to speed up these programs? (open problem!) Trend II : Power Consumption • Power a a f C V2 , where a is activity factor, f is frequency, C is capacitance, and V is voltage • Every new chip has higher frequency, more transistors (higher C), and slightly lower voltage – the net result is an increase in power consumption Scary Slide! • Power density cannot be allowed to increase at current rates (Source: Borkar et al., Intel) Impact of Power Increases • Well, UtahPower sends you fatter bills every month • To maintain constant chip temperature, heat produced on a chip has to be dissipated away – every additional watt increases cooling cost of a chip by approximately $4 !! • If temperature of a chip rises, the power dissipated also increases (almost exponentially) a vicious cycle! Trend III : Wire Delays • As technology improves, logic gates shrink their speed increases and clock speeds improve • As logic gates shrink, wires shrink too – unfortunately, their speed improves only marginally • In relative terms, future chips will have fast transistors/gates and slow wires Computation is cheap, communication is expensive! Impact of Wire Delays • Crossing the chip used to take one cycle • In the future, crossing the chip can take up to 30 cycles • Many structures on a chip are wire-constrained (register file, cache) – their access times slow down throughput decreases as instructions sit around waiting for values • Long wires also consume power Trend IV : Soft Errors • High energy particles constantly collide with objects and deposit charge • Transistors are becoming smaller and on-chip voltages are being lowered it doesn’t take much to toggle the state of the transistor • The frequency of this occurrence is projected to increase by nine orders of magnitude over a 20 year period Impact of Soft Errors • When a particle strike occurs, the component is not rendered permanently faulty – only the value it contains is erroneous • Hence, this is termed a transient fault or soft error • The error propagates when other instructions read this faulty value • This is already a problem for mission-critical apps (space, defense, highly-available servers) and may soon be a problem in other domains Summary of Trends • More transistors, more processors on a single chip • High power consumption • Long wire delays • Frequent soft errors We are attempting to exploit transistors to increase parallelism – in light of the above challenges, we’d be happy to even preserve parallelism Transistors & Wire Delays • Bring in a large window of instructions so you can find high parallelism • Distribute instructions across processors so that communication is minimized Instructions Processors Difficult Branches • Mispredicted branches result in poor parallelism and wasted work (power) • Solution: when you arrive at a fork, take both directions – execute on low frequency units to control power dissipation levels Instructions Processors Thermal Emergencies • Heterogeneous units allow you to reduce cooling costs • If a chip’s peak power is 110W, allow enough cooling to handle 100W average – save $40/chip! • If the application starts consuming more than 100W and temperature starts to rise, start favoring the low power processor cores – intelligent management allows you to make forward progress even in a thermal emergency Handling Long Wire Delays • Wires can be designed to have different properties • Knob 1: wire width and spacing: fat wires are faster, but have low bandwidth Handling Wire Capacitance • Knob 2: wires have repeaters/buffers – many, large buffers low delay, high power consumption Mapping Data to Wires • We can optimize wires for delay, bandwidth, power • Different data transfers on a chip have different latency and bandwidth needs – an intelligent mapping of data to wires can improve performance and lower power consumption Handling Soft Errors • Errors can be detected and corrected by providing redundancy – execute two copies of a program (perhaps, on a CMP) and compare results • Note that this doubles power consumption! Leading Thread Trailing Thread Handling Soft Errors • Trailing thread is capable of higher performance than leading thread – but there’s no point catching up – hence, artificially slow the trailing thread by lowering its frequency lower power dissipation Peak thruput: 1 BIPS Leading Thread 2 BIPS Trailing Thread Trailing thread never fetches data from memory and never guesses at branches Summary of Solutions • Heterogeneous wires and processors • Instructions and data have different needs: map them to appropriate wires and processors • Note how these solutions target multiple issues simultaneously: slow wires, many transistors, soft errors, power/thermal emergencies Conclusions • Performance has improved because of clock speed and parallelism advances • Clock speed improvements will continue at a slower rate • Parallelism is on a downward trend because of technology trends and because low-hanging fruits have been picked • We must find creative ways to preserve or even improve parallelism in the future Slide Title • Point 1.