Download Lec01Overview - Computer Science & Engineering

CSCE 713 Computer Architecture Topics    Speedup Amdahl’s law Execution time Readings January 10, 2012 Overview Readings for today    Landscape of Parallel Computing Research Berkeley View EECS-2006-183 Parallel Benchmarks Inspired by Berkeley Dwarfs Ka10_7dwarfsOfSymbolicComputation New   Topics overview Syllabus and other course pragmatics  Website (not shown)  Dates    Power wall, ILP wall,  to multicore Seven Dwarfs Amdahl’s Law, Gustaphson’s law –2– CSCE 713 Spring 2012 Move to multi-processor Introduction Single Processor Performance RISC –3– Copyright © 2012, Elsevier Inc. All rights reserved. CSCE 713 Spring 2012 Power Wall Note both of dynamic power and energy have voltage2 as dominant term 1 Powerdynamic   CapacitiveLoad  Voltage2  FrequencySwitched 2 Energydynamic  Capacitive _ Load Voltage 2 So lower voltage improves both; 5V  1V over period of time, but then can’t continue without errors –4– CSCE 713 Spring 2012 Static Power CMOS chip have power loss due to current leakage even when the transistor is off In 2006 the goal for leakage is 25% Powerstatic  Currentstatic Voltage –5– CSCE 713 Spring 2012 Single CPU Single Thread Programming Model –6– CSCE 713 Spring 2012 Berkeley Conventional Wisdom Old CW: Power is free, but transistors are expensive. · New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on. 2. Old CW: If you worry about power, the only concern is dynamic power. · New CW: For desktops and servers, static power due to leakage can be 40% of total power. 3. Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins. · New CW: As chips drop below 65 nm feature sizes, they will have high soft and hard error rates. [Borkar 2005] – 7 – [Mukherjee et al 2005] CSCE 713 Spring 2012 Old CW: By building upon prior successes, we can continue to raise the level of abstraction and hence the size of hardware designs. · New CW: Wire delay, noise, cross coupling (capacitive and inductive), manufacturing variability, reliability (see above), clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes. –8– CSCE 713 Spring 2012 Old CW: Researchers demonstrate new architecture ideas by building chips. · New CW: The cost of masks at 65 nm feature size, the cost of Electronic Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes. Thus, an alternative approach to evaluating architectures must be developed. –9– CSCE 713 Spring 2012 Old CW: Performance improvements yield both lower latency and higher bandwidth. · New CW: Across many technologies, bandwidth improves by at least the squareof the improvement in latency. [Patterson 2004] 7. Old CW: Multiply is slow, but load and store is fast. · New CW is the “Memory wall” [Wulf and McKee 1995]: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles. – 10 – CSCE 713 Spring 2012 CSAPP – Bryant O’Hallaron . – 11 – CSCE 713 Spring 2012 Topics Covered The needs for gains in performance The need for Parallelism Amdahl’s and Gustaphson’s laws Various Problems: the 7 Dwarfs and … Multithreaded • • • Multicore Posix pthreads Intel’s TTB •Distributed – MPI •Shared Memory – OpenMP •GPUs Various Approaches •Grid Computing Bridges between •Cloud Computing – 12 – CSCE 713 Spring 2012 Top 10 challenges in parallel computing By Michael Wrinn (Intel) In priority order: 1. Finding concurrency in a program - how to help programmers “think parallel”? 2. Scheduling tasks at the right granularity onto the processors of a parallel machine. 3. The data locality problem: associating data with tasks and doing it in a way that our target audience will be able to use correctly. 4. Scalability support in hardware: bandwidth and latencies to memory plus interconnects between processing elements. 5. Scalability support in software: libraries, scalable algorithms, and adaptive runtimes to map high level software onto platform details. – 13 – http://www.multicoreinfo.com/2009/01/wrinn-top-10-challenges/ CSCE 713 Spring 2012 6. Synchronization constructs (and protocols) that enable programmers write programs free from deadlock and race conditions. 7. Tools, API’s and methodologies to support the debugging process. 8. Error recovery and support for fault tolerance. 9. Support for good software engineering practices: composability, incremental parallelism, and code reuse. 10. Support for portable performance. What are the right models (or abstractions) so programmers can write code once and expect it to execute well on the important parallel platforms? – 14 – http://www.multicoreinfo.com/2009/01/wrinn-top-10-challenges/ CSCE 713 Spring 2012 Berkeley Conventional Wisdom 1. Old CW: Power is free, but transistors are expensive. · New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on. 2. Old CW: If you worry about power, the only concern is dynamic power. · New CW: For desktops and servers, static power due to leakage can be 40% of total power. – 15 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins. · New CW: As chips drop below 65 nm feature sizes, they will have high soft and hard error rates. [Borkar 2005] [Mukherjee et al 2005] 4. Old CW: By building upon prior successes, we can continue to raise the level of abstraction and hence the size of hardware designs. · New CW: Wire delay, noise, cross coupling (capacitive and inductive), manufacturing variability, reliability (see above), clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes. – 16 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 5. Old CW: Researchers demonstrate new architecture ideas by building chips. · New CW: The cost of masks at 65 nm feature size, the cost of Electronic Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes. Thus, an alternative approach to evaluating architectures must be developed. – 17 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 6. Old CW: Performance improvements yield both lower latency and higher bandwidth. · New CW: Across many technologies, bandwidth improves by at least the square of the improvement in latency. [Patterson 2004] 7. Old CW: Multiply is slow, but load and store is fast. · New CW is the “Memory wall” [Wulf and McKee 1995]: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles. – 18 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and architecture innovation. Examples from the past include branch prediction, out-of-order execution, speculation, and Very Long Instruction Word systems. · New CW is the “ILP wall”: There are diminishing returns on finding more ILP. – 19 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 Old CW: Uniprocessor performance doubles every 18 months. · New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. Figure 2 plots processor performance for almost 30 years. In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may now take 5 years. – 20 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 Old CW: Don’t bother parallelizing your application, as you can just wait a little while and run it on a much faster sequential computer. · New CW: It will be a very long wait for a faster sequential computer – 21 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 Old CW: Increasing clock frequency is the primary method of improving processor performance. · New CW: Increasing parallelism is the primary method of improving processor performance. 12. Old CW: Less than linear scaling for a multiprocessor application is failure. · New CW: Given the switch to parallel computing, any speedup via parallelism is a success. – 22 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 . – 23 – CSCE 713 Spring 2012 Amdahl’s Law Suppose you have an enhancement or improvement in a design component. The improvement in the performance of the system is limited by the % of the time the enhancement can be used Speedupoverall  1 Fracenhanced [(1  Fracenhanced )  ] Speedupenhanced – 24 – CSCE 713 Spring 2012 Exec Time of Parallel Computation – 25 – CSCE 713 Spring 2012 Gustafson’s Law: Scale the problem – 26 – http://en.wikipedia.org/wiki/Gustafson%27s_law CSCE 713 Spring 2012 Matrix Multiplication – scaling the problem Note we would really scale a model of a “real problem,” but matrix multiplication might be one step required – 27 – CSCE 713 Spring 2012 – 28 – CSCE 713 Spring 2012 Phillip Colella’s “Seven dwarfs” High-end simulation in the physical sciences = 7 numerical methods: 1. Structured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement) 2. Unstructured Grids 3. Fast Fourier Transform If add 4 for embedded, covers all 41 EEMBC benchmarks  8. Search/Sort 9. Filter 10. Combinational logic 11. Finite State Machine 6. Note: Data sizes (8 bit to 32 bit) and Dense Linear Algebra types (integer, Sparse Linear Algebra character) differ, but algorithms the same Well-defined targets from algorithmic, software, Particles 7. Monte Carlo 4. 5. Slide from “Defining Software Requirements for Scientific – 29 – Computing”, Phillip Colella, 2004  and architecture standpoint www.eecs.berkeley.edu/bears/presentations/06/Patterson.ppt CSCE 713 Spring 2012 Seven Dwarfs - Dense Linear Algebra Data are dense matrices or vectors. • Generally, such applications use unit-stride memory accesses to read data from rows, and • strided accesses to read data from columns. • Communication pattern • Black is no communication – 30 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 Seven Dwarfs -Sparse Linear Algebra . – 31 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 Seven Dwarfs –Spectral Methods (e.g., FFT) . – 32 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 Seven Dwarfs - N-Body Methods Depends on interactions between many discrete points. Variations include particle-particle methods, where every point depends on all others, leading to an O(N2) calculation, and hierarchical particle methods, which combine forces or potentials from multiple points to reduce the computational complexity to O(N log N) or O(N). – 33 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 Seven Dwarfs –Structured Grids . – 34 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 Seven Dwarfs – Unstructured Grids An irregular grid where data locations are selected, usually by underlying characteristics of the application. – 35 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 Seven Dwarfs - Monte Carlo Calculations depend on statistical results of repeated random trials. Considered embarrassingly parallel. Communication is typically not dominant in Monte Carlo methods. EmbarrassinglyParallel / NSF Teragrid – 36 – http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html CSCE 713 Spring 2012 Principle of Locality Rule of thumb – A program spends 90% of its execution time in only 10% of the code. So what do you try to optimize? Locality of memory references Temporal locality Spatial locality – 37 – CSCE 713 Spring 2012 Taking Advantage of Parallelism Logic parallelism – carry lookahead adder Word parallelism – SIMD Instruction pipelining – overlap fetch and execute Multithreads – executing independent instructions at the same time Speculative execution - – 38 – CSCE 713 Spring 2012 Linux – Sytem Info saluda> lscpu Architecture: i686 CPU op-mode(s): 32-bit, 64-bit CPU(s): 4 Thread(s) per core: 1 Core(s) per socket: 4 CPU socket(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 15 Stepping: 11 CPU MHz: 2393.830 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 4096K – saluda> 39 – CSCE 713 Spring 2012 Control Panel  System and Sec…  System … … – 40 – CSCE 713 Spring 2012 Task Manager . – 41 – CSCE 713 Spring 2012

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lec01Overview - Computer Science & Engineering