Download CS267: Introduction - Electrical Engineering & Computer Sciences

Document related concepts
no text concepts found
Transcript
CS267/E233
Applications of Parallel
Computers
Lecture 1: Introduction
Kathy Yelick
[email protected]
www.cs.berkeley.edu/~yelick/cs267_spr07
01/17/2007
CS267-Lecture 1
1
Why powerful
computers are
parallel
circa 1991-2006
01/17/2007
CS267-Lecture 1
2
Tunnel Vision by Experts
• “I think there is a world market for maybe five
computers.”
- Thomas Watson, chairman of IBM, 1943.
• “There is no reason for any individual to have a
computer in their home”
- Ken Olson, president and founder of Digital Equipment
Corporation, 1977.
• “640K [of memory] ought to be enough for anybody.”
- Bill Gates, chairman of Microsoft,1981.
• “On several recent occasions, I have been asked
whether parallel computing will soon be relegated to
the trash heap reserved for promising technologies
that never quite make it.”
- Ken Kennedy, CRPC Directory, 1994
01/17/2007
CS267-Lecture 1
Slide source: Warfield et al.
3
Technology Trends: Microprocessor Capacity
Moore’s Law
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Microprocessors have
become smaller, denser,
and more powerful.
Gordon Moore (co-founder of
Intel) predicted in 1965 that the
transistor density of
semiconductor chips would
double roughly every 18
months.
Slide source: Jack Dongarra
01/17/2007
CS267-Lecture 1
4
Microprocessor Transistors per Chip
• Growth in transistors per chip
• Increase in clock rate
100,000,000
1000
10,000,000
1,000,000
i80386
i80286
100,000
R3000
R2000
100
Clock Rate (MHz)
Transistors
R10000
Pentium
10
1
i8086
10,000
i8080
i4004
1,000
1970 1975 1980 1985 1990 1995 2000 2005
Year
01/17/2007
CS267-Lecture 1
0.1
1970
1980
1990
2000
Year
5
Impact of Device Shrinkage
• What happens when the feature size (transistor size) shrinks
by a factor of x ?
• Clock rate goes up by x because wires are shorter
- actually less than x, because of power consumption
• Transistors per unit area goes up by x2
• Die size also tends to increase
- typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !
- typically x3 is devoted either on-chip
- parallelism: hidden parallelism such as ILP
- locality: caches
01/17/2007
CS267-Lecture 1
6
But there are limiting forces
Manufacturing costs and yield problems limit use of density
•
Moore’s 2nd law (Rock’s
law): costs go up
Demo of
0.06
micron
CMOS
Source: Forbes Magazine
•
Yield
-What percentage of the chips
are usable?
-E.g., Cell processor (PS3) is
sold with 7 out of 8 “on” to
improve yield
01/17/2007
CS267-Lecture 1
7
Power Density Limits Serial Performance
01/17/2007
CS267-Lecture 1
8
More Limits: How fast can a serial computer be?
1 Tflop/s, 1
Tbyte sequential
machine
r = 0.3
mm
• Consider the 1 Tflop/s sequential machine:
- Data must travel some distance, r, to get from memory
to CPU.
- To get 1 data element per cycle, this means 1012 times
per second at the speed of light, c = 3x108 m/s. Thus
r < c/1012 = 0.3 mm.
• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:
- Each bit occupies about 1 square Angstrom, or the
size of a small atom.
• No choice but parallelism
01/17/2007
CS267-Lecture 1
9
Revolution is Happening Now
• Chip density is
continuing
increase ~2x
every 2 years
- Clock speed is not
- Number of
processor cores
may double instead
• There is little or
no hidden
parallelism (ILP)
to be found
• Parallelism must
be exposed to
and managed by
software
Source: Intel, Microsoft (Sutter) and
Stanford
(Olukotun, Hammond)
01/17/2007
CS267-Lecture 1
10
Why Parallelism (2007)?
• These arguments are no long theoretical
• All major processor vendors are producing multicore
chips
- Every machine will soon be a parallel machine
- All programmers will be parallel programmers???
• New software model
- Want a new feature? Hide the “cost” by speeding up the code first
- All programmers will be performance programmers???
• Some may eventually be hidden in libraries, compilers,
and high level languages
- But a lot of work is needed to get there
• Big open questions:
- What will be the killer apps for multicore machines
- How should the chips be designed, and how will they be programmed?
01/17/2007
CS267-Lecture 1
11
Outline
all
• Why powerful computers must be parallel processors
Including your laptop
• Large important problems require powerful computers
Even computer games
• Why writing (fast) parallel programs is hard
• Principles of parallel computing performance
• Structure of the course
01/17/2007
CS267-Lecture 1
12
Why we need
powerful computers
01/17/2007
CS267-Lecture 1
13
Units of Measure in HPC
• High Performance Computing (HPC) units are:
- Flop: floating point operation
- Flops/s: floating point operations per second
- Bytes: size of data (a double precision floating point number is 8)
• Typical sizes are millions, billions, trillions…
Mega
Mflop/s = 106 flop/sec
Mbyte = 220 = 1048576 ~ 106 bytes
Giga
Tera
Peta
Exa
Zetta
Gflop/s = 109 flop/sec
Tflop/s = 1012 flop/sec
Pflop/s = 1015 flop/sec
Eflop/s = 1018 flop/sec
Zflop/s = 1021 flop/sec
Gbyte = 230 ~ 109 bytes
Tbyte = 240 ~ 1012 bytes
Pbyte = 250 ~ 1015 bytes
Ebyte = 260 ~ 1018 bytes
Zbyte = 270 ~ 1021 bytes
Yotta
Yflop/s = 1024 flop/sec
Ybyte = 280 ~ 1024 bytes
• See www.top500.org for current list of fastest machines
01/17/2007
CS267-Lecture 1
14
Simulation: The Third Pillar of Science
•
Traditional scientific and engineering paradigm:
1) Do theory or paper design.
2) Perform experiments or build system.
•
Limitations:
-
•
Too difficult -- build large wind tunnels.
Too expensive -- build a throw-away passenger jet.
Too slow -- wait for climate or galactic evolution.
Too dangerous -- weapons, drug design, climate
experimentation.
Computational science paradigm:
3) Use high performance computer systems to simulate the
phenomenon
- Base on known physical laws and efficient numerical
methods.
01/17/2007
CS267-Lecture 1
15
Some Particularly Challenging Computations
• Science
-
Global climate modeling
Biology: genomics; protein folding; drug design
Astrophysical modeling
Computational Chemistry
Computational Material Sciences and Nanosciences
• Engineering
-
Semiconductor design
Earthquake and structural modeling
Computation fluid dynamics (airplane design)
Combustion (engine design)
Crash simulation
• Business
- Financial and economic modeling
- Transaction processing, web services and search engines
• Defense
- Nuclear weapons -- test by simulations
- Cryptography
01/17/2007
CS267-Lecture 1
16
Economic Impact of HPC
• Airlines:
- System-wide logistics optimization systems on parallel systems.
- Savings: approx. $100 million per airline per year.
• Automotive design:
- Major automotive companies use large systems (500+ CPUs) for:
- CAD-CAM, crash testing, structural integrity and
aerodynamics.
- One company has 500+ CPU parallel system.
- Savings: approx. $1 billion per company per year.
• Semiconductor industry:
- Semiconductor firms use large systems (500+ CPUs) for
- device electronics simulation and logic validation
- Savings: approx. $1 billion per company per year.
• Securities industry:
- Savings: approx. $15 billion per year for U.S. home mortgages.
01/17/2007
CS267-Lecture 1
17
$5B World Market in Technical Computing
1998 1999 2000 2001 2002 2003
100%
Other
Technical Management and
Support
Simulation
90%
Scientific Research and R&D
80%
Mechanical
Design/Engineering Analysis
70%
Mechanical Design and
Drafting
60%
Imaging
50%
Geoscience and Geoengineering
40%
Electrical Design/Engineering
Analysis
Economics/Financial
30%
Digital Content Creation and
Distribution
20%
Classified Defense
10%
Chemical Engineering
Biosciences
0%
Source: IDC 2004, from NRC Future of Supercomputing Report
01/17/2007
CS267-Lecture 1
18
Global Climate Modeling
• Problem
Problem is to compute:
f(latitude, longitude, elevation, time) 
temperature, pressure, humidity, wind velocity
• Approach:
- Discretize the domain, e.g., a measurement point every 10 km
- Devise an algorithm to predict weather at time t+dt given t
• Uses:
- Predict major events,
e.g., El Nino
- Use in setting air
emissions standards
Source: http://www.epm.ornl.gov/chammp/chammp.html
01/17/2007
CS267-Lecture 1
19
Global Climate Modeling Computation
• One piece is modeling the fluid flow in the atmosphere
- Solve Navier-Stokes equations
- Roughly 100 Flops per grid point with 1 minute timestep
• Computational requirements:
- To match real-time, need 5 x 1011 flops in 60 seconds = 8 Gflop/s
- Weather prediction (7 days in 24 hours)  56 Gflop/s
- Climate prediction (50 years in 30 days)  4.8 Tflop/s
- To use in policy negotiations (50 years in 12 hours)  288 Tflop/s
• To double the grid resolution, computation is 8x to 16x
• State of the art models require integration of
atmosphere, ocean, sea-ice, land models, plus possibly
carbon cycle, geochemistry and more
• Current models are coarser than this
01/17/2007
CS267-Lecture 1
20
High Resolution
Climate Modeling on
NERSC-3 – P. Duffy,
et al., LLNL
01/17/2007
CS267-Lecture 1
21
A 1000 Year Climate Simulation
• Demonstration of the
Community Climate Model
(CCSM2)
• A 1000-year simulation
shows long-term, stable
representation of the
earth’s climate.
• 760,000 processor hours
used
• Temperature change
shown
• Warren Washington and Jerry Meehl, National Center for
Atmospheric Research; Bert Semtner, Naval Postgraduate
School; John Weatherly, U.S. Army Cold Regions Research and
Engineering Lab Laboratory et al.
• http://www.nersc.gov/news/science/bigsplash2002.pdf
01/17/2007
CS267-Lecture 1
22
Climate Modeling on the Earth Simulator System
 Development of ES started in 1997 in order to make a
comprehensive understanding of global environmental
changes such as global warming.
 Its construction was completed at the end of February,
2002 and the practical operation started from March 1,
2002
 35.86Tflops (87.5% of the peak performance) is achieved in the
Linpack benchmark (world’s fastest machine from 2002-2004).
 26.58Tflops was obtained by a global atmospheric circulation
code.
01/17/2007
CS267-Lecture 1
23
Astrophysics: Binary Black Hole Dynamics
• Massive supernova cores collapse to black holes.
• At black hole center spacetime breaks down.
• Critical test of theories of gravity –
General Relativity to Quantum Gravity.
• Indirect observation – most galaxies
have a black hole at their center.
• Gravity waves show black hole directly
including detailed parameters.
• Binary black holes most powerful
sources of gravity waves.
• Simulation extraordinarily complex –
evolution disrupts the spacetime !
01/17/2007
CS267-Lecture 1
24
01/17/2007
CS267-Lecture 1
25
Heart
Simulation
• Problem is to compute blood flow in the heart
• Approach:
- Modeled as an elastic structure in an incompressible fluid.
- The “immersed boundary method” due to Peskin and McQueen.
- 20 years of development in model
- Many applications other than the heart: blood clotting, inner ear,
paper making, embryo growth, and others
- Use a regularly spaced mesh (set of points) for evaluating the fluid
• Uses
- Current model can be used to design artificial heart valves
- Can help in understand effects of disease (leaky valves)
- Related projects look at the behavior of the heart during a heart
attack
- Ultimately: real-time clinical work
01/17/2007
CS267-Lecture 1
26
Heart Simulation Calculation
The involves solving Navier-Stokes equations
- 64^3 was possible on Cray YMP, but 128^3 required for accurate
model (would have taken 3 years).
- Done on a Cray C90 -- 100x faster and 100x more memory
- Until recently, limited to vector machines
- Needs more features:
- Electrical model of the
heart, and details of
muscles, E.g.,
- Chris Johnson
- Andrew McCulloch
- Lungs, circulatory
systems
01/17/2007
CS267-Lecture 1
27
Heart Simulation
Animation of lower portion of the heart
01/17/2007
CS267-Lecture 1
Source: www.psc.org
28
Parallel Computing in Data Analysis
• Finding information amidst large quantities of data
• General themes of sifting through large, unstructured data
sets:
- Has there been an outbreak of some medical condition in a
community?
- Which doctors are most likely involved in fraudulent
charging to medicare?
- When should white socks go on sale?
- What advertisements should be sent to you?
• Data collected and stored at enormous speeds (Gbyte/hour)
- remote sensor on a satellite
- telescope scanning the skies
- microarrays generating gene expression data
- scientific simulations generating terabytes of data
- NSA analysis of telecommunications
01/17/2007
CS267-Lecture 1
29
Performance on Linpack Benchmark
www.top500.org
01/17/2007
CS267-Lecture 1
30
01/17/2007
CS267-Lecture 1
31
Why writing (fast) parallel
programs is hard
01/17/2007
CS267-Lecture 1
32
Principles of Parallel Computing
• Finding enough parallelism (Amdahl’s Law)
• Granularity
• Locality
• Load balance
• Coordination and synchronization
• Performance modeling
All of these things makes parallel programming
even harder than sequential programming.
01/17/2007
CS267-Lecture 1
33
“Automatic” Parallelism in Modern Machines
• Bit level parallelism
- within floating point operations, etc.
• Instruction level parallelism (ILP)
- multiple instructions execute per clock cycle
• Memory system parallelism
- overlap of memory operations with computation
• OS parallelism
- multiple jobs run in parallel on commodity SMPs
Limits to all of these -- for very high performance, need user
to identify, schedule and coordinate parallel tasks
01/17/2007
CS267-Lecture 1
34
Finding Enough Parallelism
• Suppose only part of an application seems parallel
• Amdahl’s law
- let s be the fraction of work done sequentially, so
(1-s) is fraction parallelizable
- P = number of processors
Speedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P)
<= 1/s
• Even if the parallel part speeds up perfectly
performance is limited by the sequential part
01/17/2007
CS267-Lecture 1
35
Overhead of Parallelism
• Given enough parallel work, this is the biggest barrier to
getting desired speedup
• Parallelism overheads include:
- cost of starting a thread or process
- cost of communicating shared data
- cost of synchronizing
- extra (redundant) computation
• Each of these can be in the range of milliseconds
(=millions of flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of work
to run fast in parallel (I.e. large granularity), but not so
large that there is not enough parallel work
01/17/2007
CS267-Lecture 1
36
Locality and Parallelism
Conventional
Storage
Proc
Hierarchy
Cache
L2 Cache
Proc
Cache
L2 Cache
Proc
Cache
L2 Cache
L3 Cache
L3 Cache
Memory
Memory
Memory
• Large memories are slow, fast memories are small
• Storage hierarchies are large and fast on average
• Parallel processors, collectively, have large, fast cache
- the slow accesses to “remote” data we call “communication”
• Algorithm should do most work on local data
01/17/2007
CS267-Lecture 1
potential
interconnects
L3 Cache
Processor-DRAM Gap (latency)
CPU
“Moore’s Law”
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
100
10
1
µProc
60%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
Time
01/17/2007
CS267-Lecture 1
38
Load Imbalance
• Load imbalance is the time that some processors in the
system are idle due to
- insufficient parallelism (during that phase)
- unequal size tasks
• Examples of the latter
- adapting to “interesting parts of a domain”
- tree-structured computations
- fundamentally unstructured problems
• Algorithm needs to balance load
01/17/2007
CS267-Lecture 1
39
Measuring
Performance
01/17/2007
CS267-Lecture 1
40
Improving Real Performance
Peak Performance grows exponentially,
a la Moore’s Law

In 1990’s, peak performance increased 100x;
in 2000’s, it will increase 1000x
1,000
But efficiency (the performance relative to
the hardware peak) has declined

was 40-50% on the vector supercomputers
of 1990s
now as little as 5-10% on parallel
supercomputers of today
Close the gap through ...


Mathematical methods and algorithms that
achieve high performance on a single
processor and scale to thousands of
processors
More efficient programming models and tools
for massively parallel supercomputers
01/17/2007
CS267-Lecture 1
100
Teraflops

Peak Performance
Performance
Gap
10
1
Real Performance
0.1
1996
2000
2004
41
Performance Levels
• Peak advertised performance (PAP)
- You can’t possibly compute faster than this speed
• LINPACK
- The “hello world” program for parallel computing
- Solve Ax=b using Gaussian Elimination, highly tuned
• Gordon Bell Prize winning applications performance
- The right application/algorithm/platform combination plus years of work
• Average sustained applications performance
- What one reasonable can expect for standard applications
When reporting performance results, these levels are
often confused, even in reviewed publications
01/17/2007
CS267-Lecture 1
42
Performance on Linpack Benchmark
www.top500.org
01/17/2007
CS267-Lecture 1
43
Performance Levels (for example on NERSC3)
• Peak advertised performance (PAP): 5 Tflop/s
• LINPACK (TPP): 3.05 Tflop/s
• Gordon Bell Prize winning applications performance :
2.46 Tflop/s
- Material Science application at SC01
• Average sustained applications performance: ~0.4
Tflop/s
- Less than 10% peak!
01/17/2007
CS267-Lecture 1
44
Course
Organization
01/17/2007
CS267-Lecture 1
45
Course Mechanics
• This class is listed as both a CS and Engineering class
• Normally a mix of CS, EE, and other engineering and
science students
• This class seems to be about:
- 15 grads + 4 undergrads
- X% CS, EE, Other (BioPhys, BioStat, Civil, Mechanical, Nuclear)
• For final projects we encourage interdisciplinary teams
- This is the way parallel scientific software is generally built
01/17/2007
CS267-Lecture 1
46
Rough Schedule of Topics
• Parallel Programming Models and Machines
- Shared Memory and Multithreading
- Distributed Memory and Message Passing
- Data parallelism
• Parallel languages and libraries
- Shared memory threads and OpenMP
- MPI
- Languages (UPC)
• “Seven Dwarfs” of Scientific Computing
-
Dense Linear Algebra
Structured grids
Particle methods
Sparse matrices
Spectral methods
Unstructured Grids
Spectral methods (FFTs, etc)
• Applications: biology, climate, combustion, astrophysics, …
• Work: 3 homeworks + 1 final project
01/17/2007
CS267-Lecture 1
47
Lecture Scribes
• Each student should scribe ~2 lectures
• Next lecture is on single processor performance (and
architecture)
• Sign up on list to choose 1 topic in first part of the course
01/17/2007
CS267-Lecture 1
48
Reading Materials
• Some on-line texts:
- Demmel’s notes from CS267 Spring 1999, which are similar to 2000
and 2001. However, they contain links to html notes from 1996.
- http://www.cs.berkeley.edu/~demmel/cs267_Spr99/
- Simon’s notes from Fall 2002
- http://www.nersc.gov/~simon/cs267/
- Ian Foster’s book, “Designing and Building Parallel Programming”.
- http://www-unix.mcs.anl.gov/dbpp/
• Potentially useful texts:
- “Sourcebook for Parallel Computing”, by Dongarra, Foster, Fox, ..
- A general overview of parallel computing methods
- “Performance Optimization of Numerically Intensive Codes” by Stefan
Goedecker and Adolfy Hoisie
- This is a practical guide to optimization, mostly for those of you
who have never done any optimization
• More pointers will be on the web page
01/17/2007
CS267-Lecture 1
49
What you should get out of the course
In depth understanding of:
• When is parallel computing useful?
• Understanding of parallel computing hardware options.
• Overview of programming models (software) and tools.
• Some important parallel applications and the algorithms
• Performance analysis and tuning
01/17/2007
CS267-Lecture 1
50
Administrative
• Information
Instructors:
- Kathy Yelick, 777 Soda, [email protected]
- TA: Marghoob Mohiyuddin ([email protected])
- Office hours TBD
• Lecture notes are based on previous semester notes:
- Jim Demmel, David Culler, David Bailey, Bob Lucas, Kathy Yelick
and Horst Simon
- Much material stolen from others (with sources noted)
• Most class material and lecture notes are at:
- http://www.cs.berkeley.edu/~yelick/cs267_spr07
01/17/2007
CS267-Lecture 1
51
Extra slides
01/17/2007
CS267-Lecture 1
52
Transaction Processing
(mar. 15, 1996)
25000
other
Throughput (tpmC)
20000
Tandem Himalaya
IBM PowerPC
15000
DEC Alpha
SGI PowerChallenge
HP PA
10000
5000
0
0
20
40
60
80
100
120
Processors
• Parallelism is natural in relational operators: select, join, etc.
• Many difficult issues: data partitioning, locking, threading.
01/17/2007
CS267-Lecture 1
53
SIA Projections for Microprocessors
1000
100
Feature Size
(microns)
10
Transistors per
chip x 106
1
0.1
2010
2007
2004
2001
1998
0.01
1995
Feature Size
(microns) & Million
Transistors per chip
Compute power ~1/(Feature Size)3
Year of Introduction
based on F.S.Preston, 1997
01/17/2007
CS267-Lecture 1
54
Much of the Performance is from Parallelism
Thread-Level
Parallelism?
Instruction-Level
Parallelism
Bit-Level
Parallelism
Name
01/17/2007
CS267-Lecture 1
55
Performance on Linpack Benchmark
www.top500.org
100000
Earth Simulator
10000
ASCI White
ASCI Red
1000
Rmax
max Rmax
mean Rmax
min Rmax
100
System
10
04
03
Ju
n
03
D
ec
02
Ju
n
02
D
ec
01
Ju
n
01
D
ec
00
Ju
n
00
D
ec
99
Ju
n
99
D
ec
98
Ju
n
98
D
ec
97
Ju
n
97
D
ec
96
Ju
n
96
D
ec
95
Ju
n
95
D
ec
94
Ju
n
94
D
ec
93
Ju
n
D
ec
Ju
n
93
1
0.1
Nov 2004: IBM Blue Gene L, 70.7 Tflops Rmax
01/17/2007
CS267-Lecture 1
56