Download LECTURE 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nervous system network models wikipedia , lookup

Sparse distributed memory wikipedia , lookup

Transcript
CS 350: Principal of Parallel Computing
LECTURE 1
Applied Parallel Computing (2002 manuscript) by Dr. Yuefan Deng & Dr. Stephen Tse
all rights reserved
Introduction
1. Introduction

Computer accounts: Every registered student will get account on QC
Cluster (send email to [email protected] to request your account.) In
your email, please state the following
1. Name (first last names)
2. CS350 account username for Parallel Machine
3. E-mail address that’s active



Projects: All four projects should be done on the Parallel Machine. All
but Project 4 will be collected three weeks after assigning. Project 4 will
have four weeks
Lecture notes: At the start of the week, notes for the following two
lectures will be emailed to registered students.
Networked computing system in Computer Science Lab: (NSB A131)
2.
Parallel Computing: What? Why? and How?
2.1
What:
Parallel computing is defined as ``simultaneous processing by more
than one processing unit on a single application''.
2.2
2.2.1
Why:
Improve response time (stock market, hospitals, battle fields, etc)
1
2.2.2
2.2.3
Amend the total amount of work done in a given time (fluids, QCD,
proteins, meteorology, materials: finer grid, larger domain, and
more parameters, etc)
Cost-effective: 10X compared to mainframes (everyone enjoys
more money)
Machines
2.2.4
2.2.4.1
2.2.4.2
2.2.4.3
2.2.4.4
2.2.4.5
2.2.4.6
2.2.4.7
2.2.4.8
2.2.4.9
2.2.4.10
2.2.4.11
2.2.4.12
2.2.4.13
Typical Times
Grand
Moderate
Challenge
Problems
Problems
Applications
Protein folding,
O(1) Hrs
QCD,
Turbulence
Sub-Pflop
NA
Tflop
Machine
1K-Node
Beowulf
Cluster
High-End
Workstation
PC With
2 GHz
Pentium
O(1) Sec
O(10) Hrs
Weather
O(1) Minute
O(1) Wks
2D CFD,
Simple designs
O(1) Hrs
O(10) Yrs
O(1) Days
O(100) Yrs
Problems Requiring Parallel Computing
Prediction of Weather, Climate, and Global Change
Materials Science
Semiconductor Design
Superconductivity
Structural Biology
Drug Design
Human Genome
QCD
Astronomy
Transportation
Vehicle Signature---Military
Turbulence
Vehicle Dynamics
2
2.2.4.14
2.2.4.15
2.2.4.16
2.2.4.17
2.2.4.18
2.2.4.19
2.2.4.20
2.2.5
2.2.5.1
2.2.5.2
2.2.5.3
2.2.5.4
2.2.5.5
2.2.5.6
2.2.5.7
2.2.5.8
2.2.5.9
2.2.5.10
2.2.5.11
2.2.5.12
2.2.5.13
2.2.6
Nuclear Fusion
Combustion System
Oil and Gas Recovery
Ocean Science
Speech
Vision
Undersea Surveillance
Best Application areas
Aerodynamics: Aircraft design, air breathing propulsion,
advanced sensors.
Applied Mathematics: Fluid dynamics, turbulence, differential
equations, numerical analysis, global optimization, numerical
linear algebra.
Biology and engineering: Simulation of genetic compounds,
neural network, structural biology, conformation, drug design,
protein folding, human genome.
Chemistry and engineering: Polymer simulation, reaction rate
prediction.
Computer science: Simulation of devices and circuits, VLSI,
artificial intelligence
Electrical engineering: Electromagnetic scattering.
Geosciences: Oil and seismic exploration, enhanced oil and gas
recovery.
Materials Research: Material property prediction, modeling of
new materials, superconductivity.
Mechanical engineering: Structural analysis, combustion
simulation.
Meteorology: Weather forecasting, climate modeling.
Oceanography: global ocean modeling
Physics: Astrophysics, evolution of galaxies and natures of
black holes; particle physics, quark interactions, properties of
new particles; plasma physics, fusion reaction modeling;
nuclear physics, weapon design and modeling, remedy of
nuclear contaminations.
Others: Information super highway, optical processing, and
transportation.
Basic Physics Equations
3
2.2.6.1
2.2.6.2
2.2.6.3
2.2.6.4
2.2.6.5
Classical Mechanics---Newton's Second Law
Electrodynamics---Maxwell Equations
Quantum Dynamics---Schrödinger Equation
Statistical Mechanics
Quantum Chromo Dynamics---Yang-Mills Equation.
(In future lectures, I'll gradually introduce these equations and analyze
the relevant structures for parallel computing.)
2.3
How? The subject for the entire semester
3.
Parallel Computers: Hardware Issues
3.1
Characterization
3.1.1
3.1.2
3.1.3
3.1.4
3.1.5
3.2
3.2.1
3.2.2
3.2.3
3.3
3.3.1
3.3.2
3.3.3
3.3.4
3.3.5
Number and properties of compute processors
Network topologies and communication bandwidth
Instruction and data streams
Processor-memory connectivity
Memory size and I/O
Node architectures
Simplicity in the processing units
Conventional, off-the-shelf, and mass-product processors
(developing special-purpose processors
for Cray processors and for IBM mainframe CPUs,)
Several main processors used in parallel machines are: i860 for
iPSC/860, Intel Delta, Transtech, Meiko, Paragon; sparc in CM-5;
DEC's Alpha in DECmpp, Cray T3D; RS/6000 in SP-1, SP-2;
Weitek in MasPars, etc.
Network topology
1D array
2D mesh
3D hypercube
Ring
Four-level, complete binary tree
4
3.3.6
3.3.7
3.4
Bus and switches
Cross-bar network
Network performance measurement
3.4.1
Connectivity: (defined as the multiplicity between any two
processors)
Diameter: (defined as the maximum distance between two
processors in the number of hops between two most distant
processors)
Bisection bandwidth: (defined as the number of bits that can be
transmitted in parallel multiplied by the bisection width which is
defined as the minimum number of communication links that
have to be removed to divide the network into two equal
partitions.)
3.4.2
3.4.3
3.5
Instruction And Data Streams
3.5.1
Based on the nature of the instruction and data streams, parallel
computers can be made as:
Instruction
& Data
Single
Multiple
Single
SISD
MISD
Multiple
SIMD
MIMD
3.5.1.1
SISD: Such as workstations and PC with 1-CPU (to disappear)
3.5.1.2
SIMD: Easy to use for simple problems (CM-1/2...)
3.5.1.3
MISD: Rare
3.5.1.4
MIMD: Trend (Paragon, IBM SP-1/2, and Cray T3D...)
3.6
Processor-memory connectivity
5
3.6.1
3.6.2
3.6.3
3.7
List
11/2
001
11/2
001
11/2
001
Distributed memory
Shared memory
Distributed shared memory
The Top 10 supercomputers
Manuf
Ran
Comput
acture
k
er
r
ASCI
White,SP
1
IBM
Power3
375 MHz
AlphaSer
Comp ver SC
2
aq
ES45/1
GHz
SP
Power3
3
IBM
375 MHz
16 way
11/2
4
001
11/2
5
001
11/2
6
001
11/2
7
001
11/2
8
001
11/2
9
001
11/2
10
001
Intel
ASCI
Red
Rmax(GFl Installatio Coun Ye Proces
Rpeak
Nm Nha
ops)
n Site
try ar sors (GFlops) ax lf
Lawrence
Livermore
7226.00
USA
National
Laboratory
Pittsburgh
Supercomp
4059.00
USA
uting
Center
3052.00
20
8192
00
12288
518 179
096 000
20
3024
01
6048
525 105
000 000
20
3328
01
4992
371 102
712 400
USA
19
9632
99
3207
362 754
880 00
USA
19
5808
99
3868
431
344
USA
20
1536
01
3072
390 710
000 00
Japan
20
1152
01
2074
141 160
000 00
USA
19
6144
98
3072
374 138
400 000
USA
20
1336
00
2004
374
000
NERSC/LB
USA
NL
Sandia
2379.00 National
Labs
ASCI
Lawrence
BlueLivermore
IBM
Pacific
2144.00
National
SST,IBM
Laboratory
SP 604e
AlphaSer
Los
Comp ver SC
Alamos
2096.00
aq
ES45/1
National
GHz
Laboratory
SR8000/
University
Hitachi
1709.10
MPP
of Tokyo
Los
ASCI
Alamos
SGI
Blue
1608.00
National
Mountain
Laboratory
Naval
SP
Oceanogra
IBM
Power3 1417.00 phic Office
375 MHz
(NAVOCE
ANO)
SP
Deutscher
Power3
IBM
1293.00 Wetterdien
375 MHz
st
16 way
Germ 20
1280
any
01
6
3.8
Cost-effective of Parallel Computers
3.8.1
Performance/Cost
System Number of Performance
in
processors
in MF
1997
2001
2001
3.8.2
3.9
128
128
1700
40,000
300,000
4,000,000
Cost ($)
CostPerformance
Ratio (MF/$)
250,000
250,000
3,500,000
0.16
1.20
1.20
Benchmarking Definitions
Name
Benchmark Code and Class
EP
Embarrassingly parallel
MG
Multigrid
CG
Conjugate gradient
FT
3-D FFT PDE
LU
LU solver
PS
Pentadiagonal solver
BT
Block tridiagonal solver
A
B
C
Comparison of Human Brain with Supercomputers
Typical Human Brain
Total # of
Neurons
Neurons in
cerebral
cortex
20 – 50 billion
Intel Pentium 4
(at 1.5 GHz)
42 million
Transistors
~20 billion
7
Number of
synapses
Weight
Power
consumption
1014 (2,000-5,000/neuron)
~1.5 kg
~40W (4 nW/neuron)
2% weight, 0.04-0.07%
Percentage of cells,
body
20-44% power
consumption
Genetic code 1 bit per 10,000-1,000,000
influence
synapses
Atrophy/death 50,000 per day (age:
of neurons
20~75)
Sleep
30%
requirement
Normal
operating
37±2°C
temperature
Maximum
firing
250-2,000 Hz (0.5-4 ms
frequency of intervals)
neuron
Signal
propagation
90 m/s sheathed, <0.1 m/s
speed inside
unsheathed
axon
Processing of
complex
0.5s or 100-1,000 firings
stimuli
0.3 kg
~55 W
(1
mW/transistor)
0%
15-85°C
1.5 GHz
Speed of light
Long time
3.10 Supercomputer Design Issues
3.10.1
3.10.2
Processors: Advanced pipelining, instruction-level parallelism,
reduction of branch penalties with dynamic hardware prediction
and scheduling,
Networks: Inter-networking, and networking processors to the
networks;
8
3.10.3
3.10.4
3.11
3.11.1
3.11.2
Processor-memory Connectivity: Caching, reduction of
caching misses and the penalty, design of memory hierarchies,
and virtual memory. centralized shared-memory architectures,
distributed shared-memory architectures, and synchronization;
Storage Systems: Types and performance of storage devices,
buses-connected storage devices, storage area network, raid,
reliability and availability.
Gordon Bell 11 rules of supercomputer design
Performance, performance, performance. People are buying
supercomputers for performance. Performance, within a broad
price range, is everything.
Everything matters. The use of the harmonic mean for reporting
performance on the Livermore Loops severely penalizes
machines that run poorly on even one loop. It also brings little
benefit for those loops that run significantly faster than other
loops. Since the Livermore Loops was designed to simulate the
real computational load mix at Livermore Labs, there can be no
holes in performance when striving to achieve high
performance on this realistic mix of computational loads.
3.11.3
Scalars matter the most. A well-designed vector unit will
probably be fast enough to make scalars the limiting factor.
Even if scalar operations can be issued efficiently, high latency
through a pipelined floating point unit such as the VPU can be
deadly in some applications.
3.11.4
Provide as much vector performance as price allows. Peak
vector performance is primarily determined by bus bandwidth
in some circumstances, and the use of vector registers in others.
Thus the bus was designed to be as fast as practical using a
cost-effective mix of TTL and ECL logic, and the VRF was
designed to be as large and flexible as possible within cost
limitations. Gordon Bell's rule of thumb is that each vector unit
must be able to produce at least two results per clock tick to
have acceptably high performance.
9
3.11.5
Avoid holes in the performance space. This is an amplification
of rule 2. Certain specific operations may not occur often in an
"average" application. But in those applications where they
occur, lack of high speed support can significantly degrade
performance.
3.11.6
Place peaks in performance. Marketing sells machines as much
or more so than technical excellence. Benchmark and
specification wars are inevitable. Therefore the most important
inner loops or benchmarks for the targeted market should be
identified, and inexpensive methods should be used to increase
performance. It is vital that the system can be called the
"World's Fastest", even though only on a single program. A
typical way that this is done is to build special optimizations
into the compiler to recognize specific benchmark programs.
3.11.7
Provide a decade of addressing. Computers never have enough
address space. History is full of examples of computers that
have run out of memory addressing space for important
applications while still relatively early in their life (e.g., the
PDP-8, the IBM System 360, and the IBM PC). Ideally, a
system should be designed to last for 10 years without running
out of memory address space for the maximum amount of
memory that can be installed. Since dynamic RAM chips tend
to quadruple in size every three years, this means that the
address space should contain 7 bits more than required to
address installed memory on the initial system.
3.11.8
Make it easy to use. The "dusty deck" syndrome, in which users
want to reuse FORTRAN code written two or three decades
early, is rampant in the supercomputer world. Supercomputers
with parallel processors and vector units are expected to run this
code efficiently without any work on the part of the
programmer. While this may not be entirely realistic, it points
out the issue of making a complex system easy to use.
Technology changes too quickly for customers to have time to
become an expert on each and every machine version.
3.11.9
Build on other's work.
10
3.11.10
Design for the next one, and then do it again. In a small startup
company, resources are always scarce, and survival depends on
shipping the next product on schedule. It is often difficult to
look beyond the current design, yet this is vital for long term
success. Extra care must be taken in the design process to plan
ahead for future upgrades. The best way to do this is to start
designing the next generation before the current generation is
complete, using a pipelined hardware design process. Also, be
resigned to throwing away the first design as quickly as
possible.
3.11.11
Have slack resources. Expect the unexpected. No matter how
good the schedule, unscheduled events will occur. It is vital to
have spare resources available to deal with them, even in a
startup company with little extra manpower or capital.
11