Download ppt2 - Xputer Lab Configware Engineering for Reconfigurable

Document related concepts
no text concepts found
Transcript
Dynamically
Reconfigurable
Architectures
Dagstuhl, Germany, April 2 - 7, 2006
Reiner Hartenstein
TU Kaiserslautern
Reconfigurable
Supercomputing:
What are the Problems?
What are the Solutions?
The Supercomputing Paradox
TU Kaiserslautern
COTS processor decreasing cost
Increasing number of processors running in parallel
Rapidly growing listed Teraflops
Almost stalled application implementation progress
Often limited sustained Teraflops
Very high total cost of the Tera(?)flops
Scientists waiting for affordable compute capacity
© 2006, [email protected]
2
http://hartenstein.de
TU Kaiserslautern
dangerously telling this to
the supercomputing people:
You … used
the wrong
roadmap
the past
20 years !!!
© 2006, [email protected]
3
http://hartenstein.de
progress stalled
TU Kaiserslautern
© 2006, [email protected]
4
http://hartenstein.de
3 Reconfigurable Computing Paradoxes
TU Kaiserslautern
Reconfigurable Computing Education Paradox
The low power paradox
The high performance paradox
© 2006, [email protected]
5
http://hartenstein.de
TU Kaiserslautern
The Pervasiveness of RC
search “FPGA and ….”
# of hits
by Google
# of hits
by Google
647,000
1,490,000
171,000
194,000
398,000
1,620,000
127,000
113,000
158,000
162,000
915,000
272,000
© 2006, [email protected]
6
http://hartenstein.de
TU Kaiserslautern
going into every
application area
Almost 10 million hits
© 2006, [email protected]
7
http://hartenstein.de
TU Kaiserslautern
…. educational deficits
in addition to the hardware / software chasm
We now also have the
hardware / configware / software chasm
Curricula still ignore these extremely hot new challenges
The Reconfigurable Computing Education Paradox:
its run-away accelerated pervasiveness,
despite of all these educational deficits
© 2006, [email protected]
8
http://hartenstein.de
Computing
Curricula
TU Kaiserslautern
2004 (1)
Within about 500
pages the term
reconfigurable
is not found –
nor its synonyms
© 2006, [email protected]
9
http://hartenstein.de
obsolete
TU Kaiserslautern
von Neumann‘s
monopoly
inside curricula
is obsolete
© 2006, [email protected]
10
http://hartenstein.de
von Neumann is not the common model
mainframe age:
RAM
memory
CPU
von Neumann
bottleneck
DPU
progra
m
counter
von Neumann
instruction-streambased machine
© 2006, [email protected]
microprocessor age:
instruction- datastreamstreambased
based
CPU
accelerator
co-processors
vN paradigm
dominance ?
the tail is
wagging
the dog
11
hardware
morphware
software
TU Kaiserslautern
http://hartenstein.de
TU Kaiserslautern
modern FPGA bestsellers:
The new model is reality:
FPGA fabrics, together with
several µprocessors,
several memory banks,
and other IP cores,
on the same COTS microchip
© 2006, [email protected]
12
http://hartenstein.de
Bill Gates
TU Kaiserslautern
Speech by Bill Gates at a summit meeting
of US state governors:
"American high schools are obsolete."
"The high schools of today teach kids about today's
computers like on a 50-year-old mainframe.
„Without re-design for the needs of the 21st century,
we will keep limiting - even ruining the lives of millions of Americans every year."
© 2006, [email protected]
13
http://hartenstein.de
carved out of stone
TU Kaiserslautern
The most important cultural revolution
since the invention of text characters:
it‘s not the mainframe
It is the Microchip !
© 2006, [email protected]
14
http://hartenstein.de
RC education needed
TU Kaiserslautern
Jürgen Becker
Jörg Henkel
R. Hartenstein
35 submissions from
Australia, Brasil, India, USA, and throughout Europe
http://fpl.org/RCeducation/
© 2006, [email protected]
15
http://hartenstein.de
Reconfigurable Computing Paradoxes
TU Kaiserslautern
Reconfigurable Computing Education Paradox
The low power paradox
The high performance paradox
© 2006, [email protected]
16
http://hartenstein.de
The FPGA Low Power Paradox
TU Kaiserslautern
The awful technology of FPGAs:
„very power-hungry“ [Rick Kornfeld*]
FPGAs run at lower clock frequencies, draw
much more power and are more expensive.
*) personal communication
Reducing the electricity bill by an order of magnitude
and more by supercomputer 2 FPGA migration
© 2006, [email protected]
17
http://hartenstein.de
telling this to the low power design people ?
TU Kaiserslautern
ISLPED,
Oct 4 – 6,
Tegernsee
you … used
the wrong
roadmap
the past
15 years:
use FPGAs
!
PATMOS,
Sep 13 – 15,
Montpellier
1991: Kaiserslautern, Germany
1992: Paris, France
1993: Montpellier, France
© 2006, [email protected]
18
http://hartenstein.de
Reconfigurable Computing Paradoxes
TU Kaiserslautern
Reconfigurable Computing Education Paradox
The low power paradox
The high performance paradox
© 2006, [email protected]
19
http://hartenstein.de
The High Performance Paradox
TU Kaiserslautern
The awful technology of FPGAs:
Effective integration density much worse than the
Gordon Moore curve: by a factor of more than 10,000
FPGAs run at lower clock frequencies,
and are more expensive.
85% of all designers hate their tools
© 2006, [email protected]
20
http://hartenstein.de
#
fine-grained RC: 1st DeHon‘s Law
[1996: Ph. D, MIT]
TU Kaiserslautern
density: overhead:
transistors
/ microchip
wiring
FPGA
physical overhead
109
106
FPGA
logical
reconfigurability
overhead>
FPGA
routed
routing
congestion
immense area
inefficiency
103
100
>> 10 000
1980
1990
2000
© 2006, [email protected]
21
2010
http://hartenstein.de
#
coarse-grained RC: Hartenstein‘s Law
[1996: ISIS, Austin, TX]
TU Kaiserslautern
transistors
/ microchip
109
>> 10 000
106
FPGA
routed
area efficiency
very close to
Moore‘s law
103
100
1980
1990
© 2006, [email protected]
2000
22
2010
http://hartenstein.de
Claassen‘s Law
TU Kaiserslautern
1000
MOPS / milliWatt
100
10
DSP
1
0.1
0.01
µ feature size
0.001
2
© 2006, [email protected]
1
0.5
23
0.25
0.13
0.1
0.07
http://hartenstein.de
Claassen‘s Law: Hartenstein‘s Amendment
TU Kaiserslautern
1000
MOPS / milliWatt
100
10
DSP
1
0.1
0.01
µ feature size
0.001
2
© 2006, [email protected]
1
0.5
24
0.25
0.13
0.1
0.07
http://hartenstein.de
Selection of published speed-up factors
relative performance
TU Kaiserslautern
109
DSP and wireless
106
Image processing,
Decoding
Pattern matching, real-time face Reed-Solomon
detection
2400
6000
Multimedia video-rate stereo visionMAC crypto
Grid-based DRC
(„fair comparizon“)
pattern recognition 730
SPIHT wavelet-based image compression 457
1000
400 Viterbi Decoding
900 288
Smith-Waterman
Bioinformatics
15000
2000
100 000
pattern matching
88 molecular dynamics simulation
100
Grid-based DRC:
no FPGA: DPLA
52
FFT
protein identification BLAST
on MoM by TU-KL
P4
Los Alamos traffic simulation 47 40
103
20
Lee Routing 160
(DPLA by TU-KL)
2-D FIR filter (no
39,4 FPGA: DPLA by TU-KL)
GRAPE
Astrophysics
8080
100
1980
http://xputers.informatik.uni-kl.de/faq-pages/fqa.html
© 2006, [email protected]
1990
2000
25
2010
http://hartenstein.de
nd
2
TU Kaiserslautern
DeHon‘s Law
[IEEE COMPUTER, 2000]
Computational Density
1000
FPGA
100
10
RISC
1
2
© 2006, [email protected]
µ feature size
1
0.5
26
0.25
0.13
0.1
0.07
http://hartenstein.de
The three RC Paradoxes
TU Kaiserslautern
© 2006, [email protected]
27
http://hartenstein.de
TU Kaiserslautern
Why supercomputing / HPC failed
because of the interconnect network architecture
the wrong way, how the data are moved around
instruction-stream-based: memory-cycle-hungry
instruction fetch overhead
sequencing overhead
The law or More:
address computation overhead
and other overhead
© 2006, [email protected]
28
http://hartenstein.de
moving data around inside the Earth Simulator
TU Kaiserslautern
Crossbar weight: 220 t, 3000 km of cable,
ES 20: TFLOPS
© 2006, [email protected]
5120 Processors, 5000 pins each
29
http://hartenstein.de
data moved around by software
TU Kaiserslautern
i.e. by memory-cycle-hungry instruction
streams which fully hit the memory wall
P&R: move
locality of
operation,
not data !
© 2006, [email protected]
stolen from Bob Colwell
30
http://hartenstein.de
An Archetype Common Model needed
TU Kaiserslautern
from the
Configware Industry
Progress stalled by the software/configware chasm
Useful simple archetype not widely accepted
An archetype common model should provide ....
Guidance for organizing efficient solutions
Make the project manageable
Allow to share lessions between applications
and between disciplines
support undergraduate educastion
© 2006, [email protected]
31
http://hartenstein.de
The new paradigm: how the data are traveling
TU Kaiserslautern
no, not by instruction execution
transport-triggered: an old hat
pipeline, or chaining
asynchronous (via handshake)
systolic array
wavefront array
© 2006, [email protected]
32
http://hartenstein.de
TU Kaiserslautern
Def.: data streams
(flowware)
Flowware defines:
... which data item
time
at which time
at which port
(pipe network)
x
x
x
DPA
time
x
x
x
|
x
x
x
|
|
x x x
x x x -
port #
- - - x x x
time
- - - - x x x
x x x - -
- - - - - x x x
port #
|
|
|
|
|
|
|
|
|
|
|
x
x
x
© 2006, [email protected]
input data streams
time
x
x
x
33
|
x
x
x
port #
output data streams
source and sink ?
http://hartenstein.de
TU Kaiserslautern
Data streams source and sink: not my job
Not my Job!
© 2006, [email protected]
34
http://hartenstein.de
|
|
x x x
x x x x x x - |
|
|
|
|
|
|
|
|
|
|
x
x
x
ASM
© 2006, [email protected]
ASM
x
x
x
35
|
x
x
x
ASM
implemented by ASM
distributed on- ASM
chip memory ASM
|
input data streams
- - - x x x
ASM
- - - - x x x
ASM
- - - - - x x x
ASM
output data streams
RAM
x
x
x
x
x
x
GAG
x
x
x
ASM
ASM
ASM
TU Kaiserslautern
distributed
memory
ASM
On-chip
Auto-Sequencing
Memory
http://hartenstein.de
How the data are moved
TU Kaiserslautern
DMA,
vN move processor [Jack Lipovski, EUROMiCRO, Nice, 1975]
[TU-KL publ.:
ASM use GAG generic address generator Tokyo 1989 +
by the way: GAG st…. by TI [TI patent 1995] NH journal]
Henk Corporaal coins the term “transport-triggered”
MoM: GAG-based storage scheme methodology [Herz*]
Application-specific distributed memory [Catthoor et al.]
*)©[see
Michael Herz et al.: ICECS36
2002 (Dubrovnik)]
2006, [email protected]
http://hartenstein.de
TU Kaiserslautern
The dual paradigm approach
Configware
Engineering
Software
Engineering
ASM
CPU
von Neumann paradigm
© 2006, [email protected]
37
Kress-Kung paradigm
http://hartenstein.de
TU Kaiserslautern
Mathematical Synthesis Methods
algebraic methods
i. e., linear projections
yields only uniform
arrays w. linear pipes
only for applications with
regular data dependencies
© 2006, [email protected]
38
http://hartenstein.de
TU Kaiserslautern
Coarse-grained reconfigurable arrays are
a Generalization of the Systolic Array ....
[Rainer Kress]
discard algebraic synthesis methods
use optimization algorithms instead,
for example: simulated annealing
the achievement: also non-linear and non-uniform
pipes, and even more wild pipe structures possible
now reconfigurability really makes sense
© 2006, [email protected]
39
http://hartenstein.de
Coarse grain is about computing, not logic
TU Kaiserslautern
Example: mapping onto rDPA by DPSS: based on simulated annealing
SNN filter on KressArray (mainly a pipe network)
rout thru only
array size:
10 x 16
= 160 rDPUs
no CPU
rDPU, 32 bit
Legend:
rDPU not used
[Ulrich Nageldinger]
backbus connect
used for
routing only
backbus
connect
operator and routing
port location
not
usedmarker
tool: KressArray Xplorer: diss. Ulrich Nageldinger (downloadable)
© 2006, [email protected]
40
http://hartenstein.de
Software / Configware Co-Compilation
[Juergen Becker’s CoDe-X, 1996]
TU Kaiserslautern
C language source
“vN" machine
paradigm
Partitioner
anti machine
paradigm
CW
SW
Analyzer
compiler / Profiler compiler
SW code
© 2006, [email protected]
CW Code
FW Code
41
supporting
different
platforms
Resource
Parameters
http://hartenstein.de
Software / Configware Co-Compilation
[Juergen Becker’s CoDe-X, 1996]
TU Kaiserslautern
C language source
“vN" machine
paradigm
Partitioner
anti machine
paradigm
CW
SW
Analyzer
compiler / Profiler compiler
SW code
© 2006, [email protected]
CW Code
FW Code
42
supporting
different
platforms
Resource
Parameters
http://hartenstein.de
Distributed Memory Parallelism Capability
TU Kaiserslautern
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
© 2006, [email protected]
43
operator and routing
ASM
ASM
ASM
rDPU not used
ASM
used for routing only
ASM
ASM
backbus connect
ASM
Legend:
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
backbus connect layers …
ASM
ASM
NN ports interconnect layer
array size
example:
10 x 16
port location marker
http://hartenstein.de
TU Kaiserslautern
Applications for coarse-grained arrays
(on-chip distributed memory for intermediate results)
with steady I/O data streams at constant speed:
Multi-standard world HDTV receiver
Wide variety of multimedia applications
Wide variety of real-time applications
Many other applications
© 2006, [email protected]
44
http://hartenstein.de
The wrong mind set ....
TU Kaiserslautern
„but you can‘t implement decisions!“
(remark of a high-ranked
industrial research head –
discussion after a talk by
Ulrich Nageldinger – RAW
Orlando)
© 2006, [email protected]
45
http://hartenstein.de
a tiny section of the pipe network
TU Kaiserslautern
+
S
Legend:
rDPU not used
© 2006, [email protected]
backbus connect
used for routing only
46
operator and routing
port location marker
http://hartenstein.de
The wrong mind set ....
TU Kaiserslautern
section of a very
large pipe network:
R B A
C =1
=0
„but you can‘t implement decisions!“
not knowing this solution:
symptom of the
hardware / software chasm
+
© 2006, [email protected]
and the
configware / software chasm
47
http://hartenstein.de
TU Kaiserslautern
introducing hardware description languages
(in the mid‘ seventies)
“The decision box becomes
a (de)multiplexer”
This is so simple: why did it take decades to find out ?
The wrong mind set – the wrong road map!
© 2006, [email protected]
48
http://hartenstein.de
section of a major pipe network on rDPU
hypothetical branching example to illustrate
software-to-configware migration
TU Kaiserslautern
S = R + (if C then A else B endif);
R B A
C =1
+
S
clock
200 MHz
(5 nanosec)
© 2006, [email protected]
C=1
simple conservative CPU example
read instruction
instruction decoding
if C
then read A read operand*
operate & reg. transfers
read instruction
if not C
then read B instruction decoding
read instruction
instruction decoding
add & store
operate & reg. transfers
store result
total
memory nano
cycles seconds
1
100
1
100
1
100
1
100
1
5
100
500
*) if no intermediate storage in register file
49
http://hartenstein.de
why the RC paradigm shift is so important
TU Kaiserslautern
by Software
by
Configware
Move the
stool or the
grand piano?
© 2006, [email protected]
50
http://hartenstein.de
… understand only this parallelism solution:
TU Kaiserslautern
the instruction-stream-based approach
the data-stream-based approach
has no von
Neumann
bottleneck
von
Neumann
bottlenecks
© 2006, [email protected]
51
http://hartenstein.de
What means Reconfigurable Computing?
TU Kaiserslautern
switching the multiplexers?
routing ALU result to a register?
microprogramming?
concurrency of 64 or 256 CPUs on a single chip?
it means using the Kress/Kung machine paradigm !
© 2006, [email protected]
52
http://hartenstein.de
TU Kaiserslautern
vN paradigm loosing its dominance
http://bwrc.eecs.berkeley.edu/Research/RAMP/people.htm
RAMP project proposes:
Run LINUX on FPGAs
© 2006, [email protected]
53
http://hartenstein.de
TU Kaiserslautern
vN paradigm loosing its dominance
Xilinx inside !
Cray XD1
© 2006, [email protected]
54
http://hartenstein.de
TU Kaiserslautern
Recommended Pentium successor
Discard most caches
Have 64* cores
with clever interconnect for:
concurrent processes,
for multithreading, and,
Kung-Kress rDPA array
The Desk-top Supercomputer!
© 2006, [email protected]
55
http://hartenstein.de
What means Reconfigurable Computing ?
TU Kaiserslautern
The key issue: which is the underlying paradigm?
Operation not based on instruction-streams at run time
No instruction fetch at run time
machine paradigm is data stream-based: Kress-Kung
Undergraduate education needs a dual paradigm
approach: symbiosis of von Neumann / Kress-Kung
© 2006, [email protected]
56
http://hartenstein.de
TU Kaiserslautern
thank you
© 2006, [email protected]
57
http://hartenstein.de
TU Kaiserslautern
END
© 2006, [email protected]
58
http://hartenstein.de
TU Kaiserslautern
© 2006, [email protected]
59
http://hartenstein.de
TU Kaiserslautern
Backup for Discussion:
© 2006, [email protected]
60
http://hartenstein.de
TU Kaiserslautern
Term to be used for „soft hardware“
accelware
adaptware
adjustware
altware
alterware
arrangeware
changeware
conformware
doughware
fabricsware
fabrixware
fitware
flexware
formware
FPware
unfortunately “Morphware” is trademarked
gateware
gateroutware
hpcware
LUTware
matchware
modiware
morphware®
morfware
mouldware
muxware
parware
paraware
passware
pathware
patchware
send yourproposal to:
© 2006, [email protected]
performware
perfware
perware
pipeware
platformware
railware
rangeware
RCware
ressourceware
routware
routeware
routingware
RTware
shapeware
shuntware
61
shuntingware
speedware
speedupware
suiteware
switchware
switchingware
streamware
structware
transferware
transware
variware
varyware
warpware
xferware
xware
http://hartenstein.de
Compilation: Software vs. Configware
TU Kaiserslautern
Software
Engineering
source program
Configware
Engineering
C, FORTRAN
MATHLAB
placement source „program“
& routing
mapper
software
compiler
configware
compiler
data scheduler
software code
configware code
© 2006, [email protected]
62
flowware code
http://hartenstein.de
Co-Compilation
TU Kaiserslautern
C, FORTRAN, MATHLAB
automatic SW / CW partitioner
Software /
Configware
software Co-Compiler
compiler
mapper
configware
compiler
data scheduler
software code
configware code
© 2006, [email protected]
63
flowware code
http://hartenstein.de
Why use Reconfigurable Computing
TU Kaiserslautern
Exploit spatial parallelism, and ..
instead of software?
instead of spec. hardware?
… high bandwidth and low latency memory access
… and fine-grained parallelism when useful
Ride the technology curve avoiding specific silicon
Adapt to change: standards, trends, …..
Adapt to application / deployment requirements
Reduce risk
© 2006, [email protected]
64
http://hartenstein.de
TU Kaiserslautern
Computing Curricula 2004 (2)
CE missing
#
© 2006, [email protected]
65
http://hartenstein.de
2.2.1.
TU Kaiserslautern
© 2006, [email protected]
66
Computing
Curricula
2004 (3)
http://hartenstein.de
2.2.1.
TU Kaiserslautern
Computing
Curricula
2004 (4)
… how it
should be
CONFIGWARE
morphware
and
configware
added
MORPHWARE
© 2006, [email protected]
67
http://hartenstein.de
TU Kaiserslautern
© 2006, [email protected]
68
http://hartenstein.de