Download ProtoShop Compiler

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A tiled processor architecture
prototype: the Raw microprocessor
October 02
Tiled Processor Architecture (TPA)
IMEM DMEM
PC REG
FPU
ALU
SMEM
PC
SWITCH
Tile
 Lots of ALUs and regs
 Short, prog. wires
 Lower power
Programmable. Supports ILP and Streams
A Prototype TPA:
The Raw Microprocessor
[Billion transistor IEEE Computer
Issue ’97]
A Raw Tile
Raw Switch
SMEM
PC
IMEM DMEM
PC REG
FPU
ALU
SMEM
PC
SWITCH
The Raw Chip
Tile
Packet stream
Disk stream
Video1
RDRAM
Software-scheduled
interconnects
(can use static or dynamic routing –
but compiler determines instruction
placement and routes)
Tight integration of interconnect
r24
r25
r26
A Raw Tile
IMEM DMEM
FPU
ALU
SMEM
PC
SWITCH
The Raw Chip
Tile
0-cycle
“local bypass
network”
r27
Input
FIFOs
from
Static
Router
PC REG
Packet stream
Disk stream
Video1
RDRAM
IF
D
r24
r25
r26
r27
Output
FIFOs
to
Static
Router
E
M1
RF
M2
A
TL
F
P
TV
U
F4
WB
Point-to-point
bypass-integrated
compiler-orchestrated
on-chip networks
How to “program the wires”
Tile 10
Tile 11
fmul r24, r3, r4
fadd r5, r3, r25
route P->E
route W->P
software
controlled
crossbar
software
controlled
crossbar
The result of orchestrating the wires
mem
mem
mem
C program
ILP
computation
MPI
program
httpd
Zzzz
Custom
Datapath
Pipeline
Perspective
We have replaced
Bypass paths, ALU-reg bus, FPU-Int. bus,
reg-cache-bus, cache-mem bus, etc.
With a general, point-to-point, routed
interconnect called:
Scalar operand network (SON)
Fundamentally new kind of network
optimized for both scalar and
stream transport
Programming models and software
for tiled processor architectures
o Conventional scalar programs (C, C++, Java)
Or, how to do ILP
o Stream programs
Scalar (ILP) program mapping
E.g., Start with a C program, and several transformations later:
v2.4 = v2
seed.0 = seed
v1.2 = v1
pval1 = seed.0 * 3.0
pval0 = pval1 + 2.0
tmp0.1 = pval0 / 2.0
pval2 = seed.0 * v1.2
tmp1.3 = pval2 + 2.0
pval3 = seed.0 * v2.4
tmp2.5 = pval3 + 2.0
pval5 = seed.0 * 6.0
pval4 = pval5 + 2.0
tmp3.6 = pval4 / 3.0
pval6 = tmp1.3 - tmp2.5
v2.7 = pval6 * 5.0
pval7 = tmp1.3 + tmp2.5
v1.8 = pval7 * 3.0
v0.9 = tmp0.1 - v1.8
v3.10 = tmp3.6 - v2.7
tmp2 = tmp2.5
v1 = v1.8;
tmp1 = tmp1.3
v0 = v0.9
tmp0 = tmp0.1
v3 = v3.10
tmp3 = tmp3.6
v2 = v2.7
Existing
languages
will work
Lee, Amarasinghe et al,
“Space-time scheduling”,
ASPLOS ‘98
Scalar program mapping
v2.4 = v2
seed.0 = seed
v1.2 = v1
pval1 = seed.0 * 3.0
pval0 = pval1 + 2.0
tmp0.1 = pval0 / 2.0
pval2 = seed.0 * v1.2
tmp1.3 = pval2 + 2.0
pval3 = seed.0 * v2.4
tmp2.5 = pval3 + 2.0
pval5 = seed.0 * 6.0
pval4 = pval5 + 2.0
tmp3.6 = pval4 / 3.0
pval6 = tmp1.3 - tmp2.5
v2.7 = pval6 * 5.0
pval7 = tmp1.3 + tmp2.5
v1.8 = pval7 * 3.0
v0.9 = tmp0.1 - v1.8
v3.10 = tmp3.6 - v2.7
tmp2 = tmp2.5
v1 = v1.8;
tmp1 = tmp1.3
v0 = v0.9
tmp0 = tmp0.1
v3 = v3.10
tmp3 = tmp3.6
v2 = v2.7
seed.0=seed
pval1=seed.0*3.0
v1.2=v1
v2.4=v2
pval5=seed.0*6.0
pval0=pval1+2.0
pval2=seed.0*v1.2
pval3=seed.o*v2.4
pval4=pval5+2.0
tmp1.3=pval2+2.0
tmp0.1=pval0/2.0
tmp2.5=pval3+2.0
tmp3.6=pval4/3.0
tmp1=tmp1.3
tmp0=tmp0.1
Program code
tmp2=tmp2.5
pval7=tmp1.3+tmp2.5
tmp3=tmp3.6
pval6=tmp1.3-tmp2.5
v1.8=pval7*3.0
v2.7=pval6*5.0
v0.9=tmp0.1-v1.8
v1=v1.8
v0=v0.9
v3.10=tmp3.6-v2.7
v2=v2.7
v3=v3.10
Graph
Program graph clustering
seed.0=seed
pval1=seed.0*3.0
v1.2=v1
v2.4=v2
pval5=seed.0*6.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
pval2=seed.0*v1.2
pval3=seed.o*v2.4
tmp1.3=pval2+2.0
tmp2.5=pval3+2.0
tmp1=tmp1.3
tmp0=tmp0.1
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp2=tmp2.5
pval7=tmp1.3+tmp2.5
tmp3=tmp3.6
pval6=tmp1.3-tmp2.5
v1.8=pval7*3.0
v2.7=pval6*5.0
v0.9=tmp0.1-v1.8
v1=v1.8
v0=v0.9
v3.10=tmp3.6-v2.7
v2=v2.7
v3=v3.10
Placement
Tile2
Tile1
Tile4
Tile3
Routing
Processor Switch
code
code
Instruction Scheduling
v1.2=v1
v2.4=v2
route(t,E)
route(N,t)
route(t,E)
route(W,t)
seed.0=recv(0)
pval3=seed.o*v2.4
route(W,S)
route(N,t)
seed.0=seed
send(seed.0)
pval1=seed.0*3.0
seed.0=recv()
pval5=seed.0*6.0
seed.0=recv()
pval2=seed.0*v1.2
pval0=pval1+2.0
tmp0.1=pval0/2.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp2.5=pval3+2.0
tmp1.3=pval2+2.0
tmp2=tmp2.5
send(tmp2.5)
tmp1.3=recv()
pval6=tmp1.3-tmp2.5
route(t,E)
route(t,W)
route(E,t)
route(W,t)
send(tmp1.3)
tmp1=tmp1.3
tmp3=tmp3.6
route(t,E)
route(W,S)
tmp2.5=recv()
pval7=tmp1.3+tmp2.5
send(tmp0.1)
tmp0=tmp0.1
route(W,S)
route(N,t)
v2.7=pval6*5.0
v1.8=pval7*3.0
Send(v2.7)
v2=v2.7
route(t,E)
route(W,N)
v1=v1.8
tmp0.1=recv()
v0.9=tmp0.1-v1.8
v0=v0.9
route(W,N)
route(S,t)
v2.7=recv()
v3.10=tmp3.6-v2.7
v3=v3.10
Tile3
Tile4
Tile2
Tile1
time
Raw die photo
.18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta)
Of course, custom IC designed by industrial design team
could do much better
Raw motherboard
MIPS
PR4450
Generation 1
MS
Memory
Controller
TM32
RW
RW
DCS-SEC S
TM32
MS
DCS-SEC S
MS
RW
S DCS-CTR
PMA
S PMA-MON
M-GIC
S
S PMA-SEC
M-IPC
S
S PMA-ARB
S TM1-IPC
CLOCKS S
M PCI/XIO R
S
W
GLOBAL S
S
S
DE
R
W
TM1-DBG S
S
IIC1
R
W
RESET
TM2-DBG S
UART1
UART2
UART3
S
S
S
S
S
S
S
S
EJTAG
BOOT
• focus on computation
• programmable cores
• domain specific cores
• L1 caches
• reuse level raised from SC to IP
• communication: straightforward
• buses + bridges (heterogeneous)
• Data is communicated via external
memory under synchronization control
of a programmable core
M
S
IIC2
R
W
IIC3
R
W
USB
R
W
SMC1
SMC2
R
W
R
W
M-Gate
M
S
R
W
VMPG
S
R
W
DVDD
S
R
W
EDMA
S
R
W
VLD
S
R
QVCP2
S
R
W
MBS1
S
R
W
MBS2
S
R
W
QTNR
S
R
QVCP1
S
W
VIP1
S
W
VIP2
S
R
W
VPK
S
W
TSDMA
S
S TM1-GIC
S TM2-IPC
S TM2-GIC
S
DENC
S
SPDIO
R
W
S
AIO1
R
W
S
AIO2
R
W
S
AIO3
R
W
S
GPIO
R
W
M TUNNEL R
S
W
S
MSP1
R
W
S
MSP2
R
W
C-Bridge
M-DCS
T-DCS
CAB
• ex. Nexperia
• 0.18 m / 8M
• 1.8V / 4.5 W
• 75 clock domains
• 35 M transistors
DCS-CTR S
MPEG
1394
MBS
+
VIP
T-PI
MMI+AICP
Conditional
access
MSP
M-PI
MIPS
TriMedia VLIW
Conventional architectures (Nexperia)
in1
job1
task graph
job2
in2
•rationale: driven by flexibility
• applications are unknown at
fh=16MHz
task
design time)
• dynamic load balancing via
flexible binding of Kahn
processes to processors
•Extension of well-known computer
architectures (shared memory,
caches, …) adopting a general
purpose view and using existing
skills.
• key issue: cache coherency and
(RT)OS
(RT)OS
(RT)OS
Proc
Proc
Proc
Cache
Cache
...
Cache
Acc
Acc
Bus based interconnect
memory consistency
• performance analysis via simulation
out
Memory
Symmetric
Multiprocessor
[Culler]
Problems (1): timing events
• Classic approach:
processors communicate via
SDRAM under
synchronisation control of the
CPU
2
C_1
CPU
C_2
C_20
...
1
3
mem
4
application level
• P1: extra claims on scarce
resource (bandwidth)
point to point comm.
• P2: lots of events exchanged
with higher SW levels
start when data available
TM sync level
SD level
drivers
fA
fB
Problems (2): Timing & processor
stalls
.
.
Processor stalls, e.g. 60%
large variation
. miss penalty (BC, AC, WC)
. Miss rate
.
. Caches
. Memory
. Busses
.
. “Easy to program, hard to tune”
Task B
L1$
3% miss rate
unpredictability at every arbiter
programming effort ?
.
L2$
way2
Cost ?
SDRAM
BC=8cc
AC=20cc
WC=3000cc
A typical video flow graph
[Janssen, Dec. 02]
Problems (3): End-to-end timing ?
DDR SDRAM
Interaction
between
multiple
local
arbiters
D$
I$
TM-CPU
D$
I$
TM-CPU
D$
I$
STC
ASIP
Problems (4): Compositionality
DDR SDRAM
Multiple independent
applications active
simultaneously
D$
I$
TM-CPU
D$
I$
TM-CPU
D$
I$
STC
ASIP
.
Generation 1: Problem summary
Timing
. Events (coarse level sync
. Latency
critical
.
. end-to-end
timing behavior of the application
.
. Composition of several applications sharing resources (virtualization)
3% miss rate, 20 cc penalty 60% stalls
interaction local arbiters
.
.
.
Power
•
2 x power dissipation
Area
•
DDR SDRAM
expensive caches
NRE cost
. (>20 M$ of ITRS due to SW):
verification by simulation
D$
I$
Proc
D$
I$
Proc
D$
I$
Proc
Towards a solution
.
distributed systems.
. tiles
will become very much autonomous.
.
.
Computation
timing (GALS) techniques
for performance & predictability reasons we want to
decouple communication from computation.
. Tiles run independent from other tiles
.
and from communication actions.
Add a communication assist (CA)
. acts as an initiator of a communication
action.
. arbitrates the access to the memory
. can stall the processor.
. This way communication and
computation concerns are separated.
Local
mem
CA
master slave
Gen. 2 Architecture
cluster
[Culler][Hijdra]
.
cluster
Processor
Processor
stall
MEM
CA
stall
MEM
CA
Network on chip
MEM
Memory cluster
CA
MEM
.
.
Cluster/tile = computation +
local memory
. heterogeneous
. CPU, DSP, ASIPs, ASICs
. Memory only
. IO
Clusters are autonomous.
The communication is done via
an on-chip network
CA
SDRAM
CTRL
A generic scalable multiprocessor architecture = a collection of essentially
complete computers, including one or more processors and memory,
communicating through a general-purpose high-performance scalable
interconnect and a communication assist. [C&S, pp.51]
Related documents