Download PPT - Unife

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Immunity-aware programming wikipedia , lookup

Stream processing wikipedia , lookup

Random-access memory wikipedia , lookup

Transcript
APPENDIX
1- ADVANCED
FPGA PRODUCTS
Heterogeneous Programmable Platforms
Centered around an FPGA
FPGA Fabric
Embedded memories
Embedded PowerPc
Hardwired multipliers
Xilinx Vertex-II Pro
High-speed I/O (3.125 Gbps transceivers)
Courtesy Xilinx
Soft Cores
Concept figure, not real device
MicroBlaze embedded processor
SOFT CORE:
 RISC processor optimized for
implementation on the Xilinx
FPGAs
 Completely implemented onthe-field in the generalpurpose memory and logic
fabric of the FPGA
Berkeley Pleiades Processor
Centered around an ARM7 core
FPGA
Reconfigurable
Data-path
Interface
ARM8 Core
- ARM8: system manager
- Intensive computations offloaded
to a reconfigurable datapath
(adders, multipliers, ASIP,..)
- FPGA for bit manipulation
• 0.25um 6-level metal CMOS
• 5.2mm x 6.7mm
• 1.2 Million transistors
• 40 MHz at 1V
• 2 extra supplies: 0.4V, 1.5V
• 1.5~2 mW power dissipation
Today: Xilinx Zynq-7000
Xilinx Ultrascale MPSoC:
All-programmble heterogeneous MPSoC
…and of course
programmable
logic 
APPENDIX
2- ADVANCED
PROTOTYPING
EXAMPLE
Heterogeneous Parallel Computing

Template features
 Host processor core (ARM Big.Little)
 Programmable multi-core accelerator (GPPA)
 Hierarchical interconnect fabric
– CCI-400 (a crossbar)
– System NoC (a network-on-chip)
GOAL: prototype an
innovative GPPA
capable of running
multiple concurrent
offload applications
by means of isolated
and reserved
computation
partitions



Virtex 7 evaluation board VC-707
XC7VX485T chip, 486l logic cells, 76k slices, 36
Mb BRAM
Advanced GHz-range transceivers, on-board RAM
and flash, display, Ethernet, etc.
Main NoC
M
DRAM
Controller
S
Memory
S
Interrupt
Controller
S
UART
S
Debug
Module
S
Timer
S
GPIO
S
AXI Bus
Fabric
Controller
M
Dual NoC
Receiver
S
Dual NoC
Driver
μB
μB
NI
μB
μB
S
Traffic
Sniffers
Progr.
Fault
Injector
μB
μB
NI
μB
NI
μB
NI
NI
NI
μB
NI
μB
NI
μB
NI
FPGA
NI
μB
NI
Dual NoC
= Xilinx IP
= Ferrara IP
NI
μB
NI
μB
NI
μB
NI
μB
NI
ACCELERATOR
ARCHITECTURE
Computation clusters
with distributed L2 banks
Dual NoC for
routing reconfiguration
Fabric Controller
(NoC reconfiguration,
Partition setup, application start,..)
GPPA I/O interface
NORTH
SW


SNIFFER
WEST
SW
SNIFFER
SW
SNIFFER
SOUTH
SW
MicroBlaze in place of
clusters
Hardware sniffers for
user-accessible link
traffic monitoring
SNIFFER
EAST
SW
- Hardwired Registers (full mesh)
Vc_id_in
-Dash Links
are currently
unused
- Set of registers From Dual-Bus:
For OSR programming partition
µP
VC_ARBITER
3x1
6x6 switch
LOCAL
Inter-Processor
Requests L2
+
Routing reconf.
Stall_out
Vc_id_in
VC_ARBITER
2x1
STALL
Vc_id_out
LOCAL
LINK
LOCAL LINK
6x6 switch
LOCAL
Responses L2
+
Routing reconf.
Stall_out
Stall_out
Stall_in
Stall_in
6x6 switch GLOBAL
No Circuit
+
No Routing Reconf.
GLOBAL LINK
GLOBAL LINK
VC_ARBITER
3x1
“SET OF REGISTERS” From
Dual-Bus
Vc_id_in
L2
NETWORK-ON-CHIP
uP
L2





Initial NoC testing (for stuck-at faults) and configuration
Detection of a link failure
NoC is configured to route around it
Matrix Multiply benchmark starts on the 16 mesh
MicroBlazes
Objective: configuration ok, rerouting ok, benchmark ok



Button press, fabric controller (supervision MicroBlaze)
initiates dynamic space-division multiplexing (SDM)
MicroBlazes start new SDM-aware tasks
Objective: prove partition isolation, differentiated
partition shape-dependent execution time
4x4 mesh
Mesh NIs
Dual NoC
MicroBlazes
and other
Beyond 90%
resource
utilization!
Accelerator Offload
Example: offload packet (data and binary for the GPPA)
GPPAbrctl
Tsk
desc
App
ioctl
Tsk
data
Resource
allocation/management
API
API
/dev/GPPAv
Tsk
data
Tsk
desc
•
OpenMP RTE forwards offload
request to guest GPPA Driver
•
Guest GPPA Driver forwards it to the
the GPPA emulation device
•
GPPA emulation device forwards the
request to the GPPA bridge and
copies data and binary from Guest
memory space to host memory
space
•
GPPA bridge forwards the packet to
the Host GPPA driver and copies data
and binary from host virtual memory
to contiguous memory shared with
the GPPA (L3 memory)
Guest
Memory
Kernel
GUEST
POSIX Queue
iowrite
GPPAv
Tsk
data
Host
Memory
Contiguous
Memory
QEMU/KVM
KVM
/dev/GPPA
HOST
GPPA
GPPA Offload
CURRENTLY AIMING AT A UNIQUE PROTOTYPING PLATFORM
• The offload procedure relies on copies into a non-paged,
Example: offloadcontiguous
packet (datamemory
and binaryrange
for the(seen
GPPA) as mmap-ed IO)
• OpenMP
RTE forwards
• COPIES ARE AVOIDED IN REAL SYSTEMS
BY MEANS
OFoffload
AN
request to guest GPPA Driver
IOMMU!
GPPAbrctl
Tsk
desc
App
Tsk
data
Resource
allocation/management
API
API
/dev/GPPAv
Tsk
data
Tsk
desc
•
Guest GPPA Driver forwards it to the
the GPPA emulation device
•
GPPA emulation device forwards the
request to the GPPA bridge and
copies data and binary from Guest
memory space to host memory
space
•
GPPA bridge forwards the packet to
the Host GPPA driver and copies data
and binary from host virtual memory
to contiguous memory shared with
the GPPA (L3 memory)
Guest
Memory
Kernel
GUEST
GPPAv
POSIX Queue
Tsk
data
Host
Memory
Contiguous
Memory
QEMU/KVM
KVM
Validated on ODROID
/dev/GPPA
HOST
GPPA
OFFLOAD – the accelerator side
AXI BUS
IO PORTS
NI
NI
SW
4K
Fabric Contrl
Task
0x10000000
TEST & SET
0x10000FFF
UB_1 0
Thread
Support 64K
L2_0
Data
0x1003FFFF
UB_2
NI
NI
SW
0x100F0000
OpenMP
NI
BRAM-CTRL
BRAM
Task Task Queue
Offload
UART
OpenMP
0x10030000
NI
NI
SW
OpenMP
0x100C0000
64K Support
L2_1
0x100CFFFF
Data
SW
UB_0 1
Thread
NI
Support
L2_2
64K
0x100FFFFF
Data
OpenMP Offload Data Support
FCFC
triggers
task execution
Generates
a TASK
OpenMP Offload Task Queue
FC sets partition
Going to ASIC: The synthesis flow
- 12T Library, Regular Threshold Voltage, 1,0V/0,9V/0,8V Supply Voltages (Best/Typical/Worst),
125C/25C/125C Temperatures (Best/Typical/Worst), 28 nm
NI
NI
Design
Compiler
IC
Compiler
SoCEncounter
Target design:
Switch 0
NI
NI
Switch 1
NI
NI
- switch radix: 7x7
-32 bit flit width
-3 VCs
- 2 slot input buffers
- 6 slot output buffers
- 3 NIs per cluster
-Tight boundary constraints:
400 ps input transition slope
1000 times the input capacity
of the biggest inverter in the library as
output capacitance
Post-synthesis MAX speed: 800 MHz
Floorplanning
-LINK LENGTH BASED ON ESTIMATED TILE SIZE OF 2 MM IN 28NM
-HARD FENCES DEFINED FOR THE FLOORPLANNING BLOCKS
- ROW UTILIZATION SET TO 60%
CLUSTER CPU NETWORK INTERFACES
CLUSTER L1 NETWORK INTERFACES
SWITCHES
L2 BANK NETWORK INTERFACES
Post-Layout Analysis
• Post-layout:
– 800 MHz (Highly predictable)
– 213515 um2
• Critical path:
– inside FSM of virtual channel flit-level arbiter
– The link was not on the critical path