Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
APPENDIX 1- ADVANCED FPGA PRODUCTS Heterogeneous Programmable Platforms Centered around an FPGA FPGA Fabric Embedded memories Embedded PowerPc Hardwired multipliers Xilinx Vertex-II Pro High-speed I/O (3.125 Gbps transceivers) Courtesy Xilinx Soft Cores Concept figure, not real device MicroBlaze embedded processor SOFT CORE: RISC processor optimized for implementation on the Xilinx FPGAs Completely implemented onthe-field in the generalpurpose memory and logic fabric of the FPGA Berkeley Pleiades Processor Centered around an ARM7 core FPGA Reconfigurable Data-path Interface ARM8 Core - ARM8: system manager - Intensive computations offloaded to a reconfigurable datapath (adders, multipliers, ASIP,..) - FPGA for bit manipulation • 0.25um 6-level metal CMOS • 5.2mm x 6.7mm • 1.2 Million transistors • 40 MHz at 1V • 2 extra supplies: 0.4V, 1.5V • 1.5~2 mW power dissipation Today: Xilinx Zynq-7000 Xilinx Ultrascale MPSoC: All-programmble heterogeneous MPSoC …and of course programmable logic APPENDIX 2- ADVANCED PROTOTYPING EXAMPLE Heterogeneous Parallel Computing Template features Host processor core (ARM Big.Little) Programmable multi-core accelerator (GPPA) Hierarchical interconnect fabric – CCI-400 (a crossbar) – System NoC (a network-on-chip) GOAL: prototype an innovative GPPA capable of running multiple concurrent offload applications by means of isolated and reserved computation partitions Virtex 7 evaluation board VC-707 XC7VX485T chip, 486l logic cells, 76k slices, 36 Mb BRAM Advanced GHz-range transceivers, on-board RAM and flash, display, Ethernet, etc. Main NoC M DRAM Controller S Memory S Interrupt Controller S UART S Debug Module S Timer S GPIO S AXI Bus Fabric Controller M Dual NoC Receiver S Dual NoC Driver μB μB NI μB μB S Traffic Sniffers Progr. Fault Injector μB μB NI μB NI μB NI NI NI μB NI μB NI μB NI FPGA NI μB NI Dual NoC = Xilinx IP = Ferrara IP NI μB NI μB NI μB NI μB NI ACCELERATOR ARCHITECTURE Computation clusters with distributed L2 banks Dual NoC for routing reconfiguration Fabric Controller (NoC reconfiguration, Partition setup, application start,..) GPPA I/O interface NORTH SW SNIFFER WEST SW SNIFFER SW SNIFFER SOUTH SW MicroBlaze in place of clusters Hardware sniffers for user-accessible link traffic monitoring SNIFFER EAST SW - Hardwired Registers (full mesh) Vc_id_in -Dash Links are currently unused - Set of registers From Dual-Bus: For OSR programming partition µP VC_ARBITER 3x1 6x6 switch LOCAL Inter-Processor Requests L2 + Routing reconf. Stall_out Vc_id_in VC_ARBITER 2x1 STALL Vc_id_out LOCAL LINK LOCAL LINK 6x6 switch LOCAL Responses L2 + Routing reconf. Stall_out Stall_out Stall_in Stall_in 6x6 switch GLOBAL No Circuit + No Routing Reconf. GLOBAL LINK GLOBAL LINK VC_ARBITER 3x1 “SET OF REGISTERS” From Dual-Bus Vc_id_in L2 NETWORK-ON-CHIP uP L2 Initial NoC testing (for stuck-at faults) and configuration Detection of a link failure NoC is configured to route around it Matrix Multiply benchmark starts on the 16 mesh MicroBlazes Objective: configuration ok, rerouting ok, benchmark ok Button press, fabric controller (supervision MicroBlaze) initiates dynamic space-division multiplexing (SDM) MicroBlazes start new SDM-aware tasks Objective: prove partition isolation, differentiated partition shape-dependent execution time 4x4 mesh Mesh NIs Dual NoC MicroBlazes and other Beyond 90% resource utilization! Accelerator Offload Example: offload packet (data and binary for the GPPA) GPPAbrctl Tsk desc App ioctl Tsk data Resource allocation/management API API /dev/GPPAv Tsk data Tsk desc • OpenMP RTE forwards offload request to guest GPPA Driver • Guest GPPA Driver forwards it to the the GPPA emulation device • GPPA emulation device forwards the request to the GPPA bridge and copies data and binary from Guest memory space to host memory space • GPPA bridge forwards the packet to the Host GPPA driver and copies data and binary from host virtual memory to contiguous memory shared with the GPPA (L3 memory) Guest Memory Kernel GUEST POSIX Queue iowrite GPPAv Tsk data Host Memory Contiguous Memory QEMU/KVM KVM /dev/GPPA HOST GPPA GPPA Offload CURRENTLY AIMING AT A UNIQUE PROTOTYPING PLATFORM • The offload procedure relies on copies into a non-paged, Example: offloadcontiguous packet (datamemory and binaryrange for the(seen GPPA) as mmap-ed IO) • OpenMP RTE forwards • COPIES ARE AVOIDED IN REAL SYSTEMS BY MEANS OFoffload AN request to guest GPPA Driver IOMMU! GPPAbrctl Tsk desc App Tsk data Resource allocation/management API API /dev/GPPAv Tsk data Tsk desc • Guest GPPA Driver forwards it to the the GPPA emulation device • GPPA emulation device forwards the request to the GPPA bridge and copies data and binary from Guest memory space to host memory space • GPPA bridge forwards the packet to the Host GPPA driver and copies data and binary from host virtual memory to contiguous memory shared with the GPPA (L3 memory) Guest Memory Kernel GUEST GPPAv POSIX Queue Tsk data Host Memory Contiguous Memory QEMU/KVM KVM Validated on ODROID /dev/GPPA HOST GPPA OFFLOAD – the accelerator side AXI BUS IO PORTS NI NI SW 4K Fabric Contrl Task 0x10000000 TEST & SET 0x10000FFF UB_1 0 Thread Support 64K L2_0 Data 0x1003FFFF UB_2 NI NI SW 0x100F0000 OpenMP NI BRAM-CTRL BRAM Task Task Queue Offload UART OpenMP 0x10030000 NI NI SW OpenMP 0x100C0000 64K Support L2_1 0x100CFFFF Data SW UB_0 1 Thread NI Support L2_2 64K 0x100FFFFF Data OpenMP Offload Data Support FCFC triggers task execution Generates a TASK OpenMP Offload Task Queue FC sets partition Going to ASIC: The synthesis flow - 12T Library, Regular Threshold Voltage, 1,0V/0,9V/0,8V Supply Voltages (Best/Typical/Worst), 125C/25C/125C Temperatures (Best/Typical/Worst), 28 nm NI NI Design Compiler IC Compiler SoCEncounter Target design: Switch 0 NI NI Switch 1 NI NI - switch radix: 7x7 -32 bit flit width -3 VCs - 2 slot input buffers - 6 slot output buffers - 3 NIs per cluster -Tight boundary constraints: 400 ps input transition slope 1000 times the input capacity of the biggest inverter in the library as output capacitance Post-synthesis MAX speed: 800 MHz Floorplanning -LINK LENGTH BASED ON ESTIMATED TILE SIZE OF 2 MM IN 28NM -HARD FENCES DEFINED FOR THE FLOORPLANNING BLOCKS - ROW UTILIZATION SET TO 60% CLUSTER CPU NETWORK INTERFACES CLUSTER L1 NETWORK INTERFACES SWITCHES L2 BANK NETWORK INTERFACES Post-Layout Analysis • Post-layout: – 800 MHz (Highly predictable) – 213515 um2 • Critical path: – inside FSM of virtual channel flit-level arbiter – The link was not on the critical path