Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Streaming Multi-Threaded Model Eylon Caspi, Joe Yeh, Randy Huang, Yury Markovskiy, André DeHon, John Wawrzynek BRASS Research Group University of California, Berkeley MSP-3 12/2/01 Protecting Software Investment Technology trends: bigger, faster Moore’s Law: 2x transistors every 18 months Device landscape growing Microprocessors, DSPs, FPGAs, communication processors, network processors, PSOCs, etc. Need a way to let SW survive, automatically scale to next-gen device Need a strong model for SW-HW interface with better parallelism 12/2/01 Eylon Caspi — MSP-3 2 Outline Motivation SCORE SCORE for Reconfigurable Hardware SCORE for Microprocessors Summary / Future Work 12/2/01 Eylon Caspi — MSP-3 3 A Lesson from ISA Processors ISA (Instruction Set Architecture) decouples SW from HW Survival to compatible, next generation devices Performance scales with device speed + size Survival for decades—e.g. IBM 360, x86 An ISA cannot scale forever Latency scales with device size (cycles to cross chip, access mem) Need parallelism to hide latency ILP: expensive to extract + exploit (caches, branch pred., etc.) Data: (Vector, MMX) limited applicability; MMX not scalable Thread: (MP, multi-threaded) IPC expensive; hard to program Gluing together conventional processors is insufficient 12/2/01 Eylon Caspi — MSP-3 4 Streams Stream = FIFO communication channel with blocking read, non-blocking write, conceptually unbounded capacity Basic primitive for communication, synchronization Exposed at all levels—programming model, architecture Application = data flow graph of threads, memories Kahn process network Stream semantics ensure determinism regardless of communication timing, thread scheduling (Kahn continuity) Thread Mem Thread Thread Mem Thread 12/2/01 Eylon Caspi — MSP-3 5 Stream-Aware Scheduling Streams expose inter-thread dependencies (data flow) Streams enable efficient, flexible schedules Efficient: fewer blocked cycles, shorter run time Automatically schedule to available resources Number of processors, memory size, network bandwidth, etc. E.g. Fully spatial, pipelined E.g. Time multiplexed with data batching Amortize cost of context swap over larger data set Thread Mem Thread Thread Mem Thread 12/2/01 Eylon Caspi — MSP-3 6 Stream Reuse Persistent streams enable reuse Establish connection once (network route / buffer) Reuse connection while threads loaded Cheap (single cycle) stream access Amortize per-message cost of communication Thread Mem Thread Thread Mem Thread 12/2/01 Eylon Caspi — MSP-3 7 SCORE Compute Model Program = data flow graph of stream-connected threads Kahn process network (blocking read, non-blocking write) Compute: Thread Communication: Stream Task with local control FIFO channel, unbounded buffer capacity, blocking read, non-blocking write Memory: Segment Dynamics: Model admits parallelism at multiple levels: ILP, pipeline, data Memory block with stream interface (e.g. streaming read) Dynamic local thread behavior dynamic flow rates Unbounded resource usage: may need stream buffer expansion Dynamic graph allocation 12/2/01 Eylon Caspi — MSP-3 8 SCORE for Reconfigurable Hardware SCORE: Stream Computations Organized for Reconfigurable Execution Programmable logic + Programmable Interconnect E.g. Field Programmable Gate Arrays (FPGAs) Hardware scales by tiling / duplicating High parallelism; spatial data paths But no abstraction for software survival No binary compatibility No performance scaling 12/2/01 Designer targets a specific device, specific resource constraints Eylon Caspi — MSP-3 9 Virtual Hardware Compute model has unbounded resources Paging Efficient virtualization Programmer no longer targets particular device size “Compute pages” swapped in/out (like VM) Page context = thread (FSM to access streams, block) Amortize reconfiguration cost over an entire input buffer compute pages buffers 10 Transform Quantize RLE Encode SCORE Hardware Model Paged FPGA Compute Page (CP) Fixed-size slice of RC hardware (e.g. 512 4-LUTs) Fixed number of I/O ports Configurable Memory Block (CMB) Distributed, on-chip memory (e.g. 2 Mbit) Stream access High-level interconnect Microprocessor Run-time support + user code 12/2/01 Eylon Caspi — MSP-3 11 Programming Model: TDF TDF = intermediate, behavioral language for: EFSM Operators • Static operator graphs State machine for: Firing signatures • Control flow (branching) Firing semantics: When in state X, wait for X’s inputs, then fire (consume, act) select (input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ) { state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S; } s t f select o 12 Page Scheduling Schedule = time-sliced eviction / loading Choose pages to run Manage stream buffers (modify page graph; swap memory) Configure CPs, CMBs, network Implemented several schedulers Dynamic: Static: Quasi-Static: Dynamic loading order based on buffered input Static, repeated loading order Static loading order, dynamic time slice Page loading order (static / quasi-static) Topological: Min-cut: Exhaustive: 12/2/01 dependence order (arbitrary topological sort of page graph) minimize # of live stream buffers (min-cut page graph) minimize stall cycles based on profiled I/O rates (exhaustively search all topological orders) Eylon Caspi — MSP-3 13 Execution Results Hardware Size (CP-CMB Pairs) 12/2/01 Eylon Caspi — MSP-3 14 Heterogeneous SCORE SCORE extends to other processor types Network interface Route traffic to network or buffer Block on empty/full stream access Processor Processor Processor Processor FPU FPU FPU FPU IO FPU Processor Processor Processor Processor Processor Processor Processor Processor Processor 12/2/01 Processor Processor Processor FPU IO FPU FPU FPU FPU FPU FPU FPU FPU Processor Eylon Caspi — MSP-3 15 Microprocessor Stream Support Stream instructions: stream_read (reg,idx) stream_write (reg,idx) Network Interface 12/2/01 Eylon Caspi — MSP-3 16 Summary Exposing streams at all levels (programming model, architecture) enables software survival + performance scaling in high-capacity architectures Demonstrated scalable hybrid reconfigurable architecture; proposed heterogeneous / multi-processor extensions Future work Page partitioning for reconfigurable Scheduling with I/O rate matching More Information SCORE web page http://brass.cs.berkeley.edu/SCORE/ FPGA 2002 paper (February 24-26) 12/2/01 Eylon Caspi — MSP-3 17 Supplemental 12/2/01 Eylon Caspi — MSP-3 18 Functional Simulation FPGA based on HSRA [Berkeley, FPGA ’99] CP: 512 4-LUTs CMB: 2Mbit DRAM Area for CP-CMB pair: .25: 12.9mm2 (1/9 of PII-450) .18: 6.7mm2 (1/16 of PIII-600) Page reconfiguration: 5000 cycles (from CMB) Synchronous operation (same clock speed as processor) x86 microprocessor Page Scheduler task Swap on timer interrupt (every 250,000 cycles) Fully dynamic scheduling 12/2/01 Eylon Caspi — MSP-3 19 Application: JPEG Encode 12/2/01 Eylon Caspi — MSP-3 20 Execution Results Hardware Size (CP-CMB Pairs) 12/2/01 Eylon Caspi — MSP-3 21 Execution Results Hardware Size (CP-CMB Pairs) 12/2/01 Eylon Caspi — MSP-3 22 Execution Results Hardware Size (CP-CMB Pairs) 12/2/01 Eylon Caspi — MSP-3 23