Download PowerPoint - Berkeley Reconfigurable Architectures, Systems, and

A Streaming Multi-Threaded Model Eylon Caspi, Joe Yeh, Randy Huang, Yury Markovskiy, André DeHon, John Wawrzynek BRASS Research Group University of California, Berkeley MSP-3 12/2/01 Protecting Software Investment  Technology trends: bigger, faster  Moore’s Law: 2x transistors every 18 months  Device landscape growing  Microprocessors, DSPs, FPGAs, communication processors, network processors, PSOCs, etc.  Need a way to let SW survive, automatically scale to next-gen device  Need a strong model for SW-HW interface with better parallelism 12/2/01 Eylon Caspi — MSP-3 2 Outline      Motivation SCORE SCORE for Reconfigurable Hardware SCORE for Microprocessors Summary / Future Work 12/2/01 Eylon Caspi — MSP-3 3 A Lesson from ISA Processors  ISA (Instruction Set Architecture) decouples SW from HW  Survival to compatible, next generation devices  Performance scales with device speed + size  Survival for decades—e.g. IBM 360, x86  An ISA cannot scale forever  Latency scales with device size (cycles to cross chip, access mem)  Need parallelism to hide latency  ILP: expensive to extract + exploit (caches, branch pred., etc.)  Data: (Vector, MMX) limited applicability; MMX not scalable  Thread: (MP, multi-threaded) IPC expensive; hard to program Gluing together conventional processors is insufficient 12/2/01 Eylon Caspi — MSP-3 4 Streams  Stream = FIFO communication channel with blocking read, non-blocking write, conceptually unbounded capacity  Basic primitive for communication, synchronization  Exposed at all levels—programming model, architecture  Application = data flow graph of threads, memories  Kahn process network  Stream semantics ensure determinism regardless of communication timing, thread scheduling (Kahn continuity) Thread Mem Thread Thread Mem Thread 12/2/01 Eylon Caspi — MSP-3 5 Stream-Aware Scheduling  Streams expose inter-thread dependencies (data flow)  Streams enable efficient, flexible schedules  Efficient: fewer blocked cycles, shorter run time  Automatically schedule to available resources  Number of processors, memory size, network bandwidth, etc.  E.g. Fully spatial, pipelined  E.g. Time multiplexed with data batching  Amortize cost of context swap over larger data set Thread Mem Thread Thread Mem Thread 12/2/01 Eylon Caspi — MSP-3 6 Stream Reuse  Persistent streams enable reuse  Establish connection once (network route / buffer)  Reuse connection while threads loaded  Cheap (single cycle) stream access  Amortize per-message cost of communication Thread Mem Thread Thread Mem Thread 12/2/01 Eylon Caspi — MSP-3 7 SCORE Compute Model  Program = data flow graph of stream-connected threads  Kahn process network (blocking read, non-blocking write)  Compute: Thread  Communication: Stream  Task with local control  FIFO channel, unbounded buffer capacity, blocking read, non-blocking write  Memory: Segment  Dynamics:  Model admits parallelism at multiple levels: ILP, pipeline, data  Memory block with stream interface (e.g. streaming read)  Dynamic local thread behavior  dynamic flow rates  Unbounded resource usage: may need stream buffer expansion  Dynamic graph allocation 12/2/01 Eylon Caspi — MSP-3 8 SCORE for Reconfigurable Hardware  SCORE: Stream Computations Organized for Reconfigurable Execution  Programmable logic + Programmable Interconnect  E.g. Field Programmable Gate Arrays (FPGAs)  Hardware scales by tiling / duplicating  High parallelism; spatial data paths  But no abstraction for software survival  No binary compatibility  No performance scaling  12/2/01 Designer targets a specific device, specific resource constraints Eylon Caspi — MSP-3 9 Virtual Hardware  Compute model has unbounded resources  Paging  Efficient virtualization  Programmer no longer targets particular device size  “Compute pages” swapped in/out (like VM)  Page context = thread (FSM to access streams, block)  Amortize reconfiguration cost over an entire input buffer compute pages buffers 10 Transform Quantize RLE Encode SCORE Hardware Model  Paged FPGA  Compute Page (CP)  Fixed-size slice of RC hardware (e.g. 512 4-LUTs)  Fixed number of I/O ports  Configurable Memory Block (CMB)  Distributed, on-chip memory (e.g. 2 Mbit)  Stream access  High-level interconnect  Microprocessor  Run-time support + user code 12/2/01 Eylon Caspi — MSP-3 11 Programming Model: TDF  TDF = intermediate, behavioral language for:  EFSM Operators • Static operator graphs  State machine for:  Firing signatures • Control flow (branching)  Firing semantics:  When in state X, wait for X’s inputs, then fire (consume, act) select (input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ) { state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S; } s t f select o 12 Page Scheduling  Schedule = time-sliced eviction / loading  Choose pages to run  Manage stream buffers (modify page graph; swap memory)  Configure CPs, CMBs, network  Implemented several schedulers  Dynamic:  Static:  Quasi-Static:  Dynamic loading order based on buffered input Static, repeated loading order Static loading order, dynamic time slice Page loading order (static / quasi-static)  Topological:  Min-cut:  Exhaustive: 12/2/01 dependence order (arbitrary topological sort of page graph) minimize # of live stream buffers (min-cut page graph) minimize stall cycles based on profiled I/O rates (exhaustively search all topological orders) Eylon Caspi — MSP-3 13 Execution Results Hardware Size (CP-CMB Pairs) 12/2/01 Eylon Caspi — MSP-3 14 Heterogeneous SCORE  SCORE extends to other processor types  Network interface  Route traffic to network or buffer  Block on empty/full stream access Processor Processor Processor Processor FPU FPU FPU FPU IO FPU Processor Processor Processor Processor Processor Processor Processor Processor Processor 12/2/01 Processor Processor Processor FPU IO FPU FPU FPU FPU FPU FPU FPU FPU Processor Eylon Caspi — MSP-3 15 Microprocessor Stream Support  Stream instructions: stream_read (reg,idx) stream_write (reg,idx) Network Interface 12/2/01 Eylon Caspi — MSP-3 16 Summary  Exposing streams at all levels (programming model, architecture) enables software survival + performance scaling in high-capacity architectures  Demonstrated scalable hybrid reconfigurable architecture; proposed heterogeneous / multi-processor extensions  Future work  Page partitioning for reconfigurable  Scheduling with I/O rate matching  More Information  SCORE web page http://brass.cs.berkeley.edu/SCORE/  FPGA 2002 paper (February 24-26) 12/2/01 Eylon Caspi — MSP-3 17 Supplemental 12/2/01 Eylon Caspi — MSP-3 18 Functional Simulation  FPGA based on HSRA [Berkeley, FPGA ’99]  CP: 512 4-LUTs  CMB: 2Mbit DRAM  Area for CP-CMB pair: .25: 12.9mm2 (1/9 of PII-450) .18: 6.7mm2 (1/16 of PIII-600)  Page reconfiguration: 5000 cycles (from CMB)  Synchronous operation (same clock speed as processor)  x86 microprocessor  Page Scheduler task  Swap on timer interrupt (every 250,000 cycles)  Fully dynamic scheduling 12/2/01 Eylon Caspi — MSP-3 19 Application: JPEG Encode 12/2/01 Eylon Caspi — MSP-3 20 Execution Results Hardware Size (CP-CMB Pairs) 12/2/01 Eylon Caspi — MSP-3 21 Execution Results Hardware Size (CP-CMB Pairs) 12/2/01 Eylon Caspi — MSP-3 22 Execution Results Hardware Size (CP-CMB Pairs) 12/2/01 Eylon Caspi — MSP-3 23

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PowerPoint - Berkeley Reconfigurable Architectures, Systems, and