Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Paper Review: Area-Performance Trade-offs in Tiled Dataflow Architectures Ke Peng Instructor: Chun-Hsi Huang CSE 340 Computer Architecture University of Connecticut Reference Steven Swanson, Andrew Putanm, Martha Mercaldi, Ken Michelson, Andrew Petersen, Andrew Schwerin, Mark Oskin, Susan J. Eggers Computer Science & Engineering, University of Washington “Area-Performance Trade-offs in Tiled Dataflow Architectures” Proceedings of the 33rd International Symposium on Computer Architecture, 2006 IEEE. 2/33 Outline Background Introduction Experimental Infrastructure WaveScalar Architecture Evaluation Conclusion 3/33 Background Introduction A lot of issues should be addressed in processor design Wire delay Fabrication reliability Design complexity Processing elements (PEs) are designed and replicated across a chip Examples of titled architectures RAW SmartMemories TRIPS WaveScalar 4/33 Background Introduction Tiled WaveScalar Architecture Captured online, University of Washington 5/33 Background Introduction Benefits PE design Decreases design and verification time Provide robustness for fabrication errors Reduce wire delay for data and control signal transmission Good performance is achievable only if all aspects of the microarchitecture are properly designed. Challenges Tile number VS Tile size High utilized tiles VS Possibly more powerful tiles Partition and distribution of data memory across the chip Tiles interconnection Etc. 6/33 Background Introduction This paper focuses on WaveScalar processor, explores the area-performance trade-offs encountered when designing a tiled architecture. WaveScalar, tiled dataflow architecture Based on PE replication Hierarchical data networks Distributed hardware data structures, including the caches, store buffers, and specialized dataflow memories (token store). 7/33 Experimental Infrastructure Synthesizable RTL WaveScalar model TSMC (Taiwan Semiconductor Manufacturing Company) 90 nm technology Use Synopsys DesignWare IP Synopsys Design Compiler for front-end synthesis Cadence First Encounter for back-end synthesis Synopsys VCS for RTL simulation and function verification Hierarchical architecture and single voltage design, can be extended for multiple voltage design. 8/33 Experimental Infrastructure Three workloads to evaluate the WaveScalar processor Spec2000 benchmark suite (ammp, art, equake, gzip, twolf and mcf), for single-threaded performance evaluation Mediabench (rawdaudio, mgeg2encoder, djpeg), for media processing performance evaluation Splash2 benchmarks(fft, lu-continuous, oceannoncontinuous, raytrace, water-spatial, radix), for multiple-threaded performance evaluation 9/33 WaveScalar Architecture Processing Elements (PEs) PE is the heart of a WaveScalar machine. Execution resources (Captured from Steven ISCA’06) 10/33 WaveScalar Architecture Five pipeline stages of PE Input stage: Operand messages arrive at the PE either from itself or another PE. Match stage: Operands enter the matching table, determing which instructions are ready to fire, issuing table index of eligible instructions into instruction scheduling queue. Dispatch stage: Selects instruction from scheduling queue, reads operands from matching table for executing. Execute state: Executes an instruction Output stage: Send output to its consumer instructions via connection network. 11/33 WaveScalar Architecture Several PEs are combined into a single Pod share bypass networks Several pods are combined into a single Domain Several Domains are combined into a single cluster These are parameters that affect the performance and area trade-off of WaveScaler. 2-PE pod is 15% faster on average than isolated PEs Increasing the number of PEs in each pod with further increase performance but adversely affects cycle time. 12/33 WaveScalar Architecture The configuration of the baseline WaveScalar processor (Captured from Steven ISCA’06) 13/33 WaveScalar Architecture Hierarchical organization of the WaveScalar microarchitecture (Captured from Steven ISCA’06) 14/33 WaveScalar Architecture 4 Levels hierarchical interconnect network Intra-pod Intra-domain Broadcast-based Pseudo-PEs (Mem, Net), serve as gateways to the memory system and PEs in other domains or clusters. 7% area overhead Intra-cluster Small network, area overhead negligible Inter-cluster Responsible for all long-distance communication 1% of the total chip area overhead 15/33 WaveScalar Architecture Hierarchical cluster interconntects (Captured from Steven ISCA’06) 16/33 WaveScalar Architecture Memory Subsystem Wave-ordered store buffers Memory interface that enables WaveScalar to execute programs written in imperative languages (C, C++, Java) Store decoupling technique to process store address and store data messages separately Partial store queues for storing address Occupies approximately 6.2% area of the cluster Conventional memory hierarchy with distributed L1 and L2 cache L1 data cache is 4-way set associative and have 128-byte lines, with hit costs 3-cycles L2’s hit rate is 20-30 cycles Main memory latency is modeled at 200 cycles 17/33 Evaluation Die area spent for the baseline design (Captured from Steven ISCA’06) 18/33 Evaluation The configuration of the baseline WaveScalar processor (Captured from Steven ISCA’06) 19/33 Evaluation Many parameters affect the area required for WaveScalar designs. This paper considers 7 parameters with the strongest effect on the area requirements. Ignores some minor effects. For example, assuming that wiring costs do not decrease with fewer than 4 domains. 20/33 Evaluation WaveScalar processor area model (Captured from Steven ISCA’06) 21/33 Evaluation The ranges allow for over 21,000 Wavescalar processor configurations. To select the configurations, the authors: Eliminate clearly poor, unbalanced designs Bound die size at 400mm2 in 90 nm technology Reduce the number of designs to 201 Report AIPC (Alpha-equivalent instructions executed per cycle) instead of IPC. 22/33 Evaluation Pareto-optimal WaveScalar Designs (Captured from Steven ISCA’06) 23/33 Evaluation Pareto-optimal configurations for Splash2 (Captured from Steven ISCA’06) 24/33 Evaluation Goal of WaveScalar’s hierarchical interconnect: Isolate as much traffic as possible in the lower levels of the hierarchy, within a PE, a pod or a domain. On average, 40% network traffic remains within a pod 52% network traffic remains within a domain On average, just 1.5% of traffic traverses the intercluster interconnect 25/33 Evaluation Pareto-optimal WaveScalar Designs (Captured from Steven ISCA’06) 26/33 Conclusion This paper presents WaveScalar processor architecture in details Presents parameters affecting area and performance significantly Propose WaveScalar architecture and explored the area/performance trade-offs simulation and analysis Reveals that Wavescalar processors tuned for either area efficiency or maximum performance across a wide range of processor sizes. The hierarchical interconnect network is very effective. Over 50% of messages stay within a domain Over 80% of messages stay within a cluster 27/33 Thank you! 28/33