Download Rajagopal-base-station-slides-TI

High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research Results  Stream-based programmable processors meet real-time requirements for a set of base-station phy layer algorithms+,*  Map algorithms on stream processors and studied tradeoffs between packing, ALU utilization and memory operations  Improve power efficiency in stream processors by adapting compute resources to workload variations and varying voltage and clock frequency to real-time requirements*  Design exploration between #ALUs and clock frequency to minimize power consumption of the processor RICE defined UNIVERSITY2 + S. Rajagopal, S. Rixner, J. R. Cavallaro 'A programmable baseband processor design for software radios’, 2002, *Paper draft sent previously, rest of the contributions in thesis Recent (2003) Research Results  Peak computation rate available : ~200 billion arithmetic operations at 1.2 GHz  Estimated Peak Power (0.13 micron) : 12.38 W at 1.2 GHz  Power:  12.38 W for 32 users, constraint 9 decoding, at 128Kbps/user At 1.2 GHz, 1.4 V  300 mW for 4 users, constraint 7 decoding, at 128Kbps/user At 433 MHz, 0.875 V RICE UNIVERSITY3 Motivation  This research could be applied to DSP design!  Designing  High performance DSPs  Power-efficient  Adapt computing resources with workload changes  Such that  Gradual changes in C64x architecture  Gradual changes in compilers and tools RICE UNIVERSITY4 Levels of changes  To allow changes in TI DSPs and tools gradually  Changes classified into 3 levels  Level 1 : simple, minimum changes (next silicon)  Level 2 : intermediate, handover changes (1-2 years)  Level 3 : actual proposed changes (2-3 years) We want to go to Level 3 but in steps! RICE UNIVERSITY5 Level 1 changes: Power-efficiency RICE UNIVERSITY6 Level 1 changes: Power saving features  (1) Use Dynamic Voltage and Frequency scaling  When workload changes such as  Users, data rates, modulation, coding rates, …  Already in industry : Crusoe, XScale …  (2) Use Voltage gating to turn off unused resources  When units idle for a ‘sufficiently’ long time  Saves static and dynamic power dissipation  See example on next page RICE UNIVERSITY7 Turning off ALUs Adders Multipliers ‘Sleep’ Instruction Instruction Schedule Adders Multipliers Default schedule Schedule after exploration 2 multipliers turned off to save power Turned off using voltage gating to eliminate static and dynamic power dissipation RICE UNIVERSITY8 Level 1: Architecture tradeoffs DVS:  Advanced voltage regulation scheme  Cannot use NMOS pass gates  Cannot use tri-state buffers  Use at a coarser time scale (once in a million cycles)  100-1000 cycles settling time Voltage gating:  Gating device design important  Should be able to supply current to gated circuit  Use at coarser time scale (once in 100-1000 cycles)  1-10 cycles settling time RICE UNIVERSITY9 Level 1: Tools/Programming impact  Need a DSP BIOS “TASK” running continuously which looks at the workload change and changes voltage/frequency using a look-up table in memory  Compiler should be made ‘re-targetable’  Target subset of ALUs and explore static performance with different adder-multiplier schedules  Voltage gating using a ‘sleep’ instruction that the compiler generates for unused ALUs  ALUs should be idle for > 100 cycles for this to occur  Other resources can be gated off similarly to save static power dissipation  Programmer is not aware of these changes RICE UNIVERSITY 10 Level 2 changes: Performance RICE UNIVERSITY11 Solutions to increase DSP performance  (1) Increasing clock frequency  C64x: 600 – 720 – 1000 - ?  Easiest solution but limited benefits  Not good for power, given cubic dependence with frequency  (2) Increasing ALUs  Limited instruction level parallelism (ILP)  Register file area, ports explosion  Compiler issues in extracting more ILP  (3) Multiprocessors (MIMD)  Usually 3rd party vendors (except C40-types) RICE UNIVERSITY 12 DSP multiprocessors DSP Network Interface DSP ASSP Interconnection DSP DSP ASSP Co-Proc’s 13 Source: Texas Instruments Wireless Infrastructure Solutions Guide, Pentek, Sundance, C80 RICE UNIVERSITY Multiprocessing tradeoffs  Advantages:  Performance, and tools don’t have to change!!  Load-balancing algorithms on multiple DSPs not straight-forward+  Burden pushed on to the programmer  Not scalable with number of processors  difficult to adapt with workload changes  Traditional DSPs not built for multiprocessing* (except C40-types)  I/O impacts throughput, power and area  (E)DMA use minimizes the throughput problem  Power and area problems still remain *R. Baines, The DSP bottleneck, IEEE Communications Magazine, May 1995, pp 46-54 (outdated?) +S. Rajagopal, B. Jones and J.R. Cavallaro, Task partitioning wireless base-station algorithms RICEonUNIVERSITY 14 multiple DSPs and FPGAs, ICSPAT’2001 Options  Chip multiprocessors with SIMD parallelism (Level 3)  SIMD parallelism can alleviate load balancing  (shown in Level 3)  Scalable with processors  Automatic SIMD parallelism can be done by the compiler  Single chip will alleviate I/O bottlenecks  Tool will need changes  To get to level 3, intermediate (Level 2) level investigation  Level 2  Do SPMD on DSP multiprocessor RICE UNIVERSITY 15 Texas Instruments C64x DSP C64x Datapath RICE UNIVERSITY 16 Source: Texas Instruments C64x DSP Generation (sprt236a.pdf) A possible, plausible solution Exploit data parallelism (DP)*  Available in many wireless algorithms  This is what ASICs do! int i,a[N],b[N],sum[N]; // 32 bits short int c[N],d[N],diff[N]; // 16 bits packed for (i = 0; i< 1024; ++i) { sum[i] = a[i] + b[i]; diff[i] = c[i] - d[i]; } DP ILP Subword RICE UNIVERSITY 17 *Data Parallelism is defined as the parallelism available after subword packing and loop unrolling SPMD multiprocessor DSP C64x Datapath C64x Datapath Same Program running on all DSPs C64x Datapath C64x Datapath RICE UNIVERSITY 18 Level 2: Architecture tradeoffs  C64x’s  Interconnection could be similar to the ones used by 3rd party vendors  FPGA- based C40 comm ports (Sundance) ~400 MBps  VIM modules (Pentek) ~300 MBps  Others developed by TI, BlueWave systems RICE UNIVERSITY 19 Level 2: Tools/Programming impact  All DSPs run the same program  Programmer thinks of only 1 DSP program  Burden now on tools  Can use C8x compiler and tool support expertise  Integration of C8x and C6x compilers  Data parallelism used for SPMD  DMA data movement can be left to programmer at this stage to keep data fed to the all the processors  MPI (Message Passing) can also be alternatively applied RICE UNIVERSITY 20 Level 3 changes: Performance and Power RICE UNIVERSITY 21 A chip multiprocessor (CMP) DSP Internal Memory L2 Instruction decoder ILP Subword + + + * * * ILP Subword C64x DSP Core (1 cluster) + + + * * * + + + * * * + + + * * * … + + + * * * Instruction decoder Internal Memory (L2) DP C64x based CMP DSP Core adapt #clusters to DP Identical clusters, same operations. Power-down unused ALUs, clusters RICE UNIVERSITY 22 A 4 cluster CMP using TI C64x C64x Datapath Significant savings possible in area and power C64x Datapath C64x Datapath C64x Datapath Increasing benefits with larger #clusters (8,16,32 clusters) RICE UNIVERSITY 23 Alternate view of the CMP DSP DMA Controller L2 internal memory Bank 1 Bank 2 Bank C Inter-cluster communication network Instruction decoder C64x core C C64x core 1 Clusters Of C64x C64x core 0 Prefetch Buffers RICE UNIVERSITY 24 Adapting #clusters to Data Parallelism Turned off using voltage gating to eliminate static and dynamic power dissipation Adaptive Multiplexer Network C No reconfiguration C C C C C C 4: 2 reconfiguration C C C 4:1 reconfiguration C All clusters off RICE UNIVERSITY 25 Level 3: Architecture tradeoffs  Single processor -> SPMD -> SIMD  Single chip :  Max die size limited to 128 clusters with 8 functional units/cluster at 90 nm technology [estimate]  Number of memory banks = #clusters  Instruction addition to turn off clusters when data parallelism is insufficient RICE UNIVERSITY 26 Level 3: Tools/Programming impact  Level 2 compiler provides support for data parallelism  adapt #clusters to data parallelism for power savings  check for loop count index after loop unrolling  If less than #clusters, provide instruction to turn off clusters  Design of parallel algorithms and mapping important  Programmer still writes regular C code  Transparent to the programmer  Burden on the compiler  Automatic DMA data movement to keep data feeding into the arithmetic units RICE UNIVERSITY 27 Verification of potential benefits Level 3 potential verification using the Imagine stream processor simulator Replacing the C64x DSP with a cluster containing 3 +, 3 X and a distributed register file RICE UNIVERSITY 28 Need for adapting to flexibility  Base-stations are designed for worst case workload  Base-stations rarely operate at worst case workload  Adapting the resources to the workload can save power! RICE UNIVERSITY 29 Example of flexibility needed in workloads Operation count (in GOPs) 25 2G base-station (16 Kbps/user) 3G base-station (128 Kbps/user) 20 15 Note: GOPs refer only to arithmetic computations 10 5 0 (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9) (Users, Constraint lengths) Billions of computations per second needed Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi to ~23 GOPs for 32 users, constraint 9 viterbi RICE UNIVERSITY 30 Flexibility affects Data Parallelism* U - Users, K - constraint length, N - spreading gain, R - decoding rate Workload Estimation Detection Decoding (U,K) f(U,N) f(U,N) f(U,K,R) (4,7) 32 4 16 (4,9) 32 4 64 (8,7) 32 8 16 (8,9) 32 8 64 (16,7) 32 16 16 (16,9) 32 16 64 (32,7) 32 32 16 (32,9) 32 32 64 RICE UNIVERSITY 31 *Data Parallelism is defined as the parallelism available after subword packing and loop unrolling Cluster utilization variation with workload 100 (4,9) (4,7) 50 Cluster Utilization 0 100 0 5 10 15 20 25 (8,9) (8,7) 50 0 100 0 50 0 100 5 10 15 20 25 30 10 15 20 25 30 10 15 20 25 30 (16,9) (16,7) 0 5 (32,9) (32,7) 50 0 30 0 5 Cluster Index Cluster utilization variation on a 32-cluster processor RICE UNIVERSITY 32 (32, 9) = 32 users, constraint length 9 Viterbi Real-time Frequency (in MHz) Frequency variation with workload 1200 1000 Mem Stall L2 Stall Busy 800 600 400 200 0 (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9) RICE UNIVERSITY 33 Operation  DVS when system changes significantly  Users, data rates …  Coarse time scale (every few seconds)  Turn off clusters when parallelism changes significantly  Parallelism can change within the same algorithm  Eg: spreading gain changes during matched filtering  Finer time scales (100’s of microseconds)  Turn off ALUs when algorithms change significantly  estimation, detection, decoding  Finer time scales (100’s of microseconds) RICE UNIVERSITY 34 Power savings: Voltage Gating & Scaling Workload (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9) Estimated Estimated Estimated Estimated Freq (MHz) Voltage Power Savings (W) Power needed used (V) clocking Memory Clusters New 345.09 433 0.875 0.325 1.05 0.366 0.3 380.69 433 0.875 0.193 0.56 0.604 0.69 408.89 433 0.875 0.089 0.54 0.649 0.77 463.29 533 0.95 0.304 0.71 0.643 1.33 528.41 533 0.95 0.02 0.44 0.808 1.71 637.21 667 1.05 0.156 0.58 0.603 3.21 902.89 1000 1.3 0.792 1.18 1.375 7.11 1118.3 1200 1.4 0.774 1.41 0 12.38 Cluster Power Consumption L2 memory Power Consumption instruction decoder oder Power Consumption Chip Area (0.13 micron process) Power can change from 12.38 W to 300 mW depending on workload changes (W) Base 2.05 2.05 2.05 2.98 2.98 4.55 10.46 14.56 Savings 85.14 % 66.41 % 62.44 % 55.46 % 42.54 % 29.46 % 32.03 % 14.98 % 78 % 11.5 % 10.5 % 2 45.7 mm RICE UNIVERSITY 35 How to decide ALUs vs. clock frequency  No independent variables  Clusters, ALUs, frequency, voltage  Trade-offs exist V f P  CV 2 f P f 3  How to find the right combination for real-time @ lowest power! + ‘1’ + ++ *‘1’ ** * + ‘10’ + ++ * ‘10’ ** * + ‘10’ + ++ * ‘10’ ** * ‘1’ cluster ‘100’ clusters 100 GHz 10 MHz (A) (B) + ‘10’ + ++ * ‘10’ ** * + ‘a’ + ++ * ‘m’ ** * + ‘a’ + ++ * ‘m’ ** * + ‘a’ + ++ * ‘m’ ** * ‘c’ clusters ‘f’ MHz (C) RICE UNIVERSITY 36 Setting clusters, adders, multipliers  If sufficient DP, linear decrease in frequency with clusters  Set clusters depending on DP and execution time estimate  To find adders and multipliers,  Let compiler schedule algorithm workloads across different numbers of adders and multipliers and let it find execution time  Put all numbers in previous equation  Compare increase in capacitance due to added ALUs and clusters with benefits in execution time  Choose the solution that minimizes the power Details available in Sridhar’s thesis RICE UNIVERSITY 37 Conclusions  We propose a step-by-step methodology to design high performance power-efficient DSPs based on the TI 64x architecture  Initial results show benefits in power/performance greater than an order-of-magnitude over a conventional C64x  We tailor the design to ensure maximum compatibility with TI’s C6x architecture and tools  We are interested in exploring opportunities in TI for designing and actual fabrication of a chip and associated tool development  We are interested in feedback  limitations that we have not accounted for  Unreasonable assumptions that we have made Recommended reading: S. Rixner et al, A register organization for media processing, HPCA 2000 B. Khailany et al, Exploring the VLSI scalability of stream processors, HPCA 2003 U. J. Kapasi et al, Programmable Stream Processors, IEEE Computer, August 2003 RICE UNIVERSITY 38

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Rajagopal-base-station-slides-TI