Download Rajagopal-base-station-slides-TI

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Audio power wikipedia , lookup

Utility frequency wikipedia , lookup

Decibel wikipedia , lookup

Islanding wikipedia , lookup

Power over Ethernet wikipedia , lookup

Power engineering wikipedia , lookup

Voltage optimisation wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Alternating current wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Mains electricity wikipedia , lookup

Stream processing wikipedia , lookup

Transcript
High performance, power-efficient DSPs
based on the TI C64x
Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner
Rice University
{sridhar,cavallar,rixner}@rice.edu
RICE UNIVERSITY
Recent (2003) Research Results
 Stream-based programmable processors meet real-time
requirements for a set of base-station phy layer algorithms+,*
 Map algorithms on stream processors and studied tradeoffs
between packing, ALU utilization and memory operations
 Improve power efficiency in stream processors by adapting
compute resources to workload variations and varying voltage
and clock frequency to real-time requirements*
 Design exploration between #ALUs and clock frequency to
minimize power consumption of the processor
RICE defined
UNIVERSITY2
+ S. Rajagopal, S. Rixner, J. R. Cavallaro 'A programmable baseband processor design for software
radios’, 2002,
*Paper draft sent previously, rest of the contributions in thesis
Recent (2003) Research Results
 Peak computation rate available : ~200 billion arithmetic
operations at 1.2 GHz
 Estimated Peak Power (0.13 micron) : 12.38 W at 1.2 GHz
 Power:
 12.38 W for 32 users, constraint 9 decoding, at 128Kbps/user
At 1.2 GHz, 1.4 V
 300 mW for 4 users, constraint 7 decoding, at 128Kbps/user
At 433 MHz, 0.875 V
RICE UNIVERSITY3
Motivation
 This research could be applied to DSP design!
 Designing
 High performance DSPs
 Power-efficient
 Adapt computing resources with workload changes
 Such that
 Gradual changes in C64x architecture
 Gradual changes in compilers and tools
RICE UNIVERSITY4
Levels of changes
 To allow changes in TI DSPs and tools gradually
 Changes classified into 3 levels
 Level 1 : simple, minimum changes (next silicon)
 Level 2 : intermediate, handover changes (1-2 years)
 Level 3 : actual proposed changes (2-3 years)
We want to go to Level 3 but in steps!
RICE UNIVERSITY5
Level 1 changes:
Power-efficiency
RICE UNIVERSITY6
Level 1 changes: Power saving features
 (1) Use Dynamic Voltage and Frequency scaling
 When workload changes such as
 Users, data rates, modulation, coding rates, …
 Already in industry : Crusoe, XScale …
 (2) Use Voltage gating to turn off unused resources
 When units idle for a ‘sufficiently’ long time
 Saves static and dynamic power dissipation
 See example on next page
RICE UNIVERSITY7
Turning off ALUs
Adders Multipliers
‘Sleep’ Instruction
Instruction Schedule
Adders Multipliers
Default schedule
Schedule after
exploration
2 multipliers turned off to save power
Turned off using
voltage gating to
eliminate static and
dynamic power dissipation
RICE UNIVERSITY8
Level 1: Architecture tradeoffs
DVS:
 Advanced voltage regulation scheme
 Cannot use NMOS pass gates
 Cannot use tri-state buffers
 Use at a coarser time scale (once in a million cycles)
 100-1000 cycles settling time
Voltage gating:
 Gating device design important
 Should be able to supply current to gated circuit
 Use at coarser time scale (once in 100-1000 cycles)
 1-10 cycles settling time
RICE UNIVERSITY9
Level 1: Tools/Programming impact
 Need a DSP BIOS “TASK” running continuously which looks at
the workload change and changes voltage/frequency using a
look-up table in memory
 Compiler should be made ‘re-targetable’
 Target subset of ALUs and explore static performance with
different adder-multiplier schedules
 Voltage gating using a ‘sleep’ instruction that the compiler
generates for unused ALUs
 ALUs should be idle for > 100 cycles for this to occur
 Other resources can be gated off similarly to save static
power dissipation
 Programmer is not aware of these changes
RICE UNIVERSITY
10
Level 2 changes:
Performance
RICE UNIVERSITY11
Solutions to increase DSP performance
 (1) Increasing clock frequency
 C64x: 600 – 720 – 1000 - ?
 Easiest solution but limited benefits
 Not good for power, given cubic dependence with frequency
 (2) Increasing ALUs
 Limited instruction level parallelism (ILP)
 Register file area, ports explosion
 Compiler issues in extracting more ILP
 (3) Multiprocessors (MIMD)
 Usually 3rd party vendors (except C40-types)
RICE UNIVERSITY
12
DSP multiprocessors
DSP
Network
Interface
DSP
ASSP
Interconnection
DSP
DSP
ASSP
Co-Proc’s
13
Source: Texas Instruments Wireless Infrastructure Solutions Guide, Pentek, Sundance, C80 RICE UNIVERSITY
Multiprocessing tradeoffs
 Advantages:
 Performance, and tools don’t have to change!!
 Load-balancing algorithms on multiple DSPs not straight-forward+
 Burden pushed on to the programmer
 Not scalable with number of processors
 difficult to adapt with workload changes
 Traditional DSPs not built for multiprocessing* (except C40-types)
 I/O impacts throughput, power and area
 (E)DMA use minimizes the throughput problem
 Power and area problems still remain
*R. Baines, The DSP bottleneck, IEEE Communications Magazine, May 1995, pp 46-54 (outdated?)
+S. Rajagopal, B. Jones and J.R. Cavallaro, Task partitioning wireless base-station algorithms
RICEonUNIVERSITY
14
multiple DSPs and FPGAs, ICSPAT’2001
Options
 Chip multiprocessors with SIMD parallelism (Level 3)
 SIMD parallelism can alleviate load balancing
 (shown in Level 3)
 Scalable with processors
 Automatic SIMD parallelism can be done by the compiler
 Single chip will alleviate I/O bottlenecks
 Tool will need changes
 To get to level 3, intermediate (Level 2) level investigation
 Level 2
 Do SPMD on DSP multiprocessor
RICE UNIVERSITY
15
Texas Instruments C64x DSP
C64x Datapath
RICE UNIVERSITY
16
Source: Texas Instruments C64x DSP Generation (sprt236a.pdf)
A possible, plausible solution
Exploit data parallelism (DP)*
 Available in many wireless algorithms
 This is what ASICs do!
int i,a[N],b[N],sum[N]; // 32 bits
short int c[N],d[N],diff[N]; // 16 bits packed
for (i = 0; i< 1024; ++i)
{
sum[i] = a[i] + b[i];
diff[i] = c[i] - d[i];
}
DP
ILP
Subword
RICE UNIVERSITY
17
*Data Parallelism is defined as the parallelism available after subword packing and loop unrolling
SPMD multiprocessor DSP
C64x Datapath
C64x Datapath
Same Program running on all DSPs
C64x Datapath
C64x Datapath
RICE UNIVERSITY
18
Level 2: Architecture tradeoffs
 C64x’s
 Interconnection could be similar to the ones used by 3rd party
vendors
 FPGA- based C40 comm ports (Sundance) ~400 MBps
 VIM modules (Pentek) ~300 MBps
 Others developed by TI, BlueWave systems
RICE UNIVERSITY
19
Level 2: Tools/Programming impact
 All DSPs run the same program
 Programmer thinks of only 1 DSP program
 Burden now on tools
 Can use C8x compiler and tool support expertise
 Integration of C8x and C6x compilers
 Data parallelism used for SPMD
 DMA data movement can be left to programmer at this stage
to keep data fed to the all the processors
 MPI (Message Passing) can also be alternatively applied
RICE UNIVERSITY
20
Level 3 changes:
Performance and
Power
RICE UNIVERSITY
21
A chip multiprocessor (CMP) DSP
Internal
Memory
L2
Instruction
decoder
ILP
Subword
+
+
+
*
*
*
ILP
Subword
C64x DSP Core
(1 cluster)
+
+
+
*
*
*
+
+
+
*
*
*
+
+
+
*
*
*
…
+
+
+
*
*
*
Instruction
decoder
Internal Memory (L2)
DP
C64x based CMP DSP Core
adapt #clusters to DP
Identical clusters, same operations.
Power-down unused ALUs, clusters
RICE UNIVERSITY
22
A 4 cluster CMP using TI C64x
C64x Datapath
Significant savings
possible in area and
power
C64x Datapath
C64x Datapath
C64x Datapath
Increasing benefits
with larger #clusters
(8,16,32 clusters)
RICE UNIVERSITY
23
Alternate view of the CMP DSP
DMA Controller
L2
internal
memory
Bank
1
Bank
2
Bank
C
Inter-cluster
communication
network
Instruction
decoder
C64x core C
C64x core 1
Clusters
Of
C64x
C64x core 0
Prefetch
Buffers
RICE UNIVERSITY
24
Adapting #clusters to Data Parallelism
Turned off using
voltage gating to
eliminate static and
dynamic power dissipation
Adaptive
Multiplexer
Network
C
No reconfiguration
C
C
C
C
C
C
4: 2 reconfiguration
C
C
C
4:1 reconfiguration
C
All clusters off
RICE UNIVERSITY
25
Level 3: Architecture tradeoffs
 Single processor -> SPMD -> SIMD
 Single chip :
 Max die size limited to 128 clusters with 8 functional
units/cluster at 90 nm technology [estimate]
 Number of memory banks = #clusters
 Instruction addition to turn off clusters when data parallelism is
insufficient
RICE UNIVERSITY
26
Level 3: Tools/Programming impact
 Level 2 compiler provides support for data parallelism
 adapt #clusters to data parallelism for power savings
 check for loop count index after loop unrolling
 If less than #clusters, provide instruction to turn off clusters
 Design of parallel algorithms and mapping important
 Programmer still writes regular C code
 Transparent to the programmer
 Burden on the compiler
 Automatic DMA data movement to keep data feeding into
the arithmetic units
RICE UNIVERSITY
27
Verification of potential benefits
Level 3 potential verification using
the Imagine stream processor simulator
Replacing the C64x DSP with a
cluster containing 3 +, 3 X
and a distributed register file
RICE UNIVERSITY
28
Need for adapting to flexibility
 Base-stations are designed for worst case workload
 Base-stations rarely operate at worst case workload
 Adapting the resources to the workload can save power!
RICE UNIVERSITY
29
Example of flexibility needed in workloads
Operation count (in GOPs)
25
2G base-station (16 Kbps/user)
3G base-station (128 Kbps/user)
20
15
Note:
GOPs refer
only to arithmetic
computations
10
5
0
(4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)
(Users, Constraint lengths)
Billions of computations per second needed
Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi
to ~23 GOPs for 32 users, constraint 9 viterbi
RICE UNIVERSITY
30
Flexibility affects Data Parallelism*
U - Users, K - constraint length,
N - spreading gain, R - decoding rate
Workload
Estimation
Detection
Decoding
(U,K)
f(U,N)
f(U,N)
f(U,K,R)
(4,7)
32
4
16
(4,9)
32
4
64
(8,7)
32
8
16
(8,9)
32
8
64
(16,7)
32
16
16
(16,9)
32
16
64
(32,7)
32
32
16
(32,9)
32
32
64
RICE UNIVERSITY
31
*Data Parallelism is defined as the parallelism available after subword packing and loop unrolling
Cluster utilization variation with workload
100
(4,9)
(4,7)
50
Cluster Utilization
0
100
0
5
10
15
20
25
(8,9)
(8,7)
50
0
100
0
50
0
100
5
10
15
20
25
30
10
15
20
25
30
10
15
20
25
30
(16,9)
(16,7)
0
5
(32,9)
(32,7)
50
0
30
0
5
Cluster Index
Cluster utilization variation on a 32-cluster processor
RICE UNIVERSITY
32
(32, 9) = 32 users, constraint length 9 Viterbi
Real-time Frequency (in MHz)
Frequency variation with workload
1200
1000
Mem Stall
L2 Stall
Busy
800
600
400
200
0
(4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)
RICE UNIVERSITY
33
Operation
 DVS when system changes significantly
 Users, data rates …
 Coarse time scale (every few seconds)
 Turn off clusters when parallelism changes significantly
 Parallelism can change within the same algorithm
 Eg: spreading gain changes during matched filtering
 Finer time scales (100’s of microseconds)
 Turn off ALUs when algorithms change significantly
 estimation, detection, decoding
 Finer time scales (100’s of microseconds)
RICE UNIVERSITY
34
Power savings: Voltage Gating & Scaling
Workload
(4,7)
(4,9)
(8,7)
(8,9)
(16,7)
(16,9)
(32,7)
(32,9)
Estimated
Estimated
Estimated
Estimated
Freq (MHz) Voltage
Power Savings (W) Power
needed used
(V) clocking Memory Clusters New
345.09 433 0.875 0.325
1.05
0.366 0.3
380.69 433 0.875 0.193
0.56
0.604 0.69
408.89 433 0.875 0.089
0.54
0.649 0.77
463.29 533
0.95
0.304
0.71
0.643 1.33
528.41 533
0.95
0.02
0.44
0.808 1.71
637.21 667
1.05
0.156
0.58
0.603 3.21
902.89 1000
1.3
0.792
1.18
1.375 7.11
1118.3 1200
1.4
0.774
1.41
0
12.38
Cluster Power Consumption
L2 memory Power Consumption
instruction decoder
oder
Power Consumption
Chip Area (0.13 micron process)
Power can change from 12.38 W to 300 mW
depending on workload changes
(W)
Base
2.05
2.05
2.05
2.98
2.98
4.55
10.46
14.56
Savings
85.14 %
66.41 %
62.44 %
55.46 %
42.54 %
29.46 %
32.03 %
14.98 %
78 %
11.5 %
10.5 %
2
45.7 mm
RICE UNIVERSITY
35
How to decide ALUs vs. clock frequency
 No independent variables
 Clusters, ALUs, frequency, voltage
 Trade-offs exist
V f
P  CV 2 f
P f 3
 How to find the right combination for real-time @ lowest power!
+
‘1’
+
++
*‘1’
**
*
+
‘10’
+
++
*
‘10’
**
*
+
‘10’
+
++
*
‘10’
**
*
‘1’ cluster
‘100’ clusters
100 GHz
10 MHz
(A)
(B)
+
‘10’
+
++
*
‘10’
**
*
+
‘a’
+
++
*
‘m’
**
*
+
‘a’
+
++
*
‘m’
**
*
+
‘a’
+
++
*
‘m’
**
*
‘c’ clusters
‘f’ MHz
(C) RICE UNIVERSITY
36
Setting clusters, adders, multipliers
 If sufficient DP, linear decrease in frequency with clusters
 Set clusters depending on DP and execution time estimate
 To find adders and multipliers,
 Let compiler schedule algorithm workloads across different
numbers of adders and multipliers and let it find execution
time
 Put all numbers in previous equation
 Compare increase in capacitance due to added ALUs and
clusters with benefits in execution time
 Choose the solution that minimizes the power
Details available in Sridhar’s thesis
RICE UNIVERSITY
37
Conclusions
 We propose a step-by-step methodology to design high performance
power-efficient DSPs based on the TI 64x architecture
 Initial results show benefits in power/performance greater than an
order-of-magnitude over a conventional C64x
 We tailor the design to ensure maximum compatibility with TI’s C6x
architecture and tools
 We are interested in exploring opportunities in TI for designing and
actual fabrication of a chip and associated tool development
 We are interested in feedback
 limitations that we have not accounted for
 Unreasonable assumptions that we have made
Recommended reading:
S. Rixner et al, A register organization for media processing, HPCA 2000
B. Khailany et al, Exploring the VLSI scalability of stream processors, HPCA 2003
U. J. Kapasi et al, Programmable Stream Processors, IEEE Computer, August 2003
RICE UNIVERSITY
38