* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Rajagopal-base-station-slides-TI
Audio power wikipedia , lookup
Utility frequency wikipedia , lookup
Power over Ethernet wikipedia , lookup
Power engineering wikipedia , lookup
Voltage optimisation wikipedia , lookup
Switched-mode power supply wikipedia , lookup
Alternating current wikipedia , lookup
Immunity-aware programming wikipedia , lookup
High performance, power-efficient DSPs
based on the TI C64x
Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner
Rice University
{sridhar,cavallar,rixner}@rice.edu
RICE UNIVERSITY
Recent (2003) Research Results
Stream-based programmable processors meet real-time
requirements for a set of base-station phy layer algorithms+,*
Map algorithms on stream processors and studied tradeoffs
between packing, ALU utilization and memory operations
Improve power efficiency in stream processors by adapting
compute resources to workload variations and varying voltage
and clock frequency to real-time requirements*
Design exploration between #ALUs and clock frequency to
minimize power consumption of the processor
RICE defined
UNIVERSITY2
+ S. Rajagopal, S. Rixner, J. R. Cavallaro 'A programmable baseband processor design for software
radios’, 2002,
*Paper draft sent previously, rest of the contributions in thesis
Recent (2003) Research Results
Peak computation rate available : ~200 billion arithmetic
operations at 1.2 GHz
Estimated Peak Power (0.13 micron) : 12.38 W at 1.2 GHz
Power:
12.38 W for 32 users, constraint 9 decoding, at 128Kbps/user
At 1.2 GHz, 1.4 V
300 mW for 4 users, constraint 7 decoding, at 128Kbps/user
At 433 MHz, 0.875 V
RICE UNIVERSITY3
Motivation
This research could be applied to DSP design!
Designing
High performance DSPs
Power-efficient
Adapt computing resources with workload changes
Such that
Gradual changes in C64x architecture
Gradual changes in compilers and tools
RICE UNIVERSITY4
Levels of changes
To allow changes in TI DSPs and tools gradually
Changes classified into 3 levels
Level 1 : simple, minimum changes (next silicon)
Level 2 : intermediate, handover changes (1-2 years)
Level 3 : actual proposed changes (2-3 years)
We want to go to Level 3 but in steps!
RICE UNIVERSITY5
Level 1 changes:
Power-efficiency
RICE UNIVERSITY6
Level 1 changes: Power saving features
(1) Use Dynamic Voltage and Frequency scaling
When workload changes such as
Users, data rates, modulation, coding rates, …
Already in industry : Crusoe, XScale …
(2) Use Voltage gating to turn off unused resources
When units idle for a ‘sufficiently’ long time
Saves static and dynamic power dissipation
See example on next page
RICE UNIVERSITY7
Turning off ALUs
Adders Multipliers
‘Sleep’ Instruction
Instruction Schedule
Adders Multipliers
Default schedule
Schedule after
exploration
2 multipliers turned off to save power
Turned off using
voltage gating to
eliminate static and
dynamic power dissipation
RICE UNIVERSITY8
Level 1: Architecture tradeoffs
DVS:
Advanced voltage regulation scheme
Cannot use NMOS pass gates
Cannot use tri-state buffers
Use at a coarser time scale (once in a million cycles)
100-1000 cycles settling time
Voltage gating:
Gating device design important
Should be able to supply current to gated circuit
Use at coarser time scale (once in 100-1000 cycles)
1-10 cycles settling time
RICE UNIVERSITY9
Level 1: Tools/Programming impact
Need a DSP BIOS “TASK” running continuously which looks at
the workload change and changes voltage/frequency using a
look-up table in memory
Compiler should be made ‘re-targetable’
Target subset of ALUs and explore static performance with
different adder-multiplier schedules
Voltage gating using a ‘sleep’ instruction that the compiler
generates for unused ALUs
ALUs should be idle for > 100 cycles for this to occur
Other resources can be gated off similarly to save static
power dissipation
Programmer is not aware of these changes
RICE UNIVERSITY
10
Level 2 changes:
Performance
RICE UNIVERSITY11
Solutions to increase DSP performance
(1) Increasing clock frequency
C64x: 600 – 720 – 1000 - ?
Easiest solution but limited benefits
Not good for power, given cubic dependence with frequency
(2) Increasing ALUs
Limited instruction level parallelism (ILP)
Register file area, ports explosion
Compiler issues in extracting more ILP
(3) Multiprocessors (MIMD)
Usually 3rd party vendors (except C40-types)
RICE UNIVERSITY
12
DSP multiprocessors
DSP
Network
Interface
DSP
ASSP
Interconnection
DSP
DSP
ASSP
Co-Proc’s
13
Source: Texas Instruments Wireless Infrastructure Solutions Guide, Pentek, Sundance, C80 RICE UNIVERSITY
Multiprocessing tradeoffs
Advantages:
Performance, and tools don’t have to change!!
Load-balancing algorithms on multiple DSPs not straight-forward+
Burden pushed on to the programmer
Not scalable with number of processors
difficult to adapt with workload changes
Traditional DSPs not built for multiprocessing* (except C40-types)
I/O impacts throughput, power and area
(E)DMA use minimizes the throughput problem
Power and area problems still remain
*R. Baines, The DSP bottleneck, IEEE Communications Magazine, May 1995, pp 46-54 (outdated?)
+S. Rajagopal, B. Jones and J.R. Cavallaro, Task partitioning wireless base-station algorithms
RICEonUNIVERSITY
14
multiple DSPs and FPGAs, ICSPAT’2001
Options
Chip multiprocessors with SIMD parallelism (Level 3)
SIMD parallelism can alleviate load balancing
(shown in Level 3)
Scalable with processors
Automatic SIMD parallelism can be done by the compiler
Single chip will alleviate I/O bottlenecks
Tool will need changes
To get to level 3, intermediate (Level 2) level investigation
Level 2
Do SPMD on DSP multiprocessor
RICE UNIVERSITY
15
Texas Instruments C64x DSP
C64x Datapath
RICE UNIVERSITY
16
Source: Texas Instruments C64x DSP Generation (sprt236a.pdf)
A possible, plausible solution
Exploit data parallelism (DP)*
Available in many wireless algorithms
This is what ASICs do!
int i,a[N],b[N],sum[N]; // 32 bits
short int c[N],d[N],diff[N]; // 16 bits packed
for (i = 0; i< 1024; ++i)
{
sum[i] = a[i] + b[i];
diff[i] = c[i] - d[i];
}
DP
ILP
Subword
RICE UNIVERSITY
17
*Data Parallelism is defined as the parallelism available after subword packing and loop unrolling
SPMD multiprocessor DSP
C64x Datapath
C64x Datapath
Same Program running on all DSPs
C64x Datapath
C64x Datapath
RICE UNIVERSITY
18
Level 2: Architecture tradeoffs
C64x’s
Interconnection could be similar to the ones used by 3rd party
vendors
FPGA- based C40 comm ports (Sundance) ~400 MBps
VIM modules (Pentek) ~300 MBps
Others developed by TI, BlueWave systems
RICE UNIVERSITY
19
Level 2: Tools/Programming impact
All DSPs run the same program
Programmer thinks of only 1 DSP program
Burden now on tools
Can use C8x compiler and tool support expertise
Integration of C8x and C6x compilers
Data parallelism used for SPMD
DMA data movement can be left to programmer at this stage
to keep data fed to the all the processors
MPI (Message Passing) can also be alternatively applied
RICE UNIVERSITY
20
Level 3 changes:
Performance and
Power
RICE UNIVERSITY
21
A chip multiprocessor (CMP) DSP
Internal
Memory
L2
Instruction
decoder
ILP
Subword
+
+
+
*
*
*
ILP
Subword
C64x DSP Core
(1 cluster)
+
+
+
*
*
*
+
+
+
*
*
*
+
+
+
*
*
*
…
+
+
+
*
*
*
Instruction
decoder
Internal Memory (L2)
DP
C64x based CMP DSP Core
adapt #clusters to DP
Identical clusters, same operations.
Power-down unused ALUs, clusters
RICE UNIVERSITY
22
A 4 cluster CMP using TI C64x
C64x Datapath
Significant savings
possible in area and
power
C64x Datapath
C64x Datapath
C64x Datapath
Increasing benefits
with larger #clusters
(8,16,32 clusters)
RICE UNIVERSITY
23
Alternate view of the CMP DSP
DMA Controller
L2
internal
memory
Bank
1
Bank
2
Bank
C
Inter-cluster
communication
network
Instruction
decoder
C64x core C
C64x core 1
Clusters
Of
C64x
C64x core 0
Prefetch
Buffers
RICE UNIVERSITY
24
Adapting #clusters to Data Parallelism
Turned off using
voltage gating to
eliminate static and
dynamic power dissipation
Adaptive
Multiplexer
Network
C
No reconfiguration
C
C
C
C
C
C
4: 2 reconfiguration
C
C
C
4:1 reconfiguration
C
All clusters off
RICE UNIVERSITY
25
Level 3: Architecture tradeoffs
Single processor -> SPMD -> SIMD
Single chip :
Max die size limited to 128 clusters with 8 functional
units/cluster at 90 nm technology [estimate]
Number of memory banks = #clusters
Instruction addition to turn off clusters when data parallelism is
insufficient
RICE UNIVERSITY
26
Level 3: Tools/Programming impact
Level 2 compiler provides support for data parallelism
adapt #clusters to data parallelism for power savings
check for loop count index after loop unrolling
If less than #clusters, provide instruction to turn off clusters
Design of parallel algorithms and mapping important
Programmer still writes regular C code
Transparent to the programmer
Burden on the compiler
Automatic DMA data movement to keep data feeding into
the arithmetic units
RICE UNIVERSITY
27
Verification of potential benefits
Level 3 potential verification using
the Imagine stream processor simulator
Replacing the C64x DSP with a
cluster containing 3 +, 3 X
and a distributed register file
RICE UNIVERSITY
28
Need for adapting to flexibility
Base-stations are designed for worst case workload
Base-stations rarely operate at worst case workload
Adapting the resources to the workload can save power!
RICE UNIVERSITY
29
Example of flexibility needed in workloads
Operation count (in GOPs)
25
2G base-station (16 Kbps/user)
3G base-station (128 Kbps/user)
20
15
Note:
GOPs refer
only to arithmetic
computations
10
5
0
(4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)
(Users, Constraint lengths)
Billions of computations per second needed
Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi
to ~23 GOPs for 32 users, constraint 9 viterbi
RICE UNIVERSITY
30
Flexibility affects Data Parallelism*
U - Users, K - constraint length,
N - spreading gain, R - decoding rate
Workload
Estimation
Detection
Decoding
(U,K)
f(U,N)
f(U,N)
f(U,K,R)
(4,7)
32
4
16
(4,9)
32
4
64
(8,7)
32
8
16
(8,9)
32
8
64
(16,7)
32
16
16
(16,9)
32
16
64
(32,7)
32
32
16
(32,9)
32
32
64
RICE UNIVERSITY
31
*Data Parallelism is defined as the parallelism available after subword packing and loop unrolling
Cluster utilization variation with workload
100
(4,9)
(4,7)
50
Cluster Utilization
0
100
0
5
10
15
20
25
(8,9)
(8,7)
50
0
100
0
50
0
100
5
10
15
20
25
30
10
15
20
25
30
10
15
20
25
30
(16,9)
(16,7)
0
5
(32,9)
(32,7)
50
0
30
0
5
Cluster Index
Cluster utilization variation on a 32-cluster processor
RICE UNIVERSITY
32
(32, 9) = 32 users, constraint length 9 Viterbi
Real-time Frequency (in MHz)
Frequency variation with workload
1200
1000
Mem Stall
L2 Stall
Busy
800
600
400
200
0
(4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9)
RICE UNIVERSITY
33
Operation
DVS when system changes significantly
Users, data rates …
Coarse time scale (every few seconds)
Turn off clusters when parallelism changes significantly
Parallelism can change within the same algorithm
Eg: spreading gain changes during matched filtering
Finer time scales (100’s of microseconds)
Turn off ALUs when algorithms change significantly
estimation, detection, decoding
Finer time scales (100’s of microseconds)
RICE UNIVERSITY
34
Power savings: Voltage Gating & Scaling
Workload
(4,7)
(4,9)
(8,7)
(8,9)
(16,7)
(16,9)
(32,7)
(32,9)
Estimated
Estimated
Estimated
Estimated
Freq (MHz) Voltage
Power Savings (W) Power
needed used
(V) clocking Memory Clusters New
345.09 433 0.875 0.325
1.05
0.366 0.3
380.69 433 0.875 0.193
0.56
0.604 0.69
408.89 433 0.875 0.089
0.54
0.649 0.77
463.29 533
0.95
0.304
0.71
0.643 1.33
528.41 533
0.95
0.02
0.44
0.808 1.71
637.21 667
1.05
0.156
0.58
0.603 3.21
902.89 1000
1.3
0.792
1.18
1.375 7.11
1118.3 1200
1.4
0.774
1.41
0
12.38
Cluster Power Consumption
L2 memory Power Consumption
instruction decoder
oder
Power Consumption
Chip Area (0.13 micron process)
Power can change from 12.38 W to 300 mW
depending on workload changes
(W)
Base
2.05
2.05
2.05
2.98
2.98
4.55
10.46
14.56
Savings
85.14 %
66.41 %
62.44 %
55.46 %
42.54 %
29.46 %
32.03 %
14.98 %
78 %
11.5 %
10.5 %
2
45.7 mm
RICE UNIVERSITY
35
How to decide ALUs vs. clock frequency
No independent variables
Clusters, ALUs, frequency, voltage
Trade-offs exist
V f
P CV 2 f
P f 3
How to find the right combination for real-time @ lowest power!
+
‘1’
+
++
*‘1’
**
*
+
‘10’
+
++
*
‘10’
**
*
+
‘10’
+
++
*
‘10’
**
*
‘1’ cluster
‘100’ clusters
100 GHz
10 MHz
(A)
(B)
+
‘10’
+
++
*
‘10’
**
*
+
‘a’
+
++
*
‘m’
**
*
+
‘a’
+
++
*
‘m’
**
*
+
‘a’
+
++
*
‘m’
**
*
‘c’ clusters
‘f’ MHz
(C) RICE UNIVERSITY
36
Setting clusters, adders, multipliers
If sufficient DP, linear decrease in frequency with clusters
Set clusters depending on DP and execution time estimate
To find adders and multipliers,
Let compiler schedule algorithm workloads across different
numbers of adders and multipliers and let it find execution
time
Put all numbers in previous equation
Compare increase in capacitance due to added ALUs and
clusters with benefits in execution time
Choose the solution that minimizes the power
Details available in Sridhar’s thesis
RICE UNIVERSITY
37
Conclusions
We propose a step-by-step methodology to design high performance
power-efficient DSPs based on the TI 64x architecture
Initial results show benefits in power/performance greater than an
order-of-magnitude over a conventional C64x
We tailor the design to ensure maximum compatibility with TI’s C6x
architecture and tools
We are interested in exploring opportunities in TI for designing and
actual fabrication of a chip and associated tool development
We are interested in feedback
limitations that we have not accounted for
Unreasonable assumptions that we have made
Recommended reading:
S. Rixner et al, A register organization for media processing, HPCA 2000
B. Khailany et al, Exploring the VLSI scalability of stream processors, HPCA 2003
U. J. Kapasi et al, Programmable Stream Processors, IEEE Computer, August 2003
RICE UNIVERSITY
38