Download highly-configurable cache - Department of Computer Science and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
System-on-a-Chip Platform
Tuning for Embedded Systems
Frank Vahid
Associate Professor
Dept. of Computer Science and Engineering
University of California, Riverside
Also with the Center for Embedded Computer Systems at UC Irvine
http://www.cs.ucr.edu/~vahid
This research has been supported by the National Science Foundation, NEC,
Trimedia, and Triscend
Frank Vahid, UC Riverside
1
How Much is Enough?
Frank Vahid, UC Riverside
2
How Much is Enough?
Perhaps a bit small
Frank Vahid, UC Riverside
3
How Much is Enough?
Reasonably sized
Frank Vahid, UC Riverside
4
How Much is Enough?
Probably plenty big
Frank Vahid, UC Riverside
5
How Much is Enough?
More than typically necessary
Frank Vahid, UC Riverside
6
How Much is Enough?
Very few people could use this
Frank Vahid, UC Riverside
7
How Much is Enough for an IC?
IC package
IC
1993: ~ 1 million logic transistors
Perhaps a bit small
Frank Vahid, UC Riverside
8
How Much is Enough for an IC?
1996: ~ 5-8 million logic transistors
Reasonably sized
Frank Vahid, UC Riverside
9
How Much is Enough for an IC?
1999: ~ 10-50 million logic transistors
Probably plenty big
Frank Vahid, UC Riverside
10
How Much is Enough for an IC?
2002: ~ 100-200 million logic transistors
More than typically necessary
Frank Vahid, UC Riverside
11
How Much is Enough for an IC?
1993: 1 M

Point of diminishing
returns




2008: >1 BILLION logic transistors

Other examples

Perhaps very few people could design this



Frank Vahid, UC Riverside
8-bit uC: ~15K
32-bit ARM: ~30K
MPEG dcd: ~1M
100M good enough for
audio/video/etc.?
Fast cars (> 100 mph)
High res digital
cameras (> 4M)
Disk space
Even IC performance
12
Very Few Companies Can Design High-End ICs
Design productivity gap
10,000
100,000
1,000
10,000
Logic transistors per 100
10
chip
(in millions)
1
1000
Gap
IC capacity
10
0.1
0.01
Productivity
(K) Trans./Staff-Mo.
1
productivity
0.001

100
0.1
0.01
Source:
ITRS’99
Designer productivity growing at slower rate


1981: 100 designer months  ~$1M
2002: 30,000 designer months  ~$300M
Frank Vahid, UC Riverside
13
Meanwhile, ICs Themselves are Costlier
Tech:
0.8
0.35
0.18
0.13
NRE:
$40k
$100k
$350k
$1,000k
Turnaround
42 days
49 days
56 days
76 days
Market:
$3.5B
$6B
$12B
$18B
Source: DAC’01 panel on embedded programmable logic



And take longer to fabricate
While market windows are shrinking
Less than 1,000 out of 10,000 ASIC designs have
volumes to justify fabrication in 0.13 micron
Frank Vahid, UC Riverside
14
Summarizing So Far...
* Transistors are less scarce
•
ICs are big enough, fast enough
* ICs take more time and money to design and fabricate
•
While market windows are shrinking
Buy pre-fabricated
system-level ICs:
platforms
Designers
Frank Vahid, UC Riverside
15
Trend Towards Pre-Fabricated Platforms: ASSPs
ASSP: application specific
standard product





Domain-specific prefabricated IC
e.g., digital camera IC
ASIC: application specific IC
ASSP revenue > ASIC
ASSP design starts > ASIC

Unique IC design



Ignores quantity of same IC
ASIC design starts decreasing
Due to strong benefits of
using pre-fabricated devices
Source: Gartner/Dataquest September’01

Frank Vahid, UC Riverside
16
A Sample Pre-Fabricated Platform

L2
cache
Peripherals
L1
cache
JPEG
dcd
uP
Must be programmable for
use in variety of products


Ideally also configurable
Means high volume


DSP

FPGA
IC
Pre-fabricated Platform

Platform designer’s investment
pays off
Cost per IC is reasonable
Use additional (readily
available) transistors for high
configurability
Our research focus

Design and use of highly
configurable platforms
Frank Vahid, UC Riverside
17
Commercial Highly-Configurable Platform Type:
Single-Chip Microprocessor/FPGA Platforms

Triscend E5: based on
8-bit 8051 CISC core




10 Dhrystone MIPS at
40MHz
60 kbytes on-chip RAM
up to 40K logic gates
Cost only about $4 (in
volume)
Configurable logic
Triscend E5 chip
8051 processor plus
other peripherals
Frank Vahid, UC Riverside
Memory
18
Single-Chip Microprocessor/FPGA Platforms

Atmel FPSLIC


Field-Programmable
System-Level IC
Based on AVR 8-bit
RISC core




20 Dhrystone MIPS
5k-40k configurable
logic gates
On-chip RAM (20-36Kb)
and EEPROM
$5-$10
Frank Vahid, UC Riverside
Courtesy of Atmel
19
Single-Chip Microprocessor/FPGA Platforms


Triscend A7 chip
Based on ARM7 32bit RISC processor




54 Dhrystone MIPS at
60 MHz
Up to 40k logic gates
On-chip cache and
RAM
$10-$20 in volume
Courtesy of Triscend
Frank Vahid, UC Riverside
20
Single-Chip Microprocessor/FPGA Platforms




Altera’s Excalibur EPXA 10
ARM (922T) hard core
~200 Dhrystone MIPS at
~200 MHz
Devices range from ~200k
to ~2 million programmable
logic gates
Source: www.altera.com
Frank Vahid, UC Riverside
21
Single-Chip Microprocessor/FPGA Platforms


Xilinx Virtex II Pro
PowerPC based






Config.
logic

• 622 Mbps to 3.125 Gbps
PowerPCs

420 Dhrystone MIPS at
300 MHz
1 to 4 PowerPCs
4 to 16 gigabit
transceivers
12 to 216 multipliers
3,000 to 50,000 logic
cells
200k to 4M bits RAM
204 to 852 I/O
$100-$500 (>25,000
units)
Up to 16 serial transceivers
Courtesy of Xilinx
Frank Vahid, UC Riverside
22
Single-Chip Microprocessor/FPGA Platforms

Why wouldn’t future microprocessor chips include
some amount of on-chip FPGA?
Frank Vahid, UC Riverside
23
Single-Chip Microprocessor/FPGA Platforms

Lots of silicon area taken up by
configurable logic


As discussed earlier, less of an issue every
year
Smaller area doesn’t necessarily mean
higher yield (lower costs) any more



Previously could pack more die onto a wafer
But die are becoming pad (pin) limited in
nanoscale technologies
Configurable logic typically used for
peripherals, glue logic, etc.

We have investigated another use...
Frank Vahid, UC Riverside
24
Software Improvements using On-Chip
Configurable Logic


Partitioned software critical loops onto
on-chip FPGA for several benchmarks
Performed physical measurements on
Triscend A7 and E5 devices
Benchmark
PS_g3fax
PS_crc
PS_brev
Benchmark
PS_g3fax
PS_crc
PS_brev
Timeorig
Timesw/hw
11.47
7.44
10.92
4.51
9.84
3.28
Average:
A7 results
Sp.
Porig
Psw/hw Eorig
Esw/hw
E sav
1.5 1.320 1.332
15.140
9.910
35%
2.4 1.320 1.320
14.414
5.953
59%
3.0 1.332 1.344
13.107
4.408
66%
2.3
Average:
53%
Timeorig
Timesw/hw
15.16
7.11
10.64
4.64
17.81
1.81
Average:
E5 results
Sp.
Porig
Psw/hw Eorig
Esw/hw
E sav
2.1 0.252 0.270
3.820
1.920
50%
2.3 0.207 0.225
2.202
1.044
53%
9.8 0.252 0.270
4.488
0.489
89%
4.8
Average:
64%
Frank Vahid, UC Riverside
A7 IC
Triscend A7
development board
Work done by Greg
Stitt, Brian Grattan,
Shawn Nematbaktsh
at UCR
25
Software Improvements using On-Chip
Configurable Logic

Extensive simulated results for 8051 and MIPS


(Physical measurement very time consuming)
For Powerstone (PS), MediaBench (MB) and Netbench (NB)
Example
PS_g3fax
PS_crc
PS_summin
PS_brev
PS_matmul
PS_g3fax
PS_adpcm
PS_crc
PS_des
PS_engine
PS_jpeg
PS_summin
PS_v42
PS_brev
MB_g721
MB_adpcm
MB_pegwit
NB_dh
NB_md5
NB_tl
Archit
Cycles orig
Cycles sw
Cycles hw
8051
19,675,456
10,812,544
176,562
8051
291,196
180,224
7,168
8051
109,821,892
20,394,080
384,416
8051
330,064
305,768
1,360
8051
119,420
101,576
2,560
MIPS
15,600,000
4,720,000
599,000
MIPS
113,000
29,300
5,440
MIPS
5,040,000
3,480,000
460,800
MIPS
142,000
70,700
15,100
MIPS
915,000
145,000
28,100
MIPS
7,900,000
646,000
171,000
MIPS
2,920,000
1,270,000
266,000
MIPS
3,850,000
846,000
216,000
MIPS
3,566
2,499
138
MIPS
838,230,002
457,674,179 9,985,261
MIPS
32,894,094
32,866,110 1,183,260
MIPS
42,752,919
33,276,287 2,167,651
MIPS 1,793,032,157 1,349,063,192 45,156,767
MIPS
5,374,034
3,046,881
289,877
MIPS
57,412,470
29,244,221 2,479,552
ClkhwSp.
Psw
Phw
25
2.2 0.05 0.032
25
2.5 0.05 0.028
25
1.2 0.05 0.033
25 12.9 0.05 0.034
25
5.9 0.05 0.035
100
1.4 0.07 0.111
100
1.3 0.07 0.181
100
2.5 0.07 0.061
100
1.6 0.07 0.197
100
1.1 0.07 0.082
100
1.1 0.07 0.092
100
1.5 0.07 0.111
100
1.2 0.07 0.102
100
3.0 0.07 0.107
100
2.1 0.07 0.152
42 11.6 0.07 0.130
50
3.1 0.07 0.170
69
3.5 0.07 0.121
47
1.8 0.07 0.251
58
1.8 0.07 0.059
Average:
3.2
Frank Vahid, UC Riverside
Eorig
Esw/hw
ESav
Area
0.1142 0.05408
53%
2,858
0.0017 0.00071
58%
770
0.6376 0.53657
16%
4,191
0.0019 0.00015
92%
3,961
0.0007 0.00012
82%
5,882
0.0265 0.02163
18%
2,858
0.0002 0.00018
6%
8,075
0.0086 0.00379
56%
770
0.0002 0.00019
20%
9,031
0.0016 0.00146
6%
2,074
0.0134 0.01360
-1%
3,161
0.0050 0.00375
24%
4,191
0.0065 0.00605
7%
3,319
0.0000 0.00000
62%
3,961
1.4250 0.75035
47%
5,811
0.0559 0.00821
85%
14,132
0.0727 0.03241
55%
18,150
3.0482 1.00547
67%
21,383
0.0091 0.00722
21%
90,074
0.0976 0.05930
39%
5,478
Speedup
of 3.2
and
energy
savings
of 34%
obtained
with only
10,500
gates
(avg)
Average: 34% 10,507
26
Speedup Gained with Relatively Few Gates
Created several partitioned versions of each benchmarks


Most speedup gained with first 20,000 gates; diminishing returns after that
Surprisingly few gates
5.0
27.
27.
4.5
G721(MB)
4.0
Speedup

ADPCM(MB)
PEGWIT(MB)
DH(NB)
3.5
3.0
MD5(NB)
TL(NB)
URL(NB)
2.5
2.0
2.05 at 90,000
1.5
1.0
0
5,000
10,000
15,000
20,000
25,000
Gates



Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002
Stitt and Vahid, IEEE Design and Test, Dec. 2002
J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded
Systems, 2002 (to appear).
27
Frank Vahid, UC Riverside
Other Types of Configurability

Microprocessor



(other researchers)
VLIW configurations
Voltage scaling
Memory hierarchy

Our focus: build a highly-configurable cache that can
be tuned to a particular program

Work by Chaunjun Zhang, along with Walid Najjar, at UCR
Frank Vahid, UC Riverside
28
Cache Contributes Much to Performance and
Power


Well-known for performance
Energy


ARM920T: caches consume nearly half of total power (Segars 01)
M*CORE: unified cache consumes half of total power (Lee/Moyer/Arends 99)
Mem
L1 Cache
Processor
ARM920T. Source: Segars ISSCC’01
Frank Vahid, UC Riverside
29
Associativity Plays a Big Role
Reduces miss rate – thus improving performance

Impact on power and energy?

(Energy = Power * Time)
2.0%
Miss rate

1.5%
1.0%
epic
0.5%
mpeg2
0.0%
1
2
Associativity
Frank Vahid, UC Riverside
4
30
Associativity is Costly

Associativity improves hit rate, but at the cost of more power
per access

Are the power savings from reduced misses outweighed by the
increased power per hit?
Energy per access(nJ)
data output
driver
decode_data
mux driver
1.0
0.9
0.8
0.7
0.6
comparator
sa_tag
bitline_tag
wordline_data
bitline_data
wordline_tag
decode_tag
0.5
0.4
0.3
0.2
0.1
0.0
sa_data
1w ay
2w ay
4w ay
As s ociativity
Energy per access for 8 Kbyte cache
Energy access breakdown for 8 Kbyte,
4-way set associative cache
(considering dynamic power only)
Frank Vahid, UC Riverside
31
Associativity and Energy

Best performing cache is not always lowest energy
Miss rate
2.0%
1.5%
1.0%
epic
0.5%
mpeg2
0.0%
1
2
Associativity
Normalized energy
Significantly
poorer energy
1.0
0.8
0.6
0.4
0.2
epic
mpeg2
0.0
1
4
Frank Vahid, UC Riverside
2
Associativity
4
32
So What’s the Best Cache?
Instruct. Cache
Data Cache
Processor
Size As. Line Size As. Line
AMD-K6-IIIE
32K
2
32
32K
2
32
Alchemy AU1000
16K
4
32
16K
4
32
ARM 7
8K/U 4
16
8K/U 4
16
ColdFire
0-32K DM 16 0-32K N/A N/A
Hitachi SH7750S (SH4)
8K DM 32
16K DM 32
Hitachi SH7727
16K/U 4
16 16K/U 4
16
IBM PPC 750CX
32K
8
32
32K
8
32
IBM PPC 7603
16K
4
32
16K
4
32
IBM750FX
32K
8
32
32K
8
32
IBM403GCX
16K
2
16
8K
2
16
IBM Power PC 405CR
16K
2
32
8K
2
32
Intel 960JA
2K
2
N/A
1K
2
N/A
Intel 960JD
4K
2
N/A
2K
2
N/A
Intel 960IT
16K
2
N/A
4K
2
N/A
Motorola MPC8240
16K
4
32
16K
4
32


Processor
Motorola MPC8540
Motorola MPC7455
NEC VR5500
NEC VR4131
NEC VR4181
NEC VR4181A
NEC VR4121
PMC Sierra RM9000X2
PMC Sierra RM7000A
SandCraft sr71000
Sun Ultra SPARC Iie
SuperH
TI TMS320C6414
TriMedia TM32A
Xilinx Virtex IIPro
Instruct. Cache
Size As. Line
32K 4 32/64
32K 8
32
32K 2
32
16K 2 16/32
4K DM 16
8K DM 32
16 DM 16
16K 4
N/A
16K 4
32
32K 4
32
16K 2
N/A
32K 4
32
16K DM N/A
32K 8
64
16K 2
32
Data Cache
Size As. Line
32K 4 32/64
32K 8
32
32K 2
32
16K 2 16/32
4K DM 16
8K DM 32
8K DM 16
16K 4
N/A
16K 4
32
32K 4
32
16K DM N/A
32K 4
32
16K 2
N/A
16K 8
64
8K 2
32
Looking at popular embedded processors, there’s obviously no
standard cache
Dilemma



Direct mapped –good performance and energy for most programs
Four-way – good performance for all programs, but at cost of higher
power per access for all programs
Do we design for the average case or the worst case?
Frank Vahid, UC Riverside
33
Solution to the Dilemma
Configurable cache

Can be configured as four way, two way, or one way


Ways can be concatenated
Furthermore, ways can even be shut down to decrease total size
Memory
Direct mapped cache

Four-way
Frank Vahid, UC Riverside
Now two-way
Now one-way
34
Configurable Cache Design: Way Concatenation
a31
tag address
a13
a12
a11
a10
index
a4
line offset
a0
Configuration circuit
a11
Small area and
performance
overhead
a5
reg0
a12
reg1
tag part
c0
index
c1
c3
c2
bitline
c1
c0
6x64
6x64
6x64
c2
6x64
data
array
c3
6x64
6x64
column mux
sense amps
tag
address
c0
c1
c2
c3
line offset
mux driver
data output
critical path
Frank Vahid, UC Riverside
35
Configurable Cache Experiments
100%
100% = 4-way conventional cache
CnvI1D1
cnct
shut
both
Energy (normalized)
90%
80%
116%
268% 114%
70%
60%
50%
40%
30%
20%
10%

vpr
Configurable cache with both way concatenation and way
shutdown is superior on every benchmark



Average
Benchmarks
parser
mcf
art
g721
pegwit
mpeg2
jpeg
epic
adpcm
v42
ucbqsort
pjepg
fir
g3fax
brev
blit
binary
bilv
bcnt
auto2
crc
padpcm
0%
Considered Powerstone, MediaBench, and Spec2000
Tuning the cache to the program is important
Work submitted to High-Performance Computer Architectures 2003, Zhang, Vahid and
Najjar
Frank Vahid, UC Riverside
36
Conclusions

Trend is away from semi-custom IC fabrication


Platforms must be highly configurable



Big enough; other pressures encourage buying pre-fabricated platforms
To be useful for a variety of applications, and hence mass produced
We have discussed

Software speedup/energy benefits of on-chip configurable logic: 3x

Creating a highly-configurable cache architecture: 40% energy
speedups with only ~10,000 gates
savings compared to conventional cache
Current/future work

Automatically partitioning software loops to configurable logic



(collaborators: Walid Najjar UCR, Nik Dutt UCI)
Several approaches: platform-assisted, and dynamically on-chip
Work being done by Roman Lysecky, Susan Cotterell, Greg Stitt, and Shawn
Nematbaktsh at UCR
Automatically tuning a configurable cache

Ann Gordon-Ross at UCR
Frank Vahid, UC Riverside
37