Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend Frank Vahid, UC Riverside 1 How Much is Enough? Frank Vahid, UC Riverside 2 How Much is Enough? Perhaps a bit small Frank Vahid, UC Riverside 3 How Much is Enough? Reasonably sized Frank Vahid, UC Riverside 4 How Much is Enough? Probably plenty big Frank Vahid, UC Riverside 5 How Much is Enough? More than typically necessary Frank Vahid, UC Riverside 6 How Much is Enough? Very few people could use this Frank Vahid, UC Riverside 7 How Much is Enough for an IC? IC package IC 1993: ~ 1 million logic transistors Perhaps a bit small Frank Vahid, UC Riverside 8 How Much is Enough for an IC? 1996: ~ 5-8 million logic transistors Reasonably sized Frank Vahid, UC Riverside 9 How Much is Enough for an IC? 1999: ~ 10-50 million logic transistors Probably plenty big Frank Vahid, UC Riverside 10 How Much is Enough for an IC? 2002: ~ 100-200 million logic transistors More than typically necessary Frank Vahid, UC Riverside 11 How Much is Enough for an IC? 1993: 1 M Point of diminishing returns 2008: >1 BILLION logic transistors Other examples Perhaps very few people could design this Frank Vahid, UC Riverside 8-bit uC: ~15K 32-bit ARM: ~30K MPEG dcd: ~1M 100M good enough for audio/video/etc.? Fast cars (> 100 mph) High res digital cameras (> 4M) Disk space Even IC performance 12 Very Few Companies Can Design High-End ICs Design productivity gap 10,000 100,000 1,000 10,000 Logic transistors per 100 10 chip (in millions) 1 1000 Gap IC capacity 10 0.1 0.01 Productivity (K) Trans./Staff-Mo. 1 productivity 0.001 100 0.1 0.01 Source: ITRS’99 Designer productivity growing at slower rate 1981: 100 designer months ~$1M 2002: 30,000 designer months ~$300M Frank Vahid, UC Riverside 13 Meanwhile, ICs Themselves are Costlier Tech: 0.8 0.35 0.18 0.13 NRE: $40k $100k $350k $1,000k Turnaround 42 days 49 days 56 days 76 days Market: $3.5B $6B $12B $18B Source: DAC’01 panel on embedded programmable logic And take longer to fabricate While market windows are shrinking Less than 1,000 out of 10,000 ASIC designs have volumes to justify fabrication in 0.13 micron Frank Vahid, UC Riverside 14 Summarizing So Far... * Transistors are less scarce • ICs are big enough, fast enough * ICs take more time and money to design and fabricate • While market windows are shrinking Buy pre-fabricated system-level ICs: platforms Designers Frank Vahid, UC Riverside 15 Trend Towards Pre-Fabricated Platforms: ASSPs ASSP: application specific standard product Domain-specific prefabricated IC e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC Unique IC design Ignores quantity of same IC ASIC design starts decreasing Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01 Frank Vahid, UC Riverside 16 A Sample Pre-Fabricated Platform L2 cache Peripherals L1 cache JPEG dcd uP Must be programmable for use in variety of products Ideally also configurable Means high volume DSP FPGA IC Pre-fabricated Platform Platform designer’s investment pays off Cost per IC is reasonable Use additional (readily available) transistors for high configurability Our research focus Design and use of highly configurable platforms Frank Vahid, UC Riverside 17 Commercial Highly-Configurable Platform Type: Single-Chip Microprocessor/FPGA Platforms Triscend E5: based on 8-bit 8051 CISC core 10 Dhrystone MIPS at 40MHz 60 kbytes on-chip RAM up to 40K logic gates Cost only about $4 (in volume) Configurable logic Triscend E5 chip 8051 processor plus other peripherals Frank Vahid, UC Riverside Memory 18 Single-Chip Microprocessor/FPGA Platforms Atmel FPSLIC Field-Programmable System-Level IC Based on AVR 8-bit RISC core 20 Dhrystone MIPS 5k-40k configurable logic gates On-chip RAM (20-36Kb) and EEPROM $5-$10 Frank Vahid, UC Riverside Courtesy of Atmel 19 Single-Chip Microprocessor/FPGA Platforms Triscend A7 chip Based on ARM7 32bit RISC processor 54 Dhrystone MIPS at 60 MHz Up to 40k logic gates On-chip cache and RAM $10-$20 in volume Courtesy of Triscend Frank Vahid, UC Riverside 20 Single-Chip Microprocessor/FPGA Platforms Altera’s Excalibur EPXA 10 ARM (922T) hard core ~200 Dhrystone MIPS at ~200 MHz Devices range from ~200k to ~2 million programmable logic gates Source: www.altera.com Frank Vahid, UC Riverside 21 Single-Chip Microprocessor/FPGA Platforms Xilinx Virtex II Pro PowerPC based Config. logic • 622 Mbps to 3.125 Gbps PowerPCs 420 Dhrystone MIPS at 300 MHz 1 to 4 PowerPCs 4 to 16 gigabit transceivers 12 to 216 multipliers 3,000 to 50,000 logic cells 200k to 4M bits RAM 204 to 852 I/O $100-$500 (>25,000 units) Up to 16 serial transceivers Courtesy of Xilinx Frank Vahid, UC Riverside 22 Single-Chip Microprocessor/FPGA Platforms Why wouldn’t future microprocessor chips include some amount of on-chip FPGA? Frank Vahid, UC Riverside 23 Single-Chip Microprocessor/FPGA Platforms Lots of silicon area taken up by configurable logic As discussed earlier, less of an issue every year Smaller area doesn’t necessarily mean higher yield (lower costs) any more Previously could pack more die onto a wafer But die are becoming pad (pin) limited in nanoscale technologies Configurable logic typically used for peripherals, glue logic, etc. We have investigated another use... Frank Vahid, UC Riverside 24 Software Improvements using On-Chip Configurable Logic Partitioned software critical loops onto on-chip FPGA for several benchmarks Performed physical measurements on Triscend A7 and E5 devices Benchmark PS_g3fax PS_crc PS_brev Benchmark PS_g3fax PS_crc PS_brev Timeorig Timesw/hw 11.47 7.44 10.92 4.51 9.84 3.28 Average: A7 results Sp. Porig Psw/hw Eorig Esw/hw E sav 1.5 1.320 1.332 15.140 9.910 35% 2.4 1.320 1.320 14.414 5.953 59% 3.0 1.332 1.344 13.107 4.408 66% 2.3 Average: 53% Timeorig Timesw/hw 15.16 7.11 10.64 4.64 17.81 1.81 Average: E5 results Sp. Porig Psw/hw Eorig Esw/hw E sav 2.1 0.252 0.270 3.820 1.920 50% 2.3 0.207 0.225 2.202 1.044 53% 9.8 0.252 0.270 4.488 0.489 89% 4.8 Average: 64% Frank Vahid, UC Riverside A7 IC Triscend A7 development board Work done by Greg Stitt, Brian Grattan, Shawn Nematbaktsh at UCR 25 Software Improvements using On-Chip Configurable Logic Extensive simulated results for 8051 and MIPS (Physical measurement very time consuming) For Powerstone (PS), MediaBench (MB) and Netbench (NB) Example PS_g3fax PS_crc PS_summin PS_brev PS_matmul PS_g3fax PS_adpcm PS_crc PS_des PS_engine PS_jpeg PS_summin PS_v42 PS_brev MB_g721 MB_adpcm MB_pegwit NB_dh NB_md5 NB_tl Archit Cycles orig Cycles sw Cycles hw 8051 19,675,456 10,812,544 176,562 8051 291,196 180,224 7,168 8051 109,821,892 20,394,080 384,416 8051 330,064 305,768 1,360 8051 119,420 101,576 2,560 MIPS 15,600,000 4,720,000 599,000 MIPS 113,000 29,300 5,440 MIPS 5,040,000 3,480,000 460,800 MIPS 142,000 70,700 15,100 MIPS 915,000 145,000 28,100 MIPS 7,900,000 646,000 171,000 MIPS 2,920,000 1,270,000 266,000 MIPS 3,850,000 846,000 216,000 MIPS 3,566 2,499 138 MIPS 838,230,002 457,674,179 9,985,261 MIPS 32,894,094 32,866,110 1,183,260 MIPS 42,752,919 33,276,287 2,167,651 MIPS 1,793,032,157 1,349,063,192 45,156,767 MIPS 5,374,034 3,046,881 289,877 MIPS 57,412,470 29,244,221 2,479,552 ClkhwSp. Psw Phw 25 2.2 0.05 0.032 25 2.5 0.05 0.028 25 1.2 0.05 0.033 25 12.9 0.05 0.034 25 5.9 0.05 0.035 100 1.4 0.07 0.111 100 1.3 0.07 0.181 100 2.5 0.07 0.061 100 1.6 0.07 0.197 100 1.1 0.07 0.082 100 1.1 0.07 0.092 100 1.5 0.07 0.111 100 1.2 0.07 0.102 100 3.0 0.07 0.107 100 2.1 0.07 0.152 42 11.6 0.07 0.130 50 3.1 0.07 0.170 69 3.5 0.07 0.121 47 1.8 0.07 0.251 58 1.8 0.07 0.059 Average: 3.2 Frank Vahid, UC Riverside Eorig Esw/hw ESav Area 0.1142 0.05408 53% 2,858 0.0017 0.00071 58% 770 0.6376 0.53657 16% 4,191 0.0019 0.00015 92% 3,961 0.0007 0.00012 82% 5,882 0.0265 0.02163 18% 2,858 0.0002 0.00018 6% 8,075 0.0086 0.00379 56% 770 0.0002 0.00019 20% 9,031 0.0016 0.00146 6% 2,074 0.0134 0.01360 -1% 3,161 0.0050 0.00375 24% 4,191 0.0065 0.00605 7% 3,319 0.0000 0.00000 62% 3,961 1.4250 0.75035 47% 5,811 0.0559 0.00821 85% 14,132 0.0727 0.03241 55% 18,150 3.0482 1.00547 67% 21,383 0.0091 0.00722 21% 90,074 0.0976 0.05930 39% 5,478 Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg) Average: 34% 10,507 26 Speedup Gained with Relatively Few Gates Created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates; diminishing returns after that Surprisingly few gates 5.0 27. 27. 4.5 G721(MB) 4.0 Speedup ADPCM(MB) PEGWIT(MB) DH(NB) 3.5 3.0 MD5(NB) TL(NB) URL(NB) 2.5 2.0 2.05 at 90,000 1.5 1.0 0 5,000 10,000 15,000 20,000 25,000 Gates Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear). 27 Frank Vahid, UC Riverside Other Types of Configurability Microprocessor (other researchers) VLIW configurations Voltage scaling Memory hierarchy Our focus: build a highly-configurable cache that can be tuned to a particular program Work by Chaunjun Zhang, along with Walid Najjar, at UCR Frank Vahid, UC Riverside 28 Cache Contributes Much to Performance and Power Well-known for performance Energy ARM920T: caches consume nearly half of total power (Segars 01) M*CORE: unified cache consumes half of total power (Lee/Moyer/Arends 99) Mem L1 Cache Processor ARM920T. Source: Segars ISSCC’01 Frank Vahid, UC Riverside 29 Associativity Plays a Big Role Reduces miss rate – thus improving performance Impact on power and energy? (Energy = Power * Time) 2.0% Miss rate 1.5% 1.0% epic 0.5% mpeg2 0.0% 1 2 Associativity Frank Vahid, UC Riverside 4 30 Associativity is Costly Associativity improves hit rate, but at the cost of more power per access Are the power savings from reduced misses outweighed by the increased power per hit? Energy per access(nJ) data output driver decode_data mux driver 1.0 0.9 0.8 0.7 0.6 comparator sa_tag bitline_tag wordline_data bitline_data wordline_tag decode_tag 0.5 0.4 0.3 0.2 0.1 0.0 sa_data 1w ay 2w ay 4w ay As s ociativity Energy per access for 8 Kbyte cache Energy access breakdown for 8 Kbyte, 4-way set associative cache (considering dynamic power only) Frank Vahid, UC Riverside 31 Associativity and Energy Best performing cache is not always lowest energy Miss rate 2.0% 1.5% 1.0% epic 0.5% mpeg2 0.0% 1 2 Associativity Normalized energy Significantly poorer energy 1.0 0.8 0.6 0.4 0.2 epic mpeg2 0.0 1 4 Frank Vahid, UC Riverside 2 Associativity 4 32 So What’s the Best Cache? Instruct. Cache Data Cache Processor Size As. Line Size As. Line AMD-K6-IIIE 32K 2 32 32K 2 32 Alchemy AU1000 16K 4 32 16K 4 32 ARM 7 8K/U 4 16 8K/U 4 16 ColdFire 0-32K DM 16 0-32K N/A N/A Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 Hitachi SH7727 16K/U 4 16 16K/U 4 16 IBM PPC 750CX 32K 8 32 32K 8 32 IBM PPC 7603 16K 4 32 16K 4 32 IBM750FX 32K 8 32 32K 8 32 IBM403GCX 16K 2 16 8K 2 16 IBM Power PC 405CR 16K 2 32 8K 2 32 Intel 960JA 2K 2 N/A 1K 2 N/A Intel 960JD 4K 2 N/A 2K 2 N/A Intel 960IT 16K 2 N/A 4K 2 N/A Motorola MPC8240 16K 4 32 16K 4 32 Processor Motorola MPC8540 Motorola MPC7455 NEC VR5500 NEC VR4131 NEC VR4181 NEC VR4181A NEC VR4121 PMC Sierra RM9000X2 PMC Sierra RM7000A SandCraft sr71000 Sun Ultra SPARC Iie SuperH TI TMS320C6414 TriMedia TM32A Xilinx Virtex IIPro Instruct. Cache Size As. Line 32K 4 32/64 32K 8 32 32K 2 32 16K 2 16/32 4K DM 16 8K DM 32 16 DM 16 16K 4 N/A 16K 4 32 32K 4 32 16K 2 N/A 32K 4 32 16K DM N/A 32K 8 64 16K 2 32 Data Cache Size As. Line 32K 4 32/64 32K 8 32 32K 2 32 16K 2 16/32 4K DM 16 8K DM 32 8K DM 16 16K 4 N/A 16K 4 32 32K 4 32 16K DM N/A 32K 4 32 16K 2 N/A 16K 8 64 8K 2 32 Looking at popular embedded processors, there’s obviously no standard cache Dilemma Direct mapped –good performance and energy for most programs Four-way – good performance for all programs, but at cost of higher power per access for all programs Do we design for the average case or the worst case? Frank Vahid, UC Riverside 33 Solution to the Dilemma Configurable cache Can be configured as four way, two way, or one way Ways can be concatenated Furthermore, ways can even be shut down to decrease total size Memory Direct mapped cache Four-way Frank Vahid, UC Riverside Now two-way Now one-way 34 Configurable Cache Design: Way Concatenation a31 tag address a13 a12 a11 a10 index a4 line offset a0 Configuration circuit a11 Small area and performance overhead a5 reg0 a12 reg1 tag part c0 index c1 c3 c2 bitline c1 c0 6x64 6x64 6x64 c2 6x64 data array c3 6x64 6x64 column mux sense amps tag address c0 c1 c2 c3 line offset mux driver data output critical path Frank Vahid, UC Riverside 35 Configurable Cache Experiments 100% 100% = 4-way conventional cache CnvI1D1 cnct shut both Energy (normalized) 90% 80% 116% 268% 114% 70% 60% 50% 40% 30% 20% 10% vpr Configurable cache with both way concatenation and way shutdown is superior on every benchmark Average Benchmarks parser mcf art g721 pegwit mpeg2 jpeg epic adpcm v42 ucbqsort pjepg fir g3fax brev blit binary bilv bcnt auto2 crc padpcm 0% Considered Powerstone, MediaBench, and Spec2000 Tuning the cache to the program is important Work submitted to High-Performance Computer Architectures 2003, Zhang, Vahid and Najjar Frank Vahid, UC Riverside 36 Conclusions Trend is away from semi-custom IC fabrication Platforms must be highly configurable Big enough; other pressures encourage buying pre-fabricated platforms To be useful for a variety of applications, and hence mass produced We have discussed Software speedup/energy benefits of on-chip configurable logic: 3x Creating a highly-configurable cache architecture: 40% energy speedups with only ~10,000 gates savings compared to conventional cache Current/future work Automatically partitioning software loops to configurable logic (collaborators: Walid Najjar UCR, Nik Dutt UCI) Several approaches: platform-assisted, and dynamically on-chip Work being done by Roman Lysecky, Susan Cotterell, Greg Stitt, and Shawn Nematbaktsh at UCR Automatically tuning a configurable cache Ann Gordon-Ross at UCR Frank Vahid, UC Riverside 37