Download PDF

A Low-Power 0.7-V H.264 720p Video Decoder D. Finchelstein, V. Sze, M.E. Sinangil, Y. Koken, A.P. Chandrakasan A-SSCC 2008 Outline • Motivation for low-power video decoders • Low-power techniques – pipelining and parallelism – independent voltage/clock domains – efficient memory accessing • ASIC results • Comparison with state of the art • Summary Motivation • High demand for video capture and playback on mobile devices iPhone • H.264 state of the art video coding standard • Goal: Ultra Low Power H.264 decoder in 65nm – 1280x720 @ 30fps DVC Digital Camera PSP H.264 Decoder Architecture Bitstream Input LUMA SHARED ED FIFO MVS MVS COEFFS MODES OFF-CHIP MVS MEM MC MODES CHROMA PARALLEL INTRA MUX + DB INTRA MUX + DB IT MC MEM MVS YUV->RGB (FPGA) FRAME BUFFER (ZBT SRAM) Pipelined, highly parallel architecture to reduce voltage (and power) Pipeline FIFO Sizing Normalized System Throughput 1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05 1 1 2 4 8 16 32 256 FIFO Depth Sizes on this chip • Pipeline stages have variable latencies – ex: ED latency is 0-33 cycles per 4x4 block • Larger FIFOs help average out workload – increase performance by up to 45% – FIFOs of depths 1-4 chosen to reduce area Deblocking Filter Parallelism 4 edges in parallel • Process entire 4x4 edge (4 filters) in parallel • Filter luma and chroma in parallel • 192 cycles reduced to ~ 46 cycles per 16x16 block Deblocking Filter Architecture Datapath Control Last Line Cache 104kb [SRAM] 4x4 4x4 4 PARALLEL FILTERS Datapath Control Filters (bS=1 to 3) threshold calc clip PIN 4x4 Boundary Strength (bS) Datapath Control << 4x1 POUT QIN 4x4 4x4 4x1 Filters (bS=4) threshold ...... << Block IN Block OUT QOUT ...... >> Datapath Control (bS=0) 4x4 4x4 Internal Memory 4x(4x4x8b) [DFF] 4x4 Datapath Control Motion Compensation (MC) 4x4 block in current frame Reference block Vector (1, -1) Reference block Vector (0.5, -0.5) • Use two interpolators in parallel • Interpolate luma and chroma in parallel • 176 reduced to ~ 72 cycles per 16x16 block Parallel MC Interpolators column0 Frame Buffer (FB) column1 MC0 6:1 6:1 6:1 Entropy Decoding (ED) MV0 MV1 MC1 MC0 0 1 4 5 MC1 2 3 6 7 MC0 8 9 12 13 10 11 14 15 MC1 6:1 6:1 x9 4x4 • Interpolators can run in same cycle when… – motion vectors are all available – memory interface supplies 2 columns per cycle • Interpolators are synchronized – MC0: even 4x4 rows, MC1: odd 4x4 rows – shared interpolation data reused 16x16 Clock & Voltage Domains 20 18 16 14 12 10 8 6 4 2 0 Vlow Vhigh Average Cycles / block u _l M E M a m M EM DQ a m ro ED Memory Controller Vhigh Vlow a m o hr _c a om a h a m c u _ l hr um _ B l c D C_ DB C_ M M IT Core Domain Core Domain DQ Level Shifters Memory Controller Mem DFF Array QD FIFO LOGIC DQ Dual-Clock FIFO CLKslow CLKfast • Decouple voltage / clock domains – lower core voltage and frequency – 25% power savings vs. single domain • 4% further savings if we used 3 domains – dual-clock FIFOs and level-shifters link domains Workload Variation Core Domain [ MHz @ V ] Memory Controller [ MHz @ V ] P-frame 14 @ 0.70 50 @ 0.84 I-frame 53 @ 0.90 25 @ 0.76 Relative Power @720p 1 I-frame, 14 P-frames [%] MAX 53 @ 0.90 50 @ 0.84 100 DVFS 14 @ 0.70 53 @ 0.90 25 @ 0.76 50 @ 0.84 73 FA 17 @ 0.72 48 @ 0.83 73 Power No DVFS Perfect DVFS ΔP Workload 100% • INTER-INTRA workload variation – MAX: maximum frequency on each domain – DVFS: 1 frame every 33ms – Frame Averaging (FA): 15 frames every 15 * 33ms • switches less often than DVFS, but needs output buffer MC Data Overlap Horizontal Current Neighbors 4x4 block Vertical Neighbors Interpolation Area Overlapped Interpolation Area – Neighboring 4x4 with same MVs – Overlap area shared horizontally and vertically – Reduced MC read bandwidth Last-line On-chip Caches TopLeft Top Left 16x16 block TopRight Deblocking Cache Size [kb] 120 807 Mbps 100 80 60 40 122 Mbps 20 0 26 Mbps DB INTRA MC 21 Mbps ED MA B C D E F GH I J K L Intra prediction Off-chip Bandwidth 26% P-frame off-chip BW [ Gbps ] 1 2 19% 0 original caching cache & fewer reads • Frame buffer off-chip (1.4 MByte per frame) • P-frames more common than I-frames – P-frame off-chip BW larger due to MC • 40% (0.9 Gbps) total reduction – last-line caches – avoided redundant reads in MC Voltage Scalable SRAM 8T SRAM Cell Write Assist to improve writability at low voltages Extra 2 Tx ensures read stability at low voltages DOUT • Low voltage SRAM needed – Typical 6T SRAMs fail at low voltages – 8T SRAMs work down to 0.5V RDBL snsRef snsEn Pseudo-differential sense amplifier with global snsRef H.264 Decoder ASIC 3.3 mm 176 I/O PADS DECODER STATISTICS 3.3 mm CACHES CORE DOMAIN MEMORY CONTROLLER DOMAIN Area (w/o pads) Area Utilization Technology I/O Pads On-chip SRAM : : : : : 2.76 x 2.76 mm2 31 % 65-nm 176 17kB Area Breakdown Logic 25% • Cache area 3x larger than logic • Standard Cells: 134k Caches 75% • Parallelism Overhead – 1.5% of active chip area – 4 luma + 2 chroma filters: – 1.5% of DB – 2 luma + 4 chroma interpolators: – 9% of MC MISC 1.4% ED 1.6% IT 8.2% INTRA 15.6% DB 56% MC 16.6% Power Measurements 720p Video mobcal shields parkrun Input Bitrate [ Mbps ] 5.4 7.0 26 Core [ MHz @ V ] 14 @ 0.70 14 @ 0.70 25 @ 0.80 MEM [ MHz @ V ] 50 @ 0.84 50 @ 0.84 50 @ 0.84 Power [ mW ] 1.8 10 9 Distribution 8 7 Across 6 15 Dies 5 4 3 2 1 0 1.8 0.69 0.70 0.71 0.72 Minimum Core Vdd @ 720p 3.2 Power Breakdown Pipeline control & FIFOs 19% ED 3% IDCT 1% MEM write 7% INTRA 1% Motion Vector predictor Interpolators 5% 20% MC 42% MEM Read 75% DB 26% P-Frame • P-frame power dominated by: – MC (frame buffer reads) – deblocking filter Survey of Other Decoders Resolution 15fps QCIF 1W CIF [work] - process, profile Core Domain Memory Cntl 0.85 V 1.15 V [3] - 130-nm, Baseline 100 mW Power 30fps D1 720p 1080p 0.70 V 0.85 V [4] - 180-nm, Baseline 10 mW 0.55 V 0.68 V [5] - 180-nm, Baseline 1 mW [6] - 180-nm, Main 0.66 V 0.74 V 0.1 mW 0.5 V 0.5 V 0.01 mW 0.1 1 This work - 65-nm, Baseline 10 Mpixels/s 100 Summary • Pipeline and parallelism – Concurrency allows 14MHz @ 720p – Parallelism: luma DB = 4x, luma MC = 2x • Separate voltage/clock domains – 25% P-frame power savings – DVFS on each domain for I/P-frame differences • Efficient memory accesses – Low-voltage on-chip caches and data reuse – Off-chip BW lowered by 40% Acknowledgements • Funding: Nokia, TI, and NSERC • Chip fabrication: TI • Valuable feedback: – Nokia: J. Hicks, G. Raghavan, J. Ankcorn – TI: M. Budagavi, D. Buss, M. Zhou – MIT: Arvind, E. Fleming Video Demo

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PDF