Download PDF

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pulse-width modulation wikipedia , lookup

Electrical substation wikipedia , lookup

Variable-frequency drive wikipedia , lookup

Buck converter wikipedia , lookup

Power engineering wikipedia , lookup

Rectiverter wikipedia , lookup

Stray voltage wikipedia , lookup

History of electric power transmission wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Power electronics wikipedia , lookup

Distribution management system wikipedia , lookup

Magnetic-core memory wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Voltage optimisation wikipedia , lookup

Alternating current wikipedia , lookup

Mains electricity wikipedia , lookup

Transcript
A Low-Power 0.7-V
H.264 720p Video Decoder
D. Finchelstein, V. Sze,
M.E. Sinangil, Y. Koken,
A.P. Chandrakasan
A-SSCC 2008
Outline
• Motivation for low-power video decoders
• Low-power techniques
– pipelining and parallelism
– independent voltage/clock domains
– efficient memory accessing
• ASIC results
• Comparison with state of the art
• Summary
Motivation
• High demand for video
capture and playback on
mobile devices
iPhone
• H.264 state of the art video
coding standard
• Goal: Ultra Low Power
H.264 decoder in 65nm
– 1280x720 @ 30fps
DVC
Digital
Camera
PSP
H.264 Decoder Architecture
Bitstream
Input
LUMA
SHARED
ED
FIFO
MVS
MVS
COEFFS
MODES
OFF-CHIP
MVS
MEM
MC
MODES
CHROMA
PARALLEL
INTRA
MUX
+
DB
INTRA
MUX
+
DB
IT
MC
MEM
MVS
YUV->RGB
(FPGA)
FRAME BUFFER
(ZBT SRAM)
Pipelined, highly parallel architecture to
reduce voltage (and power)
Pipeline FIFO Sizing
Normalized
System
Throughput
1.5
1.45
1.4
1.35
1.3
1.25
1.2
1.15
1.1
1.05
1
1
2
4
8
16
32
256
FIFO Depth
Sizes on this chip
• Pipeline stages have variable latencies
– ex: ED latency is 0-33 cycles per 4x4 block
• Larger FIFOs help average out workload
– increase performance by up to 45%
– FIFOs of depths 1-4 chosen to reduce area
Deblocking Filter Parallelism
4 edges in parallel
• Process entire 4x4 edge (4 filters) in parallel
• Filter luma and chroma in parallel
• 192 cycles reduced to ~ 46 cycles per 16x16 block
Deblocking Filter Architecture
Datapath
Control
Last Line Cache
104kb
[SRAM]
4x4
4x4
4 PARALLEL FILTERS
Datapath
Control
Filters (bS=1 to 3)
threshold
calc clip
PIN
4x4
Boundary
Strength (bS)
Datapath
Control
<<
4x1
POUT
QIN
4x4
4x4
4x1
Filters (bS=4)
threshold
......
<<
Block IN
Block OUT
QOUT
...... >>
Datapath
Control
(bS=0)
4x4
4x4
Internal Memory
4x(4x4x8b)
[DFF]
4x4
Datapath
Control
Motion Compensation (MC)
4x4 block in
current frame
Reference block
Vector (1, -1)
Reference block
Vector (0.5, -0.5)
• Use two interpolators in parallel
• Interpolate luma and chroma in parallel
• 176 reduced to ~ 72 cycles per 16x16 block
Parallel MC Interpolators
column0
Frame
Buffer
(FB)
column1
MC0
6:1
6:1
6:1
Entropy
Decoding
(ED)
MV0
MV1
MC1
MC0
0
1
4
5
MC1
2
3
6
7
MC0
8
9
12 13
10 11 14 15
MC1
6:1
6:1
x9
4x4
• Interpolators can run in same cycle when…
– motion vectors are all available
– memory interface supplies 2 columns per cycle
• Interpolators are synchronized
– MC0: even 4x4 rows, MC1: odd 4x4 rows
– shared interpolation data reused
16x16
Clock & Voltage Domains
20
18
16
14
12
10
8
6
4
2
0
Vlow Vhigh
Average Cycles / block
u
_l
M
E
M
a
m
M
EM
DQ
a
m
ro
ED
Memory Controller
Vhigh
Vlow
a
m
o
hr
_c
a
om
a
h
a
m
c
u
_
l
hr
um
_
B
l
c
D
C_
DB
C_
M
M
IT
Core Domain
Core
Domain
DQ
Level
Shifters
Memory
Controller
Mem
DFF
Array
QD
FIFO
LOGIC
DQ
Dual-Clock
FIFO
CLKslow
CLKfast
• Decouple voltage / clock domains
– lower core voltage and frequency
– 25% power savings vs. single domain
• 4% further savings if we used 3 domains
– dual-clock FIFOs and level-shifters link domains
Workload Variation
Core
Domain
[ MHz @ V ]
Memory
Controller
[ MHz @ V ]
P-frame
14 @ 0.70
50 @ 0.84
I-frame
53 @ 0.90
25 @ 0.76
Relative
Power
@720p
1 I-frame,
14 P-frames
[%]
MAX
53 @ 0.90
50 @ 0.84
100
DVFS
14 @ 0.70
53 @ 0.90
25 @ 0.76
50 @ 0.84
73
FA
17 @ 0.72
48 @ 0.83
73
Power
No
DVFS
Perfect
DVFS
ΔP
Workload 100%
• INTER-INTRA workload variation
– MAX: maximum frequency on each domain
– DVFS: 1 frame every 33ms
– Frame Averaging (FA): 15 frames every 15 * 33ms
• switches less often than DVFS, but needs output buffer
MC Data Overlap
Horizontal
Current
Neighbors
4x4 block
Vertical
Neighbors
Interpolation
Area
Overlapped
Interpolation
Area
– Neighboring 4x4 with same MVs
– Overlap area shared horizontally and vertically
– Reduced MC read bandwidth
Last-line On-chip Caches
TopLeft
Top
Left
16x16
block
TopRight
Deblocking
Cache Size [kb]
120
807 Mbps
100
80
60
40
122 Mbps
20
0
26 Mbps
DB
INTRA
MC
21 Mbps
ED
MA B C D E F GH
I
J
K
L
Intra prediction
Off-chip Bandwidth
26%
P-frame
off-chip BW
[ Gbps ] 1
2
19%
0
original
caching
cache &
fewer reads
• Frame buffer off-chip (1.4 MByte per frame)
• P-frames more common than I-frames
– P-frame off-chip BW larger due to MC
• 40% (0.9 Gbps) total reduction
– last-line caches
– avoided redundant reads in MC
Voltage Scalable SRAM
8T SRAM Cell
Write Assist to improve
writability at low voltages
Extra 2 Tx ensures read
stability at low voltages
DOUT
• Low voltage SRAM needed
– Typical 6T SRAMs fail at low
voltages
– 8T SRAMs work down to 0.5V
RDBL
snsRef
snsEn
Pseudo-differential
sense amplifier
with global snsRef
H.264 Decoder ASIC
3.3 mm
176 I/O PADS
DECODER STATISTICS
3.3 mm
CACHES
CORE
DOMAIN
MEMORY CONTROLLER
DOMAIN
Area (w/o pads)
Area Utilization
Technology
I/O Pads
On-chip SRAM
:
:
:
:
:
2.76 x 2.76 mm2
31 %
65-nm
176
17kB
Area Breakdown
Logic
25%
• Cache area 3x larger than logic
• Standard Cells: 134k
Caches
75%
• Parallelism Overhead
– 1.5% of active chip area
– 4 luma + 2 chroma filters:
– 1.5% of DB
– 2 luma + 4 chroma interpolators:
– 9% of MC
MISC
1.4%
ED
1.6%
IT
8.2%
INTRA
15.6%
DB
56%
MC
16.6%
Power Measurements
720p Video
mobcal
shields
parkrun
Input Bitrate [ Mbps ]
5.4
7.0
26
Core [ MHz @ V ]
14 @ 0.70 14 @ 0.70
25 @ 0.80
MEM [ MHz @ V ]
50 @ 0.84 50 @ 0.84
50 @ 0.84
Power [ mW ]
1.8
10
9
Distribution 8
7
Across
6
15 Dies
5
4
3
2
1
0
1.8
0.69 0.70 0.71 0.72
Minimum Core Vdd @ 720p
3.2
Power Breakdown
Pipeline
control & FIFOs
19%
ED
3%
IDCT
1%
MEM
write
7%
INTRA
1%
Motion Vector
predictor
Interpolators
5%
20%
MC
42%
MEM
Read 75%
DB
26%
P-Frame
• P-frame power dominated by:
– MC (frame buffer reads)
– deblocking filter
Survey of Other Decoders
Resolution
15fps
QCIF
1W
CIF
[work] - process, profile
Core Domain
Memory Cntl
0.85 V
1.15 V
[3] - 130-nm, Baseline
100 mW
Power
30fps
D1
720p 1080p
0.70 V
0.85 V
[4] - 180-nm, Baseline
10 mW
0.55 V
0.68 V
[5] - 180-nm, Baseline
1 mW
[6] - 180-nm, Main
0.66 V
0.74 V
0.1 mW
0.5 V
0.5 V
0.01 mW
0.1
1
This work - 65-nm, Baseline
10
Mpixels/s
100
Summary
• Pipeline and parallelism
– Concurrency allows 14MHz @ 720p
– Parallelism: luma DB = 4x, luma MC = 2x
• Separate voltage/clock domains
– 25% P-frame power savings
– DVFS on each domain for I/P-frame differences
• Efficient memory accesses
– Low-voltage on-chip caches and data reuse
– Off-chip BW lowered by 40%
Acknowledgements
• Funding: Nokia, TI, and NSERC
• Chip fabrication: TI
• Valuable feedback:
– Nokia: J. Hicks, G. Raghavan, J. Ankcorn
– TI: M. Budagavi, D. Buss, M. Zhou
– MIT: Arvind, E. Fleming
Video Demo