Download Multi-Core System on Chip 설계 동향 (계속) 발표: 조준동

Document related concepts
no text concepts found
Transcript
Low Power
System on Chip
Design
1
System Level Power Optimization
• Algorithm selection / algorithm
transformation
• Identification of hot spots
• Low Power data encoding
• Quality of Service vs. Power
• Low Power Memory mapping
• Resource Sharing / Allocation
2
Levels for Low Power Design
Hardware-software partitioning,
Power down
Complexity, Concurrency, Locality,
Regularity, Data representation
Parallelism, Pipelining, Signal correlations
Architecture
Instruction set selection, Data rep.
Circuit/Logic
Sizing, Logic Style, Logic Design
Threshold Reduction, Scaling, Advanced packaging
Technology
SOI
System
Algorithm
Level of
Abstraction
Algorithm
10 - 100 times
Architecture
10 - 90%
Logic Level
20 - 40%
Layout Level
10 - 30%
Device Level
10 - 30%
Expected Saving
3
High Performance System
구현을 위한 제반 요소
High Performance System
High Speed
High Density
Reduced Swing
Logic
Deep Submicron
Technology
Channel
Engineering
Low Voltage
Low VT
Advanced
Technology
Low Power per
Gate
Low Capacitance
4
System Level Power Optimization
• Algorithm selection / algorithm
transformation
• Identification of hot spots
• Low Power data encoding
• Quality of Service vs. Power
• Low Power Memory mapping
• Resource Sharing / Allocation
5
전력 소모에 대한 고찰
• Digital 회로에서 전력 소모의 구성
성분
Power    f  C  VDD 2  Ileak  VDD  Qshort  circuit  f  VDD
 : Switching Activity
f : Frequency
C : Capacitance
VDD : Supply Voltage
Ileak : Leakage Current
Qshort  circuit : Short Circuit Charge
6
Vdd, power, and current trend
200 500
2.5
Voltage
VDD current [A]
0.5
Power per chip [W]
2.0
0.0
0
0
Voltage
Power
1.5
Current
1.0
1998
2002
2006
2010
2014
Year
International Technology Roadmap for Semiconductors 1998 update
7
Three Factors affecting Energy
– Reducing waste by Hardware Simplification:
redundant h/w extraction, Locality of
reference,Demand-driven / Data-driven
computation,Application-specific
processing,Preservation of data correlations,
Distributed processing
– All in one Approach(SOC): I/O pin and buffer
reduction
– Voltage Reducible Hardwares
– 2-D pipelining (systolic arrays)
– SIMD:Parallel Processing:useful for data w/
parallel structure
– VLIW: Approach- flexible
8
전력 소모를 줄일 수 있는 설
계 방법
• 공급 전압을 조절하는 방법
– IC 내에서 high speed가 필요한 곳에만 높은 전압을 사
용한다.
– 사용하지 않는 block에 대해서는 sleep mode로 전력
소모를 줄인다.
• 동작 주파수를 낮추는 방법
– Parallel processing으로 같은 throughput을 얻으면서
동작 주파수는 낮춘다. 이로 인한 면적의 증가는 필연
적이다.
– 큰 clock buffer의 사용을 피한다.
– Phase Locked Loop (PLL)을 사용하여 필요한 곳에만
주파수를 높여 사용한다.
9
전력 소모를 줄일 수 있는 설
계 방법
• Parasitic capacitance를 줄이는 방법
– Critical node에 짧은 배선을 사용한다.
– 3배 이상의 fan-out을 피한다.
– 낮은 전압 사용시 배선의 폭을 줄인다.
– 가능한 한 작은 크기의 transistor를 사용한다.
• Switching Activity를 줄이는 방법
– Bit 수를 감소시킨다.
– Dynamic 회로보다는 static 회로를 사용한다.
– 전체 transistor 수를 줄인다.
– 가장 active한 node는 internal node로 결정한다.
10
전력 소모를 줄일 수 있는 설
계 방법
• Switching Activity를 줄이는 방법
– 각 node 에서 주파수와 capacitance의 곱의 합이 최
소가 되도록 logic을 설계한다. 즉, switching activity
가 통계적으로 최소가 되도록 한다.
n
 f  C  min,
i
i
f i  mean switching frequency of node i
i 1
Ci  capacitance of node i
– Logic tree를 결정할 때, 입력 신호의 activity가 높을수
록 VDD 또는 ground에서 멀리 위치시킨다.
– Activity가 큰 cell은 dynamic으로, activity가 작은 cell
은 static으로 설계한다.
– Data가 변하지 않는 flip-flop의 clock을 off 시킨다.
– 항상 사용하지 않는 cell의 clock을 disable시킬 수 있
11
Son! Haven’t I told you
Web browsing
is slow with 802.11 PSM
to turn on power-saving
mode. Batteries don’t
grow on trees you know!
But dad! Performance
SUCKS when I turn on
power-saving mode!
So what! When I was
your age, I walked 2
miles through the snow
to fetch my Web pages!
12
• Users complain about performance degradation
IBM’s PowerPC
Lower Power Architecture
• Optimum Supply Voltage through Hardware Parallel,
Pipelining ,Parallel instruction execution
– 603e executes five instruction in parallel (IU, FPU, BPU, LSU,
SRU)
– FPU is pipelined so a multiply-add instruction can be issued
every clock cycle
– Low power 3.3-volt design
• Use small complex instruction with smaller instruction length
– IBM’s PowerPC 603e is RISC
• Superscalar: CPI < 1
– 603e issues as many as three instructions per cycle
• Low Power Management
– 603e provides four software controllable power-saving modes.
• Copper Processor with SOI
• IBM’s Blue Logic ASIC :New design reduces of power by a
factor of 10 times
13
Power-Down Techniques
◆ Lowering the
voltage along with
the clock
actually alters the
energy-per-operation
of the
microprocessor,
reducing the energy
required to perform a
fixed amount of work
14
Voltage vs Delay
•Use Variable Voltage Scaling or Scheduling for Real-time
Processing
•Use architecture optimization to compensate for slower operation,
e.g., Parallel Processing and Pipelining for concurrent increasing
and critical path reducing.
15
Why Copper Processor?
• Motivation: Aluminum resists the flow
of electricity as wires are made thinner
and narrower.
• Performance: 40% speed-up
• Cost: 30% less expensive
• Power: Less power from batteries
• Chip Size: 60% smaller than Aluminum
chip
16
Silicon-on-Insulator
• How Does SOI Reduce Capacitance ?
Eliminated junction capacitance by using
SOI (similar to glass) is placed between the
impuritis and the silicon substrate
high performance, low power, low soft error
17
Clock Network Power
Managements
• 50% of the total power
• FIR (massively pipelined circuit):
video processing: edge detection
voice-processing (data transmission like
xDSL)
Telephony: 50% (70%/30%) idle, 동시에
이야기하지 않음.
with every clock cycle, data are loaded
into the working register banks, even if
there are no data changes.
18
Partitioning
•
•
•
•
Performance Requirements
– 몇몇의 Function들은 Hardware로의 구현이 더 용이
– 반복적으로 사용되는 Block
– Parallel하게 구성되어 있는 Block
Modifiability
– Software로 구성된 Block은 변형이 용이
Implementation Cost
– Hardware로 구성된 Block은 공유해서 사용이 가능
Scheduling
– 각각 HW와 SW로 분리된 Block들을 정해진 constraints들에 맞출 수 있
도록 scheduling
– SW Operation은 순차적으로 scheduling되어야 한다
– Data와 Control의 의존성만 없다면 SW와 HW는 Concurrent하게
scheduling
19
Low power partitioning
approach
Different HW resources are invoked
according to the instruction executed
at a specific point in time
• During the execution of the add op.,
ALU and register are used, but
Multiplier is in idle state.
• Non-active resources will still
consume energy since the according
circuit continue to switch
• Calculate wasting energy
•
20
Design Flow
Core Energy
-
Estimation
Compute
utilization
rate(uP)
Evaluate
- Max 94% energy
saving and in most
case even reduced
execution time
- 16k sell overhead
S
Application
Devide
Appliction in
cluster
List schedule
Compute
utilization
rate(ASIC)
Select
cluster
HW Synthesis
21
H/W and S/W 통합 저전력 설계 최적화
S/W
H/W
S/W 코아
에너지 예측
클러스터 링
SW 에너지
효율 계산
시스템 수준
에너지 예측
- Max 94% energy
saving and in most
case even reduced
execution time
- 16k sell overhead
알고리즘 선택
HW SW 통합
클러스터
스케쥴링
HW 에너지
효율 계산
클러스터 선택
H/W 합성 및 에너지 예측
22
IS-95 CDMA Searcher H/W and S/W 통합 설계
황인기, 성균관대
Cost
(Speed,Area,Power)
Synchronous
Accumulator
(SW)
Energy
Estimate
(SW)
Comparator
(SW)
Asynchronous
Accumulator
(SW)
Comparator
(SW)
GOAL!
PN-Code
Generation
Synchronous
Accumulator1
(HW)
Comparator
with
precomputation
(HW)
Energy
Estimate
(HW)
Asynchronous
Accumulator
(HW)
Comparator
with
precomputation
(HW)
Synchronous
Accumulator2
(HW)
23
Low Power DSP
• DO-LOOP Dominant
VSELP Vocoder : 83.4 %
2D 8x8 DCT
: 98.3 %
LPC computation : 98.0 %
DO-LOOP Power Minimization
==> DSP Power Minimization
VSELP : Vector Sum Excited Linear Prediction
LPC : Linear Prediction Coding
24
Loop unrolling
• The technique of loop unrolling replicates the body of a
loop some number of times (unrolling factor u) and then
iterates by step u instead of step 1. This transformation
reduces the loop overhead, increases the instruction
for i = 2 to N - 2 step 2
parallelism
for
i = 2 to N -and
1 improves register, data cache or TLB
A(i ) = A(i ) + A(i - 1) A(i + 1)
locality.
A(i ) = A(i ) + A(i - 1) A(i + 1)
A(i  1) = A(i  1) + A(i ) A(i + 2)
Loop overhead is cut in half because two iterations are performed in each iteration.
If array elements are assigned to registers, register locality is improved because A(i) and
A(i +1) are used twice in the loop body.
Instruction parallelism is increased because the second assignment can be performed
while the results of the first are being stored and the loop variables are being updated.
25
Loop Unrolling (IIR filter example)
loop unrolling : localize the data to reduce the activity of the
inputs of the functional units or two output samples are
computed in parallel based on two input samples.
Yn1  X n1  A  Yn2
Yn  X n  A  Yn1  X n  A  ( X n1  A  Yn2 )
Neither the capacitance switched nor the voltage is altered. However, loop unrolling
enables several other transformations (distributivity, constant propagation, and
pipelining). After distributivity and constant propagation,
Yn1  X n 1  A  Yn 2
Yn  X n  A  Yn1  A2  Yn2
The transformation yields critical path of 3, thus voltage can be dropped.
26
Loop Unrolling for Low Power
27
Loop Unrolling for Low Power
28
Loop Unrolling for Low Power
29
Designing a Parallel FIR
 To obtain a parallel processing structure, the
SISO(single-input single-output) system must be
converted into a MIMO(multiple-input multiple-output)
system.
y(3k) = ax(3k)+bx(3k-1)+cx(3k-2)
y(3k+1) = ax(3k+1)+bx(3k)+cx(3k-1)
y(3k+2) = ax(3k+2)+bx(3k+1)+cx(3k)
 Parallel Processing systems are also referred to as block
processing systems.
30
Parallel Processing (2)
 Parallel processing architecture for a 3-tap FIR
filter (with block size 3)
31
Parallel Processing (3)
<Combined fine-grain pipelining and parallel processing
for 3-tap FIR filter>
32
Motion Estimation
33
Block Matching Algorithm
34
Configurable H/W Paradigms
35
Why Hardware for Motion Estimation?
• Most Computationally demanding part of
Video Encoding
• Example: CCIR 601 format
• 720 by 576 pixel
• 16 by 16 macro block (n = 16)
• 32 by 32 search area (p = 8)
• 25 Hz Frame rate (f frame = 25)
• 9 Giga Operations/Sec is needed for Full
Search Block Matching Algorithm.
36
Why Reconguration in Motion
Estimation?
• Adjusting the search
area at frame-rate
according to the
changing
characteristics of video
sequences
• Reducing Power
Consumption by
avoiding unnecessary
computation
Motion Vector Distributions
37
Architecture for Motion Estimation
From P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995
38
Re-configurable Architecture for
ME
39
Power Estimation in Recongurable
Architecture
40
Power vs Search area
41
Resource Reuse in FPGAs
42
Motion Estimation Conventional
43
Motion Estimation - Data
Reuse
Pa  2 Padd 2  Pabs / 2
Pb  Padd 2  Padd 1  Pabs / 2
Pabs  0.45 Padd 2
Therefore, power reduction
factor is 11%
44
Vector Quantization
• Lossy compression technique which exploits the
correlation that exists between neighboring samples and
quantizes samples together
45
Complexity of VQ Encoding
The distortion metric between an input vector X and
a codebook vector C_i is computed as follows:
Three VQ encoding algorithms will be evaluated: full
search, tree search and differential codebook treesearch.
46
Full Search
• Brute-force VQ: the distortion between the input vector
and every entry in the code-book is computed, and the
codeindex that corresponds to the minimum distortion is
determined and sent over to the decoder.
• For each distortion computation, there are 16 8-bit
memory accesses (to fetch the entries in the codeword),
16 subtractions, 16 multiplications, 15 additions. In
addition, the minimum of 256 distortion values, which
involves 255 comparison operations, must be
determined.
47
Tree-structured Vector Quantization
If for example at
level 1, the input
vector is
closer to the left
entry, then the right
portion of the tree is
never compared
below level 2 and
Here only 2 x log 2 256 = 16 distortion
an index bit 0 is
calculations with 8 comparisons transmitted.
48
Algorithmic Optimization
•
Minimizing the number of
operations
– example
• video data stream using the
vector quantization (VQ)
algorithm
• distortion metric
Di   X j  Cij 
15
– Tree-structured VQ
• binary tree-search
• some performance
degradation
• distortion calculation : 16
( 2 x log2 256 )
• value comparison : 8
2
0
j 0
0
3
– Full search VQ
• exhaustive full-search
• distortion calculation : 256
• value comparison : 255
8
2
1
3
1
1
0
3
2
1
3
8
49
Differential Codebook Tree-structure Vector
Quantization
• The distortion difference b/w the left and right node
needs to be computed. This equation can be
manipulated to reduce the number of operations
.
50
Algorithmic Optimization
– Differential codebook tree-structure VQ
• modify equation for optimizing operations
Dleft right   X j  Cleft, j    X j  Cright, j 
15
2
j 0
15
2
j 0

 X
j 0
15
2
left , j
C
  2 X C
15
2
right, j
j 0
j
right, j
 Cleft, j 
# of
# of # of # of
mem. mul. add. sub
access
4096 4096 3840 4096
full search
tree search
256 256 240 264
differential
136 128 128
0
tree search
algorithm
51
Multiplication and Accumulation:
MAC
• Major operation in DSP
X
X
Y
[ Modified Booth Encoding ]
One of 0, X, -X, 2X, -2X
based on each 2 bits of Y
Y
ALU
MULT
ACC
PR
CSA
CPA
MUL > (5 * ALU)
PR
52
Operand Swapping (1/2)
• Weight
= how many additions are needed ?
Weight = 2
Y= 00111100
By
00X000X0
Booth Encoding
Operands
Low Weight
High Switching
A
7FFF
0001
7FFF
0001
7FFF
0001
B
AAAA
AAAA
6666
AAAA
AAAA
0001
Current (mW)
A*B
22.0
B*A
10.0
Saving
31.6
10.0
68%
28.8
12.2
58%
54%
53
DIGLOG multiplier
Cmult (n)  253n 2 , Cadd (n)  214n, where n  world length in bits
A  2 j  AR , B  2 k  BR
A  B  (2 j  AR )(2 k  BR )  2 j  BR  2 k  AR  AR  BR
1st Iter 2nd Iter 3rd Iter
Worst-case error
-25%
-6%
-1.6%
Prob. of Error<1% 10%
70%
99.8%
With an 8 by 8 multiplier, the exact result can be obtained at a maximum of
seven iteration steps (worst case)
54
Voltage Scaling
• Merely changing a processor clock frequency is
not an effective technique for reducing energy
consumption. Reducing the clock frequency will
reduce the power consumed by a processor,
however, it does not reduce the energy required
to perform a given task.
• Lowering the voltage along with the clock
actually alters the energy-per-operation of the
microprocessor, reducing the energy required to
perform a fixed amount of work.
55
Energy consumption ( Vdd2)
Different Voltage Schedules
Timing constraint
40J
1000Mcycles
5.02
(A)
50MHz
0
5.02
5
10
15
32.5J
20
25
750Mcycles
250Mcycles
50MHz
25MHz
Time(sec)
(B)
2.52
0
5
5.02
4.02
10
15
20
25
Time(sec)
(C)
25J
1000Mcycles
40MHz
0
5
10
15
20
25
Time(sec)
56
Data Driven Signal Processing
The basic idea of
averaging two samples
are buffered and their
work loads are
averaged.
The averaged workload
is then used as the
effective workload to
drive the power supply.
Using a pingpong
buffering scheme, data
samples In +2, In +3
are being buffered while
In, In +1
are being processed.
57
RTL: Multiple Supply Voltages
Scheduling
Filter Example
58
A hardware / software partitioning technique with
hierarchical design space exploration
Houria Oudghiri, Bozena Kaminska, and Janusz Rajski,
Mentor Graphics Corp.
• A set of DSP examples are considered for
co-design on a specific architecture in order
to accelerate their performance on a target
architecture including a standard DSP
processor running concurrently with a custom
SIMD (Single Instruction Multiple Data)
processor
59
proposed methodology
input : List of blocks and time constraints , output : Two subsets where blocks
are assigned
Step 1 : construct the complete weighted dependency graph G
Step 2 : Assign all blocks to software, compare the complete system execution
time
Step 3 : while (time constraints not satisfied)
do
step 3_i : Select the node with the maximum execution time (i)
step 3_ii : Assign i to hardware, Update the system execution time
step 3_iii : while (time constraints not satisfied)
do
step 3_iii_1 : Select the maximum weighted edge connected to i
with the most time consuming node (j)
step 3_iii_2 : Assign to hardware, Update the dependency graph G
Update the system execution time
endo
60
endo
co-design target architecture
The Texas Instruments DSP processor TMS320C40 is
used as the master processor and the custom SIMD
processor PULSE (Parallel Ultra Large Scale Engine, 4
processors in parallel) as the slave processor
61
The hierarchical model of the FFT
transform
Initialize
Bit
Reversal
FFT
Initialize
Variable
Initialize
Data
Bit_init
Index_init
Read_data
Index_incr
Bit_loop1
Bit_cond
Bit_incr
Bit_shift
Bit_test
Bit_swap1
Bit_swap2
Danielson
control
Output
Dan_init
Dan_loop
Out_init
Out_write
Bit_loop2
Loop2_test
Bit_acc
Loop2_ass
Data_test
Loop2_shif
Danielson
Dan_init
Initialize
Dan_loop1
Level 2
Dan_loop1
Loop2_init
Initialize
Loop1_body
Update
Variables
Loop2_body
Dan_real
Loop2_incr
Dan_imag
Level 7
Level 8
Loop1_incr
Out_incr
Level 1
Loop1_init
Level 3
Level 4
Level 5
Level 6
62
Block assignment at different
hierarchical levels
level
Nb.of
Bolcks
C40
PULSE
Time(ms) / time
constraint = 25 ms
PULSE
C40
Total
1
4
2
2
18.14
4.8
22.94
2
10
6
4
18.8
2.96
21.76
3
17
11
6
15.56
9
24.56
4
22
18
6
14.68
10.24
24.92
5
24
17
7
14.56
10.4
24.94
6
24
22
2
6.82
17.72
24.54
7
25
22
3
7
17.92
24.92
8
27
18
9
5.88
18.64
24.52
63
Function-Architecture Co-Design
Cadence
64
65
System C supports:
– Mentor Graphics - Seamless® C-Bridge™
– Verisity - SpecMan™ Elite
– Forte Design Systems - ESC Library
– Emulation & Verification Engineering - Zebu
– Axys Design - MaxSim™
– CoWare - N2C updated for SystemC 2.0
– Cadence - SPW 4.8 / SystemC v2.0 IF
– Synopsys - CoCentric System Studio
• Plus Kluwer book - “System Design Using
SystemC”, 2002
66
OCAPI-xl design flow
67
Application Structure
68
Specification and modeling
• Executable specification - Verilog, VHDL, C,
C++, Java.
• Common models: synchronous dataflow (SD
F), sequential programs (Prog.), communicati
ng sequential processes (CSP), object-orient
ed programming (OOP), FSMs, hierarchical/c
oncurrent FSM (HCFSM).
• Depending on the application domain and sp
ecification semantics, they are based on
different models of computation.
69
Hardware Synthesis
• Many RTL, logic level, physical level
commercial CAD tools.
• Some emerging high-level synthesis tools:
Behavioral Compiler (Synosys), Monet (Mento
r Graphics), and RapidPath (DASYS).
• Many open problems: memory optimization, p
arallel heterogeneous hardware architectures,
programmable hardware synthesis and optimi
zation, communication optimization.
70
Software synthesis
• The use of real-time operating systems
(RTOSs)
• The use of DSPs and micro-controllers –
code generation issues
• Special processor compilation in many cases
is still far less efficient than manual code gen
eration!
• Retargeting issues - C code developed for
TI TMS320C6x is not optimized for running on
Philips TriMedia processor.
71
Interface synthesis
• Interface between:
- hardware-hardware
- hardware-software
- software-software
• Timing and protocols
• Recently, first commercial tools appear
ed: the CoWare system (hw-sw protoco
ls) and the Synopsys Protocol Compiler
(hw interface synthesis tool)
72
Co-design Sites
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Bibliography of Hardware/Software Codesign: http://www-ti.informatik.unituebingen.de/~buchen/
Ralf Niemann's Codesign Links and Literature: http://ls12-www.informatik.unidortmund.de/~niemann/codesign/codesign_links.html
URLs to Hardware/Software Co-Design Research:
http://www.ece.cmu.edu/~thomas/hsURL.html
RASSP Architecture Guide: http://www.sanders.com/hpc/ArchGuide/TOC.html
EDA, Electronic Design Automation: http://www.eda.org
COMET (Case Western Reserve University):
http://bear.ces.cwru.edu/research/hard_soft.html
COSMOS (Tima - Cmp, France): http://timacmp.imag.fr/Homepages/cosmos/research.html
COSYMA (Braunschweig): http://www.ida.ing.tu-bs.de/projects/cosyma/
Handel-C (Oxford): http://oldwww.comlab.ox.ac.uk/oucl/hwcomp.html
Lycos (Technical University of Lyngby, Denmark): http://www.it.dtu.dk/~lycos/
MOVE (Technical University Delft): http://cardit.et.tudelft.nl/MOVE/
Polis (University of Berkeley): http://www
cad.eecs.berkeley.edu/Respep/Research/hsc/abstract.html
ProCos (UK Research): http://www.comlab.ox.ac.uk/archive/procos/codesign.html
Ptolemy (University of Berkeley): http://ptolemy.eecs.berkeley.edu/
SPAM (Princeton): http://www.ee.princeton.edu/~spam/
TRADES (University of Twente, INF/CAES): http://wwwspa.cs.utwente.nl/aid/aid.html
Specificatietalen
SystemC: http://www.systemc.org
73
SOC CAD Companies
• Cadence www.cadence.com
• Duet Tech
www.duettech.com
• Escalade www.escalade.com
• Logic visions
www.logicvision.com
• Mentor Graphics
www.mentor.com
• Palmchip
www.palmchip.com
• Sonic www.sonicsinc.com
• Summit Design
www.summit-design.com
• Synopsys
www.synopsys.com
• Topdown design solutions
www.topdown.com
• Xynetix Design Systems
www.xynetix.com
• Zuken-Redac
www.redac.co.uk
74
Related documents