Download L10-high level synthe.. - VADA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Resilient control systems wikipedia , lookup

Standby power wikipedia , lookup

Signal-flow graph wikipedia , lookup

Electrification wikipedia , lookup

Mains electricity wikipedia , lookup

Electric power system wikipedia , lookup

Pulse-width modulation wikipedia , lookup

Alternating current wikipedia , lookup

Audio power wikipedia , lookup

Power over Ethernet wikipedia , lookup

Power electronics wikipedia , lookup

Power engineering wikipedia , lookup

Control system wikipedia , lookup

Switched-mode power supply wikipedia , lookup

AC adapter wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Transcript
L10 : Lower Power High Level
Synthesis(1)
1999. 8
성균관대학교 조 준 동 교수
http://vada.skku.ac.kr
Low Power Design Flow
Function
Partitioning and
HW/SW Allocation
System
Level
Specification
System-Level
Power Analysis
Behavioral
Description
Software
Functions
Power-driven
Behavioral
Transformation
Processor
Selection
Behavioral-Level
Power Analysis
Power Conscious
Behavioral
Description
Software-Level
Power Analysis
Software
Optimization
High-Level
Synthesis and
Optimization
To RT-Level Design
RT-Level
Power Analysis
Early Analysis Leads to Power
Savings
National Semiconductor Success
A LAN switch ASIC of 200K gates and 41
memories characterized for state-dependent
power.
DesignPower revealed excessive power
consumption by the memories due to redundant
read cycles.
Module Selection
•
•
•
•
Select the clock period, choose proper hardware modules for all
operations(e.g., Wallace or Booth Multiplier), determine where to
pipeline (or where to put registers), such that a minimal hardware cost
is obtained under given timing and throughput constraints.
Full pipelining: ineffective clock period mismatches between the
execution times of the operators. performing operations in sequence
without immediate buffering can result in a reduction of the critical path.
Clustering operations into non-pipelining hardware modules, the
reusability of these modules over the complete computational graph be
maximized.
During clustering, more expensive but faster hardware may be
swapped in for operations on the critical path if the clustering violates
timing constraints
High-Level Power Estimation
• Pcore = PDP + PMEM + PCNTR + PPROC
• PDP = PREG +PMUX +PFU + +PFU, where PREG is the
power of the registers
• PMUX is the power of multiplexers
• PFU is the power of functional units
• PINT
is the power of physical interconnet capacitance
C    C / N , where  is the average activity
int
total
(the total number of interconne ct accesses multiplied by
an average signal transitio n probabilit y), Ctotal is the total
estimated capacitanc e of the chip and N is an estimate of the
number of physical interconne ts (HYPER).
Estimation
•
Estimate min and max bounds on the required resources to
–
–
•
•
delimit the design space min bounds to serve as an initial solution
serve as entries in a resource utilization table which guides the transformation,
assignment and scheduling operations
Max bound on execution time is tmax: topological ordering of DFG using ASAP
and ALAP
Minimum bounds on the number of resources for each resource class
Where NRi: the number of resources of class Ri
dRi : the duration of a single operation
ORi : the number of operations
High-Level Power Estimation: PREG
•
•
•
•
•
•
•
•
•
Compute the lifetimes of all the variables in the given VHDL code.
Represent the lifetime of each variable as a vertical line from statement i through
statement i + n in the column j reserved for the corresponding varibale v j .
Determine the maximum number N of overlapping lifetimes computing the
maximum number of vertical lines intersecting with any horizontal cut-line.
Estimate the minimal number of N of set of registers necessary to implement the
code by using register sharing. Register sharing has to be applied whenever a
group of variables, with the same bit-width b i .
Select a possible mapping of variables into registers by using register sharing
Compute the number w i of write to the variables mapped to the same set of
registers. Estimate n i of each set of register dividing w i by the number of
statements S: i =wi/S; hence TR imax = n i f clk .
Power of latches and flip flops is consumed not only during output transitions,
but also during all clock edges by the internal clock buffers
The non-switching power PNSK dissipated by internal clock buffers accounts for
30% of the average power for the 0.38-micron and 3.3 V operating system.
In total,
N
PREG   ( Pk  PNSK ), Pk  nk PtkTRk  , PNSK  nk PNk ( f clk  TRk ) ,
k 1
PCNTR
• After scheduling, the control is defined and optimized by the hardware mapper and
further by the logic synthesis process before mapping to layout.
• Like interconnect, therefore, the control needs to be estimated statistically.
• Global control model:
CFSM  1 N states   2 ,
For a 1.2 technolo gy, 1 is 4.9fF and  2 is 22.1fF.
The total number of transitio ns is strongly dependent on the
number of states.
Local control model: the local controller account for a larger percentage of
the total capacitance than the global controller.
Clc   0  1 Ntrans   2 N states   3 B f ,
For a 1.2 tech.,  0,  72, 1,  0.15,  2,  8.3,  3,  0.55.
Where Ntrans is the number of tansitions, nstates is the number of states, Bf is the bus
factor, and Clc is the capacitance switched in any local controller in one sample
period. Bf is the ratio of the number of bus accesses to the number of busses.
Ntrans
•
•
The number of transitions depends on assignment, scheduling, optimizations, logic
optimization, the standard cell library used, the amount of glitchings and the statistics of
the inputs.
N trans   1   2 ( N nodes  N edges )   3 ( S  N Exu )
where N transis the number of transitio ns on the outputs of
the loal controller s, S is the number of control cycles per sample
period, N edges and N nodes are the number of edges and nodes in
the CDFG and N Exu is an estimate for the total number of
execution units. For a 1.2  tech.  1  178.7,  2  7.2,  3  2.0.
Exploring the Design Space
•
•
•
•
•
Find the minimal area solution constrained to the timing constraints
By checking the critical paths, it determine if the proposed graph
violates the timing constraints. If so, retiming, pipelining and tree height
reduction can be applied.
After acceptable graph is obtained, the resource allocation process is
initiated.
– change the available hardware (FU's, registers, busses)
– redistribute the time allocation over the sub-graphs
– transform the graph to reduce the hardware requirements.
Use a rejectionless probabilistic iterative search technique (a variant of
Simulated Annealing), where moves are always accepted. This
approach reduces computational complexity and gives faster
convergence.
Behavioral Synthesis
•
loop unrolling : localize the data to reduce the activity of the inputs of the
functional units or two output samples are computed in parallel based on two
input samples.
Yn1  X n1  A  Yn2
Yn  X n  A  Yn1  X n  A  ( X n1  A  Yn2 )
Neither the capacitance switched nor the voltage is altered. However, loop unrolling
enables several other transformations (distributivity, constant propagation, and
pipelining). After distributivity and constant propagation,
Yn1  X n 1  A  Yn 2
Yn  X n  A  Yn1  A2  Yn2
The transformation yields critical path of 3, thus voltage can be dropped.
• Clock Selection : Choose optimal system clock period Eliminate slacks/improve resource
utilization and Enable greater voltage scaling
• Module selection : For each operation, choose library template
• Flow graph restructuring : pull out operations on the critical cycle.
High-Level Power Estimation: PMUX and PFU
Critical Path
•
•
•
Longest delayed path from input to
output in combinational logic
Determine operating clock
frequency
Resizing non-critical path transistor
(In-Place Optimization)
• Critical path in Synchronous
Sequential logic
D
Q
D
Q
D
Q
D
Q
path A
tcycle,min  t ff,max  tlogic,max  t setup,max  t skew,max
D
Q
D
Q
path B
tcycle,min : min.value of clock period
t ff,max : max.value of flipflop delay
tlogic,max : max.value of critical path delay
t setup,max : max.value of setup time of flipflop
t skew,max : max.value of clock skew
clk
Combinational
Logic
clk
Data path Synthesis
System Partitioning
•
•
•
To decide which components of the system will be realized in hardware
and which will be implemented in software
High-quality partitioning is critical in high-level synthesis. To be useful, highlevel synthesis algorithms should be able to handle very large systems.
Typically, designers partition high-level design specifications manually into
procedures, each of which is then synthesized individually. Different
partitionings of the high-level specifications may produce substantial
differences in the resulting IC chip areas and overall system performance.
To decide whether the system functions are distributed or not.
Distributed processors, memories and controllers can lead to significant
power savings. The drawback is the increase in area. E.g., a nondistributed and a distributed design of a vector quantizer.
Circuit Partitioning
• graph and physical
representation
VHDL example
process communication
Behavioral description
control/data flow graph
Clustering Example
• Two-cluster Partition
• Three-cluster Partition
Clustering (Cont’d)
상위 수준 합성 단계
½Ã½ºÅÛ · ¹º§
Design Specification
CDFG
(Control Data Flow Graph)
µ¿ÀÛÀû · ¹º§
¾ÆÅ°ÅØÃÄ · ¹º§
· ÎÁ÷/ȸ· Î · ¹º§
LOW POWER AND FAST
SCHEDULING
REGISTER ALLOCATION
FOR LOW POWER
RESOURCE ALLOCATION
FOR LOW POWER
DATAPATH GENERATION
AND CONTROLLER
SYNTHESIS
µð¹ÙÀ̽º/°øÁ¤ · ¹º§
WRITE VHDL
- 설계 자동화 연구실 -
Fast and Enable resource
sharing for low power
scheduling
Minimizing switching activity
in Register
Minimizing switching activity
in resource and interconnection
상위 수준 합성 ( High Level Synthesis )
for(I=0;I<=2;I=I+1begin
@(posedge clk);
Control
if(fgb[I]%8; begin
p=rgb[I]%8;
g=filter(x,y)*8;
end
Datapath
............
Instructions
scheduling
Operations
Memory inferencing
Variables
Register sharing
Arrays
Control interencing
constraints
Memory
Operators,
Registers,
Memory, Multiplexor
Control
signals
회로의 동작적
기술
상위 수준 합성
- 설계 자동화 연구실 -
RTL(register transfer
level) architecture
High-Level Synthesis
• The allocation task determines the type and quantity of
resources used in the RTL design. It also determines the
clocking scheme, memory hierarchy and pipelining style. To
perform the required trade-offs, the allocation task must
determine the exact area and performance values.
• The scheduling task schedules operations and memory
references into clock cycles. If the number of clock cycles is a
constraint, the scheduler has to produce a design with the
fewest functional units
• The binding task assigns operations and memory references
within each clock cycle to available hardware units. A resource
can be shared by different operations if they are mutually
exclusive, i.e. they will never execute simultaneously.
상위 수준 합성 과정 예
Á¦¾î±¸°£ ¿¬»êÀÚ
+
+
2
+
3
<
4
*
+
CDFG
<
+
1
*
½ºÄÉÁ층
<
+
Çϵå¿þ¾î
¶óÀ̺귯¸®
*
¸®¼Ò½ºÇÒ´ç
- 설계 자동화 연구실 -
¸ðµâ ¹ÙÀεù
Low Power Scheduling
상위 레벨에서 제안된 저전력 방법
Sibling 연산의 연산자 공유 [ Fang , 96 ]
데이타 correlation 를 고려한 resource sharing [ Gebotys, 97 ]
FU 의 shut down 방법(Demand-driven operation) [ Alidina, 94 ]
 연산의 규칙성 이용 [ Rabaey, 96 ]
 Dual 전압 사용 [ Sarrafzadeh, 96 ]
 Spurious 연산의 최소화 [ Hwang, 96 ]
 최소 비용의 흐름 알고리즘을 사용한 스위칭 동작 최소화 + 연결구조
단순화를 통한 캐패시턴스 최소화 [Cho,97]
- 설계 자동화 연구실 -
레지스터의
전력
소모
모델
Power(Register) =
switching(x)(Cout,Mux+Cin,Register)+switching(y) x (Cout,Register+Cin,DeMux)
switching(x)=switching(y)이므로 Power(Register)=switching(y) x Ctotal
Control
Cout,MuxCin,Register
y
Cout,Register
Cin,DeMux
- 설계 자동화 연구실 -
DeMux
x
Register
MUX
i
j
k
Control
i*
j*
k*
회로의 CDFG 표현
a
e=a+b;
g=c+d;
f=e+b;
h=f*g;
b
c
+1
d
+2
e
g
+3
f
*1
h
CDFG( control
data flow graph )
- 설계 자동화 연구실 -
Schematic to CDFG of FIR3
레지스터와 리소스의 수 결정
control
step
1
a
b
c
d
a b c d e f g h
A1 +1
e
2
+2
A2 +3
g
f
3
*1
M1
A1
1
2
3
4
h
- 설계 자동화 연구실 -