Download Multi-Core System on Chip 설계 동향 (계속) 발표: 조준동

Low Power System on Chip Design 1 System Level Power Optimization • Algorithm selection / algorithm transformation • Identification of hot spots • Low Power data encoding • Quality of Service vs. Power • Low Power Memory mapping • Resource Sharing / Allocation 2 Levels for Low Power Design Hardware-software partitioning, Power down Complexity, Concurrency, Locality, Regularity, Data representation Parallelism, Pipelining, Signal correlations Architecture Instruction set selection, Data rep. Circuit/Logic Sizing, Logic Style, Logic Design Threshold Reduction, Scaling, Advanced packaging Technology SOI System Algorithm Level of Abstraction Algorithm 10 - 100 times Architecture 10 - 90% Logic Level 20 - 40% Layout Level 10 - 30% Device Level 10 - 30% Expected Saving 3 High Performance System 구현을 위한 제반 요소 High Performance System High Speed High Density Reduced Swing Logic Deep Submicron Technology Channel Engineering Low Voltage Low VT Advanced Technology Low Power per Gate Low Capacitance 4 System Level Power Optimization • Algorithm selection / algorithm transformation • Identification of hot spots • Low Power data encoding • Quality of Service vs. Power • Low Power Memory mapping • Resource Sharing / Allocation 5 전력 소모에 대한 고찰 • Digital 회로에서 전력 소모의 구성 성분 Power    f  C  VDD 2  Ileak  VDD  Qshort  circuit  f  VDD  : Switching Activity f : Frequency C : Capacitance VDD : Supply Voltage Ileak : Leakage Current Qshort  circuit : Short Circuit Charge 6 Vdd, power, and current trend 200 500 2.5 Voltage VDD current [A] 0.5 Power per chip [W] 2.0 0.0 0 0 Voltage Power 1.5 Current 1.0 1998 2002 2006 2010 2014 Year International Technology Roadmap for Semiconductors 1998 update 7 Three Factors affecting Energy – Reducing waste by Hardware Simplification: redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing – All in one Approach(SOC): I/O pin and buffer reduction – Voltage Reducible Hardwares – 2-D pipelining (systolic arrays) – SIMD:Parallel Processing:useful for data w/ parallel structure – VLIW: Approach- flexible 8 전력 소모를 줄일 수 있는 설 계 방법 • 공급 전압을 조절하는 방법 – IC 내에서 high speed가 필요한 곳에만 높은 전압을 사 용한다. – 사용하지 않는 block에 대해서는 sleep mode로 전력 소모를 줄인다. • 동작 주파수를 낮추는 방법 – Parallel processing으로 같은 throughput을 얻으면서 동작 주파수는 낮춘다. 이로 인한 면적의 증가는 필연 적이다. – 큰 clock buffer의 사용을 피한다. – Phase Locked Loop (PLL)을 사용하여 필요한 곳에만 주파수를 높여 사용한다. 9 전력 소모를 줄일 수 있는 설 계 방법 • Parasitic capacitance를 줄이는 방법 – Critical node에 짧은 배선을 사용한다. – 3배 이상의 fan-out을 피한다. – 낮은 전압 사용시 배선의 폭을 줄인다. – 가능한 한 작은 크기의 transistor를 사용한다. • Switching Activity를 줄이는 방법 – Bit 수를 감소시킨다. – Dynamic 회로보다는 static 회로를 사용한다. – 전체 transistor 수를 줄인다. – 가장 active한 node는 internal node로 결정한다. 10 전력 소모를 줄일 수 있는 설 계 방법 • Switching Activity를 줄이는 방법 – 각 node 에서 주파수와 capacitance의 곱의 합이 최 소가 되도록 logic을 설계한다. 즉, switching activity 가 통계적으로 최소가 되도록 한다. n  f  C  min, i i f i  mean switching frequency of node i i 1 Ci  capacitance of node i – Logic tree를 결정할 때, 입력 신호의 activity가 높을수 록 VDD 또는 ground에서 멀리 위치시킨다. – Activity가 큰 cell은 dynamic으로, activity가 작은 cell 은 static으로 설계한다. – Data가 변하지 않는 flip-flop의 clock을 off 시킨다. – 항상 사용하지 않는 cell의 clock을 disable시킬 수 있 11 Son! Haven’t I told you Web browsing is slow with 802.11 PSM to turn on power-saving mode. Batteries don’t grow on trees you know! But dad! Performance SUCKS when I turn on power-saving mode! So what! When I was your age, I walked 2 miles through the snow to fetch my Web pages! 12 • Users complain about performance degradation IBM’s PowerPC Lower Power Architecture • Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution – 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU) – FPU is pipelined so a multiply-add instruction can be issued every clock cycle – Low power 3.3-volt design • Use small complex instruction with smaller instruction length – IBM’s PowerPC 603e is RISC • Superscalar: CPI < 1 – 603e issues as many as three instructions per cycle • Low Power Management – 603e provides four software controllable power-saving modes. • Copper Processor with SOI • IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times 13 Power-Down Techniques ◆ Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work 14 Voltage vs Delay •Use Variable Voltage Scaling or Scheduling for Real-time Processing •Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing. 15 Why Copper Processor? • Motivation: Aluminum resists the flow of electricity as wires are made thinner and narrower. • Performance: 40% speed-up • Cost: 30% less expensive • Power: Less power from batteries • Chip Size: 60% smaller than Aluminum chip 16 Silicon-on-Insulator • How Does SOI Reduce Capacitance ? Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate high performance, low power, low soft error 17 Clock Network Power Managements • 50% of the total power • FIR (massively pipelined circuit): video processing: edge detection voice-processing (data transmission like xDSL) Telephony: 50% (70%/30%) idle, 동시에 이야기하지 않음. with every clock cycle, data are loaded into the working register banks, even if there are no data changes. 18 Partitioning • • • • Performance Requirements – 몇몇의 Function들은 Hardware로의 구현이 더 용이 – 반복적으로 사용되는 Block – Parallel하게 구성되어 있는 Block Modifiability – Software로 구성된 Block은 변형이 용이 Implementation Cost – Hardware로 구성된 Block은 공유해서 사용이 가능 Scheduling – 각각 HW와 SW로 분리된 Block들을 정해진 constraints들에 맞출 수 있 도록 scheduling – SW Operation은 순차적으로 scheduling되어야 한다 – Data와 Control의 의존성만 없다면 SW와 HW는 Concurrent하게 scheduling 19 Low power partitioning approach Different HW resources are invoked according to the instruction executed at a specific point in time • During the execution of the add op., ALU and register are used, but Multiplier is in idle state. • Non-active resources will still consume energy since the according circuit continue to switch • Calculate wasting energy • 20 Design Flow Core Energy - Estimation Compute utilization rate(uP) Evaluate - Max 94% energy saving and in most case even reduced execution time - 16k sell overhead S Application Devide Appliction in cluster List schedule Compute utilization rate(ASIC) Select cluster HW Synthesis 21 H/W and S/W 통합 저전력 설계 최적화 S/W H/W S/W 코아 에너지 예측 클러스터 링 SW 에너지 효율 계산 시스템 수준 에너지 예측 - Max 94% energy saving and in most case even reduced execution time - 16k sell overhead 알고리즘 선택 HW SW 통합 클러스터 스케쥴링 HW 에너지 효율 계산 클러스터 선택 H/W 합성 및 에너지 예측 22 IS-95 CDMA Searcher H/W and S/W 통합 설계 황인기, 성균관대 Cost (Speed,Area,Power) Synchronous Accumulator (SW) Energy Estimate (SW) Comparator (SW) Asynchronous Accumulator (SW) Comparator (SW) GOAL! PN-Code Generation Synchronous Accumulator1 (HW) Comparator with precomputation (HW) Energy Estimate (HW) Asynchronous Accumulator (HW) Comparator with precomputation (HW) Synchronous Accumulator2 (HW) 23 Low Power DSP • DO-LOOP Dominant VSELP Vocoder : 83.4 % 2D 8x8 DCT : 98.3 % LPC computation : 98.0 % DO-LOOP Power Minimization ==> DSP Power Minimization VSELP : Vector Sum Excited Linear Prediction LPC : Linear Prediction Coding 24 Loop unrolling • The technique of loop unrolling replicates the body of a loop some number of times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction for i = 2 to N - 2 step 2 parallelism for i = 2 to N -and 1 improves register, data cache or TLB A(i ) = A(i ) + A(i - 1) A(i + 1) locality. A(i ) = A(i ) + A(i - 1) A(i + 1) A(i  1) = A(i  1) + A(i ) A(i + 2) Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated. 25 Loop Unrolling (IIR filter example) loop unrolling : localize the data to reduce the activity of the inputs of the functional units or two output samples are computed in parallel based on two input samples. Yn1  X n1  A  Yn2 Yn  X n  A  Yn1  X n  A  ( X n1  A  Yn2 ) Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation, Yn1  X n 1  A  Yn 2 Yn  X n  A  Yn1  A2  Yn2 The transformation yields critical path of 3, thus voltage can be dropped. 26 Loop Unrolling for Low Power 27 Loop Unrolling for Low Power 28 Loop Unrolling for Low Power 29 Designing a Parallel FIR  To obtain a parallel processing structure, the SISO(single-input single-output) system must be converted into a MIMO(multiple-input multiple-output) system. y(3k) = ax(3k)+bx(3k-1)+cx(3k-2) y(3k+1) = ax(3k+1)+bx(3k)+cx(3k-1) y(3k+2) = ax(3k+2)+bx(3k+1)+cx(3k)  Parallel Processing systems are also referred to as block processing systems. 30 Parallel Processing (2)  Parallel processing architecture for a 3-tap FIR filter (with block size 3) 31 Parallel Processing (3) <Combined fine-grain pipelining and parallel processing for 3-tap FIR filter> 32 Motion Estimation 33 Block Matching Algorithm 34 Configurable H/W Paradigms 35 Why Hardware for Motion Estimation? • Most Computationally demanding part of Video Encoding • Example: CCIR 601 format • 720 by 576 pixel • 16 by 16 macro block (n = 16) • 32 by 32 search area (p = 8) • 25 Hz Frame rate (f frame = 25) • 9 Giga Operations/Sec is needed for Full Search Block Matching Algorithm. 36 Why Reconguration in Motion Estimation? • Adjusting the search area at frame-rate according to the changing characteristics of video sequences • Reducing Power Consumption by avoiding unnecessary computation Motion Vector Distributions 37 Architecture for Motion Estimation From P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995 38 Re-configurable Architecture for ME 39 Power Estimation in Recongurable Architecture 40 Power vs Search area 41 Resource Reuse in FPGAs 42 Motion Estimation Conventional 43 Motion Estimation - Data Reuse Pa  2 Padd 2  Pabs / 2 Pb  Padd 2  Padd 1  Pabs / 2 Pabs  0.45 Padd 2 Therefore, power reduction factor is 11% 44 Vector Quantization • Lossy compression technique which exploits the correlation that exists between neighboring samples and quantizes samples together 45 Complexity of VQ Encoding The distortion metric between an input vector X and a codebook vector C_i is computed as follows: Three VQ encoding algorithms will be evaluated: full search, tree search and differential codebook treesearch. 46 Full Search • Brute-force VQ: the distortion between the input vector and every entry in the code-book is computed, and the codeindex that corresponds to the minimum distortion is determined and sent over to the decoder. • For each distortion computation, there are 16 8-bit memory accesses (to fetch the entries in the codeword), 16 subtractions, 16 multiplications, 15 additions. In addition, the minimum of 256 distortion values, which involves 255 comparison operations, must be determined. 47 Tree-structured Vector Quantization If for example at level 1, the input vector is closer to the left entry, then the right portion of the tree is never compared below level 2 and Here only 2 x log 2 256 = 16 distortion an index bit 0 is calculations with 8 comparisons transmitted. 48 Algorithmic Optimization • Minimizing the number of operations – example • video data stream using the vector quantization (VQ) algorithm • distortion metric Di   X j  Cij  15 – Tree-structured VQ • binary tree-search • some performance degradation • distortion calculation : 16 ( 2 x log2 256 ) • value comparison : 8 2 0 j 0 0 3 – Full search VQ • exhaustive full-search • distortion calculation : 256 • value comparison : 255 8 2 1 3 1 1 0 3 2 1 3 8 49 Differential Codebook Tree-structure Vector Quantization • The distortion difference b/w the left and right node needs to be computed. This equation can be manipulated to reduce the number of operations . 50 Algorithmic Optimization – Differential codebook tree-structure VQ • modify equation for optimizing operations Dleft right   X j  Cleft, j    X j  Cright, j  15 2 j 0 15 2 j 0   X j 0 15 2 left , j C   2 X C 15 2 right, j j 0 j right, j  Cleft, j  # of # of # of # of mem. mul. add. sub access 4096 4096 3840 4096 full search tree search 256 256 240 264 differential 136 128 128 0 tree search algorithm 51 Multiplication and Accumulation: MAC • Major operation in DSP X X Y [ Modified Booth Encoding ] One of 0, X, -X, 2X, -2X based on each 2 bits of Y Y ALU MULT ACC PR CSA CPA MUL > (5 * ALU) PR 52 Operand Swapping (1/2) • Weight = how many additions are needed ? Weight = 2 Y= 00111100 By 00X000X0 Booth Encoding Operands Low Weight High Switching A 7FFF 0001 7FFF 0001 7FFF 0001 B AAAA AAAA 6666 AAAA AAAA 0001 Current (mW) A*B 22.0 B*A 10.0 Saving 31.6 10.0 68% 28.8 12.2 58% 54% 53 DIGLOG multiplier Cmult (n)  253n 2 , Cadd (n)  214n, where n  world length in bits A  2 j  AR , B  2 k  BR A  B  (2 j  AR )(2 k  BR )  2 j  BR  2 k  AR  AR  BR 1st Iter 2nd Iter 3rd Iter Worst-case error -25% -6% -1.6% Prob. of Error<1% 10% 70% 99.8% With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case) 54 Voltage Scaling • Merely changing a processor clock frequency is not an effective technique for reducing energy consumption. Reducing the clock frequency will reduce the power consumed by a processor, however, it does not reduce the energy required to perform a given task. • Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work. 55 Energy consumption ( Vdd2) Different Voltage Schedules Timing constraint 40J 1000Mcycles 5.02 (A) 50MHz 0 5.02 5 10 15 32.5J 20 25 750Mcycles 250Mcycles 50MHz 25MHz Time(sec) (B) 2.52 0 5 5.02 4.02 10 15 20 25 Time(sec) (C) 25J 1000Mcycles 40MHz 0 5 10 15 20 25 Time(sec) 56 Data Driven Signal Processing The basic idea of averaging two samples are buffered and their work loads are averaged. The averaged workload is then used as the effective workload to drive the power supply. Using a pingpong buffering scheme, data samples In +2, In +3 are being buffered while In, In +1 are being processed. 57 RTL: Multiple Supply Voltages Scheduling Filter Example 58 A hardware / software partitioning technique with hierarchical design space exploration Houria Oudghiri, Bozena Kaminska, and Janusz Rajski, Mentor Graphics Corp. • A set of DSP examples are considered for co-design on a specific architecture in order to accelerate their performance on a target architecture including a standard DSP processor running concurrently with a custom SIMD (Single Instruction Multiple Data) processor 59 proposed methodology input : List of blocks and time constraints , output : Two subsets where blocks are assigned Step 1 : construct the complete weighted dependency graph G Step 2 : Assign all blocks to software, compare the complete system execution time Step 3 : while (time constraints not satisfied) do step 3_i : Select the node with the maximum execution time (i) step 3_ii : Assign i to hardware, Update the system execution time step 3_iii : while (time constraints not satisfied) do step 3_iii_1 : Select the maximum weighted edge connected to i with the most time consuming node (j) step 3_iii_2 : Assign to hardware, Update the dependency graph G Update the system execution time endo 60 endo co-design target architecture The Texas Instruments DSP processor TMS320C40 is used as the master processor and the custom SIMD processor PULSE (Parallel Ultra Large Scale Engine, 4 processors in parallel) as the slave processor 61 The hierarchical model of the FFT transform Initialize Bit Reversal FFT Initialize Variable Initialize Data Bit_init Index_init Read_data Index_incr Bit_loop1 Bit_cond Bit_incr Bit_shift Bit_test Bit_swap1 Bit_swap2 Danielson control Output Dan_init Dan_loop Out_init Out_write Bit_loop2 Loop2_test Bit_acc Loop2_ass Data_test Loop2_shif Danielson Dan_init Initialize Dan_loop1 Level 2 Dan_loop1 Loop2_init Initialize Loop1_body Update Variables Loop2_body Dan_real Loop2_incr Dan_imag Level 7 Level 8 Loop1_incr Out_incr Level 1 Loop1_init Level 3 Level 4 Level 5 Level 6 62 Block assignment at different hierarchical levels level Nb.of Bolcks C40 PULSE Time(ms) / time constraint = 25 ms PULSE C40 Total 1 4 2 2 18.14 4.8 22.94 2 10 6 4 18.8 2.96 21.76 3 17 11 6 15.56 9 24.56 4 22 18 6 14.68 10.24 24.92 5 24 17 7 14.56 10.4 24.94 6 24 22 2 6.82 17.72 24.54 7 25 22 3 7 17.92 24.92 8 27 18 9 5.88 18.64 24.52 63 Function-Architecture Co-Design Cadence 64 65 System C supports: – Mentor Graphics - Seamless® C-Bridge™ – Verisity - SpecMan™ Elite – Forte Design Systems - ESC Library – Emulation & Verification Engineering - Zebu – Axys Design - MaxSim™ – CoWare - N2C updated for SystemC 2.0 – Cadence - SPW 4.8 / SystemC v2.0 IF – Synopsys - CoCentric System Studio • Plus Kluwer book - “System Design Using SystemC”, 2002 66 OCAPI-xl design flow 67 Application Structure 68 Specification and modeling • Executable specification - Verilog, VHDL, C, C++, Java. • Common models: synchronous dataflow (SD F), sequential programs (Prog.), communicati ng sequential processes (CSP), object-orient ed programming (OOP), FSMs, hierarchical/c oncurrent FSM (HCFSM). • Depending on the application domain and sp ecification semantics, they are based on different models of computation. 69 Hardware Synthesis • Many RTL, logic level, physical level commercial CAD tools. • Some emerging high-level synthesis tools: Behavioral Compiler (Synosys), Monet (Mento r Graphics), and RapidPath (DASYS). • Many open problems: memory optimization, p arallel heterogeneous hardware architectures, programmable hardware synthesis and optimi zation, communication optimization. 70 Software synthesis • The use of real-time operating systems (RTOSs) • The use of DSPs and micro-controllers – code generation issues • Special processor compilation in many cases is still far less efficient than manual code gen eration! • Retargeting issues - C code developed for TI TMS320C6x is not optimized for running on Philips TriMedia processor. 71 Interface synthesis • Interface between: - hardware-hardware - hardware-software - software-software • Timing and protocols • Recently, first commercial tools appear ed: the CoWare system (hw-sw protoco ls) and the Synopsys Protocol Compiler (hw interface synthesis tool) 72 Co-design Sites • • • • • • • • • • • • • • • • • Bibliography of Hardware/Software Codesign: http://www-ti.informatik.unituebingen.de/~buchen/ Ralf Niemann's Codesign Links and Literature: http://ls12-www.informatik.unidortmund.de/~niemann/codesign/codesign_links.html URLs to Hardware/Software Co-Design Research: http://www.ece.cmu.edu/~thomas/hsURL.html RASSP Architecture Guide: http://www.sanders.com/hpc/ArchGuide/TOC.html EDA, Electronic Design Automation: http://www.eda.org COMET (Case Western Reserve University): http://bear.ces.cwru.edu/research/hard_soft.html COSMOS (Tima - Cmp, France): http://timacmp.imag.fr/Homepages/cosmos/research.html COSYMA (Braunschweig): http://www.ida.ing.tu-bs.de/projects/cosyma/ Handel-C (Oxford): http://oldwww.comlab.ox.ac.uk/oucl/hwcomp.html Lycos (Technical University of Lyngby, Denmark): http://www.it.dtu.dk/~lycos/ MOVE (Technical University Delft): http://cardit.et.tudelft.nl/MOVE/ Polis (University of Berkeley): http://www cad.eecs.berkeley.edu/Respep/Research/hsc/abstract.html ProCos (UK Research): http://www.comlab.ox.ac.uk/archive/procos/codesign.html Ptolemy (University of Berkeley): http://ptolemy.eecs.berkeley.edu/ SPAM (Princeton): http://www.ee.princeton.edu/~spam/ TRADES (University of Twente, INF/CAES): http://wwwspa.cs.utwente.nl/aid/aid.html Specificatietalen SystemC: http://www.systemc.org 73 SOC CAD Companies • Cadence www.cadence.com • Duet Tech www.duettech.com • Escalade www.escalade.com • Logic visions www.logicvision.com • Mentor Graphics www.mentor.com • Palmchip www.palmchip.com • Sonic www.sonicsinc.com • Summit Design www.summit-design.com • Synopsys www.synopsys.com • Topdown design solutions www.topdown.com • Xynetix Design Systems www.xynetix.com • Zuken-Redac www.redac.co.uk 74

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Multi-Core System on Chip 설계 동향 (계속) 발표: 조준동