Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Standby power wikipedia , lookup
Mains electricity wikipedia , lookup
Electric power system wikipedia , lookup
Switched-mode power supply wikipedia , lookup
Electrification wikipedia , lookup
Alternating current wikipedia , lookup
Audio power wikipedia , lookup
Power over Ethernet wikipedia , lookup
Power engineering wikipedia , lookup
Techniques for Low Power Turbo Coding in Software Radio Joe Antoon Adam Barnett Software Defined Radio • Single transmitter for many protocols • Protocols completely specified in memory • Implementation: – Microprocessors – Field programmable logic Why Use Software Radio? • Wireless protocols are constantly reinvented – 5 Wi-Fi protocols – 7 Bluetooth protocols – Proprietary mice and keyboard protocols – Mobile phone protocol alphabet soup • Custom DSP logic for each protocol is costly So Why Not Use Software Radio? • Requires high performance processors • Consumes more power Inefficient general fork Inefficient Efficient application Field-programmable specific fork fork Turbo Coding • Channel coding technique • Throughput nears theoretical limit • Great for bandwidth limited applications – CDMA2000 – WiMAX – NASA ‘s Messenger probe Turbo Coding Considerations • Presents a design trade-off • Turbo coding is computationally expensive • But it reduces cost in other areas – Bandwidth – Transmission power Reducing Power in Turbo Decoders • FPGA turbo decoders – Use dynamic reconfiguration • General processor turbo decoders – Use a logarithmic number system Generic Turbo Encoder Data stream s Component Encoder Interleave Component Encoder p1 p2 Generic Turbo Decoder r q1 q2 Decoder Interleave Decoder Decoder Design Options • Multiple algorithms used to decode • Maximum A-Posteriori (MAP) – Most accurate estimate possible – Complex computations required • Soft-Output Viterbi Algorithm – Less accurate – Simpler calculations Decoder FPGA Design Options • Goal Make an adaptive decoder Received Data Parity Original sequence Decoder Tunable Parameter Low power, accuracy High power, accuracy Component Encoder Generator Function M • M blocks are 1-bit registers • Memory provides encoder state M Encoder State 0 1 0 1 00 00 00 01 01 01 10 10 10 11 11 11 GF 1 0 Time Viterbi’s Algorithm • Determine most likely output • Simulate encoder state given received values d0 d1 s0 r0 p0 Time d2 s1 r1 p1 s2 r2 p2 … Viterbi’s Algorithm • Write: Compute branch metric (likelihood) • Traceback: Compute path metric, output data • Update: Compute distance between paths • Rank paths by path metric and choose best • For N memory: – Must calculate 2N-1 paths for each state Adaptive SOVA • SOVA: Inflexible path system scales poorly • Adaptive SOVA: Heuristic – Limit to M paths max – Discard if path metric below threshold T – Discard all but top M paths when too many paths Implementing in Hardware Control r q Branch Metric Unit Add Compare Select Survivor memory Implementing in Hardware • Controller – – Control memory – select paths • Branch Metric Unit – Compute likelihood – Consider all possible “next” states • Add, Compare, Select – Append path metric – Discard paths • Survivor Memory – Store / discard path bits Implementing in Hardware Add, Compare, Select Unit Path Distance Present State Path Values Compute, Compare Paths >T Branch Values Threshold Next State Path Values Dynamic Reconfiguration • Bit Error Rate (BER) – Changes with signal strength – Changes with number of paths used • Change hardware at runtime – Weak signal: use many paths, save accuracy – Strong signal: use few paths, save power – Sample SNR every 250k bits, reconfigure Dynamic Reconfiguration Experimental Results K (Number of encoder bits) proportional to average speed, power Experimental Results • FPGA decoding has a much higher throughput • Due to parallelism Experimental Results • ASOVA performs worse than commercial cores • However, in other metrics it is much better – Power – Memory usage – Complexity Future Work • Use present reconfiguration means to design – Partial reconfiguration – Dynamic voltage scaling • Compare to power efficient software methods Power-Efficient Implementation of a Turbo Decoder in SDR System • Turbo coding systems are created by using one of three general processor types – Fixed Point (FXP) • Cheapest, simplest to implement, fastest – Floating Point (FLP) • More precision than fixed point – Logarithmic Numbering System (LNS) • Simplifies complex operations • Complicates simple add/subtract operations Logarithmic Numbering System • X = {s, x = log(b)[|x|]} – S = sign bit, remaining bits used for number value • Example – Let b = 2, – Then the decimal number 8 would be represented as log(2)[8] = 3 – Numbers are stored in computer memory in 2’s compliment form (3 = 01111101) (sign bit = 0) Why use Logarithmic System? • Greatly simplifies multiplication, division, roots, and exponents – Multiplication simplifies to addition • E.g. 8 * 4 = 32, LNS => 3 + 2 = 5 (2^5 = 32) – Division simplifies to subtraction • E.g. 8 / 4 = 2, LNS => 3 – 2 = 1 (2^1 = 2) Why use Logarithmic System? • Roots are done as right shifts – E.g. sqrt(16) = 4, LNS => 4 shifted right = 2 (2^2 = 4) • Exponents are done as left shifts – E.g. 8^2 = 64, LNS => 3 shifted left = 6 (2^6 = 64) So why not use LNS for all processors? • Unfortunately addition and subtraction are greatly complicated in LNS. – Addition: log(b)[|x| + |y|] = x + log(b)[1 + b^z] – Subtraction: log(b)[|x| - |y|] = x + log(b)[1 - b^z] • Where z = y – x • Turbo coding/decoding is computationally intense, requiring more mults, divides, roots, and exps, than adds or subtracts Turbo Decoder block diagram • Use present reconfiguration means to design – Partial reconfiguration – Dynamic voltage scaling • Compare to power efficient software methods • Each bit decision requires a subtraction, table look up, and addition Proposed new block diagram • As difference between e^a and e^b becomes larger, error between value stored in lookup table vs. computation becomes negligible. • For this simulation a difference of >5 was used How it works • For d > 5 • New Mux (on right) ignores SRAM input and simply adds 0 to MAX result. • d > 5, pre-Decoder circuitry disables the SRAM for power conservation. Comparing the 3 simulations • Comparisons were done between a 16-bit fixed point microcontroller, a 16-bit floating point processor, and a 20-bit LNS processor. • 11-bits would be sufficient for FXP and FLP, but 16-bit processors are much more common • Similarly 17-bits would suffice for LNS processor, but 20-bit is common type Power Consumption Latency • Recall: Max*(a,b) = ln(e^a+e^b) Power savings • Pre-Decoder circuitry adds 11.4% power consumption compared to SRAM read. • So when an SRAM read is required, we use 111.4% of the power compared to the unmodified system • However, when SRAM is blocked we only use 11.4% of the power we used before. Power savings • The CACTI simulations for the system reported that the Max* operation accounted for 40% of all operations in the decoder • The Max* operations for the modified system required 69% of the power when compared to the unmodified system. • This leads to an overall power savings of 69% * 40% = 27.6% Conclusion • Turbo codes are computationally intense, requiring more complex operations than simple ones • LNS processors simplify complex operations at the expense of making adding and subtracting more difficult Conclusion • Using a LNS processor with slight modifications can reduce power consumption by 27.6% • Overall latency is also reduced due to ease of complex operations in LNS processor when compared to FXP or FLP processors.