* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 2 Computer arithmetics
Survey
Document related concepts
Transcript
2 Computer arithmetics Digital systems are implemented on hardware with finite wordlength. Implementations require special attention because of possible quantization and arithmetic errors. Part I: Real number representations • Characteristics • Basis: integers • Fixed-point numbers • Floating-point formats Part II: Design flows of fixed-point solutions • Analysis based flow • Simulation based flow 1 2.1 Background: real number representations Two categories: fixed-point and floating-point. In both cases, a certain number of significant bits represent an integer value, and associated scaling maps that integer to some real number. sign (s) + significand (f) significant bits integer / fraction exponent (e) floating−point representation scaling fixed during design fixed−point representation Exponent: floating-point hardware performs appropriate scaling during run-time. In the case of fixed-point numbers this is a design-time decision. Can be hard! Implications for choosing the computational platform: • Do we really need an optimized fixed-point solution? • Or, do we want to have an easier money-saving design process? • In addition, the floating-point HW might not contain all features of the standards and word length might also be limited 2 2.1.1 Characterization of representations Word length: the total number of bits in a representation Bit precision: the number of significant bits Range: the smallest and the largest representable value - overflow Precision: the smallest interval between two consecutive numbers (unit in least position, ulp) - roundoff noise, underflow Dynamic range: measures the ratio between the smallest and largest absolute values Dynamic range in dB = 20 log10 (AMax/AMin). • Fixed-point case: all numbers represented with the same precision, 6 dB per one bit • Floating-point case: large numbers represented with less precision, dynamic range huge word length bit precision range (max. absolute value) precision (ulp) dynamic range 32-bit signed integer 32 32 31 2 ≈ 2.1 × 109 1 187 dB 32-bit signed fractional 32 32 1 2−31 187 dB 2−149 3 IEEE-754 single precision floating-point 32 24 (= 23 bit significand + sign bit) ≈ 3.4 × 1038 104 (for Emin ) ... 2 (for Emax ); note: 256 for 2.1 × 109 1535 dB 2.1.2 Integers Fixed-point and floating-point representations are based on representations of integer values. The choice of particular representation depends on what has to be done with the numbers. 1. Unsigned integers 2. Signed integers • sign-magnitude: encoding of significant bits in floating-point formats • one’s complement: advantage is easy negation • two’s complement: the common choice for fixed-point formats • biased representation (excess-B): a bias value, B, is subtracted from an unsigned integer value. Example: offset binary (ADC output, DAC input), encoding of exponents in floating-point formats a2 a1 a0 011 010 001 000 111 110 101 100 sign-magnitude +3 +2 +1 0 -3 -2 -1 0 1’s complement +3 +2 +1 0 0 -1 -2 -3 4 2’s complement +3 +2 +1 0 -1 -2 -3 -4 offset binary -1 -2 -3 -4 +3 +2 +1 0 2.1.3 Fixed-point representation Fixed-point numbers are based on scaling of integers (unsigned or signed). Two’s complement integers used as a basis for signed fixed-point. (1) Binary point scaling: The value represented is Ṽ = Vint /2n where Vint is the integer value represented by the bit string, and n is the number of fraction bits. Notation: up.n for unsigned and sp.n for signed formats where p is the word length and n is the number of fraction bits. For example, s8.3: 5 Example. 4-bit signed fixed point numbers: s format Binary point position Range Precision (ulp) s4.-1 a3 a2 a1 a0 0 • -16 ... +14 2 s4.0 a3 a2 a1 a0 • -8 ... +7 1 s4.1 a3 a2 a1 • a0 -4 ... +3.5 0.5 s4.2 a3 a2 • a1 a0 -2 ... +1.75 0.25 s4.3 a3 • a2 a1 a0 -1 ... +0.875 0.125 s4.4 -0.5 ... +0.4375 0.0625 • a3 a2 a1 a0 s4.5 -0.25 ... +0.21875 0.03125 • a3 a2 a1 a0 6 (2) Slope-bias scaling: The value represented is Ṽ = s × Vint + b where s > 0 is called the slope, and b is the (offset) bias. The slope can be represented as s = f × 2e where 1 ≤ f < 2 is called the fractional slope and e shows the radix point position. - binary point scaling is a special case of this: b = 0, f = 1, e = −n. - precision (the weight of the least significant bit) is equal to the slope - the goal of slope (and bias) selection: utilization of the full dynamic range Example. We want to represent the angles in the range [−π, π) with maximal precision using 6 bits. 1) Binary point scaling: - we must have two integer bits as 011 = 3. - thus the format to be used is s6.3. - the range is [−4, 4 − 2−3 ]. - the precision of the format is ulp = 2−3 = 0.125. 2) Slope-bias scaling: - using zero bias, we use s = π/25 to use the full dynamic range. - the range is then π × [−1, 1 − 2−5 ] and ulp = s ≈ 0.0982 7 2.1.4 Fixed-point arithmetics (1) Addition: guard bits - sp.n + sp.n 7→ s(p + 1).n - guard bits g added to the accumulator of the MAC data path: n n multiply 2n add 2n+g - g ≥ ⌈log2 (N)⌉, where N is the number of terms to be added - in MAC based FIR filtering, the bound for g depends on the coefficient values (2) Multiplication: • The law of conservation of bits: sp1 .n1 × sp2 .n2 7→ s(p1 + p2 ).(n1 + n2 ) - note: one extra integer bit introduced (e.g. s4.3 × s4.3 7→ s8.6; 4-3-1 = 0, 8-6-1 = 1) - if the largest negative value does not occur, that extra integer bit is not needed 8 • Modes of arithmetic: - in integer arithmetic fixed-point values are treated as integers, and the programmer must take care that overflows do not occur (e.g. intermediate scaling operations, coefficient magnitude restrictions). - in fractional arithmetic, one uses fractional fixed-point values. Multiplication and storing the result can be implemented in a special manner: Operand A Operand B S S integer multiplication (saturating to handle −1 x −1) binary point x S (S) x arithmetic shift left + binary point movement Result: S x x (rounding +) taking the most significant bits S x • Multiplication by a power of two: (1) can be implemented simply as an arithmetic shift - left: may cause overflow, right: precision may be lost (2) the movement of the binary point to the right/left - overflow and precision loss is not possible - basis of the CORDIC algorithm discussed later 9 (3) Signal quantization: rounding of the arithmetic results to specific word lengths - There are different kinds of rounding methods: (1) truncation: simply discard least significant bits (2) round-to-nearest (3) convergent rounding (4) magnitude truncation - Introduces roundoff noise e s round sq modelled as s sq - Depending on the rounding method, noise can be biased (expectation E{e} = 6 0) - The quantization noise gets amplified through noise transfer functions (4) Overflow handling: hardware may use guard bits, wrapping, or saturation - in the case of wrapping, overflows are neglected in HW. Therefore, one must either (1) ascertain that the final result is within the range, or (2) check that overflows cannot occur (by analysis/simulation) - saturating operations are not associative! Therefore some standards for algorithms may specify exact order of performing operations 10 2.1.5 Floating-point representation Design of the representation is based on HW implementation issues Bit string parts: (1) sign bit, (2) exponent, and (3) significand (order is important!). Modes of bit string interpretation: • normalized number: the exponent is adjusted so that the maximum precision is achieved. Ṽ = (−1)s × (1 + f ) × 2e−eb , where s is the value of the sign bit, e is the unsigned integer encoded by the exponent part, eb is the exponent bias, and f is the unsigned fractional fixed point value encoded by the significand. • zero: representations of +0 and -0 • denormalized number: for representing small numbers, which fill the underflow gap around zero • infinity, not-a-number (NaN) IEEE 754 standard formats are commonly used • single precision (32 bits = sign bit + exponent 8 bits + significand 23 bits; eb = 127) • double precision (64 bits = 1 + 11 + 52; eb = 1023) • half precision (16 bits = 1 + 5 + 11; eb = 15): used especially in computer graphics Non-standard format may be designed for arithmetics of a particular application. Support for all modes might not be needed in HW. 11 Example. IEEE 754 single precision Mode normalized zero denormalized infinity not-a-number When e ∈ {1, 2, . . . , 254} e = 0, f = 0 e = 0, f 6= 0 e = 255, f = 0 e = 255, f 6= 0 12