Download 2 Computer arithmetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Large numbers wikipedia , lookup

Location arithmetic wikipedia , lookup

Positional notation wikipedia , lookup

Proofs of Fermat's little theorem wikipedia , lookup

Addition wikipedia , lookup

Arithmetic wikipedia , lookup

Elementary mathematics wikipedia , lookup

Transcript
2
Computer arithmetics
Digital systems are implemented on hardware with finite wordlength.
Implementations require special attention because of possible quantization and arithmetic errors.
Part I: Real number representations
• Characteristics
• Basis: integers
• Fixed-point numbers
• Floating-point formats
Part II: Design flows of fixed-point solutions
• Analysis based flow
• Simulation based flow
1
2.1
Background: real number representations
Two categories: fixed-point and floating-point.
In both cases, a certain number of significant bits represent an integer value, and associated scaling maps that integer to some
real number.
sign (s) + significand (f)
significant bits
integer / fraction
exponent (e)
floating−point
representation
scaling
fixed during design
fixed−point
representation
Exponent: floating-point hardware performs appropriate scaling during run-time.
In the case of fixed-point numbers this is a design-time decision. Can be hard!
Implications for choosing the computational platform:
• Do we really need an optimized fixed-point solution?
• Or, do we want to have an easier money-saving design process?
• In addition, the floating-point HW might not contain all features of the standards and word length might also be limited
2
2.1.1
Characterization of representations
Word length: the total number of bits in a representation
Bit precision: the number of significant bits
Range: the smallest and the largest representable value - overflow
Precision: the smallest interval between two consecutive numbers (unit in least position, ulp) - roundoff noise, underflow
Dynamic range: measures the ratio between the smallest and largest absolute values
Dynamic range in dB = 20 log10 (AMax/AMin).
• Fixed-point case: all numbers represented with the same precision, 6 dB per one bit
• Floating-point case: large numbers represented with less precision, dynamic range huge
word length
bit precision
range (max. absolute value)
precision (ulp)
dynamic range
32-bit signed
integer
32
32
31
2 ≈ 2.1 × 109
1
187 dB
32-bit signed
fractional
32
32
1
2−31
187 dB
2−149
3
IEEE-754 single precision
floating-point
32
24 (= 23 bit significand + sign bit)
≈ 3.4 × 1038
104
(for Emin ) ... 2
(for Emax ); note: 256 for 2.1 × 109
1535 dB
2.1.2
Integers
Fixed-point and floating-point representations are based on representations of integer values.
The choice of particular representation depends on what has to be done with the numbers.
1. Unsigned integers
2. Signed integers
• sign-magnitude: encoding of significant bits in floating-point formats
• one’s complement: advantage is easy negation
• two’s complement: the common choice for fixed-point formats
• biased representation (excess-B): a bias value, B, is subtracted from an unsigned integer value. Example: offset
binary (ADC output, DAC input), encoding of exponents in floating-point formats
a2 a1 a0
011
010
001
000
111
110
101
100
sign-magnitude
+3
+2
+1
0
-3
-2
-1
0
1’s complement
+3
+2
+1
0
0
-1
-2
-3
4
2’s complement
+3
+2
+1
0
-1
-2
-3
-4
offset binary
-1
-2
-3
-4
+3
+2
+1
0
2.1.3
Fixed-point representation
Fixed-point numbers are based on scaling of integers (unsigned or signed).
Two’s complement integers used as a basis for signed fixed-point.
(1) Binary point scaling: The value represented is
Ṽ = Vint /2n
where Vint is the integer value represented by the bit string, and n is the number of fraction bits.
Notation: up.n for unsigned and sp.n for signed formats where p is the word length and n is the number of fraction bits. For
example, s8.3:
5
Example. 4-bit signed fixed point numbers:
s format Binary point position
Range
Precision (ulp)
s4.-1
a3 a2 a1 a0 0 •
-16 ... +14
2
s4.0
a3 a2 a1 a0 •
-8 ... +7
1
s4.1
a3 a2 a1 • a0
-4 ... +3.5
0.5
s4.2
a3 a2 • a1 a0
-2 ... +1.75
0.25
s4.3
a3 • a2 a1 a0
-1 ... +0.875
0.125
s4.4
-0.5 ... +0.4375
0.0625
• a3 a2 a1 a0
s4.5
-0.25 ... +0.21875
0.03125
• a3 a2 a1 a0
6
(2) Slope-bias scaling: The value represented is
Ṽ = s × Vint + b
where s > 0 is called the slope, and b is the (offset) bias.
The slope can be represented as s = f × 2e where 1 ≤ f < 2 is called the fractional slope and e shows the radix point position.
- binary point scaling is a special case of this: b = 0, f = 1, e = −n.
- precision (the weight of the least significant bit) is equal to the slope
- the goal of slope (and bias) selection: utilization of the full dynamic range
Example. We want to represent the angles in the range [−π, π) with maximal precision using 6 bits.
1) Binary point scaling:
- we must have two integer bits as 011 = 3.
- thus the format to be used is s6.3.
- the range is [−4, 4 − 2−3 ].
- the precision of the format is ulp = 2−3 = 0.125.
2) Slope-bias scaling:
- using zero bias, we use s = π/25 to use the full dynamic range.
- the range is then π × [−1, 1 − 2−5 ] and ulp = s ≈ 0.0982
7
2.1.4
Fixed-point arithmetics
(1) Addition: guard bits
- sp.n + sp.n 7→ s(p + 1).n
- guard bits g added to the accumulator of the MAC data path:
n
n
multiply
2n
add
2n+g
- g ≥ ⌈log2 (N)⌉, where N is the number of terms to be added
- in MAC based FIR filtering, the bound for g depends on the coefficient values
(2) Multiplication:
• The law of conservation of bits: sp1 .n1 × sp2 .n2 7→ s(p1 + p2 ).(n1 + n2 )
- note: one extra integer bit introduced (e.g. s4.3 × s4.3 7→ s8.6; 4-3-1 = 0, 8-6-1 = 1)
- if the largest negative value does not occur, that extra integer bit is not needed
8
• Modes of arithmetic:
- in integer arithmetic fixed-point values are treated as integers, and the programmer must take care that overflows do
not occur (e.g. intermediate scaling operations, coefficient magnitude restrictions).
- in fractional arithmetic, one uses fractional fixed-point values. Multiplication and storing the result can be implemented in a special manner:
Operand A
Operand B
S
S
integer multiplication (saturating to handle −1 x −1)
binary point
x
S (S)
x
arithmetic shift left + binary point movement
Result:
S
x
x
(rounding +) taking the most significant bits
S
x
• Multiplication by a power of two:
(1) can be implemented simply as an arithmetic shift
- left: may cause overflow, right: precision may be lost
(2) the movement of the binary point to the right/left
- overflow and precision loss is not possible
- basis of the CORDIC algorithm discussed later
9
(3) Signal quantization: rounding of the arithmetic results to specific word lengths
- There are different kinds of rounding methods:
(1) truncation: simply discard least significant bits
(2) round-to-nearest
(3) convergent rounding
(4) magnitude truncation
- Introduces roundoff noise
e
s
round
sq
modelled as s
sq
- Depending on the rounding method, noise can be biased (expectation E{e} =
6 0)
- The quantization noise gets amplified through noise transfer functions
(4) Overflow handling: hardware may use guard bits, wrapping, or saturation
- in the case of wrapping, overflows are neglected in HW. Therefore, one must either (1) ascertain that the final result is within
the range, or (2) check that overflows cannot occur (by analysis/simulation)
- saturating operations are not associative! Therefore some standards for algorithms may specify exact order of performing
operations
10
2.1.5
Floating-point representation
Design of the representation is based on HW implementation issues
Bit string parts: (1) sign bit, (2) exponent, and (3) significand (order is important!).
Modes of bit string interpretation:
• normalized number: the exponent is adjusted so that the maximum precision is achieved.
Ṽ = (−1)s × (1 + f ) × 2e−eb ,
where s is the value of the sign bit, e is the unsigned integer encoded by the exponent part, eb is the exponent bias, and
f is the unsigned fractional fixed point value encoded by the significand.
• zero: representations of +0 and -0
• denormalized number: for representing small numbers, which fill the underflow gap around zero
• infinity, not-a-number (NaN)
IEEE 754 standard formats are commonly used
• single precision (32 bits = sign bit + exponent 8 bits + significand 23 bits; eb = 127)
• double precision (64 bits = 1 + 11 + 52; eb = 1023)
• half precision (16 bits = 1 + 5 + 11; eb = 15): used especially in computer graphics
Non-standard format may be designed for arithmetics of a particular application.
Support for all modes might not be needed in HW.
11
Example. IEEE 754 single precision
Mode
normalized
zero
denormalized
infinity
not-a-number
When
e ∈ {1, 2, . . . , 254}
e = 0, f = 0
e = 0, f 6= 0
e = 255, f = 0
e = 255, f 6= 0
12