Fixed Point Conversion
Fixed Point Conversion
Integer Numberswhole numbers, no fractional part Discrete points on number line e.g. -10, -4, 0, 7, 9
www.rajeshsharma.co.in 2
7/17/2010
Format- Representing numbers in binary system. e.g. for signed number we have
Float: Mantissa Bits & Exponent Bits Fixed: Integer Bits & Fractional Bits Total Bits = Mantissa / Integer + Exponent /fractional + sign bit
Few Terms
Precision - Effective Word Length = N Accuracy Maximum error between actual number and its
representation.
e.g. representation is accurate till 8th decimal place e.g. representation has maximum of 0.001% error
Dynamic Range
While Processing/Storing variables: we need to find
Var Var
max
absolute maximum value a variable can have absolute minimum value a variable can have
min
max
www.rajeshsharma.co.in 5
7/17/2010
Dynamic Range
The ratio of the largest and smallest possible values of the
variable quantity
Dynamic Range is more often expressed as Ratio 5000:1 or as dB 75 dB Or in Bits 12 Bits
Varmax Varmax Dynamic Range ( Ratio) = = Varres Dynamic Range ( Bits ) = log 2 ( Varmax ) = N Bits Varres
e.g. of microphone, loudspeaker, ADC, DAC etc.. For display devices it is often called contrast ratio
Varmax Dynamic Range (dB) = 20 log10 Dynamic Range (dB) = 6.02 * N + 1.76
For Fixed point conversion normally Signed representation method is used to calculate dynamic range
www.rajeshsharma.co.in 7
7/17/2010
Store results in Registers having finite word length N Bit storage can have only 2 possibilities
Choice is governed by
Dynamic Range of Variable (Most Imp) Power consumption by specific type of ALU Cost of implementation of ALU.. Other possibilities. www.rajeshsharma.co.in 8
7/17/2010
Floating Point
The decimal point or binary point is floating Economical for very large/small number storage FPU silicon implementation is costly Usually represented as
r = (1) 2
s (eB)
0. f
or
r = (1) 2
s (eB)
1. f
www.rajeshsharma.co.in 9
7/17/2010
The Range of variable is very high Still there are only 2 distinct values for N Bits The accuracy of variable varies with values Spacing between numbers is not constant IEEE 754 supports four precisions
N
Half precision, 16 Bit; Single precision, 32 Bit Double precision, 64 Bit; Quadruple precision, 128 Bit
www.rajeshsharma.co.in 10
Floating point arithmetic is very easy to use Floating point arithmetic: problems
It may not be distributive (processor implementation)*
a * (b + c) != a*b + a*c
Operations on large numbers has more error. Mantissa limits the resolution Exponent limits the largest possible number
*PS: Different floating point implementations of one DSP algorithm may not be Bit Exact.
www.rajeshsharma.co.in 11
7/17/2010
Number line
Double Precision: 64 bit ;
Mantissa = 52 Bits; exponent 11 Bits
Number 1.7687*107
1.7687*10-2 1.7688*10-2
0.0001*10-2
7/17/2010
=0.000001 www.rajeshsharma.co.in 12
Number line
2 24 values/decade
www.rajeshsharma.co.in 13
7/17/2010
To achieve higher speed of operation at lower cost To reduce silicon area of hardware -> cost Power saving during code execution Fixed point implementations may be directly used in
many DSPs
Range is directly limited by the number of Bits The spacing between numbers is constant Q format :written as MQN format or M.N
K bits for Integer part N bits for fractional part M = K + Sign bit = K + 1
www.rajeshsharma.co.in 15
7/17/2010
Fractional Format
1Q3 number
www.rajeshsharma.co.in 16
7/17/2010
www.rajeshsharma.co.in 17
7/17/2010
How to Convert
A fractional number can be converted to fixed point number for a
given signed Q format MQN
Set all the N LSB bits to 1 and M Bits to 0; call it maxN e.g. for 1Q31 format, max31 = 0X7FFFFFFF For 1Q15->0X7FFF = 32767; for 3Q13->0X1FFF= 8191 Now multiply the given fractional number with this number
Number -----------Format 1Q15 HEX/Dec 5Q11 0.375 -0.625 1 1.25 2.005 0.0003
0009 9 2 2
10238 16422
7/17/2010
Setting Q Format
The variable to be converted is explored for Range of the variable value Absolute Maximum / Minimum values of variable Resolution of variable: smallest non zero number Accuracy: Maximum error for a representation
Accuracy required highly depends on the arithmetic involved. It also depends on the reference for final comparison Calculate accuracy from tolerable %error in representation.
www.rajeshsharma.co.in 19
7/17/2010
Exploration Process
www.rajeshsharma.co.in 20
7/17/2010
DR bits = B = K + N Total Bits = B + 1(S) => 1(S) + K + N; => M + N S is for Sign Bit
www.rajeshsharma.co.in 21
7/17/2010
RO Table Example
Const float Table[5]={ (5.0979557966e-001), (6.0134489352e-001), We will set the Q format (8.9997625993e-001), For this RO table and compare two choices of (2.5629159405e+000) resolution (-4.4628159401e+000) } Range: 4.4628159401e+000 (signed representation) Resolution: 5.0979557966e-001 = Variable minimum value Resolution = Table[1]-Table[0] = 0.09154931386 MaxInt = 5
www.rajeshsharma.co.in 22
7/17/2010
Example cont
DR = (4.4628159401) / 0.09154931386 = 48.74 DR Bits B = 6 ; K = 3 Bits (as MaxInt = 5) M = 4 Bits, N = 3 Bits; Total Bits = 6(DR) + 1(S) => 1(S) + 3(K) + 3(N) The conversion format MQN is 4Q3 K = 4 Bits ; N = 3 Bits; max3 = 7; 5.0979557966e-001 * max3 = 3 ; 3 / max3 is equivalent to = 0.42857142857. Representation Error = 15.93% (very huge error)
This error is not acceptable
www.rajeshsharma.co.in 23
7/17/2010
Example cont
digit) =
DR = (4.4628159401) / 0.00000000006 = 74380265668 DR Bits = 36 bits Total Bits = 36(DR) + 1(S) => 1(S) + 3(K) + 33(N) max33 = 0x1FFFFFFFF = 8589934591 5.0979557966e-001 * max33 = 4379110684 ; 4379110684 /max33 = 0.509795579652976. Representation Error = 1.37e-9% Q Format is set as 4Q33
www.rajeshsharma.co.in 24
7/17/2010
PolyPhase4ptDCT4toDCT3[]={
4379110684, Error = 1.37e-9% The 37 bits representation error 5165513301, Error = 1.87e-8% is very small and may not be needed for our application. 7730737206, Error = 3.25e-9% We may try with 32 Bits and check -38335297017, Error = 1.53e-9% } if it fits into application or not??
This is very good representation, But still many things matters e.g.
Precision of arithmetic involved Bits available for representation, e.g.
www.rajeshsharma.co.in 25 32 Bits
7/17/2010
For example:
Binary point shifting only indicate the change in the maximum/minimum value representable with resulting format
www.rajeshsharma.co.in 26
7/17/2010
Exercise
www.rajeshsharma.co.in 27
7/17/2010
www.rajeshsharma.co.in 29
7/17/2010
Fractional Addition
www.rajeshsharma.co.in 30
7/17/2010
Fractional Addition
Result format of output is same as that of input
i.e. Binary point at same place for both numbers, input & output
www.rajeshsharma.co.in 31
7/17/2010
Headroom/Guard Bits
example addition
for( i=0; i<256; i++ ) sum = sum + ar[i]; There is possibility of 8 bit overflow
www.rajeshsharma.co.in 32
7/17/2010
2s Complement Multiplication
www.rajeshsharma.co.in 33
7/17/2010
Fractional Multiplication
The product of two N bit numbers is 2N bits Example product of MQN and KQL formats Output format is: (M+K)Q(N+L) format
www.rajeshsharma.co.in 34
7/17/2010
Fractional Multiplication
1Q3 Numbers
2Q6 Numbers
www.rajeshsharma.co.in 35
7/17/2010
www.rajeshsharma.co.in 36
7/17/2010
Normalizing Fractions
two sign
result can be left shifted by 1 bit. If one of the format is 1QN i.e. N+1 Bits, then
Result of left shift will have the format of other operand If other operand is 5Q11 then result of mult is 6Q(11+N) After 1 bit left shift, result is 5Q(11+N) i.e. 5Q11 + (N+1)LSBs
Rounding
Why Rounding
To replace a Numerical value with approx. equal & shorter i.e. low precision representation
e.g. replacing 45.6782 with 45.68 Rounding is necessary evil... (..remember quantization..)
Many times it is unavoidable It introduces round-off errors Required in float to fixed conversion and.. floating & fixed point arithmetic and.. for function approximation on fixed point processors
www.rajeshsharma.co.in 38
7/17/2010
Rounding Methods
www.rajeshsharma.co.in 39
7/17/2010
Tie Breaking
Dithering
Stochastic Rounding:
Choose q randomly between x+0.5 and x-0.5 It is also bias free because of random component
All of them introduces non-linear response in a system Harmonics in the filter response because of rounding method
www.rajeshsharma.co.in 41
7/17/2010
When Dynamic Range of a signal is very high When the input is fluctuating between very high and very
small values.
Exercise1
Exercise2
www.rajeshsharma.co.in 44
7/17/2010
Exercise3
Now given below is the DCTIII code, with gain applied at end Convert this into a fixed point code
www.rajeshsharma.co.in 45
7/17/2010
Exercise4
Following IIR filter is used for filtering 16 bit PCM ip Convert it into fixed point and ensure that the filter
response is linear.
x[n] + Z-1 x[n-1] Z-1 x[n-2] y[n-2] b2 a2 b1 a1 Z-1 b0 Z-1 y[n-1] y[n]
www.rajeshsharma.co.in 46
7/17/2010
Exercise4: C code
www.rajeshsharma.co.in 47
7/17/2010
www.rajeshsharma.co.in 48
7/17/2010
References
Application Note 33
Fixed point arithmetic on the ARM
Document number ARM DAI 0033A
Wiki
www.rajeshsharma.co.in 49
7/17/2010
THANK YOU