Introduction To Fixed Point Math
Introduction To Fixed Point Math
9/14/2004
Back in the olden days of dragons and sorcerers, before floating point coprocess
ors, computers could only directly manipulate whole numbers (integers). Things
like decimal points were new fangled technologies only used by crazy mathematici
ans with slide rules and pen and paper. To overcome this deficiency crufty old
-school programmers, many of whom today are over 30 years old, had to resort to
various hacks and tricks in order to represent wacky numerical concepts like "0.
5". The most common technique was fixed point math, which is the topic of this
article.
== Preamble ==
I'm eventually, maybe, going to write a bunch of stuff here about real numbers a
nd integers and number theory and domains and what not, but right now that doesn
't interest me so I'm going to dodge the issue and do a quick summary for those
late to class:
* computers represent integers and real numbers differently
* not all computers can perform math on real numbers quickly
* computers that do perform math on real numbers typically use a floating point
representation which dynamically adjusts range and precision depending on the i
nputs to a particular operation.
* converting from a floating point value to an integer value is often very slow
and always inext
Hope you're caught up. Next.
== Fixed Point Representation ==
Given that there are processors out there that don't support floating point numb
ers, how do we represent something like "0.5" on such beasts? This is where fix
ed point math comes into play. As the name implies, a fixed point number places
the "decimal" point between the whole and fractional parts of a number at a fix
ed location, providing f bits of fractional precision. For example, an 8.24 fix
ed point number has an eight bit integer portion and twenty four bits of fractio
nal precision. Since the split between the whole and fractional portion is fixe
d, we know exactly what our range and precision will be.
So let's think about this for a second. Using a 16.16 fixed point format (which
, handily enough, fits in a typical 32-bit integer in C), the high 16-bits repre
sent the whole part, and the low 16-bits represent the fractional portion:
{{{
0xWWWWFFFF
}}}
With 16-bits in the whole part, we can represent 2^16^ (65536) discrete values (
0..65535 unsigned or -32768 to +32767 signed). The fractional part gives us, aga
in, 2^16^ steps to represent the values from 0 to almost 1 -- specifically, 0 to
65535/65536 or approximately 0.99999. Our fractional resolution is 1/65536, or
about 0.000015 -- that's pretty good resolution as well.
Let's look at a concrete example: the 16.16 fixed point value {{{0x00104ACF}}} a
pproximately equals the decimal value 16.29222. The high sixteen bits are 0x10 (
16 decimal) and the low 16-bits are 0x4ACF, so 0x4ACF/65536.0 ~= 0.29222. The si
mple method to convert from a fixed point value of ''f'' fractional bits is to d
ivide by 2^f^ so:
http://www.bookofhook.com/Images/Fixed_Point_Images/fixrep_1.gif
However, we have to be careful since these integers are stored in 2's complement
form. The value -16.29222 is 0xFFEFB531 -- notice that the fractional bits (0xB
531) are not the same as for the positive value. So you can't just grab the frac
tional part by masking off the fractional bits -- sign matters.
== 2's Complement Summary ==
If you're reading this document there's a chance that you're not familiar with t
he concept of 2's complement arithmetic, which is fairly vital to understanding
fixed point math. Fundamentally 2's complement solves the problem of representin
g negative numbers in binary form. Given a 32-bit integer, how can we represent
negative numbers?
The obvious method is to incorporate a sign bit, which is precisely what floatin
g point numbers in IEEE-754 format do. This signed magnitude form reserves one b
it to represent positive/negative, and the rest of the bits represent the number
's magnitude. Unfortunately there are a couple problems with this.
1. Zero has two representations, positive (0x00000000) and negative (0x8000000
0), making comparisons cumbersome. It's also wasteful since two bit configuratio
ns represent the same value.
2. Adding a positive integer to a negative integer requires logic. For example
, -1 (0x80000001 in 32-bit sign magnitude form) summed with +1 (0x00000001 in 32
-bit sign magnitude form) yields 0x8002, or -2 in 16-bit sign-bit form..
Some really smart guys came up with a much better system a long time ago called
2's complement arithmetic. In this system positive numbers are represented as yo
u'd expect, however negative numbers are represented by their complement, plus o
ne. This requires taking all bits in the positive representation, reversing them
, then adding one.
So to encode -1 we take {{{0x00000001}}}, invert (complement) the bits to get {{
{0xFFFFFFFE}}}, and then add 1, giving us {{{0xFFFFFFFF}}} -- which you may reco
gnize as the hex form for -1. This gives us a range of +/- 2^(n-1)^ where n is t
he number of storage bits.
2's complement solves a host of problems.
1. Zero has a single representation, simplifying comparisons and removing the
wasted value.
2. Addition and subtraction just work. A binary addition of 2's complement val
ues will give the correct result regardless of sign. -1 +1 = 0 ({{{0xFFFFFFFF +
0x00000001 = 0x00000000}}}); -1 + -1 gives us -2 ({{{0xFFFFFFFF + 0xFFFFFFFF = 0
xFFFFFFFE}}}); and 1 + 1 = 2 ({{{0x00000001 + 0x00000001 = 0x00000002}}}).
But...it introduces a couple unintuitive bits. Decoding a negative number takes
a little bit of thought (you have to do the complement and add one), and the mag
nitude of a negative number is inversely proportional to the magnitude of its bi
nary representation: -1 = {{{0xFFFFFFFF}}} but -2147483648 = {{{0x80000000}}}. T
his has at least one important ramification when dealing with fixed point -- the
fractional bits of a fixed point number are not interpreted the same for positi
ve and negative values.
Anyway, that's a quick refresher course. If you really need to know more about 2
's complement, Google is your friend or, if you're feeling gung-ho, find a good
book on discrete math or numerical analysis.
== Range and Precision Trade Offs ==
The whole trick with fixed point is that you're sacrificing some range in exchan
ge for precision. In this case, instead of a 32-bit integer you have a 16-bit in
teger space with 16-bits of fractional precision.
Different operations often require different amounts of range and precision. A f
ormat like 16.16 is good for many different things, but depending on the specifi
c problem you may need highly precise formats such as 2.30, or longer range form
ats such as 28.4. For example, we know that the sine function only returns value
s in the range -1.0 to +1.0, so representing sine values with a 16.16 is wastefu
l since 14-bits of the integer component will always be zero. Not only that, but
the difference between sine(90) and sine(89.9) is a very small number, ~.000001
5230867, which our 16-bits of fractional precision cannot represent accurately.
In this particular case we'd be much better off using a 2.30 format.
Conversely, some calculations will generate very large numbers, sometimes unexpe
ctedly. The area of a screen sized triangle can easily exceed 2^16^ (a triangle
half the size of a 640x480 pixel screen covers 153600 pixels), so even though th
e individual coordinates of such a triangle might be well within what we'd intui
tively believe to be acceptable range (e.g. (0,0),(639,0),(639,479)), the area f
ormed can vastly outstrip our available integer bits.
This brings to light a serious problem with fixed point -- you can silently over
flow or underflow without expecting it, especially when dealing with sums-of-pro
ducts or power series. This is one of the most common sources of bugs when worki
ng on a fixed point code base: some set of values that you expect to stay within
range rapidly overflow when you least expect it.
A very common case is vector normalization, which has a sum-of-products during t
he vector length calculation. Normalizing a vector consists of dividing each of
the vector's elements by the length of the vector. The length of a vector is:
http://bookofhook.com/Images/Fixed_Point_Images/vector_length.gif
One might believe that the vector (125,125,125) can be easily normalized using a
16.16 signed fixed point format (-32K to +32K range)...right? Unfortunately, th
at's not the case. The problem is that the sum of those squares is 46875 -- grea
ter than our maximum value of +2^15^ (32767). After the square root it would dro
p down to a very manageable value (~216.51), but by then the damage is done. The
length calculation will fail, and thus the normalization will fail -- but the s
cary thing is that the vector normalization will work correctly with a lot of va
lues, but not with others. The vector (100,100,100) will normalize just fine, bu
t the vector (40,40,180) will not, but you can't tell from a glance which is fin
e and which isn't. Fixed point math is highly input sensitive.
This is why choosing your range and precision is very important, and putting in
a lot of range and overflow sanity checks is extremely important if you want rob
ust software.
== Conversion ==
Converting to and from fixed point is a simple affair if you're not too worried
about rounding and details like that. A fixed point value is derived from a floa
ting point value by multiplying by 2^f^. A fixed point value is generated from a
n integer value by shifting left ''f'' bits:
http://bookofhook.com/Images/Fixed_Point_Images/fixops_1.gif
{{{
F = fixed point value
i = integer value
r = floating point value
f = number of fractional bits for F
}}}
Converting back to floating point is trivial -- just perform the reverse of the
float-to-fixed conversion:
http://bookofhook.com/Images/Fixed_Point_Images/fixops_2.gif
However, converting to an integer is a bit trickier due to rounding rules and th
e 2's complement encoding. In the world of floating point there are four common
methods for converting a floating point value to an integer:
1.
2.
3.
4.
round-towards-zero
round-towards-positive-infinity
round-towards-negative-infinity
round-to-nearest
(Note: there are other rounding methods as well, such as round-to-even, round-to
-odd, round-away-from-zero, etc. but those are rarer and I don't feel like devot
ing space to them). These are conceptually simpler in floating point than in fix
ed point -- for example, truncating a floating point value always rounds towards
zero, but truncating the fractional bits of a fixed point number actually round
s towards negative infinity (due to the miracle of 2's complement, where the fra
ctional bits move you away from 0 for positive numbers but towards 0 for negativ
e numbers).
=== Round-Towards-Zero ===
Rounding towards zero "shrinks" a number so that it always gets closer to zero d
uring the rounding operation. For a positive fixed point value a simple truncati
on works, however a negative number needs to have 1<<(''f''-1) added to it befor
e the truncation. This is just "round towards negative-infinity if positive; rou
nd towards positive-infinity if negative".
=== Round-Towards-Negative-Infinity ===
Rounding towards negative-infinity is the simplest operation, it's the truncatio
n we talked about earlier. A positive number that loses its fractional goes from
A.B to A.0, and will always get closer to zero. However, negative numbers get l
arger as they approach 0 due to 2's complement encoding. This means that removin
g the fractional bits actually pushes the negative number away from zero (and th
us towards negative infinity).
This can be somewhat confusing, but an example illustrate this. The fixed point
value 0xFFFF9999 represents -0.40000 roughly. By truncating the fractional bits
the new value is -1.0 -- we've rounded towards negative infinity. The fixed poin
t value 0xFFFF0003 represents a value just above negative -1, so truncating the
fractional bits will, again, give us a value of -1.
Note that we need to perform an arithmetic shift, also known as a sign-extended
shift, to ensure that the sign bit is propagated correctly for negative numbers.
Consider the 16.16 fixed point value for -1 ({{{0xFFFF0000}}}). If we do an uns
igned shift to the right by 16 places we have 0x0000FFFF which would be 65535, w
hich is incorrect. By sign extending the shift the result will be {{{0xFFFFFFFF}
}}, which is -1 in 2's complement integer representation.
http://bookofhook.com/Images/Fixed_Point_Images/fixops_3.gif
by twelve; or by shift
whatever. It's up to y
have to be in the same
output form) are of th
{{{
A = first number
m = number of fractional bits for A
B = second number, non-fixed point
n = number of fractional bits for B
}}}
Looking at the above we find that the product is scaled by the sum of ''m'' and
''n'', so the product is "overscaled". We can fix this by multiplying through by
some value that will bring it back down into the range we desire. Given that we
eventually want, say, ''f'' bits of precision (i.e. the result is scaled by ''f
'' bits) in our result, we need to convert the 2^(m+n)^ term to 2^f^.
Multiplying through by:
http://bookofhook.com/Images/Fixed_Point_Images/multiply_4.gif
will scale things so that we have f fractional bits in our result since:
http://bookofhook.com/Images/Fixed_Point_Images/multiply_2.gif
So our final equation is:
http://bookofhook.com/Images/Fixed_Point_Images/multiply_3.gif
In plain English, that's just saying that if we want the product of two fixed po
int numbers, of ''m'' and ''n'' fractional bits respectively, to have ''f'' bits
of precision, we multiply the two fixed point values then scale the result by 2
^(f-(m+n))^.
Let's use our previous example of {{{0x10000 * 0x20000}}} (1.0 * 2.0 in 16.16 fo
rmat) but this time with 24-bits of fractional precision in the result. We have
''m'' = 16, ''n'' = 16, and ''f'' = 24. The initial multiplication results in {{
{0x200000000}}}, which we then multiply by 2^(24-(16+16))^ = 2^-8^ or 1/256. Of
course, we'd just shift right by 8 places, which in the end gives us 0x02000000
which is 2.0 in 8.24 format.
One thing to keep in mind is that along with loss of precision (any bits shifted
out of the result are lost), we still have to contend with overflow in some sit
uations. For example, if we had multipled 0xFF0000 and 0x20000 (255.0 x 2.0) the
result should be 510.0 in 8.24 format, or 0x1FE000000...which is larger than 32
-bits, so it'll be truncated to 0xFE000000 (254.0) on systems that don't support
64-bit intermediate products. This is a silent error, and overflow bugs like th
is can be excruciating to hunt down.
One way to get around this is to prescale the two multiplicands down, avoiding t
he final shift, but in the process we lose precision in exchange for avoiding ov
erflow. This is not the ideal way of dealing with insufficient intermediate bits
-- instead a piecewise multiplication (which I might explain one day when I'm r
eal bored) would be used to preserve accuracy at the cost of some performance.
The moral being: never assume you have enough bits, always make sure that your i
nputs and outputs have the precision and range you need and expect!
=== Division ===
Division is similar to multiplication, with a couple caveats. Let's look at an e
xample to get an intuitive feel for the process. Assuming we have 2.0 and 1.0 ag
ain in 16.16 fixed point form: 0x20000 and 0x10000. If we divide the former by t
he latter, we get the value 2. Not 0x20000, but just 2. Which is correct, grante
d, but it's not in a 16.16 output form. We could take the result and shift it ba
ck by the number of bits we need, but...that's just gross. And wrong, since shif
ting after the integer divide means we lose precision before the shift. In other
words, if A=0x10000 and B=0x20000 our final result would be 0, instead of 0x800
0 (0x8000 = 0.5 in 16.16 fixed point, since 0x8000/65536.0 = 0.5). But if we do
the shift first, we manage to retain our precision and also have a properly scal
ed result.
Well, that all makes sense, but we're assuming that A and B are of the same frac
tional precision, and what if they're not? Now it's time to look at the "real" m
ath of a fixed point division. Again, this is grade school stuff:
http://bookofhook.com/Images/Fixed_Point_Images/div_1.gif
What this says is that if you take two fixed point numbers of m and n fractional
bits, and divide them, the quotient will be a value with m-n fractional bits. W
ith our previous example of 16.16 divided by a 16.16, that means 0 fractional bi
ts, which is precisely what we got. If we want to have f fractional bits, we mus
t scale the numerator by 2^f^-(m-n). Since of order operations matters when perf
orming integer operations, we want to scale the numerator before we do the divid
e, so our final equation looks like:
http://bookofhook.com/Images/Fixed_Point_Images/div_2.gif
Order of operations matters, so actual evaluation would use the middle form, whe
re the numerator is fully expanded before division. The far right form is just t
o show what the final result will look like.
Overflow is still a concern, so we have to make sure that
efore the divide that we don't run out of bits -- this is
tely. For example, a 16.16 divided by a 16.16 needs to be
ich is a guaranteed overflow. Again, modern CPUs can come
me by providing 64-bit / 32-bit division operations.
we do the adjustment b
all too easy unfortuna
prescaled by 2^16^, wh
to the rescue, this ti
== Summary ==
Fixed point is a wonderful way to get around the limitations of systems with slo
w (or non-existent) floating point units. It also allows you to write very fast
inner loops that can generate integer values without a costly float-to-int conve
rsion. However, fixed point math can be very tricky due to range and precision i
ssues. This article attempts to cover the basics of fixed point math so that at
least a theoretical grounding is established before tackling the tricky implemen
tation problems (which I might write an article about later).
Thanks to Tom Hubina and Jonathan Blow for commenting on and proof reading this
article.