Floating Point Numbers in Delphi
Floating Point Numbers in Delphi
floating
point
values in a
Delphi
environme
nt
Many debuting, and even more experienced, programmers are unfamiliar w
the ins and outs of floating point values and their use, which leads often to
unexpected program behaviour or strange output results. This article tries to
focus on this topic in order to make you aware of the particular points of
interest in regard to handling, storing and maintaining floating point values i
Borland Delphi.
The first step is to look at the way binary floating point values (FPV) represe
numbers. In fact, there is not one, but a series of different possible
representations, depending on the precision you need for your purposes. Th
most important formats for storing and handling floating point values are the
single, double and extended types. These formats are based on the IEEE
standard and are directly supported by the CPU's Floating Point Unit (FPU)
hardware, which is based on the i387 architecture. By contrast, The real
format is not native to the Intel family of processors. Therefor, it has to be
manipulated in software. Hence, it is extremely slow and tedious.
Programmers should avoid the non-native 6-bytes wide real48 type (1) as
much as possible. If, for compatibility purposes, it is necessary to use the
real48 format, than you should limit its use as much as possible by converti
to one of the other formats immediately after obtaining the real48 value and
converting it back just before storage. We will focus our attention here on th
native formats: single, double and extended.
Let's have a closer look at the single format. As the table above shows, it
represents the stored value as follows:
The mantissa is in fact not 23, but 24 bits wide. So where is the 24 th bit then
The answer is simple, but at the same time ingenious: the actual value is
always stored normalised. This means that the number before the decimal
separator (which can be 0 or 1 in our binary system, of course) is always 1.
By decreasing the exponent part, the actual value can be sufficiently scaled
order to make sure the number before the decimal separator equals 1.
Because the first number of the mantissa is always one, there is no need to
store it! In this format, the first number is implicitly known to be one.
The exponent is biased: to get the real value, you have to subtract 127 from
the stored value. Thus, when the exponent is less than 127, the result will b
negative. Hence, the actual value will be less than 1. The value 255, or all b
set, of the exponent is reserved and indicates the NAN (Not A Number) valu
The sign bit indicates the sign, with 0 equalling positive and 1 negative.
= 0 10000101 00110010110110011101000
Mantissa = 00110010110110011101000
= 2-3 + 2-4 + 2-7 + 2-9 +2-10 + 2-12 + 2-13 + 2-16 + 2-17 + 2-18 + 2-
= 0.1986360549927
= 1.1986360549927
Exponent = 10000101
= 133
Sign = 0 = positive
1.1986360549927 * 26
= 1.1986360549927 * 64
= 76.171270751953
The double format follows the same rules but the mantissa and exponent
parts are bigger and therefor can store numbers with a greater precision. Th
extended format, however, differs slightly from its single and double
counterparts in that the integer part is explicitly stored in bit 63. This intege
part in bit 63 absorbs any carry values, thus ensuring precision up to 19 dig
By contrast, in single and double formats, the integer part is always one. Y
should also realise that the FPU unit internally always works on extended
types (also known as temporary real format). This means that every time
you load a single or double value into the FPU, it is sized to the extended
format.
As one can see, the floating point format has a major disadvantage: if the
value can not be written as a limited and exact sequence of powers of the
base number (2 for our binary system), the resulting value will only be an
approximation of the desired value. Only numbers that are a true
multiplication of 2 can be exactly represented.
For instance, the value 0.25 can be exactly represented in the sing
format as:
0 01111101 00000000000000000000000
= $3E800000
Mantissa = 00000000000000000000000
= 0
Sign = 0 = positive
1 * 2-2
= 0.25
MyVar being an FPV, the DoThis part would never be executed, simply
because the key value of 2.1 can never be exactly represented in floating
point format. Furthermore, in many cases, the contents of MyVar will be the
result of several calculations beforehand. That would mean that even if the
key value could be exactly represented in the Binary Floating Point format,
calculation would involve other numbers that might be approximations in
stead of exact representations, so that the final result would only be close t
the desired value. Hence, the algorithm would fail. Possible techniques one
could use to overcome this problem are to truncate or round the FPV to an
integer:
const
if Round(MyVar)=2 then …
You could easily scale the variable if you need precision involving one or mo
digits after the decimal separator:
You could also use smaller than/greater than in stead of the equality
comparison:
if MyVar>2.1 then …
Often, the best solution is to use (scaled) integers where possible. In Delph
the Currency type uses in fact a scaled integer to represent numbers with u
to four digits after the decimal separator. So, the stored integer value 65481
fact designates 6.5481. When performing calculations on integers, the
resulting integer will be an exact representation of a certain number. Needle
to say that integer handling is more efficient and much faster than using FP