0% found this document useful (0 votes)
10 views2 pages

Floating-Point Numbers and Round-Off Errors by Kusal Kaluarachchi Medium

The document discusses the limitations of computer floating-point arithmetic and the resulting round-off errors due to the finite representation of real numbers. It explains the IEEE 754 standard for floating-point representation, including its components and how numbers are converted to binary format. The article also highlights real-life consequences of floating-point errors and provides recommendations for handling them in programming.

Uploaded by

sudeyerekonmaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views2 pages

Floating-Point Numbers and Round-Off Errors by Kusal Kaluarachchi Medium

The document discusses the limitations of computer floating-point arithmetic and the resulting round-off errors due to the finite representation of real numbers. It explains the IEEE 754 standard for floating-point representation, including its components and how numbers are converted to binary format. The article also highlights real-life consequences of floating-point errors and provides recommendations for handling them in programming.

Uploaded by

sudeyerekonmaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Computer Floating-Point Arithmetic

and round-off errors


Kusal Kaluarachchi · Follow
8 min read · Jun 23, 2022

Photo by Mika Baumeister on Unsplash

At some point, we all thought, “The computer calculated it, so it must be


right.” Actually, it is not. So how do we know the computer calculations are
correct?

You don’t. In fact, they are not always correct, and you must understand the
limitations.In this article I’m going to talk about one limitation.

Compacting infinitely many real numbers into a finite number of bits


requires an approximate representation. Most programs can store the results
EE i i.i
of integer computations in 32 or 64 bits. Given any fixed number of bits,
most calculations with real numbers will produce quantities that cannot be
exactly represented using a large number of bits. Therefore, the result of a
floating-point calculation must often be rounded off to fit into a finite
representation. This rounding error is a characteristic feature of floating-
point computation.

Let’s see how the “real” numbers are represented.

First, consider a number like 25. The computer doesn’t directly represent 2S
or 5S. Everything is converted into binary.

B nary

So 25 can be represented by the 11001 ( base 2)

Let’s see how fractions are represented.

Dec mal.EE

So we can represent 0.375 using


E EEE 0.011( base 2)

Another example 0.1 = 0.000110011001100110011001100…

As you can see, there is no finite binary representation for 0.1. Computer
systems have finite memory. But we need to represent numbers that take an
infinite number of bits. So how do we do that?

IEEE 754 standard for floating-point representation EEEFprox mat on

kayannokta
Most computers these days implement IEEE 754 binary floating-point
arithmetic. IEEE 754 specifies very careful rules to ensure accurate results at
kesinlik
the available precision. It also specifies different degrees of precision for
different applications. For example, there’s 32-bit single precision, 43-bit
extended precision, and 64-bit double precision. It has three main
components, as follows:

1. The Sign of Mantissa –0 represents a positive number while 1 represents


a negative number.
exponen
5
2. The Biased exponent –The exponent field needs to represent both 10
positive and negative exponents. To do this, a bias is added to the actual
exponent in order to get the stored exponent.

3. The Normalised Mantissa –Mantissa is the binary representation of the


scientific notation for base 2 number. In here only the value comes after
göster m
b narynumber
the decimal point of scientific notation will be taken out.

Any number can be expressed in scientific notation in many different ways.


For example, the number 8 can be represented as any of these:

s m

In normalized form, 8 is represented as 8.0 x 10⁰.

So, to sum up:

1. The sign bit is 0 for positive, 1 for negative.

2. The exponent’s base is two.

3. The exponent field contains 127 plus the true exponent for single-
precision, or 1023 plus the true exponent for double precision.

4. The first bit of the mantissa is assumed to be 1, and is not stored


7
explicitly.

Single Precision IEEE 754 Floating-Point Standard

Double Precision IEEE 754 Floating-Point Standard

Bit representation

let’s see an example of how to represent a double number in IEEE standards.

Let’s get 9.1 as an example.

First, we convert it into binary format. We have to find the number 9 and 0.1
binary format separately.
9L

9 1001 base

Binary Format for 9

1001

forthetract on 0.1

0.4 0
0.2
0.4 0.8 0

0.8 16 1
1.2 1
0.6
nf n ty 0
0.2 0.4
0
0.4 0.8

As I’ve showed in the above figure, it never going to end.The marked part is
going to repeat again and again. It is a infinite binary representation.

So Binary Format for 0.1

0001100110011001100...

so for 9.1

1001.0001100110011001100...

If we write this into scientific notation, it will be

1.0010001100110011001100… x 2³
397 D'sından

L
Now let’s write this in the IEEE standard. To allow for negative exponents,
127 is added to that exponent to get the representation. We say that the
exponent is “biased” by 127.So the range of possible exponents is -128 to

127 . Because we have a positive power ( 2³) it should be added with Exponent
Bias (127). exponentb as127

3+127 = 130

Now let’s convert 130 into binary format.

130 = 10000010 ( base 2)

Now lets look at the Mantissa, The mantissa is the next 23 bits(for single
precision) .so the binary value of 9.1 is 1.0010001100110011001100… x ²³ and
the Mantissa will be,

00100011001100110011001

Now lets write the the value, since 9.1 is positive number so the sign bit is
zero .The total representation of 9.1 in single precision will be like this

So, the representation according to IEEE standard is

24ᵗʰG t
01000001000100011001100110011001

But what computer does is, if the 24th bit is 1, it will add 1 to the 23rd bit (if 0
it doesn’t do anything) and store.

So the answer will be

01000001000100011001100110011010

If you check this with IEEE 754 calculator you will get this same answer.

https://www.binaryconvert.com/result_float.html?decimal=057046049

As you can see, there is a small difference at the end of the mantissa. This
difference is because of the rounding error. Why the rounding error happens
is that there is a limit to the number of bits we can represent the mantissa.

Let’s convert computer calculated value back to normal decimal value.

ava l
2

31

As you can see first we convert 9.1 to IEEE 754 standard and then IEEE 754
value to decimal value. we got 9.10000038.So this is the floating point
rounding problem.

https://www.smbc-comics.com/?id=2999

If we talk about above image.Internally, the computer carries several binary


digits of precision. The values 0.1 and 0.2 do not have exact representations
in binary floating point, and so when you convert from decimal to binary
and back, you can get answers that look a little funny in decimal.

So, if we are writing an algorithm, we should handle this floating point error.

Why does this matter?


Here are some real life examples of what can happen when numerical
algorithms are not correctly applied.

Patriot missile accident — In 1991, an American missile failed to track


and destroy an incoming missile. Instead it hit a US Army barracks,
KEEN killing 28.The system tracked time in tenths of seconds. The error in
approximating 0.1 with 24 bits was magnified in its calculations.

Sinking of an oil rig — In 1992, the Sleipner A oil and gas platform sank in
the North Sea near Norway. Numerical issues in modelling the structure
caused shear stresses to be underestimated by 47%. As a result, concrete
walls were not built thick enough.

Explosion of the Ariane 5 rocket — just after lift-off on its maiden voyage
off French Guiana, on June 4, 1996, was ultimately the consequence of a
simple overflow.

What should you do?

95% of folks out there are completely clueless about floating-point -James Gosling
(in JavaOne keynote address)

Use double instead of float.

When adding floating point numbers, add the smallest first.

More generally, try to avoid adding dissimilar quantities.

Specific scenario: When adding a list of floating point numbers, sort


them first.

Don’t use floating point variables to control what is essentially a counted


loop.

Use fewer arithmetic operations where possible.

BigDecimal in Java
The java.math.BigDecimal class provides operations for arithmetic,
manipulation, rounding, comparison and format conversion.

BigDecimal consists of two parts:

unscaled value — an arbitrary precision integer

scale — a 32-bit integer representing the number of digits to the right of


the decimal point

For instance, BigDecimal 2.18 has the unscaled value of 218 and the scale of
2.

Let’s see a problematic code

Output

10.0
9.9
9.8
9.700000000000001
9.600000000000001
9.500000000000002
9.400000000000002
9.300000000000002
9.200000000000003
9.100000000000003
9.000000000000004
8.900000000000004
8.800000000000004
8.700000000000005
8.600000000000005
8.500000000000005
8.400000000000006
8.300000000000006
8.200000000000006
8.100000000000007
8.000000000000007
7.9000000000000075
7.800000000000008
7.700000000000008
7.6000000000000085
7.500000000000009
7.400000000000009
7.30000000000001
7.20000000000001
7.10000000000001
7.000000000000011
6.900000000000011
6.800000000000011
6.700000000000012
6.600000000000012
6.500000000000012
6.400000000000013
6.300000000000013
6.2000000000000135
6.100000000000014
6.000000000000014
5.900000000000015
5.800000000000015
5.700000000000015
5.600000000000016
5.500000000000016
5.400000000000016
5.300000000000017
5.200000000000017
5.100000000000017
5.000000000000018
4.900000000000018
4.8000000000000185
4.700000000000019
4.600000000000019
4.5000000000000195
4.40000000000002
4.30000000000002
4.200000000000021
4.100000000000021
4.000000000000021
3.9000000000000212
3.800000000000021
3.700000000000021
3.600000000000021
3.500000000000021
3.400000000000021
3.3000000000000207
3.2000000000000206
3.1000000000000205
3.0000000000000204
2.9000000000000203
2.8000000000000203
2.70000000000002
2.60000000000002
2.50000000000002
IEEE 754 number representat on 32b t float ngpo ntnotat on
263.3
1
EE
i

263 100000101
0.3
I IE

010011001100110011

11 263.3 01001100110011
170 n 1

t.EEFn Y
ttrom12 0
163
EFiI'YEEIE.IE b tsfollow ngdec mal
IE Isma lov nary
po ntnscnotat on

iii
3 0 00000 1 00000 0 0 001 001 001 0

You might also like