0% found this document useful (0 votes)
95 views3 pages

Floating Point Representation Practice Problems

Uploaded by

Orgho Mcguire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views3 pages

Floating Point Representation Practice Problems

Uploaded by

Orgho Mcguire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Practice Problems: Chapter 1

1.​ The floating point representation can be expressed in any of the following forms:

Standard Form : 𝐹 = (± 0. 𝑑1𝑑2𝑑3 ··· 𝑑𝑚 )β × β𝑒; 𝑑1 ≠ 0.

IEEE Normalized Form : 𝐹 = (± 0. 1𝑑1𝑑2𝑑3 ··· 𝑑𝑚 )β × β𝑒 .


IEEE Denormalized Form : 𝐹 = (± 1. 𝑑1𝑑2𝑑3 ··· 𝑑𝑚 )β × β𝑒 .

a) Consider a system with β = 2, 𝑚 = 4, and − 3 ≤ 𝑒 ≤ 4. Find out the maximum and


minimum numbers this system can store with and without negative support. Express the
numbers both in binary and decimal digits for all three forms.

b) How many numbers can this system represent or store in all these forms?

c) Using Standard Form, find all the decimal numbers without negative support, plot them
on a real line, and show if the number line is equally spaced or not.

d) For the IEEE standard for double-precision (64-bit) arithmetic, find the smallest positive
number and the largest number representable by a system that follows this standard. Do
not find their decimal values, but simply represent the numbers in the following format:
(± 0. 1𝑑1𝑑2𝑑3 ··· 𝑑𝑚)β × β𝑒−𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝐵𝑖𝑎𝑠 .
Be mindful of the conditions for representing ±∞ and ±0 in this IEEE standard.

e) In the above IEEE standard, if the exponent bias were to be altered to


𝑒𝑥𝑝𝑜𝑛𝑒𝑡𝐵𝑖𝑎𝑠 = 500, what would the smallest positive number and the largest number
be? Write your answers in the same format as in part (d). Note that the conditions for
representing ±∞ and ±0 are still maintained as before.

2.​ If 𝑥 = 3/8 and 𝑦 = 5/8, find 𝑓𝑙(𝑥 × 𝑦) where 𝑚 = 4. Also check whether
𝑥 × 𝑦 = 𝑓𝑙(𝑥 × 𝑦). If not, find the rounding error of the product of these two numbers.
2
3.​ Consider the quadratic equation, 𝑥 − 60𝑥 + 1 = 0 . Working to 6 significant figures,
compute the roots of the quadratic equation and check that there is a loss of significance.
Find the correct roots such that loss of significance does not occur.
4.​ Given β = 2, 𝑚 = 5 , − 100 ≤ 𝑒 ≤ 100. Using the IEEE Normalized form, answer the
following:

a) Compute the Machine Epsilon (𝞊M).

b) Compute the minimum of ∣x∣.


c) How many non-negative numbers can you represent using this system?

2
5.​ Consider the quadratic equation 𝑥 − 16𝑥 + 3 = 0. Explain how the loss of significance
occurs in finding the roots of the quadratic equation if we restrict to 4 significant figures.
Discuss how to avoid this and find the roots.

6.​ Given a system parameterized by 𝛽 = 2, 𝑚 = 3, and 𝑒min = −1 ≤ 𝑒 ≤ 𝑒max = 2, where

𝑒 ∈ 𝑍. For this system answer the following:

(a) Find the floating-point representation of the numbers (6.25)10 and (6.875)10 in the
Normalized Form. That is, find 𝑓𝑙(6. 25)₁₀ and 𝑓𝑙(6. 875)₁₀ .

(b) What are the rounding errors 𝛿1,𝛿2 in part (a)?

(c) Can the values (6.25)10 and (6.875)10 be represented in the Denormalized Form? If so,
find the floating-point representations. If not, then concisely explain why?

(d) Find the rounding error for Standard Form, Normalized and Denormalized Form.

7.​ Consider the real number x = (8.235)10

(a) First convert the decimal number x in binary format at least up to 8 binary places.

(b) What will be the binary value of x [Find fl(x)] if you store it in a system with m = 6 using
the Denormalized form of floating point representation.

(c) Now convert back to decimal form the stored values you obtained in the previous part,
and calculate the rounding error of both numbers.

8.​ Consider the quadratic equation:

2
𝑥 − 12𝑥 + 5 = 0

a) Compute the roots of the quadratic equation while keeping to four significant figures.

b) Explain how loss of significance occurs in this case due to the subtraction of nearly
equal numbers.

c) Discuss an alternative approach to computing the roots to avoid loss of significance,


and use this method to determine the correct roots.

9.​ Consider a computing system with base 𝛽 =2, 𝑚 = 3, and 𝑒min = −3 ≤ 𝑒 ≤ 𝑒max = 2

a) In the Standard form of this system, determine the total number of representable values
including support for negative numbers. Also, compute the maximum value of delta.

b) Express the floating-point representations (binary format) for the numbers 𝑥 = 4/8 and
𝑦 = 7/8 in this system.
c) Compute 𝑓𝑙(𝑥 × 𝑦) and determine whether this value can be stored within the given
floating-point system.

Common questions

Powered by AI

Numbers like 6.25 and 6.875 cannot be represented in Denormalized Form in a system with precision m = 3 and β = 2 if their significand requires more precision than m allows or if their exponent falls within the normal range, as denormalization is reserved for exceedingly small values necessitating the smallest representable base-exponent pairs beyond normal ranges .

The maximum value of delta in the described floating-point system is influenced by the smallest distinguishable difference between two successive representable numbers. It is computed as β^(-m+1), determined by the precision and spacing between consecutive numbers, influenced by precision (m) and the largest exponent (e_max), affecting the density of representable numbers .

In the IEEE double-precision floating-point standard (64-bit), with an exponent bias of 500, the smallest positive representable number is ± 0.1(0...00)₂ × 2^(-500), and the largest number is ± 0.1(1...11)₂ × 2^(1023 - 500). The conditions for ±∞ and ±0 are maintained as previously defined in the standard .

To mitigate the loss of significance when solving x² - 16x + 3 = 0, utilize formula manipulation like rationalizing or modifying the standard quadratic formula. Compute roots using an algorithm that prioritizes subtraction from larger numbers or employs the quadratic inverse method to preserve precision by avoiding subtraction of nearly equal values .

In the floating-point system with base β = 2, precision m = 5, and exponent range -100 ≤ e ≤ 100 using IEEE Normalized Form, you compute the number of non-negative representable numbers by evaluating the range for positive numbers: (2^(m-1) - 1)(e_max - e_min + 1), which totals 128 (1023 values).

For a floating-point system with base β = 2, precision m = 4, and exponent range -3 ≤ e ≤ 4, the maximum number of values that can be stored is determined by the formula: 2 * β * (β^(m-1) - 1) * (e_max - e_min + 1). Without negative support, the system can store 112 numbers. With negative support, it can store 224 numbers because both positive and negative representations are available .

Altering the exponent bias to 500 in the IEEE double-precision standard modifies the range of representable exponents from the original. The smallest positive number becomes ± 0.1(0...00)₂ × 2^(-500), and the largest becomes ± 0.1(1...11)₂ × 2^(1023-500), substantially shrinking the dynamic range of representable numbers compared to standard bias, affecting precision and overflow/underflow characteristics .

The loss of significance in solving x² - 12x + 5 = 0 occurs due to subtracting nearly equal numbers in the quadratic formula, leading to large rounding errors. To avoid this, one can use an alternative method like the quadratic inverse or rational root theorem to compute roots with higher accuracy and mitigate rounding errors .

In the system with base β = 2, precision m = 3, and exponent range -1 ≤ e ≤ 2, the floating-point representations of 6.25 and 6.875 in Normalized Form are approximately accurate with small rounding errors, δ₁ ≈ 0.125 and δ₂ ≈ 0.125, as they are translated to binary values with imprecise fraction parts .

The real number x = (8.235)₁₀, when converted into binary and stored using a 6-bit denormalized format, experiences truncation due to limited bit precision, resulting in significant rounding errors. Upon converting back to decimal, these rounding errors manifest as either a larger or smaller approximation relative to the original decimal, showing the inaccuracies in denormalized representation .

You might also like