Floating-point representation helps computers handle real numbers with a large range of values, both very small and very large. The single precision format uses 32 bits, while double precision uses 64 bits, allowing for more precision and a larger range. These formats are based on scientific notation, with a sign bit, exponent, and mantissa.
The following description explains terminology and primary details of IEEE 754 binary floating-point representation. The discussion confines to single and double precision formats.
Usually, a real number in binary will be represented in the following format,
ImIm-1…I2I1I0.F1F2…FnFn-1
Where Im and Fn will be either 0 or 1 of integer and fraction parts respectively.
A finite number can also represented by four integers components, a sign (s), a base (b), a significant (m), and an exponent (e). Then the numerical value of the number is evaluated as
(-1)s x m x be ________ Where m < |b|
Depending on base and the number of bits used to encode various components, the IEEE 754 standard defines five basic formats. Among the five formats, the binary32 and the binary64 formats are single precision and double precision formats respectively in which the base is 2.
What is Floating Point Representation
The Floating point representation is a way to the encode numbers in a format that can handle very large and very small values. It is based on scientific notation where numbers are represented as a fraction and an exponent. In computing, this representation allows for trade-off between range and precision.
Format: A floating point number is typically represented as:
Value=Sign × Significand × BaseExponent
where:
- Sign: Indicates whether the number is positive or negative.
- Significand (Mantissa): Represents the precision bits of the number.
- Base: Usually 2 in binary systems.
- Exponent: Determines the scale of the number.
Need for Floating Point Representation
The Floating point representation is crucial because:
- Range: It can represent a wide range of values from the very large to very small numbers.
- Precision: It provides a good balance between the precision and range, making it suitable for the scientific computations, graphics and other applications where exact values and wide ranges are necessary.
- Flexibility: It adapts to different scales of numbers allowing for the efficient storage and computation of real numbers in the computer systems.
Number System and Data Representation
- Number Systems: The Floating point representation often uses binary (base-2) systems for the digital computers. Other number systems like decimal (base-10) or hexadecimal (base-16) may be used in the different contexts.
- Data Representation: This includes how numbers are stored in the computer memory involving binary encoding and the representation of the various data types.
Table – Precision Representation
Precision |
Base |
Sign |
Exponent |
Significant |
Single precision |
2 |
1 |
8 |
23+1 |
Double precision |
2 |
1 |
11 |
52+1 |
Single Precision Format
The single precision format has 23 bits for significant (1 represents implied bit, details below), 8 bits for exponent and 1 bit for sign.
For example, the rational number 9÷2 can be converted to single precision float format as following,
9(10) ÷ 2(10) = 4.5(10) = 100.1(2)
The result said to be normalized, if it is represented with leading 1 bit, i.e. 1.001(2) x 22. (Similarly when the number 0.000000001101(2) x 23 is normalized, it appears as 1.101(2) x 2-6). Omitting this implied 1 on left extreme gives us the mantissa of float number. A normalized number provides more accuracy than corresponding de-normalized number. The implied most significant bit can be used to represent even more accurate significant (23 + 1 = 24 bits) which is called subnormal representation. The floating point numbers are to be represented in normalized form.
The subnormal numbers fall into the category of de-normalized numbers. The subnormal representation slightly reduces the exponent range and can’t be normalized since that would result in an exponent which doesn’t fit in the field. Subnormal numbers are less accurate, i.e. they have less room for nonzero bits in the fraction field, than normalized numbers. Indeed, the accuracy drops as the size of the subnormal number decreases. However, the subnormal representation is useful in filing gaps of floating point scale near zero.
In other words, the above result can be written as (-1)0 x 1.001(2) x 22 which yields the integer components as s = 0, b = 2, significant (m) = 1.001, mantissa = 001 and e = 2. The corresponding single precision floating number can be represented in binary as shown below,
Where the exponent field is supposed to be 2, yet encoded as 129 (127+2) called biased exponent. The exponent field is in plain binary format which also represents negative exponents with an encoding (like sign magnitude, 1’s complement, 2’s complement, etc.). The biased exponent is used for the representation of negative exponents. The biased exponent has advantages over other negative representations in performing bitwise comparing of two floating point numbers for equality.
A bias of (2n-1 – 1), where n is the number of bits used in exponent, is added to the exponent (e) to get biased exponent (E). So, the biased exponent (E) of single precision number can be obtained as
E = e + 127
The range of exponent in single precision format is -126 to +127. Other values are used for special symbols.
Note: When we unpack a floating point number the exponent obtained is the biased exponent. Subtracting 127 from the biased exponent we can extract unbiased exponent.
Double Precision Format
The double precision format has 52 bits for significant (1 represents implied bit), 11 bits for exponent and 1 bit for sign. All other definitions are same for double precision format, except for the size of various components.
Precision
The smallest change that can be represented in floating point representation is called as precision. The fractional part of a single precision normalized number has exactly 23 bits of resolution, (24 bits with the implied bit). This corresponds to log(10) (223) = 6.924 = 7 (the characteristic of logarithm) decimal digits of accuracy. Similarly, in case of double precision numbers the precision is log(10) (252) = 15.654 = 16 decimal digits.
Accuracy
Accuracy in floating point representation is governed by number of significant bits, whereas range is limited by exponent. Not all real numbers can exactly be represented in floating point format. For any numberwhich is not floating point number, there are two options for floating point approximation, say, the closest floating point number less than x as x_ and the closest floating point number greater than x as x+. A rounding operation is performed on number of significant bits in the mantissa field based on the selected mode. The round down mode causes x set to x_, the round up mode causes x set to x+, the round towards zero mode causes x is either x_ or x+ whichever is between zero and. The round to nearest mode sets x to x_ or x+ whichever is nearest to x. Usually round to nearest is most used mode. The closeness of floating point representation to the actual value is called as accuracy.
Special Bit Patterns
The standard defines few special floating point bit patterns. Zero can’t have most significant 1 bit, hence can’t be normalized. The hidden bit representation requires a special technique for storing zero. We will have two different bit patterns +0 and -0 for the same numerical value zero. For single precision floating point representation, these patterns are given below,
0 00000000 00000000000000000000000 = +0
1 00000000 00000000000000000000000 = -0
Similarly, the standard represents two different bit patterns for +INF and -INF. The same are given below,
0 11111111 00000000000000000000000 = +INF
1 11111111 00000000000000000000000 = -INF
All of these special numbers, as well as other special numbers (below) are subnormal numbers, represented through the use of a special bit pattern in the exponent field. This slightly reduces the exponent range, but this is quite acceptable since the range is so large.
An attempt to compute expressions like 0 x INF, 0 ÷ INF, etc. make no mathematical sense. The standard calls the result of such expressions as Not a Number (NaN). Any subsequent expression with NaN yields NaN. The representation of NaN has non-zero significant and all 1s in the exponent field. These are shown below for single precision format (x is don’t care bits),
x 11111111 m0000000000000000000000
Where m can be 0 or 1. This gives us two different representations of NaN.
0 11111111 00000000000000000000001 _____________ Signaling NaN (SNaN)
0 11111111 10000000000000000000001 _____________Quiet NaN (QNaN)
Usually QNaN and SNaN are used for error handling. QNaN do not raise any exceptions as they propagate through most operations. Whereas SNaN are which when consumed by most operations will raise an invalid exception.
Overflow and Underflow
Overflow is said to occur when the true result of an arithmetic operation is finite but larger in magnitude than the largest floating point number which can be stored using the given precision. Underflow is said to occur when the true result of an arithmetic operation is smaller in magnitude (infinitesimal) than the smallest normalized floating point number which can be stored. Overflow can’t be ignored in calculations whereas underflow can effectively be replaced by zero.
Endianness
The IEEE 754 standard defines a binary floating point format. The architecture details are left to the hardware manufacturers. The storage order of individual bytes in binary floating point numbers varies from architecture to architecture.
Advantages
- Wide Range: Can represent very large and very small numbers.
- Efficient Calculation: The Suitable for a wide range of the scientific, engineering and graphics applications where precision and range are important.
- Standardization: The Floating point representation is standardized (IEEE 754) which ensures consistency and compatibility across the different systems and programming languages.
Disadvantages
- Precision Issues: The Floating point numbers can suffer from the precision errors due to rounding and truncation.
- Complexity: The More complex than fixed-point representation requiring more computational resources.
- Overhead: Operations involving the floating point numbers can be slower and require more memory compared to integer operations.
Applications
- Scientific Computations: Used in simulations, modeling and calculations requiring the high precision and large ranges.
- Graphics: Essential in the rendering and manipulating graphical data where precise calculations are needed.
- Engineering: The Applied in fields such as aerospace, mechanical engineering and electronics for the accurate measurements and simulations.
Similar Reads
Fixed Point Representation
Fixed Point Representation means that represents real numbers in a computer system, where the position of the decimal of the (or binary) point is fixed. This is in difference to floating point representation, where the position of the point can "float." In fixed point representation, the number is s
7 min read
Introduction of Floating Point Representation
1. To convert the floating point into decimal, we have 3 elements in a 32-bit floating point representation: i) Sign ii) Exponent iii) Mantissa Sign bit is the first bit of the binary representation. '1' implies negative number and '0' implies positive number. Example: 110000011101000000000000000000
4 min read
Binary Representations in Digital Logic
Binary representation is the method of expressing numbers using binary digits (bits). In digital logic, binary representations are important as they are the foundation for all computations and data processing in computers. Binary numbers form the backbone of digital circuits and systems.Each binary
12 min read
Representation of Negative Binary Numbers
In computer systems, binary numbers use only two symbols: 0 and 1. Unlike decimal numbers, binary numbers cannot include a plus (+) or minus (-) symbol directly to denote positive or negative values. Instead, negative binary numbers are represented using specific methods that incorporate a special b
3 min read
Multiplying Floating Point Numbers
Prerequisite - IEEE Standard 754 Floating Point Numbers Problem:- Here, we have discussed an algorithm to multiply two floating point numbers, x and y. Algorithm:- Convert these numbers in scientific notation, so that we can explicitly represent hidden 1. Let âaâ be the exponent of x and âbâ be the
2 min read
Different ways to represent Signed Integer
A signed integer is an integer with a positive '+' or negative sign '-' associated with it. Since the computer only understands binary, it is necessary to represent these signed integers in binary form. In binary, signed Integer can be represented in three ways: Signed bit.1âs Complement.2âs Complem
5 min read
Basic Laws for Various Arithmetic Operations
Prerequisite - Number System A number is a way to represent arithmetic value, count or measure of a particular quantity. A number system can be considered as a mathematical notation of numbers using a set of digits or symbols. In simpler words the number system is a method of representing numbers. E
1 min read
Arithmetic Operations of Binary Numbers
Binary is a base-2 number system that uses two states 0 and 1 to represent a number. We can also call it to be a true state and a false state. A binary number is built the same way as we build a normal decimal number. Binary arithmetic is an essential part of various digital systems. You can add, su
1 min read
Unsigned and Signed Numbers Representation in Binary Number System
The binary number system uses only two digits, 0 and 1, to represent all data in computing and digital electronics. Understanding unsigned and signed numbers is important for efficient data handling and accurate computations in these fields. The binary system forms the foundation of all digital syst
6 min read
Classification of Number System
A number is a way to represent arithmetic value, count or measure of a particular quantity. A number system can be considered as a mathematical notation of numbers using a set of digits or symbols. In simpler words the number system is a method of representing numbers. Every number system is identif
2 min read