Complete Floating Point (Blog)
Complete Floating Point (Blog)
FLOATING POINT : )
integer
Types float and double in c++ , and c ( programming language) in scientific notation i. -4.44 x 10^77 normalized ii. +9.943 x 10^-5 iii. 0.001 x 10^ 3 not normalized
The term floating point is derived from the fact that there is no fixed number of digits before and after the decimal point; that why it is, is called float.
Videos about IEEE floating point ,its help you to more understand about it. Hopefully
smallest value - exponent : 0000001 - actual exponent = 1 27 = -126 - Fraction : 0000000 significand = 1.0 1.0 x 2^-126 1.2 x 10^-38 largest value - Exponents : 11111110 - Actual exponent = 254 127 = +127 - Fraction : 111.11 significand 2.0 2.0 x 2^127 3.4 x 10^38
Single-precision range
Double-Precision Range
smallest value exponent : 00000000000001 Actual exponent = 1 1023 = -1022 Fraction : 00000 , significand = 1.0 1.0 x 2^ -1022 2.2 x 10 ^ -308
largest value
Exponent : 1111111111110 Actual exponent = 2046 1023 = +1023 Fraction : 11111 , significand 2.0 2.0 x 2^ 1023 1.8 x 10^ 308
FLOATING-POINT PRECISION
Relative precision
single : approx 2^-23
Floating-point example
represent -0.75 -0.75 = (-1) x 1.1 x 2 S=1 Fraction = 100000 Exponent = -1 + Bias single : -1 + 127 = 126 = 01111110 double : -1 + 1023 = 1022 = 01111111110 single : 101111110100000 double : 101111111110100000
What number is represented by the singleprecision float 1100000010100000 S=1 Fraction = 01000.00 Fxponent = 10000001 = 129 x= (-1) x (1 + 01) x 2^(129-127) = (-1) x 1.25 x 2 = -5.0
Floating-point addition
consider a 4-digit decimal example - 9.999 x 10 + 1.610 x 10 1. align decimal points
Shift number with smaller exponent 9.999 x 10 + 0.016 x 10 = 10.015 x 10 2. add significands 9.999 x 10 + 0.016 x 10 = 10.015 x 10 3. normalize result & check for over/underflow 1.0015 x 10 4. Round and renormalize if necessary 1.002 x 10
4-digit binary example 1.000 x 2 + -1.110 x 2 ( 0.5 + - 0.4375) 1. Align binary points
shift number with smaller exponent 1.000 x 2 + -0.111 x 2 2. Add significands 1.000 x 2 + -0.111 x 2 = 0.001 x 2 3. Normalize result & check for over/underflow 1.000 x 2 (no change) = 0.0625
Floating point arithmetic hardware(FP ADDER HARDWARE) usually does - Addition , subtraction , multiplication,division, reciprocal, square-root FP= integer conversion Operation usually takes several cycles
Floating-point multiplication
Consider a 4- digit decimal Example : 1.110 x 10 x 9.200 x 10 1. Add exponents For biased exponents , subtract bias from sum New exponent = 10+ -5 =5 2. Multiply significands 1.110 x 9.200 = 10.212 , (10.212 x 10 ) 3. Normalize result & check for over/underflow 1.0212 x 10 4. Round and renormalize if necessary 1.021 x 10 5. Determine sign of result from signs of operands +1.021 x 10