Unit 3
Unit 3
A Q M
00000 1010 00011 Initial Values
Reminder=00001
Quotient=0011
Real Numbers
• Numbers with fractions
• Convert No Binary to Decimal
(1101.01101)=
1101 01101
=1*2^3+1*2^2+1*2^0 =1*2^-2+1*2^-3+1*2^-5
=8+4+1 =1/2^2+1/2^3+1/2^5
=13 =1/4+1/8+1/32
=0.25+0.125+0.03125
=0.40623
Ans-(13.40623)
Convert Decimal to Binary
• (28.125)
28=11100
(.125)
=0.125*2 =0.250
=0.250*2 =0.500
=0.500*2 =1.000
=100
Ans-(28.125)=(11100.100)
Single Floating Point
• Total 32 Bit –
• 0-22=Mantisa
• 23-30=E’=E+Bias(Bias for Single Precision=127)
• 31=Sign Bit(0-Positive,1-Negative)
่ ด,
Misnomer:เรียกชือผิ ใชช
Double Floating Point
Total 64 Bit –
0-51=Mantisa
52-62=E’=E+Bias(Bias for Single Precision=1024)
63=Sign Bit(0-Positive,1-Negative)
Signs for Floating Point
• Mantissa is stored in 2s compliment
• Exponent is in excess or biased notation
• e.g. Excess (bias) 128 means
• 8 bit exponent field
• Pure value range 0-255
• Subtract 128 to get correct value
• Range -128 to +127
Normalization
• FP numbers are usually normalized
• i.e. exponent is adjusted so that leading bit (MSB) of
mantissa is 1
• Since it is always 1 there is no need to store it
• (c.f. Scientific notation where numbers are normalized to
give a single digit before the decimal point
• e.g. 3.123 x 103)
FP Ranges
• For a 32 bit number
• 8 bit exponent
• +/- 2256 1.5 x 1077
• Accuracy
• The effect of changing lsb of mantissa
• 23 bit mantissa 2-23 1.2 x 10-7
• About 6 decimal places
IEEE 754
• Standard for floating point storage
• 32 and 64 bit standards
• 8 and 11 bit exponent respectively
• Extended formats (both mantissa and exponent) for
intermediate results
IEEE 754 Formats
Convert a number in IEEE 754 (32 bit)
Example:- 263.3
1. Convert this No in Binary no
263.3 =(100000111.0100110011001101…)
Example:- 263.3
1. Convert this No in Binary no
263.3 =(1000001110.0100110011001101…)
• Rules
• 1.Compare magnitude of the 2 exponant & make suitable
alignment to the number with the smaller magnitude of
exponant.
• 2.Perform Addition/Subtraction.
• 3.Perform Normalization by shifting resulting mantissa &
adjusting resulting exponant
Addition
• Add (1.1100*2^4 & 1.100*2^2)
1. 1.100*2^2 has aligned to 0.01100*2^4
2. Addition:
=1.1100*2^4+0.01100*2^4
=10.0010*2^4
3.Normalization:
Final Normalization Result is
=0.100010*2^6
=0.1000*2^6(Assuming 4 bits are allowed after radix Point)
Ans=0.1000*2^6
Subtraction
• Add (1.1100*2^4 & 1.100*2^2)
1. 1.100*2^2 has aligned to 0.01100*2^4
2. Subtraction:
=1.1100*2^4- 0.01100*2^4
=1.1100*2^4+(-0.01100*2^4)
=1.1100*2^4+1.10100*2^4
=11.01100*2^4
=1.01100*2^4 (1 reduce because 2’s Complement)
3.Normalization:
Final Normalization Result is
=0.101100*2^5
=0.1011*2^5(Assuming 4 bits are allowed after radix Point)
Ans=0.1011*2^5
FP Addition & Subtraction Flowchart
FP Arithmetic x/
• Multiplication of a pair of floating point number
• x= mx *2^a & y=my*2^b is represented as x*y=(mx*my)*2^a+b
a. Add Exponant=(-2)+(-1)=-3
Ans=-0.1010*2^-4
Division
• Ex. X=91.34375 X=1011011.01011 X=1.01101101011*2^6
Y=0.14453125 Y=0.00100101 Y=1.00101*2^-3
a. X/Y =(Xs/Ys)*2^Xe-Ye
=(X/Y)*2^6-(-3)
b. =(X/Y)*2^9
=1.001111*2^9 // (X/Y)=1.001111
c. Normalization
=1.001111*2^9
=0.1001111*2^10
=0.1001*2^10 (Assuming 4 bits are allowed after radix Point)
Ans= 0.1001*2^10
Floating Point Multiplication
Floating Point Division
Required Reading
• Stallings Chapter 9
• IEEE 754 on IEEE Web site