0% found this document useful (0 votes)
69 views

Computer Architecture & Organization Unit 2

This document discusses different methods for representing numbers in digital computers. It describes fixed point notation, which reserves a fixed number of bits for the integer and fractional parts of a number. Floating point notation allows a varying number of bits after the decimal point. The key advantages of floating point are flexibility and a wider range of representable values, while fixed point has better performance. The document then explains IEEE floating point standards for single, double and quadruple precision formats.

Uploaded by

Nihal Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Computer Architecture & Organization Unit 2

This document discusses different methods for representing numbers in digital computers. It describes fixed point notation, which reserves a fixed number of bits for the integer and fractional parts of a number. Floating point notation allows a varying number of bits after the decimal point. The key advantages of floating point are flexibility and a wider range of representable values, while fixed point has better performance. The document then explains IEEE floating point standards for single, double and quadruple precision formats.

Uploaded by

Nihal Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Unit - 2

By: Namrata Singh


Fixed Point and Floating Point Number Representations
Digital Computers use Binary number system to represent all types of information inside the
computers. Alphanumeric characters are represented using binary bits (i.e., 0 and 1). Digital
representations are easier to design, storage is easy, accuracy and precision are greater.

There are various types of number representation techniques for digital number representation,
for example: Binary number system, octal number system, decimal number system, and
hexadecimal number system etc. But Binary number system is most relevant and popular for
representing numbers in digital computer system.

Storing Real Number


These are structures as following below −

There are two major approaches to store real numbers (i.e., numbers with fractional
component) in modern computing. These are (i) Fixed Point Notation and (ii) Floating Point
Notation. In fixed point notation, there are a fixed number of digits after the decimal point,
whereas floating point number allows for a varying number of digits after the decimal
point.

Fixed-Point Representation −
This representation has fixed number of bits for integer part and for fractional part. For
example, if given fixed-point representation is IIII.FFFF, then you can store minimum value is
0000.0001 and maximum value is 9999.9999. There are three parts of a fixed-point number
representation: the sign field, integer field, and fractional field.

By: Namrata Singh


We can represent these numbers using:

1) Signed representation: range from -(2(k-1)-1) to (2(k-1)-1), for k bits.


2) 1’s complement representation: range from -(2(k-1)-1) to (2(k-1)-1), for k bits.
3) 2’s complementation representation: range from -(2(k-1)) to (2(k-1)-1), for k bits.
4) 2’s complementation representation is preferred in computer system because of
unambiguous property and easier for arithmetic operations.

Example −Assume number is using 32-bit format which reserve 1 bit for the sign, 15 bits for
the integer part and 16 bits for the fractional part.

Then, -43.625 is represented as following:

Where, 0 is used to represent + and 1 is used to represent. 000000000101011 is 15 bit binary


value for decimal 43 and 1010000000000000 is 16 bit binary value for fractional 0.625.

The advantage of using a fixed-point representation is performance and disadvantage is


relatively limited range of values that they can represent. So, it is usually inadequate for
numerical analysis as it does not allow enough numbers and accuracy. A number whose
representation exceeds 32 bits would have to be stored inexactly.

These are above smallest positive number and largest positive number which can be store in
32-bit representation as given above format. Therefore, the smallest positive number is 2-16
≈ 0.000015 approximate and the largest positive number is (215-1)+(1-2-16)=215(1-2-16)
=32768, and gap between these numbers is 2-16.

We can move the radix point either left or right with the help of only integer field is 1.

By: Namrata Singh


Floating-Point Representation −
This representation does not reserve a specific number of bits for the integer part or the
fractional part. Instead it reserves a certain number of bits for the number (called the
mantissa or significand) and a certain number of bits to say where within that number the
decimal place sits (called the exponent).

The floating number representation of a number has two part: the first part represents a
signed fixed point number called mantissa. The second part of designates the position of the
decimal (or binary) point and is called the exponent. The fixed point mantissa may be
fraction or an integer. Floating -point is always interpreted to represent a number in the
following form: Mxre.

Only the mantissa m and the exponent e are physically represented in the register (including
their sign). A floating-point binary number is represented in a similar manner except that is
uses base 2 for the exponent. A floating-point number is said to be normalized if the most
significant digit of the mantissa is 1.

So, actual number is (-1)s(1+m)x2(e-Bias), where s is the sign bit, m is the mantissa, e is the
exponent value, and Bias is the bias number.

Note that signed integers and exponent are represented by either sign representation, or
one’s complement representation, or two’s complement representation.

The floating point representation is more flexible. Any non-zero number can be represented in
the normalized form of ±(1.b1b2b3 ...)2x2n This is normalized form of a number x.

Example −Suppose number is using 32-bit format: the 1 bit sign bit, 8 bits for signed exponent,
and 23 bits for the fractional part. The leading bit 1 is not stored (as it is always 1 for a
normalized number) and is referred to as a “hidden bit”.

Then −53.5 is normalized as -53.5=(-110101.1)2=(-1.101011)x25 , which is represented as


following below,

Where 00000101 is the 8-bit binary value of exponent value +5.


By: Namrata Singh
Note that 8-bit exponent field is used to store integer exponents -126 ≤ n ≤ 127.

The smallest normalized positive number that fits into 32 bits is


(1.00000000000000000000000)2x2-126=2-126≈1.18x10-38 , and largest normalized positive
number that fits into 32 bits is (1.11111111111111111111111)2x2127=(224-1)x2104 ≈
3.40x1038 . These numbers are represented as following below,

The precision of a floating-point format is the number of positions reserved for binary digits
plus one (for the hidden bit). In the examples considered here the precision is 23+1=24.

The gap between 1 and the next normalized floating-point number is known as machine
epsilon. the gap is (1+2-23)-1=2-23for above example, but this is same as the smallest
positive floating-point number because of non-uniform spacing unlike in the fixed-point
scenario.

Note that non-terminating binary numbers can be represented in floating point


representation, e.g., 1/3 = (0.010101 ...)2 cannot be a floating-point number as its binary
representation is non-terminating.

IEEE Floating point Number Representation −


IEEE (Institute of Electrical and Electronics Engineers) has standardized Floating-Point
Representation as following diagram.

So, actual number is (-1)s(1+m)x2(e-Bias), where s is the sign bit, m is the mantissa, e is the
exponent value, and Bias is the bias number. The sign bit is 0 for positive number and 1 for
negative number. Exponents are represented by or two’s complement representation.

According to IEEE 754 standard, the floating-point number is represented in following ways:

1) Half Precision (16 bit): 1 sign bit, 5 bit exponent, and 10 bit mantissa
2) Single Precision (32 bit): 1 sign bit, 8 bit exponent, and 23 bit mantissa
3) Double Precision (64 bit): 1 sign bit, 11 bit exponent, and 52 bit mantissa
4) Quadruple Precision (128 bit): 1 sign bit, 15 bit exponent, and 112 bit mantissa
By: Namrata Singh
Special Value Representation −

There are some special values depended upon different values of the exponent and
mantissa in the IEEE 754 standard.

1) All the exponent bits 0 with all mantissa bits 0 represents 0. If sign bit is 0, then
+0, else -0.
2) All the exponent bits 1 with all mantissa bits 0 represents infinity. If sign bit is 0,
then +∞, else -∞.
3) All the exponent bits 0 and mantissa bits non-zero represents denormalized
number.
4) All the exponent bits 1 and mantissa bits non-zero represents error.

By: Namrata Singh


HALF-ADDER
Half-adder is a circuit that can add two binary bits. Its outputs are
SUM and CARRY. The following truth table shows various
combinations of inputs and their corresponding outputs of a
half-adder. X and Y denote inputs and C and S denote CARRY
and SUM.

X Y CARRY(C) SUM (S)

0 0 0 0
0 1 0 1( X Y)
1 0 0 1 (X Y )
1 1 1(XY) 0

Truth Table for a Half-Adder

The minterms for SUM and CARRY are shown in the bracket.
The Sum-Of-Product (SOP) equation for SUM is :

S = XY + XY = X  Y …..………… ( 1 )

Similarly, the SOP equation for the CARRY is :

C = XY …………. ………………………( 2 )

Combining the logic circuits for equation ( 1 ) & ( 2 ) we get


the circuit for Half-Adder as :

Half-Adder Circuit and Symbol

FULL-ADDER
Full- Adder is a logic circuit to add three binary bits. Its outputs
are SUM and CARRY. In the following truth table X, Y, Z are
inputs and C and S are CARRY and SUM.

By: Namrata Singh


X Y Z CARRY(C) SUM (S)

0 0 0 0 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1(XYZ) 1(XYZ)

Truth Table for a Full-Adder


The minterms are written in the brackets for each 1 output in
the truth table. From these the SOP equation for SUM can be
written as :

S = XYZ  XYZ  XYZ  XYZ

= X YZ  YZ  X YZ  YZ

= XS  XS ……………………. (3)

( Exclusive OR and equivalence functions are complement to each


other). Here S is SUM of Half-Adder.

Again, SOP equation for Full–Adder CARRY is :

C = XYZ  XYZ  XYZ  XYZ


= XYZ  XYZ  XYZ  XYZ

= X  X YZ  X YZ  YZ

= YZ + XS
= C + XS ................................. (4)

Here also C means CARRY of half-adder and S means SUM


of half-adder.

Now using two half-adder circuits and one OR gate we can


implement equation ( 3) and ( 4 ) to obtain a full-adder circuit
as follows:

By: Namrata Singh


Full-Adder Circuit and its Symbol

HALF-SUBTRACTOR

A half-subtractor subtracts one bit from another bit. It has


two outputs viz DIFFERENCE (D) and BORROW (B).

X Y BORROW (B) DIFFERENCE (D)

0 0 0 0

0 1 1 XY

1 0 0 1 XY
1 1 0 0

Truth Table for Half-Subtractor

The mean terms are written within parenthesis for output


1 in each column. The SOP equations are :

D = X Y + XY
= X  Y ................................. (5)

B = X Y ................................. (6)

The half-subtractor circuit and the symbol

FULL-SUBTRACTOR

A full-subtractor circuit can find Difference and Borrow

By: Namrata Singh


arising on the subtraction operation involving three binary
bits.

X Y Z BORROW (B') DIFFERENCE (D')

0 0 0 0 0

0 0 1 1 XYZ

0 1 0 1 XYZ

0 1 1 1 XYZ

1 0 0 0 1 XYZ

1 0 1 0 0
1 1 0 0 0
1 1 1 1(XYZ) 1 (X YZ )

Truth Table for Full-Subtractor

The SOP equation for the DIFFERENCE IS :

D' = XYZ  XYZ  XYZ  XYZ


= XYZ  XYZ  XYZ  XYZ

= XY  XY Z  XY  XY Z

= DZ  DZ ................................. (7)
And SOP equation for BORROW is :

B' = XYZ  XYZ  XYZ  XYZ


= XYZ  XYZ  XYZ  XYZ

= XY  XY Z  XY Z  Z

= DZ  XZ ................................. (7)

In equation (7) and (8) , D stands for DIFFERENCE output


of half-subtractor. Now, from the equations (7) and (8)
we can construct a full-subtractor using two half-subtractor
and an OR gate.

By: Namrata Singh


Full-Subtractor circuit

By: Namrata Singh


Carry look ahead adder:
Motivation behind Carry Look-Ahead Adder:

In ripple carry adders, for each adder block, the two bits that are to be added are available instantly. However,
each adder block waits for the carry to arrive from its previous block. So, it is not possible to generate the sum
and carry of any block until the input carry is known.
The ith block waits for the i- 1th block to produce its carry. So there will be a considerable time delay which is
carry propagation delay.

Figure – Digital Logic

Consider the above 4-bit ripple carry adder. The sum S4 is produced by the corresponding full adder as
soon as the input signals are applied to it. But the carry input C4 is not available on its final steady state
value until carry C3 is available at its steady state value. Similarly C3 depends on C2 and C2 on C1.
Therefore, though the carry must propagate to all the stages in order that output S3 and carry C4 settle
their final steady-state value. The propagation time is equal to the propagation delay of each adder
block, multiplied by the number of adder blocks in the circuit. For example, if each full adder stage has
a propagation delay of 20 nanoseconds, then S3 will reach its final correct value after 60 (20 × 3)
nanoseconds. The situation gets worse, if we extend the number of stages for adding more number of
bits.

By: Namrata Singh


Carry Look-ahead Adder:
A carry look-ahead adder reduces the propagation delay by introducing more complex hardware. In this design,
the ripple carry design is suitably transformed such that the carry logic over fixed groups of bits of the adder is
reduced to two-level logic. Let us discuss the design in detail.

Figure - Design.

Figure – Truth table.

By: Namrata Singh


Generally, we perform many mathematical operations in our daily life such as addition, subtraction,
multiplication, division, and so on. Let us consider the multiplication process that can be performed in different
methods. Different types of algorithms can be used to perform multiplication like grid multiplication method,
long multiplication, lattice multiplication, peasant or binary multiplication, and soon.
Binary multiplication is usually performed in digital electronics by using an electronic circuit called as binary
multiplier. These binary multipliers are implemented using different computer arithmetic techniques. Booth
multiplier that works based on booth algorithm is one of the most frequently used binary multipliers.

By: Namrata Singh


Addition and Subtraction
Four basic computer arithmetic operations are addition, subtraction, division and
multiplication. The arithmetic operation in the digital computer manipulate data to produce
results. It is necessary to design arithmetic procedures and circuits to program arithmetic
operations using algorithm. The algorithm is a solution to any problem and it is stated by a
finite number of well-defined procedural steps. The algorithms can be developed for the
following types of data.
1. Fixed point binary data in signed magnitude representation
2. Fixed point binary data in signed 2’s complement representation.
3. Floating point representation
4. Binary Coded Decimal (BCD) data

Addition and Subtraction with signed magnitude


Consider two numbers having magnitude A and B. When the signed numbers are added or
subtracted, there can be 8 different conditions depending on the sign and the operation
performed as shown in the table below:
Operation Add magnitude When A > B When A < B When A = B
(+A) + (+B) +(A + B) -- -- --
(+A) + (-B) -- +(A - B) -(B - A) +(A - B)
(-A) + (+B) -- -(A - B) +(B - A) +(A - B)
(-A) + (-B) -(A + B) -- -- --
(+A) - (+B) -- +(A - B) -(B - A) +(A - B)
(+A) - (-B) +(A + B) -- -- --
(-A) - (+B) -(A + B) -- -- --
(-A) - (-B) -- -(A - B) +(B - A) +(A - B)
From the table, we can derive an algorithm for addition and subtraction as follows:
Addition (Subtraction) Algorithm:
 When the signs of A & B are identical, add the two magnitudes and attach the sign of A to
the result.
 When the sign of A & B are different, compare the magnitude and subtract the smaller
number from the large number. Choose the sign of the result to be same as A if A > B, or the
complement of the sign of A if A < B. If the two numbers are equal, subtract B from A and
make the sign of the result positive.

By: Namrata Singh


Hardware Implementation

fig: Hardware for signed magnitude addition and subtraction

The hardware consists of two registers A and B to store the magnitudes, and two flip-
flops As and Bs to store the corresponding signs. The results can be stored in the register A
and As which acts as an accumulator. The subtraction is performed by adding A to the 2’s
complement of B. The output carry is transferred to the flip-flop E. The overflow may occur
during the add operation which is stored in the flip-flop A Ë… F. When m = 0, the output of E is
transferred to the adder without any change along with the input carry of ‘0".

The output of the parallel adder is equal to A + B which is an add operation. When m =
1, the content of register B is complemented and transferred to parallel adder along with the
input carry of 1. Therefore, the output of parallel is equal to A + B’ + 1 = A – B which is a
subtract operation.

By: Namrata Singh


Hardware Algorithm

fig: flowchart for add and subtract operations

As and Bs are compared by an exclusive-OR gate. If output=0, signs are identical, if 1 signs are
different.
 For Add operation, identical signs dictate addition of magnitudes and for operation
identical signs dictate addition of magnitudes and for subtraction, different magnitudes
dictate magnitudes be added. Magnitudes are added with a micro operation EA
 Two magnitudes are subtracted if signs are different for add operation and identical for
subtract operation. Magnitudes are subtracted with a micro operation EA = B and number
(this number is checked again for 0 to make positive 0 [As=0]) in A is correct result. E = 0
indicates A < B, so we take 2’s complement of A.

Multiplication
Hardware Implementation and Algorithm
Generally, the multiplication of two final point binary number in signed magnitude
representation is performed by a process of successive shift and ADD operation. The process
consists of looking at the successive bits of the multiplier (least significant bit first). If the
multiplier is 1, then the multiplicand is copied down otherwise, 0’s are copied. The numbers

By: Namrata Singh


copied down in successive lines are shifted one position to the left and finally, all the numbers
are added to get the product.
But, in digital computers, an adder for the summation (∑) of only two binary numbers are
used and the partial product is accumulated in register. Similarly, instead of shifting the
multiplicand to the left, the partial product is shifted to the right. The hardware for the
multiplication of signed magnitude data is shown in the figure below.

Hardware for multiply operation


Initially, the multiplier is stored q register and the multiplicand in the B register. A register is
used to store the partial product and the sequence counter (SC) is set to a number equal to
the number of bits in the multiplier. The sum of A and B form the partial product and both
shifted to the right using a statement “Shr EAQ” as shown in the hardware algorithm. The flip
flops As, Bs & Qs store the sign of A, B & Q respectively. A binary ‘0” inserted into the flip-flop
E during the shift right.
Hardware Algorithm

flowchart for multiply algorithm

By: Namrata Singh


 The multiplicand is added to the partial product if the first least significant bit is 0
(provided that there was a previous 1) in a string of 0’s in the multiplier.
 The partial product doesn’t change when the multiplier bit is identical to the previous
multiplier bit.
This algorithm is used for both the positive and negative numbers in signed 2’s complement
form. The hardware implementation of this algorithm is in figure below:

The flowchart for booth multiplication algorithm is given below:

flowchart for booth multiplication algorithm

Numerical Example: Booth algorithm


BR=10111(Multiplicand)
QR=10011(Multiplier)
Array Multiplier
The multiplication algorithm first check the bits of the multiplier one at time and form partial
product. This is a sequential process that requires a sequence of add and shift micro
operation. This method is complicated and time consuming. The multiplication of 2 binary

By: Namrata Singh


numbers can also be done with one micro operation by using combinational circuit that
provides the product all at once.
Example.
Consider that the multiplicand bits are b1 and b0 and the multiplier bits are a1 and a0. The
partial product is c3c2c1c0. The multiplication two bits a0 and a1 produces a binary 1 if both
the bits are 1, otherwise it produces a binary 0. This is identical to the AND operation and can
be implemented with the AND gates as shown in figure.

2-bit by 2-bit array multiplier

Division Algorithm
The division of two fixed point signed numbers can be done by a process of successive
compare shift and subtraction. When it is implemented in digital computers, instead of
shifting the divisor to the right, the dividend or the partial remainder is shifted to the left. The
subtraction can be obtained by adding the number A to the 2’s complement of number B. The
information about the relative magnitudes of the information about the relative magnitudes
of numbers can be obtained from the end carry,
Hardware Implementation
The hardware implementation for the division signed numbers is shown id the figure.

By: Namrata Singh


Division Algorithm
The divisor is stored in register B and a double length dividend is stored in register A and Q.
the dividend is shifted to the left and the divider is subtracted by adding twice complement of
the value. If E = 1, then A >= B. In this case, a quotient bit 1 is inserted into Qn and the partial
remainder is shifted to the left to repeat the process. If E = 0, then A > B. In this case, the
quotient bit Qn remains zero and the value of B is added to restore the partial remainder in A
to the previous value. The partial remainder is shifted to the left and approaches continues
until the sequence counter reaches to 0. The registers E, A & Q are shifted to the left with 0
inserted into Qn and the previous value of E is lost as shown in the flow chart for division
algorithm.

flowchart for division algorithm


This algorithm can be explained with the help of an example.
Consider that the divisor is 10001 and the dividend is 01110.

By: Namrata Singh


binary division with digital hardware
Restoring method
Method described above is restoring method in which partial remainder is restored by
adding the divisor to the negative result. Other methods:
Comparison method: A and B are compared prior to subtraction. Then if A >= B, B is
subtracted from A. if A < B nothing is done. The partial remainder is then shifted left and
numbers are compared again. Comparison inspects end carry out of the parallel adder before
transferring to E.
Non-restoring method: In contrast to restoring method, when A -B is negative, B is not
added to restore A but instead, negative difference is shifted left and then B is added. How is it
possible? Let’s argue:
 In flowchart for restoring method, when A < B, we restore A by operation A - B + B. Next
time in a loop,
this number is shifted left (multiplied by 2) and B subtracted again, which gives: 2 (A - B +
B) – B = 2 A - B.
 In Non-restoring method, we leave A - B as it is. Next time around the loop, the number is
shifted left and B is added: 2 (A - B) + B = 2 A - B (same as above).

By: Namrata Singh


Divide Overflow
The division algorithm may produce a quotient overflow called dividend overflow. The
overflow can occur of the number of bits in the quotient are more than the storage capacity of
the register. The overflow flip-flop DVF is set to 1 if the overflow occurs.
The division overflow can occur if the value of the half most significant bits of the dividend is
equal to or greater than the value of the divisor. Similarly, the overflow can occue=r if the
dividend is divided by a 0. The overflow may cause an error in the result or sometimes it may
stop the operation. When the overflow stops the operation of the system, then it is called
divide stop.

Arithmetic Operations on Floating-Point Numbers


The rules apply to the single-precision IEEE standard format. These rules
specify only the major steps needed to perform the four operations. Intermediate
results for both mantissas and exponents might require more than 24 and 8 bits,
respectively & overflow or an underflow may occur. These and other aspects of the
operations must be carefully considered in designing an arithmetic unit that meets the
standard. If their exponents differ, the mantissas of floating-point numbers must be
shifted with respect to each other before they are added or subtracted. Consider a

decimal example in which we wish to add 2.9400 x to 4.3100 x . We rewrite

2.9400 x as 0.0294 x and then perform addition of the mantissas to get 4.3394

x . The rule for addition and subtraction can be stated as follows:

Add/Subtract Rule

The steps in addition (FA) or subtraction (FS) of floating-point numbers (s1, eˆ , f1) fad{s2, eˆ
2, f2) are as follows.

1. Unpack sign, exponent, and fraction fields. Handle special operands such as zero,
infinity, or NaN(not a number).

2. Shift the significand of the number with the smaller exponent right by bits.
3. Set the result exponent er to max(e1,e2).
4. If the instruction is FA and s1= s2 or if the instruction is FS and s1 ≠ s2 then add the
significands; otherwise subtract them.

By: Namrata Singh


Computer Organization and Architecture

5. Count the number z of leading zeros. A carry can make z = -1. Shift the result
significand left z bits or right 1 bit if z = -1.
6. Round the result significand, and shift right and adjust z if there is rounding overflow,
which is a carry-out of the leftmost digit upon rounding.
7. Adjust the result exponent by er = er - z, check for overflow or underflow, and pack
the result sign, biased exponent, and fraction bits into the result word.

Multiplication and division are somewhat easier than addition and subtraction, in that
no alignment of mantissas is needed.

By: Namrata Singh

You might also like