0% found this document useful (0 votes)
82 views6 pages

FPGA Implementation of Modified Non-Restoring Square Root Core

This document summarizes an article that proposes three architectures - pipelined, combinatorial, and iterative - for efficiently implementing a modified non-restoring square root algorithm on an FPGA. The pipelined architecture divides the algorithm into n stages to allow parallel execution and faster speeds but at higher cost. The combinatorial architecture is the simplest implementation without pipelining, suitable when cost is critical. The iterative architecture reuses hardware over multiple cycles to reduce cost while maintaining higher speeds than combinatorial. The architectures are compared based on speed, cost, reliability and application suitability.

Uploaded by

erpublication
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views6 pages

FPGA Implementation of Modified Non-Restoring Square Root Core

This document summarizes an article that proposes three architectures - pipelined, combinatorial, and iterative - for efficiently implementing a modified non-restoring square root algorithm on an FPGA. The pipelined architecture divides the algorithm into n stages to allow parallel execution and faster speeds but at higher cost. The combinatorial architecture is the simplest implementation without pipelining, suitable when cost is critical. The iterative architecture reuses hardware over multiple cycles to reduce cost while maintaining higher speeds than combinatorial. The architectures are compared based on speed, cost, reliability and application suitability.

Uploaded by

erpublication
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Engineering and Technical Research (IJETR)

ISSN: 2321-0869, Volume-3, Issue-4, April 2015

FPGA Implementation of modified non-restoring


square root core
ShabirAhmed B J, Narendra K, Swaroop Kumar K, Asha G H
algorithms for computing square root. In many VLSI real
Abstract The aim of this paper is to synthesize and time image processing applications, it is high prioritized
implement an algorithm to compute square root efficiently and requirement to provide the computation of square root of a
cost effectively. Square root is a computation required in many
binary coded number with low power dissipation and fast
mathematical problems required in computer multimedia
computation (low delay propagation). Square root calculation
communication and in many space related data processing. So,
there is a requirement to develop this operation efficiently. is one of the most useful and vital operations in computer
Hence, in this paper a new type of algorithm is used to design a graphics and scientific calculation applications, such as
core for finding square root. The core implemented in this paper digital signal processing (DSP) algorithms, math
uses a modified non restoring division algorithms to find square coprocessor, data processing and control, and even
root. Three types of structures have been developed namely: multimedia applications [1-6]. It is a classical problem in
basic combinational, iterative and pipeline. Basic combinatorial computational number theory, which is oftenly encountered
is simple implementation of non-restoring division algorithm, it
and which is a hard task to get an exact result [7-8].
is nothing but single stage of pipelined architecture. This
The paper is divided as follows: Section II describes the
architecture can be used when cost is the major factor and speed
can be compromised. Iterative architecture is hardware algorithm. Section III presents the implemented
efficient cost effective architecture where single hardware unit architectures. Section IV explains the results and analysis,
is used iteratively for a computation. Pipelined architecture is and in the results a detailed comparison between the spartan
the fastest architecture to be implemented in this paper. It uses core and Virtex core is presented. Finally, conclusions is
various stages for computation i.e. parallel execution is given.
performed and hence speeding up the execution time and
process. Pipelined architecture can be used in real time II. NON RESTORING ALGORITHM
processing system where speed is the major factor. These three
structures are developed to compare various parameters like
The focus of the previous restoring and non-restoring
speed, cost, reliability and distinguish them for various
algorithms is on each bit of the square root with each
application suited by them. The core is developed for any FPGA
iteration. In this section, non-restoring square root algorithm
processor and is simulated and debugged using XILINX ISE
has been described as in [1]. Each operation consists of
14.1. The architecture is implemented onto VIRTEX family and
addition or subtraction based on the sign of the result of
debugged on Spartan 3 XC3S400TQ144. previous operation. The partial remainder generated in each
iteration is used in the next iteration even it is negative [1]. At
Index TermsCombinatorial, FPGA, Iterative, Pipelined,
the nal iteration, if the partial remainder is not negative, it
Spartan, Virtex
becomes the nal precise remainder.
I. INTRODUCTION
Square root is an operation required by system graphics and Radical: D of 2n bits. Square root: Q of n bits:
scientific computation applications such as math
coprocessors, DSP algorithms, data processing and control D D D D D D
D: 2n-1 2n-2 2n-3 2n-4 ... 1 0
[1]. Hence, it is an important computation that need to be
enhanced. In 1996, Lu and Chi [1] have proposed a new Q Q
Q: n-1 n-2 Q0
non-restoring square root algorithm for VLSI
implementation, which is better than the existing VLSI

Manuscript received April 15, 2015.


ShabirAhmed B J, Student (M.Tech) Digital Electronics and *Note that qk has n-k bits
Communication systems, Malnad College of Engineering, Hassan,
Karnataka, India, +91-9738417860.
Narendra K, Student (M.Tech) Digital Electronics and Communication
systems, Malnad College of Engineering, Hassan, Karnataka, India,
+91-9738543811.
Swaroop Kumar K, Student (M.Tech) Digital Electronics and
Communication systems, Malnad College of Engineering, Hassan, For k = n-2 downto 0 do
Karnataka, India, +91-7411379265.
Asha G H, Associate Professor, Dept. of Electronics and
communication, Malnad College of Engineering, Hassan, Karnataka, India,
+91-9448033837.

202 www.erpublication.org
FPGA Implementation of modified non-restoring square root core

End
Remainder R = r0

At each iteration, qk (the square root of d2k), is computed.


Since d2k = d2(k+1)D2k+1D2k, that is D2k+1D2k is attached to
d2(k+1) to form d2k, it can be inferred that D2k+1D2k must be
used to get qk. That explains the fact the algorithm attaches
D2k+1D2k to rk+1 to form rk in order to get qk. The remainder
at each iteration, called rk, has n-k+1 bits, one more bit than
qk [1]: rk = R nRn-1Rn-2 Rk, But the algorithm uses an
estimated remainder, called rk, that has n-k+2 bits, the
MSB is the sign bit, which decides the value of Qk, and it can
be demonstrated that only the n-k+1 least significant bits of
rk are used to get the next estimated remainder r k-1. Also, in
order to get the real remainder R = r0, only the n+1 LSBs of
r0 are needed (the MSB determines Q 0). It lessens the gate
count, since a register of only n-k+1 bits is needed for rk.

III. ARCHITECTURES

As mentioned in abstract three types of architectures have


been implemented which will be described in this section.

A. Pipelined
To implement this architecture we need to unfold the
algorithm explained in section II. Therefore n stages with
n adders/subtractors will appear. By observing the first
iteration, a reduction is obtained:

Rn-1 D2n-1D2n-2 01
Qn-1 1, if rn-1 0 Figure 1: Pipelined architecture
Qn-1 0, if rn-1 < 0

The longest path delay occurs in the last stage, because the
There is no need to perform the first subtraction and wait
adder/substractor increases in size as stages advance. A
one cycle, if the result from the first iteration can be
further improvement can be made if the last stages are
obtained directly from the first 2 MSBs of D. So the first
pipelined, and the initial ones merged.
stage can be embedded into the second stage, and there will
be n-1 pipeline stages.
B. Combinatorial Architecture
This architecture is depicted in Figure 1. The computation
of the remainder is not considered, although the core This architecture is implemented because some non-real
computes it if the user wants. Note that the dotted rectangles time applications need it, and also in order to establish a
indicate the registers that would have appeared if the comparison with the core that does have a fully-combinatorial
reduction of the first stage hadnt been performed. Such architecture. The architecture is very simple: It is the
architecture can obtain a new square root each cycle. The fully-pipelined architecture without the pipelining registers.
initial latency is n cycles. It only has one register at the input and one at the output.

203 www.erpublication.org
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869, Volume-3, Issue-4, April 2015
This reduces the width of the adder/subtractor by 2 bits. The
result ba is obtained in parallel and the carry-in comes
from just an OR gate. So the new adder/subtractor uses n
bits and has carry-in. Also, note that the MSB of the second
operator of the adder/subtractor is 0 as in the pipelined
case. Figure 3 depicts this architecture.

Figure 2: Combinatorial Architecture


C. Iterative Architecture
The size of the elements (registers, adder/subtractor) will be
the size of the last stage of the pipelined architecture:
Register R n+1 bits Register Q n bits Adder / subtractor
n+2 bits.
Since all iterations are embedded in one stage, the reduction
of Section III A cannot be used. Figure 3: Iterative Architecture

But a simplification for this case exists:

In the adder/subtractor: the 2 LSBs performs either xy-01


or xy+11, xy is the pair of D bits used at each step. The
operation yields: cba. The truth table is shown:

cba = xy + 11 cba = xy - 01
xy cba xy cba

00 011 00 111
01 100 01 000
00 101 00 001
Figure 4: FSM for iterative architecture
01 110 01 010
Finite state machine of iterative architecture is depicted in
C: carry-in for the next stage of the adder/subtractor ba:
figure 4. This FSM controls the iterative architecture. The
result of the operation.
process start when s = 1.After n clock cycles, the result is
obtained in register Q, done = 1, and a new process can be
Ba depends only on xy, but c depends on the type of
started.
operation. Luckily, a conventional adder/substractor with
carry-in (e.g. the lpm_add_sub megafuntion) treats the
IV. RESULTS
carry-in as positive logic when adding, and as negative
logic when subtracting [3] (this is done to reduce gates The architecture were synthesized using XILINX ISE
usage). So, for subtraction, we have to invert c to assure v14.1 successfully. After synthesizing the core were
the proper working of the adder/subtractor. The new truth implemented on FPGA device XC3S400-TQ144 (Xilinx
table is: Spartan-3 family) with speed grade -5. The core presented
does not compute the remainder, since it is rarely used.
cba = xy + 11 cba = xy - 01 Figure 5 depicts the core with all its options. Table 1
xy cba xy cba establishes a comparison between this core and the
ALTERA core.
00 011 00 011 Results are shown only for a specific device (Spartan 3)
01 100 01 100 because of large results data with just one device and these
00 101 00 101 results are enough to demonstrate the benefits of the core
01 110 01 110 implemented.

Now, c and ba depends only on xy:


c x y b x y a y

204 www.erpublication.org
FPGA Implementation of modified non-restoring square root core

Figure 5: Parameter comparison graph for Spartan Figure 5: Parameter comparison graph for Virtex

205 www.erpublication.org
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869, Volume-3, Issue-4, April 2015

Figure 7: Area comparison of all three architecture for SPARTAN 3

Figure 7: Area comparison of all three architecture for VIRTEX 4


Result analysis:
From the graphs shown in the figure 5 and 6 we can infer
that latency is more in combinatorial. But in pipeline the V. CONCLUSION
latency till 16 bits are low than iterative. After 16 bits The core implemented achieves high speed at minimum
iterative and pipeline latency time crossover: i.e. for large cost since it only uses only an adder/subtractor unit to
input bits latency increases in pipeline. Hence efficiency in perform the operations. The architecture is very flexible, so
pipeline architecture is the highest. Maximum frequency that the user can choose the best architecture for his
operable is highest in pipelined architecture. application. The efficiency of this core can be observed from
graph and table.
Area Comparison: The results are better in terms of speed and resource effort
As shown in figure 7 and 8 area covered by all three than the earlier implementation. An improvement i.e.
architecture can be seen. Plan Ahead tool of XILINX was simplification for the iterative architecture can be applied to
used to generate area and compare them. We can infer from each stage of the pipelined architecture.
figure 7 and 8 that pipeline architecture uses more area and
iterative uses the least area.

206 www.erpublication.org
FPGA Implementation of modified non-restoring square root core

REFERENCES
[1] Y. Li and W. Chu, A New Non-Restoring Square Root Algorithm and
Its VLSI Implementations, Proc. Of 1996 IEEE International
Conference on Computer Designs: VLSI in Computers and Processors,
Austin, Texas, USA, October 1996, pp538-544..
[2] J. Hennessy and D. Patterson, Computer Architecture, A Quantitative
Approach, Second Edition, Morgan Kaufmann Publishers, Inc., 1996.
[3] G. Knittel, A VLSI-Design for Fast Vector Normalization Comput.
& Graphics, Vol. 19, No. 2, 1995. pp261 - 271.
[4] J. Bannur and A. Varma, The VLSI Implementation of A Square Root
Algorithm, Proc. IEEE Symposium on Computer Arithmetic , IEEE
Computer Society Press, Washington D.C., 1985. pp159 - 165.
[5] J. OLeary, M. Leeser, J. Hickey, M. Aagaard, NonRestoring Integer
Square Root: A Case Study in Design by Principled Optimization,
Proc. 2nd International Conference on Theorem Provers in Circuit
Design (TPCD94) , 1994. pp52 - 71.
[6] K. C. Johnson, Efcient Square Root Implementation on the 68000,
ACM Transaction on Mathematical Software , Vol. 13, No. 2, 1987.
pp138 - 151.
[7] H. Kabuo, T. Taniguchi, A. Miyoshi, H. Yamashita, M. Urano, H.
Edamatsu, S. Kuninobu, Accurate Rounding Scheme for the
Newton-Raphson Method Using Redundant Binary Representation,
IEEE Transaction on Computers , Vol. 43, No. 1, 1994. pp43 51
[8] Brown & Vranesic. Fundamentals of Digital Logic with VHDL
Design, McGraw Hill, 2000
[9] U. Meyer Baese, Digital Signal Processing with Field Programmable
Gate Arrays: Springer-Verlag Berlin Heidelberg, May 2001

ShabirAhmed B J, Student (M.Tech) Digital


Electronics and Communication systems, Malnad
College of Engineering, Hassan, Karnataka. India.
+91-9738417860.

Narendra K, Student (M.Tech) Digital


Electronics and Communication systems, Malnad
College of Engineering, Hassan, Karnataka, India,
+91-9738543811.

Swaroop Kumar K, Student (M.Tech) Digital


Electronics and Communication systems, Malnad
College of Engineering, Hassan, Karnataka, India,
+91-7411379265.

Asha G H, Associate Professor, Dept. of Electronics


and communication, Malnad College of Engineering,
Hassan, Karnataka, India, +91-9448033837.

207 www.erpublication.org

You might also like