A Novel Pipelined Algorithm and Modular Architecture For Non-Square Matrix Transposition
A Novel Pipelined Algorithm and Modular Architecture For Non-Square Matrix Transposition
Authorized licensed use limited to: National Taiwan University. Downloaded on December 19,2023 at 12:32:07 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: NOVEL PIPELINED ALGORITHM AND MODULAR ARCHITECTURE FOR NON-SQUARE MATRIX TRANSPOSITION 1425
TABLE I
T IMING D IAGRAM OF C ONTROL S IGNALS AND O UTPUT R ESULTS FOR
S INGLE -PATH 8 × 4 M ATRIX T RANSPOSITION A RCHITECTURE
Authorized licensed use limited to: National Taiwan University. Downloaded on December 19,2023 at 12:32:07 UTC from IEEE Xplore. Restrictions apply.
1426 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 68, NO. 4, APRIL 2021
TABLE II
T IMING D IAGRAM OF C ONTROL S IGNALS AND O UTPUT R ESULTS FOR 4-PATH 8 × 4 M ATRIX T RANSPOSITION A RCHITECTURE
As discussed in Section III, for an NR × NC matrix, the Fig. 6. 4-parallel transposition architecture for a 8 × 4 matrix.
proposed single-path transposition architecture has K(N − 1)
permutation circuit stages. N − 1 of these circuit stages have
a SR of length N − 1 while (K − 1)(N − 1) of these circuit In general, the control strategy for this circuit is the same as
stages have a SR of length N. Hence, the total memory size that described in Section III-B. However, the control signals
is as follows: are slightly changed because, in current multi-path circuits,
C0 is not required for circuit control. Thus, control sig-
D = (N − 1) × (N − 1) + (K − 1)(N − 1) × N nals can be generated using only two counters C1 and C2 ,
= (NR − 1)(NC − 1). (7) where C1 and C2 take values from 0 to N − 1 and 0 to
KN − 1, respectively; these values are incremented with each
P = (NR − 1)(NC − 1)
Therefore, the total latency is L = D clock cycle. A timing diagram of the 4-parallel architecture
cycles as each SR has a latency of one clock cycle, as listed is shown in Table II. Entering four elements from the IN
in Table I. Considering these properties, the proposed single- (IN0 , IN1 , IN2 , IN3 ) ports in each clock cycle, each stage has
path (P = 1) architecture achieves the theoretical minimums a latency of 1 cycle. After 6 cycles, we receive the trans-
for both memory and latency. posed data from the OUT (OUT0 , OUT1 , OUT2 , OUT3 ) ports.
M1 −M5 are the intermediate states of the matrix transposition
IV. N-PARALLEL P IPELINED A RCHITECTURE process.
The proposed algorithm is applicable not only to serial but
also parallel architecture implementations. Serial implementa- B. Memory and Latency
tions transpose a matrix using a series of permutations over
a single data path, whereas N-parallel implementations do For an NR ×NC matrix, when NR > NC , the proposed multi-
so using a set of multi-path exchanges. Cheng and Yu [16] path transposition architecture has K(N−1) permutation circuit
proposed a simple yet efficient circuit for performing two- stages. N − 1 of these circuit stages have a SR of length N in
path exchanges; it is illustrated in Fig. 5. Using this circuit, Step A while (K − 1)(N − 1) of these circuit stages have a SR
the positions of data in a stream of length L can be swapped of length N in Step B. Hence, the total memory of the entire
using two parallel paths. Wang et al. [10] extended this cir- architecture can be calculated as follows:
cuit to a multi-path permutation circuit. However, it does not D = (N − 1) × N + (K − 1)(N − 1) × N
support transposition of a non-square matrix.
= (NR − 1)(NC − 1) + N − 1. (8)
A. Cascade N-Parallel Architecture for Non-Square Matrix As discussed in Section III-C, the optimal delay Lmin =
We proposed a multi-path non-square permutation circuit DP = K(N − 1), which is true in any case where NR < NC .
by extending the basic exchange circuit proposed by Cheng Therefore, the proposed N-parallel transposition architecture
and Yu [16] and the N-parallel permutation circuit proposed would also achieve the theoretical minimums for memory and
by Wang et al. [10]; our proposed circuit is shown in Fig. 6. latency.
Authorized licensed use limited to: National Taiwan University. Downloaded on December 19,2023 at 12:32:07 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: NOVEL PIPELINED ALGORITHM AND MODULAR ARCHITECTURE FOR NON-SQUARE MATRIX TRANSPOSITION 1427
TABLE III
C OMPARISON OF S EVERAL T RANSPOSITION C IRCUITS fair comparison, we used registers as memory to implement the
architecture proposed by Garrido. Compared with Garrido’s
design, the architecture we proposed saves a few Look-Up-
Tables (LUTs) and Flip-Flops (FFs) and achieves lower latency
and greater throughput. The total power consumption, as listed
in Table IV, is primarily consumed by LUTs and FFs of the
matrix transpose module in the proposed architecture.
VI. C ONCLUSION
TABLE IV In this brief, we formulate a novel pipelined algorithm and
C OMPARISONS OF I MPLEMENTATION R ESULTS the corresponding architecture for non-square matrix trans-
position. The proposed architecture achieves the theoretical
minimums for memory and latency and has a simple con-
trol strategy. Thus, the proposed circuit, which also supports
continuous data, is suitable for realizing pipelined multi-
dimensional FHT and FFT operations, and other radix-2s
butterfly algorithms in signal and image processing that use
non-square matrix transpositions.
R EFERENCES
[1] K. Han et al., “An accurate 2-D nonuniform fast Fourier transform
method applied to high resolution SAR image reconstruction,” in Proc.
Int. Workshop Metamater. (Meta), Oct. 2012.
[2] F. Mahmood, M. Toots, L.-G. Öfverstedt, and U. Skoglund, “2D discrete
fourier transform with simultaneous edge artifact removal for real-time
applications,” in Proc. Int. Conf. Field Program. Technol. (FPT), 2015,
pp. 236–239.
[3] U. Nidhi, K. Paul, A. Hemani, and A. Kumar, “High performance 3D-
V. C OMPARISON AND E XPERIMENTAL R ESULTS FFT implementation,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
In this section, we present some results for our proposed 2013, pp. 2227–2230.
[4] S. Murugan and K. Jayakumar, “A DSP based real-time 3D FFT system
architecture and compare them with those obtained using other for analysis of dynamic parameters,” in Proc. IEEE Int. Conf. Adv.
matrix transposition circuits. The results listed in Table III Commun. Control Comput. Technol., 2014, pp. 1489–1492.
indicate that the circuit proposed in [9] supports both serial [5] C. Paik and M. Fox, “Fast Hartley transforms for image processing,”
IEEE Trans. Med. Imag., vol. 7, no. 2, pp. 149–153, Jun. 1988.
and parallel implementations for 2n × 2n matrix transposi- [6] P. K. Meher, S. Y. Park, B. K. Mohanty, K. S. Lim, and C. Yeo, “Efficient
tion with theoretical minimal memory and latency, and the integer DCT architectures for HEVC,” IEEE Trans. Circuits Syst. Video
complexity of multiplexers is O(N log2 N). In contrast, the Technol., vol. 24, no. 1, pp. 168–178, Jan. 2014.
[7] D. Im, D. Han, S. Choi, S. Kang, and H.-J. Yoo, “DT-CNN: An energy-
architecture reported in [10] provides a transposition scheme efficient dilated and transposed convolutional neural network processor
for an N × N matrix and achieves the theoretical minimums for region of interest based image segmentation,” IEEE Trans. Circuits
for memory and latency, but the complexity of multiplexers Syst. I, Reg. Papers, vol. 67, no. 10, pp. 3471–3483, Oct. 2020.
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
is O(N 2 ). Nevertheless, neither of the architectures supports image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
non-square matrix transposition. In [14], the authors pro- (CVPR), Jun. 2016, pp. 770–778.
vided a detailed solution for non-square matrix transposition [9] T. Järvinen, P. Salmela, H. Sorokin, and J. Takala, “Stride permutation
networks for array processors,” J. VLSI Signal Process. Syst. Signal
using smaller multiplexers with a complexity of O(N log2 N). Image Video Technol., vol. 49, no. 1, pp. 51–71, 2007.
However, the corresponding architecture proposed by them [10] Y. Wang, Z. Ma, and F. Yu, “Pipelined algorithm and modular architec-
was not optimal in terms of memory or latency, as the trans- ture for matrix transposition,” IEEE Trans. Circuits Syst. II, Exp. Briefs,
vol. 66, no. 4, pp. 652–656, Apr. 2019.
position of an M × N matrix required a memory size of MN [11] M. V. Noskov and V. S. Tutatchikov, “Modification of a two-dimensional
and a latency of MN. Compared with [14], our proposed archi- fast fourier transform algorithm by the analog of the Cooley-Tukey algo-
tecture requires slightly more multiplexers whose complexity rithm for a rectangular signal,” Pattern Recognit. Image Anal., vol. 25,
no. 1, pp. 81–83, Jan. 2015.
is O(N 2 ), but it achieves theoretical minimums for memory [12] M. W. Czekalski, “Corner turn memory address generator,” U.S. Patent
and latency, which requires a memory size of (M − 1)(N − 1) 4 484 265, Nov. 20, 1984.
and latency of (M − 1)(N − 1), and it supports all matrices [13] I. de Lotto and D. Dotti, “Large-matrix-ordering technique with appli-
cations to transposition,” Electron. Lett., vol. 9, no. 16, pp. 374–375,
whose rows and columns are integer multiples. Moreover, our 1973.
proposed architecture also supports both serial and N-parallel [14] M. Garrido and P. Pirsch, “Continuous-flow matrix transposition using
implementations for matrix transposition. memories,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 9,
pp. 3035–3046, Sep. 2020.
To verify the proposed design, the architectures were [15] M. Garrido, J. Grajal, and O. Gustafsson, “Optimum circuits for bit
implemented with Vivado tool on a Xilinx Virtex-7 FPGA reversal,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 10,
(XC7VX485T). Table IV shows the corresponding place and pp. 657–661, Oct. 2011.
[16] C. Cheng and F. Yu, “An optimum architecture for continuous-flow
route results for different input sizes. The matrix data is parallel bit reversal,” IEEE Signal Process. Lett., vol. 22, no. 12,
assumed to be in 16-bit word width. Furthermore, to ensure a pp. 2334–2338, Dec. 2015.
Authorized licensed use limited to: National Taiwan University. Downloaded on December 19,2023 at 12:32:07 UTC from IEEE Xplore. Restrictions apply.