0% found this document useful (0 votes)
20 views

A Novel Pipelined Algorithm and Modular Architecture For Non-Square Matrix Transposition

The document presents a novel pipelined algorithm and modular architecture for non-square matrix transposition. The architecture is composed of a series of identical cascaded basic circuits and can be controlled via a simple control strategy based on several counters. It achieves theoretical minimum memory and latency and supports matrices whose rows and columns are integer multiples.

Uploaded by

kll890420
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

A Novel Pipelined Algorithm and Modular Architecture For Non-Square Matrix Transposition

The document presents a novel pipelined algorithm and modular architecture for non-square matrix transposition. The architecture is composed of a series of identical cascaded basic circuits and can be controlled via a simple control strategy based on several counters. It achieves theoretical minimum memory and latency and supports matrices whose rows and columns are integer multiples.

Uploaded by

kll890420
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 68, NO.

4, APRIL 2021 1423

A Novel Pipelined Algorithm and Modular


Architecture for Non-Square Matrix Transposition
Bo Zhang , Zhenguo Ma , and Feng Yu , Member, IEEE

Abstract—In this brief, we present a novel pipelined algorithm


for transposing non-square matrices and describe the correspond-
ing architecture for this algorithm. In particular, the architecture
is composed of a series of identical cascaded basic circuits and can
be controlled via a simple control strategy based on several coun-
ters. The architecture is optimal in terms of both memory and
latency and it achieves the theoretical minimums. Moreover, the Fig. 1. Algorithm of calculating a 2D FFT in a continuous flow.
proposed algorithm and architecture could be easily extended to
N-parallel implementations for matrix transposition. This archi-
tecture supports matrices whose rows and columns are integer when training or testing. For example, in Residual Network
multiples; it is mainly used for radix-2s butterfly algorithms
(ResNet), the test image is resized to 224×224 [8]. Therefore,
using matrix transpositions. Experimental results indicate that
the proposed single-path architecture can reduce the computation several square matrix transposition algorithms and architec-
cycles and circuit area by a factor of 9.18% and 5.87%, respec- tures have been proposed in [9], [10]. Järvinen et al. [9]
tively, for a 32×16 matrix transposition computation, compared proposed the stride permutation networks for 2n × 2n matrix
with those of a recently proposed state-of-the-art architecture for transposition with theoretical minimal memory and latency.
matrix transposition.
In contrast, Wang et al. [10] proposed a pipelined algo-
Index Terms—Non-square matrix transposition, continuous- rithm and modular architecture for N × N matrix transposition
flow, pipelined algorithm, simple control strategy. with minimum memory and latency. However, these previous
approaches can only transpose square matrices, but not non-
square matrices. In fact, there are many rectangular signals in
I. I NTRODUCTION practice, and some two-dimensional signals are easier to pro-
ATRIX transposition is a mathematical calculation that cess after reshaping into special rectangular signals [11] than
M swaps the rows and columns of a matrix that is widely
used in signal and image processing applications. For exam-
reshaping into square signals. Therefore, increasing attention
has been paid to the transposition of non-squares matrices.
ple, in image compression and synthetic aperture radar (SAR) Previously, [12], [13] mentioned some design methods but did
image reconstruction [1], transpositions of image matrices not present sufficient details; specifically, there is no report
are required for two-dimensional (2D) fast Fourier transform on hardware resources. Recently, Garrido and Pirsch [14]
(FFT) operations between two one-dimensional (1D) FFTs, as proposed a fundamental theory for matrix transposition in a
shown in Fig. 1. It is typically performed in the implemen- continuous flow to address the problem of non-square matrix
tation of some multi-dimensional transformation before each transposition in detail. However, Garrido’s architecture is not
dimension is transformed, such as multi-dimensional Cooley- optimal in terms of memory or latency.
Tukey radix-2s fast Fourier transform (FFT) [2], [3], [4], This brief presents a pipelined algorithm and architecture
fast Hartley transform (FHT) [5], discrete cosine transform for realizing NR × NC matrix transposition with theoretical
(DCT) [6], and so on. Matrix transposition is an essential minimums for memory and latency. The organization of this
part of these algorithms. Furthermore, the matrix transposi- brief is as follows. Section II describes the basic algorithms.
tion operation is also used in convolutional neural networks Section III presents the implementation of the serial modular
(CNNs) [7]. pipelined architecture. Section IV presents an N-parallel archi-
Several researchers prefer to reshape the two-dimensional tecture for NR × NC matrix transposition. Section V shows
signals into square signals. In the computer vision community, comparisons with existing matrix transposition architectures.
to adapt to the model and maintain data consistency, scholars Conclusions of this brief are drawn in Section VI.
usually use zero-padding to resize images to a square shape
II. BASIC A LGORITHM
Manuscript received September 29, 2020; revised October 17, 2020;
accepted November 3, 2020. Date of publication November 5, 2020; date of Matrix transposition is a fundamental operation that can be
current version March 26, 2021. This brief was recommended by Associate expressed as
Editor C. W. Sham. (Corresponding author: Zhenguo Ma.)
The authors are with the Key Laboratory for Biomedical Engineering of
Ministry of Education, Zhejiang University, Hangzhou 310027, China (e-mail: Ai,j = (AT )j,i , (1)
[email protected]; [email protected]; [email protected]).
Digital Object Identifier 10.1109/TCSII.2020.3036183 where i = 0, 1, 2, . . . , NR − 1 and j = 0, 1, 2, . . . , NC − 1.
1549-7747 
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National Taiwan University. Downloaded on December 19,2023 at 12:32:07 UTC from IEEE Xplore. Restrictions apply.
1424 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 68, NO. 4, APRIL 2021

Fig. 3. Basic permutation circuit with a shift register (SR) of length L.

above. In the rows marked (∗) at the end of Algorithm 1, mod


denotes the modulo operation and int represents the rounding
operation. The mechanism of Algorithm 1 can be illustrated
using a simple example. Suppose matrix A is an 8 × 4 matrix.
Thus, K = 2; accordingly, Steps A and B in Algorithm 1
would involve three stages, as depicted in Fig. 2. The ellipses
Fig. 2. An example of Algorithm 1.
and arrows in Fig. 2 represent the sample points and the
exchanges between them.
Algorithm 1 Pipeline Non-Square Matrix Transposition
Based on the depictions of all steps and corresponding
Input: NR , NC , matrix A
stages in Algorithm 1, it is clear that the stages in steps
Output: matrix AT
A and B can be pipelined in a circuit system. Notably, the
1: Step A: Exchange of elements within a block
2: for stage = 0 to N − 2 do architectures for matrix transposition in both cases, NC < NR
3: for k = 1 to K do and NC > NR , are similar, and they are only slightly differ-
4: for row = (k − 1)N to kN − 2 − stage do ent in control strategy and execution order. When NR > NC ,
5: for col = stage + 1 to N − 1 do we execute Step A first followed by step B. When NR <
6: Swap(A[row, col], A[row + 1, col − 1])
NC , we execute Step B first followed by Step A. We have
7: endfor endfor endfor endfor
8: Step B: Exchange of elements between blocks presented serial pipelined architectures and control strategies
9: for stage = 0 to (K − 1)(N − 1) − 1 do in Section III based on NR > NC ; the other situation is similar.
10: for row = ROW 2 − 1 to ROW − 1 + row3 − 1 do
11: for col = 0 to N − 1 do III. S ERIAL P IPELINED A RCHITECTURE
12: Swap(A[row, col], A[row + 1, col])
13: endfor endfor endfor As illustrated in Section II, transposition of an NR × NC
14: return AT matrix can be achieved via a series of steps involving permuta-
** tions of matrix values. Suppose that the matrix is pumped into
* ROW = [K − 1 − mod(stage, K − 1)][N − int(stage, K − 1)]
* row = [mod(stage, K − 1) + 1][N − 1 − int(stage, K − 1)] a pipelined circuit in a serial order. In Step A of Algorithm 1,
it can be observed that two pairs of elements to be swapped
have a constant distance L1 between them, which is given
Several high-speed implementations of digital signal pro- by Eq. (3). Furthermore, in Step B of Algorithm 1, two pairs
cessing algorithms are based on matrix transposition, e.g., of elements to be swapped also have a constant distance L2
Cooley-Tukey radix-2s 2D FFT algorithms have been real- between them, which is given by Eq. (4).
ized with cascaded processing elements computing radix-2s
L1 = index(Ai+1,j−1 ) − index(Ai,j )
butterfly operations.
Furthermore, in radix-2s algorithms, NR = 2r , NC = 2c . = N × (i + 1) + (j − 1) − (N × i + j)
The variables N, M, and K are defined as follows: = N − 1, (3)
M L2 = index(Ai+1,j ) − index(Ai,j )
N = min(NR , NC ), M = max(NR , NC ), K = . (2)
N = N × (i + 1) + j − (N × i + j)
Therefore, K is an integer. There are three cases associated = N. (4)
with the size relation between NR and NC . For NR equal
to NC , [9], [10] proposed other algorithms and architectures. Therefore, the permutation circuits to implement the proposed
Nevertheless, we focused on the cases where NR is not equal Algorithm 1 have two fixed offsets L1 and L2 .
to NC . In this brief, we take the case where NR is greater than
NC as an example. When NC is less than NR , the situation is A. Basic Serial Exchange Circuit
basically similar. Garrido et al. [15] proposed a bit-reversal circuit that can
The matrix that needs to be transposed has M ×N elements, swap the positions of two elements with a fixed interval
and we divide each N × N elements into one block to obtain between them; it is illustrated in Fig. 3, where S and S̄ repre-
K blocks, namely block 1, block 2· · · block k. A pseudocode sent the control signals that can be used to change the status
description of the Algorithm 1, which is mainly composed of of the two multiplexers in the circuit to either pass-by or
Step A (that exchanges the elements in a block) and Step B exchange modes; in addition, L refers to the length of the
(that exchanges the elements between the blocks) is shown buffer used for element swapping.

Authorized licensed use limited to: National Taiwan University. Downloaded on December 19,2023 at 12:32:07 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: NOVEL PIPELINED ALGORITHM AND MODULAR ARCHITECTURE FOR NON-SQUARE MATRIX TRANSPOSITION 1425

TABLE I
T IMING D IAGRAM OF C ONTROL S IGNALS AND O UTPUT R ESULTS FOR
S INGLE -PATH 8 × 4 M ATRIX T RANSPOSITION A RCHITECTURE

Fig. 4. Cascade transposition architecture for a 8×4 matrix.

B. Cascade Transposition Architecture


The proposed NR × NC matrix transposition architecture is
composed of N − 1 cascade basic permutation circuits with a
shift register (SR) of length N − 1 and (K − 1)(N − 1) cascade
basic permutation circuits with a SR of length N. In total,
the proposed cascade transposition architecture is composed
of K(N − 1) cascade basic permutation circuits.
Si refers to the control signal of the i-th stage in all stages,
while Sn refers to the control signal of the n-th stage in Steps A
or B. In particular, these control signals can be generated using
three counters, namely C0 , C1 , and C2 . C0 and C1 are counters
from 0 to N − 1, while C2 is a counter from 0 to KN − 1.
The value of C0 is incremented after each clock cycle, whereas
those of C1 and C2 are incremented after each N clock cycles.
In case NR > NC , all the circuits are split into two steps
according to Algorithm 1 depending on the value of the integer
i ∈ [0, K(N − 1) − 1], which represents the algorithmic stage
in the cascade basic permutation circuits. The control signals
are fixed only for the following two cases: i ∈ [0, N − 2] or
i ∈ [N − 1, K(N − 1) − 1].
When i ∈ [0, N −2] and n = i in Step A, thus, n ∈ [0, N −2]
and Si = Sn ; then,

1 n ≤ C0 ≤ N − 2 & 1 ≤ C1 ≤ N − n − 1,
Sn = (5)
0 else.
When i ∈ [N − 1, K(N − 1) − 1] and n = i − (N − 1) in
Step B, thus, n ∈ [0, (K − 1)(N − 1) − 1] and Si−(N−1) = Sn ;
then,

1 ROW ≤ C2 ≤ ROW + row − 1,
Sn = (6)
0 else.
For convenience, we refer to the previous example involving the delay of each stage in Step A is fixed to 3 cycles, while
an 8×4 matrix. The architecture for the 8×4 matrix transpo- those in Step B are fixed to 4 cycles.
sition requires six basic permutation circuit stages, as shown
in Fig. 4, wherein the first three stages include a SR of length
3, while the next three stages include a SR of length 4. C. Memory and Latency
To clarify the pipelined transposition process implemented In this brief, memory refers to all the SRs used in the archi-
via our proposed architecture, Table I lists the control signals tecture while memory size indicates register complexity. The
and outputs of each stage at different clock cycles in the matrix following results from Järvinen et al. [9] provide the lower
transposition process. As depicted in Fig. 2, the input data for bounds for register complexity and latency for pipelined matrix
matrix A arrives in a row-major order, which is also listed in transposition circuit systems:
column IN in Table I. Control signals S0 − S5 shown in Fig. 4 Property 1: The lower bound for the register complexity of
change the status of the permutation circuit. In particular, when an NR × NC matrix transposed over P ports is (NR − 1)(NC −
Sn is 0, the input data enters the SR, and the output port gets 1) + P − 1 registers, where P ≤ N.
the data exported from the SR; when Sn is 1, the input data is Property 2: Latency(L) can be calculated based on the num-
directly transported to the output port and the data in the SR’s ber of registers (D) and ports (P) using the ceiling function
export enters its import, forming a loop. As shown in Table I, ( ) as follows: L =  DP .

Authorized licensed use limited to: National Taiwan University. Downloaded on December 19,2023 at 12:32:07 UTC from IEEE Xplore. Restrictions apply.
1426 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 68, NO. 4, APRIL 2021

TABLE II
T IMING D IAGRAM OF C ONTROL S IGNALS AND O UTPUT R ESULTS FOR 4-PATH 8 × 4 M ATRIX T RANSPOSITION A RCHITECTURE

Fig. 5. 2-path basic exchange circuit with SRs of length L.

As discussed in Section III, for an NR × NC matrix, the Fig. 6. 4-parallel transposition architecture for a 8 × 4 matrix.
proposed single-path transposition architecture has K(N − 1)
permutation circuit stages. N − 1 of these circuit stages have
a SR of length N − 1 while (K − 1)(N − 1) of these circuit In general, the control strategy for this circuit is the same as
stages have a SR of length N. Hence, the total memory size that described in Section III-B. However, the control signals
is as follows: are slightly changed because, in current multi-path circuits,
C0 is not required for circuit control. Thus, control sig-
D = (N − 1) × (N − 1) + (K − 1)(N − 1) × N nals can be generated using only two counters C1 and C2 ,
= (NR − 1)(NC − 1). (7) where C1 and C2 take values from 0 to N − 1 and 0 to
KN − 1, respectively; these values are incremented with each
P  = (NR − 1)(NC − 1)
Therefore, the total latency is L =  D clock cycle. A timing diagram of the 4-parallel architecture
cycles as each SR has a latency of one clock cycle, as listed is shown in Table II. Entering four elements from the IN
in Table I. Considering these properties, the proposed single- (IN0 , IN1 , IN2 , IN3 ) ports in each clock cycle, each stage has
path (P = 1) architecture achieves the theoretical minimums a latency of 1 cycle. After 6 cycles, we receive the trans-
for both memory and latency. posed data from the OUT (OUT0 , OUT1 , OUT2 , OUT3 ) ports.
M1 −M5 are the intermediate states of the matrix transposition
IV. N-PARALLEL P IPELINED A RCHITECTURE process.
The proposed algorithm is applicable not only to serial but
also parallel architecture implementations. Serial implementa- B. Memory and Latency
tions transpose a matrix using a series of permutations over
a single data path, whereas N-parallel implementations do For an NR ×NC matrix, when NR > NC , the proposed multi-
so using a set of multi-path exchanges. Cheng and Yu [16] path transposition architecture has K(N−1) permutation circuit
proposed a simple yet efficient circuit for performing two- stages. N − 1 of these circuit stages have a SR of length N in
path exchanges; it is illustrated in Fig. 5. Using this circuit, Step A while (K − 1)(N − 1) of these circuit stages have a SR
the positions of data in a stream of length L can be swapped of length N in Step B. Hence, the total memory of the entire
using two parallel paths. Wang et al. [10] extended this cir- architecture can be calculated as follows:
cuit to a multi-path permutation circuit. However, it does not D = (N − 1) × N + (K − 1)(N − 1) × N
support transposition of a non-square matrix.
= (NR − 1)(NC − 1) + N − 1. (8)
A. Cascade N-Parallel Architecture for Non-Square Matrix As discussed in Section III-C, the optimal delay Lmin =
We proposed a multi-path non-square permutation circuit DP  = K(N − 1), which is true in any case where NR < NC .
by extending the basic exchange circuit proposed by Cheng Therefore, the proposed N-parallel transposition architecture
and Yu [16] and the N-parallel permutation circuit proposed would also achieve the theoretical minimums for memory and
by Wang et al. [10]; our proposed circuit is shown in Fig. 6. latency.

Authorized licensed use limited to: National Taiwan University. Downloaded on December 19,2023 at 12:32:07 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: NOVEL PIPELINED ALGORITHM AND MODULAR ARCHITECTURE FOR NON-SQUARE MATRIX TRANSPOSITION 1427

TABLE III
C OMPARISON OF S EVERAL T RANSPOSITION C IRCUITS fair comparison, we used registers as memory to implement the
architecture proposed by Garrido. Compared with Garrido’s
design, the architecture we proposed saves a few Look-Up-
Tables (LUTs) and Flip-Flops (FFs) and achieves lower latency
and greater throughput. The total power consumption, as listed
in Table IV, is primarily consumed by LUTs and FFs of the
matrix transpose module in the proposed architecture.

VI. C ONCLUSION
TABLE IV In this brief, we formulate a novel pipelined algorithm and
C OMPARISONS OF I MPLEMENTATION R ESULTS the corresponding architecture for non-square matrix trans-
position. The proposed architecture achieves the theoretical
minimums for memory and latency and has a simple con-
trol strategy. Thus, the proposed circuit, which also supports
continuous data, is suitable for realizing pipelined multi-
dimensional FHT and FFT operations, and other radix-2s
butterfly algorithms in signal and image processing that use
non-square matrix transpositions.

R EFERENCES
[1] K. Han et al., “An accurate 2-D nonuniform fast Fourier transform
method applied to high resolution SAR image reconstruction,” in Proc.
Int. Workshop Metamater. (Meta), Oct. 2012.
[2] F. Mahmood, M. Toots, L.-G. Öfverstedt, and U. Skoglund, “2D discrete
fourier transform with simultaneous edge artifact removal for real-time
applications,” in Proc. Int. Conf. Field Program. Technol. (FPT), 2015,
pp. 236–239.
[3] U. Nidhi, K. Paul, A. Hemani, and A. Kumar, “High performance 3D-
V. C OMPARISON AND E XPERIMENTAL R ESULTS FFT implementation,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
In this section, we present some results for our proposed 2013, pp. 2227–2230.
[4] S. Murugan and K. Jayakumar, “A DSP based real-time 3D FFT system
architecture and compare them with those obtained using other for analysis of dynamic parameters,” in Proc. IEEE Int. Conf. Adv.
matrix transposition circuits. The results listed in Table III Commun. Control Comput. Technol., 2014, pp. 1489–1492.
indicate that the circuit proposed in [9] supports both serial [5] C. Paik and M. Fox, “Fast Hartley transforms for image processing,”
IEEE Trans. Med. Imag., vol. 7, no. 2, pp. 149–153, Jun. 1988.
and parallel implementations for 2n × 2n matrix transposi- [6] P. K. Meher, S. Y. Park, B. K. Mohanty, K. S. Lim, and C. Yeo, “Efficient
tion with theoretical minimal memory and latency, and the integer DCT architectures for HEVC,” IEEE Trans. Circuits Syst. Video
complexity of multiplexers is O(N log2 N). In contrast, the Technol., vol. 24, no. 1, pp. 168–178, Jan. 2014.
[7] D. Im, D. Han, S. Choi, S. Kang, and H.-J. Yoo, “DT-CNN: An energy-
architecture reported in [10] provides a transposition scheme efficient dilated and transposed convolutional neural network processor
for an N × N matrix and achieves the theoretical minimums for region of interest based image segmentation,” IEEE Trans. Circuits
for memory and latency, but the complexity of multiplexers Syst. I, Reg. Papers, vol. 67, no. 10, pp. 3471–3483, Oct. 2020.
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
is O(N 2 ). Nevertheless, neither of the architectures supports image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
non-square matrix transposition. In [14], the authors pro- (CVPR), Jun. 2016, pp. 770–778.
vided a detailed solution for non-square matrix transposition [9] T. Järvinen, P. Salmela, H. Sorokin, and J. Takala, “Stride permutation
networks for array processors,” J. VLSI Signal Process. Syst. Signal
using smaller multiplexers with a complexity of O(N log2 N). Image Video Technol., vol. 49, no. 1, pp. 51–71, 2007.
However, the corresponding architecture proposed by them [10] Y. Wang, Z. Ma, and F. Yu, “Pipelined algorithm and modular architec-
was not optimal in terms of memory or latency, as the trans- ture for matrix transposition,” IEEE Trans. Circuits Syst. II, Exp. Briefs,
vol. 66, no. 4, pp. 652–656, Apr. 2019.
position of an M × N matrix required a memory size of MN [11] M. V. Noskov and V. S. Tutatchikov, “Modification of a two-dimensional
and a latency of MN. Compared with [14], our proposed archi- fast fourier transform algorithm by the analog of the Cooley-Tukey algo-
tecture requires slightly more multiplexers whose complexity rithm for a rectangular signal,” Pattern Recognit. Image Anal., vol. 25,
no. 1, pp. 81–83, Jan. 2015.
is O(N 2 ), but it achieves theoretical minimums for memory [12] M. W. Czekalski, “Corner turn memory address generator,” U.S. Patent
and latency, which requires a memory size of (M − 1)(N − 1) 4 484 265, Nov. 20, 1984.
and latency of (M − 1)(N − 1), and it supports all matrices [13] I. de Lotto and D. Dotti, “Large-matrix-ordering technique with appli-
cations to transposition,” Electron. Lett., vol. 9, no. 16, pp. 374–375,
whose rows and columns are integer multiples. Moreover, our 1973.
proposed architecture also supports both serial and N-parallel [14] M. Garrido and P. Pirsch, “Continuous-flow matrix transposition using
implementations for matrix transposition. memories,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 9,
pp. 3035–3046, Sep. 2020.
To verify the proposed design, the architectures were [15] M. Garrido, J. Grajal, and O. Gustafsson, “Optimum circuits for bit
implemented with Vivado tool on a Xilinx Virtex-7 FPGA reversal,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 10,
(XC7VX485T). Table IV shows the corresponding place and pp. 657–661, Oct. 2011.
[16] C. Cheng and F. Yu, “An optimum architecture for continuous-flow
route results for different input sizes. The matrix data is parallel bit reversal,” IEEE Signal Process. Lett., vol. 22, no. 12,
assumed to be in 16-bit word width. Furthermore, to ensure a pp. 2334–2338, Dec. 2015.

Authorized licensed use limited to: National Taiwan University. Downloaded on December 19,2023 at 12:32:07 UTC from IEEE Xplore. Restrictions apply.

You might also like