0% found this document useful (0 votes)
56 views

Architectural Exploration For Energy-Efficient Fixed Point Kalman Filter Vlsi Design

This document summarizes an article that proposes several architectural designs for implementing a Kalman filter (KF) in dedicated hardware for energy-efficient applications. The authors explore fully sequential, semiparallel, and fully parallel architectures at the register-transfer level using fixed-point representation. They optimize quantization to reduce bit-width and evaluate the designs on three case studies, finding that the semiparallel and sequential architectures offer the best balance of area, power, speed, and energy efficiency. Compared to state-of-the-art solutions, their KF architecture achieves 2.8x fewer arithmetic operations and requires 3.3x fewer clock cycles while maintaining precision close to floating-point.

Uploaded by

roopa_kothapalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Architectural Exploration For Energy-Efficient Fixed Point Kalman Filter Vlsi Design

This document summarizes an article that proposes several architectural designs for implementing a Kalman filter (KF) in dedicated hardware for energy-efficient applications. The authors explore fully sequential, semiparallel, and fully parallel architectures at the register-transfer level using fixed-point representation. They optimize quantization to reduce bit-width and evaluate the designs on three case studies, finding that the semiparallel and sequential architectures offer the best balance of area, power, speed, and energy efficiency. Compared to state-of-the-art solutions, their KF architecture achieves 2.8x fewer arithmetic operations and requires 3.3x fewer clock cycles while maintaining precision close to floating-point.

Uploaded by

roopa_kothapalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Architectural Exploration for Energy-Efficient


Fixed-Point Kalman Filter VLSI Design
Pedro Tauã Lopes Pereira, Student Member, IEEE, Guilherme Paim , Member, IEEE,
Patrícia Ücker Leleu da Costa, Eduardo Antonio César da Costa , Member, IEEE,
Sérgio José Melo de Almeida, Member, IEEE, and Sergio Bampi , Senior Member, IEEE

Abstract— Efficient Kalman filter (KF) designs for real-time I. I NTRODUCTION


mobile applications, such as nano-drones navigation, robots
localization, spacecraft orbit control, GPS positioning, image
recognition, and multisensor data fusion for wearable systems,
I NTENSIVE research efforts to leverage energy efficiency
in system-on-chip (SoC) design have been undertaken in
academia and in cutting-edge industry players, as in [1]
are key technology goals. The KF is a compute-intensive kernel and [2], to sustain the evolution of battery-powered systems,
composed of consecutive complex matrix operations, like multi-
such as mobile and internet of things (IoT) devices. Notably,
plications and matrix inversions. The most complex block in the
KF is the Kalman gain (KG) function, which involves matrices application-specific integrated circuits (ASICs) and dedicated
inversion at each iteration, applying the determinant matrix accelerators allow maximum optimization of power through
calculation and division operations. In this article, we combine techniques such as parallelism exploration, scaling the preci-
architectural solutions of different types, for which balancing sion, truncation, optimized quantization, and hardware module
conflicting low-power and high-performance requirements aim- reuse, leading to high-energy efficiency [3].
ing at real-time KF processing is a key design issue. The key Digital signal processing (DSP) algorithms fit the demands
finding in our architectural exploration herein presented is that of a wide range of applications, such as adaptive filtering,
the KF architectures in semiparallel and sequential forms offer real-time state estimation, and intelligent systems, demanding
the best balance of circuit area size, power dissipation, and efficient hardware accelerators to integrate them into SoCs.
processing speed. Compared to the state-of-the-art solutions, our
Kalman filter (KF) plays an essential role in DSP for mobile
KF architecture is more efficient, with 2.8 times fewer arithmetic
operators, requiring 3.3 times fewer clock cycles. The usefulness applications, such as navigation by nano-drones [4], space-
of the developed KF in digital signal processing (DSP) is shown ship orbit control [5], GPS positioning [6], image recog-
herein by simulations of system identification, noise elimination, nition [7], target tracking [8], and sensor data fusion for
and state estimation applications. These figures highlight the health monitoring [9]. These applications’ design needs to
results of the KF architecture: the speed of adaptation for be battery-driven with limited energy capacity while still
the system identification applications with root mean square demanding a reliable real-time performance. Such scenar-
error (RMSE) of 0.01 after 12 samples, precision level in noise ios require low-energy/power systems, and thus, dedicated
elimination applications with RMSE of 0.13, and reliability in VLSI architectures are a solution due to their higher energy
state estimation processes with RMSE less than 10% of system efficiency.
peak response.
The main drawback of the KF is the compute-intensive
Index Terms— Design space exploration (DSE), digital signal signal-processing kernel, which intensively demands complex
processing (DSP), hardware architectures, Kalman filter (KF), arithmetic operations like matrix multiplications and matrix
VLSI design. inversions. Therefore, designing low-power application-
specific accelerators for the most time-consuming tasks of the
KF is a key challenge [10]–[13].
Manuscript received September 5, 2020; revised January 18, 2021 and This work explores architectures for energy-efficient imple-
March 18, 2021; accepted April 16, 2021. This work was supported in part mentations of each KF equation in dedicated hardware.
by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq),
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) We propose and implement fully sequential, semiparallel, and
Research and Development and in part by the Research Foundation of the State fully parallel dedicated KF architectures investigating the most
of Rio Grande do Sul–Brazil (FAPERGS) under Grant 19/2551-0001844-4. energy-efficient (lower energy per operation ratio) for each
(Corresponding author: Guilherme Paim.) one of the equations. Our architectures were designed at
Pedro Tauã Lopes Pereira and Sergio Bampi are with the Graduate Pro-
gram in Microelectronics (PGMICRO), Institute of Informatics (INF), Federal
a register-transfer level (RTL) and developed in fixed-point
University of Rio Grande do Sul (UFRGS), 91501-970 Porto Alegre, Brazil employing decimal representation. We optimize the quanti-
(e-mail: [email protected]; [email protected]). zation aiming to reduce the bit-width to achieve high-speed
Guilherme Paim is with the Graduate Program in Microelectronics (PGMI- operation and low hardware complexity. We investigate our
CRO), Institute of Informatics (INF), Federal University of Rio Grande architectures into three case studies to demonstrate the high
do Sul (UFRGS), 91501-970 Porto Alegre, Brazil, and also with Graduate
Program on Electronics Engineering and Computing, Catholic University of precision close to the double-precision floating-point (FP)
Pelotas (UCPel), 96015-560 Pelotas, Brazil (e-mail: [email protected]). algorithm.
Patrícia Ücker Leleu da Costa, Eduardo Antonio César da Costa, and We offer different designs for all equations that denoted
Sérgio José Melo de Almeida are with the Graduate Program on Electron- the KF processing to determine which architectural configu-
ics Engineering and Computing, Catholic University of Pelotas (UCPel), ration presents the best results of circuit area, power dissi-
96015-560 Pelotas, Brazil.
Color versions of one or more figures in this article are available at pation, processing speed, and energy per operation. In these
https://doi.org/10.1109/TVLSI.2021.3075379. architectures, we explore different dedicated circuits for matrix
Digital Object Identifier 10.1109/TVLSI.2021.3075379 inversion operation, using the analytical method necessary to
1063-8210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

calculate the Kalman gain (KG) equation and a dedicated


circuit based on the Goldschmidt iteration division algorithm.
Our results show that the architectures of KF equations bal-
ancing sequential and semiparallel forms parallelism levels
meet to maximize energy efficiency. The results also show that
the proposed solution is accurate and reliable for identifying
unknown systems disturbed by white noise, eliminating noise,
and estimating process.
The contributions presented in this article are as follows. Fig. 1. Goldschmidt divider hardware architecture [15].
1) We thoroughly present the partition and explore archi-
tectural solutions for the KF VLSI design. The term R represents the measurement noise covariance
2) We perform a coarse-grained design space explo- matrix derivative of the signal r.

ration (DSE) for key KF equations VLSI implementation The optimal estimation minimizes both the prior P(k) and
aiming at the best solutions for the trading-off between posterior P(k) system error covariance and can be evaluated by
circuit area, power dissipation, energy consumption per

operation, and processing speed. P(k) = A P(k−1) A T + Q (5)
3) We demonstrate the complete KF hardware architecture   −
P(k) = I L − K (k) H(k) P(k) . (6)
performance in three different scenarios. Our VLSI
architecture excels in accurate and robust applications, The term Q describes the system noise covariance matrix,
such as system identification, noise elimination, and and I L represents an identity matrix of error covariance
estimation process. matrix P.
This article is organized as follows. Section II shows a brief The KG equation holds the highest computational complex-
background on the KF, the Goldschmidt divider, the noise ity of the algorithm since it requires the inversion of matrices
covariance matrices determination, and the related work. at each iteration. Using suitable iterative-based dividers and
Section III presents the architectures developed for the key optimized dedicated architectures for this equation would
blocks of the KF, with different parallelism levels. Section IV minimize its hardware requirements and increase its perfor-
offers the main results and comparisons for the architectures. mance. In this work, we employed an optimized iterative-based
Finally, the main conclusions are drawn in Section V. Goldshmidt divider hardware architecture proposed in [15].
B. Goldschmidt Division Algorithm and Architecture
II. BACKGROUND Division by functional iteration uses multiplication as a
This section presents the background on the (A) KF, (B) the fundamental operation. Based on the division operation Qc =
Goldschmidt division algorithm and architecture, (C) the cal- N × (1/D), where Qc is the quotient, N is the dividend, and
culation of system noise covariance matrix and measurement D is the divisor, the primary design challenge consists of how
noise covariance matrix, and (D) the related work regarding to compute the reciprocal of the divisor efficiently [16].
KF design realizations. The Goldschmidt algorithm operations are based on three
equations:
A. Kalman Filter
Qc(i+1) = Qc(i) × F(i) (7)
The KF is a set of mathematical equations for a recursive D(i+1) = D(i) × F(i) (8)
process of optimal estimation based on the decrease of the
mean square error (MSE). A matrix system consisting of the F(i+1) = 2 − D(i+1) . (9)
state equation Equation (7) calculates the new quotient value with the prior
x(k) = Ax(k−1) + T
Bu(k) + w(k) (1) reciprocal approximation of the divisor, F(i) . Equation (8)
calculates the reciprocal of the divisor error. Finally, (9)
and observation equation calculates the response error [17].
Fig. 1 illustrates the Goldschmidt divider circuit based on
z(k) = H(k)
T
x(k) + r(k) (2) two iterations. It is noted that the use of truncation (represented
describes its mathematical model [14]. by a truncate symbol in Fig. 1) for minimizing the size of
The term x(k) is the states vector. The term A is a transition arithmetic operators reducing the width of the multiplier output
matrix, B is a control input matrix, u(k) is the control input from 4n bits to 2n bits. The variable n represents the input
vector, z(k) represents the observation vector, H(k) is the bit-width of the system.
observation matrix and the signals w(k) and r(k) are zero-mean Toward a faster and more power-efficient algorithm in
white Gaussian noises that model uncertainties in the desired hardware than the original Goldschmidt algorithm, we use
signal z(k) , with normal distributions N (0, Q) and N (0, R), a new method in fixed-point, presented in [15], for the first
respectively. The term k is the sampling instance. approximation of the divisor term, F0 . The architecture solu-
The correction equation tion realizes the optimum denominator value’s suitable choice
  that adjusts the algorithm’s first iteration, contributing to fast
x̂(k) = x̂(k−1) + K (k) z(k) − ẑ(k) (3) and precise convergence.
produces the best estimate of the state vector. The term ẑ(k) C. System Noise and Measurement Noise
T
consists in estimated output represented by H(k) x̂(k−1) . In this
Covariance Matrices
equation, the K (k) is known as KG and it is given
 −1 The correct values for the system noise and measurement
− − noise covariance matrices are essential to obtain high reliabil-
K (k) = P(k) T
H(k) H(k) P(k) T
H(k) +R . (4)
ity of the system’s estimated signals. Its determination can be

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PEREIRA et al.: ARCHITECTURAL EXPLORATION FOR ENERGY-EFFICIENT FIXED-POINT KF VLSI DESIGN 3

TABLE I
S UMMARY OF R ELATED W ORK A BOUT KF D ESIGN R EALIZATIONS

Fig. 2. General high-level KF block diagram configuration.

However, when the power dissipation analysis is mandatory,


ASIC solutions are the best choices, and just the solution
of [33] was found in the literature. This work proposes
architectural solutions for each KF equation aiming at a
low-power circuit.
for an empirical form or statistical modeling, using a linear From the mentioned works from the literature, the solutions
estimation during the process. The empirical form requires a of [26], [23], [31], and [33] present some power dissipation
large number of experiments that allow observing the desired results. However, the former is exclusively related to adaptive
system behavior and thus determining the appropriate values battery state monitoring. The second one does not show the
of system and measurement noise covariances. In statistical methodology for the obtained FPGA power dissipation results.
modeling, the covariances values change based on the features The last two are the only ones that present the results of
of the estimated system. methodology and power dissipation.
The work in [18] proposes strategies based on the funda- When the focus is on architectural exploration,
mentals equations of the KF, i.e., (1) and (2), to calculate only [30], [31] and [33] present such an approach.
the system noise covariance matrix and measurement noise However, they propose exploration in blocks made up
covariance matrix based, respectively, on of a set of equations; hence, each architecture is not explored
  individually. One of the processes of KF, the KG calculation,
x̂(k) − x̂(k−1) 2 involves the inversion of matrices at each iteration. For this
Q (k) = 2
(10) task, we have applied the matrix inversion by the analytical
 L 
  method using the efficient iteration-based Goldschmidt divider
R(k) = σ̂d2(k) − σ̂ẑ2(k) . (11) for division operation, with considerable advantage. To the
best of our knowledge, no previous work for KF uses this
The term L represents the state vector length, σ̂d2(k) is the divider circuit in the internal block for matrix inversion.
observed signal power, and σ̂ẑ2(k) the estimated signal power. The work in [35] explored dedicated architectures for the
The two last variables are calculate using KG equation aiming at the best combination of power dissipa-
tion, circuit area, and speed processing. There was no detailed
σ̂d2(k) = β σ̂d2(k−1) + (1 − β)d(k)
2
(12) implementation of the entire KF in that prior work. Our work
σ̂ẑ2(k) = β σ̂ẑ2(k−1) + (1 − β)ẑ (k)
2
(13) exercises three KF case studies for validating the entire KF
hardware structure that best balances precision and low energy.
respectively, where β is a positive value between 0 and 1 [18].
D. Related Work III. KF H ARDWARE A RCHITECTURES
Due to its great applicability, various works from the liter- In addition to developing and studying different KF archi-
ature propose different solutions for the KF implementation tectures that can be generally used from simple up to com-
aiming at several applications [19]–[33]. Table I summarizes plex industrial problems (i.e., to model, analyze, and control
the main features and contribution found in the related works electrical variables), this work aims to employ a circuit easily
present in the literature. replicated and reused in complex systems. Fig. 2 shows the
Generally, the KF applications require parameters estima- general KF circuit block-diagram with the configuration and
tion and validation of the implemented algorithm’s estimation connections of all equations. Multiplexers replace the initial
in real time. The proposed solutions from the literature use values of x 0 and P0 with the values calculated after the first
field-programmable gate array (FPGA) [19]–[23], [29]–[32], iteration. The diagram shows the number of bits for each stage
FPGA/GPU [24], or general purpose processors (GPP) [25], connection.
[26] as target devices. The work in [34], only the coeffi- During the KG calculation, a specific block for the matrix
cients estimation and the error correction are FPGA-based. inversion calculation is implemented, i.e., the INV block.
A C/C++ software implements the KF algorithm for ellipse A general INV block architecture at design time is shown
estimation. A microprocessor executes the software and man- in Fig. 3, which can calculate the matrix inversion by the ana-
ages each fundamental operation of coefficient estimation lytical methods [36] for any square matrix of order N = 2m ,
sequentially. with m being a positive integer. The INV block process

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 3. General architectures of: (a) INV block, (b) Cofactor block, (c) C block, (d) Det block, and (e) B2 block.

calculates and divides (with the Goldschmidt divider) the TABLE II


matrix cofactors [Cofactor block Fig. 3(b)] by the matrix deter- E QUATION S TEPS OF THE KF
minant [Det block Fig. 3(d)]. It reorders the resulting matrix
components to respect the properties of the method. Both
architectures of the blocks Cofactor and Det are developed
based on calculating the determinant’s square matrix of order
two, the B2 block, Fig. 3(e). Fig. 3(c) shows the C block
internal architecture to calculate each component cij of the
cofactors matrix, with i and j representing the matrix line and
column, respectively. Fig. 3(b) shows that the C block circuit
is replicated 22m times. The C block applies W = ((2m −1)!/2)
B2 blocks to calculate the determinant of W selected matrices
A∗ , following by a line of multipliers using a predefined matrix
component a ∗ and adders tree. The circuit of the multiplier
line and adder tree repeats depending on the matrix length.
The matrix determinant, Fig. 3(d), is calculated by multiplying
the first line of booth cofactor matrix and original matrix and
adding its results. The B2 block is reused in the Det block with
a signal control to ensure the adder process. For the matrix
determinant calculation, (2m /2) B2 blocks are used.
The architectures developed in this work aimed at a general
DSP circuit that allows the implementation of systems with
m = 1. It requires two state variables, two observed signals,
and two control signals. Thus, it is only necessary to know
the output vector and the input vector of the desired system,
both with two components, for its implementation. Therefore,
we observe that a square matrix of order two represents: 1) the sible; however, the number of arithmetic operators increases
transition matrix A; 2) the control inputs matrix B; 3) the prior significantly. The semiparallel form combines sequential and

and posterior error covariance matrices P(k) and P(k) ; 4) the parallel structures, processing more samples at a time than
observation matrix H(k) ; 5) the KG K (k) ; 6) the system noise the sequential one, with fewer hardware requirements than the
covariance matrix Q; and 7) the measurement noise covariance fully parallel architecture. In the architectures, considering n

matrix R. The prior and posterior state vectors x (k) and x (k) , as the bit-width of the system inputs, we use the truncation
respectively, the control inputs vector u (k) , and estimated technique (truncate) to reduce the bit-width from 2n to n.
output ẑ (k) are vectors with two coefficients. Implementing a Conversely, the logical extension increases the bit-width from
complex system composed of square matrices of any order 2m n to 2n when need. The techniques respect the size of the
can be applied using this configuration, once smaller matrices data buses and the bit-width of all operators. The truncation
can decompose them, reusing or replicating our architectures and logical extension techniques respect the fractional position
22∗(m−1) times. The exception is the INV block that will be regarding the fixed-point decimal representation.
implemented following the general architecture of Fig. 3. The proposed architectures for each equation work accord-
We explore three different architectures for each equa- ing to the scheduling in steps of Table II. Whenever possible,
tion, whenever possible, to observe the best combination of we design three different architectures to identify which one
circuit area, power dissipation, energy consumption per oper- obtains the best combination of circuit area, power dissi-
ation, and processing speed. The architectural solutions are pation, and processing speed. Output variables are assigned
in sequential, semiparallel, and parallel forms. The sequential to registers for each module/equation, as shown in the last
architecture reduces hardware utilization by reusing hardware column of Table II. The multiplications and additions in all
modules as much as possible. The parallel architecture is the the figures are implemented using the ‘∗’ and ‘+’ operators
alternative for minimizing processing time as much as pos- in the VHDL input, leaving logic optimizations to the ASIC

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PEREIRA et al.: ARCHITECTURAL EXPLORATION FOR ENERGY-EFFICIENT FIXED-POINT KF VLSI DESIGN 5

Fig. 4. Prior state vector (x̂−


(k) ) architectures in: (a) sequential form, (b) semiparallel form, and (c) parallel form.


Fig. 5. Prior system error covariance (P(k) ) architectures in: (a) sequential form, (b) semiparallel form, and (c) parallel form.

synthesis tool. All architectures use fixed-point representation, component of the stage, it calculates two components. The
where (n/2) − 1 bits represent the integer part, (n/2) bits the parallel arrangement calculates all components of the step in
fractional part and one bit for the mathematical sign. just one cycle.
1) Prior State Vector (x̂−
(k) ): Fig. 4 presents the architectures 2) Prior System Error Covariance (P(k) −
): Fig. 5 shows dif-
developed for the prior state vector equation in the sequential ferent hardware architectures for prior system error covariance,
form, Fig. 4(a), semiparallel form, Fig. 4(b), and parallel (5), in sequential [Fig. 5(a)], semiparallel [Fig. 5(b)], and
form, Fig. 4(c). The sequential form processing begins with parallel [Fig. 5(c)] forms. The processing is similar to the
simultaneous multiplication of the components for matrix prior state vector processing. The primary modification is in
operation follows of its sum. This process realizes until the number of input signals and registers feedback. In this
calculating all parts of the first step, repeating in the second process, the transposed matrix A T is realized using the matrix
step. Lastly, in third step, one applies just the sums of the A and reposing its coefficients in multiplexer’s input. For prior
results of both phases one and two. The semiparallel form system error covariance calculation in steps after the first, one
realizes the same processing, but instead of calculating a single uses pipeline, with previously calculated outputs (T and So ) to

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 6. KG hardware architecture in: (a) sequential form, (b) semiparallel form, and (c) parallel form.

ensure reduced operations and high processing speed. To reuse


mathematical operators and try to maintain the circuit with
reduced area and power dissipation, we use truncation and
logical extension techniques during the processes.
3) Kalman Gain (K (k) ): As already mentioned, the KG
equation represents the bottleneck in the KF. Therefore, for
the sake of performance exploration, regarding the number
of arithmetic operators and accuracy, we developed three
architectures in sequential [Fig. 6(a)], semiparallel [Fig. 6(b)]
and parallel [Fig. 6(c)] designs.
The INV block is responsible for the matrix inversion
operation in the architectures. The appropriate choice of matrix
inversion technique can determine the energy efficiency of the
circuit. According to [36], when comparing matrix inversion
operations by LU, QR, and Cholesky decomposition, the ana-
lytical method presents the smallest number of mathematical
operations, mainly when performed with quadratic matrices
of order 2. Consequently, it leads to the most moderate
power consumption when implemented in hardware. Based on
Fig. 3, three different architectures for the INV block applying
the analytical method were herein proposed and developed,
i.e., in sequential, semiparallel, and parallel forms, respec-
tively, Fig. 7(a), (b), and (c), following the same principle of Fig. 7. Internal INV block architectures. (a) Sequential form. (b) Semiparallel
the other architectures showed previously. As we implement form. (c) Parallel form.
a system with square matrices of order 2, m = 1, and
following the analytical methods’ propriety, the general INV not consider the sign, it is necessary to implement an extra
architecture, Fig. 3, could be simplified. The Cofactor block control signal in both systems.
is replaced by the matrix components reordering, and a single In the first and second steps, the KG architectures perform

B2 block without the signal control defines the Det block. the same processes of prior system error covariance (P(k) )
The divider circuits use the Goldshmidt algorithm, as prior calculation modifying its input. As the main stage begins,
presented. As the division algorithm used in this article does step 3 starts with a simple coefficient sum. When calculating

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PEREIRA et al.: ARCHITECTURAL EXPLORATION FOR ENERGY-EFFICIENT FIXED-POINT KF VLSI DESIGN 7

calculation can be implemented efficiently by right shifts. This


simplification using simple shifts is also design time scalable
to any system of order 2m . In our case, as the vector size is
21 , the division is simply a right-shift of the final sum.
Fig. 12(a) and (b) shows the architectures developed for
the measurement noise covariance in sequential and parallel
forms, respectively. The calculation of measurement noise
covariance is more complex than the system noise covariance
since it depends on the observed signal power σ̂d2(k) and the
estimated signal power σ̂ẑ2(k) . Even presenting just one step, it is
necessary a subsystem, named as Eq.1 and shown in Fig. 12(c),
Fig. 8. Estimated output (ẑ (k) ) architectures in: (a) Sequential form and to calculate the current signals power based on the past signals
(b) Parallel form.
power, thus implementing equations (12) and (13).
all results, they are implemented in the INV block to perform
the determinant calculation and inversion matrix. After this IV. R ESULTS
process, step 4 concludes the KG processing with the matrix The results of the proposed architectural design explorations
multiplication of inversion results by the resulting output of are presented next. Initially, the most suitable number of
step 1. The truncate and logical extender operators follow the bits for hardware design in fixed point is determined for the
− architecture implementations. Then, VLSI hardware synthesis
same behavior as in P̂(k) process.
4) Posterior State Vector (x̂(k) ): To perform the posterior results follow. Furthermore, finally, we offer the KF architec-
state vector it is necessary to calculate the estimated output, ture validation for three case studies: system identification,
ẑ(k) . Fig. 8 presents the architectures for estimated output, noise elimination, and estimation process.
in sequential form, Fig. 8(a) and parallel form, Fig. 8(b). In this
case, the semiparallel form is equal to the parallel one, and A. Bit-Width Determination
thus, there is no need to implement a third circuit variant. To develop efficient architectures, the best bit-width applied
The simplicity present in these architectures is a consequence to the system needs to be determined. The system bit-width
of performing just matrix multiplication. Fig. 9 offers the choice considers the best compromise between the lowest
architectures for posterior state vector implementation. The number of bits and the best system error result, aiming
solution is in three architectures, in sequential form, Fig. 9(a), at VLSI circuits with the smallest circuit area and power
semiparallel form, Fig. 9(b), and parallel form, Fig. 9(c). Step dissipation. For this purpose, we used the root mean square
one started with two parallel subtractions of the desired system error (RMSE) metric, which gives the error measurements in
output, and the estimated system output follows for matrix the time domain.
multiplication with KG results. Finally, one performs the sum A cosimulation environment was developed to perform the
between the past step output coefficients and the extended x̂− (k) simulations in a closed-loop interaction between MATLAB1
input coefficients. and ModelSim1 logic hardware simulation tool with timing
5) Posterior System Error Covariance (P(k) ): The posterior accuracy at the gate level. The MATLAB software generates
system error covariance processing is done in three steps. Its the input signals, and through the Simulink1, the data are
implementation had three versions: the sequential, Fig. 10(a), sent to the hardware simulation tool, ModelSim, to simulate
semiparallel, Fig. 10(b), and parallel, Fig. 10(c) implemen- the logic circuit behavior. The MATLAB receives back the

tations. The process is similar to the P(k) process. The main extracted results, process them, and evaluates them comparing
differences are related to performing matrix subtraction instead with a double-precision FP reference system.
of sum and to the inputs of the multiplexers’ positions. The bit-width tests use the KG equation. The KG equation
6) Noise Covariance Matrices (Q (k) & R(k) ): Aiming to needs to be more accurate, given the division operations
implement a system with realistic characteristics, we devel- inherent to its functioning. We choose to implement the circuit
oped architectures to calculate both the system noise covari- in the parallel form, with the INV block in the parallel
ance and the measurement noise covariance. Fig. 11(a) in mode, due to the high processing speed compared to the
sequential form and Fig. 11(b) in parallel form, show dedi- other architectures. The exploration of bit-widths varied from
cated circuits for the system noise covariance calculation. The 8, 10, 12, 14, 16, 18, 20, 22, 24, up to 26 bits, for pseudoran-
parallel and semiparallel forms have the same configuration dom and independent input signals generated in MATLAB.
for this equation. The Q (k) resolution is given by the sum of The simulation in ModelSim considers a fixed-point repre-
the difference square between current and past state vector sentation of reals. The most significant bit is the sign value.
coefficients divided by the vector size of the state variables. The (n/2) − 1 following most significant bits are the integer
Based on (10), we calculate the system noise covariance by part, and the (n/2) least significant bits represent the fractional
dividing the state vector by its length L. As we present part. Fig. 13 shows the RMSE for each bit-width.
architectures implementing systems with square matrices of Fig. 13 shows that the increasing the bit-width of the
order 2m , Q (k) it will always be calculated with a division by developed architectures results in a reduction of the average
2m value. Thus, a simple right-shift block m  is added to error incurred in the hardware implementation, compared to
the architectures to facilitate this division operation. It is worth the reference FP benchmark. This characteristic refers to the
mentioning that the KF architecture requires both constant and increase in the fractional part precision, which goes along with
general divisions among the different blocks. The inversion the growth in the representation of integer values, leading
block requires general dividers and a Goldshmidt one was
used here. On the other hand, the system noise covariance 1 Trademarked.

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 9. Posterior state vector (x̂(k) ) architectures in: (a) sequential form, (b) semiparallel form, and (c) parallel form.

Fig. 10. Posterior system error covariance (P(k) ) architectures in (a) sequential form, (b) semiparallel form, and (c) parallel form.

present circuit with the RMSE variation of at most 0.01. The


desired result is observed in the bit-width greater than 18 bits.
Thus, the 20 bits presents the best compromise between VLSI
implementation quality and circuit area. All hardware results
for the KF presented henceforth are for 20-bit word-size
choice.
B. Hardware Synthesis
Hardware results from the logic synthesis (LS) performed
with Cadence Genus1 tool established which architecture
presents the best combination of power dissipation, circuit
area, and processing speed. In the LS step, we set the target
operating frequency for the CMOS VLSI circuit. The timing
Fig. 11. System noise covariance (Q (k) ) architectures in: (a) sequential form and power estimations for the VLSI was carried out with
and (b) parallel form.
real input signals and followed the methodology depicted
in Fig. 14. The input signals are generated by employing the
to increases in logic hardware. To define a good enough Simulink1 tool and exported in a text format suitable for logic
precision, we aim shortest bit-width with reduced RMSE. simulation, followed by power estimations. We implement
Based on the lowest RMSE result at 26 bits width, we target to the KF architecture in the simulation tool incisive, which,

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PEREIRA et al.: ARCHITECTURAL EXPLORATION FOR ENERGY-EFFICIENT FIXED-POINT KF VLSI DESIGN 9

Fig. 12. Measurement noise covariance (R(k) ) architectures in: (a) sequential form and (b) parallel form.

architectures, the figure of merit (FoM) Energy per operation


(Energy/op.) was calculated for each equation. This FoM is
shown in the last column of Table III. The calculation is based
on [37] by the equation Energy/op = Po∗CC∗T/O, where Po
is the total system power, CC the number of the clock cycles,
T is the clock period, and O the number of system outputs.
Table IV shows that the sequential structure gives impres-
sive reductions in total power dissipation and energy per
operation, compared to parallel and semiparallel forms. This
is mainly due to the inherent sequential nature of the
iterative-based Goldshmidt divider used in the INV block.
Fig. 13. Normalized value of RMSE versus bit-width precision. Therefore, the results referring to KG architectures in Table III
consider the INV block implemented in its sequential form.
All the sequential architectures appear with the smallest
circuit area, as they reuse circuits as much as possible. That
leads to the power dissipation minimization in these circuits.
However, since the architectures present different maximum
operating frequencies and a different number of clock cycles,
it is essential to observe the system cost in energy per oper-
ation. The prior state vector, estimated output, and posterior
state vector in sequential architectures are the best options for a
balanced KF circuit, as long as processing speed is not a severe
constraint in these modules, once its processing is carried out
Fig. 14. Synthesis methodology flow for power/timing evaluation. in parallel with equations that demand a high number of clock
cycles. For these modules, we observe in Table III that circuit
area, power dissipation, and energy per operation are much
together with the prior synthesis results and design files, smaller than the KG module.
obtains the value change dump (VCD) file, which contains all To implement the balanced KF version, we select the
logic transitions in time-domain logic simulation. This method semiparallel parallelism for the prior system error covariance,
greatly improves the power dissipation estimation accuracy for KG, and posterior system error covariance, which are the most
the real circuit under real input stimuli. Our results report very complex steps. These semiparallel structures present reduced
realistic and accurate figures for circuit area, power dissipation, circuit area and power dissipation, close to sequential form
and timing behavior with the ASIC design method and VCD results, and high-speed operation, close to the parallel form
files from logic simulations. results. Moreover, the semiparallel architectures show lower
Our hardware architecture proposals are synthesized with a energy consumption per operation when compared to the
commercial ST 65-nm standard cell library with the voltage parallel form.
supply at 1.25 V for hardware mapping. The signal input We selected the sequential architectures with the best results
has 20 bits resolution, and the output signal has 40 bits for for both the system and measurement noise covariances equa-
each equation. Table III shows the synthesis results for all tions. As the processing time does not represent a bottleneck
architecture explorations proposed in this work. for these equations, we prioritize the reduction of the circuit
The results for power are for a clock frequency opera- area and the power dissipation in the sequential structure.
tion of 100 MHz. This is the maximum frequency opera- Table V shows the architectures determined for each KF
tion obtained for the KG, which represents the throughput equation followed by its respective forms and clock cycles.
bottleneck of the overall KF. The maximum frequency and The number of clock cycles in the KG considers the INV
throughput (in MSamples/s) attainable by individual VLSI block in the sequential form.
blocks of each equation in Table III are shown in columns Based on each configuration chosen, as above, the balanced
5 and 6, respectively. Aiming for a precise comparison among entire KF in VLSI was implemented. Table VI shows the

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE III
E QUATIONS S YNTHESIS RESULTS 1 @ RUNNING AT 100 MHz

TABLE IV without knowing its electrical, digital, and mechanical char-


C OMPARING THE INV B LOCK S TRUCTURES acteristics. When implementing the inputs and outputs in the
KF algorithm, it is possible to calculate fast and accurately
the state variables of the desired system. This simulation
considered the parallel KF architecture for an observed system
with unknown variables (black box system), using its input
and output to estimate the system state variables based on the
decrease of the system MSE. The simulations were performed
with an observed system presenting two state variables x 1
TABLE V and x 2 , generated pseudorandomly, disturbed by Gaussian
A RCHITECTURAL F ORM C HOSEN FOR E ACH KF E QUATION white noise. The simulation used 100 repetitions containing
50 samples in each repetition. Fig. 15 shows the results
obtained for the KF VLSI architecture running system identi-
fication compared with the reference system implemented in
MATLAB. The RMSE results demonstrate the estimated out-
put error, Fig. 15(a), estimated states vector error, Fig. 15(b),
the desired output z 1 versus estimated output ẑ 1 , Fig. 15(c),
and the observed state x 2 versus estimated state xˆ2 , Fig. 15(d).
Fig. 15(a) and (b) highlights the decrease of the estimation
error as the samples processing evolves, demonstrating high
performance of the proposed filter, which converges to a
steady-state regime presenting RMSE values equal to 0.01,
architectural features chosen for the balanced KF, synthesized in both cases, after 12 samples. Further validation of the
for 100 MHz operation, and the latency (in clock cycles) KF hardware operation is presented by combining these two
contributed by each block. As shown above, the speed bot- results: a small RMSE in estimating the system state variables
tleneck for the entire KF is the KG circuit block. and calculating the estimated system outputs.
2) Noise Elimination: This validation removes a pseudo-
C. KF Architecture Validation random noise from a corrupted electroencephalogram (EEG)
The validation of the balanced KF architecture proposed signal. It is possible to calculate the original EEG signal with
above was performed in real DSP applications. This way, the input of a signal with the same behavior as the noise
the KF architecture developed is tested for its accuracy, pre- in EEG and the corrupted signal as the desired output. The
cision, reliability, and behavior in three different case studies. difference between the desired output signal and the estimated
This section will present simulations for system identification, output from KF results in the filtered EEG signal. Fig. 16
noise elimination, and estimation process applications. shows normalized values of original EEG signal [Fig. 16(a)],
1) System Identification: The system identification process the corrupted EEG signal [Fig. 16(c)], and the filtered EEG
identifies a noisy, corrupted, or unknown observable system signal [Fig. 16(e)]. The right side of Fig. 16 shows the

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PEREIRA et al.: ARCHITECTURAL EXPLORATION FOR ENERGY-EFFICIENT FIXED-POINT KF VLSI DESIGN 11

TABLE VI
S YNTHESIS R ESULTS : T HOROUGH KF A RCHITECTURE BALANCED FOR H IGHER E NERGY-E FFICIENCY. RUNNING AT 100 MHz IN 65-nm CMOS

Fig. 15. System identification validation. (a) Estimated output Ẑ RMSE. (b) Estimated state vector x̂ RMSE. (c) Observed output versus estimated output.
(d) Real state vector versus estimated state vector.

Fig. 16. Noise elimination validation. (a) Original EEG signal and (b) its frequency response. (c) Corrupted EEG signal and (d) its frequency response.
(e) Filtered EEG signal and (f) its frequency response.

frequency response of the original EEG signal [Fig. 16(b)],


the corrupted EEG signal [Fig. 16(d)], and the filtered EEG
signal [Fig. 16(f)]. With the results, it is possible to denote that
the system obtained, as a response, a signal that resembles the
original signal, both in the frequency and in the time domains,
standing out considerably from the signal corrupted by the
noise agent.
To observe the precision level achieved, we evaluated Fig. 17. RLC circuit with dependent current source.
the RMSE metric results, presenting a value equal to 0.13.
In the noise-canceling case study, the output of the KF is the
The states variables of interest are the current i l in inductor
estimated noise. Therefore, if the noise varies along the time
L 1 and the voltage v c in capacitor C1 . The observed output
(as in a pseudorandom noise), no stationary regime is reached.
will be the current i r2 and voltage vr2 in resistor R2 . Its state
The KF will update its coefficients dynamically, according to
space representation are presented in
the noise characteristics.
3) Estimation Process: The following scenario realizes the          
i˙l 0.5004 −0.0005 il −0.5004 0 i
estimation process validation implementing a RLC circuit in = ∗ + ∗ t
v˙c −1.0005 0 vc 1.0005 0 0
state-space representation. The KF architecture obtains the
(14)
current through an inductor and the voltage at the capacitor      
terminals. Fig. 17 shows the resonant RLC circuit with a v R2 0.2502 −1.0003 il
= ∗ . (15)
dependent current source. i R2 0.000250 −0.001 vc

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 18. System estimation validation in a RLC. (a) Current in the inductor versus estimated current. (b) Voltage in the capacitor versus estimated voltage.
(c) Voltage in the resistor versus estimated voltage. (d) Current in the resistor current versus estimated current.

Simulation is performed for a sinusoidal current source TABLE VII


oscillating at 100 Hz and a peak of 10 A. To simulate noise C OMPARISONS W ITH P REVIOUS L ITERATURE R ESULTS
and measurement errors, before to enter in the KF, we added
three different Gaussian noise signals, with a peak in 10−3
place, in the system output signals (i r2 , and vr2 ) and in the
system input signal (i t ).
Fig. 18 shows the simulation results. The system and the
estimated state vector are shown in Fig. 18(a) and (b), and
the system and estimated outputs in Fig. 18(c) and (d). The
KF architecture realizes precisely the estimation of circuit
variables even with noisy signals simulating noise and mea-
surement errors. The RMSE of estimated state variables x̂ 1 and
x̂ 2 and the estimated outputs ẑ 1 and ẑ 2 are 4.03, 8.57, 11.45,
and 0.02, respectively. The cause of the elevated RMSE values latency in clock cycles, number of operators, power, or voltage
is the high values used in the simulation and the response supply) are not available (N/A) for many of previous articles,
delay regarding the state estimation. However, if considering as shown in Table VII. The KF hardware implementation
the percentage values, the RMSE results are less than 10% has gained special attention, such as in [30], that presents
of the signal wave peak, which, depending on the electrical an FPGA solution for the KF, but with no dedicated VLSI
variables, is an acceptable estimation of the system’s response. circuit for calculating the covariance of the noise system
The different scenarios of KF implementation proved the and the noise system. Ref. [30] is the only presenting the
hardware system validation in relevant DSP problems. We can number of arithmetic operators used. That solution is a sys-
note the speed of adaptation for the system identification tem with two state variables, a control input, one observed
applications with steady-state regime RMSE of 0.01 after signal, and it comprises 35 adders, 5 subtractors, 75 multi-
12 samples; the precision level for noise elimination appli- pliers, and 2 dividers, for a total of 117 arithmetic operators.
cations with RMSE of 0.13, and reliability in state estimation We synthesized our VLSI hardware KF design for the same
processes with reduced RMSE. operating frequency as in that work (100 MHz). Our final KF
This section presented scenarios that comply with the KF architecture needs 10 adders, 22 multipliers, 8 subtractors, and
calculation by square matrices of size 21 . Previously, we high- 1 divider for a total of 41 arithmetic operators, representing a
lighted the development of a circuit that could be replicated in 2.8 × reduction.
more complex systems applying square matrices of order 2m . The works of [24], [31] and [33] show power dissipation
As presented previously and with the results obtained, it is figures, to allow comparisons. The work in [24] has a KF
worth mentioning that for more complex systems, we can circuit with a bit-width of 20 bits, as in our work, and
implement the KF architecture without modifying the com- presents a power dissipation of 1433 mW, under a reduced
binational architecture repeating the steps of Table V 22∗(m−1) frequency operation of 10.27 MHz. This power dissipation
times. However, this leads to increasing both the delay and is significantly higher than that provided by our solution,
power consumption of the final architecture. In the internal even considering our design operating at a higher frequency,
block INV case, we presented a scalable general architecture i.e., 100 MHz. Furthermore, the authors do not state the
to be applied at design time, depending on the system size. length of the state, observed, and control vectors used in their
implementation. The work of [31] offers a system configured
by three state variables and two observed vectors with a matrix
D. Discussions and Comparisons With the Literature inversion circuit using Cholesky decomposition. Their best
A comparison with previous KF implementations is pro- results for power dissipation is 212 mW, which is almost 200×
vided in Table VII. It is difficult to compare all metrics of our our power dissipation figure for the same operating frequency,
results with the literature since some figures (like bit-width, as their design is targeted to FPGA.

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PEREIRA et al.: ARCHITECTURAL EXPLORATION FOR ENERGY-EFFICIENT FIXED-POINT KF VLSI DESIGN 13

The work in [33] is the only one with an ASIC imple- [2] J. Myers, A. Savanth, R. Gaddh, D. Howard, P. Prabhat, and D. Flynn,
mentation, using a 0.5-μm CMOS technology with 24 bits “A subthreshold ARM cortex-M0+ subsystem in 65 nm CMOS for
resolution. Their authors quote power dissipation at a 20 MHz WSN applications with 14 power domains, 10T SRAM, and integrated
voltage regulator,” IEEE J. Solid-State Circuits, vol. 51, no. 1, pp. 31–44,
operating frequency and fail to mention the voltage supply. Jan. 2016.
This lack of information precludes parameters scaling for a fair [3] J. Han and M. Orshansky, “Approximate computing: An emerging
power comparison with our 65-nm CMOS ASIC. Therefore, paradigm for energy-efficient design,” in Proc. 18th IEEE Eur. TEST
we directly compare key design parameters of five solutions Symp. (ETS), May 2013, pp. 1–6.
in Table VII, considering two state variables (i.e., being one [4] A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, and V. Sze, “Navion: A
2-mW fully integrated real-time visual-inertial odometry accelerator for
control input and one observed signal). In this comparison, autonomous navigation of nano drones,” IEEE J. Solid-State Circuits,
although frequency speed is highly dependent on technology vol. 54, no. 4, pp. 1106–1119, Apr. 2019.
parameters and voltage supply, the number of clock cycles is [5] A. Bellar and M. A. S. Mohammed, “Satellite inertia parameters esti-
a better metric for the characteristics of the architecture. It is mation based on extended Kalman filter,” J. Aerosp. Technol. Manage.,
noticeable that, although presenting a more extensive system vol. 11, pp. 1–11, Mar. 2019.
[6] W. Zhou, J. Hou, L. Liu, T. Sun, and J. Liu, “Design and simulation
with two state variables, two control inputs, and two observed of the integrated navigation system based on extended Kalman filter,”
signals, our balanced KF architecture has a latency of just Open Phys., vol. 15, no. 1, pp. 182–187, Apr. 2017.
34 clock cycles against 113 clock cycles of [33], representing a [7] M. Oskoei, “Adaptive Kalman filter applied to vision based head gesture
reduction of 3.3 times. Besides, the solution at 0.5-μm CMOS tracking for playing video games,” Robotics, vol. 6, no. 4, p. 33, 2017.
presented in [33] dissipates 55.3 mW at just 20 MHz, i.e., 5× [8] H. Wang et al., “Kalman filter slope measurement method based on
improved genetic algorithm-back propagation,” in Proc. WCX SAE World
slower than our VLSI KF frequency of operation. Moreover, Congr. Exper., 2020, pp. 1–10.
our balanced KF architecture dissipates just 1.30 mW at a [9] S. Acharya et al., “Ensemble learning approach via Kalman filtering
5× higher frequency. Our architecture shows a power reduc- for a passive wearable respiratory monitor,” IEEE J. Biomed. Health
tion of 42.5 times, in direct comparison to the best result Informat., vol. 23, no. 3, pp. 1022–1031, May 2019.
(55.3 mW) in prior literature for a VLSI implementation. [10] C.-S.-A. Gong et al., “Design and implementation of acoustic sensing
The most relevant FoM for comparison, i.e., energy expended system for online early fault detection in industrial fans,” J. Sensors,
vol. 2018, Jun. 2018, Art. no. 4105208.
per operation, shows that our proposed balanced architecture [11] S.-A. Li and C. Li, “FPGA implementation of adaptive Kalman filter
reduces this FoM by 710 × compared to said VLSI solution. for industrial ultrasonic applications,” Microsyst. Technol., pp. 1–8,
May 2019, doi: 10.1007/s00542-019-04456-6.
V. C ONCLUSION [12] X. Lai, T. Yang, Z. Wang, and P. Chen, “IoT implementation of Kalman
filter to improve accuracy of air quality monitoring and prediction,” Appl.
This work presented dedicated architectures implementing Science, vol. 9, no. 9, p. 1831, 2019.
the entire KF process in DSP applications. Eight different [13] L. Torres, J. Jiménez-Cabas, O. González, L. Molina, L. Estrada, and
architectures were developed, one for each equation of the R. Francisco, “Kalman filters for leak diagnosis in pipelines: Brief
filter, configured in fully sequential, semiparallel, and fully history and future research,” J. Mar. Sci. Eng., vol. 8, no. 3, p. 173,
2020.
parallel forms. Such wide design exploration enabled us to
[14] R. E. Kalman, “A new approach to linear filtering and prediction
determine which configuration leads to the best compromise or problems,” J. Basic Eng., vol. 82, no. 1, pp. 35–45, Mar. 1960.
balance between circuit area, power dissipation, and process- [15] G. Paim, P. Marques, E. Costa, S. Almeida, and S. Bampi, “Improved
ing speed. Our KG proposal uses the iterative-based Godsh- goldschmidt algorithm for fast and energy-efficient fixed-point divider,”
midt divider for the matrix inversions. Simulations determined in Proc. 24th IEEE Int. Conf. Electron., Circuits Syst. (ICECS),
20 bits as the best fixed-point word width to implement the Dec. 2017, pp. 74–77.
[16] S. F. Obermann and M. J. Flynn, “Division algorithms and implemen-
architectures with the best system error. Our results pointed to tations,” IEEE Trans. Comput., vol. 46, no. 8, pp. 833–854, Aug. 1997.
sequential and semiparallel architectures as the best tradeoff [17] R. E. Goldschmidt, “Applications of division by convergence,” Ph.D.
between reduced circuit area, power dissipation, and high dissertation, Massachusetts Inst. Technol., Cambridge, MA, USA, 1964.
processing speed for the KF equations. This best-balanced [18] C. Paleologu, J. Benesty, and S. Ciochina, “Study of the general Kalman
VLSI implementation of the entire KF architecture is a filter for echo cancellation,” IEEE Trans. Audio, Speech, Lang. Process.,
vol. 21, no. 8, pp. 1539–1549, Aug. 2013.
comprehensive and new solution for VLSI DSP applications.
[19] J. Baliyan, A. Aggarwal, and A. Kumar, “Implementation of Kalman
Finally, we explored the VLSI KF circuit in application filter using VHDL,” Int. J. Sci. Eng. Technol. Res, vol. 3, no. 8,
scenarios of systems identification, noise cancellation, and pp. 1569–1575, 2014.
systems estimation to show its end-user performance. The [20] R. Inan, M. Barut, and F. Karakaya, “FPGA implementation of extended
results presented reduced RMSE values regarding estimated Kalman filter for speed-sensorless control of induction motors,” in
outputs and estimated state vectors, confirming and validating Proc. 7th IET Int. Conf. Power Electron., Mach. Drives (PEMD), 2014,
pp. 1–6.
the accuracy, precision, and reliability of the new KF dedicated [21] A. A. Q. Al Rababah, “Embedded architecture for object tracking
VLSI architectures proposed in this work. Comparisons with using Kalman filter,” J. Comput. Sci., vol. 12, no. 5, pp. 241–245,
previous literature revealed our new balanced KF with the 2016.
best results concerning the number of arithmetic operators, [22] M. Terra, R. Montanari, and V. Guizilini, “FPGA implementation of
power dissipation, energy per operation, and processing speed. robust array Kalman filter based On Givens rotation,” in Proc. XIII Intell.
Automat. Brazilian Symp. (SBIA), Oct. 2017, pp. 1844–1849.
In future works, we aim to investigate the system order [23] N. Noordin, Z. Ibrahim, M. Xie, R. Samad, and N. Hasan, “FPGA
scalability at design time and its impacts by using other case implementation of simulated Kalman filter optimization algorithm,”
studies. Runtime scalable KF architectures will be addressed J. Telecommun., Electron. Comput. Eng., vol. 10, nos. 1–3, pp. 21–24,
as well in future research. 2018.
[24] A. Jarrah, “Optimized parallel architecture of Kalman filter for radar
tracking applications,” Jordan J. Electr. Eng., vol. 2, no. 3, pp. 215–230,
R EFERENCES May 2016.
[1] S. Vangal et al., “Near-threshold voltage design techniques for heteroge- [25] A. Valade, P. Acco, P. Grabolosa, and J.-Y. Fourniols, “A study about
nous manycore system-on-chips,” J. Low Power Electron. Appl., vol. 10, Kalman filters applied to embedded sensors,” Sensors, vol. 17, no. 12,
no. 2, p. 16, May 2020. p. 2810, Dec. 2017.

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

[26] S. Nejad, D. T. Gladwin, and D. A. Stone, “On-chip implementation Patrícia Ücker Leleu da Costa received the
of extended Kalman filter for adaptive battery states monitoring,” in engineering degree in electronics engineering from
Proc. 42nd Annu. Conf. IEEE Ind. Electron. Soc. (IECON), Oct. 2016, the Federal University of Pelotas, Pelotas, Brazil,
pp. 5513–5518. in 2018, and the M.Sc. degree in electronic engi-
[27] L.-C. Zai, C. L. DeMarco, and T. A. Lipo, “An extended Kalman neering and computing from the Catholic University
filter approach to rotor time constant measurement in PWM induction of Pelotas, Pelotas, in 2020.
motor drives,” IEEE Trans. Ind. Appl., vol. 28, no. 1, pp. 96–104, Her research interests are low-power VLSI archi-
Jan./Feb. 1992. tectures, arithmetic operators, digital signal process-
[28] S. Y. Chen, “Kalman filter for robot vision: A survey,” IEEE Trans. Ind. ing architecture, and approximate computing.
Electron., vol. 59, no. 11, pp. 4409–4420, Nov. 2012.
[29] F. Sandhu, H. Selamat, S. E. Alavi, and V. B. S. Mahalleh, “FPGA-based
implementation of Kalman filter for real-time estimation of tire velocity
and acceleration,” IEEE Sensors J., vol. 17, no. 17, pp. 5749–5758,
Sep. 2017.
[30] M. Ricco, P. Manganiello, E. Monmasson, G. Petrone, and
G. Spagnuolo, “FPGA-based implementation of dual Kalman filter for
Eduardo Antonio César da Costa (Member, IEEE)
PV MPPT applications,” IEEE Trans. Ind. Informat., vol. 13, no. 1,
received the five-year engineering degree in electri-
pp. 176–185, Feb. 2017.
cal engineering from the University of Pernambuco,
[31] J. Soh and X. Wu, “A scalable, FPGA-based implementation of the
Recife, Brazil, in 1988, the M.Sc. degree in electrical
unscented Kalman filter,” in Introduction and Implementations of the
engineering from the Federal University of Paraiba,
Kalman Filter. Rijeka, Croatia: InTechOpen, 2018.
Campina Grande, Brazil, in 1991, and the Ph.D.
[32] J. Liao et al., “FPGA implementation of a Kalman-based motion esti-
degree in computer science from the Federal Uni-
mator for levitated nanoparticles,” IEEE Trans. Instrum. Meas., vol. 68,
versity of Rio Grande do Sul, Porto Alegre, Brazil,
no. 7, pp. 2374–2386, Jul. 2019.
in 2002.
[33] R. Chávez-Bracamontes, M. A. Gurrola-Navarro, H. J. Jiménez-Flores,
Part of his doctoral work was developed at the
and M. Bandala-Sánchez, “VLSI architecture of a Kalman filter opti-
INESC-ID, Lisbon, Portugal. He is currently a Full
mized for real-time applications,” IEICE Electron. Exp., pp. 1–11,
Professor with the Catholic University of Pelotas (UCPel), Pelotas, Brazil.
Feb. 2016, Art. no. 20160043.
He is a Co-Founder and a Coordinator of the Graduate Program on Electronic
[34] C. Wang, E. D. Burnham-Fay, and J. D. Ellis, “Real-time FPGA-based
Engineering and Computing at UCPel. His research interests are VLSI
Kalman filter for constant and non-constant velocity periodic error
architectures and low-power design.
correction,” Precis. Eng., vol. 48, pp. 133–143, Apr. 2017.
[35] P. T. L. Pereira, G. Paim, P. Ucker, E. Costa, S. Almeida, and S. Bampi,
“Exploring architectural solutions for an energy-efficient Kalman filter
gain realization,” in Proc. 26th IEEE Int. Conf. Electron., Circuits Syst.
(ICECS), Nov. 2019, pp. 650–653.
[36] A. U. Irtürk, “GUSTO: General architecture design utility and synthesis
tool for optimization,” Ph.D. dissertation, Univ. California, San Diego, Sérgio José Melo de Almeida (Member, IEEE)
CA, USA, 2009. received the B.E.E. degree from the Federal Univer-
[37] R. Muller, H.-J. Pfleiderer, and K.-U. Stein, “Energy per logic operation– sity of Pernambuco (UFPE), Recife, Brazil, in 1988,
A figure of merit for IC’s,” in Proc. 2nd Eur. Solid State Circuits Conf., the M.Sc. degree in electrical engineering from
Sep. 1976, pp. 50–51. Federal University of Paraba (UFPB), João Pessoa,
Brazil, in 1991, and the Ph.D. degree in electrical
engineering from the Federal University of Santa
Catarina (UFSC), Florianópolis, Brazil, in 2004.
Pedro Tauã Lopes Pereira (Student Member, He was a Postdoctoral with the Department of
IEEE) received the engineering degree in control Electrical and Electronic Engineering, Federal Uni-
and automation engineering from the Federal Uni- versity of Santa Catarina, from 2009 to 2010. He is
versity of Pelotas (UFPEL), Pelotas, Brazil, in 2018, currently a Professor of Electrical Engineering and Computer Science with
and the M.Sc. degree in electronic engineering and the Catholic University of Pelotas, Pelotas, Brazil. His research interests are
computing from the Catholic University of Pelotas, in digital signal processing, including statistical signal processing, adaptive
Pelotas, Brazil, in 2019. He is a currently working algorithm, hyperspectral image processing, and dedicated hardware for signal
toward the Ph.D. degree at the Federal University processing.
of Rio Grande do Sul (UFRGS), Porto Alegre,
Brazil.
His research interests are digital signal processing
architecture, adaptive filters, and approximate computing.

Sergio Bampi (Senior Member, IEEE) received the


electronics engineering degree and the B.Sc. degree
Guilherme Paim (Member, IEEE) received the in physics from Federal University of Rio Grande do
engineering degree (Hons.) in electronics engineer- Sul (UFRGS), Porto Alegre, Brazil, both in 1979,
ing from the Federal University of Pelotas (UFPel), and the M.S.E.E. and Ph.D. degrees in electrical
Pelotas, Brazil, in 2015, and the Ph.D. degree engineering from Stanford University, Stanford, CA,
(summa cum laude) in microelectronics from the USA, in 1982 and 1986, respectively.
Federal University of Rio Grande do Sul (UFRGS), He is a Full Professor with the Informatics Insti-
Porto Alegre, Brazil, in 2021. tute, UFRGS, which he joined in 1981. He was a
He developed part of his Ph.D. degree as a Visiting Former President of the Brazilian Microelectronics
Researcher with the Karlsruhe Institute of Tech- Society, of the FAPERGS Brazilian Research Fund-
nology (KIT), Karlsruhe, Germany, in collaboration ing Agency, and the CEITEC Technical Director. He was a Distinguished
with the University of Stuttgart, Stuttgart, Germany Lecturer of IEEE Circuits and Systems Society (CAS) from 2009 to 2010.
from 2019 to 2020. He is a Professor with the Catholic University of Pelotas He has published more than 460 research articles in conferences and journals,
(UCPel), Pelotas, and a Postdoctoral Researcher at UFRGS, Porto Alegre, in the fields of CMOS Analog, Digital and RF Design, Video Coding
Brazil. He has around 60 research articles on Circuits and Systems. His algorithms and hardware architectures.
research interests are energy-efficient VLSI design, near-threshold computing Dr. Bampi served as the Technical Program Chair of SBCCI in 1997 and
(NTC), reliability, side-channel attack-resistant cryptographic circuits, and 2005, the IEEE LASCAS in 2013, VARI in 2015, SBMICRO Congress
cross-layer approximate computing (AxC) for machine learning (ML), DSP, in 1989 and 1995, and served on TPC Committees of DAC, ICCAD, SBCCI,
and video coding. ICM, LASCAS, VLSI-SoC, ICECS, and many other international conferences.

Authorized licensed use limited to: Carleton University. Downloaded on June 04,2021 at 14:16:29 UTC from IEEE Xplore. Restrictions apply.

You might also like