An Efficient Hardware Implementation of Artificial Neural Network Based On Stochastic Computing
An Efficient Hardware Implementation of Artificial Neural Network Based On Stochastic Computing
Abstract—Recently, Artificial Neural Network (ANN) portable voice recognition systems, etc. Those appli-
has emerged as the main driving force behind the cations impose real-time processing and low-power
rapid developments of many applications. Although ANN consumption constraints. To enable real-time, in-field
provides high computing capabilities, its prohibitive
computational complexity, together with the large area processing of ANN in such platforms, developing ded-
footprints of ANN hardware implementations, has made icated hardware architectures has attracted growing in-
it unsuitable for embedded applications with real-time terest [5][6][7]. Such dedicated platforms have two ma-
constraints. Stochastic Computing (SC), an unconven- jor advantages. The first one is the high computational
tional computing technique which could offer low-power speed due to its innate parallel processing capabilities.
and area-efficient hardware implementations, has shown
promising results when applied to ANN hardware cir- The second one is the low power and low area cost due
cuits. In this paper, efficient hardware implementations to the highly optimized architectures of the dedicated
of ANN with conventional binary radix computation and platform. In this paper, a hardware architecture for
SC technique are proposed. The system’s performance efficient processing of ANN is proposed, with major
is benchmarked with a handwritten digit recognition goals of achieving low area cost and moderate power
application. Simulation results show that, on the MNIST
dataset, the 10-bit binary implementation of the system consumption.
only incurs an accuracy loss of 0.44% compared to the To further optimize the area cost and power con-
software simulations. The preliminary simulation results sumption of the architecture, Stochastic Computing
of the SC neuron block show that the output is compa-
rable to the binary radix results. FPGA implementation (SC) technique is also introduced. SC is an unconven-
of the SC neuron block has shown a reduction of 67% tional computing technique which was first proposed in
in the number of LUTs slice. the 1960s by Gaines [8]. SC allows the implementation
Index Terms—Artificial Neural Network, Stochastic of complex arithmetic operations with simple logic
Computing, MNIST data set
elements. Hence SC could greatly reduce the area cost
I. I NTRODUCTION of hardware architectures. Recently, SC has been suc-
Artificial Neural Network (ANN) is currently the cessfully applied to many applications, such as edge-
main driving force behind the development of many detecting circuit [9] or LDPC decoder circuit [10].
modern Artificial Intelligence (AI) applications. Some However, SC suffers from low computational accuracy
popular variants of ANN such as Deep Neural Network issues, due to the random nature of the computation in
(DNN) and Convolutional Neural Network (CNN) have SC domain [8]. On the other hand, ANN has inherent
made some major breakthroughs in the field of self- fault-tolerant nature and could show good results even
driving car [1], cancer detection [2], or playing the with very low precision computing [11]. Therefore
game of GO [3]. Spiking Neural Network (SNN) combining dedicated ANN hardware architecture with
model, the third generation of ANN, has been ex- SC technique is a solution to reduce the area cost and
tensively used to simulate the human’s brain and to power consumption of the system.
study its behavior [4]. Despite having state-of-the-art The two main contributions of the paper are as
performance, ANN also requires a very high level of follows. Firstly, a dedicated hardware architecture for
computational complexity. This has made ANN unsuit- efficient processing of ANN is proposed. Secondly,
able for many embedded image and video processing SC computing technique is introduced to the existing
applications, e.g smart surveillance camera systems, platform to reduce the area cost. Simulation results
show that on the MNIST dataset, the 10-bit fixed point B. SC computational elements
implementation of the platform only incurs an accuracy Stochastic Computing uses a fundamentally differ-
loss of 0.44% compared to the software simulation ent computing paradigm from the traditional method.
results. The preliminary results of the system with SC SC requires some basic computational elements. The
technique applied show that the output is comparable details are given in this section.
to the base system. The FPGA implementation of the 1) Stochastic Number converter: Circuits that act
SC neuron block also shows a great reduction of 67% as interfaces between binary numbers to stochastic
in the number of LUTs slice used. number are fundamental elements of SC. A circuit
The outline of the paper is as follows. Section II which converts a binary number to SC format is called
presents the basic concepts of SC and its computational a Stochastic Number Generator (SNG). On the other
elements. Section III covers the propose hardware hand, a circuit that converts an SC number to binary
architecture of the system in details. Section IV reports number format is called a Stochastic Number Decoder.
the simulation and implementation results on FPGA Fig. 1 shows such circuits.
platform. Finally, Section V concludes the paper.
A. Basic concept
(a) Stochastic Number (b) Stochastic Number
Stochastic Computing is an unconventional com- Generator Decoder
puting method based on the probability of bits in a
pseudo-random bit stream. Numbers in SC domain Figure 1: Circuits to convert between stochastic
are represented as pseudo-random bit streams, called numbers and binary numbers.
Stochastic Numbers, which can be processed by very
simple logic circuits. The value of each stochastic An SNG consists of a Linear Feedback Shift Regis-
number is determined by the probabilities of observing ter (LFSR) and a comparator. A k-bit pseudo-random
bit 0 and 1 in each bit stream. For example, let binary number is generated in each clock cycle by the
a pseudo-random bit stream S with a length of N LFSR, then it is compared to the k-bit input binary
denotes the stochastic number X. S contains N1 1’s number b. The comparator produces ’1’ if the random
and N −N1 0’s . In the unipolar format, the probability number is less than b and ’0’ otherwise. Assuming that
px of X has the range of [0; 1], and is given as: the input random numbers are uniformly distributed
over the interval [0; 1], the probability of ’1’ appearing
N1 at the output of the comparator at each clock cycle is
px = (1)
N px = b/2k . On the other hand, a stochastic number
The stochastic representation of a number is decoder consists of a binary counter. It will simply
not unique. For each number N1 /N , there are count the number of bit ’1’ in the input stochastic
number X.
N
representations. For example, with N = 6, 2) Multiplication in SC: Multiplication of two
N1
there are 6 ways to represent the number 5/6: stochastic bit streams is performed using AND and
{111110, 111101, 111011, 110111, 101111, 011111}. XNOR gates in unipolar and bipolar format, respec-
With a given bit stream length N , there is only a tively. Fig. 2a, 2b demonstrates examples of multipli-
small subset of real numbers in [0; 1] can be expressed cation in SC domain.
exactly in SC. 3) Addition in SC: The addition in SC is usually
To express real numbers in different intervals, sev- performed by using either scaled adders or OR gates.
eral other SNs format has been proposed [12]. One Fig. 2c, 2d show examples of addition in SC using
popular variant is the bipolar format. A stochastic either a MUX or an OR gate.
The addition in SC is also performed by using
number X in unipolar format with value px ∈ [0; 1]
a stochastic adder that is more accurate and does
can be mapped to the range [−1; 1] via a mapping
not require additional random inputs like the scaled
function:
N 1 − N0 adder. This adder is proposed by V.T. Lee [13], and is
py = 2px − 1 = (2) illustrated in Fig. 3.
N
238
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:07:31 UTC from IEEE Xplore. Restrictions apply.
2018 5th NAFOSTED Conference on Information and Computer Science (NICS)
Start
A: 1010_1111 (6/8) Restart
Controller Finish
Y: 1010_1010 (4/8)
B: 1000_0010 (2/8) Figure 4: The proposed architecture.
239
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:07:31 UTC from IEEE Xplore. Restrictions apply.
2018 5th NAFOSTED Conference on Information and Computer Science (NICS)
between x and M . The output is loaded to Sum register. of the most popular data set for this task is the MNIST
The Activation function block is used to implement the dataset [15]. This dataset contains 60,000 images for
activation function of each neuron. In this work, two of training purpose and 10,000 images for testing purpose.
the most popular activation functions are implemented, The size of each image is 28 × 28.
which are the Sigmoid function and the ReLU function. The software implementation of ANN is realized
[14] with Caffe [16], a popular open-source platform for
C. The SC neuron block machine learning tasks. The training phase of ANN
is performed off-line. The trained connection weights
The binary neuron block uses a lot of parallel
and biases are then used for the inference phase of
multipliers and adders, which are costly in terms of
our hardware system. The chosen ANN’s model for
area cost. On the other hand, the multiplication and
our application is a Multi-layer perceptron with only
addition in SC can be implemented with area-efficient
1 hidden layer. Since each input image has a size of
basic components. Hence, a novel neuron block based
28 × 28, there are 784 neurons in the input layer. The
on SC technique is proposed in this section to reduce
classification digits are from 0 to 9, hence there are
the area footprint. The proposed SC neuron block
10 neurons in the output layer. In this work, we also
still maximize parallelism with M inputs. Because our
explore the effect of varying the number of neurons
application requires negative weights and biases, SC
in the hidden layer on classification accuracy. Table I
with bipolar format representation is chosen in this
summarizes the classification results on Caffe platform.
work. The architecture of the proposed SC neuron
N is the number of neurons in the hidden layer.
block is given in Fig. 6.
The chosen activation functions are the Sigmoid and
clk ReLU functions. The classification results on software
reset platform serve as a golden model to benchmark our
activ
b +
Activation
function
output hardware implementation results.
clear Counter
x_1 SNG 1
Table I: Classification accuracies with Caffe
w_1 SNG 1’
240
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:07:31 UTC from IEEE Xplore. Restrictions apply.
2018 5th NAFOSTED Conference on Information and Computer Science (NICS)
Table II: Comparison of classification accuracies Table III: Comparison between the SC adder and SC
between the proposed design and the Caffe’s multiplier’s MSE with other works
implementation
Design in [13] Proposed design
N 16 48 64
SC format 8-bit unipolar 8-bit unipolar 8-bit bipolar
Proposed design 87.60% 86.63% 85.94% Multiplier 2.57 × 10−4 7.43 × 10−6 1.98 × 10−4
Sigmoid Software simulation 88.04% 89.01% 89.10% Adder 1.91 × 10−6 1.89 × 10−6 1.32 × 10−5
Accuracy loss 0.44% 2.38% 3.16%
comes from the fact that our design uses 10-bit fixed The SC neuron’s output’s MSE is comparable to the
point representation while Caffe’s platform uses 32- binary radix implementation. However, SC technique
bit floating point representation. Another observation results in longer latency.
is that the hardware implementation of Sigmoid incurs D. Results of FPGA implementations
a greater accuracy loss compared to ReLU. This is We have implemented the proposed binary radix
because in the proposed design the Sigmoid function design in a Xilinx Artix-7 FPGA platform. The number
is approximated with a Look Up Table (LUT). of hidden layer’s neuron chosen is 48 with ReLU ac-
C. Results of SC neuron block implementations tivation function and 10-bit fixed point representation.
To evaluate our SC neuron block implementation, The results are summarized in Table V and Table VI.
the Mean Square Error (MSE) between the outputs of Table V: Performance report of FPGA
each block with SC technique and the outputs of SC’s implementation
software implementation is used. MSE is calculated as
Performance metric Result
[13]:
N −1 Frequency 158 MHz
(Of loating point − Oproposed )2
M SE = (3) Clock cycle 4888
i=0
N Execution time 30.93 μs
Table III compares the MSE of SC’s multiplier and Total power 0.202 W
SC’s adder between our implementation and the design
in [13]. We used 8-bit fixed point representation in both The maximum frequency is 158MHz. The execution
SC’s unipolar and bipolar format. time for one classification is 30.93 μs. The total power
241
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:07:31 UTC from IEEE Xplore. Restrictions apply.
2018 5th NAFOSTED Conference on Information and Computer Science (NICS)
Table VI: Hardware resources utilization report of neuron block has lower power consumption and lower
FPGA implementation numbers of LUTs slice.
Cost Design FPGA resources Utilization R EFERENCES
[1] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving:
LUTs 1659 133800 1.24% Learning affordance for direct perception in autonomous driv-
Register 1020 267600 0.38% ing,” in Proceedings of the 2015 IEEE International Confer-
Mux 0 100350 0% ence on Computer Vision (ICCV), ser. ICCV ’15. Washington,
DC, USA: IEEE Computer Society, 2015, pp. 2722–2730.
Slice 578 33450 1.72% [2] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M.
DSP 0 740 0% Blau, and S. Thrun, “Dermatologist-level classification of skin
BRAM 0.5 365 0.14% cancer with deep neural networks,” Nature, vol. 542, pp. 115–
117, Jan 2017.
IO 8 400 2.0%
[3] D. Silver et al., “Mastering the game of go with deep neural
networks and tree search,” Nature, vol. 529, pp. 484–485, Jan
2016.
[4] E. M. Izhikevich, “Simple model of spiking neurons,” IEEE
consumption is 0.202 W. The area cost is also small Transactions on Neural Networks, vol. 14, no. 6, pp. 1569–
compared to the available FPGA resources, with 1659 1572, Nov 2003.
LUTs slices and 1020 register slices. [5] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An
fpga-based processor for convolutional networks,” in 2009
We have also implemented the SC neuron block in International Conference on Field Programmable Logic and
this FPGA platform. Table VII summarizes the area Applications, Aug 2009, pp. 32–37.
cost of SC neuron block and binary neuron block. [6] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
N. Xu, S. Song, Y. Wang, and H. Yang, “Going deeper with
Table VII: FPGA implementation results of Binary embedded fpga platform for convolutional neural network,” in
Proceedings of the 2016 ACM/SIGDA International Symposium
neuron block and SC neuron block on Field-Programmable Gate Arrays, ser. FPGA ’16. New
York, NY, USA: ACM, 2016, pp. 26–35.
Binary neuron block SC neuron block [7] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello,
and Y. LeCun, “Neuflow: A runtime reconfigurable dataflow
Parallel inputs 16 16 processor for vision,” in CVPR 2011 WORKSHOPS, June 2011,
pp. 109–116.
Frequency 250 MHz 286 MHz [8] B. R. Gaines, Stochastic Computing Systems. Boston, MA:
LUTs 1268 416 Springer US, 1969, pp. 37–172.
Register 23 299 [9] A. Alaghi, C. Li, and J. P. Hayes, “Stochastic circuits
for real-time image-processing applications,” in 2013 50th
IO 277 277
ACM/EDAC/IEEE Design Automation Conference (DAC), May
Power consumption 0.045W 0.039W 2013, pp. 1–6.
[10] W. J. Gross, V. C. Gaudet, and A. Milner, “Stochastic im-
plementation of ldpc decoders,” in Conference Record of
The SC neuron block’s implementation results show the Thirty-Ninth Asilomar Conference onSignals, Systems and
that, with SC technique applied, the neuron block Computers, 2005., Oct 2005, pp. 713–717.
could have faster operation frequency, use fewer LUTs [11] B. Moons, B. D. Brabandere, L. V. Gool, and M. Verhelst,
“Energy-efficient convnets through approximate computing,”
slice (67% reduction) and lower power consumption. in 2016 IEEE Winter Conference on Applications of Computer
However, the SC design uses more register slice due Vision (WACV), March 2016, pp. 1–8.
to the need for a large number of SNGs. [12] B. R. Gaines, “Stochastic computing,” in Proceedings of the
April 18-20, 1967, Spring Joint Computer Conference, ser.
V. C ONCLUSION AFIPS ’67 (Spring). New York, NY, USA: ACM, 1967, pp.
149–156.
ANN has been the major driving force behind the [13] V. T. Lee, A. Alaghi, J. P. Hayes, V. Sathe, and L. Ceze,
developments of many applications. However, its high “Energy-efficient hybrid stochastic-binary neural networks for
computational complexity has made it unsuitable for near-sensor computing,” in Design, Automation Test in Europe
Conference Exhibition (DATE), 2017, March 2017, pp. 13–18.
many embedded applications. In this work, we intro- [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
duced an efficient hardware implementation of ANN, classification with deep convolutional neural networks,” in
with SC technique applied to reduce the area cost Proceedings of the 25th International Conference on Neural
Information Processing Systems - Volume 1, ser. NIPS’12.
and the power consumption of the design. The chosen USA: Curran Associates Inc., 2012, pp. 1097–1105.
ANN’s network model is a feed-forward multi-layer [15] “The MNIST database of handwritten digits.” [Online].
perceptron for digit recognition application. The binary Available: http://yann.lecun.com/exdb/mnist/
[16] Y. Jia et al., “Caffe: Convolutional architecture for fast feature
radix implementation of the design shows comparable embedding,” arXiv preprint arXiv:1408.5093, 2014.
results with the software implementation, with up
to 92.18% accuracy. With SC technique applied, the
242
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:07:31 UTC from IEEE Xplore. Restrictions apply.