0% found this document useful (0 votes)

22 views

An Efficient Hardware Implementation of Artificial Neural Network Based On Stochastic Computing

paper 3

Uploaded by

Yeshudas Muttu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

An Efficient Hardware Implementation of Artificial Neural Network Based On Stochastic Computing

paper 3

Uploaded by

Yeshudas Muttu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2018 5th NAFOSTED Conference on Information and Computer Science (NICS)

An Efficient Hardware Implementation of

Artificial Neural Network based on Stochastic
Computing
Duy-Anh Nguyen, Huy-Hung Ho, Duy-Hieu Bui, Xuan-Tu Tran
SISLAB, VNU University of Engineering and Technology – 144 Xuan Thuy road, Cau Giay, Hanoi, Vietnam
[email protected]; [email protected]; [email protected]; [email protected]

Abstract—Recently, Artificial Neural Network (ANN) portable voice recognition systems, etc. Those appli-
has emerged as the main driving force behind the cations impose real-time processing and low-power
rapid developments of many applications. Although ANN consumption constraints. To enable real-time, in-field
provides high computing capabilities, its prohibitive
computational complexity, together with the large area processing of ANN in such platforms, developing ded-
footprints of ANN hardware implementations, has made icated hardware architectures has attracted growing in-
it unsuitable for embedded applications with real-time terest [5][6][7]. Such dedicated platforms have two ma-
constraints. Stochastic Computing (SC), an unconven- jor advantages. The first one is the high computational
tional computing technique which could offer low-power speed due to its innate parallel processing capabilities.
and area-efficient hardware implementations, has shown
promising results when applied to ANN hardware cir- The second one is the low power and low area cost due
cuits. In this paper, efficient hardware implementations to the highly optimized architectures of the dedicated
of ANN with conventional binary radix computation and platform. In this paper, a hardware architecture for
SC technique are proposed. The system’s performance efficient processing of ANN is proposed, with major
is benchmarked with a handwritten digit recognition goals of achieving low area cost and moderate power
application. Simulation results show that, on the MNIST
dataset, the 10-bit binary implementation of the system consumption.
only incurs an accuracy loss of 0.44% compared to the To further optimize the area cost and power con-
software simulations. The preliminary simulation results sumption of the architecture, Stochastic Computing
of the SC neuron block show that the output is compa-
rable to the binary radix results. FPGA implementation (SC) technique is also introduced. SC is an unconven-
of the SC neuron block has shown a reduction of 67% tional computing technique which was first proposed in
in the number of LUTs slice. the 1960s by Gaines [8]. SC allows the implementation
Index Terms—Artificial Neural Network, Stochastic of complex arithmetic operations with simple logic
Computing, MNIST data set
elements. Hence SC could greatly reduce the area cost
I. I NTRODUCTION of hardware architectures. Recently, SC has been suc-
Artificial Neural Network (ANN) is currently the cessfully applied to many applications, such as edge-
main driving force behind the development of many detecting circuit [9] or LDPC decoder circuit [10].
modern Artificial Intelligence (AI) applications. Some However, SC suffers from low computational accuracy
popular variants of ANN such as Deep Neural Network issues, due to the random nature of the computation in
(DNN) and Convolutional Neural Network (CNN) have SC domain [8]. On the other hand, ANN has inherent
made some major breakthroughs in the field of self- fault-tolerant nature and could show good results even
driving car [1], cancer detection [2], or playing the with very low precision computing [11]. Therefore
game of GO [3]. Spiking Neural Network (SNN) combining dedicated ANN hardware architecture with
model, the third generation of ANN, has been ex- SC technique is a solution to reduce the area cost and
tensively used to simulate the human’s brain and to power consumption of the system.
study its behavior [4]. Despite having state-of-the-art The two main contributions of the paper are as
performance, ANN also requires a very high level of follows. Firstly, a dedicated hardware architecture for
computational complexity. This has made ANN unsuit- efficient processing of ANN is proposed. Secondly,
able for many embedded image and video processing SC computing technique is introduced to the existing
applications, e.g smart surveillance camera systems, platform to reduce the area cost. Simulation results

978-1-5386-7983-8/18/$31.00 ©2018 IEEE 237

Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:07:31 UTC from IEEE Xplore. Restrictions apply.
2018 5th NAFOSTED Conference on Information and Computer Science (NICS)

show that on the MNIST dataset, the 10-bit ﬁxed point B. SC computational elements
implementation of the platform only incurs an accuracy Stochastic Computing uses a fundamentally differ-
loss of 0.44% compared to the software simulation ent computing paradigm from the traditional method.
results. The preliminary results of the system with SC SC requires some basic computational elements. The
technique applied show that the output is comparable details are given in this section.
to the base system. The FPGA implementation of the 1) Stochastic Number converter: Circuits that act
SC neuron block also shows a great reduction of 67% as interfaces between binary numbers to stochastic
in the number of LUTs slice used. number are fundamental elements of SC. A circuit
The outline of the paper is as follows. Section II which converts a binary number to SC format is called
presents the basic concepts of SC and its computational a Stochastic Number Generator (SNG). On the other
elements. Section III covers the propose hardware hand, a circuit that converts an SC number to binary
architecture of the system in details. Section IV reports number format is called a Stochastic Number Decoder.
the simulation and implementation results on FPGA Fig. 1 shows such circuits.
platform. Finally, Section V concludes the paper.

II. BASIC OF S TOCHASTIC C OMPUTING AND ITS

COMPUTATIONAL ELEMENTS

A. Basic concept
(a) Stochastic Number (b) Stochastic Number
Stochastic Computing is an unconventional com- Generator Decoder
puting method based on the probability of bits in a
pseudo-random bit stream. Numbers in SC domain Figure 1: Circuits to convert between stochastic
are represented as pseudo-random bit streams, called numbers and binary numbers.
Stochastic Numbers, which can be processed by very
simple logic circuits. The value of each stochastic An SNG consists of a Linear Feedback Shift Regis-
number is determined by the probabilities of observing ter (LFSR) and a comparator. A k-bit pseudo-random
bit 0 and 1 in each bit stream. For example, let binary number is generated in each clock cycle by the
a pseudo-random bit stream S with a length of N LFSR, then it is compared to the k-bit input binary
denotes the stochastic number X. S contains N1 1’s number b. The comparator produces ’1’ if the random
and N −N1 0’s . In the unipolar format, the probability number is less than b and ’0’ otherwise. Assuming that
px of X has the range of [0; 1], and is given as: the input random numbers are uniformly distributed
over the interval [0; 1], the probability of ’1’ appearing
N1 at the output of the comparator at each clock cycle is
px = (1)
N px = b/2k . On the other hand, a stochastic number
The stochastic representation of a number is decoder consists of a binary counter. It will simply
not unique. For each number N1 /N , there are count the number of bit ’1’ in the input stochastic
number X.
N
representations. For example, with N = 6, 2) Multiplication in SC: Multiplication of two
N1
there are 6 ways to represent the number 5/6: stochastic bit streams is performed using AND and
{111110, 111101, 111011, 110111, 101111, 011111}. XNOR gates in unipolar and bipolar format, respec-
With a given bit stream length N , there is only a tively. Fig. 2a, 2b demonstrates examples of multipli-
small subset of real numbers in [0; 1] can be expressed cation in SC domain.
exactly in SC. 3) Addition in SC: The addition in SC is usually
To express real numbers in different intervals, sev- performed by using either scaled adders or OR gates.
eral other SNs format has been proposed [12]. One Fig. 2c, 2d show examples of addition in SC using
popular variant is the bipolar format. A stochastic either a MUX or an OR gate.
The addition in SC is also performed by using
number X in unipolar format with value px ∈ [0; 1]
a stochastic adder that is more accurate and does
can be mapped to the range [−1; 1] via a mapping
not require additional random inputs like the scaled
function:
N 1 − N0 adder. This adder is proposed by V.T. Lee [13], and is
py = 2px − 1 = (2) illustrated in Fig. 3.
N

238
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:07:31 UTC from IEEE Xplore. Restrictions apply.
2018 5th NAFOSTED Conference on Information and Computer Science (NICS)

A: 1001_0000 (1/8) Y: 1001_0000 (1/8) A. The proposed architecture

The top architecture is shown in Fig.4.
B: 1001_0101 (4/8)
(a) Multiplication in unipolar format RAM
bias
A: 1010_0000 (-4/8)
Y: 1100_1010 (0/8)
RAM Neuron RAM
Max Result
weight proposed result
B: 1001_0101 (0/8)
RAM
(b) Multiplication in bipolar format image

Start
A: 1010_1111 (6/8) Restart
Controller Finish

Y: 1010_1010 (4/8)
B: 1000_0010 (2/8) Figure 4: The proposed architecture.

S: 1001_0101 (4/8) At the beginning of each operation cycle, the input

data, the trained connection weights and the biases
(c) Scaled adder with a MUX are preloaded to the RAM image, RAM weight, and
A: 1010_0000 (2/8) RAM bias blocks, respectively. The Controller block
Y: 1110_0101 (5/8)
will then issue the Start signal to begin the calculation
of the Neuron proposed block. The data from the input
B: 0100_0101 (3/8) block RAM is processed at this block. The output of
(d) Addition using OR gate each neuron layer is stored at the RAM result block. A
Figure 2: Multiplication examples in SC. multiplexer is used to select the inputs for the Neuron
proposed block, either from the RAM image for the first
hidden layer, or the feedback from RAM result for each
subsequent layer. The Max block will determine the
classification results from the outputs of final neuron

layer.
B. The binary neuron block

The architecture of the binary radix design is shown
in Fig. 5. The architecture uses M parallel inputs x,

Figure 3: TFF-based stochastic adder. clk

reset
The accuracy of operations in SC domain always activ
enable
depends on the correlation between the input bit clear
streams. One advantage of this adder is that the bit- b

stream generated by the T Flip-ﬂop (TFF) is always x_1

uncorrelated with its input bit stream. Moreover, the x_M

area cost of a TFF is no more than an LFSR that is Parallel Adder
Sum
Activation
output
required to generate the scaled value of 1/2. Hence, w_1
Mult Tree function

this adder is adopted in this work.

w_M
III. H ARDWARE ARCHITECTURE FOR EFFICIENT
PROCESSING OF ANN
Figure 5: The binary neuron block.
In this section, the proposed hardware architecture
is discussed in details. The architecture aimed at ef- M weight inputs w and bias b. The choice of M is
ﬁciently processing the inference phase of multilayer, conﬁgurable. Other inputs include the control signal
feed-forward ANN, also called Multi-Layer Perceptron from the Controller block. The Parallel Mult and the
(MLP). The training phase is conducted off-line. Adder Tree block will calculate the sum-of-product

239
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:07:31 UTC from IEEE Xplore. Restrictions apply.
2018 5th NAFOSTED Conference on Information and Computer Science (NICS)

between x and M . The output is loaded to Sum register. of the most popular data set for this task is the MNIST
The Activation function block is used to implement the dataset [15]. This dataset contains 60,000 images for
activation function of each neuron. In this work, two of training purpose and 10,000 images for testing purpose.
the most popular activation functions are implemented, The size of each image is 28 × 28.
which are the Sigmoid function and the ReLU function. The software implementation of ANN is realized
[14] with Caffe [16], a popular open-source platform for
C. The SC neuron block machine learning tasks. The training phase of ANN
is performed off-line. The trained connection weights
The binary neuron block uses a lot of parallel
and biases are then used for the inference phase of
multipliers and adders, which are costly in terms of
our hardware system. The chosen ANN’s model for
area cost. On the other hand, the multiplication and
our application is a Multi-layer perceptron with only
addition in SC can be implemented with area-efficient
1 hidden layer. Since each input image has a size of
basic components. Hence, a novel neuron block based
28 × 28, there are 784 neurons in the input layer. The
on SC technique is proposed in this section to reduce
classification digits are from 0 to 9, hence there are
the area footprint. The proposed SC neuron block
10 neurons in the output layer. In this work, we also
still maximize parallelism with M inputs. Because our
explore the effect of varying the number of neurons
application requires negative weights and biases, SC
in the hidden layer on classification accuracy. Table I
with bipolar format representation is chosen in this
summarizes the classification results on Caffe platform.
work. The architecture of the proposed SC neuron
N is the number of neurons in the hidden layer.
block is given in Fig. 6.
The chosen activation functions are the Sigmoid and
clk ReLU functions. The classification results on software
reset platform serve as a golden model to benchmark our
activ
b +
Activation
function
output hardware implementation results.
clear Counter
x_1 SNG 1
Table I: Classification accuracies with Caffe
w_1 SNG 1’

x_2 SNG 2 Binary Adder N 16 48 64 800

w_2 SNG 2’ Tree

Sigmoid 88.04% 89.01% 89.10% 89.48%

T Q
x_(M-1) SNG M-1
ReLU 92.17% 92.63% 92.91% 93.49%
Q’
w_(M-1) SNG M-1’

x_M SNG M (Adder component)

w_M SNG M’
10 It is clear that the classification accuracy is increas-
Figure 6: The SC neuron block. ing with N . However, the huge network’s size quickly
becomes a constraint for efficient hardware implemen-
The operation of the SC neuron block is as follows: tations. On the other hand, with substantially lower N ,
• At the start of each cycle, clear signal is activated, the accuracy decreases by a small margin. For example,
all SNGs and the counter are reset. with ReLU activation function, when N decreases from
• The SNGs will generate SNs corresponding to M 800 to 64, there is only a 0.58% accuracy loss. Thus in
inputs (x1 to xM ). and M weights (w1 to wM ). this work, we will limit the choice of N to 16-48-64
• The multiplication and addition of SNs will be in our hardware implementations.
realized by the XNOR gates and the binary adder
B. Results of ANN hardware implementations
tree with the efficient adder component as mentioned
in section II. The output of the adder tree will be The proposed design is realized with VHDL at RTL
converted to binary radix through the counter. level. Table II and Fig. 7 show the classification results
• The bias is then added to the output of the counter. with the proposed binary radix hardware implementa-
The activation function block will calculate the final tions. We used 10-bit fixed point representation in our
output of the neuron. design.
It can be seen that the accuracy’s of our hardware
IV. S IMULATION AND I MPLEMENTATION RESULTS implementation is comparable to the results from soft-
A. Software simulation results ware simulations. With N = 64 and ReLU activa-
The handwritten digit recognition application is cho- tion function, the best classification result is 92.47%
sen as a benchmark for our system performance. One with only a 0.44% accuracy loss. The accuracy loss

240
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:07:31 UTC from IEEE Xplore. Restrictions apply.
2018 5th NAFOSTED Conference on Information and Computer Science (NICS)

Table II: Comparison of classiﬁcation accuracies Table III: Comparison between the SC adder and SC
between the proposed design and the Caffe’s multiplier’s MSE with other works
implementation
Design in [13] Proposed design
N 16 48 64
SC format 8-bit unipolar 8-bit unipolar 8-bit bipolar
Proposed design 87.60% 86.63% 85.94% Multiplier 2.57 × 10−4 7.43 × 10−6 1.98 × 10−4
Sigmoid Software simulation 88.04% 89.01% 89.10% Adder 1.91 × 10−6 1.89 × 10−6 1.32 × 10−5
Accuracy loss 0.44% 2.38% 3.16%

Proposed design 91.56% 92.18% 92.47%

This shows that the output of our basic SC’s com-
ReLU Software simulation 92.17% 92.63% 92.91%
ponents is comparable to the one proposed by V.T.
Accuracy loss 0.61% 0.45% 0.44%
Lee in [13]. To further evaluate the performance of the
proposed SC neuron block, we compare the output’s
!"#$ %#"#$
MSE between the binary radix implementation and
&$'*!"#$ &$'*%#"#$

the SC implementation. The number of clock cycle to

ﬁnish one classiﬁcation is also included. The neuron
block uses M = 16 parallel inputs. This is summarized

in table IV:

Table IV: MSE and execution time comparison

between binary radix neuron block and SC neuron

block

Bitwidth Binary radix SC

8 2.02 × 10−3 7.05 × 10−2

MSE
10 1.47 × 10−4 9.02 × 10−4
Figure 7: Classiﬁcation results on software and Execution time 8 2 257
hardware platform. (clock cycle) 10 2 1025

comes from the fact that our design uses 10-bit fixed The SC neuron’s output’s MSE is comparable to the
point representation while Caffe’s platform uses 32- binary radix implementation. However, SC technique
bit floating point representation. Another observation results in longer latency.
is that the hardware implementation of Sigmoid incurs D. Results of FPGA implementations
a greater accuracy loss compared to ReLU. This is We have implemented the proposed binary radix
because in the proposed design the Sigmoid function design in a Xilinx Artix-7 FPGA platform. The number
is approximated with a Look Up Table (LUT). of hidden layer’s neuron chosen is 48 with ReLU ac-
C. Results of SC neuron block implementations tivation function and 10-bit fixed point representation.
To evaluate our SC neuron block implementation, The results are summarized in Table V and Table VI.
the Mean Square Error (MSE) between the outputs of Table V: Performance report of FPGA
each block with SC technique and the outputs of SC’s implementation
software implementation is used. MSE is calculated as
Performance metric Result
[13]:

N −1 Frequency 158 MHz
(Of loating point − Oproposed )2
M SE = (3) Clock cycle 4888
i=0
N Execution time 30.93 μs
Table III compares the MSE of SC’s multiplier and Total power 0.202 W
SC’s adder between our implementation and the design
in [13]. We used 8-bit fixed point representation in both The maximum frequency is 158MHz. The execution
SC’s unipolar and bipolar format. time for one classification is 30.93 μs. The total power

241
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:07:31 UTC from IEEE Xplore. Restrictions apply.
2018 5th NAFOSTED Conference on Information and Computer Science (NICS)

Table VI: Hardware resources utilization report of neuron block has lower power consumption and lower
FPGA implementation numbers of LUTs slice.
Cost Design FPGA resources Utilization R EFERENCES
[1] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving:
LUTs 1659 133800 1.24% Learning affordance for direct perception in autonomous driv-
Register 1020 267600 0.38% ing,” in Proceedings of the 2015 IEEE International Confer-
Mux 0 100350 0% ence on Computer Vision (ICCV), ser. ICCV ’15. Washington,
DC, USA: IEEE Computer Society, 2015, pp. 2722–2730.
Slice 578 33450 1.72% [2] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M.
DSP 0 740 0% Blau, and S. Thrun, “Dermatologist-level classification of skin
BRAM 0.5 365 0.14% cancer with deep neural networks,” Nature, vol. 542, pp. 115–
117, Jan 2017.
IO 8 400 2.0%
[3] D. Silver et al., “Mastering the game of go with deep neural
networks and tree search,” Nature, vol. 529, pp. 484–485, Jan
2016.
[4] E. M. Izhikevich, “Simple model of spiking neurons,” IEEE
consumption is 0.202 W. The area cost is also small Transactions on Neural Networks, vol. 14, no. 6, pp. 1569–
compared to the available FPGA resources, with 1659 1572, Nov 2003.
LUTs slices and 1020 register slices. [5] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An
fpga-based processor for convolutional networks,” in 2009
We have also implemented the SC neuron block in International Conference on Field Programmable Logic and
this FPGA platform. Table VII summarizes the area Applications, Aug 2009, pp. 32–37.
cost of SC neuron block and binary neuron block. [6] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
N. Xu, S. Song, Y. Wang, and H. Yang, “Going deeper with
Table VII: FPGA implementation results of Binary embedded fpga platform for convolutional neural network,” in
Proceedings of the 2016 ACM/SIGDA International Symposium
neuron block and SC neuron block on Field-Programmable Gate Arrays, ser. FPGA ’16. New
York, NY, USA: ACM, 2016, pp. 26–35.
Binary neuron block SC neuron block [7] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello,
and Y. LeCun, “Neuflow: A runtime reconfigurable dataflow
Parallel inputs 16 16 processor for vision,” in CVPR 2011 WORKSHOPS, June 2011,
pp. 109–116.
Frequency 250 MHz 286 MHz [8] B. R. Gaines, Stochastic Computing Systems. Boston, MA:
LUTs 1268 416 Springer US, 1969, pp. 37–172.
Register 23 299 [9] A. Alaghi, C. Li, and J. P. Hayes, “Stochastic circuits
for real-time image-processing applications,” in 2013 50th
IO 277 277
ACM/EDAC/IEEE Design Automation Conference (DAC), May
Power consumption 0.045W 0.039W 2013, pp. 1–6.
[10] W. J. Gross, V. C. Gaudet, and A. Milner, “Stochastic im-
plementation of ldpc decoders,” in Conference Record of
The SC neuron block’s implementation results show the Thirty-Ninth Asilomar Conference onSignals, Systems and
that, with SC technique applied, the neuron block Computers, 2005., Oct 2005, pp. 713–717.
could have faster operation frequency, use fewer LUTs [11] B. Moons, B. D. Brabandere, L. V. Gool, and M. Verhelst,
“Energy-efficient convnets through approximate computing,”
slice (67% reduction) and lower power consumption. in 2016 IEEE Winter Conference on Applications of Computer
However, the SC design uses more register slice due Vision (WACV), March 2016, pp. 1–8.
to the need for a large number of SNGs. [12] B. R. Gaines, “Stochastic computing,” in Proceedings of the
April 18-20, 1967, Spring Joint Computer Conference, ser.
V. C ONCLUSION AFIPS ’67 (Spring). New York, NY, USA: ACM, 1967, pp.
149–156.
ANN has been the major driving force behind the [13] V. T. Lee, A. Alaghi, J. P. Hayes, V. Sathe, and L. Ceze,
developments of many applications. However, its high “Energy-efficient hybrid stochastic-binary neural networks for
computational complexity has made it unsuitable for near-sensor computing,” in Design, Automation Test in Europe
Conference Exhibition (DATE), 2017, March 2017, pp. 13–18.
many embedded applications. In this work, we intro- [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
duced an efficient hardware implementation of ANN, classification with deep convolutional neural networks,” in
with SC technique applied to reduce the area cost Proceedings of the 25th International Conference on Neural
Information Processing Systems - Volume 1, ser. NIPS’12.
and the power consumption of the design. The chosen USA: Curran Associates Inc., 2012, pp. 1097–1105.
ANN’s network model is a feed-forward multi-layer [15] “The MNIST database of handwritten digits.” [Online].
perceptron for digit recognition application. The binary Available: http://yann.lecun.com/exdb/mnist/
[16] Y. Jia et al., “Caffe: Convolutional architecture for fast feature
radix implementation of the design shows comparable embedding,” arXiv preprint arXiv:1408.5093, 2014.
results with the software implementation, with up
to 92.18% accuracy. With SC technique applied, the

242
Authorized licensed use limited to: Indian Institute of Information Technology Design & Manufacturing. Downloaded on March 16,2024 at 12:07:31 UTC from IEEE Xplore. Restrictions apply.

Sensors 24 00181 v2
No ratings yet
Sensors 24 00181 v2
26 pages
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
No ratings yet
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
12 pages
Encodingnet: A Novel Encoding-Based Mac Design For Efficient Neural Network Acceleration
No ratings yet
Encodingnet: A Novel Encoding-Based Mac Design For Efficient Neural Network Acceleration
7 pages
Seminar Report1
No ratings yet
Seminar Report1
30 pages
on_fiber_hotnets_2023
No ratings yet
on_fiber_hotnets_2023
9 pages
sensors-23-02045
No ratings yet
sensors-23-02045
16 pages
Efficient Design of Majority-Logic-Based Approximate Arithmetic Circuits
No ratings yet
Efficient Design of Majority-Logic-Based Approximate Arithmetic Circuits
13 pages
Fixed-Point CNN For FPGA
No ratings yet
Fixed-Point CNN For FPGA
7 pages
Sensors 22 02459
No ratings yet
Sensors 22 02459
18 pages
Root Computation of Floating Point Numbers
No ratings yet
Root Computation of Floating Point Numbers
12 pages
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
10 pages
Marsellus A Heterogeneous RISC-V AI-IoT End-Node SoC With 28 B DNN Acceleration
No ratings yet
Marsellus A Heterogeneous RISC-V AI-IoT End-Node SoC With 28 B DNN Acceleration
15 pages
Technical Seminar Black
No ratings yet
Technical Seminar Black
20 pages
Fast_and_Scaled_Counting-Based_Stochastic_Computing_Divider_Design
No ratings yet
Fast_and_Scaled_Counting-Based_Stochastic_Computing_Divider_Design
11 pages
A_Custom_Parallel_Hardware_Architecture_of_Nonlinear_Model-Predictive_Control_on_FPGA
No ratings yet
A_Custom_Parallel_Hardware_Architecture_of_Nonlinear_Model-Predictive_Control_on_FPGA
11 pages
CSPNet A New Backbone That Can Enhance Learning Capability of CNN
No ratings yet
CSPNet A New Backbone That Can Enhance Learning Capability of CNN
10 pages
FPGA Convolution Network Acceleration
No ratings yet
FPGA Convolution Network Acceleration
9 pages
Fully Parallel Stochastic Computing Hardware Implementation of Convolutional Neural Networks For Edge Computing Applications!
No ratings yet
Fully Parallel Stochastic Computing Hardware Implementation of Convolutional Neural Networks For Edge Computing Applications!
11 pages
BR-CIM An Efficient Binary Representation Computation-In-Memory Design
No ratings yet
BR-CIM An Efficient Binary Representation Computation-In-Memory Design
14 pages
A CNN Accelerator on FPGA Using Depthwise
No ratings yet
A CNN Accelerator on FPGA Using Depthwise
5 pages
Low Power Personalized ECG Based System Design Methodology for Remote Cardiac Health Monitoring
No ratings yet
Low Power Personalized ECG Based System Design Methodology for Remote Cardiac Health Monitoring
11 pages
07) A Time-Domain Binary CNN Engine With Error-Detection-Based Resilience in 28nm CMOS
No ratings yet
07) A Time-Domain Binary CNN Engine With Error-Detection-Based Resilience in 28nm CMOS
5 pages
A Hardware-Friendly High-Precision CNN Pruning Method and Its FPGA Implementation
No ratings yet
A Hardware-Friendly High-Precision CNN Pruning Method and Its FPGA Implementation
22 pages
A Real-Time Object Detection Processor With Xnor-B
No ratings yet
A Real-Time Object Detection Processor With Xnor-B
13 pages
Accelerating Low Bit-width Convolutional Neural Networks With Embedded FPGA
No ratings yet
Accelerating Low Bit-width Convolutional Neural Networks With Embedded FPGA
4 pages
Neuromorphic Computing-From Devices To Integratedcircuits
No ratings yet
Neuromorphic Computing-From Devices To Integratedcircuits
21 pages
A Network-On Chip Architecture For Optimization of Area and Power With Reconfigurable Topology On Cyclone II Specific Device
No ratings yet
A Network-On Chip Architecture For Optimization of Area and Power With Reconfigurable Topology On Cyclone II Specific Device
9 pages
Neural Network Hardware Implementation (HDL) - Vishu Garg
No ratings yet
Neural Network Hardware Implementation (HDL) - Vishu Garg
13 pages
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
No ratings yet
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
5 pages
Virtual Instrumentation and Virtual Environments
No ratings yet
Virtual Instrumentation and Virtual Environments
6 pages
An Efficient ASIC Architecture For Real-Time Edge Detection
No ratings yet
An Efficient ASIC Architecture For Real-Time Edge Detection
10 pages
A Low-Power Sparse Convolutional Neural Network Accelerator With Pre-Encoding Radix-4 Booth Multiplier
No ratings yet
A Low-Power Sparse Convolutional Neural Network Accelerator With Pre-Encoding Radix-4 Booth Multiplier
5 pages
Low-Loss Photonic Reservoir Computing With Multimode Photonic Integrated Circuits
No ratings yet
Low-Loss Photonic Reservoir Computing With Multimode Photonic Integrated Circuits
10 pages
IJME Vol 7 Iss 4 Paper 9 1260 1264
No ratings yet
IJME Vol 7 Iss 4 Paper 9 1260 1264
5 pages
10.1109VDAT50263.2020.9190274
No ratings yet
10.1109VDAT50263.2020.9190274
6 pages
Electronics: FPGA Implementation For CNN-Based Optical Remote Sensing Object Detection
No ratings yet
Electronics: FPGA Implementation For CNN-Based Optical Remote Sensing Object Detection
24 pages
DAC'22_EBSP_Bit_Sparsity_DNN
No ratings yet
DAC'22_EBSP_Bit_Sparsity_DNN
6 pages
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
No ratings yet
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
6 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
CNN hw1
No ratings yet
CNN hw1
13 pages
Design_and_Implementation_of_Gray-Coded_Bit-Plane_Based_Reconfigurable_Motion_Estimation_Architecture_Using_Binary_Content_Addressable_Memory_for_Video_Encoder
No ratings yet
Design_and_Implementation_of_Gray-Coded_Bit-Plane_Based_Reconfigurable_Motion_Estimation_Architecture_Using_Binary_Content_Addressable_Memory_for_Video_Encoder
8 pages
Energy-and-Area-Efficient CNN
No ratings yet
Energy-and-Area-Efficient CNN
14 pages
A High Performance Reconfigurable Hardware Archite (5)
No ratings yet
A High Performance Reconfigurable Hardware Archite (5)
17 pages
LAYGO_A_Template-and-Grid-Based_Layout_Generation_Engine_for_Advanced_CMOS_Technologies
No ratings yet
LAYGO_A_Template-and-Grid-Based_Layout_Generation_Engine_for_Advanced_CMOS_Technologies
11 pages
Dynamic Vision Sensor Integration On FPGA-based CNN Accelerators For High-Speed Visual Classification
No ratings yet
Dynamic Vision Sensor Integration On FPGA-based CNN Accelerators For High-Speed Visual Classification
7 pages
International Refereed Journal of Engineering and Science (IRJES)
No ratings yet
International Refereed Journal of Engineering and Science (IRJES)
4 pages
Weight-Oriented Approximation for Energy-Efficient Neural Network Inference Accelerators
No ratings yet
Weight-Oriented Approximation for Energy-Efficient Neural Network Inference Accelerators
14 pages
ACCEL All Analog Photnic Chip For High Speed - China 2023-11
No ratings yet
ACCEL All Analog Photnic Chip For High Speed - China 2023-11
23 pages
Circuit Net
No ratings yet
Circuit Net
2 pages
286-1006-1-PB (3)
No ratings yet
286-1006-1-PB (3)
8 pages
03) Time-Domain - Computing - in - Memory - Using - Spintronics - For - Energy-Efficient - Convolutional - Neural - Network
No ratings yet
03) Time-Domain - Computing - in - Memory - Using - Spintronics - For - Energy-Efficient - Convolutional - Neural - Network
13 pages
High-Speed_Polynomials_Multiplication_HW_Accelerator_for_CRYSTALS-Kyber
No ratings yet
High-Speed_Polynomials_Multiplication_HW_Accelerator_for_CRYSTALS-Kyber
9 pages
Laius: An 8-Bit Fixed-Point CNN Hardware Inference Engine
No ratings yet
Laius: An 8-Bit Fixed-Point CNN Hardware Inference Engine
8 pages
A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural Networks
No ratings yet
A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural Networks
10 pages
A Parallel Implementation of NEC For The Analysis of Large Structures
No ratings yet
A Parallel Implementation of NEC For The Analysis of Large Structures
12 pages
An Ultra Low-Power Memristive Neuromorphic Circuit For Internet of Things Smart Sensors
No ratings yet
An Ultra Low-Power Memristive Neuromorphic Circuit For Internet of Things Smart Sensors
12 pages
main
No ratings yet
main
13 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
Arm vs x86
From Everand
Arm vs x86
Mei Gates
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Thubelihle Camilla NCUBE 401907393: Question One Question Two Question Three
No ratings yet
Thubelihle Camilla NCUBE 401907393: Question One Question Two Question Three
16 pages
Number System Practice Sheet-2
No ratings yet
Number System Practice Sheet-2
2 pages
Datasheet
No ratings yet
Datasheet
42 pages
Catalogo de Controles
No ratings yet
Catalogo de Controles
232 pages
Comportamiento Humano en El Trabajo Keith Davis Descargar
No ratings yet
Comportamiento Humano en El Trabajo Keith Davis Descargar
2 pages
C#Lecture
No ratings yet
C#Lecture
2 pages
Performance Issues With Structural Authorizations
No ratings yet
Performance Issues With Structural Authorizations
8 pages
End term examination schedule_CEA dept.
No ratings yet
End term examination schedule_CEA dept.
20 pages
DM7400
No ratings yet
DM7400
4 pages
0 d9cdf
No ratings yet
0 d9cdf
287 pages
Resistors Landing Pattern
No ratings yet
Resistors Landing Pattern
1 page
Assignment No1 (JAVA)
No ratings yet
Assignment No1 (JAVA)
5 pages
In Search of Database Nirvana
100% (1)
In Search of Database Nirvana
54 pages
Microprofessor-1 Monitor Program Assembly Source
No ratings yet
Microprofessor-1 Monitor Program Assembly Source
27 pages
Heaps
No ratings yet
Heaps
67 pages
TCS Ninja Programming MCQs
No ratings yet
TCS Ninja Programming MCQs
6 pages
Spos Lab Manual
No ratings yet
Spos Lab Manual
63 pages
Quiz Type Imaging
No ratings yet
Quiz Type Imaging
12 pages
Debug 1214
No ratings yet
Debug 1214
3 pages
Course: Operating Systems (OS 201) Assignment #1 - System Call
No ratings yet
Course: Operating Systems (OS 201) Assignment #1 - System Call
18 pages
DSD Lecture Notes
No ratings yet
DSD Lecture Notes
120 pages
Trimble RealWorks 12.X Release Notes
No ratings yet
Trimble RealWorks 12.X Release Notes
11 pages
Siprotec: Communication Module DNP 3.0 Bus Mapping / Point Lists
No ratings yet
Siprotec: Communication Module DNP 3.0 Bus Mapping / Point Lists
36 pages
4it-22 Discrete Mathematics and Graph Theory: Unit-I
No ratings yet
4it-22 Discrete Mathematics and Graph Theory: Unit-I
5 pages
Ri Le Tan
No ratings yet
Ri Le Tan
2 pages
4th Sem VB NET 2022
No ratings yet
4th Sem VB NET 2022
50 pages
Excel To Access Using Vb6
No ratings yet
Excel To Access Using Vb6
2 pages
Programmable Logic Design Grzegorz Budzy Ń L Ecture 1
No ratings yet
Programmable Logic Design Grzegorz Budzy Ń L Ecture 1
43 pages
Operating System Test
No ratings yet
Operating System Test
6 pages
Nutanix Spec Sheet
No ratings yet
Nutanix Spec Sheet
1 page

An Efficient Hardware Implementation of Artificial Neural Network Based On Stochastic Computing

Uploaded by

An Efficient Hardware Implementation of Artificial Neural Network Based On Stochastic Computing

Uploaded by

2018 5th NAFOSTED Conference on Information and Computer Science (NICS)

An Efficient Hardware Implementation of

978-1-5386-7983-8/18/$31.00 ©2018 IEEE 237

II. BASIC OF S TOCHASTIC C OMPUTING AND ITS

A: 1001_0000 (1/8) Y: 1001_0000 (1/8) A. The proposed architecture

S: 1001_0101 (4/8) At the beginning of each operation cycle, the input

Figure 3: TFF-based stochastic adder. clk

stream generated by the T Flip-ﬂop (TFF) is always x_1

uncorrelated with its input bit stream. Moreover, the x_M

this adder is adopted in this work.

x_2 SNG 2 Binary Adder N 16 48 64 800

Sigmoid 88.04% 89.01% 89.10% 89.48%

x_M SNG M (Adder component)

Proposed design 91.56% 92.18% 92.47%

Table IV: MSE and execution time comparison

Bitwidth Binary radix SC

8 2.02 × 10−3 7.05 × 10−2

You might also like