0% found this document useful (0 votes)
13 views

A Hybrid Deep Learning-Based Fruit Classification Using Attentionmodel

Uploaded by

Ahamed Zahvie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

A Hybrid Deep Learning-Based Fruit Classification Using Attentionmodel

Uploaded by

Ahamed Zahvie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Complex & Intelligent Systems (2023) 9:2209–2219

https://doi.org/10.1007/s40747-020-00192-x

ORIGINAL ARTICLE

A hybrid deep learning-based fruit classification using attention model


and convolution autoencoder
Gang Xue1 · Shifeng Liu1 · Yicao Ma1

Received: 1 July 2020 / Accepted: 18 August 2020 / Published online: 12 October 2020
© The Author(s) 2020

Abstract
Image recognition supports several applications, for instance, facial recognition, image classification, and achieving accurate
fruit and vegetable classification is very important in fresh supply chain, factories, supermarkets, and other fields. In this paper,
we develop a hybrid deep learning-based fruit image classification framework, named attention-based densely connected
convolutional networks with convolution autoencoder (CAE-ADN), which uses a convolution autoencoder to pre-train the
images and uses an attention-based DenseNet to extract the features of image. In the first part of the framework, an unsupervised
method with a set of images is applied to pre-train the greedy layer-wised CAE. We use CAE structure to initialize a set of
weights and bias of ADN. In the second part of the framework, the supervised ADN with the ground truth is implemented.
The final part of the framework makes a prediction of the category of fruits. We use two fruit datasets to test the effectiveness
of the model, experimental results show the effectiveness of the framework, and the framework can improve the efficiency of
fruit sorting, which can reduce costs of fresh supply chain, factories, supermarkets, etc.

Keywords Fruit classification · DenseNet · CBAM · Convolutional neural networks

Introduction automatic transportation vehicles make mistakes because of


large path identification errors [3]. Another example is fruit
Nowadays, image classification method is very popular and vegetable classification. Deep learning can extract image
in a lot of fields, playing a pretty important role. Image features effectively and then implement classification.
recognition supports several applications, for instance, facial In the past, the fruit picking and processing is based
recognition, image classification, video analysis, and so on. on artificial methods, resulting in a large amount of waste
Deep learning technology has been the core topic in machine of labor [4]. Recently, researchers tended to apply near-
learning and it has outstanding results in image identification infrared imaging, gas sensor, and high-performance liquid
[1]. Deep learning uses multilayer structure to process image chromatography devices to scan the fruit. Nevertheless, those
features, which greatly enhance the performance of image methods need expensive devices (different types of sensors)
identification [2]. Image recognition and deep learning are and professional operators, and their overall accuracies are
developing so fast and more and more fields benefit from commonly lower than 85% [4]. With the fast development
them. Fresh supply chain, factories, supermarkets, etc. are of 4G communication and widespread popularity of various
the popular fields that are relying on image recognition and mobile Internet devices, people have generated a great quan-
deep learning to obtain a better development. In other words, tity of images, sounds, videos, and other information, and
the application of image recognition and deep learning in image recognition technology has been gradually mature.
logistics and supply chain field becomes a trend. For exam- Image-based fruit identification has attracted the attention
ple, image recognition can help to guide the path of logistics of researchers since their cheap device and excellent per-
and transportation, and it can solve the problem that several formance [4]. The intelligent identification of fruit can be
used not only in the picking stage of the early fruit, but also
B Gang Xue in the picking and processing stage in the later stage. Fruit
[email protected] recognition technology based on deep learning can signif-
icantly improve the performance of fruit recognition, and
1 School of Economics and Management, Beijing Jiaotong has a positive effect on promoting the development of smart
University, Beijing 100044, China

123
2210 Complex & Intelligent Systems (2023) 9:2209–2219

agriculture. Comparing with artificial features + traditional There are two main innovations and contributions in this
machine learning combination method, deep learning can paper: (1) scholars used manual features decided by experts,
automatically extract features, and has better performance, when there are many fruit subspecies, it is very difficult
which gradually becomes the mainstream method of smart to identify specific fruit subspecies using artificial feature
identification [5]. For instance, Rocha et al. [6] presented a extraction. We construct a model which can identify specific
unified method that could combine features and classifier. fruit subspecies accurately. (2) An attention module and CAE
Tu et al. [7] developed a machine vision method to detect are integrated into DenseNet, which refine the features of fruit
passion fruit and identify maturity applying RGB-D images. images and improves the interpretability of the method.
Fruit and vegetable classification is challenging, because it is In this paper, we develop a hybrid deep learning-based
hard to give each kind of fruit an adequate definition. How- fruit recognition method, named attention-based densely
ever, achieving accurate fruit and vegetable classification is connected convolutional networks with convolution autoen-
very important in fresh supply chain for several reasons. First, coder (CAE-ADN), which uses a convolution autoencoder to
fruit and vegetable automatic classification decreases labor pre-train the images and uses an attention-based DenseNet
cost, because factories do not need workers to do this classifi- to process the images. Experimental results illustrate the
cation task anymore. Labor cost is saved and this cost can be effectiveness of the method, and the model can improve the
invested in other aspects. Second, accurate classification con- efficiency of related work.
tributes to factory automatic fruit-packing and transportation The rest sections of this work are shown below. Section 2
[5]. In these fields, packing and transportation are two core delineates the related works of fruit classification. Section 3
parts. If any errors happen in these parts, it will cause a bad delineates the detailed methodology. Section 4 describes the
influence on the following processes. For instance, the errors sets and results of experiment, and comparison with other
will delay the time which the customers receive the fruits and studies. Finally, Sect. 5 makes a conclusion of this work.
vegetables. Thirdly, accurate classification can save the time
and bring higher efficiency. After lots of training, automatic
classification can reach a high standard, which can promise
the accuracy of the classification result. In addition, unlike Related work
humans, automatic classification has memories and it can
continue to repeat the classification process. It helps to save With the rapid development of machine vision technology,
much time, enhancing the efficiency greatly [8]. automatic sorting method using machine vision has been used
The development of attention mechanism and autoencoder in production and processing fields. To resolve the problem
is constantly improving the performance of deep learning. of low efficiency and accuracy of traditional sorting method
The attention mechanism in the deep learning model is a under the condition of huge fruit output, machine vision and
model that can simulate the attention of the human brain deep learning technology can assuredly promote the effi-
[8]. Attention mechanisms were first presented in the field ciency and accuracy of sorting. However, under the actual
of image research and then were used in wider fields [9]. conditions, the image will be affected by light, fruit reflec-
When people observe images, they only focus their attention tion, and occlusion. For example, different shapes and colors
on important parts instead of looking at every pixel of the of fruits make it difficult to identify and locate under differ-
image. As more and more people use attention mechanism, ent conditions (such as different light and noise). In addition,
more cutting-edge models have been created and used in var- because the color and texture features of fruit image are
ious studies. There are four categories: number of sequences, related to the growth period, it also increases the complexity
number of abstraction levels, number of positions, and num- of fruit recognition.
ber of representations [10]. Each category corresponds to In the previous studies, color, texture, and edge properties
different types. The number of sequences has distinctive, are considered to categorize fruits [13–15]. Garcia et al. [16]
co-attention, and self-types. The number of abstraction lev- used artificial vision for fruit recognition and extracted shape,
els has single-level and multi-level types. The number of color chromaticity, and texture features. To distinguish large
positions has soft, hard, and local types. The number of repre- types of figures, features in low and middle levels are applied
sentations has multi-representational and multi-dimensional [17–19]. When referring to the product classification, the first
types. As for autoencoder, it is popularly used as an effective attempt of a fruit recognition system in supermarket, which
feature extraction method. Autoencoder (AE) is a special considered texture, color, and density, must be mentioned
artificial neural network used in semi-supervised learning [20]. The accuracy could reach nearly 95% in some situa-
and unsupervised learning and is composed of encoder and tions. Later, Jurie et al. and Agarwal et al. [21, 22] used the
decoder [11]. Its function is to have representational learning method of breaking down the classification problem to the
from the input information using the input information as the recognition of different parts, that are features of each object
learning target [12]. class. These techniques were called bag-of-features and they

123
Complex & Intelligent Systems (2023) 9:2209–2219 2211

presented potential results, even though they did not model problem of gradient vanishment is resolved applying a direct
spatial constraints between features [23, 24]. path to all preceding layers to route residuals during back-
Besides color, texture, and edge properties, many differ- propagation. We proposed an attention-based DenseNet to
ent methods are used in fruit and vegetable classification. train the images.
For example, scholars use gas sensor, near-infrared, and An autoencoder (AE) is a special artificial neural net-
high-performance liquid chromatography devices to scan the work, and it is applied to unsupervised learning and efficient
fruit [25–27]. Fei-Fei et al. [28] introduced prior knowl- coding [36]. Hinton and PDP Group [37] first proposed
edge into the estimation of the distribution, thus reducing AEs in the 1980s to solve the problem of “backpropagation
the number of training examples to around ten images while without teachers” by applying training dataset as teachers
preserving a good recognition rate. Even with this improve- [38]. Nowadays, AEs become more widely applied in the
ment, the problem of exponential growth with the number field of learning generative models [39], and convolutional
of parts persists, which makes it unpractical for the prob- autoencoder (CAE) also has a good performance of image
lem presented in this paper, which requires speed for on-line identification [40].
operation. However, these methods, with expensive devices, In this work, an attention-based densely connected convo-
do not bring good results, because the accuracies are lower lutional network with convolution autoencoder (CAE-ADN)
than 85% [4]. To solve this problem, later scholars begin framework is developed, which uses a convolution autoen-
to pay attention to image-based fruit classification for its coder to pre-train the attention-based densely connected
cheap device and wonderful performance. Wang et al. [29] convolutional networks. The whole framework of CAE-ADN
applied backpropagation neural network (BPNN) and used is illustrated in Fig. 1. Details are as follows: in the first part of
fractional Fourier entropy (FRFE) as the features. Lu et al. the framework, an unsupervised method with a set of images
[30] proposed an improved hybrid genetic algorithm (IHGA) is applied to pre-train the greedy layer-wised CAE. We use
to take the place of BPNN. Zhang et al. [31] presented a novel CAE structure to initialize a set of weights and bias of ADN.
method called biogeography-based optimization and feed- In the second part of the framework, the supervised ADN
forward neural network (BBO-FNN) which achieved higher with the ground truth is implemented. The final part of the
accuracy. Zhang et al. [29] created a categorization model framework makes a prediction of the category of fruits.
using fitness-scaled chaotic artificial bee colony (FSCABC)
method to replace kSVM. Attention-based DenseNet
According to the past studies, two main problems must be
solved to promote the accuracy of fruit recognition. One lim- Attention mechanism
itation is that some scholars used manual features decided by
experts, when there are many fruit subspecies, it is very diffi- CBAM represents the attention mechanism module of the
cult to identify specific fruit subspecies using artificial feature convolution module. It is a kind of attention mechanism mod-
extraction. And another limitation is that the accuracy of the ule which combines spatial and channel. Compared with
existing models is not enough to support the recognition of channel only attention mechanism, it can achieve better
dozens of fruits in related fields [32, 33]. results. An intermediate feature map FRC×H ×W is given
as an input of the CBAM, and a 1D channel attention map
Mc RC×1×1 and a 2D spatial attention map Ms R1×H ×W
Method are computed consecutively, as shown in Fig. 2. The atten-
tion structure is concluded as:
Model designing
F   Mc (F) ⊗ F (1)
To explore all fruit features contained in the image, we use  
an attention module to force the networks to learn the high- F   Ms F  ⊗ F  , (2)
level feature. Attention not only illustrates where should be
focused, but also enhances the representation of features. where ⊗ represents an element-wise multiplication opera-
Woo et al. [34] proposed an effective attention module con- tion. F  is the final refined output of CBAM. Details of each
volutional block attention module (CBAM) which can be attention module are described as followed.
widely used to boost the representation power of convolu- First, we aggregate the spatial feature of a feature map
tional neural networks (CNNs). In addition, considering the by utilizing average-pooling and max-pooling layers, and
complexity of fruit features, it is difficult for the traditional two different spatial context descriptors are generated: Favgc

CNNs to train these images, densely connected convolutional s


denotes average-pooled features and Fmax denotes average-
networks (DenseNet) [35] is more suitable for the proposed pooled features. Both Favg c and F s
max are then inputted to
problem due to the improvement of feature delivery. The a shared network to compute the channel attention map

123
2212 Complex & Intelligent Systems (2023) 9:2209–2219

Fig. 1 The structure of the


CAE-ADN

  
 σ f 7×7 Favg
s s
; Fmax , (4)

where σ represents the sigmoid function and f 7×7 denotes


a convolution operation with the filter size of 7 × 7.

Attention-based dense block

Fig. 2 Structure of CBAM Huang et al. [35] introduced direct connections from any
layer to all subsequent layers. To refine the feature of each
layer, we combine the CBAM and DenseNet to pay more
Mc RC×1×1 . To avoid oversize parameters, the hidden acti-
attention to the scale features of targets.
vation size is set to RC/r 1×1×1 , where r denotes the reduction
Figure 3 illustrates the structure of attention-based dense
ratio. In a nutshell, we compute the channel attention as:
block. Finally, the  th layer connects the feature maps of all
preceding layers, x0 , x1 , . . . , x−1 , as input:
Mc (F)  σ (MLP(AvgPool(F))) + MLP(MaxPool(F))))
      
 σ W1 W0 Favgc c
+ W1 W0 Favg , (3)  
x  H x0 , x1 , . . . , x−1 , (5)

where σ represents the sigmoid function, W0 RC/r ×C , and  


where x0 , x1 , . . . , x−1 denotes a concatenate operation of
W1 RC×C/r denote the weights of MLP, W0 and W1 , are the x0 , x1 , . . . ., x−1 generated in 0, . . . ,  − 1. And we con-
employed for both inputs, and the ReLU activation function catenate the multiple inputs of H (·) in Eq. (5) into a single
is followed by W0 . It is proved that using pooling opera- tensor. Inspired by Huang et al. [35], H (·) is defined as
tions along the channel axis can highlight informative regions a composite function of four sequential operations: batch
effectively [41–43]. On the concatenated feature descriptor, normalization (BN) → rectified linear unit (ReLU) → 3 × 3
a convolution layer is applied to compute a spatial attention convolutional layer (Conv) → CBAM. And other settings of
map Ms (F) ∈ R H ×W which encodes where to focus on. attention-based dense block include growth rate, Bottleneck
Detailed operation is shown in the following [34]. layers, compression, etc. are the same as DenseNet.
The channel features of a feature map are aggregated by
applying two pooling layers, computing two 2D maps:Favg s ∈

R1×H ×W represents average-pooled features across the Attention-based DenseNet


channel and Fmaxs ∈ R1×H ×W represents max-pooled fea-
tures across the channel. In a nutshell, we compute the spatial The input of attention-based DenseNet is the image followed
attention as [34]: by a convolution layer with a 7 × 7 filter, represented by D01 :
     
q p
Ms (F)  σ f 7×7 AvgPool(F) ; [MaxPool(F)] Di   Di−1 , Wid , (6)

123
Complex & Intelligent Systems (2023) 9:2209–2219 2213

Fig. 3 Structure of
attention-based dense block

q
where Di denotes the ith feature map of attention-based Classification and parameter learning
DenseNet with the channel size of q,  denotes a set of
operations: attention-based dense block followed by a 1 × 1 The structure of an ADN is shown in Fig. 5, the ith attention-
convolution layer and a 2 × 2 average-pooling layer, stride based dense block in ADN is the same as CAE, and ADN
2, and Wid denotes to a set of parameters of the ith attention- remains first half attention-based dense blocks of CAE. If
based DenseNet. N is an odd number, we round up the number, and ADN
has Ceiling(N /2) attention-based dense blocks. Finally, a
SoftMax function is applied to make the final prediction.
Stacked convolutional autoencoders
And we train the networks by minimizing logarithmic loss
between the prediction results and the labels:
The AEs is a well-known learning algorithm, derived from
the idea of sparse coding, which has a huge advantage in data  M
feature extraction. The traditional AEs consist an encoder θ  arg min − gi log( pi ) , (7)
θ
and a decoder, using a backpropagation algorithm to find the i1
optimum solution to make the prediction result equal to the
ground truth. Traditional AEs ignore the 2D image structure. where M is the number of images in the training set, gi is the
The convolutional autoencoder uses a convolutional layer ground truth of the image, and pi is the output of the neural
instead of a fully connected layer. The principle is the same network after SoftMax. And θ represent a set of parameters of
as that of an autoencoder. It downsamples the input symbol the framework and we learn θ using an Adam [44] optimizer
to provide a smaller representation of the latent dimension, with backpropagation algorithm.
and forces the autoencoder to learn a compressed version
of the symbol [44, 45]. When processing high-dimensional
data like images, the traditional AEs will cause the net- Experimental result
work to generate many redundant parameters, especially for
three-channel color images. And due to the network layer Datasets
parameters of traditional AEs which are global, so traditional
AEs cannot retain the spatial locality, and slow down the net- We use two fruit datasets to test the effectiveness of the
work learning rate. model. Mureşan et al. [45] collected a fruit dataset including
The structure of a CAE is shown in Fig. 4, which has N 26 labels (Fruit 26), they displayed a fruit with the original
attention-based dense blocks, two blocks are connected by background in the image, and after removing the background,
a convolution operation and a pooling operation; the output the image was downsampling to 100 × 100 pixels. The Fruit
is the same as input image, which can pre-train the structure 26 dataset included 124,212 fruit images spread across 26
using supervision training method, and there is a convolution labels. 85,260 images are for training and 38,952 images are
operation between the input or output. for testing. Figure 6 shows some samples of Fruit 26.

123
2214 Complex & Intelligent Systems (2023) 9:2209–2219

Fig. 4 The structure of CAE section

Fig. 5 The structure of ADN section

Fig. 6 Samples of Fruit 26

Hussain et al. [46] collected a fruit dataset including 15 Experimental implementation


labels (Fruit 15). There are 15 different kind of fruits consist-
ing of 44,406 images. All the figures are captured on a clear The loss is defined as the cross-entropy loss between the
background with resolution of 320 × 258 pixels. Figure 7 predicted result and the ground truth, which is defined
shows some samples of Fruit 15. as:

M
Li  − gi log( pi ), (8)
i1

123
Complex & Intelligent Systems (2023) 9:2209–2219 2215

Fig. 7 Samples of Fruit 15

where M represents the number of images of training dataset, tion with several networks, ADNs without pre-train perform
gi denotes ground truth of the image, and pi is the output of better than ResNet-50 and DenseNet-169, and ADN with
the neural network after SoftMax. pre-train has higher accuracy than ADN without pre-train.
Accuracy: Results show that AND-169 with pre-train is the best struc-
ture of CAE-ADN, whose Top-1 accuracy is 95.86% and
TP + TN 93.78%, Top-5 accuracy is 99.98%, and 98.78% of Fruit 26
Accuracy  , (9)
TP + TN + FP + FN and Fruit 15 respectively.
where TP and FP represent True Positive and False Positive;
TP and FN represent True Positive and False Negative. Performance of each class
Considering the GPU memory, we set batch size to 10. The
learning rate is set to 0.0001 and reduced to half at epochs 50, The precision and recall of the testing dataset is presented
100, and 150, and the network is fine-tuned with a learning in Tables 3 and 4. The results show that the model has good
rate of 0.000001 at the last 10 epochs. And we conduct 200 performance in all kinds of fruits. And the model has a good
epochs to train the network. ability of fruit color recognition, which can identify different
In this paper, we evaluate ADN-q of CAE-ADN with varieties of the same kind of fruit. For instance, the Precisions
q ∈ [121, 169, 201]. The q in ADN-q denotes the number of Apple Red 1, Apple Red 2, and Apple Red 3 are 96.07%,
of layers in ADN. Table 1 shows the ADN-q architectures. 95.36%, and 95.28%, respectively, and the Recalls of Apple
The CAE-p is generated by the corresponding ADN-q to Red 1, Apple Red 2, and Apple Red 3 are 95.67%, 95.58%,
pre-train the structure. All experiments are conducted on one and 94.89%, respectively. Due to the difference in shooting
GPU card (NVIDIA GeForce RTX 2080) with 8 GB RAM scenes, the accuracy of Fruit 26 is generally higher than Fruit
using Tensorflow. 15.

Results Comparison with other studies

Comparison with baselines We compared our framework with five up-to-date meth-
ods: PCA + FSCABC [48], WE + BBO [49], FRFE +
We compare CAE-ADN with two baselines: ResNet-50 [47], BPNN [50], FRFE + IHGA [51], and 13-layer CNN [52].
and DenseNet-169 [35] (these two states of the art meth- Some of them used traditional machine learning methods
ods have the similar parameter numbers as CAE-ADN). For with feature engineering, and others used simple BPNN and
ResNet-50, we use the kernel size of 3 × 3 at each con- CNN structures. The overall accuracies of these methods are
volutional layer, and for DenseNet-169, k value sets 4, k shown in Table 5, and the accuracies of our method reach
represents the growth rate of network, and for convolutional 95.86% and 93.78% for Fruit 26 and Fruit 15, which illus-
layers with kernel size 3 × 3, each side of the inputs is zero- trate that CAE-ADN performs better than state-of-the-art
padded by one pixel to keep the feature-map size fixed. A 1 × approaches. It can be summarized that the attention model
1 convolution followed by 2 × 2 average pooling is applied and pre-training using CAE can improve the performance
as transition layers between two contiguous dense blocks. of CNN model in fruit classification problem. Compared
The training configurations of ATP-DenseNet, ResNet-50, with the traditional recognition method using sample fea-
and DenseNet-169 are the same for a fair comparison. Table tures, the deep learning algorithm based on convolution
2 illustrates the Top-1 and Top-5 accuracy of fruit classifica- neural networks have stronger adaptability, better robustness,

123
2216 Complex & Intelligent Systems (2023) 9:2209–2219

Table 1 ADN of CAE-ADN


architectures Layers ADN-121 ADN-169 ADN-201

Convolution 7 × 7 conv, stride 2


Pooling 3 × 3 Max pool, stride 2
A-Dense block (1) 1 × 1 conv 1 × 1 conv 1 × 1 conv
×6 ×6 ×6
3 × 3 BRCC 3 × 3 BRCC 3 × 3 BRCC
Transition layer (1) 1 × 1 conv
2 × 2 Average pool, stride 2
A-Dense block (2) 1 × 1 conv 1 × 1 conv 1 × 1 conv
× 12 × 12 × 12
3 × 3 BRCC 3 × 3 BRCC 3 × 3 BRCC
Transition layer (2) 1 × 1 conv
2 × 2 average pool, stride 2
A-Dense block (3) 1 × 1 conv 1 × 1 conv 1 × 1 conv
× 24 × 32 × 48
3 × 3 BRCC 3 × 3 BRCC 3 × 3 BRCC
Transition layer (3) 1 × 1 conv
2 × 2 average pool, stride 2
A-Dense block (4) 1 × 1 conv 1 × 1 conv 1 × 1 conv
× 16 × 32 × 32
3 × 3 BRCC 3 × 3 BRCC 3 × 3 BRCC
Classification layer 7 × 7 global average pool
1000D fully connected, SoftMax
The growth rate of ADN is set to k  4. And each “conv” operation denotes to the sequence BN-ReLU-Conv;
each “BRCC” operation denotes to sequence BN-ReLU-Conv-CBAM

Table 2 Performance of fruit


classification with different Net Fruit 26 Fruit 15
networks Top-1 (%) Top-5 (%) Top-1 (%) Top-5 (%)

ResNet-50 93.59 98.35 91.44 96.41


DenseNet-169 93.87 98.85 91.46 96.57
ADN-121 (without pre-train) 94.21 98.42 92.07 97.52
ADN-169 (without pre-train) 94.65 98.97 92.52 97.89
ADN-201 (without pre-train) 94.26 98.24 92.12 97.38
ADN-121 (pre-train) 95.15 99.58 93.03 98.29
ADN-169 (pre-train) 95.86 99.98 93.78 98.78
ADN-201 (pre-train) 95.64 99.32 93.49 98.64

and higher recognition accuracy. Image recognition based part of the framework, an unsupervised method with a set of
on deep learning is learned by the abstract features of the images is applied to pre-train the greedy layer-wised CAE.
algorithm, which avoids the difficulty of generating specific We use the CAE structure to initialize a set of weights and
features for specific tasks and makes the whole recognition bias of ADN. In the second part of the framework, the super-
process more intelligent. Due to its strong learning ability, it vised ADN with the ground truth is implemented. The final
can be transplanted to other tasks well, only need to retrain part of the framework makes a prediction of the category of
the convolutional neural network. Considering the develop- fruit. We use two fruit datasets to test the effectiveness of the
ment of the technology of IoT [34, 53], it is meaningful to model, the overall accuracies of these methods are shown
establish a decision-making system [54–56], and we could in Table 5, and the accuracies of our method reach 95.86%
use the proposed method in practice. and 93.78% for Fruit 26 and Fruit 15, which illustrate that
CAE-ADN performs better than other approaches. It can be
summarized that the attention model and pre-training using
Conclusion CAE can promote the performance of CNN algorithm in fruit
classification problem. In addition, compared with the tradi-
In this work, we develop a hybrid deep learning-based fruit tional algorithm, deep learning is a method to simulate human
image classification approach, which uses a convolution visual perception through neural network. Through abstract-
autoencoder to pre-train the images and uses an attention- ing the local features of the lowest layer, each layer receives
based DenseNet to extract the features of image. In the first the input of the higher layer. It can automatically learn the

123
Complex & Intelligent Systems (2023) 9:2209–2219 2217

Table 3 Performance of each class (Fruit 26) Table 5 Comparison with other studies
Label Precision (%) Recall (%) Number of test Method Fruit 26 (accuracy) Fruit 15 (accuracy)
images (%) (%)

Not fruits 96.07 95.67 15,750 PCA + FSCABC 89.18 87.75


Apple red 1 95.36 95.58 984 [48]
Apple red 2 95.28 94.89 984 WE + BBO [49] 89.85 87.17
Apple red 3 95.90 96.20 864 FRFE + BPNN [50] 88.78 87.52
Apple red yellow 95.50 95.02 984 FRFE + IHGA [51] 89.96 88.85
Apricot 95.85 96.64 984 13-Layer CNN [52] 94.59 92.38
Avocado 95.47 96.21 858 CAE-ADN 95.86 93.78
Braeburn (apple) 95.66 95.64 984
Cherry 95.61 96.11 984
Apple golden 1 95.23 95.82 984 hidden features of the data. At last, it can obtain the percep-
Apple golden 2 95.40 94.83 984 tion of the whole target.
Apple golden 3 96.00 95.99 966 Our method now stays in the offline experiment. In the
Granny Smith 96.33 96.69 984 future, we will build an application to experiment in the
(Apple) actual supply chain scenario. In addition, we will expand
Grape 95.62 95.42 984 the data set, so that the model can adapt to more complex
Grapefruit 95.92 95.97 984 classification tasks. The proposed model cannot be installed
Kiwi 95.87 96.54 936 in some embedded systems of machine vision, and it is still in
Lemon 95.57 95.68 492 the research of two-dimensional image, which cannot really
Nectarine 95.94 96.35 984
realize the location of spatial points. In the future, we need
Orange 95.57 95.45 960
to study the transformation from 2-D coordinates to 3-D
coordinates, which can be combined with Kinect sensor for
Papaya 95.58 95.48 984
three-dimensional positioning.
Peach 96.40 96.20 984
Peach flat 95.12 95.45 984 Acknowledgements This work was supported by Beijing Social Sci-
Pear 96.25 95.67 984 ence Foundation Grant 19JDGLA002 and was partially supported by
Plum 95.50 95.88 906 Beijing Logistics Informatics Research Base. We are very grateful for
their support.
Pomegranate 95.25 95.78 492
Strawberry 95.26 95.77 984 Author contributions Conceptualization, SL; methodology, GX; soft-
ware, GX; writing—original draft preparation, GX and YM.
Table 4 Performance of each class (Fruit 15)
Funding Beijing Social Science Foundation Grant 19JDGLA002.
Label Precision (%) Recall (%) Number of test images
Availability of data and material The authors declare availability of
Apple 93.70 93.35 1507 data and material.
Banana 94.29 93.34 908
Carambola 93.33 94.34 624 Code availability The authors declare code availability.
Guava 94.61 94.13 1202
Kiwi 93.32 93.45 1252
Mango 94.33 93.61 1246 Compliance with ethical standards
Orange 93.91 94.19 904
Conflict of interest On behalf of all authors, the corresponding author
Peach 93.27 93.75 792 states that there is no conflict of interest.
Pear 93.85 93.23 904
Persimmon 94.04 93.85 622 Open Access This article is licensed under a Creative Commons
Pitaya 93.40 93.27 750 Attribution 4.0 International License, which permits use, sharing, adap-
tation, distribution and reproduction in any medium or format, as
Plum 93.33 94.64 689 long as you give appropriate credit to the original author(s) and the
Pomegranate 93.64 93.71 650 source, provide a link to the Creative Commons licence, and indi-
Tomatoes 94.02 92.96 651 cate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence,
muskmelon 93.63 94.45 623 unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your
intended use is not permitted by statutory regulation or exceeds the

123
2218 Complex & Intelligent Systems (2023) 9:2209–2219

permitted use, you will need to obtain permission directly from the copy- 20. Bolle R, Connell J, Haas N, Mohan R, Taubin G (1996) Veggievi-
right holder. To view a copy of this licence, visit http://creativecomm sion: a produce recognition system. WACV, Sarasota, pp 1–8
ons.org/licenses/by/4.0/. 21. Jurie F, Triggs B (2005) Creating efficient code books for visual
recognition. ICCV 1:604–610
22. Agarwal S, Awan A, Roth D (2004) Learning to detect objects
in images via a sparse, part-based representation. TPAMI
References 26(11):1475–1490
23. Marszalek M, Schmid C (2006) Spatial weighting for bag-of-
1. Pak M, Kim S (2017) A review of deep learning in image features. In: CVPR, pp 2118–2125
recognition. In: 2017 4th International conference on computer 24. Sivic J, Russell B, Efros A, Zisserman A, Freeman W (2005) Dis-
applications and information processing technology (CAIPT) covering objects and their location in images. In: ICCV, pp 370–377
2. Zhai H (2016) Research on image recognition based on deep learn- 25. Pardo-Mates N, Vera A, Barbosa S, Hidalgo-Serrano M, Núñez
ing technology. In: 2016 4th International conference on advanced O, Saurina J et al (2017) Characterization, classification and
materials and information technology processing (AMITP 2016) authentication of fruit-based extracts by means of HPLC-UV chro-
3. Jiang L, Fan Y, Sheng Q, Feng X, Wang W (2018) Research on matographic fingerprints, polyphenolic profiles and chemometric
path guidance of logistics transport vehicle based on image recog- methods. Food Chem 221:29
nition and image processing in port area. EURASIP J Image Video 26. Shao W, Li Y, Diao S, Jiang J, Dong R (2017) Rapid classification
Process of chinese quince (Chaenomeles speciosa nakai) fruit provenance
4. Liu F, Snetkov L, Lima D (2017) Summary on fruit identifica- by near-infrared spectroscopy and multivariate calibration. Anal
tion methods: a literature review. Adv Soc Sci Educ Hum Res Bioanal Chem 409(1):115–120
119:1629–1633 27. Radi CS, Litananda WS et al (2016) Electronic nose based on
5. Getahun S, Ambaw A, Delele M, Meyer CJ, Opara UL (2017) partition column integrated with gas sensor for fruit identification
Analysis of airflow and heat transfer inside fruit packed refrigerated and classification. Comput Electron Agric 121:429–435
shipping container: Part I—model development and validation. J 28. Fei-Fei L, Fergus R, Perona P (2006) One-shot learning of object
Food Eng 203:58–68 categories. IEEE TPAMI 33(3):239–253
6. Rocha A, Hauagge DC, Wainer J, Goldenstein S (2010) Automatic 29. Zhang Y, Phillips P, Wang S, Ji G, Yang J, Wu J (2016) Fruit classifi-
fruit and vegetable classification from images. Comput Electron cation by biogeography-based optimization and feedforward neural
Agric 70(1):96–104. https://doi.org/10.1016/j.compag.2009.09.00 network. Expert Syst 33(3):239–253
2 30. Wang S, Lu Z, Yang J, Zhang Y, Dong Z (2016) Fractional Fourier
7. Tu S, Xue Y, Zheng C, Qi Y, Wan H, Mao L (2018) Detection entropy increases the recognition rate of fruit type detection. BMC
of passion fruits and maturity classification using red-green-blue Plant Biol 16(S2):85
depth images. Biosyst Eng 175:156–167. https://doi.org/10.1016/ 31. Lu Z, Lu S, Wang S, Li Y, Lu H (2017) A fruit sensing and classi-
j.biosystemseng.2018.09.004 fication system by fractional Fourier entropy and improved hybrid
8. Wang C, Han D, Liu Q, Luo S (2018) A deep learning approach for genetic algorithm. In: International conference on industrial appli-
credit scoring of peer-to-peer lending using attention mechanism cation engineering 2017
LSTM. IEEE Access 7:1–1 32. Zhang Y, Wang S, Ji G, Phillips P (2014) Fruit classification using
9. Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent computer vision and feedforward neural network. J Food Eng
models of visual attention. In: Advances in neural information pro- 143:167–177
cessing systems 33. Kuo Y-H, Yeh Y-T, Pan S-Y, Hsieh S-C (2019) Identification and
10. Chaudhari S, Polatkan G, Ramanath R, Mithal V (2019) An atten- structural elucidation of anti-inflammatory compounds from Chi-
tive survey of attention models nese olive (Canarium album L.) fruit extracts. Foods 8(10):441.
11. Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extract- https://doi.org/10.3390/foods8100441
ing and composing robust features with denoising autoencoders. 34. Zhang Y, Dong Z, Chen X, Jia W, Du S, Muhammad K et al (2017)
Machine learning. In: Proceedings of the twenty-fifth international Image based fruit category classification by 13-layer deep convo-
conference (ICML 2008), Helsinki, Finland, June 5–9, 2008. ACM lutional neural network and data augmentation. Multimed Tools
12. Bengio Y, Courville A, Vincent P (2013) Representation learning: a Appl 78:3613
review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35. Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: convolutional
35(8):1798–1828 block attention module. Springer, New York
13. Unser M (1986) Sum and difference histograms for texture classi- 36. Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely
fication. IEEE TPAMI 8(1):118–125 connected convolutional networks. In: IEEE Conference on com-
14. Pass G, Zabih R, Miller J (1997) Comparing images using color puter vision and pattern recognition (CVPR), Honolulu, HI, 2017,
coherence vectors. In: ACMMM, pp 1–14 pp 2261–2269
15. Stehling R, Nascimento M, Falcao A (2002) A compact and 37. Liou CY, Cheng WC, Liou JW, Liou DR (2014) Autoencoder for
efficient image retrieval approach based on border/interior pixel words. Neurocomputing 139:84–96
classification. In: CIKM, pp 102–109 38. Rumelhart DE (1986) Learning internal representations by error
16. Garcia F, Cervantes J, Lopez A, Alvarado M (2016) Fruit classifi- propagation, parallel distributed processing. Explorations in the
cation by extracting color chromaticity, shape and texture features: microstructure of cognition. MIT Press, Cambridge
towards an application for supermarkets. IEEE Lat Am Trans 39. Baldi P (2012) Autoencoders, unsupervised learning, and deep
14(7):3434–3443 architectures. ICML Unsuperv Transf Learn 27:37–50
17. Serrano N, Savakis A, Luo J (2004) A computationally efficient 40. Kingma DP, Welling M (2013) Auto-encoding variational Bayes
approach to indoor/outdoor scene classification. In: ICPR, pp 41. Masci J, Meier U, Cireşan D, Schmidhuber J (2011) Stacked con-
146–149 volutional auto-encoders for hierarchical feature extraction. Artif
18. Lyu S, Farid H (2005) How realistic is photorealistic? IEEE Trans Neural Netw Mach Learn ICANN 89:52–59. https://doi.org/10.10
Signal Process (TSP) 53(2):845–850 07/978-3-642-21735-7_7
19. Rocha A, Goldenstein S (2007) PR: more than meets the eye. In:
ICCV, pp 1–8

123
Complex & Intelligent Systems (2023) 9:2209–2219 2219

42. Zagoruyko S, Komodakis N (2017) Paying more attention to atten- 53. Brahmachary TK, Ahmed S, Mia MS (2018) Health, safety and
tion: improving the performance of convolutional neural networks quality management practices in construction sector: a case study.
via attention transfer. In: ICLR J Syst Manag Sci 8(2):47–64
43. Lowe D (1999) Object recognition from local scale-invariant fea- 54. Hai L, Fan Chunxiao W, Yuexin LJ, Lilin R (2014) Research of
tures. Proc Seventh IEEE Int Conf Comput Vis 2:1150–1157 LDAP-based IOT object information management scheme. J Logist
44. Serre T, Wolf L, Poggio T (2007) Object recognition with features Inform Serv Sci 1(1):51–60
inspired by visual cortex. In: Proceedings of computer vision and 55. Zhao PX, Gao WQ, Han X, Luo WH (2019) Bi-objective collabora-
pattern recognition conference (2007) tive scheduling optimization of airport ferry vehicle and tractor. Int
45. Kingma D, Ba J (2014) ADAM: a method for stochastic optimiza- J Simul Model 18(2):355–365. https://doi.org/10.2507/IJSIMM18
tion. Comput Sci (2)CO9
46. Mureşan H, Oltean M (2017) Fruit recognition from images using 56. Xu W, Yin Y (2018) Functional objectives decision-making of
deep learning discrete manufacturing system based on integrated ant colony opti-
47. Israr H, Qianhua H, Zhuliang C, Wei X (2018) Fruit recognition mization and particle swarm optimization approach. Adv Prod Eng
dataset (version V 1.0). Zenodo. https://doi.org/10.5281/zenodo.1 Manag 13(4):389–404. https://doi.org/10.14743/apem2018.4.298
310165
48. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for
image recognition. In: IEEE Conference on computer vision and
Publisher’s Note Springer Nature remains neutral with regard to juris-
pattern recognition. IEEE computer society
dictional claims in published maps and institutional affiliations.
49. Ji G (2014) Fruit classification using computer vision and feedfor-
ward neural network. J Food Eng 143:167–177
50. Wei L (2015) Fruit classification by wavelet-entropy and feedfor-
ward neural network trained by fitness scaled chaotic ABC and
biogeography-based optimization. Entropy 17(8):5711–5728
51. Lu Z (2016) Fractional Fourier entropy increases the recognition
rate of fruit type detection. BMC Plant Biol 16(S2):10
52. Lu Z, Li Y (2017) A fruit sensing and classification system by frac-
tional fourier entropy and improved hybrid genetic algorithm. In:
5th International conference on industrial application engineering
(IIAE). Kitakyushu, Institute of Industrial Applications Engineers,
Japan, pp 293–299

123
© The Author(s) 2020. This work is published under
http://creativecommons.org/licenses/by/4.0/(the “License”). Notwithstanding
the ProQuest Terms and Conditions, you may use this content in accordance
with the terms of the License.

You might also like