Classification of Spam Emails using Deep learning
Classification of Spam Emails using Deep learning
Babylon International Conference on Information Technology and Science 2021 (BICITS 2021)- Babil- IRAQ
Authorized licensed use limited to: CAMBRIDGE UNIV. Downloaded on December 21,2022 at 19:37:24 UTC from IEEE Xplore. Restrictions apply.
1st. Babylon International Conference on Information Technology and Science 2021 (BICITS 2021)- Babil- IRAQ
emails create. The proposed method is built and implemented describe several higher level features [6] [14] , as illustrated in
through the use of data analysis. Fig. 2.
II. RELATED WORKS
Various works by authors have previously been published to
demonstrate the architectures, techniques, and algorithms under
which deep learning is used to classify email, according to a
literary study. Nevertheless, the Min- hash technology was not
commonly used for email classification, as one dimensional
Convolutional Neural Network (1D CNN) only was used to
classify emails into Ham and spam. Fig.2. Machine learning and Deep Learning are subfields of Artificial
Intelligence.
DNNs have been commonly used in many fields since the
development of Deep Learning (DL), which eventually Different DL architectures exist. A popular type of such
highlighted the critical security issues related to DNNs. Many architectures is the Convolutional Neural Network (CNN),
studies have examined the security of DNNs and they identified which has often used as an alternative recently because of their
several vulnerabilities through proposing a number of attack ability of performing complicated operations through the use
methods [3], including black-box attacks [4] and white-box of convolution filters [6] [15]. A standard CNN structure
attacks [5]. consists of several fully-connected layers that convert the
The study presented in [6] makes use of a Deep Neural previous layers' 2D feature maps into 1D vectors for
Network classifier, which is one of the DL architectures, to classification, followed by a sequence of feedforward layers
identify a dataset. They coupled the classifier with the Discrete that introduce convolutional filters and pooling layers [15]
Wavelet Transform (DWT), a strong feature extraction method, [16].
and principal components analysis (PCA). Their results turned DNNS are also a type of DL architectures which have been
out to be very successful throughout all performance steps. successfully used in classifying and regressing data in a variety
of fields. In this type of networks, the information flows from
As for [7] , the authors classified emails into Spam and not- the input layer to the output layer through several hidden layers
Spam (Ham) by analyzing the whole content (i.e. both image in a typical feed-forward network (more than two) [17].
and text), and processing it through independent classifiers using
Convolutional Neural Networks. They proposed two hybrid Figure 3 shows a standard DNN architecture, which includes
multi-modal architectures by forging the image and text an input layer with neurons for input characteristics, an output
classifiers. It is a study approaching our research. layer with neurons for output classes, and hidden layers. As for
Fig. 4, it illustrates the neural node in detail.
A comprehensive survey is presented in [8], wherein the
authors shed light on a broad range of text classification
algorithms, such as the Support Vector Machine, Decision Tree,
and Rule-based Classifiers.
The authors in [9] present the preliminary results of their
study, whereby deep learning is used in legal document analysis.
Their experiments were performed on four datasets of actual
legal matters, after which their deep learning findings are
compared to the results obtained using an SVM algorithm. Their
outcomes revealed that Convolutional Neural Network (CNN)
performs sufficiently when used with a wider training dataset,
and it is therefore found to be a suitable tool for text
classification in the legal industry.
As for the studies and experiments presented in [10] [11]
[12] [13] , the authors of each work proposed novel strategies
for email Spam identification, whereby their work is based on
SVM and feature extraction. The application of their proposed
Fig. 3. Structure diagram of deep neural network
strategies on test data sets all resulted in high accuracy rates of
about 98%.
III. OVERVIEW OF THE DEEP LEARNING
A general introduction to Deep Learning (DL) involves
defining it as sub-field of machine learning that focuses on
learning several levels of representations. It is performed
through building a hierarchy of features in which lower levels
classify higher levels, and lower level features can be used to
64
Authorized licensed use limited to: CAMBRIDGE UNIV. Downloaded on December 21,2022 at 19:37:24 UTC from IEEE Xplore. Restrictions apply.
1st. Babylon International Conference on Information Technology and Science 2021 (BICITS 2021)- Babil- IRAQ
TABLE I. CHARACTERISTIC MATRIC-BASED K-SHINGLE (K=3)
#Shingling Emails
E1 E2
This email is 1 1
email is Spam 1 1
is Spam and 1 0
Spam and is 0 1
And is not 1 1
: : :
Shingle m M m
65
Authorized licensed use limited to: CAMBRIDGE UNIV. Downloaded on December 21,2022 at 19:37:24 UTC from IEEE Xplore. Restrictions apply.
1st. Babylon International Conference on Information Technology and Science 2021 (BICITS 2021)- Babil- IRAQ
316870923 1 0 1 6 2 6 V. THE PROPOSED METHODOLOGY
1976815438 0 1 9 2 7 9
The proposed Methodology consist of the fallowing steps:
339473466 1 1 6 9 1 1
: : : : : : : 1) Data set : The dataset used in this study can be
M M M M m m m found on Kaggle, which is a machine learning
database. There are 5725 instances in the dataset of
"Spam filter", with two columns for class and email
string. Fig 5. shows sample of the dataset.
The algorithm that implements the main steps Min-hash
functions hashing on emails whereas the input is Characteristic 2) Data cleaning :Data cleaning is an essential aspect
Matrix M, and Hash Functions (h1, h2, h3, …, hn) and the of data science. Working with skewed data will
output is Signature Matrix (S); is clarified as follows steps: cause a large number of problems. This process
1) Picking n randomly hashing functions h1, h2, h3, …, hn. entails breaking down into words and dealing with
punctuation and case. In this work, unnecessary
2) Construct the signature Matrix S from characteristic values are excluded, as is the elimination of stop
Matrix M, where each row (i) is a hash function and words, symbols, and punctuation marks, and
each column (c) is a emails. convert data types are used.
3) Then, set SIG(i,c) as signature matrix element for the
hash h function and column c. 3) Calculating Hashing shingles (k-shingles): K-
shingles with (k=3) length was used in this work.
4) Convert the long bit vector into short signatures. The characteristics matrix and signature are
5) To every column c in documents, do next steps: implemented to generate a dense matrix from large
sparse matrix. Characteristic matrix was dense
a) if c has 0 in both documents’ rows r, do nothing. using (h=4) Min-hash function. The values of the
b) if row has 1, then, for each i=1,2, ……, n set min-hash were used to build a signature matrix.
SIG(i,c) to the smaller value of the current value These values feed to deep neural network as input
of SIG(i,c) and hi(r) Then Pr[hπ(c1) =hπ(c2)] =sim vectors.
(c1, c2).
4) Calculating Hash crc32 (Shingles): After the k-
D. Min-hash Signatures
shingle stage, the Hash functions (crc32) is used. As
After finding the similarities in the emails, as well as the min for the k-hashed tokens, the Min-hash algorithm is
hash for all emails, the Min-hash Signatures can be obtained used.
through the construction of the signature matrix by considering
each row in the order in which it appears. SIG (i, c) is the 5) The data is split into training and testing.
signature matrix unit for the ith hash function and column c.
Supposing that SIG (i, c) is for both I and c at first ∞, Row r is 6) The number of hidden layers, nodes on each layer,
operated with in the following manner: training batches, as well as the number of training
data for each batch in a DNN with several hidden
1) Calculate h1 (shingles) to hn (shingles). layers are set.
2) Do the following for each column c:
a) If c has 0 in row, do nothing 7) The DNN Spam classifier is used to classify the
b) If c has 1 in row (shingles), then set SIG(i, c) to the checked emails, so as to decide whether or not they
smaller of the current value of SIG (i, c), and hi for each are considered to be Spam. The DNN classifiers can
I = 1, 2,..., n. (shingles) [21], as shown in Table III. be created by training the DNN. And the output for
this stage is Optimum Weights.
TABLE III. SIGNATURE MATRIX FOR THE ALL EMAILS
Hash Emails 8) The obtained email classification result are
Email 1 Email 2
compared with the real tags to ensure that the
algorithm is indeed accurate. The result is given as
#1 1 5
Spam or Ham.
#2 2 2
#3 1 1 The essential stages of the proposed system are illustrated in
#4 0 0 Fig. 6.
66
Authorized licensed use limited to: CAMBRIDGE UNIV. Downloaded on December 21,2022 at 19:37:24 UTC from IEEE Xplore. Restrictions apply.
1st. Babylon International Conference on Information Technology and Science 2021 (BICITS 2021)- Babil- IRAQ
system =
67
Authorized licensed use limited to: CAMBRIDGE UNIV. Downloaded on December 21,2022 at 19:37:24 UTC from IEEE Xplore. Restrictions apply.
1st. Babylon International Conference on Information Technology and Science 2021 (BICITS 2021)- Babil- IRAQ
VII. CONCLUSION [10] B. K. Dedeturk and B. Akay, “Spam filtering using a logistic
regression model trained by an artificial bee colony algorithm,”
The Min-hash technology was not used with email Applied Soft Computing Journal, vol. 91. 2020, doi:
classification before, as the email was classified using 1D CNN 10.1016/j.asoc.2020.106229.
to classify emails into Ham and spam. in this paper presented an [11] M. A. Hassan and N. Mtetwa, “Feature Extraction and Classification
of Spam Emails,” 5th International Conference on Soft Computing
email classification algorithm, whereby data mining is used to
and Machine Intelligence, ISCMI 2018. pp. 93–98, 2018, doi:
design and execute an effective method to differentiate and 10.1109/ISCMI.2018.8703222.
classify email into Spam and Ham. Neural networks have a lot [12] S. Sumathi and G. K. Pugalendhi, “Cognition based spam mail text
to offer the computer community. Their ability to learn allows analysis using combined approach of deep neural network classifier
them to be very adaptable and strong. Furthermore, there is no and random forest,” J. Ambient Intell. Humaniz. Comput., vol.
need to comprehend the task's internal mechanics. Because of 0123456789, 2020, doi: 10.1007/s12652-020-02087-8.
their parallel architecture, they are also well adapted for real- [13] D. K. Dewangan and P. Gupta, “Email Spam Classification Using
Support Vector.pdf,” International Journal for Research in Applied
time systems due to their fast response and computation times. Science & Engineering Technology (IJRASET), vol. 6, no. VI, June
The training and test data generated by the Min-hash Technique 2018-Available at www.ijraset.com. 2018.
was fed into the DNN algorithm, which produced several hidden [14] G. Litjens et al., “A survey on deep learning in medical image
layers and generates NN classifiers through training. As analysis,” Med. Image Anal., vol. 42, pp. 60–88, 2017, doi:
compared to alternative works, it has been observed that the 10.1016/j.media.2017.07.005.
proposed method is relatively more effective, as the accuracy [15] Y. Pan et al., “Brain tumor grading based on Neural Networks and
rate obtained was remarkably high (98%). The findings show Convolutional Neural Networks,” Proc. Annu. Int. Conf. IEEE Eng.
Med. Biol. Soc. EMBS, vol. 2015-Novem, pp. 699–702, 2015, doi:
that the signature matrix is more sufficient for this mission, as it 10.1109/EMBC.2015.7318458.
emphasizes tempo, secrecy, and honesty. The success criterion [16] “Diving Deep into Deep Learning:History, Evolution, Types and
for consistency received a high ranking. Applications,” International Journal of Innovative Technology and
Exploring Engineering, vol. 9, no. 3. pp. 2835–2846, 2020, doi:
10.35940/ijitee.a4865.019320.
ACKNOWLEDGMENT [17] A. Anuse and V. Vyas, “A novel training algorithm for convolutional
neural network,” Complex Intell. Syst., vol. 2, no. 3, pp. 221–234,
Authors would like to thank university of Babylon- College 2016, doi: 10.1007/s40747-016-0024-6.
of IT for supporting this paper
[18] Mohammed Awad and Monir Foqaha, “Email Spam Classification
Using Hybrid Approach of Rbf Neural Network and Particle Swarm
Optimization,” International Journal of Network Security & Its
REFERENCES Applications, vol. 8, no. 5. pp. 19–38, 2016, doi:
10.5121/ijnsa.2016.8402.
[1] D. Coss and S. Samonas, “The CIA Strikes Back: Redefining
Confidentiality, Integrity and Availability in Security.,” Journal of [19] A. S. Das, M. Datar, A. Garg, and S. Rajaram, “Google news
Information System Security, vol. 10, no. 3. pp. 21–45, 2014. personalization: Scalable online collaborative filtering,” 16th
International World Wide Web Conference, WWW2007. pp. 271–
[2] T. Sultana, K. A. Sapnaz, F. Sana, and N. Mrs. Jamedar, “email- 280, 2007, doi: 10.1145/1242572.1242610.
based-spam-detection.pdf,” Int. J. Eng. Res. Technol., vol. 9, no. 06,
June, 2020. [20] J. Leskovec, A. Rajaraman, and J. D. Ullman, “Mining of Massive
Datasets,” Mining of Massive Datasets. 2020, doi:
[3] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar, “Can 10.1017/9781108684163.
machine learning be secure?,” Proc. 2006 ACM Symp. Information,
Comput. Commun. Secur. ASIACCS ’06, vol. 2006, pp. 16–25, [21] J. Leskovec, A. Rajaraman, and J. D. Ullman, “Mining of Massive
2006, doi: 10.1145/1128817.1128824. Datasets,” Mining of Massive Datasets. 2014, doi:
10.1017/cbo9781139924801.
[4] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar, “Can
machine learning be secure?,” Proc. 2006 ACM Symp. Information, [22] M. Majumder, “EMAIL CLASSIFICATION USING ARTIFICIAL
Comput. Commun. Secur. ASIACCS ’06, vol. 2006, no. March, pp. NEURAL NETWORK,” pp. 49–54, 2015, doi: 10.1007/978-981-
16–25, 2006, doi: 10.1145/1128817.1128824. 4560-73-3_3.
[5] B. Liang, M. Su, W. You, W. Shi, and G. Yang, “Cracking classifiers [23] M. Rout, J. K. Rout, and H. Das, “Correction to: Nature Inspired
for evasion: A case study on the google’s phishing pages filter,” 25th Computing for Data Science,” vol. SCI 871, 2020, pp. C1–C1.
Int. World Wide Web Conf. WWW 2016, pp. 345–356, 2016, doi:
10.1145/2872427.2883060.
[6] H. Mohsen, E.-S. A. El-Dahshan, E.-S. M. El-Horbaty, and A.-B. M.
Salem, “Classification using deep learning neural networks for brain
tumors,” Future Computing and Informatics Journal, vol. 3, no. 1. pp.
68–71, 2018, doi: 10.1016/j.fcij.2017.12.001.
[7] S. Shikhar and S. Biswas, “Multimodal Spam Classification Using
Deep Learning Techniques Shikhar.pdf,” 2017 13th Int. Conf.
Signal-Image Technol. Internet-Based Syst., vol. 978-1–5386, pp.
346–349, 2017.
[8] C. C. Aggarwal and C. X. Zhai, “A SURVEY OF TEXT
CLASSIFICATION ALGORITHMS,” Elsevier, vol. 9781461432,
pp. 163–222, 2013.
[9] F. Wei, H. Qin, S. Ye, and H. Zhao, “Empirical Study of deep
learning for text classification in legal document review,” arXiv.
2019.
68
Authorized licensed use limited to: CAMBRIDGE UNIV. Downloaded on December 21,2022 at 19:37:24 UTC from IEEE Xplore. Restrictions apply.