Paper 64-A Sophisticated Deep Learning Framework
Paper 64-A Sophisticated Deep Learning Framework
616 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023
detection systems, and firewalls are instances of traditional A unique approach is envisioned with a dual-powered CLM
security measures that can assist in preventing attacks to some (Convolution neural networks and LSTM) and optimization
extent but are not precisely operational when it comes to technique. The amalgamation of deep learning and
malicious user detection. Algorithms in machine learning have evolutionary computation provides the technique with the
been used to analyze user activity and detect anomalies. adaptive competencies vital to safeguard OSNs. The suggested
However, the accuracy of these algorithms is mostly method is evaluated on a user activity dataset in OSN, and the
determined by the prominence and volume of training data. outcomes are illustrious from those of conventional machine
learning techniques [4].
A malicious user utilizes a computer system or network
intending to cause harm, steal data, or disrupt normal The motivation for the proposed CLM and Optimization
operations [1]. Malicious users may have numerous intentions, method distinguishes hazardous users to improve security and
comprising of financial gain, retaliation, or political defend against cyberattacks. Exploiting system vulnerabilities,
involvement. They may use a variety of strategies to attainment unauthorized access, stealing sensitive data, and
accomplish their goals, encompassing malware, phishing, interrupting system operations can detriment people and
social engineering, and exploiting vulnerabilities in software companies. Firewalls and antivirus software don't always stop
and hardware. complex attacks, thus modern methods are obligatory to detect
and preclude them [13].
Analyzing user behavior is one procedure for identifying
malevolent users. It could be capable of flagging suspicious A. Organisation of the Paper
behavior and more research by discerning an eye on user The paper encompasses the subsequent subheadings:
activity patterns and perceiving abnormalities. CNN and Section II - Literature Review, Section III – Proposed
LSTM networks are instances of machine learning techniques Methodology, Section IV - Experimental Evaluations and
that possibly will be used to automatically analyze big datasets Results, Section V - Conclusion and References.
of user behavior and predicament patterns that can be
suggestive of harmful conduct [2]. By looking for the ideal set II. LITERATURE REVIEW
of hyper parameters, genetic algorithms (GAs) may be
Deep learning neural networks of the variation known as
employed to improve the enactment of the archetype.
CNNs are frequently engaged in processing images and videos.
Malicious users pose a severe threat to entities, They have been revealed to be incredibly efficacious in
governments, and organizations. They have the proficiency to resolving stimulating computer vision issues comprising
steal private information, jeopardize the security of systems, segmentation, object identification, and picture categorization.
and harm a company’s reputation and brand. Therefore, it is The vital principle of CNNs is to extract information from
essential to have effective techniques for identifying and pictures using convolutional filters and then to categorise or
reducing the actions of harmful users [3]. The upsurge of these determine objects using these characteristics [5]. CNNs have
daily threats over the past ten years is the main cause for revolutionised the field of computer vision and made it
concern for data security. Fig. 1 illustrates the tendency of the possible for a variety of applications, from self-driving cars to
threats in the past decade. medical imaging. CNN has significantly augmented its
popularity in voice and picture recognition tests. It captures
spatial and temporal tendencies in data since it is built on the
Frequency of Threats notion of native connectedness and shared weights. When
creating a CNN model, data inputs like images or data
categorizations are deployed through numerous of layers of
convolution, pooling, and activation functions. Ensuing this,
fully linked layers that dispense the response into numerous
classifications acquire the yield of these layers.
CNN has significantly amplified its popularity in voice and
picture recognition tests. It captures spatial and temporal
Threats in Billions tendencies in data since it is built on the notion of native
connectedness and shared weights. When creating a CNN
model, data inputs like images or data sequences are deployed
through numerous layers of convolution, pooling, and
activation functions. Ensuing this, fully linked layers that
distribute the response into several classifications acquire the
yield of these layers [7].
0 50 100 150 200 250 300 The LSTM variance of the recurrent neural network (RNN)
properly resolves the vanishing gradient problem that concerns
2022 2021 2020 2019 2018 2017 regular RNNs. [6]. The vanishing gradient problem occurs
2016 2015 2014 2013 2012 when gradients get tinier as they propagate over time, making
training the network on lengthy sequences challenging.
Fig. 1. Frequency threats in real-time.
617 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023
This problem will be resolved by LSTM, which has a Ranjan and Kumar [6]. The authors analysed user behavioural
particular form of memory cell that can store information for data using multiple machine-learning methods to identify
longer. Three gates govern the cell: the input gate, the forget unusual behaviours. The study demonstrated that UBA can be
gate, and the output gate. The forget gate standardizes the an expedient method for detecting malicious users. Tanuja et
retention of preceding data, the input gate controls the flow of al. [12] proposed a machine learning technique for identifying
new information into the cell, and the output gate regulates the fraudulent social network users. The authors analysed user
cell's output. activity data using multiple machine-learning methods to
identify abnormal conduct that may advocate a deceitful user.
In an extensive assortment of applications, including To identify various anomalous user behaviours and lessen their
speech recognition, machine translation, and NLP (natural negative impacts, statistical analysis was done. To find unusual
language processing), LSTM has been illustrated to be conduct that may point to a malevolent user, the authors
effective. It has also been used for anomaly detection and time- performed statistical analysis [10]. Several patents pertaining to
series prediction jobs, where it may discover temporal the detection of malicious users are accessible on Google
relationships and long-term trends in data [8]. Patents, including a framework for mobile advanced persistent
A heuristic optimization method based on natural selection threat detection, a deep learning method for detecting covert
and evolution is referred to as the Genetic Algorithm (GA). It channels in the domain name system, and a technique for
is used to address optimization issues that require determining detecting insider and masquerade attacks by identifying
the optimal parameter combination for a given objective malicious user behaviour [11] [12].
function. The GA generates a population of candidate
solutions, known as chromosomes. Each chromosome is III. PROPOSED METHODOLOGY
composed of a series of genes that represent various parameters A. System Model
of the issue being optimized. These parameters can include any
form of data, including numerical values, Boolean values, and System model for malicious user detection through user
texts. Subsequently, the GA evaluates the fitness value of the behavior for CLM and optimization technique. The Fig. 2 gives
respective chromosome in the population using the objective an overview of the system. The data collection and
function. The fitness value assesses how successfully the preprocessing module, the CLM and optimization technique,
chromosome resolves the issue. The GA then chooses the and the evaluation module encompass the classification model
population's top chromosomes to serve as the parents of the for malicious user detection through user behavior for CLM
following generation. and optimization technique.
618 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023
should be able to handle large datasets, noisy data, and a wide Algorithm 1 : Data Preprocessing
variety of malicious behavior types, including network attacks,
system intrusions, and user impersonation. The objective is to Initialize
develop a prototype that can be deployed in a real-world
setting to detect and prevent malicious user behavior before it BEGIN
can cause damage to users or systems. The system aims to Step 1: Load the Dataset
afford a reliable and precise methodology for identifying
malevolent users by scrutinizing their behavioral patterns. The Step 2: Handle the missing values
system attempts to capture both the spatial and temporal Replacing with Mean or Median values
aspects of user behavior data by utilizing the capabilities of
CNN and LSTM neural networks [17] [18]. In order to upsurge Step 3: Normalize the features
the model's performance and optimize its parameters, the Step 4: Splitting the dataset
Genetic Algorithm is also used.
Divide the Dataset
A dataset of user behavior that comprises elements like
login patterns, session length, transaction history, and other 1.Training Dataset
appropriate data is used as the system's input. The data is pre- 2. Testing Dataset
processed by the method in order to normalize and encrypt it
for neural networks. The CNN module of the classification Step 5: Feature Selection and Feature Extraction
pulls spatial characteristics from the input data, while the Step 6: Handling the time series data
LSTM component captures the temporal relationships [26].
The CNN and LSTM model is then trained expanding the Step 7: Data augmentation
training dataset [19] [20]. The Genetic Algorithm is used to Step 8: Finalize the pre-processed dataset
optimize the model, which scrutinizes various amalgamations
of hyper parameters to classify the optimal collection of End
parameters that maximizes the detection accuracy [23].
2) Algorithm implementation: The CLM and optimization
1) Data collection: The data collection and preparation technique model is in possession of assessing the pre-
module is responsible for gathering user behavior data and processed user behavior data and determining whether or not a
converting it into a format that can be used by the AIMDS certain user is acting maliciously. This model is made up of
model. This module collects data from many sources, such as two key parts: the CNN and LSTM layers, which extract
network traffic logs, user input logs, and system logs, and then features from user behavior data, and the genetic algorithm,
pre-processes the data to eliminate noise, missing values, and which optimizes the CLM technique (Convolution neural
other abnormalities [16]. networks and LSTM) model's parameters to enhance its
a) Dataset acquisition: A large-scale dataset accuracy [24][25].
comprehending user behaviour information is attained from a
Detecting malicious user behavior using the dual-powered
reliable internet platform. The dataset encompasses a diversity CLM technique and an optimization technique approach
of features such as user activities, timestamps, and session involves several algorithms formulas and techniques.
information.
b) Data Pre-processing: Pre-processing the dataset to a) Architecture: The CLM and optimization model
eradicate excessive or redundant characteristics, manage architecture is intended based on the three algorithms. The
missing values, and normalize the data [14]. Pre-processing CNN layer accumulates spatial characteristics from data, the
processes may include feature selection, data purification, and LSTM layer captures the temporal dynamics of user behaviour
categorical variable encoding. [15], and the GA layer optimizes the model's hyper
parameters.
c) Data preparation: The pre-processed data is
consequently prepared for model training. The data has been b) Training: The CLM and optimization model is
fragmented into training, validation, and testing sets to trained using the prepared data. The model is trained on the
accomplish this. The training set is utilized to train the training set, then it is validated on the validation set. During
prototypical, the validation set usage to fine-tune the hyper the training phase, the loss function is minimized using
parameters, and the testing set is used to assess the aftermath optimization techniques such as stochastic gradient descent or
of the model. Adam optimization.
Preprocessing the input data entails filtering and c) Hyperparameter tuning: The hyper parameters of the
normalizing the user behavior data to filter the noise and model are optimized using the GA. The GA is used to explore
insignificant data as the first stage. The feature extraction layer the hyper parameter space for the optimum hyper parameter
utilizes the input data to extract useful characteristics that may amalgamation that maximizes the model's performance. The
be utilized for further exploration once the preprocessed data GA's fitness function is based on evaluation measures such as
has been passed through it [21]. The Algorithm 1 provides the Acc, Prec, Rc, and f1s.
overview of the data preprocessing after the assemblage of the d) CLM Algorithm: The Algorithm 2 gives the details of
dataset has to endure a sequence of steps to further process. initialization of the convolution layer parameters and applying
619 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023
activation function. The scientific formulation for the CNN Ft_th: Fitness threshold
component involves convolutions and pooling operations.
s: hypotheses to be included
Let's symbolize the input data as X, the convolutional layer
output as C, and the pooling layer output as P. The Eq. (1) and F: fraction of population to be replaced
Eq. (2) gives the desired outcome. m: mutation error
) (1)
Step 1: Initialization
(2) Define population size
Algorithm 2: CLM
Step 2: Evaluation
BEGIN
Compute fitness
Initialize CNN parameters
Calculate fitness score
f = filtersize
Step 3: Selection
n=numoffilters
The probability Pr ( ) is
d= dropoutrate
fz= filtersizes
∑
Define CNN
Step 4: Crossover
Inputlayer=input (shape= (input_shape))
Select pair of hypothesis from P
Convlayers= []
For each pair produce offspring by applying crossover
For f in fz:
Step 5: Mutation
Convlayer= Conv1D (filters=n, kernelsize=activation=’relu’)
(input_layer) Choose members with uniform probability
Poollayer =MaxPooling1D(poolsize=x) (convlayer) Step 6: Update
Convlayers.append (poollayer)
mergedlayer = Concatenate (axis=1) (convlayers) Step 7: Evaluate
flattenlayer = Flatten () (mergedlayer) Retrieve the best solution
dropoutlayer = Dropout (dropoutrate)(flatten_layer) END
outputlayer=
3) Evaluation and detection: The evaluation module is in
Dense(numclasses,activation='softmax') (dropoutlayer)
charge of establishing the CLM and optimization technique
model is accurate and successful at detecting harmful user
Compile and train the model behavior. This module often consists of testing the model's
performance on a test set of data and comparing its accuracy,
END precision, recall, and F1 score to other cutting-edge machine
learning models like SVM and Random Forest.
e) Optimization algorithm: The Algorithm 3 describes
the Genetic algorithm of initialization of population size, a) Evaluation: The CLM and optimization technique
evaluation of fitness, probabilistic selection to evaluate the efficacy is assessed using the testing set. Some of the
best solution. The fitness function in the genetic algorithm assessment metrics used include Acc, Prec, Rc, and f1 s. The
analyses the quality of each potential solution (chromosome). results are compared to other cutting-edge methodologies to
The fitness value is determined by the problem's purpose and assess the efficacy of the suggested methodology.
can be a combination of metrics such as accuracy, precision, b) Malicious user detection: Based on their conduct, the
recall, or F1-score. The fitness function directs the genetic trained proposed model CLM and optimization technique are
algorithm's selection, crossover, and mutation processes. applied to detect malicious users. The model accepts data on
user behaviour as input and produces the probability of the
Algorithm 3: Genetic Algorithm individual being malevolent. Based on the output prospect, a
BEGIN threshold is defined to identify people as malicious or non-
malicious. The methodology's architecture is depicted in Fig.
GA(Ft,Ft_th,s,f,m) 3.
Ft: Fitness function assigns evaluation score
620 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023
C. Accuracy
IV. EXPERIMENTAL EVALUATIONS AND RESULTS
Accuracy assesses the overall efficacy of the model's
A. Dataset Description predictions. It computes the proportion of correctly identified
The TwiBot-20 dataset, specifically designed for social cases (both harmful and non-malicious) in the dataset to the
media bots, serves as a substantial and all-encompassing total number of occurrences. A higher level of accuracy
standard for detecting Twitter bots. The purpose is to stimulate suggests superior performance. Eq. (3) can be used to evaluate
the difficulties posed by a small dataset size and accurately the accuracy. The Fig. 4 associates the present model with the
reflect both actual people and Twitter bots found in the real previous model.
world. The collection comprises 229,573 people, 33,488,192
(3)
tweets, 8,723,736 user property pieces, and 455,958 follow
relationships. It comprises a comprehensive range of
automated accounts and authentic users to more accurately
depict the Twitter community as it exists in reality. The dataset Performance of Accuracy
contains three different types of user information, which may
be used for both classifying individual users into two 100
95
0
categories and developing community-aware methods. The 90 89
0 91
0
three modalities are semantic information, property 80
information, and neighborhood information. The TwiBot-20
70
dataset is accessible for academic research objectives and is
hosted by the Bot Repository [22]. This benchmark is one of 60
the most extensive collections of Twitter bot detection data 50
available. It obliges as an accommodating tool for training and 40
assessing the proposed model that aim to identify harmful users
in online social networks, specifically in the context of Twitter 30
bot identification. 20
Considering the objective of achieving optimal 10
performance in identifying harmful user activity, it is vital to 0 0
conduct experiments and prudently tune the settings. The AIMDS SVM RF
properties of the dataset, the kind of malicious activity, and the
computational resources that are available for training and Accuracy
optimization all have a role in the selection of parameters.
When trying to fine-tune these parameters in an efficient Fig. 4. Accuracy comparison.
manner, it is frequently prerequisite to do iterative refinement
based on performance data and domain expertise. D. Precision
B. Experimental Results Precision is the measurement of successfully recognized
harmful users among all occurrences projected to be malicious
Numerous indicators may be used to measure the success
[22]. It is determined as the ratio of TP (malicious users
of a system built to identify harmful user behaviour using CLM
accurately predicted) to the total of TP and FP (malicious users
and optimization techniques [28] [31]. Considering the
wrongly categorized as non-malicious). A higher precision
frequently used assessment metrics.
621 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023
suggests that there are fewer false positives. Eq. (4) is used to suggested strategy was assessed. The outcomes show how well
evaluate the precision. Fig. 4 compares the present model with CLM technique (Convolution neural networks and LSTM) and
previous models. optimization technique appropriately classify malicious users
based on their behaviour patterns.
(4)
On the, which encompasses of TwiBot-20 dataset gathered
E. Recall from an online platform, the performance of the suggested
The fraction of real malicious users properly recognized by strategy was assessed. The outcomes show how well CLM
the model is measured by Rc, also labelled as sensitivity or true technique (Convolution neural networks and LSTM) and
positive rate. It is determined as the proportion of true positives optimization technique perform in correctly classifying
to the total of TP and FN (malicious users categorized malicious users based on their behaviour patterns. Fig. 7 gives
mistakenly as non-malicious). A better recall means that there the portrayal of a comparison of evaluation metrics.
are fewer false negatives. Eq. (5) is used to evaluate the recall. The proposed methodology consistently outperforms
Fig. 5 compares the present model with previous models. existing methodologies and traditional models, as demonstrated
by the assessment measures. The genetic algorithm's ability to
(5) adapt is a crucial factor in accomplishing enhanced
performance through hyper parameter optimization and feature
selection. Conventional models may face difficulties in
PERFORMANCE twigging the ever-changing and dynamic aspects of user
behavior, while the proposed model excels in identifying
PRECISION RECALL intricate patterns.
94
Comparision
93
92
96
90
95 95
89
94 94
88
93 93 93
92 92
91 91
CLM RF SVM
TECHNIQUE 90 90 90
Fig. 5. Comparison of precision and recall. 89 89 89 89
F. F1 Score 88 88
The f1s combine accuracy and recall into a single statistic
that balances their respective trade-offs. It provides an ample 87
evaluation of the model's performance and is the harmonic 0 1 2 3 4 5
mean of accuracy and recall. An increased F1-score suggests a
better balance of accuracy and recall. The Eq. (6) evaluates the Fig. 6. Performance of proposed approach.
F1 score.
TABLE I. PERFORMANCE EVALUATION
(6)
Techniques
Metric
Summarizing the values, the following Table I and Fig. 6 Proposed Technique SVM RF
provide the overall performance of the CLM and optimization Accuracy 95 89 91
technique with the traditional algorithms. The experimental
Precision 94 88 92
study of CLM and optimization technique used an amalgam of
CNN, LSTM, and genetic algorithms (GA) to assess user Recall 93 90 89
behaviour in order to ascertain malevolent users. On the user F1 Score 93 89 90
behaviour dataset, which encompasses of user behaviour data
gathered from a web platform, the performance of the
622 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023
623 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 14, No. 12, 2023
[16] Terumalasetti, S. (2022, August). A Comprehensive Study on Review of [24] Kim, J., Park, M., Kim, H., Cho, S., & Kang, P. (2019). Insider threat
AI Techniques to Provide Security in the Digital World. In 2022 Third detection based on user behavior modeling and anomaly detection
International Conference on Intelligent Computing Instrumentation and algorithms. Applied Sciences, 9(19), 4018.
Control Technologies (ICICICT) (pp. 407-416). IEEE. [25] Qiu, J., Shen, X., Guo, Y., Yao, J., & Fang, R. (2019, August). Detecting
[17] Wu, X., Sun, Y. E., Du, Y., Xing, X., Gao, G., & Huang, H. (2020). An malicious users in online dating application. In 2019 5th International
efficient malicious user detection mechanism for crowdsensing system. Conference on Big Data Computing and Communications
In Wireless Algorithms, Systems, and Applications: 15th International (BIGCOM) (pp. 255-260). IEEE.
Conference, WASA 2020, Qingdao, China, September 13–15, 2020, [26] Kiran, K., Manjunatha, C., Harini, T. S., Shenoy, P. D., & Venugopal,
Proceedings, Part I 15 (pp. 507-519). Springer International Publishing. K. R. (2019, March). Identification of anomalous users in Twitter based
[18] Sarker, I. H., Kayes, A. S. M., Badsha, S., Alqahtani, H., Watters, P., & on user behaviour using artificial neural networks. In 2019 IEEE 5th
Ng, A. (2020). Cybersecurity data science: an overview from machine International Conference for Convergence in Technology (I2CT) (pp. 1-
learning perspective. Journal of Big data, 7, 1-29. 5). IEEE.
[19] Wanda, P., Hiswati, M. E., & Jie, H. J. (2020). DeepOSN: Bringing deep [27] Hong, T., Choi, C., & Shin, J. (2018). CNN‐based malicious user
learning as malicious detection scheme in online social network. IAES detection in social networks. Concurrency and Computation: Practice
International Journal of Artificial Intelligence, 9(1), 146. and Experience, 30(2), e4163.
[20] Mou, G., & Lee, K. (2020). Malicious bot detection in online social [28] Yu, J., Wang, K., Li, P., Xia, R., Guo, S., & Guo, M. (2017). Efficient
networks: arming handcrafted features with deep learning. In Social trustworthiness management for malicious user detection in big data
Informatics: 12th International Conference, SocInfo 2020, Pisa, Italy, collection. IEEE Transactions on Big Data, 8(1), 99-112.
October 6–9, 2020, Proceedings 12 (pp. 220-236). Springer [29] Saracino, A., Sgandurra, D., Dini, G., & Martinelli, F. (2016). Madam:
International Publishing. Effective and efficient behavior-based android malware detection and
[21] Samokhvalov, D. I. (2020). Machine learning-based malicious users' prevention. IEEE Transactions on Dependable and Secure
detection in the VKontakte social network. Труды института Computing, 15(1), 83-97.
системного программирования РАН, 32(3), 109-117. [30] Khan, M. U. S., Ali, M., Abbas, A., Khan, S. U., & Zomaya, A. Y.
[22] Rabbani, M., Wang, Y. L., Khoshkangini, R., Jelodar, H., Zhao, R., & (2016). Segregating spammers and unsolicited bloggers from genuine
Hu, P. (2020). A hybrid machine learning approach for malicious experts on twitter. IEEE Transactions on Dependable and Secure
behaviour detection and recognition in cloud computing. Journal of Computing, 15(4), 551-560.
Network and Computer Applications, 151, 102507. [31] Khan, M. U. S., Ali, M., Abbas, A., Khan, S. U., & Zomaya, A. Y.
[23] https://botometer.osome.iu.edu/bot-repository/datasets.html [Dataset]. (2016). Segregating spammers and unsolicited bloggers from genuine
experts on twitter. IEEE Transactions on Dependable and Secure
Computing, 15(4), 551-560.
624 | P a g e
www.ijacsa.thesai.org