0% found this document useful (0 votes)
28 views73 pages

Credit Card Fraud Detection (Book) 15

Uploaded by

mohamed hasson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views73 pages

Credit Card Fraud Detection (Book) 15

Uploaded by

mohamed hasson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Misr University of Science and Technology

Faculty of Engineering
Department of Computer and Software
Final year project
Credit Card Fraud Detection Using Machine Learning
Submitted by:
NAME ID
Mohamed Ebrahim 76946
Ahmed Ebrahim 76953
Mahmoud Mohamed 80506
Mohamed Hassan 80834

Supervised by:
Prof. Dr. Heba Elnemr

2022-2023

1
Declaration
We hereby declare that the work presented in this thesis has not been
submitted for any other degree or professional qualification, and that it is
the result of my own independent work.

Names:

Date:

2
Acknowledgement
This endeavor would not been possible without

• Prof. Heba Elnemr, our supervisor, supports us in all circumstances.


• Prof. Ashraf Mahrous is the head of the computer and software
department.
• Prof. Tamer Nassef Vice Dean of the Faculty of Engineering.
• Prof. Ghada Amer Dean of the Faculty of Engineering.

So, Thank you a lot.

3
Abstract

Credit card fraud is a significant problem that can result in significant financial losses
for both individuals and institutions. In recent years, machine learning algorithms
have proved to be an effective solution for detecting fraudulent transactions. In this
project, we investigate the use of various machine learning techniques such as
logistic regression, random forests, and support vector machines to identify
fraudulent credit card transactions. We evaluate the performance and accuracy of
these techniques using a publicly available credit card fraud dataset. The results
reveal that the Random Forest algorithm outperforms other algorithms when applied
to imbalanced data. Furthermore, three sampling techniques were devoted; random
undersampling, random oversampling, and a hybrid technique that merges
oversampling and undersampling methods to enhance the system's performance.
This study demonstrates that machine learning, along with the utilization of
appropriate data sampling techniques, can be effective in detecting credit card fraud,
emphasizing the importance of developing and deploying such systems to protect
against fraudulent activities in financial institutions.

4
Contents
1 Chapter 1 Introduction .....................................................................................10
1.1 Introduction .................................................................................................10
1.1.1 What is credit card fraud ....................................................................11
1.1.2 Types of credit card fraud ..................................................................11
1.1.3 What is credit card fraud detection ....................................................12
1.1.4 Anomaly detection .............................................................................13
1.1.5 Data used to create the user profile includes .....................................13
1.2 Importance of machine learning in credit card fraud detection ..................14
2 Chapter 2 Literature Review ............................................................................16
2.1 First paper [4] ..............................................................................................16
2.2 Second paper [5] ..........................................................................................16
2.3 Third paper [6].............................................................................................17
2.4 Fourth paper [7] ...........................................................................................17
2.5 Fifth paper [8] ..............................................................................................17
3 Chapter 3 Methodology ...................................................................................19
3.1 Data set ........................................................................................................19
3.2 Data Preprocessing ......................................................................................20
3.2.1 Normalization ....................................................................................21
3.3 Machine Learning........................................................................................21
3.3.1 Types of Machine learning methods..................................................22
3.3.2 supervised learning techniques ..........................................................24
3.3.3 Regression ..........................................................................................24
3.4 Machine learning techniques .......................................................................25
3.4.1 Logistic regression: ............................................................................25
3.4.2 Decision tree: .....................................................................................26
3.4.3 Random forest: ...................................................................................26
3.4.4 K-nearest neighbor:............................................................................28
5
3.4.5 Naïve Bayesian Classifier: .................................................................28
3.4.6 Support Vector Machines: .................................................................29
3.5 Data Balancing ............................................................................................30
4 Chapter 4 Experimental results ........................................................................32
4.1 Evaluation Metrics: .....................................................................................33
4.1.1 Confusion Matrix ...............................................................................33
4.1.2 Accuracy ............................................................................................34
4.1.3 Precision .............................................................................................34
4.1.4 Recall .................................................................................................34
4.1.5 F1-score..............................................................................................34
4.2 Training ............................................................................................................35
4.3 Evaluation Strategy ..........................................................................................35
4.3.1 Evaluation of imbalanced Dataset .....................................................35
4.3.2 Evaluation of Undersampling Dataset ...............................................39
4.3.3 Evaluation of Oversampling Dataset .................................................42
4.3.4 Evaluation of Hybrid Sampling Dataset ............................................46
5 Conclusion and Future work ............................................................................54
5.1 Conclusion ...................................................................................................54
5.2 Recommendation .........................................................................................55
5.3 Future work .................................................................................................56
6 References ........................................................................................................57
Appendix A: .........................................................................................................60
Appendix B: .........................................................................................................60
Appendix C: .........................................................................................................60
Appendix D: .........................................................................................................60
Appendix E: .........................................................................................................61
Appendix F:..........................................................................................................61
Appendix G: .........................................................................................................64
Appendix H: .........................................................................................................67

6
Appendix I: ..........................................................................................................70

Table of Abbreviation
CCF Credit Card Fraud
CCRD Credit Card Fraud Detection
ML Machine Learning
LR Logistic Regression
DT Decision Tree
RF Random Forest
KNN K-Nearest Neighbor
SVM Support Vector Machine

7
List of Tables
Table 4-1 Confusion Matrix _________________________________________ 33
Table 4-2 Training performance for imbalanced dataset. ___________________ 36
Table 4-3 Testing performance for imbalanced dataset. ____________________ 37
Table 4-4 Training performance for Under sampling Dataset. _______________ 40
Table 4-5 Testing performance for Under sampling Dataset ________________ 40
Table 4-6 Training performance for Oversampling Dataset _________________ 43
Table 4-7 Testing performance for Oversampling Dataset. _________________ 44
Table 4-8 Training performance for Hybrid Sampling Dataset. ______________ 47
Table 4-9 Testing performance for Hybrid Sampling Dataset. _______________ 47

8
List Of Figures
Figure 3-1 a snapshot of the utilized features ____________________________ 20
Figure 3-2 Types of Machine learning _________________________________ 22
Figure 4-1 Computer Properties ______________________________________ 32
Figure 4-2 Imbalance Data __________________________________________ 36
Figure 4-3 confusion matrix of imbalanced data __________________________ 37
Figure 4-4 Random Forest confusion matrix of imbalanced data ____________ 38
Figure 4-5 SVM confusion matrix of imbalanced data ____________________ 38
Figure 4-6 Under sampling Data ______________________________________ 39
Figure 4-7 Logistic Regression confusion matrix of under sampling data ______ 41
Figure 4-8 Random Forest confusion matrix of undersampling data __________ 41
Figure 4-9 SVM confusion matrix of undersampling data __________________ 42
Figure 4-10 Over sampling __________________________________________ 43
Figure 4-11 Logistic Regression confusion matrix of Oversampling__________ 44
Figure 4-12 Random Forest confusion matrix of Oversampling _____________ 45
Figure 4-13 SVM confusion matrix of Oversampling ____________________ 45
Figure 4-14 Hybrid sampling _______________________________________ 46
Figure 4-15 Logistic Regression confusion matrix of Hybrid sampling _______ 48
Figure 4-16 Random Forest confusion matrix of Hybrid sampling ___________ 48
Figure 4-17 SVM confusion matrix of Hybrid sampling ___________________ 49
Figure 4-18 comparison of under sampling results ________________________ 49
Figure 4-19 comparison of oversampling results _________________________ 51
Figure 4-20 comparison of hyper sampling results ________________________ 52

9
1 CHAPTER 1 INTRODUCTION
1.1 Introduction
In the last decade, there has been an exponential growth of the Internet. This has
sparked the proliferation and increase in the use of services such as e-commerce,
tap-and-pay systems, online bill payment systems, etc. However, fraudsters have
also increased activities to attack transactions that are made using credit cards. As a
result, various protection mechanisms, such as credit card data encryption and
tokenization, have been implemented to protect credit card transactions [1].
E-commerce has come a long way since its inception. It has become an essential
tool for most organizations, companies, and government agencies to increase their
productivity in global trade. One of the main reasons for the success of e-commerce
is the easy online credit card transaction. Whenever we talk about monetary
transactions, we also must consider financial fraud. Recently, credit card
transactions are believed to be the most common payment method. Consequently,
fraud activities have increased rapidly.
Losses related to credit card fraud will grow to $43 billion within five years and
climb to $408.5 billion globally within the next decade, according to a recent Nilson
Report ]2[, meaning that credit card fraud detection has become more crucial than
ever.
All parties involved in the payment lifecycle will experience the impact of these
increasing costs, from banks and credit card companies who foot the bill of such
fraud, the consumers who pay higher fees or receive lower credit scores, to
merchants and small businesses who are slapped with chargeback fees.
With digital crime and online fraud of all kinds on the rise, it’s more important
than ever for organizations to take firm and unambiguous steps to prevent payment
card fraud through advanced technology and strong security measures.

10
1.1.1 What is credit card fraud
Credit card fraud is the act of using another person’s credit card to make
purchases or request cash advances without the cardholder’s knowledge or consent.
These criminals may obtain the card itself through physical theft, though nowadays,
they are more often using digital methods to steal both the credit card number and
personal details to carry out fraudulent transactions.
There is some overlap between identity theft and credit card theft. Credit card
theft is one of the most common forms of identity theft. In such cases, a fraudster
uses an individual’s personal information, which is often stolen as part of a
cyberattack or data breach, to open a new account that the victim does not know
about. This activity is considered both identity fraud and credit card fraud.

1.1.2 Types of credit card fraud


Credit card fraud falls into two basic categories.
➢ Card present fraud
➢ Card-not-present fraud

1.1.2.1 card present fraud


Card present fraud is when the criminal uses a physical card, which has been
stolen or duplicated, to make fraudulent purchases. Card present fraud may occur
when a card is stolen, either through robbery, pickpocketing, or mail theft.
Criminals may also leverage card skimmers installed at frequently used payment
points to collect and store the card details when swiped; this data can then be used
to produce a duplicate payment card or clone.

11
1.1.2.2 card-not-present fraud
Card-not-present fraud is when the criminal uses the details associated with
the card, such as the card number, accountholder name, and CVV code, without
having the card in their possession.
In some cases, card-not-present crime is accompanied by account takeover
techniques. This is when fraudsters contact a credit card issuer and purport to be a
legitimate cardholder to change information associated with the account, such as a
phone number or address. This will allow them to verify purchases and authenticate
activity, evading many fraud detection tools [3].

1.1.3 What is credit card fraud detection


Credit card fraud detection is the collective term for the policies, tools,
methodologies, and practices credit card companies and financial institutions take to
combat identity fraud and stop fraudulent transactions.
In recent years, as the amount of data has exploded and the number of payment card
transactions has skyrocketed, fraud detection has become significantly digitized and
automated. Most modern solutions leverage artificial intelligence (AI) and machine
learning (ML) to manage data analysis, predictive modeling, decision-making, fraud
alerts, and remediation activity, which occur when individual instances of credit card
fraud are detected.
Fraud detection involves monitoring the activities of populations of users to
estimate, perceive or avoid objectionable behavior, which consists of fraud,
intrusion, and defaulting.
Also, transaction patterns often change their statistical properties over the course of
time. In real world examples, the massive stream of payment requests is quickly
scanned by automatic tools, which determine which transactions to authorize.
Those are relevant problems that demand the attention of communities, such as
machine learning and data science, where the solution to these problems can be
automated.

12
1.1.4 Anomaly detection
Anomaly detection is the process of analyzing massive amounts of data points
from both internal and external sources to identify unusual or unexpected patterns
or data points in a dataset that deviate from the norm or expected behavior, it produce
a framework of “normal” activity for each individual user and establish regular
patterns in their activity. It is used to identify outliers, anomalies, or suspicious
events that may indicate fraudulent or abnormal behavior, system malfunctions, or
other issues. Anomaly detection techniques are commonly used in various fields
such as cyber security, fraud detection, medical diagnosis, and predictive
maintenance.

1.1.5 Data used to create the user profile includes


➢ Purchase history and other historical data
➢ Location
➢ Device ID
➢ IP address
➢ Payment amount
➢ Transaction information.

When a transaction falls outside the scope of normal activity, the anomaly
detection tool will then alert the card issuer and, in some cases, the user. Depending
on the transaction details and risk score assigned to the action, these fraud detection
systems may flag the purchase for review or put a hold on the transaction until the
user verifies their activity.

Credit card fraud detection is an important problem in the financial industry, and
machine learning techniques can be used to help identify fraudulent transactions.
Machine learning algorithms can analyze patterns in transaction data and
automatically detect anomalies that may be indicative of fraud. This approach is
particularly useful for detecting fraud in real-time, allowing financial institutions to

13
quickly respond to suspicious activity and protect their customers.

1.2 Importance of machine learning in credit card fraud detection

➢ Detection of Complex Patterns: Machine learning algorithms can analyze


large volumes of transaction data and identify complex patterns and
anomalies that may indicate fraudulent activity. These algorithms can
uncover subtle relationships and behaviors that human analysts might
miss, allowing for more accurate and effective fraud detection.
➢ Real-time Monitoring: Machine learning models can be trained to monitor
transactions in real-time and make instant decisions on whether a
transaction is likely to be fraudulent. This enables financial institutions to
quickly identify and block suspicious transactions, preventing potential
fraud before it occurs.
➢ Adaptability to Evolving Fraud Techniques: Fraudsters are constantly
evolving their tactics to evade detection. Machine learning models can
adapt and learn from new fraud patterns as they emerge. By continuously
analyzing and updating the models with new data, financial institutions can
stay ahead of fraudsters and effectively detect new types of fraud.
➢ Reduced False Positives: Traditional rule-based systems may generate a
high number of false positives, leading to inconvenience for legitimate
customers. Machine learning models can significantly reduce false
positives by accurately distinguishing between genuine and fraudulent
transactions. This improves the overall user experience and reduces the
need for manual reviews.
➢ Scalability and Efficiency: Machine learning algorithms can handle large
volumes of data efficiently, making them suitable for processing vast
amounts of credit card transactions in real-time. This scalability allows
financial institutions to process transactions quickly and accurately, even
during peak periods, without compromising the detection accuracy.
➢ Risk Score Calculation: Machine learning models can assign risk scores to
individual transactions based on their likelihood of being fraudulent. These
risk scores can help prioritize and focus resources on the highest-risk
14
transactions, improving the efficiency of fraud detection and investigation
efforts.
➢ Continuous Learning and Improvement: Machine learning models can
continuously learn and improve over time as they are exposed to more data.
By regularly retraining the models with new data, financial institutions can
enhance the accuracy and effectiveness of their fraud detection systems.
Overall, machine earning plays a crucial role in credit card fraud detection by
enabling the analysis of large datasets, real-time monitoring, adaptability to evolving
fraud techniques, and the ability to reduce false positives. By leveraging the power
of machine learning, financial institutions can enhance their fraud detection
capabilities, protect their customers, and minimize financial losses due to fraudulent
activity.
This report is organized as follows. Chapter 2 presents a comprehensive
summary of previous research. Chapter 3 demonstrates the proposed methodology
used to solve the problem of Credit card fraud detection. The experimental results
are displayed in chapter 4. Chapter 5 summarizes the proposed work.

15
2 CHAPTER 2 LITERATURE REVIEW
In this section, we will explore multiple research and advancements in the use
of machine learning algorithms for credit card fraud detection.

2.1 First paper [4]


In the work of [4], different supervised machine learning techniques are
applied for fraud detection purposes, including logistic regression, decision tree,
random forest, K-Nearest Neighbors (KNN), and XGBoost. Furthermore, this study
endeavors to assess their efficacy on genuine data and construct an ensemble model
that could serve as a viable resolution for this issue.
KNN and logistic regression tend to have a better overall performance, which
makes them more effective in detecting fraudulent transactions.
The KNN and logistic regression tend to have a better overall performance
than other models. Their low false-negative rates mean sounder at capturing
fraudulent transactions and extracting the data pattern. Furthermore, to enhance the
prediction accuracy of individual classifiers, a voting classifier based on a new
ensemble model will be implemented. This approach combines various classification
techniques and aims to minimize errors in each model. Consequently, the ensemble
model is expected to generate more precise predictions compared to the single
classifiers.

2.2 Second paper [5]


In this research [5], the authors investigate utilizing both linear and nonlinear
statistical modeling techniques along with machine learning models on real credit
card transaction data. The models constructed are supervised fraud models that
attempt to identify which transactions are most likely fraudulent. This work
incorporates the processes of data exploration, data cleaning, variable creation,
feature selection, model algorithms, and results. Five different supervised models
are explored and compared including logistic regression, neural networks, random
forest, boosted tree and support vector machines (SVM). The boosted tree model
performs the most outstanding fraud detection.

16
2.3 Third paper [6]
This study aims to explore fifteen various techniques for detecting credit card
fraud, which includes Neural Networks, Decision Trees, genetic algorithm, case-
Based Reasoning, Bayesian Network, SVM, KNN, Artificial Immune System,
Hidden Markov Model, fuzzy neural network, fuzzy Darwinian system, Inductive
Logic programming, Clustering Techniques, Logistic Regression, and Outlier
Detection. The study focuses on investigating the relative effectiveness of these
methods to achieve the fundamental goal of detecting fraud in credit cards, along
with the advantage and disadvantages of every technique.

2.4 Fourth paper [7]


The aim of this work is to compare several well-known supervised learning
algorithms in order to accurately distinguish between authentic and fraudulent
transactions. The authors analyzed KNN, Naive Bayes, Decision Tree, Logistic
Regression, and Random Forest using an imbalanced dataset to identify the most
effective model among the aforementioned options for detecting credit card fraud.
The obtained analysis determined that the Decision Tree model is the most
appropriate for predicting credit card fraud. While the KNN model showed greater
sensitivity than the Decision Tree, its testing time was significantly longer.
Considering that rapid fraud detection is essential, the authors concluded that the
Decision Tree model is the optimal choice.

2.5 Fifth paper [8]

This study highlights the application of machine learning techniques to


develop a model for credit card fraud detection utilizing multiple classifiers and data
balance techniques. To develop the credit card fraud detection model, the authors
utilized several machine learning algorithms such as Decision Tree, Logistic
Regression, XGBoost Classifier, and Artificial Neural Network. These models are
tested on imbalanced data and found that XGBoost performed well. Afterward,
XGBoost experimented with various sampling techniques, including oversampling,
under sampling, and Synthetic Minority Oversampling Technique (SMOTE), to
improve the results. Out of all the techniques, oversampling proved to be the most
effective in enhancing the performance of the XGBoost model.
17
In this chapter we have conducted a comprehensive review of literature on
existing fraud detection techniques, which allowed us to gain insight into the various
approaches used in this field. The upcoming chapter will demonstrate the suggested
methodology in this project, which aims to create a proficient system for detecting
credit card fraud.

18
3 CHAPTER 3 METHODOLOGY
All models were created in both Visual studio code and anaconda programs.
This chapter outlines the approach we used to develop a credit card fraud
detection model. First, we selected a dataset consisting of credit card transactions,
which served as the foundation for our analysis. Subsequently, data preprocessing
technique is implemented. We then utilized several machine learning algorithms,
including Logistic Regression, Random Forest, and SVM, to develop our model. To
address the issue of data imbalance, a common challenge in fraud detection, we
experimented with various sampling techniques such as oversampling, under
sampling, and hybrid method that incorporates oversampling and under sampling, to
enhance the performance of our models. Throughout the process, we analyzed and
compared the performance of different models, assessed their strengths and
weaknesses, and ultimately selected the most suitable model for credit card fraud
detection. In the following sections, we provide a detailed account of the research
methodology we employed to develop our credit card fraud detection framework.

3.1 Data set


In this research, the adopted Credit Card Fraud Detection dataset can be
downloaded from Kaggle [9]. This dataset contains transactions that occurred in two
days, made in September 2013 by European cardholders. The dataset has been
collected and analyzed during a research collaboration between Worldline and the
Machine Learning Group of ULB (Université Libre de Bruxelles)[10]on big data
mining and fraud detection. The dataset contains 31 numerical features. Since some
of the input variables contain financial information, the Principal Component
Analysis (PCA) transformation was conducted on these input variables to keep these
data anonymous. The features V1, V2, … V28 are the principal components
obtained with PCA. The features that weren’t transformed are time, amount and
class.
Feature "Time" shows the time between the first transaction and every other
transaction in the dataset.
Feature "Amount" is the amount of the transactions made by credit card.
Feature "Class" represents the label and takes only 2 values: value 1 in case of fraud
transaction and 0 otherwise.
19
Figure 3-1 illustrates a snapshot of the features that will be utilized for credit card
fraud detection.
The dataset contains 284,807 transactions, where 492 transactions were frauds,
and the rest were genuine. Considering the numbers, we can see that this dataset is
highly imbalanced, where only 0.173% of transactions are labeled as frauds. Since
distribution ratio of classes plays a significant role in model accuracy and precision,
preprocessing of the data is crucial[Appendix B].

Figure 3-1 a snapshot of the utilized features

3.2 Data Preprocessing


Data pre-processing involves the task of cleaning and organizing data. Data
Cleaning incorporates removes any duplicate or irrelevant data, inconsistent values,
or missing data. Ensure that all the data is in a consistent format across all records.
To check We would like to check if there are missing values. To check this, we can
use the function dataframe.isnull() in pandas library. It will return True for missing
components and False for non-missing cells. However, when the dimension of the
dataset is large, it could be difficult to figure out the existence of missing values. In
general, we may just to know if there any missing values at all before we try to find
where they are. The function dataframe.isnull().values. Any() return True when there
is at least one missing values occurring in the data. The function
dataframe.isnull().sum() returns the number of missing values in the dataset.
Furthermore, in order to figure out the duplicate values, a function
dataframe.drop_duplicates() in Pandas library is used to remove duplicates from
the data frame rows and columns[Appendix C].
20
3.2.1 Normalization
Normalization is a technique used in machine learning to rescale numeric
variables to a common scale, typically between 0 and 1. Normalization is important
because many machine learning algorithms assume that the input features are on a
similar scale, and features that are on different scales can have a disproportionate
impact on the model's performance. Normalization is done by subtracting the
minimum value of the feature and dividing by the range (the difference between the
maximum and minimum values) of the feature. This results in values between 0 and
1, where 0 represents the minimum value and 1 represents the maximum value [11].
Normalization is a useful technique for improving the performance of machine
learning algorithms, especially those that are sensitive to the scale of the input
features. By rescaling the features to a common scale, normalization can help to
reduce the impact of outliers and improve the convergence of some optimization
algorithms [Appendix D] [12].

3.3 Machine Learning


Machine learning is a branch of artificial intelligence (AI) and computer
science that focuses on the use of data and algorithms to imitate the way that humans
learn, gradually improving its accuracy.
This branch of AI focuses on using data and algorithms to mimic human
learning, allowing machines to improve over time, becoming increasingly accurate
when making predictions or classifications, or uncovering data-driven insights. It
works in three basic steps, starting with using a combination of data and algorithms
to predict patterns and classify data sets, an error function that helps evaluate the
accuracy, and then an optimization process to fit the data points into the model best
[13].
Machine learning is a family of statistical and mathematical modeling
techniques that use a variety of approaches to automatically learn and improve the
prediction of a target objective without explicit programming.

21
3.3.1 Types of Machine learning methods
The learning algorithms can be categorized into three major types, such as
supervised, unsupervised, and reinforcement learning [14]. Figure 3-2 displays the
basic types of machine learning approaches.

Figure 3-2 Types of Machine learning

3.3.1.1 Supervised learning


In supervised learning, the operator provides the machine learning algorithm
with a known dataset that includes desired inputs and outputs, and the algorithm
must find a method to determine how to arrive at those inputs and outputs. While the
operator knows the correct answers to the problem, the algorithm identifies patterns
in data, learns from observations and makes predictions. The algorithm makes
predictions and is corrected by the operator – and this process continues until the
algorithm achieves a high level of accuracy/performance.

22
Supervised learning feeds historical input and output data in machine learning
algorithms, with processing in between each input/output pair that allows the
algorithm to shift the model to create outputs as closely aligned with the desired
result as possible. Common algorithms used during supervised learning include
linear regression, SVM, Decision tree, Random Forest, and KNN.
The machine learning techniques proceed by partitioning the input data into
two sets, training and test set. Afterward, the input features of the training data are
extracted. These features are used to build and train the model using a suitable
machine learning algorithm. Training is the process through which the model learns
or recognizes the patterns in the given data for making suitable predictions. The test
set contains already predicted values. Hence, the model is trained on the training set
and tested on the test set [15].

3.3.1.2 Unsupervised learning


While supervised learning requires users to help the machine learn,
unsupervised learning doesn't use the same labeled training sets and data. Instead,
the machine looks for less apparent patterns in the data. This machine learning type
is extremely helpful in identifying patterns and using data to make decisions.
The machine learning algorithm studies data to identify patterns. There is no answer
key or human operator to provide instruction. Instead, the machine determines the
correlations and relationships by analyzing available data. In an unsupervised
learning process, the machine learning algorithm is left to interpret large data sets
and address that data accordingly. The algorithm tries to organize data in some way
to describe its structure. This might mean grouping the data into clusters or arranging
it in a way that looks more organized [16]

3.3.1.3 Reinforcement learning


Reinforcement learning focuses on regimented learning processes, where a
machine learning algorithm is equipped with a set of actions, parameters and end
values. By defining the rules, the machine learning algorithm then tries to explore
different options and possibilities, monitoring and evaluating each result to
determine which one is optimal. Reinforcement learning teaches machine trial and
error. It learns from past experiences and commences to adapt its approach in
response to the situation to achieve the best possible result [17].
23
3.3.2 supervised learning techniques
In this work, we will focus on and explore the basics of supervised learning
approaches and find out which strategy is suitable for achieving the goal of our
project, which is detecting fraud in credit cards. The most commonly supervised
learning tasks are classification and regression.

3.3.2.1 Classification
Classification refers to the problem of identifying the category to which an
input belongs to among a possible set of categories. The possible set of categories
are labelled, and models are generally learned from training data. Classification
models can be created using simple thresholds, regression techniques, or other
machine learning techniques like Neural Networks, Random Forests, or Markov
models. Classification is a supervised learning algorithm where a training set of
correctly identified or labelled data is available. The model learned from training
data to identify the category or class of the input feature or data is called a classifier.

Types of classifiers
➢ Binary classifier that identifies the input as belonging to one of the two output
categories.
➢ Multi-class classification has at least two mutually exclusive class labels,
where the goal is to predict to which class a given input example belongs.
➢ Multi-label classification can predict more than one class for each input
example. In this case, there is no mutual exclusion because the input example
can have more than one label.

3.3.3 Regression

Regression is a process of finding the correlations between dependent and


independent variables. It helps in predicting the continuous variables such as
prediction of Market Trends, prediction of House prices, etc. The task of the
Regression algorithm is to find the mapping function to map the input variable (x)
to the continuous output variable (y) [18].

Example: Suppose we want to do weather forecasting, so for this, we will use the
regression algorithm. In weather prediction, the model is trained on the past data,
24
and once the training is completed, it can easily predict the weather for future days
[19].

Types of Regression Algorithm:

Regression algorithms can be classified into

➢ Simple Linear Regression


➢ Multiple Linear Regression
➢ Polynomial Regression
➢ Support Vector Regression
➢ Decision Tree Regression
➢ Random Forest Regression

3.4 Machine learning techniques

3.4.1 Logistic regression:


Logistic regression belongs to the family of supervised machine learning
models. It is also considered a discriminative model, which means that it attempts to
distinguish between classes (or categories).
Logistic regression is used when the dependent variable can have one of two
values, such as true or false, or success or failure. Logistic regression models can be
used to predict the probability of a dependent variable occurring. Generally, the
output values must be binary. A sigmoid curve can be used to map the relationship
between the dependent variable and independent variables.

Types of logistic regression:

There are three types of logistic regression models, which are defined based
on categorical response:

1- Binary logistic regression: In this approach, the response or dependent


variable is dichotomous in nature—i.e., it has only two possible outcomes
(e.g., 0 or 1). Some popular examples of its use include predicting if an e-mail
is spam or not spam or if a tumor is malignant or not malignant. Within logistic

25
regression, this is the most used approach, and more generally, it is one of the
most common classifiers for binary classification.
2- Multinomial logistic regression: In this type of logistic regression model, the
dependent variable has three or more outcomes; however, these values have
no specified order. For example, movie studios want to predict what genre of
film a moviegoer is likely to see to market films more effectively. A
multinomial logistic regression model can help the studio to determine the
strength of influence a person's age, gender, and dating status may have on the
type of film that they prefer. The studio can then orient an advertising
campaign for a specific movie toward a group of people likely to see it.
3- Ordinal logistic regression: This type of logistic regression model is
leveraged when the response variable has three or more possible outcomes,
but in this case, these values have a defined order. Examples of ordinal
responses include grading scales from A to F or rating scales from 1 to 5.

3.4.2 Decision tree:


A decision tree is a predictive model for classification and regression
problems that map the inputs to the possible classes. This supervised technique has
a tree-like structure that contains root nodes, other nodes that split the data based on
the features, and leaves. At each node, the classes in the dataset are separated based
on some conditions on the features. The splitting process achieves the fullest purity
[20].

3.4.3 Random forest:


The instability in single trees and sensitivity to some training data led to the
development of another model that is random forests. With each tree being built
independently of the others, the computational efficiency of random forest is
comparatively better [21].
Random Forest is a popular algorithm that is often used in credit card fraud
detection. In this context, the algorithm can be used to classify transactions as either
fraudulent or non-fraudulent based on various features and attributes of the
transactions.

26
One advantage of using Random Forest for credit card fraud detection is that
it can handle high-dimensional data with many features, which is often the case in
fraud detection. Additionally, Random Forest is a powerful ensemble algorithm that
can reduce the risk of overfitting and improve the accuracy of the model.
To use Random Forest for credit card fraud detection, the algorithm is trained
on a dataset of historical credit card transactions, where each transaction is labeled
as either fraudulent or non-fraudulent. The algorithm learns to classify new
transactions based on the patterns and relationships found in the historical data.
Once the Random Forest model is trained, it can be used to classify new
transactions as either fraudulent or non-fraudulent in real-time. The model will
analyze the features of each transaction and predict whether it is likely to be
fraudulent or not. If the model predicts a transaction as fraudulent, the transaction
can be flagged for further investigation by the relevant authorities.
Overall, Random Forest is a powerful algorithm that can be used to improve
the accuracy of credit card fraud detection. By analyzing the features of credit card
transactions, the algorithm can learn to identify patterns and relationships that are
indicative of fraud and make accurate predictions in real-time.

3.4.3.1 Benefits of Random Forest


• Reduced risk of overfitting: Decision trees run the risk of overfitting as they
tend to tightly fit all the samples within training data. However, when there’s
a robust number of decision trees in a random forest, the classifier won’t
overfit the model since the averaging of uncorrelated trees lowers the overall
variance and prediction error.
• Provides flexibility: Since random forest can handle both regression and
classification tasks with a high degree of accuracy, it is a popular method
among data scientists. Feature bagging also makes the random forest classifier
an effective tool for estimating missing values as it maintains accuracy when
a portion of the data is missing.
• Easy to determine feature importance: Random Forest makes it easy to
evaluate variable importance, or contribution, to the model. There are a few
ways to evaluate feature importance. Gini importance and mean decrease in
impurity (MDI) are usually used to measure how much the model’s accuracy
decreases when a given variable is excluded. However, permutation
27
importance, also known as mean decrease accuracy (MDA), is another
importance measure. MDA identifies the average decrease in accuracy by
randomly permutating the feature values in out-of-bag samples [22].

3.4.4 K-nearest neighbor:


The KNN algorithm is one of the most famous classification algorithms used
for predicting the class of a record or (sample) with unspecified class based on the
class of its neighbor records [23].
The algorithm is made of three steps as follows:
1. Calculating the distance of input record from all training records.
2. Arranging training records based on the distance and selection of K-nearest
neighbor.
3. Using the class which owns the majority among the k-nearest neighbors (this
method considers the class as the class of input record which is observed more than
all the other classes among the K-nearest neighbors).

3.4.5 Naïve Bayesian Classifier:


Naive Bayesian classifiers assume that the effect of an attribute value on a
given class is independent of the values of the other attributes. This assumption is
called class conditional independence. It is made to simplify the computation
involved and, in this sense, is considered naïve. “It is not a single algorithm but a
family of algorithms where all of them share a common principle, i.e., every pair of
features being classified is independent of each other.” [24].

The main steps of Naïve Bayesian Classifier


1- Convert the given dataset into frequency tables.
2- Generate a Likelihood table by finding the probabilities of given features.
3- Now, use Bayes theorem to calculate the posterior probability.

28
3.4.6 Support Vector Machines:
Support Vector Machine” (SVM) is a supervised machine learning algorithm
that can be used for both classification and regression challenges. However, it is
mostly used in classification problems. The main idea behind SVM is to find the best
line (or hyperplane) that separates the data into different classes in a high-
dimensional space. The line that provides the largest margin between the classes is
considered the best. SVM are commonly used to detect cancerous cells based on
millions of images or may be used to predict future driving routes with a well-fitted
regression model [25].

Furthermore, SVMs are employed in applications like handwriting


recognition, intrusion detection, face detection, email classification, gene
classification, and at web pages.
One of the motivations for recommending usings SVM in machine learning
is handling both classification and regression on linear and non-linear data. Another
reason is that SVMs can find complex relationships between data without needing
to do a lot of manual transformations. Also, it's a great option when working with
smaller datasets that have tens to hundreds of thousands of features. They typically
find more accurate results when compared to other algorithms because of their
ability to handle small, complex datasets [26].

SVM works by transforming the data into a higher-dimensional space, known


as a feature space, where a hyperplane can be used to separate the data. The goal of
SVM is to find the hyperplane with the maximum margin between the classes, which
is known as the maximum margin classifier. The margin is the distance between the
hyperplane and the closest data points, known as support vectors. These support
vectors are the critical elements of SVM, as they determine the location of the
hyperplane.
SVM can be used for both linear and non-linear classification problems. For
linear problems, the data can be separated by a single straight line, or hyperplane.
For non-linear problems, SVM uses a technique known as the kernel trick, which

29
maps the data into a higher-dimensional space where a linear separation can be
achieved. Common kernels used in SVM include the radial basis function (RBF)
kernel, the polynomial kernel, and the sigmoid kernel.
Based on the previous review, logistic regression, random forest, and SVM
classifiers were adopted to develop the proposed credit card fraud detection system.

3.5 Data Balancing


Classification techniques attempt to categorize data into different buckets. In
an imbalanced dataset, the bias in dataset can influence many machines learning
algorithms, leading some to ignore the minority class entirely. This is a problem as
it is typically the minority class of which predictions are most important.

There are two main approaches to randomly resampling an imbalanced dataset


are random under sampling and random oversampling. However, there is a
contemporary technique that is considered quite interesting as it merges the methods
of oversampling and under sampling.

3.5.1.1 Random Undersampling Imbalanced Datasets:


Random under sampling involves randomly selecting examples from the
majority class to delete from the dataset. This has the effect of reducing the number
of examples in the majority class in the transformed version of the dataset. This
approach may be more suitable for those datasets where there is a class imbalance
although enough examples in the minority class, such a useful model can be fit. A
limitation of under sampling is that examples from the majority class are deleted that
may be useful, important, or perhaps critical to fitting a robust decision boundary
[27].

3.5.1.2 Random Oversampling Imbalanced Datasets:


Random oversampling involves randomly duplicating examples from the
minority class and adding them to the dataset. This means that examples from the
minority class can be chosen and added to the new “more balanced” dataset multiple
times; they are selected from the original dataset, added to the new dataset, and then
returned or “replaced” in the original dataset, allowing them to be selected again.
This technique can be effective for those machine learning algorithms that are
30
affected by a skewed distribution and where multiple duplicate examples for a given
class can influence the fit of the model. This might include algorithms that seek good
splits of the data, such as support vector machines. It might be useful to tune the
target class distribution. In some cases, seeking a balanced distribution for a severely
imbalanced dataset can cause affected algorithms to overfit the minority class,
leading to increased generalization error. The increase in the number of examples
for the minority class, especially if the class skew was severe, can also result in a
0marked increase in the computational cost when fitting the model, especially
considering the model is seeing the same examples in the dataset again and again.

3.5.1.3 Hybrid Sampling Imbalanced Datasets:


Here we are suggesting combining both random oversampling and
undersampling approaches. For example, a modest amount of oversampling can be
applied to the minority class to improve the bias towards these examples, whilst also
applying a modest amount of under sampling to the majority class to reduce the bias
on that class. Figure 3-7 illustrate the hybrid sampling dataset.
The forthcoming chapter will highlight and discuss the obtained system results.

31
4 CHAPTER 4 EXPERIMENTAL RESULTS

In this chapter, we present the results of our experiments that implicated


testing the performance of multiple machines learning models, including Logistic
Regression, Random Forest, and SVM, and various data sampling techniques, such
as oversampling, under sampling, and hybrid sampling. The evaluation metrics are
presented, and the impact of different parameters on the performance of the models
is discussed. Furthermore, we compare our models' performance with state-of-the-
art models and highlight the strengths and limitations of our approach. Overall, this
chapter provides a comprehensive analysis of the experimental results and offers
insights into the effectiveness of our solution.
All the experiments are conducted on an Intel(R) Core (TM) i7-7700HQ CPU @
2.80GHz 2.81 GHz, RAM: 16.0 GB, and GPU: Nvidia 1650 TI. Figure 4-1 shows
the utilized computer properties.

Figure 4-1 Computer Properties

32
4.1 Evaluation Metrics:
The proposed credit card fraud detection models are assessed using the following
metrics:

4.1.1 Confusion Matrix


A confusion matrix is a table that is often used to describe the performance of
a classification model on a set of test data for which the true values are known. It is
a way to visualize how well a machine learning algorithm is performing in terms of
the precision, recall, accuracy, and F1 score. The confusion matrix is typically a
square matrix with the number of rows and columns equal to the number of classes
in the problem. The rows in the matrix represent the true classes, while the columns
represent the predicted classes. Therefore, the diagonal elements of the matrix
represent the number of correct predictions for each class, while the off-diagonal
elements represent the misclassifications. Table 4-1 presents a confusion matrix for
a binary classification problem.

Predicted Negative Predicted Positive

Actual Negative True Negative (TN) False Positive (FP)

Actual Positive False Negative (FN) True Positive (TP)

Table 4-1 Confusion Matrix


True positive (TP) denotes the number of times the model correctly predicted
a positive class. While false positive (FP) represents the number of times the model
incorrectly predicted a positive class. On the other hand, false negative (FN) depicts
the number of times the model incorrectly predicted a negative class. Whereas true
negative (TN) represents the number of times the model correctly predicted a
negative class.

33
4.1.2 Accuracy
The accuracy is used to find the portion of correctly classified values. It tells us how
often our classifier is right. It is the sum of all true values divided by total values.

𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
4.1.3 Precision

Precision is used to calculate the model's ability to classify positive values


correctly. It is the true positives divided by the total number of predicted positive
values.

𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃

4.1.4 Recall
It is used to calculate the model's ability to predict positive values. "How often
does the model predict the correct positive values?". It is the true positives divided
by the total number of actual positive values.
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁

4.1.5 F1-score
It is the harmonic mean of Recall and Precision. It is useful when you need to take
both Precision and Recall into account.

34
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 =
𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛

4.2 Training
To prepare the dataset for training, validation, and testing, it was divided into
70% for training and 30% for validation and testing. Based on these percentages, the
number of genuine and fraudulent transactions in the training set are 192622 and
311, respectively. In the validation set, there are 48143 genuine and 91 fraudulent
transactions. Finally, in the test set, there are 42488 genuine and 71 fraudulent
transactions[Appendix E].

4.3 Evaluation Strategy


To improve the quality of the data, it initially undergoes a preprocessing stage,
making it easier for us to extract meaningful insights from big data. This process
involves organizing, cleaning, and normalizing data that may be incomplete,
inaccurate, or inconsistent.
Subsequently, the collected data was found to be imbalanced as there were only 492
instances of fraudulent transactions out of a total of 284,807 genuine transactions.
Figure 4-2 represents the count of each class category distributed in the data. Since
class 0 (genuine transactions) constitutes more than 90 percent of the data
distribution, the majority class, biased results, and poor model performance may
arise. Thus, data sampling techniques must be applied to overcome the imbalanced
data issue.

4.3.1 Evaluation of imbalanced Dataset


The results of utilizing logistic regression, random forest, and SVM classifiers
on the imbalanced dataset are presented in Table 4-2 and Table 4-3 for training and
testing stages, respectively, depicting their precision, recall, F1 score, and accuracy.
Furthermore, the confusion matrices for applying the three machine learning models
are illustrated in Figure 4-3, Figure 4-4, and Figure 4-5.

35
Figure 4-2 Imbalance Data

Model precision Recall F1 Score Accuracy


Logistic 0.866 0.607 0.714 0.999
Regression
Random 1.0 1.0 1.0 1.0
Forest
SVM 0.980 0.794 0.877 0.999

Table 4-2 Training performance for imbalanced dataset.

36
Model precision Recall F1 Score Accuracy
Logistic 0.999 0.563 0.720 0.781
Regression
Random 0.999 0.788 0.88 0.894
Forest
SVM 0.999 0.661 0.796 0.830

Table 4-3 Testing performance for imbalanced dataset.

Figure 4-3 confusion matrix of imbalanced data

37
Figure 4-4 Random Forest confusion matrix of imbalanced data

Figure 4-5 SVM confusion matrix of imbalanced data

38
The results show that all three models appear to be performing inadequately,
but Random Forest has the best outcomes regarding precision, recall, F1 score, and
accuracy as 99.9%, 78.8%, 88%, and 89.4%, respectively indicating that it is the
best-performing model in this experiment. These results infer biased nature of
algorithms while validating, suggesting the imbalanced nature of data.

4.3.2 Evaluation of Undersampling Dataset


In this section, we applied the undersampling technique to balance the
dataset with Fraud and Non-Fraud of 311 transactions. Figure 4-6 depicts the count
of each class category after undersampling. Table 4-4 and Table 4-5 in this work
showcase the outcomes of employing logistic regression, random forest, and SVM
classifiers on the Undersampling dataset. These tables display the precision, recall,
F1 score, and accuracy for each classifier during the training and testing
phases.Additionally, Figure 4-7, Figure 4-8, and Figure 4-9 present the confusion
matrices, offering a visual representation of how the three machine learning
models perform [Appendix G].

Figure 4-6 Under sampling Data

39
Model precision Recall F1 Score Accuracy
Logistic 0.966 0.916 0.940 0.942
Regression
Random 0.981 1.0 0.990 0.990
Forest
SVM 0.985 0.871 0.924 0.929

Table 4-4 Training performance for Under sampling Dataset.

Model precision Recall F1 Score Accuracy


Logistic 0.967 0.943 0.955 0.956
Regression
Random 0.969 0.929 0.949 0.950
Forest
SVM 0.981 0.887 0.932 0.935

Table 4-5 Testing performance for Under sampling Dataset

40
Figure 4-7 Logistic Regression confusion matrix of under sampling data

Figure 4-8 Random Forest confusion matrix of undersampling data

41
Figure 4-9 SVM confusion matrix of undersampling data

4.3.3 Evaluation of Oversampling Dataset


In this section, we employed the oversampling technique to address the
imbalance in the dataset with 173359 fraud transactions and 192622 genuine
transactions. Figure 4-10 visualizes the distribution of each class category after
applying an oversampling approach. The outcomes of employing logistic regression,
random forest, and SVM classifiers on the oversampling dataset are outlined in
Table 4-6 and Table 4-7 for the training and testing stages, respectively. These tables
present precision, recall, F1 score, and accuracy metrics. Additionally, the confusion
matrices for the three machine-learning models are provided for further clarity in
Figure 4.11, Figure 4.12, and Figure 4.13[Appendix H].

42
Figure 4-10 Over sampling

Model precision Recall F1 Score Accuracy


Logistic 0.974 0.917 0.945 0.949
Regression
Random 1.0 1.0 1.0 1.0
Forest
SVM 0.987 0.965 0.976 0.977

Table 4-6 Training performance for Oversampling Dataset

43
Model precision Recall F1 Score Accuracy
Logistic 0.976 0.943 0.959 0.960
Regression
Random 0.999 0.774 0.872 0.887
Forest
SVM 0.944 0.901 0.942 0.944

Table 4-7 Testing performance for Oversampling Dataset.

Figure 4-11 Logistic Regression confusion matrix of Oversampling

44
Figure 4-12 Random Forest confusion matrix of Oversampling

Figure 4-13 SVM confusion matrix of Oversampling

45
4.3.4 Evaluation of Hybrid Sampling Dataset
In this section, we highlight the outcomes of implementing a hybrid technique
that combines both under sampling and oversampling methods with fraud and non-
fraud transactions of 134835. The hybrid sampling dataset is visualized in Figure 4-
14. Table 4-8 and Table 4-9 present the results of utilizing logistic regression,
random forest, and SVM classifiers on the hybrid sampling dataset for the training
and testing stages, respectively. These tables provide details on precision, recall, F1
score, and accuracy. Additionally, the confusion matrices for the three machine-
learning models are included in Figure 4-15, Figure 4-16, and Figure 4-17.
[Appendix I]

Figure 4-14 Hybrid sampling

46
Model precision Recall F1 Score Accuracy
Logistic 0.973 0.919 0.945 0.947
Regression
Random 0.999 1.0 0.999 0.999
Forest
SVM 0.987 0.964 0.975 0.976

Table 4-8 Training performance for Hybrid Sampling Dataset.

Model precision Recall F1 Score Accuracy


Logistic 0.974 0.943 0.958 0.959
Regression
Random 0.999 0.774 0.872 0.887
Forest
SVM 0.985 0.901 0.941 0.944

Table 4-9 Testing performance for Hybrid Sampling Dataset.

47
Figure 4-15 Logistic Regression confusion matrix of Hybrid sampling

Figure 4-16 Random Forest confusion matrix of Hybrid sampling

48
Figure 4-17 SVM confusion matrix of Hybrid sampling

Figure 4-18 comparison of under sampling results

49
From Figure 4-18 For the logistic regression model, we can see that it has a
precision of 0.967, meaning that it correctly identifies 96.7% of the positive cases.
The recall for this model is 0.943, indicating that it correctly identifies 94.3% of the
actual positive cases. The F1 score for this model is 95.5%, which is a balanced
measure of precision and recall. The accuracy of this model is 95.6%, indicating that
it is overall effective at making correct predictions. The random forest model has a
precision of 0.969, indicating that it correctly identifies 96.9% of the positive cases.
The recall for this model is 0.929, indicating that it correctly identifies 92.9% of the
actual positive cases. The F1 score for this model is 94.9%, which is a balanced
measure of precision and recall. The accuracy of this model is 95%, indicating that
it is overall effective at making correct predictions. The SVM model has a precision
of 0.981, indicating that it correctly identifies 98.1% of the positive cases. The recall
for this model is 0.887, indicating that it correctly identifies 88.7% of the actual
positive cases. The F1 score for this model is 93.2%, which is a balanced measure
of precision and recall. The accuracy of this model is 93.5%, indicating that it is
overall effective at making correct predictions.
According to the results, it seems that all three models are performing
satisfactorily; however, Logistic Regression demonstrates the highest recall, F1
score, and accuracy rates at 96.9%, 92.9%, 94.9%, and 95% respectively, indicating
its superiority as the top-performing model in this specific experiment. However,
SVM has the highest precision (98.1%), with an insignificant difference (1.4%) from
that of Logistic Regression.

50
Figure 4-19 comparison of oversampling results
From Figure 4-19 For the logistic regression model, we can see that it has a
precision of 0.976, meaning that it correctly identifies 97.6% of the positive cases.
The recall for this model is 0.943, indicating that it correctly identifies 94.3% of the
actual positive cases. The F1 score for this model is 95.9%, which is a balanced
measure of precision and recall. The accuracy of this model is 96.0%, indicating that
it is overall effective at making correct predictions. The random forest model has a
precision of 99.9%, which is a very high value, indicating that it correctly identifies
almost all positive cases. However, the recall for this model is lower than the other
two models, at 77.4%, indicating that it may miss some positive cases. The F1 score
for this model is 87.2%, which is lower than the other two models. Despite the lower
F1 score, the model has a relatively high accuracy, which is not provided. The SVM
model has a precision of 0.944, indicating that it correctly identifies 94.4% of the
positive cases. The recall for this model is 0.901, indicating that it correctly identifies
90.1% of the actual positive cases. The F1 score for this model is 94.2%, which is a
balanced measure of precision and recall. The accuracy of this model is 94.4%,
indicating that it is overall effective at making correct predictions.
Based on the results, all three models demonstrate strong performance.
However, Logistic Regression outperforms the others in terms of recall, F1 score,
and accuracy, achieving 94.3%, 95.9%, and 96% respectively. This indicates that
Logistic Regression is the top-performing model in this specific experiment. On the
other hand, Random Forest boasts the highest precision of 99.9%, with only a
marginal difference of 2.3% compared to Logistic Regression.
51
Figure 4-20 comparison of hyper sampling results

From Figure 4-20 For the logistic regression model, we can see that it has a
precision of 0.974, meaning that it correctly identifies 97.4% of the positive cases.
The recall for this model is 0.943, indicating that it correctly identifies 94.3% of the
actual positive cases. The F1 score for this model is 0.958, which is a balanced
measure of precision and recall. The accuracy of this model is 0.959, indicating that
it is overall effective at making correct predictions. The random forest model has a
precision of 99.9%, which is a very high value, indicating that it correctly identifies
almost all positive cases. However, the recall for this model is lower than the other
two models, at 77.4%, indicating that it may miss some positive cases. The F1 score
for this model is 88.7%, which is lower than the other two models. Despite the lower
F1 score, the model has a very high accuracy of unknown value. The SVM model
has a precision of 98.5%, indicating that it correctly identifies 98.5% of the positive
cases. The recall for this model is 90.1%, indicating that it correctly identifies 90.1%
of the actual positive cases. The F1 score for this model is 94.1%, which is a balanced
measure of precision and recall. The accuracy of this model is 94.4%, indicating that
it is overall effective at making correct predictions.
The results indicate that all three models exhibit robust performance.
However, Logistic Regression surpasses the others in terms of recall, F1 score, and
accuracy, achieving 94.3%, 95.8%, and 95.9%, respectively. This suggests that
Logistic Regression is the top-performing model in this specific experiment.
52
Conversely, Random Forest demonstrates the highest precision of 99.9%, with a
slight difference of only 2.5% compared to Logistic Regression.
In addition, the results exhibit that oversampling and hybrid sampling
techniques yield the most satisfactory performance for Logistic Regression and
SVM. However, Logistic Regression surpasses SVM in terms of performance.
Logistic regression is a probabilistic model that estimates the probability of a
particular outcome. In the case of class imbalance, oversampling can help balance
the dataset by increasing the number of instances in the minority class. Logistic
regression can better utilize this information to estimate the probability and make
predictions. Additionally, Logistic Regression is computationally less expensive
compared to SVM, especially for large datasets. Oversampling can increase the
number of samples, making the dataset even larger. This can impact the performance
of SVM due to increased training time and memory requirements. As a result,
logistic regression may outperform SVM in terms of speed. SVM performs best
when the decision boundary is well-separated and when the number of samples is
smaller. Oversampling can lead to overlapping regions between classes, making the
decision boundary more complex. Logistic regression is more flexible in handling
overlapping classes and can adapt to the increased complexity of the dataset.
In contrast, Random Forest exhibits superior performance when utilizing the
undersampling technique. Nonetheless, its performance is comparatively weaker
when employing oversampling and hybrid sampling methods than undersampling.
This is because oversampling may lead to an over-representation of the minority
class and cause the Random Forest algorithm to be biased towards this class.
Moreover, introducing duplicate samples that may be similar or identical to existing
instances in the minority class can lead to overfitting, where the model memorizes
the training instances and performs poorly on unseen data. Random Forests, which
inherently have the potential to overfit, can be particularly prone to this issue.

53
5 CONCLUSION AND FUTURE WORK

5.1 Conclusion
This work presents the application of three supervised machine learning
techniques, including Logistic Regression, Random Forest, and SVM. When
comparing Logistic Regression, Random Forest, and SVM for credit card fraud
detection, it is necessary to consider the effectiveness of different sampling strategies
such as oversampling, undersampling, and hybrid sampling.
Oversampling, which involves replicating the minority class instances, can help
improve the performance of Logistic Regression and SVM models. By increasing
the presence of fraudulent cases, oversampling provides more information for the
models to learn from and hence improves their ability to accurately detect fraud.
However, in the case of Random Forest, this causes deficient performance due to
overfitting and potentially biased results.
Undersampling, on the other hand, reduces the number of majority class
instances, which can help mitigate bias in heavily imbalanced datasets. This
technique ensures a more balanced training set, allowing the models to better
understand and detect the minority class. However, undersampling can lead to loss
of information due to the removal of majority class instances, potentially negatively
impacting the model's generalization and overall performance.
Hybrid sampling combines oversampling and undersampling techniques to
address the limitations of each method. It tries to strike a balance by oversampling
the minority class and undersampling the majority class, which can result in a more
robust and accurate model. Hybrid sampling provides an opportunity to capture the
essence of both classes and reduce biases while maintaining the integrity of the data.
In terms of the models themselves, Logistic Regression is a simple yet widely
used algorithm that tends to perform well when the classes are well separated.
Logistic Regression benefits from undersampling, oversampling, and hybrid
sampling techniques in improving its performance.
Random forest is an ensemble classifier that combines multiple decision trees,
leading to better generalization and robustness. It handles imbalanced datasets
reasonably well. The random forest also benefits from the undersampling technique
as it improves its performance. While its performance is inadequate using
oversampling and hybrid sampling.
54
SVM, with its ability to classify data into different classes based on a
hyperplane, can also be effective in credit card fraud detection. It can handle
complex relationships and nonlinearities while minimizing the influence of outliers.
SVM can be improved through oversampling, undersampling, or hybrid sampling,
which provide a broader representation of both classes, enabling the model to learn
better decision boundaries.
In conclusion, the choice of model and sampling technique depends on the
characteristics of the dataset, level of imbalance, and desired trade-offs between
accuracy, and computational complexity. Implementing oversampling,
undersampling, or hybrid sampling can significantly enhance the performance of
logistic regression, random forest, and SVM in credit card fraud detection tasks.
That leads us to believe that using supervised machine-learning techniques will
help in decreasing the amount of credit card fraud and increase customers’
satisfaction as it will provide them with a better experience in addition to feeling
secure.
After the comparative analysis of the various supervised learning models, we
can infer that Logistic Regression is the best approach to be used for detecting credit
card fraud detection.

5.2 Recommendation
There are many ways to improve the model, such as using it on different
datasets with various sizes, different data types or by changing the data splitting
ratio, in addition to viewing it from different algorithm perspective. An example can
be merging telecom data to calculate the location of people to have better knowledge
of the location of the card owner while his/her credit card is being used, this will
ease the detection because if the card owner is in New Valley and a transaction of
his card was made in Cairo it will easily be detected as fraud.

55
5.3 Future work
• Explore advanced anomaly detection techniques: Investigate unsupervised
methods like autoencoders, clustering, and one-class SVMs to detect fraud
patterns without labeled data.
• Exploring deep learning approaches: Deep learning techniques, such as
recurrent neural networks (RNNs) or convolutional neural networks (CNNs),
have shown promise in various fields. Future work could investigate the
application of deep learning models to credit card fraud detection, considering
their ability to capture complex patterns and dependencies in data.
• Real-time fraud detection: The report primarily focused on offline fraud
detection using pre-processed datasets. A valuable extension would be to
develop a real-time fraud detection system that can analyze transactions as
they occur, providing immediate alerts for suspicious activities. This could
involve the use of streaming data processing frameworks and adaptive
learning algorithms.
• Incorporate domain knowledge: Collaborate with industry experts or fraud
analysts to incorporate their insights and domain-specific knowledge into the
fraud detection process.
• Research and development: Continuously enhance and refine the fraud
detection model by staying updated with the latest techniques and
advancements in machine learning. Publish research papers and contribute to
academic conferences to establish credibility and attract potential partnerships
or funding opportunities.
• Collaboration with financial institutions: Collaborate with banks, credit card
companies, or other financial institutions to integrate our fraud detection
model into their systems. This partnership could involve licensing
agreements, revenue-sharing models, or other mutually beneficial
arrangements.
• Customization and integration: Provide customization options to businesses
based on their specific needs. Offer additional features or modules that can be
integrated into their existing fraud detection systems, allowing them to adapt
the model to their unique requirements. Charge an additional fee for
customization and integration services.

56
6 REFERENCES
[1] G. Babatunde Iwasokun, “Encryption and Tokenization-Based System for
Credit Card Information Security,” International Journal of Cyber-Security
and Digital Forensics, vol. 7, no. 3, pp. 283–293, 2018, doi:
10.17781/P002462.
[2] N. Report, “Payment Card Fraud Losses Reach $27.85 Billion.” Cision PR
Newswire, 2019.
[3] Credit card fraud detection: Everything you need to know, “Credit card fraud
detection: Everything you need to know.” https://www.inscribe.ai/fraud-
detection/credit-fraud-detection (accessed Feb. 14, 2023).
[4] Z. Faraji, “A Review of Machine Learning Applications for Credit Card Fraud
Detection with A Case study,” SEISENSE Journal of Management, vol. 5, no.
1, pp. 49–59, 2022.
[5] J. Gao, Z. Zhou, J. Ai, B. Xia, and S. Coggeshall, “Predicting Credit Card
Transaction Fraud Using Machine Learning Algorithms,” Journal of
Intelligent Learning Systems and Applications, vol. 11, no. 03, pp. 33–63,
2019, doi: 10.4236/jilsa.2019.113003.
[6] H. E. M. Abd El-Hamid, A. Abdou, W. Khalifa, M. I. Roushdy, and A.-B. M.
Salem, “Future Computing and Informatics Journa l,” Computing and
Informatics, vol. 4, no. 2, p. 5.
[7] S. Khatri, A. Arora, and A. P. Agrawal, “Supervised machine learning
algorithms for credit card fraud detection: a comparison,” in 2020 10th
International Conference on Cloud Computing, Data Science & Engineering
(Confluence), IEEE, 2020, pp. 680–683.
[8] P. Gupta, A. Varshney, M. R. Khan, R. Ahmed, M. Shuaib, and S. Alam,
“Unbalanced Credit Card Fraud Detection Data: A Machine Learning-
Oriented Comparative Study of Balancing Techniques,” Procedia Comput Sci,
vol. 218, pp. 2575–2584, 2023.
[9] Kaggle, “Dataset.” https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
(accessed Feb. 14, 2023).
[10] “Credit Card Fraud Detection - Great Learning Blog.”
https://www.mygreatlearning.com/blog/credit-card-fraud-detection/ (accessed
Jul. 03, 2023).
57
[11] “What is Normalization in Machine Learning | Deepchecks.”
https://deepchecks.com/glossary/normalization-in-machine-learning/
(accessed Jun. 29, 2023).
[12] “Normalize Data: Component Reference - Azure Machine Learning |
Microsoft Learn.” https://learn.microsoft.com/en-us/azure/machine-
learning/component-reference/normalize-data?view=azureml-api-2 (accessed
Jun. 29, 2023).
[13] Editorial Board: A. Bundy J. G. Carbonell M. Pinkal H. Uszkoreit M. Veloso
W. Wahlster M. J. Wooldridge, “Machine Learning Techniques for
Multimedia”.
[14] E. F. Morales and H. J. Escalante, “A brief introduction to supervised,
unsupervised, and reinforcement learning,” in Biosignal Processing and
Classification Using Computational Learning and Intelligence, Elsevier,
2022, pp. 111–129.
[15] W. Zhang, X. Gu, L. Tang, Y. Yin, D. Liu, and Y. Zhang, “Application of
machine learning, deep learning and optimization algorithms in
geoengineering and geoscience: Comprehensive review and future challenge,”
Gondwana Research, 2022.
[16] E. F. Morales and H. J. Escalante, “A brief introduction to supervised,
unsupervised, and reinforcement learning,” in Biosignal Processing and
Classification Using Computational Learning and Intelligence, Elsevier,
2022, pp. 111–129.
[17] E. F. Morales and H. J. Escalante, “A brief introduction to supervised,
unsupervised, and reinforcement learning,” in Biosignal Processing and
Classification Using Computational Learning and Intelligence, Elsevier,
2022, pp. 111–129.
[18] L. P. Coelho and W. Richert, Building machine learning systems with Python.
Packt Publishing Ltd, 2015.
[19] M. Bkassiny, Y. Li, and S. K. Jayaweera, “A survey on machine-learning
techniques in cognitive radios,” IEEE Communications Surveys & Tutorials,
vol. 15, no. 3, pp. 1136–1159, 2012.
[20] H. Paruchuri, “Credit Card Fraud Detection using Machine Learning: A
Systematic Literature Review,” ABC Journal of Advanced Research, vol. 6,
no. 2, pp. 113–120, 2017.
58
[21] P. Tiwari, S. Mehta, N. Sakhuja, J. Kumar, and A. K. Singh, “Credit card fraud
detection using machine learning: a study,” arXiv preprint arXiv:2108.10005,
2021.
[22] “What is Random Forest? | IBM.” https://www.ibm.com/topics/random-forest
(accessed Jul. 03, 2023).
[23] M. Kuhkan, “A method to improve the accuracy of k-nearest neighbor
algorithm,” International Journal of Computer Engineering and Information
Technology, vol. 8, no. 6, p. 90, 2016.
[24] K. M. Leung, “Naive bayesian classifier,” Polytechnic University Department
of Computer Science/Finance and Risk Engineering, vol. 2007, pp. 123–156,
2007.
[25] Understanding Support Vector Machine(SVM) algorithm from examples
(along with code), “Support Vector Machine.”
https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-
machine-example-code/ (accessed Feb. 14, 2023).
[26] SVM Machine Learning Tutorial, “freecodecamp,” SVM Machine Learning
Tutorial – What is the Support Vector Machine Algorithm.
https://www.freecodecamp.org/news/svm-machine-learning-tutorial-what-is-
the-support-vector-machine-algorithm-explained-with-code-examples/
(accessed Feb. 14, 2023).
[27] “Random Oversampling and Undersampling for Imbalanced Classification -
MachineLearningMastery.com.”
https://machinelearningmastery.com/random-oversampling-and-
undersampling-for-imbalanced-classification/ (accessed Jun. 24, 2023).

59
Appendix A:
### Imports
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score, confusion_matrix


import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import StandardScaler

import pickle
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.svm import SVC


from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

Appendix B:
### Exploratory Data Analysis
data = pd.read_csv('creditcard.csv')

pd.options.display.max_columns = None
data
data.shape
data.info()

Appendix C:
#### Data Cleaning
data.isnull().sum()
data.duplicated().any()
data = data.drop_duplicates()

Appendix D:
#### Data Normalization

60
data.hist(bins=30, figsize=(20, 20))

sc = StandardScaler()
data['Amount']=sc.fit_transform(pd.DataFrame(data['Amount']))
sc = StandardScaler()
data['Time']=sc.fit_transform(pd.DataFrame(data['Time']))
data.hist(bins=30, figsize=(20, 20))

Appendix E:
### Train Test Split
X_train_val,X_test,y_train_val,y_test=train_test_split(X,y,test_size=0.15,ra
ndom_state=22)
X_train,X_val,y_train,y_val=train_test_split(X_train_val,y_train_val,test_si
ze=0.20 , random_state = 42 )

Appendix F:
#### Pure DataSet
def resultOfPureDataset (model, X_train, y_train, X_val, y_val, X_test,
y_test, Model_Path):
x = model()

x.fit(X_train, y_train)

y_predtrain = x.predict(X_train)

print('\t\tTrain Classification Report:')


print(classification_report(y_train, y_predtrain))
print('Train accuracy_score ',accuracy_score(y_train, y_predtrain))

print('Train precision_score ',precision_score(y_train, y_predtrain))


print('Train recall_score ',recall_score(y_train, y_predtrain))
print('Train f1_score ',f1_score(y_train, y_predtrain))

print('\t\tValidation Classification Report:')


y_predval = x.predict(X_val)
metrics(y_val, y_predval)

cm = confusion_matrix(y_val, y_predval)
sns.heatmap(cm, annot=True, fmt='d')

61
plt.title('Unnormalized validation confusion matrix')

plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized, fmt='.11g', annot=True, linewidths = 0.01)
plt.title('normalized validation confusion matrix')

plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

tn_val, fp_val, fn_val, tp_val = np.ravel(cm_normalized)


print("True Negatives (TN):", tn_val)

print("False Positives (FP):", fp_val)

print("False Negatives (FN):", fn_val)


print("True Positives (TP):", tp_val)

precision_val=tp_val/(tp_val+fp_val)
print ( "precision_val", precision_val )

recall_val=tp_val/(tp_val+fn_val)

print ( "recall_val", recall_val )


F1_score_val = (2*precision_val*recall_val)/(precision_val+recall_val)

print ( "F1-score_val", F1_score_val )


specificity_val=tn_val/(tn_val+fp_val)
print ( "specificity_val", specificity_val )

False_positive_rate_val=fp_val/(fp_val+tn_val)
print ( "False_positive_rate_val", False_positive_rate_val )
False_negative_rate_val=fn_val/(fn_val+tp_val)
print ( "False_negative_rate_val", False_negative_rate_val )

print('\t\tTest Classification Report:')


y_predtest = x.predict(X_test)
metrics(y_test, y_predtest)
62
cm = confusion_matrix(y_test, y_predtest)

sns.heatmap(cm, annot=True, fmt='d')


plt.title('Unnormalized testing confusion matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

sns.heatmap(cm_normalized,fmt='.11g', annot=True, linewidths = 0.01)


plt.title('normalized testing confusion matrix')

plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

tn_test, fp_test, fn_test, tp_test = np.ravel(cm_normalized)

print("True Negatives (TN):", tn_test)


print("False Positives (FP):", fp_test)

print("False Negatives (FN):", fn_test)


print("True Positives (TP):", tp_test)

precision_test=tp_test/(tp_test+fp_test)

print ( "precision_test", precision_test )


recall_test=tp_test/(tp_test+fn_test)

print ( "recall_test", recall_test )


F1_score_test= =
(2*precision_test*recall_test)/(precision_test+recall_test)
print ( "F1-score_test", F1_score_test )

specificity_test=tn_test/(tn_test+fp_test)

print ( "specificity_test", specificity_test )

False_positive_rate_test=fp_test/(fp_test+tn_test)
print ( "False_positive_rate_test", False_positive_rate_test )
False_negative_rate_test=fn_test/(fn_test+tp_test)

print ( "False_negative_rate_test", False_negative_rate_test )

save_model(x, Model_Path)

63
print("Evaluation of Logistic regression Classifier")
resultOfPureDataset(Lr_model_pure, X_train, y_train, X_val, y_val, X_test,
y_test , lr_pure)

print("Evaluation of Random Forest Classifier")


resultOfPureDataset(RF_model_pure, X_train, y_train, X_val, y_val, X_test,
y_test, RF_pure)
print("Evaluation of Support Vector Machine")
resultOfPureDataset(SVC_model_pure, X_train, y_train, X_val, y_val, X_test,
y_test , SVC_pure)

Appendix G:
#### UnderSampling DataSet
undersample = RandomUnderSampler(sampling_strategy='majority')
X_train_under, y_train_under = undersample.fit_resample(X_under, y_under)
def resultOfUndersamplingDataset (model, X_train_under, y_train_under,
X_val_under, y_val_under, X_test_under, y_test_under, Model_Path):
x = model()

x.fit(X_train_under, y_train_under)
print('\t\tTrain Classification Report:')
y_predtrain_under = x.predict(X_train_under)
print(classification_report(y_train_under, y_predtrain_under))
print('Train_accuracy_score',accuracy_score(y_train_under,
y_predtrain_under))
print('Train_precision_score',precision_score(y_train_under,
y_predtrain_under))
print('Train_recall_score',recall_score(y_train_under,
y_predtrain_under))

print('Train f1_score',f1_score(y_train_under, y_predtrain_under))

y_predval_under = x.predict(X_val_under)
print('\t\tValidation Classification Report:')

metrics(y_val_under, y_predval_under)
cm_val_under = confusion_matrix(y_val_under, y_predval_under)
sns.heatmap(cm_val_under, annot=True, fmt='d')

plt.title('Unnormalized validation confusion matrix')


plt.xlabel('Predicted')

64
plt.ylabel('Actual')

plt.show()
cm_normalized_val_under=cm_val_under.astype('float')/
cm_val_under.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized_val_under,fmt='.11g', annot=True, linewidths =
0.01)
plt.title('normalized validation confusion matrix')

plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

tn_val_under, fp_val_under, fn_val_under, tp_val_under =


np.ravel(cm_normalized_val_under)

print("True Negatives (TN):", tn_val_under)

print("False Positives (FP):", fp_val_under)


print("False Negatives (FN):", fn_val_under)

print("True Positives (TP):", tp_val_under)


precision_val_under=tp_val_under/(tp_val_under+fp_val_under)

print ( "precision_val_under", precision_val_under )

recall_val_under=tp_val_under/(tp_val_under+fn_val_under)
print ( "recall_val_under", recall_val_under )
F1_score_val_under=
(2*precision_val_under*recall_val_under)/(precision_val_under+recall_va
l_under)
print ( "F1-score_val_under", F1_score_val_under )
specificity_val_under=tn_val_under/(tn_val_under+fp_val_under)

print ( "specificity_val_under", specificity_val_under )


False_positive_rate_val_under=fp_val_under/(fp_val_under+tn_val_under)

print ( "False_positive_rate_val_under", False_positive_rate_val_under )

False_negative_rate_val_under=fn_val_under/(fn_val_under+tp_val_under)
print ( "False_negative_rate_val_under", False_negative_rate_val_under )

print('\t\tTest Classification Report:')


y_predtest_under = x.predict(X_test_under)

65
metrics(y_test_under, y_predtest_under)

cm_test_under = confusion_matrix(y_test_under, y_predtest_under)


sns.heatmap(cm_test_under, annot=True, fmt='d')
plt.title('Unnormalized testing confusion matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
cm_normalized_test_under=cm_test_under.astype('float')/
cm_test_under.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized_test_under,fmt='.11g', annot=True, linewidths
= 0.01)
plt.title('normalized testing confusion matrix')

plt.xlabel('Predicted')

plt.ylabel('Actual')

plt.show()

tn_test_under, fp_test_under, fn_test_under, tp_test_under =


np.ravel(cm_normalized_test_under)

print("True Negatives (TN):", tn_test_under)

print("False Positives (FP):", fp_test_under)


print("False Negatives (FN):", fn_test_under)

print("True Positives (TP):", tp_test_under)


precision_test_under=tp_test_under/(tp_test_under+fp_test_under)
print ( "precision_test_under", precision_test_under )

recall_test_under=tp_test_under/(tp_test_under+fn_test_under)
print ( "recall_test_under", recall_test_under )
F1_score_test_under=
(2*precision_test_under*recall_test_under)/(precision_test_under+recall
_test_under)

print ( "F1-score_test_under", F1_score_test_under )


specificity_test_under=tn_test_under/(tn_test_under+fp_test_under)

print ( "specificity_test_under", specificity_test_under )


False_positive_rate_test_under=fp_test_under/(fp_test_under+tn_test_und
er)

66
print ( "False_positive_rate_test_under", False_positive_rate_test_under
)

False_negative_rate_test_under=fn_test_under/(fn_test_under+tp_test_und
er)
print ( "False_negative_rate_test_under", False_negative_rate_test_under
)

save_model(x, Model_Path)
print("Evaluation of Logistic regression Classifier")
resultOfUndersamplingDataset(Lr_model_under, X_train_under, y_train_under,
X_val_under, y_val_under, X_test_under, y_test_under, lr_under)

print("Evaluation of Random Forest Classifier")


resultOfUndersamplingDataset(RF_model_under, X_train_under, y_train_under,
X_val_under, y_val_under, X_test_under, y_test_under, RF_under)
print("Evaluation of Support Vector Machine")
resultOfUndersamplingDataset(SVC_model_under, X_train_under, y_train_under,
X_val_under, y_val_under, X_test_under, y_test_under, SVC_under)

Appendix H:
#### OverSampling DataSet
oversample = RandomOverSampler(sampling_strategy=0.9)

X_train_over, y_train_over = oversample.fit_resample(X_over,y_over)


def resultOfoversamplingDataset (model, X_train_over, y_train_over,
X_val_over, y_val_over, X_test_over, y_test_over, Model_Path):
x = model()
x.fit(X_train_over, y_train_over)

print('\t\tTrain Classification Report:')


y_predtrain_over = x.predict(X_train_over)

print(classification_report(y_train_over, y_predtrain_over))
print('Train_accuracy_score',accuracy_score(y_train_over,
y_predtrain_over))
print('Train_precision_score',precision_score(y_train_over,
y_predtrain_over))

print('Train_recall_score',recall_score(y_train_over, y_predtrain_over))
print('Train_f1_score',f1_score(y_train_over, y_predtrain_over))

67
y_predval_over = x.predict(X_val_over)

print('\t\tValidation Classification Report:')


metrics(y_val_over, y_predval_over)
cm_val_over = confusion_matrix(y_val_over, y_predval_over)
sns.heatmap(cm_val_over, annot=True, fmt='d')
plt.title('Unnormalized validation confusion matrix')
plt.xlabel('Predicted')

plt.ylabel('Actual')
plt.show()
cm_normalized_val_over = cm_val_over.astype('float') /
cm_val_over.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized_val_over,fmt='.11g', annot=True, linewidths =
0.01)

plt.title('normalized validate confusion matrix')

plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

tn_val_over, fp_val_over, fn_val_over, tp_val_over =


np.ravel(cm_normalized_val_over)

print("True Negatives (TN):", tn_val_over)


print("False Positives (FP):", fp_val_over)
print("False Negatives (FN):", fn_val_over)

print("True Positives (TP):", tp_val_over)


precision_val_over=tp_val_over/(tp_val_over+fp_val_over)
print ( "precision_val_over", precision_val_over )

recall_val_over=tp_val_over/(tp_val_over+fn_val_over)
print ( "recall_val_over", recall_val_over )
F1_score_val_over=
(2*precision_val_over*recall_val_over)/(precision_val_over+recall_val_o
ver)
print ( "F1-score_val_over", F1_score_val_over )
specificity_val_over=tn_val_over/(tn_val_over+fp_val_over)
print ( "specificity_val_over", specificity_val_over )

68
False_positive_rate_val_over=fp_val_over/(fp_val_over+tn_val_over)

print ( "False_positive_rate_val_over", False_positive_rate_val_over )


False_negative_rate_val_over=fn_val_over/(fn_val_over+tp_val_over)
print ( "False_negative_rate_val_over", False_negative_rate_val_over )

print('\t\tTest Classification Report:')

y_predtest_over = x.predict(X_test_over)
metrics(y_test_over, y_predtest_over)

cm_test_over = confusion_matrix(y_test_over, y_predtest_over)


sns.heatmap(cm_test_over, annot=True, fmt='d')

plt.title('Unnormalized testing confusion matrix')


plt.xlabel('Predicted')

plt.ylabel('Actual')

plt.show()
cm_normalized_test_over = cm_test_over.astype('float') /
cm_test_over.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized_test_over, fmt='.11g',annot=True, linewidths =
0.01)

plt.title('normalized testing confusion matrix')


plt.xlabel('Predicted')

plt.ylabel('Actual')
plt.show()

tn_test_over,fp_test_over,fn_test_over,tp_test_over=
np.ravel(cm_normalized_test_over)
print("True Negatives (TN):", tn_test_over)

print("False Positives (FP):", fp_test_over)


print("False Negatives (FN):", fn_test_over)

print("True Positives (TP):", tp_test_over)


precision_test_over=tp_test_over/(tp_test_over+fp_test_over)
print ( "precision_test_over", precision_test_over )
recall_test_over=tp_test_over/(tp_test_over+fn_test_over)
print ( "recall_test_over", recall_test_over )

69
F1_score_test_over=
(2*precision_test_over*recall_test_over)/(precision_test_over+recall_te
st_over)
print ( "F1-score_test_over", F1_score_test_over )
specificity_test_over=tn_test_over/(tn_test_over+fp_test_over)

print ( "specificity_test_over", specificity_test_over )


False_positive_rate_test_over=fp_test_over/(fp_test_over+tn_test_over)

print ( "False_positive_rate_test_over", False_positive_rate_test_over )


False_negative_rate_test_over=fn_test_over/(fn_test_over+tp_test_over)
print ( "False_negative_rate_test_over", False_negative_rate_test_over )

save_model(x, Model_Path)

print("Evaluation of Logistic regression Classifier")


resultOfoversamplingDataset(Lr_model_Over, X_train_over, y_train_over,
X_val_over, y_val_over, X_test_over, y_test_over, lr_Over)
print("Evaluation of Random Forest Classifier")
resultOfoversamplingDataset(RF_model_Over, X_train_over, y_train_over,
X_val_over, y_val_over, X_test_over, y_test_over, RF_Over)

print("Evaluation of Support Vector Machine")


resultOfoversamplingDataset(SVC_model_Over, X_train_over, y_train_over,
X_val_over, y_val_over, X_test_over, y_test_over,SVC_Over)

Appendix I:
#### Hybrid Sampling DataSet
oversample2 = RandomOverSampler(sampling_strategy=0.7)
Xover, yover = oversample2.fit_resample(X_comb,y_comb)
X_train_comb, y_train_comb = undersample.fit_resample(Xover, yover)
def resultOfCombDataset (model, X_train_comb, y_train_comb, X_val_comb,
y_val_comb, X_test_comb, y_test_comb, Model_Path):
x = model
x.fit(X_train_comb, y_train_comb)

print('\t\tTrain Classification Report:')


y_predtrain_comb = x.predict(X_train_comb)
print(classification_report(y_train_comb, y_predtrain_comb))

70
print('Train_accuracy_score',accuracy_score(y_train_comb
y_predtrain_comb))
print('Train_precision_score',precision_score(y_train_comb,
y_predtrain_comb))
print('Train_recall_score ',recall_score(y_train_comb, y_predtrain_comb))

print('Train f1_score ',f1_score(y_train_comb, y_predtrain_comb))

y_predval_comb = x.predict(X_val_comb)
print('\t\tValidation Classification Report:')
metrics(y_val_comb, y_predval_comb)

cm_val_comb = confusion_matrix(y_val_comb, y_predval_comb)


sns.heatmap(cm_val_comb, annot=True, fmt='d')

plt.title('Unnormalized validation confusion matrix')

plt.xlabel('Predicted')

plt.ylabel('Actual')
plt.show()
cm_normalized_val_comb=cm_val_comb.astype('float')/
cm_val_comb.sum(axis=1)[:, np.newaxis]

sns.heatmap(cm_normalized_val_comb, fmt='.11g',annot=True, linewidths = 0.01)

plt.title('normalized validate confusion matrix')


plt.xlabel('Predicted')

plt.ylabel('Actual')
plt.show()

tn_val_comb, fp_val_comb, fn_val_comb, tp_val_comb =


np.ravel(cm_normalized_val_comb)
print("True Negatives (TN):", tn_val_comb)

print("False Positives (FP):", fp_val_comb)


print("False Negatives (FN):", fn_val_comb)
print("True Positives (TP):", tp_val_comb)

precision_val_comb=tp_val_comb/(tp_val_comb+fp_val_comb)
print ( "precision_val_comb", precision_val_comb )

recall_val_comb=tp_val_comb/(tp_val_comb+fn_val_comb)
print ( "recall_val_comb", recall_val_comb )

71
F1_score_val_comb=
(2*precision_val_comb*recall_val_comb)/(precision_val_comb+recall_val_comb)
print ( "F1-score_val_comb", F1_score_val_comb )

specificity_val_comb=tn_val_comb/(tn_val_comb+fp_val_comb)
print ( "specificity_val_comb", specificity_val_comb )
False_positive_rate_val_comb=fp_val_comb/(fp_val_comb+tn_val_comb)
print ( "False_positive_rate_val_comb", False_positive_rate_val_comb )

False_negative_rate_val_comb=fn_val_comb/(fn_val_comb+tp_val_comb)
print ( "False_negative_rate_val_comb", False_negative_rate_val_comb )

print('\t\tTest Classification Report:')

y_predtest_comb = x.predict(X_test_comb)
metrics(y_test_comb, y_predtest_comb)
cm_test_comb = confusion_matrix(y_test_comb, y_predtest_comb)
sns.heatmap(cm_test_comb, annot=True, fmt='d')

plt.title('Unnormalized testing confusion matrix')


plt.xlabel('Predicted')

plt.ylabel('Actual')
plt.show()
cm_normalized_test_comb=cm_test_comb.astype('float')/
cm_test_comb.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized_test_comb,fmt='.11g', annot=True, linewidths =
0.01)

plt.title('normalized testing confusion matrix')


plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()
tn_test_comb,fp_test_comb,fn_test_comb,tp_test_comb=
np.ravel(cm_normalized_test_comb)

print("True Negatives (TN):", tn_test_comb)


print("False Positives (FP):", fp_test_comb)

print("False Negatives (FN):", fn_test_comb)


print("True Positives (TP):", tp_test_comb)

72
precision_test_comb=tp_test_comb/(tp_test_comb+fp_test_comb)

print ( "precision_test_comb", precision_test_comb )


recall_test_comb=tp_test_comb/(tp_test_comb+fn_test_comb)
print ( "recall_test_comb", recall_test_comb )
F1_score_test_comb =
(2*precision_test_comb*recall_test_comb)/(precision_test_comb+recall_test_co
mb)

print ( "F1-score_test_comb", F1_score_test_comb )


specificity_test_comb=tn_test_comb/(tn_test_comb+fp_test_comb)
print ( "specificity_test_comb", specificity_test_comb )

False_positive_rate_test_comb=fp_test_comb/(fp_test_comb+tn_test_comb)
print ( "False_positive_rate_test_comb", False_positive_rate_test_comb )

False_negative_rate_test_comb=fn_test_comb/(fn_test_comb+tp_test_comb)

print ( "False_negative_rate_test_comb", False_negative_rate_test_comb )

save_model(x, Model_Path)
print("Evaluation of Logistic regression Classifier")
resultOfCombDataset(Lr_model_Comb, X_train_comb, y_train_comb, X_val_comb,
y_val_comb, X_test_comb, y_test_comb, lr_Comb)
print("Evaluation of Random Forest Classifier")
resultOfCombDataset(RF_model_Comb, X_train_comb, y_train_comb, X_val_comb,
y_val_comb, X_test_comb, y_test_comb, RF_Comb)
print("Evaluation of Support Vector Machine")
resultOfCombDataset(SVC_model_Comb, X_train_comb, y_train_comb, X_val_comb,
y_val_comb, X_test_comb, y_test_comb, SVC_Comb).

73

You might also like