Miniproject Group E 1
Miniproject Group E 1
DECLARATION
I hereby declare that this project report titled "Credit Card Fraud Detection" is my original work
and has not been submitted previously for any degree or diploma at any other educational institution.
This report is a result of my own research and the information provided is accurate to the best of my
knowledge.
Credit card fraud detection leverages a range of sophisticated algorithms to accurately identify
and prevent fraudulent transactions. It includes logistic regression, which provides probabilistic
classification by estimating the likelihood of fraud; decision trees, which offer a clear decision-making
process through hierarchical splitting of data; and ensemble methods like Random Forest and Gradient
Boosting, which combine multiple models to improve prediction accuracy. Machine learning
techniques such as Support Vector Machines (SVM) and Neural Networks are also employed to
capture complex patterns in transaction data. Unsupervised learning methods, including clustering
algorithms like k-means and anomaly detection techniques, help identify outliers without pre-labeled
data.
The output of these algorithms is typically evaluated , with many systems achieving accuracy
rates above 90%. For example, models like Random Forest scores in the range of 85-95%, reflecting
their effectiveness in balancing the detection of fraudulent transactions while minimizing false
positives. The continuous tuning of these models, along with feature engineering and behavioral
analytics, ensures that fraud detection systems remain robust and adaptive to emerging threats.
Our system also provides a thorough comparative analysis of different machine learning
algorithms. We benchmark a range of models, including Support Vector Machines (SVM), Random
Forests, Gradient Boosting, and Neural Networks. By evaluating these algorithms based on multiple
performance metrics—such as accuracy, precision, recall, F1-score, and ROC-AUC—we establish
comprehensive performance benchmarks. Systematic hyperparameter tuning is employed to optimize
each model’s configuration, ensuring that we achieve the best possible performance for fraud detection.
I would like to extend my gratitude to my project supervisor, Dr.S Krishna Anand for their
guidance and support throughout this project. Their expertise and feedback have been invaluable in
shaping this work. I also thank my peers and family for their encouragement and assistance.
Credit card fraud detection 2023-2024
ACKNOWLEDGE
MENT
I would like to express my sincere gratitude to all those who have supported and guided me
throughout the course of this project on "Credit Card Fraud Detection." This project would not
have been possible without the help, encouragement, and expertise of many individuals.
We would like to extend our heartfelt gratitude to everyone who contributed to the
development and success of the Credit Card Fraud Detection System. This innovative project
would not have been possible without the collective efforts of a dedicated team of professionals,
researchers, and collaborators.
Firstly, we acknowledge the invaluable contributions of our machine learning experts and
data scientists. Their expertise in developing and fine-tuning state-of-the-art algorithms, including
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), was
instrumental in creating a system capable of accurate and efficient fraud predictions. Their
dedication to feature engineering, ensemble methods, and hyperparameter tuning has greatly
enhanced the system's performance.
Our thanks extend to the data providers that contributed valuable datasets for training and
validating the system. The integration of diverse and comprehensive data sources was vital for
developing a robust and reliable predictive model. We appreciate their collaboration and support
in making this project a success.
This project has been a significant learning experience, and I am grateful for the
opportunity to work on such a relevant and impactful topic. The knowledge and skills gained
during this project will undoubtedly benefit my future endeavors in the field of cloud computing
and monitoring.
NOMENCLATURE
DT : Decisions Tree
FN : False Negative
FP : False Positive
LR : Logistic Regression
NB : Naive Bayes
NN : Neural Network
PCA : Principal Component Analysis
RF : Random Forest
TN : True Negative
TP : True Positive
Abstract
The craze for money has led to a huge increase in anonymous users exploiting the
vulnerabilities of innocent human beings by swindling their hard earned money by means of
carrying out fraudulent transactions through various means like tapping of ATM( Automated Teller
Machine) machines, forged signatures, unauthorized use of credit cards and bland theft. This work
exclusively focuses on detection of fraudulent transactions taking place through credit cards. This
work takes a small step in offsetting few of the difficulties faced by the customers. To make the
system more realistic and foolproof, four sets of models with different parameters have been
experimented with. The models include KNN ( k-Nearest Neighborhood ), Logistic Regression,
Naive Bayes and SVM ( Support Vector Machine). A comparison in accuracy levels is made
between these techniques and the most appropriate models has been chosen.
Table of Contents
CHAPTER NO TITLE PG NO
ACKNOWLEDGMENTS I
NOMENCLATURE II
ABSTRACT III
LIST OF FIGURES IV
LIST OF TABLES IV
1. INTRODUCTION 1
1.2 OBJECTIVES 1
2. LITERATURE REVIEW 7
4. ALGORITHM APPROACH 14
4.1 INTRODUCTION 14
4.6 KNN 18
5. CONCLUSION 33
5.1 CONCLUSION 33
5.2 FUTUREWORK 33
List of Figures
Fig No Names Pg no
1 Dataset Structure 11
2 Class Distribution 12
3 Correlations 12
4 Variable 18 13
5 Variable 28 13
6 Weka K=3 15
7 RStudio K=3 15
8 RStudio K=7 16
9 Weka K=7 16
10 Weka Naïve Bayes 17
11 RStudio Naïve Bayes 17
12 Weka Logistic Regression 18
13 RStudio Logistic Regression 18
14 Support Vector Machine 19
List of Tables
Table no Title Pg No
1 Confusion Matrix 20
2 Table of Accuracies 21
Chapter 1:Introduction
1.1 General Introduction
With the increase in people using credit cards in their daily lives, credit card companies
should take special care in the security and safety of the customers. According to the world bank,
the number of people using credit cards around the world was 2.8 billion in 2019, in addition 70%
of those users own a single card at least.
Reports of Credit card fraud in the US rose by 44.7% from 271,927 in 2019 to 393,207
reports in 2020. There are two kinds of credit card fraud, the first one is by having a credit card
account opened under your name by an identity thief, reports of this fraudulent behavior increased
48% from 2019 to 2020. The second type is by an identity thief uses an existing account that you
created, and it’s usually done by stealing the information of the credit card, reports on this type of
fraud increased 9% from 2019 to 2020 . Those statistics caught the attention of researchers as the
numbers are increasing drastically and rapidly throughout the years, which provided the motive to
them for trying to resolve the issue analytically by using different machine learning methods to
detect the credit card fraudulent transactions within numerous transactions.
1.2 Objectives
The primary objective of this work deals with accurate detection of fraudulent transactions
through usage of credit cards. The ideal choice of machine learning technique that leads to
accurate levels of detection helps a long way in increasing the trust levels of the customer. This
work is a pointer in that direction.
Enhancing a fraud detection system involves several key strategies. Firstly, integrating
advanced techniques like deep learning, reinforcement learning, and ensemble methods can
significantly improve detection accuracy and reduce false positives. Implementing real-time
adaptation mechanisms allows the model to continuously learn from new transaction data, staying
ahead of emerging fraud patterns. Expanding data sources, including social media activity, device
information, and biometric data, enriches the model's ability to detect fraud, while incorporating
global transaction data ensures effectiveness across different regions and fraud types. System
integration should be enhanced to connect seamlessly with broader financial platforms, such as
customer service and risk management tools, ensuring a holistic approach to fraud prevention.
Ensuring cross-platform compatibility will further allow the system to work effortlessly with
various payment processing systems. Improving user experience through a more intuitive interface
and customization alerts helps fraud analysts and administrators interact efficiently with the
system. Performance optimization is essential, focusing on scalability to handle larger transaction
volumes and reducing latency for near-instantaneous response times. The system should also be
updated regularly to comply with evolving regulatory requirements, with robust audit trails to
ensure transparency. Enhanced fraud detection capabilities, including advanced behavioral
analytic and improved anomaly detection, are crucial for identifying subtle and new types of fraud.
Additionally, user education and training programs will ensure that end-users and analysts are
well-equipped to utilize the system effectively, with best practices guiding the interpretation of
alerts. Finally, fostering partnerships with other financial institutions and adopting industry
standards will enable shared insights and strategies, strengthening the overall approach to fraud
prevention.
1.4.1 Characteristics:
Accuracy: The ability to correctly identify fraudulent transactions while minimizing false
positives (legitimate transactions flagged as fraud) and false negatives (fraudulent transactions not
detected).
Real-time Processing: The capability to analyze and flag transactions instantly as they occur,
enabling immediate intervention to prevent fraudulent activities.
Scalability: The system must handle large volumes of transactions efficiently, especially
during peak times, without compromising performance or accuracy.
Adaptability: The ability to update and adapt to new fraud patterns and techniques as
fraudsters continuously evolve their strategies.
Feature Engineering: Extraction and selection of relevant features from transaction data,
such as transaction amount, location, time, frequency, and user behavior patterns, to enhance model
performance.
Anomaly Detection: Identifying deviations from normal transaction behavior that could
indicate potential fraud, such as unusual spending patterns or transactions from unexpected
locations.
1.4.2 Advantages:
Increased Security: Fraud detection systems help to identify and prevent fraudulent
transactions in real-time, significantly reducing the risk of financial losses for both consumers and
financial institutions.
Financial Savings: By detecting and preventing fraudulent activities, these systems save
financial institutions millions of dollars annually that would otherwise be lost to fraud. This also
minimizes the financial impact on customers who might otherwise be liable for fraudulent charges.
Enhanced Customer Trust: Customers feel more secure knowing that their transactions are
being monitored and protected against fraud, which enhances their trust and confidence in the
financial institution and its services.
Real-time Fraud Detection: Modern fraud detection systems can analyze transactions in real-
time, allowing for immediate action to be taken to prevent fraud, such as blocking a suspicious
transaction or alerting the customer.
Reduction in False Positives: Advanced machine learning algorithms improve the accuracy
of fraud detection, reducing the number of legitimate transactions that are incorrectly flagged as
fraudulent, which in turn reduces customer inconvenience and dissatisfaction.
Hardware requirements:
CPU : Intel i7 or AMD Ryzen 7 and above
GPU : NVIDIA GTX 1080 Ti, RTX 2080, or
higher
RAM : At least 32GB
Storage : SSD with at least 1TB
Network : High-speed internet connection
Backup and Redundancy : External Hard Drives, NAS, Cloud
Storage
Peripherals : Dual Monitors, Ergonomic Keyboard
and Mouse
Software requirements:
Operating System : Windows 11
Programming Languages
and Libraries :
Python
NumPy
Pandas
Scikit-learn
Matplotlib/Seaborn
Jupyter Notebook
In the bustling 1960s, as credit cards began to revolutionize consumer behavior, fraud
emerged as a significant concern. Early attempts to combat fraud were rudimentary and labor-
intensive. Merchants and banks relied on lists of stolen card numbers, manually scrutinizing
transactions to spot potential fraud. This manual process was inefficient and often inadequate.
By the 1980s, as computers became more integrated into business operations, the landscape
of fraud detection began to change. Simple automated systems emerged, employing rule-based
algorithms to flag suspicious activities. For example, if a card was used in New York and then
suddenly in Los Angeles within an hour, the system would raise an alert for potential fraud. These
early-automated systems marked the beginning of a more structured approach to fraud prevention.
The 1990s brought significant advancements with the rise of more sophisticated algorithms
and databases. Statistical methods and anomaly detection techniques were developed to better
analyze transaction data. These methods allowed for the identification of unusual spending patterns
that could indicate fraud, laying the groundwork for more advanced fraud detection mechanisms.
With the dawn of the 2000s, the internet era brought e-commerce to the forefront, presenting
new challenges and opportunities for fraud detection. Machine learning and data mining techniques
started to take center stage. Neural networks and decision trees were among the early machine
learning methods used, significantly enhancing the ability to detect fraudulent activities by learning
from vast amounts of transaction data.
The 2010s saw an explosion of big data and advancements in computing power, leading to a
quantum leap in fraud detection capabilities. More complex machine learning models, such as
Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and ensemble methods, became
commonplace. These models could analyze large datasets in real-time, identifying even the subtlest
of fraudulent patterns with remarkable accuracy.
As one entered the third decade of the 21 st century, fraud detection systems became marvels of
modern technology. They now leverage deep learning, real-time analytics, and artificial intelligence
to stay ahead of increasingly sophisticated fraud tactics. Techniques such as deep neural networks,
convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are employed to
achieve unprecedented levels of detection accuracy. Moreover, innovations in biometric
authentication, behavioral analytics, and blockchain technology are being integrated into these
systems, providing a multi-faceted defense against fraud.
The evolution of credit card fraud detection, from manual checks in the 1960s to today's
cutting-edge AI systems, is a testament to human ingenuity and the relentless pursuit of security in
the digital age. It remains a dynamic field, continually evolving to outpace the ever-advancing
strategies of fraudsters, ensuring that the convenience of credit cards remains safe and secure for
users worldwide.
Suraya Nurain Kalid and his team proposed a model for credit card fraud detection 2002. In
the proposed methodology researchers have used various machine learning algorithms such as
support vector machine (SVM), artificial neural network (ANN), Bayesian Networks, K-Nearest
Neighbors (KNN) Fuzzy Logic system and Decision Trees. In their paper, they have observed that
the algorithms k-nearest neighbor, decision tree, and the SVM give a medium level accuracy. The
Fuzzy Logic and Logistic Regression give the lowest accuracy among all the other algorithms.
Neural Networks, naive byes, fuzzy systems, and KNN offer a high detention rate. The Logistic
Regression, SVM, decision trees offer a high detection rate at the medium level. There are two
algorithms namely ANN and the Naive Bayesian Networks, which perform better at all parameters.
These are very much expensive to train. There is a major drawback in all the algorithms. The
drawback is that these algorithms do not give the same result in all types of environments. They
give improved results with one type of datasets and poor results with another kind of data set.
Algorithms like KNN and SVM give excellent results with small datasets and algorithms like
logistic regression and fuzzy logic systems give good accuracy with raw and unsampled data.
In the year 2001, Fraud A. Ghaleb and his team have used the decision tree, random forest,
SVM, and logistic regression algorithms for credit card fraud detection. Researchers have taken the
highly skewed data set. that random forest provides the best results and good accuracy as
comparison to others algorithm and also concluded that SVM algorithm has a data imbalance
problem.
transaction or through rotating categories. Cash back can be redeemed as statement credits,
deposits, or checks.
Each type of credit card has its own set of features, benefits, and terms, making it important to
choose one that aligns with your financial goals and spending habits.
1. Phishing
Fraudsters use fake emails, texts, or websites that appear legitimate to trick individuals into
revealing their credit card details or personal information. For instance, an email might claim to
be from a trusted financial institution, prompting the recipient to click on a link and enter their
credit card information on a fraudulent website.
2. Skimming
Skimming involves using a small device, called a skimmer, to capture credit card
information from the magnetic stripe of a card during legitimate transactions. The skimmer is
often placed on ATM's or gas station card readers, where it reads and stores card data while the
transaction appears normal.
3. Cloning
Cloning occurs when a fraudster copies the data from a stolen credit card and encodes it
onto a blank card, which can then be used to make fraudulent transactions. The fraudster uses
skimming or other methods to obtain the card data, then creates a counterfeit card using a card
reader/writer.
4. Account Takeover
This method involves gaining access to a person’s credit card account by stealing personal
information and using it to make unauthorized changes or transactions. Fraudsters may use stolen
login credentials or personal information obtained through social engineering to log into the
account, change account details, or make purchases.
6. Application Fraud
This type of fraud involves applying for a credit card using stolen or falsified information
to receive a new card. The fraudster uses fake or stolen identities to fill out credit card
applications, often with the intention of making fraudulent purchases or obtaining cash advances.
7. Identity Theft
Identity theft involves stealing someone’s personal information, such as Social Security
numbers or bank account details, to open credit card accounts or commit other forms of fraud.
The stolen identity information is used to apply for credit cards or loans in the victim’s name,
leading to unauthorized charges and damage to their credit.
8. Data Breaches
Data breaches occur when cyber criminals hack into databases of companies or financial
institutions to steal large amounts of credit card information. The stolen data is often sold on the
dark web or used directly by the criminals to make fraudulent transactions.
9. Social Engineering
Social engineering involves manipulating individuals into divulging confidential
information through deception or coercion. Fraudsters might pose as bank representatives or tech
support to trick victims into providing their credit card information over the phone or through
email.
institutions. Financially, it can lead to substantial losses for all parties involved, with fraudulent
transactions draining resources and profits. Legally, those responsible for such activities may face
severe repercussions, including fines and imprisonment. The reputational damage to organizations
affected by credit card fraud can be significant, as it undermines customer trust and credibility. For
consumers, the inconvenience is considerable, as they must deal with the aftermath of fraud,
including the process of disputing charges, obtaining new cards, and restoring their accounts to
normal. These combined impacts highlight the importance of robust fraud prevention measures.
4.1 Introduction
In order to accomplish the objective and goal of the project which is to find the most
suitable model for detecting fraud in credit card transactions, several steps need to be taken.
Initially, the identification of most suitable data is carried out and then it is preprocessed. Later, a series
of algorithms like K-Nearest Neighbor (KNN), Naive Bayes, SVM and the Logistic Regression
have been incorporated. In the KNN model two Ks were chosen K=3 and K=7. All models were
created using both R and Weka tools. However, in SVM, only the Weka model has been taken into
account. In addition, all kinds of visualization have been taken from the applications.
Fig.4.3.1 shows the structure of the data set. Here, all attributes are shown along with their
type. In addition to have a glimpse of the variables within each attribute, Class type is integer.
Fig.4.3.2. shows the distribution of the class. Here, the red bar which contains
284,315 variables represents the non-fraudulent transaction. The blue bar with 492
variables represents the fraudulent transactions.
Fig 4.3.3 shows the correlation between attributes “Image from R” .It is necessary to find the
values of various principal components for a given data set. The graphical plot
representing the various principal components for each class has been depicted in Fig
4.3.3
Figure 4.3.4 shows an attribute that is numbered V18 which represents the eighteenth
principal component which had already been shown in Fig.4.3.3 This attribute deals the attribute
with the most credit card fraudulent transactions. The blue line represents the variable 1 which
indicates the fraudulent transactions.
Fig.4.3.5. shows the variable that have the lowest number of fraudulent transactions. That
attribute has the number 28. As mentioned earlier, the blue line represents the fraudulent
instances within the data set.
As there are neither unavailable nor duplicated variables, the preparation of the data set
was simple. The first alteration that was made to be able to open the data set on Weka program
deals with changing the type of the class attribute from Numeric to Class and identify the class as
{1,0} using the program Sublime Text. Another alteration was made on the type as well on the R
program to be able to create the model and the visualization.
4.6 KNN
K Nearest Neighbor (or KNN) is one of the machine learning algorithms, it is classified
under the supervised machine learning algorithms. This algorithm is popular by its simplicity and it
required no para metrical evaluation and no likelihood calculations. The k Nearest Neighbor
Algorithm can be work using three major steps.
1. Evaluation of the distance. Distances are calculated between test data and training data. Most
common metrics for distance are Euclidean distance, Manhattan distance and Hamming distance.
Euclidean is the most frequent one.
2. Identification of the nearest Neighbor according to the distance information. Distances are sorted
in ascending order. Then top k neighbors are preferred.
3. According to the nearest Neighbors, the results that represent the prediction are made. Top k
number of distances are chosen from sorted list and a point is assigned a class to the test point
depend on most frequent class of the list.
The Euclidean distance between two points can be calculated using the equation (3.3).
where d represents distance between element of test data x and each training element x'. As the
Euclidean distance is being evaluated for each element, the classes are now ready for graphical
representation as demonstrated in Figure 4.6.1 illustrates three types of classes in the graph and
each class is far from a particular Euclidean distance from the other. each class is represented by
using specific color as circle, yellow, blue and green.
The second action to be taken by the K Nearest Neighbor algorithm is to evaluate the nearest
distance between the classes.
The distance values are sorted in ascending order and top k numbers of distance are selected.
The last step in the K Nearest Neighbor algorithm is to perform the classification. However,
the classification is to be made based on frequency of the nearest k neighbor's classes of training
data set. The class of the test data is assigned according to most frequent class in the nearest k
neighbors.
The similarity of the distance will lead to a decision that this entry of the test set is related to
the class with the best similarity. Figure 4.6.2 demonstrates the process of the K Nearest Neighbor
algorithm from the beginning until making the decision.
K=3
K=7
For the value of K=7, it was found that the model scored an accuracy of 99.82% and
managed to correctly identify 91,719 transactions and missed 52. As for the Weka program the
model scored 99.88% for the accuracy and miss-classified 52 transactions. As the level of accuracy
changes with different types and number of transactions, the average value has been chosen. It was
found that the average of the accuracies is 99.88%. The various accuracy levels have been shown
in Fig.4.6.5 while the average accuracy value has been shown in Fig.4.6.6.
Naive Bayes algorithm is one of the machine learning approaches. The algorithm is mainly
works based on the likelihood logic. This algorithm is termed as one of the best classification
techniques; it is also famous for processing the independent features of data. It is a lazy learning
algorithm but also It can be worked on unbalanced data clusters. The algorithm calculates each
probability degree for a record and classifies it according to the highest probability value. The
algorithm is not able to predict for data which is in test data set and is not in training data set. This
situation is called “Zero frequency”. There are regularization techniques as Laplace estimation to
solve the problem in the literature.
The concept of this algorithm can be derived using the equation (3.2).
It works basically in such a way the next event can be decided based on the previous events.
In other words, this logic works on the basis of the previous experiment (from this point of fact, it is
accepted a learning algorithm). In order to perform the Naive Bayes algorithm on some real-life
problems, the first step is identifying the data set. Data set classes must be clearly seen and hence
class abstraction can be performed easily. The probability of observing some factor resulting or
producing an event is the main likelihood term that to be calculated from the Naive Bayes algorithm
which is termed by P(L|S). The other Bayes low particulars can be defined as the following:
P(S): is a prior probability that states as the likelihood of observing the event S independent of any
other thing.
P(L): is the probability of observing the factor L independent of any other factor of the event.
P(S|L): is called as posterior probability and represents the probability that observing the even S
producing the factor L.
The algorithm is mainly used to evaluate parameters as described above and perform the
multiplication and division of them to evaluate the required probability. The higher probability
value is always taken as a prediction result. In order to apply this concept to the dat aset, firstly;
data set classes should be visible and clearly identifies. The class frequency means evaluating the
number of times that every class is generated. So, the classes frequency table is the first important
step in Naive Bayes algorithm.
Naive Bayes is a classification algorithm that consider the being of a certain trait within a
class is unrelated to the being of any different feature, the main use of it is for clustering and
classifications, depending on the conditional probability .
The second model created by R is Naive Bayes, figure 4.7.2 shows the performance of the
model, it scored an accuracy of 97.77% and misclassified a total of 2,051 transactions, 33
fraudulent as non fraudulent and 2018 non fraudulent as fraudulent. There is a slight difference in
the accuracy of the Naive Bayes model created within Weka as its 97.73% and the
misclassification instances are 1,938.
The implementation of Naive Bayes model showed a much lower level of accuracy. The
performance of the model has been shown in Fig 4.7.3. The model scored an accuracy of 97.77%
and misclassified a total of 2,051 transactions, 33 fraudulent as non fraudulent and 2018 non
fraudulent as fraudulent. There is a slight difference in the accuracy of the Naive Bayes model
created within Weka as its 97.73% and the misclassification instances are 1,938.
Logistic regression is a statistical method used for classification tasks where the goal is to
predict the probability of an instance belonging to a particular category. Unlike linear regression,
which predicts continuous values, logistic regression outputs probabilities ranging from 0 to 1. It
employs a sigmoid function to map linear combinations of input features to these probabilities.
By setting a threshold (typically 0.5), these probabilities can be converted into binary
predictions. Logistic regression is widely used in various domains, such as finance, healthcare,
and marketing, to model the relationship between predictors and categorical outcomes. Its
simplicity, interpretability, and efficiency make it a valuable tool in the machine learning arsenal.
σ(x)=1/1+e^-x
where e is the base of the natural logarithm. Logistic regression estimates the parameters of
a logistic model, which can be used to determine the relationship between the independent
variables and the log-odds of the dependent variable. The model's parameters are typically
estimated using maximum likelihood estimation. In practice, logistic regression is widely used in
various fields such as medicine, finance, and social sciences for predicting binary outcomes.
The last model created using both R and Weka is Logistic Regression, the model managed
to score and accuracy of 99.92% in R in figure 4.8.3 with 70 misclassified instances, while it
scored 99.91% in Weka with 77 misclassified instances as presented in figure 4.8.2.
The aim of a support vector machine algorithm is to find the best possible line, or decision
boundary, that separates the data points of different data classes. This boundary is called
a hyperplane when working in high-dimensional feature spaces. The idea is to maximize the
margin, which is the distance between the hyperplane and the closest data points of each
category, thus making it easy to distinguish data classes.
The model Support Vector Machine as show in figure 4.9.2 managed to score
99.94% for the accuracy and misclassified 51 instances.
In order to ensure the proper and appropriate usage of the model, accuracy needs to
be computed. Accuracy represents the overall number of instances that are predicted
correctly, accuracies are represented by confusion matrix where it showed the True
Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN). True
Positive represents the transactions that are fraudulent and was correctly classified by the
model as fraudulent. True Negative represents the not fraudulent transactions that were
correctly predicted by the model as Not fraudulent. The third rating is False positive which
represents the transaction that are fraudulent but was misclassified as not fraudulent. False
Negative represents non fraudulent transactions that were classified as fraudulent. The
confusion matrix representing these parameters has been illustrated in Table – 1.
Positive TP FN
Negative FP TN
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 +
𝐹𝑁
Model Accuracy
K=3
K=3 99.89%
K=7
KNN
K=7 99.88%
Naive Bayes
Logistic Regression
The last stage of the CRISP-DM (Cross industry standard process for data mining)
model deals with the evaluation and deployment stage. Table 2 shows the accuracies of all
the models that were created in the project, all models performed well in detecting
fraudulent transactions and managed to score high accuracies. It was observed that Support
Vector Machine marginally exceeded in accuracies as compared to other models. The
Naïve Bayes model was found to detect with the comparatively lower level of accuracy
which is also a sizable score of 97.76%.
Chapter 5: Conclusion
5.1 Conclusion
A number of models were designed to identify frauds in transactions involving credit
cards. It was observed that apart from Naïve Bayes, the other three models were able to detect
fraudulent transactions with an accuracy level of more than 99 %. Among them, SVM technique
was found to perform a shade better than other models as the accuracy of detection was found to
be 99.94 %. The number of misclassified instances was found to be a paltry 51 among a set of
2051 transactions. The high levels of accuracy in each of the models indicate that these models
could be explored for different kinds of applications covering a wide range of domains.
Future work in credit card fraud detection using machine learning holds several exciting
opportunities for advancement. One promising area is the incorporation of heterogeneous data
sources to build more comprehensive models. For example, integrating contextual information
about transactions, such as geographic location and merchant details, along with behavioral data
like spending habits and device usage patterns, could improve the detection of sophisticated
fraud tactics. Additionally, experimenting with state-of-the-art machine learning techniques, such
as deep learning architectures and reinforcement learning, might enhance the system's ability to
recognize subtle and evolving fraud patterns.
Another critical aspect is improving the model's interpretability and reducing false
positives. Techniques such as explainable AI (XAI) could make it easier for practitioners to
understand and trust the model's predictions, while adaptive algorithms and feedback loops can
help fine-tune models in response to new fraud trends and minimize disruptions to legitimate
transactions. Real-time fraud detection is another area of focus, with research aimed at optimizing
the speed and efficiency of processing transactions without sacrificing accuracy.
Exploring the scalability of these models to handle large volumes of data and diverse
transaction types is also essential. Collaborations with financial institutions and other
stakeholders can facilitate access to extensive and varied datasets, aiding in the development and
validation of more robust models. Furthermore, addressing ethical and privacy concerns, such as
ensuring data security and transparency in model decisions, will be crucial for maintaining
consumer trust and compliance with regulations.
Overall, future work should aim to enhance the effectiveness, efficiency, and fairness of
credit card fraud detection systems, paving the way for more secure and reliable financial
transactions.
import pandas as pd
import numpy as np
import sys
import scipy
import sklearn
import warnings
warnings.filterwarnings('ignore')
sns.set(style="whitegrid")
import tensorflow as tf
data = pd.read_csv("creditcard.csv")
data.head(5)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=99)
X_train.shape,X_test.shape,y_train.shape,y_test.shape
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.fit_transform(X_test)
y_train=y_train.to_numpy()
y_test=y_test.to_numpy()
def plot_learningcurve(history,epochs):
epoch=range(1,epochs+1)
plt.plot(epoch, history.history['accuracy'])
plt.plot(epoch, history.history['val_accuracy'])
plt.title('Model accuracy')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.legend(['train','val'], loc='upper left')
plt.show()
plt.plot(epoch, history.history['loss'])
plt.plot(epoch, history.history['val_loss'])
plt.title('Model loss')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend(['train','val'], loc='upper left')
plt.show()
columns = data.columns.tolist()
columns = [c for c in columns if c not in ['Class']]
target = 'Class'
X = data[columns]
Y = data[target]
print(X.shape)
print(Y.shape)
fraud = data[data['Class'] == 1]
valid = data[data['Class'] == 0]
n_outliers = len(fraud)
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
0 -1.35981 -0.07278 2.536347 1.378155 -0.33832 0.462388 0.239599 0.098698 0.363787 0.090794
0 1.191857 0.266151 0.16648 0.448154 0.060018 -0.08236 -0.0788 0.085102 -0.25543 -0.16697
1 -1.35835 -1.34016 1.773209 0.37978 -0.5032 1.800499 0.791461 0.247676 -1.51465 0.207643
1 -0.96627 -0.18523 1.792993 -0.86329 -0.01031 1.247203 0.237609 0.377436 -1.38702 -0.05495
2 -1.15823 0.877737 1.548718 0.403034 -0.40719 0.095921 0.592941 -0.27053 0.817739 0.753074
2 -0.42597 0.960523 1.141109 -0.16825 0.420987 -0.02973 0.476201 0.260314 -0.56867 -0.37141
4 1.229658 0.141004 0.045371 1.202613 0.191881 0.272708 -0.00516 0.081213 0.46496 -0.09925
7 -0.64427 1.417964 1.07438 -0.4922 0.948934 0.428118 1.120631 -3.80786 0.615375 1.249376
7 -0.89429 0.286157 -0.11319 -0.27153 2.669599 3.721818 0.370145 0.851084 -0.39205 -0.41043
9 -0.33826 1.119593 1.044367 -0.22219 0.499361 -0.24676 0.651583 0.069539 -0.73673 -0.36685
10 1.449044 -1.17634 0.91386 -1.37567 -1.97138 -0.62915 -1.42324 0.048456 -1.72041 1.626659
10 0.384978 0.616109 -0.8743 -0.09402 2.924584 3.317027 0.470455 0.538247 -0.55889 0.309755
10 1.249999 -1.22164 0.38393 -1.2349 -1.48542 -0.75323 -0.6894 -0.22749 -2.09401 1.323729
11 1.069374 0.287722 0.828613 2.71252 -0.1784 0.337544 -0.09672 0.115982 -0.22108 0.46023
12 -2.79185 -0.32777 1.64175 1.767473 -0.13659 0.807596 -0.42291 -1.90711 0.755713 1.151087
12 -0.75242 0.345485 2.057323 -1.46864 -1.15839 -0.07785 -0.60858 0.003603 -0.43617 0.747731
12 1.103215 -0.0403 1.267332 1.289091 -0.736 0.288069 -0.58606 0.18938 0.782333 -0.26798
13 -0.43691 0.918966 0.924591 -0.72722 0.915679 -0.12787 0.707642 0.087962 -0.66527 -0.73798
14 -5.40126 -5.45015 1.186305 1.736239 3.049106 -1.76341 -1.55974 0.160842 1.23309 0.345173
15 1.492936 -1.02935 0.454795 -1.43803 -1.55543 -0.72096 -1.08066 -0.05313 -1.97868 1.638076
16 0.694885 -1.36182 1.029221 0.834159 -1.19121 1.309109 -0.87859 0.44529 -0.4462 0.568521
17 0.962496 0.328461 -0.17148 2.109204 1.129566 1.696038 0.107712 0.521502 -1.19131 0.724396
18 1.166616 0.50212 -0.0673 2.261569 0.428804 0.089474 0.241147 0.138082 -0.98916 0.922175
18 0.247491 0.277666 1.185471 -0.0926 -1.31439 -0.15012 -0.94636 -1.61794 1.544071 -0.82988
22 -1.94653 -0.0449 -0.40557 -1.01306 2.941968 2.955053 -0.06306 0.855546 0.049967 0.573743
22 -2.07429 -0.12148 1.322021 0.410008 0.295198 -0.95954 0.543985 -0.10463 0.475664 0.149451
23 1.173285 0.353498 0.283905 1.133563 -0.17258 -0.91605 0.369025 -0.32726 -0.24665 -0.04614
23 1.322707 -0.17404 0.434555 0.576038 -0.83676 -0.83108 -0.2649 -0.22098 -1.07142 0.868559
23 -0.41429 0.905437 1.727453 1.473471 0.007443 -0.20033 0.740228 -0.02925 -0.59339 -0.34619
23 1.059387 -0.17532 1.26613 1.18611 -0.786 0.578435 -0.76708 0.401046 0.6995 -0.06474
24 1.237429 0.061043 0.380526 0.761564 -0.35977 -0.49408 0.006494 -0.13386 0.43881 -0.20736
25 1.114009 0.085546 0.493702 1.33576 -0.30019 -0.01075 -0.11876 0.188617 0.205687 0.082262
26 -0.52991 0.873892 1.347247 0.145457 0.414209 0.100223 0.711206 0.176066 -0.28672 -0.48469
V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 Amount
0.403993 0.251412 -0.01831 0.277838 -0.11047 0.066928 0.128539 -0.18911 0.133558 -0.02105 149.62
-0.14578 -0.06908 -0.22578 -0.63867 0.101288 -0.33985 0.16717 0.125895 -0.00898 0.014724 2.69
-2.26186 0.52498 0.247998 0.771679 0.909412 -0.68928 -0.32764 -0.1391 -0.05535 -0.05975 378.66
-1.23262 -0.20804 -0.1083 0.005274 -0.19032 -1.17558 0.647376 -0.22193 0.062723 0.061458 123.5
0.803487 0.408542 -0.00943 0.798278 -0.13746 0.141267 -0.20601 0.502292 0.219422 0.215153 69.99
-0.03319 0.084968 -0.20825 -0.55982 -0.0264 -0.37143 -0.23279 0.105915 0.253844 0.08108 3.67
-0.04558 -0.21963 -0.16772 -0.27071 -0.1541 -0.78006 0.750137 -0.25724 0.034507 0.005168 4.99
0.324505 -0.15674 1.943465 -1.01545 0.057504 -0.64971 -0.41527 -0.05163 -1.20692 -1.08534 40.8
0.570328 0.052736 -0.07343 -0.26809 -0.20423 1.011592 0.373205 -0.38416 0.011747 0.142404 93.2
0.451773 0.203711 -0.24691 -0.63375 -0.12079 -0.38505 -0.06973 0.094199 0.246219 0.083076 3.68
-0.22137 -0.38723 -0.0093 0.313894 0.02774 0.500512 0.251367 -0.12948 0.04285 0.016253 7.8
0.707664 0.125992 0.049924 0.238422 0.00913 0.99671 -0.76731 -0.49221 0.042472 -0.05434 9.99
-0.68319 -0.10276 -0.23181 -0.48329 0.084668 0.392831 0.161135 -0.35499 0.026416 0.042422 121.5
-0.98292 -0.1532 -0.03688 0.074412 -0.07141 0.104744 0.548265 0.104094 0.021491 0.021293 27.5
2.221868 -1.58212 1.151663 0.222182 1.020586 0.028317 -0.23275 -0.23556 -0.16478 -0.03015 58.8
0.432535 0.263451 0.499625 1.35365 -0.25657 -0.06508 -0.03912 -0.08709 -0.181 0.129394 15.99
-0.57568 -0.11391 -0.02461 0.196002 0.013802 0.103758 0.364298 -0.38226 0.092809 0.037051 12.99
0.025436 -0.04702 -0.1948 -0.67264 -0.15686 -0.88839 -0.34241 -0.04903 0.079692 0.131024 0.89
-0.40687 -2.19685 -0.5036 0.98446 2.458589 0.042119 -0.48163 -0.62127 0.392053 0.949594 46.8
0.05423 -0.38791 -0.17765 -0.17507 0.040002 0.295814 0.332931 -0.22038 0.022298 0.007602 5
-1.30041 -0.13833 -0.29558 -0.57196 -0.05088 -0.30421 0.072001 -0.42223 0.086553 0.063499 231.71
-2.02761 -0.26932 0.143997 0.402492 -0.04851 -1.37187 0.390814 0.199964 0.016371 -0.01461 34.09
-0.8166 -0.30717 0.018702 -0.06197 -0.10385 -0.37042 0.6032 0.108556 -0.04052 -0.01142 2.28
2.177807 -0.23098 1.65018 0.200454 -0.18535 0.423073 0.820591 -0.22763 0.336634 0.250475 22.75
0.488603 -0.21672 -0.57953 -0.79923 0.8703 0.983421 0.321201 0.14965 0.707519 0.0146 0.89
0.505751 -0.38669 -0.40364 -0.2274 0.742435 0.398535 0.249212 0.274404 0.359969 0.243232 26.43
-0.39093 0.027878 0.067003 0.227812 -0.15049 0.435045 0.724825 -0.33708 0.016368 0.030041 41.88
-1.24062 -0.52295 -0.28438 -0.32336 -0.03771 0.347151 0.559639 -0.28016 0.042335 0.028822 16
0.543969 0.097308 0.077237 0.457331 -0.0385 0.642522 -0.18389 -0.27746 0.182687 0.152665 33
-0.27783 -0.17802 0.013676 0.213734 0.014462 0.002951 0.294638 -0.39507 0.081461 0.02422 12.99
0.348416 -0.06635 -0.24568 -0.5309 -0.04427 0.079168 0.509136 0.288858 -0.0227 0.011836 17.28
-0.14571 -0.27383 -0.05323 -0.00476 -0.03147 0.198054 0.565007 -0.33772 0.029057 0.004453 4.45
-0.82337 -0.29035 0.046949 0.208105 -0.18555 0.001031 0.098816 -0.5529 -0.07329 0.023307 6.14
Chapter 6: References
• S. N. Kalid, K. -C. Khor, K. -H. Ng and G. -K. Tong, "Detecting Frauds and Payment
Defaults on Credit Card Data Inherited with Imbalanced Class Distribution and Overlapping
Class Problems: A Systematic Review," IEEE Access, Vol. 12 2024 pp. 23636-23652.
• F. A. Ghaleb, F. Saeed, M. Al-Sarem, S. N. Qasem and T. Al-Hadhrami, "Ensemble
Synthesized Minority Oversampling-Based Generative Adversarial Networks and Random
Forest Algorithm for Credit Card Fraud Detection," IEEE Access, Vol. 11 2023 pp. 89694-
89710.
• F. K. Alarfaj, I. Malik, H. U. Khan, N. Almusallam, M. Ramzan and M. Ahmed, "Credit
Card Fraud Detection Using State-of-the-Art Machine Learning and Deep Learning
Algorithms," IEEE Access, Vol. 10 2022 pp. 39700-39715.