Credit Card Fraud Detection Using Machine Learning PDF
Credit Card Fraud Detection Using Machine Learning PDF
Abstract— It is vital that credit card companies are able to These are not the only challenges in the implementation of a
identify fraudulent credit card transactions so that customers real-world fraud detection system, however. In real world
are not charged for items that they did not purchase. Such examples, the massive stream of payment requests is quickly
problems can be tackled with Data Science and its importance, scanned by automatic tools that determine which transactions
along with Machine Learning, cannot be overstated. This
project intends to illustrate the modelling of a data set using
to authorize.
machine learning with Credit Card Fraud Detection. The Credit Machine learning algorithms are employed to analyse all the
Card Fraud Detection Problem includes modelling past credit authorized transactions and report the suspicious ones. These
card transactions with the data of the ones that turned out to be reports are investigated by professionals who contact the
fraud. This model is then used to recognize whether a new cardholders to confirm if the transaction was genuine or
transaction is fraudulent or not. Our objective here is to detect fraudulent.
100% of the fraudulent transactions while minimizing the The investigators provide a feedback to the automated system
incorrect fraud classifications. Credit Card Fraud Detection is a which is used to train and update the algorithm to eventually
typical sample of classification. In this process, we have focused improve the fraud-detection performance over time.
on analysing and pre-processing data sets as well as the
deployment of multiple anomaly detection algorithms such as
Local Outlier Factor and Isolation Forest algorithm on the PCA
transformed Credit Card Transaction data.
I. INTRODUCTION
'Fraud' in credit card transactions is unauthorized and
unwanted usage of an account by someone other than the
owner of that account. Necessary prevention measures can be
taken to stop this abuse and the behaviour of such fraudulent
practices can be studied to minimize it and protect against
similar occurrences in the future.In other words, Credit Card
Fraud can be defined as a case where a person uses someone
else’s credit card for personal reasons while the owner and the
card issuing authorities are unaware of the fact that the card is
being used.
Fraud detection involves monitoring the activities of
populations of users in order to estimate, perceive or avoid
objectionable behaviour, which consist of fraud, intrusion, and
defaulting. Fraud detection methods are continuously developed to defend
This is a very relevant problem that demands the attention of criminals in adapting to their fraudulent strategies. These
communities such as machine learning and data science where frauds are classified as:
the solution to this problem can be automated. • Credit Card Frauds: Online and Offline
This problem is particularly challenging from the perspective • Card Theft
of learning, as it is characterized by various factors such as • Account Bankruptcy
class imbalance. The number of valid transactions far • Device Intrusion
outnumber fraudulent ones. Also, the transaction patterns • Application Fraud
often change their statistical properties over the course of • Counterfeit Card
time. • Telecommunication Fraud
Some of the currently used approaches to detection of such was accompanied by classification problem with variable
fraud are: misclassification costs.
• Artificial Neural Network
• Fuzzy Logic III. METHODOLOGY
• Genetic Algorithm The approach that this paper proposes, uses the latest machine
• Logistic Regression learning algorithms to detect anomalous activities, called
• Decision tree outliers.
• Support Vector Machines The basic rough architecture diagram can be represented with
• Bayesian Networks the following figure:
• Hidden Markov Model
• K-Nearest Neighbour
This graph represents the amount that was transacted. A These algorithms are a part of sklearn. The ensemble module
majority of transactions are relatively small and only a handful in the sklearn package includes ensemble-based methods and
of them come close to the maximum transacted amount. functions for the classification, regression and outlier
detection.
After checking this dataset, we plot a histogram for every This free and open-source Python library is built using
column. This is done to get a graphical representation of the NumPy, SciPy and matplotlib modules which provides a lot of
dataset which can be used to verify that there are no missing simple and efficient tools which can be used for data analysis
and machine learning. It features various classification, By comparing the local values of a sample to that of its
clustering and regression algorithms and is designed to neighbours, one can identify samples that are substantially
interoperate with the numerical and scientific libraries. lower than their neighbours. These values are quite amanous
We’ve used Jupyter Notebook platform to make a program in and they are considered as outliers.
Python to demonstrate the approach that this paper suggests. As the dataset is very large, we used only a fraction of it in out
This program can also be executed on the cloud using Google tests to reduce processing times.
Collab platform which supports all python notebook files. The final result with the complete dataset processed is also
Detailed explanations about the modules with pseudocodes for determined and is given in the results section of this paper.
their algorithms and output graphs are given as follows:
B. Isolation Forest Algorithm
A. Local Outlier Factor The Isolation Forest ‘isolates’ observations by arbitrarily
It is an Unsupervised Outlier Detection algorithm. 'Local selecting a feature and then randomly selecting a split value
Outlier Factor' refers to the anomaly score of each sample. It between the maximum and minimum values of the designated
measures the local deviation of the sample data with respect to feature.
its neighbours. Recursive partitioning can be represented by a tree, the
More precisely, locality is given by k-nearest neighbours, number of splits required to isolate a sample is equivalent to
whose distance is used to estimate the local data. the path length root node to terminating node.
The pseudocode for this algorithm is written as: The average of this path length gives a measure of normality
and the decision function which we use.
The pseudocode for this algorithm can be written as:
On plotting the results of Local Outlier Factor algorithm, we On plotting the results of Isolation Forest algorithm, we get
get the following figure: the following figure:
IV. IMPLEMENTATION
This idea is difficult to implement in real life because it
requires the cooperation from banks, which aren’t willing to
share information due to their market competition, and also
due to legal reasons and protection of data of their users.
Therefore, we looked up some reference papers which
followed similar approaches and gathered results. As stated in
one of these reference papers:
“This technique was applied to a full application data set
supplied by a German bank in 2006. For banking
confidentiality reasons, only a summary of the results obtained
is presented below. After applying this technique, the level 1 Results with the complete dataset is used:
list encompasses a few cases but with a high probability of
being fraudsters.
All individuals mentioned in this list had their cards closed to
avoid any risk due to their high-risk profile. The condition is
more complex for the other list. The level 2 list is still
restricted adequately to be checked on a case by case basis.
Credit and collection officers considered that half of the cases
in this list could be considered as suspicious fraudulent
behaviour. For the last list and the largest, the work is
equitably heavy. Less than a third of them are suspicious.
In order to maximize the time efficiency and the overhead
charges, a possibility is to include a new element in the query;
this element can be the five first digits of the phone numbers,
the email address, and the password, for instance, those new
queries can be applied to the level 2 list and level 3 list.”.
V. RESULTS
The code prints out the number of false positives it detected
and compares it with the actual values. This is used to
calculate the accuracy score and precision of the algorithms.
The fraction of data we used for faster testing is 10% of the
entire dataset. The complete dataset is also used at the end and
both the results are printed. VI. CONCLUSION
These results along with the classification report for each
algorithm is given in the output as follows, where class 0 Credit card fraud is without a doubt an act of criminal
means the transaction was determined to be valid and 1 means dishonesty. This article has listed out the most common
it was determined as a fraud transaction. methods of fraud along with their detection methods and
This result matched against the class values to check for false reviewed recent findings in this field. This paper has also
positives. explained in detail, how machine learning can be applied to
Results when 10% of the dataset is used: get better results in fraud detection along with the algorithm,
pseudocode, explanation its implementation and
experimentation results.
While the algorithm does reach over 99.6% accuracy, its
precision remains only at 28% when a tenth of the data set is
taken into consideration. However, when the entire dataset is
fed into the algorithm, the precision rises to 33%. This high
percentage of accuracy is to be expected due to the huge
imbalance between the number of valid and number of
genuine transactions.
Since the entire dataset consists of only two days’ transaction REFERENCES
records, its only a fraction of data that can be made available
if this project were to be used on a commercial scale. Being [1] “Credit Card Fraud Detection Based on Transaction Behaviour -by
John Richard D. Kho, Larry A. Vea” published by Proc. of the 2017
based on machine learning algorithms, the program will only
IEEE Region 10 Conference (TENCON), Malaysia, November 5-8,
increase its efficiency over time as more data is put into it. 2017
VII. FUTURE ENHANCEMENTS [2] CLIFTON PHUA1, VINCENT LEE1, KATE SMITH1 & ROSS
GAYLER2 “ A Comprehensive Survey of Data Mining-based Fraud
While we couldn’t reach out goal of 100% accuracy in fraud Detection Research” published by School of Business Systems, Faculty
detection, we did end up creating a system that can, with of Information Technology, Monash University, Wellington Road,
Clayton, Victoria 3800, Australia
enough time and data, get very close to that goal. As with any
[3] “Survey Paper on Credit Card Fraud Detection by Suman” , Research
such project, there is some room for improvement here. Scholar, GJUS&T Hisar HCE, Sonepat published by International
The very nature of this project allows for multiple algorithms Journal of Advanced Research in Computer Engineering & Technology
to be integrated together as modules and their results can be (IJARCET) Volume 3 Issue 3, March 2014
combined to increase the accuracy of the final result. [4] “Research on Credit Card Fraud Detection Model Based on Distance
This model can further be improved with the addition of more Sum – by Wen-Fang YU and Na Wang” published by 2009
International Joint Conference on Artificial Intelligence
algorithms into it. However, the output of these algorithms
[5] “Credit Card Fraud Detection through Parenclitic Network Analysis-
needs to be in the same format as the others. Once that
By Massimiliano Zanin, Miguel Romance, Regino Criado, and
condition is satisfied, the modules are easy to add as done in SantiagoMoral” published by Hindawi Complexity Volume 2018,
the code. This provides a great degree of modularity and Article ID 5764370, 9 pages
versatility to the project. [6] “Credit Card Fraud Detection: A Realistic Modeling and a Novel
More room for improvement can be found in the dataset. As Learning Strategy” published by IEEE TRANSACTIONS ON
demonstrated before, the precision of the algorithms increases NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO.
8, AUGUST 2018
when the size of dataset is increased. Hence, more data will
[7] “Credit Card Fraud Detection-by Ishu Trivedi, Monika, Mrigya,
surely make the model more accurate in detecting frauds and Mridushi” published by International Journal of Advanced Research in
reduce the number of false positives. However, this requires Computer and Communication Engineering Vol. 5, Issue 1, January
official support from the banks themselves. 2016
[8] David J.Wetson,David J.Hand,M Adams,Whitrow and Piotr Jusczak
“Plastic Card Fraud Detection using Peer Group Analysis” Springer,
Issue 2008.