0% found this document useful (0 votes)
172 views

Credit Card Fraud Detection Using Machine Learning PDF

This document discusses credit card fraud detection using machine learning and data science. It presents an overview of credit card fraud, challenges in detecting fraud, and current fraud detection methods like artificial neural networks and logistic regression. The methodology proposes using machine learning algorithms like isolation forest and local outlier factor to detect anomalous credit card transactions. The full architecture involves obtaining a transaction dataset, preprocessing it, applying machine learning models to detect outliers and fraudulent transactions, and providing feedback to update the models over time. The goal is to accurately detect 100% of fraudulent transactions while minimizing incorrect fraud classifications.

Uploaded by

Kritik Bansal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
172 views

Credit Card Fraud Detection Using Machine Learning PDF

This document discusses credit card fraud detection using machine learning and data science. It presents an overview of credit card fraud, challenges in detecting fraud, and current fraud detection methods like artificial neural networks and logistic regression. The methodology proposes using machine learning algorithms like isolation forest and local outlier factor to detect anomalous credit card transactions. The full architecture involves obtaining a transaction dataset, preprocessing it, applying machine learning models to detect outliers and fraudulent transactions, and providing feedback to update the models over time. The goal is to accurately detect 100% of fraudulent transactions while minimizing incorrect fraud classifications.

Uploaded by

Kritik Bansal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Published by : International Journal of Engineering Research & Technology (IJERT)

http://www.ijert.org ISSN: 2278-0181


Vol. 8 Issue 09, September-2019

Credit Card Fraud Detection using Machine


Learning and Data Science
S P Maniraj Aditya Saini, Swarna Deep Sarkar
Assistant Professor (O.G.) Shadab Ahmed
Department of Computer Science and Engineering Department of Computer Science and Engineering
SRM Institute of Science and Technology SRM Institute of Science and Technology

Abstract— It is vital that credit card companies are able to These are not the only challenges in the implementation of a
identify fraudulent credit card transactions so that customers real-world fraud detection system, however. In real world
are not charged for items that they did not purchase. Such examples, the massive stream of payment requests is quickly
problems can be tackled with Data Science and its importance, scanned by automatic tools that determine which transactions
along with Machine Learning, cannot be overstated. This
project intends to illustrate the modelling of a data set using
to authorize.
machine learning with Credit Card Fraud Detection. The Credit Machine learning algorithms are employed to analyse all the
Card Fraud Detection Problem includes modelling past credit authorized transactions and report the suspicious ones. These
card transactions with the data of the ones that turned out to be reports are investigated by professionals who contact the
fraud. This model is then used to recognize whether a new cardholders to confirm if the transaction was genuine or
transaction is fraudulent or not. Our objective here is to detect fraudulent.
100% of the fraudulent transactions while minimizing the The investigators provide a feedback to the automated system
incorrect fraud classifications. Credit Card Fraud Detection is a which is used to train and update the algorithm to eventually
typical sample of classification. In this process, we have focused improve the fraud-detection performance over time.
on analysing and pre-processing data sets as well as the
deployment of multiple anomaly detection algorithms such as
Local Outlier Factor and Isolation Forest algorithm on the PCA
transformed Credit Card Transaction data.

Keywords— Credit card fraud, applications of machine


learning, data science, isolation forest algorithm, local outlier
factor, automated fraud detection.

I. INTRODUCTION
'Fraud' in credit card transactions is unauthorized and
unwanted usage of an account by someone other than the
owner of that account. Necessary prevention measures can be
taken to stop this abuse and the behaviour of such fraudulent
practices can be studied to minimize it and protect against
similar occurrences in the future.In other words, Credit Card
Fraud can be defined as a case where a person uses someone
else’s credit card for personal reasons while the owner and the
card issuing authorities are unaware of the fact that the card is
being used.
Fraud detection involves monitoring the activities of
populations of users in order to estimate, perceive or avoid
objectionable behaviour, which consist of fraud, intrusion, and
defaulting. Fraud detection methods are continuously developed to defend
This is a very relevant problem that demands the attention of criminals in adapting to their fraudulent strategies. These
communities such as machine learning and data science where frauds are classified as:
the solution to this problem can be automated. • Credit Card Frauds: Online and Offline
This problem is particularly challenging from the perspective • Card Theft
of learning, as it is characterized by various factors such as • Account Bankruptcy
class imbalance. The number of valid transactions far • Device Intrusion
outnumber fraudulent ones. Also, the transaction patterns • Application Fraud
often change their statistical properties over the course of • Counterfeit Card
time. • Telecommunication Fraud

IJERTV8IS090031 www.ijert.org 110


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 8 Issue 09, September-2019

Some of the currently used approaches to detection of such was accompanied by classification problem with variable
fraud are: misclassification costs.
• Artificial Neural Network
• Fuzzy Logic III. METHODOLOGY
• Genetic Algorithm The approach that this paper proposes, uses the latest machine
• Logistic Regression learning algorithms to detect anomalous activities, called
• Decision tree outliers.
• Support Vector Machines The basic rough architecture diagram can be represented with
• Bayesian Networks the following figure:
• Hidden Markov Model
• K-Nearest Neighbour

II. LITERATURE REVIEW


Fraud act as the unlawful or criminal deception intended to
result in financial or personal benefit. It is a deliberate act that
is against the law, rule or policy with an aim to attain
unauthorized financial benefit.
Numerous literatures pertaining to anomaly or fraud detection
in this domain have been published already and are available
for public usage. A comprehensive survey conducted by When looked at in detail on a larger scale along with real life
Clifton Phua and his associates have revealed that techniques elements, the full architecture diagram can be represented as
employed in this domain include data mining applications, follows:
automated fraud detection, adversarial detection. In another
paper, Suman, Research Scholar, GJUS&T at Hisar HCE
presented techniques like Supervised and Unsupervised
Learning for credit card fraud detection. Even though these
methods and algorithms fetched an unexpected success in
some areas, they failed to provide a permanent and consistent
solution to fraud detection.
A similar research domain was presented by Wen-Fang YU
and Na Wang where they used Outlier mining, Outlier
detection mining and Distance sum algorithms to accurately
predict fraudulent transaction in an emulation experiment of
credit card transaction data set of one certain commercial
bank. Outlier mining is a field of data mining which is
basically used in monetary and internet fields. It deals with
detecting objects that are detached from the main system i.e.
the transactions that aren’t genuine. They have taken attributes
of customer’s behaviour and based on the value of those
attributes they’ve calculated that distance between the First of all, we obtained our dataset from Kaggle, a data
observed value of that attribute and its predetermined value. analysis website which provides datasets.
Unconventional techniques such as hybrid data Inside this dataset, there are 31 columns out of which 28 are
mining/complex network classification algorithm is able to named as v1-v28 to protect sensitive data.
perceive illegal instances in an actual card transaction data set, The other columns represent Time, Amount and Class. Time
based on network reconstruction algorithm that allows shows the time gap between the first transaction and the
creating representations of the deviation of one instance from following one. Amount is the amount of money transacted.
a reference group have proved efficient typically on medium Class 0 represents a valid transaction and 1 represents a
sized online transaction. fraudulent one.
There have also been efforts to progress from a completely We plot different graphs to check for inconsistencies in the
new aspect. Attempts have been made to improve the alert- dataset and to visually comprehend it:
feedback interaction in case of fraudulent transaction.
In case of fraudulent transaction, the authorised system would
be alerted and a feedback would be sent to deny the ongoing
transaction.
Artificial Genetic Algorithm, one of the approaches that shed
new light in this domain, countered fraud from a different
direction.
It proved accurate in finding out the fraudulent transactions
and minimizing the number of false alerts. Even though, it

IJERTV8IS090031 www.ijert.org 111


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 8 Issue 09, September-2019

any values in the dataset. This is done to ensure that we don’t


require any missing value imputation and the machine
learning algorithms can process the dataset smoothly.

This graph shows that the number of fraudulent transactions is


much lower than the legitimate ones.

After this analysis, we plot a heatmap to get a coloured


representation of the data and to study the correlation between
out predicting variables and the class variable. This heatmap is
shown below:

This graph shows the times at which transactions were done


within two days. It can be seen that the least number of
transactions were made during night time and highest during
the days.

The dataset is now formatted and processed. The time and


amount column are standardized and the Class column is
removed to ensure fairness of evaluation. The data is
processed by a set of algorithms from modules. The following
module diagram explains how these algorithms work together:
This data is fit into a model and the following outlier detection
modules are applied on it:
• Local Outlier Factor
• Isolation Forest Algorithm

This graph represents the amount that was transacted. A These algorithms are a part of sklearn. The ensemble module
majority of transactions are relatively small and only a handful in the sklearn package includes ensemble-based methods and
of them come close to the maximum transacted amount. functions for the classification, regression and outlier
detection.
After checking this dataset, we plot a histogram for every This free and open-source Python library is built using
column. This is done to get a graphical representation of the NumPy, SciPy and matplotlib modules which provides a lot of
dataset which can be used to verify that there are no missing simple and efficient tools which can be used for data analysis

IJERTV8IS090031 www.ijert.org 112


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 8 Issue 09, September-2019

and machine learning. It features various classification, By comparing the local values of a sample to that of its
clustering and regression algorithms and is designed to neighbours, one can identify samples that are substantially
interoperate with the numerical and scientific libraries. lower than their neighbours. These values are quite amanous
We’ve used Jupyter Notebook platform to make a program in and they are considered as outliers.
Python to demonstrate the approach that this paper suggests. As the dataset is very large, we used only a fraction of it in out
This program can also be executed on the cloud using Google tests to reduce processing times.
Collab platform which supports all python notebook files. The final result with the complete dataset processed is also
Detailed explanations about the modules with pseudocodes for determined and is given in the results section of this paper.
their algorithms and output graphs are given as follows:
B. Isolation Forest Algorithm
A. Local Outlier Factor The Isolation Forest ‘isolates’ observations by arbitrarily
It is an Unsupervised Outlier Detection algorithm. 'Local selecting a feature and then randomly selecting a split value
Outlier Factor' refers to the anomaly score of each sample. It between the maximum and minimum values of the designated
measures the local deviation of the sample data with respect to feature.
its neighbours. Recursive partitioning can be represented by a tree, the
More precisely, locality is given by k-nearest neighbours, number of splits required to isolate a sample is equivalent to
whose distance is used to estimate the local data. the path length root node to terminating node.
The pseudocode for this algorithm is written as: The average of this path length gives a measure of normality
and the decision function which we use.
The pseudocode for this algorithm can be written as:

On plotting the results of Local Outlier Factor algorithm, we On plotting the results of Isolation Forest algorithm, we get
get the following figure: the following figure:

IJERTV8IS090031 www.ijert.org 113


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 8 Issue 09, September-2019

Partitioning them randomly produces shorter paths for


anomalies. When a forest of random trees mutually produces
shorter path lengths for specific samples, they are extremely
likely to be anomalies.
Once the anomalies are detected, the system can be used to
report them to the concerned authorities. For testing purposes,
we are comparing the outputs of these algorithms to determine
their accuracy and precision.

IV. IMPLEMENTATION
This idea is difficult to implement in real life because it
requires the cooperation from banks, which aren’t willing to
share information due to their market competition, and also
due to legal reasons and protection of data of their users.
Therefore, we looked up some reference papers which
followed similar approaches and gathered results. As stated in
one of these reference papers:
“This technique was applied to a full application data set
supplied by a German bank in 2006. For banking
confidentiality reasons, only a summary of the results obtained
is presented below. After applying this technique, the level 1 Results with the complete dataset is used:
list encompasses a few cases but with a high probability of
being fraudsters.
All individuals mentioned in this list had their cards closed to
avoid any risk due to their high-risk profile. The condition is
more complex for the other list. The level 2 list is still
restricted adequately to be checked on a case by case basis.
Credit and collection officers considered that half of the cases
in this list could be considered as suspicious fraudulent
behaviour. For the last list and the largest, the work is
equitably heavy. Less than a third of them are suspicious.
In order to maximize the time efficiency and the overhead
charges, a possibility is to include a new element in the query;
this element can be the five first digits of the phone numbers,
the email address, and the password, for instance, those new
queries can be applied to the level 2 list and level 3 list.”.
V. RESULTS
The code prints out the number of false positives it detected
and compares it with the actual values. This is used to
calculate the accuracy score and precision of the algorithms.
The fraction of data we used for faster testing is 10% of the
entire dataset. The complete dataset is also used at the end and
both the results are printed. VI. CONCLUSION
These results along with the classification report for each
algorithm is given in the output as follows, where class 0 Credit card fraud is without a doubt an act of criminal
means the transaction was determined to be valid and 1 means dishonesty. This article has listed out the most common
it was determined as a fraud transaction. methods of fraud along with their detection methods and
This result matched against the class values to check for false reviewed recent findings in this field. This paper has also
positives. explained in detail, how machine learning can be applied to
Results when 10% of the dataset is used: get better results in fraud detection along with the algorithm,
pseudocode, explanation its implementation and
experimentation results.
While the algorithm does reach over 99.6% accuracy, its
precision remains only at 28% when a tenth of the data set is
taken into consideration. However, when the entire dataset is
fed into the algorithm, the precision rises to 33%. This high
percentage of accuracy is to be expected due to the huge
imbalance between the number of valid and number of
genuine transactions.

IJERTV8IS090031 www.ijert.org 114


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 8 Issue 09, September-2019

Since the entire dataset consists of only two days’ transaction REFERENCES
records, its only a fraction of data that can be made available
if this project were to be used on a commercial scale. Being [1] “Credit Card Fraud Detection Based on Transaction Behaviour -by
John Richard D. Kho, Larry A. Vea” published by Proc. of the 2017
based on machine learning algorithms, the program will only
IEEE Region 10 Conference (TENCON), Malaysia, November 5-8,
increase its efficiency over time as more data is put into it. 2017
VII. FUTURE ENHANCEMENTS [2] CLIFTON PHUA1, VINCENT LEE1, KATE SMITH1 & ROSS
GAYLER2 “ A Comprehensive Survey of Data Mining-based Fraud
While we couldn’t reach out goal of 100% accuracy in fraud Detection Research” published by School of Business Systems, Faculty
detection, we did end up creating a system that can, with of Information Technology, Monash University, Wellington Road,
Clayton, Victoria 3800, Australia
enough time and data, get very close to that goal. As with any
[3] “Survey Paper on Credit Card Fraud Detection by Suman” , Research
such project, there is some room for improvement here. Scholar, GJUS&T Hisar HCE, Sonepat published by International
The very nature of this project allows for multiple algorithms Journal of Advanced Research in Computer Engineering & Technology
to be integrated together as modules and their results can be (IJARCET) Volume 3 Issue 3, March 2014
combined to increase the accuracy of the final result. [4] “Research on Credit Card Fraud Detection Model Based on Distance
This model can further be improved with the addition of more Sum – by Wen-Fang YU and Na Wang” published by 2009
International Joint Conference on Artificial Intelligence
algorithms into it. However, the output of these algorithms
[5] “Credit Card Fraud Detection through Parenclitic Network Analysis-
needs to be in the same format as the others. Once that
By Massimiliano Zanin, Miguel Romance, Regino Criado, and
condition is satisfied, the modules are easy to add as done in SantiagoMoral” published by Hindawi Complexity Volume 2018,
the code. This provides a great degree of modularity and Article ID 5764370, 9 pages
versatility to the project. [6] “Credit Card Fraud Detection: A Realistic Modeling and a Novel
More room for improvement can be found in the dataset. As Learning Strategy” published by IEEE TRANSACTIONS ON
demonstrated before, the precision of the algorithms increases NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO.
8, AUGUST 2018
when the size of dataset is increased. Hence, more data will
[7] “Credit Card Fraud Detection-by Ishu Trivedi, Monika, Mrigya,
surely make the model more accurate in detecting frauds and Mridushi” published by International Journal of Advanced Research in
reduce the number of false positives. However, this requires Computer and Communication Engineering Vol. 5, Issue 1, January
official support from the banks themselves. 2016
[8] David J.Wetson,David J.Hand,M Adams,Whitrow and Piotr Jusczak
“Plastic Card Fraud Detection using Peer Group Analysis” Springer,
Issue 2008.

IJERTV8IS090031 www.ijert.org 115


(This work is licensed under a Creative Commons Attribution 4.0 International License.)

You might also like