Credit Card Fraud Detection Using Machine Learning Techniques
Credit Card Fraud Detection Using Machine Learning Techniques
Techniques:
A Comparative Analysis
Abstract—Financial fraud is an ever growing menace with far of stolen credit card to get cash through dubious means. A lot
consequences in the financial industry. Data mining had played of researches have been devoted to detection of external card
an imperative role in the detection of credit card fraud in online fraud which accounts for majority of credit card frauds.
transactions. Credit card fraud detection, which is a data mining Detecting fraudulent transactions using traditional methods of
problem, becomes challenging due to two major reasons – first,
manual detection is time consuming and inefficient, thus the
the profiles of normal and fraudulent behaviours change
constantly and secondly, credit card fraud data sets are highly advent of big data has made manual methods more
skewed. The performance of fraud detection in credit card impractical. However, financial institutions have focused
transactions is greatly affected by the sampling approach on attention to recent computational methodologies to handle
dataset, selection of variables and detection technique(s) used. credit card fraud problem.
This paper investigates the performance of naïve bayes, k-nearest Data mining technique is one notable methods used in
neighbor and logistic regression on highly skewed credit card solving credit fraud detection problem. Credit card fraud
fraud data. Dataset of credit card transactions is sourced from detection is the process of identifying those transactions that
European cardholders containing 284,807 transactions. A hybrid are fraudulent into two classes of legitimate (genuine) and
technique of under-sampling and oversampling is carried out on
fraudulent transactions [1]. Credit card fraud detection is
the skewed data. The three techniques are applied on the raw and
preprocessed data. The work is implemented in Python. The based on analysis of a card’s spending behaviour. Many
performance of the techniques is evaluated based on accuracy, techniques have been applied to credit card fraud detection,
sensitivity, specificity, precision, Matthews correlation coefficient artificial neural network [2], genetic algorithm [3, 4], support
and balanced classification rate. The results shows of optimal vector machine [5], frequent itemset mining [6], decision tree
accuracy for naïve bayes, k-nearest neighbor and logistic [7], migrating birds optimization algorithm [8], naïve bayes
regression classifiers are 97.92%, 97.69% and 54.86% [9]. A comparative analysis of logistic regression and naive
respectively. The comparative results show that k-nearest bayes is carried out in [10]. The performance of bayesian and
neighbour performs better than naïve bayes and logistic neural network [11] is evaluated on credit card fraud data.
regression techniques. Decision tree, neural networks and logistic regression are
tested for their applicability in fraud detections [12]. This
Keywords—credit card fraud; data mining; naïve bayes;
paper [13] evaluates two advanced data mining approaches,
decision tree; logistic regression, comparative analysis support vector machines and random forests, together with
logistic regression, as part of an attempt to better detect credit
I. INTRODUCTION card fraud while neural network and logistic regression is
Financial fraud is an ever growing menace with far applied on credit card fraud detection problem [14]. A number
reaching consequences in the finance industry, corporate of challenges are associated with credit card detection, namely
organizations, and government. Fraud can be defined as fraudulent behaviour profile are dynamic, that is fraudulent
criminal deception with intent of acquiring financial gain. transactions tend to look like legitimate ones; credit card
High dependence on internet technology has enjoyed transaction datasets are rarely available and highly imbalanced
increased credit card transactions. As credit card transactions (or skewed); optimal feature (variables) selection for the
become the most prevailing mode of payment for both online models; suitable metric to evaluate performance of techniques
and offline transaction, credit card fraud rate also accelerates. on skewed credit card fraud data. Credit card fraud detection
Credit card fraud can come in either inner card fraud or performance is greatly affected by type of sampling approach
external card fraud. Inner card fraud occurs as a result of used, selection of variables and detection technique(s) used.
consent between cardholders and bank by using false identity This study investigates the effect of hybrid sampling on
to commit fraud while the external card fraud involves the use performance of fraud detection of naïve bayes, k-nearest
) P( f k P| c(if )*)P(ci )
rates metric. The performance comparison of the classifiers is
analyzed based on accuracy, sensitivity, specificity, precision, (
P ci | f k = (4)
Matthews correlation coefficient and balanced classification k
rate.
P( f k | ci ) = ∏ P( f k ci ) k = 1,..., n; i = 1,2
n
A. Dataset (5)
i =1
The dataset is sourced from ULB Machine Learning Group
where n represents maximum number of features (30), P(ci | fk)
and description is found in [32]. The dataset contains credit
card transactions made by European cardholders in September is probability of feature value fk being in class ci, P(fk | ci) is
2013. This dataset presents transactions that occurred in two probability of generating feature value fk given class ci, P(ci)
days, consisting of 284,807 transactions. The positive class and P(fk) are probability of occurrence of class ci and
(fraud cases) make up 0.172% of the transactions data. The probability of feature value fk occurring respectively. The
dataset is highly unbalanced and skewed towards the positive classifier performs the binary classification based on Bayesian
class. It contains only numerical (continuous) input variables classification rule.
which are as a result of a Principal Component Analysis
(PCA) feature selection transformation resulting to 28 If P(c1 | fk) > P(c2 | fk) then the classification is C1
principal components. Thus a total of 30 input features are
utilized in this study. The details and background information If P(c1 | fk) < P(c2 | fk) then the classification is C2
of the features cannot be presented due to confidentiality Ci is the target class for classification where C1 is the negative
issues. The time feature contains the seconds elapsed between class (non fraud cases) and C2 is the positive class (fraud
each transaction and the first transaction in the dataset. The cases).
'amount' feature is the transaction amount. Feature 'class' is the
target class for the binary classification and it takes value 1 for D. K-Nearest Neighbour Classifier
positive case (fraud) and 0 for negative case (non fraud). The k-nearest neighbour is an instance based learning
which carries out its classification based on a similarity
B. Hybrid Sampling of dataset
measure, like Euclidean, Mahanttan or Minkowski distance
Data pre-processing is carried out on the data. A hybrid of functions. The first two distance measures work well with
under-sampling and over-sampling is carried out on the highly continuous variables while the third suits categorical variables.
unbalanced dataset to achieve two sets of distribution (10:90 The Euclidean distance measure is used in this study for the
and 34:64) for analysis. This is done by stepwise addition and kNN classifier. The Euclidean distance (Dij) between two
input vectors (Xi, Xj) is given by:
n 2 Positive (FPR) and False Negative (FNR) rates metric
Dij = X ik − X jk k=1,2,…,n (6) respectively.
k = 1
TP
TPR = (10)
For every data point in the dataset, the Euclidean distance P
between an input data point and current point is calculated.
These distances are sorted in increasing order and k items with TN
TNR = (11)
lowest distances to the input data point are selected. The N
majority class among these items is found and the classifier FP
returns the majority class as the classification for the input FPR = (12)
point. Parameter tuning for k is carried out for k = 1, 3, 5, 7, 9, N
11 and k = 3 showed optimal performance. Thus, value of k = FN
3 is used in the classifier. FNR = (13)
P
E. Logistic Regression Classifier where TP, TN, FP and FN are the number of true positive, true
Logistic Regression which uses a functional approach to negative, false positive and false negative test cases classified
estimate the probability of a binary response based on one or while P and N are the total number of positive and negative
more variables (features). It finds the best-fit parameters to a class cases under test. True positives are cases classified as
nonlinear function called the sigmoid. The sigmoid function positive which are actually positive. True negative are cases
(σ) and the input (x) to the sigmoid function are shown in (7) classified rightly as negative. False positive are cases
and (8). classified as positive but are negative cases. False negative are
cases classified as negative but are truly positive.
1 The performance of naïve bayes, k-nearest neighbour and
σ ( x) =
(1 + )
−x
(7) logistic regression classifiers are evaluated based on accuracy,
sensitivity, specificity, precision, Matthews correlation
x = w0 z 0 + w1 z1 + ... + wn z n (8) coefficient (MCC) and balanced classification rate. These
evaluation metrics are implored based on their relevance in
The vector z is input data and the best coefficients w, is evaluating imbalanced binary classification problem.
multiplied together multiply each element and adds up to get
TP + TN
one number which determines the classifier classification of Accuracy = (14)
the target class. If the value of the sigmoid is more than 0.5, TP + FP + TN + FN
it’s considered a 1; otherwise, it’s a 0. An optimization TP
method is used to train the classifier and find the best-fit Sensitivity = (15)
TP + FN
parameters. The gradient ascent (9) and modified stochastic
gradient ascent optimization methods were experimented on to TN
Specificity = (16)
evaluate their performance on the classifier. FP + TN
w := w + α∇ w f (w) (9) TP
Pr ecision = (17)
TP + FP
where the parameter ∇ is the magnitude of movement of the
gradient ascent. The steps are continued until a stopping
MCC =
(TP * TN ) − (FP * FN ) (18)
criterion is met. The optimization methods are investigated (TP + FP )(TP + FN )(TN + FP )(TN + FN )
(for iterations 50 to 1000) to know if the parameters are
converging. That is, are the parameters reaching a steady TP TN
value, or are they constantly changing. At 100 iterations, BCR = 1 * + (19)
2 P N
steady values of parameters are achieved.
Stochastic gradient ascent incrementally updates the Sensitivity (Recall) gives the accuracy on positive (fraud)
classifier as new data comes in rather than all at once. It starts cases classification. Specificity gives the accuracy on negative
with all weights set to 1. Then for every feature value in the (legitimate) cases classification. Precision gives the accuracy
dataset, the gradient ascent is calculated. The weights vector is in cases classified as fraud (positive). Matthews Correlation
updated by the product of alpha and gradient. Then weight Coefficient (MCC) is an evaluation metric for binary
vector is returned. The stochastic gradient ascent is used in classification problems. MCC is used mainly with unbalanced
this study because given the large size of data it updates the data sets because its evaluation consists of TP, FP, TN and
weights using only one instance at a time, thus reducing FN. The MCC value is usually between -1 and +1; a +1 value
computational complexity. represents excellent classification while a -1 value represents
total distinction between classification and observation.
IV. PERFORMANCE EVALUATION AND RESULTS Balanced classification rate represents the average of
Four basic metrics are used in evaluating the experiments, sensitivity and specificity which is the portion of negatives
namely True positive (TPR), True Negative (TNR), False which are classified as negatives [33].
A. Results
In this study, three classifier models based on naive bayes,
k-nearest neighbour and logistic regression are developed. To TABLE 3. Accuracy result for 34:66 data distribution
evaluate these models, 70% of the dataset is used for training Classifiers
while 30% is set aside for validating and testing. Accuracy, Metrics k-Nearest Logistic
Naïve Bayes
sensitivity, specificity, precision, Matthews correlation Neighbour Regression
coefficient (MCC) and balanced classification rate are used to Accuracy 0.9769 0.9792 0.5486
evaluate the performance of the three classifiers. The accuracy
Sensitivity 0.9514 0.9375 0.5833
of the classifiers for the original 0.172:99.828 dataset
distribution, the sampled 10:90 and 34:66 distributions are Specificity 0.9896 1.0000 0.5313
presented in Tables 1, 2 and 3 respectively. Precision 0.9786 1.0000 0.3836
An observation of the metric tables shows that there is
Matthews
significant improvement from the sampled dataset distribution Correlation +0.9478 +0.9535 +0.1080
of 10:90 to 34:66 for accuracy, sensitivity, specificity, Coefficient
Matthews correlation coefficient and balanced classification Balanced
rate of the classifiers. This shows that a hybrid sampling Classification 0.9705 0.9688 0.5573
Rate
(under-sampling and over-sampling) on a highly imbalanced
dataset greatly improves the performance of binary
classification. The true positive, true negative, false positive
and false negative rates of the classifiers in each set of un-
TABLE 4. Basic metric rates for un-sampled data distribution
sampled and sampled data distribution is shown in Tables 4, 5
and 6. Logistic regression is the only technique that did not Classifiers
show better improvement in false negative rates from the Metrics k-Nearest Logistic
Naïve Bayes
10:90 to 34:66 data distribution. However, it showed overall Neighbour Regression
best performance in the un-sampled distribution. True Positive Rate 0.8072 0.8835 0.9767
Classifiers
Metrics k-Nearest Logistic
Naïve Bayes
Neighbour Regression TABLE 6. Basic metric rates for 34:66 data distribution
Accuracy 0.9752 0.9715 0.3639
Classifiers
Sensitivity 0.8210 0.8285 0.7155 Metrics k-Nearest Logistic
Naïve Bayes
Neighbour Regression
Specificity 0.9754 1.0000 0.2939
True Positive Rate 0.9514 0.9375 0.5833
Precision 0.0546 1.0000 0.1678
False Positive Rate 0.0104 0.000 0.4688
Matthews
Correlation +0.2080 +0.8950 +0.0077 True Negative Rate 0.9896 1.0000 0.5313
Coefficient
False Negative Rate 0.0486 0.0625 0.4167
Balanced
Classificati 0.8975 0.9143 0.5047
on Rate
B. Comparative Performance
The performance evaluation of the three classifiers for the
34:66 data distribution is shown in figure 1. This data
distribution showed better performance. The k-nearest
neighbour technique showed superior performance across the
evaluation metrics used. It reached the highest value for
specificity and precision (that is 1.0) for the two data
distributions. This is because the kNN classifier recorded no
false positive in the classification. Naïve Bayes classifier only
outperformed the kNN in accuracy for the 10:90 data
distribution. The Logistic regression classifier showed the
least performance among the three classifiers evaluated.
However, there was significant improvement in performance
Figure 3.TPR and FPR evaluation of k-nearest neighbour classifiers
between the two sets of sampled data distribution. Since not
all related works carried out evaluation based on accuracy, *TPR = True Positive Rate
*FPR = False Positive Rate
sensitivity, specificity, precision, Matthews correlation *Proposed kNN = Proposed k-nearest neighbor classifier
coefficient and balanced classification rate, thus other related
works are compared with this study based on the basic true
positive and false positive rates. Figures 2 and 3 show the TPR
and FPR evaluation of proposed Naïve Bayes, kNN and LR
classifiers against other related works. The related works are
referenced using their reference number delimited within
square brackets “[ ]”.