0% found this document useful (0 votes)
145 views30 pages

Hybrid Approach for Insurance Fraud Detection

This document presents a novel hybrid approach for detecting automobile insurance fraud using machine learning techniques. It aims to address the class imbalance issue present in fraud detection datasets by employing data balancing methods like oversampling and undersampling. Various classification algorithms such as decision trees, naive bayes, support vector machines, K-nearest neighbors and logistic regression are trained on the balanced datasets and their performances are evaluated using 10-fold cross validation. The proposed system uses a combination of data balancing and machine learning to effectively detect fraud cases while minimizing false positives.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views30 pages

Hybrid Approach for Insurance Fraud Detection

This document presents a novel hybrid approach for detecting automobile insurance fraud using machine learning techniques. It aims to address the class imbalance issue present in fraud detection datasets by employing data balancing methods like oversampling and undersampling. Various classification algorithms such as decision trees, naive bayes, support vector machines, K-nearest neighbors and logistic regression are trained on the balanced datasets and their performances are evaluated using 10-fold cross validation. The proposed system uses a combination of data balancing and machine learning to effectively detect fraud cases while minimizing false positives.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd

A NOVEL HYBRID DATA BALANCING AND

FRAUD DETECTION APPROACH FOR


AUTOMOBILE INSURANCE CLAIMS

Atul Kumar Agrawal

Computer Science & Engineering


Veer Surendra Sai University Of Technology, Burla
2019

1
A NOVEL HYBRID DATA BALANCING AND
FRAUD DETECTION APPROACH FOR
AUTOMOBILE INSURANCE CLAIMS

A minor project submitted in partial fulfillment of the requirements for the


degree of:

BACHELOR OF TECHNOLOGY IN
COMPUTER SCIENCE & ENGINEERING

Submitted by:
Atul Kumar Agrawal
Registration no. : 1602040031

Under The supervision of:

Dr. Suvasini Panigrahi

Associate Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


VEER SURENDRA SAI UNIVERSITY OF TECHNOLOGY, BURLA
2019

2
VEER SURENDRA SAI UNIVERSITY OF TECHNOLOGY, BURLA, ODISHA

Declaration

I declare that this written submission represents my ideas in my own words and wherever
others’ ideas or words have been included, I have adequately cited and referenced the
original sources. I also declare that i have adhered to all principles of academic honesty and
integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in
my submission. I understand that any violation of the above will be cause for disciplinary
action by the university and can also evoke penal action from the sources which have thus
not been properly cited or from whom proper permission has not been taken when needed.

DATE: Atul Kumar Agrawal


[Link].-1602040031
3
Department of Computer Science & Engineering
Veer Surendra Sai University Of Technology, Burla, Odisha

Certificate

This is to certify that the dissertation entitled " A novel hybrid data balancing and fraud
detection approach for automobile insurance claims" submitted by Atul Kumar Agrawal
is approved for the degree of bachelor of technology in Computer Science and Engineering
is a record of an original research work carried out by him under my supervision and
guidance.

Dr. Manas Ranjan Kabat Dr. Suvasini Panigrahi


Head of Department Supervisor
4
Acknowledgment

I would like to express my sincere gratitude to my supervisor, Dr. Suvasini Panigrahi , for
her invaluable help during the course work towards this dissertation. She was a source of
constant ideas and encouragement and provided a friendly atmosphere to work in. I am really
very thankful to her for everything.

I am also thankful to Dr. Manas Ranjan Kabat, Head of the Department and to all the
faculties of Department of Computer Science and Engineering for having supported me to
carry out this dissertation and for their constant advice. I would like to thank all my friends
for their encouragement and understanding. I would like to express my heart felt gratitude to
them.

Atul Kumar Agrawal

[Link].-1602040031

5
Approval Sheet

This dissertation entitled "A novel hybrid data balancing and fraud detection approach for
automobile insurance claims" by Atul Kumar Agrawal is approved for the degree of
bachelor of technology in "Computer S cience and E ngineering", department of
C omputer S cience and E n g i n e e r i n g .

Date: Supervisor
Place: Burla

6
ABSTRACT

Automobile insurance fraud has been a major issue to the insurance companies and has caused
several crores of losses due to the fraudulent and false claims. It is a serious crime in most parts
of the world and the scammers may be sentenced to at least 1 year of jail and up to 20 years.
The fraudsters involve fake patients, fake doctors, fake lawyers together. Various machine
learning and deep learning techniques have been developed to detect these kind of fraud and
research is been done to get the best suitable method that can detect new patterns of fraudsters
over time. As the number of frauds is low in comparison to the legitimate transactions,we use
one class classification for the minority class to get better results so as to minimize the
classification of non-fraudulent data as fraud i.e. minimizing the false positive alarm rates.
Various machine learning techniques like Support vector machine , K – nearest neighbors ,
Decision tree , Logistic regression and Naive Bayes have been deployed to detect fraud. Most
of these classifiers have good accuracies . These methods have been trained using
undersampling and oversampling of data to reduce the class imbalance of fraud and non fraud
database. It is then validated using 10- fold validation technique.

Keywords : Automobile insurance fraud, false positive alarm rates, undersampling, Support
vector machine, K – nearest neighbors, Decision tree, Logistic regression, Naive Bayes

7
Table of Contents

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.1 Definitions
1.2 Fraudsters

1.3 Types of fraud

1.4 Methods of fake fraud claims

1.5 Types of automobile fraud

1.6 Problems associated

1.7 Organization of the thesis

CHAPTER 2 MOTIVATION. .................................. 14

CHAPTER 3 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . 22

CHAPTER 4 BACKGROUND STUDY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


4.1 Methods for class balancing
4.1.1 Oversampling
4.1.2 Undersampling
4.2 Methods used
4.2.1. Decision Tree
4.2.2. Naive Bayes
4.2.3. Support Vector Machine (SVM)
4.2.4. KNN (K – Nearest Neighbors)

4.2.5. Logistic Regression

CHAPTER 5 PROPOSED SYSTEM ......................... 26

5.1 Research objective

5. 2 Proposal diagram

8|Page
5.3 Steps involved in the system implemented

5.4 Data set attributes used

5.5 Data set description

CHAPTER 6 RESULTS AND DISCUSSION ..................... 30

CHAPTER 7 CONCLUSION AND FUTURE WORK ............ 31

REFERENCES ...................... 31

List of Figures
Figure 4.1 : Decision Tree: rules A leading to the colored clusters 16
Figure 4.2 : Decision tree : insurance classification 16
Figure 4.3 : SVM trained with samples from two classes 18
Figure 4.4 : KNN classification model 20
Figure 4.5 : Generalized linear regression model. 20
Figure 5.1 : Proposed model flow chart 27

List of Tables
Table 4.1 : Related work with algorithms used on insurance fraud 23
Table 6.1 : Accuracies after implication of 5 models to the data set. 30
Table 6.2 : Result after undersampling. 30

9|Page
LIST OF ABBREVIATIONS
Abbreviation Description
KNN K- Nearest Neighbors
SVM Support Vector Machine
TP True Positive
TN True Negative
FP False Positive
FN False Negative
CM Confusion Matrix
DT Decision Tree
LR Logistic Regression
NB Naive Bayes

Chapter 1
INTRODUCTION
10 | P a g e
1.1 Definitions :
Fraud : It is a serious crime that includes use of one’s occupation for personal enrichment
through deliberate misuse or application of the employing organization’s resources or assets.
Illegal misuse of insurance policies for self benefit.

Fraud Detection : monitoring the behavior of population of users using data sets to estimate,
detect or avoid undesirable behaviors.

Automobile Insurance fraud : Automobile insurance fraud has been a major issue to the
insurance company and has caused several losses due to the false claims.

1.2 Fraudsters :
Fraud done by:
1. False accident claims and injury
2. False stolen reports
3. False claims that accident/damage happened after policy or coverage was purchased.
4. Claimants hide the information that excluded driver was driving at the time of accident.
1.3 Types of fraud:
1. Credit card fraud
2. Telecommunication fraud
3. Bankruptcy fraud
4. Theft / counterfeit fraud
5. Application fraud
6. Behavioral fraud
7. Insurance fraud
8. Statement fraud
9. Security fraud
1.4 Various methods of fake fraud claims are:
1. Staged Collisions : In this type of frauds, fraudsters use a motor vehicle to stage fake
accidents with an innocent party.
2. Exaggerated claims : Fake claims involving injuries and damages that may have already
been present before the actual accident had taken place.
11 | P a g e
3. False stolen reports : Claimant might have sold the vehicle or gifted it to a relative and
then claims for insurance based on stolen case.
4. Hidden information : Claimants may hide information regarding the driver at the time of
accident might be an excluded driver from the terms of the insurance .
5. Multiple claims: It includes people who claim multiple times for the same loss.

1.5 Types of automobile fraud:

1. Soft auto-insurance fraud: Examples of soft auto-insurance fraud include filing more
than one claim for a single injury, filing claims for injuries not related to an automobile
accident, misreporting wage losses due to injuries, and reporting higher costs for car repairs
than those that were actually paid.
2. Hard auto-insurance fraud: It includes activities such as staging,automobile collision,
filing claims when the claimant was not actually involved in the accident, submitting claims
for medical treatments that were not received, or inventing injuries and false stolen reports.

1.6 Problems associated : a. Class imbalance problem [minority class = fraud, majority
class = legitimate]
b. Outliers problem : Outliers are the records which exhibit dissimilarity with the defined
set of clusters and they cannot be part of representatives while undersampling the data and
further noise needs to be eliminated to enhance the data quality.

1.7 Organization of the thesis:-


The whole thesis has been ordered in the following way:

Chapter 2 – I t includes motivation for the research project.


Chapter 3 – I t includes some basic concept of m a c h i n e learning that has been used
in the work.
Chapter 4 – It presents an existing methodology that has been implemented and analyzed.
Chapter 5 – It presents the materials and methodology and its algorithms as well as
flowcharts.
Chapter 6 – It is about results and discussion upon various sample data set.
Chapter 7 – It concludes the thesis with some prospects of future work.

12 | P a g e
CHAPTER 2

MOTIVATION

Automobile Fraud Statistics

The Insurance Fraud Bureau in the UK estimated there were more than 20,000 fake collisions
and false insurance claims across the UK from 1999 to 2006. One tactic fraudsters use is to
drive to a busy junction or roundabout and brake sharply causing a motorist to drive into the
back of them. They claim the other motorist was at fault because they were driving too fast or
too close behind them, and make a false and inflated claim to the motorist's insurer for injury
and damage, which can pay the fraudsters up to 30 Lakhs. In the Insurance Fraud Bureau's
first year or operation, the usage of data mining initiatives exposed insurance fraud networks
and led to 74 arrests and a five-to-one return on investment. The Insurance Research Council
estimated that in 1996, 21 to 36 percent of auto-insurance claims contained elements of
suspected fraud. There is a wide variety of schemes used to defraud automobile insurance
providers.
According to data released by Beijing bureau of China, 10% insurance claims of the total
claims were fraud. The Coalition Against Insurance Fraud estimates that in 2006 a total of
about $80 billion was lost in the United States due to insurance fraud. According to estimates
by the Insurance Information Institute, insurance fraud accounts for about 10 percent of the
property/casualty insurance industry's incurred losses and loss adjustment expenses. India
forensic Center of Studies estimates that Insurance frauds in India costs about $6.25 billion
annually.

Problem Statement

1. Given an automobile insurance dataset comprising of various features of various


claimants. The dataset is labelled and has both training and test set.
2. A hybrid technique has been implemented in which the data has been sampled ,
trained and tested using the given data set.

13 | P a g e
Chapter 3
Literature Review
Table 3.1 : Related work with algorithms used on insurance fraud
S. No Name of the Research Year of Technique(s) Result
Paper Publication used
1. An Experimental Study 2019 - LR,C5.0, decision
With Imbalanced tree algorithm, SVM
Classification Approaches and ANN are the best
for Credit Card Fraud methods according to
Detection the 3 considered
performance
measures (Accuracy,
Sensitivity and
AUPRC).
2. Predicting Fraudulent 2018 Random Random forest
claims in automobile forest + outperforms the
insurance J48+Naive remaining two
Bayes algorithms
3. One-class support vector 2015 OCSVM OCSVM based
machine based undersampling
undersampling: improves the
Application to churn performance of
prediction and insurance classifiers.
fraud detection
4. The Identification 2015 Outlier Data mining had
Algorithm and Model detection the advantages of low
Construction method time complexity, high
of Automobile Insurance based on knn recognition rate,
Fraud Based on Data high accuracy
Mining
5. Random Rough Subspace 2011 random Random subspace
based Neural Network rough method can be used
Ensemble for Insurance subspace for online fraud
Fraud Detection based detection system
neural
14 | P a g e
network
ensemble
method

Various deep learning, Machine learning and data mining techniques have been implemented
in the case of automobile insurance fraud detection. These are:
1. Decision tree based
2. Machine learning
i. Supervised learning
A. Classification
a. Support vector machine (SVM) [4]
b. Recursive neural network(RNN)
c. Radial Basis Function neural network [5]
B. Regression and statistics
a. Logistic regression
b. Binary regression
ii. Unsupervised learning
A. Clustering
a. K-means clustering
b. Hierarchical clustering
B. Spectral Ranking Anomaly (SRA) [14]
iii. Semi- supervised learning : Combination of supervised and unsupervised learning
iv. Reinforcement learning
3. Multi layered perceptron based (MLP)
4. Data mining
5. Random forest
6. Naive Bayes Tree [4]
7. Probabilistic neural network
8. Group method of data handling (GMDH)
9. Synthetic Minority Oversampling Technique [2]
10. kRNN and K-Means hybrid for outlier elimination and undersampling [3]
11. Geometric mean based [6]
12. fuzzy Gaussian membership based oversampling [7]
13. data gravitation [8]
15 | P a g e
14. Fuzzy logic control (FLC)
15. Genetic algorithms
16. hybrid of back-propagation neural networks (ANN) and self-organizing maps (SOM) [9]
17. 10-fold cross validation method
18. OVERSAMPLING TECHNIQUES (for minority class):
a. ADASYN (adaptive synthesis)
b. SMOTE ( Synthetic Minority Oversampling Technique)
19. Back Propagation
20. C4.5 algorithm [15]
21. Meta learning approaches [15]
22. Stacking – bagging method - MLP together with Naïve Bayesian (NB) and C4.5
algorithm [16]
23. Bayesian belief network
24. Non negative matrix factorization approach for health care fraud detection. [17]
25. Random Rough subspace based Neural Network Ensemble [13]
26. Iterative Assessment Algorithm based on Graph components [18]
[Link]- Recursive Feature Elimination for feature selection and employed active learning
methods for synthetic data generation. [4]

Limitations:
1. Data sets are not available / public. [privacy concerns]
2. Results are often not disclosed to public
3. Confidential data set and results.
4. False alarms may be generated.
5. Imbalanced classification approaches is required because the fraud : legitimate ratio is very
low.
Chapter 4
BACKGROUND STUDY
4.1 Methods for class balancing :

1. Oversampling Minority by creating artificial data sets


2. Under sampling Majority class data
3. Cost sensitive models -> higher cost = minority class

16 | P a g e
4. One class classification = train using only minority class

4.2 Undersampling and Oversampling methods used :


1. Random oversampling : Balance classes using replicating observations. Model is
modified to concentrate both on minority and majority classes .
2. One class classification : It is a cost sensitive approach .
It uses only minority classes. After train , it detects if the transaction belongs to the minority
class or not.
3. One class SVM
-> finds a small region capturing most of the data points.
-> f= { 1, if point is in the region
-1, otherwise
Φ : mapping function from variable space to higher dimensional space (F)
Hyper plane equation : ωTΦ(xi) = ρ (1)
Objectives of one class SVM :
Maximize the margin = ρ / | | ω | | (2)
i.e. minimize 1/2 * | | ω | | + 1/(v*l) Σli=1 (ρi-ρ) (3)

where, { ωT * Φ(xi) >= ( ρ – ρi )


Ci>=0 ∀ i=1,2 . . . l
ω = weight factor
ρ = offset parameter for hyperplane F.
Oversampling methods :
1. ADASYN (Adaptive oversampling) : It is applied on minority class. It creates more
artificial minority class instances. It replicates instances that are difficult to learn.
Density distribution factor = ri . (degree of learning difficulty of each minority class ).
2. SMOTE ( Synthetic Minority Oversampling Technique): It stands for synthetic
minority oversampling. It is used for generating artificial minority samples. The module
works by generating new instances from existing minority cases that you supply as input.
Undersampling methods : Random Undersampling, Clustering
Clustering : Making clusters and using associative rule mining to identify correlated data
and generate associate rules.
Validation techniques : K – fold validation technique. It chooses the test and train data
randomly to train the model effectively and get the accuracy.
17 | P a g e
4.3 Supervised model training algorithms used :
4.3.1. Decision Tree
4.3.2. Naive Bayes
4.3.3. Support Vector Machine (SVM)
4.3.4. KNN (K – Nearest Neighbors)
4.3.5. Logistic Regression

4.3.1. Decision Tree

18 | P a g e
Figure 4.1 : Decision Tree: Rules leading to the colored clusters

Figure 4.2 : Decision tree : insurance classification

Decision tree uses C 5.0 algorithm.


> It uses cross entropy (information statistics and information gain)
> It is a classification based algorithm.
> It divides the values into subsequent sub trees for decision making.

4.3.2. Naive Bayes

Uses Bayes conditional probability rule for classification.

Objective : To find a class of new observation that maximizes its probability.

Y ∃ P(Y/X1X2 . . . Xn) is maximum.

P(Y/X1,X2 ,. . . Xn) = P(X1,X2, . . . Xn/Y) * P(Y) (4)


Max( P(Y/X1X2 . . . Xn) ) = max (P(X1X2 . . . Xn/Y) ) = max (P((X1/Y)*P(X2/Y) . . .
*P(Xn/Y))) (5)

19 | P a g e
(6)

Disadvantages :

1. Assumption that variables are independent may not be true.


2. Democratization of continuous variables.
3. Information may be lost.
4. Data may not be distributed normally.

4.3.3. Support Vector Machine (SVM) :

Figure 4.3 : SVM trained with samples from two classes

A Support Vector Machine (SVM) performs classification by constructing an N-dimensional


hyperplane that optimally separates the data into two categories. SVM models are closely
related to neural networks. Using a kernel function, SVMs are an alternative training method
for polynomial, Radial Basis Function (RBF) networks and MLP classifiers, in which the
weights of the network are found by solving a quadratic programming problem with linear

20 | P a g e
constraints, rather than by solving a non-convex, unconstrained minimization problem, as in
standard neural network training.
It is a classification tool used to divide the data points into 2 classes using hyperplane.
It performs row dimensional reduction (by way of picking up Support Vectors) while
classifying the data sets.
For a given training vector xi in Rn ,
i= {1,2 ... l}
n= no of exploratory variables
l= no. of observations in the train set.
y ∈ Rl
Binary classification done using the following optimization problem :
minimize (1/2 ||w|| + (Σli=1 Ci) ) (7)
w: It maximizes the distance between 2 margins.
{ yi * (wt Φ (xi) + b ) >= 1-Ci , Ci >= 0 , i= 1,2 . . . l } (8)
Hyper plane equation : wt Φ (xi) + b (9)
w= vector of weights
Ci = Slack variables (for error calculation / Miscalculations.)
C = cost parameter > 0.
n = no. of features
y = dependent variable
x= independent variable
y= {-1,1}
It is for a hyperplane w that separates the point x i from the origin with margin ρ and ξi
accounts for possible errors.

4.3.4. KNN (K – Nearest Neighbors)

21 | P a g e
Figure 4.4 : KNN classification model

-> K nearest points ( distance between the points)


-> It is a classification algorithm
-> Euclidean distance , d(p,q) = √ ( Σni=1 (pi- qi)2 ) ,for 2 observations pi and qi .

4.3.5. Logistic Regression

Figure 4.5 : Logistic Regression model.

Vector α = ( α0 , α1 . . . αn ) = coefficients
x= ( x0 , x1 . . . xn ) = Exploratory variables
ϵ = model’s error
Y= α0 + α1.x1 + α2.x2+ . . . αn xn + ϵ = ( x.α+ϵ ) (10)
g= logistic link function over [0,1] in R. ( for getting variable values between 0 and 1).
g(p) = x.α (11)
p = probability of fraud risk
g(p) = ln(p/(1-p)) (12)
p=exα / (1+ exα ) (13)

4.4 Fraud detection measures :


1. False positive
2. False negative
3. True positive
4. True negative
22 | P a g e
Chapter 5
PROPOSED SYSTEM
Majority of the above mentioned work did have limitations due to any data imbalance
problem. The proposed model in
this paper is a one class
classification to deal with the
minority class imbalance
problem.
As mentioned in the proposed
methodology, we extracted two
subsets of data in the ratios 80%
and 20 % to ensure that each
subset has the same proportion
of positive and negative
samples.
5. 1 Proposed system diagram
Flow Chart:

23 | P a g e
Figure 5.1 : Proposed model flow chart

5.2 Steps :
1. Data pre processing : Oversampling and undersampling
2. Clustering and classification
3. Training the model
4. Testing
5. Validation
Data pre processing
a. Handling missing data
b. Joining the claim payment data
c. Removing redundancy
d. Data cleaning (using only essential attributes)

3.2 Data set used : [Link] [15]


5.3 Data set attributes used for automobile insurance fraud detection :
week_past , is_holiday , age_price_wsum , make , accidentarea , sex , maritalstatus , fault ,
vehicleCategory , RepNumber , Deductible , DriverRating , Days: policy_accident , Days:
policy_claim , PastNumberOfClaims , AgeOfPolicyHolder , PolicyReportFiled ,
WitnessPresent , AgentType , NumberOfSuppliments , AddressChange_claim ,
24 | P a g e
NumberOfCars , BasePolicy , FraudFound , Claim number, Policy number, Claim occurrence
date and report date, Claim occurrence time , Claim open date, claim report date , claim loss
data , claim event location name , claim amount , policy premium, part market cost , claim on
vehicle , count of customer communication , are claim document submitted , policy effective
date, claim occurrence date , claim on same vehicle check.

5.4 Dataset Description :


Total = 15,420 insurance claims
Year : 1994 – 96 in US
Genuine = 14,497 (94%)
Fraud = 923 (6%)
Imbalance ratio = 0.06 : 0.94
No. of attributes = 24

5.5 Implementation Environment

The experiments were implemented under the following hardware and software specifications

5.5.1 Hardware Specification

The study of the data set and the project has been implemented on a laptop with an Intel core
i3-5005U at 2 GHz with 8 GB of RAM and 2 GB of Graphics. The Operating System used is
Ubuntu 18.04.3.

5.5.2 Software Specification

The project has been implemented on Spyder IDE and language used is Python 3.6.

5.6 Performance metrics :

25 | P a g e
confusion matrix

1. Accuracy
Accuracy = No. of True Positives + [Link] True Negatives / (No. of True Positives + [Link]
True Negatives +[Link] False Positives + No. of False Negatives )
Accuracy = ( TP + TN ) / (TP + FP + FN + TN) (14)
2. Specificity
Specificity, also known as True Negative Rate is calculated as,

Specificity= [Link] True Negatives/ (No. of True Negatives+ [Link] False Positives)

Specificity= TN / (TN + FP) (15)

3. Sensitivity
Sensitivity also known as the True Positive rate or Recall is calculated as,

Sensitivity = No. of True Positives / (No. of True Positives + No. of False Negatives)

Sensitivity = TP / (TP + FN) (16)

[Link]
Precision also known as Positive Predictive Value is calculated as,
Precision = No. of True Positives / (No. of True Positives+ [Link] False Positives)

Precision= TP / (TP + FP) (17)

Chapter 6
RESULTS AND DISCUSSION

Thus, the complete survey was done to get the various techniques used in automobile
insurance fraud detection. This shows that there is still scope as minority class classifiers can
be over sampled to further increase the accuracy and will be helpful in preventing losses of
the insurance companies to a great extent.
The following are the results produced :

Model Accuracy ( in% )

26 | P a g e
1 . Decision Tree 89.06
2. SVM 94.04
3. KNN 93.36
4. Logistic regression 93.60
5. Naive Bayes 73.97

Table 6.1 : Accuracies after implication of 5 models to the data set.

After Random undersampling data [no. of majority samples = 5000] :

Model Accuracy ( in% )


1. Decision Tree 85.84
2. SVM 93.55
3. KNN 93.19
4. Logistic regression 94.03
5. Naive Bayes 69.86

Table 6.2 : Result after undersampling.

Observation : It is evident from the table that by using undersampling of the majority class
and oversampling of the minority class for data balancing, most algorithms showed a increase
in accuracy scores.

Chapter 7
Conclusion and Future Work

The system is implemented using supervised learning methods and the proposed system shall
be implemented on unsupervised algorithms and/or hybrid of the models which will be used
to compare and test the accuracies to get the best possible method for higher accuracy and
low false positive alarm rates.

27 | P a g e
REFERENCES

1. Sundarkumar, G. Ganesh, Vadlamani Ravi, and V. Siddeshwar. "One-class support vector


machine based undersampling: Application to churn prediction and insurance fraud
detection." In 2015 IEEE International Conference on Computational Intelligence and
Computing Research (ICCIC), pp. 1-7. IEEE, 2015.

2. N.V. Chawla,K.W. Bowyer., L.O. Hall, and W.P. Kegelmeyer,“SMOTE: Synthetic


Minority oversampling Technique”, Journal of Artificial Intelligence Research, vol. 16(1), pp.
321-357, 2002.

3. M. Vasu, and V. Ravi, “ A hybrid undersampling approach for mining unbalanced datasets:
Application to Banking and insurance”, International Journal of Data Mining Modeling and
Management, Vol. 3(1), pp. 75-105, 2011.

4. M.A.H. Farquad, V. Ravi and S. Bapi Raju,“Analytical CRM in banking and finance using
SVM: a modified active learning-based rule extraction approach”, International Journal of
Electronic
Customer Relationship Management, vol. 6(1), pp 48-73, 2011.

5. M. D. Pérez-Godoy, A. J. Rivera, C. J. Carmona, M. J. D. Jesus, “Training algorithm for


radial basis funcion network to tackle learning process with imbalanced datasets”, Applied
Soft Computing, Vol. 25, pp. 26-39, 2014.

6. M. J. Kim, D. K. Kang, and H. B. Kim,“Geometric mean based boosting algorithm with


oversampling to resolve data imbalance problem for bankruptcy prediction”, Expert Systems
with
Applications. Vol. 41(3), pp. 1074-1082, 2015.

7. D. C. Li, C. W. Liu, S. C. Hu, “A learning method for the class imbalance problem with
medical datasets”, Computers in Biology and Medicine, Vol. 40(5), pp. 509-518, 2010.

8. L. Peng, H. Zhang, B. Yang, andY. Chen, “A new approach for imbalanced data
classification based on data gravitation”, Information Sciences, Vol. 288, pp. 347-373, 2014.

9. C. F. Tsai, andY. H. Lu, “ Customer churn prediction by hybrid neural networks”, Expert
Systems with Applications, Vol. 36 (10), pp. 12547-12553, 2009.

10. Makki, Sara, Zainab Assaghir, Yehia Taher, Rafiqul Haque, Mohand-Saïd Hacid, and
Hassan Zeineddine. "An Experimental Study With Imbalanced Classification Approaches for
Credit Card Fraud Detection." IEEE Access 7 (2019): 93010-93022.

28 | P a g e
11. Kowshalya, G., and M. Nandhini. "Predicting Fraudulent Claims in Automobile
Insurance." In 2018 Second International Conference on Inventive Communication and
Computational Technologies (ICICCT), pp. 1338-1343. IEEE, 2018.

12. Yan, Chun, and Yaqi Li. "The Identification Algorithm and Model Construction of
Automobile Insurance Fraud Based on Data Mining." In 2015 Fifth International Conference
on Instrumentation and Measurement, Computer, Communication and Control (IMCCC), pp.
1922-1928. IEEE, 2015.

13. Xu, Wei, Shengnan Wang, Dailing Zhang, and Bo Yang. "Random rough subspace based
neural network ensemble for insurance fraud detection." In 2011 Fourth International Joint
Conference on Computational Sciences and Optimization, pp. 1276-1280. IEEE, 2011.

14. K. Nian, H. Zhang, A. Tayal, T. Coleman, and Y. Li, “Auto insurance fraud detection
using unsupervised spectral ranking for anomaly,” The Journal of Finance and Data Science,
vol. 2, no. 1, pp. 58–75, 2016.

15. C. Phua, D. Alahakoon, and V. Lee, “Minority report in fraud detection: classification of
skewed data,” Acm sigkdd explorations newsletter, vol. 6, no. 1, pp. 50–59, 2004.

16. Phua, C., Damminda, A., Lee, V., 2004. Minority report in fraud detection: classification
of skewed data (Special Issue on Imbalanced Data Sets). SIGKDD Explor. 6 (1), 50–59.

17. Zhu. S., Wang, Y., Wu, Y., 2011. Health care fraud detection using non-negative matrix
factorization. In: Proceedings of the IEEE International Conference on Computer Science and
Education, pp. 499–503.

18. Sublej, L., Furlan, S., Bajec, M., 2011. An expert system for detecting automobile
insurance fraud using social network analysis. Expert Syst. Appl. 38 (1), 1039–1052.

29 | P a g e
30 | P a g e

You might also like