0% found this document useful (0 votes)
15 views41 pages

Advanced Techniques in Insurance Claim Fraud Detection

Insurance Fraud Detection

Uploaded by

manojkumar.0uk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views41 pages

Advanced Techniques in Insurance Claim Fraud Detection

Insurance Fraud Detection

Uploaded by

manojkumar.0uk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Advanced Techniques in Insurance Claim Fraud

Detection

1
ACKNOWLEDGEMENTS

I wish to express my sincere gratitude to all those individuals who made it possible to
complete this dissertation.

First of all, I would like to thank my supervisor Monari, Dennis, for guiding me
properly with their endless support and patience throughout the research process. I
was deeply enlightened by their views and knowledge regarding the subject, which
has restored the approach taken in this paper.

I would also like to acknowledge the faculty members of computer science for their
continuous support and valuable comments during my course of study. Special
thanks also go out to [Other Important Individuals, Professors, or Mentors] for the
support given and for providing the resources necessary to carry out my research.

I am deeply indebted to colleagues and friends who have offered their


encouragement and shared their knowledge and experiences. Their support was
very vital for my focus and motivation.

I would like to express my sincere gratitude to my family, who has supported me


unconditionally and continued with much understanding during this tough journey.
Their love and encouragement have been and still are my greatest source of
strength.

2
3
Contents
Supervisor: Monari, Dennis......................................................................................................1

CHAPTER 1 BACKGROUND AND STUDY ORIENTATION...................................................7

1.1 AIM.................................................................................................................................7

1.2 OBJECTIVES.................................................................................................................7

1.3 Tasks..............................................................................................................................8

1.4 Problem Statement.........................................................................................................8

CHAPTER 2 LITERATURE REVIEW....................................................................................10

CHAPTER 3 RESEARCH DESIGN and METHODOLOGY...................................................12

3.1 Collection and Preparation of Data...............................................................................12

3.2 Data preprocessing......................................................................................................12

3.3 Exploratory Data Analysis (EDA)..................................................................................18

CHAPTER 4 TECHNICAL ANALYSIS AND IMPLEMENTATION.........................................33

4.1 Data Exploration and Preprocessing............................................................................33

4.2 Feature Engineering and Selection..............................................................................33

4.3 Model Implementation and Evaluation.........................................................................34

4.4 Hyperparameter Tuning Results...................................................................................39

4.5 Feature Importance Analysis........................................................................................40

4.6 Model Interpretability....................................................................................................40

4.7 Challenges and Limitations..........................................................................................41

CHAPTER 5 SURVEY FINDINGS AND ANALYSIS..............................................................42

5.1 Dataset Overview.........................................................................................................42

5.2 Exploratory Data Analysis Findings..............................................................................42

5.3 Model Performance Analysis........................................................................................43

5.4 Model Comparison and Analysis..................................................................................44

5.5 Feature Importance Analysis........................................................................................44

CHAPTER 6 DISCUSSION AND IMPLICATIONS.................................................................46

6.1 Model Performance and Selection...............................................................................46

4
6.2 Addressing Class Imbalance........................................................................................46

6.3 Feature Engineering and Selection..............................................................................47

6.4 Practical Implementation Considerations.....................................................................47

CHAPTER 7 CONCLUSION AND FUTURE DIRECTIONS...................................................48

7.1 Summary of Findings....................................................................................................48

7.2 Limitations of the Study................................................................................................48

7.3 Future Research Directions..........................................................................................48

7.4 Concluding Remarks....................................................................................................49

8. REFERENCES...................................................................................................................50

9. APPENDICES...................................................................................................................51

5
6
CHAPTER 1 BACKGROUND AND STUDY ORIENTATION
Insurance fraud is one of the most universal and highly expensive issues faced by the
insurance world all over. Fraudulent claims are believed to cost the insurers billions of
dollars on an annual basis. This cost is subsequently transferred back to honest
policyholders in the form of hikes in premiums. Devising effective ways to detect and prevent
insurance fraud is very important. This study focuses on applying machine learning
techniques to identify fishy insurance claims with a likelihood of being fraudulent. We will
look for patterns of fraud in very large data sets of historical claims, from which we can
consider building predictive models based on only some attributes in the data that can reflect
red flags. All this can help insurers utilize their resources for fraud detection more efficiently,
catching more fraud in the process. The auto insurance claim dataset, with details on the
policyholder incident and claim amount information, was used in our study. Within this
research, we intend to analyze this data for patterns and indicators that point toward possibly
frivolous, wasteful, or fraudulent claims. In this paper, we look at applying different machine
learning models, aiming to evaluate their performance in classifying fraudulent versus
nonfraudulent claims.

1.1 AIM
Designing and evaluating machine learning models for the detection of fraudulent insurance
claims. This study will deliver a system that is capable of raising red flags over possibly
fraudulent claims with very high accuracy and being very considerate of false positives.

1.2 OBJECTIVES
1. Conduct exploratory data analysis on the insurance claims dataset to identify key features
and patterns associated with fraudulent claims.

2. Clean and preprocess data for a machine learning task, including handling missing
values, encoding categorical variables, and feature engineering.

3. Model development and comparison of various machine learning models like decision
trees, random forests, gradient boosting, and support vector machines.

4. Model hyperparameter tuning with grid search in conjunction with cross-validation to


improve the performance of the model.

5. Measuring the model's performance by using relevant metrics such as accuracy,


precision, recall, and F1-score.

6. Feature Importance: Check the most influencing factors predicting fraudulent cases.

7
7. Derive insights and recommendations for practical applications of Fraud Detection Models
in Insurance.

1.3 Tasks
1. Exploratory Data Analysis and Visualization

2. Data Preprocessing

3. Model Selection and Implementation

4. Tuning of hyperparameters

5. Model Evaluation and Comparison

6. Feature Importance Analysis

7. Results Interpretation and Reporting

1.4 Problem Statement


Insurance fraud problems are extremely complex and dynamic, making their detection
sophisticated. Traditional, rule-based approaches are unable to handle new schemes of
fraud quickly enough, while manual review of all submitted claims has already proved to be
inefficient. Machine learning can help automatically identify very subtle patterns indicative of
fraud across large volumes of claims data. There are several challenges in the development
of effective fraud detection models

1. Class imbalance: Fraudulent claims, generally, form a very small part of all claims,
making the model biased if not taken care of.

2. Feature selection: A big pool of potential variables is required to be identified with regard
to relevant features for fraud detection.

3. Model interpretability: This makes sure predictive performance is in line with explainable
results to understand and act on fraud investigators' interpretations.

4. Adaptability: Building models that can identify new and evolving patterns of fraud over
time.

5. False Positives: Reducing the Number of Legitimate Claims Tagged as Fraud. All
legitimate claims misclassified as fraud have the potential to negatively affect customer
relationships, besides wasting investigative resources.

8
In this regard, this study is set to carefully analyze the data, engineer features, and evaluate
various machine learning approaches with a view to coming up with more accurate and
efficient fraud detection models. If successful, we will have served insurers in reducing
losses, keeping premiums lower for honest customers, and maintaining the integrity of the
insurance system.

9
CHAPTER 2 LITERATURE REVIEW
Insurance fraud detection is a very active area of research, and numerous publications have
focused on the application of machine learning to this problem. This literature review
summarizes some key findings and approaches from recent academic papers and industry
reports. Phua et al. (2010) presented a very good survey of data mining techniques for
insurance fraud detection. Their results indicated that different supervised learning
techniques, such as decision trees, neural networks, and support vector machines, perform
well in detecting fraudulent claims. The authors also observed that feature selection and
dimensionality reduction are important in the performance of models. In 2015, Sundarkumar
and Ravi proposed a novel approach that incorporates feature selection with homogeneity-
oriented behavior analysis for insurance fraud detection. The proposed methodology used a
genetic algorithm for feature selection and a one-class support vector machine for anomaly
detection. The results for fraudulent auto insurance claim datasets were very encouraging.

Nian et al. (2016) investigated deep learning techniques for insurance fraud detection.
Herein, the authors have developed a deep learning model based on the auto-encoder
principle and testified to its effectiveness in detecting fraudulent health insurance claims. The
findings of the authors indicated that the proposed deep learning approach outperformed
conventional machine learning methods in accuracy and the false positive rate. In a more
recent study, Wang and Xu (2018) investigated the application of ensemble learning
methods for insurance fraud detection. In the study, several ensemble techniques were
compared, like random forests, gradient boosting, and stacking. Their results indicated that
in general, all ensemble methods outperformed any single classifier, where gradient
boosting had the highest accuracy.

Correia et al. (2020) focused on the problem domain of insurance fraud detection, which
suffers from class imbalance. In their attempt to handle the imbalance between fraudulent
and non-fraudulent claims, different sampling techniques were compared, such as
oversampling, under sampling, and their hybrids. Accordingly, the conclusion drawn from the
study is that the combination of SMOTE (Synthetic Minority Over-sampling Technique) with
random under-sampling turns out to be the best among all classifiers. At the industrial level,
MarketsandMarkets published an Insurance Fraud Detection Report in 2021, in which it was
substantiated that AI and machine learning are being visibly integrated into fraud detection
systems. According to this report, greater relevance is given to real-time analytics and the
integration of external data sources to better detect fraud.

Several important themes are clear from the literature:

10
1. Machine learning methods are much better at fraud detection than traditional rule-based
approaches.

2. Feature selection and engineering are of high importance for establishing or leading to
good performance.

3. Ensemble methods, especially gradient boosting, seem to lead to very high accuracy in
most cases.

4. Sampling techniques that address class imbalance have to be used for the minority class.

5. Deep learning methods do have a lot of potential for capturing complex fraud patterns.

6. Integrating several sources of data can create value, and how real-time analytics can be
effectively used.

We will implement and compare various machine learning approaches in the current
research informed by these findings. In this work, we also handle the class imbalance issue,
a problem of model interpretability to provide actionable insights to fraud investigators.

11
CHAPTER 3 RESEARCH DESIGN and METHODOLOGY
This quantitative research methodology applies statistical and machine learning techniques
on a large database of insurance claims. The main steps in the process are as follows

3.1 Collection and Preparation of Data


The dataset used here has information from 1000 automobile insurance claims, including
detailed attributes about policyholders, incident specifics, and claim amounts corresponding
to the incidents. This data set was derived from an open source with extensive
preprocessing done on the data to assure its quality and consistency. In detail,
preprocessing steps involved data cleaning, handling missing values, and inconsistencies. It
also involved data transformation, including encoding categorical variables and date/time
variables into more workable formats. Further feature engineering was conducted to create
new variables and interaction terms in a bid to enhance the predictive power of this dataset.
Numeric features were normalized to prevent any one feature from disproportionately
affecting the analysis. This data set was then split into three subsets, a training set, a
validation set, and a test set. This allowed for a reliable assessment of model performance
on data it had not seen. Great attention was paid to the preparation of the dataset to allow
as much accuracy and robustness of the analyses as possible and provide reliable and
actionable insights (Agarwal, 2023).

3.2 Data preprocessing


This is a snippet of code that imports Python libraries for data analysis and machine
learning. The libraries do the following Pandas and NumPy deal with data and numerical
operations, plotting with Matplotlib and Seaborn, implementation of a model using Scikit-
learn with data splitting, encoding, and various classifier models implemented for Logistic
Regression, Decision Tree, Random Forest, SVC, K-Nearest Neighbors, together with
functions that provide metric values for model performance evaluation.

12
Loading CSV Data

This line of code reads a CSV file in the named path ('/content/'),
`fraud_insurance_claims.csv`, into a Pandas DataFrame, `df`. Now, all other analysis or
manipulation can be done in that DataFrame, `df`, which will be holding data from this CSV
file.

Display the First Few Rows of DataFrame


The `df.head()` function displays the first five rows of DataFrame `df`. This is useful in
structural inspection, with the names of columns and preliminary data entries. This gives
insight into what a dataset looks like and what kinds of values it contains.

Displaying the Last Few Rows of the DataFrame

The `df.tail()` function displays the last five rows of the DataFrame `df`. This is quite useful to
check the end of the dataset to make sure that the last entries are what one expects, and
the data has been properly loaded when working with large datasets.

13
The dimensionality of the DataFrame

The attribute df.shape returns the dimensions of the DataFrame df. In this example, this is
(1000, 39). It has 1000 rows and 39 columns. This information lets one understand how big
the dataset is, its features columns and the number of records or rows.

Statistical summary of the DataFrame

The `df.describe()` function calculates statistics for the DataFrame `df`. The results show
that `months_as_customer` averages about 204 months, while ages range from 19 to 64
years. Policy numbers range from 100,804 to 999,435, with the average deductible at 1,136
and the average premium per annum at 1,256.41. While the `umbrella_limit` is largely
variable, its mean value is 1,101,000. Capital gains are 25,126.10, and capital losses come
in at about 26,793.70. Total claim amounts vary from a low of 100 to a high of 114,920, the
mean total claim amount is 52,761.94.

14
Summary of DataFrame

It returns the size of the DataFrame and column dtypes. The `df.info()` function returns that
this DataFrame `df` has 1,000 entries and 39 columns. Except for the column
`authorities_contacted`, all other columns contain 1,000 non-null entries. The data types are
`int64` for integer columns, `float64` for floating-point columns, and `object` for categorical
columns. The memory usage of the DataFrame is about 304.8 KB.

15
Missing Values Summary

The result of df.isnull().sum(), therefore, is interpreted to mean that most columns in the
DataFrame df do not contain missing values, except 'authorities_contacted', which contains
91 missing entries. All other columns, such as 'months_as_customer', 'age', 'policy_number',
and others, are full with 1,000 non-null entries. This summary shows a view of the
completeness of the dataset and points to the presence of missing data in one column.

16
17
Handling Missing Values in `authorities_contacted`

The missing values in the column `authorities_contacted` are handled by first calculating the
mode from this column using `df['authorities_contacted'].mode()[0]`. Then, the mode filled
the missing entries using `df['authorities_contacted'].fillna(mode_authorities_contacted,
inplace=True)`. This will ensure all missing values in the column `authorities_contacted` are
replaced with the most frequent value so as to maintain the consistency of the dataset.

3.3 Exploratory Data Analysis (EDA)


We did an exhaustive exploratory analysis of the dataset to shed light on how the variables
were distributed if any correlation existed, and to identify possible patterns that could raise
red flags for fraudulent claims. This included the following steps:

18
 Descriptive statistics for numerical variables
 Frequency analysis for categorical variables
 Plotting/charts for key relationships
 Correlation testing between features

1. Distribution of Incident Types

This plot illustrates the distribution of different types of incidents in the dataset with a donut
chart. First, the count of each unique type of incident in the column `incident_type` is
computed using `value_counts()`. These counts are subsequently used to plot a pie chart
where the various slices represent the different incident types.

Colouring and Design: Slices are coloured using a soft pastel colour palette from the
Seaborn library, which is easier on the eyes and helps in differentiation between different
types of incidents. The slices start at a 140-degree angle to optimize the placement of the
chart visually. Every slice is outlined with a white border for better separation and clarity.

Donut effect: Basically, in this pie chart, a white circle is inserted at the center to create the
donut effect. This not only looks good but also puts focus on the way portioning of the slices
happens around the circle.

Labels and Percentages: The percentage of the total of each incident type is displayed
directly on its corresponding slice, formatted to one decimal place. This allows viewers to get
a sense of the relative proportion of each incident type at a glance.

Title: The chart is labelled "Distribution of Incident Types," providing immediate context to
the viewer and making it clear that the visualization presents the types of incidents recorded
in the dataset.

This donut chart represents the swift summary of all kinds of incidents represented within the
dataset, as shown in this clear and easy-to-understand format. This enables the catching at
one glance of which type of incident occurs the most or the least.

19
2. Incident Severity Distribution

The incident severity distribution is illustrated in the pie chart below. It shows how often
every unique value of the 'incident_severity' column occurs within the dataset.

20
Coloring and design: Color-coded using the Seaborn "coolwarm" color palette from cool.
This gradient is not only aesthetically pleasing, but it might give a subtle hint toward a
gradient with related severity measures.

Size of Slice and Percentages: The size of the slice is depicted by directly showing the
frequency of that severity level in the dataset. The percentage of the total that each slice
represents is clearly labelled directly on the chart, making comparison among the severity
levels very easy to see.

Title and Labels: I chose to title this " Incident Severity Distribution" to guide the user on
what this chart represents. Removed the y-axis label to make it clean and not cluttered.
There is no need for axis labels for well-cleaned pie charts.

This pie chart is a simplistic yet very effective way of visualization to grasp the severity levels
in the dataset that are distributed, therefore informing one quickly which severity level is
most prevalent.

21
3. Capital Gains vs. Capital Losses

A similar scatter plot when again lined up aids in recognizing the relationship between the
capital gains and the capital losses observed between the records in the dataset. The data
points can be identified as per the fraud reported.

Axes and Data Points: The `x-axis` indicates the gains in capital, and on the other hand,
the `y-axis` indicates the losses in capital. All points in the plot represent a single record in
the dataset, their position depends on capital gains and losses arising from that record.

Coloring (Hue): Point coloring is done concerning the `fraud_reported` column, based on
the incident being flagged as a fraud or not. This difference in colour hues among points
helps the viewer, in identifying, in a jiffy, to which pattern or cluster the financial activity, gain,
or loss shows a mutual relationship to the fraud incidence.

22
Interpretation: It could pinpoint trends in the scatter plot for instance, if there were higher
capital gains or losses in some regions, in which case the probability of fraud would vary. It
could help in determining whether there are any outliers or anomalies in the data.

X and Y Labels: It is titled "Capital Gains vs. Capital Losses" with the `x-axis` stating
"Capital Gains," and the `y-axis` stating "Capital Losses." The labeling very clearly describes
what it is the viewers are looking at and better helps in the interpretation of the data.

More importantly, this scatter plot is a potently acquisitor to a search for relationships
between financial factors and fraud, which serve as inputs for further analyses or possible
decisions.

4. Relationship of Insured

This line plot is used to analyse the distribution of relationships status insured in this dataset.
It calculates the count of the unique relationship status within the column
'insured_relationship', after which it visualizes the obtained count.

23
Data Transformation: Counts for each relationship status get put into a DataFrame for easy
plotting. This will produce two columns: `insured_relationship` for the status and `count` for
the frequency of that status.

Plotting the Line Chart: A line chart of relationship status (`x-axis`) against count (`y-axis`)
is created using `sns.lineplot`. Every point on the line represents the count of one
relationship status or another. The markers (`marker='o'`) show the points.

X-axis Labels and Rotation: The `x-axis` is used for the relationship status, whose labels
are rotated 45 degrees so that they will still be readable if there are many categories.

Grid and Title: A grid is put on the plot (`plt.grid(True)`) for legibility, and the chart is titled
"Relationship of Insured," indicating right away that the plot is of how the different
relationship statuses are represented in the dataset.

This dot plot clearly illustrates the distribution of relationship status to know which
relationship status most of the insured in this dataset fall into.

24
5. Auto Make Distribution

It is a bar chart of the distribution for different makes of cars in the dataset, showing their
relative frequency.

Y-axis and Data: The `y-axis` lists the car makes and the `x-axis` is a count of how often
each make occurs. This data is sorted so that most common makes are at the top, making it
easy to see which makes are most prevalent.

Coloring: The bars are colored using Seaborn's "Paired" palette, which gives a sequence of
quite different colors, making it easier to tell the various makes apart.

Interpretation: The lengths of the bars reflect the frequency of each make of car within the
dataset. This will quickly get across which makes of cars appear most frequently in the
records.

Title and Labels: This chart is captioned "Auto Make Distribution," and it has the x-axis
captioned "Count," while the y-axis has the label "Auto Make." These captions and the title
state the objective of the chart, and the viewers can easily note what is being shown.

Below is a bar chart that simply summarizes the distribution of car makes, from which
insights can be gained into which the most common types of vehicles are involved in
incidents recorded within this dataset.

25
6. Incident Severity Distribution (Custom Palette)

The bar chart will show the distribution of different levels of incident severity across the
dataset. The color array is also customized to reflect a certain green-blue-red color scheme
as quoted here.

Color Palette: In the chart, a custom palette is used with specific colors: `"#3498db"`,
`"#e74c3c"`, and `"#2ecc71"` for the bars. This palette includes shades of blue, red, and
green that will clearly outline the visual difference between different levels of severity and
look more beautiful in the chart.

Plot Details: The `sns.countplot` function creates a bar chart where the `x-axis` represents
different severity levels of incidents and the `y-axis` represents how many occurrences are
against each severity level. It will color each bar based on this custom palette, making it
easy to tell the different severity levels apart.

Labels and Title: The chart is labeled "Incident Severity Distribution" with `x-axis` labeled as
"Incident Severity" and `y-axis` labeled as "Count." These labels with the title make sure that
the audience knows what the chart is trying to represent and can read the data properly.

Different levels of incident severity are represented in this very clean and distinct visual
summary through this bar chart, with custom color choices that further improve the
distinction between the categories.

26
7. Correlation Heatmap

The correlation matrix will include these numerical features of the dataset, enlightening the
more definitive incidence of different numerical variables with respect to others.

Correlation Matrix: The `numeric_df.corr()` function computes the coefficient of correlation


between numerical features in the DataFrame `numeric_df`. These coefficients are in the −1
to 1 range, indicating the strength and direction of the linear relationship between pairs of
variables.

Heatmap Details: The `sns.heatmap` function is utilized to generate the heatmap, where:

annot=True`: Such a parameter shows directly on the heatmap the correlation coefficients
thus, a more precise interpretation.

`cmap='coolwarm'`: A cool-warm color map where the cool colors like blue mean negative
correlations, and warm colors like red, signify positive correlations. It helps indicate strong
and weak relationships at one glance.

Labels and Title: The heatmap is titled "Correlation Heatmap." Second, there is no setting
for `x` and `y` labels, which explains that the focus is more on correlation values in and of
themselves and not on some type of feature name. Axis labels could be added for clarity if
one wanted.

The heat map would thus give an overview of how numerical variables in the dataset are
interrelated, helping in recognizing patterns, relationships, and possible multicollinearity
among features.

27
4. Data Preprocessing

Outlier detection and handling: Box plots were used for spotting outliers, after which extreme
outliers were removed to improve the model.

Visualizing Numeric Data with Boxplots

Generates boxplots for all numeric columns in the DataFrame df. It will helps to visualize the
distribution of the data and check whether there are any outliers or not. First let's import
necesary libraries: 'pandas', 'matplotlib.pyplot' and 'seaborn'. Then, extract numeric columns
from the DataFrame and compute the number of rows and columns needed as a function of
the number of numeric columns.

It creates a figure of size 15 inches by 10 inches and then iterates over all the numeric
columns at each iteration, a boxplot is drawn. Subplots are named after their column, and
`plt.tight_layout()` is added to ensure that the plots do not spill over into each other. Finally,
`plt.show()` shows the grid of boxplots, allowing for a clear visual assessment of how the
numeric data distributes and varies within a dataset.

28
Outlier Removal and Data Visualization

This code snippet identifies outliers in the DataFrame df. First, it calculates quartiles Q1 and
Q3 and the Interquartile Range for all numeric columns. Then, it defines lower and upper
bounds for outliers to be 1.5 times the IQR below Q1 and above Q3, respectively. The rows
having values beyond these bounds are filtered out to get a new DataFrame, df_no_outliers
that does not include these outliers. It is noted that the original dataset consisted of 1,000
rows, but after removing outliers, it is now 600 rows in size, with the removal of 400 rows.
The cleaned data would then be visualized by making boxplots for every numeric column,
which are arranged on a grid to create an exact view of distribution without interference from
outliers.

29
5. Model Selection and Implementation

In the sake of comparison in results on fraudulent claim detection, we choose and apply
several machine learning algorithms. These include:

- Random Forest

- Decision Tree

- Support Vector Machine (SVM)

- Gradient Boosting

- K-Nearest Neighbors (KNN)

30
These are algorithms that have been selected in view of their performance in classification
tasks like this one and to handle numerical and categorical features.

6. Model Training and Hyperparameter Tuning

First, we split the dataset into a training set and a test set in an 80:20 ratio for model
performance evaluation. Then, we performed hyperparameter tuning for each of these
algorithms using grid search with cross-validation to understand how model performance
could be improved, and this involved defining ranges for hyperparameters of each model.

- Using GridSearchCV to evaluate all possible combinations of provided hyperparameters.

This is a way of choosing the best combination of hyperparameters based on their cross-
validation scores.

7. Model Evaluation

The evaluation metrics used are as follows for each model:

- Accuracy: It is the overall correct prediction rate.

- Precision: Proportion of true positive predictions out of total positive predictions

- Recall: Proportion of true positive predictions out of actual positive cases

- F1-score: Harmonic mean of precision and recall

- Confusion matrix: Detailed breakdown of correct and incorrect predictions

We have paid special attention to the tradeoff between precision and recall in fraud
detection, as it is important to be accurate. It is undesirable to have too many false positives
or false negatives.

8. Feature Importance Analysis

In order to shed more light on which variables were really influential in the prediction of
fraudulent cases, feature importance from the best models was extracted. This is very
important information for fraud investigators and will ensure that data necessary for any
future collection is captured.

9. Results Interpretation and Reporting

We then interpreted the analysis results and made a concluding inference on the
effectiveness of different machine learning approaches for insurance fraud detection. We
further related the findings to practical implications and the possible future research areas.

31
This methodology allows a thorough evaluation of machine learning approaches for
insurance fraud detection through the performance of quantitative measures and the
presentation of the qualitative insights derived from factors leading to fraudulent claims.

32
CHAPTER 4 TECHNICAL ANALYSIS AND
IMPLEMENTATION
4.1 Data Exploration and Preprocessing
First of all, we did a deep dive into the insurance claims dataset. The dataset contained 1000
records and 39 features, which were both numerical and categorical variables. Some key
observations in light of this first data exploration can be summarized as follows:

The dataset contains a mix of information for both policyholders that is, on age and
education level and policy details, premium and deductible, with claim-specific information
like incident type and claim amount.

- The 'authorities_contacted' column had missing values, which we treated using mode
imputation.

There were a bunch of date fields like policy_bind_date and incident_date which were in text
and needed to be transformed into numerical values for use in machine learning models.

- The target variable, 'fraud_reported', was imbalanced, with most of the claims considered
not fraudulent.

We have performed the following steps of preprocessing:

1. Handling missing values in 'authorities_contacted' by mode imputation.

2. Date fields can be converted into numerical values, such as the number of days from a
reference date.

3. Encode Categorical Variables: Ordinal categories should be label encoded, and nominal
categories should be one-hot encoded.

4. Removing extreme outliers from box plots for model improvement.

4.2 Feature Engineering and Selection


We made several new features that were designed to trap useful information as given below:

- Policy age: Number of days from policy bind date to incident date

- Age: The current year subtracted by the car's year

- Claim ratio = Total claim amount / Policy annual premium

33
Feature selection had been accomplished using a correlation analysis and by looking at the
feature importance scores attributed by the tree-based model. We dropped highly correlated
features to introduce low multicollinearity for better interpretability of the model.

4.3 Model Implementation and Evaluation


We used and compared five machine learning models, specifically:

1. Random Forest

2. Decision Tree

3. Support Vector Machine (SVM)

4. Gradient Boosting

5. K-Nearest Neighbors (KNN)

Each model from the above was trained based on the preprocessed dataset using an 80-20
train-test split for evaluation. Grid search with cross-validation of 5-fold was employed in
tuning optimal hyperparameters of each model.

Random Forest Classification Evaluation

It trained a random forest classifier on the training set and assessed it on the test set. The
model is 71% accurate, meaning that it predicted the target variable correctly 71% of the
time. However, according to the performance metrics, there is a big problem with how well
the model is performing on the positive class alone label 1.

- Precision: The precision for the positive class is 0.00; no cases of the positive class were
classified correctly by the model.

- Recall: For the positive class, the recall is 0.00; the model failed to recognize any of the
actual positive cases.

- F1-Score: The F1 score for the positive class is also 0.00, thus indicating that the model
did not perform well in terms of classification concerning positive instances.

34
According to the classification report, this model works excellent on the negative class or
label 0 with very high precision, recall, and F1-score. The positive class is completely missed
with literally no correct predictions. This fact is very well represented in the confusion matrix
with 142 true negatives, 3 false positives, 55 false negatives, and 0 true positives.

The random forest model has a reasonable accuracy on all cases but fares poorly on the
positive class, thereby indicating that further tuning may be required or more complex
models/strategies need to be explored in order to improve class imbalance performance.

Decision Tree Classification Evaluation

In this attempt, the Decision Tree classifier was trained and evaluated to achieve an
accuracy of 80% against the test set. Hence, in this case, the performance metrics will be as
follows:

- Accuracy: 80%, the model has predicted the cases of the target variable correctly 80% of
the time.

- Precision: In the case of the positive class, i.e., label 1, precision comes to be 0.63, thus
63% of the positive cases as predicted by the model are actual positives; in other words, it is
a true positive.

35
- Recall: The recall for the positive class is 0.60, which means that the model identified 60%
of the actual positive cases.

- F1-Score: The F1-score for the positive class is 0.62; this is a balance between the
precision and recall for the positive class.

Based on this classification report, the model has rather decent performance for both
classes: fairly high precision, recall, and F1-score for the negative class, and then more
balanced for the positive class, in comparison with a random forest model.

The confusion matrix reads 126 true negatives, 19 false positives, 22 false negatives, and 33
true positives. This means the Decision Tree model does a bit better at recognizing the
positive cases than the Random Forest, but still leaves much to be desired.

Support Vector Machine (SVM) Classification Evaluation:

In this regard, the Support Vector Machine (SVM) classifier was trained and evaluated,
returning the following results:

- Accuracy: The model is 72% accurate in predicting the target variable.

- Precision: For the positive class, that is, label 1, the precision is 0.00, indicating that no
positive cases were classified correctly by the model.

36
- Recall: The recall for the positive class is also 0.00, which is indicative of its failure to recall
any of the actual positive cases.

- F1-Score: For the positive class, this comes out to be 0.00, which means the model does
very poorly in predicting positive instances.

According to the classification report, the performance of the SVM model with respect to the
negative class, or label 0, is high in terms of precision and recall, and thus high in terms of
F1-score. Simultaneously, though, it completely misses the positive cases. This is attested
by the confusion matrix, which contains 145 true negatives and 0 false positives, along with
55 false negatives and 0 true positives.

Indeed, the model SVM performs well for class negatives but fails in detecting the positive
instances, which shows a high class imbalance problem or even a model setting problem.

Gradient Boosting Classification Evaluation

Gradient Boosting classifier: The following results have been drawn

- Accuracy: This model gives an accuracy of 82%, which means it has predicted the target
variable correctly 82% of the time.

37
- Precision: For the positive class, the precision is 0.66 for label 1, which means of all the
cases that were predicted as positive, 66% actually were.

- Recall: Recall for the positive class is 0.73, which means it found 73% of actual positive
cases.

- F1-Score: The F1-score for the positive class is 0.69.

The classification report describes good performance of the Gradient Boosting model for
both classes, with high precision and recall for the positive class compared to the SVM
model. In the confusion matrix, there are 124 true negatives and 21 false positives, 15 false
negatives, and 40 true positives, which means it's good at catching positive cases while
retaining some good balance in its performance for the negative cases.

Optimized Decision Tree Classifier Evaluation

The following parameter combinations were run across 5 cross-validation folds. The optimal
parameters identified were:

-Criterion: 'gini'

- Max Depth: None

- Min Samples Leaf: 10

- Min Samples Split: 2

With these parameters, the model returned an accuracy of 83% on cross-validation. The
optimized Decision Tree Classifier returned an accuracy of 81% with these hyperparameters
on the test set.

In detail:

- Precision of the positive class is 0.69, meaning that out of all cases predicted as positive,
69% actually are.

- Recall of the positive class is 0.53, so it found 53% of the actual positive cases.

- F1-Score of the positive class is 0.60, which is a balanced measure between precision and
recall.

The confusion matrix returns 132 true negatives, 13 false positives, 26 false negatives, and
29 true positives. The result improves on the base Decision Tree model in performance,

38
driving better recall and F1-score for the positive class while keeping high precision and
recall for the negative class.

Key findings from our model evaluation:

Gradient Boosting was the best, performing at an accuracy of 82% with an F1-score of 0.69
against fraudulent claims.

Optimized Decision Tree Model: The results were competitive, with an accuracy of 81% and
an F1-score of 0.60 concerning fraudulent claims.

In the initial implementations, random forest and SVM struggled with the class imbalance,
often predicting all claims as non-fraudulent.

KNN had medium performance but lost against ensemble methods.

4.4 Hyperparameter Tuning Results


Gradient Boosting:

Optimized Gradient Boosting Classifier Evaluation

The Gradient Boosting Classifier was processed with a fine-tuning parameter using
GridSearchCV for an extensive parameter grid, thus concluding with the following best
parameters: learning rate of 0.1, maximum depth of 3, 200 estimators, and specific values of
min samples split and leaf. This overall optimization scored 85% under cross-validation and
82% on a test set. Those are very good values for model performance metrics: a precision of
0.66, recall of 0.75 and F1 score equal to 0.7 for the positive class.

These findings are further verified by the confusion matrix, which gives 124 true negatives,
41 true positive, 14 false negative, and 21 false positive. The results considered are
acceptable, and the optimized Gradient Boosting Classifier performs well as it manages to
strike a balance between precision and recall for the positive class, proving to be a robust
model among those considered. The presented model exhibits good overall performance
and consequently, nice improvements in the detection of positive cases.

Decision Tree:

Best parameters: {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 10,


'min_samples_split': 2}

Best cross-validation accuracy: 0.83

Optimized Random Forest Classifier Evaluation

39
The GridSearchCV fine-tuned Random Forest Classifier had its parameters optimized
estimator number and max depth and scored an accuracy of 71.0% on the test set. In fact, it
was very good at predicting the negative class, with 140 true negatives against just 5 false
positives. It did badly with positive cases, though, which resulted in low precision of 0.29 and
a recall of only 0.04. The F1 score for the positive class is 0.06, which is increasing the
difficulty for this model in having a trade-off between maintaining higher precision and higher
recall.

The confusion matrix states that the model has identified only 2 true positive cases out of 55
actual positive cases and missed 53. So the model is pretty strong for negative cases but
rather weak for positive cases. Better tuning or probably other approaches are needed to
make positive cases better detected.

Random Forest:

Best parameters: (not provided in the output, but typically includes n_estimators, max_depth,
min_samples_split, and min_samples_leaf)

4.5 Feature Importance Analysis


Feature importance analysis for the Gradient Boosting model indicates the following key
predictors for fraudulent cases:

1. Total claim amount

2. Vehicle claim amount

3. Incident severity

4. Number of vehicles

5. Policy annual premium

This information may be very useful to fraud investigators in focusing their efforts on the
most relevant factors while reviewing suspicious claims.

4.6 Model Interpretability


Code Overview for Model Accuracy Comparison

This code snippet benchmarks the accuracy of various machine learning models, including
the non-optimized and optimized variants of the Decision Tree and Random Forest classifier.
Here is a small description of what each part does:

1. Model Definitions and Fitting:

40
Models: It constitutes a total of seven models, including the base models of the
popular traditional algorithms such as Random Forest, Decision Tree, SVM, Gradient
Boosting, KNN, and two optimized versions of Decision Tree and Random Forest.
Fitting: The models get fit to the training data (`X_train`, `y_train`) and, using the test
data (`X_test`), return values of accuracy.
2. Accuracy Calculation:
The accuracy for each model is calculated with `accuracy_score` and stored in a list
called `accuracies`.
3. Plotting:
- Figure: A bar graph is plotted showing accuracy of all models.
- Annotations: To make it easier to read, the accuracy of each bar is annotated.
- Customization: The x-axis labels have been rotated to be visible and I have set y-
axis to be between 0 and 1, making the scale common.

While Gradient Boosting provided the best performance, it can be less interpretable than
simpler models such as Decision Trees. In this respect, we will recommend using the
Gradient Boosting model for the initial screening of claims but having a more interpretable
model, like a decision tree, for those cases that require detailed explanation.

4.7 Challenges and Limitations


Several challenges were encountered during the implementation:

1. Class imbalance: Because the proportion of non-fraudulent claims was huge compared
to fraudulent ones, the dataset was heavily suffering from class imbalance. This was causing
very poor performance in fraud detection at initial stages of model development.

2. Feature Encodings: There was very strong importance of proper encoding in each of the
variables within our models, say categorical. This involved careful consideration of ordinal
versus nominal categories.

3. Hyperparameter tuning: In some models, such as the Random Forest model, there was
a large hyperparameter space, which could easily make an exhaustive grid search process
very computationally.

4. Model interpretability: While Gradient Boosting and other ensembles performed well, the
complexity gained is likely to make individual predictions unexplainable.

41

You might also like