0% found this document useful (0 votes)
5 views

Application-of-machine-learning-and-data-visua_2021_International-Journal-of

This research paper discusses the application of machine learning and data visualization techniques in the insurance sector, particularly focusing on claim analysis to differentiate between fraudulent and genuine claims. It emphasizes the importance of utilizing big data and machine learning algorithms to improve decision-making processes, enhance customer satisfaction, and optimize claims management. The study evaluates various classification algorithms and feature selection methods to derive meaningful insights for better underwriting and policy enrollment.

Uploaded by

manya.singh1999m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Application-of-machine-learning-and-data-visua_2021_International-Journal-of

This research paper discusses the application of machine learning and data visualization techniques in the insurance sector, particularly focusing on claim analysis to differentiate between fraudulent and genuine claims. It emphasizes the importance of utilizing big data and machine learning algorithms to improve decision-making processes, enhance customer satisfaction, and optimize claims management. The study evaluates various classification algorithms and feature selection methods to derive meaningful insights for better underwriting and policy enrollment.

Uploaded by

manya.singh1999m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

International Journal of Information Management Data Insights 1 (2021) 100012

Contents lists available at ScienceDirect

International Journal of Information Management Data


Insights
journal homepage: www.elsevier.com/locate/jjimei

Application of machine learning and data visualization techniques for


decision support in the insurance sector
Seema Rawat a, Aakankshu Rawat a, Deepak Kumar b,∗, A. Sai Sabitha a
a
Department of Information Technology, Amity School of Engineering and Technology (ASET), Amity University Uttar Pradesh, Sector 125, Noida-201313, Gautam
Buddha Nagar, Uttar Pradesh, India.
b
Amity Institute of Geoinformatics & Remote Sensing (AIGIRS), Amity University Uttar Pradesh, Sector 125, Noida-201313, Gautam Buddha Nagar, Uttar Pradesh,
India.

a r t i c l e i n f o a b s t r a c t

Keywords: The insurance industry has a giant role in the sustainable economic growth of any country. With an increase in
Insurance the number of insurance buyers, it has become an absolute necessity for an insurance company to have a detailed
Claim analysis claim analysis system in place. Claim Analysis is performed by insurance companies to distinguish between
Classification
fraudulent and genuine claims. Apart from that, Claim Analysis can also be used to understand the client strata
Data mining
in a much better way and implement the results further during the underwriting and acceptance/denial stage
Exploratory data analysis
Feature selection of policy enrollment. The main objective of this research work is to identify meaningful and decisive factors for
InsurTech claim filing and acceptance in a learning context through exploratory data analysis (EDA) and feature selection
Machine learning techniques. Also, machine learning algorithms are applied to the datasets and are evaluated using performance
metrics.

1. Introduction claims as well as generate sufficient profit to survive in the industry


Blackstone (2013).
Insurance is financial protection against any type of risk. This pro- “InsurTech” is the term used in association with bringing new tech-
tection is defined by a contract between two parties: the insurer and nologies and innovations in the field of insurance to impact the regu-
the insured (beneficiary). The insurer is the company that sells the poli- latory practices of insurance markets. Here comes the concept of “Big
cies and the insured is the person or the party which buys the insurance Data”. Big Data has transformed the way insurance companies deals
policy for its benefits. The insurer agrees to take risk of an insured en- with the enormous amount of data that they receive. Due to the ad-
tity against future events in return for monetary compensation known vancement of technologies, the volume of data to be dealt with has
as Premium (Aswani, Ghrera, Chandra, & Kar, 2020). In case of an un- enormously increased and this data can be structured, semi-structured
foreseen event, the insurer must pay a claim to the insured i.e. the ben- as well as unstructured (Chakraborty & Kar, 2017; Kar, 2016). Initially,
efit amount to be paid to the beneficiary according to the policy doc- and even now, many insurance companies use actuarial formulas for
ument. Depending on the types of risk involved, insurance policies are underwriting and mortality table for life expectancy assessment. Big
of different types. Life Insurance, Travel Insurance, Health Insurance, Data technologies help to analyze data and extract important informa-
Auto Insurance and Property Insurance are some of the different Line of tion from it that can lead to better decision making and strategic busi-
Business’ (LOB) in the insurance industry (Bacry et al., 2020). Another ness development as compared to the above-mentioned traditional data-
type of insurance is Re-insurance in which an insurance company pur- processing techniques (Chowdhury, Mayilvahanan, & Govindaraj, 2020;
chases insurance from another insurance company to protect itself from Karhade et al., 2019). Data Mining plays an important role in various
financial risk due to the event of an enormous claim Barry & Charpen- aspects of the insurance industry such as risk assessment, fraud detec-
tier (2020). The whole industry runs on the concept of risk or financial tion, underwriting analysis, claim analysis, marketing analytics, product
loss reduction (Batra, Jain, Tikkiwal, & Chakraborty, 2021). In such a development, customer profiling etc (D. Das, Chakraborty, & Banerjee,
deal, where the insurer has to ensure the client against any type of fi- 2020; S. Das, Datta, Zubaidi, & Obaid, 2021). Along with Data Min-
nancial loss against any unforeseen event, there must be a way for the ing, the industry is shifting towards machine learning algorithms to pre-
insurer or the insurance company to manage their transactions to pay dict using the analysis done on big datasets for better fraud detection,


Correpsonding author.
E-mail addresses: [email protected] (S. Rawat), [email protected] (A. Rawat), [email protected], [email protected] (D. Kumar),
[email protected] (A.S. Sabitha).

https://doi.org/10.1016/j.jjimei.2021.100012
Received 4 February 2021; Received in revised form 17 March 2021; Accepted 18 March 2021
2667-0968/© 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

KYC verification, behavioural policy evaluation and custom claim settle- improves risk selection and pricing strategies (Larson & Sinclair, 2021;
ments. The use of Machine Learning (ML) is increasing tremendously in Maehashi & Shintani, 2020). Not only that, but different insights can
the insurance industry despite initial resistance by the industry because also be drawn from the client’s data as well as personalization of poli-
of its explicitly recursive approach in predictive modelling which helps cies has become easier due to faster underwriting and analysis (McGlade
to improve the model at each repetition. In this analysis, we will deal & Scott-Hayward, 2019; Mita et al., 2021). Claim processing has become
with the claim analysis aspect of the insurance industry (Dave, Patwa, automated using mobile applications (Nian, Zhang, Tayal, Coleman, &
& Pandit, 2021; Doupe, Faghmous, & Basu, 2019). ML is used in claim Li, 2016). A thorough analysis can help with fraud detection. Insurance
analysis & processing for triaging claims, identifying outlier claims & fraud is a 40 billion dollar industry per year, hence, the use of ML tech-
even fraud, and automating where possible i.e. reducing the human in- niques can help to notify brokers in case of fraud to further investigate.
tervention in claim processing and making the whole process hassle- The analysis tools used to gather insights on customer behaviour can
free (Gupta, Kar, Baabdullah, & Al-Khowaiter, 2018; Kakhki, Freeman, also be used to gather insights on employees to retain valuable talent.
& Mosher, 2020). The use of ML algorithms in this process helps the This can be done by understanding the behaviour, interests, learning
company to understand the beneficiary’s claim filing pattern as well styles of the brokers as well as their job satisfaction and the potential to
their claim acceptance pattern which can be used to optimize the whole look out for a job somewhere else (Ozbayoglu, Gudelek, & Sezer, 2020;
process flow for policy enrollment Kar (2016). Sengupta et al., 2020). Lastly, AI & ML can be used for the effective mar-
The purpose of this research is to understand how machine learn- keting of insurance policies. Table 1 summarizes the major work done
ing algorithms can help in insurance companies to deduce patterns in in the insurance sector.
various segments/branches of InsurTech (claim analysis is done in this
research paper). In this analysis, two datasets are used to perform claim
2.3. Claim analysis in insurance sector
analysis using different classification algorithms. Three feature selection
algorithms have been used to reduce the dimensionality of the data and
Statistics has been in use in the insurance industry from the onset
to improve the results of the analysis. The algorithms are finally evalu-
of the industry. There is a whole discipline for the use of statistics in
ated and compared based on four widely approved and trusted metrics:
the insurance industry known as Actuarial Science. With the massive
accuracy, precision, recall and f1-score. The following sections of the
increase in data processed by the industry, Predictive Analytics is com-
research work are literature review, proposed computational methodol-
ing into the limelight (Pal, Mandana, Pal, Sarkar, & Chakraborty, 2012;
ogy, experimentation, results, discussion and conclusion.
Yang et al., 2021). It encompasses data mining, predictive modelling,
and machine learning techniques like classification, regression, cluster-
2. Literature review
ing and outlier detection to make accurate and fast predictions about
unforeseen events in the future using the current data (Palanisamy &
2.1. All about “InsurTech”
Thirunavukarasu, 2019).
Claim Analysis is an important aspect of Predictive Analytics in the
InsurTech is simply bringing technology to the world of insurance.
insurance industry as approximately 80% of the premium revenue gen-
InsurTech companies have been able to come up with several business
erated is spent on claims by the insurance companies Pappas & Wood-
models to improve the contracting process as well the claims manage-
side (2021). Hence, it is essential to do a thorough analysis of claims
ment process, helping customers to clearly understand where their pre-
to improve cash flow. By analyzing the insurance data, relations be-
mium goes as well as how their premiums are structured. InsurTech
tween various factors (variables) is observed and a function is derived to
technologies and innovations have helped to reach out to all the sec-
model predictions (Petropoulos, Siakoulis, Stavroulakis, & Vlachogian-
tions of the economy, especially the lower-income bracket and the less
nakis, 2020). These predictions can be used for making decisions. Apart
developed markets. Most of the InsurTech companies are startups and
from the structured data used by the companies, there is a huge scope
due to the huge scale of customers they attract, many reinsurance com-
of unstructured data that provides vital information (Pourhabibi, Ong,
panies invest in them. But while bringing new technologies to the mar-
Kam, & Boo, 2020; Waring, Lindvall, & Umeton, 2020). It can be used
ket, they must be aware of the regulations of the insurance industry and
to cluster the different categories of beneficiaries and calculate the ex-
meet them. Many countries have a well-established regulatory sandbox
pected claim payout and processing for these categories (Pramanik et al.,
approach to allow InsurTech companies to enter the market so that such
2020). There are certain Key Performance Indicators (KPI) of insurance
startups can grow their business models as well as keep up with the reg-
claims such as claims cycle time, customer satisfaction, fraud detec-
ulatory requirements. India has a regulation policy in place for micro
tion, claims recovery and claim handling costs (Richter & Khoshgof-
insurance regulations since 2015 (Kasy, 2018; Kaur, Sharma, & Mittal,
taar, 2018). It is observed that out of the enormous amount of data that
2018). Apart from the benefits, InsurTech needs to balance the need for
an insurance company has access to, it utilizes only 10-15% of that data
innovation as well as the need for the protection of data. There is a need
(Ringshausen et al., 2021; Saggi & Jain, 2018). ML can help to increase
for close inspection of the internal controls that ensure that the client’s
the utilization of data keeping in mind the KPIs of claim, to automate
data use is free of bias and the law is adhered to.
many routine processes to reduce claims cycle time, increase customer
satisfaction, combat fraud, optimize claims recovery and reduce claim
2.2. Need for artificial intelligence (AI) & machine learning (ML) in
handling costs .
insurance

The insurance industry is one of the oldest industries in the world 3. Proposed computational methodology
with marine insurance as the first form of insurance, used for self-
insurance. With the increase in client data, there is a huge scope for In this analysis, two different datasets are used to perform claim anal-
improvement in methodologies used in the sector for policy enrollment ysis using ML classification algorithms. Classification algorithms are a
and claims settlement (Khan, Bashir, & Qamar, 2014; Knighton et al., type of Supervised ML algorithms. Supervised Learning is used when
2020). Artificial Intelligence is the buzzword of the decade and it sim- the data is divided into a set of input variables or features and there is a
ply means the enhancement of machine to simulate human intelligence corresponding output or target variable. Supervised algorithms are fur-
(Kose, Gokturk, & Kilic, 2015; Kraus, Feuerriegel, & Oztekin, 2020). AI ther divided into classification and regression algorithms. Classification
encompasses ML and predictive analysis which is taking the insurance algorithms are used when the target variable is categorical in nature
industry by storm. AI helps in almost all the business needs of an in- and regression algorithms are used when the target variable is continu-
surance company. AI speeds up the process of underwriting, as well as ous. For example, if an insurance company has to predict the premium

2
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

Table 1
Research work in the Insurance Industry.

S. No. Authors Year Application Techniques used

1. Bartl et al. (2020) Claim Prediction in Export Decision Tree, Random Forest, Neural Networks (NN) & Probabilistic Neural
Credit Finance Networks (PNN) for prediction and Accuracy, Cohen’s K & R-square for
assessment
2. Baudry et al. (2019) Claim Reserving Non-parametric ML model used for prediction, chain ladder method used for
assessment w.r.t. bias and variance of estimates.
3. Insurance Claim Analysis Naïve Bayes, Naïve Bayes Updatable, Multi-Layer Perceptron, J48, Random
Tree, LMT, Random Forest used for prediction with Recall, Precision,
F-Measure, MCC, PRC & ROC Area for assessment
4. Risk Prediction in Life Univariate & Bivariate analysis for data visualization, PCA &
Insurance Correlation-based feature selection for dimensional reduction and multiple
linear regression, multilayer perceptron, REPTree & Random Tree for
prediction
5. Dehghanpour et al. (2018) Portfolio Insurance Adaptive Neuro-Fuzzy Inference Systems (ANFIS) for prediction with the
Strategy Markowitz portfolio optimization model for determining optimal portfolio
weights
6. Kang et al. (2018) Aggregate Auto-Insurance Feature Selection techniques to classify the dataset into homogenous risk
Data Analysis groups
7. Panigrahi et al. (2018) Auto-Insurance Fraud Univariate, L1 based & Tree-based feature selection and Decision Tree, Naïve
Detection Bayes, KNN & Random Forest classification algorithms
8. Patil et al. (2018) Survey on ML techniques Supervised, Unsupervised and Hybrid- Bagging, Boosting, Stacking &
used for Fraud Detection Ensemble Learners
9. Quan et al. (2018) Predictive Analytics of Multivariate Decision Trees compared using R2, Gini, ME, MPE, MSE, MAE
Claims and MAPE
10. Wang et al. (2018) Auto-Insurance Fraud Deep Learning model with Latent Dirichlet Allocation (LDA) based text
Detection analytics
11. Role of Data Mining in the Classification, Clustering, Regression, Association, Summarization
Insurance Industry
12. Rao et al. (2013) Factors influencing Claims Regression analysis
in General Insurance,
India
13. Guelman (2012) Insurance Lost Cost Gradient Boosting (GB) compared to the Linear Model approach
Modeling
14. Salcedo‐Sanz et al. (2004) Insolvency Prediction Simulated Annealing (SA) and Walsh Analysis for feature selection using
SVM as underlying classifier
15. Viaene et al. (2002) Insurance Claim Fraud Logistic Regression, C4.5 Decision Tree, K-Nearest Neighbor, Bayesian
Detection Multilayer Neural Network, Naïve Bayes and SVM for classification and
PCC, AUROC and ROC curves for assessment

ML algorithms are applied to a dataset, a framework is used to outline


the process. This framework simplifies the process as well as makes it
easier to understand. The framework used in this analysis is designed
based on Yufeng Guo’s 7 Steps of Machine Learning. Fig. 2 shows the
framework used to perform claim analysis.

3.1. Data collection

Data Collection initiates the onset of the ML process. Data can be col-
lected through various sources and methods. Searching and sharing is
the most effective and common way to collect data. Data can be searched
on the web or obtained from a ‘data lake’ and be shared through the web.
Fig. 1. Outcomes of claim analysis done in this analysis.
Another method in use is Data Augmentation. In this method instead of
collecting new data, existing data is augmented with external data to in-
crease the diversity of the dataset. This approach is mostly used in deep
or claim amount, then regression algorithms can be used. In this anal-
learning to develop and train large artificial neural networks (ANN).
ysis, the outcomes of the analyses using the two datasets are different,
The last way to collect data is through crowdsourcing and generating
depending on the type of the datasets and the variables (features) in-
synthetic datasets.
volved.
The datasets used in the analysis are collected from Kaggle.com and
Apart from Supervised Learning, Unsupervised Learning is also im-
Github.com. Further description is given in the “Case Studies” section.
plemented in the insurance industry. In the case of unsupervised learn-
ing, there is no target variable. Unsupervised algorithms are further clas-
sified into Clustering and Association algorithms. Some use cases of un-
supervised learning are to find customers having similarities in various 3.2. Data preparation
attributes, analyze customer attrition using clustering algorithms and
discovering associations that improve or promotes the business in the Data Preparation is the process of transforming data such that it is
case of association algorithms. suitable for an ML algorithm. It can have a great impact on the per-
As shown in Fig. 1, the target variable in both datasets is categorical. formance of the model. It consists of data cleaning, exploratory data
Hence, classification algorithms are used to do the analyses. Whenever analysis (EDA), normalization and dimensionality reduction processes.

3
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

Fig. 2. ML Framework used for Claim Analysis.

3.2.1. Data cleaning ing is variable-by-variable cleaning. In this approach, illegal or misspelt
Data Cleaning is the first step after the data is retrieved. It consists features values are removed from the dataset based on certain factors
of detecting and removing inaccurate, false, incomplete, corrupt, or ir- such as the minimum and maximum value should not be outside the per-
relevant records from the dataset. Data cleaning is also called “Data missible range, the variance and standard deviation should not be more
Pre-Processing”. One of the most common approaches for data clean- than the threshold value and there should not be any misspelt values in

4
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

the dataset. Data values are either removed or manipulated depending 3.2.4.1.1. Chi-Square test. It is a type of statistical filter method that
on the coarseness of the data value. In case there are missing feature is used to evaluate the correlation between different features using their
values, the feature is either eliminated if there are many missing values frequency distribution. In this method, feature selection is based on the
or the missing values are replaced by a dummy value (treating the miss- intrinsic properties of the features and is independent of any ML algo-
ing value itself as a new value). Mean substitution is the most common rithm.
approach for treating missing values. 3.2.4.1.2. Recursive feature elimination (RFE). It is a type of wrapper
method used for feature selection. The term “wrapper” is used because
3.2.2. Exploratory data analysis (EDA) this method wraps up a classifier in a feature selection algorithm. In RFE,
EDA is an aid to understanding the data before applying any ma- features are recursively removed from the dataset based on an external
chine learning model to it. It is done by visualizing data with the help estimator used which is the classifier. The classifier assigns weights to a
of different graphs to understand the different characteristics of the data feature based on its performance. It is a greedy algorithm that seeks to
like hidden relationships among various features, which is not possible generate the best performing subset.
by simply looking at the dataset Table 3. 3.2.4.1.3. Tree-based feature selection. It is a type of embedded
method in which there is an inbuilt method for feature importance
3.2.3. Feature engineering which generates a set of features along with their importance. Embed-
Feature engineering is one of the most fundamental and important ded methods can be used with the help of algorithms that have inbuilt
parts of data preparation before moving further to model training and feature selection methods.
evaluation. In this stage, new features are created based on EDA per-
formed on the dataset and existing knowledge of the domain of the 3.3. Model selection
dataset to improve the performance of the model. It is one of the most
difficult and time-consuming parts of data preparation. Surveys show After the data preparation is done and the data is divided into train-
that data scientists spend almost 80% of their time on data preparation. ing and testing tests, suitable models need to be selected to do the train-
The new features are created based on different calculations between ing. Since it is a classification problem, appropriate classifiers need to
the existing features. The new features might be a ratio, or a mathemat- be selected to conduct the classification. Both the datasets used in the
ical transformation or any statistical or scientific formula to generate a analysis fall under the category of binary classification. The classifica-
more significant feature. Feature engineering can be done both manu- tion algorithms used in this analysis are Logistic Regression, Random
ally by statisticians and by using feature encoding techniques in the case Forest, Decision Tree, Support Vector Machine, Gaussian Naïve Bayes,
of categorical variables. There is a general misconception that feature Bernoulli Naïve Byes, Mixed Naïve Bayes and K-Nearest Neighbors.
engineering can only be beneficial for linear regression or text classifi-
cation problems. Feature engineering has proved to be greatly benefi- 3.4. Model training
cial for support vector machines, random forests, neural networks, and
gradient boosting machines. Encoding is important as machine learn- After model selection, the training data is trained using the models
ing itself is based on mathematical models and algorithms, so most of selected with all features initially. Then, classification algorithms are
the algorithms cannot classify between categorical and continuous val- applied only to the features that are selected through feature selection
ues. Encoding follows two methodologies: nominal and ordinal. Nomi- techniques.
nal Encoding is performed where the order of the data is not of much
importance and vice-versa. 3.5. Model evaluation and comparison
Apart from encoding, there are other techniques for feature engi-
neering such as normalization. Normalization is used for scaling all the Finally, the models are compared with each other along with the fea-
values in a dataset in a fixed range between 0 and 1. The formula used ture selection techniques to come up with the best model and feature
for normalization is. selection technique for the two datasets. Four metrics are used to evalu-
𝑋 − 𝑋𝑚𝑖𝑛 ate the models in the analysis: Precision, Recall, F1 Score and Accuracy.
𝑋𝑛𝑜𝑟𝑚 =
𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛 All four metrics are given equal importance rather than relying on one
Although it does improve the numerical scalability of the model, it performance metric.
should not be used always it can harm the performance of a model.
4. Experimentation results
3.2.4. Dimensionality reduction
Dimensionality Reduction is simply reducing the dimensionality of The analysis consists of two case studies: one of the health insurance
your features. There are two approaches to the process: Feature Selec- sector and the other one of the travel insurance sector. The output in
tion and Feature Extraction. In this analysis, only feature selection is both cases is different as shown in Fig. 1. The first one is from the per-
utilized to reduce the dimensionality of the feature set as feature extrac- spective of the beneficiary and the second one is from the perspective
tion is more suitable for data used for pattern recognition or image pro- of the insurance company.
cessing where meaningful inferences cannot be obtained just by looking
at the data. 4.1. Case Study 1: health insurance

3.2.4.1. Feature selection. Having all features considered for modelling For the case study, the dataset is obtained from Kaggle.com. It con-
can reduce the predictability of the model. It is always preferred to se- sists of 1338 rows and 9 columns: 8 features and 1 target variable.
lect features that contribute more to the target variable. It can be done Table 2 describes the features and target variable of the dataset for bet-
using manual methods like univariate selection where each feature is ter understanding.
evaluated to decipher its importance. Statistical methods like variance Once the data is collected, the next step is data preparation. First,
and Pearson correlation are used for univariate analysis. But univariate the data is checked for any missing values. There are no missing values
analysis is more reliable when the data is linear, also it is exceedingly in the dataset. Next, different statistics of the features are observed.
difficult to perform univariate analysis on a large dataset. In such a case, Before proceeding further for EDA, one-hot encoding is performed
the multivariate analysis can be performed. There are three methods of for the feature ‘region’ to better understand the distribution of policy-
performing multivariate analysis: filter, wrapper and embedded. holders in different regions along with other features. The feature itself
Following are the feature selection methods used in the analysis: is removed from the dataset and four new features ‘NorthEast USA’,

5
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

Fig. 3. Graphical Representation of relationship b/w Number of Claims and the BMI of the beneficiary & Number of Claims and the Age of the beneficiary.

Table 2 policyholders with an average of 3–6k steps day claim the most and
Description of Health Insurance Dataset. the policyholders with more than 6k steps per day claim the least and
S.No. Column Description the policyholders with no children claim the most. Apart from these
Heading findings, it is also observed that the policyholders of NorthWest USA
are neutral to filing a claim while policyholders of SouthEast USA are
1. age Age of the beneficiary
2. sex Gender of the beneficiary (female = 0, male = 1) more likely to file a claim as compared to other regions.
3. bmi Body Mass Index of the beneficiary, i.e. the ratio of The final set of features selected before proceeding to feature selec-
weight to height (kg / mˆ2), ideally 18.5 to 25 tion are ‘age’, ‘sex’, ‘BMI’, ‘steps’, ‘children’, ‘smoker’, ‘NorthEast USA’,
4. steps Average walking steps per day of the beneficiary
‘NorthWest USA’, ‘SouthEast USA’, ‘SouthWest USA’, ‘charges’ and the
5. children Number of children or dependents of the beneficiary
6. smoker Smoking status of the beneficiary (non-smoker = 0,
target variable ‘insurance claim’.
smoker = 1) From Table 6 and Fig. 8, it can be inferred that Logistic Regression,
7. region The place of residence of the beneficiary in the US Random Forest and Decision Tree Classifiers show the best performance.
(northeast = 0, northwest = 1, southeast = 2, The Decision Tree Classifier has the best performance among these three
southwest = 3)
as three out of the four performance metrics have a higher value in
8. charges Individual medical costs billed by health insurance
9. insurance Whether the beneficiary files a claim or not (Yes = 1, Decision Tree classifier as compared to the rest of the two classifiers.
claim No = 0) One thing to note is that no single performance evaluation metric is
given more importance.
Now, once again the dataset is trained using the same eight classi-
Table 3
Statistics of features of the Health Insurance Dataset. fiers but with feature selection techniques to observe the changes in the
result as well as evaluate the best feature selection technique for all the
Statistics Age BMI Steps Children Charges classifiers.
Min 18 15.96 3000 0 1121.87 From Table 7 and Fig. 8, it can be inferred that the Logistic Regres-
Max 64 53.13 10010 5 63770.42 sion, Random Forest and Decision Tree Classifiers show the best perfor-
Mean 39.2 30.66 5328.62 1 13270.42 mance once again. Random Forest is the best among these three classi-
fiers as all four of its performance metrics have a higher value than the
rest of the classifiers. Using Chi-Squared Test, four features are elimi-
‘NorthWest USA’, ‘SouthEast USA’,’ SouthWest USA’ are added to the nated leaving the dataset with seven features: ‘age’, ‘BMI’, ‘steps’, ‘chil-
dataset. dren’, ‘smoker’, ‘NorthWest USA’ and ‘Southeast USA’. The performance
By observing Fig. 3, it can be inferred that the policyholders having of the SVM Classifier has decreased, there is an improvement in the per-
a BMI between 14 and 24 claim the least, policyholders having a BMI formance of Gaussian and Mixed NB Classifiers & the performance of
between 24–29 have a neutral response and the policyholders with a the KNN Classifier has increased tremendously.
BMI of more than 29 are most likely to file a claim. Also, it can be From Table 8 and Fig. 8, it can be inferred that Logistic Regression,
concluded that the age bracket of 29–39 has the maximum number of Random Forest, Decision Tree, Gaussian NB and Mixed NB show the
policyholders that have not filed a claim. best performance. Random Forest is once again the best among all the
By observing Fig. 4, it can be inferred that while most of the smok- classifiers. Logistic Regression, Gaussian NB and Mixed NB show equal
ers file a claim, non-smokers are neutral towards filing a claim and the performance. The performance of SVM, Gaussian NB, Bernoulli NB and
sex of the policyholder does not affect the same. Whether a male or fe- Mixed NB classifiers increased, the rest of the classifiers’ performance
male, if the policyholder is a smoker, he or she will most probably file a has decreased as compared to the results obtained from the Chi-Squared
claim. Test. As Chi-Squared Test is a statistical test and has no involvement of
From Fig. 5, it can be inferred that policyholders with charges less any ML model, the number of features considered important by this
than 9999 are least likely to file a claim and policyholders with charges test i.e. seven is considered as the number of comparing all the feature
more than 39,999 are going to file a claim. It can also be inferred that selection techniques. The seven most important features selected using

6
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

Fig. 4. Graphical Representation of the relationship between Number of Claims and the smoking category of the policyholders, Number of Claims and the smoking
category when the policyholders are female & Number of Claims and the smoking category when the policyholders are male.

RFE are: ‘NorthEast USA’, ‘age’, ‘BMI’, ‘charges’, ‘children’, ‘smoker’ and as most of the models performed their best after RFE. KNN is the only
‘steps’. classifier that performed best with the features selected by Chi-Squared
From Table 9 and Fig. 8, it can be inferred that Logistic Regression, Test. Decision Tree gives the best result with the Tree-Based method. Lo-
Random Forest and Decision Tree classifiers show the best performance. gistic Regression gives the best result without feature selection. Gaussian
Decision Tree is the best among all the classifiers. Except for the Deci- NB and Mixed NB produce the same results in all the cases; hence it can
sion Tree, all the classifiers’ performance has decreased as compared to be concluded that continuous variables are more important than cate-
RFE. The seven most important features selected using the Tree-Based gorical variables for the dataset. The best set of features, therefore, is:
Feature Importance Method are: ‘children’, ‘steps’, ‘BMI’, ‘charges’, ‘age’, ‘NorthEast USA’, ‘age’, ‘BMI’, ‘charges’, ‘children’, ‘smoker’ and ‘steps’.
‘smoker’ and ‘sex’.
By observing the performance of all the eight classifiers with and 4.2. Case study 2: travel insurance
without feature selection it can be concluded that Decision Tree is the
best classifier without feature selection and Random Forest is the best For the case study, the dataset is obtained from Kaggle.com. It con-
classifier with feature selection. RFE is the best feature selection method sists of 62,288 rows and 12 columns: 11 features and 1 target variable.

7
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

Fig 5. Graphical representation of the relationship between Number of Claims by the policyholder and the Charges billed by the health insurance, Number of Claims
and the average walking steps of the policyholder per day & Number of Claims and the number of children of the policyholder.

Table 4 describes the features and target variable for a better under- ative when the claim is rejected and the expenses incurred for doing the
standing of the dataset. investigation is more than the actual policy amount paid by the policy-
Once the data is collected, the next step is data preparation. First, holder. The minimum value of duration is -2 is which not possible under
the data is checked for any missing values. There are 39,575 missing any circumstances and the maximum value is 4881. Even if the unit for
values for Gender i.e. 63.54 %. Hence, the column is dropped from the the duration is considered to be in days then also this value is not possi-
dataset before proceeding for further evaluation. Also, the ID column is ble as travel insurance policies can be applied for a maximum duration
dropped as it has no significance in predicting claim acceptance. Next, of 1–2 years in the case of an Annual Plan. Considering a maximum du-
different statistics of the features are observed. ration of 731 days i.e. one year and one leap year, the values above 731
According to the statistics observed as in Table 5, the maximum age and below 1 are imputed as they constitute only 0.026% of the whole
is 118 which is not a suitable age to travel for anybody, also most of dataset. Values above 731 are imputed as 731 and values below 1 are
the insurance companies do not provide insurance to people above 85 imputed by the median value of duration.
years of age, hence considering 100–118 as an outlier category which By observing Fig. 6, it can be inferred that Net Sales and Acceptance
comprises of 1.44% of the total policyholders, it is replaced by 99 hence % are directly proportional to each other and only 3 agencies have a
keeping 99 as the maximum age of a policyholder. The minimum neg- good amount of sales. Also, Agency ‘C2B’ despite having high net sales
ative value of Net Sales is justified as net sales are calculated as the and acceptance % provide a low commission to the mediator or maybe
difference between the value for which the insurance was sold and the most of their sales are direct and there is no mediator in between.
expenses incurred, or the claim amount paid by the insurance company From Fig. 7, it can be inferred that products having high Commis-
to the policyholder/beneficiary. So the net sale may be negative if the sion Value have high Net Sales as well as high Acceptance %. Apart
claim amount paid or even if the claim amount is not paid it can be neg- from these findings, it is also observed that most of the Travel Agency

8
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

Fig. 6. Graphical representation of the relationship between Agency and Net Sales of policies & Agency and Acceptance % of claims along with the mean commission.

Table 4 gorical columns. Frequency Encoding is done for Destination, Agency,


Description of Travel Insurance Dataset. and Product Name columns. Dummy Encoding is done for Agency
S.No. Column Heading Description Type and Distribution Channel columns & Label Encoding is performed
for Destination Category. Destination Category is decided based on
1. Agency Name of the insurance agency
Destination Risk. Destination Risk is calculated based on the count
2. Agency Type Type of agency: travel or airlines
3. Distribution Channel Distribution channel of the insurance of claims. If the value is more than 0.3 i.e. more than 30% of the
agency: online or offline policyholders (travellers) have claimed the destination is marked as
4. Product Name Name of the insurance policies (products) ‘High Risk’. Similarly, if the value of risk is between 0.2 and 0.3,
5. Duration Duration of travel of the policyholder
then the destination is marked as ‘Moderate Risk’ and if the value
6. Destination Destination of travel
7. Net Sales The total amount of sales of the insurance
of risk is between 0 and 0.2, then the destination is marked as ‘Low
policies Risk’.
8. Commission (in value) The commission received to the mediator The final set of features selected before proceeding to fea-
(agent) ture selection are ‘Age’, ‘Commission (in value)’, ‘Duration’,
9. Gender Gender of the policyholder
‘Net Sales’, ‘Dest_freq_encoding’, ‘Agency_freq_encoding’, Prod-
10. Age Age of the policyholder
11. ID ID of the policyholder uct_Name_freq_encoding’, ‘Destination Category (labels)’, ‘Agency
12. Claim Claim status of the insurance policy: Type_Travel Agency’, ‘Distribution Channel_Online’ and the target
accepted or denied variable ‘Claim’.
After feature engineering, the next step is to do dimensionality re-
Table 5 duction. In both the datasets, as previously discussed there is no need
Statistics of the features of the Travel Insurance Dataset. for feature extraction. In this section, first of all, both the datasets are
trained and evaluated using the aforementioned eight classifiers without
Statistics Age Commission Net Sales Duration performing feature selection. Then feature selection is performed using
Min 0 0 -389 -2 three different methods, namely filter, wrapper and embedded methods
Max 118 262.76 682 4881 & then the models are evaluated using four performance metrics. The
Mean 39.67 12.83 50.71 60.96
feature selection techniques used under these methods in this analysis
are the Chi-Squared Test, Recursive Feature Elimination using Logistic
Regression Classifier & Tree-Based Feature Importance using ExtraTrees
claims are denied and despite having a low count of claims under Air- Classifier.
lines Agency, around 40% of the claims are accepted. Before moving From Table 10 and Fig. 9, it can be inferred that Random Forest,
onto the next comparison, the age is divided into three groups: Child Decision Tree and KNN show the best performance. Random Forest is
(less than 21 years old), Adult (21–50 years old) and Senior (above 50 best among all the classifiers as all four of its performance metrics have
years old). a higher value than the rest of the classifiers.
The next step after EDA is Feature Engineering. In this stage, label, Now, once again the dataset is trained using the same eight classi-
dummy, and frequency encoding are performed to deal with the cate- fiers but with feature selection techniques to observe the changes in the

Table 6
Classification of Health Insurance dataset without Feature Selection.

Metrics Used Classification Algorithms Used

Logistic Random Forest Decision Tree Support Vector Gaussian Naïve Bernoulli Naïve Mixed Naïve K-Nearest
Regression Machine Bayes Bayes Bayes Neighbors

Precision 0.8137255 0.96078 0.95098 0.52941 0.80392 0.73529 0.80392 0.5490196


Recall 0.8556701 0.95146 0.9898 0.7013 0.73874 0.68182 0.73874 0.6021505
F1 Score 0.8341709 0.9561 0.97 0.60335 0.76995 0.70755 0.76995 0.574359
Accuracy 0.8768657 0.96642 0.97761 0.73507 0.81716 0.76866 0.81716 0.6902985

9
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

Table 7
Classification of Health Insurance dataset with Chi-Squared Test.

Metrics Used Classification Algorithms Used

Logistic Random Forest Decision Tree Support Vector Gaussian Naïve Bernoulli Naïve Mixed Naïve K-Nearest
Regression Machine Bayes Bayes Bayes Neighbors

Precision 0.773913 0.99107 0.97321 0.5625 0.875 0.76786 0.875 0.723214


Recall 0.872549 0.97368 0.9646 0.62376 0.75385 0.66667 0.75385 0.771429
F1 Score 0.820276 0.9823 0.96889 0.59155 0.80992 0.71369 0.80992 0.746544
Accuracy 0.854478 0.98507 0.97388 0.67537 0.82836 0.74254 0.82836 0.794776

Table 8
Classification of Health Insurance dataset with Recursive Feature Elimination.

Metrics Used Classification Algorithms Used

Logistic Random Forest Decision Tree Support Vector Gaussian Naïve Bernoulli Naïve Mixed Naïve K-Nearest
Regression Machine Bayes Bayes Bayes Neighbors

Precision 0.7714286 0.99048 0.96639 0.54622 0.80672 0.79832 0.80672 0.5714286


Recall 0.8350515 0.97196 0.95833 0.80247 0.83478 0.73077 0.83478 0.7234043
F1 Score 0.8019802 0.98113 0.96234 0.65 0.82051 0.76305 0.82051 0.6384977
Accuracy 0.8507463 0.98507 0.96642 0.73881 0.84328 0.77985 0.84328 0.7126866

Table 9
Classification of Health Insurance dataset with Tree Based Feature Importance.

Metrics Used Classification Algorithms Used

Logistic Random Forest Decision Tree Support Vector Gaussian Naïve Bernoulli Naïve Mixed Naïve K-Nearest
Regression Machine Bayes Bayes Bayes Neighbors

Precision 0.754237 0.9322 0.97458 0.55932 0.80508 0.69492 0.80508 0.542373


Recall 0.89899 0.99099 1 0.74157 0.79832 0.76636 0.79832 0.752941
F1 Score 0.820276 0.9607 0.98712 0.63768 0.80169 0.72889 0.80169 0.630542
Accuracy 0.854478 0.96642 0.98881 0.72015 0.82463 0.77239 0.82463 0.720149

Table 10
Classification of Travel Insurance dataset without Feature Selection.

Metrics Used Classification Algorithms Used

Logistic Random Forest Decision Tree Support Vector Gaussian Naïve Bernoulli Naïve Mixed Naïve K-Nearest
Regression Machine Bayes Bayes Bayes Neighbors

Precision 0.91919 0.98727 0.97604 0.97323 0.83818 0.86154 0.83818 0.939743


Recall 0.799163 1 1 0.83515 0.90457 0.8981 0.90457 0.999147
F1 Score 0.854985 0.99359 0.98787 0.89892 0.87011 0.87944 0.87011 0.968535
Accuracy 0.750361 0.98981 0.98082 0.82477 0.79965 0.81088 0.79965 0.951116

Table 11
Classification of Travel Insurance dataset with Chi-Squared Test.

Metrics Used Classification Algorithms Used

Logistic Random Forest Decision Tree Support Vector Gaussian Naïve Bernoulli Naïve Mixed Naïve K-Nearest
Regression Machine Bayes Bayes Bayes Neighbors

Precision 0.965773 0.98706 0.97662 0.97461 0.83895 0.86033 0.83895 0.942103


Recall 0.837788 1 1 0.83266 0.90204 0.89452 0.90204 1
F1 Score 0.89724 0.99349 0.98817 0.89806 0.86935 0.87709 0.86935 0.970189
Accuracy 0.823086 0.98965 0.9813 0.82301 0.79828 0.80711 0.79828 0.953684

result as well as evaluate the best feature selection technique for all the maining classifiers has decreased as compared to modelling without fea-
classifiers. ture selection.
From Table 11 and Fig. 9, it can be inferred that Random For- From Table 12 and Fig. 9, it can be inferred that Random Forest,
est, Decision Tree and KNN classifiers show the best performance. Decision Tree and KNN classifiers show the best performance. Random
Random Forest is once again the best among all the classifiers. Forest is the best among all the classifiers used for the dataset. Except
Using Chi-Squared Test, only one feature is discarded leaving the for Decision Tree and KNN, the rest all the classifiers have increased
dataset with nine features: ‘Age’, ‘Commission (in value)’, ‘Dura- performance as compared to the results of the Chi-Squared Test. Also,
tion’, ‘Net Sales’, ‘Dest_freq_encoding’, ‘Agency_freq_encoding’, ‘Prod- the performance of the Bernoulli NB classifier has decreased as com-
uct_Name_freq_encoding’, ‘Destination Category (labels)’ and ‘Agency pared to its performance without any feature selection technique. As
Type_Travel Agency’. The performance of Logistic Regression, Decision Chi-Squared Test is a statistical test and has no involvement of any
Tree and KNN classifiers has increased and the performance of the re- ML model, the number of features considered important by this test

10
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

Fig. 7. Graphical Representation of the relationship between Product Name and Net Sales along with Mean Commission & between Product Name and Acceptance
% along with mean commission.

i.e. nine is considered as the number of comparing all the feature From Table 13 and Fig. 9, it can be inferred that Random Forest, De-
selection techniques. The nine most important features identified by cision Tree and KNN classifiers show the best performance. Random is
RFE are: ‘Age’, ‘Agency Type_Travel Agency’, ‘Commission (in value)’, once again the best among all the classifiers. The performance of Logistic
‘Dest_freq_encoding’, ‘Destination Category (labels)’, ‘Distribution Chan- Regression, Random Forest, Decision Tree and Bernoulli NB classifiers
nel_Online’, ‘Duration’, ‘Net Sales’ and ‘Product_Name_freq_encoding’. has increased, rest four classifiers’ performance has decreased compared

11
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

Fig. 8. Graphical Representation of the performance of the aforementioned classifiers without feature selection, with Chi-Squared Test, with RFE (using LogisticRe-
gression Classifier) and with Tree-Based Feature Importance (using ExtraTreesClassifier).

Table 12
Classification of Travel Insurance dataset with Recursive Feature Elimination.

Metrics Used Classification Algorithms Used

Logistic Random Forest Decision Tree Support Vector Gaussian Naïve Bernoulli Naïve Mixed Naïve K-Nearest
Regression Machine Bayes Bayes Bayes Neighbors

Precision 0.966322 0.98751 0.97601 0.97401 0.85927 0.85767 0.85927 0.938331


Recall 0.839476 1 1 0.83533 0.88913 0.89825 0.88913 0.999574
F1 Score 0.898444 0.99371 0.98786 0.89936 0.87395 0.87749 0.87395 0.967985
Accuracy 0.825574 0.98997 0.98074 0.82493 0.80093 0.80767 0.80093 0.950153

Table 13
Classification of Travel Insurance dataset with Tree Based Feature Importance.

Metrics Used Classification Algorithms Used

Logistic Random Forest Decision Tree Support Vector Gaussian Naïve Bernoulli Naïve Mixed Naïve K-Nearest
Regression Machine Bayes Bayes Bayes Neighbors

Precision 0.965455 0.98794 0.97659 0.97408 0.83886 0.86277 0.83886 0.936095


Recall 0.844541 1 1 0.83306 0.90348 0.89626 0.90348 0.999569
F1 Score 0.90096 0.99394 0.98816 0.89807 0.86997 0.8792 0.86997 0.966792
Accuracy 0.829347 0.99037 0.9813 0.82333 0.79965 0.81056 0.79965 0.948788

to RFE. The nine best features selected by the Tree-Based Feature Im- the best classifier both with and without feature selection. Chi-Squared
portance method are: ‘Duration’, ‘Age’, ‘Net Sales’, ‘Dest_freq_encoding’, and Tree-Based Feature Importance methods are the best feature se-
‘Commission (in value)’, ‘Agency_freq_encoding’, ‘Destination Cate- lection techniques for the dataset as four models perform their best
gory (labels)’, ‘Product_Name_freq_encoding’ and ‘Agency Type_Travel with features selected from these two. The slight difference between
Agency’. These are the same features as selected by Chi-Squared Test. the results of these two techniques is neglected as both have selected
The performance of the models using both feature selection methods is the same set of features and any ML model trains itself continuously &
also almost the same. both the techniques are not applied parallelly. Gaussian NB and Mixed
By observing the performance of all the eight classifiers with and NB produce the same results in all the cases; hence it can be con-
without feature selection it can be concluded that Random Forest is cluded that continuous variables are more important than categorical

12
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

Fig. 9. Graphical Representation of the performance of the aforementioned classifiers without Feature Selection, with Chi-Squared Test, with RFE (using LogisticRe-
gression Classifier), with Tree-Based Feature Importance (using ExtraTreesClassifier).

variables for the dataset. Bernoulli NB classifier performs best without with RFE and Tree-Based Feature Importance embedded method, re-
any feature selection. Therefore, the best set of features for the dataset spectively. The choice of model used to fit the dataset for RFE is not
is: ‘Duration’, ‘Age’, ‘Net Sales’, ‘Dest_freq_encoding’, ‘Commission (in much of a concern, it will not bring a huge lot of difference in the per-
value)’, ‘Agency_freq_encoding’, ‘Destination Category (labels)’, ‘Prod- formance of the model. ExtraTrees classifier is used for the Tree-Based
uct_Name_freq_encoding’ and ‘Agency Type_Travel Agency’. Feature Importance method as it is an extremely randomized classifier
and is computationally less expensive than other tree-based algorithms.
5. Discussion Unlike the results of the healthcare insurance dataset, all the eight clas-
sifiers used in the travel insurance dataset perform extremely well as
With the help of the experimentation conducted, various character- wells as the Chi-Squared Test identifies nine out of the ten features se-
istics of the client strata have been deduced in correlation with the in- lected for model training as important. This is because the features of
surance claim and claim status. Further, machine learning models have the second dataset are highly engineered already to increase the model’s
been used to train the model to predict the claim status/insurance claim performance. It is also observed that the data is high imbalanced based
with accuracy (Testoni & Boeri, 2015) . Feature selection techniques are on the target variable in the second dataset, 80% of the claims are not
used to reduce the dimensionality of the datasets as well as increase the accepted. This may result in biased predictions even though the model
accuracy of the trained models Guo & Yu (2016). The results have been is properly trained.
further discussed in the following section. Claim Analysis of both the
datasets is performed successfully with the help of eight classification
algorithms. For both datasets, Random Forest is the best classifier with 5.1. Theoretical contributions and implications
suitable feature selection methods. For the first and second dataset, Lo-
gistic Regression and Bernoulli NB classifiers have performed better with It can be deduced that introducing technologies such as Machine
all the features. Also, the KNN classifier has performed better with Chi- Learning into the field of insurance can be very helpful. InsurTech as a
Squared Test in the case of both the datasets despite the Chi-Squared whole can help identify and understand the customer in a much better
Test being a filter method that selects features solely based on their cor- way than the narrow definition defined by the insurance industry re-
relation with each other based on their frequency distribution. Although garding their needs and investing patterns (Doupe et al., 2019; McGlade
wrapper methods like RFE improve the model’s performance by a bet- & Scott-Hayward, 2019). Using claim analysis, customer claiming pat-
ter margin than the filter and embedded methods, it is not always the terns and their demography can be understood which can further help
same, as observed in the analysis. The performance of the feature selec- to improve policies and decide more viable premiums for the customers
tion method also depends on the dataset, the models used for training (Doupe et al., 2019; Karhade et al., 2019). Also, by understanding the
and evaluation of the dataset & the models used for feature selection. insurance company’s acceptance patterns, the policies can be modified
In the analysis, LogisticRegression and ExtraTrees classifiers are used to monitor their profit/loss ratio.

13
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

5.2. Implications for practice Dave, H. S., Patwa, J. R., & Pandit, N. B. (2021). Facilitators and barriers to partici-
pation of the private sector health facilities in health insurance & government-led
schemes in India. Clinical Epidemiology and Global Health, 10(January), Article 100699.
It is observed that using feature selection techniques is indeed 10.1016/j.cegh.2021.100699.
very beneficial before classifying data using classification algorithms Doupe, P., Faghmous, J., & Basu, S. (2019). Machine learning for health services re-
(Dave et al., 2021). Not all attributes are equally significant and using searchers. Value in Health, 22(7), 808–815. 10.1016/j.jval.2019.02.012.
Guo, Y., & Yu, S. (2016). A new histogram based shape descriptor in image retrieval.
feature selection techniques help to choose the best subset of attributes International Journal of Signal Processing, Image Processing and Pattern Recognition, 9(4),
for optimal results (Nian et al., 2016). It reduces the overfitting of data, 233–246. 10.14257/ijsip.2016.9.4.22.
increases the accuracy of the algorithm used and also reduces the com- Gupta, S., Kar, A. K., Baabdullah, A., & Al-Khowaiter, W. A. A (2018). Big data with cog-
nitive computing: A review for the future. International Journal of Information Manage-
putation time (Larson & Sinclair, 2021). For example, in the Travel In-
ment, 42, 78–89 (April). 10.1016/j.ijinfomgt.2018.06.005.
surance case study, using logistic regression with all attributes results Kakhki, F. D., Freeman, S. A., & Mosher, G. A. (2020). Applied machine learning in agro-
in a model which has an accuracy of 0.750361, while using logistic re- manufacturing occupational incidents. Procedia Manufacturing, 48, 24–30 (2019).
10.1016/j.promfg.2020.05.016.
gression along with tree-based feature selection results in an accuracy
Kar, A. K. (2016). Bio inspired computing - a review of algorithms and scope of applica-
of 0.829347. tions. Expert Systems with Applications, 59, 20–32. 10.1016/j.eswa.2016.04.018.
Karhade, A. V., Ogink, P. T., Thio, Q. C. B. S., Broekman, M. L. D., Cha, T. D., Hersh-
man, S. H., …, & Schwab, J. H. (2019). Machine learning for prediction of sustained
6. Conclusions opioid prescription after anterior cervical discectomy and fusion. Spine Journal, 19(6),
976–983. 10.1016/j.spinee.2019.01.009.
Kasy, M. (2018). Optimal taxation and insurance using machine learning —
Despite the innumerable advantages of InsurTech only 28% of the Sufficient statistics and beyond. Journal of Public Economics, 167, 205–219.
big companies go for partnership with InsurTech companies and even 10.1016/j.jpubeco.2018.09.002.
less than 14% actively participate in incubator programs or ventures Kaur, P., Sharma, M., & Mittal, M. (2018). Big data and machine learning
based secure healthcare framework. Procedia Computer Science, 132, 1049–1059.
(water house Coopers 2016). Due to sparse partnerships, it is difficult for 10.1016/j.procs.2018.05.020.
new InsurTech companies to survive for long. This brings into question Khan, F. H., Bashir, S., & Qamar, U. (2014). Author ’ s personal copy TOM : Twitter
the sustainability of this model for scaling. Also, most of the funds of opinion mining framework using hybrid classi fi cation scheme. Decision Supp. Syst.,
57(January), 245–257.
startups go into the distribution of policies which makes it difficult for
Knighton, J., Buchanan, B., Guzman, C., Elliott, R., White, E., & Rahm, B. (2020). Pre-
them to invest in other units of insurance i.e. underwriting, claims to dicting flood insurance claims with hydrologic and socioeconomic demographics
service and maintain regulatory compliance. via machine learning: Exploring the roles of topography, minority populations, and
The future scope of this research work can be to address this prob- political dissimilarity. Journal of Environmental Management, 272, Article 111051.
10.1016/j.jenvman.2020.111051.
lem of high data imbalance using resampling of the dataset, clustering Kose, I., Gokturk, M., & Kilic, K. (2015). An interactive machine-learning-based electronic
the abundant class or by simply applying Synthetic Minority Oversam- fraud and abuse detection system in healthcare insurance. Applied Soft Computing Jour-
pling Technique (SMOTE), XGBoost or Adaptive Boosting (AdaBoost) nal, 36, 283–299. 10.1016/j.asoc.2015.07.018.
Kraus, M., Feuerriegel, S., & Oztekin, A. (2020). Deep learning in business analytics and
algorithms . operations research: Models, applications and managerial implications. European Jour-
nal of Operational Research, 281(3), 628–641. 10.1016/j.ejor.2019.09.018.
Larson, W. D., & Sinclair, T. M. (2021). Nowcasting unemployment insurance
Author contributions claims in the time of COVID-19. International Journal of Forecasting xxxx.
10.1016/j.ijforecast.2021.01.001.
Maehashi, K., & Shintani, M. (2020). Macroeconomic forecasting using factor models and
Dr Seema Rawat conceived and designed the study, and Mr
machine learning: an application to Japan. Journal of the Japanese and International
Aakankshu Rawat performed the research, and Dr A. Sai Sabitha with Economies, 58(March), Article 101104. 10.1016/j.jjie.2020.101104.
Dr Deepak Kumar analyzed the data and contributed to editorial input. McGlade, D., & Scott-Hayward, S. (2019). ML-based cyber incident detection for electronic
medical record (EMR) systems. Smart Health, 12, 3–23. 10.1016/j.smhl.2018.05.001.
Mita, Y., Inose, R., Goto, R., Kusama, Y., Koizumi, R., Yamasaki, D., …, & Mu-
Declaration of Competing Interest raki, Y. (2021). An alternative index for evaluating AMU and anti-methicillin-resistant
Staphylococcus aureus agent use: A study based on the national database of health
insurance claims and specific health checkups data of Japan. Journal of Infection and
The authors declare that there is no conflict of interests regarding Chemotherapy, (xxxx). 10.1016/j.jiac.2021.02.009.
the publication of this paper. Nian, K., Zhang, H., Tayal, A., Coleman, T., & Li, Y. (2016). Auto insurance fraud detection
using unsupervised spectral ranking for anomaly. The Journal of Finance and Data
Science, 2(1), 58–75. 10.1016/j.jfds.2016.03.001.
References Ozbayoglu, A. M., Gudelek, M. U., & Sezer, O. B. (2020). Deep learning for finan-
cial applications: A survey. Applied Soft Computing Journal, 93, Article 106384.
Aswani, R., Ghrera, S. P., Chandra, S., & Kar, A. K. (2020). A hybrid evolutionary approach 10.1016/j.asoc.2020.106384.
for identifying spam websites for search engine marketing. Evolutionary Intelligence Pal, D., Mandana, K. M., Pal, S., Sarkar, D., & Chakraborty, C. (2012). Fuzzy expert system
(0123456789). 10.1007/s12065-020-00461-1. approach for coronary artery disease screening using clinical parameters. Knowledge-
Bacry, E., Gaïffas, S., Leroy, F., Morel, M., Nguyen, D. P., Sebiat, Y., & Sun, D. (2020). Based Systems, 36, 162–174. 10.1016/j.knosys.2012.06.013.
SCALPEL3: A scalable open-source library for healthcare claims databases. Interna- Palanisamy, V., & Thirunavukarasu, R. (2019). Implications of big data analytics in devel-
tional Journal of Medical Informatics, 141(May). 10.1016/j.ijmedinf.2020.104203. oping healthcare frameworks – a review. Journal of King Saud University - Computer
Barry, L., & Charpentier, A. (2020). Personalization as a promise: Can big data change the and Information Sciences, 31(4), 415–425. 10.1016/j.jksuci.2017.12.007.
practice of insurance? Big Data and Society, 7(1). 10.1177/2053951720935143. Pappas, I. O., & Woodside, A. G. (2021). Fuzzy-set qualitative comparative analysis
Batra, J., Jain, R., Tikkiwal, V. A., & Chakraborty, A. (2021). A comprehensive study (fsQCA): Guidelines for research practice in Information Systems and marketing. In-
of spam detection in e-mails using bio-inspired optimization techniques. Inter- ternational Journal of Information Management, 58, Article 102310 (September 2020).
national Journal of Information Management Data Insights, 1(1), Article 100006. 10.1016/j.ijinfomgt.2021.102310.
10.1016/j.jjimei.2020.100006. Petropoulos, A., Siakoulis, V., Stavroulakis, E., & Vlachogiannakis, N. E. (2020). Predicting
Blackstone, E. H. (2013). Generating new knowledge in cardiac interventions. Anesthesi- bank insolvencies using machine learning techniques. International Journal of Forecast-
ology Clinics, 31(2), 217–248. 10.1016/j.anclin.2012.12.006. ing, 36(3), 1092–1113. 10.1016/j.ijforecast.2019.11.005.
Chakraborty, A., & Kar, A. K. (2017). Swarm intelligence: A review of algo- Pourhabibi, T., Ong, K. L., Kam, B. H., & Boo, Y. L. (2020). Fraud detection: A system-
rithms. Modeling and Optimization in Science and Technologies, 10, 475–494. atic literature review of graph-based anomaly detection approaches. Decision Support
https://doi.org/10.1007/978-3-319-50920-4_19 Systems, 133(April), Article 113303. 10.1016/j.dss.2020.113303.
Chowdhury, S., Mayilvahanan, P., & Govindaraj, R. (2020). Optimal feature extraction Pramanik, M. I., Lau, R. Y. K., Azad, M. A. K., Hossain, M. S., Chowdhury, M. K. H.,
and classification-oriented medical insurance prediction model: machine learning in- & Karmaker, B. K (2020). Healthcare informatics and analytics in big data. Expert
tegrated with the internet of things. International Journal of Computers and Applications, Systems with Applications, 152, Article 113388. 10.1016/j.eswa.2020.113388.
0(0), 1–13. 10.1080/1206212X.2020.1733307. Richter, A. N., & Khoshgoftaar, T. M. (2018). A review of statistical and machine learning
Das, D., Chakraborty, C., & Banerjee, S. (2020). A Framework development on big data methods for modeling cancer risk using structured clinical data. Artificial Intelligence
analytics for terahertz healthcare. terahertz biomedical and healthcare technologies. in Medicine, 90, 1–14 (September 2017). 10.1016/j.artmed.2018.06.002.
Elsevier Inc. 10.1016/b978-0-12-818556-8.00007-0. Ringshausen, F. C., Ewen, R., Multmeier, J., Monga, B., Obradovic, M., van der Laan, R., &
Das, S., Datta, S., Zubaidi, H. A., & Obaid, I. A. (2021). Applying interpretable machine Diel, R. (2021). Predictive modeling of nontuberculous mycobacterial pulmonary dis-
learning to classify tree and utility pole related crash injury types. IATSS Research. ease epidemiology using German health claims data. International Journal of Infectious
10.1016/j.iatssr.2021.01.001. Diseases, 104, 398–406. 10.1016/j.ijid.2021.01.003.

14
S. Rawat, A. Rawat, D. Kumar et al. International Journal of Information Management Data Insights 1 (2021) 100012

Saggi, M. K., & Jain, S. (2018). A survey towards an integration of big data analytics to big Waring, J., Lindvall, C., & Umeton, R. (2020). Automated machine learning: Review of
insights for value-creation. Information Processing and Management, 54(5), 758–790. the state-of-the-art and opportunities for healthcare. Artificial Intelligence in Medicine,
10.1016/j.ipm.2018.01.010. 104, Article 101822 (October 2019). 10.1016/j.artmed.2020.101822.
Sengupta, S., Basak, S., Saikia, P., Paul, S., Tsalavoutis, V., Atiah, F., …, & Pe- Yang, C., Yang, Z., Wang, J., Wang, H.-Y., Su, Z., Chen, R., & Zhou, Z. (2021). Estimation
ters, A. (2020). A review of deep learning with special emphasis on architec- of prevalence of kidney disease treated with dialysis in China: A study of insurance
tures, applications and recent trends. Knowledge-Based Systems, 194, Article 105596. claims data. American Journal of Kidney Diseases. 10.1053/j.ajkd.2020.11.021.
10.1016/j.knosys.2020.105596.
Testoni, C., & Boeri, A. (2015). Smart governance: urban regeneration and integration
policies in Europe. Turin and Malmö case studies. City and Community, 26(27), 28.

15

You might also like