0% found this document useful (1 vote)
278 views7 pages

Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar

Uploaded by

Gr Ranjere
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
278 views7 pages

Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar

Uploaded by

Gr Ranjere
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Journal of Information and Computational Science ISSN: 1548-7741

Prediction of medical costs using regression algorithms


A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar

1
Assistant Professor, Department of CSE, Raghu Engineering College, Dakamarri,
Visakhapatnam
Email: [email protected]

2
Assistant Professor, Department of CSE, Pragati Engineering College, Surampalem,
Andhra Pradesh, India.
Email: [email protected]

3
Assistant Professor, Department of CSE, Pragati Engineering College, Surampalem,
Andhra Pradesh, India
Email: [email protected]

Abstract
Heath care costs increases day by day. As there are a greater number of new viruses
entering into people, there is a need to predict health charges. This type of prediction
helps the governments to make a decision regarding health issues. People also knows the
importance of health care costs. Machine Learning is a filed which has its impact on
every filed. Health care system also uses machine learning models for several health
related applications.In this paper,we have done predicate analysis on medical health
insurance charges.We build a model to predict the medical insurance cost of a person
based on gender.We collect the dataset from Kaggle,which contains 1338 rows of data
with the features age, gender, smoker ,BMI, children,region, insurance charges.The data
contains medical information and costs billed by health insurance companies.We applied
various regression algorithms on this dataset to predict medical costs.For
implementation, we used python programming language.

Keywords: Medical insurance costs, Kaggle, Machine Learning

1. Introduction

As indicated by the World Bank, the absolute use on medicinal services as an extent of
GDP in 2015 was 3.89%. Out of 3.89%, the legislative wellbeing consumption as an
extent of GDP is simply 1%, and the cash-based use as an extent of the present wellbeing
use was 65.06% in 2015. Throughout the most recent couple of decades, the progression
in clinical innovation has made it conceivable to fix illnesses that were once viewed as
serious. In any case, the expense of their treatment is so high, it is practically

Volume 10 Issue 5 - 2020 751 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

incomprehensible for a white collar class individual to manage the cost of them. As
indicated by insights, Rs 5 lakh family floater strategy will cover self, mate and one kid
will cost anyplace between Rs 10,000 and Rs 17,000 on a yearly premise though Rs. 5
lakh singular wellbeing plan will cost a multi year old Rs. 4,000-7,000 per year.

2. Literature Survey

Machine Learning is a technology where machines can learn from the previous data and
predict new samples. Machine Learning models are applicable in all fileds. Medical files
also not having any exclusion to machine learning. Medical field usingML models in
different situation from last several years. Many of the researchers applied machine
learning techniques to medical related cost prediction. B. Nithya [1] et.al applied
machine learning models in predictive Analytics in Health Care.They applied various
supervised and unsupervised models for predictive analysis. They also suggested
machine learning tools and techniques are decisive in health care province and
exclusively used in the diagnosis and predictions of various types of cancers. Anuja
Tike[2] et.al applied hierarchical decision tress for medical price prediction system.
Their experiments shown that the price prediction system achieves high accuracy. Moran
et al. [3] utilized linear regression techniques to anticipate Intensive Care Unit (ICU)
expenses and utilize understanding socioeconomics, DRG (Diagnostic Related Group),
length of stay in the clinic and a couple of others as highlights. Gregori [4] et.al applied
various regression models for analyzing medical costs in health care system. They
mainly concentrated on reduce the bias in the cost estimates to achieve good results.
Dimitris Bertsimas[5] et.al applied different data mining techniques which provided an
accurate predictions of medical costs and represent a powerful tool for prediction of
health-care costs.

3. Proposed Method

The dataset used for experiments is collected from Kaggle[6] machine learning
repository. This dataset was inspired by the book Machine Learning with R by Brett
Lantz. The data contains medical information and costs billed by health insurance
companies. It contains 1338 rows of data and the following columns: age, gender, BMI,
children, smoker, region, insurance charges.In these features insurance charges is a
dependent variable and the remaining features are called as independent variables.In
regression analysis, we need to predict the value of dependent variable using independent
variables. First, we collected dataset and applied various data preprocessing methods.
Data preprocessing is a technique in which we can remove missing values in the data.
Because of these missing values, it is not possible to apply machine learning algorithms.
After removal of missing values, we need to apply label encoding, one hot encoding data
to the categorical features. Categorical features are the features whose values are labels
instead of values. After that, apply standardization or normalization techniques to our
data. This method is used when all the attribute values are not in the same scale.

Volume 10 Issue 5 - 2020 752 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Collection of dataset
from Kaggle store

Splitting of data as training


set and testing set (and apply
ML methods on training set)

Apply Apply Apply Apply


Multiple Support Decision Random
Linear Vector Tree Forest
Regressor Regressor Regressor Regressor

Apply model on testing


data

Compare and select


best model

Figure 1: Proposed model

We applied following four regression models on the dataset.


i) Multiple Linear Regression
ii) Support Vector Regression
iii) Decision Tree Regression
iv) Random Forest Regression

3.1 Multiple Linear Regression:

Multiple linear regression (MLR) is a basic machine learning regression model,in


which there is one dependent variable and multiple independent variables. The value of
dependent variable is calculated from independent variables.In this dataset the dependent
variable is medical charges and independent variables are age, gender, smoker ,BMI,
children,region.
Multiple Linear Regression uses ordinary least-squares (OLS) method to find a best
fitting line which involves multiple independent variables.
The formula for Multiple linear regression is as follows:
Y=b0+ b1X1+…bkXk + α

Volume 10 Issue 5 - 2020 753 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Here, Y is dependent variable, Xi is independent variables, b0 is y-intercept (constant


term), bk is slope coefficient for dependent variables, α is model error term.

3.2 Support Vector Regression:

Support Vector Regresison is variant of all other models.It is used for regression and
classification. In Support Vector Regression, a hyperplane is plotted to separate to
predict the value of dependent variable. This line is the margin of tolerance. In
regression, this hyperplane line is used to predict continuous value.

3.3. Decision Tree Regression:

Decision Tree is one of the most widely regression model. It is tree structured based
machine learning model.In this model,Mean squared error(MSE) is used in each step to
find the root node.This is recursively applied to build a tree. It breakdown the dataset
values into subets by incrementally developing decision tree. The final tree contains
decision nodes and leaf nodes. Decision nodes contains 2 or more child nodes, denoting
values for attributes tested. Leaf nodes indicating a decision on numerical target.
Decision trees are capable of dealing with both numerical and categorical data.

3.4. Random Forest Regression:

Random Forest is combination of more than one model. It is also called as ensemble
approach. In the ensemble technique, we combine the predictions from more than
decision tree to predict the value of dependent variable. It can be treated as a bagging
method, where the weighted average is used for final prediction.

4. Experimentation and Results

we conducted all experiments in python language.In Python,there is a library named as


“scikit-learn” ,which provides a vast number of functions and classses for machine
learning models.The results of the experiments are tabulated. Regression analysis is
based on the following measures.
MAE (Mean absolute error):It is used to identify the difference between the original
value and predicted values extracted by averaged the absolute difference over the data
set.
MSE (Mean Squared Error):It is used to represent the difference between the original
and predicted values extracted by squared the average difference over the data set.
RMSE (Root Mean Squared Error): is the error rate by the square root of Mean
Squared Error.

Volume 10 Issue 5 - 2020 754 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

R-squared (Coefficient of determination) : R-squared represents the coefficient of how


well the values fit compared to the original values. The value of r-squared is between 0
and 1.The best possible score is 1.0. The higher the value is, the better the model is.
Using the above four measures,we compared different models and the results are
tabulated below:

ML model R-Squared Mean Mean Squared Root Mean


error Absolute Error Squared Error
Error
Multiple 0.78 4008 33571665 5794
Linear
Regression
Support 0.26 5678 1734584581 13175
Vector
Regression
Decision Tree 0.68 3401 50404643 7099
Regression
Random 0.85 2760 23294452 4826
Forest
Regression

Table 1: Results of regression models

Comparision of models based on R-squared values:

R-squared is the most widely used measure.After applying the four models,random forest
performed well on this dataset.

Figure 2: R-squared values of models

Volume 10 Issue 5 - 2020 755 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Comparision of models based on MAE:

MAE is also one of the measure which decides the performance of the machine learning
model.Afer applying four different machine learning algorithms,random forest performs
well with dataset.

Figure 3: MAE values comparison

So,after applying four algorithm,Random Forest Regression gives better results.


We also applied Multiple Linear Regression with backward elimination technique to find
most predominate variables for deciding strength of concrete. In backward elimination
method, initially we started with all dependent variables. After that, we are removing
variables with high p values until we find best dependent variables.

Steps for implementing MLR with backpropagation:


Step 1: Start with all independent variables
Step 2: Identify the variable with high p value, remove it.
Step 3: Identify the next variable with high p-value and remove it.
Step 4: Repeat step-3 until one or two variables remains.
Step 5: The remaining features are valuable features for regression analysis.

Volume 10 Issue 5 - 2020 756 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

After step-1 we obtained following values.

coef std err t P>|t| [0.025 0.975]


------------------------------------------------------------------------------
const -102.5428 472.699 -0.217 0.828 -1029.859 824.773
x1 13.0485 286.203 0.046 0.964 -548.409 574.506
x2 -115.5913 292.189 -0.396 0.692 -688.793 457.610
x3 -1.196e+04 294.329 -40.645 0.000 -1.25e+04 -1.14e+04
x4 1.186e+04 331.935 35.731 0.000 1.12e+04 1.25e+04
x5 257.7350 11.904 21.651 0.000 234.383 281.087
x6 322.3642 27.419 11.757 0.000 268.576 376.153
x7 474.4111 137.856 3.441 0.001 203.973

5. Conclusion
In this paper, we proposed a machine learning model for predicting medical costs.. We
applied four regression techniques Multiple Linear Regression, Support Vector
Regression, Decision Tree Regression, Random Forest Regression. We also applied
MLR with backward elimination technique and observed that age,bmi are features which
decides the dependent variable. Out of all experiments,Random Forest model given
better result.

References
1) B. Nithya, Dr. V. Ilango,“Predictive Analytics in Health Care Using Machine
Learning Tools and Techniques”, International Conference on Intelligent Computing
and Control Systems ICICCS 2017, 978-1-5386-2745-7/17/$31.00 ©2017 IEEE.
2) A. Tike and S. Tavarageri. (2017). A Medical Price Prediction System using
Hierarchical Decision Trees. In: IEEE Big Data Conference 2017. IEEE, 978-1-5386-
2715-0/17/$31.00 ©2017 IEEE.
3) Lahiri and N. Agarwal, “Predicting healthcare expenditure increase for an
individualfrom medicare data,” in Proceedings of the ACM SIGKDD Workshop on
Health Informatics, 2014.
4) Gregori, M. Petrinco, S. Bo, A. Desideri, F. Merletti, and E. Pagano, “Regression
modelsfor analyzing costs and their determinants in health care: an introductory
review,” International Journal for Quality in Health Care, vol. 23, no. 3, pp. 331–341,
2011.
5) Bertsimas, M. V. Bjarnad´ottir, M. A. Kane, J. C. Kryder, R. Pandey, S. Vempala, and
G.Wang, “Algorithmic prediction of health-care costs,” Operations Research, vol. 56,
no. 6, pp. 1382–1392, 2008.

6) https://www.kaggle.com/mirichoi0218/insurance

Volume 10 Issue 5 - 2020 757 www.joics.org

You might also like