Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar
Prediction of Medical Costs Using Regression Algorithms: A. Lakshmanarao, Chandra Sekhar Koppireddy, G.Vijay Kumar
1
Assistant Professor, Department of CSE, Raghu Engineering College, Dakamarri,
Visakhapatnam
Email: [email protected]
2
Assistant Professor, Department of CSE, Pragati Engineering College, Surampalem,
Andhra Pradesh, India.
Email: [email protected]
3
Assistant Professor, Department of CSE, Pragati Engineering College, Surampalem,
Andhra Pradesh, India
Email: [email protected]
Abstract
Heath care costs increases day by day. As there are a greater number of new viruses
entering into people, there is a need to predict health charges. This type of prediction
helps the governments to make a decision regarding health issues. People also knows the
importance of health care costs. Machine Learning is a filed which has its impact on
every filed. Health care system also uses machine learning models for several health
related applications.In this paper,we have done predicate analysis on medical health
insurance charges.We build a model to predict the medical insurance cost of a person
based on gender.We collect the dataset from Kaggle,which contains 1338 rows of data
with the features age, gender, smoker ,BMI, children,region, insurance charges.The data
contains medical information and costs billed by health insurance companies.We applied
various regression algorithms on this dataset to predict medical costs.For
implementation, we used python programming language.
1. Introduction
As indicated by the World Bank, the absolute use on medicinal services as an extent of
GDP in 2015 was 3.89%. Out of 3.89%, the legislative wellbeing consumption as an
extent of GDP is simply 1%, and the cash-based use as an extent of the present wellbeing
use was 65.06% in 2015. Throughout the most recent couple of decades, the progression
in clinical innovation has made it conceivable to fix illnesses that were once viewed as
serious. In any case, the expense of their treatment is so high, it is practically
incomprehensible for a white collar class individual to manage the cost of them. As
indicated by insights, Rs 5 lakh family floater strategy will cover self, mate and one kid
will cost anyplace between Rs 10,000 and Rs 17,000 on a yearly premise though Rs. 5
lakh singular wellbeing plan will cost a multi year old Rs. 4,000-7,000 per year.
2. Literature Survey
Machine Learning is a technology where machines can learn from the previous data and
predict new samples. Machine Learning models are applicable in all fileds. Medical files
also not having any exclusion to machine learning. Medical field usingML models in
different situation from last several years. Many of the researchers applied machine
learning techniques to medical related cost prediction. B. Nithya [1] et.al applied
machine learning models in predictive Analytics in Health Care.They applied various
supervised and unsupervised models for predictive analysis. They also suggested
machine learning tools and techniques are decisive in health care province and
exclusively used in the diagnosis and predictions of various types of cancers. Anuja
Tike[2] et.al applied hierarchical decision tress for medical price prediction system.
Their experiments shown that the price prediction system achieves high accuracy. Moran
et al. [3] utilized linear regression techniques to anticipate Intensive Care Unit (ICU)
expenses and utilize understanding socioeconomics, DRG (Diagnostic Related Group),
length of stay in the clinic and a couple of others as highlights. Gregori [4] et.al applied
various regression models for analyzing medical costs in health care system. They
mainly concentrated on reduce the bias in the cost estimates to achieve good results.
Dimitris Bertsimas[5] et.al applied different data mining techniques which provided an
accurate predictions of medical costs and represent a powerful tool for prediction of
health-care costs.
3. Proposed Method
The dataset used for experiments is collected from Kaggle[6] machine learning
repository. This dataset was inspired by the book Machine Learning with R by Brett
Lantz. The data contains medical information and costs billed by health insurance
companies. It contains 1338 rows of data and the following columns: age, gender, BMI,
children, smoker, region, insurance charges.In these features insurance charges is a
dependent variable and the remaining features are called as independent variables.In
regression analysis, we need to predict the value of dependent variable using independent
variables. First, we collected dataset and applied various data preprocessing methods.
Data preprocessing is a technique in which we can remove missing values in the data.
Because of these missing values, it is not possible to apply machine learning algorithms.
After removal of missing values, we need to apply label encoding, one hot encoding data
to the categorical features. Categorical features are the features whose values are labels
instead of values. After that, apply standardization or normalization techniques to our
data. This method is used when all the attribute values are not in the same scale.
Collection of dataset
from Kaggle store
Support Vector Regresison is variant of all other models.It is used for regression and
classification. In Support Vector Regression, a hyperplane is plotted to separate to
predict the value of dependent variable. This line is the margin of tolerance. In
regression, this hyperplane line is used to predict continuous value.
Decision Tree is one of the most widely regression model. It is tree structured based
machine learning model.In this model,Mean squared error(MSE) is used in each step to
find the root node.This is recursively applied to build a tree. It breakdown the dataset
values into subets by incrementally developing decision tree. The final tree contains
decision nodes and leaf nodes. Decision nodes contains 2 or more child nodes, denoting
values for attributes tested. Leaf nodes indicating a decision on numerical target.
Decision trees are capable of dealing with both numerical and categorical data.
Random Forest is combination of more than one model. It is also called as ensemble
approach. In the ensemble technique, we combine the predictions from more than
decision tree to predict the value of dependent variable. It can be treated as a bagging
method, where the weighted average is used for final prediction.
R-squared is the most widely used measure.After applying the four models,random forest
performed well on this dataset.
MAE is also one of the measure which decides the performance of the machine learning
model.Afer applying four different machine learning algorithms,random forest performs
well with dataset.
5. Conclusion
In this paper, we proposed a machine learning model for predicting medical costs.. We
applied four regression techniques Multiple Linear Regression, Support Vector
Regression, Decision Tree Regression, Random Forest Regression. We also applied
MLR with backward elimination technique and observed that age,bmi are features which
decides the dependent variable. Out of all experiments,Random Forest model given
better result.
References
1) B. Nithya, Dr. V. Ilango,“Predictive Analytics in Health Care Using Machine
Learning Tools and Techniques”, International Conference on Intelligent Computing
and Control Systems ICICCS 2017, 978-1-5386-2745-7/17/$31.00 ©2017 IEEE.
2) A. Tike and S. Tavarageri. (2017). A Medical Price Prediction System using
Hierarchical Decision Trees. In: IEEE Big Data Conference 2017. IEEE, 978-1-5386-
2715-0/17/$31.00 ©2017 IEEE.
3) Lahiri and N. Agarwal, “Predicting healthcare expenditure increase for an
individualfrom medicare data,” in Proceedings of the ACM SIGKDD Workshop on
Health Informatics, 2014.
4) Gregori, M. Petrinco, S. Bo, A. Desideri, F. Merletti, and E. Pagano, “Regression
modelsfor analyzing costs and their determinants in health care: an introductory
review,” International Journal for Quality in Health Care, vol. 23, no. 3, pp. 331–341,
2011.
5) Bertsimas, M. V. Bjarnad´ottir, M. A. Kane, J. C. Kryder, R. Pandey, S. Vempala, and
G.Wang, “Algorithmic prediction of health-care costs,” Operations Research, vol. 56,
no. 6, pp. 1382–1392, 2008.
6) https://www.kaggle.com/mirichoi0218/insurance