131 574 1 PB
131 574 1 PB
Predictive and Analytics using Data Mining and Machine Learning for
Customer Churn Prediction
Chandra Lukita 1,*, , Lalu Darmawan Bakti 2, , Umi Rusilowati 3, , Asep Sutarman 4, , Untung Rahardja 5,
1Information Systems Study Program Management Science Expertise, Catur Insan Cendekia University, Indonesia
2Software Engineering, Faculty of Information and Communication Technology, Mataram University of Technology, Indonesia
3Faculty of Economics and Business, University of Pamulang, Jakarta, Indonesia
4Faculty of Economics and Business, University of Muhammadiyah Prof. Dr. HAMKA, Indonesia
5Digital Business Study Program, University of Raharja, Indonesia
(Received: October 15, 2023; Revised: November 18, 2023; Accepted: November 18, 2023; Available online: December 8, 2023)
Abstract
This research aims to predict and analyze customer churn using Data Mining and Machine Learning methods. The background of this research
is based on the importance of understanding the factors that influence customer decisions to churn, as well as improving the effectiveness of
customer retention strategies in a business context. The method used in this research involves the use of a customer bank dataset that includes
information about customers who left in the past month, services registered by customers, customer account information, and demographic info
about customers. The factors most influential to churn were identified through heatmap analysis, including MonthlyCharges, PaperlessBilling,
SeniorCitizen, PaymentMethod, MultipleLines, and PhoneService. This research compares the performance of several machine learning
algorithms, including Random Forest, Logistic Regression, Adaboost, and Extreme Gradient Boosting (XGBoost), to predict customer churn.
Accuracy metrics and confusion matrix results are used to evaluate the performance of these algorithms. The results showed that XGBoost proved
to be the best algorithm in predicting customer churn with high accuracy. The factors that have been correctly identified do not provide missed
precision, showing a significant influence on customer churn decisions. The novelty and uniqueness of this research lies in focusing on the factors
that have the most influence on customer churn and comparing the performance of machine learning algorithms. This research provides more
specific and relevant insights for companies in developing effective customer retention strategies. However, this research has some limitations.
One of them is the use of a dataset limited to a customer bank, so the generalizability of the findings of this research may be limited to that
business context. In addition, other factors that are not the focus of this research may also contribute to the prediction of customer churn.
1. Introduction
In the era of rapidly developing technology and increasing competition, organizations in various industries realize the
importance of retaining existing customers. [1][2][3]. Customer churn, which refers to the phenomenon of customers
terminating a relationship with a business, has become a critical challenge that can significantly impact the profitability
and long-term success of an enterprise. To address the adverse effects of customer churn, organizations are increasingly
relying on predictive and analytical techniques supported by data mining and machine learning. [3][4][5]. The objective
of this research is to explore the potential of data mining and machine learning techniques in predicting customer churn.
By leveraging the vast amount of customer data available to organizations, these techniques can uncover valuable
insights and patterns that enable proactive interventions and targeted retention strategies. The ability to accurately
identify customers who are likely to churn gives businesses the opportunity to implement timely and personalized
actions, such as customized retention offers or proactive customer service, to prevent churn and strengthen customer
loyalty.
The field of data mining offers a variety of algorithms and methodologies designed to extract valuable knowledge from
large and complex data sets. [3][6]. By applying data mining techniques to customer data, businesses can uncover
hidden patterns, correlations, and trends that are indicators of churn behavior. [7]. On the other hand, machine learning
algorithms enable the development of predictive models that can learn from historical customer data and make accurate
predictions about future churn events. This research aims to investigate the effectiveness of various data mining and
machine learning techniques in predicting customer churn. By analyzing a comprehensive dataset that includes
customer attributes, historical transactions, and behavioral patterns, this research seeks to identify the most influential
factors in contributing to churn and evaluate the performance of different predictive models. The findings from this
study will provide valuable insights for organizations looking to improve their customer retention strategies and
optimize their business outcomes.
In this research, there are several gaps or novelty to be filled. First, although customer churn prediction has been a
widely researched topic, there is still a need to dig deeper into the potential of data mining and machine learning
techniques in the face of increasingly complex customer churn challenges. This research aims to fill that void by
examining the effectiveness of various techniques in predicting customer churn. Secondly, this research focuses on the
use of data mining and machine learning for customer churn analysis and prediction. Meanwhile, many previous studies
have leaned more towards statistical approaches or simple regression analysis. By utilizing the capabilities of more
complex data mining and machine learning algorithms, this research seeks to provide a deeper understanding of the
factors that influence customer churn and develop more accurate predictive models.
In addition, this research also seeks to fill the gap in the application of data mining and machine learning techniques in
the context of customer churn in different business environments. Each industry or organization may have unique
characteristics that affect customer churn behavior. Therefore, this research will explore the use of data mining and
machine learning techniques in various industry sectors, such as telecommunications, banking, retail, and so on, to
evaluate the diversity of factors that contribute to customer churn.
By filling these gaps, this research is expected to make a significant contribution to the field of customer churn
prediction and improve our understanding of how data mining and machine learning techniques can be effectively
applied to optimize customer retention efforts. The results of this research can provide practical guidance for
organizations in dealing with customer churn challenges and formulating more effective strategies in retaining
customers.
In summary, this research aims to connect customer churn prediction, data mining, and machine learning. By harnessing
the power of advanced analytics and predictive modeling, organizations can gain a competitive advantage in retaining
their valuable customers and building long-term relationships. Through empirical investigation, this research aims to
contribute to existing knowledge and provide practical insights for businesses aiming to reduce customer churn and
optimize their customer relationship management strategies.
2. Literature Review
Another factor that can affect customer churn is changes in customer needs or preferences. Customer needs and
preferences can change over time, and if companies are not able to adjust to these changes, customers may look for
solutions that better suit their current needs. [19][20][21]. To address this, companies should proactively gather
customer feedback, conduct market research, and stay up-to-date with industry trends. Lastly, strong competition can
also be a significant factor in customer churn. If competitors offer better or more attractive products or services,
customers are likely to switch to them. To overcome this competition, companies must build a competitive advantage,
identify the advantages of their products or services, and consistently communicate them to customers.
In order to overcome customer churn, it is important for companies to implement effective customer retention
strategies. This can include efforts to increase customer satisfaction, increase engagement through loyalty programs,
improve customer service, keep up with changing customer needs, and maintain a competitive advantage. [22][23]. In
addition, customer data analysis can also help in identifying potential churn patterns, so that companies can take
appropriate preventive measures to retain customers. In conclusion, customer churn is a significant problem for
companies and can have a negative impact on growth and profitability. Factors such as lack of customer satisfaction,
lack of engagement, poor customer service, changing customer needs, and strong competition can influence a
customer's decision to switch. By understanding these factors and implementing effective customer retention strategies,
companies can reduce churn rates and maintain long-term relationships with customers.
2.2. Machine Learning
Machine learning is a field in artificial intelligence that focuses on developing computer algorithms that allow systems
to learn and improve their performance from data without being explicitly programmed. [24][25][26]. Machine learning
theory involves the use of mathematics, statistics, and data processing to build models that can recognize patterns and
make predictions based on previous experience. The main capability of machine learning is its ability to make
predictions. Using existing data, machine learning systems can analyze and find patterns hidden in it. [27][28]. With
the trained model, the system can make predictions or inferences about new data that has never been seen before. This
ability makes machine learning very useful in various applications, such as stock price prediction, medical diagnosis,
facial recognition, sentiment analysis, and many more. Machine learning has several types of models used for
prediction, including:
1) Regression: Regression models are used to predict continuous values based on input variables. An example is
predicting house prices based on size and location.
2) Classification: Classification models are used to predict the class or label of data. For example, predicting
whether an email is spam or not based on its content.
3) Clustering: Clustering models are used to group data based on their similar characteristics. This helps in
identifying hidden patterns or segmentation of data.
4) Decision Tree: A decision tree model is a tree structure used to make decisions based on a set of questions or
rules. It allows for more complex predictions by considering multiple input variables.
Machine learning is able to make predictions with high accuracy due to its ability to learn from large and complex data.
By noticing patterns and trends in the data, machine learning models can adjust and improve their predictions over
time. However, it is important to note that the performance of machine learning models is highly dependent on the
quality and representativeness of the data used for training. In making predictions, machine learning can also identify
causal relationships or factors that contribute to the prediction results. This allows for a deeper understanding of the
factors that influence the target variable, which can be used for better decision-making. In conclusion, machine learning
theory involves the use of algorithms and mathematical techniques to learn patterns from data and make predictions.
The ability of machine learning to make predictions is useful in various applications and allows the use of large and
complex data to make better decisions.
2.3. Past Related Study
Johnson's research [23] is an important contribution to the field of prediction and analysis using machine learning,
especially in the context of the telecommunications industry. The objective of this research is to use machine learning
algorithms to analyze and predict customer churn, which is a major challenge faced by companies in the
Journal of Applied Data Sciences ISSN 2723-6471
Vol. 4, No. 4, December 2023, pp. 454-465 457
telecommunications industry. By understanding the factors that influence churn and the ability to predict customers at
risk of churn, companies can take appropriate precautions to retain customers. This research uses several machine
learning algorithms, including Random Forest, Support Vector Machines, and Naive Bayes, to analyze customer data
covering various features such as call duration, data usage, and customer satisfaction. The results show that machine
learning algorithms can provide churn prediction with a high degree of accuracy. In the context of the
telecommunications industry, having the ability to predict churn with high accuracy is crucial in planning effective
marketing and customer retention strategies.
The uniqueness of this research lies in the use of various machine learning algorithms that are compared to see the
performance and accuracy of churn prediction. By using this approach, this research provides a deeper understanding
of which algorithms are most effective in predicting churn in the context of the telecommunications industry. In
addition, this research also provides valuable insights into the factors that contribute to customer churn, which can help
telecommunication companies develop effective prevention strategies. However, this research also has some
weaknesses that need to be noted. One possible drawback is the need for considerable computational resources,
especially if used on large and complex datasets. The process of training models with machine learning algorithms can
take a long time and require significant computing power. In addition, interpretation of the model results generated by
such machine learning algorithms can be difficult, due to the high complexity of the models.
Overall, this research makes a valuable contribution to the development of prediction and analysis using machine
learning in the telecommunications industry. In facing the challenge of customer churn, machine learning algorithms
can provide accurate prediction capabilities, thus helping companies make better decisions based on existing data. By
continuing to conduct research and development in this field, it is hoped that more effective and efficient solutions can
be found to face various complex business challenges.
First, Random Forest is an ensemble learning that combines a number of random decision trees. Each tree in Random
Forest will perform predictions independently, and the results are combined to produce the final prediction. Random
Forest has the ability to handle various types of data and has a tolerance for overfitting.
Secondly, Logistic Regression is one of the most commonly used classification algorithms. It is used to model the
relationship between an independent variable and a binary dependent variable. Logistic Regression generates prediction
probabilities and performs classification based on a specified threshold value.
Third, Adaboost (Adaptive Boosting) is a boosting algorithm that combines a small number of "weak learners" (e.g.,
simple decision trees) to form a stronger model. Adaboost assigns different weights to each data sample and adaptively
amplifies the influence of samples that are difficult to predict in advance.
Finally, Extreme Gradient Boosting (XGBoost) is a method that combines ensemble learning techniques with a
boosting algorithm. XGBoost uses a gradient boosting approach to improve model performance by correcting the
prediction error at each iteration. This makes XGBoost a highly effective algorithm in dealing with churn prediction
challenges.
Journal of Applied Data Sciences ISSN 2723-6471
Vol. 4, No. 4, December 2023, pp. 454-465 460
In this research, all four algorithms will be used to predict churn in business. Each algorithm has its own uniqueness
and advantages. Through performance comparison and comprehensive analysis, this research will identify the most
effective algorithm in churn prediction in a specific business context. This will provide valuable insights for
organizations to develop more effective customer retention strategies and significantly reduce churn.
By using three different datasets for each technique, this research makes it possible to evaluate the performance and
effectiveness of each technique in different contexts. Each technique has a unique approach in selecting the most
informative features, and by using datasets that match the principles and criteria of each technique, it is expected to
find better and optimized prediction results for each technique used.
accurate prediction models to identify customers at high risk of churn. Figure 2 below is a heatmap of the feature factors
from the dataset.
1) Identify Potential Churn Customers: By utilizing the developed churn prediction model, companies can
identify customers who have the potential to churn. This allows companies to take preventive action or
implement appropriate retention strategies, such as offering special loyalty programs, discounts, or other
special offers.
2) Personalize the Customer Experience: By understanding the factors that influence churn, companies can craft
more personalized and relevant customer experiences. Information on influencing factors can be used to tailor
services, offers, and communications to individual customers, thereby increasing customer satisfaction and
reducing the likelihood of churn.
3) Marketing Strategy Improvement: This research also provides valuable insights in designing effective
marketing strategies. By knowing the factors that have the most influence on churn, companies can devise
more careful and relevant marketing campaigns, and improve targeting to prevent customer churn.
4) Data-Driven Decision Making: This research underscores the importance of using data mining and machine
learning in customer churn analysis. By utilizing these techniques, companies can make more informed and
fact-based decisions to manage customer churn more effectively.
In addition, it is also important to recognize that this study makes a new contribution in the business context by
identifying more specific factors in predicting customer churn. As such, this study can provide valuable insights and
useful information for business practitioners and researchers interested in managing customer churn and improving
customer retention more effectively. As such, the results of this study offer strong implementation potential and
contribute to improved customer churn management, more targeted marketing strategies, and more data-driven
business decisions.
5. Conclusion
This research successfully applies Data Mining and Machine Learning methods to predict and analyze customer churn.
In this research, Extreme Gradient Boosting (XGBoost) algorithm is proven to be the best algorithm with high accuracy
in predicting churn. Factors that affect churn, such as MonthlyCharges, PaperlessBilling, SeniorCitizen,
PaymentMethod, MultipleLines, and PhoneService, were identified correctly and contributed significantly to customer
churn decisions.
The novelty of this research lies in the focus on the most influential factors in customer churn, as well as the
performance comparison of machine learning algorithms. This research provides more specific and relevant insights in
the management of customer churn in the context of the business under study.
The discussion of the implementation of the results of this research shows strong potential for application in the
business world. By using the model developed based on the XGBoost algorithm and the identified factors, companies
can improve customer retention strategies, personalization of customer experience, and data-driven decision-making.
Identification of potential churn customers, personalization of services, and improvement of marketing strategies are
practical implications that can be implemented from the results of this study.
However, this study has some limitations. One limitation is the use of a dataset limited to bank customers, which may
affect the generalizability of the findings to different business contexts. In addition, other factors that are not the focus
of this study may also contribute to the prediction of customer churn.
Further research that can be done is to expand the scope of the dataset and consider other factors that could potentially
affect customer churn. In addition, further research can deepen the analysis and expand the performance comparison
of other machine learning algorithms. The application of ensemble learning techniques or the use of other methods
such as Deep Learning can also be the focus of further research.
Overall, this research makes a valuable contribution to managing customer churn and improving customer retention in
a business context. By understanding the factors that influence churn and applying appropriate machine learning
algorithms, companies can optimize their strategies to retain customers and strengthen their competitive advantage.
Journal of Applied Data Sciences ISSN 2723-6471
Vol. 4, No. 4, December 2023, pp. 454-465 464
6. Declarations
References
[1] M. Tarokh and M. EsmaeiliGookeh, "A New Model to Speculate CLV Based on Markov Chain Model," J. Ind. Eng. Manag.
Stud., vol. 4, no. 2, pp. 85-102, 2017, doi: 10.22116/jiems.2017.54609.
[2] U. Salunkhe, B. Rajan, and V. Kumar, "Understanding firm survival in a global crisis," Int. Mark. Rev., vol. 1, no. 1, Jan.
2021, doi: 10.1108/IMR-05-2021-0175.
[3] B. Durkaya Kurtcan and T. Ozcan, "Predicting customer churn using grey wolf optimization-based support vector machine
with principal component analysis," J. Forecast., vol. 1, no.1, pp. 1-7, 2023, doi: 10.1002/for.2960.
[4] V. Morozov, O. Mezentseva, A. Kolomiiets, and M. Proskurin, "Predicting Customer Churn Using Machine Learning in IT
Startups," Lecture Notes on Data Engineering and Communications Technologies, vol. 77. No. 1, pp. 645-664, 2022. doi:
10.1007/978-3-030-82014-5_45.
[5] M. Mujiya Ulkhaq, A. T. Wibowo, M. R. Tribosnia, R. Putawara, and A. B. Firdauz, "Predicting Customer Churn: A
Comparison of Eight Machine Learning Techniques: A Case Study in an Indonesian Telecommunication Company," in 2021
International Conference on Data Analytics for Business and Industry, ICDABI 2021, vol. 1, no.1, pp. 42-46, 2021. doi:
10.1109/ICDABI53623.2021.9655790.
[6] R. Priyadarshi, A. Panigrahi, S. Routroy, and G. K. Garg, "Demand forecasting at retail stage for selected vegetables: a
performance analysis," J. Model. Manag., vol. 14, no. 4, pp. 1042-1063, Jan. 2019, doi: 10.1108/JM2-11-2018-0192.
[7] R. Manivannan, R. Saminathan, and S. Saravanan, "An improved analytical approach for customer churn prediction using
Grey Wolf Optimization approach based on stochastic customer profiling over a retail shopping analysis. CUPGO," Evol.
Intell., vol. 14, no. 2, pp. 479-488, 2021, doi: 10.1007/s12065-019-00282-x.
[8] A. Zaky, S. Ouf, and M. Roushdy, "Predicting Banking Customer Churn based on Artificial Neural Network," in 5th
International Conference on Computing and Informatics, ICCI 2022, vol. 1, no. 1, pp. 132-139, 2022. doi:
10.1109/ICCI54321.2022.9756072.
[9] W. Park and H. Ahn, "Not All Churn Customers Are the Same: Investigating the Effect of Customer Churn Heterogeneity
on Customer Value in the Financial Sector," Sustain., vol. 14, no. 19, pp. 1-13, 2022, doi: 10.3390/su141912328.
[10] Y. Yamato, "Server Structure Proposal and Automatic Verification Technology on IAAS Cloud of Plural Type Servers,"
IJIIS Int. J. Informatics Inf. Syst., vol. 1, no. 2, pp. 97-106, 2018, doi: 10.47738/ijiis.v1i2.104.
[11] M. F. Kokasih and A. S. Paramita, "Property Rental Price Prediction Using the Extreme Gradient Boosting Algorithm," IJIIS
Journal of Applied Data Sciences ISSN 2723-6471
Vol. 4, No. 4, December 2023, pp. 454-465 465
Int. J. Informatics Inf. Syst., vol. 3, no. 2, pp. 54-59, 2020, doi: 10.47738/ijiis.v3i2.65.
[12] A. R. Lubis, S. Prayudani, Julham, O. Nugroho, Y. Y. Lase, and M. Lubis, "Comparison of Models in Predicting Customer
Churn Based on Users' habits on E-Commerce," in 2022 5th International Seminar on Research of Information Technology
and Intelligent Systems, ISRITI 2022, vol. 1, no. 1, pp. 300-305, 2022. doi: 10.1109/ISRITI56927.2022.10052834.
[13] L. Sook Ling, N. Mustafa, and S. F. Abdul Razak, "Customer churn prediction for telecommunication industry: A Malaysian
Case Study, " F1000Research, vol. 10, no. 1, pp. 1-13, 2021, doi: 10.12688/f1000research.73597.1.
[14] Q. Tang, G. Xia, and X. Zhang, "A hybrid classification model for churn prediction based on customer clustering," J. Intell.
Fuzzy Syst., vol. 39, no. 1, pp. 69-80, 2020, doi: 10.3233/JIFS-190677.
[15] O. M. Mirza et al., "Optimal Deep Canonically Correlated Autoencoder-Enabled Prediction Model for Customer Churn
Prediction," Comput. Mater. Contin., vol. 73, no. 2, pp. 3757-3769, 2022, doi: 10.32604/cmc.2022.030428.
[16] A. Widiyanto, N. A. Prabowo, M. Ircham, N. Amarullah, and A. Soni, "The Effect of E-Learning as One of the Information
Technology-Based Learning Media on Student Learning Motivation," IJIIS Int. J. Informatics Inf. Syst., vol. 4, no. 2, pp.
123-129, 2021.
[17] W.-J. Su, "The Effects of Safety Management Systems, Attitude and Commitment on Safety Behaviors and Performance,"
Int. J. Appl. Inf. Manag., vol. 1, no. 4, pp. 187-199, 2021, doi: 10.47738/ijaim.v1i4.20.
[18] H.-T. Le, "Knowledge Management in Vietnameses Mall and Medium Enterprises: Review of Literature," Int. J. Appl. Inf.
Manag., vol. 1, no. 2, pp. 50-59, 2021, doi: 10.47738/ijaim.v1i2.12.
[19] P. Jeyaprakaash and K. Sashirekha, "Accuracy Measure of Customer Churn Prediction in Telecom Industry using Adaboost
over Decision Tree Algorithm," J. Pharm. Negat. Results, vol. 13, no. 1, pp. 1495-1503, 2022, doi:
10.47750/pnr.2022.13.S04.179.
[20] H. K. Thakkar, A. Desai, S. Ghosh, P. Singh, and G. Sharma, "Clairvoyant: AdaBoost with Cost-Enabled Cost-Sensitive
Classifier for Customer Churn Prediction," Comput. Intell. Neurosci., vol. 2022, no. 1, pp. 1-7, 2022, doi:
10.1155/2022/9028580.
[21] R. A. de Lima Lemos, T. C. Silva, and B. M. Tabak, "Propensity to customer churn in a financial institution: a machine
learning approach," Neural Comput. Appl., vol. 34, no. 14, pp. 11751-11768, 2022, doi: 10.1007/s00521-022-07067-x.
[22] N. Tomasevic, N. Gvozdenovic, and S. Vranes, "An overview and comparison of supervised data mining techniques for
student exam performance prediction," Comput. Educ., vol. 143, pp. 103676-103684, 2020, doi:
https://doi.org/10.1016/j.compedu.2019.103676.
[23] J. Prayitno, B. Saputra, and R. P. Bernarte, "The Naive Bayes Algorithm in Predicting the Spread of the Omicron Variant of
Covid-19 in Indonesia: Implementation and Analysis," IJIIS Int. J. Informatics Inf. Syst., vol. 5, no. 2, pp. 84-91, 2022.
[24] M.-H. Tayarani N., "Applications of artificial intelligence in battling against covid-19: A literature review," Chaos, Solitons
& Fractals, vol. 142, no. 1, pp. 110338-110345, 2021, doi: https://doi.org/10.1016/j.chaos.2020.110338.
[25] H. N. Do, W. Shih, and Q. A. Ha, "Effects of mobile augmented reality apps on impulse buying behavior: An investigation
in the tourism field," Heliyon, vol. 6, no. 8, pp. 1-12, 2020, doi: 10.1016/j.heliyon.2020.e04667.
[26] A. Efendi, D. Purwana, and A. D. Buchdadi, "Human Capital Management of Government Internal Supervisory at the
Ministry of Defense of the Republic of Indonesia," Int. J. Appl. Inf. Manag., vol. 2, no. 2, pp. 81-89, 2021, doi:
10.47738/ijaim.v2i2.30.
[27] S. Hidayat, M. Matsuoka, S. Baja, and D. A. Rampisela, "Object-based image analysis for sago palm classification: The most
important features from high-resolution satellite imagery," Remote Sens., vol. 10, no. 8, pp. 1-12, 2018, doi:
10.3390/RS10081319.
[28] C. Ricciardi et al., "Application of data mining in a cohort of Italian subjects undergoing myocardial perfusion imaging at an
academic medical center," Comput. Methods Programs Biomed., vol. 189, pp. 105343-105350, 2020, doi:
10.1016/j.cmpb.2020.105343.