A4

The study evaluates the predictive ability of the LightGBM classifier in assessing customer satisfaction in the airline industry, analyzing over 100,000 samples. It identifies key factors influencing passenger satisfaction, with LightGBM achieving the highest accuracy among various machine learning algorithms. The findings highlight the importance of inflight Wi-Fi service, age, flight distance, customer type, and type of travel in enhancing airline services and customer satisfaction.

Uploaded by

oliviaooworld

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views6 pages

A4

Uploaded by

oliviaooworld

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2023 International Conference for Advancement in Technology (ICONAT)

Goa, India. Jan 24-26, 2023

Evaluating the Predictive Ability of the LightGBM

Classifier for Assessing Customer Satisfaction in
the Airline Industry
2023 International Conference for Advancement in Technology (ICONAT) | 978-1-6654-7517-4/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICONAT57137.2023.10080120

Pankaj Kunekar Mihir Deshpande Adwait Gharpure

Department of Information Technology Department of Artificial Intelligence Department of Artificial Intelligence
Vishwakarma Institute of Technology, and Data Science and Data Science
Pune Vishwakarma Institute of Technology, Vishwakarma Institute of Technology,
[email protected] Pune Pune
[email protected] [email protected]

Vedant Gokhale Aayush Gore Harsh Yadav

Department of Artificial Intelligence Department of Artificial Intelligence Department of Artificial Intelligence
and Data Science and Data Science and Data Science
Vishwakarma Institute of Technology, Vishwakarma Institute of Technology, Vishwakarma Institute of Technology,
Pune Pune Pune
[email protected] [email protected] [email protected]

Abstract— The aviation industry is heavily influenced by In this study, we analyse the data having more than
customer reviews and satisfaction. This study aims to identify 100,000 samples and implement EDA and run different
the major factors that impact passenger satisfaction in the machine learning classification techniques to predict whether
aviation industry and to provide insights that can help airlines the passenger was satisfied or not with the experience the
improve their services, gain a competitive advantage, and airline provided. By giving a list of criteria with decreasing
achieve business success. To this end, we employed several effect power, we also intend to aid the airlines in
classification algorithms, including Logistic Regression, SVM, understanding which component has the greatest impact on
Naive Bayes, Light GBM, AdaBoost, and XGBoost. Our results consumer satisfaction.
indicate that the LightGBM classifier produced the highest
accuracy. In conclusion, our findings suggest that the top five II. LITERATURE SURVEY
factors affecting passenger satisfaction are: (1) Inflight Wi-Fi
service, (2) Age, (3) Flight distance, (4) Customer type, and (5) Eren Sezgen et. al. [3] have used text mining approach to
Type of travel. These findings can be used by airlines to analyze online reviews to determine the factors impacting
prioritize their efforts and resources in order to enhance passenger satisfaction. Results show that depending on the
customer satisfaction and improve their business performance. airline business model and service class, the factors that
determine consumer satisfaction and dissatisfaction differ
Keywords— Aviation, Airline industry, Customer satisfaction, slightly.
Classification, Data mining, Inflight services, LightGBM
Hayadi et. al. [4] have used Random Forest Algorithm
I. INTRODUCTION for their study. The results show Inflight Wifi service as an
important factor in getting customer satisfaction.
The aviation industry is booming after the Covid
restrictions were eased. Measuring customer satisfaction is a We have used LightGBM to improve the accuracy as
key way that businesses such as airlines evaluate their LightBGM is a boosting algorithm.
performance [1]. An key factor in determining business
performance and a tactical instrument for attaining a R.Archana and Dr. M.V Subha [5] found that various
competitive edge is customer loyalty and passenger factors can affect customer satisfaction with Indian Airlines.
satisfaction, which is becoming more widely acknowledged. Their research showed that the impact of different
In order to improve consumer satisfaction and ultimately dimensions of airline service on passenger satisfaction and
boost revenues and profits, airline firms invest a significant the image of the airline was significant and positive.
amount of money in providing high-quality services. Rahim Hussain et. al. [6] have examined the relationship
As the importance of delivering high-quality service between service quality, service provider reputation, client
becomes increasingly crucial for the survival and expectations, perceived value, client happiness, and brand
competitiveness of airlines, the measurement of customer loyalty in a Dubai-based airline. SERVQUAL framework
satisfaction in the airline industry is becoming more frequent has been implemented. Since only one airline was included
and relevant [2]. Airline carriers now need to provide in the data collection, the conclusions' generalizability is
elevated services as it sustains consumer loyalty. Customers called into question.
who are unsatisfied or disconnected naturally lead to lower Hariguna Taqwa et. al. [7] have proposed a new method
people on the plane and less net profits. It is crucial that of combining K-means and Naive Bayes classifier
customers have the best experience every time they fly. Few algorithms to distinguish between positive and negative
of the factors that add to better experience when passenger classes in product review comments. The results showed that
travels are on-time flights, decent in-flight entertainment, the accuracy value using K-means and naive Bayes classifier
refreshments, and greater legroom space. without manual data achieved a higher accuracy value of

978-1-6654-7517-4/23/$31.00 ©2023 IEEE 1

Authorized licensed use limited to: De Montfort University. Downloaded on February 10,2025 at 03:42:31 UTC from IEEE Xplore. Restrictions apply.
77.12%, compared to the accuracy value of 56.86% when A. EDA
using K-means, Naive Bayes classifier, and manual data. Exploratory data analysis allows us to visualize and
Jin-Woo Park et. al. [8] have made a research on better understand large datasets through the use of graphical
understanding of air passengers' decision-making processes. representations such as heat maps, pie charts, and scatter
Specifically, they tested a conceptual model that considers plots. By visualizing data, we can quickly and easily identify
service expectation, service perception, service value, patterns, trends, and relationships that would be difficult to
passenger satisfaction, airline image, and behavioral discern from looking at raw data alone.
intentions all at once. The results from the data analysis Python libraries like Pandas, NumPy, Matplotlib, and
showed that service value, passenger satisfaction, and airline Seaborn make it easy to create powerful visualizations and
image each have a direct effect on air passengers' decision- gain valuable insights from data. Through the use of these
making processes. tools, we can effectively analyse and interpret data, leading
III. METHODOLOGY to more informed decision making and better outcomes.
B. Data Modeling and Model Selection
In this study, models were developed employing machine
learning methods to understand satisfaction factors behind
passenger's travel. Logistic Regression, Linear SVM, Naïve
Bayes, SGD, Random Forest, AdaBoost, Gradient Boosting,
Light GBM, XGBoost models have been compared. The
model with highest accuracy is selected for further analysis.
Standardization has been performed for numerical
features using MinMaxScaler. Rescaling the value
distribution to make the observed values' mean equal to zero
and their standard deviation equal to one is the process of
standardizing a dataset.
x Logistic Regression : If the parameter to be
classified is in a zero or one state, logistic
regression is practised, where "good" equals 1 and
"insufficiency" equals 0. This model estimates the
Fig. 1. Methodology
likelihood of influence of independent factors on
the result variable.
This study will primarily consist of three stages, each x Naïve Bayes : This algorithm is based on Bayes
carried out in order to achieve the objectives specified Theorem and is primarily used for text
below: classification. It is a type of supervised learning
x EDA : Analyze the data, remove the unnecessary algorithm.
feature columns, perform data transformation, x SGD : For each iteration of Stochastic Gradient
visualize the data using different plots. Descent, a small number of samples are chosen at
x Data Modeling : Perform different algorithms such random rather than the entire data set. The sample
as Logistic regression, SVM, Naïve Bayes, is randomly selected.
LightGBM. x Decision Tree : Decision Trees are a type of
x Model Selection : Select the best model based on supervised machine learning algorithm that
the accuracy score and interpret the different involves partitioning the training data into subsets
features affecting passenger satisfaction. based on a specific feature or parameter, where the
input and corresponding output are defined. This
The dataset contains total of 104,000 samples. There are process is repeated recursively until a satisfactory
a total of 28 feature columns. On a scale of one to five, level of granularity is reached, resulting in a tree-
fourteen of the attributes that were surveyed were responses like model that can be used to make predictions on
from passengers rating their flying experience. Our target unseen data.
variable is satisfaction. Some of the columns are not x Random Forest : It is made up of several decision
necessary, so we drop them. We perform following changes trees and assists in identifying only the elements
to make data more suitable for building models : that have an impact on the outcome variable.
x Drop the ID column from the dataset.
x Replace Satisfied with 1 and Dissatisfied or Gradient Boosting Methods (GBDT models):
Neutral with 0. A collective method known as "boosting" makes an
x Remove the columns: Gate location, Arrival delay effort to create a precise classifier from a number of
in minutes, Departure/Arrival time convenient. imperfect(weak) classifiers.
x The cleaned data is obtained after performing these
changes. XGBoost :
XGBoost is an ensemble learning method [13]. A
methodical approach to combining the prediction capacity of
various learners is provided by ensemble learning. An

2
Authorized licensed use limited to: De Montfort University. Downloaded on February 10,2025 at 03:42:31 UTC from IEEE Xplore. Restrictions apply.
improved variant of the gradient boosting method is called
XGBoost (Extreme Gradient Boosting).In comparison to
other gradient boosting approaches, XGBoost is ten times
faster and has strong psychometric properties.
XGBoost is characterized by it’s unique splits that it The feature extraction of each split node must be
performs by calculating the gain of variance. [13] For compared, and the largest one must be chosen for splitting
instance, let O be the training set on a fixed node of the DT, for XGBoost. The comparative study must take into account
the variance gain of splitting feature j at a point d for this the information gain of all samples. In comparison,
node as presented in (1) [13] : LightGBM calculates the information gain with a much
మ మ smaller number of samples and is substantially efficient.
ቆσ ԝ௚೔ ቇ ቆσ ԝ௚೔ ቇ
ଵ ቄೣ೔ ‫א‬ೀǣೣ೔ೕ ರ೏ቅ ቄೣ೔ ‫א‬ೀǣೣ೔ೕ ಭ೏ቅ
ܸ௝‫ ࣩפ‬ሺ݀ሻ ൌ ൮ ೕ ൅ ೕ ൲ (1)
௡ೀ ௡೗‫פ‬ೀ ሺௗሻ ௡ೝ‫פ‬ೀ ሺௗሻ LightGBM Model works as follows :
௝
where no ൌ σ‫ܫ‬ሾ‫ݔ‬௜ ‫ܱ א‬ሿǡ ݊௜‫פ‬ை ሺ݀ሻ ൌ σ‫ܫ‬ൣ‫ݔ‬௜ ‫ܱ א‬ǣ ‫ݔ‬௜௝ ൑ ݀൧ i. The input data is divided into smaller subsets called
௝
"leaves" and organized into a tree-like structure
and ݊௥‫פ‬ை ሺ݀ሻ ൌ σ‫ܫ‬ൣ‫ݔ‬௜ ‫ܱ א‬ǣ ‫ݔ‬௜௝ ൐ ݀൧. called a decision tree.
ii. The model uses the decision tree to make
Where no is the total number of observations in the predictions based on the features of the input data.
dataset, nij is total number of observations to the left of the iii. For each tree in the model, the data is split into two
dataset and njr is is total number of observations to the right subsets based on the value of a selected feature.
of the dataset, and gi is the negative gradient of the loss The process is repeated until the data is divided
function with respect to the model output. into leaves.
For a feature j, the DT algorithm selects iv. The model uses the leaves of the decision tree to
make predictions by taking the average of the
d*jൌ ܽ‫݆ܸ݀ݔܽ݉݃ݎ‬ሺ݀ሻ (2) responses in each leaf.
and calculates the largest gain. v. The final prediction is made by averaging the
predictions of all the trees in the model.
Then, the data is split according to feature j‫ כ‬at point dj‫כ‬
into the left and right child nodes.
LightGBM :
The open-source Gradient Boosting Decision Tree
(GBDT) algorithm called 'LightGBM' was developed by
Microsoft. It employs a leaf-wise tree growth strategy, which
involves selecting the leaf node with the greatest gain in
variance as the split point at each iteration of tree
construction [13]. LightGBM's multi-thread optimization and
leaf growth technique with depth restriction helps to reduce
excessive XGBoost memory consumption so that big data
processing can be done more quickly, with fewer false
alarms, and with fewer missed detections [14].
LightGBM can be differentiated from other GBDT
Fig. 2. LightGBM Model Leaf-wise Growth
models by the way the gain of variation is calculated [13].
Considering the same inputs presented for the calculation of The benefits of using LightGBM :
the gain of variance. In LightGBM, the splits occue
considering weak and strong learners (small and big 1. Improved training speed and efficiency:
gradients, gi). In this case, the training instances are ranked LightGBM is known for its fast training speed and
according to the absolute values of their gradients in the high efficiency compared to other gradient boosting
descending order. Then a top x percent of instances with the algorithms.
larger gradients are kept to form an instance subset A. For
the remaining set Ac formed by the (1-x) percent of instances 2. Enhanced accuracy: LightGBM is often able to
with smaller gradients, a subset B with size b * |Ac| is achieve better accuracy in classification and
randomly formed. Finally, the split of the instances regression tasks compared to other algorithms, due
according to an estimated variance gain over the Subset AUB to its ability to handle large-scale data and
is performed. incorporate feature interactions.
As presented in (2) [13] , the variance gain is calculated as 3. Reduced memory usage: LightGBM is designed to
భషೌ మ minimize memory usage, making it well-suited for
ଵ ቀσೣ೔ ‫א‬ಲ೗ ԝ௚೔ ା ್ σೣ೔ ‫א‬ಳ೗ ԝ௚೔ ቁ
ܸ௝‫ כ‬ሺ݀ሻ ൌ ቆ ೕ working with large datasets.
௡ ௡೗ ሺௗሻ
భషೌ మ (3) 4. Support for parallel, distributed, and GPU
ቀσೣ೔ ‫א‬ಲೝ ԝ௚೔ ା ್ σೣ೔ ‫א‬ಳೝ ԝ௚೔ ቁ
൅ ቇ learning: LightGBM supports parallel and
ೕ
௡ೝ ሺௗሻ distributed training, as well as training on GPU,

3
Authorized licensed use limited to: De Montfort University. Downloaded on February 10,2025 at 03:42:31 UTC from IEEE Xplore. Restrictions apply.
allowing it to scale to larger datasets and make use
of advanced hardware resources.
5. Ability to handle large-scale data: LightGBM is
able to handle large-scale data efficiently, making it
a good choice for tasks that require processing large
datasets.
A cross-validation procedure was used to optimize the
performance of the model. This approach involves dividing
the dataset into multiple subsets, training the model on
different subsets, and evaluating its performance on the
remaining subsets. This helps to reduce the risk of overfitting
and improve the generalizability of the model.

Based on the cross-validation analysis, the following

hyperparameters were identified as optimal:
x Maximum tree depth: 8
x Minimum children samples: 18
x Minimum children weight: 0.001 Fig. 4. Correlation Heatmap
x Number of leaves: 40 Similarly, other features are visualised using box plot,
violin plot and other visualisation plots.
These hyperparameters were found to produce the best
model performance in terms of accuracy and B. Modeling Results
generalizability. They represent the most appropriate After the models have been evaluated, it is crucial to
balance between model complexity and fit to the data. compare the models' performance results in order to
determine which model is more accurate and has the best
performance for this study.
IV. RESULT ANALYSIS

A. EDA Results : TABLE I. COMPARISON OF ACCURACY OF DIFFERENT MODELS

Model Accuracy F1 Score
Logistic Regression 87.57 85.45
Linear SVM 87.33 85.12
Naive Bayes 85.29 82.97
SGD 85.59 80.86
Decision Tree 94.48 93.65
Random Forest 96.25 95.61
AdaBoost 92.41 91.21
Gradient Boosting 94.01 93.01
Light GBM 96.28 95.63
XGBoost 96.16 95.5

Based on the results presented in Table 1, it was found

that the LightGBM classifier had the highest accuracy and
F1 score. Therefore, we utilized the LightGBM classifier in
our study.

Fig. 3. Pie chart showing percentage of satisfied passengers vs neutral or

dissatisfied passengers.

The Fig. 3 shows that 56.7% passengers are dissatisfied

or neutral while 43.3% passengers are satisfied. This shows
that our dataset is quite balanced.
From Fig. 4, we observed that ID, Departure/Arrival time
convenient, Gate location, Departure Delay in Minutes and
Arrival Delay in Minutes are not significantly correlated with
satisfaction.
Fig. 5. Top Features obtained by LightGBM Classifier

4
Authorized licensed use limited to: De Montfort University. Downloaded on February 10,2025 at 03:42:31 UTC from IEEE Xplore. Restrictions apply.
According to the results depicted in Fig. 5, the top 10 that the classifier will rank a randomly chosen positive
most influential features on passenger satisfaction are: instance higher than a randomly chosen negative instance.
The Fig. 6 in the paper shows an AUC score of 0.995, which
1. Inflight Wi-Fi service is very close to 1. This indicates that the classifier used has a
2. Age good fit and is able to distinguish between positive and
3. Flight Distance negative instances with high accuracy.
4. Customer Type - Loyal Customer
5. Type of Travel - Business travel
6. Baggage handling
7. Online boarding
8. Inflight service
9. Seat comfort
10. Inflight entertainment

These features affect the passenger satisfaction most

traveling via airlines.

Out of these features, the airline company can improve

customer satisfaction and business performance by focusing
on the following features:
1. Inflight Wi-Fi service: Providing reliable and high-
speed Wi-Fi connectivity during flights can
enhance the overall travel experience for
customers.
2. Baggage handling: Ensuring efficient and timely Fig. 7. Confusion Matrix
baggage handling can reduce the stress and
inconvenience for passengers. The model yields an Accuracy of 96.2%, Precision of
3. Online boarding: Implementing online boarding 97.36% and Recall of 93.88%. These results suggest that the
processes can streamline the check-in process and model is able to make accurate predictions with a high level
improve convenience for customers. of consistency.
4. Inflight services: Providing high-quality inflight
services, such as food and beverage options and V. CONCLUSION AND FUTURE SCOPE
attentive crew, can enhance the overall travel The present study utilized the LightGBM classifier to
experience. develop a predictive model for customer satisfaction in the
5. Seat comfort: Ensuring comfortable seating can airline industry. The model demonstrated high accuracy,
improve the overall travel experience for with a value of 96.2%. Results indicate that focusing on
customers. improving in-flight Wi-Fi service, baggage handling, seat
6. Inflight entertainment: Offering a variety of comfort, and inflight entertainment can significantly impact
entertainment options, such as movies, TV shows, customer satisfaction. Additionally, optimizing online
and games, can help to keep passengers occupied boarding services and inflight services can help to foster
and satisfied during flights. customer loyalty and improve business performance. These
findings have practical implications for the airline industry,
highlighting the importance of prioritizing customer
experience in order to drive satisfaction and loyalty.
A potential future direction for this research could
involve analyzing the impact of various in-flight services on
passenger satisfaction. Specifically, the study could examine
how changes to services such as Wi-Fi availability, baggage
handling, and seat comfort affect passenger satisfaction on a
Likert scale ranging from one to five. This type of analysis
could provide valuable insights into the factors that influence
passenger satisfaction and could inform decisions related to
the provision of in-flight services.
ACKNOWLEDGMENT
We would like to express our sincere gratitude to Prof.
Pankaj Kunekar for his invaluable insights and guidance
Fig. 6. ROC Curve on Test Data during the course of this study. His expertise in the field and
his willingness to share his knowledge with us were essential
to the success of this work. We are deeply grateful for his
The area under the curve (AUC) is a measure of the support and encouragement throughout the research. We
performance of a binary classifier, defined as the probability would also like to thank him for his constructive feedback

5
Authorized licensed use limited to: De Montfort University. Downloaded on February 10,2025 at 03:42:31 UTC from IEEE Xplore. Restrictions apply.
and suggestions, which helped us improve the quality of our [8] Jin-Woo Park, Rodger Robertson and Cheng-Lung Wu, “The effect of
research. airline service quality on passengers’ behavioural intentions: a Korean
case study ,” in Journal of Air Transport Management,vol. 10, Issue 6,
2004, pp. 435 – 439.
REFERENCES
[9] Saha, G.C. and Theingi, "Service quality, satisfaction, and behavioural
[1] Clement Kong Wing Chow, “Customer satisfaction and service quality intentions: A study of low ̺ cost airline carriers in Thailand",
in the Chinese airline industry,” in Journal of Air Transport Managing Service Quality: An International Journal, vol. 19 No. 3,
Management, vol. 35, 2014, pp.102–107. 2009, pp. 350-372.
[2] Stelios Tsafarakis, Theodosios Kokotas and Angelos Pantouvakis, “A [10] Lacic, E., Kowald, D., and Lex, E. “ High Enough? Explaining and
multiple criteria approach for airline passenger satisfaction Predicting Traveler Satisfaction Using Airline Review,” HT 2016 -
measurement and service quality improvement,” in Journal of Air Proc. 27th ACM Conf. Hypertext Soc. Media, 2016, pp. 249–254.
Transport Management, vol. 68, 2018, pp.61–75.
[11] Gures, Nuriye, Arslan, Seda and Tun, Sevil. “Customer Expectation,
[3] Eren Sezgen, Keith J. Mason and Robert Mayer, “Voice of airline Satisfaction and Loyalty Relationship in Turkish Airline Industry,”
passenger: A text mining approach to understand customer International Journal of Marketing Studies. vol. 6, no. 1, 2014, pp. 66–
satisfaction,” in Journal of Air Transport Management, vol. 77, 2019, 74.
pp. 65–74.
[12] Ban, Hyun-Jeong and Kim, Hak-Seon, “Understanding Customer
[4] Hayadi, B.Herawan, Jin-Mook Kim, Khodijah Hulliyah, and Husni Experience and Satisfaction through Airline Passengers’ Online
Teja Sukmana. "Predicting Airline Passenger Satisfaction with Review” Sustain., vol. 11, no. 15, 2019
Classification Algorithms," International Journal of Informatics and
[13] M. R. Machado, S. Karray and I. T. de Sousa, "LightGBM: an
Information Systems [Online], 4.1 (2021): 82-94. Web. 28 Nov. 2022
Effective Decision Tree Gradient Boosting Method to Predict
[5] R.Archana and Dr.M.V.Subha, “A study on service quality and Customer Loyalty in the Finance Industry," 2019 14th International
passenger satisfaction on Indian airlines,” International Journal of Conference on Computer Science & Education (ICCSE), 2019, pp.
Multidisciplinary Research, vol. 2, Issue 2, February 2012. 1111-1116, doi: 10.1109/ICCSE.2019.8845529.
[6] Rahim Hussain, Amjad Al Nasser and Yomna K. Hussain, “Service [14] M. Tang et al., “An Improved LightGBM Algorithm for Online Fault
quality and customer satisfaction of a UAE-based airline: An empirical Detection of Wind Turbine Gearboxes,” Energies, vol. 13, no. 4, p.
investigation,” in Journal of Air Transport Management, vol. 42, 2015, 807, Feb. 2020, doi: 10.3390/en13040807.
pp. 167–175.
[15] Gan, M.; Pan, S.; Chen, Y.; Cheng, C.; Pan, H.; Zhu, X. “Application
[7] Hariguna Taqwa, Wiga Maulana Baihaqi, and Aulia Nurwanti. of the Machine Learning LightGBM Model to the Prediction of the
"Sentiment Analysis of Product Reviews as A Customer Water Levels of the Lower Columbia River”. J. Mar. Sci. Eng. 2021,
Recommendation Using the Naive Bayes Classifier Algorithm, 9, 496. https://doi.org/10.3390/ jmse9050496.
" International Journal of Informatics and Information
Systems [Online], 2.2 (2019): 48-55. Web. 28 Nov. 2022.