Cdu 1121 09
Cdu 1121 09
*
[email protected], [email protected], [email protected]
Keywords: prediction, employee attrition, class imbalance, data cleaning, feature selection,
classification
1. Introduction
Employees make up a crucial part of any organization. They are the key asset who are
responsible for the growth of an organization. Without employees, projects cannot be
scaled up nor can be completed within the deadlines. Therefore, these days companies are
investing more in their employees by providing them with a good working atmosphere
and benefits like insurance, paid leaves, transport, etc.
In the present world, many organizations are facing a serious problem referred to as
employee attrition: attrition is described as a worker retiring/resigning from a firm. There
are various factors influencing employees to quit the company, such as work pressure,
low growth, and job satisfaction. According to a survey, 31% out of 1000 employees
reported quitting the job in their very first six months in the company [1].
Reports indicate over 50% of the companies globally are facing issues with employee
retention. Corporates spend a lot of resources and time recruiting and training new
employees. Replacing the experienced workforce with freshers is an expensive and long
process. According to the data available on the internet, it requires around 1-2 years of
time for a newly hired employee to match the speed, productivity, and knowledge level of
an existing employee. And if this problem continues to grow, it will result in knowledge
loss and also a financial loss for the companies to a great extent over time [2]. 1
*
Rakshith.A.C
Student, Dept. of Computer Applications
JSS Science and Technology University, Mysuru, India
[email protected]
In order to overcome the employee attrition problem, employers in recent times have
started estimating the employees who are likely to resign, which thereby helps to reduce
the non- scheduled staff replacement costs. By predicting employee attrition at an early
stage, corporates can start hiring new candidates in advance that will help them by coping
with project deadlines [3].
For this work, multiple Machine Learning models: Naive Bayes, Random Forest, Support
Vector Classifier (SVC), k-Nearest Neighbours (kNN), Decision Trees, and eXtreme
Gradient Boosting (XGB) algorithms are used to predict the employee attrition by
classifying the employee data into two categories: Employee will stay and Employee will
exit.
2. Literature Survey
A study was carried out using monthly reports of 3,638 software developers over time as
the dataset, from two IT companies. Three datasets were considered to carry out the
experiment, the first being the dataset of the corporate C1, the C2 being the dataset of
second company, and the third is the combination of both the company reports into a
single dataset. 67 features along six dimensions from the employee’s initial six-month
report data were used. The six dimensions include total hours worked per month,
complete statistics of hours worked and projects done, task report statistics, readability of
task report and project statistics of each month. According to this, the main reason the
developers were leaving the job is because of the heavy workload and tight deadlines.
Random Forest classifier gave good result [3].
Another work reported using dataset provided by IBM analytics, which consists of around
1500 samples and 35 features. The dataset was imbalanced with 84% of the workers
staying back and 16% leaving the job. In his research, he found "Monthly Income" as the
major factor that led to workers quitting the job. Gaussian Naive Bayes, Naive Bayes
classifier for multivariate Bernoulli models, Logistic Regression classifier, kNN, Decision
tree classifier, RFC, SVM classification and LSVM were used. The Gaussian Naive
Bayes algorithm gave the best result [2].
A new ERP (Employee Resignation Predictor) approach was reported, where the data of
publicly available professional profiles on LinkedIn have been mined and later the
features are extracted. By the use of data mining, the crawler has mined 1,20,000 expert
profiles. 11 features were selected. Three classification algorithms: Decision Trees, Back
Propagation, and Self Organizing Maps were used for evaluating the ERP approach. The
work demonstrated Decision Trees algorithm outperformed compared to Back
Propagation and Self Organizing Maps algorithms with a good accuracy [4].
Various data pre-processing techniques were discussed, which are focused on outliers
detection, handling missing values ( Ignoring instances, mean substitution, hot-deck
imputation, and more), discretization, data normalization ( min-max and z-score
normalization), feature selection (based on: distance, information, dependence,
consistency; based on selection movement: Sequential backward floating selection
(SBFS) and Sequential forward floating selection (SFFS)), and feature construction that is
performed by the GALA algorithm [5].
A study has described several techniques for handling the imbalance datasets. The work
discusses the previously proposed solutions for class-imbalance both at the algorithmic
and data levels. At the data level, random under-sampling and random oversampling are
non- heuristic methods, where random oversampling balances the class through random
duplication of minority class, and random undersampling does this by eliminating random
examples of the majority class. At the algorithmic level, the solutions include threshold
method, one-class learning, and cost- sensitive learning that requires defining fixed and
unequal misclassification costs between the classes [6].
Different variants of Decision trees algorithm were used for the work. The dataset used
consisted of 309 worker records of one of the reputed companies in Nigeria. 9 features
were extracted from the records out of which only 6 features (Sex, State of Origin, Length
of Service, Rank, Salary, and Reason for leaving) were used for modelling. Two tools
were used for experimenting, the first being the WEKA tool (Waikato Environment for
Knowledge Analysis) which was developed in New Zealand by the University of
Waikato. WEKA tool provides a proper suite for performing machine learning that
includes visualization tools and ML models for predictive modelling and data analysis
along with a GUI. The second tool was the See5 which was used for the discovery of
patterns in the data. The classifiers C4.5 (J48), REPTree, and CART were used. Among
these, the See5 Decision Tree gave favourable result compared to other algorithms, and
the attributes that contributed more to employees decision to leave the organization were
the Salary and Length [7].
3. Dataset
In this work, we used two datasets, Generic Employee dataset, and the HR Analytics: Job
change of Data Scientists dataset. Both datasets contain historical data of employees that
are in 2 different categories: Employee will stay, and Employee will leave.
This is collected from GitHub. The Generic Employee dataset comprises 14,999 samples
along with 11 (9+2) features [8].
The HR Analytics: Job change of Data Scientists dataset is available on Kaggle. It has
19,159 entries and 14 parameters [9].
STEM >20 1 36 1
STEM 15 50-99 Pvt Ltd >4 47 0
STEM 5 never 83 0
Business Degree <1 Pvt Ltd never 52 1
Funded
STEM >20 50-99 Startup 4 8 0
4. Proposed System
In this work, we propose one of the new ways to handle employee attrition problem, and
it deals with the limitations of the existing system and is trained and evaluated on larger
data. Two different datasets are used to perform experiments. One experiment is done on
the generic employee departments such as sales, accounting, support, etc, and the other is
performed only on the data scientists. Employee satisfaction is considered since it is one
of the key factors that determine whether an employee will depart the
company/organization or not.
Multiple classifiers: kNN, Decision Trees, SVC, Random Forest, Naïve Bayes, and
XGBoost are used.
Exploratory Data
Data Preprocessing Feature Selection
Analysis (EDA)
The Generic Employee dataset is available in two different sets with one having 9
attributes and the other having 2. In this phase, the separate sets are combined using
set_index() and join() functions, and the features satisfaction_level and last_evaluation
that had null values are filled with the mean of that feature. Two Attempts are made on
the Data Scientist dataset, in Attempt 1 the missing values are filled with the mean of the
respective features, and all the rows with even a single missing value are deleted in
Attempt 2.
In this step, data is explored in-depth, both statistically and visually. The datasets are
checked for any missing, class imbalance, categorical features. Visualization libraries
Matplotlib and Seaborn are used to visualize the data in many ways. Plots such as
correlation plot are used for feature selection and the pairplot technique is a grid of
scatterplots that plots pairwise bivariate distributions which show the relationship for (n,
2) combination of variables as a matrix of plots and the diagonal graphs are the univariate
plots.
Features that are not relevant or partly relevant can impact negatively on the model's
performance, hence feature selection is done to select only those features which contribute
the most to the output variable. Three feature selection methods Correlation Matrix with
Heatmap, Feature Importance, and Univariate feature selection techniques are employed.
However, only the SelectKBest which is a Univariate feature selection method is
considered due to its accuracy. This method scores the features by combining statistical
tests and selecting the k-number of features concerning the results obtained between X
and y. Additionally, this method has a built-in Chi-Square (chi2) test. The employee_id
feature of the Generic employee dataset, and 3 features gender, company_size and
enrolled_id of the Data Scientist dataset are deleted since they are of no use in prediction
[10].
At this stage, the dataset is prepared to feed to the classification algorithms. Firstly, the
categorical variables are converted to numerical features. The Generic employee dataset
had 2 features in the categorical form which are converted to numerical using the
LabelEncoder() function, while the Data Scientist dataset had 10, and are transformed
manually by passing them through user-defined methods. Later, the datasets are split into
training and testing sets using the train_test_split() function. For a few models such as
kNN and SVC, data normalization is necessary since they work on the basis of distance
between the data points, and if the distance is large, it may lead to inaccuracies. The
StandardScalar() function is used to scale down the distance between the data points by
bringing them within a smaller range.
Class imbalance is a problem where the distribution of samples throughout the classes is
unequal. It is described in terms of a ratio, where a slight class imbalance of 4:6 is
negligible, whereas a severe imbalance in class with a ratio of 1:3, 1:5, or 1:100 is not
negligible as it leads to a bias in models towards the class which has more samples
(majority class), hence it is supposed to be treated using some resampling techniques.
Two methods of imblearn library, Random over-sampling and NearMiss undersampling
are employed to handle class imbalance. The RandomOverSampler() selects random
samples of the majority class and adds them to the minority class to balance the dataset,
while the NearMiss() undersampling method selects the entries to keep. This is an
efficient method since it keeps the essential samples and deletes the less important records
[11].
Two different datasets are experimented with six classification algorithms in three ways
when the dataset is imbalanced, the dataset is balanced with oversampling and
undersampling. The ML models are then trained and tested for their performance.
Hyperparameter tuning is done for three models used on the Data Scientist dataset.
Two metrics are used for evaluation, the F1 score and Accuracy. The F1 score is used to
measure the performance of the models fed with the imbalanced dataset [12]. It is also
called the harmonic mean of recall and precision since it internally uses precision and
recall metrics. Accuracy score is used to measure the models performance which is fed
with the dataset having equal class distribution. Accuracy is the measure of all accurately
classified cases [13].
3 Random Forest 98 99 98
4 SVC 92 96 93
5 XGBoost 95 97 96
6 kNN 94 97 93
In the experiments performed, except Naive Bayes, all the models performed well with
both the balanced and imbalanced datasets. Random Forest gave the highest accuracy in
every case.
In Generic Employee dataset, all the models excluding Naive Bayes have an average
accuracy/F1 score of 94 in all three cases of imbalanced dataset, oversampling, and
undersampling. Therefore, there was no need for hyperparameter tuning. Even the dataset
was clean with minimal missing values, due to which the models are able to be highly
accurate. The best performing classifier on this dataset is Random Forest.
In contrast, hyperparameter tuning was required to increase the accuracy of the models
used on Data Scientist dataset. Two attempts, Attempt 1 and Attempt 2 are made to
increase the accuracy of the models. In both the attempts, the models are able to perform
better only on the oversampled dataset. In Attempt 1, the models having the highest
accuracies are Random Forest with an accuracy of 78% and XGBoost with an accuracy of
77%. Hyperparameter tuning is done to SVC, Random Forest, and XGBoost. However,
after the hyperparameter tuning, XGBoost model overfitted, and SVC's accuracy
increased 2% from 74% to 76%. The Random Forest classifier overfitted and had an
accuracy of 78% before the hyperparameter tuning is done, and after the hyperparameter
tuning the model did not overfit and gave the best result. The hyperparameters used for
the Random Forest model are criterion set to 'gini', n_estimators of 500, and max_depth of
7.
The Data Scientist dataset had too many missing values, due to which the models were
not able to perform well despite hyperparameter tuning. Hence, another attempt (Attempt
2) is made on the Data Scientist dataset to increase the accuracy of the models by deleting
all the rows with even a single missing value, and the results came out positive. This time,
the hyperparameter tuned Random Forest and SVC models gave out the accuracies 86%
and 85% respectively, followed by the vanilla XGBoost classifier with 83% accuracy.
Therefore, the models performance is dependent on the dataset, and having a good dataset
can increase the performance of the algorithms.
6. Conclusion
The application of machine learning is unfolding rapidly in various domains, thanks to the
availability of large data and growth in data science that has helped in making accurate
data-driven decisions that are objective.
In this application of machine learning for predicting employee attrition rate, the Random
Forest classifier performed the best on both the balanced and imbalanced datasets due to
its ability to create multiple trees and select the best one, while the least result was given
by the Naive Bayes model. The algorithms gave out the best results on oversampled
datasets. Therefore, apart from having good classification algorithms and hyperparameter
tuning, it is equally important to have a clean dataset that would otherwise lead to poor
model performance.
7. References
[1] 20 Surprising Employee Retention Statistics You Need to Know, https://blog.bonus.ly/surprising-
employee-retention-statistics.
[2] Fallucchi, Francesca, Marco Coladangelo, Romeo Giuliano, and Ernesto William De Luca. "Predicting
Employee Attrition Using Machine Learning Techniques." Computers 9, no. 4 (2020): 86.
[3] Bao,Lingfeng,ZhenchangXing,XinXia,DavidLo,andShanpingLi."Whowill leave the company?: a large-
scale industry study of developer turnover by mining monthly work report." In 2017 IEEE/ACM 14th
International Conference on Mining Software Repositories (MSR), pp. 170-181. IEEE, 2017.
[4] de Jesus, Ana Carolina C., Márcio Enio GD Júnior, and Wladmir C. Brandao. "Exploiting linkedin to
predict employee resignation likelihood." In Proceedings of the 33rd Annual ACM Symposium on
Applied Computing, pp. 1764-1771. 2018.
[5] Kotsiantis,SotirisB.,DimitrisKanellopoulos,andPanagiotisE.Pintelas."Data preprocessing for supervised
leaning." International Journal of ComputerScience 1, no. 2 (2006): 111-117.
[6] Kotsiantis, Sotiris, Dimitris Kanellopoulos, and Panayiotis Pintelas. "Handling imbalanced datasets: A
review." GESTS International Transactions on Computer Science and Engineering 30, no. 1 (2006): 25-
36.
[7] Alao, D. A. B. A., and A. B. Adeyemo. "Analyzing employee attrition using decision tree algorithms."
Computing, Information Systems, Development Informatics and Allied Research Journal 4, no. 1 (2013):
17-28.
[8] Dataset 1: GenericEmployee,
https://github.com/pydeveloperashish/Predicting-which-of-your-Employee-will-Quit-your-Company-
Data-Science-Project
https://www.kaggle.com/arvindbhatt/hrcsv.
[9] Dataset 2: DataScientist, https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists.
[10] Feature Selection For Machine Learning in Python,
https://machinelearningmastery.com/feature-selection-machine-learning- python.
[11] 10 Techniques to deal with Imbalanced Classes in Machine Learning,
https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-
learning.
[12] F1 Score – Classification Error Metric, https://www.journaldev.com/45165/f1-score-in-python.
[13] Accuracy vs F1 – Score, https://medium.com/analytics-vidhya/accuracy-vs-f1-score-6258237beca2.