Ibm Attrition Practices
Ibm Attrition Practices
finding monthlyincome, age, and the number of companies et al.[5]applied Principal Component Analysis and classification
worked significantly impacted employee attrition. Next, we methods K-Nearest Neighbors and Random Forest, finding that
also classified people into two clusters by using K-means Logistic Regression predicts employee quits with the highest accu-
Clustering.Finally, we performed binary logistic regression racy. Yadavet et al.[6] provided a framework for predicting the em-
quantitative analysis: the attrition of people who traveled ployee churn by analyzing the employee’s precise behaviors and at-
frequently was 2.4 times higher than that of people who tributes using classification techniques. Srivastava et al.[7]provided
rarely traveled. And we also found that employees who work a framework for predicting the employee churn by analyzing the
in Human Resource have a higher tendency to leave. employee’s precise behaviors and attributes using classification
techniques. El-Rayes et al.[8] presented a framework for predicting
the employee attrition with respect to voluntary termination em-
ploying predictive analytics. Setiawan et al.[9] found that eleven
1 INTRODUCTION variables that have a significant impact on employee attrition.
Employee attrition is defined as the natural process by which em-
ployees leave the workforce – for example, through resignation 3 DATA AND METHODOLOGY
for personal reasons or retirement – and are not immediately re-
placed[1]. Employee turnover is regarded as the key issue for all Employee attrition is the internal data of the company, which is
organizations these days, because of its adverse effects on work- difficult to obtain, and some data has a certain degree of confiden-
place productivity, and accomplishing organizational objectives on tiality, therefore our paper used the data set disclosed by kaggle.
time [2]. In order for an organization to continually have a higher The sample size of the data set is 1471, there are 34 feature vari-
competitive advantage over its competition, it should make it a ables,mainly divided into three types of variables: personal basic
duty to minimize employee attrition Therefore, for the better devel- information, work experience, attendance rate.This paper explored
opment of corporation, it is essential for the leader of companies to the relationship between employee’s characteristics and employee
know the main reasons why their employees choose to leave the attrition, found Whether characteristics have a great influence on
company,then take relevant measures to improve their company’s employee attrition. In addition, we used machine learning algo-
productivity, overall workflow and business performance. rithms to select important features that influenced the employee
Objectives. In this paper, we aim to select the main causes that attrition, and predicted the it. In this paper, we exploited three ma-
contribute to an employee’s decision to leave a company, and to chine learning algorithms: Decision Tree, and Logistic Regression
be able to predict whether a particular employee will leave the and k-means clustering.
company by utilizing machine learning models.
Contributions. Following are the main contributions of this 3.1 Random Forest
paper: Random forest is an learning method used for classification, regres-
• We select the main factors affecting the employee attrition sion and other tasks, which combines multiple decision trees to
by using Random Forest, and classify which types of people select the best result. Random forest corrects the habit of decision
are more likely to quit by utilizing the K-means Clustering trees that rely too much on the training set and improves the accu-
• We represent a given reality in terms of a numerical value racy of the model.First, there are randomly selected sub-data sets
to compare the employee attrition in different categories by that are replaced from the original data[10]. The elements between
utilizing quantitative analysis. the sub-datasets may have the same elements, and the content of
each sub-data set is 1 different. Then use the sub-dataset to con-
The rest of the paper is described below. We present some related struct the sub-decision tree, each decision tree will output a result,
work in Section 2. We interpret data and introduce methodologies you can vote through the output result of the sub-decision tree,
in Section 3. We process the original data set and remove some vari- and finally get the output result. As shown in Figure 1, the data
ables which are not very correlated with other features in section set extracts 4 sub-datasets to construct 4 sub-decision trees, 3 trees
4. We implement our machine learning model in Section 5. Finally, voted as A, one sub-decision tree voted as B, and the final output is
we draw a conclusion in section 6. A.
Shenghuan Yang and Md Tariqul Islam
𝑝 (𝑦 = 1)
𝑙𝑜𝑔𝑖𝑡 ( 𝑝) = ln = 𝛽 0 + 𝛽 1𝑥 1 + ... + 𝛽𝑛 𝑥𝑛 (1)
1 − 𝑝 (𝑦 = 1)
Here, employee attrition as the dependent variable y, 0 represents
No, 1 represents Yes. Gender, EducationLevel, OverTime, JobLevel
and so on as predictor variables, n=34, 𝑥 1 ,𝑥 2 ,..., 𝑥 34 . 𝛽 0 , 𝛽 1 ,..., 𝛽𝑛 Figure 2: Clustering process of K-means Clustering
IBM Employee Attrition Analysis
4 DATA PROCESS employee attrition, and ignore the variables that are not significant
In our studies, there are 34 variables, however, some features just in explaining such behavior.
have one data level that do not make sense for our research, such as
EmployeeCount, Over18 and StandardHours, and employee number
does not have meaning in analyzing resulting so we also deleted
these features. In addition, some features are not very correlated
with other attributes, we need to remove them from our dataset to
improve computing efficiency. We built correlation matrix, which
is a table showing correlation coefficients between variables. Each
cell in the table shows the correlation between two variables.
5.2 Classification of people job chance. What is more, job satisfaction is also the main cause
To distinguish which types of people are more likely to resign, we influencing the employee attrition rate.Intention to stay on the job
use K-means clustering to divide the dataset into two categories. is clearly correlated with job satisfaction in such aspects as edu-
The first type is prone to leave, and the other is less likely to quit. cational system and environment, income and welfare, leadership
We can see from Table 2 (complete form in appendix table 4),clus- and administration.[14].
ter set 0 represents low attrition, cluster set 1 means high attrition,
and older people, high job level, high job satisfaction,high monthly 5.3 Logistics Regression
income, more number of companies worked and so forth are less In this section, we exploited the binary logistics regression to pre-
likely to leave.These finding are in line with people’s behavior in the dict the relationship between predictors (employee characteristics)
real world and previous accounts in feature selection of Random- and a predicted variable (employee attrition) where the dependent
Forest. In last section, we illustrated the relationship between age, variable is binary (NO:0, YES:1). And we will compare the differ-
monthly income, distance from the company and home, this section, ences between each category to help us understand which type of
we also find that the number of companies employees worked are person is more likely to quit (complete regression table can be seen
related to the probability of leaving the corporation. People who in appendix table 5).
have worked in 3-4 companies are less likely to quit because by this
time, they have roughly found the direction of employment, and Table 3: Regression Result
people who have more than four companies indicate that they are
unstable and often change jobs. In addition, people who are in the OR(95% CI) P-value
higher the job level, enjoy the higher salary and respect , so they are Travel_Rarely 1
less likely to leave. While those employees in lower-level positions NonTravel 0.361
are often not satisfied with the status quo and want to seek better Travel_Frequently 2.411 <0.05
Male 1
Table 1: Prediction Accuracy of Random Forest Female 0.659 <0.05
Sales Representative 1
Healthcare 0.160
Times Accuracy
Human Resource 4.060
1 0.85714 Laboratory 0.556
2 0.83673 Manager 0.347
3 0.84353 Manufacturing 0.200
4 0.84693 Research 0.200
... ... Sales Executive 0.484 <0.05
100 0.85374 Single 1
Average Accuracy 0.84561 Divorced 0.304
Married 0.427 <0.05
Table 2: Clustering OverTime 0.138 <0.05
Cluster Type 0 1 From Table 3, we can see, taking people who rarely travel as the
standard 1, employees who travel frequently are 0.361 times more
Age 44.215152 34.813158 than those to leave the company whereas people who never travel
Attrition 0.1 0.178947 are more than twice. And Women are 0.659 times more than men
DistanceFromHome 9.072727 9.227193 to go. As for JobRole, taking Sales Representative as the standard
Education 3.039394 2.876316 1, employee who who work about Research Science, Laboratory,
EnvironmentSatisfaction 2.693939 2.729825 Manufacturing, Healthcare are not likely to quit, while Human-
JobInvolvement 2.690909 2.741228 Resource Department has high employee attrition. Then single
JobLevel 3.684848 1.594737 people are more likely to leave their jobs than married and divorced
JobSatisfaction 2.709091 2.734211 people.Finally,not surprised, people who usually work overtime
MonthlyIncome 14060.49394 4315.215789 would like to leave the company. To evaluate the logistics regres-
NumCompaniesWorked 3.342424 2.505263 sion model, we need to test the accuracy of model in Python, the
OverTime 0.290909 0.280702 test result is 0.8843, which means our model fitting is well.
... ... ...
... ... ... 6 CONCLUSION
PercentSalaryHike 15.066667 15.250877
According the above model results, we can know that our finding
PerformanceRating 3.148485 3.155263
are in line with people’s behavior in the real world and previous
RelationshipSatisfaction 2.781818 2.692105
studies other scholar did. We utilized Random Forest and K-means
StockOptionLevel 0.80303 0.791228
Clustering to select important features that had obvious impact on
TotalWorkingYears 21.072727 8.444737
the employee attrition. Firstly, according to Random Forest results,
IBM Employee Attrition Analysis
monthlyincome, age, the number of companies worked are the [12] Meng X H, Huang Y X, Rao D P, et al. Comparison of three data mining models
main reasons why people choose to resign. Then we found older for predicting diabetes or prediabetes by risk factors[J]. The Kaohsiung journal
of medical sciences, 2013, 29(2): 93-99.
people, high job level, high job satisfaction,high monthly income, [13] Pham D T, Dimov S S, Nguyen C D. Selection of K in K-means clustering[J]. Pro-
more number of companies worked, these kinds of people are not ceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical
Engineering Science, 2005, 219(1): 103-119.
likely to go based on the clustering result of K-means Clustering. [14] Weiqi, Chen. "The structure of secondary school teacher job satisfaction and its
However, different people have various intention, we need to do relationship with attrition and work enthusiasm." Chinese Education& Society
further and detailed research to find qualitative findings by using 40.5 (2007): 17-31.
qualitative analysis. So we exploited the binary logistics regression
to compare the difference between people. Our study found that
females’ attrition was 0.659 times than that of males, married and
divorced people were 0.427 and 0.304 times than people who were
single, respectively. Besides, the attrition of people who traveled
frequently was 2.4 times higher than that of people who rarely
traveled. And we also found that employees who work in Human
Resource have a higher tendency to leave. Finally, there are other
interesting findings in our study:in terms of number of companies
worked, people who worked in 2 - 4 companies are less likely to
leave, the female attrition rate is less than male after working for
six companies , and people who earned Doctor’s Degree are almost
always having the lowest attrition rate.
To evaluate the model performance, we trained and tested the
dataset to predict the employee attrition, split it into two parts(80%
for training, 20% for testing ), and recorded the test set’s accuracy.
Random Forest and Logistics Regression accuracy were 0.8456 and
0.8843, respectively, which meant Logistics Regression fitted better
and was more suitable for prediction in our dataset.
We also want to make some suggestions to the company through
this research, hoping that they will care more about their employ-
ees and improve their job satisfaction. Simultaneously, they must
pay more attention to human resources employees because they
have very low job satisfaction. Besides, the company should allow
employees to have enough time to rest and spend time with their
families.There is a general belief that employees who take regular
breaks are more productive.
REFERENCES
[1] Budhwar P S, Bhatnagar J. Talent management strategy of employee engagement
in Indian ITES employees: key to retention[J]. Employee relations, 2007.
[2] Jain, Rachna, and Anand Nayyar. Predicting employee attrition using xgboost
machine learning approach. 2018 International Conference on System Modeling
& Advancement in Research Trends (SMART). IEEE, 2018.
[3] Alao, D. A. B. A., and A. B. Adeyemo. "Analyzing employee attrition using decision
tree algorithms." Computing, Information Systems, Development Informatics
and Allied Research Journal 4.1 (2013): 17-28.
[4] Alduay j, Sarah S., and Kashif Rajpoot. "Predicting employee attrition using
machine learning." 2018 International Conference on Innovations in Information
Technology (IIT). IEEE, 2018.
[5] Frye, Alex, et al. "Employee Attrition: What Makes an Employee Quit?." SMU
Data Science Review 1.1 (2018): 9.
[6] Yadav, Sandeep, Aman Jain, and Deepti Singh. "Early Prediction of Employee
Attrition using Data Mining Techniques." 2018 IEEE 8th International Advance
Computing Conference (IACC). IEEE, 2018.
[7] Srivastava, Devesh Kumar, and Priyanka Nair. "Employee attrition analysis using
predictive techniques." International Conference on Information and Communi-
cation Technology for Intelligent Systems. Springer, Cham, 2017.
[8] El-Rayes, Nesreen, et al. "Predicting employee attrition using tree-based models."
International Journal of Organizational Analysis (2020).
[9] Setiawan, I., et al. "HR analytics: Employee attrition analysis using logistic re-
gression." IOP Conference Series: Materials Science and Engineering. Vol. 830.
No. 3. IOP Publishing, 2020.
[10] Pal M. Random forest classifier for remote sensing classification[J]. International
journal of remote sensing, 2005, 26(1): 217-222.
[11] Qi, Yanjun. "Random forest for bioinformatics." Ensemble machine learning.
Springer, Boston, MA, 2012. 307-323.
Shenghuan Yang and Md Tariqul Islam
Cluster Type 0 1
Age 44.215152 34.813158
Attrition 0.1 0.178947
DistanceFromHome 9.072727 9.227193
Education 3.039394 2.876316
EnvironmentSatisfaction 2.693939 2.729825
JobInvolvement 2.690909 2.741228
JobLevel 3.684848 1.594737
JobSatisfaction 2.709091 2.734211
MonthlyIncome 14060.49394 4315.215789
NumCompaniesWorked 3.342424 2.505263
OverTime 0.290909 0.280702
PercentSalaryHike 15.066667 15.250877
PerformanceRating 3.148485 3.155263
RelationshipSatisfaction 2.781818 2.692105
StockOptionLevel 0.80303 0.791228
TotalWorkingYears 21.072727 8.444737
TrainingTimesLastYear 2.79697 2.8
WorkLifeBalance 2.781818 2.755263
YearsAtCompany 11.927273 5.584211
YearsInCurrentRole 6.133333 3.67807
YearsSinceLastPromotion 4.109091 1.631579
YearsWithCurrManager 5.821212 3.631579
Male 0.581818 0.605263
BusinessTravelNon-Travel 0.087879 0.10614
BusinessTravelTravelFrequently 0.187879 0.188596
BusinessTravelTravelRarely 0.724242 0.705263
DepartmentHuman Resources 0.045455 0.042105
DepartmentResearch & Development 0.642424 0.657018
DepartmentSales 0.312121 0.300877
EducationFieldHuman Resources 0.021212 0.017544
EducationFieldLife Sciences 0.39697 0.416667
EducationFieldMarketing 0.130303 0.101754
EducationFieldMedical 0.324242 0.313158
EducationFieldOther 0.045455 0.058772
EducationFieldTechnical Degree 0.081818 0.092105
JobRoleHealthcare Representative 0.112121 0.082456
JobRoleHuman Resources 0.012121 0.042105
JobRoleLaboratory Technician 0 0.227193
JobRoleManager 0.309091 0
JobRoleManufacturing Director 0.121212 0.092105
JobRoleResearch Director 0.242424 0
JobRoleResearch Scientist 0.00303 0.255263
JobRoleSales Executive 0.2 0.22807
JobRoleSales Representative 0 0.072807
MaritalStatusDivorced 0.251515 0.214035
MaritalStatusMarried 0.506061 0.44386
MaritalStatusSingle 0.242424 0.342105
IBM Employee Attrition Analysis