Report
Report
0
Table of Contents
1. INTRODUCTION ........................................................................................................................ 2
7. CONCLUSION.......................................................................................................................... 12
BIBLIOGRAPHY........................................................................................................................... 13
1
1. Introduction
Loss of employees, also called change, is a big problem for many businesses. Knowing what causes
employees to leave can help companies come up with ways to keep good workers, cut down on the
cost of hiring new ones, and keep their staff stable. A fake dataset is used in this study to mimic a real-
life situation in which different factors affect whether an employee stays with a company or goes. The
dataset has variables like age, gender, department, job title, years at the company, level of happiness,
average monthly hours worked, promotion in the last five years, pay, and attrition status. It gives a full
picture of the things that might cause employees to leave their jobs.
The main goal of this study is to use both uncontrolled and supervised machine learning to find and
analyse the main factors that cause employees to leave their jobs. The study wants to find trends and
connections in the data by using exploratory data analysis (EDA). K means clustering will be used to
separate employees into groups based on the traits they share. Then logistic regression will be used to
guess how likely it is that an employee will leave which will show which factors are most likely to cause
people to leave. Companies that want to improve their methods for keeping employees will find the
study results very useful.
The dataset used in this study is a fake dataset that was made to look like a real life situation where
employees are leaving. It comes from Kaggle a well known site for data science and machine learning
events and anyone can use it for study and analysis. The dataset has 1000 records and 11 variables
which give a full picture of the many things that could cause employees to leave their jobs.
• Employee_ID: A unique identifier for each employee. This variable is not used in the analysis
but helps in identifying individual records.
• Age: The age of the employee. Age can be a significant factor in attrition as it may relate to
career stage and stability.
• Gender: The gender of the employee (Male/Female). Gender analysis can reveal any
disparities in attrition rates.
2
• Department: The department in which the employee works (e.g., Marketing, Sales,
Engineering). Different departments may have varying attrition rates due to job roles and work
environments.
• Job_Title: The job title of the employee (e.g., Manager, Engineer, Analyst). Job roles can
influence job satisfaction and turnover.
• Years_at_Company: The number of years the employee has worked at the company. Longer
tenure often correlates with lower attrition.
• Average_Monthly_Hours: The average number of hours the employee works per month.
Work hours can impact job satisfaction and turnover.
• Promotion_Last_5Years: Whether the employee has been promoted in the last five years (0 =
No, 1 = Yes). Promotion opportunities can influence retention.
• Salary: The annual salary of the employee. Compensation is a critical factor in employee
retention.
• Attrition: The target variable indicating whether the employee has left the company (0 = No,
1 = Yes).
The first look at the dataset shows some interesting trends and findings. The goal variable is balanced
in the dataset, with about the same number of workers who have left the company (49.5%) as those
who have stayed (50.5%). Because it doesn't favour one class over another, this mix is good for training
predictive models.
The range of ages of workers is from 25 to 59 years old, with 42 years being the average. With an
average score of 0.51, the level of happiness in the workforce is about normal. The average number of
hours worked each month is about 199, and the standard variation is about 30 hours, which shows
that workers' work hours vary.
3
3. Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a very important step in getting to the bottom of a dataset's trends
and traits. A summary statistic gives a number-based look at the numbers in the collection. It includes
measures of centre tendency (mean, median) and dispersion (standard deviation, minimum,
maximum, and quartiles).
• Age: People who work there are between the ages of 25 and 59, with a mean age of 42.21 years
and a standard deviation of 10.02 years. This spread shows that the workers are of different ages.
• Years_at_Company: Most employees have been with the company for 5.61 years, but some have
been there for as little as 1 year and as much as 10 years. The standard difference is 2.82 years.
This shows that workers' lengths of service vary.
• Level of Satisfaction: The level of satisfaction has a mean of 0.51 and a standard deviation of 0.29
on a range from 0 to 1. The range from almost 0 to 1 shows that employees are satisfied with their
jobs in different ways.
• Average Monthly Hours: The mean monthly hours worked is 199.49 hours, and the standard
deviation is 29.63 hours. The range of average monthly hours worked is 150 to 249 hours. These
numbers show that work hours vary a lot.
• Promotion_Last_5Years: This is a binary variable that tells you if an employee was promoted in the
last five years. Its average value is 0.49, which means that about half of the employees were
promoted during this time.
• Salary: The pay runs from $30,099 to $99,991, with $20,262.98 being the standard deviation and
$64,624.98 being the mean. There are big differences in how much employees are paid, as shown
by the wide spread and high standard deviation.
• Attrition: The rate of turnover is about 49.5%, which means that almost half of the workers have
quit.
3.2 Visualizations
Visualising the data helps you see how the variables are distributed and how they are related to each
other. Distribution plots for number variables and count plots for categorical variables are two
important types of visualisations.
4
Plots of the distribution of key variables:
• Age: The range of ages is pretty evenly spread out, with a small peak around 42 years.
• Level of Satisfaction: Levels of satisfaction are pretty evenly spread out, with no big differences
between them. This means that employees are generally happy with their jobs.
• Salary: The pay range is right-skewed, with most workers making less than $80,000 and a few
making close to the top salary of $99,991.
5
• Department: The count plot shows that workers are pretty evenly spread out across departments,
with a small bias towards Engineering and Finance.
• Job Title: The spread of job titles is also pretty even, with a few more Managers and Engineers.
• Gender: When it comes to gender, the number of male and female workers is almost equal.
The summary figures and visualisations give us a number of important insights into the information,
including:
6
The association matrix shows that the level of happiness and the rate of attrition are moderately
negatively related. This means that higher levels of satisfaction are linked to lower rates of attrition.
There is also a positive relationship between average monthly hours and turnover, which means that
workers who put in more hours may be more likely to quit.
Other variables have lower associations, which shows how complicated the factors are that cause
employees to leave their jobs.
K-means clustering is a type of unsupervised learning that divides a dataset into K separate groups that
don't intersect. The method starts with K randomly chosen centres, then assigns each data point to
the centre that is closest to it. Finally, the centres are updated based on the average of the points that
were given. This process is repeated over and over again until the centroids don't move much more or
until a certain number of times is reached.
The K-means grouping method was chosen for this study because it is easy to use and good at finding
trends in the data. The number of groups (K) you choose is very important for making sense of the
data. The Elbow Method helped us choose K=3. This method includes finding the "elbow point" where
the rate of drop slows down by plotting the sum of the squared distances from each point to its given
centroid. This point points to the best number of groups that strike a balance between simplicity and
density.
7
Fig. 3 K-means Clustering of Employees
Using K-means clustering on the dataset, we used average monthly hours and happiness level as axes
to show the groups.
• Cluster 0 (Purple): This group includes employees who are satisfied in a variety of ways but
usually work fewer hours each month.
• Cluster 1 (yellow): Workers who are somewhat satisfied with their jobs and whose average
monthly hours vary a lot.
• Cluster 2 (Teal): Workers who aren't completely happy with their jobs and who work more
hours on average each month.
The groupings show that employees can be put into groups based on how satisfied they are with their
jobs and how many hours they work. This shows possible areas that methods for keeping employees
should focus on.
8
• Satisfaction Levels: Different workers in the same cluster have different levels of happiness,
which suggests that satisfaction is not level across the board and can have a big effect on
turnover.
• Working Hours: The fact that average monthly hours vary within groups suggests that working
hours may have an effect on job happiness and, as a result, turnover.
• Employee Segmentation: The clusters help divide employees into groups with similar traits,
which lets actions be more focused.
Since employees in Cluster 2 work longer hours each month on average, they may be more likely to
get burned out and quit. Retention might go up if you work on balancing your task.
People in Cluster 0 who work fewer hours each month may be dealing with different stressors at work
that affect their happiness and motivation, which can lead to turnover.
Because of these findings, customised tactics can be made to make employees happier and keep them
from leaving.
5.1 Methodology
Logistic regression is a method for guided learning that is used to classify things into two groups. It
shows how likely it is that a certain data belongs to a certain class. Any real number can be mapped
into the [0, 1] range by the logistic function, also called the sigmoid function. This range can then be
thought of as a chance.
Data Splitting and Model Training: An 80-20 split was used to separate the dataset into training and
testing sets. This way, the model could be trained on a part of the data and then tested on data it had
not seen before. It was decided that logistic regression would be used up to 1000 times to make sure
it would converge.
9
5.2 Model Performance
The model is 50% accurate generally, which means that it correctly predicts loss half of the time.
When there is no attrition in class 0, the precision is 0.51, and when there is attrition in class 1, the
precision is 0.49. It is 0.59 for class 0 and 0.41 for class 1. For class 0, the F1-score is 0.55, which is the
harmonic mean of accuracy and memory. For class 1, it is 0.44.
Confusion Matrix shows that the model got 60 of the predictions of no attrition and 40 of the
predictions of attrition right, but got 42 of the predictions of attrition wrong as no attrition and 58 of
the predictions of no attrition as attrition.
Classification Report:
Confusion Matrix:
[[60 42]
[58 40]]
10
Fig.4 Correlation Matrix
• Model Accuracy: A score of 50% means that the model's results are about the same as guessing at
chance. This means that the straight links that logistic regression finds might not fully show how
complicated the factors are that cause people to leave their jobs.
• Precision and memory: The model's average precision and memory numbers show that it can't tell
the difference between workers who are going to stay and those who are going to leave. It's
possible that this is because the two groups share some traits.
• Possible Causes of Model Accuracy:
• Feature Overlap: Characteristics of employees who stay and employees who leave may be similar,
making it hard for the model to tell them apart.
11
• Logistic regression assumes that the traits and the log-odds of the goal variable are related in a
straight line. If the real interactions aren't straight, this could make the model less accurate.
• Additional Factors: Attrition may also be affected by things that were not noticed or were not
included in the dataset. These could be things like the culture of the company, the state of the job
market outside the company, or personal circumstances.
7. Conclusion
This study used a fake dataset to look at employee loss in order to find the main factors that cause
people to leave their jobs and to make prediction models that can help companies come up with better
ways to keep their employees. We found important trends and relationships in the data using
exploratory data analysis (EDA). For example the level of happiness and the average number of hours
worked each month had a big effect on the rate of employee turnover.
Unsupervised learning with K means clustering showed that there were three separate groups of
employees based on their job happiness and hours worked. This gave us a look at some issues that
might make it hard to keep employees. Even though supervised learning using logistic regression was
only somewhat accurate it showed how hard it is to use linear models to predict loss which suggests
that we need more advanced methods.
Some of the most important things that this study showed were how important it is for employees to
be happy with their jobs and have a good mix of work as well as the possibility of focused interventions
based on clustering results. The logistic regression model results on the other hand show that turnover
is a complicated problem that is probably affected by factors not included in the dataset.
12
Bibliography
Srivastava, D.K., Nair, P. (2018). Employee Attrition Analysis Using Predictive Techniques. In: Satapathy,
S., Joshi, A. (eds) Information and Communication Technology for Intelligent Systems (ICTIS 2017) -
Volume 1. ICTIS 2017. Smart Innovation, Systems and Technologies, vol 83. Springer, Cham. [Online]
Available at: https://doi.org/10.1007/978-3-319-63673-3_35
Biau, G., & Scornet, E. (2016). A Random Forest Guided Tour. Test, 25(2), 197-227.
Brodie, Roderick J., Whittome, James R., & Brush, Gregory J. (2009). Investigating the Service Brand: A
Customer Value Perspective. Journal of Business Research, 62(3), 345–355.
Burmann, Christoph, Jost-Benz, Marc, & Riley, Nicola (2009). Toward an Identity-Based Brand Equity
Model. Journal of Business Research, 62(3), 390–397.
Fallucchi F, Coladangelo M, Giuliano R, William De Luca E. Predicting Employee Attrition Using Machine
Learning Techniques. Computers. 2020; 9(4):86. https://doi.org/10.3390/computers9040086
Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques. Elsevier.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer.
McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th
Python in Science Conference, 51-56.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011).
Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
13
Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and Statistical Modeling with Python.
Proceedings of the 9th Python in Science Conference.
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical Machine Learning Tools
and Techniques. Morgan Kaufmann.
Appendix
14