0% found this document useful (0 votes)

6 views15 pages

Report

Uploaded by

Soniya Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views15 pages

Report

Uploaded by

Soniya Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Big Data Fundamental Coursework

0
Table of Contents

1. INTRODUCTION ........................................................................................................................ 2

2. DESCRIPTION OF THE DATASET .................................................................................................. 2

2.1 SOURCE AND BRIEF HISTORY ............................................................................................................... 2

2.2 INITIAL OBSERVATIONS ....................................................................................................................... 3

3. EXPLORATORY DATA ANALYSIS (EDA) ......................................................................................... 4

3.1 STATISTICS IN BRIEF ............................................................................................................................ 4

3.2 VISUALIZATIONS ................................................................................................................................ 4
3.3 KEY INSIGHTS .................................................................................................................................... 6

4. UNSUPERVISED LEARNING (CLUSTERING) .................................................................................. 7

4.1 CLUSTERING METHOD ........................................................................................................................ 7

4.2 CLUSTERING RESULTS ......................................................................................................................... 8
4.3 INSIGHTS FROM CLUSTERING ............................................................................................................... 8

5. SUPERVISED LEARNING (LOGISTIC REGRESSION) ....................................................................... 9

5.1 METHODOLOGY ................................................................................................................................ 9

5.2 MODEL PERFORMANCE .................................................................................................................... 10
5.3 INTERPRETATION OF RESULTS ............................................................................................................. 11

7. CONCLUSION.......................................................................................................................... 12

BIBLIOGRAPHY........................................................................................................................... 13

1
1. Introduction

Loss of employees, also called change, is a big problem for many businesses. Knowing what causes
employees to leave can help companies come up with ways to keep good workers, cut down on the
cost of hiring new ones, and keep their staff stable. A fake dataset is used in this study to mimic a real-
life situation in which different factors affect whether an employee stays with a company or goes. The
dataset has variables like age, gender, department, job title, years at the company, level of happiness,
average monthly hours worked, promotion in the last five years, pay, and attrition status. It gives a full
picture of the things that might cause employees to leave their jobs.

The main goal of this study is to use both uncontrolled and supervised machine learning to find and
analyse the main factors that cause employees to leave their jobs. The study wants to find trends and
connections in the data by using exploratory data analysis (EDA). K means clustering will be used to
separate employees into groups based on the traits they share. Then logistic regression will be used to
guess how likely it is that an employee will leave which will show which factors are most likely to cause
people to leave. Companies that want to improve their methods for keeping employees will find the
study results very useful.

2. Description of the Dataset

2.1 Source and Brief History

The dataset used in this study is a fake dataset that was made to look like a real life situation where
employees are leaving. It comes from Kaggle a well known site for data science and machine learning
events and anyone can use it for study and analysis. The dataset has 1000 records and 11 variables
which give a full picture of the many things that could cause employees to leave their jobs.

The variables in the dataset are as follows:

• Employee_ID: A unique identifier for each employee. This variable is not used in the analysis
but helps in identifying individual records.

• Age: The age of the employee. Age can be a significant factor in attrition as it may relate to
career stage and stability.

• Gender: The gender of the employee (Male/Female). Gender analysis can reveal any
disparities in attrition rates.

2
• Department: The department in which the employee works (e.g., Marketing, Sales,
Engineering). Different departments may have varying attrition rates due to job roles and work
environments.

• Job_Title: The job title of the employee (e.g., Manager, Engineer, Analyst). Job roles can
influence job satisfaction and turnover.

• Years_at_Company: The number of years the employee has worked at the company. Longer
tenure often correlates with lower attrition.

• Satisfaction_Level: The satisfaction level of the employee, on a scale from 0 to 1. Higher

satisfaction is generally associated with lower attrition.

• Average_Monthly_Hours: The average number of hours the employee works per month.
Work hours can impact job satisfaction and turnover.

• Promotion_Last_5Years: Whether the employee has been promoted in the last five years (0 =
No, 1 = Yes). Promotion opportunities can influence retention.

• Salary: The annual salary of the employee. Compensation is a critical factor in employee
retention.

• Attrition: The target variable indicating whether the employee has left the company (0 = No,
1 = Yes).

2.2 Initial Observations

The first look at the dataset shows some interesting trends and findings. The goal variable is balanced
in the dataset, with about the same number of workers who have left the company (49.5%) as those
who have stayed (50.5%). Because it doesn't favour one class over another, this mix is good for training
predictive models.

The range of ages of workers is from 25 to 59 years old, with 42 years being the average. With an
average score of 0.51, the level of happiness in the workforce is about normal. The average number of
hours worked each month is about 199, and the standard variation is about 30 hours, which shows
that workers' work hours vary.

3
3. Exploratory Data Analysis (EDA)

3.1 Statistics in Brief

Exploratory Data Analysis (EDA) is a very important step in getting to the bottom of a dataset's trends
and traits. A summary statistic gives a number-based look at the numbers in the collection. It includes
measures of centre tendency (mean, median) and dispersion (standard deviation, minimum,
maximum, and quartiles).

Here are some general numbers for the dataset:

• Age: People who work there are between the ages of 25 and 59, with a mean age of 42.21 years
and a standard deviation of 10.02 years. This spread shows that the workers are of different ages.
• Years_at_Company: Most employees have been with the company for 5.61 years, but some have
been there for as little as 1 year and as much as 10 years. The standard difference is 2.82 years.
This shows that workers' lengths of service vary.
• Level of Satisfaction: The level of satisfaction has a mean of 0.51 and a standard deviation of 0.29
on a range from 0 to 1. The range from almost 0 to 1 shows that employees are satisfied with their
jobs in different ways.
• Average Monthly Hours: The mean monthly hours worked is 199.49 hours, and the standard
deviation is 29.63 hours. The range of average monthly hours worked is 150 to 249 hours. These
numbers show that work hours vary a lot.
• Promotion_Last_5Years: This is a binary variable that tells you if an employee was promoted in the
last five years. Its average value is 0.49, which means that about half of the employees were
promoted during this time.
• Salary: The pay runs from $30,099 to $99,991, with $20,262.98 being the standard deviation and
$64,624.98 being the mean. There are big differences in how much employees are paid, as shown
by the wide spread and high standard deviation.
• Attrition: The rate of turnover is about 49.5%, which means that almost half of the workers have
quit.

3.2 Visualizations

Visualising the data helps you see how the variables are distributed and how they are related to each
other. Distribution plots for number variables and count plots for categorical variables are two
important types of visualisations.

4
Plots of the distribution of key variables:

Fig. 1 Plots of the distribution

• Age: The range of ages is pretty evenly spread out, with a small peak around 42 years.
• Level of Satisfaction: Levels of satisfaction are pretty evenly spread out, with no big differences
between them. This means that employees are generally happy with their jobs.
• Salary: The pay range is right-skewed, with most workers making less than $80,000 and a few
making close to the top salary of $99,991.

Count Plots for Categorical Variables:

Fig. 2 Count Plots for Categorical Variables

5
• Department: The count plot shows that workers are pretty evenly spread out across departments,
with a small bias towards Engineering and Finance.
• Job Title: The spread of job titles is also pretty even, with a few more Managers and Engineers.
• Gender: When it comes to gender, the number of male and female workers is almost equal.

3.3 Key Insights

The summary figures and visualisations give us a number of important insights into the information,
including:

• Age and Tenure:

o This means that the workforce is evenly split between different age groups, as shown by
the fairly even age distribution.
o The term (years at the company) shows that workers have worked for the company for
different amounts of time. Many have been there for a long time.
• Level of Satisfaction:
o The amounts of satisfaction are spread out evenly, which suggests that employee
happiness is not centred in one area.
o There is no major skew, which means that the dataset has a good representation of both
happy and unhappy workers.
• Salary:
o The right-hand skew of the income distribution shows that while most workers make a
salary in the middle, there are a few who make a lot of money.
o This difference in pay could be an interesting thing to look into in terms of employee
turnover.
• Title of the job and department:
o As there is an even spread across areas and job titles, it seems likely that the information
does not have a bias based on department or role.
o It also shows that the company has a wide range of jobs, which can help you figure out
which areas have higher turnover rates.
• Promotion and Attrition:
o About half of the workers have been promoted in the last five years, which could affect
their choice of whether to stay or go.
o The almost equal turnover rate says that the information gives a good picture of the
employees who have stayed and those who have left.

How variables are related to each other:

6
The association matrix shows that the level of happiness and the rate of attrition are moderately
negatively related. This means that higher levels of satisfaction are linked to lower rates of attrition.

There is also a positive relationship between average monthly hours and turnover, which means that
workers who put in more hours may be more likely to quit.

Other variables have lower associations, which shows how complicated the factors are that cause
employees to leave their jobs.

4. Unsupervised Learning (Clustering)

4.1 Clustering Method

K-means clustering is a type of unsupervised learning that divides a dataset into K separate groups that
don't intersect. The method starts with K randomly chosen centres, then assigns each data point to
the centre that is closest to it. Finally, the centres are updated based on the average of the points that
were given. This process is repeated over and over again until the centroids don't move much more or
until a certain number of times is reached.

The K-means grouping method was chosen for this study because it is easy to use and good at finding
trends in the data. The number of groups (K) you choose is very important for making sense of the
data. The Elbow Method helped us choose K=3. This method includes finding the "elbow point" where
the rate of drop slows down by plotting the sum of the squared distances from each point to its given
centroid. This point points to the best number of groups that strike a balance between simplicity and
density.

7
Fig. 3 K-means Clustering of Employees

4.2 Clustering Results

Using K-means clustering on the dataset, we used average monthly hours and happiness level as axes
to show the groups.

• Cluster 0 (Purple): This group includes employees who are satisfied in a variety of ways but
usually work fewer hours each month.
• Cluster 1 (yellow): Workers who are somewhat satisfied with their jobs and whose average
monthly hours vary a lot.
• Cluster 2 (Teal): Workers who aren't completely happy with their jobs and who work more
hours on average each month.

The groupings show that employees can be put into groups based on how satisfied they are with their
jobs and how many hours they work. This shows possible areas that methods for keeping employees
should focus on.

4.3 Insights from Clustering

From the grouping study, a few main patterns stood out:

8
• Satisfaction Levels: Different workers in the same cluster have different levels of happiness,
which suggests that satisfaction is not level across the board and can have a big effect on
turnover.
• Working Hours: The fact that average monthly hours vary within groups suggests that working
hours may have an effect on job happiness and, as a result, turnover.
• Employee Segmentation: The clusters help divide employees into groups with similar traits,
which lets actions be more focused.

Implications for Employee Attrition:

Since employees in Cluster 2 work longer hours each month on average, they may be more likely to
get burned out and quit. Retention might go up if you work on balancing your task.

People in Cluster 0 who work fewer hours each month may be dealing with different stressors at work
that affect their happiness and motivation, which can lead to turnover.

Because of these findings, customised tactics can be made to make employees happier and keep them
from leaving.

5. Supervised Learning (Logistic Regression)

5.1 Methodology

Logistic regression is a method for guided learning that is used to classify things into two groups. It
shows how likely it is that a certain data belongs to a certain class. Any real number can be mapped
into the [0, 1] range by the logistic function, also called the sigmoid function. This range can then be
thought of as a chance.

Features and Target Variable:

Age, Number of Years at Company, Level of Satisfaction, Average Monthly Hours,

Promotion_Last_5Years, Salary, and the one-hot encoded category variables (Gender, Department,
Job_Title) are some of the features.

Variable to Aim for: Dropout rate (0 = No, 1 = Yes).

Data Splitting and Model Training: An 80-20 split was used to separate the dataset into training and
testing sets. This way, the model could be trained on a part of the data and then tested on data it had
not seen before. It was decided that logistic regression would be used up to 1000 times to make sure
it would converge.

9
5.2 Model Performance

Several measures were used to judge the model's performance:

The model is 50% accurate generally, which means that it correctly predicts loss half of the time.

When there is no attrition in class 0, the precision is 0.51, and when there is attrition in class 1, the
precision is 0.49. It is 0.59 for class 0 and 0.41 for class 1. For class 0, the F1-score is 0.55, which is the
harmonic mean of accuracy and memory. For class 1, it is 0.44.

Confusion Matrix shows that the model got 60 of the predictions of no attrition and 40 of the
predictions of attrition right, but got 42 of the predictions of attrition wrong as no attrition and 58 of
the predictions of no attrition as attrition.

Classification Report:

precision recall f1-score Support

0 0.51 0.59 0.55 102
1 0.49 0.41 0.44 98

accuracy 0.50 200

macro avg 0.50 0.50 0.49 200
weighted avg 0.50 0.50 0.50 200

Confusion Matrix:

[[60 42]

[58 40]]

10
Fig.4 Correlation Matrix

5.3 Interpretation of Results

The logistic regression model tells us a number of things, including

• Model Accuracy: A score of 50% means that the model's results are about the same as guessing at
chance. This means that the straight links that logistic regression finds might not fully show how
complicated the factors are that cause people to leave their jobs.
• Precision and memory: The model's average precision and memory numbers show that it can't tell
the difference between workers who are going to stay and those who are going to leave. It's
possible that this is because the two groups share some traits.
• Possible Causes of Model Accuracy:
• Feature Overlap: Characteristics of employees who stay and employees who leave may be similar,
making it hard for the model to tell them apart.

11
• Logistic regression assumes that the traits and the log-odds of the goal variable are related in a
straight line. If the real interactions aren't straight, this could make the model less accurate.
• Additional Factors: Attrition may also be affected by things that were not noticed or were not
included in the dataset. These could be things like the culture of the company, the state of the job
market outside the company, or personal circumstances.

7. Conclusion

This study used a fake dataset to look at employee loss in order to find the main factors that cause
people to leave their jobs and to make prediction models that can help companies come up with better
ways to keep their employees. We found important trends and relationships in the data using
exploratory data analysis (EDA). For example the level of happiness and the average number of hours
worked each month had a big effect on the rate of employee turnover.

Unsupervised learning with K means clustering showed that there were three separate groups of
employees based on their job happiness and hours worked. This gave us a look at some issues that
might make it hard to keep employees. Even though supervised learning using logistic regression was
only somewhat accurate it showed how hard it is to use linear models to predict loss which suggests
that we need more advanced methods.

Some of the most important things that this study showed were how important it is for employees to
be happy with their jobs and have a good mix of work as well as the possibility of focused interventions
based on clustering results. The logistic regression model results on the other hand show that turnover
is a complicated problem that is probably affected by factors not included in the dataset.

12
Bibliography

Srivastava, D.K., Nair, P. (2018). Employee Attrition Analysis Using Predictive Techniques. In: Satapathy,
S., Joshi, A. (eds) Information and Communication Technology for Intelligent Systems (ICTIS 2017) -
Volume 1. ICTIS 2017. Smart Innovation, Systems and Technologies, vol 83. Springer, Cham. [Online]
Available at: https://doi.org/10.1007/978-3-319-63673-3_35

Aggarwal, C. C. (2018). Machine Learning for Text. Springer.

Biau, G., & Scornet, E. (2016). A Random Forest Guided Tour. Test, 25(2), 197-227.

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.

Brodie, Roderick J., Whittome, James R., & Brush, Gregory J. (2009). Investigating the Service Brand: A
Customer Value Perspective. Journal of Business Research, 62(3), 345–355.

Burmann, Christoph, Jost-Benz, Marc, & Riley, Nicola (2009). Toward an Identity-Based Brand Equity
Model. Journal of Business Research, 62(3), 390–397.

Fallucchi F, Coladangelo M, Giuliano R, William De Luca E. Predicting Employee Attrition Using Machine
Learning Techniques. Computers. 2020; 9(4):86. https://doi.org/10.3390/computers9040086

Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques. Elsevier.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer.

Kaggle. (2024). Employee Attrition Data Prediction. [Online] Available at:

https://www.kaggle.com/datasets/mrsimple07/employee-attrition-data-prediction

McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th
Python in Science Conference, 51-56.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011).
Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.

13
Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and Statistical Modeling with Python.
Proceedings of the 9th Python in Science Conference.

Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical Machine Learning Tools
and Techniques. Morgan Kaufmann.

Zhang, Z. (2016). Introduction to Machine Learning: K-means Clustering. Annals of Translational

Medicine, 4(9), 174.

Appendix

Language: Python 3.6 IDE: Spyder 3.2.3

Capstone Project Vivek
100% (4)
Capstone Project Vivek
145 pages
INX Future Employee Performance Project
No ratings yet
INX Future Employee Performance Project
62 pages
IIMK HRMA CapstoneProject Group 13
75% (4)
IIMK HRMA CapstoneProject Group 13
17 pages
Final Capstone Project Report
100% (1)
Final Capstone Project Report
35 pages
HR Analytics With Python Final Project Report
No ratings yet
HR Analytics With Python Final Project Report
13 pages
Armillia Karenna - TP060327 - Pfda
No ratings yet
Armillia Karenna - TP060327 - Pfda
65 pages
Churn Prediction - Commercial Use of Data Science
No ratings yet
Churn Prediction - Commercial Use of Data Science
25 pages
Data Analytics Report - Case Study - Employee Attrition
100% (1)
Data Analytics Report - Case Study - Employee Attrition
41 pages
MBA 600 - Workforce - Composition - Analysis - Upload
No ratings yet
MBA 600 - Workforce - Composition - Analysis - Upload
68 pages
ML 2 Project Business Report - Nandini
No ratings yet
ML 2 Project Business Report - Nandini
43 pages
BDA Finaltest
No ratings yet
BDA Finaltest
30 pages
Predicting Employee Churn in Python
100% (1)
Predicting Employee Churn in Python
19 pages
Data Mining
No ratings yet
Data Mining
17 pages
Employee Attrition Study Case
No ratings yet
Employee Attrition Study Case
88 pages
Report
No ratings yet
Report
45 pages
Salary Data Analysis - Phase 1
No ratings yet
Salary Data Analysis - Phase 1
5 pages
Capstone Project Group 9
No ratings yet
Capstone Project Group 9
31 pages
Final Hranalytics
No ratings yet
Final Hranalytics
21 pages
Business Analytics Project Report
No ratings yet
Business Analytics Project Report
11 pages
ASM2-Vu Lam Le-3975055
No ratings yet
ASM2-Vu Lam Le-3975055
12 pages
Data Analytics in HR
No ratings yet
Data Analytics in HR
19 pages
HR Analytics Project Documentation
No ratings yet
HR Analytics Project Documentation
42 pages
Draft - Assignment 1 Report
No ratings yet
Draft - Assignment 1 Report
8 pages
Employee Data Analysis 2 (1) AKSHAYA.M
No ratings yet
Employee Data Analysis 2 (1) AKSHAYA.M
9 pages
Data Visualization and Dashboard
No ratings yet
Data Visualization and Dashboard
10 pages
HR - Analytics - CSV Number of Rows: 1480 Number of Columns: 38
No ratings yet
HR - Analytics - CSV Number of Rows: 1480 Number of Columns: 38
19 pages
Human Resources
No ratings yet
Human Resources
26 pages
Anubhuti Baruah-Analysis and Findings of Employee Survey Data
No ratings yet
Anubhuti Baruah-Analysis and Findings of Employee Survey Data
23 pages
M23bbau0428 Ayush
No ratings yet
M23bbau0428 Ayush
13 pages
A Project Report
60% (5)
A Project Report
85 pages
Employee Turnover Prediction Project
No ratings yet
Employee Turnover Prediction Project
10 pages
Analysis and Prediction of Employee Turnover Characteristics Based On Machine Learning
No ratings yet
Analysis and Prediction of Employee Turnover Characteristics Based On Machine Learning
6 pages
Wongjianwei (TP061912)
No ratings yet
Wongjianwei (TP061912)
13 pages
Employee Turnover Problem Statement
No ratings yet
Employee Turnover Problem Statement
5 pages
23RM04
No ratings yet
23RM04
10 pages
Employee Turnover
No ratings yet
Employee Turnover
19 pages
Final Project Solution
No ratings yet
Final Project Solution
28 pages
[email protected]
No ratings yet
[email protected]
13 pages
Formulate Hypothesis
No ratings yet
Formulate Hypothesis
7 pages
SMDM Project Report - Shubham Bakshi - 07.05.2023
0% (1)
SMDM Project Report - Shubham Bakshi - 07.05.2023
23 pages
Cdu 1121 09
No ratings yet
Cdu 1121 09
10 pages
Karpagam Sep Oct 2019 Article 6
No ratings yet
Karpagam Sep Oct 2019 Article 6
6 pages
IBM Analysis
No ratings yet
IBM Analysis
17 pages
Case Study BA
No ratings yet
Case Study BA
8 pages
Job Satisfaction Analysis Report
No ratings yet
Job Satisfaction Analysis Report
23 pages
Employee Turnover1
No ratings yet
Employee Turnover1
4 pages
A Micro-Project Report On: "Analysis of Salary of Data Professions"
No ratings yet
A Micro-Project Report On: "Analysis of Salary of Data Professions"
19 pages
Capstone Project
No ratings yet
Capstone Project
42 pages
Data Analytics Final Project
No ratings yet
Data Analytics Final Project
6 pages
PFDA
No ratings yet
PFDA
23 pages
Employee Attrition Miniblogs
100% (1)
Employee Attrition Miniblogs
15 pages
Example Document - Descriptive and Predictive Analysis
No ratings yet
Example Document - Descriptive and Predictive Analysis
6 pages
Research Paper
No ratings yet
Research Paper
5 pages
Business Analytics
No ratings yet
Business Analytics
5 pages
Available Online Through: International Journal of Mathematical Archive-4 (12), 2013
No ratings yet
Available Online Through: International Journal of Mathematical Archive-4 (12), 2013
4 pages
Data Wrangling Report
No ratings yet
Data Wrangling Report
3 pages
Industry Assignment 1 - EmployeeAnalyis
No ratings yet
Industry Assignment 1 - EmployeeAnalyis
4 pages
Employee Retention Problem Part 1: Written by Muhammad Rizaldy
No ratings yet
Employee Retention Problem Part 1: Written by Muhammad Rizaldy
1 page
Project: Case Study 1
No ratings yet
Project: Case Study 1
2 pages
Wastage Analysis
No ratings yet
Wastage Analysis
13 pages
Human Resource Management Strategy and Analysis
No ratings yet
Human Resource Management Strategy and Analysis
39 pages
4 Job Analysis Design Evaluation
No ratings yet
4 Job Analysis Design Evaluation
95 pages
Ethical and Legal Issues in Selling
No ratings yet
Ethical and Legal Issues in Selling
86 pages
Gravity Project Report-Prince Dudhatra
No ratings yet
Gravity Project Report-Prince Dudhatra
29 pages
BEC中级口语谢娇岳
No ratings yet
BEC中级口语谢娇岳
40 pages
Creating The Best Workplace On Earth
No ratings yet
Creating The Best Workplace On Earth
2 pages
HR Quarterly PWC
No ratings yet
HR Quarterly PWC
19 pages
Faculty of Nursing and Allied Health Science
No ratings yet
Faculty of Nursing and Allied Health Science
39 pages
UNIT FOUR Recruitment and Selection
No ratings yet
UNIT FOUR Recruitment and Selection
18 pages
HRM Group B
No ratings yet
HRM Group B
27 pages
The Motivational Basis Organizational Behavior: Ilaniel
No ratings yet
The Motivational Basis Organizational Behavior: Ilaniel
16 pages
Brewer 2015 International Journal of Nursing Studies
No ratings yet
Brewer 2015 International Journal of Nursing Studies
11 pages
Strategies For Enhancing Small Business Owners Success Rates
No ratings yet
Strategies For Enhancing Small Business Owners Success Rates
16 pages
Job Satisfaction Among Hospital Nurses Revisited
No ratings yet
Job Satisfaction Among Hospital Nurses Revisited
22 pages
Improving Teacher Preparation: Building On Innovation
No ratings yet
Improving Teacher Preparation: Building On Innovation
29 pages
Module1 21ec61
No ratings yet
Module1 21ec61
20 pages
Nurse Retention and Shortages
No ratings yet
Nurse Retention and Shortages
10 pages
Motivation: Throug H in
No ratings yet
Motivation: Throug H in
23 pages
A Sample Report (BUS101)
No ratings yet
A Sample Report (BUS101)
26 pages
Research Proposal - (Monisha-20D1977)
No ratings yet
Research Proposal - (Monisha-20D1977)
13 pages
Group 10 - First Trial
No ratings yet
Group 10 - First Trial
28 pages
HRD Unit - Iv
No ratings yet
HRD Unit - Iv
16 pages
1-2f Issue 6: Adapting To Educational and Cultural Shifts Affecting The Workforce
No ratings yet
1-2f Issue 6: Adapting To Educational and Cultural Shifts Affecting The Workforce
7 pages
Building Bridges To Commitment Investigating Employee Retention Practices and Its Influence On Employees' Retention Intention
No ratings yet
Building Bridges To Commitment Investigating Employee Retention Practices and Its Influence On Employees' Retention Intention
14 pages
Context Leaders at Parivar An IT Services Firm Face High Employee Turnover 35.
No ratings yet
Context Leaders at Parivar An IT Services Firm Face High Employee Turnover 35.
4 pages
HRMQuiz 1
No ratings yet
HRMQuiz 1
3 pages
Factors Influencing Employee Turnover in Private Sector
No ratings yet
Factors Influencing Employee Turnover in Private Sector
5 pages
Case Analysis
No ratings yet
Case Analysis
2 pages
Job Challenge Profile, Participant Workbook
From Everand
Job Challenge Profile, Participant Workbook
Cynthia D. McCauley
No ratings yet
Job Challenge Profile, Participant Workbook and Survey
From Everand
Job Challenge Profile, Participant Workbook and Survey
Cynthia D. McCauley
No ratings yet

Report

Uploaded by

Report

Uploaded by

Big Data Fundamental Coursework

2. DESCRIPTION OF THE DATASET .................................................................................................. 2

2.1 SOURCE AND BRIEF HISTORY ............................................................................................................... 2

3. EXPLORATORY DATA ANALYSIS (EDA) ......................................................................................... 4

3.1 STATISTICS IN BRIEF ............................................................................................................................ 4

4. UNSUPERVISED LEARNING (CLUSTERING) .................................................................................. 7

4.1 CLUSTERING METHOD ........................................................................................................................ 7

5. SUPERVISED LEARNING (LOGISTIC REGRESSION) ....................................................................... 9

5.1 METHODOLOGY ................................................................................................................................ 9

2. Description of the Dataset

2.1 Source and Brief History

The variables in the dataset are as follows:

• Satisfaction_Level: The satisfaction level of the employee, on a scale from 0 to 1. Higher

2.2 Initial Observations

3.1 Statistics in Brief

Here are some general numbers for the dataset:

Fig. 1 Plots of the distribution

Count Plots for Categorical Variables:

Fig. 2 Count Plots for Categorical Variables

3.3 Key Insights

• Age and Tenure:

How variables are related to each other:

4. Unsupervised Learning (Clustering)

4.1 Clustering Method

4.2 Clustering Results

4.3 Insights from Clustering

From the grouping study, a few main patterns stood out:

Implications for Employee Attrition:

5. Supervised Learning (Logistic Regression)

Features and Target Variable:

Age, Number of Years at Company, Level of Satisfaction, Average Monthly Hours,

Variable to Aim for: Dropout rate (0 = No, 1 = Yes).

Several measures were used to judge the model's performance:

precision recall f1-score Support

accuracy 0.50 200

5.3 Interpretation of Results

The logistic regression model tells us a number of things, including

Aggarwal, C. C. (2018). Machine Learning for Text. Springer.

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.

Kaggle. (2024). Employee Attrition Data Prediction. [Online] Available at:

Zhang, Z. (2016). Introduction to Machine Learning: K-means Clustering. Annals of Translational

Language: Python 3.6 IDE: Spyder 3.2.3

You might also like