Student Notebook HR Analysis
Student Notebook HR Analysis
• shorturl.at/hkPR6
Data Science
• Data science is the field of exploring, manipulating, analyzing data and using data to
make answer real world questions.
• Statistics, Visualization, Deep Learning, Machine Learning are important Data
Science concepts.
Data Scientist
• Someone who finds solutions to problems by anayzing big or small data using
appropriate tools and then tells stories to communicate findings to the relevant
stakeholders.
• Important qualities
– Curious
– Judgemental
– Argumentative
– Story teller i.e. find somethings from data and tell whole world about your
findings from data
Problem Statement
Employee churn is a costly problem for the companies i.e. cost of recruiting new employees
than retaining existing one would be very very large.
A study shows that replacing an existing employees is expensive, with cost ranging anywhere
from 16% to 213% of an employee's salary.
Understanding why and when employees are likely to leave, helps company's HR
professional to focus on several employee's retention strategies as well as planning new
hiring in advance.
Data Understanding
Data Ingestion
Data:
• https://www.kaggle.com/datasets/giripujar/hr-analytics?
datasetId=11142&sortBy=voteCount
[Term] Data Ingestion
Data ingestion is the process of obtaining and importing data for immediate use or storage.
For this session we'll "ingest" the data from google drive and save it as a csv file using the
following code
#to download data
# !gdown 1XybZVhpRMBhZZCelDwH3dArKeDT1AxCS
!gdown 1g1nwk4k-h9FceEHKZc8ocfu_xp3xnZ8R
Downloading...
From: https://drive.google.com/uc?id=1g1nwk4k-h9FceEHKZc8ocfu_xp3xnZ8R
To: /content/hr.csv
0% 0.00/580k [00:00<?, ?B/s] 100% 580k/580k [00:00<00:00, 80.8MB/s]
[Library] NumPy
NumPy makes it easier to manupulate large, multi-dimensional arrays and matrices, and
provides high-level mathematical functions to operate on them.
[Library] Pandas
Pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool. It stands for "Python Data Analysis Library".
Let's import numpy and pandas as we'll be using them a lot for the upcoming steps.
import numpy as np
import pandas as pd
[Term] Dataframe
A pandas Dataframe is a two-dimensional data structure i.e. data is aligned in a tabular
fashion in rows and columns.
We will use the method read_csv() of a pandas dataframe to load the data from the saved
csv file. We will also use the method head() to display the first five rows of the dataframe.
df = pd.read_csv(filename)
df.head()
stroke
0 1
1 1
2 1
3 1
4 1
We can call the info() method of the dataframe to print it's information including the
index (#), datatype (dtype) and columns, non-null values (will be discussed later) and
memory usage.
## Use info() method to print dataframe information
### START CODE HERE ###
Numerical Variables
[Term] Numerical Variable
In statistics, A numeric variable (also called quantitative variable) is a quantifiable
characteristic whose values are numbers. Numeric variables may be either continuous or
discrete.
We use the describe() method of a pandas dataframe to analyze a numeric variable's
data distribution
# stats of numerical data talk why numerical data
round (df.describe(exclude = 'object'), 2)
bmi stroke
count 4909.00 5110.00
mean 28.89 0.05
std 7.85 0.22
min 10.30 0.00
25% 23.50 0.00
50% 28.10 0.00
75% 33.10 0.00
max 97.60 1.00
df.salary.value_counts()
Female 2994
Male 2115
Other 1
Name: gender, dtype: int64
df = df[df.salary != 'low']
df.salary.unique()
array([0, 1])
df.promotion_last_5years.unique()
array([1, 0])
Missing Data
In programming we refer to missing values as a null value.
We can use the following functions to identify missing values:
1. isnull()
2. notnull()
The output is a boolean value indicating whether the value that is passed into the argument
is in fact missing data.
Note:
• isnull() is an alias for isna()
• notnull() is an alias for notna()
Q: What will be the output when you add False and False?
False + False
We can take a column-wise sum using the method sum() to get the count of the number of
missing values in each column.
• axis=0, row wise operation
– E.g making a pile of books on a table or floor
• axis=1, column wise operation
– E.g arranging books on a bookshelf.
• For pictorial representation click_here
# Find null value in each column of dataframe using isnull()
### START CODE HERE ###
satisfaction_level 0
last_evaluation 0
number_project 0
average_montly_hours 0
time_spend_company 0
Work_accident 0
left 0
promotion_last_5years 0
Department 0
salary 29
age 11924
dtype: int64
satisfaction_level 0.000000
last_evaluation 0.000000
number_project 0.000000
average_montly_hours 0.000000
time_spend_company 0.000000
Work_accident 0.000000
left 0.000000
promotion_last_5years 0.000000
Department 0.000000
salary 0.193282
age 79.472141
dtype: float64
The age column contains around 80% of missing data. Filling this column makes no sense
therefore, we will drop it. However, The salary column has 0.1% missing data we can fill
this missing data. This technique is called imputation.
There are two ways to access a single column of a DataFrame
• df.column_name
• df["column_name"]
Q: How can we display the first five rows of the column salary?
# Display top 5 rows of salary column using head()
### START CODE HERE ###
0 low
1 medium
2 medium
3 high
4 high
Name: salary, dtype: object
We can see NaN values in the salary, age and left column; these are the missing values
which may hinder our further analysis.
• We need to replace various missing values (such as empty string, ?, null, etc) with
python's default missing value marker i.e. NaN, so that it will be easier to handle the
missing values of a dataset later. Let's see two different approach to convert missing
values to python's default missing value marker.
• Pass this list in the na_values parameter while reading the data
using pandas
df = pd.read_csv(
"hr.csv",
na_values=missing_values,
)
df = df.replace("?", np.nan)
– If you want to avoid assigning the new DataFrame to the same variable you
can set it "in place" using the inplace parameter
In our dataset, missing values are already represented as python's default missing value
marker. However, ? is also a missing value. Let's read data again considering ? as missing
data.
df = pd.read_csv('hr.csv', na_values = ['?', np.nan])
df.isna().sum()
satisfaction_level 0
last_evaluation 0
number_project 0
average_montly_hours 0
time_spend_company 0
Work_accident 0
left 4
promotion_last_5years 0
Department 0
salary 29
age 11924
dtype: int64
Let's handle the missing data first and then we will do EDA.
• If we have missing values in the target column we drop the row to handle missing
data. In our case, we will drop 4 rows having nan on a left feature.
• If a feature of a column contains more than 70% missing data we usually drop it. [It
depends on domain knowledge.] In our case, we will drop it.
• In other cases: To handle numerical feature missing value we usually impute it using
mean or median value.
– Normally distributed with outlier: use median value to fill in the missing
value
– Normally distributed without outlier: use mean value to fill in the missing
value
• For categorical data we mostly impute missing values using mode. In our case the
salary column is categorical so we will use mode to fill in missing data.
# This is formatted as code
(15004, 11)
df.shape
(15000, 11)
Drop age column because it has more than 70% missing values.
df.drop(columns = ['age'], axis = 'column', inplace = True)
Impute salary column with mode value of a column because it is a categorical column
df.salary.mode().values[0]
{"type":"string"}
satisfaction_level 0
last_evaluation 0
number_project 0
average_montly_hours 0
time_spend_company 0
Work_accident 0
left 0
promotion_last_5years 0
Department 0
salary 29
dtype: int64
df.salary = df.salary.fillna(df.salary.mode()[0])
df.isna().sum()
satisfaction_level 0
last_evaluation 0
number_project 0
average_montly_hours 0
time_spend_company 0
Work_accident 0
left 0
promotion_last_5years 0
Department 0
salary 0
dtype: int64
df.groupby('left').mean()
Another way
# pd.crosstab(df.salary,df.left)
# pd.crosstab(df.salary,df.left).plot(kind='bar')
From above chart there seem to be some impact of department on employee retention but
it is not major hence we will ignore department in our analysis
Conclusion
From the data analysis so far we can conclude that we will use following variables as
independant variables in our model
Satisfaction Level
Average Monthly Hours
Promotion Last 5 Years
Salary
subdf =
df[['satisfaction_level','average_montly_hours','promotion_last_5years
','salary']]
## Use head() to display the first five rows of the updated "salary"
column
### START CODE HERE ###