0% found this document useful (0 votes)
22 views

Student Notebook HR Analysis

The document discusses analyzing employee data to understand and predict employee churn. It covers ingesting data, exploring numerical and categorical variables, handling missing data, and imputing missing values. The goal is to help companies retain employees by identifying factors that predict churn.

Uploaded by

Abhay Manandhar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Student Notebook HR Analysis

The document discusses analyzing employee data to understand and predict employee churn. It covers ingesting data, exploring numerical and categorical variables, handling missing data, and imputing missing values. The goal is to help companies retain employees by identifying factors that predict churn.

Uploaded by

Abhay Manandhar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Short URL

• shorturl.at/hkPR6

Data Science
• Data science is the field of exploring, manipulating, analyzing data and using data to
make answer real world questions.
• Statistics, Visualization, Deep Learning, Machine Learning are important Data
Science concepts.

Data Scientist
• Someone who finds solutions to problems by anayzing big or small data using
appropriate tools and then tells stories to communicate findings to the relevant
stakeholders.
• Important qualities
– Curious
– Judgemental
– Argumentative
– Story teller i.e. find somethings from data and tell whole world about your
findings from data

Problem Statement
Employee churn is a costly problem for the companies i.e. cost of recruiting new employees
than retaining existing one would be very very large.
A study shows that replacing an existing employees is expensive, with cost ranging anywhere
from 16% to 213% of an employee's salary.
Understanding why and when employees are likely to leave, helps company's HR
professional to focus on several employee's retention strategies as well as planning new
hiring in advance.

Data Understanding
Data Ingestion
Data:
• https://www.kaggle.com/datasets/giripujar/hr-analytics?
datasetId=11142&sortBy=voteCount
[Term] Data Ingestion
Data ingestion is the process of obtaining and importing data for immediate use or storage.
For this session we'll "ingest" the data from google drive and save it as a csv file using the
following code
#to download data
# !gdown 1XybZVhpRMBhZZCelDwH3dArKeDT1AxCS
!gdown 1g1nwk4k-h9FceEHKZc8ocfu_xp3xnZ8R

Downloading...
From: https://drive.google.com/uc?id=1g1nwk4k-h9FceEHKZc8ocfu_xp3xnZ8R
To: /content/hr.csv
0% 0.00/580k [00:00<?, ?B/s] 100% 580k/580k [00:00<00:00, 80.8MB/s]

[Library] NumPy
NumPy makes it easier to manupulate large, multi-dimensional arrays and matrices, and
provides high-level mathematical functions to operate on them.
[Library] Pandas
Pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool. It stands for "Python Data Analysis Library".
Let's import numpy and pandas as we'll be using them a lot for the upcoming steps.
import numpy as np
import pandas as pd

[Term] Dataframe
A pandas Dataframe is a two-dimensional data structure i.e. data is aligned in a tabular
fashion in rows and columns.
We will use the method read_csv() of a pandas dataframe to load the data from the saved
csv file. We will also use the method head() to display the first five rows of the dataframe.
df = pd.read_csv(filename)
df.head()

id gender age hypertension heart_disease ever_married \


0 9046 Male 67.0 0 1 Yes
1 51676 Female 61.0 0 0 Yes
2 31112 Male 80.0 0 1 Yes
3 60182 Female 49.0 0 0 Yes
4 1665 Female 79.0 1 0 Yes

work_type Residence_type avg_glucose_level bmi


smoking_status \
0 Private Urban 228.69 36.6 formerly
smoked
1 Self-employed Rural 202.21 NaN never
smoked
2 Private Rural 105.92 32.5 never
smoked
3 Private Urban 171.23 34.4
smokes
4 Self-employed Rural 174.12 24.0 never
smoked

stroke
0 1
1 1
2 1
3 1
4 1

We can call the info() method of the dataframe to print it's information including the
index (#), datatype (dtype) and columns, non-null values (will be discussed later) and
memory usage.
## Use info() method to print dataframe information
### START CODE HERE ###

### END CODE HERE ###

Numerical Variables
[Term] Numerical Variable
In statistics, A numeric variable (also called quantitative variable) is a quantifiable
characteristic whose values are numbers. Numeric variables may be either continuous or
discrete.
We use the describe() method of a pandas dataframe to analyze a numeric variable's
data distribution
# stats of numerical data talk why numerical data
round (df.describe(exclude = 'object'), 2)

id age hypertension heart_disease


avg_glucose_level \
count 5110.00 5110.00 5110.0 5110.00
5110.00
mean 36517.83 43.23 0.1 0.05
106.15
std 21161.72 22.61 0.3 0.23
45.28
min 67.00 0.08 0.0 0.00
55.12
25% 17741.25 25.00 0.0 0.00
77.24
50% 36932.00 45.00 0.0 0.00
91.88
75% 54682.00 61.00 0.0 0.00
114.09
max 72940.00 82.00 1.0 1.00
271.74

bmi stroke
count 4909.00 5110.00
mean 28.89 0.05
std 7.85 0.22
min 10.30 0.00
25% 23.50 0.00
50% 28.10 0.00
75% 33.10 0.00
max 97.60 1.00

# Find stats of categorical data excluding int64 and float type


### START CODE HERE ###

### END CODE HERE ###

df.salary.value_counts()

Female 2994
Male 2115
Other 1
Name: gender, dtype: int64

df = df[df.salary != 'low']

df.salary.unique()

array([0, 1])

df.promotion_last_5years.unique()

array([1, 0])

Missing Data
In programming we refer to missing values as a null value.
We can use the following functions to identify missing values:
1. isnull()
2. notnull()

The output is a boolean value indicating whether the value that is passed into the argument
is in fact missing data.
Note:
• isnull() is an alias for isna()
• notnull() is an alias for notna()
Q: What will be the output when you add False and False?
False + False

We can take a column-wise sum using the method sum() to get the count of the number of
missing values in each column.
• axis=0, row wise operation
– E.g making a pile of books on a table or floor
• axis=1, column wise operation
– E.g arranging books on a bookshelf.
• For pictorial representation click_here
# Find null value in each column of dataframe using isnull()
### START CODE HERE ###

### END CODE HERE ###

satisfaction_level 0
last_evaluation 0
number_project 0
average_montly_hours 0
time_spend_company 0
Work_accident 0
left 0
promotion_last_5years 0
Department 0
salary 29
age 11924
dtype: int64

Let's see these numbers on percentage.


df.isna().sum()/len(df) * 100

satisfaction_level 0.000000
last_evaluation 0.000000
number_project 0.000000
average_montly_hours 0.000000
time_spend_company 0.000000
Work_accident 0.000000
left 0.000000
promotion_last_5years 0.000000
Department 0.000000
salary 0.193282
age 79.472141
dtype: float64

The age column contains around 80% of missing data. Filling this column makes no sense
therefore, we will drop it. However, The salary column has 0.1% missing data we can fill
this missing data. This technique is called imputation.
There are two ways to access a single column of a DataFrame
• df.column_name
• df["column_name"]

Q: How can we display the first five rows of the column salary?
# Display top 5 rows of salary column using head()
### START CODE HERE ###

### END CODE HERE ###

0 low
1 medium
2 medium
3 high
4 high
Name: salary, dtype: object

We can see NaN values in the salary, age and left column; these are the missing values
which may hinder our further analysis.

Replacing missing values


• NaN (np.nan) is python's default marker for missing value and it stands for "Not a
Number".

• We need to replace various missing values (such as empty string, ?, null, etc) with
python's default missing value marker i.e. NaN, so that it will be easier to handle the
missing values of a dataset later. Let's see two different approach to convert missing
values to python's default missing value marker.

– Approach 1: Replace the missing values while reading the data

• Make a list of different missing values i.e. missing_values = ["?",


"", "n/a", "--"]

• Pass this list in the na_values parameter while reading the data
using pandas

df = pd.read_csv(
"hr.csv",
na_values=missing_values,
)

– Approach 2: Replace the missing values after reading the data

• Pandas provides a replace() method which can be used to replace


the missing values

df = df.replace("?", np.nan)
– If you want to avoid assigning the new DataFrame to the same variable you
can set it "in place" using the inplace parameter

df.replace("?", np.nan, inplace=True)

In our dataset, missing values are already represented as python's default missing value
marker. However, ? is also a missing value. Let's read data again considering ? as missing
data.
df = pd.read_csv('hr.csv', na_values = ['?', np.nan])

df.isna().sum()

satisfaction_level 0
last_evaluation 0
number_project 0
average_montly_hours 0
time_spend_company 0
Work_accident 0
left 4
promotion_last_5years 0
Department 0
salary 29
age 11924
dtype: int64

Let's handle the missing data first and then we will do EDA.
• If we have missing values in the target column we drop the row to handle missing
data. In our case, we will drop 4 rows having nan on a left feature.

• If a feature of a column contains more than 70% missing data we usually drop it. [It
depends on domain knowledge.] In our case, we will drop it.

• In other cases: To handle numerical feature missing value we usually impute it using
mean or median value.

– Normally distributed with outlier: use median value to fill in the missing
value
– Normally distributed without outlier: use mean value to fill in the missing
value
• For categorical data we mostly impute missing values using mode. In our case the
salary column is categorical so we will use mode to fill in missing data.
# This is formatted as code

Drop rows if left has missing value


# Find unique value in 'left' column using unique()
### START CODE HERE ###
### END CODE HERE ###

array([ 0., 1., nan])

# Print shape of dataframe


### START CODE HERE ###

### END CODE HERE ###

(15004, 11)

df.dropna(subset=['left'], inplace = True)

df.shape

(15000, 11)

Drop age column because it has more than 70% missing values.
df.drop(columns = ['age'], axis = 'column', inplace = True)

Impute salary column with mode value of a column because it is a categorical column
df.salary.mode().values[0]

{"type":"string"}

# Find number of null value in each column using isna()


### START CODE HERE ###

### END CODE HERE ###

satisfaction_level 0
last_evaluation 0
number_project 0
average_montly_hours 0
time_spend_company 0
Work_accident 0
left 0
promotion_last_5years 0
Department 0
salary 29
dtype: int64

df.salary = df.salary.fillna(df.salary.mode()[0])

df.isna().sum()

satisfaction_level 0
last_evaluation 0
number_project 0
average_montly_hours 0
time_spend_company 0
Work_accident 0
left 0
promotion_last_5years 0
Department 0
salary 0
dtype: int64

# Find unique value in 'salary' column


### START CODE HERE ###

### END CODE HERE ###

array(['low', 'medium', 'high'], dtype=object)

Data exploration and visualization


# import necessary library

## Use info() method to print dataframe information


### START CODE HERE ###

### END CODE HERE ###

## Use dtypes to print datatypes information


### START CODE HERE ###

### END CODE HERE ###

## Use pairplot() method to display pairplot of the dataframe


### START CODE HERE ###

### END CODE HERE ###

## Use barplot() method to display barplot of target


### START CODE HERE ###

### END CODE HERE ###

df.groupby('left').mean()

From above table we can draw following conclusions,


Satisfaction Level: Satisfaction level seems to be relatively low (0.44) in employees
leaving the firm vs the retained ones (0.66)
Average Monthly Hours: Average monthly hours are higher in employees leaving the firm
(199 vs 207)
Promotion Last 5 Years: Employees who are given promotion are likely to be retained at
firm
Impact of salary on employee retention
## Countplot salary based on left column
### START CODE HERE ###

### END CODE HERE ###

Another way
# pd.crosstab(df.salary,df.left)

# pd.crosstab(df.salary,df.left).plot(kind='bar')

Department wise employee retention rate


plt.figure(figsize = (10, 4))
## Countplot department wise employee retention rate
### START CODE HERE ###

### END CODE HERE ###

From above chart there seem to be some impact of department on employee retention but
it is not major hence we will ignore department in our analysis

Conclusion
From the data analysis so far we can conclude that we will use following variables as
independant variables in our model
Satisfaction Level
Average Monthly Hours
Promotion Last 5 Years
Salary
subdf =
df[['satisfaction_level','average_montly_hours','promotion_last_5years
','salary']]

How to handle salary column?


Salary has all text data. It needs to be converted to numbers and we will use dummy
variable for that.
Since we will apply Linear Model let's use one hot encoding
## Use head() to display the first five rows of the 'subdf 'dataframe
### START CODE HERE ###
### END CODE HERE ###

## Use dummy variable to convert "salary" text data to numerical data


### START CODE HERE ###

### END CODE HERE ###

## Use head() to display the first five rows of the updated "salary"
column
### START CODE HERE ###

### END CODE HERE ###

## Concate 'suubf' dataframe and updated 'salary' column


### START CODE HERE ###

### END CODE HERE ###

## Remove 'salary' column from updated dataframe


### START CODE HERE ###

### END CODE HERE ###

Prepare training dataset

Fit and Evaluate the Model

You might also like