Data Wrangling Report
Data Wrangling Report
Introduction:
This document particularly describes the data wrangling steps that I undertook to
prepare the IBM HR Analytics Employee Attrition & Performance dataset for the further
process in the project. It explains what kind of steps were performed on this particular
data set, how the missing values or the outliers handled.
Data Retrieval:
Dataset is in the open source Kaggle website and can be reached from this link. I
loaded the dataset from here in csv format and read it in the jupyter notebook after
importing necessary libraries.
Data Specifications:
The dataset has 1470 rows and 35 columns. Rows are observations from each
employee and columns are from different features which are obtained in order to explain
the employee attrition. The features data types consist of 27 integers and 8 objects. For
some features, It is important to figure out their identity.
Field 1 2 3 4
I searched for missing values in every features of dataset, all features look like having
1470 non-null entries. However, missing values can be encoded in a number of different
ways, such as by zeroes, or questions marks. For that reason, I checked both missing
values and duplicate values in the dataset. Luckily, it was okay to continue to next step.
I observed 5 random sample records in the dataset to grasp the general intuition about
whole picture. Besides that, I explored the statistical attributes of each features such as
their mean, standard deviation, interquartile values in order to detect outliers. This
research also gave me a general impression about unique and top values for each
attributes in addition to their frequencies in the dataset. I made double checks on some
of features in order to make sure that everything is good to go. Those results were also
okay.
I inspected the useless features in order to drop in the dataset. “Over 18”,
”StandardHours”, and “EmployeeCount” had only one unique value for each
observations and that did not impact or change anything in the data. For that reason, I
dropped those three useless columns.
To be able to use effectively in the further steps, I reassigned the response variable
(Attrition) which had “Yes” and “No” values previously. They were assigned to 1 and 0
respectively. After that, I moved the response variable to the last column place.