0% found this document useful (0 votes)
221 views

Data Wrangling Report

This document summarizes the data wrangling steps performed on an IBM HR Analytics Employee Attrition & Performance dataset. The dataset contained 1470 rows and 35 columns describing employee features. Missing values and outliers were checked, and three useless columns were dropped. The response variable was reassigned from text to numeric values and moved to the last column. Object type features were changed to category types to reduce memory usage and increase processing speed.

Uploaded by

chinudash
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
221 views

Data Wrangling Report

This document summarizes the data wrangling steps performed on an IBM HR Analytics Employee Attrition & Performance dataset. The dataset contained 1470 rows and 35 columns describing employee features. Missing values and outliers were checked, and three useless columns were dropped. The response variable was reassigned from text to numeric values and moved to the last column. Object type features were changed to category types to reduce memory usage and increase processing speed.

Uploaded by

chinudash
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Data Wrangling Report

Introduction:

This document particularly describes the data wrangling steps that I undertook to
prepare the IBM HR Analytics Employee Attrition & Performance dataset for the further
process in the project. It explains what kind of steps were performed on this particular
data set, how the missing values or the outliers handled.

Data Retrieval:

Dataset is in the open source Kaggle website and can be reached from this ​link​. I
loaded the dataset from here in csv format and read it in the jupyter notebook after
importing necessary libraries.

Data Specifications:

The dataset has 1470 rows and 35 columns. Rows are observations from each
employee and columns are from different features which are obtained in order to explain
the employee attrition. The features data types consist of 27 integers and 8 objects. For
some features, It is important to figure out their identity.
Field 1 2 3 4

Education* Below College College Bachelor Master

Environment Low Medium High Very High


Satisfaction

Job Involvement Low Medium High Very High

Job Satisfaction Low Medium High Very High

Performance Rating Low Good Excellent Outstanding

Relationship Low Medium High Very High


Satisfaction

Work Life Balance Bad Good Better Best


* For ‘Education’ field, 5 stands for ‘Doctor’.
List of attributes are presented below.
Data Preprocessing:

I searched for missing values in every features of dataset, all features look like having
1470 non-null entries. However, missing values can be encoded in a number of different
ways, such as by zeroes, or questions marks. For that reason, I checked both missing
values and duplicate values in the dataset. Luckily, it was okay to continue to next step.

I observed 5 random sample records in the dataset to grasp the general intuition about
whole picture. Besides that, I explored the statistical attributes of each features such as
their mean, standard deviation, interquartile values in order to detect outliers. This
research also gave me a general impression about unique and top values for each
attributes in addition to their frequencies in the dataset. I made double checks on some
of features in order to make sure that everything is good to go. Those results were also
okay.

I inspected the useless features in order to drop in the dataset. “Over 18”,
”StandardHours”, and “EmployeeCount” had only one unique value for each
observations and that did not impact or change anything in the data. For that reason, I
dropped those three useless columns.

To be able to use effectively in the further steps, I reassigned the response variable
(Attrition) which had “Yes” and “No” values previously. They were assigned to 1 and 0
respectively. After that, I moved the response variable to the last column place.

The dataset has 8 object types which are 'BusinessTravel', 'Department',


'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime'. To be able have more
memory usage and become fast, I changed object type to category type in the dataset.
At first memory usage was 402.0+ KB, and after changing the data types, it became
298.3 KB.

You might also like