Data Wrangling Report

This document summarizes the data wrangling steps performed on an IBM HR Analytics Employee Attrition & Performance dataset. The dataset contained 1470 rows and 35 columns describing employee features. Missing values and outliers were checked, and three useless columns were dropped. The response variable was reassigned from text to numeric values and moved to the last column. Object type features were changed to category types to reduce memory usage and increase processing speed.

Uploaded by

chinudash

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

221 views

Data Wrangling Report

Uploaded by

chinudash

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Data Wrangling Report

Introduction:

This document particularly describes the data wrangling steps that I undertook to
prepare the IBM HR Analytics Employee Attrition & Performance dataset for the further
process in the project. It explains what kind of steps were performed on this particular
data set, how the missing values or the outliers handled.

Data Retrieval:

Dataset is in the open source Kaggle website and can be reached from this link. I
loaded the dataset from here in csv format and read it in the jupyter notebook after
importing necessary libraries.

Data Specifications:

The dataset has 1470 rows and 35 columns. Rows are observations from each
employee and columns are from different features which are obtained in order to explain
the employee attrition. The features data types consist of 27 integers and 8 objects. For
some features, It is important to figure out their identity.
Field 1 2 3 4

Education* Below College College Bachelor Master

Environment Low Medium High Very High

Satisfaction

Job Involvement Low Medium High Very High

Job Satisfaction Low Medium High Very High

Performance Rating Low Good Excellent Outstanding

Relationship Low Medium High Very High

Satisfaction

Work Life Balance Bad Good Better Best

* For ‘Education’ field, 5 stands for ‘Doctor’.
List of attributes are presented below.
Data Preprocessing:

I searched for missing values in every features of dataset, all features look like having
1470 non-null entries. However, missing values can be encoded in a number of different
ways, such as by zeroes, or questions marks. For that reason, I checked both missing
values and duplicate values in the dataset. Luckily, it was okay to continue to next step.

I observed 5 random sample records in the dataset to grasp the general intuition about
whole picture. Besides that, I explored the statistical attributes of each features such as
their mean, standard deviation, interquartile values in order to detect outliers. This
research also gave me a general impression about unique and top values for each
attributes in addition to their frequencies in the dataset. I made double checks on some
of features in order to make sure that everything is good to go. Those results were also
okay.

I inspected the useless features in order to drop in the dataset. “Over 18”,
”StandardHours”, and “EmployeeCount” had only one unique value for each
observations and that did not impact or change anything in the data. For that reason, I
dropped those three useless columns.

To be able to use effectively in the further steps, I reassigned the response variable
(Attrition) which had “Yes” and “No” values previously. They were assigned to 1 and 0
respectively. After that, I moved the response variable to the last column place.

The dataset has 8 object types which are 'BusinessTravel', 'Department',

'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime'. To be able have more
memory usage and become fast, I changed object type to category type in the dataset.
At first memory usage was 402.0+ KB, and after changing the data types, it became
298.3 KB.

Employee Attrition Study Case
No ratings yet
Employee Attrition Study Case
88 pages
User Story - Cognizant
100% (1)
User Story - Cognizant
1 page
Challenges and Scope of Data Science Project
No ratings yet
Challenges and Scope of Data Science Project
21 pages
DataMining Course Handout PDF
No ratings yet
DataMining Course Handout PDF
5 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Help File
No ratings yet
Help File
92 pages
Final Capstone Project Report
100% (1)
Final Capstone Project Report
35 pages
Touchpad Plus Ver. 1.1 Class 7
From Everand
Touchpad Plus Ver. 1.1 Class 7
Nisha Batra
No ratings yet
Capstone Project
No ratings yet
Capstone Project
10 pages
Synopsis Minor Project-2
No ratings yet
Synopsis Minor Project-2
5 pages
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
No ratings yet
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
22 pages
Design A Library System - Flowcharts and Pseudocode
No ratings yet
Design A Library System - Flowcharts and Pseudocode
1 page
C Arrays Function
No ratings yet
C Arrays Function
65 pages
Iwt Practical
No ratings yet
Iwt Practical
20 pages
Project Report - Credit Card Fraud Detection
No ratings yet
Project Report - Credit Card Fraud Detection
11 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
PM Guided Project Sample Business Report
No ratings yet
PM Guided Project Sample Business Report
52 pages
Heart
No ratings yet
Heart
28 pages
Capstone Project Proposal - HR Audit
No ratings yet
Capstone Project Proposal - HR Audit
3 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
SE 7204 BIG Data Analysis Unit I Final
No ratings yet
SE 7204 BIG Data Analysis Unit I Final
66 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
Data Preprocessing in Python - Handling Missing Data
No ratings yet
Data Preprocessing in Python - Handling Missing Data
8 pages
House Price Prediction Using Data Science
No ratings yet
House Price Prediction Using Data Science
8 pages
Report On Linear Regression Using R
No ratings yet
Report On Linear Regression Using R
15 pages
Abstraction and Interface
No ratings yet
Abstraction and Interface
17 pages
Loan Prediction System
No ratings yet
Loan Prediction System
5 pages
Technical Communication - 2023-2024
No ratings yet
Technical Communication - 2023-2024
2 pages
Rainfall
No ratings yet
Rainfall
24 pages
Graded Quiz 1 - Working With Python Great Lakes
No ratings yet
Graded Quiz 1 - Working With Python Great Lakes
6 pages
Capstone Notes-Model
No ratings yet
Capstone Notes-Model
20 pages
The Database Approach To Data Management
67% (6)
The Database Approach To Data Management
50 pages
Project Report On DBMS Project
No ratings yet
Project Report On DBMS Project
22 pages
Cars Project PDF
No ratings yet
Cars Project PDF
9 pages
Capstone Presentation
No ratings yet
Capstone Presentation
9 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Buisiness Reoprt Extended As Project Report
No ratings yet
Buisiness Reoprt Extended As Project Report
18 pages
Passport Automation System - F
No ratings yet
Passport Automation System - F
16 pages
Bankruptcy Prevention Project
No ratings yet
Bankruptcy Prevention Project
16 pages
Data Mining
100% (4)
Data Mining
9 pages
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
No ratings yet
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
31 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
2021115-Draft EIA Reports Khammam To Vijayawada Compressed - R1
No ratings yet
2021115-Draft EIA Reports Khammam To Vijayawada Compressed - R1
327 pages
Obstructive Sleep Apnea
No ratings yet
Obstructive Sleep Apnea
19 pages
Regression: Unit Iii
No ratings yet
Regression: Unit Iii
54 pages
Strategic Approach To Software Testing
No ratings yet
Strategic Approach To Software Testing
6 pages
Internship Report - Software - Salaries Predictions
100% (1)
Internship Report - Software - Salaries Predictions
17 pages
Uber Data Analysis Using Python
No ratings yet
Uber Data Analysis Using Python
24 pages
Assignment 5 - Heuristics and Principles
No ratings yet
Assignment 5 - Heuristics and Principles
4 pages
Project DVT CarInsurance
No ratings yet
Project DVT CarInsurance
10 pages
Data Visualization R Programming Power Bi Lab Record
No ratings yet
Data Visualization R Programming Power Bi Lab Record
29 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
Campus Placement Analyzer: Using Supervised Machine Learning Algorithms
No ratings yet
Campus Placement Analyzer: Using Supervised Machine Learning Algorithms
5 pages
Assignment I Data Analytics
No ratings yet
Assignment I Data Analytics
3 pages
Project Title "Impact of Fii'S On Indian Stock Market"
No ratings yet
Project Title "Impact of Fii'S On Indian Stock Market"
12 pages
APP - Unit 3
No ratings yet
APP - Unit 3
112 pages
Report Final FINAL
No ratings yet
Report Final FINAL
72 pages
Module 2
No ratings yet
Module 2
20 pages
HR A (6)
No ratings yet
HR A (6)
7 pages
Gas Turbine: Operation and Maintenance
No ratings yet
Gas Turbine: Operation and Maintenance
53 pages
Transportation Data Pakistan
No ratings yet
Transportation Data Pakistan
133 pages
PRELIM PRODUCTIVITY AND QUALITY TOOLS
No ratings yet
PRELIM PRODUCTIVITY AND QUALITY TOOLS
2 pages
DU EEE Sponsorship Proposal
No ratings yet
DU EEE Sponsorship Proposal
5 pages
Work Permit
No ratings yet
Work Permit
1 page
Service Documentation: Market Release 3/95
No ratings yet
Service Documentation: Market Release 3/95
7 pages
Servicenow Student 101 Guide - Latest
No ratings yet
Servicenow Student 101 Guide - Latest
18 pages
Computing: Computing Is Any Goal-Oriented Activity Requiring, Benefiting From, or
No ratings yet
Computing: Computing Is Any Goal-Oriented Activity Requiring, Benefiting From, or
1 page
Discovering Computers 2008 Discovering Computers 2008
No ratings yet
Discovering Computers 2008 Discovering Computers 2008
44 pages
Pgurl 7137183537631190
No ratings yet
Pgurl 7137183537631190
4 pages
Mid-Term Parking
No ratings yet
Mid-Term Parking
17 pages
Sspa Power Relative Referance
No ratings yet
Sspa Power Relative Referance
6 pages
A Weighted Partial Domain Adaptation For Acoustic Scene Classification and Its Application in Fiber Optic Security System
No ratings yet
A Weighted Partial Domain Adaptation For Acoustic Scene Classification and Its Application in Fiber Optic Security System
7 pages
High Pressure System Pump
No ratings yet
High Pressure System Pump
2 pages
SPK
No ratings yet
SPK
24 pages
MarinePartnerltd en
No ratings yet
MarinePartnerltd en
8 pages
Mix 500D User Manual
No ratings yet
Mix 500D User Manual
25 pages
Setting Up A DHCP Server in Cisco Packet Tracer
No ratings yet
Setting Up A DHCP Server in Cisco Packet Tracer
5 pages
CT6033 Cyber Security Management
No ratings yet
CT6033 Cyber Security Management
9 pages
17th International Conference on Networks & Communications (NeCoM 2025)
No ratings yet
17th International Conference on Networks & Communications (NeCoM 2025)
2 pages
Christian Eminent College: Department of Computer Science and Electronics
No ratings yet
Christian Eminent College: Department of Computer Science and Electronics
6 pages
CSPro Android - Data Transfert Guide
100% (2)
CSPro Android - Data Transfert Guide
9 pages
NEW Product Fortindo 2020 zj8g2t
No ratings yet
NEW Product Fortindo 2020 zj8g2t
4 pages
COMP PROJECT H
No ratings yet
COMP PROJECT H
36 pages
Fuzzy Logic Control For Single Phase Multilevel Inverter
No ratings yet
Fuzzy Logic Control For Single Phase Multilevel Inverter
6 pages
Banner Measuring Sensors
No ratings yet
Banner Measuring Sensors
57 pages
Itec106 02 Css
No ratings yet
Itec106 02 Css
24 pages
Survey
No ratings yet
Survey
15 pages
1492 br016 - en P PDF
No ratings yet
1492 br016 - en P PDF
8 pages
Pages From CompTIA - 220-902
No ratings yet
Pages From CompTIA - 220-902
39 pages

Data Wrangling Report

Uploaded by

Data Wrangling Report

Uploaded by

Data Wrangling Report

Education* Below College College Bachelor Master

Environment Low Medium High Very High

Job Involvement Low Medium High Very High

Job Satisfaction Low Medium High Very High

Performance Rating Low Good Excellent Outstanding

Relationship Low Medium High Very High

Work Life Balance Bad Good Better Best

The dataset has 8 object types which are 'BusinessTravel', 'Department',

You might also like