Data Preprocessing
Data Preprocessing
An Overview:
For Data Quality
Doing some Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Standardization
Importing necessary libraries and
reading .csv file
Understanding the dataset:
We have one data set titled “Human_Resources_Employee_Attrition”
In the given data set Human Resources employee Attrition ( in Human
Resource terminology, refers to the phenomenon of the employees leaving
the company. Attrition in a company is usually measured with a metric called
attrition rate, which simply measures the no of employees moving out of the
company)
First five rows of given dataset: {df.head()}
Data set information:
In the given data set salary and department are object data types
Identifying target variable and independent
variables
We are taken target(output/dependent) variable is column name “left” in the
given dataset.
In column name "left” zero belongs to employee working in organization and
one belongs to employee left the organization.
we need to find predictors(input/independent) variables changes value of
dependent variable . Now we need to find independent variables which are
affecting dependent variable(“left”)
column name(department) not affecting the target(output) variable then we
are dropping department column
Finding null values
Here last column(‘salary’) is non numerical column and this column is also
effected the ‘left’ column then we have to covert this column as numerical
data by using “OneHotEndcoder” because this column contains three types
values(‘low’,’medium’,’high’)
Converting character values to numerical values
Using Standard scaler to convert all the values in a similar scale
Finding outliers after converting values in a similar scale
here there are some outliers and then reducing these outliers by
using Normalizer
Using Normalizer for reducing outliers