Data Preprocessing

The document discusses the key steps in data preprocessing, including data cleaning, integration, reduction, and transformation. It describes common techniques for data cleaning like filling in missing values and removing outliers. Data integration involves combining multiple data sources. Data reduction includes dimensionality reduction and data compression. Data transformation techniques mentioned are normalization, standardization, and discretization.

Uploaded by

naveen kumar Malineni

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views

Data Preprocessing

Uploaded by

naveen kumar Malineni

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Data Preprocessing

An Overview:
For Data Quality
Doing some Major Tasks in Data Preprocessing

Data Cleaning
Data Integration
Data Reduction
Data Transformation
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Standardization
Importing necessary libraries and
reading .csv file
Understanding the dataset:
 We have one data set titled “Human_Resources_Employee_Attrition”
In the given data set Human Resources employee Attrition ( in Human
Resource terminology, refers to the phenomenon of the employees leaving
the company. Attrition in a company is usually measured with a metric called
attrition rate, which simply measures the no of employees moving out of the
company)
First five rows of given dataset: {df.head()}
 Data set information:

In the given data set salary and department are object data types
Identifying target variable and independent
variables
We are taken target(output/dependent) variable is column name “left” in the
given dataset.
In column name "left” zero belongs to employee working in organization and
one belongs to employee left the organization.
we need to find predictors(input/independent) variables changes value of
dependent variable . Now we need to find independent variables which are
affecting dependent variable(“left”)
column name(department) not affecting the target(output) variable then we
are dropping department column
Finding null values

there is no null values in the given dataset

Showing the “ how each variable distributed” by using
histogram before normalizing the data
Finding outliers using boxplot

Here lot of outliers are there because ‘average_monthly_hours’ column is not in

similar scale of values comparative to other columns, then we have to normalize the
data after splitting the data as dependent and independent variables
Finding outliers using boxplot

here we taken only four columns for detecting

outliers because these four in a same scale of values
Splitting the dataset as dependent and independent variables

 fdd x is independent variable

 y is dependent variable

Here last column(‘salary’) is non numerical column and this column is also
effected the ‘left’ column then we have to covert this column as numerical
data by using “OneHotEndcoder” because this column contains three types
values(‘low’,’medium’,’high’)
Converting character values to numerical values
Using Standard scaler to convert all the values in a similar scale
Finding outliers after converting values in a similar scale

here there are some outliers and then reducing these outliers by
using Normalizer
Using Normalizer for reducing outliers

After using Normalizer boxplot will be…

 Small amount outliers remaining in the data after using normalizer then we
have to use MinMaxScalar to reduce remaining outliers
Again checking for outliers after using MinMaxScalar

 The box plot will be….

 Finally we reduced all the outliers in the data.

Thank you

Inspection Checklist
80% (5)
Inspection Checklist
1 page
Topic 7-Interactional Theories
No ratings yet
Topic 7-Interactional Theories
36 pages
Assessment Form 2 PDF
No ratings yet
Assessment Form 2 PDF
29 pages
Domain 5 Questions
No ratings yet
Domain 5 Questions
9 pages
AI Introduction For Teacher - PPT (Autosaved)
100% (1)
AI Introduction For Teacher - PPT (Autosaved)
68 pages
CLO Analysis - Master Program
No ratings yet
CLO Analysis - Master Program
22 pages
1 - Artificial Intelligence Introduction
No ratings yet
1 - Artificial Intelligence Introduction
30 pages
Measure of Dispersion and Location
No ratings yet
Measure of Dispersion and Location
51 pages
Manual
No ratings yet
Manual
48 pages
Statistics - Describing Data Numerical
No ratings yet
Statistics - Describing Data Numerical
56 pages
Empowerment Tech Microsoft Excel Ppt 1
No ratings yet
Empowerment Tech Microsoft Excel Ppt 1
45 pages
Sample Detailed Lesson Plan Mat10
No ratings yet
Sample Detailed Lesson Plan Mat10
8 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
7 pages
Measures of Variability
No ratings yet
Measures of Variability
24 pages
Bayes Theorem
No ratings yet
Bayes Theorem
16 pages
Interdisciplinary Contextualization (Icon) Lessons in Teaching Senior High School Mathematics
No ratings yet
Interdisciplinary Contextualization (Icon) Lessons in Teaching Senior High School Mathematics
12 pages
3.3 Mean and Standard Deviation of Grouped Data
No ratings yet
3.3 Mean and Standard Deviation of Grouped Data
15 pages
Lesson Plan Framework Mathematics Secondary Calculus 12
No ratings yet
Lesson Plan Framework Mathematics Secondary Calculus 12
3 pages
Data Screening
No ratings yet
Data Screening
19 pages
Cpe PC 214 Discrete Mathematics
No ratings yet
Cpe PC 214 Discrete Mathematics
9 pages
Simplification of Boolean Expression
No ratings yet
Simplification of Boolean Expression
61 pages
and DATA HOW TO PREPARE FOR COMPUTER SCIENCE Board Exam 2020
No ratings yet
and DATA HOW TO PREPARE FOR COMPUTER SCIENCE Board Exam 2020
76 pages
Computational Methods - Error Analysis
No ratings yet
Computational Methods - Error Analysis
9 pages
AI and Its Scope in Different Areas With Special Reference To The Field of Education
100% (1)
AI and Its Scope in Different Areas With Special Reference To The Field of Education
6 pages
Computational Science and Numerical Methods
No ratings yet
Computational Science and Numerical Methods
8 pages
Axioms of Probability
No ratings yet
Axioms of Probability
40 pages
Intro to Probability for Computing - PDF Room
No ratings yet
Intro to Probability for Computing - PDF Room
571 pages
Action Verbs (Bloom's Taxonomy)
No ratings yet
Action Verbs (Bloom's Taxonomy)
3 pages
Unit 4 Skewness and Kurtosis: Structure
No ratings yet
Unit 4 Skewness and Kurtosis: Structure
10 pages
Ranjana Arora
No ratings yet
Ranjana Arora
2 pages
Data Science and Ethical Issues
No ratings yet
Data Science and Ethical Issues
42 pages
Item Analysis
No ratings yet
Item Analysis
12 pages
Exercise 2 - PM 299
No ratings yet
Exercise 2 - PM 299
6 pages
Statistics: Dr. Rebecca R. Amagsila
No ratings yet
Statistics: Dr. Rebecca R. Amagsila
37 pages
Behaviorism Theory
No ratings yet
Behaviorism Theory
28 pages
Sittiehaymer T. Abdulwahab
No ratings yet
Sittiehaymer T. Abdulwahab
17 pages
Asymptotic Analysis
No ratings yet
Asymptotic Analysis
19 pages
Course Guide Math 17-b
No ratings yet
Course Guide Math 17-b
3 pages
Lesson 1 Antiderivatives and The Power Formula
No ratings yet
Lesson 1 Antiderivatives and The Power Formula
13 pages
Student Dropout Prediction
No ratings yet
Student Dropout Prediction
11 pages
Methodology of Teaching in Schools
No ratings yet
Methodology of Teaching in Schools
17 pages
Matlab Project PDF
No ratings yet
Matlab Project PDF
6 pages
1 Plan Training Session
No ratings yet
1 Plan Training Session
99 pages
File 5502 Workshop Brochure 1584345580
No ratings yet
File 5502 Workshop Brochure 1584345580
2 pages
Data Science and Predictive Analytics Bi
No ratings yet
Data Science and Predictive Analytics Bi
44 pages
Lrbi Project
No ratings yet
Lrbi Project
17 pages
Creating A Bar Graph in Excel Lesson Plan
No ratings yet
Creating A Bar Graph in Excel Lesson Plan
4 pages
22 Tvet Cdacc Assessement Guidelines
No ratings yet
22 Tvet Cdacc Assessement Guidelines
15 pages
PQT Lesson Plan
No ratings yet
PQT Lesson Plan
7 pages
Assessment Rubric For Powerpoint Presentations: Exemplary Accomplished Developing Beginning
No ratings yet
Assessment Rubric For Powerpoint Presentations: Exemplary Accomplished Developing Beginning
2 pages
2 - Module 1 - Descriptive Statistics - Frequency Tables, Measure of Central Tendency & Measures of Dispersion
No ratings yet
2 - Module 1 - Descriptive Statistics - Frequency Tables, Measure of Central Tendency & Measures of Dispersion
21 pages
Junior Software Developer Curriculam and Syllabus
No ratings yet
Junior Software Developer Curriculam and Syllabus
12 pages
Strategic Action Plan For Slow Learners
No ratings yet
Strategic Action Plan For Slow Learners
1 page
Msed Rwrcoel Diversity Proficiencies
No ratings yet
Msed Rwrcoel Diversity Proficiencies
3 pages
Sample of Table of Specification
No ratings yet
Sample of Table of Specification
2 pages
NEP - Presentation - Sunderdeep 4th April
No ratings yet
NEP - Presentation - Sunderdeep 4th April
72 pages
Multiply Up To A 4 Digit Number by 2 Digit Number PowerPoint
No ratings yet
Multiply Up To A 4 Digit Number by 2 Digit Number PowerPoint
11 pages
DP
No ratings yet
DP
9 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Unit 1
No ratings yet
Unit 1
21 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Connected Load Charge PDF
No ratings yet
Connected Load Charge PDF
3 pages
EO1995 2000 Lawrence - 0008
No ratings yet
EO1995 2000 Lawrence - 0008
114 pages
Sciencedirect
No ratings yet
Sciencedirect
9 pages
Measuring Monitoring Reporting Performance PDF
No ratings yet
Measuring Monitoring Reporting Performance PDF
16 pages
MACO Compact HMI Operator Interface (HA136754USA Iss 2)
No ratings yet
MACO Compact HMI Operator Interface (HA136754USA Iss 2)
2 pages
UpskillingIT 2023 FINAL
No ratings yet
UpskillingIT 2023 FINAL
66 pages
ACI 318M-11 RC Bracket and Corbel Design - v0.03 - 2017-04-10
No ratings yet
ACI 318M-11 RC Bracket and Corbel Design - v0.03 - 2017-04-10
5 pages
BAC GIANG - Đề thi chọn ĐT 2023 (chính thức)
No ratings yet
BAC GIANG - Đề thi chọn ĐT 2023 (chính thức)
19 pages
21st Century Politics and Governance
No ratings yet
21st Century Politics and Governance
14 pages
Securities Regulation Code
100% (1)
Securities Regulation Code
24 pages
Invest in Yourself - Start Reading PDF
No ratings yet
Invest in Yourself - Start Reading PDF
2 pages
Bajhang Upper Seti 216 MW Construction Schedule FINAL
100% (3)
Bajhang Upper Seti 216 MW Construction Schedule FINAL
2 pages
Call For Volunteer - Volunteers 14 LGA's of Yobe State
No ratings yet
Call For Volunteer - Volunteers 14 LGA's of Yobe State
4 pages
Quantum Dot Solar Cells: High Efficiency Through Multiple Exciton Generation
No ratings yet
Quantum Dot Solar Cells: High Efficiency Through Multiple Exciton Generation
5 pages
Ali Inam S/O Inam Ullah 50-F-2-Johar Town LHR: Web Generated Bill
No ratings yet
Ali Inam S/O Inam Ullah 50-F-2-Johar Town LHR: Web Generated Bill
1 page
IAP 01 - Introduction To Internet Architecture and Protocols
No ratings yet
IAP 01 - Introduction To Internet Architecture and Protocols
67 pages
Agen Rugby - SUA LG Facebook
No ratings yet
Agen Rugby - SUA LG Facebook
1 page
ACCT 325, Module 1
No ratings yet
ACCT 325, Module 1
39 pages
MBV - Alexander v. US - 113 S. Ct. 2766, 125 L. Ed. 2d. 441
No ratings yet
MBV - Alexander v. US - 113 S. Ct. 2766, 125 L. Ed. 2d. 441
27 pages
Gardacid X: Safety Data Sheet
No ratings yet
Gardacid X: Safety Data Sheet
6 pages
99 Series - ANS
No ratings yet
99 Series - ANS
7 pages
Tea in Vietnam - Analysis: Country Report - Mar 2019
No ratings yet
Tea in Vietnam - Analysis: Country Report - Mar 2019
2 pages
EF River Pro
No ratings yet
EF River Pro
157 pages
R0.base Plate - (Top) )
0% (1)
R0.base Plate - (Top) )
4 pages
Index: S.No. Particulars No
No ratings yet
Index: S.No. Particulars No
49 pages
Transito Ramp
No ratings yet
Transito Ramp
30 pages
Group-05 Recruitment Test - 2020 - Result: Professional Examination Board
No ratings yet
Group-05 Recruitment Test - 2020 - Result: Professional Examination Board
2 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

Data Preprocessing

there is no null values in the given dataset

Here lot of outliers are there because ‘average_monthly_hours’ column is not in

here we taken only four columns for detecting

 fdd x is independent variable

After using Normalizer boxplot will be…

 The box plot will be….

 Finally we reduced all the outliers in the data.

You might also like