Unit-2Exploratory-Analysis

The document outlines the importance of exploratory data analysis (EDA) and data preprocessing techniques, including handling missing values, outlier detection, and feature engineering. It emphasizes the need for data cleaning, integration, reduction, and transformation to ensure data quality and prepare datasets for machine learning models. Key methods discussed include various scaling techniques, encoding categorical data, and approaches for managing outliers.

Uploaded by

VANSHIKA GADA

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Unit-2Exploratory-Analysis

Uploaded by

VANSHIKA GADA

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Unit -2

Exploratory Data Analysis

Agenda
• Missing Values Treatment
• Handling Categorical data: Mapping ordinal features, Encoding class
labels, Performing one-hot encoding on nominal features, Outlier
Detection and Treatment.
• Feature Engineering: Variable Transformation and Variable Creation,
Selecting meaningful features
Why pre-processing ?

• Real world data are generally

• Incomplete: lacking attribute values, lacking certain attributes of interest, or

containing only aggregate data

• Noisy: containing errors or outliers

• Inconsistent: containing discrepancies in codes or names

Data Quality: Why Preprocess the Data?
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
Tasks in data pre-processing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Noisy Data
Data Cleaning – Numeric Attribute
• Mean / Median / Mode Imputation:
• Random Sample Imputation
• Capturing NAN Values With a New Feature
• End of Tail Distribution Imputation
• Arbitrary Imputation
Categorical Data
• Discrete/ Categorical Data: discrete data is quantitative data that can be
counted and has a finite number of possible values or data which may
be divided into groups
• e.g. days in a week, number of months in a year, Gender
(Male/Female/Others), Grades (High/Medium/Low), etc.
Data Cleaning – Categorical Attribute
• Adding a Variable To Capture NAN
• Frequent Categorical Imputation
• Create a New Category (Random Category) for NAN Values
Feature Engineering on Numeric Data

• Numeric data can be directly fed into machine learning models.

• Still need to engineer features which are relevant to the scenario,
problem and domain before building a model.
• Typically these features can indicate values or counts
Feature Scaling Techniques

• Absolute Maximum Scaling

• Min-Max Scaling
• Normalization
• Standardization
• Robust Scaling
Absolute Maximum Scaling

• Find the absolute maximum value of the feature in the dataset

• Divide all the values in the column by that maximum value

Problems:
If we do this for all the numerical columns, then all their values will lie between -1 and
1. The main disadvantage is that the technique is sensitive to outliers. Like consider
the feature *square feet*, if 99% of the houses have square feet area of less than
1000, and even if just 1 house has a square feet area of 20,000, then all those other
house values will be scaled down to less than 0.05.
Min Max Scaling

In min-max you will subtract the minimum value in the

dataset with all the values and then divide this by the
range of the dataset(maximum-minimum). In this case,
your dataset will lie between 0 and 1 in all cases whereas
in the previous case, it was between -1 and +1. Again, this
technique is also prone to outliers.
Normalization
What is variance?
• Variance is the measure of how notably a collection of data is spread
out. If all the data values are identical, then it indicates the variance is
zero. All non-zero variances are considered to be positive. A little
variance represents that the data points are close to the mean, and to
each other, whereas if the data points are highly spread out from the
mean and from one another indicates the high variance. In short, the
variance is defined as the average of the squared distance from each
point to the mean.
What is Standard deviation?
Standard Deviation is a measure which shows how much variation
(such as spread, dispersion, spread,) from the mean exists. The
standard deviation indicates a “typical” deviation from the mean. It is a
popular measure of variability because it returns to the original units of
measure of the data set. Like the variance, if the data points are close
to the mean, there is a small variation whereas the data points are
highly spread out from the mean, then it has a high variance. Standard
deviation calculates the extent to which the values differ from the
average.
Robust Scaling

• In this method, you need to subtract all the data points with the
median value and then divide it by the Inter Quartile Range(IQR)
value.
• IQR is the distance between the 25th percentile point and the 50th
percentile point.
Binarization
• Convert count data into binary.
• For instance if I’m building a recommendation system for song
recommendations, I would just want to know if a person is interested
or has listened to a particular song. This doesn’t require the number
of times a song has been listened to since I am more concerned about
the various songs he\she has listened to. In this case, a binary feature
is preferred as opposed to a count based feature
Scaling
• In most cases, the numerical features of the dataset do not have a
certain range and they differ from each other. In real life, it is
nonsense to expect age and income columns to have the same range.
But from the machine learning point of view, how these two columns
can be compared?
• The continuous features become identical in terms of the range, after
a scaling process
• Min Max Normalization
Xnorm = X-Xmin/Xmax -Xmin
Binning
• Often the distribution of values in features
will be skewed.
• some values will occur quite frequently
while some will be quite rare.
• For instance view counts of specific music
videos could be abnormally large and some
could be really small.
• Hence there are strategies to deal with this,
which include binning and transformations.
• Binning, also known as quantization is used
for transforming continuous numeric
features into discrete ones (categories).
• Specific strategies of binning data include
fixed-width and adaptive binning
Statistical Transformations

• Log Transform
• The log transform belongs to the power transform family of functions.
This function can be mathematically represented as
• Y = logb(x) == by = X
• Log transforms are useful when applied to skewed distributions as
they tend to expand the values which fall in the range of lower
magnitudes and tend to compress or reduce the values which fall in
the range of higher magnitudes.
Box-Cox Transform ****

• The Box-Cox transform is another popular function belonging to the

power transform family of functions. This function has a pre-requisite
that the numeric values to be transformed must be positive.
Feature Engineering on Categorical Data

• Transformation of these categorical values into numeric labels

• Apply some encoding scheme on these values.
• The idea here is to transform these attributes into a more
representative numerical format which can be easily understood by
downstream code and pipelines.
Transforming Nominal Attribute
Encoding : One-hot Encoding Scheme
• Considering we have the numeric representation of any categorical
attribute with m labels (after transformation)
• the one-hot encoding scheme, encodes or transforms the attribute
into m binary features which can only contain a value of 1 or 0.
• Each observation in the categorical feature is thus converted into a
vector of size m with only one of the values as 1 (indicating it as
active).
Encoding : One-hot Encoding Scheme
• from sklearn.preprocessing import OneHotEncoder
• gen_ohe = OneHotEncoder()
• gen_feature_arr = gen_ohe.fit_transform(mydf[['Genre']]).toarray()
• mylist = list(mydf['Genre'].unique())
• encoded_df = pd.DataFrame(gen_feature_arr, columns=mylist)
• mydf1 = pd.concat([mydf,encoded_df],axis = 1)
Encoding : One-hot Encoding Scheme
Outlier detection
• Outliers are observations that are far away from the other data points
in a random sample of a population.
• outliers may reveal unexpected knowledge about a population, which
also justifies their special handling during EDA
Different types of outliers

• univariate and multivariate outliers.

• Univariate outliers are extreme values in the distribution of a specific
variable, whereas multivariate outliers are a combination of values in an
observation that is unlikely.
• For example, a univariate outlier could be a human age measurement of
120 years or a temperature measurement in Antarctica of 50 degrees
Celsius.
• A multivariate outlier could be an observation of a human with a height
measurement of 2 meters (in the 95th percentile) and a weight
measurement of 50kg
• Both types of outliers can affect the outcome of an analysis but are
detected and treated differently.
Visualizing outliers

• A first and useful step in detecting univariate outliers is the

visualization of a variables’ distribution.
• An easy way to visually summarize the distribution of a variable is
the box plot.
• In a box plot, introduced by John Tukey in 1970, the data is divided
into quartiles. It usually shows a rectangular box representing 25%-
75% of a sample’s observations, extended by so-called whiskers that
reach the minimum and maximum data entry. Observations shown
outside of the whiskers are outliers
Tukey’s box plot method

• Tukey distinguishes between possible and probable outliers. A

possible outlier is located between the inner and the outer fence,
whereas a probable outlier is located outside the outer fence.
• the inner (often confused with the whiskers) and outer fence are
usually not shown on the actual box plot, they can be calculated using
the interquartile range (IQR).
• IQR =Q3 - Q1, whereas q3 := 75th quartile and q1 := 25th quartile
• Inner fence = [Q1-1.5*IQR, Q3+1.5*IQR]
• Outer fence = [Q1–3*IQR, Q3+3*IQR]
multivariate outliers
• When dealing with multivariate outliers, distance metrics can be
helpful for detection. With distance metrics, the distance between
two vectors is determined. These two vectors can be two different
observations (rows) or an observation (row) compared to the mean
vector (row of means of all columns). Distance metrics can be
calculated independent of the number of variables in the dataset
(columns).
Methods to Pre-Process Outliers

• Mean/Median or random Imputation

• Trimming
Mean/Median or random Imputation

• outliers are in nature similar to missing data, then any method used
for missing data imputation can we used to replace outliers. The
number are outliers are small (otherwise, they won't be called
outliers), and it's reasonable to use mean/median/random
imputation to replace them.
Trimming:

• In this method, we discard the outliers completely. That is, eliminate

the data points that are considered as outliers. In situations where
you won’t be removing a large number of values from the dataset,
trimming is a good and fast approach.

Project: Submitted By: Abhijit Kumar Kalita
90% (21)
Project: Submitted By: Abhijit Kumar Kalita
44 pages
Measure of Skewness - Lesson 7
No ratings yet
Measure of Skewness - Lesson 7
16 pages
02.data Preprocessing PDF
100% (1)
02.data Preprocessing PDF
31 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
253777
No ratings yet
253777
66 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Data Mining and Predictive Modelling: Lecture 4: Data Pre-Processing
No ratings yet
Data Mining and Predictive Modelling: Lecture 4: Data Pre-Processing
19 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Week2-2
No ratings yet
Week2-2
25 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Machine Learning Pipeline: Created by Arbaz Ali
No ratings yet
Machine Learning Pipeline: Created by Arbaz Ali
32 pages
4-Data Preprocessing (Cleaning) and Exploration
No ratings yet
4-Data Preprocessing (Cleaning) and Exploration
54 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
3_AML _Lecture 3_Feature Engg
No ratings yet
3_AML _Lecture 3_Feature Engg
39 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Lesson4 Data
No ratings yet
Lesson4 Data
31 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
data processing
No ratings yet
data processing
19 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Measurement Scale: Dr. Myint Moe Moe Khin Professor / Head Department of Statistics Monywa University of Economics
No ratings yet
Measurement Scale: Dr. Myint Moe Moe Khin Professor / Head Department of Statistics Monywa University of Economics
27 pages
ML1
No ratings yet
ML1
69 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Feature Engineering
No ratings yet
Feature Engineering
43 pages
Dmml Notes
No ratings yet
Dmml Notes
89 pages
CE-613 - DOC - 01 Introduction To Course, Measurement Scale, Simulation, Graphs, Fallacies-1
No ratings yet
CE-613 - DOC - 01 Introduction To Course, Measurement Scale, Simulation, Graphs, Fallacies-1
61 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
Unit 3
No ratings yet
Unit 3
55 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
EDA
No ratings yet
EDA
52 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
CH - 4
No ratings yet
CH - 4
71 pages
Down 2
No ratings yet
Down 2
61 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Preprocessing_1
No ratings yet
Preprocessing_1
11 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Get Elementary Statistics A Step by Step Approach 9th Edition Bluman Test Bank Free All Chapters Available
100% (19)
Get Elementary Statistics A Step by Step Approach 9th Edition Bluman Test Bank Free All Chapters Available
42 pages
Case 3-Fish - Catch - Data - With - Analysis
No ratings yet
Case 3-Fish - Catch - Data - With - Analysis
90 pages
ARDL GoodSlides PDF
No ratings yet
ARDL GoodSlides PDF
44 pages
Me Chapter 4 Demand Estimation Covid 19 Student Version
No ratings yet
Me Chapter 4 Demand Estimation Covid 19 Student Version
18 pages
SST 3206 Probability & Statistics
No ratings yet
SST 3206 Probability & Statistics
4 pages
CS-13410 Introduction To Machine Learning: Lecture # 17
No ratings yet
CS-13410 Introduction To Machine Learning: Lecture # 17
11 pages
7 Measure of Relationship
100% (1)
7 Measure of Relationship
6 pages
Statistics Problem Set
No ratings yet
Statistics Problem Set
2 pages
Assignment Week 7 Haris FA18 BEE 061
No ratings yet
Assignment Week 7 Haris FA18 BEE 061
15 pages
FYCS DM Stats Sample Questions
No ratings yet
FYCS DM Stats Sample Questions
6 pages
One Way Anova
No ratings yet
One Way Anova
8 pages
Introduction To Econometrics (3 Updated Edition, Global Edition)
No ratings yet
Introduction To Econometrics (3 Updated Edition, Global Edition)
8 pages
Stat 235 Lab Assignment 2
No ratings yet
Stat 235 Lab Assignment 2
13 pages
The-Predictive-Machine-Learning-Model-Towards-Effect-of-Human-
No ratings yet
The-Predictive-Machine-Learning-Model-Towards-Effect-of-Human-
12 pages
Business Statistics A First Course - 6ed Index
0% (2)
Business Statistics A First Course - 6ed Index
7 pages
ST3189 - Machine Learning - 2019 Exam - Zone-B
No ratings yet
ST3189 - Machine Learning - 2019 Exam - Zone-B
6 pages
Business Forecasting: The Box-Jenkins Method of Forecasting
No ratings yet
Business Forecasting: The Box-Jenkins Method of Forecasting
22 pages
Autoregressive, MA and ARMA Processes
No ratings yet
Autoregressive, MA and ARMA Processes
100 pages
Measure of Variability
No ratings yet
Measure of Variability
23 pages
Excel_Practice_Problems_with_Solutions
No ratings yet
Excel_Practice_Problems_with_Solutions
14 pages
Usage of Color Measurements Obtained by Modified Seliwanoff Test To Determine Hydroxymethylfurfural
No ratings yet
Usage of Color Measurements Obtained by Modified Seliwanoff Test To Determine Hydroxymethylfurfural
8 pages
Descriptive Measures With Samples-1
No ratings yet
Descriptive Measures With Samples-1
33 pages
Forecasting Models - PPT
No ratings yet
Forecasting Models - PPT
57 pages
Skittles Project Stats 1040
No ratings yet
Skittles Project Stats 1040
8 pages
P-Value (0.1824) Alpha (0.05) Accept Ho 1000
No ratings yet
P-Value (0.1824) Alpha (0.05) Accept Ho 1000
8 pages
Hubungan Sikap Kesehatan Gigi Dan Mulut Penderita Terhadap Kepatuhan Dalam Menjalani Perawatan Berulang
No ratings yet
Hubungan Sikap Kesehatan Gigi Dan Mulut Penderita Terhadap Kepatuhan Dalam Menjalani Perawatan Berulang
9 pages
Lecture 2 HYPOTHESIS TESTING Real
No ratings yet
Lecture 2 HYPOTHESIS TESTING Real
10 pages
Calculating and Using Basic Statistics: Standard Practice For
100% (2)
Calculating and Using Basic Statistics: Standard Practice For
21 pages