0% found this document useful (0 votes)
2 views

Unit-2Exploratory-Analysis

The document outlines the importance of exploratory data analysis (EDA) and data preprocessing techniques, including handling missing values, outlier detection, and feature engineering. It emphasizes the need for data cleaning, integration, reduction, and transformation to ensure data quality and prepare datasets for machine learning models. Key methods discussed include various scaling techniques, encoding categorical data, and approaches for managing outliers.

Uploaded by

VANSHIKA GADA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit-2Exploratory-Analysis

The document outlines the importance of exploratory data analysis (EDA) and data preprocessing techniques, including handling missing values, outlier detection, and feature engineering. It emphasizes the need for data cleaning, integration, reduction, and transformation to ensure data quality and prepare datasets for machine learning models. Key methods discussed include various scaling techniques, encoding categorical data, and approaches for managing outliers.

Uploaded by

VANSHIKA GADA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Unit -2

Exploratory Data Analysis


Agenda
• Missing Values Treatment
• Handling Categorical data: Mapping ordinal features, Encoding class
labels, Performing one-hot encoding on nominal features, Outlier
Detection and Treatment.
• Feature Engineering: Variable Transformation and Variable Creation,
Selecting meaningful features
Why pre-processing ?

• Real world data are generally

• Incomplete: lacking attribute values, lacking certain attributes of interest, or


containing only aggregate data

• Noisy: containing errors or outliers

• Inconsistent: containing discrepancies in codes or names


Data Quality: Why Preprocess the Data?
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
Tasks in data pre-processing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Noisy Data
Data Cleaning – Numeric Attribute
• Mean / Median / Mode Imputation:
• Random Sample Imputation
• Capturing NAN Values With a New Feature
• End of Tail Distribution Imputation
• Arbitrary Imputation
Categorical Data
• Discrete/ Categorical Data: discrete data is quantitative data that can be
counted and has a finite number of possible values or data which may
be divided into groups
• e.g. days in a week, number of months in a year, Gender
(Male/Female/Others), Grades (High/Medium/Low), etc.
Data Cleaning – Categorical Attribute
• Adding a Variable To Capture NAN
• Frequent Categorical Imputation
• Create a New Category (Random Category) for NAN Values
Feature Engineering on Numeric Data

• Numeric data can be directly fed into machine learning models.


• Still need to engineer features which are relevant to the scenario,
problem and domain before building a model.
• Typically these features can indicate values or counts
Feature Scaling Techniques

• Absolute Maximum Scaling


• Min-Max Scaling
• Normalization
• Standardization
• Robust Scaling
Absolute Maximum Scaling

• Find the absolute maximum value of the feature in the dataset


• Divide all the values in the column by that maximum value

Problems:
If we do this for all the numerical columns, then all their values will lie between -1 and
1. The main disadvantage is that the technique is sensitive to outliers. Like consider
the feature *square feet*, if 99% of the houses have square feet area of less than
1000, and even if just 1 house has a square feet area of 20,000, then all those other
house values will be scaled down to less than 0.05.
Min Max Scaling

In min-max you will subtract the minimum value in the


dataset with all the values and then divide this by the
range of the dataset(maximum-minimum). In this case,
your dataset will lie between 0 and 1 in all cases whereas
in the previous case, it was between -1 and +1. Again, this
technique is also prone to outliers.
Normalization
What is variance?
• Variance is the measure of how notably a collection of data is spread
out. If all the data values are identical, then it indicates the variance is
zero. All non-zero variances are considered to be positive. A little
variance represents that the data points are close to the mean, and to
each other, whereas if the data points are highly spread out from the
mean and from one another indicates the high variance. In short, the
variance is defined as the average of the squared distance from each
point to the mean.
What is Standard deviation?
Standard Deviation is a measure which shows how much variation
(such as spread, dispersion, spread,) from the mean exists. The
standard deviation indicates a “typical” deviation from the mean. It is a
popular measure of variability because it returns to the original units of
measure of the data set. Like the variance, if the data points are close
to the mean, there is a small variation whereas the data points are
highly spread out from the mean, then it has a high variance. Standard
deviation calculates the extent to which the values differ from the
average.
Robust Scaling

• In this method, you need to subtract all the data points with the
median value and then divide it by the Inter Quartile Range(IQR)
value.
• IQR is the distance between the 25th percentile point and the 50th
percentile point.
Binarization
• Convert count data into binary.
• For instance if I’m building a recommendation system for song
recommendations, I would just want to know if a person is interested
or has listened to a particular song. This doesn’t require the number
of times a song has been listened to since I am more concerned about
the various songs he\she has listened to. In this case, a binary feature
is preferred as opposed to a count based feature
Scaling
• In most cases, the numerical features of the dataset do not have a
certain range and they differ from each other. In real life, it is
nonsense to expect age and income columns to have the same range.
But from the machine learning point of view, how these two columns
can be compared?
• The continuous features become identical in terms of the range, after
a scaling process
• Min Max Normalization
Xnorm = X-Xmin/Xmax -Xmin
Binning
• Often the distribution of values in features
will be skewed.
• some values will occur quite frequently
while some will be quite rare.
• For instance view counts of specific music
videos could be abnormally large and some
could be really small.
• Hence there are strategies to deal with this,
which include binning and transformations.
• Binning, also known as quantization is used
for transforming continuous numeric
features into discrete ones (categories).
• Specific strategies of binning data include
fixed-width and adaptive binning
Statistical Transformations

• Log Transform
• The log transform belongs to the power transform family of functions.
This function can be mathematically represented as
• Y = logb(x) == by = X
• Log transforms are useful when applied to skewed distributions as
they tend to expand the values which fall in the range of lower
magnitudes and tend to compress or reduce the values which fall in
the range of higher magnitudes.
Box-Cox Transform ****

• The Box-Cox transform is another popular function belonging to the


power transform family of functions. This function has a pre-requisite
that the numeric values to be transformed must be positive.
Feature Engineering on Categorical Data

• Transformation of these categorical values into numeric labels


• Apply some encoding scheme on these values.
• The idea here is to transform these attributes into a more
representative numerical format which can be easily understood by
downstream code and pipelines.
Transforming Nominal Attribute
Encoding : One-hot Encoding Scheme
• Considering we have the numeric representation of any categorical
attribute with m labels (after transformation)
• the one-hot encoding scheme, encodes or transforms the attribute
into m binary features which can only contain a value of 1 or 0.
• Each observation in the categorical feature is thus converted into a
vector of size m with only one of the values as 1 (indicating it as
active).
Encoding : One-hot Encoding Scheme
• from sklearn.preprocessing import OneHotEncoder
• gen_ohe = OneHotEncoder()
• gen_feature_arr = gen_ohe.fit_transform(mydf[['Genre']]).toarray()
• mylist = list(mydf['Genre'].unique())
• encoded_df = pd.DataFrame(gen_feature_arr, columns=mylist)
• mydf1 = pd.concat([mydf,encoded_df],axis = 1)
Encoding : One-hot Encoding Scheme
Outlier detection
• Outliers are observations that are far away from the other data points
in a random sample of a population.
• outliers may reveal unexpected knowledge about a population, which
also justifies their special handling during EDA
Different types of outliers

• univariate and multivariate outliers.


• Univariate outliers are extreme values in the distribution of a specific
variable, whereas multivariate outliers are a combination of values in an
observation that is unlikely.
• For example, a univariate outlier could be a human age measurement of
120 years or a temperature measurement in Antarctica of 50 degrees
Celsius.
• A multivariate outlier could be an observation of a human with a height
measurement of 2 meters (in the 95th percentile) and a weight
measurement of 50kg
• Both types of outliers can affect the outcome of an analysis but are
detected and treated differently.
Visualizing outliers

• A first and useful step in detecting univariate outliers is the


visualization of a variables’ distribution.
• An easy way to visually summarize the distribution of a variable is
the box plot.
• In a box plot, introduced by John Tukey in 1970, the data is divided
into quartiles. It usually shows a rectangular box representing 25%-
75% of a sample’s observations, extended by so-called whiskers that
reach the minimum and maximum data entry. Observations shown
outside of the whiskers are outliers
Tukey’s box plot method

• Tukey distinguishes between possible and probable outliers. A


possible outlier is located between the inner and the outer fence,
whereas a probable outlier is located outside the outer fence.
• the inner (often confused with the whiskers) and outer fence are
usually not shown on the actual box plot, they can be calculated using
the interquartile range (IQR).
• IQR =Q3 - Q1, whereas q3 := 75th quartile and q1 := 25th quartile
• Inner fence = [Q1-1.5*IQR, Q3+1.5*IQR]
• Outer fence = [Q1–3*IQR, Q3+3*IQR]
multivariate outliers
• When dealing with multivariate outliers, distance metrics can be
helpful for detection. With distance metrics, the distance between
two vectors is determined. These two vectors can be two different
observations (rows) or an observation (row) compared to the mean
vector (row of means of all columns). Distance metrics can be
calculated independent of the number of variables in the dataset
(columns).
Methods to Pre-Process Outliers

• Mean/Median or random Imputation


• Trimming
Mean/Median or random Imputation

• outliers are in nature similar to missing data, then any method used
for missing data imputation can we used to replace outliers. The
number are outliers are small (otherwise, they won't be called
outliers), and it's reasonable to use mean/median/random
imputation to replace them.
Trimming:

• In this method, we discard the outliers completely. That is, eliminate


the data points that are considered as outliers. In situations where
you won’t be removing a large number of values from the dataset,
trimming is a good and fast approach.

You might also like