Unit-2Exploratory-Analysis
Unit-2Exploratory-Analysis
Problems:
If we do this for all the numerical columns, then all their values will lie between -1 and
1. The main disadvantage is that the technique is sensitive to outliers. Like consider
the feature *square feet*, if 99% of the houses have square feet area of less than
1000, and even if just 1 house has a square feet area of 20,000, then all those other
house values will be scaled down to less than 0.05.
Min Max Scaling
• In this method, you need to subtract all the data points with the
median value and then divide it by the Inter Quartile Range(IQR)
value.
• IQR is the distance between the 25th percentile point and the 50th
percentile point.
Binarization
• Convert count data into binary.
• For instance if I’m building a recommendation system for song
recommendations, I would just want to know if a person is interested
or has listened to a particular song. This doesn’t require the number
of times a song has been listened to since I am more concerned about
the various songs he\she has listened to. In this case, a binary feature
is preferred as opposed to a count based feature
Scaling
• In most cases, the numerical features of the dataset do not have a
certain range and they differ from each other. In real life, it is
nonsense to expect age and income columns to have the same range.
But from the machine learning point of view, how these two columns
can be compared?
• The continuous features become identical in terms of the range, after
a scaling process
• Min Max Normalization
Xnorm = X-Xmin/Xmax -Xmin
Binning
• Often the distribution of values in features
will be skewed.
• some values will occur quite frequently
while some will be quite rare.
• For instance view counts of specific music
videos could be abnormally large and some
could be really small.
• Hence there are strategies to deal with this,
which include binning and transformations.
• Binning, also known as quantization is used
for transforming continuous numeric
features into discrete ones (categories).
• Specific strategies of binning data include
fixed-width and adaptive binning
Statistical Transformations
• Log Transform
• The log transform belongs to the power transform family of functions.
This function can be mathematically represented as
• Y = logb(x) == by = X
• Log transforms are useful when applied to skewed distributions as
they tend to expand the values which fall in the range of lower
magnitudes and tend to compress or reduce the values which fall in
the range of higher magnitudes.
Box-Cox Transform ****
• outliers are in nature similar to missing data, then any method used
for missing data imputation can we used to replace outliers. The
number are outliers are small (otherwise, they won't be called
outliers), and it's reasonable to use mean/median/random
imputation to replace them.
Trimming: