Data Exploration and Visualization Unit 1
Data Exploration and Visualization Unit 1
Exploratory Data Analysis (EDA) is a critical step in understanding and analyzing data
before any modeling or formal hypothesis testing. EDA involves summarizing the main
characteristics of data, identifying patterns, detecting anomalies, and testing assumptions.
Below are detailed notes on the sub-topics for a better understanding of EDA.
When analyzing a single variable, we are interested in summarizing and understanding the
characteristics of the data associated with that variable. This is done using basic statistical
tools like mean, median, mode, range, etc.
2. Distribution Variables
A distribution represents how frequently values occur within a dataset. Understanding the
shape of the distribution is key to analyzing the behavior of the data.
Definition: The distribution of a variable shows how its values are spread or clustered
across different intervals.
Types of Distributions:
o Symmetrical: Data is evenly distributed (e.g., Normal Distribution).
o Skewed: Data is concentrated on one side.
Right-skewed: Long tail on the right.
Left-skewed: Long tail on the left.
o Kurtosis: Describes the "tailedness" of the distribution.
Example: If we examine the income of individuals, we might find that most people
earn around a certain amount, but a few individuals earn significantly more, leading to
a right-skewed distribution.
Numerical summaries are essential in summarizing the central tendency (level) and the
spread of data.
Central Tendency (Level): Describes where the center of the data lies.
o Mean, Median, Mode (as explained earlier).
Spread: Describes how far data points are from the center.
o Range: The difference between the maximum and minimum values.
Formula: Range = Max – Min
o Variance: Measures the spread of data points around the mean
∑𝐧
𝐢=𝟏(𝐱𝐢 −𝛍)
𝟐
Formula: 𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆 =
𝐧
o Standard Deviation:
The square root of the variance, providing a measure of the typical distance of
data points from the mean.
Example:
For the data 2, 4, 6, 8, the standard deviation is: √5 ≈ 2.24
Example (Numerical Summaries):
Given the data 55, 60, 65, 70, 75, calculate the mean, range, variance,
and standard deviation:
Mean = 65
Range = 75 - 55 = 20
Variance = 50
Standard Deviation = √50 ≈ 7.07
Definition:
Scaling and standardizing are preprocessing techniques used to adjust the values of numerical
variables to make them comparable, especially when different variables have different units
or scales.
Scaling:
Scaling adjusts the range of data to a specific range, usually between 0 and 1. This is done
using Min-Max Scaling.
𝑥−𝑀𝑖𝑛(𝑥)
Min-Max Scaling Formula: x ′ = 𝑀𝑎𝑥(𝑥)−𝑀𝑖𝑛(𝑥)
Example:
If data has a range from 10 to 100, and the value is 40, the scaled value is:
40−10
= 0.33
100−10
Standardizing:
Standardizing transforms data so that it has a mean of 0 and a standard deviation of 1. This is
useful when comparing variables with different units or ranges.
𝑥−𝜇
Z-score Standardization Formula: : z = σ
Example:
For a value of 50, with a mean of 40 and a standard deviation of 10, the z-score is:
50−40
z= = 1 This means the value is 1 standard deviation above the mean.
10
5. Inequality
Definition:
Inequality measures how evenly or unevenly values are distributed, often used in economics
to describe income or wealth distribution.
Key Metrics:
Gini Coefficient:
A measure of inequality that ranges from 0 (perfect equality) to 1 (perfect inequality).
It is often used to measure income or wealth inequality.
Example:
A high Gini coefficient (e.g., 0.7) indicates high inequality, while a low Gini
coefficient (e.g., 0.2) indicates more equality.
Lorenz Curve:
A graphical representation of inequality. It plots the cumulative percentage of total
income or wealth against the cumulative percentage of the population.
Example Question
"Given the dataset of test scores, calculate the mean, median, mode, range, variance, and
standard deviation. Then, apply a 3-point moving average to smooth the data and comment
on the results."