0% found this document useful (0 votes)
6 views

Data Exploration and Visualization Unit 1

Exploratory Data Analysis (EDA) is essential for understanding data characteristics before modeling, involving single-variable analysis, distribution understanding, and numerical summaries. Key concepts include measures of central tendency (mean, median, mode), spread (range, variance, standard deviation), and techniques for scaling and standardizing data. Additionally, EDA addresses inequality through metrics like the Gini coefficient and Lorenz curve.

Uploaded by

Dev Mane
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data Exploration and Visualization Unit 1

Exploratory Data Analysis (EDA) is essential for understanding data characteristics before modeling, involving single-variable analysis, distribution understanding, and numerical summaries. Key concepts include measures of central tendency (mean, median, mode), spread (range, variance, standard deviation), and techniques for scaling and standardizing data. Additionally, EDA addresses inequality through metrics like the Gini coefficient and Lorenz curve.

Uploaded by

Dev Mane
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Unit 1: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a critical step in understanding and analyzing data
before any modeling or formal hypothesis testing. EDA involves summarizing the main
characteristics of data, identifying patterns, detecting anomalies, and testing assumptions.
Below are detailed notes on the sub-topics for a better understanding of EDA.

1. Introduction to Single Variables

When analyzing a single variable, we are interested in summarizing and understanding the
characteristics of the data associated with that variable. This is done using basic statistical
tools like mean, median, mode, range, etc.

 Definition: A single-variable analysis focuses on exploring and summarizing one


variable from the dataset.
 Common Measures:
o Mean: The average value of the data points.
∑ xi
 Formula: 𝑀𝑒𝑎𝑛 = n
4+5+7
 Example: For the values 4, 5, 7, the mean is = 5.33
3

o Median: The middle value when the data is ordered.


 Example: For the values 3, 7, 9, the median is 7.
o Mode: The most frequent value in the data.
 Example: In the dataset 2, 3, 3, 4, 3 is the mode.

2. Distribution Variables

A distribution represents how frequently values occur within a dataset. Understanding the
shape of the distribution is key to analyzing the behavior of the data.

 Definition: The distribution of a variable shows how its values are spread or clustered
across different intervals.
 Types of Distributions:
o Symmetrical: Data is evenly distributed (e.g., Normal Distribution).
o Skewed: Data is concentrated on one side.
 Right-skewed: Long tail on the right.
 Left-skewed: Long tail on the left.
o Kurtosis: Describes the "tailedness" of the distribution.

 Example: If we examine the income of individuals, we might find that most people
earn around a certain amount, but a few individuals earn significantly more, leading to
a right-skewed distribution.

3. Numerical Summaries of Level and Spread

Numerical summaries are essential in summarizing the central tendency (level) and the
spread of data.

 Central Tendency (Level): Describes where the center of the data lies.
o Mean, Median, Mode (as explained earlier).
 Spread: Describes how far data points are from the center.
o Range: The difference between the maximum and minimum values.
 Formula: Range = Max – Min
o Variance: Measures the spread of data points around the mean
∑𝐧
𝐢=𝟏(𝐱𝐢 −𝛍)
𝟐
 Formula: 𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆 =
𝐧

 Example: For the data 2, 4, 6, 8, the mean is 5. Variance:

(𝟐 − 𝟓)𝟐 + (𝟒 − 𝟓)𝟐 + (𝟔 − 𝟓)𝟐 + (𝟖 − 𝟓)𝟐


=𝟓
𝟒

o Standard Deviation:
The square root of the variance, providing a measure of the typical distance of
data points from the mean.

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = √Variance

Example:
For the data 2, 4, 6, 8, the standard deviation is: √5 ≈ 2.24
Example (Numerical Summaries):

Given the data 55, 60, 65, 70, 75, calculate the mean, range, variance,
and standard deviation:

 Mean = 65
 Range = 75 - 55 = 20
 Variance = 50
 Standard Deviation = √50 ≈ 7.07

4. Scaling and Standardizing

Definition:
Scaling and standardizing are preprocessing techniques used to adjust the values of numerical
variables to make them comparable, especially when different variables have different units
or scales.

Scaling:

Scaling adjusts the range of data to a specific range, usually between 0 and 1. This is done
using Min-Max Scaling.

𝑥−𝑀𝑖𝑛(𝑥)
 Min-Max Scaling Formula: x ′ = 𝑀𝑎𝑥(𝑥)−𝑀𝑖𝑛(𝑥)

 Example:
If data has a range from 10 to 100, and the value is 40, the scaled value is:

40−10
= 0.33
100−10

Standardizing:

Standardizing transforms data so that it has a mean of 0 and a standard deviation of 1. This is
useful when comparing variables with different units or ranges.

𝑥−𝜇
 Z-score Standardization Formula: : z = σ
 Example:
For a value of 50, with a mean of 40 and a standard deviation of 10, the z-score is:
50−40
z= = 1 This means the value is 1 standard deviation above the mean.
10

5. Inequality

Definition:
Inequality measures how evenly or unevenly values are distributed, often used in economics
to describe income or wealth distribution.

Key Metrics:

 Gini Coefficient:
A measure of inequality that ranges from 0 (perfect equality) to 1 (perfect inequality).
It is often used to measure income or wealth inequality.

∑𝑛𝑖=1 ∑𝑛𝑗=1 |𝑥𝑖 − 𝑥𝑗 |


G=
2n2 μ

Example:
A high Gini coefficient (e.g., 0.7) indicates high inequality, while a low Gini
coefficient (e.g., 0.2) indicates more equality.

 Lorenz Curve:
A graphical representation of inequality. It plots the cumulative percentage of total
income or wealth against the cumulative percentage of the population.

Example Question

"Given the dataset of test scores, calculate the mean, median, mode, range, variance, and
standard deviation. Then, apply a 3-point moving average to smooth the data and comment
on the results."

You might also like