Data Visualization & Analytics
LECTURE 2 NOTES
Descriptive Statistics
● Introduction to Statistics
▪ Statistics is a mathematical science that includes methods for collecting, organizing,
analyzing, and visualizing data so that meaningful conclusions can be drawn.
Statistics is also a field of study that summarizes the data, interprets the data and
makes decisions based on the data.
▪ Statistics is composed of two broad categories:
1. Descriptive Statistics (in this lecture notes we will discuss Descriptive Statistics)
2. Inferential Statistics (Inferential statistics is a powerful tool for making
predictions and decisions based on data. It involves making predictions or
inferences about a population based on a sample of data taken from that
population.)
▪ Descriptive statistics: Descriptive statistics refers to a branch of statistics that
involves summarizing, organizing, and presenting data meaningfully and concisely. It
focuses on describing and analyzing a dataset's main features and characteristics
without making any generalizations or inferences to a larger population.
▪ Common Techniques:
1. Measures of Central Tendency:
○ Mean (average)
○ Median (middle value)
○ Mode (most frequent value)
2. Measures of Dispersion:
○ Range (difference between highest and lowest values)
○ Quartiles (divide data into four equal parts)
○ Interquartile Range (IQR - spread of the middle 50% of data)
○ Coefficient of Variation (CV - standardized measure of spread)
3. Frequency Distributions:
○ Show how often each value (or range of values) appears in the data.
○ It can be presented in tables or charts (histograms, bar charts).
● Measures of Central Tendency: (Mean, Median, Mode)
▪ A measure of central tendency is a single value that attempts to describe a set of data
by identifying the central position within that set of data. The mean, median, and
mode are all valid measures of central tendency.
⮚ Mean (Arithmetic)
▪ The mean (or average) is the most popular and well-known measure of central
tendency. It can be used with both discrete and continuous data, although its use is
most often with continuous data. The mean is equal to the sum of all the values in the
data set divided by the number of values in the data set. So, if we have values in a
data set and they have values x1,x2,…xn, the sample mean, is usually denoted by 𝒙.
▪ An important property of the mean is that it includes every value in the dataset as
part of the calculation. In addition, the mean is the only measure of central tendency
where the sum of the deviations of each value from the mean is always zero.
⮚ Median:
▪ The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. It is a holistic
measure. It is an easy method of approximation of the median value of a large data
set.
○ Median for an Odd Number of Observations
○ Where:
○ (𝑛 )n = Number of observations
○ (𝑥 )x = Ordered data points
○ Median for an Even Number of Observations
⮚ Mode:
▪ The mode is the most frequent score in our dataset. The mode is used for categorical
data where we want to know which is the most common category occurring in the
population. There are possibilities for the greatest frequency to correspond to
different values. This results in more than one, two, or more modes in a dataset. They
are called unimodal, bimodal, and multimodal datasets. If each data occurs only once
then the mode is equal to zero.
▪ Unimodal frequency curve with symmetric data distribution, the mean median and
mode are all the same.
Example Median: Consider the given dataset with the odd number of observations
arranged in descending order – 23, 21, 18, 16, 15, 13, 12, 10, 9, 7, 6, 5, and 2. Here 12 is
the middle or median number that has 6 values above it and 6 values below it.
Now, consider another example with an even number of observations that are
arranged in descending order – 40, 38, 35, 33, 32, 30, 29, 27, 26, 24, 23, 22, 19, and
17. When you look at the given dataset, the two middle values obtained are 27 and 29.
Now, find out the mean value for these two numbers.i.e.,(27+29)/2 =28. Therefore, the
median for the given data distribution is 28.
Example Mode: Consider the given dataset 5, 4, 2, 3, 2, 1, 5, 4, 5
Since the mode represents the most common value. Hence, the most frequently
repeated value in the given dataset is 5.
● Measure of Dispersion in Statistic:
▪ Measures of Dispersion measure the scattering of the data. It tells us how the values
are distributed in the data set. In statistics, we define the measure of dispersion as
various parameters that are used to define the various attributes of the data.
▪ Types of Measures of Dispersion:
○ Absolute Measure of Dispersion
○ Relative Measure of Dispersion
▪ Absolute Measure of Dispersion: The measures of dispersion that are measured and
expressed in the units of data themselves are called Absolute Measure of Dispersion.
For example – Meters, Dollars, Kg, etc.
▪ Some absolute measures of dispersion are:
○ Range: It is defined as the difference between the largest and the smallest
value in the distribution.
➢ R=L–S
➢ where,
➢ L is the largest value in the Distribution
➢ S is the smallest value in the Distribution
○ Mean Deviation: It is the arithmetic mean of the difference between the values
and their mean.
○ Standard Deviation: It is the square root of the arithmetic average of the
squares of the deviations measured from the mean.
➢ Concept: Imagine the mean as the center point of your data set. Standard
deviation tells you, on average, how far each data point deviates from that
center.
➢ Formula SD = √σ²
➢ Where:
➢ σ² (sigma squared) represents the variance of the data set.
○ Variance: It is defined as the average of the square deviation from the mean of
the given data set. Variance is a statistical measure that tells you how spread
out a set of data is relative to its mean (average). In simpler terms, it tells you
how much the data points tend to deviate from the average value.
➢ Variance (σ²) is calculated by finding the average of the squared deviations
from the mean.
➢ Here's the mathematical formula:
➢ σ² = Σ (x - μ)² / N (or σ² = Σ (x - μ)² / (N-1))
➢ Where:
→ Σ (sigma) represents the sum of all the values.
→ x represents each data point.
→ μ (mu) represents the mean of the data set.
→ N represents the total number of data points.
○ Quartile Deviation: It is defined as half of the difference between the third
quartile and the first quartile in a given dataset.
○ Interquartile Range: The difference between the upper(Q3 ) and lower(Q1)
quartile is called the interquartile range. Its formula is given as Q3 – Q1.
▪ Relative Measure of Dispersion: We use relative measures of dispersion to measure
the two quantities that have different units to get a better idea about the scattering of
the data.
▪ Here are some of the relative measures of dispersion:
○ Coefficient of Range: It is defined as the ratio of the difference between the
highest and lowest value in a dataset to the sum of the highest and lowest
value.
○ Coefficient of Variation: It is defined as the ratio of the standard deviation to
the mean of the data set. We use percentages to express the coefficient of
variation.
○ Coefficient of Mean Deviation: It is defined as the ratio of the mean deviation to
the value of the central point of the dataset.
○ Coefficient of Quartile Deviation: It is defined as the ratio of the difference
between the third quartile and the first quartile to the sum of the third and first
quartiles.
▪ Measure of Spread: Measures of spread are the ways of summarizing a group of data
by describing how scores are spread out. To describe this spread, several statistics
are available to us, including the range, quartiles, absolute deviation, variance, and
standard deviation. The degree to which numerical data tends to spread is called the
dispersion, or variance of the data. The common measures of data dispersion are
range, Quartiles, Outliers, and Boxplots.
⮚ Range:
▪ The range of the set is the difference between the largest (max()) and smallest (min())
values. Ex: Step 1: Sort the numbers in order, from smallest to largest: 7, 10, 21,
33, 43, 45, 45, 65, 67, 87, 98, 99.
▪ The range is the simplest measure of dispersion. It's the difference between the
highest and lowest values in your dataset.
▪ Formula: Range = Maximum Value - Minimum Value
▪ Interpretation: While easy to calculate, the range can be sensitive to outliers. A single
extreme value can significantly inflate the range and not accurately reflect the spread
of most data points.
⮚ Quartiles:
▪ Quartiles divide your ordered dataset into four equal parts. There are three quartiles
(Q1, Q2, and Q3):
○ Q1 (First Quartile): The value at which 25% of the data falls below it.
○ Q2 (Second Quartile or Median): The middle value of the dataset.
○ Q3 (Third Quartile): The value at which 75% of the data falls below it.
▪ Interquartile Range (IQR): The IQR is the difference between Q3 and Q1 and
represents the middle 50% of the data (between the first and third quartiles).
▪ Formula: IQR = Q3 - Q1
▪ Interpretation: Unlike the range, the IQR is less sensitive to outliers as it focuses on
the central portion of the data. It's a good initial measure to understand the spread of
your data.
▪ Quartiles: Percentile: kth percentile of a set of data in numerical order is the value xi
having the property that k percent of the data entries lie at or below xi
▪ The first quartile (Q1) is the 25th percentile;
▪ The third quartile (Q3) is the 75th percentile
▪ The distance between the first and third quartiles is the range covered by the middle
half of the data.
▪ Outliers are to single out values falling at least 1.5 *IQR above the third quartile or
below the first quartile.
▪ Five-number summary: median, the quartiles Q1 and Q3, and the smallest and largest
individual observations comprise the five-number summary: Minimum; Q1; Median;
Q3; Maximum
▪ Example: Quartiles
▪ Start with the following dataset:
▪ 1, 2, 2, 3, 4, 6, 6, 7, 7, 7, 8, 11, 12, 15, 15, 15, 17, 17, 18, 20
▪ There are a total of twenty data points in the set. There is an even number of data
values, hence the median is the mean of the tenth and eleventh values. The median is:
(7 + 8)/2 = 7.5. The median of the first half of the set is found between the fifth and
sixth values of 1, 2, 2, 3, 4, 6, 6, 7, 7, 7. Thus the first quartile is found to equal Q1 = (4 +
6)/2 = 5
▪ To find the third quartile, examine the top half of the original dataset. The median of
8, 11, 12, 15, 15, 15, 17, 17, 18, 20 is (15 + 15)/2 = 15. Thus the third quartile Q3 = 15.
▪ A small interquartile range indicates data that is clumped about the median. A larger
interquartile range shows that the data is more spread out.
⮚ Coefficient of Variation (CV)
▪ The coefficient of variation (CV) is a standardized measure of dispersion. It expresses
the range relative to the mean and is particularly useful for comparing the spread of
datasets with different units.
▪ Formula: CV = (Standard Deviation / Mean) * 100%
▪ Interpretation: The CV is expressed as a percentage. A higher CV indicates a larger
spread of data relative to the mean, compared to a lower CV.
▪ Variance and Standard Deviation:
Relative measures of dispersion:
1. Coefficient of Range: The ratio of the difference between two extreme items (the largest
and smallest) of the distribution to their sum is known as the Coefficient of Range. The
coefficient of the range is a relative measure of dispersion. Symbolically, the range can be
expressed as:
2. Coefficient of Variation: Coefficient of Variation is a relative measure introduced by Karl
Pearson (also known as Karl Pearson’s Coefficient of Variation) through which two or more
groups of similar data are compared concerning stability, homogeneity or consistency. It is
the most appropriate measure and indicates the relationship between the standard
deviation and the arithmetic mean of the given distributions/series. The coefficient of
Variation is expressed in terms of percentage and can be determined using the following
formula:
Where:
C.V. = Coefficient of Variation
σ = Standard Deviation
X = Arithmetic Mean
3. Coefficient of Mean Deviation: Mean Deviation is an absolute measure of dispersion. To
convert it into a relative measure, it is divided by the average from which it has been
calculated. It is known as the Coefficient of Mean Deviation.
Coefficient of Mean Deviation from Mean:
Coefficient of Mean Deviation from Median:
4. Coefficient of Quartile Deviation: It is defined as the ratio of the difference between the
third quartile and the first quartile to the sum of the third and first quartiles.
○ Here's a breakdown of CQD:
○ Formula: There isn't a universally agreed-upon formula for CQD. Two common
variations exist:
➢ CQD = (Q3 - Q1) / Mean
➢ CQD = (Q3 - Q1) / (Mean * Standard Deviation) * 100%
➢ Q3: Third quartile (upper quartile)
➢ Q1: First quartile (lower quartile)
➢ Mean: Average of the data set
➢ Standard Deviation: Measure of spread around the mean
● Measures of Dispersion and Central Tendency
■ Both Measures of Dispersion and Central Tendency are numbers that are used to
describe various parameters of the data. Let’s see the differences between
Measures of Dispersion and Central Tendency.
● Introduction to Dataset Modality
■ In statistics, the modality of a dataset refers to the number of peaks, or modes,
present in its distribution. These peaks represent the most frequently occurring
values within the dataset. Understanding the modality of a dataset helps to identify
patterns, clusters, and trends within the data, which can be crucial for data analysis
and decision-making.
1. Unimodal Dataset
■ Definition: A unimodal dataset has a single peak, indicating one predominant value
or a narrow range of values that occurs most frequently. This type of dataset is the
simplest form of modality and is commonly encountered in natural and social
sciences.
■ Characteristics:
➢ Single Peak: The distribution has one clear peak, representing the most
common value.
➢ Symmetry: Often, unimodal distributions are symmetric, but they can also
be skewed.
➢ Typical Examples: Heights of individuals, test scores of students, or weights
of fruits.
■ Visualization:
➢ Histogram: A histogram of a unimodal dataset will show a single peak.
➢ Density Plot: A smooth curve with one peak, representing the highest
frequency.
2. Bimodal Dataset
■ A bimodal dataset has two peaks, indicating two distinct values or ranges of values
that are most frequent. This type of dataset often suggests the presence of two
different subgroups within the data.
■ Characteristics:
➢ Two Peaks: The distribution has two prominent peaks.
➢ Indication of Subgroups: The two peaks often correspond to two distinct
groups or categories within the data.
➢ Typical Examples: Test scores from two different classes, income levels of
two demographic groups.
■ Visualization:
➢ Histogram: A histogram of a bimodal dataset will show two distinct peaks.
➢ Density Plot: A smooth curve with two peaks, indicating two frequently
occurring values.
3. Multimodal Dataset
■ A multimodal dataset has more than two peaks, indicating multiple values or
ranges of values that are frequently occurring. This type of dataset suggests the
presence of multiple subgroups or clusters.
■ Characteristics:
➢ Multiple Peaks: The distribution has more than two prominent peaks.
➢ Indication of Multiple Subgroups: The peaks correspond to several distinct
groups within the data.
➢ Typical Examples: Population ages in a large, diverse country, product
preferences among different market segments.
■ Visualization:
➢ Histogram: A histogram of a multimodal dataset will show several peaks.
➢ Density Plot: A smooth curve with multiple peaks, each representing a
frequently occurring value.
● Visualization Techniques for Descriptive Statistics
Visualization techniques for descriptive statistics help in understanding the
distribution, central tendency, variability, and relationships within datasets. Here are
some commonly used visualization methods for descriptive statistics:
▪ Histograms: Histograms display the distribution of a continuous variable by dividing
the data into bins and plotting the frequency of observations in each bin. They provide
insights into the shape, central tendency, and spread of the data.
▪ Box Plots (Box-and-Whisker Plots): Box plots summarize the distribution of a
continuous variable through five statistics: minimum, first quartile (Q1), median (Q2),
third quartile (Q3), and maximum. They are useful for comparing distributions and
identifying outliers.
▪ Bar Charts: Bar charts are used to visualize the distribution of categorical variables
by plotting the frequency or proportion of each category as bars. They are effective
for showing comparisons between categories.
▪ Pie Charts: Pie charts represent the composition of a categorical variable as slices of
a pie, where each slice corresponds to a category and its size represents the
proportion of that category in the dataset.
▪ Scatter Plots: Scatter plots display the relationship between two continuous variables
by plotting each observation as a point on a graph. They are useful for identifying
patterns, trends, and correlations between variables.
▪ Line Graphs: Line graphs are used to show changes in a variable over time or another
ordered dimension. They connect data points with lines, making them ideal for
visualizing trends and patterns over continuous sequences.
▪ Heatmaps: Heatmaps use color gradients to represent the magnitude of a
relationship between two categorical variables or to visualize the density of points in
a scatter plot. They are particularly useful for large datasets.
▪ Violin Plots: Violin plots combine aspects of box plots and density plots to display the
distribution of a continuous variable, providing insights into both the summary
statistics and the shape of the distribution.
▪ Pair Plots (Scatterplot Matrix): Pair plots are grids of scatter plots showing
relationships between pairs of variables in a dataset. They allow for quick visual
inspection of relationships across multiple dimensions.
▪ Probability Plots (Q-Q Plots): Q-Q plots compare the quantiles of a dataset's
distribution to those of a theoretical distribution (e.g., normal distribution). They are
useful for assessing whether a dataset follows a particular distribution.
▪ Example:
▪ Central Tendency: This refers to the "middle" of your data set. Common measures
include mean, median, and mode.
○ Bar Charts: Great for comparing means or medians of different categories. For
example, a bar chart could show the average income across different age
groups.
Bar chart central tendency
▪ Distribution: This describes how your data is spread out.
○ Histograms: Show the frequency of data points across different ranges (bins).
Useful for understanding the shape of the distribution (normal, skewed, etc.).
Histogram visualization
○ Box Plots: Display the median, quartiles, and outliers of the data. Quartiles
divide the data into four equal parts. They are useful for comparing the spread
and potential outliers between groups.
Box plot visualization
▪ Dispersion: This refers to how spread out the data is from the central tendency.
Common measures include variance and standard deviation.
○ Violin Plots: Combine features of boxplots and density plots. They show the
distribution of data within each category and can reveal differences in spread.
Violin plot visualization
▪ Example: Bar Charts
○ Distribution of favorite fruits among a group of people.
○ Data: Number of people preferring different fruits.
○ Visualization:
fruits = ['Apple', 'Banana', 'Orange', 'Grapes']
counts = [10, 15, 7, 8]
plt.bar(fruits, counts)
plt.xlabel('Fruit')
plt.ylabel('Number of People')
plt.title('Favorite Fruits')
plt.show()
▪ Insight: The bar chart shows the popularity of each fruit, making it easy to compare
preferences.