Notation
We use an upper-case letter to denote a variable, and the corresponding lower-case
letter to denote a general value of the variable. For example, when 𝑋𝑋 is used to denote
a variable, 𝑥𝑥 is used to denote its particular value.
Suppose, our sample consists of 𝑛𝑛 values of a variable 𝑋𝑋. We use ∑ 𝑥𝑥 to denote the
sum of all these values.
Describing the center of the data
There are three common measures to describe the central tendency of the sample
data. These are:
1. Mean
2. Median
3. Mode
Mean or arithmetic mean (𝑨𝑨. 𝑴𝑴.)
Let a sample of 𝑛𝑛 values of variable 𝑋𝑋 be taken. Data: 𝑥𝑥1 , 𝑥𝑥2 , ⋯ , 𝑥𝑥𝑛𝑛 . Then, mean
(also known as arithmetic mean) is defined as
𝑛𝑛
1
𝑥𝑥̅ = � 𝑥𝑥𝑖𝑖
𝑛𝑛
𝑖𝑖=1
Example
Data: 4, 8, 5, 9, 15
1 1
𝑥𝑥̅ = � 𝑥𝑥 = (4 + 8 + 5 + 9 + 15) = 8.2
𝑛𝑛 5
• Keep the result as fraction or decimal even if the variable is discrete.
• Mean cannot be calculated for categorical variables.
10
Median
It is the middlemost value in the sorted data. If 𝑛𝑛 is an odd number, median is the
𝑛𝑛+1
middle value, i.e., � �th value of the sorted data. If 𝑛𝑛 is an even number, median
2
is the average of the two middle values, i.e.,
𝑛𝑛 𝑛𝑛
th value + � + 1� st value
Median = 2 2
2
When sample size is large, approximately 50% values are less (more) than the
median.
Example
Data: 4, 8, 5, 9, 15
Sorted data: 4, 5, 8, 9, 15
Median = 8
Example:
Data: 4, 8, 5, 9, 15, 13
Sorted data: 4, 5, 8, 9, 13, 15
1
Median = (8 + 9) = 8.5
2
Mode
Mode is the value that occurs most frequently.
Sometimes two or more values occur with highest frequency.
• If there are two modes, the data is bimodal.
• If there are more than two modes, the data is multimodal.
If all values occur with equal frequency, there is no mode.
11
Example:
20 people were asked to give satisfaction rating after a restaurant meal on a scale of
1 (not satisfied) to 10 (extremely satisfied).
Data: 9, 3, 7, 5, 5, 10, 8, 9, 9, 10, 9, 8, 9, 6, 9, 8, 7, 7, 10, 6.
Mode = 9 (occurred 6 times in the data)
Which measure to use when
For categorical data, mode can be used.
For numerical (discrete or continuous) data, any of the three measures can be used.
However, for mathematical reasons, mean or median is preferred.
Center for numerical data: mean or median?
Data: 2, 3, 4, 5, 7
Here, mean = 4.2, median = 4.
(Results are close. Mean is preferred because it is easy to calculate and
mathematically solid.)
Data: 2, 3, 4, 5, 507 (the last value is an ‘outlier’)
Here, mean = 104.2, median = 4.
Median represents the majority of the data. Mean represents neither the majority,
nor the outlier. Median is preferred because it gives reasonable result.
• When data have outliers, median is preferred.
Geometric Mean
It is used to calculate average growth rates, interest rates etc. Let a sample of 𝑛𝑛 values
of variable 𝑋𝑋 be taken, which are: 𝑥𝑥1 , 𝑥𝑥2 , ⋯ , 𝑥𝑥𝑛𝑛 . The geometric mean is defined as
12
𝑛𝑛 1/𝑛𝑛
𝐺𝐺. 𝑀𝑀. = �� 𝑥𝑥𝑖𝑖 �
𝑖𝑖=1
Example
Data: 4, 8, 5, 9, 15
𝐺𝐺. 𝑀𝑀. = (4 × 8 × 5 × 9 × 15)1/5 = 7.36
Harmonic Mean
It gives less weight to large values. It is defined as follows.
𝑛𝑛
𝐻𝐻. 𝑀𝑀. =
1 1 1
+ + ⋯+
𝑥𝑥1 𝑥𝑥2 𝑥𝑥𝑛𝑛
Harmonic mean is often used to calculate the average of ratios or rates.
Example
Data: 4, 8, 5, 9, 15
5
𝐻𝐻. 𝑀𝑀. = = 6.64
1 1 1
+ + ⋯+
4 8 15
Note: For any set of positive values, 𝐴𝐴. 𝑀𝑀. ≥ 𝐺𝐺. 𝑀𝑀. ≥ 𝐻𝐻. 𝑀𝑀.
Weighted Mean
𝑥𝑥1 𝑤𝑤1 + 𝑥𝑥2 𝑤𝑤2 + ⋯ + 𝑥𝑥𝑛𝑛 𝑤𝑤𝑛𝑛
𝑊𝑊. 𝑀𝑀. =
𝑤𝑤1 + 𝑤𝑤2 + ⋯ + 𝑤𝑤𝑛𝑛
where 𝑤𝑤𝑖𝑖 is the weight of 𝑥𝑥𝑖𝑖 .
13
Example
Grade Point Credit
3.7 3
3.3 4
4.0 3
3.7 × 3 + 3.3 × 4 + 4.0 × 3
𝐺𝐺𝐺𝐺𝐺𝐺 = = 3.63
3+4+3
Mean from frequency table
Example
𝑥𝑥 Frequency
0 40
2 20
3 30
4 10
Total 100
Mean:
1
𝑥𝑥̅ = (0 × 40 + 2 × 20 + 3 × 30 + 4 × 10)
100
= 1.7
That is,
𝑘𝑘
1
𝑥𝑥̅ = � 𝑥𝑥𝑖𝑖 𝑓𝑓𝑖𝑖
𝑛𝑛
𝑖𝑖=1
* From the table above, mode is 0 and median is 2. (Why?)
14
Mean from grouped data
Example
Class Frequency
0−5 40
5 − 10 20
10 − 15 10
15 − 20 30
Total 100
We use mid-values of each class in our calculation.
Mean:
1
𝑥𝑥̅ = (2.5 × 40 + 7.5 × 20 + 12.5 × 10 + 17.5 × 30)
100
=9
That is,
𝑘𝑘
1
𝑥𝑥̅ = � 𝑚𝑚𝑖𝑖 𝑓𝑓𝑖𝑖
𝑛𝑛
𝑖𝑖=1
Here, 𝑚𝑚𝑖𝑖 is the mid-value of the 𝑖𝑖th class.
Relation between mean, median and mode
For symmetric bell-shaped distribution:
mean = median = mode (shown with a bullet point in the plot below).
15
For positively skewed distribution:
mean > median > mode (shown with 3 bullet points in the plot below).
For negatively skewed distribution:
mean < median < mode (shown with 3 bullet points in the plot below).
Exercise
Consider the data: 2, 4, 10, 10, 12, 6, 11, 12, 12, 8. Compute mean, median and
mode. Comment on the shape of the distribution.
Solution
Sorted data: 2, 4, 6, 8, 10, 10, 11, 12, 12, 12.
Mean = 8.7
Median = (10 + 10)/2 = 10
Mode = 12
Since Mean < Median < Mode, the distribution is negatively skewed (or skewed to
the left).
16
Quartiles
There are three quartiles that divide the total area of the histogram in 4 equal parts.
The first quartile Q1 is the 25th percentile. The second quartile Q2 (or median) is
the 50th percentile. The third quartile Q3 is the 75th percentile.
𝑄𝑄1 𝑄𝑄2 𝑄𝑄3
Example
20 customers’ satisfaction ratings:
5, 1, 7, 3, 5, 10, 10, 9, 8, 8, 10, 8, 8, 9, 9, 8, 8, 10, 9, 9.
Sorted data:
1, 3, 5, 5, 7, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10.
Median = (8+8)/2 = 8
Q1 = (7+8)/2 = 7.5
Q3 = (9+9)/2 = 9
Five-number summary
We often describe a set of data by using a five-number summary. The summary
consists of (1) minimum (the smallest value) (2) the first quartile Q1 (3) the median
(4) the third quartile Q3 and (5) maximum (the largest value).
Example
The five-number summary of the previous data: 1, 7.5, 8, 9, 10.
17
Percentiles
When data are arranged in increasing order, the 𝑝𝑝th percentile is a value such that 𝑝𝑝
percent of the values fall at or below the value, and (100 − 𝑝𝑝) percent of the values
fall at or above the value. There are 99 percentiles that divide the total area of the
histogram in 100 equal parts.
Example
Let the 83rd percentile = 39.5. This means 83% values in the data are less than 39.5,
and (100 – 83) % = 17% values are more than 39.5.
18