Chapter 3 - CT & Dispersion
Chapter 3 - CT & Dispersion
Measure of location: A single value that summarizes a set of data. It locates the
center of the values.
The arithmetic mean, or simply mean, is the most widely used measure of location.
2
In terms of symbols, the formula for the arithmetic mean of a population is:
ΣX
Population Mean µ= [3 − 1]
N
Where:
µ is the population mean.
N is the number of items in the population.
X is a particular value.
∑ indicates the operation of adding all the values. It is pronounced “sigma.”
The mean of a sample and the mean of a population are computed in the same way, but the
shorthand notation is different.
ΣX
Sample Mean X= [3 − 2]
n
Chapter 3
Describing Data: Numerical Measures
3
Where:
X is the sample mean; it is read X bar@.
n is the number of values in the sample.
X is a particular value.
∑ indicates the operation of adding all the values.
The mean of a sample, or any other measure based on sample data, is called a statistic.
Note that in both of the above formulas the mean is calculated by summing the observations and
dividing by the total number of observations.
As an example, the Kellogg Company had quarterly earnings per share of $0.89, $0.77, $1.05, $0.79,
and $0.95. The mean is found by:
The mean quarterly earning per share is $0.89. In some situations the mean may not be
representative of the data.
As an example, the annual salaries of five executives are $40,000, $42,000, $44,000, $48,000, and
$300,000. The mean is:
Notice how the one extreme value ($300,000) pulled the mean upward. Four of the five executives
earned less than the mean, raising the question whether the arithmetic mean value of $94,800 is
typical of the salary of the five executives.
Chapter 3
Describing Data: Numerical Measures
4
iii) Based on all observations.
iv) Suitable for further mathematical treatment
v) Least affected by sampling fluctuations
1. Every set of interval level and ratio level data has a mean.
2. All the data values are used in the calculation.
3. A set of data has only one mean, that is, the mean is unique.
4. The mean is a useful measure for comparing two or more populations.
5. The sum of the deviations of each value from the mean will always be zero, that is:
∑( X − X ) =
0
6. Mean of composite series: If X i , (i = 1, 2, , k ) are the means of k-component series
of sizes ni, (i = 1, 2, ….., k) respectively, then the mean X of the composite series
obtained on combining the component series is given by the formula:
k
n X + n 2 X 2 + + n k X k ∑n X i i
X= 1 1 = i =1
n 1 + n 2 + + n k k
∑n i =1
i
Geometric Mean GM = n ( X 1 × X 2 × X 3 × X n )
Chapter 3
Describing Data: Numerical Measures
5
Where:
X 1, X 2, ( X 3 ) etc. are data values.
n is the number of values.
n is the n th root.
The geometric mean can be used for averaging percents. Suppose the return on investment for
McDermoll International for the past 4 years is 0.4%, 2.9%, 2.1%, and 12.3%. The GM increase over
the period is 4.3 percent, found by:
GM = n ( X 1 )( X 2 )( X 3 ) ( X n )
= 4
1.004 × 1.029 × 1.021 × 1.123
= =
4
1.18455 1.043
Chapter 3
Describing Data: Numerical Measures
6
Disadvantages of Harmonic Mean:
i) It is not easy to understand and is difficult to calculate.
ii) If any observation is zero, Harmonic mean becomes imaginary
Uses of HM:
Weighted Mean
The weighted mean is a special case of the arithmetic mean. It is often useful when there are several
observations of the same value.
Weighted mean: The value of each observation is multiplied by the number of times it
occurs. The sum of these products is divided by the total number of observations to give
the weighted mean.
In general, the weighted mean of a set of numbers, designated X1, X2, X3, … Xn, with the
corresponding weights w1, w2, w3, …, wn is computed by:
w1 X 1 + w2 X 2 + w3 X 3 + + wn X n
=
Weighted Mean Xw [3 − 3]
w1 + w2 + w3 + + wn
The weighted mean is particularly useful when various classes or groups contribute differently to
the total. For example, the coronary care unit of a hospital consists of nurses= aides who are paid
$12 per hour, nurses = assistants who earn $15 per hour, and registered nurses who earn $21 per
hour.
It would not be accurate to say the average hourly wage for the coronary unit is $16 per hour ($12 +
$15 + $21) / 3 unless there was the same number of people in each group.
Suppose the coronary care unit has ten employees: two aides who earn $12 per hour, 3 nurses=
assistants who earn $15 per hour, and five registered nurses who earn $21 per hour. The weighted
mean is:
w1 X 1 + w2 X 2 + w3 X 3 + + wn X n
Xw =
w1 + w2 + w3 + + wn
(2 × $12) + (3 × $15) + (5 × $21) $24 + $45 + $105 $174
= = = = $17.40
2+3+5 10 10
Chapter 3
Describing Data: Numerical Measures
7
Median: The midpoint of the values after all observations has been ordered from the
smallest to the largest or from largest to smallest.
Fifty percent of the observations are above the median and 50 percent are below the median. To
determine the median, the values are ordered from low to high, or high to low, and the middle
value selected. Hence, half the observations are above the median and half are below it. For the
executive incomes, the middle value is $44,000, the median.
$40,000$42,000$44,000$48,000$300,000
D
median
Obviously, it is a more representative value in this problem than the mean of $94,800.
Note that there were an odd number of executive incomes (5). For an odd number of ungrouped
values we just order them and select the middle value. To determine the median of an even number
of ungrouped values, the first step is to arrange them from low to high as usual, and then
determine the value half way between the two middle values.
As an example, the final grades of the six students in Mathematics 126 were 87, 62, 91, 58, 99, and
85. Ordering these from low to high:
58 62 85 87 91 99
D D
The median grade is halfway between the two middle values of 85 and 87. The median grade is 86.
Thus we note that the median (86) may not be one of the values in a set of data.
Advantages of Median:
i) Well defined
ii) Readily comprehensible and easy to calculate.
Chapter 3
Describing Data: Numerical Measures
8
iii) Not affected by extreme values
iv) Can be calculated for a distribution when extreme class is open
Disadvantages of Median:
i) Not based on all observations.
ii) Not suitable for further mathematical treatment
iii) As compared to AM it is affected much by sampling fluctuations
Uses of Median:
i) It is the only average to be used while dealing with qualitative data, which cannot be
measured quantitatively but still can be arranged in ascending or descending order of
magnitude. e.g., to find the average intelligence, or average honesty among a group of
people.
ii) It is to be used to determining the typical values in the problems concerning wages,
distribution of wealth, etc.
Properties of the Median
The major properties of the median are:
1. The median is a unique value, that is, like the mean, there is only one median for a set of
data.
2. It is not influenced by extremely large or small values.
3. It can be computed for ratio level, interval level, and ordinal-level data.
4. Fifty percent of the observations are greater than the median and fifty percent of the
observations are less than the median.
The mode is the value that occurs most often in a set of raw data. The dividends per share declared
on five stocks were: $3, $2, $4, $5, and $4. Since $4 occurred twice, which was the most frequent,
the mode is $4.
Below is the formula for calculating the mode from grouped data
∆1
Mode = L1 + × i
∆1 + ∆ 2
Where
L1 is the lower limit of modal class
∆1 The difference between the frequency of the modal class and the frequency of the class just
preceding the modal class
∆2 The difference between the frequency of the modal class and the frequency of the class just
succeeding the modal class
i Width of the modal class
Chapter 3
Describing Data: Numerical Measures
9
Advantages of Mode:
i) Rigidly defined
ii) Readily comprehensible and easy to calculate.
iii) Not affected by extreme values
Disadvantages of Mode:
i) Not based on all observations.
ii) Ill-defined
iii) Not suitable for further mathematical treatment
iv) As compared to AM it is affected much by sampling fluctuations
Chapter 3
Describing Data: Numerical Measures
10
For example, the mean salary paid to baseball players for the New York Yankees is $4,342,365.
However, the range is $14,390,000, with a low of $210,000 and a high of $14,600,000. The Tampa
Devil Rays have a mean salary of $1,227,857. The range is $8,550,000, with a low of $200,000 and a
high of $8,750,000.
As another example, suppose a statistics instructor had two classes, one in the morning and one in
the evening; each with six students. In the morning class (AM) the students’ ages are 18, 20, 21, 21,
23, and 23 years. In the evening class (PM) the ages are 17, 17, 18, 20, 25, and 29 years. Note that for
both classes the mean age is 21 years but there is more variation or dispersion in the ages of the
evening students.
A small value for a measure of dispersion indicates that the data are clustered closely, say, around
the arithmetic mean. Thus the mean is considered representative of the data, that is, it is reliable.
Conversely, a large measure of dispersion indicates that the mean is not reliable and is not
representative of the data.
There are several measures of dispersion. We will consider six: the range, the mean deviation, the
variance, the standard deviation, the interquartile range, and quartile deviation.
3.2.1 RANGE
Perhaps the simplest measure of dispersion is the range.
Range: The difference between the highest and lowest value in a set of data.
Chapter 3
Describing Data: Numerical Measures
11
iv) It can be distorted by an extreme value.
The range has two disadvantages. It can be distorted by a single extreme value. Suppose the same
statistics instructor has a third class of five students. The ages of these students are given below.
Ages of Students
20, 20, 21, 22, 60
The range of ages is 40 years, yet four of the five students’ ages are within two years of each other.
The 60-year-old student has distorted the spread. Another disadvantage is that only two values, the
largest and the smallest, are used in its calculation.
Mean Deviation: The arithmetic mean of the absolute values of the deviations from the arithmetic
mean.
In terms of symbols, the formula for the mean deviation is:
Mean Deviation MD =
∑X−X [3-5]
n
Where:
X is the value of each observation.
X is the arithmetic mean.
n is the number of observations in the sample.
The parallel lines indicate absolute value. To interpret, 4.0 years is the mean amount by which
the ages differ from the arithmetic mean age of 21.0 years for the PM students.
Chapter 3
Describing Data: Numerical Measures
12
3.2.3 VARIANCE AND STANDARD DEVIATION
The disadvantage of the mean deviation is that the absolute values are difficult to manipulate
mathematically. Squaring the differences from each value and the mean eliminates the problem of
absolute values. These squared differences are used both in the computation of the variance and the
standard deviation.
Variance: The arithmetic mean of the squared deviations from the mean.
Note that the variance is non-negative and is zero only if all observations are the same.
Squaring units of measurement, such as dollars or years, makes the variance cumbersome to use
since it yields units like “dollars squared” or “years squared.” However, by calculating the
standard deviation, which is the positive square root of the variance, we can return to the original
units, such as years or dollars. Because the standard deviation is easier to interpret, it is more
widely used than the mean deviation or the variance.
Population Variance
The formula for the population variance and the sample variance are slightly different. The formula
for the population variance is:
Population Variance σ 2 =
∑ (X − µ) 2
[3 – 6]
N
Where:
σ2 is the symbol for the population variance.
X is a value of an observation in the population.
µ is the arithmetic mean of the population.
N is the total number of observations in the population.
[3-7]
N
Chapter 3
Describing Data: Numerical Measures
13
Sample Variance
The conversion of the population variance formula to the sample variance formula is not as direct
as the change made when we went from the population mean formula to the sample mean formula.
Recall that we replaced µ with X and N with n.
The conversion from population variance to sample variance requires a change in the denominator.
Instead of substituting n, the number in the sample, for N, the number in the population, we
replace N with (n – 1). Thus the formula for the sample variance is:
s 2
=
∑(X − X ) 2
Sample Variance n −1 [3 – 8]
Where:
s2 is the symbol for the sample variance. It is pronounced as “s squared.”
X is the value of each observation in the sample.
X is the mean of the sample.
n is the total number of observations in the sample.
n1 + n2 + + nk
Relative Dispersion
Suppose we want to compare the variability of two sets of data that are measured in different units
such as one in dollars and the other in years. How can this be done? Relative dispersion is the
answer. Below are the four relative measures of dispersion:
Coefficient of Range = L − S
L+S
Coefficient of Quartile deviation = Q3 − Q1
Q3 + Q1
Coefficient of Mean deviation = MD , and Coefficient of Standard deviation = σ
X X
Chapter 3
Describing Data: Numerical Measures
14
The coefficient of variation is another relative measure of dispersion.
Coefficient of variation: The ratio of the standard deviation to the arithmetic mean, expressed as a percent.
If, for example, in a study of executives the coefficient of variation for incomes is 29 percent and for their
ages it is 12 percent, we would conclude that there is more relative dispersion in the incomes of the
executives than in their ages.
Characteristics of the coefficient of variation are:
° It reports the variation relative to the mean.
° It is useful for comparing distributions with different units.
CHAPTER PROBLEMS
Problem 1
A comparison shopper employed by a large grocery chain recorded Supermarket Price X
these prices for a 340-gram jar of Kraft blackberry preserves at a sample 1 $1.31
of six supermarkets selected at random. 2 1.35
3 1.26
a. Compute the arithmetic mean.
4 1.42
5 1.31
b. Compute the median.
6 1.33
c. Compute the mode. Total $7.98
Solution:
a. Determine the mean price of this raw data by summing the prices for the six jars and dividing
the total by six. Using the formula for the mean of a sample we get.
ΣX $7.98
X= = = $1.33
n 6
b. As noted above the median is defined as the middle value of a set of data, after the data is
arranged from smallest to largest. The prices for the six jars of blackberry preserves have
been ordered from a low of $1.26 up to $1.42. Because this is an even number of prices the
median price is halfway between the third and the fourth price. The median is $1.32.
Chapter 3
Describing Data: Numerical Measures
15
Suppose there are an odd number of blackberry preserve prices, such as shown in the table.
c. The mode is the price that occurs most often. The price of $1.31 occurs twice in the original data
and is the mode.
Problem 2
A sample of the amounts spent in November for propane gas to heat homes of similar sizes in Duluth
revealed these amounts (to the nearest dollar):
191 212 176 129 106 92 108 109 103 121 175 194
What is the range? Interpret your results.
Solution:
Recall that the range is the difference between the largest value and the smallest value.
This indicates that there is a difference of $120 between the largest and the smallest heating cost.
Problem 3
Using the heating cost data in Problem 2, compute the mean deviation.
Solution:
The mean deviation is the mean of the absolute deviations from the arithmetic mean. For raw, or
ungrouped data, it is computed by first determining the mean. Next, the difference between each
value and the arithmetic mean is determined. Finally, these differences are totaled and the total
divided by the number of observations.
The table below shows the data values, each data value minus the mean, and the absolute value
of the deviations from the mean. In other words, the signs of the deviations from the mean are
disregarded.
Chapter 3
Describing Data: Numerical Measures
16
Payment | X − X| Absolute
X Deviations
$191 |$+48 | = $48 ΣX $1,716
X= = = $143.00
212 | +69 | = 69 n 12
176 | +33 | = 33
129 | −14 | = 14
106 | –37 | = 37
92 | –51 | = 51 ΣX−X $466
108 | –35 | = 35 MD = = = $38.83
109 | –34 | = 34 n 12
103 | –40 | = 40
121 | –22 | = 22
175 | +32 | = 32
194 | +51 | = 51
$1,716 $466
The mean deviation of $38.83 indicates that the typical electric bill deviates $38.83 from the mean of
$143.00.
Problem 4
The hourly wages for a sample of plumbers were grouped into the Hourly Number
following frequency distribution. Since the wages have been Wages f
grouped into classes, we refer to the following distribution as being $8 up to $10 3
grouped data. $10 up to $12 6
$12 up to $14 12
a. Compute the arithmetic mean.
$14 up to $16 10
b. Compute the mode. $16 up to $18 7
$18 up to $20 2
40
Solution:
a. The arithmetic mean of this sample data, grouped into a frequency distribution, is computed by
formula.
X =
∑ fX
n
Chapter 3
Describing Data: Numerical Measures
17
Where:
X is the designation for the arithmetic mean.
M is the mid-value, or midpoint, of each class.
f is the frequency in each class.
fX is the frequency in each class times the midpoint of the class.
∑fM is the sum of these products.
n is the total number of frequencies.
It is assumed that the observations in each class are represented by the midpoint of the class. The
midpoint of the first class is $9.00, found by ($8.00 + $10.00)/2. For the next higher class, the
midpoint is $11.00.
Using formula for the arithmetic mean hourly wage is $13.90, found by
Wage Frequency Class
fX
Rate f Midpoint X
b. The mode is the value that occurs most often. So, we can say that mode of this distribution
lies in the class $12 up to $14. For data grouped into a frequency distribution mode is
∆1 6
Mode = L1 + × i = 12 + × 2 = 13.5
∆1 + ∆ 2 6+2
Problem 5
Determine the mean and SD of sales of 100 First Food Restaurants in the Eastern Districts (in ’
000$)
Sales Number of Restaurants
700 - 800 4 Solution:
800 - 900
900 - 1000
7
8
∑
fX = 125000, X = 1250
∑
f(X − X) = 6680000,
2
1000 - 1100 10
1100 - 1200 12 2
σ = 66800, σ = 258.5
1200 - 1300 17
1300 - 1400 13
1400 - 1500 10
1500 - 1600 9
1600 - 1700 7
1700 - 1800 2
1800 – 1900 1
Chapter 3
Describing Data: Numerical Measures
18
Exercises: The Measures of Central Tendency
1. The annual exports of 50 medium-sized manufacturers were organized into a frequency
distribution. (Exports are in $ millions).
Exports Frequency
$6 up to $9 2
9 up to 12 8
12 up to 15 20
15 up to 18 14
18 up to 21 6
5. The arithmetic mean of the following series is 30.5. Find the missing figure.
Values : 10 20 ? 40 50
Frequency : 8 10 20 15 7
Correcting incorrect values:
6. The mean and median of 100 items are 50 and 52 respectively. The value of the largest item is
100. It was later found that it is actually 110. Find the correct mean and median.
7. The mean of 20 observations is 50.1. By mistake one observation is taken 70 instead of -70. Find
the correct mean.
Combined Mean:
8. The mean marks obtained in an examination by a group of 100 students were found to be 49.96.
The mean marks obtained in the same examination by another group of 200 students were
52.32. Find the mean of marks obtained by both groups of students taken together. [Ans. 51.53]
Chapter 3
Describing Data: Numerical Measures
19
9. The mean marks got by 300 students in the subject statistics is 45. The mean of the top 100 of
them was found to be70 and the mean of last 100 was known to be 20. What is the mean of the
remaining 100 students?
10. The mean weekly salary paid to all employees in a company is Tk. 500. The mean weekly salary
paid to male and female employee is Tk. 520 and 420 respectively. Determine the percentage of
males and females employed by the company.
2. 10 observations have mean 20 and SD 4 respectively. If each of these observations doubled then
what will be the mean and SD of new observations. [Ans. 40, 8]
6. For a group of 200 candidates, the mean and SD were found to be 40 and 15 respectively. Later
on it was discovered that the scores 43 and 35 were misread as 34 and 53 respectively. Find the
corrected mean and SD corresponding to the figures.
Combined Variance:
7. For a group containing 100 observations, the AM and SD are 8 and 10.5 respectively. For 50
observations, selected from those 100 observations the mean and SD are 10 and 2 respectively.
Find the AM and SD of the other 50 observations.
Chapter 3
Describing Data: Numerical Measures
20
8. In two factories A and B engaged in the same industry, the average weekly wages and SD’s are
as follows:
Factory Ave. weekly SD of wage No. of wage
wage earners
A 460 50 100
B 490 40 80
a) Which factory A and B pays large amount as weekly wages?
b) Which factory shows greater variability in the distribution of wages?
c) What is the mean and SD of all workers in two factories taken together?
9. The number of employees, average wage per employee and the variance of the distribution of
wages per employee for two factories are given below:
Factory - A Factory – B
No. of employees 50 100
Average wage 120 85
Variance of wage 9 16
a) In which factory is there greater variability in the distribution of wages per employee?
b) Suppose in factory B, the wages of an employee were wrongly noted as 120 instead of
100. What would be the correct mean and variance for factory B?
10. FundInfo provides information to its subscribers to enable them to evaluate the performance of
mutual funds they are considering as potential investment vehicles. A recent survey of Funds
whose started investment goals was growth and income produced the following data on total
annual rate of return over the five years:
Annual rate 11 - 12 12 - 13 13 - 14 14 - 15 15 - 16 16 - 17 17 - 18 18 - 19
of return
Frequency 2 2 8 10 11 8 3 1
Calculate the mean, Variance and SD of the annual rate of return for this sample of 45 funds.
Chapter 3
Describing Data: Numerical Measures
21
1 1
r-th raw moment = µ′ = ∑ (X − A) , r-th central moment = µ r = ∑ (X − X ) r ,
r
r N N
Sheppard’s correction for moments:
h2
µ 2 (corrected) = µ 2 −
12
µ3 (corrected) = µ3
h 2 µ2 7h 2
µ 4 (corrected) = µ 4 − +
2 240
Relationship between raw and central moments:
Measures of Skewness
Skewness: Another characteristic of a set of data is the shape of the distribution. Skewness means
“Lack of Symmetry”. There are four shapes commonly observed: symmetric, positively skewed,
negatively skewed, and bimodal. The measures of location and the measures of dispersion are
both descriptive characteristics of a set of data. A third characteristic of a distribution is its
skewness. As noted before, a symmetric distribution has the same shape on either side of the
median and it has no skewness. For a positively skewed distribution the long tail is to the right, the
mean is larger than the median or the mode, and the mode appears at the highest point on the
curve. For a negatively skewed distribution the mode is the largest value and is at the highest point
of the curve, while the mean is the smallest. A bimodal distribution will have two or more peaks.
The coefficient of skewness is used to describe how a distribution is skewed. Different kinds of
skewness are shown in below: Positive Skewness
20
16
Number
12
8
4
0
1 6 11 16 21 26 31 36 41 46 51
Ye ars of Se rvice
Chapter 3
Describing Data: Numerical Measures
22
20
15
Number
Number
10
0
1 6 11 16 21 26 31 36 41 46 51 1 5 9 13 17 21 25 29 33 37 41 45 49 53
Hours of Use ful Life Years of Service
Measure of Kurtosis
Chapter 3
Describing Data: Numerical Measures
23
Exercise:
1. The first four raw moments of a distribution about the origin of the variable are 2.5, 21, 166
and 1132. Calculate all central moments, SD, β1 and β2. (Ans: 0, 14.75, 39.75, 142.3125, 3.84,
0.4926, .6543)
2. The first four moments of a distribution about the value 4 of the variable are -1.5, 17, -30 and
108. Find the moments about mean, β1 and β2. Also find the moments about the origin. (Ans:
0, 14.75, 39.75, 142.3125, 0.4926, 0.6543,2.5, 21, 166, 1132)
3. The SD of a distribution is 3. What must be the value of fourth moment about the mean in
order that the distribution be mesokurtic? What must be the value of third moment about
the mean When β1 of the distribution is 1.5?
4. For a distribution, the mean is 10, variance is 16, γ1 is +1, and β2 is 4. Obtain first four
moments about the origin. Comment upon the nature of the distribution.
[Ans. 10, 116, 1544, 23184]
5. Calculate the first four moments of the following distribution about the mean and hence
find β1 and β2.
X: 52 - 54 54 - 56 56 - 58 58 - 60 60 - 62 62 - 64 64 - 66
f: 2 5 12 18 39 15 9
Chapter 3
Describing Data: Numerical Measures