Handout 04 Data Description
Handout 04 Data Description
Handout 04
Contents
Organization and Interpretation of data: Frequency distribution, graphical representation, Histogram, frequency curve
and Ogive. Central Measures: Arithmetic Mean Geometric Mean, Harmonic Mean, Median, Mode, Quartiles, Deciles and
Percentiles for grouped and ungrouped data. Dispersion measures: variance, standard deviation, mean deviation, coefficient of
variation, Skewness.
Objectives
After careful study of this chapter stu4 dents should be able to Compute and interpret the central measures and the
measures of dispersion.
References
1. Introduction to Statistical Theory, Shehzad Ahmad and Sher Muhammad Ch.
2. Elementary Statistics, 7 th Edition, Allan G. Bluman
3. Statistics for Management, 7 th Edition, Richard Levin and David Rubin
4. Statistics for Business and Economics, 10Edition, David R. Anderson, Dennis J. Sweeny and Thomas A. Willium
Data Description
There are three main tasks in descriptive statistics: (i) collection and organization, (ii) analysis,
and (iii) interpretation of data.
(i) Collection and Organization of Data:
Graphically: through the use of charts and graphs
Numerically: through the use of tables of data
(ii) Analysis of Data:
Once the data is organized, we can go ahead and compute various quantities (called statistics or
parameters) associated with the data.
(iii) Interpretation of Data:
Once we have performed the analysis, we can use the information to make assertions about the real world
Samples versus Population
The term "population" is used in statistics to represent all possible measurements or outcomes
that are of interest to us in a particular study. The term "sample" refers to a portion of the population that
is representative of the population from which it was selected.
In order to use statistics to learn things about the population, the sample must be random. A
random sample is one in which every member of a population has an equal chance of being selected. The
most commonly used sample is a simple random sample. It requires that every possible sample of the
selected size has an equal chance of being used.
A parameter is a characteristic of a population. A statistic is a characteristic of a sample.
Inferential statistics enables you to make an educated guess about a population parameter based on a
statistic computed from a sample randomly drawn from that population.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (2)
Statistical procedures
Statistical procedures can be divided into two major categories: descriptive statistics and
inferential statistics.
(i) Descriptive Statistics
Descriptive statistics includes statistical procedures that we use to describe the population. The
data could be collected from either a sample or a population, but the results help us organize and describe
data. Descriptive statistics can only be used to describe the group that is being studying. Frequency
distributions, measures of central tendency (mean, median, and mode), and graphs like pie charts and bar
charts that describe the data are all examples of descriptive statistics.
(ii) Inferential Statistics
Inferential statistics is concerned with making predictions or inferences about a population from
observations and analysis of a sample. Regression analysis, test of hypothesis, significance, analysis of
variance are the examples of inferential statistics.
(A) Frequency Distribution
The main object of descriptive statistics is to put the information contained in a set of data into a
more useable form.
By condensing the raw data into the tabular form we distribute the data into classes or category
and determine the number of individuals belonging to each class, called the class frequency. A tabular
arrangement of data by classes together with the corresponding class frequencies is called a frequency
distribution or frequency table or categorical data. We can also use relative frequency and percentage
frequency in a frequency distribution.
frequency
where relative frequency =
n
and percent frequency = 100 relative frequency
Examples (1)
Thirty batteries were tested to determine how long they would last. The results, to the nearest
minute, were recorded as:
423, 369, 387, 411, 393, 394, 371, 377, 389, 409, 392, 408, 431, 401, 363, 391, 405, 382, 400,
381, 399, 415, 428, 422, 396, 372, 410, 419, 386, 390
Construct a frequency distribution table.
Solution
The lowest value is 363 and the highest is 431. Using the given data and a class interval of 10, the
interval for the first class is 360 to 369 and includes 363 (the lowest value). Remember, there should
always be enough class intervals so that the highest value is
included. The completed frequency distribution table should
look like this:
Life of batteries in minutes:
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (3)
Examples (2)
These data represent the record high temperatures in degrees Fahrenheit (oF) for each of the 50
states. Construct a grouped frequency distribution for the data using 7 classes.
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
Source: The World Almanac and Book of Facts
Example 2-2 page 41 Elementary Statistics by Bluman
Solution
Examples (3)
These data represent the record high temperatures in degrees Fahrenheit (oF) for each of the 50
states. Construct a grouped frequency distribution for the data using 7 classes.
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
Source: The World Almanac and Book of Facts
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (4)
Examples (4)
The data shown here represent the number of miles per gallon (mpg) that 30 selected four-wheel-
drive sports utility vehicles obtained in city driving. Construct a frequency distribution, and analyze the
distribution.
12 17 12 14 16 18 16 18 12 16 17 15 15 16 12
15 16 16 12 14 15 12 15 15 19 13 16 18 16 14
Source: Model Year Fuel Economy Guide. United States
Environmental Protection Agency.
The complete ungrouped frequency distribution is
In this case, almost one-half (14) of the vehicles get 15 or 16 miles per gallon.
The cumulative frequencies are:
Cumulative frequency
Less than 11.5 0
Less than 12.5 6
Less than 13.5 7
Less than 14.5 10
Less than 15.5 16
Less than 16.5 24
Less than 17.5 26
Less than 18.5 29
Less than 19.5 30
Exercise (1)
The number of passengers (in thousands) for the leading U.S. passenger airlines in 2004 is
indicated below. Use the data to construct a grouped frequency distribution and a cumulative frequency
distribution with a reasonable number of classes and comment on the shape of the distribution.
91,570 86,755 81,066 70,786 55,373
42,400 40,551 21,119 16,280 4,869
13,659 13,417 13,170 12,632 11,731
10,420 10,024 9,122 7,041 6,954
6,406 6,362 5,930 5,585 5,427
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (5)
The Pareto Chart is a simple to use and powerful graphic to identify where the majority of
problems in a process are originating.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (6)
(3) Histogram
A histogram is a bar graph of raw data that creates a picture of the data distribution. The bars
represent the frequency of occurrence by classes of data. A histogram shows basic information about the
data set, such as central location, width of spread etc.
Histograms show how data can pile up; in any distribution of values, some values will occur
more frequently than others. The peaks on the histogram show where there is similarity among the data.
This is the central location, which is measured by mean, median, and mode. While these statistics provide
valuable information about the process, central location alone does not provide a complete picture of the
process. When you consider the spread of the data, you will see its extremes. The shape of the histogram
can show if the system leans toward one extreme or the other, or if there are multiple peaks.
When you use a histogram for prediction, the system must be stable. If not, the central location,
spread, and shape may vary dramatically in histograms created from data taken at different times and will
not be an accurate reflection of the process. If you are not using histograms to make predictions, stability
is not required.
We can construct histogram by taking class boundaries along x-axis and frequency along y-axis,
then constructing rectangular bars against each class boundary with a height according to the
corresponding frequency.
Examples (5)
Using data given in example (1), we can construct histogram by taking class boundaries along x-
axis and frequency along y-axis. Then constructing rectangular bars against each class boundary with a
height according to the corresponding frequency.
Further joining the mid points of the top heads of all rectangular bars with a smooth curve, we
can have a frequency curve as shown in figure. It is not necessary for a smooth curve to pass through all
the points.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (7)
If we find out mid points of each class limit / class boundary, then draw the smooth curve for the
cumulative frequency against the midpoints then the diagram would be as follows:
Cumulative Frequency Curve or Ogive
35
30
25
20
C.F
Series1
15
10
0
0 2 4 6 8 10
Mid Points
2.0-2.5 1 0.05
2.6-3.1 0 0.00
3.2-3.7 2 0.10
3.8-4.3 8 0.40
4.4-4.9 5 0.25
5.0-5.5 4 0.20
20 1.00
Some conclusions:
The frequency of an average inventory of 4.4 to 4.9 days is 5.
The relative frequency of an average inventory of 4.4 to 4.9 days is 0.25.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (8)
Examples (6)
Construct a histogram and Ogive to represent the data shown for the record high temperatures for
each of the 50 states.
Classes 100 104 105 109 110 114 115 119 120 124 125 129 130 - 134
Frequency 2 8 18 13 7 1 1
Examples (7)
Here is a frequency distribution of the weight of 150 people who used a ski lift a certain day.
Construct a histogram for these data
Class Frequency Class Frequency
75-89 10 150-164 23
90-104 11 165-179 9
105-119 23 180-194 9
120-134 26 195-209 6
135-149 31 210-224 2
(a) What can you see from the histogram about the data that was not immediately apparent
from the frequency distribution.
(b) If each ski lift chair holds two people but is limited in total safe weight capacity to 400
pounds, what can the operator do to maximize the people capacity of the ski lift without
exceeding the safe weight capacity of a chair? Do the data support your proposal?
Solution
(a) The lower tail of the distribution is fatter (has more observations in it) than the upper tail.
(b) Because there are so few people who weigh 180 pounds or more, the operator can afford to
apir each person who appear to be heavy with a lighter person. This can be done without
greatly delaying any individuals turn at the lift.
Exercise (2)
The number of passengers (in thousands) for the leading U.S. passenger airlines in 2004 is
indicated below. Use the data to construct a grouped frequency distribution and a cumulative frequency
distribution with a reasonable number of classes and comment on the shape of the distribution.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (9)
In such cases we find proportional height for rectangular bars. So we construct table as follows:
Class Frequency Width of Classes Proportional
Boundaries (in units) Height
10-20 6 1 6
20-30 7 1 7
30-40 8 1 8
40-50 10 1 10
50-70 10 2 5
70-100 9 3 3
100-140 8 4 2
Now we construct histogram by taking class boundaries along x-axis and proportional height
along y-axis.
Exercises
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (10)
(2) The following is a frequency distribution of students of different ages, construct a histogram
Ages 18-19 20-24 25-29 30-34 35-44 45-59
No. 9 188 160 123 84 15
(3) Here are the ages of 30 people who bought video recorders at Liberty Music Shop last week:
26 37 40 18 14 45 32 68 31 37
20 32 15 27 46 44 62 58 30 42
22 26 44 41 34 55 50 63 29 22
(a) From looking at the data Justas they are, what conclusions can you come to quickly about
Libertys market?
(b) Construct a 6-category closed classification. Does having this enable you to conclude
anything more about Libertys market?
(4) At a newspaper office, the time required to set the entire front page in type was recorded for 50
days. The data, to the nearest tenth of a minute, are given below.
20.8 22.8 21.9 22.0 20.7 20.9 25.0 22.2 22.8 20.1
25.3 20.7 22.5 21.2 23.8 23.3 20.9 22.9 23.5 19.5
23.7 20.3 23.6 19.0 25.1 25.0 19.5 24.1 24.2 21.8
21.3 21.5 23.1 19.9 24.2 24.1 19.8 23.9 22.8 23.9
19.7 24.2 23.8 20.7 23.8 24.3 21.1 20.9 21.6 22.7
0.4 1.9 1.5 0.9 0.3 1.6 0.4 1.5 1.2 0.8
0.9 0.7 0.9 0.7 0.9 1.5 0.5 1.5 1.7 1.8
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (11)
(6) Administer of a hospital has ordered a study of the amount of time a patient must wait before
being treated by emergency room personnel. The following data were collected during a typical
day.
12 16 21 20 24 3 11 17 29 18
26 4 7 14 25 1 27 15 16 5
(a) Arrange the data in an array from lowest to highest. What comment can you make about
patient waiting time from your data array?
(b) Now construct a frequency distribution using 6 classes. What additional interpretation can
you give to the data from the frequency distribution?
(c) From an ogive, state how long 75 percent of the patients should expect to wait based on data?
(4) The bureau of labor statistics has sampled 30 communities nationwide and compiled prices in each
community at the beginning and end of August in order to find out approximately how the
Consumer Price Index has changed during August. The percentage changes in prices for the 30
communities are as follows: Ref. Ex. 2.19 Statistics for Management 7 th by Levin Rubin
0.7 0.4 0.3 0.2 0.1 0.1 0.3 0.7 0.0 0.4
0.1 0.5 0.2 0.3 1.0 0.3 0.0 0.2 0.5 0.1
0.5 0.3 0.1 0.5 0.4 0.0 0.2 0.3 0.5 0.4
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (12)
(C) Averages
The following average measures are also called the central measures
(i) Arithmetic Mean
(ii) Geometric Mean
(iii) Harmonic Mean
(1) Arithmetic Mean
The Arithmetic mean or simply the mean is the most familiar average. It is defined as
Sum of all the observations
Mean =
Number of the observations
x1+x2+ +xn xi
For ungrouped data, x = = , (i = 1, 2, , n)
n n
f1 x 1+f2 x2+ +fn x n fi x i
For grouped data, x = = , (n= fi)
f1+f2+ +fn fi
Advantages of Arithmetic Mean
its concept is familiar to most people and intuitively clear.
It is a measure that can be calculated, and it is unique because every data set has one and only one mean
The mean is useful for performing statistical procedure such as comparing the means from several
data sets.
Disadvantages of Arithmetic Mean
It may be affected by the extreme values that are not representative of the rest of the data. e.g. the
mean of the values 4.2, 4.3, 4.7, 4.8, 5.0, 5.1, 9.0 is 5.3. But if we exclude the value 9.0, the
answer is about 4.7. The one extreme value 9.0 distorts (de-shapes) the value we get for the mean.
It may be time consuming sometime.
We are unable to compute mean for the data with open ended classes.
Properties
Mean (a) = a
Mean (X a) = Mean (X) a
Mean (bX) = b Mean (X)
Sum of the deviations from mean value is equal to zero.
For the two sets of data with n1, n2 number of values and X1 , X2 mean values respectively,
n1 X1 + n2 X2
the joint mean X is
n1 + n2
Exercise
(1) Find the arithmetic mean, geometric mean and harmonic mean of the series
(i) 1,2,4,8,16,, 2n
(ii) 1,3,9,27,81,, 3 n. (Sher)
(2) Find the average rate of
a. motion in case of a person who rides the first mile at the rate of 10 miles an hour, the next
mile at the rate of 8 miles per hour and the third mile at the rate of 6 miles per hour.
b. Increase in the population, which in the first decade has increased 20%, in the next 25%
and in the third 44%.
Problem 4-108 Elementary Statistics by Bluman, chapter 3, page 122
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (13)
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (14)
Exercise
(1) A salesperson drives 300 miles round trip at 30 miles per hour going to Chicago and 45 miles per
hour returning home. Find the average miles per hour.
(2) A bus driver drives 50 miles to West Chester at 40 miles per hour and returned driving 25 miles per
hour. Find the average miles per hour.
(3) A carpenter buys $500 worth of nails at $50 per pound and $500 worth of nails at $10 per pound.
Find the average cost of 1 pound of nails.
(4) The following are the monthly salaries in rupees of 30 employees of a firm:
The firm gave bonuses of Rs. 10, 15, 20, 25, 30 and 35 for individuals in the respective salary
groups; exceeding 60 but not exceeding 75, exceeding 75 but not exceeding 90 and so on up to
exceeding 135 but not exceeding 150. Find the average bonus paid per employee.
Examples (10)
Daves Giveaway Store advertises, If our average prices are not equal or lower than everyone
elses, you get it free. One of Daves customers came into the store one day and threw on the counter
bills of sale for six items she bought from a competitor for an average price less than Daves. (Statistics
for Management, 7th Ed, by Richard Levin and David Rubin Chap 3 )
The items cost:
$1.29, $2.97, $3.49, $5.00, $7.50, $10.95
Daves price for the same six items are:
$1.35, $2.89, $3.19, $4.98, $7.59, $11.50
Dave told the customer, My ad refers to a weighted average price of these items. Our average is lower
because our sales of these items have been
7, 9, 12, 8, 6, 3
Is Dave getting himself into or out of trouble by talking about weighted averages.
Solution
With unweighted average, we get
xi 1.29 + 2.97 + 3.49 + 5.00 + 7.50 + 10.95 31.20
xC = = = = $5.20 at the competition
n 6 6
xi 1.35 + 2.89 + 3.19 + 4.98 + 7.59 + 11.50 31.50
xD = = = = $5.20 at Daves
6 6 6
with weighted average
(wxi) 7(1.29) + 9(2.97) + 12(3.49) + 8(5.00) + 6(7.50) + 3(10.95) 195.49
xC = w = = = $4.344
7 + 9 + 12 + 8 + 6 + 3 45
at the competition
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (15)
Examples (11)
Bennett Distribution Company, a subsidiary of major appliance manufacturer, is forecasting
regional sales for the next year. The Atlantic branch, with current yearly sales of $193.8 million, is
expected to achieve a sales growth of 7.25 percent; the Midwest branch, with current sales of $79.3
million, is expected to grow by 8.20 percent; and the Pacific branch, with sales of $57.5 million, is
expected to increase sales by 7.15 percent. What is the average rate of sales growth forecasted for next
year? (Statistics for Management, 7th Ed, by Richard Levin and David Rubin Chap 3)
Solution
(wxi) 193.8(7.25) + 79.3(8.20) + 57.5(7.15) 2466.435
xw = w = = = 7.46%
193.8 + 79.3+ 57.5 330.6
Exercise ( Bluman )
1. Find the weighted mean price of three models of automobiles sold. The number and price of each
of each model sold are shown in this list.
Model Number Price
A 8 $10,000
B 10 $12,000
C 12 $8,000
2. Using the weighted mean, find the average number of games of fat per ounce of meat or fish that
a person would consume over a 5 day period if he ate these:
Meat or Fish Fat (g/oz)
3 oz fried shrimp 3.33
3 oz veal cutlet (broiled) 3.00
2 oz roast beef (lean) 2.50
2.5 oz fried chicken drumstick 4.40
4 oz tuna (canned in oil) 1.75
Source:- The World Almanac and Book of Facts
3. A recent survey of a new diet cola reported the following percentages of people who liked the
taste. Find the weighted mean of the percentages.
i.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (16)
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (17)
the growth factor is the amount by which we multiply the savings at the beginning of the year to
get the saving at the end of the year.
The simple arithmetic mean of the growth rate would be (1.07+1.08+1.10+1.12+1.18) 5 = 1.11,
which corresponds to an average interest rate of 11 percent per year. If the bank gives interest at
a constant rate of 11 percent per year, however, a $100 deposit would grow in five years to
$1001.111.111.111.111.11 = $168.51
The table shows that the actual figure is only $168.00. Thus the correct average growth factor
must be slightly less than 1.11.
To find the correct average growth factor, we can multiply together the 5 year growth factors and
then take the 5th root of the product. The result is the geometric mean growth rate, which is the
appropriate average to use here.
G.M = 5 1.071.081.101.121.18 = 5 1.679965 = 1.1093
Notice that the correct average interest rate of 10.93 percent per year obtained with the geometric
mean is very close to the incorrect average rate of 11 percent obtained with arithmetic mean.
This happens because the interest rates are relatively small.
In highly inflationary economics, banks pay high interest rate to attract savings. Suppose that
over 5 years in an unbelievable inflationary economy, banks pay interest at annual rates of 100,
200, 250, 300 and 400 percent, which correspond to growth factor of 2, 3, 3.5, 4, and 5.
(Calculate growth factor both with arithmetic mean and geometric mean as you did in above
table, you will find a significant difference.)
Solution
In 5 years, an initial deposit of $100 would grow to $100 2 3 3.5 4 5 = $42000. The
arithmetic growth factor is (2 + 3 + 3.5 + 4 + 5)/5 or 3.5. This corresponds to an average interest
rate of 250 percent. Yes if bank gave interest at a constant rate of 250 percent per year, then $100
would grow to $52521.88 in 5 years:
$100 3.5 3.5 3.5 3.5 3.5 = $52521.88
This answer exceeds the actual $42000 by more than $10500, a sizable error.
Lets use the formula for finding the geometric mean of a series of numbers to
determine the correct growth factor.
GM = n product of all x values
= n 2 3 3.5 4 5
= n 420 = 3.347 _____ Average Growth Factor
This growth factor corresponds to an average interest rate of 235 percent per year.
Examples (13)
The growth in bad-debt expense for a company over the last few year follows: Calculate the
average percentage increase in bad-debt expense over this time period. If this rate continues, estimate the
percentage increase in bad debt for 1997, relative to 1995
1989 1990 1991 1992 1993 1994 1995
0.11 0.09 0.075 0.08 0.095 0.108 0.120
Solution
M = 7 0.11(0.09)(0.075)(0.08)(0.095)(0.108)(0.120) = 7 1.908769992 = 1.09675
The average increase is 9.675 percent per year. The estimate for bad-debt expense in 1997 is (1.09675)2
1 = 0.2029. i.e. 20.29% higher than in 1995.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (18)
Exercise
Find the geometric mean of each of these.
a). The growth rates of the Living Life Insurance Corporation for the past 3 years
were 35, 24, and 18%.
b). A person received these percentage raises in salary over a 4-year period: 8, 6, 4,
and 5%.
c). A stock increased each year for 5 years at these percentages: 10, 8, 12, 9, and 3%.
d). The price increases, in percentages, for the cost of food in a specific geographic
region for the past 3 years were 1, 3, and 5.5%.
The advantages of geometric mean are
It is based on all observed values.
It gives equal weightage to all the observations.
It is not much affected by sampling variability.
The disadvantages of geometric mean are
It vanishes if any observation is zero.
In case of negative values, it cannot be computed at all.
(4) The Harmonic Mean
This mean is useful for finding the average speed. Suppose a person drove 100 miles at 40 miles
40 + 50
per hour and returned deriving 50 miles per hour. The average miles per hour is not = 45 miles
2
per hour. Correct average is found as shown:
Since Time = distance / rate, then
100
Time 1 = = 2.5 hours to make a trip and
40
100
Time 2 = = 2 hours to return
50
Hence total time is 4.5 hours, and total miles driven are 200. Now the average speed is
distance 200
Rate = = = 44.44 miles per hour
time 4.5
This value can also be found by using the harmonic mean as
2
HM = = 44.44
1/40 + 1/50
Definition
The harmonic mean is the reciprocal of the mean of the reciprocals.
1 + 1 + + 1
x1 x2 xn
for ungrouped data, H = Reciprocal of
n
( xf ) i
fi
The advantages of Harmonic mean are
It is neither easy to calculate nor to understand
It is based on all observed values.
It is an appropriate type for averaging rates and ratios.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (19)
Weights 65 84 85 104 105 124 125 144 145 164 165 184 185 204
( grams )
F 9 10 17 10 5 4 5
Solution
The necessary calculations are given below:
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (20)
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (21)
Exercise
The tensile strength of silicone rubber is thought to be a function of curing temperature. A study
was carried out in which samples of 12 specimens of the rubber were prepared using curing temperatures
of 20 C and 45 C.
The data below show the tensile strength values in megapascals. Calculate the sample mean and
median for the data for the two companies. (Walpole p-35)
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (22)
Exercise
The ages of residents in a community have the following distribution
Class 47-51.9 52-56.9 57-61.9 62-66.9 67-71.9 72-76.9 77-81.9
Frequency 4 9 13 42 39 20 9
Estimate the model value of the distribution.
n
For Q1 we see that is an integer or a non-integer
4
n n
If is not an integer, then Q1 = [ ] + 1 th item in the data
4 4
n n n
If is an integer, then Q1 = average of { th and( +1)th items}
4 4 4
2n 3n
Similarly for Q2 and Q3 we will check whether and is an integer or non-integer respectively, then
4 4
we find the value of Q2 and Q3 same as we did in the case of Q1.
When the data is in grouped form, then
h n
Q1 = l + -c
f4
Where
l = lower limit of the class for Q1
n = number of observations in the sample
c = sum of the frequencies in all classes preceding the class for Q1.
f = frequency of the class for Q1
h = class interval of the class for Q1
Similarly we can find, Q2 and Q3.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (23)
(2) Deciles
Deciles divide the distribution into 10 groups, as shown. They are denoted by D1, D2, etc.
7n
For D7 we see that is an integer or a non-integer
10
7n
If is not an integer, then
10
7n
D7 = [ ] + 1th item in the data
10
7n
If is an integer, then
10
7n 7n
D7 = average of { th and( +1)th items}
10 10
2n 3n
Similarly for D2 and D3 we will check whether and is an integer or non-integer respectively, then
10 10
we find the value of D2 and D3 same as we did in the case of D7.
When the data is in grouped form, then
h 7n
D7 = l + -c
f 10
Where
l = lower limit of the class for D7
n = number of observations in the sample
c = sum of the frequencies in all classes preceding the class for D7.
f = frequency of the class for D7
h = class interval of the class for D7
Similarly we can find, D2 and D3.
(3) Percentiles
Percentiles are position measures used in educational and health-related fields to indicate the
position of an individual in a group.
Percentiles divide the data set into 100 equal groups.
Percentiles are symbolized by
P1, P2, P3, . . . , P99
and divide the distribution into 100 groups.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (24)
For instance,
27n
For P27 we see that is an integer or a non-integer
100
27n 27n
If is not an integer, then P27 = [ ] + 1 th item in the data
100 100
27n 27n 27n
If is an integer, then P27 = average of { th and( +1)th items}
100 100 100
25n 30n
Similarly for P25 and P30 we will check whether and is an integer or non-integer
100 100
respectively, then we find the value of P25 and P30 same as we did in the case of P27.
When the data is in grouped form, then
h 27n
P27 = l + -c
f 100
Where l = lower limit of the class for P27
n = number of observations in the sample
c = sum of the frequencies in all classes preceding the class for P27.
f = frequency of the class for P27
h = class interval of the class for P27
Similarly we can find, P25 and P30.
Examples (17)
The weights in milligrams of 2538 seeds of the long leef pine were as follows:
Weight Number of Weight Number of
(milligrams) Seeds (milligrams) Seeds
10 24.9 16 85 99.9 655
25 39.9 68 100 114.9 803
40 54.9 204 115 129.9 294
55 69.9 233 130 144.9 21
70 84.9 240 145 159.9 4
(a) Find the average weight, the median weight and the most common weight (mode) of the seeds
(b) Find the first and third quartiles. Find the third decile and the 45th percentile.
Solution:
The necessary calculations are given below:
Class Boundaries No. of Seeds Mid points fx Cumulative
( c.b ) (f) (x) Frequency
( c.f )
9.95 24.95 16 17.45 279.20 16
24.95 39.95 68 32.45 2206.60 84
39.95 54.95 204 47.45 9679.80 288
54.95 69.95 233 62.45 14550.85 521
69.95 84.95 240 77.45 18588.00 761
84.95 99.95 655 92.45 60554.75 1416
99.95 114.95 803 107.45 86282.35 2219
294 122.45 36000.30 2513
114.95 129.95
21 137.45 2886.45 2534
129.95 144.95 4 152.45 609.80 2538
144.95 159.95
Total () 2538 --- 231638.10 ---
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (25)
(a)
fi x i 231638.10
(i) Average weight = = = 91.27 milligrams
fi 2538
n
(ii) Median = Weight of th seed
2
2538th
= Weight of ,
2
th
i.e. 1269 seed which lies in the group 84.95 99.95.
Our median class is 84.95 99.95
Since for group data we have the median as
h n
Median = l + - C . Where
f 2
l = Lower limit of the median class = 84.95
n = No. of observations in the sample = 2538
C = Preceding cumulative frequency of the median class = 761
f = Frequency of the median class = 655
h = Class interval of the median class = 15
15
Median = 84.95 + (1269 761)
655
= 84.95 + 11.63 = 96.58 milligrams
Sine the class that carries the highest frequency is
99.95 114.95, Which is thus the model class.
Therefore for a group data
( fm - f1)
Mode = l + h. Where,
( fm - f1) + ( fm - f2)
l = lower class boundary of the middle class = 99.95
f m = frequency of the model class = 803
f1 = frequency associated with the class preceding the model class = 655
f1 = frequency associated with the class following the model class = 294
h = width of the class interval = 15
( 803 - 655 )
Mode = 99.95 + 15
( 803 - 655 ) + ( 803 - 294 )
148 148
= 99.95 + 15 = 99.95 + 15
148 + 509 657
= 99.95 + 3.38 = 103.33 mili grams
(b)
Since for a group data Q1 and Q3 are computed as
h n
Q1 = l + - C , and
f 4
h 3n
Q2 = l + -C
f 4
Now,
n th
Q1 = Weight of seed
4
2538 th
= Weight of , i.e.634.5 th seed which lies in the group 69.95 84.95. Thus
4
15
Q1 = 69.95 + (634.5 521)
240
= 69.95 + 7.09 = 77.04 milligrams
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (26)
And
3n th
Q3 = Weight of seed
4
3 2538 th
= Weight of , i.e.1903.5 th seed which lies in the group 99.95 114.95. Thus
4
15
Q1 = 99.95 + (1903.5 1416)
803
= 99.95 + 9.11 = 109.06 milligrams
(ii) Since for a group data D3 is computed as
h 3n
D3 = l + - C , now
f 10
3n th
D3 = Weight of the seed
10
3 2538 th
= Weight of
,
10
th
i.e.761.5 seed which lies in the group 84.95 99.95. Thus
15
D3 = 84.95 + (761.4 761) = 84.95 + 0.01 = 84.96 milligrams
655
(iii) Since for a group data P45 is computed as
h 45n
P45 = l + - C , now
f 100
45n th
P45 = Weight of the seed
100
45 2538 th
= Weight of ,
100
th
i.e.1142.10 seed which lies in the group 84.95 99.95. Thus
15
P45 = 84.95 + (1142.10 761) = 84.95 + 8.73 = 93.68 milligrams
655
Quartiles, Deciles and Percentiles with the help of Ogive
Examples (18)
Suppose you kept a record of the marks of a quiz of 80 students. The exam is out of 10 and you
have grouped the marks and recorded the data in a frequency table shown below:
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (27)
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (28)
The mean of all the three curves is the same, but curve A has less spread (or variability) than
curve B, and curve B has less variability than curve C. If we measure only mean of these three
distributions, we will miss an important difference among the three curves. To increase the understanding
of the pattern of the data, we must also measure its dispersion.
These are additional information that enables us to judge the reliability of our measure of the
central tendency. A wide spread of values away from the centre indicates an unacceptable risk. A quantity
that measures this characteristic is called measure of dispersion, scatter or variability. The main measures
are
(1) Range
Range R defined as the difference between xmax and x min in a set of data.
i.e. R = xmax - xmin = xn x0
The main disadvantage is that it depend only on two values (extreme values) may be seriously affected by
one usual observations. It is therefore unsatisfactory measure of dispersion. However, it is appropriately
used in statistical quality control charts of manufactured products, daily temperatures, stock prices etc.
This is an absolute measure of dispersion. Its relative measure known as the co-efficient of dispersion,
defined as;
x n x0
co-efficient of dispersion =
x n + x0
(2) Inter-quartile Range
The measure of variability that overcome the dependency on extreme values is the inter-quartile range
(IQR), defined by the difference between the third and first quartiles.
Interquartile range:
IQR = Q3 Q1).
In other words, the interquartile range is the range for the middle 50% of the data.
Half of this range is called the semi-interquartile range or the quartile deviation (Q.D), symbolically;
Q3 Q1
Q.D =
2
For the data on monthly starting salaries, the quartiles are Q3 = 3600 and Q1 = 3465. Thus, the
interquartile range is 3600 3465 = 135.
The quartile deviation is also an absolute measure of dispersion. Its relative measure is called the co-
efficient of quartile deviation or the coefficient of semi-interquartile range, is defined as
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (29)
Q3 Q1
co-efficient of quartile deviation =
Q3 + Q1
which is used for comparing the variation in two or more sets of data.
(3) Mean Deviation
The mean deviation (M.D) of a set of data is defined as the A.M of the absolute deviation measured either
from positive mean or from median or from mode; the reason to disregard the algebraic signs is to avoid
the difficulty arising from the property that the sum of the deviations of the observation from their
mean is zero.
n
x
x i
i =1
M.D =
n
For grouped data, with k classes, having the mid points x1, x 2,.,xk with the correspondence frequencies
n
fi | xi -
x|
i =1
M.D =
n
(4) Population Variance and Standard Deviation
The variance is the average of the squares of the distance each value is from the mean. The symbol for the
population variance is 2 ( is the Greek lowercase letter sigma). The formula for the population variance
is
The symbolic definition of variance is given by
(x i )2 fi(x i )2
2 = (for ungrouped data) and 2 = (for grouped data)
N fi
alternative formula,
2 Xi2 Xi 2 fiXi2 fiXi 2
= -( ) (for ungrouped data) and 2 = -( ) (for grouped data)
N N fi fi
The positive square root of the variance is called standard Deviation. Symbolically,
(xi)2 fi(x i)2
= (for ungrouped data) and 2 = (for grouped data)
N fi
The standard deviation is a very important concept that serves as a basic measure of variability. A smaller
value of the standard deviation indicates that most of the observations in the data are close to the mean
while a larger value implies that the observations are scattered widely about the mean.
Obviously the standard deviation may be found by taking the positive square roots of the above values. It
is an absolute measure of dispersion. Its relative measure called coefficient of standard deviation, is
defined as
Standard Deviation
Coefficient of S.D. =
Mean
(5) Sample Variance and Standard Deviation
In most cases the purpose of calculating the statistic is to estimate the corresponding parameter. For
example, the sample mean is used to estimate the population mean . The expression
(xi x)2
n
does not give best estimate of the population variance because when the population is large and the
sample is small (usually less than 30), the variance computed by this formula usually underestimates the
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (30)
population variance. Therefore, instead of dividing by n, find the variance of the sample by dividing by n
1, giving a slightly larger value and an unbiased estimate of the population variance. The formula for
the sample variance denoted by s 2 , is
2 (xi x)2
s =
n1
and standard deviation of a sample (denoted by s) is
(xi)2
s=
n1
(6) Properties of Variance
i). Var .(a) = 0
ii). Var (X + a) = Var (X) = 2
iii). Var (aX) = a2 Var (X)
iv). Var (X Y)= Var (X) + Var (Y)
v). Let x1 and s12 be mean and variance of n1 observations and x2 and s22 be mean and
variance of n 2 observations (n1 and n2 are sufficiently large) then if the variance of n1 +
n2 observations prove that
n1 s12+ n2 s22 n1( x1 - x )2 n2( x2 - x )2
S2 = + +
n1 +n2 n1 +n2 n1 +n2
Examples (19)
The breaking strength of test pieces of a certain alloy is given as under
95 103 97 130 96 73 78 95 89 68
82 79 69 67 83 108 94 87 93 117
Calculate the average breaking strength of the alloy and the standard deviation.
Breaking Strength (X) X2 Breaking Strength (X) X2
67 4489 93 8649
68 4624 94 8836
69 4761 95 9025
73 5329 95 9025
78 6084 96 9216
79 6241 97 9409
82 6724 103 10609
83 6889 108 11664
87 7569 117 13689
89 7921 130 16900
Total: 1803 167653
X 1803
Mean = = = 90.15
n 20
X2 X 2 167653 1803 2
= -( ) = -( )
n n 20 20
= 8382.65 - 8127.0225
= 255.6275
= 15.99
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (31)
Solution
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (32)
Examples (21)
The mean of the number of sales of cars over a 3-month period is 87, and the standard deviation is 5. The
mean of the commissions is $5225, and the standard deviation is $773. Compare the variations of the two.
Solution
The coefficients of variation are
5
C.V =
100 = 100 = 5.7 %
(sales)
x 87
773
C.V =
100 =
100 = 14.8 % (commissions)
x 5225
Since the coefficient of variation is larger for commissions, the commissions are more variable than the sales.
Exercise
The lengths (in feet) of the main span of the longest suspension bridges in the United States and the rest
of the world are shown below. Which set of data is more variable?
United States: 4205, 4200, 3800, 3500, 3478, 2800, 2800, 2310
World: 6570, 5538, 5328, 4888, 4626, 4544, 4518, 3970 (Bluman Ex. 3.2, 29)
(8) Range Rule of Thumb
The range can be used to approximate the standard deviation. The approximation is called the
range rule of thumb.
range
A rough estimate of the standard deviation is s=
4
For example, the standard deviation for the data set 5, 8, 8, 9, 10, 12, and 13 is 2.7, and the range
is 13 5 = 8. The range rule of thumb is s 2.
A note of caution should be mentioned here. The range rule of thumb is only an approximation
and should be used when the distribution of data values is unimodal and roughly symmetric.
The range rule of thumb can be used to estimate the largest and smallest data values of a data set.
The smallest data value will be approximately 2 standard deviations below the mean, and the largest data
value will be approximately 2 standard deviations above the mean of the data set. The mean for the
previous data set is 9.3; hence,
Smallest data value = x 2s = 9.3 2(2.8) = 3.7
Largest data value = x + 2s = 9.3 + 2(2.8) = 14.9
Notice that the smallest data value was 5, and the largest data value was 13. Again, these are
rough approximations. For many data sets, almost all data values will fall within 2 standard deviations of
the mean. Better approximations can be obtained by using Chebyshevs theorem and the empirical rule.
Chebyshevs theorem, developed by the Russian mathematician Chebyshev (18211894),
specifies the proportions of the spread in terms of the standard deviation.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (33)
We start by examining a specific set of data. The following Table shows the heights in inches of
100 randomly selected adult men. The mean and standard deviation of the data are, rounded to two
decimal places, x = 69.92 and = 1.70. If we go through the data and count the number of observations
that are within one standard deviation of the mean, that is, that are between 69.92 1.70 = 68.22 and
69.92 + 1.70 = 71.62 inches, there are 69 of them. If we count the number of observations that are within
two standard deviations of the mean, that is, that are between 69.92 2(1.70) = 66.52 and 69.92 + 2(1.70)
= 73.32 inches, there are 95 of them.
All of the measurements are within three standard deviations of the mean, that is, between 69.92 3(1.70)
= 64.822 and 69.92 + 3(1.70) = 75.02 inches. These tallies are not coincidences, but are in agreement with
the following result that has been found to be widely applicable.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (34)
Remarks
Two key points in regard to the Empirical Rule are that the data distribution must be
approximately bell-shaped and that the percentages are only approximately true. The Empirical Rule does
not apply to data sets with severely asymmetric distributions, and the actual percentage of observations in
any of the intervals specified by the rule could be either greater or less than those given in the rule. We
see this with the example of the heights of the men: the Empirical Rule suggested 68 observations
between 68.22 and 71.62 inches but we counted 69.
Examples (22)
Heights of 18-year-old males have a bell-shaped distribution with mean 69.6 inches and standard
deviation 1.4 inches.
(a) About what proportion of all such men are between 68.2 and 71 inches tall?
(b) What interval centered on the mean should contain about 95% of all such men?
Solution
Since the interval from 68.2 to 71.0 has endpoints x s and x s,
by the Empirical Rule about 68% of all 18-year-old males should have heights in this range.
By the Empirical Rule the shortest such interval containing 95% of the data is x 2s. So the
interval from x 2s = 69.6 2(1.4) = 66.8 to x + 2s = 69.6 + 2(1.4) = 72.4 contains 95% of the data
values.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (35)
Examples (23)
Scores on IQ tests have a bell-shaped distribution with mean = 100 and standard deviation =
10. Discuss what the Empirical Rule implies concerning individuals with IQ scores of 110, 120, and 130.
Solution
A sketch of the IQ distribution is given in Figure. The Empirical Rule states that
(i) approximately 68% of the IQ scores in the population lie between 90 and 110,
(ii) approximately 95% of the IQ scores in the population lie between 80 and 120, and
(iii) approximately 99.7% of the IQ scores in the population lie between 70 and 130.
Since 68% of the IQ scores lie within the interval from 90 to 110, it must be the case that 32% lie
outside that interval. By symmetry approximately half of that 32%, or 16% of all IQ scores, will lie above
110. If 16% lie above 110, then 84% lie below. We conclude that the IQ score 110 is the 84th percentile.
The same analysis applies to the score 120. Since approximately 95% of all IQ scores lie within
the interval form 80 to 120, only 5% lie outside it, and half of them, or 2.5% of all scores, are above 120.
The IQ score 120 is thus higher than 97.5% of all IQ scores, and is quite a high score.
By a similar argument, only 15/100 of 1% of all adults, or about one or two in every thousand,
would have an IQ score above 130. This fact makes the score 130 extremely high.
(2) Chebyshevs Theorem
The Empirical Rule does not apply to all data sets, only to those that are bell-shaped, and even
then is stated in terms of approximations. A result that applies to every data set is known as Chebyshevs
Theorem.
For any numerical data set,
at least 3/4 of the data lie within two standard deviations of the mean, that is, in the
interval with endpoints x 2s for samples and with endpoints 2 for populations;
at least 8/9 of the data lie within three standard deviations of the mean, that is, in the
interval with endpoints x 3s for samples and with endpoints 3 for populations;
at least 11/k2 of the data lie within k standard deviations of the mean, that is, in the
interval with endpoints x ks for samples and with endpoints k for populations,
where k is any positive whole number that is greater than 1.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (36)
Remark
It is important to pay careful attention to the words at least at the beginning of each of the
three parts of Chebyshevs Theorem. The theorem gives the minimum proportion of the data which must
lie within a given number of standard deviations of the mean; the true proportions found within the
indicated regions could be greater than what the theorem guarantees.
Examples (24)
A sample of size n = 50 has mean x = 28 and standard deviation s = 3. Without knowing anything
else about the sample, what can be said about the number of observations that lie in the interval (22,34)?
What can be said about the number of observations that lie outside that interval?
Solution
The interval (22,34) is the one that is formed by adding and subtracting two standard deviations
from the mean. By Chebyshevs Theorem, at least 3/4 of the data are within this interval. Since 3/4 of 50
is 37.5, this means that at least 37.5 observations are in the interval. But one cannot take a fractional
observation, so we conclude that at least 38 observations must lie inside the interval (22,34).
If at least 3/4 of the observations are in the interval, then at most 1/4 of them are outside it. Since
1/4 of 50 is 12.5, at most 12.5 observations are outside the interval. Since again a fraction of an
observation is impossible, x (22,34).
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (37)
Examples (25)
The number of vehicles passing through a busy intersection between 8:00 a.m. and 10:00 a.m.
was observed and recorded on every weekday morning of the last year. The data set contains n = 251
numbers. The sample mean is x = 725 and the sample standard deviation is s = 25. Identify which of the
following statements must be true.
a. On approximately 95% of the weekday mornings last year the number of vehicles passing
through the intersection from 8:00 a.m. to 10:00 a.m. was between 675 and 775.
b. On at least 75% of the weekday mornings last year the number of vehicles passing through the
intersection from 8:00 a.m. to 10:00 a.m. was between 675 and 775.
c. On at least 189 weekday mornings last year the number of vehicles passing through the
intersection from 8:00 a.m. to 10:00 a.m. was between 675 and 775.
d. On at most 25% of the weekday mornings last year the number of vehicles passing through the
intersection from 8:00 a.m. to 10:00 a.m. was either less than 675 or greater than 775.
e. On at most 12.5% of the weekday mornings last year the number of vehicles passing through
the intersection from 8:00 a.m. to 10:00 a.m. was less than 675.
f. On at most 25% of the weekday mornings last year the number of vehicles passing through the
intersection from 8:00 a.m. to 10:00 a.m. was less than 675.
Solution
a. Since it is not stated that the relative frequency histogram of the data is bell-shaped, the Empirical
Rule does not apply. Statement (a) is based on the Empirical Rule and therefore it might not be
correct.
b. Statement (b) is a direct application of part (1) of Chebyshevs Theorem because x 2s = 675,
x + 2s = 775. It must be correct.
c. Statement (c) says the same thing as statement (b) because 75% of 251 is 188.25, so the minimum
whole number of observations in this interval is 189. Thus statement (c) is definitely correct.
d. Statement (d) says the same thing as statement (b) but in different words, and therefore is
definitely correct.
e. Statement (d), which is definitely correct, states that at most 25% of the time either fewer than
675 or more than 775 vehicles passed through the intersection. Statement (e) says that half of that
25% corresponds to days of light traffic. This would be correct if the relative frequency histogram
of the data were known to be symmetric. But this is not stated; perhaps all of the observations
outside the interval (675,775) are less than 75. Thus statement (e) might not be correct.
f. Statement (d) is definitely correct and statement (d) implies statement (f): even if every
measurement that is outside the interval (675,775) is less than 675 (which is conceivable, since
symmetry is not known to hold), even so at most 25% of all observations are less than 675. Thus
statement (f) must definitely be correct.
Exercise
The mean of a distribution is 20 and the standard deviation is 2. Use Chebyshevs theorem.
a. At least what percentage of the values will fall between 10 and 30?
b. At least what percentage of the values will fall between 12 and 28? (Bluman ch. 3)
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (38)
Exercise
The Energy Information Administration reported that the mean retail price per gallon of regular
grade gasoline was $2.30 (Energy Information Administration, February 27, 2006). Suppose that the
standard deviation was $.10 and that the retail price per gallon has a bell shaped distribution.
a. What percentage of regular grade gasoline sold between $2.20 and $2.40 per gallon?
b. What percentage of regular grade gasoline sold between $2.20 and $2.50 per gallon?
c. What percentage of regular grade gasoline sold for more than $2.50 per gallon?
(prob. 3.30, Sweeny Chap 3 )
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (39)
The easiest way to develop a five-number summary is to first place the data in ascending order.
Then it is easy to identify the smallest value, the three quartiles, and the largest value. The monthly
starting salaries shown in the above table for a sample of 12 business school graduates are repeated here
in ascending order.
The median of 3505 and the quartiles Q1 = 3465 and Q3 = 3600. Reviewing the data shows a
smallest value of 3310 and a largest value of 3925. Thus the five-number summary for the salary data is
3310, 3465, 3505, 3600, 3925. Approximately one-fourth, or 25%, of the observations are between
adjacent numbers in a five-number summary.
(1) Box Plot
A box plot is a graphical summary of data that is based on a five-number summary. A key to the
development of a box plot is the computation of the median and the quartiles, Q1 and Q3. The
interquartile range, IQR = Q3 Q1, is also used. Following figure is the box plot for the monthly
starting salary data. The steps used to construct the box plot follow.
Abox is drawn with the ends of the box located at the first and third quartiles. For the salary
data,Q1 = 3465 andQ3 = 3600. This box contains the middle 50% of the data.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (40)
A vertical line is drawn in the box at the location of the median (3505 for the salary data).
By using the interquartile range, IQR = Q3 Q1, limits are located. The limits for the box plot
are 1.5(IQR) below Q1 and 1.5(IQR) above Q3. For the salary data, IQR = Q3 Q1 = 3600 3465 =
135. Thus, the limits are 3465 1.5(135) = 3262.5 and 3600 + 1.5(135) = 3802.5. Data outside these
limits are considered outliers.
The dashed lines in Figure are called whiskers. The whiskers are drawn from the ends of the box
to the smallest and largest values inside the limits computed in step 3. Thus, the whiskers end at salary
values of 3310 and 3730.
Finally, the location of each outlier is shown with the symbol *. In Figure we see one outlier, 3925.
Exercise
The nine measurements that follow are furnace temperature recorded on successive
batches in a semiconductor manufacturing process (units are F0): 953, 950, 948, 955, 951, 949,
957, 954, 955.
(a) Calculate the sample mean, sample variance, and standard deviation.
(b) Find the median. How much could the largest temperature measurement
increase without changing the median value?
(c) Construct a box plot of the data.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (41)
If the frequency curve of a distribution has a longer tail to the left of the central maximum than to
the right, the distribution is said to be skewed to the left or to have negative skewness.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (42)
N(N 1) (Xi X )3 /N
Skewness =
N1 s3
This is an adjustment for sample size. The adjustment approaches 1 as N gets large. For reference, the
adjustment factor is 1.49 for N = 5, 1.19 for N = 10, 1.08 for N = 20, 1.05 for N = 30, and 1.02 for N = 100.
Karl Pearson investigated the following formula to measure the skewness:
mean mode
Skewness =
standard deviation
Led Bowley introduced the following measure of skewness
Q3 + Q1 2Q2
Quartile coefficient of skewness =
Q3 Q1
This measure is equal to zero when quartiles are equidistant from median. Then the distribution is
symmetrical. It is positive when the upper quartile is farther from the median than the lower quartile.
Then the distribution is positive skewed. This measure is negative when the lower quartile is farther from
the median than the upper quartile.
For a perfectly symmetrical curve, this measure is zero.
Problems (Skewness)
1) What can you say of skewness in each case of the following cases;
(i) The median is 26.01, while the two quartiles are 13.73 and 38.29.
(ii) Mean = 140 and mode = 148.7
(iii) Mean = 129.5 and median = 128.7
2) Which of the following is correct in a positively skewed and negatively skewed distribution
(i) The arithmetic mean is greater than the mode.
(ii) The arithmetic mean is less than the mode.
(iii) The arithmetic mean is greater than the median.
(iv) The median is greater than the mode.
3) The length of stay on the cancer floor of Apolo Hospital were organized into a frequency
distribution. The mean length of stay was 28 days, the medial 25 days and modal length is 23
days. The standard deviation was computed to be 4.2 days.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (43)
(2) Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal
distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low
kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case.
The histogram is an effective graphical technique for showing both the skewness and kurtosis of
data set.
Kurtosis is the degree of peakness of a distribution. A distribution having relatively high peak is
called Lepto-Kurtic whereas a distribution having flat topped is called Platy Kurtic. A frequency curve
which is neither very high peaked nor vary flat topped is called Meso-kurtic or a Normal curve having a
Normal distribution.
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore
Handouts 04: Data Description (44)
For Skewness,
(Xi X )2 f 852.75
Variance = = = 8.5275
N 100
Standard Deviation = = 8.5275 = 2.92
(Xi X )3 /N
Skewness =
s3
-269.33/100
=
(2.92)3
= - 2.6933
This means that the distribution is negatively skewed
For Kutosis,
61 5 -6.45 8653.84
64 18 -3.45 2550.05
67 42 -0.45 1.72
70 27 2.55 1141.63
73 8 5.55 7590.35
n/a 19937.60
(Xi X )4 /N 19937.60
Kurtosis = = = 199.3760 < 3
s4 100
This means the frequency curve is flat, that is platy-Kurtic
Muhammad Naeem Sandhu, Assistant Professor, Department of Mathematics, University of Engineering and Technology, Lahore