Chapter 3
Chapter 3
3.2
3.4
3.5
3.6
A.3
LEARNING OBJECTIVES
In this chapter, you learn:
To describe the properties of central tendency, variation,
and shape in numerical data
To calculate descriptive summary measures for a population
To construct and interpret a box-and-whisker plot
To describe the covariance and the coefficient of correlation
72
U S I N G S TAT I S T I C S
Evaluating the Performance of Mutual Funds
Return to the study of mutual funds introduced in Chapter 2. You want to
decide which types of mutual funds to invest in. In the last chapter you
learned how to present data in tables and charts. However, when dealing
with numerical data, such as the return on investments in mutual funds in
2003, you also need to summarize the data, and ask statistical questions.
What is the central tendency for returns of the various funds? For example, what is the mean return in 2003 for the low-risk, average-risk, and
high-risk mutual funds? How much variability is present in the returns?
Are the returns for high-risk funds more variable than for average-risk
funds or low-risk funds? How can you use this information when deciding
what mutual funds to invest in?
or numerical variables, you need more than just the visual picture of what a variable looks
like than you get from the graphs discussed in Chapter 2. For example, for the 2003 returns,
you would like to determine not only whether the riskier funds had a higher 2003 return, but
whether they also had greater variation, and how the returns for each risk group were distributed. You also want to examine whether there is a relationship between the expense ratio and
the 2003 return. Reading this chapter will allow you to learn about some of the methods to
measure:
central tendency, the extent to which all of the data values group around a central value
variation, the amount of dispersion or scattering of values away from a central point
shape, the pattern of the distribution of values from the lowest value to the highest value
You will also learn about the covariance and the coefficient of correlation that help measure the
strength of the association between two numerical variables.
3.1
73
The Mean
The arithmetic mean (typically referred to as the mean) is the most common measure of central tendency. The mean is the only common measure in which all the values play an equal role.
The mean serves as a balance point in a set of data (like the fulcrum on a seesaw). You calculate the mean by adding together all the values in a data set and then dividing that sum by the
number of values in the data set.
The symbol X , called X bar, is used to represent the mean of a sample. For a sample containing n values, the equation for the mean of a sample, is written as
sum of the values
number of values
X =
Using the series X1, X2, . . . , Xn to represent the set of n values and n to represent the number of
values, the equation becomes:
X =
X1 + X 2 + L + X n
n
By using summation notation (discussed fully in Appendix B), you replace the numerator
n
X 1 + X 2 + + X n by the term
Xi
i =1
value, X1 , to the last X value, Xn , to form Equation (3.1), a formal definition of the sample
mean.
SAMPLE MEAN
The sample mean is the sum of the values divided by the number of values.
n
X =
Xi
i =1
(3.1)
X = sample mean
n = number of values or sample size
where
Because all the values play an equal role, a mean will be greatly affected by any value that
is greatly different from the others in the data set. When you have such extreme values, you
should avoid using the mean.
The mean can suggest what is a typical or central value for a data set. For example, if you
knew the typical time it takes you to get ready in the morning, you might be able to better plan
your morning and minimize any excessive lateness (or earliness) going to your destination.
Suppose you define the time to get ready as the time in minutes (rounded to the nearest minute)
from when you get out of bed to when you leave your home. You collect the times shown below
for 10 consecutive work days:
Day:
Time (minutes):
10
39
29
43
52
39
44
40
31
44
35
74
TIMES
X =
Xi
i =1
X =
39 + 29 + 43 + 52 + 39 + 44 + 40 + 31 + 44 + 35
10
X =
396
= 39.6
10
Even though no one day in the sample actually had the value 39.6 minutes, allotting about 40
minutes to get ready would be a good rule for planning your mornings, but only because the 10
days does not contain extreme values.
Contrast this to the case in which the value on day four was 102 minutes instead of 52 minutes. This extreme value would cause the mean to rise to 44.6 minutes as follows:
X =
X =
X =
Xi
i =1
n
446
= 44.6
10
The one extreme value has increased the mean by more than 10% from 39.6 to 44.6 minutes. In
contrast to the original mean that was in the middle, greater than 5 of the get-ready times
(and less than the 5 other times), the new mean is greater than 9 of the 10 get-ready times. The
extreme value has caused the mean to be a poor measure of central tendency.
EXAMPLE 3.1
THE MEAN 2003 RETURN FOR SMALL CAP MUTUAL FUNDS WITH HIGH RISK
The 121 mutual funds that are part of the Using Statistics scenario (see page 72) are classified
according to the risk level of the mutual funds (low, average, and high) and type (small cap, mid
cap, and large cap). Compute the mean 2003 return for the small cap mutual funds with high risk.
SOLUTION The mean 2003 return for the small cap mutual funds with high risk (MUTUALis 51.53, calculated as follows:
FUNDS2004)
X =
=
=
Xi
i =1
n
463.8
= 51.53
9
The ordered array for the nine small cap mutual funds with high risk is:
37.3
39.2
44.2
44.5
53.8
56.6
59.3
62.4
66.5
Four of these returns are below the mean of 51.53 and five of these returns are above the mean.
75
The Median
The median is the value that splits a ranked set of data into two equal parts. The median is not
affected by extreme values, so you can use the median when extreme values are present.
The median is the middle value in a set of data that has been ordered from lowest to highest
value.
To calculate the median for a set of data, you first rank the values from smallest to largest.
Then use Equation (3.2) to compute the rank of the value that is the median.
MEDIAN
50% of the values are smaller than the median and 50% of the values are larger than the
median.
Median =
n +1
ranked value
2
(3.2)
Rule 1 If there are an odd number of values in the data set, the median is the middle
ranked value.
Rule 2 If there are an even number of values in the data set, then the median is the average
of the two middle ranked values.
To compute the median for the sample of 10 times to get ready in the morning, you rank the
daily times as follows:
Ranked values:
29 31 35 39 39 40 43 44 44 52
Ranks:
1
9 10
Median = 39.5
Because the result of dividing n + 1 by 2 is (10 + 1)/2 = 5.5 for this sample of 10, you must use
Rule 2 and average the fifth and sixth ranked values, 39 and 40. Therefore, the median is 39.5.
The median of 39.5 means that for half of the days, the time to get ready is less than or equal to
39.5 minutes, and for half of the days the time to get ready is greater than or equal to 39.5 minutes. The median time to get ready of 39.5 minutes is very close to the mean time to get ready
of 39.6 minutes.
EXAMPLE 3.2
76
Ranked values:
37.3
39.2
44.2
44.5
53.8
56.6
59.3
62.4
66.5
Ranks:
Median
The median return is 53.8. Half the small cap high-risk mutual funds have returns equal to or
below 53.8 and half have returns equal to or above 53.8.
The Mode
The mode is the value in a set of data that appears most frequently. Like the median and unlike
the mean, extreme values do not affect the mode. You should use the mode only for descriptive
purposes as it is more variable from sample to sample than either the mean or the median.
Often there is no mode or there are several modes in a set of data. For example, consider the
time to get ready data shown below.
29
31
35
39
39
40
43
44
44
52
There are two modes, 39 minutes and 44 minutes, since each of these values occurs twice.
EXAMPLE 3.3
A set of data will have no mode if none of the values is most typical. Example 3.4 presents a
data set with no mode.
EXAMPLE 3.4
77
Quartiles
1The
Quartiles split a set of data into four equal partsthe first quartile Q1 divides the smallest
25.0% of the values from the other 75.0% that are larger. The second quartile Q2 is the
median50.0% of the values are smaller than the median and 50.0% are larger. The third
quartile Q3 divides the smallest 75.0% of the values from the largest 25.0%. Equations (3.3)
and (3.4) define the first and third quartiles.1
FIRST QUARTILE Q1
25.0% of the values are smaller than Q1, the first quartile, and 75.0% are larger than the
first quartile Q1.
Q1 =
n +1
ranked value
4
(3.3)
THIRD QUARTILE Q3
75.0% of the values are smaller than the third quartile Q3, and 25.0% are larger than the
third quartile Q3.
Q3 =
3( n + 1)
ranked value
4
(3.4)
Rule 1 If the result is a whole number, then the quartile is equal to that ranked value. For example, if the sample size n = 7, the first quartile Q1 is equal to the (7 + 1)/4 = second ranked value.
Rule 2 If the result is a fractional half (2.5, 4.5, etc.), then the quartile is equal to the average of the corresponding ranked values. For example, if the sample size n = 9, the first
quartile Q1 is equal to the (9 + 1)/4 = 2.5 ranked value, halfway between the second ranked
value and the third ranked value.
Rule 3 If the result is neither a whole number nor a fractional half, you round the result to
the nearest integer and select that ranked value. For example, if the sample size n = 10, the
first quartile Q1 is equal to the (10 + 1)/4 = 2.75 ranked value. Round 2.75 to 3 and use the
third ranked value.
To illustrate the computation of the quartiles for the time-to-get-ready data, rank the data from
smallest to largest.
Ranked values:
29
31
35
39
39
40
43
44
44
52
10
Ranks:
The first quartile is the (n + 1)/4 = (10 + 1)/4 = 2.75 ranked value. Using the third rule for quartiles, you round up to the third ranked value. The third ranked value for the get-ready time data
is 35 minutes. You interpret the first quartile of 35 to mean that on 25% of the days the time to
get ready is less than or equal to 35 minutes, and on 75% of the days, the time to get ready is
greater than or equal to 35 minutes.
The third quartile is the 3(n + 1)/4 = 3(10 + 1)/4 = 8.25 ranked value. Using the third rule
for quartiles, you round this down to the eighth ranked value. The eighth ranked value for the
get-ready time data is 44 minutes. You interpret this to mean that on 75% of the days, the time
to get ready is less than or equal to 44 minutes, and on 25% of the days, the time to get ready is
greater than or equal to 44 minutes.
78
EXAMPLE 3.5
39.2
44.2
44.5
53.8
56.6
59.3
62.4
66.5
Ranks:
( n + 1)
ranked value
4
9 +1
ranked value = 2.5 ranked value
4
Therefore, using the second rule, Q1 is the 2.5 ranked value, halfway between the second
ranked value and the third ranked value. Since the second ranked value is 39.2 and the third
ranked value is 44.2, the first quartile Q1 is halfway between 39.2 and 44.2. Thus,
Q1 =
39.2 + 44.2
= 41.7
2
3( n + 1)
ranked value
4
3( 9 + 1)
ranked value = 7.5 ranked value
4
Therefore, using the second rule, Q3 is the 7.5 ranked value, halfway between the seventh
ranked value and the eighth ranked value. Since the seventh ranked value is 59.3 and the eighth
ranked value is 62.4, the third quartile Q3 is halfway between 59.3 and 62.4. Thus,
Q3 =
59.3 + 62.4
= 60.85
2
The first quartile of 41.7 indicates that 25% of the returns in 2003 for small cap high-risk funds
are below or equal to 41.7 and 75% are greater than or equal to 41.7. The third quartile of 60.85
indicates that 75% of the returns in 2003 for small cap high-risk funds are below or equal to
60.85 and 25% are greater than or equal to 60.85.
79
(3.5)
(3.6)
To illustrate using these measures, consider an investment of $100,000 that declined to a value
of $50,000 at the end of year 1 and then rebounded back to its original $100,000 value at the
end of year 2. The rate of return for this investment for the two-year period is 0, because the
starting and ending value of the investment is unchanged. However, the arithmetic mean of the
yearly rates of return of this investment is
X =
( 0.50 ) + (1.00 )
= 0.25 or 25%
2
100, 000
and the rate of return for year 2 is
100, 000 50, 000
R2 =
= 1.00 or 100%
50, 000
Using Equation (3.6), the geometric mean rate of return for the two years, is
RG = [(1 + R1 ) (1 + R2 )]1/ n 1
= [(1 + ( 0.50 )) (1 + (1.0 ))]1/ 2 1
= [(0.50 ) ( 2.0 )]1/ 2 1
= [1.0 ]1/ 2 1
= 11 = 0
Thus, the geometric mean rate of return more accurately reflects the (zero) change in the value
of the investment for the two-year period than does the arithmetic mean.
80
EXAMPLE 3.6
The Range
The range is the simplest numerical descriptive measure of variation in a set of data.
RANGE
The range is equal to the largest value minus the smallest value.
Range = Xlargest Xsmallest
(3.7)
To determine the range of the times to get ready, you rank the data from smallest to largest:
29 31 35 39 39 40 43 44 44 52
Using Equation (3.7), the range is 52 29 = 23 minutes. The range of 23 minutes indicates that
the largest difference between any two days in the time to get ready in the morning is 23 minutes.
EXAMPLE 3.7
39.2
44.2
44.5
53.8
56.6
59.3
62.4
66.5
81
The range measures the total spread in the set of data. Although the range is a simple measure of total variation in the data, it does not take into account how the data are distributed
between the smallest and largest values. In other words, the range does not indicate if the values are evenly distributed throughout the data set, clustered near the middle, or clustered near
one or both extremes. Thus, using the range as a measure of variation when at least one value is
an extreme value is misleading.
(3.8)
The interquartile range measures the spread in the middle 50% of the data; therefore, it is not
influenced by extreme values. To determine the interquartile range of the times to get ready
29 31 35 39 39 40 43 44 44 52
you use Equation (3.8) and the earlier results on page 78, Q1 = 35 and Q3 = 44.
Interquartile range = 44 35 = 9 minutes
Therefore, the interquartile range in the time to get ready is 9 minutes. The interval 35 to 44 is
often referred to as the middle fifty.
EXAMPLE 3.8
39.2
44.2
44.5
53.8
56.6
59.3
62.4
66.5
Using Equation (3.8) and the earlier results on page 78, Q1 = 41.7 and Q3 = 60.85.
Interquartile range = 60.85 41.7 = 19.15
Therefore, the interquartile range in the 2003 return is 19.15.
Because the interquartile range does not consider any value smaller than Q1 or larger than
Q3, it cannot be affected by extreme values. Summary measures such as the median, Q1, Q3,
and the interquartile range, which cannot be influenced by extreme values, are called resistant
measures.
82
( X1 X )2 + ( X 2 X )2 + L + ( X n X )2
n 1
SAMPLE VARIANCE
The sample variance is the sum of the squared differences around the mean divided by the
sample size minus one.
n
S2 =
where
( X i X )2
i =1
n 1
(3.9)
X = mean
n = sample size
Xi = ith value of the variable X
S =
S2 =
( X i X )2
i =1
n 1
(3.10)
83
If the denominator were n instead of n 1, Equation (3.9) [and the inner term in Equation
(3.10)] would calculate the average of the squared differences around the mean. However,
n 1 is used because of certain desirable mathematical properties possessed by the statistic
S2 that make it appropriate for statistical inference (which will be discussed in Chapter 7).
As the sample size increases, the difference between dividing by n or n 1 becomes smaller
and smaller.
You will most likely use the sample standard deviation as your measure of variation
[defined in Equation (3.10)]. Unlike the sample variance, which is a squared quantity, the standard deviation is always a number that is in the same units as the original sample data. The standard deviation helps you to know how a set of data clusters or distributes around its mean. For
almost all sets of data, the majority of the observed values lie within an interval of plus and
minus one standard deviation above and below the mean. Therefore, knowledge of the mean
and the standard deviation usually helps define where at least the majority of the data values
are clustering.
To hand-calculate the sample variance S2 and the sample standard deviation S:
Step 1: Compute the difference between each value and the mean.
Step 2: Square each difference.
Step 3: Add the squared differences.
Step 4: Divide this total by n 1 to get the sample variance.
Step 5: Take the square root of the sample variance to get the sample standard deviation.
Table 3.1 shows the first four steps for calculating the variance and standard deviation
for the getting ready times data with a mean ( X ) equal to 39.6 (see page 74 for the calculation of the mean). The second column of Table 3.1 shows Step 1. The third column of Table
3.1 shows Step 2. The sum of the squared differences (Step 3) is shown at the bottom of
Table 3.1. This total is then divided by 10 1 = 9 to compute the variance (Step 4).
TABLE 3.1
Computing the Variance
of the Getting Ready
Times
X = 39.6
Time
(X)
39
29
43
52
39
44
40
31
44
35
Step 1:
(Xi X )
Step 2:
(Xi X )2
0.60
10.60
3.40
12.40
0.60
4.40
0.40
8.60
4.40
4.60
0.36
112.36
11.56
153.76
0.36
19.36
0.16
73.96
19.36
21.16
Step 3:
Sum:
Step 4:
Divide by (n 1):
412.40
45.82
84
You can also calculate the variance by substituting values for the terms in Equation (3.9):
n
S2 =
( X i X )2
i =1
n 1
412.4
9
= 45.82
Because the variance is in squared units (in squared minutes for these data), to compute the
standard deviation you take the square root of the variance. Using Equation (3.10) on page 82,
the sample standard deviation S is
n
S2 =
S =
( X i X )2
i =1
n 1
45.82 = 6.77
This indicates that the get-ready times in this sample are clustering within 6.77 minutes around
the mean of 39.6 minutes (i.e., clustering between X 1S = 32.83 and X + 1S = 46.37). In
fact, 7 out of 10 get-ready times lie within this interval.
Using the second column of Table 3.1, you can also calculate the sum of the differences
between each value and the mean to be zero. For any set of data, this sum will always be zero:
n
This property is one of the reasons that the mean is used as the most common measure of central tendency.
EXAMPLE 3.9
S2 =
( X i X )2
i =1
n 1
891.16
8
= 111.395
TABLE 3.2
Computing the Variance
of the 2003 Return for
the Small Cap Mutual
Funds with High Risk
85
X = 51.5333
Return
2003
Step 1:
(Xi X )
Step 2:
(Xi X )2
44.5
39.2
62.4
59.3
56.6
53.8
37.3
44.2
66.5
7.0333
12.3333
10.8667
7.7667
5.0667
2.2667
14.2333
7.3333
14.9667
49.4678
152.1111
118.0844
60.3211
25.6711
5.1378
202.5878
53.7778
224.0011
Step 3:
Sum:
Step 4:
Divide by (n 1):
891.16
111.395
S =
S2 =
( X i X )2
i =1
n 1
= 111.395 = 10.55
The standard deviation of 10.55 indicates that the 2003 returns for the small cap mutual
funds with high risk are clustering within 10.55 around the mean of 51.53 (i.e., clustering
between X 1S = 40.98 and X + 1S = 62.08). In fact, 55.6% (5 out of 9) of the 2003 returns
lie within this interval.
The following summarizes the characteristics of the range, interquartile range, variance,
and standard deviation.
The more spread out, or dispersed, the data are, the larger the range, interquartile range,
variance, and standard deviation.
The more concentrated, or homogeneous the data are, the smaller the range, interquartile
range, variance, and standard deviation.
If the values are all the same (so that there is no variation in the data), the range, interquartile range, variance, and standard deviation will all equal zero.
None of the measures of variation (the range, interquartile range, standard deviation, and
variance) can ever be negative.
86
COEFFICIENT OF VARIATION
The coefficient of variation is equal to the standard deviation divided by the mean,
multiplied by 100%.
S
CV = 100%
X
where
(3.11)
For the sample of 10 get-ready times, since X = 39.6 and S = 6.77, the coefficient of variation is
S
6.77
CV = 100% =
100% = 17.10%
X
39.6
For the get-ready times, the standard deviation is 17.1% of the size of the mean.
You will find the coefficient of variation very useful when comparing two or more sets of
data that are measured in different units as Example 3.10 illustrates.
EXAMPLE 3.10
Z Scores
An extreme value or outlier is a value located far away from the mean. Z scores are useful in
identifying outliers. The larger the Z score, the farther the distance from the value to the mean.
The Z score is the difference between the value and the mean, divided by the standard deviation.
87
Z SCORES
Z =
X X
S
(3.12)
For the time to get ready in the morning data, the mean is 39.6 minutes and the standard deviation is 6.77 minutes. The time to get ready on the first day is 39.0 minutes. You compute the Z
score for day 1 from
Z =
=
X X
S
39.0 39.6
6.77
= 0.09
Table 3.3 shows the Z scores for all 10 days. The largest Z score is 1.83 for day 4 on which
the time to get ready was 52 minutes. The lowest Z score was 1.57 for day 2 on which the
time to get ready was 29 minutes. As a general rule, a Z score is considered an outlier if it
is less than 3.0 or greater than +3.0. None of the times met that criterion to be considered
outliers.
TABLE 3.3
Z Scores for the
10 Get-Ready Times
Mean
Standard deviation
EXAMPLE 3.11
Time (X)
Z Score
39
29
43
52
39
44
40
31
44
35
39.6
6.77
0.09
1.57
0.50
1.83
0.09
0.65
0.06
1.27
0.65
0.68
88
TABLE 3.4
Z Scores of the 2003
Return for the Small
Cap Mutual Funds
with High Risk
Mean
Standard Deviation
Return 2003
Z Scores
44.5
39.2
62.4
59.3
56.6
53.8
37.3
44.2
66.5
51.53
10.55
0.67
1.17
1.03
0.74
0.48
0.21
1.35
0.69
1.42
Shape
A third important property that describes a set of numerical data is shape. Shape is the pattern
of the distribution of data values throughout the entire range of all the values. A distribution
will either be symmetrical, when low and high values balance each other out, or skewed, not
symmetrical and showing an imbalance of low values or high values.
Shape influences the relationship of the mean to the median in the following ways:
FIGURE 3.1
A Comparison of Three
Data Sets Differing
in Shape
Panel A
Negative, or left-skewed
Panel B
Symmetrical
Panel C
Positive, or right-skewed
The data in panel A are negative, or left-skewed. In this panel, most of the values are in the
upper portion of the distribution. There is a long tail and distortion to the left that is caused by
some extremely small values. These extremely small values pull the mean downward so that the
mean is less than the median.
The data in panel B are symmetrical. Each half of the curve is a mirror image of the other
half of the curve. The low and high values on the scale balance, and the mean equals the median.
The data in panel C are positive, or right-skewed. In this panel, most of the values are in
the lower portion of the distribution. There is a long tail on the right of the distribution and a
distortion to the right that is caused by some extremely large values. These extremely large values pull the mean upward so that the mean is greater than the median.
89
Experiment by entering an extreme value such as 10 minutes into one of the tinted cells of
column A. Which measures are affected by this change? Which ones are not? You can flip
between the before and above diagrams by repeatedly pressing Crtl-Z (undo) followed by
Crtl-Y (redo) to help see the changes the extreme value caused in the diagram.
90
three risk levels. High-risk funds had a slightly higher mean and median than did low-risk and
average-risk funds. There was very little difference in the standard deviations of the three
groups.
3.5 Suppose that the rate of return for a particular stock during the past two years was 10% and
30%. Compute the geometric mean rate of return.
(Note: A rate of return of 10% is recorded as 0.10 and a rate
of return of 30% is recorded as 0.30.)
PH Grade
ASSIST
3.6 The operations manager of a plant that manufactures tires wants to compare the actual inner
diameter of two grades of tires, each of which is
expected to be 575 millimeters. A sample of five tires of
each grade was selected, and the results representing the
inner diameters of the tires, ranked from smallest to largest,
are as follows:
PH Grade
ASSIST
Grade X
Grade Y
568 570 575 578 584 573 574 575 577 578
a. For each of the two grades of tires, compute the mean,
median, and standard deviation.
b. Which grade of tire is providing better quality? Explain.
c. What would be the effect on your answers in (a) and (b) if
the last value for grade Y were 588 instead of 578? Explain.
SELF
Test
7 9 15 16 16 18 22 25 27 33 39
Source: Extracted from Quick Bites, Copyright 2001 by
Consumers Union of U.S., Inc., Yonkers, NY 107031057. Adapted
with permission from Consumer Reports, March 2001, 46.
91
3.9 In the 20022003 academic year, many public universities in the United States raised tuition and fees due to a
decrease in state subsidies (Mary Beth Marklein, Public
Universities Raise Tuition, Feesand Ire, USA Today,
August 8, 2002, 1A2A). The following represents the
change in the cost of tuition, a shared dormitory room, and
the most popular meal plan between the 20012002 academic year and the 20022003 academic year for a sample of
10 public universities. COLLEGECOST
University
1,589
593
1,223
869
423
1,720
708
1,425
922
308
Calories Fat
240
8.0
260
350
3.5
22.0
350
20.0
420
16.0
510
22.0
530
19.0
92
Hotel
Cars
205
179
185
210
128
145
177
117
221
159
205
128
165
180
198
158
132
283
269
204
47
41
49
38
32
48
49
41
56
41
50
32
34
46
41
40
39
67
69
40
Source: Extracted from The Wall Street Journal, October 10, 2003, W4.
93
3.18 The time period from 2000 to 2003 saw a great deal
of volatility in the value of stocks. The data in the following
table STOCKRETURN represent the total rate of return of the
Dow Jones Industrial Index, the Standard & Poors 500, the
Russell 2000 Index, and the Wilshire 5000 Index from
2000 to 2003.
Year
DJIA
SP500
Russell2000
Wilshire5000
2003
2002
2001
2000
25.30
15.01
5.44
6.20
26.40
22.10
11.90
9.10
45.40
21.58
1.03
3.02
29.40
20.90
10.97
10.89
One Year
30 Month
Money Market
2003
2002
2001
2000
1.20
1.98
3.60
5.46
1.76
2.74
3.97
5.64
0.61
1.02
1.73
2.09
Platinum
Gold
Silver
2003
2002
2001
2000
34.2
24.5
21.3
23.3
19.5
24.5
1.2
1.8
24.0
5.5
3.0
5.9
94
3.2
Bond Fund
TABLE 3.5
2003 Return for the
Population Consisting
of the Five Largest
Bond Funds
Vanguard GNMA
Vanguard Total Bond Index
Pimco Total Return Admin
Pimco Total Return Instl
America Bond Fund
3.8
6.5
7.0
7.3
12.9
Source: Extracted from The Wall Street Journal, March 25, 2004, C2.
POPULATION MEAN
The population mean is the sum of the values in the population divided by the population
size N.
N
=
where
Xi
i =1
(3.13)
= population mean
Xi = ith value of the variable X
To compute the mean return for the population of bond funds given in Table 3.5, use
Equation (3.13),
N
Xi
i =1
Thus, the mean 2003 return for these bond funds is 7.5%.
95
POPULATION VARIANCE
The population variance is the sum of the squared differences around the population mean
divided by the population size N.
N
2 =
where
( X i )2
i =1
(3.14)
= population mean
Xi = ith value of the variable X
( X i )2
i =1
(3.15)
To compute the population variance for the data of Table 3.5 on page 94, you use Equation
(3.14),
2 =
( X i )2
i =1
( 3.8 7.5) 2 + (6.5 7.5) 2 + ( 7.0 7.5) 2 + ( 7.3 7.5) 2 + (12.9 7.5) 2
5
44.14
= 8.828
5
96
Thus, the variance of the returns is 8.828 squared percentage return. The squared units
make the variance hard to interpret. You should use the standard deviation that uses the original
units of the data (percentage return). From Equation (3.15),
N
2 =
( X i )2
i =1
8.828 = 2.97
Therefore, the typical 2003 return differs from the mean of 7.5 by approximately 2.97. This
large amount of variation suggests that these large bond funds produce results that differ greatly.
Approximately 68% of the values are within a distance of 1 standard deviation from the
mean.
Approximately 95% of the values are within a distance of 2 standard deviations from the
mean.
Approximately 99.7% are within a distance of 3 standard deviations from the mean.
The empirical rule helps you measure how the values distribute above and below the mean.
This can help you to identify outliers when analyzing a set of numerical data. The empirical rule
implies that for bell-shaped distributions only about one out of 20 values will be beyond two
standard deviations from the mean in either direction. As a general rule, you can consider values not found in the interval 2 as potential outliers. The rule also implies that only about
three in 1,000 will be beyond three standard deviations from the mean. Therefore, values not
found in the interval 3 are almost always considered outliers. For heavily skewed data sets,
or those not appearing bell-shaped for any other reason, the Chebyshev rule discussed on page
97 should be applied instead of the empirical rule.
EXAMPLE 3.12
Using the empirical rule, approximately 68% of the cans will contain between 12.04 and 12.08
ounces, approximately 95% will contain between 12.02 and 12.10 ounces, and approximately
99.7% will contain between 12.00 and 12.12 ounces. Therefore, it is highly unlikely that a can
will contain less than 12 ounces.
97
TABLE 3.6
How Data Vary Around
the Mean
EXAMPLE 3.13
Chebyshev
(for any distribution)
Empirical Rule
(bell-shaped distribution)
At least 0%
At least 75%
At least 88.89%
Approximately 68%
Approximately 95%
Approximately 99.7%
Because the distribution may be skewed, you cannot use the empirical rule. Using the
Chebyshev rule, you cannot say anything about the percentage of cans containing between
12.04 and 12.08 ounces. You can state that at least 75% of the cans will contain between 12.02
and 12.10 ounces, and at least 88.89% will contain between 12.00 and 12.12 ounces. Therefore,
between 0 and 11.11% of the cans contain less than 12 ounces.
You can use these two rules for understanding how data are distributed around the mean
when you have sample data. In each case, use the value you calculated for X in place of
and the value you calculated for S in place of . The results you compute using the sample
statistics are approximations since you used sample statistics ( X , S) and not population parameters (, ).
98
10.3
13.0
13.0
8.0
11.1
11.6
10.0
12.5
9.3
10.5
11.1
6.7
11.2
11.8
10.2
15.1
12.9
9.3
11.5
7.6
9.6
11.0
7.3
8.7
11.1
12.5
9.2
10.4
10.7
10.1
9.0
8.4
5.3
10.6
9.9
6.5
10.0
12.7
11.6
8.9
14.5
10.3
12.5
9.5
9.8
7.5
12.8
10.5
7.8
8.6
Assets (Billions $)
19.5
16.8
13.7
12.8
10.9
3.3
99
X =
where
mj f j
j =1
(3.16)
X = sample mean
n = number of values or sample size
c = number of classes in the frequency distribution
mj = midpoint of the jth class
fj = numbers of values in the jth class
To calculate the standard deviation from a frequency distribution, you assume that all values
within each class interval are located at the midpoint of the class.
APPROXIMATING THE STANDARD DEVIATION FROM A FREQUENCY DISTRIBUTION
c
S =
( m j X )2 f j
j =1
(3.17)
n 1
Example 3.14 illustrates the computation of the mean and the standard deviation from a
frequency distribution.
EXAMPLE 3.14
TABLE 3.7
Frequency Distribution
of the 2003 Return for
Growth Mutual Funds
Frequency
2
9
13
15
5
5
49
100
SOLUTION The computations that you need to calculate the approximations of the mean and
standard deviation of the 2003 return for growth mutual funds are summarized in Table 3.8.
Percentage Return
Number of Funds(fj)
Midpoint(mj)
mj fj
(mj X )
(mj X )2
(mj X )2fj
2
9
13
15
5
5
49
15
25
35
45
55
65
30
225
455
675
275
325
1,985
25.51
15.51
5.51
4.49
14.49
24.49
650.7601
240.5601
30.3601
20.1601
209.9601
599.7601
1,301.5202
2,165.0409
394.6813
302.4015
1,049.8005
2,998.8005
8,212.2449
X =
X =
mj f j
j =1
n
1,985.0
= 40.51
49
c
and
S =
S =
( m j X )2 f j
j =1
n 1
8,212.2449
49 1
= 171.08843
= 13.08
Class Intervals
Frequency
Class Intervals
Frequency
0Under 10
10Under 20
20Under 30
30Under 40
40Under 50
10
20
40
20
10
100
0Under 10
10Under 20
20Under 30
30Under 40
40Under 50
40
25
15
15
5
100
Approximate
a. the mean.
b. the standard deviation.
Approximate
a. the mean.
b. the standard deviation.
Amount
March
Frequency
April
Frequency
6
13
17
10
4
0
50
10
14
13
10
0
3
50
$0 to under $2,000
$2,000 to under $4,000
$4,000 to under $6,000
$6,000 to under $8,000
$8,000 to under $10,000
$10,000 to under $12,000
Total
0
1
2
3
0.0
4.0
8.0
12.0
3.4
Foreign-Made
Automobile Models
Less Than
Indicated Values
Number Percentage
0
1
4
19
0.0
1.4
5.6
26.4
U.S.-Made
Automobile Models
Less Than
Braking
Indicated Values
Distance
(in Ft)
Number Percentage
250
260
270
280
290
300
310
320
4
8
11
17
21
23
25
25
101
Foreign-Made
Automobile Models
Less Than
Indicated Values
Number Percentage
16.0
32.0
44.0
68.0
84.0
92.0
100.0
100.0
32
54
61
68
68
70
71
72
44.4
75.0
84.7
94.4
94.4
97.2
98.6
100.0
A
Frequency
B
Frequency
8
17
11
8
2
15
32
20
4
0
102
Q1
Median
Q3
Xlargest
provides a way to determine the shape of the distribution. Table 3.9 explains how the relationships among the five numbers allows you to recognize the shape of a data set.
TABLE 3.9
Relationships among
the Five-Number
Summary and the
Type of Distribution
Type of Distribution
Comparison
Left-Skewed
Symmetric
Right-Skewed
Distance from
Xsmallest to the
median versus the
distance from the
median to Xlargest.
Distance from
Xsmallest to Q1 versus
the distance from Q3
to Xlargest.
Distance from Q1 to
the median versus
the distance from
the median to Q3.
For the sample of 10 get-ready times, the smallest value is 29 minutes and the largest value
is 52 minutes (see pages 75 and 77). Calculations done previously in section 3.1 show that the
median = 39.5, the first quartile = 35, and the third quartile = 44. Therefore, the five-number
summary is
29 35 39.5 44 52
The distance from the median to Xsmallest to the median (39.5 29 = 10.5) is slightly less
than the distance from Xlargest (52 39.5 = 12.5). The distance from Xsmallest to Q1 (35 29 = 6)
is slightly less than the distance from Q3 to Xlargest (52 44 = 8). Therefore, the get-ready times
are slightly right-skewed.
EXAMPLE 3.15
103
SOLUTION From previous computations for the 2003 return for the small cap mutual funds
with high risk (see pages 76 and 78), the median = 53.8, the first quartile = 41.7, and the third
quartile = 60.85. In addition, the smallest value in the data set is 37.3 and the largest value is
66.5. Therefore, the five-number summary is
37.3 41.7 53.8 60.85 66.5
The distance from the median to Xlargest (66.5 53.8 = 12.7) is less than the distance
from Xsmallest to the median (53.8 37.3 = 16.5). This indicates left skewness. The distance
from Xsmallest to Q1 (41.7 37.3 = 4.4) is slightly less than the distance from Q3 to Xlargest
(66.5 60.85 = 5.65). This indicates slight right-skewness. Therefore, the results are
inconsistent.
Xsmallest
20
25
30
Q1
35
Median
40
Time (minutes)
Xlargest
Q3
45
50
55
The vertical line drawn within the box represents the median. The vertical line at the
left side of the box represents the location of Q1 and the vertical line at the right side of the
box represents the location of Q3. Thus, the box contains the middle 50% of the values in
the distribution. The lower 25% of the data are represented by a line (i.e., a whisker) connecting the left side of the box to the location of the smallest value, Xsmallest. Similarly, the
upper 25% of the data are represented by a whisker connecting the right side of the box to
Xlargest.
The box-and-whisker plot of the get-ready times in Figure 3.4 indicates very slight rightskewness since the distance between the median and the highest value is slightly more than the
distance between the lowest value and the median. The right whisker is slightly longer than the
left whisker.
EXAMPLE 3.16
104
2If
SOLUTION Figure 3.5 is the Minitab box-and-whisker plot of the 2003 return for low-risk,
average-risk, and high-risk mutual funds. Minitab displays the box-and-whisker plot vertically
from bottom (low) to top (high). The asterisk (*) for the average-risk fund represents the presence of outlier values.2 The median percentage return and the quartiles are higher for the highrisk funds than for the low-risk and average-risk funds. The average-risk funds are right-skewed
due to the extremely large return of one fund (78). The high-risk funds appear left-skewed
because of the long lower whisker, but the median return is closer to the first quartile than to
the third quartile. The low-risk funds appear to be slightly right-skewed since the upper whisker
is longer than the lower whisker.
FIGURE 3.5
Minitab Box-andWhisker Plot of the
2003 Return for LowRisk, Average-Risk, and
High-Risk Mutual Funds
Figure 3.6 demonstrates the relationship between the box-and-whisker plot and the polygon for four different types of distributions. (Note: The area under each polygon is split into
quartiles corresponding to the five-number summary for the box-and-whisker plot.)
FIGURE 3.6
Box-and-Whisker Plots
and Corresponding
Polygons for Four
Distributions
Panel A
Bell-shaped distribution
Panel B
Left-skewed distribution
Panel C
Right-skewed distribution
Panel D
Rectangular distribution
105
Panels A and D of Figure 3.6 are symmetric. In these distributions, the mean and median
are equal. In addition, the length of the left whisker is equal to the length of the right whisker,
and the median line divides the box in half.
Panel B of Figure 3.6 is left-skewed. The few small values distort the mean toward the left tail.
For this left-skewed distribution, the skewness indicates that there is a heavy clustering of values at
the high end of the scale (i.e., the right side); 75% of all values are found between the left edge of
the box (Q1) and the end of the right whisker (Xlargest). Therefore, the long left whisker contains the
smallest 25% of the values, demonstrating the distortion from symmetry in this data set.
Panel C of Figure 3.6 is right-skewed. The concentration of values is on the low end of the
scale (i.e., the left side of the box-and-whisker plot). Here, 75% of all data values are found
between the beginning of the left whisker (Xsmallest) and the right edge of the box (Q3), and the
remaining 25% of the values are dispersed along the long right whisker at the upper end of the scale.
SELF
Test
1,589
593
1,223
869
423
1,720
708
1,425
922
308
106
3.5
Burgers
19 31 34 35 39 39 43
Chicken
7 9 15 16 16 18 22 25 27 33 39
Source: Extracted from Quick Bites, Copyright 2001 by
Consumers Union of U.S., Inc., Yonkers, NY 107031057. Adapted
with permission from Consumer Reports, March 2001, 46.
The Covariance
The covariance measures the strength of the linear relationship between two numerical variables
(X and Y). Equation (3.18) defines the sample covariance and Example 3.17 illustrates its use.
107
cov( X , Y ) =
EXAMPLE 3.17
( X i X )(Yi Y )
i =1
(3.18)
n 1
9.579
9 1
= 1.19738
TABLE 3.10
Expense Ratio and 2003
Return for the Small
Cap High-Risk Funds
FIGURE 3.7
Microsoft Excel
Worksheet for the
Covariance between
Expense Ratio and 2003
Return for the Small
Cap High-Risk Funds
Expense Ratio
1.25
0.72
1.57
1.40
1.33
1.61
1.68
1.42
1.20
2003 Return
37.3
39.2
44.2
44.5
53.8
56.6
59.3
62.4
66.5
108
The covariance has a major flaw as a measure of the linear relationship between two numerical
variables. Since the covariance can have any value, you are unable to determine the relative
strength of the relationship. To better determine the relative strength of the relationship, you
need to compute the coefficient of correlation.
Panel A
Perfect negative
correlation (r = 1)
Panel B
No correlation
(r = 0)
Panel C
Perfect positive
correlation (r = +1)
In panel A of Figure 3.8 there is a perfect negative linear relationship between X and Y.
Thus, the coefficient of correlation equals 1, and when X increases, Y decreases in a
perfectly predictable manner. Panel B shows a situation in which there is no relationship
between X and Y. In this case, the coefficient of correlation equals 0, and as X increases,
there is no tendency for Y to increase or decrease. Panel C illustrates a perfect positive relationship where equals +1. In this case, Y increases in a perfectly predictable manner when
X increases.
When you have sample data, the sample coefficient of correlation r is calculated. When
using sample data, you are unlikely to have a sample coefficient of exactly +1, 0, or 1. Figure
3.9 on page 109 presents scatter diagrams along with their respective sample coefficients of
correlation r for six data sets, each of which contains 100 values of X and Y.
In panel A, the coefficient of correlation r is 0.9. You can see that for small values of
X there is a very strong tendency for Y to be large. Likewise, the large values of X tend to
be paired with small values of Y. The data do not all fall on a straight line, so the association between X and Y cannot be described as perfect. The data in panel B have a coefficient
of correlation equal to 0.6, and the small values of X tend to be paired with large values of
Y. The linear relationship between X and Y in panel B is not as strong as in panel A. Thus,
the coefficient of correlation in panel B is not as negative as in panel A. In panel C the linear relationship between X and Y is very weak, r = 0.3, and there is only a slight tendency
for the small values of X to be paired with the larger values of Y. Panels D through F depict
data sets that have positive coefficients of correlation because small values of X tend to be
paired with small values of Y, and the large values of X tend to be associated with large
values of Y.
In the discussion of Figure 3.9, the relationships were deliberately described as tendencies
and not as cause-and-effect. This wording was used on purpose. Correlation alone cannot prove
Panel A
Panel B
Panel C
Panel D
Panel E
Panel F
109
FIGURE 3.9 Six Scatter Diagrams Created from Minitab and Their Sample
Coefficients of Correlation r
that there is a causation effect, that is, that the change in the value of one variable caused the
change in the other variable. A strong correlation can be produced simply by chance, by the
effect of a third variable not considered in the calculation of the correlation, or by a cause-andeffect relationship. You would need to perform additional analysis to determine which of these
three situations actually produced the correlation. Therefore, you can say that causation implies
correlation, but correlation alone does not imply causation.
Equation (3.19) defines the sample coefficient of correlation r and Example 3.18 illustrates its use.
110
cov( X , Y )
S X SY
(3.19)
( X i X )(Yi Y )
where
cov(X, Y) =
i =1
n 1
( X i X )2
SX =
i =1
n 1
(Yi Y )2
SY =
i =1
n 1
Example 3.18 illustrates the computation of the sample coefficient of correlation using
Equation (3.19).
EXAMPLE 3.18
r =
=
cov( X , Y )
S X SY
1.19738
( 0.287663)(10.554383)
= 0.3943786
FIGURE 3.10
Microsoft Excel
Worksheet for the
Sample Coefficient of
Correlation r between
the Expense Ratio and
the 2003 Return for
Small Cap High-Risk
Funds
111
The expense ratio and the 2003 return for the small cap high-risk funds are positively correlated. Those mutual funds with the lowest expense ratios tend to be associated with the lowest 2003
returns. Those mutual funds with the highest expense ratios tend to be associated with the highest
2003 returns. This relationship is fairly weak, as indicated by a coefficient of correlation, r = 0.394.
You cannot assume that having a low expense ratio caused the low 2003 return. You can
only say that this is what tended to happen in the sample. As with all investments, past performance does not guarantee future performance.
21
15
24
10
12
15
18
9 18
30
36
12
27
45
54
Calories Fat
240
260
350
8.0
3.5
22.0
350
20.0
420
16.0
510
22.0
530
19.0
112
Exports
Imports
874.1
730.8
403.5
266.2
259.9
191.1
158.5
150.4
122.5
121.8
912.8
1180.2
349.1
243.6
227.2
202.0
176.2
141.1
107.3
116.0
European Union
United States
Japan
China
Canada
Hong Kong
Mexico
South Korea
Taiwan
Singapore
3.6
Violations
110
100
90
88
79
70
64
53
47
37
20.7
9.9
14.8
25.1
13.5
10.3
13.1
30.1
31.8
14.9
St. Louis
Atlanta
Houston
Boston
Chicago
Denver
Dallas
Baltimore
Seattle/Tacoma
San Francisco
Orlando
WashingtonDulles
Los Angeles
Detroit
San Juan
Miami
New YorkJFK
WashingtonReagan
Honolulu
Turnover
City
City
Turnover
Violations
416
375
237
207
200
193
156
155
140
11.9
7.3
10.6
22.9
6.5
15.2
18.2
21.7
31.5
3.49 The following data CELLPHONE represent the digitalmode talk time in hours and the battery capacity in milliampere-hours of cellphones.
Talk
Time
Battery
Capacity
Talk
Time
Battery
Capacity
4.50
4.00
3.00
2.00
2.75
1.75
1.75
2.25
1.75
800
1500
1300
1550
900
875
750
1100
850
1.50
2.25
2.25
3.25
2.25
2.25
2.50
2.25
2.00
450
900
900
900
700
800
800
900
900
Summary
113
tation is subjective. You must avoid errors that may arise either in the objectivity of your analysis or in the subjectivity of your interpretation.
The analysis of the mutual funds based on risk level is objective and reveals several impartial findings. Objectivity in data analysis means reporting the most appropriate descriptive
summary measures for a given data set. Now that you have read the chapter and have become
familiar with various descriptive summary measures and their strengths and weaknesses, how
should you proceed with the objective analysis? Because the data distribute in a slightly asymmetrical manner, shouldnt you report the median in addition to the mean? Doesnt the standard
deviation provide more information about the property of variation than the range? Should you
describe the data set as right-skewed?
On the other hand, data interpretation is subjective. Different people form different conclusions
when interpreting the analytical findings. Everyone sees the world from different perspectives.
Thus, because data interpretation is subjective, you must do it in a fair, neutral, and clear manner.
Ethical Issues
Ethical issues are vitally important to all data analysis. As a daily consumer of information, you
need to question what you read in newspapers and magazines, what you hear on the radio or television, and what you see on the World Wide Web. Over time, much skepticism has been expressed
about the purpose, the focus, and the objectivity of published studies. Perhaps no comment on
this topic is more telling than a quip often attributed to the famous nineteenth-century British
statesman Benjamin Disraeli: There are three kinds of lies: lies, damned lies, and statistics.
Ethical considerations arise when you are deciding what results to include in a report. You
should document both good and bad results. In addition, when making oral presentations and
presenting written reports, you need to give results in a fair, objective, and neutral manner.
Unethical behavior occurs when you willfully choose an inappropriate summary measure (e.g.,
the mean for a very skewed set of data) to distort the facts in order to support a particular position. In addition, unethical behavior occurs when you selectively fail to report pertinent findings because it would be detrimental to the support of a particular position.
SUMMARY
This chapter was about numerical descriptive measures. In
this and the previous chapter, you studied descriptive statisticshow data are presented in tables and charts, and
then summarized, described, analyzed, and interpreted.
When dealing with the mutual fund data, you were able to
present useful information through the use of pie charts,
histograms, and other graphical methods. You explored
characteristics of past performance such as central ten-
TABLE 3.11
Summary of Numerical
Descriptive Measures
Type of Analysis
Numerical Data
114
KEY
FORMULAS
Sample Mean
Z Scores
X =
Xi
i =1
(3.1)
(3.12)
Population Mean
N
Median
Median =
n +1
rank value
2
(3.2)
First Quartile Q1
n +1
ranked value
Q1 =
4
Xi
i =1
(3.13)
Population Variance
N
( X i )2
(3.3)
i =1
2 =
Third Quartile Q3
Q3 =
X X
S
Z =
3( n + 1)
ranked value
4
(3.4)
( X i )2
Geometric Mean
1/ n
X G = ( X1 X 2 L X n )
(3.5)
i =1
RG = [(1 + R1 ) (1 + R2 ) L (1 + Rn )]1/ n 1
(3.6)
Range
X =
mj f j
j =1
(3.9)
n 1
( m j X )2 f j
j =1
S =
Sample Covariance
S2 =
(Xi X )
i =1
n 1
(3.10)
Coefficient of Variation
S
CV = 100%
X
KEY
(3.17)
n 1
S =
(3.16)
( X i X )2
i =1
(3.15)
S2 =
(3.14)
(3.11)
cov( X , Y ) =
( X i X )(Yi Y )
i =1
n 1
(3.18)
r =
cov( X , Y )
S X SY
(3.19)
TERMS
arithmetic mean 73
box-and-whisker plot 103
central tendency 72
Chebyshev rule 97
coefficient of correlation 108
coefficient of variation 85
covariance 106
dispersion 72
empirical rule 96
extreme value 86
five-number summary 102
geometric mean 79
interquartile range 81
left-skewed 88
mean 73
median 75
midspread 81
mode 76
outlier 86
population mean 94
population standard deviation
population variance 97
Q1: first quartile 77
Q2: second quartile 77
95
CHAPTER
REVIEW
3.61 A quality characteristic of interest for a tea-bag-filling process is the weight of the tea in the individual bags. If
the bags are underfilled, two problems arise. First, customers may not be able to brew the tea to be as strong as
they wish. Second, the company may be in violation of the
truth-in-labeling laws. For this product, the label weight on
the package indicates that, on average, there are 5.5 grams
of tea in a bag. If the average amount of tea in a bag
exceeds the label weight, the company is giving away product. Getting an exact amount of tea in a bag is problematic
115
standard deviation 82
sum of squares 82
symmetrical 88
variance 82
variation 76
Z scores 86
82
PROBLEMS
because of variation in the temperature and humidity inside
the factory, differences in the density of the tea, and the
extremely fast filling operation of the machine (approximately 170 bags a minute). The following table provides
the weight in grams of a sample of 50 tea bags produced in
one hour by a single machine. TEABAGS
5.65
5.57
5.47
5.77
5.61
5.44
5.40
5.40
5.57
5.45
5.42
5.53
5.47
5.42
5.44
5.40
5.54
5.61
5.58
5.25
5.53
5.55
5.53
5.58
5.56
5.34
5.62
5.32
5.50
5.63
5.54
5.56
5.67
5.32
5.50
5.45
5.46
5.29
5.50
5.57
5.52
5.44
5.49
5.53
5.67
5.41
5.51
5.55
5.58
5.36
116
a. Calculate the mean, median, range, and standard deviation for the width. Interpret these measures of central
tendency and variability.
b. List the five-number summary.
c. Construct a box-and-whisker plot and describe the shape.
d. What can you conclude about the number of troughs
that will meet the companys requirements of troughs
being between 8.31 and 8.61 inches wide?
3.65 The manufacturing company in problem 3.64 also
produces electric insulators. If the insulators break when in
use, a short-circuit is likely to occur. To test the strength of
the insulators, destructive testing is carried out to determine
how much force is required to break the insulators. Force is
measured by observing how many pounds must be applied
to the insulator before it breaks. The data from 30 insulators
from this experiment are as follows: FORCE
1,870 1,728 1,656 1,610 1,634 1,784 1,522 1,696 1,592 1,662
1,866 1,764 1,734 1,662 1,734 1,774 1,550 1,756 1,762 1,866
1,820 1,744 1,788 1,688 1,810 1,752 1,680 1,810 1,652 1,736
a. Calculate the mean, median, range, and standard deviation for the force variable.
b. Interpret the measures of central tendency and variability in (a).
c. Construct a box-and-whisker plot and describe the shape.
d. What can you conclude about the strength of the insulators if the company requires a force measurement of at
least 1,500 pounds?
3.66 Problems with a telephone line that prevent a customer from receiving or making calls are disconcerting to
both the customer and the telephone company. The following
data represent samples of 20 problems reported to two different offices of a telephone company and the time to clear
these problems (in minutes) from the customers lines:
PHONE
3.67 In many manufacturing processes the term work-inprocess (often abbreviated WIP) is used. In a book manufacturing plant the WIP represents the time it takes for
sheets from a press to be folded, gathered, sewn, tipped on
end sheets, and bound. The following data represent samples of 20 books at each of two production plants and the
processing time (operationally defined as the time in days
from when the books came off the press to when they were
packed in cartons) for these jobs. WIP
Plant A
5.62 5.29 16.25 10.92 11.46 21.62 8.45 8.58 5.41 11.42
11.62 7.29 7.50 7.96 4.42 10.50 7.58 9.29 7.54 8.92
Plant B
9.54 11.46 16.62 12.62 25.75 15.41 14.29 13.13 13.71 10.04
5.75 12.46 9.17 13.21 6.00 2.33 14.25 5.37 6.25 9.71
117
For the four types of food (dry dog food, canned dog food,
dry cat food and canned cat food), for the variables of cost
per serving, protein in grams, and fat in grams:
a. Compute the mean, median, first quartile, and third
quartile.
b. Compute the range, interquartile range, variance, standard deviation, and coefficient of variation.
c. Construct a side-by-side box-and-whisker plot for the
four types (dry dog food, canned dog food, dry cat food,
and canned cat food). Are the data for any of the types of
food skewed? If so, how?
118
d. What conclusions can you reach concerning any differences among the four types (dry dog food, canned dog
food, dry cat food, and canned cat food)?
3.72 The manufacturer of Boston and Vermont asphalt
shingles provide their customers with a 20-year warranty
on most of their products. To determine whether a shingle
will last as long as the warranty period, accelerated-life
testing is conducted at the manufacturing plant.
Accelerated-life testing exposes the shingle to the stresses
it would be subject to in a lifetime of normal use in a laboratory setting via an experiment that takes only a few minutes to conduct. In this test, a shingle is repeatedly scraped
with a brush for a short period of time and the amount of
shingle granules that are removed by the brushing is
weighed (in grams). Shingles that experience low amounts
of granule loss are expected to last longer in normal use
than shingles that experience high amounts of granule loss.
In this situation, a shingle should experience no more than
0.8 grams of granule loss if it is expected to last the length
of the warranty period. The data file GRANULE contains a
sample of 170 measurements made on the companys
Boston shingles, and 140 measurements made on Vermont
shingles.
a. List the five-number summary for the Boston shingles
and for the Vermont shingles.
b. Construct side-by-side box-and-whisker plots for the
two brands of shingles and describe the shapes of the
distributions.
c. Comment on the shingles ability to achieve a granule
loss of 0.8 grams or less.
3.73 The data in the file STATES represent the results of the
American Community Survey, a sampling of 700,000
households taken in each state during the 2000 U.S.
Census. For each of the variables of average travel-to-work
time in minutes, percentage of homes with eight or more
rooms, median household income, and percentage of mortgage-paying homeowners whose housing costs exceed 30%
of income:
a. Compute the mean, median, first quartile, and third
quartile.
b. Compute the range, interquartile range, variance, standard deviation, and coefficient of variation.
c. Construct a box-and-whisker plot. Are the data skewed?
If so, how?
d. What conclusions can you reach concerning the average
travel-to-work time in minutes, percentage of homes
with eight or more rooms, median household income,
and percentage of mortgage-paying homeowners whose
housing costs exceed 30% of income?
3.74 The economics of baseball has caused a great deal of
controversy with owners arguing that they are losing money,
players arguing that owners are making money, and fans
complaining about how expensive it is to attend a game and
3.76 The data in the file PRINTERS represent the price, text
speed, text cost, color photo time, and color photo cost of
computer printers.
a. Compute the coefficient of correlation between price and
each of the following: text speed, text cost, color photo
time, and color photo cost.
b. Based on the results of (a), do you think that any of the
other variables might be useful in predicting printer
price? Explain.
Source: Extracted from Printers, Copyright 2002 by Consumers
Union of U.S., Inc., Yonkers, NY 107031057. Adapted with
permission from Consumer Reports, March 2002, 51.
50,000
For New York City and Long Island restaurants, for the
variables of food rating, decor rating, service rating, and
price per person:
a. Compute the mean, median, first quartile, and third
quartile.
b. Compute the range, interquartile range, variance, standard deviation, and coefficient of variation.
c. Construct a side-by-side box-and-whisker plot for the
New York City and Long Island restaurants. Are the data
for any of the variables skewed? If so, how?
d. What conclusions can you reach concerning differences
between New York City and Long Island restaurants?
3.80 As an illustration of the misuse of statistics, an article by Glenn Kramon (Coaxing the Stanford Elephant to
Dance, The New York Times Sunday Business Section,
November 11, 1990) implied that costs at Stanford
Medical Center had been driven up higher than at competing institutions because the former was more likely
than other organizations to treat indigent, Medicare,
Medicaid, sicker, and more complex patients. The chart
below was provided to compare the average 1989 to 1990
hospital charges for three medical procedures (coronary
bypass, simple birth, and hip replacement) at three competing institutions (El Camino, Sequoia, and Stanford).
Suppose you were working in a medical center. Your CEO
knows you are currently taking a course in statistics and calls
you in to discuss this. She tells you that the article was presented in a discussion group setting as part of a meeting of
regional area medical center CEOs last night and that one of
them mentioned that this chart was totally meaningless and
asked her opinion. She now requests that you prepare her
response. You smile, take a deep breath, and reply . . .
Dollars
40,000
Sequoia
Stanford
30,000
20,000
10,000
0
N/A
Coronary bypass
119
Simple birth
Hip replacement
El Camino costs are the average of high and low charges for a simple
birth with a two-day stay and a hip replacement with a nine-day stay.
Sequoia costs are averages of the middle 50% of all charges for each
operation.
Stanford data are the average cost of all operations.
120
3.81 You are planning to study for your statistics examination with a group of classmates, one of whom you particularly want to impress. This individual has volunteered
to use Microsoft Excel, Minitab, or SPSS to get the needed
summary information, tables, and charts for a data set containing several numerical and categorical variables
assigned by the instructor for study purposes. This person
comes over to you with the printout and exclaims, Ive got
it allthe means, the medians, the standard deviations, the
box-and-whisker plots, the pie chartsfor all our variables. The problem is, some of the output looks weird
like the box-and-whisker plots for gender and for major
and the pie charts for grade point index and for height.
Also, I cant understand why Professor Krehbiel said we
cant get the descriptive stats for some of the variablesI
got it for everything! See, the mean for height is 68.23, the
mean for grade point index is 2.76, the mean for gender is
1.50, the mean for major is 4.33. What is your reply?
TEAM
PROJECTS
Appendix
121
RUNNING CASE
MANAGING THE SPRINGVILLE HERALD
For what variable in the Chapter 2 Managing the
Springville Herald case (see page 62) are numerical
descriptive measures needed? For the variable you identify:
1. Compute the appropriate numerical descriptive measures, and generate a box-and-whisker plot.
WEB
CASE
REFERENCES
1. Kendall, M. G., and A. Stuart, The Advanced Theory of
Statistics, vol. 1 (London: Charles W. Griffin, 1958).
2. Microsoft Excel 2003 (Redmond, WA: Microsoft
Corporation, 2002).
3. Minitab Version 14 (State College, PA: Minitab Inc.,
2004).
4. SPSS Base 12.0 Brief Guide (Upper Saddle River, NJ:
Prentice Hall, 2003).
Appendix 3
Using Software
for Descriptive Statistics
A3.1 MICROSOFT EXCEL
For Descriptive Statistics
Use the Data Analysis ToolPak. Open to the worksheet containing the data you want to summarize. Select Tools
Data Analysis. From the list that appears in the Data
122
A3.2 MINITAB
Computing Descriptive Statistics
To produce descriptive statistics for the 2003 return for different risk levels shown in Figure 3.3 on page 90, open the
MUTUALFUNDS2004.MTW worksheet. Select Stat
Basic Statistics Display Descriptive Statistics.
FIGURE A3.1 Data Analysis Descriptive
Statistics Dialog Box
To enter one of these functions into a worksheet, select
an empty cell and then select Insert Function. In the
Function dialog box, select Statistical from the drop-down
list and then scroll to and select the function you want to
use. Click OK. In the Function Arguments dialog box,
enter the cell range of the data to be summarized and click
OK. (For LARGE and SMALL, enter 1 as the K value; and
for QUARTILE, enter either 1 or 3 as the Quart value, for
either first or third quartile.) In versions of Microsoft Excel
earlier than Excel 2003, you may encounter errors in
results when using the QUARTILE function.
For Covariance
Open the Covariance.xls Excel file, shown in Figure 3.7 on
page 107. Follow the onscreen instructions for modifying the
table area if you want to use this worksheet with other pairs of
variables. Note in Figure 3.7 that cell C15 contains a formula
that uses the COUNT function. This allows Excel to automatically update the value of n when the size of the table area is
changed, and ensures that the n 1 term is always correct.
Appendix
123