Lecture 2
Lecture 2
Alexey Rubtsov
Ryerson University
1 Measures of Center
2 Measures of Variability
3 Measures of Relative Standing
4 The Five Number Summary and the Box Plot
Solution:
The dotplot in following figure seems to be centered between
6 and 8.
Figure 2.2
1
Example – Solution
To find the sample mean, calculate
2
Definition (Median)
The median of the sample is the middle measurement when the
measurements are ranked from smallest to largest
y0.5(n+1) if n is odd
M= 1
2 y0.5n + y0.5(n+2) if n is even
Definition (Mode)
The mode is the category that occurs most frequently, or the most
frequently occurring value of x.
Solution:
Rank the n = 5 measurements from smallest to largest:
1
Example
Find the median for the set of measurements 2, 9, 11, 5, 6, 27.
Solution:
Rank the measurements from smallest to largest:
2
Example – Solution
Now there are two “middle” observations, shown in the box.
3
Measures of Center (6 of 12)
When a data set has extremely small or extremely large
observations, the sample mean is drawn toward the
direction of the extreme measurements.
1
Example
For this example, the mode and the modal class, shown in
figure are the same—the highest peak in the graph.
2
It is possible for a distribution of measurements to have
more than one mode. These modes would appear as “local
peaks” in the relative frequency distribution. For example, if
we were to tabulate the length of fish taken from a lake
during one season, we might get a bimodal distribution,
possibly reflecting a mixture of young and old fish in the
population.
3
2. Measures of Variability
1
Both are centered at x = 4, but there is a big difference in
the way the measurements spread out, or vary. The
measurements in Figure 2.5(a) vary from 3 to 5; in
Figure 2.5(b) the measurements vary from 0 to 8.
2
Definition (Range)
The range, R, of a set of n measurements is the difference between the
largest (maximum) and the smallest (minimum) measurements:
R = yn − y1 .
Computation of
Table 2.2
1
Example
Adding, we obtain
2
Example
Notation
3
Example
If you need to calculate by hand, it is much easier to
use the alternative computing formula given next. This
formula is sometimes called the shortcut method for
calculating
4
Example
The formula for the sample mean that is the sum of all
the measurements. To find you square each individual
measurement and then add them together.
5
Example
Calculate the variance and standard deviation for the five
measurements from Table 2.2–5, 7, 1, 2, 4–reproduced in
table. Use the computing formula for and compare your
results with those obtained using the original definition of
6
Example
The entries in table are the individual measurements, xi, and
their squares, together with their sums. Using the
computing formula for you have
7
Now that you have learned how to calculate the variance
and standard deviation, remember these points:
8
Example
Population: Torontonians.
Sample: People at a bus stop.
Variable: Number of coffee cups per week.
Data: x1 = 3, x2 = 14, x3 = 2, x4 = 5, x5 = 6. Then
y1 = 2, y2 = 3, y3 = 5, y4 = 6, y5 = 14, n = 5.
3 + 14 + 2 + 5 + 6
x = =6
5
M = 5
R = 12
5
(xi − 6)2
P
i=1
s2 = = 15.5
4
s = 3.94
Theorem (Tchebysheff)
Given a number k greater than or equal to 1 and a set of n
measurements, at least (1 − (1/k 2 )) of the measurements will lie within
k standard deviations of their mean.
1
Example
The mean and variance of a sample of n = 25
measurements are 75 and 100, respectively. Use
Tchebysheff’s Theorem to describe the measurements.
Solution:
You are given The standard deviation is
The distribution of measurements is centered
at and Tchebysheff’s Theorem states:
2
Example
• At least 8 ∕ 9 of the measurements lie in the interval
—that is, 45 to 105.
3
The Empirical Rule
Another rule for describing the
variability of a data set does not
work for all data sets, but it does
work very well for data that “pile
up” in the familiar mound shape.
Mound-shaped distribution
Figure 2.9
4
The Empirical Rule
Solution:
To describe the data, calculate these intervals:
6
Example
According to the Empirical Rule, you expect
• Approximately 68% of the measurements will fall in the
interval 11.1 to 14.5.
• Approximately 95% of the measurements will fall in the
interval 9.4 to 16.2.
• Approximately 99.7% of the measurements will fall in the
interval 7.7 to 17.9.
7
Example
If you doubt that the distribution of measurements is
mound-shaped, or if you wish for some other reason to be
conservative, you can apply Tchebysheff’s Theorem and be
absolutely certain of your statements.
8
The Empirical Rule
The Empirical Rule is a “rule of thumb” that can be used
only when the data tend to be roughly mound-shaped.
9
3. Measures of Relative Standing
Sometimes you need to know the position of one
observation relative to others in a data set.
1
z-Scores
2
z-Scores
The mean and standard deviation of a data set can be used
to calculate a z-score, which measures the distance
between a particular observation x and the mean, measured
in units of standard deviation.
DEFINITION
The sample z-score is a measure of relative standing
defined as
3
Example
Two students are preparing for college admissions by
taking college preparatory exams. One student takes the
SAT test and scores 1440 out of 1600 while the other takes
the ACT test and scores 31 out of 36.
4
Example
We can find the means and standard deviations for the SAT
and ACT tests from collegeboard.org and nces.ed.gov in the
following table:
ACT 31 21 5.2
5
Example
Then we compare the students by using their respective
z-scores:
The student who took the SAT test has performed better on
her exam than the student who took the ACT test.
6
Percentiles and Quartiles
7
Percentiles and Quartiles
A percentile is another measure of relative standing, most
often used for large data sets.
DEFINITION
A set of n measurements on the variable x has been
arranged from smallest to largest. The pth percentile is
the value of x that is greater than p% of the
measurements and is less than the remaining (100 − p)%.
8
Example
Suppose you have been notified that your score of 158 on
the Verbal Graduate Record Examination placed you at the
80th percentile in the distribution of scores.
Solution:
Scoring at the 80th percentile means that 80% of all the
examination scores were lower than your score and 20%
were higher.
9
Percentiles and Quartiles
DEFINITION
A set of n measurements on the variable x has been
arranged from smallest to largest. The lower quartile
(first quartile), Q1, is the value of x that is greater than
one-fourth of the measurements and is less than the
remaining three-fourths. The second quartile is the
median. The upper quartile (third quartile), Q3, is the
value of x that is greater than three-fourths of the
measurements and is less than the remaining one-fourth.
10
Percentiles and Quartiles
Calculating Sample Quartiles
• When the measurements are arranged from smallest to
largest, the lower quartile, Q1, is the value of x in
position .25(n + 1), and the upper quartile, Q3, is the
value of x in position .75(n + 1).
11
Example
Find the lower and upper quartiles for this set of
measurements:
Solution:
Rank the n = 10 measurements from smallest to largest:
Calculate
Position of Q1 = .25(n + 1) = .25(10 + 1) = 2.75
Position of Q3 = .75(n + 1) = .75(10 + 1) = 8.25
12
Example
Since these positions are not integers, we take the lower
quartile to be the value 3∕4 of the distance between the
second and third ordered measurements, and we take the
upper quartile to be the value 1∕4 of the distance between
the eighth and ninth ordered measurements.
Therefore,
13
Percentiles and Quartiles
DEFINITION
The interquartile range (IQR) for a set of measurements
is the difference between the upper and lower quartiles;
that is, IQR = Q3 − Q1.
14
Percentiles and Quartiles
How to Calculate Sample Quartiles
1. Arrange the data set in order of magnitude from smallest
to largest.
2. Calculate the quartile positions:
15
Percentiles and Quartiles
4. If the positions in step 2 are not integers, find the two
measurements in positions just above and just below the
calculated position. Calculate the quartile by finding a
value either one-fourth, one-half, or three-fourths of the
way between these two measurements.
16
The Five-Number Summary and
the Box Plot
17
The Five-Number Summary and the Box Plot
18
The Five-Number Summary and the Box Plot
From the box plot, you can quickly detect any skewness in
the shape of the distribution and see whether there are any
outliers in the data set.
19
The Five-Number Summary and the Box Plot
20
The Five-Number Summary and the Box Plot
21
The Five-Number Summary and the Box Plot
Figure 2.16
22
The Five-Number Summary and the Box Plot
The upper and lower fences are shown with broken lines in
Figure 2.16, but they are not usually drawn on the box plot.
23
The Five-Number Summary and the Box Plot
24
Example
As American consumers become more careful about the
foods they eat, food processors try to avoid large amounts
of fat, cholesterol, and sodium in the foods they sell. The
following data are the amounts of sodium per slice (in
milligrams) for each of eight brands of regular American
cheese. Draw a box plot for the data and look for outliers.
25
Example
The n = 8 measurements are ranked from smallest to
largest:
26
Example
so that m = (320 + 330) ∕ 2 = 325, Q1 = 290 + .25(10) = 292.5,
and Q3 = 340. The interquartile range is calculated as
27
Example
The value x = 520, a brand of cheese containing 520
milligrams of sodium, is the only outlier, lying beyond the
upper fence.
The box plot for the data is shown in the following figure.
Figure 2.17
28
Example
The outlier is marked with an asterisk (*). Once the
outlier is excluded, we find that the smallest and largest
measurements are x = 260 and x = 340.
These are the two values that form the whiskers. Since
the value x = 340 is the same as Q3, there is no whisker
on the right side of the box.
29