0% found this document useful (0 votes)
14 views

Lecture 2

Uploaded by

kjj10250309
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lecture 2

Uploaded by

kjj10250309
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Lecture 2:

Describing Data with Numerical Measures

Alexey Rubtsov

Ryerson University

Alexey Rubtsov (Ryerson University) Probability and Statistics I 1/9


Lecture content

1 Measures of Center
2 Measures of Variability
3 Measures of Relative Standing
4 The Five Number Summary and the Box Plot

Alexey Rubtsov (Ryerson University) Probability and Statistics I 2/9


1. Measures of Center
Let x1 , ..., xn be a set of n measurements. Assume y1 , ..., yn is the
ordered version of x1 , ..., xn .

Measure of Center: A measure along the horizontal axis of the data


distribution that locates the center of the distribution.

Definition (Arithmetic Mean)


The arithmetic mean of the sample is the sum of the measurements
divided by the total number of measurements:
n
P
xi
i=1
x= .
n
Remark: If we are able to enumerate the whole population then we
can compute the ”population mean”, denoted µ.

Alexey Rubtsov (Ryerson University) Probability and Statistics I 3/9


Example
Draw a dotplot for the n = 5 measurements 2, 9, 11, 5, 6.
Find the sample mean and compare its value with what you
might consider the “center” of these observations on the
dotplot.

Solution:
The dotplot in following figure seems to be centered between
6 and 8.

Figure 2.2

1
Example – Solution
To find the sample mean, calculate

If you think of the dots in Figure 2.2 as equal weights on a


scale or “see-saw,” the mean is the point at which
the scale or “see-saw” is balanced.

2
Definition (Median)
The median of the sample is the middle measurement when the
measurements are ranked from smallest to largest

y0.5(n+1)  if n is odd
M= 1
2 y0.5n + y0.5(n+2) if n is even

Definition (Mode)
The mode is the category that occurs most frequently, or the most
frequently occurring value of x.

Alexey Rubtsov (Ryerson University) Probability and Statistics I 4/9


Example
Find the median for the set of measurements 2, 9, 11, 5, 6.

Solution:
Rank the n = 5 measurements from smallest to largest:

The middle observation, marked with an arrow, is in the


center of the set, or m = 6.

1
Example
Find the median for the set of measurements 2, 9, 11, 5, 6, 27.

Solution:
Rank the measurements from smallest to largest:

2
Example – Solution
Now there are two “middle” observations, shown in the box.

To find the median, choose a value halfway between the two


middle observations:

3
Measures of Center (6 of 12)
When a data set has extremely small or extremely large
observations, the sample mean is drawn toward the
direction of the extreme measurements.

• If a distribution is skewed to the right, the mean shifts to


the right and the mean is greater than the median.
• If a distribution is skewed to the left, the mean shifts to
the left and the mean is less than the median.
• When a distribution is symmetric, the mean and the
median are equal.
• If a distribution is strongly skewed by one or more
extreme values, you should use the median rather than
the mean as a measure of center.
4
Example
The mode is generally used to describe large data sets,
whereas the mean and median are used for both large and
small data sets.
For the data shown in table, the mode is 5 visits per week,
occurring 8 times.
6 7 1 5 6
4 6 4 6 8
6 5 6 3 4
5 5 5 7 6
3 5 7 5 5
Starbucks data
Table 2.1 (a)

1
Example
For this example, the mode and the modal class, shown in
figure are the same—the highest peak in the graph.

Relative frequency histograms for the Starbucks data

Figure 2.4 (a)

2
It is possible for a distribution of measurements to have
more than one mode. These modes would appear as “local
peaks” in the relative frequency distribution. For example, if
we were to tabulate the length of fish taken from a lake
during one season, we might get a bimodal distribution,
possibly reflecting a mixture of young and old fish in the
population.

Sometimes bimodal distributions of sizes or weights reflect


a mixture of measurements taken on males and females. In
any case, a set or distribution of measurements may have
more than one mode.

3
2. Measures of Variability

Measure of Variability: A measure of the data distribution that


describes the spread of the distribution from the center.

Alexey Rubtsov (Ryerson University) Probability and Statistics I 5/9


Data sets may have the same center but look different
because of the way the numbers spread out from the center.
Look at the two distributions shown in figure.

Variability or dispersion of data


Figure 2.5

1
Both are centered at x = 4, but there is a big difference in
the way the measurements spread out, or vary. The
measurements in Figure 2.5(a) vary from 3 to 5; in
Figure 2.5(b) the measurements vary from 0 to 8.

Variability is a very important characteristic of data.

2
Definition (Range)
The range, R, of a set of n measurements is the difference between the
largest (maximum) and the smallest (minimum) measurements:

R = yn − y1 .

Definition (Variance of a sample)


The variance of a sample measures the average of the square
deviation of the measurements about their mean
n
1 X
s2 = (xi − x)2 .
n−1
i=1

Alexey Rubtsov (Ryerson University) Probability and Statistics I 6/9


Remark: If we are able to enumerate the whole population then we
can compute the ”population variance”, denoted σ 2 .

Definition (Variance of a population)


The variance of a population of size N measures the average of the
square deviation of the measurements about their mean
N
1 X
σ2 = (xi − µ)2 .
N
i=1

Definition (Standard Deviation)


The standard deviation of a sample is the square root of the
variance. It returns the variance to the original units of measure!

s = s2 .

Alexey Rubtsov (Ryerson University) Probability and Statistics I 7/9


Example
For the set of n = 5 sample measurements presented in
table, the square of each deviation is recorded in the third
column.

Computation of
Table 2.2

1
Example
Adding, we obtain

and the sample variance is

The variance is measured in terms of the square of the


original units of measurement.

2
Example
Notation

For the set of n = 5 sample


measurements in table, the sample
variance is so the
sample standard deviation is
The more
variable the data set is, the larger
the value of s. Computation of
Table 2.2

3
Example
If you need to calculate by hand, it is much easier to
use the alternative computing formula given next. This
formula is sometimes called the shortcut method for
calculating

The Computing Formula for Calculating

4
Example
The formula for the sample mean that is the sum of all
the measurements. To find you square each individual
measurement and then add them together.

The sample standard deviation, s, is the positive square root


of

5
Example
Calculate the variance and standard deviation for the five
measurements from Table 2.2–5, 7, 1, 2, 4–reproduced in
table. Use the computing formula for and compare your
results with those obtained using the original definition of

Table for Simplified Calculation of


Table 2.3

6
Example
The entries in table are the individual measurements, xi, and
their squares, together with their sums. Using the
computing formula for you have

7
Now that you have learned how to calculate the variance
and standard deviation, remember these points:

• The value of s is always greater than or equal to zero.


• The larger the value of the greater the variability of
the data set.
• If is equal to zero, all the measurements must
have the same value.
• In order to measure the variability in the same units as
the original observations, we calculate the standard
deviation

8
Example
Population: Torontonians.
Sample: People at a bus stop.
Variable: Number of coffee cups per week.
Data: x1 = 3, x2 = 14, x3 = 2, x4 = 5, x5 = 6. Then
y1 = 2, y2 = 3, y3 = 5, y4 = 6, y5 = 14, n = 5.

3 + 14 + 2 + 5 + 6
x = =6
5
M = 5
R = 12
5
(xi − 6)2
P
i=1
s2 = = 15.5
4
s = 3.94

Alexey Rubtsov (Ryerson University) Probability and Statistics I 8/9


Significance of standard deviation

Theorem (Tchebysheff)
Given a number k greater than or equal to 1 and a set of n
measurements, at least (1 − (1/k 2 )) of the measurements will lie within
k standard deviations of their mean.

At least none of the measurements lie in the interval [µ − σ, µ + σ].


At least 3/4 of the measurements lie in the interval [µ − 2σ, µ + 2σ].
At least 8/9 of the measurements lie in the interval [µ − 3σ, µ + 3σ].
Empirical Rule: Given a distribution of measurements that is
approximately mound-shaped:
The interval [µ − σ, µ + σ] contains approximately 68% of the
measurements.
The interval [µ − 2σ, µ + 2σ] contains approximately 95% of the
measurements.
The interval [µ − 3σ, µ + 3σ] contains approximately 99.7% of the
measurements.
Alexey Rubtsov (Ryerson University) Probability and Statistics I 9/9
We use the population mean and standard deviation—μ
and σ—for this example.

Illustrating Tchebysheff’s Theorem


Figure 2.8

1
Example
The mean and variance of a sample of n = 25
measurements are 75 and 100, respectively. Use
Tchebysheff’s Theorem to describe the measurements.

Solution:
You are given The standard deviation is
The distribution of measurements is centered
at and Tchebysheff’s Theorem states:

• At least 3 ∕4 of the 25 measurements lie in the interval


—that is, 55 to 95.

2
Example
• At least 8 ∕ 9 of the measurements lie in the interval
—that is, 45 to 105.

3
The Empirical Rule
Another rule for describing the
variability of a data set does not
work for all data sets, but it does
work very well for data that “pile
up” in the familiar mound shape.
Mound-shaped distribution
Figure 2.9

The closer your data distribution is to the mound-shaped


curve, the more accurate the rule will be. We call it the
Empirical Rule.

4
The Empirical Rule

Illustrating the Empirical Rule


Figure 2.10

Intervals are constructed measuring distances of one, two,


and three standard deviations on either side of the mean.
The Empirical Rule tells you the approximate percentage of
measurements falling in each of these intervals.
5
Example
In a study conducted at a manufacturing plant, the length of
time to complete a specified operation is measured for each
of n = 40 workers. The mean and standard deviation are
found to be 12.8 and 1.7, respectively. Describe the sample
data using the Empirical Rule.

Solution:
To describe the data, calculate these intervals:

6
Example
According to the Empirical Rule, you expect
• Approximately 68% of the measurements will fall in the
interval 11.1 to 14.5.
• Approximately 95% of the measurements will fall in the
interval 9.4 to 16.2.
• Approximately 99.7% of the measurements will fall in the
interval 7.7 to 17.9.

7
Example
If you doubt that the distribution of measurements is
mound-shaped, or if you wish for some other reason to be
conservative, you can apply Tchebysheff’s Theorem and be
absolutely certain of your statements.

Tchebysheff’s Theorem tells you that at least 3/4 of the


measurements fall into the interval from 9.4 to 16.2 and at
least 8/9 into the interval from 7.7 to 17.9.

8
The Empirical Rule
The Empirical Rule is a “rule of thumb” that can be used
only when the data tend to be roughly mound-shaped.

Tchebysheff’s Theorem will always work, but it is a very


conservative estimate of the fraction of measurements
falling in a particular interval. If the data is approximately
mound-shaped, the Empirical Rule will give you a more
accurate estimate of the fraction of measurements falling
within 1, 2, or 3 standard deviations of the mean.

9
3. Measures of Relative Standing
Sometimes you need to know the position of one
observation relative to others in a data set.

These types of measures are called measures of relative


standing.

1
z-Scores

2
z-Scores
The mean and standard deviation of a data set can be used
to calculate a z-score, which measures the distance
between a particular observation x and the mean, measured
in units of standard deviation.

DEFINITION
The sample z-score is a measure of relative standing
defined as

3
Example
Two students are preparing for college admissions by
taking college preparatory exams. One student takes the
SAT test and scores 1440 out of 1600 while the other takes
the ACT test and scores 31 out of 36.

Which student has performed better on the exam?

4
Example
We can find the means and standard deviations for the SAT
and ACT tests from collegeboard.org and nces.ed.gov in the
following table:

Test Score Mean Standard Deviation


SAT 1440 1002 168

ACT 31 21 5.2

5
Example
Then we compare the students by using their respective
z-scores:

The student who took the SAT test has performed better on
her exam than the student who took the ACT test.

6
Percentiles and Quartiles

7
Percentiles and Quartiles
A percentile is another measure of relative standing, most
often used for large data sets.

DEFINITION
A set of n measurements on the variable x has been
arranged from smallest to largest. The pth percentile is
the value of x that is greater than p% of the
measurements and is less than the remaining (100 − p)%.

8
Example
Suppose you have been notified that your score of 158 on
the Verbal Graduate Record Examination placed you at the
80th percentile in the distribution of scores.

Where does your score of 158 stand in relation to the


scores of others who took the examination?

Solution:
Scoring at the 80th percentile means that 80% of all the
examination scores were lower than your score and 20%
were higher.

9
Percentiles and Quartiles

DEFINITION
A set of n measurements on the variable x has been
arranged from smallest to largest. The lower quartile
(first quartile), Q1, is the value of x that is greater than
one-fourth of the measurements and is less than the
remaining three-fourths. The second quartile is the
median. The upper quartile (third quartile), Q3, is the
value of x that is greater than three-fourths of the
measurements and is less than the remaining one-fourth.

10
Percentiles and Quartiles
Calculating Sample Quartiles
• When the measurements are arranged from smallest to
largest, the lower quartile, Q1, is the value of x in
position .25(n + 1), and the upper quartile, Q3, is the
value of x in position .75(n + 1).

• When .25(n + 1) and .75(n + 1) are not integers, the


quartiles are found by interpolation, using the values in
the two adjacent positions.

11
Example
Find the lower and upper quartiles for this set of
measurements:

16, 25, 4, 18, 11, 13, 20, 8, 11, 9

Solution:
Rank the n = 10 measurements from smallest to largest:

4, 8, 9, 11, 11, 13, 16, 18, 20, 25

Calculate
Position of Q1 = .25(n + 1) = .25(10 + 1) = 2.75
Position of Q3 = .75(n + 1) = .75(10 + 1) = 8.25

12
Example
Since these positions are not integers, we take the lower
quartile to be the value 3∕4 of the distance between the
second and third ordered measurements, and we take the
upper quartile to be the value 1∕4 of the distance between
the eighth and ninth ordered measurements.

Therefore,

Q1 = 8 + .75(9 − 8) = 8 + .75 = 8.75


and

Q3 = 18 + .25(20 − 18) = 18 + .5 = 18.5

13
Percentiles and Quartiles

DEFINITION
The interquartile range (IQR) for a set of measurements
is the difference between the upper and lower quartiles;
that is, IQR = Q3 − Q1.

14
Percentiles and Quartiles
How to Calculate Sample Quartiles
1. Arrange the data set in order of magnitude from smallest
to largest.
2. Calculate the quartile positions:

• Position of Q1: .25(n + 1)


• Position of Q3: .75(n + 1)

3. If the positions are integers, then Q1 and Q3 are the


values in the ordered data set found in those positions.

15
Percentiles and Quartiles
4. If the positions in step 2 are not integers, find the two
measurements in positions just above and just below the
calculated position. Calculate the quartile by finding a
value either one-fourth, one-half, or three-fourths of the
way between these two measurements.

16
The Five-Number Summary and
the Box Plot

17
The Five-Number Summary and the Box Plot

The five-number summary consists of the following


numerical measures:

Min Q1 Median Q3 Max

By definition, one-fourth of the measurements in the data


set lie between each of the four adjacent pairs of numbers.

18
The Five-Number Summary and the Box Plot

The five-number summary can be used to create a simple


graph called a box plot to visually describe the data
distribution.

From the box plot, you can quickly detect any skewness in
the shape of the distribution and see whether there are any
outliers in the data set.

An outlier may result from transposing digits when recording


a measurement, from incorrectly reading an instrument dial,
from a broken piece of equipment, or from other problems.

19
The Five-Number Summary and the Box Plot

Even when there are no recording errors, a data set may


contain one or more measurements that, for one reason or
another, are very different from the others in the set.

These outliers can cause a distortion in commonly used


numerical measures such as and s.

In fact, outliers may themselves contain important


information not shared with the other measurements in the
set.

20
The Five-Number Summary and the Box Plot

Therefore, isolating outliers, if they are present, is an


important first step in analyzing a data set. The box plot is
designed exactly for this purpose.

To Construct a Box Plot


• Calculate the median, the upper and lower quartiles, and
the IQR for the data set.
• Draw a horizontal line and mark the scale of
measurement. Form a box just above the horizontal line
with the right and left ends at Q1 and Q3. Draw a vertical
line through the box at the location of the median.

21
The Five-Number Summary and the Box Plot

A box plot is shown in figure.

Figure 2.16

22
The Five-Number Summary and the Box Plot

The box plot uses a different method—it uses the IQR to


create imaginary “fences” to separate outliers from the rest
of the data set:

Detecting Outliers—Observations that Are Beyond:

• Lower fence: Q1 − 1.5(IQR)


• Upper fence: Q3 + 1.5(IQR)

The upper and lower fences are shown with broken lines in
Figure 2.16, but they are not usually drawn on the box plot.

23
The Five-Number Summary and the Box Plot

Any measurement beyond the upper or lower fence is an


outlier; the rest of the measurements, inside the fences, are
not unusual. Finally, the box plot marks the range of the
data set using “whiskers” to connect the smallest and
largest measurements to the box.

To Finish the Box Plot


• Mark any outliers with an asterisk (*) on the graph.
• Draw horizontal lines called “whiskers” from the ends of
the box to the smallest and largest observations that are
not outliers.

24
Example
As American consumers become more careful about the
foods they eat, food processors try to avoid large amounts
of fat, cholesterol, and sodium in the foods they sell. The
following data are the amounts of sodium per slice (in
milligrams) for each of eight brands of regular American
cheese. Draw a box plot for the data and look for outliers.

340, 300, 520, 340, 320, 290, 260, 330

25
Example
The n = 8 measurements are ranked from smallest to
largest:

260, 290, 300, 320, 330, 340, 340, 520

The positions of the median, Q1, and Q3 are

.5(n + 1) = .5(9) = 4.5


.25(n + 1) = .25(9) = 2.25
.75(n + 1) = .75(9) = 6.75

26
Example
so that m = (320 + 330) ∕ 2 = 325, Q1 = 290 + .25(10) = 292.5,
and Q3 = 340. The interquartile range is calculated as

IQR = Q3 − Q1 = 340 − 292.5 = 47.5

Calculate the upper and lower fences:


Lower fence: 292.5 − 1.5(47.5) = 221.25
Upper fence: 340 + 1.5(47.5) = 411.25

27
Example
The value x = 520, a brand of cheese containing 520
milligrams of sodium, is the only outlier, lying beyond the
upper fence.

The box plot for the data is shown in the following figure.

Figure 2.17

28
Example
The outlier is marked with an asterisk (*). Once the
outlier is excluded, we find that the smallest and largest
measurements are x = 260 and x = 340.

These are the two values that form the whiskers. Since
the value x = 340 is the same as Q3, there is no whisker
on the right side of the box.

29

You might also like