Chapter 01
Chapter 01
Introduction to Statistics
Introduction
Statistics is the science of conducting
studies to
collect,
organize,
summarize,
analyze, and
draw conclusions from data.
2
1-1 Descriptive and Inferential
Statistics
A variable is a characteristic or attribute
that can assume different values.
The values that a variable can assume
are called data.
A population consists of all subjects
(human or otherwise) that are studied.
A sample is a subset of the population.
3
1-1 Descriptive and Inferential
Statistics
Descriptive statistics consists of the
collection, organization, summarization,
and presentation of data.
Inferential statistics consists of
generalizing from samples to populations,
performing estimations and hypothesis
tests, determining relationships among
variables, and making predictions.
4
1-2 Variables and Types of Data
Data
Qualitative Quantitative
Categorical Numerical,
Can be ranked
Discrete Continuous
Countable Can be decimals
5, 29, 8000, etc. 2.59, 312.1, etc.
5
1-2 Recorded Values and
Boundaries
Variable Recorded Value Boundaries
Length 15 centimeters 14.5-15.5 cm
(cm)
Temperature 86° Fahrenheit 85.5-86.5 °F
(°F)
Time 0.43 second 0.425-0.435
(sec) sec
Mass 1.6 grams (g) 1.55-1.65 g
6
1-2 Variables and Types of Data
Levels of Measurement
1. Nominal – categorical (names)
7
1-2 Variables and Types of Data
8
1-3 Data Collection and Sampling
Techniques
Some Sampling Techniques
Random – random number generator
Systematic – every kth subject
Stratified – divide population into “layers”
Cluster – use intact groups
9
1-4 Observational and
Experimental Studies
In an observational study, the researcher
merely observes and tries to draw conclusions
based on the observations.
The researcher manipulates the independent
(explanatory) variable and tries to determine
how the manipulation influences the dependent
(outcome) variable in an experimental study.
A confounding variable influences the
dependent variable but cannot be separated
from the independent variable.
10
1-6 Computers and Calculators
Microsoft Excel
Microsoft Excel with MegaStat
TI-83/84
Minitab
SAS
SPSS
11
Frequency Distributions
and Graphs
12
Organizing Data
Data collected in original form is called
raw data.
A frequency distribution is the
organization of raw data in table form,
using classes and frequencies.
Nominal- or ordinal-level data that can be
placed in categories is organized in
categorical frequency distributions.
13
Categorical Frequency Distribution
Twenty-five army indicates were given a blood
test to determine their blood type.
14
Categorical Frequency Distribution
Twenty-five army indicates were given a blood
test to determine their blood type.
17
Rules for Classes in Grouped
Frequency Distributions
1. There should be 5-20 classes.
2. The class width should be an odd
number.
3. The classes must be mutually exclusive.
4. The classes must be continuous.
5. The classes must be exhaustive.
6. The classes must be equal in width
(except in open-ended distributions).
18
Constructing a Grouped Frequency
Distribution
The following data represent the record
high temperatures for each of the 50 states.
Construct a grouped frequency distribution
for the data using 7 classes.
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
19
Constructing a Grouped Frequency
Distribution
STEP 1 Determine the classes.
Find the class width by dividing the range by
the number of classes 7.
Range = High – Low
= 134 – 100 = 34
20
Constructing a Grouped Frequency
Distribution
For convenience sake, we will choose the lowest
data value, 100, for the first lower class limit.
The subsequent lower class limits are found by
adding the width to the previous lower class limits.
Class Limits
The first upper class limit is one
100 - 104
105 - 109 less than the next lower class limit.
110 - 114
The subsequent upper class limits
115 - 119
120 - 124 are found by adding the width to the
125 - 129 previous upper class limits.
130 - 134
21
Constructing a Grouped Frequency
Distribution
Theclass boundary is midway between an upper
class limit and a subsequent lower class limit.
104,104.5,105
Class Class Cumulative
Frequency
Limits Boundaries Frequency
100 - 104 99.5 - 104.5
105 - 109 104.5 - 109.5
110 - 114 109.5 - 114.5
115 - 119 114.5 - 119.5
120 - 124 119.5 - 124.5
125 - 129 124.5 - 129.5
130 - 134 129.5 - 134.5
22
Constructing a Grouped Frequency
Distribution
STEP 2 Tally the data.
STEP 3 Find the frequencies.
Class Class Cumulative
Frequency
Limits Boundaries Frequency
100 - 104 99.5 - 104.5 2
105 - 109 104.5 - 109.5 8
110 - 114 109.5 - 114.5 18
115 - 119 114.5 - 119.5 13
120 - 124 119.5 - 124.5 7
125 - 129 124.5 - 129.5 1
130 - 134 129.5 - 134.5 1
23
Constructing a Grouped Frequency
Distribution
STEP 4 Find the cumulative frequencies by
keeping a running total of the frequencies.
Class Class Cumulative
Frequency
Limits Boundaries Frequency
100 - 104 99.5 - 104.5 2 2
105 - 109 104.5 - 109.5 8 10
110 - 114 109.5 - 114.5 18 28
115 - 119 114.5 - 119.5 13 41
120 - 124 119.5 - 124.5 7 48
125 - 129 124.5 - 129.5 1 49
130 - 134 129.5 - 134.5 1 50
24
Histograms, Frequency Polygons,
and Ogives
3 Most Common Graphs in Research
1. Histogram
2. Frequency Polygon
3. Cumulative Frequency Polygon (Ogive)
25
Histograms, Frequency Polygons,
and Ogives
The histogram is a graph that
displays the data by using vertical
bars of various heights to represent
the frequencies of the classes.
26
Histograms
Construct a histogram to represent the
data for the record high temperatures for
each of the 50 states .
27
Histograms
Histograms use class boundaries and
frequencies of the classes.
Class Class
Frequency
Limits Boundaries
100 - 104 99.5 - 104.5 2
105 - 109 104.5 - 109.5 8
110 - 114 109.5 - 114.5 18
115 - 119 114.5 - 119.5 13
120 - 124 119.5 - 124.5 7
125 - 129 124.5 - 129.5 1
130 - 134 129.5 - 134.5 1
28
Histograms
Histograms use class boundaries and
frequencies of the classes.
29
Histograms, Frequency Polygons,
and Ogives
The frequency polygon is a graph that
displays the data by using lines that
connect points plotted for the
frequencies at the class midpoints. The
frequencies are represented by the
heights of the points.
The class midpoints are represented on
the horizontal axis.
30
Frequency Polygons
Construct a frequency polygon to
represent the data for the record high
temperatures for each of the 50 states.
31
Frequency Polygons
Frequency polygons use class midpoints
and frequencies of the classes.
Class Class
Frequency
Limits Midpoints
100 - 104 102 2
105 - 109 107 8
110 - 114 112 18
115 - 119 117 13
120 - 124 122 7
125 - 129 127 1
130 - 134 132 1
32
Frequency Polygons
Frequency polygons use class midpoints
and frequencies of the classes.
A frequency polygon
is anchored on the
x-axis before the first
class and after the
last class.
33
Histograms, Frequency Polygons,
and Ogives
The ogive is a graph that represents
the cumulative frequencies for the
classes in a frequency distribution.
34
Ogives
Construct an ogive to represent the data
for the record high temperatures for each
of the 50 states .
35
Ogives
Ogives use upper class boundaries and
cumulative frequencies of the classes.
Class Class Cumulative
Frequency
Limits Boundaries Frequency
100 - 104 99.5 - 104.5 2 2
105 - 109 104.5 - 109.5 8 10
110 - 114 109.5 - 114.5 18 28
115 - 119 114.5 - 119.5 13 41
120 - 124 119.5 - 124.5 7 48
125 - 129 124.5 - 129.5 1 49
130 - 134 129.5 - 134.5 1 50
36
Ogives
Ogives use upper class boundaries and
cumulative frequencies of the classes.
Cumulative
Class Boundaries
Frequency
Less than 104.5 2
Less than 109.5 10
Less than 114.5 28
Less than 119.5 41
Less than 124.5 48
Less than 129.5 49
Less than 134.5 50
37
Ogives
Ogives use upper class boundaries and
cumulative frequencies of the classes.
38
Histograms, Frequency Polygons,
and Ogives
If proportions are used instead of
frequencies, the graphs are called
relative frequency graphs.
42
Frequency Polygons
The following is a frequency distribution of
miles run per week by 20 selected runners.
Class Class Relative
Boundaries Midpoints Frequency
5.5 - 10.5 8 0.05
10.5 - 15.5 13 0.10
15.5 - 20.5 18 0.15
20.5 - 25.5 23 0.25
25.5 - 30.5 28 0.20
30.5 - 35.5 33 0.15
35.5 - 40.5 38 0.10
43
Frequency Polygons
Use the class midpoints and the
relative frequencies of the classes.
44
Ogives
The following is a frequency distribution of
miles run per week by 20 selected runners.
Class Cumulative Cum. Rel.
Frequency
Boundaries Frequency Frequency
5.5 - 10.5 1 1 1/20 = 0.05
10.5 - 15.5 2 3 3/20 = 0.15
15.5 - 20.5 3 6 6/20 = 0.30
20.5 - 25.5 5 11 11/20 = 0.55
25.5 - 30.5 4 15 15/20 = 0.75
30.5 - 35.5 3 18 18/20 = 0.90
35.5 - 40.5 2 20 20/20 = 1.00
Σf = 20
45
Ogives
Ogives use upper class boundaries and
cumulative frequencies of the classes.
Cum. Rel.
Class Boundaries
Frequency
Less than 10.5 0.05
Less than 15.5 0.15
Less than 20.5 0.30
Less than 25.5 0.55
Less than 30.5 0.75
Less than 35.5 0.90
Less than 40.5 1.00
46
Ogives
Use the upper class boundaries and the
cumulative relative frequencies.
47
Shapes of Distributions
48
Other Types of Graphs
Bar Graphs
49
Other Types of Graphs
Time Series Graphs
50
Other Types of Graphs
Pie Graphs
51
Other Types of Graphs
Stem and Leaf Plots
A stem and leaf plots is a data plot that
uses part of a data value as the stem
and part of the data value as the leaf to
form groups or classes.
It has the advantage over grouped
frequency distribution of retaining the
actual data while showing them in
graphic form.
52
At an outpatient testing center, the
number of cardiograms performed each
day for 20 days is shown. Construct a
stem and leaf plot for the data.
25 31 20 32 13
14 43 2 57 23
36 32 33 32 44
32 52 44 51 45
53
25 31 20 32 13
14 43 2 57 23
36 32 33 32 44
32 52 44 51 45
54
55
1.3 Measures of Location
A statistic is a characteristic or measure
obtained by using the data values from a
sample.
A parameter is a characteristic or
measure obtained by using all the data
values for a specific population.
56
Measures of Location
Mean
Median
Mode
Percentile
Quartile
57
Measures of Central Tendency:
Mean
The mean is the quotient of the sum of
the values and the total number of values.
The symbol X is used for sample mean.
X
X1 + X 2 + X 3 + + X n
=
∑X
n n
For a population, the Greek letter μ (mu)
is used for the mean.
µ
X1 + X 2 + X 3 + + X N
=
∑X
N N
58
Example 3-1: Days Off per Year
The data represent the number of days off per
year for a sample of individuals selected from
nine different countries. Find the mean.
20, 26, 40, 36, 23, 42, 35, 24, 30
X
X1 + X 2 + X 3 + + X n
=
∑X
n n
20 + 26 + 40 + 36 + 23 + 42 + 35 + 24 + 30 276
X
= = = 30.7
9 9
59
Measures of Central Tendency:
Median
The median is the midpoint of the data
array. The symbol for the median is x
The median will be one of the data values th
n +1
if there is an odd number of values 2
The median will be the average of two
data values if there is an even number
th of
n
th
n
values
(average of and + 1 )
2 2
60
Example 3-4: Hotel Rooms
The number of rooms in the seven hotels in
downtown Pittsburgh is 713, 300, 618, 595,
311, 401, and 292. Find the median.
63
Example 3-9: NFL Signing Bonuses
Find the mode of the signing bonuses of
eight NFL players for a specific year. The
bonuses in millions of dollars are
18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
64
Example 3-10: Coal Employees in PA
Find the mode for the number of coal employees
per county for 10 selected counties in
southwestern Pennsylvania.
110, 731, 1031, 84, 20, 118, 1162, 1977, 103, 752
There is no mode.
65
Example 3-11: Licensed Nuclear
Reactors
The data show the number of licensed nuclear
reactors in the United States for a recent 15-year
period. Find the mode.
104 104 104 104 104 107 109 109 109 110
109 111 112 111 109
104 and 109 both occur the most. The data set
is said to be bimodal.
67
Properties of the Median
Gives the midpoint
Used when it is necessary to find out
whether the data values fall into the upper
half or lower half of the distribution.
Affected less than the mean by extremely
high or extremely low values.
68
Distributions
69
Percentiles
Percentiles separate the data set into
100 equal groups.
A percentile rank for a datum represents
the percentage of data values below the
datum.
Percentile
( # of values below X ) + 0.5 ⋅100%
total # of values
n⋅ p
c=
100
70
Example 3-32: Test Scores
A teacher gives a 20-point test to 10 students.
Find the percentile rank of a score of 12.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
Sort in ascending order.
2, 3, 5, 6, 8, 10, 12, 15, 18, 20
6 values
Percentile
( # of values below X ) + 0.5 ⋅100%
total # of values
6 + 0.5 A student whose score
= ⋅100%
10 was 12 did better than
= 65% 65% of the class.
71
Example 3-34: Test Scores
A teacher gives a 20-point test to 10 students. Find
the value corresponding to the 25th percentile.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
Sort in ascending order.
2, 3, 5, 6, 8, 10, 12, 15, 18, 20
n ⋅ p 10 ⋅ 25
c == = 2.5 ≈ 3
100 100
72
Quartiles and Interquartile Range
Quartiles separate the data set into 4
equal groups. Q1=P25, Q2=Median, Q3=P75
Q2 = median(Low,High)
Q1 = median(Low,Q2)
Q3 = median(Q2,High)
73
Example 3-36: Quartiles
Find Q1, Q2, and Q3 for the data set.
15, 13, 6, 5, 12, 50, 22, 18
6 + 12
=Q1 median ( Low,= MD ) = 9
2
13 + 15
=Q 2 median ( Low, =
High ) = 14
2
18 + 22
=Q3 median ( MD,= High ) = 20
2
74
1.4 Measures of Variability
Range
Variance
Standard Deviation
Interquartile Range
Outliers
75
Measures of Variation: Range
The range is the difference between the
highest and lowest values in a data set.
=R Highest − Lowest
76
Example 3-18/19: Outdoor Paint
Two experimental brands of outdoor paint are
tested to see how long each will last before
fading. Six cans of each brand constitute a
small population. The results (in months) are
shown. Find the mean and range of each group.
Brand A Brand B
10 35
60 45
50 30
30 35
40 40
20 25
77
Example 3-18: Outdoor Paint
Brand A Brand B ∑
µ= =
X 210
= 35
10 35 Brand A: N 6
60 45 R = 60 − 10 = 50
50 30
30 35 ∑
µ= =
X 210
= 35
40 40 Brand B: N 6
20 25
R = 45 − 25 = 20
The average for both brands is the same, but the range
for Brand A is much greater than the range for Brand B.
78
Measures of Variation: Variance &
Standard Deviation
The variance is the average of the
squares of the distance each value is
from the mean.
The standard deviation is the square
root of the variance.
The standard deviation is a measure of
how spread out your data are.
79
Measures of Variation:
Variance & Standard Deviation
(Population Theoretical Model)
The population variance is
∑ (X − µ)
2
σ 2
=
N
The population standard deviation is
∑( X − µ)
2
σ=
N
80
Example 3-21: Outdoor Paint
Find the variance and standard deviation for the
data set for Brand A paint. 10, 60, 50, 30, 40, 20
∑ (X − µ)
2
Months, X µ X - µ (X - µ)2 σ 2
=
n
10 35 -25 625 1750
60 35 25 625 =
50 35 15 225 6
30 35 -5 25 = 291.7
40 35 5 25
20 35 -15 225 1750
σ=
1750 6
= 17.1
81
Measures of Variation:
Variance & Standard Deviation
(Sample Theoretical Model)
The sample variance is
∑ ( X −X)
2
2
s =
n −1
The sample standard deviation is
∑( X − X )
2
s=
n −1
82
Measures of Variation:
Variance & Standard Deviation
(Sample Computational Model)
The sample variance is
n∑ X − ( ∑ X )
2 2
2
s =
n ( n − 1)
83
Example 3-23: European Auto Sales
Find the variance and standard deviation for the
amount of European auto sales for a sample of 6
years. The data are in millions of dollars.
11.2, 11.9, 12.0, 12.8, 13.4, 14.3
n∑ X − ( ∑ X )
2 2
X X 2
2
s =
11.2 125.44 n ( n − 1)
11.9 141.61
12.9 166.41
6 ( 958.94 ) − ( 75.6 )
2
12.8 163.84 2
s = s 2 = 1.28
13.4 179.56 6 ( 5) s = 1.13
14.3 204.49
75.6 958.94 s2 = ( )
6 ⋅ 958.94 − 75.62 / ( 6 ⋅ 5 )
84
Interquartile Range
85
Example 3-36: Quartiles
Find IQR for the data set.
15, 13, 6, 5, 12, 50, 22, 18
6 + 12 18 + 22
Q1 = 9
= =Q3 = 20
2 2
IQR
= Q3 − Q1
= 20 − 9
= 11 86
Outliers
An outlier is an extremely high or low
data value when compared with the rest of
the data values.
A data value less than Q1 – 1.5(IQR) or
greater than Q3 + 1.5(IQR) can be
considered an outlier.
87
Example 3-36: Quartiles
Find outlier for the data set.
15, 13, 6, 5, 12, 50, 22, 18
Sort in ascending order.
5, 6, 12, 13, 15, 18, 22, 50
6 + 12 18 + 22 IQR
= Q3 − Q1
Q1 = 9 =
= Q3 = 20
2 2 = 20 − 9
= 11
Q1 − 15( IQR) =
9 − 1.5(11) =
−7.5
Q3 + 15( IQR) =
20 + 1.5(11) =
36.5
50 is the outlier
88
Exploratory Data Analysis
TheFive-Number Summary is
composed of the following numbers:
Low, Q1, Median, Q3, High
TheFive-Number Summary can be
graphically represented using a
Boxplot.
89
Procedure Table
Constructing Boxplots
1. Find the five-number summary.
2. Draw a horizontal axis with a scale that includes
the maximum and minimum data values.
3. Draw a box with vertical sides through Q1 and
Q3, and draw a vertical line though the median.
4. Draw a line from the minimum data value to the
left side of the box and a line from the maximum
data value to the right side of the box.
90
Example 3-38: Meteorites
The number of meteorites found in 10 U.S. states
is shown. Construct a boxplot for the data.
89, 47, 164, 296, 30, 215, 138, 78, 48, 39
30, 39, 47, 48, 78, 89, 138, 164, 215, 296
91
92