Module 2 MIDTERM StatProb
Module 2 MIDTERM StatProb
STATISTICS
Introduction:
Whatever exists at all exists in some amount…and whatever
existsin some amount can be measured.
Edward Thorndike
Learning Objectives:
A. Mean
Arithmetic mean – the only common measure in which all values play an equal role, meaning, to
determine its values you would need to consider all the values of any given data set. It is appropriate
to determine the central tendency of an interval or ratio data.
The symbol 𝑥̅, called “x bar”, is used to represent the mean of a sample and the symbol 𝜇,
called“mu”, is used to denote the mean of a population.
Properties of Mean
1. A set of data has only one mean.
2. Mean can be applied for interval and ratio data.
3. All values in the data set are included in computing the mean.
4. The mean is very useful in comparing two or more data sets.
5. Mean is affected by the extreme small or large values on a data set.
6. Mean is most appropriate in symmetrical data.
Solution:
∑ 𝑥̅ 𝑥̅1+𝑥̅2+𝑥̅3+𝑥̅4+𝑥̅5+𝑥̅6+𝑥̅7+𝑥̅8
𝑥̅ = =
𝑛 𝑛
550+420+560+500+700+670+860+480 4,740
𝑥̅ = = = 592.50
8 8
Example 2: Find the population mean of the ages of 9 middle-management employees of a certain company.
The ages are 53, 45, 59, 48, 54, 46, 51, 58, and 55.
Solution:
53+45+59+48+54+46+51+58+55 469
𝜇= = = 52.11
9 9
A. Median
- Is the midpoint of the data array.
- Is an appropriate measure of central tendency for data that are ordinal or above, but is more
valuable in an ordinal type of data.
Properties of Median
1. The median is unique, there is only one median for a set of data.
2. The median is found by arranging the set of data from lowest or highest (or highest to lowest) and
getting the value of the middle observation.
3. Median is not affected by the extreme small or large values.
4. Median can be applied for ordinal, interval and ratio data.
5. Median is most appropriate in a skewed data.
To determine the value of median for ungrouped, we need to consider two rules:
1. If n was odd, the median is the middle ranked.
2. If n was even, then the median is the average of the two middle ranked values.
𝑛+1
𝑀𝑒𝑑𝑖𝑎𝑛 (𝑅𝑎𝑛𝑘 𝑉𝑎𝑙𝑢𝑒) =
2
Example 1: Find the median of the ages of 9 middle-management employees of a certain company. The
ages are 53, 45, 59, 48, 54, 46, 51, 58, and 55.
Solution:
Step 1: Arrange the data in order.
5th
Example 2: The daily rates of a sample of eight employees at GMS Inc. are P550, P420, P560, P500, P700,
P670, P860, P480. Find the median daily rate of employee.
Solution:
Step 1: Arrange the data in order.
4.5th
Since the middle point falls between P550 and P560, we can determine the median of the
data set by getting the average of the two values.
550+560 1,110
𝑀𝑒𝑑𝑖𝑎𝑛 = = = 555
2 2
B. Mode
- is the value in a data set that appears most frequently. Like the median and unlike the mean,
extreme values in a data set do not affect the mode.
Unimodal – a data set that has only one value that occurs the greatest frequency.
Bimodal – if the data has two values with the same greatest frequency, both values are considered
the mode and the data set is bimodal.
Multimodal – if a data set has more than two modes.
No mode - when a data set values have the same number frequency.
Properties of Mode
1. The mode is found by locating the most frequently occurring value.
2. The mode is the easiest average to compute.
3. There can be more than one mode or even no mode in any given data set.
4. Mode is affected by the extreme small or large values.
5. Mode can be applied for nominal, ordinal, interval and ratio data.
Example 1: The following data represents the total unit sales for Smartphones from a sample of 10
Communication Centers for the month of August: 15, 17, 10, 12, 13, 10, 14, 10, 8, and 9. Find the mode.
Solution:
The ordered array for these data is 8, 9, 10, 10, 10, 12, 13, 14, 15, 17.
Because 10 appear 3 times, more times than the other values, therefore the mode is 10.
Example 2: An operations manager in charge of a company’s manufacturing keeps track of the number of
manufactured LED television in a day. Compute for the following data that represents the number of LED
television manufactured for the past three weeks: 20, 18, 19, 25, 20, 21, 20, 25, 30, 29, 28, 29, 25, 25, 27,
26, 22, and 20. Find the mode of the given data set.
Solution:
The ordered array for these data is 18, 19, 20, 20, 20, 20, 21, 22, 25, 25, 25, 25, 26, 27, 28, 29, 29, 30.
There are two modes 20 and 25, since each of these values occurs four times.
Example 3: Find the mode of the ages of 9 middle-management employees of a certain company. The ages
are 53, 45, 59, 48, 54, 46, 51, 58, and 55.
Solution:
The ordered array for these data is 45, 46, 48, 51, 53, 54, 55, 58, 59.
There is no mode since the data set has the same frequency.
C. Weighted Mean
- Is particularly useful when various classes or groups contribute differently to the total.
- Is found by multiplying each value by its corresponding weight and dividing by the sum of the
weights.
𝑥̅1𝑤1 + 𝑥̅2𝑤2 + 𝑥̅3𝑤3 + ⋯ + 𝑥̅𝑛𝑤𝑛
𝑥̅𝑤 =
𝑤1 + 𝑤2 + 𝑤3 + ⋯ + 𝑤𝑛
where: 𝑥̅
𝑤= 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝑚𝑒𝑎𝑛
𝑤𝑖 = 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑖𝑛𝑔 𝑤𝑒𝑖𝑔ℎ𝑡
𝑥̅𝑖 = 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑎𝑛𝑦 𝑝𝑎𝑟𝑡𝑖𝑐𝑢𝑙𝑎𝑟 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑜𝑟 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑚𝑒𝑛𝑡
Example 1: At the Mathematics Department of San Sebastian College there are 18 instructors, 12 assistant
professors, 7 associate professors, and 3 professors. Their monthly salaries are P30, 500, P33,700, P38,600,
and P45,000. What is the weighted mean salary?
Solution:
Let 𝑤1 = 18 𝑤2 = 12 𝑤3 = 7 𝑤4 = 3
𝑥̅1 = 30,500 𝑥̅2 = 33,700 𝑥̅3 = 38,600 𝑥̅4 = 45,000
Solution:
Let 𝑤1 = 3 𝑤2 = 3 𝑤3 = 3 𝑤4 = 2 𝑤5 = 1
𝑥̅1 = 90 𝑥̅2 = 87 𝑥̅3 = 88 𝑥̅4 = 95 𝑥̅5 = 96
𝑥̅𝑤 = = = 90.67
3+3+3+2+1 12
The weighted mean of Riana’s GPA for the first quarter is 90.67.
Example 3: A certain subdivision in Laguna consists of 50 homes. The table shows the frequency distribution
of homes with respect to the number of bedrooms it has. Find the mean number of bedrooms for the 50
homes.
No. of Bedrooms 2 3 4 5 6
No. of Homes 13 21 10 4 2
Solution:
Let 𝑤1 = 2 𝑤2 = 3 𝑤3 = 4 𝑤4 = 5 𝑤5 = 6
𝑥̅1 = 13 𝑥̅2 = 21 𝑥̅3 = 10 𝑥̅4 = 4 𝑥̅5 = 2
MEASURES OF DISPERSION
Standard deviation – is a statistical term that provides a good indication of volatility. It measures how widely
values are dispersed from the average.
Dispersion – is the difference between the actual value and the average value.
A. Range
- Is the difference of the highest value and the lowest value in the data set.
Advantages:
a. It is easy to compute.
b. It is easy to understand.
Disadvantages:
a. It can be distorted by a single extreme value (or outliner)
b. Only two values are used in the calculation
Example 1: The daily rates of a sample of eight employees at GMS Inc. are P550, P420, P560, P500, P700,
P670, P860, P480. Find the range.
Solution:
Step 1: Determine the highest value and lowest value in the data set.
Highest Value (HV) = P860 Lowest Value (LV) = P420
Standard deviation
- is calculated as the square root of variance.
- In finance, it is applied to the annual rate of return of an investment to measure the investment’s
volatility.
- Is also known as historical volatility and is used by investors as a gauge for the amount of expected
volatility.
Variance
- Is a mathematical expectation of the average squared deviations from the mean.
Volatility
- Is a measure of risk, so this statistic can help determine the risk an investor might take on when
purchasing a specific security.
∑(𝑥̅−𝑥̅)2 ∑(𝑥̅−𝑥̅)2
𝑠2 = 𝑠 =√
𝑛−1 𝑛−1
2 (∑ 𝑥̅)2 (∑ 𝑥̅)2
∑ ∑ 𝑥̅ 2−
𝑠2 = 𝑥̅ − 𝑛 𝑠 = √ 𝑛−1 𝑛
𝑛−1
where:
𝑠2 = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒.
𝑠 = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛.
𝑥̅ = 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑎𝑛𝑦 𝑝𝑎𝑟𝑡𝑖𝑐𝑢𝑙𝑎𝑟 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑜𝑟 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑚𝑒𝑛𝑡.
𝑥̅ = 𝑠𝑎𝑚𝑝𝑙𝑒𝑚𝑒𝑎𝑛.
𝑛 = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛.
Example 2: The daily rates of a sample of eight employees at GMS Inc. are P550, P420, P560, P500, P700,
P670, P860, P480. Find the variance and standard deviation.
Solution:
Step 1: Compute the mean of the data set.
∑ 𝑥̅ 550+420+560+500+700+670+860+480 4,740
𝑥̅ = = = = 592.50
𝑛 8 8
Step 2: Subtract the mean from each of the value in the data set.
𝑥̅ 𝑥̅ − 𝑥̅
550 −42.5
420 −172.5
560 −32.5
500 −92.5
700 107.5
670 77.5
860 267.5
480 −112.5
∑ 𝑥̅ = 4,740 ∑(𝑥̅ − 𝑥̅) = 0
𝑥̅ 𝑥̅ − 𝑥̅ (𝑥̅ − 𝑥̅)2
550 −42.5 1,806.25 (−42.5)2 = 1,806.25
420 −172.5 29,756.25
560 −32.5 1,056.25
500 −92.5 8,556.25
700 107.5 11,556.25
670 77.5 6,006.25
860 267.5 71,556.25
480 −112.5 12,656.25
∑ 𝑥̅ = 4,740 ∑(𝑥̅ − 𝑥̅) = 0 ∑(𝑥̅ − 𝑥̅)2 = 142,950
Step 4: Solve for variance and the standard deviation. We can also obtain the standard deviation by
simply extracting the square root of the variance.
∑(𝑥̅−𝑥̅)2
= 20,421.43 = 142.90
142,950 ∑(𝑥̅−𝑥̅)2 142,950
𝑠2 = = = 20,421.43 𝑠=√ =√
𝑛−1 8−1 𝑛−1 8−1
√
𝑥̅
550
420
560
500
700
670
860
480
∑ 𝑥̅ = 4,740
Step 2: Square the values in the data set and get the sum.
𝑥̅ 𝑥̅2
550 302,500
420 176,400
560 313,600
500 250,000
700 490,000
670 448,900
860 739,600
480 230,400
∑ 𝑥̅ = 4,740 ∑ 𝑥̅2 = 2,951,400
Step 3: Solve for the values of the variance and standard deviation.
2
2 (∑ 𝑥̅) (4,740)2
2,951,400−
∑ 𝑥̅ − 2,951,400−2,808,450
𝑠2 = 𝑛−1
𝑛
= 8−1
8 = 7
= 20,421.43
2 (∑ 𝑥̅)2 (4,740)2
∑
𝑥̅ − 2,951,400− 2,951,400−2,808,450
𝑠=√ 𝑛−1
𝑛
=√ 8
=√ = √20,421.43 = 142.90
8−1 7
∑(𝑥̅−𝜇)2 ∑(𝑋−𝜇)2
𝜎2 = 𝜎 =√
𝑁 𝑁
Example 3: The monthly incomes of the five research directors of Recoletos schools are: P55,000, P59,500,
P62,500, P57,000, and P61,000. Find the variance and standard deviation.
Solution:
Step 1: Compute the mean of the data set.
∑ 𝑥̅ 55,000+59,500+62,500+57,000+61,000 295,000
𝜇= = = = 59,000
𝑁 5 5
Step 2: Subtract the population mean from each of the value in the data set.
𝑥̅ 𝑥̅ − 𝜇
55,000 −4,000
59,500 500
62,500 3,500
57,000 −2,000
61,000 2,000
Step 3: Get the square of 𝑥̅ − 𝜇, then get the sum.
𝑥̅ 𝑥̅ − 𝜇 (𝑥̅ − 𝜇)2
55,000 −4,000 16,000,000
59,500 500 250,000
62,500 3,500 12,250,000
57,000 −2,000 4,000,000
61,000 2,000 4,000,000
∑ 𝑥̅ = 295,000 ∑(𝑥̅ − 𝜇) = 0 ∑(𝑥̅ − 𝜇)2 = 36,500,000
Step 4: Solve for the population variance and population standard deviation.
Hence, the population variance is 730,000 and the population standard deviation is 2,701.85.
A. Quartiles
𝑘(𝑁+1)
𝑄𝑘 = 4
where: 𝑄𝑘 = 𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒.
𝑁 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛.
𝑘 = 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛.
Example 1: Find the first, second, and third quartiles of the ages of 9 middle-management employees of a
certain company. The ages are 53, 45, 59, 48, 54, 46, 51, 58, and 55.
Solution:
Step 1: Arrange the data in order.
Step 2: Select the first, second, and third quartiles value using Formula 4-14.
1(𝑁+1) 1(9+1) 10
𝑄1 = = = = 2.5
4 4 4
Step 3: Identify the first, second, and third quartiles values in the data set.
B. z-Score
- is used to know the position of one observation relative to others in a set of data.
- measures the distance between an observation and the mean, measured in units of standard
deviation.
The following formulas show how to compute the z-score for a data value x in a population and in a sample.
𝑥̅−𝜇 (𝑥̅−𝑥̅)
𝑧= (for population) 𝑧= (for sample)
𝜎 𝑠
Example 1: The monthly expenditures of a large group of households are normally distributed with a mean
of P48,700 and a standard deviation of P10,400. What is the z-value of monthly expenditures of P59,400 and
P38,300?
Solution:
Let 𝜇 = 48,700 𝜎 = 10,400
Using the formula of z to determine z-values for the two x values (P59,400 and P38,300) are computed as
follows:
(𝑥̅−𝜇) 59,400−48,700
For 𝑥̅ = 59,400 𝑧= = = 1.00
𝜎 10,400
(𝑥̅−𝜇) 38,300−48,700
For 𝑥̅ = 38,300 𝑧= = = −1.00
𝜎 10,400
*The z of 1.00 indicates that a monthly expenditure of P59,400 for households is one standard deviation
above the mean, and a 𝑧 of −1.00 shows that a P38,300 monthly expenditures is one standard deviation
below the mean. Note that both household monthly expenditures (P59,400 and P38,300) are the same
distance (P10,400) from the mean.
Example 2: A normal curve has a mean of 650 and a standard deviation of 40. An analyst is interested in
value of 575 and wants to find its equivalent z-score.
Solution:
Given: 𝑥̅ = 650 𝑠 = 40 𝑥̅ = 575
(𝑥̅−𝑥̅) 575−650
𝑧= = = −1.875
𝑠 40
Example 3: A time study reports indicates that an assembly line task should be finished in an average of
5.64 minutes, with a standard deviation of 0.97 minutes. One particular item had a z-score of 1.53. What was
the completion time of this item?
Solution:
Given: 𝑥̅ = 5.64 𝑠 = 0.97 z = 1.53
Substituting the given values to determine the 𝑥̅ value, we get
(𝑥̅−𝑥̅)
𝑧= 𝑥̅ = 𝑥̅ + 𝑧𝑠
𝑠
Example 4: The salary of junior executives in a large corporation in Ortigas area is normally distributed with
a standard deviation of P15,600. Cutback is pending, at which time those who earn less than P85,000 will be
discharged. If such a cut represents a z-score of −1.28 of the junior executives, what is the mean salary of
the group of junior executives?
Solution:
Given: 𝑠 = 15,600 𝑥̅ = 85,000 𝑧 = -1.28
(𝑥̅−𝑥̅)
𝑧= 𝑥̅ = 𝑥̅ − 𝑧𝑠
𝑠
C. Box-and-Whisker Plot
- Introduced by John Wilder Tukey (1915-2000) in the 1970’s.
- a boxplot (or box-and-whisker plot) is graph of a data set obtained by drawing a horizontal line
from the minimum data value to first quartile (𝑄1), drawing a horizontal line to third quartile (𝑄3) to
the maximum data value, and drawing a box whose vertical line passes through 𝑄1 and 𝑄3 with a
vertical line inside the box passing through the median or second quartile (𝑄2).
𝑋 𝑙𝑜𝑤𝑒𝑠𝑡 Q 𝑋 ℎ𝑖𝑔ℎ𝑒𝑠𝑡
𝑄2 = 𝑀𝑒𝑑𝑖𝑎𝑛
0 10 20 30 40 50 60
Example 2: Construct a boxplot for the data set of the ages of 9 middle-management employees of a
certain company. The ages are 53, 45, 59, 48, 54, 46, 51, 58, and 55. What can we say about the
distribution of the data set?
Solution:
Step 1: Determine the 𝑄1, Median, and 𝑄3 of the given data set.
Recall that 𝑄1 = 47, Median= 53, and 𝑄3 = 56.5.
Step 2: Locate the lowest value, 𝑄1, the median, 𝑄3, and the highest value on the scale.
Step 3: Draw a box around 𝑄1 and 𝑄3, draw a vertical line through the median, and connect the upper
and lower values, as shown in Figure 4.11.
𝑄1 = 47 𝑄3 = 56.5
45 59
Median = 53
40 45 50 55 60
The data set of the distribution is negatively-skewed, since the median falls to the right of the center of the
box.
Normal curve
- Was developed mathematically in 1733 by Abraham de Moivre (1667-1754) as an approximation
to the binomial distribution.
- Is often called the Gaussian distribution
Figure 4.12: Histogram for the Distribution of Heights of Adult Male in the Philippines
(a) Random Sample of 100 Male (b) Sample size increased & class width decreased
(c ) Sample size increased & class width decreased (d) Normal distribution for the population
further
A normal distribution can be converted into a standard normal distribution by obtaining the z value.
z value – is the signed distance between a selected value, designated x, and the mean, 𝜇, divided by the
standard deviation.
- also called as z scores, the z statistics, the standard normal deviates, or the standard normal
values.
𝑥̅−𝜇
Standard normal value: 𝑧=
𝜎
where: 𝑧 = 𝑧 𝑣𝑎𝑙𝑢𝑒
𝑥̅ = 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑎𝑛𝑦 𝑝𝑎𝑟𝑡𝑖𝑐𝑢𝑙𝑎𝑟 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑜𝑟 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑚𝑒𝑛𝑡.
𝜇 = 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛.
𝜎 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛.
Example 1: Determine the area under the standard normal distribution curve between 𝑧 = 0 and 𝑧 = 1.85.
Solution:
Draw the figure and represent the area as shown in the figure below.
Since Table A gives the area between 0 and any z value to the right of 0, we only need to look up the z value
in the table. Find 1.8 in the left column and 0.05 in the top row. The value where the column and row meet in
the table in the answer, 0.4678.
0.4678
0 1.85
Hence, the area is 0.4678 or 46.78%.
Example 2: Determine the area under the standard normal distribution curve between 𝑧 = 0 and 𝑧 = −1.15.
Solution:
The desired area is shown below.
0.3749
−1.15 0
The area between 𝑧 = 0 and 𝑧 = −1.15 𝑜𝑟 𝑃(−1.15 < 𝑧 < 0) is 0.3749. Therefore, the area is 0.3749 or
37.49%.
The formula is used to gain information about an individual data value when the variable is normally
distributed.
Example 1: The average Pag-ibig salary loan for RFS Pharmacy Inc. employees is P23,000. If the debt is
normally distributed with a standard deviation of P2,500, find the probability that the employee owes less
than P18,500.
Solution:
Step 1: Draw a figure and represent the area.
P(x<18,500)
18,500 23,000
Step 2: Find the z value for P18,500.
Step 3: Find the appropriate area. The area obtained in the Standardized Normal Distribution Table (kindly
download your own copy from the internet) is 0.4641, which corresponds to the area between 𝑧 = 0 and 𝑧 =
−1.80.
𝑃(𝑥̅ < 18,500) = 𝑃(𝑧 < −1.80) = 0.5000 − 𝑃(−1.80 < 𝑧 < 0) = 0.5000 − 0.4641 = 0.0359
0.0359
18,500 23,000
Hence, the probability that the employee owes less than P18,500 in Pag-ibig salary loan is 0.0359 or 3.59%.
Example 2: Consider an investment in stock market whose return is normally distributed with a mean of
20% and standard deviation of 8%. Determine the probability of earning money.
Solution:
Note that the investment earns money when the return is positive. Thus, we can represent the return in terms
of P(x>0).
𝑃(𝑥̅ > 0) = 𝑃(𝑧 < −2.50) = 𝑃(−2.50 < 𝑧 < 0) + 0.5000 = 0.4938 + 0.5000 = 0.4938
Regression analysis – is a statistical method used to describe the nature of the relationship between
variables, that is, either positive or negative, linear or nonlinear.
Two types of relationships
a. Simple
- There are 2 variables (a) an independent variable (or explanatory variable or predictor variable)
and, (b) a dependent variable (or response variable).
- Can be positive or negative – (a) positive relationship exists when either variables increase at the
same time or both decrease at the same time, and on the contrary, in a (b) negative relationship,
as one variable increases, the other variable decreases or vice versa.
b. Multiple
Correlation coefficient – is defined as the covariance divided by the standard deviations of the variables.
*The value of the correlation coefficient varies between +1 and -1. When the value of the correlation
coefficient lies around ±1, then it is said to be a perfect degree of association between the two variables. As
the value of the correlation coefficient goes closer to zero, the relationship between the two variables will be
weaker. This information is summarized in the charts below.
Non-linear Correlation
*A test of significance for the coefficient of correlation may be used to find out if the computed Pearson’s 𝑟
could have occurred in a population in which the two variables are related or not. The test statistics follows
the 𝑡 distribution with 𝑛 − 2 degrees of freedom. The significance is computed using the formula of 𝑡 test as
shown below:
𝑟√𝑛−2
𝑡= where: 𝑡 = 𝑡-test for correlation coefficient
√1−𝑟2
𝑟 = correlation coefficient
𝑛 = number of paired samples
Figure 4.14: Testing the Hypothesis of Correlation Coefficient at 0.05 Significance Level
When the null hypothesis has been rejected for a specific significance level, there are possible relationships
between x and y variables.
Example 1: The owner of a chain of fruit shake stores would like to study the correlation between atmospheric
temperature and sales during the summer season. A random sample of 12 days is selected withthe results
given as follows:
Day 1 2 3 4 5 6 7 8 9 10 11 12
Temperature 79 76 78 84 90 83 93 94 97 85 88 82
Total Sales (Units) 147 143 147 168 206 155 192 211 209 187 200 150
Plot the data on a scatter diagram. Does it appear there is a relationship between atmospheric temperature
and sales? Compute the coefficient of correlation. Determine at the 0.05 significance level whether the
correlation in the population is greater than zero.
Solution:
Step 1: Graph the scatter plot.
250
200
150
100
50
0
0 20 40 60 80 100 120
𝐷𝐹 = 𝑁 − 2 = 12 − 2 = 10 and 𝑡 = ±2.228
𝑛 ∑ 𝑥̅𝑦−(∑ 𝑥̅)(∑ 𝑦)
𝑟=
√[𝑛(∑ 𝑥̅2)−(∑ 𝑥̅)2][𝑛(∑ 𝑦2)−(∑ 𝑦)2]
12(183,222) − (1,029)(2,115) 22,329
𝑟= = = 0.9270572554 ≈ 0.93
√[12(88,733) − (1,029)2][12(380,887) − (2,115)2] √[5,955][97,419]
The coefficient of correlation, 𝑟 = 0.93, between the atmospheric temperature and total sales indicates a very
high positive correlation (very dependable relationship) – that is an increase in atmospheric temperature is
highly associated with the increased in total sales of fruit shake.
Since the computed t-value of 8.00 is greater than the tabular value of 2.228 at level of significance of 0.05,
we would need to reject the null hypothesis.
Step 7: Conclusion.
Since the null hypothesis has been rejected, we can conclude that there is evidence that shows significant
association between the atmospheric temperature and the total sales of fruit shake.
Residual – the difference between an observed and predicted value. The mean of residual is always zero.
Outliers – the points that fall outside the overall pattern of the other points.
Influential scores
- Scores whose removal greatly changes the regression line
Example 2: Referring to the Example 1 involving atmospheric temperature on sales, determine the
regression equation, plot the regression line and interpret it.
Solution:
Computation of the Simple Linear Regression Equation
Step 1: Obtain the sum of 𝑥̅, 𝑦, 𝑥̅2, 𝑦2, 𝑎𝑛𝑑 𝑥̅𝑦. (Recall that we already obtain the values)
𝑏1 = = = = 3.7496
12(88,733) − (1,029)2 1,064,796 − 1,058,841 5,955
Step 5: Substitute the slope and intercept in the general simple linear regression equation.
𝑦 =3.7496𝑥̅−145.2782
120
100
80
60
40
20
0
0 50 100 150 200 250
Thus, the regression equation is 𝑦̂= 3.7496𝑥̅ − 145.2782. The 𝑏1 of 3.7496 indicates that for each additional
temperature in Fahrenheit, sales are expected to increase by 3.7496 units. The 𝑏0 value of -145.2782
indicates that the intercept with the y-axis is below the origin. A concrete interpretation is that if the
temperature in Fahrenheit is zero, a negative 145.2782 units would be sold.
POST - ASSESSMENT
A. Find the mean, median, mode from the following data: (5pts each)
77 56 47 73 67 84 33 37 49 67
B. At the SM Department Store, there are 10 supervisors, 40 Salesmen, 8 cashiers and 12 baggers. Their
monthly salaries are P25,300, P18,100, P21,500, P14,200, respectively. What is the weighted mean
salary? (5pts)
C. Find the value of the first, second, and third quartiles of the following data: (5pts each)
11, 13, 9, 16, 23, 31, 15, 49, 33, 17 and 52
D. The monthly expenditures of a school department are normally distributed with a mean of P13,800 and
a standard deviation of P3,300. What is the z-value of monthly expenditures of P21,200 and P7,400?
(5pts each)
E. A time study report indicates that an assembly line task should be finished in an average of 8.96 minutes,
with a standard deviation of 0.83 minutes. One particular item had a z-score of 2.12. What was the
completion time of this item? (5pts)
NOTE:
1. SUBMIT YOUR OUTPUT ONTIME, NO OUTPUT=NO ATTENDANCE!!!