0% found this document useful (0 votes)
8 views27 pages

Sps 2291 Lesson 4

systematic way to research

Uploaded by

curtisandrea242
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views27 pages

Sps 2291 Lesson 4

systematic way to research

Uploaded by

curtisandrea242
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

1

STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE


NOTES [2023]

MUSEMBI N.S, MSC.

SEPTEMBER 2023

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Table of contents 2

LESSON FOUR: CORRELATION AND REGRESSION ANALYSIS


Lesson Objectives
Introduction
Correlation
Scatter Diagrams
Positive and Negative Correlation
Assumptions of Correlation
Correlation Coefficient
Pearson Correlation Coefficient
Spearman’s rank correlation Coefficient
Regression Analysis
Assumption of Regression Analysis
Simple Linear Regression
Coefficient of Determination
Exercise

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
3

LESSON FOUR:
CORRELATION AND
REGRESSION ANALYSIS

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
CORRELATION AND REGRESSION ANALYSIS 4

Lesson Objectives
By the end of the lesson learners should be able to:
▶ Interpret scatter diagrams for bivariate data.
▶ Fit the equations of the least squares regression line and use
them to estimate values.
▶ Calculate and interpret the value of the product-moment
correlation coefficient.
▶ Calculate and interpret the value of spearman’s rank
correlation coefficient.

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Introduction 5

Correlation
Correlation is a statistical measure that indicates the extent to which two
or more variables fluctuate together. A positive correlation indicates the
extent to which those variables increase or decrease in parallel; a negative
correlation indicates the extent to which one variable increases as the
other decreases.

Scatter Diagrams
A scatter diagram is a tool for analyzing relationships between two
variables. One variable is plotted on the horizontal axis and the other is
plotted on the vertical axis. The pattern of their intersecting points can
graphically show relationship patterns.

Positive and Negative Correlation


▶ Correlation is Positive or direct when the values increase together.
▶ Correlation is Negative when one value decreases as the other
increases.
MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Correlation 6

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Interpreting a Scatter Plot 7

Scatter diagrams will generally show one of six possible


correlations between the variables:
1. Strong Positive Correlation The value of Y clearly increases as
the value of X increases.
2. Strong Negative Correlation The value of Y clearly decreases as
the value of X increases.
3. Weak Positive Correlation The value of Y increases slightly as
the value of X increases.
4. Weak Negative Correlation The value of Y decreases slightly as
the value of X increases.
5. Complex Correlation The value of Y seems to be related to the
value of X, but the relationship is not easily determined.
6. No Correlation There is no demonstrated connection between
the two variables

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Assumptions of Correlation 8

▶ Employing of correlation rely on some underlying assumptions.


The variables are assumed to be independent, assume that
they have been randomly selected from the population; the two
variables are normal distribution; association of data is
homoscedastic (homogeneous).
▶ Homoscedastic data have the same standard deviation in
different groups where data are heteroscedastic have different
standard deviations in different groups and assumes that the
relationship between the two variables is linear.
▶ An inspection of a scatterplot can give an impression of
whether two variables are related and the direction of their
relationship. But it alone is not sufficient to determine whether
there is an association between two variables. The relationship
depicted in the scatterplot needs to be described qualitatively.
▶ Descriptive statistics that express the degree of relation
between two variables are called correlation coefficients. A
commonly employed correlation coefficient are Pearson ,
Kendall rank and Spearman correlation.
MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Correlation Coefficient 9

Correlation coefficient measures the degree of linear association


between 2 paired variables It takes values from + 1 to – 1.
▶ If r = +1,we have perfect positive relationship
▶ If r = -1,we have perfect negative relationship
▶ If r = 0 there is no relationship, the variables are uncorrelated.

Pearson Correlation Coefficient


Bivariate correlation is a measure of the relationship between the
two variables; it measures the strength and direction of their
relationship, the strength can range from absolute value 1 to 0.
The stronger the relationship, the closer the value is to 1. The
pearson correlation coefficient is given by:

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Example 10

A study is conducted involving 10 students to investigate the


association between statistics and science tests. Use the data to
calculate the Pearson Correlation coefficient.

Solution

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Solution 11

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Solution 12

Other solution

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Other solution 13

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Other solution 14

Spearman rank correlation


Spearman rank correlation is a non-parametric test that is used to mea-
sure the degree of association between two variables. It was developed by
Spearman, thus it is called the Spearman rank correlation. Spearman rank
correlation test does not assume any assumptions about the distribution of
the data and is the appropriate correlation analysis when the variables are
measured on a scale that is at least ordinal.
MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Spearman rank correlation 15

Example
Calculate the Spearman rank-order correlation coefficient using
the following data:

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Solution 16

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Solution 17

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Solution 18

Exercise
Study is conducted involving 14 infants to investigate the association between gestational age at birth, measured in weeks, and birth
weight, measured in grams.

Use this data to calculate the pearson correlation coefficient and the Spearmans rank correlation coefficient and comment on your
answer.

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
REGRESSION ANALYSIS 19

▶ Regression analysis is one of the most commonly used statistical tech-


niques which involves identifying and evaluating the relationship be-
tween a dependent variable and one or more independent variables,
which are also called predictor or explanatory variables.
▶ Linear regression explores relationships that can be readily described
by straight lines or their generalization to many dimensions. A surpris-
ingly large number of problems can be solved by linear regression, and
even more by means of transformation of the original variables that re-
sult in linear relationships among the transformed variables.
▶ When there is a single continuous dependent variable and a single
independent variable, the analysis is called a simple linear regression
analysis. Multiple regression is to learn more about the relationship
between several independent or predictor variables and a dependent or
criterion variable.
▶ Independent variables are characteristics that can be measured di-
rectly; these variables are also called predictor or explanatory variables
used to predict or to explain the behavior of the dependent variable.
▶ Dependent variable is a characteristic whose value depends on the
values of independent variables.
MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Assumption of Regression Analysis 20

The regression model is based on the following assumptions.


▶ The relationship between independent variable and dependent
is linear.
▶ The expected value of the error term is zero
▶ The variance of the error term is constant for all the values of
the independent variable,the assumption of homoscedasticity.
▶ There is no autocorrelation.
▶ The independent variable is uncorrelated with the error term.
▶ The error term is normally distributed.
▶ On an average difference between the observed value (yi ) and
the predicted value ŷ is zero.
▶ On an average the estimated values of errors and values of
independent variables are not related to each other.
▶ The squared differences between the observed value and the
predicted value are similar.
▶ There is some variation in independent variable. If there are
more than one variable in the equation, then two variables
should not be perfectly correlated.
MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Reliability and Validity 21

▶ Does the model make intuitive sense? Is the model easy to


understand and interpret?
▶ Are all coefficients statistically significant? (p-values less than .05)
▶ Are the signs associated with the coefficients as expected?
▶ Does the model predict values that are reasonably close to the actual
values?
▶ Is the model sufficiently sound? (High R 2 , low standard error, etc.)

Simple Linear Regression


Simple linear regression is a statistical method that allows us to
summarize and study relationships between two continuous
(quantitative) variables. In a cause and effect relationship, the
independent variable is the cause, and the dependent variable is
the effect. Least squares linear regression is a method for
predicting the value of a dependent variable y, based on the value
of an independent variable x.
MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Simple Linear Regression 22

▶ One variable, denoted (x), is regarded as the predictor,


explanatory, or independent variable.
▶ The other variable, denoted (y), is regarded as the response,
outcome, or dependent variable.
▶ Mathematically, the regression model is represented by the
following equation:
Y = β0 + β1 Xi + ϵi

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Example 23

A study is conducted involving 10 patients to investigate the relationship


and effects of patient’s age and their blood pressure. Use the data below to
obtain the linear regression line (line of best fit).

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Solution 24

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Coefficient of Determination 25

▶ We may ask the question: How good is the regression model? In other
words: How well does the independent variable explain the dependent
variable in the regression model? The coefficient of determination is
one concept that answers this question. in the absence of a regression
model, we use ȳ perform estimation or prediction. Consequently, the
error of prediction is the difference between the actual observed value
and the mean of the observed values. If we calculate such errors in
the sample and then square and add them, the resulting sum is called
the total sum of squares and is denoted by SST.
▶ The Coefficient of determination, denoted by R 2 ,represents the
proportion of SST that is explained by use of the regression model and
is given by:

SSxy SSR
R 2 =β1 =
SSyy SST
where : SSR = Σ(Ŷ − Ȳ )2 and SST = Σ(Y − Ȳ )2
0 ≤ R2 ≤ 1

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Example 26

A study is conducted involving 10 patients to investigate the relationship and effects of patient’s age and their blood pressure. Use the
data below to obtain the coefficient of determination from the fitted line of best fit.

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]
Exercise 27

1. The cetane number is a critical property in specifying the ignition quality of a fuel used in a diesel engine. Determination of
this number for a biodiesel fuel is expensive and time-consuming. A study including the following data on x= iodine values in
(g) and y= cetane number for a sample of 14 biofuels was conducted.

x 132.0 129.0 120.0 113.2 105.0 92.0 84.0 83.2 88.4 59.0 80.0 81.5 71.
y 46.0 48.0 51.0 52.1 54.0 52.0 59.0 58.7 61.6 64.0 61.4 54.6 58.

Use this data to:


i. Fit a simple linear regression line to this data.
ii. Calculate the Product-moment correlation coefficient and comment on your answer.
2. The data given below are obtained from student records.( Grade Point Average (x) and Graduate Record exam score (y)).
Subject 1 2 3 4 5 6 7 8 9 10
X 8.3 8.6 9.2 9.8 8.0 7.8 9.4 9.0 7.2 8.6
y 2300 2250 2380 2400 2000 2100 2360 2350 2000 2260
Calculate the rank correlation coefficient ‘R’ for the data.
3. An accurate assessment is important in the manufacturing of computer processors. A certain researcher argues that correctly
manufactured processor is partly determined by the number of grams of a particular component. The table below presents the
number of correctly manufactured processors (y) and the number of grams of the component(x).
x 2.4 3.4 4.6 3.7 2.2 3.3 4.0 2.1
y 1.33 2.12 1.80 1.65 2.00 1.76 2.11 1.63

Use this data to obtain the :


i. Product-moment correlation coefficient and comment on your answer
ii. The line of best fit
iii. The coefficient of determination and comment on your answer.
4. The scores obtained by students in a statistics class in the mid-term and final examination are given below.
Student 1 2 3 4 5 6 7 8 9 10
Mid-term 98 66 100 96 88 45 76 60 74 82
Final 90 74 98 88 80 62 78 74 86 80
Develop a least squares regression line that can be used to predict the final examination scores from the Mid-term score.

MUSEMBI N.S, MSC. | STA 2470/SPM 2291: PROBABILITY AND STATISTICS LECTURE NOTES [2023]

You might also like