Introduction to Correlation and Regression Analysis (1) (3)
Introduction to Correlation and Regression Analysis (1) (3)
Submitted to
Krishna prasad Aryal
Department of Mathematics
Daisy English Boarding Secondary School
Page 1 of 14
Table of content
Content Page No
➢ Introduction ➢3
➢ Methodology ➢9
➢ Interpretation of data ➢ 10
➢ Calculation ➢ 11
➢ Conclusion ➢ 14
➢ Reference ➢ 14
Page 2 of 14
Introduction to Correlation and Regression Analysis
In this section we will first discuss correlation analysis, which is used to
quantify the association between two continuous variables (e.g.,
between an independent and a dependent variable or between two
independent variables). Regression analysis is a related technique to
assess the relationship between an outcome variable and one or more
risk factors or confounding variables. The outcome variable is also called
the response or dependent variable and the risk factors and
confounders are called the predictors, or explanatory or independent
variables. In regression analysis, the dependent variable is denoted "y"
and the independent variables are denoted by "x".
Correlation Analysis
In correlation analysis, we estimate a sample correlation coefficient, more
specifically the Kari Pearson Product Moment correlation coefficient.
The sample correlation coefficient, denoted r, ranges between -1 and +1 and
quantifies the direction and strength of the linear association between the two
variables.
The correlation between two variables can be positive (i.e., higher levels of one
variable are associated with higher levels of the other) or negative (i.e., higher
levels of one variable are associated with lower levels of the other).
The sign of the correlation coefficient indicates the direction of the association.
The magnitude of the correlation coefficient indicates the strength of the
association.
For example, a correlation of r = 0.9 suggests a strong, positive association
between two variables, whereas a correlation of r = -0.2 suggest a weak, negative
association. A correlation close to zero suggests no linear association between two
continuous variables.
Page 3 of 14
Fig no. 1 Fig no. 2
Fig no 3 Fig no 4
Page 4 of 14
Methods for the Determination of Correlation:
Commonly there are three methods used to determine the correlation:
1. Scatter Plot Diagram
2. Karl Pearson Coefficient of Correlation
3. Spearman's Rank-Correaltion coefficient
Generally, we mostly use Karl Pearson Coefficient of Correlation and
spearman’s Rank-Correlation coefficient
Karl Pearson Coefficient of Correlation coefficient:
where,
n-Number of values or elements
∑X=Sum of 1st values list
∑Y=Sum of 2nd values list
∑XY = Sum of the product of 1st and 2nd values
∑𝑋 2 =Sum of squares of 1st values
∑𝑌 2 =Sum of squares of 2nd values
Page 5 of 14
Properties of Karl Pearson Coefficient of Correlation:
Karl Pearson's coefficient of correlation, commonly referred to as Pearson's
correlation coefficient or Simply Pearson's (r), possesses several important
properties:
1. Range: Pearson's (r) always falls between -1 and 1, inclusive. A value of -1
indicates a perfect negative linear relationship, O indicates no linear relationship,
and 1 indicate a perfect positive linear relationship.
2. Linearity: Pearson's (r) measures the strength of a linear relationship between
two variables. It assumes a linear relationship between the variables; therefore, it
may not accurately represent nonlinear relationships.
3. Symmetry: Pearson's (r) is symmetric, meaning that the correlation between
variable (x ) and ( y ) is the same as the correlation between variable ( y ) and ( x ).
4. Not affected by scale: Pearson's (r) is not affected by changes in the scale of
measurement of the variables. This means that multiplying all the values of one
variable by a constant or adding a constant to all values does not change the
correlation coefficient.
5. Sensitive to outliers: Pearson's (r) can be influenced by outliers in the data.
Outliers can disproportionately affect the correlation coefficient, potentially
leading to misleading interpretations.
6. Affected by range: The correlation coefficient may be affected by the range of
values in the dataset. Limited variability in the data can result in an
underestimation of the true correlation. 1.Sample dependence: The sample size
influences the reliability of Pearson's (r). Generally, larger Sample sizes provide
more accurate estimates of the population correlation.
&Does not imply causation: A high correlation coefficient does not necessarily
imply a causal relationship between the variables. Correlation only measures the
strength and direction of association, not causation.
Page 6 of 14
Regression Analysis
Regression Analysis is a method of measuring the degree of association of a set
variable called cause variables over the effect variables. Correlation can measure
only the direction of association that is positive, negative or zero association
whereas regression can measure both direction as well as degree of association.
Regression is one of the highest used data analysis used by researchers and
academicians today. Regression equation express the linear relationship between
two variables.
Y - Y͞ =𝑏𝑌𝑋 (X - X͞),
Where,
𝑛∑𝑋𝑌−(∑𝑋)(∑𝑌)
𝑏𝑌𝑋 = is regression coefficient of Y on X
𝑛∑𝑋 2 −(∑𝑋)2
Page 7 of 14
Spearmans Rank Correlation Coefficient
Rank correlation is the degree of association between two variables when the
data are arranged in order or ranks. The data which are quantitatively measured
are by Karl Pearsons correlation coefficient and the data which are not measured
quantitatively are measured by assigning ranks. It is generally denoted by (r).
There are 3 cases while calculating (r). They are
1)when ranks are given
2)when ranks are not given
3) Repeated ranks
The formula for calculating spearman rank correlation coefficient for case I and
case II are given by
∑𝑑2
R = 1-
𝑛(𝑛2 −1)
where,
n=Number of items ranked
d= difference between paired ranks (𝑅1 − 𝑅1 )
𝑅1 = 𝑇ℎ𝑒 rank of items with respect to first variable
𝑅1 = 𝑇ℎ 𝑒 𝑟𝑎𝑛𝑘 𝑜𝑓 𝑖𝑡𝑒𝑚𝑠 𝑤𝑖𝑡ℎ 𝑟𝑒𝑠𝑝𝑒𝑐𝑡 𝑡𝑜 𝑠𝑒𝑐𝑜𝑛𝑑 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒s
where,
Page 8 of 14
Methodology
For the statistical analysis, the data of second terminal examination of class 10 is
taken from SHANGRILA ENGLISH BOARDING SCHOOL, Rapti -6. The observation is
to find out the better and stable performance of student in two different subjects
(i.e. math’s and science). About 28 students’ data was collected. The Full Mark and
Pass Mark are 75 and 27. The marks data is given below.
Page 9 of 14
CALCULATION:
The above data is theory marks of Students in subject (math’s
and science). Let the marks of math’s be X and science be Y. The
data is interpreted below,
S.N X Y 𝑋2 𝑌2 XY 𝑅1 𝑅2 𝑑= d2
(𝑅1 − 𝑅2 )
1 75 70 5625 4900 5250 1 2 -1 1
2 56 71 3136 5041 3976 3 1 2 4
3 50 46 2500 2116 2300 6.5 3 3.5 12.25
4 60 37 3600 1369 2220 2 4 -2 4
5 48 35 2304 1225 1680 9 6 3 9
6 53 27 2809 729 1431 5 12.5 7.5 56.25
7 27 29 729 841 783 14 10 4 16
8 38 28 1444 784 1064 12 11 1 1
9 55 30 3025 900 1650 4 9 -5 25
10 50 36 2500 1296 1800 6.5 5 1.5 2.25
11 39 31 1521 961 1209 11 8 3 9
12 45 17 2025 289 765 10 14 -4 16
13 30 27 900 729 810 13 12.5 0.5 0.25
14 49 32 2401 1024 1568 8 7 1 1
∑X=675 ∑Y=516 ∑𝑋 2 = ∑𝑌 2 = ∑XY ∑𝑑2
34519 22204 26506 =157
Page 10 of 14
At first for coefficient for correlation;
n = 15
Now,
Correlation Coefficient Formula
𝑛∑𝑋𝑌−(∑𝑋)(∑𝑌)
r=
√[𝑛∑𝑋 2 −(∑𝑋)2 ]√[𝑛∑𝑌 2 −(∑𝑌)2
14𝑋26506−(675𝑋516)
r=
√[14𝑋34519−(675)2 ]√[14𝑋22204−(516)2 ]
r= 0.64
𝑏𝑋𝑌 = 0.51
Now,
𝑛∑𝑋𝑌−(∑𝑋)(∑𝑌)
𝑏𝑌𝑋 =
𝑛∑𝑋 2 −(∑𝑋)2
14𝑋26506−(675𝑋516)
𝑏𝑌𝑋 =
14𝑋34519−(675)2
∴ 𝑏 𝑌𝑋 = 0 . 82.
Page 11 of 14
For Regression equations,
∑𝑋
Mean of X =
𝑛
675
=
14
= 48.21
∑𝑌
Mean of Y =
𝑛
516
=
14
= 36.85
Now, regression equation X on Y
or, X - X͞ =𝑏𝑋𝑌 (Y - Y͞)
Or, X – 48.21=0.51(Y – 36.85)
X =0.51Y + 29.41 is the required regression equation of X on Y
Page 12 of 14
For spearman’s Rank correlation coefficient
∑𝑑2 =157
.𝑚1 = 2
.𝑚2 = 2
According to spearman’s Rank correlation coefficient
1 1
6[∑𝑑2 + 𝑚1 (𝑚1 2 −1)+ 𝑚2 (𝑚2 2 −1)]
12 12
R =1 − 𝑛(𝑛2 −1)
1 1
6[157+ 2(22 −1)+ 2(22 −1)]
12 12
Or, R = 1 − 14(14 2 −1)
R = 0.64
Therefore, the spearman’s Rank correlation coefficient of the give
data is 0.64.
Page 13 of 14
Conclusion
From the above calculation, the observed value of correlation between
two variable X (math’s) and Y(Science) is 0.64 which is moderately close
to 1, so we conclude that the association is moderate strong.
Hence, it indicates that the marks of mathematics increases when the
marks of science increase.
And the same result was obtained from the regression Analysis and
spearman’s Rank correlation coefficient
Reference
I noted the definition and formula used in my project from the
book of class 11 and websites.
✓ Foundation of MATHEMATICS class 11
✓ www.Wikipedia.com
✓ www.freedictionary.com
Some sentences are also of our subject teacher (Krishna
prasad Aryal) also.
signature
Krishna prasad Aryal
Subject teacher
Page 14 of 14