0% found this document useful (0 votes)
3 views

Linear Regression Analysis_1

The document provides an overview of linear regression analysis, focusing on measures of association, particularly Pearson's correlation coefficient, which quantifies the linear relationship between quantitative variables. It discusses regression analysis as a method for estimating relationships between dependent and independent variables, emphasizing its use for prediction and causal inference. Additionally, it outlines the components of regression models and the process of estimating parameters using methods like ordinary least squares.

Uploaded by

raisa.mim17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Linear Regression Analysis_1

The document provides an overview of linear regression analysis, focusing on measures of association, particularly Pearson's correlation coefficient, which quantifies the linear relationship between quantitative variables. It discusses regression analysis as a method for estimating relationships between dependent and independent variables, emphasizing its use for prediction and causal inference. Additionally, it outlines the components of regression models and the process of estimating parameters using methods like ordinary least squares.

Uploaded by

raisa.mim17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Linear Regression Analysis

Lecture 1
Variable
Variable
Qualitative and Quantitative variables
Association between Quantitative Variables
Correlation
Regression
Measure of association
• In statistics, any measure used to quantify a relationship between two
or more variables is a measure of association.
• Measures of association are used in various fields of research. For
example, in the areas of epidemiology and psychology, measures of
association are frequently used to quantify relationships between
exposures and diseases or behaviors.
• Data may be measured on an interval/ratio scale, an ordinal/rank
scale, or a nominal/categorical scale.
• These three characteristics can be thought of as continuous, integer,
and qualitative categories, respectively.
• The method used to determine the strength of an association
depends on the characteristics of the data for each variable.
Pearson’s correlation coefficient

• Pearson’s correlation coefficient, r (rho, the population characteristic)


measures the strength of the linear relationship between two variables on
a continuous scale.
• A typical example for quantifying the association between two quantitative
variables (measured on an interval/ratio scale) is the analysis of
relationship between a person’s height and weight.
• Each of these two characteristic variables is measured on a continuous
scale.
• The appropriate measure of association for this situation is Pearson’s
correlation coefficient, r (ρ, the population characteristic), which measures
the strength of the linear relationship between two variables on a
continuous scale.
Estimate of correlation coefficient
• The correlation is estimated by sample correlation r given in the
expression below:
𝑛 𝑥 𝑛 𝑦
𝑛
𝑥 𝑦 − 𝑖=1 𝑖 𝑖=1 𝑖
𝑖=1 𝑖 𝑖 𝑛
𝑟=
( 𝑛 𝑥 )2 ( 𝑛 𝑦 )2
𝑛
𝑥 2 − 𝑖=1 𝑖 𝑛
𝑦 2 − 𝑖=1 𝑖
𝑖=1 𝑖 𝑛 𝑖=1 𝑖 𝑛

• Here we have the sample covariance between the two variables


divided by the square root of the product of the individual variances.
Pearson’s product moment correlation
coefficient
• The coefficient r ranges from −1 to +1 inclusive.
• Values of −1 or +1 indicate a perfect linear relationship between the
two variables, whereas a value of 0 indicates no linear relationship.
(Negative values simply indicate the direction of the association,
whereby as one variable increases, the other decreases.)
• Correlation coefficients that differ from 0 but are not −1 or +1
indicate a linear relationship, although not a perfect linear
relationship.
• In practice, ρ (the population correlation coefficient) is estimated by r,
which is the correlation coefficient derived from sample data.
• Although Pearson’s correlation coefficient is a measure of the
strength of an association (specifically the linear relationship), it is not
a measure of the significance of the association.
• The significance of an association is a separate analysis of the sample
correlation coefficient, r, using a t-test to measure the difference
between the observed r and the expected r under the null hypothesis.
Inferences for Correlation
• Let us consider testing the null hypothesis that there is zero
correlation between two variables Xj and Xk. Mathematically we write
this as shown below:
• H0: ρ= 0 against Ha: ρ≠ 0
• To test the null hypothesis, we form the test statistic, t as below
𝑛−2
𝑡=𝑟 ∼ tn−2
1−𝑟 2
• Under the null hypothesis, H0, this test statistic will be approximately
distributed as t with n - 2 degrees of freedom.
Regression Analysis
Regression is a set of statistical processes for estimating the relationships between a dependent and one (or
more) independent variable(s).

Dependent variable , outcome, response


Independent variable, predictor, covariate, explanatory variable.
Linear Regression
The most common form of regression analysis is linear regression in
which one finds the line that most closely fits the data according to a
specific mathematical criterion.
Usage
Regression analysis is primarily used for two conceptually distinct purposes.
1. Prediction or forecasting
2. In some situations, regression analysis can be used to infer causal relationships between the independent
and dependent variables.
Importantly, regressions by themselves only reveal relationships between a dependent variable and a
collection of independent variables in a fixed dataset.
To use regressions for prediction or to infer causal relationships, respectively, a researcher must carefully justify
why existing relationships have predictive power for a new context or why a relationship between two
variables has a causal interpretation. The latter is especially important when researchers hope to estimate
causal relationships using observational data.
• The earliest form of regression was the method of least squares published by Legendre in 1805 and by Gauss
in 1809.
• Legendre and Gauss both applied the method to the problem of determining, from astronomical
observations, the orbits of bodies about the Sun (mostly comets, but also later the then newly discovered
minor planets).
• The term "regression" was coined by Francis Galton in the 19th century to describe a biological
phenomenon that the heights of descendants of tall ancestors tend to regress down towards a normal
average (a phenomenon also known as regression towards the mean.
• For Galton, regression had only a biological meaning, but later Yule and Karl Pearson extended his idea to a
more general statistical context.
Regression Model
In practice, researchers first select a model they would like to estimate and then use their chosen method (e.g.,
ordinary least squares) to estimate the parameters of that model.
Regression models involve the following components:

• The unknown parameters, often denoted as a scalar or vector .


• The independent variables, which are observed in data and are often denoted as a scalar or vector
Xi (where i denotes a row of data).
• The dependent variable, which are observed in data and often denoted by Yi .
• The error terms, which are not directly observed in data and are often denoted by ei.
Most regression models propose that Yi is a function of Xi and , with ei representing an additive error
term that may stand in for un-modeled determinants of Yi or random statistical noise:

Yi = f(Xi, ) + ei .

The researchers' goal is to estimate the function f that most closely fits the data. To carry out regression
analysis, the form of the function f must be specified.
Linear Regression
In linear regression, the model specification is that the dependent variable, yi is a
linear combination of the parameters (but need not be linear in the independent
variables).
For example, in simple linear regression for modelling, n data points there is one
independent variable: Xi and two parameters 0 and 1 :
Straight line: yi = 0 + 1Xi + i , i = 1, 2, …, n.
In multiple linear regression, there are several independent variables or functions
of independent variables. Adding a term in xi2 to the preceding regression gives:
Parabola: yi = 0 + 1Xi + 2Xi2 + I, i = 1, 2, …, n.
This is still a linear regression although the expression on the right hand side is
quadratic in the independent variable Xi, it is linear in the parameters 0, 1 and 2.
In both cases, I is an error term and the subscript i indexes a particular
observation.
Given a random sample from the population, we estimate the
population parameters and obtain the sample linear regression model

𝑦𝑖 = 𝛽0 +𝛽1 𝑥𝑖 .
The residual ei is the difference between the value of the dependent
variable predicted by the model above and the true value of the
dependent variable.
One method is to obtain parameter estimates that minimize the sum of
squared residuals, SSR.
• What is the formula for the least square estimates for simple linear
regression?
• What is the estimate of the variance?
• What is MSE?
• What are the assumptions of a simple linear regression model?

You might also like