0% found this document useful (0 votes)
4 views

Unit 3

Least squares estimation is a method for estimating coefficients in linear regression by minimizing the sum of squared differences between observed and predicted values. The document explains the process of applying least squares to a dataset, including defining the model, calculating coefficients, and making predictions, while also discussing statistical inference, model adequacy checking, and variable selection techniques. Additionally, it covers Generalized Least Squares (GLS) and Weighted Least Squares (WLS) as extensions of Ordinary Least Squares (OLS) for handling issues like autocorrelation and heteroscedasticity in regression models.

Uploaded by

ashwin481410
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 3

Least squares estimation is a method for estimating coefficients in linear regression by minimizing the sum of squared differences between observed and predicted values. The document explains the process of applying least squares to a dataset, including defining the model, calculating coefficients, and making predictions, while also discussing statistical inference, model adequacy checking, and variable selection techniques. Additionally, it covers Generalized Least Squares (GLS) and Weighted Least Squares (WLS) as extensions of Ordinary Least Squares (OLS) for handling issues like autocorrelation and heteroscedasticity in regression models.

Uploaded by

ashwin481410
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Least Square estimation in regression model

Least squares estimation is a method used to estimate the coefficients of a linear regression
model by minimizing the sum of the squared differences between the observed values and the
values predicted by the model. Let's walk through an example problem to illustrate how it
works.
Suppose you have a dataset that shows the relationship between the number of hours studied
(independent variable, X) and exam scores (dependent variable, Y) for five students:

Student Hours Studied (X) Exam Score (Y)

1 2 50

2 4 60

3 6 65

4 8 70

5 10 80

You want to find the linear relationship between hours studied and exam score using linear
regression.
Step 1: Define the Linear Model
The linear model for simple linear regression is:

Step 2: Set Up the Objective Function


The objective of least squares estimation is to find the values of β0 and β1 that minimize the
sum of squared errors (SSE):
Step 3: Calculate the Coefficients

For simple linear regression, the formulas to estimate the coefficients β0(intercept) and β1
(slope) are:

Where n is the number of observations.

Given Data:
Let's calculate the coefficients using the provided data
Step 4: Write the Regression Equation
The estimated regression equation is:

Step 6: Use the Model to Make Predictions


You can use the regression equation to predict exam scores for any given number of hours
studied. For example, if a student studies for 7 hours:

So, the predicted exam score for a student who studies for 7 hours is 82.5.
The least squares estimation method provides a way to find the best-fitting line for your data
by minimizing the sum of the squared differences between observed and predicted values. In
this example, we found that the number of hours studied has a positive and significant effect
on exam scores.
Statistical Inference in Linear Regression
The test for the significance of regression tells you whether your regression model as
a whole is meaningful. A significant result (small p-value) indicates that your model,
including all predictors, explains a significant amount of variation in the dependent variable,
making it a useful tool for prediction and inference.

Tests on Individual Regression Coefficients


In a regression model, you often want to test whether each individual predictor
(independent variable) significantly contributes to the model. This is done using a t-test on
each regression coefficient.
Hypotheses for Individual Coefficient Testing:
• Null Hypothesis (H₀): The coefficient of the predictor is equal to zero.
p-Value for the t-Test:
• If the p-value < 0.05: You reject the null hypothesis, concluding that the predictor Xi
has a significant effect on the dependent variable.
• If the p-value ≥ 0.05: You fail to reject the null hypothesis, suggesting that the
predictor Xi does not have a statistically significant effect on the dependent variable.
6. Conclusion
There is sufficient evidence to conclude that the number of study hours significantly
affects exam scores. Specifically, for each additional hour studied, the exam score is expected
to increase by approximately 4.5 points.

Model adequacy checking


Model adequacy checking in time series refers to evaluating whether the fitted time series
model is appropriate for the data and meets the assumptions required for accurate forecasting.
This process is crucial for ensuring the model's predictions are reliable and that the model
captures all the key patterns in the data, such as trends, seasonality, and autocorrelation.

Residual Analysis
The residuals (the difference between the actual values and the fitted values) are an important
diagnostic tool for checking model adequacy. For a well-fitted model, the residuals should
behave like white noise, meaning they should have:
• Zero mean: The residuals should average out to zero.
• Constant variance: Residuals should have constant variability over time
(homoscedasticity).
• No autocorrelation: Residuals should not exhibit patterns or correlations; they should
be independent over time.
• Normality: Residuals should follow a normal distribution (for models that assume
normally distributed errors).
Diagnostic Plots for Residuals:
• Residual Plot: A plot of residuals over time. A good model shows no systematic
patterns (e.g., trends or clusters).
• ACF/PACF of Residuals: The autocorrelation function (ACF) and partial
autocorrelation function (PACF) plots should show no significant correlations for the
residuals. This indicates that all relevant time-dependent structure has been captured
by the model.
• Histogram or Q-Q Plot: A histogram or Q-Q plot of residuals helps check for
normality.
Scaled Residuals:
Residuals are the differences between the observed and predicted values of a model. Scaled
residuals refer to residuals that have been normalized or adjusted to account for the
variability of the model, making them easier to interpret, especially when comparing across
different models or when identifying outliers.
Key points to understand scaled residuals:
• Definition: Scaled residuals are computed by dividing the residuals by an estimate of
the standard deviation of the residuals or other model-specific factors.
• Purpose: They help in identifying unusual points that deviate significantly from the
model's predictions. Scaling puts the residuals on a common scale, so large residuals
are easily spotted.
• Formula:

where σ is the standard deviation of


residuals (or a related measure).
• Types:
o Standardized Residuals: Divide the residuals by the estimated standard
deviation of the residuals.
o Studentized Residuals: Similar to standardized residuals, but they take into
account the influence of the data point being considered.

2. PRESS (Prediction Error Sum of Squares):


PRESS is a key metric used in regression analysis to assess the predictive performance of
a model, particularly in the context of cross-validation or model validation.
Definition: PRESS measures the sum of squares of prediction errors, where each
observation is left out one at a time, and the model is fitted to the remaining data. The
residual for each left-out observation is calculated and squared.
Purpose: PRESS helps in detecting overfitting and ensuring the generalizability of the
model. If a model has a low PRESS value, it means the model is likely to perform well on
unseen data.

• where yi is the observed value for the


i-th observation and y^i(−i) is the predicted value for the i th observation when the model is
fitted without the I th data point.

• Interpretation: A smaller PRESS value indicates a model with better predictive


accuracy. It is particularly useful in assessing how well the model might perform on
new or out-of-sample data.
In summary:
• Scaled residuals provide insights into the quality of individual data points and help in
identifying outliers.
PRESS is a global measure of model performance, focusing on predictive ability and helping
prevent overfitting.

Let’s consider a simple linear regression model with 5 data points:

x y

1 2

2 4

3 5

4 7

5 8
1.
PRESS provides an estimate of how well the model generalizes to new data. Lower PRESS
values indicate a better fit, meaning that the model is likely to have good predictive accuracy.
It is particularly useful in detecting overfitting. A model with a very low PRESS on the
training data but high PRESS on validation data may be overfitting.

Variable selection in regression


Variable selection in regression is a crucial step to improve model accuracy, interpretability,
and prevent overfitting.
Forward Selection
Forward selection is a step-by-step approach where variables are added to the model one by
one based on a criterion (e.gR 2 , p-value, AIC, or BIC). The goal is to find the set of
variables that maximizes the model's performance.
Example:
Let's assume you have the following dataset for predicting house prices, with potential
predictor variables: size, bedrooms, bathrooms, age, location.
• Step 1: Start with an empty model (no variables).
• Step 2: Test each variable individually and choose the one that improves the model
the most (e.g., gives the highest R^2).
Fit model with size: R2=0.60R^2 = 0.60R2=0.60
Fit model with bedrooms: R2=0.45R^2 = 0.45R2=0.45
Fit model with bathrooms: R2=0.40R^2 = 0.40R2=0.40
Fit model with age: R2=0.20R^2 = 0.20R2=0.20
Fit model with location: R2=0.55R^2 = 0.55R2=0.55
Select size (highestR^2).
Step 3: Add the next variable that improves the model the most when combined with size.
Fit model with size + bedrooms: R2=0.72R^2 = 0.72R2=0.72
Fit model with size + bathrooms: R2=0.65R^2 = 0.65R2=0.65
Fit model with size + age: R2=0.63R^2 = 0.63R2=0.63
Fit model with size + location: R2=0.80R^2 = 0.80R2=0.80
Select location.
Step 4: Continue adding variables until no further improvement occurs. In this example,
you stop after adding bedrooms if the improvement is negligible afterward.

2. Backward Elimination
Backward elimination starts with all the variables in the model. Variables are removed one
by one based on a significance test (e.g., p-value), AIC, or BIC. The goal is to eliminate
variables that don't contribute much to the model.
Example:
Using the same dataset (size, bedrooms, bathrooms, age, location):
• Step 1: Start with a full model, including all variables: size + bedrooms + bathrooms
+ age + location.
• Step 2: Examine the significance of each variable. Remove the least significant
variable (highest p-value).
Assume p-values are as follows:
size: p-value = 0.02 (significant)
bedrooms: p-value = 0.25 (insignificant)
bathrooms: p-value = 0.10 (insignificant)
age: p-value = 0.03 (significant)
location: p-value = 0.01 (significant)
Remove bedrooms (highest p-value).
Step 3: Refit the model and check p-values again. Continue removing the least significant
variable.
In the next iteration, remove bathrooms if it has the highest p-value.
Step 4: Continue this process until all remaining variables are statistically significant.

3. Stepwise Selection (Bidirectional Elimination)


Stepwise selection is a combination of forward selection and backward elimination. You start
by adding variables like forward selection, but at each step, you also check if any of the
already included variables can be removed. This way, variables are added and removed
iteratively.
Example:
Using the same dataset:
Step 1: Start with an empty model.
Step 2: Add size (as in forward selection, since it has the highest R^2).
Step 3: Add location next, as it improves the model the most. Now, your model has size
and location.
Step 4: Check if size is still significant after adding location. If it’s no longer significant,
you remove it. Otherwise, you keep it.
Step 5: Add bedrooms next, but if it doesn't improve the model, remove it (backward
elimination).
Step 6: Continue this process until no variables can be added or removed without
degrading the model's performance.
Comparison:
Forward Selection: Variables are only added to the model.
Backward Elimination: Variables are only removed from the model.
Stepwise Selection: Variables can be both added and removed, allowing more flexibility
in finding the best set of predictors.

Generalized Least Squares (GLS)


Generalized Least Squares (GLS) is an extension of Ordinary Least Squares (OLS)
used when the error terms in a time series are correlated or heteroscedastic (i.e., they
have non-constant variance). This commonly occurs in time series data, where errors
from different time periods are often related.
In time series analysis, violations of OLS assumptions (like autocorrelation) can lead
to biased or inefficient estimates. GLS corrects for these issues by transforming the
data.

Purpose of GLS in Time Series


The key goal of GLS is to remove correlations in the residuals (autocorrelation) and
adjust for changing variance over time (heteroscedasticity), ensuring that the model's
estimates are efficient and unbiased.
It does this by transforming the variables and error terms in a way that the errors become
independent (no autocorrelation) and have constant variance (homoscedastic).
Key Concepts in GLS for Time Series
Autocorrelation: In time series data, the error term at one time point is often correlated
with errors from previous time points. This happens in many processes where past events
influence future outcomes, such as stock prices or temperature data.
Variance-Covariance Matrix: In GLS, the errors' covariance structure is represented by a
variance-covariance matrix Σ. This matrix captures how errors at different time points are
correlated. The GLS approach adjusts for this structure, ensuring that the errors behave as
required by OLS after transformation.
Transformation Process: The core of GLS involves transforming the original data by
multiplying it by a matrix derived from the variance-covariance matrix of the error terms.
This transformation results in a new dataset where the errors are independent and
homoscedastic, allowing for the use of OLS on this transform ed data.
Ordinary Least Squares (OLS) is the most common method used to estimate the parameters
of a linear regression model. It works by finding the line (or hyperplane) that minimizes
the sum of the squared differences (errors) between the observed data points and the
predicted values.
The OLS method minimizes the sum of squared residuals:

Where:
yi are the observed values (e.g., the actual scores),
y^i are the predicted values from the regression line.

Example:
Imagine you're trying to predict the temperature in different cities based on their latitude.
However, the errors in your model might not have constant variance. For example, in cities
near the equator, the temperature might vary less, while in polar regions, there might be
more variability in temperature due to extreme weather conditions. This violates the
assumption of constant variance in OLS.
GLS accounts for this by adjusting the model to handle different error variances, providing
more accurate parameter estimates.
Steps:
1. You first identify the variance structure of the errors (if they have unequal variances or
are correlated).
2. Then, you transform the model using a weighting matrix that reflects this error
structure.
3. Finally, you apply OLS to the transformed model.
In mathematical terms:

Where:
• y is the dependent variable (e.g., temperature),
• X is the matrix of predictors (e.g., latitude),
• Β are the parameters we want to estimate,
• ε are the error terms with non-constant variance or correlation.
GLS estimates β considering the non-constant variance or correlations in ε
Weighted Least Squares (WLS)
Weighted Least Squares (WLS) is a special case of GLS. It is used when the errors have unequal
variances (heteroscedasticity), but there is no correlation between them. In WLS, observations
with smaller variance (i.e., more precise measurements) are given more weight than
observations with larger variance (i.e., noisier measurements).
Example:
Suppose you're measuring the heights of trees in a forest, but your measuring device is less
accurate for very tall trees. Thus, the errors for taller trees are larger than for shorter trees.
If you were to use OLS, it would treat all errors equally. However, with WLS, you can give
more weight to observations with smaller errors (shorter trees) and less weight to those with
larger errors (taller trees), resulting in more reliable parameter estimates.
Steps:
1. Assign weights to each observation based on the inverse of the variance of the errors.
If an observation has more error variance, it gets a smaller weight.
2. Apply OLS to the weighted data.

Mathematically:
The WLS objective function is:

Where:
• Wi is the weight for each observation, which is often 1/σi2 ,where σi2 is the variance
of the error for observation i.
• yi is the observed value,
• Xi are the predictors.
GLS is more general and can handle both heteroscedasticity and correlated errors.
WLS is a specific case of GLS used when the errors have unequal variances but are
uncorrelated.
REGRESSION MODELS FOR GENERAL TIME SERIES DATA

The presence of autocorrelation in the errors has several effects on the OLS regression
procedure. These are summarized as follows:
1. The OLS regression coefficients are still unbiased, but they are no longer
minimum-variance estimates.
2. When the errors are positively autocorrelated, the residual mean square may
seriously underestimate the error variance 𝜎2. Consequently, the standard errors
of the regression coefficients may be too small. As a result, confidence and
prediction intervals are shorter than they really should be, and tests of
hypotheses on indi vidual regression coefficients may be misleading, in that
they may indicate that one or more predictor variables contribute significantly
to the model when they really do not. Generally, underestimating the error
variance 𝜎2 gives the analyst a false impression of precision of estimation and
potential forecast accuracy.
3. Theconfidence intervals,prediction intervals, and tests of hypotheses based on
the t and Fdistributions are, strictly speaking, no longer exact procedures.

The Durbin-Watson test is used to detect the presence of autocorrelation (specifically, first-
order autocorrelation) in the residuals of a regression model. It helps determine whether the
residuals are independent from one another, a key assumption in ordinary least squares (OLS)
regression. Here's how it works and how to interpret it:
Durbin-Watson Statistic Formula:
The Durbin-Watson statistic d is calculated as:
The Durbin-Watson test is used to detect the presence of autocorrelation (specifically, first-
order autocorrelation) in the residuals of a regression model. It helps determine whether the
residuals are independent from one another, a key assumption in ordinary least squares (OLS)
regression. Here's how it works and how to interpret it:
Durbin-Watson Statistic Formula:
The Durbin-Watson statistic d is calculated as:

Where:

et is the residual at time t.


n is the number of observations.
Interpreting the Durbin-Watson Statistic:
d=2: No autocorrelation (ideal scenario).
d<2: Positive autocorrelation (residuals are positively correlated).
d>2: Negative autocorrelation (residuals are negatively correlated).
The Durbin-Watson statistic ranges between 0 and 4:
A value near 0 suggests strong positive autocorrelation.
A value near 4 suggests strong negative autocorrelation.
A value around 2 indicates no autocorrelation.
Steps to Perform the Durbin-Watson Test:
1. Fit a Regression Model: First, you must fit a regression model to your time series or
dataset and calculate the residuals.
2. Calculate the Durbin-Watson Statistic: Using the residuals from your regression
model, you can calculate the Durbin-Watson statistic either manually using the formula
above or by using statistical software.
Durbin-Watson Table for Critical Values:
To determine statistical significance, you would compare the Durbin-Watson statistic with
critical values from a Durbin-Watson table. These tables depend on the sample size and the
number of predictors in the model.
However, as a general rule of thumb:
d in the range [1.5, 2.5] suggests little to no autocorrelation.
Values below 1.5 or above 2.5 suggest potential autocorrelation.
Limitations of the Durbin-Watson Test:
The test only detects first-order autocorrelation. If there’s higher-order autocorrelation
(e.g., autocorrelation at lags 2 or 3), the Durbin-Watson test may not capture it.
It requires that the residuals from the regression are from a stationary process.
Estimating the Parameters in Time Series Regression Models
Apparent autocorrelation in a time series model can arise when important predictors are
missing from the model. Autocorrelation refers to the correlation of a variable with a lagged
version of itself, and it can indicate that there is some pattern in the data that the model has not
captured. If the source of autocorrelation is due to missing predictors, identifying and
incorporating these predictors can reduce or eliminate the autocorrelation by accounting for the
underlying patterns or trends driving the correlation.
Example: Sales and Weather Impact on Ice Cream Sales
Suppose you are building a model to predict ice cream sales based on past sales data. You fit a
simple linear regression model where the target variable is current sales, and the predictor is
past sales (lagged by one period).
After fitting the model, you find that there is significant autocorrelation in the residuals. This
could suggest that the model is missing some key predictors that explain variations in sales.
One potential missing predictor could be the weather, since ice cream sales are often influenced
by temperature.
Step 1: Residuals Show Autocorrelation
The autocorrelation function (ACF) of the residuals might look like this:

Lag ACF Value

1 0.6

2 0.4

3 0.2

... ...

This autocorrelation suggests that the model is not capturing all the important variables.
Step 2: Identifying the Missing Predictor
After investigating, you realize that weather (specifically temperature) plays a significant role
in ice cream sales. Hotter days result in more sales, and this variation was not included in the
original model.
Step 3: Incorporating the Missing Predictor
Now, you include temperature as an additional predictor in the model:

Where:
Tempt is the temperature at time t.

Step 4: Reduced Autocorrelation


After fitting the new model, you check the autocorrelation of the residuals again. The ACF
values may now look like this:

Lag ACF Value

1 0.1

2 0.05
Lag ACF Value

3 0.01

... ...

By adding the temperature variable, the residuals no longer show significant autocorrelation.
This means that the new model better captures the underlying relationship between the
predictors and the target, resolving the apparent autocorrelation problem.
In this example, the apparent autocorrelation was due to the missing predictor, temperature.
Once the temperature variable was identified and included in the model, the autocorrelation in
the residuals dimin ished, improving the model's accuracy and interpretation.

Exponential smoothing
Exponential smoothing is a forecasting technique used in time series analysis to smooth data
by applying a weighted average of past observations. The weights decrease exponentially as
observations get older, which gives more importance to recent data points. There are different
types of exponential smoothing methods, with the first-order and second-order being common
ones.
1. First-Order Exponential Smoothing (Simple Exponential Smoothing)
This method is used when there is no trend or seasonality in the data. It calculates the smoothed
value by combining the previous smoothed value and the current observation, with a smoothing
constant (α) that determines the weight assigned to the current observation.

Where:
St = smoothed value at time t
Xt= observed value at time t
St−1 = smoothed value at time t−1
α= smoothing constant (0 < α < 1)
Example:
Consider a dataset of monthly sales: 50,52,53,49,48. Let's assume α=0.3 and the first
smoothed value S1=50.
S2=0.3⋅52+0.7⋅50=15.6+35=50.6
S3=0.3⋅53+0.7⋅50.6=15.9+35.42=51.32
S4=0.3⋅49+0.7⋅51.32=14.7+35.92=50.62
S5=0.3⋅48+0.7⋅50.62=14.4+35.43=49.83
Here, you can see the values get smoother over time, with less fluctuation.
Second-Order Exponential Smoothing (Double Exponential Smoothing)
This method is used when the data exhibits a trend. It accounts for both the level and trend
of the data by applying exponential smoothing twice: once for the level and again for the
trend.

Where:
• St = smoothed value (level) at time t
• Tt = smoothed trend value at time t
• α = smoothing constant for the level
• β = smoothing constant for the trend
Example:
Let’s use the same data, 50,52,53,49,48 with α=0.3, β=0.2 and initial smoothed value
S1=50 T1=1.
• S2=0.3⋅52+0.7⋅(50+1)=15.6+35.7=51.3
• T2=0.2⋅(51.3−50)+0.8⋅1=0.26+0.8=1.06
• S3=0.3⋅53+0.7⋅(51.3+1.06)=15.9+36.87=52.77
• T3=0.2⋅(52.77−51.3)+0.8⋅1.06=0.29+0.85=1.14
Here, the second-order smoothing adjusts for the trend, improving forecast accuracy
when a trend is present in the data.

You might also like