Unit 3
Unit 3
Least squares estimation is a method used to estimate the coefficients of a linear regression
model by minimizing the sum of the squared differences between the observed values and the
values predicted by the model. Let's walk through an example problem to illustrate how it
works.
Suppose you have a dataset that shows the relationship between the number of hours studied
(independent variable, X) and exam scores (dependent variable, Y) for five students:
1 2 50
2 4 60
3 6 65
4 8 70
5 10 80
You want to find the linear relationship between hours studied and exam score using linear
regression.
Step 1: Define the Linear Model
The linear model for simple linear regression is:
For simple linear regression, the formulas to estimate the coefficients β0(intercept) and β1
(slope) are:
Given Data:
Let's calculate the coefficients using the provided data
Step 4: Write the Regression Equation
The estimated regression equation is:
So, the predicted exam score for a student who studies for 7 hours is 82.5.
The least squares estimation method provides a way to find the best-fitting line for your data
by minimizing the sum of the squared differences between observed and predicted values. In
this example, we found that the number of hours studied has a positive and significant effect
on exam scores.
Statistical Inference in Linear Regression
The test for the significance of regression tells you whether your regression model as
a whole is meaningful. A significant result (small p-value) indicates that your model,
including all predictors, explains a significant amount of variation in the dependent variable,
making it a useful tool for prediction and inference.
Residual Analysis
The residuals (the difference between the actual values and the fitted values) are an important
diagnostic tool for checking model adequacy. For a well-fitted model, the residuals should
behave like white noise, meaning they should have:
• Zero mean: The residuals should average out to zero.
• Constant variance: Residuals should have constant variability over time
(homoscedasticity).
• No autocorrelation: Residuals should not exhibit patterns or correlations; they should
be independent over time.
• Normality: Residuals should follow a normal distribution (for models that assume
normally distributed errors).
Diagnostic Plots for Residuals:
• Residual Plot: A plot of residuals over time. A good model shows no systematic
patterns (e.g., trends or clusters).
• ACF/PACF of Residuals: The autocorrelation function (ACF) and partial
autocorrelation function (PACF) plots should show no significant correlations for the
residuals. This indicates that all relevant time-dependent structure has been captured
by the model.
• Histogram or Q-Q Plot: A histogram or Q-Q plot of residuals helps check for
normality.
Scaled Residuals:
Residuals are the differences between the observed and predicted values of a model. Scaled
residuals refer to residuals that have been normalized or adjusted to account for the
variability of the model, making them easier to interpret, especially when comparing across
different models or when identifying outliers.
Key points to understand scaled residuals:
• Definition: Scaled residuals are computed by dividing the residuals by an estimate of
the standard deviation of the residuals or other model-specific factors.
• Purpose: They help in identifying unusual points that deviate significantly from the
model's predictions. Scaling puts the residuals on a common scale, so large residuals
are easily spotted.
• Formula:
x y
1 2
2 4
3 5
4 7
5 8
1.
PRESS provides an estimate of how well the model generalizes to new data. Lower PRESS
values indicate a better fit, meaning that the model is likely to have good predictive accuracy.
It is particularly useful in detecting overfitting. A model with a very low PRESS on the
training data but high PRESS on validation data may be overfitting.
2. Backward Elimination
Backward elimination starts with all the variables in the model. Variables are removed one
by one based on a significance test (e.g., p-value), AIC, or BIC. The goal is to eliminate
variables that don't contribute much to the model.
Example:
Using the same dataset (size, bedrooms, bathrooms, age, location):
• Step 1: Start with a full model, including all variables: size + bedrooms + bathrooms
+ age + location.
• Step 2: Examine the significance of each variable. Remove the least significant
variable (highest p-value).
Assume p-values are as follows:
size: p-value = 0.02 (significant)
bedrooms: p-value = 0.25 (insignificant)
bathrooms: p-value = 0.10 (insignificant)
age: p-value = 0.03 (significant)
location: p-value = 0.01 (significant)
Remove bedrooms (highest p-value).
Step 3: Refit the model and check p-values again. Continue removing the least significant
variable.
In the next iteration, remove bathrooms if it has the highest p-value.
Step 4: Continue this process until all remaining variables are statistically significant.
Where:
yi are the observed values (e.g., the actual scores),
y^i are the predicted values from the regression line.
Example:
Imagine you're trying to predict the temperature in different cities based on their latitude.
However, the errors in your model might not have constant variance. For example, in cities
near the equator, the temperature might vary less, while in polar regions, there might be
more variability in temperature due to extreme weather conditions. This violates the
assumption of constant variance in OLS.
GLS accounts for this by adjusting the model to handle different error variances, providing
more accurate parameter estimates.
Steps:
1. You first identify the variance structure of the errors (if they have unequal variances or
are correlated).
2. Then, you transform the model using a weighting matrix that reflects this error
structure.
3. Finally, you apply OLS to the transformed model.
In mathematical terms:
Where:
• y is the dependent variable (e.g., temperature),
• X is the matrix of predictors (e.g., latitude),
• Β are the parameters we want to estimate,
• ε are the error terms with non-constant variance or correlation.
GLS estimates β considering the non-constant variance or correlations in ε
Weighted Least Squares (WLS)
Weighted Least Squares (WLS) is a special case of GLS. It is used when the errors have unequal
variances (heteroscedasticity), but there is no correlation between them. In WLS, observations
with smaller variance (i.e., more precise measurements) are given more weight than
observations with larger variance (i.e., noisier measurements).
Example:
Suppose you're measuring the heights of trees in a forest, but your measuring device is less
accurate for very tall trees. Thus, the errors for taller trees are larger than for shorter trees.
If you were to use OLS, it would treat all errors equally. However, with WLS, you can give
more weight to observations with smaller errors (shorter trees) and less weight to those with
larger errors (taller trees), resulting in more reliable parameter estimates.
Steps:
1. Assign weights to each observation based on the inverse of the variance of the errors.
If an observation has more error variance, it gets a smaller weight.
2. Apply OLS to the weighted data.
Mathematically:
The WLS objective function is:
Where:
• Wi is the weight for each observation, which is often 1/σi2 ,where σi2 is the variance
of the error for observation i.
• yi is the observed value,
• Xi are the predictors.
GLS is more general and can handle both heteroscedasticity and correlated errors.
WLS is a specific case of GLS used when the errors have unequal variances but are
uncorrelated.
REGRESSION MODELS FOR GENERAL TIME SERIES DATA
The presence of autocorrelation in the errors has several effects on the OLS regression
procedure. These are summarized as follows:
1. The OLS regression coefficients are still unbiased, but they are no longer
minimum-variance estimates.
2. When the errors are positively autocorrelated, the residual mean square may
seriously underestimate the error variance 𝜎2. Consequently, the standard errors
of the regression coefficients may be too small. As a result, confidence and
prediction intervals are shorter than they really should be, and tests of
hypotheses on indi vidual regression coefficients may be misleading, in that
they may indicate that one or more predictor variables contribute significantly
to the model when they really do not. Generally, underestimating the error
variance 𝜎2 gives the analyst a false impression of precision of estimation and
potential forecast accuracy.
3. Theconfidence intervals,prediction intervals, and tests of hypotheses based on
the t and Fdistributions are, strictly speaking, no longer exact procedures.
The Durbin-Watson test is used to detect the presence of autocorrelation (specifically, first-
order autocorrelation) in the residuals of a regression model. It helps determine whether the
residuals are independent from one another, a key assumption in ordinary least squares (OLS)
regression. Here's how it works and how to interpret it:
Durbin-Watson Statistic Formula:
The Durbin-Watson statistic d is calculated as:
The Durbin-Watson test is used to detect the presence of autocorrelation (specifically, first-
order autocorrelation) in the residuals of a regression model. It helps determine whether the
residuals are independent from one another, a key assumption in ordinary least squares (OLS)
regression. Here's how it works and how to interpret it:
Durbin-Watson Statistic Formula:
The Durbin-Watson statistic d is calculated as:
Where:
1 0.6
2 0.4
3 0.2
... ...
This autocorrelation suggests that the model is not capturing all the important variables.
Step 2: Identifying the Missing Predictor
After investigating, you realize that weather (specifically temperature) plays a significant role
in ice cream sales. Hotter days result in more sales, and this variation was not included in the
original model.
Step 3: Incorporating the Missing Predictor
Now, you include temperature as an additional predictor in the model:
Where:
Tempt is the temperature at time t.
1 0.1
2 0.05
Lag ACF Value
3 0.01
... ...
By adding the temperature variable, the residuals no longer show significant autocorrelation.
This means that the new model better captures the underlying relationship between the
predictors and the target, resolving the apparent autocorrelation problem.
In this example, the apparent autocorrelation was due to the missing predictor, temperature.
Once the temperature variable was identified and included in the model, the autocorrelation in
the residuals dimin ished, improving the model's accuracy and interpretation.
Exponential smoothing
Exponential smoothing is a forecasting technique used in time series analysis to smooth data
by applying a weighted average of past observations. The weights decrease exponentially as
observations get older, which gives more importance to recent data points. There are different
types of exponential smoothing methods, with the first-order and second-order being common
ones.
1. First-Order Exponential Smoothing (Simple Exponential Smoothing)
This method is used when there is no trend or seasonality in the data. It calculates the smoothed
value by combining the previous smoothed value and the current observation, with a smoothing
constant (α) that determines the weight assigned to the current observation.
Where:
St = smoothed value at time t
Xt= observed value at time t
St−1 = smoothed value at time t−1
α= smoothing constant (0 < α < 1)
Example:
Consider a dataset of monthly sales: 50,52,53,49,48. Let's assume α=0.3 and the first
smoothed value S1=50.
S2=0.3⋅52+0.7⋅50=15.6+35=50.6
S3=0.3⋅53+0.7⋅50.6=15.9+35.42=51.32
S4=0.3⋅49+0.7⋅51.32=14.7+35.92=50.62
S5=0.3⋅48+0.7⋅50.62=14.4+35.43=49.83
Here, you can see the values get smoother over time, with less fluctuation.
Second-Order Exponential Smoothing (Double Exponential Smoothing)
This method is used when the data exhibits a trend. It accounts for both the level and trend
of the data by applying exponential smoothing twice: once for the level and again for the
trend.
Where:
• St = smoothed value (level) at time t
• Tt = smoothed trend value at time t
• α = smoothing constant for the level
• β = smoothing constant for the trend
Example:
Let’s use the same data, 50,52,53,49,48 with α=0.3, β=0.2 and initial smoothed value
S1=50 T1=1.
• S2=0.3⋅52+0.7⋅(50+1)=15.6+35.7=51.3
• T2=0.2⋅(51.3−50)+0.8⋅1=0.26+0.8=1.06
• S3=0.3⋅53+0.7⋅(51.3+1.06)=15.9+36.87=52.77
• T3=0.2⋅(52.77−51.3)+0.8⋅1.06=0.29+0.85=1.14
Here, the second-order smoothing adjusts for the trend, improving forecast accuracy
when a trend is present in the data.