Unit 3 Notes
Unit 3 Notes
Time series may also exhibit short-term seasonal effects (over a year, month, week,
or even a day) as well as longer-term cyclical effects, or nonlinear trends. A seasonal
effect is one that repeats at fixed intervals of time, typically a year, month, week, or day.
At a neighborhood grocery store, for instance, short-term seasonal patterns may occur
over a week, with the heaviest volume of customers on weekends; seasonal patterns may
also be evident during the course of a day, with higher volumes in the mornings and late
afternoons. Figure shows seasonal changes in natural gas usage for a homeowner over
the course of a year (Excel file Gas & Electric). Cyclical effects describe ups and downs
over a much longer time frame, such as several years. shows a chart of the data in the
Excel file Federal Funds Rates.
The mean absolute deviation (MAD) is the absolute difference between the
actual value and the forecast, averaged over a range of forecasted
values:
where At is the actual value of the time series at time t, Ft is the forecast value for time t,
and n is the number of forecast values (not the number of data points since we do not
have a forecast value associated with the first k data points). MAD provides a robust
measure of error and is less affected by extreme observations.
2. Mean square error (MSE):
Mean square error (MSE) is probably the most commonly used error metric.
It penalizes larger errors because squaring larger numbers has a greater impact than
squaring smaller numbers. The formula for MSE is
where F t+1 is the forecast for time period t + 1, Ft is the forecast for period t, At is the
observed value in period t, and a is a constant between 0 and 1 called the smoothing
constant.
To begin, set F1 and F2 equal to the actual observation in period 1, A1.
Using the two forms of the forecast equation just given, we can interpret the simple
exponential smoothing model in two ways. In the first model, the forecast for the next
period, Ft+1, is a weighted average of the forecast made for period t, Ft, and the actual
observation in period t, At. The second form of the model, obtained by simply rearranging
terms, states that the forecast for the next period, Ft+1, equals the forecast for the last
period, Ft, plus a fraction a of the forecast error made in period t, At - Ft. Thus, to make a
forecast once we have selected the smoothing constant, we need to know only the
previous forecast and the actual value. By repeated substitution for Ft in the equation, it
is easy to demonstrate that Ft+1 is a decreasingly weighted average of all past time-series
data. Thus,the forecast actually reflects all the data, provided that a is strictly between 0
and 1.
Double Exponential Smoothing
In double exponential smoothing, the estimates of at and bt are obtained from
the following
equations:
In essence, we are smoothing both parameters of the linear trend model. From the first
equation, the estimate of the level in period t is a weighted average of the observed value
at time t and the predicted value at time t, at+1 + bt+1 ,based on simple exponential
smoothing. For large values of a, more weight is placed on the observed value. Lower
values of a put more weight on the smoothed predicted value. Similarly, from the second
equation, the estimate of the trend in period t is a weighted average of the differences in
the estimated levels in periods t and t - 1 and the estimate of the level in period t - 1.
Forecasting Time Series with Seasonality:
When time series exhibit seasonality, different techniques provide better forecasts,
Regression-Based Seasonal Forecasting Models
One approach is to use linear regression. Multiple linear regression models
with categorical variables can be used for time series with seasonality.
Holt-Winters Forecasting for Seasonal Time Series
Holt-Winters models are similar to exponential smoothing models in that
smoothing constants are used to smooth out variations in the level and seasonal patterns
over time. For time series with seasonality but no trend, XLMiner supports a Holt-Winters
method but does not have the ability to optimize the parameters
Holt-Winters Models for Forecasting Time Series with seasonality and
Trend
Many time series exhibit both trend and seasonality. Such might be the case for
growing sales of a seasonal product. These models combine elements of both the trend
and seasonal models. Two types of Holt-Winters smoothing models are often used.
The Holt-Winters additive model is based on the equation
The additive model applies to time series with relatively stable seasonality, whereas the
multiplicative model applies to time series whose amplitude increases or decreases over
time. Therefore, a chart of the time series should be viewed first to identify the
appropriate type of model to use. Three parameters,∝,β,γ, are used to smooth the level,
trend,and seasonal factors in the time series. XLMiner supports both models.
Predictive modeling is often performed using curve and surface fitting, time series
regression, or machine learning approaches. Regardless of the approach used, the
process of creating a predictive model is the same across methods.
The two steps in supervised machine learning. Table1.1 lists a set of historical instances,
or dataset, of mortgages that a bank has granted in the past. This dataset includes
descriptive features that describe the mortgage, and a target feature that indicates
whether the mortgage applicant ultimately defaulted on the loan or paid it back in full.
The descriptive features tell us three pieces of information about the mortgage: the
OCCUPATION (which can be professional or industrial) and AGE of the applicant and the
ratio between the applicant’s salary and the amount borrowed (LOANSALARY RATIO).
The target feature, OUTCOME, is set to either default or repay. In machine learning terms,
each row in the dataset is referred to as a training instance, and the overall dataset is
referred to as a training data sets.
Table 1.1
Table1.4(b) also illustrates the fact that the training dataset does not contain an instance
for every possible descriptive feature value combination and that there are still a large
number of potential prediction models that remain consistent with the training dataset
after the inconsistent models have been excluded. Specifically, there are three remaining
descriptive feature value combinations for which the correct target feature value is not
known, and therefore there are 33 = 27 potential models that remain consistent with the
training data. Three of these- M2,M4,M5- shown in Table1.4(b). Because a single consistent
model cannot be found based on the sample training dataset alone, we say that machine
learning is fundamentally an ill-posed problem.
We might be tempted to think that having multiple models that are consistent with the
data is a good thing. The problem is, however, that although these models agree on what
predictions should be made for the instances in the training dataset, they disagree with
regard to what predictions should be returned for instances that are not in the training
dataset. For example, if a new customer starts shopping at the supermarket and buys
baby food, alcohol, and organic vegetables, our set of consistent models will contradict
each other with respect to what prediction should be returned for this customer, for
example, M2 will return GRP = single, M4 will return GRP = family, and M5 will return GRP
= couple.
The criterion of consistency with the training data doesn’t provide any guidance with
regard to which of the consistent models to prefer when dealing with queries that are
outside the training dataset. As a result, we cannot use the set of consistent models to
make predictions for these queries. In fact, searching for predictive models that are
consistent with the dataset is equivalent to just memorizing the dataset. As a result, no
learning is taking place because the set of consistent models tells us nothing about the
underlying relationship between the descriptive and target features beyond what a
simple look-up of the training dataset would provide.
If a predictive model is to be useful, it must be able to make predictions for queries that
are not present in the data. A prediction model that makes the correct predictions for
these queries captures the underlying relationship between the descriptive and target
features and is said to generalize well. Indeed, the goal of machine learning is to find the
predictive model that generalizes best. In order to find this single best model, a machine
learning algorithm must use some criteria for choosing among the candidate models it
considers during its search.
Given that consistency with the dataset is not an adequate criterion to select the best
prediction model, what criteria should we use? There are a lot of potential answers to this
question, and that is why there are a lot of different machine learning algorithms. Each
machine learning algorithm uses different model selection criteria to drive its search for
the best predictive model. So, when we choose to use one machine learning algorithm
instead of another, we are, in effect, choosing to use one model selection criterion instead
of another.
All the different model selection criteria consist of a set of assumptions about the
characteristics of the model that we would like the algorithm to induce. The set of
assumptions that defines the model selection criteria of a machine learning algorithm is
known as the inductive bias 6 of the machine learning algorithm.
There are two types of inductive bias that a machine learning algorithm can use, a
restriction bias and a preference bias. A restriction bias constrains the set of models that
the algorithm will consider during the learning process. A preference bias guides the
learning algorithm to prefer certain models over others.
For example, we introduce a machine learning algorithm called multivariable linear
regression with gradient descent, which implements the restriction bias of only
considering prediction models that produce predictions based on a linear combination of
the descriptive feature values and applies a preference bias over the order of the linear
models it considers in terms of a gradient descent approach through a weight space. As a
second example, we introduce the Iterative Dichotomizer 3 (ID3) machine learning
algorithm, which uses a restriction bias of only considering tree prediction models where
each branch encodes a sequence of checks on individual descriptive features but also
utilizes a preference bias by considering shallower (less complex) trees over larger trees.
It is important to recognize that using an inductive bias is a necessary prerequisite for
learning to occur; without inductive bias, a machine learning algorithm cannot learn
anything beyond what is in the data.
In summary, machine learning works by searching through a set of potential models to
find the prediction model that best generalizes beyond the dataset. Machine learning
algorithms use two sources of information to guide this search, the training dataset and
the inductive bias assumed by the algorithm.