A Comparison of Deep Learning Methods For Time Series Forecasting With Limited Data
A Comparison of Deep Learning Methods For Time Series Forecasting With Limited Data
Master’s thesis
for the degree
M.Sc Mathematics
submitted by
Abinaya Jayaprakash (5391958)
This thesis was written in cooperation with Elia. The goal was to develop a
forecaster that would predict the future output of photovolatic (PV) systems that
would be then used to optimize energy usage in households. Better knowledge
about renewable energy production would result in efficient usage of this energy
and reduced energy costs. The historic data used covered 6 months; January to
June 2022. Traditional time series forecasting techniques were compared with
developing machine learning approaches on their ability to predict future values
using the limited input data. The established time series forecasting technique used
as a baseline model was simple linear regression. The machine learning techniques
consisted of two main types of feed forward neural networks; LSTMs and CNNs.
The performance of all methods were measured and compared via a chosen set of
evaluation metrics. The best performance of all implemented methods resulted in
a CNN model with no hidden layers.
1
Acknowledgements
I would first like to thank my thesis supervisors; Professor Volker John and Rachel
Berryman for their time, support and vital feedback. My thanks also goes to Mason
Samuel for his help and guidance. I would also like to thank my friends and family
for motivating me throughout and mainly Subodh Singh Khangar for guiding me
when I first decided to explore the field of machine learning. Lastly I am also very
grateful to Elia for the opportunity, the access to data and coding facilities.
2
Table of Contents
1 1. Introduction 5
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 2. Theoretical Background 8
2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Artificial Neural Networks (ANNs) . . . . . . . . . . . . . . 9
2.2.2 Training FNNs . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Vanishing/Exploding gradients . . . . . . . . . . . . . . . . 12
2.3 Types of ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Multivariate forecasting using LSTMs . . . . . . . . . . . . . . . . . 16
2.5 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Software used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3
TABLE OF CONTENTS 4
3.4.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Train and Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 4. Results 31
4.1 Baseline model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 CNN-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 5. Evaluation 45
6 6. Conclusion 47
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1. Introduction
1.1 Motivation
Increasing developments in the area of machine learning have evidently outlined
the seamless attribution that machines have in producing greater values/outcomes
for businesses. Deep learning in specific is one of the most recognized sub areas
of machine learning, inspired by the human brain’s neural networks, and has set
exceptional records of accuracy in the recent years irrespective of the business
sector.
Consequently deep learning has strongly established a place for itself in the energy
sector too, and one of its main applications has been to maintain the balance
between energy supply and demand. According to Forbes [1], "As the world shifts
in the direction of personalized digitized services, the energy sector is lagging
behind" and this is mainly because this sector heavily depends on predictive
diagnostics and the cost of error in it is known to be particularly high.
In attempts to resolve the wastage of excess energy in households, the process of
working with renewable energy has proven to be quite challenging given that it
requires precise prognosis of its generation and necessity. With respect to solar
energy, the uncertainty associated with factors that affect its production such as
weather conditions which are constantly subject to change makes it tough. Another
hardship faced is gaining access to similar household data from the past with
similar geographical settings since installing PV panels in a household is a big-
budget and high involvement decision with many barriers that require substantial
maintenance, [2].
Achieving an equilibrium in households amidst these drawbacks would lead to
energy saving which will then result in reduced utility bills and personalized energy
usage/maintenance schemes. Energy providers would be able to dispatch their
5
1.2. RELATED WORK 6
1.3 Outline
This thesis is further divided into chapters that elaborate on the different theories
and methods used in the thesis.
1.3. OUTLINE 7
Chapter 2 explains the concept behind neural networks, how they are trained, their
drawbacks and possible ways to counteract these. It also introduces the two main
types of neural networks the thesis focuses on and the software needed to develop
and train them. Chapter 3 describes the task and the data sets used in detail. It
also gives an overview of the different techniques used to process the raw data
and then further divide it for training and testing purposes. Chapters 4 and 5
present and evaluate the results obtained via the various approaches chosen. At
the end, chapter 6 lays out the outcome of the thesis and how this can be utilized
or improved for future use.
2. Theoretical Background
8
2.2. MATHEMATICAL BACKGROUND 9
As seen in Figure 2.1, a neuron comprises of four main components. The first being
the inputs, which could be information about the inputs in the training data set
such as features or the output from a preceding neuron. Second, each input (xi )
is multiplied by a weight (wi ) that controls its level of significance in determining
the output (y). The larger the magnitude of the weight multiplied by the input,
the more influence it has on the output value. Third, all these weighted inputs
(wi xi ) are added to yield a weighted sum. This could also include a bias component
(b) in some instances, which is used to adjust the value of the output obtained,
thereby ensuring the model is a best fit for the input data. The final component is
the activation function (ϕ) which the weighted sum is passed through to output a
single number
m
!
X
y=ϕ w i xi + b .
i=1
The purpose of these activation functions can differ based on their mathematical
properties such as linearity, continuity, range and order of differentiation, [11].
2.2. MATHEMATICAL BACKGROUND 10
network outputs and also as an average of all cost functions for individual training
samples when there are many. Although the cost function also entails the actual
output, it is a fixed parameter and will not be affected by the changes made to the
weights/biases, [13].
An epoch refers to one cycle of forward propagation followed by back propaga-
tion, [14] . The updated internal parameters are then used for the next epoch. The
number of epochs is traditionally large permitting the learning algorithm to run
until the network error has reached an optimal minimum (satisfies certain criteria
or when the number of iterations exceed the allocated computational budget, [15]).
dJ(w)
w1 = w0 − α ∗ ,
dw
where α denotes the chosen learning rate. Suppose the value of the gradient is
positive, then
dJ(w)
w1 = w0 + α ∗ .
dw
This method referred to as gradient descent is fairly expensive since it requires the
computation of the gradient during every single iteration, especially when there are
2.2. MATHEMATICAL BACKGROUND 12
2.2.3 Overfitting
One common problem encountered when training FNNs is that the neural network
overfits (learns the training data set too well) and is unable to perform with unseen
data, [17]. This issue is particularly frequent for small training data sets like the
one used in this context. Several tecnhiques are used to encounter this.
One possibility is to add a regularization/penalty term to the error function (L2
regularization) of the form
λ T
w w,
2
where λ determines its impact. This additional term which is the sum of all
weights squared helps in minimizing the values of weights (weight decay) during
backpropagation and adds stability by making the model less sensitive to the
training data.
Another method we could use is dropout. This method ignores a set of neurons in
the neural network at random with set probability. This reduces interdependent
learning across the network, makes it simpler and results in better spread out
weights, [18].
ELU (x) = x if x ≥ 0
Negative entries are handled better by the ELU activation function with the help
α that controls the saturation of negative net inputs and therefore avoids the dying
ReLU problem and also since its mean output values are closer to zero, it also
tends to converge faster and has many other advantages over the ReLU function
overall, [20, 21]. The value of α was kept constant at the default value 1.0 for all
the models tested.
A well known approach for the initialization of weights along with the ELU activation
function is He initialization, known after the last name of its author. The weights
would be random number that follow q a gaussian probability distribution with a
mean of 0 and a standard deviation of 2
n
, where n denotes the number of inputs
fed into the node, [22].
2.3.1 RNNs
RNNs are specialized in processing a sequence of values of varying length.
As illustrated in Figure 2.4, RNNs are structured to learn dependencies in sequential
input data, by storing previous values in its memory/hidden state for a short period
2.3. TYPES OF ANNS 14
Figure 2.4: Main difference between MLP and RNN, from [24]
of time. The loop in the figure depicts how the output of the current layer would
depend on the current and previous inputs, which enables the network handle
sequential data well, [25].
2.3.2 CNNs
CNNs were originally designed to solve problems with image data but have also
produced impressive results when introduced to sequential data and this is mainly
due to the one dimensional convolutional layers it can contain.
Figures 2.5 and 2.6 give us an overview as to how a convolutional layer works as
opposed to a fully connected layer.
output generated which makes CNNs capable of handling high dimensional data
and picking up important features in the data.
The weight matrix in Figure 2.5 would have varying values in all columns of the
matrix, whereas the values of weights in the kernel/filter in Figure 2.6 do not change
as it slides horizontally and vertically through the input data. This minimizes the
number of parameters per layer and teaches the network to flexibly learn and detect
important features irrespective of where/how they appear in the input data, which
allows for better generalization when a CNN is exposed to new input data, [26].
As mentioned above, Figure 2.7 shows how the kernel slides only horizontally in
a one dimensional convolutional layer used when dealing with time series data.
Ref. [28] contains further information on how the time series data was restructured
in order to be fed into the 1D convolutional layers.
Similarly, the pooling layers in a CNN also enhance its capability to identify vital
features and control the computational power needed to run it via dimension
2.4. MULTIVARIATE FORECASTING USING LSTMS 16
reduction.
Figure 2.8: Types of pooling and their respective outputs from [29]
As illustrated in Figure 2.8, there are two main types of pooling and their names
describe how their functionalities and outputs are different. In Figure 2.8 a
window size of 2 and a stride (number of columns/rows to skip when moving
horizontally/vertically) value of 2 have been used and these hyperparameters also
play a vital role in determining how efficient the pooling layers are. Along with
these the type of padding used also matters, [29].
Figure 2.10: Detailed overview of a memory cell, extracted from [31] and edited
ft = σ(Wf · [ht−1 , xt ] + bf ).
• Input Gate(2) - Updates the new cell state (signifies long term memory)
for the current timestamp Ct which would be a combination of a percentage
(it ) of processed input data C̄t along with the information from the previous
2.5. HYPERPARAMETERS 18
it = σ(Wi · [ht−1 , xt ] + bi ),
Ct = ft ∗ Ct−1 + it ∗ C̄t .
• Output Gate(3) - Decides what percentage (ot ) of the new cell state Ct
need to be stored short term and carried forward to the next timestamp, also
referred to as the new hidden state, ht
ot = σ(Wo · [ht−1 , xt ] + bo ),
ht = ot ∗ tanh(Ct ).
The cell state enables information flow through the entire chain of cells, with
a few linear interactions that determine which parts of it are stored long term.
Constantly updated cell state values help control the gradient values, and thereby
avoid vanishing/exploding gradients, [33].
2.5 Hyperparameters
• Number of hidden layers and neurons - The number of hidden layers
and neurons in each of these layers depends on the use case, input data and
largely impact the complexity of the model, [34]. Considering the limited
amount of data available, I chose to test with 1 and 2 hidden layers along
with the number of neurons within the range of 10 to 50.
• Dropout probability - I chose to test with dropout probabilities between
20 to 40% on my hidden layers, [35].
• Learning rate - I tested with learning rates ranging from 10−3 to 10−6 for
the chosen optimizer, [36] .
• Batch size - I chose to test batch sizes ranging from 32 to 128, [37].
• Number of epochs - The number of epochs used for testing varied from 50
to 300.
2.6. SOFTWARE USED 19
As seen in Figure 3.1, the optimizer designed by Elia, a belgian electricity system
operator as part of a research project, focuses on optimizing the energy usage/
minimizing the overall energy costs for all devices agreed upon and controlled by a
battery in each household. Some households additionally have solar PV systems
installed in them too.
One of the main purposes of the optimizer in this scenario would be to make
efficient use of the energy converted to electricity by the PV. In order to carry
this out, an accurate forecast of how much electricity is generated in this form
throughout different times of the day is absolutely crucial and this thesis contributes
to this project by reviewing and analyzing a few approaches that can be used to
20
3.2. DATA 21
forecast this data for a specified period in the future using deep neural networks.
Precise predictions would lead to less wastage of the excess solar energy converted
to electricity and reduced energy imports from the grid, thereby reducing overall
household energy costs.
The following sections would include a more detailed focus on how this was done
for one of the households with a PV in the Limburg region in Belgium. A simple
linear regression model would be used as the baseline model and act as a reference
to compare and evaluate the performance of the chosen methods.
3.2 Data
The models were tested on data over a span of 6 months; January to June 2022.
The variables that were taken into consideration are as seen in Figure 3.3. The
names of the columns are abbreviations of these variables listed below;
• HCC : High Cloud Cover
• LCC : Low Cloud Cover
• MCC : Medium Cloud Cover
• Pressure : PressureReducedMSL
• RH : Relative Humidity
• SDR : Solar Downward Radiation
• Temp : Temperature
• TCC : Total Cloud Cover
• TP : Total Precipitation
• WD : Wind Direction
• WS : Wind Speed
Ref. [49] contains a detailed insight of what these variables mean and how this
data is collected.
3.3 Preprocessing
Raw data, or in other terms unformatted real-world data can undoubtedly contain
inconsistent values, errors, outliers and sometimes missing entries. Preprocessing
is essential in order to address these problems and make the data more consis-
tent, [52].The code used for data analysis and preprocessing can be found at https:
//github.com/Abinaya-J/ThesisFiles/blob/main/HHDataAnalysis.ipynb.
The regional solar forecast data did not have any inconsistent data entries, but
however the household PV and the weather forecast datasets did need some
preprocessing.
Figure 3.6: Snippet of a few missing values present in the household data
Figure 3.7: Snippet of a few negative values present in the household data
them with the mean of the observed values for the same hour over the month of
occurrence, [56].
There were also a few occurrences of inconsistent values during the late hours on
some days as shown by Figure 3.11 due to errors that didn’t correspond to the
general pattern observed. Again Ref. [55] was used to cross check the sunrise and
sunset times on these days and these inconsistent values were then replaced with
zeroes if they occurred after sunset or before sunrise.
Another important thing to check in time series data is if trend and seasonality is
present in the data. [57]. Doing this is necessary in order to see what effect these
have on the target variable if present, and removing these components would reveal
other interesting facts about the data. Since it was quite hard to decide on this
just by observing Figure 3.10, I used the Augmented Dickey Fuller (ADF) test for
further verification, [58].
The p-value (probability of not rejecting the null hypothesis which states that
the data is non-stationary or in other words contains the trend and seasonality
components) as observed in Figure 3.12 is very small and the test statistic is less
than all critical values at all confidence levels, so the data used is stationary and
no further preprocessing steps are needed.
3.4. MACHINE LEARNING TECHNIQUES 26
Figure 3.10: Snippet of the household data line graph without negative entries
• Binary indicator - The information obtained about the sunrise and sunset
times for each day in the data set from Ref. [55] was used to add in a binary
indicator variable where 1 represented all timestamps in between the sunrise
and sunset times and 0 otherwise.
When predicting values for 24 hours in the future, it is not possible to have the
actual observed value for timestamps over the last 24 hours with respect to the
target timestamp, therefore based on Figures 3.13 and 3.14, the lagged variables
and the rolling window statistics were chosen from a period of 4 days before (first
few spikes with high auto correlation values) with respect to each timestamp, [60].
Figure 3.13 clearly shows us that the highly correlated values with the target
timestamp from any chosen day are the values that correspond to the target
timestamp (96th lag), one timestamp before (95th lag) and after this (97th lag) on
3.4. MACHINE LEARNING TECHNIQUES 28
Figure 3.13: Autocorrelation plot obtained for values of the target variable with
102 previous timestamps
the previous day. The mean of these values corresponded to the rolling window
statistics that were included.
3.4.2 Normalization
Since the range of values for each feature in our data set varied, normalization was
essential in order to prevent features with larger values having a bigger influence
on the final output, [61].
Each feature and the target variable were standardized as shown below so that it
follows a normal distribution with zero mean and unit variance. Each variables
mean value is denoted by x̄ and σ denotes its standard deviation
x − x̄
x′ = .
σ
Figure 3.14: Autocorrelation plot obtained for values of the target variable with timestamps over 2 weeks before
29
3.5. TRAIN AND TEST DATA 30
As illustrated in Figure 3.15, the test data included inputs for 24 hours in the
future each time but two major differences in our use case were that:
• Test data sets overlapped since the model was run every 15 minutes during
the day and predicted values for 24 hours ahead.
• The train data set was fixed since retraining the model with just one more
entry every 15 minutes needed more computational time and is not feasible
in 15 minutes.
Model performance on the test data was used to choose the best combination of
hyperparameters for each model. Each model along with this combination was
then fitted on all 6 months of data (January to June) and was then made to predict
for the first two weeks of July (unseen data) and the results obtained were used to
compare the suitability of different models.
4. Results
The following four error metrics, [64, 65, 66] were used to assess and compare the
performance of the chosen models.
In all of the formulas below, yi denotes the actual value, ŷi denotes the forecasted
value, n denotes the number of timesteps to predict for in the future, ȳ denotes
the mean actual value over the n timestamps and k denotes the number of input
features.
• Root mean squared error (RMSE) - A quadratic scoring rule that
measures the average magnitude of the error.
v
u n
u1 X
RM SE = t (yi − ŷi )2 .
n i=1
31
32
(1 − R2score)(n − 1)
adjustedR2 = 1 − .
(n − k − 1)
Adjusted R2 scores were used when testing with the baseline model in order to
choose the best subset of input features and then the rest of the models were tested
on the same subset but with varying combinations of hyperparameter values via
the following method.
To begin with each model was tested with a subset of hyperparameters that resulted
in a not too simple nor complex model in order to avoid undefitting or overfitting
given that we were dealing with small amounts of data. The top 20 combinations of
hyperparameters were then filtered based on the lowest MAPE value in the results
obtained based on the models prediction for the test data set. This information
was then used to test with simpler and complex models with the same test data
set in order to see if they perform better, and the best model was chosen to be the
model that produced the lowest MAPE value overall.
Although two other metrics were also used to assess model performance, MAPE
was chosen to be the deciding metric based on the following factors, [67]:
• The fact that it returns the error as a percentage when multiplied by 100
makes it easier to interpret the model’s ability to predict and compare its
performance across a range of hyperparameter values.
• MAPE is calculated in this scenario by taking only instances where the actual
value was non-zero and these are the crucial times of the day when we expect
the predictions made by the model to be as accurate as possible. Given that
there are fewer data points of this nature overall, it is important that the
model learns these equally well and is able to identify the pattern/correlation
between the feature and output values corresponding to these data points
and MAPE gives us a good idea of how well the model can do this.
• MAPE is not relative to the magnitude of the observed and predicted values
which is also another advantage especially given that the PV output can
largely vary across the months of the year. This makes it an ideal metric to
assess and compare model performance across months/seasons or when more
data is used for training in the future.
4.1. BASELINE MODEL 33
The loss/error function I chose to use when training the neural networks was mean
squared error (MSE), the squared version of RMSE, [68]. The code used for the
training and testing of models can be found at https://github.com/Abinaya-J/
ThesisFiles/blob/main/ModelTraining.ipynb
In order to see if removing the less significant features improved the adjusted R2
scores, I first chose to drop the features whose regression coefficient value was less
than |0.01|. As per Figure 4.1, the features HCC, MCC, TCC, TP, WS, hour,
4.2. LSTM 34
minute and prev4 were dropped and Figure 4.3 contains the error metrics obtained
as a result.
Figure 4.3: Test results after the first set of features were dropped
When comparing the results in Figures 4.2 and 4.3 it is clear that removing features
did result in slight improvement of the adjusted R2, RMSE and MAPE scores.
Therefore I then also tried to refine the subset even more by removing the features
Pressure, SDR, dayofmonth and prev2mean whose regression coefficient value was
less than |0.02| but, as seen in Figure 4.4 this did not improve the results overall
and therefore the first subset of features were used to test the rest of the models.
Figure 4.4: Test results after the second set of features were dropped
An overview of the results obtained using the baseline model (linear regression
model fitted to the chosen subset of features) in order to predict for two weeks of
unseen data can be observed in Figures 4.5 and 4.16.
The predicted values plotted in the final prediction plot corresponding to each model
refer to the latest predictions made by the respective model for each corresponding
timestamp.
4.2 LSTM
Initially, a simple LSTM with one hidden layer with 25 and 10 neurons in its
first and second layers respectively was tested with a range of values for other
hyperparameters.
4.3. CNN 35
Figure 4.5: Final prediction error metrics using the linear regression model
Figure 4.6 was referred to when choosing a suitable subset of values to test further
with for each hyperparameter for LSTMs with no hidden layers (Figure 4.7) and
two hidden layers (Figure 4.8) and as a result, dropout probabilities of 0.3 and 0.4
were chosen along with learning rates of 0.01 and 0.001, a batch size of 32 and the
number of epochs tested for only included 300. I also additionally did test with
models that contained 50 neurons in its first LSTM layer.
Having compared all the results obtained, the best result achieved via a LSTM
when compared to that of the baseline model corresponded to the first row in
Figure 4.8. This model was then used to predict for the two weeks of unseen data
and produced the results shown by Figures 4.9 and 4.17.
4.3 CNN
To begin with, a simple CNN with one hidden layer with 32 and 16 filters in its
first and second layers respectively was tested with the default kernel size of 1 and
a range of values for other hyperparameters.
Figure 4.10 was referred to when choosing a suitable subset of values to test further
with for each hyperparameter for CNNs with no hidden layers (Figure 4.11) and
two hidden layers (Figure 4.12) and as a result, dropout probabilities of 0.3 and
0.4 were chosen along with learning rates of 0.01 and 0.001, kernel sizes of 1 and 2,
a batch size of 32 and the number of epochs tested for only included 300.
Having compared all the results obtained, the best result obtained via a CNN when
compared to that of the baseline model corresponded to the first row in Figure
4.11. This model was then used to predict for the two weeks of unseen data and
produced the results shown by Figures 4.13 and 4.18.
4.4 CNN-LSTM
As a first step, a simple CNN-LSTM model that consisted of one CNN layer with
16 filters and the default kernel size of 1 and one LSTM layer with 10 neurons was
tested with a range of values for other hyperparameters.
4.4. CNN-LSTM 36
Figure 4.6: The top 20 entries corresponding to the lowest MAPE scores for the
LSTM with one hidden layer
Having compared all the results obtained, the best result obtained via a CNN-
LSTM when compared to that of the baseline model corresponded to the first row
in Figure 4.14. This model was used to predict for the two weeks of unseen data
and produced the results shown by Figures 4.15 and 4.19.
4.4. CNN-LSTM 37
Figure 4.7: Results obtained using a LSTM with no hidden layer and the chosen
subset of hyperparameters
Figure 4.8: Results obtained using a LSTM with two hidden layers and the chosen
subset of hyperparameters
Figure 4.9: Final prediction error metrics using the LSTM model
4.4. CNN-LSTM 38
Figure 4.10: The top 20 entries corresponding to the lowest MAPE scores for the
CNN with one hidden layer
4.4. CNN-LSTM 39
Figure 4.11: Results obtained using a CNN with no hidden layer and the chosen
subset of hyperparameters
Figure 4.12: Results obtained using a CNN with two hidden layers and the chosen
subset of hyperparameters
4.4. CNN-LSTM 40
Figure 4.13: Final prediction error metrics using the CNN model
Figure 4.14: The top 20 entries corresponding to the lowest MAPE scores for the
CNN-LSTM with one CNN layer and one LSTM layer
Figure 4.15: Final prediction error metrics using the CNN-LSTM model
4.4. CNN-LSTM
To start off, lets analyze the suitability of the various models chosen for this task
based on how well they could predict for unseen data.
Figure 5.1: Final error metric scores obtained using each model during prediction
Figure 5.1 clearly shows us that the CNN achieved the lowest MAPE and RMSE
scores and also the highest R2 score, thereby proving to be the best model. Both
the LSTM and the CNN-LSTM did not perform in favour of choosing them over
the baseline model. Although the LSTM did achieve a better MAPE score when
compared to the baseline model, this was not observed in the other two error
metrics.
In order to get an in-depth picture of how the different models performed when
compared with one another during both testing and prediction, we will also take a
look at the final test and pred MAPE values (since this metric was used to choose
the final model used for prediction), the optimal hyperparameter values and the
final plots obtained using each of them.
When comparing the final MAPE scores obtained when testing (column 1 in Figure
5.2), the CNN-LSTM evidently outperformed the baseline model itself, while both
the LSTM and CNN models did not. In contradiction to this, looking closely at
the MAPE values obtained when using these models to predict for unseen data
(column 2 in Figure 5.2), both the LSTM and CNN models outperform the baseline
45
46
Figure 5.2: Final MAPE scores obtained using each model during testing and
prediction
whereas the CNN-LSTM does not. This is one clear example of the CNN-LSTM
model being too complex for limited data which has in turn resulted in overfitting
and therefore this model fails to generalize when exposed to new data, whereas the
LSTM and CNN models seem to generalize well.
Another important fact to note is that the best performing LSTM model needed
two hidden layers whereas the CNN needed no hidden layer which means complexity
does have an impact on the performance depending on the model. The optimal
hyperparameter values for the ones that were common to all models were identical
for the LSTM and CNN models (dropout probability of 0.3, learning rate of 0.01,
a batch size of 32 and 300 epochs) but mostly varied for the CNN-LSTM model
(dropout probability of 0.3, learning rate of 0.0001, a batch size of 64 and 100
epochs).
Closely observing the final plots obtained, Figures 4.16, 4.17, 4.18 and 4.19, they
match the results mentioned above. Figure 4.18 clearly establishes how the CNN
model outperforms the baseline model and predicts both at the peaks and troughs
better then both the LSTM and CNN-LSTM models. As shown in Figure 4.17,
although the LSTM performed better then the CNN-LSTM model, it still does
not manage to predict well at all peaks and at most troughs. Lastly, Figure 4.19
reveals how the CNN-LSTM fails to predict well at almost all peaks and troughs.
6. Conclusion
In an attempt to find an answer to the question Can deep learning (or combining
best known methods) yield accurate predictions regarding the PV output
needed to maintain the balance between the production and usage of
solar energy in households with limited input? by mainly referring to the
findings in Ref. [4] the LSTM model, the CNN model and a combination of the two,
the CNN-LSTM model were tested for their ability to predict future values when
provided with limited data to learn from and were compared against a simple linear
regression model that was used as a baseline model. The CNN model performed the
best when compared to the other two models used but this can definitely change
depending on the features and hyperparameters chosen to work with. Recent
advancements in machine learning based techniques, particularly deep learning
algorithms, have caused these methods to gain popularity among researchers for
time series forecasting and have also proved to be quite effective and accurate as
traditional methods which I think was quite evident from the findings presented in
this thesis. However, in order to deploy these models over simple models like the
baseline model a lot more testing is required with new data in order to see if it
constantly outperforms the baseline model and not just by a minor difference in
the error metrics mainly due to the extra time and resources needed to train and
update these models.
47
6.1. FUTURE WORK 48
The subset of features chosen to test models with could also be varied in order to
check if this would have an impact on the models ability to predict future values.
The models could also be made to predict for shorter periods in the future, for
example 6 hours instead of 24 which could also improve accuracy.
References
49
REFERENCES 50
[9] Katrina Wakefield. “A guide to machine learning algorithms and their ap-
plications”. 2019. url: https://www.sas.com/en_gb/insights/articles/
analytics/machine-learning-algorithms.html.
[10] Emma Juliana Gachancipa Castelblanco. How to choose an activation func-
tion? May 2020. url: https://www.linkedin.com/pulse/how-choose-
activation-function-emma-juliana-gachancipa-castelblanco.
[11] Warren E. Agin. “A Simple Guide to Machine Learning”. In: Business Law
Today (2017), pp. 1–5. issn: 10599436, 23758112. url: https://www.jstor.
org/stable/90003559.
[12] Simeon Kostadinov. “Understanding Backpropagation Algorithm”. Aug. 2019.
url: https://towardsdatascience.com/understanding-backpropagation-
algorithm-7bb3aa2f95fd.
[13] Michael A Nielsen. “Neural Networks and Deep Learning”. Dec. 2019. url:
http://neuralnetworksanddeeplearning.com/chap2.html.
[14] SAGAR SHARMA. Epoch vs Batch Size vs Iterations. Sept. 2017. url:
https://towardsdatascience.com/epoch- vs- iterations- vs- batch-
size-4dfb9c7ce9c9.
[15] Catherine F. Higham and Desmond J. Higham. “Deep Learning: An Intro-
duction for Applied Mathematicians”. In: SIAM Review 61.3 (Jan. 2019),
pp. 860–891. doi: 10.1137/18m1165748. url: https://arxiv.org/pdf/
1801.05894.pdf.
[16] datahacker.rs. “Gradient Descent”. Oct. 2018. url: https://datahacker.
rs/gradient-descent/.
[17] Jason Brownlee. How to Avoid Overfitting in Deep Learning Neural Networks.
Dec. 2018. url: https://machinelearningmastery.com/introduction-
to-regularization-to-reduce-overfitting-and-improve-generalization-
error/?source=post_page-----e05e64f9f07----------------------.
[18] Artem Oppermann. Regularization in Deep Learning — L1, L2, and Dropout.
Aug. 2020. url: https://towardsdatascience.com/regularization-in-
deep-learning-l1-l2-and-dropout-377e75acc036.
[19] Yash Bohra. “Vanishing and Exploding Gradients in Deep Neural Networks”.
June 2021. url: https://www.analyticsvidhya.com/blog/2021/06/the-
challenge - of - vanishing - exploding - gradients - in - deep - neural -
networks/.
[20] Aurélien Géron. Hands-on machine learning with Scikit-Learn and TensorFlow
concepts, tools, and techniques to build intelligent systems. O’Reilly Media,
Inc., Sept. 2019. isbn: 9781492032649. url: https://www.knowledgeisle.
com / wp - content / uploads / 2019 / 12 / 2 - Aur % C3 % A9lien - G % C3 % A9ron -
Hands-On-Machine-Learning-with-Scikit-Learn-Keras-and-Tensorflow_-
REFERENCES 51
[33] Nir Arbel. How LSTM networks solve the problem of vanishing gradients.
May 2020. url: https://medium.datadriveninvestor.com/how-do-lstm-
networks-solve-the-problem-of-vanishing-gradients-a6784971a577.
[34] Harpreet Singh Sachdev. Choosing number of Hidden Layers and number
of hidden neurons in Neural Networks. Jan. 2020. url: https : / / www .
linkedin . com / pulse / choosing - number - hidden - layers - neurons -
neural - networks - sachdev# : ~ : text = 1 % 20Well % 20if % 20the % 20data %
20is%20linearly%20separable.
[35] Jason Brownlee. Dropout Regularization in Deep Learning Models With
Keras. June 2016. url: https://machinelearningmastery.com/dropout-
regularization-deep-learning-models-keras/.
[36] Saulo Barreto. Choosing a Learning Rate | Baeldung on Computer Science.
Nov. 2021. url: https://www.baeldung.com/cs/ml-learning-rate.
[37] Enes Zvornicanin. Relation Between Learning Rate and Batch Size | Baeldung
on Computer Science. Jan. 2022. url: https://www.baeldung.com/cs/
learning-rate-batch-size.
[38] Renu Khandelwal. Convolutional Neural Network(CNN) Simplified. Oct.
2018. url: https://medium.datadriveninvestor.com/convolutional-
neural-network-cnn-simplified-ecafd4ee52c5.
[39] Swarnima Pandey. How to choose the size of the convolution filter or Kernel
size for CNN? July 2020. url: https://medium.com/analytics-vidhya/
how- to- choose- the- size- of- the- convolution- filter- or- kernel-
size-for-cnn-86a55a1e2d15.
[40] Krisha Samir Mehta. Weights Biases. Feb. 2022. url: https : / / wandb .
ai/krishamehta/seo/reports/Difference- Between- SAME- and- VALID-
Padding-in-TensorFlow--VmlldzoxODkwMzE.
[41] Python. Welcome to Python.org. May 2019. url: https://www.python.org/.
[42] Numpy. NumPy. 2009. url: https://numpy.org/.
[43] Pandas. Python Data Analysis Library — pandas: Python Data Analysis
Library. 2018. url: https://pandas.pydata.org/.
[44] Scikit-Learn. User guide: contents — scikit-learn 0.22.1 documentation. 2019.
url: https://scikit-learn.org/stable/user_guide.html.
[45] TensorFlow. Effective TensorFlow 2 | TensorFlow Core. url: https://www.
tensorflow.org/guide/effective_tf2.
[46] Plotly. Plotly Python Graphing Library. url: https : / / plotly . com /
python/.
[47] Vidhyashankar Venkatachalaperumal and Afshin Bakhtiari. “How Photo-
voltaic modules operate in different weather”. June 2021. url: https://ae-
solar.com/solar-panels-in-different-weather/.
REFERENCES 53
[48] url: https : / / api . rebase . energy / weather / docs / v2 / #hirlam - fmi -
identifier-fmi_hirlam.
[49] url: https://api.rebase.energy/weather/docs/v2/#variables-3.
[50] url: https://www.elia.be/en/grid-data/power-generation/solar-
pv-power-generation-data.
[51] url: https://www.elia.be/en/grid-data/power-generation.
[52] Neha Seth. What Is Data Preprocessing in Machine Learning, and Its Im-
portance? Nov. 2021. url: https://www.analytixlabs.co.in/blog/data-
preprocessing-in-machine-learning/.
[53] Will Badr. Why Feature Correlation Matters . . . . A Lot! Jan. 2019. url:
https://towardsdatascience.com/why-feature-correlation-matters-
a-lot-847e8ba439c4.
[54] Aishwarya V. Srinivasan. Why exclude highly correlated features when build-
ing regression model ?? Sept. 2019. url: https://towardsdatascience.
com / why - exclude - highly - correlated - features - when - building -
regression-model-34d77a90ea8e.
[55] url: https://sunrise-sunset.org/api.
[56] Akshita Chugh. How to deal with missing values in data set ? Jan. 2021. url:
https://medium.com/analytics-vidhya/how-to-deal-with-missing-
values- in- data- set- 8e8f70ecf155#:~:text=%20How%20to%20deal%
20with%20missing%20values%20in.
[57] Shay Palachy. Detecting stationarity in time series data. Nov. 2019. url:
https://towardsdatascience.com/detecting-stationarity-in-time-
series-data-d29e0a21e638.
[58] Selva Prabhakaran. Augmented Dickey Fuller Test (ADF Test) – Must Read
Guide. Nov. 2019. url: https://www.machinelearningplus.com/time-
series/augmented-dickey-fuller-test/.
[59] Jason Brownlee. Basic Feature Engineering With Time Series Data in Python.
Dec. 2016. url: https://machinelearningmastery.com/basic-feature-
engineering-time-series-data-python/.
[60] Alan Anderson and David Semmelroth. Autocorrelation Plots: Graphical
Technique for Statistical Data. Mar. 2016. url: https : / / www . dummies .
com/article/technology/information-technology/data-science/big-
data/autocorrelation-plots-graphical-technique-for-statistical-
data-141241/.
[61] Urvashi Jaitley. Why Data Normalization is necessary for Machine Learning
models. Oct. 2018. url: https : / / medium . com / @urvashilluniya / why -
data- normalization- is- necessary- for- machine- learning- models-
681b65a05029.
REFERENCES 54