PHYLIS
PHYLIS
SC212/1256/2017
A report submitted in partial fulfillment of the requirements for the award of Bachelor’s
degree in Software Engineering at the department of Computer Science ,School of
Computing and Information Technology, Murang’a University of Technology
2021
i
DECLARATION
This project is my original work and it has not been presented before to the school of computer
science and information technology for the award of bachelor’s degree in software engineering
of Murang’a University of technology. No part of this report shall be duplicated without my prior
consent.
…………………………… ……………………………..
SIGN DATE
NAME……………………………………
REGISTRATION NUMBER………………………………
SUPERVISOR
……………………………. ………………………….
ii
DEDICATION
I am a complacement to my friends, lecturers, family members for their support, whether
informational, financial, educational and physical or in any way.
This report courtesy of mentioned role players and I would love to dedicate my findings,
experience and achievements to them
iii
ACKNOWLEDGEMENT
This would have not been successful without cooperation and support from a number of people
who gave me a total support.
First, I would like thank almighty God for the charitable time; good healthy, continuous grace
and strength that enabled me complete my research.
Secondly, my gratitude goes to my supervisor for valuable guidance he gave me and assessing
my progress during my research.
iv
ABSTRACT
Sales forecasting is an important field in supermarkets, and it has recently got immense
popularity to boost market operations and productivity due to new technologies. The industry has
focused on conventional statistical model but in recent years, Machine learning techniques have
received more attention.
The use of traditional statistical method to forecast supermarket sales has left a lot of challenges
unaddressed and mostly result in the creation of predictive models that perform poorly.
The era of big data coupled with access to massive compute power has made machine learning a
goto for sales forecast.
The objective of this project is to develop a model for predicting sales in supermarkets keeping
in view sales and the amount used to advertise.
Using regression analysis product variables such as supermarket type, product price and
supermarket opening year are used to predict the sales.
TABLE OF CONTENTS
v
DECLARATION...................................................................................................................................ii
DEDICATION......................................................................................................................................iii
ACKNOWLEDGEMENT....................................................................................................................iv
ABSTRACT...........................................................................................................................................v
TABLE OF CONTENTS......................................................................................................................vi
LIST OF FIGURES............................................................................................................................viii
LIST OF TABLES................................................................................................................................ix
ACRONYMS AND ABREVIATIONS.................................................................................................x
CHAPTER 1: INTRODUCTION.......................................................................................................11
1.1: BACKGROUND INFORMATION............................................................................................11
1.2: PROBLEM STATEMENT............................................................................................................3
1.3: OBJECTIVES................................................................................................................................3
1.3.1: General Objectives......................................................................................................................3
1.3.2: Specific Objectives.......................................................................................................................4
1.4: Significance of the study................................................................................................................4
1.5: Scope of the study...........................................................................................................................4
1.6: Limitations......................................................................................................................................4
CHAPTER 2: LITERATURE REVIEW.............................................................................................5
2.1: INTRODUCTION..........................................................................................................................5
2.2: EXISTING SYSTEMS...............................................................................................................5
2.2.1: Time series forecasting using Artificial Neural Networks Methodologies..............................5
2.2.2: Time series sales forecasting for short shelf-life food products based on ANN and
evolutionary computing........................................................................................................................7
2.2.3: A survey of machine learning techniques for food sales prediction.........................................9
2.2.4: Sales prediction for a pharmaceutical distribution company: A data mining based
approach...............................................................................................................................................11
2.2.5: Proposed System........................................................................................................................13
2.3: Existing software design and development tools........................................................................13
2.3.1: Python Programming Language..............................................................................................13
vi
2.4: Justification...................................................................................................................................14
2.5: Conclusion.....................................................................................................................................15
CHAPTER 3: RESEARCH METHODOLOGY...............................................................................15
3.1: Introduction..................................................................................................................................15
3.2: Data Collection Techniques.........................................................................................................15
3.2.1: Interview....................................................................................................................................15
3.2.2: Questionnaires...........................................................................................................................17
3.2.3: Observation................................................................................................................................17
3.2.4: Documents and records.............................................................................................................18
3.2.5: Justification................................................................................................................................18
3.3: Software Development Techniques.............................................................................................18
3.3.1: Waterfall Methodology.............................................................................................................19
3.3.2: Rapid Application Development Methodology.......................................................................19
3.3.3: Agile Methodology.....................................................................................................................21
3.3.4: Justification................................................................................................................................22
3.4: System Requirements...................................................................................................................23
3.4.1: Software Requirements.............................................................................................................23
3.4.2: Hardware Requirements...........................................................................................................23
3.4.3: Functional Requirements..........................................................................................................23
3.4.4: Non-Functional Requirements.................................................................................................23
3.5: Conclusion.....................................................................................................................................24
Chapter 4: System design, Implementation and Testing..................................................................25
4.1: Introduction..................................................................................................................................25
4.2: System design................................................................................................................................25
4.2.1: Logical Design............................................................................................................................25
4.2.2: User Interface Design................................................................................................................26
4.2.3: Data Design................................................................................................................................27
4.2.4: Process Design...........................................................................................................................28
4.3: Implementation Approaches...........................................................................................................32
4.3.1: Multiple Linear Regression Algorithm........................................................................................32
4.3.2: Flask framework........................................................................................................................33
4.4 Coding Details and Code Efficiency.............................................................................................33
vii
4.5: Testing Approach.........................................................................................................................38
4.6. Modifications and Improvements................................................................................................40
Chapter 5..............................................................................................................................................41
5.1. Test Reports..................................................................................................................................41
5.2: User Documentation.....................................................................................................................46
Chapter 6: Conclusions and Future Works.......................................................................................49
6.1. Conclusion.....................................................................................................................................49
6.2: Future Works...............................................................................................................................49
APP 1: Budget......................................................................................................................................50
APP2: Schedule....................................................................................................................................50
References.................................................................................................................................................51
viii
LIST OF FIGURES
ix
LIST OF TABLES
Table 1: Budget..............................................................................................................................39
Table 2: Schedule..........................................................................................................................39
x
ACRONYMS AND ABREVIATIONS
xi
CHAPTER 1: INTRODUCTION
1.1: BACKGROUND INFORMATION
Sales prediction is an estimation of sales volume that a company can expect to attain within the
plan period based on historical data and industry trends [1]. It’s also the determination of a firms
share in the market under a specified future.
Earlier companies used to produce goods without considering the number of sales and demand.
For any manufacturer to determine whether to increase or decrease the production of several
units, data regarding the demand for products on the market is required. Therefore the companies
used to face losses while competing in the market since they don’t know how much to sell.
Managers used to make sales predictions randomly. Professional managers however, become
hard to find and not always available.
In today’s highly competitive and ever changing consumer landscape, accurate and timely
forecasting of future revenue or sales can offer a valuable insight to companies engaged in
manufacture and distribution of retail goods. Short tern forecasts help with production planning
and stock management while long term forecasts can deal with business growth and decision
making.
Sales prediction can be assisted by computer systems to play the qualified managers role when
they are not available. One way of implementing such a method is to try and model professional
manager’s skill inside a computer program for a company to gain better results for the progress
of current society.
In this project, we propose a predictive model using linear regression technique for predicting
sales in a supermarket. The major of this machine learning project is to build a predictive model
and also search out sales of each of the products at a particular selected supermarket. Using
machine learning model, supermarket prediction tries to understand the properties of products
and stores which play a key role in increasing sales of products. Python is used as a
programming language and Jupyter Notebook is used as tools. To build this application,
regression task aspect is used to predict sales of a given store in the future
xii
Various processes used are; Data Preprocessing, Feature engineering, creating model, Evaluation
and supervised learning helps understand the flow of data and knowing sales prices.
Regression task includes data visualization, cleaning and transformation. Linear regression
algorithm will be used in the proposed system
The approach of using machine learning to predict sales is accurate, simpler and flexible. Linear
regression model is important in that it can be used to understand all kinds of patterns that occur
in data.
The aim of developing a sales prediction system is to enable companies efficiently allocate
resources for future growth and manage cash flow. Also to help businesses to estimate their cost
and revenue accurately based on which they are able to predict their short-term and long-term
performance. The motivation for this project lies within a natural passion for market research
Regression is an important machine learning model for this kind of a problem. Predicting sales of
a company needs time series data of the company and based on that data the model can predict
future sales of a supermarket or product.
For this kind of project of sales prediction, linear regression will be applied to evaluate the result
based on training, testing and validation set of the data. The main aim of linear regression is to
find the best fit line to target variable and independent variables of the data.
According to Grigorios tsoumakas [2] they used machine learning techniques to perform a
survey on forecasting food sales. They addressed data analyst design decisions such as output
variable and input variable in their survey. The authors experimented by taking point of sale as
internal data and even external data by considering different environments to enhance efficiency
xiii
of demand forecasting. They used algorithms such as boosted decision tree regression and
Bayesian linear regression.
Most of the recent studies focused on sales modeling without considering the relationship
between training and test data, they used training data directly. This causes many errors which
led to a reduction in accuracy.
Clustering techniques have been suggested to separate the entire forecasting data into several
clusters of predictable data before designing predictable models to minimize computational time
and achieve effective evaluating performance.
1.3: OBJECTIVES
To develop a model that can predict sales of products from different supermarkets based on
amount used to advertise the items.
The proposed system aims to help supermarkets identify benchmarks and determine incremental
impacts of new initiatives, plan resources in response to expected demand and project failure
budgets
xiv
1.5: Scope of the study
The project aims at providing an efficient prediction system to the supermarkets for managing
their inventory. The system analyzes the sales, compares it with the past sales and predict future
sales.
The proposed system uses linear regression model of machine learning to make predictions of
sales in supermarket using python programming language
1.6: Limitations
A sales history or past records are essential for a sound forecast plan. If past data are not
available, then the forecast is made on a guess work without a base and this may lead to failure.
Since customer’s attitude may change at any time, the forecast may not be able to predict exactly
the behavior of customers
xv
CHAPTER 2: LITERATURE REVIEW
2.1: INTRODUCTION
Literature review is a survey of scholarly sources on a specific topic that provides an overview of
current knowledge, allowing you to identify relevant theories, methods and gaps in existing
research [3].
Due to importance of forecasting in many fields, many prominent approaches have been
developed. Statistical methods, machine learning methods and hybrid models have been
practiced.
The project studied advances in time series forecasting models using artificial neural networks
methodologies in a systematic literature review using manual search of published papers. Also it
applied the research methodology LSR in context of software engineering. The methodology
promotes use of systematic strategy for defining the research questions, declaring the search
strategy, identifying primary studies, data synthesizing and data analysis [4].
The objective of this LSR is to identify the most important theoretical contributions in
development of neural network models for forecasting non-linear time series performed in the
period between 2006 and 2016 and also identify new research problems originated from
published proposal.
The search process consisted of manual search of articles published in journals serials using the
largest bibliographic system called SCOPUS which includes one of the largest collections of
abstracts, bibliographic references and indexes. Two criteria were used; the first criterion was
non-linear neural model for forecasting and the second on were neural networks and non-linear
time series modeling using the search string.
xvi
Although there are very high numbers of publications on ANN, there are few studies that
propose new models with an appropriate theoretical support. According to Ahmed Teelab,
several quality criteria were used to analyze the best ANN models that can be used in
forecasting.
ERNN-STNN- a model based on Elman recurrent networks and stochastic time effective. The
empirical results show that proposed neural network displays the best performance between
linear regression, complexity invariant distance, multi scale complexity invariant distance
compared to back propagation neural network in financial time series forecasting [5].
Application of novel neural network technique in financial time series forecasting, support vector
machine SVM to examine the feasibility of SVM in financial time series forecasting and
proposed that SVMs machines achieve an optimum network structure by implementing the
structural risk minimization principle which seeks to minimize an upper bound of generalization
error rather than minimize the training error. SVMs have also extended to solve non- linear
regression estimation problems [6].
They also made an attempt with ensembles aiming for the improvement of prediction
performance and recognized ensembles as one of the most ambitious forms for solving predictive
tasks and conventional in reducing the variance and bias components of forecasting seeeror by
taking advantage of diversity and amid models. They compared bagging and ARIMA and
positive results are achieved showing that the approach can be used as an alternative for
forecasting time series.
Financial time series forecasting is inevitably a center point for the practitioner for its available
data and for its profitability [7]. Ensemble algorithms are substantial in improvising
performances of base learners in financial time series forecasting. The research was
experimented using SVR support vector regression, BPNN back-propagation neural network,
RBFNN radial basis function neural network, bagging for comparison and evaluation research.
The authors also experimented financial time series forecasting by using intelligent hybrid
models to overcome the issue of capturing the non-stationary property and identify the accurate
xvii
movements. Empirical mode description and support vector regression are used to evaluate
performance
Neural networks have the advantage that can approximate non-linear functions
Time series analysis allows you analyze major patterns such as trends, seasonality,
cyclicity and irregularity.
Neural networks are data driven
Disadvantages
It was observed that original pattern of time series of the index is not stationary
2.2.2: Time series sales forecasting for short shelf-life food products based on ANN and
evolutionary computing
In retail food industry, the main cause of wasted products and stock outs is the inaccuracy of
sales forecasting leading to incorrect orders. More specifically in fresh food industry, including
refrigerated ones such as dairy, fruit and juice segments and the need to maintain quality in
storage and distribution process makes sales forecasting accuracy and important factor for
planning and minimizing wastage.
They presented a framework that can be used to develop non-linear time series sales forecasting
models comprising two artificial intelligence technologies namely radial basis function neural
network and a specially designed genetic algorithm. The methodology was applied successfully
to sales data of fresh milk provided by a major company of dairy products [8].
Hybrid system of non-linear methods; genetic algorithm for variable selection and adaptive
radial basis function (RBF) artificial network were used to model the relationship between
xviii
variables and sales volume. To integrate linear and non-linear models they used ARMA for
linear auto regression and neural network for modeling of forecasting moving average errors.
RBF networks are non-linear modeling structures that unveil the mathematical relationships
between the hidden node and output node. RBF has a special structure that has a certain
advantages including faster training algorithms and more successful capabilities.
Genetic algorithms are machine learning procedures which derive their behavior from the
process of evolution in nature and are used to solve complicated optimization problems.
The combined GA-RBF method was applied on sales data of fresh milk. It selects appropriate
factors that are going to be used as inputs to the models.
They obtained the following results; the problem under study is evaluation of forecasting
performance of the GA-RBF methodology on the daily sales of fresh milk in area of Athens,
Greece and more specifically on 11 pack. Daily sales data of 11 pack for the first few months of
the year were provided by leading manufacturer of dairy products. Effect national holidays have
on sales were analyzed and arranged.
Past sales data were also utilized in order to exploit information they contain. Past sales data
from current year contain the changes that have meanwhile occurred in the market and have
affected the level and trend of sales.
The change in trend could be fed into a model by providing it with percentile change in sales
between the current year and previous year.
xix
Minimizes lost sales due to lack of products, reducing returns due to proximity of
expiration dates.
GA-RBF utilizes only historical data therefore does not show how additional information
like price, promotions can be explicitly taken into account in development of the time
series model.
The type of non-linearity is not known in advance hence the model produces about 28.2%
of errors.
For time series forecasting to be carried out historical data for a long time period is
needed to capture seasonality. In this case when a new product is launched, maybe a
perishable good and they have a time series for a similar product they may assume that
the new product will have a similar sales pattern.
Food sales prediction is concerned with estimating future sales of companies in the food
industry, such as supermarkets, groceries, restaurants, bakeries and patisseries. Accurate short-
term sales prediction allows companies to minimize stocked and expired products inside the
stores at the same time.
This survey reviewed existing machine learning approaches for food sales prediction. They
discussed important design decisions of a data analyst working on food sales prediction, such as
temporal granularity of sales data, input variables to use for predicting sales and the
representation of sales output variable [2].
It reviews machine learning algorithms that have been applied to food sales prediction and
appropriate measures for evaluating accuracy. And also discusses the challenges and
opportunities for applied machine learning in the domain of food sales prediction.
xx
The author experimented by taking point of sale as internal data and even external data by
considering different environments to enhance the efficiency of demand forecasting. They
considered different machine learning algorithms such as Boosted Decision Tree Regression,
Bayesian Linear Regression and Decision Forest Regression for evaluation.
The author had also researched interestingly about customers coming to the restaurants using
Random Forests, k-nearest neighbor and XGBoost. They chose two real world data sets from
different booking sites and also made different input variables from restaurant features. They
found the XGBoost is most appropriate for dataset.
They had observed that regular restaurants sales are influenced by weather. They considered two
algorithms; XGBoost and neural network and the results showed that XGBoost is more accurate
and the performance of their system improved. To improve accuracy, they had considered
numerous variables such as date characteristics, sales history and weather factors [9].
However the study focused on sales without considering the relationship between the training
and testing data. They used training data directly hence causing many errors which led to
reduction in accuracy. Recent studies suggest clustering techniques to separate entire data into
several clusters of predictable data before assigning predictable models to minimize
computational time ach achieve effective evaluating performance.
2.2.4: Sales prediction for a pharmaceutical distribution company: A data mining based
approach
They explored the use of time series data mining technique for sales prediction of individual
products of pharmaceutical distribution company in Portugal [10].
xxi
Through data mining techniques, the historical data of product sales are analyzed to detect
patterns to make prediction based on the experience contained in the data.
The results they obtained with the technique as well with proposed method suggested that the
performed modeling maybe considered appropriate for the short term product sales prediction.
They examined the role of data prescription and pharmacies sales mining in pharmaceutical
industry and various type of techniques that be used.
They found that most Pharmaceutical distribution companies (PDC) in Portugal still use
heuristic or simple statistical models for their sales forecasting. With the access to past sales data
and by use of data mining techniques, almost all companies and especially pharmaceuticals
distribution centers can make accurate and reliable prediction for future sales. Since sales
prediction should be performed with high accuracy and in short time, it is impossible to do it
with manual or traditional methods. Data mining techniques enhance accuracy and speed up the
process.
They collected the required data from a large PDC that dispenses medicine to customers in a
number of provinces in Iran. After receiving the orders the company is committed to supplying
drugs to provinces within 24hours, cities within 48hours and remotes areas within 72 hours. In
keeping with its market leading position, this company needs to have large product inventories in
order to meet customers demand, as a shortage of drugs is not acceptable.
The company keeps inventories for about 2months. This fact causes many excessive costs and
investments for Iranian PDCs. Thus this gap causes undesired expenses, monthly and precise
sales prediction would shorten or even eliminate the gap.
According to restrictions on sales of medicines such as existing new items with short numbers of
past sales records and having a great diversity of medicines their objective was concerned with
development of a novel and accurate sales forecasting method for pharmaceutical products by
means of one of the related data mining approaches to overcome the problem of having
numerous kinds of medicine and not having enough past sales records of each medicine.
To predict sales of company, past sales records were collected. The company provided the sales
data of nearly 1200 kinds of medicine which were sold to different provinces or centers in Iran
xxii
during three years. Database of the company included name, code of medicines, sales number,
name and code of centers, name of manufacturers and price and monthly date of sales.
To approach their objective, code, date and number of products sold were selected from the
database. Three-year monthly sales data were gathered and from PDC , in preprocessing phase
raw data was prepared to suit the research objectives, exploratory analysis was performed to
specify nature of data and also a comprehensive graph based analysis was performed to find
clique sets and group members and visualize the network of drugs.
Their research verified that by applying data mining approaches forecasting performance can be
considerably improved since the approach captured different patterns in data.
Data mining is not perfectly accurate. Therefore if inaccurate information is used in prediction it
will cause serious consequences.
Data mining may violate user privacy. Data mining collects information about people using the
pharmaceutical products.
xxiii
model are applied for training dataset to train the model. The train model is then tested and test
dataset and validation dataset for checking accuracy of the model.
Feature ML model
Raw Testing and
Data extraction for
sales validation of
cleaning and classificati
data model
selection on
Pandas
It’s an open source python package used for data science and machine learning tasks. It provides
support for multi-dimensional array.
It makes it simpler to do the following tasks associated with the working data; Data exploration,
data cleaning and data visualization [12]
Plotly
xxiv
It’s an open source tool used for data visualization and understanding data simply and easily. It
supports various types of plots like line charts, scatter charts, histograms and cox plots
Scikit-Learn
It’s a python tool that provides supervised and unsupervised learning algorithm
It contains efficient tools for machine learning and statistical modelling including regression,
clustering and classification
Proposed system will use regression analysis which is supervised learning algorithm [14]
2.4: Justification
Literature review summarizes and synthesizes the arguments and ideas of existing sales
prediction systems and also other prediction system without adding any contributions. With
profound knowledge of the gaps exposed in the existing systems proposed system will
overpower them.
Python programming will be used to develop the prediction model because its selection of
machine learning-specific libraries and frameworks simplify development process and cut
development time. Python has a simple syntax and its readability promote rapid testing of
complex algorithms
2.5: Conclusion
According to the presented literature review, numerous prediction methods have been offered
and each method has its specific advantages and disadvantages in comparison with other
techniques. However, none of the accomplished studies described the applications of linear
networks in forecasting. They also did not offer novel technique for handling the problem of not
having enough past records for prediction.
xxv
This motivates the evolution of regression analysis to make precise sales prediction. Regression
analysis is used in determining the strength of predictors, forecasting an effect and also trend
forecasting
With traditional methods not being of much help to the business organization in revenue growth,
use of machine learning approaches prove to be an important aspect for shaping business
strategies keeping into consideration the purchase patterns of the customers. Prediction of sales
with respect to various factors including sales of previous years helps business adopt suitable
strategies for increasing sales and set their foot undaunted in the competitive world
It discusses how data is collected or generated, and how data is analyzed. I obtained data from
both primary and secondary sources. Primary sources were more reliable and enabled me have
confidence on decision making.
3.2.1: Interview
Personal interview where questions are asked personally directly to the respondent it gives a
higher response rate
xxvi
Telephonic interviews are widely used and easy to combine with online surveys to carry out
research effectively.
Email or web-page interview; since online research is growing and more consumers are
migrating to more virtual world e-mail and web-page interviews are efficient [16].
I was able to gain valuable insights based on the depth of the information gathered and
the wisdom.
Interviews require only simple equipment and build on conversation skills which
researchers already have.
Interviews are more flexible
Direct contact at the point of interview means data can be checked for accuracy and
relevance are they are collected
3.2.2: Questionnaires
Questionnaire is the main instrument for collecting data in survey research. It’s a set of
standardized questions, often called items, which follow a fixed scheme in order to collect
individual data about one or more specific topics [17].
Advantages
xxvii
Result into wide range of views from customers
Questionnaires are the most affordable ways to gather quantitative data.
It’s easy and quick to collect results
When data has been quantified it can be used to compare and contrast other research and
maybe used to measure change.
Disadvantages
There is a chance that some questions will be ignored and left unanswered
Differences in understanding and interpretation
Questionnaire cannot fully capture emotional responses and feelings
3.2.3: Observation
It’s a technique that involves systematically selecting, watching, listening, reading, touching and
recording behavior and characteristics of living beings, objects or phenomena [18].
Advantages
Disadvantages
Difficulties in quantification
Sample size observed is usually small
There is no opportunity to study the past when using observation method
Some companies still record their sales history in books. Therefore I obtained from their sales
records. The records contained sales for every month of the year.
The data obtained was useful to predict the sales of the next year for the company
xxviii
Advantages of using Documents and Records
3.2.5: Justification
Since data collection is essential in research, to gather information in the proposed system two
methods will be used; interviews and use of documents and records.
Interviewing specific persons in supermarket will enable one obtain information such as how
much sales they make weekly, quarterly and monthly, factors affecting increase and low sales
and also how prediction system may help utilize resources if implemented.
Through interviews one is exposed to first-hand information and also helps in gaining more
insights into current systems
Waterfall model is a linear application development that uses rigid phases: when one phase ends,
next begins. Steps occur in sequence, and if unmodified, the model does not allow developers to
go back to previous steps [20]
It’s also referred as linear-sequential lifecycle model [21]. It follows a structured sequential path
from requirements to maintenance, setting out milestones at each steps before next step begins
[21].
xxix
Figure 2: Waterfall model
[21]
Waterfall model divides the entire process of software development into finite
independent stages making controlling of each stage easier.
Requirements are stable and known to the developer at the starting point of the project
Only one stage is processed at a time thus avoiding confusion
It’s simple and easy to implement [22]
RAD is an agile software development approach that focuses more on ongoing software projects
and user feedback and less on following a strict plan [23].
RAD develops software via the use of prototypes, dummy, backend databases and its goal is to
meet the business need of the system and customer is heavily involved in the process [24].
Requirement analysis- Developers, clients and team members communicate to determine the
goals and expectations for the project
User Design- involves building out user design through various prototype iterations
xxx
Rapid construction- Takes the prototypes and beta systems from design phase and converts them
into a working model.
[26]
RAD lets you break the project into smaller and more manageable tasks
Task oriented structure allows project managers to optimize their team’s efficiency by
assigning tasks according to members specialist and experience.
Clients get a working product delivered in a shorter time frame
Regular communication and constant feedback between team members and stakeholders
increases the efficiency of design and build process
Disadvantages of RAD
xxxi
Only systems which can be modularized can be developed using RAD
Agile methodology is a type of project management process, mainly used for software
development, where demands and solutions evolve through the collaborative effort of self-
organizing and cross-functional teams and their customers [27].
Project initiation which is about discussing project vision and ROI justification. Team members,
time and work resources required are determined.
Planning- it is where the team gets together with their sponsor or product owner and identifies
exactly what they are looking for.
Production –a handover with relevant training should take place between the production and
support teams
Retirement – it is the final stage. Customers are notified and informed about migration to newer
releases or alternative options
xxxii
It has several frameworks such as;
Kanban is a visual method used to paint picture of the workflow process, with an aim to identify
any bottlenecks early in the process
FDD- Is a lightweight iterative and incremental software development process with an objective
to deliver tangible, working software in timely manner.
Better product quality- agile methods have excellent safeguards to make sure that quality
is as high as possible
Higher customer satisfaction- by keeping customers involved and engaged.
High team morale-being part of self-managing team allows people to be creative,
innovative and acknowledged for their expertise.
Increased collaboration and ownership- development team, product owner and scrum
master work closely together on a daily basis
3.3.4: Justification
In this project I have used both agile methodology and waterfall methodology because;
Agile methodology is suitable for projects which comprise multiple iterations of understanding a
business problem by asking questions, data acquisition from multiple sources, data cleaning,
feature engineering and modelling.
Waterfall methodology is easy to implement and doesn’t need a lot of resources and effort
xxxiii
3.4.3: Functional Requirements
Are function or features that must include in any system to satisfy the business needs and be
acceptable to the users. The developed system has the following functional requirements;
xxxiv
3.5: Conclusion
Sales forecasting plays a vital role in the business sector in every field. With the help of the sales
forecasts, sales revenue analysis will help to get the details needed to estimate both revenue and
the income. Linear regression has been evaluated on supermarket sales to find critical factors that
influence sales to provide a solution for forecasting sales.
xxxv
Chapter 4: System design, Implementation and Testing.
4.1: Introduction.
System design is the process of defining the architecture, product design, modules,
interfaces, and data for a system to satisfy specified requirements [30].
To develop the proposed system the following process of defining the architecture will be
followed.
This process is iterative in nature as it trains the model to get the best-suited information for
business purposes in this case to predict the amount sales based on money spent.
xxxvi
4.2.2: User Interface Design.
User interface design is the visual layout of the elements that a user might interact with in a
system.
Sales prediction model will have the following layout where the user can enter amount spent to
advertise on TV, Radio and Newspapers so as to predict future sales.
xxxvii
4.2.3: Data Design.
Data design is concerned with how the data is represented and stored within the system.
The dataset used in the model is in tabular form and is stored in database as follows.
xxxviii
4.2.4: Process Design.
Process Design is concerned with how data moves through the system, and with how and where
it is validated, secured and/or transformed as it flows into, through and out of the system.
To develop the proposed system the following process of defining the architecture will be
followed.
xxxix
Data Collection and Cleaning.
In the proposed system we will use the advertising dataset given in ISLR and analyze the
relationship between TV, Radio and Newspaper and sales using multiple regression model.
Once the data has been collected it is cleaned .Data cleaning is the process of fixing or removing
incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
Feature Engineering.
In order to test a feature’s usefulness, we will proceed to split the data, create some models, and
check its efficiency by setting the values for independent (X) variable and dependent
(Y)variable. X= dataset [['TV', 'Radio', 'Newspaper']]
y = dataset ['Sales']
Split Train/Test
Once the useful features have been identified, we must split our dataset into a Train and Test
dataset.
In the proposed system, we will train the model into the Train dataset and test it in the Test
dataset.
The split can be done taking 70% and 30% of the data for train and test respectively.
As shown;
xl
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)
Model Tuning.
The proposed model uses Multiple Linear regression algorithm to predict the sales.
Multiple linear regression (MLR) algorithm is used to estimate the relationship between two or
more independent variables and one dependent variable.
Simple implementation
Linear Regression is a very simple algorithm that can be implemented very easily to give
satisfactory results.
Linear regression fits linearly separable datasets almost perfectly and is often used to find the
nature of the relationship between variables.
Overfitting is a situation that arises when a machine learning model fits a dataset very closely
and hence captures the noisy data as well.
The proposed model will use the following evaluation metrics to measure how good a model
performs and how well it approximates the relationship.
xli
Mean Squared Error (MSE)
It is the most common metric for regression tasks. It has a convex shape. It is the average of the
This is simply the average of the absolute difference between the target value and the value
This metric represents the part of the variance of the dependent variable explained by the
independent variables of the model. It measures the strength of the relationship between your
This is the square root of the average of the squared difference of the predicted and actual value.
Final Model
The last step the proposed system will undergo is getting the final model. Once we have obtained
the best tuning for a model, we train that model into the full dataset (Train andTest) in order to
train the model with all the available data.
Finally, the model is prepared to predict future sales, so we can introduce future sales and start
showing the predictions.
The purpose of the System Design process is to provide sufficient detailed data and
information about the system and its system elements to enable the implementation
consistent with architectural entities as defined in models and views of the system
architecture.
xlii
Using the proposed system design we will be able to implement the stated steps to come up
Implementation plan is designed to document, in detail, the critical steps necessary to put your
solutions into practice.
To implement the steps identified in proposed system design the following approaches have
been used.
The main goal of regression is the construction of an efficient model to predict the total sales
from a bunch of attribute variables that is money spent to advertise TV sales, Radio sales and
Newspaper sales.
xliii
4.3.2: Flask framework.
To develop and implement the user interface design the model uses flask framework for frontend
design.
import pandas as pd
import numpy as np
numpy: NumPy stands for numeric Python, a python package for the computation and processin
g of the multi-dimensional and single-dimensional array elements.
xliv
pandas: Pandas provide high-performance data manipulation in Python.
matplotlib: Matplotlib is a library used for data visualization. It is mainly used for basic plotting.
Visualization using Matplotlib generally consists of bars, pies, lines, scatter plots, and so on.
seaborn: Seaborn is a library used for making statistical graphics of the dataset. It provides a var
iety of visualization patterns. It uses fewer syntax and has easily interesting default themes. It is
used to summarize data in visualizations and show the data’s distribution.
dataset = pd.read_csv("advertising.csv")
dataset
Data Inspection
dataset.tail(10)
Data Cleaning
# Outlier Analysis
fig, axs = plt.subplots(3, figsize = (5,5))
plt1 = sns.boxplot(dataset['TV'], ax = axs[0])
plt2 = sns.boxplot(dataset['Newspaper'], ax = axs[1])
plt3 = sns.boxplot(dataset['Radio'], ax = axs[2])
plt.tight_layout()
Exploratory data analysis (EDA) is used to analyze and investigate data sets and summarize
their main characteristics, often employing data visualization methods.
xlv
It can also help determine if the statistical techniques you are considering for data analysis
are appropriate.
Splitting datasets.
Setting the values for independent (X) variable and dependent (Y) variable
Data Visualization.
It is the graphical representation of information and data. By using visual elements like charts,
graphs, and maps, data visualization tools provide an accessible way to see and understand
trends, outliers, and patterns in data.
Scatter plot
Let's see how Sales are related with other variables using scatter plot.
sns.pairplot(dataset, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales', height=4, aspect=1, kin
d='scatter')
plt.show()
Boxplot
sns.boxplot(dataset['Sales'])
plt.show()
Heatmap
xlvi
plt.show()
Model Equation
#Model Evaluation
from sklearn import metrics
meanAbErr = metrics.mean_absolute_error(y_test, y_pred)
meanSqErr = metrics.mean_squared_error(y_test, y_pred)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('R squared: {:.2f}'.format(model.score(X,y)*100))
print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)
import pickle
xlvii
pickle.dump(model, open('model.pkl','wb'))
app=Flask(__name__)
model = pickle.load(open('model.pkl', 'rb'))
@app.route('/')
def home():
return render_template ('index.html')
@app.route('/predict',methods=['POST'])
def predict():
'''
For rendering results on HTML GUI
'''
int_features = [int(x) for x in request.form.values()]
final_features = [np.array(int_features)]
prediction = model.predict(final_features)
output = round(prediction[0], 2)
if __name__=='__main__':
app.run(debug=True)
<!DOCTYPE html>
<html >
<!--From https://codepen.io/frytyler/pen/EGdtg-->
<head>
<meta charset="UTF-8">
<title>MODEL FOR PREDICTING SUPERMARKET SALES</title>
<link href='https://fonts.googleapis.com/css?family=Pacifico' rel='stylesheet' type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Arimo' rel='stylesheet' type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Hind:300' rel='stylesheet' type='text/css'>
xlviii
<link href='https://fonts.googleapis.com/css?family=Open+Sans+Condensed:300' rel='stylesheet'
type='text/css'>
<link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
</head>
<body>
<div class="login">
<h1> MODEL FOR PREDICTING SALES</h1>
<br>
<br>
{{ prediction_text }}
</div>
</body>
</html>
Software testing has the power to point out all the defects and flaws during development. .
Different kinds of testing allow us to catch bugs that are visible only during runtime.
xlix
The purpose of machine learning testing is to ensure that this learned logic will remain
consistent, no matter how many times we call the program.
Functional Testing
It is a type of software testing that validates the software system against the functional
requirements/specifications.
The purpose of Functional tests is to test each function of the software application, by providing
appropriate input, verifying the output against the Functional requirements.
Black-box testing of machine learning (ML) models refers to testing with no knowledge about
the internal details of the model, such as the algorithm used to create it and the features in it. The
main objective of black-box testing is to ensure the quality of the models in a sustained manner.
Unit tests. The program is broken down into blocks, and each element (unit) is tested separately
It involves testing individual units of the source code, such as functions, methods, and class
to ascertain that they meet the requirements and have expected results.
Each piece of code has been tested individually and results executed.
Regression tests. They cover already tested software to see if it doesn’t suddenly break and also
ensures quality of the user experience along with the new changes.
l
Integration tests
These tests aim to determine whether modules that have been developed separately work as
expected when brought together. In terms of a data pipeline, these can check that:
The data cleaning process results in a dataset appropriate for the model
The model training can handle the data provided to it and outputs results (ensuring that
code can be refactored in the future)
The data is consumable by the model (a label exists for every input, the types of the data
are accepted by the type of model chosen)
We are able to refactor our code in the future, without breaking the end to end
functionality.
li
Chapter 5
Test report is a document which contains a summary of all test activities and final test results of a
testing project [34].
Figure 10 Dataset
lii
Outlier Analysis
liii
Correlation between different variables
liv
Model Equation
From the above results, Multiple Linear Regression model performs well as 90.11% of the data
fit the regression model. Also, the mean absolute error, mean square error, and the root mean
square error are less
lv
Deploying the model using Flask and a sample prediction
lvi
Figure 12 Model Deployment
Dataset used in this project is from Kaggle.com. You can also create your dataset also.
Anaconda- it is a scientific python distribution that comes with all necessary packages needed to build
the model. The packages include pandas, numpy, sklearn and Jupyter notebook which is an interactive,
open source web application for creating and sharing documents that integrate live code.
Jupyter notebook is used to perform task such as data cleaning, data transformation, exploratory data
analysis, statistical modelling, machine learning and data visualizations.
Visual studio code- it is a code editor redefined and optimized for building and debugging modern web
and cloud applications. In this project the user interface design has been designed using the flask
lvii
framework.
The interface has fields that enables users enter the test data. After entering the test data the system is
able to predict the sales.
The model has been trained using the multiple linear regression algorithm
lviii
lix
Chapter 6: Conclusions and Future Works.
6.1. Conclusion
Sales forecasting is a pivotal part of the financial planning of business for any organization. It
can be said as a self-assessment tool which uses the statistics of the past and the current sales in
order to predict future performance.
Sales forecasting plays an important role in optimizing the supermarket sales process. Financial
and Sales planning with the help of the sales forecasts helps to get the information needed to
predict the revenue as well as the profit.
Thus, in finding such solution for sales forecasts Linear Regression algorithm have been
evaluated on sales data which can forecast the short term sales and help the organization in
making the key decisions. After performing the various statistical tests and performance metrics,
it is found that Linear Regression is a suitable algorithm in accordance to the chosen dataset and
thus accomplishing the aim of this project.
In future work one can attempt performance metrics such as time while predicting the sales.
These metrics can play a crucial role in evaluating multiple Machine Learning algorithms.
And also one can attempt to implement more accurate data in the continued study. Machine
Learning has the advantage of analyzing data and key variables so that you can aim to develop a
systematic approach using a variety of Machine Learning techniques.
APPENDICES
lx
APP 1: Budget
Table 1: Budget
Table 2: Schedule
APP2: Schedule
ACTIVITY MARCH APRIL MAY JUNE JULY AUGUST SEPTEMBER
Project
identification
System
analysis
System Design
Coding and
Testing
Implementation
Documentation
Project
submission
lxi
References
[2] G. tsoumakas, "Survey Of machine learning techniques for food sales techniques,"
Artificial intelligence review, vol. 52, pp. 441-447, 2018.
[4] A. Teelab, "Time Series Forecating using Artificial Neural networks," Future computing
and informatics journal, vol. 3, no. 2, pp. 334-340, 2018.
[5] W. F. H. N. Jun Wang, "Financial Time Series Prediction Using Elman Recurrent Random
Neural Networks," Computational Intelliegence and Neuroscience, vol. 2016, p. 14, 2016.
[8] A. H. P.Doganis, "Forecasting for shelf life food using AI and evolutionary computing,"
Food Engineering, vol. 75, no. 2, pp. 196-204, 2006.
[9] M. &. H. P. Holmberg, Abstract Machine Learning for Restaurant Forecast, 2018.
[10] I. Ribeiro, "Sales Prediction for a pharmaceutical distribution company," in 11th Iberian
Conference on Information systems and technologies, Las Pamas, 2016.
[12] "Pandas and numpy fundamentals," Daquest labs, 15 November 2018. [Online]. Available:
https://www.dataquest.io/course/pandas-fundamentals/. [Accessed 8 June 2021].
[14] "scikit learn: Machine learning in python," Dataquest Labs, 15 November 2018. [Online].
Available: https://www.dataquest.io/blog/sci-kit-learn-tutorial/. [Accessed 8 june 2021].
lxii
[15] M. Patel and N. Patel, "Exploring Research Methodology," International Journal of
Research and Review, vol. 6, no. 3, March 2019.
[16] "Question pro: Types and methods of interview in Research," Questionpro survey sofware,
[Online]. Available: https://www.questionpro.com/blog/types-of-interviews/. [Accessed 10
june 2021].
[19] S. M. S. Kabir, Methods of Data Collection, Bangladesh: Book Zone Publication, July,
2016.
[21] P. S. Ganney and E. Claridge, Clinical Engineering, UK: Elsevier ltd, 2020.
[23] S. Idesis, "Rapid Application Development: Why RAD and Why Now," 9 October 2020.
[26] "Rapid Application Development: Changing How Developers Work," Kissflow, 31 March
2021. [Online]. Available: https://kissflow.com/low-code/rad/rapid-application-
development/. [Accessed 12 June 2021].
[29] T. Bunsiri, "Benefits of Agile Project Management," APHEIT JOURNAL, vol. 5, no. 1, pp.
23-29, 2016.
lxiii
17 September 2021].
[31] S. Link, "The Logic of Design as a Conceptual Logic of Information," Minds and Machines
, no. 27, p. 495–519, 14 June 2017.
[34] T. Hamilton, "Test Summary Reports Tutorial: Learn with Example & Template," Guru 99,
27 August 2021. [Online]. Available: https://www.guru99.com/how-test-reports-predict-
the-success-of-your-testing-project.html. [Accessed 17 September 2021].
[35] S. m.
lxiv
lxv