Big Mart Sales Prediction Using Machine Learning Report PDF
Big Mart Sales Prediction Using Machine Learning Report PDF
CONTENT
PAGE NO.
Certificate ................................................................ (I)
Acknowledgement ................................................... (II)
Table of Content .................................................... (III)
List of Abbreviations .............................................. (V)
Abstract................................................................... (VIII)
CHAPTER 1: INTRODUCTION
1.1 Introduction ........................................................ 01
1.2 Problem Statement ............................................ 02
1.3 Objectives ........................................................... 03
1.4 Methodology Used ............................................. 04
REFRENCES……………………………………………48
LIST OF ABBREVIATIONS
ABBREVIATION WORD
Ml MACHINE LEARNING
EDA EXPLORATORY DATA ANALYSIS
FIG FIGURE
DESCR DESCRIPTION
RF RANDOM FOREST
LR LINEAR REGRESSION
VIII
Fig 4.3 Performance of Hyper Tuning Parameter 41
Fig 4.4 Performance of Decision Tree 42
Fig 4.5 Performance of XGBoost Regression 42
Fig 4.6 Performance of Ridge Regression 43
VIII
LIST OF GRAPHS
DESCRIPTION PAGE NO.
Fig 1.1 Process of building a model 03
Fig 1.2 Working procedure of proposed model 03
Fig 3.1 Figure represent line of regression 08
Fig 3.2 Flowchart of RF 09
Fig 3.3 Relationship between Feature Importance and their 10
F score in Hyper Parameter Tuning
Fig 3.4 Types of XGBoost Regression 11
LIST OF TABLE
DESCRIPTION PAGE NO
Table 4.1 Algorithms Performance 44
VIII
ABSTRACT
Nowadays many shopping malls keep track of individual item sales data in order
to forecast future client demand and adjust inventory management. In order to be
ahead of the competition and earn more profit one needs to create a model
which will help to predict and find out the sales of the various product present in
the particular store.So to predict out the sales for the big mart one need to use
the very important tool i.e. Machine Learning (ML). ML is that field of computer
science which gives machines ie computers the ability to learn without doing any
type of programming.Using the concepts of machine and basics of data science
one can build a model which can help to predict the sales of the big
mart.Because of increasing competition among various shopping complex one
needs to have some predictive model which could help to gain some useful
insights so as to maximize the profit and be ahead of the competiton
VIII
IX
Chapter 1
INTRODUCTION
1.1 INTRODUCTION
The daily competition between different malls as well as big malls is becoming more and
more intense because of the rapid rise of international supermarkets and online shoppings.
Every mall or mart tries to provide personal and short-term donations or benefits to attract
more and more customers on a daily basis, such as the sales price of everything which is
usually predicted to be managed through different ways such as corporate asset management,
logistics, and transportation service, etc. Current machine learning algorithms that are very
complex and provide strategies for predicting or predicting long-term demand for a company's
sales, which now also help in overcoming budget and computer programs.
In this report, we basically discuss the subject of specifying a large mart sale or predicting an
item for a customer’s future need in a few supermarkets in various locations and products that
support the previous record. Various ML algorithms such as linear regression, random forest,
etc. are used to predict sales volume. As we know, good marketing is probably the lifeblood of
all organizations, so sales forecasting now plays an important role in any shopping mall. It is
always helpful to predict the best, and develop business strategies about useful markets and to
improve market knowledge. Regular sales forecasting research can help in-depth analysis of
pre-existing conditions and conditions and then, assumptions are often used in terms of
customer acquisition, lack of funding, and strength before setting budgets and marketing plans
for the coming year.
In other words, sales forecasts are predicted on existing services of the past. In-depth
knowledge of the past is required to develop and enhance market opportunities no matter what
the circumstances, especially the external environment, which allows to prepare for the future
1
needs of the business. Extensive research is ongoing in the retailer’s domain to predict long-
term sales demand. An important and effective method used to predict the sale of a
mathematical method, also called the conventional method, but these methods take more time
to predict sales. And these methods could not manage indirect data so to overcome these
problems in traditional methods the machine learning techniques used. ML methods can
handle not only indirect data but also large data sets well.
Due to increasing competition many malls and bigmart are trying their best to stay ahead in
competition.In order to find out what are the various factors which affect the sales of bigmart
and what strategies one needs to employ in order to gain more profit one need to have some
model on which they can rely .So a predictive model can be made which could help to gain
useful information and increase profit.
1.3 OBJECTIVES
b) To understand the key features that are responsible for the sale of a particular product.
c) Find the best algorithm that will predict sales with the greatest accuracy.
2
1.4 METHODOLOGY
Figure 1.1 represents the steps of building a model. Following are the steps which one needs
to follow while creating a model.
3
1. Data collection- The step of every project is to collect the data.
We collected our data from the Kaggle whose link is given below-
https://www.kaggle.com/brijbhushannanda1979/bigmart-sales-data/code
2. Data preprocessing-In this step we basically clean our dataset for example check for
any missing value in the dataset , if present then handle the missing values. In our
dataset attributes like Item Weight and Outlet Size had the missing value.
3. EDA-This part is considered as one of the most important parts when it comes to data
analysis.To gain important insights of our data one must need to do exploratory data
analysis.Here in our project we used two libraries i.e. klib and dtale library.
4. Tested various algorithms-Then various algorithms like simple LR, xgboost algorithm
were applied in order to find out which algorithm can be used to predict the sales.
5. Building the model -After completing all the previous phases which are mentioned
above, now our dataset is ready for further phases that is to build the model.
Once we built the model now it is ready to be used as a predictive model to forecast
sales of Big Mart.
6. Web deployment-Finally once the prediction can be made for making it more user
friendly we have used web development.
4
. CHAPTER 2
LITERATURE SURVEY
Kadam,et.al [1] have suggested when the prediction for the sales for bigmart was done using
the algorithm like random forest and LR for prediction analysis it gave lesser accuracy.So to
overcome this problem we can use another algorithm which is XG boost algorithm which not
only gives better accuracy but also is more efficient.
Makridakis, et.al [2] have suggested predicting methods and applications containing Data
Lack and short life cycles. So some data like historical data, consumer-focused markets face
uncertain needs, which can be an accurate predictor of outcome.
Das, et.al [4] have suggested in the prediction of retail sales of footwear which used recurrent
Neural Networks and feed forward used the neural network to predict the sales.Using neural
network for predicting the sales which is not an efficient method so XGboost algorithm can be
used.
5
S. Cheriyan, et.al [5] have suggested in the study they implemented three ML algorithms on
the given dataset and the models for evaluating the performance. Based upon the testing the
algorithm which gave maximum accuracy was chosen for the prediction which was found to
be a gradient boosting algorithm.
A. Krishna, et.al[6] have suggested that both the normal regression and boosting algorithms
were implemented and found out that boosting algorithms have better results than the regular
algorithms.
6
CHAPTER 3
SYSTEM DEVELOPMENT
As we know Regression can be termed as a parametric technique which means we can predict
a continuous or dependent variable on the basis of a provided datasets of independent
variables.
where,
β1 : when there is a change in X value by 1 unit then Y value is also changed. It can also be
7
Fig 3.1 Given figure represent line of regression
Random Forest is a tree-based bootstrapping algorithm based on that tree that includes a
certain number of decision trees to build a powerful predictive model. Individual learners, a
set of random lines and a randomly selected few variables often create a tree of choice. The
final prediction may be the function of all predictions made by each learner. In the event of a
regression. The final prediction may be the meaning of all the predictions.
8
Fig 3.2 : Flowchart of Random Forest Regression
In ML, optimization of the hyperparameter or problem solving by selecting the correct set of
parameters for the learning algorithm. To control the learning process a hyperparameter
parameter value is used.In contrast, the values of some parameters are calculated.
The same type of ML model may require different types of weights, learning scales or
constraints in order to make different data and information patterns more general. The steps
are also called hyperparameters and must be used for the model to solve the ML problem.
9
Fig 3.3: Relationship between Feature Importance and their F score in Hyper parameter tuning
XGBOOST REGRESSION
10
The XGBoost model uses stepwise, ridge regression internally, automatically selecting
features as well as deleting multicollinearity.
11
3.2 PHASE OF MODEL
In our work, we have used the 2013 Big Mart sales data as a database. Where the data set
contains 12 features such as Item Fat, Item Type, MRP Item, Output Type, Object Appearance,
Object Weight, Outlet Indicator, Outlet Size, Outlet Year of Establishment, Type of Exit, Exit
Identity, and Sales. In these different aspects of responding to the Item Outlet Sales features as
well, the other features are also used as the predictive variables. Our dataset has in total 8523
products in various regions and cities. The data set is also based on product level and store-
level considerations . Where store level includes features such as city, population density,
store capacity, location, etc. and product-level speculation involves factors such as product, ad,
etc. After all considerations, a data set is finally created, then the data set is split into two parts
that are tested and trained in a ratio of 80:20.
12
Fig 3.6 : How libraries, train and test datasets are imported.
13
Fig 3.8 : Description of dataset using info() method
In figure 3.8 we can clearly see that there are in total 12 features out of which Numeric data
count is 5 and Categorical data count is 7.
14
Fig 3.9 : Description of dataset using describe() method
In figure 3. Item_Visibility feature has a minimum value of 0.00 and Item_weight has count
of 7060.
15
3.2.2 HANDLING MISSING VALUES
While analyzing the dataset we come across some missing values in the dataset.In order to
check for the missing value we have the following code-
16
From the above Fig 3.10 we can clearly see that column names item_weight and outlet_size
have 976 and 1606 missing values respectively.
In order to handle these missing values we have different approaches for e.g. dropping the
rows having missing value or filling the missing value with suitable values using different
methods. Looking at our dataset we have 8523 rows so dropping would not be a better option
as it would lead to decrease the prediction accuracy.
17
Since item_weight is a numerical feature, filling its missing value using the average
imputation method.
Outlet size is a categorical feature so filling the value using the mode imputation method
18
So finally -
Fig 3.14 : Now there are no missing values in the item_weight and Outer_size columns.
3.2.3 EDA
D-Tale is a Flask and React-based powerful tool which is used to analyze and visualize
pandas' data structure seamlessly.
19
Fig 3.15: Represents how to import dtale library and display the table
20
Fig 3.16: The Dtale Window
21
Fig 3.17: Frequency of values in the column name Outlet_Size
22
Fig 3.18 : This figure represents the Item_Weight value range
Klib is a python library which is used for importing, cleaning, analyzing and
preprocessing the data.
23
Fig 3.19 : Categorical data plot of all variables present in dataset using Klib Library
24
Fig 3.20 : Feature- correlation using klib Library
25
Fig 3.22: Distribution plot for every numeric feature.
26
-
From the figure 3.23 we can clearly see that item_visibility attribute has the lowest
correlation with the other target variables and Item_MRP has strong positive correlation with
target variables i.e. 0.57.
Data cleaning is basically the process where the corrupt recordset, tables or databases are
detected and then corrected by replacing, modifying, or deleting the dirty or coarse data.
27
Fig 3.24 :Cleaning the data using klib library
28
Fig 3.26 : Represents the 12 features of the dataset ie numerical and categorical
Fig 3.27 : Converting to more efficient data types using convert_datatypes function
29
3.2.5 FEATURE ENGINEERING
Feature Engineering is a way of using domain data to understand how to build mechanical
operations learning algorithms. When feature engineering is done properly, the ability to
predict ML algorithms are developed by creating useful raw data features that simplify the ML
process. Feature engineering including correction of incorrect values. In the device database,
object visibility has a small value of 0 which is unacceptable, because the object must be
accessible to all, and so it is replaced by the mean of the column.
30
Fig 3.28 : Label Encoding Code
Fig 3.29 : Splitting of data into train and test data set.
31
Fig 3.30 : Standardization of dataset
32
Fig 3.32 Y_train array and Y_test array
In figures 3.33 and 3.34 we just split the train and test data into X_train_std , Y_train,
X_test_std and Y_test.
33
3.2.6 MODEL BUILDING
Now the dataset is ready to fit a model after performing Data Preprocessing and Feature
Transformation. The training set is fed into the algorithm in order to learn how to predict
values. Testing data is given as input after Model Building a target variable to predict. The
models are built using:
a) LR
b) RF Regression
c) Hyper Parameter Tuning
d) XGBoost Regression
e) Decision Tree
f) Ridge Regression
34
Fig 3.35: Value of R2 in Linear Regression = 0.50
35
36
Fig 3.37: Value of R2 = 0.55
37
Fig 3.39: Value of R2 in Decision Tree = 0.17
38
Fig 3.40: Value of R2 in Ridge Regression = 0.4916
39
CHAPTER 4
4 . PERFORMANCE ANALYSIS
For the purpose of performance analysis we can go and look for the R2 value of the different
algorithm performed and check for which algorithm gives us the best performance
LR
40
RF regression
41
Decision Tree
XGBoost Regression
42
Ridge Regression
43
TABLE 4.1 : Algorithms Performance
From the above table, we conclude that the XGBoost algorithm is more efficient and gives
accurate and fast results.
44
PERFORMANCE ANALYSIS USING GRAPHS
RMSE AND MSE VALUES
Figure 4.7 shows the comparative analysis of RMSE and MSE values. RMSE is the squared
root of MSE and MSE is calculated by the squared difference between the original and
predicted values in the data set. In this experiment Decision tree has the highest RMSE and
MSE value and XgBoost Regression has the lowest RMSE and MSE value.
45
R2 AND MAE VALUES
Figure 4.8 shows the comparative analysis of R2 and MAE values. MAE is calculated by the
average of the absolute difference between the actual and predicted values in the dataset and
R2 is calculated by the sum of the residuals squared, and the total sum of squares is the sum of
all the data's deviations from the mean. In this experiment Decision tree has the highest MAE
value whereas XgBoost has the lowest and in case of R2 XgBoost has the highest value
whereas Decision tree has the lowest value.
It has been observed that increased efficiency is observed with XGBoost algorithms with
lower RMSE, MSE and MAE rating and higher R2 rating
46
CHAPTER 5
5 . CONCLUSIONS
5.1 CONCLUSION
So from this project we conclude that a smart sales forecasting program is required to manage
vast volumes of knowledge for business organizations.
The Algorithms which are presented in this report , LR, RF regression, Decision tree and
XGBoost regression provide an effective method for data sharing as well as decision-making
and also provide new approaches that are used for better identifying consumer needs and
formulate marketing plans that are going to be implemented.
The outcomes of ML algorithms which are done in this project will help us to pick the
foremost suitable demand prediction algorithm and with the aid of which BigMart will prepare
its marketing campaigns.
The future scope of this project is that this project can further collaborate with any other
devices which are supported with an in-built intelligence by virtue of the Internet of Things
(I0T) which makes it more feasible to use.
Multiple instances parameters and various factors are also make this sales prediction project
more
innovative and successful.
The most important term for any prediction-based system that is accuracy, is often
significantly increased
because of the increase in the number of parameters.
47
6. REFERENCES
1. Beheshti-Kashi, S., Karimi, H.R., Thoben, K.D., Lutjen, M., Teucke, M.: A survey on retail
sales forecasting and prediction in fashion markets. Systems Science &Control Engineering
3(1), (2015), pp.154–161
2. Bose, I., Mahapatra, R.K.: Business data mining ML perspective. Information &
management 39(3),(2001), pp. 211–225
3. Mitchell, T. M. ML and data mining. Communications of the ACM, 42(11) , (1999), pp. 30-
36.
4. Das, P., Chaudhury, S.: Prediction of retail sales of footwear using feedforward and
recurrent neural networks. Neural Computing and Applications 16(4-5),(2007), pp. 491–502
5. Punam, K., Pamula, R., Jain, P.K.: A two-level statistical model for big mart sales
prediction. In: 2018 International Conference on Computing, Power and Communication
Technologies (GUCON), IEEE (2018). pp. 617–620.
48