0% found this document useful (0 votes)

54 views45 pages

Its665 Report

Uploaded by

Yusnurdalila Dali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views45 pages

Its665 Report

Uploaded by

Yusnurdalila Dali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

UNIVERSITI TEKNOLOGI MARA

CAWANGAN NEGERI SEMBILAN KAMPUS KUALA PILAH

DATA MINING (ITS665)

GROUP PROJECT

COUNTRY WITH THE HIGHEST REVENUES: SALES FOR COURSES

LECTURER’S NAME : DR TUAN NORHAFIZAH TUAN ZAKARIA

CLASS : AS2013B1

GROUP MEMBERS :

NO NAME ID NUMBER
1. AYU AFEZAH BINTI ASNAN SULUNG 2022958459
2. INTAN SYAFIKA BINTI BAHARUDIN 2022919289
3. MAYA SYAMELIA BINTI YAHAYA 2022923665
4. MUHAMMAD MIRZAIEMAN BINTI OTHMAN 2022755525
5. YUSNURDALILA BINTI DALI 2022991343
TABLE OF CONTENTS
1.0 OVERVIEW................................................................................................................................ 3

2.0 DATA ACQUISITION AND DATA UNDERSTANDING ............................................................... 4

3.0 DATA PREPARATION ..................................................................................................................... 5

3.1 SELECTION (RELEVANT) AND CLEANING .............................................................................. 5

3.2 DATA TRANSFORMATION ......................................................................................................... 17

3.3 DATA REDUCTION ...................................................................................................................... 21

4. MODEL DEVELOPMENT AND EVALUATION ........................................................................... 24

4.1 Cross Validation process ............................................................................................................. 24

4.2 Percentage Split Process ............................................................................................................. 31

4.3 Model Development and Evaluation Comparison Between Full Dataset and Reduced Dataset.33

4.4 Tree Visualizer ............................................................................................................................ 35

5.0 CLUSTERING WTH WEKA EXPLORER ................................................................................... 36

CONCLUSION ..................................................................................................................................... 44
REFERENCES ..................................................................................................................................... 45
1.0 OVERVIEW
Kaggle provides dataset that being used in this investigation. This project’s title is Analysing
Customer Spending Habits for improving sales performance. This Sales Data dataset is offering
unique insight into customers spending behaviour from many countries through globe. Information
that been provided are customer age, gender, quantity, product category, unit cost and price and
revenue generated through sales of products listed in this dataset. From these attributes, we may
discover and explore patterns in consumer behaviour. To determine reason that drives customers’
decisions during shopping offline or online, the analyse shifts in consumer trends with qualitative data
such as gender and customer age are done. Next, this dataset may also receive valuable insights about
changes in consumer demand for specific products within the time for finding out which one product
has better margin or to observe different promotions impacted whole sales performance from
dissimilar categories and sub-categories. In addition, analyse consumer habits is the main key to
success in commerce business models so that this sales Data is offering great ways into understanding
consumer base better.

PROBLEM STATEMENT

In Malaysia, the rate of purchasing power is high day by day. The spending of consumer in Malaysia
is averaged RM140822.48 million in first quarter of 2022 and record low of RM56768 Million in
second quarter of 2005. Average living cost in Malaysia varies depends on factors like lifestyle,
location and family size. However, based on the recent estimates, single people may expect of four
can expect for spending about RM7,000 – RM9,000 per month. The main problem with this issue is in
identifying the great classification method for predicting highest revenue of country in study area.
Then, the problem faced is to observe clear insight towards customers spend habits from other
counties. Next, there is still no clue because people choose some decision during shopping online or
offline.

OBJECTIVES

• To identify the best classification method to predict the highest revenue of a country in study
area.
• To offer unique insight into customers spending behaviour from many countries through
globe
• To determine reason that drives customers’ decisions during shopping offline or online.
2.0 DATA ACQUISITION AND DATA UNDERSTANDING

In this written report, we get the data from Kaggle website and downloaded a the sales dataset
that are given (https://www.kaggle.com/datasets/thedevastator/analyzing-customer-spending-habits-
to-improve-sa ). Based on the information on the Kaggle website, it was believed that the people from
the states of United States of America and France was used to gather data regarding the unique insight
into customers spending behaviour from many countries through globe. This dataset does not used
any other sources or digital object identifier (DOI) citation for their dataset. The data was created by
Vineet Bahl.Scraping using the scraping method.
3.0 DATA PREPARATION

3.1 SELECTION (RELEVANT) AND CLEANING

Step 1 - Preprocess tab was started. The numeric attributes were studied throughout the preprocess
Tab and recorded. the mean, min & max and standard deviation were found. The file was opened
using csv format from excel format. From the figure 1.0, the mean for this data is 17433. As for
minimum value, it is 0 and for maximum value, it is 34866. Next, standard deviation for this data is
10065.38. This data is being presenting at the statistic section in Preprocess Tab.
Step 2 - Then, the filter was setting up to be Nominal using NumericToNominal filter. The steps are
Filter – unsupervised – attribute – NominalToNumeric. Nominal type of attribute represents a fixed
set of nominal values. While as for numeric type of attribute is representing floating-point number.
The numeric variable is referring to quantifiable characteristic where values are in the form of
numbers. As a default, the attribute will be in nominal type. Then, it changes to numeric. This is
because by converting nominal attribute to numeric attribute, we are introducing an ordering over
nominal labels where it is considered as a bad representation of data and may lead to unwanted effects
from classifier. The NominalToNumeric filter was used for turning it into nominal one. Dissimilar
with discretization, it takes all numeric values and adding it to nominal values of attribute list. It is
very useful after the CSV imports for forcing some attributes becoming nominal. Before changing
attribute to numeric, the data is representing both Nominal and Numeric type of attribute. After
converting, the attribute label is all changes to Nominal and there is no Numeric at all. The values for
each attribute are also changing for each of the data. There is also decreasing in number and there is
also an increasing of values. All count for each attribute is 1. The Column 1 is also labelled as
Nominal after converting.

Figure 3.1.1 shows steps to convert Numeric attribute type to Nominal attribute type
Figure 3.1.2 shows Preprocess Tab before converting to Nominal

Figure 3.1.3 shows Preprocess Tab after converting to Numeric

Figure 3.1.4 shows data before converting to Numeric
Figure 3.1.5 shows data after converting to Numeric
Step 3 - the attributes with missing/noise values were identified. the missing/noise values were
removed with the method of my choice using WEKA.

Before After
The filter that I used is ReplaceMissingValues. The step for applying this filter is Filters –
Unsupervised – ReplaceMissingValues. This filter is being chosen because this filter replaces all
missing values in dataset with modes and means from training data. The class attribute is being
skipped by default. By replacing missing values, all the missing values from dataset would be zero
again meaning that there are no missing values left. In other words, in order to remove the missing
values, we have to replace all those missing values.
Step 4 - The attributes with outliers were not identified. This means that this data has no outliers. In
order to detect outliers is by clicking on the “Edit” button at Preprocess Tab. At the outliers section, if
there are any outliers, it will be written as “yes”. While if it is no outliers, it will be written as “no”.
Outliers detection and analysis also called as outlier mining. By using data mining techniques, we
may identify many interesting and hidden patterns among the weather data. The outliers mean
extreme values. However, the function of removing outlier is to increase the error variance and reduce
statistical tests power. If the outliers are considered as non-randomly distributed, it may decrease
normality.
Step 5 – Save the file and rename as file-cleaned.arff. Firstly, click the button “Save” at Preprocess
Tab. Then change the File Name to “file-cleaned” and change the File of Type to “Arff data files
(*.arff)”. Then, click Save.
3.2 DATA TRANSFORMATION
Perform normalization

Machine learning algorithms perform better and are more accurate when features in a dataset
are transformed to a common scale using the data pretreatment approach known as normalization.
Normalization's primary objective is to remove any potential biases and distortions brought on by the
various scales of characteristics. We do normalization to all data because this makes our database
more accurate and enhances the accuracy of our data. Simply put, data normalization guarantees
consistency in our data's appearance, readability, and usability across all the records in our database.

Step 1 - Load the data set in Weka after performing all the tasks before from A1 to A2. Then, click on
the Preprocess tab > Open file > Choose file> Open

Figure 3.2.1

Step 2 – Identify the numeric. Under the preprocess tab, click at the Choose > Filters >
Unsupervised > Attribute > Normalize

Figure 3.2.2

Step 3 - Click Normalize and set the scale to 1.0 and translate to 0.0.
Figure 3.2.3

Step 4 – The attribute has been normalized by the filters. Figure below shows that the data of the
attribute before the normalization while the next figure shows the data after being normalized.

Figure 3.2.4

Perform Discretization

Our group did not perform discretization. This is because our data’s range are not far from
each other which mean it did not have a big difference. So, we did not do the binning or discretization
for our data as it is not necessary.

Perform Attributes Selection

An attribute construction method's objective is to create new attributes from the existing ones
and changing the original data representation into a new one where the classification algorithm can
easily identify the patterns in the data. This tends to increase the predictive accuracy of the data.
Step 1 – Save the data in CSV file form.

Figure 3.2.5

Step 2 - Load the data in excel.

Figure 3.2.6

Step 3 – Perform the function of =IF(M2 > 1500, “Yes”, “No”) and named the column as ‘Country
wih highest revenue’. We used IF function because it enables us to compare values logically to
expectations. Thus, an IF statement can produce two outcomes. If our comparison is True, the first
result is what we get and if it's False, it will be the second.
Figure 3.2.7

Step 4 – Save the file in CVS or arff form.

Figure 3.2.8
3.3 DATA REDUCTION
In this section the data set will be reduced after been cleaned and transformed. In general,
data mining datasets are too huge to be processed, and this case could cause overfitting when building
the model evaluation. Therefore, reduction can be done on the selected attributes or sampling. Hence,
we apply select attributes technique by using suitable methods to reduction process. The methods we
used in this section are Ranker method and Greedy Stepwise method.

Step 1 – Load the ‘file-process.arff’ in the Weka (after performing the tasks from A1 until A3). Then,
click open.

Step 2 – Navigate to the ‘Select Attributes’ tab on the top and choose the ‘InfoGainAttributeEval’
from the Attribute Evaluator. Click ‘Yes’ then the ‘Ranker’ method will automatically selected under
Search Method. Then, click the ‘Start’ button to begin.

Figure 3.3.1: The Ranker method shown.

Step 3 – Repeat the same steps by using the ‘CfsSubsetEval’ method. In this method, the ‘Greedy
Stepwise’ method was automatically chosen, and this method is depicted as the figure below. Then,
click ‘Yes’ to begin the process.
Figure 3.3.2: The Greedy Stepwise method shown.

The attribute selection output was displayed, along with their respective ranks. The attributes
are listed in order of their rank values. The attributes among the highest rank values are chosen and
the rest will be removed.

The table below shows the selected attributes according to their method. Where Ranker method with
13 selected attributes, while Greedy Stepwise with only 2 attributes selected. In this project we
decided to choose Ranker method. However, we need to remove some attributes to avoid huge dataset
to be processed. Therefore, the highest 5 ranked values attributes were selected, and the other 8 low
ranked values attributes will be removed.

Ranker method selected attributes Greedy Stepwise method selected attributes

• Revenue • Country
• Unit Price • Product Category
• Cost
• Unit Cost
• Sub Category
• Product Category
• Customer Age
• State
• Country
• Year
• Month
• Quantity
• Customer Gender
Furthermore, in comparison, we choose the Ranke method instead of Greedy Stepwise
method because Ranker method contains with suitable selected attributes to be process rather than
Greedy Stepwise method which only have two selected attributes. As precaution, to avoid overfitting
data result, used less and suitable selected attributes, because huge datasets are too large to be
processed directly. This is the reason why the dataset needs to be reduced before the model
development and evaluation part.

Figure 3.3.3: Reduced dataset in Weka.

Figure 3.3.4: Reduced dataset in Excel (first 20 rows).

Step 4 – The file is then saved as ‘file-reduced.arff’ to be used in part B.

4. MODEL DEVELOPMENT AND EVALUATION
In this part, two types of data are used which are full dataset and reduced dataset to build a model and
to be evaluated. Then a comparison was made based on the two different datasets.

4.1 Cross Validation process

Step 1: Load the dataset in WEKA.

Figure 4.1 Process to load the dataset in WEKA.

Step 2: Go to the Classify tab.

Step 3: In the Classify tab, under the classifier option choose the trees → J48 as the algorithm.

Figure 4.2 J48 algorithm under the tree classifier option.

Step 4: On the test option, select the cross validation = 10.

Figure 4.3 Cross-validation with 10 folds.

Step 5: Click Start, repeat step 4 for cross validation = 20.

• The figure below shows the accuracy of cross validation k= 10 (97.7%).

Figure 4.4 Result of cross validation k=10.

Step 6: Save the model by right clicking the trees. J48 → save model (as model objects files). Then
load the model in WEKA.
Figure 4.5 How to save model.

Figure 4.6 Save model as a model object type of file.

Figure 4.7 How to load the model into WEKA.

Figure 4.8 Load the model into WEKA.

Step 7: To review the prediction of the data set select Supplied test set → load the test dataset
(reduced processed) → More option select Plain text in output prediction → Re-evaluate the model.

Figure 4.9 Supplied test set option under test options.

Figure 4.10 Load the test set option.

Figure 4.11 Load the test dataset.

Figure 4.12 Set the output prediction to Plain text.

Figure 4.13 Re-evaluate the model using the test set.

Figure 4.14 Prediction result of the model using the test set.

Step 8: To visualize the tree, right click on the trees. J48 → visualize tree.

Figure 4.15 Tree view for cross validation k=10.

Cross Validation with 20 folds (k=20)

• The step is same from step 1- step 8 except at step 4 the cross validation → 20.
• The figure below shows the accuracy of cross validation with 20 folds (97.7%)
Figure 4.16 Result of cross validation k=20.

Figure 4.17 Tree view for cross validation k=20.

4.2 Percentage Split Process
In this process, the step is the same as the Cross Validation process except at Step 4. The cross
validation was changed to Split percentage → 70%. This step was repeated for Split percentage 80%
to obtain the result.

Figure 4.18 Set the percentage split to 70%

• The figure below shows the accuracy of percentage split 70% (97.4%).

Figure 4.19 Result of percentage split (70:30)

Figure 4.20 Tree view of percentage split (70:30)

Percentage Split (80:20)

Figure 4.21 Result of percentage split (80:20)

Figure 4.22 Tree view of percentage split (80:20)

4.3 Model Development and Evaluation Comparison Between Full Dataset and Reduced
Dataset.
In this part the evaluation process is continued using the reduced data set in A4 by repeating the same
step. The results that are obtained from the evaluation process are compared with the full dataset
results. The table below shows the comparison between the full dataset and reduced dataset evaluation
results.

Table 4.3.1 Comparison between full dataset and reduced dataset

Full Dataset Reduced Dataset

Cross Validation
(k=10)

Cross Validation
(k=20)

Split Percentage
(70:30)

Split Percentage
(80:20)
Table 4.3.2 Accuracy based on different test option.

Cross validation, Cross validation, Percentage split Percentage split

K=10 K=20 70% 80%
Full dataset 97.7078 97.7078 97.4093 97.4757

Reduced dataset 36.5967 36.1305 31.2176 34.8689

Accuracy between Full dataset and Reduced dataset

120

100

0
Cross validation, k=10 Cross validation, k=20 Percentage split 70% Percentage split 80%

Full dataset Reduced dataset

Figure 4.23 Bar graph accuracy between Full dataset and Reduced dataset

Based on the results above, it shows that the full dataset has highest accuracy compared to the reduced
dataset. This is due to the reduction of the attributes and instances, there are several attributes and
instances that have been removed from the dataset. The reduction of the attribute and instances has
caused the accuracy to be lower than the full dataset this is due to insufficient data and in appropriate
data selection. Among all the results, the percentage split (80:20) method has the most accurate
predictions compared to the other methods.
4.4 Tree Visualizer
The best tree visualizer for full dataset among all the methods is the percentage split (80:20) method.
It has the highest accuracy which is 97.5757% compared to the other methods. The tree size and the
number of leaves for each tree is state as below:

Figure 4.25 The number of leaves and the size of the tree

Figure 4.26 The tree visualization

5.0 CLUSTERING WTH WEKA EXPLORER
In this part we also use the other method which is clustering. The term clustering refers to the
collection of these subsets, which are known as clusters. The clustering algorithm is an unsupervised
learning method that is used to group various elements of data that share similar characteristics. It
divides items into groups and subgroups based on their similarity, which results in the partitioning of
datasets. By using cluster analysis, datasets are divided into smaller categories.

Before reduced the data

Step 1: Data was load on Weka explorer.

Step 2: Then click on cluster and SimpleKMeans algorithm has been chosen. Kmeans indicates the
most basic clustering algorithm. It is created by partitioning the dataset in the K-Clusters
process. The objective is to identify the quality of partitions that place similar items in one
cluster and dissimilar objects in other groups.
Step 3: Set displayStdDevs to True and set the cluster to 3, 4, 5, and 6. There are two methods that
have been used in this project, which are Euclidean and Manhattan distances. Manhattan
distance is performed by averaging the paired absolute difference between each variable.
Meanwhile, Euclidean distance measures the same distance by averaging the squared
difference in each variable.

Data in cluster 3

The dataset's features and the clustering technique are described by Scheme, Relation,
Instances, and Attributes. The customer spending dataset in this instance consists of 2574
instances and 16 attributes. There are five iterations of the K-means cluster. The total squared
error is 22277.0.