Its665 Report
Its665 Report
GROUP PROJECT
CLASS : AS2013B1
GROUP MEMBERS :
NO NAME ID NUMBER
1. AYU AFEZAH BINTI ASNAN SULUNG 2022958459
2. INTAN SYAFIKA BINTI BAHARUDIN 2022919289
3. MAYA SYAMELIA BINTI YAHAYA 2022923665
4. MUHAMMAD MIRZAIEMAN BINTI OTHMAN 2022755525
5. YUSNURDALILA BINTI DALI 2022991343
TABLE OF CONTENTS
1.0 OVERVIEW................................................................................................................................ 3
4.3 Model Development and Evaluation Comparison Between Full Dataset and Reduced Dataset.33
PROBLEM STATEMENT
In Malaysia, the rate of purchasing power is high day by day. The spending of consumer in Malaysia
is averaged RM140822.48 million in first quarter of 2022 and record low of RM56768 Million in
second quarter of 2005. Average living cost in Malaysia varies depends on factors like lifestyle,
location and family size. However, based on the recent estimates, single people may expect of four
can expect for spending about RM7,000 – RM9,000 per month. The main problem with this issue is in
identifying the great classification method for predicting highest revenue of country in study area.
Then, the problem faced is to observe clear insight towards customers spend habits from other
counties. Next, there is still no clue because people choose some decision during shopping online or
offline.
OBJECTIVES
• To identify the best classification method to predict the highest revenue of a country in study
area.
• To offer unique insight into customers spending behaviour from many countries through
globe
• To determine reason that drives customers’ decisions during shopping offline or online.
2.0 DATA ACQUISITION AND DATA UNDERSTANDING
In this written report, we get the data from Kaggle website and downloaded a the sales dataset
that are given (https://www.kaggle.com/datasets/thedevastator/analyzing-customer-spending-habits-
to-improve-sa ). Based on the information on the Kaggle website, it was believed that the people from
the states of United States of America and France was used to gather data regarding the unique insight
into customers spending behaviour from many countries through globe. This dataset does not used
any other sources or digital object identifier (DOI) citation for their dataset. The data was created by
Vineet Bahl.Scraping using the scraping method.
3.0 DATA PREPARATION
Step 1 - Preprocess tab was started. The numeric attributes were studied throughout the preprocess
Tab and recorded. the mean, min & max and standard deviation were found. The file was opened
using csv format from excel format. From the figure 1.0, the mean for this data is 17433. As for
minimum value, it is 0 and for maximum value, it is 34866. Next, standard deviation for this data is
10065.38. This data is being presenting at the statistic section in Preprocess Tab.
Step 2 - Then, the filter was setting up to be Nominal using NumericToNominal filter. The steps are
Filter – unsupervised – attribute – NominalToNumeric. Nominal type of attribute represents a fixed
set of nominal values. While as for numeric type of attribute is representing floating-point number.
The numeric variable is referring to quantifiable characteristic where values are in the form of
numbers. As a default, the attribute will be in nominal type. Then, it changes to numeric. This is
because by converting nominal attribute to numeric attribute, we are introducing an ordering over
nominal labels where it is considered as a bad representation of data and may lead to unwanted effects
from classifier. The NominalToNumeric filter was used for turning it into nominal one. Dissimilar
with discretization, it takes all numeric values and adding it to nominal values of attribute list. It is
very useful after the CSV imports for forcing some attributes becoming nominal. Before changing
attribute to numeric, the data is representing both Nominal and Numeric type of attribute. After
converting, the attribute label is all changes to Nominal and there is no Numeric at all. The values for
each attribute are also changing for each of the data. There is also decreasing in number and there is
also an increasing of values. All count for each attribute is 1. The Column 1 is also labelled as
Nominal after converting.
Figure 3.1.1 shows steps to convert Numeric attribute type to Nominal attribute type
Figure 3.1.2 shows Preprocess Tab before converting to Nominal
Before After
The filter that I used is ReplaceMissingValues. The step for applying this filter is Filters –
Unsupervised – ReplaceMissingValues. This filter is being chosen because this filter replaces all
missing values in dataset with modes and means from training data. The class attribute is being
skipped by default. By replacing missing values, all the missing values from dataset would be zero
again meaning that there are no missing values left. In other words, in order to remove the missing
values, we have to replace all those missing values.
Step 4 - The attributes with outliers were not identified. This means that this data has no outliers. In
order to detect outliers is by clicking on the “Edit” button at Preprocess Tab. At the outliers section, if
there are any outliers, it will be written as “yes”. While if it is no outliers, it will be written as “no”.
Outliers detection and analysis also called as outlier mining. By using data mining techniques, we
may identify many interesting and hidden patterns among the weather data. The outliers mean
extreme values. However, the function of removing outlier is to increase the error variance and reduce
statistical tests power. If the outliers are considered as non-randomly distributed, it may decrease
normality.
Step 5 – Save the file and rename as file-cleaned.arff. Firstly, click the button “Save” at Preprocess
Tab. Then change the File Name to “file-cleaned” and change the File of Type to “Arff data files
(*.arff)”. Then, click Save.
3.2 DATA TRANSFORMATION
Perform normalization
Machine learning algorithms perform better and are more accurate when features in a dataset
are transformed to a common scale using the data pretreatment approach known as normalization.
Normalization's primary objective is to remove any potential biases and distortions brought on by the
various scales of characteristics. We do normalization to all data because this makes our database
more accurate and enhances the accuracy of our data. Simply put, data normalization guarantees
consistency in our data's appearance, readability, and usability across all the records in our database.
Step 1 - Load the data set in Weka after performing all the tasks before from A1 to A2. Then, click on
the Preprocess tab > Open file > Choose file> Open
Figure 3.2.1
Step 2 – Identify the numeric. Under the preprocess tab, click at the Choose > Filters >
Unsupervised > Attribute > Normalize
Figure 3.2.2
Step 3 - Click Normalize and set the scale to 1.0 and translate to 0.0.
Figure 3.2.3
Step 4 – The attribute has been normalized by the filters. Figure below shows that the data of the
attribute before the normalization while the next figure shows the data after being normalized.
Figure 3.2.4
Perform Discretization
Our group did not perform discretization. This is because our data’s range are not far from
each other which mean it did not have a big difference. So, we did not do the binning or discretization
for our data as it is not necessary.
An attribute construction method's objective is to create new attributes from the existing ones
and changing the original data representation into a new one where the classification algorithm can
easily identify the patterns in the data. This tends to increase the predictive accuracy of the data.
Step 1 – Save the data in CSV file form.
Figure 3.2.5
Figure 3.2.6
Step 3 – Perform the function of =IF(M2 > 1500, “Yes”, “No”) and named the column as ‘Country
wih highest revenue’. We used IF function because it enables us to compare values logically to
expectations. Thus, an IF statement can produce two outcomes. If our comparison is True, the first
result is what we get and if it's False, it will be the second.
Figure 3.2.7
Figure 3.2.8
3.3 DATA REDUCTION
In this section the data set will be reduced after been cleaned and transformed. In general,
data mining datasets are too huge to be processed, and this case could cause overfitting when building
the model evaluation. Therefore, reduction can be done on the selected attributes or sampling. Hence,
we apply select attributes technique by using suitable methods to reduction process. The methods we
used in this section are Ranker method and Greedy Stepwise method.
Step 1 – Load the ‘file-process.arff’ in the Weka (after performing the tasks from A1 until A3). Then,
click open.
Step 2 – Navigate to the ‘Select Attributes’ tab on the top and choose the ‘InfoGainAttributeEval’
from the Attribute Evaluator. Click ‘Yes’ then the ‘Ranker’ method will automatically selected under
Search Method. Then, click the ‘Start’ button to begin.
Step 3 – Repeat the same steps by using the ‘CfsSubsetEval’ method. In this method, the ‘Greedy
Stepwise’ method was automatically chosen, and this method is depicted as the figure below. Then,
click ‘Yes’ to begin the process.
Figure 3.3.2: The Greedy Stepwise method shown.
The attribute selection output was displayed, along with their respective ranks. The attributes
are listed in order of their rank values. The attributes among the highest rank values are chosen and
the rest will be removed.
The table below shows the selected attributes according to their method. Where Ranker method with
13 selected attributes, while Greedy Stepwise with only 2 attributes selected. In this project we
decided to choose Ranker method. However, we need to remove some attributes to avoid huge dataset
to be processed. Therefore, the highest 5 ranked values attributes were selected, and the other 8 low
ranked values attributes will be removed.
Step 3: In the Classify tab, under the classifier option choose the trees → J48 as the algorithm.
Step 6: Save the model by right clicking the trees. J48 → save model (as model objects files). Then
load the model in WEKA.
Figure 4.5 How to save model.
Step 7: To review the prediction of the data set select Supplied test set → load the test dataset
(reduced processed) → More option select Plain text in output prediction → Re-evaluate the model.
Figure 4.14 Prediction result of the model using the test set.
Step 8: To visualize the tree, right click on the trees. J48 → visualize tree.
• The step is same from step 1- step 8 except at step 4 the cross validation → 20.
• The figure below shows the accuracy of cross validation with 20 folds (97.7%)
Figure 4.16 Result of cross validation k=20.
• The figure below shows the accuracy of percentage split 70% (97.4%).
Cross Validation
(k=20)
Split Percentage
(70:30)
Split Percentage
(80:20)
Table 4.3.2 Accuracy based on different test option.
100
80
60
40
20
0
Cross validation, k=10 Cross validation, k=20 Percentage split 70% Percentage split 80%
Figure 4.23 Bar graph accuracy between Full dataset and Reduced dataset
Based on the results above, it shows that the full dataset has highest accuracy compared to the reduced
dataset. This is due to the reduction of the attributes and instances, there are several attributes and
instances that have been removed from the dataset. The reduction of the attribute and instances has
caused the accuracy to be lower than the full dataset this is due to insufficient data and in appropriate
data selection. Among all the results, the percentage split (80:20) method has the most accurate
predictions compared to the other methods.
4.4 Tree Visualizer
The best tree visualizer for full dataset among all the methods is the percentage split (80:20) method.
It has the highest accuracy which is 97.5757% compared to the other methods. The tree size and the
number of leaves for each tree is state as below:
Figure 4.25 The number of leaves and the size of the tree
Step 2: Then click on cluster and SimpleKMeans algorithm has been chosen. Kmeans indicates the
most basic clustering algorithm. It is created by partitioning the dataset in the K-Clusters
process. The objective is to identify the quality of partitions that place similar items in one
cluster and dissimilar objects in other groups.
Step 3: Set displayStdDevs to True and set the cluster to 3, 4, 5, and 6. There are two methods that
have been used in this project, which are Euclidean and Manhattan distances. Manhattan
distance is performed by averaging the paired absolute difference between each variable.
Meanwhile, Euclidean distance measures the same distance by averaging the squared
difference in each variable.
Data in cluster 3
The dataset's features and the clustering technique are described by Scheme, Relation,
Instances, and Attributes. The customer spending dataset in this instance consists of 2574
instances and 16 attributes. There are five iterations of the K-means cluster. The total squared
error is 22277.0.
Euclidean
distance.
Manhattan
distance.
Step 4: Next, the number of clusters was increased to see the difference between the number
iterations for each cluster. Next, the number of clusters was increased to see the difference between
the number of iterations for each cluster. By increasing the cluster, the error will be reduced. Clustered
instances show the quantity and proportion of all instances that are a part of a cluster.
Cluster 4
Euclidean
distance.
Manhattan
distance.
Cluster 5
Euclidean
distance.
Manhattan
distance.
Cluster 6
Euclidean
distance.
Manhattan
distance.
Step 6: Then cluster 6 was chosen since the error was lower than in other clusters. Now we need to
visualise the cluster assignments for both methods.
Euclidean
distance.
Manhattan
distance.
The best accuracy for predicting the class is in cluster 6 because it has the lowest number of errors,
which is 21102.0, compared to other clusters. Thus, the best method that will be used in this project is
Euclidean distance. This is because as we can see from the visual assignment, the bullet is more
accurate than the Manhattan distance which is far away between the bullets.
After reduced the data
Step 1: The class attribute that is not unnecessary and does not affect clustering will be removed.
Then the step will be repeated to see the different results before and after reducing the
attribute.
Step 2: Cluster 6 and Euclidean method was used as it is the best and most accurate data.
CONCLUSION
In short, the dataset contains errors which are classified as noisy data which can be described
as the outliers as described in the case study. The noisy data cannot be deleted because the
inaccuracies present in the data readings due to few factors like large values of data. Only unnecessary
attributes known as Index or Column1 were then deleted. Therefore, cross-validation with k=10 and
k=20 is the best classification method used on this condensed crowdsourced dataset because it has the
highest accuracy percentage 97.7 % based on Table 4.4.
REFERENCES
https://www.kaggle.com/datasets/thedevastator/analyzing-customer-spending-habits-
to-improve-sa
Basit, A. (20 February , 2023). Linkedin. Retrieved from Unraveling the Mystery of Low Model
Accuracy: Causes and Solutions: https://www.linkedin.com/pulse/unraveling-mystery-low-
model-accuracy-causes-solutions-abdul-basit/