MachineLearningBasedFoodRecipeRecommendationSystem_Springer
MachineLearningBasedFoodRecipeRecommendationSystem_Springer
net/publication/366962956
CITATIONS READS
14 4,126
3 authors, including:
Manju N.
JSS Science and Technology University, Mysuru
20 PUBLICATIONS 98 CITATIONS
SEE PROFILE
All content following this page was uploaded by Manju N. on 09 January 2023.
Abstract Recommender systems make use of user profiles and filtering tech-
nologies to help users to find appropriate information over large volume of data.
Users profile is important for successful recommendations. In this paper, we present
two approaches to recommend recipes based on preferences of the user given in the
form of ratings and compare them to identify which approach suits the dataset
better. We use two approaches namely, item based approach and user based
approach to recommend recipes. For item based approach Tanimoto Coefficient
Similarity and Log Likelihood Similarity would be used to compute similarities
between different recipes. For user based approach Euclidean Distance and Pearson
Correlation are used. We use similarity techniques of user based approach and
introduce fixed size neighborhood and threshold-based neighborhood to the same.
The performance of the user based approach is found to be better than item based
approach. The performance for the Allrecipe data set is found to be better than the
simulated dataset since there are more number of interactions between users and
items.
Keywords Collaborative filtering Item based User based Fixed size neigh-
borhood Threshold-based neighborhood
1 Introduction
2 Methodology
Nc
T ða; bÞ ¼
Na þ Nb Nc
where,
Na Number of customers who rates item A
Nb Number of customers who rates item B
Nc Number of customers who rate both items A and B.
Log-likelihood-based similarity [22] is similar to the Tanimoto coefficient- based
similarity. This also does not take into account the values of individual preferences.
It is based on the number of recipes common between two users, but its value is
more of how unlikely it is for two users to have so much overlap, given the number
of recipes present and the number of recipes each user has a preference for.
To compute the score, let counts be the number of times the events occurred
together (n_11), the number of times each has occurred without the other (n_12 and
n_21) and number of times neither of these events took place (n_22). By having the
above information Log-likelihood ratio score (also known as G2) is computed as,
User based recommendations are based on the preferences given by the user and
how similar the users are according to the preferences given by them. The similarity
values are used to obtain a list of recommended recipes. To calculate the similarity
between two users, we make use of two similarity measures namely, Pearson
Correlation Coefficient similarity and Euclidean Distance similarity along with a
fixed size neighborhood and threshold-based neighborhood.
An implementation of a similarity based on the Euclidean distance [23] between
two users X and Y. Thinking of recipes as dimensions and preferences or ratings as
points on those dimensions, a distance will be computed using all recipes (di-
mensions) where both users have expressed a preference for that recipe. This is
simply the square root of the sum of the squares of differences in preferences or
position along each dimension.
Machine Learning Based Food Recipe Recommendation System 15
So the resulting values are in the range of (0, 1). This would weigh against pairs that
overlap in more dimensions, which should indicate more similarity. More dimensions
generally offer more opportunities to be farther apart. Actually, it is computed as
pffiffiffi
n=ð1 þ distanceÞ
pffiffiffi
where n is the number of dimensions. n is chosen since randomly-chosen points
pffiffiffi
have a distance that grows as n. This would cause a similarity value to exceed 1;
such values are capped at 1. The distance isn’t normalized in any way. Within one
domain, normalizing wouldn’t matter as much as it won’t change the ordering. The
implementation of the Pearson correlation [24] for two users X and Y is given as,
P 2
X sum of the square of all X’s preference values.
P 2
sum of the square of all Y’s preference values.
PY
XY sum of the product of X and Y’s preference value for all items for
which both X and Y indicate a preference.
The correlation is then
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X
X X
XY= X2 Y2
This correlation centers its data, shifts the user’s rating values so that each of
their means is 0. This is important, to achieve expected behavior on both the data
sets. This correlation implementation is similar to the cosine similarity since the
data it receives is centered-mean is 0. The correlation may also be interpreted as the
cosine of the angle between the two vectors defined by the users’ preference values.
For our work the recipe data is collected from Allrecipe website. There are about
46,336 recipes, 1,966,920 user reviews, and data from approximately 530,609 users
to understand the fundamentals of cooking and user preferences. We scraped the
data and obtained the data of about 940 users, 1.6 K recipes with 98 K user
preferences. Along with the Allrecipe website data we also have our own data
collected from the users using our application. There are 24 users, 124 recipes with
323 user preferences.
We implemented the item based approach making use of preferences given by the
users’ using two similarity techniques namely Tanimoto Coefficient Similarity and
16 M.B. Vivek et al.
Table 1 results achieves an average recall of about 23 and 28% for Tanimoto
Coefficient similarity and Log likelihood similarity respectively for Allrecipe
dataset while it achieves an average recall of about 4 and 1% for the simulated
dataset.
We implemented the user based approach making use of preferences given by
the users’ using similarity based on Euclidean Distance and Pearson Correlation
with fixed size neighborhood and threshold-based neighborhood. The recommen-
dations are ranked according to the value of similarity measure. The performance of
this approach is measured over a fivefold evaluation using Average Absolute dif-
ference (AAD) and Root Mean Squared Error (RMSE) in estimated and actual
preferences when evaluating a user based recommender for fixed size neighborhood
and threshold based neighborhood for different percentages of training data and test
data. Lower the values indicate more accurate recommendations for the respective
datasets.
Table 2 shows the best results in estimated and actual preferences for Allrecipe
dataset with user based recommender using two different similarity metrics with a
nearest n user neighborhood.
Table 3 shows the best results in estimated and actual preferences for Allrecipe
dataset using two different similarity metrics with a threshold-based user
neighborhood.
Table 4 shows the best results in estimated and actual preferences for simulated
dataset using two different similarity metrics with a nearest n user neighborhood.
Table 2 Results with 90% training data and 10% test data
Similarity n = 100 n = 200 n = 300 n = 500 n = 1000
Pearson correlation (AAD) 0.83 0.7906 0.7782 0.8133 0.842
Euclidean distance (AAD) 0.7745 0.7484 0.7511 0.7667 0.820
Pearson correlation (RMSE) 1.04 0.9993 1.006 1.005 1.065
Euclidean distance (RMSE) 0.9783 0.9593 0.9649 0.9697 1.000
Machine Learning Based Food Recipe Recommendation System 17
Table 3 Results with 80% training data and 20% test data
Similarity t = 0.9 t = 0.8 t = 0.7 t = 0.6 t = 0.5
Pearson correlation (AAD) 0.9140 0.8771 0.8624 0.8339 0.8207
Euclidean distance (AAD) 0.8795 0.8919 0.8714 0.8022 0.7488
Pearson correlation (RMSE) 1.1177 1.1009 1.1005 1.0486 1.0424
Euclidean distance (RMSE) 1.1097 1.2740 1.1037 1.0090 0.9555
Table 4 Results with 90% training data and 10% test data
Similarity n=2 n=4 n=6 n=8 n = 16
Pearson correlation (AAD) 1.663 1.69 1.69 1.576 1.639
Euclidean distance (AAD) 3.07 1.29 1.346 1.166 1.18
Pearson correlation (RMSE) 2.1357 1.839 1.678 2.078 1.8231
Euclidean distance (RMSE) 2.07 1.265 1.801 1.693 1.5893
Table 5 Results with 80% training data and 20% test data
Similarity t = 0.9 t = 0.8 t = 0.7 t = 0.6 t = 0.5
Pearson correlation (AAD) 1.211 0.8977 1.6216 1.1686 0.9553
Euclidean distance (AAD) 1.666 0.8886 1.7776 1.95 1.2538
Pearson correlation (RMSE) 1.1956 2.2223 1.9247 1.8125 1.3195
Euclidean distance (RMSE) 2.84 2.0955 1.4513 2.2580 1.7751
Table 5 shows the best results in estimated and actual preferences for simulated
dataset using two different similarity metrics with a threshold-based user
neighborhood.
The performance or the estimated ratings for different users for Allrecipe dataset
is found to be better than simulated dataset. This is because; the number of inter-
actions between users and recipes in the form of ratings in the Allrecipe dataset is
far greater than the simulated dataset. More the number of interactions mean the
matrix constructed to compute similarity between users will be less sparse indi-
cating more data available to the recommender system to identify similarities
between different users.
4 Conclusion
References