Module-4_Notes_13-12-2024.docx
Module-4_Notes_13-12-2024.docx
Syllabus
Recommender Systems:
Datasets, Association rules, collaborative filtering, user-based similarity, item-
based similarity using surprise library, matrix factorization
Text Analytics:
Overview, sentiment classification, Naïve Bayes, model for sentiment
classification, using TF-IDF vectorizer, challenges of text analytics
Textbook : Machine Learning using Python by Manaranjan Pradhan Consultant
Indian Institute of Management Bangalore U Dinesh Kumar Professor Indian
Institute of Management Bangalore, Wiley 2019
Chapters 9 and 10
First Edition: 2019
ISBN: 978-81-265-7990-7 ISBN: 978-81-265-8855-8 (ebk)
www.wileyindia.com.
1
To illustrate the association rule mining concept, let us consider a set of baskets
and the items in those baskets purchased by customers as depicted in Figure 9.1.
Items purchased in different baskets are:
• Basket 1: egg, beer, sugar, bread, diaper 2.
• Basket 2: egg, beer, cereal, bread, diaper
• Basket 3: milk, beer, bread 4.
• Basket 4: cereal, diaper, bread
apriori algorithm
The Apriori algorithm is a widely used data mining technique for discovering
frequent itemsets and association rules from large datasets. It is particularly
popular in market basket analysis, where it helps identify items frequently
purchased together.
How the Apriori Algorithm Works
The Apriori algorithm relies on the Apriori property, which states that:
• "If an itemset is frequent, then all its subsets must also be frequent."
This property reduces the search space by eliminating candidate itemsets that
include infrequent subsets.
2
Steps of the Apriori Algorithm
1. Generate Candidate Itemsets:
o Start with single-item itemsets (e.g., {A}, {B}, {C}).
o Extend these into larger itemsets by combining frequent itemsets
from the previous iteration.
2. Prune Infrequent Itemsets:
o For a candidate itemset to be frequent, its support (occurrence in
transactions) must meet or exceed a minimum support threshold.
o Remove itemsets that do not meet this threshold.
3. Repeat:
o Increase the size of the itemsets (e.g., 2-itemsets, 3-itemsets, etc.)
until no further frequent itemsets can be generated.
4. Generate Association Rules:
o Once frequent itemsets are identified, association rules are
generated by calculating confidence and lift for potential rules.
3
From suggested learning resource Manaranjan Pradhan book
COLLABORATIVE FILTERING
Collaborative filtering comes in two variations:
• User-Based Similarity: Finds K similar users based on common items
they have bought.
• Item-Based Similarity: Finds Ksimilar items based on common users
who have bought those items. Both algorithms are similar to K-Nearest
Neighbors (KNN),
4
9.3 | COLLABORATIVE FILTERING
Collaborative filtering is based on the notion of similarity (or distance). For
example, if two users A and B have purchased the same products and have rated
them similarly on a common rating scale, then A and B can be considered
similar in their buying and preference behavior. Hence, if A buys a new product
and rates high, then that product can be recommended to B. Alternatively, the
products that A has already bought and rated high can be recommended to B, if
not already bought by B.
Similarity or the distance between users can be computed using the rating the
users have given to the common items purchased. If the users are similar, then
the similarity measures such as Jaccard coefficient and cosine similarity will
have a value closer to 1 and distance measures such as Euclidian distance will
have low value. Calculating similarity and distance have already been discussed
in Chapter 7. Most widely used distances or similarities are Euclidean distance,
Jaccard coefficient, cosine similarity, and Pearson correlation. We will be
discussing collaborative filtering technique using the example described below.
The picture in Figure 9.2 depicts three users Rahul, Purvi, and Gaurav and the
books they have bought and rated.
5
The users are represented using their rating on the Euclidean space in Figure
9.3. Here the dimensions are represented by the two books Into Thin Air and
Missoula, which are the two books commonly bought by Rahul, Purvi, and
Gaurav.
6
Collaborative filtering comes in two variations:
1. User-Based Similarity: Finds K similar users based on common items
they have bought.
2. Item-Based Similarity: Finds K similar items based on common users
who have bought those items
Both algorithms are similar to K-Nearest Neighbors (KNN)
9.3.2 | User-Based Similarity
We will use MovieLens dataset (see https://grouplens.org/datasets/movielens/)
for finding similar users based on common movies the users have watched and
how they have rated those movies. The file ratings. csv in the dataset contains
ratings given by users. Each line in this file represents a rating given by a user
to a movie. The ratings are on the scale of 1 to 5. The dataset has the following
features: 1. userId 2. movieId 3. rating 4. timestamp
7
1. Jaccard Coefficient
Definition:
The Jaccard coefficient measures the similarity between two sets by comparing
their intersection and union. It is particularly useful for binary data or datasets
with categorical features.
Formula:
Range:
• Values range from 0 to 1:
o 0: No overlap.
o 1: Perfect overlap.
8
Use Cases:
• Text similarity (e.g., comparing documents based on word sets).
• Collaborative filtering with binary data (e.g., user preferences).
2. Cosine Similarity
Definition:
Cosine similarity measures the cosine of the angle between two non-zero
vectors in a multidimensional space. It is useful for comparing high-
dimensional data such as text embeddings or user-item interactions.
Range:
• Values range from −1 to 1:
o 1: Vectors point in the same direction (high similarity).
o 0: Vectors are orthogonal (no similarity).
o −1: Vectors point in opposite directions (high dissimilarity).
9
=0.67
=0.67
Use Cases:
• Document similarity in Natural Language Processing (e.g., TF-IDF
vectors).
• User-item interaction data in recommender systems.
• Definition:
• Pearson correlation measures the linear relationship between two variables. It
indicates whether an increase in one variable corresponds to an increase or decrease in
the other.
10
11
Use Cases:
• Collaborative filtering in recommender systems (e.g., user-user
similarity).
• Correlation analysis in statistical studies.
Comparison
Metric Use Case Data Type Key Characteristic
Jaccard Binary data (e.g., Sets or Measures overlap
Coefficient sets, categorical binary between sets.
data) vectors
Cosine High-dimensional Numeric Measures the angle
Similarity data (e.g., text, vectors between two vectors.
vectors)
Pearson Ratings or Numeric Measures linear
Correlation continuous data correlation between
numerical data variables.
Key Points
• Jaccard Coefficient is best for sets or binary data.
• Cosine Similarity focuses on the direction of vectors, ignoring
magnitude.
• Pearson Correlation evaluates the strength and direction of a linear
relationship between variables.
These metrics serve different purposes, so the choice depends on the data type
and task at hand. Let me know if you'd like further examples or implementation
details!
import pandas as pd
movies_df = pd.read_csv("/content/movies.csv" )
12
movies_df.shape
print(movies_df.head(20))
print(movies_df.shape)
print(movies_df.tail(10))
print(movies_df.describe)
print(movies_df.columns)
13
1. Similarity Matrix:
o A matrix where each entry (i,j)(i, j)(i,j) represents the similarity
between item iii and item jjj.
o Common similarity measures:
▪ Cosine Similarity: Measures the cosine of the angle
between item vectors.
▪ Pearson Correlation: Measures the linear relationship
between ratings for two items.
▪ Jaccard Similarity: Measures the overlap in users who
interacted with two items.
2. Recommendation Process:
o Identify items the user has interacted with.
o Compute the similarity of these items to all other items.
o Rank items based on their similarity scores and recommend the
top-ranked items.
3. Advantages of Item-Based Similarity:
o More stable compared to user-based similarity because item
relationships change less frequently.
o Scales well for scenarios with many users and fewer items (e.g.,
product recommendation).
4. Challenges:
o Cold-start problem for new items.
o Requires a sufficient number of user interactions to compute
meaningful similarities.
9.4 | USING SURPRISE LIBRARY
For real-world implementations, we need a more extensive library which hides
all the implementation details and provides abstract Application Programming
Interfaces (APIs) to build recommender systems. Surprise is a Python library for
accomplishing this. It provides the following features: 1. Various ready-to-use
prediction algorithms like neighborhood methods (user similarity and item
similarity), and matrix factorization-based. It also has built-in similarity
measures such as cosine, mean square distance (MSD), Pearson correlation
14
coefficient, etc. 2. Tools to evaluate, analyze, and compare the performance of
the algorithms. It also provides methods to recommend.
We import the required modules or classes from surprise library.
15
Matrix Factorization
Useful webpages
• Matrix Factorization made easy (Recommender Systems) | by Rohan Naidu |
Analytics Vidhya | Medium
• Recommender Systems: Matrix Factorization from scratch | by Aakanksha NS |
Towards Data Science
We come across recommendations multiple times a day — while deciding what to watch
on Netflix/Youtube, item recommendations on shopping sites, song suggestions on
Spotify, friend recommendations on Instagram, job recommendations on LinkedIn…the
list goes on! Recommender systems aim to predict the “rating” or “preference” a user
would give to an item. These ratings are used to determine what a user might like and
make informed suggestions.
There are two broad types of Recommender systems:
1. Content-Based systems: These systems try to match users with items based on
items’ content (genre, color, etc) and users’ profiles (likes, dislikes, demographic
information, etc). For example, Youtube might suggest me cooking videos based
on the fact that I’m a chef, and/or that I’ve watched a lot of baking videos in the
past, hence utilizing the information it has about a video’s content and my
profile.
2. Collaborative filtering: They rely on the assumption that similar users like
similar items. Similarity measures between users and/or items are used to make
recommendations.
This article talks about a very popular collaborative filtering technique called Matrix
factorization.
Matrix Factorization
A recommender system has two entities — users and items. Let’s say we have m users
and n items. The goal of our recommendation system is to build an mxn matrix (called
the utility matrix) which consists of the rating (or preference) for each user-item pair.
Initially, this matrix is usually very sparse because we only have ratings for a limited number
of user-item pairs.
Here’s an example. Say we have 4 users and 5 superheroes and we’re trying to predict the
rating each user would give to each superhero. This is what our utility matrix initially looks
like:
16
Now, our goal is to populate this matrix by finding similarities between users and items. To
get an intuition, for example, we see that User3 and User4 gave the same rating to Batman, so
we can assume the users are similar and they’d feel the same way about Spiderman and
predict that User3 would give a rating of 4 to Spiderman. In practice, however, this is not as
straightforward because there are multiple users interacting with many different items.
In practice, The matrix is populated by decomposing (or factorizing) the Utility matrix into
two tall and skinny matrices. The decomposition has the equation:
17
Implementation
To implement matrix factorization, we can use embeddings for the user and item embedding
matrices and use Gradient Descent to get the optimal decomposition. If you’re unfamiliar
with embeddings, you can check out this article where I’ve talked about them in detail:
Code
All the code I’ve used in this article can be found here: https://jovian.ml/aakanksha-ns/anime-
ratings-matrix-factorization
Dataset
I’ve used the Anime Recommendations dataset from Kaggle:
Surprise Library
Leveraging Surprise Library for Recommender Systems in Python | by Mario Montalvo
García | Medium
Introduction
Recommender systems play a crucial role in our daily lives, assisting us in discovering new
products, services, and content that align with our preferences. Python provides numerous
libraries for building recommender systems, and one powerful option is the Surprise library.
Surprise is an open-source Python library specifically designed for recommendation tasks,
making it easier to develop and evaluate recommender systems. In this article, we will
explore the uses and applications of the Surprise library and highlight its key features.
Defining Surprise
Surprise is a Python scikit for building and evaluating recommender systems. It provides a
simple and intuitive API, making it accessible even to beginners. Developed on top of SciPy,
Surprise offers a wide range of collaborative filtering algorithms, including matrix
factorization-based methods such as Singular Value Decomposition (SVD) and Non-negative
18
Matrix Factorization (NMF). It also supports neighborhood-based approaches like k-Nearest
Neighbors (k-NN) and provides tools for model selection and evaluation.
Applications of Surprise
• Movie Recommendations: Surprise is commonly used for movie recommendation
systems. By leveraging collaborative filtering algorithms, Surprise can analyze user
preferences and provide personalized movie suggestions based on similar users’
ratings.
• Music Recommendations: With the rise of music streaming platforms, building
accurate music recommendation systems has become crucial. Surprise can help create
personalized playlists and recommend new songs or artists based on users’ listening
habits.
• Book Recommendations: Recommending books based on user preferences is
another popular application of Surprise. By analyzing past ratings or reviews, the
library can suggest books that align with users’ reading preferences and interests.
• E-commerce Recommendations: Surprise can also be employed in e-commerce
platforms to recommend products to users based on their browsing history, purchase
behavior, and similarities with other users.
Key Features of Surprise
• Easy Integration: Surprise seamlessly integrates with other popular Python libraries,
such as NumPy and Pandas, making it convenient to preprocess and manipulate data
for recommendation tasks.
• Variety of Algorithms: Surprise offers a wide range of built-in algorithms, including
collaborative filtering, matrix factorization, and neighborhood-based methods. These
algorithms provide flexibility and allow developers to choose the most suitable
approach for their specific recommendation problem.
• Built-in Datasets: Surprise provides several built-in datasets, including famous
benchmark datasets like MovieLens and Jester. These datasets can be readily used for
experimentation and evaluation of recommendation models.
• Cross-Validation and Evaluation: Surprise simplifies the evaluation process by
providing built-in functions for cross-validation and performance metrics. Developers
can easily assess the accuracy and performance of their recommendation models
using metrics such as RMSE and MAE.
• Hyperparameter Tuning: The library also includes tools for hyperparameter tuning,
allowing developers to optimize the performance of their models. Grid search and
random search functionalities help in finding the best combination of hyperparameters
for improved recommendation accuracy.
Getting Started with Surprise
To begin using Surprise, follow these steps:
19
• Install the library: You can install Surprise using pip by running the command pip
install surprise in your terminal.
• Import Surprise and relevant modules: Start by importing Surprise and other
required modules using import surprise.
• Load or create a dataset: You can either load one of the built-in datasets provided by
Surprise or create your own dataset using Pandas or NumPy.
• Choose an algorithm: Select an algorithm from Surprise’s extensive collection based
on your recommendation task. Each algorithm has its own parameters that can be
tuned to improve performance.
• Instantiate and train the model: Create an instance of the chosen algorithm and fit it
to your dataset using the fit() method. This step trains the model on the provided data.
• Generate recommendations: Once the model is trained, you can generate
recommendations for users by calling the appropriate method, such
as recommend() or predict().
• Evaluate the model: Use Surprise’s built-in evaluation functions to assess the
performance of your model. Compute metrics like RMSE or MAE to measure the
accuracy of your recommendations.
• Fine-tune and iterate: Iterate through the previous steps, experimenting with
different algorithms, parameters, and evaluation metrics to refine your recommender
system.
Building a Book Recommendation System with Surprise
Builds a book recommendation system using Surprise library in Python, utilizing
collaborative filtering algorithms and the Book-Crossing dataset. Splits data, trains the model
with SVD algorithm, evaluates accuracy with RMSE, and generates personalized book
recommendations. Here is the code:
import surprise
from surprise import Dataset
from surprise import SVD
from surprise.model_selection import train_test_split
from surprise import accuracy
20
algo.fit(trainset)
21