0% found this document useful (0 votes)
24 views

Module-4_Notes_13-12-2024.docx

Module 4 notes

Uploaded by

saivineela0806
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Module-4_Notes_13-12-2024.docx

Module 4 notes

Uploaded by

saivineela0806
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Module-4 Notes

Syllabus

Recommender Systems:
Datasets, Association rules, collaborative filtering, user-based similarity, item-
based similarity using surprise library, matrix factorization
Text Analytics:
Overview, sentiment classification, Naïve Bayes, model for sentiment
classification, using TF-IDF vectorizer, challenges of text analytics
Textbook : Machine Learning using Python by Manaranjan Pradhan Consultant
Indian Institute of Management Bangalore U Dinesh Kumar Professor Indian
Institute of Management Bangalore, Wiley 2019
Chapters 9 and 10
First Edition: 2019
ISBN: 978-81-265-7990-7 ISBN: 978-81-265-8855-8 (ebk)
www.wileyindia.com.

ASSOCIATION RULES (ASSOCIATION RULE MINING)


Association rule finds combinations of items that frequently occur together in
orders or baskets (in a retail context). The items that frequently occur together
are called itemsets. Itemsets help to discover relationships between items that
people buy together and use that as a basis for creating strategies like combining
products as combo offer or place products next to each other in retail shelves to
attract customer attention. An application of association rule mining is in
Market Basket Analysis (MBA). MBA is a technique used mostly by retailers to
find associations between items purchased by customers.

1
To illustrate the association rule mining concept, let us consider a set of baskets
and the items in those baskets purchased by customers as depicted in Figure 9.1.
Items purchased in different baskets are:
• Basket 1: egg, beer, sugar, bread, diaper 2.
• Basket 2: egg, beer, cereal, bread, diaper
• Basket 3: milk, beer, bread 4.
• Basket 4: cereal, diaper, bread

apriori algorithm
The Apriori algorithm is a widely used data mining technique for discovering
frequent itemsets and association rules from large datasets. It is particularly
popular in market basket analysis, where it helps identify items frequently
purchased together.
How the Apriori Algorithm Works
The Apriori algorithm relies on the Apriori property, which states that:
• "If an itemset is frequent, then all its subsets must also be frequent."
This property reduces the search space by eliminating candidate itemsets that
include infrequent subsets.

2
Steps of the Apriori Algorithm
1. Generate Candidate Itemsets:
o Start with single-item itemsets (e.g., {A}, {B}, {C}).
o Extend these into larger itemsets by combining frequent itemsets
from the previous iteration.
2. Prune Infrequent Itemsets:
o For a candidate itemset to be frequent, its support (occurrence in
transactions) must meet or exceed a minimum support threshold.
o Remove itemsets that do not meet this threshold.
3. Repeat:
o Increase the size of the itemsets (e.g., 2-itemsets, 3-itemsets, etc.)
until no further frequent itemsets can be generated.
4. Generate Association Rules:
o Once frequent itemsets are identified, association rules are
generated by calculating confidence and lift for potential rules.

Key Terms in Apriori Algorithm


1. Support:
o Measures how often an itemset appears in the dataset.
o Formula:

3
From suggested learning resource Manaranjan Pradhan book
COLLABORATIVE FILTERING
Collaborative filtering comes in two variations:
• User-Based Similarity: Finds K similar users based on common items
they have bought.
• Item-Based Similarity: Finds Ksimilar items based on common users
who have bought those items. Both algorithms are similar to K-Nearest
Neighbors (KNN),

Pros and Cons of Association Rule Mining


The following are advantages of using association rules: 1. Transactions data,
which is used for generating rules, is always available and mostly clean. 2. The
rules generated are simple and can be interpreted. However, association rules do
not take the preference or ratings given by customers into account, which is an
important information pertaining for generating rules. If customers have bought
two items but disliked one of them, then the association should not be
considered. Collaborative filtering takes both, what customers bought and how
they liked (rating) the items, into consideration before recommending.
Association rules mining is used across several use cases including product
recom

4
9.3 | COLLABORATIVE FILTERING
Collaborative filtering is based on the notion of similarity (or distance). For
example, if two users A and B have purchased the same products and have rated
them similarly on a common rating scale, then A and B can be considered
similar in their buying and preference behavior. Hence, if A buys a new product
and rates high, then that product can be recommended to B. Alternatively, the
products that A has already bought and rated high can be recommended to B, if
not already bought by B.

How to Find Similarity between Users?

Similarity or the distance between users can be computed using the rating the
users have given to the common items purchased. If the users are similar, then
the similarity measures such as Jaccard coefficient and cosine similarity will
have a value closer to 1 and distance measures such as Euclidian distance will
have low value. Calculating similarity and distance have already been discussed
in Chapter 7. Most widely used distances or similarities are Euclidean distance,
Jaccard coefficient, cosine similarity, and Pearson correlation. We will be
discussing collaborative filtering technique using the example described below.
The picture in Figure 9.2 depicts three users Rahul, Purvi, and Gaurav and the
books they have bought and rated.

5
The users are represented using their rating on the Euclidean space in Figure
9.3. Here the dimensions are represented by the two books Into Thin Air and
Missoula, which are the two books commonly bought by Rahul, Purvi, and
Gaurav.

What is Euclidean Distance?


Euclidean distance is a measure of the straight-line distance between two
points in Euclidean space. It is the most common and familiar distance
metric, often referred to as the "ordinary" distance.

6
Collaborative filtering comes in two variations:
1. User-Based Similarity: Finds K similar users based on common items
they have bought.
2. Item-Based Similarity: Finds K similar items based on common users
who have bought those items
Both algorithms are similar to K-Nearest Neighbors (KNN)
9.3.2 | User-Based Similarity
We will use MovieLens dataset (see https://grouplens.org/datasets/movielens/)
for finding similar users based on common movies the users have watched and
how they have rated those movies. The file ratings. csv in the dataset contains
ratings given by users. Each line in this file represents a rating given by a user
to a movie. The ratings are on the scale of 1 to 5. The dataset has the following
features: 1. userId 2. movieId 3. rating 4. timestamp

9.3.2.6 Challenges with User-Based Similarity


Finding user similarity does not work for new users. We need to wait until the
new user buys a few items and rates them. Only then users with similar
preferences can be found and recommendations can be made based on that. This
is called cold start problem in recommender systems. This can be overcome by
using item-based similarity. Item-based similarity is based on the notion that if
two items have been bought by many users and rated similarly, then there must
be some inherent relationship between these two items. In other terms, in future,
if a user buys one of those two items, he or she will most likely buy the other
one.

9.3.3 | Item-Based Similarity


If two movies, movie A and movie B, have been watched by several users and
rated very similarly, then movie A and movie B can be similar in taste. In other
words, if a user watches movie A, then he or she is very likely to watch B and
vice versa.
Jaccard coefficient, cosine similarity, and Pearson correlation are commonly
used similarity/distance metrics, each suited for different types of data and
scenarios. Here's a breakdown of these concepts:

7
1. Jaccard Coefficient
Definition:
The Jaccard coefficient measures the similarity between two sets by comparing
their intersection and union. It is particularly useful for binary data or datasets
with categorical features.
Formula:

Range:
• Values range from 0 to 1:
o 0: No overlap.
o 1: Perfect overlap.

8
Use Cases:
• Text similarity (e.g., comparing documents based on word sets).
• Collaborative filtering with binary data (e.g., user preferences).

2. Cosine Similarity
Definition:
Cosine similarity measures the cosine of the angle between two non-zero
vectors in a multidimensional space. It is useful for comparing high-
dimensional data such as text embeddings or user-item interactions.

Range:
• Values range from −1 to 1:
o 1: Vectors point in the same direction (high similarity).
o 0: Vectors are orthogonal (no similarity).
o −1: Vectors point in opposite directions (high dissimilarity).

9
=0.67

=0.67
Use Cases:
• Document similarity in Natural Language Processing (e.g., TF-IDF
vectors).
• User-item interaction data in recommender systems.

3. Pearson Correlation Coefficient

• Definition:
• Pearson correlation measures the linear relationship between two variables. It
indicates whether an increase in one variable corresponds to an increase or decrease in
the other.

10
11
Use Cases:
• Collaborative filtering in recommender systems (e.g., user-user
similarity).
• Correlation analysis in statistical studies.

Comparison
Metric Use Case Data Type Key Characteristic
Jaccard Binary data (e.g., Sets or Measures overlap
Coefficient sets, categorical binary between sets.
data) vectors
Cosine High-dimensional Numeric Measures the angle
Similarity data (e.g., text, vectors between two vectors.
vectors)
Pearson Ratings or Numeric Measures linear
Correlation continuous data correlation between
numerical data variables.

Key Points
• Jaccard Coefficient is best for sets or binary data.
• Cosine Similarity focuses on the direction of vectors, ignoring
magnitude.
• Pearson Correlation evaluates the strength and direction of a linear
relationship between variables.
These metrics serve different purposes, so the choice depends on the data type
and task at hand. Let me know if you'd like further examples or implementation
details!

import pandas as pd
movies_df = pd.read_csv("/content/movies.csv" )

12
movies_df.shape
print(movies_df.head(20))
print(movies_df.shape)
print(movies_df.tail(10))
print(movies_df.describe)
print(movies_df.columns)

9.3.2.6 Challenges with User-Based Similarity


Finding user similarity does not work for new users. We need to wait until the
new user buys a few items and rates them. Only then users with similar
preferences can be found and recommendations can be made based on that. This
is called cold start problem in recommender systems. This can be overcome by
using item-based similarity. Item-based similarity is based on the notion that if
two items have been bought by many users and rated similarly, then there must
be some inherent relationship between these two items. In other terms, in future,
if a user buys one of those two items, he or she will most likely buy the other
one.
9.3.3 | Item-Based Similarity
If two movies, movie A and movie B, have been watched by several users and
rated very similarly, then movie A and movie B can be similar in taste. In other
words, if a user watches movie A, then he or she is very likely to watch B and
vice versa.
Item-Based Similarity in Recommender Systems
Item-based similarity is a technique used in collaborative filtering for
recommending items to users. Instead of focusing on user-user relationships, it
measures the similarity between items based on user interactions (e.g., ratings,
purchases). The idea is that if a user liked one item, they are likely to like
similar items.
Key Concepts

13
1. Similarity Matrix:
o A matrix where each entry (i,j)(i, j)(i,j) represents the similarity
between item iii and item jjj.
o Common similarity measures:
▪ Cosine Similarity: Measures the cosine of the angle
between item vectors.
▪ Pearson Correlation: Measures the linear relationship
between ratings for two items.
▪ Jaccard Similarity: Measures the overlap in users who
interacted with two items.
2. Recommendation Process:
o Identify items the user has interacted with.
o Compute the similarity of these items to all other items.
o Rank items based on their similarity scores and recommend the
top-ranked items.
3. Advantages of Item-Based Similarity:
o More stable compared to user-based similarity because item
relationships change less frequently.
o Scales well for scenarios with many users and fewer items (e.g.,
product recommendation).
4. Challenges:
o Cold-start problem for new items.
o Requires a sufficient number of user interactions to compute
meaningful similarities.
9.4 | USING SURPRISE LIBRARY
For real-world implementations, we need a more extensive library which hides
all the implementation details and provides abstract Application Programming
Interfaces (APIs) to build recommender systems. Surprise is a Python library for
accomplishing this. It provides the following features: 1. Various ready-to-use
prediction algorithms like neighborhood methods (user similarity and item
similarity), and matrix factorization-based. It also has built-in similarity
measures such as cosine, mean square distance (MSD), Pearson correlation

14
coefficient, etc. 2. Tools to evaluate, analyze, and compare the performance of
the algorithms. It also provides methods to recommend.
We import the required modules or classes from surprise library.

9.3.3.1 Calculating Cosine Similarity between Movies

15
Matrix Factorization
Useful webpages
• Matrix Factorization made easy (Recommender Systems) | by Rohan Naidu |
Analytics Vidhya | Medium
• Recommender Systems: Matrix Factorization from scratch | by Aakanksha NS |
Towards Data Science
We come across recommendations multiple times a day — while deciding what to watch
on Netflix/Youtube, item recommendations on shopping sites, song suggestions on
Spotify, friend recommendations on Instagram, job recommendations on LinkedIn…the
list goes on! Recommender systems aim to predict the “rating” or “preference” a user
would give to an item. These ratings are used to determine what a user might like and
make informed suggestions.
There are two broad types of Recommender systems:
1. Content-Based systems: These systems try to match users with items based on
items’ content (genre, color, etc) and users’ profiles (likes, dislikes, demographic
information, etc). For example, Youtube might suggest me cooking videos based
on the fact that I’m a chef, and/or that I’ve watched a lot of baking videos in the
past, hence utilizing the information it has about a video’s content and my
profile.
2. Collaborative filtering: They rely on the assumption that similar users like
similar items. Similarity measures between users and/or items are used to make
recommendations.
This article talks about a very popular collaborative filtering technique called Matrix
factorization.
Matrix Factorization
A recommender system has two entities — users and items. Let’s say we have m users
and n items. The goal of our recommendation system is to build an mxn matrix (called
the utility matrix) which consists of the rating (or preference) for each user-item pair.
Initially, this matrix is usually very sparse because we only have ratings for a limited number
of user-item pairs.
Here’s an example. Say we have 4 users and 5 superheroes and we’re trying to predict the
rating each user would give to each superhero. This is what our utility matrix initially looks
like:

16
Now, our goal is to populate this matrix by finding similarities between users and items. To
get an intuition, for example, we see that User3 and User4 gave the same rating to Batman, so
we can assume the users are similar and they’d feel the same way about Spiderman and
predict that User3 would give a rating of 4 to Spiderman. In practice, however, this is not as
straightforward because there are multiple users interacting with many different items.
In practice, The matrix is populated by decomposing (or factorizing) the Utility matrix into
two tall and skinny matrices. The decomposition has the equation:

17
Implementation
To implement matrix factorization, we can use embeddings for the user and item embedding
matrices and use Gradient Descent to get the optimal decomposition. If you’re unfamiliar
with embeddings, you can check out this article where I’ve talked about them in detail:
Code
All the code I’ve used in this article can be found here: https://jovian.ml/aakanksha-ns/anime-
ratings-matrix-factorization
Dataset
I’ve used the Anime Recommendations dataset from Kaggle:
Surprise Library
Leveraging Surprise Library for Recommender Systems in Python | by Mario Montalvo
García | Medium

Introduction
Recommender systems play a crucial role in our daily lives, assisting us in discovering new
products, services, and content that align with our preferences. Python provides numerous
libraries for building recommender systems, and one powerful option is the Surprise library.
Surprise is an open-source Python library specifically designed for recommendation tasks,
making it easier to develop and evaluate recommender systems. In this article, we will
explore the uses and applications of the Surprise library and highlight its key features.
Defining Surprise
Surprise is a Python scikit for building and evaluating recommender systems. It provides a
simple and intuitive API, making it accessible even to beginners. Developed on top of SciPy,
Surprise offers a wide range of collaborative filtering algorithms, including matrix
factorization-based methods such as Singular Value Decomposition (SVD) and Non-negative

18
Matrix Factorization (NMF). It also supports neighborhood-based approaches like k-Nearest
Neighbors (k-NN) and provides tools for model selection and evaluation.
Applications of Surprise
• Movie Recommendations: Surprise is commonly used for movie recommendation
systems. By leveraging collaborative filtering algorithms, Surprise can analyze user
preferences and provide personalized movie suggestions based on similar users’
ratings.
• Music Recommendations: With the rise of music streaming platforms, building
accurate music recommendation systems has become crucial. Surprise can help create
personalized playlists and recommend new songs or artists based on users’ listening
habits.
• Book Recommendations: Recommending books based on user preferences is
another popular application of Surprise. By analyzing past ratings or reviews, the
library can suggest books that align with users’ reading preferences and interests.
• E-commerce Recommendations: Surprise can also be employed in e-commerce
platforms to recommend products to users based on their browsing history, purchase
behavior, and similarities with other users.
Key Features of Surprise
• Easy Integration: Surprise seamlessly integrates with other popular Python libraries,
such as NumPy and Pandas, making it convenient to preprocess and manipulate data
for recommendation tasks.
• Variety of Algorithms: Surprise offers a wide range of built-in algorithms, including
collaborative filtering, matrix factorization, and neighborhood-based methods. These
algorithms provide flexibility and allow developers to choose the most suitable
approach for their specific recommendation problem.
• Built-in Datasets: Surprise provides several built-in datasets, including famous
benchmark datasets like MovieLens and Jester. These datasets can be readily used for
experimentation and evaluation of recommendation models.
• Cross-Validation and Evaluation: Surprise simplifies the evaluation process by
providing built-in functions for cross-validation and performance metrics. Developers
can easily assess the accuracy and performance of their recommendation models
using metrics such as RMSE and MAE.
• Hyperparameter Tuning: The library also includes tools for hyperparameter tuning,
allowing developers to optimize the performance of their models. Grid search and
random search functionalities help in finding the best combination of hyperparameters
for improved recommendation accuracy.
Getting Started with Surprise
To begin using Surprise, follow these steps:

19
• Install the library: You can install Surprise using pip by running the command pip
install surprise in your terminal.
• Import Surprise and relevant modules: Start by importing Surprise and other
required modules using import surprise.
• Load or create a dataset: You can either load one of the built-in datasets provided by
Surprise or create your own dataset using Pandas or NumPy.
• Choose an algorithm: Select an algorithm from Surprise’s extensive collection based
on your recommendation task. Each algorithm has its own parameters that can be
tuned to improve performance.
• Instantiate and train the model: Create an instance of the chosen algorithm and fit it
to your dataset using the fit() method. This step trains the model on the provided data.
• Generate recommendations: Once the model is trained, you can generate
recommendations for users by calling the appropriate method, such
as recommend() or predict().
• Evaluate the model: Use Surprise’s built-in evaluation functions to assess the
performance of your model. Compute metrics like RMSE or MAE to measure the
accuracy of your recommendations.
• Fine-tune and iterate: Iterate through the previous steps, experimenting with
different algorithms, parameters, and evaluation metrics to refine your recommender
system.
Building a Book Recommendation System with Surprise
Builds a book recommendation system using Surprise library in Python, utilizing
collaborative filtering algorithms and the Book-Crossing dataset. Splits data, trains the model
with SVD algorithm, evaluates accuracy with RMSE, and generates personalized book
recommendations. Here is the code:
import surprise
from surprise import Dataset
from surprise import SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# Load the book-crossing dataset


data = Dataset.load_builtin('book-crossing')

# Split the data into training and testing sets


trainset, testset = train_test_split(data, test_size=0.2)

# Define the matrix factorization-based algorithm


algo = SVD()

# Train the algorithm on the training set

20
algo.fit(trainset)

# Predict ratings for the test set


predictions = algo.test(testset)

# Evaluate the accuracy of the model


accuracy.rmse(predictions)

# Get top-N book recommendations for a specific user


user_id = str(276729)
top_n = 5

# Get the inner id of the user


user_inner_id = algo.trainset.to_inner_uid(user_id)

# Get a list of book recommendations for the user


recommendations = algo.get_top_n(user_inner_id, k=top_n)

# Print the top-N book recommendations for the user


for book_id, predicted_rating in recommendations:
book_title = algo.trainset.to_raw_iid(book_id)
print(f"Book: {book_title}, Predicted Rating: {predicted_rating}")
Conclusion
The Surprise library provides a user-friendly and powerful framework for building
recommender systems in Python. Its intuitive API, variety of algorithms, and convenient
evaluation tools make it an excellent choice for both beginners and experienced developers.
With applications ranging from movie and music recommendations to e-commerce and book
suggestions, Surprise enables the creation of personalized recommendation systems tailored to
different domains. By leveraging Surprise’s features and following the steps outlined in this
article, you can start developing your own recommender systems and deliver personalized
recommendations to users based on their preferences and behaviors.

21

You might also like