How to store a TfidfVectorizer for future use in scikit-learn?

Last Updated : 12 Jul, 2024

The TfidfVectorizer in scikit-learn is a powerful tool for converting text data into numerical features, making it essential for many Natural Language Processing (NLP) tasks. Once you have fitted and transformed your data with TfidfVectorizer, you might want to save the vectorizer for future use.

This guide will show you how to store a TfidfVectorizer using scikit-learn and load it later for transforming new text data.

What is TfidfVectorizer?

The TfidfVectorizer is a feature extraction technique in the scikit-learn library for converting a collection of raw text documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features. This is a common step in Natural Language Processing (NLP) and text mining tasks to transform text data into numerical data that machine learning algorithms can work with.

How TfidfVectorizer Works

Term Frequency (TF): This measures how frequently a term (word) appears in a document. The assumption is that the more frequently a term appears in a document, the more important it is. However, this alone can be misleading, as common words (like "the", "is", "and") will appear frequently in many documents.
Inverse Document Frequency (IDF): This measures how important a term is by considering how often it appears across all documents in the dataset. The more documents a term appears in, the less important it is. The IDF value of a term decreases as the number of documents containing the term increases.
TF-IDF: The product of TF and IDF. This score gives us an indication of how important a term is within a particular document while reducing the weight of commonly occurring terms that are less informative.

Formula of TF-IDF

The TF-IDF score for a term t in a document d is calculated as:

\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}

Where:

\text{tf}(t, d) is the term frequency of term t in document d.
\text{idf}(t) is the inverse document frequency of term t, calculated as:

\text{idf}(t) = \log \left( \frac{N}{1 + \text{df}(t)} \right)

Where:

N is the total number of documents.
\text{df}(t) is the number of documents containing the term t.

Steps to store a TfidfVectorizer in Sklearn

TF-IDF evaluates how important a word is to a document in a collection. Storing a TfidfVectorizer can be useful when you need to preprocess text data in a consistent way across different sessions or applications.

Step 1: Import Necessary Libraries

Import the necessary libraries. TfidfVectorizer from sklearn is used for transforming text data into TF-IDF features. pickle and joblib are used for saving and loading the vectorizer model.

from sklearn.feature_extraction.text import TfidfVectorizer
from joblib import dump, load
import pickle

Step 2: Prepare Sample Data

Define a list of sample text documents. These documents will be used to fit the TfidfVectorizer.

documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

Step 3: Create and Fit the TfidfVectorizer

Create an instance of TfidfVectorizer and fit it to the sample documents. The fit_transform method learns the vocabulary and idf from the documents and returns the transformed TF-IDF matrix.

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

Step 4: Save the Vectorizer using pickle

Save the fitted TfidfVectorizer to a file using pickle. This allows the vectorizer to be reused later without needing to refit it to the data.

with open('tfidf_vectorizer.pkl', 'wb') as file:
    pickle.dump(vectorizer, file)

Step 5: Load the Vectorizer using pickle

Load the saved TfidfVectorizer from the file using pickle. This restores the vectorizer to its state when it was saved.

with open('tfidf_vectorizer.pkl', 'rb') as file:
    loaded_vectorizer_pickle = pickle.load(file)

Step 6: Save the Vectorizer using joblib

Save the fitted TfidfVectorizer to a file using joblib. joblib is optimized for storing large numpy arrays, making it a good choice for saving scikit-learn models.

dump(vectorizer, 'tfidf_vectorizer.joblib')

Step 7: Load the Vectorizer using joblib

Load the saved TfidfVectorizer from the file using joblib. This restores the vectorizer to its state when it was saved.

loaded_vectorizer_joblib = load('tfidf_vectorizer.joblib')

Step 8: Prepare Sample New Data

Define a list of new text documents. These documents will be transformed using the loaded vectorizers.

new_documents = [
    "This is a new document.",
    "This document is different from the others."
]

Step 9: Transform the New Data with the Loaded Vectorizer from pickle

Transform the new text documents using the vectorizer loaded from the pickle file. This converts the new documents into TF-IDF features.

X_new_pickle = loaded_vectorizer_pickle.transform(new_documents)

Step 10: Transform the New Data with the Loaded Vectorizer from joblib

Transform the new text documents using the vectorizer loaded from the joblib file. This also converts the new documents into TF-IDF features.

X_new_joblib = loaded_vectorizer_joblib.transform(new_documents)

Step 11: Print the Feature Names and the Transformed Data

Print the feature names and the transformed data. This allows you to see the features (terms) extracted by the TfidfVectorizer and the TF-IDF values for both the original and new documents.

print("Feature names:")
print(vectorizer.get_feature_names_out())

print("\nOriginal transformed data:")
print(X.toarray())

print("\nTransformed new data using loaded vectorizer from pickle:")
print(X_new_pickle.toarray())

print("\nTransformed new data using loaded vectorizer from joblib:")
print(X_new_joblib.toarray())

Complete Implementation and Output:

Python

from sklearn.feature_extraction.text import TfidfVectorizer
from joblib import dump, load
import pickle

# Sample data
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

# Create and fit the TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

# Save the vectorizer using pickle
with open('tfidf_vectorizer.pkl', 'wb') as file:
    pickle.dump(vectorizer, file)

# Load the vectorizer using pickle
with open('tfidf_vectorizer.pkl', 'rb') as file:
    loaded_vectorizer_pickle = pickle.load(file)

# Save the vectorizer using joblib
dump(vectorizer, 'tfidf_vectorizer.joblib')

# Load the vectorizer using joblib
loaded_vectorizer_joblib = load('tfidf_vectorizer.joblib')

# Sample new data
new_documents = [
    "This is a new document.",
    "This document is different from the others."
]

# Transform the new data with the loaded vectorizer from pickle
X_new_pickle = loaded_vectorizer_pickle.transform(new_documents)

# Transform the new data with the loaded vectorizer from joblib
X_new_joblib = loaded_vectorizer_joblib.transform(new_documents)

# Print the feature names and the transformed data
print("Feature names:")
print(vectorizer.get_feature_names_out())

print("\nOriginal transformed data:")
print(X.toarray())

print("\nTransformed new data using loaded vectorizer from pickle:")
print(X_new_pickle.toarray())

print("\nTransformed new data using loaded vectorizer from joblib:")
print(X_new_joblib.toarray())

Output:

Feature names:
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']

Original transformed data:
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]

Transformed new data using loaded vectorizer from pickle:
[[0.         0.65416415 0.         0.53482206 0.         0.
  0.         0.         0.53482206]
 [0.         0.57684669 0.         0.47160997 0.         0.
  0.47160997 0.         0.47160997]]

Transformed new data using loaded vectorizer from joblib:
[[0.         0.65416415 0.         0.53482206 0.         0.
  0.         0.         0.53482206]
 [0.         0.57684669 0.         0.47160997 0.         0.
  0.47160997 0.         0.47160997]]

The output represents:

Feature Names: The terms extracted from the documents.
Original Transformed Data: TF-IDF scores for the original documents.
Transformed New Data (pickle): TF-IDF scores for the new documents using the vectorizer loaded from the pickle file.
Transformed New Data (joblib): TF-IDF scores for the new documents using the vectorizer loaded from the joblib file.

Both the pickle and joblib methods successfully store and restore the TfidfVectorizer, allowing for consistent transformation of new data.

Conclusion

Storing a TfidfVectorizer for future use is a practical approach to ensure consistency in text data preprocessing. Whether you use pickle or joblib, the process is straightforward and can save time in your machine learning workflow.

How to Use the Hugging Face Transformer Library for Sentiment Analysis

mrmishraoofc

Improve

Article Tags :