Bag of word and Frequency count in text using sklearn
Last Updated :
21 May, 2024
Text data is ubiquitous in today's digital world, from emails and social media posts to research articles and customer reviews. To analyze and derive insights from this textual information, it's essential to convert text into a numerical form that machine learning algorithms can process. One of the fundamental methods for this conversion is the "Bag of Words" (BoW) model, which represents text as a collection of word frequencies. In this article, we will explore the BoW model, its implementation, and how to perform frequency counts using Scikit-learn, a powerful machine-learning library in Python.
What is the Bag of Words Model?
The Bag of Words model is a simple and effective way of representing text data. It treats a text document as an unordered collection of words, disregarding grammar and word order while preserving the word frequency. The primary steps involved in creating a BoW model are:
- Tokenization: Splitting the text into individual words (tokens).
- Vocabulary Building: Creating a vocabulary of unique words from the entire corpus.
- Vectorization: Transforming each document into a numerical vector based on the frequency of each word in the vocabulary.
Example: Consider a small corpus with the following two sentences:
"The cat sat on the mat."
"The dog sat on the log."
The vocabulary would consist of the unique words: ["the", "cat", "sat", "on", "mat", "dog", "log"]. Each sentence is then represented as a vector of word counts:
"The cat sat on the mat.": [2, 1, 1, 1, 1, 0, 0]"
The dog sat on the log.": [2, 0, 1, 1, 0, 1, 1]
Implementing Bag of Words with Scikit-learn
Scikit-learn provides a straightforward implementation of the BoW model through its CountVectorizer class.
Here’s a step-by-step guide to implementing BoW and performing frequency counts using Scikit-learn.
Python
# Step 1: Preparing the Corpus
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
# Step 2: Preparing the Corpus
corpus = [
"The cat sat on the mat.",
"The dog sat on the log."
]
# Step 3: Initializing and Fitting the CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# Step 4: Displaying the Vocabulary and Frequency Counts
print("Vocabulary:", vectorizer.vocabulary_)
print("Feature Names:", vectorizer.get_feature_names_out())
print("Bag of Words Representation:\n", X.toarray())
# Step 5: Analyzing Word Frequencies
word_counts = np.sum(X.toarray(), axis=0)
word_freq = dict(zip(vectorizer.get_feature_names_out(), word_counts))
print("Word Frequencies:", word_freq)
Output:
Vocabulary: {'the': 6, 'cat': 0, 'sat': 5, 'on': 4, 'mat': 3, 'dog': 1, 'log': 2}
Feature Names: ['cat' 'dog' 'log' 'mat' 'on' 'sat' 'the']
Bag of Words Representation:
[[1 0 0 1 1 1 2]
[0 1 1 0 1 1 2]]
Word Frequencies: {'cat': 1, 'dog': 1, 'log': 1, 'mat': 1, 'on': 2, 'sat': 2, 'the': 4}
Conclusion
In this article, we've covered the basic steps to create a BoW model and perform frequency count analysis using Scikit-learn. This knowledge serves as a stepping stone to more advanced text processing techniques, such as TF-IDF, word embeddings, and neural network-based models, which build upon the concepts introduced here.
Similar Reads
Perl | Count the frequency of words in text
Counting frequency of all words of a string is a basic operation for any programming language. The frequency of each word of the text can be counted and stored in a hash for further use. In Perl, we can do this by firstly splitting the words of the string into an array. We use the function split / /
3 min read
Word and Letter Counter using React
The word and letter counter is an application that counts the number of letters and words entered in text and give the number (count) of each in output. Each element present in the text will be considered a letter, and a group of letters with a space will be considered a word. Using this approach, w
3 min read
Counting Frequency of Values by Date in Pandas
Counting the frequency of values by date is a common task in time-series analysis, where we need to analyze how often certain events occur within specific time frames. Understanding these frequencies can provide valuable insights if we analyze sales data, website traffic, or any other date-related d
3 min read
C# Program to Estimate the Frequency of the Word âisâ in a Sentence
Given a string as input, we need to write a program in C# to count the frequency of the word "is" in the string. The task of the program is to count the number of the occurrence of the given word "is" in the string and print the number of occurrences of "is". Examples: Input : string = "The most sim
2 min read
How to count words in real time using jQuery ?
In this article, we will learn how to count words in real-time using jQuery. We have an input field, and when we type, we get the number of words in the input in real time. Approach: First, we select the input field and store its value in a variable. We call a function called wordCount() when a keyb
2 min read
Count of given Strings in 2D character Array using Trie
Counting the number of occurrences of a given string in a 2D character array using a Trie. Examples: Input: vector<string> grid = {"abcde", "fghij", "xyabc", "klmno",}; string word = "abc";Output: 2Explanation: abc occurs twice in "abcde", "xyabc" Input: vector<string> grid = {"abcde", "
12 min read
Understanding TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in natural language processing and information retrieval to evaluate the importance of a word in a document relative to a collection of documents (corpus). Unlike simple word frequency, TF-IDF balances common and rare w
7 min read
Frequency Distribution of Qualitative Data in R
The frequency table in R programming Language is used to create a table with a respective count for both the discrete values and the grouped intervals. It indicates the counts of each segment of the table. It is helpful for constructing the probabilities and drawing an idea about the data distributi
3 min read
Find the Frequency of a Particular Word in a Cell in an Excel Table in Python
In this article, we'll look at how to use Python to find the number of times a word appears in a cell of an Excel file. Before we begin with the steps of the solution, the following modules/libraries must be installed. We will use the following sample Excel file to determine the frequency of the inp
5 min read
Count Frequency of Columns in Pandas DataFrame
When working with data in Pandas counting how often each value appears in a column is one of the first steps to explore our dataset. This helps you understand distribution of data and identify patterns. Now weâll explore various ways to calculate frequency counts for a column in a Pandas DataFrame.
2 min read