Applying Multinomial Naive Bayes to NLP Problems
Last Updated :
10 Jul, 2024
Multinomial Naive Bayes (MNB) is a popular machine learning algorithm for text classification problems in Natural Language Processing (NLP). It is particularly useful for problems that involve text data with discrete features such as word frequency counts. MNB works on the principle of Bayes theorem and assumes that the features are conditionally independent given the class variable.
Here are the steps for applying Multinomial Naive Bayes to NLP problems:
Preprocessing the text data: The text data needs to be preprocessed before applying the algorithm. This involves steps such as tokenization, stop-word removal, stemming, and lemmatization.
Feature extraction: The text data needs to be converted into a feature vector format that can be used as input to the MNB algorithm. The most common method of feature extraction is to use a bag-of-words model, where each document is represented by a vector of word frequency counts.
Splitting the data: The data needs to be split into training and testing sets. The training set is used to train the MNB model, while the testing set is used to evaluate its performance.
Training the MNB model: The MNB model is trained on the training set by estimating the probabilities of each feature given each class. This involves calculating the prior probabilities of each class and the likelihood of each feature given each class.
Evaluating the performance of the model: The performance of the model is evaluated using metrics such as accuracy, precision, recall, and F1-score on the testing set.
Using the model to make predictions: Once the model is trained, it can be used to make predictions on new text data. The text data is preprocessed and transformed into the feature vector format, which is then input to the trained model to obtain the predicted class label.
MNB is a simple and efficient algorithm that works well for many NLP problems such as sentiment analysis, spam detection, and topic classification. However, it has some limitations, such as the assumption of independence between features, which may not hold true in some cases. Therefore, it is important to carefully evaluate the performance of the model before using it in a real-world application.
Naive Bayes Classifier Algorithm is a family of probabilistic algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of a feature.
Bayes theorem calculates probability P(c|x) where c is the class of the possible outcomes and x is the given instance which has to be classified, representing some certain features.
P(c|x) = P(x|c) * P(c) / P(x)
Naive Bayes are mostly used in natural language processing (NLP) problems. Naive Bayes predict the tag of a text. They calculate the probability of each tag for a given text and then output the tag with the highest one.
How Naive Bayes Algorithm Works ?
Let’s consider an example, classify the review whether it is positive or negative.
Training Dataset:
Text | Reviews |
---|
“I liked the movie” | positive |
“It’s a good movie. Nice story” | positive |
“Nice songs. But sadly boring ending. ” | negative |
“Hero’s acting is bad but heroine looks good. Overall nice movie” | positive |
“Sad, boring movie” | negative |
We classify whether the text “overall liked the movie” has a positive review or a negative review. We have to calculate,
P(positive | overall liked the movie) — the probability that the tag of a sentence is positive given that the sentence is “overall liked the movie”.
P(negative | overall liked the movie) — the probability that the tag of a sentence is negative given that the sentence is “overall liked the movie”.
Before that, first, we apply Removing Stopwords and Stemming in the text.
Removing Stopwords: These are common words that don’t really add anything to the classification, such as an able, either, else, ever and so on.
Stemming: Stemming to take out the root of the word.
Now After applying these two techniques, our text becomes
Text | Reviews |
---|
“ilikedthemovi” | positive |
“itsagoodmovienicestori” | positive |
“nicesongsbutsadlyboringend” | negative |
“herosactingisbadbutheroinelooksgoodoverallnicemovi” | positive |
“sadboringmovi” | negative |
Feature Engineering:
The important part is to find the features from the data to make machine learning algorithms works. In this case, we have text. We need to convert this text into numbers that we can do calculations on. We use word frequencies. That is treating every document as a set of the words it contains. Our features will be the counts of each of these words.
In our case, we have P(positive | overall liked the movie), by using this theorem:
P(positive | overall liked the movie) = P(overall liked the movie | positive) * P(positive) / P(overall liked the movie)
Since for our classifier we have to find out which tag has a bigger probability, we can discard the divisor which is the same for both tags,
P(overall liked the movie | positive)* P(positive) with P(overall liked the movie | negative) * P(negative)
There’s a problem though: “overall liked the movie” doesn’t appear in our training dataset, so the probability is zero. Here, we assume the ‘naive’ condition that every word in a sentence is independent of the other ones. This means that now we look at individual words.
We can write this as:
P(overall liked the movie) = P(overall) * P(liked) * P(the) * P(movie)
The next step is just applying the Bayes theorem:-
P(overall liked the movie| positive) = P(overall | positive) * P(liked | positive) * P(the | positive) * P(movie | positive)
And now, these individual words actually show up several times in our training data, and we can calculate them!
Calculating probabilities:
First, we calculate the a priori probability of each tag: for a given sentence in our training data, the probability that it is positive P(positive) is 3/5. Then, P(negative) is 2/5.
Then, calculating P(overall | positive) means counting how many times the word “overall” appears in positive texts (1) divided by the total number of words in positive (17). Therefore, P(overall | positive) = 1/17, P(liked/positive) = 1/17, P(the/positive) = 2/17, P(movie/positive) = 3/17.
If probability comes out to be zero then By using Laplace smoothing: we add 1 to every count so it’s never zero. To balance this, we add the number of possible words to the divisor, so the division will never be greater than 1. In our case, the total possible words count are 21.
Applying smoothing, The results are:
Word | P(word | positive) | P(word | negative) |
---|
overall | 1 + 1/17 + 21 | 0 + 1/7 + 21 |
liked | 1 + 1/17 + 21 | 0 + 1/7 + 21 |
the | 2 + 1/17 + 21 | 0 + 1/7 + 21 |
movie | 3 + 1/17 + 21 | 1 + 1/7 + 21 |
Now we just multiply all the probabilities, and see who is bigger:
P(overall | positive) * P(liked | positive) * P(the | positive) * P(movie | positive) * P(positive ) = 1.38 * 10^{-5} = 0.0000138
P(overall | negative) * P(liked | negative) * P(the | negative) * P(movie | negative) * P(negative) = 0.13 * 10^{-5} = 0.0000013
Our classifier gives “overall liked the movie” the positive tag.
Below is the implementation :
Python
# cleaning texts
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
dataset = [["I liked the movie", "positive"],
["It’s a good movie. Nice story", "positive"],
["Hero’s acting is bad but heroine looks good.\
Overall nice movie", "positive"],
["Nice songs. But sadly boring ending.", "negative"],
["sad movie, boring movie", "negative"]]
dataset = pd.DataFrame(dataset)
dataset.columns = ["Text", "Reviews"]
nltk.download('stopwords')
corpus = []
for i in range(0, 5):
text = re.sub('[^a-zA-Z]', '', dataset['Text'][i])
text = text.lower()
text = text.split()
ps = PorterStemmer()
text = ''.join(text)
corpus.append(text)
# creating bag of words model
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
Python
# splitting the data set into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.25, random_state = 0)
Python
# fitting naive bayes to the training set
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
classifier = GaussianNB();
classifier.fit(X_train, y_train)
# predicting test set results
y_pred = classifier.predict(X_test)
# making the confusion matrix
cm = confusion_matrix(y_test, y_pred)
cm
Similar Reads
Machine Learning Algorithms
Machine learning algorithms are essentially sets of instructions that allow computers to learn from data, make predictions, and improve their performance over time without being explicitly programmed. Machine learning algorithms are broadly categorized into three types: Supervised Learning: Algorith
8 min read
Top 15 Machine Learning Algorithms Every Data Scientist Should Know in 2025
Machine Learning (ML) Algorithms are the backbone of everything from Netflix recommendations to fraud detection in financial institutions. These algorithms form the core of intelligent systems, empowering organizations to analyze patterns, predict outcomes, and automate decision-making processes. Wi
15 min read
Linear Model Regression
Ordinary Least Squares (OLS) using statsmodels
Ordinary Least Squares (OLS) is a widely used statistical method for estimating the parameters of a linear regression model. It minimizes the sum of squared residuals between observed and predicted values. In this article we will learn how to implement Ordinary Least Squares (OLS) regression using P
3 min read
Linear Regression (Python Implementation)
Linear regression is a statistical method that is used to predict a continuous dependent variable i.e target variable based on one or more independent variables. This technique assumes a linear relationship between the dependent and independent variables which means the dependent variable changes pr
14 min read
ML | Multiple Linear Regression using Python
Linear regression is a fundamental statistical method widely used for predictive analysis. It models the relationship between a dependent variable and a single independent variable by fitting a linear equation to the data. Multiple Linear Regression is an extension of this concept that allows us to
4 min read
Polynomial Regression ( From Scratch using Python )
Prerequisites Linear RegressionGradient DescentIntroductionLinear Regression finds the correlation between the dependent variable ( or target variable ) and independent variables ( or features ). In short, it is a linear model to fit the data linearly. But it fails to fit and catch the pattern in no
5 min read
Bayesian Linear Regression
Linear regression is based on the assumption that the underlying data is normally distributed and that all relevant predictor variables have a linear relationship with the outcome. But In the real world, this is not always possible, it will follows these assumptions, Bayesian regression could be the
11 min read
How to Perform Quantile Regression in Python
In this article, we are going to see how to perform quantile regression in Python. Linear regression is defined as the statistical method that constructs a relationship between a dependent variable and an independent variable as per the given set of variables. While performing linear regression we a
4 min read
Isotonic Regression in Scikit Learn
Isotonic regression is a regression technique in which the predictor variable is monotonically related to the target variable. This means that as the value of the predictor variable increases, the value of the target variable either increases or decreases in a consistent, non-oscillating manner. Mat
6 min read
Stepwise Regression in Python
Stepwise regression is a method of fitting a regression model by iteratively adding or removing variables. It is used to build a model that is accurate and parsimonious, meaning that it has the smallest number of variables that can explain the data. There are two main types of stepwise regression: F
6 min read
Least Angle Regression (LARS)
Regression is a supervised machine learning task that can predict continuous values (real numbers), as compared to classification, that can predict categorical or discrete values. Before we begin, if you are a beginner, I highly recommend this article. Least Angle Regression (LARS) is an algorithm u
3 min read
Linear Model Classification
K-Nearest Neighbors (KNN)
ML | Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is an optimization algorithm in machine learning, particularly when dealing with large datasets. It is a variant of the traditional gradient descent algorithm but offers several advantages in terms of efficiency and scalability, making it the go-to method for many d
8 min read