Open In App

Classification of Text Documents using Naive Bayes

Last Updated : 16 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In natural language processing and machine learning Naive Bayes is a popular method for classifying text documents. It can be used to classifies documents into pre defined types based on likelihood of a word occurring by using Bayes theorem. In this article we will implement Text Classification using Naive Bayes in Python.

Text-Classification-using-naive-bayes
Text Classification using Naive Bayes

The dataset we will be using will be of text data categorized into four labels: Technology, Sports, Politics and Entertainment. Each entry contains a short sentence or statement related to a specific topic with the label indicating the category it belongs to.

1. Importing Libraries

We will need to import the necessary libraries like scikit-learn, Pandas and Numpy.

  • CountVectorizer to convert text data into numerical features using word counts.
  • MultinomialNB: The Naive Bayes classifier for multinomial data and is ideal for text classification.
Python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer

2. Loading the Dataset

You can download the dataset from here.

Python
data = pd.read_csv('synthetic_text_data.csv')
X = data['text']
y = data['label']

3. Splitting the Data

Now we split the dataset into training and testing sets. The training set is used to train the model while the testing set is used to evaluate its performance.

  • train_test_split: Splits the data into training (80%) and testing (20%) sets.
  • random_state: ensures reproducibility.
Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Text Preprocessing: Converting Text to Numeric Features

We need to convert the text data into numerical format before feeding it to the model. We use CountVectorizer to convert the text into a matrix of token counts.

  • CountVectorizer(): Converts the raw text into a matrix of word counts.
  • fit_transform(): Learns the vocabulary from the training data and transforms the text into vector.
  • transform(): Applies the learned vocabulary from the training data to the test data.
Python
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

5. Training the Naive Bayes Classifier

With the data now in the right format we train the Naive Bayes classifier on the training data. Here we use Multinomial Naive Bayes.

Multinomial Naive Bayes is a variant of the Naive Bayes classifier specifically suited for classification tasks where the features or input data are discrete such as word counts or frequencies in text classification.

Python
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

6. Making Predictions

Now that the model is trained we can use it to predict the labels for the test data using X_test_vectorized.

Python
y_pred = model.predict(X_test_vectorized)

7. Evaluating the Model

After making predictions we need to evaluate the model's performance. We'll calculate the accuracy and confusion matrix to understand how well the model is performing.

  • accuracy_score(): Calculates the accuracy of the model by comparing the predicted labels (y_pred) with the true labels (y_test).
  • confusion_matrix(): Generates a confusion matrix to visualize how well the model classifies each category.
Python
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {accuracy *100}%')

class_labels = np.unique(y_test)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=class_labels, yticklabels=class_labels)
plt.title('Confusion Matrix Heatmap')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Output:

Accuracy: 88.23529411764706%

download-
Confusion Matrix

The accuracy of the model is approximately 88% meaning it correctly predicted the categories for about 88% of the test data.

Looking at the confusion matrix heatmap we can see the model made correct predictions for Sports (2), Technology (5), Politics (2) and Entertainment (6). Heatmap shows these values with darker colors representing correct predictions. However there were some misclassifications.

8. Prediction on Unseen Data

Python
user_input = ("I love artificial intelligence and machine learning")

user_input_vectorized = vectorizer.transform([user_input])
predicted_label = model.predict(user_input_vectorized)
print(f"The input text belongs to the '{predicted_label[0]}' category.")

Output:

The input text belongs to the 'Technology' category.

Here we can see our model is working fine and can predict on unseen data accurately. Naive Bayes is a strong baseline model for text classification tasks especially when the dataset is large and the features (words) are relatively independent.


Next Article

Similar Reads