Open In App

Text Classification Using NLTK

Last Updated : 28 Jun, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Text classification is a fundamental task in Natural Language Processing (NLP) that involves assigning categories or labels to text based on its content. One of the most accessible tools for performing text classification in Python is the Natural Language Toolkit (NLTK). NLTK provides a comprehensive suite of tools for text processing, including tokenization, stemming, stopword removal and built in classifiers like Naive Bayes.

Text-Classification-Using-NLTK
Text Classification Using NLTK

Implementation

Step 1: Install necessary libraries

  • This code imports essential libraries for text preprocessing, model training and evaluation.
  • Pandas is used for handling the dataset, while nltk provides tools for text processing like tokenization and stopword removal. From sklearn, you import modules for splitting data, converting text to TF-IDF vectors, training a Naive Bayes classifier and evaluating the model's performance.
  • The three nltk.download() lines ensure that necessary datasets like the tokenizer models (punkt) and stopword list are downloaded and available.
Python
import pandas as pd
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

Output:

Screenshot-2025-06-26-103841
Output

Step 2: Load the dataset

  • This block loads the dataset from a CSV file (You can download it from here- Emotion dataset) into a DataFrame called df using pandas. It then removes any rows with missing values to avoid errors during processing using dropna().
  • After that the column names are renamed to text and label for consistency and easier reference in the rest of the code. Finally the first five rows are printed to give a quick look at the loaded data.
Python
df = pd.read_csv("Emotion_classify_Data (2).csv")
df.dropna(inplace=True)

df.columns = ['text', 'label']
print(df.head())

Output:

Screenshot-2025-06-26-103848
Output

Step 3: Preprocessing the text

  • This block defines a preprocessing function to clean the text data. First it creates a set of English stopwords using NLTK's built in list. Inside the preprocess() function each text is converted to lowercase, tokenized into words using word_tokenize() and filtered to keep only alphabetic words that are not in the stopword list.
  • The cleaned tokens are joined back into a single string. This function is then applied to the text column and the result is stored in a new column called clean_text. Finally it prints the original and cleaned text side by side for the first few rows.
Python
stop_words = set(stopwords.words('english'))

def preprocess(text):
    text = text.lower
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    return " ".join(tokens)

df['clean_text'] = df['text'].apply(preprocess)
print(df[['text', 'clean_text']].head())

Output:

Screenshot-2025-06-26-103856
Output for Preprocessed text

Step 4: TF-IDF Vectorization

  • This block initializes a TfidfVectorizer which converts the cleaned text into numerical features based on the importance of each word (TF-IDF: Term Frequency-Inverse Document Frequency).
  • The `fit_transform()` method learns the vocabulary from the clean_text column and transforms the text into a sparse TF-IDF matrix X.
  • The corresponding target labels are stored in y from the label column. Finally, it prints the shape of the TF-IDF matrix showing the number of documents (rows) and the number of unique words (columns) used as features.
Python
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_text'])
y = df['label']
print("TF-IDF matrix shape:", X.shape)

Output:

Screenshot-2025-06-26-103901
Output for vectorization

Step 5: Train-Test Split

  • This line splits the dataset into training and testing sets. The train_test_split() takes the TF-IDF feature matrix X and corresponding labels y and randomly divides them as 80% for training and 20% for testing.
  • The random_state=42 ensures that the split is reproducible so the same data is selected each time the code is run.
Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 6: Train Classifier

  • This line initializes a Multinomial Naive Bayes classifier which is well suited for text classification tasks using word frequencies or TF-IDF features.
  • The .fit() method trains the model on the training data allowing it to learn the patterns between the input features and their corresponding emotion labels.
Python
model = MultinomialNB()
model.fit(X_train, y_train)

Output:

Screenshot-2025-06-26-103910
Output

Step 7: Evaluate the Model

  • This block uses the trained Naive Bayes model to make predictions on the test set X_test storing the results in y_pred. The accuracy_score() function then compares the predicted labels with the actual labels (y_test) to calculate the overall accuracy of the model.
  • The classification_report() provides a detailed performance summary including precision, recall and F1 score for each emotion class, helping you understand how well the model performs on each label.
Python
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Output:

Screenshot-2025-06-26-103917
Output of Accuracy of the model

Step 8: Make Predictions

Python
def predict_emotion(text):
    clean = preprocess(text)
    vector = vectorizer.transform([clean])
    return model.predict(vector)[0]

# Example:
print(predict_emotion("I feel amazing and joyful today!"))

Output:

Joy

You can download the complete source code from here - Text Classification Using NLTK


Next Article

Similar Reads