Text Classification Using NLTK Last Updated : 28 Jun, 2025 Comments Improve Suggest changes Like Article Like Report Text classification is a fundamental task in Natural Language Processing (NLP) that involves assigning categories or labels to text based on its content. One of the most accessible tools for performing text classification in Python is the Natural Language Toolkit (NLTK). NLTK provides a comprehensive suite of tools for text processing, including tokenization, stemming, stopword removal and built in classifiers like Naive Bayes.Text Classification Using NLTKImplementationStep 1: Install necessary librariesThis code imports essential libraries for text preprocessing, model training and evaluation. Pandas is used for handling the dataset, while nltk provides tools for text processing like tokenization and stopword removal. From sklearn, you import modules for splitting data, converting text to TF-IDF vectors, training a Naive Bayes classifier and evaluating the model's performance.The three nltk.download() lines ensure that necessary datasets like the tokenizer models (punkt) and stopword list are downloaded and available. Python import pandas as pd import nltk import string from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import classification_report, accuracy_score nltk.download('punkt') nltk.download('stopwords') nltk.download('punkt_tab') Output:Output Step 2: Load the datasetThis block loads the dataset from a CSV file (You can download it from here- Emotion dataset) into a DataFrame called df using pandas. It then removes any rows with missing values to avoid errors during processing using dropna().After that the column names are renamed to text and label for consistency and easier reference in the rest of the code. Finally the first five rows are printed to give a quick look at the loaded data. Python df = pd.read_csv("Emotion_classify_Data (2).csv") df.dropna(inplace=True) df.columns = ['text', 'label'] print(df.head()) Output:Output Step 3: Preprocessing the textThis block defines a preprocessing function to clean the text data. First it creates a set of English stopwords using NLTK's built in list. Inside the preprocess() function each text is converted to lowercase, tokenized into words using word_tokenize() and filtered to keep only alphabetic words that are not in the stopword list.The cleaned tokens are joined back into a single string. This function is then applied to the text column and the result is stored in a new column called clean_text. Finally it prints the original and cleaned text side by side for the first few rows. Python stop_words = set(stopwords.words('english')) def preprocess(text): text = text.lower tokens = word_tokenize(text) tokens = [word for word in tokens if word.isalpha() and word not in stop_words] return " ".join(tokens) df['clean_text'] = df['text'].apply(preprocess) print(df[['text', 'clean_text']].head()) Output:Output for Preprocessed text Step 4: TF-IDF VectorizationThis block initializes a TfidfVectorizer which converts the cleaned text into numerical features based on the importance of each word (TF-IDF: Term Frequency-Inverse Document Frequency).The `fit_transform()` method learns the vocabulary from the clean_text column and transforms the text into a sparse TF-IDF matrix X.The corresponding target labels are stored in y from the label column. Finally, it prints the shape of the TF-IDF matrix showing the number of documents (rows) and the number of unique words (columns) used as features. Python vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(df['clean_text']) y = df['label'] print("TF-IDF matrix shape:", X.shape) Output:Output for vectorizationStep 5: Train-Test SplitThis line splits the dataset into training and testing sets. The train_test_split() takes the TF-IDF feature matrix X and corresponding labels y and randomly divides them as 80% for training and 20% for testing.The random_state=42 ensures that the split is reproducible so the same data is selected each time the code is run. Python X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) Step 6: Train ClassifierThis line initializes a Multinomial Naive Bayes classifier which is well suited for text classification tasks using word frequencies or TF-IDF features.The .fit() method trains the model on the training data allowing it to learn the patterns between the input features and their corresponding emotion labels. Python model = MultinomialNB() model.fit(X_train, y_train) Output:OutputStep 7: Evaluate the ModelThis block uses the trained Naive Bayes model to make predictions on the test set X_test storing the results in y_pred. The accuracy_score() function then compares the predicted labels with the actual labels (y_test) to calculate the overall accuracy of the model.The classification_report() provides a detailed performance summary including precision, recall and F1 score for each emotion class, helping you understand how well the model performs on each label. Python y_pred = model.predict(X_test) print("Accuracy:", accuracy_score(y_test, y_pred)) print("Classification Report:\n", classification_report(y_test, y_pred)) Output:Output of Accuracy of the modelStep 8: Make Predictions Python def predict_emotion(text): clean = preprocess(text) vector = vectorizer.transform([clean]) return model.predict(vector)[0] # Example: print(predict_emotion("I feel amazing and joyful today!")) Output:JoyYou can download the complete source code from here - Text Classification Using NLTK Comment More infoAdvertise with us Next Article Text Classification Using NLTK S shrurfu5 Follow Improve Article Tags : NLP Natural-language-processing AI-ML-DS With Python Similar Reads Processing text using NLP | Basics In this article, we will be learning the steps followed to process the text data before using it to train the actual Machine Learning Model. Importing Libraries The following must be installed in the current working environment: NLTK Library: The NLTK library is a collection of libraries and program 2 min read NLP | Customization Using Tagged Corpus Reader How we can use Tagged Corpus Reader ?  Customizing word tokenizerCustomizing sentence tokenizerCustomizing paragraph block readerCustomizing tag separatorConverting tags to a universal tagset  Code #1 : Customizing word tokenizer  Python3 # Loading the libraries from nltk.tokenize import SpaceTok 2 min read Python | Lemmatization with NLTK Lemmatization is a fundamental text pre-processing technique widely applied in natural language processing (NLP) and machine learning. Serving a purpose akin to stemming, lemmatization seeks to distill words to their foundational forms. In this linguistic refinement, the resultant base word is refer 6 min read Tokenize text using NLTK in python To run the below python program, (NLTK) natural language toolkit has to be installed in your system.The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology.In order to install NLTK run the following commands in your terminal. sudo pip 3 min read Accessing Text Corpora and Lexical Resources using NLTK Accessing Text Corpora and Lexical Resources using NLTK provides efficient access to extensive text data and linguistic resources, empowering researchers and developers in natural language processing tasks. Natural Language Toolkit (NLTK) is a powerful Python library for natural language processing 5 min read Like