Open In App

Introduction to Stemming

Last Updated : 17 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Stemming is an important text-processing technique that reduces words to their base or root form by removing prefixes and suffixes. This process standardizes words which helps to improve the efficiency and effectiveness of various natural language processing (NLP) tasks.

In NLP, stemming simplifies words to their most basic form, making it easier to analyze and process text. For example, "chocolates" becomes "chocolate" and "retrieval" becomes "retrieve". This is important in the early stages of NLP tasks where words are extracted from a document and tokenized (broken into individual words).

It helps in tasks such as text classification, information retrieval and text summarization by reducing words to a base form. While it is effective, it can sometimes introduce drawbacks including potential inaccuracies and a reduction in text readability.

Note: It's important to thoroughly understand the concept of 'tokenization' as it forms the foundational step in text preprocessing.

Examples of stemming for the word "like":

  • "likes" → "like"
  • "liked" → "like"
  • "likely" → "like"
  • "liking" → "like"

Types of Stemmer in NLTK 

Python's NLTK (Natural Language Toolkit) provides various stemming algorithms each suitable for different scenarios and languages. Lets see an overview of some of the most commonly used stemmers:

1. Porter's Stemmer

Porter's Stemmer is one of the most popular and widely used stemming algorithms. Proposed in 1980 by Martin Porter, this stemmer works by applying a series of rules to remove common suffixes from English words. It is well-known for its simplicity, speed and reliability. However, the stemmed output is not guaranteed to be a meaningful word and its applications are limited to the English language.

Example:

  • 'agreed' → 'agree'
  • Rule: If the word has a suffix EED (with at least one vowel and consonant) remove the suffix and change it to EE.

Advantages:

  • Very fast and efficient.
  • Commonly used for tasks like information retrieval and text mining.

Limitations:

  • Outputs may not always be real words.
  • Limited to English words.

Now lets implement Porter's Stemmer in Python, here we will be using NLTK library.

Python
from nltk.stem import PorterStemmer

porter_stemmer = PorterStemmer()

words = ["running", "jumps", "happily", "running", "happily"]

stemmed_words = [porter_stemmer.stem(word) for word in words]

print("Original words:", words)
print("Stemmed words:", stemmed_words)

Output:

stem1
Porter's Stemmer

2. Snowball Stemmer

The Snowball Stemmer is an enhanced version of the Porter Stemmer which was introduced by Martin Porter as well. It is referred to as Porter2 and is faster and more aggressive than its predecessor. One of the key advantages of this is that it supports multiple languages, making it a multilingual stemmer.

Example:

  • 'running' → 'run'
  • 'quickly' → 'quick'

Advantages:

  • More efficient than Porter Stemmer.
  • Supports multiple languages.

Limitations:

  • More aggressive which might lead to over-stemming.

Now lets implement Snowball Stemmer in Python, here we will be using NLTK library.

Python
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer(language='english')

words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes']

stemmed_words = [stemmer.stem(word) for word in words_to_stem]

print("Original words:", words_to_stem)
print("Stemmed words:", stemmed_words)

Output:

stem2
Snowball Stemmer

3. Lancaster Stemmer

The Lancaster Stemmer is known for being more aggressive and faster than other stemmers. However, it’s also more destructive and may lead to excessively shortened stems. It uses a set of external rules that are applied in an iterative manner.

Example:

  • 'running' → 'run'
  • 'happily' → 'happy'

Advantages:

  • Very fast.
  • Good for smaller datasets or quick preprocessing.

Limitations:

  • Aggressive which can result in over-stemming.
  • Less efficient than Snowball in larger datasets.

Now lets implement Lancaster Stemmer in Python, here we will be using NLTK library.

Python
from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()

words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes']

stemmed_words = [stemmer.stem(word) for word in words_to_stem]

print("Original words:", words_to_stem)
print("Stemmed words:", stemmed_words)

Output:

stem3
Lancaster Stemmer

4. Regexp Stemmer

The Regexp Stemmer or Regular Expression Stemmer is a flexible stemming algorithm that allows users to define custom rules using regular expressions (regex). This stemmer can be helpful for very specific tasks where predefined rules are necessary for stemming.

Example:

  • 'running' → 'runn'
  • Custom rule: r'ing$' removes the suffix ing.

Advantages:

  • Highly customizable using regular expressions.
  • Suitable for domain-specific tasks.

Limitations:

  • Requires manual rule definition.
  • Can be computationally expensive for large datasets.

Now let's implement Regexp Stemmer in Python, here we will be using NLTK library.

Python
from nltk.stem import RegexpStemmer

custom_rule = r'ing$'
regexp_stemmer = RegexpStemmer(custom_rule)

word = 'running'
stemmed_word = regexp_stemmer.stem(word)

print(f'Original Word: {word}')
print(f'Stemmed Word: {stemmed_word}')

Output:

stem4
Regexp Stemmer

5. Krovetz Stemmer 

The Krovetz Stemmer was developed by Robert Krovetz in 1993. It is designed to be more linguistically accurate and tends to preserve meaning more effectively than other stemmers. It includes steps like converting plural forms to singular and removing ing from past-tense verbs.

Example:

  • 'children' → 'child'
  • 'running' → 'run'

Advantages:

  • More accurate, as it preserves linguistic meaning.
  • Works well with both singular/plural and past/present tense conversions.

Limitations:

  • May be inefficient with large corpora.
  • Slower compared to other stemmers.

Note: The Krovetz Stemmer is not natively available in the NLTK library, unlike other stemmers such as Porter, Snowball or Lancaster.

Stemming vs. Lemmatization

Let's see the tabular difference between Stemming and Lemmatization for better understanding:

StemmingLemmatization
Reduces words to their root form often resulting in non-valid words.Reduces words to their base form (lemma) ensuring a valid word.
Based on simple rules or algorithms.Considers the word's meaning and context to return the base form.
May not always produce a valid word.Always produces a valid word.
Example: "Better" → "bet"Example: "Better" → "good"
No context is considered.Considers the context and part of speech.

Applications of Stemming

Stemming plays an important role in many NLP tasks. Some of its key applications include:

  1. Information Retrieval: It is used in search engines to improve the accuracy of search results. By reducing words to their root form, it ensures that documents with different word forms like "run," "running," "runner" are grouped together.
  2. Text Classification: In text classification, it helps in reducing the feature space by consolidating variations of words into a single representation. This can improve the performance of machine learning algorithms.
  3. Document Clustering: It helps in grouping similar documents by normalizing word forms, making it easier to identify patterns across large text corpora.
  4. Sentiment Analysis: Before sentiment analysis, it is used to process reviews and comments. This allows the system to analyze sentiments based on root words which improves its ability to understand positive or negative sentiments despite word variations.

Challenges in Stemming

While stemming is beneficial but also it has some challenges:

  1. Over-Stemming: When words are reduced too aggressively, leading to the loss of meaning. For example, "arguing" becomes "argu" making it harder to understand.
  2. Under-Stemming: Occurs when related words are not reduced to a common base form, causing inconsistencies. For example, "argument" and "arguing" might not be stemmed similarly.
  3. Loss of Meaning: Stemming ignores context which can result in incorrect interpretations in tasks like sentiment analysis.
  4. Choosing the Right Stemmer: Different stemmers may produce diffierent results which requires careful selection and testing for the best fit.

These challenges can be solved by fine-tuning the stemming process or using lemmatization when necessary.

Advantages of Stemming

Stemming provides various benefits which are as follows:

  1. Text Normalization: By reducing words to their root form, it helps to normalize text which makes it easier to analyze and process.
  2. Improved Efficiency: It reduces the dimensionality of text data which can improve the performance of machine learning algorithms.
  3. Information Retrieval: It enhances search engine performance by ensuring that variations of the same word are treated as the same entity.
  4. Facilitates Language Processing: It simplifies the text by reducing variations of words which makes it easier to process and analyze large text datasets.

By mastering different stemming techniques in NLTK helps improve text analysis by choosing the right method for our needs.


Next Article
Article Tags :
Practice Tags :

Similar Reads