Raw text data is often unstructured, noisy and inconsistent, containing typos, punctuation, stopwords and irrelevant information. Text preprocessing converts this data into a clean, structured and standardized format, enabling effective feature extraction and improving model performance.
- Improves feature representation, helping NLP models achieve higher accuracy and robustness.
- Simplifies text data, reducing computational overhead and accelerating model training.
Implementation
Here we implement text preprocessing techniques in Python, showing how raw text is cleaned, transformed and prepared for NLP tasks.
Step 1: Preparing the Sample Corpus
Here we define a sample corpus containing a variety of text examples, including HTML tags, emojis, URLs, numbers, punctuation and typos. This corpus will be used to demonstrate each preprocessing step in detail.
corpus = [
"I can't wait for the new season of my favorite show! 😍",
"The COVID-19 pandemic has affected millions of people worldwide.",
"U.S. stocks fell on Friday after news of rising inflation.",
"<html><body>Welcome to the website!</body></html>",
"Python is a great programming language!!! ??",
"Check out https://www.example.com for more info!",
"He won 1st prize in the comp3tition!!!",
"I luvv this movie sooo much!!!"
]
Step 2: Text Cleaning and Regular Expressions
Text cleaning is the process of removing noise and unwanted elements from raw text to make it structured and easier for NLP models to analyze. Regular expressions (regex) is a useful tool in text preprocessing that allow you to find, match and manipulate patterns in text efficiently.
- Converts all text to lowercase to maintain consistency.
- Removes HTML tags using BeautifulSoup to extract only meaningful text.
- Eliminates numbers and punctuation to reduce noise.
- Uses regex (\W+ and \s+) to remove special characters and extra spaces.
import re
import string
from bs4 import BeautifulSoup
def clean_text(text):
text = text.lower()
text = BeautifulSoup(text, "html.parser").get_text()
text = re.sub(r'http\S+|www\S+', '', text)
text = re.sub(r'\d+', '', text)
text = text.translate(str.maketrans('', '', string.punctuation))
text = re.sub(r'\W+', ' ', text)
text = re.sub(r'\s+', ' ', text).strip()
return text
cleaned_corpus = [clean_text(doc) for doc in corpus]
print("Cleaned Corpus:\n", cleaned_corpus)
Output:
Cleaned Corpus:
['i cant wait for the new season of my favorite show', 'the covid pandemic has affected millions of people worldwide', 'us stocks fell on friday after news of rising inflation', 'welcome to the website', 'python is a great programming language', 'check out httpswwwexamplecom for more info', 'he won st prize in the comptition', 'i luvv this movie sooo much'
Step 3: Tokenization
Tokenization is the process of breaking text into smaller units, such as words or sentences. This step converts raw text into a structured format that NLP models can analyze and process.
- Splits each sentence into individual words for easier processing.
- Preserves the sequence of words for context in analysis.
- Prepares the text for further steps like stopword removal, stemming and POS tagging.
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')
tokenized_corpus = [word_tokenize(doc) for doc in cleaned_corpus]
print("Tokenized Corpus:\n", tokenized_corpus)
Output:
Tokenized Corpus:
[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'covid', 'pandemic', 'has', 'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of', 'rising', 'inflation'], ['welcome', 'to', 'the', 'website'], ['python', 'is', 'a', 'great', 'programming', 'language'], ['check', 'out', 'httpswwwexamplecom', 'for', 'more', 'info'], ['he', 'won', 'st', 'prize', 'in', 'the', 'comptition'], ['i', 'luvv', 'this', 'movie', 'sooo', 'much']]
Step 4: Stopword Removal
Stopwords are common words in a language (like “is”, “the”, “and”) that usually do not add significant meaning to text analysis. Removing them helps NLP models focus on the more meaningful words in the text.
- Loads the list of English stopwords from NLTK.
- Loops through each word in every document and removes any word that is in the stopword list.
- Creates a new corpus (filtered_corpus) that contains only the meaningful words for further processing.
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_corpus = [[word for word in doc if word not in stop_words] for doc in tokenized_corpus]
print("Stopword Removed Corpus:\n", filtered_corpus)
Output:
Stopword Removed Corpus:
[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'millions', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'friday', 'news', 'rising', 'inflation'], ['welcome', 'website'], ['python', 'great', 'programming', 'language'], ['check', 'httpswwwexamplecom', 'info'], ['st', 'prize', 'comptition'], ['luvv', 'movie', 'sooo', 'much']]
Step 5: Stemming
Stemming is the process of reducing words to their root or base form. It helps in normalizing text by treating different forms of a word (e.g., “running”, “runs”) as the same word (“run”).
- Initializes the PorterStemmer from NLTK to perform stemming.
- Loops through each word in every document of the filtered corpus.
- Converts each word to its stemmed form and creates a new corpus (stemmed_corpus) for further processing.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_corpus = [[stemmer.stem(word) for word in doc] for doc in filtered_corpus]
print("Stemmed Corpus:\n", stemmed_corpus)
Output:
Stemmed Corpus:
[['cant', 'wait', 'new', 'season', 'favorit', 'show'], ['covid', 'pandem', 'affect', 'million', 'peopl', 'worldwid'], ['us', 'stock', 'fell', 'friday', 'news', 'rise', 'inflat'], ['welcom', 'websit'], ['python', 'great', 'program', 'languag'], ['check', 'httpswwwexamplecom', 'info'], ['st', 'prize', 'comptit'], ['luvv', 'movi', 'sooo', 'much']]
Step 6: Lemmatization
Lemmatization is the process of converting a word to its meaningful base or dictionary form, called a lemma. Unlike stemming, it ensures that the root word is an actual word in the language.
- Initializes the WordNetLemmatizer from NLTK for lemmatization.
- Iterates through each word in every document of the filtered corpus.
- Converts each word to its lemma and stores the result in lemmatized_corpus for further analysis.
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
lemmatized_corpus = [[lemmatizer.lemmatize(word) for word in doc] for doc in filtered_corpus]
print("Lemmatized Corpus:\n", lemmatized_corpus)
Output:
Lemmatized Corpus:
[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'million', 'people', 'worldwide'], ['u', 'stock', 'fell', 'friday', 'news', 'rising', 'inflation'], ['welcome', 'website'], ['python', 'great', 'programming', 'language'], ['check', 'httpswwwexamplecom', 'info'], ['st', 'prize', 'comptition'], ['luvv', 'movie', 'sooo', 'much']]
Step 7: Contractions Expansion
Contractions expansion is the process of converting shortened forms of words (like “can’t”, “won’t”) into their full forms (“cannot”, “will not”). This helps NLP models better understand the meaning of the text.
- Imports the contractions library to handle contraction expansion.
- Iterates through each document in the original corpus.
- Replaces all contractions in the text with their full forms and stores the result in expanded_corpus.
import contractions
expanded_corpus = [contractions.fix(doc) for doc in corpus]
print("Expanded Corpus:\n", expanded_corpus)
Output:
Expanded Corpus:
['I cannot wait for the new season of my favorite show! 😍', 'The COVID-19 pandemic has affected millions of people worldwide.', 'YOU.S. stocks fell on Friday after news of rising inflation.', '<html><body>Welcome to the website!</body></html>', 'Python is a great programming language!!! ??', 'Check out https://www.example.com for more info!', 'He won 1st prize in the comp3tition!!!', 'I luvv this movie sooo much!!!']
Step 8: Emoji Conversion
Emoji conversion is the process of converting emojis in text into descriptive text labels. This allows NLP models to understand the meaning conveyed by emojis.
- Imports the emoji library to handle emoji processing.
- Iterates through each document in the original corpus.
- Replaces all emojis with their descriptive text equivalents and stores the result in emoji_corpus.
import emoji
emoji_corpus = [emoji.demojize(doc) for doc in corpus]
print("Emoji Converted Corpus:\n", emoji_corpus)
Output:
Emoji Converted Corpus:
["I can't wait for the new season of my favorite show! :smiling_face_with_heart-eyes:", 'The COVID-19 pandemic has affected millions of people worldwide.', 'U.S. stocks fell on Friday after news of rising inflation.', '<html><body>Welcome to the website!</body></html>', 'Python is a great programming language!!! ??', 'Check out https://www.example.com for more info!', 'He won 1st prize in the comp3tition!!!', 'I luvv this movie sooo much!!!']
Step 9: Spell Correction
Spell correction is the process of identifying and correcting misspelled words in text. This ensures that NLP models receive accurate and meaningful words for analysis.
- Imports the SpellChecker library to detect and correct spelling errors.
- Initializes the spell checker object using SpellChecker().
- Iterates through each token in every document of the tokenized corpus, replacing misspelled words with their correct forms, and stores the result in corrected_corpus.
from spellchecker import SpellChecker
spell = SpellChecker()
corrected_corpus = [[spell.correction(word) for word in doc] for doc in tokenized_corpus]
print("Spell Corrected Corpus:\n", corrected_corpus)
Output:
Spell Corrected Corpus:
[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'covin', 'pandemic', 'has', 'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of', 'rising', 'inflation'], ['welcome', 'to', 'the', 'website'], ['python', 'is', 'a', 'great', 'programming', 'language'], ['check', 'out', None, 'for', 'more', 'info'], ['he', 'won', 'st', 'prize', 'in', 'the', 'competition'], ['i', 'luvs', 'this', 'movie', 'soon', 'much']]
Step 10: Parts of Speech (POS) Tagging
POS tagging assigns grammatical labels (like noun, verb, adjective) to each word in a sentence. This helps NLP models understand the role of words and their relationships in the text.
- Downloads the NLTK POS tagger data required for tagging words.
- Iterates through each tokenized document in the corpus.
- Assigns a POS tag to each word and stores the result in pos_tagged_corpus for further linguistic analysis.
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
pos_tagged_corpus = [nltk.pos_tag(doc) for doc in tokenized_corpus]
print("POS Tagged Corpus:\n", pos_tagged_corpus)
Output:
POS Tagged Corpus:
[[('i', 'NN'), ('cant', 'VBP'), ('wait', 'NN'), ('for', 'IN'), ('the', 'DT'), ('new', 'JJ'), ('season', 'NN'), ('of', 'IN'), ('my', 'PRP$'), ('favorite', 'JJ'), ('show', 'NN')], [('the', 'DT'), ('covid', 'NN'), ('pandemic', 'NN'), ('has', 'VBZ'), ('affected', 'VBN'), ('millions', 'NNS'), ('of', 'IN'), ('people', 'NNS'), ('worldwide', 'VBP')], [('us', 'PRP'), ('stocks', 'NNS'), ('fell', 'VBD'), ('on', 'IN'), ('friday', 'NN'), ('after', 'IN'), ('news', 'NN'), ('of', 'IN'), ('rising', 'VBG'), ('inflation', 'NN')], [('welcome', 'NN'), ('to', 'TO'), ('the', 'DT'), ('website', 'NN')], [('python', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('programming', 'NN'), ('language', 'NN')], [('check', 'VB'), ('out', 'RP'), ('httpswwwexamplecom', 'NN'), ('for', 'IN'), ('more', 'JJR'), ('info', 'NN')], [('he', 'PRP'), ('won', 'VBD'), ('st', 'JJ'), ('prize', 'NN'), ('in', 'IN'), ('the', 'DT'), ('comptition', 'NN')], [('i', 'NN'), ('luvv', 'VBP'), ('this', 'DT'), ('movie', 'NN'), ('sooo', 'VBZ'), ('much', 'RB')]]
Download full code from here
Applications
- Preprocessed text helps models accurately detect opinions and emotions in reviews, tweets or social media posts.
- Cleaning and normalizing text improves performance in spam detection, news categorization, or topic labeling.
- Search engines and recommendation systems rely on processed text for better matching and ranking results.
- Properly preprocessed text ensures that chatbots understand user queries and respond accurately.
- Normalizing and cleaning text allows translation and summarization models to produce more accurate outputs.
- Removing noise and tokenizing text helps in detecting entities like names, locations, and dates correctly.
Advantages
- Removes noise, irrelevant content and inconsistencies, ensuring that the text is clean and standardized.
- Helps NLP models learn meaningful patterns more effectively, improving predictions and classification results.
- Simplifies text by removing stopwords, punctuation and unnecessary symbols, reducing data size and speeding up model training.
- Makes it easier to extract relevant features like n-grams, embeddings or semantic representations.
- Makes text and results easier to analyze and interpret, improving understanding of model outputs.
Limitations
- Important information may be lost during cleaning (e.g., removing stopwords or punctuation)
- Over-processing can reduce context and affect model performance
- Language-specific rules make it harder to generalize across languages
- Requires additional time and computational effort
- Errors in preprocessing (e.g., wrong stemming or spelling correction) can impact final results