In this article, we will introduce ourselves to the TextaCy module in python which is generally used to perform a variety of NLP tasks on texts. It is built upon the SpaCy module in Python.
Some of the features of the TextaCy module are as follows:
- It provides the facility of text cleaning and preprocessing by replacing and removing punctuation, extra whitespaces, numbers, etc from the text before processing it with spaCy.
- It includes automatic language detection and tokenizes and vectorizes the documents and then train and interpret the topic models.
- Custom extensions can be added to extend the main functionality of spaCy for working with one or more documents.
- Load prepared datasets that contain both text content and information, such as Reddit comments, Congressional speeches, and historical books.
- It provides facility to extract features such as n-grams, entities, acronyms, keyphrases and SVO triples as structured data from processed documents.
- Strings and sequences can be compared using a variety of similar metrics.
- Calculates text readability and lexical variety data, such as the Type-Token Ratio, Multilingual Flesch Reading Ease, and Flesch-Kincaid Grade Level.
Installation of TextaCy module:
We can install the textaCy module using pip.
pip install textacy
If someone uses conda then write the following command -
conda install -c conda-forge textacy
Examples of some of its features:
Here we will see some of the notable features of textaCy module.
Remove Punctuation
Using the preprocessing class of textacy module we can easily remove punctuation from our text.
Python3
from textacy import preprocessing
ex = """
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
print(rm_punc)
The text used here is a randomly generated text from an external website. Firstly, we imported preprocessing class of textacy module and then used the remove and punctuation methods to remove the punctuations.
Output:
Now is the winter of our discontent
Made glorious summer by this sun of York
And all the clouds that lour d upon our house
In the deep bosom of the ocean buried
Now are our brows bound with victorious wreaths
Our bruised arms hung up for monuments
Our stern alarums changed to merry meetings
Our dreadful marches to delightful measures
Grim visaged war hath smooth d his wrinkled front
And now instead of mounting barded steeds
To fright the souls of fearful adversaries
He capers nimbly in a lady s chamber
To the lascivious pleasing of a lute
But I that am not shaped for sportive tricks
Nor made to court an amorous looking glass
I that am rudely stamp d and want love s majesty
To strut before a wanton ambling nymph
I that am curtail d of this fair proportion
Remove unnecessary Whitespace
We can remove unnecessary whitespaces from our text. It will remove all the extra spaces we have and cut them all to only a single space after each word.
Python3
from textacy import preprocessing
ex = """
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
print(rm_wsp)
Here we used the normalize class and whitespace method to remove whitespaces.
Output:
In the output, we can see all the excess whitespace is being removed but the punctuations are still there. So if we want to remove that too then we can amalgamate both operations.
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
Removing Punctuation and Whitespace together
Python3
from textacy import preprocessing
ex = """
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
# Remove Punctuation and Whitespace both
rm_all = preprocessing.normalize.whitespace(rm_punc)
print(rm_all)
Output:
Now is the winter of our discontent
Made glorious summer by this sun of York
And all the clouds that lour d upon our house
In the deep bosom of the ocean buried
Now are our brows bound with victorious wreaths
Our bruised arms hung up for monuments
Our stern alarums changed to merry meetings
Our dreadful marches to delightful measures
Grim visaged war hath smooth d his wrinkled front
And now instead of mounting barded steeds
To fright the souls of fearful adversaries
He capers nimbly in a lady s chamber
To the lascivious pleasing of a lute
But I that am not shaped for sportive tricks
Nor made to court an amorous looking glass
I that am rudely stamp d and want love s majesty
To strut before a wanton ambling nymph
I that am curtail d of this fair proportion
Partition a text
Sometimes the text we receive or use is 'raw' means unstructured, messy, etc, so before analysis, in the preprocessing stage, we might need to clean them up and partition them based on certain criteria.
Python3
from textacy import preprocessing
from textacy import extract
ex = """
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
# Remove Punctuation and Whitespace both
rm_all = preprocessing.normalize.whitespace(rm_punc)
# Extracting text
ext = list(extract.keyword_in_context(
rm_all, 'I', window_width=20, pad_context=True))
print(ext)
Output:
Now the output looks a bit complex because the text used here was not appropriate for this cause. But as I have used the text which was already punctuation and whitespace free we can't see any punctuation or extra whitespace. The blank spaces created here are due to the window_width, all the whitespace that was there in the text has been removed alongside the punctuation.
[(' Now ', 'i', 's the winter of our '),
(' Now is the w', 'i', 'nter of our disconte'),
(' the winter of our d', 'i', 'scontent\nMade glorio'),
('discontent\nMade glor', 'i', 'ous summer by this s'),
('lorious summer by th', 'i', 's sun of York \nAnd a'),
('ur d upon our house\n', 'I', 'n the deep bosom of '),
('som of the ocean bur', 'i', 'ed \nNow are our brow'),
('re our brows bound w', 'i', 'th victorious wreath'),
('r brows bound with v', 'i', 'ctorious wreaths \nOu'),
('ws bound with victor', 'i', 'ous wreaths \nOur bru'),
('ous wreaths \nOur bru', 'i', 'sed arms hung up for'),
('hanged to merry meet', 'i', 'ngs \nOur dreadful ma'),
('adful marches to del', 'i', 'ghtful measures \nGri'),
('ightful measures \nGr', 'i', 'm visaged war hath s'),
('ful measures \nGrim v', 'i', 'saged war hath smoot'),
(' war hath smooth d h', 'i', 's wrinkled front \nAn'),
('hath smooth d his wr', 'i', 'nkled front \nAnd now'),
('kled front \nAnd now ', 'i', 'nstead of mounting b'),
('now instead of mount', 'i', 'ng barded steeds\nTo '),
(' barded steeds\nTo fr', 'i', 'ght the souls of fea'),
(' of fearful adversar', 'i', 'es \nHe capers nimbly'),
('rsaries \nHe capers n', 'i', 'mbly in a lady s cha'),
('s \nHe capers nimbly ', 'i', 'n a lady s chamber\nT'),
(' chamber\nTo the lasc', 'i', 'vious pleasing of a '),
('hamber\nTo the lasciv', 'i', 'ous pleasing of a lu'),
('the lascivious pleas', 'i', 'ng of a lute \nBut I '),
('sing of a lute \nBut ', 'I', ' that am not shaped '),
('not shaped for sport', 'i', 've tricks \nNor made '),
('aped for sportive tr', 'i', 'cks \nNor made to cou'),
('ourt an amorous look', 'i', 'ng glass \nI that am '),
('rous looking glass \n', 'I', ' that am rudely stam'),
('before a wanton ambl', 'i', 'ng nymph \nI that am '),
('nton ambling nymph \n', 'I', ' that am curtail d o'),
('mph \nI that am curta', 'i', 'l d of this fair pro'),
('t am curtail d of th', 'i', 's fair proportion '),
('curtail d of this fa', 'i', 'r proportion '),
('of this fair proport', 'i', 'on ')]
The below section shows the result if we don't remove the punctuation or whitespace earlier, I didn't include the entire output as it is big and as all the punctuation is available alongside whitespace it would look messy.
[(' \nNow ', 'i', 's the winter of our '),
(' \nNow is the w', 'i', 'nter of our dis'),
('winter of our d', 'i', 'scontent\nMade glorio'),
('discontent\nMade glor', 'i', 'ous summer by this s'),
('lorious summer by th', 'i', 's sun of York;\nAnd a'),
("ur'd upon our house\n", 'I', 'n the deep b'),
('som of the ocean bur', 'i', 'ed.\nNow are our brow').......]
Replace URLs from text with other text
We can remove any unnecessary URLs from our text and replace it with some other text -
Python3
from textacy import preprocessing
# Replace URLs
txt = "https://www.geeksforgeeks.org/ is the best place to learn anything"
rm_url = preprocessing.replace.urls(txt,"GeeksforGeeks")
print(rm_url)
Output:
Replace emails with other text
Python3
from textacy import preprocessing
# Replace Emails
mail = "Send me a mail in the following address - [email protected]"
rm_mail = preprocessing.replace.emails(mail,"UserMail")
print(rm_mail)
Output:
Replace phone number
Python3
from textacy import preprocessing
# Replace phone number
num = "Call me at 12345678910"
rm_num = preprocessing.replace.phone_numbers(num,"NUM")
print(rm_num)
Output:
If we pass more than one number then this will replace all of them with NUM.
Python3
from textacy import preprocessing
# Replace phone number
num = "Call me at 12345678910 or 7896451235"
rm_num = preprocessing.replace.phone_numbers(num,"NUM")
print(rm_num)
Output -
Replace any number
Python3
from textacy import preprocessing
# Replace Number
n = "Any number like 12 or 86 , maybe 100 etc"
rm_n = preprocessing.replace.numbers(n,"Numbers")
print(rm_n)
Output:
Remove texts surrounded by Brackets and the brackets too:
Python3
from textacy import preprocessing
txt = """Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica (CWI) in the Netherlands
as a successor to the ABC programming language, which was inspired by SETL,
capable of exception handling (from the start plus new capabilities in Python 3.11)"""
print(preprocessing.remove.brackets(txt))
Output:
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica in the Netherlands
as a successor to the ABC programming language, which was inspired by SETL,
capable of exception handling
We can also pass an keyworded argument called only and pass a list of type brackets we only want to be removed. It supports three values square, curly , round.
Python3
from textacy import preprocessing
txt = """Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica (CWI) in the Netherlands
as a successor to the [ABC programming language], which was inspired by SETL,
capable of exception handling {from the start plus new capabilities in Python 3.11}"""
print(preprocessing.remove.brackets(txt,only=["round","square"]))
Output:
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica in the Netherlands
as a successor to the , which was inspired by SETL,
capable of exception handling {from the start plus new capabilities in Python 3.11}
Similar Reads
Python Tutorial - Learn Python Programming Language Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly. It'sA high-level language, used in web development, data science, automation, AI and more.Known fo
10 min read
Python Interview Questions and Answers Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Python OOPs Concepts Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. By understanding the core OOP principles (classes, objects, inheritance, encapsulation, polymorphism, and abstraction), programmers can leverage the full p
11 min read
Python Projects - Beginner to Advanced Python is one of the most popular programming languages due to its simplicity, versatility, and supportive community. Whether youâre a beginner eager to learn the basics or an experienced programmer looking to challenge your skills, there are countless Python projects to help you grow.Hereâs a list
10 min read
Python Exercise with Practice Questions and Solutions Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test
9 min read
Python Programs Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co
11 min read
Support Vector Machine (SVM) Algorithm Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It tries to find the best boundary known as hyperplane that separates different classes in the data. It is useful when you want to do binary classification like spam vs. not spam or
9 min read
Python Introduction Python was created by Guido van Rossum in 1991 and further developed by the Python Software Foundation. It was designed with focus on code readability and its syntax allows us to express concepts in fewer lines of code.Key Features of PythonPythonâs simple and readable syntax makes it beginner-frien
3 min read
Python Data Types Python Data types are the classification or categorization of data items. It represents the kind of value that tells what operations can be performed on a particular data. Since everything is an object in Python programming, Python data types are classes and variables are instances (objects) of thes
9 min read
Input and Output in Python Understanding input and output operations is fundamental to Python programming. With the print() function, we can display output in various formats, while the input() function enables interaction with users by gathering input during program execution. Taking input in PythonPython input() function is
8 min read