Tokenization Using Spacy

Tokenization is one of the first steps in Natural Language Processing (NLP) where we break down text into smaller units or "tokens." These tokens can be words, punctuation marks or special characters making it easier for algorithms to process and analyze the text. SpaCy is one of the most widely used libraries for NLP tasks which provides an efficient way to perform tokenization.

Let's consider a sentence: "I love natural language processing!"

After tokenization: ["I", "love", "natural", "language", "processing", "!"]

This breakdown helps us understand the individual components of the text, making it easier for algorithms to process. Tokenization is important for further tasks like text classification, sentiment analysis and more.

Key Features of SpaCy Tokenizer

Efficient Tokenization: SpaCy’s tokenizer is built for speed and efficiency, capable of handling large volumes of text quickly without compromising accuracy.
Handles Punctuation: Unlike many basic tokenizers, it treats punctuation marks as separate tokens which is important for tasks like sentence segmentation and parsing.
Language-Specific Tokenization: It's tokenization is adjusted for different languages, taking into account language-specific rules like contractions, abbreviations and compound words.
Whitespace and Special Character Handling: The tokenizer properly manages spaces, newlines and special characters, ensuring that tokens like URLs, hashtags and email addresses are correctly identified.
Customizable: It allows us to customize the tokenizer by adding special rules for tokenization, giving us more control over how the text is split into tokens.

Implementation of Tokenization Using SpaCy

Here, we’ll see how to implement tokenization using SpaCy.

1. Blank Model Tokenization

Here, we are using SpaCy's blank model (spacy.blank("en")) which initializes a minimal pipeline without pre-trained components like part-of-speech tagging or named entity recognition. This example shows a basic tokenization functionality.

Python

import spacy
nlp = spacy.blank("en")

doc = nlp("GeeksforGeeks is a one stop\
learning destination for geeks")

for token in doc:
    print(token)

Output:

2. Displaying the Pipeline Components

We use the pre-trained en_core_web_sm model which includes various components for NLP tasks. After loading the model, we can display the available components in the pipeline.

Python

nlp = spacy.load("en_core_web_sm")

nlp.pipe_names

Output:

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

3. Tokenization with Part-of-Speech Tagging and Lemmatization

Here we see how to process text, extract token information like part-of-speech (POS) tagging and obtain the lemmatized form of each token.

Python

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("If you want to be an excellent programmer \
, be consistent to practice daily on GFG.")

for token in doc:
    print(token, " | ",
          spacy.explain(token.pos_),
          " | ", token.lemma_)

Output:

Tokenization-Using-Spacy2 — Example with POS Tagging and Lemmatization

The spacy.explain() function provides a description of the POS tag for each token. The model performs tokenization, POS tagging and lemmatization automatically when we process the text with NLP.

Advantages of Using SpaCy for Tokenization

Efficiency: It is designed for fast processing, enabling quick tokenization even for large volumes of text which is useful in real-time or large-scale applications.
High Accuracy: Its pre-trained models like en_core_web_sm offer reliable and accurate tokenization along with other NLP tasks like part-of-speech tagging and named entity recognition.
Ease of Use: The intuitive and simple API makes SpaCy easy to use for tokenization, reducing the complexity of setting up tokenization tasks for both beginners and experts.
Extensibility: It allows easy integration with other NLP tasks such as lemmatization, dependency parsing and named entity recognition (NER) making it a useful library for building custom NLP pipelines.

Limitations of Using SpaCy for Tokenization

Memory Consumption: SpaCy’s pre-trained models can be large and require a significant amount of memory which may not be ideal for memory-constrained environments.
Dependency on Pre-trained Models: While it provides great pre-trained models, these models might not perform as well on specialized or domain-specific text without additional fine-tuning.
Limited Support for Some Languages: While it supports many languages, its models may not be as robust or comprehensive for certain languages especially those with limited linguistic resources.

You can read related articles:
Sentiment Analysis using VADER
Text Generation using Recurrent Long Short Term Memory Network
Text Preprocessing in Python

Tokenization Using Spacy

Key Features of SpaCy Tokenizer

Implementation of Tokenization Using SpaCy

1. Blank Model Tokenization

2. Displaying the Pipeline Components

3. Tokenization with Part-of-Speech Tagging and Lemmatization

Advantages of Using SpaCy for Tokenization

Limitations of Using SpaCy for Tokenization

Explore