Open In App

Token Classification in Natural Language Processing

Last Updated : 12 Jun, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Token classification is a core task in Natural Language Processing (NLP) where each token (typically a word or sub-word) in a sequence is assigned a label. This task is important for extracting structured information from unstructured text. It does so by labeling tokens with specific categories such as entity types, parts of speech, or syntactic chunks. It helps models to better understand both the semantic meaning and the grammatical structure of the language.

What is a Token ?

Tokens are the smaller units of a text, typically separated by whitespace (like spaces or newlines) and punctuation marks (such as commas or periods). For example, Consider the sentence

"The quick brown fox jumps over the lazy dog."

The tokens for this sentence would be:

"The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."

Word level vs Sub-word level (BERT-style)

1. In traditional NLP pipelines, tokenization is done at the word level, meaning each token corresponds directly to a word in the sentence.

Text: "running"

Tokens: ["running"]

This approach is simple, but has drawbacks as it struggles with :

  • Rare or misspelled words
  • Change in structure and formation of words
  • Out of vocabulary terms

2. Modern transformer-based models like BERT, RoBERTa, and GPT use sub-word tokenization algorithms such as WordPiece, Byte Pair Encoding (BPE) or Unigram models. These break words into smaller, meaningful fragments.

Text: "running"

Tokens: ["run", "##ing"]

"##ing" shows that it is a continuation of a previous sub-word. This method helps with:

  • Efficiently handling rare or unseen words
  • Sharing sub-word representations across different words (e.g., “runner”, “running”, “run”)

BIO/IOB tagging scheme

The IOB tagging scheme also referred to as BIO (Beginning, Inside, Outside) is a widely used labeling format in token classification tasks such as Named Entity Recognition (NER), chunking, and semantic role labeling. This format helps models identify spans of consecutive tokens that together represent a meaningful entity (like a person’s name, organization, or location).

Components of IOB/BIO Tags

Each token is assigned one of these tags

Tags

Meaning

B

Beginning of a chunk/entity

I

Inside a chunk/entity

O

Outside of any chunk/entity

Applications of Token Classification

1. Named Entity Recognition (NER)

NER classified entities typically include categories such as person names, locations, organizations, dates, and more. The task involves assigning a specific label to each token in the text, indicating the type of entity it belongs to, or even a neutral label (often "O") if the token is not part of any named entity.

Code Example:

Python
from transformers import pipeline

classifier = pipeline("ner")
classifier("Hello I'm Mary and I live in Amsterdam.")

Output:

NER_output
Output for NER model

Interpretation:

  1. "Mary" labeled as a Person (I-PER) with 99.8% confidence
  2. "Amsterdam" labeled as a Location (I-LOC) with 99.85% confidence

2. Part-of-Speech (POS) Tagging

POS tagging is a classic token classification task, where the goal is to assign a grammatical category (such as noun, verb, adjective, etc.) to each word in a sentence. The model processes the input text and labels each token based on its syntactic role within the sentence.

Code Example:

Python
from transformers import pipeline

classifier = pipeline("token-classification", model = "vblagoje/bert-english-uncased-finetuned-pos")
classifier("Hello I'm Mary and I live in Amsterdam.")

Output:

Screenshot-2025-06-07-172009
Output for POS model

Interpretation:

TokenPOS TagMeaningConfidence
helloINTJInterjection (e.g., greetings)99.68%
iPRONPronoun99.94%
'AUXAuxiliary verb (part of "I'm")99.67%
mAUXAuxiliary verb ("am")99.65%
maryPROPNProper noun99.85%
andCCONJCoordinating conjunction99.92%
iPRONPronoun99.95%
liveVERBMain verb99.86%
inADPAdposition (preposition)99.94%
amsterdamPROPNProper noun99.88%
.PUNCTPunctuation99.96%

Challenges in Token Classification

  • Sub-word Token Handling: Sub-word tokenization splits words into smaller units, complicating token-level labeling.
    For example, "playing" → ["play", "##ing"]. The challenge lies in aligning word-level labels with sub-word tokens. A common solution is to assign the label only to the first sub-word and mask the rest during training and evaluation.
  • Imbalanced Label Distribution: In tasks like NER, the "O" class dominates, making the model biased toward non-entity predictions. This results in poor performance for rare labels. To address this, oversampling techniques are used to ensure the model pays attention to minority classes.
  • Domain Adaptation: Pretrained models often fail on specialized domains like medical or legal texts due to unfamiliar terms. This mismatch reduces performance. Fine-tuning or using domain adapted models like BioBERT or LegalBERT helps bridge the vocabulary and context gap.
  • Label Ambiguity: Words like "Apple" may refer to a company or a fruit, depending on context. Such ambiguity confuses models in the absence of strong contextual clues. Transformer models like BERT use contextual embeddings to simplify meanings, but errors still occur in subtle or low-resource contexts.

Next Article

Similar Reads