Part of Speech (POS) tagging with Hidden Markov Model

Last Updated : 10 Jun, 2025

Part of Speech (POS) tagging is the process of assigning a grammatical category to each word in a sentence based on its role in the context. These categories include noun, verb, adjective, adverb, preposition, and others. POS tagging helps computers understand the structure of language, making it easier for them to process and analyze text.

Example:
Sentence: “The cat is sitting on the mat.”
POS tags:
The → Determiner
cat → Noun
is → Verb
sitting → Verb
on → Preposition
the → Determiner
mat → Noun

Importance of POS Tagging

POS tagging plays an important role in a wide range of Natural Language Processing (NLP) tasks. From text analysis and grammar checking to information retrieval, translation, and sentiment analysis, POS tags help systems understand sentence structure and word meaning in context. They are also important in more advanced processes like parsing and semantic analysis, acting as a foundational step for deeper language understanding.

Techniques for POS Tagging

Over the years, several approaches have evolved to perform POS tagging. These methods mainly fall into two broad categories which are Rule-Based and Stochastic or Probabilistic. More recently, Deep Learning has also made its mark in this area.

1. Rule-Based POS Tagging

Rule-based systems rely on a set of manually written grammar rules and contextual cues. These rules help decide the correct tag for each word. A classic example might be If an unknown word ends in ‘ing’ and follows a verb, tag it as a verb. These models typically use dictionaries of words with possible tags and a set of disambiguation rules based on surrounding words which are also called context frames. While accurate in structured settings, rule-based taggers can struggle with exceptions and the dynamic nature of real-world language.

2. Transformation-Based Tagging

This is a hybrid between rule-based and statistical approaches. Initially, each word is tagged with a most-likely guess. Then, transformation rules learned from training data are applied iteratively to correct the tags based on context. The famous Brill Tagger is based on this method. It is an elegant balance between human intuition in the form of rules and machine learning in the form of pattern discovery.

3. Stochastic or Probabilistic Tagging

These models use statistics derived from large annotated corpora. The simplest method assigns each word the most frequently occurring tag from training data. However, this can produce grammatically incorrect tag sequences. To fix this, probabilistic models consider tag sequences rather than individual words. A popular method is the Hidden Markov Model (HMM), which calculates the most probable sequence of tags for a sentence by considering both the word-tag likelihood and the likelihood of tag transitions. These models provide a good mix of accuracy and flexibility.

4. Deep Learning-Based Tagging

With the rise of deep learning, models like BiLSTM which stands for Bidirectional Long Short-Term Memory and Meta-BiLSTM have set new benchmarks in POS tagging accuracy. These models automatically learn complex patterns from large datasets and can capture long-distance dependencies in language which is something rule-based and HMM models often miss.

POS tagging with Hidden Markov Model

A Hidden Markov Model is a statistical model used to understand systems where the actual states are not directly visible, but we can see outcomes that depend on those hidden states. In simple terms, HMM helps us figure out what is going on behind the scenes based on what we observe.

Let’s understand this with an example. Suppose you are trying to guess the weather each day, but you can't look outside. Instead, you watch what people do. If someone is walking, shopping, or cleaning, these actions give you clues about what the weather might be like.

part_of_speech_pos_tagging_with_hidden_markov_model — HMM Model Example: Weather and Activities

This HMM model helps determine the most likely sequence of hidden weather states based on a sequence of observed activities.

1. Initial Probabilities (Self-Loops)

These represent the probability of the system remaining in the same weather state:

Probability of staying in Rainy: 0.5
Probability of staying in Sunny: 0.6
Probability of staying in Cloudy: 0.5

2. Transition Probabilities (Weather to Weather)

These indicate the likelihood of transitioning from one weather state to another:

From Rainy to Sunny: 0.2
From Rainy to Cloudy: 0.3
From Sunny to Rainy: 0.3
From Sunny to Cloudy: 0.2
From Cloudy to Rainy: 0.2
From Cloudy to Sunny: 0.2

3. Emission Probabilities (Weather to Activities)

These describe the probability of performing an activity given a specific weather condition:

When the weather is Rainy:

Probability of going for a walk: 0.1
Probability of going shopping: 0.4
Probability of cleaning: 0.5

When the weather is Sunny:

Probability of going for a walk: 0.6
Probability of going shopping: 0.3
Probability of cleaning: 0.1

When the weather is Cloudy:

Probability of going for a walk: 0.3
Probability of going shopping: 0.5
Probability of cleaning: 0.2

Components associated with Hidden Markov Models

Hidden Markov Models, or HMMs, are built around a few key ideas that help describe systems where we can observe some things, but not everything directly.

States: These are the underlying conditions or steps in a process that we can’t directly see. For example, in applications like speech recognition, the hidden states could represent sounds or words that are being spoken, even though we don’t hear them in isolated form.
Observations: These are the actual signals or data points we can detect. Using the same example of speech recognition, the observations would be features of the audio, such as frequency patterns or sound waves, that we can measure from a recording.
Transition Probabilities: HMMs include a set of probabilities that show how likely it is to move from one hidden state to another. These are usually arranged in a matrix and help the model understand the usual flow or sequence between states.
Emission Probabilities: Each hidden state has its own set of probabilities for producing different observations. This tells the model how likely it is to see a certain observation when the system is in a specific state.
Initial State Probabilities: This part of the model shows how likely it is for the system to start in a particular state before any observations are made.

POS Tagging with HMM in Python

Step 1: Define the Training Data

We start by creating a small dataset of sentences. Each word is labeled with its correct part of speech.

Python

train_data = [
    [("the", "DET"), ("cat", "NOUN"), ("sat", "VERB")],
    [("the", "DET"), ("dog", "NOUN"), ("barked", "VERB")],
    [("a", "DET"), ("dog", "NOUN"), ("sat", "VERB")],
]

Step 2: Calculate Probabilities

This step builds the statistical foundation for the HMM. The model counts how likely:

A sentence starts with each tag
One tag follows another (transition)
A word appears with a tag (emission)

Then it converts these counts into probabilities, which are used by the Viterbi algorithm later.

Python

from collections import defaultdict
import math

transition = defaultdict(lambda: defaultdict(int))
emission = defaultdict(lambda: defaultdict(int))
start_prob = defaultdict(int)
tag_counts = defaultdict(int)

for sentence in train_data:
    prev_tag = None
    for i, (word, tag) in enumerate(sentence):
        tag_counts[tag] += 1
        emission[tag][word] += 1
        
        if i == 0:
            start_prob[tag] += 1
        else:
            transition[prev_tag][tag] += 1
        prev_tag = tag

def normalize(d):
    total = sum(d.values())
    return {k: v / total for k, v in d.items()}

start_prob = normalize(start_prob)
for tag in emission:
    emission[tag] = normalize(emission[tag])
for prev in transition:
    transition[prev] = normalize(transition[prev])

Step 3: Define the Viterbi Algorithm

This is the Viterbi algorithm. It uses the probabilities from Step 2 to figure out the most likely sequence of tags for a given sentence. At each word, it checks all possible tags and selects the path that has the highest probability so far. It continues this process until the end of the sentence and returns the best sequence of POS tags.

Python

def viterbi(sentence, states, start_p, trans_p, emit_p):
    V = [{}]
    path = {}

    for state in states:
        V[0][state] = start_p.get(state, 0) * emit_p[state].get(sentence[0], 1e-6)
        path[state] = [state]

    for t in range(1, len(sentence)):
        V.append({})
        new_path = {}

        for curr_state in states:
            max_prob, prev_state = max(
                (V[t - 1][y0] * trans_p[y0].get(curr_state, 1e-6) * emit_p[curr_state].get(sentence[t], 1e-6), y0)
                for y0 in states
            )
            V[t][curr_state] = max_prob
            new_path[curr_state] = path[prev_state] + [curr_state]

        path = new_path

    n = len(sentence) - 1
    prob, final_state = max((V[n][y], y) for y in states)
    return path[final_state]

Step 4: Run the Tagger on a New Sentence

Now we test the HMM model on a new sentence. Even though the model hasn’t seen this exact sentence before, it will use what it learned to predict the most likely part of speech tags for each word.

Python

test_s = ["a", "cat", "barked"]
states = list(tag_counts.keys())

predicted_tags = viterbi(test_sentence, states, start_prob, transition, emission)
print("Sentence:", test_s)
print("Predicted Tags:", predicted_tags)

Output

Sentence: ['a', 'cat', 'barked']
Predicted Tags: ['DET', 'NOUN', 'VERB']

Challenges in POS Tagging with HMM

Unknown Words (Out-of-Vocabulary Problem): HMMs rely on training data to learn which words belong to which tags. If a word appears in a new sentence that was never seen during training, the model struggles to assign a tag because it has no emission probability for that word. This is especially common with names, technical terms, or misspelled words.
Data Sparsity: For rare word-tag combinations or tag transitions, the model may not have enough training examples to calculate accurate probabilities. This can lead to zero probabilities or incorrect tagging, especially with small datasets.
Context Limitation: HMMs only look at the previous tag (first-order Markov assumption). This means the model can’t fully understand longer-range dependencies in language, like a word being influenced by tags several steps before.
Ambiguity in Words: Many words can have multiple valid tags depending on context. For example, the word “book” can be a noun or a verb. HMMs might make the wrong choice if context isn’t strong or the probabilities are too close.
Independence Assumption: HMMs assume that the current word depends only on its tag, not on neighboring words. In reality, word meanings often rely on the surrounding context, which HMMs ignore.

Similar Reads:
Natural Language Processing
Tokenization in NLP

Part of Speech (POS) tagging with Hidden Markov Model

shaliniujuz

Improve

Article Tags :

Part of Speech (POS) tagging with Hidden Markov Model

Importance of POS Tagging

Techniques for POS Tagging

1. Rule-Based POS Tagging

2. Transformation-Based Tagging

3. Stochastic or Probabilistic Tagging

4. Deep Learning-Based Tagging

POS tagging with Hidden Markov Model

1. Initial Probabilities (Self-Loops)

2. Transition Probabilities (Weather to Weather)

3. Emission Probabilities (Weather to Activities)

When the weather is Rainy:

When the weather is Sunny:

When the weather is Cloudy:

Components associated with Hidden Markov Models

POS Tagging with HMM in Python

Step 1: Define the Training Data

Step 2: Calculate Probabilities

Step 3: Define the Viterbi Algorithm

Step 4: Run the Tagger on a New Sentence

Challenges in POS Tagging with HMM

Similar Reads

Thank You!

What kind of Experience do you want to share?