NLP | Proper Noun Extraction

Last Updated : 08 Jul, 2025

Proper noun extraction is a foundational task in Natural Language Processing (NLP), frequently used in applications like information retrieval, question answering and named entity recognition (NER). It involves identifying and isolating proper nouns from raw text.

Understanding Proper Nouns in NLP

In English grammar, proper nouns denote specific names and are typically capitalized. In NLP, they are tagged using the Penn Treebank POS tagset:

NNP: Proper noun, singular (London, Python, Amazon)
NNPS: Proper noun, plural (Americans, Italians)

POS taggers like the one in nltk can identify these tokens. Once tagged, we can apply rule-based chunking methods to group sequences of proper nouns into meaningful entities.

Extracting Proper Noun Chunks Using RegexpParser

This code extracts sequences of proper nouns (tagged NNP) from a POS-tagged sentence using a regular expression-based chunk parser.

Implementation using RegexpParser

1. Setup and Importing libraries

treebank_chunk: This is a corpus in NLTK that contains POS-tagged sentences. Each sentence is a list of tuples like ('word', 'POS').
RegexpParser: A class used to define and apply chunking patterns using regular expressions over POS tags.

Python

from nltk.corpus import treebank_chunk
from nltk.chunk import RegexpParser

2. Helper Function: sub_leaves()

This function takes a chunked parse tree and a chunk label (like 'NAME').
It loops through all subtrees of the tree where the label matches.
It extracts the actual words (leaves) from each of those subtrees.
It joins them into strings to return clean, readable named entities like "Pierre Vinken".

Python

def sub_leaves(tree, label):
    """
    Extracts the leaves (words) from subtrees with a specific label.
    """
    leaves = []
    for subtree in tree.subtrees(filter=lambda t: t.label() == label):
        leaf_list = subtree.leaves()
        if leaf_list and isinstance(leaf_list[0], str):
            leaves.append(" ".join(leaf_list))
        elif leaf_list:
            leaves.append(" ".join([str(leaf) for leaf in leaf_list]))
    return leaves

3. Chunk Grammar Definition

This defines a simple chunking grammar that groups one or more consecutive proper nouns (NNP) into a chunk labeled 'NAME'.
Example: If a sentence has ('Pierre', 'NNP'), ('Vinken', 'NNP'), both will be grouped into a single chunk: NAME.

Python

chunker = RegexpParser(r'''
                       NAME:
                       {<NNP>+}
                       ''')

4. Extracting Named Entities

treebank_chunk.tagged_sents()[0] gives the first POS-tagged sentence from the corpus.
chunker.parse(...) applies the NAME grammar to this sentence, producing a chunked tree.
sub_leaves(..., 'NAME') extracts all the 'NAME' chunks (proper noun phrases) as text.

Python

print ("Named Entities : \n",
       sub_leaves(chunker.parse(treebank_chunk.tagged_sents()[0]), 'NAME'))

Output:

Chunking using RegexpParser runs in linear time relative to the number of tokens in the sentence (O(n)). Each token is checked against the grammar once.

Limitations

Fails when names are mixed with non-proper-noun tokens (Dr. John Smith).
Cannot distinguish entities of different types ("Amazon" as a company vs. river).

Natural Language Processing (NLP) - Overview

mohit gupta_omg :)

Improve

Article Tags :

NLP | Proper Noun Extraction

Understanding Proper Nouns in NLP

Extracting Proper Noun Chunks Using RegexpParser

Implementation using RegexpParser

Limitations

Similar Reads

Introduction to NLP

Libraries for NLP

Text Normalization in NLP

Text Representation and Embedding Techniques

NLP Deep Learning Techniques

NLP Projects and Practice

Thank You!

What kind of Experience do you want to share?