Open In App

NLP | Proper Noun Extraction

Last Updated : 08 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Proper noun extraction is a foundational task in Natural Language Processing (NLP), frequently used in applications like information retrieval, question answering and named entity recognition (NER). It involves identifying and isolating proper nouns from raw text.

Understanding Proper Nouns in NLP

In English grammar, proper nouns denote specific names and are typically capitalized. In NLP, they are tagged using the Penn Treebank POS tagset:

  • NNP: Proper noun, singular (London, Python, Amazon)
  • NNPS: Proper noun, plural (Americans, Italians)

POS taggers like the one in nltk can identify these tokens. Once tagged, we can apply rule-based chunking methods to group sequences of proper nouns into meaningful entities.

Extracting Proper Noun Chunks Using RegexpParser

This code extracts sequences of proper nouns (tagged NNP) from a POS-tagged sentence using a regular expression-based chunk parser.

Implementation using RegexpParser

1. Setup and Importing libraries

  • treebank_chunk: This is a corpus in NLTK that contains POS-tagged sentences. Each sentence is a list of tuples like ('word', 'POS').
  • RegexpParser: A class used to define and apply chunking patterns using regular expressions over POS tags.
Python
from nltk.corpus import treebank_chunk
from nltk.chunk import RegexpParser

2. Helper Function: sub_leaves()

  • This function takes a chunked parse tree and a chunk label (like 'NAME').
  • It loops through all subtrees of the tree where the label matches.
  • It extracts the actual words (leaves) from each of those subtrees.
  • It joins them into strings to return clean, readable named entities like "Pierre Vinken".
Python
def sub_leaves(tree, label):
    """
    Extracts the leaves (words) from subtrees with a specific label.
    """
    leaves = []
    for subtree in tree.subtrees(filter=lambda t: t.label() == label):
        leaf_list = subtree.leaves()
        if leaf_list and isinstance(leaf_list[0], str):
            leaves.append(" ".join(leaf_list))
        elif leaf_list:
            leaves.append(" ".join([str(leaf) for leaf in leaf_list]))
    return leaves

3. Chunk Grammar Definition

  • This defines a simple chunking grammar that groups one or more consecutive proper nouns (NNP) into a chunk labeled 'NAME'.
  • Example: If a sentence has ('Pierre', 'NNP'), ('Vinken', 'NNP'), both will be grouped into a single chunk: NAME.
Python
chunker = RegexpParser(r'''
                       NAME:
                       {<NNP>+}
                       ''')

4. Extracting Named Entities

  • treebank_chunk.tagged_sents()[0] gives the first POS-tagged sentence from the corpus.
  • chunker.parse(...) applies the NAME grammar to this sentence, producing a chunked tree.
  • sub_leaves(..., 'NAME') extracts all the 'NAME' chunks (proper noun phrases) as text.
Python
print ("Named Entities : \n",
       sub_leaves(chunker.parse(treebank_chunk.tagged_sents()[0]), 'NAME'))

Output:

NounRecognition
Output after code

Chunking using RegexpParser runs in linear time relative to the number of tokens in the sentence (O(n)). Each token is checked against the grammar once.

Limitations

  • Fails when names are mixed with non-proper-noun tokens (Dr. John Smith).
  • Cannot distinguish entities of different types ("Amazon" as a company vs. river).

Similar Reads