Open In App

Unsupervised Noun Extraction in NLP

Last Updated : 24 Jul, 2024
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

Unsupervised noun extraction is a technique in Natural Language Processing (NLP) used to identify and extract nouns from text without relying on labelled training data. Instead, it leverages statistical and linguistic patterns to detect noun phrases. This approach is particularly valuable for processing large volumes of text where manual annotation is impractical.

In this article, we will explore several methods used in unsupervised noun extraction, including branching entropy, accessor variety, and cohesion score. We will also implement these techniques using Python to gain hands-on experience.

What is Unsupervised Noun Extraction?

Unsupervised noun extraction extracts the statistical properties of a language to identify nouns without relying on labelled datasets. This approach can also be particularly helpful for languages where labelled data is unavailable or very limited. By performing tasks like word co-occurrences, and context distributions, unsupervised methods can detect noun phrases easily. All these techniques have metrics like entropy, mutual information and accessor variety to evaluate the likelihood of a word being a noun based on its contextual usage. We will cover all these metrics in deep in the next sections.

One of the main benefits of using unsupervised noun extraction is its scalability, meaning we can process large textual documents easily, where as, traditional supervised methods require manual annotation, which is both time-consuming and cost-ineffective.

Branching Entropy for Unsupervised Noun Extraction in NLP

Branching entropy is a concept used in natural language processing (NLP) for unsupervised noun extraction. It helps identify noun phrases by measuring the uncertainty or randomness in the continuation of a word sequence. Specifically, branching entropy can be used to determine the boundaries of noun phrases by analyzing the likelihood of different words following a given word or sequence of words.

Branching entropy is derived from the Shannon entropy formula and is used to measure the uncertainty of word occurrences in different contexts.

There are two types of branching entropy:

  1. Forward Branching Entropy: Measures the uncertainty in the possible continuations of a word or sequence.
  2. Backward Branching Entropy: Measures the uncertainty in the possible preceding words of a word or sequence.

The entropy value is higher when there are many possible continuations (or preceding words) with relatively equal probabilities, indicating high uncertainty. Conversely, a lower entropy value indicates fewer possible continuations, suggesting more certainty.

Calculating Branching Entropy

Forward Branching Entropy

Given a word w in a sequence, the forward branching entropy H_f(w)is calculated as:

H_f(w) = -\sum_{i=1}^{N} P(w_i | w) \log P(w_i | w)

where P(w_i | w) is the conditional probability of the word w_i following the word w.

Backward Branching Entropy

Similarly, the backward branching entropy H_b(w) is calculated as:

H_b(w) = -\sum_{i=1}^{N} P(w_i | w) \log P(w_i | w)

where P(w_i | w) is the conditional probability of the word w_i preceding the word w.

Unsupervised Noun Extraction Using Branching Entropy in Python

  1. Setup and Import Libraries:
    • Import necessary libraries such as nltk, Counter, and math.
    • Download required NLTK resources (tokenizers, POS taggers, and stopwords).
  2. Define Helper Function to Calculate Entropy: Implement calculate_entropy function to compute entropy based on word counts.
  3. Tokenize and Preprocess Text:
    • Tokenize the input text into words.
    • Remove stopwords from the tokenized words.
  4. Perform POS Tagging:
    • Use NLTK’s POS tagger to tag the filtered tokens.
    • Focus on nouns (tags starting with 'NN').
  5. Create Co-occurrence Matrix: Construct a co-occurrence matrix to count how often each word appears with its context words within a specified window size.
  6. Calculate Branching Entropy for Each Noun: Calculate the entropy for each noun based on its context word counts.
  7. Set Entropy Threshold and Extract Nouns:
    • Determine a threshold for noun extraction based on the calculated entropies.
    • Extract nouns that meet or exceed the threshold.
  8. Test the Function: Apply the function to a sample text and print the extracted nouns and their branching entropies.
Python
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from collections import Counter
import math

# Download NLTK resources (only need to do this once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

def calculate_entropy(word_counts):
    total_count = sum(word_counts.values())
    entropy = -sum((count / total_count) * math.log2(count / total_count) for count in word_counts.values())
    return entropy

def extract_nouns_with_entropy(text, window_size=5):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    # Perform POS tagging
    tagged_tokens = pos_tag(filtered_tokens)
    
    # Create co-occurrence matrix
    co_occurrence_matrix = Counter()
    for i, (word, tag) in enumerate(tagged_tokens):
        if tag.startswith('NN'):  # Focus on nouns
            context_words = filtered_tokens[max(0, i-window_size):i] + filtered_tokens[i+1:i+1+window_size]
            for context_word in context_words:
                if context_word != word:
                    co_occurrence_matrix[(word, context_word)] += 1
    
    # Calculate branching entropy for nouns
    word_entropy = Counter()
    for (word, _), count in co_occurrence_matrix.items():
        word_entropy[word] += count
    
    entropies = {word: calculate_entropy({context: count for (w, context), count in co_occurrence_matrix.items() if w == word})
                 for word in word_entropy}
    
    # Set a threshold for noun extraction
    threshold = sorted(entropies.values(), reverse=True)[int(0.5 * len(entropies))]
    nouns = [word for word, entropy in entropies.items() if entropy >= threshold]
    
    return nouns, entropies

text = "The quick brown fox jumps over the lazy dog."
nouns, entropies = extract_nouns_with_entropy(text)
print("Extracted Nouns:", nouns)
print("Nouns with Branching Entropy:", entropies)

Output:

Extracted Nouns: ['brown', 'fox', 'jumps', 'dog']
Nouns with Branching Entropy: {'brown': 2.584962500721156, 'fox': 2.584962500721156, 'jumps': 2.584962500721156, 'dog': 2.584962500721156}

The current output highlights the importance of context size and POS tagging accuracy in noun extraction using branching entropy. With a larger corpus and more refined tagging, the results would likely be more accurate and varied.

Accessor Variety for Unsupervised Noun Extraction in NLP

Accessor Variety (AV) is another concept used in NLP for unsupervised noun extraction. It measures the diversity of contexts (both preceding and following words) in which a candidate word appears. The idea is that nouns often appear in a variety of contexts, while other types of words (like function words) do not.

Concept of Accessor Variety

  1. Forward Accessor Variety (FAV): The number of unique words that follow a given word.
  2. Backward Accessor Variety (BAV): The number of unique words that precede a given word.

The total accessor variety (TAV) is the sum of FAV and BAV:

\text{TAV}(w) = \text{FAV}(w) + \text{BAV}(w)

By calculating the AV values for words, we can identify candidate nouns based on their contextual diversity. Words with high AV values are more likely to be nouns.

Unsupervised Noun Extraction Using Accessor Variety in Python

  1. Setup and Import Libraries:
    • Import necessary libraries such as nltk, Counter, and math.
    • Download required NLTK resources (tokenizers, POS taggers, and stopwords).
  2. Tokenize and Preprocess Text:
    • Tokenize the input text into words.
    • Remove stopwords from the tokenized words.
  3. Perform POS Tagging:
    • Use NLTK’s POS tagger to tag the filtered tokens.
    • Focus on nouns (tags starting with 'NN').
  4. Calculate Forward and Backward Accessor Variety: For each word, count the number of unique preceding and following words.
  5. Combine FAV and BAV to Get Total Accessor Variety (TAV): Sum the FAV and BAV for each word to get the total accessor variety.
  6. Set Threshold for Noun Extraction: Determine a threshold for noun extraction based on the AV values.
Python
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from collections import defaultdict, Counter

# Download NLTK resources (only need to do this once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

def calculate_accessor_variety(tokens, window_size=5):
    # Create dictionaries to store forward and backward contexts
    forward_contexts = defaultdict(set)
    backward_contexts = defaultdict(set)
    
    # Populate the context dictionaries
    for i, word in enumerate(tokens):
        if i > 0:
            backward_contexts[word].add(tokens[i-1])
        if i < len(tokens) - 1:
            forward_contexts[word].add(tokens[i+1])
    
    # Calculate forward and backward accessor variety
    fav = {word: len(context) for word, context in forward_contexts.items()}
    bav = {word: len(context) for word, context in backward_contexts.items()}
    
    # Calculate total accessor variety
    tav = {word: fav.get(word, 0) + bav.get(word, 0) for word in set(tokens)}
    
    return fav, bav, tav

def extract_nouns_with_accessor_variety(text, window_size=5):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    # Perform POS tagging
    tagged_tokens = pos_tag(filtered_tokens)
    
    # Calculate accessor variety
    fav, bav, tav = calculate_accessor_variety(filtered_tokens, window_size)
    
    # Extract nouns based on accessor variety
    threshold = sorted(tav.values(), reverse=True)[int(0.5 * len(tav))]
    nouns = [word for word, av in tav.items() if av >= threshold and any(tag.startswith('NN') for word, tag in tagged_tokens if word == word)]
    
    return nouns, fav, bav, tav

text = "The quick brown fox jumps over the lazy dog."
nouns, fav, bav, tav = extract_nouns_with_accessor_variety(text)
print("Extracted Nouns:", nouns)
print("Forward Accessor Variety:", fav)
print("Backward Accessor Variety:", bav)
print("Total Accessor Variety:", tav)

Output:

Extracted Nouns: ['dog', 'lazy', 'brown', 'jumps', 'fox']
Forward Accessor Variety: {'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'lazy': 1, 'dog': 1}
Backward Accessor Variety: {'brown': 1, 'fox': 1, 'jumps': 1, 'lazy': 1, 'dog': 1, '.': 1}
Total Accessor Variety: {'dog': 2, 'lazy': 2, '.': 1, 'brown': 2, 'quick': 1, 'jumps': 2, 'fox': 2}

The extracted nouns are based on their high accessor variety values, indicating a high diversity of contexts, which is typical for nouns in natural language.

Cohesion Score for Unsupervised Noun Extraction in NLP

Cohesion Score is another method used in NLP for unsupervised noun extraction. It measures the strength of association between words in a phrase or sequence, indicating how likely they are to form a meaningful unit, such as a noun phrase.

Cohesion score is calculated based on the mutual information between adjacent words in a sequence. Mutual Information (MI) measures the degree of association between two words by comparing the observed frequency of their co-occurrence with the frequency expected if the words were independent.

Mutual Information Formula

For two words w_i and w_j​, the mutual information MI(w_i, w_j) is calculated as:

MI(w_i, w_j) = \log \left( \frac{P(w_i, w_j)}{P(w_i) P(w_j)} \right)

Where:

  • P(w_i, w_j) is the joint probability of words w_i​ and w_j​ occurring together.
  • P(w_i) and P(w_j) are the individual probabilities of words w_i and w_j​.

Steps for Implementation

  1. Setup and Import Libraries:
    • Import necessary libraries such as nltk, Counter, and math.
    • Download required NLTK resources (tokenizers, POS taggers, and stopwords).
  2. Tokenize and Preprocess Text:
    • Tokenize the input text into words.
    • Remove stopwords from the tokenized words.
  3. Calculate Mutual Information: For each pair of adjacent words, calculate the mutual information.
  4. Calculate Cohesion Score: Sum the mutual information values for all pairs in a candidate phrase to get the cohesion score.
  5. Set Threshold for Noun Extraction: Determine a threshold for noun extraction based on the cohesion scores.
Python
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from collections import Counter
import math

# Download NLTK resources (only need to do this once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

def calculate_mutual_information(word_counts, bigram_counts, total_bigrams, word1, word2):
    p_word1 = word_counts[word1] / total_bigrams
    p_word2 = word_counts[word2] / total_bigrams
    p_word1_word2 = bigram_counts[(word1, word2)] / total_bigrams
    if p_word1_word2 == 0:
        return 0
    return math.log2(p_word1_word2 / (p_word1 * p_word2))

def calculate_cohesion_score(tokens, window_size=2):
    word_counts = Counter(tokens)
    bigram_counts = Counter(zip(tokens, tokens[1:]))
    total_bigrams = len(tokens) - 1
    
    cohesion_scores = {}
    for i in range(len(tokens) - window_size + 1):
        phrase = tuple(tokens[i:i+window_size])
        cohesion_score = sum(calculate_mutual_information(word_counts, bigram_counts, total_bigrams, phrase[j], phrase[j+1])
                             for j in range(len(phrase) - 1))
        cohesion_scores[phrase] = cohesion_score
    
    return cohesion_scores

def extract_nouns_with_cohesion_score(text, window_size=2):
    # Tokenize the text
    tokens = word_tokenize(text.lower())
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    # Perform POS tagging
    tagged_tokens = pos_tag(filtered_tokens)
    
    # Calculate cohesion scores
    cohesion_scores = calculate_cohesion_score(filtered_tokens, window_size)
    
    # Extract noun phrases based on cohesion score
    threshold = sorted(cohesion_scores.values(), reverse=True)[int(0.5 * len(cohesion_scores))]
    noun_phrases = [' '.join(phrase) for phrase, score in cohesion_scores.items() if score >= threshold and any(tag.startswith('NN') for word, tag in tagged_tokens if word in phrase)]
    
    return noun_phrases, cohesion_scores

text = "The quick brown fox jumps over the lazy dog."
noun_phrases, cohesion_scores = extract_nouns_with_cohesion_score(text)
print("Extracted Noun Phrases:", noun_phrases)
print("Cohesion Scores:", cohesion_scores)

Output:

Extracted Noun Phrases: ['quick brown', 'brown fox', 'fox jumps', 'jumps lazy', 'lazy dog', 'dog .']
Cohesion Scores: {('quick', 'brown'): 2.584962500721156, ('brown', 'fox'): 2.584962500721156, ('fox', 'jumps'): 2.584962500721156, ('jumps', 'lazy'): 2.584962500721156, ('lazy', 'dog'): 2.584962500721156, ('dog', '.'): 2.584962500721156}

For the input text "The quick brown fox jumps over the lazy dog.", the output might include:

  • Extracted Noun Phrases: Phrases identified as noun phrases based on their high cohesion scores.
  • Cohesion Scores: A dictionary of phrases and their corresponding cohesion scores.

Conclusion

Unsupervised noun extraction techniques allow for effective identification of noun phrases without the need for annotated datasets. By utilizing methods such as branching entropy, accessor variety, and cohesion score, we can detect nouns based on contextual usage. Each technique offers unique insights: branching entropy evaluates contextual unpredictability, accessor variety measures context diversity, and cohesion score assesses word associations. These methods collectively enhance our ability to identify and analyze noun phrases in large texts.


Similar Reads