Semantic Analysis with NLTK
Last Updated :
24 Jul, 2025
Semantic Analysis in NLP is the process of extracting meaning from text beyond just the words themselves. While syntax focuses on structure semantics helps machines understand what the text actually means. The Natural Language Toolkit (NLTK) is a popular Python library that provides foundational tools for this.
Semantic AnalysisWordNet in NLTK
WordNet is like a smart English dictionary that groups words into sets of synonyms called synsets. Each synset represents a single concept or meaning by which we can easily find all the meanings of a word and see example sentences that use each meaning.
- Word Sense Disambiguation (WSD): WSD is about figuring out which meaning is correct in a given sentence. NLTK includes the classic Lesk Algorithm which chooses the best sense by comparing words in the sentence with the dictionary definitions. For example in the sentence “He went to the bank to deposit money,” WSD correctly picks the sense of “bank” as a financial place not a riverbank.
- Semantic Similarity: WordNet lets you calculate semantic similarity based on how closely related their meanings are in the WordNet hierarchy. For example “dog” and “cat” are semantically similar because they both belong to the “animal” group. In NLTK you can use measures which compares how deep the words are in the WordNet tree and how closely they share a common ancestor concept.
- Named Entity Recognition (NER): NER goes a step beyond words and meanings it helps identify real world entities in text. NLTK’s built in NER can detect names of people, places, organizations, dates and more. For example in the sentence “Barack Obama was born in Hawaii,” NER labels “Barack Obama” as a person and “Hawaii” as a location.
Implementation
Step 1: Install and Download Necessary Libraries
This step installs NLTK if needed, downloads various language resources including tokenizers, taggers and named entity chunkers, and imports modules required for word sense disambiguation and linguistic analysis.
Python
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
import pandas as pd
Step 2: Load the Dataset
This step reads the CSV file into a DataFrame with specified column names and displays the first three rows to verify the data has been loaded correctly. You can download the Sentiment140 dataset with 1.6 million tweets from Kaggle.
Python
df = pd.read_csv('training.1600000.processed.noemoticon.csv.zip',
encoding='latin-1',
names=['target', 'ids', 'date', 'flag', 'user', 'text'])
df.head(3)
Output:
OutputStep 3: Select and Display a Sample Tweet
This step picks the first tweet from the dataset and prints it, providing a sample text to work with for further analysis or demonstration.
Python
tweet = df['text'][0]
print(f"Tweet: {tweet}")
Output:
Tweet: @switchfoot https://twitpic.com/ - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D
Step 4: Tokenize and POS Tag the Sample Tweet
This step tokenizes the selected tweet into words and assigns part of speech tags to each token helping identify the grammatical role of each word for deeper linguistic analysis.
Python
import nltk
nltk.download('averaged_perceptron_tagger_eng')
tokens = word_tokenize(tweet)
tags = pos_tag(tokens)
print("Tokens:", tokens)
print("POS Tags:", tags)
Output:
OutputStep 5: Perform Word Sense Disambiguation (WSD)
This step uses the Lesk algorithm to find the most appropriate meaning of the ambiguous word bank in the given sentence. It then prints the selected sense and its definition to clarify the intended meaning based on context.
Python
sentence = "He went to the bank to deposit money."
tokens = word_tokenize(sentence)
sense = lesk(tokens, 'bank')
print("Best sense:", sense)
print("Definition:", sense.definition())
Output:
Best sense: Synset('savings_bank.n.02')
Definition: a container (usually with a slot in the top) for keeping money at home
Step 6: Calculate Semantic Similarity Between Words
This step selects adjective senses of good and bad from WordNet and computes their semantic similarity using the Wu-Palmer measure which quantifies how closely related the two concepts are based on their position in the lexical hierarchy.
Python
dog = wn.synsets('good', pos=wn.ADJ)[0]
bad = wn.synsets('bad', pos=wn.ADJ)[0]
similarity = dog.wup_similarity(bad)
print(f"Semantic Similarity (Wu-Palmer): {similarity}")
Output:
Semantic Similarity (Wu-Palmer): 0.5
Step 7: Named Entity Recognition (NER) with Chunking
This step downloads the necessary resource for NER applies the named entity chunker to the POS tagged tokens and prints the resulting parse tree which identifies and groups named entities in the sentence.
Python
import nltk
nltk.download('maxent_ne_chunker_tab')
tree = ne_chunk(tags)
print(tree)
Output:
OutputStep 8: Perform Comprehensive Semantic Analysis
This step tokenizes and POS tags the input text, performs named entity recognition and then retrieves and prints all WordNet synsets with definitions for each word. Running this on the first two tweets provides a detailed linguistic and semantic overview.
Python
def semantic_analysis(text):
tokens = word_tokenize(text)
tags = pos_tag(tokens)
print("Original:", text)
print("Tokens:", tokens)
print("NER Tree:", ne_chunk(tags))
for word in tokens:
synsets = wn.synsets(word)
if synsets:
print(f"\nWord: {word}")
for syn in synsets:
print(f" - {syn.name()}: {syn.definition()}")
print("\n")
for i in range(2):
semantic_analysis(df['text'][i])
Output:
OutputHere, we have done Semantic Analysis of the word "Day".
You can download the Source code from here- Semantic Analysis with NLTK