Part of Speech Tagging with Stop words using NLTK in python

The Natural Language Toolkit (NLTK) is a popular Python library for text processing. Its POS tagging assigns grammatical categories (noun, verb, adjective, etc.) to words and is essential for NLP tasks like text analysis, sentiment analysis and information extraction.

Installation

Open your terminal or command prompt and run the following command to install NLTK:

pip install nltk

Import NLTK and download required packages:

import nltk
nltk.download()

Quick Vocabulary

Corpus: A body of text. Plural: corpora.
Lexicon: Words and their meanings.
Token: A single unit in a text, often a word or punctuation.

POS Tagging Basics

POS tagging assigns grammatical information to each token in a sentence. For example:

Input:

Everything is all about money.

Output:

[('Everything', 'NN'), ('is', 'VBZ'), ('all', 'DT'), ('about', 'IN'), ('money', 'NN'), ('.', '.')]

Here, 'NN' is a noun, 'VBZ' is a verb in 3rd person singular, 'DT' is a determiner, etc.

Common POS Tags

Here's a list of the tags, what they mean, and some examples:

CC - Conjunction
CD - Number
DT - Determiner
EX - Existential
FW - Foreign word
IN -Preposition
JJ / JJR / JJS - Adjective / Comparative / Superlative
NN / NNS / NNP / NNPS - Noun singular/plural / Proper noun
PRP / PRP$ - Pronoun / Possessive pronoun
RB / RBR / RBS - Adverb / Comparative / Superlative
VB / VBD / VBG / VBN / VBP / VBZ - Verb forms
MD - Modal
POS - Possessive ending
RP - Particle
TO - To
UH - Interjection
WDT / WP / WP$ / WRB - Wh-determiners/pronouns/adverb

Stop Words

Stop words are common words like 'the', 'is', 'are' that often do not carry significant meaning. NLTK provides a built-in list of stop words. You can also add custom stop words if needed.

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

POS Tagging with Stop Words Removal Example:

Python

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

stop_words = set(stopwords.words('english'))

txt = ("Sukanya, Rajib and Naba are my good friends. "
       "Sukanya is getting married next year. "
       "Marriage is a big step in one’s life. "
       "It is both exciting and frightening. "
       "But friendship is a sacred bond between people. "
       "It is a special kind of love between us. "
       "Many of you must have tried searching for a friend "
       "but never found the right one.")

tokenized = sent_tokenize(txt)

for sentence in tokenized:
    words = word_tokenize(sentence)
    words = [w for w in words if w.lower() not in stop_words]
    tagged = nltk.pos_tag(words)
    print(tagged)

Output

[('Sukanya', 'NNP'), ('Rajib', 'NNP'), ('Naba', 'NNP'), ('good', 'JJ'), ('friends', 'NNS')]
[('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN')]
[('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life', 'NN')]
[('It', 'PRP'), ('exciting', 'VBG'), ('frightening', 'VBG')]
[('But', 'CC'), ('friendship', 'NN'), ('sacred', 'JJ'), ('bond', 'NN'), ('people', 'NNS')]
[('It', 'PRP'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'NN'), ('us', 'PRP')]
[('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'), ('never', 'RB'), ('found', 'VBD'), ('right', 'JJ'), ('one', 'CD')]

Explanation:

stop_words = set(stopwords.words('english')): Loads English stop words.
tokenized = sent_tokenize(txt): Split text into sentences.
words = word_tokenize(sentence): Tokenizes each sentence into words.
tagged = nltk.pos_tag(words): Assign POS tags.

Tokenize text using NLTK in python
POS(Parts-Of-Speech) Tagging in NLP