The Natural Language Toolkit (NLTK) is a popular Python library for text processing. Its POS tagging assigns grammatical categories (noun, verb, adjective, etc.) to words and is essential for NLP tasks like text analysis, sentiment analysis and information extraction.
Installation
Open your terminal or command prompt and run the following command to install NLTK:
pip install nltk
Import NLTK and download required packages:
import nltk
nltk.download()
Quick Vocabulary
- Corpus: A body of text. Plural: corpora.
- Lexicon: Words and their meanings.
- Token: A single unit in a text, often a word or punctuation.
POS Tagging Basics
POS tagging assigns grammatical information to each token in a sentence. For example:
Input:
Everything is all about money.
Output:
[('Everything', 'NN'), ('is', 'VBZ'), ('all', 'DT'), ('about', 'IN'), ('money', 'NN'), ('.', '.')]
Here, 'NN' is a noun, 'VBZ' is a verb in 3rd person singular, 'DT' is a determiner, etc.
Common POS Tags
Here's a list of the tags, what they mean, and some examples:
- CC - Conjunction
- CD - Number
- DT - Determiner
- EX - Existential
- FW - Foreign word
- IN -Preposition
- JJ / JJR / JJS - Adjective / Comparative / Superlative
- NN / NNS / NNP / NNPS - Noun singular/plural / Proper noun
- PRP / PRP$ - Pronoun / Possessive pronoun
- RB / RBR / RBS - Adverb / Comparative / Superlative
- VB / VBD / VBG / VBN / VBP / VBZ - Verb forms
- MD - Modal
- POS - Possessive ending
- RP - Particle
- TO - To
- UH - Interjection
- WDT / WP / WP$ / WRB - Wh-determiners/pronouns/adverb
Stop Words
Stop words are common words like 'the', 'is', 'are' that often do not carry significant meaning. NLTK provides a built-in list of stop words. You can also add custom stop words if needed.
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
POS Tagging with Stop Words Removal Example:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
txt = ("Sukanya, Rajib and Naba are my good friends. "
"Sukanya is getting married next year. "
"Marriage is a big step in one’s life. "
"It is both exciting and frightening. "
"But friendship is a sacred bond between people. "
"It is a special kind of love between us. "
"Many of you must have tried searching for a friend "
"but never found the right one.")
tokenized = sent_tokenize(txt)
for sentence in tokenized:
words = word_tokenize(sentence)
words = [w for w in words if w.lower() not in stop_words]
tagged = nltk.pos_tag(words)
print(tagged)
Output
[('Sukanya', 'NNP'), ('Rajib', 'NNP'), ('Naba', 'NNP'), ('good', 'JJ'), ('friends', 'NNS')]
[('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN')]
[('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life', 'NN')]
[('It', 'PRP'), ('exciting', 'VBG'), ('frightening', 'VBG')]
[('But', 'CC'), ('friendship', 'NN'), ('sacred', 'JJ'), ('bond', 'NN'), ('people', 'NNS')]
[('It', 'PRP'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'NN'), ('us', 'PRP')]
[('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'), ('never', 'RB'), ('found', 'VBD'), ('right', 'JJ'), ('one', 'CD')]
Explanation:
- stop_words = set(stopwords.words('english')): Loads English stop words.
- tokenized = sent_tokenize(txt): Split text into sentences.
- words = word_tokenize(sentence): Tokenizes each sentence into words.
- tagged = nltk.pos_tag(words): Assign POS tags.