Open In App

Accessing Text Corpora and Lexical Resources using NLTK

Last Updated : 12 Jun, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Accessing Text Corpora and Lexical Resources using NLTK provides efficient access to extensive text data and linguistic resources, empowering researchers and developers in natural language processing tasks. Natural Language Toolkit (NLTK) is a powerful Python library for natural language processing (NLP). It provides easy-to-use interfaces to over 50 corpora and lexical resources, including WordNet, along with a suite of text-processing libraries for classification, tokenization, stemming, tagging, parsing, and more.

This article will guide you through accessing text corpora and lexical resources using NLTK, illustrating with practical examples.

Accessing Text Corpora using NLTK

NLTK provides access to various text corpora, including books, news, chats, and more. Some popular corpora include:

  • Gutenberg: Contains text from classic literature.
  • Brown: The first million-word electronic corpus of English.
  • Reuters: A collection of news documents.
  • Inaugural: Presidential inaugural speeches.

NLTK is a comprehensive library that supports complex NLP tasks. It is ideal for academic and research purposes due to its extensive collection of linguistic data and tools.

Before proceeding with implementation make sure, that you have install NLTK and necessary data.

pip install nltk

After installation, you need to download the data:

import nltk
nltk.download('all')

Loading and Using Corpora

You can load and use these corpora easily. For example, to access the Gutenberg corpus:

Python
from nltk.corpus import gutenberg
print(gutenberg.fileids())

Output:

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

The output represents the list of file identifiers (fileids) available in the Gutenberg corpus of the NLTK library. Each file identifier corresponds to a text file containing a literary work included in the Gutenberg collection.

To read the text of a specific file:

Python
hamlet = gutenberg.words('shakespeare-hamlet.txt')
print(hamlet[:100])

Output:

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', ...]

Working with Lexical Resources using NLTK

NLTK includes several lexical resources, with WordNet being the most significant. WordNet is a large lexical database of English that groups words into sets of synonyms.

Using WordNet with NLTK

To use WordNet:

Python
from nltk.corpus import wordnet as wn


Find the synonyms of a word:

Python
synonyms = wn.synsets('book')
print(synonyms)

Output:

[Synset('book.n.01'), Synset('book.n.02'), Synset('record.n.05'), Synset('script.n.01'), Synset('ledger.n.01'), Synset('book.n.06'), Synset('book.n.07'), Synset('koran.n.01'), Synset('bible.n.01'), Synset('book.n.10'), Synset('book.n.11'), Synset('book.v.01'), Synset('reserve.v.04'), Synset('book.v.03'), Synset('book.v.04')]

The output represents a list of synsets (synonym sets) for the word "book" from the WordNet lexical database in NLTK. Each synset corresponds to a different meaning or sense of the word "book." The notation Synset('book.n.01') provides the following information:

  • book: The word for which the synset is defined.
  • n: Indicates the part of speech (in this case, "n" for noun).
  • 01: The sense number, distinguishing different meanings of the word.

To get definitions and examples:

Python
for syn in synonyms:
    print(syn.definition())
    print(syn.examples())

Output:

a written work or composition that has been published (printed on pages bound together)
['I am reading a good book on economics']
physical objects consisting of a number of pages bound together
['he used a large book as a doorstop']
a compilation of the known facts regarding something or someone
["Al Smith used to say, `Let's look at the record'", 'his name is in all the record books']
a written version of a play or other dramatic composition; used in preparing for a performance
[]
a record in which commercial accounts are recorded
['they got a subpoena to examine our books']
a collection of playing cards satisfying the rules of a card game
[]
a collection of rules or prescribed standards on the basis of which decisions are made
['they run things by the book around here']
the sacred writings of Islam revealed by God to the prophet Muhammad during his life at Mecca and Medina
[]
the sacred writings of the Christian religions
['he went to carry the Word to the heathen']
a major division of a long written composition
['the book of Isaiah']
a number of sheets (ticket or stamps etc.) bound together on one edge
['he bought a book of stamps']
engage for a performance
['Her agent had booked her for several concerts in Tokyo']

The output provides the definitions and example sentences for each sense (synset) of the word "book" as retrieved from WordNet, illustrating the various contexts in which the word can be used.

Practical Examples

Tokenizing Text

Tokenization is the process of breaking text into words or sentences.

  • word_tokenize: Splits the text into individual words.
  • sent_tokenize: Splits the text into sentences.
Python
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Hello, world! This is a test."
print(word_tokenize(text))
print(sent_tokenize(text))

Output:

['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', '.']
['Hello, world!', 'This is a test.']

Finding Synonyms and Antonyms

Using WordNet, you can find synonyms and antonyms for words.

  • wn.synsets('good'): Retrieves all synsets (sets of synonyms) for the word "good".
  • lemma.name(): Extracts the lemma name from each synset.
  • lemma.antonyms(): Checks if the lemma has antonyms and adds the first antonym's name to the antonyms list.
Python
from nltk.corpus import wordnet as wn

synonyms = []
antonyms = []

for syn in wn.synsets('good'):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
        if lemma.antonyms():
            antonyms.append(lemma.antonyms()[0].name())

print("Synonyms:", set(synonyms))
print("Antonyms:", set(antonyms))

Output:

Synonyms: {'salutary', 'just', 'in_effect', 'full', 'skillful', 'adept', 'safe', 'skilful', 'beneficial', 'near', 'unspoiled', 'trade_good', 'secure', 'estimable', 'respectable', 'effective', 'unspoilt', 'soundly', 'serious', 'well', 'commodity', 'good', 'practiced', 'dependable', 'in_force', 'right', 'sound', 'expert', 'honest', 'upright', 'thoroughly', 'honorable', 'ripe', 'goodness', 'proficient', 'dear', 'undecomposed'}
Antonyms: {'evilness', 'badness', 'bad', 'evil', 'ill'}

Frequency Distributions

Frequency Distribution is used to analyze the frequency of words in a text, helping to identify the most common words.

  • gutenberg.words('shakespeare-hamlet.txt'): Loads the words from the text of "Hamlet".
  • FreqDist(text): Creates a frequency distribution of the words in the text.
  • fdist.most_common(10): Returns the 10 most common words along with their frequencies.
Python
from nltk.probability import FreqDist

text = gutenberg.words('shakespeare-hamlet.txt')
fdist = FreqDist(text)
print(fdist.most_common(10))

Output:

[(',', 2892), ('.', 1886), ('the', 860), ("'", 729), ('and', 606), ('of', 576), ('to', 576), (':', 565), ('I', 553), ('you', 479)]

Conclusion

NLTK is a versatile tool for NLP, offering access to a wealth of corpora and lexical resources. Whether you're performing text analysis, developing NLP applications, or conducting research, NLTK provides the functionalities needed to preprocess and analyze textual data effectively. By following the examples provided, you can leverage NLTK's capabilities to enhance your NLP projects.


Next Article

Similar Reads