0% found this document useful (0 votes)
29 views

Assignment 1 IR

This document provides an overview of code to create an inverted index for a text document collection. The code uses NLTK and other libraries for natural language processing tasks like tokenization and stemming. It initializes stopwords and a Porter stemmer, then iterates over files, tokenizing sentences and words. Nouns and verbs are extracted and stemmed, and an index is built with stemmed words mapping to files. A search function stems queries and looks up words in the index to return matching files. The main function interacts with a user to search the index.

Uploaded by

Pac SaQii
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Assignment 1 IR

This document provides an overview of code to create an inverted index for a text document collection. The code uses NLTK and other libraries for natural language processing tasks like tokenization and stemming. It initializes stopwords and a Porter stemmer, then iterates over files, tokenizing sentences and words. Nouns and verbs are extracted and stemmed, and an index is built with stemmed words mapping to files. A search function stems queries and looks up words in the index to return matching files. The main function interacts with a user to search the index.

Uploaded by

Pac SaQii
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Information Retrieval

Assignment 1

Session: 2020 – 2024

Submitted by:
Saqlain Nawaz 2020-CS-135

Supervised by:
Sir Khaldoon Syed Khurshid

Department of Computer Science


University of Engineering and Technology
Lahore Pakistan
Inverted Index for Text Files
Overview
This document provides an overview and explanation of the code used to create an inverted
index for a collection of text files.

Libraries Used
The following libraries are used in this code:

● OS: Provides functions for interacting with the operating system, used for file
operations and directory traversal.
● NLTK: The Natural Language Toolkit is used for natural language processing tasks
such as tokenization, stemming, and part-of-speech tagging.
● String: Provides a collection of string constants for punctuation characters.
● nltk.corpus.stopwords: Provides a list of common English stopwords.
● nltk.stem.PorterStemmer: Implements the Porter stemming algorithm for word
stemming.
● nltk.tokenize.word_tokenize: Tokenizes sentences into words.
● nltk.tokenize.sent_tokenize: Tokenizes text into sentences.

Code Flow
The code is structured as follows:

Import Libraries

Import the required libraries at the beginning of the code.

Initialize Variables

● Initialize a set of English stopwords using nltk.corpus.stopwords.


● Initialize a Porter stemmer using nltk.stem.PorterStemmer.

create_index Function

This function takes a directory path as input and returns an inverted index.

1. It iterates over text files in the specified directory.


2. For each file, it reads the content, tokenizes it into sentences, and further tokenizes
each sentence into words.
3. It tags the parts of speech for each word and checks if the word is a noun or verb.
4. If the word is a noun or verb and not in the stopwords list, it is stemmed using the
Porter stemmer.
5. An entry is added to the inverted index with the stemmed word as the key and the
filename as the value.
6. The function handles UnicodeDecodeError exceptions for files that cannot be
decoded.

search Function

This function takes a user's search query as input.

1. It tokenizes and stems the query.


2. For each stemmed query word, it retrieves the filenames associated with it from the
inverted index.

User Interaction

In the main function, the program interacts with the user.

● The user can input a search query, and the code returns the filenames in which each
query word appears

Execution

The main function is executed when the script is run.

Block Diagram:
A block diagram is a visual representation of the code's structure and key
components.
Data Flow Diagram (DFD):
A DFD illustrates how data moves through your code.

You might also like