UNIT_2-PART_1
UNIT_2-PART_1
Indexing :
Indexing in information retrieval systems is like creating a quick-reference guide (or) a map for a
huge collection of information such as a library of books (or) a database of documents
(or)
It is the process of building a map of words and documents to make search faster and more
accurate
Ex :
Books example :
If you have 100 books and want to find ones about “ dinosaurs ” and you would look at an index
in the back of the book
The index might list
“ Dinosaurs ” – Book 1 : Page3, Page 45
“ Dinosaurs ” – Book 5 : Page 12, Page 89
Digital example :
In a search engine, when you type “ dinosaurs ”, the system uses its index to find all the web
pages that mention “ dinosaurs ” and shows you a list of results in seconds
History and objectives of indexing :
History :
Historical background :
Old methods :
In the late 1800s, a more structured way to index books like the Dewey decimal system was
introduced
This system organized books into categories and subcategories making it easier to find them
MARC system :
In the 1960s, computers began to assist with cataloguing through the MARC ( Machine
Readable Cataloguing ) system, which standardized how information about books was stored
electronically
This made it easier to share catalogs between libraries
DIALOG system :
In 1965’s a system called DIALOG was developed for NASA and later it became a commercial
indexing system, allowing users to search technical publications
Changes with technology :
Manual to digital :
Originally, indexing was done manually with indexers choosing keywords to represent the
content of books
With the advent of computers, this process became digital, allowing librarians to share and
manage indexes more efficiently
Full-text search :
By the 1990s, with decrease in computer costs and availability of full-text documents (like digital
books and articles), the role of human indexers changed
Instead of relying only on manually chosen keywords, users could search full text of documents
directly
This means users can now search for any words (or) phrase within the entire content not just in
the index, making searches more accurate
Ex :
Objectives :
The objectives of indexing has evolved with advancements in information retrieval systems
Traditionally, manual indexing involved selecting a few key terms (index terms) from a
controlled vocabulary that represented the main topics of a document
This helped standardize searches and make finding information easier
However, with modern IRS, entire documents are searchable, meaning every word in a
document can potentially be an index term
This is known as “ total document indexing ”
In this system every word in a document is considered when searching for information, making
it easier to find specific content within documents
Indexing process :
Document file :
It contains important keywords (or) concepts that are relevant to the general public
These keywords are selected to help everyone find the most important topics in the document
It is more focuses than the document file because it only includes important terms
Document file contains everything, so it overlaps with both the public and private index files
Public index file overlaps the document file because it includes some of the key topics but it is
smaller and more focused
Private index file is the most specific and only overlaps with the document file in areas that
match your personal interests
Ex :
Suppose the library has every book on cooking, gardening, science and more
This collection is like document file containing all possible information
The library creates a general guide for visitors, highlighting popular sections like “ cooking books
”, “ gardening books ” and “ science books ”
This guide is like the public index file, helping most people find the major topics they are
interested in
Scope of indexing :
When creating an index for documents (or) articles, the process involves deciding which key
terms( words or concepts ) best represent the content of the text
This can be tricky, especially when done manually, because both the person writing the
document (author) and the person creating the index (indexer) might use different words to
describe the same ideas
The indexer may also not fully understand some of the specialized topics in the document,
leading to differences in how well the document is indexed
Exhaustivity :
How many different topics (or) ideas from a document are included in the index
Ex :
Ex :
In the same book about animals, using “ Animals ” as an index term is low specificity
But if you use more specific terms like “ Elephants ”, “ Sparrows ” and “ Lizards ” then you are
using high specificity because the terms are more detailed
Note :
Assigning importance to each index term based on how much it is important in the document
Ex :
If a document heavily discusses “ Pentium processors ” the indexer might give the term “
Pentium ” a higher weight
This helps search systems prioritize documents that are more relevant to that concept
Note :
In indexing, a decision needs to be made about whether (or) not to create linkages between
related terms that describe the same concept
This is important to ensure that terms connected to the same idea are grouped together
When you create the index, you decide how related terms are linked together
(Or)
This is when related index terms are linked together at the time of indexing (before searching)
Ex :
Suppose you create an index entry that links both the animal and the habitat (natural
environment) together
Suppose your index entry is like “ Elephant – Africa ”
Here “ Elephant ” and “ Africa ” are linked together in the index at the time of indexing
If someone searches for “ Elephant Africa ” then system will find entries where both terms are
linked and Finally it return books specifically about elephants in Africa
The system only links terms together when you search, usually by combining terms with “ AND “
(Or)
This happens when terms are combined at the search stage, not during indexing
Ex :
If you search for “climate change” AND “carbon emissions” the system looks for documents that
have both terms
They are not linked at the indexing stage but only when the user makes the search
Ex :
You might link “ CITGO ”, “ Oil drilling” and “ Mexico ” all together to specify where the drilling is
happening
Ordering constraints :
Ex :
If you have “ CITGO ” first and “ oil drilling ” second, it might suggest a different meaning than if
they were reversed
Additional descriptors :
Ex :
You might add descriptors like “ source ”, “ problem ” (or) “ affected ” to clarify the roles of each
term in the context
Note :
These are extra pieces of information added to the index terms to provide more context (or)
detail
Suppose index term is “ Gardening – Beginners ” and additional descriptors is “ How-To-Guide”
Suppose a user searches for “ Gardening – Beginners – How – To –Guide ”
The system will find and display “ Gardening Basics ” because it matches the combined term and
additional descriptor
Positional roles :
Ex :
For “ CITGO introduces oil refineries in Peru, Bolivia and Argentina “ you would list
Entry 1 : “ CITGO introduces oil refineries in Peru ”
Entry 2 : “ CITGO introduces oil refineries in Bolivia ”
Entry 3 : “ CITGO introduces oil refineries in Argentina ”
Each entry has a fixed role based on its position
Modifiers :
Ex :
Instead of separate entries, you could have one entry like “CITGO introduces oil refineries in
Peru, Bolivia, Argentina” with modifiers indicating the affected countries
Note :
Suppose you are taking an index term like “ Smartphone – Latest model ” and modifier as “
Advanced features ”
Suppose a user searches for “ Smartphone – Latest Model ” with the modifier “ Advanced
Features ”
The system finds and displays products that match both the main term and the modifier
For example, “ Smartphone – Latest Model with Advanced Features ”
Automatic indexing :
When a computer system automatically selects the keywords (or) index terms that best
represent the main ideas of a document, without human intervention
Automatic indexing helps to organize and retrieve documents quickly by choosing keywords
automatically
It can do this by treating all words equally (un weighted) (or) by giving more importance to
certain words (weighted)
This makes searching for documents more efficient, especially in large database
Ex :
Suppose you have a document about “ The benefits of exercise for mental health ”
Now automatic indexing system would scan the document and might choose keywords like
“ Exercise ”
“ Mental Health ”
“ Benefits ”
These keywords help you find the document later when you search for related topics
Weighted indexing :
The system gives more importance (weight) to keywords that appear more frequently (or) are
more central to the document’s main ideas
Ex :
In a document where “ Mental health ” is mentioned many times and “ Exercise ” only a few
“ Mental Health ” be given a higher weight, meaning it’s considered more important in
representing the document’s content
Unweighted indexing :
All keywords are treated equally, without trying to figure out which ones are more important
Ex :
If a document mentions “ Exercise ” once and “ Mental Health ” ten ties, both keywords would
still be treated the same
Note :
Automated indexing – The system scans the document and automatically identifies key terms to
include in the index
Weighted indexing – The systems assigns a weight to each term based on its frequency (or)
importance in the document
The system includes all extracted terms in the index without considering their importance (or)
frequency
Ex :
The automated indexing system scans the document and extract terms like “healthy”, “eating”,
“nutrition” and “diet”
Weighted indexing – The term “eating” appears frequently in the document, while “diet” is
mentioned less often. The weighted index might look like Eating(Weight-0.8), Healthy(Weight-
0.5), Nutrition(Weight-0.4), Diet(Weight-0.3)
The weights help indicate the relative importance of each term, making searches more precise
Unweighted indexing – In the index, you might see Healthy, Eating, Nutrition, Diet
All terms are treated equally, so there is no differentiation between how frequently (or)
significantly each term appears in the document
After the initial setup, automatic indexing is cheaper and faster than using humans to do the
work
Ex :
It might take a human several minutes to read and choose keywords for a document but a
computer can do it in seconds
Consistency :
The computer follows the same rules every time, so the keywords it picks are consistent
Ex :
If two different human indexers read the same document, they might choose slightly different
keywords
But an automatic system will pick the same keywords every time
Information extraction :
Information extraction is the process of automatically identifying and pulling out important
details (or) facts from a large amount of text and organizing them in a structured format like
table (or) database
Information extraction uses two key processes are fact extraction and summarization
Fact extraction and summarization are used for extracting important information from a text,
but they do it in different ways
Fact extraction :
Fact extraction is like picking specific details (or) facts from a document to update a database
The goal is to find important information based on set criteria, without trying to understanding
the entire document
The system looks for specific types of information like names, dates, places and so on
Ex :
Summarization :
Summarization involves taking a long document and creating a shorter version (summary) that
captures the main ideas
The is more complex because it needs to consider all the major concepts, not just specific facts
Ex :
Note :
In this context pulling out means extracting (or) selecting specific pieces of information from a
larger text
Summarization – Highlighting