0% found this document useful (0 votes)
8 views

UNIT_2-PART_1

Uploaded by

teamkiller334
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

UNIT_2-PART_1

Uploaded by

teamkiller334
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT II CATALOGING AND INDEXING SUBTOPICS PART I

History and objectives of indexing History


Objectives
Indexing process Scope of indexing
Pre-coordination and linkages
Automatic indexing
Information extraction
Cataloguing and indexing:

Indexing :

 Indexing in information retrieval systems is like creating a quick-reference guide (or) a map for a
huge collection of information such as a library of books (or) a database of documents
(or)
 It is the process of building a map of words and documents to make search faster and more
accurate

 Indexing makes searching fast and efficient

Ex :

Books example :

 If you have 100 books and want to find ones about “ dinosaurs ” and you would look at an index
in the back of the book
 The index might list
 “ Dinosaurs ” – Book 1 : Page3, Page 45
 “ Dinosaurs ” – Book 5 : Page 12, Page 89

Digital example :

 In a search engine, when you type “ dinosaurs ”, the system uses its index to find all the web
pages that mention “ dinosaurs ” and shows you a list of results in seconds
History and objectives of indexing :

History :

 Indexing is also called as cataloguing


 It is one of the oldest methods used to help people to find information
 The goal of indexing is to create access points like keywords (or) topics and that will help users
find what they are looking for in a collection of items such as books (or) documents

Historical background :

Old methods :

 For centuries, indexing was done manually


 Librarians (or) professional indexers would create cards with details about each book such as
title, author and subject
 These cards were then organized in card catalogs to help users in finding the books on specific
topics

Dewey decimal system :

 In the late 1800s, a more structured way to index books like the Dewey decimal system was
introduced
 This system organized books into categories and subcategories making it easier to find them

MARC system :

 In the 1960s, computers began to assist with cataloguing through the MARC ( Machine
Readable Cataloguing ) system, which standardized how information about books was stored
electronically
 This made it easier to share catalogs between libraries

DIALOG system :

 In 1965’s a system called DIALOG was developed for NASA and later it became a commercial
indexing system, allowing users to search technical publications
Changes with technology :

Manual to digital :

 Originally, indexing was done manually with indexers choosing keywords to represent the
content of books
 With the advent of computers, this process became digital, allowing librarians to share and
manage indexes more efficiently

Full-text search :

 By the 1990s, with decrease in computer costs and availability of full-text documents (like digital
books and articles), the role of human indexers changed
 Instead of relying only on manually chosen keywords, users could search full text of documents
directly
 This means users can now search for any words (or) phrase within the entire content not just in
the index, making searches more accurate

Ex :

 Suppose a library where books are stored on shelves


 In the past, if you wanted to find a book on “ dinosaurs ” you would go to the card catalog look
under “ D ” for “ dinosaurs ” and that card would tell you which shelf to find the book on
 Librarians would have written down the subject “ dinosaurs ” on the card
 With modern technology, instead of just looking at a card with the word “ dinosaurs ” and you
could now type “ dinosaurs ” into a computer and it would search the full text of every book to
find any that mention dinosaurs even if “ dinosaurs ” was not listed as a subject by the librarian

Objectives :

 The objectives of indexing has evolved with advancements in information retrieval systems
 Traditionally, manual indexing involved selecting a few key terms (index terms) from a
controlled vocabulary that represented the main topics of a document
 This helped standardize searches and make finding information easier
 However, with modern IRS, entire documents are searchable, meaning every word in a
document can potentially be an index term
 This is known as “ total document indexing ”
 In this system every word in a document is considered when searching for information, making
it easier to find specific content within documents
Indexing process :

 There are different types of index files in an information retrieval system


 The different index files is shown in the below figure

Document file :

 It includes all the words and content with in a document


 Every word in the document is part of the total document file
 This is the broadest and most detailed level of indexing

Public index file :

 It contains important keywords (or) concepts that are relevant to the general public
 These keywords are selected to help everyone find the most important topics in the document
 It is more focuses than the document file because it only includes important terms

Private index file :

 It is even more focused and specific


 It includes only those concepts and keywords that are important to a particular user (or)small
group of users
 It’s like a personalized subset of the public index file that reflects individual needs (or) interests

How they relate :

 Document file contains everything, so it overlaps with both the public and private index files
 Public index file overlaps the document file because it includes some of the key topics but it is
smaller and more focused
 Private index file is the most specific and only overlaps with the document file in areas that
match your personal interests
Ex :

Document file (Entire library collection) :

 Suppose the library has every book on cooking, gardening, science and more
 This collection is like document file containing all possible information

Public index file (General guide for all users ) :

 The library creates a general guide for visitors, highlighting popular sections like “ cooking books
”, “ gardening books ” and “ science books ”
 This guide is like the public index file, helping most people find the major topics they are
interested in

Private index file (Personalized guide for specific needs) :

 Now, let’s say you love Italian cooking


 You create your own list that focuses only on “ Italian cooking books ”
 Private index can access you only and other users will not access that private index

Scope of indexing :

 When creating an index for documents (or) articles, the process involves deciding which key
terms( words or concepts ) best represent the content of the text
 This can be tricky, especially when done manually, because both the person writing the
document (author) and the person creating the index (indexer) might use different words to
describe the same ideas
 The indexer may also not fully understand some of the specialized topics in the document,
leading to differences in how well the document is indexed

Key concepts in manual indexing :

Exhaustivity :

 How many different topics (or) ideas from a document are included in the index

Ex :

 Suppose a book about animals


 If you index only “Mammals” and “birds” you have low exhaustivity
 But if you include “Mammals”, “Birds”, “Fish” and “ Insects” then you are using high exhaustivity
because you are covering more topics from the book
Specificity :

 How detailed (or) precise the index terms are

Ex :

 In the same book about animals, using “ Animals ” as an index term is low specificity
 But if you use more specific terms like “ Elephants ”, “ Sparrows ” and “ Lizards ” then you are
using high specificity because the terms are more detailed

Note :

 Exhaustivity – How many topics you cover


 Specificity – How detailed the terms you use are

Weighting index terms :

 Assigning importance to each index term based on how much it is important in the document

Ex :

 If a document heavily discusses “ Pentium processors ” the indexer might give the term “
Pentium ” a higher weight
 This helps search systems prioritize documents that are more relevant to that concept

Note :

 Suppose you have an article about “ healthy eating ”


 The article mentions “ Vegetables ” many times and “ Fruit ” a few times
 If you are indexing this article
 Vegetables might get a higher weight (example a score of 10) because it is
discussed in depth and is a key part of the article
 Fruit might a lower weight ( example a score of 5) because it is mentioned but
not as thoroughly discussed
 The use of weighting index terms is to help a search system understand which topics (or)
concepts in a document are more important than others and it also makes searches more
accurate and relevant
Pre-coordination and linkages :

 In indexing, a decision needs to be made about whether (or) not to create linkages between
related terms that describe the same concept
 This is important to ensure that terms connected to the same idea are grouped together

Pre-coordination : (Linking at index creation)

 When you create the index, you decide how related terms are linked together
(Or)
 This is when related index terms are linked together at the time of indexing (before searching)

Ex :

 Suppose you create an index entry that links both the animal and the habitat (natural
environment) together
 Suppose your index entry is like “ Elephant – Africa ”
 Here “ Elephant ” and “ Africa ” are linked together in the index at the time of indexing
 If someone searches for “ Elephant Africa ” then system will find entries where both terms are
linked and Finally it return books specifically about elephants in Africa

Post coordination : (Linkage at search time)

 The system only links terms together when you search, usually by combining terms with “ AND “
(Or)
 This happens when terms are combined at the search stage, not during indexing

Ex :

 If you search for “climate change” AND “carbon emissions” the system looks for documents that
have both terms
 They are not linked at the indexing stage but only when the user makes the search

Factors in linkage process :

Number of terms linked :

 How many terms you can link together

Ex :

 You might link “ CITGO ”, “ Oil drilling” and “ Mexico ” all together to specify where the drilling is
happening
Ordering constraints :

 Whether the order of the terms matters

Ex :

 If you have “ CITGO ” first and “ oil drilling ” second, it might suggest a different meaning than if
they were reversed

Additional descriptors :

 Extra information about the terms

Ex :

 You might add descriptors like “ source ”, “ problem ” (or) “ affected ” to clarify the roles of each
term in the context

Note :

 These are extra pieces of information added to the index terms to provide more context (or)
detail
 Suppose index term is “ Gardening – Beginners ” and additional descriptors is “ How-To-Guide”
 Suppose a user searches for “ Gardening – Beginners – How – To –Guide ”
 The system will find and display “ Gardening Basics ” because it matches the combined term and
additional descriptor

Positional roles :

 Each term’s position in the list indicates its role

Ex :

 For “ CITGO introduces oil refineries in Peru, Bolivia and Argentina “ you would list
 Entry 1 : “ CITGO introduces oil refineries in Peru ”
 Entry 2 : “ CITGO introduces oil refineries in Bolivia ”
 Entry 3 : “ CITGO introduces oil refineries in Argentina ”
 Each entry has a fixed role based on its position

Modifiers :

 Extra words added to terms to describe their role

Ex :

 Instead of separate entries, you could have one entry like “CITGO introduces oil refineries in
Peru, Bolivia, Argentina” with modifiers indicating the affected countries
Note :

 Suppose you are taking an index term like “ Smartphone – Latest model ” and modifier as “
Advanced features ”
 Suppose a user searches for “ Smartphone – Latest Model ” with the modifier “ Advanced
Features ”
 The system finds and displays products that match both the main term and the modifier
 For example, “ Smartphone – Latest Model with Advanced Features ”

Automatic indexing :

 When a computer system automatically selects the keywords (or) index terms that best
represent the main ideas of a document, without human intervention
 Automatic indexing helps to organize and retrieve documents quickly by choosing keywords
automatically
 It can do this by treating all words equally (un weighted) (or) by giving more importance to
certain words (weighted)
 This makes searching for documents more efficient, especially in large database

Ex :

 Suppose you have a document about “ The benefits of exercise for mental health ”
 Now automatic indexing system would scan the document and might choose keywords like
 “ Exercise ”
 “ Mental Health ”
 “ Benefits ”
 These keywords help you find the document later when you search for related topics

Weighted indexing :

 The system gives more importance (weight) to keywords that appear more frequently (or) are
more central to the document’s main ideas

Ex :

 In a document where “ Mental health ” is mentioned many times and “ Exercise ” only a few
 “ Mental Health ” be given a higher weight, meaning it’s considered more important in
representing the document’s content
Unweighted indexing :

 All keywords are treated equally, without trying to figure out which ones are more important

Ex :

 If a document mentions “ Exercise ” once and “ Mental Health ” ten ties, both keywords would
still be treated the same

Note :

 Automated indexing – The system scans the document and automatically identifies key terms to
include in the index
 Weighted indexing – The systems assigns a weight to each term based on its frequency (or)
importance in the document
 The system includes all extracted terms in the index without considering their importance (or)
frequency

Ex :

 The automated indexing system scans the document and extract terms like “healthy”, “eating”,
“nutrition” and “diet”
 Weighted indexing – The term “eating” appears frequently in the document, while “diet” is
mentioned less often. The weighted index might look like Eating(Weight-0.8), Healthy(Weight-
0.5), Nutrition(Weight-0.4), Diet(Weight-0.3)
 The weights help indicate the relative importance of each term, making searches more precise
 Unweighted indexing – In the index, you might see Healthy, Eating, Nutrition, Diet
 All terms are treated equally, so there is no differentiation between how frequently (or)
significantly each term appears in the document

Advantages of automatic indexing :

Cost and speed :

 After the initial setup, automatic indexing is cheaper and faster than using humans to do the
work

Ex :

 It might take a human several minutes to read and choose keywords for a document but a
computer can do it in seconds
Consistency :

 The computer follows the same rules every time, so the keywords it picks are consistent

Ex :

 If two different human indexers read the same document, they might choose slightly different
keywords
 But an automatic system will pick the same keywords every time

Information extraction :

 Information extraction is the process of automatically identifying and pulling out important
details (or) facts from a large amount of text and organizing them in a structured format like
table (or) database
 Information extraction uses two key processes are fact extraction and summarization
 Fact extraction and summarization are used for extracting important information from a text,
but they do it in different ways

Fact extraction :

 Fact extraction is like picking specific details (or) facts from a document to update a database
 The goal is to find important information based on set criteria, without trying to understanding
the entire document
 The system looks for specific types of information like names, dates, places and so on

Ex :

 Imagine a news article about a company’s CEO changing


 The system’s job to fill certain slots in a database
Company name: XYZ Corporation
Old CEO: John
New CEO: David
Date: September 8th 2024
 The system only looks for these specific facts and fills them into the correct slots in the database
 It does not care about the rest of the article that might talk about the CEO’s background (or) the
company’s future plans

Summarization :

 Summarization involves taking a long document and creating a shorter version (summary) that
captures the main ideas
 The is more complex because it needs to consider all the major concepts, not just specific facts
Ex :

 Imagine you have a long research paper about climate change


 Summarization would extract the key points such as
 Climate change is causing rising sea levels
 Global temperatures are increasing
 Urgent action is needed to reduce carbon emissions
 This summary gives you the essential ideas of the research paper without having to read the
entire document

Note :

 In this context pulling out means extracting (or) selecting specific pieces of information from a
larger text
 Summarization – Highlighting

You might also like