UNIT 1 Notes
UNIT 1 Notes
4. XML Retrieval 71
Course: TOPICS (Credits : 03 Lectures/Week: 03)
USCS604 Information Retrieval
Objectives:
To provide an overview of the important issues in classical and web information retrieval. The focus
is to give an up-to- date treatment of all aspects of the design and implementation of systems for
gathering, indexing, and searching documents and of methods for evaluating systems.
Expected Learning Outcomes:
After completion of this course, learner should get an understanding of the field of information
retrieval and its relationship to search engines. It will give the learner an understanding to apply
information retrieval models.
Introduction to Information Retrieval: Introduction, History of IR,
Unit I Components of IR, and Issues related to IR, Boolean retrieval, 15L
Dictionaries and tolerant retrieval.
Link Analysis and Specialized Search: Link Analysis, hubs and
authorities, Page Rank and HITS algorithms, Similarity, Hadoop & Map
Reduce, Evaluation, Personalized search, Collaborative filtering and
Unit II 15L
content-based recommendation of documents and products, handling
―invisible‖ Web, Snippet generation, Summarization, Question
Answering, Cross- Lingual Retrieval.
Web Search Engine: Web search overview, web structure, the user, paid
placement, search engine optimization/spam, Web size measurement,
search engine optimization/spam, Web Search Architectures.
Unit III 15L
XML retrieval: Basic XML concepts, Challenges in XML retrieval, A
vector space model for XML retrieval, Evaluation of XML retrieval,
Text-centric versus data-centric XML retrieval.
Text book(s):
1) Introduction to Information Retrieval, C. Manning, P. Raghavan, and H. Schütze,
Cambridge University Press, 2008
2) Modern Information Retrieval: The Concepts and Technology behind Search, Ricardo Baeza
-Yates and Berthier Ribeiro – Neto, 2nd Edition, ACM Press Books 2011.
3) Search Engines: Information Retrieval in Practice, Bruce Croft, Donald Metzler and Trevor
Strohman, 1st Edition, Pearson, 2009.
Additional Reference(s):
1) Information Retrieval Implementing and Evaluating Search Engines, Stefan Büttcher,
Charles L. A. Clarke and Gordon V. Cormack, The MIT Press; Reprint edition (February 12,
2016)
1
INTRODUCTION TO INFORMATION
RETRIEVAL
Unit Structure
1.0 Objectives
1.1 Introduction and History of IR
1.2 Components of IR
1.3 Issues related to IR
1.4 Boolean retrieval
1.5 Dictionaries and tolerant retrieval
1.5.1 Search structures for dictionary
1.5.2 Wildcard queries
1.6 Summary
1.7 List of References
1.8 Unit End Exercises
1.0 OBJECTIVES
To define information retrieval
Understand the importance and need of information retrieval system
Explain the concept of subject approach to information
Illustrate the process of information retrieval
1
Information retrieval Since the 1950s, text and text documents have been the field's main focus.
Documents come in a wide variety of forms, including web pages, emails,
books, academic papers, and news articles to name just a few. Every one
of these documents has some structure to it, including the title, author,
date, and abstract details related to the content of articles published in
scholarly publications.When referring to database records, the components
of this structure are referred to as attributes or fields. The key distinction
between a document and a normal database record, like one for a bank
account or a ticket reservation, is that a document's content is mostly
presented as text, which is a rather unstructured format.
Consider the details present in the account number and current amount,
two typical account record attributes, to demonstrate this distinction. Both
have very clearly defined formats (a six-digit integer for an account
number, for instance, and a real number with two decimal places for
balance), as well as meanings.As a result, it is simple to develop
algorithms to find the records that respond to queries like "Find account
number 321456" or "Find accounts with balances larger than $50,000.00."
It is very easy to compare the values of these attributes.
Think about a recent news report about a bank merger. The headline and
the story's source are some of the story's qualities, but the story's actual
content is what matters most. This important piece of data would normally
be recorded in a database system as a single huge attribute with no internal
structure. The majority of searches for this topic on search engines like
Google will be of the type "bank merger" or "bank takeover." In order to
conduct this search, we must create algorithms that can evaluate if the tale
contains the information sought by comparing the text of the query with
the text of the story.Providing a definition for a term, sentence, paragraph,
or piece of news
Comparing literature is challenging since describing a tale is harder than
defining an account number. The foundation of information retrieval is the
understanding and modelling of how individuals compare texts, as well as
the development of computer algorithms that efficiently carry out this
comparison.Information retrieval applications increasingly use multimedia
documents having structure, substantial text content, and other media.
Pictures, video, and audio, including voice and music, are common forms
of information media.Scanned document images are crucial in some
applications, such legal support. Similar to text, the content of these
mediums is challenging to define and contrast. Instead of using the
contents themselves, present technology for searching non-text materials
relies on text descriptions of their contents, but advancements are being
made in methods for direct comparison of photographs, for instance.
Information retrieval encompasses a variety of tasks and applications in
addition to a variety of media. In the typical search situation, a user types a
query into a search engine and receives results in the form of a ranked list
of documents. Search is an essential component of applications in
businesses, the government, and many other fields, even though web
search is by far the most popular application involving information
2
retrieval. A specific type of web search called vertical search limits the Introduction to Information
domain of the search to a single subject. Enterprise search is locating the Retrieval
necessary information among the enormous variety of digital assets
dispersed throughout a company intranet.The majority of the information
will be found in sources like emails, reports, presentations, spreadsheets,
and structured data in corporate databases, while web pages are
undoubtedly a part of that distributed knowledge store. Desktop search is
the individual version of corporate search, where the information sources
are the documents saved on a user's PC, including emails and recently
visited websites. Without centralized control, peer-to-peer search involves
locating data in networks of nodes or computers. This kind of search
initially served as a music file-sharing platform, but it can now be used to
any group of people who have similar interests or, in the case of mobile
devices, a local area.Search and related information retrieval techniques
are utilized in a variety of industries, including advertising, intelligence
gathering, science, healthcare, customer service, real estate, and others.
Any application that uses a collection of unstructured data, such as text or
other types of information, will need to organize and search that data.
There are other text-based tasks that are researched in information
retrieval besides search based on a user query (sometimes referred to as ad
hoc search because the range of possible queries is broad and not
prespecified). Filtering, categorization, and question-answering are
additional duties. Based on a person's interests, filtering or monitoring
involves finding stories that are relevant to them and sending them an alert
by email or another method. A predetermined set of labels or
classifications are used in classification or categorization, which
automatically applies those labels to documents. Similar to search,
question answering focuses on more specific queries such "What is the
height of Mt. Everest?". Instead of returning a list of documents, the aim
of question answering is to return a specific response that was found in the
text. Some of these components or dimensions of the information retrieval
field are summarized in Table 1.
Table 1: Some dimensions of information retrieval
1.2 COMPONENTS OF IR
The main parts of an IR system are shown in Figure 1. A user's
information demand, which underpins and motivates the search process,
exists prior to performing a search. When this information need is offered
3
Information retrieval in writing form as a test collection for IR evaluation, we occasionally refer
to it as a topic. The user creates and submits a query to the IR system as a
result of her information need. This search usually only has one or two
terms, with two to three terms being normal for a Web search. Because a
query term might not actually be a word at all, we use "term" instead of
"word" in this sentence.A query term could be a date, a number, a musical
note, or a phrase, depending on the information required. Query phrases
may also be allowed to contain wildcard operators and other partial-match
operators. The term "inform," for instance, might refer to any word
beginning with that prefix (e.g., "informs," "informs," "informal,"
"informant," "informative," etc.).
A OR B: union of A and B (A ∪ B)
NOT A: complement of A with respect to the document collection (A¯)
where A and B are terms or other Boolean queries.
Consider table 2 as an example.
Table 2:Text fragment from Shakespeare’s Romeo and Juliet, act I,
scene 1
Figure 3: Function to locate the next occurrence of a cover for the term
vector <t1, ..., tn> after a given position
Both of the aforementioned algorithms have the same fundamental mode
of operation. Lines 1-6 of the phrase search algorithm identify a range that
contains every term in the phrase in the order that it appears, such that no
smaller range included within it also contains every term in the phrase.
Lines 1-4 similarly identify all the terms as closely as possible in the cover
searching method. Then, an additional constraint is imposed to both
methods.
To simplify our definition of our Boolean search algorithm, we define two
functions that operate over Boolean queries, extending the nextDoc and
prevDoc methods of schema-dependent inverted indices.
8
Introduction to Information
Retrieval
Definitions for the NOT operator are more difficult, so we wait to discuss
it until after the core algorithm has been introduced.
The nextSolution function, shown in Figure 4, finds the following
Boolean question solution after a specified point.
Figure 4: Function to locate the next solution to the Boolean query Q after
a given position. The function nextSolution calls docRight and docLeft
to generate a candidate solution. These functions make recursive calls that
depend on the structure of the query
9
Information retrieval For the purpose of producing a potential solution, the function calls
docRight and docLeft. This potential answer can be found in the interval
[u,v] right after line 4. The potential solution is returned if it only consists
of one document. Otherwise, a recursive call is made by the function. All
answers to the Boolean inquiry Q may be produced by the following given
this function:
This transformation does not alter the quantity of AND and OR operators,
and as a result, does not alter the quantity of terms in the query (n). We are
left with a query that contains expressions of the form NOT t, where t is a
word, after properly applying De Morgan's principles. We need
comparable definitions of docRight and docLeft in order to process
queries with expressions of this type. These definitions could be expressed
in terms of nextDoc and prevDoc.
10
Introduction to Information
Retrieval
11
Information retrieval Many of these problems are resolved by search trees, which, for example,
allow us to enumerate all lexical phrases beginning with automat. The
binary tree, which contains two children at each internal node, is the most
well-known search tree.At the tree's base, a phrase is sought after. Each
internal node (including the root) represents a binary test, the results of
which determine which of the two subtrees lies below that node in the
internal tree. An illustration of a binary search tree used for a dictionary
may be found in Figure 5. The balance of the tree is crucial for efficient
search (with a number of comparisons that is O(log M)). To mitigate
rebalancing, one approach is to allow the number of sub-trees under an
internal node to vary in a fixed interval.
Figure 5:A binary search tree. In this example the branch at the root
partitions vocabulary terms into two subtrees, those whose first letter is
between a and m, and the rest
12
But what about searches that use wildcards, in which the * sign is not Introduction to Information
required to appear at the end of the search string? We briefly generalize Retrieval
trailing wildcard queries before dealing with this general instance.
Consider leading wildcard queries or *mon-style queries first. Think of a
reverse B-tree on the dictionary, where each root-to-leaf path represents a
word in the dictionary spelled backwards. For example, the term "lemon"
would be represented by the path root-n-o-m-e-l in the B-tree. The next
step is to go down the reverse B-tree to find all terms R in the lexicon that
begin with a specific prefix.
In reality, we can handle a case that is even more generic by combining a
conventional B-tree with a reverse B-tree: wildcard queries with a single *
sign, like se*mon. To do this, we first enumerate the set W of dictionary
terms beginning with the prefix se using the ordinary B-tree, and then we
enumerate the set R of terms ending in the suffix mon using the reverse B-
tree. The collection of phrases that start with the prefix se and end with the
suffix mon is then obtained by taking the intersection W ∩ R of these two
sets. Finally, we obtain all documents that contain any of the terms in this
intersection using the conventional inverted index.We can thus handle
wildcard queries that contain a single * symbol using two B-trees, the
normal B-tree and a reverse B-tree.
1.6 SUMMARY
Calvin Mooers originated the phrase "information retrieval" in 1950. From
1961 onwards, when computers were developed for information handling,
it gained appeal among researchers. The subsequent definition of
information retrieval included the extraction of bibliographic data from
databases of archived documents. But those document retrieval systems
were actually information retrieval systems. These were created to
discover the existence (or absence) of bibliographic documents pertinent
to a user's search.To put it another way, early IRS were built to return a
complete documenta book, an article, etc.in answer to a search query.
Although the IRS still operates in this manner today, numerous cutting-
edge design methods have been created and implemented over time. The
meaning of information retrieval has evolved over time, and scholars and
information specialists have used many terminologies to describe it.
Information access, text retrieval, information representation and retrieval,
information processing and retrieval, and information storage and retrieval
are only a few of them. We have also discussed about the components
used in the information retrieval system along with the issues associated
with it.
14