0% found this document useful (0 votes)
34 views

UNIT 1 Notes

Uploaded by

shelakeavi2003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

UNIT 1 Notes

Uploaded by

shelakeavi2003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

CONTENTS

Unit No. Title Page No.

1. Introduction to Information Retrieval 01

2. Link Analysis and Specialized Search 15

3. Web Search Engine 56

4. XML Retrieval 71


Course: TOPICS (Credits : 03 Lectures/Week: 03)
USCS604 Information Retrieval
Objectives:
To provide an overview of the important issues in classical and web information retrieval. The focus
is to give an up-to- date treatment of all aspects of the design and implementation of systems for
gathering, indexing, and searching documents and of methods for evaluating systems.
Expected Learning Outcomes:
After completion of this course, learner should get an understanding of the field of information
retrieval and its relationship to search engines. It will give the learner an understanding to apply
information retrieval models.
Introduction to Information Retrieval: Introduction, History of IR,
Unit I Components of IR, and Issues related to IR, Boolean retrieval, 15L
Dictionaries and tolerant retrieval.
Link Analysis and Specialized Search: Link Analysis, hubs and
authorities, Page Rank and HITS algorithms, Similarity, Hadoop & Map
Reduce, Evaluation, Personalized search, Collaborative filtering and
Unit II 15L
content-based recommendation of documents and products, handling
―invisible‖ Web, Snippet generation, Summarization, Question
Answering, Cross- Lingual Retrieval.
Web Search Engine: Web search overview, web structure, the user, paid
placement, search engine optimization/spam, Web size measurement,
search engine optimization/spam, Web Search Architectures.
Unit III 15L
XML retrieval: Basic XML concepts, Challenges in XML retrieval, A
vector space model for XML retrieval, Evaluation of XML retrieval,
Text-centric versus data-centric XML retrieval.
Text book(s):
1) Introduction to Information Retrieval, C. Manning, P. Raghavan, and H. Schütze,
Cambridge University Press, 2008
2) Modern Information Retrieval: The Concepts and Technology behind Search, Ricardo Baeza
-Yates and Berthier Ribeiro – Neto, 2nd Edition, ACM Press Books 2011.
3) Search Engines: Information Retrieval in Practice, Bruce Croft, Donald Metzler and Trevor
Strohman, 1st Edition, Pearson, 2009.

Additional Reference(s):
1) Information Retrieval Implementing and Evaluating Search Engines, Stefan Büttcher,
Charles L. A. Clarke and Gordon V. Cormack, The MIT Press; Reprint edition (February 12,
2016)
1
INTRODUCTION TO INFORMATION
RETRIEVAL
Unit Structure
1.0 Objectives
1.1 Introduction and History of IR
1.2 Components of IR
1.3 Issues related to IR
1.4 Boolean retrieval
1.5 Dictionaries and tolerant retrieval
1.5.1 Search structures for dictionary
1.5.2 Wildcard queries
1.6 Summary
1.7 List of References
1.8 Unit End Exercises

1.0 OBJECTIVES
 To define information retrieval
 Understand the importance and need of information retrieval system
 Explain the concept of subject approach to information
 Illustrate the process of information retrieval

1.1 INTRODUCTION AND HISTORY OF IR


In his influential 1968 textbook, information retrieval pioneer Gerard
Salton (Salton, 1968), who was a key figure from the 1960s until the
1990s, put out the following definition:
“Information retrieval is a field concerned with the structure, analysis,
organization, storage, searching, and retrieval of information”.
This concept is still relevant and correct today despite the enormous
advancements in search technology and understanding over the previous
40 years. Since "information" is such a broad concept, information
retrieval encompasses study on a wide range of informational domains and
applications.

1
Information retrieval Since the 1950s, text and text documents have been the field's main focus.
Documents come in a wide variety of forms, including web pages, emails,
books, academic papers, and news articles to name just a few. Every one
of these documents has some structure to it, including the title, author,
date, and abstract details related to the content of articles published in
scholarly publications.When referring to database records, the components
of this structure are referred to as attributes or fields. The key distinction
between a document and a normal database record, like one for a bank
account or a ticket reservation, is that a document's content is mostly
presented as text, which is a rather unstructured format.
Consider the details present in the account number and current amount,
two typical account record attributes, to demonstrate this distinction. Both
have very clearly defined formats (a six-digit integer for an account
number, for instance, and a real number with two decimal places for
balance), as well as meanings.As a result, it is simple to develop
algorithms to find the records that respond to queries like "Find account
number 321456" or "Find accounts with balances larger than $50,000.00."
It is very easy to compare the values of these attributes.
Think about a recent news report about a bank merger. The headline and
the story's source are some of the story's qualities, but the story's actual
content is what matters most. This important piece of data would normally
be recorded in a database system as a single huge attribute with no internal
structure. The majority of searches for this topic on search engines like
Google will be of the type "bank merger" or "bank takeover." In order to
conduct this search, we must create algorithms that can evaluate if the tale
contains the information sought by comparing the text of the query with
the text of the story.Providing a definition for a term, sentence, paragraph,
or piece of news
Comparing literature is challenging since describing a tale is harder than
defining an account number. The foundation of information retrieval is the
understanding and modelling of how individuals compare texts, as well as
the development of computer algorithms that efficiently carry out this
comparison.Information retrieval applications increasingly use multimedia
documents having structure, substantial text content, and other media.
Pictures, video, and audio, including voice and music, are common forms
of information media.Scanned document images are crucial in some
applications, such legal support. Similar to text, the content of these
mediums is challenging to define and contrast. Instead of using the
contents themselves, present technology for searching non-text materials
relies on text descriptions of their contents, but advancements are being
made in methods for direct comparison of photographs, for instance.
Information retrieval encompasses a variety of tasks and applications in
addition to a variety of media. In the typical search situation, a user types a
query into a search engine and receives results in the form of a ranked list
of documents. Search is an essential component of applications in
businesses, the government, and many other fields, even though web
search is by far the most popular application involving information
2
retrieval. A specific type of web search called vertical search limits the Introduction to Information
domain of the search to a single subject. Enterprise search is locating the Retrieval
necessary information among the enormous variety of digital assets
dispersed throughout a company intranet.The majority of the information
will be found in sources like emails, reports, presentations, spreadsheets,
and structured data in corporate databases, while web pages are
undoubtedly a part of that distributed knowledge store. Desktop search is
the individual version of corporate search, where the information sources
are the documents saved on a user's PC, including emails and recently
visited websites. Without centralized control, peer-to-peer search involves
locating data in networks of nodes or computers. This kind of search
initially served as a music file-sharing platform, but it can now be used to
any group of people who have similar interests or, in the case of mobile
devices, a local area.Search and related information retrieval techniques
are utilized in a variety of industries, including advertising, intelligence
gathering, science, healthcare, customer service, real estate, and others.
Any application that uses a collection of unstructured data, such as text or
other types of information, will need to organize and search that data.
There are other text-based tasks that are researched in information
retrieval besides search based on a user query (sometimes referred to as ad
hoc search because the range of possible queries is broad and not
prespecified). Filtering, categorization, and question-answering are
additional duties. Based on a person's interests, filtering or monitoring
involves finding stories that are relevant to them and sending them an alert
by email or another method. A predetermined set of labels or
classifications are used in classification or categorization, which
automatically applies those labels to documents. Similar to search,
question answering focuses on more specific queries such "What is the
height of Mt. Everest?". Instead of returning a list of documents, the aim
of question answering is to return a specific response that was found in the
text. Some of these components or dimensions of the information retrieval
field are summarized in Table 1.
Table 1: Some dimensions of information retrieval

1.2 COMPONENTS OF IR
The main parts of an IR system are shown in Figure 1. A user's
information demand, which underpins and motivates the search process,
exists prior to performing a search. When this information need is offered
3
Information retrieval in writing form as a test collection for IR evaluation, we occasionally refer
to it as a topic. The user creates and submits a query to the IR system as a
result of her information need. This search usually only has one or two
terms, with two to three terms being normal for a Web search. Because a
query term might not actually be a word at all, we use "term" instead of
"word" in this sentence.A query term could be a date, a number, a musical
note, or a phrase, depending on the information required. Query phrases
may also be allowed to contain wildcard operators and other partial-match
operators. The term "inform," for instance, might refer to any word
beginning with that prefix (e.g., "informs," "informs," "informal,"
"informant," "informative," etc.).

Figure 1: Components of an IR system


IR systems generally allow a richer query syntax, frequently with complex
Boolean and pattern matching operators, even though users typically make
simple keyword searches. These tools can be used to confine a search to a
certain Web site, to establish restrictions on fields like author and title, or
to use additional filters, limiting the search to a portion of the collection.
When these more complex query facilities are necessary, a user interface
mediator between the user and the IR system, streamlining the query-
creation process.
A search engine, which may be running on the user's local workstation, on
a sizable cluster of machines in a far-off location, or anywhere in between,
processes the user's query. The management and manipulation of an
inverted index for a document collection is a key responsibility of a search
4
engine. The main data structure that the search engine uses to rank Introduction to Information
relevance is this index. An inverted index's primary purpose is to map out Retrieval
the relationships between terms and the locations in the collection where
they appear. It is important to take care that index access and update
operations are carried out effectively because the size of an inverted list is
on the same scale as the document collection itself.
The search engine keeps index collecting statistics, such as the number of
documents containing each term and the length of each document, to help
relevance ranking algorithms. In order to provide the user with relevant
results, the search engine typically has access to the source content of the
documents.
The search engine accepts queries from its users, analyses these questions,
and then produces prioritized lists of results using the inverted index,
collection statistics, and other data. The search engine calculates a score,
sometimes referred to as a retrieval status value (RSV), for each document
to perform relevancy ranking. Following the score-based sorting of the
documents, additional processing, such as the elimination of redundant or
duplicate results, may be applied to the result list. One or two results from
a single host or domain, for instance, might be reported by a web search
engine, with the remaining pages being replaced by pages from other
sources. One of the most essential issues in the industry is the scoring of
documents in relation to a user's query.

1.3 ISSUES RELATED TO IR


Since the 1960s, when experiments were conducted on document
collections totaling around 1.5 gigabytes of text, information retrieval
experts have concentrated on a few core challenges that are still vitally
essential in the era of commercial web search engines dealing with billions
of web pages. Following are few of the issues related to IR:
1] Relevance: A key idea in information retrieval is relevance. A relevant
document, broadly speaking, contains the details that a user was seeking
when she entered a search term into the search engine. Although it may
seem straightforward, determining whether a given document is relevant
depends on a variety of factors. When formulating algorithms for
comparing text and rating texts, these elements must be taken into
consideration.In a database system or when using the grep utility in Unix,
comparing the text of a query with the text of a document and seeking for
an exact match only yields very unreliable results in terms of relevancy.
The fact that language may be used to describe the same ideas in
numerous ways, frequently using very different words, is one clear
explanation for this. The vocabulary mismatch problem in information
retrieval is what this is known as.
Separating topical relevance from user relevance is also crucial. If a text
document is on the same topic as a query, it is topically relevant. A news
report on a tornado in Kansas, for instance, would be topically pertinent to
the search term "severe weather incidents." However, if the questioner
5
Information retrieval (commonly referred to as the user) has already seen the story, it is five
years old, it is in Chinese from a Chinese news source, or it is in another
language, she may not find it to be relevant. These extra elements of the
story are taken into account when determining user relevance.
Researchers suggest retrieval models and evaluate their effectiveness in
order to overcome the relevance issue. A retrieval model is a formal
illustration of how a query and a document are matched. It serves as the
cornerstone of the ranking algorithm that a search engine employs to
generate the ranked list of pages. Finding documents that the user who
submitted the query is likely to find relevant is the goal of a decent
retrieval model. Some retrieval models place a greater emphasis on topical
relevance, but a search engine used in a live setting needs to employ
ranking algorithms that take user relevance into account.
The fact that information retrieval models often model the statistical rather
than the linguistic structure of text is an intriguing aspect of these models.
This implies, for instance, that word frequency countsrather than whether
a word is a noun or an adjectiveare often much more important to ranking
algorithms. Linguistic features are included in more sophisticated models,
but they often play a supporting role. H.P. Luhn, another information
retrieval pioneer, introduced the use of word frequency data to represent
text in the 1950s. It took until the 1990s for this perspective on text to
catch on in other branches of computer science, like natural language
processing.
2] Evaluation: Since a document ranking's quality depends on how well it
meets a user's expectations, early evaluation metrics and experimental
techniques for gathering this data and applying it to evaluate ranking
algorithms were required. Early in the 1960s, Cyril Cleverdon took the
lead in creating evaluation techniques, and today, precision and recall are
still widely utilized. The percentage of papers that are relevantly returned
is known as precision, and it is a relatively simple unit of measurement.
The percentage of pertinent papers that are retrieved is known as recall.
The recall measure operates under the presumption that all pertinent
documents for a particular query are known.In a web search context, such
an assumption is obviously problematic, but this strategy can be helpful
for smaller test collections of documents. A sample of common queries, a
list of pertinent documents for each query, and a collection of text
documents make up an information retrieval experiment test collection
(the relevance judgments). The test collections connected to the
TREC assessment forum are the most well-known.
The evaluation of retrieval models and search engines is a very active field
right now, and a lot of the attention is being paid to the usage of massive
amounts of log data from user interactions, like clickthrough data, which
records the documents that were clicked during a search session. Search
engine businesses still rely on relevance judgements in addition to log data
to guarantee the accuracy of their results, despite the fact that clickthrough
and other log data have a strong correlation with relevance and can be
used to evaluate search.
6
3] Emphasis on users and their information needs: Given that the Introduction to Information
evaluation of search is user-centered, this ought to be obvious. In other Retrieval
words, search engine users are the final arbiters of value. Numerous
research has been conducted as a result on how users interact with search
engines, and in particular, approaches have been created to assist users in
expressing their information demands. The primary motivation behind a
person's search engine query is an information requirement. Text queries
are frequently inadequate explanations of what the user actually needs, as
opposed to a request to a database system, such as for the balance of a
bank account.If you search for "cats," you can be looking for information
on where to buy cats or a synopsis of the Broadway show. One-word
queries are extremely prevalent in web search, despite their lack of
specificity. The initial query is refined using interaction and context to
provide higher ranked lists using techniques like query expansion, query
suggestion, and relevance feedback.

1.4 BOOLEAN RETRIEVAL


Explicit support for Boolean searches is crucial in particular application
domains like digital libraries and the legal sector, in addition to the
implicit Boolean filters used by Web search engines. Boolean retrieval
delivers sets of documents rather than ranked lists, in contrast to ranked
retrieval. A term t is regarded by the Boolean retrieval model as specifying
the collection of documents that contain it. Boolean queries, which are
regarded as operations on these sets, are built using the common Boolean
operators AND, OR, and NOT as follows:
A AND B: intersection of A and B (A ∩ B)

A OR B: union of A and B (A ∪ B)
NOT A: complement of A with respect to the document collection (A¯)
where A and B are terms or other Boolean queries.
Consider table 2 as an example.
Table 2:Text fragment from Shakespeare’s Romeo and Juliet, act I,
scene 1

For example, over the collection in Table 2, the query


(“quarrel” OR “sir”) AND “you”
specifies the set {1, 3}, whereas
7
Information retrieval the query (“quarrel” OR “sir”) AND NOT “you”
specifies the set {2, 5}.
The phrase searching algorithm of Figure 2 and the cover finding method
of Figure 3 are two variations of our approach for resolving Boolean
inquiries. The technique finds candidate answers to a Boolean query,
where each candidate answer is a set of documents that individually meet
the Boolean question but do not include any smaller sets of documents that
also fulfil the query. This single document satisfies the query and belongs
in the result set when the range represented by a candidate solution has a
length of 1.

Figure 2: Function to locate the first occurrence of a phrase after a given


position. The function calls the next and prev methods of the inverted
index ADT and returns an interval in the text collection as a result

Figure 3: Function to locate the next occurrence of a cover for the term
vector <t1, ..., tn> after a given position
Both of the aforementioned algorithms have the same fundamental mode
of operation. Lines 1-6 of the phrase search algorithm identify a range that
contains every term in the phrase in the order that it appears, such that no
smaller range included within it also contains every term in the phrase.
Lines 1-4 similarly identify all the terms as closely as possible in the cover
searching method. Then, an additional constraint is imposed to both
methods.
To simplify our definition of our Boolean search algorithm, we define two
functions that operate over Boolean queries, extending the nextDoc and
prevDoc methods of schema-dependent inverted indices.

8
Introduction to Information
Retrieval

Definitions for the NOT operator are more difficult, so we wait to discuss
it until after the core algorithm has been introduced.
The nextSolution function, shown in Figure 4, finds the following
Boolean question solution after a specified point.

Figure 4: Function to locate the next solution to the Boolean query Q after
a given position. The function nextSolution calls docRight and docLeft
to generate a candidate solution. These functions make recursive calls that
depend on the structure of the query

9
Information retrieval For the purpose of producing a potential solution, the function calls
docRight and docLeft. This potential answer can be found in the interval
[u,v] right after line 4. The potential solution is returned if it only consists
of one document. Otherwise, a recursive call is made by the function. All
answers to the Boolean inquiry Q may be produced by the following given
this function:

The temporal complexity of this algorithm, using a galloping search


implementation of nextDoc and prevDoc, is O(n·l·log(L/l)), where n is the
number of terms in the query. l and L reflect the lengths of the shortest and
longest posting lists of the terms in the query as measured by the number
of documents if a docid or frequency index is utilised and positional
information is not kept in the index. The justification needed to prove this
time complexity is comparable to that of our proximity ranking algorithm
and word search algorithm.The temporal complexity changes to
O(n·κ·log(L/κ)) when expressed in terms of the quantity of candidate
solutions, which illustrates the adaptive character of the algorithm.
Although the call to the docLeft function in line 4 of the algorithm can be
eliminated, by clearly defining a potential answer, it aids in our analysis of
the algorithm's complexity.
We didn't take into account the NOT operator while defining docRight
and docLeft. In fact, implementing the generalized versions of these
functions is not required in order to use the NOT operator. Instead, a query
can be transformed by applying De Morgan's laws, which will push any
NOT operators inside until they are very next to the query terms:

This transformation does not alter the quantity of AND and OR operators,
and as a result, does not alter the quantity of terms in the query (n). We are
left with a query that contains expressions of the form NOT t, where t is a
word, after properly applying De Morgan's principles. We need
comparable definitions of docRight and docLeft in order to process
queries with expressions of this type. These definitions could be expressed
in terms of nextDoc and prevDoc.

10
Introduction to Information
Retrieval

Unfortunately, this strategy raises the possibility of inefficiencies.


Although this definition will work well when few documents contain the
term "t," it may perform poorly when most documents contain the term
"t," effectively returning to the linear scan of the postings list that
galloping search avoided. Additionally, the comparable implementation of
docLeft(NOT t, v) necessitates a backward scan of the postings list, which
is against the rule required to benefit from galloping search.
Instead, we may implement the NOT operator directly over the data
structures, extending the methods supported by our inverted index with
explicit methods for nextDoc(NOT t, u) and prevDoc(NOT t, v).

1.5 DICTIONARIES AND TOLERANT RETRIEVAL


1.5.1 Search structures for dictionary
Our initial objective is to check whether each query term is present in the
lexicon and, if it is, find the reference to the associated postings given an
inverted index and a query. There are two main groups of solutions for this
vocabulary lookup action: hashing and search trees. The dictionary is a
traditional data structure that is used in this operation. The entries in the
lexicon (in our case, terms) are typically referred to as keys in the
literature on data structures.A number of questions determine the best
solution (hashing or search trees): (1) How many keys are likely to be in
our possession? (2) Is it likely that the number will stay the same or
change significantly, and if it does, are we likely to see only new keys
added or also some keys removed from the dictionary? (3) How frequently
will different keys be accessed, on average?
In some search engines, hashing has been employed for dictionary
searches. Each word in the dictionary is hashed into an integer over a
sufficiently enough area to make hash collisions improbable; if any, they
are resolved using auxiliary structures that can be labor-intensive to
maintain. 1 We hash each query phrase individually at query time,
following a pointer to the associated postings, and accounting for any
logic for handling hash clashes.Minor variations of a search term (such as
the accented and unaccented forms of a word like "resume") might have
extremely different hash values, so it is difficult to discover them. Finally,
a hash function created to meet present demands may not be enough in a
few years in a context (like the Web) where the vocabulary is expanding.

11
Information retrieval Many of these problems are resolved by search trees, which, for example,
allow us to enumerate all lexical phrases beginning with automat. The
binary tree, which contains two children at each internal node, is the most
well-known search tree.At the tree's base, a phrase is sought after. Each
internal node (including the root) represents a binary test, the results of
which determine which of the two subtrees lies below that node in the
internal tree. An illustration of a binary search tree used for a dictionary
may be found in Figure 5. The balance of the tree is crucial for efficient
search (with a number of comparisons that is O(log M)). To mitigate
rebalancing, one approach is to allow the number of sub-trees under an
internal node to vary in a fixed interval.

Figure 5:A binary search tree. In this example the branch at the root
partitions vocabulary terms into two subtrees, those whose first letter is
between a and m, and the rest

1.5.2 Wildcard queries


Wildcard queries are used in any of the following situations: (1) the useris
uncertain of the spelling of a query term (e.g., Sydney vs. Sidney,
whichleads to the wildcard query S*dney); (2) the user is aware of
multiple variants of spelling a term and (consciously) seeks documents
containing any ofthe variants (e.g., color vs. colour); (3) the user seeks
documents containingvariants of a term that would be caught by
stemming, but is unsure whetherthe search engine performs stemming
(e.g., judicial vs. judiciary, leading to thewildcard query judicia*); (4) the
user is uncertain of the correct rendition of aforeign word or phrase (e.g.,
the query Universit* Stuttgart).
A trailing wildcard query, such as mon*, is one in which the wildcard *
sign appears just once, at the conclusion of the search term. It is simple to
handle trailing wildcard searches by using a search tree on the dictionary.
By moving down the tree while following the symbols m, o, and n in turn,
we may eventually enumerate the set W of entries in the dictionary that
have the prefix mon.Finally, we obtain all documents containing any term
in W using |W| lookups on the common inverted index.

12
But what about searches that use wildcards, in which the * sign is not Introduction to Information
required to appear at the end of the search string? We briefly generalize Retrieval
trailing wildcard queries before dealing with this general instance.
Consider leading wildcard queries or *mon-style queries first. Think of a
reverse B-tree on the dictionary, where each root-to-leaf path represents a
word in the dictionary spelled backwards. For example, the term "lemon"
would be represented by the path root-n-o-m-e-l in the B-tree. The next
step is to go down the reverse B-tree to find all terms R in the lexicon that
begin with a specific prefix.
In reality, we can handle a case that is even more generic by combining a
conventional B-tree with a reverse B-tree: wildcard queries with a single *
sign, like se*mon. To do this, we first enumerate the set W of dictionary
terms beginning with the prefix se using the ordinary B-tree, and then we
enumerate the set R of terms ending in the suffix mon using the reverse B-
tree. The collection of phrases that start with the prefix se and end with the
suffix mon is then obtained by taking the intersection W ∩ R of these two
sets. Finally, we obtain all documents that contain any of the terms in this
intersection using the conventional inverted index.We can thus handle
wildcard queries that contain a single * symbol using two B-trees, the
normal B-tree and a reverse B-tree.

1.6 SUMMARY
Calvin Mooers originated the phrase "information retrieval" in 1950. From
1961 onwards, when computers were developed for information handling,
it gained appeal among researchers. The subsequent definition of
information retrieval included the extraction of bibliographic data from
databases of archived documents. But those document retrieval systems
were actually information retrieval systems. These were created to
discover the existence (or absence) of bibliographic documents pertinent
to a user's search.To put it another way, early IRS were built to return a
complete documenta book, an article, etc.in answer to a search query.
Although the IRS still operates in this manner today, numerous cutting-
edge design methods have been created and implemented over time. The
meaning of information retrieval has evolved over time, and scholars and
information specialists have used many terminologies to describe it.
Information access, text retrieval, information representation and retrieval,
information processing and retrieval, and information storage and retrieval
are only a few of them. We have also discussed about the components
used in the information retrieval system along with the issues associated
with it.

1.7 LIST OF REFERENCES


1] Introduction to Information Retrieval, C. Manning, P. Raghavan, and H.
Schutze, Cambridge University Press, 2008
2] Modern Information Retrieval: The Concepts and Technology behind
Search, Ricardo Baeza -Yates and Berthier Ribeiro – Neto, 2nd Edition,
ACM Press Books 2011.
13
Information retrieval 3] Search Engines: Information Retrieval in Practice, Bruce Croft, Donald
Metzler and TrevorStrohman, 1st Edition, Pearson, 2009.
4] Information Retrieval Implementing and Evaluating Search Engines,
Stefan Buttcher, Charles L. A. Clarke and Gordon V. Cormack, The MIT
Press; Reprint edition (February 12, 2016).

1.8 UNIT END EXERCISES


1] Define information retrieval system along with its component.
2] Discuss the history of information retrieval system.
3] What are the different issues of information retrieval system.
4] Explain the Boolean retrieval.
5] Discuss on dictionaries and tolerant retrieval.
6] Explain the search structures for dictionary.
7] What are the wildcard queries.



14

You might also like