IR Question Bank
IR Question Bank
OBJECTIVES:
• The Student should be made to:
• Learn the information retrieval models.
• Be familiar with Web Search Engine be exposed to Link Analysis.
• Understand Hadoop and Map Reduce.
• Learn document text mining techniques.
UNIT I INTRODUCTION 9
Information Retrieval – Early Developments – The IR Problem – The User‗s Task –
Information versus Data Retrieval - The IR System – The Software Architecture of the IR
System – The Retrieval and Ranking Processes - The Web – The e-Publishing Era – How the
web changed Search – Practical Issues on the Web – How People Search – Search Interfaces
Today – Visualization in Search Interfaces.
TEXT BOOKS:
1. C. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval , Cambridge University
Press, 2008.
2. Ricardo Baeza -Yates and Berthier Ribeiro - Neto, Modern Information Retrieval: The Concepts and
Technology behind Search 2nd Edition, ACM Press Books 2011
3. Bruce Croft, Donald Metzler and Trevor Strohman, Search Engines: Information Retrieval in Practice,
1st Edition Addison Wesley, 2009.
4. Mark Levene, An Introduction to Search Engines and Web Navigation, 2nd EditionWiley, 2010.
REFERENCES:
1. Stefan Buettcher, Charles L. A. Clarke, Gordon V. Cormack, Information Retrieval: Implementing and
Evaluating Search Engines, The MIT Press, 2010.
2. Ophir Frieder “Information Retrieval: Algorithms and Heuristics: The Information Retrieval Series “, 2nd
Edition, Springer, 2004.
3. Manu Konchady, “Building Search Applications: Lucene, Ling Pipe”, and First Edition, Gate Mustru
2007.
UNIT I – INTRODUCTION
Information Retrieval – Early Developments – The IR Problem – The User‗s Task – Information versus
Data Retrieval - The IR System – The Software Architecture of the IR System – The Retrieval and Ranking
Processes - The Web – The e-Publishing Era – How the web changed Search – Practical Issues on the Web
– How People Search – Search Interfaces Today – Visualization in Search Interfaces.
PART * A
Q.No. Questions
Nonobjective Terms – Are intended to reflect the information manifested in the document, and there is no
agreement about the choice or degree of applicability of these terms.
Write the type of natural language technology used in information retrieval. BTL1
Two types
• Natural language interface make the task of communicating with the information source easier,
4 allowing a system to respond to a range of inputs.
• Natural Language text processing allows a system to scan the source texts, either to retrieve
particular informationor to derive knowledge structures that may be used in accessing information
from the texts.
What are the search engine types? (April/May 2021) BTL1
5 A search engine is a document retrieval system design to help find information stored in a computer system,
such as on the WWW. The search engine allows one to ask for content meeting specific criteria and retrieves
a list of items that match those criteria.
What is conflation? BTL1
6 Stemming is the process for reducing inflected words to their stem, base or root form, generally a written
word form. The process of stemming is often called as conflation.
What is an invisible web? BTL1
7 Many dynamically generated sites are not index able by search engines; This phenomenon is known as the
invisible web.
Define Zipf’s lawBTL1
8 An empirical rule that describesthe frequency of the text words. It state that the ith most frequent wordappears
as many times as the most frequent one divided by iθ, for some θ >1.
PART * B
Architecture (2 M)
Components:
• Text operations (3 M)
• Indexing (2 M)
• Searching (2 M)
• Ranking (2 M)
• Query operations (2 M)
Explain the issues in the process of Information Retrieval. (8M) (Nov/Dec’17) (April/May 2021) (BTL1)
Answer: Page 4-6 - Bruce Croft
3
• Relevance (4 M)
• Evaluation (2 M)
• Emphasis information (2M)
Discuss in detail about the framework of Open Source Search engine with necessary diagrams.(13M)
BTL2
Answer: Page 27-28 - Stefan Buettcher
Nutch:
• Architecture (2 M)
• Four major components (3 M)
• Searcher
3 • Indexer
• Database
• Fetcher
• Crawling (2 M)
• Indexing text (2 M)
• Indexing hypertext (1 M)
• Removing duplicates (1 M)
• Link analysis (1 M)
• Summarizing (1 M)
Explain the Role of AI in IR. (8M) (Nov/Dec’16) BTL3
Answer: Page 1.19 – 1.20 - I.A.Dhotre
5 • AI (1 M)
• NLP (2 M)
• NLP techniques (2 M)
• Two types - Natural language technology (2M)
• Applying NLP (2M)
Differentiate Information Retrieval from Web Search. (8M) (Nov/Dec'17) BTL4
6 Answer: Page 1-2 - Ricardo Baez
Language, File types, Document length, Document structure, Spam, amount of data, size, queries, ranking
etc., (8M)
8 Write a short note on Characterizing the web for search. (8M)(Nov/Dec’16) BTL4
Answer: Page 367-371 - Ricardo Baez
• Measuring web (4 M)
• Modeling web (4 M)
PART * C
What is search engine? Explain with diagrammatic illustration the components of a search engine.
(15M)(Apr/May’18) (April/May 2021) BTL2
Answer: Page 13-16 - Bruce Croft
1
• Search engine-collecting information. (4 M)
• Architecture - search engine. (4 M)
• Indexing process. (4 M)
• Querying process. (3M)
Examine the various impact of WEB on IR. (15M)BTL1
Answer: Page 27-28 -Stefan Buettcher
2 • Heterogeneity (4M)
• Web information (3 M)
• Inexperienced users (4 M)
• Software tools (2 M)
• Multimedia information (2 M)
PART * A
Q.No. Questions
word appears in document. TFIDF can be 1 in the naive case, or to add the IDF effect, just do it log(number
of documents/number of documents in which word is present).
Consider the two texts, “Tom and Jerry are friends” and “Jack and Tom are friends”. What is the
cosine similarity for these two texts? (Apr/May’18) BTL1
Here are two very short texts to compare:
• “Tom and Jerry are friends”
• “Jack and Tom are friends”
We want to know how similar these texts are, purely in terms of word counts (and ignoring word order). We
begin by making a list of the words from both texts:
Tom and Jerry are friends Jack
Now we count the number of times each of these words appears in each text:
20 Tom 1 1
and 1 1
Jerry 1 0
are 1 1
friends 1 1
Jack 0 1
B: [1,1,0,1,1,1]
The cosine similarity for these two texts is 0.83.
PART * B
Explain Boolean model with an example. (April/May 2021) (8M) BTL1
1 Answer: Page 1-15 - C. Manning Text
• Boolean model (2 M)
• Boolean Relevance (2 M)
• Implementation (2 M)
• Advantages (2 M)
Explain vector space retrieval model with an example. (April/May 2021) (10M) (Apr/May’17) BTL1
Answer: Page No. 237-243 - Bruce Croft
• Vector space retrieval model concepts (2 M)
2
• Vector representation – documents, Queries (2 M)
• Vector space retrieval model (2 M)
• Advantage (2 M)
• Disadvantage (2 M)
Briefly explain weighting. (8M) (Nov/Dec’16) BTL1
Answer: Page No. 107-110 - C. Manning
3
• Weighting scheme (4 M)
• Idf (2 M)
• Tf-idf weighting (2 M)
Briefly explain cosine similarity. (8M) (Nov/Dec’16) BTL1
Answer: Page 2.11-I.A.Dhotre
4
• Metric frequently - find similarity (4 M)
• Cosine similarity representation (4 M)
Explain the inverted indices in detail. (8M) BTL1
Answer: Page No. 33-45 - Stefan Buettcher
5
• Phrase search (2 M)
• Implementing inverted indices (4 M)
• Documents and other elements (2 M)
Explain the Probabilistic IR model in detail. (8M ) (Nov/Dec’17) BTL1
Answer: Page 201-216 - C. Manning
6
• Basic probability theory (2 M)
• Probability Ranking Principle (4 M)
• Binary independence model (2 M)
Give an example for latent semantic indexing and explain the same. (13 M) (Apr/May’17) (April/May
2021) BTL1
Answer: Page 2.18-2.19 - I.A.Dhotre
7
• Semantic indexing (4 M)
• General idea (4 M)
• Goal- indexing (4 M)
• Example (1 M)
8 Write about query expansion. (8 M) (Nov/Dec’16) BTL1
• Query Expansion (2 M)
• Automatic thesaurus generation ( 4M)
• Example (2 M)
PART*C
What is relevance feedback? Explain the different types of relevance feedback and also explain with
an example an algorithm for relevance feedback. (15 M) (Nov/Dec’16) (April/May 2021) BTL1
Answer: Page 163 to 172 - C. Manning
PART * A
Q.No. Questions
PART * A
Q.No. Questions
A focused crawler or topical crawler is a web crawler that attempts to download only pages that are relevant
to a pre-defined topic or set of topic.
What is hard and soft focused crawling? BTL1
In hard focusedcrawling the classifier is invoked on a newly crawled document in a standard manner. When
10 it returns the best matching category path, the out-neighbors of the page are checked into the database if and
only if some node on the best matching category path is marked as good.
In soft focused crawling all out-neighbors of a visited page are checked into DB2, but their crawl priority
is based on the relevance of the current page.
Distinguish between SEO and Pay-per-click. BTL1
In many cases, several different XML schemas occur in a collection since the XML documents in an IR
application often come from more than one source. This phenomenon is called schema heterogeneity or
schema diversity.
What are the politeness policies used in web crawling? (Nov/Dec’17) BTL1
15
Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them.
These politeness policies must be respected.
Define an inverted index. (Apr/May’17 BTL1
16 An inverted index is a list of all documents containing that word
• The index may be a bit vector.
• It may also contain the location(s) of the word in the document.
What is inversion in indexing process? (Nov/Dec’18) BTL1
An inverted index (also referred to as postings file or inverted file) is an index data structure storinga mapping
from content, such as words or numbers, to its locations in a database file, or in a document or a setof
17 documents (named in contrast to a forward index, which maps from documents to content). The purpose of
an inverted index is to allow fast full text searches, at a cost of increased processing when a document is
added to the database. The inverted file may be the database file itself, rather than its index. It is the most
popular datastructureused in document retrieval systems used on a large scale for example in search engines.
How do spammers use cloaking to server spam to the web users? (Apr/May’17) BTL1
Search engines soon became sophisticated enough in their spam detection to screen out a large number of
18 repetitions of particular keywords. Spammers responded with a richer set of spam techniques, the best known
of which we now describe. The first of these techniques is cloaking. Here, the spammer's web server returns
different pages depending on whether the http request comes from a web search engine's crawler (the part of
the search engine that gathers web pages) or from a human user's browser.
Can a digest of the characters in a web page be used detect near duplicate web pages? Why? BTL1
Yes. The simplest approach to detecting duplicates is to compute, for each web page, a fingerprint that is a
19 succinct (say 64-bit) digest of the characters on that page. Then, whenever the fingerprints of two web pages
are equal, we test whether the pages themselves are equal and if so declare one of them to be a duplicate copy
of the other. This simplistic approach fails to capture a crucial and widespread phenomenon on the Web: near
duplication. A solution to the problem of detecting near-duplicate web pages is shingling technique.
Mention the processing steps in indexing process. BTL1
Indexing process:
• Input: list of normalized tokens for each document
• Sort the terms alphabetically
20 • Merge multiple occurrences of the same term
• Record the frequency of occurrence of the term in the document
o not needed by Boolean models
o used by vector space models
• Group instances of the same term and split dictionary and postings
List out the bitwise methods used in compression schemes for inverted lists. BTL1
• Unary coding
21 • Elias gamma coding
• Delta coding
• Golomb coding
• Variable-byte scheme
Write short note on web search overview and web structure.(8M) BTL1
Answer: Page No.507 -512 -Stefan Buettcher
1
• Web structure (4M)
• Bow-Tie Structure (4M)
Write a short note on the user and paid placement. (8M) BTL1
Answer: Page 3.3 – 3.4 - I.A. Dhotre
PART*C
Draw the basic crawler architectureanddescribe its components and also explain the working of a web
crawler with an example. (15M)(Apr/May’18) BTL1
Answer: Page No. 541-549 in Stefan Buettcher
1
• Web crawler definition (4M)
• Features- Crawler (4M)
• Basic operation (4M)
• Web Crawler Architecture (3M)
Write short notes on i) Deep Web ii) Site maps iii) Distributed crawling (15M) (Nov/Dec’17) ) BTL2
Answer: Page 41 to 45 - Bruce Croft
PART * A
Q.No. Questions
• Write once and read many data. It allows for parallelism without murexes
• Map and Reduce are the main operations: Simple code
• All the map should be completed before reduce operation starts.
• Map and reduce operations are typically performed by the same physical processor.
• Number of map tasks and reduce tasks are configurable.
• Operations are provisioned near the data.
• Commodity hardware and storage.
What are the limitations of Hadoop/MapReduce? BTL1
• Cannot control the order in which the maps or reductions are run.
• For maximum parallelism, you need Maps and Reduces to not depend on data generated in the same
10 MapReduce job.
• A database with an index will always be faster than a MapReduce job on unindexed data.
• Reduce operations do not take place until all Maps are complete.
• General assumption that the output of Reduce is smaller thanthe input to Map, large data source used
to generate smaller final values
What is Cross-Lingual Retrieval? (Apr/May’18) BTL1
Cross – Lingual Retrieval are the retrieval of documents that are in a different language from the one in which
11 the query is expressed. This allows users to search document collections in multiple language and retrieve
relevant information in a form that is useful to them, even when they have little or no linguistic competence
in the target languages.
Define Snippets. (Nov/Dec’17) BTL1
12 Snippets is defined as a short fragments of text extracted from the document content or its metadata. They
may be static or query based. In static snippet, it always shows the first 50 words of the document, or the
content of its description metadata, or a description taken froma directory site such as dmoz.org.
List the advantages of invisible web content. BTL1
• Specialized content focus – large amounts of information focused on an exact subject.
13 • Contains information than might not be available on the visible web.
• Allows a user to find a precise answer to a specific question
• Allow a user to find Webpages from a specific date or time.
What is collaborative filtering? BTL1
Collaborative filtering is a method of making automatic predictions about the interests of a single user by
14 collecting preferences or taste information from manyusers. It uses given rating data by many users for many
items as the basic for predicting missing ratings and/or for creating a top-N recommendation list for a given
user, called the active user.
What do you mean by item-based Content based Filtering? (Apr/May’17) (April/May 2021) BTL1
15 Item-based CF is a model-based approach which produces recommendations based on the relationship
between items inferred from the rating matrix. The assumption behind this approach is that users will prefer
items that are similar to other items they like.
What are the two problems of user based CF? BTL1
16 The two main problems of user-based CF are that the whole user database has to be kept in memory and that
expensive similarity computation between the active user and all other users in the database has to be
performed.
Define user based collaborative Filtering. (Nov/Dec’16) (April/May 2021) BTL1
17 User-based collaborative filtering is defined as the algorithms work off the premise that if a user(A) has a
similar profile to another user (B), then A is more likely to prefer things that B prefers when compared with
a user chosen at random..
PART * B
Write a short notes on i) Link analysis ii) Hubs and Authorities (13M) BTL1
Answer: Page: 461-477 - C. Manning
Link analysis
Explain in detail about Community-based Question Answering system. (15M) (Nov/Dec’17) &
(Apr/May’18) BTL2
Answer: Page:4.85 to 4.91 - N.Jayanthi
• Community-based Question. (3M)
1 • Natural Language Annotations. (3M)
• The AI lab, factual queries. (3M)
• Open domain QA. (2M)
• Redundancy based method. (2M)
• Semantic Headers. (2M)
Write a short notes on i) Handling invisible web ii) Snippet generation iii) Summarization (15M) BTL2
Answer: Page: 4.25 – 4.26 - I.A.Dhotre
2
(i) Handling invisible web
• Using the search engine- find hidden content. (4M)
ii)Snippet generation