100% found this document useful (2 votes)
1K views

IR Question Bank

Uploaded by

Amaya Ema
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
1K views

IR Question Bank

Uploaded by

Amaya Ema
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

REGULATION : 2017 ACADEMIC YEAR : 2021-2022

CS8080 INFORMATION RETRIEVALTECHNIQUES L T PC 300 3

OBJECTIVES:
• The Student should be made to:
• Learn the information retrieval models.
• Be familiar with Web Search Engine be exposed to Link Analysis.
• Understand Hadoop and Map Reduce.
• Learn document text mining techniques.
UNIT I INTRODUCTION 9
Information Retrieval – Early Developments – The IR Problem – The User‗s Task –
Information versus Data Retrieval - The IR System – The Software Architecture of the IR
System – The Retrieval and Ranking Processes - The Web – The e-Publishing Era – How the
web changed Search – Practical Issues on the Web – How People Search – Search Interfaces
Today – Visualization in Search Interfaces.

UNIT II MODELLING AND RETRIEVAL EVALUATION 9


Basic IR Models - Boolean Model - TF-IDF (Term Frequency/Inverse Document Frequency)
Weighting - Vector Model – Probabilistic Model – Latent Semantic Indexing Model – Neural
Network Model – Retrieval Evaluation – Retrieval Metrics – Precision and Recall – Reference
Collection – User-based Evaluation – Relevance Feedback and Query Expansion – Explicit
Relevance Feedback.

UNIT III TEXT CLASSIFICATION AND CLUSTERING 9


A Characterization of Text Classification – Unsupervised Algorithms: Clustering – Naïve Text
Classification – Supervised Algorithms – Decision Tree – k-NN Classifier – SVM Classifier –
Feature Selection or Dimensionality Reduction – Evaluation metrics – Accuracy and Error –
Organizing the classes – Indexing and Searching – Inverted Indexes – Sequential Searching –
Multi-dimensional Indexing.

UNIT IV WEB RETREIVAL AND WEB CRAWLING 9


The Web – Search Engine Architectures – Cluster based Architecture – Distributed
Architectures – Search Engine Ranking – Link based Ranking – Simple Ranking Functions –
Learning to Rank – Evaluations -- Search Engine Ranking – Search Engine User Interaction –
Browsing – Applications of a Web Crawler – Taxonomy – Architecture and Implementation –
Scheduling Algorithms – Evaluation.

UNIT V RECOMMENDER SYSTEMS 9


Recommender Systems Functions – Data and Knowledge Sources – Recommendation
Techniques – Basics of Content-based Recommender Systems – High Level Architecture –
Advantages and Drawbacks of Content-based Filtering – Collaborative Filtering – Matrix
factorization models – Neighborhood models.
TOTAL: 45 PERIODS
OUTCOMES:
Upon completion of the course, students will be able to
▪ Apply information retrieval models.
▪ Design Web Search Engine.
▪ Use Link Analysis.
▪ Use Hadoop and Map Reduce.
▪ Apply document text mining techniques.
JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0
REGULATION : 2017 ACADEMIC YEAR : 2021-2022

TEXT BOOKS:
1. C. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval , Cambridge University
Press, 2008.
2. Ricardo Baeza -Yates and Berthier Ribeiro - Neto, Modern Information Retrieval: The Concepts and
Technology behind Search 2nd Edition, ACM Press Books 2011
3. Bruce Croft, Donald Metzler and Trevor Strohman, Search Engines: Information Retrieval in Practice,
1st Edition Addison Wesley, 2009.
4. Mark Levene, An Introduction to Search Engines and Web Navigation, 2nd EditionWiley, 2010.

REFERENCES:
1. Stefan Buettcher, Charles L. A. Clarke, Gordon V. Cormack, Information Retrieval: Implementing and
Evaluating Search Engines, The MIT Press, 2010.
2. Ophir Frieder “Information Retrieval: Algorithms and Heuristics: The Information Retrieval Series “, 2nd
Edition, Springer, 2004.
3. Manu Konchady, “Building Search Applications: Lucene, Ling Pipe”, and First Edition, Gate Mustru
2007.

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

Subject Code: CS8080 Year/Semester: IV/08


Subject Name: INFORMATION RETRIEVALTECHNIQUES

UNIT I – INTRODUCTION
Information Retrieval – Early Developments – The IR Problem – The User‗s Task – Information versus
Data Retrieval - The IR System – The Software Architecture of the IR System – The Retrieval and Ranking
Processes - The Web – The e-Publishing Era – How the web changed Search – Practical Issues on the Web
– How People Search – Search Interfaces Today – Visualization in Search Interfaces.

PART * A

Q.No. Questions

What is information retrieval? (Nov/Dec’16) (April/May 2021)


1. Information Retrieval is finding material of an unstructured nature that satisfies an information need from
within large collections.
List out the components of IR block diagram. BTL1
• Input – Store Only a representation of the document
2 • A document representative – Could be list of extracted words considered to be significant.
• Processor – Involve in performance of actual retrieval function
• Feedback – Improve
• Output – A set document numbers.
What is objective term and non-objective term? BTL1
Objective Terms – Are extrinsic to semantic content, and there is generally no disagreement about how to
3 assign them.

Nonobjective Terms – Are intended to reflect the information manifested in the document, and there is no
agreement about the choice or degree of applicability of these terms.
Write the type of natural language technology used in information retrieval. BTL1
Two types
• Natural language interface make the task of communicating with the information source easier,
4 allowing a system to respond to a range of inputs.
• Natural Language text processing allows a system to scan the source texts, either to retrieve
particular informationor to derive knowledge structures that may be used in accessing information
from the texts.
What are the search engine types? (April/May 2021) BTL1
5 A search engine is a document retrieval system design to help find information stored in a computer system,
such as on the WWW. The search engine allows one to ask for content meeting specific criteria and retrieves
a list of items that match those criteria.
What is conflation? BTL1
6 Stemming is the process for reducing inflected words to their stem, base or root form, generally a written
word form. The process of stemming is often called as conflation.
What is an invisible web? BTL1
7 Many dynamically generated sites are not index able by search engines; This phenomenon is known as the
invisible web.
Define Zipf’s lawBTL1
8 An empirical rule that describesthe frequency of the text words. It state that the ith most frequent wordappears
as many times as the most frequent one divided by iθ, for some θ >1.

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

What is open source software? BTL1


Open source software is software whose source code is available for modificationor enhancement by anyone.
9 "Source code" is the part of software that most computer users don't ever see; it's the code computer
programmers can manipulate to change how a piece of software—a "program" or "application"—works.
Programmers who have access to a computer program's source code can improve that program by adding
features to it or fixing parts that don't always work correctly.
What is proprietary software? BTL1
10 Proprietary software is computer software which is the legal property of one party. The term of use for other
parties is defined by contracts or licensing agreements.
These terms may include various privileges to share, alter, dissemble, and use the software and its code.
What is closed software? BTL1
Closed software is a term for software whose license does not allow for the release or distribution of the
11 software’s source code. Generally it means only the binaries of a computer program are distributed and the
license provides no access to the programs source code. The source codeof such programs is usually regarded
as a trade secret of the company. Access to source code by third parties commonly requires the party to sign
a non-disclosure agreement.
List the advantage of open sourceBTL1
• The right to use the software in any way.
12 • There is usually no license cost and free of cost.
• The source code is open and can be modified freely.
• Open standards.
• It provides higher flexibility.
List the disadvantage of open source. BTL1
13 • There is no guarantee that development will happen.
• It is sometimes difficult to know that a project exist, and its current status.
• No secured follow-up development strategy.
What are the reasons for selecting open software? BTL1
• Development and maintenance of open source software is a community based activity.
• Open source software licenses are copyright protected they strictly ensure the user freedom to use,
14 modify and distribute the programs.
• Is interoperable customizable according to the needs and fulfills the software industry standards.
• Open source software allows everyone to use, study, modify and distribute the software.
• Allows a broader perspective when comes to its support.
What do you mean by Apache License? BTL1
15 The Apache License is a free software license written bythe Apache Software Foundation (ASF). The name
Apache is a registered trademark and may only be used with the trademark holders express permission.
Apache license is a high performance, Full-featured text search engine library written entirely in Java.
Write the features of GPL version2. BTL1
• It gives permission to copy and distribute the programs unmodified source code. It allows
16 modifying the programs source code and distributing the modified source code.
• User distributes compiled versions of the program, both modified and unmodified.
• All modified copies are distributed under the GPL v2.
• All compiled versions of the program are accompanied by the relevant source code.
Outline the impact of the Web on information retrieval. (Apr/May’17) BTL1
17 Finding the documents relevant to user queries. Technically IR studies the acquisition, organization, storage,
retrieval and distribution of information.

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

Specify the role of an IR system. (Nov/Dec’17)BTL1


• The role of an IR system
• Support the user in
18 • Exploring a problem domain, understanding its terminology, concepts and structure
• Clarifying, refining and formulating an informationneed
• Finding documents that match the information need description
• As many relevant documents as possible
• As few non-relevant documents as possible.
What is Peer-to-peer Search? (Nov/Dec’17) BTL1
Peer-to-Peer Search which currently means storing files in a directory that is accessible by people outside a
19 local network. Essentially, this is file sharing with the entire Internet. Other uses include reducing corporate
server bandwidth bottlenecks, storing enterprise data, distributed processing, knowledge management
aggregation, collaboration, automatically distributing updates for software and real-time updating such as
auctions and news syndication.
What are the performance measures for search engine? (Nov/Dec’17)BTL1
20 • Effectiveness (quality of results)
• Efficiency (response time and throughput).
Give any two advantages of using artificial intelligence in information retrieval tasks. (Apr/May’18)
21 BTL1
• Focused on the representation of knowledge, reasoning and intelligent action
• Formalisms for representing knowledge and queries
How does the large amount of information available in web affect information retrieval system
implementation? (Apr/May’18) BTL1
• Distributed data
22 • Volatile data
• Large Volume
• Unstructured and redundant data
• Quality of data
• Heterogeneous data
State the difficulties faced in Information retrieval. BTL1
• Vocabularies mismatching (synonyms, polysemy)
23 • Queries are ambiguous, they are partial specificationof user’s need
• Content representation may be inadequate and incomplete
• The notion of relevance is imprecise, context and user-dependent

PART * B

Appraise the history of IR. (8M) (Apr/May ‘17) BTL5


Answer: Page 6-9 - Ricardo Baez
1
• Finding material - unstructured nature (4 M)
• Historical mile stones (4 M)
2 Explain in detail about the components of IR. (13 M) (Nov/Dec ‘16) BTL1

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

Answer: Page No. 9-10 - Ricardo Baeza

Architecture (2 M)
Components:
• Text operations (3 M)
• Indexing (2 M)
• Searching (2 M)
• Ranking (2 M)
• Query operations (2 M)
Explain the issues in the process of Information Retrieval. (8M) (Nov/Dec’17) (April/May 2021) (BTL1)
Answer: Page 4-6 - Bruce Croft
3
• Relevance (4 M)
• Evaluation (2 M)
• Emphasis information (2M)
Discuss in detail about the framework of Open Source Search engine with necessary diagrams.(13M)
BTL2
Answer: Page 27-28 - Stefan Buettcher

Nutch:
• Architecture (2 M)
• Four major components (3 M)
• Searcher
3 • Indexer
• Database
• Fetcher
• Crawling (2 M)
• Indexing text (2 M)
• Indexing hypertext (1 M)
• Removing duplicates (1 M)
• Link analysis (1 M)
• Summarizing (1 M)
Explain the Role of AI in IR. (8M) (Nov/Dec’16) BTL3
Answer: Page 1.19 – 1.20 - I.A.Dhotre

5 • AI (1 M)
• NLP (2 M)
• NLP techniques (2 M)
• Two types - Natural language technology (2M)
• Applying NLP (2M)
Differentiate Information Retrieval from Web Search. (8M) (Nov/Dec'17) BTL4
6 Answer: Page 1-2 - Ricardo Baez
Language, File types, Document length, Document structure, Spam, amount of data, size, queries, ranking
etc., (8M)
8 Write a short note on Characterizing the web for search. (8M)(Nov/Dec’16) BTL4
Answer: Page 367-371 - Ricardo Baez

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

• Measuring web (4 M)
• Modeling web (4 M)
PART * C
What is search engine? Explain with diagrammatic illustration the components of a search engine.
(15M)(Apr/May’18) (April/May 2021) BTL2
Answer: Page 13-16 - Bruce Croft
1
• Search engine-collecting information. (4 M)
• Architecture - search engine. (4 M)
• Indexing process. (4 M)
• Querying process. (3M)
Examine the various impact of WEB on IR. (15M)BTL1
Answer: Page 27-28 -Stefan Buettcher

2 • Heterogeneity (4M)
• Web information (3 M)
• Inexperienced users (4 M)
• Software tools (2 M)
• Multimedia information (2 M)

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

UNIT II - MODELING AND RETRIEVAL EVALUATION


Basic IR Models - Boolean Model - TF-IDF (Term Frequency/Inverse Document Frequency) Weighting -
Vector Model – Probabilistic Model – Latent Semantic Indexing Model – Neural Network Model –
Retrieval Evaluation – Retrieval Metrics – Precision and Recall – Reference Collection – User-based
Evaluation – Relevance Feedback and Query Expansion – Explicit Relevance Feedback.

PART * A

Q.No. Questions

What do you mean information retrieval models? BTL1


1. A retrieval model can be a description of either the computational process or the human process of retrieval:
The process of choosing documents for retrieval; the process by which information needs are first articulated
and then refined.
What is cosine similarity? BTL1
2 This metric is frequently used when trying to determine similarity between two documents. Since there are
more words that are in common between two documents, it is useless to use the other methods of calculating
similarities.
What is language model based IR? BTL1
3 A language model is a probabilistic mechanismfor generatingtext. Language models estimate the probability
distribution of various natural language phenomena.
Define unigram language. BTL1
4 A unigram (1-gram) language model makes the strong independence assumption that words are generated
independently from a multinomial distribution.
What are the characteristics of relevance feedback? BTL1
5 • It shields the user from the details of the query reformulation process.
• It breaks down the whole searching task into a sequence of small steps which are easier to grasp.
• Provide a controlled process designed to emphasize some terms and de-emphasize others.
What are the assumptions of vector space model? BTL1
6 • Assumption of vector space model:
• The degree of matching can be used to rank-order documents
• This rank-ordering corresponds to how well a document satisfying a user’s information needs.
What are the disadvantages of Boolean model? BTL1
• It is not simple to translate an information need into a Boolean expression
7 • Exact matching may lead to retrieval of too many documents.
• The retrieved documents are not ranked.
• The model does not use term weights.
8 Define term frequency. (April/May 2021) BTL1
Term frequency: Frequency of occurrence of query keyword in document.
Explain Luhn’s ideas. BTL1
9 Luhn’s basic idea to use various properties of texts, includingstatistical ones, wascritical in opening handling
of input by computers for IR. Automatic input joined the already automated output.
What is stemming? Give example. (Apr/May’17) BTL1
Stemming is the process to replace all the variants of a word with the single stem of the word. Variants include
10 plurals, gerund forms (ing-form), third person suffixes, past tense suffixes, etc., Example: connect: connects,
connected, connecting, connection etc., Stemming is either based on linguistic dictionaries or on
algorithms.

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

What is Recall? BTL1


11 Recall is the ratio of the number of relevant documents retrieved to the total number of relevant documents
retrieved
What is precision? BTL1
12 Precision is the ratio of the number of relevant documents retrieved to the total number of documents
retrieved.
Explain Latent semantic Indexing. BTL1
Latent Semantic Indexing is a technique that projects queries and documents into a space with “latent”
13 Semantic dimensions. It is statistical method for automatic indexing and retrieval that attempts to solve the
major problems of the current technology. It is intended to uncover latent semantic structure in the data that
is hidden. It creates a semantic space where in terms and documents that are associated are placed near one
another.
List the retrieval models. (Nov/Dec’16) BTL1
• Boolean model
14 • Vector space model
• Language model
• Probabilistic model
Define document preprocessing. (Nov/Dec’16) BTL1
The document processing is a procedure which takes documents as input and converts it into a structure by
number of transformations.
Document pre-processing includes 5 stages of transformations:
15 • Lexical analysis
• Stop word elimination
• Stemming
• Index-term selection
• Construction of term categorization structures
Define an inverted index. (Apr/May’17) (April/May 2021) BTL1
16 An inverted index is a list of all documents containing that word
• The index may be a bit vector
• It may also contain the location(s) of the word in the document
What is Zone index? (Nov/Dec’17) BTL1
17 Zones are similar to fields, except the contents of a zone can be arbitrary free text. Whereas a field may take
on a relatively small set of values, a zone can be thought of as an arbitrary, unbounded amount of text. For
instance, document titles and abstracts are generally treated as zones.
State Bayes rule. (Nov/Dec’17) BTL1
The theorem provides a way to revise existing predictions or theories given new or additional evidence. In
finance, Bayes' theorem can be used to rate the risk of lending money to potential borrowers.
18 The formula is as follows:

Bayes' theorem is also called Bayes' Rule or Bayes' Law.


Can the tf-idf weight of a term in a document exceed 1? Why? (Apr/May’18) BTL1
19 Yes, the tf-idf weight of a term in a document exceed 1. TF-IDF is a family of measures for scoring a term
with respect to a document (relevance). The simplest form of TF (word, document) is the number of times

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

word appears in document. TFIDF can be 1 in the naive case, or to add the IDF effect, just do it log(number
of documents/number of documents in which word is present).
Consider the two texts, “Tom and Jerry are friends” and “Jack and Tom are friends”. What is the
cosine similarity for these two texts? (Apr/May’18) BTL1
Here are two very short texts to compare:
• “Tom and Jerry are friends”
• “Jack and Tom are friends”
We want to know how similar these texts are, purely in terms of word counts (and ignoring word order). We
begin by making a list of the words from both texts:
Tom and Jerry are friends Jack
Now we count the number of times each of these words appears in each text:
20 Tom 1 1
and 1 1
Jerry 1 0
are 1 1
friends 1 1
Jack 0 1
B: [1,1,0,1,1,1]
The cosine similarity for these two texts is 0.83.

What is Boolean Model? BTL1


Index terms are considered to be either present or absent in a document and to provide equal evidence with
21 respect to information needs. A document is represented as set of keywords. Queries are Boolean expression
of keywords connected by AND, OR and NOT including the use of brackets to indicate scope.
[[Rio & Brazil] | [Hilo & Hawaii]] & hotel &! Hilton]
What are the advantages of Boolean retrieval model? BTL1
• Easy to understand for simple queries
22 • Clean formalism
• Boolean models can be extended to include ranking
• Reasonably efficient implementation possible for normal queries
Define Vector space model. BTL1
23 In this model, documents and queries are assumed to be part of t-dimensional vector space, where t is the
number of index terms (words, stems, phrases etc.). A document Di is represented by a vector of index terms:
Di = (d i1, d i2, ............. dij) where dij represents the weight of the jth term.
What is Inverse document frequency? (April/May 2021) BTL1
Terms that occur in many documents in the collection are less useful for discriminating among documents. If
a term appears in all documents in a set, then it loses its distinguishing power. The inverse document
24 frequency (idf) of a term i is given by:
idfi=log(N/ni)
Where ni is the number of documents in which the term i occurs
N is the total number of documents
Define tf-idf weighting BTL1
25 A term occurring frequently in the document but rarely in the rest of the collection is given high weight. A
typical combined term importance is tf-idf weighting:
Wif = tfij idfi = ifij log2(N/dfi)

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

PART * B
Explain Boolean model with an example. (April/May 2021) (8M) BTL1
1 Answer: Page 1-15 - C. Manning Text
• Boolean model (2 M)
• Boolean Relevance (2 M)
• Implementation (2 M)
• Advantages (2 M)
Explain vector space retrieval model with an example. (April/May 2021) (10M) (Apr/May’17) BTL1
Answer: Page No. 237-243 - Bruce Croft
• Vector space retrieval model concepts (2 M)
2
• Vector representation – documents, Queries (2 M)
• Vector space retrieval model (2 M)
• Advantage (2 M)
• Disadvantage (2 M)
Briefly explain weighting. (8M) (Nov/Dec’16) BTL1
Answer: Page No. 107-110 - C. Manning
3
• Weighting scheme (4 M)
• Idf (2 M)
• Tf-idf weighting (2 M)
Briefly explain cosine similarity. (8M) (Nov/Dec’16) BTL1
Answer: Page 2.11-I.A.Dhotre
4
• Metric frequently - find similarity (4 M)
• Cosine similarity representation (4 M)
Explain the inverted indices in detail. (8M) BTL1
Answer: Page No. 33-45 - Stefan Buettcher
5
• Phrase search (2 M)
• Implementing inverted indices (4 M)
• Documents and other elements (2 M)
Explain the Probabilistic IR model in detail. (8M ) (Nov/Dec’17) BTL1
Answer: Page 201-216 - C. Manning
6
• Basic probability theory (2 M)
• Probability Ranking Principle (4 M)
• Binary independence model (2 M)
Give an example for latent semantic indexing and explain the same. (13 M) (Apr/May’17) (April/May
2021) BTL1
Answer: Page 2.18-2.19 - I.A.Dhotre
7
• Semantic indexing (4 M)
• General idea (4 M)
• Goal- indexing (4 M)
• Example (1 M)
8 Write about query expansion. (8 M) (Nov/Dec’16) BTL1

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

Answer: Page No.173 to 176 - C. Manning

• Query Expansion (2 M)
• Automatic thesaurus generation ( 4M)
• Example (2 M)

PART*C

What is relevance feedback? Explain the different types of relevance feedback and also explain with
an example an algorithm for relevance feedback. (15 M) (Nov/Dec’16) (April/May 2021) BTL1
Answer: Page 163 to 172 - C. Manning

• Relevance feedback and pseudo relevance


• The Rocchio algorithm (3M)
1 • Probabilistic relevance feedback (2M)
• Relevance feedback work (2M)
• Relevance feedback –web (2M)
• Relevance feedback strategies (2M)
• Pseudo relevance feedback (2M)
• Indirect relevance feedback (2M)

Explain the Language model IR in detail. (15 M)(Apr/May’18) BTL1


Answer: Page 2.14 -2.15 -I.A.Dhotre,

2 • Finite automata and language models (6 M)


• Types of language models (2 M)
• Unigram Language Model (4 M)
• N-gram Language Model (3 M)
Explain the preprocessing in detail. (15M) BTL1
Answer: Page 86-97 - Bruce Croft

3 • Stop word removal (4 M)


• Stemming (4 M)
• Text preprocessing (4 M)
• Web page preprocessing (3M)

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

UNIT III – TEXT CLASSIFICATION AND CLUSTERING

A Characterization of Text Classification – Unsupervised Algorithms: Clustering – Naïve Text Classification –


Supervised Algorithms – Decision Tree – k-NN Classifier – SVM Classifier – Feature Selection or Dimensionality
Reduction – Evaluation metrics – Accuracy and Error – Organizing the classes – Indexing and Searching – Inverted
Indexes – Sequential Searching – Multi-dimensional Indexing.

PART * A

Q.No. Questions

What is information filtering? BTL2


1. An information filtering system is a system that removes redundant or unwanted information from an
information stream using (semi)automated or computerized methods prior to presentation overload and
increment of the semantic signal-to-noise ratio.
What are the characteristics of information filtering? (Nov/Dec’18)BTL4
2 • Filtering system involve large amounts of data.
• Information filtering systems deal with textual information.
• It is applicable for unstructured or semi-structured data.
Differentiate information filtering and information Retrieval. (Apr/May’17) BTL4

S.No. Information Filter Information Retrieval


1 IF is concerned with the removal IR systems are concerned with
of textual information from an the collection and organization
3 incoming stream and its of texts so that users can then
dissemination to groups or easily find a text in the
individuals. collection.
2 Information filtering is concerned A query represents a one-time
with repeated uses of the system by information need.
users with long-term, but changing
interests and needs.

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

3 Filtering is based on descriptions Retrieval of information is


of individual or group interests or instead based on user specified
needs that are usually called information needs in the form
profiles. of a query.
4 IF systems deal with dynamic data IR systems deal with static
databases
What is text mining? BTL1
Text mining is understood as a process of automatically extracting meaningful, useful, previously unknown
4 and ultimately comprehensible information from textual document repositories.
Text mining can be visualized as consisting of two phases: Text refining that transforms free-form text
documents into a chosen intermediate form, and knowledge distillation that deduces patterns or knowledge
from the intermediate form.
State classification BTL1
Classification is a technique used to predict group membership for data instances. For example, you may
5 wish to use classification to predict whether the weather on a particular day will be “sunny”, “rainy” or
“cloudy”.

What is clustering? BTL2


6 Clustering is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the
subclass shares a common trait. It helps a user to understand the natural grouping or structure in a data set.
What are the desirable properties of a clustering algorithm? (Apr/May’17) BTL4
• Scalability (in terms of both time and space)
7 • Ability to deal with different data types
• Minimal requirements for domain knowledge to determine input parameters
• Interpretability and usability
What is decision tree? BTL1
A decision tree is a simple representation for classifying examples. Decision tree learning is one of the most
8 successful techniques for supervised classification learning. A decision tree or a classification tree is a tree
in which each internal node is labeled with an input features. The arcs coming from a node labeled with a
feature are labeled with each of the possible values of the feature. Each leaf of the tree is labeled with a class
or a probability distribution over the classes.
List the advantages of decision tree. BTL4
• Decision tree can handle both nominal and numeric input attributes.
9 • Decision tree representation is rich enough to represent any discrete value classifier.
• Decision trees are capable of handling database that may have errors.
• Decision trees are capable of handling datasets that may have missing values.
• It is self-explanatory and when compacted theyare also easy to follow.
List the disadvantages of decision tree. BTL4
• Most of the algorithms require that the target attribute will have only discrete values.
10 • Most decision-tree algorithms only examine a single field at a time.
• Decision trees are prone to errors in classification problems with much class.
• As decision tree use the “divide and conquer” method, they tend to perform well if a few highly
relevant attribute exists, but less so if many complex interactions are present.
What is supervised learning? (Nov/Dec’17) BTL1
11 In supervised learning, both the inputs and the outputs are provided. The network then processes the inputs
and compares its resulting outputs against the desired outputs. Errors are then propagated back through the
system, causing the system to adjust the weights which control the network.

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

What is unsupervised learning? (Nov/Dec’17) BTL1


12 In an unsupervised learning, the network adapts purely in response to its inputs. Such networks can learn to
pick out structure in their input.
What is dendrogram? (Nov/Dec’17) BTL1
13 Decompose data objects into a several levels of nested partitioning called a dendrogram. A clustering of the
data objects is obtained by cutting the dendrogramatthe desired level, then eachconnected component forms
a cluster.
What is the use of Expectation-Maximization algorithm? (Apr/May’18) BTL1
14 Expectation-Maximization algorithm is extremely widely used for “hidden-data” problems. Hidden Markov
Models
List the advantages and disadvantages of EM algorithm. BTL4
Advantages:
• Its simplicity
15 • Ease of implementation
Disadvantages:
• Finding good seeds is more critical for EM then for K-means
• This is prone to get stuck in local optima if the seeds are not chosen well
What is k-means algorithm? BTL1
16 K-means is one of the simplest unsupervised learning algorithms that solve the well-known clustering
problem. K-means clustering aims to partition n observations into k clusters in which each observation
belongs to the cluster with the nearest mean serving as a prototype of the cluster.
Give the applications of clustering in IR. BTL4
• Search result clustering
17 • Scatter –gather
• Collection clustering
• Language modeling
• Cluster-based retrieval
What is relevance feedback? BTL1
18 Relevance feedback is a feature of information retrieval systems. The idea behind relevance feedback is to
take the results that are initially returned from a given query and to use information about whether or not
those results are relevant to performa new query. User feedback on relevance of docs in initial set of results.
List pros and cons of Hierarchical agglomerative clustering. BTL4
Advantages
➢ It can produce an ordering of the objects which may be informative for data display
➢ Smaller clusters are generated, which may be helpful for discovery
19 Disadvantages
➢ No provision can be made for a relocation of objects that may have been ‘incorrectly’ grouped at an
early stage. The result should be examined closely to ensure it make sense.
➢ Use of different distance metrics for measuring distances between clusters may generate different
results. Performing multiple experiments and comparing the results is recommended to support the
veracity of the original results.
Mention the different types of clustering methods. BTL1
• Partitioning methods
• Hierarchical methods
20 • Density-based methods
• Grid-based methods
• Model-based clustering methods

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

Name the types of data in clustering analysis. BTL1


• Interval-scaled variable – e.g., salary, height
• Binary variables – e.g., gender(M/F), has cancer (T/F)
21 • Nominal variables – e.g., religion (Hindu, Christian, Muslim etc.,)
• Ordinal variables – e.g., military rank (soldier, sergeant, lutenant, captain etc.,)
• Ratio-scaled variables – e.g., population growth (1,10,1000, …)
• Variables of mixed types – e.g., multiple attributes with various types
List out the category of algorithms in classification. BTL1
22 • Naïve Bayesian classification
• Decision tree classification
• K-Nearest Neighbor algorithm (KNN)
Differentiate Agglomerative and Divisive clustering. (April/May 2021) BTL4
S.No. Agglomerative clustering Divisive clustering (top
23 (bottom up) down)
1 Start with 1 point (singleton) Start with a big cluster
2 Recursively add two or more Recursively divide into
appropriate clusters smaller clusters
Differentiate Data mining and Text mining. BTL4
S.No. Data mining Text Mining
1 Structured data Unstructured data (Text)
24 2 Process directly Linguistic processing or natural
language processing (NLP)
3 Identify casual Discover unknown information
relationship
PART * B

Explain in detail about organization and relevance feedback. (13M) BTL2


Answer: Page 5.2 - 5.7 - N.Jayanthi-Charulatha Publications
1 • Rocchio algorithm for relevance feedback (4M)
• Probabilistic relevance feedback (4M)
• Pseudo relevance feedback (4M)
• Indirect relevance feedback (1M)
Write a short notes on i) Text mining ii) Text classification and clustering (13M) BTL1
Answer:
i) Page 5.8 - I.A.Dhotre, Technical Publications
ii) Page No. 5.9 – 5.16 - I.A.Dhotre, Technical Publications

2 Binary classification (1M)


Assessing classification performance (2M)
Clustering (2M)
Desirable properties-clustering algorithm (2M)
Distance between clusters (2M)
Supervised learning after clustering (2M)
Clustering applications (2M)
3 State Bayes theorem. Discuss in detail about the working of Naïve Bayesian classifier with an example.
(13M) (Nov/Dec’18) (April/May 2021) BTL4

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

Answer: Page 5.16 - 5.22 - I.A.Dhotre, Technical Publications

Bayes Theorem. (4M)


Prior Probability (4M)
Posterior Probability. (5M)
Explain in detail the Multiple-Bernoulli and the multinomial models. (13M) (Nov/Dec’17) BTL1
Answer: Page 234- 264 - C. Manning

6 • Relation - Multinomial Unigram Language Model (4M)


• The Bernoulli Model (4M)
• Properties of Naïve Bayes (1M)
• Variant of Multinomial Model And Feature Selection (4M)
How hierarchical agglomerative clustering works? Explain with an example. (13M) BTL2
Answer: Page 5.38 - 5.40 - I.A.Dhotre, Technical Publications

• Hierarchical Clustering (3M)


7
• Agglomerative Clustering (2M)
• Method Based on Probabilities (2M)
• K-nearest neighbours clustering. (2M)
• Euclidian Distance (2M)
• Mahalanobis Distance (2M)
Explain in detail about the decision tree. (13M) BTL1
Answer: Page No. 5.13 - 5.17 - N.Jayanthi, Charulatha Publications

9 • 2 Phases Algorithm (3M)


• Attribute Selection Measures (3M)
• Information Gain (4M)
• Information filtering. (3M)
PART*C

What is clustering? Explain k-means clustering algorithmwith an example. (15M) (Apr/May’18)


(April/May 2021) BTL1Answer: Page 5.36 - 5.38 - I.A.Dhotre, Technical Publications

1 Select initial centroid at random (4M)


Assign each object to the cluster with the nearest centroid. (4M)
Measurement of attribute selection (4M)
Probabilistic models (2M)
Unsupervised selection (1M)
Explain the Expectation Maximization problem in detail. (15M) (Nov/Dec’17) BTL2
Answer: Page 5.36 - 5.38 - I.A.Dhotre, Technical Publications
2 Basic Setting In EM And Process (5M)
Regression model (4M)
Eclureit circuit (3M)
Relevance Expansion (3M)

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

UNIT IV - WEB RETRIEVAL AND WEB CRAWLING


The Web – Search Engine Architectures – Cluster based Architecture – Distributed Architectures – Search Engine
Ranking – Link based Ranking – Simple Ranking Functions – Learning to Rank – Evaluations -- Search Engine
Ranking – Search Engine User Interaction – Browsing – Applications of a Web Crawler – Taxonomy –
Architecture and Implementation – Scheduling Algorithms – Evaluation.

PART * A

Q.No. Questions

Define web server. BTL1


1. Web server is a computer connected to the internet that runs a program that takes responsibility for storing,
retrieving and distributing some of the web files.
What is web Browsers? BTL1
2 A web browser is a program. Web browser is used to communicate with web servers on the Internet, Which
enables it to download and display the web pages. Netscape Navigator and Microsoft Internet Explorer are
the most popular browser software’s available in market.
Define paid submission of search service. BTL1
In paid submission user submit website for review by a search service for a preset fee with the expectation
3 that the site will be accepted and included in that company’s search engine, provided it meets the stated
guidelines for submission. Yahoo! is the major search engine that accepts this type of submission. While paid
submissions guarantee a timely review of the submitted site and notice of acceptance or rejection, you’re not
guaranteed inclusion or a particular placement order in the listings.
State paid inclusion programs of search services. BTL1
4 Paid inclusion programs allow you to submit your website for guaranteed inclusion in a search engines
database of listings for a set period of time. While paid inclusion guarantees indexing of submitted pages or
sites in a search database, you’re not guaranteed that the pages will rank well for particular queries.
Define pay-for-placement. BTL1
5 In pay-for-placement, youcan guarantee a ranking in a search listing for the terms of your choice. Also known
as paid placement, paid listing, or sponsored listings, this program guarantees placement in search results.
The leaders in pay-for-placement are Google, Yahoo! and Bing.
Define Search Engine Optimization. BTL1
6 Search Engine Optimization is the actof modifying a website to increase its ranking in organic, crawler-based
listing of search engines. There are several ways to increase the visibility of your website through the major
search engines on the internet today.
Describe benefit of SEO. BTL1
• Increase your searchengine visibility
7 • Generate more traffic from the major search engines
• Make sure your website and business get NOTICED and VISITED
• Grow your client base and increase business revenue
What is web crawler? (Nov/Dec’16) (April/May 2021) BTL1
8 A web crawler is a program which browses the world web in a methodical, automated manner. Web crawlers
are mainly used to create a copy of all the visited pages for later processing by a search engine that will index
the downloaded pages to provide fast searches.
9 Define focused crawler. BTL1

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

A focused crawler or topical crawler is a web crawler that attempts to download only pages that are relevant
to a pre-defined topic or set of topic.
What is hard and soft focused crawling? BTL1
In hard focusedcrawling the classifier is invoked on a newly crawled document in a standard manner. When
10 it returns the best matching category path, the out-neighbors of the page are checked into the database if and
only if some node on the best matching category path is marked as good.
In soft focused crawling all out-neighbors of a visited page are checked into DB2, but their crawl priority
is based on the relevance of the current page.
Distinguish between SEO and Pay-per-click. BTL1

S.No SEO Pay-Per-click

1 SEO results take 2 It results in 1-2 days.


weeks to 4 months.

2 It is very difficult to It has ability to turn on and at


control flow of traffic. any moment.

3 Requires ongoing Easier for a novice.


learningand experience
11 to reap results.

4 It is more difficult to Ability to target “local”


target local markets. market.

5 Better for long-term and Better for short-term and


lower margin high-margin campaigns.
campaigns.

6 Generally more cost- Generally more costly per


effective , does not visitor and per conversion.
penalize for more
traffic.

What is the Near-duplicate detection? BTL1


12 Near-duplicate is the task of identifying documents with almost identical content. Near- duplicate web
documents are abundant. Two such documents differ from each other in a very small portion that displays
advertisements, for example. Such differences are irrelevant and for web search.
What are the requirements of XML information retrieval systems? (Nov/Dec’16) BTL1
• Query language that allows users to specify the nature of relevant components, in particular with
respect to their structure.
13
• Representation strategies providing a description notonly of the content of XMLdocuments, but also
their structure.
• Ranking strategies that determinethe most relevant elements and rank theseappropriatelyfor a given
query.
14 What is schema heterogeneity? (Apr/May’17) BTL1

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

In many cases, several different XML schemas occur in a collection since the XML documents in an IR
application often come from more than one source. This phenomenon is called schema heterogeneity or
schema diversity.
What are the politeness policies used in web crawling? (Nov/Dec’17) BTL1
15
Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them.
These politeness policies must be respected.
Define an inverted index. (Apr/May’17 BTL1
16 An inverted index is a list of all documents containing that word
• The index may be a bit vector.
• It may also contain the location(s) of the word in the document.
What is inversion in indexing process? (Nov/Dec’18) BTL1
An inverted index (also referred to as postings file or inverted file) is an index data structure storinga mapping
from content, such as words or numbers, to its locations in a database file, or in a document or a setof
17 documents (named in contrast to a forward index, which maps from documents to content). The purpose of
an inverted index is to allow fast full text searches, at a cost of increased processing when a document is
added to the database. The inverted file may be the database file itself, rather than its index. It is the most
popular datastructureused in document retrieval systems used on a large scale for example in search engines.
How do spammers use cloaking to server spam to the web users? (Apr/May’17) BTL1
Search engines soon became sophisticated enough in their spam detection to screen out a large number of
18 repetitions of particular keywords. Spammers responded with a richer set of spam techniques, the best known
of which we now describe. The first of these techniques is cloaking. Here, the spammer's web server returns
different pages depending on whether the http request comes from a web search engine's crawler (the part of
the search engine that gathers web pages) or from a human user's browser.
Can a digest of the characters in a web page be used detect near duplicate web pages? Why? BTL1
Yes. The simplest approach to detecting duplicates is to compute, for each web page, a fingerprint that is a
19 succinct (say 64-bit) digest of the characters on that page. Then, whenever the fingerprints of two web pages
are equal, we test whether the pages themselves are equal and if so declare one of them to be a duplicate copy
of the other. This simplistic approach fails to capture a crucial and widespread phenomenon on the Web: near
duplication. A solution to the problem of detecting near-duplicate web pages is shingling technique.
Mention the processing steps in indexing process. BTL1
Indexing process:
• Input: list of normalized tokens for each document
• Sort the terms alphabetically
20 • Merge multiple occurrences of the same term
• Record the frequency of occurrence of the term in the document
o not needed by Boolean models
o used by vector space models
• Group instances of the same term and split dictionary and postings
List out the bitwise methods used in compression schemes for inverted lists. BTL1
• Unary coding
21 • Elias gamma coding
• Delta coding
• Golomb coding
• Variable-byte scheme

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

Differentiate Text centric and Data centric XML. BTL1


Text Centric
XML document retrieval is characterized by
• Long text fields (e.g., section of a document)
• In exact matching
22
• Relevance ranked results
• Relational databases do not deal well with this use case
Data Centric
Data centric XML mainly encodes numerical and non-text XML attribute-value data. When querying data-
centric XML, we want to impose exact match conditions in most cases. This puts the emphasis on the
structural aspects of XML documents and queries.
List out the search engine optimization techniques. BTL1
• Domain name strategies
23 • Linking strategies
• Keywords
• Title tags
• Meta description tags
List out the parameters used to measure the size of the web. BTL1
24 • Number of hosts
• Number of HTML pages
List out web search engine components. BTL1
• Spider
25 • Indexer
• User
• Online query processor
Define XML retrieval. BTL1
26 XML retrieval or XML information retrieval is the content-based retrieval of documents structured with
XML (eXtensible Markup Language). As such it is used for computing relevance of XML documents.
PART * B

Write short note on web search overview and web structure.(8M) BTL1
Answer: Page No.507 -512 -Stefan Buettcher
1
• Web structure (4M)
• Bow-Tie Structure (4M)
Write a short note on the user and paid placement. (8M) BTL1
Answer: Page 3.3 – 3.4 - I.A. Dhotre

2 •Web search user (4M)


•Paid search service (4M)
▪ Paid submission
▪ Pay-for-inclusion
▪ Pay-for-placement
3 Explain about search engine optimization/Spam. (13M) (April/May 2021) (Nov/Dec’17) BTL1
Answer: Page 427 - 429 - C. Manning

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

• Early history (2M)


• First generation of spam (2M)
• diagram –Cloaking (2M)
• Paid inclusion model (2M)
• Doorway page (2M)
• Search engine optimizers (SEO) (3M)
Discuss in detail of web size measurement.(13M) (Nov/Dec’16) BTL1
Answer: Page 225-230 - Mark Levene

4 • Random searches (3M)


• Random IP addresses (4M)
• Random walks (3M)
• Random queries (3M)
Elaborate on web search architectures. (13M) (Nov/Dec’18) BTL1
Answer: Page 1.20 – 1.25 - I.A.Dhotre

Centralized architecture-components (7M)


• Crawler - diagram
• Index
5 • Query engine
• User Interface
Distributed Architecture (6M)
• Harvest architecture
• Goals of Harvest
• Components of Harvest
• Features of Harvest
Describe meta crawlers. (13M) (Nov/Dec’16) BTL1
Answer: Page 3.10 – 3.12 - I.A.Dhotre

6 • Meta searcher (4M)


• Advantages (1M)
• Metasearch engine (4M)
• Components -Met crawler (4M)
Describe focused crawling. (13M) (Nov/Dec’17) BTL1
Answer: Page- 41-Bruce Croft

7 • Vertical Search (4M)


• Web Pages (4M)
• Focused, Topical, Crawling (4M)
• Outgoing Links (1M)
Explain in detail about finger print algorithm for near-duplicate detection.(8M) (Nov/Dec’17) BTL1
8 Answer: Page 437- 441 - C. Manning

• Finger print algorithm (2M)

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

• finding duplicates (2M)


• Finding near-duplicates (2M)
• shingle sketches (2M)
Explain the index compression with an example. (10M) (Apr/May’17)BTL1
Answer: Page 85 - 105 - C. Manning

• Lossy And Lossless Compression (2M)


9
• Heaps’ Law (2M)
• Zipf’s Law (2M)
• Dictionary Compression (2M)
• Dictionary As A String (2M)

PART*C

Draw the basic crawler architectureanddescribe its components and also explain the working of a web
crawler with an example. (15M)(Apr/May’18) BTL1
Answer: Page No. 541-549 in Stefan Buettcher
1
• Web crawler definition (4M)
• Features- Crawler (4M)
• Basic operation (4M)
• Web Crawler Architecture (3M)
Write short notes on i) Deep Web ii) Site maps iii) Distributed crawling (15M) (Nov/Dec’17) ) BTL2
Answer: Page 41 to 45 - Bruce Croft

Deep Web (8M)


• Deep Web fall
• Private sites
• Form results
2 • Scripted pages
Site maps (4M)
• Crawling
• Sitemap File
Distributed crawling (3M)
• Crawling
• sites
• crawler
• computing resource
Explain with an example the frameworkof a XML retrieval systemandits challenges with appropriate
examples. (15M)(Apr/May’18) BTL2
Answer: Page 195- 215 - C. Manning
3
• Challenges - XML retrieval (4M)
• vector space model (4M)
• Evaluation (4M)
• Text-centric and data-centric (3M)

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

UNIT V – RECOMMENDER SYSTEMS


Recommender Systems Functions – Data and Knowledge Sources – Recommendation Techniques – Basics of
Content-based Recommender Systems – High Level Architecture – Advantages and Drawbacks of Content-based
Filtering – Collaborative Filtering – Matrix factorization models – Neighborhood models.

PART * A

Q.No. Questions

What is link analysis? BTL1


1. Link analysis is a collections of documents connected by hyperlinks. Hyperlinks provide a valuable source
of information for web information retrieval.
What is query independent ranking? BTL1
2 Query-independent ranking is a score assigned to each page without a specific user query with the goal of
measuring the intrinsic quality of a page.
What is query dependent ranking? BTL1
3 Query-dependent ranking is a score of measuring the quality and the relevance of a page tot a given user
query which is assigned to some of the pages.
Define authorities. (Nov/Dec’18) BTL1
4 Authorities are pages that are recognized as providing significant, trustworthy and useful information on a
topic. In-degree is one simple measure of authority. However in-degree treats all links as equal.
Define hubs. BTL1
5 Hubs is defined as a index pages that provide lots of useful links to relevant content pages. Hub pages for IR
are included in the home page.
What are the properties of Hadoop? BTL1
• Hadoop is a distributed file system. At Goggle Map Reduce, operation are run on a special file system
called Google File System that is highly optimized for this purpose.
• GFS is not open source. Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop
6 Distributed File System.
• The Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System.
• The software framework that supports HDFS, Map Reduce and other related entities is called the
project Hadoop or simply Hadoop.

What are the Hadoop Distributed File System? BTL1


Hadoop Distributed File System are the very large data sets, where data’s are stored reliably, and to stream
7 those data sets at high bandwidth to user application. HDFS stores file system metadata and application data
separately. The HDFS namespace is a hierarchy of files and directories. Files and directories are represented
on the NameNode by nodes, Which record attributes like permissions, modification and access times,
namespace and disk space quotas.
Define Map Reduce. BTL1
8 Map Reduce is defined as a programmingmodel andsoftwareframework first developed by Google. Intended
to facilitate and simplify the processing of vast amounts of data in parallel on large clusters of commodity
hardware in a reliable, fault-tolerant manner.
9 List out the characteristics of Map Reduce? (Apr/May’17) & (Nov/Dec’17) BTL1
• Very large scale data: peta, exa bytes

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

• Write once and read many data. It allows for parallelism without murexes
• Map and Reduce are the main operations: Simple code
• All the map should be completed before reduce operation starts.
• Map and reduce operations are typically performed by the same physical processor.
• Number of map tasks and reduce tasks are configurable.
• Operations are provisioned near the data.
• Commodity hardware and storage.
What are the limitations of Hadoop/MapReduce? BTL1
• Cannot control the order in which the maps or reductions are run.
• For maximum parallelism, you need Maps and Reduces to not depend on data generated in the same
10 MapReduce job.
• A database with an index will always be faster than a MapReduce job on unindexed data.
• Reduce operations do not take place until all Maps are complete.
• General assumption that the output of Reduce is smaller thanthe input to Map, large data source used
to generate smaller final values
What is Cross-Lingual Retrieval? (Apr/May’18) BTL1
Cross – Lingual Retrieval are the retrieval of documents that are in a different language from the one in which
11 the query is expressed. This allows users to search document collections in multiple language and retrieve
relevant information in a form that is useful to them, even when they have little or no linguistic competence
in the target languages.
Define Snippets. (Nov/Dec’17) BTL1
12 Snippets is defined as a short fragments of text extracted from the document content or its metadata. They
may be static or query based. In static snippet, it always shows the first 50 words of the document, or the
content of its description metadata, or a description taken froma directory site such as dmoz.org.
List the advantages of invisible web content. BTL1
• Specialized content focus – large amounts of information focused on an exact subject.
13 • Contains information than might not be available on the visible web.
• Allows a user to find a precise answer to a specific question
• Allow a user to find Webpages from a specific date or time.
What is collaborative filtering? BTL1
Collaborative filtering is a method of making automatic predictions about the interests of a single user by
14 collecting preferences or taste information from manyusers. It uses given rating data by many users for many
items as the basic for predicting missing ratings and/or for creating a top-N recommendation list for a given
user, called the active user.
What do you mean by item-based Content based Filtering? (Apr/May’17) (April/May 2021) BTL1
15 Item-based CF is a model-based approach which produces recommendations based on the relationship
between items inferred from the rating matrix. The assumption behind this approach is that users will prefer
items that are similar to other items they like.
What are the two problems of user based CF? BTL1
16 The two main problems of user-based CF are that the whole user database has to be kept in memory and that
expensive similarity computation between the active user and all other users in the database has to be
performed.
Define user based collaborative Filtering. (Nov/Dec’16) (April/May 2021) BTL1
17 User-based collaborative filtering is defined as the algorithms work off the premise that if a user(A) has a
similar profile to another user (B), then A is more likely to prefer things that B prefers when compared with
a user chosen at random..

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

What is the goal of snippet generation? BTL1


The goal of snippet generation is
18 • Present the most informative bit of a document in light of the query.
• Present something which is self-contained i.e., a clause or a sentence.
• Present something short enough to fit in output.
• Be fast, accurate (where are the snippets stored?).
Compute the Jaccard’s similarity for the two list of words (time, flies,like,an,arrow) and
(how,time,flies). (Apr/May’18) BTL1
Consider two sets A= {time, flies, like, an, arrow} and B = {how, time,flies}. How similar are A and B?

19 A can be represented as A={1,2,3,4,5} and B can be represented as B={6,1,2}

The Jaccard similarity is defined JS(A, B) = |A ∩ B| / |A 𝖴 B|

JS(A,B) = |{1,2}| / |{1,2,3,4,5,6}| = 2/6=0.33

What is summarization? BTL1


Summarization is the process of reducing a text document with a computer program in order to create a
20 summary that retains the most important points of the original document. Technologies that can make a
coherent summarytake into account variables such as length, writing style and syntax.
What are types of PWS (personalized web search)? BTL1
• User profiling
21 • Hyperlink analysis
• Community based PWS
• User location based PWS
Differentiate Text centric and Data centric XML. BTL1
Text Centric
• Long text fields (e.g., section of a document)
• In exact matching
22 • Relevance ranked results
• Relational databases do not deal well with this use case
Data Centric
• Data centric XML mainly encodes numerical and non-text XML attribute-value data.
• When querying data-centric XML, we want to impose exact match conditions in most cases.
• This puts the emphasis on the structural aspects of XML documents and queries.
Define Individualization. BTL1
23 Individualization is defined as the totality of characteristics that distinguishes an individual. Uses the user’s
goals, prior and tacit knowledge, past information-seeking behaviors.
Define contextualization. BTL1
24 Contextualization is defined as the interrelated condition that occur within an activity which includes factors
like the nature of informationavailable, the information currently beingexamined and the applications in use.
Define Corpus. BTL1
Corpus is defined as the collection of written texts, especially the entire works of a particular author or a body
25 of writing on particular subject.

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

PART * B

Write a short notes on i) Link analysis ii) Hubs and Authorities (13M) BTL1
Answer: Page: 461-477 - C. Manning

Link analysis

1 • Web as a graph. (2M)


• Anchor text and the web graph. (3M)
Hubs and Authorities
• Broad-topic searches. (3M)
• Key consequences. (2M)
• Example. (3M)
Explain Page Rank in detail. (13M) (Nov/Dec’16) & (Apr/May’17) BTL2
Answer: Page:421-439 - C. Manning
2 • Page rank. (3M)
• Markov chains. (3M)
• The Page Rank computation. (3M)
• Topic-specific Page Rank. (4M)
Explain searching and ranking with an example. (13M) BTL2
Answer: Page: 4.23 to 4.33 - N.Jayanthi
• Indexes and ranking. (3M)
3 • Inverted indexes. (3M)
• Ranking. (2M)
• Page rank. (2M)
• Queries and Challenges. (3M)
Explain how relevance scoring is used to rank the pages. (13M) BTL2
Answer: Page:4.34 to 4.35 - N.Jayanthi
4 • Relevance ranking. (4M)
• Page Ranking (3M)
• Relevance using hyperlinks. (2M)
• User query (4M)
Explain the various methods to find the similarity of two documents using similarity measures. (13M)
Answer: Page:4.36 to 4.43 - N.Jayanthi BTL2
• Similarity measures. (2M)
6 • Methods to find similarity of documents. (2M)
• Computing similarity. (2M)
• Set similarity. (2M)
• Using Jaccard coefficient. (2M)
• Comparing signatures. (3M)
How does Map Reduce Work? Illustrate the usage of Map Reduce programming model in Hadoop.
7 (13M) (Apr/May’18) BTL2
Answer: Page:4.6 – 4.8 - I.A.Dhotre

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

• Hadoop framework. (3M)


• Map reduce. (3M)
• Characteristics. (2M)
• Logical data flow diagram. (3M)
• Hadoop/Map reduce limitations. (2M)
Explain Personalized search. (13M) (Nov/Dec’18) BTL2
Answer: Page No. 4.65 to 4.69 - N.Jayanthi
• Personalized web search. (3M)
9 • User profiling. (3M)
• Hyperlink analysis. (2M)
• Community based PWS. (2M)
• User location based PWS. (3M)

Explain collaborative filtering with an example. (13M) (Apr/May’17) BTL2


Answer: Page:333-346 - Mark Levene
• User-Based Collaborative Filtering. (2M)
• Item-Based Collaborative Filtering. (2M)
10 • Model-Based Collaborative Filtering. (2M)
• Content-Based Recommendation Systems. (2M)
• Evaluation of Collaborative Filtering Systems. (2M)
• Scalability of Collaborative Filtering Systems. (2M)
• Case Study of Amazon.co.uk. (1M)
Explain recommendation systemwith an example. (13M) (Apr/May’17) & (Nov/Dec’17) (April/May
2021)
BTL2
11 Answer: Page:4.23 – 4.24 - I.A.Dhotre
• Process performed - content based recommender. (3M)
• Recommendation process. (3M)
• Advantages of Content-based approach. (4M)
• Disadvantages of Content-based approach. (3M)
PART*C

Explain in detail about Community-based Question Answering system. (15M) (Nov/Dec’17) &
(Apr/May’18) BTL2
Answer: Page:4.85 to 4.91 - N.Jayanthi
• Community-based Question. (3M)
1 • Natural Language Annotations. (3M)
• The AI lab, factual queries. (3M)
• Open domain QA. (2M)
• Redundancy based method. (2M)
• Semantic Headers. (2M)
Write a short notes on i) Handling invisible web ii) Snippet generation iii) Summarization (15M) BTL2
Answer: Page: 4.25 – 4.26 - I.A.Dhotre
2
(i) Handling invisible web
• Using the search engine- find hidden content. (4M)
ii)Snippet generation

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0


REGULATION : 2017 ACADEMIC YEAR : 2021-2022

• Steps for document based snippet generation. (3M)


• Steps for indexed based snippet generation. (2M)
• Query based summarization. (2M)
iii) Summarization
• Static summarization. (2M)
• Dynamic summarization. (2M)
Explain in detail cross lingual information retrieval and its limitations in web search.(15M)
(Nov/Dec’16) BTL2
Answer: Page;4.28 – 4.29 - I.A.Dhotre
3 • Information retrieval cycle. (4M)
• Query translation. (4M)
• Document translation. (4M)
• Limitations (3M)

JIT-2106/CSE/4th Yr/SEM 08/CS8080/INFORMATION RETRIEVAL TECHNIQUES/UNIT 1-5/QB+Keys/Ver2.0

You might also like