0% found this document useful (0 votes)
40 views8 pages

IR QB

Iq ka Question bank hai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views8 pages

IR QB

Iq ka Question bank hai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

T.Y.B.Sc.

(CS) Sem VI Information Retrieval Question Bank


Unit 1
Theory Questions
Foundations of Information Retrieval
1. Define Information Retrieval (IR) and explain its goals.
2. Discuss the key components of an IR system.
3. What are the major challenges faced in Information Retrieval?
4. Provide examples of applications of Information Retrieval.
Introduction to Information Retrieval (IR) systems
1. Explain the process of constructing an inverted index. How does it facilitate efficient information retrieval?
2. Discuss techniques for compressing inverted indexes.
3. How are documents represented in an IR system? Discuss different term weighting schemes.
4. With the help of examples, explain the process of storing and retrieving indexed documents.
5. Discuss storage mechanisms for indexed documents.
6. Explain the retrieval process of indexed documents.
7. Define k-gram indexing and explain its significance in Information Retrieval systems.
8. Describe the process of constructing a k-gram index. Highlight the key steps involved and the data structures
used.
9. Explain how wildcard queries are handled in k-gram indexing. Discuss the challenges associated with
wildcard queries and potential solutions.
Retrieval Models
1. Describe the Boolean model in Information Retrieval. Discuss Boolean operators and query processing.
2. Explain the Vector Space Model (VSM) in Information Retrieval. Discuss TF-IDF, cosine similarity, and
query-document matching.
3. What is the Probabilistic Model in Information Retrieval? Discuss Bayesian retrieval and relevance feedback.
4. How does cosine similarity measure the similarity between queries and documents in the Vector Space
Model?
5. What is relevance feedback in the context of retrieval models? How does it enhance search results?
Spelling Correction in IR Systems
1. What are the challenges posed by spelling errors in queries and documents?
2. What is edit distance, and how is it used in measuring string similarity? Provide examples.
3. Discuss string similarity measures used for spelling correction in IR systems.
4. Describe techniques employed for spelling correction in IR systems. Assess their effectiveness and
limitations.
5. What is the Soundex Algorithm and how does it address spelling errors in IR systems?
6. Discuss the steps involved in the Soundex Algorithm for phonetic matching.
Performance Evaluation
1. Define evaluation metrics used in Information Retrieval, including precision, recall, and F-measure.
2. Explain the concept of average precision in evaluating IR systems.
3. Explain the importance of test collections and relevance judgments in evaluating Information Retrieval
systems.
4. Discuss the process of relevance judgments and their importance in performance evaluation.
5. Describe experimental design and significance testing in the context of evaluating IR systems.
6. Discuss significance testing in Information Retrieval and its role in performance evaluation.
Numerical Questions
1. Given the following document-term matrix:
Document Terms
Doc1 cat, dog, fish
Doc2 cat, bird, fish
Doc3 dog, bird, elephant
Doc4 cat, dog, elephant
Construct the posting list for each term: cat, dog, fish, bird, elephant.
2. Consider the following document-term matrix:
Document Terms
Doc1 apple, banana, grape
Doc2 apple, grape, orange
Doc3 banana, orange, pear
Doc4 apple, grape, pear
Create the posting list for each term: apple, banana, grape, orange, pear.
3. Given the inverted index with posting lists:
Term Posting List
cat Doc1, Doc2, Doc4
dog Doc1, Doc3, Doc4
fish Doc1, Doc2
Calculate the Term Document Matrix and find the documents which contain both 'cat' and 'fish' using the
Boolean Retrieval Model.
4. Given the following term-document matrix for a set of documents:
Term Doc1 Doc2 Doc3 Doc4
cat 15 28 0 0
dog 18 0 32 25
fish 11 19 13 0
Total No of terms in Doc1, Doc2, Doc3 & Doc4 are 48, 85, 74 and 30 respectively.
Calculate the TF-IDF score for each term-document pair using the following TF and IDF calculations:
• Term Frequency (TF) = (Number of occurrences of term in document) / (Total number of terms in the
document)
• Inverse Document Frequency (IDF) = log(Total number of documents / Number of documents
containing the term) + 1
5. Given the term-document matrix and the TF-IDF scores calculated from Problem 4, calculate the cosine
similarity between each pair of documents (Doc1, Doc2), (Doc1, Doc3), (Doc1, Doc4), (Doc2, Doc3), (Doc2,
Doc4), and (Doc3, Doc4).
6. Consider the following queries expressed in terms of TF-IDF weighted vectors:
Query1: cat: 0.5, dog: 0.5, fish: 0
Query2: cat: 0, dog: 0.5, fish: 0.5
Calculate the cosine similarity between each query and each document from the term-document matrix in
Problem 4.
7. Given the following term-document matrix:

Term Doc1 Doc2 Doc3 Doc4


apple 22 9 0 40
banana 14 0 12 0
orange 0 23 14 0
Total No of terms in Doc1, Doc2, Doc3 & Doc4 are 65, 48, 36 and 92 respectively.
Calculate the TF-IDF score for each term-document pair
8. Suppose you have a test collection with 50 relevant documents for a given query. Your retrieval system
returns 30 documents, out of which 20 are relevant. Calculate the Recall, Precision, and F-score for this
retrieval.
• Recall = (Number of relevant documents retrieved) / (Total number of relevant documents)
• Precision = (Number of relevant documents retrieved) / (Total number of documents retrieved)
• F-score = 2 * (Precision * Recall) / (Precision + Recall)
9. You have a test collection containing 100 relevant documents for a query. Your retrieval system retrieves 80
documents, out of which 60 are relevant. Calculate the Recall, Precision, and F-score for this retrieval.
10. In a test collection, there are a total of 50 relevant documents for a query. Your retrieval system retrieves 60
documents, out of which 40 are relevant. Calculate the Recall, Precision, and F-score for this retrieval.
11. You have a test collection with 200 relevant documents for a query. Your retrieval system retrieves 150
documents, out of which 120 are relevant. Calculate the Recall, Precision, and F-score for this retrieval.
12. In a test collection, there are 80 relevant documents for a query. Your retrieval system retrieves 90 documents,
out of which 70 are relevant. Calculate the Recall, Precision, and F-score for this retrieval.
13. Construct 2-gram, 3-gram and 4-gram index for the following terms:
a. banana d. programming
b. pineapple e. elephant
c. computer f. database
14. Calculate the Levenshtein distance between the following pair of words:
a. kitten and sitting c. robot and orbit
b. intention and execution d. power and flower
15. Using the Soundex algorithm, encode the following:
a. Williams d. Parker
b. Gonzalez e. Jackson
c. Harrison f. Thompson
Unit II
Theory Questions
Text Categorization and Filtering:
1. Define text categorization and explain its importance in information retrieval systems.
Discuss the challenges associated with text categorization.
2. Discuss the Naive Bayes algorithm for text classification. How does it work, and what are its
assumptions?
3. Explain Support Vector Machines (SVM) and their application in text categorization. How
does SVM handle text classification tasks?
4. Compare and contrast the Naive Bayes and Support Vector Machines (SVM) algorithms for
text classification. Highlight their strengths and weaknesses.
5. Describe feature selection and dimensionality reduction techniques used in text
categorization. Why are these techniques important?
6. Discuss the applications of text categorization and filtering in real-world scenarios such as
spam detection, sentiment analysis, and news categorization.
Text Clustering for Information Retrieval:
1. Explain the K-means clustering algorithm and how it is applied to text data. What are its key
steps, and how does it handle document clustering? Discuss its strengths and limitations.
2. Describe hierarchical clustering techniques and their relevance in organizing text data for
information retrieval. What are the advantages and disadvantages of hierarchical clustering
compared to K-means?
3. Discuss the evaluation measures used to assess the quality of clustering results in text data.
Explain purity, normalized mutual information, and F-measure in the context of text
clustering evaluation.
4. How can clustering be utilized for query expansion and result grouping in information
retrieval systems? Provide examples.
5. Compare and contrast the effectiveness of K-means and hierarchical clustering in text data
analysis. Discuss their suitability for different types of text corpora and retrieval tasks.
6. Discuss challenges and issues in applying clustering techniques to large-scale text data.
Web Information Retrieval:
1. Describe the architecture of a web search engine. Explain the components involved in
crawling and indexing web pages.
2. Discuss the challenges faced by web search engines, such as spam, dynamic content, and
scale. How are these challenges addressed in modern web search engines?
3. Explain link analysis and the PageRank algorithm. How does PageRank work to determine
the importance of web pages?
4. Describe the PageRank algorithm and how it calculates the importance of web pages based on
their incoming links. Discuss its role in web search ranking.
5. Explain how link analysis algorithms like HITS (Hypertext Induced Topic Search) contribute
to improving search engine relevance.
6. Discuss the impact of web information retrieval on modern search engine technologies and
user experiences.
7. Discuss applications of link analysis in information retrieval systems beyond web search.
Learning to Rank
1. Explain the concept of learning to rank and its importance in search engine result ranking.
2. Discuss algorithms and techniques used in learning to rank for Information Retrieval. Explain
the principles behind RankSVM, RankBoost, and their application in ranking search results.
3. Compare and contrast pairwise and listwise learning to rank approaches. Discuss their
advantages and limitations.
4. Explain evaluation metrics used to assess the performance of learning to rank algorithms.
Discuss metrics such as Mean Average Precision (MAP), Normalized Discounted Cumulative
Gain (NDCG), and Precision at K (P@K).
5. Discuss the role of supervised learning techniques in learning to rank and their impact on
search engine result quality.
6. How does supervised learning for ranking differ from traditional relevance feedback methods
in Information Retrieval? Discuss their respective advantages and limitations.
7. Describe the process of feature selection and extraction in learning to rank. What are the key
features used to train ranking models, and how are they selected or engineered?
Link Analysis and its Role in IR Systems:
1. Describe web graph representation in link analysis. How are web pages and hyperlinks
represented in a web graph OR Explain how web graphs are represented in link analysis.
Discuss the concepts of nodes, edges, and directed graphs in the context of web pages and
hyperlinks.
2. Explain the HITS algorithm for link analysis. How does it compute authority and hub scores?
3. Discuss the PageRank algorithm and its significance in web search engines. How is PageRank
computed?
4. Discuss the difference between the PageRank and HITS algorithms.
5. How are link analysis algorithms applied in information retrieval systems? Provide examples.
6. Discuss future directions and emerging trends in link analysis and its role in modern IR
systems. OR Discuss how link analysis can be used in social network analysis and
recommendation systems.
7. How do link analysis algorithms contribute to combating web spam and improving search
engine relevance?

Numerical Questions
1. Consider a simplified web graph with the following link structure:
• Page A has links to pages B, C, and D.
• Page B has links to pages C and E.
• Page C has links to pages A and D.
• Page D has a link to page E.
• Page E has a link to page A.
Using the initial authority and hub scores of 1 for all pages, calculate the authority and hub
scores for each page after one/two iteration(s) of the HITS algorithm.
2. Consider a web graph with the following link structure:
• Page A has links to pages B and C.
• Page B has a link to page C.
• Page C has links to pages A and D.
• Page D has a link to page A.
Perform two iterations of the HITS algorithm to calculate the authority and hub scores for
each page. Assume the initial authority and hub scores are both 1 for all pages.
3. Given the following link structure:
• Page A has links to pages B and C.
• Page B has a link to page D.
• Page C has links to pages B and D.
• Page D has links to pages A and C.
Using the initial authority and hub scores of 1 for all pages, calculate the authority and hub
scores for each page after one iteration of the HITS algorithm.
4. Consider a web graph with the following link structure:
• Page A has links to pages B and C.
• Page B has links to pages C and D.
• Page C has links to pages A and D.
• Page D has a link to page B.
Perform two iterations of the HITS algorithm to calculate the authority and hub scores for
each page. Assume the initial authority and hub scores are both 1 for all pages.
Unit III
Theory Questions
Web Page Crawling Techniques:
1. Explain the breadth-first and depth-first crawling strategies. Compare their advantages and
disadvantages.
2. Describe focused crawling and its significance in building specialized search engines. Discuss
the key components of a focused crawling system. Discuss the importance of focused crawling
in targeted web data collection. Provide examples of scenarios where focused crawling is
preferred over general crawling.
3. How do web crawlers handle dynamic web content during crawling? Explain techniques such
as AJAX crawling, HTML parsing, URL normalization and session handling for dynamic
content extraction. Explain the challenges associated with handling dynamic web content
during crawling.
4. Describe the role of AJAX crawling scheme and the use of sitemaps in crawling dynamic web
content. Provide examples of how these techniques are implemented in practice.
Near-Duplicate Page Detection:
1. Define near-duplicate page detection and its significance in web search. Discuss the challenges
associated with identifying near-duplicate pages.
2. Discuss common techniques used for near-duplicate detection, such as fingerprinting and
shingling.
3. Compare and contrast local and global similarity measures for near-duplicate detection. Provide
examples of scenarios where each measure is suitable.
4. Describe common near-duplicate detection algorithms such as SimHash and MinHash. Explain
how these algorithms work and their computational complexities.
5. Provide examples of applications where near-duplicate page detection is critical, such as
detecting plagiarism and identifying duplicate content in search results.
Text Summarization:
1. Explain the difference between extractive and abstractive text summarization methods.
Compare their advantages and disadvantages.
2. Describe common techniques used in extractive text summarization, such as graph-based
methods and sentence scoring approaches.
3. Discuss challenges in abstractive text summarization and recent advancements in neural
network-based approaches.
4. Discuss common evaluation metrics used to assess the quality of text summaries, such as
ROUGE and BLEU. Explain how these metrics measure the similarity between generated
summaries and reference summaries.
Question Answering:
1. Discuss different approaches for question answering in information retrieval, including
keyword-based, document retrieval, and passage retrieval methods.
2. Explain how natural language processing techniques such as Named Entity Recognition (NER)
and semantic parsing contribute to question answering systems.
3. Provide examples of question answering systems and evaluate their effectiveness in providing
precise answers.
4. Discuss the challenges associated with question answering, including ambiguity resolution,
answer validation, and handling of incomplete or noisy queries.
Recommender Systems:
1. Define collaborative filtering and content-based filtering in recommender systems. Compare
their strengths and weaknesses.
2. Explain how collaborative filtering algorithms such as user-based and item-based methods
work. Discuss techniques to address the cold start problem in collaborative filtering.
3. Describe content-based filtering approaches, including feature extraction and similarity
measures used in content-based recommendation systems.
Cross-Lingual and Multilingual Retrieval:
1. Discuss the challenges associated with cross-lingual retrieval, including language barriers,
lexical gaps, and cultural differences.
2. Describe the role of machine translation in information retrieval. Discuss different approaches
to machine translation, including rule-based, statistical, and neural machine translation models.
3. Describe methods for multilingual document representations and query translation, including
cross-lingual word embeddings and bilingual lexicons.
Evaluation Techniques for IR Systems:
1. Explain user-based evaluation methods, including user studies and surveys, and their role in
assessing the effectiveness of IR systems. Discuss methodologies for conducting user studies,
including usability testing, eye-tracking experiments, and relevance assessments.
2. Describe the role of test collections and benchmarking datasets in evaluating IR systems.
Discuss common test collections, such as TREC and CLEF, and their use in benchmarking
retrieval algorithms.
3. Define A/B testing and interleaving experiments as online evaluation methods for information
retrieval systems. Explain how these methods compare different retrieval algorithms or features
using real user interactions.
4. Discuss the advantages and limitations of online evaluation methods compared to offline
evaluation methods, such as test collections and user studies.
Numerical Questions
1. Given two sets of shingles representing web pages:
{ "apple", "banana", "orange", "grape" }
{ "apple", "orange", "grape", "kiwi" }
Compute the Jaccard similarity between the two pages using the formula J(A, B) = |A ∩ B| / |A
∪ B|.
2. Given two sets of tokens representing documents:
Document 1: { "machine", "learning", "algorithm", "data", "science" }
Document 2: { "algorithm", "data", "science", "model", "prediction" }
Compute the Jaccard similarity between the two documents.
3. Given two sets of terms representing customer transaction documents:
Transaction Document 1: { "bread", "milk", "eggs", "cheese" }
Transaction Document 2: { "bread", "butter", "milk", "yogurt" }
Compute the Jaccard similarity between the two transaction documents.
4. Given two sets of features representing product description documents:
Product Document 1: { "smartphone", "camera", "battery", "display" }
Product Document 2: { "smartphone", "camera", "storage", "processor" }
Compute the Jaccard similarity between the features of the two product documents.

You might also like