IR QB
IR QB
Numerical Questions
1. Consider a simplified web graph with the following link structure:
• Page A has links to pages B, C, and D.
• Page B has links to pages C and E.
• Page C has links to pages A and D.
• Page D has a link to page E.
• Page E has a link to page A.
Using the initial authority and hub scores of 1 for all pages, calculate the authority and hub
scores for each page after one/two iteration(s) of the HITS algorithm.
2. Consider a web graph with the following link structure:
• Page A has links to pages B and C.
• Page B has a link to page C.
• Page C has links to pages A and D.
• Page D has a link to page A.
Perform two iterations of the HITS algorithm to calculate the authority and hub scores for
each page. Assume the initial authority and hub scores are both 1 for all pages.
3. Given the following link structure:
• Page A has links to pages B and C.
• Page B has a link to page D.
• Page C has links to pages B and D.
• Page D has links to pages A and C.
Using the initial authority and hub scores of 1 for all pages, calculate the authority and hub
scores for each page after one iteration of the HITS algorithm.
4. Consider a web graph with the following link structure:
• Page A has links to pages B and C.
• Page B has links to pages C and D.
• Page C has links to pages A and D.
• Page D has a link to page B.
Perform two iterations of the HITS algorithm to calculate the authority and hub scores for
each page. Assume the initial authority and hub scores are both 1 for all pages.
Unit III
Theory Questions
Web Page Crawling Techniques:
1. Explain the breadth-first and depth-first crawling strategies. Compare their advantages and
disadvantages.
2. Describe focused crawling and its significance in building specialized search engines. Discuss
the key components of a focused crawling system. Discuss the importance of focused crawling
in targeted web data collection. Provide examples of scenarios where focused crawling is
preferred over general crawling.
3. How do web crawlers handle dynamic web content during crawling? Explain techniques such
as AJAX crawling, HTML parsing, URL normalization and session handling for dynamic
content extraction. Explain the challenges associated with handling dynamic web content
during crawling.
4. Describe the role of AJAX crawling scheme and the use of sitemaps in crawling dynamic web
content. Provide examples of how these techniques are implemented in practice.
Near-Duplicate Page Detection:
1. Define near-duplicate page detection and its significance in web search. Discuss the challenges
associated with identifying near-duplicate pages.
2. Discuss common techniques used for near-duplicate detection, such as fingerprinting and
shingling.
3. Compare and contrast local and global similarity measures for near-duplicate detection. Provide
examples of scenarios where each measure is suitable.
4. Describe common near-duplicate detection algorithms such as SimHash and MinHash. Explain
how these algorithms work and their computational complexities.
5. Provide examples of applications where near-duplicate page detection is critical, such as
detecting plagiarism and identifying duplicate content in search results.
Text Summarization:
1. Explain the difference between extractive and abstractive text summarization methods.
Compare their advantages and disadvantages.
2. Describe common techniques used in extractive text summarization, such as graph-based
methods and sentence scoring approaches.
3. Discuss challenges in abstractive text summarization and recent advancements in neural
network-based approaches.
4. Discuss common evaluation metrics used to assess the quality of text summaries, such as
ROUGE and BLEU. Explain how these metrics measure the similarity between generated
summaries and reference summaries.
Question Answering:
1. Discuss different approaches for question answering in information retrieval, including
keyword-based, document retrieval, and passage retrieval methods.
2. Explain how natural language processing techniques such as Named Entity Recognition (NER)
and semantic parsing contribute to question answering systems.
3. Provide examples of question answering systems and evaluate their effectiveness in providing
precise answers.
4. Discuss the challenges associated with question answering, including ambiguity resolution,
answer validation, and handling of incomplete or noisy queries.
Recommender Systems:
1. Define collaborative filtering and content-based filtering in recommender systems. Compare
their strengths and weaknesses.
2. Explain how collaborative filtering algorithms such as user-based and item-based methods
work. Discuss techniques to address the cold start problem in collaborative filtering.
3. Describe content-based filtering approaches, including feature extraction and similarity
measures used in content-based recommendation systems.
Cross-Lingual and Multilingual Retrieval:
1. Discuss the challenges associated with cross-lingual retrieval, including language barriers,
lexical gaps, and cultural differences.
2. Describe the role of machine translation in information retrieval. Discuss different approaches
to machine translation, including rule-based, statistical, and neural machine translation models.
3. Describe methods for multilingual document representations and query translation, including
cross-lingual word embeddings and bilingual lexicons.
Evaluation Techniques for IR Systems:
1. Explain user-based evaluation methods, including user studies and surveys, and their role in
assessing the effectiveness of IR systems. Discuss methodologies for conducting user studies,
including usability testing, eye-tracking experiments, and relevance assessments.
2. Describe the role of test collections and benchmarking datasets in evaluating IR systems.
Discuss common test collections, such as TREC and CLEF, and their use in benchmarking
retrieval algorithms.
3. Define A/B testing and interleaving experiments as online evaluation methods for information
retrieval systems. Explain how these methods compare different retrieval algorithms or features
using real user interactions.
4. Discuss the advantages and limitations of online evaluation methods compared to offline
evaluation methods, such as test collections and user studies.
Numerical Questions
1. Given two sets of shingles representing web pages:
{ "apple", "banana", "orange", "grape" }
{ "apple", "orange", "grape", "kiwi" }
Compute the Jaccard similarity between the two pages using the formula J(A, B) = |A ∩ B| / |A
∪ B|.
2. Given two sets of tokens representing documents:
Document 1: { "machine", "learning", "algorithm", "data", "science" }
Document 2: { "algorithm", "data", "science", "model", "prediction" }
Compute the Jaccard similarity between the two documents.
3. Given two sets of terms representing customer transaction documents:
Transaction Document 1: { "bread", "milk", "eggs", "cheese" }
Transaction Document 2: { "bread", "butter", "milk", "yogurt" }
Compute the Jaccard similarity between the two transaction documents.
4. Given two sets of features representing product description documents:
Product Document 1: { "smartphone", "camera", "battery", "display" }
Product Document 2: { "smartphone", "camera", "storage", "processor" }
Compute the Jaccard similarity between the features of the two product documents.