0% found this document useful (0 votes)

8 views17 pages

Supervisionguide16 17 Students

Uploaded by

ADARSH SINGH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views17 pages

Supervisionguide16 17 Students

Uploaded by

ADARSH SINGH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Exercises for Information Retrieval

Ronan Cummins∗

Lent Term 2016/17

Contents
1 Boolean Model [Lecture 1] 3
1.1 Exercises from book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Other Exercises/Discussion Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Indexing and document normalisation [Lecture 2] 5

2.1 Exercises from the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Other Exercises/Discussion Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Tolerant Retrieval [Lecture 3] 7

3.1 Exercises from the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Other Exercises/Discussion Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Term Weighting and VSM [Lecture 4] 9

4.1 Exercises from the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Other Exercises/Discussion Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Language Modelling and Text Classification [Lecture 5] 12

5.1 Exercises from book : Language modelling for IR . . . . . . . . . . . . . . . . . . . 12
5.2 Exercises from book : Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.3 Other Exercises/Discussion Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6 IR evaluation [Lecture 6] 13
6.1 Exercises from book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.2 Other Exercises/Discussion Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.3 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

7 Query Expansion and Relevance Feedback [Lecture 7] 15

7.1 Exercises from Textbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.2 Other Exercises/Discussion Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

8 Link Analysis [Lecture 8] 16

8.1 Exercises from book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8.2 Other Exercises/Discussion Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
8.3 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

∗ Adapted from Simone Teufel’s original guide

1
Optional practical work with Apache Lucene
Apache Lucene is a freely available full-text open-source information retrieval software library
written in Java. There is some code available on github (https://github.com/ronancummins/spud)
for students to experiment with. The repository also contains a small IR test collection (the original
Cranfield collection which contains about 1400 documents and 225 queries). The students should
be able to do the following:

• Download the code and index the collection by following the instructions in the README.
• Run the queries through the collection while collecting metrics of the effectiveness of each
query.
• Change the underlying retrieval model to retrieve documents using a different function (e.g.
BM25 or unigram multinomial Language model).
• Manually select a subset of the queries and create a new file containing these.
• Change these queries by formulating different versions of the queries (e.g. try removing stop
words or try adding words you think will improve the query).

2
1 Boolean Model [Lecture 1]
1.1 Exercises from book
• Exercise 1.2 Draw the term-document incidence matrix and the inverted index represen-
tation for the following document collection:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients

• Exercise 1.3 For the document collection shown in Exercise 1.2, what are the returned
results for these queries:
– schizophrenia AND drug
– for AND NOT(drug OR approach)
• Exercise 1.4 * For the queries below, can we still run through the intersection in time O(x
+ y), where x and y are the lengths of the postings lists for Brutus and Caesar? If not, what
can we achieve?

– Brutus AND NOT Caesar

– Brutus OR NOT Caesar

• First make sure they understand the merge algorithm for AND, then, if they seem particu-
larly intersted, you can make them do:
• Exercise 1.5 * Extend the postings merge algorithm to arbitrary Boolean query formulas.
What is its time complexity? For instance, consider:

(Brutus OR Caesar) AND NOT (Antony OR Cleopatra)

Can we always merge in linear time? Linear in what? Can we do better than this?
• Exercise 1.7 Recommend a query processing order for

(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

given the following postings list sizes:

Term Postings size
eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812
• Exercise 1.9 For a conjunctive query, is processing postings lists in order of size guaranteed
to be optimal? Explain why it is, or give an example where it isn’t.
• Exercise 1.11 * How should the Boolean query x AND NOT y be handled? Why is naive
evaluation of this query normally very expensive? Write out a postings merge algorithm
that evaluates this query efficiently.

3
1.2 Other Exercises/Discussion Points
1. Why don’t we use grep for information retrieval?
2. Why don’t we use a relational database for information retrieval?

3. In constructing the index, which step is most expensive/complex?

4. Googlewhack is a game started in 2002. The task is to find a pair of search terms that return
exactly one document in a Google search.
• anxiousness scheduler
• squirreling dervishes
Deceptively hard! Spend some time trying to find a new googlewhack – it will give you an
idea what kinds of words might qualify, and why this is so hard.

4
2 Indexing and document normalisation [Lecture 2]
2.1 Exercises from the Book
• Exercise 2.1 Are the following statements true or false?
– In a Boolean retrieval system, stemming never lowers precision.
– In a Boolean retrieval system, stemming never lowers recall.
– Stemming increases the size of the vocabulary.
– Stemming should be invoked at indexing time but not while processing a query.
• Exercise 2.4 For the top Porter stemmer rule group (2.1) shown on page 33:
– What is the purpose of including an identity rule such as SS →SS?
– Applying just this rule group, what will the following words be stemmed to?
circus canaries boss
– What rule should be added to correctly stem pony?
– The stemming for ponies and pony might seem strange. Does it have a deleterious effect
on retrieval? Why or why not?
• Exercise 2.5 Why are skip pointers not useful for queries of the form x OR y?
• Exercise 2.6 * We have a two-word query. For one term the postings list consists of the
following 16 entries:

[4,6,10,12,14,16,18,20,22,32,47,81,120,122,157,180] and for the other it is the one entry post-

ings list: [47]. Work out how many comparisons would be done to intersect the two postings
lists with the following two strategies. Briefly justify your answers:
– Using standard postings lists
– Using postings lists stored with skip pointers, with a skip length of P, as suggested in
the lecture.
• Exercise 2.8 Assume a biword index. Give an example of a document which will be
returned for a query of New York University but is actually a false positive which should
not be returned.
• Exercise 2.9 Shown below is a portion of a positional index in the format:
term: doc1: <position1, position2, . . . >; doc2: <position1, position2, . . . >; etc.

angels: 2: <36,174,252,651>; 4: <12,22,102,432>; 7: <17>;

fools: 2: <1,17,74,222>; 4: <8,78,108,458>; 7: <3,13,23,193>;
fear: 2: <87,704,722,901>; 4: <13,43,113,433>; 7: <18,328,528>;
in: 2: <3,37,76,444,851>; 4: <10,20,110,470,500>; 7: <5,15,25,195>;
rush: 2: <2,66,194,321,702>; 4: <9,69,149,429,569>; 7: <4,14,404>;
to: 2: <47,86,234,999>; 4: <14,24,774,944>; 7: <199,319,599,709>;
tread: 2: <57,94,333>; 4: <15,35,155>; 7: <20,320>;
where: 2: <67,124,393,1001>; 4: <11,41,101,421,431>; 7: <16,36,736>;

Which document(s) if any match each of the following queries, where each expression within
quotes is a phrase query?
– “fools rush in”

5
– “fools rush in” AND “angels fear to tread”
• Exercise 2.12 * Consider the adaptation of the basic algorithm for intersection of two
postings lists (Figure 1.6, page 11) to the one in Figure 2.12 (page 42), which handles
proximity queries. A naive algorithm for this operation could be O(P Lmax 2 ), where P is
the sum of the lengths of the postings lists (i.e., the sum of document frequencies) and Lmax
is the maximum length of a document (in tokens).
– Go through this algorithm carefully and explain how it works.
– What is the complexity of this algorithm? Justify your answer carefully.

2.2 Other Exercises/Discussion Points

• Define the number of types and tokens in a sentence. How many tokens does the following
verse contain?
Come as you are
as you were
as I want you to be
as a friend
as a friend
as an old enemy
• Download the Porter stemmer, e.g. from http://tartarus.org/martin/PorterStemmer/
and play around with it on a text of your choice.
• Discuss the limitations of the equivalence classing approach with an example.
• What is a stop list?
• Porter stemmer questions:
1. Show which stems rationalisations, rational, rationalizing result in, and which rules
they use.
2. Explain why sander and sand do not get conflated.
3. What would you have to change if you wanted to conflate them?
4. Find five different examples of incorrect stemmings.
5. Can you find a word that gets reduced in every single step (of the 5)?
• Show cases where the porter stemmer is too aggressive.
• Try to find counterexamples for rules of your choice in the porter stemmer.
• Build (or plan how you would build) your own Boolean index and search engine – as an
added thrill, using only unix tools on the command line (if you want)...
• Name two data structures that support phrase queries, and explain how they do it.

• Name a data structure that supports proximity queries.

• Give formulae for Zipf’s law and Heap’s law.

2.3 Exam questions

• 2004 P7Q12a-c

6
3 Tolerant Retrieval [Lecture 3]
3.1 Exercises from the book
• Exercise 3.1 In the permuterm index, each permuterm vocabulary term points to the
original vocabulary term(s) from which it was derived. How many original vocabulary terms
can there be in the postings list of a permuterm vocabulary term?
• Exercise 3.3 If you wanted to search for s*ng in a permuterm wildcard index, what key(s)
would one do the lookup on?

• Exercise 3.4 Refer to Figure 3.4 in the textbook; it is pointed out in the caption that the
vocabulary terms in the postings are lexicographically ordered. Why is this ordering useful?
• Exercise 3.5 Consider again the query fi*mo*er from Section 3.2.1. What Boolean query
on a bigram index would be generated for this query? Can you think of a term that matches
the permuterm query in Section 3.2.1, but does not satisfy this Boolean query?
• Exercise 3.6 Give an example of a sentence that falsely matches the wildcard query mon*h
if the search were to simply use a conjunction of bigrams.
• Exercise 3.7 If |si | denotes the length of string si , show that the edit distance between s1
and s2 is never more than max(|s1 |, |s2 |).

3.2 Other Exercises/Discussion Points

1. Which data structures are typically used for locating the entry for a term in the dictionary,
and why?
2. Which of these is best if the collection is static?
3. What sequence of letters is looked up in the permuterm index for the following wildcard
queries? X, X*, *X, *X* X*Y
4. What is the difference between the regular inverted index used in IR and the k-gram index?
5. Draw a trie which encodes the following terms: Hawai’i, hare, hiss, hissing, hissed, he,
hunger, honey, hello, hallo, Hungary.

6. Now draw a hash in which these same terms are stored. Consider a word with letters indexed
from l1 l2 . . . ln . Let code(li ) be the position of letter li in the alphabet (e.g., code(a)=1 and
code(z)=26). Use the hash function of h(word) = [code(l1 ) + code(l2 )] mod 26
7. Caculate the edit distance between cat – catcat.
8. How many transformations exist to turn “cat” into “catcat”? How can these be read off the
edit distance matrix?

9. Why is the assymptotic complexity of edit distance calculation quadractic?

10. Show that a naive implementation of the recursive definition of edit distance is assymptoti-
cally worse (how bad?)
11. Give an algorithm to read out one optimal transformation, and state its assymptotic com-
plexity.
12. Give an algorithm to read out all optimal transformations, and state its assumptotic com-
plexity.

7
3.3 Exam questions
• 2006 P7Q11 (note that this question was asked when tolerant retrieval was NOT specifically
treated in the lectures).

8
4 Term Weighting and VSM [Lecture 4]
4.1 Exercises from the Book
• Exercise 6.8 Why is the idf of a term always finite?
• Exercise 6.9 What is the idf of a term that occurs in every document? Compare this with
the use of stop word lists.
• Exercise 6.10 Consider the table of term frequencies for 3 documents denoted Doc1, Doc2,
Doc3 in Figure 6.9. Compute the tf-idf weights for the terms car, auto, insurance, best, for
each document, using the idf values from Figure 6.8.
• Exercise 6.15 Recall the tf-idf weights computed in Exercise 6.10. Compute the Euclidean
normalized document vectors for each of the documents, where each vector has four compo-
nents, one for each of the four terms.
• Exercise 6.16 Verify that the sum of the squares of the components of each of the document
vectors in Exercise 6.15 is 1 (to within rounding error). Why is this the case?

• Exercise 6.17 With term weights as computed in Exercise 6.15, rank the three documents
by computed score for the query car insurance, for each of the following cases of term
weighting in the query:
– The weight of a term is 1 if present in the query, 0 otherwise.
– Euclidean normalized idf.
• Exercise 6.19 Compute the vector space similarity between the query digital cameras
and the document digital cameras and video cameras by filling out the empty columns in
Table 6.1. Assume N = 10,000,000, logarithmic term weighting (wf columns) for query and
document, idf weighting for the query only and cosine normalization for the document only.
Treat and as a stop word. Enter term counts in the tf columns. What is the final similarity
score?
• Exercise 6.20 Show that for the query “affection”, the relative ordering of the scores of
the three documents in Figure 6.13 is the reverse of the ordering of the scores for the query
“jealous gossip”.
• Exercise 6.23 Refer to the tf and idf values for four terms and three documents in Exercise
6.10. Compute the two top scoring documents on the query best car insurance for each of
the following weighing schemes: (i) nnn.atc; (ii) ntc.atc.

4.2 Other Exercises/Discussion Points

1. What is the bag-of-words model?
2. What is the advantage of idf weighting compared to inverse-collection-frequency weighting?
3. What is the relationship between term frequency and collection frequency?
4. Compute the Jaccard matching score and the tf matching score for the following query-
document pairs.
• q: [information on cars] d: “all you’ve ever wanted to know about cars”
• q: [information on cars] d: “information on trucks, information on planes, information
on trains”
• q: [red cars and red trucks] d: “cops stop red cars more often”

9
5. In the figure below, which of the three vectors ~a, ~b and ~c is most similar to ~x according to
i xi ·yi
(i) dot-product similarity (σi xi · yi ), (iii) cosine similarity ( σ|~
x||~y| ), (iii) Euclidean distance
T ~
(|x − y|)? The vectors are ~a = (0.5 1.5) , b =(6 6) and ~c = (12 9)T , and ~x = (2 2)T .
T

6. Consider the following situation in a collection of N=100,000,000 documents. The query is

“John Miller” and the document “John Miller, John Fisher, and one other John”. Treat
“and” “one” and “other” as stop words. dfJohn = 50,000; dfF isher = 100,000; dfMiller =
10,000.

• What is the Jaccard similarity between query and document?

• What is the lnc.ltn similarity between query and document?

7. Compute the similarity between query “smart phones” and document “smart phones and
video phones at smart prices” under lnc.ltn similarity. Assume N=10,000,000. Treat “and”
and “at” as stop words. dfsmart = 5,000; dfvideo = 50,000; dfphones = 25,000; dfprices =
30,000;
When computing length-normalised weights, you can round the length of a vector to the
nearest integer.
Here are some log values you may need: (assumption: use in exam without calculator
allowed)

x 1 2 3 4 5 6 7 8 9
log10 (x) 0 0.3 0.5 0.6 0.7 0.8 0.8 0.9 1.0

8. Ask the students to choose three very short documents (of their own choice, or they can
make them up!) and then

• Build a binary document-term matrix

• Build an TFIDF weighted query (you may estimate the IDF)
• Write a suitable query
• Calculate document–query similarity, using
– cosine
– inner product (i.e. cosine without normalisation)
(Of course, having to mark arbitrary documents and queries is creating more work for
you as a supervisor).
• What effect does normalisation have?
• What effect does stemming have?
• What effect does equivalence classing by prefix have (e.g., same first 4 letters)?

10
9. If we were to have only one-term queries, explain why the use of weight-ordered postings
lists (i.e., postings lists sorted according to weight, instead of docID) truncated at position
k in the list suffices for identifying the k highest scoring documents. Assume that the weight
w stored for a document d in the postings list of t is the cosine-normalised weight of t for d.
10. Play around with your own implementation:
• Modify your implementation from lecture 2 so that your search model is now a VSM –
several parameters can be varied

4.3 Exam questions

• 2003 P7Q11a
• 2010 P8Q9a

• 2013 P8Q9ab

11
5 Language Modelling and Text Classification [Lecture 5]
[Please note that this lecture has been restructured to include language modelling for IR.]

5.1 Exercises from book : Language modelling for IR

• Exercise 12.5 How might a language model be used in a spelling correction system? In
particular, consider the case of context-sensitive spelling correction, and correcting incorrect
usages of words, such as their in Are you their? See Section 3.5 (page 65) for pointers to
some literature on this topic.)
• Exercise 12.6 Consider making a language model from the following training text:
the martian has landed on the latin pop sensation ricky martin
a. Under a MLE-estimated unigram probability model, what are P (the) and P (martian)?
b. Under a MLE-estimated bigram model, what are P (sensation|pop) andP (pop|the)?
• Exercise 12.9 In the mixture model approach to the query likelihood model (Equation
(12.12)), the probability estimate of a term is based on the term frequency of a word in a
document, and the collection frequency of the word. Doing this certainly guarantees that
each term of a query (in the vocabulary) has a non-zero chance of being generated by each
document model. But it has a more subtle but important effect of implementing a form of
term weighting, related to what we saw in Chapter 6. Explain how this works. In particular,
include in your answer a concrete numeric example showing this term weighting at work.

5.2 Exercises from book : Classification

• Exercise 13.1 Why is C||V | < |D|Lave expected to hold for most test collections?
• Exercise 13.3 The rationale for the positional independence assumption is that there is
no useful information in the fact that a term occurs in position k of a document. Find
exceptions. Consider formulaic documents with a fixed document structure.

5.3 Other Exercises/Discussion Points

1. Based on the data below, estimate a multinomial Naive Bayes classifier and apply the clas-
sifier to the test document.

docID words in document in c = China?

training set 1 Kyoto Osaka Taiwan yes
2 Japan Kyoto yes
3 Taipei Taiwan no
4 Macao Shanghai Taiwan no
5 London no
test set 6 Taiwan Taiwan Kyoto ?

You only need to provide the subset of the parameters that you need to classify the test set
(e.g., it’s not necessary to estimate the lexical probabilities for “London”).
2. What is bad about maximum likelihood estimates of the parameters P (t|c) in Naive Bayes?
3. What is the time complexity of a Naive Bayes classifier, and why?
4. What is the main independence assumption of Naive Bayes (insist on them giving you a
formula, not a natural language description).
5. *What are the differences/similarities between the multinomial Naive Bayes classifier and
the query-likelihood approach to ranking documents?

12
6 IR evaluation [Lecture 6]
6.1 Exercises from book
• Exercise 8.3
Derive the equivalence between the two formulae given for the F-measure in the lectures (Fα
vs Fβ )
• Exercise 8.4
What are the possible values for interpolated precision at a recall level of 0?
• Exercise 8.5
Must there always be a break-even point between Precision and Recall? Either show there
must or provide a counterexample.
• Exercise 8.8 (avg 11 point prec instead of R-precision)
Consider an information need for which there are 4 relevant documents in the collection.
Contrast two systems run on this collection. Their top ten results are judged for relevance
as follows (with the leftmost being the top-ranked search result):

System 1 RNRNNNNNRR
System 2 NRNNRRRNNN

a) What is the MAP of each system?

b) Does this result intuitively make sense? What does it say about what is important to get
a high MAP score?
c) What is the avg- 11 point precision of the systems? Observations?

• Exercise 8.9
The following list of Rs and Ns represents relevant (R) and non-relevant (N) returned doc-
uments (as above) in a ranked list of 20 documents retrieved in response to a query from a
collection of 10,000 documents. The top of the ranked list is on the left of the list. The list
shows 6 relevant documents. Assume that there are 8 relevant documents in the collection.

RRNNN NNNRN RNNNR NNNNR

a) What is the precision of the system in the top twenty?

b) What is the F1 on the top twenty?
c) What is the (uninterpolated) precision of the system at 25% recall? d) What is the inter-
polated precision at 33% recall?
e) Assume that these twenty documents are the complete result set of the system. What is
the AP for the query?
f) What is the largest possible MAP that this system could have?
g) WHat is the smallest possible MAP that this system could have?
h) In a set of experiments, only the top 20 results are evaluated by hand. The result in (e)
is used to approximate the range (f) to (g). For this example, how large in absolute terms
can the error for MAP be by calculating (e) instead of (f) and (g) for this query?

• Exercise 8.10bc Below is a table showing how two human judges rated the relevance of a
set of 12 documents to a particular information need (0 = nonrelevant, 1 = relevant). Let us
assume that you have written and IR system for this query that returns the set of documents
{4,5,6,7,8}.

13
docID Judge 1 Judge 2
1 0 0
2 0 0
3 1 1
4 1 1
5 1 0
6 1 0
7 1 0
8 1 0
9 0 1
10 0 1
11 0 1
12 0 1

6.2 Other Exercises/Discussion Points

1. Name three criteria for evaluating a search engine.
2. What are the components of an information retrieval benchmark?
3. Explain the difference between the concepts of a query and an information need.
4. What is an easy way of maximising the recall of an IR engine?
5. What is an easy way of maximising the precision of an IR engine?
6. Precision and recall – make sure they understand the difference to accuracy. This is somehow
really hard to understand for some of them.
7. Ask them to give you precision out of the first 10, 20, 30 documents on their own Google
query to the TREC information need “hair loss”.
8. Discuss; how did they express their query (synonymy etc).
9. What is the so-called ’recall problem’ in IR?
10. If you feel that asking them to go through the worked examples for MAP and avg-11-pt
precision is somehow still not enough, you can give them relevance tables for several made-
up queries (ideally more than 2) similar to those on slide 2 3, play through MAP and avg-
11-point prec with interpolation. What is important is that they can describe to you why
averaging over queries with a different number of relevant documents makes it impossible to
use the simple one-number metrics (such as P at rank X)
11. An evaluation benchmark ideally should tell us for any document–query pair whether the
document is relevant to the query.
• Why is Cranfield the only collection that actually satisfies this desideratum?
• Explain the difference between a ROC and a precision–recall curve. Can you transform
one into the other?
• How do modern benchmarks solve this problem, i.e., provide document–query judge-
ments?

6.3 Exam questions

• 2004 P7Q12d
• 2007 P7Q5ab
• 2011 P8Q9

14
7 Query Expansion and Relevance Feedback [Lecture 7]
7.1 Exercises from Textbook
• Exercise 9.1 In the Rocchio algorithm, what weight setting for α, β, γ does a “Find pages
like this one” search correspond to?
• Exercise 9.2 Give three reasons why relevance feedback has been little used in web search.
• Exercise 9.3 Under what conditions would the modified query qm in Equation 9.3 be the
same as the original query q0 ? In all other cases, is qm closer than q0 to the centroid of the
relevant documents?
• Exercise 9.4 Why is positive feedback likely to be more useful than negative feedback to an
IR system? Why might only using one nonrelevant document be more effective than using
several?
• Exercise 16.1 Define two documents as similar if they have at least two proper names like
Clinton or Sarkozy in common. Give an example of an information need and two documents,
for which the cluster hypothesis does not hold for this notion of similarity.

7.2 Other Exercises/Discussion Points

• Query expansion always improves recall. True or False.
• Query expansion always improves precision. True or False.
• What is Pseudo relevance feedback? How could you use the Rocchio algorithm for pseudo
relevance feedback? What problems may there be in directly applying the Rocchio algo-
rithm?

15
8 Link Analysis [Lecture 8]
8.1 Exercises from book
• Exercise 21.1 Is it always possible to follow directed edges (hyperlinks) in the web graph
from any node (web page) to any other? Why or why not?
• Exercise 21.2 Find an instance of misleading anchor text on the web.
• Exercise 21.5 What is the transition probability matrix for the following example:

• Exercise 21.6 Consider a web graph with three nodes 1, 2 and 3. The links are as follows:
1 ← 2, 3 ← 2, 2← 1, 2← 3. Write down the transition prob. matrix for the surfer’s walk
with teleporting, for the following 3 values of teleportation: (i) α = 0, (ii) α = 0.5 (iii) α =
1.
• Exercise 21.7* A user of a browser can, in addition to clicking a hyperlink on the page x
she is currently browsing, use the back button to go back to the page from which she arrived
at x. Can such a use of back buttons be modelled as a Markov chain? How would we model
repeated invocations of the back button?
• Exercise 21.8 Consider a Markov chain with three states, A, B and C, and transition
probabilities as follows. From state A, the next state is B with probability 1. From B, the
next state is either A with probability pA , or state C with probability 1 − pA . From C,
the next state is A with probability 1. For what values of pA ∈ [0, 1] is this Markov chain
ergodic?
• Exercise 21.11 Verify that the pagerank of the data in the following transition matrix
(from book and lectures)

d0 d1 d2 d3 d4 d5 d6
d0 0.02 0.02 0.88 0.02 0.02 0.02 0.02
d1 0.02 0.45 0.45 0.02 0.02 0.02 0.02
d2 0.31 0.02 0.31 0.31 0.02 0.02 0.02
d3 0.02 0.02 0.02 0.45 0.45 0.02 0.02
d4 0.02 0.02 0.02 0.02 0.02 0.02 0.88
d5 0.02 0.02 0.02 0.02 0.02 0.45 0.45
d6 0.02 0.02 0.02 0.31 0.31 0.02 0.31

is indeed
~x = (0.05 0.04 0.11 0.25 0.21 0.04 0.31)
Write a small routine or use a scientific calculator to do so. [Please additionally check that
they know how to produce a transition with teleportation matrix in general.]
• Exercise 21.22 For the web graph in the following figure, compute PageRank for each of
the three pages.

16
8.2 Other Exercises/Discussion Points
1. Compute PageRank for the web graph below for each of the three pages. Teleportation
probability is 0.6.

2. Compute PageRank of the following network using the power method

0.3

0.8
0.7

d1 d2
0.2

3. When using PageRank for ranking, what assumptions are we making about the meaning of
hyperlinks?
4. Why is PageRank a better measure of quality than a simple count of inlinks?
5. What is the meaning of the PageRank q of a page d in the random surfer model?
6. Make sure they understand the two main differences between PageRank and HITS (namely
1. offline vs. online, 2. different underlying model of importance/of the inherent properties
of a web page).
7. Discuss with them exactly how anchor text is used for queries.

8.3 Exam questions

• 2009 P8Q9
• 2010 P8Q9b

CS726 Information Retrieval Techniques Complete Handouts (Downloded From Cluesbook - Com)
No ratings yet
CS726 Information Retrieval Techniques Complete Handouts (Downloded From Cluesbook - Com)
237 pages
CS726 - Solution Manual - Introduction To Information Retrieval
100% (2)
CS726 - Solution Manual - Introduction To Information Retrieval
49 pages
Query Languages
No ratings yet
Query Languages
54 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
Module 7
No ratings yet
Module 7
53 pages
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
No ratings yet
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
124 pages
Juniper Practice Questions PDF
No ratings yet
Juniper Practice Questions PDF
32 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
No ratings yet
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
47 pages
Chapter 4
No ratings yet
Chapter 4
48 pages
Dental Radiographs
100% (1)
Dental Radiographs
96 pages
C7 SpellCorrection
No ratings yet
C7 SpellCorrection
43 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
IRWS Lecture 03 - Indexing and Transcrib
No ratings yet
IRWS Lecture 03 - Indexing and Transcrib
42 pages
Week 3 - Probabilistic Retrieval and Relevance Feedback
No ratings yet
Week 3 - Probabilistic Retrieval and Relevance Feedback
37 pages
2 Boolean Retrieval Term Indexing
No ratings yet
2 Boolean Retrieval Term Indexing
50 pages
Lecture3 Tolerant Retrieval
No ratings yet
Lecture3 Tolerant Retrieval
48 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Stochastic Modeling: Analysis and Simulation
From Everand
Stochastic Modeling: Analysis and Simulation
Barry L. Nelson
No ratings yet
Module 1-1
No ratings yet
Module 1-1
12 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
Unit 1
No ratings yet
Unit 1
181 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Introduction To Information Rertrieval Answer
100% (4)
Introduction To Information Rertrieval Answer
6 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
Succeeding with OKRs in Agile
From Everand
Succeeding with OKRs in Agile
Allan Kelly
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Learning Guide Unit 1 - Home
No ratings yet
Learning Guide Unit 1 - Home
10 pages
IR - Midsem Question Paper - 2024 - Solutionfull
No ratings yet
IR - Midsem Question Paper - 2024 - Solutionfull
7 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Supervisionguide15 16 Students
No ratings yet
Supervisionguide15 16 Students
18 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
15 pages
Major PPT Autism Detection
No ratings yet
Major PPT Autism Detection
12 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
Sheet 2 ch2
No ratings yet
Sheet 2 ch2
4 pages
Important SLK Calculation Sheet
No ratings yet
Important SLK Calculation Sheet
65 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Lecture3 Tolerant Retrieval Handout 6 Per
No ratings yet
Lecture3 Tolerant Retrieval Handout 6 Per
8 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
Hector Posts Volume5
No ratings yet
Hector Posts Volume5
140 pages
Ir 1
No ratings yet
Ir 1
14 pages
Biology 12th Class Punjab Curriculum and Textbook Board (PCTB) - The Ebook With Rich Content Is Ready For You To Download
100% (4)
Biology 12th Class Punjab Curriculum and Textbook Board (PCTB) - The Ebook With Rich Content Is Ready For You To Download
66 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Cross Lingual Information Retrieval and Error Tracking in Search Engine
No ratings yet
Cross Lingual Information Retrieval and Error Tracking in Search Engine
37 pages
Know The Major Difference Between BJT and FET
No ratings yet
Know The Major Difference Between BJT and FET
11 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
India - Commercial Vehicle Sector - December 2021
100% (1)
India - Commercial Vehicle Sector - December 2021
24 pages
Placement Services
No ratings yet
Placement Services
39 pages
Adding and Subtracting Dissimilar Fractions
100% (1)
Adding and Subtracting Dissimilar Fractions
13 pages
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
No ratings yet
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
44 pages
Exercise 1.2010.solutions
83% (24)
Exercise 1.2010.solutions
3 pages
Trivia PPE 4
No ratings yet
Trivia PPE 4
1 page
Dumont Article 2006
No ratings yet
Dumont Article 2006
24 pages
MGT 351, Summer-2019 Section:14 Report On: Quazi Tafsirul Alam
No ratings yet
MGT 351, Summer-2019 Section:14 Report On: Quazi Tafsirul Alam
18 pages
Final Sol
100% (1)
Final Sol
8 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
ACTIVITY-SHEET HANDICRAFT7 Final
No ratings yet
ACTIVITY-SHEET HANDICRAFT7 Final
9 pages
Review On Chicken Egg Quality Determination, Grading and Affecting Factors
No ratings yet
Review On Chicken Egg Quality Determination, Grading and Affecting Factors
10 pages
600 Computer Mcqs
No ratings yet
600 Computer Mcqs
23 pages
ABSLI Wealth Smart Plus V02 Brochure
No ratings yet
ABSLI Wealth Smart Plus V02 Brochure
16 pages
NCERT Exemplar Solution Class 9 Chapter 6
No ratings yet
NCERT Exemplar Solution Class 9 Chapter 6
22 pages
Language Mind
No ratings yet
Language Mind
15 pages
Fairfield Methodist - 11 Prelims - E Maths Paper 2
No ratings yet
Fairfield Methodist - 11 Prelims - E Maths Paper 2
15 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
FCE Exam Use of English Exercises
From Everand
FCE Exam Use of English Exercises
Powerprint Publishers
5/5 (1)
Portfolio of Evidence English
No ratings yet
Portfolio of Evidence English
18 pages
Scope of Ecommerce in Maharashtra
0% (1)
Scope of Ecommerce in Maharashtra
13 pages
New Drugs For Multiple Sclerosis New Treatment.3
No ratings yet
New Drugs For Multiple Sclerosis New Treatment.3
9 pages
Chapter 2 Test Bank
No ratings yet
Chapter 2 Test Bank
14 pages
Trends in ICT
No ratings yet
Trends in ICT
10 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
Immicompliance Com
No ratings yet
Immicompliance Com
2 pages
Exercise 1.1: Q.1. What Is The Vision of India Post?
No ratings yet
Exercise 1.1: Q.1. What Is The Vision of India Post?
5 pages
Lab 8 Electronics 2
No ratings yet
Lab 8 Electronics 2
5 pages
Api 570 - Aug - 2022
No ratings yet
Api 570 - Aug - 2022
4 pages
RMO 47-2020 Annex C.2 - Verification Procedures Sales of Goods
No ratings yet
RMO 47-2020 Annex C.2 - Verification Procedures Sales of Goods
2 pages
Solar Switch Review
No ratings yet
Solar Switch Review
4 pages

Supervisionguide16 17 Students

Uploaded by

Supervisionguide16 17 Students

Uploaded by

Exercises for Information Retrieval

Lent Term 2016/17

2 Indexing and document normalisation [Lecture 2] 5

3 Tolerant Retrieval [Lecture 3] 7

4 Term Weighting and VSM [Lecture 4] 9

5 Language Modelling and Text Classification [Lecture 5] 12

7 Query Expansion and Relevance Feedback [Lecture 7] 15

8 Link Analysis [Lecture 8] 16

∗ Adapted from Simone Teufel’s original guide

– Brutus AND NOT Caesar

(Brutus OR Caesar) AND NOT (Antony OR Cleopatra)

(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

given the following postings list sizes:

3. In constructing the index, which step is most expensive/complex?

[4,6,10,12,14,16,18,20,22,32,47,81,120,122,157,180] and for the other it is the one entry post-

angels: 2: <36,174,252,651>; 4: <12,22,102,432>; 7: <17>;

2.2 Other Exercises/Discussion Points

• Name a data structure that supports proximity queries.

2.3 Exam questions

3.2 Other Exercises/Discussion Points

9. Why is the assymptotic complexity of edit distance calculation quadractic?

4.2 Other Exercises/Discussion Points

6. Consider the following situation in a collection of N=100,000,000 documents. The query is

• What is the Jaccard similarity between query and document?

• Build a binary document-term matrix

4.3 Exam questions

5.1 Exercises from book : Language modelling for IR

5.2 Exercises from book : Classification

5.3 Other Exercises/Discussion Points

docID words in document in c = China?

a) What is the MAP of each system?

RRNNN NNNRN RNNNR NNNNR

a) What is the precision of the system in the top twenty?

6.2 Other Exercises/Discussion Points

6.3 Exam questions

7.2 Other Exercises/Discussion Points

2. Compute PageRank of the following network using the power method

8.3 Exam questions

You might also like