Supervisionguide16 17 Students
Supervisionguide16 17 Students
Ronan Cummins∗
Contents
1 Boolean Model [Lecture 1] 3
1.1 Exercises from book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Other Exercises/Discussion Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
6 IR evaluation [Lecture 6] 13
6.1 Exercises from book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.2 Other Exercises/Discussion Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.3 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1
Optional practical work with Apache Lucene
Apache Lucene is a freely available full-text open-source information retrieval software library
written in Java. There is some code available on github (https://github.com/ronancummins/spud)
for students to experiment with. The repository also contains a small IR test collection (the original
Cranfield collection which contains about 1400 documents and 225 queries). The students should
be able to do the following:
• Download the code and index the collection by following the instructions in the README.
• Run the queries through the collection while collecting metrics of the effectiveness of each
query.
• Change the underlying retrieval model to retrieve documents using a different function (e.g.
BM25 or unigram multinomial Language model).
• Manually select a subset of the queries and create a new file containing these.
• Change these queries by formulating different versions of the queries (e.g. try removing stop
words or try adding words you think will improve the query).
2
1 Boolean Model [Lecture 1]
1.1 Exercises from book
• Exercise 1.2 Draw the term-document incidence matrix and the inverted index represen-
tation for the following document collection:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
• Exercise 1.3 For the document collection shown in Exercise 1.2, what are the returned
results for these queries:
– schizophrenia AND drug
– for AND NOT(drug OR approach)
• Exercise 1.4 * For the queries below, can we still run through the intersection in time O(x
+ y), where x and y are the lengths of the postings lists for Brutus and Caesar? If not, what
can we achieve?
• First make sure they understand the merge algorithm for AND, then, if they seem particu-
larly intersted, you can make them do:
• Exercise 1.5 * Extend the postings merge algorithm to arbitrary Boolean query formulas.
What is its time complexity? For instance, consider:
Can we always merge in linear time? Linear in what? Can we do better than this?
• Exercise 1.7 Recommend a query processing order for
3
1.2 Other Exercises/Discussion Points
1. Why don’t we use grep for information retrieval?
2. Why don’t we use a relational database for information retrieval?
4
2 Indexing and document normalisation [Lecture 2]
2.1 Exercises from the Book
• Exercise 2.1 Are the following statements true or false?
– In a Boolean retrieval system, stemming never lowers precision.
– In a Boolean retrieval system, stemming never lowers recall.
– Stemming increases the size of the vocabulary.
– Stemming should be invoked at indexing time but not while processing a query.
• Exercise 2.4 For the top Porter stemmer rule group (2.1) shown on page 33:
– What is the purpose of including an identity rule such as SS →SS?
– Applying just this rule group, what will the following words be stemmed to?
circus canaries boss
– What rule should be added to correctly stem pony?
– The stemming for ponies and pony might seem strange. Does it have a deleterious effect
on retrieval? Why or why not?
• Exercise 2.5 Why are skip pointers not useful for queries of the form x OR y?
• Exercise 2.6 * We have a two-word query. For one term the postings list consists of the
following 16 entries:
Which document(s) if any match each of the following queries, where each expression within
quotes is a phrase query?
– “fools rush in”
5
– “fools rush in” AND “angels fear to tread”
• Exercise 2.12 * Consider the adaptation of the basic algorithm for intersection of two
postings lists (Figure 1.6, page 11) to the one in Figure 2.12 (page 42), which handles
proximity queries. A naive algorithm for this operation could be O(P Lmax 2 ), where P is
the sum of the lengths of the postings lists (i.e., the sum of document frequencies) and Lmax
is the maximum length of a document (in tokens).
– Go through this algorithm carefully and explain how it works.
– What is the complexity of this algorithm? Justify your answer carefully.
6
3 Tolerant Retrieval [Lecture 3]
3.1 Exercises from the book
• Exercise 3.1 In the permuterm index, each permuterm vocabulary term points to the
original vocabulary term(s) from which it was derived. How many original vocabulary terms
can there be in the postings list of a permuterm vocabulary term?
• Exercise 3.3 If you wanted to search for s*ng in a permuterm wildcard index, what key(s)
would one do the lookup on?
• Exercise 3.4 Refer to Figure 3.4 in the textbook; it is pointed out in the caption that the
vocabulary terms in the postings are lexicographically ordered. Why is this ordering useful?
• Exercise 3.5 Consider again the query fi*mo*er from Section 3.2.1. What Boolean query
on a bigram index would be generated for this query? Can you think of a term that matches
the permuterm query in Section 3.2.1, but does not satisfy this Boolean query?
• Exercise 3.6 Give an example of a sentence that falsely matches the wildcard query mon*h
if the search were to simply use a conjunction of bigrams.
• Exercise 3.7 If |si | denotes the length of string si , show that the edit distance between s1
and s2 is never more than max(|s1 |, |s2 |).
6. Now draw a hash in which these same terms are stored. Consider a word with letters indexed
from l1 l2 . . . ln . Let code(li ) be the position of letter li in the alphabet (e.g., code(a)=1 and
code(z)=26). Use the hash function of h(word) = [code(l1 ) + code(l2 )] mod 26
7. Caculate the edit distance between cat – catcat.
8. How many transformations exist to turn “cat” into “catcat”? How can these be read off the
edit distance matrix?
7
3.3 Exam questions
• 2006 P7Q11 (note that this question was asked when tolerant retrieval was NOT specifically
treated in the lectures).
8
4 Term Weighting and VSM [Lecture 4]
4.1 Exercises from the Book
• Exercise 6.8 Why is the idf of a term always finite?
• Exercise 6.9 What is the idf of a term that occurs in every document? Compare this with
the use of stop word lists.
• Exercise 6.10 Consider the table of term frequencies for 3 documents denoted Doc1, Doc2,
Doc3 in Figure 6.9. Compute the tf-idf weights for the terms car, auto, insurance, best, for
each document, using the idf values from Figure 6.8.
• Exercise 6.15 Recall the tf-idf weights computed in Exercise 6.10. Compute the Euclidean
normalized document vectors for each of the documents, where each vector has four compo-
nents, one for each of the four terms.
• Exercise 6.16 Verify that the sum of the squares of the components of each of the document
vectors in Exercise 6.15 is 1 (to within rounding error). Why is this the case?
• Exercise 6.17 With term weights as computed in Exercise 6.15, rank the three documents
by computed score for the query car insurance, for each of the following cases of term
weighting in the query:
– The weight of a term is 1 if present in the query, 0 otherwise.
– Euclidean normalized idf.
• Exercise 6.19 Compute the vector space similarity between the query digital cameras
and the document digital cameras and video cameras by filling out the empty columns in
Table 6.1. Assume N = 10,000,000, logarithmic term weighting (wf columns) for query and
document, idf weighting for the query only and cosine normalization for the document only.
Treat and as a stop word. Enter term counts in the tf columns. What is the final similarity
score?
• Exercise 6.20 Show that for the query “affection”, the relative ordering of the scores of
the three documents in Figure 6.13 is the reverse of the ordering of the scores for the query
“jealous gossip”.
• Exercise 6.23 Refer to the tf and idf values for four terms and three documents in Exercise
6.10. Compute the two top scoring documents on the query best car insurance for each of
the following weighing schemes: (i) nnn.atc; (ii) ntc.atc.
9
5. In the figure below, which of the three vectors ~a, ~b and ~c is most similar to ~x according to
i xi ·yi
(i) dot-product similarity (σi xi · yi ), (iii) cosine similarity ( σ|~
x||~y| ), (iii) Euclidean distance
T ~
(|x − y|)? The vectors are ~a = (0.5 1.5) , b =(6 6) and ~c = (12 9)T , and ~x = (2 2)T .
T
7. Compute the similarity between query “smart phones” and document “smart phones and
video phones at smart prices” under lnc.ltn similarity. Assume N=10,000,000. Treat “and”
and “at” as stop words. dfsmart = 5,000; dfvideo = 50,000; dfphones = 25,000; dfprices =
30,000;
When computing length-normalised weights, you can round the length of a vector to the
nearest integer.
Here are some log values you may need: (assumption: use in exam without calculator
allowed)
x 1 2 3 4 5 6 7 8 9
log10 (x) 0 0.3 0.5 0.6 0.7 0.8 0.8 0.9 1.0
8. Ask the students to choose three very short documents (of their own choice, or they can
make them up!) and then
10
9. If we were to have only one-term queries, explain why the use of weight-ordered postings
lists (i.e., postings lists sorted according to weight, instead of docID) truncated at position
k in the list suffices for identifying the k highest scoring documents. Assume that the weight
w stored for a document d in the postings list of t is the cosine-normalised weight of t for d.
10. Play around with your own implementation:
• Modify your implementation from lecture 2 so that your search model is now a VSM –
several parameters can be varied
• 2013 P8Q9ab
11
5 Language Modelling and Text Classification [Lecture 5]
[Please note that this lecture has been restructured to include language modelling for IR.]
You only need to provide the subset of the parameters that you need to classify the test set
(e.g., it’s not necessary to estimate the lexical probabilities for “London”).
2. What is bad about maximum likelihood estimates of the parameters P (t|c) in Naive Bayes?
3. What is the time complexity of a Naive Bayes classifier, and why?
4. What is the main independence assumption of Naive Bayes (insist on them giving you a
formula, not a natural language description).
5. *What are the differences/similarities between the multinomial Naive Bayes classifier and
the query-likelihood approach to ranking documents?
12
6 IR evaluation [Lecture 6]
6.1 Exercises from book
• Exercise 8.3
Derive the equivalence between the two formulae given for the F-measure in the lectures (Fα
vs Fβ )
• Exercise 8.4
What are the possible values for interpolated precision at a recall level of 0?
• Exercise 8.5
Must there always be a break-even point between Precision and Recall? Either show there
must or provide a counterexample.
• Exercise 8.8 (avg 11 point prec instead of R-precision)
Consider an information need for which there are 4 relevant documents in the collection.
Contrast two systems run on this collection. Their top ten results are judged for relevance
as follows (with the leftmost being the top-ranked search result):
System 1 RNRNNNNNRR
System 2 NRNNRRRNNN
• Exercise 8.9
The following list of Rs and Ns represents relevant (R) and non-relevant (N) returned doc-
uments (as above) in a ranked list of 20 documents retrieved in response to a query from a
collection of 10,000 documents. The top of the ranked list is on the left of the list. The list
shows 6 relevant documents. Assume that there are 8 relevant documents in the collection.
• Exercise 8.10bc Below is a table showing how two human judges rated the relevance of a
set of 12 documents to a particular information need (0 = nonrelevant, 1 = relevant). Let us
assume that you have written and IR system for this query that returns the set of documents
{4,5,6,7,8}.
13
docID Judge 1 Judge 2
1 0 0
2 0 0
3 1 1
4 1 1
5 1 0
6 1 0
7 1 0
8 1 0
9 0 1
10 0 1
11 0 1
12 0 1
14
7 Query Expansion and Relevance Feedback [Lecture 7]
7.1 Exercises from Textbook
• Exercise 9.1 In the Rocchio algorithm, what weight setting for α, β, γ does a “Find pages
like this one” search correspond to?
• Exercise 9.2 Give three reasons why relevance feedback has been little used in web search.
• Exercise 9.3 Under what conditions would the modified query qm in Equation 9.3 be the
same as the original query q0 ? In all other cases, is qm closer than q0 to the centroid of the
relevant documents?
• Exercise 9.4 Why is positive feedback likely to be more useful than negative feedback to an
IR system? Why might only using one nonrelevant document be more effective than using
several?
• Exercise 16.1 Define two documents as similar if they have at least two proper names like
Clinton or Sarkozy in common. Give an example of an information need and two documents,
for which the cluster hypothesis does not hold for this notion of similarity.
15
8 Link Analysis [Lecture 8]
8.1 Exercises from book
• Exercise 21.1 Is it always possible to follow directed edges (hyperlinks) in the web graph
from any node (web page) to any other? Why or why not?
• Exercise 21.2 Find an instance of misleading anchor text on the web.
• Exercise 21.5 What is the transition probability matrix for the following example:
• Exercise 21.6 Consider a web graph with three nodes 1, 2 and 3. The links are as follows:
1 ← 2, 3 ← 2, 2← 1, 2← 3. Write down the transition prob. matrix for the surfer’s walk
with teleporting, for the following 3 values of teleportation: (i) α = 0, (ii) α = 0.5 (iii) α =
1.
• Exercise 21.7* A user of a browser can, in addition to clicking a hyperlink on the page x
she is currently browsing, use the back button to go back to the page from which she arrived
at x. Can such a use of back buttons be modelled as a Markov chain? How would we model
repeated invocations of the back button?
• Exercise 21.8 Consider a Markov chain with three states, A, B and C, and transition
probabilities as follows. From state A, the next state is B with probability 1. From B, the
next state is either A with probability pA , or state C with probability 1 − pA . From C,
the next state is A with probability 1. For what values of pA ∈ [0, 1] is this Markov chain
ergodic?
• Exercise 21.11 Verify that the pagerank of the data in the following transition matrix
(from book and lectures)
d0 d1 d2 d3 d4 d5 d6
d0 0.02 0.02 0.88 0.02 0.02 0.02 0.02
d1 0.02 0.45 0.45 0.02 0.02 0.02 0.02
d2 0.31 0.02 0.31 0.31 0.02 0.02 0.02
d3 0.02 0.02 0.02 0.45 0.45 0.02 0.02
d4 0.02 0.02 0.02 0.02 0.02 0.02 0.88
d5 0.02 0.02 0.02 0.02 0.02 0.45 0.45
d6 0.02 0.02 0.02 0.31 0.31 0.02 0.31
is indeed
~x = (0.05 0.04 0.11 0.25 0.21 0.04 0.31)
Write a small routine or use a scientific calculator to do so. [Please additionally check that
they know how to produce a transition with teleportation matrix in general.]
• Exercise 21.22 For the web graph in the following figure, compute PageRank for each of
the three pages.
16
8.2 Other Exercises/Discussion Points
1. Compute PageRank for the web graph below for each of the three pages. Teleportation
probability is 0.6.
0.3
0.8
0.7
d1 d2
0.2
3. When using PageRank for ranking, what assumptions are we making about the meaning of
hyperlinks?
4. Why is PageRank a better measure of quality than a simple count of inlinks?
5. What is the meaning of the PageRank q of a page d in the random surfer model?
6. Make sure they understand the two main differences between PageRank and HITS (namely
1. offline vs. online, 2. different underlying model of importance/of the inherent properties
of a web page).
7. Discuss with them exactly how anchor text is used for queries.
17