SIGIR 2003 Workshop On Distributed Information Retrieval: Jamie Callan Fabio Crestani Mark Sanderson
SIGIR 2003 Workshop On Distributed Information Retrieval: Jamie Callan Fabio Crestani Mark Sanderson
Introduction
During the last decade companies, governments, and research groups worldwide have
directed significant effort towards the creation of sophisticated digital libraries across a
variety of disciplines. As digital libraries proliferate, in a variety of media, and from a
variety of sources, problems of resource selection and data fusion become major
obstacles. Traditional search engines, even very large systems such as Google, are
unable to provide access to the “Hidden Web” of information that is only available via
digital library search interfaces. Effective, reliable information retrieval also requires the
ability to pose multimedia queries across many digital libraries. The answer to a query
about the lyrics to a folk song might be text or an audio recording, but few systems today
could deliver both data types in response to a single, simple query. Distributed
Information Retrieval addresses issues that arise when people have routine access to
thousands of multimedia digital libraries.
The SIGIR 2003 Workshop on Distributed Information Retrieval was held on August 1,
2003, at the University of Toronto following the SIGIR conference, to provide a venue
for the presentation and discussion of recent research on the design and implementation
of methods and tools for resource description, resource selection, data fusion, and user
interaction. About 25 people attended, including representatives from university and
industrial research labs. Participants were encouraged to ask questions during and after
presentations, which they did. The formal presentations were followed by a general
discussion of the state-of-the-art in distributed information retrieval, with a particular
emphasis on what still needs to be done.
Presentation Summaries
Below we provide very brief descriptions of the workshop presentations, to give a sense
of the range of themes and topics covered. The complete workshop proceedings are
available in electronic form at http://www.cs.cmu.edu/~callan/Workshops/dir03/.
Revised and extended versions of the workshop papers will also be published as part of a
book on distributed information retrieval that will appear later this year in Springer-
Verlag’s Lecture Notes of Computer Science (LNCS) series.
“Collection fusion for distributed image retrieval” by Berretti, Del Bimbo, and Pala,
described a model-based approach to image data fusion (i.e., merging results from
different image libraries). During an offline model learning stage training data is
acquired by a form of query-based sampling in which queries are submitted to an image
library, images are retrieved (with their library-specific scores), and normalized, library-
independent scores are computed with a fusion search engine. When sampling is
complete, images from each library are clustered into groups, and pairs of library-specific
and normalized scores are combined to learn group-specific linear models. During
interactive retrieval an image’s score is normalized by finding the most similar cluster
and using the model parameters associated with the cluster. The method is very fast and
worked well in experimental evaluations.
“Recent results on fusion of effective retrieval strategies in the same information retrieval
system” by Beitzel, Jensen, Chowdhury, Grossman, Goharian, and Frieder took a new
look at meta-search by studying it within a single retrieval system. Meta-search is known
to improve retrieval results, but prior research often focused on fusion from different
retrieval systems, which conflates effects due to different representations and retrieval
models. In this study the representation was held constant. The results were unexpected.
The number of documents that appear in multiple retrieval lists (“overlap documents”) is
considered a good clue to the effectiveness of data fusion; for example, the well-known
CombMNZ method exploits this feature. However, it was a poor predictor in this setting,
rewarding common “near miss” documents and penalizing “maverick” relevant
documents found by only a single method. This paper encourages a more careful
examination of representation vs. retrieval model effects in future meta-search research.
“Apoidea: A decentralized peer-to-peer architecture for crawling the World Wide Web”
by Singh, Srivatsa, Liu, and Miller described a new spider architecture based on dynamic
hash tables. Each node is responsible for a portion of the address (URL) space; each
domain is covered by a single node, which keeps communication among nodes to a
manageable level. Exact duplicate detection is handled in a similar manner, by
converting Web pages to hash values and making each peer responsible for a portion of
the hash address space. The distributed approach makes it easy to distribute crawling
geographically, possibly reducing communications costs. Initial experiments show very
nearly linear scale- up as the number of nodes is increased.
“Towards virtual knowledge communities in peer-to-peer networks” by Gnasa, Alda,
Grigull, and Cremers described a peer-to-peer architecture consisting of personal digital
libraries (“personal search memory” or PeerSy) and an architecture that lets them
organize into virtual knowledge communities (VKCs). Virtual knowledge communities
occur by clustering nodes based on each node’s frequently-asked and seldom-asked
queries and bookmarked documents (considered relevant). New queries are sent to one’s
personal digital library (PeerSy), one’s Virtual Knowledge Community, and Google. The
expectation is that documents found within a person’s personal digital library and Virtua l
Knowledge Community will be better matches for an information need, possibly
reflecting some combination of past searching and browsing behavior. The work is at the
initial prototype stage.
Conclusions
During the general discussion there was considerable debate about the state of resource
selection research. Resource selection has been the driving topic in this research area for
the last decade, and there has been steady improvement, but the upper bound remains
unknown. Precision-oriented methods dominated past research, with much success, but
high Recall and high diversity are neglected topics that are particularly important in some
domains, for example to better represent the range of information available.
Participants felt that data fusion research needs to continue its transition to stronger
theoretical models. The field does not yet understand how differing levels of overlap
among resources affects fusion algorithms; the research community is split into “much
overlap” (e.g., meta-search) or “little overlap” (e.g., distributed IR), but the real world is
more complex. The field also needs to learn to model the interaction between resource
selection and data fusion. Improvements in resource selection may ha ve a large effect or
none depending on the data fusion algorithm, but today the interaction is unpredictable.
The topic that generated the most discussion was, of course, evaluation. There was broad
agreement that there is too much focus on testbeds based on TREC data. Participants felt
that it is especially necessary to model the size and relevance distributions of real sites,
and that it might be possible to get such information from industry. There was
recognition that different tasks and environments will have different characteristics, and
that the research community needs to devote more effort to understanding what they are.
A major obstacle for many researchers is that distributed IR is still rare in the “real
1
http://www.cs.cmu.edu/~callan/Workshops/nir96/ and http://www.cs.cmu.edu/~callan/Workshops/nir97/.
world”, so it is difficult to find “real” data, users, and failures. The clearest example of
distributed IR in use today is in peer-to-peer networks such as KaZaA.
Relevance-based ranking (RBR) is a convenient and clear metric, but participant felt that
the field will need to transition to a utility-based metric, possibly something like Norbert
Fuhr’s decision theoretic framework, that encompasses a wider range of user criteria.
Such a transition will require a much better understanding of user information needs in
distributed environments, for example, the importance of relevance vs. communication
time, search vs. browsing, and relevance vs. diversity.
One could summarize the discussion of evaluation as a strong worry that researchers are
stuck searching under the same old lampposts (i.e., searching where it is easiest to search)
due to a lack of realistic data and user information needs. Participants expressed support
for a TREC track or INEX-style project to focus attention on creating new datasets, task
models, and evaluation metrics. The TREC-4 and TREC-5 Database Merging tracks
were conducted before a distributed IR research community had developed, and hence
they attracted little participation. Today, with active interest in distributed IR and
federated search from a variety of research communities, a similar effort would have a
much better chance of success.
Acknowledgements
We thank the organizers of SIGIR 2003 for their support of the workshop. We also thank
the members of the Program Committee (Donatella Castelli, Jim French, Norbert Fuhr,
Luis Gravano, Umberto Straccia) for their efforts on behalf of the workshop. The
workshop was sponsored in part by the MIND Project (EU research contract IST-2000-
26061), the University of Toronto, and the Knowledge Media Design Institute.