0% found this document useful (0 votes)

29 views

SIGIR 2003 Workshop On Distributed Information Retrieval: Jamie Callan Fabio Crestani Mark Sanderson

This document summarizes the SIGIR 2003 Workshop on Distributed Information Retrieval. The workshop had about 25 attendees and included presentations on recent research related to distributed information retrieval, including approaches to collection fusion, data fusion, peer-to-peer architectures, and resource selection across different data types and digital libraries. Presenters discussed methods for distributed crawling, normalization across media and libraries, and modeling distributed search as a stochastic game.

Uploaded by

vanteneh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

SIGIR 2003 Workshop On Distributed Information Retrieval: Jamie Callan Fabio Crestani Mark Sanderson

Uploaded by

vanteneh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

SIGIR 2003 Workshop on Distributed Information Retrieval

Jamie Callan Fabio Crestani Mark Sanderson

Carnegie Mellon University University of Strathclyde University of Sheffield
Pittsburgh, PA 15241 Glasgow G1 1XH Sheffield, S1 4DP, UK
USA Scotland, UK UK
[email protected] [email protected] [email protected]

Introduction

During the last decade companies, governments, and research groups worldwide have
directed significant effort towards the creation of sophisticated digital libraries across a
variety of disciplines. As digital libraries proliferate, in a variety of media, and from a
variety of sources, problems of resource selection and data fusion become major
obstacles. Traditional search engines, even very large systems such as Google, are
unable to provide access to the “Hidden Web” of information that is only available via
digital library search interfaces. Effective, reliable information retrieval also requires the
ability to pose multimedia queries across many digital libraries. The answer to a query
about the lyrics to a folk song might be text or an audio recording, but few systems today
could deliver both data types in response to a single, simple query. Distributed
Information Retrieval addresses issues that arise when people have routine access to
thousands of multimedia digital libraries.

The SIGIR 2003 Workshop on Distributed Information Retrieval was held on August 1,
2003, at the University of Toronto following the SIGIR conference, to provide a venue
for the presentation and discussion of recent research on the design and implementation
of methods and tools for resource description, resource selection, data fusion, and user
interaction. About 25 people attended, including representatives from university and
industrial research labs. Participants were encouraged to ask questions during and after
presentations, which they did. The formal presentations were followed by a general
discussion of the state-of-the-art in distributed information retrieval, with a particular
emphasis on what still needs to be done.

Presentation Summaries

Below we provide very brief descriptions of the workshop presentations, to give a sense
of the range of themes and topics covered. The complete workshop proceedings are
available in electronic form at http://www.cs.cmu.edu/~callan/Workshops/dir03/.
Revised and extended versions of the workshop papers will also be published as part of a
book on distributed information retrieval that will appear later this year in Springer-
Verlag’s Lecture Notes of Computer Science (LNCS) series.
“Collection fusion for distributed image retrieval” by Berretti, Del Bimbo, and Pala,
described a model-based approach to image data fusion (i.e., merging results from
different image libraries). During an offline model learning stage training data is
acquired by a form of query-based sampling in which queries are submitted to an image
library, images are retrieved (with their library-specific scores), and normalized, library-
independent scores are computed with a fusion search engine. When sampling is
complete, images from each library are clustered into groups, and pairs of library-specific
and normalized scores are combined to learn group-specific linear models. During
interactive retrieval an image’s score is normalized by finding the most similar cluster
and using the model parameters associated with the cluster. The method is very fast and
worked well in experimental evaluations.

“Recent results on fusion of effective retrieval strategies in the same information retrieval
system” by Beitzel, Jensen, Chowdhury, Grossman, Goharian, and Frieder took a new
look at meta-search by studying it within a single retrieval system. Meta-search is known
to improve retrieval results, but prior research often focused on fusion from different
retrieval systems, which conflates effects due to different representations and retrieval
models. In this study the representation was held constant. The results were unexpected.
The number of documents that appear in multiple retrieval lists (“overlap documents”) is
considered a good clue to the effectiveness of data fusion; for example, the well-known
CombMNZ method exploits this feature. However, it was a poor predictor in this setting,
rewarding common “near miss” documents and penalizing “maverick” relevant
documents found by only a single method. This paper encourages a more careful
examination of representation vs. retrieval model effects in future meta-search research.

“The MIND architecture for heterogeneous multimedia federated digital libraries” by

Nottelmann and Fuhr presented an architecture for distributed information retrieval. It
consists of five types of components: graphical user interfaces, data fusion components,
a dispatcher, proxies, and the digital libraries. Proxies provide “wrapper” functionality
for each digital library, providing common schemas and APIs for heterogeneous,
multimedia, and possibly uncooperative digital libraries. Proxies also provide local
resource selection using a cost-based, probabilistic framework, so retrieval scores are
normalized across different media and digital libraries. The architecture provides varying
levels of distribution, depending upon user needs. Communication among architecture
components is performed using the SOAP protocol. An implementation is available.

“Apoidea: A decentralized peer-to-peer architecture for crawling the World Wide Web”
by Singh, Srivatsa, Liu, and Miller described a new spider architecture based on dynamic
hash tables. Each node is responsible for a portion of the address (URL) space; each
domain is covered by a single node, which keeps communication among nodes to a
manageable level. Exact duplicate detection is handled in a similar manner, by
converting Web pages to hash values and making each peer responsible for a portion of
the hash address space. The distributed approach makes it easy to distribute crawling
geographically, possibly reducing communications costs. Initial experiments show very
nearly linear scale- up as the number of nodes is increased.
“Towards virtual knowledge communities in peer-to-peer networks” by Gnasa, Alda,
Grigull, and Cremers described a peer-to-peer architecture consisting of personal digital
libraries (“personal search memory” or PeerSy) and an architecture that lets them
organize into virtual knowledge communities (VKCs). Virtual knowledge communities
occur by clustering nodes based on each node’s frequently-asked and seldom-asked
queries and bookmarked documents (considered relevant). New queries are sent to one’s
personal digital library (PeerSy), one’s Virtual Knowledge Community, and Google. The
expectation is that documents found within a person’s personal digital library and Virtua l
Knowledge Community will be better matches for an information need, possibly
reflecting some combination of past searching and browsing behavior. The work is at the
initial prototype stage.

“The effect of database size distribution on resource selection algorithms” by Si and

Callan extended work reported in the main SIGIR conference. The conference paper
reported on a new resource selection algorithm (ReDDE) that compensates for skewed
distributions of database sizes more effectively than prior algorithms. The workshop
paper developed new versions of the CORI and KL-divergence resource selection
algorithms that better compensate for skewed distributions of database sizes. The three
resource selection algorithms were compared on several testbeds with varying
distributions of database sizes and relevant documents. The extended version of KL-
divergence was about as effective as the new ReDDE algorithm. The extended CORI
algorithm was better than the basic CORI algorithm, but the least effective of the three.

“Decision-theoretic resource selection for different data types in MIND” by Nottelmann

and Fuhr was also a companion to a paper that appeared in the main SIGIR conference.
The conference paper reported on a decision-theoretic framework for text resource
selection based on characteristics such as relevance, access time, and access costs. The
workshop paper extends the approach to other data types, such as person name, year, and
image, and exact- match and approximate- match retrieval methods. Two of the three
methods presented for estimating the retrieval quality of a digital library can be applied to
text and non-text data types. In spite of its generality, so far this approach to federated
search of multimedia digital libraries has only been evaluated using text resources due to
a lack of large, widely available multimedia resources.

“Distributed Web search as a stochastic game” by Khoussainov and Kushmerick

addressed the problem of maximizing the performance (“profits”) of a search service in
an environment containing competing search services. Search engines are assumed to
compete by deciding which markets to serve with their finite resources, and consumers
are assumed to flock to the search engines that best meet their needs. This process can be
modeled as a stochastic game in which parties have only partial information, a limited
range of actions, and actions take time to have effects. Evaluations were done using data
derived from 100 days of Web proxy logs from a large ISP; 47 search engines were
involved. Experimental results indicated the effectiveness of the general approach, but
also demonstrate artifacts due to assumptions. Future research will be on improving the
models, and reducing the simplifying assumptions.
“Using query probing to identify query language features on the Web” by Bergholz and
Chidlovskii addressed the problem of discovering the lexical processing and query
language characteristics of an uncooperative search engine. Their research shows that a
relatively small number of probe queries and simple classification algorithms are
sufficient to discover a range of search engine characteristics, including stopword
removal, stemming, phrase processing, and treatment of AND operators, much of the
time. In evaluations with 19 search engines, features were discovered correctly about 75-
80% of the time. Future research will be directed at improving classification accuracy,
for example with better feature selection and improved classification algorithms.

Conclusions

Workshops on distributed information retrieval were held in conjunction with SIGIR

1996 and 1997. 1 The 2003 workshop indicates that many of the same issues remain
important, for example, data gathering, resource selection, data fusion, and architectures.
However, comparison with the earlier workshops also indicates that the topic has matured
considerably. Assumptions about small numbers of cooperating, homogeneous resources
running the same software are no longer pervasive. Resource selection algorithms are
more accurate, more robust, and beginning to really address multimedia data; fusion
algorithms are much less ad-hoc and much more effective ; peer-to-peer architectures
have emerged; and software architectures have become more detailed and realistic.

During the general discussion there was considerable debate about the state of resource
selection research. Resource selection has been the driving topic in this research area for
the last decade, and there has been steady improvement, but the upper bound remains
unknown. Precision-oriented methods dominated past research, with much success, but
high Recall and high diversity are neglected topics that are particularly important in some
domains, for example to better represent the range of information available.

Participants felt that data fusion research needs to continue its transition to stronger
theoretical models. The field does not yet understand how differing levels of overlap
among resources affects fusion algorithms; the research community is split into “much
overlap” (e.g., meta-search) or “little overlap” (e.g., distributed IR), but the real world is
more complex. The field also needs to learn to model the interaction between resource
selection and data fusion. Improvements in resource selection may ha ve a large effect or
none depending on the data fusion algorithm, but today the interaction is unpredictable.

The topic that generated the most discussion was, of course, evaluation. There was broad
agreement that there is too much focus on testbeds based on TREC data. Participants felt
that it is especially necessary to model the size and relevance distributions of real sites,
and that it might be possible to get such information from industry. There was
recognition that different tasks and environments will have different characteristics, and
that the research community needs to devote more effort to understanding what they are.
A major obstacle for many researchers is that distributed IR is still rare in the “real

1
http://www.cs.cmu.edu/~callan/Workshops/nir96/ and http://www.cs.cmu.edu/~callan/Workshops/nir97/.
world”, so it is difficult to find “real” data, users, and failures. The clearest example of
distributed IR in use today is in peer-to-peer networks such as KaZaA.

Relevance-based ranking (RBR) is a convenient and clear metric, but participant felt that
the field will need to transition to a utility-based metric, possibly something like Norbert
Fuhr’s decision theoretic framework, that encompasses a wider range of user criteria.
Such a transition will require a much better understanding of user information needs in
distributed environments, for example, the importance of relevance vs. communication
time, search vs. browsing, and relevance vs. diversity.

One could summarize the discussion of evaluation as a strong worry that researchers are
stuck searching under the same old lampposts (i.e., searching where it is easiest to search)
due to a lack of realistic data and user information needs. Participants expressed support
for a TREC track or INEX-style project to focus attention on creating new datasets, task
models, and evaluation metrics. The TREC-4 and TREC-5 Database Merging tracks
were conducted before a distributed IR research community had developed, and hence
they attracted little participation. Today, with active interest in distributed IR and
federated search from a variety of research communities, a similar effort would have a
much better chance of success.

Acknowledgements

We thank the organizers of SIGIR 2003 for their support of the workshop. We also thank
the members of the Program Committee (Donatella Castelli, Jim French, Norbert Fuhr,
Luis Gravano, Umberto Straccia) for their efforts on behalf of the workshop. The
workshop was sponsored in part by the MIND Project (EU research contract IST-2000-
26061), the University of Toronto, and the Knowledge Media Design Institute.

Ford Fiesta 1.25L PDF
60% (5)
Ford Fiesta 1.25L PDF
146 pages
Labview Exercises 2
100% (1)
Labview Exercises 2
6 pages
AIand Academic Libraries Raymond Uzwyshyn Preprint
No ratings yet
AIand Academic Libraries Raymond Uzwyshyn Preprint
15 pages
Csit1232 (2021 - 07 - 30 08 - 37 - 35 UTC)
No ratings yet
Csit1232 (2021 - 07 - 30 08 - 37 - 35 UTC)
11 pages
The Cybernetics Thought Collective's Experiments With AI Tools To Determine
No ratings yet
The Cybernetics Thought Collective's Experiments With AI Tools To Determine
4 pages
A Review On Knowledge Sharing in Collaborative Environment
No ratings yet
A Review On Knowledge Sharing in Collaborative Environment
6 pages
A Literature Review of Image Retrieval Based On Semantic Concept
100% (1)
A Literature Review of Image Retrieval Based On Semantic Concept
7 pages
Big Data Research
No ratings yet
Big Data Research
2 pages
Evaluation of Digital Libraries Criteria and Problems From Users Perspectives 2006 Library Information Science Research
No ratings yet
Evaluation of Digital Libraries Criteria and Problems From Users Perspectives 2006 Library Information Science Research
20 pages
2007d Sigirforum Vanzwol
No ratings yet
2007d Sigirforum Vanzwol
6 pages
Solutions in Multimedia and Hypertext. (P. 21-30) Eds.: Susan Stone and Michael
No ratings yet
Solutions in Multimedia and Hypertext. (P. 21-30) Eds.: Susan Stone and Michael
15 pages
Learning Structure and Schemas From Heterogeneous Domains in Networked Systems: A Survey
No ratings yet
Learning Structure and Schemas From Heterogeneous Domains in Networked Systems: A Survey
8 pages
Design and Implementation of Electronic Library System
No ratings yet
Design and Implementation of Electronic Library System
9 pages
A graph-based approach for visualizing
No ratings yet
A graph-based approach for visualizing
20 pages
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
No ratings yet
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
4 pages
of-280fbpkmhy
No ratings yet
of-280fbpkmhy
9 pages
Impact of Semantic Web Technologies On Digital Collections of Libraries
No ratings yet
Impact of Semantic Web Technologies On Digital Collections of Libraries
11 pages
Wang 2007
No ratings yet
Wang 2007
6 pages
A Framework for Measuring Relevancy in Discovery Environments: Increasing Scalability and Reproducibility.
No ratings yet
A Framework for Measuring Relevancy in Discovery Environments: Increasing Scalability and Reproducibility.
19 pages
Systematic Literature Review Computer Science
100% (3)
Systematic Literature Review Computer Science
4 pages
paper-08
No ratings yet
paper-08
12 pages
Cognitive Process-Based Research Support System
No ratings yet
Cognitive Process-Based Research Support System
8 pages
Audio Visual Challenge
No ratings yet
Audio Visual Challenge
6 pages
EAGER: Citation++: Data Citation, Provenance, and Documentation
No ratings yet
EAGER: Citation++: Data Citation, Provenance, and Documentation
11 pages
Towards Peer-to-Peer Semantic Web: A Distributed Environment For Sharing Semantic Knowledge On The Web
No ratings yet
Towards Peer-to-Peer Semantic Web: A Distributed Environment For Sharing Semantic Knowledge On The Web
9 pages
299-APRIL-2019
No ratings yet
299-APRIL-2019
10 pages
Crosswalks, Metadata Harvesting, Federated Searching, Metasearching: Using Metadata To Connect Users and Information
No ratings yet
Crosswalks, Metadata Harvesting, Federated Searching, Metasearching: Using Metadata To Connect Users and Information
25 pages
Information Retrieval 1
No ratings yet
Information Retrieval 1
10 pages
Provides More Accurate Recommendations
No ratings yet
Provides More Accurate Recommendations
1 page
User Requirements Analysis For Digital Library App
No ratings yet
User Requirements Analysis For Digital Library App
11 pages
A P2P Recommender System Based On Gossip Overlays (Prego)
No ratings yet
A P2P Recommender System Based On Gossip Overlays (Prego)
8 pages
Download Full Heterogeneous Information Network Analysis and Applications 1st Edition Chuan Shi PDF All Chapters
100% (5)
Download Full Heterogeneous Information Network Analysis and Applications 1st Edition Chuan Shi PDF All Chapters
55 pages
Situating Search: Chirag Shah Emily M. Bender
No ratings yet
Situating Search: Chirag Shah Emily M. Bender
12 pages
Web People Search Using Ontology Based Decision Tree
No ratings yet
Web People Search Using Ontology Based Decision Tree
8 pages
2008d Sigirforum Murdock
No ratings yet
2008d Sigirforum Murdock
4 pages
DataInterpretationPerspectives PDF
No ratings yet
DataInterpretationPerspectives PDF
30 pages
Information Retrieval in Folksonomies: Search and Ranking
No ratings yet
Information Retrieval in Folksonomies: Search and Ranking
15 pages
Information Retrieval Systems in Academic Libraries
No ratings yet
Information Retrieval Systems in Academic Libraries
5 pages
Bees Swarm Optimization Based Approach For Web Information Retrieval
No ratings yet
Bees Swarm Optimization Based Approach For Web Information Retrieval
8 pages
A Survey On Association Rules in Case of Multimedia Data Mining
No ratings yet
A Survey On Association Rules in Case of Multimedia Data Mining
4 pages
Ijettcs 2014 02 25 115
No ratings yet
Ijettcs 2014 02 25 115
5 pages
A Design of Faceted Search Engine - A Review
No ratings yet
A Design of Faceted Search Engine - A Review
5 pages
Lec 1- Intro- Unit 1 information technology
No ratings yet
Lec 1- Intro- Unit 1 information technology
102 pages
Challenges and Emerging Practices For Knowledge Organization in The Electronic Information Environment - Anil Mishra
No ratings yet
Challenges and Emerging Practices For Knowledge Organization in The Electronic Information Environment - Anil Mishra
17 pages
Literature Review Text Mining
100% (1)
Literature Review Text Mining
9 pages
Digitalcommons@University of Nebraska - Lincoln Digitalcommons@University of Nebraska - Lincoln
No ratings yet
Digitalcommons@University of Nebraska - Lincoln Digitalcommons@University of Nebraska - Lincoln
11 pages
A Scalable Self
No ratings yet
A Scalable Self
33 pages
AI
No ratings yet
AI
32 pages
Buy ebook Heterogeneous Information Network Analysis and Applications 1st Edition Chuan Shi cheap price
100% (4)
Buy ebook Heterogeneous Information Network Analysis and Applications 1st Edition Chuan Shi cheap price
65 pages
Linked Data Based Exploration A State-Of-The-Art Camera-Ready
No ratings yet
Linked Data Based Exploration A State-Of-The-Art Camera-Ready
13 pages
Challenges and Emerging Practices For Knowledge Organization in The Electronic Information Environment
No ratings yet
Challenges and Emerging Practices For Knowledge Organization in The Electronic Information Environment
16 pages
202477-9-4-4-PB
No ratings yet
202477-9-4-4-PB
10 pages
Paper Cds
No ratings yet
Paper Cds
10 pages
Tipping Point A Critical Case Study
No ratings yet
Tipping Point A Critical Case Study
10 pages
Intelligent Information Retrieval From The Web
No ratings yet
Intelligent Information Retrieval From The Web
4 pages
#2004_Zhuge_IEEE IS_China's E-Science Knowledge Grid Environment
No ratings yet
#2004_Zhuge_IEEE IS_China's E-Science Knowledge Grid Environment
5 pages
BISS_article_136735
No ratings yet
BISS_article_136735
4 pages
Irs Unit-V
No ratings yet
Irs Unit-V
48 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Is 15868 1-6 2008
No ratings yet
Is 15868 1-6 2008
13 pages
MOBIRAY
No ratings yet
MOBIRAY
1 page
Scheme of Work Mat421 March 2024
No ratings yet
Scheme of Work Mat421 March 2024
5 pages
Materie Plastiche IEC-60335
No ratings yet
Materie Plastiche IEC-60335
2 pages
Rightship GHG Emission Rating
No ratings yet
Rightship GHG Emission Rating
24 pages
Advance Power Electronics and Control
No ratings yet
Advance Power Electronics and Control
9 pages
Unit Wise MCQ
No ratings yet
Unit Wise MCQ
3 pages
Constant Log FC901
No ratings yet
Constant Log FC901
10 pages
2021-P6-Maths-Semestral Assessment 1-Nan Hua
No ratings yet
2021-P6-Maths-Semestral Assessment 1-Nan Hua
42 pages
User Acceptance of Computer Technology A20161129 11980 1zn8qd With Cover Page v2
100% (1)
User Acceptance of Computer Technology A20161129 11980 1zn8qd With Cover Page v2
25 pages
3 2 Fractions Choice Board
No ratings yet
3 2 Fractions Choice Board
1 page
Tratamiento Polietileno
No ratings yet
Tratamiento Polietileno
8 pages
Marimba Grade 2 Exam Scales and Arpeggios
100% (1)
Marimba Grade 2 Exam Scales and Arpeggios
1 page
EQUATIONS and Quadratic Equation - Part II
100% (1)
EQUATIONS and Quadratic Equation - Part II
55 pages
College Physical Experiments: EXP: Adjusting The Prism Spectrometer and Measuring The Refractive Index of Prism
No ratings yet
College Physical Experiments: EXP: Adjusting The Prism Spectrometer and Measuring The Refractive Index of Prism
32 pages
Report of Airtel
No ratings yet
Report of Airtel
15 pages
Bio Icse 10 ..Traspiration
No ratings yet
Bio Icse 10 ..Traspiration
23 pages
Vander Vegtetal 2012 Geol Soc
No ratings yet
Vander Vegtetal 2012 Geol Soc
25 pages
Eco 211 Problem Set 3
No ratings yet
Eco 211 Problem Set 3
26 pages
EO 301 Notes
No ratings yet
EO 301 Notes
31 pages
00 Datasheet of JUPITER-9000-6000-3000K-H1 For 330KTL-V2.0
No ratings yet
00 Datasheet of JUPITER-9000-6000-3000K-H1 For 330KTL-V2.0
6 pages
Verbs Are Action Words
No ratings yet
Verbs Are Action Words
46 pages
Gmail - Read and Write Excel
No ratings yet
Gmail - Read and Write Excel
5 pages
System Investigati ON System Analysis Maintanen CE: G. Raja Sekhar 1
No ratings yet
System Investigati ON System Analysis Maintanen CE: G. Raja Sekhar 1
23 pages
CIE 2000 Colour Difference Formula
No ratings yet
CIE 2000 Colour Difference Formula
6 pages
DC gm3688
No ratings yet
DC gm3688
3 pages
A Study On The Productivity of Sinter Plant
100% (1)
A Study On The Productivity of Sinter Plant
9 pages
Digital Torque Meters
No ratings yet
Digital Torque Meters
9 pages

SIGIR 2003 Workshop On Distributed Information Retrieval: Jamie Callan Fabio Crestani Mark Sanderson

Uploaded by

SIGIR 2003 Workshop On Distributed Information Retrieval: Jamie Callan Fabio Crestani Mark Sanderson

Uploaded by

SIGIR 2003 Workshop on Distributed Information Retrieval

Jamie Callan Fabio Crestani Mark Sanderson

“The MIND architecture for heterogeneous multimedia federated digital libraries” by

“The effect of database size distribution on resource selection algorithms” by Si and

“Decision-theoretic resource selection for different data types in MIND” by Nottelmann

“Distributed Web search as a stochastic game” by Khoussainov and Kushmerick

Workshops on distributed information retrieval were held in conjunction with SIGIR

You might also like