0% found this document useful (0 votes)
56 views

Balancing Volume, Quality and Freshness in Web Crawling: Ricardo Baeza-Yates and Carlos Castillo

This document discusses balancing volume, quality, and freshness in web crawling. It provides an overview of search engines and web crawlers, and discusses the challenges of crawling including conflicting goals of refreshing pages versus discovering new pages. It outlines a software architecture for crawlers that is tightly integrated with search engine indices. The architecture prioritizes pages based on quality and freshness metrics. Implementation details are also covered, such as a non-trivial distributed architecture and challenges in performance benchmarking.

Uploaded by

ABHI LIFE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Balancing Volume, Quality and Freshness in Web Crawling: Ricardo Baeza-Yates and Carlos Castillo

This document discusses balancing volume, quality, and freshness in web crawling. It provides an overview of search engines and web crawlers, and discusses the challenges of crawling including conflicting goals of refreshing pages versus discovering new pages. It outlines a software architecture for crawlers that is tightly integrated with search engine indices. The architecture prioritizes pages based on quality and freshness metrics. Implementation details are also covered, such as a non-trivial distributed architecture and challenges in performance benchmarking.

Uploaded by

ABHI LIFE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Balancing Volume, Quality and

Freshness in Web Crawling


Ricardo Baeza-Yates and Carlos Castillo

Center for Web Research


DCC / UCHILE
2002

Agenda
Motivation
Search Engines
Web Crawlers
Crawling
Crawling problems
Software architecture
Implementation details
Concluding Remarks
1

1
Introduction 1/3

The Web
Web: Largest public repository of data
(more than 3 billion static pages)
Today, there are more than 40 million
Web servers
Well connected graph with out-link and
in-link power law distributions
Log
x –β Self-similar &
Self-organizing
Log
1

Introduction 2/3

Web Retrieval
Challenge: find information in this data
Problems:
– volume
– fast rate of change and growth
– dynamic content
– redundancy
– organization and data quality
– diversity, etc...

2
Introduction 3/3

Motivation

Main problem of the Web: scalability


Search engines are one of the most
important tools
Crawling the Web is the current
resource bottleneck
Our main motivation: WIRE project
(Web IR Environment)
Can we do it better?

Search Engines 1/2

Centralized Architectures

Web

Crawlers

3
Search Engines 2/2

Crawling
Crawlers: page recollection and parsing
Crawlers do not crawl! They are static
Bottlenecks: available bandwidth and
CPU
They use the wrong paradigm: pulling
Pulling vs. pushing

Web crawlers 1/3

First Generation
RBSE Spider
– Web of 100,000 pages
Crawler of the Internet Archive
Web Crawler
Common aspects:
– Avoid rejection from webmasters
– Try to discover new pages, pages are scarce

4
Web crawlers 2/3

Second Generation
Mercator, Sphinx
– Extensible architectures
– Multi-protocol, multi-purpose
Lycos, Excite, Google
– Distributed (farm) crawlers
Parallel crawlers

Web crawlers 3/3

Standard Architecture
World Wide Web

Networking
Scheduler
(Multi-
(Multi-threaded)

URL Doc
database Collection

5
Crawling 1/4

Main Goals
They depend on the index of a search
engine:
– The index should contain a large
number of objects that are interesting for
the users
– The index should accurately represent a
real object on the Web
– Generate a representation that captures
the most significant aspects of the object
using the minimum amount of resources

Crawling 2/4

Main Issues
Interesting object?
Quality: intrinsic semantic value of the
object. Possible estimators:
– content analysis (e.g. vector model)
– link analysis (e.g. Pagerank)
– similarity to a driven query (focused
crawlers, searching agents)
– usage popularity
– location based (depth, geography, etc.)

6
Crawling 3/4

Main Issues
Accurate content of the objects?
Quantity: as many objects as possible
– Accurate content depends on the size
and format of the representation
– Accurate object through time?
Freshness: how recent is the object
– Web updates are common, half of the
Web is less than 6 months old
– Freshness can be estimated for most
Web servers
1

Crawling 4/4

Taxonomy

7
Crawling Problems 1/3

Conflicting Goal

Refreshing a page or bringing a new


page?
– However new pages are only found in
updated or new pages
– Trade-off between quantity (more objects)
and quality-freshness (interesting and up-to-
date objects)
Secondary Goals
– Resource usage and politeness

Crawling problems 2/3

Scheduling
Keep collection fresh
– More important pages must be kept fresher
Use the network to the maximum extent
Do not overload servers (robot politeness)
Discover new pages (updated pages first)
All this depends on the type of crawler

8
Crawling problems 3/3

Technical Problems
DNS is a bottleneck
Duplicates, slow servers
Crawler traps
Dynamic pages and session-ids in URLs
Threads implementations
– For example, for constant C= p k t which is
the optimal number of processors p running
k processes with t threads each?
– Simulation is difficult, experimenting too
1

Software Architecture 1/3

Our Approach
Goals depend on the index knowledge
crawlers should be tightly
integrated with the index
Page scheduling:
– Long-term: quality-freshness vs. quantity
– Short-term: resource usage and politeness
Approach that we use in WIRE

9
Software Architecture 2/3

High-Level Architecture
Manager
Pages Long-
Long-term scheduling Tasks
(metadata) Priorities (batch+metadata)

Seeder Harvester
Check new pages Short-
Short-term scheduling
Link analysis Networking

Links Gatherer Documents


(urls) Add to collection (html + otros)
Parsing
1

Software architecture 3/3

Indices
Manager

URLs

Seeder Meta Harvester

Links Text

Gatherer
1

10
Implementation details 1/5

Non Trivial

Share-nothing distributed architecture


Performance: how good is the result,
how fast is accomplished
No benchmarks, no standard
measures
– Pages per processor cycles, bandwidth,
and time?

Implementation details 2/5


Manager
Collection
N Batch for the
harvester
score=0.7 n << N
freshness=0.9
priority = 0.07
score=0.3
score=0.3 freshness=0.4
priority=0.21
freshness=0.4 Manager
priority=0.21
score=0.8
freshness=0.1
score=0.8 priority=0.56
freshness=0.1
priority=0.56

score=0.2
freshness=0.8 priority = score x (1-freshness)
priority=0.04

11
Implementation details 3/5
URL resolving (Seeder)
http://host.domain.com/path/file.html
INPUT

h1(host.domain.com)

h2('235 path/file.html')

host.domain.com 235

235 path/file.html 9421


OUTPUT

SITEid = 235; DOCid = 9421


1

Implementation details 4/5


DOCid: 9421 Gatherer
1 h( )

URL Seen?

2 old=NULL, new=9421

...
9420 offset1
3 9421 offset2 4
9422 offset3
9423 offset4
...
Offsets list
Free space list
1 Disk Storage

12
Implementation details 5/5
Harvester
Temporary Storage

Gatherer

Fetch3 Fetch4
Fetch1 Fetch2

a.com/p c.com d.com


Batch a.com b.com
a.com/q
b.com/r /u /v
/q /s
b.com/s
Manager b.com/t
/p /t
c.com/u
d.com/v Scheduler /r
1

Concluding Remarks
Prototype currently under evaluation
Future: Agents and pushing?
Real networks are dynamic
– Crawlers do not care about network
topology... but could (and should!)
– Redundancy and fault tolerance is based
on randomness
– How to avoid changing the information
market in the wrong (almost any!)
direction?
1

13
Questions, comments? ...

14

You might also like