Balancing Volume, Quality and Freshness in Web Crawling: Ricardo Baeza-Yates and Carlos Castillo
Balancing Volume, Quality and Freshness in Web Crawling: Ricardo Baeza-Yates and Carlos Castillo
Agenda
Motivation
Search Engines
Web Crawlers
Crawling
Crawling problems
Software architecture
Implementation details
Concluding Remarks
1
1
Introduction 1/3
The Web
Web: Largest public repository of data
(more than 3 billion static pages)
Today, there are more than 40 million
Web servers
Well connected graph with out-link and
in-link power law distributions
Log
x –β Self-similar &
Self-organizing
Log
1
Introduction 2/3
Web Retrieval
Challenge: find information in this data
Problems:
– volume
– fast rate of change and growth
– dynamic content
– redundancy
– organization and data quality
– diversity, etc...
2
Introduction 3/3
Motivation
Centralized Architectures
Web
Crawlers
3
Search Engines 2/2
Crawling
Crawlers: page recollection and parsing
Crawlers do not crawl! They are static
Bottlenecks: available bandwidth and
CPU
They use the wrong paradigm: pulling
Pulling vs. pushing
First Generation
RBSE Spider
– Web of 100,000 pages
Crawler of the Internet Archive
Web Crawler
Common aspects:
– Avoid rejection from webmasters
– Try to discover new pages, pages are scarce
4
Web crawlers 2/3
Second Generation
Mercator, Sphinx
– Extensible architectures
– Multi-protocol, multi-purpose
Lycos, Excite, Google
– Distributed (farm) crawlers
Parallel crawlers
Standard Architecture
World Wide Web
Networking
Scheduler
(Multi-
(Multi-threaded)
URL Doc
database Collection
5
Crawling 1/4
Main Goals
They depend on the index of a search
engine:
– The index should contain a large
number of objects that are interesting for
the users
– The index should accurately represent a
real object on the Web
– Generate a representation that captures
the most significant aspects of the object
using the minimum amount of resources
Crawling 2/4
Main Issues
Interesting object?
Quality: intrinsic semantic value of the
object. Possible estimators:
– content analysis (e.g. vector model)
– link analysis (e.g. Pagerank)
– similarity to a driven query (focused
crawlers, searching agents)
– usage popularity
– location based (depth, geography, etc.)
6
Crawling 3/4
Main Issues
Accurate content of the objects?
Quantity: as many objects as possible
– Accurate content depends on the size
and format of the representation
– Accurate object through time?
Freshness: how recent is the object
– Web updates are common, half of the
Web is less than 6 months old
– Freshness can be estimated for most
Web servers
1
Crawling 4/4
Taxonomy
7
Crawling Problems 1/3
Conflicting Goal
Scheduling
Keep collection fresh
– More important pages must be kept fresher
Use the network to the maximum extent
Do not overload servers (robot politeness)
Discover new pages (updated pages first)
All this depends on the type of crawler
8
Crawling problems 3/3
Technical Problems
DNS is a bottleneck
Duplicates, slow servers
Crawler traps
Dynamic pages and session-ids in URLs
Threads implementations
– For example, for constant C= p k t which is
the optimal number of processors p running
k processes with t threads each?
– Simulation is difficult, experimenting too
1
Our Approach
Goals depend on the index knowledge
crawlers should be tightly
integrated with the index
Page scheduling:
– Long-term: quality-freshness vs. quantity
– Short-term: resource usage and politeness
Approach that we use in WIRE
9
Software Architecture 2/3
High-Level Architecture
Manager
Pages Long-
Long-term scheduling Tasks
(metadata) Priorities (batch+metadata)
Seeder Harvester
Check new pages Short-
Short-term scheduling
Link analysis Networking
Indices
Manager
URLs
Links Text
Gatherer
1
10
Implementation details 1/5
Non Trivial
score=0.2
freshness=0.8 priority = score x (1-freshness)
priority=0.04
11
Implementation details 3/5
URL resolving (Seeder)
http://host.domain.com/path/file.html
INPUT
h1(host.domain.com)
h2('235 path/file.html')
host.domain.com 235
URL Seen?
2 old=NULL, new=9421
...
9420 offset1
3 9421 offset2 4
9422 offset3
9423 offset4
...
Offsets list
Free space list
1 Disk Storage
12
Implementation details 5/5
Harvester
Temporary Storage
Gatherer
Fetch3 Fetch4
Fetch1 Fetch2
Concluding Remarks
Prototype currently under evaluation
Future: Agents and pushing?
Real networks are dynamic
– Crawlers do not care about network
topology... but could (and should!)
– Redundancy and fault tolerance is based
on randomness
– How to avoid changing the information
market in the wrong (almost any!)
direction?
1
13
Questions, comments? ...
14