Information Retrieval
Lecture 10 - Web crawling
Seminar f
ur Sprachwissenschaft
International Studies in Computational Linguistics
Wintersemester 2007
1 / 30
Introduction
Crawling: gathering pages from the internet, in order to
index them
2 main objectives:
fast gathering
efficient gathering (as many useful web pages as possible, and the links interconnecting them)
Todays focus: issues arising when developping a web
crawler
2 / 30
Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers
3 / 30
Features of a crawler
Features of a crawler
Robustness: ability to handle spider-traps (cycles, dynamic
web pages)
Politeness: policies about the frequency of robot visits
Distribution: crawling should be distributed within several
machines
Scalability: crawling should be extensible by adding
machines, extending bandwith, etc.
Efficiency: clever use of the processor, memory, bandwith
(e.g. as few idle processes as possible)
4 / 30
Features of a crawler
Features of a crawler (continued)
Quality: should detect the most useful pages, to be
indexed first
Freshness: should continuously crawl the web (visiting
frequency of a page should be close to its modification
frequency)
Extensibility: should support new data formats (e.g.
XML-based formats), new protocols (e.g. ftp), etc.
5 / 30
Crawling process
Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers
6 / 30
Crawling process
Crawling process
(a) The crawler begins with a seed set of URLs to fetch
(b) The crawler fetches and parses the corresponding
webpages, and extracts both text and links
(c) The text is fed to a text indexer, the links (URL) are
added to a URL frontier ( crawling agenda)
(d) (continuous crawling) Already fetched URLs are
appended to the URL frontier for later re-processing
traversal of the web graph
7 / 30
Crawling process
Crawling process (continued)
Reference point: fetching a billion pages in a month-long
crawl requires fetching several hundred pages each second
Some potential issues:
Links encountered during parsing may be relative
paths
normalization needed
Pages of a given web site may contain several duplicated links
Some links may point to robot-free areas
8 / 30
Architecture of a crawler
Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers
9 / 30
Architecture of a crawler
Architecture of a crawler
Several modules interacting:
URL frontier managing the URLs to be fetched
DNS resolution determining the host (web server) from
which to fetch a page defined by a URL
Fetching module downloading a remote webpage for
processing
Parsing module extracting text and links
Duplicate elimination detecting URLs and contents
that have been processed a short time ago
10 / 30
Architecture of a crawler
Architecture of a crawler (continued)
Some remarks:
During parsing, the text (with HTML tags) is passed
on to the indexer, so are the links contained in the
page (links analysis)
links normalization
links are checked before being added to the frontier
- URL duplicates
- content duplicates and nearly-duplicates
- robots policy checking (robots.txt file) (*)
User-agent: *
Disallow: /subsite/temp
priority score assignment (most useful links)
robustness via checkpoints
11 / 30
Architecture of a crawler
Architecture of a crawler (continued)
Figure from (Manning et al., 2008)
12 / 30
Architecture of a crawler
About distributed crawling
A crawling operation can be performed by several dedicated
threads
Parallel crawlings can be distributed over nodes of a
distributed system (geographical distribution, link-based
distribution, etc.)
Distribution involves a host-splitter, which dispatches URLs
to the corresponding crawling node
the duplicate elimination module cannot use a
cache for fingerprints/shingles since duplicates do not
mandatorily belong to the same domain
documents change over time, potential duplicates
may have to be added to the frontier
13 / 30
Architecture of a crawler
About distributed crawling (continued)
Figure from (Manning et al., 2008)
14 / 30
Architecture of a crawler
About DNS resolution
DNS resolution: translation of a server name and domain
into an IP address
Resolution done by contacting a DNS server (which may
itself contact other DNS servers)
DNS resolution is a bottleneck in crawling (due to its
recursive nature)
use of a DNS cache (recently asked names)
DNS resolution difficulty: lookup is synchronous (new
request processed only when the current request is
completed)
thread i sends a request to a DNS server, if the time-out
is reached or a signal from another thread is received, then
it resumes
in case of time-out, 5 attempts (with increasing
time-outs ranging from 1 to 90 sec. cf Mercator system) 15 / 30
Architecture of a crawler
About the URL frontier
Considerations governing the order in which URLs are
extracted from the frontier:
(a) high quality pages updated frequently should have
higher priority for frequent crawling
(b)
politeness should be obeyed (no repeated fetches sent
to a given host)
A frontier should:
(i) give a priority score to URLs reflecting their quality
(ii)
open only one connection at a time to any host
(iii) wait a few seconds between successive requests to a
host
16 / 30
Architecture of a crawler
About the URL frontier (continued)
2 types of queues:
front queues for prioritization
back queues for politeness
Each back queue only contains URLs for a given host
(mapping between hosts and back queue identifiers)
When a back queue is empty, it is filled with URLs from
priority queues
A heap contains the earliest time te to contact a given host
again
17 / 30
Architecture of a crawler
About the URL frontier (continued)
URL frontier extraction process:
(a)
extraction of the root of the heap back queue j
(b) extraction of the URL u at the head of back queue j
(c)
fetching of the web page defined by u
(d) if the queue j is empty, selection of a front queue
using a biasing function (random priority)
(e)
selection of the head URL v of the selected front
queue
(f)
does host(v) already have a back queue ? if yes, v
goes there, otherwise, v is stored in back queue j
(g)
the heap is updated for back queue j
18 / 30
Architecture of a crawler
About the URL frontier (continued)
Figure from (Manning et al., 2008)
19 / 30
Architecture of a crawler
About the URL frontier (continued)
Remarks about the URL frontier:
The number of front queues + the biasing function
implement the priority properties of the crawler
The number of back queues implement the capacity
of the crawler to avoid wasting time (by keeping the
threads busy)
20 / 30
Web crawling and distributed indexes
Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers
21 / 30
Web crawling and distributed indexes
Web crawling and distributed indexes
High cooperation between crawlers and indexers
Indexes are distributed over a large computer cluster
2 main partitioning techniques:
term partitioning (multi word queries are harder to
process)
document partitioning (inverse document frequencies
are harder to compute background processes)
In document partitioned indexes, a hash function maps
index URL to nodes so that crawlers know where to send
the extracted text
In document partitioned indexes, documents that are most
likely to score highly (cf links) are gathered
low score partitions used only when there are too
few results
22 / 30
Connectivity servers
Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers
23 / 30
Connectivity servers
Connectivity servers
Quality of a page is a function of the links that points to it
For link analysis, queries on the connectivity of the web
graph are needed
Connectivity servers store information such as:
which URLs point to a given URL ?
which URLs are pointed by a given URL ?
Mappings stored:
URL out-links
URL in-links
24 / 30
Connectivity servers
Connectivity servers (continued)
Size needed to store the connectivity underlying the web
graph ?
Estimate: 4 billion pages, 10 links per page, 4 bytes to
encode a link extrimity (i.e. URL id):
4 109 10 8 = 3.2 1011 mbytes
Graph compression is needed to ensure efficient connectivity
queries processing
Graph encoded as adjacency tables:
in-table: row page p, columns links to p
out-table: row page p, columns links from p
Space saved by using tables instead of list of links:50%
25 / 30
Connectivity servers
Connectivity servers (continued)
Compression based on the following ideas:
There is a large similarity between rows (e.g. menus)
The links tend to point to nearby page
gap between URL ids in sorted list can be use (i.e.
offset rather than absolute id)
26 / 30
Connectivity servers
Connectivity servers (continued)
(a): Each URL is associated with an integer i , where i is
the position of the URL in the sorted list of URLs
(b): Contiguous rows are noticed to have similar links (cf
locality via menus)
(c): Each row of the tables is encoded in terms of the 7
preceding rows
offset expressed within 3 bits
7 to avoid expansive search among preceding queries
gap encoding inside a row
1:
2:
2, 5, 8, 12, 18, 24
2, 4, 8, 12, 18, 24
1:
2:
2, 3, 3, 4, 6, 6
row1 -5 +4
27 / 30
Connectivity servers
Connectivity servers (continued)
(d): Querying:
index lookup for a URL row id
row reconstruction (threshold in the number of redirection
within preceding rows)
28 / 30
Connectivity servers
Conclusion
Crawler: robots fetching and parsing web pages via a
traversal of the web graph
Issues of crawling related to traps, politeness, dependency
to DNS servers and quality scoring
Core component of a crawler: URL frontier (FIFO
architecture used to sort the next URLs to process)
Next week: link analysis
29 / 30
Connectivity servers
References
C. Manning, P. Raghavan and H. Schutze
Introduction to Information Retrieval
http://nlp.stanford.edu/IR-book/pdf/
chapter20-crawling.pdf
Allan Heydon and Marc Najork
Mercator: A Scalable, Extensible Web Crawler
(1999)
http://citeseer.ist.psu.edu/heydon99mercator.html
Sergey Brin and Lawrence Page
The Anatomy of a Large-Scale Hypertextual Web
Search Engine (1998)
http://citeseer.ist.psu.edu/brin98anatomy.html
http://www.robotstxt.org/
30 / 30