Information Retrieval Lecture 10 - Web Crawling
Information Retrieval Lecture 10 - Web Crawling
Wintersemester 2007
1 / 30
Introduction
2 main objectives:
fast gathering
efficient gathering (as many useful web pages as possible, and the links interconnecting them)
2 / 30
Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers
3 / 30
Features of a crawler
Features of a crawler
4 / 30
Features of a crawler
5 / 30
Crawling process
Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers
6 / 30
Crawling process
Crawling process
(c) The text is fed to a text indexer, the links (URL) are
added to a URL frontier ( crawling agenda)
7 / 30
Crawling process
8 / 30
Architecture of a crawler
Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers
9 / 30
Architecture of a crawler
Architecture of a crawler
Several modules interacting:
10 / 30
Architecture of a crawler
Some remarks:
During parsing, the text (with HTML tags) is passed
on to the indexer, so are the links contained in the
page (links analysis)
links normalization
links are checked before being added to the frontier
- URL duplicates
- content duplicates and nearly-duplicates
- robots policy checking (robots.txt file) (*)
User-agent: *
Disallow: /subsite/temp
priority score assignment (most useful links)
robustness via checkpoints
11 / 30
Architecture of a crawler
Architecture of a crawler
Architecture of a crawler
Architecture of a crawler
Architecture of a crawler
A frontier should:
(i) give a priority score to URLs reflecting their quality
(ii)
Architecture of a crawler
2 types of queues:
front queues for prioritization
back queues for politeness
17 / 30
Architecture of a crawler
(f)
(g)
Architecture of a crawler
Architecture of a crawler
20 / 30
Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers
21 / 30
22 / 30
Connectivity servers
Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers
23 / 30
Connectivity servers
Connectivity servers
Mappings stored:
URL out-links
URL in-links
24 / 30
Connectivity servers
Connectivity servers
26 / 30
Connectivity servers
2, 5, 8, 12, 18, 24
2, 4, 8, 12, 18, 24
1:
2:
2, 3, 3, 4, 6, 6
row1 -5 +4
27 / 30
Connectivity servers
(d): Querying:
index lookup for a URL row id
row reconstruction (threshold in the number of redirection
within preceding rows)
28 / 30
Connectivity servers
Conclusion
29 / 30
Connectivity servers
References
http://citeseer.ist.psu.edu/heydon99mercator.html
http://citeseer.ist.psu.edu/brin98anatomy.html
http://www.robotstxt.org/
30 / 30