0% found this document useful (0 votes)

100 views8 pages

Web Crawling for Linguistics Students

This document discusses web crawling and summarizes key aspects in 3 sentences or less: Web crawlers gather pages from the internet by starting with a seed set of URLs and extracting links to add to a frontier for future crawling. The crawling process involves fetching, parsing, and indexing pages while distributing the workload across machines. An effective crawler architecture includes modules for the URL frontier, DNS resolution, fetching, parsing, and distributing the crawl across nodes to index pages in a distributed manner while ensuring politeness and efficiency.

Uploaded by

Akhilesh Halageri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views8 pages

Web Crawling for Linguistics Students

Uploaded by

Akhilesh Halageri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Information Retrieval

Lecture 10 - Web crawling

Seminar f
ur Sprachwissenschaft
International Studies in Computational Linguistics

Wintersemester 2007

1 / 30

Introduction

Crawling: gathering pages from the internet, in order to

index them

2 main objectives:
fast gathering
efficient gathering (as many useful web pages as possible, and the links interconnecting them)

Todays focus: issues arising when developping a web

crawler

2 / 30

Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers

3 / 30

Features of a crawler

Robustness: ability to handle spider-traps (cycles, dynamic

web pages)

Politeness: policies about the frequency of robot visits

Distribution: crawling should be distributed within several

machines

Scalability: crawling should be extensible by adding

machines, extending bandwith, etc.

Efficiency: clever use of the processor, memory, bandwith

(e.g. as few idle processes as possible)

4 / 30

Features of a crawler

Features of a crawler (continued)

Quality: should detect the most useful pages, to be

indexed first

Freshness: should continuously crawl the web (visiting

frequency of a page should be close to its modification
frequency)

Extensibility: should support new data formats (e.g.

XML-based formats), new protocols (e.g. ftp), etc.

5 / 30

Crawling process

Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers

6 / 30

Crawling process

(a) The crawler begins with a seed set of URLs to fetch

(b) The crawler fetches and parses the corresponding

webpages, and extracts both text and links

(d) (continuous crawling) Already fetched URLs are

appended to the URL frontier for later re-processing
traversal of the web graph

7 / 30

Crawling process

Crawling process (continued)

Reference point: fetching a billion pages in a month-long

crawl requires fetching several hundred pages each second

Some potential issues:

Links encountered during parsing may be relative
paths
normalization needed
Pages of a given web site may contain several duplicated links
Some links may point to robot-free areas

8 / 30

Architecture of a crawler

Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers

9 / 30

Architecture of a crawler

Architecture of a crawler
Several modules interacting:

URL frontier managing the URLs to be fetched

DNS resolution determining the host (web server) from

which to fetch a page defined by a URL

Fetching module downloading a remote webpage for

processing

Parsing module extracting text and links

Duplicate elimination detecting URLs and contents

that have been processed a short time ago

10 / 30

Architecture of a crawler

Architecture of a crawler (continued)

Some remarks:
During parsing, the text (with HTML tags) is passed
on to the indexer, so are the links contained in the
page (links analysis)
links normalization
links are checked before being added to the frontier
- URL duplicates
- content duplicates and nearly-duplicates
- robots policy checking (robots.txt file) (*)
User-agent: *
Disallow: /subsite/temp
priority score assignment (most useful links)
robustness via checkpoints
11 / 30

Architecture of a crawler

Architecture of a crawler (continued)

Figure from (Manning et al., 2008)

12 / 30

Architecture of a crawler

About distributed crawling

A crawling operation can be performed by several dedicated

threads

Parallel crawlings can be distributed over nodes of a

distributed system (geographical distribution, link-based
distribution, etc.)

Distribution involves a host-splitter, which dispatches URLs

to the corresponding crawling node
the duplicate elimination module cannot use a
cache for fingerprints/shingles since duplicates do not
mandatorily belong to the same domain
documents change over time, potential duplicates
may have to be added to the frontier
13 / 30

Architecture of a crawler

About distributed crawling (continued)

Figure from (Manning et al., 2008)

14 / 30

Architecture of a crawler

About DNS resolution

DNS resolution: translation of a server name and domain

into an IP address

Resolution done by contacting a DNS server (which may

itself contact other DNS servers)

DNS resolution is a bottleneck in crawling (due to its

recursive nature)
use of a DNS cache (recently asked names)

DNS resolution difficulty: lookup is synchronous (new

request processed only when the current request is
completed)
thread i sends a request to a DNS server, if the time-out
is reached or a signal from another thread is received, then
it resumes
in case of time-out, 5 attempts (with increasing
time-outs ranging from 1 to 90 sec. cf Mercator system) 15 / 30

Architecture of a crawler

About the URL frontier

Considerations governing the order in which URLs are

extracted from the frontier:
(a) high quality pages updated frequently should have
higher priority for frequent crawling
(b)

politeness should be obeyed (no repeated fetches sent

to a given host)

A frontier should:
(i) give a priority score to URLs reflecting their quality
(ii)

open only one connection at a time to any host

(iii) wait a few seconds between successive requests to a

host
16 / 30

Architecture of a crawler

About the URL frontier (continued)

2 types of queues:
front queues for prioritization
back queues for politeness

Each back queue only contains URLs for a given host

(mapping between hosts and back queue identifiers)

When a back queue is empty, it is filled with URLs from

priority queues

A heap contains the earliest time te to contact a given host

again

17 / 30

Architecture of a crawler

About the URL frontier (continued)

URL frontier extraction process:
(a)

extraction of the root of the heap back queue j

(b) extraction of the URL u at the head of back queue j

(c)

fetching of the web page defined by u

(d) if the queue j is empty, selection of a front queue

using a biasing function (random priority)
(e)

selection of the head URL v of the selected front

queue

(f)

does host(v) already have a back queue ? if yes, v

goes there, otherwise, v is stored in back queue j

(g)

the heap is updated for back queue j

18 / 30

Architecture of a crawler

About the URL frontier (continued)

Figure from (Manning et al., 2008)

19 / 30

Architecture of a crawler

About the URL frontier (continued)

Remarks about the URL frontier:

The number of front queues + the biasing function
implement the priority properties of the crawler
The number of back queues implement the capacity
of the crawler to avoid wasting time (by keeping the
threads busy)

20 / 30

Web crawling and distributed indexes

Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers

21 / 30

Web crawling and distributed indexes

High cooperation between crawlers and indexers

Indexes are distributed over a large computer cluster

2 main partitioning techniques:

term partitioning (multi word queries are harder to
process)
document partitioning (inverse document frequencies
are harder to compute background processes)

In document partitioned indexes, a hash function maps

index URL to nodes so that crawlers know where to send
the extracted text

In document partitioned indexes, documents that are most

likely to score highly (cf links) are gathered
low score partitions used only when there are too
few results

22 / 30

Connectivity servers

Overview
Features of a crawler
Crawling process
Architecture of a crawler
Web crawling and distributed indexes
Connectivity servers

23 / 30

Connectivity servers

Quality of a page is a function of the links that points to it

For link analysis, queries on the connectivity of the web

graph are needed

Connectivity servers store information such as:

which URLs point to a given URL ?
which URLs are pointed by a given URL ?

Mappings stored:
URL out-links
URL in-links

24 / 30

Connectivity servers

Connectivity servers (continued)

Size needed to store the connectivity underlying the web

graph ?

Estimate: 4 billion pages, 10 links per page, 4 bytes to

encode a link extrimity (i.e. URL id):
4 109 10 8 = 3.2 1011 mbytes

Graph compression is needed to ensure efficient connectivity

queries processing

Graph encoded as adjacency tables:

in-table: row page p, columns links to p
out-table: row page p, columns links from p

Space saved by using tables instead of list of links:50%

25 / 30

Connectivity servers

Connectivity servers (continued)

Compression based on the following ideas:

There is a large similarity between rows (e.g. menus)
The links tend to point to nearby page
gap between URL ids in sorted list can be use (i.e.
offset rather than absolute id)

26 / 30

Connectivity servers

Connectivity servers (continued)

(a): Each URL is associated with an integer i , where i is

the position of the URL in the sorted list of URLs

(b): Contiguous rows are noticed to have similar links (cf

locality via menus)

(c): Each row of the tables is encoded in terms of the 7

preceding rows
offset expressed within 3 bits
7 to avoid expansive search among preceding queries
gap encoding inside a row
1:
2:

2, 5, 8, 12, 18, 24
2, 4, 8, 12, 18, 24

1:
2:

2, 3, 3, 4, 6, 6
row1 -5 +4
27 / 30

Connectivity servers

Connectivity servers (continued)

(d): Querying:
index lookup for a URL row id
row reconstruction (threshold in the number of redirection
within preceding rows)

28 / 30

Connectivity servers

Conclusion

Crawler: robots fetching and parsing web pages via a

traversal of the web graph

Issues of crawling related to traps, politeness, dependency

to DNS servers and quality scoring

Core component of a crawler: URL frontier (FIFO

architecture used to sort the next URLs to process)

Next week: link analysis

29 / 30

Connectivity servers

References

C. Manning, P. Raghavan and H. Schutze

Introduction to Information Retrieval
http://nlp.stanford.edu/IR-book/pdf/
chapter20-crawling.pdf

Allan Heydon and Marc Najork

Mercator: A Scalable, Extensible Web Crawler
(1999)

http://citeseer.ist.psu.edu/heydon99mercator.html

Sergey Brin and Lawrence Page

The Anatomy of a Large-Scale Hypertextual Web
Search Engine (1998)

http://citeseer.ist.psu.edu/brin98anatomy.html

http://www.robotstxt.org/
30 / 30

IR-UNIT 10 (Web Crawling)
No ratings yet
IR-UNIT 10 (Web Crawling)
62 pages
Web Crawling: Christopher Olston and Marc Najork
No ratings yet
Web Crawling: Christopher Olston and Marc Najork
49 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
Python Web Crawler Guide
No ratings yet
Python Web Crawler Guide
10 pages
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
No ratings yet
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
27 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
20 Crawl
No ratings yet
20 Crawl
46 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Lecture16 Crawling
No ratings yet
Lecture16 Crawling
39 pages
Web Crawling and Search Engine Basics
No ratings yet
Web Crawling and Search Engine Basics
40 pages
Lect 02-Crawling Part A
No ratings yet
Lect 02-Crawling Part A
21 pages
Seminar Report: Submitted By: Aanchal Garg CSE
No ratings yet
Seminar Report: Submitted By: Aanchal Garg CSE
22 pages
Week 4
No ratings yet
Week 4
38 pages
Ir 5
No ratings yet
Ir 5
18 pages
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
No ratings yet
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
25 pages
Web Crawling for Search Engines
No ratings yet
Web Crawling for Search Engines
14 pages
Efficient Web Crawler Project SRS
No ratings yet
Efficient Web Crawler Project SRS
7 pages
Research Paper
No ratings yet
Research Paper
5 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Web Crawling: Based On The Slides by Filippo
No ratings yet
Web Crawling: Based On The Slides by Filippo
52 pages
Web Crawler
0% (1)
Web Crawler
16 pages
Web Spiders
No ratings yet
Web Spiders
10 pages
Extended Curlcrawler: A Focused and Path-Oriented Framework For Crawling The Web With Thumb
No ratings yet
Extended Curlcrawler: A Focused and Path-Oriented Framework For Crawling The Web With Thumb
9 pages
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
Web Crawlers & Hyperlink Analysis
No ratings yet
Web Crawlers & Hyperlink Analysis
50 pages
Web Crawler Design Guide
No ratings yet
Web Crawler Design Guide
6 pages
Cse3024 WM Module-2 Smsatapathy
No ratings yet
Cse3024 WM Module-2 Smsatapathy
106 pages
Web Info PDF
No ratings yet
Web Info PDF
4 pages
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
No ratings yet
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
4 pages
A Scalable, Distributed Web-Crawler
No ratings yet
A Scalable, Distributed Web-Crawler
8 pages
CS 3308 Discussion Assignment Unit 8
No ratings yet
CS 3308 Discussion Assignment Unit 8
4 pages
The Design and Implementation of Erachnid: An Extensible, Scalable Web Crawler in Erlang
No ratings yet
The Design and Implementation of Erachnid: An Extensible, Scalable Web Crawler in Erlang
10 pages
UNIT III-Web Crawlers Why Do We Need Web Crawlers?
No ratings yet
UNIT III-Web Crawlers Why Do We Need Web Crawlers?
19 pages
Web Crawler Toolkit for Developers
No ratings yet
Web Crawler Toolkit for Developers
6 pages
Crawling and Web Indexes IR
No ratings yet
Crawling and Web Indexes IR
45 pages
Detailed Explanation: IR Vs Web Search Vs Web
No ratings yet
Detailed Explanation: IR Vs Web Search Vs Web
15 pages
Explores The Ways of Usage of Web Crawler in Mobile Systems
No ratings yet
Explores The Ways of Usage of Web Crawler in Mobile Systems
5 pages
WI Sem8
No ratings yet
WI Sem8
56 pages
Crahid: A New Technique For Web Crawling in Multimedia Web Sites
No ratings yet
Crahid: A New Technique For Web Crawling in Multimedia Web Sites
6 pages
Geo Dist Crawler
No ratings yet
Geo Dist Crawler
10 pages
Report Format
No ratings yet
Report Format
15 pages
Erformance Valuation EB Rawler: P E O W C
No ratings yet
Erformance Valuation EB Rawler: P E O W C
34 pages
Mercator: A Scalable, Extensible Web Crawler: Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA
No ratings yet
Mercator: A Scalable, Extensible Web Crawler: Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA
11 pages
Web Crawling
No ratings yet
Web Crawling
44 pages
CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012
No ratings yet
CIS 455/555: Internet and Web Systems: Crawling and Publish/Subscribe February 15, 2012
34 pages
Dept. of Cse, Msec 2014-15
No ratings yet
Dept. of Cse, Msec 2014-15
19 pages
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
No ratings yet
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
7 pages
A Study of Focused Web Crawling Techniques
No ratings yet
A Study of Focused Web Crawling Techniques
4 pages
An Extended Model For Effective Migrating Parallel Web Crawling With Domain Specific Crawling
No ratings yet
An Extended Model For Effective Migrating Parallel Web Crawling With Domain Specific Crawling
4 pages
Unit IV
No ratings yet
Unit IV
12 pages
Fuzzy Based Approach To URL Assignment in Dynamic Web Crawler
No ratings yet
Fuzzy Based Approach To URL Assignment in Dynamic Web Crawler
5 pages
Artificial Intellegence Project Module - I: Contemporary Curriculum, Pedagogy, and Practice (C2P2) BY
No ratings yet
Artificial Intellegence Project Module - I: Contemporary Curriculum, Pedagogy, and Practice (C2P2) BY
5 pages
WebTracker Paper - SUST Journal
No ratings yet
WebTracker Paper - SUST Journal
11 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Mercator: A Scalable, Extensible Web Crawler: Allan Heydon Marc Najork Compaq Systems Research Center
No ratings yet
Mercator: A Scalable, Extensible Web Crawler: Allan Heydon Marc Najork Compaq Systems Research Center
14 pages
Internship ET TBE
No ratings yet
Internship ET TBE
58 pages
Albinoni, T. - Adagio - Kindgren
No ratings yet
Albinoni, T. - Adagio - Kindgren
2 pages
Intro S4HANA Using Global Bike MasterGroup en v4.0-MD
No ratings yet
Intro S4HANA Using Global Bike MasterGroup en v4.0-MD
35 pages
Secure Your Online Banking
No ratings yet
Secure Your Online Banking
4 pages
Forensic Lab Administration: Synopsis
No ratings yet
Forensic Lab Administration: Synopsis
20 pages
Learning Management System (Lms Moodle)
No ratings yet
Learning Management System (Lms Moodle)
14 pages
IMPACT OF SOCIAL MEDIA ON SECURITY OF MILITARY OPERATIONS (Submitted Script)
No ratings yet
IMPACT OF SOCIAL MEDIA ON SECURITY OF MILITARY OPERATIONS (Submitted Script)
33 pages
Mobile Marketing Insights
No ratings yet
Mobile Marketing Insights
21 pages
Installing Windows 7: Upgrade
No ratings yet
Installing Windows 7: Upgrade
16 pages
Activity 3. Cyberbullying
100% (1)
Activity 3. Cyberbullying
1 page
Mobile Store Managment
No ratings yet
Mobile Store Managment
72 pages
Module 16 - Network Security Fundamentals
No ratings yet
Module 16 - Network Security Fundamentals
16 pages
Form CSR-1 MCA Registration - Trinity Care Foundation
No ratings yet
Form CSR-1 MCA Registration - Trinity Care Foundation
1 page
Spider 1tx 1fx SM
No ratings yet
Spider 1tx 1fx SM
1 page
Hypertext Markup Language (HTML) Is The Main Markup Language For Displaying Web Pages and Other
No ratings yet
Hypertext Markup Language (HTML) Is The Main Markup Language For Displaying Web Pages and Other
24 pages
Introducing Windows 7 For Developers PDF
100% (1)
Introducing Windows 7 For Developers PDF
417 pages
MIM V6 1 Service Update Training
100% (3)
MIM V6 1 Service Update Training
81 pages
Chip Oct11
No ratings yet
Chip Oct11
109 pages
Preface
50% (2)
Preface
26 pages
Sponsorship Brochure
No ratings yet
Sponsorship Brochure
11 pages
Tempail Com
No ratings yet
Tempail Com
2 pages
HMC Configuration
No ratings yet
HMC Configuration
9 pages
Edureka - Co-Digital Marketing
No ratings yet
Edureka - Co-Digital Marketing
15 pages
Answer:: Question No: 1 - Topic 1
No ratings yet
Answer:: Question No: 1 - Topic 1
70 pages
Link L8 Assessment AK
No ratings yet
Link L8 Assessment AK
15 pages
IGCSE ICT Mock 2 - Full
No ratings yet
IGCSE ICT Mock 2 - Full
12 pages
Editable Letter Sounds Alphabet Game
No ratings yet
Editable Letter Sounds Alphabet Game
7 pages
Cyber Security in Critical Infrastructures 1st Edition by Stefan Rass, Stefan Schauer, Sandra KÃ Nig, Quanyan Zhu 3030469085 9783030469085
100% (21)
Cyber Security in Critical Infrastructures 1st Edition by Stefan Rass, Stefan Schauer, Sandra KÃ Nig, Quanyan Zhu 3030469085 9783030469085
85 pages
Cyber Law - Series 2 - Issue 3 - Cybercrimes Under The Bhartiya Nyaya Sanhita, 2023
No ratings yet
Cyber Law - Series 2 - Issue 3 - Cybercrimes Under The Bhartiya Nyaya Sanhita, 2023
2 pages
IPv6 Addressing and Transition
No ratings yet
IPv6 Addressing and Transition
34 pages