0% found this document useful (0 votes)
70 views

B Level Project Combined Index

We propose a hypertext resource discovery system called a Focused Crawler. Focused Crawler selectively seeks out pages that are relevant to a pre-defined set of topics. It is very effective for building high-quality collections of web documents on specific topics, using modest desktop hardware.

Uploaded by

srivasab
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

B Level Project Combined Index

We propose a hypertext resource discovery system called a Focused Crawler. Focused Crawler selectively seeks out pages that are relevant to a pre-defined set of topics. It is very effective for building high-quality collections of web documents on specific topics, using modest desktop hardware.

Uploaded by

srivasab
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 59

WEB SPIDER A Focused Crawler

Acknowledgements
It shall truly be unfair not to show our gratefulness to all those who helped us complete this project. We would like to show our deepest gratitude to our project guide Dr. Anupam Agarwal, without whom this project would not have been possible. It was he who motivated us for this cause and was always present with his precious guidance and ideas besides being extremely supportive and understanding all the times. This fueled our enthusiasm even further and encouraged us to boldly step into what was a totally dark and unexplored expanse before us.

We would also like to thank our batch mates and seniors who were ready with a positive comment all the time, whether it was an off-hand comment to

Web Spider: A Focused Crawler

encourage us or a constructive piece of criticism. Their positive as well as critic comments were of great help in giving the project its present form.

Abstract
The world-wide web, having over 350 million pages, continues to grow rapidly at a million pages per day. About 600 GB of text changes every month. Such growth and flux poses basic limits of scale for today's generic crawlers and search engines. In spite of using high-end multiprocessors and exquisitely crafted crawling software, the largest crawls cover only 30-40% of the web, and refreshes take weeks to a month. With such unprecedented scaling challenges for general-purpose crawlers and search engines, we propose a hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents To achieve such goal-directed crawling, we evaluate the relevance of a hypertext document with respect to the focus topics thereby discarding the irrelevant pages and focusing on the hyperlinks of relevant pages only. Focused crawling, thus steadily acquire relevant pages only while standard crawling quickly loses its way. Therefore it is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.
Indian Institute of Information Technology, Allahabad 2

Web Spider: A Focused Crawler

Contents
Student Declaration Supervisor Recommendation Acknowledgement Abstract List of figures used 2 3 4 5 7 8 10 10 12 14 15 20 23 24 28 33
3

Chapter 1: Introduction
1.1 Objective 1.2 Motivation 1.3 Problem Definition

Chapter 2: Literature Survey


2.1 Literature survey 2.2 Previous Work

Chapter 3: Project Model


3.1 Basic Architecture 3.2 Crawler Policies 3.3 Issues
Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler

Chapter 4: Algorithm Implementation


4.1 Outline 4.2 Parsing and Stemming 4.3 Threshold calculation 4.4 Document Frequency

35 36 38 40 41

4.5 Robots.txt

42 43 44 45 45 46 46 46 48 49 49

Chapter 5: Discussion and Results


5.1 Retrieval of relevant pages only 5.2 Multithreading 5.3 Crawl space reduction 5.4 Reduction of server overload 5.5 Robustness of Acquisition 5.6 Snapshots

Chapter 6: Conclusion
6.1 Conclusion 6.2 Challenges and Future work

Appendices
Appendix A: Term Vector Model Appendix B: Basic Authentication Scheme Appendix C: Term Frequency-Inverse Document Frequency

51 52 54 56 58 59
4

References
Technical references
Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler

Other references

60

List of figures:
Fig 1.1: Performance of an unfocused crawler Fig 1.2: Performance of focused crawler Fig 2.1: Basic Components of the crawler Fig 2.2: Integration of crawler, classifier and distiller Fig 2.3: Domain of focused web crawler. Fig 3.1: Simple Crawler Configuration Fig 3.2: Control Flow of a Crawler Frontier Fig 4.1: Basic functioning of crawl frontier Fig 5.1: Comparison Analysis Fig 5.2: Crawl Space reduction Fig 5.3: Snapshot 1 Fig 5.4: Snapshot 2 11 12 16 17 19 25 27 35 44 45 46 47

Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler

Chapter I

Introduction
This section covers:

Objective Motivation Problem definition

Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler

1.1: Objective

To build a customized multithreaded, focused crawler, which will crawl the web, based on the relevance of the web page, thus reducing the crawl space.

1.2: Motivation

The World Wide Web has grown from a few thousand pages in 1993 to more than two billion pages at present. It continues to grow rapidly at a million pages per day. About 600 GB of text changes every month. Due to this explosion in size, web search engines are becoming increasingly important as the primary means of locating relevant information [2]. Such search engines rely on massive collections of web pages that are acquired with the help of web crawlers, which traverse the web by following hyperlinks and storing downloaded pages in a large database that is later indexed for efficient execution of user queries. Many researchers have looked at web search technology over the last few years, including crawling strategies, storage, indexing, ranking techniques, and a significant amount of work on the structural analysis of the web and web graph. .

Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler In spite of using high-end multiprocessors and exquisitely crafted crawling software, the largest crawls cover only 30-40% of the web, and refreshes take weeks to a month. The overwhelming engineering challenges are in part due to the one-size-fits-all philosophy: the crawler trying to cater to every possible query.

Serious web users adopt the strategy of filtering by relevance and quality. The growth of the web matters little to a physicist if at most a few dozen pages dealing with quantum electrodynamics are added or updated per week. Seasoned users also rarely roam aimlessly; they have bookmarked sites important to them, and their primary need is to expand and maintain a community around these examples while preserving the quality. A focused crawler selectively seeks out pages that are relevant to a pre-defined set of topics. It is crucial that the harvest rate: the fraction of page fetches which are relevant to the user's interest of the focused crawler be high, otherwise it would be easier to crawl the whole web and bucket the results into topics as a post-processing step.

Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler Fig 1.1: Performance of unfocused crawler [10]

Fig 1.2: Performance of focused crawler [10] As we see in case of focused crawler (Fig 1.2) the fraction of page fetches which are relevant to the user's interest in case of the focused crawler is very high when compared to that of unfocused crawler (Fig 1.1). Crawl Space in case of focused crawler can be reduced to a large extent as compared to a normal crawler.

1.3: Problem Definition


Indian Institute of Information Technology, Allahabad 9

Web Spider: A Focused Crawler

Our project is ambitious to build customized multithreaded, focused crawler, which crawls the web, based on the relevance of the web page. The approach should concern specifically in particular domain.

In order to achieve the objectives it should be able to perform the following:

Efficient Preprocessing: This involves the preprocessing of the input


documents. We aim to provide efficient parsing and stemming of pages. . Initially, the user will be required to provide a set of example pages along with his search query. These example pages will be parsed, removing all the stop words and finally the text will be stemmed

Knowledge Retrieval: To provide efficient retrieval of information containing


words .Once the text will be stemmed, information containing words will be picked and this will form the information to the crawler which it carries with it.

Crawling: To build a crawler that starts from the root node or a URL, called the
seed. As the crawler visits these URLs, it will identify all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier will now be recursively visited.

Retrieving relevant pages: We aim to retrieve only those pages which are
closely related to the corresponding query. In our case we will deal with the most relevant pages. It will reduce the burden on the user to scan through all the retrieved pages to find the pages of his interest

Indian Institute of Information Technology, Allahabad

10

Web Spider: A Focused Crawler

Chapter II

Literature Survey
This section covers:

Literature survey Background and Previous work

Indian Institute of Information Technology, Allahabad

11

Web Spider: A Focused Crawler

2.1: Literature survey

2.1.1: Basic Crawler The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. We want to implement a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. To achieve such goal-directed crawling, we will design two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links. [7] Now report on extensive focused-crawling experiments using several topics at different levels of specificity. Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set. Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations. It is also capable of exploring out and discovering valuable resources that are dozens of links away Indian Institute of Information Technology, Allahabad 12

Web Spider: A Focused Crawler from the start set, while carefully pruning the millions of pages that may lie within this same radius.[5] As a result it is highly efficient as compared to normal crawlers. Normal crawlers when start crawling works good for some time but loose their path making it their biggest disadvantage over focused crawlers. Our anecdotes suggest that focused crawling is very

Effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware.[3]

Fig 2.1: Basic Components of the crawler [2]

The focused crawler has three main components: a classifier which makes relevance judgments on pages crawled to decide on link expansion, a distiller which determines a measure of centrality of crawled pages to determine visit priorities, and a crawler with dynamically reconfigurable priority controls which is governed by the classifier and distiller.[2] Indian Institute of Information Technology, Allahabad 13

Web Spider: A Focused Crawler

Its block diagram can be shown as

Fig 2.2: Focused crawler showing how crawler, classifier and distiller are integrated.[1] 2.1.2: Classification Relevance is enforced on the focused crawler using a hypertext classifier. We assume that the category taxonomy induces a hierarchical partition on Web documents. (In real life, documents are often judged to belong to multiple categories) useful pages, not eliminating irrelevant pages. Human judgment, although subjective and even erroneous, would be best for measuring relevance. Clearly, even for an experimental crawler that acquires only ten thousand pages per hour, this is impossible. Therefore we use our classifier to estimate the relevance of the crawl graph. It is to be noted carefully that we Indian Institute of Information Technology, Allahabad 14

Web Spider: A Focused Crawler are not, for instance, training and testing the classifier on the same set of documents, or checking the classifiers earlier evaluation of a document using the classifier itself. Just as human judgment is prone to variation and error, a statistical program can make mistakes. Based on such imperfect recommendation, we choose to or not to expand pages. Later,

when a page that was chosen is visited, we evaluate its relevance, and thus the value of that decision.[8] 2.1.3: Distillation Relevance is not the only attribute used to evaluate a page while crawling. A long essay very relevant to the topic but without links is only a finishing point in the crawl. A good strategy for the crawler is to identify hubs: pages that are almost exclusively a collection of links to authoritative resources that are relevant to the topic.* Social network analysis is concerned with the properties of graphs formed between entities such as people, organizations, papers, etc., through coauthoring, citations, mentoring, paying, telephoning, infecting, etc. Prestige is an important attribute of nodes in a social network, especially in the context of academic papers and Web documents. The number of citations to paper is a reasonable but crude measure of the prestige. Also many hubs are multi-topic in nature, e.g., a published bookmark file pointing to sports car sites and photography sites.[4] 2.1.4: Integration with the crawler The crawler has one watchdog thread and many worker threads. The watchdog is in charge of checking out new work from the crawl frontier, which is stored on disk. New work is passed to workers using shared memory buffers. Workers save details of newly explored pages in private per-worker disk structures. In bulk-synchronous fashion, workers are stopped, and their results are collected and integrated into the central pool of work.[4] 15

Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system designed, I/O and network efficiency, and robustness and manageability.[2]
* Refer Appendix B

Perhaps the most crucial evaluation of focused crawling is to measure the rate at which relevant pages are acquired, and how effectively irrelevant pages are filtered off from the crawl. This harvest ratio must be high, otherwise the focused crawler would spend a lot of time merely eliminating irrelevant pages, and it may be better to use an ordinary crawler instead! It would be good to judge the relevance of the crawl by human inspection, even though it is subjective and inconsistent. But this is not possible for the hundreds of thousands of pages our system crawled. Therefore we have to take recourse to running an automatic classifier over the collected pages. Specifically, we can use our classifier. It may appear that using the same classifier to guide the crawler and judge the relevance of crawled pages is flawed methodology, but it is not so. We are evaluating not the classifier but the basic crawling heuristic that neighbors of highly relevant pages tend to be relevant.

Indian Institute of Information Technology, Allahabad

16

Web Spider: A Focused Crawler

Fig 2.3: Domain of focused web crawler [11] The unfocused crawler starts out from the same set of dozens of highly relevant links as the focused crawler, but is completely lost within the next hundred page fetches: the relevance goes quickly to zero. In contrast the focused one crawl keeps up a healthy pace

of acquiring relevant pages over thousands of pages, in spite of some short-range rate fluctuations, which is expected. On an average, between a third and half of all page fetches result in success over the first several thousand fetches, and there seems to be no sign of stagnation. Crawling the Web, in a certain way, resembles watching the sky in a clear night: what we see reflect the state of the stars at different times, as the light travels different distances. What a Web crawler gets is not a snapshot of the Web, because it does not represents the Web at any given instant of time. The last pages being crawled are probably very accurately represented, but the first pages that were downloaded have a high probability of have been changed [6]

Indian Institute of Information Technology, Allahabad

17

Web Spider: A Focused Crawler

2.2: Previous work


The following is a list of published crawler architectures for general-purpose crawlers (excluding focused Web crawlers), with a brief description that includes the names given to the different components and outstanding features: 2.21: RBSE was the first published web crawler. It was based on two programs: the first program, "spider" maintains a queue in a relational database, and the second program mite, is a modified www ASCII browser that downloads the pages from the Web. It was presented at the First International Conference on the World Wide Web, Geneva, Switzerland.[12] 2.2.2: Google Crawler is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing

and also for URL extraction. There is a URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL has been previously seen. If not, the URL was added to the queue of the URL server.[16] 2.2.3: Mercator is a distributed, modular web crawler written in Java. Its modularity arises from the usage of interchangeable "protocol modules" and "processing modules". Protocols modules are related to how to acquire the Web pages (e.g.: by HTTP), and processing modules are related to how to process Web pages. The standard processing module just parses the pages and extracts new URLs, but other processing modules can be used to index the text of the pages, or to gather statistics from the Web.[15]

Indian Institute of Information Technology, Allahabad

18

Web Spider: A Focused Crawler 2.2.4: WebRACE is a crawling and caching module implemented in Java, and used as a part of a more generic system called eRACE. The system receives requests from users for downloading Web pages, so the crawler acts in part as a smart proxy server. The system also handles requests for "subscriptions" to Web pages that must be monitored: when the pages change, they must be downloaded by the crawler and the subscriber must be notified. The most outstanding feature of WebRACE is that, while most crawlers start with a set of "seed" URLs, WebRACE is continuously receiving new starting URLs to crawl from.[18] 2.2.5: Ubicrawler is a distributed crawler written in Java, and it has no central process. It is composed of a number of identical "agents"; and the assignment function is calculated using consistent hashing of the host names. There is zero overlap, meaning that no page is crawled twice, unless a crawling agent crashes (then, another agent must re-crawl the pages from the failing agent). The crawler is designed to achieve high scalability and to be tolerant to failures.[13]

2.2.6: Some Open-source crawlers [11] DatePrk Search GNU Widget Heritrix HTTrack Metaboth NUTCH WebSPHINX Sherlock Holmes YaCy 19

Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler

Chapter III

Project Model
Indian Institute of Information Technology, Allahabad 20

Web Spider: A Focused Crawler

This section covers:

Basic Architecture Crawler Policies Issues

3.1: Basic Architecture


In this project we will develop a web crawler that will start with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability. 21

Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler 3.1.1: Basic Concept A web crawler also known as a Web spider or Web robot is a program or automated script which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Web crawlers start by parsing a specified web page, noting any hypertext links on that page that point to other web pages. They then parse those pages for new links, and so on, recursively. Web-crawler software doesn't actually move around to different computers on the Internet, as viruses or intelligent agents do. A crawler resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet just as a web browser does when the user clicks on links. All the crawler really does is to automate the process of following links. [2]

Indian Institute of Information Technology, Allahabad

22

Web Spider: A Focused Crawler

3.1.2: Architecture The input to the focused crawler is the search query of the user. Also, a set of example pages relating to the query have to be given to the crawler. A series of parses are done on these example pages to finally extract the information containing words. These information containing words are given as input to the crawler which carries them along with it. Based on this information, the crawler calculates the relevance of an encountered page, and only if the relevance is satisfying, will it be stored for further crawling. [4]

Fig 3.1: Simple Crawler Configuration [4]

Indian Institute of Information Technology, Allahabad

23

Web Spider: A Focused Crawler

The architecture can be classified into two major components - crawling system and crawling application. The crawling system itself consists of several specialized component s, in particular a crawl manager, downloader and DNS resolver. The crawl manager is responsible for receiving the URL input stream from the applications. After loading the URLs of a request files, the manager queries the DNS resolvers for the IP addresses of the servers, unless a recent address is already cached. The manager then requests the file robots.txt in the web servers root directory, unless it already has a recent copy of the file. A downloader is a high-performance asynchronous HTTP client capable of downloading hundreds of web pages in parallel, while a DNS resolver is an optimized stub DNS resolver that forwards queries to local DNS servers.[6] Finally, after parsing the robots files and removing excluded URLs, the requested URLs are sent in batches to the downloader. The manager later notifies the application of the pages that have been downloaded and are available for processing. The crawling application starts out with a URL giving it to the crawl manager. The application then parses each downloaded page for hyperlinks, checks whether these URLs have already been encountered before, and if not, sends them to the manager in batches of a few hundred or thousand.[9] The downloaded files are then forwarded to a storage manager for compression and storage in a repository. 3.1.3: Control flow As the crawler gets the relevant pages, it retrieves their URLs and makes their list out of which it takes the URLs one by one and the web page is downloaded. Now the downloaded page is changed to a text file for simplicity. This text file is parsed removing all the stop words from it and stemming the remaining words using Porter Stemmer. Then its relevance is tested. If relevant, the URLs present on the page are extracted and added to the list of URLs for further crawling.

Indian Institute of Information Technology, Allahabad

24

Web Spider: A Focused Crawler

Fig 3.2: Control Flow of a Crawler Frontier

Indian Institute of Information Technology, Allahabad

25

Web Spider: A Focused Crawler

3.2: Crawler policies


There are three important characteristics of the Web that generate a scenario in which Web crawling is very difficult: its large volume, its fast rate of change, dynamic page generation, containing a wide variety of possible crawlable URLs. The large volume implies that the crawler can only download a fraction of the Web pages within a given time, so it needs to prioritize all of its downloads. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted. The recent increase in the number of pages being generated by server-side scripting languages has also created difficulty in those endless combinations of HTTP GET parameters exist, only a small selection of which will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided contents, then that same set of content can be accessed with forty-eight different URLs, all of which will be present on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content. The behavior of a Web crawler is the outcome of a combination of policies:

A selection policy that states which pages to download. A re-visit policy that states when to check for changes to the pages. A politeness policy that states how to avoid overloading websites. A parallelization policy that states how to coordinate distributed web crawlers. 26

Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler

3.2.1: Selection policy Given the current size of the Web, even large search engines cover only a portion of the publicly available internet; A recent study showed that no search engine indexes more than 16% of the Web. As a crawler always downloads just a fraction of the Web pages, it is highly desirable that the downloaded fraction contains the most relevant pages, and not just a random sample of the Web. [2] This requires a metric of importance for prioritizing Web pages. The importance of a page is a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL (the latter is the case of vertical search engines restricted to a single toplevel domain, or search engines restricted to a fixed Web site). Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set of Web pages is not known during crawling. Crawling can be combined with different strategies. The ordering metrics can be breadthfirst, backlink-count and partial Pagerank calculations. One of the conclusions was that if the crawler wants to download pages with high Pagerank early during the crawling process, then the partial Pagerank strategy is the better, followed by breadth-first and backlink-count. However, these results are for just a single domain. Though it is basically considered that the breadth first strategy is a better strategy than page rank . The explanation is simple for this it has been proved that the most important pages have many links to them from numerous hosts, and those links will be found early, regardless of on which host or page the crawl originates. 3.2.2: Re-visit policy The Web has a very dynamic nature, and crawling a fraction of the Web can take a really long time, usually measured in weeks or months. By the time a Web crawler has finished its crawl, many events could have happened. These events can include creations, updates and deletions. [2]

Indian Institute of Information Technology, Allahabad

27

Web Spider: A Focused Crawler From the search engine's point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource. The most used cost functions are freshness and age.

Freshness: This is a binary measure that indicates whether the local copy is accurate or not. Age: This is a measure that indicates how outdated the local copy is.

The objective of the crawler is to keep the average freshness of pages in its collection as high as possible, or to keep the average age of pages as low as possible. These objectives are not equivalent: in the first case, the crawler is just concerned with how many pages are out-dated, while in the second case, the crawler is concerned with how old the local copies of pages are. Two simple re-visiting policies are: Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change. Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency. In terms of average freshness, the uniform policy outperforms the proportional policy in both a simulated Web and a real Web crawl. The explanation for this result comes from the fact that, when a page changes too often, the crawler will waste time by trying to recrawl it too fast and still will not be able to keep its copy of the page fresh. 3.2.3: Politeness policy Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say if a single crawler is performing multiple requests per second and downloading large files, a server would have a hard time keeping up with requests from multiple crawlers. [2] The use of Web crawlers is useful for a number of tasks, but comes with a price for the general community. The costs of using Web crawlers include: Indian Institute of Information Technology, Allahabad 28

Web Spider: A Focused Crawler Network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time. Server overload, especially if the frequency of accesses to a given server is too high.

Poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle. Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers. A partial solution to these problems is the robots exclusion protocol, also known as the robots.txt protocol that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers. This standard does not include a suggestion for the interval of visits to the same server, even though this interval is the most effective way of avoiding server overload. Recently commercial search engines like Ask Jeeves, MSN and Yahoo are able to use an extra "Crawl-delay:" parameter in the robots.txt file to indicate the number of seconds to delay between requests. However, if pages were downloaded at this rate from a website with more than 100,000 pages over a perfect connection with zero latency and infinite bandwidth, it would take more than 2 months to download only that entire website; also, only a fraction of the resources from that Web server would be used. This does not seem acceptable. Normally one uses 10 seconds as an interval for accesses and some crawlers uses 15 seconds as the default. Some even follows an adaptive politeness policy: if it took t seconds to download a document from a given server, the crawler waits for 10t seconds before downloading the next page. 3.2.4: Parallelization policy A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered

Indian Institute of Information Technology, Allahabad

29

Web Spider: A Focused Crawler during the crawling process, as the same URL can be found by two different crawling processes. There are basic following policies:-

Dynamic Assignment: With this type of policy, a central server assigns new URLs to different crawlers dynamically. This allows the central server to; for instance, dynamically balance the load of each crawler. With dynamic assignment, typically the systems can also add or remove downloader processes. The central server may become the bottleneck, so most of the workload must be transferred to the distributed crawling processes for large crawls. There are two configurations of crawling architectures with dynamic assignments such that [2] A small crawler configuration, in which there is a central DNS resolver and central queues per Web site, and distributed downloaders. A large crawler configuration, in which the DNS resolver and the queues are also distributed. Static Assignment: With this type of policy, there is a fixed rule stated from the beginning of the crawl that defines how to assign new URLs to the crawlers. For static assignment, a hashing function can be used to transform URLs (or, even better, complete website names) into a number that corresponds to the index of the corresponding crawling process. As there are external links that will go from a Web site assigned to one crawling process to a website assigned to a different crawling process, some exchange of URLs must occur.[14] To reduce the overhead due to the exchange of URLs between crawling processes, the exchange should be done in batch, several URLs at a time, and the most cited URLs in the collection should be known by all crawling processes before the crawl (e.g.: using data from a previous crawl) . An effective assignment function must have three main properties: each crawling process should get approximately the same number of hosts (balancing property), if the number 30 Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler of crawling processes grows, the number of hosts assigned to each process must shrink (contra-variance property), and the assignment must be able to add and remove crawling processes dynamically. Consistent hashing, which replicates the buckets, so adding or removing a bucket does not require re-hashing of the whole table to achieve all of the

desired properties. Crawling is an effective process synchronization tool between the users and the search engine.

3.3: Issues
3.3.1: How to Re-visit web pages The optimum method to re-visit the web and maintain average freshness high of web page is to ignore the pages that change too often. The approaches could be: [2] Re-visiting all pages in the collection with the same frequency, regardless of their rates of change. Re-visiting more often the pages that change more frequently. In both cases, the repeated crawling order of pages can be done either at random or with a fixed order. The re-visiting methods considered here regard all pages as homogeneous in terms of quality ("all pages on the Web are worth the same"), something that is not a realistic scenario 3.3.2: How to avoid overloading websites Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say if a single 31 Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers. The use of Web crawler is useful for a number of tasks, but comes with a price for the general community.

The costs of using Web crawlers include: [1]

Network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time. Server overload, especially if the frequency of accesses to a given server is too high. Poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle. Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers.

To resolve this problem we can use robots exclusion protocol, also known as the robots.txt protocol. The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website. We can specify the top level directory of web site in a file called robots.txt and this will prevent the access of that directory to crawler. This protocol uses simple substring comparisons to match the patterns defined in robots.txt file. So, while using this robots.txt file we need to make sure that we use final [./.].[17] Character appended to directory path. Else, files with names starting with that substring will be matched rather than directory.

Indian Institute of Information Technology, Allahabad

32

Web Spider: A Focused Crawler

Chapter IV

Algorithm Implementation
This section covers:

Outline Parsing and Stemming Threshold calculation Document Frequency Robots.txt


33

Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler

4.1: Outline

The input to the focused crawler is the search query of the user in form of pages. Parsing is done and words are retrieved. These information containing words are given as input to the crawler which carries them along with it. Based on this information, the crawler calculates the relevance of an encountered page, and only if the relevance is satisfying, will it be stored for further crawling

Indian Institute of Information Technology, Allahabad

34

Web Spider: A Focused Crawler

Fig 4.1: Basic functioning of crawl frontier

The Pseudo-code summary of implementing the crawler:


Ask user to specify the starting URL on web and file type that crawler should crawl. Add the URL to the empty list of URLs to search. While not empty ( the list of URLs to search ) { Take the first URL in from the list of URLs Mark this URL as already searched URL. If the URL protocol is not HTTP then break; go back to while

Indian Institute of Information Technology, Allahabad

35

Web Spider: A Focused Crawler

If robots.txt file exist on site then If file includes .Disallow. statement then break; go back to while Open the URL If the opened URL is not HTML file then Break; Go back to while Iterate the HTML file While the html text contains another link { If robots.txt file exist on URL/site then If file includes .Disallow. statement then break; go back to while If the opened URL is HTML file then If the URL isn't marked as searched then Mark this URL as already searched URL. Else if type of file is user requested Add to list of files found. } }

4.2: Parsing and Stemming

4.2.1: Parsing (more formally syntactical analysis) is the process of analyzing a sequence of tokens to determine its grammatical structure with respect to a given formal grammar. A parser is the component of a compiler that carries out this task. Parsing transforms input text into a data structure, usually a tree, which is suitable for later processing and which captures the implied hierarchy of the input. Lexical analysis Indian Institute of Information Technology, Allahabad 36

Web Spider: A Focused Crawler creates tokens from a sequence of input characters and it is these tokens that are processed by a parser to build a data structure such as parse tree or abstract syntax trees. Parsing is also an earlier term for the diagramming of sentences in grammar of natural language, and is still used to diagram the grammar of inflected languages, such as the Romance languages or Latin. 4.2.2: Removal of Stop Words Firstly, the web page is converted into a text file for convenience. An initial parse removes all the stop words from the file. Stop words are words which are filtered out prior to, or after, processing of natural language data (text). Some of the most frequently used stop words include "a", "of", "the", "I", "it", "you", and "and. These are generally regarded as 'functional words' which do not carry meaning (are not as important for communication).[16] The assumption is that, when assessing the contents of the web page, the meaning can be conveyed more clearly, or interpreted more easily, by ignoring the functional words. A Stop List is maintained in a separate text file and all the words of that file are removed from the file being parsed.

4.2.3: Stemming Next, a Porter Stemmer is run on this file. Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". Porter Stemmer is one algorithm for doing this process effectively. Some examples of the rules include: Indian Institute of Information Technology, Allahabad 37

Web Spider: A Focused Crawler if the word ends in 'ed', remove the 'ed' if the word ends in 'ing', remove the 'ing' if the word ends in 'ly', remove the 'ly'

Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes regarded as crude given the poor performance when dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix stripping algorithms are limited to those lexical categories which have well known suffices with few exceptions. This, however, is a problem, as not all parts of speech have such a well formulated set of rules. Lemmatization attempts to improve upon this challenge [16] When this is done for all the example pages, we perform the frequency analysis of each word and the number of pages in which it has appeared and then select the words that are most likely to carry the information content of the page. These words are given to the crawler.

4.3: Threshold Calculation

4.3.1: Identification of Information carrying words Firstly, the frequency of each word in each of the example pages is found out. Also, the number of pages in which it is appearing is also kept track of. Based on these two criteria, we select our information containing words. *

Indian Institute of Information Technology, Allahabad

38

Web Spider: A Focused Crawler Fixing Threshold:We have used the Vector Space Method to fix the threshold from the initial set of pages. Then we used the following formula to find out the relevance of a particular page. Relevance = No. of info words with mean freq --------------------------------------------------------Total no. of information words 4.3.2: Vector space model It is an algebraic model used for information filtering, information retrieval, indexing and relevancy rankings. It represents natural language documents (or any objects, in general) in a formal manner through the use of vectors (of identifiers, such as, for example, index terms) in a multi-dimensional linear space. Its first use was in the SMART Information Retrieval System. Documents are represented as vectors of index terms (keywords). The set of terms is a predefined collection of terms, for example the set of all unique words occurring in the document corpus.* = 3

*Refer Appendix A

4.4: Document frequency


4.4.1: Term frequency in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term frequency regardless of the actual importance of that term in the document) to give a measure of the importance of the term ti within the particular document. [10] 39

Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler

Where ni is the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms.

4.4.2: The inverse document frequency is a measure of the general importance of the term (obtained by dividing the number of all documents by the number of documents containing the term, and then taking the logarithm of that quotient).*

A high weight in tfidf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weight hence tends to filter out common terms.

* Refer Appendix C

4.5: Robots.txt :-

The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is, otherwise, publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to Indian Institute of Information Technology, Allahabad 40

Web Spider: A Focused Crawler proofread source code. A robots.txt file on a website will function as a request that specified robots ignore specified files or directories in their search. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data. The protocol, however, is purely advisory. It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee privacy. Some web site administrators have tried to use the robots file to make private parts of a website invisible to the rest of the world, but the file is necessarily publicly available and its content is easily checked by anyone with a web browser. An example robots.txt file

# robots.txt for http://somehost.com/ User-agent: * Disallow: /cgi-bin/ Disallow: /registration Disallow: /login

Chapter V
Indian Institute of Information Technology, Allahabad 41

Web Spider: A Focused Crawler

Discussion and Results


This section covers:

Retrieval of relevant pages Multithreading Crawl Space reduction Server overload Robustness of Acquisition Snapshots

5.1: Retrieval of the relevant pages only

Relevant pages are those which are closely related to the input document to the crawler. Our focused crawler achieves the relevance of downloaded web-pages up to 80-85% while for a normal crawler it is only up to 20-25 %. Indian Institute of Information Technology, Allahabad 42

Web Spider: A Focused Crawler Moreover the numbers of relevant pages downloaded are 50-100 per hour as compared to a normal crawler which goes on downloading pages most of which are irrelevant.

Fig 5.1: Comparison Analysis

5.2: Multithreading
This issue has been dealt successfully as on receiving more than one hyperlink from a file then, a number of parallel threads are generated and work together for downloading the pages and parsing them.

Indian Institute of Information Technology, Allahabad

43

Web Spider: A Focused Crawler

5.3: Crawl space reduction

Crawl space is the number of pages visited by the crawler on the web. Our focused crawler reduces the crawl space to a great extent as it visits the hyperlinks of only the relevant pages thus go on pruning most part of the web tree.

Fig 5.2: Crawl Space reduction

5.4: Reduction of server overload

We have used robots exclusion protocol, also known as the robots.txt protocol to prevent web spider from accessing all or part of a website.

Indian Institute of Information Technology, Allahabad

44

Web Spider: A Focused Crawler

5.5: Robustness of acquisition


Web Spider has the ability to ramp up to and maintain a healthy acquisition rate without being too sensitive on the start set.

5.6: Snapshots

Fig 5.3: When the page is downloaded, it is parsed and stemmed. The frequency of the words in the page is calculated and its relevance is checked using the cosine similarity method.

Indian Institute of Information Technology, Allahabad

45

Web Spider: A Focused Crawler

Fig 5.4: Initially a URL http://www.cert.org/research/papers.html is given as input to the Web Spider and the links it visits are shown above out of which only the links of the relevant page are downloaded.

Indian Institute of Information Technology, Allahabad

46

Web Spider: A Focused Crawler

Chapter VI

Conclusion
This section covers:

Conclusion Challenges and Future Work

Indian Institute of Information Technology, Allahabad

47

Web Spider: A Focused Crawler

6.1: Conclusion

The motives of our project as per the problem definition have been achieved to completion. Web Spider, the customized multithreaded focused crawler with its all functionalities properly running is ready to be used. We have achieved a relatively high reduction in crawl space. The rate of pages being downloaded varies from 50 to 100 pages an hour and they are the ones most relevant to the user input document. We have taken care of the Robots Exclusion protocol and have achieved a healthy acquisition rate without being too sensitive to the start document. Our project can successfully perform using modest desktop hardware. No doubt, the process of developing Web Spider was extremely knowledgeful and enjoying. We got to learn the very deep concepts of information retrieval and their practical implementation and are proud to complete the project up to a satisfactory level.

6.2: Challenges and future work

6.2.1: Challenges

Server side checking: Web Spider in its present form downloads all the URLs present in a relevant pages and discard the irrelevant URLs post-downloading. Its

Indian Institute of Information Technology, Allahabad

48

Web Spider: A Focused Crawler a challenge for us to implement a server side check, i.e. checking the URLs on the server side and thus downloading only the relevant ones.

Distributed web crawler: Our project presently works on a single system, to make it scalable its a challenge to make it a distributed system with many parallel crawlers running.

6.2.2: Future work

Extending the project for file format son the web other than HTML and text. Ranking the downloaded pages with respect to their priority. The page having high cosine similarity with the example pages carry high priority. Increasing the harvest rate. Presently the relevant pages are downloaded at the rate of 50-100 pages per hour. Implementing a better focused crawler can increase the rate.

Implementing better preprocessing algorithms. We have presently employed upto 650 stop words and implemented the Porter Stemmer algorithm. Using more stop words and implementing a better stemming algorithm may further enhance the crawler performance.

Indian Institute of Information Technology, Allahabad

49

Web Spider: A Focused Crawler

Appendices
This section covers:

Term Vector Model Basic Authentication Scheme

Term Frequency-Inverse Document Frequency

Indian Institute of Information Technology, Allahabad

50

Web Spider: A Focused Crawler

Appendix A: Term vector model

Term vector model is an algebraic model used for information filtering, information retrieval, indexing and relevancy rankings. It represents natural language documents (or any objects, in general) in a formal manner through the use of vectors (of identifiers, such as, for example, index terms) in a multi-dimensional linear space. Its first use was in the SMART Information Retrieval System. Documents are represented as vectors of index terms (keywords). The set of terms is a predefined collection of terms, for example the set of all unique words occurring in the document corpus. Relevancy rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as same kind of vector as the documents. In practice, it is easier to calculate the cosine of the angle between the vectors instead of the angle:

A cosine value of zero means that the query and document vector were orthogonal and had no match (i.e. the query term did not exist in the document being considered). Assumptions and Limitations of the Vector Space Model The Vector Space Model has the following limitations: Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality) Search keywords must precisely match document terms; word substrings might result in a "false positive match"

Indian Institute of Information Technology, Allahabad

51

Web Spider: A Focused Crawler

Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".

Indian Institute of Information Technology, Allahabad

52

Web Spider: A Focused Crawler

Appendix B: Basic authentication scheme

In the context of an HTTP transaction, the basic authentication scheme is a method designed to allow a web browser, or other client program, to provide credentials in the form of a user name and password when making a request. Although the scheme is easily implemented, it relies on the assumption that the connection between the client and server computers is secure and can be trusted. Specifically, the credentials are passed as plaintext and could be intercepted easily. The scheme also provides no protection for the information passed back from the server. To prevent the user name and password being read directly by a person, they are encoded as a sequence of base-64 characters before transmission. For example, the user name "Aladdin" and password "open sesame" would be combined as "Aladdin:open sesame" which is equivalent to QWxhZGRpbjpvcGVuIHNlc2FtZQ== when encoded in base-64. Little effort is required to translate the encoded string back into the user name and password, and many popular security tools will decode the strings "on the fly", so an encrypted connection should always be used to prevent interception. One advantage of the basic authentication scheme is that it is supported by almost all popular web browsers. It is rarely used on normal Internet web sites but may sometimes be used by small, private systems. A later mechanism, digest access authentication, was developed in order to replace the basic authentication scheme and enable credentials to be passed in a relatively secure manner over an otherwise insecure channel.

Example
Here is a typical transaction between an HTTP client and an HTTP server running on the local machine (localhost). It comprises the following steps. The client asks for a page that requires authentication but does not provide a user name and password. Typically this is because the user simply entered the address or followed a link to the page. 53

Indian Institute of Information Technology, Allahabad

Web Spider: A Focused Crawler

The server responds with the 401 response code and provides the authentication realm. At this point, the client will present the authentication realm (typically a description of the computer or system being accessed) to the user and prompt for a user name and password. The user may decide to cancel at this point. Once a user name and password have been supplied, the client re-sends the same request but includes the authentication header. In this example, the server accepts the authentication and the page is returned. If the user name is invalid or the password incorrect, the server might return the 401 response code and the client would prompt the user again.

Indian Institute of Information Technology, Allahabad

54

Web Spider: A Focused Crawler

Appendix C: Term frequencyinverse document frequency

The tfidf weight (term frequencyinverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tfidf weighting scheme are often used by search engines to score and rank a document's relevance given a user query. In addition to tf-idf weighting, Internet search engines use link analysis based ranking to determine the order in which the scored documents are presented to the user. The term frequency in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term frequency regardless of the actual importance of that term in the document) to give a measure of the importance of the term ti within the particular document.

where ni is the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms. The inverse document frequency is a measure of the general importance of the term (obtained by dividing the number of all documents by the number of documents containing the term, and then taking the logarithm of that quotient).

Indian Institute of Information Technology, Allahabad

55

Web Spider: A Focused Crawler

with |D| : total number of documents in the corpus : number of documents where the term ti appears (that is Numeric application of the document frequency. There are many different formulas used to calculate tfidf. The term frequency (TF) is the number of times the word appears in a document divided by the number of total words in the document. If a document contains 100 total words and the word cow appears 3 times, then the term frequency of the word cow in the document is 0.03 (3/100). One way of calculating document frequency (DF) is to determine how many documents contain the word cow divided by the total number of documents in the collection. So if cow appears in 1,000 documents out of a total of 10,000,000 then the document frequency is 0.0001 (1000/10,000,000). The final tf-idf score is then calculated by dividing the term frequency by the document frequency. For our example, the tf-idf score for cow in the collection would be 300 (0.03/0.0001). Alternatives to this formula are to take the log of the document frequency. Applications in Vector Space Model The tf-idf weighting scheme is often used in the vector space model together with cosine similarity to determine the similarity between two documents. ).

Indian Institute of Information Technology, Allahabad

56

Web Spider: A Focused Crawler

References
This section covers:

Technical references Other references

Indian Institute of Information Technology, Allahabad

57

Web Spider: A Focused Crawler

Technical references
[1]Soumen Chakrabarti, Martin van den Berg, Byron Dom, Focused crawling: a new approach to topic-specific Web resource discovery The Eighth International WorldWide Web Conference,Toronto 1999 Published by Elsevier Science B.V.,1999 [2]Vladislav Shkapenyuk, Torsten Suel, Design and Implementation of a High Performance Distributed Web Crawler Proceedings of the 18th International Conference on Data Engineering ICDE02, 1063-6382/02 2002 IEEE [3]Ke Hu, Wing Shing Wong, A probabilistic model for intelligent Web crawlers Computer Software and Applications Conference, 2003. Proceedings .27th national Conference, On page(s): 278- 282, ISSN: 0730-3157 2003 IEEE [4]Castillo, C. Effective Web Crawling PhD thesis, University of Chile.Year of Publication 2004. [5]Padmini Sriniwasan,Gautam Pant, Learning to Crawl :Comparing Classification Schemes, ACM Transactions on Information Systems(TOIS), Volume 23, Issue 4, Pages: 430 462, ISSN:1046-8188, ACM Press 2005 [6]Ipeirotis, P., Ntoulas, A., Cho, J., Gravano, L., Modeling and managing content in text databases, In Proceedings of the 21st IEEE International Conference, Pages: 606 617, ISBN ~ ISSN:1084-4627 , 0-7695-2285-8, 2005 IEEE [7]Baeza-Yates, R., Castillo, C., Marin, M. and Rodriguez, A., Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering. In Proceedings of the Industrial and Practical Experience track of the 14th conference on World Wide Web, pages 864872, Chiba, Japan. ACM Press 2005. [8]Gautam Pant, Padmini Srinivasan, Link Contexts in Classifier-guided topical crawler, IEEE Transactions on knowledge and data engineering , Vol. 18, No 1,January 2006 2006 IEEE [9]Jamali, M., Sayyadi, H., Hariri, B.B., Abolhassani, H., A Method for Focused Crawling Using Combination of Link Structure and Content, Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference, Publication Date: Dec. 2006 On page(s): 753-756 ISBN: 0-7695-2747-7 2006 IEEE

Indian Institute of Information Technology, Allahabad

58

Web Spider: A Focused Crawler

Other References
[10]. [11]. [12]. [13]. [14].
[15]. [16]. [17]. http://www.devbistro.com/articles/Misc/Effective-Web-Crawler http://en.wikipedia.org/wiki/Web_crawler http://www.depspid.net/ http://www-db.stanford.edu/~backrub/google.html http://www.webtechniques.com/archives/1997/05/burner/ http://www.ils.unc.edu/keyes/java/porter/ http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words http://combine.it.lth.se/ http://www.cse.iitb.ac.in/~soumen/focus/

[18].

Indian Institute of Information Technology, Allahabad

59

You might also like