Eclipse Foundation: Home Downloads Users Members Committers Resources Projects
Eclipse Foundation: Home Downloads Users Members Committers Resources Projects
Eclipse Foundation
Eclipse Marketplace
Bugzilla
Eclipse Live
Planet Eclipse
My Foundation Portal
About Us
Top of Form
Go
Search
Bottom of Form
Navigation
Main Page Community portal Current events Recent changes Random page Help Toolbox What links here Related changes Upload file Special pages Printable version Permanent link Page Discussion View source History Edit
SMILA/Documentation/Web Crawler
From Eclipsepedia
< SMILA | Documentation
Outdated This needs to be revised for v0.9. For now, look at the code of existing web crawler configurations.
Contents
[hide] 1. 2. 3. 4. 1. 2. 3. 4.
1 Overview 2 Crawling configuration 3 Crawling configuration explanation 4 Crawling configuration example 4.1 Minimal configuration example 4.2 Html form login example 4.3 Multiple website configuration 4.4 Complex website configuration 5 Output example for default configuration 6 Additional performance counters 7 See also 8 External links
example
5. 6. 7. 8.
Overview
The Web crawler fetches data from HTTP servers. Starting with an initial URL, it crawls all linked websites recursively.
Crawling configuration
The example configuration file is located at
configuration/org.eclipse.smila.connectivity.framework/web.xml
Defining Schema:
org.eclipse.smila.connectivitiy.framework.crawler.web/schemas/WebDataSourceCo nnectionConfigSchema.xsd
specify the schema for a crawler job. describes which agent crawler should be used.
4.
CompoundHandling
Attribute:
1.
1. 2. 3.
attributes:
Type Name
(required) the data type (String, Integer or Date). (required) attributes name. specify if the attribute is used for the hash used for
HashAttribute
delta indexing (true or false). Must be true for at least one attribute which must always have a value.
4.
KeyAttribute
ID (true or false). Must be true for at least one attribute. All key attributes must identify the file uniquely, so usually you will set it true for the attribute containing Url FieldAttribute.
5.
Attachment
record. 2.
1. 1.
sub elements:
FieldAttribute:
Url: URL of the web page. NOTE: Must currently be mapped to an attribute named "Url". Mapping to additional attributes are allowed.
2. 3.
Title: The title of the web page from the <title> tag.
Content: The content of the web page. Original binary content, if mapped to an attachment, else it is tried to convert it to a string using the encoding reported in the response headers.
4.
headers.
2.
1. 2.
sub elements MetaName: Key of value to get from metadata. attribute Type: one of MetaData, ResponseHeader,
of:
1.
example:
this element is responsible for selecting data - contains all important information for accessing and crawling a - defines project name
Website
website.
1. 2.
ProjectName Sitemaps
- for supporting Google site maps. sitemap.xml, sitemap.xml.gz and sitemap.gz formats are supported. See [[1]]. Links extracted from <loc> tags are added to the current level links. Crawler looks for the sitemap file at the root directory of the web server and then caches it for the particular host to avoid parsing the sitemap again for the URL already processed. - request headers separated by semicolon. Headers should be in format "<header_name>:<header_content>", separated by semicolon.
3.
Header
4. 5.
Referer
- to include "Referer: URL" header in HTTP request. See [[2]] - enable or disable cookies for crawling process (true or
EnableCookies
false). See [[3]] - element used to identify crawler to the server as a specific user agent origination the request. The UserAgent string generated looks like the following: Name/Version (Description, Url, Email)
6.
UserAgent
1.
2. 3. 4. 5.
Name
(required)
7.
Robotstxt
Robots Exclusion Standard tells crawler how to crawl a website or rather which resources should not be crawled. See [[4]]
1.
Policy:
there are five types of policies offered on how to deal with Simply obey the robots.txt rules. Recommended unless
robots.txt rules:
1.
Classic.
Completely ignore robots.txt rules. Obey your own, custom, robots.txt instead of those
discovered on the relevant site. The attribute Value must contain the path to a locally available robots.txt file in this case.
4.
Set.
Limit robots names which rules are followed to the given set.
Value attribute must handle robots names separated by semicolon in this case.
2.
Value:
specifies the filename with the robots.txt rules for Custom specifies the list of agents we advertise. This list should
be started with the same name as UserAgent Name (for example: crawler user-agent name that is used for the crawl job)
8. 1. 1. 2.
CrawlingModel: Type:
the model type (MaxBreadth or MaxDepth) crawling a web site through a limited number of links.
MaxBreadth: MaxDepth:
depth.
1. 2.
Value:
parameter (Integer) decides for each discovered URI if it is within the scope of
CrawlScope:
following scope are provided: accept all. This scope does not impose any limits on the hosts, accept if on same 'domain' as seeds (start URL). This scope
Broad:
limits discovered URIs to the set of domains defined by the provided seeds. That is any URI discovered belonging to a domain from which one of the seed came is within scope. Using the seed 'brox.de', a domain scope will fetch 'bugs.brox.de', 'confluence.brox.de', etc. It will fetch all discovered URIs from 'brox.de' and from any subdomain of 'brox.de'. accept if on exact host as seeds. This scope limits discovered URIs to the set of hosts defined by the provided seeds. If the seed is 'www.brox.de',
3.
Host:
then we'll only fetch items discovered on this host. The crawler will not go to 'bugs.brox.de'.
4.
Path:
scope goes yet further and limits the discovered URIs to a section of paths on hosts defined by the seeds. Of course any host that has a seed **:pointing at its root (i.e. www.sample.com/index.html) will be included in full where as a host whose only seed is www.sample2.com/path/index.html **:will be limited to URIs under /path/.
1.
Filters:
every scope can have additional filters to select URI that will In addition to limits imposed on the scope of the crawl it is
be considered to be within or out of scope ( see the section Filters for details)
2.
CrawlLimits:
possible to enforce arbitrary limits on the duration and extent of the crawling process with the following setting:
1. 1.
SizeLimits: MaxBytesDownload:
stop after a fixed number of bytes have been stop after downloading a fixed number of
(0 means unlimited). These are not supposed to be hard limits. Once one of these limits is reached, it will trigger a graceful termination of the crawl job, which means that URIs already being crawled will be completed. As a result the set limit will be exceeded by some amount.
4.
MaxLengthBytes:
host, it checks the timeouts and aborts the operation if any is exceeded. This prevents anomalous occurrences such as hanging reads or infinite connects.
1.
Timeout:
This limit is the total time need to connect and get the
download website, and such represents the total of a ConnectTimeout plus a ReadTimeout.
2.
ConnectTimeout:
take longer will fail. The default value for read timeout is 900 seconds.
3. 1.
WaitLimis: Wait:
retrievals. Use of this option is recommended, as it lightens the server load by making the *:requests less frequent. Specifying a large value for this option is useful if the
network or the destination host is down, so that crawler can wait *:long enough to reasonably expect the network error to be fixed before the retry.
2.
RandomWait:
retrieval programs by looking for statistically significant similarities in the time between requests. This option causes the time between requests to vary between 0 and 2 * wait seconds, where wait was specified using the wait setting, in order to mask crawler's presence from such analysis.
3. 4. 2. 1.
1. 2. 3. 4. Proxy: MaxRetries: WaitRetry:
3.
areas of websites requiring authentication. Three types of authentication are available: RFC2617 (BASIC and DIGEST types of authentication), HTTP POST or GET of an HTML Form and SSL Certificate based client authentication.
1. 1. 2. 3.
RFC2617: Host Port:
and equate to the canonical root URI of RFC2617. realm as per RFC2617. The realm string must match
Realm:
exactly the realm name presented in the authentication challenge served up by the web server.
4. 5. 2. 1.
Login:
Password: HMTLFrom:
CredentialDomain:
RFC2617.
2. 3.
HttpMethod: LoginUrl:
POST or GET
relative or absolute URI to the page that the HTML Form listing of HTML Form key/value pairs
SSLCertificate: ProtocolName:
2. 3.
Port:
TruststoreUrl:
trusted certificates.
4. TruststorePassword KeystoneUrl:
5.
certificate pair.
6. KeystonePassword Seeds:
4. 1.
contains a list of Seed elements enables analyzing URL of pages that otherwise would
FollowLinks:
be ignored:
1. 2.
NoFollow: Follow:
analyze everything that matches some "Unselect" filter, do not index pages that match
index anything
3.
FollowLinksWithCorrespondingSelectFilter:
both "Select" and "Unselect" filters, and analyze everything else that matches **:some "Unselect" filter.
1. 2.
Seed: Filters:
defines sites start path from which crawling process begin. contains a list of Filter elements and optional refinements used to define filters for pages that should be crawled and the following filter types are available: filters paths which begin with the specified
elements.
1.
Filter:
indexed.
1. 1.
Type:
BeginningPath:
characters.
2. 3.
RegExp:
filters urls based on a regular expression. filters content type on a regular expression. Use this
ContentType:
the filter value that will be used to check if the given value must be nested into the Filter element. It allows to
modify filter settings under certain circumstances. Following refinements may be applied to the filters:
1.
Port:
2.
TimeOfDay:
between the hours specified each day. From and To attributes must be in HH:mm:ss format (e.g. 23:00:00)
1. 2. 2. 1. 1. 2. 3. 4.
From: To:
till this time the filter will be enabled. contains a list of MetaTagFilter elements. defines filter for omitting content by meta tags.
MetaTagFilters:
type of meta-tag to match: Name or Http-Equiv. name of the tag e.g. "author" for the Type "Name". the tag contents. or Unselect
Content:
WorkType: Select
</Attribute> <Attribute Type="String" Name="Content" HashAttribute="true" Attachment="true" MimeTypeAttribute="Content"> <FieldAttribute>Content< /FieldAttribute> </Attribute> <Attribute Type="String" Name="MimeType"> <FieldAttribute>MimeType </FieldAttribute> </Attribute> <Attribute Type="String" Name="MetaData" Attachment="false"> <MetaAttribute Type="MetaData"/> </Attribute> <Attribute Type="String" Name="ResponseHeader" Attachment="false"> <MetaAttribute Type="ResponseHeader"> <MetaName>Date</MetaNa me> <MetaName>Server</Meta Name> </MetaAttribute> </Attribute> <Attribute Type="String" Name="MetaDataWithResponseHead erFallBack" Attachment="false"> <MetaAttribute Type="MetaDataWithResponseHead erFallBack"/> </Attribute> </Attributes> <Process> <WebSite ProjectName="Example Crawler Configuration" Header="AcceptEncoding: gzip,deflate; Via: myProxy" Referer="http://myReferer">
<UserAgent Name="Crawler" Version="1.0" Description="teddy crawler" Url="http://www.teddy.com" Email="[email protected]"/> <CrawlingModel Type="MaxDepth" Value="1000"/> <CrawlScope Type="Domain"> <Filters> <Filter Type="BeginningPath" WorkType="Select" Value="/"/> </Filters> </CrawlScope> <CrawlLimits> <!-- Warning: The amount of files returned is limited to 1000 --> <SizeLimits MaxBytesDownload="0" MaxDocumentDownload="1000" MaxTimeSec="3600" MaxLengthBytes="100000"/> <TimeoutLimits Timeout="10000"/> <WaitLimits Wait="0" RandomWait="false" MaxRetries="8" WaitRetry="0"/> </CrawlLimits> <Seeds FollowLinks="Follow"> <Seed>http://en.wikipe dia.org/</Seed> </Seeds> <Filters> <Filter Type="RegExp" Value=".*action=edit.*" WorkType="Unselect"/> </Filters> </WebSite> </Process> </DataSourceConnectionConfig>
<WebSite ProjectName="Login To Invision Powerboard Forum Example"> <UserAgent Name="Mozilla" Version="5.0" Description="" Url="" Email=""/> <Robotstxt Policy="Ignore" /> <CrawlLimits> <SizeLimits MaxDocumentDownload="15"/> </CrawlLimits> <Authentication> <HtmlForm CredentialDomain="http://forum .example.com/index.php? act=Login&CODE=00" LoginUri="http://forum.example .com/index.php? act=Login&CODE=01" HttpMethod="POST"> <FormElements> <FormElement Key="referer" Value=""/> <FormElement Key="CookieDate" Value="1"/> <FormElement Key="Privacy" Value="1"/> <FormElement Key="UserName" Value="User"/>
<FormElement Key="PassWord" Value="Password"/> <FormElement Key="submit" Value="Enter"/> </FormElements> </HtmlForm> </Authentication> <Seeds FollowLinks="Follow"> <Seed><! [CDATA[http://forum.example.co m/index.php? act=Login&CODE=00]]></Seed> </Seeds> </WebSite>
<Rfc2617 Host="localhost" Port="80" Realm="Restricted area" Login="user" Password="pass"/> <HtmlForm CredentialDomain="http://local host:8081/admin/" LoginUri="/j_security_check" HttpMethod="GET"> <FormElements> <FormElement Key="j_username" Value="admin"/> <FormElement Key="j_password" Value=""/> <FormElement Key="submit" Value="Login"/> </FormElements> </HtmlForm> </Authentication> </WebSite> <WebSite ProjectName="Second WebSite"> <UserAgent Name="Mozilla" Version="5.0" Description="X11; U; Linux x86_64; en-US; rv:1.8.1.4" /> <Robotstxt Policy="Classic" AgentNames="mozilla, googlebot"/> <CrawlingModel Type="MaxDepth" Value="100"/> <CrawlScope Type="Host"/> <CrawlLimits> <WaitLimits Wait="5" RandomWait="true"/> </CrawlLimits> <Seeds FollowLinks="NoFollow"> <Seed>http://example.c om</Seed> </Seeds> <Filters> <Filter Type="BeginningPath" WorkType="Unselect" Value="/something/">
<Refinements> <TimeOfDay From="09:00:00" To="23:00:00"/> <Port Number="80"/> </Refinements> </Filter> <Filter Type="RegExp" WorkType="Unselect" Value="news"/> <Filter Type="ContentType" WorkType="Unselect" Value="image/jpeg"/> </Filters> </WebSite>
<SizeLimits MaxBytesDownload="0" MaxDocumentDownload="1" MaxTimeSec="3600" MaxLengthBytes="1000000" /> <TimeoutLimits Timeout="10000" /> <WaitLimits Wait="0" RandomWait="false" MaxRetries="8" WaitRetry="0"/> </CrawlLimits> <Proxy> <ProxyServer Host="example.com" Port="3128" Login="user" Password="pass"/> </Proxy> <Authentication> <Rfc2617 Host="somehost.com" Port="80" Realm="realm string" Login="user" Password="pass"/> </Authentication> <Seeds FollowLinks="NoFollow"> <Seed>http://example.com </Seed> </Seeds> <Filters> <Filter Type="BeginningPath" WorkType="Unselect" Value="/something/"> <Refinements> <TimeOfDay From="09:00:00" To="23:00:00"/> <Port Number="80"/> </Refinements> </Filter> <Filter Type="RegExp" WorkType="Unselect" Value="news"/> <Filter Type="ContentType" WorkType="Unselect" Value="image/jpeg"/> </Filters> <MetaTagFilters>
<Record xmlns="http://www.eclipse.org/ smila/record" version="1.0"> <Val key="_recordid">web:<Url=ht tp://en.wikipedia.org/wiki/Mai n_Page></Val> <Val key="Url">http://en.wikipedia. org/wiki/Main_Page</Val> <Val key="Content"> Whole content of wikipedia main page. To much to post here. </Val> <Val key="Title">Wikipedia, the free encyclopedia</Val> <Seq n="MetaData"> <Val>base:null</Val> <Val>noCache:false</Val> <Val>noFollow:false</Val> <Val>noIndex:false</Val> <Val>refresh:false</Val> <Val>refreshHref:null</Val > <Val> keywords:Main Page,1266,1815,1919,1935,1948 NCAA Men's Division I Ice Hockey Tournament,1991,1993,2009,2009 Bangladesh Rifles revolt,Althea Byfield
</Val> <Val>generator:MediaWiki 1.15alpha</Val> <Val>contenttype:text/html; charset=utf8</Val> <Val>content-styletype:text/css</Val> </Seq> <Val key="MimeType">text/html</Val> <Seq key="ResponseHeader"> <Val>Server:Apache</Val> <Val>Date:Thu, 26 Feb 2009 14:33:37 GMT</Val> </Seq> <Seq key="MetaDataWithResponseHeade rFallBack"> <Val>Age:2</Val> <Val>ContentLanguage:en</Val> <Val>ContentLength:57974</Val> <Val>Last-Modified:Thu, 26 Feb 2009 14:31:46 GMT</Val> <Val> X-Cache-Lookup:MISS from knsq25.knams.wikimedia.org:80 </Val> <Val>Connection:KeepAlive</Val> <Val>X-Cache:MISS from knsq25.knams.wikimedia.org</Va l> <Val>Server:Apache</Val> <Val>X-PoweredBy:PHP/5.2.4-2ubuntu5wm1</Val> <Val> Cache-Control:private, s-maxage=0, max-age=0, must-revalidate </Val> <Val>Date:Thu, 26 Feb 2009 14:33:37 GMT</Val> <Val>Vary:AcceptEncoding,Cookie</Val>
<Val> X-Vary-Options:AcceptEncoding;listcontains=gzip,Cookie;stringcontains=enwikiToken;stringcontains=enwikiLoggedOut;strin gcontains=enwiki_session;string contains=centralauth_Token;str ingcontains=centralauth_Session;s tringcontains=centralauth_LoggedOut </Val> <Val> Via:1.1 sq39.wikimedia.org:3128 (squid/2.7.STABLE6), 1.0 knsq29.knams.wikimedia .org:3128 (squid/2.7.STABLE6), 1.0 knsq25.knams.wikimedia .org:80 (squid/2.7.STABLE6), 1.0 HAN-HB-FW-001 </Val> <Val>ContentType:text/html; charset=utf8</Val> <Val>ProxyConnection:Keep-Alive</Val> <Val>base:null</Val> <Val>noCache:false</Val> <Val>noFollow:false</Val> <Val>noIndex:false</Val> <Val>refresh:false</Val> <Val>refreshHref:null</Val > <Val> keywords:Main Page,1266,1815,1919,1935,1948 NCAA Men's Division I Ice Hockey Tournament,1991,1993,2009,2009 Bangladesh Rifles revolt,Althea Byfield </Val>
<Val>generator:MediaWiki 1.15alpha</Val> <Val>contenttype:text/html; charset=utf8</Val> <Val>content-styletype:text/css</Val> </Seq> <Val key="_HASH_TOKEN">eb1eff85a3e3 d4ad4ffd0dd9d4883e3d1f7f988019 ca9bfa4a4df2e7659aa6</Val> <Attachment>Content</Attachm ent> </Record>
bytes: number of bytes read from web server pages: number of web pages read averageHttpFetchTime: average time for fetching a page from the server. producerExceptions: number of webserver related errors
See also
1. 2. 3.
External links
1. 2. 3. 4.
The Web Robots Pages - robots.txt reference Google Sitemap Protocol HTTP Referer Header HTTP Cookie Header
Home
2. 3. 4. 5. 6.
Privacy Policy Terms of Use Copyright Agent Contact About Eclipsepedia Copyright 2012 The Eclipse Foundation. All Rights Reserved
This page was last modified 12:13, 22 September 2011 by Juergen Schumacher. Based on work by , A. Schank and Bjrn Decker and others. This page has been accessed 5,213 times.