Intranet Search With Nutch: Doug Cutting
Intranet Search With Nutch: Doug Cutting
Doug Cutting
<[email protected]>
Lucene is...
● A mature Apache open-source project;
● Java library for text indexing and search;
– Not an application;
● A large community of contributors;
● The search technology behind a lot of web sites &
applications.
● http://jakarta.apache.org/lucene/
● A book out this summer!
Nutch is...
● A young open-source project;
● Web search application software;
● A few part-time paid developers;
● A growing number of contributors;
– paid and un-paid.
● Behind a growing number of sites.
Nutch isn't...
● A business;
– But is a non-profit legal entity to own copyright;
– No employees.
● A search site;
– But want to power lots of search sites;
– From domain-specific, to whole-web.
● A research project.
– But want to be platform for research.
Nutch Design Goals
● Scale to entire web
– pages on millions of different servers
– billions of pages
– complete crawl takes weeks
– very noisy
● Support high traffic
– thousands of searches per second
● State-of-the-art search quality
Nutch Architecture
web db indexers
fetchers content
searchers
web servers
Web Database
● Page Database
– Used for fetch scheduling.
● Link Database
– Represents full link graph.
– Stores anchor text associated with each link.
– Used for:
● Link analysis;
● Anchor text indexing.
Scalability
● To meet scalability goals:
– multiple simultaneous fetches
(100+ pages/second / CPU, ~10M / day)
– parallel, distributed db update
(100M pages @ 100 pages/second / CPU)
– distributed search
(2-20M pages, 1-40 searches/second / CPU)
But intranets are different!
Part 1: Scale
● Fetch, DB & search can all run on one box.
● Complete crawl takes only hours.
● Handful of servers on LAN—easy to overload!
● Lessons:
– need to throttle fetcher
– need much simple operation—single command
– can crawl deeper
But intranets are different!
Part 2: Control
● cleaner content
● knowledge about structure of sites (cgi's, etc)
● lessons:
– can index more dynamic content (cgi's, etc.)
– can customize crawler better to site
But intranets are different!
Part 3: Quality
● only ~1M pages
● lesson:
– not great for link analysis
– but plenty for anchor text
Intranet How To
Step 1: Install
● Nutch requires only Java & JSP.
● Download & unpack.
● No admin GUI (yet)
– command line
– config files
Intranet How To
Step 2: Configure
● Specify root URLs.
● Specify URL filters.
– a separate config file, containing regexps
– each either includes or excludes URLs
– first matching pattern determines fate of each URL
● Optionally, add a config file specifying:
– delay between fetches
– num fetcher threads
– levels to crawl
URL Filter Example
# skip image and other suffixes
-\.(gif|jpg|pdf|doc|sit|rtf|exe)$