AI Assignment

Uploaded by

abhiran14082002

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

AI Assignment

Uploaded by

abhiran14082002

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Assignment - 1 : Search Engine using Apache Nutch

Submitted By: Hirakjyoti Choudhury (2112042)

Abstract:
This assignment required us to design and implement a search engine utilizing Apache Nutch and
Apache Tomcat as the primary technologies. The aim was to create a web-based search engine
capable of efficiently crawling and indexing web content from diverse sources to deliver relevant
search results to users. Through this project, we successfully demonstrated the feasibility of building
a custom search engine using open-source frameworks such as Nutch and Tomcat.
Tools and Environment:
1. Operating System: macOS Sonoma 14.1 (Unix-based)
2. Apache Nutch: Version 0.9
3. Apache Tomcat: Version 9.0.82
4. Java: Version 21.0.1
Setup and Configuration:
 Apache Tomcat Installation:
To set up the environment, I downloaded and extracted "Apache-Tomcat-9.0.82.tar" from
the official Apache website, placing it in the “Downloads” folder on my system. Tomcat,
commonly used for deploying web applications, was essential for browsing crawled data
locally. Additionally, I configured the JAVA_HOME environment variable, required for Tomcat
operations. To initiate the Tomcat server, I ran:
“/Users/priyanshu/Downloads/apache-tomcat-9.0.82/bin/startup.sh”
 Apache Nutch Installation and Configuration:
Apache Nutch, an open-source framework for web crawling and indexing, was chosen for
this project due to its flexibility in building web search engines. After downloading and
extracting "nutch-0.9.tar," I placed it in /Users/priyanshu/Downloads/nutch-0.9. In the nutch-
0.9/bin directory, I created a urls folder, including a seed.txt file listing URLs for crawling.
Our example contained one URL, http://www.nits.ac.in, for demonstration purposes.
 In the conf directory of nutch-0.9, I made the following configuration adjustments
o crawl-urlfilter.txt: Added the line:
“+^http://([a-z0-9]*\.)*www.nits.ac.in/”
 regex-urlfilter.txt: Added the line:
“+^http://([a-z0-9]*\.)*www.nits.ac.in”

Crawling Process:
After configuring Nutch, I navigated to the nutch-0.9/bin directory in the Terminal to initiate the crawl
with this command:
“./nutch crawl urls -dir Crawled_Data -depth 3 -topN 10”
Here, the Crawled_Data folder stores the crawled data, while depth and topN control the depth and
number of pages to be crawled, respectively.
Deployment on Apache Tomcat:
With the crawled data prepared, I copied the nutch-0.9.war file to the Tomcat webapps directory
(/Users/priyanshu/Downloads/apache-tomcat-9.0.82/webapps). Then, I modified the search.dir
property in nutch-site.xml to point to Crawled_Data, enabling the search engine to access the
indexed data.
Upon starting the Apache Tomcat server, I accessed the search engine at
http://localhost:8080/nutch-0.9/. Entering a search query like “b.tech” yielded 9 results from the
crawled data, demonstrating the search engine's functionality.
Challenges and Solutions:
During the setup, I encountered an error with Search.jsp, specifically at line 151. Adding an escape
sequence to include the header.html file resolved this issue. Afterward, the Apache Tomcat server
was restarted to reflect these change
Result:
Upon completing the configuration and necessary adjustments, the Apache Nutch search engine
successfully displayed a homepage with its logo and a search bar. Entering search queries yielded
results based on the indexed web content. For instance, querying "practice" returned 42 relevant
results, proving the search engine’s functionality.