0% found this document useful (0 votes)

42 views15 pages

Search Engine Architecture

Uploaded by

tanishgontlag

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views15 pages

Search Engine Architecture

Uploaded by

tanishgontlag

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Search Engine

Architecture
By
Srivatsala V
Search Engine Architecture
• The ultimate challenge of web IR research is to provide improved
systems that retrieve the most relevant information available on the
Web to better satisfy the information needs of users.
• To address the challenges found in web IR, web search systems need
very specialized architectures.
• The architecture of a search engine is determined by two
requirements: effectiveness (quality of results) and efficiency
(response time and throughput).
• It has two primary parts: the indexing process and the query process.
Indexing Process
• The indexing process distills information contained within a body of
documents into a format that is amenable to quick access by the query
processor.
• Once the indexes are built, the system is ready to process queries. It
consists of three steps,
• text acquisition, which identifies and stores the documents for indexing
• text transformation where the documents are transformed into index
terms or features,
• the index creation, which takes index terms and creates data structures
(indexes) to support fast searching. Figure 6.2 depicts the indexing process.
Text Acquisition
• Text acquisition: A search engine on the Web has a “crawler” (also known as a “
spider” or a “bot”), which is typically programmed to visit web sites and read
their pages and other information in order to create entries for a search engine
index.
• A crawler is usually responsible for gathering the documents and storing them in
a document repository.
• Web crawlers, which follow links to find documents, must efficiently find huge
numbers of web pages (coverage) and keep them up to date (freshness).
• The web crawler is instructed to know that words contained in headings,
metadata, and the first few sentences of the body are likely to be more important
in the context of the page, and that keywords in prime locations suggest that the
page is really “about” those keywords.
Text Acquisition
• Finally, information gathered is stored in the document repository.
• It stores text, metadata (document type, document structure, document
length, etc.), and other related content for documents where metadata is
information about the documents, such as type and creation date.
• Other content includes links and anchor text.
• The data store provides fast access to document contents for search engine
components.
• Thus, text acquisition identifies and stores the documents for indexing by
crawling the Web, then converts the gathered information into a consistent
format, and finally stores the findings in a document repository.
Text Transformation
• Text transformation: Once the text is acquired, the next step is to transform
the captured documents into index terms or features. This is a kind of
preprocessing step, which involves parsing, stop-word removal, stemming,
link analysis, and information extraction as the sub-steps.
• Parser: It is the processing of the sequence of text tokens in the document to
recognize structural elements. For example, titles, links, headings, and so on.
• Stopping: Commonly occurring words are unlikely to provide useful
information and may be removed from the vocabulary to speed up
processing.
• Stop word removal or stopping is a process that removes common words like
“and,” “or,” “the,” or “in.”
Text Transformation
• Stemming: Stemming is the process of removing suffixes from words to get
the common origin.
• Link analysis: This makes use of links and anchor text in web pages and
identifies popularity and community information—for example, PageRank™.
• Information extraction: This process identifies the classes of index terms that
are important for some applications. For example, named entity recognizers
identify classes, such as people, locations, companies, and dates.
• Classifier: This identifies the class-related metadata for documents. It assigns
labels to documents, such as topics, reading levels, sentiment, and genre.
The use of classifier depends on the application.
Index Creation
• Index creation: Document statistics—such as counts of index term occurrences, positions in the
documents where the index terms occurred, counts of occurrences over groups of documents,
lengths of documents in terms of the number of tokens, and other features mostly used in
ranking algorithms—are gathered.
• Most indexes use variants of inverted files.
• An inverted file is a list of sorted words. Each word has pointers to the related pages.
• It is referred to as “inverted” because documents are associated with words, rather than words
with documents.
• A logical view of the text is indexed. Each pointer associates a short description about the page
that the pointer points to.
• This is the core of the indexing process, which tends to convert document-term information to
term-document for indexing, but it is difficult to do that for very large numbers of documents.
• The format of inverted file is designed for fast query processing that must also handle updates.
• An inverted index allows quick lookup of document IDs with a particular word.
The Query Process
• The role of the indexing process is to build data structures that enable
searching. The query process makes use of these data structures to
produce a ranked list of documents for a user’s query.
• Query processing takes the user’s query and, depending on the
application, the context, and other inputs, builds a better query and
submits the enhanced query to the search engine on the user’s behalf
and displays the ranked results.
• Thus, a query process comprises of :
• a user interaction module
• a ranking module (Figure 6.3).
The Query Process
The Query Process and enhancing
query
• The user interaction module supports creation and refinement of a query and
displays the results.
• The ranking module uses the query and indexes (generated during the indexing
process) to generate a ranked list of documents.
• The user interacts with the system through an interface where he inputs the query.
• Query transformation is then employed to improve the initial query, both before
and after the initial search.
• Query transformation is then employed to improve the initial query, both before
and after the initial search.
• This can use a variety of techniques, such as spell checker, query suggestion
(which provides alternatives to the original query), query expansion, and
relevance feedback.
Enhancing Queries
• The query expansion approach attempts to expand the original
search query by adding further, new, or related terms.
• These additional terms are inserted into an existing query, either by
the user (interactive query expansion or IQE) or
• by the retrieval system (automatic query expansion or AQE) and aim
to increase the accuracy of the search.
Ranking Module
• The fundamental challenge of a search engine is to rank pages that
match the input query and return an ordered list.
• Search engines rank individual web pages of a web site, not the
entire site. There are many variations of ranking algorithms and
retrieval models.
• Search engines use two different kinds of ranking factors: query-
dependent factors and query-independent factors
Ranking Module
• Query –dependent factors:
• Query-dependent are all ranking factors that are specific to a given query.
• These include measures such as :
• word documents frequency,
• the position of the query terms within the document, a
• the inverted document frequency,
• Query-independent factors:
• Query-independent factors are attached to documents, regardless of a given query, and
consider measures such as
• an emphasis on anchor text,
• the language of the document in relation to the language of the query,
• a measure of the “geographical distance between the user and the document.”

Information Retrieval
No ratings yet
Information Retrieval
142 pages
Chap 2
No ratings yet
Chap 2
29 pages
IR_Lec1
No ratings yet
IR_Lec1
26 pages
Chapter - 6 - Searching and Indexing
No ratings yet
Chapter - 6 - Searching and Indexing
44 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
IR Workbook Answers
No ratings yet
IR Workbook Answers
36 pages
4
No ratings yet
4
35 pages
10-Searching The Web
100% (1)
10-Searching The Web
27 pages
2 Mod-1 - Lec-2
No ratings yet
2 Mod-1 - Lec-2
58 pages
7 CurrentTrendsAndIssues
No ratings yet
7 CurrentTrendsAndIssues
50 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
The Overview of Web Search Engines 16ep4np3gk
No ratings yet
The Overview of Web Search Engines 16ep4np3gk
23 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
1 - SLT Errors
No ratings yet
1 - SLT Errors
11 pages
Bulu
No ratings yet
Bulu
47 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
UNIT 3 Notes
No ratings yet
UNIT 3 Notes
32 pages
Topic 2 W2 - SDR - Edited - March2023
No ratings yet
Topic 2 W2 - SDR - Edited - March2023
25 pages
Information Retrieval System Module 1 Mumbai University
No ratings yet
Information Retrieval System Module 1 Mumbai University
24 pages
Text
No ratings yet
Text
5 pages
Search ENgine
No ratings yet
Search ENgine
28 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
Unit 5
No ratings yet
Unit 5
36 pages
Data Mining Handout
No ratings yet
Data Mining Handout
4 pages
Untitled Document
No ratings yet
Untitled Document
9 pages
Everything in Brief Introduction
No ratings yet
Everything in Brief Introduction
5 pages
02 - Lect2 Biomedical IR
No ratings yet
02 - Lect2 Biomedical IR
20 pages
Chap 1
No ratings yet
Chap 1
22 pages
02 - Lect2 Search Engines - Part1
No ratings yet
02 - Lect2 Search Engines - Part1
18 pages
Tableau Resume
100% (1)
Tableau Resume
5 pages
Unit 8 - Search Engines
No ratings yet
Unit 8 - Search Engines
8 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
Information Retrieval and XML Data: ADBMS Unit-4
No ratings yet
Information Retrieval and XML Data: ADBMS Unit-4
37 pages
DBMS_UNIT-3_MCQ
No ratings yet
DBMS_UNIT-3_MCQ
9 pages
Cmpsci 446 Search Engines
No ratings yet
Cmpsci 446 Search Engines
32 pages
CertyIQ AZ-305 40-Important Real Exam New Questions-2022
No ratings yet
CertyIQ AZ-305 40-Important Real Exam New Questions-2022
85 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
No ratings yet
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
21 pages
Search Engine Using Apache Lucene
No ratings yet
Search Engine Using Apache Lucene
5 pages
Preprocessing, Inverted Index
No ratings yet
Preprocessing, Inverted Index
15 pages
How Do Search Engines Work
No ratings yet
How Do Search Engines Work
3 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
Mod 3
No ratings yet
Mod 3
7 pages
Lab 01 - Get Data in Power BI Desktop
No ratings yet
Lab 01 - Get Data in Power BI Desktop
7 pages
Meta Search Engines
No ratings yet
Meta Search Engines
48 pages
Search Engine Architecture 1
No ratings yet
Search Engine Architecture 1
23 pages
Exam Ref Dp900 Microsoft Azure Data Fundamentals 2nd Edition Nicola Farquharson PDF Download
No ratings yet
Exam Ref Dp900 Microsoft Azure Data Fundamentals 2nd Edition Nicola Farquharson PDF Download
44 pages
Qlik Interview Questions & Answers Updated
No ratings yet
Qlik Interview Questions & Answers Updated
20 pages
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
No ratings yet
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
12 pages
Comsats Institute of Information TECHNOLOGY Islamabad
No ratings yet
Comsats Institute of Information TECHNOLOGY Islamabad
11 pages
Database Worksheet
100% (1)
Database Worksheet
10 pages
Database Management Final Exam
67% (6)
Database Management Final Exam
31 pages
1 Cs208 Principles of Database Design - Model QP
No ratings yet
1 Cs208 Principles of Database Design - Model QP
3 pages
Mini Google
No ratings yet
Mini Google
34 pages
La Structure de La Magie Le Livre Fondateur de La - 5a9c51801723dd1a6ed27b78 PDF
No ratings yet
La Structure de La Magie Le Livre Fondateur de La - 5a9c51801723dd1a6ed27b78 PDF
2 pages
CSCI 720 - Project
No ratings yet
CSCI 720 - Project
23 pages
ABAP ABAPDictionaryFAQ 051213 0818 9486
No ratings yet
ABAP ABAPDictionaryFAQ 051213 0818 9486
7 pages
DVTK QR SCP Emulator User Manual
No ratings yet
DVTK QR SCP Emulator User Manual
20 pages
Extending The Performance and Quality of Search
No ratings yet
Extending The Performance and Quality of Search
20 pages
SAP HANA SP5 Course Content Details
No ratings yet
SAP HANA SP5 Course Content Details
3 pages
Seminar Report
100% (4)
Seminar Report
44 pages
Did It Make The News?
No ratings yet
Did It Make The News?
6 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Cis4100 - Project 1 Introduction: Murach'S PHP and Mysql by Developing An Application Called Sportspro Technical
0% (1)
Cis4100 - Project 1 Introduction: Murach'S PHP and Mysql by Developing An Application Called Sportspro Technical
16 pages
Alertdiretellog
No ratings yet
Alertdiretellog
35 pages
Current Log
No ratings yet
Current Log
27 pages
Data Normalization
No ratings yet
Data Normalization
38 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
Databases and Database Management Systems: Understanding Computers: Today and Tomorrow, 13th Edition
No ratings yet
Databases and Database Management Systems: Understanding Computers: Today and Tomorrow, 13th Edition
43 pages
Data Manipulation & Analysis
No ratings yet
Data Manipulation & Analysis
31 pages
PLSQL 9 3 Practice
No ratings yet
PLSQL 9 3 Practice
3 pages
T6hehehev Hejeh Y2vbles1-5l
No ratings yet
T6hehehev Hejeh Y2vbles1-5l
623 pages
IT 224 - Database 1 Guidelines For Project Submission
No ratings yet
IT 224 - Database 1 Guidelines For Project Submission
3 pages
CH 14
No ratings yet
CH 14
46 pages
How To Convert A Relationship Into Relation Schema?
No ratings yet
How To Convert A Relationship Into Relation Schema?
3 pages
Read Write BLOBs From To SQL Server Using C#
No ratings yet
Read Write BLOBs From To SQL Server Using C#
6 pages
Exporting File Systems To UNIX
No ratings yet
Exporting File Systems To UNIX
31 pages
DBMS (Database Management Systems) Example Queries
No ratings yet
DBMS (Database Management Systems) Example Queries
3 pages
How A Search Engine Works - Slide
No ratings yet
How A Search Engine Works - Slide
40 pages

Search Engine Architecture

Uploaded by

Search Engine Architecture

Uploaded by

Search Engine

You might also like