Search Engine
Architecture
By
Srivatsala V
Search Engine Architecture
• The ultimate challenge of web IR research is to provide improved
systems that retrieve the most relevant information available on the
Web to better satisfy the information needs of users.
• To address the challenges found in web IR, web search systems need
very specialized architectures.
• The architecture of a search engine is determined by two
requirements: effectiveness (quality of results) and efficiency
(response time and throughput).
• It has two primary parts: the indexing process and the query process.
Indexing Process
• The indexing process distills information contained within a body of
documents into a format that is amenable to quick access by the query
processor.
• Once the indexes are built, the system is ready to process queries. It
consists of three steps,
• text acquisition, which identifies and stores the documents for indexing
• text transformation where the documents are transformed into index
terms or features,
• the index creation, which takes index terms and creates data structures
(indexes) to support fast searching. Figure 6.2 depicts the indexing process.
Text Acquisition
• Text acquisition: A search engine on the Web has a “crawler” (also known as a “
spider” or a “bot”), which is typically programmed to visit web sites and read
their pages and other information in order to create entries for a search engine
index.
• A crawler is usually responsible for gathering the documents and storing them in
a document repository.
• Web crawlers, which follow links to find documents, must efficiently find huge
numbers of web pages (coverage) and keep them up to date (freshness).
• The web crawler is instructed to know that words contained in headings,
metadata, and the first few sentences of the body are likely to be more important
in the context of the page, and that keywords in prime locations suggest that the
page is really “about” those keywords.
Text Acquisition
• Finally, information gathered is stored in the document repository.
• It stores text, metadata (document type, document structure, document
length, etc.), and other related content for documents where metadata is
information about the documents, such as type and creation date.
• Other content includes links and anchor text.
• The data store provides fast access to document contents for search engine
components.
• Thus, text acquisition identifies and stores the documents for indexing by
crawling the Web, then converts the gathered information into a consistent
format, and finally stores the findings in a document repository.
Text Transformation
• Text transformation: Once the text is acquired, the next step is to transform
the captured documents into index terms or features. This is a kind of
preprocessing step, which involves parsing, stop-word removal, stemming,
link analysis, and information extraction as the sub-steps.
• Parser: It is the processing of the sequence of text tokens in the document to
recognize structural elements. For example, titles, links, headings, and so on.
• Stopping: Commonly occurring words are unlikely to provide useful
information and may be removed from the vocabulary to speed up
processing.
• Stop word removal or stopping is a process that removes common words like
“and,” “or,” “the,” or “in.”
Text Transformation
• Stemming: Stemming is the process of removing suffixes from words to get
the common origin.
• Link analysis: This makes use of links and anchor text in web pages and
identifies popularity and community information—for example, PageRank™.
• Information extraction: This process identifies the classes of index terms that
are important for some applications. For example, named entity recognizers
identify classes, such as people, locations, companies, and dates.
• Classifier: This identifies the class-related metadata for documents. It assigns
labels to documents, such as topics, reading levels, sentiment, and genre.
The use of classifier depends on the application.
Index Creation
• Index creation: Document statistics—such as counts of index term occurrences, positions in the
documents where the index terms occurred, counts of occurrences over groups of documents,
lengths of documents in terms of the number of tokens, and other features mostly used in
ranking algorithms—are gathered.
• Most indexes use variants of inverted files.
• An inverted file is a list of sorted words. Each word has pointers to the related pages.
• It is referred to as “inverted” because documents are associated with words, rather than words
with documents.
• A logical view of the text is indexed. Each pointer associates a short description about the page
that the pointer points to.
• This is the core of the indexing process, which tends to convert document-term information to
term-document for indexing, but it is difficult to do that for very large numbers of documents.
• The format of inverted file is designed for fast query processing that must also handle updates.
• An inverted index allows quick lookup of document IDs with a particular word.
The Query Process
• The role of the indexing process is to build data structures that enable
searching. The query process makes use of these data structures to
produce a ranked list of documents for a user’s query.
• Query processing takes the user’s query and, depending on the
application, the context, and other inputs, builds a better query and
submits the enhanced query to the search engine on the user’s behalf
and displays the ranked results.
• Thus, a query process comprises of :
• a user interaction module
• a ranking module (Figure 6.3).
The Query Process
The Query Process and enhancing
query
• The user interaction module supports creation and refinement of a query and
displays the results.
• The ranking module uses the query and indexes (generated during the indexing
process) to generate a ranked list of documents.
• The user interacts with the system through an interface where he inputs the query.
• Query transformation is then employed to improve the initial query, both before
and after the initial search.
• Query transformation is then employed to improve the initial query, both before
and after the initial search.
• This can use a variety of techniques, such as spell checker, query suggestion
(which provides alternatives to the original query), query expansion, and
relevance feedback.
Enhancing Queries
• The query expansion approach attempts to expand the original
search query by adding further, new, or related terms.
• These additional terms are inserted into an existing query, either by
the user (interactive query expansion or IQE) or
• by the retrieval system (automatic query expansion or AQE) and aim
to increase the accuracy of the search.
Ranking Module
• The fundamental challenge of a search engine is to rank pages that
match the input query and return an ordered list.
• Search engines rank individual web pages of a web site, not the
entire site. There are many variations of ranking algorithms and
retrieval models.
• Search engines use two different kinds of ranking factors: query-
dependent factors and query-independent factors
Ranking Module
• Query –dependent factors:
• Query-dependent are all ranking factors that are specific to a given query.
• These include measures such as :
• word documents frequency,
• the position of the query terms within the document, a
• the inverted document frequency,
• Query-independent factors:
• Query-independent factors are attached to documents, regardless of a given query, and
consider measures such as
• an emphasis on anchor text,
• the language of the document in relation to the language of the query,
• a measure of the “geographical distance between the user and the document.”