0% found this document useful (0 votes)
37 views

Documentation Ir

Uploaded by

Taherika Mansuri
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Documentation Ir

Uploaded by

Taherika Mansuri
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 58

SECTION CONTENTS

Preface

Acknowledgement

1. Introduction 1

2. Applications of IR 10

3. Web search engine 16

4. Performance measures 35

5. Models of IR 45

6. Problems in IR 49

References 54
i. Preface
ii. Acknowledgment

1. Introduction 1
1.1 Definition 2
1.2 History 3
1.3 Purpose 3
1.4 Basic IR system architecture 4
1.5 Databases v/s. IR 6
1.6 How IR system works 7
1.7 The process of IR 8

2. Applications of Information Retrieval 10


2.1 General applications of IR 11
2.2 Domain specific applications of IR 13
2.3 Other retrieval methods 14

3. Web search engine 16


3.1 Overview & Definition 17
3.2 Popular search engines 19
3.3 Google Search 22
3.4 Yahoo Search 23
3.5 Bing Search 24
3.6 How search engine works 25

4. Performance measures 35
4.1 Performance measures 36
4.2 Other measures 40

5. Models of Information Retrieval 45


5.1 Model Types 46

6. Problems in Information Retrieval 49


6.1 Problem-1 50
6.2 Problem-2 51
6.3 Problem-3 51
6.4 Problem-4 52
6.5 Problem-5 53

iii. References 54
1.1 DEFINITION

Information Retrieval (IR) also called information storage and retrieval (ISR or
ISAR) or information organization and retrieval is the science of searching for documents, for
information within documents, and for metadata about documents, as well as that of searching
relational databases and the World Wide Web. The main trick is to retrieve what is useful while
leaving behind what is not. The process begins when a user enters a query into the system.
Queries are formal statements of information needs, for example search strings in web search
engines. In information retrieval a query does not uniquely identify a single object in the
collection. Instead, several objects may match the query, perhaps with different degrees of
relevancy.

An object is an entity that is represented by information in a database. User queries are matched
against the database information. Depending on the application the data objects may be, for
example, text documents, images, audio, mind maps or videos. Often the documents themselves
are not kept or stored directly in the IR system, but are instead represented in the system by
document surrogates or metadata.

Most IR systems compute a numeric score on how well each object in the database match the
query, and rank the objects according to this value. The top ranking objects are then shown to the
user. The process may then be iterated if the user wishes to refine the query.
1.2 HISTORY
The idea of using computers to search for relevant pieces of information was popularized in the
article As We May Think by Vannevar Bush in 1945. The first automated information retrieval
systems were introduced in the 1950s and 1960s. By 1970 several different techniques had been
shown to perform well on small text corpora such as the Cranfield collection (several thousand
documents). Large-scale retrieval systems, such as the Lockheed Dialog system, came into use
early in the 1970s.

1.3 PURPOSE
1.3.1 To process large document collections quickly. The amount of online data has
grown at least as quickly as the speed of computers, and we would now like to be
able to search collections that total in the order of billions to trillions of words.
1.3.2 To allow more flexible matching operations.
1.3.3 To allow ranked retrieval: in many cases you want the best answer to an
information need among many documents that contain certain words.
1.4 BASIC IR SYSTEM
ARCHITECTURE
t io
rma

d
n
Nee
Info

User
Result

Query

Deletions Additions

Documents

Components of an IR system

Search
Engine

Index
The above Figure illustrates the major components in an IR system. Before conducting a search,
a user has an information need, which underlies and drives the search process. We sometimes
refer to this information need as a topic, particularly when it is presented in written form as part
of a test collection for IR evaluation.

The user’s query is processed by a search engine, which may be running on the user’s local
machine, on a large cluster of machines in a remote geographic location, or anywhere in
between.

A major task of a search engine is to maintain and manipulate an inverted index for a document
collection. As its basic function, an inverted index provides a mapping between terms and the
locations in the collection in which they occur.

To support relevance ranking algorithms, the search engine maintains collection statistics
associated with the index, such as the number of documents containing each term and the length
of each document.

In addition, the search engine usually has access to the original content of the documents, in
order to report meaningful results back to the user.
Using the inverted index, collection statistics, and other data, the search engine accepts queries
from its users, processes these queries, and returns ranked lists of results. To perform relevance
ranking, the search engine computes a score, sometimes called a retrieval status value (RSV), for
each document.

After sorting documents according to their scores, the result list may be subjected to further
processing, such as the removal of duplicate or redundant results.
1.5 DATABASES

V/S.

INFORMATION RETRIEVAL

DATABASES IR

We know the schema No schema, but rather unstructured natural


in advance, so semantic language text. The result is that there is not
correlation between queries a clear semantic correlation between
and data is clear. queries and data.

We can get exact answers We get inexact, estimated answers

Strong theoretical foundation Theory not well understood


(at least with relational…) (especially Natural Language
Processing)
1.6 HOW INFORMATION
RETRIEVAL SYSTEMS WORK
IR is a component of an information system. An information system must make sure that
everybody it is meant to serve has the information needed to accomplish tasks, solve problems,
and make decisions, no matter where that information is available. To this end, an information
system must

1.6.1 Actively find out what users need,


Determining user needs involves
(a) Studying user needs in general as a basis for designing
responsive systems (such as determining what information
students typically need for assignments),
(b) Actively soliciting the needs of specific users, expressed as
query descriptions, so that the system can provide the information.

1.6.2 Acquire documents (or computer programs, or products, or data items, and so on),
resulting in a collection, and

1.6.3 Match documents with needs.

Figuring out what information the user really needs to solve a problem is essential for successful
retrieval. Matching involves taking a query description and finding relevant documents in the
collection; this is the task of the IR system.
1.7 The Process of IR
The process illustrated by means of a black box what a typical IR system would look like. The
diagram shows three components: input, processor and output.

Starting with the input side of things. The main problem here is to obtain a representation of each
document and query suitable for a computer to use. A document representative could, for
example, be a list of extracted words considered to be significant. Rather than have the computer
process the natural language, an alternative approach is to have an artificial language within
which all queries and documents can be formulated.

There is some evidence to show that this can be effective. Of course it presupposes that a user is
willing to be taught to express his information need in the language.
When the retrieval system is on-line, it is possible for the user to change his request during one
search session in the light of sample retrieval, thereby; it is hoped, improving the subsequent
retrieval run. Such a procedure is commonly referred to as feedback.

Secondly, the processor part of the retrieval system concerned with the retrieval process. The
process may involve structuring the information in some appropriate way, such as classifying it.
It will also involve performing the actual retrieval function, that is, executing the search strategy
in response to a query. In the diagram, the documents have been placed in a separate box to
emphasize the fact that they are not just input but can be used during the retrieval process in such
a way that their structure is more correctly seen as part of the retrieval process.

Finally, we come to the output, which is usually a set of citations or document numbers. In an
operational system the story ends here. However, in an experimental system it leaves the
evaluation to be done.
APPLICATIONS OF IR
Areas where information retrieval techniques are employed can be bifurcated into three
categories:

2.1 General applications of information retrieval


2.2 Domain specific applications of information retrieval

2.3 Other retrieval methods

2.1 General applications of


Information Retrieval
1. Digital libraries - A digital library is a library in which collections are stored in digital
formats and accessible by computers. The digital content may be stored locally, or
accessed remotely via computer networks. A digital library is a type of information
retrieval system

2. Information filtering - An Information filtering system is a system that removes


redundant or unwanted information from an information stream using (semi)automated or
computerized methods prior to presentation to a human user.
i. Recommender systems - Recommender systems work from a specific type of
information filtering system technique that attempts to recommend information
items (movies, TV program/show/episode, video on demand, music[1], books,
news, images, web pages, scientific literature such as research papers etc.) that are
likely to be of interest to the user.
3. Media search
i. Image retrieval - An image retrieval system is a computer system for browsing,
searching and retrieving images from a large database of digital images.
ii. Music retrieval - Music information retrieval (MIR) is the interdisciplinary
science of retrieving information from music
iii. News search
iv. Speech retrieval
v. Video retrieval

4. Search engines - A search engine is designed to help find information stored on a


computer system.
i. Desktop search - Desktop search is the name for the field of search tools which
search the contents of a user's own computer files, rather than searching the
Internet. These tools are designed to find information on the user's PC, including
web browser histories, e-mail archives, text documents, sound files, images and
video.
ii. Enterprise search - Enterprise search is the practice of making content from
multiple enterprise-type sources, such as databases and intranets, searchable to a
defined audience.
iii. Federated search - Federated search is an information retrieval technology that
allows the simultaneous search of multiple searchable resources.
iv. Mobile search - Mobile search is an evolving branch of information retrieval
services that is centered around the convergence of mobile platforms and mobile
phones and other mobile devices.
v. Social search - Social search or a social search engine is a type of web search that
takes into account the Social Graph of the person initiating the Search Query.
vi. Web search - A web search engine is designed to search for information on the
World Wide Web and FTP servers. The search results are generally presented in a
list of results and are often called hits.

2.2 Domain specific applications


of Information Retrieval
1. Geographic information retrieval - Geographic Information Retrieval (GIR) or
Geographical Information Retrieval is the augmentation of Information Retrieval with
geographic metadata.

2. Legal information retrieval - Legal information retrieval is the science of information


retrieval applied to legal text, including legislation, case law, and scholarly works.

3. Vertical search - A vertical search engine, as distinct from a general Web search engine,
focuses on a specific segment of online content. The vertical content area may be based
on topicality, media type, or genre of content. Common examples include legal, medical,
patent (intellectual property), travel, and automobile search engines.
2.3 Other retrieval methods
Methods/Techniques in which information retrieval techniques are employed include:

1. Adversarial information retrieval - Adversarial information retrieval (adversarial IR) is


a topic in information retrieval related to strategies for working with a data source where
some portion of it has been manipulated maliciously.

2. Automatic summarization - Automatic summarization is the creation of a shortened


version of a text by a computer program. The product of this procedure still contains the
most important points of the original text.
i. Multi-document summarization - Multi-document summarization is an
automatic procedure aimed at extraction of information from multiple texts
written about the same topic.

3. Compound term processing - Compound term processing is the name that is used for a
category of techniques in Information retrieval applications that performs matching on
the basis of compound terms. Compound terms are built by combining two (or more)
simple terms, for example "triple" is a single word term but "triple heart bypass" is a
compound term.

4. Cross-lingual retrieval - Cross-language information retrieval (CLIR) is a subfield of


information retrieval dealing with retrieving information written in a language different
from the language of the user's query. For example, a user may pose their query in
English but retrieve relevant documents written in French.
5. Document classification - Document classification/categorization is a problem in
information science. The task is to assign an electronic document to one or more
categories, based on its contents.

6. Spam filtering - Email filtering is the processing of e-mail to organize it according to


specified criteria. Most often this refers to the automatic processing of incoming
messages, but the term also applies to the intervention of human intelligence in addition
to anti-spam techniques, and to outgoing emails as well as those being received.

7. Question answering - In information retrieval and natural language processing (NLP),


question answering (QA) is the task of automatically answering a question posed in
natural language.
3.1 Search Engine / Web Search
Engine Definition & Overview

A search engine is the popular term for an Information Retrieval (IR) System designed to help
find information stored on a Computer System, such as on the World Wide Web (WWW), inside
a corporate or proprietary network, or in a Personal Computer.

The search engine allows one to ask for content meeting specific criteria (typically those
containing a given word or Phrase) and retrieves a list of items that match those criteria.

Regular users of Web search engines casually expect to receive accurate and near-instantaneous
answers to questions and requests merely by entering a short query — a few words — into a text
box and clicking on a search button. Underlying this simple and intuitive interface are clusters of
computers, comprising thousands of machines, working cooperatively to generate a ranked list of
those Web pages that are likely to satisfy the information need embodied in the query. These
machines identify a set of Web pages containing the terms in the query, compute a score for each
page, eliminate duplicate and redundant pages, generate summaries of the remaining pages, and
finally return the summaries and links back to the user for browsing.

In order to achieve the sub second response times expected from Web search engines, they
incorporate layers of caching and replication, taking advantage of commonly occurring queries
and exploiting parallel processing, allowing them to scale as the number of Web pages and users
increase. In order to produce accurate results, they store a “snapshot” of the Web. This snapshot
must be gathered and refreshed constantly by a Web crawler, also running on a cluster of
hundreds or thousands of machines, and downloading periodically — perhaps once a week —a
fresh copy of each page. Pages that contain rapidly changing information of high quality, such as
news services, may be refreshed daily or hourly.
Consider a simple example. If you have a computer connected to the Internet nearby, pause for a
minute to launch a browser and try the query “information retrieval” on one of the major
commercial Web search engines. It is likely that the search engine responded in well under a
second. Take some time to review the top ten results. Each result lists the URL for a Web page
and usually provides a title and a short snippet of text extracted from the body of the page.
Overall, the results are drawn from a variety of different Web sites and include sites associated
with leading textbooks, journals, conferences, and researchers. As is common for informational
queries such as this one, the Wikipedia article1 may be present. Do the top ten results contain
anything inappropriate? Could their order be improved? Have a look through the next ten results
and decide whether any one of them could better replace one of the top ten results.

Now, consider the millions of Web pages that contain the words “information” and “retrieval”.
This set of pages includes many that are relevant to the subject of information retrieval but are
much less general in scope than those that appear in the top ten, such as student Web pages and
individual research papers. In addition, the set includes many pages that just happen to contain
these two words, without having any direct relationship to the subject. From these millions of
possible pages, a search engine’s ranking algorithm selects the top-ranked pages based on a
variety of features, including the content and structure of the pages (e.g., their titles), their
relationship to other pages (e.g., the hyperlinks between them), and the content and structure of
the Web as a whole. For some queries, characteristics of the user such as her geographic location
or past searching behavior may also play a role. Balancing these features against each other in
order to rank pages by their expected relevance to a query is an example of relevance ranking.
3.2 Popular Search Engines

Google » www.google.com

Bing » www.bing.com

Ask » www.ask.com

3.2.1 Web Directories

Web directories are human-compiled indexes of sites, which are then categorized. The fact that
your site is reviewed by an editor before being placed in the index means that getting listed in a
directory is often quite difficult. Consequently, having a listing in a directory will guarantee you
a good amount of well-targeted visitors. Most search engines will rank you higher if they find
your site in one of the directories below.

Yahoo » www.yahoo.com

3.2.2 Open Directory

The Open Directory is another human-compiled directory, but one where any Internet user can
become an editor and be responsible for some part of the index. Many other services use Open
Directory listings, including Google, Netscape, Lycos, AOLsearch, AltaVista and HotBot.

» dmoz.com
3.2.3 And the rest...

You can submit your site to these search engines if you want. Most of these companies are
struggling on this new Google-dominated web.

Lycos » www.lycos.com

FAST Search » www.alltheweb.com

AOL Search » search.aol.com

AltaVista » www.altavista.com

DogPile » www.dogpile.com
3.2.4 The three most widely used web search engines and their approximate share
3.3 Google search

URL www.google.com
list of domain names
Commercial? Yes
Type of site Search Engine
Registration Optional
Available language(s) Multilingual (124)
Owner Google
Created by Sergey Brin and Larry
Page
Launched September 15, 1997
Alexa rank 1
Revenue From AdWords
Current status Active
3.4 Yahoo! Search

URL search.yahoo.com
Commercial? Yes
Type of site Search Engine
Registration Optional
Available language(s) Multilingual (40)
Owner Yahoo!
Created by Yahoo!
Launched March 1, 1995
Alexa rank 4
Current status Active
3.5 Bing (search engine)

URL www.bing.com
Slogan Bing & decide
Commercial? Yes
Type of site Search Engine
Registration Optional
Available language(s) Multilingual (40)
Owner Microsoft
Created by Microsoft
Launched June 1, 2009
Alexa rank 23
Current status Active
3.6 How Search Engine Works?

 Search engines match queries against an index that they create.

 The index consists of the words in each document, plus pointers to their locations within
the documents. This is called an Inverted file.

 A search engine or IR system comprises four essential modules:

3.6.1 A Document Processor

3.6.2 A Query Processor

3.6.3 A Search and matching function

3.6.4 A Ranking capability

3.6.1 Document Processor

o The document processor prepares, processes, and inputs the documents, pages, or sites
that users search against.

o Steps of Document Processor

1. Normalizes the document stream.

2. Breaks the document stream into desired retrievable units.

3. Isolates and metatags subdocument pieces.

4. Identifies potential indexable elements in documents.

5. Deletes stop words.


6. Stems terms.

7. Extracts index entries.

8. Computes weights.

9. Creates and updates the main inverted file against which the search engine
searches in order to match

Step 1 to 3
Preprocessing.
 While essential and potentially important in affecting the outcome of a search,
these first three steps simply standardize the multiple formats encountered when deriving
documents from various providers or handling various Web sites.

 The steps serve to merge all the data into a single consistent data structure that all
the downstream processes can handle. The need for a well-formed, consistent format is of
relative importance in direct proportion to the sophistication of later steps of document
processing.

 Step two is important because the pointers stored in the inverted file will enable a
system to retrieve various sized units — either site, page, document, section, paragraph,
or sentence.

Step 4
Identify elements to index.
 Identifying potential indexable elements in documents dramatically affects the nature and
quality of the document representation that the engine will search against.

 In designing the system, we must define the word "term." Is it the alpha-numeric
characters between blank spaces or punctuation? If so, what about non-compositional
phrases (phrases in which the separate words do not convey the meaning of the phrase,
like "skunk works" or "hot dog"), multi-word proper names, or inter-word symbols such
as hyphens or apostrophes that can denote the difference between "small business men"
versus “small-business men."

 Each search engine depends on a set of rules that its document processor must execute to
determine what action is to be taken by the "tokenizer," i.e. the software used to define a
term suitable for indexing.

Step-5
Deleting stop words.
 This step helps save system resources by eliminating from further processing, as well
as potential matching, those terms that have little value in finding useful documents in
response to a customer's query.

 This step used to matter much more than it does now when memory has become so
much cheaper and systems so much faster, but since stop words may comprise up to
40 percent of text words in a document, it still has some significance.

 A stop word list typically consists of those word classes known to convey little
substantive meaning, such as

o articles (a, the),

o conjunctions (and, but),

o interjections (oh, but),

o prepositions (in, over),

o pronouns (he, it), and

o forms of the "to be" verb (is, are).

Step-6
Term Stemming.
 Stemming removes word suffixes

 The process has two goals.

o In terms of efficiency, stemming reduces the number of unique words in the


index, which in turn reduces the storage space required for the index and speeds
up the search process.

o In terms of effectiveness, stemming improves recall by reducing all forms of the


word to a base or stemmed form.

 For example, if a user asks for analyze, they may also want documents which contain
analysis, analyzing, analyzer, analyzes, and analyzed.

 Therefore, the document processor stems document terms to analyze- so that


documents which include various forms of analyze- will have equal likelihood of
being retrieved.

Step-7
Extract index entries.
 Having completed steps 1 through 6, the document processor extracts the remaining
entries from the original document. For example, the following paragraph shows the
full text sent to a search engine for processing:

o Milosevic's comments, carried by the official news agency Tanjug, cast doubt
over the governments at the talks, which the international community has called
to try to prevent an all-out war in the Serbian province. "President Milosevic said
it was well known that Serbia and Yugoslavia were firmly committed to resolving
problems in Kosovo, which is an integral part of Serbia, peacefully in Serbia with
the participation of the representatives of all ethnic communities," Tanjug said.
Milosevic was speaking during a meeting with British Foreign Secretary Robin
Cook, who delivered an ultimatum to attend negotiations in a week's time on an
autonomy proposal for Kosovo with ethnic Albanian leaders from the province.
Cook earlier told a conference that Milosevic had agreed to study the proposal.

 Steps 1 to 6 reduce this text for searching to the following:

o Milosevic comm carri offic new agen Tanjug cast doubt govern talk interna
commun call try prevent all-out war Serb province President Milosevic said well
known Serbia Yugoslavia firm commit resolv problem Kosovo integr part Serbia
peace Serbia particip representa ethnic commun Tanjug said Milosevic speak
meeti British Foreign Secretary Robin Cook deliver ultimat attend negoti week
time autonomy propos Kosovo ethnic Alban lead province Cook earl told
conference Milosevic agree study propos.

 The output of step 7 is then inserted and stored in an inverted file that lists the index
entries and an indication of their position and frequency of occurrence.

 The specific nature of the index entries, however, will vary based on the decision in
Step 4 concerning what constitutes an “indexable term.”

 Document processors will have phrase recognizers, as well as Named Entity


recognizers and Categorizers, to insure index entries such as Milosevic are tagged as
a Person and entries such as Yugoslavia and Serbia as Countries

Step-8
Term weight assignment.
 Weights are assigned to terms in the index file. The simplest of search engines just
assign a binary weight: 1 for presence and 0 for absence.

 Measuring the frequency of occurrence of a term in the document creates more


sophisticated weighting, with length-normalization of frequencies still more
sophisticated. Extensive experience in information retrieval research over many years
has clearly demonstrated that the optimal weighting comes from use of "tf/idf." This
algorithm measures the frequency of occurrence of each term within a document.
Then it compares that frequency against the frequency of occurrence in the entire
database.

 A simple example would be the word "the." This word appears in too many
documents to help distinguish one from another. A less obvious example would be
the word "antibiotic." In a sports database when we compare each document to the
database as a whole, the term "antibiotic" would probably be a good discriminator
among documents, and therefore would be assigned a high weight. Conversely, in a
database devoted to health or medicine, "antibiotic" would probably be a poor
discriminator, since it occurs very often. The TF/IDF weighting scheme assigns
higher weights to those terms that really distinguish one document from the others.

Step-9
Create index.
 The index or inverted file is the internal data structure that stores the index
information and that will be searched for each query. Inverted files range from a
simple listing of every alpha-numeric sequence in a set of documents/pages being
indexed along with the overall identifying numbers of the documents in which the
sequence occurs, to a more linguistically complex list of entries, the tf/idf weights,
and pointers to where inside each document the term occurs.

 The more complete the information in the index, the better the search result.
3.6.2 Query Processor

o Query processing has seven possible steps, though a system can cut these steps
short and proceed to match the query to the inverted file at any of a number of
places during the processing. Document processing shares many steps with query
processing.

o Steps of Query Processor

The steps in query processing are as follows (with the option to stop processing and start
matching indicated as "Matcher"):
1. Tokenize query terms.

2. Parsing Recognize query terms vs. special operators.

————————> Matcher
3. Delete stop words.

4. Stem words.

5. Creating the query.

————————> Matcher
6. Query expansion

7. Compute weights

————————> Matcher

Step 1
Tokenizing.
 As soon as a user inputs a query, the search engine must tokenize the query stream, i.e.,
break it down into understandable segments. Usually a token is defined as an alpha-
numeric string that occurs between white space and/or punctuation.

Step 2
Parsing
 Since users may employ special operators in their query, including Boolean, adjacency,
or proximity operators, the system needs to parse the query first into query terms and
operators. These operators may occur in the form of reserved punctuation (e.g., quotation
marks) or reserved terms in specialized format (e.g., AND, OR).

Step 3 & 4
Stop list and stemming
 Some search engines will go further and stop-list and stem the query, similar to the
processes described above in the Document Processor section. The stop list might also
contain words from commonly occurring querying phrases, such as, "I'd like information
about." However, since most publicly available search engines encourage very short
queries, as evidenced in the size of query window provided, the engines may drop these
two steps.

Step 5
Creating the query.
 How each particular search engine creates a query representation depends on how the
system does its matching. If a statistically based matcher is used, then the query must
match the statistical representations of the documents in the system. Good statistical
queries should contain many synonyms and other terms in order to create a full
representation. If a Boolean matcher is utilized, then the system must create logical sets
of the terms connected by AND, OR, or NOT.

 An NLP system will recognize single terms, phrases, and Named Entities. If it uses any
Boolean logic, it will also recognize the logical operators from Step 2 and create a
representation containing logical sets of the terms to be AND'd, OR'd, or NOT'd.

 At this point, a search engine may take the query representation and perform the search
against the inverted file. More advanced search engines may take two further steps.

\
Step 6
Query expansion
 Since users of search engines usually include only a single statement of their information
needs in a query, it becomes highly probable that the information they need may be
expressed using synonyms, rather than the exact query terms, in the documents which the
search engine searches against. Therefore, more sophisticated systems may expand the
query into all possible synonymous terms and perhaps even broader and narrower terms.

 This process approaches what search intermediaries did for end users in the earlier days
of commercial search systems. Back then, intermediaries might have used the same
controlled vocabulary or thesaurus used by the indexers who assigned subject descriptors
to documents. Today, resources such as WordNet are generally available, or specialized
expansion facilities may take the initial query and enlarge it by adding associated
vocabulary.

Step 7
Query term weighting (assuming more than one query term)
 The final step in query processing involves computing weights for the terms in the query.
Sometimes the user controls this step by indicating either how much to weight each term
or simply which term or concept in the query matters most and must appear in each
retrieved document to ensure relevance.

 Leaving the weighting up to the user is not common, because research has shown that
users are not particularly good at determining the relative importance of terms in their
queries. They can't make this determination for several reasons. First, they don't know
what else exists in the database, and document terms are weighted by being compared to
the database as a whole. Second, most users seek information about an unfamiliar subject,
so they may not know the correct terminology.
 Few search engines implement system-based query weighting, but some do an implicit
weighting by treating the first term(s) in a query as having higher significance. The
engines use this information to provide a list of documents/pages to the user.

 After this final step, the expanded, weighted query is searched against the inverted file of
documents.
4.1 Performance measures
Many different measures for evaluating the performance of information retrieval systems have
been proposed. The measures require a collection of documents and a query. All common
measures described here assume a ground truth notion of relevancy: every document is known to
be either relevant or non-relevant to a particular query. In practice queries may be ill-posed and
there may be different shades of relevancy.

4.1.1 Precision

Precision is the fraction of the documents retrieved that are relevant to the user's information
need.

In binary classification, precision is analogous to positive predictive value. Precision takes all
retrieved documents into account. It can also be evaluated at a given cut-off rank, considering
only the topmost results returned by the system. This measure is called precision at n or P@n.

Note that the meaning and usage of "precision" in the field of Information Retrieval differs from
the definition of accuracy and precision within other branches of science and technology.

4.1.2 Recall
Recall is the fraction of the documents that are relevant to the query that are successfully
retrieved.

In binary classification, recall is called sensitivity. So it can be looked at as the probability that a
relevant document is retrieved by the query.

It is trivial to achieve recall of 100% by returning all documents in response to any query.
Therefore recall alone is not enough but one needs to measure the number of non-relevant
documents also, for example by computing the precision.

4.1.3 Fall-Out

The proportion of non-relevant documents that are retrieved, out of all non-relevant documents
available:

In binary classification, fall-out is closely related to specificity (1 − specificity). It can be looked


at as the probability that a non-relevant document is retrieved by the query.

It is trivial to achieve fall-out of 0% by returning zero documents in response to any query.

4.1.4 F-measure
The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-
score is:

This is also known as the F1 measure, because recall and precision are evenly weighted.

The general formula for non-negative real β is:

Two other commonly used F measures are the F2 measure, which weights recall twice as much
as precision, and the F0.5 measure, which weights precision twice as much as recall.

The F-measure was derived by van Rijsbergen (1979) so that Fβ "measures the effectiveness of
retrieval with respect to a user who attaches β times as much importance to recall as precision".
It is based on van Rijsbergen's effectiveness measure E = 1 − (1 / (α / P + (1 − α) / R)). Their
relationship is Fβ = 1 − E where α = 1 / (β2 + 1).

4.1.5 Mean Average precision

Precision and recall are single-value metrics based on the whole list of documents returned by
the system. For systems that return a ranked sequence of documents, it is desirable to also
consider the order in which the returned documents are presented. Average precision emphasizes
ranking relevant documents higher. It is the average of precisions computed at the point of each
of the relevant documents in the ranked sequence:
where r is the rank, N the number retrieved, rel() a binary function on the relevance of a given
rank, and P(r) precision at a given cut-off rank:

This metric is also sometimes referred to geometrically as the area under the Precision-Recall
curve.

Note that the denominator (number of relevant documents) is the number of relevant documents
in the entire collection, so that the metric reflects performance over all relevant documents,
regardless of a retrieval cutoff.

4.1.6 Discounted cumulative gain

DCG uses a graded relevance scale of documents from the result set to evaluate the usefulness,
or gain, of a document based on its position in the result list. The premise of DCG is that highly
relevant documents appearing lower in a search result list should be penalized as the graded
relevance value is reduced logarithmically proportional to the position of the result.

The DCG accumulated at a particular rank position p is defined as:


Since result set may vary in size among different queries or systems, to compare performances
the normalized version of DCG uses an ideal DCG - by sorting documents of a result list by
relevance - to normalize the score:

The nDCG values for all queries can be averaged to obtain a measure of the average performance
of a ranking algorithm. Note that in a perfect ranking algorithm, the DCGp will be the same as
the IDCGp producing an nDCG of 1.0. All nDCG calculations are then relative values on the
interval 0.0 to 1.0 and so are cross-query comparable.

4.2 Other Measures

4.2.1 Mean reciprocal rank

Mean reciprocal rank is a statistic for evaluating any process that produces a list of possible
responses to a query, ordered by probability of correctness. The reciprocal rank of a query
response is the multiplicative inverse of the rank of the first correct answer. The mean reciprocal
rank is the average of the reciprocal ranks of results for a sample of queries Q[1]:
Example: For example, suppose we have the following three sample queries for a system that
tries to translate English words to their plurals. In each case, the system makes three guesses,
with the first one being the one it thinks is most likely correct:

Query Results Correct response Rank Reciprocal rank


Cat catten, cati, cats Cats 3 1/3
Torus torii, tori, toruses Tori 2 ½
Virus viruses, virii, viri Viruses 1 1

Given those three samples, we could calculate the mean reciprocal rank as (1/3 + 1/2 + 1)/3 =
11/18 or about 0.61.

This basic definition does not specify what to do if...

1. None of the proposed results are correct (use mean reciprocal rank 0), or if
2. There are multiple correct answers in the list. Consider using mean average precision
(MAP).

4.2.2 Spearman's rank correlation coefficient

In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles
Spearman and often denoted by the Greek letter ρ (rho) or as rs, is a non-parametric measure of
statistical dependence between two variables. The Spearman correlation coefficient is often
thought of as being the Pearson correlation coefficient between the ranked variables. In practice,
however, a simpler procedure is normally used to calculate ρ. The n raw scores Xi, Yi are
converted to ranks xi, yi, and the differences di = xi − yi between the ranks of each observation
on the two variables are calculated.

If there are no tied ranks, then ρ is given by:


If tied ranks exist, Pearson's correlation coefficient between ranks should be used for the
calculation. One has to assign the same rank to each of the equal values. It is an average of their
positions in the ascending order of the values.

A Spearman correlation of 1 results when the two variables being compared are monotonically
related, even if their relationship is not linear. In contrast, this does not give a perfect Pearson
correlation
When the data are roughly elliptically distributed and there are no prominent outliers, the
Spearman correlation and Pearson correlation give similar values

The Spearman correlation is less sensitive than the Pearson correlation to strong outliers that are
in the tails of both samples
A positive Spearman correlation coefficient corresponds to an increasing monotonic trend
between X and Y.

A negative Spearman correlation coefficient corresponds to a decreasing monotonic trend


between X and Y
5.1 MODEL TYPES
For the information retrieval to be efficient, the documents are typically transformed into a
suitable representation. There are several representations. The above picture illustrates the
relationship of some common models. In the picture, the models are categorized according to
two dimensions: the mathematical basis and the properties of the model.

5.1.1 First dimension: mathematical basis

1. Set-theoretic models represent documents as sets of words or phrases. Similarities are


usually derived from set-theoretic operations on those sets. Common models are:
i. Standard Boolean model
ii. Extended Boolean model
iii. Fuzzy retrieval
2. Algebraic models represent documents and queries usually as vectors, matrices, or tuples.
The similarity of the query vector and document vector is represented as a scalar value.
i. Vector space model
ii. Generalized vector space model
iii. (Enhanced) Topic-based Vector Space Model
iv. Extended Boolean model
v. Latent semantic indexing aka latent semantic analysis

3. Probabilistic models treat the process of document retrieval as a probabilistic inference.


Similarities are computed as probabilities that a document is relevant for a given query.
Probabilistic theorems like the Bayes' theorem are often used in these models.
i. Binary Independence Model
ii. Probabilistic relevance model on which is based the okapi (BM25) relevance
function
iii. Uncertain inference
iv. Language models
v. Divergence-from-randomness model
vi. Latent Dirichlet allocation

4. Machine-learned ranking models view documents as vectors of ranking features (some of


which often incorporate other ranking models mentioned above) and try to find the best
way to combine these features into a single relevance score by machine learning methods.
5.1.2 Second dimension: properties of the model

1. Models without term-interdependencies treat different terms/words as independent. This


fact is usually represented in vector space models by the orthogonality assumption of
term vectors or in probabilistic models by an independency assumption for term
variables.

2. Models with immanent term interdependencies allow a representation of


interdependencies between terms. However the degree of the interdependency between
two terms is defined by the model itself. It is usually directly or indirectly derived (e.g.
by dimensional reduction) from the co-occurrence of those terms in the whole set of
documents.

3. Models with transcendent term interdependencies allow a representation of


interdependencies between terms, but they do not allege how the interdependency
between two terms is defined. They relay an external source for the degree of
interdependency between two terms. (For example a human or sophisticated algorithms.)
6.1 Problem 1.Assisting the user
in clarifying and analyzing the
problem and determining
information needs.

i. Clarifying and analyzing the problem.


ii. Determining what part of the problem solution can be affected by the system and what
part is left to the user.
iii. Determining what knowledge the user requires for her part in the problem solution.
iv. Determining what the user knows already.
v. Deduce what information is necessary to lead the user from her present knowledge state
to the required knowledge state.
6.2 Problem 2.Knowing how
people use and process
information.

i. Problem assembling a package of information that enables group the user to come closer
to a solution of his problem.
ii. Relationship of information use to the problem-solving/decision-making process
iii. How do people make relevance judgments
iv. How do people organize information in their minds, acquire it, process it for output

6.3 Problem 3.Knowledge


representation.

i. Choosing the general approach to knowledge representation.


ii. Constructing a conceptual schema.
iii. Constructing a list of values for each entity type or rules for generating such values
iv. Knowledge/data acquisition and assimilation
v. Representing uncertainty
vi. Developing fine-grained information systems
6.4 Problem 4.Procedures for
processing
knowledge/information.

i. Transformation from one representation to another


ii. Translation from one natural language to another.
iii. Translation from natural language to a formal representation
iv. Translation from a formal representation to natural language
v. Translation from one formal representation into another
vi. Expression of data in tabular or graphic form adapted to the user's purpose
vii. Removing redundancy
viii. Summarizing
ix. Computing statistics
x. Deriving generalizations
xi. Drawing inferences
xii. Search and selection
xiii. Indexing: attaching predictive clues
6.5 Problem 5.The human-
computer interface.

i. Functions in the human-computer interface


ii. Formal design of the human-computer interface
iii.
REFERENCES

• An Introduction to Information Retrieval Christopher D. Manning, Prabhakar Raghavan,


Hinrich Schütze Cambridge University Press, EnglandOnline edition (c)2009 Cambridge
UP
 Database System Concepts 5th Edition, ©Silberschatz, Korth and Sudarshan
 http://www.google.com
 http://www.wikipedia.com

You might also like