Documentation Ir
Documentation Ir
Preface
Acknowledgement
1. Introduction 1
2. Applications of IR 10
4. Performance measures 35
5. Models of IR 45
6. Problems in IR 49
References 54
i. Preface
ii. Acknowledgment
1. Introduction 1
1.1 Definition 2
1.2 History 3
1.3 Purpose 3
1.4 Basic IR system architecture 4
1.5 Databases v/s. IR 6
1.6 How IR system works 7
1.7 The process of IR 8
4. Performance measures 35
4.1 Performance measures 36
4.2 Other measures 40
iii. References 54
1.1 DEFINITION
Information Retrieval (IR) also called information storage and retrieval (ISR or
ISAR) or information organization and retrieval is the science of searching for documents, for
information within documents, and for metadata about documents, as well as that of searching
relational databases and the World Wide Web. The main trick is to retrieve what is useful while
leaving behind what is not. The process begins when a user enters a query into the system.
Queries are formal statements of information needs, for example search strings in web search
engines. In information retrieval a query does not uniquely identify a single object in the
collection. Instead, several objects may match the query, perhaps with different degrees of
relevancy.
An object is an entity that is represented by information in a database. User queries are matched
against the database information. Depending on the application the data objects may be, for
example, text documents, images, audio, mind maps or videos. Often the documents themselves
are not kept or stored directly in the IR system, but are instead represented in the system by
document surrogates or metadata.
Most IR systems compute a numeric score on how well each object in the database match the
query, and rank the objects according to this value. The top ranking objects are then shown to the
user. The process may then be iterated if the user wishes to refine the query.
1.2 HISTORY
The idea of using computers to search for relevant pieces of information was popularized in the
article As We May Think by Vannevar Bush in 1945. The first automated information retrieval
systems were introduced in the 1950s and 1960s. By 1970 several different techniques had been
shown to perform well on small text corpora such as the Cranfield collection (several thousand
documents). Large-scale retrieval systems, such as the Lockheed Dialog system, came into use
early in the 1970s.
1.3 PURPOSE
1.3.1 To process large document collections quickly. The amount of online data has
grown at least as quickly as the speed of computers, and we would now like to be
able to search collections that total in the order of billions to trillions of words.
1.3.2 To allow more flexible matching operations.
1.3.3 To allow ranked retrieval: in many cases you want the best answer to an
information need among many documents that contain certain words.
1.4 BASIC IR SYSTEM
ARCHITECTURE
t io
rma
d
n
Nee
Info
User
Result
Query
Deletions Additions
Documents
Components of an IR system
Search
Engine
Index
The above Figure illustrates the major components in an IR system. Before conducting a search,
a user has an information need, which underlies and drives the search process. We sometimes
refer to this information need as a topic, particularly when it is presented in written form as part
of a test collection for IR evaluation.
The user’s query is processed by a search engine, which may be running on the user’s local
machine, on a large cluster of machines in a remote geographic location, or anywhere in
between.
A major task of a search engine is to maintain and manipulate an inverted index for a document
collection. As its basic function, an inverted index provides a mapping between terms and the
locations in the collection in which they occur.
To support relevance ranking algorithms, the search engine maintains collection statistics
associated with the index, such as the number of documents containing each term and the length
of each document.
In addition, the search engine usually has access to the original content of the documents, in
order to report meaningful results back to the user.
Using the inverted index, collection statistics, and other data, the search engine accepts queries
from its users, processes these queries, and returns ranked lists of results. To perform relevance
ranking, the search engine computes a score, sometimes called a retrieval status value (RSV), for
each document.
After sorting documents according to their scores, the result list may be subjected to further
processing, such as the removal of duplicate or redundant results.
1.5 DATABASES
V/S.
INFORMATION RETRIEVAL
DATABASES IR
1.6.2 Acquire documents (or computer programs, or products, or data items, and so on),
resulting in a collection, and
Figuring out what information the user really needs to solve a problem is essential for successful
retrieval. Matching involves taking a query description and finding relevant documents in the
collection; this is the task of the IR system.
1.7 The Process of IR
The process illustrated by means of a black box what a typical IR system would look like. The
diagram shows three components: input, processor and output.
Starting with the input side of things. The main problem here is to obtain a representation of each
document and query suitable for a computer to use. A document representative could, for
example, be a list of extracted words considered to be significant. Rather than have the computer
process the natural language, an alternative approach is to have an artificial language within
which all queries and documents can be formulated.
There is some evidence to show that this can be effective. Of course it presupposes that a user is
willing to be taught to express his information need in the language.
When the retrieval system is on-line, it is possible for the user to change his request during one
search session in the light of sample retrieval, thereby; it is hoped, improving the subsequent
retrieval run. Such a procedure is commonly referred to as feedback.
Secondly, the processor part of the retrieval system concerned with the retrieval process. The
process may involve structuring the information in some appropriate way, such as classifying it.
It will also involve performing the actual retrieval function, that is, executing the search strategy
in response to a query. In the diagram, the documents have been placed in a separate box to
emphasize the fact that they are not just input but can be used during the retrieval process in such
a way that their structure is more correctly seen as part of the retrieval process.
Finally, we come to the output, which is usually a set of citations or document numbers. In an
operational system the story ends here. However, in an experimental system it leaves the
evaluation to be done.
APPLICATIONS OF IR
Areas where information retrieval techniques are employed can be bifurcated into three
categories:
3. Vertical search - A vertical search engine, as distinct from a general Web search engine,
focuses on a specific segment of online content. The vertical content area may be based
on topicality, media type, or genre of content. Common examples include legal, medical,
patent (intellectual property), travel, and automobile search engines.
2.3 Other retrieval methods
Methods/Techniques in which information retrieval techniques are employed include:
3. Compound term processing - Compound term processing is the name that is used for a
category of techniques in Information retrieval applications that performs matching on
the basis of compound terms. Compound terms are built by combining two (or more)
simple terms, for example "triple" is a single word term but "triple heart bypass" is a
compound term.
A search engine is the popular term for an Information Retrieval (IR) System designed to help
find information stored on a Computer System, such as on the World Wide Web (WWW), inside
a corporate or proprietary network, or in a Personal Computer.
The search engine allows one to ask for content meeting specific criteria (typically those
containing a given word or Phrase) and retrieves a list of items that match those criteria.
Regular users of Web search engines casually expect to receive accurate and near-instantaneous
answers to questions and requests merely by entering a short query — a few words — into a text
box and clicking on a search button. Underlying this simple and intuitive interface are clusters of
computers, comprising thousands of machines, working cooperatively to generate a ranked list of
those Web pages that are likely to satisfy the information need embodied in the query. These
machines identify a set of Web pages containing the terms in the query, compute a score for each
page, eliminate duplicate and redundant pages, generate summaries of the remaining pages, and
finally return the summaries and links back to the user for browsing.
In order to achieve the sub second response times expected from Web search engines, they
incorporate layers of caching and replication, taking advantage of commonly occurring queries
and exploiting parallel processing, allowing them to scale as the number of Web pages and users
increase. In order to produce accurate results, they store a “snapshot” of the Web. This snapshot
must be gathered and refreshed constantly by a Web crawler, also running on a cluster of
hundreds or thousands of machines, and downloading periodically — perhaps once a week —a
fresh copy of each page. Pages that contain rapidly changing information of high quality, such as
news services, may be refreshed daily or hourly.
Consider a simple example. If you have a computer connected to the Internet nearby, pause for a
minute to launch a browser and try the query “information retrieval” on one of the major
commercial Web search engines. It is likely that the search engine responded in well under a
second. Take some time to review the top ten results. Each result lists the URL for a Web page
and usually provides a title and a short snippet of text extracted from the body of the page.
Overall, the results are drawn from a variety of different Web sites and include sites associated
with leading textbooks, journals, conferences, and researchers. As is common for informational
queries such as this one, the Wikipedia article1 may be present. Do the top ten results contain
anything inappropriate? Could their order be improved? Have a look through the next ten results
and decide whether any one of them could better replace one of the top ten results.
Now, consider the millions of Web pages that contain the words “information” and “retrieval”.
This set of pages includes many that are relevant to the subject of information retrieval but are
much less general in scope than those that appear in the top ten, such as student Web pages and
individual research papers. In addition, the set includes many pages that just happen to contain
these two words, without having any direct relationship to the subject. From these millions of
possible pages, a search engine’s ranking algorithm selects the top-ranked pages based on a
variety of features, including the content and structure of the pages (e.g., their titles), their
relationship to other pages (e.g., the hyperlinks between them), and the content and structure of
the Web as a whole. For some queries, characteristics of the user such as her geographic location
or past searching behavior may also play a role. Balancing these features against each other in
order to rank pages by their expected relevance to a query is an example of relevance ranking.
3.2 Popular Search Engines
Google » www.google.com
Bing » www.bing.com
Ask » www.ask.com
Web directories are human-compiled indexes of sites, which are then categorized. The fact that
your site is reviewed by an editor before being placed in the index means that getting listed in a
directory is often quite difficult. Consequently, having a listing in a directory will guarantee you
a good amount of well-targeted visitors. Most search engines will rank you higher if they find
your site in one of the directories below.
Yahoo » www.yahoo.com
The Open Directory is another human-compiled directory, but one where any Internet user can
become an editor and be responsible for some part of the index. Many other services use Open
Directory listings, including Google, Netscape, Lycos, AOLsearch, AltaVista and HotBot.
» dmoz.com
3.2.3 And the rest...
You can submit your site to these search engines if you want. Most of these companies are
struggling on this new Google-dominated web.
Lycos » www.lycos.com
AltaVista » www.altavista.com
DogPile » www.dogpile.com
3.2.4 The three most widely used web search engines and their approximate share
3.3 Google search
URL www.google.com
list of domain names
Commercial? Yes
Type of site Search Engine
Registration Optional
Available language(s) Multilingual (124)
Owner Google
Created by Sergey Brin and Larry
Page
Launched September 15, 1997
Alexa rank 1
Revenue From AdWords
Current status Active
3.4 Yahoo! Search
URL search.yahoo.com
Commercial? Yes
Type of site Search Engine
Registration Optional
Available language(s) Multilingual (40)
Owner Yahoo!
Created by Yahoo!
Launched March 1, 1995
Alexa rank 4
Current status Active
3.5 Bing (search engine)
URL www.bing.com
Slogan Bing & decide
Commercial? Yes
Type of site Search Engine
Registration Optional
Available language(s) Multilingual (40)
Owner Microsoft
Created by Microsoft
Launched June 1, 2009
Alexa rank 23
Current status Active
3.6 How Search Engine Works?
The index consists of the words in each document, plus pointers to their locations within
the documents. This is called an Inverted file.
o The document processor prepares, processes, and inputs the documents, pages, or sites
that users search against.
8. Computes weights.
9. Creates and updates the main inverted file against which the search engine
searches in order to match
Step 1 to 3
Preprocessing.
While essential and potentially important in affecting the outcome of a search,
these first three steps simply standardize the multiple formats encountered when deriving
documents from various providers or handling various Web sites.
The steps serve to merge all the data into a single consistent data structure that all
the downstream processes can handle. The need for a well-formed, consistent format is of
relative importance in direct proportion to the sophistication of later steps of document
processing.
Step two is important because the pointers stored in the inverted file will enable a
system to retrieve various sized units — either site, page, document, section, paragraph,
or sentence.
Step 4
Identify elements to index.
Identifying potential indexable elements in documents dramatically affects the nature and
quality of the document representation that the engine will search against.
In designing the system, we must define the word "term." Is it the alpha-numeric
characters between blank spaces or punctuation? If so, what about non-compositional
phrases (phrases in which the separate words do not convey the meaning of the phrase,
like "skunk works" or "hot dog"), multi-word proper names, or inter-word symbols such
as hyphens or apostrophes that can denote the difference between "small business men"
versus “small-business men."
Each search engine depends on a set of rules that its document processor must execute to
determine what action is to be taken by the "tokenizer," i.e. the software used to define a
term suitable for indexing.
Step-5
Deleting stop words.
This step helps save system resources by eliminating from further processing, as well
as potential matching, those terms that have little value in finding useful documents in
response to a customer's query.
This step used to matter much more than it does now when memory has become so
much cheaper and systems so much faster, but since stop words may comprise up to
40 percent of text words in a document, it still has some significance.
A stop word list typically consists of those word classes known to convey little
substantive meaning, such as
Step-6
Term Stemming.
Stemming removes word suffixes
For example, if a user asks for analyze, they may also want documents which contain
analysis, analyzing, analyzer, analyzes, and analyzed.
Step-7
Extract index entries.
Having completed steps 1 through 6, the document processor extracts the remaining
entries from the original document. For example, the following paragraph shows the
full text sent to a search engine for processing:
o Milosevic's comments, carried by the official news agency Tanjug, cast doubt
over the governments at the talks, which the international community has called
to try to prevent an all-out war in the Serbian province. "President Milosevic said
it was well known that Serbia and Yugoslavia were firmly committed to resolving
problems in Kosovo, which is an integral part of Serbia, peacefully in Serbia with
the participation of the representatives of all ethnic communities," Tanjug said.
Milosevic was speaking during a meeting with British Foreign Secretary Robin
Cook, who delivered an ultimatum to attend negotiations in a week's time on an
autonomy proposal for Kosovo with ethnic Albanian leaders from the province.
Cook earlier told a conference that Milosevic had agreed to study the proposal.
o Milosevic comm carri offic new agen Tanjug cast doubt govern talk interna
commun call try prevent all-out war Serb province President Milosevic said well
known Serbia Yugoslavia firm commit resolv problem Kosovo integr part Serbia
peace Serbia particip representa ethnic commun Tanjug said Milosevic speak
meeti British Foreign Secretary Robin Cook deliver ultimat attend negoti week
time autonomy propos Kosovo ethnic Alban lead province Cook earl told
conference Milosevic agree study propos.
The output of step 7 is then inserted and stored in an inverted file that lists the index
entries and an indication of their position and frequency of occurrence.
The specific nature of the index entries, however, will vary based on the decision in
Step 4 concerning what constitutes an “indexable term.”
Step-8
Term weight assignment.
Weights are assigned to terms in the index file. The simplest of search engines just
assign a binary weight: 1 for presence and 0 for absence.
A simple example would be the word "the." This word appears in too many
documents to help distinguish one from another. A less obvious example would be
the word "antibiotic." In a sports database when we compare each document to the
database as a whole, the term "antibiotic" would probably be a good discriminator
among documents, and therefore would be assigned a high weight. Conversely, in a
database devoted to health or medicine, "antibiotic" would probably be a poor
discriminator, since it occurs very often. The TF/IDF weighting scheme assigns
higher weights to those terms that really distinguish one document from the others.
Step-9
Create index.
The index or inverted file is the internal data structure that stores the index
information and that will be searched for each query. Inverted files range from a
simple listing of every alpha-numeric sequence in a set of documents/pages being
indexed along with the overall identifying numbers of the documents in which the
sequence occurs, to a more linguistically complex list of entries, the tf/idf weights,
and pointers to where inside each document the term occurs.
The more complete the information in the index, the better the search result.
3.6.2 Query Processor
o Query processing has seven possible steps, though a system can cut these steps
short and proceed to match the query to the inverted file at any of a number of
places during the processing. Document processing shares many steps with query
processing.
The steps in query processing are as follows (with the option to stop processing and start
matching indicated as "Matcher"):
1. Tokenize query terms.
————————> Matcher
3. Delete stop words.
4. Stem words.
————————> Matcher
6. Query expansion
7. Compute weights
————————> Matcher
Step 1
Tokenizing.
As soon as a user inputs a query, the search engine must tokenize the query stream, i.e.,
break it down into understandable segments. Usually a token is defined as an alpha-
numeric string that occurs between white space and/or punctuation.
Step 2
Parsing
Since users may employ special operators in their query, including Boolean, adjacency,
or proximity operators, the system needs to parse the query first into query terms and
operators. These operators may occur in the form of reserved punctuation (e.g., quotation
marks) or reserved terms in specialized format (e.g., AND, OR).
Step 3 & 4
Stop list and stemming
Some search engines will go further and stop-list and stem the query, similar to the
processes described above in the Document Processor section. The stop list might also
contain words from commonly occurring querying phrases, such as, "I'd like information
about." However, since most publicly available search engines encourage very short
queries, as evidenced in the size of query window provided, the engines may drop these
two steps.
Step 5
Creating the query.
How each particular search engine creates a query representation depends on how the
system does its matching. If a statistically based matcher is used, then the query must
match the statistical representations of the documents in the system. Good statistical
queries should contain many synonyms and other terms in order to create a full
representation. If a Boolean matcher is utilized, then the system must create logical sets
of the terms connected by AND, OR, or NOT.
An NLP system will recognize single terms, phrases, and Named Entities. If it uses any
Boolean logic, it will also recognize the logical operators from Step 2 and create a
representation containing logical sets of the terms to be AND'd, OR'd, or NOT'd.
At this point, a search engine may take the query representation and perform the search
against the inverted file. More advanced search engines may take two further steps.
\
Step 6
Query expansion
Since users of search engines usually include only a single statement of their information
needs in a query, it becomes highly probable that the information they need may be
expressed using synonyms, rather than the exact query terms, in the documents which the
search engine searches against. Therefore, more sophisticated systems may expand the
query into all possible synonymous terms and perhaps even broader and narrower terms.
This process approaches what search intermediaries did for end users in the earlier days
of commercial search systems. Back then, intermediaries might have used the same
controlled vocabulary or thesaurus used by the indexers who assigned subject descriptors
to documents. Today, resources such as WordNet are generally available, or specialized
expansion facilities may take the initial query and enlarge it by adding associated
vocabulary.
Step 7
Query term weighting (assuming more than one query term)
The final step in query processing involves computing weights for the terms in the query.
Sometimes the user controls this step by indicating either how much to weight each term
or simply which term or concept in the query matters most and must appear in each
retrieved document to ensure relevance.
Leaving the weighting up to the user is not common, because research has shown that
users are not particularly good at determining the relative importance of terms in their
queries. They can't make this determination for several reasons. First, they don't know
what else exists in the database, and document terms are weighted by being compared to
the database as a whole. Second, most users seek information about an unfamiliar subject,
so they may not know the correct terminology.
Few search engines implement system-based query weighting, but some do an implicit
weighting by treating the first term(s) in a query as having higher significance. The
engines use this information to provide a list of documents/pages to the user.
After this final step, the expanded, weighted query is searched against the inverted file of
documents.
4.1 Performance measures
Many different measures for evaluating the performance of information retrieval systems have
been proposed. The measures require a collection of documents and a query. All common
measures described here assume a ground truth notion of relevancy: every document is known to
be either relevant or non-relevant to a particular query. In practice queries may be ill-posed and
there may be different shades of relevancy.
4.1.1 Precision
Precision is the fraction of the documents retrieved that are relevant to the user's information
need.
In binary classification, precision is analogous to positive predictive value. Precision takes all
retrieved documents into account. It can also be evaluated at a given cut-off rank, considering
only the topmost results returned by the system. This measure is called precision at n or P@n.
Note that the meaning and usage of "precision" in the field of Information Retrieval differs from
the definition of accuracy and precision within other branches of science and technology.
4.1.2 Recall
Recall is the fraction of the documents that are relevant to the query that are successfully
retrieved.
In binary classification, recall is called sensitivity. So it can be looked at as the probability that a
relevant document is retrieved by the query.
It is trivial to achieve recall of 100% by returning all documents in response to any query.
Therefore recall alone is not enough but one needs to measure the number of non-relevant
documents also, for example by computing the precision.
4.1.3 Fall-Out
The proportion of non-relevant documents that are retrieved, out of all non-relevant documents
available:
4.1.4 F-measure
The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-
score is:
This is also known as the F1 measure, because recall and precision are evenly weighted.
Two other commonly used F measures are the F2 measure, which weights recall twice as much
as precision, and the F0.5 measure, which weights precision twice as much as recall.
The F-measure was derived by van Rijsbergen (1979) so that Fβ "measures the effectiveness of
retrieval with respect to a user who attaches β times as much importance to recall as precision".
It is based on van Rijsbergen's effectiveness measure E = 1 − (1 / (α / P + (1 − α) / R)). Their
relationship is Fβ = 1 − E where α = 1 / (β2 + 1).
Precision and recall are single-value metrics based on the whole list of documents returned by
the system. For systems that return a ranked sequence of documents, it is desirable to also
consider the order in which the returned documents are presented. Average precision emphasizes
ranking relevant documents higher. It is the average of precisions computed at the point of each
of the relevant documents in the ranked sequence:
where r is the rank, N the number retrieved, rel() a binary function on the relevance of a given
rank, and P(r) precision at a given cut-off rank:
This metric is also sometimes referred to geometrically as the area under the Precision-Recall
curve.
Note that the denominator (number of relevant documents) is the number of relevant documents
in the entire collection, so that the metric reflects performance over all relevant documents,
regardless of a retrieval cutoff.
DCG uses a graded relevance scale of documents from the result set to evaluate the usefulness,
or gain, of a document based on its position in the result list. The premise of DCG is that highly
relevant documents appearing lower in a search result list should be penalized as the graded
relevance value is reduced logarithmically proportional to the position of the result.
The nDCG values for all queries can be averaged to obtain a measure of the average performance
of a ranking algorithm. Note that in a perfect ranking algorithm, the DCGp will be the same as
the IDCGp producing an nDCG of 1.0. All nDCG calculations are then relative values on the
interval 0.0 to 1.0 and so are cross-query comparable.
Mean reciprocal rank is a statistic for evaluating any process that produces a list of possible
responses to a query, ordered by probability of correctness. The reciprocal rank of a query
response is the multiplicative inverse of the rank of the first correct answer. The mean reciprocal
rank is the average of the reciprocal ranks of results for a sample of queries Q[1]:
Example: For example, suppose we have the following three sample queries for a system that
tries to translate English words to their plurals. In each case, the system makes three guesses,
with the first one being the one it thinks is most likely correct:
Given those three samples, we could calculate the mean reciprocal rank as (1/3 + 1/2 + 1)/3 =
11/18 or about 0.61.
1. None of the proposed results are correct (use mean reciprocal rank 0), or if
2. There are multiple correct answers in the list. Consider using mean average precision
(MAP).
In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles
Spearman and often denoted by the Greek letter ρ (rho) or as rs, is a non-parametric measure of
statistical dependence between two variables. The Spearman correlation coefficient is often
thought of as being the Pearson correlation coefficient between the ranked variables. In practice,
however, a simpler procedure is normally used to calculate ρ. The n raw scores Xi, Yi are
converted to ranks xi, yi, and the differences di = xi − yi between the ranks of each observation
on the two variables are calculated.
A Spearman correlation of 1 results when the two variables being compared are monotonically
related, even if their relationship is not linear. In contrast, this does not give a perfect Pearson
correlation
When the data are roughly elliptically distributed and there are no prominent outliers, the
Spearman correlation and Pearson correlation give similar values
The Spearman correlation is less sensitive than the Pearson correlation to strong outliers that are
in the tails of both samples
A positive Spearman correlation coefficient corresponds to an increasing monotonic trend
between X and Y.
i. Problem assembling a package of information that enables group the user to come closer
to a solution of his problem.
ii. Relationship of information use to the problem-solving/decision-making process
iii. How do people make relevance judgments
iv. How do people organize information in their minds, acquire it, process it for output