0% found this document useful (0 votes)
5 views

coarse

The document presents a new probabilistic approach to information retrieval that draws on concepts from statistical machine translation. It proposes a model where users translate their information needs into queries, allowing for improved document relevance assessments by estimating the probability of a query being generated from a document. Experimental results indicate that this translation-based retrieval system performs competitively against traditional methods.

Uploaded by

Cyrus Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

coarse

The document presents a new probabilistic approach to information retrieval that draws on concepts from statistical machine translation. It proposes a model where users translate their information needs into queries, allowing for improved document relevance assessments by estimating the probability of a query being generated from a document. Experimental results indicate that this translation-based retrieval system performs competitively against traditional methods.

Uploaded by

Cyrus Ray
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Information Retrieval as Statistical Translation

Adam Berger and John La erty


School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
<aberger,lafferty>@cs.cmu.edu

Abstract we can formulate tractable mathematical models of the


query generation process, models that naturally account
We propose a new probabilistic approach to information for many of the features that are critical to modern high
retrieval based upon the ideas and methods of statisti- performance retrieval systems, such as term weighting
cal machine translation. The central ingredient in this and query expansion. Moreover, statistical translation
approach is a statistical model of how a user might dis- models for information retrieval can be implemented in
till or \translate" a given document into a query. To a quite straightforward way, and our experiments with
assess the relevance of a document to a user's query, we these models demonstrate very promising empirical be-
estimate the probability that the query would have been havior.
generated as a translation of the document, and factor We begin by detailing a conceptual model of infor-
in the user's general preferences in the form of a prior mation retrieval. In formulating a query to a retrieval
distribution over documents. We propose a simple, well system, we imagine that a user begins with an informa-
motivated model of the document-to-query translation tion need. This information need is then represented
process, and describe an algorithm for learning the pa- as a fragment of an \ideal document"|a portion of the
rameters of this model in an unsupervised manner from type of document that the user hopes to receive from
a collection of documents. As we show, one can view the system. The user then translates or \distills" this
this approach as a generalization and justi cation of ideal document fragment into a succinct query, select-
the \language modeling" strategy recently proposed by ing key terms and replacing some terms with related
Ponte and Croft. In a series of experiments on TREC terms: replacing pontiff with pope, for instance.
data, a simple translation-based retrieval system per- Summarizing the model of query generation,
forms well in comparison to conventional retrieval tech-
niques. This prototype system only begins to tap the 1. The user has an information need =.
full potential of translation-based retrieval. 2. From this need, he generates an ideal document
fragment d= .
1 Introduction 3. He selects a set of key terms from d= , and gener-
ates a query q from this set.
When a person formulates a query to a retrieval system,
what he is really doing is distilling an information need One can view this imaginary process of query form-
into a succinct query. In this work, we take the view ulation as a corruption of the ideal document. In this
that this distillation is a form of translation from one setting, the task of a retrieval system is to nd those
language to another: from documents, which contain documents most similar to d= . In other words, retrieval
the normal super uence of textual fat and connective is the task of nding, among the documents comprising
tissue such as prepositions, commas and so forth, to the collection, likely preimages of the user's query. Fig-
queries, comprised of just the skeletal index terms that ure 1 depicts this model of retrieval in a block diagram.
characterize the document. We have drawn Figure 1 in a way that suggests an
We take this view not because it is an accurate model information-theoretic perspective. One can view the
of how a user decides what to ask of an information information need = as a signal that gets corrupted as
retrieval system, but because it turns out to be a use- the user U distills it into a query q. That is, the query-
ful expedient. By thinking about retrieval in this way, formulation process represents a noisy channel, corrupt-
ing the information need just as a telephone cable cor-
Permission to make digital or hard copies of all or part of this work for rupts the data transmitted by a modem. Given q and
personal or classroom use is granted without fee provided that copies are a model of the channel|how an information need gets
not made or distributed for profit or commercial advantage and that corrupted into a query|the retrieval system's task is
copies bear this notice and the full citation on the first page. To copy
to identify those documents d that best satisfy the in-
otherwise, to republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee. formation need of the user.
SIGIR '99 8/99 Berkley, CA USA
Copyright 1999 ACM 1-58113-096-1/99/0007 . . . $5.00

222
related, and exploit these relations to better rank doc-
uments by relevance to a query.
= Document
d = Document-query
q
The rest of the paper proceeds as follows. Section 2
information generation model ideal document
translation model
query outlines some antecedents to this work, and describes in
need fragment
greater detail the relationship between information re-
trieval and statistical translation. Section 3 introduces
two speci c models of translation from documents to
fdg q queries. Section 4 explains how to estimate the param-
Retrieval Search
eters of such models automatically from a collection
retrieved
Engine user’s
of documents. Section 5 presents results of several ex-
periments on widely-used benchmarks in information
documents query

retrieval. These experimental results demonstrate the


competitiveness of translation-based retrieval, compared
Figure 1. Model of query generation and retrieval to standard t df vector space techniques. We conclude
in Section 6 with some further discussion of this ap-
proach and directions for future research.
More precisely, the retrieval system's task is to nd
the a posteriori most likely documents given the query;
that is, those d for which p (d j q; U ) is highest. By 2 Background and Related Work
Bayes' law,
There is a large literature on probabilistic approaches
p (d j q; U ) = p (q j d; U ) p (d j U ) : (1) to information retrieval, and we will not attempt to
p (q j U ) survey it here. Instead, we focus on the language mod-
eling approach introduced recently by Ponte and Croft
Since the denominator p (q j U ) is xed for a given query [10, 9], which is closest in spirit to the present work.
and user, we can ignore it for the purpose of ranking To each document in the collection, this approach as-
documents, and de ne the relevance q (d) of a docu- sociates a probability distribution p ( j d) over terms;
ment to a query as using the terminology developed in the eld of speech
recognition, Ponte and Croft call this distribution a
q (d) = p| (q j{zd; U}) p| (d{zj U }) : (2) \language model." The probability of a term t given
a document is related to the frequency of t in the doc-
query-dependent query-independent
ument. The probability of a query q = q1 ; q2 ; : : : qm
Equation (2) highlights the decomposition of relevance is just the Q product of the individual term probabilities,
into two terms: rst, a query-dependent term mea- p (q j d) = i p (qi j d). The relevance of a document d
suring the proximity of d to q, and second, a query- to a query q is presumed to be monotonically related
independent or \prior" term, measuring the quality of to p (q j d).
the document according to the user's general prefer- The language modeling approach represents a novel
ences and information needs. Though in this work we and theoretically motivated approach to retrieval, which
take the prior term to be uniform over all documents, Ponte and Croft demonstrate to be e ective. However,
we imagine that in real-world retrieval systems the prior this framework does not allow the capability to model
will be crucial for improved performance, and for adapt- di erent forms or styles of queries, nor does it directly
ing to the user's needs and interests. At the very least, address the important issues of synonymy and polysemy :
the document prior can be used to discount short doc- multiple terms sharing similar meanings and the same
uments, or perhaps documents in a foreign language. term having multiple meanings. In formulating a sta-
High-performance document retrieval systems must tistical translation-based approach to IR, we aim to de-
be sophisticated enough to handle polysemy and velop a general statistical framework for handling these
synonymy|to know, for instance, that pontiff and issues.
pope are related terms. The eld of statistical trans- The inspiration and foundation for the present work
lation concerns itself with how to mine large text comes from statistical machine translation|an area of
databases to automatically discover such semantic re- research that until now has existed independently of
lations. Brown et al. [3, 4] showed, for instance, how a information retrieval. It would take us too far a eld
system can \learn" to associate French terms with their to present an overview of the motivation and methods
English translations, given only a collection of bilingual of statistical translation. Instead, we refer the reader
French/English sentences. We shall demonstrate how, to two articles [3, 4] outlining the IBM work that rst
in a similar fashion, an IR system can, from a collec- developed the idea of statistical machine translation.
tion of documents, automatically learn which terms are We will, however, brie y describe the general form of
the IBM statistical translation models, illustrated for
the case of translating from French into English. These
We use the convention that boldface roman letters refer to collec-
tions of words such as documents or queries, while italic roman letters models are comprised of translation probabilities t (f j e)
refer to individual terms. Thus p (q j d) refers to the probability of for each English word e translating to each French word
generating a single query word from an entire document d. f , fertility probabilities to model the number of French

223
words that an English word can generate, and distortion all possible alignments, given by
probabilities for introducing a dependence on relative n n Y
m
p (q j d) = (m j dm)
word order in the two languages. These parameters are X X
used to form a model p (f ; a j e) for generating a French
sentence f = ff1 ; f2 : : : ; fmg from an English sentence (n + 1) a =0 am =0 j=1 t (qj j daj ):
 
1
 (3)
e = fe1 ; e2 ; : : : ; en g in terms of a hidden alignment a
between the words in the two sentences. Just as the most primitive version of IBM's trans-
Brown et al. propose a series of increasingly com- lation model takes no account of the subtler aspects
plex and powerful statistical models of translation, the of language translation, including the way word order
parameters of which are estimated by a bootstrapping tends to di er across languages, so our basic IR transla-
procedure. We refer to [4] for a detailed presentation tion approach is but an impressionistic model of the re-
of these models, including the mathematical details of lation between queries and documents relevant to them.
training them using the EM algorithm. In the next sec- Since IBM called their most basic scheme Model 1, we
tion we discuss analogous models for query generation. shall do the same for this rudimentary retrieval model.
A little algebraic manipulation shows that the prob-
ability of generating query q according to Model 1 can
3 Two Models of Document-Query Translation be rewritten as
Suppose that an information analyst is given a news p (q j d) =
Ym  n 
article and asked to quickly generate a list of a few
words to serve as a rough summary of the article's topic. (m j d) p (qj j d) + 1 t (w j <null>)
As the analyst rapidly skims the story, he encounters j =1 n + 1 n+1
a collection of words and phrases. Many of these are where X
rejected as irrelevant, but his eyes rest on certain key p (qj j d) = t (qj j w) l (w j d) ;
terms as he decides how to render them in the summary. w
For example, when presented with an article about Pope
John Paul II's visit to Cuba in 1998, the analyst decides with the document language model l (w j d) given by rel-
that the words pontiff and vatican can simply be ative counts. Thus, we see that the query terms are gen-
represented by the word pope, and that cuba, castro erated using a mixture model|the document language
and island can be collectively referred to as cuba. model provides the mixing weights for the translation
In this section we present two statistical models of model, which has parameters t (q j w). An alternative
this query formation process, making speci c indepen- view (and terminology) for this model is to describe
dence assumptions to derive computationally and sta- it as a Hidden Markov Model, where the states corre-
tistically ecient algorithms. While our simple query spond to the words in the vocabulary, and the transi-
generation models are mathematically similar to those tion probabilities between states are proportional to the
used for statistical translation of natural language, the word frequencies.
duties of the models are qualitatively di erent in the The simplest version of Model 1, which we will dis-
two settings. Document-query translation requires a tinguish as Model 0, is the one for which each word w
distillation of the document, while translation of natu- can be translated only as itself; that is, the translation
ral language will tolerate little being thrown away. probabilities are \diagonal":
n
3.1. Model 1: A mixture model q=w
t (q j w) = 10 ifotherwise .
We rst consider the simplest of the IBM translation
models for the document-to-query mapping. This model In this case, the query generation model is given by
produces a query according to the following generative
procedure. First we choose a length m for the query, p (q j d) = n l (q j d) + 1 t (w j <null>) ;
according to the distribution (m j d). Then, for each n+1 n+1
position j 2 [1 : : : m] in the query, we choose a position i a linear interpolation of the document language model
in the document from which to generate qj , and gener- and the background model associated with the null word.
ate the query word by \translating" di according to the
translation model t ( j di ). We include in position zero of
the document an arti cial \null word," written <null>. 3.2. Model 10: A binomial model
The purpose of the null word is to generate spurious or Our imaginary information analyst, when asked to gen-
content-free terms in the query (consider, for example, erate a brief list of descriptive terms for a document, is
a query q = Find all of the documents: : :). unlikely to list multiple occurrences of the same word.
Let's now denote the length of the document by To account for this assumption in terms of a statisti-
j d j = n. The probability p (q j d) is then the sum over cal model, we assume that a list of words is generated
by making several independent translations of the doc-
ument d into a single query term q, in the following
manner. First, the analyst chooses a word w at random

224
from the document. He chooses this word according to Model 10 also has an interpretation in the degener-
the document language model l (w j d). Next, he trans- ate case of diagonal translation probabilities. To see
lates w into the word or phrase q according to the trans- this, let us make a further simpli cation by xing the
lation model t (q j w). Thus, the probability of choosing average number of samples to be a constant  indepen-
q as a representative of the document d is dent of the document d, and suppose that the expected
X number of times a query word is drawn is less than one,
p (q j d) = l (w j d) t (q j w): so that maxi l (qi j d) < 1. Then to rst order, the
w2d probability assigned to the query according to Model 10
is a constant times the product of the language model
We assume that the analyst repeats this process n times, probabilities:
where n is chosen according to the sample size model m
Y
 (n j d), and that the resulting list of words is ltered p (q = q1 ; : : : ; qm j d)  e, m l (qi j d) : (5)
to remove duplicates before it is presented as the sum- i=1
mary, or query, q = q1 ; q2 ; : : : qm .
In order to calculate the probability that a particular Since the mean  is xed for all documents, the doc-
query q is generated in this way, we need to sum over ument that maximizes the righthandQside of the above
all sample sizes n, and consider that each of the terms expression is that which maximizes m i=1 qi j d). This
l (
qi may have been generated multiple times. Thus, the is precisely the value assigned to the query in what
process described above assigns to q a total probability Ponte and Croft (1998) call the \language modeling ap-
proach."
p (q j d) =
X X X  n
Y
m
 (n j d)  p (qi j d) ni 4 Building a Translation-Based IR System
n n1 >0 nm >0 n1    nm i=1
We now describe an implementation of the models de-
In spite of its intimidating appearance, this expression scribed in the previous section, and their application to
can be calculated eciently using simple combinatorial TREC data. The key ingredient in these models is the
identities and dynamic programming techniques. In- collection of translation probabilities t (q j w). But how
stead of pursuing this path, we will assume that the are we to obtain these probabilities? The statistical
number of samples n is chosen according to a Poisson translation strategy is to learn these probabilities from
distribution with mean (d): an aligned bilingual corpus of translated sentences, us-
ing the likelihood criterion. Ideally, we should have a
n
 (n j d) = e,(d) (nd!) : collection of query/document pairs to learn from, ob-
tained by human relevance judgments. But we know of
no publicly-available collection of data suciently large
Under this assumption, the above sum takes on a much to estimate parameters for general queries.
friendlier appearance:
m
Y  4.1. Synthetic training data
p (q j d) = e ,  ( d ) e(d) p(qi j d) , 1 : (4) Lacking a large corpus of queries and their associated
i=1 relevant documents, we decided to tease out the se-
mantic relationships among words by generating syn-
This formula shows that the probability of the query thetic queries for a large collection of documents and
is given as a product of terms. Yet the query term estimating the translation probabilities from this syn-
translations are not independent, due to the process of thetic data. To explain the rationale for this scheme, we
ltering out the generated list to remove duplicates. We return to our ctitious information analyst, and recall
refer to the model expressed in equation (4) as Model 10 . that when presented with a document d, he will tend
Model 10 has an interpretation in terms of binomial to select terms that are suggestive of the content of the
random variables. Suppose that a word w does not document. Suppose now that he himself selects an ar-
belong to the query with probability w = e,(d)p(w j d) . bitrary document d from a database D, and asks us to
Then Model 10 amounts to ipping independent w - guess, based only upon his summary q, which document
biased coins to determine which set of words comprise he chose. The amount by which we are able to do better,
the query. That is, we can reexpress the probability on average, than randomly guessing a document from
p (q j d) of equation (4) as D is the mutual information I (D; Q) = H (D) , H (D j Q)
Y Y
between the random variables representing his choice of
p (q j d) = (1 , w ) document D and query Q. Here H (D) is the entropy in
w: the analyst's choice of document, and H (D j Q) is the
w2q w 2= q conditional entropy of the document given the query. If
Our use of this model was inspired by another IBM he is playing this game cooperatively, he will generate
statistical translation model, one that was designed for queries for which this mutual information is large.
modeling a bilingual dictionary [5]. With this game in mind, we took a collection of
TREC documents D, and for each document d in the

225
q t (q j w ) q t (q j w) q t (q j w )
solzhenitsyn 0.319 carcinogen 0.667 zubin mehta 0.248
citizenship 0.049 cancer 0.032 zubin 0.139
exile 0.044 scientific 0.024 mehta 0.134
archipelago 0.030 science 0.014 philharmonic 0.103
alexander 0.025 environment 0.013 orchestra 0.046
soviet 0.023 chemical 0.012 music 0.036
union 0.018 exposure 0.012 bernstein 0.029
komsomolskaya 0.017 pesticide 0.010 york 0.026
treason 0.015 agent 0.009 end 0.018
vishnevskaya 0.015 protect 0.008 sir 0.016
=
w solzhenitsyn =
w carcinogen =
w zubin

q t (q j w ) q t (q j w) q t (q j w )
pontiff 0.502 everest 0.439 wildlife 0.705
pope 0.169 climb 0.057 fish 0.038
paul 0.065 climber 0.045 acre 0.012
john 0.035 whittaker 0.039 species 0.010
vatican 0.033 expedition 0.036 forest 0.010
ii 0.028 float 0.024 environment 0.009
visit 0.017 mountain 0.024 habitat 0.008
papal 0.010 summit 0.021 endangered 0.007
church 0.005 highest 0.018 protected 0.007
flight 0.004 reach 0.015 bird 0.007
=
w pontiff =
w everest =
w wildlife

Figure 2. Sample translation probabilities after EM training on synthetic data.

collection, computed the mutual information statistic from a corpus obtained by generating ve synthetic mu-
I (w; d) for each of its words according to tual information queries for each of the 78,325 docu-
ments in the collection.
I (w; d) = p (w; d) log pp((ww jj Dd)) : For statistical models of this form, smoothing or in-
terpolating the parameters away from their maximum
likelihood estimates is important. We used a linear in-
Here p (w j d) is the probability of the word in the doc- terpolation of the background unigram model and the
ument, and p (w j D) is the probability of the word in EM-trained translation model:
the collection at large. By scaling these I (w; d) values
appropriately, we constructed an arti cial cumulative p (q j d) = p (q j D) + (1 , ) p (q j d)
distribution function over words in each document. We X
then drew m   ( j d) random samples from the doc- = p (q j D) + (1 , ) l (w j d) t (q j w) :
ument according to this distribution, forming a query w2d
q = q1 ; : : : ; qm , and several such queries were generated The weight was empirically set to = 0:05 on heldout
for each document. data. The models for the baseline language modeling
approach, or Model 0, were also smoothed using linear
4.2. EM training interpolation:
The resulting corpus f(d; q)g of documents and syn- l (w j d) = p (w j D) + (1 , ) l (w j d) :
thetic queries was used to t the translation probabili- This interpolation weight was simply xed at = 0:1.
ties of Models 1 and 10 with the EM algorithm [7], run The Poisson parameter for the sample size distribution
for only three iterations as a means to avoid over t- was xed at  = 15, independent of the document. No
ting. Space limitations prevent us from explaining the adjustment of any parameters, other than those deter-
details of the training process, but enough detail to en- mined by unsupervised EM training of the translation
able implementation can be found in the papers [4, 5], probabilities, was carried out on the TREC volume 3
which describe very similar models. A sample of the data that we ran our evaluation on.
resulting translation probabilities, when trained on the
Associated Press (AP) portion of the TREC volume 3
corpus, is shown in Figure 2. In this gure, a docu-
ment word is shown together with the ten most proba- 5 Experimental Results on TREC Data
ble query words that it will translate to according to the
model. The probabilities in these tables are among the In this section we summarize the quantitative results
47,065,200 translation probabilities that were trained of using the models described in the previous two sec-
for our 132,625 word vocabulary. They were estimated tions to rank documents for a set of queries. Though

226
AP, concept fields SJMN, concept fields

Model 1 Model 1
0.7 Model 1’ 0.7 Model 1’
Model 0 Model 0
TFIDF TFIDF

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

AP, concept fields AP, title fields

Model 1’, 3 EM iterations Model 1


0.7 Model 1’, 2 EM iterations 0.7 Model 1’
TFIDF Model 0
TFIDF

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Spoken Document Retrieval Track Language Model vs. Model 0

Model 1 Model 0
0.7 Model 1’ 0.7 Language model
Model 0
TFIDF

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Figure 3. Precision-recall curves on TREC data. The two plots in the top row compare the performance of
Models 1 and 10 to the baseline t df and Model 0 performance on AP data (left) and SJMN data (right) when
ranking documents for queries formulated from the concept elds for topics 51{100. The middle row shows the
discrepancy between two and three EM iterations of training for Model 10 (left) and the performance of models
on the short (average 2.8 words/query) queries obtained from the title eld of topics 51{100 (right). The bottom
row compares Models 1 and 10 to Model 0 andQ t df on the SDR data (left) and the same language model scored
according to Model 0 and using the product m i=1 l (qi j d), demonstrating that the approximation in equation (5) is
very good (right).

227
not a real-time computation, calculating the relevance t df Model 1 %
score q(d) for each query can be done quickly enough Relevant: 5845 5845 |
to obviate the need for a \fast match" to throw out Rel.ret.: 5845 5845 |
Precision:
documents that are clearly irrelevant. at 0.00 0.6257 0.7125 +13.9
Our experiments fall into three categories. First, at 0.10 0.5231 0.5916 +13.1
we report on a series of experiments carried out on at 0.20 0.4569 0.5217 +14.2
the AP portion of the TREC data from volume 3 for at 0.30 0.3890 0.4554 +17.1
at 0.40 0.3425 0.4119 +20.3
both the concept and title elds in topics 51{100. The at 0.50 0.3035 0.3636 +19.8
concept eld queries comprise a set of roughly 20 key- at 0.60 0.2549 0.3148 +23.5
words, while the title elds are much more succinct| at 0.70 0.2117 0.2698 +27.4
typically not more than four words. Next, we tabulate at 0.80 0.1698 0.2221 +30.8
at 0.90 0.1123 0.1580 +40.7
a corresponding series of experiments carried out on at 1.00 0.0271 0.0462 +70.5
the 90,250 document San Jose Mercury News (SJMN) Avg.: 0.2993 0.3575 +19.4
portion of the data, also evaluated on topics 51{100. Precision at:
Finally, we present results on the Spoken Document 5 docs: 0.4809 0.5574 +15.9
Retrieval (SDR) track of the 1998 TREC evaluation, 10 docs: 0.4702 0.5170 +10.0
15 docs: 0.4326 0.5135 +18.7
a small collection of 2,866 broadcast news transcripts. 20 docs: 0.4213 0.4851 +15.1
All of the data is preprocessed by converting to upper 30 docs: 0.3894 0.4539 +16.6
case, stemming, and ltering with a list of 571 stop- 100 docs: 0.2960 0.3419 +15.5
words from the SMART system. 200 docs: 0.2350 0.2653 +12.9
500 docs: 0.1466 0.1610 +9.8
Precision-recall curves for the AP and SJMN data, 1000 docs: 0.0899 0.0980 +9.0
generated from the output of the TREC evaluation soft- R-Precision: 0.3254 0.3578 +10.0
ware, appear in Figure 3. The baseline curves in these
plots show the performance of the t df measure using Table 1. Model 1 compared to the baseline system
Robertson's tf score, as described by Ponte in [9]. They for queries constructed from the concept elds. These
also show the result of using Model 0 to score the docu- numbers correspond to the upper left plot in Figure 3.
ments, using only word-for-word translations. In this
approach, documents receive high relevance scores if
they contain terms appearing in the query. Model 1 im- 6 Discussion
proves the average precision over the t df baseline by
19.4% on the AP data, and by 27.3% on the SJMN data. Viewing a user's interaction with an information re-
The R-precision improves by 10.0% on the AP data and trieval system as translation of an information need into
by 22.8% on the SJMN data. The actual numbers for a query is natural. Exactly this formulation is made in
AP are collected in Table 1. a recent overview of issues in information science pre-
As discussed earlier, over tting during EM train- sented to the theoretical computer science community
ing is a concern. Figure 3 shows the performance of [2]. In this paper we have attempted to lay the ground-
Model 10 on the AP data when the probabilities are work for building practical IR systems that exploit this
trained for two and three EM iterations. To study the view, by demonstrating how statistical translation mod-
e ects of query length on the models, we also scored the els can be built and used to advantage.
documents for the title elds of topics 51{100, where When designing a statistical model for language pro-
the average query length is only 2.8 words. As the mid- cessing tasks, often the most natural route is to apply
dle right plot of Figure 3 reveals, the precision-recall a generative model which builds up the output step-
curves are qualitatively di erent, showing a degradation by-step. Yet to be e ective, such models need to lib-
in performance in the high-precision range. Overall, erally distribute probability mass over a huge space of
Model 1 achieves an improvement over the t df baseline possible outcomes. This probability can be dicult to
of 30.2% in average precision and 17.8% in R-precision control, making an accurate direct model of the distri-
on these short queries. The marginal improvement of bution of interest dicult to construct. The source-
Model 1 over Model 0 is smaller here|6.3% in average channel perspective suggests a di erent approach: turn
precision and 4.9% in R-precision. the search problem around to predict the input. Far
Two precision-recall plots for the SDR task are given. more than a simple application of Bayes' law, there are
The bottom left plot of Figure 3 shows the improvement compelling reasons why reformulating the problem in
of Model 1 over the baseline. There is an improvement this way should be rewarding. In speech recognition,
of 22.2% in average precision and 18.4% in R-precision. natural language processing, and machine translation,
The bottom right plot compares Model 00 to the re- researchers have time and again found that predicting
sult what is already known (i.e., the query) from competing
Qm of ranking documents according to the probability hypotheses can be more e ective than directly predict-
i=1 l (qi j d) for the same language model, as in Ponte ing all of the hypotheses.
and Croft's method. There is very little di erence in
the results, showing that equation (5) is indeed a good Our simple models only begin to tap the potential
approximation. of the translation-based approach. More powerful mod-
els of the query generation process, even along the lines
of IBM's more complex models of translation, should

228
o er performance gains. For example, one of the fun- References
damental notions of statistical translation is the idea of
fertility, where a source word can generate zero or more [1] A. Bookstein and D. Swanson (1974). \Probabilis-
words in the target sentence. While there appears to tic models for automatic indexing," Journal of
be no good reason why a word selected from the doc- the American Society for Information Science, 25,
ument should generate more than a single query term pp. 312{318.
per trial, we should allow for infertility probabilities, [2] A. Broder and M. Henzinger (1998). \Information
where a word generates no terms at all. The use of stop retrieval on the web: Tools and algorithmic issues,"
word lists mitigates but does not eliminate the need for Invited tutorial at Foundations of Computer Sci-
this improvement. The use of distortion probabilities ence (FOCS).
could be important for discounting relevant words that
appear toward the end of a document, and rewarding [3] P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra,
those that appear at the beginning. Many other natural F. Jelinek, J. La erty, R. Mercer, and P. Roossin
extensions to the model are possible. (1990). \A statistical approach to machine transla-
Our discussion has made the usual \bag of words" tion," Computational Linguistics, 16(2), pp. 79{85.
assumption about documents, ignoring word order for [4] P. Brown, S. Della Pietra, V. Della Pietra, and
the sake of simplicity and computational ease. But the R. Mercer (1993). \The mathematics of statistical
relative ordering of words is informative in almost all machine translation: Parameter estimation," Com-
applications, and crucial in some. The sense of a word is putational Linguistics, 19(2), pp. 263{311.
often revealed by nearby words, and so by heeding con- [5] P. Brown, S. Della Pietra, V. Della Pietra, M.
textual clues, one might hope to obtain a more accurate Goldsmith, J. Hajic, R. Mercer, and S. Mohanty
translation from document words to query words. (1993). \But dictionaries are data too," In Pro-
The retrieval-as-translation approach applies quite ceedings of the ARPA Human Language Technology
naturally to the problem of multilingual retrieval, where Workshop, Plainsborough, New Jersey.
a user issues queries in a language T to retrieve docu- [6] W. B. Croft and D.J. Harper (1979). \Using proba-
ments in a source language S . The most direct approach bilistic models of document retrieval without rele-
to multilingual IR expands the query q T to q0T and vance information," Journal of Documentation, 35,
then translates directly into a source language query pp. 285{295.
0
q S , which is then issued to the database. The retrieval-
as-translation approach would learn translation models [7] A. Dempster, N. Laird, and D. Rubin (1977).
which capture how words in S translate to words in \Maximum likelihood from incomplete data via the
T . The source-channel framework should also be well- EM algorithm," Journal of the Royal Statistical So-
suited to \higher-order" IR tasks, such as fact extrac- ciety, 39(B), pp. 1{38.
tion and question answering from a database. [8] W. Gale and K. Church (1991). \Identifying
word correspondences in parallel texts," in Fourth
7 Conclusions DARPA Workshop on Speech and Natural Lan-
guage, Morgan Kaufmann Publishers, pp. 152{157.
We have presented an approach to information retrieval [9] J. Ponte (1998). A language modeling approach to
that exploits ideas and methods of statistical machine information retrieval. Ph.D. thesis, University of
translation. After outlining the approach, we presented Massachusetts at Amherst.
two simple, motivated models of the document-query [10] J. Ponte and W. B. Croft (1998). \A language mod-
translation process. With the EM algorithm, the pa- eling approach to information retrieval," Proceed-
rameters of these models can be trained in an unsu- ings of the ACM SIGIR, pp. 275{281.
pervised fashion from a collection of documents. Ex-
periments on TREC datasets demonstrate that even [11] S.E. Robertson and K. Sparck Jones (1976). \Rele-
these simple methods can yield substantial improve- vance weighting of search terms," Journal of the
ments over standard baseline vector space methods, but American Society for Information Science, 27,
without explicit query expansion and term weighting pp. 129{146.
schemes, which are inherent in the translation approach [12] S. Robertson, S. Walker, M. Hancock-Beaulieu, A.
itself. Gull, and M. Lau (1992). \Okapi at TREC," In
Proceedings of the rst Text REtrieval Conference
Acknowledgements (TREC-1), Gaithersburg, Maryland.
[13] G. Salton and C. Buckley (1988). \Term-weighting
We would like to thank Jamie Callan, Jaime Carbonell, approaches in automatic text retrieval," Informa-
Ramesh Gopinath, Yiming Yang, and the anonymous tion Processing and Management, 24, pp. 513{523.
reviewers for helpful comments on this work. This re- [14] H. Turtle and W. B. Croft (1991). \Ecient proba-
search was supported in part by NSF KDI grant IIS- bilistic inference for text retrieval," Proceedings of
9873009, an IBM University Partnership Award, an RIAO 3.
IBM Cooperative Fellowship, and a grant from Just- [15] W. Weaver (1955). \Translation (1949)," In Ma-
system Corporation. chine Translation of Languages, MIT Press.

229

You might also like