Cross Language
Cross Language
1
What is Cross-Lingual Retrieval?
2
Current Approaches to CLIR
3
Two Approaches
• Query translation
– Translate English query into Chinese query
– Search Chinese document collection
– Translate retrieved results back into English
• Document translation
– Translate entire document collection into
English
– Search collection in English
• Translate both?
4
Query Translation
Chinese Document
Collection
Chinese
documents
Retrieval Translation Results
Engine System
Chinese
queries examine
select
English
queries
5
Document Translation
Chinese Document
Collection
Translation
Results
System
select examine
Retrieval
Engine
English
queries
English Document
Collection
6
Tradeoffs
• Query Translation
– Often easier
– Disambiguation of query terms may be difficult with
short queries
– Translation of documents must be performed at
query time
• Document Translation
– Documents can be translate and stored offline
– Automatic translation can be slow
• Which is better?
– Often depends on the availability of language-specific
resources (e.g., morphological analyzers)
– Both approaches present challenges for interaction
7
A non-statistical approach
• A non-statistical approach
• Interlingua approaches
– Translate query into special language
– Translate all documents into same language
– Compare directly
– Cross-language retrieval becomes monolingual
retrieval
• Choice of interlingua?
– Could use an existing language (e.g., English)
– Create own
8
CINDOR
9
CINDOR
10
Does it work?
• Some background research suggested large
gains over word-by-word translation
11
Current Capabilities of CLIR
• Best performance obtained by
– probabilistic approach using translation probabilities
estimated from an aligned parallel corpus
– “structured” query that treats translations from
bilingual dictionary as synonyms and uses advanced
search engine
– Combination of techniques including MT
– Most experiments done in Chinese, Spanish, French,
German, and recently, Arabic
12
CLIR errors
13
But how good is “monolingual”?
14
Adding New Languages
• Morphological processing
– segmenting (what is a word?)
– stemming (combining inflections and
variants)
– stopwords (words that can be ignored)
• Language resources
– minimum is a bilingual dictionary
– parallel or comparable corpora are even
better
– MT system is a luxury
15
Problems with CLIR
16
Problems with CLIR
• Morphological processing (contd.)
– Arabic stemming
– Root + patterns+suffixes+prefixes=word
ktb+CiCaC=kitab
• All verbs and nouns derived from fewer than 2000 roots
17
Problems with CLIR
• Availability of resources
– Names and phrases are very important, most lexicons
do not have good coverage
18
Phrase translation
19
Example Phrase
20
CLIR Issues
oil probe
petroleum survey
take samples
No
translation! Which
translation?
cymbidium probe
oil survey
Wrong goeringii
petroleum take samples
segmentation restrain
21
Learning to Translate
• Lexicons
– Phrase books, bilingual dictionaries, …
• Large text collections
– Translations (“parallel”)
– Similar topics (“comparable”)
• People
22
Hieroglyphic
Demotic
Greek
23
Word-Level Alignment
English
Diverging opinions about planned tax reform
English
Madam President , I had asked the administration …
24
Query Expansion/Translation
source language query
Source Target
Query
Language Language results
Translation
IR IR
expanded expanded
source language target language
query terms
25
TREC 2002 CLIR/Arabic
• Most recent (US-based) study in CLIR occurred at TREC
– Results reported November 2002
• Topics
– 50 TREC topic statements in English
– Average of 118.2 relevant docs/topic (min 3, max 523)
26
Sample topic
<top>
<num>Number: AR26</num>
<title>Kurdistan Independence</title>
<desc> Description:
How does the National Council of Resistance relate to the
potential independence of Kurdistan?
</desc>
<narr> Narrative:
Articles reporting activities of the National Council of
Resistance are considered on topic. Articles discussing
Ocalan's leadership within the context of the Kurdish
efforts toward independence are also considered on
topic.
</narr>
</top>
27
sample topic arabic document
28
Stemming
29
Stemming (Berkeley)
• Alternative way to build stem classes
30
Stemming
31
UMass core approaches
• InQuery
– For each English word, look up all translations in dictionary
32
Breaking the LM approach apart
• Query likelihood model
• P(a|Da)
– Probability of Arabic word in the Arabic
document
• P(e|a)
– Translation probability (prob. of English
word for Arabic word)
• P(e|GE)
– Smoothing of the probabilities
33
Calculating translation
probabilities
• Dictionary or lexicon
– Assume equal probabilities for all translations
– Unless dictionary gives usage hints
• Parallel corpus
– Assume sentence-aligned parallel corpora
34
Other techniques
• Query expansion
– Useful to bring in additional related words
– Same as in monolingual retrieval
36
International Research
Programs
• Major ones are
– TREC(US DARPA under TIDES program),
– CLEF(EU) and
– NTCIR (Japan)
• Programs were initially designed for
ad-hoc cross-language text retrieval,
then extended to multi-lingual,
multimedia, domain specific and other
dimensions.
37
CL image retrieval, CLEF 2003
38
Why a new CLEF task?
39
CL image retrieval, CLEF 2003
Given a user need expressed in a language
different from the document collection, find as
many relevant images as possible
• Fifty user needs (topics):
– Expressed with a short (title) and longer (narrative) textual
description
– Also expressed with an example relevant image (QBE)
– Titles translated into 5 European languages (by Sheffield) and
Chinese (by NTU)
Publicly-
available
ImageCLEF
Image Relevance resources
Topics
collection assessments
41
Evaluation
• Evaluation based on most stringent relevance set (strict
intersection)
• Compared systems using
– MAP across all topics
– Number of topics with no relevant image in the top 100
• 4 participants evaluated (used captions only):
– NTU – Chinese->English, manual and automatic, Okapi and
dictionary-based translation, focus on proper name translation
42
Results
• Surrey had problems
• NTU obtained highest Chinese results
– approx. 51% mono and 12 failed topics (NTUiaCoP)
• Sheffield obtained highest
– Italian: 72% mono and 7 failed topics
– German: 75% mono and 8 failed topics
– Dutch: 69% mono and 7 failed topics
– French: 78% mono and 3 failed topics
• Daedalus obtained highest
– Spanish: 76% mono and 5 failed topics (QTdoc)
– Monolingual: 0.5718 and 1 failed topic (Qor)
• For more information … see the ImageCLEF working notes
43