0% found this document useful (0 votes)
6 views

Cross Language

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Cross Language

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Cross-Language IR

many slides courtesy


James Allan@umass
Jimmy Lin, University of Maryland
Paul Clough and Mark Stevenson, University of Sheffield, UK

1
What is Cross-Lingual Retrieval?

• Accepting questions in one language (English)


and retrieving information in a variety of other
languages
– “questions” may be typical Web queries or full
questions in across-lingual question answering (QA)
system
– “information” could be news articles, text fragments
orpassages, factual answers, audio broadcasts, written
documents, images, etc.

• Searching distributed, unstructured,


heterogeneous, multilingual data

• Often combined with summarization,


translation, and discovery technology

2
Current Approaches to CLIR

• Typical approach is to translate query, use monolingual


search engines, then combine answers
– other approaches use machine translation of documents
– Or translation into an interlingua

• Translation ambiguity a major issue


– multiple translations for each word
– query expansion often used as part of solution
– translation probabilities required for some approaches

• Requires significant language resources


– bilingual dictionaries
– parallel corpora
– “comparable” corpora
– MT systems

3
Two Approaches
• Query translation
– Translate English query into Chinese query
– Search Chinese document collection
– Translate retrieved results back into English
• Document translation
– Translate entire document collection into
English
– Search collection in English
• Translate both?

4
Query Translation
Chinese Document
Collection

Chinese
documents
Retrieval Translation Results
Engine System
Chinese
queries examine
select

English
queries

5
Document Translation
Chinese Document
Collection

Translation
Results
System

select examine

Retrieval
Engine
English
queries
English Document
Collection

6
Tradeoffs
• Query Translation
– Often easier
– Disambiguation of query terms may be difficult with
short queries
– Translation of documents must be performed at
query time
• Document Translation
– Documents can be translate and stored offline
– Automatic translation can be slow
• Which is better?
– Often depends on the availability of language-specific
resources (e.g., morphological analyzers)
– Both approaches present challenges for interaction

7
A non-statistical approach
• A non-statistical approach

• Interlingua approaches
– Translate query into special language
– Translate all documents into same language
– Compare directly
– Cross-language retrieval becomes monolingual
retrieval

• Choice of interlingua?
– Could use an existing language (e.g., English)
– Create own

• Textwise created a “conceptual interlingua”

8
CINDOR

9
CINDOR

10
Does it work?
• Some background research suggested large
gains over word-by-word translation

• Fielded in TREC-7 cross-language task

• Performed poorly overall


– System not completed at the time
– Interlingua incomplete
– Several small processing errors added up
– On queries without problems, comparable to
monolingual

• Statistical methods now dominate the field

11
Current Capabilities of CLIR
• Best performance obtained by
– probabilistic approach using translation probabilities
estimated from an aligned parallel corpus
– “structured” query that treats translations from
bilingual dictionary as synonyms and uses advanced
search engine
– Combination of techniques including MT
– Most experiments done in Chinese, Spanish, French,
German, and recently, Arabic

• Cross-lingual can achieve 80-90% effectiveness


of monolingual
– with sufficient language resources
– sometimes does even better, but can also do worse

12
CLIR errors

13
But how good is “monolingual”?

• Not easy to summarize IR performance as a


single number
– We’ve considered average precision, Swet’s number,
utility functions, expected search length, …

• Based on measures of recall and precision…


– Breakeven of 30% for “Web” queries, precision 40% in
top 20, 20% in top 100
– Breakeven of 45% for “analyst” queries, precision 65%
in top 20, 45% in top 100
– Recall can be improved through techniques such as
query expansion and relevance feedback

14
Adding New Languages

• Morphological processing
– segmenting (what is a word?)
– stemming (combining inflections and
variants)
– stopwords (words that can be ignored)

• Language resources
– minimum is a bilingual dictionary
– parallel or comparable corpora are even
better
– MT system is a luxury

15
Problems with CLIR

• Morphological processing difficult for


some languages (e.g. Arabic)

16
Problems with CLIR
• Morphological processing (contd.)
– Arabic stemming
– Root + patterns+suffixes+prefixes=word
ktb+CiCaC=kitab

• All verbs and nouns derived from fewer than 2000 roots

• Roots too abstract for information retrieval


ktb ! kitab a book kitabi my book
alkitab the book kitabuki your book (f)
kataba to write kitabuka your book (m)
maktab office kitabuhu his book
maktaba library, bookstore ...
Want stem=root+pattern+derivational affixes?

• No standard stemmers available, only morphological (root)


analyzers

17
Problems with CLIR
• Availability of resources
– Names and phrases are very important, most lexicons
do not have good coverage

• Difficult to get hold of bilingual dictionaries


– can sometimes be found on the Web

• e.g. for recent Arabic cross-lingual evaluation


we used 3 on-line Arabic- English dictionaries
(including harvesting) and a small lexicon of
country and city names
– Parallel corpora are more difficult and require more
formal arrangements

18
Phrase translation

• Phrases are a major source of translation error

• How to get phrases translated properly?

• Assume that correct translations of words in phrase co-


occur
– Given two-word phrase “A B”
– Look at all translations of A: A1 or A2 or … or An (and B,
similarly)
– Look at all pairs “Ai Bj” and see which of them co-occur

• Probably in passages of the collection


– Use the best pair as the phrase translation

19
Example Phrase

• Worked quite well in English-Spanish CLIR

• Consider Spanish phrase “Proceso Paz”


– process, lapse of time, trial, prosecution, action,
lawsuit, proceedings, processing
– peace, peacefulness, tranquility, peace, peace treaty,
kiss of peace, sign of peace

• Ranked possible translation pairs:


– peace process
– peacefulness process
– tranquility process
– …

20
CLIR Issues
oil probe
petroleum survey
take samples

No
translation! Which
translation?

cymbidium probe
oil survey
Wrong goeringii
petroleum take samples
segmentation restrain
21
Learning to Translate
• Lexicons
– Phrase books, bilingual dictionaries, …
• Large text collections
– Translations (“parallel”)
– Similar topics (“comparable”)
• People

22
Hieroglyphic

Demotic

Greek

23
Word-Level Alignment
English
Diverging opinions about planned tax reform

Unterschiedliche Meinungen zur geplanten Steuerreform


German

English
Madam President , I had asked the administration …

Señora Presidenta, había pedido a la administración del Parlamento …


Spanish

24
Query Expansion/Translation
source language query

Source Target
Query
Language Language results
Translation
IR IR
expanded expanded
source language target language
query terms

source language target language


collection collection

Pre-translation expansion Post-translation expansion

25
TREC 2002 CLIR/Arabic
• Most recent (US-based) study in CLIR occurred at TREC
– Results reported November 2002

• Problem was to retrieve Arabic documents in response to English


queries
– Translated Arabic queries provided for monolingual comparison

• Corpus of Arabic documents


– 896Mb of news from Agence France Presse
– May 13, 1994 through Decmber 20, 2000
– 383,872 articles

• Topics
– 50 TREC topic statements in English
– Average of 118.2 relevant docs/topic (min 3, max 523)

• Nine sites participated


– 23 CL runs, 18 monolingual

26
Sample topic
<top>
<num>Number: AR26</num>
<title>Kurdistan Independence</title>
<desc> Description:
How does the National Council of Resistance relate to the
potential independence of Kurdistan?
</desc>
<narr> Narrative:
Articles reporting activities of the National Council of
Resistance are considered on topic. Articles discussing
Ocalan's leadership within the context of the Kurdish
efforts toward independence are also considered on
topic.
</narr>
</top>

27
sample topic arabic document

28
Stemming

• TREC organizers provided standard


resources
• To allow comparisons of algorithms vs.
resources
• One of those was an Arabic stemmer
– UMass developed a “light stemmer” also
used heavily

29
Stemming (Berkeley)
• Alternative way to build stem classes

• Trying to deal with complex morphology

• Use MT system to translate Arabic words


– Now have (arabic, english) pairs

• Stop and stem all of the English words/phrases using


favorite stemmer
– (arabic, english-stem) pairs
– If English stem is the same, then assume Arabic words
should be in the same stem class

• (Also used a light stemmer)

30
Stemming

31
UMass core approaches
• InQuery
– For each English word, look up all translations in dictionary

• If not found as is, try its stem


– Stem all Arabic translations
– Apply operators

• Put Arabic phrases in #filreq() operator

• Use synonym operator, #syn(), for alternate translations

• Wrap all together in #wsum() operator

• Cross-language language modeling (after BBN)

32
Breaking the LM approach apart
• Query likelihood model
• P(a|Da)
– Probability of Arabic word in the Arabic
document
• P(e|a)
– Translation probability (prob. of English
word for Arabic word)
• P(e|GE)
– Smoothing of the probabilities

33
Calculating translation
probabilities
• Dictionary or lexicon
– Assume equal probabilities for all translations
– Unless dictionary gives usage hints
• Parallel corpus
– Assume sentence-aligned parallel corpora

• Know that sentence E is a translation of sentence A


– Estimate P(e|a) from those aligned sentences
– Consider sentence pairs (E,A) where e is in E and a is in
A
– To get P(e|a), divide by number of Arabic sentences
containing a

34
Other techniques
• Query expansion
– Useful to bring in additional related words
– Same as in monolingual retrieval

• Expand query in English


– Need comparable corpus (why comparable?)
– Brings in synonyms and other words related to query

• Expand translated query in Arabic


– Done on actual target corpus
– Brings in Arabic synonyms not in dictionary
– BBN in TREC 2002 was careful to expand only by
translation of original query words

• Can do neither, either, or both

• UMass added 5 terms from English and 50 from Arabic


– For LM runs, used “relevance modeling” instead in Arabic
35
CLIR better than IR?
• How can cross-language beat within-language?
– We know there are translation errors
– Surely those errors should hurt performance

• Hypothesis is that translation process may disambiguate


some query terms
– Words that are ambiguous in Arabic may not be ambiguous
in English
– Expansion during translation from English to Arabic
prevents the ambiguity from re-appearing

• Has been proposed that CLIR is a model for IR


– Translate query into one language and then back to
original
– Given hypothesis, should have an improved query
– Should be reasonable to do this across many different
languages

36
International Research
Programs
• Major ones are
– TREC(US DARPA under TIDES program),
– CLEF(EU) and
– NTCIR (Japan)
• Programs were initially designed for
ad-hoc cross-language text retrieval,
then extended to multi-lingual,
multimedia, domain specific and other
dimensions.

37
CL image retrieval, CLEF 2003

• A pilot experiment in CLEF 2003


• Called ImageCLEF
• Combination of image retrieval and CLIR
• An ad hoc retrieval task
• 4 entries
– NTU (Taiwan)
– Daedalus (Spain)
– Surrey (UK)
– Sheffield (UK)

38
Why a new CLEF task?

• No existing TREC-style test collection


• Broadens the CLEF range of CLIR
tasks
• Facilitates CL image retrieval research
• International forum for discussion

39
CL image retrieval, CLEF 2003
Given a user need expressed in a language
different from the document collection, find as
many relevant images as possible
• Fifty user needs (topics):
– Expressed with a short (title) and longer (narrative) textual
description
– Also expressed with an example relevant image (QBE)
– Titles translated into 5 European languages (by Sheffield) and
Chinese (by NTU)

• Two retrieval challenges


– Matching textual queries to visual documents (use captions)
– Matching non-English queries to English captions (use
translation)
• Essentially a bilingual CLIR task
• No retrieval constraints specified
40
Creating the test collection
Work undertaken at Sheffield
Created
an image Selected & Pooled all
translated submitted Judged
collection
50 topics runs from relevance
entrants of pools
plus ISJ Evaluated
all runs CLEF
using results
trec_eval

Publicly-
available
ImageCLEF
Image Relevance resources
Topics
collection assessments

+ caption Title (translated) 4 relevance sets created by 2 assessors


Description per topic using a ternary relevance
scheme based on image and caption
Example image

41
Evaluation
• Evaluation based on most stringent relevance set (strict
intersection)
• Compared systems using
– MAP across all topics
– Number of topics with no relevant image in the top 100
• 4 participants evaluated (used captions only):
– NTU – Chinese->English, manual and automatic, Okapi and
dictionary-based translation, focus on proper name translation

– Daedalus – all->English (except Dutch and Chinese), Xapian and


dictionary-based + on-line translation, Wordnet query expansion,
focus on indexing query and ways of combining query terms

– Surrey – all->English (except Chinese), SoCIS system and on-line


translation, Wordnet expansion, focus on query expansion and
analysis of topics

– Sheffield – all->English, GLASS (BM25) and Systran translation,


no language-specific processing, focus on translation quality

42
Results
• Surrey had problems
• NTU obtained highest Chinese results
– approx. 51% mono and 12 failed topics (NTUiaCoP)
• Sheffield obtained highest
– Italian: 72% mono and 7 failed topics
– German: 75% mono and 8 failed topics
– Dutch: 69% mono and 7 failed topics
– French: 78% mono and 3 failed topics
• Daedalus obtained highest
– Spanish: 76% mono and 5 failed topics (QTdoc)
– Monolingual: 0.5718 and 1 failed topic (Qor)
• For more information … see the ImageCLEF working notes

43

You might also like