0% found this document useful (0 votes)

6 views

Cross Language

Uploaded by

Karrolu Kavya sri

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Cross Language

Uploaded by

Karrolu Kavya sri

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Cross-Language IR

many slides courtesy

James Allan@umass
Jimmy Lin, University of Maryland
Paul Clough and Mark Stevenson, University of Sheffield, UK

1
What is Cross-Lingual Retrieval?

• Accepting questions in one language (English)

and retrieving information in a variety of other
languages
– “questions” may be typical Web queries or full
questions in across-lingual question answering (QA)
system
– “information” could be news articles, text fragments
orpassages, factual answers, audio broadcasts, written
documents, images, etc.

• Searching distributed, unstructured,

heterogeneous, multilingual data

• Often combined with summarization,

translation, and discovery technology

2
Current Approaches to CLIR

• Typical approach is to translate query, use monolingual

search engines, then combine answers
– other approaches use machine translation of documents
– Or translation into an interlingua

• Translation ambiguity a major issue

– multiple translations for each word
– query expansion often used as part of solution
– translation probabilities required for some approaches

• Requires significant language resources

– bilingual dictionaries
– parallel corpora
– “comparable” corpora
– MT systems

3
Two Approaches
• Query translation
– Translate English query into Chinese query
– Search Chinese document collection
– Translate retrieved results back into English
• Document translation
– Translate entire document collection into
English
– Search collection in English
• Translate both?

4
Query Translation
Chinese Document
Collection

Chinese
documents
Retrieval Translation Results
Engine System
Chinese
queries examine
select

English
queries

5
Document Translation
Chinese Document
Collection

Translation
Results
System

select examine

Retrieval
Engine
English
queries
English Document
Collection

6
Tradeoffs
• Query Translation
– Often easier
– Disambiguation of query terms may be difficult with
short queries
– Translation of documents must be performed at
query time
• Document Translation
– Documents can be translate and stored offline
– Automatic translation can be slow
• Which is better?
– Often depends on the availability of language-specific
resources (e.g., morphological analyzers)
– Both approaches present challenges for interaction

7
A non-statistical approach
• A non-statistical approach

• Interlingua approaches
– Translate query into special language
– Translate all documents into same language
– Compare directly
– Cross-language retrieval becomes monolingual
retrieval

• Choice of interlingua?
– Could use an existing language (e.g., English)
– Create own

• Textwise created a “conceptual interlingua”

8
CINDOR

9
CINDOR

10
Does it work?
• Some background research suggested large
gains over word-by-word translation

• Fielded in TREC-7 cross-language task

• Performed poorly overall

– System not completed at the time
– Interlingua incomplete
– Several small processing errors added up
– On queries without problems, comparable to
monolingual

• Statistical methods now dominate the field

11
Current Capabilities of CLIR
• Best performance obtained by
– probabilistic approach using translation probabilities
estimated from an aligned parallel corpus
– “structured” query that treats translations from
bilingual dictionary as synonyms and uses advanced
search engine
– Combination of techniques including MT
– Most experiments done in Chinese, Spanish, French,
German, and recently, Arabic

• Cross-lingual can achieve 80-90% effectiveness

of monolingual
– with sufficient language resources
– sometimes does even better, but can also do worse

12
CLIR errors

13
But how good is “monolingual”?

• Not easy to summarize IR performance as a

single number
– We’ve considered average precision, Swet’s number,
utility functions, expected search length, …

• Based on measures of recall and precision…

– Breakeven of 30% for “Web” queries, precision 40% in
top 20, 20% in top 100
– Breakeven of 45% for “analyst” queries, precision 65%
in top 20, 45% in top 100
– Recall can be improved through techniques such as
query expansion and relevance feedback

14
Adding New Languages

• Morphological processing
– segmenting (what is a word?)
– stemming (combining inflections and
variants)
– stopwords (words that can be ignored)

• Language resources
– minimum is a bilingual dictionary
– parallel or comparable corpora are even
better
– MT system is a luxury

15
Problems with CLIR

• Morphological processing difficult for

some languages (e.g. Arabic)

16
Problems with CLIR
• Morphological processing (contd.)
– Arabic stemming
– Root + patterns+suffixes+prefixes=word
ktb+CiCaC=kitab

• All verbs and nouns derived from fewer than 2000 roots

• Roots too abstract for information retrieval

ktb ! kitab a book kitabi my book
alkitab the book kitabuki your book (f)
kataba to write kitabuka your book (m)
maktab office kitabuhu his book
maktaba library, bookstore ...
Want stem=root+pattern+derivational affixes?

• No standard stemmers available, only morphological (root)

analyzers

17
Problems with CLIR
• Availability of resources
– Names and phrases are very important, most lexicons
do not have good coverage

• Difficult to get hold of bilingual dictionaries

– can sometimes be found on the Web

• e.g. for recent Arabic cross-lingual evaluation

we used 3 on-line Arabic- English dictionaries
(including harvesting) and a small lexicon of
country and city names
– Parallel corpora are more difficult and require more
formal arrangements

18
Phrase translation

• Phrases are a major source of translation error

• How to get phrases translated properly?

• Assume that correct translations of words in phrase co-

occur
– Given two-word phrase “A B”
– Look at all translations of A: A1 or A2 or … or An (and B,
similarly)
– Look at all pairs “Ai Bj” and see which of them co-occur

• Probably in passages of the collection

– Use the best pair as the phrase translation

19
Example Phrase

• Worked quite well in English-Spanish CLIR

• Consider Spanish phrase “Proceso Paz”

– process, lapse of time, trial, prosecution, action,
lawsuit, proceedings, processing
– peace, peacefulness, tranquility, peace, peace treaty,
kiss of peace, sign of peace

• Ranked possible translation pairs:

– peace process
– peacefulness process
– tranquility process
– …

20
CLIR Issues
oil probe
petroleum survey
take samples

No
translation! Which
translation?

cymbidium probe
oil survey
Wrong goeringii
petroleum take samples
segmentation restrain
21
Learning to Translate
• Lexicons
– Phrase books, bilingual dictionaries, …
• Large text collections
– Translations (“parallel”)
– Similar topics (“comparable”)
• People

22
Hieroglyphic

Demotic

Greek

23
Word-Level Alignment
English
Diverging opinions about planned tax reform

Unterschiedliche Meinungen zur geplanten Steuerreform

German

English
Madam President , I had asked the administration …

Señora Presidenta, había pedido a la administración del Parlamento …

Spanish

24
Query Expansion/Translation
source language query

Source Target
Query
Language Language results
Translation
IR IR
expanded expanded
source language target language
query terms

source language target language

collection collection

Pre-translation expansion Post-translation expansion

25
TREC 2002 CLIR/Arabic
• Most recent (US-based) study in CLIR occurred at TREC
– Results reported November 2002

• Problem was to retrieve Arabic documents in response to English

queries
– Translated Arabic queries provided for monolingual comparison

• Corpus of Arabic documents

– 896Mb of news from Agence France Presse
– May 13, 1994 through Decmber 20, 2000
– 383,872 articles

• Topics
– 50 TREC topic statements in English
– Average of 118.2 relevant docs/topic (min 3, max 523)

• Nine sites participated

– 23 CL runs, 18 monolingual

26
Sample topic
<top>
<num>Number: AR26</num>
<title>Kurdistan Independence</title>
<desc> Description:
How does the National Council of Resistance relate to the
potential independence of Kurdistan?
</desc>
<narr> Narrative:
Articles reporting activities of the National Council of
Resistance are considered on topic. Articles discussing
Ocalan's leadership within the context of the Kurdish
efforts toward independence are also considered on
topic.
</narr>
</top>

27
sample topic arabic document

28
Stemming

• TREC organizers provided standard

resources
• To allow comparisons of algorithms vs.
resources
• One of those was an Arabic stemmer
– UMass developed a “light stemmer” also
used heavily

29
Stemming (Berkeley)
• Alternative way to build stem classes

• Trying to deal with complex morphology

• Use MT system to translate Arabic words

– Now have (arabic, english) pairs

• Stop and stem all of the English words/phrases using

favorite stemmer
– (arabic, english-stem) pairs
– If English stem is the same, then assume Arabic words
should be in the same stem class

• (Also used a light stemmer)

30
Stemming

31
UMass core approaches
• InQuery
– For each English word, look up all translations in dictionary

• If not found as is, try its stem

– Stem all Arabic translations
– Apply operators

• Put Arabic phrases in #filreq() operator

• Use synonym operator, #syn(), for alternate translations

• Wrap all together in #wsum() operator

• Cross-language language modeling (after BBN)

32
Breaking the LM approach apart
• Query likelihood model
• P(a|Da)
– Probability of Arabic word in the Arabic
document
• P(e|a)
– Translation probability (prob. of English
word for Arabic word)
• P(e|GE)
– Smoothing of the probabilities

33
Calculating translation
probabilities
• Dictionary or lexicon
– Assume equal probabilities for all translations
– Unless dictionary gives usage hints
• Parallel corpus
– Assume sentence-aligned parallel corpora

• Know that sentence E is a translation of sentence A

– Estimate P(e|a) from those aligned sentences
– Consider sentence pairs (E,A) where e is in E and a is in
A
– To get P(e|a), divide by number of Arabic sentences
containing a

34
Other techniques
• Query expansion
– Useful to bring in additional related words
– Same as in monolingual retrieval

• Expand query in English

– Need comparable corpus (why comparable?)
– Brings in synonyms and other words related to query

• Expand translated query in Arabic

– Done on actual target corpus
– Brings in Arabic synonyms not in dictionary
– BBN in TREC 2002 was careful to expand only by
translation of original query words

• Can do neither, either, or both

• UMass added 5 terms from English and 50 from Arabic

– For LM runs, used “relevance modeling” instead in Arabic
35
CLIR better than IR?
• How can cross-language beat within-language?
– We know there are translation errors
– Surely those errors should hurt performance

• Hypothesis is that translation process may disambiguate

some query terms
– Words that are ambiguous in Arabic may not be ambiguous
in English
– Expansion during translation from English to Arabic
prevents the ambiguity from re-appearing

• Has been proposed that CLIR is a model for IR

– Translate query into one language and then back to
original
– Given hypothesis, should have an improved query
– Should be reasonable to do this across many different
languages

36
International Research
Programs
• Major ones are
– TREC(US DARPA under TIDES program),
– CLEF(EU) and
– NTCIR (Japan)
• Programs were initially designed for
ad-hoc cross-language text retrieval,
then extended to multi-lingual,
multimedia, domain specific and other
dimensions.

37
CL image retrieval, CLEF 2003

• A pilot experiment in CLEF 2003

• Called ImageCLEF
• Combination of image retrieval and CLIR
• An ad hoc retrieval task
• 4 entries
– NTU (Taiwan)
– Daedalus (Spain)
– Surrey (UK)
– Sheffield (UK)

38
Why a new CLEF task?

• No existing TREC-style test collection

• Broadens the CLEF range of CLIR
tasks
• Facilitates CL image retrieval research
• International forum for discussion

39
CL image retrieval, CLEF 2003
Given a user need expressed in a language
different from the document collection, find as
many relevant images as possible
• Fifty user needs (topics):
– Expressed with a short (title) and longer (narrative) textual
description
– Also expressed with an example relevant image (QBE)
– Titles translated into 5 European languages (by Sheffield) and
Chinese (by NTU)

• Two retrieval challenges

– Matching textual queries to visual documents (use captions)
– Matching non-English queries to English captions (use
translation)
• Essentially a bilingual CLIR task
• No retrieval constraints specified
40
Creating the test collection
Work undertaken at Sheffield
Created
an image Selected & Pooled all
translated submitted Judged
collection
50 topics runs from relevance
entrants of pools
plus ISJ Evaluated
all runs CLEF
using results
trec_eval

Publicly-
available
ImageCLEF
Image Relevance resources
Topics
collection assessments

+ caption Title (translated) 4 relevance sets created by 2 assessors

Description per topic using a ternary relevance
scheme based on image and caption
Example image

41
Evaluation
• Evaluation based on most stringent relevance set (strict
intersection)
• Compared systems using
– MAP across all topics
– Number of topics with no relevant image in the top 100
• 4 participants evaluated (used captions only):
– NTU – Chinese->English, manual and automatic, Okapi and
dictionary-based translation, focus on proper name translation

– Daedalus – all->English (except Dutch and Chinese), Xapian and

dictionary-based + on-line translation, Wordnet query expansion,
focus on indexing query and ways of combining query terms

– Surrey – all->English (except Chinese), SoCIS system and on-line

translation, Wordnet expansion, focus on query expansion and
analysis of topics

– Sheffield – all->English, GLASS (BM25) and Systran translation,

no language-specific processing, focus on translation quality

42
Results
• Surrey had problems
• NTU obtained highest Chinese results
– approx. 51% mono and 12 failed topics (NTUiaCoP)
• Sheffield obtained highest
– Italian: 72% mono and 7 failed topics
– German: 75% mono and 8 failed topics
– Dutch: 69% mono and 7 failed topics
– French: 78% mono and 3 failed topics
• Daedalus obtained highest
– Spanish: 76% mono and 5 failed topics (QTdoc)
– Monolingual: 0.5718 and 1 failed topic (Qor)
• For more information … see the ImageCLEF working notes

Moldflow
No ratings yet
Moldflow
74 pages
Reboiler Calculations Design Guide PDF Free
100% (2)
Reboiler Calculations Design Guide PDF Free
12 pages
Cross Language Information Retrieval: Presented By: Namita Singh B.Tech 3 Year CS GLA University
No ratings yet
Cross Language Information Retrieval: Presented By: Namita Singh B.Tech 3 Year CS GLA University
17 pages
research_abesec_mlir neel
No ratings yet
research_abesec_mlir neel
7 pages
Translingual Information Retrieval Learning From Bil - 1998 - Artificial Intell
No ratings yet
Translingual Information Retrieval Learning From Bil - 1998 - Artificial Intell
23 pages
Kannada and Telugu Native Languages To E PDF
No ratings yet
Kannada and Telugu Native Languages To E PDF
5 pages
A Survey On Various Clir Techniques
No ratings yet
A Survey On Various Clir Techniques
5 pages
Survey On Variants of Cross Language Information Retrieval: S. Vaishnavi Dr. Anitha Chepuru
No ratings yet
Survey On Variants of Cross Language Information Retrieval: S. Vaishnavi Dr. Anitha Chepuru
4 pages
IR Combined2
No ratings yet
IR Combined2
157 pages
I P e - H C L I R T Q T
No ratings yet
I P e - H C L I R T Q T
7 pages
Cross-Language Information Retrieval (CLIR) : Ananthakrishnan R
No ratings yet
Cross-Language Information Retrieval (CLIR) : Ananthakrishnan R
32 pages
952-Article Text-7540-1-10-20090609
No ratings yet
952-Article Text-7540-1-10-20090609
11 pages
Cross Language Information Retrieval Thesis
100% (3)
Cross Language Information Retrieval Thesis
4 pages
Demos 040
No ratings yet
Demos 040
8 pages
Google Translate 6
No ratings yet
Google Translate 6
5 pages
OromoEng CLIR
No ratings yet
OromoEng CLIR
7 pages
View
No ratings yet
View
99 pages
Multilingual Information Retrieval
No ratings yet
Multilingual Information Retrieval
18 pages
NLP Applications (Continued)
No ratings yet
NLP Applications (Continued)
14 pages
MRI 06 - Liddy, ED - 1998
No ratings yet
MRI 06 - Liddy, ED - 1998
3 pages
Sanchez Martinez11a
No ratings yet
Sanchez Martinez11a
12 pages
A Language Independent Approach To Multilingual Text Summarization
No ratings yet
A Language Independent Approach To Multilingual Text Summarization
10 pages
A_statistical_approach_to_language_translation
No ratings yet
A_statistical_approach_to_language_translation
7 pages
Bereket Information Retrival
No ratings yet
Bereket Information Retrival
4 pages
Information retrieval
No ratings yet
Information retrieval
10 pages
Contribution To AMTA2012
No ratings yet
Contribution To AMTA2012
8 pages
A Language Independent Approach To Develop URDUIR System
No ratings yet
A Language Independent Approach To Develop URDUIR System
10 pages
Setting Koneksi VB Ke Mysql
No ratings yet
Setting Koneksi VB Ke Mysql
14 pages
Project Report
No ratings yet
Project Report
12 pages
Complex Linguistic Features For Text Classification: A Comprehensive Study
No ratings yet
Complex Linguistic Features For Text Classification: A Comprehensive Study
15 pages
Descriptive Morphological Analysis in Montage
No ratings yet
Descriptive Morphological Analysis in Montage
56 pages
Measuring Bilingual Corpus Comparability
No ratings yet
Measuring Bilingual Corpus Comparability
27 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
Unit Vapplications Notes
No ratings yet
Unit Vapplications Notes
13 pages
CME4408_P3_NLPtechniques
No ratings yet
CME4408_P3_NLPtechniques
33 pages
Alr 2012
No ratings yet
Alr 2012
145 pages
Interlingua-Based English-Hindi Machine Translation and Language Divergence
No ratings yet
Interlingua-Based English-Hindi Machine Translation and Language Divergence
54 pages
Unit 5
No ratings yet
Unit 5
42 pages
Clef 2010 Labs Submission 116
No ratings yet
Clef 2010 Labs Submission 116
6 pages
My M-7
No ratings yet
My M-7
44 pages
7
No ratings yet
7
4 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
Thesis - Dinesh Mavaluru
No ratings yet
Thesis - Dinesh Mavaluru
142 pages
Dialnet-ToolsForEnglishSpanishCrossLinguisticAppliedResear-3094671
No ratings yet
Dialnet-ToolsForEnglishSpanishCrossLinguisticAppliedResear-3094671
16 pages
991635.991651
No ratings yet
991635.991651
6 pages
Cross Cultural Interviewing
No ratings yet
Cross Cultural Interviewing
20 pages
Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing
No ratings yet
Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing
14 pages
Using Linguistic Typology To Enrich Multilingual Lexicons: The Case of Lexical Gaps in Kinship
No ratings yet
Using Linguistic Typology To Enrich Multilingual Lexicons: The Case of Lexical Gaps in Kinship
10 pages
Totalrecall: A Bilingual Concordance For Computer Assisted Translation and Language Learning
No ratings yet
Totalrecall: A Bilingual Concordance For Computer Assisted Translation and Language Learning
4 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
A Hybrid Approach Using Phrases and Rules For Hindi To English Machine Translation
100% (1)
A Hybrid Approach Using Phrases and Rules For Hindi To English Machine Translation
17 pages
NLP Machine Translation
No ratings yet
NLP Machine Translation
15 pages
Survey Paper CLSA
No ratings yet
Survey Paper CLSA
5 pages
IR Assignment
No ratings yet
IR Assignment
2 pages
NLP KEY
No ratings yet
NLP KEY
16 pages
Interactive English To Urdu Machine Translation Using Example-Based Approach
100% (2)
Interactive English To Urdu Machine Translation Using Example-Based Approach
8 pages
Seminar Sample Report
No ratings yet
Seminar Sample Report
20 pages
Corpus Linguistics
No ratings yet
Corpus Linguistics
40 pages
Seminar: "Multilingual Web Search and Navigation"
No ratings yet
Seminar: "Multilingual Web Search and Navigation"
15 pages
Complete A Resource light Approach to Morpho syntactic Tagging Anna Feldman ebook download PDF & DOCX
100% (1)
Complete A Resource light Approach to Morpho syntactic Tagging Anna Feldman ebook download PDF & DOCX
52 pages
CLEP® Spanish Language: Levels 1 and 2 (Book + Online)
From Everand
CLEP® Spanish Language: Levels 1 and 2 (Book + Online)
Lisa J. Goldman
5/5 (1)
Common Core: Grammar Usage
From Everand
Common Core: Grammar Usage
Linda Armstrong
5/5 (1)
Python Unit 5
No ratings yet
Python Unit 5
21 pages
Biochemistry 9th Edition by Mary Campbell, Shawn Farrell, Owen McDougal ISBN 1305961137 9781305961135 instant download
100% (2)
Biochemistry 9th Edition by Mary Campbell, Shawn Farrell, Owen McDougal ISBN 1305961137 9781305961135 instant download
29 pages
Samsung GT-E3210 08 Level 3 Repair - Block-, PCB Diagrams, Flow Chart of Troubleshooting
100% (1)
Samsung GT-E3210 08 Level 3 Repair - Block-, PCB Diagrams, Flow Chart of Troubleshooting
38 pages
2 SK 170
No ratings yet
2 SK 170
5 pages
Lee Seoh Choon
No ratings yet
Lee Seoh Choon
50 pages
Realism and Neo-Realism
100% (4)
Realism and Neo-Realism
22 pages
4 Bit Counter With Test Bench
100% (1)
4 Bit Counter With Test Bench
3 pages
Rhythm 2014-03
No ratings yet
Rhythm 2014-03
116 pages
1) Document-1
No ratings yet
1) Document-1
8 pages
Pdur
No ratings yet
Pdur
9 pages
Schenck Web
No ratings yet
Schenck Web
4 pages
EBB 013b
No ratings yet
EBB 013b
102 pages
Isl95829chrtz Isl95829cirtz
No ratings yet
Isl95829chrtz Isl95829cirtz
76 pages
Brosur Minor Surgery Instrument Set PDF
0% (1)
Brosur Minor Surgery Instrument Set PDF
3 pages
Sel351 - Landscape Photography
No ratings yet
Sel351 - Landscape Photography
26 pages
People Vs de Vera
No ratings yet
People Vs de Vera
8 pages
Nhóm 1 - Epic Content Marketing - How To Tell A Different Story, Break Through The Clutter, and Win More Customers by Marketing Less
No ratings yet
Nhóm 1 - Epic Content Marketing - How To Tell A Different Story, Break Through The Clutter, and Win More Customers by Marketing Less
8 pages
CV of Bireswar Bhattacharjee
No ratings yet
CV of Bireswar Bhattacharjee
8 pages
TISO - TORRICELLI 1991, The Tibetan Text of Tilopa's Mahāmudropadeśa
No ratings yet
TISO - TORRICELLI 1991, The Tibetan Text of Tilopa's Mahāmudropadeśa
26 pages
Trinidad Building and Furniture Technology Syllabus
100% (1)
Trinidad Building and Furniture Technology Syllabus
140 pages
Experience of Managers Upon Handling Neurodiverse Employees
No ratings yet
Experience of Managers Upon Handling Neurodiverse Employees
7 pages
RNA Scaffolds Methods and Protocols 2nd Edition Luc Ponchon - Quickly download the ebook to start your content journey
100% (1)
RNA Scaffolds Methods and Protocols 2nd Edition Luc Ponchon - Quickly download the ebook to start your content journey
57 pages
4.modifications of Mendelian Ratios
100% (2)
4.modifications of Mendelian Ratios
5 pages
836 - Library - SQP 20-21
No ratings yet
836 - Library - SQP 20-21
8 pages
Follow Through Case Study
No ratings yet
Follow Through Case Study
13 pages
Cause and Effect Matrix
No ratings yet
Cause and Effect Matrix
22 pages
01 Past Tense + Speaking Exercises + Listening Practice
No ratings yet
01 Past Tense + Speaking Exercises + Listening Practice
13 pages
Case Analysis
No ratings yet
Case Analysis
6 pages