0% found this document useful (0 votes)

18 views5 pages

FR ZH Corpus Align

Alignement corpus français et chinois

Uploaded by

Anonymous YJ3z00

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views5 pages

FR ZH Corpus Align

Alignement corpus français et chinois

Uploaded by

Anonymous YJ3z00

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

An Aligned French-Chinese corpus of 10K segments from university

educational material
Ruslan Kalitvianski Lingxiao Wang Valérie Bellynck Christian Boitet
LIG-GETALP, Bâtiment IMAG, 700 av. Centrale,
CS 40700, 38058 Grenoble cedex 9, France
[email protected]

Abstract
This paper describes a corpus of nearly 10K French-Chinese aligned segments, produced by post-
editing machine translated computer science courseware. This corpus was built from 2013 to 2016 with-
in the MACAU project, by native Chinese students. The quality, as judged by native speakers, is ade-
quate for understanding (far better than by reading only the original French) and for getting better
marks. This corpus is annotated at segment-level by a self-assessed quality score. It has been directly
used as supplemental training data to build a statistical machine translation system dedicated to that sub-
language, and can be used to extract the specific bilingual terminology. To our knowledge, it is the first
corpus of this kind to be released.

1 Introduction
The ongoing MACAU project, started in 2012 at the University of Grenoble, aims at providing multi-
lingual access to course material taught at the university (Kalitvianski et al, 2015). It is motivated by
the fact that many foreign students struggle with understanding material taught in French, and have to
spend extra time in dictionary lookup and translation to fully comprehend the meaning.
The MACAU platform1 is designed to create multilingual versions of initially monolingual course ma-
terial by producing machine translations into the desired language, and by providing an interface that
allows readers to post-edit these translations, segment by segment, until the desired level of quality is
achieved.
A direct by-product of this activity is a bilingual corpus of post-edited sentences, constituting full
courses, exercises and so on, concerning several fields of theoretical and practical computer science.
Such a corpus could be employed as supplemental data for training a custom machine translation sys-
tem. It can also serve for extraction of domain-specific lexicon.
In this paper we describe the data, provide corpus statistics, and delineate potential uses for the cor-
pus.

2 The MACAU corpus

In this section we describe the MACAU project within which this corpus was constructed, and give
the corpus’ characteristics.
2.1 The MACAU project
The MACAU project has been ongoing since 2012. Its purpose is to help foreign students access edu-
cational material produced by the university in their native tongues, as those are the ones they under-
stand best.
This is achieved by post-editing machine-translated documents, segment by segment. The post-
edition is done via the iMAG web interface (Boitet et al, 2008). An iMAG is an interactive multilin-
gual access gateway, which allows its users to visit a web page in the language of their choice while
preserving its layout.
Pages are automatically segmented into translation units, typically sentences or titles. Segments are
substituted by either a machine translation output if the segment is not found in the dedicated transla-
tion memory, or by the best post-edition available if the segment has been post-edited. Users can con-
tribute corrections directly on the page by hovering the mouse pointer over the segment they desire to

1
Currently migrating to macau.imag.fr
This work is licenced under a Creative Commons Attribution 4.0 International Licence.
Licence details: http://creativecommons.org/licenses/by/4.0/
117
Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications,
pages 117–121, Osaka, Japan, December 12 2016.
correct, which makes a post-editing palette appear. The quality of a post-edition is explicitly “self-
assessed” by the post-editor through a score in [0..20]. That score can later be revised, for example by
other Chinese students using the Chinese version to study. Note that the interface allows to see the tar-
get (Chinese) and source (French) versions to appear side by side, so that, while learning some topic in
computer science, Chinese students can also progress in French.
Post-editing through the iMAG interface is typically 3 times faster than translation from scratch
(15-20 mins vs. 1 hour per standard page of 250 words). Also, for the post-editor, such an interface has
the benefit of allowing post-editors to see the segment within its context.

Figure 1: a chapter of a course on computational complexity, in bilingual view,

with a post-editing window displayed over a segment.

For the MACAU project, the source documents are provided by teachers and also by students, and
cover subjects concerning bachelor and master-level computer science. Table 1 below describes the
quantity and the subject matters included in the corpus.

Subject matter Content type Pages (html)

Introduction to Propositional and First-Order Logic Full book 45
C programming Teacher lectures 14
Object-oriented programming Teacher lectures 13
Computational Complexity Lecture notes 13
Human-Machine interaction Teacher lectures 7
Formal Languages and Parsing Teacher lectures, hand-outs 5
Modelling of digital systems Exam paper 2
AI and automatic planning Exam paper 2
Introduction to Ergonomics Student report 1
Table 1. Current status of the MACAU platform

Post-editing was performed by three Chinese-speaking university students, selected for their
knowledge of the courses. Two were masters’ students, one was a third year bachelors’. All have been
taught the subject in French.
Students were selected on the basis of their knowledge of Chinese, their familiarity with the subject
matters, and their interest in the task. They were explained the purpose of the task, and practiced on
training documents before post-editing those that are included in the corpus. They received some
monetary compensation for their work (as interns).

118
The students were asked to post-edit machine-generated translations segment by segment, through
the iMAG interface, until an acceptable Chinese formulation of each segment was obtained. They
were also told that the priority was not literary quality, but rather understandability. A group of two
other native Chinese speakers subsequently verified the correctness a subset of randomly selected
translations.
2.2 Corpus characteristics
The corpus is a collection of 9662 aligned French-Chinese segments, extracted from courseware in
HTML. This corpus is cleaned of all HTML markup, however it does contain other non-linguistic el-
ements, such as mathematical and logical formulas.
The segmentation was produced automatically and has not been corrected manually, therefore
some segments correspond to fragments of sentences, and, more rarely, to two sentences fused togeth-
er. Primary translations were obtained automatically via Google Translate.
The average source segment length is ~72 characters, or about 11 French words, and the median
length is 53 characters. 25% of the source segments are less than 26 characters long and another 25%
are over 100 characters long. Moreover, the corpus contains 108860 words, of which 8819 are unique
French tokens.
The corpus initially contained many redundancies, but has been substantially cleaned. The remain-
ing few source redundancies differ by their Chinese translations. The quality of the segments, as
judged by bilingual readers, is considered adequate for understanding.

La complexité d'un programme pseudo-Pascal est l'ordre 伪帕斯卡程序的复杂性是基本指令的数量级上的大小为

de grandeur du nombre d'instructions élémentaires à exé-
cuter sur une entrée de taille n. n 条输入运行。
Question: Sont-ils décidables dans le modèle de calcul
déterministe? 问题：他们是能用确定性计算解决的问题吗？
(a ⇒ b )∧( b ⇒ c )∧¬( a ⇒ c ) est insatisfaisable. ( a ⇒ b )∧( b ⇒ c )∧¬( a ⇒ c )是不可满足的。
Voici la référence principale, et son résumé, qui nous
这里是主要的参考和总结，这似乎是相当清楚的。
semble tout à fait clair.
Ces trois formules n'ont pas de variables libres. 这三个公式没有自由变量。
Table 2. Examples of segments from the corpus

This corpus is now available on GitHub2.

3 Building a specialized MT system for that sublanguage

One possible use of this corpus is the training of a specialized MT system for educational documents.
3.1 Motivations and method
We are interested in increasing the usage quality of machine translation systems. We measure usage
quality as a function of post-edition times related to an estimate of the human translation time, which
by default is assumed to be 60 minutes per standard 250 word page.

(1)
Formula 1: A measure for the usage quality of a MT system.

For example, Q = 40% is Tpetotal = 30 mn/p (8/20), and Q = 90% if Tpetotal = 5 mn/p (18/20).

This corpus has been used by Wang (2015) as supplemental data for training a specialized Moses
(Koehn et al. 2007) probabilistic machine translation system through incremental training, yielding
better usage quality than a generalistic PMT system.

2
https://github.com/macau-getalp/macau

119
3.2 Usage of Moses incremental training
When new training data is available, a way of adding it to an existing model is incremental training. It
is an iterative process that avoids the time-consuming retraining of a new model from scratch3.
The V0 of the system was trained on 100K bilingual segments from the MultiUN corpus (Eisele et al,
2010). Batches of 5000 segments taken from several in-domain corpora were iteratively added, includ-
ing a raw form of this corpus that contained 16000 unfiltered segments.
3.3 Evolution of post-editing times
After 16 iterations, results show that the incremental training method reduces post-edition times, in a
short amount of time (16 iterations, about 90 hours of computation, without ever recompiling every-
thing). This system yields a usage quality of 70%, with 15 mins/std_page, better than Google Trans-
late.

Figure 2: observed reduction in post-edition times after incremental training

The BLEU score (Papineni et al. 2002) improved as well, going from 13.8% after the first iteration
to 48.3% after the last one.

Conclusion
We have presented a bilingual parallel corpus of nearly 10K aligned French-Chinese segments, pro-
duced over three years in the course of the MACAU project. This corpus is released under a free li-
cense and will be periodically updated as new post-editions become available. To our knowledge, this
is the first corpus of this kind to be published
The multilingual access platform being open to everyone, this corpus can be extended by anyone by
post-editing either pre-existing or newly uploaded documents, something that we encourage.
Although a large-scale evaluation of the usefulness of the platform will be carried out in the near fu-
ture, we have already observed that the process of post-editing improves understanding and exam
grades. An example of this is a student whose exam grade rose from 2.5/20 to 11/20 after a month of
post-editing material related to the subject. Undoubtedly, several factors were at play, but this appears
to be an interesting avenue of investigation.

3
The details of the incremental training process are described here:
http://www.statmt.org/moses/?n=Advanced.Incremental

120
Acknowledgements
The authors would like to express gratitude to the PédagoTICE initiative, as well as to Pr. Marie-
Christine Rousset, Guillaume Huard and Pascal Lafourcade for their assistance.

References
Christian Boitet, Cong-Phap Huyhn, Hong-Thai Nguyen and Valérie Bellynck. 2010. The iMAG concept: multilingual access
gateway to an elected Web sites with incremental quality increase through collaborative post-edition of MT pretranslations.
In Proceedings of TALN-2010, 8 p.

Ruslan Kalitvianski, Valérie Bellynck and Christian Boitet. 2015. Multilingual Access to Educational Material through Con-
tributive Post-editing of MT Pretranslations by Foreign Students; In Proceedings of ICWL 2015, 10 p.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan,
Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin and Evan Herbst. 2007. Moses:
Open Source Toolkit for Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics
(ACL), demonstration session, Prague, Czech Republic, June 2007.

Lingxiao Wang. 2015. Outils et environnements pour l'amélioration incrémentale, la post-édition contributive et l'évaluation
continue de systèmes de TA. Application à la TA français-chinois. PhD dissertation, Université de Grenoble.

Andreas Eisele and Yu Chen. 2010. MultiUN: A Multilingual Corpus from United Nation Documents. In the Proceedings of
the Seventh conference on International Language Resources and Evaluation, European Language Resources Association
(ELRA), Pages 2868-2872

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of ma-
chine translation. ACL-2002: 40th Annual meeting of the Association for Computational Linguistics. pp. 311–318.

121

Medical Billing Terms
100% (1)
Medical Billing Terms
15 pages
(Routledge Studies in Translation Technology 2018 - 1) Chan, Sin-Wai - The Human Factor in Machine Translation-Routledge, Taylor & Francis Group (2018)
No ratings yet
(Routledge Studies in Translation Technology 2018 - 1) Chan, Sin-Wai - The Human Factor in Machine Translation-Routledge, Taylor & Francis Group (2018)
269 pages
1 s2.0 S0957417423023151 Main
No ratings yet
1 s2.0 S0957417423023151 Main
17 pages
Oracle Cloud OCI 2022 Architect Associate: 1Z0 - 1072 - 22
No ratings yet
Oracle Cloud OCI 2022 Architect Associate: 1Z0 - 1072 - 22
34 pages
What Is The Difference Between 1080p, 480p, 720p, Blueray, BRRIP, CAM, DVDrip, DVDSCR, Hdrip, HDTS, HDTV or WebRip - Quora
No ratings yet
What Is The Difference Between 1080p, 480p, 720p, Blueray, BRRIP, CAM, DVDrip, DVDSCR, Hdrip, HDTS, HDTV or WebRip - Quora
7 pages
Seminar Sample Report
No ratings yet
Seminar Sample Report
20 pages
Interactive English To Urdu Machine Translation Using Example-Based Approach
100% (2)
Interactive English To Urdu Machine Translation Using Example-Based Approach
8 pages
A Twi French Parallel Corpus For Machine
No ratings yet
A Twi French Parallel Corpus For Machine
110 pages
Corpora in Translation Studies An Overview and Some Suggestions For Future Research - 1995
No ratings yet
Corpora in Translation Studies An Overview and Some Suggestions For Future Research - 1995
21 pages
Using Webcorp in The Classroom For Building Specialized Dictionaries
No ratings yet
Using Webcorp in The Classroom For Building Specialized Dictionaries
13 pages
Statistical Approach To Machine Translation
No ratings yet
Statistical Approach To Machine Translation
7 pages
Brief Description of Shrdlu
No ratings yet
Brief Description of Shrdlu
6 pages
1.1 General: Resourced" Languages. To Enhance The Translation Performance of Dissimilar Language
No ratings yet
1.1 General: Resourced" Languages. To Enhance The Translation Performance of Dissimilar Language
18 pages
On Application of Natural Language Processing in Machine Translation
No ratings yet
On Application of Natural Language Processing in Machine Translation
5 pages
Research Article
No ratings yet
Research Article
14 pages
Machine Translation of Vedic Sanskrit Using Deep Learning Algorithm
No ratings yet
Machine Translation of Vedic Sanskrit Using Deep Learning Algorithm
4 pages
Alabau
No ratings yet
Alabau
12 pages
03-Dan Turda A4
No ratings yet
03-Dan Turda A4
6 pages
The CMU-EBMT Machine Translation System: Ralf D. Brown
No ratings yet
The CMU-EBMT Machine Translation System: Ralf D. Brown
17 pages
Translation Table Compression Under End-Tagged Dense Code
No ratings yet
Translation Table Compression Under End-Tagged Dense Code
6 pages
Translator
No ratings yet
Translator
60 pages
Ia 3 Scheme DS
No ratings yet
Ia 3 Scheme DS
6 pages
Baoulé Related Parallel Corpora For Machine Translation Tasks: Mtbci-1.0
No ratings yet
Baoulé Related Parallel Corpora For Machine Translation Tasks: Mtbci-1.0
21 pages
Final Translator
No ratings yet
Final Translator
9 pages
Arnold Etal94
100% (1)
Arnold Etal94
237 pages
Vishalthesis
No ratings yet
Vishalthesis
348 pages
Machine Translation Spanish-To-English Translation System Using RNNs
No ratings yet
Machine Translation Spanish-To-English Translation System Using RNNs
9 pages
MT Journal 2014
No ratings yet
MT Journal 2014
31 pages
NLP - Prelim Exam - SE24-25
No ratings yet
NLP - Prelim Exam - SE24-25
2 pages
11 VII July 2023
No ratings yet
11 VII July 2023
8 pages
Design and Application of Corpus in Computational Linguistics Based On Multimedia Virtual Technology
No ratings yet
Design and Application of Corpus in Computational Linguistics Based On Multimedia Virtual Technology
12 pages
Report Sample
No ratings yet
Report Sample
61 pages
Fin Irjmets1702791465
No ratings yet
Fin Irjmets1702791465
5 pages
Bilingual Evaluation Understudy Score (Used To Measure The Accuracy of Translations) of
No ratings yet
Bilingual Evaluation Understudy Score (Used To Measure The Accuracy of Translations) of
4 pages
2016 Kituku, Muchemi & Nganga - Review On Machine Translation Approaches
No ratings yet
2016 Kituku, Muchemi & Nganga - Review On Machine Translation Approaches
8 pages
An SMT-driven Authoring Tool: Sriram Venkatapathy Shachar M Irkin
No ratings yet
An SMT-driven Authoring Tool: Sriram Venkatapathy Shachar M Irkin
8 pages
Cheat Sheet PDF
No ratings yet
Cheat Sheet PDF
21 pages
Machine Translation: Problems and Issues: John Hutchins
No ratings yet
Machine Translation: Problems and Issues: John Hutchins
18 pages
BENSALAH Nouhaila, AYAD Habib, ADIB Abdellah and IBN EL FAROUK Abdelhamid+
No ratings yet
BENSALAH Nouhaila, AYAD Habib, ADIB Abdellah and IBN EL FAROUK Abdelhamid+
2 pages
Projects UoA
No ratings yet
Projects UoA
11 pages
Towards Science of Machine Translation
No ratings yet
Towards Science of Machine Translation
9 pages
Course Paper On Machine Translation
100% (2)
Course Paper On Machine Translation
76 pages
Troy Dreier PC MAGAZINE July 2006.: Text 2 A Gift of Tongues
No ratings yet
Troy Dreier PC MAGAZINE July 2006.: Text 2 A Gift of Tongues
3 pages
(Yamashita Et Al., 2023) - English Writing Lessons Using AI Tools
No ratings yet
(Yamashita Et Al., 2023) - English Writing Lessons Using AI Tools
14 pages
Translation & Technology (Session 3)
No ratings yet
Translation & Technology (Session 3)
3 pages
Signbank+: Preparing A Multilingual Sign Language Dataset For Machine Translation Using Large Language Models
No ratings yet
Signbank+: Preparing A Multilingual Sign Language Dataset For Machine Translation Using Large Language Models
19 pages
Language Identification of Text
No ratings yet
Language Identification of Text
62 pages
NLP Report Final
No ratings yet
NLP Report Final
12 pages
The Significance of Parallel Corpuses in Language Teaching
No ratings yet
The Significance of Parallel Corpuses in Language Teaching
3 pages
Machine Translation and Its Approaches: Vanlalmuansangi Khenglawt, Lal Anpuia
No ratings yet
Machine Translation and Its Approaches: Vanlalmuansangi Khenglawt, Lal Anpuia
5 pages
Building English-Punjabi Parallel Corpus For Machi
No ratings yet
Building English-Punjabi Parallel Corpus For Machi
5 pages
Machine Translation, Auto Encoders and Decoders
No ratings yet
Machine Translation, Auto Encoders and Decoders
29 pages
IJSRET V10 Issue3 125
No ratings yet
IJSRET V10 Issue3 125
3 pages
Baidutrans
No ratings yet
Baidutrans
2 pages
Comparative Study of Machine Translation Techniques
No ratings yet
Comparative Study of Machine Translation Techniques
16 pages
Today One Two
No ratings yet
Today One Two
4 pages
Demos 008
No ratings yet
Demos 008
8 pages
Character-Based Pivot Translation For Under-Resourced Languages and Domains
No ratings yet
Character-Based Pivot Translation For Under-Resourced Languages and Domains
11 pages
Machine Translation Project
No ratings yet
Machine Translation Project
9 pages
Machine Translation Approaches and Survey For Indian Languages
No ratings yet
Machine Translation Approaches and Survey For Indian Languages
18 pages
Referential Salience in French and Mandarin Chinese: Influence of Syntactic, Semantic and Textual Factors
No ratings yet
Referential Salience in French and Mandarin Chinese: Influence of Syntactic, Semantic and Textual Factors
26 pages
Graph-Based Bilingual Word Embedding For Statistical Machine Translation
No ratings yet
Graph-Based Bilingual Word Embedding For Statistical Machine Translation
24 pages
Using Topic Salience To Detect Candidates To Semantic Changes
No ratings yet
Using Topic Salience To Detect Candidates To Semantic Changes
6 pages
Build Your Reputation On Ours: Fan Coil Technical Catalog
No ratings yet
Build Your Reputation On Ours: Fan Coil Technical Catalog
32 pages
WBS Template ProjectManager
No ratings yet
WBS Template ProjectManager
12 pages
PL 400
No ratings yet
PL 400
85 pages
Privacy Information Notice
No ratings yet
Privacy Information Notice
21 pages
PS 25 - Upadhyay 2020
No ratings yet
PS 25 - Upadhyay 2020
14 pages
Ingls Dili Esse
No ratings yet
Ingls Dili Esse
7 pages
Certificate of Registration: Quality Management System - Iso 9001:2015
No ratings yet
Certificate of Registration: Quality Management System - Iso 9001:2015
2 pages
Packaged Non Inverter PLXDPID2040 R410 Air Colled Rooftop Package C Series Indonesia PDF
No ratings yet
Packaged Non Inverter PLXDPID2040 R410 Air Colled Rooftop Package C Series Indonesia PDF
8 pages
Ibp - S1 - Innovation Techniques
No ratings yet
Ibp - S1 - Innovation Techniques
38 pages
Persistence Hibernate
No ratings yet
Persistence Hibernate
39 pages
Electronics & Communication Engineering
No ratings yet
Electronics & Communication Engineering
242 pages
Electrical Engineering Management
100% (1)
Electrical Engineering Management
2 pages
Test On Abrasive Blasting Machine
No ratings yet
Test On Abrasive Blasting Machine
7 pages
Isa-75 19 01
No ratings yet
Isa-75 19 01
34 pages
Frequency Stability Enhancement in Low-Inertia Power System Using An Optimal Control Scheme
No ratings yet
Frequency Stability Enhancement in Low-Inertia Power System Using An Optimal Control Scheme
5 pages
Free Proxy List
No ratings yet
Free Proxy List
5 pages
Ds Flow Designer
No ratings yet
Ds Flow Designer
2 pages
15kw Hybrid Proposal Danish SB
No ratings yet
15kw Hybrid Proposal Danish SB
1 page
TLE4: Unit 4 Entrepreneurship
92% (25)
TLE4: Unit 4 Entrepreneurship
82 pages
Fieldwork No 4
No ratings yet
Fieldwork No 4
6 pages
Ashoka - Production Docket - Hall 3
No ratings yet
Ashoka - Production Docket - Hall 3
23 pages
Qilin BWT20 Double Swing Handheld Laser Welding System User Manual V11 S3
No ratings yet
Qilin BWT20 Double Swing Handheld Laser Welding System User Manual V11 S3
41 pages
Ict 201 1.2
No ratings yet
Ict 201 1.2
3 pages
Acropolis Institute of Technolgy and Research
No ratings yet
Acropolis Institute of Technolgy and Research
69 pages
Lesson 2 Preparing Instructional Materials: PED 113 FS 3 - Participation and Teaching Assistantship (Mmahelar/files 2021)
No ratings yet
Lesson 2 Preparing Instructional Materials: PED 113 FS 3 - Participation and Teaching Assistantship (Mmahelar/files 2021)
8 pages
A Study On Imapct of Covid 19 On Indian Education System - Neha Nupoor, Kaishlay Kumar & Dr. Sandeep Kumar
No ratings yet
A Study On Imapct of Covid 19 On Indian Education System - Neha Nupoor, Kaishlay Kumar & Dr. Sandeep Kumar
11 pages
XK 3260E Manual
No ratings yet
XK 3260E Manual
10 pages
Uca Literature Review
100% (2)
Uca Literature Review
8 pages

FR ZH Corpus Align

Uploaded by

FR ZH Corpus Align

Uploaded by

An Aligned French-Chinese corpus of 10K segments from university

2 The MACAU corpus

Figure 1: a chapter of a course on computational complexity, in bilingual view,

Subject matter Content type Pages (html)

La complexité d'un programme pseudo-Pascal est l'ordre 伪帕斯卡程序的复杂性是基本指令的数量级上的大小为

This corpus is now available on GitHub2.

3 Building a specialized MT system for that sublanguage

Figure 2: observed reduction in post-edition times after incremental training

You might also like