2008 - Learning Transfer Rules For Machine Translation From Parallel Corpora
2008 - Learning Transfer Rules For Machine Translation From Parallel Corpora
ABSTRACT: In this paper we present JETCAT, a [29], in which we had developed a translation system from
Japanese-English transfer-based machine translation Japanese into German. One main problem for that language
system. Our main research contribution is that the pair was the lack of training material, i.e. high quality Japanese-
transfer rules are not handcrafted but are learnt German parallel corpora, which was one of the reasons why
automatically from a parallel corpus. The system has we shifted our research focus to Japanese-English.
been implemented in Amzi! Prolog, which offers For the implementation of our machine translation system
scalability for large rule bases, full Unicode support for we have chosen Amzi! Prolog because it provides an
Japanese characters, and several APIs for the seamless expressive declarative programming language within the
integration of the translation functionality into common Eclipse Platform. It offers powerful unification operations
office environments. As a first user interface we have required for the efficient application of the transfer rules and
developed a translation environment under Microsoft full Unicode support so that Japanese characters can be
Word. The dynamic nature of our system allows for an used as textual elements in the Prolog source code. Amzi!
easy customization of the rule base according to the Prolog has also proven its scalability during past projects
user’s personal preferences by simply post-editing the where we accessed large bilingual dictionaries stored as
translation results, which leads to an automatic update. fact files with several 100,000 facts. Finally, it offers several
The user interface for Microsoft Word also provides the APIs, which makes it possible to run the translation program
possibility for the user to display token lists, parse in the background so that the users can invoke the translation
trees, and transfer rules, which makes JETCAT also a functionality from their familiar text editor. For example, we
very useful tool for language students. have developed a prototype interface for Microsoft Word using
Visual Basic macros.
Categories and Subject Descriptors
I.2.7 [Natural Language Processing]; Machine translation I.2.8 The rest of the paper is organized as follows. We first provide
[Problem Solving, Control Methods, and Search]:Graph and tree a brief discussion of related work in Sect. 2 and an overview
search strategies of the system architecture in Sect. 3. In Sect. 4 we describe
General Terms the preprocessing stages to prepare the input for the
Machine translation, Unicode acquisition and application of transfer rules. Section 5 gives
a formal account of the three generic types of transfer rules
Keywords: Japanese-English translation, Language transfer rules that we use in our system along with several illustrative
Received 10 Sep. 2007; Reviewed and accepted 27 Jan. 2008 examples. The processing steps for the acquisition of new
rules and the subsequent consolidation phase to avoid
1. Introduction overtraining are presented in Sect. 6. Finally, Sect. 7 focusses
Despite the large amount of effort invested in the development on the application of the transfer rules during translation and
of machine translation systems, the achieved translation the generation of the final natural language output. We close
quality is still appalling. One major reason is the missing the paper with some concluding remarks and an outlook on
ability to learn from translation errors through incremental future work in Sect. 8.
updates of the rule base.
In our research we use the bilingual data from the JENAAD 2. Related Work
corpus [27] comprising 150,000 Japanese-English sentence Research on machine translation has a long tradition (for
pairs from news articles to learn specific transfer rules from good overviews see [6, 7, 12, 21]). The state of the art in
the examples through structural matching between the parse machine translation is that there are quite good solutions for
trees for the source and target language. The rules are narrow application domains with a limited vocabulary and
generalized in a consolidation phase to avoid overtraining concept space. However, it is the general opinion that fully
and to increase the coverage for new input. Based on the automatic high quality translation without any limitations on
resulting rule base we have developed JETCAT (Japanese- the subject and without any human intervention is far beyond
English Translation using Corpus-based Acquisition of the scope of today’s machine translation technology, and
Transfer rules), a Japanese-English transfer-based machine there is serious doubt that it will be ever possible in the
translation system. future [8, 10].
Our current research work originates from the PETRA project It is very disappointing to have to notice that the translation
(Personal Embedded Translation and Reading Assistant) quality has not much improved in the last 10 years [26]. One
main obstacle on the way to achieving better quality is seen in Whereas the transfer rules learnt during acquisition are very
the fact that most of the current machine translation systems accurate and guarantee consistent translations, this specificity
are not able to learn from their mistakes [9]. Most of the reduces the coverage for new unseen data. Therefore, the
translation systems consist of large static rule bases with limited consolidation step generalizes transfer rules as long as such
coverage, which have been compiled manually with huge relaxations do not result in conflicts with other rules in the rule
intellectual effort. All the valuable effort spent by users on post- base.
editing translation results is usually lost for future translations. Finally, we perform the translation of a Japanese sentence
As a solution to this knowledge acquisition bottleneck, corpus- by first tagging and parsing the sentence and then invoking
based machine translation tries to learn the transfer knowledge the transfer module. It applies the transfer rules stored in the
automatically on the basis of large bilingual corpora for the rule base to transform the Japanese parse tree into the
language pair [3, 15]. Statistical machine translation [2], in its corresponding English parse tree. The latter is the input to
pure form, uses no additional linguistic knowledge to train both the generation module, which produces the surface form of
a statistical translation and target language model. The two the English sentence as character string. In the following
models are used to assign probabilities to translation candidates sections we give a more detailed technical description of the
and then to choose the candidate with the maximum score. For individual modules. We illustrate their mode of operation by
the first few years the translation model was built only at the using the sentence pair in Fig. 2 as a running example
word level, however, as the limitations became apparent, several throughout the rest of this paper.
extensions towards phrase-based translation [16, 18, 22] and
syntax-based translation [13, 28, 30] have been proposed. 4. Tagging and Parsing
Although some improvements in the translation quality could be We use Python scripts for the basic string operations to import
achieved, statistical machine translation has still one main the sentence pairs from the JENAAD corpus. For the part-of-
disadvantage in common with rule-based translation, i.e. an speech tagging of Japanese sentences we use a Japanese
incremental adaptation of the statistical model by the user is lexicon, which was compiled automatically by applying the
usually impossible. Japanese morphological analysis system ChaSen [19] to the
The most prominent approach for the translation of Japanese JENAAD corpus. Figure 3 shows the result produced by the
has been example-based machine translation [11, 20, 25]. It tagging module. It segments the Japanese input into morphemes
uses a parallel corpus to create a database of translation and tags each morpheme with its pronunciation, base form,
examples for source language fragments. The different part-of-speech, conjugation type, and conjugation form. The last
approaches vary in how they represent these fragments: as three columns are encoded as numerical categories. For the
surface strings, structured representations, generalized ease of the reader we have added Roman transcriptions,
templates with variables, etc. [1, 4, 5, 14]. However, most of approximate English equivalents of the numerical categories,
the representations of translation examples used in and contextual English translations.
example-based systems of reasonable size have to be The Japanese language uses postpositions instead of
manually crafted or at least reviewed for correctness [23] to prepositions. The postposition O indicates the direct object, the
achieve sufficient accuracy. theme particle WA the subject (in this sentence). The verb SURU
is used to derive verbs from sahen nouns (e.g. recognize from
recognition), and the phrase DE ARU KOTO changes an adjectival
3. System Architecture
noun into a noun (e.g. important into importance). Finally, the
The three main tasks that we have to perform in our system verb suffix RERU indicates passive voice and the auxiliary verb TA
are acquisition, consolidation, and translation. During past tense.
acquisition (see Fig. 1) we derive new transfer rules by using
The acquisition of the Japanese lexicon is performed in two
a Japanese-English sentence pair as input. Both sentences steps. We first create a lexicon entry for each Japanese word
are first analyzed by the tagging modules, which produce the in the corpus based on the tags assigned by ChaSen. For words
correct segmentations into word tokens associated with their with several different tags, we store one default tag and, in a
part-of-speech tags. The token lists are then transformed second step, learn word sense disambiguation rules based on
into parse trees by the parsing modules. The parse trees the local context. We had to correct some wrong tag
represent the input to the acquisition module, which uses a assignments by ChaSen, in particular regarding different uses
structural matching algorithm to discover new transfer rules. of postpositions.
5. Transfer Rules
One characteristic of our approach is that we model all
translation problems with only three generic types of transfer
rules. The transfer rules are stored as Prolog facts in the
rule base. We have defined three Prolog predicates for the
three different rule types. Therefore, for the rest of this paper,
when we talk about “transfer rules” we always refer to transfer
rules for machine translation in the more general sense, not
to logical rules in the strict sense of Prolog terminology. We
also use the terms “equal” and “equality” synonymously with
“unifiable” and “unifiability”.
In the next subsections we give an overview of the three
different rule types along with illustrative examples. For the
ease of the reader we use Roman transcription for the
Japanese examples instead of the original Japanese
characters.
C2(A2) = map([apo(toward/in), hea(MINSHU/ head importance, definite determiner, and a modifying noun
2), suf(KA/31)]) phrase of X:
is computed recursively from its subconstituents as nested and the resulting parse tree after the transfer step. Finally,
list and flattened afterwards. The correct surface forms for the user can choose a Japanese-English sentence pair and
words with irregular inflections is determined by accessing watch the details of the acquisition and consolidation
the English lexicon. process. This enables the user to correct the result of a
Since the order of the subconstituents in the argument of a translation and verify the consequences for acquisition and
complex constituent could have been arbitrarily rearranged consolidation before updating the rule base. The resulting
through the application of phrase transfer rules, the system represents a valuable tool for language learning and
generation module cannot derive the original sequence of customized machine translation.
several subconstituents with identical category (e.g. several Future work will focus on extending the coverage of the
modifying adjective phrases) from the information in the system so that we can process the full JENAAD corpus and
parse tree. However, to maintain the original sequence in perform a thorough quantitative evaluation of the translation
the translation is an important default choice in such a case. quality using tenfold cross-validation. We also plan to make
Therefore, we have added an additional processing step our system available to students of Japanese Studies at our
after parsing a Japanese source sentence in which we add university in order to receive valuable feedback from practical
a sequence number as simple constituent seq(Seq) to each use.We also have just started implementing a Web interface
argument of a complex constituent. As a consequence we using Java Server Faces with interactive features to
had to extend the transfer component so that it ignores but manipulate lexical, syntactic, and translation knowledge. For
preserves this sequence information during the application example, it will be possible to correct parse trees by simple
of transfer rules. drag-and-drop and editing actions at the Web client. After
submitting the changes, the grammar will be instantly
8. Conclusion updated. In the same way we plan to customize lexical data
In this paper we have presented JETCAT, a Japanese-English and transfer rules. The user only has to revise translation
machine translation system based on the automatic results to customize the rule base according to his personal
acquisition of transfer rules from a parallel corpus. We have preferences.
finished the implementation of the system and demonstrated We have also agreed on a new research collaboration with
the feasibility of the approach based on a small subset of Korean researchers from Korea Advanced Institute of
the JENAAD corpus. Science and Technology (KAIST) in September 2007. Due to
We also have just started implementing a Web interface the similar nature of Japanese and Korean there is a high
using Java Server Faces with interactive features to potential for a straightforward portation to Korean-English
manipulate lexical, syntactic, and translation knowledge. For machine translation. One important intended extension of
example, it will be possible to correct parse trees by simple the machine translation framework is the incorporation of
drag-and-drop and editing actions at the Web client. After ontological knowledge to improve translation quality and to
submitting the changes, the grammar will be instantly address advanced text understanding tasks such as
updated. In the same way we plan to customize lexical data question answering or summarization. Finally, the translation
and transfer rules. The user only has to revise translation knowledge shall serve as a valuable resource for ontology
results to customize the rule base according to his personal mining and refinement.
preferences.We have also developed a convenient prototype
interface to Microsoft Word (see Fig. 9 for an example References
screenshot). The user can simply click on a sentence and [1] Brockett, C. et al. (2002). English-Japanese example-
translate it. In addition, it is possible to view the token lists based machine translation using abstract linguistic
and parse trees for both source and target language, the representations. Proceedings of the COLING-2002
application sequence of transfer rules during translation, Workshop on Machine Translation in Asia.