Seminar Sample Report
Seminar Sample Report
SEMINAR REPORT
Submitted by
LEARNER NAME
(Roll No. :)
of
In
YEAR
Table of Contents
Abstract ................................................................................................................................................... 3
Chapter 1 – Introduction ......................................................................................................................... 4
Motivation ........................................................................................................................................... 4
Chapter II - Seminar Objectives.............................................................................................................. 6
Methodology ....................................................................................................................................... 6
Chapter III - Literature Review............................................................................................................. 10
Part 1: Indian Machine Translation Systems Studied: ...................................................................... 10
Part 2: Internationally recognized work ............................................................................................ 10
Chapter IV - Noise Removal & Data Cleaning..................................................................................... 11
LR based Strategy ............................................................................................................................. 12
TR based Strategy ............................................................................................................................. 12
Corpus for Study purpose ................................................................................................................. 12
SRILM for Language Modeling ................................................................................................... 13
Moses ............................................................................................................................................ 13
Chapter V – Development of Baseline & Improved System ................................................................ 14
Chapter VI – Results & Discussion ...................................................................................................... 16
Chapter VII - Conclusion ...................................................................................................................... 19
References ............................................................................................................................................. 20
List of Figures
List of Tables
<This is a sample report only, learners are expected to include the appropriate
data/paragraphs in their reports as per the guidelines shared.>
Translation of human languages has been a need for the development of the society around the
world and therefore subject of study for hundreds of years. The explosive growth of internet
these days has led to the increasing demand of multilingual knowledge-based applications and
services, therefore, it proves to be an effective tool to overcome the language barriers leading
to generating great interest in the linguistic industry. This multilingualism has triggered the
necessity for machine translations so that information and services can be provided in the
choice of language of any user.
Designing automatic machine translations system will be a great challenge for humans as
difficulties are seen for translating sentences by taking into consideration all the required
information. For preparing perfect translations, humans need several revisions. No two
individual human translators can generate identical translations of the provided text in the same
language pair. Also the cross culture understanding for languages is an important issue for
holding accuracy of the translation.
This thesis proposes a pre-processing method for removal of noise that uses linguistic tools to
the development of Hindi to English machine translation system. In this translation framework,
external linguistic tools are utilized to expand the linguistic information into the parallel
corpora
Motivation
Machine Translation was, however, envisioned as a computer application in early 1950’s, it is
still considered to be an open-problem. The demand for machine translation is growing rapidly
with around 20% of webpage and other resources on World Wide Web are available in their
national languages. In a linguistically diverse country like India, many reasons have prompted
to consider Indian languages as an object for study. Although, human assisted translations
widely prevalent in India since ancient times still these languages are considered as scarcely
resourced languages. Having nearly half a billion native speakers, Indian languages also
demonstrates a variety of linguistic phenomenon which is divergent and understudied. Indian
languages also represent a high degree of morphological complexity with reference to English.
With highly agglutinative nature, use of automation is largely restricted to word-processing,
and therefore, there is large market for translation between English and the various Indian
languages.
In recent years, the statistical machine translation has demonstrated competitive translation
performances as compared to other translation strategies. The main advantage of SMT is that
it can automatically learn more translation knowledge from a very large scale of training data.
However, the performance of an SMT system heavily relies on the quality and quantity of
training as well as development data because the training data helps in building the translation
and language model and development data sets helps in optimizing the parameters through
minimum error rate training (MERT) method.
A lot of previous work is carried on improving the quantity of training data and collecting more
bilingual data for achieving the more accuracy and better performance for Hindi-to-English
translations without considering much the quality of the collected data. Clearly the execution
results will not be so outperforming by just expanding the training data if there is huge
dissimilarity between the training data and test data. In addition to this, if the quality of
<This is a sample report only, learners are expected to include the appropriate
data/paragraphs in their reports as per the guidelines shared.>
Our major objective is to improve the quality of machine translation system and in order to
implement the same; we proposed the following two objectives:
Because improvements in corpus quality are not possible for all machine translation systems
therefore this seminar is restricted on Machine Translations systems for Hindi-English
language pairs.
Methodology
It is known fact that in Statistical Machine Translation model, the information of probability
is extracted from the training data and the translation parameters are tuned on the basis of
development data. The target language sentences are therefore generated on the basis of both
probabilities and parameters.
The presence of noise in Translation Data affects its quality, by improper word alignment or
importing errors into translation rules and hence reduces the performance of the Statistical
Machine Translation system.
It is also assumed that better performance is yielded only when more data is available.
However, it reduces the translation speed and also costs more computer resources.
To filter noise, it is proposed to follow two simple policies viz. the Length Ratio policy and
the Translation Policy. In the former policy, if a sentence pair is parallel, the length ratio of
the two sentences which are corresponding should be within range in a special bound.
In later policy, if two sentences are corresponding, the translation of the word in the source
language sentence has a large probability to appear in the target language sentence.
In order to reduce the size of the training data, it is proposed to select sentences which cover
more information of the original corpus. Because word is a special phrase, the main focus
would be on phrase-coverage and structure coverage for this research problem.
For the phrase –coverage-based approach in this research problem, it is proposed to select
phrases on the basis of the information it contains and its length. This will constitute to new
development data. In another approach where the structure coverage is proposed, such
sentences which cover the majority of structures of development data will be extracted. Such
sentences which will provide much information are proposed for constituting development
data.
The basic of machine translation process is understood in this period and it has been identified
that the standard Machine Translation process is divided into three major steps:
1. Training
i. In training procedure, the steps of corpus preparation is followed where along with
basic steps of tokenization and true-casing to maintain a consistency within the
corpus, the semantic noise of the given corpora is also removed.
ii. The pre-processed training data is then passed through an aligner which computes
the word and phrase alignments along with probability scores for phrase and word
substitution respectively.
iii. It is also identified that GIZA++ is considered for experiments which is based on
the IBM models and computes the alignment using Expectation-Maximization
(EM) algorithm.
iv. Phrases are extracted from the phrase alignments and scored to build a phrase table
which is the most important step in Phrase based SMT. In this step, the four different
phrase translation scores are computed:
• Direct phrase translation probability
2. Tuning
Tuning is a procedure which helps in finding the optimal weights to maximize the
translation scores of development corpus. The translation quality is calculated using BLEU
(automatic scoring) from the best translation and corresponding reference sentence.
Minimum error rate training (MERT) is performed for batch tuning where, generally a
sentence is decoded into n-best list and model weights are updated based on output. This
procedure continues until a convergence criterion is achieved with these optimal weights
used for all future translations.
3. Testing
The testing of system involves decoding a small corpus and comparing the best translation
with the reference translation using automatic evaluation metric. The scores for n-best
lattices from various modules are combined in a log-linear model with the sentence
achieving highest score as system’s output.
<This is a sample report only, learners are expected to include the appropriate
data/paragraphs in their reports as per the guidelines shared.>
Data
Cleaning
Semantic Noise
Removal Language
Model
Phrase
Alignment
Translation Re-ordering
Model Model
Tuning
Applying Weights
Test Decoding
Target Translation
Corpus
It is now clear that different types of MT systems are required to meet widely differing
translation needs. Those identified so far include:
a) the traditional MT systems for large organisations, usually within are stricted domain;
b) The translation tools and workstations (with MT modules as options) designed for
professional translators;
c) The cheap PC systems for occasional translations;
d) The use of systems to obtain rough gist for the purposes of surveillance or information
gathering;
e) The use of MT for translating electronic mail and Web pages;
f) Systems for monolinguals to translate standard messages into unknown languages;
g) Systems for speech translation in restricted domains.
Some of these needs are being met or are the subject of active research, but there are many
other possibilities, in particular combining MT with other applications of language technology
(information retrieval, information extraction, summarization, etc.) As MT systems of many
varieties become more widely known and used, the range of possible translation needs and
have possible types of MT systems will become more apparent and will stimulate research and
development in directions not yet envisioned.
<This is a sample report only, learners are expected to include the appropriate
data/paragraphs in their reports as per the guidelines shared.>
“Noise”, a general idea, which alludes to the superfluous data found in the plain texts. It
indicates the illegal data, like the chaotic codes, mismatch sentence alignments, etc. In fact, the
noise exists in the data all through the standard data accumulation steps.
A typical source of preparing data is from crawled multilingual websites, which can be
extremely noisy such as formatting information, such as the color, font and style attribute of
text, along with its position in the global layout of the web-page. For example, during the
process of document alignment, the HTML tags can be considered as the noises.
The key issue is way to filter the noise for processing the data because the noise will lessen the
performance of translation results. Despite the fact, noise reduction has been studied widely;
regardless it still remains as the regular preprocessing advance for such type of data as noise in
parallel corpora brings out the incorrect training of SMT models accordingly degrading the
overall performance of the framework. For parallel information, a pair of sentences that are far
from contemplations of good translations can be considered as noisy. It might be alluring to
filter out data that are generally not designated as noise. For example, non-literal translations
are not desirable for training phrase-based SMT systems.
Consider the problem of translating a noisy sentence e to a clean sentence h. SMT assumes that
e was originally developed in clean language which due to transmission over the noisy channel
got corrupted and became a noisy sentence. Therefore, the objective of SMT can be defined as
to recover the original clean sentence. In this research work, techniques are presented to pre-
process the noisy text sentences e before translating it to clean text sentences h.
In this approach, the translation model is emulated during decoding by analyzing the noise of
the tokens in the input sentence. Through this research, it is proved that it is possible to clean
noisy text in an unsupervised way by incorporating steps to construct ranked lists of possible
clean tokens followed by searching for the best clean sentence. Due to the availability of several
clean sentences for a given noisy sentence, the statistical machine learning paradigm is
exploited to let the decoder pick the best alternative to give the final translation.
Although noise can be categorized into two categories i.e. format noise and semantic noise.
The format noise includes the noise such as tags of XML/HTML, wrongly encoded characters
or words, multi-bytes symbols in English such as Greek symbols or currency symbols, full-
width and half-width letters, numbers and punctuations etc. The semantic noise can be
understood as the misaligned pairs of Hindi-English sentences, mismatched pairs in length,
wrongly swapped pairs in two sides etc. In this research work, the focus is on measures to
handle problems in the semantic noise, which is categorizedas:
In this seminar work, the text using a pseudo-translation model of clean and noisy words along
with a language model is trained using a large monolingual corpus. A decoder is used to search
for the best clean sentence for a noisy sentence using these models. It is followed by generation
of scores for the pseudo translation model using a weighing function for each token in the given
file and these scores used along with language model probabilities to hypothesize the best clean
sentence for a given noisy file.
The first experiment of noise reduction is in sentence level model based on a parallel corpus
which is trained, followed by decreasing the complexity of parallel corpus by selecting
In order to filter the noise in the data, two simple strategies are proposed. One is based on the
length ratio (LR) and the other is based on the translation ratio (TR)j
LR based Strategy
According to Zhou, the length ratio of two sentences in a sentence pair for a parallel
sentence pair should be distributed in a certain range. Otherwise, the two sentence pair is not
corresponding to each other and therefore such sentences should be filtered as noise. The length
ratio of two sentences in a sentence pair is defined as follows:
TR based Strategy
In a sentence pair, if two sentences are closely corresponding to each other, their words
should have good correspondence which means that the translation of words in the source
language should have a greater chance for its appearance in the target language sentence.
Therefore, the translation ratio is used as a measure to filter noise. If the TR value for two
sentences is less than the given threshold value for a sentence pair, it will be filtered out as
noise. A bilingual dictionary is used to estimate the number of word pairs co-occurred in two
sentences. The translation ratio is calculated using the following formula:
Moses
Packaged as a complete SMT system, Moses offers several advantages and extra
functionalities along with several challenges. It allows memory-intensive translation model
and language model exploiting its efficient data structure and resources with limited
hardware for automatically training translation pair for a number of language pairs.It is an
open-source toolkit which includes GIZA++, GIZA++ training, SRILM, IRSTLM and a
number of applications for processing parallel corpora.
<This is a sample report only, learners are expected to include the appropriate
data/paragraphs in their reports as per the guidelines shared.>
The various steps for Development of Baseline System using Hindi language were executed
using the IRSTLM & Moses. The sequence of which is shown below:
1. Preliminary cleaning
To train a translation system we need parallel data (text translated into two different
languages) which is aligned at the sentence level.
2. Corpus Preparation
To begin with, some special preprocessing of raw data is carried out such as
• Conversion of encoding from UTF-16 to UTF-8;
• Extraction of sentences enclosed within XML/HTML mark-up text;
• Removal of bad characters, pictures etc. in addition to removal of English Text
from Hindi File.
The following steps are performed to prepare the data for training the translation
system:
a. Tokenisation: This is the first step in which spaces have to be inserted between
(e.g.) words and punctuation.
b. Truecasing: To reduce data sparsity problem, the second step of truecasing is
performed in which the initial words in each sentence are converted to their
most probable casing. For this research work, this step was performed only for
English language.
c. Cleaning: To avoid the problem with the training pipeline, long sentences and
empty sentences are removed.
5. Tuning
The slowest part of the process, Tuning requires a small amount of parallel data, separate
from the training data. From the home directory, the commands are executed to launch the
tuning process. The end result of tuning is an ini file with trained weights, which is placed
in ~/working/mert- work/moses.ini
6. Testing
In order to make it start quickly, binarising the phrase-table and lexicalized reordering
models is required. Hindi sentence is then typed to see the results. One can now test the
decoder by first translating the test set then running the BLEU script on it which gives a
series of BLEU scores.
To construct experiments, 500,000 words are chosen randomly and the original corpus with
20000 sentences as the test set.
As already discussed, four methods to filter the sentences containing noise are adopted. The
first method is based on LR approach, the second method is based on TR approach, the third
one is based on the combination of LR & TR approach and referred as Combination Approach
(CR), and the fourth one is actually an extension to LR approach with special focus on 3 types
of noise viz. misaligned pairs, mismatched pairs and wrongly swapped pairs and referred as
LR Variant (LRV) approach.
• Using LR Approach
A survey to identify the length ratio of Hindi sentence length to English sentence length
was carried out on a bilingual Hindi-English data, containing 10,000 sentence pairs.
The statistical results are presented in Fig. 1, where the number of the bilingual
sentences are denoted on vertical axis and the length ratio of Hindi sentence length to
English sentence length are denoted on horizontal axis.
For applying this approach, all Hindi sentences were tokenized first followed by the
calculation of their length based on the number of words present in it.
It can be easily inferred from the Fig. that the length ratio are widely distributed within
a certain range of 0.1~4 and more than 90% ones are lying between 0.4 and 1.6.
Therefore, these two values are considered as thresholds. It also implies that if length
ratio of two sentences in a given sentence pair is between 0.4 and 1.6, then only it will
be considered otherwise discarded.
• Using TR Approach
As it can be easily inferred from fig. 2 that the translation of 80% sentence pair is higher
than 0.2, therefore, this value can be considered as threshold. Thus, all such sentence
pairs with the translation ratio less than 0.2 are considered as noisy sentence and eligible
for discard.
For this approach, the sentence length is limited to 40 for each sub-system and BLEU
score is calculated for in-house test set using the parameters tuned by development sets
containing 983 generalized sentences.
The other statistics of this in-house corpus which is helpful in understanding its
structure is presented below:
These set of experiments were then applied to the larger set of corpora available where
additional data for language modeling were used. Also, for evaluation NIST & TER scores
were calculated to verify the results.
Additional data for language modeling were used followed by calculation of BLEU, NIST &
TER scores, which resulted as shown in
The result of BLEU score of each sub-system is compared against the Baseline System as well
as Combined System.
It is noted that many words in source language set are translated to just single target output. It
is due to the poor tokenization for regional languages as standardized tokenizer is still not
available for regional languages. However, the corpora used for English-Hindi language pair
were most domain-relevant and also biggest in size and therefore it resulted in significantly
better translation.
Looking to these results, it can be concluded that the size and relevance of parallel language
corpus have a direct relationship with the quality of translation. Also, it can be easily inferred
that data cleaning plays a more important role in improving the translation quality.
<This is a sample report only, learners are expected to include the appropriate
data/paragraphs in their reports as per the guidelines shared.>
In this research work, approaches for data cleaning are proposed by removing noise and
selecting more informative sentences from the bilingual corpus.
In the data preparation stage, a number of measures are taken into training data to remove the
semantic noise such as removal of mis-aligned sentences, mismatched sentences in length and
wrongly swapped sentences in order to obtain a better word alignment and phrase tables thus
leading to better translation quality.
For the training corpus, sentences were selected to build a compact training corpus using
weighted-phrase method. For the newly generated compact training corpus, a competitive
performance was achieved as compared to baseline system using all training data thus reducing
the computing resources and improving translation speed.
According to the experimental results achieved for the five fold translation tasks, the following
can be summarized:
1. As the data scale for MT training is very large, the implementation through
distributed/parallel computing is the trend to overcome this problem.
2. Data cleaning plays a vital role in improving translation quality and therefore the need
to design smarter and automatic systems to optimize the training data.
3. The robustness of the Combinational Approach is required to be improved so that it can
better adapt in case of low relevance between development and test sets.
<This is a sample report only, learners are expected to include the appropriate
data/paragraphs in their reports as per the guidelines shared.>
[1]. PF Brown, J Cocke, S A D Pietra, V J D Pietra, F Jelinek, J D Lafferty, R L Mercer, P S Roossin: A Statistical
Approach to Machine Translation. Computational linguistics Volume 16, Number 2, June 1990.
[2]. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della, and Robert L. Mercer: The Mathematics of
statistical machine translations: parameter estimation. Computational Linguistics, 19(2): 263-311, 1993
[3]. P F Brown, S A D Pietra, V J D Pietra, R L Mercer: The Mathematics of Statistical Machine Translation:
Parameter Estimation. ACL 1993. Computational Linguistics Volume 19, Issue 2, June 1993, Pages: 263 -
311.
[4]. J Hutchins: Research methods and system designs in machine translation – a ten - year review, 1984-
1994. International conference ‘Machine Translation: ten year on’, Cranfield University, England, 12-14
November 1994.
[5]. D Jurafsky, J H Martin: Speech and Language Processing. 2nd Edition. May 2008.ISBN-10:
0131873210
[6]. P Koehn, H Huang, A Birch, C Callison -Burch, M Federico, N Bertoldi, B Cowan, W Shen, C Moran,
R Zens, C Dyer, O Bojar, A Constantin, E Herbst. Moses: Open Source Toolkit for Statistical Machine
Translation. ACL Demos.
[7]. A Lopez. Statistical Machine Translation. ACM Computing Surveys (C Sur), Volume 40, Issue 3,
Article No. 8, August 2008.
[8]. K. Papineni, S. Roukos, T. Ward and W-J. Zhu, “BLEU: a method for automatics evaluation of
Machine Translation”, in proc. Of 40th ACL, Philadelphia, Pennsylvania, USA, 2002, pp. 311-318
[9]. F J Och, H Ney: A Syntactic Comparison of Various Statistical Alignment Models. Computational
Linguistics Volume 29, number 1, pp 19-51, March 2003.
[10]. F J Och: Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual
Meeting on Association for Computational Linguistics, p.160-167, July 07-12, 2003, Sapporo, Japan.
<This is a sample report only, learners are expected to include the appropriate
data/paragraphs in their reports as per the guidelines shared.>