0% found this document useful (0 votes)

5 views

Seminar Sample Report

The seminar report discusses the development of a machine translation system focusing on Hindi to English translations, highlighting the importance of parallel corpora and the challenges posed by linguistic differences. It outlines objectives aimed at improving translation quality through noise removal and selection of informative corpus, while also detailing the methodology for data processing and system training. The report emphasizes the need for compact and efficient data to enhance the performance of statistical machine translation systems.

Uploaded by

ajitsinghnirban1407

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Seminar Sample Report

Uploaded by

ajitsinghnirban1407

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

NAME OF THE TOPIC

SEMINAR REPORT

Submitted by

LEARNER NAME
(Roll No. :)

In partial fulfillment for the award of the degree

MASTER OF COMPUTER APPLICATIONS

Directorate of Online Education

Manipal University Jaipur

YEAR
Table of Contents

Abstract ................................................................................................................................................... 3
Chapter 1 – Introduction ......................................................................................................................... 4
Motivation ........................................................................................................................................... 4
Chapter II - Seminar Objectives.............................................................................................................. 6
Methodology ....................................................................................................................................... 6
Chapter III - Literature Review............................................................................................................. 10
Part 1: Indian Machine Translation Systems Studied: ...................................................................... 10
Part 2: Internationally recognized work ............................................................................................ 10
Chapter IV - Noise Removal & Data Cleaning..................................................................................... 11
LR based Strategy ............................................................................................................................. 12
TR based Strategy ............................................................................................................................. 12
Corpus for Study purpose ................................................................................................................. 12
SRILM for Language Modeling ................................................................................................... 13
Moses ............................................................................................................................................ 13
Chapter V – Development of Baseline & Improved System ................................................................ 14
Chapter VI – Results & Discussion ...................................................................................................... 16
Chapter VII - Conclusion ...................................................................................................................... 19
References ............................................................................................................................................. 20

List of Figures

Figure 1: Framework for Data Processing ................................................................................. 7

Figure 2: Screenshot of Translation Process ................................................................................
Figure 3: Ratio Distribution ..................................................................................................... 16
Figure 4: Mean of BLEU scores for Proposed Approach ............................................................
Figure 5: Perplexity of in-house corpus for various systems .......................................................

List of Tables

Table 1: Overview of Machine Translation Systems for Indian Languages ...............................

Table 2 : Statistics of Various English - Hindi Corpus................................................................

<Title> <Page Number>

Abstract
Machine translation is one of the earliest areas of research in natural language processing. With
26 constitutionally recognized languages, India is, no doubt, a highly Multilanguage country.
Still, English is understood by, less than 3% of Indian population and therefore, machine
translation is required for breaking language barrier within the sociological structure of the
country. Here comes the parallel corpus into the scene playing an important role in machine
translation. A parallel corpus can be defined as a collection of text, paired with translations into
another language.
As we are very aware that English is a highly positional language with rudimentary
morphology with the default sentence structure as “subject-verb-object”. In contrast, the Indian
languages are highly instructional, with such morphology, selectively free word order, and
default sentence structure as “subject-object-verb”. Apart from this, there are many stylistic
differences too.
As the parallel content of Source and Target language is made available with increased capacity
of memory & high processing speed, the trend is moving towards Statistical Machine
Translation. SMT models rely heavily on the available bilingual corpus. As it is already known
to us that the corpus quality plays a significant role in improving statistical machine translation
quality, for these parallel corpora is developed generally as a collection of English corpus of
various domain from various resources, or by generating multiple references for each sentence
by getting it translated by different expert translators. However, the large amount of data causes
will cast more computational resources too. Therefore, a need for compact, clean, normalized
corpus is expected, which in turn also improves BLEU scores as comported to raw data.

<Title> <Page Number>

Chapter 1 – Introduction

Translation of human languages has been a need for the development of the society around the
world and therefore subject of study for hundreds of years. The explosive growth of internet
these days has led to the increasing demand of multilingual knowledge-based applications and
services, therefore, it proves to be an effective tool to overcome the language barriers leading
to generating great interest in the linguistic industry. This multilingualism has triggered the
necessity for machine translations so that information and services can be provided in the
choice of language of any user.

Designing automatic machine translations system will be a great challenge for humans as
difficulties are seen for translating sentences by taking into consideration all the required
information. For preparing perfect translations, humans need several revisions. No two
individual human translators can generate identical translations of the provided text in the same
language pair. Also the cross culture understanding for languages is an important issue for
holding accuracy of the translation.

This thesis proposes a pre-processing method for removal of noise that uses linguistic tools to
the development of Hindi to English machine translation system. In this translation framework,
external linguistic tools are utilized to expand the linguistic information into the parallel
corpora
Motivation
Machine Translation was, however, envisioned as a computer application in early 1950’s, it is
still considered to be an open-problem. The demand for machine translation is growing rapidly
with around 20% of webpage and other resources on World Wide Web are available in their
national languages. In a linguistically diverse country like India, many reasons have prompted
to consider Indian languages as an object for study. Although, human assisted translations
widely prevalent in India since ancient times still these languages are considered as scarcely
resourced languages. Having nearly half a billion native speakers, Indian languages also
demonstrates a variety of linguistic phenomenon which is divergent and understudied. Indian
languages also represent a high degree of morphological complexity with reference to English.
With highly agglutinative nature, use of automation is largely restricted to word-processing,
and therefore, there is large market for translation between English and the various Indian
languages.

In recent years, the statistical machine translation has demonstrated competitive translation
performances as compared to other translation strategies. The main advantage of SMT is that
it can automatically learn more translation knowledge from a very large scale of training data.
However, the performance of an SMT system heavily relies on the quality and quantity of
training as well as development data because the training data helps in building the translation
and language model and development data sets helps in optimizing the parameters through
minimum error rate training (MERT) method.

A lot of previous work is carried on improving the quantity of training data and collecting more
bilingual data for achieving the more accuracy and better performance for Hindi-to-English
translations without considering much the quality of the collected data. Clearly the execution
results will not be so outperforming by just expanding the training data if there is huge
dissimilarity between the training data and test data. In addition to this, if the quality of

<Title> <Page Number>

bilingual data collected is poor, the translation performance will be diluted due to the noise
caused by the collected data. Moreover, more computation resources will be required for more
training data leading to decrement in the translation speed due to complexities during the
decoding stage. Therefore, a compact and efficient corpus is required for practical SMT system
development.

<Title> <Page Number>

Chapter II - Seminar Objectives
Information plays a critical role in today's world of machines. As the array of information
innovation is extending step by step, the requirement for its snappy dispersal is additionally
expanding. With linguistic diversity in India, its constitution supports 26 languages. One
noteworthy impediment in such manner is that, not just is the information exuding from
different corners of globe, it is being produced in different languages also. Therefore, bilingual
or even multilingual documents are very normal which traverse to incorporate ads, website
pages, Govt. and additionally legitimate notification, and so on and are accessible as parallel
text with its nearby translations. Huge collection of parallel text are called parallel corpora
which is imperative to wide assortment of utilizations like cross language information retrieval,
sense based word reference creation, POS taggers, SMT and so on.

Our major objective is to improve the quality of machine translation system and in order to
implement the same; we proposed the following two objectives:

1. Analysis of corpus to filter noise:

Statistical Machine Translation model heavily depends on bilingual corpus consisting
of sentences in the source language and their corresponding translation in the target
language.

2. Selection of informative corpus:

Corpus plays a very important role in the development of better systems, as they can
provide crisp information about the pros and cons of the systems therefore affects the
decision making. This selection can be done either through manually or through
automatic. Typically, in Statistical Machine Translation models more data is used for
training and tuning processes with the goal of more accuracy and better performance.
However, such massive data will cost more computational resources. For some specific
application nowadays such as translation system running on mobiles etc. the
computation resources are limited and therefore a compact, efficient and quite
informative corpus is desirable.

Because improvements in corpus quality are not possible for all machine translation systems
therefore this seminar is restricted on Machine Translations systems for Hindi-English
language pairs.

Methodology
It is known fact that in Statistical Machine Translation model, the information of probability
is extracted from the training data and the translation parameters are tuned on the basis of
development data. The target language sentences are therefore generated on the basis of both
probabilities and parameters.

The presence of noise in Translation Data affects its quality, by improper word alignment or
importing errors into translation rules and hence reduces the performance of the Statistical
Machine Translation system.

It is also assumed that better performance is yielded only when more data is available.
However, it reduces the translation speed and also costs more computer resources.

<Title> <Page Number>

Keeping in mind the above factors, in this research work an attempt to remove noise from
the data followed by selection of more informative sentences are proposed. Hence a limited
training data will be available for Statistical Machine Translation model.

Figure 1: Framework for Data Processing

To filter noise, it is proposed to follow two simple policies viz. the Length Ratio policy and
the Translation Policy. In the former policy, if a sentence pair is parallel, the length ratio of
the two sentences which are corresponding should be within range in a special bound.

In later policy, if two sentences are corresponding, the translation of the word in the source
language sentence has a large probability to appear in the target language sentence.

In order to reduce the size of the training data, it is proposed to select sentences which cover
more information of the original corpus. Because word is a special phrase, the main focus
would be on phrase-coverage and structure coverage for this research problem.

For the phrase –coverage-based approach in this research problem, it is proposed to select
phrases on the basis of the information it contains and its length. This will constitute to new
development data. In another approach where the structure coverage is proposed, such
sentences which cover the majority of structures of development data will be extracted. Such
sentences which will provide much information are proposed for constituting development
data.

The basic of machine translation process is understood in this period and it has been identified
that the standard Machine Translation process is divided into three major steps:
1. Training
i. In training procedure, the steps of corpus preparation is followed where along with
basic steps of tokenization and true-casing to maintain a consistency within the
corpus, the semantic noise of the given corpora is also removed.
ii. The pre-processed training data is then passed through an aligner which computes
the word and phrase alignments along with probability scores for phrase and word
substitution respectively.
iii. It is also identified that GIZA++ is considered for experiments which is based on
the IBM models and computes the alignment using Expectation-Maximization
(EM) algorithm.
iv. Phrases are extracted from the phrase alignments and scored to build a phrase table
which is the most important step in Phrase based SMT. In this step, the four different
phrase translation scores are computed:
• Direct phrase translation probability

<Title> <Page Number>

• Direct lexical weighting
• Inverse phrase translation probability
• Inverse lexical weighting
v. Language model (LM) then checks for the fluency of the target translation by means
of probability distribution.
vi. The likelihood of any word sequence (n-gram) is computed from an already
developed model which in turn computes the likelihood of a sentence. For this
purpose, SRILM language model toolkit is used for the experiments.

2. Tuning
Tuning is a procedure which helps in finding the optimal weights to maximize the
translation scores of development corpus. The translation quality is calculated using BLEU
(automatic scoring) from the best translation and corresponding reference sentence.
Minimum error rate training (MERT) is performed for batch tuning where, generally a
sentence is decoded into n-best list and model weights are updated based on output. This
procedure continues until a convergence criterion is achieved with these optimal weights
used for all future translations.

3. Testing
The testing of system involves decoding a small corpus and comparing the best translation
with the reference translation using automatic evaluation metric. The scores for n-best
lattices from various modules are combined in a log-linear model with the sentence
achieving highest score as system’s output.

<Title> <Page Number>

Target Language
Parallel
Monolingual
Corpus
Corpus

Data
Cleaning

Semantic Noise
Removal Language
Model

Phrase
Alignment

Translation Re-ordering
Model Model

Tuning

Applying Weights

Test Decoding
Target Translation
Corpus

Figure 2: Proposed Methodology

<Title> <Page Number>

Chapter III - Literature Review
A thorough literature survey has been carried out to identify the majority of work in machine
translation for Indian languages were done for Hindi and Dravidian languages with English as
base language. The review done is arranged in mainly two parts:
1. First part discusses on machine translation systems in Indian context specifically for
English-Hindi translations and vice versa;
2. Second part focuses on work related to noise removal and data selection in statistical
machine translation along with other globally recognized work in this translation fields.

A brief overview of the work done is given below:

Part 1: Indian Machine Translation Systems Studied:
The following 25 system for machine translations in the Indian context has been studied.
Although, the main focus was on English-Hindi translation systems but other Indian languages
system was also studied with the intention to identify the existing quality of the corpus. Also,
the approach and domain of each system was studied in case if it affects any quality related
parameter of the corpus. The gist of the all systems studied is presented in the table below:
Part 2: Internationally recognized work
Apart from the Indian context systems, the international work was also studied. The work of
the following researchers was studied and included for the literature survey of the work:
All the references studied are also added in the references section of this document.

It is now clear that different types of MT systems are required to meet widely differing
translation needs. Those identified so far include:
a) the traditional MT systems for large organisations, usually within are stricted domain;
b) The translation tools and workstations (with MT modules as options) designed for
professional translators;
c) The cheap PC systems for occasional translations;
d) The use of systems to obtain rough gist for the purposes of surveillance or information
gathering;
e) The use of MT for translating electronic mail and Web pages;
f) Systems for monolinguals to translate standard messages into unknown languages;
g) Systems for speech translation in restricted domains.

Some of these needs are being met or are the subject of active research, but there are many
other possibilities, in particular combining MT with other applications of language technology
(information retrieval, information extraction, summarization, etc.) As MT systems of many
varieties become more widely known and used, the range of possible translation needs and
have possible types of MT systems will become more apparent and will stimulate research and
development in directions not yet envisioned.

<Title> <Page Number>

Chapter IV - Noise Removal & Data Cleaning

“Noise”, a general idea, which alludes to the superfluous data found in the plain texts. It
indicates the illegal data, like the chaotic codes, mismatch sentence alignments, etc. In fact, the
noise exists in the data all through the standard data accumulation steps.

A typical source of preparing data is from crawled multilingual websites, which can be
extremely noisy such as formatting information, such as the color, font and style attribute of
text, along with its position in the global layout of the web-page. For example, during the
process of document alignment, the HTML tags can be considered as the noises.

The key issue is way to filter the noise for processing the data because the noise will lessen the
performance of translation results. Despite the fact, noise reduction has been studied widely;
regardless it still remains as the regular preprocessing advance for such type of data as noise in
parallel corpora brings out the incorrect training of SMT models accordingly degrading the
overall performance of the framework. For parallel information, a pair of sentences that are far
from contemplations of good translations can be considered as noisy. It might be alluring to
filter out data that are generally not designated as noise. For example, non-literal translations
are not desirable for training phrase-based SMT systems.

Consider the problem of translating a noisy sentence e to a clean sentence h. SMT assumes that
e was originally developed in clean language which due to transmission over the noisy channel
got corrupted and became a noisy sentence. Therefore, the objective of SMT can be defined as
to recover the original clean sentence. In this research work, techniques are presented to pre-
process the noisy text sentences e before translating it to clean text sentences h.

In this approach, the translation model is emulated during decoding by analyzing the noise of
the tokens in the input sentence. Through this research, it is proved that it is possible to clean
noisy text in an unsupervised way by incorporating steps to construct ranked lists of possible
clean tokens followed by searching for the best clean sentence. Due to the availability of several
clean sentences for a given noisy sentence, the statistical machine learning paradigm is
exploited to let the decoder pick the best alternative to give the final translation.

Although noise can be categorized into two categories i.e. format noise and semantic noise.
The format noise includes the noise such as tags of XML/HTML, wrongly encoded characters
or words, multi-bytes symbols in English such as Greek symbols or currency symbols, full-
width and half-width letters, numbers and punctuations etc. The semantic noise can be
understood as the misaligned pairs of Hindi-English sentences, mismatched pairs in length,
wrongly swapped pairs in two sides etc. In this research work, the focus is on measures to
handle problems in the semantic noise, which is categorizedas:

In this seminar work, the text using a pseudo-translation model of clean and noisy words along
with a language model is trained using a large monolingual corpus. A decoder is used to search
for the best clean sentence for a noisy sentence using these models. It is followed by generation
of scores for the pseudo translation model using a weighing function for each token in the given
file and these scores used along with language model probabilities to hypothesize the best clean
sentence for a given noisy file.

The first experiment of noise reduction is in sentence level model based on a parallel corpus
which is trained, followed by decreasing the complexity of parallel corpus by selecting

<Title> <Page Number>

sentences and measuring this by 3-gram precision in training set and training data selection.
This strategy improves BLEU score. The second experiment of noise reduction (or smoothing)
is in word-level where the target is on phrase table.

In order to filter the noise in the data, two simple strategies are proposed. One is based on the
length ratio (LR) and the other is based on the translation ratio (TR)j

LR based Strategy
According to Zhou, the length ratio of two sentences in a sentence pair for a parallel
sentence pair should be distributed in a certain range. Otherwise, the two sentence pair is not
corresponding to each other and therefore such sentences should be filtered as noise. The length
ratio of two sentences in a sentence pair is defined as follows:

TR based Strategy
In a sentence pair, if two sentences are closely corresponding to each other, their words
should have good correspondence which means that the translation of words in the source
language should have a greater chance for its appearance in the target language sentence.
Therefore, the translation ratio is used as a measure to filter noise. If the TR value for two
sentences is less than the given threshold value for a sentence pair, it will be filtered out as
noise. A bilingual dictionary is used to estimate the number of word pairs co-occurred in two
sentences. The translation ratio is calculated using the following formula:

Corpus for Study purpose

On the basis of these findings, following corpus sample was considered for developing base
line system:
1. Enabling Minority Language Engineering (EMILLE) corpus which is developed and
maintained by Lancaster University, UK and Central Institute of Indian Languages
(CIIL), Mysore
2. Tides corpus collected amid June 2002 for DARPA-tides surprise-language challenge
contains around 1.5 million words of English-Hindi parallel corpus.
3. Daniel pipes corpus (site: http://www.danielpipes.org/) initially written in English, has
been meant 25 different languages including Hindi. For Hindi alone, there exist 322
articles which make roughly 6761 sentence sets.
4. Indic multi-parallel corpus collected from the texts of Wikipedia translated from the
respective Indian language into English by hired non-expert translators.
5. Launchpad.net, an open-source project, is a relatively high quality resource with
manual sentence alignments.
6. TED talks although held in primarily English, are equipped with transcripts and
translated into 102 languages and for 179 talks Hindi translation is also available.
7. Intercorp (Čermák and Rosen, 2012) is a large multilingual parallel corpus of 32
languages including Hindi, covering predominately short stories and novels.
8. Shabdanjaliis English-Hindi dictionary populated with 26,000 passages
9. ACL 2005, a subset of EMILLE, was released on validations with shared task on word-
alignment

<Title> <Page Number>

SRILM for Language Modeling
A Language model is required in order to translate a source sentence more – accurately and
it is built from the exact same inputs used by the text alignment tool such as GIZA ++. The
following sub-sections are studied for building an accurate Language Model along with its
methods for improving model accuracy.
a. Probability Tables
b. Smoothing
Other functionalities offered by SRILM are discussed below:
i. Customizable n-gram Orders: For language model creation, SRILM is not
constrained by a maximum n-gram length. However, n-grams with a length greater
than three gives diminishing returns thus useless.
ii. Choice of Discounting Algorithm: the SRILM offers the choice of discounting
algorithms used by other language modeling toolkits, absolute or modified Kneser-
Ney methods.
iii. Support for Pre-Defined Vocabulary List: This feature is provided in the SRILM
toolkit for improving the quality of training data.
iv. Filtering of Unknown Words: For improving training data, unknown words can
be specially handled with SRILM.
v. Text Formatting: For reducing text ambiguity, the SRILM is competent to perform
various text formatting actions such as capitalization thus reducing unnecessary
vocabulary words. This functionality is similar to mkcls executable of GIZA++
which helps in word classification on the basis of alignment potential.

Moses
Packaged as a complete SMT system, Moses offers several advantages and extra
functionalities along with several challenges. It allows memory-intensive translation model
and language model exploiting its efficient data structure and resources with limited
hardware for automatically training translation pair for a number of language pairs.It is an
open-source toolkit which includes GIZA++, GIZA++ training, SRILM, IRSTLM and a
number of applications for processing parallel corpora.
<This is a sample report only, learners are expected to include the appropriate
data/paragraphs in their reports as per the guidelines shared.>

<Title> <Page Number>

Chapter V – Development of Baseline & Improved System

The various steps for Development of Baseline System using Hindi language were executed
using the IRSTLM & Moses. The sequence of which is shown below:
1. Preliminary cleaning
To train a translation system we need parallel data (text translated into two different
languages) which is aligned at the sentence level.
2. Corpus Preparation
To begin with, some special preprocessing of raw data is carried out such as
• Conversion of encoding from UTF-16 to UTF-8;
• Extraction of sentences enclosed within XML/HTML mark-up text;
• Removal of bad characters, pictures etc. in addition to removal of English Text
from Hindi File.
The following steps are performed to prepare the data for training the translation
system:
a. Tokenisation: This is the first step in which spaces have to be inserted between
(e.g.) words and punctuation.
b. Truecasing: To reduce data sparsity problem, the second step of truecasing is
performed in which the initial words in each sentence are converted to their
most probable casing. For this research work, this step was performed only for
English language.
c. Cleaning: To avoid the problem with the training pipeline, long sentences and
empty sentences are removed.

3. Creation of Language model

As per the acquired knowledge through various research papers, the language model (LM)
is used to ensure fluent output, thus built with the target language.

4. Training the Translation System

For training the translation model, it is required to run the word-alignment (using GIZA++),
phrase extraction and scoring, create lexicalized reordering tables and create Moses
configuration file, all with a single command

5. Tuning
The slowest part of the process, Tuning requires a small amount of parallel data, separate
from the training data. From the home directory, the commands are executed to launch the
tuning process. The end result of tuning is an ini file with trained weights, which is placed
in ~/working/mert- work/moses.ini

6. Testing
In order to make it start quickly, binarising the phrase-table and lexicalized reordering
models is required. Hindi sentence is then typed to see the results. One can now test the
decoder by first translating the test set then running the BLEU script on it which gives a
series of BLEU scores.

<Title> <Page Number>

<Title> <Page Number>

Chapter VI – Results & Discussion
All experiments have been carried out on the open available toolkit Moses translation engine
with its parameters set to default values. The translation results obtained are then evaluated by
the BLEU, NIST & TER metrics.

To construct experiments, 500,000 words are chosen randomly and the original corpus with
20000 sentences as the test set.

As already discussed, four methods to filter the sentences containing noise are adopted. The
first method is based on LR approach, the second method is based on TR approach, the third
one is based on the combination of LR & TR approach and referred as Combination Approach
(CR), and the fourth one is actually an extension to LR approach with special focus on 3 types
of noise viz. misaligned pairs, mismatched pairs and wrongly swapped pairs and referred as
LR Variant (LRV) approach.

• Using LR Approach
A survey to identify the length ratio of Hindi sentence length to English sentence length
was carried out on a bilingual Hindi-English data, containing 10,000 sentence pairs.
The statistical results are presented in Fig. 1, where the number of the bilingual
sentences are denoted on vertical axis and the length ratio of Hindi sentence length to
English sentence length are denoted on horizontal axis.

No. of Bilingual Sentences

Figure 3: Ratio Distribution

For applying this approach, all Hindi sentences were tokenized first followed by the
calculation of their length based on the number of words present in it.

It can be easily inferred from the Fig. that the length ratio are widely distributed within
a certain range of 0.1~4 and more than 90% ones are lying between 0.4 and 1.6.
Therefore, these two values are considered as thresholds. It also implies that if length
ratio of two sentences in a given sentence pair is between 0.4 and 1.6, then only it will
be considered otherwise discarded.

• Using TR Approach

<Title> <Page Number>

A Hindi-English dictionary containing more than 2500 entries is used to calculate the
translation ratio of each sentence pair. The statistical results are shown in Fig. 2 where
the number of the bilingual sentences are denoted on vertical axis and the translation
ratio of Hindi sentences to the English sentences are denoted on horizontal axis.

No. of Bilingual Sentences

Figure 4: Translation Ratio of Bilingual Sentence Pairs

As it can be easily inferred from fig. 2 that the translation of 80% sentence pair is higher
than 0.2, therefore, this value can be considered as threshold. Thus, all such sentence
pairs with the translation ratio less than 0.2 are considered as noisy sentence and eligible
for discard.

• Using LRV Approach

In this approach, the LR approach is first applied for filtering the bilingual sentences
for generating the new dataset. Approaches for removal of misaligned pairs, wrongly
swapped pairs and mismatched pairs is then applied on this new dataset using three
individual sub-systems for obtaining the cleaned corpus which is noise-free. This
approach is named as Variant to LR Approach

For this approach, the sentence length is limited to 40 for each sub-system and BLEU
score is calculated for in-house test set using the parameters tuned by development sets
containing 983 generalized sentences.

The other statistics of this in-house corpus which is helpful in understanding its
structure is presented below:

Table 1: <some table>

These set of experiments were then applied to the larger set of corpora available where
additional data for language modeling were used. Also, for evaluation NIST & TER scores
were calculated to verify the results.

Additional data for language modeling were used followed by calculation of BLEU, NIST &
TER scores, which resulted as shown in

<Title> <Page Number>

The same approach is applied for other set of corpus, as discussed under Corpus section of
Chapter 5. All these corpus are related to either health or tourism or some are General.

The result of BLEU score of each sub-system is compared against the Baseline System as well
as Combined System.

It is noted that many words in source language set are translated to just single target output. It
is due to the poor tokenization for regional languages as standardized tokenizer is still not
available for regional languages. However, the corpora used for English-Hindi language pair
were most domain-relevant and also biggest in size and therefore it resulted in significantly
better translation.
Looking to these results, it can be concluded that the size and relevance of parallel language
corpus have a direct relationship with the quality of translation. Also, it can be easily inferred
that data cleaning plays a more important role in improving the translation quality.
<This is a sample report only, learners are expected to include the appropriate
data/paragraphs in their reports as per the guidelines shared.>

<Title> <Page Number>

Chapter VII - Conclusion

In this research work, approaches for data cleaning are proposed by removing noise and
selecting more informative sentences from the bilingual corpus.

In the data preparation stage, a number of measures are taken into training data to remove the
semantic noise such as removal of mis-aligned sentences, mismatched sentences in length and
wrongly swapped sentences in order to obtain a better word alignment and phrase tables thus
leading to better translation quality.

For the training corpus, sentences were selected to build a compact training corpus using
weighted-phrase method. For the newly generated compact training corpus, a competitive
performance was achieved as compared to baseline system using all training data thus reducing
the computing resources and improving translation speed.

According to the experimental results achieved for the five fold translation tasks, the following
can be summarized:

1. As the data scale for MT training is very large, the implementation through
distributed/parallel computing is the trend to overcome this problem.
2. Data cleaning plays a vital role in improving translation quality and therefore the need
to design smarter and automatic systems to optimize the training data.
3. The robustness of the Combinational Approach is required to be improved so that it can
better adapt in case of low relevance between development and test sets.

<Title> <Page Number>

References

[1]. PF Brown, J Cocke, S A D Pietra, V J D Pietra, F Jelinek, J D Lafferty, R L Mercer, P S Roossin: A Statistical
Approach to Machine Translation. Computational linguistics Volume 16, Number 2, June 1990.
[2]. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della, and Robert L. Mercer: The Mathematics of
statistical machine translations: parameter estimation. Computational Linguistics, 19(2): 263-311, 1993
[3]. P F Brown, S A D Pietra, V J D Pietra, R L Mercer: The Mathematics of Statistical Machine Translation:
Parameter Estimation. ACL 1993. Computational Linguistics Volume 19, Issue 2, June 1993, Pages: 263 -
311.
[4]. J Hutchins: Research methods and system designs in machine translation – a ten - year review, 1984-
1994. International conference ‘Machine Translation: ten year on’, Cranfield University, England, 12-14
November 1994.
[5]. D Jurafsky, J H Martin: Speech and Language Processing. 2nd Edition. May 2008.ISBN-10:
0131873210
[6]. P Koehn, H Huang, A Birch, C Callison -Burch, M Federico, N Bertoldi, B Cowan, W Shen, C Moran,
R Zens, C Dyer, O Bojar, A Constantin, E Herbst. Moses: Open Source Toolkit for Statistical Machine
Translation. ACL Demos.
[7]. A Lopez. Statistical Machine Translation. ACM Computing Surveys (C Sur), Volume 40, Issue 3,
Article No. 8, August 2008.
[8]. K. Papineni, S. Roukos, T. Ward and W-J. Zhu, “BLEU: a method for automatics evaluation of
Machine Translation”, in proc. Of 40th ACL, Philadelphia, Pennsylvania, USA, 2002, pp. 311-318
[9]. F J Och, H Ney: A Syntactic Comparison of Various Statistical Alignment Models. Computational
Linguistics Volume 29, number 1, pp 19-51, March 2003.
[10]. F J Och: Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual
Meeting on Association for Computational Linguistics, p.160-167, July 07-12, 2003, Sapporo, Japan.

<Title> <Page Number>

Project Report of Migrating To Cloud Bas
100% (1)
Project Report of Migrating To Cloud Bas
32 pages
SPC Procedure
83% (12)
SPC Procedure
3 pages
CAM350 Ver.10 Macro Manual
No ratings yet
CAM350 Ver.10 Macro Manual
390 pages
#AVIRO GLOBAL TEKNOLOGI - Company Profile Rev12.3
No ratings yet
#AVIRO GLOBAL TEKNOLOGI - Company Profile Rev12.3
15 pages
1.1 General: Resourced" Languages. To Enhance The Translation Performance of Dissimilar Language
No ratings yet
1.1 General: Resourced" Languages. To Enhance The Translation Performance of Dissimilar Language
18 pages
Syntactic and Semantic
No ratings yet
Syntactic and Semantic
4 pages
JETIR2211403
No ratings yet
JETIR2211403
6 pages
Comparative Study of Machine Translation Techniques
No ratings yet
Comparative Study of Machine Translation Techniques
16 pages
Article16
No ratings yet
Article16
8 pages
(IJCST-V9I1P20) :T. Madhavi Kumari, Dr. A. Vinaya Babu
No ratings yet
(IJCST-V9I1P20) :T. Madhavi Kumari, Dr. A. Vinaya Babu
6 pages
Telugu To English Translation Using Direct Machine Translation Approach
No ratings yet
Telugu To English Translation Using Direct Machine Translation Approach
8 pages
Improving The Performance of English-Tamil Statistical Machine Translation System Using Source-Side Pre-Processing
No ratings yet
Improving The Performance of English-Tamil Statistical Machine Translation System Using Source-Side Pre-Processing
11 pages
Riwhiwhp
No ratings yet
Riwhiwhp
15 pages
Machine Translation Approaches and Survey For Indian Languages
No ratings yet
Machine Translation Approaches and Survey For Indian Languages
18 pages
Multimodal Machine Translation for Sanskrit-Hindi an Empirical Analysis
No ratings yet
Multimodal Machine Translation for Sanskrit-Hindi an Empirical Analysis
4 pages
Hindi To English Machine Translation
No ratings yet
Hindi To English Machine Translation
4 pages
fin_irjmets1702791465
No ratings yet
fin_irjmets1702791465
5 pages
Proceedings of International Ethical Hacking Conference 2018
No ratings yet
Proceedings of International Ethical Hacking Conference 2018
5 pages
Machine Translation Development For Indian Languages and Its Approaches
No ratings yet
Machine Translation Development For Indian Languages and Its Approaches
21 pages
On Application of Natural Language Processing in Machine Translation
No ratings yet
On Application of Natural Language Processing in Machine Translation
5 pages
Translation of Telugusa Using
No ratings yet
Translation of Telugusa Using
14 pages
Machine Learning in Translation Corpora Processing
No ratings yet
Machine Learning in Translation Corpora Processing
281 pages
Referance 3
No ratings yet
Referance 3
4 pages
Building English-Punjabi Parallel Corpus for Machi
No ratings yet
Building English-Punjabi Parallel Corpus for Machi
5 pages
s11042-023-15287-z
No ratings yet
s11042-023-15287-z
29 pages
MLT Synopsis
No ratings yet
MLT Synopsis
2 pages
2503.04797v2
No ratings yet
2503.04797v2
15 pages
s11042-022-14273-1
No ratings yet
s11042-022-14273-1
32 pages
NLP Machine Translation
No ratings yet
NLP Machine Translation
15 pages
2016 Kituku, Muchemi & Nganga - Review On Machine Translation Approaches
No ratings yet
2016 Kituku, Muchemi & Nganga - Review On Machine Translation Approaches
8 pages
10 1 1 258 8988 PDF
No ratings yet
10 1 1 258 8988 PDF
8 pages
10 1 1 258 8988 PDF
No ratings yet
10 1 1 258 8988 PDF
8 pages
research_abesec_mlir neel
No ratings yet
research_abesec_mlir neel
7 pages
Ref Paper 4
No ratings yet
Ref Paper 4
4 pages
Amritha PDF
No ratings yet
Amritha PDF
7 pages
English To Malayalam Translation A Statistical Approach
No ratings yet
English To Malayalam Translation A Statistical Approach
7 pages
Corpora in Translation Studies An Overview and Some Suggestions For Future Research - 1995
No ratings yet
Corpora in Translation Studies An Overview and Some Suggestions For Future Research - 1995
21 pages
06 Chapter2
No ratings yet
06 Chapter2
10 pages
Machine Translation of Vedic Sanskrit Using Deep Learning Algorithm
No ratings yet
Machine Translation of Vedic Sanskrit Using Deep Learning Algorithm
4 pages
Arnold Etal94
No ratings yet
Arnold Etal94
237 pages
Report-Sakshi Rastogi 149105266
No ratings yet
Report-Sakshi Rastogi 149105266
47 pages
A Hybrid Approach Using Phrases and Rules For Hindi To English Machine Translation
100% (1)
A Hybrid Approach Using Phrases and Rules For Hindi To English Machine Translation
17 pages
Machine Translation and Its Approaches: Vanlalmuansangi Khenglawt, Lal Anpuia
No ratings yet
Machine Translation and Its Approaches: Vanlalmuansangi Khenglawt, Lal Anpuia
5 pages
s11831-020-09449-7
No ratings yet
s11831-020-09449-7
29 pages
Machine Translation For English To Kanna
No ratings yet
Machine Translation For English To Kanna
8 pages
Y13 1040 Transliteration
No ratings yet
Y13 1040 Transliteration
9 pages
Interactive English To Urdu Machine Translation Using Example-Based Approach
100% (2)
Interactive English To Urdu Machine Translation Using Example-Based Approach
8 pages
General Introduction - and Brief History
No ratings yet
General Introduction - and Brief History
9 pages
Voice Based Translator
No ratings yet
Voice Based Translator
4 pages
Machine Translation Thesis PDF
100% (3)
Machine Translation Thesis PDF
8 pages
Transliteration of Indian Languages
No ratings yet
Transliteration of Indian Languages
13 pages
Comparative Analysis of Accuracy and Reliability in Machine Translation Versus Human Translation: An Empirical Investigation in Real-Life Conditions
No ratings yet
Comparative Analysis of Accuracy and Reliability in Machine Translation Versus Human Translation: An Empirical Investigation in Real-Life Conditions
11 pages
Systematic Review On Techniques of Machine Translation For Indian Languages
No ratings yet
Systematic Review On Techniques of Machine Translation For Indian Languages
6 pages
Interlingual Machine Translation
No ratings yet
Interlingual Machine Translation
27 pages
Survey On Machine Translation Approaches Used in India: D S Rawat
No ratings yet
Survey On Machine Translation Approaches Used in India: D S Rawat
4 pages
Leeds 2006
No ratings yet
Leeds 2006
34 pages
A Sanskrit-to-English Machine Translation Using Hybridization of Direct and Rule-Based Approach
No ratings yet
A Sanskrit-to-English Machine Translation Using Hybridization of Direct and Rule-Based Approach
20 pages
Error Analysis of The Urdu Verb Markers
No ratings yet
Error Analysis of The Urdu Verb Markers
13 pages
Research On The Relations Between Machine Translation and Human Translation
No ratings yet
Research On The Relations Between Machine Translation and Human Translation
7 pages
Bilingual Machine Translation
No ratings yet
Bilingual Machine Translation
8 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
A Case Study on English-Malayalam Machine Translat
No ratings yet
A Case Study on English-Malayalam Machine Translat
8 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Absolute Beginner's Python Programming: The Illustrated Guide to Learning Computer Programming
From Everand
Absolute Beginner's Python Programming: The Illustrated Guide to Learning Computer Programming
Kevin Wilson
1/5 (1)
SAS Quiz - SAS Multiple Choice Questions and Answers
No ratings yet
SAS Quiz - SAS Multiple Choice Questions and Answers
40 pages
Huawei NE40E Portfolio
No ratings yet
Huawei NE40E Portfolio
1 page
Fenris Debug-4
No ratings yet
Fenris Debug-4
41 pages
LD 300 Ce
No ratings yet
LD 300 Ce
29 pages
ADA lab manual (1)
No ratings yet
ADA lab manual (1)
47 pages
Ian Ngwalo - Ebusiness
No ratings yet
Ian Ngwalo - Ebusiness
23 pages
E-Mu Emulator x3 Manual PDF
No ratings yet
E-Mu Emulator x3 Manual PDF
352 pages
Lesson Plan MPF-10 Deployment Systems
No ratings yet
Lesson Plan MPF-10 Deployment Systems
9 pages
Dana Internet Solutions Guide: Alphasmart, Inc
No ratings yet
Dana Internet Solutions Guide: Alphasmart, Inc
68 pages
Value Proposition Canvas
No ratings yet
Value Proposition Canvas
2 pages
Mediclaim (1)
No ratings yet
Mediclaim (1)
7 pages
Bombil Media
No ratings yet
Bombil Media
37 pages
Print Lab Manual
No ratings yet
Print Lab Manual
48 pages
Nctuns
0% (1)
Nctuns
12 pages
AxesstelPst Manual
No ratings yet
AxesstelPst Manual
133 pages
Marketing 20 Plan
No ratings yet
Marketing 20 Plan
5 pages
Devops Unit 1 - Notes
No ratings yet
Devops Unit 1 - Notes
37 pages
CV Mghenja
No ratings yet
CV Mghenja
4 pages
OTC-001 5.10 All (Modules 1-7)
No ratings yet
OTC-001 5.10 All (Modules 1-7)
158 pages
07_gpuarch
No ratings yet
07_gpuarch
73 pages
Council President Todd Gloria: News From City of San Diego
No ratings yet
Council President Todd Gloria: News From City of San Diego
1 page
Ch_-1001935880746_1638MB_7468_938_IE_51.37.79.80_13-05-21_password_file
No ratings yet
Ch_-1001935880746_1638MB_7468_938_IE_51.37.79.80_13-05-21_password_file
37 pages
9912-2431 B Tceh
No ratings yet
9912-2431 B Tceh
6 pages
19 Rack Mount FO Patch Panel
No ratings yet
19 Rack Mount FO Patch Panel
2 pages
ADS-B ESS v1.0
No ratings yet
ADS-B ESS v1.0
33 pages
The 8086 Book
No ratings yet
The 8086 Book
619 pages

Seminar Sample Report

Uploaded by

Seminar Sample Report

Uploaded by

NAME OF THE TOPIC

In partial fulfillment for the award of the degree

MASTER OF COMPUTER APPLICATIONS

Directorate of Online Education

Figure 1: Framework for Data Processing ................................................................................. 7

Table 1: Overview of Machine Translation Systems for Indian Languages ...............................

<Title> <Page Number>

<Title> <Page Number>

<Title> <Page Number>

<Title> <Page Number>

1. Analysis of corpus to filter noise:

2. Selection of informative corpus:

<Title> <Page Number>

Figure 1: Framework for Data Processing

<Title> <Page Number>

<Title> <Page Number>

Figure 2: Proposed Methodology

<Title> <Page Number>

A brief overview of the work done is given below:

<Title> <Page Number>

<Title> <Page Number>

Corpus for Study purpose

<Title> <Page Number>

<Title> <Page Number>

3. Creation of Language model

4. Training the Translation System

<Title> <Page Number>

<Title> <Page Number>

No. of Bilingual Sentences

Figure 3: Ratio Distribution

<Title> <Page Number>

No. of Bilingual Sentences

Figure 4: Translation Ratio of Bilingual Sentence Pairs

• Using LRV Approach

Table 1: <some table>

<Title> <Page Number>

<Title> <Page Number>

<Title> <Page Number>

<Title> <Page Number>

You might also like